cs.CV [Back]

[1] Dynamic Epsilon Scheduling: A Multi-Factor Adaptive Perturbation Budget for Adversarial Training

Alan Mitkiy,James Smith,Hana Satou,Hiroshi Tanaka,Emily Johnson,F Monkey

Main category: cs.CV

TL;DR: 论文提出了一种动态调整对抗训练扰动预算的方法（DES），通过结合决策边界距离、预测置信度和模型不确定性，显著提升了对抗鲁棒性和标准准确性。

Details

Motivation: 现有对抗训练方法依赖固定的扰动预算，无法适应实例特定的鲁棒性特征，限制了其效果。 Method: 提出动态epsilon调度（DES），结合梯度代理、softmax熵和Monte Carlo dropout，动态调整每个实例和训练迭代的扰动预算。 Result: 在CIFAR-10和CIFAR-100上，DES显著优于固定预算和现有自适应方法，提升了对抗鲁棒性和标准准确性。 Conclusion: DES为实例感知、数据驱动的对抗训练方法开辟了新方向，并提供了理论支持。 Abstract: Adversarial training is among the most effective strategies for defending deep neural networks against adversarial examples. A key limitation of existing adversarial training approaches lies in their reliance on a fixed perturbation budget, which fails to account for instance-specific robustness characteristics. While prior works such as IAAT and MMA introduce instance-level adaptations, they often rely on heuristic or static approximations of data robustness. In this paper, we propose Dynamic Epsilon Scheduling (DES), a novel framework that adaptively adjusts the adversarial perturbation budget per instance and per training iteration. DES integrates three key factors: (1) the distance to the decision boundary approximated via gradient-based proxies, (2) prediction confidence derived from softmax entropy, and (3) model uncertainty estimated via Monte Carlo dropout. By combining these cues into a unified scheduling strategy, DES tailors the perturbation budget dynamically to guide more effective adversarial learning. Experimental results on CIFAR-10 and CIFAR-100 show that our method consistently improves both adversarial robustness and standard accuracy compared to fixed-epsilon baselines and prior adaptive methods. Moreover, we provide theoretical insights into the stability and convergence of our scheduling policy. This work opens a new avenue for instance-aware, data-driven adversarial training methods.

Yi Lu,Jiawang Cao,Yongliang Wu,Bozheng Li,Licheng Tang,Yangguang Ji,Chong Wu,Jay Wu,Wenbo Zhu

Main category: cs.CV

TL;DR: RSVP框架通过视觉提示统一多模态推理与视觉分割，显著提升性能。

Details

Motivation: 多模态大语言模型缺乏显式的视觉定位与分割机制，导致认知推理与视觉感知之间存在差距。 Method: RSVP采用两阶段框架：推理阶段通过多模态思维链视觉提示生成区域建议；分割阶段通过视觉语言分割模块精修分割掩码。 Result: RSVP在ReasonSeg上超越现有方法（+6.5 gIoU和+9.2 cIoU），在SegInW上零样本达到49.7 mAP。 Conclusion: RSVP为认知推理与结构化视觉理解的结合提供了有效且可扩展的框架。 Abstract: Multi-modal Large Language Models (MLLMs) have demonstrated remarkable reasoning capability while lack explicit mechanisms for visual grounding and segmentation, creating a gap between cognitive reasoning and visual perception. To bridge this gap, we introduce Reasoning Segmentation via Visual Prompting (RSVP), a novel framework that unifies multi-step multimodal reasoning with grounded visual understanding. RSVP is a two-stage structuralized framework that integrates reasoning-driven localization with segmentation refinement. In the reasoning stage, RSVP employs multimodal chain-of-thought visual prompts to help MLLMs understand queries and infer targets, generating interpretable region proposals that enhance visual grounding. In segmentation stage, RSVP refines these proposals with a Vision-Language Segmentation Module (VLSM), seamlessly integrates textual and visual cues to produce precise segmentation masks. By explicitly modelling the interaction between multimodal reasoning and segmentation, RSVP introduces a new paradigm for interpretable reasoning segmentation. It exploits MLLMs' inherent localization capabilities, enabling the models to not only reason about objects but also generate structured visual representations. Our extensive experiments demonstrate that RSVP achieves state-of-the-art performance, surpasses state-of-the-art methods by up to +6.5 gIoU and +9.2 cIoU on ReasonSeg, and achieves 49.7 mAP on SegInW under zero-shot settings. These results validate RSVP as an effective and scalable framework for integrating cognitive reasoning with structured visual understanding.

[3] Evaluating MLLMs with Multimodal Multi-image Reasoning Benchmark

Ziming Cheng,Binrui Xu,Lisheng Gong,Zuhe Song,Tianshuo Zhou,Shiqi Zhong,Siyu Ren,Mingxiang Chen,Xiangchao Meng,Yuxin Zhang,Yanlin Li,Lei Ren,Wei Chen,Zhiyuan Huang,Mingjie Zhan,Xiaojie Wang,Fangxiang Feng

Main category: cs.CV

TL;DR: 论文提出了首个多图像推理基准（MMRB），用于评估多模态大语言模型（MLLMs）在多图像输入下的结构化推理能力，并发现开源模型与商业模型存在显著差距。

Details

Motivation: 现有MLLM基准主要关注单图像推理或多图像任务的最终答案评估，缺乏对多图像输入下推理能力的系统评估。 Method: 设计了包含92个子任务的MMRB基准，涵盖空间、时间和语义推理，并采用GPT-4o生成多解法和思维链标注。提出基于开源LLM的句子级匹配框架进行快速评估。 Result: 实验表明，开源MLLMs在多图像推理任务中显著落后于商业模型，且当前多模态奖励模型几乎无法处理多图像奖励排名任务。 Conclusion: MMRB填补了多图像推理评估的空白，揭示了开源模型的不足，为未来研究提供了方向。 Abstract: With enhanced capabilities and widespread applications, Multimodal Large Language Models (MLLMs) are increasingly required to process and reason over multiple images simultaneously. However, existing MLLM benchmarks focus either on single-image visual reasoning or on multi-image understanding tasks with only final-answer evaluation, leaving the reasoning capabilities of MLLMs over multi-image inputs largely underexplored. To address this gap, we introduce the $\textbf{Multimodal Multi-image Reasoning Benchmark (MMRB)}$, the first benchmark designed to evaluate structured visual reasoning across multiple images. MMRB comprises $\textbf{92 sub-tasks}$ covering spatial, temporal, and semantic reasoning, with multi-solution, CoT-style annotations generated by GPT-4o and refined by human experts. A derivative subset is designed to evaluate multimodal reward models in multi-image scenarios. To support fast and scalable evaluation, we propose a sentence-level matching framework using open-source LLMs. Extensive baseline experiments on $\textbf{40 MLLMs}$, including 9 reasoning-specific models and 8 reward models, demonstrate that open-source MLLMs still lag significantly behind commercial MLLMs in multi-image reasoning tasks. Furthermore, current multimodal reward models are nearly incapable of handling multi-image reward ranking tasks.

[4] HuGeDiff: 3D Human Generation via Diffusion with Gaussian Splatting

Maksym Ivashechkin,Oscar Mendez,Richard Bowden

Main category: cs.CV

TL;DR: 提出了一种弱监督流程，通过图像扩散模型生成可控属性的人像数据集，并利用基于Transformer的架构将其映射到3D点云，最后训练点云扩散模型，显著提升了3D人体生成的速度、真实感和文本对齐性。

Details

Motivation: 当前3D人体生成方法在细节、手部和面部渲染、真实性和可控性方面存在不足，且缺乏多样性和标注数据。本文旨在解决这些问题。 Method: 1. 使用图像扩散模型生成可控属性的真实人像数据集；2. 提出基于Transformer的图像特征到3D点云的高效映射方法；3. 训练条件点云扩散模型。 Result: 相比现有方法，实现了数量级的速度提升，并显著改善了文本对齐性、真实感和渲染质量。 Conclusion: 提出的弱监督流程有效解决了3D人体生成中的关键挑战，代码和数据集将公开。 Abstract: 3D human generation is an important problem with a wide range of applications in computer vision and graphics. Despite recent progress in generative AI such as diffusion models or rendering methods like Neural Radiance Fields or Gaussian Splatting, controlling the generation of accurate 3D humans from text prompts remains an open challenge. Current methods struggle with fine detail, accurate rendering of hands and faces, human realism, and controlability over appearance. The lack of diversity, realism, and annotation in human image data also remains a challenge, hindering the development of a foundational 3D human model. We present a weakly supervised pipeline that tries to address these challenges. In the first step, we generate a photorealistic human image dataset with controllable attributes such as appearance, race, gender, etc using a state-of-the-art image diffusion model. Next, we propose an efficient mapping approach from image features to 3D point clouds using a transformer-based architecture. Finally, we close the loop by training a point-cloud diffusion model that is conditioned on the same text prompts used to generate the original samples. We demonstrate orders-of-magnitude speed-ups in 3D human generation compared to the state-of-the-art approaches, along with significantly improved text-prompt alignment, realism, and rendering quality. We will make the code and dataset available.

[5] ReXVQA: A Large-scale Visual Question Answering Benchmark for Generalist Chest X-ray Understanding

Ankit Pal,Jung-Oh Lee,Xiaoman Zhang,Malaikannan Sankarasubbu,Seunghyeon Roh,Won Jung Kim,Meesun Lee,Pranav Rajpurkar

Main category: cs.CV

TL;DR: ReXVQA是胸片视觉问答（VQA）领域最大且最全面的基准测试，包含约69.6万问题和16万胸片研究，评估了8种多模态大语言模型，其中MedGemma表现最佳（83.24%准确率），甚至超过放射科住院医师（77.27%准确率）。

Details

Motivation: 填补胸片VQA领域缺乏多样化且临床真实任务的空白，推动AI系统模拟专家级临床推理。 Method: 构建ReXVQA基准测试，涵盖五种放射学推理技能，并评估8种多模态大语言模型，同时进行人类读者研究。 Result: MedGemma表现最佳（83.24%准确率），超过人类专家（77.27%准确率），揭示了AI与人类在胸片解读上的差异模式。 Conclusion: ReXVQA为评估通用放射学AI系统设立了新标准，为下一代AI系统的发展奠定了基础。 Abstract: We present ReXVQA, the largest and most comprehensive benchmark for visual question answering (VQA) in chest radiology, comprising approximately 696,000 questions paired with 160,000 chest X-rays studies across training, validation, and test sets. Unlike prior efforts that rely heavily on template based queries, ReXVQA introduces a diverse and clinically authentic task suite reflecting five core radiological reasoning skills: presence assessment, location analysis, negation detection, differential diagnosis, and geometric reasoning. We evaluate eight state-of-the-art multimodal large language models, including MedGemma-4B-it, Qwen2.5-VL, Janus-Pro-7B, and Eagle2-9B. The best-performing model (MedGemma) achieves 83.24% overall accuracy. To bridge the gap between AI performance and clinical expertise, we conducted a comprehensive human reader study involving 3 radiology residents on 200 randomly sampled cases. Our evaluation demonstrates that MedGemma achieved superior performance (83.84% accuracy) compared to human readers (best radiology resident: 77.27%), representing a significant milestone where AI performance exceeds expert human evaluation on chest X-ray interpretation. The reader study reveals distinct performance patterns between AI models and human experts, with strong inter-reader agreement among radiologists while showing more variable agreement patterns between human readers and AI models. ReXVQA establishes a new standard for evaluating generalist radiological AI systems, offering public leaderboards, fine-grained evaluation splits, structured explanations, and category-level breakdowns. This benchmark lays the foundation for next-generation AI systems capable of mimicking expert-level clinical reasoning beyond narrow pathology classification. Our dataset will be open-sourced at https://huggingface.co/datasets/rajpurkarlab/ReXVQA

[6] WorldPrediction: A Benchmark for High-level World Modeling and Long-horizon Procedural Planning

Delong Chen,Willy Chung,Yejin Bang,Ziwei Ji,Pascale Fung

Main category: cs.CV

TL;DR: WorldPrediction是一个基于视频的基准测试，用于评估AI模型的世界建模和程序规划能力，强调具有时间和语义抽象的动作。

Details

Motivation: 当前AI模型（尤其是生成模型）如何学习世界模型并进行程序规划尚不明确，需要一种新的评估方法。 Method: 通过区分正确动作或动作序列与反事实干扰项，评估模型的世界建模和规划能力，使用视觉观察表示状态和动作。 Result: 前沿模型在WorldPrediction-WM和WorldPrediction-PP上的准确率分别为57%和38%，远低于人类的完美表现。 Conclusion: WorldPrediction为评估AI模型的世界建模和规划能力提供了可靠基准，揭示了当前模型的局限性。 Abstract: Humans are known to have an internal "world model" that enables us to carry out action planning based on world states. AI agents need to have such a world model for action planning as well. It is not clear how current AI models, especially generative models, are able to learn such world models and carry out procedural planning in diverse environments. We introduce WorldPrediction, a video-based benchmark for evaluating world modeling and procedural planning capabilities of different AI models. In contrast to prior benchmarks that focus primarily on low-level world modeling and robotic motion planning, WorldPrediction is the first benchmark that emphasizes actions with temporal and semantic abstraction. Given initial and final world states, the task is to distinguish the proper action (WorldPrediction-WM) or the properly ordered sequence of actions (WorldPrediction-PP) from a set of counterfactual distractors. This discriminative task setup enable us to evaluate different types of world models and planners and realize a thorough comparison across different hypothesis. The benchmark represents states and actions using visual observations. In order to prevent models from exploiting low-level continuity cues in background scenes, we provide "action equivalents" - identical actions observed in different contexts - as candidates for selection. This benchmark is grounded in a formal framework of partially observable semi-MDP, ensuring better reliability and robustness of the evaluation. We conduct extensive human filtering and validation on our benchmark and show that current frontier models barely achieve 57% accuracy on WorldPrediction-WM and 38% on WorldPrediction-PP whereas humans are able to solve both tasks perfectly.

[7] Puck Localization Using Contextual Cues

Liam Salass,Jerrin Bright,Amir Nazemi,Yuhao Chen,John Zelek,David Clausi

Main category: cs.CV

TL;DR: PLUCC利用球员行为上下文线索进行冰球检测，通过多尺度特征和门控解码器提升性能，显著优于基线方法。

Details

Motivation: 冰球检测因尺寸小、遮挡多、运动模糊等问题具有挑战性，现有方法未充分利用球员行为的上下文线索。 Method: PLUCC包含上下文编码器、特征金字塔编码器和门控解码器，结合球员姿态和多尺度特征。 Result: 在PuckDataset上，PLUCC的平均精度提升12.2%，RSLE提升25%，表现最优。 Conclusion: 上下文理解对冰球检测至关重要，对自动化体育分析有广泛意义。 Abstract: Puck detection in ice hockey broadcast videos poses significant challenges due to the puck's small size, frequent occlusions, motion blur, broadcast artifacts, and scale inconsistencies due to varying camera zoom and broadcast camera viewpoints. Prior works focus on appearance-based or motion-based cues of the puck without explicitly modelling the cues derived from player behaviour. Players consistently turn their bodies and direct their gaze toward the puck. Motivated by this strong contextual cue, we propose Puck Localization Using Contextual Cues (PLUCC), a novel approach for scale-aware and context-driven single-frame puck detections. PLUCC consists of three components: (a) a contextual encoder, which utilizes player orientations and positioning as helpful priors; (b) a feature pyramid encoder, which extracts multiscale features from the dual encoders; and (c) a gating decoder that combines latent features with a channel gating mechanism. For evaluation, in addition to standard average precision, we propose Rink Space Localization Error (RSLE), a scale-invariant homography-based metric for removing perspective bias from rink space evaluation. The experimental results of PLUCC on the PuckDataset dataset demonstrated state-of-the-art detection performance, surpassing previous baseline methods by an average precision improvement of 12.2\% and RSLE average precision of 25\%. Our research demonstrates the critical role of contextual understanding in improving puck detection performance, with broad implications for automated sports analysis.

[8] Fine-Tuning Video Transformers for Word-Level Bangla Sign Language: A Comparative Analysis for Classification Tasks

Jubayer Ahmed Bhuiyan Shawon,Hasan Mahmud,Kamrul Hasan

Main category: cs.CV

TL;DR: 该研究通过微调VideoMAE、ViViT和TimeSformer等视频Transformer架构，在孟加拉手语数据集BdSLW60和BdSLW401上实现了高性能的手语识别，显著优于传统方法。

Details

Motivation: 提高孟加拉手语（BdSL）识别的准确性和可扩展性，以改善听力障碍群体的沟通无障碍性。 Method: 使用视频Transformer架构（VideoMAE、ViViT、TimeSformer），结合数据增强技术和10折分层交叉验证，在BdSLW60和BdSLW401数据集上进行微调和评估。 Result: VideoMAE在BdSLW60上达到95.5%的准确率，在BdSLW401上达到81.04%，显著优于传统方法。 Conclusion: 视频Transformer模型在孟加拉手语识别中表现出色，具有可扩展性和高准确性潜力。 Abstract: Sign Language Recognition (SLR) involves the automatic identification and classification of sign gestures from images or video, converting them into text or speech to improve accessibility for the hearing-impaired community. In Bangladesh, Bangla Sign Language (BdSL) serves as the primary mode of communication for many individuals with hearing impairments. This study fine-tunes state-of-the-art video transformer architectures -- VideoMAE, ViViT, and TimeSformer -- on BdSLW60 (arXiv:2402.08635), a small-scale BdSL dataset with 60 frequent signs. We standardized the videos to 30 FPS, resulting in 9,307 user trial clips. To evaluate scalability and robustness, the models were also fine-tuned on BdSLW401 (arXiv:2503.02360), a large-scale dataset with 401 sign classes. Additionally, we benchmark performance against public datasets, including LSA64 and WLASL. Data augmentation techniques such as random cropping, horizontal flipping, and short-side scaling were applied to improve model robustness. To ensure balanced evaluation across folds during model selection, we employed 10-fold stratified cross-validation on the training set, while signer-independent evaluation was carried out using held-out test data from unseen users U4 and U8. Results show that video transformer models significantly outperform traditional machine learning and deep learning approaches. Performance is influenced by factors such as dataset size, video quality, frame distribution, frame rate, and model architecture. Among the models, the VideoMAE variant (MCG-NJU/videomae-base-finetuned-kinetics) achieved the highest accuracies of 95.5% on the frame rate corrected BdSLW60 dataset and 81.04% on the front-facing signs of BdSLW401 -- demonstrating strong potential for scalable and accurate BdSL recognition.

[9] Visualizing and Controlling Cortical Responses Using Voxel-Weighted Activation Maximization

Matthew W. Shinkle,Mark D. Lescroart

Main category: cs.CV

TL;DR: 论文展示了激活最大化技术如何应用于基于DNN的脑编码模型，通过生成图像优化预测的脑响应，验证了其在视觉系统中的有效性。

Details

Motivation: 尽管DNN编码模型能预测大脑对视觉刺激的响应，但缺乏对驱动这些响应的具体特征的理解。 Method: 使用预训练的Inception V3网络提取并下采样激活，通过线性回归预测fMRI响应，并应用激活最大化生成优化图像。 Result: 生成的图像与已知选择性特征一致，并在fMRI实验中验证了其能有效驱动目标脑区活动。 Conclusion: 激活最大化技术成功应用于DNN编码模型，为理解人类视觉系统提供了灵活的工具。 Abstract: Deep neural networks (DNNs) trained on visual tasks develop feature representations that resemble those in the human visual system. Although DNN-based encoding models can accurately predict brain responses to visual stimuli, they offer limited insight into the specific features driving these responses. Here, we demonstrate that activation maximization -- a technique designed to interpret vision DNNs -- can be applied to DNN-based encoding models of the human brain. We extract and adaptively downsample activations from multiple layers of a pretrained Inception V3 network, then use linear regression to predict fMRI responses. This yields a full image-computable model of brain responses. Next, we apply activation maximization to generate images optimized for predicted responses in individual cortical voxels. We find that these images contain visual characteristics that qualitatively correspond with known selectivity and enable exploration of selectivity across the visual cortex. We further extend our method to whole regions of interest (ROIs) of the brain and validate its efficacy by presenting these images to human participants in an fMRI study. We find that the generated images reliably drive activity in targeted regions across both low- and high-level visual areas and across subjects. These results demonstrate that activation maximization can be successfully applied to DNN-based encoding models. By addressing key limitations of alternative approaches that require natively generative models, our approach enables flexible characterization and modulation of responses across the human visual system.

[10] Is Perturbation-Based Image Protection Disruptive to Image Editing?

Qiuyu Tang,Bonor Ayambem,Mooi Choo Chuah,Aparna Bharati

Main category: cs.CV

TL;DR: 研究发现，现有的基于扰动的图像保护方法无法完全阻止扩散模型的编辑，反而可能增强编辑效果。

Details

Motivation: 探讨扩散模型（如Stable Diffusion）在图像生成中的潜在滥用风险，以及现有图像保护方法的局限性。 Method: 通过实验评估多种基于扰动的图像保护方法在不同领域（自然场景图像和艺术作品）和编辑任务（图像到图像生成和风格编辑）中的效果。 Result: 大多数情况下，受保护的图像仍能被扩散模型成功编辑，且扰动可能增强编辑效果。 Conclusion: 基于扰动的方法不足以提供针对扩散模型编辑的鲁棒图像保护。 Abstract: The remarkable image generation capabilities of state-of-the-art diffusion models, such as Stable Diffusion, can also be misused to spread misinformation and plagiarize copyrighted materials. To mitigate the potential risks associated with image editing, current image protection methods rely on adding imperceptible perturbations to images to obstruct diffusion-based editing. A fully successful protection for an image implies that the output of editing attempts is an undesirable, noisy image which is completely unrelated to the reference image. In our experiments with various perturbation-based image protection methods across multiple domains (natural scene images and artworks) and editing tasks (image-to-image generation and style editing), we discover that such protection does not achieve this goal completely. In most scenarios, diffusion-based editing of protected images generates a desirable output image which adheres precisely to the guidance prompt. Our findings suggest that adding noise to images may paradoxically increase their association with given text prompts during the generation process, leading to unintended consequences such as better resultant edits. Hence, we argue that perturbation-based methods may not provide a sufficient solution for robust image protection against diffusion-based editing.

[11] Normalize Filters! Classical Wisdom for Deep Vision

Gustavo Perez,Stella X. Yu

Main category: cs.CV

TL;DR: 论文提出了一种滤波器归一化方法，通过可学习的缩放和平移（类似批归一化）解决深度学习卷积滤波器在图像大气传输中的失真问题，显著提升了性能。

Details

Motivation: 传统图像滤波器经过精心归一化以保证一致性和可解释性，而深度学习中端到端学习的卷积滤波器缺乏此类约束，导致在大气传输中响应失真。 Method: 提出滤波器归一化方法，结合可学习的缩放和平移，确保滤波器的大气等变性，适用于卷积神经网络和依赖卷积的视觉变换器。 Result: 在人工和自然强度变化基准测试中取得显著改进，ResNet34甚至大幅超越CLIP。 Conclusion: 滤波器归一化不仅正则化学习、促进多样性，还提高了鲁棒性和泛化能力。 Abstract: Classical image filters, such as those for averaging or differencing, are carefully normalized to ensure consistency, interpretability, and to avoid artifacts like intensity shifts, halos, or ringing. In contrast, convolutional filters learned end-to-end in deep networks lack such constraints. Although they may resemble wavelets and blob/edge detectors, they are not normalized in the same or any way. Consequently, when images undergo atmospheric transfer, their responses become distorted, leading to incorrect outcomes. We address this limitation by proposing filter normalization, followed by learnable scaling and shifting, akin to batch normalization. This simple yet effective modification ensures that the filters are atmosphere-equivariant, enabling co-domain symmetry. By integrating classical filtering principles into deep learning (applicable to both convolutional neural networks and convolution-dependent vision transformers), our method achieves significant improvements on artificial and natural intensity variation benchmarks. Our ResNet34 could even outperform CLIP by a large margin. Our analysis reveals that unnormalized filters degrade performance, whereas filter normalization regularizes learning, promotes diversity, and improves robustness and generalization.

[12] Photoreal Scene Reconstruction from an Egocentric Device

Zhaoyang Lv,Maurizio Monge,Ka Chen,Yufeng Zhu,Michael Goesele,Jakob Engel,Zhao Dong,Richard Newcombe

Main category: cs.CV

TL;DR: 论文研究了使用自我中心设备进行高动态范围场景真实感重建的挑战，提出了基于视觉-惯性束调整（VIBA）的高频轨迹校准方法和基于高斯泼溅的物理图像形成模型，显著提升了重建质量。

Details

Motivation: 现有方法通常假设使用设备视觉-惯性里程计系统的帧率6DoF姿态估计，可能忽略像素级重建所需的关键细节。 Method: 1. 采用VIBA校准滚动快门RGB相机的高频轨迹；2. 将物理图像形成模型融入高斯泼溅表示。 Result: 实验表明，VIBA带来PSNR提升1 dB，物理图像形成模型再提升1 dB。 Conclusion: 提出的方法显著提升了真实感重建质量，适用于多种高斯泼溅表示变体。 Abstract: In this paper, we investigate the challenges associated with using egocentric devices to photorealistic reconstruct the scene in high dynamic range. Existing methodologies typically assume using frame-rate 6DoF pose estimated from the device's visual-inertial odometry system, which may neglect crucial details necessary for pixel-accurate reconstruction. This study presents two significant findings. Firstly, in contrast to mainstream work treating RGB camera as global shutter frame-rate camera, we emphasize the importance of employing visual-inertial bundle adjustment (VIBA) to calibrate the precise timestamps and movement of the rolling shutter RGB sensing camera in a high frequency trajectory format, which ensures an accurate calibration of the physical properties of the rolling-shutter camera. Secondly, we incorporate a physical image formation model based into Gaussian Splatting, which effectively addresses the sensor characteristics, including the rolling-shutter effect of RGB cameras and the dynamic ranges measured by sensors. Our proposed formulation is applicable to the widely-used variants of Gaussian Splats representation. We conduct a comprehensive evaluation of our pipeline using the open-source Project Aria device under diverse indoor and outdoor lighting conditions, and further validate it on a Meta Quest3 device. Across all experiments, we observe a consistent visual enhancement of +1 dB in PSNR by incorporating VIBA, with an additional +1 dB achieved through our proposed image formation model. Our complete implementation, evaluation datasets, and recording profile are available at http://www.projectaria.com/photoreal-reconstruction/

[13] HMAR: Efficient Hierarchical Masked Auto-Regressive Image Generation

Hermann Kumbong,Xian Liu,Tsung-Yi Lin,Ming-Yu Liu,Xihui Liu,Ziwei Liu,Daniel Y. Fu,Christopher Ré,David W. Romero

Main category: cs.CV

TL;DR: HMAR是一种新的图像生成算法，通过改进VAR的并行生成问题，实现了更高质量的图像生成和更快的采样速度。

Details

Motivation: VAR在并行生成图像时存在质量下降、序列长度超线性增长以及采样计划不可变的问题，HMAR旨在解决这些问题。 Method: HMAR采用马尔可夫过程和多步掩码生成技术，逐分辨率生成图像，并结合高效的块稀疏注意力内核。 Result: HMAR在ImageNet 256x256和512x512基准测试中表现优于VAR、扩散模型和自回归基线，训练和推理速度分别提高了2.5倍和1.75倍。 Conclusion: HMAR不仅提升了图像生成质量和效率，还提供了采样计划的灵活性和零样本图像编辑能力。 Abstract: Visual Auto-Regressive modeling (VAR) has shown promise in bridging the speed and quality gap between autoregressive image models and diffusion models. VAR reformulates autoregressive modeling by decomposing an image into successive resolution scales. During inference, an image is generated by predicting all the tokens in the next (higher-resolution) scale, conditioned on all tokens in all previous (lower-resolution) scales. However, this formulation suffers from reduced image quality due to the parallel generation of all tokens in a resolution scale; has sequence lengths scaling superlinearly in image resolution; and requires retraining to change the sampling schedule. We introduce Hierarchical Masked Auto-Regressive modeling (HMAR), a new image generation algorithm that alleviates these issues using next-scale prediction and masked prediction to generate high-quality images with fast sampling. HMAR reformulates next-scale prediction as a Markovian process, wherein the prediction of each resolution scale is conditioned only on tokens in its immediate predecessor instead of the tokens in all predecessor resolutions. When predicting a resolution scale, HMAR uses a controllable multi-step masked generation procedure to generate a subset of the tokens in each step. On ImageNet 256x256 and 512x512 benchmarks, HMAR models match or outperform parameter-matched VAR, diffusion, and autoregressive baselines. We develop efficient IO-aware block-sparse attention kernels that allow HMAR to achieve faster training and inference times over VAR by over 2.5x and 1.75x respectively, as well as over 3x lower inference memory footprint. Finally, HMAR yields additional flexibility over VAR; its sampling schedule can be changed without further training, and it can be applied to image editing tasks in a zero-shot manner.

[14] Towards Large-Scale Pose-Invariant Face Recognition Using Face Defrontalization

Patrik Mesec,Alan Jović

Main category: cs.CV

TL;DR: 论文提出了一种称为‘去正面化’的方法，通过增强训练数据集来改进极端头部姿态下的人脸识别性能。该方法在推理阶段不增加时间开销，并在多个公开数据集上表现优于现有方法。

Details

Motivation: 当前的人脸识别方法在极端头部姿态下表现不佳，且依赖复杂方法（如正面化）和小规模数据集。本文旨在通过去正面化增强训练数据，提升模型在实际场景中的性能。 Method: 1) 在预处理后的正面-侧面配对数据集上训练去正面化模型（FFWM）；2) 在原始及随机去正面化的大规模数据集上训练基于ArcFace损失的ResNet-50特征提取模型。 Result: 去正面化方法在LFW、AgeDB和CFP数据集上优于现有方法，但在Multi-PIE极端姿态（75和90度）上表现不佳，表明当前方法可能对小数据集过拟合。 Conclusion: 去正面化是一种有效的训练数据增强方法，能够提升人脸识别模型在极端姿态下的性能，但需注意避免对小数据集的过拟合。 Abstract: Face recognition under extreme head poses is a challenging task. Ideally, a face recognition system should perform well across different head poses, which is known as pose-invariant face recognition. To achieve pose invariance, current approaches rely on sophisticated methods, such as face frontalization and various facial feature extraction model architectures. However, these methods are somewhat impractical in real-life settings and are typically evaluated on small scientific datasets, such as Multi-PIE. In this work, we propose the inverse method of face frontalization, called face defrontalization, to augment the training dataset of facial feature extraction model. The method does not introduce any time overhead during the inference step. The method is composed of: 1) training an adapted face defrontalization FFWM model on a frontal-profile pairs dataset, which has been preprocessed using our proposed face alignment method; 2) training a ResNet-50 facial feature extraction model based on ArcFace loss on a raw and randomly defrontalized large-scale dataset, where defrontalization was performed with our previously trained face defrontalization model. Our method was compared with the existing approaches on four open-access datasets: LFW, AgeDB, CFP, and Multi-PIE. Defrontalization shows improved results compared to models without defrontalization, while the proposed adjustments show clear superiority over the state-of-the-art face frontalization FFWM method on three larger open-access datasets, but not on the small Multi-PIE dataset for extreme poses (75 and 90 degrees). The results suggest that at least some of the current methods may be overfitted to small datasets.

[15] FALO: Fast and Accurate LiDAR 3D Object Detection on Resource-Constrained Devices

Shizhong Han,Hsin-Pai Cheng,Hong Cai,Jihad Masri,Soyeb Nagori,Fatih Porikli

Main category: cs.CV

TL;DR: FALO是一种硬件友好的LiDAR 3D检测方法，结合了高精度和快速推理速度，适用于资源受限的边缘设备。

Details

Motivation: 现有LiDAR 3D检测方法依赖稀疏卷积或Transformer，计算成本高且内存访问模式不规则，难以在边缘设备上运行。 Method: FALO将稀疏3D体素排列为1D序列，通过ConvDotMix块（大核卷积、Hadamard乘积和线性层）处理，引入隐式分组以优化推理效率。 Result: 在nuScenes和Waymo基准测试中，FALO表现优异，推理速度比最新SOTA快1.6~9.8倍。 Conclusion: FALO是一种高效且适用于边缘设备的LiDAR 3D检测方法，兼具高精度和快速推理能力。 Abstract: Existing LiDAR 3D object detection methods predominantely rely on sparse convolutions and/or transformers, which can be challenging to run on resource-constrained edge devices, due to irregular memory access patterns and high computational costs. In this paper, we propose FALO, a hardware-friendly approach to LiDAR 3D detection, which offers both state-of-the-art (SOTA) detection accuracy and fast inference speed. More specifically, given the 3D point cloud and after voxelization, FALO first arranges sparse 3D voxels into a 1D sequence based on their coordinates and proximity. The sequence is then processed by our proposed ConvDotMix blocks, consisting of large-kernel convolutions, Hadamard products, and linear layers. ConvDotMix provides sufficient mixing capability in both spatial and embedding dimensions, and introduces higher-order nonlinear interaction among spatial features. Furthermore, when going through the ConvDotMix layers, we introduce implicit grouping, which balances the tensor dimensions for more efficient inference and takes into account the growing receptive field. All these operations are friendly to run on resource-constrained platforms and proposed FALO can readily deploy on compact, embedded devices. Our extensive evaluation on LiDAR 3D detection benchmarks such as nuScenes and Waymo shows that FALO achieves competitive performance. Meanwhile, FALO is 1.6~9.8x faster than the latest SOTA on mobile Graphics Processing Unit (GPU) and mobile Neural Processing Unit (NPU).

[16] AuthGuard: Generalizable Deepfake Detection via Language Guidance

Guangyu Shen,Zhihua Li,Xiang Xu,Tianchen Zhao,Zheng Zhang,Dongsheng An,Zhuowen Tu,Yifan Xing,Qin Zhang

Main category: cs.CV

TL;DR: AuthGuard通过结合语言指导和视觉编码器，提升了深度伪造检测的泛化能力和准确性。

Details

Motivation: 现有深度伪造检测技术难以应对不断更新的伪造方法，因其依赖训练时学到的统计特征，而这些特征可能无法泛化到新的伪造方法。 Method: 结合判别分类和图像-文本对比学习训练视觉编码器，并集成数据不确定性学习以减少噪声。 Result: AuthGuard在多个数据集上取得最佳性能，AUC提升显著。 Conclusion: AuthGuard通过语言指导和视觉-语言学习，显著提升了深度伪造检测的泛化能力和解释性。 Abstract: Existing deepfake detection techniques struggle to keep-up with the ever-evolving novel, unseen forgeries methods. This limitation stems from their reliance on statistical artifacts learned during training, which are often tied to specific generation processes that may not be representative of samples from new, unseen deepfake generation methods encountered at test time. We propose that incorporating language guidance can improve deepfake detection generalization by integrating human-like commonsense reasoning -- such as recognizing logical inconsistencies and perceptual anomalies -- alongside statistical cues. To achieve this, we train an expert deepfake vision encoder by combining discriminative classification with image-text contrastive learning, where the text is generated by generalist MLLMs using few-shot prompting. This allows the encoder to extract both language-describable, commonsense deepfake artifacts and statistical forgery artifacts from pixel-level distributions. To further enhance robustness, we integrate data uncertainty learning into vision-language contrastive learning, mitigating noise in image-text supervision. Our expert vision encoder seamlessly interfaces with an LLM, further enabling more generalized and interpretable deepfake detection while also boosting accuracy. The resulting framework, AuthGuard, achieves state-of-the-art deepfake detection accuracy in both in-distribution and out-of-distribution settings, achieving AUC gains of 6.15% on the DFDC dataset and 16.68% on the DF40 dataset. Additionally, AuthGuard significantly enhances deepfake reasoning, improving performance by 24.69% on the DDVQA dataset.

[17] Pruning Everything, Everywhere, All at Once

Gustavo Henrique do Nascimento,Ian Pons,Anna Helena Reali Costa,Artur Jordao

Main category: cs.CV

TL;DR: 提出了一种同时剪枝神经元和层的新方法，通过表示相似性选择最优子网络，显著提升计算效率和模型稀疏性，同时保持预测能力。

Details

Motivation: 深度学习模型复杂度高且计算成本大，现有剪枝方法仅针对神经元或层，无法同时剪枝。 Method: 通过表示相似性（Centered Kernel Alignment）选择最优子网络，迭代剪枝神经元和层。 Result: 在标准架构和基准测试中表现优异，FLOPs减少显著（如ResNet56达86.37%），且模型鲁棒性增强。 Conclusion: 该方法为剪枝领域开辟新方向，显著降低计算成本和碳排放，推动GreenAI发展。 Abstract: Deep learning stands as the modern paradigm for solving cognitive tasks. However, as the problem complexity increases, models grow deeper and computationally prohibitive, hindering advancements in real-world and resource-constrained applications. Extensive studies reveal that pruning structures in these models efficiently reduces model complexity and improves computational efficiency. Successful strategies in this sphere include removing neurons (i.e., filters, heads) or layers, but not both together. Therefore, simultaneously pruning different structures remains an open problem. To fill this gap and leverage the benefits of eliminating neurons and layers at once, we propose a new method capable of pruning different structures within a model as follows. Given two candidate subnetworks (pruned models), one from layer pruning and the other from neuron pruning, our method decides which to choose by selecting the one with the highest representation similarity to its parent (the network that generates the subnetworks) using the Centered Kernel Alignment metric. Iteratively repeating this process provides highly sparse models that preserve the original predictive ability. Throughout extensive experiments on standard architectures and benchmarks, we confirm the effectiveness of our approach and show that it outperforms state-of-the-art layer and filter pruning techniques. At high levels of Floating Point Operations reduction, most state-of-the-art methods degrade accuracy, whereas our approach either improves it or experiences only a minimal drop. Notably, on the popular ResNet56 and ResNet110, we achieve a milestone of 86.37% and 95.82% FLOPs reduction. Besides, our pruned models obtain robustness to adversarial and out-of-distribution samples and take an important step towards GreenAI, reducing carbon emissions by up to 83.31%. Overall, we believe our work opens a new chapter in pruning.

[18] EECD-Net: Energy-Efficient Crack Detection with Spiking Neural Networks and Gated Attention

Shuo Zhang

Main category: cs.CV

TL;DR: 提出了一种名为EECD-Net的多阶段道路裂缝检测方法，结合SRCNN、SCU和GAT模块，显著提升检测精度和能效。

Details

Motivation: 智能终端设备因能量有限和低分辨率成像难以实现实时监测，需一种高效、低功耗的裂缝检测方法。 Method: 采用SRCNN提升图像分辨率，SCU降低功耗，GAT模块融合多尺度特征以增强检测鲁棒性。 Result: 在CrackVision12K基准测试中达到98.6%的检测精度，功耗仅为5.6 mJ，比基线降低33%。 Conclusion: EECD-Net为资源受限环境提供了一种可扩展、低功耗的实时裂缝检测解决方案。 Abstract: Crack detection on road surfaces is a critical measurement technology in the instrumentation domain, essential for ensuring infrastructure safety and transportation reliability. However, due to limited energy and low-resolution imaging, smart terminal devices struggle to maintain real-time monitoring performance. To overcome these challenges, this paper proposes a multi-stage detection approach for road crack detection, EECD-Net, to enhance accuracy and energy efficiency of instrumentation. Specifically, the sophisticated Super-Resolution Convolutional Neural Network (SRCNN) is employed to address the inherent challenges of low-quality images, which effectively enhance image resolution while preserving critical structural details. Meanwhile, a Spike Convolution Unit (SCU) with Continuous Integrate-and-Fire (CIF) neurons is proposed to convert these images into sparse pulse sequences, significantly reducing power consumption. Additionally, a Gated Attention Transformer (GAT) module is designed to strategically fuse multi-scale feature representations through adaptive attention mechanisms, effectively capturing both long-range dependencies and intricate local crack patterns, and significantly enhancing detection robustness across varying crack morphologies. The experiments on the CrackVision12K benchmark demonstrate that EECD-Net achieves a remarkable 98.6\% detection accuracy, surpassing state-of-the-art counterparts such as Hybrid-Segmentor by a significant 1.5\%. Notably, the EECD-Net maintains exceptional energy efficiency, consuming merely 5.6 mJ, which is a substantial 33\% reduction compared to baseline implementations. This work pioneers a transformative approach in instrumentation-based crack detection, offering a scalable, low-power solution for real-time, large-scale infrastructure monitoring in resource-constrained environments.

[19] Enhancing Frequency for Single Image Super-Resolution with Learnable Separable Kernels

Heng Tian

Main category: cs.CV

TL;DR: 提出了一种名为可学习可分离核（LSKs）的即插即用模块，通过直接增强图像频率分量提升单图像超分辨率（SISR）性能，显著减少参数和计算需求。

Details

Motivation: 现有方法通常通过间接方式（如特殊损失函数）提升SISR性能，而LSKs旨在直接从频率角度增强图像质量。 Method: LSKs设计为秩一矩阵，可分解为正交且可合并的一维核，从而减少参数和计算量。 Result: 实验表明，LSKs减少60%以上参数和计算需求，同时提升模型性能，尤其在放大因子增加时表现更优。 Conclusion: LSKs不仅高效且有效，为SISR任务提供了一种直接且可解释的解决方案。 Abstract: Existing approaches often enhance the performance of single-image super-resolution (SISR) methods by incorporating auxiliary structures, such as specialized loss functions, to indirectly boost the quality of low-resolution images. In this paper, we propose a plug-and-play module called Learnable Separable Kernels (LSKs), which are formally rank-one matrices designed to directly enhance image frequency components. We begin by explaining why LSKs are particularly suitable for SISR tasks from a frequency perspective. Baseline methods incorporating LSKs demonstrate a significant reduction of over 60\% in both the number of parameters and computational requirements. This reduction is achieved through the decomposition of LSKs into orthogonal and mergeable one-dimensional kernels. Additionally, we perform an interpretable analysis of the feature maps generated by LSKs. Visualization results reveal the capability of LSKs to enhance image frequency components effectively. Extensive experiments show that incorporating LSKs not only reduces the number of parameters and computational load but also improves overall model performance. Moreover, these experiments demonstrate that models utilizing LSKs exhibit superior performance, particularly as the upscaling factor increases.

Yunhao Gou,Kai Chen,Zhili Liu,Lanqing Hong,Xin Jin,Zhenguo Li,James T. Kwok,Yu Zhang

Main category: cs.CV

TL;DR: RACRO方法通过强化学习优化视觉提取器的描述生成，以支持多模态大语言模型的复杂推理任务，避免了昂贵的多模态重新对齐。

Details

Motivation: 解决多模态大语言模型中视觉与语言对齐的高成本问题，同时确保视觉提取生成的描述既准确又支持推理。 Method: 提出RACRO方法，通过推理引导的强化学习策略优化视觉提取器的描述生成，形成感知与推理的闭环。 Result: 在数学和科学多模态基准测试中，RACRO实现了最先进的性能，并支持更高级推理模型的即插即用。 Conclusion: RACRO通过优化视觉描述生成，显著提升了多模态推理的性能和可扩展性，同时降低了成本。 Abstract: Recent advances in slow-thinking language models (e.g., OpenAI-o1 and DeepSeek-R1) have demonstrated remarkable abilities in complex reasoning tasks by emulating human-like reflective cognition. However, extending such capabilities to multi-modal large language models (MLLMs) remains challenging due to the high cost of retraining vision-language alignments when upgrading the underlying reasoner LLMs. A straightforward solution is to decouple perception from reasoning, i.e., converting visual inputs into language representations (e.g., captions) that are then passed to a powerful text-only reasoner. However, this decoupling introduces a critical challenge: the visual extractor must generate descriptions that are both faithful to the image and informative enough to support accurate downstream reasoning. To address this, we propose Reasoning-Aligned Perceptual Decoupling via Caption Reward Optimization (RACRO) - a reasoning-guided reinforcement learning strategy that aligns the extractor's captioning behavior with the reasoning objective. By closing the perception-reasoning loop via reward-based optimization, RACRO significantly enhances visual grounding and extracts reasoning-optimized representations. Experiments on multi-modal math and science benchmarks show that the proposed RACRO method achieves state-of-the-art average performance while enabling superior scalability and plug-and-play adaptation to more advanced reasoning LLMs without the necessity for costly multi-modal re-alignment.

[21] LGM-Pose: A Lightweight Global Modeling Network for Real-time Human Pose Estimation

Biao Guo,Fangmin Guo,Guibo Luo,Xiaonan Luo,Feng Zhang

Main category: cs.CV

TL;DR: 提出了一种轻量级全局建模网络（LGM-Pose），通过单分支结构解决多分支CNN网络的冗余和全局上下文捕捉不足问题。

Details

Motivation: 当前多分支CNN网络在多人体姿态估计中存在冗余结构和高延迟问题，且难以捕捉全局上下文。 Method: 设计了轻量级MobileViM Block和LARM模块，结合NPT-Op提取全局信息，并引入SFusion模块整合多尺度信息。 Result: 在COCO和MPII数据集上，方法减少了参数量，同时提升了性能和速度。 Conclusion: LGM-Pose通过单分支结构和创新模块设计，有效解决了现有方法的不足。 Abstract: Most of the current top-down multi-person pose estimation lightweight methods are based on multi-branch parallel pure CNN network architecture, which often struggle to capture the global context required for detecting semantically complex keypoints and are hindered by high latency due to their intricate and redundant structures. In this article, an approximate single-branch lightweight global modeling network (LGM-Pose) is proposed to address these challenges. In the network, a lightweight MobileViM Block is designed with a proposed Lightweight Attentional Representation Module (LARM), which integrates information within and between patches using the Non-Parametric Transformation Operation(NPT-Op) to extract global information. Additionally, a novel Shuffle-Integrated Fusion Module (SFusion) is introduced to effectively integrate multi-scale information, mitigating performance degradation often observed in single-branch structures. Experimental evaluations on the COCO and MPII datasets demonstrate that our approach not only reduces the number of parameters compared to existing mainstream lightweight methods but also achieves superior performance and faster processing speeds.

[22] Follow-Your-Creation: Empowering 4D Creation through Video Inpainting

Yue Ma,Kunyu Feng,Xinhua Zhang,Hongyu Liu,David Junhao Zhang,Jinbo Xing,Yinhan Zhang,Ayden Yang,Zeyu Wang,Qifeng Chen

Main category: cs.CV

TL;DR: Follow-Your-Creation是一个新颖的4D视频生成与编辑框架，通过单目视频输入实现内容生成和编辑，利用视频修复模型作为生成先验，结合深度渲染和掩码技术提升生成质量。

Details

Motivation: 现有的4D视频生成和编辑方法在处理复杂相机运动和用户编辑时存在一致性和灵活性不足的问题，需要一种更高效且通用的解决方案。 Method: 通过深度渲染生成不可见区域掩码，结合用户编辑掩码构建复合掩码数据集，利用视频修复模型进行训练，并设计自迭代调优策略和时间包装模块提升生成质量。 Result: 该方法在生成4D视频时表现出多视角一致性和高质量，支持基于提示的内容编辑，性能优于现有方法。 Conclusion: Follow-Your-Creation框架通过创新的掩码技术和自迭代训练策略，显著提升了4D视频生成和编辑的灵活性与质量。 Abstract: We introduce Follow-Your-Creation, a novel 4D video creation framework capable of both generating and editing 4D content from a single monocular video input. By leveraging a powerful video inpainting foundation model as a generative prior, we reformulate 4D video creation as a video inpainting task, enabling the model to fill in missing content caused by camera trajectory changes or user edits. To facilitate this, we generate composite masked inpainting video data to effectively fine-tune the model for 4D video generation. Given an input video and its associated camera trajectory, we first perform depth-based point cloud rendering to obtain invisibility masks that indicate the regions that should be completed. Simultaneously, editing masks are introduced to specify user-defined modifications, and these are combined with the invisibility masks to create a composite masks dataset. During training, we randomly sample different types of masks to construct diverse and challenging inpainting scenarios, enhancing the model's generalization and robustness in various 4D editing and generation tasks. To handle temporal consistency under large camera motion, we design a self-iterative tuning strategy that gradually increases the viewing angles during training, where the model is used to generate the next-stage training data after each fine-tuning iteration. Moreover, we introduce a temporal packaging module during inference to enhance generation quality. Our method effectively leverages the prior knowledge of the base model without degrading its original performance, enabling the generation of 4D videos with consistent multi-view coherence. In addition, our approach supports prompt-based content editing, demonstrating strong flexibility and significantly outperforming state-of-the-art methods in both quality and versatility.

Ziqi Jia,Anmin Wang,Xiaoyang Qu,Xiaowen Yang,Jianzong Wang

Main category: cs.CV

TL;DR: 论文提出了一种分层持续学习框架（HEC）和Task-aware MoILE方法，通过分层学习和LoRA专家选择解决灾难性遗忘问题。

Details

Motivation: 现有持续学习方法忽视了高级规划和多级知识学习，HEC框架旨在解决这一问题。 Method: HEC框架分为高低两层学习，Task-aware MoILE通过聚类视觉-文本嵌入和双路由选择LoRA专家，利用SVD保留关键参数。 Result: 实验表明，该方法显著减少旧任务遗忘，支持代理持续学习新任务。 Conclusion: HEC框架和Task-aware MoILE方法有效解决了灾难性遗忘问题，提升了持续学习能力。 Abstract: Previous continual learning setups for embodied intelligence focused on executing low-level actions based on human commands, neglecting the ability to learn high-level planning and multi-level knowledge. To address these issues, we propose the Hierarchical Embodied Continual Learning Setups (HEC) that divide the agent's continual learning process into two layers: high-level instructions and low-level actions, and define five embodied continual learning sub-setups. Building on these setups, we introduce the Task-aware Mixture of Incremental LoRA Experts (Task-aware MoILE) method. This approach achieves task recognition by clustering visual-text embeddings and uses both a task-level router and a token-level router to select the appropriate LoRA experts. To effectively address the issue of catastrophic forgetting, we apply Singular Value Decomposition (SVD) to the LoRA parameters obtained from prior tasks, preserving key components while orthogonally training the remaining parts. The experimental results show that our method stands out in reducing the forgetting of old tasks compared to other methods, effectively supporting agents in retaining prior knowledge while continuously learning new tasks.

[24] SmartAvatar: Text- and Image-Guided Human Avatar Generation with VLM AI Agents

Alexander Huang-Menders,Xinhang Liu,Andy Xu,Yuyao Zhang,Chi-Keung Tang,Yu-Wing Tai

Main category: cs.CV

TL;DR: SmartAvatar是一个基于视觉-语言-智能体的框架，通过单张照片或文本提示生成可动画的3D人体化身，利用大型视觉语言模型和参数化人体生成器实现高质量定制。

Details

Motivation: 现有扩散方法在3D人体化身生成中难以精确控制身份、体型和动画适应性，SmartAvatar旨在解决这一问题。 Method: 结合视觉语言模型和参数化生成器，通过自主验证循环迭代调整生成参数，支持自然语言交互优化。 Result: 生成的化身质量高，支持姿势操控，在网格质量、身份保真度和动画适应性上优于现有方法。 Conclusion: SmartAvatar为消费者级硬件提供了高效、可定制的3D化身生成工具。 Abstract: SmartAvatar is a vision-language-agent-driven framework for generating fully rigged, animation-ready 3D human avatars from a single photo or textual prompt. While diffusion-based methods have made progress in general 3D object generation, they continue to struggle with precise control over human identity, body shape, and animation readiness. In contrast, SmartAvatar leverages the commonsense reasoning capabilities of large vision-language models (VLMs) in combination with off-the-shelf parametric human generators to deliver high-quality, customizable avatars. A key innovation is an autonomous verification loop, where the agent renders draft avatars, evaluates facial similarity, anatomical plausibility, and prompt alignment, and iteratively adjusts generation parameters for convergence. This interactive, AI-guided refinement process promotes fine-grained control over both facial and body features, enabling users to iteratively refine their avatars via natural-language conversations. Unlike diffusion models that rely on static pre-trained datasets and offer limited flexibility, SmartAvatar brings users into the modeling loop and ensures continuous improvement through an LLM-driven procedural generation and verification system. The generated avatars are fully rigged and support pose manipulation with consistent identity and appearance, making them suitable for downstream animation and interactive applications. Quantitative benchmarks and user studies demonstrate that SmartAvatar outperforms recent text- and image-driven avatar generation systems in terms of reconstructed mesh quality, identity fidelity, attribute accuracy, and animation readiness, making it a versatile tool for realistic, customizable avatar creation on consumer-grade hardware.

[25] Perfecting Depth: Uncertainty-Aware Enhancement of Metric Depth

Jinyoung Jun,Lei Chu,Jiahao Li,Yan Lu,Chang-Su Kim

Main category: cs.CV

TL;DR: 提出了一种名为Perfecting Depth的两阶段框架，用于传感器深度增强，通过结合随机不确定性建模和确定性细化，生成高质量深度图。

Details

Motivation: 解决传感器深度测量中不可靠区域的问题，同时保留几何线索，提升深度图的可靠性和准确性。 Method: 第一阶段（随机估计）利用扩散模型的随机性检测不可靠区域并推断几何结构；第二阶段（确定性细化）利用不确定性图强制结构一致性和像素级精度。 Result: 实验证明该方法能生成密集、无伪影的深度图，并在多种真实场景中表现优异。 Conclusion: 该框架为传感器深度增强设定了新基准，适用于自动驾驶、机器人和沉浸式技术等领域。 Abstract: We propose a novel two-stage framework for sensor depth enhancement, called Perfecting Depth. This framework leverages the stochastic nature of diffusion models to automatically detect unreliable depth regions while preserving geometric cues. In the first stage (stochastic estimation), the method identifies unreliable measurements and infers geometric structure by leveraging a training-inference domain gap. In the second stage (deterministic refinement), it enforces structural consistency and pixel-level accuracy using the uncertainty map derived from the first stage. By combining stochastic uncertainty modeling with deterministic refinement, our method yields dense, artifact-free depth maps with improved reliability. Experimental results demonstrate its effectiveness across diverse real-world scenarios. Furthermore, theoretical analysis, various experiments, and qualitative visualizations validate its robustness and scalability. Our framework sets a new baseline for sensor depth enhancement, with potential applications in autonomous driving, robotics, and immersive technologies.

[26] Deep Learning Reforms Image Matching: A Survey and Outlook

Shihua Zhang,Zizhuo Li,Kaining Zhang,Yifan Lu,Yuxin Deng,Linfeng Tang,Xingyu Jiang,Jiayi Ma

Main category: cs.CV

TL;DR: 本文综述了深度学习如何逐步改进传统图像匹配流程，包括替换单一步骤和整合为端到端模块，并评估了代表性方法，最后讨论了未来研究方向。

Details

Motivation: 传统图像匹配流程在复杂场景中表现不佳，深度学习显著提升了其鲁棒性和准确性，本文旨在系统梳理这些改进。 Method: 通过替换传统流程中的单一步骤（如可学习的检测器-描述符）或整合为端到端模块（如稀疏匹配器），全面回顾深度学习的应用。 Result: 在相对位姿恢复、单应性估计和视觉定位任务上评估了代表性方法，展示了深度学习的优势。 Conclusion: 深度学习为图像匹配带来了显著改进，但仍存在挑战，未来研究应关注进一步创新。 Abstract: Image matching, which establishes correspondences between two-view images to recover 3D structure and camera geometry, serves as a cornerstone in computer vision and underpins a wide range of applications, including visual localization, 3D reconstruction, and simultaneous localization and mapping (SLAM). Traditional pipelines composed of ``detector-descriptor, feature matcher, outlier filter, and geometric estimator'' falter in challenging scenarios. Recent deep-learning advances have significantly boosted both robustness and accuracy. This survey adopts a unique perspective by comprehensively reviewing how deep learning has incrementally transformed the classical image matching pipeline. Our taxonomy highly aligns with the traditional pipeline in two key aspects: i) the replacement of individual steps in the traditional pipeline with learnable alternatives, including learnable detector-descriptor, outlier filter, and geometric estimator; and ii) the merging of multiple steps into end-to-end learnable modules, encompassing middle-end sparse matcher, end-to-end semi-dense/dense matcher, and pose regressor. We first examine the design principles, advantages, and limitations of both aspects, and then benchmark representative methods on relative pose recovery, homography estimation, and visual localization tasks. Finally, we discuss open challenges and outline promising directions for future research. By systematically categorizing and evaluating deep learning-driven strategies, this survey offers a clear overview of the evolving image matching landscape and highlights key avenues for further innovation.

[27] Unfolding Spatial Cognition: Evaluating Multimodal Models on Visual Simulations

Linjie Li,Mahtab Bigverdi,Jiawei Gu,Zixian Ma,Yinuo Yang,Ziang Li,Yejin Choi,Ranjay Krishna

Main category: cs.CV

TL;DR: STARE是一个评估多模态大语言模型在空间认知任务中表现的新基准，发现模型在复杂任务中表现不佳，而人类通过视觉模拟显著提升效率。

Details

Motivation: 现有AI基准主要关注语言推理，忽视了非语言、多步视觉模拟的复杂性，因此需要STARE来填补这一空白。 Method: STARE包含4K个任务，涵盖几何变换、空间推理和现实世界空间推理，通过多步视觉模拟评估模型表现。 Result: 模型在简单2D任务中表现良好，但在复杂3D任务中接近随机水平；人类通过视觉模拟显著提升效率，而模型表现不一致。 Conclusion: 模型在利用视觉模拟进行复杂空间推理方面仍有不足，需进一步优化。 Abstract: Spatial cognition is essential for human intelligence, enabling problem-solving through visual simulations rather than solely relying on verbal reasoning. However, existing AI benchmarks primarily assess verbal reasoning, neglecting the complexities of non-verbal, multi-step visual simulation. We introduce STARE(Spatial Transformations and Reasoning Evaluation), a benchmark designed to rigorously evaluate multimodal large language models on tasks better solved through multi-step visual simulation. STARE features 4K tasks spanning foundational geometric transformations (2D and 3D), integrated spatial reasoning (cube net folding and tangram puzzles), and real-world spatial reasoning (perspective and temporal reasoning), reflecting practical cognitive challenges like object assembly, mechanical diagram interpretation, and everyday spatial navigation. Our evaluations show that models excel at reasoning over simpler 2D transformations, but perform close to random chance on more complex tasks like 3D cube net folding and tangram puzzles that require multi-step visual simulations. Humans achieve near-perfect accuracy but take considerable time (up to 28.9s) on complex tasks, significantly speeding up (down by 7.5 seconds on average) with intermediate visual simulations. In contrast, models exhibit inconsistent performance gains from visual simulations, improving on most tasks but declining in specific cases like tangram puzzles (GPT-4o, o1) and cube net folding (Claude-3.5, Gemini-2.0 Flash), indicating that models may not know how to effectively leverage intermediate visual information.

[28] Text-Aware Real-World Image Super-Resolution via Diffusion Model with Joint Segmentation Decoders

Qiming Hu,Linlong Fan,Yiyan Luo,Yuhang Yu,Xiaojie Guo,Qingnan Fan

Main category: cs.CV

TL;DR: TADiSR是一种基于扩散的图像超分辨率框架，通过文本感知注意力和联合分割解码器，提升真实世界图像中文本区域的结构保真度和自然细节。

Details

Motivation: 生成模型在图像超分辨率中常导致文本结构失真，TADiSR旨在解决这一问题。 Method: 结合文本感知注意力和联合分割解码器，提出合成高质量图像的全流程方法。 Result: 实验表明，TADiSR显著提升文本可读性，在多项指标上达到最优性能。 Conclusion: TADiSR在真实场景中表现优异，代码已开源。 Abstract: The introduction of generative models has significantly advanced image super-resolution (SR) in handling real-world degradations. However, they often incur fidelity-related issues, particularly distorting textual structures. In this paper, we introduce a novel diffusion-based SR framework, namely TADiSR, which integrates text-aware attention and joint segmentation decoders to recover not only natural details but also the structural fidelity of text regions in degraded real-world images. Moreover, we propose a complete pipeline for synthesizing high-quality images with fine-grained full-image text masks, combining realistic foreground text regions with detailed background content. Extensive experiments demonstrate that our approach substantially enhances text legibility in super-resolved images, achieving state-of-the-art performance across multiple evaluation metrics and exhibiting strong generalization to real-world scenarios. Our code is available at \href{https://github.com/mingcv/TADiSR}{here}.

[29] FPSAttention: Training-Aware FP8 and Sparsity Co-Design for Fast Video Diffusion

Akide Liu,Zeyu Zhang,Zhexin Li,Xuehai Bai,Yizeng Han,Jiasheng Tang,Yuanjie Xing,Jichao Wu,Mingyang Yang,Weihua Chen,Jiahao He,Yuanyu He,Fan Wang,Gholamreza Haffari,Bohan Zhuang

Main category: cs.CV

TL;DR: FPSAttention提出了一种结合FP8量化和稀疏化的训练感知协同设计方法，显著提升了视频生成的推理速度，同时保持了生成质量。

Details

Motivation: 扩散生成模型在高质量视频生成中表现优异，但其推理速度慢和计算需求高限制了实际应用。量化和稀疏化虽能独立加速推理，但现有方法缺乏联合优化，导致性能下降。 Method: FPSAttention通过以下创新点实现优化：1）统一的3D分块粒度支持量化和稀疏化；2）根据噪声调度调整策略；3）硬件友好的内核设计。 Result: 在Wan2.1模型上测试，FPSAttention实现了7.09倍的注意力操作加速和4.96倍的端到端视频生成加速，且未牺牲生成质量。 Conclusion: FPSAttention为视频生成提供了一种高效的解决方案，显著提升了推理速度，同时保持了生成质量。 Abstract: Diffusion generative models have become the standard for producing high-quality, coherent video content, yet their slow inference speeds and high computational demands hinder practical deployment. Although both quantization and sparsity can independently accelerate inference while maintaining generation quality, naively combining these techniques in existing training-free approaches leads to significant performance degradation due to the lack of joint optimization.We introduce FPSAttention, a novel training-aware co-design of FP8 quantization and sparsity for video generation, with a focus on the 3D bi-directional attention mechanism. Our approach features three key innovations: 1) A unified 3D tile-wise granularity that simultaneously supports both quantization and sparsity; 2) A denoising step-aware strategy that adapts to the noise schedule, addressing the strong correlation between quantization/sparsity errors and denoising steps; 3) A native, hardware-friendly kernel that leverages FlashAttention and is implemented with optimized Hopper architecture features for highly efficient execution. Trained on Wan2.1's 1.3B and 14B models and evaluated on the VBench benchmark, FPSAttention achieves a 7.09x kernel speedup for attention operations and a 4.96x end-to-end speedup for video generation compared to the BF16 baseline at 720p resolution-without sacrificing generation quality.

[30] Feature-Based Lie Group Transformer for Real-World Applications

Takayuki Komatsu,Yoshiyuki Ohmura,Kayato Nishitsunoi,Yasuo Kuniyoshi

Main category: cs.CV

TL;DR: 提出了一种结合特征提取和对象分割的方法，将群分解理论应用于更现实的场景，解决了传统表示学习无法处理条件独立性的问题。

Details

Motivation: 传统表示学习假设解缠的独立特征轴是好的表示，但无法解释条件独立性。本研究旨在通过群分解理论解决这一问题，并扩展到现实世界应用。 Method: 结合特征提取和对象分割，将像素翻译替换为特征翻译，并将对象分割定义为相同变换下的特征分组。 Result: 在包含真实世界对象和背景的数据集上验证了方法的有效性。 Conclusion: 该方法有望更好地理解人类在现实世界中的物体识别发展。 Abstract: The main goal of representation learning is to acquire meaningful representations from real-world sensory inputs without supervision. Representation learning explains some aspects of human development. Various neural network (NN) models have been proposed that acquire empirically good representations. However, the formulation of a good representation has not been established. We recently proposed a method for categorizing changes between a pair of sensory inputs. A unique feature of this approach is that transformations between two sensory inputs are learned to satisfy algebraic structural constraints. Conventional representation learning often assumes that disentangled independent feature axes is a good representation; however, we found that such a representation cannot account for conditional independence. To overcome this problem, we proposed a new method using group decomposition in Galois algebra theory. Although this method is promising for defining a more general representation, it assumes pixel-to-pixel translation without feature extraction, and can only process low-resolution images with no background, which prevents real-world application. In this study, we provide a simple method to apply our group decomposition theory to a more realistic scenario by combining feature extraction and object segmentation. We replace pixel translation with feature translation and formulate object segmentation as grouping features under the same transformation. We validated the proposed method on a practical dataset containing both real-world object and background. We believe that our model will lead to a better understanding of human development of object recognition in the real world.

[31] Interpretable Few-Shot Image Classification via Prototypical Concept-Guided Mixture of LoRA Experts

Zhong Ji,Rongshuai Wei,Jingren Liu,Yanwei Pang,Jungong Han

Main category: cs.CV

TL;DR: 论文提出了一种Few-Shot Prototypical Concept Classification (FSPCC)框架，通过参数高效适应和多层次特征融合，解决了数据稀缺场景下自解释模型的性能问题。

Details

Motivation: 自解释模型在数据稀缺场景下表现不佳，主要因参数不平衡和表示不对齐问题。 Method: 结合Mixture of LoRA Experts (MoLE)实现参数高效适应，引入跨模块概念指导和几何感知概念判别损失。 Result: 在六个基准测试中，FSPCC显著优于现有方法，相对提升4.2%-8.7%。 Conclusion: FSPCC通过结合概念学习和少样本适应，实现了更高准确性和模型可解释性。 Abstract: Self-Explainable Models (SEMs) rely on Prototypical Concept Learning (PCL) to enable their visual recognition processes more interpretable, but they often struggle in data-scarce settings where insufficient training samples lead to suboptimal performance.To address this limitation, we propose a Few-Shot Prototypical Concept Classification (FSPCC) framework that systematically mitigates two key challenges under low-data regimes: parametric imbalance and representation misalignment. Specifically, our approach leverages a Mixture of LoRA Experts (MoLE) for parameter-efficient adaptation, ensuring a balanced allocation of trainable parameters between the backbone and the PCL module.Meanwhile, cross-module concept guidance enforces tight alignment between the backbone's feature representations and the prototypical concept activation patterns.In addition, we incorporate a multi-level feature preservation strategy that fuses spatial and semantic cues across various layers, thereby enriching the learned representations and mitigating the challenges posed by limited data availability.Finally, to enhance interpretability and minimize concept overlap, we introduce a geometry-aware concept discrimination loss that enforces orthogonality among concepts, encouraging more disentangled and transparent decision boundaries.Experimental results on six popular benchmarks (CUB-200-2011, mini-ImageNet, CIFAR-FS, Stanford Cars, FGVC-Aircraft, and DTD) demonstrate that our approach consistently outperforms existing SEMs by a notable margin, with 4.2%-8.7% relative gains in 5-way 5-shot classification.These findings highlight the efficacy of coupling concept learning with few-shot adaptation to achieve both higher accuracy and clearer model interpretability, paving the way for more transparent visual recognition systems.

[32] Gen-n-Val: Agentic Image Data Generation and Validation

Jing-En Huang,I-Sheng Fang,Tzuhsuan Huang,Chih-Yu Wang,Jun-Cheng Chen

Main category: cs.CV

TL;DR: Gen-n-Val是一个新型数据生成框架，结合Layer Diffusion、LLMs和VLLMs，解决了合成数据中的多对象掩码、分割不准确和标签错误问题，显著提升了实例分割和目标检测的性能。

Details

Motivation: 计算机视觉任务中数据稀缺和标签噪声问题严重，现有合成数据生成方法存在多对象掩码、分割不准确和标签错误等缺陷。 Method: Gen-n-Val框架包含两个代理：LD提示代理（LLM）优化提示生成高质量单对象掩码；数据验证代理（VLLM）过滤低质量数据。系统提示通过TextGrad优化，并使用图像协调技术组合多个实例。 Result: Gen-n-Val将无效合成数据从50%降至7%，在COCO实例分割中提升1% mAP，在开放词汇目标检测中提升7.1% mAP。 Conclusion: Gen-n-Val显著提升了合成数据的质量和任务性能，为计算机视觉任务提供了高效解决方案。 Abstract: Recently, Large Language Models (LLMs) and Vision Large Language Models (VLLMs) have demonstrated impressive performance as agents across various tasks while data scarcity and label noise remain significant challenges in computer vision tasks, such as object detection and instance segmentation. A common solution for resolving these issues is to generate synthetic data. However, current synthetic data generation methods struggle with issues, such as multiple objects per mask, inaccurate segmentation, and incorrect category labels, limiting their effectiveness. To address these issues, we introduce Gen-n-Val, a novel agentic data generation framework that leverages Layer Diffusion (LD), LLMs, and VLLMs to produce high-quality, single-object masks and diverse backgrounds. Gen-n-Val consists of two agents: (1) The LD prompt agent, an LLM, optimizes prompts for LD to generate high-quality foreground instance images and segmentation masks. These optimized prompts ensure the generation of single-object synthetic data with precise instance masks and clean backgrounds. (2) The data validation agent, a VLLM, which filters out low-quality synthetic instance images. The system prompts for both agents are refined through TextGrad. Additionally, we use image harmonization to combine multiple instances within scenes. Compared to state-of-the-art synthetic data approaches like MosaicFusion, our approach reduces invalid synthetic data from 50% to 7% and improves performance by 1% mAP on rare classes in COCO instance segmentation with YOLOv9c and YOLO11m. Furthermore, Gen-n-Val shows significant improvements (7. 1% mAP) over YOLO-Worldv2-M in open-vocabulary object detection benchmarks with YOLO11m. Moreover, Gen-n-Val improves the performance of YOLOv9 and YOLO11 families in instance segmentation and object detection.

[33] MARS: Radio Map Super-resolution and Reconstruction Method under Sparse Channel Measurements

Chuyun Deng,Na Liu,Wei Xie,Lianming Xu,Li Wang

Main category: cs.CV

TL;DR: MARS是一种结合CNN和Transformer的多尺度感知无线电地图超分辨率方法，通过多尺度特征融合和残差连接提升重建精度。

Details

Motivation: 传统插值和修复方法缺乏环境感知，而深度学习依赖详细场景数据，限制了泛化能力，因此需要一种更高效的方法。 Method: 结合CNN和Transformer，利用多尺度特征融合和残差连接，同时关注全局和局部特征提取。 Result: 在不同场景和天线位置的实验中，MARS在MSE和SSIM上优于基线模型，且计算成本低。 Conclusion: MARS具有强大的实际应用潜力，能够高效重建无线电地图。 Abstract: Radio maps reflect the spatial distribution of signal strength and are essential for applications like smart cities, IoT, and wireless network planning. However, reconstructing accurate radio maps from sparse measurements remains challenging. Traditional interpolation and inpainting methods lack environmental awareness, while many deep learning approaches depend on detailed scene data, limiting generalization. To address this, we propose MARS, a Multi-scale Aware Radiomap Super-resolution method that combines CNNs and Transformers with multi-scale feature fusion and residual connections. MARS focuses on both global and local feature extraction, enhancing feature representation across different receptive fields and improving reconstruction accuracy. Experiments across different scenes and antenna locations show that MARS outperforms baseline models in both MSE and SSIM, while maintaining low computational cost, demonstrating strong practical potential.

[34] HoliSafe: Holistic Safety Benchmarking and Modeling with Safety Meta Token for Vision-Language Model

Youngwan Lee,Kangsan Kim,Kwanyong Park,Ilcahe Jung,Soojin Jang,Seanie Lee,Yong-Ju Lee,Sung Ju Hwang

Main category: cs.CV

TL;DR: 论文提出了HoliSafe数据集和SafeLLaVA模型，解决了现有视觉语言模型（VLMs）安全性的不足，包括数据集覆盖不全和缺乏架构创新。

Details

Motivation: 现有方法在安全调整数据集和基准测试中仅部分考虑图像-文本交互可能产生的有害内容，且依赖数据为中心的调整，缺乏架构创新。 Method: 引入HoliSafe数据集，覆盖五种安全/不安全图像-文本组合；提出SafeLLaVA模型，包含可学习的安全元令牌和专用安全头。 Result: 实验显示SafeLLaVA在多个VLM基准测试中达到最先进的安全性能，HoliSafe基准揭示了现有模型的关键漏洞。 Conclusion: HoliSafe和SafeLLaVA为未来多模态对齐研究提供了新方向，推动了稳健且可解释的VLM安全性研究。 Abstract: Despite emerging efforts to enhance the safety of Vision-Language Models (VLMs), current approaches face two main shortcomings. 1) Existing safety-tuning datasets and benchmarks only partially consider how image-text interactions can yield harmful content, often overlooking contextually unsafe outcomes from seemingly benign pairs. This narrow coverage leaves VLMs vulnerable to jailbreak attacks in unseen configurations. 2) Prior methods rely primarily on data-centric tuning, with limited architectural innovations to intrinsically strengthen safety. We address these gaps by introducing a holistic safety dataset and benchmark, HoliSafe, that spans all five safe/unsafe image-text combinations, providing a more robust basis for both training and evaluation. We further propose SafeLLaVA, a novel VLM augmented with a learnable safety meta token and a dedicated safety head. The meta token encodes harmful visual cues during training, intrinsically guiding the language model toward safer responses, while the safety head offers interpretable harmfulness classification aligned with refusal rationales. Experiments show that SafeLLaVA, trained on HoliSafe, achieves state-of-the-art safety performance across multiple VLM benchmarks. Additionally, the HoliSafe benchmark itself reveals critical vulnerabilities in existing models. We hope that HoliSafe and SafeLLaVA will spur further research into robust and interpretable VLM safety, expanding future avenues for multimodal alignment.

[35] Line of Sight: On Linear Representations in VLLMs

Achyuta Rajaram,Sarah Schwettmann,Jacob Andreas,Arthur Conmy

Main category: cs.CV

TL;DR: 论文探讨了多模态语言模型LlaVA-Next中图像概念的表示方式，发现线性可解码特征，并通过编辑模型输出验证其因果性。通过训练稀疏自编码器，增加了特征的多样性，发现多模态表示在深层逐渐共享。

Details

Motivation: 研究多模态语言模型如何在其隐藏激活中表示图像概念，以增强对模型内部机制的理解。 Method: 使用LlaVA-Next模型，分析其残差流中的线性可解码特征，并通过目标编辑验证因果性；训练多模态稀疏自编码器以增加特征多样性。 Result: 发现ImageNet类别的线性可解码特征，验证了特征的因果性；多模态表示在深层逐渐共享。 Conclusion: 多模态模型的图像表示具有线性可解码性和因果性，深层特征在多模态间逐渐共享，稀疏自编码器可提升特征多样性。 Abstract: Language models can be equipped with multimodal capabilities by fine-tuning on embeddings of visual inputs. But how do such multimodal models represent images in their hidden activations? We explore representations of image concepts within LlaVA-Next, a popular open-source VLLM. We find a diverse set of ImageNet classes represented via linearly decodable features in the residual stream. We show that the features are causal by performing targeted edits on the model output. In order to increase the diversity of the studied linear features, we train multimodal Sparse Autoencoders (SAEs), creating a highly interpretable dictionary of text and image features. We find that although model representations across modalities are quite disjoint, they become increasingly shared in deeper layers.

[36] Robust Few-Shot Vision-Language Model Adaptation

Hanxin Wang,Tian Liu,Shu Kong

Main category: cs.CV

TL;DR: 论文研究了预训练视觉语言模型（VLM）在少样本适应中的鲁棒性，提出了一种新的两阶段微调方法SRAPF，显著提升了ID和OOD准确性。

Details

Motivation: 预训练VLM在少样本适应中面临OOD数据泛化能力不足的问题，需要提升ID和OOD准确性。 Method: 比较了多种适应方法（如提示调优、线性探测、对比微调等），提出基于检索增强和对抗扰动的两阶段微调方法SRAPF。 Result: SRAPF在ImageNet OOD基准测试中达到了最先进的ID和OOD准确性。 Conclusion: 部分视觉编码器微调结合检索增强和对抗扰动是提升VLM适应性能的有效方法。 Abstract: Pretrained VLMs achieve strong performance on downstream tasks when adapted with just a few labeled examples. As the adapted models inevitably encounter out-of-distribution (OOD) test data that deviates from the in-distribution (ID) task-specific training data, enhancing OOD generalization in few-shot adaptation is critically important. We study robust few-shot VLM adaptation, aiming to increase both ID and OOD accuracy. By comparing different adaptation methods (e.g., prompt tuning, linear probing, contrastive finetuning, and full finetuning), we uncover three key findings: (1) finetuning with proper hyperparameters significantly outperforms the popular VLM adaptation methods prompt tuning and linear probing; (2) visual encoder-only finetuning achieves better efficiency and accuracy than contrastively finetuning both visual and textual encoders; (3) finetuning the top layers of the visual encoder provides the best balance between ID and OOD accuracy. Building on these findings, we propose partial finetuning of the visual encoder empowered with two simple augmentation techniques: (1) retrieval augmentation which retrieves task-relevant data from the VLM's pretraining dataset to enhance adaptation, and (2) adversarial perturbation which promotes robustness during finetuning. Results show that the former/latter boosts OOD/ID accuracy while slightly sacrificing the ID/OOD accuracy. Yet, perhaps understandably, naively combining the two does not maintain their best OOD/ID accuracy. We address this dilemma with the developed SRAPF, Stage-wise Retrieval Augmentation-based Adversarial Partial Finetuning. SRAPF consists of two stages: (1) partial finetuning the visual encoder using both ID and retrieved data, and (2) adversarial partial finetuning with few-shot ID data. Extensive experiments demonstrate that SRAPF achieves the state-of-the-art ID and OOD accuracy on the ImageNet OOD benchmarks.

[37] Towards Holistic Visual Quality Assessment of AI-Generated Videos: A LLM-Based Multi-Dimensional Evaluation Model

Zelu Qi,Ping Shi,Chaoyang Zhang,Shuqi Wang,Fei Zhao,Da Pan,Zefeng Ying

Main category: cs.CV

TL;DR: 论文提出了一种基于多维度特征和大语言模型（LLM）的AI生成视频（AIGV）自动视觉质量评估方法，并在NTIRE 2025挑战赛中取得第二名。

Details

Motivation: AIGV技术发展迅速，但仍存在视觉质量缺陷（如噪声、模糊、帧抖动等），影响用户体验，亟需有效的自动质量评估方法。 Method: 将AIGV视觉质量分解为技术质量、运动质量和视频语义三个维度，设计对应编码器提取特征，并引入LLM作为质量回归模块，结合多模态提示工程和LoRA微调技术。 Result: 在NTIRE 2025挑战赛中取得第二名，验证了方法的有效性。 Conclusion: 该方法通过多维度特征和LLM的结合，显著提升了AIGV视觉质量评估的准确性，为内容监管和生成模型改进提供了有力工具。 Abstract: The development of AI-Generated Video (AIGV) technology has been remarkable in recent years, significantly transforming the paradigm of video content production. However, AIGVs still suffer from noticeable visual quality defects, such as noise, blurriness, frame jitter and low dynamic degree, which severely impact the user's viewing experience. Therefore, an effective automatic visual quality assessment is of great importance for AIGV content regulation and generative model improvement. In this work, we decompose the visual quality of AIGVs into three dimensions: technical quality, motion quality, and video semantics. For each dimension, we design corresponding encoder to achieve effective feature representation. Moreover, considering the outstanding performance of large language models (LLMs) in various vision and language tasks, we introduce a LLM as the quality regression module. To better enable the LLM to establish reasoning associations between multi-dimensional features and visual quality, we propose a specially designed multi-modal prompt engineering framework. Additionally, we incorporate LoRA fine-tuning technology during the training phase, allowing the LLM to better adapt to specific tasks. Our proposed method achieved \textbf{second place} in the NTIRE 2025 Quality Assessment of AI-Generated Content Challenge: Track 2 AI Generated video, demonstrating its effectiveness. Codes can be obtained at https://github.com/QiZelu/AIGVEval.

[38] Learning dissection trajectories from expert surgical videos via imitation learning with equivariant diffusion

Hongyu Wang,Yonghao Long,Yueyao Chen,Hon-Chi Yip,Markus Scheppach,Philip Wai-Yan Chiu,Yeung Yam,Helen Mei-Ling Meng,Qi Dou

Main category: cs.CV

TL;DR: 本文提出了一种名为iDPOE的新方法，通过隐式扩散策略和等变表示来改进内镜黏膜下剥离术（ESD）中的轨迹预测，提升了手术技能训练的效果。

Details

Motivation: 预测ESD视频中的剥离轨迹对提升手术技能训练和简化学习过程具有潜力，但目前研究不足。现有模仿学习方法在处理未来不确定性、几何对称性和多样化手术场景时存在挑战。 Method: 提出iDPOE方法，结合扩散模型和等变表示，通过联合状态动作分布建模专家行为，并采用前向过程引导的动作推理策略处理状态不匹配。 Result: 在近2000个ESD视频片段的数据集上，iDPOE在轨迹预测上超越了现有方法。 Conclusion: iDPOE是首个将模仿学习应用于手术技能训练中剥离轨迹预测的方法，展现了显著优势。 Abstract: Endoscopic Submucosal Dissection (ESD) is a well-established technique for removing epithelial lesions. Predicting dissection trajectories in ESD videos offers significant potential for enhancing surgical skill training and simplifying the learning process, yet this area remains underexplored. While imitation learning has shown promise in acquiring skills from expert demonstrations, challenges persist in handling uncertain future movements, learning geometric symmetries, and generalizing to diverse surgical scenarios. To address these, we introduce a novel approach: Implicit Diffusion Policy with Equivariant Representations for Imitation Learning (iDPOE). Our method models expert behavior through a joint state action distribution, capturing the stochastic nature of dissection trajectories and enabling robust visual representation learning across various endoscopic views. By incorporating a diffusion model into policy learning, iDPOE ensures efficient training and sampling, leading to more accurate predictions and better generalization. Additionally, we enhance the model's ability to generalize to geometric symmetries by embedding equivariance into the learning process. To address state mismatches, we develop a forward-process guided action inference strategy for conditional sampling. Using an ESD video dataset of nearly 2000 clips, experimental results show that our approach surpasses state-of-the-art methods, both explicit and implicit, in trajectory prediction. To the best of our knowledge, this is the first application of imitation learning to surgical skill development for dissection trajectory prediction.

[39] Using In-Context Learning for Automatic Defect Labelling of Display Manufacturing Data

Babar Hussain,Qiang Liu,Gang Chen,Bihai She,Dahai Yu

Main category: cs.CV

TL;DR: 本文提出了一种基于AI的自动标注系统，用于显示面板缺陷检测，通过上下文学习能力提升SegGPT架构，并引入涂鸦标注机制。实验表明，该系统在工业数据集上显著优于基线模型，且自动标注数据训练的模型性能接近人工标注数据。

Details

Motivation: 减少工业检测系统中的人工标注工作量，提高缺陷检测的效率和准确性。 Method: 采用并改进SegGPT架构，引入涂鸦标注机制，采用两阶段训练方法。 Result: 在工业显示面板数据集上，平均IoU提升0.22，召回率提高14%，自动标注覆盖率达60%。 Conclusion: 该系统为工业检测提供了一种实用的自动标注解决方案，显著减少人工标注需求。 Abstract: This paper presents an AI-assisted auto-labeling system for display panel defect detection that leverages in-context learning capabilities. We adopt and enhance the SegGPT architecture with several domain-specific training techniques and introduce a scribble-based annotation mechanism to streamline the labeling process. Our two-stage training approach, validated on industrial display panel datasets, demonstrates significant improvements over the baseline model, achieving an average IoU increase of 0.22 and a 14% improvement in recall across multiple product types, while maintaining approximately 60% auto-labeling coverage. Experimental results show that models trained on our auto-labeled data match the performance of those trained on human-labeled data, offering a practical solution for reducing manual annotation efforts in industrial inspection systems.

[40] Bridging Annotation Gaps: Transferring Labels to Align Object Detection Datasets

Mikhail Kennerley,Angelica Alives-Reviro,Carola-Bibiane Schönlieb,Robby T. Tan

Main category: cs.CV

TL;DR: LAT提出了一种标签对齐转移框架，通过伪标签生成和特征融合解决多数据集标注不一致问题，显著提升目标检测性能。

Details

Motivation: 多数据集结合可提升泛化性，但标注语义和边界框不一致阻碍了其应用。现有方法要么假设标签分类一致，要么需要手动重标注，无法满足固定目标标签空间的需求。 Method: LAT通过训练数据集特定检测器生成伪标签，结合特权提案生成器（PPG）和语义特征融合（SFF）模块，实现标签空间对齐和特征优化。 Result: LAT在多个基准测试中显著提升目标域检测性能，最高提升4.8AP。 Conclusion: LAT无需共享标签空间或手动标注，即可解决类别和边界框不一致问题，适用于异构数据集训练。 Abstract: Combining multiple object detection datasets offers a path to improved generalisation but is hindered by inconsistencies in class semantics and bounding box annotations. Some methods to address this assume shared label taxonomies and address only spatial inconsistencies; others require manual relabelling, or produce a unified label space, which may be unsuitable when a fixed target label space is required. We propose Label-Aligned Transfer (LAT), a label transfer framework that systematically projects annotations from diverse source datasets into the label space of a target dataset. LAT begins by training dataset-specific detectors to generate pseudo-labels, which are then combined with ground-truth annotations via a Privileged Proposal Generator (PPG) that replaces the region proposal network in two-stage detectors. To further refine region features, a Semantic Feature Fusion (SFF) module injects class-aware context and features from overlapping proposals using a confidence-weighted attention mechanism. This pipeline preserves dataset-specific annotation granularity while enabling many-to-one label space transfer across heterogeneous datasets, resulting in a semantically and spatially aligned representation suitable for training a downstream detector. LAT thus jointly addresses both class-level misalignments and bounding box inconsistencies without relying on shared label spaces or manual annotations. Across multiple benchmarks, LAT demonstrates consistent improvements in target-domain detection performance, achieving gains of up to +4.8AP over semi-supervised baselines.

[41] SRD: Reinforcement-Learned Semantic Perturbation for Backdoor Defense in VLMs

Shuhan Xu,Siyuan Liang,Hongling Zheng,Yong Luo,Aishan Liu,Dacheng Tao

Main category: cs.CV

TL;DR: 论文提出了一种名为语义奖励防御（SRD）的强化学习框架，用于抵御视觉语言模型（VLMs）中的后门攻击，无需事先了解触发器即可降低攻击成功率。

Details

Motivation: 视觉语言模型在图像描述任务中表现出色，但易受后门攻击，攻击者通过注入微小扰动控制模型输出恶意描述。这些攻击隐蔽且跨模态，难以检测和防御。 Method: 提出SRD框架，利用深度Q网络学习对敏感图像区域施加离散扰动（如遮挡、颜色掩码），并通过语义保真度评分作为奖励信号，指导模型生成鲁棒且准确的描述。 Result: 实验表明，SRD将攻击成功率降至5.6%，同时在干净输入上保持描述质量，性能下降低于10%。 Conclusion: SRD为多模态生成模型中的隐蔽后门威胁提供了一种无需触发器先验知识、可解释的防御范式。 Abstract: Vision-Language Models (VLMs) have achieved remarkable performance in image captioning, but recent studies show they are vulnerable to backdoor attacks. Attackers can inject imperceptible perturbations-such as local pixel triggers or global semantic phrases-into the training data, causing the model to generate malicious, attacker-controlled captions for specific inputs. These attacks are hard to detect and defend due to their stealthiness and cross-modal nature. By analyzing attack samples, we identify two key vulnerabilities: (1) abnormal attention concentration on specific image regions, and (2) semantic drift and incoherence in generated captions. To counter this, we propose Semantic Reward Defense (SRD), a reinforcement learning framework that mitigates backdoor behavior without prior knowledge of triggers. SRD uses a Deep Q-Network to learn policies for applying discrete perturbations (e.g., occlusion, color masking) to sensitive image regions, aiming to disrupt the activation of malicious pathways. We design a semantic fidelity score as the reward signal, which jointly evaluates semantic consistency and linguistic fluency of the output, guiding the agent toward generating robust yet faithful captions. Experiments across mainstream VLMs and datasets show SRD reduces attack success rates to 5.6%, while preserving caption quality on clean inputs with less than 10% performance drop. SRD offers a trigger-agnostic, interpretable defense paradigm against stealthy backdoor threats in multimodal generative models.

[42] Physics Informed Capsule Enhanced Variational AutoEncoder for Underwater Image Enhancement

Niki Martinel,Rita Pucci

Main category: cs.CV

TL;DR: 提出了一种新颖的双流架构，通过结合物理模型与胶囊聚类特征学习，实现了无需参数的水下图像增强，显著提升了性能并降低了计算复杂度。

Details

Motivation: 水下图像增强面临物理模型与语义结构保持的挑战，现有方法难以兼顾。 Method: 采用双流架构，分别通过物理估计器和胶囊聚类学习特征，结合优化目标确保物理一致性和感知质量。 Result: 在六个基准测试中，PSNR提升0.5dB，计算复杂度降低三分之二，或在相同计算预算下PSNR提升1dB以上。 Conclusion: 该方法在性能和效率上均优于现有技术，为水下图像增强提供了新思路。 Abstract: We present a novel dual-stream architecture that achieves state-of-the-art underwater image enhancement by explicitly integrating the Jaffe-McGlamery physical model with capsule clustering-based feature representation learning. Our method simultaneously estimates transmission maps and spatially-varying background light through a dedicated physics estimator while extracting entity-level features via capsule clustering in a parallel stream. This physics-guided approach enables parameter-free enhancement that respects underwater formation constraints while preserving semantic structures and fine-grained details. Our approach also features a novel optimization objective ensuring both physical adherence and perceptual quality across multiple spatial frequencies. To validate our approach, we conducted extensive experiments across six challenging benchmarks. Results demonstrate consistent improvements of $+0.5$dB PSNR over the best existing methods while requiring only one-third of their computational complexity (FLOPs), or alternatively, more than $+1$dB PSNR improvement when compared to methods with similar computational budgets. Code and data \textit{will} be available at https://github.com/iN1k1/.

Shenshen Li,Kaiyuan Deng,Lei Wang,Hao Yang,Chong Peng,Peng Yan,Fumin Shen,Heng Tao Shen,Xing Xu

Main category: cs.CV

TL;DR: 论文提出了一种名为RAP的数据选择方法，通过识别高价值的认知样本，显著减少训练数据量和计算成本，同时提升多模态推理能力。

Details

Motivation: 传统多模态大语言模型需要大量训练数据，导致数据冗余和计算成本高。论文挑战这一假设，认为仅需少量高价值样本即可触发有效的多模态推理。 Method: 提出RAP方法，包括两个互补的估计器（CDE和ACE）和一个难度感知替换模块（DRM），用于识别和优化认知样本。 Result: 在六个数据集上的实验表明，RAP仅需9.3%的训练数据即可实现更优性能，并减少43%以上的计算成本。 Conclusion: RAP方法通过高效数据选择，显著提升了多模态推理的效率和性能，为模型训练提供了新思路。 Abstract: While multi-modal large language models (MLLMs) have made significant progress in complex reasoning tasks via reinforcement learning, it is commonly believed that extensive training data is necessary for improving multi-modal reasoning ability, inevitably leading to data redundancy and substantial computational costs. However, can smaller high-value datasets match or outperform full corpora for multi-modal reasoning in MLLMs? In this work, we challenge this assumption through a key observation: meaningful multi-modal reasoning is triggered by only a sparse subset of training samples, termed cognitive samples, whereas the majority contribute marginally. Building on this insight, we propose a novel data selection paradigm termed Reasoning Activation Potential (RAP), which identifies cognitive samples by estimating each sample's potential to stimulate genuine multi-modal reasoning by two complementary estimators: 1) Causal Discrepancy Estimator (CDE) based on the potential outcome model principle, eliminates samples that overly rely on language priors by comparing outputs between multi-modal and text-only inputs; 2) Attention Confidence Estimator (ACE), which exploits token-level self-attention to discard samples dominated by irrelevant but over-emphasized tokens in intermediate reasoning stages. Moreover, we introduce a Difficulty-aware Replacement Module (DRM) to substitute trivial instances with cognitively challenging ones, thereby ensuring complexity for robust multi-modal reasoning. Experiments on six datasets show that our RAP method consistently achieves superior performance using only 9.3% of the training data, while reducing computational costs by over 43%. Our code is available at https://github.com/Leo-ssl/RAP.

[44] Toward Better SSIM Loss for Unsupervised Monocular Depth Estimation

Yijun Cao,Fuya Luo,Yongjie Li

Main category: cs.CV

TL;DR: 本文提出了一种新的SSIM形式，通过加法而非乘法组合其组件，优化了无监督单目深度学习的训练效果。

Details

Motivation: 传统方法忽略了SSIM函数中不同组件及超参数对训练的影响，导致性能受限。 Method: 提出了一种新的SSIM形式，用加法替代乘法组合亮度、对比度和结构相似性组件，并优化参数组合。 Result: 在KITTI-2015数据集上，优化后的SSIM损失函数显著优于基线方法。 Conclusion: 新SSIM形式能生成更平滑的梯度，提升无监督深度估计性能。 Abstract: Unsupervised monocular depth learning generally relies on the photometric relation among temporally adjacent images. Most of previous works use both mean absolute error (MAE) and structure similarity index measure (SSIM) with conventional form as training loss. However, they ignore the effect of different components in the SSIM function and the corresponding hyperparameters on the training. To address these issues, this work proposes a new form of SSIM. Compared with original SSIM function, the proposed new form uses addition rather than multiplication to combine the luminance, contrast, and structural similarity related components in SSIM. The loss function constructed with this scheme helps result in smoother gradients and achieve higher performance on unsupervised depth estimation. We conduct extensive experiments to determine the relatively optimal combination of parameters for our new SSIM. Based on the popular MonoDepth approach, the optimized SSIM loss function can remarkably outperform the baseline on the KITTI-2015 outdoor dataset.

[45] HypeVPR: Exploring Hyperbolic Space for Perspective to Equirectangular Visual Place Recognition

Suhan Woo,Seongwon Lee,Jinwoo Jang,Euntai Kim

Main category: cs.CV

TL;DR: HypeVPR是一种基于双曲空间的分层嵌入框架，用于解决P2E VPR的挑战，通过分层特征聚合和粗到细搜索策略，显著提升了检索速度和准确性。

Details

Motivation: 现实世界中的视觉地点识别（VPR）需要处理多视角查询图像，P2E方法成为自然选择，但现有方法未能充分利用全景图像的层次结构。 Method: 提出HypeVPR框架，利用双曲空间表示层次特征关系，设计分层特征聚合机制和粗到细搜索策略。 Result: HypeVPR在多个基准数据集上表现优于现有方法，检索速度提升高达5倍。 Conclusion: HypeVPR通过双曲空间的层次表示和高效搜索策略，为P2E VPR提供了更优的解决方案。 Abstract: When applying Visual Place Recognition (VPR) to real-world mobile robots and similar applications, perspective-to-equirectangular (P2E) formulation naturally emerges as a suitable approach to accommodate diverse query images captured from various viewpoints. In this paper, we introduce HypeVPR, a novel hierarchical embedding framework in hyperbolic space, designed to address the unique challenges of P2E VPR. The key idea behind HypeVPR is that visual environments captured by panoramic views exhibit inherent hierarchical structures. To leverage this property, we employ hyperbolic space to represent hierarchical feature relationships and preserve distance properties within the feature space. To achieve this, we propose a hierarchical feature aggregation mechanism that organizes local-to-global feature representations within hyperbolic space. Additionally, HypeVPR adopts an efficient coarse-to-fine search strategy, optimally balancing speed and accuracy to ensure robust matching, even between descriptors from different image types. This approach enables HypeVPR to outperform state-of-the-art methods while significantly reducing retrieval time, achieving up to 5x faster retrieval across diverse benchmark datasets. The code and models will be released at https://github.com/suhan-woo/HypeVPR.git.

Gaia Di Lorenzo,Federico Tombari,Marc Pollefeys,Daniel Barath

Main category: cs.CV

TL;DR: Object-X 是一种多模态 3D 对象表示框架，能够编码丰富的信息并解码为几何和视觉重建，支持多种下游任务，且存储效率高。

Details

Motivation: 现有方法通常针对特定任务设计，无法同时支持几何重建和跨任务复用。Object-X 旨在解决这一问题。 Method: 通过将多模态信息（如图像、点云、文本）嵌入 3D 体素网格，并学习融合体素与对象属性的非结构化嵌入。 Result: 在真实数据集上，Object-X 实现了高保真新视角合成和几何精度提升，同时在场景对齐和定位任务中表现优异。 Conclusion: Object-X 是一种高效、可扩展的多模态 3D 场景表示解决方案。 Abstract: Learning effective multi-modal 3D representations of objects is essential for numerous applications, such as augmented reality and robotics. Existing methods often rely on task-specific embeddings that are tailored either for semantic understanding or geometric reconstruction. As a result, these embeddings typically cannot be decoded into explicit geometry and simultaneously reused across tasks. In this paper, we propose Object-X, a versatile multi-modal object representation framework capable of encoding rich object embeddings (e.g. images, point cloud, text) and decoding them back into detailed geometric and visual reconstructions. Object-X operates by geometrically grounding the captured modalities in a 3D voxel grid and learning an unstructured embedding fusing the information from the voxels with the object attributes. The learned embedding enables 3D Gaussian Splatting-based object reconstruction, while also supporting a range of downstream tasks, including scene alignment, single-image 3D object reconstruction, and localization. Evaluations on two challenging real-world datasets demonstrate that Object-X produces high-fidelity novel-view synthesis comparable to standard 3D Gaussian Splatting, while significantly improving geometric accuracy. Moreover, Object-X achieves competitive performance with specialized methods in scene alignment and localization. Critically, our object-centric descriptors require 3-4 orders of magnitude less storage compared to traditional image- or point cloud-based approaches, establishing Object-X as a scalable and highly practical solution for multi-modal 3D scene representation.

[47] LotusFilter: Fast Diverse Nearest Neighbor Search via a Learned Cutoff Table

Yusuke Matsui

Main category: cs.CV

TL;DR: LotusFilter是一种后处理模块，用于多样化近似最近邻搜索（ANNS）结果，通过预计算和贪婪查找删除冗余向量。

Details

Motivation: 在类似RAG的应用中，ANNS结果可能过于相似，需要同时保持与查询相似性和多样性。 Method: 预计算一个截止表，总结相近的向量；在过滤阶段，贪婪查找表以删除候选中的冗余向量。 Result: LotusFilter在类似真实RAG应用的设置中运行快速（0.02毫秒/查询），使用OpenAI嵌入等特征。 Conclusion: LotusFilter是一种高效的后处理模块，适用于需要多样化ANNS结果的场景。 Abstract: Approximate nearest neighbor search (ANNS) is an essential building block for applications like RAG but can sometimes yield results that are overly similar to each other. In certain scenarios, search results should be similar to the query and yet diverse. We propose LotusFilter, a post-processing module to diversify ANNS results. We precompute a cutoff table summarizing vectors that are close to each other. During the filtering, LotusFilter greedily looks up the table to delete redundant vectors from the candidates. We demonstrated that the LotusFilter operates fast (0.02 [ms/query]) in settings resembling real-world RAG applications, utilizing features such as OpenAI embeddings. Our code is publicly available at https://github.com/matsui528/lotf.

[48] SupeRANSAC: One RANSAC to Rule Them All

Daniel Barath

Main category: cs.CV

TL;DR: SupeRANSAC是一种新型统一的RANSAC流程，旨在提高计算机视觉任务中的鲁棒性估计性能，显著优于现有方法。

Details

Motivation: RANSAC及其变体在计算机视觉中是几何模型估计的金标准，但性能在不同任务中表现不一致，受实现细节和问题特定优化影响较大。 Method: 提出SupeRANSAC，分析并整合了使RANSAC在特定视觉任务（如单应性、基础/本质矩阵、绝对/刚性位姿估计）中有效的技术。 Result: SupeRANSAC在多个任务和数据集上显著优于现有方法，例如在基础矩阵估计中平均提高6 AUC点。 Conclusion: SupeRANSAC通过统一流程实现了跨任务的高一致性性能，为计算机视觉中的鲁棒性估计提供了更优解决方案。 Abstract: Robust estimation is a cornerstone in computer vision, particularly for tasks like Structure-from-Motion and Simultaneous Localization and Mapping. RANSAC and its variants are the gold standard for estimating geometric models (e.g., homographies, relative/absolute poses) from outlier-contaminated data. Despite RANSAC's apparent simplicity, achieving consistently high performance across different problems is challenging. While recent research often focuses on improving specific RANSAC components (e.g., sampling, scoring), overall performance is frequently more influenced by the "bells and whistles" (i.e., the implementation details and problem-specific optimizations) within a given library. Popular frameworks like OpenCV and PoseLib demonstrate varying performance, excelling in some tasks but lagging in others. We introduce SupeRANSAC, a novel unified RANSAC pipeline, and provide a detailed analysis of the techniques that make RANSAC effective for specific vision tasks, including homography, fundamental/essential matrix, and absolute/rigid pose estimation. SupeRANSAC is designed for consistent accuracy across these tasks, improving upon the best existing methods by, for example, 6 AUC points on average for fundamental matrix estimation. We demonstrate significant performance improvements over the state-of-the-art on multiple problems and datasets. Code: https://github.com/danini/superansac

[49] MegaHan97K: A Large-Scale Dataset for Mega-Category Chinese Character Recognition with over 97K Categories

Yuyi Zhang,Yongxin Shi,Peirong Zhang,Yixin Zhao,Zhenhua Yang,Lianwen Jin

Main category: cs.CV

TL;DR: 论文介绍了MegaHan97K数据集，支持GB18030-2022标准，包含97,455类汉字，解决了长尾分布问题，并揭示了大规模类别识别的新挑战。

Details

Motivation: 中文汉字类别庞大且不断扩展，现有数据集无法满足需求，亟需一个全面支持最新标准的大规模数据集以推动文化遗产保护和数字应用。 Method: 构建MegaHan97K数据集，包含手写、历史和合成三个子集，覆盖97,455类汉字，并进行了全面的基准测试。 Result: MegaHan97K是首个完全支持GB18030-2022标准的数据集，类别数量是现有数据集的6倍以上，解决了长尾分布问题，并揭示了存储需求高、形态相似字符识别和零样本学习等新挑战。 Conclusion: MegaHan97K为大规模汉字识别研究提供了重要资源，推动了OCR和模式识别领域的发展。 Abstract: Foundational to the Chinese language and culture, Chinese characters encompass extraordinarily extensive and ever-expanding categories, with the latest Chinese GB18030-2022 standard containing 87,887 categories. The accurate recognition of this vast number of characters, termed mega-category recognition, presents a formidable yet crucial challenge for cultural heritage preservation and digital applications. Despite significant advances in Optical Character Recognition (OCR), mega-category recognition remains unexplored due to the absence of comprehensive datasets, with the largest existing dataset containing merely 16,151 categories. To bridge this critical gap, we introduce MegaHan97K, a mega-category, large-scale dataset covering an unprecedented 97,455 categories of Chinese characters. Our work offers three major contributions: (1) MegaHan97K is the first dataset to fully support the latest GB18030-2022 standard, providing at least six times more categories than existing datasets; (2) It effectively addresses the long-tail distribution problem by providing balanced samples across all categories through its three distinct subsets: handwritten, historical and synthetic subsets; (3) Comprehensive benchmarking experiments reveal new challenges in mega-category scenarios, including increased storage demands, morphologically similar character recognition, and zero-shot learning difficulties, while also unlocking substantial opportunities for future research. To the best of our knowledge, the MetaHan97K is likely the dataset with the largest classes not only in the field of OCR but may also in the broader domain of pattern recognition. The dataset is available at https://github.com/SCUT-DLVCLab/MegaHan97K.

[50] Spike-TBR: a Noise Resilient Neuromorphic Event Representation

Gabriele Magrini. Federico Becattini,Luca Cultrera,Lorenzo Berlincioni,Pietro Pala,Alberto Del Bimbo

Main category: cs.CV

TL;DR: Spike-TBR是一种基于时间二进制表示（TBR）的事件编码策略，通过整合脉冲神经元增强抗噪能力，在噪声环境下表现优异。

Details

Motivation: 事件相机具有高时间分辨率和低延迟等优势，但将事件流转换为标准计算机视觉兼容格式仍具挑战性，尤其是在噪声存在时。 Method: 提出Spike-TBR，结合TBR的帧基优势和脉冲神经网络的噪声过滤能力，评估了四种不同脉冲神经元变体。 Result: 在噪声数据和干净数据上均表现优异，验证了其抗噪性和性能提升。 Conclusion: Spike-TBR填补了脉冲基与帧基处理的鸿沟，为事件驱动视觉应用提供了简单且抗噪的解决方案。 Abstract: Event cameras offer significant advantages over traditional frame-based sensors, including higher temporal resolution, lower latency and dynamic range. However, efficiently converting event streams into formats compatible with standard computer vision pipelines remains a challenging problem, particularly in the presence of noise. In this paper, we propose Spike-TBR, a novel event-based encoding strategy based on Temporal Binary Representation (TBR), addressing its vulnerability to noise by integrating spiking neurons. Spike-TBR combines the frame-based advantages of TBR with the noise-filtering capabilities of spiking neural networks, creating a more robust representation of event streams. We evaluate four variants of Spike-TBR, each using different spiking neurons, across multiple datasets, demonstrating superior performance in noise-affected scenarios while improving the results on clean data. Our method bridges the gap between spike-based and frame-based processing, offering a simple noise-resilient solution for event-driven vision applications.

[51] Fool the Stoplight: Realistic Adversarial Patch Attacks on Traffic Light Detectors

Svetlana Pavlitska,Jamie Robb,Nikolai Polley,Melih Yazgan,J. Marius Zöllner

Main category: cs.CV

TL;DR: 该论文展示了如何通过打印的对抗性补丁攻击交通灯检测的CNN模型，提出了威胁模型和训练策略，并在实验和现实场景中验证了攻击的有效性。

Details

Motivation: 现有研究对自动驾驶车辆摄像头感知任务的对抗性攻击较多，但对交通灯检测的攻击研究较少，本文填补了这一空白。 Method: 提出了一种威胁模型，通过在交通灯下方放置对抗性补丁攻击CNN模型，并设计了相应的训练策略。 Result: 实验成功实现了目标标签翻转攻击（如红变绿）和图形分类攻击，并在现实场景中验证了攻击效果。 Conclusion: 该研究表明交通灯检测系统易受对抗性补丁攻击，需加强防御措施。 Abstract: Realistic adversarial attacks on various camera-based perception tasks of autonomous vehicles have been successfully demonstrated so far. However, only a few works considered attacks on traffic light detectors. This work shows how CNNs for traffic light detection can be attacked with printed patches. We propose a threat model, where each instance of a traffic light is attacked with a patch placed under it, and describe a training strategy. We demonstrate successful adversarial patch attacks in universal settings. Our experiments show realistic targeted red-to-green label-flipping attacks and attacks on pictogram classification. Finally, we perform a real-world evaluation with printed patches and demonstrate attacks in the lab settings with a mobile traffic light for construction sites and in a test area with stationary traffic lights. Our code is available at https://github.com/KASTEL-MobilityLab/attacks-on-traffic-light-detection.

[52] DualX-VSR: Dual Axial Spatial$\times$Temporal Transformer for Real-World Video Super-Resolution without Motion Compensation

Shuo Cao,Yihao Liu,Xiaohui Li. Yuanting Gao. Yu Zhou,Chao Dong

Main category: cs.CV

TL;DR: DualX-VSR提出了一种新型的双轴空间×时间注意力机制，用于解决视频超分辨率任务中像素级精度的挑战，无需运动补偿，性能优越。

Details

Motivation: 现有基于Transformer的视频超分辨率模型在像素级精度和运动补偿方面存在局限性，DualX-VSR旨在解决这些问题。 Method: 采用双轴空间×时间注意力机制，沿正交方向整合时空信息，简化结构并避免运动补偿。 Result: DualX-VSR在真实世界视频超分辨率任务中实现了高保真度和卓越性能。 Conclusion: DualX-VSR通过创新的注意力机制，克服了现有模型的限制，为视频超分辨率提供了更优的解决方案。 Abstract: Transformer-based models like ViViT and TimeSformer have advanced video understanding by effectively modeling spatiotemporal dependencies. Recent video generation models, such as Sora and Vidu, further highlight the power of transformers in long-range feature extraction and holistic spatiotemporal modeling. However, directly applying these models to real-world video super-resolution (VSR) is challenging, as VSR demands pixel-level precision, which can be compromised by tokenization and sequential attention mechanisms. While recent transformer-based VSR models attempt to address these issues using smaller patches and local attention, they still face limitations such as restricted receptive fields and dependence on optical flow-based alignment, which can introduce inaccuracies in real-world settings. To overcome these issues, we propose Dual Axial Spatial$\times$Temporal Transformer for Real-World Video Super-Resolution (DualX-VSR), which introduces a novel dual axial spatial$\times$temporal attention mechanism that integrates spatial and temporal information along orthogonal directions. DualX-VSR eliminates the need for motion compensation, offering a simplified structure that provides a cohesive representation of spatiotemporal information. As a result, DualX-VSR achieves high fidelity and superior performance in real-world VSR task.

[53] OpenMaskDINO3D : Reasoning 3D Segmentation via Large Language Model

Kunshen Zhang

Main category: cs.CV

TL;DR: OpenMaskDINO3D是一个基于LLM的模型，用于3D点云数据的理解和分割，通过文本提示生成高精度分割掩码。

Details

Motivation: 现有2D感知系统依赖显式指令或预定义类别，而3D推理分割缺乏类似框架。 Method: 引入SEG token和对象标识符，处理点云数据和文本提示，生成实例分割掩码。 Result: 在ScanNet数据集上验证了模型的有效性。 Conclusion: OpenMaskDINO3D填补了3D推理分割的空白，实现了从自然语言指令到点云分割的直接映射。 Abstract: Although perception systems have made remarkable advancements in recent years, particularly in 2D reasoning segmentation, these systems still rely on explicit human instruction or pre-defined categories to identify target objects before executing visual recognition tasks. Such systems have matured significantly, demonstrating the ability to reason and comprehend implicit user intentions in two-dimensional contexts, producing accurate segmentation masks based on complex and implicit query text. However, a comparable framework and structure for 3D reasoning segmentation remain absent. This paper introduces OpenMaskDINO3D, a LLM designed for comprehensive 3D understanding and segmentation. OpenMaskDINO3D processes point cloud data and text prompts to produce instance segmentation masks, excelling in many 3D tasks. By introducing a SEG token and object identifier, we achieve high-precision 3D segmentation mask generation, enabling the model to directly produce accurate point cloud segmentation results from natural language instructions. Experimental results on large-scale ScanNet datasets validate the effectiveness of our OpenMaskDINO3D across various tasks.

[54] Geological Field Restoration through the Lens of Image Inpainting

Vladislav Trifonov,Ivan Oseledets,Ekaterina Muravleva

Main category: cs.CV

TL;DR: 提出一种基于低秩张量补全的多维地质场重建方法，优于传统克里金法。

Details

Motivation: 从稀疏观测数据中重建多维地质场，传统方法如克里金法效果有限，需更高效方法。 Method: 结合张量补全与地统计学，通过全局低秩结构恢复缺失值。 Result: 在合成地质场实验中，张量补全方法显著优于克里金法。 Conclusion: 该方法为地质场重建提供了更准确的解决方案。 Abstract: We present a new viewpoint on a reconstructing multidimensional geological fields from sparse observations. Drawing inspiration from deterministic image inpainting techniques, we model a partially observed spatial field as a multidimensional tensor and recover missing values by enforcing a global low-rank structure. Our approach combines ideas from tensor completion and geostatistics, providing a robust optimization framework. Experiments on synthetic geological fields demonstrate that used tensor completion method significant improvements in reconstruction accuracy over ordinary kriging for various percent of observed data.

[55] Invisible Backdoor Triggers in Image Editing Model via Deep Watermarking

Yu-Feng Chen,Tzuhsuan Huang,Pin-Yen Chiu,Jun-Cheng Chen

Main category: cs.CV

TL;DR: 该论文提出了一种新型的后门攻击框架，通过毒化训练数据在图像编辑过程中嵌入不可见的触发器，利用现有水印模型实现隐蔽攻击。

Details

Motivation: 现有研究主要关注图像生成的后门攻击，而图像编辑领域的后门攻击研究较少，且现有方法多使用可见触发器，实用性不足。 Method: 利用现成的水印模型将不可感知的水印编码为后门触发器，通过毒化训练数据实现攻击。 Result: 在不同水印模型上的实验表明，该方法攻击成功率较高，且水印特性分析进一步验证了其有效性。 Conclusion: 该方法成功实现了隐蔽的后门攻击，为图像编辑领域的安全研究提供了新方向。 Abstract: Diffusion models have achieved remarkable progress in both image generation and editing. However, recent studies have revealed their vulnerability to backdoor attacks, in which specific patterns embedded in the input can manipulate the model's behavior. Most existing research in this area has proposed attack frameworks focused on the image generation pipeline, leaving backdoor attacks in image editing relatively unexplored. Among the few studies targeting image editing, most utilize visible triggers, which are impractical because they introduce noticeable alterations to the input image before editing. In this paper, we propose a novel attack framework that embeds invisible triggers into the image editing process via poisoned training data. We leverage off-the-shelf deep watermarking models to encode imperceptible watermarks as backdoor triggers. Our goal is to make the model produce the predefined backdoor target when it receives watermarked inputs, while editing clean images normally according to the given prompt. With extensive experiments across different watermarking models, the proposed method achieves promising attack success rates. In addition, the analysis results of the watermark characteristics in term of backdoor attack further support the effectiveness of our approach. The code is available at:https://github.com/aiiu-lab/BackdoorImageEditing

[56] Learning to Plan via Supervised Contrastive Learning and Strategic Interpolation: A Chess Case Study

Andrew Hamara,Greg Hamerly,Pablo Rivas,Andrew C. Freeman

Main category: cs.CV

TL;DR: 论文提出了一种基于直觉驱动的规划方法，通过监督对比学习训练Transformer编码器，将棋盘状态嵌入到潜在空间中，完全在嵌入空间中进行移动选择，无需依赖深度搜索。

Details

Motivation: 现代国际象棋引擎依赖深度树搜索和回归评估实现超人类表现，而人类玩家则依赖直觉选择候选移动并进行浅层搜索验证。论文旨在模拟这种直觉驱动的规划过程。 Method: 使用监督对比学习训练Transformer编码器，将棋盘状态嵌入到潜在空间中，距离反映评估相似性，并通过6-ply束搜索进行移动选择。 Result: 模型在仅使用6-ply束搜索的情况下，估计Elo评分为2593，性能随模型大小和嵌入维度提升。 Conclusion: 潜在规划可能成为传统搜索的可行替代方案，该方法可推广到其他完美信息游戏中。所有源代码已开源。 Abstract: Modern chess engines achieve superhuman performance through deep tree search and regressive evaluation, while human players rely on intuition to select candidate moves followed by a shallow search to validate them. To model this intuition-driven planning process, we train a transformer encoder using supervised contrastive learning to embed board states into a latent space structured by positional evaluation. In this space, distance reflects evaluative similarity, and visualized trajectories display interpretable transitions between game states. We demonstrate that move selection can occur entirely within this embedding space by advancing toward favorable regions, without relying on deep search. Despite using only a 6-ply beam search, our model achieves an estimated Elo rating of 2593. Performance improves with both model size and embedding dimensionality, suggesting that latent planning may offer a viable alternative to traditional search. Although we focus on chess, the proposed embedding-based planning method can be generalized to other perfect-information games where state evaluations are learnable. All source code is available at https://github.com/andrewhamara/SOLIS.

[57] From Objects to Anywhere: A Holistic Benchmark for Multi-level Visual Grounding in 3D Scenes

Tianxu Wang,Zhuofan Zhang,Ziyu Zhu,Yue Fan,Jing Xiong,Pengxiang Li,Xiaojian Ma,Qing Li

Main category: cs.CV

TL;DR: Anywhere3D-Bench是一个全面的3D视觉定位基准，涵盖四个层次的任务，揭示了当前模型在空间级和部分级任务上的显著不足。

Details

Motivation: 探索3D场景中超越对象级别的视觉定位能力，填补现有研究的空白。 Method: 提出Anywhere3D-Bench基准，评估现有3D视觉定位方法及大语言模型在四个不同层次任务上的表现。 Result: 空间级和部分级任务表现最差，最佳模型OpenAI o4-mini的准确率分别仅为23.57%和33.94%。 Conclusion: 当前模型在3D场景的空间和部分级别理解与推理能力存在明显不足。 Abstract: 3D visual grounding has made notable progress in localizing objects within complex 3D scenes. However, grounding referring expressions beyond objects in 3D scenes remains unexplored. In this paper, we introduce Anywhere3D-Bench, a holistic 3D visual grounding benchmark consisting of 2,632 referring expression-3D bounding box pairs spanning four different grounding levels: human-activity areas, unoccupied space beyond objects, objects in the scene, and fine-grained object parts. We assess a range of state-of-the-art 3D visual grounding methods alongside large language models (LLMs) and multimodal LLMs (MLLMs) on Anywhere3D-Bench. Experimental results reveal that space-level and part-level visual grounding pose the greatest challenges: space-level tasks require a more comprehensive spatial reasoning ability, for example, modeling distances and spatial relations within 3D space, while part-level tasks demand fine-grained perception of object composition. Even the best performance model, OpenAI o4-mini, achieves only 23.57% accuracy on space-level tasks and 33.94% on part-level tasks, significantly lower than its performance on area-level and object-level tasks. These findings underscore a critical gap in current models' capacity to understand and reason about 3D scene beyond object-level semantics.

[58] Generating Synthetic Stereo Datasets using 3D Gaussian Splatting and Expert Knowledge Transfer

Filip Slezak,Magnus K. Gjerde,Joakim B. Haurum,Ivan Nikolov,Morten S. Laursen,Thomas B. Moeslund

Main category: cs.CV

TL;DR: 提出了一种基于3D高斯泼溅（3DGS）的立体数据集生成方法，相比NeRF更高效，并探索了利用显式3D表示和FoundationStereo模型进行知识迁移。

Details

Motivation: 为立体模型提供低成本、高保真的数据集生成方法，并快速微调模型。 Method: 结合3DGS生成的几何重建和FoundationStereo的深度估计，进行知识迁移和模型微调。 Result: 3DGS生成的数据集在零样本泛化基准测试中表现良好，但几何重建噪声较多；FoundationStereo的视差估计更干净，性能更优。 Conclusion: 3DGS方法在数据集生成和快速微调方面潜力大，但在复杂场景中的鲁棒性仍需改进。 Abstract: In this paper, we introduce a 3D Gaussian Splatting (3DGS)-based pipeline for stereo dataset generation, offering an efficient alternative to Neural Radiance Fields (NeRF)-based methods. To obtain useful geometry estimates, we explore utilizing the reconstructed geometry from the explicit 3D representations as well as depth estimates from the FoundationStereo model in an expert knowledge transfer setup. We find that when fine-tuning stereo models on 3DGS-generated datasets, we demonstrate competitive performance in zero-shot generalization benchmarks. When using the reconstructed geometry directly, we observe that it is often noisy and contains artifacts, which propagate noise to the trained model. In contrast, we find that the disparity estimates from FoundationStereo are cleaner and consequently result in a better performance on the zero-shot generalization benchmarks. Our method highlights the potential for low-cost, high-fidelity dataset creation and fast fine-tuning for deep stereo models. Moreover, we also reveal that while the latest Gaussian Splatting based methods have achieved superior performance on established benchmarks, their robustness falls short in challenging in-the-wild settings warranting further exploration.

[59] Light and 3D: a methodological exploration of digitisation techniques adapted to a selection of objects from the Mus{é}e d'Arch{é}ologie Nationale

Antoine Laurent,Jean Mélou,Catherine Schwab,Rolande Simon-Millot,Sophie Féret,Thomas Sagory,Carole Fritz,Jean-Denis Durou

Main category: cs.CV

TL;DR: 本文探讨了文化遗产数字化中3D摄影方法的多样性，强调没有单一方法适用于所有对象，需根据对象特性和用途选择合适工具。

Details

Motivation: 文化遗产数字化需求广泛，但现有3D数字化方法多样，需针对不同对象选择最佳方法。 Method: 通过法国国家考古博物馆的藏品案例，分析不同3D摄影方法的适用性。 Result: 研究表明，需根据对象特性和未来用途选择数字化工具，而非追求绝对分类。 Conclusion: 文化遗产数字化需跨领域合作，灵活选择工具，以适应不同对象需求。 Abstract: The need to digitize heritage objects is now widely accepted. This article presents the very fashionable context of the creation of ''digital twins''. It illustrates the diversity of photographic 3D digitization methods, but this is not its only objective. Using a selection of objects from the collections of the mus{\'e}e d'Arch{\'e}ologie nationale, it shows that no single method is suitable for all cases. Rather, the method to be recommended for a given object should be the result of a concerted choice between those involved in heritage and those involved in the digital domain, as each new object may require the adaptation of existing tools. It would therefore be pointless to attempt an absolute classification of 3D digitization methods. On the contrary, we need to find the digital tool best suited to each object, taking into account not only its characteristics, but also the future use of its digital twin.

[60] CzechLynx: A Dataset for Individual Identification and Pose Estimation of the Eurasian Lynx

Lukas Picek,Elisa Belotti,Michal Bojda,Ludek Bufka,Vojtech Cermak,Martin Dula,Rostislav Dvorak,Luboslav Hrdy,Miroslav Jirik,Vaclav Kocourek,Josefa Krausova,Jirı Labuda,Jakub Straka,Ludek Toman,Vlado Trulık,Martin Vana,Miroslav Kutal

Main category: cs.CV

TL;DR: CzechLynx是一个大规模、开放访问的数据集，用于欧亚猞猁的个体识别、2D姿态估计和实例分割，包含真实和合成图像，并提供了三种评估协议。

Details

Motivation: 为猞猁的个体识别和姿态估计提供首个大规模数据集，支持跨时空域的研究和模型开发。 Method: 数据集包含30k真实相机陷阱图像和100k合成图像，标注了分割掩码、身份标签和20点骨骼，并设计了三种评估协议。 Result: 数据集覆盖219个独特个体，跨越15年和两个地理区域，支持多样化的研究需求。 Conclusion: CzechLynx将为先进模型的基准测试和新方法的开发提供重要支持。 Abstract: We introduce CzechLynx, the first large-scale, open-access dataset for individual identification, 2D pose estimation, and instance segmentation of the Eurasian lynx (Lynx lynx). CzechLynx includes more than 30k camera trap images annotated with segmentation masks, identity labels, and 20-point skeletons and covers 219 unique individuals across 15 years of systematic monitoring in two geographically distinct regions: Southwest Bohemia and the Western Carpathians. To increase the data variability, we create a complementary synthetic set with more than 100k photorealistic images generated via a Unity-based pipeline and diffusion-driven text-to-texture modeling, covering diverse environments, poses, and coat-pattern variations. To allow testing generalization across spatial and temporal domains, we define three tailored evaluation protocols/splits: (i) geo-aware, (ii) time-aware open-set, and (iii) time-aware closed-set. This dataset is targeted to be instrumental in benchmarking state-of-the-art models and the development of novel methods for not just individual animal re-identification.

[61] Time-Lapse Video-Based Embryo Grading via Complementary Spatial-Temporal Pattern Mining

Yong Sun,Yipeng Wang,Junyu Shi,Zhiyuan Zhang,Yanmei Xiao,Lei Zhu,Manxi Jiang,Qiang Nie

Main category: cs.CV

TL;DR: 提出了一种基于视频的胚胎分级任务，利用全时程延时监测视频预测胚胎质量，并通过互补时空模式挖掘框架（CoSTeM）实现静态与动态特征的协同整合。

Details

Motivation: 现有方法在胚胎评估中缺乏全面性，或受限于临床结果的混杂因素，限制了临床应用。本文旨在填补这一空白。 Method: 提出CoSTeM框架，包含形态学和形态动力学两个分支，分别捕捉局部结构特征和全局发育轨迹。 Result: 实验结果表明该设计优于现有方法，为AI辅助胚胎选择提供了方法论框架。 Conclusion: 该研究为胚胎质量评估提供了新范式，数据集和源代码将公开。 Abstract: Artificial intelligence has recently shown promise in automated embryo selection for In-Vitro Fertilization (IVF). However, current approaches either address partial embryo evaluation lacking holistic quality assessment or target clinical outcomes inevitably confounded by extra-embryonic factors, both limiting clinical utility. To bridge this gap, we propose a new task called Video-Based Embryo Grading - the first paradigm that directly utilizes full-length time-lapse monitoring (TLM) videos to predict embryologists' overall quality assessments. To support this task, we curate a real-world clinical dataset comprising over 2,500 TLM videos, each annotated with a grading label indicating the overall quality of embryos. Grounded in clinical decision-making principles, we propose a Complementary Spatial-Temporal Pattern Mining (CoSTeM) framework that conceptually replicates embryologists' evaluation process. The CoSTeM comprises two branches: (1) a morphological branch using a Mixture of Cross-Attentive Experts layer and a Temporal Selection Block to select discriminative local structural features, and (2) a morphokinetic branch employing a Temporal Transformer to model global developmental trajectories, synergistically integrating static and dynamic determinants for grading embryos. Extensive experimental results demonstrate the superiority of our design. This work provides a valuable methodological framework for AI-assisted embryo selection. The dataset and source code will be publicly available upon acceptance.

[62] Robustness as Architecture: Designing IQA Models to Withstand Adversarial Perturbations

Igor Meleshin,Anna Chistyakova,Anastasia Antsiferova,Dmitriy Vatolin

Main category: cs.CV

TL;DR: 论文提出了一种新的方法，通过设计而非训练来增强图像质量评估（IQA）模型的鲁棒性，避免对抗攻击。

Details

Motivation: 现有IQA模型易受对抗攻击，传统的数据驱动防御方法可能并非最佳解决方案。 Method: 通过正交信息流和规范保持操作重新设计模型结构，并结合剪枝和微调进一步稳定系统。 Result: 提出了一种无需对抗训练的鲁棒IQA架构，能有效抵御对抗攻击。 Conclusion: 研究建议从数据优化转向设计优化，以提升模型的鲁棒性。 Abstract: Image Quality Assessment (IQA) models are increasingly relied upon to evaluate image quality in real-world systems -- from compression and enhancement to generation and streaming. Yet their adoption brings a fundamental risk: these models are inherently unstable. Adversarial manipulations can easily fool them, inflating scores and undermining trust. Traditionally, such vulnerabilities are addressed through data-driven defenses -- adversarial retraining, regularization, or input purification. But what if this is the wrong lens? What if robustness in perceptual models is not something to learn but something to design? In this work, we propose a provocative idea: robustness as an architectural prior. Rather than training models to resist perturbations, we reshape their internal structure to suppress sensitivity from the ground up. We achieve this by enforcing orthogonal information flow, constraining the network to norm-preserving operations -- and further stabilizing the system through pruning and fine-tuning. The result is a robust IQA architecture that withstands adversarial attacks without requiring adversarial training or significant changes to the original model. This approach suggests a shift in perspective: from optimizing robustness through data to engineering it through design.

[63] APVR: Hour-Level Long Video Understanding with Adaptive Pivot Visual Information Retrieval

Hong Gao,Yiming Bao,Xuezhan Tu,Bin Zhong,Minling Zhang

Main category: cs.CV

TL;DR: APVR框架通过分层视觉信息检索解决长视频理解问题，无需训练即可高效处理小时级视频。

Details

Motivation: 现有视频多模态大语言模型因计算限制和低效信息提取难以处理小时级视频。 Method: APVR采用双组件方法：Pivot Frame Retrieval（语义扩展和多模态置信度评分）和Pivot Token Retrieval（查询感知的注意力驱动标记选择）。 Result: 在LongVideoBench和VideoMME上表现优异，不仅超越无训练方法，还优于有训练方法。 Conclusion: APVR为长视频理解提供了高效、即插即用的解决方案。 Abstract: Current video-based multimodal large language models struggle with hour-level video understanding due to computational constraints and inefficient information extraction from extensive temporal sequences. We propose APVR (Adaptive Pivot Visual information Retrieval), a training-free framework that addresses the memory wall limitation through hierarchical visual information retrieval. APVR operates via two complementary components: Pivot Frame Retrieval employs semantic expansion and multi-modal confidence scoring to identify semantically relevant video frames, while Pivot Token Retrieval performs query-aware attention-driven token selection within the pivot frames. This dual granularity approach enables processing of hour-long videos while maintaining semantic fidelity. Experimental validation on LongVideoBench and VideoMME demonstrates significant performance improvements, establishing state-of-the-art results for not only training-free but also training-based approaches while providing plug-and-play integration capability with existing MLLM architectures.

[64] FEAT: Full-Dimensional Efficient Attention Transformer for Medical Video Generation

Huihan Wang,Zhiwen Yang,Hui Zhang,Dan Zhao,Bingzheng Wei,Yan Xu

Main category: cs.CV

TL;DR: FEAT是一种高效的全维度注意力Transformer，通过空间-时间-通道注意力机制、线性复杂度设计和残差值引导模块，解决了动态医学视频合成中的挑战，性能优于现有方法。

Details

Motivation: 动态医学视频合成需要同时建模空间一致性和时间动态性，现有Transformer方法在通道交互、计算复杂度和噪声处理方面存在不足。 Method: FEAT采用空间-时间-通道注意力机制、线性复杂度设计和残差值引导模块，以高效捕捉全局依赖并适应不同噪声水平。 Result: FEAT-S参数仅为Endora的23%，性能相当或更优；FEAT-L在多个数据集上超越所有对比方法。 Conclusion: FEAT在动态医学视频合成中表现出高效性和可扩展性，为相关领域提供了新思路。 Abstract: Synthesizing high-quality dynamic medical videos remains a significant challenge due to the need for modeling both spatial consistency and temporal dynamics. Existing Transformer-based approaches face critical limitations, including insufficient channel interactions, high computational complexity from self-attention, and coarse denoising guidance from timestep embeddings when handling varying noise levels. In this work, we propose FEAT, a full-dimensional efficient attention Transformer, which addresses these issues through three key innovations: (1) a unified paradigm with sequential spatial-temporal-channel attention mechanisms to capture global dependencies across all dimensions, (2) a linear-complexity design for attention mechanisms in each dimension, utilizing weighted key-value attention and global channel attention, and (3) a residual value guidance module that provides fine-grained pixel-level guidance to adapt to different noise levels. We evaluate FEAT on standard benchmarks and downstream tasks, demonstrating that FEAT-S, with only 23\% of the parameters of the state-of-the-art model Endora, achieves comparable or even superior performance. Furthermore, FEAT-L surpasses all comparison methods across multiple datasets, showcasing both superior effectiveness and scalability. Code is available at https://github.com/Yaziwel/FEAT.

[65] Bringing SAM to new heights: Leveraging elevation data for tree crown segmentation from drone imagery

Mélisande Teng,Arthur Ouaknine,Etienne Laliberté,Yoshua Bengio,David Rolnick,Hugo Larochelle

Main category: cs.CV

TL;DR: 论文比较了利用Segment Anything Model (SAM)和数字表面模型(DSM)进行无人机图像中树冠实例分割的方法，提出了BalSAM模型，并探讨了SAM在三种森林类型中的应用效果。

Details

Motivation: 传统森林监测方法成本高且耗时，无人机遥感和计算机视觉技术为大规模个体树木测绘提供了潜力。 Method: 比较了SAM在三种森林类型中的树冠实例分割效果，并研究了DSM数据的整合，提出了BalSAM模型。 Result: 发现直接使用SAM效果不如定制Mask R-CNN，但通过端到端调优和整合DSM数据有潜力。 Conclusion: SAM的端到端调优和DSM数据整合是提升树冠实例分割模型的有效途径。 Abstract: Information on trees at the individual level is crucial for monitoring forest ecosystems and planning forest management. Current monitoring methods involve ground measurements, requiring extensive cost, time and labor. Advances in drone remote sensing and computer vision offer great potential for mapping individual trees from aerial imagery at broad-scale. Large pre-trained vision models, such as the Segment Anything Model (SAM), represent a particularly compelling choice given limited labeled data. In this work, we compare methods leveraging SAM for the task of automatic tree crown instance segmentation in high resolution drone imagery in three use cases: 1) boreal plantations, 2) temperate forests and 3) tropical forests. We also study the integration of elevation data into models, in the form of Digital Surface Model (DSM) information, which can readily be obtained at no additional cost from RGB drone imagery. We present BalSAM, a model leveraging SAM and DSM information, which shows potential over other methods, particularly in the context of plantations. We find that methods using SAM out-of-the-box do not outperform a custom Mask R-CNN, even with well-designed prompts. However, efficiently tuning SAM end-to-end and integrating DSM information are both promising avenues for tree crown instance segmentation models.

[66] TextVidBench: A Benchmark for Long Video Scene Text Understanding

Yangyang Zhong,Ji Qi,Yuan Yao,Pengxin Luo,Yunfeng Yan,Donglian Qi,Zhiyuan Liu,Tat-Seng Chua

Main category: cs.CV

TL;DR: 论文提出了TextVidBench，首个针对长视频（>3分钟）文本问答的基准测试，解决了现有数据集视频时长短、评估范围窄的问题。

Details

Motivation: 现有短视频文本视觉问答（ViteVQA）数据集无法充分评估多模态大语言模型（MLLMs）的能力，尤其是在长视频理解方面。 Method: 1）构建跨领域长视频数据集（平均2306秒）；2）提出三阶段评估框架；3）引入高效改进范式（IT-Rope机制、非均匀位置编码等）。 Result: TextVidBench对现有模型提出了显著挑战，提出的方法在长视频场景文本理解方面表现优异。 Conclusion: TextVidBench为长视频文本问答提供了更真实的评估基准，提出的方法为改进长视频理解提供了有效途径。 Abstract: Despite recent progress on the short-video Text-Visual Question Answering (ViteVQA) task - largely driven by benchmarks such as M4-ViteVQA - existing datasets still suffer from limited video duration and narrow evaluation scopes, making it difficult to adequately assess the growing capabilities of powerful multimodal large language models (MLLMs). To address these limitations, we introduce TextVidBench, the first benchmark specifically designed for long-video text question answering (>3 minutes). TextVidBench makes three key contributions: 1) Cross-domain long-video coverage: Spanning 9 categories (e.g., news, sports, gaming), with an average video length of 2306 seconds, enabling more realistic evaluation of long-video understanding. 2) A three-stage evaluation framework: "Text Needle-in-Haystack -> Temporal Grounding -> Text Dynamics Captioning". 3) High-quality fine-grained annotations: Containing over 5,000 question-answer pairs with detailed semantic labeling. Furthermore, we propose an efficient paradigm for improving large models through: (i) introducing the IT-Rope mechanism and temporal prompt engineering to enhance temporal perception, (ii) adopting non-uniform positional encoding to better handle long video sequences, and (iii) applying lightweight fine-tuning on video-text data. Extensive experiments on multiple public datasets as well as TextVidBench demonstrate that our new benchmark presents significant challenges to existing models, while our proposed method offers valuable insights into improving long-video scene text understanding capabilities.

[67] Multi-scale Image Super Resolution with a Single Auto-Regressive Model

Enrique Sanchez,Isma Hadji,Adrian Bulat,Christos Tzelepis,Brais Martinez,Georgios Tzimiropoulos

Main category: cs.CV

TL;DR: 本文提出了一种基于视觉自回归（VAR）模型的图像超分辨率（ISR）方法，通过多尺度图像标记化和直接偏好优化（DPO）解决了现有方法的局限性，实现了单次前向传递的超分辨率，并在小模型和无外部数据的情况下达到SOTA效果。

Details

Motivation: 现有的VARSR方法存在固定分辨率限制和大模型依赖问题，本文旨在通过改进标记化和优化策略解决这些问题。 Method: 提出多尺度图像标记化方法和DPO正则化项，训练VAR模型以生成语义一致的残差。 Result: 模型在单次前向传递中实现超分辨率，使用小模型（300M参数）且无需外部数据，达到SOTA效果。 Conclusion: 本文方法在ISR任务中表现出色，解决了现有VARSR的局限性，展示了小模型和高效优化的潜力。 Abstract: In this paper we tackle Image Super Resolution (ISR), using recent advances in Visual Auto-Regressive (VAR) modeling. VAR iteratively estimates the residual in latent space between gradually increasing image scales, a process referred to as next-scale prediction. Thus, the strong priors learned during pre-training align well with the downstream task (ISR). To our knowledge, only VARSR has exploited this synergy so far, showing promising results. However, due to the limitations of existing residual quantizers, VARSR works only at a fixed resolution, i.e. it fails to map intermediate outputs to the corresponding image scales. Additionally, it relies on a 1B transformer architecture (VAR-d24), and leverages a large-scale private dataset to achieve state-of-the-art results. We address these limitations through two novel components: a) a Hierarchical Image Tokenization approach with a multi-scale image tokenizer that progressively represents images at different scales while simultaneously enforcing token overlap across scales, and b) a Direct Preference Optimization (DPO) regularization term that, relying solely on the LR and HR tokenizations, encourages the transformer to produce the latter over the former. To the best of our knowledge, this is the first time a quantizer is trained to force semantically consistent residuals at different scales, and the first time that preference-based optimization is used to train a VAR. Using these two components, our model can denoise the LR image and super-resolve at half and full target upscale factors in a single forward pass. Additionally, we achieve \textit{state-of-the-art results on ISR}, while using a small model (300M params vs ~1B params of VARSR), and without using external training data.

[68] PATS: Proficiency-Aware Temporal Sampling for Multi-View Sports Skill Assessment

Edoardo Bianchi,Antonio Liotta

Main category: cs.CV

TL;DR: PATS是一种新型视频采样策略，通过保留完整的基本动作片段来提升运动技能评估的准确性。

Details

Motivation: 当前视频采样方法破坏了评估运动技能所需的时间连续性，影响了专家与新手表现的区分。 Method: 提出PATS策略，自适应分割视频以确保每个分析片段包含完整的关键动作执行，并在多视点配置下评估。 Result: 在EgoExo4D基准测试中，PATS在所有视点配置下均优于现有方法（+0.65%至+3.05%），在特定领域表现尤为突出（如攀岩+26.22%）。 Conclusion: PATS能适应不同活动特性，是一种有效的自适应时间采样方法，推动了现实场景中的自动化技能评估。 Abstract: Automated sports skill assessment requires capturing fundamental movement patterns that distinguish expert from novice performance, yet current video sampling methods disrupt the temporal continuity essential for proficiency evaluation. To this end, we introduce Proficiency-Aware Temporal Sampling (PATS), a novel sampling strategy that preserves complete fundamental movements within continuous temporal segments for multi-view skill assessment. PATS adaptively segments videos to ensure each analyzed portion contains full execution of critical performance components, repeating this process across multiple segments to maximize information coverage while maintaining temporal coherence. Evaluated on the EgoExo4D benchmark with SkillFormer, PATS surpasses the state-of-the-art accuracy across all viewing configurations (+0.65% to +3.05%) and delivers substantial gains in challenging domains (+26.22% bouldering, +2.39% music, +1.13% basketball). Systematic analysis reveals that PATS successfully adapts to diverse activity characteristics-from high-frequency sampling for dynamic sports to fine-grained segmentation for sequential skills-demonstrating its effectiveness as an adaptive approach to temporal sampling that advances automated skill assessment for real-world applications.

[69] Beyond Cropped Regions: New Benchmark and Corresponding Baseline for Chinese Scene Text Retrieval in Diverse Layouts

Gengluo Li,Huawen Shen,Yu Zhou

Main category: cs.CV

TL;DR: 本文提出了一种针对中文场景文本检索的新模型CSTR-CLIP，通过结合全局视觉信息和多粒度对齐训练，显著提升了性能。

Details

Motivation: 中文场景文本检索因复杂多样的布局而极具挑战性，现有方法未能有效解决这一问题。 Method: 提出CSTR-CLIP模型，采用两阶段训练，结合全局视觉信息和多粒度对齐。 Result: 在现有基准上，CSTR-CLIP性能提升18.82%，且推理速度更快。 Conclusion: CSTR-CLIP在处理多样化文本布局方面表现优异，数据集和代码将公开以促进研究。 Abstract: Chinese scene text retrieval is a practical task that aims to search for images containing visual instances of a Chinese query text. This task is extremely challenging because Chinese text often features complex and diverse layouts in real-world scenes. Current efforts tend to inherit the solution for English scene text retrieval, failing to achieve satisfactory performance. In this paper, we establish a Diversified Layout benchmark for Chinese Street View Text Retrieval (DL-CSVTR), which is specifically designed to evaluate retrieval performance across various text layouts, including vertical, cross-line, and partial alignments. To address the limitations in existing methods, we propose Chinese Scene Text Retrieval CLIP (CSTR-CLIP), a novel model that integrates global visual information with multi-granularity alignment training. CSTR-CLIP applies a two-stage training process to overcome previous limitations, such as the exclusion of visual features outside the text region and reliance on single-granularity alignment, thereby enabling the model to effectively handle diverse text layouts. Experiments on existing benchmark show that CSTR-CLIP outperforms the previous state-of-the-art model by 18.82% accuracy and also provides faster inference speed. Further analysis on DL-CSVTR confirms the superior performance of CSTR-CLIP in handling various text layouts. The dataset and code will be publicly available to facilitate research in Chinese scene text retrieval.

[70] Structure-Aware Radar-Camera Depth Estimation

Fuyi Zhang,Zhu Yu,Chunhao Li,Runmin Zhang,Xiaokai Bai,Zili Zhou,Si-Yuan Cao,Wang Wang,Hui-Liang Shen

Main category: cs.CV

TL;DR: 论文探讨了单目深度估计的进展，重点介绍了深度学习方法的应用及其在未见域泛化方面的挑战。

Details

Motivation: 单目深度估计的目标是从RGB图像中预测每个像素的深度，深度学习的发展显著推动了这一领域的进步。然而，模型在未见域上的泛化能力仍然是一个挑战。 Method: 论文回顾了多种方法，包括多尺度融合网络、将回归任务重新定义为分类问题、引入额外先验知识以及设计更有效的目标函数。近期方法采用了仿射不变损失来实现多数据集联合训练。 Result: Depth Anything方法在零样本单目深度估计中表现领先，尽管在度量深度估计上存在不足，但在提取未见图像的结构信息方面表现出色。 Conclusion: 尽管单目深度估计取得了显著进展，但在泛化能力和度量深度准确性方面仍需进一步研究。 Abstract: Monocular depth estimation aims to determine the depth of each pixel from an RGB image captured by a monocular camera. The development of deep learning has significantly advanced this field by facilitating the learning of depth features from some well-annotated datasets \cite{Geiger_Lenz_Stiller_Urtasun_2013,silberman2012indoor}. Eigen \textit{et al.} \cite{eigen2014depth} first introduce a multi-scale fusion network for depth regression. Following this, subsequent improvements have come from reinterpreting the regression task as a classification problem \cite{bhat2021adabins,Li_Wang_Liu_Jiang_2022}, incorporating additional priors \cite{shao2023nddepth,yang2023gedepth}, and developing more effective objective function \cite{xian2020structure,Yin_Liu_Shen_Yan_2019}. Despite these advances, generalizing to unseen domains remains a challenge. Recently, several methods have employed affine-invariant loss to enable multi-dataset joint training \cite{MiDaS,ZeroDepth,guizilini2023towards,Dany}. Among them, Depth Anything \cite{Dany} has shown leading performance in zero-shot monocular depth estimation. While it struggles to estimate accurate metric depth due to the lack of explicit depth cues, it excels at extracting structural information from unseen images, producing structure-detailed monocular depth.

[71] Point Cloud Segmentation of Agricultural Vehicles using 3D Gaussian Splatting

Alfred T. Christiansen,Andreas H. Højrup,Morten K. Stephansen,Md Ibtihaj A. Sakib,Taman S. Poojary,Filip Slezak,Morten S. Laursen,Thomas B. Moeslund,Joakim B. Haurum

Main category: cs.CV

TL;DR: 提出了一种利用3D高斯泼溅和高斯不透明度场生成合成数据的管道，用于训练3D点云语义分割模型，无需真实数据即可达到高精度。

Details

Motivation: 真实点云数据的获取和标注成本高且耗时，需要一种高效且低成本的方法生成合成数据。 Method: 结合3D高斯泼溅和高斯不透明度场生成农业车辆的3D资产，并在模拟环境中使用模拟LiDAR生成点云数据。 Result: 仅使用合成数据训练的模型（如PTv3）在mIoU上达到91.35%，某些情况下甚至优于真实数据训练的模型。 Conclusion: 合成数据能有效替代真实数据，且在某些场景下表现更优，模型还能泛化到未训练的语义类别。 Abstract: Training neural networks for tasks such as 3D point cloud semantic segmentation demands extensive datasets, yet obtaining and annotating real-world point clouds is costly and labor-intensive. This work aims to introduce a novel pipeline for generating realistic synthetic data, by leveraging 3D Gaussian Splatting (3DGS) and Gaussian Opacity Fields (GOF) to generate 3D assets of multiple different agricultural vehicles instead of using generic models. These assets are placed in a simulated environment, where the point clouds are generated using a simulated LiDAR. This is a flexible approach that allows changing the LiDAR specifications without incurring additional costs. We evaluated the impact of synthetic data on segmentation models such as PointNet++, Point Transformer V3, and OACNN, by training and validating the models only on synthetic data. Remarkably, the PTv3 model had an mIoU of 91.35\%, a noteworthy result given that the model had neither been trained nor validated on any real data. Further studies even suggested that in certain scenarios the models trained only on synthetically generated data performed better than models trained on real-world data. Finally, experiments demonstrated that the models can generalize across semantic classes, enabling accurate predictions on mesh models they were never trained on.

[72] UAV4D: Dynamic Neural Rendering of Human-Centric UAV Imagery using Gaussian Splatting

Jaehoon Choi,Dongki Jung,Christopher Maxey,Yonghan Lee,Sungmin Eum,Dinesh Manocha,Heesung Kwon

Main category: cs.CV

TL;DR: UAV4D框架通过结合3D基础模型和人体网格重建模型，解决了无人机单目视频中动态场景和多移动行人重建的挑战，实现了更高质量的新视角合成。

Details

Motivation: 现有动态神经渲染方法未充分考虑无人机单目相机拍摄场景的独特挑战，如俯视角、多移动行人等，且缺乏相关数据集。 Method: 结合3D基础模型和人体网格重建模型，通过识别人与场景接触点解决尺度模糊问题，并利用SMPL模型和背景网格初始化高斯样条进行整体渲染。 Result: 在三个复杂无人机数据集上测试，PSNR提升1.5 dB，视觉清晰度优于现有方法。 Conclusion: UAV4D框架有效解决了无人机动态场景渲染的挑战，显著提升了新视角合成的质量。 Abstract: Despite significant advancements in dynamic neural rendering, existing methods fail to address the unique challenges posed by UAV-captured scenarios, particularly those involving monocular camera setups, top-down perspective, and multiple small, moving humans, which are not adequately represented in existing datasets. In this work, we introduce UAV4D, a framework for enabling photorealistic rendering for dynamic real-world scenes captured by UAVs. Specifically, we address the challenge of reconstructing dynamic scenes with multiple moving pedestrians from monocular video data without the need for additional sensors. We use a combination of a 3D foundation model and a human mesh reconstruction model to reconstruct both the scene background and humans. We propose a novel approach to resolve the scene scale ambiguity and place both humans and the scene in world coordinates by identifying human-scene contact points. Additionally, we exploit the SMPL model and background mesh to initialize Gaussian splats, enabling holistic scene rendering. We evaluated our method on three complex UAV-captured datasets: VisDrone, Manipal-UAV, and Okutama-Action, each with distinct characteristics and 10~50 humans. Our results demonstrate the benefits of our approach over existing methods in novel view synthesis, achieving a 1.5 dB PSNR improvement and superior visual sharpness.

[73] Physical Annotation for Automated Optical Inspection: A Concept for In-Situ, Pointer-Based Trainingdata Generation

Oliver Krumpek,Oliver Heimann,Jörg Krüger

Main category: cs.CV

TL;DR: 提出了一种新型物理标注系统，用于为自动光学检测生成训练数据，通过指针交互和投影界面提高标注效率和准确性。

Details

Motivation: 传统屏幕标注方法效率低且不直观，无法充分利用人工检查员的专业知识。新系统旨在通过物理交互直接捕捉标注轨迹，提升数据标注质量。 Method: 使用校准的跟踪指针记录用户输入，并通过投影界面提供视觉引导，将物理交互转换为标准化标注格式。 Result: 初步评估证实系统能高效捕捉详细标注轨迹，并与CVAT集成优化后续ML任务工作流。 Conclusion: 该系统填补了人工专业知识和自动化数据生成之间的空白，为非IT专家参与ML训练提供了可能，未来有望推动自动光学检测的数据生成。 Abstract: This paper introduces a novel physical annotation system designed to generate training data for automated optical inspection. The system uses pointer-based in-situ interaction to transfer the valuable expertise of trained inspection personnel directly into a machine learning (ML) training pipeline. Unlike conventional screen-based annotation methods, our system captures physical trajectories and contours directly on the object, providing a more intuitive and efficient way to label data. The core technology uses calibrated, tracked pointers to accurately record user input and transform these spatial interactions into standardised annotation formats that are compatible with open-source annotation software. Additionally, a simple projector-based interface projects visual guidance onto the object to assist users during the annotation process, ensuring greater accuracy and consistency. The proposed concept bridges the gap between human expertise and automated data generation, enabling non-IT experts to contribute to the ML training pipeline and preventing the loss of valuable training samples. Preliminary evaluation results confirm the feasibility of capturing detailed annotation trajectories and demonstrate that integration with CVAT streamlines the workflow for subsequent ML tasks. This paper details the system architecture, calibration procedures and interface design, and discusses its potential contribution to future ML data generation for automated optical inspection.

[74] FlowDirector: Training-Free Flow Steering for Precise Text-to-Video Editing

Guangzhao Li,Yanming Yang,Chenxi Song,Chi Zhang

Main category: cs.CV

TL;DR: FlowDirector提出了一种无需反转的视频编辑框架，通过ODE直接演化数据空间，保持时间一致性和结构细节，并结合注意力引导掩码和增强编辑策略，实现高效且一致的视频编辑。

Details

Motivation: 现有基于反转的方法在视频编辑中常导致时间不一致和结构退化，FlowDirector旨在解决这些问题。 Method: 采用ODE直接演化数据空间，结合注意力引导掩码和增强编辑策略，实现局部可控编辑和语义对齐。 Result: 实验表明FlowDirector在指令遵循、时间一致性和背景保留方面达到最优性能。 Conclusion: FlowDirector为无需反转的高效视频编辑提供了新范式。 Abstract: Text-driven video editing aims to modify video content according to natural language instructions. While recent training-free approaches have made progress by leveraging pre-trained diffusion models, they typically rely on inversion-based techniques that map input videos into the latent space, which often leads to temporal inconsistencies and degraded structural fidelity. To address this, we propose FlowDirector, a novel inversion-free video editing framework. Our framework models the editing process as a direct evolution in data space, guiding the video via an Ordinary Differential Equation (ODE) to smoothly transition along its inherent spatiotemporal manifold, thereby preserving temporal coherence and structural details. To achieve localized and controllable edits, we introduce an attention-guided masking mechanism that modulates the ODE velocity field, preserving non-target regions both spatially and temporally. Furthermore, to address incomplete edits and enhance semantic alignment with editing instructions, we present a guidance-enhanced editing strategy inspired by Classifier-Free Guidance, which leverages differential signals between multiple candidate flows to steer the editing trajectory toward stronger semantic alignment without compromising structural consistency. Extensive experiments across benchmarks demonstrate that FlowDirector achieves state-of-the-art performance in instruction adherence, temporal consistency, and background preservation, establishing a new paradigm for efficient and coherent video editing without inversion.

[75] A Survey on Vietnamese Document Analysis and Recognition: Challenges and Future Directions

Anh Le,Thanh Lam,Dung Nguyen

Main category: cs.CV

TL;DR: 本文综述了越南文档分析与识别（DAR）的现状，探讨了传统OCR和深度学习方法面临的挑战，以及大型语言模型（LLMs）和视觉语言模型的潜力，并提出了未来研究方向。

Details

Motivation: 越南文本识别因其复杂的变音符号、声调变化和缺乏大规模标注数据集而面临独特挑战，需要新的技术突破。 Method: 综述了现有技术，包括传统OCR、深度学习方法及LLMs的应用，并分析了其局限性。 Result: LLMs和视觉语言模型在文本识别和文档理解方面表现出潜力，但仍需解决领域适应、多模态学习和计算效率等问题。 Conclusion: 未来研究应关注数据集开发、模型优化和多模态方法整合，以推动越南DAR领域的进步。 Abstract: Vietnamese document analysis and recognition (DAR) is a crucial field with applications in digitization, information retrieval, and automation. Despite advancements in OCR and NLP, Vietnamese text recognition faces unique challenges due to its complex diacritics, tonal variations, and lack of large-scale annotated datasets. Traditional OCR methods often struggle with real-world document variations, while deep learning approaches have shown promise but remain limited by data scarcity and generalization issues. Recently, large language models (LLMs) and vision-language models have demonstrated remarkable improvements in text recognition and document understanding, offering a new direction for Vietnamese DAR. However, challenges such as domain adaptation, multimodal learning, and computational efficiency persist. This survey provide a comprehensive review of existing techniques in Vietnamese document recognition, highlights key limitations, and explores how LLMs can revolutionize the field. We discuss future research directions, including dataset development, model optimization, and the integration of multimodal approaches for improved document intelligence. By addressing these gaps, we aim to foster advancements in Vietnamese DAR and encourage community-driven solutions.

[76] SeedEdit 3.0: Fast and High-Quality Generative Image Editing

Peng Wang,Yichun Shi,Xiaochen Lian,Zhonghua Zhai,Xin Xia,Xuefeng Xiao,Weilin Huang,Jianchao Yang

Main category: cs.CV

TL;DR: SeedEdit 3.0 是 Seedream 3.0 的配套工具，显著提升了编辑指令跟随和图像内容保留能力，并通过数据管道优化和联合学习实现了更好的性能。

Details

Motivation: 改进现有图像编辑工具在指令跟随和内容保留方面的不足，提升真实图像输入的编辑效果。 Method: 1. 开发增强的数据管道，采用元信息嵌入策略混合多源数据；2. 引入联合学习管道，结合扩散损失和奖励损失；3. 升级 T2I 模型。 Result: 在测试基准中，SeedEdit 3.0 实现了 56.1% 的高可用率，优于 SeedEdit 1.6 (38.4%)、GPT4o (37.1%) 和 Gemini 2.0 (30.3%)。 Conclusion: SeedEdit 3.0 在多个方面实现了最佳平衡，显著提升了图像编辑的实用性和性能。 Abstract: We introduce SeedEdit 3.0, in companion with our T2I model Seedream 3.0 [22], which significantly improves over our previous version [27] in both aspects of edit instruction following and image content (e.g., ID/IP) preservation on real image inputs. Additional to model upgrading with T2I, in this report, we present several key improvements. First, we develop an enhanced data curation pipeline with a meta-info paradigm and meta-info embedding strategy that help mix images from multiple data sources. This allows us to scale editing data effectively, and meta information is helpfult to connect VLM with diffusion model more closely. Second, we introduce a joint learning pipeline for computing a diffusion loss and a reward loss. Finally, we evaluate SeedEdit 3.0 on our testing benchmarks, for real image editing, where it achieves a best trade-off between multiple aspects, yielding a high usability rate of 56.1%, compared to SeedEdit 1.6 (38.4%), GPT4o (37.1%) and Gemini 2.0 (30.3%).

[77] Interpretable Multimodal Framework for Human-Centered Street Assessment: Integrating Visual-Language Models for Perceptual Urban Diagnostics

HaoTian Lan

Main category: cs.CV

TL;DR: 该研究提出了一种多模态街道评估框架（MSEF），结合视觉和语言模型，以可解释的方式评估街道景观，同时捕捉主观感知与客观特征。

Details

Motivation: 传统的客观街道指标无法充分捕捉主观感知，而主观感知对包容性城市设计至关重要。 Method: 研究使用视觉变换器（VisualGLM-6B）和大型语言模型（GPT-4）构建MSEF，并通过LoRA和P-Tuning v2进行参数高效微调。 Result: 模型在客观特征上F1得分为0.84，与居民感知的一致性达89.3%，并能捕捉上下文依赖的矛盾和非线性模式。 Conclusion: MSEF为城市感知建模提供了方法创新，并为规划系统提供了实用工具，以平衡基础设施精确性与居民体验。 Abstract: While objective street metrics derived from imagery or GIS have become standard in urban analytics, they remain insufficient to capture subjective perceptions essential to inclusive urban design. This study introduces a novel Multimodal Street Evaluation Framework (MSEF) that fuses a vision transformer (VisualGLM-6B) with a large language model (GPT-4), enabling interpretable dual-output assessment of streetscapes. Leveraging over 15,000 annotated street-view images from Harbin, China, we fine-tune the framework using LoRA and P-Tuning v2 for parameter-efficient adaptation. The model achieves an F1 score of 0.84 on objective features and 89.3 percent agreement with aggregated resident perceptions, validated across stratified socioeconomic geographies. Beyond classification accuracy, MSEF captures context-dependent contradictions: for instance, informal commerce boosts perceived vibrancy while simultaneously reducing pedestrian comfort. It also identifies nonlinear and semantically contingent patterns -- such as the divergent perceptual effects of architectural transparency across residential and commercial zones -- revealing the limits of universal spatial heuristics. By generating natural-language rationales grounded in attention mechanisms, the framework bridges sensory data with socio-affective inference, enabling transparent diagnostics aligned with SDG 11. This work offers both methodological innovation in urban perception modeling and practical utility for planning systems seeking to reconcile infrastructural precision with lived experience.

[78] FG 2025 TrustFAA: the First Workshop on Towards Trustworthy Facial Affect Analysis: Advancing Insights of Fairness, Explainability, and Safety (TrustFAA)

Jiaee Cheong,Yang Liu,Harold Soh,Hatice Gunes

Main category: cs.CV

TL;DR: 该研讨会旨在探讨情感AI驱动的面部情感分析（FAA）工具的可信度问题，聚焦公平性、可解释性和安全性。

Details

Motivation: 随着FAA工具的普及，其可信度问题（如偏见、隐私等）日益突出，需推动相关研究。 Method: 通过研讨会形式，汇集研究者探讨FAA任务中的可信度挑战，如表情识别、人机交互等。 Result: 支持FG2025的伦理倡议，推动可信FAA的研究与讨论。 Conclusion: 研讨会旨在促进FAA领域的公平、透明和安全发展。 Abstract: With the increasing prevalence and deployment of Emotion AI-powered facial affect analysis (FAA) tools, concerns about the trustworthiness of these systems have become more prominent. This first workshop on "Towards Trustworthy Facial Affect Analysis: Advancing Insights of Fairness, Explainability, and Safety (TrustFAA)" aims to bring together researchers who are investigating different challenges in relation to trustworthiness-such as interpretability, uncertainty, biases, and privacy-across various facial affect analysis tasks, including macro/ micro-expression recognition, facial action unit detection, other corresponding applications such as pain and depression detection, as well as human-robot interaction and collaboration. In alignment with FG2025's emphasis on ethics, as demonstrated by the inclusion of an Ethical Impact Statement requirement for this year's submissions, this workshop supports FG2025's efforts by encouraging research, discussion and dialogue on trustworthy FAA.

[79] Astraea: A GPU-Oriented Token-wise Acceleration Framework for Video Diffusion Transformers

Haosong Liu,Yuge Cheng,Zihan Liu,Aiyue Chen,Yiwu Yao,Chen Chen,Jingwen Leng,Yu Feng,Minyi Guo

Main category: cs.CV

TL;DR: ASTRAEA是一个自动框架，通过轻量级令牌选择和高效稀疏注意力策略，显著提升视频扩散变压器的推理速度，同时保持生成质量。

Details

Motivation: 视频扩散变压器（vDiT）在文本到视频生成中表现出色，但高计算需求限制了实际部署。现有加速方法多依赖启发式方法，适用性有限。 Method: ASTRAEA提出轻量级令牌选择机制和内存高效的GPU并行稀疏注意力策略，并结合进化算法自动优化令牌预算分配。 Result: 在单GPU上实现2.4倍推理加速（8GPU可达13.2倍），视频质量损失极小（VBench评分损失<0.5%）。 Conclusion: ASTRAEA在加速vDiT推理的同时保持了高质量生成，为实际部署提供了可行方案。 Abstract: Video diffusion transformers (vDiTs) have made impressive progress in text-to-video generation, but their high computational demands present major challenges for practical deployment. While existing acceleration methods reduce workload at various granularities, they often rely on heuristics, limiting their applicability. We introduce ASTRAEA, an automatic framework that searches for near-optimal configurations for vDiT-based video generation. At its core, ASTRAEA proposes a lightweight token selection mechanism and a memory-efficient, GPU-parallel sparse attention strategy, enabling linear reductions in execution time with minimal impact on generation quality. To determine optimal token reduction for different timesteps, we further design a search framework that leverages a classic evolutionary algorithm to automatically determine the distribution of the token budget effectively. Together, ASTRAEA achieves up to 2.4x inference speedup on a single GPU with great scalability (up to 13.2x speedup on 8 GPUs) while retaining better video quality compared to the state-of-the-art methods (<0.5% loss on the VBench score compared to the baseline vDiT models).

[80] DIMCIM: A Quantitative Evaluation Framework for Default-mode Diversity and Generalization in Text-to-Image Generative Models

Revant Teotia,Candace Ross,Karen Ullrich,Sumit Chopra,Adriana Romero-Soriano,Melissa Hall,Matthew J. Muckley

Main category: cs.CV

TL;DR: 论文提出了DIM-CIM框架，用于无参考地评估文本到图像模型的多样性和泛化能力，并通过COCO-DIMCIM基准测试发现模型在参数增加时泛化能力提升但默认模式多样性下降。

Details

Motivation: 现有评估方法依赖参考图像数据集或缺乏多样性测量的具体性，限制了其适应性和可解释性。 Method: 提出DIM-CIM框架，通过COCO-DIMCIM基准测试评估模型的默认模式多样性和泛化能力。 Result: 发现模型在参数增加时泛化能力提升但默认模式多样性下降，并识别了细粒度失败案例。 Conclusion: DIM-CIM为评估文本到图像模型的多样性和泛化能力提供了灵活且可解释的框架。 Abstract: Recent advances in text-to-image (T2I) models have achieved impressive quality and consistency. However, this has come at the cost of representation diversity. While automatic evaluation methods exist for benchmarking model diversity, they either require reference image datasets or lack specificity about the kind of diversity measured, limiting their adaptability and interpretability. To address this gap, we introduce the Does-it/Can-it framework, DIM-CIM, a reference-free measurement of default-mode diversity ("Does" the model generate images with expected attributes?) and generalization capacity ("Can" the model generate diverse attributes for a particular concept?). We construct the COCO-DIMCIM benchmark, which is seeded with COCO concepts and captions and augmented by a large language model. With COCO-DIMCIM, we find that widely-used models improve in generalization at the cost of default-mode diversity when scaling from 1.5B to 8.1B parameters. DIMCIM also identifies fine-grained failure cases, such as attributes that are generated with generic prompts but are rarely generated when explicitly requested. Finally, we use DIMCIM to evaluate the training data of a T2I model and observe a correlation of 0.85 between diversity in training images and default-mode diversity. Our work provides a flexible and interpretable framework for assessing T2I model diversity and generalization, enabling a more comprehensive understanding of model performance.

[81] Practical Manipulation Model for Robust Deepfake Detection

Benedikt Hopf,Radu Timofte

Main category: cs.CV

TL;DR: 论文提出了一种实用的伪造模型（PMM），通过扩展伪造空间和添加强退化训练图像，显著提升了深度伪造检测模型的鲁棒性和性能。

Details

Motivation: 当前深度伪造检测模型在非理想条件下性能不稳定，容易被规避，需要提升其鲁棒性和泛化能力。 Method: 开发PMM模型，使用泊松混合、多样化掩码、生成器伪影和干扰物扩展伪造空间，并通过添加强退化图像增强训练。 Result: 在DFDC和DFDCP数据集上，AUC分别提高了3.51%和6.21%，显著提升了模型的鲁棒性和性能。 Conclusion: PMM模型有效解决了现有检测器的鲁棒性问题，并在标准基准数据集上取得了显著改进。 Abstract: Modern deepfake detection models have achieved strong performance even on the challenging cross-dataset task. However, detection performance under non-ideal conditions remains very unstable, limiting success on some benchmark datasets and making it easy to circumvent detection. Inspired by the move to a more real-world degradation model in the area of image super-resolution, we have developed a Practical Manipulation Model (PMM) that covers a larger set of possible forgeries. We extend the space of pseudo-fakes by using Poisson blending, more diverse masks, generator artifacts, and distractors. Additionally, we improve the detectors' generality and robustness by adding strong degradations to the training images. We demonstrate that these changes not only significantly enhance the model's robustness to common image degradations but also improve performance on standard benchmark datasets. Specifically, we show clear increases of $3.51\%$ and $6.21\%$ AUC on the DFDC and DFDCP datasets, respectively, over the s-o-t-a LAA backbone. Furthermore, we highlight the lack of robustness in previous detectors and our improvements in this regard. Code can be found at https://github.com/BenediktHopf/PMM

[82] CIVET: Systematic Evaluation of Understanding in VLMs

Massimo Rizzoli,Simone Alghisi,Olha Khomyn,Gabriel Roccabruna,Seyed Mahed Mousavi,Giuseppe Riccardi

Main category: cs.CV

TL;DR: CIVET框架用于系统评估视觉语言模型（VLMs）对物体属性和关系的理解能力，发现当前VLMs在基本属性识别、物体位置依赖和关系理解方面表现有限，且未达到人类水平。

Details

Motivation: 研究VLMs对场景结构和语义的理解能力，填补缺乏标准化系统评估的空白。 Method: 引入CIVET框架，通过受控刺激系统评估五种先进VLMs，避免噪声、偏见和复杂场景干扰。 Result: VLMs仅能准确识别有限基本属性，性能受物体位置影响，且难以理解物体间关系，未达人类水平。 Conclusion: 当前VLMs在场景理解方面仍有局限，需进一步改进以接近人类能力。 Abstract: While Vision-Language Models (VLMs) have achieved competitive performance in various tasks, their comprehension of the underlying structure and semantics of a scene remains understudied. To investigate the understanding of VLMs, we study their capability regarding object properties and relations in a controlled and interpretable manner. To this scope, we introduce CIVET, a novel and extensible framework for systematiC evaluatIon Via controllEd sTimuli. CIVET addresses the lack of standardized systematic evaluation for assessing VLMs' understanding, enabling researchers to test hypotheses with statistical rigor. With CIVET, we evaluate five state-of-the-art VLMs on exhaustive sets of stimuli, free from annotation noise, dataset-specific biases, and uncontrolled scene complexity. Our findings reveal that 1) current VLMs can accurately recognize only a limited set of basic object properties; 2) their performance heavily depends on the position of the object in the scene; 3) they struggle to understand basic relations among objects. Furthermore, a comparative evaluation with human annotators reveals that VLMs still fall short of achieving human-level accuracy.

[83] FRED: The Florence RGB-Event Drone Dataset

Gabriele Magrini,Niccolò Marini,Federico Becattini,Lorenzo Berlincioni,Niccolò Biondi,Pietro Pala,Alberto Del Bimbo

Main category: cs.CV

TL;DR: 论文介绍了Florence RGB-Event Drone数据集（FRED），专为无人机检测、跟踪和轨迹预测设计，结合RGB视频和事件流，填补了现有基准的不足。

Details

Motivation: 传统RGB相机在捕捉高速移动的小型无人机时存在局限性，而事件相机虽具高时间分辨率和动态范围，但缺乏无人机特定运动模式的基准数据集。 Method: 提出了FRED数据集，包含7小时密集标注的无人机轨迹，涵盖5种无人机模型及恶劣天气和光照条件。 Result: FRED提供了详细的评估协议和标准指标，支持可重复的基准测试。 Conclusion: FRED有望推动高速无人机感知和多模态时空理解的研究。 Abstract: Small, fast, and lightweight drones present significant challenges for traditional RGB cameras due to their limitations in capturing fast-moving objects, especially under challenging lighting conditions. Event cameras offer an ideal solution, providing high temporal definition and dynamic range, yet existing benchmarks often lack fine temporal resolution or drone-specific motion patterns, hindering progress in these areas. This paper introduces the Florence RGB-Event Drone dataset (FRED), a novel multimodal dataset specifically designed for drone detection, tracking, and trajectory forecasting, combining RGB video and event streams. FRED features more than 7 hours of densely annotated drone trajectories, using 5 different drone models and including challenging scenarios such as rain and adverse lighting conditions. We provide detailed evaluation protocols and standard metrics for each task, facilitating reproducible benchmarking. The authors hope FRED will advance research in high-speed drone perception and multimodal spatiotemporal understanding.

[84] Through-the-Wall Radar Human Activity Recognition WITHOUT Using Neural Networks

Weicheng Gao

Main category: cs.CV

TL;DR: 论文提出了一种不依赖神经网络的穿墙雷达人体活动识别方法，通过信号处理和拓扑相似性计算实现智能识别。

Details

Motivation: 当前穿墙雷达人体活动识别领域过度依赖神经网络训练，忽视了早期基于模板匹配的方法的物理可解释性和理论信号处理基础。作者希望回归原始路径，挑战神经网络模型的智能识别能力。 Method: 首先生成雷达的距离-时间图和多普勒-时间图，通过角点检测确定目标前景和噪声背景区域，利用多相主动轮廓模型分割微多普勒特征，并将其离散化为二维点云，最后通过Mapper算法计算点云与模板数据的拓扑相似性。 Result: 通过数值模拟和实测实验验证了方法的有效性。 Conclusion: 该方法展示了不依赖神经网络也能实现智能识别的潜力，并开源了代码。 Abstract: After a few years of research in the field of through-the-wall radar (TWR) human activity recognition (HAR), I found that we seem to be stuck in the mindset of training on radar image data through neural network models. The earliest related works in this field based on template matching did not require a training process, and I believe they have never died. Because these methods possess a strong physical interpretability and are closer to the basis of theoretical signal processing research. In this paper, I would like to try to return to the original path by attempting to eschew neural networks to achieve the TWR HAR task and challenge to achieve intelligent recognition as neural network models. In detail, the range-time map and Doppler-time map of TWR are first generated. Then, the initial regions of the human target foreground and noise background on the maps are determined using corner detection method, and the micro-Doppler signature is segmented using the multiphase active contour model. The micro-Doppler segmentation feature is discretized into a two-dimensional point cloud. Finally, the topological similarity between the resulting point cloud and the point clouds of the template data is calculated using Mapper algorithm to obtain the recognition results. The effectiveness of the proposed method is demonstrated by numerical simulated and measured experiments. The open-source code of this work is released at: https://github.com/JoeyBGOfficial/Through-the-Wall-Radar-Human-Activity-Recognition-Without-Using-Neural-Networks.

[85] Track Any Anomalous Object: A Granular Video Anomaly Detection Pipeline

Yuzhi Huang,Chenxin Li,Haitao Zhang,Zixu Lin,Yunlong Lin,Hengyu Liu,Wuyang Li,Xinyu Liu,Jiechao Gao,Yue Huang,Xinghao Ding,Yixuan Yuan

Main category: cs.CV

TL;DR: 提出了一种名为TAO的新框架，用于细粒度视频异常检测，通过像素级跟踪异常对象，无需阈值调优，提升了准确性和鲁棒性。

Details

Motivation: 现有方法多关注异常帧或对象，忽略像素级分析，限制了检测范围。TAO旨在解决这一问题。 Method: 将异常检测转化为像素级跟踪问题，结合分割和跟踪任务，避免阈值调优。 Result: 实验表明TAO在准确性和鲁棒性上达到新标杆。 Conclusion: TAO为视频异常检测提供了一种更精确和高效的解决方案。 Abstract: Video anomaly detection (VAD) is crucial in scenarios such as surveillance and autonomous driving, where timely detection of unexpected activities is essential. Although existing methods have primarily focused on detecting anomalous objects in videos -- either by identifying anomalous frames or objects -- they often neglect finer-grained analysis, such as anomalous pixels, which limits their ability to capture a broader range of anomalies. To address this challenge, we propose a new framework called Track Any Anomalous Object (TAO), which introduces a granular video anomaly detection pipeline that, for the first time, integrates the detection of multiple fine-grained anomalous objects into a unified framework. Unlike methods that assign anomaly scores to every pixel, our approach transforms the problem into pixel-level tracking of anomalous objects. By linking anomaly scores to downstream tasks such as segmentation and tracking, our method removes the need for threshold tuning and achieves more precise anomaly localization in long and complex video sequences. Experiments demonstrate that TAO sets new benchmarks in accuracy and robustness. Project page available online.

[86] Single GPU Task Adaptation of Pathology Foundation Models for Whole Slide Image Analysis

Neeraj Kumar,Swaraj Nanda,Siddharth Singi,Jamal Benhamida,David Kim,Jie-Fu Chen,Amir Momeni-Boroujeni,Gregory M. Goldgof,Gabriele Campanella,Chad Vanderbilt

Main category: cs.CV

TL;DR: TAPFM提出了一种单GPU任务适应方法，利用ViT注意力进行MIL聚合，优化特征表示和注意力权重，显著提升病理基础模型在临床任务中的表现。

Details

Motivation: 适应预训练的病理基础模型（PFMs）到特定临床任务面临挑战，主要由于WSI级弱标签和需要MIL范式。 Method: 提出TAPFM方法，通过ViT注意力进行MIL聚合，同时优化特征和注意力权重，保持MIL聚合器和PFM的独立计算图以实现稳定训练。 Result: 在膀胱癌和肺腺癌的突变预测任务中，TAPFM表现优于传统方法，并能有效处理多标签分类。 Conclusion: TAPFM使预训练PFM在标准硬件上的适应成为可能，适用于多种临床应用。 Abstract: Pathology foundation models (PFMs) have emerged as powerful tools for analyzing whole slide images (WSIs). However, adapting these pretrained PFMs for specific clinical tasks presents considerable challenges, primarily due to the availability of only weak (WSI-level) labels for gigapixel images, necessitating multiple instance learning (MIL) paradigm for effective WSI analysis. This paper proposes a novel approach for single-GPU \textbf{T}ask \textbf{A}daptation of \textbf{PFM}s (TAPFM) that uses vision transformer (\vit) attention for MIL aggregation while optimizing both for feature representations and attention weights. The proposed approach maintains separate computational graphs for MIL aggregator and the PFM to create stable training dynamics that align with downstream task objectives during end-to-end adaptation. Evaluated on mutation prediction tasks for bladder cancer and lung adenocarcinoma across institutional and TCGA cohorts, TAPFM consistently outperforms conventional approaches, with H-Optimus-0 (TAPFM) outperforming the benchmarks. TAPFM effectively handles multi-label classification of actionable mutations as well. Thus, TAPFM makes adaptation of powerful pre-trained PFMs practical on standard hardware for various clinical applications.

[87] MokA: Multimodal Low-Rank Adaptation for MLLMs

Yake Wei,Yu Miao,Dongzhan Zhou,Di Hu

Main category: cs.CV

TL;DR: 本文提出了一种多模态低秩适应方法（MokA），针对当前多模态微调方法的局限性，通过模态特定参数和跨模态交互增强，实现了更高效的多模态模型微调。

Details

Motivation: 现有高效多模态微调方法多直接借鉴自LLMs，忽略了多模态场景的固有差异，导致模态利用不充分。本文旨在解决这一问题。 Method: 提出MokA方法，通过模态特定参数压缩单模态信息，并显式增强跨模态交互，实现单模态和跨模态的联合适应。 Result: 在多种多模态场景和LLM骨干上的实验表明，MokA方法显著提升了性能，证明了其有效性和通用性。 Conclusion: MokA为多模态模型的高效适应提供了针对性解决方案，为未来研究奠定了基础。 Abstract: In this paper, we reveal that most current efficient multimodal fine-tuning methods are hindered by a key limitation: they are directly borrowed from LLMs, often neglecting the intrinsic differences of multimodal scenarios and even affecting the full utilization of all modalities. Inspired by our empirical observation, we argue that unimodal adaptation and cross-modal adaptation are two essential parts for the effective fine-tuning of MLLMs. From this perspective, we propose Multimodal low-rank Adaptation (MokA), a multimodal-aware efficient fine-tuning strategy that takes multimodal characteristics into consideration. It compresses unimodal information by modality-specific parameters while explicitly enhancing cross-modal interaction, ensuring both unimodal and cross-modal adaptation. Extensive experiments cover three representative multimodal scenarios (audio-visual-text, visual-text, and speech-text), and multiple LLM backbones (LLaMA2/3, Qwen2, Qwen2.5-VL, etc). Consistent improvements indicate the efficacy and versatility of the proposed method. Ablation studies and efficiency evaluation are also conducted to fully asses our method. Overall, we think MokA provides a more targeted solution for efficient adaptation of MLLMs, paving the way for further exploration. The project page is at https://gewu-lab.github.io/MokA.

[88] Vision-Based Autonomous MM-Wave Reflector Using ArUco-Driven Angle-of-Arrival Estimation

Josue Marroquin,Nan Inzali,Miles Dillon Lantz,Campbell Freeman,Amod Ashtekar,\Ajinkya Umesh Mulik,Mohammed E Eltayeb

Main category: cs.CV

TL;DR: 本文提出了一种基于视觉辅助的自主反射器系统，通过动态调整金属板反射毫米波信号，显著提升了非视距条件下的通信性能。

Details

Motivation: 在非视距条件下实现可靠的毫米波通信是军事和民用领域的重要挑战，尤其是在城市或基础设施有限的环境中。 Method: 系统利用单目摄像头检测ArUco标记，估计到达角度，并通过电机驱动的金属板实时调整反射方向，实现选择性波束覆盖。 Result: 实验结果显示，在60GHz频段下，接收信号强度平均提升23dB，且在室内环境中保持信号接收高于-65dB的概率为0.89。 Conclusion: 该系统在复杂动态环境中展现了强大的毫米波通信适应性和鲁棒性。 Abstract: Reliable millimeter-wave (mmWave) communication in non-line-of-sight (NLoS) conditions remains a major challenge for both military and civilian operations, especially in urban or infrastructure-limited environments. This paper presents a vision-aided autonomous reflector system designed to enhance mmWave link performance by dynamically steering signal reflections using a motorized metallic plate. The proposed system leverages a monocular camera to detect ArUco markers on allied transmitter and receiver nodes, estimate their angles of arrival, and align the reflector in real time for optimal signal redirection. This approach enables selective beam coverage by serving only authenticated targets with visible markers and reduces the risk of unintended signal exposure. The designed prototype, built on a Raspberry Pi 4 and low-power hardware, operates autonomously without reliance on external infrastructure or GPS. Experimental results at 60\,GHz demonstrate a 23\,dB average gain in received signal strength and an 0.89 probability of maintaining signal reception above a target threshold of -65 dB in an indoor environment, far exceeding the static and no-reflector baselines. These results demonstrate the system's potential for resilient and adaptive mmWave connectivity in complex and dynamic environments.

[89] Quantifying Cross-Modality Memorization in Vision-Language Models

Yuxin Wen,Yangsibo Huang,Tom Goldstein,Ravi Kumar,Badih Ghazi,Chiyuan Zhang

Main category: cs.CV

TL;DR: 研究探讨了多模态模型中跨模态记忆的特性，通过合成数据集量化了知识记忆与跨模态迁移能力，发现模态间迁移存在显著差距，并提出了一种缓解方法。

Details

Motivation: 理解神经网络在训练中的记忆行为对隐私保护和知识获取至关重要，而现有研究多集中于单模态，多模态模型的跨模态记忆特性尚不明确。 Method: 引入合成人物数据集，通过单模态训练和跨模态评估，量化知识记忆与迁移能力，并分析不同场景下的表现。 Result: 研究发现模态间知识迁移可行，但源模态与目标模态间存在显著差距，且这一现象在多种场景下普遍存在。 Conclusion: 研究为多模态学习提供了新视角，提出的基线方法有望促进更鲁棒的跨模态迁移技术发展。 Abstract: Understanding what and how neural networks memorize during training is crucial, both from the perspective of unintentional memorization of potentially sensitive information and from the standpoint of effective knowledge acquisition for real-world, knowledge-intensive tasks. While previous studies primarily investigate memorization within a single modality, such as text memorization in large language models or image memorization in diffusion models, unified multimodal models are becoming increasingly prevalent in practical applications. In this work, we focus on the unique characteristics of cross-modality memorization and conduct a systematic study centered on vision-language models. To facilitate controlled experiments, we first introduce a synthetic persona dataset comprising diverse synthetic person images and textual descriptions. We quantify factual knowledge memorization and cross-modal transferability by training models on a single modality and evaluating their performance in the other. Our results reveal that facts learned in one modality transfer to the other, but a significant gap exists between recalling information in the source and target modalities. Furthermore, we observe that this gap exists across various scenarios, including more capable models, machine unlearning, and the multi-hop case. At the end, we propose a baseline method to mitigate this challenge. We hope our study can inspire future research on developing more robust multimodal learning techniques to enhance cross-modal transferability.

[90] Grounding Beyond Detection: Enhancing Contextual Understanding in Embodied 3D Grounding

Yani Zhang,Dongming Wu,Hao Shi,Yingfei Liu,Tiancai Wang,Haoqiang Fan,Xingping Dong

Main category: cs.CV

TL;DR: DEGround通过共享DETR查询对象表示，结合检测与定位任务，显著提升了3D物体定位性能，超越现有方法。

Details

Motivation: 探究3D物体定位是否充分受益于检测任务，发现现有方法依赖检测模型但效果有限，需改进。 Method: 提出DEGround框架，共享DETR查询对象表示，引入区域激活定位模块和查询调制模块。 Result: 在EmbodiedScan验证集上，DEGround整体准确率比BIP3D高7.52%。 Conclusion: DEGround通过结合检测与定位任务，显著提升了3D物体定位性能，代码将开源。 Abstract: Embodied 3D grounding aims to localize target objects described in human instructions from ego-centric viewpoint. Most methods typically follow a two-stage paradigm where a trained 3D detector's optimized backbone parameters are used to initialize a grounding model. In this study, we explore a fundamental question: Does embodied 3D grounding benefit enough from detection? To answer this question, we assess the grounding performance of detection models using predicted boxes filtered by the target category. Surprisingly, these detection models without any instruction-specific training outperform the grounding models explicitly trained with language instructions. This indicates that even category-level embodied 3D grounding may not be well resolved, let alone more fine-grained context-aware grounding. Motivated by this finding, we propose DEGround, which shares DETR queries as object representation for both DEtection and Grounding and enables the grounding to benefit from basic category classification and box detection. Based on this framework, we further introduce a regional activation grounding module that highlights instruction-related regions and a query-wise modulation module that incorporates sentence-level semantic into the query representation, strengthening the context-aware understanding of language instructions. Remarkably, DEGround outperforms state-of-the-art model BIP3D by 7.52\% at overall accuracy on the EmbodiedScan validation set. The source code will be publicly available at https://github.com/zyn213/DEGround.

[91] OGGSplat: Open Gaussian Growing for Generalizable Reconstruction with Expanded Field-of-View

Yanbo Wang,Ziyi Wang,Wenzhao Zheng,Jie Zhou,Jiwen Lu

Main category: cs.CV

TL;DR: OGGSplat是一种基于开放高斯生长的方法，通过语义一致性修复模块扩展视野，实现稀疏视图下的3D场景重建。

Details

Motivation: 稀疏视图下的3D场景重建需求迫切，现有方法在视野外区域重建效果不佳且计算成本高。 Method: 利用开放高斯的语义属性，结合RGB-语义一致性修复模块，通过双向控制扩散模型实现视野扩展。 Result: OGGSplat在语义和生成质量上表现优异，支持智能手机拍摄的两视图重建。 Conclusion: OGGSplat为稀疏视图下的语义感知3D重建提供了高效且通用的解决方案。 Abstract: Reconstructing semantic-aware 3D scenes from sparse views is a challenging yet essential research direction, driven by the demands of emerging applications such as virtual reality and embodied AI. Existing per-scene optimization methods require dense input views and incur high computational costs, while generalizable approaches often struggle to reconstruct regions outside the input view cone. In this paper, we propose OGGSplat, an open Gaussian growing method that expands the field-of-view in generalizable 3D reconstruction. Our key insight is that the semantic attributes of open Gaussians provide strong priors for image extrapolation, enabling both semantic consistency and visual plausibility. Specifically, once open Gaussians are initialized from sparse views, we introduce an RGB-semantic consistent inpainting module applied to selected rendered views. This module enforces bidirectional control between an image diffusion model and a semantic diffusion model. The inpainted regions are then lifted back into 3D space for efficient and progressive Gaussian parameter optimization. To evaluate our method, we establish a Gaussian Outpainting (GO) benchmark that assesses both semantic and generative quality of reconstructed open-vocabulary scenes. OGGSplat also demonstrates promising semantic-aware scene reconstruction capabilities when provided with two view images captured directly from a smartphone camera.

[92] Follow-Your-Motion: Video Motion Transfer via Efficient Spatial-Temporal Decoupled Finetuning

Yue Ma,Yulong Liu,Qiyuan Zhu,Ayden Yang,Kunyu Feng,Xinhua Zhang,Zhifeng Li,Sirui Han,Chenyang Qi,Qifeng Chen

Main category: cs.CV

TL;DR: 论文提出了一种名为Follow-Your-Motion的高效视频运动迁移框架，通过空间-时间解耦的LoRA和改进的训练策略，解决了现有方法在运动一致性和调优效率上的问题。

Details

Motivation: 现有基于LoRA的运动迁移方法在应用于大型视频扩散变换器时存在运动不一致和调优效率低的问题。 Method: 提出空间-时间解耦的LoRA，解耦注意力架构以分别处理空间外观和时间运动；在第二阶段训练中设计稀疏运动采样和自适应RoPE以加速调优。 Result: 在提出的MotionBench基准测试中验证了方法的优越性。 Conclusion: Follow-Your-Motion框架在运动一致性和调优效率上表现优异，为视频运动迁移提供了高效解决方案。 Abstract: Recently, breakthroughs in the video diffusion transformer have shown remarkable capabilities in diverse motion generations. As for the motion-transfer task, current methods mainly use two-stage Low-Rank Adaptations (LoRAs) finetuning to obtain better performance. However, existing adaptation-based motion transfer still suffers from motion inconsistency and tuning inefficiency when applied to large video diffusion transformers. Naive two-stage LoRA tuning struggles to maintain motion consistency between generated and input videos due to the inherent spatial-temporal coupling in the 3D attention operator. Additionally, they require time-consuming fine-tuning processes in both stages. To tackle these issues, we propose Follow-Your-Motion, an efficient two-stage video motion transfer framework that finetunes a powerful video diffusion transformer to synthesize complex motion.Specifically, we propose a spatial-temporal decoupled LoRA to decouple the attention architecture for spatial appearance and temporal motion processing. During the second training stage, we design the sparse motion sampling and adaptive RoPE to accelerate the tuning speed. To address the lack of a benchmark for this field, we introduce MotionBench, a comprehensive benchmark comprising diverse motion, including creative camera motion, single object motion, multiple object motion, and complex human motion. We show extensive evaluations on MotionBench to verify the superiority of Follow-Your-Motion.

[93] Towards Vision-Language-Garment Models For Web Knowledge Garment Understanding and Generation

Jan Ackermann,Kiyohiro Nakayama,Guandao Yang,Tong Wu,Gordon Wetzstein

Main category: cs.CV

TL;DR: VLG模型通过结合视觉和语言信息生成服装，初步实验显示其在零样本泛化能力上表现良好。

Details

Motivation: 探索多模态基础模型在服装生成等专业领域的知识迁移能力。 Method: 提出VLG模型，结合文本描述和视觉图像生成服装，并测试其零样本泛化能力。 Result: 初步结果显示VLG在未见过的服装风格和提示上具有潜力。 Conclusion: 多模态基础模型在时尚设计等专业领域具有适应性潜力。 Abstract: Multimodal foundation models have demonstrated strong generalization, yet their ability to transfer knowledge to specialized domains such as garment generation remains underexplored. We introduce VLG, a vision-language-garment model that synthesizes garments from textual descriptions and visual imagery. Our experiments assess VLG's zero-shot generalization, investigating its ability to transfer web-scale reasoning to unseen garment styles and prompts. Preliminary results indicate promising transfer capabilities, highlighting the potential for multimodal foundation models to adapt effectively to specialized domains like fashion design.

[94] DSG-World: Learning a 3D Gaussian World Model from Dual State Videos

Wenhao Hu,Xuexiang Wen,Xi Li,Gaoang Wang

Main category: cs.CV

TL;DR: DSG-World提出了一种基于双状态观测的端到端框架，通过显式构建3D高斯世界模型，解决了现有方法在训练难度和物理一致性上的不足。

Details

Motivation: 从有限观测中构建高效且物理一致的世界模型是一个长期挑战，现有方法要么训练困难，要么缺乏3D或物理一致性。 Method: 利用同一场景在不同物体配置下的双状态观测，构建双分割感知高斯场，并通过双向光度和语义一致性增强稳定性。 Result: DSG-World在新型视图和场景状态下表现出强泛化能力，支持高保真渲染和物体级场景操作。 Conclusion: 该方法为真实世界的3D重建和模拟提供了一种高效且一致的解决方案。 Abstract: Building an efficient and physically consistent world model from limited observations is a long standing challenge in vision and robotics. Many existing world modeling pipelines are based on implicit generative models, which are hard to train and often lack 3D or physical consistency. On the other hand, explicit 3D methods built from a single state often require multi-stage processing-such as segmentation, background completion, and inpainting-due to occlusions. To address this, we leverage two perturbed observations of the same scene under different object configurations. These dual states offer complementary visibility, alleviating occlusion issues during state transitions and enabling more stable and complete reconstruction. In this paper, we present DSG-World, a novel end-to-end framework that explicitly constructs a 3D Gaussian World model from Dual State observations. Our approach builds dual segmentation-aware Gaussian fields and enforces bidirectional photometric and semantic consistency. We further introduce a pseudo intermediate state for symmetric alignment and design collaborative co-pruning trategies to refine geometric completeness. DSG-World enables efficient real-to-simulation transfer purely in the explicit Gaussian representation space, supporting high-fidelity rendering and object-level scene manipulation without relying on dense observations or multi-stage pipelines. Extensive experiments demonstrate strong generalization to novel views and scene states, highlighting the effectiveness of our approach for real-world 3D reconstruction and simulation.

[95] MonkeyOCR: Document Parsing with a Structure-Recognition-Relation Triplet Paradigm

Zhang Li,Yuliang Liu,Qiang Liu,Zhiyin Ma,Ziyang Zhang,Shuo Zhang,Zidun Guo,Jiarui Zhang,Xinyu Wang,Xiang Bai

Main category: cs.CV

TL;DR: MonkeyOCR是一种基于SRR（结构-识别-关系）三元范式的视觉语言模型，用于文档解析，通过简化流程和高效处理，在精度和速度上优于现有方法。

Details

Motivation: 现有的文档解析方法（如模块化方法或大型端到端模型）存在复杂性和效率低下的问题，MonkeyOCR旨在通过SRR范式解决这些问题。 Method: 采用SRR三元范式，将文档解析分解为结构、识别和关系三个核心问题，分别对应布局分析、内容识别和逻辑排序。 Result: MonkeyOCR在MonkeyDoc数据集上表现优异，平均性能提升5.1%，尤其在公式和表格处理上显著改进，且速度更快。 Conclusion: MonkeyOCR通过SRR范式实现了高效、精确的文档解析，其3B参数模型在性能和部署效率上均优于大型模型。 Abstract: We introduce MonkeyOCR, a vision-language model for document parsing that advances the state of the art by leveraging a Structure-Recognition-Relation (SRR) triplet paradigm. This design simplifies what would otherwise be a complex multi-tool pipeline (as in MinerU's modular approach) and avoids the inefficiencies of processing full pages with giant end-to-end models (e.g., large multimodal LLMs like Qwen-VL). In SRR, document parsing is abstracted into three fundamental questions - "Where is it?" (structure), "What is it?" (recognition), and "How is it organized?" (relation) - corresponding to layout analysis, content identification, and logical ordering. This focused decomposition balances accuracy and speed: it enables efficient, scalable processing without sacrificing precision. To train and evaluate this approach, we introduce the MonkeyDoc (the most comprehensive document parsing dataset to date), with 3.9 million instances spanning over ten document types in both Chinese and English. Experiments show that MonkeyOCR outperforms MinerU by an average of 5.1%, with particularly notable improvements on challenging content such as formulas (+15.0%) and tables (+8.6%). Remarkably, our 3B-parameter model surpasses much larger and top-performing models, including Qwen2.5-VL (72B) and Gemini 2.5 Pro, achieving state-of-the-art average performance on English document parsing tasks. In addition, MonkeyOCR processes multi-page documents significantly faster (0.84 pages per second compared to 0.65 for MinerU and 0.12 for Qwen2.5-VL-7B). The 3B model can be efficiently deployed for inference on a single NVIDIA 3090 GPU. Code and models will be released at https://github.com/Yuliang-Liu/MonkeyOCR.

[96] SAM-aware Test-time Adaptation for Universal Medical Image Segmentation

Jianghao Wu,Yicheng Wu,Yutong Xie,Wenjia Bai,You Zhang,Feilong Tang,Yulong Li,Yasmeen George,Imran Razzak

Main category: cs.CV

TL;DR: SAM-TTA是一种新的测试时适应框架，旨在提升SAM在医学图像分割中的性能，同时保持其泛化能力。

Details

Motivation: 解决SAM在医学图像分割中适应性不足的问题，同时避免现有方法（如MedSAM）泛化性降低的缺点。 Method: 提出SAM-TTA框架，包含SBCT（自适应转换输入）和DUMT（一致性学习对齐语义）两个关键组件。 Result: 在五个公开数据集上，SAM-TTA优于现有TTA方法，甚至在某些场景下超过完全微调的模型。 Conclusion: SAM-TTA为通用医学图像分割提供了新范式，兼具高性能和泛化能力。 Abstract: Universal medical image segmentation using the Segment Anything Model (SAM) remains challenging due to its limited adaptability to medical domains. Existing adaptations, such as MedSAM, enhance SAM's performance in medical imaging but at the cost of reduced generalization to unseen data. Therefore, in this paper, we propose SAM-aware Test-Time Adaptation (SAM-TTA), a fundamentally different pipeline that preserves the generalization of SAM while improving its segmentation performance in medical imaging via a test-time framework. SAM-TTA tackles two key challenges: (1) input-level discrepancies caused by differences in image acquisition between natural and medical images and (2) semantic-level discrepancies due to fundamental differences in object definition between natural and medical domains (e.g., clear boundaries vs. ambiguous structures). Specifically, our SAM-TTA framework comprises (1) Self-adaptive Bezier Curve-based Transformation (SBCT), which adaptively converts single-channel medical images into three-channel SAM-compatible inputs while maintaining structural integrity, to mitigate the input gap between medical and natural images, and (2) Dual-scale Uncertainty-driven Mean Teacher adaptation (DUMT), which employs consistency learning to align SAM's internal representations to medical semantics, enabling efficient adaptation without auxiliary supervision or expensive retraining. Extensive experiments on five public datasets demonstrate that our SAM-TTA outperforms existing TTA approaches and even surpasses fully fine-tuned models such as MedSAM in certain scenarios, establishing a new paradigm for universal medical image segmentation. Code can be found at https://github.com/JianghaoWu/SAM-TTA.

[97] Spatiotemporal Contrastive Learning for Cross-View Video Localization in Unstructured Off-road Terrains

Zhiyun Deng,Dongmyeong Lee,Amanda Adkins,Jesse Quattrociocchi,Christian Ellis,Joydeep Biswas

Main category: cs.CV

TL;DR: MoViX是一种自监督的跨视角视频定位框架，解决了GPS缺失和季节性变化环境中的定位问题，通过学习和季节无关的表示，结合运动信息和轻量级时间聚合器，实现了高精度定位。

Details

Motivation: 在GPS缺失的越野环境中，重复植被、无结构地形和季节性变化导致视觉定位困难，传统方法难以对齐过时的卫星图像。 Method: MoViX采用姿态依赖的正采样策略和时间对齐的硬负采样，结合运动信息帧采样器和轻量级时间聚合器，学习跨视角匹配表示。 Result: 在TartanDrive 2.0数据集上，MoViX仅用30分钟训练数据，测试12.29公里，93%的定位误差在25米内，100%在50米内，优于现有方法。 Conclusion: MoViX在复杂环境中表现出色，无需环境特定调优，且能泛化到不同地理区域和机器人平台。 Abstract: Robust cross-view 3-DoF localization in GPS-denied, off-road environments remains challenging due to (1) perceptual ambiguities from repetitive vegetation and unstructured terrain, and (2) seasonal shifts that significantly alter scene appearance, hindering alignment with outdated satellite imagery. To address this, we introduce MoViX, a self-supervised cross-view video localization framework that learns viewpoint- and season-invariant representations while preserving directional awareness essential for accurate localization. MoViX employs a pose-dependent positive sampling strategy to enhance directional discrimination and temporally aligned hard negative mining to discourage shortcut learning from seasonal cues. A motion-informed frame sampler selects spatially diverse frames, and a lightweight temporal aggregator emphasizes geometrically aligned observations while downweighting ambiguous ones. At inference, MoViX runs within a Monte Carlo Localization framework, using a learned cross-view matching module in place of handcrafted models. Entropy-guided temperature scaling enables robust multi-hypothesis tracking and confident convergence under visual ambiguity. We evaluate MoViX on the TartanDrive 2.0 dataset, training on under 30 minutes of data and testing over 12.29 km. Despite outdated satellite imagery, MoViX localizes within 25 meters of ground truth 93% of the time, and within 50 meters 100% of the time in unseen regions, outperforming state-of-the-art baselines without environment-specific tuning. We further demonstrate generalization on a real-world off-road dataset from a geographically distinct site with a different robot platform.

[98] LeanPO: Lean Preference Optimization for Likelihood Alignment in Video-LLMs

Xiaodong Wang,Jinfa Huang,Li Yuan,Peixi Peng

Main category: cs.CV

TL;DR: 论文提出LeanPO方法，通过重新定义奖励和动态标签平滑策略，解决Video-LLMs中偏好对齐技术导致的非目标响应概率上升问题。

Details

Motivation: 现有Video-LLMs使用的偏好对齐技术（如DPO）在训练中会导致目标和非目标响应的概率同时下降，从而无意中提升非目标响应的概率。 Method: 提出LeanPO方法，包括基于策略模型的平均似然重新定义奖励，以及动态标签平滑策略以减少噪声影响。 Result: 实验表明LeanPO显著提升了Video-LLMs的性能，且额外训练开销小。 Conclusion: LeanPO为Video-LLMs提供了一种简单有效的偏好对齐解决方案，提升了模型的可靠性和效率。 Abstract: Most Video Large Language Models (Video-LLMs) adopt preference alignment techniques, e.g., DPO~\citep{rafailov2024dpo}, to optimize the reward margin between a winning response ($y_w$) and a losing response ($y_l$). However, the likelihood displacement observed in DPO indicates that both $\log \pi_\theta (y_w\mid x)$ and $\log \pi_\theta (y_l\mid x) $ often decrease during training, inadvertently boosting the probabilities of non-target responses. In this paper, we systematically revisit this phenomenon from LLMs to Video-LLMs, showing that it intensifies when dealing with the redundant complexity of video content. To alleviate the impact of this phenomenon, we propose \emph{Lean Preference Optimization} (LeanPO), a reference-free approach that reformulates the implicit reward as the average likelihood of the response with respect to the policy model. A key component of LeanPO is the reward-trustworthiness correlated self-generated preference data pipeline, which carefully infuses relevant prior knowledge into the model while continuously refining the preference data via self-reflection. This allows the policy model to obtain high-quality paired data and accurately estimate the newly defined reward, thus mitigating the unintended drop. In addition, we introduce a dynamic label smoothing strategy that mitigates the impact of noise in responses from diverse video content, preventing the model from overfitting to spurious details. Extensive experiments demonstrate that LeanPO significantly enhances the performance of state-of-the-art Video-LLMs, consistently boosting baselines of varying capacities with minimal additional training overhead. Moreover, LeanPO offers a simple yet effective solution for aligning Video-LLM preferences with human trustworthiness, paving the way toward the reliable and efficient Video-LLMs.

[99] Can Foundation Models Generalise the Presentation Attack Detection Capabilities on ID Cards?

Juan E. Tapia,Christoph Busch

Main category: cs.CV

TL;DR: 研究探讨了如何利用基础模型（FM）提升ID卡防伪检测（PAD）的泛化能力，特别是在未知国家的ID卡上。

Details

Motivation: 当前PAD系统因隐私保护限制，仅针对少数ID卡训练，导致泛化能力不足，无法满足商业需求。 Method: 采用零样本学习和微调方法，测试了两种ID卡数据集（智利私有数据集和芬兰、西班牙、斯洛伐克公开数据集）。 Result: 研究发现，真实图像是提升泛化能力的关键。 Conclusion: 基础模型在提升ID卡防伪检测的泛化能力方面具有潜力，真实图像数据尤为重要。 Abstract: Nowadays, one of the main challenges in presentation attack detection (PAD) on ID cards is obtaining generalisation capabilities for a diversity of countries that are issuing ID cards. Most PAD systems are trained on one, two, or three ID documents because of privacy protection concerns. As a result, they do not obtain competitive results for commercial purposes when tested in an unknown new ID card country. In this scenario, Foundation Models (FM) trained on huge datasets can help to improve generalisation capabilities. This work intends to improve and benchmark the capabilities of FM and how to use them to adapt the generalisation on PAD of ID Documents. Different test protocols were used, considering zero-shot and fine-tuning and two different ID card datasets. One private dataset based on Chilean IDs and one open-set based on three ID countries: Finland, Spain, and Slovakia. Our findings indicate that bona fide images are the key to generalisation.

[100] From Play to Replay: Composed Video Retrieval for Temporally Fine-Grained Videos

Animesh Gupta,Jay Parmar,Ishan Rajendrakumar Dave,Mubarak Shah

Main category: cs.CV

TL;DR: TF-CoVR是一个专注于时间细粒度视频检索的新基准，通过LLM生成查询-修改对，并提出TF-CoVR-Base框架提升性能。

Details

Motivation: 现有CoVR基准无法捕捉细微、快速的时间差异，限制了实际应用。 Method: 提出TF-CoVR基准和TF-CoVR-Base框架，包括预训练视频编码器和对比学习对齐查询与候选视频。 Result: TF-CoVR-Base在零样本和微调后显著提升了检索性能（mAP@50从5.92到7.51，19.83到25.82）。 Conclusion: TF-CoVR为时间细粒度视频检索提供了新基准，TF-CoVR-Base框架显著提升了性能。 Abstract: Composed Video Retrieval (CoVR) retrieves a target video given a query video and a modification text describing the intended change. Existing CoVR benchmarks emphasize appearance shifts or coarse event changes and therefore do not test the ability to capture subtle, fast-paced temporal differences. We introduce TF-CoVR, the first large-scale benchmark dedicated to temporally fine-grained CoVR. TF-CoVR focuses on gymnastics and diving and provides 180K triplets drawn from FineGym and FineDiving. Previous CoVR benchmarks focusing on temporal aspect, link each query to a single target segment taken from the same video, limiting practical usefulness. In TF-CoVR, we instead construct each pair by prompting an LLM with the label differences between clips drawn from different videos; every pair is thus associated with multiple valid target videos (3.9 on average), reflecting real-world tasks such as sports-highlight generation. To model these temporal dynamics we propose TF-CoVR-Base, a concise two-stage training framework: (i) pre-train a video encoder on fine-grained action classification to obtain temporally discriminative embeddings; (ii) align the composed query with candidate videos using contrastive learning. We conduct the first comprehensive study of image, video, and general multimodal embedding (GME) models on temporally fine-grained composed retrieval in both zero-shot and fine-tuning regimes. On TF-CoVR, TF-CoVR-Base improves zero-shot mAP@50 from 5.92 (LanguageBind) to 7.51, and after fine-tuning raises the state-of-the-art from 19.83 to 25.82.

[101] Unifying Appearance Codes and Bilateral Grids for Driving Scene Gaussian Splatting

Nan Wang,Yuantao Chen,Lixing Xiao,Weiqing Xiao,Bohan Li,Zhaoxi Chen,Chongjie Ye,Shaocong Xu,Saining Zhang,Ziyang Yan,Pierre Merriaux,Lei Lei,Tianfan Xue,Hao Zhao

Main category: cs.CV

TL;DR: 提出了一种多尺度双边网格方法，结合外观编码和双边网格，显著提升了动态自动驾驶场景重建的几何精度。

Details

Motivation: 解决现实场景中因光度不一致导致的重建质量下降问题，现有方法（外观编码和双边网格）各有局限性。 Method: 提出多尺度双边网格，统一外观编码和双边网格的优势，优化几何重建。 Result: 在Waymo、NuScenes、Argoverse和PandaSet四个数据集上表现优异，几何精度显著提升。 Conclusion: 多尺度双边网格有效减少光度不一致导致的浮游物，对自动驾驶的障碍物避障和控制至关重要。 Abstract: Neural rendering techniques, including NeRF and Gaussian Splatting (GS), rely on photometric consistency to produce high-quality reconstructions. However, in real-world scenarios, it is challenging to guarantee perfect photometric consistency in acquired images. Appearance codes have been widely used to address this issue, but their modeling capability is limited, as a single code is applied to the entire image. Recently, the bilateral grid was introduced to perform pixel-wise color mapping, but it is difficult to optimize and constrain effectively. In this paper, we propose a novel multi-scale bilateral grid that unifies appearance codes and bilateral grids. We demonstrate that this approach significantly improves geometric accuracy in dynamic, decoupled autonomous driving scene reconstruction, outperforming both appearance codes and bilateral grids. This is crucial for autonomous driving, where accurate geometry is important for obstacle avoidance and control. Our method shows strong results across four datasets: Waymo, NuScenes, Argoverse, and PandaSet. We further demonstrate that the improvement in geometry is driven by the multi-scale bilateral grid, which effectively reduces floaters caused by photometric inconsistency.

[102] Rectified Point Flow: Generic Point Cloud Pose Estimation

Tao Sun,Liyuan Zhu,Shengyu Huang,Shuran Song,Iro Armeni

Main category: cs.CV

TL;DR: Rectified Point Flow 是一种统一的参数化方法，将点云配准和多部件形状组装视为单一条件生成问题，通过学习连续点速度场实现目标位置恢复，无需对称标签即可学习组装对称性。

Details

Motivation: 解决点云配准和形状组装中对称性处理和多样化数据集联合训练的挑战。 Method: 通过自监督编码器和连续点速度场学习，将噪声点传输到目标位置，恢复部件姿态。 Result: 在六个基准测试中达到最新性能，联合训练提升了共享几何先验的学习和准确性。 Conclusion: Rectified Point Flow 提供了一种高效统一的解决方案，显著提升了点云配准和形状组装的性能。 Abstract: We introduce Rectified Point Flow, a unified parameterization that formulates pairwise point cloud registration and multi-part shape assembly as a single conditional generative problem. Given unposed point clouds, our method learns a continuous point-wise velocity field that transports noisy points toward their target positions, from which part poses are recovered. In contrast to prior work that regresses part-wise poses with ad-hoc symmetry handling, our method intrinsically learns assembly symmetries without symmetry labels. Together with a self-supervised encoder focused on overlapping points, our method achieves a new state-of-the-art performance on six benchmarks spanning pairwise registration and shape assembly. Notably, our unified formulation enables effective joint training on diverse datasets, facilitating the learning of shared geometric priors and consequently boosting accuracy. Project page: https://rectified-pointflow.github.io/.

[103] Video World Models with Long-term Spatial Memory

Tong Wu,Shuai Yang,Ryan Po,Yinghao Xu,Ziwei Liu,Dahua Lin,Gordon Wetzstein

Main category: cs.CV

TL;DR: 论文提出了一种基于几何基础长期空间记忆的新框架，以增强视频世界模型的长期一致性，解决了现有模型因时间上下文窗口有限而导致的场景遗忘问题。

Details

Motivation: 现有视频世界模型在生成视频帧时，由于时间上下文窗口有限，难以维持场景一致性，尤其是在重新访问时容易遗忘之前生成的环境。 Method: 受人类记忆机制启发，引入了一种基于几何的长期空间记忆框架，包括存储和检索信息的机制，并使用定制数据集训练和评估带有显式3D记忆机制的模型。 Result: 评估表明，与相关基线相比，该方法在生成质量、一致性和上下文长度方面均有提升。 Conclusion: 该框架为长期一致的视频世界生成提供了新途径。 Abstract: Emerging world models autoregressively generate video frames in response to actions, such as camera movements and text prompts, among other control signals. Due to limited temporal context window sizes, these models often struggle to maintain scene consistency during revisits, leading to severe forgetting of previously generated environments. Inspired by the mechanisms of human memory, we introduce a novel framework to enhancing long-term consistency of video world models through a geometry-grounded long-term spatial memory. Our framework includes mechanisms to store and retrieve information from the long-term spatial memory and we curate custom datasets to train and evaluate world models with explicitly stored 3D memory mechanisms. Our evaluations show improved quality, consistency, and context length compared to relevant baselines, paving the way towards long-term consistent world generation.

[104] RaySt3R: Predicting Novel Depth Maps for Zero-Shot Object Completion

Bardienus P. Duisterhof,Jan Oberst,Bowen Wen,Stan Birchfield,Deva Ramanan,Jeffrey Ichnowski

Main category: cs.CV

TL;DR: RaySt3R将3D形状补全问题重新定义为新视角合成问题，通过单张RGB-D图像和查询射线预测深度图、物体掩码和置信度，实现高效且一致的3D重建。

Details

Motivation: 现有3D形状补全方法缺乏一致性、计算成本高且难以捕捉锐利边界，RaySt3R旨在解决这些问题。 Method: 使用单张RGB-D图像和查询射线，通过前馈Transformer预测深度图、物体掩码和置信度，并融合多视角预测完成3D重建。 Result: 在合成和真实数据集上表现优异，3D chamfer距离比基线方法提升高达44%。 Conclusion: RaySt3R通过新视角合成方法实现了高效、一致的3D形状补全，性能显著优于现有方法。 Abstract: 3D shape completion has broad applications in robotics, digital twin reconstruction, and extended reality (XR). Although recent advances in 3D object and scene completion have achieved impressive results, existing methods lack 3D consistency, are computationally expensive, and struggle to capture sharp object boundaries. Our work (RaySt3R) addresses these limitations by recasting 3D shape completion as a novel view synthesis problem. Specifically, given a single RGB-D image and a novel viewpoint (encoded as a collection of query rays), we train a feedforward transformer to predict depth maps, object masks, and per-pixel confidence scores for those query rays. RaySt3R fuses these predictions across multiple query views to reconstruct complete 3D shapes. We evaluate RaySt3R on synthetic and real-world datasets, and observe it achieves state-of-the-art performance, outperforming the baselines on all datasets by up to 44% in 3D chamfer distance. Project page: https://rayst3r.github.io

[105] Stable Vision Concept Transformers for Medical Diagnosis

Lijie Hu,Songning Lai,Yuan Hua,Shu Yang,Jingfeng Zhang,Di Wang

Main category: cs.CV

TL;DR: 论文提出VCT和SVCT模型，解决现有概念瓶颈模型在医学领域中的性能和稳定性问题。

Details

Motivation: 医学领域需要透明且稳定的可解释AI方法，现有概念瓶颈模型（CBMs）仅依赖概念特征，忽略了医学图像的固有特征，且对输入扰动敏感。 Method: 提出Vision Concept Transformer (VCT)和Stable Vision Concept Transformer (SVCT)，结合视觉Transformer和概念层，融合概念与图像特征，并通过Denoised Diffusion Smoothing提升稳定性。 Result: 在四个医学数据集上，VCT和SVCT在保持准确性的同时提供可解释性，SVCT在扰动下仍能提供稳定解释。 Conclusion: VCT和SVCT解决了CBMs的局限性，为医学领域提供了高效且可信的可解释AI解决方案。 Abstract: Transparency is a paramount concern in the medical field, prompting researchers to delve into the realm of explainable AI (XAI). Among these XAI methods, Concept Bottleneck Models (CBMs) aim to restrict the model's latent space to human-understandable high-level concepts by generating a conceptual layer for extracting conceptual features, which has drawn much attention recently. However, existing methods rely solely on concept features to determine the model's predictions, which overlook the intrinsic feature embeddings within medical images. To address this utility gap between the original models and concept-based models, we propose Vision Concept Transformer (VCT). Furthermore, despite their benefits, CBMs have been found to negatively impact model performance and fail to provide stable explanations when faced with input perturbations, which limits their application in the medical field. To address this faithfulness issue, this paper further proposes the Stable Vision Concept Transformer (SVCT) based on VCT, which leverages the vision transformer (ViT) as its backbone and incorporates a conceptual layer. SVCT employs conceptual features to enhance decision-making capabilities by fusing them with image features and ensures model faithfulness through the integration of Denoised Diffusion Smoothing. Comprehensive experiments on four medical datasets demonstrate that our VCT and SVCT maintain accuracy while remaining interpretable compared to baselines. Furthermore, even when subjected to perturbations, our SVCT model consistently provides faithful explanations, thus meeting the needs of the medical field.

[106] EOC-Bench: Can MLLMs Identify, Recall, and Forecast Objects in an Egocentric World?

Yuqian Yuan,Ronghao Dang,Long Li,Wentong Li,Dian Jiao,Xin Li,Deli Zhao,Fan Wang,Wenqiao Zhang,Jun Xiao,Yueting Zhuang

Main category: cs.CV

TL;DR: EOC-Bench是一个创新的基准测试，用于评估动态自我中心场景中的对象认知能力，填补了现有基准测试的不足。

Details

Motivation: 现有基准测试主要关注静态场景，忽略了用户交互引起的动态变化，因此需要一个新的基准测试来评估动态场景中的对象认知能力。 Method: 开发了EOC-Bench基准测试，包含3,277个标注的QA对，涵盖11个细粒度评估维度和3种视觉对象引用类型，并设计了混合格式的标注框架和多尺度时间精度指标。 Result: 对多种MLLM进行了全面评估，EOC-Bench为提升MLLM的具身对象认知能力提供了重要工具。 Conclusion: EOC-Bench为开发可靠的具身系统核心模型奠定了坚实基础。 Abstract: The emergence of multimodal large language models (MLLMs) has driven breakthroughs in egocentric vision applications. These applications necessitate persistent, context-aware understanding of objects, as users interact with tools in dynamic and cluttered environments. However, existing embodied benchmarks primarily focus on static scene exploration, emphasizing object's appearance and spatial attributes while neglecting the assessment of dynamic changes arising from users' interactions. To address this gap, we introduce EOC-Bench, an innovative benchmark designed to systematically evaluate object-centric embodied cognition in dynamic egocentric scenarios. Specially, EOC-Bench features 3,277 meticulously annotated QA pairs categorized into three temporal categories: Past, Present, and Future, covering 11 fine-grained evaluation dimensions and 3 visual object referencing types. To ensure thorough assessment, we develop a mixed-format human-in-the-loop annotation framework with four types of questions and design a novel multi-scale temporal accuracy metric for open-ended temporal evaluation. Based on EOC-Bench, we conduct comprehensive evaluations of various proprietary, open-source, and object-level MLLMs. EOC-Bench serves as a crucial tool for advancing the embodied object cognitive capabilities of MLLMs, establishing a robust foundation for developing reliable core models for embodied systems.

[107] AliTok: Towards Sequence Modeling Alignment between Tokenizer and Autoregressive Model

Pingyu Wu,Kai Zhu,Yu Liu,Longxiang Tang,Jian Yang,Yansong Peng,Wei Zhai,Yang Cao,Zheng-Jun Zha

Main category: cs.CV

TL;DR: 提出了一种新型的对齐分词器（AliTok），通过单向依赖关系优化自回归图像生成，显著提升了生成质量和效率。

Details

Motivation: 现有图像分词器在压缩过程中存在双向依赖，阻碍了自回归模型的有效建模。 Method: 使用因果解码器建立单向依赖，结合前缀标记和两阶段分词器训练，提升重建一致性和生成友好性。 Result: 在ImageNet-256基准测试中，AliTok在177M参数下gFID为1.50，IS为305.9；662M参数时gFID为1.35，超越现有扩散方法且采样速度快10倍。 Conclusion: AliTok通过对齐分词器和自回归模型的方法，显著提升了图像生成的性能与效率。 Abstract: Autoregressive image generation aims to predict the next token based on previous ones. However, existing image tokenizers encode tokens with bidirectional dependencies during the compression process, which hinders the effective modeling by autoregressive models. In this paper, we propose a novel Aligned Tokenizer (AliTok), which utilizes a causal decoder to establish unidirectional dependencies among encoded tokens, thereby aligning the token modeling approach between the tokenizer and autoregressive model. Furthermore, by incorporating prefix tokens and employing two-stage tokenizer training to enhance reconstruction consistency, AliTok achieves great reconstruction performance while being generation-friendly. On ImageNet-256 benchmark, using a standard decoder-only autoregressive model as the generator with only 177M parameters, AliTok achieves a gFID score of 1.50 and an IS of 305.9. When the parameter count is increased to 662M, AliTok achieves a gFID score of 1.35, surpassing the state-of-the-art diffusion method with 10x faster sampling speed. The code and weights are available at https://github.com/ali-vilab/alitok.

[108] SeedVR2: One-Step Video Restoration via Diffusion Adversarial Post-Training

Jianyi Wang,Shanchuan Lin,Zhijie Lin,Yuxi Ren,Meng Wei,Zongsheng Yue,Shangchen Zhou,Hao Chen,Yang Zhao,Ceyuan Yang,Xuefeng Xiao,Chen Change Loy,Lu Jiang

Main category: cs.CV

TL;DR: SeedVR2是一种基于扩散的一步视频修复模型，通过对抗训练和动态窗口注意力机制，显著降低了计算成本，同时在高分辨率视频修复中表现出色。

Details

Motivation: 尽管扩散模型在视频修复中取得了显著进展，但其推理计算成本过高，且现有方法难以扩展到高分辨率视频修复。 Method: 提出SeedVR2模型，采用对抗训练和动态调整窗口大小的注意力机制，并引入特征匹配损失以稳定训练。 Result: 实验表明，SeedVR2在单步推理中性能优于或与现有方法相当。 Conclusion: SeedVR2通过高效的一步推理，为高分辨率视频修复提供了可行的解决方案。 Abstract: Recent advances in diffusion-based video restoration (VR) demonstrate significant improvement in visual quality, yet yield a prohibitive computational cost during inference. While several distillation-based approaches have exhibited the potential of one-step image restoration, extending existing approaches to VR remains challenging and underexplored, particularly when dealing with high-resolution video in real-world settings. In this work, we propose a one-step diffusion-based VR model, termed as SeedVR2, which performs adversarial VR training against real data. To handle the challenging high-resolution VR within a single step, we introduce several enhancements to both model architecture and training procedures. Specifically, an adaptive window attention mechanism is proposed, where the window size is dynamically adjusted to fit the output resolutions, avoiding window inconsistency observed under high-resolution VR using window attention with a predefined window size. To stabilize and improve the adversarial post-training towards VR, we further verify the effectiveness of a series of losses, including a proposed feature matching loss without significantly sacrificing training efficiency. Extensive experiments show that SeedVR2 can achieve comparable or even better performance compared with existing VR approaches in a single step.

[109] Perceive Anything: Recognize, Explain, Caption, and Segment Anything in Images and Videos

Weifeng Lin,Xinyu Wei,Ruichuan An,Tianhe Ren,Tingwei Chen,Renrui Zhang,Ziyu Guo,Wentao Zhang,Lei Zhang,Hongsheng Li

Main category: cs.CV

TL;DR: PAM是一个高效的区域级视觉理解框架，结合SAM 2和LLMs，实现对象分割与多样化语义输出。

Details

Motivation: 提升区域级视觉理解的全面性和效率，为实际应用提供轻量级解决方案。 Method: 集成SAM 2和LLMs，引入Semantic Perceiver转换视觉特征，开发数据增强流程。 Result: PAM在多种任务中表现优异，运行速度更快且内存消耗更低。 Conclusion: PAM为未来区域级视觉理解研究提供了强基线。 Abstract: We present Perceive Anything Model (PAM), a conceptually straightforward and efficient framework for comprehensive region-level visual understanding in images and videos. Our approach extends the powerful segmentation model SAM 2 by integrating Large Language Models (LLMs), enabling simultaneous object segmentation with the generation of diverse, region-specific semantic outputs, including categories, label definition, functional explanations, and detailed captions. A key component, Semantic Perceiver, is introduced to efficiently transform SAM 2's rich visual features, which inherently carry general vision, localization, and semantic priors into multi-modal tokens for LLM comprehension. To support robust multi-granularity understanding, we also develop a dedicated data refinement and augmentation pipeline, yielding a high-quality dataset of 1.5M image and 0.6M video region-semantic annotations, including novel region-level streaming video caption data. PAM is designed for lightweightness and efficiency, while also demonstrates strong performance across a diverse range of region understanding tasks. It runs 1.2-2.4x faster and consumes less GPU memory than prior approaches, offering a practical solution for real-world applications. We believe that our effective approach will serve as a strong baseline for future research in region-level visual understanding.

[110] Do It Yourself: Learning Semantic Correspondence from Pseudo-Labels

Olaf Dünkel,Thomas Wimmer,Christian Theobalt,Christian Rupprecht,Adam Kortylewski

Main category: cs.CV

TL;DR: 该论文提出了一种通过3D感知伪标签改进语义匹配的方法，显著提升了性能，并在SPair-71k数据集上取得了新的最佳结果。

Details

Motivation: 解决语义匹配中对称物体或重复部分导致的模糊性问题，同时减少对特定数据集标注的依赖。 Method: 训练一个适配器，利用3D感知链生成的伪标签优化现成特征，并通过松弛循环一致性和3D球形原型映射约束过滤错误标签。 Result: 在SPair-71k数据集上实现了超过4%的绝对性能提升，并在类似监督要求的方法中提升了7%。 Conclusion: 该方法具有通用性，可轻松扩展到其他数据源，为语义匹配提供了更高效的解决方案。 Abstract: Finding correspondences between semantically similar points across images and object instances is one of the everlasting challenges in computer vision. While large pre-trained vision models have recently been demonstrated as effective priors for semantic matching, they still suffer from ambiguities for symmetric objects or repeated object parts. We propose to improve semantic correspondence estimation via 3D-aware pseudo-labeling. Specifically, we train an adapter to refine off-the-shelf features using pseudo-labels obtained via 3D-aware chaining, filtering wrong labels through relaxed cyclic consistency, and 3D spherical prototype mapping constraints. While reducing the need for dataset specific annotations compared to prior work, we set a new state-of-the-art on SPair-71k by over 4% absolute gain and by over 7% against methods with similar supervision requirements. The generality of our proposed approach simplifies extension of training to other data sources, which we demonstrate in our experiments.

[111] MARBLE: Material Recomposition and Blending in CLIP-Space

Ta-Ying Cheng,Prafull Sharma,Mark Boss,Varun Jampani

Main category: cs.CV

TL;DR: MARBLE提出了一种基于CLIP空间材料嵌入和预训练文本到图像模型的方法，用于图像中物体材料的编辑和精细属性控制。

Details

Motivation: 研究旨在改进基于示例图像的材料编辑方法，实现对材料属性的精细控制和多属性编辑。 Method: 通过CLIP空间找到材料嵌入，利用去噪UNet中的材料属性块，结合浅层网络预测材料属性变化方向。 Result: 定性定量分析表明MARBLE在材料混合和精细属性控制方面有效，支持单次前向传递多编辑和绘画应用。 Conclusion: MARBLE为材料编辑提供了高效且灵活的方法，适用于多种实际应用场景。 Abstract: Editing materials of objects in images based on exemplar images is an active area of research in computer vision and graphics. We propose MARBLE, a method for performing material blending and recomposing fine-grained material properties by finding material embeddings in CLIP-space and using that to control pre-trained text-to-image models. We improve exemplar-based material editing by finding a block in the denoising UNet responsible for material attribution. Given two material exemplar-images, we find directions in the CLIP-space for blending the materials. Further, we can achieve parametric control over fine-grained material attributes such as roughness, metallic, transparency, and glow using a shallow network to predict the direction for the desired material attribute change. We perform qualitative and quantitative analysis to demonstrate the efficacy of our proposed method. We also present the ability of our method to perform multiple edits in a single forward pass and applicability to painting. Project Page: https://marblecontrol.github.io/

[112] ProJo4D: Progressive Joint Optimization for Sparse-View Inverse Physics Estimation

Daniel Rho,Jun Myeong Choi,Biswadip Dey,Roni Sengupta

Main category: cs.CV

TL;DR: ProJo4D提出了一种渐进式联合优化框架，用于从稀疏多视角视频中估计物理参数，解决了现有方法在稀疏输入下的误差累积问题。

Details

Motivation: 现有方法在稀疏多视角视频输入下表现不佳，导致误差累积，限制了物理参数估计的准确性。 Method: 采用渐进式联合优化策略，逐步增加优化参数集，最终实现几何、外观、物理状态和材料属性的联合优化。 Result: 在PAC-NeRF和Spring-Gaus数据集上，ProJo4D在4D未来状态预测、未来状态的新视角渲染和材料参数估计方面优于现有方法。 Conclusion: ProJo4D为物理基础的4D场景理解提供了有效解决方案，适用于机器人学和XR中的数字孪生创建。 Abstract: Neural rendering has made significant strides in 3D reconstruction and novel view synthesis. With the integration with physics, it opens up new applications. The inverse problem of estimating physics from visual data, however, still remains challenging, limiting its effectiveness for applications like physically accurate digital twin creation in robotics and XR. Existing methods that incorporate physics into neural rendering frameworks typically require dense multi-view videos as input, making them impractical for scalable, real-world use. When presented with sparse multi-view videos, the sequential optimization strategy used by existing approaches introduces significant error accumulation, e.g., poor initial 3D reconstruction leads to bad material parameter estimation in subsequent stages. Instead of sequential optimization, directly optimizing all parameters at the same time also fails due to the highly non-convex and often non-differentiable nature of the problem. We propose ProJo4D, a progressive joint optimization framework that gradually increases the set of jointly optimized parameters guided by their sensitivity, leading to fully joint optimization over geometry, appearance, physical state, and material property. Evaluations on PAC-NeRF and Spring-Gaus datasets show that ProJo4D outperforms prior work in 4D future state prediction, novel view rendering of future state, and material parameter estimation, demonstrating its effectiveness in physically grounded 4D scene understanding. For demos, please visit the project webpage: https://daniel03c1.github.io/ProJo4D/

[113] Does Your 3D Encoder Really Work? When Pretrain-SFT from 2D VLMs Meets 3D VLMs

Haoyuan Li,Yanpeng Zhou,Yufei Gao,Tao Tang,Jianhua Han,Yujie Yuan,Dave Zhenyu Chen,Jiawang Bian,Hang Xu,Xiaodan Liang

Main category: cs.CV

TL;DR: 论文探讨了3D视觉语言模型（VLMs）的性能差距，发现3D场景中心模型对3D编码器依赖不足，并提出新数据集以改进3D理解。

Details

Motivation: 研究3D VLMs的性能差距，尤其是3D场景中心模型表现不佳的原因，以促进更真实的3D场景理解。 Method: 通过分类3D VLMs的编码器设计（3D对象中心、2D图像基础和3D场景中心），分析性能差异，并提出新数据集3D Relevance Discrimination QA。 Result: 发现3D场景中心模型对3D编码器依赖有限，预训练效果不如2D VLMs，且数据扩展收益不明显。 Conclusion: 需要更先进的评估和改进策略，以提升3D VLMs的3D场景理解能力。 Abstract: Remarkable progress in 2D Vision-Language Models (VLMs) has spurred interest in extending them to 3D settings for tasks like 3D Question Answering, Dense Captioning, and Visual Grounding. Unlike 2D VLMs that typically process images through an image encoder, 3D scenes, with their intricate spatial structures, allow for diverse model architectures. Based on their encoder design, this paper categorizes recent 3D VLMs into 3D object-centric, 2D image-based, and 3D scene-centric approaches. Despite the architectural similarity of 3D scene-centric VLMs to their 2D counterparts, they have exhibited comparatively lower performance compared with the latest 3D object-centric and 2D image-based approaches. To understand this gap, we conduct an in-depth analysis, revealing that 3D scene-centric VLMs show limited reliance on the 3D scene encoder, and the pre-train stage appears less effective than in 2D VLMs. Furthermore, we observe that data scaling benefits are less pronounced on larger datasets. Our investigation suggests that while these models possess cross-modal alignment capabilities, they tend to over-rely on linguistic cues and overfit to frequent answer distributions, thereby diminishing the effective utilization of the 3D encoder. To address these limitations and encourage genuine 3D scene understanding, we introduce a novel 3D Relevance Discrimination QA dataset designed to disrupt shortcut learning and improve 3D understanding. Our findings highlight the need for advanced evaluation and improved strategies for better 3D understanding in 3D VLMs.

[114] Revisiting Depth Representations for Feed-Forward 3D Gaussian Splatting

Duochao Shi,Weijie Wang,Donny Y. Chen,Zeyu Zhang,Jia-Wang Bian,Bohan Zhuang,Chunhua Shen

Main category: cs.CV

TL;DR: 论文提出了一种基于点图正则化损失（PM-Loss）的方法，用于解决深度图中物体边界处的几何不连续性问题，从而提升3D高斯泼溅（3DGS）的渲染质量。

Details

Motivation: 深度图在3DGS中常用于生成3D点云，但物体边界处的深度不连续性会导致点云稀疏或碎片化，影响渲染质量。 Method: 引入PM-Loss，利用预训练变换器预测的点图作为正则化损失，增强几何平滑性。 Result: 改进后的深度图显著提升了3DGS的渲染效果，适用于多种架构和场景。 Conclusion: PM-Loss有效解决了深度图在物体边界处的局限性，提升了3DGS的渲染质量。 Abstract: Depth maps are widely used in feed-forward 3D Gaussian Splatting (3DGS) pipelines by unprojecting them into 3D point clouds for novel view synthesis. This approach offers advantages such as efficient training, the use of known camera poses, and accurate geometry estimation. However, depth discontinuities at object boundaries often lead to fragmented or sparse point clouds, degrading rendering quality -- a well-known limitation of depth-based representations. To tackle this issue, we introduce PM-Loss, a novel regularization loss based on a pointmap predicted by a pre-trained transformer. Although the pointmap itself may be less accurate than the depth map, it effectively enforces geometric smoothness, especially around object boundaries. With the improved depth map, our method significantly improves the feed-forward 3DGS across various architectures and scenes, delivering consistently better rendering results. Our project page: https://aim-uofa.github.io/PMLoss

[115] AV-Reasoner: Improving and Benchmarking Clue-Grounded Audio-Visual Counting for MLLMs

Lidong Lu,Guo Chen,Zhiqi Li,Yicheng Liu,Tong Lu

Main category: cs.CV

TL;DR: 论文提出了CG-AV-Counting基准和AV-Reasoner模型，解决了视频计数任务的局限性，并通过强化学习提升了性能。

Details

Motivation: 现有视频理解模型在计数任务上表现不佳，且基准测试存在视频短、查询封闭、缺乏线索标注等问题。 Method: 提出了CG-AV-Counting基准和AV-Reasoner模型，结合GRPO和课程学习提升计数能力。 Result: AV-Reasoner在多个基准上达到最优性能，但在域外基准上语言空间推理无效。 Conclusion: CG-AV-Counting为计数任务提供了全面测试，AV-Reasoner展示了强化学习的有效性，但需改进域外性能。 Abstract: Despite progress in video understanding, current MLLMs struggle with counting tasks. Existing benchmarks are limited by short videos, close-set queries, lack of clue annotations, and weak multimodal coverage. In this paper, we introduce CG-AV-Counting, a manually-annotated clue-grounded counting benchmark with 1,027 multimodal questions and 5,845 annotated clues over 497 long videos. It supports both black-box and white-box evaluation, serving as a comprehensive testbed for both end-to-end and reasoning-based counting. To explore ways to improve model's counting capability, we propose AV-Reasoner, a model trained with GRPO and curriculum learning to generalize counting ability from related tasks. AV-Reasoner achieves state-of-the-art results across multiple benchmarks, demonstrating the effectiveness of reinforcement learning. However, experiments show that on out-of-domain benchmarks, reasoning in the language space fails to bring performance gains. The code and benchmark have been realeased on https://av-reasoner.github.io.

[116] MINT-CoT: Enabling Interleaved Visual Tokens in Mathematical Chain-of-Thought Reasoning

Xinyan Chen,Renrui Zhang,Dongzhi Jiang,Aojun Zhou,Shilin Yan,Weifeng Lin,Hongsheng Li

Main category: cs.CV

TL;DR: MINT-CoT提出了一种通过动态插入视觉标记来增强多模态数学推理的方法，显著提升了模型在数学问题上的表现。

Details

Motivation: 现有方法在将Chain-of-Thought（CoT）扩展到多模态领域时存在局限性，如依赖粗粒度图像区域、视觉编码器对数学内容感知不足以及需要外部能力进行视觉修改。 Method: MINT-CoT通过Interleave Token动态选择数学图形中的任意形状视觉区域，并构建了包含54K数学问题的数据集。采用三阶段训练策略（文本CoT SFT、插入CoT SFT、插入CoT RL）训练模型。 Result: MINT-CoT-7B在MathVista、GeoQA和MMStar上的表现分别比基线模型提升了34.08%、28.78%和23.2%。 Conclusion: MINT-CoT在多模态数学推理中表现出色，为视觉与文本结合的推理提供了有效解决方案。 Abstract: Chain-of-Thought (CoT) has widely enhanced mathematical reasoning in Large Language Models (LLMs), but it still remains challenging for extending it to multimodal domains. Existing works either adopt a similar textual reasoning for image input, or seek to interleave visual signals into mathematical CoT. However, they face three key limitations for math problem-solving: reliance on coarse-grained box-shaped image regions, limited perception of vision encoders on math content, and dependence on external capabilities for visual modification. In this paper, we propose MINT-CoT, introducing Mathematical INterleaved Tokens for Chain-of-Thought visual reasoning. MINT-CoT adaptively interleaves relevant visual tokens into textual reasoning steps via an Interleave Token, which dynamically selects visual regions of any shapes within math figures. To empower this capability, we construct the MINT-CoT dataset, containing 54K mathematical problems aligning each reasoning step with visual regions at the token level, accompanied by a rigorous data generation pipeline. We further present a three-stage MINT-CoT training strategy, progressively combining text-only CoT SFT, interleaved CoT SFT, and interleaved CoT RL, which derives our MINT-CoT-7B model. Extensive experiments demonstrate the effectiveness of our method for effective visual interleaved reasoning in mathematical domains, where MINT-CoT-7B outperforms the baseline model by +34.08% on MathVista, +28.78% on GeoQA, and +23.2% on MMStar, respectively. Our code and data are available at https://github.com/xinyan-cxy/MINT-CoT

[117] Unleashing Hour-Scale Video Training for Long Video-Language Understanding

Jingyang Lin,Jialian Wu,Ximeng Sun,Ze Wang,Jiang Liu,Yusheng Su,Xiaodong Yu,Hao Chen,Jiebo Luo,Zicheng Liu,Emad Barsoum

Main category: cs.CV

TL;DR: VideoMarathon是一个大规模的长视频指令跟随数据集，包含约9,700小时的长视频和3.3M高质量QA对，支持22种任务。基于此，Hour-LLaVA模型通过内存增强模块实现了1-FPS采样下的长时间视频训练和推理，并在多个长视频语言基准测试中表现最佳。

Details

Motivation: 解决长视频标注稀缺问题，推动视频大型多模态模型（Video-LMMs）的发展。 Method: 提出VideoMarathon数据集，包含长视频和多样化QA对；开发Hour-LLaVA模型，利用内存增强模块实现高效训练和推理。 Result: Hour-LLaVA在多个长视频语言基准测试中表现最佳，验证了数据集和模型的有效性。 Conclusion: VideoMarathon填补了长视频数据集的空白，Hour-LLaVA展示了在长视频理解任务中的优越性。 Abstract: Recent long-form video-language understanding benchmarks have driven progress in video large multimodal models (Video-LMMs). However, the scarcity of well-annotated long videos has left the training of hour-long Video-LLMs underexplored. To close this gap, we present VideoMarathon, a large-scale hour-long video instruction-following dataset. This dataset includes around 9,700 hours of long videos sourced from diverse domains, ranging from 3 to 60 minutes per video. Specifically, it contains 3.3M high-quality QA pairs, spanning six fundamental topics: temporality, spatiality, object, action, scene, and event. Compared to existing video instruction datasets, VideoMarathon significantly extends training video durations up to 1 hour, and supports 22 diverse tasks requiring both short- and long-term video comprehension. Building on VideoMarathon, we propose Hour-LLaVA, a powerful and efficient Video-LMM for hour-scale video-language modeling. It enables hour-long video training and inference at 1-FPS sampling by leveraging a memory augmentation module, which adaptively integrates user question-relevant and spatiotemporal-informative semantics from a cached full video context. In our experiments, Hour-LLaVA achieves the best performance on multiple long video-language benchmarks, demonstrating the high quality of the VideoMarathon dataset and the superiority of the Hour-LLaVA model.

[118] VideoMolmo: Spatio-Temporal Grounding Meets Pointing

Ghazi Shazan Ahmad,Ahmed Heakl,Hanan Gani,Abdelrahman Shaker,Zhiqiang Shen,Ranjay Krishna,Fahad Shahbaz Khan,Salman Khan

Main category: cs.CV

TL;DR: VideoMolmo是一个多模态大模型，用于基于文本描述的细粒度时空定位，通过结合LLM和时序模块提升准确性和一致性。

Details

Motivation: 当前视频定位方法缺乏上下文理解和泛化能力，VideoMolmo旨在解决这一问题。 Method: 结合Molmo架构和时序注意力模块，采用双向点传播和掩码融合技术。 Result: 在多个任务中显著提升时空定位准确性和推理能力。 Conclusion: VideoMolmo为复杂时空定位任务提供了高效解决方案，并开源了代码和模型。 Abstract: Spatio-temporal localization is vital for precise interactions across diverse domains, from biological research to autonomous navigation and interactive interfaces. Current video-based approaches, while proficient in tracking, lack the sophisticated reasoning capabilities of large language models, limiting their contextual understanding and generalization. We introduce VideoMolmo, a large multimodal model tailored for fine-grained spatio-temporal pointing conditioned on textual descriptions. Building upon the Molmo architecture, VideoMolmo incorporates a temporal module utilizing an attention mechanism to condition each frame on preceding frames, ensuring temporal consistency. Additionally, our novel temporal mask fusion pipeline employs SAM2 for bidirectional point propagation, significantly enhancing coherence across video sequences. This two-step decomposition, i.e., first using the LLM to generate precise pointing coordinates, then relying on a sequential mask-fusion module to produce coherent segmentation, not only simplifies the task for the language model but also enhances interpretability. Due to the lack of suitable datasets, we curate a comprehensive dataset comprising 72k video-caption pairs annotated with 100k object points. To evaluate the generalization of VideoMolmo, we introduce VPoS-Bench, a challenging out-of-distribution benchmark spanning five real-world scenarios: Cell Tracking, Egocentric Vision, Autonomous Driving, Video-GUI Interaction, and Robotics. We also evaluate our model on Referring Video Object Segmentation (Refer-VOS) and Reasoning VOS tasks. In comparison to existing models, VideoMolmo substantially improves spatio-temporal pointing accuracy and reasoning capability. Our code and models are publicly available at https://github.com/mbzuai-oryx/VideoMolmo.

[119] Defurnishing with X-Ray Vision: Joint Removal of Furniture from Panoramas and Mesh

Alan Dolhasz,Chen Ma,Dave Gausebeck,Kevin Chen,Gregor Miller,Lucas Hayne,Gunnar Hovden,Azwad Sabik,Olaf Brandt,Mira Slavcheva

Main category: cs.CV

TL;DR: 提出了一种从纹理网格和多视角全景图像生成去家具室内场景的流程，通过简化网格和ControlNet修复技术实现高质量结果。

Details

Motivation: 现有方法如神经辐射场和RGB-D修复存在模糊、低分辨率或幻觉问题，需要更高质量的去家具场景生成方法。 Method: 1. 分割并移除家具，生成简化网格（SDM）；2. 从SDM提取Canny边缘；3. 使用ControlNet修复全景图像；4. 用修复图像重新纹理化网格。 Result: 相比神经辐射场和RGB-D修复，该方法生成更高质量的资产，避免了模糊和幻觉问题。 Conclusion: 提出的流程能有效生成高质量去家具场景，适用于室内空间重建。 Abstract: We present a pipeline for generating defurnished replicas of indoor spaces represented as textured meshes and corresponding multi-view panoramic images. To achieve this, we first segment and remove furniture from the mesh representation, extend planes, and fill holes, obtaining a simplified defurnished mesh (SDM). This SDM acts as an ``X-ray'' of the scene's underlying structure, guiding the defurnishing process. We extract Canny edges from depth and normal images rendered from the SDM. We then use these as a guide to remove the furniture from panorama images via ControlNet inpainting. This control signal ensures the availability of global geometric information that may be hidden from a particular panoramic view by the furniture being removed. The inpainted panoramas are used to texture the mesh. We show that our approach produces higher quality assets than methods that rely on neural radiance fields, which tend to produce blurry low-resolution images, or RGB-D inpainting, which is highly susceptible to hallucinations.

[120] Direct Numerical Layout Generation for 3D Indoor Scene Synthesis via Spatial Reasoning

Xingjian Ran,Yixuan Li,Linning Xu,Mulin Yu,Bo Dai

Main category: cs.CV

TL;DR: DirectLayout是一个基于大型语言模型的框架，直接从文本描述生成3D室内场景布局，解决了现有方法在开放词汇和细粒度用户指令对齐上的不足。

Details

Motivation: 3D室内场景合成对AI和数字内容创作至关重要，但现有方法在布局生成上存在数据集有限、灵活性不足的问题。 Method: DirectLayout分三阶段生成布局：鸟瞰图生成、3D空间提升和对象放置优化，利用Chain-of-Thought激活和生成布局奖励增强空间推理。 Result: 实验表明，DirectLayout在语义一致性、泛化能力和物理合理性上表现优异。 Conclusion: DirectLayout通过直接生成布局和迭代优化，显著提升了3D场景合成的质量和灵活性。 Abstract: Realistic 3D indoor scene synthesis is vital for embodied AI and digital content creation. It can be naturally divided into two subtasks: object generation and layout generation. While recent generative models have significantly advanced object-level quality and controllability, layout generation remains challenging due to limited datasets. Existing methods either overfit to these datasets or rely on predefined constraints to optimize numerical layout that sacrifice flexibility. As a result, they fail to generate scenes that are both open-vocabulary and aligned with fine-grained user instructions. We introduce DirectLayout, a framework that directly generates numerical 3D layouts from text descriptions using generalizable spatial reasoning of large language models (LLMs). DirectLayout decomposes the generation into three stages: producing a Bird's-Eye View (BEV) layout, lifting it into 3D space, and refining object placements. To enable explicit spatial reasoning and help the model grasp basic principles of object placement, we employ Chain-of-Thought (CoT) Activation based on the 3D-Front dataset. Additionally, we design CoT-Grounded Generative Layout Reward to enhance generalization and spatial planning. During inference, DirectLayout addresses asset-layout mismatches via Iterative Asset-Layout Alignment through in-context learning. Extensive experiments demonstrate that DirectLayout achieves impressive semantic consistency, generalization and physical plausibility.

[121] Refer to Anything with Vision-Language Prompts

Shengcao Cao,Zijun Wei,Jason Kuen,Kangning Liu,Lingzhi Zhang,Jiuxiang Gu,HyunJoon Jung,Liang-Yan Gui,Yu-Xiong Wang

Main category: cs.CV

TL;DR: 论文提出了一种新的任务——全模态参考表达分割（ORES），并提出了一个框架RAS，通过多模态交互提升分割模型的能力。

Details

Motivation: 现有图像分割模型无法基于语言和视觉的综合语义理解处理复杂查询，限制了其在用户友好交互中的应用。 Method: 提出RAS框架，通过掩码中心的大型多模态模型增强分割模型的多模态交互和理解能力。 Result: 在ORES任务及经典RES和GRES任务中，RAS表现出优越性能。 Conclusion: RAS框架有效解决了全模态参考表达分割问题，提升了多模态交互能力。 Abstract: Recent image segmentation models have advanced to segment images into high-quality masks for visual entities, and yet they cannot provide comprehensive semantic understanding for complex queries based on both language and vision. This limitation reduces their effectiveness in applications that require user-friendly interactions driven by vision-language prompts. To bridge this gap, we introduce a novel task of omnimodal referring expression segmentation (ORES). In this task, a model produces a group of masks based on arbitrary prompts specified by text only or text plus reference visual entities. To address this new challenge, we propose a novel framework to "Refer to Any Segmentation Mask Group" (RAS), which augments segmentation models with complex multimodal interactions and comprehension via a mask-centric large multimodal model. For training and benchmarking ORES models, we create datasets MaskGroups-2M and MaskGroups-HQ to include diverse mask groups specified by text and reference entities. Through extensive evaluation, we demonstrate superior performance of RAS on our new ORES task, as well as classic referring expression segmentation (RES) and generalized referring expression segmentation (GRES) tasks. Project page: https://Ref2Any.github.io.

[122] ContentV: Efficient Training of Video Generation Models with Limited Compute

Wenfeng Lin,Renjie Chen,Boyuan Liu,Shiyue Yan,Ruoyu Feng,Jiangchuan Wei,Yichen Zhang,Yimeng Zhou,Chao Feng,Jiao Ran,Qi Wu,Zuotao Liu,Mingyu Guo

Main category: cs.CV

TL;DR: ContentV是一个8B参数的文本到视频模型，通过三项创新技术实现了高效训练和高质量视频生成。

Details

Motivation: 随着视频生成技术的进步，计算成本急剧增加，需要更高效的训练方法。 Method: 1. 最小化架构设计，重用预训练图像生成模型；2. 多阶段训练策略，利用流匹配提高效率；3. 低成本强化学习框架，无需额外人工标注。 Result: 在VBench上达到85.14的SOTA性能，支持多分辨率和时长视频生成。 Conclusion: ContentV展示了高效训练和高性能视频生成的潜力，代码和模型已开源。 Abstract: Recent advances in video generation demand increasingly efficient training recipes to mitigate escalating computational costs. In this report, we present ContentV, an 8B-parameter text-to-video model that achieves state-of-the-art performance (85.14 on VBench) after training on 256 x 64GB Neural Processing Units (NPUs) for merely four weeks. ContentV generates diverse, high-quality videos across multiple resolutions and durations from text prompts, enabled by three key innovations: (1) A minimalist architecture that maximizes reuse of pre-trained image generation models for video generation; (2) A systematic multi-stage training strategy leveraging flow matching for enhanced efficiency; and (3) A cost-effective reinforcement learning with human feedback framework that improves generation quality without requiring additional human annotations. All the code and models are available at: https://contentv.github.io.

[123] SparseMM: Head Sparsity Emerges from Visual Concept Responses in MLLMs

Jiahui Wang,Zuyan Liu,Yongming Rao,Jiwen Lu

Main category: cs.CV

TL;DR: 研究发现多模态大语言模型（MLLMs）中仅少数注意力头（约5%）对视觉理解有贡献，提出一种无训练框架识别这些头，并基于此设计优化策略SparseMM，显著提升推理效率。

Details

Motivation: 探索MLLMs如何处理视觉输入，揭示注意力机制中的稀疏性现象，以优化模型效率。 Method: 通过无训练框架量化注意力头的视觉相关性，设计KV-Cache优化策略SparseMM，分配不对称计算预算。 Result: SparseMM在主流多模态基准测试中实现1.38倍实时加速和52%内存减少，同时保持性能。 Conclusion: 视觉注意力头的稀疏性可被有效利用以优化MLLMs推理效率，SparseMM为高效多模态推理提供新思路。 Abstract: Multimodal Large Language Models (MLLMs) are commonly derived by extending pre-trained Large Language Models (LLMs) with visual capabilities. In this work, we investigate how MLLMs process visual inputs by analyzing their attention mechanisms. We reveal a surprising sparsity phenomenon: only a small subset (approximately less than 5%) of attention heads in LLMs actively contribute to visual understanding, termed visual heads. To identify these heads efficiently, we design a training-free framework that quantifies head-level visual relevance through targeted response analysis. Building on this discovery, we introduce SparseMM, a KV-Cache optimization strategy that allocates asymmetric computation budgets to heads in LLMs based on their visual scores, leveraging the sparity of visual heads for accelerating the inference of MLLMs. Compared with prior KV-Cache acceleration methods that ignore the particularity of visual, SparseMM prioritizes stress and retaining visual semantics during decoding. Extensive evaluations across mainstream multimodal benchmarks demonstrate that SparseMM achieves superior accuracy-efficiency trade-offs. Notably, SparseMM delivers 1.38x real-time acceleration and 52% memory reduction during generation while maintaining performance parity on efficiency test. Our project is open sourced at https://github.com/CR400AF-A/SparseMM.

[124] Neural Inverse Rendering from Propagating Light

Anagh Malik,Benjamin Attal,Andrew Xie,Matthew O'Toole,David B. Lindell

Main category: cs.CV

TL;DR: 首个基于物理的神经逆向渲染系统，通过多视角视频捕捉传播光，利用时间分辨神经辐射缓存技术，实现高精度3D重建和光传播效果分解。

Details

Motivation: 解决在强间接光环境下3D重建的挑战，并探索光传播的分解与重光照新能力。 Method: 扩展神经辐射缓存技术至时间分辨领域，结合多视角视频数据，捕捉直接与间接光传输效应。 Result: 实现了高精度的3D重建，支持光传播的视图合成、直接与间接光分解及多视角时间分辨重光照。 Conclusion: 该系统为复杂光传输场景下的3D重建和光传播分析提供了新工具，展示了神经逆向渲染的潜力。 Abstract: We present the first system for physically based, neural inverse rendering from multi-viewpoint videos of propagating light. Our approach relies on a time-resolved extension of neural radiance caching -- a technique that accelerates inverse rendering by storing infinite-bounce radiance arriving at any point from any direction. The resulting model accurately accounts for direct and indirect light transport effects and, when applied to captured measurements from a flash lidar system, enables state-of-the-art 3D reconstruction in the presence of strong indirect light. Further, we demonstrate view synthesis of propagating light, automatic decomposition of captured measurements into direct and indirect components, as well as novel capabilities such as multi-view time-resolved relighting of captured scenes.

[125] FreeTimeGS: Free Gaussians at Anytime and Anywhere for Dynamic Scene Reconstruction

Yifan Wang,Peishan Yang,Zhen Xu,Jiaming Sun,Zhanhua Zhang,Yong Chen,Hujun Bao,Sida Peng,Xiaowei Zhou

Main category: cs.CV

TL;DR: 论文提出FreeTimeGS，一种新型4D表示方法，用于解决动态3D场景重建中复杂运动的挑战，通过允许高斯基元在任意时间和位置出现，并赋予其运动函数，显著提升了渲染质量。

Details

Motivation: 现有方法在处理复杂运动的动态3D场景时，由于变形场优化的困难，效果不佳。 Method: 提出FreeTimeGS，一种灵活的4D表示方法，允许高斯基元在任意时间和位置出现，并赋予其运动函数以减少时间冗余。 Result: 实验结果表明，该方法在多个数据集上的渲染质量显著优于现有方法。 Conclusion: FreeTimeGS通过灵活的4D表示和运动函数，有效提升了动态3D场景的建模能力。 Abstract: This paper addresses the challenge of reconstructing dynamic 3D scenes with complex motions. Some recent works define 3D Gaussian primitives in the canonical space and use deformation fields to map canonical primitives to observation spaces, achieving real-time dynamic view synthesis. However, these methods often struggle to handle scenes with complex motions due to the difficulty of optimizing deformation fields. To overcome this problem, we propose FreeTimeGS, a novel 4D representation that allows Gaussian primitives to appear at arbitrary time and locations. In contrast to canonical Gaussian primitives, our representation possesses the strong flexibility, thus improving the ability to model dynamic 3D scenes. In addition, we endow each Gaussian primitive with an motion function, allowing it to move to neighboring regions over time, which reduces the temporal redundancy. Experiments results on several datasets show that the rendering quality of our method outperforms recent methods by a large margin.

[126] VideoMathQA: Benchmarking Mathematical Reasoning via Multimodal Understanding in Videos

Hanoona Rasheed,Abdelrahman Shaker,Anqi Tang,Muhammad Maaz,Ming-Hsuan Yang,Salman Khan,Fahad Khan

Main category: cs.CV

TL;DR: VideoMathQA是一个评估模型在视频中跨模态数学推理能力的基准，涵盖10个数学领域，强调时间扩展和多模态整合。

Details

Motivation: 现实世界中的视频数学推理需要整合视觉、音频和文本信息，而现有方法难以应对这种复杂场景。 Method: 通过设计包含直接问题解决、概念迁移和深度教学理解的多样化问题，结合专家标注，构建高质量基准。 Result: 基准揭示了现有方法的局限性，并提供了细粒度模型能力诊断工具。 Conclusion: VideoMathQA为时间扩展和多模态数学推理提供了系统评估框架，推动了相关研究的发展。 Abstract: Mathematical reasoning in real-world video settings presents a fundamentally different challenge than in static images or text. It requires interpreting fine-grained visual information, accurately reading handwritten or digital text, and integrating spoken cues, often dispersed non-linearly over time. In such multimodal contexts, success hinges not just on perception, but on selectively identifying and integrating the right contextual details from a rich and noisy stream of content. To this end, we introduce VideoMathQA, a benchmark designed to evaluate whether models can perform such temporally extended cross-modal reasoning on videos. The benchmark spans 10 diverse mathematical domains, covering videos ranging from 10 seconds to over 1 hour. It requires models to interpret structured visual content, understand instructional narratives, and jointly ground concepts across visual, audio, and textual modalities. We employ graduate-level experts to ensure high quality, totaling over $920$ man-hours of annotation. To reflect real-world scenarios, questions are designed around three core reasoning challenges: direct problem solving, where answers are grounded in the presented question; conceptual transfer, which requires applying learned methods to new problems; and deep instructional comprehension, involving multi-step reasoning over extended explanations and partially worked-out solutions. Each question includes multi-step reasoning annotations, enabling fine-grained diagnosis of model capabilities. Through this benchmark, we highlight the limitations of existing approaches and establish a systematic evaluation framework for models that must reason, rather than merely perceive, across temporally extended and modality-rich mathematical problem settings. Our benchmark and evaluation code are available at: https://mbzuai-oryx.github.io/VideoMathQA

[127] Contrastive Flow Matching

George Stoica,Vivek Ramanujan,Xiang Fan,Ali Farhadi,Ranjay Krishna,Judy Hoffman

Main category: cs.CV

TL;DR: 论文提出了一种对比流匹配方法，解决了条件设置下流不唯一的问题，显著提升了生成质量和训练效率。

Details

Motivation: 在条件设置（如类别条件模型）中，流匹配的唯一性无法保证，导致生成结果模糊。 Method: 引入对比流匹配目标，通过最大化不同样本对预测流的差异，增强条件分离。 Result: 实验表明，对比流匹配在训练速度、去噪步骤和FID指标上均有显著提升。 Conclusion: 对比流匹配是一种有效的改进方法，适用于条件生成任务。 Abstract: Unconditional flow-matching trains diffusion models to transport samples from a source distribution to a target distribution by enforcing that the flows between sample pairs are unique. However, in conditional settings (e.g., class-conditioned models), this uniqueness is no longer guaranteed--flows from different conditions may overlap, leading to more ambiguous generations. We introduce Contrastive Flow Matching, an extension to the flow matching objective that explicitly enforces uniqueness across all conditional flows, enhancing condition separation. Our approach adds a contrastive objective that maximizes dissimilarities between predicted flows from arbitrary sample pairs. We validate Contrastive Flow Matching by conducting extensive experiments across varying model architectures on both class-conditioned (ImageNet-1k) and text-to-image (CC3M) benchmarks. Notably, we find that training models with Contrastive Flow Matching (1) improves training speed by a factor of up to 9x, (2) requires up to 5x fewer de-noising steps and (3) lowers FID by up to 8.9 compared to training the same models with flow matching. We release our code at: https://github.com/gstoica27/DeltaFM.git.

cs.GR [Back]

[128] SSIMBaD: Sigma Scaling with SSIM-Guided Balanced Diffusion for AnimeFace Colorization

Junpyo Seo,Hanbin Koo,Jieun Yook,Byung-Ro Moon

Main category: cs.GR

TL;DR: 提出了一种基于扩散模型的动漫风格面部草图自动上色框架，通过SSIMBaD技术实现结构保真和风格迁移。

Details

Motivation: 传统方法依赖预定义的噪声调度，可能损害感知一致性，因此需要一种更平衡且忠实的方法。 Method: 采用连续时间扩散模型，引入SSIMBaD技术，通过sigma空间变换实现线性感知退化对齐。 Result: 在大规模动漫面部数据集上，方法在像素精度和感知质量上均优于现有模型，并能泛化到多样风格。 Conclusion: SSIMBaD框架在动漫面部上色任务中表现出色，兼具高保真和风格适应性。 Abstract: We propose a novel diffusion-based framework for automatic colorization of Anime-style facial sketches. Our method preserves the structural fidelity of the input sketch while effectively transferring stylistic attributes from a reference image. Unlike traditional approaches that rely on predefined noise schedules - which often compromise perceptual consistency -- our framework builds on continuous-time diffusion models and introduces SSIMBaD (Sigma Scaling with SSIM-Guided Balanced Diffusion). SSIMBaD applies a sigma-space transformation that aligns perceptual degradation, as measured by structural similarity (SSIM), in a linear manner. This scaling ensures uniform visual difficulty across timesteps, enabling more balanced and faithful reconstructions. Experiments on a large-scale Anime face dataset demonstrate that our method outperforms state-of-the-art models in both pixel accuracy and perceptual quality, while generalizing to diverse styles. Code is available at github.com/Giventicket/SSIMBaD-Sigma-Scaling-with-SSIM-Guided-Balanced-Diffusion-for-AnimeFace-Colorization

[129] Handle-based Mesh Deformation Guided By Vision Language Model

Xingpeng Sun,Shiyang Jia,Zherong Pan,Kui Wu,Aniket Bera

Main category: cs.GR

TL;DR: 提出了一种无需训练、基于手柄的网格变形方法，利用视觉语言模型（VLM）通过提示工程实现高质量变形。

Details

Motivation: 现有网格变形方法存在输出质量低、需手动调参或依赖数据训练的问题，需改进。 Method: 结合锥形奇点检测稀疏手柄，通过VLM选择变形子部分和手柄，多视角投票减少预测不确定性。 Result: 在基准测试中，方法生成的变形更符合用户意图，且引入低失真。 Conclusion: 该方法无需训练、高度自动化，能持续提供高质量网格变形。 Abstract: Mesh deformation is a fundamental tool in 3D content manipulation. Despite extensive prior research, existing approaches often suffer from low output quality, require significant manual tuning, or depend on data-intensive training. To address these limitations, we introduce a training-free, handle-based mesh deformation method. % Our core idea is to leverage a Vision-Language Model (VLM) to interpret and manipulate a handle-based interface through prompt engineering. We begin by applying cone singularity detection to identify a sparse set of potential handles. The VLM is then prompted to select both the deformable sub-parts of the mesh and the handles that best align with user instructions. Subsequently, we query the desired deformed positions of the selected handles in screen space. To reduce uncertainty inherent in VLM predictions, we aggregate the results from multiple camera views using a novel multi-view voting scheme. % Across a suite of benchmarks, our method produces deformations that align more closely with user intent, as measured by CLIP and GPTEval3D scores, while introducing low distortion -- quantified via membrane energy. In summary, our approach is training-free, highly automated, and consistently delivers high-quality mesh deformations.

[130] VoxDet: Rethinking 3D Semantic Occupancy Prediction as Dense Object Detection

Wuyang Li,Zhu Yu,Alexandre Alahi

Main category: cs.GR

TL;DR: VoxDet通过将体素级语义占用预测转化为实例级密集检测任务，解决了现有方法忽略实例区分性的问题，并在性能和效率上达到最优。

Details

Motivation: 现有方法将3D语义占用预测视为密集分割任务，忽略了实例级区分性，导致实例不完整和相邻模糊问题。 Method: 提出VoxDet框架，将任务分解为偏移回归和语义预测两个子任务，通过空间解耦体素编码器和任务解耦密集预测器实现实例感知预测。 Result: 在相机和LiDAR输入下均达到最优性能，SemanticKITTI测试集上IoU为63.0，排名第一。 Conclusion: VoxDet通过实例级优化显著提升了3D语义占用预测的性能和效率。 Abstract: 3D semantic occupancy prediction aims to reconstruct the 3D geometry and semantics of the surrounding environment. With dense voxel labels, prior works typically formulate it as a dense segmentation task, independently classifying each voxel. However, this paradigm neglects critical instance-centric discriminability, leading to instance-level incompleteness and adjacent ambiguities. To address this, we highlight a free lunch of occupancy labels: the voxel-level class label implicitly provides insight at the instance level, which is overlooked by the community. Motivated by this observation, we first introduce a training-free Voxel-to-Instance (VoxNT) trick: a simple yet effective method that freely converts voxel-level class labels into instance-level offset labels. Building on this, we further propose VoxDet, an instance-centric framework that reformulates the voxel-level occupancy prediction as dense object detection by decoupling it into two sub-tasks: offset regression and semantic prediction. Specifically, based on the lifted 3D volume, VoxDet first uses (a) Spatially-decoupled Voxel Encoder to generate disentangled feature volumes for the two sub-tasks, which learn task-specific spatial deformation in the densely projected tri-perceptive space. Then, we deploy (b) Task-decoupled Dense Predictor to address this task via dense detection. Here, we first regress a 4D offset field to estimate distances (6 directions) between voxels and object borders in the voxel space. The regressed offsets are then used to guide the instance-level aggregation in the classification branch, achieving instance-aware prediction. Experiments show that VoxDet can be deployed on both camera and LiDAR input, jointly achieving state-of-the-art results on both benchmarks. VoxDet is not only highly efficient, but also achieves 63.0 IoU on the SemanticKITTI test set, ranking 1st on the online leaderboard.

[131] A Fast Unsupervised Scheme for Polygonal Approximation

Bimal Kumar Ray

Main category: cs.GR

TL;DR: 本文提出了一种快速且无监督的多边形近似闭合数字曲线方案，速度优于现有技术，且在Rosin度量与美学方面表现优异。方案包括初始分割、迭代顶点插入、迭代合并及顶点调整。

Details

Motivation: 现有方法在速度和美学表现上存在不足，需一种更高效且美观的近似方案。 Method: 方案分三阶段：初始分割检测高曲率顶点；迭代顶点插入补充低曲率顶点；迭代合并去除冗余顶点；最后顶点调整优化美学效果。 Result: 方案速度快于现有技术，Rosin度量与美学表现优异，且对几何变换具有鲁棒性。 Conclusion: 该方案为闭合数字曲线提供了一种高效、美观且鲁棒的近似方法。 Abstract: This paper proposes a fast and unsupervised scheme for a polygonal approximation of a closed digital curve. It is demonstrated that the approximation scheme is faster than state-of-the-art approximation and is competitive with the same in Rosin's measure and in its aesthetic aspect. The scheme comprises of three phases: initial segmentation, iterative vertex insertion, and iterative merging, followed by vertex adjustment. The initial segmentation is used to detect sharp turnings - the vertices that seemingly have high curvature. It is likely that some of important vertices with low curvature might have been missed out at the first phase and so iterative vertex insertion is used to add vertices in a region where the curvature changes slowly but steadily. The initial phase may pick up some undesirable vertices and so merging is used to eliminate the redundant vertices. Finally, vertex adjustment is used to facilitate enhancement in the aesthetic look of the approximation. The quality of the approximations is measured using Rosin's measure. The robustness of the proposed scheme with respect to geometric transformation is observed.

[132] Midplane based 3D single pass unbiased segment-to-segment contact interaction using penalty method

Indrajeet Sahu,Nik Petrinic

Main category: cs.GR

TL;DR: 提出了一种无偏接触交互方法，避免主从表面划分，通过中平面单次评估接触力，保持力平衡，验证了其在高精度接触问题中的有效性。

Details

Motivation: 传统接触方法需划分主从表面，可能导致偏差。本文旨在提出一种无偏方法，简化接触力评估并提高精度。 Method: 基于中平面单次评估接触力，惩罚真实穿透，详细分析3D段几何配置以优化力评估。 Result: 验证了方法在接触补丁测试、两梁弯曲、赫兹接触和平冲头测试中的准确性和鲁棒性，支持非共形网格高精度接触。 Conclusion: 该方法适用于平、曲面及尖锐角接触，动态碰撞问题，展现了在一般接触问题中的多功能性。 Abstract: This work introduces a contact interaction methodology for an unbiased treatment of contacting surfaces without assigning surfaces as master and slave. The contact tractions between interacting discrete segments are evaluated with respect to a midplane in a single pass, inherently maintaining the equilibrium of tractions. These tractions are based on the penalisation of true interpenetration between opposite surfaces, and the procedure of their integral for discrete contacting segments is described in this paper. A meticulous examination of the different possible geometric configurations of interacting 3D segments is presented to develop visual understanding and better traction evaluation accuracy. The accuracy and robustness of the proposed method are validated against the analytical solutions of the contact patch test, two-beam bending, Hertzian contact, and flat punch test, thus proving the capability to reproduce contact between flat surfaces, curved surfaces, and sharp corners in contact, respectively. The method passes the contact patch test with the uniform transmission of contact pressure matching the accuracy levels of finite elements. It converges towards the analytical solution with mesh refinement and a suitably high penalty factor. The effectiveness of the proposed algorithm also extends to self-contact problems and has been tested for self-contact between flat and curved surfaces with inelastic material. Dynamic problems of elastic and inelastic collisions between bars, as well as oblique collisions of cylinders, are also presented. The ability of the algorithm to resolve contacts between flat and curved surfaces for nonconformal meshes with high accuracy demonstrates its versatility in general contact problems.

[133] Towards the target and not beyond: 2d vs 3d visual aids in mr-based neurosurgical simulation

Pasquale Cascarano,Andrea Loretti,Matteo Martinoni,Luca Zanuttini,Alessio Di Pasquale,Gustavo Marfia

Main category: cs.GR

TL;DR: NeuroMix是一种基于混合现实（MR）的模拟器，用于脑室外引流（EVD）放置训练。研究发现，结合2D和3D视觉辅助的训练方式在无辅助测试中显著提高了44%的精确度，且不影响认知负荷和技术接受度。

Details

Motivation: 神经外科手术中，从2D切片重建复杂3D解剖结构的精确性是一大挑战。MR技术虽潜力巨大，但临床可用性有限，因此需要开发训练系统以提升无辅助条件下的技能保留。 Method: 研究比较了三种训练模式：无视觉辅助、仅2D辅助、2D和3D辅助结合。48名参与者完成数字对象训练后，在无MR辅助下进行自由手EVD放置测试，并与未训练对照组对比。 Result: 结合2D和3D辅助的训练组在无辅助测试中精确度提升44%，显著优于其他组。所有训练模式均获高可用性和技术接受度评分，但操作时间较长。 Conclusion: 2D和3D视觉辅助结合的训练方式能显著提升手术精确度，且不影响认知负荷，适合用于神经外科手术训练。 Abstract: Neurosurgery increasingly uses Mixed Reality (MR) technologies for intraoperative assistance. The greatest challenge in this area is mentally reconstructing complex 3D anatomical structures from 2D slices with millimetric precision, which is required in procedures like External Ventricular Drain (EVD) placement. MR technologies have shown great potential in improving surgical performance, however, their limited availability in clinical settings underscores the need for training systems that foster skill retention in unaided conditions. In this paper, we introduce NeuroMix, an MR-based simulator for EVD placement. We conduct a study with 48 participants to assess the impact of 2D and 3D visual aids on usability, cognitive load, technology acceptance, and procedure precision and execution time. Three training modalities are compared: one without visual aids, one with 2D aids only, and one combining both 2D and 3D aids. The training phase takes place entirely on digital objects, followed by a freehand EVD placement testing phase performed with a physical catherer and a physical phantom without MR aids. We then compare the participants performance with that of a control group that does not undergo training. Our findings show that participants trained with both 2D and 3D aids achieve a 44\% improvement in precision during unaided testing compared to the control group, substantially higher than the improvement observed in the other groups. All three training modalities receive high usability and technology acceptance ratings, with significant equivalence across groups. The combination of 2D and 3D visual aids does not significantly increase cognitive workload, though it leads to longer operation times during freehand testing compared to the control group.

[134] Uniform Sampling of Surfaces by Casting Rays

Selena Ling,Abhishek Madan,Nicholas Sharp,Alec Jacobson

Main category: cs.GR

TL;DR: 提出了一种基于随机射线与表面交点的简单通用方法，用于在隐式表面上均匀采样点，无需提取中间网格。

Details

Motivation: 在几何处理中，隐式表面上的点采样比显式网格更困难，需要一种高效且通用的方法。 Method: 通过随机射线与表面的交点实现均匀采样，特别适用于隐式有符号距离函数，利用球面行进法高效计算。 Result: 实验证明该方法在多种表示上比替代策略更高效，且支持蓝噪声和分层采样。 Conclusion: 该方法在隐式表面上实现了高效均匀采样，并扩展了应用场景。 Abstract: Randomly sampling points on surfaces is an essential operation in geometry processing. This sampling is computationally straightforward on explicit meshes, but it is much more difficult on other shape representations, such as widely-used implicit surfaces. This work studies a simple and general scheme for sampling points on a surface, which is derived from a connection to the intersections of random rays with the surface. Concretely, given a subroutine to cast a ray against a surface and find all intersections, we can use that subroutine to uniformly sample white noise points on the surface. This approach is particularly effective in the context of implicit signed distance functions, where sphere marching allows us to efficiently cast rays and sample points, without needing to extract an intermediate mesh. We analyze the basic method to show that it guarantees uniformity, and find experimentally that it is significantly more efficient than alternative strategies on a variety of representations. Furthermore, we show extensions to blue noise sampling and stratified sampling, and applications to deform neural implicit surfaces as well as moment estimation.

cs.CL [Back]

[135] GEM: Empowering LLM for both Embedding Generation and Language Understanding

Caojin Zhang,Qiang Zhang,Ke Li,Sai Vidyaranya Nuthalapati,Benyu Zhang,Jason Liu,Serena Li,Lizhu Zhang,Xiangjun Fan

Main category: cs.CL

TL;DR: 论文提出了一种自监督方法GEM，使解码器语言模型（LLM）能生成高质量文本嵌入，同时保留其原始文本生成和推理能力。

Details

Motivation: 现有应用中，文本嵌入和生成任务常依赖不同模型，导致系统复杂性和理解差异。 Method: 通过插入特殊标记和调整注意力掩码，使LLM生成文本嵌入，适用于后训练或微调阶段。 Result: 在MTEB和MMLU基准测试中，GEM显著提升了LLM的文本嵌入能力，同时对其NLP性能影响极小。 Conclusion: GEM方法为LLM提供了先进的文本嵌入能力，同时保持了其原始性能，具有广泛应用潜力。 Abstract: Large decoder-only language models (LLMs) have achieved remarkable success in generation and reasoning tasks, where they generate text responses given instructions. However, many applications, e.g., retrieval augmented generation (RAG), still rely on separate embedding models to generate text embeddings, which can complicate the system and introduce discrepancies in understanding of the query between the embedding model and LLMs. To address this limitation, we propose a simple self-supervised approach, Generative Embedding large language Model (GEM), that enables any large decoder-only LLM to generate high-quality text embeddings while maintaining its original text generation and reasoning capabilities. Our method inserts new special token(s) into a text body, and generates summarization embedding of the text by manipulating the attention mask. This method could be easily integrated into post-training or fine tuning stages of any existing LLMs. We demonstrate the effectiveness of our approach by applying it to two popular LLM families, ranging from 1B to 8B parameters, and evaluating the transformed models on both text embedding benchmarks (MTEB) and NLP benchmarks (MMLU). The results show that our proposed method significantly improves the original LLMs on MTEB while having a minimal impact on MMLU. Our strong results indicate that our approach can empower LLMs with state-of-the-art text embedding capabilities while maintaining their original NLP performance

[136] Effects of Speaker Count, Duration, and Accent Diversity on Zero-Shot Accent Robustness in Low-Resource ASR

Zheng-Xin Yong,Vineel Pratap,Michael Auli,Jean Maillard

Main category: cs.CL

TL;DR: 研究表明，在低资源训练下，增加说话者数量比增加每个说话者的音频时长更能提升ASR对未见口音的鲁棒性，而口音多样性对性能提升影响较小。

Details

Motivation: 构建一个适用于全球用户的自动语音识别（ASR）系统，需要其对各种口音（包括未见口音）具有鲁棒性。 Method: 系统研究了训练数据中三个变量（说话者数量、每个说话者的音频时长、口音多样性）对ASR鲁棒性的影响。 Result: 在固定训练时长下，增加说话者数量比增加单个说话者的音频时长更有效；口音多样性对性能提升影响较小。 Conclusion: 建议在ASR训练数据中优先增加说话者数量，而非口音多样性或单个说话者的音频时长。 Abstract: To build an automatic speech recognition (ASR) system that can serve everyone in the world, the ASR needs to be robust to a wide range of accents including unseen accents. We systematically study how three different variables in training data -- the number of speakers, the audio duration per each individual speaker, and the diversity of accents -- affect ASR robustness towards unseen accents in a low-resource training regime. We observe that for a fixed number of ASR training hours, it is more beneficial to increase the number of speakers (which means each speaker contributes less) than the number of hours contributed per speaker. We also observe that more speakers enables ASR performance gains from scaling number of hours. Surprisingly, we observe minimal benefits to prioritizing speakers with different accents when the number of speakers is controlled. Our work suggests that practitioners should prioritize increasing the speaker count in ASR training data composition for new languages.

[137] Mechanistic Decomposition of Sentence Representations

Matthieu Tehenan,Vikram Natarajan,Jonathan Michala,Milton Lin,Juri Opitz

Main category: cs.CL

TL;DR: 提出了一种新方法，通过字典学习分解句子嵌入为可解释组件，以增强其透明性和可控性。

Details

Motivation: 句子嵌入在现代NLP和AI系统中至关重要，但其内部结构不透明，难以解释。 Method: 使用字典学习对词级表示进行分解，分析池化操作如何将特征压缩为句子表示。 Result: 发现句子嵌入中许多语义和句法特征是线性编码的。 Conclusion: 该方法为句子嵌入提供了更透明和可控的表示，揭示了其内部工作机制。 Abstract: Sentence embeddings are central to modern NLP and AI systems, yet little is known about their internal structure. While we can compare these embeddings using measures such as cosine similarity, the contributing features are not human-interpretable, and the content of an embedding seems untraceable, as it is masked by complex neural transformations and a final pooling operation that combines individual token embeddings. To alleviate this issue, we propose a new method to mechanistically decompose sentence embeddings into interpretable components, by using dictionary learning on token-level representations. We analyze how pooling compresses these features into sentence representations, and assess the latent features that reside in a sentence embedding. This bridges token-level mechanistic interpretability with sentence-level analysis, making for more transparent and controllable representations. In our studies, we obtain several interesting insights into the inner workings of sentence embedding spaces, for instance, that many semantic and syntactic aspects are linearly encoded in the embeddings.

[138] Hierarchical Text Classification Using Contrastive Learning Informed Path Guided Hierarchy

Neeraj Agrawal,Saurabh Kumar,Priyanka Bhatt,Tanishka Agarwal

Main category: cs.CL

TL;DR: HTC-CLIP结合对比学习与路径引导的层次结构，提升分层文本分类性能，Macro F1分数提高0.99-2.37%。

Details

Motivation: 现有HTC模型分别处理标签层次和文本编码，或通过文本编码器引导标签层次，两者互补但未结合。 Method: 提出HTC-CLIP，通过对比学习学习层次感知的文本表示和路径引导的层次表示，训练时学习两组概率分布，推理时合并。 Result: 在两个公开数据集上，HTC-CLIP的Macro F1分数比现有最优模型提高0.99-2.37%。 Conclusion: HTC-CLIP有效结合两种方法，显著提升分层文本分类性能。 Abstract: Hierarchical Text Classification (HTC) has recently gained traction given the ability to handle complex label hierarchy. This has found applications in domains like E- commerce, customer care and medicine industry among other real-world applications. Existing HTC models either encode label hierarchy separately and mix it with text encoding or guide the label hierarchy structure in the text encoder. Both approaches capture different characteristics of label hierarchy and are complementary to each other. In this paper, we propose a Hierarchical Text Classification using Contrastive Learning Informed Path guided hierarchy (HTC-CLIP), which learns hierarchy-aware text representation and text informed path guided hierarchy representation using contrastive learning. During the training of HTC-CLIP, we learn two different sets of class probabilities distributions and during inference, we use the pooled output of both probabilities for each class to get the best of both representations. Our results show that the two previous approaches can be effectively combined into one architecture to achieve improved performance. Tests on two public benchmark datasets showed an improvement of 0.99 - 2.37% in Macro F1 score using HTC-CLIP over the existing state-of-the-art models.

[139] MELABenchv1: Benchmarking Large Language Models against Smaller Fine-Tuned Models for Low-Resource Maltese NLP

Kurt Micallef,Claudia Borg

Main category: cs.CL

TL;DR: 评估55个公开大语言模型在低资源语言马耳他语上的表现，发现小规模微调模型表现更优，预训练和指令调优对性能影响最大。

Details

Motivation: 大语言模型在低资源语言上表现有限，需探索其适用性和改进方法。 Method: 使用涵盖11项任务的基准测试55个模型，分析预训练、微调等因素。 Result: 小模型在微调后表现更好，预训练和指令调优是关键因素。 Conclusion: 建议低资源语言研究采用传统建模方法，强调包容性语言技术的重要性。 Abstract: Large Language Models (LLMs) have demonstrated remarkable performance across various Natural Language Processing (NLP) tasks, largely due to their generalisability and ability to perform tasks without additional training. However, their effectiveness for low-resource languages remains limited. In this study, we evaluate the performance of 55 publicly available LLMs on Maltese, a low-resource language, using a newly introduced benchmark covering 11 discriminative and generative tasks. Our experiments highlight that many models perform poorly, particularly on generative tasks, and that smaller fine-tuned models often perform better across all tasks. From our multidimensional analysis, we investigate various factors impacting performance. We conclude that prior exposure to Maltese during pre-training and instruction-tuning emerges as the most important factor. We also examine the trade-offs between fine-tuning and prompting, highlighting that while fine-tuning requires a higher initial cost, it yields better performance and lower inference costs. Through this work, we aim to highlight the need for more inclusive language technologies and recommend that researchers working with low-resource languages consider more "traditional" language modelling approaches.

[140] Building a Few-Shot Cross-Domain Multilingual NLU Model for Customer Care

Saurabh Kumar,Sourav Bansal,Neeraj Agrawal,Priyanka Bhatt

Main category: cs.CL

TL;DR: 提出一种结合嵌入器和分类器的模型架构，通过少量标注样本扩展领域特定模型到其他领域，显著提升意图分类准确性。

Details

Motivation: 跨领域数据稀缺限制了意图分类器的泛化能力，需解决在少量标注下实现多领域泛化的问题。 Method: 采用监督微调与各向同性正则化训练领域特定嵌入器，结合多语言知识蒸馏策略实现跨领域泛化。 Result: 在加拿大和墨西哥电商客服数据集上，少样本意图检测准确率比现有SOTA模型提升20-23%。 Conclusion: 提出的模型架构在少样本条件下显著提升跨领域意图分类性能，具有实际应用价值。 Abstract: Customer care is an essential pillar of the e-commerce shopping experience with companies spending millions of dollars each year, employing automation and human agents, across geographies (like US, Canada, Mexico, Chile), channels (like Chat, Interactive Voice Response (IVR)), and languages (like English, Spanish). SOTA pre-trained models like multilingual-BERT, fine-tuned on annotated data have shown good performance in downstream tasks relevant to Customer Care. However, model performance is largely subject to the availability of sufficient annotated domain-specific data. Cross-domain availability of data remains a bottleneck, thus building an intent classifier that generalizes across domains (defined by channel, geography, and language) with only a few annotations, is of great practical value. In this paper, we propose an embedder-cum-classifier model architecture which extends state-of-the-art domain-specific models to other domains with only a few labeled samples. We adopt a supervised fine-tuning approach with isotropic regularizers to train a domain-specific sentence embedder and a multilingual knowledge distillation strategy to generalize this embedder across multiple domains. The trained embedder, further augmented with a simple linear classifier can be deployed for new domains. Experiments on Canada and Mexico e-commerce Customer Care dataset with few-shot intent detection show an increase in accuracy by 20-23% against the existing state-of-the-art pre-trained models.

[141] MedAgentGym: Training LLM Agents for Code-Based Medical Reasoning at Scale

Ran Xu,Yuchen Zhuang,Yishan Zhong,Yue Yu,Xiangru Tang,Hang Wu,May D. Wang,Peifeng Ruan,Donghan Yang,Tao Wang,Guanghua Xiao,Carl Yang,Yang Xie,Wenqi Shi

Main category: cs.CL

TL;DR: MedAgentGYM是一个公开的训练环境，旨在提升大型语言模型在医学推理中的编码能力，包含72,413个任务实例，覆盖129个真实生物医学场景。

Details

Motivation: 解决商业与开源LLM在医学编码任务中的性能差距，并提供可扩展的训练资源。 Method: 通过可执行的编码环境封装任务，结合监督微调和强化学习优化模型性能。 Result: Med-Copilot-7B性能显著提升（监督微调+36.44%，强化学习+42.47%），接近gpt-4o水平。 Conclusion: MedAgentGYM为生物医学研究和实践提供了一个集成平台，支持开发高性能的LLM编码助手。 Abstract: We introduce MedAgentGYM, the first publicly available training environment designed to enhance coding-based medical reasoning capabilities in large language model (LLM) agents. MedAgentGYM comprises 72,413 task instances across 129 categories derived from authentic real-world biomedical scenarios. Tasks are encapsulated within executable coding environments, each featuring detailed task descriptions, interactive feedback mechanisms, verifiable ground-truth annotations, and scalable training trajectory generation. Extensive benchmarking of over 30 LLMs reveals a notable performance disparity between commercial API-based models and open-source counterparts. Leveraging MedAgentGYM, Med-Copilot-7B achieves substantial performance gains through supervised fine-tuning (+36.44%) and continued reinforcement learning (+42.47%), emerging as an affordable and privacy-preserving alternative competitive with gpt-4o. By offering both a comprehensive benchmark and accessible, expandable training resources within unified execution environments, MedAgentGYM delivers an integrated platform to develop LLM-based coding assistants for advanced biomedical research and practice.

[142] Unpacking Let Alone: Human-Scale Models Generalize to a Rare Construction in Form but not Meaning

Wesley Scivetti,Tatsuya Aoyama,Ethan Wilcox,Nathan Schneider

Main category: cs.CL

TL;DR: 人类和大型语言模型在罕见语法现象上的表现对比：模型能捕捉形式但无法理解意义。

Details

Motivation: 探讨人类与语言模型在罕见语法现象（如LET-ALONE结构）上的泛化能力差异。 Method: 通过构建合成基准测试，评估模型对LET-ALONE结构的语法形式和语义的理解。 Result: 模型能识别形式，但无法正确泛化其意义。 Conclusion: 当前架构在语言形式和意义的样本效率上存在不对称性，与人类学习不同。 Abstract: Humans have a remarkable ability to acquire and understand grammatical phenomena that are seen rarely, if ever, during childhood. Recent evidence suggests that language models with human-scale pretraining data may possess a similar ability by generalizing from frequent to rare constructions. However, it remains an open question how widespread this generalization ability is, and to what extent this knowledge extends to meanings of rare constructions, as opposed to just their forms. We fill this gap by testing human-scale transformer language models on their knowledge of both the form and meaning of the (rare and quirky) English LET-ALONE construction. To evaluate our LMs we construct a bespoke synthetic benchmark that targets syntactic and semantic properties of the construction. We find that human-scale LMs are sensitive to form, even when related constructions are filtered from the dataset. However, human-scale LMs do not make correct generalizations about LET-ALONE's meaning. These results point to an asymmetry in the current architectures' sample efficiency between language form and meaning, something which is not present in human language learners.

[143] Empaths at SemEval-2025 Task 11: Retrieval-Augmented Approach to Perceived Emotions Prediction

Lev Morozov,Aleksandr Mogilevskii,Alexander Shirnin

Main category: cs.CL

TL;DR: EmoRAG是一个无需额外训练、仅通过模型集成实现多标签情感检测的系统，在SemEval-2025任务中表现优异。

Details

Motivation: 解决从文本片段中检测说话者感知情感的任务，如快乐、悲伤、恐惧等。 Method: 使用模型集成方法，无需额外训练。 Result: 性能与最佳系统相当，但更高效、可扩展且易于实现。 Conclusion: EmoRAG为多标签情感检测提供了一种高效且实用的解决方案。 Abstract: This paper describes EmoRAG, a system designed to detect perceived emotions in text for SemEval-2025 Task 11, Subtask A: Multi-label Emotion Detection. We focus on predicting the perceived emotions of the speaker from a given text snippet, labeling it with emotions such as joy, sadness, fear, anger, surprise, and disgust. Our approach does not require additional model training and only uses an ensemble of models to predict emotions. EmoRAG achieves results comparable to the best performing systems, while being more efficient, scalable, and easier to implement.

[144] Zero-Shot Open-Schema Entity Structure Discovery

Xueqiang Xu,Jinfeng Xiao,James Barry,Mohab Elkaref,Jiaru Zou,Pengcheng Jiang,Yunyi Zhang,Max Giammona,Geeth de Mel,Jiawei Han

Main category: cs.CL

TL;DR: 论文提出了一种零样本开放模式实体结构发现方法（ZOES），无需预定义模式或标注数据，通过丰富、细化和统一机制提升LLM的实体结构提取能力。

Details

Motivation: 现有基于大语言模型的方法依赖预定义实体属性模式或标注数据，导致提取结果不完整。 Method: ZOES采用丰富、细化和统一的机制，利用实体与其结构的相互强化关系。 Result: 实验表明ZOES在三个不同领域显著提升了LLM提取实体结构的完整性和泛化能力。 Conclusion: ZOES的机制为提升LLM在多种场景下的实体结构发现质量提供了原则性方法。 Abstract: Entity structure extraction, which aims to extract entities and their associated attribute-value structures from text, is an essential task for text understanding and knowledge graph construction. Existing methods based on large language models (LLMs) typically rely heavily on predefined entity attribute schemas or annotated datasets, often leading to incomplete extraction results. To address these challenges, we introduce Zero-Shot Open-schema Entity Structure Discovery (ZOES), a novel approach to entity structure extraction that does not require any schema or annotated samples. ZOES operates via a principled mechanism of enrichment, refinement, and unification, based on the insight that an entity and its associated structure are mutually reinforcing. Experiments demonstrate that ZOES consistently enhances LLMs' ability to extract more complete entity structures across three different domains, showcasing both the effectiveness and generalizability of the method. These findings suggest that such an enrichment, refinement, and unification mechanism may serve as a principled approach to improving the quality of LLM-based entity structure discovery in various scenarios.

[145] Watermarking Degrades Alignment in Language Models: Analysis and Mitigation

Apurv Verma,NhatHai Phan,Shubhendu Trivedi

Main category: cs.CL

TL;DR: 本文系统分析了两种水印技术（Gumbel和KGW）对LLM输出质量的影响，提出了Alignment Resampling（AR）方法以恢复模型的对齐性。

Details

Motivation: 研究水印技术对LLM输出在真实性、安全性和有用性上的影响，填补这一领域的空白。 Method: 通过实验分析水印对四种对齐LLM的影响，并提出AR方法，利用外部奖励模型在推理时恢复对齐性。 Result: 实验表明，AR方法仅需2-4次采样即可恢复或超越未加水印的基线对齐分数。 Conclusion: 揭示了水印强度与模型对齐性之间的平衡，为实际部署水印LLM提供了简单有效的解决方案。 Abstract: Watermarking techniques for large language models (LLMs) can significantly impact output quality, yet their effects on truthfulness, safety, and helpfulness remain critically underexamined. This paper presents a systematic analysis of how two popular watermarking approaches-Gumbel and KGW-affect these core alignment properties across four aligned LLMs. Our experiments reveal two distinct degradation patterns: guard attenuation, where enhanced helpfulness undermines model safety, and guard amplification, where excessive caution reduces model helpfulness. These patterns emerge from watermark-induced shifts in token distribution, surfacing the fundamental tension that exists between alignment objectives. To mitigate these degradations, we propose Alignment Resampling (AR), an inference-time sampling method that uses an external reward model to restore alignment. We establish a theoretical lower bound on the improvement in expected reward score as the sample size is increased and empirically demonstrate that sampling just 2-4 watermarked generations effectively recovers or surpasses baseline (unwatermarked) alignment scores. To overcome the limited response diversity of standard Gumbel watermarking, our modified implementation sacrifices strict distortion-freeness while maintaining robust detectability, ensuring compatibility with AR. Experimental results confirm that AR successfully recovers baseline alignment in both watermarking approaches, while maintaining strong watermark detectability. This work reveals the critical balance between watermark strength and model alignment, providing a simple inference-time solution to responsibly deploy watermarked LLMs in practice.

[146] Aligning Large Language Models with Implicit Preferences from User-Generated Content

Zhaoxuan Tan,Zheng Li,Tianyi Liu,Haodong Wang,Hyokun Yun,Ming Zeng,Pei Chen,Zhihan Zhang,Yifan Gao,Ruijie Wang,Priyanka Nigam,Bing Yin,Meng Jiang

Main category: cs.CL

TL;DR: PUGC框架利用未标记的用户生成内容（UGC）中的隐式偏好数据，替代传统依赖人工标注的方法，显著提升了模型性能。

Details

Motivation: 传统偏好学习方法依赖昂贵且难以扩展的人工标注数据，而UGC中隐含的人类偏好信息为解决这一问题提供了可能。 Method: PUGC将UGC转化为用户查询，生成模型响应，并以UGC作为参考文本进行评分，从而学习隐式偏好。 Result: 实验显示，PUGC结合DPO训练使模型性能提升9.37%，在Alpaca Eval 2上达到35.93%的领先水平。 Conclusion: PUGC为偏好学习提供了一种高效、可扩展的解决方案，尤其在领域特定对齐和奖励质量方面表现突出。 Abstract: Learning from preference feedback is essential for aligning large language models (LLMs) with human values and improving the quality of generated responses. However, existing preference learning methods rely heavily on curated data from humans or advanced LLMs, which is costly and difficult to scale. In this work, we present PUGC, a novel framework that leverages implicit human Preferences in unlabeled User-Generated Content (UGC) to generate preference data. Although UGC is not explicitly created to guide LLMs in generating human-preferred responses, it often reflects valuable insights and implicit preferences from its creators that has the potential to address readers' questions. PUGC transforms UGC into user queries and generates responses from the policy model. The UGC is then leveraged as a reference text for response scoring, aligning the model with these implicit preferences. This approach improves the quality of preference data while enabling scalable, domain-specific alignment. Experimental results on Alpaca Eval 2 show that models trained with DPO and PUGC achieve a 9.37% performance improvement over traditional methods, setting a 35.93% state-of-the-art length-controlled win rate using Mistral-7B-Instruct. Further studies highlight gains in reward quality, domain-specific alignment effectiveness, robustness against UGC quality, and theory of mind capabilities. Our code and dataset are available at https://zhaoxuan.info/PUGC.github.io/

[147] SQLens: An End-to-End Framework for Error Detection and Correction in Text-to-SQL

Yue Gong,Chuan Lei,Xiao Qin,Kapil Vaidya,Balakrishnan Narayanaswamy,Tim Kraska

Main category: cs.CL

TL;DR: SQLens是一个端到端框架，用于检测和纠正LLM生成的SQL中的语义错误，显著提升文本到SQL系统的执行准确性。

Details

Motivation: LLM在文本到SQL任务中表现良好，但生成的查询可能存在语义错误，且缺乏对其可靠性的深入理解。 Method: SQLens结合数据库和LLM的错误信号，识别SQL子句中的潜在语义错误，并指导查询纠正。 Result: SQLens在错误检测F1上比最佳LLM自评估方法高25.78%，并将文本到SQL系统的执行准确性提升高达20%。 Conclusion: SQLens有效解决了LLM生成SQL的语义错误问题，显著提升了系统的可靠性和准确性。 Abstract: Text-to-SQL systems translate natural language (NL) questions into SQL queries, enabling non-technical users to interact with structured data. While large language models (LLMs) have shown promising results on the text-to-SQL task, they often produce semantically incorrect yet syntactically valid queries, with limited insight into their reliability. We propose SQLens, an end-to-end framework for fine-grained detection and correction of semantic errors in LLM-generated SQL. SQLens integrates error signals from both the underlying database and the LLM to identify potential semantic errors within SQL clauses. It further leverages these signals to guide query correction. Empirical results on two public benchmarks show that SQLens outperforms the best LLM-based self-evaluation method by 25.78% in F1 for error detection, and improves execution accuracy of out-of-the-box text-to-SQL systems by up to 20%.

[148] DRE: An Effective Dual-Refined Method for Integrating Small and Large Language Models in Open-Domain Dialogue Evaluation

Kun Zhao,Bohao Yang,Chen Tang,Siyuan Dai,Haoteng Tang,Chenghua Lin,Liang Zhan

Main category: cs.CL

TL;DR: SLIDE方法结合小型和大型语言模型的优势，通过自适应加权提升对话评估的可靠性。进一步提出的DRE方法通过双重细化优化模型整合，实验证明其优于现有方法。

Details

Motivation: 大型语言模型（LLM）在模糊场景中表现不稳定，而小型语言模型（SLM）对误导性输入敏感。两者在正负例处理上各有优势，结合二者可提升评估工具的可靠性。 Method: 提出SLIDE方法，通过自适应加权整合SLM和LLM；进一步提出DRE方法，利用SLM生成的洞察优化LLM的初始评估，并通过调整提升评分准确性。 Result: 实验表明DRE方法在多样基准测试中优于现有方法，更符合人类判断。 Conclusion: 结合小型和大型模型的优势，可以显著提升开放任务（如对话评估）的可靠性。 Abstract: Large Language Models (LLMs) excel at many tasks but struggle with ambiguous scenarios where multiple valid responses exist, often yielding unreliable results. Conversely, Small Language Models (SLMs) demonstrate robustness in such scenarios but are susceptible to misleading or adversarial inputs. We observed that LLMs handle negative examples effectively, while SLMs excel with positive examples. To leverage their complementary strengths, we introduce SLIDE (Small and Large Integrated for Dialogue Evaluation), a method integrating SLMs and LLMs via adaptive weighting. Building on SLIDE, we further propose a Dual-Refinement Evaluation (DRE) method to enhance SLM-LLM integration: (1) SLM-generated insights guide the LLM to produce initial evaluations; (2) SLM-derived adjustments refine the LLM's scores for improved accuracy. Experiments demonstrate that DRE outperforms existing methods, showing stronger alignment with human judgment across diverse benchmarks. This work illustrates how combining small and large models can yield more reliable evaluation tools, particularly for open-ended tasks such as dialogue evaluation.

[149] Please Translate Again: Two Simple Experiments on Whether Human-Like Reasoning Helps Translation

Di Wu,Seth Aycock,Christof Monz

Main category: cs.CL

TL;DR: 研究发现，在翻译任务中，显式分解翻译步骤（如Chain-of-Thought）并未带来明显性能提升，而简单提示LLMs“重新翻译”反而效果更好。

Details

Motivation: 探讨显式分解翻译步骤（如CoT）是否真正提升LLM在翻译任务中的性能。 Method: 通过实验比较显式分解翻译步骤与简单提示“重新翻译”的效果。 Result: 显式分解翻译步骤未带来显著性能提升，而“重新翻译”提示效果更优。 Conclusion: 未来研究需进一步探索CoT在翻译任务中有效性的真正因素。 Abstract: Large Language Models (LLMs) demonstrate strong reasoning capabilities for many tasks, often by explicitly decomposing the task via Chain-of-Thought (CoT) reasoning. Recent work on LLM-based translation designs hand-crafted prompts to decompose translation, or trains models to incorporate intermediate steps.~\textit{Translating Step-by-step}~\citep{briakou2024translating}, for instance, introduces a multi-step prompt with decomposition and refinement of translation with LLMs, which achieved state-of-the-art results on WMT24. In this work, we scrutinise this strategy's effectiveness. Empirically, we find no clear evidence that performance gains stem from explicitly decomposing the translation process, at least for the models on test; and we show that simply prompting LLMs to ``translate again'' yields even better results than human-like step-by-step prompting. Our analysis does not rule out the role of reasoning, but instead invites future work exploring the factors for CoT's effectiveness in the context of translation.

[150] Is It JUST Semantics? A Case Study of Discourse Particle Understanding in LLMs

William Sheffield,Kanishka Misra,Valentina Pyatkin,Ashwini Deo,Kyle Mahowald,Junyi Jessy Li

Main category: cs.CL

TL;DR: 论文探讨了LLMs在区分英语粒子'just'的细微语义时的能力，发现其能区分大类但难以捕捉更微妙的差异。

Details

Motivation: 研究LLMs是否能理解多功能的语篇粒子（如'just'）的细微语义差异。 Method: 使用专家标注的数据集，评估LLMs对'just'不同语义的区分能力。 Result: LLMs能区分大类语义，但对更细微的差异表现不佳。 Conclusion: LLMs在理解语篇粒子的细微语义上存在局限。 Abstract: Discourse particles are crucial elements that subtly shape the meaning of text. These words, often polyfunctional, give rise to nuanced and often quite disparate semantic/discourse effects, as exemplified by the diverse uses of the particle "just" (e.g., exclusive, temporal, emphatic). This work investigates the capacity of LLMs to distinguish the fine-grained senses of English "just", a well-studied example in formal semantics, using data meticulously created and labeled by expert linguists. Our findings reveal that while LLMs exhibit some ability to differentiate between broader categories, they struggle to fully capture more subtle nuances, highlighting a gap in their understanding of discourse particles.

[151] BSBench: will your LLM find the largest prime number?

K. O. T. Erziev

Main category: cs.CL

TL;DR: 论文提出了一种测试LLMs在无合理答案问题上的表现的基准，并发现现有模型表现不佳。

Details

Motivation: 研究LLMs在无合理答案问题上的表现，以评估其真实能力。 Method: 提出了一种基准测试方法，并改进了现有数据集。 Result: 现有模型在无合理答案问题上表现远非完美。 Conclusion: 该研究为LLMs的评估提供了新视角，代码和数据已开源。 Abstract: We propose that benchmarking LLMs on questions which have no reasonable answer actually isn't as silly as it sounds. We also present a benchmark that allows such testing and a method to modify the existing datasets, and discover that existing models demonstrate a performance far from the perfect on such questions. Our code and data artifacts are available at https://github.com/L3G5/impossible-bench

[152] SSA-COMET: Do LLMs Outperform Learned Metrics in Evaluating MT for Under-Resourced African Languages?

Senyu Li,Jiayi Wang,Felermino D. M. A. Ali,Colin Cherry,Daniel Deutsch,Eleftheria Briakou,Rui Sousa-Silva,Henrique Lopes Cardoso,Pontus Stenetorp,David Ifeoluwa Adelani

Main category: cs.CL

TL;DR: 论文介绍了SSA-MTE数据集和SSA-COMET评估指标，用于提升非洲低资源语言的机器翻译质量评估。

Details

Motivation: 现有评估指标对非洲低资源语言覆盖不足且性能不佳，需要更有效的解决方案。 Method: 构建了包含13种非洲语言对的大规模人工标注数据集，并基于此开发了SSA-COMET和SSA-COMET-QE评估指标。 Result: SSA-COMET显著优于现有方法，并在低资源语言上表现优异。 Conclusion: SSA-MTE和SSA-COMET为非洲低资源语言的机器翻译评估提供了有效工具，所有资源已开源。 Abstract: Evaluating machine translation (MT) quality for under-resourced African languages remains a significant challenge, as existing metrics often suffer from limited language coverage and poor performance in low-resource settings. While recent efforts, such as AfriCOMET, have addressed some of the issues, they are still constrained by small evaluation sets, a lack of publicly available training data tailored to African languages, and inconsistent performance in extremely low-resource scenarios. In this work, we introduce SSA-MTE, a large-scale human-annotated MT evaluation (MTE) dataset covering 13 African language pairs from the News domain, with over 63,000 sentence-level annotations from a diverse set of MT systems. Based on this data, we develop SSA-COMET and SSA-COMET-QE, improved reference-based and reference-free evaluation metrics. We also benchmark prompting-based approaches using state-of-the-art LLMs like GPT-4o and Claude. Our experimental results show that SSA-COMET models significantly outperform AfriCOMET and are competitive with the strongest LLM (Gemini 2.5 Pro) evaluated in our study, particularly on low-resource languages such as Twi, Luo, and Yoruba. All resources are released under open licenses to support future research.

[153] Demonstrations of Integrity Attacks in Multi-Agent Systems

Can Zheng,Yuhan Cao,Xiaoning Dong,Tianxing He

Main category: cs.CL

TL;DR: 论文探讨了恶意代理在多代理系统（MAS）中通过提示操纵实施的四种攻击类型，揭示了当前监控机制的局限性。

Details

Motivation: 研究多代理系统中恶意代理如何通过提示操纵影响系统行为，以揭示现有安全协议的不足。 Method: 分析了四种攻击类型（Scapegoater、Boaster、Self-Dealer、Free-Rider），并通过实验验证其有效性。 Result: 恶意代理能成功操纵系统行为，绕过高级LLM监控（如GPT-4o-mini和o3-mini）。 Conclusion: 强调需要更安全的MAS架构和监控系统以应对此类攻击。 Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities in natural language understanding, code generation, and complex planning. Simultaneously, Multi-Agent Systems (MAS) have garnered attention for their potential to enable cooperation among distributed agents. However, from a multi-party perspective, MAS could be vulnerable to malicious agents that exploit the system to serve self-interests without disrupting its core functionality. This work explores integrity attacks where malicious agents employ subtle prompt manipulation to bias MAS operations and gain various benefits. Four types of attacks are examined: \textit{Scapegoater}, who misleads the system monitor to underestimate other agents' contributions; \textit{Boaster}, who misleads the system monitor to overestimate their own performance; \textit{Self-Dealer}, who manipulates other agents to adopt certain tools; and \textit{Free-Rider}, who hands off its own task to others. We demonstrate that strategically crafted prompts can introduce systematic biases in MAS behavior and executable instructions, enabling malicious agents to effectively mislead evaluation systems and manipulate collaborative agents. Furthermore, our attacks can bypass advanced LLM-based monitors, such as GPT-4o-mini and o3-mini, highlighting the limitations of current detection mechanisms. Our findings underscore the critical need for MAS architectures with robust security protocols and content validation mechanisms, alongside monitoring systems capable of comprehensive risk scenario assessment.

[154] Reasoning or Overthinking: Evaluating Large Language Models on Financial Sentiment Analysis

Dimitris Vamvourellis,Dhagash Mehta

Main category: cs.CL

TL;DR: 研究发现，在零样本金融情感分析任务中，大型语言模型（LLMs）的推理能力并未提升性能，反而是GPT-4o不加推理提示时表现最佳。

Details

Motivation: 探讨LLMs在金融情感分析中的有效性，尤其是推理与非推理模型的差异。 Method: 使用Financial PhraseBank数据集，比较不同LLMs及提示策略（模拟System 1或System 2思维），并与金融领域微调的小模型对比。 Result: 推理提示或模型设计未提升性能，GPT-4o不加Chain-of-Thought提示时最准确。语言复杂性和标注一致性影响表现。 Conclusion: 金融情感分类中，快速直觉的“System 1”思维比慢速推理更接近人类判断，挑战了推理总能提升LLM决策的假设。 Abstract: We investigate the effectiveness of large language models (LLMs), including reasoning-based and non-reasoning models, in performing zero-shot financial sentiment analysis. Using the Financial PhraseBank dataset annotated by domain experts, we evaluate how various LLMs and prompting strategies align with human-labeled sentiment in a financial context. We compare three proprietary LLMs (GPT-4o, GPT-4.1, o3-mini) under different prompting paradigms that simulate System 1 (fast and intuitive) or System 2 (slow and deliberate) thinking and benchmark them against two smaller models (FinBERT-Prosus, FinBERT-Tone) fine-tuned on financial sentiment analysis. Our findings suggest that reasoning, either through prompting or inherent model design, does not improve performance on this task. Surprisingly, the most accurate and human-aligned combination of model and method was GPT-4o without any Chain-of-Thought (CoT) prompting. We further explore how performance is impacted by linguistic complexity and annotation agreement levels, uncovering that reasoning may introduce overthinking, leading to suboptimal predictions. This suggests that for financial sentiment classification, fast, intuitive "System 1"-like thinking aligns more closely with human judgment compared to "System 2"-style slower, deliberative reasoning simulated by reasoning models or CoT prompting. Our results challenge the default assumption that more reasoning always leads to better LLM decisions, particularly in high-stakes financial applications.

[155] Are LLMs Reliable Translators of Logical Reasoning Across Lexically Diversified Contexts?

Qingchuan Li,Jiatong Li,Zirui Liu,Mingyue Cheng,Yuting Zeng,Qi Liu,Tongxuan Liu

Main category: cs.CL

TL;DR: 论文提出SCALe基准和MenTaL方法，解决LLMs在逻辑翻译中处理词汇多样化的不足。

Details

Motivation: 现有LLMs作为逻辑翻译器在词汇多样化场景中不可靠，且现有基准缺乏词汇多样性。 Method: 提出SCALe基准通过逻辑不变词汇多样化评估LLMs，并设计MenTaL方法通过表格统一表达。 Result: 实验证实LLMs在词汇多样化翻译中存在缺陷，MenTaL显著提升性能。 Conclusion: SCALe和MenTaL填补了LLMs在逻辑翻译中的词汇多样化处理空白。 Abstract: Neuro-symbolic approaches combining large language models (LLMs) with solvers excels in logical reasoning problems need long reasoning chains. In this paradigm, LLMs serve as translators, converting natural language reasoning problems into formal logic formulas. Then reliable symbolic solvers return correct solutions. Despite their success, we find that LLMs, as translators, struggle to handle lexical diversification, a common linguistic phenomenon, indicating that LLMs as logic translators are unreliable in real-world scenarios. Moreover, existing logical reasoning benchmarks lack lexical diversity, failing to challenge LLMs' ability to translate such text and thus obscuring this issue. In this work, we propose SCALe, a benchmark designed to address this significant gap through **logic-invariant lexical diversification**. By using LLMs to transform original benchmark datasets into lexically diversified but logically equivalent versions, we evaluate LLMs' ability to consistently map diverse expressions to uniform logical symbols on these new datasets. Experiments using SCALe further confirm that current LLMs exhibit deficiencies in this capability. Building directly on the deficiencies identified through our benchmark, we propose a new method, MenTaL, to address this limitation. This method guides LLMs to first construct a table unifying diverse expressions before performing translation. Applying MenTaL through in-context learning and supervised fine-tuning (SFT) significantly improves the performance of LLM translators on lexically diversified text. Our code is now available at https://github.com/wufeiwuwoshihua/LexicalDiver.

[156] Selecting Demonstrations for Many-Shot In-Context Learning via Gradient Matching

Jianfei Zhang,Bei Li,Jun Bai,Rumei Li,Yanmeng Wang,Chenghua Lin,Wenge Rong

Main category: cs.CL

TL;DR: 本文提出了一种基于梯度匹配的演示选择方法，用于提升大语言模型（LLMs）在上下文学习（ICL）中的性能，替代传统的随机选择方法。

Details

Motivation: 现有的上下文学习（ICL）依赖于演示选择，但许多现有工作仅采用随机选择方法，无法满足多示例场景的需求。作者假设上下文学习和微调的数据需求类似，因此提出了一种新的选择方法。 Method: 通过梯度匹配方法选择演示示例，将目标任务的整个训练集的微调梯度与所选示例对齐，以逼近完整训练集的学习效果。 Result: 在4-shot到128-shot的多种场景中，该方法在9个数据集上均优于随机选择，例如在Qwen2.5-72B和Llama3-70B上表现提升4%，在5个闭源LLMs上提升约2%。 Conclusion: 该方法为多示例上下文学习提供了更可靠和高效的解决方案，推动了其更广泛的应用。 Abstract: In-Context Learning (ICL) empowers Large Language Models (LLMs) for rapid task adaptation without Fine-Tuning (FT), but its reliance on demonstration selection remains a critical challenge. While many-shot ICL shows promising performance through scaled demonstrations, the selection method for many-shot demonstrations remains limited to random selection in existing work. Since the conventional instance-level retrieval is not suitable for many-shot scenarios, we hypothesize that the data requirements for in-context learning and fine-tuning are analogous. To this end, we introduce a novel gradient matching approach that selects demonstrations by aligning fine-tuning gradients between the entire training set of the target task and the selected examples, so as to approach the learning effect on the entire training set within the selected examples. Through gradient matching on relatively small models, e.g., Qwen2.5-3B or Llama3-8B, our method consistently outperforms random selection on larger LLMs from 4-shot to 128-shot scenarios across 9 diverse datasets. For instance, it surpasses random selection by 4% on Qwen2.5-72B and Llama3-70B, and by around 2% on 5 closed-source LLMs. This work unlocks more reliable and effective many-shot ICL, paving the way for its broader application.

[157] SUCEA: Reasoning-Intensive Retrieval for Adversarial Fact-checking through Claim Decomposition and Editing

Hongjun Liu,Yilun Zhao,Arman Cohan,Chen Zhao

Main category: cs.CL

TL;DR: 提出了一种无需训练的方法SUCEA，通过分解和重写对抗性声明，提升事实核查系统的检索和标签预测准确性。

Details

Motivation: 对抗性声明挑战现有事实核查系统，需要一种更有效的方法来处理这类问题。 Method: SUCEA框架分三步：1) 声明分割与去语境化；2) 迭代证据检索与声明编辑；3) 证据聚合与标签预测。 Result: 在两个数据集上显著提升了检索和标签预测准确性，优于四种基线方法。 Conclusion: SUCEA框架有效解决了对抗性声明的事实核查问题，且无需额外训练。 Abstract: Automatic fact-checking has recently received more attention as a means of combating misinformation. Despite significant advancements, fact-checking systems based on retrieval-augmented language models still struggle to tackle adversarial claims, which are intentionally designed by humans to challenge fact-checking systems. To address these challenges, we propose a training-free method designed to rephrase the original claim, making it easier to locate supporting evidence. Our modular framework, SUCEA, decomposes the task into three steps: 1) Claim Segmentation and Decontextualization that segments adversarial claims into independent sub-claims; 2) Iterative Evidence Retrieval and Claim Editing that iteratively retrieves evidence and edits the subclaim based on the retrieved evidence; 3) Evidence Aggregation and Label Prediction that aggregates all retrieved evidence and predicts the entailment label. Experiments on two challenging fact-checking datasets demonstrate that our framework significantly improves on both retrieval and entailment label accuracy, outperforming four strong claim-decomposition-based baselines.

[158] MuSciClaims: Multimodal Scientific Claim Verification

Yash Kumar Lal,Manikanta Bandham,Mohammad Saqib Hasan,Apoorva Kashi,Mahnaz Koupaee,Niranjan Balasubramanian

Main category: cs.CL

TL;DR: 论文提出了一个名为MuSciClaims的新基准，用于测试科学文献中多模态数据的声明验证能力，并揭示了现有视觉语言模型在此任务上的不足。

Details

Motivation: 现有科学QA、图表描述等多模态任务缺乏直接测试声明验证能力的基准，因此需要填补这一空白。 Method: 通过自动提取科学文章中的支持声明，并手动扰动生成矛盾声明，设计诊断任务以分析模型失败原因。 Result: 现有视觉语言模型表现较差（F1分数0.3-0.5），最佳模型仅达0.77，且存在偏向支持声明的偏见。 Conclusion: 模型在定位证据、跨模态信息聚合及理解图表基本组件方面表现不佳，需进一步改进。 Abstract: Assessing scientific claims requires identifying, extracting, and reasoning with multimodal data expressed in information-rich figures in scientific literature. Despite the large body of work in scientific QA, figure captioning, and other multimodal reasoning tasks over chart-based data, there are no readily usable multimodal benchmarks that directly test claim verification abilities. To remedy this gap, we introduce a new benchmark MuSciClaims accompanied by diagnostics tasks. We automatically extract supported claims from scientific articles, which we manually perturb to produce contradicted claims. The perturbations are designed to test for a specific set of claim verification capabilities. We also introduce a suite of diagnostic tasks that help understand model failures. Our results show most vision-language models are poor (~0.3-0.5 F1), with even the best model only achieving 0.77 F1. They are also biased towards judging claims as supported, likely misunderstanding nuanced perturbations within the claims. Our diagnostics show models are bad at localizing correct evidence within figures, struggle with aggregating information across modalities, and often fail to understand basic components of the figure.

[159] LESS: Large Language Model Enhanced Semi-Supervised Learning for Speech Foundational Models

Wen Ding,Fan Qian

Main category: cs.CL

TL;DR: LESS框架利用大语言模型（LLM）优化半监督学习中的伪标签，通过数据过滤策略提升效率，在ASR和AST任务中显著降低WER并提高BLEU分数。

Details

Motivation: 解决半监督学习中伪标签质量不高的问题，利用LLM的知识提升语音处理任务的性能。 Method: 通过LLM修正ASR/AST生成的伪标签，并结合数据过滤策略优化知识转移效率。 Result: 在普通话ASR和西班牙语-英语AST任务中，WER降低3.77%，BLEU分数分别达到34.0和64.7。 Conclusion: LESS框架在多语言、多任务和多领域中表现出良好的适应性和性能提升。 Abstract: We introduce LESS (Large Language Model Enhanced Semi-supervised Learning), a versatile framework that leverages Large Language Models (LLMs) to correct pseudo labels generated from in-the-wild data. Within the LESS framework, pseudo-labeled text from Automatic Speech Recognition (ASR) or Automatic Speech Translation (AST) of the unsupervised data is refined by an LLM, and augmented by a data filtering strategy to optimize LLM knowledge transfer efficiency. Experiments on both Mandarin ASR and Spanish-to-English AST tasks show that LESS achieves a notable absolute WER reduction of 3.77% on the Wenet Speech test set, as well as BLEU scores of 34.0 and 64.7 on Callhome and Fisher test sets respectively. These results validate the adaptability of LESS across different languages, tasks, and domains. Ablation studies conducted with various LLMs and prompt configurations provide novel insights into leveraging LLM-derived knowledge for speech processing applications.

[160] Safe: Enhancing Mathematical Reasoning in Large Language Models via Retrospective Step-aware Formal Verification

Chengwu Liu,Ye Yuan,Yichun Yin,Yan Xu,Xin Xu,Zaoyu Chen,Yasheng Wang,Lifeng Shang,Qun Liu,Ming Zhang

Main category: cs.CL

TL;DR: 论文提出了一种名为$Safe$的框架，利用形式化数学语言Lean 4验证LLM推理步骤的正确性，以减少幻觉问题。

Details

Motivation: 当前CoT提示方法缺乏可验证的证据，难以检测幻觉，因此需要一种透明且可验证的解决方案。 Method: 采用回顾性、步骤感知的形式化验证框架$Safe$，将推理步骤转化为Lean 4中的形式化声明并提供证明。 Result: 在多个LLM和数学数据集上验证了$Safe$的有效性，性能显著提升，同时提供了可解释的证据。 Conclusion: $Safe$是首个利用Lean 4验证LLM生成内容的框架，为幻觉问题提供了可验证的解决方案。 Abstract: Chain-of-Thought (CoT) prompting has become the de facto method to elicit reasoning capabilities from large language models (LLMs). However, to mitigate hallucinations in CoT that are notoriously difficult to detect, current methods such as process reward models (PRMs) or self-consistency operate as opaque boxes and do not provide checkable evidence for their judgments, possibly limiting their effectiveness. To address this issue, we draw inspiration from the idea that "the gold standard for supporting a mathematical claim is to provide a proof". We propose a retrospective, step-aware formal verification framework $Safe$. Rather than assigning arbitrary scores, we strive to articulate mathematical claims in formal mathematical language Lean 4 at each reasoning step and provide formal proofs to identify hallucinations. We evaluate our framework $Safe$ across multiple language models and various mathematical datasets, demonstrating a significant performance improvement while offering interpretable and verifiable evidence. We also propose $FormalStep$ as a benchmark for step correctness theorem proving with $30,809$ formal statements. To the best of our knowledge, our work represents the first endeavor to utilize formal mathematical language Lean 4 for verifying natural language content generated by LLMs, aligning with the reason why formal mathematical languages were created in the first place: to provide a robust foundation for hallucination-prone human-written proofs.

[161] A MISMATCHED Benchmark for Scientific Natural Language Inference

Firoz Shaik,Mobashir Sadat,Nikita Gautam,Doina Caragea,Cornelia Caragea

Main category: cs.CL

TL;DR: 论文提出了一个名为MISMATCHED的科学自然语言推理（NLI）新基准，覆盖心理学、工程学和公共卫生三个非计算机科学领域，包含2700个人工标注的句子对。通过预训练的小型语言模型（SLM）和大型语言模型（LLM）建立了强基线，最佳基线Macro F1为78.17%，显示未来改进空间大。此外，训练中加入隐含科学NLI关系的句子对可提升模型性能。

Details

Motivation: 现有科学NLI数据集仅覆盖计算机科学领域，非计算机科学领域完全被忽视，需要填补这一空白。 Method: 构建MISMATCHED基准，包含三个非计算机科学领域的2700个标注句子对，使用SLM和LLM建立基线模型，并探索隐含科学NLI关系的句子对在训练中的作用。 Result: 最佳基线模型的Macro F1为78.17%，表明未来改进空间大；加入隐含关系的句子对可提升模型性能。 Conclusion: MISMATCHED基准填补了非计算机科学领域科学NLI的空白，未来研究可通过改进模型和利用隐含关系进一步提升性能。 Abstract: Scientific Natural Language Inference (NLI) is the task of predicting the semantic relation between a pair of sentences extracted from research articles. Existing datasets for this task are derived from various computer science (CS) domains, whereas non-CS domains are completely ignored. In this paper, we introduce a novel evaluation benchmark for scientific NLI, called MISMATCHED. The new MISMATCHED benchmark covers three non-CS domains-PSYCHOLOGY, ENGINEERING, and PUBLIC HEALTH, and contains 2,700 human annotated sentence pairs. We establish strong baselines on MISMATCHED using both Pre-trained Small Language Models (SLMs) and Large Language Models (LLMs). Our best performing baseline shows a Macro F1 of only 78.17% illustrating the substantial headroom for future improvements. In addition to introducing the MISMATCHED benchmark, we show that incorporating sentence pairs having an implicit scientific NLI relation between them in model training improves their performance on scientific NLI. We make our dataset and code publicly available on GitHub.

[162] Revisiting Test-Time Scaling: A Survey and a Diversity-Aware Method for Efficient Reasoning

Ho-Lam Chung,Teng-Yun Hsiao,Hsiao-Ying Huang,Chunerh Cho,Jian-Ren Lin,Zhang Ziwei,Yun-Nung Chen

Main category: cs.CL

TL;DR: Test-Time Scaling (TTS) 通过优化推理计算提升大语言模型（LLMs）的推理性能。ADAPT 方法通过多样性优化的前缀微调，显著减少计算需求并提高准确性。

Details

Motivation: 推理优化的模型输出多样性不足，限制了 TTS 的效果，需要一种轻量级方法提升多样性。 Method: 提出 ADAPT（多样性感知前缀微调），结合前缀微调和多样性数据策略。 Result: 在数学推理任务中，ADAPT 以八倍少的计算量达到 80% 的准确率。 Conclusion: 生成多样性对最大化 TTS 效果至关重要。 Abstract: Test-Time Scaling (TTS) improves the reasoning performance of Large Language Models (LLMs) by allocating additional compute during inference. We conduct a structured survey of TTS methods and categorize them into sampling-based, search-based, and trajectory optimization strategies. We observe that reasoning-optimized models often produce less diverse outputs, which limits TTS effectiveness. To address this, we propose ADAPT (A Diversity Aware Prefix fine-Tuning), a lightweight method that applies prefix tuning with a diversity-focused data strategy. Experiments on mathematical reasoning tasks show that ADAPT reaches 80% accuracy using eight times less compute than strong baselines. Our findings highlight the essential role of generative diversity in maximizing TTS effectiveness.

[163] Subjective Perspectives within Learned Representations Predict High-Impact Innovation

Likun Cao,Rui Pan,James Evans

Main category: cs.CL

TL;DR: 论文通过机器学习建模创新者的主观视角和人际创新机会，发现视角多样性比背景多样性更能预测创新成果。

Details

Motivation: 现有研究强调社会结构对创新能力的影响，但机器学习方法可以更细致地建模创新者的个人视角和互动机会。 Method: 基于动态语言表示构建创新者的概念几何空间，量化主观视角和创新机会，分析科学家、发明家等群体的数据。 Result: 视角多样性（而非背景多样性）能显著预测未来的创新成果，且通过自然实验和AI模拟验证了这一发现。 Conclusion: 成功的创新合作依赖于共同语言和视角多样性，这对团队组建和研究政策有重要启示。 Abstract: Existing studies of innovation emphasize the power of social structures to shape innovation capacity. Emerging machine learning approaches, however, enable us to model innovators' personal perspectives and interpersonal innovation opportunities as a function of their prior trajectories of experience. We theorize then quantify subjective perspectives and innovation opportunities based on innovator positions within the geometric space of concepts inscribed by dynamic language representations. Using data on millions of scientists, inventors, writers, entrepreneurs, and Wikipedia contributors across the creative domains of science, technology, film, entrepreneurship, and Wikipedia, here we show that measured subjective perspectives anticipate what ideas individuals and groups creatively attend to and successfully combine in future. When perspective and background diversity are decomposed as the angular difference between collaborators' perspectives on their creation and between their experiences, the former consistently anticipates creative achievement while the latter portends its opposite, across all cases and time periods examined. We analyze a natural experiment and simulate creative collaborations between AI (large language model) agents designed with various perspective and background diversity, which are consistent with our observational findings. We explore mechanisms underlying these findings and identify how successful collaborators leverage common language to weave together diverse experience obtained through trajectories of prior work that converge to provoke one another and innovate. We explore the importance of these findings for team assembly and research policy.

[164] Static Word Embeddings for Sentence Semantic Representation

Takashi Wada,Yuki Hirakawa,Ryotaro Shimizu,Takahiro Kawashima,Yuki Saito

Main category: cs.CL

TL;DR: 提出了一种优化的静态词嵌入方法，通过句子级主成分分析和知识蒸馏或对比学习改进预训练的词嵌入，以低成本实现句子语义表示。

Details

Motivation: 改进现有静态词嵌入方法，使其更适合句子语义表示任务，同时降低计算成本。 Method: 从预训练的句子转换器中提取词嵌入，通过句子级主成分分析改进，再结合知识蒸馏或对比学习。推理时通过简单平均词嵌入表示句子。 Result: 在单语和跨语言任务中显著优于现有静态模型，部分数据集甚至媲美基础句子转换模型（SimCSE）。 Conclusion: 方法成功剔除了与句子语义无关的词嵌入成分，并根据词对句子语义的影响调整向量范数。 Abstract: We propose new static word embeddings optimised for sentence semantic representation. We first extract word embeddings from a pre-trained Sentence Transformer, and improve them with sentence-level principal component analysis, followed by either knowledge distillation or contrastive learning. During inference, we represent sentences by simply averaging word embeddings, which requires little computational cost. We evaluate models on both monolingual and cross-lingual tasks and show that our model substantially outperforms existing static models on sentence semantic tasks, and even rivals a basic Sentence Transformer model (SimCSE) on some data sets. Lastly, we perform a variety of analyses and show that our method successfully removes word embedding components that are irrelevant to sentence semantics, and adjusts the vector norms based on the influence of words on sentence semantics.

[165] Advancing Tool-Augmented Large Language Models via Meta-Verification and Reflection Learning

Zhiyuan Ma,Jiayu Liu,Xianzhen Luo,Zhenya Huang,Qingfu Zhu,Wanxiang Che

Main category: cs.CL

TL;DR: Tool-MVR通过多代理元验证和探索式反思学习，解决了大语言模型在工具规划和反思能力上的不足，显著提升了性能和泛化能力。

Details

Motivation: 当前大语言模型在工具规划和反思能力上存在不足，如不可靠的工具调用和低效的错误修正能力。 Method: 提出Tool-MVR，结合多代理元验证（MAMV）构建高质量数据集ToolBench-V，以及探索式反思学习（EXPLORE）生成反思数据集ToolBench-R。 Result: Tool-MVR在StableToolBench上性能超越ToolLLM和GPT-4，API调用减少31.4%，在RefineToolBench上错误修正率达58.9%。 Conclusion: Tool-MVR通过系统性验证和动态学习，显著提升了工具利用能力，为复杂问题解决提供了更可靠的AI代理。 Abstract: Empowering large language models (LLMs) with effective tool utilization capabilities is crucial for enabling AI agents to solve complex problems. However, current models face two major limitations: (1) unreliable tool planning and invocation due to low-quality instruction datasets (e.g., widespread hallucinated API calls), and (2) weak tool reflection abilities (over 90% of errors cannot be corrected) resulting from static imitation learning. To address these critical limitations, we propose Tool-MVR, a novel Tool-Augmented LLM that achieves comprehensive System 2 reasoning through two key innovations. Specifically, we first introduce Multi-Agent Meta-Verification (MAMV), a systematic pipeline that rigorously validates APIs, queries, and reasoning trajectories to construct ToolBench-V, a new high-quality instruction dataset that addresses the limitation of unreliable tool planning and invocation. Second, we propose Exploration-based Reflection Learning (EXPLORE), which enhances tool reflection capabilities by leveraging tool feedback through a dynamic "Error -> Reflection -> Correction" learning paradigm, resulting in our reflection dataset ToolBench-R and addressing the critical weakness in tool reflection. Finally, we obtain Tool-MVR by finetuning open-source LLMs (e.g., Qwen-7B) on both ToolBench-V and ToolBench-R. Our experiments demonstrate that Tool-MVR achieves state-of-the-art performance on StableToolBench, surpassing both ToolLLM (by 23.9%) and GPT-4 (by 15.3%) while reducing API calls by 31.4%, with strong generalization capabilities across unseen tools and scenarios. Additionally, on our proposed RefineToolBench, the first benchmark specifically designed to evaluate tool reflection capabilities, Tool-MVR achieves a 58.9% error correction rate, significantly outperforming ToolLLM's 9.1%.

Thai-Binh Nguyen,Thi Van Nguyen,Quoc Truong Do,Chi Mai Luong

Main category: cs.CL

TL;DR: 本文提出了一种从原始视频生成AVSR数据集的高效方法，并展示了其在越南语中的成功应用，显著提升了在嘈杂环境中的性能。

Details

Motivation: 解决AVSR模型因数据集稀缺（尤其是非英语语言）而受限的问题，通过自动化数据收集扩展其应用范围。 Method: 提出了一种改进的自动化数据收集方法，从原始视频生成AVSR数据集，并开发了越南语的基线模型。 Result: 自动收集的数据集支持了强大的基线模型，在干净环境中表现优异，在嘈杂环境中显著优于传统ASR。 Conclusion: 该方法为扩展AVSR至更多语言（尤其是资源匮乏语言）提供了可行路径。 Abstract: Audio-Visual Speech Recognition (AVSR) has gained significant attention recently due to its robustness against noise, which often challenges conventional speech recognition systems that rely solely on audio features. Despite this advantage, AVSR models remain limited by the scarcity of extensive datasets, especially for most languages beyond English. Automated data collection offers a promising solution. This work presents a practical approach to generate AVSR datasets from raw video, refining existing techniques for improved efficiency and accessibility. We demonstrate its broad applicability by developing a baseline AVSR model for Vietnamese. Experiments show the automatically collected dataset enables a strong baseline, achieving competitive performance with robust ASR in clean conditions and significantly outperforming them in noisy environments like cocktail parties. This efficient method provides a pathway to expand AVSR to more languages, particularly under-resourced ones.

[167] TaDA: Training-free recipe for Decoding with Adaptive KV Cache Compression and Mean-centering

Vinay Joshi,Pratik Prabhanjan Brahma,Zicheng Liu,Emad Barsoum

Main category: cs.CL

TL;DR: TaDA是一种无需训练的KV缓存压缩方法，通过自适应量化和均值中心化消除异常值处理，显著减少内存占用并保持模型精度。

Details

Motivation: KV缓存在Transformer模型中内存需求随序列长度增长而急剧增加，限制了大型语言模型的可扩展部署。 Method: 提出TaDA方法，通过自适应量化精度和均值中心化处理，无需单独管理异常值。 Result: 实验显示，TaDA将KV缓存内存占用降至原始16位基准的27%，同时保持可比精度。 Conclusion: TaDA为语言模型的可扩展高性能推理铺平道路，支持更长上下文和推理链。 Abstract: The key-value (KV) cache in transformer models is a critical component for efficient decoding or inference, yet its memory demands scale poorly with sequence length, posing a major challenge for scalable deployment of large language models. Among several approaches to KV cache compression, quantization of key and value activations has been widely explored. Most KV cache quantization methods still need to manage sparse and noncontiguous outliers separately. To address this, we introduce TaDA, a training-free recipe for KV cache compression with quantization precision that adapts to error sensitivity across layers and a mean centering to eliminate separate outlier handling. Our approach yields substantial accuracy improvements for multiple models supporting various context lengths. Moreover, our approach does not need to separately manage outlier elements -- a persistent hurdle in most traditional quantization methods. Experiments on standard benchmarks demonstrate that our technique reduces KV cache memory footprint to 27% of the original 16-bit baseline while achieving comparable accuracy. Our method paves the way for scalable and high-performance reasoning in language models by potentially enabling inference for longer context length models, reasoning models, and longer chain of thoughts.

[168] Flex-TravelPlanner: A Benchmark for Flexible Planning with Language Agents

Juhyun Oh,Eunsu Kim,Alice Oh

Main category: cs.CL

TL;DR: Flex-TravelPlanner是一个新基准，用于评估语言模型在动态规划场景中的灵活推理能力，发现现有模型在多轮约束适应和优先级处理上表现不佳。

Details

Motivation: 现有基准主要关注静态、单轮规划场景，无法满足现实世界中动态、多约束规划的需求。 Method: 基于TravelPlanner数据集，引入多轮顺序约束引入和明确优先级的竞争约束两种新评估设置。 Result: 模型在单轮任务中的表现无法预测其多轮适应能力；约束引入顺序显著影响性能；模型难以正确处理约束优先级。 Conclusion: 强调在更真实的动态规划场景中评估LLMs的重要性，并指出了改进模型性能的具体方向。 Abstract: Real-world planning problems require constant adaptation to changing requirements and balancing of competing constraints. However, current benchmarks for evaluating LLMs' planning capabilities primarily focus on static, single-turn scenarios. We introduce Flex-TravelPlanner, a benchmark that evaluates language models' ability to reason flexibly in dynamic planning scenarios. Building on the TravelPlanner dataset~\citep{xie2024travelplanner}, we introduce two novel evaluation settings: (1) sequential constraint introduction across multiple turns, and (2) scenarios with explicitly prioritized competing constraints. Our analysis of GPT-4o and Llama 3.1 70B reveals several key findings: models' performance on single-turn tasks poorly predicts their ability to adapt plans across multiple turns; constraint introduction order significantly affects performance; and models struggle with constraint prioritization, often incorrectly favoring newly introduced lower priority preferences over existing higher-priority constraints. These findings highlight the importance of evaluating LLMs in more realistic, dynamic planning scenarios and suggest specific directions for improving model performance on complex planning tasks. The code and dataset for our framework are publicly available at https://github.com/juhyunohh/FlexTravelBench.

[169] Normative Conflicts and Shallow AI Alignment

Raphaël Millière

Main category: cs.CL

TL;DR: 论文探讨了大型语言模型（LLMs）的价值对齐问题，指出现有对齐策略无法有效防止滥用，并揭示了其根本局限性。

Details

Motivation: 随着AI系统（如LLMs）的发展，其安全部署问题日益突出，尤其是价值对齐的不足可能导致潜在危害。 Method: 通过分析现有对齐策略的局限性，结合道德心理学研究，对比人类与LLMs在规范推理能力上的差异。 Result: 发现LLMs缺乏深度的规范推理能力，易受对抗性攻击，现有方法无法解决这一问题。 Conclusion: 当前的对齐方法不足以应对日益强大的AI系统带来的潜在风险，需探索更深度的对齐策略。 Abstract: The progress of AI systems such as large language models (LLMs) raises increasingly pressing concerns about their safe deployment. This paper examines the value alignment problem for LLMs, arguing that current alignment strategies are fundamentally inadequate to prevent misuse. Despite ongoing efforts to instill norms such as helpfulness, honesty, and harmlessness in LLMs through fine-tuning based on human preferences, they remain vulnerable to adversarial attacks that exploit conflicts between these norms. I argue that this vulnerability reflects a fundamental limitation of existing alignment methods: they reinforce shallow behavioral dispositions rather than endowing LLMs with a genuine capacity for normative deliberation. Drawing from on research in moral psychology, I show how humans' ability to engage in deliberative reasoning enhances their resilience against similar adversarial tactics. LLMs, by contrast, lack a robust capacity to detect and rationally resolve normative conflicts, leaving them susceptible to manipulation; even recent advances in reasoning-focused LLMs have not addressed this vulnerability. This ``shallow alignment'' problem carries significant implications for AI safety and regulation, suggesting that current approaches are insufficient for mitigating potential harms posed by increasingly capable AI systems.

Gio Paik,Geewook Kim,Jinbae Im

Main category: cs.CL

TL;DR: MMRefine是一个多模态细化基准，用于评估多模态大语言模型（MLLMs）的错误细化能力，涵盖六种场景和错误类型，揭示性能瓶颈。

Details

Motivation: 随着推理过程中增强推理能力的需求增加，需要评估MLLMs在检测和纠正错误方面的能力，而不仅仅是比较细化前后的最终准确性。 Method: MMRefine通过六种不同场景和六种错误类型评估MLLMs的细化能力，并对开放和封闭MLLMs进行实验分析。 Result: 实验揭示了MLLMs在细化性能上的瓶颈和改进空间，特别是在有效推理增强方面。 Conclusion: MMRefine为MLLMs的错误细化能力提供了评估框架，并指出了未来改进的方向。 Abstract: This paper introduces MMRefine, a MultiModal Refinement benchmark designed to evaluate the error refinement capabilities of Multimodal Large Language Models (MLLMs). As the emphasis shifts toward enhancing reasoning during inference, MMRefine provides a framework that evaluates MLLMs' abilities to detect and correct errors across six distinct scenarios beyond just comparing final accuracy before and after refinement. Furthermore, the benchmark analyzes the refinement performance by categorizing errors into six error types. Experiments with various open and closed MLLMs reveal bottlenecks and factors impeding refinement performance, highlighting areas for improvement in effective reasoning enhancement. Our code and dataset are publicly available at https://github.com/naver-ai/MMRefine.

[171] Recycling the Web: A Method to Enhance Pre-training Data Quality and Quantity for Language Models

Thao Nguyen,Yang Li,Olga Golovneva,Luke Zettlemoyer,Sewoong Oh,Ludwig Schmidt,Xian Li

Main category: cs.CL

TL;DR: 论文提出REWIRE方法，通过改写低质量网页文本以丰富预训练数据，实验显示其能显著提升模型性能。

Details

Motivation: 解决预训练数据规模受限的问题，尤其是高质量文本稀缺，探索如何利用被过滤的低质量数据。 Method: 提出REWIRE方法，通过改写被过滤的低质量文本，生成合成数据并与高质量原始数据混合使用。 Result: 在1B、3B和7B规模的实验中，混合数据比仅使用过滤数据分别提升1.0、1.3和2.5个百分点，且效果优于单纯增加数据量。 Conclusion: 改写低质量文本是一种简单有效的预训练数据扩展方法，REWIRE方法优于其他合成数据生成方式。 Abstract: Scaling laws predict that the performance of large language models improves with increasing model size and data size. In practice, pre-training has been relying on massive web crawls, using almost all data sources publicly available on the internet so far. However, this pool of natural data does not grow at the same rate as the compute supply. Furthermore, the availability of high-quality texts is even more limited: data filtering pipelines often remove up to 99% of the initial web scrapes to achieve state-of-the-art. To address the "data wall" of pre-training scaling, our work explores ways to transform and recycle data discarded in existing filtering processes. We propose REWIRE, REcycling the Web with guIded REwrite, a method to enrich low-quality documents so that they could become useful for training. This in turn allows us to increase the representation of synthetic data in the final pre-training set. Experiments at 1B, 3B and 7B scales of the DCLM benchmark show that mixing high-quality raw texts and our rewritten texts lead to 1.0, 1.3 and 2.5 percentage points improvement respectively across 22 diverse tasks, compared to training on only filtered web data. Training on the raw-synthetic data mix is also more effective than having access to 2x web data. Through further analysis, we demonstrate that about 82% of the mixed in texts come from transforming lower-quality documents that would otherwise be discarded. REWIRE also outperforms related approaches of generating synthetic data, including Wikipedia-style paraphrasing, question-answer synthesizing and knowledge extraction. These results suggest that recycling web texts holds the potential for being a simple and effective approach for scaling pre-training data.

[172] Cracking the Code: Enhancing Implicit Hate Speech Detection through Coding Classification

Lu Wei,Liangzhi Li,Tong Xiang,Xiao Liu,Noa Garcia

Main category: cs.CL

TL;DR: 论文提出了一种新的隐式仇恨言论（im-HS）检测分类法，通过六种编码策略（codetypes）改进检测方法，并在中英文数据集中验证了其有效性。

Details

Motivation: 互联网上的仇恨言论（HS）对社会和谐和个人福祉构成威胁，现有方法对隐式仇恨言论（im-HS）检测效果不佳。 Method: 提出六种编码策略（codetypes），并采用两种方法：1）直接提示大语言模型（LLMs）分类句子；2）在编码过程中嵌入codetypes。 Result: 实验表明，codetypes在中英文数据集中均提高了im-HS检测效果。 Conclusion: 该方法有效提升了隐式仇恨言论的检测能力，具有跨语言适用性。 Abstract: The internet has become a hotspot for hate speech (HS), threatening societal harmony and individual well-being. While automatic detection methods perform well in identifying explicit hate speech (ex-HS), they struggle with more subtle forms, such as implicit hate speech (im-HS). We tackle this problem by introducing a new taxonomy for im-HS detection, defining six encoding strategies named codetypes. We present two methods for integrating codetypes into im-HS detection: 1) prompting large language models (LLMs) directly to classify sentences based on generated responses, and 2) using LLMs as encoders with codetypes embedded during the encoding process. Experiments show that the use of codetypes improves im-HS detection in both Chinese and English datasets, validating the effectiveness of our approach across different languages.

[173] Accelerated Test-Time Scaling with Model-Free Speculative Sampling

Woomin Song,Saket Dingliwal,Sai Muralidhar Jayanthi,Bhavana Ganesh,Jinwoo Shin,Aram Galstyan,Sravan Babu Bodapati

Main category: cs.CL

TL;DR: STAND是一种无需模型的推测解码方法，通过利用推理轨迹中的冗余性，显著加速推理过程且不损失准确性。

Details

Motivation: 现有推理任务中的测试时扩展技术（如best-of-N采样和树搜索）需要大量计算资源，导致性能与效率之间的权衡问题。 Method: STAND采用随机自适应N-gram草拟技术，结合Gumbel-Top-K采样和数据驱动的树构建，高效预测令牌。 Result: STAND在多个推理任务中减少60-65%的推理延迟，吞吐量优于现有方法14-28%，单轨迹场景下延迟降低48-58%。 Conclusion: STAND是一种无需额外训练的即插即用解决方案，适用于任何现有语言模型，显著提升推理效率。 Abstract: Language models have demonstrated remarkable capabilities in reasoning tasks through test-time scaling techniques like best-of-N sampling and tree search. However, these approaches often demand substantial computational resources, creating a critical trade-off between performance and efficiency. We introduce STAND (STochastic Adaptive N-gram Drafting), a novel model-free speculative decoding approach that leverages the inherent redundancy in reasoning trajectories to achieve significant acceleration without compromising accuracy. Our analysis reveals that reasoning paths frequently reuse similar reasoning patterns, enabling efficient model-free token prediction without requiring separate draft models. By introducing stochastic drafting and preserving probabilistic information through a memory-efficient logit-based N-gram module, combined with optimized Gumbel-Top-K sampling and data-driven tree construction, STAND significantly improves token acceptance rates. Extensive evaluations across multiple models and reasoning tasks (AIME-2024, GPQA-Diamond, and LiveCodeBench) demonstrate that STAND reduces inference latency by 60-65% compared to standard autoregressive decoding while maintaining accuracy. Furthermore, STAND outperforms state-of-the-art speculative decoding methods by 14-28% in throughput and shows strong performance even in single-trajectory scenarios, reducing inference latency by 48-58%. As a model-free approach, STAND can be applied to any existing language model without additional training, being a powerful plug-and-play solution for accelerating language model reasoning.

[174] IIITH-BUT system for IWSLT 2025 low-resource Bhojpuri to Hindi speech translation

Bhavana Akkiraju,Aishwarya Pothula,Santosh Kesiraju,Anil Kumar Vuppala

Main category: cs.CL

TL;DR: 本文介绍了IIITH-BUT团队在IWSLT 2025共享任务中针对低资源Bhojpuri-Hindi语言对的语音翻译提交成果，研究了超参数优化和数据增强技术对SeamlessM4T模型性能的影响。

Details

Motivation: 探索在低资源语言对（Bhojpuri-Hindi）中，通过超参数优化和数据增强技术提升语音翻译模型性能的可能性。 Method: 系统研究了学习率调度、更新步数、预热步数、标签平滑和批量大小等超参数，并应用了速度扰动和SpecAugment数据增强技术。同时，通过联合训练Marathi和Bhojpuri语音数据引入跨语言信号。 Result: 实验表明，超参数选择和简单有效的数据增强技术显著提升了低资源环境下的翻译性能。同时分析了翻译假设中的错误类型及其对BLEU评分的影响。 Conclusion: 在低资源语音翻译任务中，超参数优化和数据增强技术是提升模型性能的有效方法。 Abstract: This paper presents the submission of IIITH-BUT to the IWSLT 2025 shared task on speech translation for the low-resource Bhojpuri-Hindi language pair. We explored the impact of hyperparameter optimisation and data augmentation techniques on the performance of the SeamlessM4T model fine-tuned for this specific task. We systematically investigated a range of hyperparameters including learning rate schedules, number of update steps, warm-up steps, label smoothing, and batch sizes; and report their effect on translation quality. To address data scarcity, we applied speed perturbation and SpecAugment and studied their effect on translation quality. We also examined the use of cross-lingual signal through joint training with Marathi and Bhojpuri speech data. Our experiments reveal that careful selection of hyperparameters and the application of simple yet effective augmentation techniques significantly improve performance in low-resource settings. We also analysed the translation hypotheses to understand various kinds of errors that impacted the translation quality in terms of BLEU.

[175] SPARTA ALIGNMENT: Collectively Aligning Multiple Language Models through Combat

Yuru Jiang,Wenxuan Ding,Shangbin Feng,Greg Durrett,Yulia Tsvetkov

Main category: cs.CL

TL;DR: SPARTA ALIGNMENT是一种通过竞争和对抗集体对齐多个LLM的算法，利用模型间的竞争和互评提升生成多样性和减少偏见。

Details

Motivation: 单个LLM在生成多样性和评估偏见方面存在不足，需要一种集体竞争机制来提升模型性能。 Method: 多个LLM组成“斯巴达部落”，通过竞争和互评生成偏好对，并基于Elo排名系统更新模型权重。 Result: 在12个任务和数据集中的10个上优于初始模型和4个自对齐基线，平均提升7.0%。 Conclusion: SPARTA ALIGNMENT通过集体竞争有效提升模型性能，并更好地泛化到未见任务。 Abstract: We propose SPARTA ALIGNMENT, an algorithm to collectively align multiple LLMs through competition and combat. To complement a single model's lack of diversity in generation and biases in evaluation, multiple LLMs form a "sparta tribe" to compete against each other in fulfilling instructions while serving as judges for the competition of others. For each iteration, one instruction and two models are selected for a duel, the other models evaluate the two responses, and their evaluation scores are aggregated through a adapted elo-ranking based reputation system, where winners/losers of combat gain/lose weight in evaluating others. The peer-evaluated combat results then become preference pairs where the winning response is preferred over the losing one, and all models learn from these preferences at the end of each iteration. SPARTA ALIGNMENT enables the self-evolution of multiple LLMs in an iterative and collective competition process. Extensive experiments demonstrate that SPARTA ALIGNMENT outperforms initial models and 4 self-alignment baselines across 10 out of 12 tasks and datasets with 7.0% average improvement. Further analysis reveals that SPARTA ALIGNMENT generalizes more effectively to unseen tasks and leverages the expertise diversity of participating models to produce more logical, direct and informative outputs.

[176] Lifelong Evolution: Collaborative Learning between Large and Small Language Models for Continuous Emergent Fake News Detection

Ziyi Zhou,Xiaoming Zhang,Litian Zhang,Yibo Zhang,Zhenyu Guan,Chaozhuo Li,Philip S. Yu

Main category: cs.CL

TL;DR: 提出了一种名为C²EFND的新框架，结合大型语言模型（LLMs）和小型语言模型（SLMs）的优势，通过多轮协作学习和持续知识更新，显著提升了假新闻检测的准确性和适应性。

Details

Motivation: 假新闻在社交媒体上的广泛传播对社会造成了严重影响，而现有方法（如SLMs和LLMs）因数据稀缺、知识过时等问题难以有效应对。 Method: C²EFND框架结合LLMs的泛化能力和SLMs的分类专长，采用多轮协作学习，并引入基于Mixture-of-Experts的知识编辑模块和基于回放的持续学习方法。 Result: 在Pheme和Twitter16数据集上的实验表明，C²EFND显著优于现有方法，提高了假新闻检测的准确性和适应性。 Conclusion: C²EFND为解决假新闻检测中的持续性和适应性挑战提供了有效方案。 Abstract: The widespread dissemination of fake news on social media has significantly impacted society, resulting in serious consequences. Conventional deep learning methodologies employing small language models (SLMs) suffer from extensive supervised training requirements and difficulties adapting to evolving news environments due to data scarcity and distribution shifts. Large language models (LLMs), despite robust zero-shot capabilities, fall short in accurately detecting fake news owing to outdated knowledge and the absence of suitable demonstrations. In this paper, we propose a novel Continuous Collaborative Emergent Fake News Detection (C$^2$EFND) framework to address these challenges. The C$^2$EFND framework strategically leverages both LLMs' generalization power and SLMs' classification expertise via a multi-round collaborative learning framework. We further introduce a lifelong knowledge editing module based on a Mixture-of-Experts architecture to incrementally update LLMs and a replay-based continue learning method to ensure SLMs retain prior knowledge without retraining entirely. Extensive experiments on Pheme and Twitter16 datasets demonstrate that C$^2$EFND significantly outperforms existed methods, effectively improving detection accuracy and adaptability in continuous emergent fake news scenarios.

[177] Identifying Reliable Evaluation Metrics for Scientific Text Revision

Léane Jourdan,Florian Boudin,Richard Dufour,Nicolas Hernandez

Main category: cs.CL

TL;DR: 本文探讨了科学写作中文本修订的评估挑战，分析了传统指标的局限性，并提出了一种结合LLM和领域特定指标的混合方法。

Details

Motivation: 传统评估指标（如ROUGE和BERTScore）主要关注相似性而非改进质量，难以准确反映修订的实际效果。 Method: 通过人工标注研究评估修订质量，探索无参考评估指标，并分析LLM作为评判者的能力。 Result: LLM在评估指令遵循性上表现良好，但在正确性上不足；领域特定指标提供了补充信息。 Conclusion: 结合LLM和任务特定指标的混合方法能更可靠地评估修订质量。 Abstract: Evaluating text revision in scientific writing remains a challenge, as traditional metrics such as ROUGE and BERTScore primarily focus on similarity rather than capturing meaningful improvements. In this work, we analyse and identify the limitations of these metrics and explore alternative evaluation methods that better align with human judgments. We first conduct a manual annotation study to assess the quality of different revisions. Then, we investigate reference-free evaluation metrics from related NLP domains. Additionally, we examine LLM-as-a-judge approaches, analysing their ability to assess revisions with and without a gold reference. Our results show that LLMs effectively assess instruction-following but struggle with correctness, while domain-specific metrics provide complementary insights. We find that a hybrid approach combining LLM-as-a-judge evaluation and task-specific metrics offers the most reliable assessment of revision quality.

[178] Fine-Grained Interpretation of Political Opinions in Large Language Models

Jingyu Hu,Mengyue Yang,Mengnan Du,Weiru Liu

Main category: cs.CL

TL;DR: 研究通过多维度政治学习框架和可解释表示工程技术，揭示LLMs内部政治倾向，并验证其检测与干预能力。

Details

Motivation: 发现LLMs开放回答与内部意图不一致，且现有分析多依赖单轴概念易混淆，需更透明的方法。 Method: 设计四维政治学习框架，构建数据集，应用三种表示工程技术于八个开源LLMs。 Result: 向量能解构政治概念混淆，检测任务验证语义且泛化性强，干预实验可调整LLMs政治倾向。 Conclusion: 多维度方法有效揭示和干预LLMs政治倾向，为透明化研究提供新工具。 Abstract: Studies of LLMs' political opinions mainly rely on evaluations of their open-ended responses. Recent work indicates that there is a misalignment between LLMs' responses and their internal intentions. This motivates us to probe LLMs' internal mechanisms and help uncover their internal political states. Additionally, we found that the analysis of LLMs' political opinions often relies on single-axis concepts, which can lead to concept confounds. In this work, we extend the single-axis to multi-dimensions and apply interpretable representation engineering techniques for more transparent LLM political concept learning. Specifically, we designed a four-dimensional political learning framework and constructed a corresponding dataset for fine-grained political concept vector learning. These vectors can be used to detect and intervene in LLM internals. Experiments are conducted on eight open-source LLMs with three representation engineering techniques. Results show these vectors can disentangle political concept confounds. Detection tasks validate the semantic meaning of the vectors and show good generalization and robustness in OOD settings. Intervention Experiments show these vectors can intervene in LLMs to generate responses with different political leanings.

[179] MMSU: A Massive Multi-task Spoken Language Understanding and Reasoning Benchmark

Dingdong Wang,Jincenzi Wu,Junan Li,Dongchao Yang,Xueyuan Chen,Tianhua Zhang,Helen Meng

Main category: cs.CL

TL;DR: MMSU是一个专为口语理解和推理设计的综合基准，包含5000个音频-问题-答案三元组，覆盖47种任务，评估了14种先进SpeechLLMs的性能，揭示了现有模型的不足。

Details

Motivation: 口语理解需要整合语义、副语言特征和语音学特征，而现有SpeechLLMs在这方面的细粒度感知和复杂推理能力尚未充分探索。 Method: 引入MMSU基准，系统整合多种语言现象（如语音学、韵律、修辞等），并通过音频-问题-答案三元组进行评估。 Result: 评估发现现有SpeechLLMs在口语理解和推理方面仍有显著改进空间。 Conclusion: MMSU为口语理解提供了新的评估标准，为未来优化和开发更复杂的人机语音交互系统提供了方向。 Abstract: Speech inherently contains rich acoustic information that extends far beyond the textual language. In real-world spoken language understanding, effective interpretation often requires integrating semantic meaning (e.g., content), paralinguistic features (e.g., emotions, speed, pitch) and phonological characteristics (e.g., prosody, intonation, rhythm), which are embedded in speech. While recent multimodal Speech Large Language Models (SpeechLLMs) have demonstrated remarkable capabilities in processing audio information, their ability to perform fine-grained perception and complex reasoning in natural speech remains largely unexplored. To address this gap, we introduce MMSU, a comprehensive benchmark designed specifically for understanding and reasoning in spoken language. MMSU comprises 5,000 meticulously curated audio-question-answer triplets across 47 distinct tasks. To ground our benchmark in linguistic theory, we systematically incorporate a wide range of linguistic phenomena, including phonetics, prosody, rhetoric, syntactics, semantics, and paralinguistics. Through a rigorous evaluation of 14 advanced SpeechLLMs, we identify substantial room for improvement in existing models, highlighting meaningful directions for future optimization. MMSU establishes a new standard for comprehensive assessment of spoken language understanding, providing valuable insights for developing more sophisticated human-AI speech interaction systems. MMSU benchmark is available at https://huggingface.co/datasets/ddwang2000/MMSU. Evaluation Code is available at https://github.com/dingdongwang/MMSU_Bench.

[180] Towards LLM-Centric Multimodal Fusion: A Survey on Integration Strategies and Techniques

Jisu An,Junseok Lee,Jeoungeun Lee,Yongseok Son

Main category: cs.CL

TL;DR: 本文对多模态大语言模型（MLLMs）进行了系统分析，提出了基于三个维度的分类框架，并总结了125个模型的发展趋势。

Details

Motivation: 现有文献缺乏对多模态输入如何与语言主干连接的深入理解，本文旨在填补这一空白。 Method: 通过分析125个MLLMs，提出了基于架构策略、表示学习技术和训练范式的分类框架。 Result: 总结了模态集成、表示学习和训练方法的关键模式，为未来模型开发提供了指导。 Conclusion: 本文的分类框架为研究者提供了结构化视角，有助于开发更稳健的多模态集成策略。 Abstract: The rapid progress of Multimodal Large Language Models(MLLMs) has transformed the AI landscape. These models combine pre-trained LLMs with various modality encoders. This integration requires a systematic understanding of how different modalities connect to the language backbone. Our survey presents an LLM-centric analysis of current approaches. We examine methods for transforming and aligning diverse modal inputs into the language embedding space. This addresses a significant gap in existing literature. We propose a classification framework for MLLMs based on three key dimensions. First, we examine architectural strategies for modality integration. This includes both the specific integration mechanisms and the fusion level. Second, we categorize representation learning techniques as either joint or coordinate representations. Third, we analyze training paradigms, including training strategies and objective functions. By examining 125 MLLMs developed between 2021 and 2025, we identify emerging patterns in the field. Our taxonomy provides researchers with a structured overview of current integration techniques. These insights aim to guide the development of more robust multimodal integration strategies for future models built on pre-trained foundations.

[181] Dissecting Logical Reasoning in LLMs: A Fine-Grained Evaluation and Supervision Study

Yujun Zhou,Jiayi Ye,Zipeng Ling,Yufei Han,Yue Huang,Haomin Zhuang,Zhenwen Liang,Kehan Guo,Taicheng Guo,Xiangqi Wang,Xiangliang Zhang

Main category: cs.CL

TL;DR: FineLogic是一个细粒度的评估框架，用于评估大型语言模型在逻辑推理中的表现，包括准确性、步骤合理性和表示对齐。研究发现自然语言监督具有更好的泛化能力，而符号监督则促进结构化的推理链。

Details

Motivation: 现有基准测试仅关注最终答案准确性，忽略了推理过程的质量和结构，因此需要更全面的评估框架。 Method: 提出FineLogic框架，评估三个维度：准确性、步骤合理性和表示对齐。研究四种监督格式对推理能力的影响。 Result: 自然语言监督在泛化任务中表现更好，符号监督则产生更结构化的推理链。微调主要通过逐步生成改进推理行为。 Conclusion: FineLogic为评估和改进LLMs的逻辑推理提供了更严谨和可解释的方法。 Abstract: Logical reasoning is a core capability for many applications of large language models (LLMs), yet existing benchmarks often rely solely on final-answer accuracy, failing to capture the quality and structure of the reasoning process. We propose FineLogic, a fine-grained evaluation framework that assesses logical reasoning across three dimensions: overall benchmark accuracy, stepwise soundness, and representation-level alignment. In addition, to better understand how reasoning capabilities emerge, we conduct a comprehensive study on the effects of supervision format during fine-tuning. We construct four supervision styles (one natural language and three symbolic variants) and train LLMs under each. Our findings reveal that natural language supervision yields strong generalization even on out-of-distribution and long-context tasks, while symbolic reasoning styles promote more structurally sound and atomic inference chains. Further, our representation-level probing shows that fine-tuning primarily improves reasoning behaviors through step-by-step generation, rather than enhancing shortcut prediction or internalized correctness. Together, our framework and analysis provide a more rigorous and interpretable lens for evaluating and improving logical reasoning in LLMs.

[182] Design of intelligent proofreading system for English translation based on CNN and BERT

Feijun Liu,Huifeng Wang,Kun Wang,Yizhen Wang

Main category: cs.CL

TL;DR: 提出了一种结合CNN和BERT的混合方法，用于机器翻译校对，通过端到端训练优化后编辑性能，实验结果显示性能优于现有技术。

Details

Motivation: 自动翻译可能存在错误，需要人工校对，因此开发高效的机器翻译校对方法至关重要。 Method: 结合CNN提取局部n-gram模式，BERT生成上下文丰富的序列表示，并通过注意力机制检测翻译错误，使用GRU解码器和翻译记忆提出修正建议。 Result: 实验达到90%准确率、89.37% F1和16.24% MSE，性能优于现有技术10%以上。 Conclusion: 该方法在识别和修正翻译错误方面表现出色，达到了最先进的性能。 Abstract: Since automatic translations can contain errors that require substantial human post-editing, machine translation proofreading is essential for improving quality. This paper proposes a novel hybrid approach for robust proofreading that combines convolutional neural networks (CNN) with Bidirectional Encoder Representations from Transformers (BERT). In order to extract semantic information from phrases and expressions, CNN uses a variety of convolution kernel filters to capture local n-gram patterns. In the meanwhile, BERT creates context-rich representations of whole sequences by utilizing stacked bidirectional transformer encoders. Using BERT's attention processes, the integrated error detection component relates tokens to spot translation irregularities including word order problems and omissions. The correction module then uses parallel English-German alignment and GRU decoder models in conjunction with translation memory to propose logical modifications that maintain original meaning. A unified end-to-end training process optimized for post-editing performance is applied to the whole pipeline. The multi-domain collection of WMT and the conversational dialogues of Open-Subtitles are two of the English-German parallel corpora used to train the model. Multiple loss functions supervise detection and correction capabilities. Experiments attain a 90% accuracy, 89.37% F1, and 16.24% MSE, exceeding recent proofreading techniques by over 10% overall. Comparative benchmarking demonstrates state-of-the-art performance in identifying and coherently rectifying mistranslations and omissions.

[183] Evaluating Vision-Language and Large Language Models for Automated Student Assessment in Indonesian Classrooms

Nurul Aisyah,Muhammad Dehan Al Kautsar,Arif Hidayat,Raqib Chowdhury,Fajri Koto

Main category: cs.CL

TL;DR: 研究评估了VLM和LLM在印尼四年级学生手写考试答案中的表现，发现VLM在手写识别上存在困难，但LLM生成的反馈仍有一定实用性。

Details

Motivation: 探索VLM和LLM在真实课堂环境（尤其是教育资源不足地区）中的教育评估效果。 Method: 使用VLM和多个LLM对646份印尼四年级学生的手写考试答案（涵盖数学和英语）进行评分和反馈生成。 Result: VLM在手写识别上表现不佳，影响LLM评分准确性，但LLM生成的反馈仍有一定实用性。 Conclusion: VLM和LLM在教育评估中有潜力，但需改进手写识别和反馈的个性化与上下文相关性。 Abstract: Although vision-language and large language models (VLM and LLM) offer promising opportunities for AI-driven educational assessment, their effectiveness in real-world classroom settings, particularly in underrepresented educational contexts, remains underexplored. In this study, we evaluated the performance of a state-of-the-art VLM and several LLMs on 646 handwritten exam responses from grade 4 students in six Indonesian schools, covering two subjects: Mathematics and English. These sheets contain more than 14K student answers that span multiple choice, short answer, and essay questions. Assessment tasks include grading these responses and generating personalized feedback. Our findings show that the VLM often struggles to accurately recognize student handwriting, leading to error propagation in downstream LLM grading. Nevertheless, LLM-generated feedback retains some utility, even when derived from imperfect input, although limitations in personalization and contextual relevance persist.

[184] A Reasoning-Based Approach to Cryptic Crossword Clue Solving

Martin Andrews,Sam Witteveen

Main category: cs.CL

TL;DR: 论文描述了一种基于LLM的系统，用于解决加密填字游戏线索，通过假设答案、提出字谜解释和验证步骤，实现了在Cryptonite数据集上的最新性能。

Details

Motivation: 加密填字游戏线索是复杂的语言任务，现有测试集每日更新，需要高效且可解释的解决方案。 Method: 系统采用开放许可组件，通过假设答案、提出字谜解释和验证步骤（基于Python代码）来解决线索。 Result: 在Cryptonite数据集上实现了最新性能，并提供可解释的字谜推理。 Conclusion: 该系统为加密填字游戏提供了高效且可解释的解决方案，展示了LLM在复杂语言任务中的潜力。 Abstract: Cryptic crossword clues are challenging language tasks for which new test sets are released daily by major newspapers on a global basis. Each cryptic clue contains both the definition of the answer to be placed in the crossword grid (in common with regular crosswords), and 'wordplay' that proves that the answer is correct (i.e. a human solver can be confident that an answer is correct without needing crossing words as confirmation). This work describes an LLM-based reasoning system built from open-licensed components that solves cryptic clues by (i) hypothesising answers; (ii) proposing wordplay explanations; and (iii) using a verifier system that operates on codified reasoning steps. Overall, this system establishes a new state-of-the-art performance on the challenging Cryptonite dataset of clues from The Times and The Telegraph newspapers in the UK. Because each proved solution is expressed in Python, interpretable wordplay reasoning for proven answers is available for inspection.

[185] Joint Evaluation of Answer and Reasoning Consistency for Hallucination Detection in Large Reasoning Models

Changyue Wang,Weihang Su,Qingyao Ai,Yiqun Liu

Main category: cs.CL

TL;DR: 论文提出了RACE框架，用于检测大型推理模型（LRMs）中的幻觉问题，通过分析推理步骤和答案的一致性来提高检测效果。

Details

Motivation: 现有的幻觉检测方法主要关注答案层面的不确定性，难以检测推理过程中的冗余或逻辑不一致问题，而这些问题在LRMs中尤为突出。 Method: RACE框架通过提取关键推理步骤并计算四种诊断信号（推理一致性、答案不确定性、语义对齐和内部连贯性）来检测幻觉。 Result: 实验表明，RACE在多种数据集和LLMs上优于现有基线，提供了更精细的幻觉检测能力。 Conclusion: RACE为LRMs的评估提供了一个鲁棒且通用的解决方案，能够有效检测推理过程中的幻觉问题。 Abstract: Large Reasoning Models (LRMs) extend large language models with explicit, multi-step reasoning traces to enhance transparency and performance on complex tasks. However, these reasoning traces can be redundant or logically inconsistent, making them a new source of hallucination that is difficult to detect. Existing hallucination detection methods focus primarily on answer-level uncertainty and often fail to detect hallucinations or logical inconsistencies arising from the model's reasoning trace. This oversight is particularly problematic for LRMs, where the explicit thinking trace is not only an important support to the model's decision-making process but also a key source of potential hallucination. To this end, we propose RACE (Reasoning and Answer Consistency Evaluation), a novel framework specifically tailored for hallucination detection in LRMs. RACE operates by extracting essential reasoning steps and computing four diagnostic signals: inter-sample consistency of reasoning traces, entropy-based answer uncertainty, semantic alignment between reasoning and answers, and internal coherence of reasoning. This joint analysis enables fine-grained hallucination detection even when the final answer appears correct. Experiments across datasets and different LLMs demonstrate that RACE outperforms existing hallucination detection baselines, offering a robust and generalizable solution for evaluating LRMs. Our code is available at: https://github.com/bebr2/RACE.

[186] MockConf: A Student Interpretation Dataset: Analysis, Word- and Span-level Alignment and Baselines

Dávid Javorský,Ondřej Bojar,François Yvon

Main category: cs.CL

TL;DR: 论文介绍了MockConf数据集和InterAlign工具，用于自动对齐和分析同声传译任务，填补了现有平行语料库的不足。

Details

Motivation: 现有平行语料库和算法无法有效建模同声传译中的长程交互和特定类型差异（如简化、功能泛化），因此需要专用数据集和工具。 Method: 收集了MockConf学生同传数据集（7小时5种语言），开发了基于网页的InterAlign标注工具，并提出自动对齐基线和评估指标。 Result: 发布了包含转录和对齐的MockConf数据集及InterAlign工具，支持同传任务的自动标注和分析。 Conclusion: MockConf和InterAlign为同声传译研究提供了实用资源，推动了自动对齐和评估技术的发展。 Abstract: In simultaneous interpreting, an interpreter renders a source speech into another language with a very short lag, much sooner than sentences are finished. In order to understand and later reproduce this dynamic and complex task automatically, we need dedicated datasets and tools for analysis, monitoring, and evaluation, such as parallel speech corpora, and tools for their automatic annotation. Existing parallel corpora of translated texts and associated alignment algorithms hardly fill this gap, as they fail to model long-range interactions between speech segments or specific types of divergences (e.g., shortening, simplification, functional generalization) between the original and interpreted speeches. In this work, we introduce MockConf, a student interpreting dataset that was collected from Mock Conferences run as part of the students' curriculum. This dataset contains 7 hours of recordings in 5 European languages, transcribed and aligned at the level of spans and words. We further implement and release InterAlign, a modern web-based annotation tool for parallel word and span annotations on long inputs, suitable for aligning simultaneous interpreting. We propose metrics for the evaluation and a baseline for automatic alignment. Dataset and tools are released to the community.

[187] Multiple-Choice Question Generation Using Large Language Models: Methodology and Educator Insights

Giorgio Biancini,Alessio Ferrato,Carla Limongelli

Main category: cs.CL

TL;DR: 本文探讨了大型语言模型（LLMs）在生成多选题（MCQs）方面的潜力，比较了Llama 2、Mistral和GPT-3.5的性能，发现GPT-3.5表现最佳。

Details

Motivation: 教育中手动生成MCQs耗时耗力，LLMs可能提供高效解决方案。 Method: 通过向LLMs注入知识而非依赖其固有知识，生成MCQs，并由21名教育工作者评估效果。 Result: GPT-3.5在生成MCQs方面表现最优，但教育领域对AI的接受度仍有待提高。 Conclusion: LLMs在生成MCQs方面具有潜力，可改善教育体验，但需进一步推广和优化。 Abstract: Integrating Artificial Intelligence (AI) in educational settings has brought new learning approaches, transforming the practices of both students and educators. Among the various technologies driving this transformation, Large Language Models (LLMs) have emerged as powerful tools for creating educational materials and question answering, but there are still space for new applications. Educators commonly use Multiple-Choice Questions (MCQs) to assess student knowledge, but manually generating these questions is resource-intensive and requires significant time and cognitive effort. In our opinion, LLMs offer a promising solution to these challenges. This paper presents a novel comparative analysis of three widely known LLMs - Llama 2, Mistral, and GPT-3.5 - to explore their potential for creating informative and challenging MCQs. In our approach, we do not rely on the knowledge of the LLM, but we inject the knowledge into the prompt to contrast the hallucinations, giving the educators control over the test's source text, too. Our experiment involving 21 educators shows that GPT-3.5 generates the most effective MCQs across several known metrics. Additionally, it shows that there is still some reluctance to adopt AI in the educational field. This study sheds light on the potential of LLMs to generate MCQs and improve the educational experience, providing valuable insights for the future.

[188] Prompting LLMs: Length Control for Isometric Machine Translation

Dávid Javorský,Ondřej Bojar,François Yvon

Main category: cs.CL

TL;DR: 研究了等长机器翻译在多种语言对中的效果，探讨了提示策略、少样本示例数量及演示选择对翻译质量和长度控制的影响。

Details

Motivation: 探索大型语言模型在等长翻译任务中的表现，以及如何通过提示和示例优化翻译结果。 Method: 使用8种不同规模的开源大型语言模型，分析不同提示策略、少样本示例数量及演示选择对翻译的影响。 Result: 指令措辞与演示对齐对长度控制至关重要；极端示例能缩短翻译，但等长演示易忽略长度约束；少样本提示提升质量但边际效益递减；多输出可优化长度与质量的权衡。 Conclusion: 提示策略和演示选择对等长翻译效果显著，多输出方法在某些语言对中达到最优性能。 Abstract: In this study, we explore the effectiveness of isometric machine translation across multiple language pairs (En$\to$De, En$\to$Fr, and En$\to$Es) under the conditions of the IWSLT Isometric Shared Task 2022. Using eight open-source large language models (LLMs) of varying sizes, we investigate how different prompting strategies, varying numbers of few-shot examples, and demonstration selection influence translation quality and length control. We discover that the phrasing of instructions, when aligned with the properties of the provided demonstrations, plays a crucial role in controlling the output length. Our experiments show that LLMs tend to produce shorter translations only when presented with extreme examples, while isometric demonstrations often lead to the models disregarding length constraints. While few-shot prompting generally enhances translation quality, further improvements are marginal across 5, 10, and 20-shot settings. Finally, considering multiple outputs allows to notably improve overall tradeoff between the length and quality, yielding state-of-the-art performance for some language pairs.

[189] Evaluating the Effectiveness of Linguistic Knowledge in Pretrained Language Models: A Case Study of Universal Dependencies

Wenxi Li

Main category: cs.CL

TL;DR: 论文探讨了将Universal Dependencies (UD) 整合到预训练语言模型中，以提升其在跨语言对抗性释义识别任务中的表现。实验结果显示，UD显著提高了准确率和F1分数。

Details

Motivation: UD作为跨语言句法表示的框架，其有效性尚未充分探索。本文旨在填补这一空白，验证UD是否能提升预训练模型的性能。 Method: 将UD整合到预训练语言模型中，并在跨语言对抗性释义识别任务中评估其效果。 Result: UD的整合显著提升了模型的准确率和F1分数（平均提升3.85%和6.08%），缩小了预训练模型与大型语言模型的性能差距。 Conclusion: UD在跨语言任务中具有潜力，其与英语的相似性分数与模型性能正相关。 Abstract: Universal Dependencies (UD), while widely regarded as the most successful linguistic framework for cross-lingual syntactic representation, remains underexplored in terms of its effectiveness. This paper addresses this gap by integrating UD into pretrained language models and assesses if UD can improve their performance on a cross-lingual adversarial paraphrase identification task. Experimental results show that incorporation of UD yields significant improvements in accuracy and $F_1$ scores, with average gains of 3.85\% and 6.08\% respectively. These enhancements reduce the performance gap between pretrained models and large language models in some language pairs, and even outperform the latter in some others. Furthermore, the UD-based similarity score between a given language and English is positively correlated to the performance of models in that language. Both findings highlight the validity and potential of UD in out-of-domain tasks.

[190] ICPC-Eval: Probing the Frontiers of LLM Reasoning with Competitive Programming Contests

Shiyi Xu,Yiwen Hu,Yingqian Min,Zhipeng Chen,Wayne Xin Zhao,Ji-Rong Wen

Main category: cs.CL

TL;DR: 提出了ICPC-Eval，一个用于评估大型语言模型在真实竞赛环境中编码能力的基准，包含118个ICPC竞赛题目，并引入了Refine@K评估指标。

Details

Motivation: 现有基准和评估指标无法充分评估大型语言模型在真实竞赛环境中的编码和反思能力。 Method: 设计了ICPC-Eval基准，包含真实竞赛题目、本地评估工具和Refine@K评估指标。 Result: 顶级推理模型依赖多轮代码反馈才能发挥潜力，但仍落后于顶尖人类团队。 Conclusion: ICPC-Eval为评估复杂推理能力提供了有效工具，揭示了模型与人类表现的差距。 Abstract: With the significant progress of large reasoning models in complex coding and reasoning tasks, existing benchmarks, like LiveCodeBench and CodeElo, are insufficient to evaluate the coding capabilities of large language models (LLMs) in real competition environments. Moreover, current evaluation metrics such as Pass@K fail to capture the reflective abilities of reasoning models. To address these challenges, we propose \textbf{ICPC-Eval}, a top-level competitive coding benchmark designed to probing the frontiers of LLM reasoning. ICPC-Eval includes 118 carefully curated problems from 11 recent ICPC contests held in various regions of the world, offering three key contributions: 1) A challenging realistic ICPC competition scenario, featuring a problem type and difficulty distribution consistent with actual contests. 2) A robust test case generation method and a corresponding local evaluation toolkit, enabling efficient and accurate local evaluation. 3) An effective test-time scaling evaluation metric, Refine@K, which allows iterative repair of solutions based on execution feedback. The results underscore the significant challenge in evaluating complex reasoning abilities: top-tier reasoning models like DeepSeek-R1 often rely on multi-turn code feedback to fully unlock their in-context reasoning potential when compared to non-reasoning counterparts. Furthermore, despite recent advancements in code generation, these models still lag behind top-performing human teams. We release the benchmark at: https://github.com/RUCAIBox/Slow_Thinking_with_LLMs

[191] Verbose ListOps (VLO): Beyond Long Context -- Unmasking LLM's Reasoning Blind Spots

Alex Pan,Mary-Anne Williams

Main category: cs.CL

TL;DR: Verbose ListOps是一个新基准，通过将ListOps计算转化为长故事，测试LLMs在嵌套叙事推理中的表现，揭示了其在状态管理上的局限性。

Details

Motivation: 现有基准未能有效测试LLMs在嵌套叙事推理中的能力，掩盖了其根本限制，Verbose ListOps旨在填补这一空白。 Method: 通过程序化生成长故事形式的ListOps计算，强制LLMs进行内部计算和状态管理，同时控制叙事长度和推理难度。 Result: 领先的LLMs（如OpenAI o4、Gemini 2.5 Pro）在中等长度（约10k token）的叙事中表现崩溃，尽管能轻松解决原始ListOps问题。 Conclusion: Verbose ListOps揭示了LLMs在嵌套推理中的关键弱点，为改进推理能力提供了针对性方向，而不仅仅是扩展上下文窗口。 Abstract: Large Language Models (LLMs), whilst great at extracting facts from text, struggle with nested narrative reasoning. Existing long context and multi-hop QA benchmarks inadequately test this, lacking realistic distractors or failing to decouple context length from reasoning complexity, masking a fundamental LLM limitation. We introduce Verbose ListOps, a novel benchmark that programmatically transposes ListOps computations into lengthy, coherent stories. This uniquely forces internal computation and state management of nested reasoning problems by withholding intermediate results, and offers fine-grained controls for both narrative size \emph{and} reasoning difficulty. Whilst benchmarks like LongReason (2025) advance approaches for synthetically expanding the context size of multi-hop QA problems, Verbose ListOps pinpoints a specific LLM vulnerability: difficulty in state management for nested sub-reasoning amongst semantically-relevant, distracting narrative. Our experiments show that leading LLMs (e.g., OpenAI o4, Gemini 2.5 Pro) collapse in performance on Verbose ListOps at modest (~10k token) narrative lengths, despite effortlessly solving raw ListOps equations. Addressing this failure is paramount for real-world text interpretation which requires identifying key reasoning points, tracking conceptual intermediate results, and filtering irrelevant information. Verbose ListOps, and its extensible generation framework thus enables targeted reasoning enhancements beyond mere context-window expansion; a critical step to automating the world's knowledge work.

[192] A Practitioner's Guide to Building ASR Models for Low-Resource Languages: A Case Study on Scottish Gaelic

Ondřej Klejch,William Lamb,Peter Bell

Main category: cs.CL

TL;DR: 本文挑战了多语言端到端模型微调在低资源语言ASR系统中的优越性，提出了一种结合混合HMM与自监督模型的方法，显著提升了性能。

Details

Motivation: 探讨在低资源语言ASR系统中，是否多语言端到端模型微调始终是最佳方法，并提出更优方案。 Method: 结合混合HMM与自监督模型，通过持续自监督预训练和半监督训练充分利用可用数据。 Result: 在苏格兰盖尔语上，相对最佳微调Whisper模型，WER降低了32%。 Conclusion: 混合HMM与自监督模型的组合在低资源语言ASR中表现优于单纯微调端到端模型。 Abstract: An effective approach to the development of ASR systems for low-resource languages is to fine-tune an existing multilingual end-to-end model. When the original model has been trained on large quantities of data from many languages, fine-tuning can be effective with limited training data, even when the language in question was not present in the original training data. The fine-tuning approach has been encouraged by the availability of public-domain E2E models and is widely believed to lead to state-of-the-art results. This paper, however, challenges that belief. We show that an approach combining hybrid HMMs with self-supervised models can yield substantially better performance with limited training data. This combination allows better utilisation of all available speech and text data through continued self-supervised pre-training and semi-supervised training. We benchmark our approach on Scottish Gaelic, achieving WER reductions of 32% relative over our best fine-tuned Whisper model.

[193] Simulating LLM-to-LLM Tutoring for Multilingual Math Feedback

Junior Cedric Tonga,KV Aditya Srivatsa,Kaushal Kumar Maurya,Fajri Koto,Ekaterina Kochmar

Main category: cs.CL

TL;DR: 研究探讨了大型语言模型（LLMs）在多语言教育中的反馈效果，发现多语言提示能显著提升学习效果，尤其是在低资源语言中。

Details

Motivation: 评估LLMs在不同语言中提供数学推理任务反馈的能力，填补多语言教育辅助工具的空白。 Method: 通过模拟多语言师生互动，使用不同LLMs和提示策略，分析反馈语言、学生输入语言等因素对学习效果的影响。 Result: 多语言提示显著提升学习效果，尤其是当反馈语言与学生母语一致时。 Conclusion: 研究为开发多语言LLM教育工具提供了实用建议，强调反馈语言与学生母语一致的重要性。 Abstract: Large language models (LLMs) have demonstrated the ability to generate formative feedback and instructional hints in English, making them increasingly relevant for AI-assisted education. However, their ability to provide effective instructional support across different languages, especially for mathematically grounded reasoning tasks, remains largely unexamined. In this work, we present the first large-scale simulation of multilingual tutor-student interactions using LLMs. A stronger model plays the role of the tutor, generating feedback in the form of hints, while a weaker model simulates the student. We explore 352 experimental settings across 11 typologically diverse languages, four state-of-the-art LLMs, and multiple prompting strategies to assess whether language-specific feedback leads to measurable learning gains. Our study examines how student input language, teacher feedback language, model choice, and language resource level jointly influence performance. Results show that multilingual hints can significantly improve learning outcomes, particularly in low-resource languages when feedback is aligned with the student's native language. These findings offer practical insights for developing multilingual, LLM-based educational tools that are both effective and inclusive.

[194] ConECT Dataset: Overcoming Data Scarcity in Context-Aware E-Commerce MT

Mikołaj Pokrywka,Wojciech Kusa,Mieszko Rutkowski,Mikołaj Koszowski

Main category: cs.CL

TL;DR: 论文研究了如何通过添加上下文信息（如图像和产品元数据）提升神经机器翻译在电子商务领域的表现，并发布了新的捷克语-波兰语数据集。

Details

Motivation: 神经机器翻译在领域特定应用中面临词义模糊和上下文不足的问题，尤其在电子商务数据中表现不佳。 Method: 创建了包含11,400句对的捷克语-波兰语数据集ConECT，测试了视觉语言模型和文本到文本模型，探索了上下文信息的整合方法。 Result: 实验表明，视觉上下文和其他上下文信息的加入显著提高了翻译质量。 Conclusion: 上下文信息的整合能有效提升机器翻译质量，并公开了新数据集以促进进一步研究。 Abstract: Neural Machine Translation (NMT) has improved translation by using Transformer-based models, but it still struggles with word ambiguity and context. This problem is especially important in domain-specific applications, which often have problems with unclear sentences or poor data quality. Our research explores how adding information to models can improve translations in the context of e-commerce data. To this end we create ConECT -- a new Czech-to-Polish e-commerce product translation dataset coupled with images and product metadata consisting of 11,400 sentence pairs. We then investigate and compare different methods that are applicable to context-aware translation. We test a vision-language model (VLM), finding that visual context aids translation quality. Additionally, we explore the incorporation of contextual information into text-to-text models, such as the product's category path or image descriptions. The results of our study demonstrate that the incorporation of contextual information leads to an improvement in the quality of machine translation. We make the new dataset publicly available.

[195] From Struggle (06-2024) to Mastery (02-2025) LLMs Conquer Advanced Algorithm Exams and Pave the Way for Editorial Generation

Adrian Marius Dumitran,Theodor-Pierre Moroianu,Vasile Paul Alexe

Main category: cs.CL

TL;DR: 本文评估了大型语言模型（LLMs）在大学级算法考试中的表现，发现最新模型表现优异，但仍存在图论任务上的困难，并探讨了其在教育中的潜力。

Details

Motivation: 研究LLMs在复杂算法问题上的表现，探索其在教育环境中的应用潜力。 Method: 通过测试多个LLMs在罗马尼亚语考试及其高质量英语翻译上的表现，分析其解题能力、一致性和多语言性能。 Result: 最新模型表现接近顶尖学生，具备复杂多步推理能力，但在图论任务上仍有困难。 Conclusion: LLMs在教育中有潜力支持高质量内容生成，为算法教育中的生成AI集成提供了方向。 Abstract: This paper presents a comprehensive evaluation of the performance of state-of-the-art Large Language Models (LLMs) on challenging university-level algorithms exams. By testing multiple models on both a Romanian exam and its high-quality English translation, we analyze LLMs' problem-solving capabilities, consistency, and multilingual performance. Our empirical study reveals that the most recent models not only achieve scores comparable to top-performing students but also demonstrate robust reasoning skills on complex, multi-step algorithmic challenges, even though difficulties remain with graph-based tasks. Building on these findings, we explore the potential of LLMs to support educational environments through the generation of high-quality editorial content, offering instructors a powerful tool to enhance student feedback. The insights and best practices discussed herein pave the way for further integration of generative AI in advanced algorithm education.

[196] Better Semi-supervised Learning for Multi-domain ASR Through Incremental Retraining and Data Filtering

Andres Carofilis,Pradeep Rangappa,Srikanth Madikeri,Shashi Kumar,Sergio Burdisso,Jeena Prakash,Esau Villatoro-Tello,Petr Motlicek,Bidisha Sharma,Kadri Hacioglu,Shankar Venkatesan,Saurabh Vyas,Andreas Stolcke

Main category: cs.CL

TL;DR: 论文提出了一种增量半监督学习流程，通过结合少量领域内标注数据和相关领域辅助数据，显著提升ASR模型在低资源场景下的性能。

Details

Motivation: 在特定领域微调预训练ASR模型时，标注数据稀缺是一个挑战，但未标注音频和相关领域标注数据通常可用。 Method: 采用增量半监督学习流程，先整合少量领域内标注数据和相关领域辅助数据，再通过多模型共识或命名实体识别（NER）筛选伪标签并迭代优化。 Result: 在Wow呼叫中心和Fisher英语语料库上评估，共识筛选表现最佳，相对随机选择提升22.3%（Wow）和24.8%（Fisher）；NER次之，计算成本更低。 Conclusion: 共识筛选和NER是有效的伪标签优化方法，显著提升ASR模型性能，尤其适用于低资源领域。 Abstract: Fine-tuning pretrained ASR models for specific domains is challenging when labeled data is scarce. But unlabeled audio and labeled data from related domains are often available. We propose an incremental semi-supervised learning pipeline that first integrates a small in-domain labeled set and an auxiliary dataset from a closely related domain, achieving a relative improvement of 4% over no auxiliary data. Filtering based on multi-model consensus or named entity recognition (NER) is then applied to select and iteratively refine pseudo-labels, showing slower performance saturation compared to random selection. Evaluated on the multi-domain Wow call center and Fisher English corpora, it outperforms single-step fine-tuning. Consensus-based filtering outperforms other methods, providing up to 22.3% relative improvement on Wow and 24.8% on Fisher over single-step fine-tuning with random selection. NER is the second-best filter, providing competitive performance at a lower computational cost.

[197] SCOP: Evaluating the Comprehension Process of Large Language Models from a Cognitive View

Yongjie Xiao,Hongru Liang,Peixin Qin,Yao Zhang,Wenqiang Lei

Main category: cs.CL

TL;DR: 论文提出SCOP方法，从认知角度评估大语言模型（LLMs）的理解能力，发现LLMs在专家级理解过程中仍存在挑战，并建议改进方向。

Details

Motivation: 尽管LLMs在机器理解方面潜力巨大，但其理解过程是否与专家一致缺乏合理解释，因此需要系统评估。 Method: 提出SCOP方法，包括定义理解过程中的五项必备技能、构建测试数据的严格框架，以及对开源和闭源LLMs的详细分析。 Result: 发现LLMs在专家级理解过程中表现不佳，但存在与专家的相似性（如局部信息理解优于全局），且可能通过错误过程得出正确答案。 Conclusion: 建议改进LLMs的方向是更注重理解过程，确保训练中全面培养所有理解技能。 Abstract: Despite the great potential of large language models(LLMs) in machine comprehension, it is still disturbing to fully count on them in real-world scenarios. This is probably because there is no rational explanation for whether the comprehension process of LLMs is aligned with that of experts. In this paper, we propose SCOP to carefully examine how LLMs perform during the comprehension process from a cognitive view. Specifically, it is equipped with a systematical definition of five requisite skills during the comprehension process, a strict framework to construct testing data for these skills, and a detailed analysis of advanced open-sourced and closed-sourced LLMs using the testing data. With SCOP, we find that it is still challenging for LLMs to perform an expert-level comprehension process. Even so, we notice that LLMs share some similarities with experts, e.g., performing better at comprehending local information than global information. Further analysis reveals that LLMs can be somewhat unreliable -- they might reach correct answers through flawed comprehension processes. Based on SCOP, we suggest that one direction for improving LLMs is to focus more on the comprehension process, ensuring all comprehension skills are thoroughly developed during training.

[198] ComfyUI-Copilot: An Intelligent Assistant for Automated Workflow Development

Zhenran Xu,Xue Yang,Yiyu Wang,Qingli Hu,Zijiao Wu,Longyue Wang,Weihua Luo,Kaifu Zhang,Baotian Hu,Min Zhang

Main category: cs.CL

TL;DR: ComfyUI-Copilot是一个基于大语言模型的插件，旨在提升ComfyUI平台的易用性和效率，通过智能节点推荐和一键工作流构建解决新用户面临的挑战。

Details

Motivation: ComfyUI虽然灵活且用户友好，但对新手存在文档不足、模型配置复杂等问题，ComfyUI-Copilot旨在解决这些问题。 Method: 采用分层多代理框架，包括中央助理代理和专用工作代理，结合知识库支持调试和部署。 Result: 离线评估和用户反馈表明，插件能准确推荐节点并加速工作流开发，降低新手门槛并提升效率。 Conclusion: ComfyUI-Copilot有效解决了ComfyUI的易用性问题，适用于新手和资深用户。 Abstract: We introduce ComfyUI-Copilot, a large language model-powered plugin designed to enhance the usability and efficiency of ComfyUI, an open-source platform for AI-driven art creation. Despite its flexibility and user-friendly interface, ComfyUI can present challenges to newcomers, including limited documentation, model misconfigurations, and the complexity of workflow design. ComfyUI-Copilot addresses these challenges by offering intelligent node and model recommendations, along with automated one-click workflow construction. At its core, the system employs a hierarchical multi-agent framework comprising a central assistant agent for task delegation and specialized worker agents for different usages, supported by our curated ComfyUI knowledge bases to streamline debugging and deployment. We validate the effectiveness of ComfyUI-Copilot through both offline quantitative evaluations and online user feedback, showing that it accurately recommends nodes and accelerates workflow development. Additionally, use cases illustrate that ComfyUI-Copilot lowers entry barriers for beginners and enhances workflow efficiency for experienced users. The ComfyUI-Copilot installation package and a demo video are available at https://github.com/AIDC-AI/ComfyUI-Copilot.

[199] Controlling Summarization Length Through EOS Token Weighting

Zeno Belligoli,Emmanouil Stergiadis,Eran Fainman,Ilya Gusev

Main category: cs.CL

TL;DR: 提出一种简单方法，通过调整EOS令牌在交叉熵损失中的重要性来控制生成文本长度，适用于多种模型和解码算法。

Details

Motivation: 现有方法通常需要复杂模型修改，限制了与预训练模型的兼容性。 Method: 通过增加EOS令牌在交叉熵损失中的预测重要性来控制生成文本长度。 Result: 方法适用于编码器-解码器和GPT风格模型，能有效控制长度且不影响摘要质量。 Conclusion: 该方法简单、通用，无需复杂模型修改即可实现长度控制。 Abstract: Controlling the length of generated text can be crucial in various text-generation tasks, including summarization. Existing methods often require complex model alterations, limiting compatibility with pre-trained models. We address these limitations by developing a simple approach for controlling the length of automatic text summaries by increasing the importance of correctly predicting the EOS token in the cross-entropy loss computation. The proposed methodology is agnostic to architecture and decoding algorithms and orthogonal to other inference-time techniques to control the generation length. We tested it with encoder-decoder and modern GPT-style LLMs, and show that this method can control generation length, often without affecting the quality of the summary.

[200] Automatic Robustness Stress Testing of LLMs as Mathematical Problem Solvers

Yutao Hou,Zeguan Xiao,Fei Yu,Yihan Jiang,Xuetao Wei,Hailiang Huang,Yun Chen,Guanhua Chen

Main category: cs.CL

TL;DR: AR-Checker是一个自动生成数学问题变体的框架，用于测试大语言模型（LLMs）的鲁棒性，避免数据污染问题。

Details

Motivation: LLMs在简单推理任务中可能意外失败，现有评估方法存在数据污染风险。 Method: 通过多轮并行LLM重写和验证生成语义相同但可能使LLM失败的数学问题变体。 Result: 在GSM8K、MATH-500等数学任务及MMLU、MMLU-Pro、CommonsenseQA等非数学任务中表现优异。 Conclusion: AR-Checker能动态生成测试用例，有效评估LLM鲁棒性。 Abstract: Large language models (LLMs) have achieved distinguished performance on various reasoning-intensive tasks. However, LLMs might still face the challenges of robustness issues and fail unexpectedly in some simple reasoning tasks. Previous works evaluate the LLM robustness with hand-crafted templates or a limited set of perturbation rules, indicating potential data contamination in pre-training or fine-tuning datasets. In this work, inspired by stress testing in software engineering, we propose a novel framework, Automatic Robustness Checker (AR-Checker), to generate mathematical problem variants that maintain the semantic meanings of the original one but might fail the LLMs. The AR-Checker framework generates mathematical problem variants through multi-round parallel streams of LLM-based rewriting and verification. Our framework can generate benchmark variants dynamically for each LLM, thus minimizing the risk of data contamination. Experiments on GSM8K and MATH-500 demonstrate the strong performance of AR-Checker on mathematical tasks. We also evaluate AR-Checker on benchmarks beyond mathematics, including MMLU, MMLU-Pro, and CommonsenseQA, where it also achieves strong performance, further proving the effectiveness of AR-Checker.

[201] TALL -- A Trainable Architecture for Enhancing LLM Performance in Low-Resource Languages

Moshe Ofer,Orel Zamler,Amos Azaria

Main category: cs.CL

TL;DR: TALL架构通过结合LLM和双语翻译模型，提升低资源语言性能，实验显示显著优于基线方法。

Details

Motivation: 解决LLM在低资源语言中因数据不足表现不佳的问题。 Method: 集成LLM与双语翻译模型，通过维度对齐层和定制Transformer转换输入。 Result: 在希伯来语实验中显著优于直接使用、简单翻译和微调方法。 Conclusion: TALL以参数高效的方式平衡计算效率与性能提升。 Abstract: Large Language Models (LLMs) excel in high-resource languages but struggle with low-resource languages due to limited training data. This paper presents TALL (Trainable Architecture for Enhancing LLM Performance in Low-Resource Languages), which integrates an LLM with two bilingual translation models. TALL transforms low-resource inputs into high-resource representations, leveraging the LLM's capabilities while preserving linguistic features through dimension alignment layers and custom transformers. Our experiments on Hebrew demonstrate significant improvements over several baselines, including direct use, naive translation, and fine-tuning approaches. The architecture employs a parameter-efficient strategy, freezing pre-trained components while training only lightweight adapter modules, balancing computational efficiency with performance gains.

[202] Debatable Intelligence: Benchmarking LLM Judges via Debate Speech Evaluation

Noy Sternlicht,Ariel Gera,Roy Bar-Haim,Tom Hope,Noam Slonim

Main category: cs.CL

TL;DR: 论文提出了一个新颖的基准任务——辩论演讲评估，用于测试LLM评委的能力，发现大模型在某些方面接近人类评委，但整体判断行为差异显著。

Details

Motivation: 辩论演讲评估需要多层次的深度理解，而现有LLM评测对此类认知能力关注不足。 Method: 利用600多篇标注辩论演讲数据集，分析前沿LLM与人类评委的表现差异。 Result: 大模型在某些方面接近人类评委，但整体判断行为差异显著；前沿LLM生成说服性演讲的能力可达人类水平。 Conclusion: 辩论演讲评估是一个具有挑战性的基准任务，揭示了LLM与人类评委的差异及潜力。 Abstract: We introduce Debate Speech Evaluation as a novel and challenging benchmark for assessing LLM judges. Evaluating debate speeches requires a deep understanding of the speech at multiple levels, including argument strength and relevance, the coherence and organization of the speech, the appropriateness of its style and tone, and so on. This task involves a unique set of cognitive abilities that have previously received limited attention in systematic LLM benchmarking. To explore such skills, we leverage a dataset of over 600 meticulously annotated debate speeches and present the first in-depth analysis of how state-of-the-art LLMs compare to human judges on this task. Our findings reveal a nuanced picture: while larger models can approximate individual human judgments in some respects, they differ substantially in their overall judgment behavior. We also investigate the ability of frontier LLMs to generate persuasive, opinionated speeches, showing that models may perform at a human level on this task.

[203] Does It Make Sense to Speak of Introspection in Large Language Models?

Iulia Comşa,Murray Shanahan

Main category: cs.CL

TL;DR: 论文探讨了大型语言模型（LLMs）的自我报告行为，分析了其是否具有内省能力，并提出了两个案例作为讨论基础。

Details

Motivation: 随着LLMs的语言流畅性和认知能力提升，其自我报告行为引发了对内省概念是否适用于LLMs的疑问。 Method: 通过分析LLMs的两个自我报告案例，分别探讨其是否具备内省能力。 Result: 第一个案例（描述创作过程）不构成内省；第二个案例（推断自身温度参数）可视为内省的最小示例，但无意识伴随。 Conclusion: LLMs的某些行为可被视作内省，但其与人类内省的本质差异仍需进一步研究。 Abstract: Large language models (LLMs) exhibit compelling linguistic behaviour, and sometimes offer self-reports, that is to say statements about their own nature, inner workings, or behaviour. In humans, such reports are often attributed to a faculty of introspection and are typically linked to consciousness. This raises the question of how to interpret self-reports produced by LLMs, given their increasing linguistic fluency and cognitive capabilities. To what extent (if any) can the concept of introspection be meaningfully applied to LLMs? Here, we present and critique two examples of apparent introspective self-report from LLMs. In the first example, an LLM attempts to describe the process behind its own ``creative'' writing, and we argue this is not a valid example of introspection. In the second example, an LLM correctly infers the value of its own temperature parameter, and we argue that this can be legitimately considered a minimal example of introspection, albeit one that is (presumably) not accompanied by conscious experience.

[204] RIVAL: Reinforcement Learning with Iterative and Adversarial Optimization for Machine Translation

Tianjiao Li,Mengran Yu,Chenyu Shi,Yanjun Zhao,Xiaojing Liu,Qiang Zhang,Qi Zhang,Xuanjing Huang,Jiayin Wang

Main category: cs.CL

TL;DR: 论文研究了RLHF在俚语字幕翻译任务中表现不佳的问题，提出RIVAL对抗训练框架，通过迭代更新奖励模型和LLM提升翻译质量。

Details

Motivation: 发现离线奖励模型与在线LLM因分布偏移逐渐偏离，导致翻译效果不佳。 Method: 提出RIVAL框架，将奖励模型与LLM的对抗训练建模为min-max博弈，结合定性和定量奖励。 Result: 实验表明RIVAL显著优于基线翻译方法。 Conclusion: RIVAL通过对抗训练有效解决了分布偏移问题，提升了翻译质量。 Abstract: Large language models (LLMs) possess strong multilingual capabilities, and combining Reinforcement Learning from Human Feedback (RLHF) with translation tasks has shown great potential. However, we observe that this paradigm performs unexpectedly poorly when applied to colloquial subtitle translation tasks. In this work, we investigate this issue and find that the offline reward model (RM) gradually diverges from the online LLM due to distributional shift, ultimately leading to undesirable training outcomes. To address this, we propose RIVAL, an adversarial training framework that formulates the process as a min-max game between the RM and the LLM. RIVAL iteratively updates the both models, with the RM trained to distinguish strong from weak translations (qualitative preference reward), and the LLM trained to enhance its translation for closing this gap. To stabilize training and improve generalizability, we also incorporate quantitative preference reward (e.g., BLEU) into the RM, enabling reference-free quality modeling aligned with human evaluation. Through extensive experiments, we demonstrate that the proposed adversarial training framework significantly improves upon translation baselines.

[205] Just a Scratch: Enhancing LLM Capabilities for Self-harm Detection through Intent Differentiation and Emoji Interpretation

Soumitra Ghosh,Gopendra Vikram Singh,Shambhavi,Sabarna Choudhury,Asif Ekbal

Main category: cs.CL

TL;DR: 该研究通过引入CESM-100和SHINES数据集，结合多任务学习框架，提升了大型语言模型（LLMs）在社交媒体上检测自残意图的能力，并生成可解释的预测依据。

Details

Motivation: 社交媒体上的自残检测对早期干预和心理健康支持至关重要，但现有大型语言模型难以理解隐晦的语言和表情符号。 Method: 提出CESM-100表情符号敏感矩阵和SHINES数据集，设计多任务学习框架，结合自残检测和意图分类任务，并生成解释性依据。 Result: 在三种先进LLMs上验证，框架显著提升了自残检测和意图分类的性能，并解决了信号模糊性问题。 Conclusion: 该研究通过结合上下文线索和意图区分，有效提升了LLMs在自残检测任务中的表现，并提供了公开数据集和工具。 Abstract: Self-harm detection on social media is critical for early intervention and mental health support, yet remains challenging due to the subtle, context-dependent nature of such expressions. Identifying self-harm intent aids suicide prevention by enabling timely responses, but current large language models (LLMs) struggle to interpret implicit cues in casual language and emojis. This work enhances LLMs' comprehension of self-harm by distinguishing intent through nuanced language-emoji interplay. We present the Centennial Emoji Sensitivity Matrix (CESM-100), a curated set of 100 emojis with contextual self-harm interpretations and the Self-Harm Identification aNd intent Extraction with Supportive emoji sensitivity (SHINES) dataset, offering detailed annotations for self-harm labels, casual mentions (CMs), and serious intents (SIs). Our unified framework: a) enriches inputs using CESM-100; b) fine-tunes LLMs for multi-task learning: self-harm detection (primary) and CM/SI span detection (auxiliary); c) generates explainable rationales for self-harm predictions. We evaluate the framework on three state-of-the-art LLMs-Llama 3, Mental-Alpaca, and MentalLlama, across zero-shot, few-shot, and fine-tuned scenarios. By coupling intent differentiation with contextual cues, our approach commendably enhances LLM performance in both detection and explanation tasks, effectively addressing the inherent ambiguity in self-harm signals. The SHINES dataset, CESM-100 and codebase are publicly available at: https://www.iitp.ac.in/~ai-nlp-ml/resources.html#SHINES .

[206] Parking, Perception, and Retail: Street-Level Determinants of Community Vitality in Harbin

HaoTian Lan

Main category: cs.CL

TL;DR: 研究提出了一种基于图像的可解释框架，分析中国城市社区街道的商业活力与车辆可达性、环境质量和行人感知的关系，发现适度车辆存在有助于商业，但过度停车会降低满意度。

Details

Motivation: 探讨社区街道商业活力与街道特征（如停车密度、绿化、清洁度等）之间的关系，为城市设计和规划提供依据。 Method: 利用街景图像和多模态大语言模型（VisualGLM-6B），结合美团和大众点评数据构建商业活力指数（CCVI），并通过GPT-4提取空间属性进行分析。 Result: 适度车辆存在提升商业可达性，但过度停车会降低步行性和满意度；绿化和清洁度显著提高满意度，但对定价影响较弱；街道宽度调节车辆影响。 Conclusion: 研究证明了AI辅助感知与城市形态分析结合的价值，为社区振兴提供了理论和工具支持。 Abstract: The commercial vitality of community-scale streets in Chinese cities is shaped by complex interactions between vehicular accessibility, environmental quality, and pedestrian perception. This study proposes an interpretable, image-based framework to examine how street-level features -- including parked vehicle density, greenery, cleanliness, and street width -- impact retail performance and user satisfaction in Harbin, China. Leveraging street view imagery and a multimodal large language model (VisualGLM-6B), we construct a Community Commercial Vitality Index (CCVI) from Meituan and Dianping data and analyze its relationship with spatial attributes extracted via GPT-4-based perception modeling. Our findings reveal that while moderate vehicle presence may enhance commercial access, excessive on-street parking -- especially in narrow streets -- erodes walkability and reduces both satisfaction and shop-level pricing. In contrast, streets with higher perceived greenery and cleanliness show significantly greater satisfaction scores but only weak associations with pricing. Street width moderates the effects of vehicle presence, underscoring the importance of spatial configuration. These results demonstrate the value of integrating AI-assisted perception with urban morphological analysis to capture non-linear and context-sensitive drivers of commercial success. This study advances both theoretical and methodological frontiers by highlighting the conditional role of vehicle activity in neighborhood commerce and demonstrating the feasibility of multimodal AI for perceptual urban diagnostics. The implications extend to urban design, parking management, and scalable planning tools for community revitalization.

Tianyi Huang,Zikun Cui,Cuiqianhe Du,Chia-En Chiang

Main category: cs.CL

TL;DR: 论文提出了一种结合对比学习和隐式立场推理的新框架CL-ISR，用于提高社交媒体误导文本的检测准确率。

Details

Motivation: 社交媒体上的误导文本可能导致公众误解、社会恐慌和经济损失，因此检测这些文本至关重要。 Method: 使用对比学习算法增强模型对真实与误导文本语义差异的学习能力，并引入隐式立场推理模块分析文本的潜在立场倾向及其与相关主题的关系。 Result: CL-ISR框架显著提高了误导文本的检测效果，尤其在语言复杂情况下表现优异。 Conclusion: CL-ISR通过结合对比学习和隐式立场推理，有效提升了社交媒体误导文本的检测能力。 Abstract: Misleading text detection on social media platforms is a critical research area, as these texts can lead to public misunderstanding, social panic and even economic losses. This paper proposes a novel framework - CL-ISR (Contrastive Learning and Implicit Stance Reasoning), which combines contrastive learning and implicit stance reasoning, to improve the detection accuracy of misleading texts on social media. First, we use the contrastive learning algorithm to improve the model's learning ability of semantic differences between truthful and misleading texts. Contrastive learning could help the model to better capture the distinguishing features between different categories by constructing positive and negative sample pairs. This approach enables the model to capture distinguishing features more effectively, particularly in linguistically complicated situations. Second, we introduce the implicit stance reasoning module, to explore the potential stance tendencies in the text and their relationships with related topics. This method is effective for identifying content that misleads through stance shifting or emotional manipulation, because it can capture the implicit information behind the text. Finally, we integrate these two algorithms together to form a new framework, CL-ISR, which leverages the discriminative power of contrastive learning and the interpretive depth of stance reasoning to significantly improve detection effect.

[208] The NTNU System at the S&I Challenge 2025 SLA Open Track

Hong-Yun Lin,Tien-Hong Lo,Yu-Hsuan Fang,Jhen-Ke Lin,Chung-Chun Wang,Hao-Chien Lu,Berlin Chen

Main category: cs.CL

TL;DR: 论文提出了一种结合wav2vec 2.0和Phi-4多模态大语言模型的系统，用于口语能力评估，解决了BERT和wav2vec 2.0各自的局限性。

Details

Motivation: BERT和wav2vec 2.0在口语评估中各有限制：BERT依赖ASR转录，无法捕捉语音特征；wav2vec 2.0缺乏语义解释性。 Method: 通过分数融合策略整合wav2vec 2.0和Phi-4多模态大语言模型。 Result: 在Speak & Improve Challenge 2025中，系统RMSE为0.375，排名第二。 Conclusion: 提出的系统有效结合了两种模型的优势，提升了口语评估性能。 Abstract: A recent line of research on spoken language assessment (SLA) employs neural models such as BERT and wav2vec 2.0 (W2V) to evaluate speaking proficiency across linguistic and acoustic modalities. Although both models effectively capture features relevant to oral competence, each exhibits modality-specific limitations. BERT-based methods rely on ASR transcripts, which often fail to capture prosodic and phonetic cues for SLA. In contrast, W2V-based methods excel at modeling acoustic features but lack semantic interpretability. To overcome these limitations, we propose a system that integrates W2V with Phi-4 multimodal large language model (MLLM) through a score fusion strategy. The proposed system achieves a root mean square error (RMSE) of 0.375 on the official test set of the Speak & Improve Challenge 2025, securing second place in the competition. For comparison, the RMSEs of the top-ranked, third-ranked, and official baseline systems are 0.364, 0.384, and 0.444, respectively.

[209] DiCoRe: Enhancing Zero-shot Event Detection via Divergent-Convergent LLM Reasoning

Tanmay Parekh,Kartik Mehta,Ninareh Mehrabi,Kai-Wei Chang,Nanyun Peng

Main category: cs.CL

TL;DR: DiCoRe框架通过发散-收敛推理方法提升零样本事件检测性能，平均F1分数提高4-7%。

Details

Motivation: 解决零样本事件检测中复杂事件本体理解和领域特定触发器提取的挑战。 Method: 采用Dreamer（发散推理）和Grounder（收敛推理）结合的框架，辅以LLM-Judge验证。 Result: 在六个数据集上表现优于基线方法，平均F1分数提升4-7%。 Conclusion: DiCoRe是零样本事件检测的有效框架。 Abstract: Zero-shot Event Detection (ED), the task of identifying event mentions in natural language text without any training data, is critical for document understanding in specialized domains. Understanding the complex event ontology, extracting domain-specific triggers from the passage, and structuring them appropriately overloads and limits the utility of Large Language Models (LLMs) for zero-shot ED. To this end, we propose DiCoRe, a divergent-convergent reasoning framework that decouples the task of ED using Dreamer and Grounder. Dreamer encourages divergent reasoning through open-ended event discovery, which helps to boost event coverage. Conversely, Grounder introduces convergent reasoning to align the free-form predictions with the task-specific instructions using finite-state machine guided constrained decoding. Additionally, an LLM-Judge verifies the final outputs to ensure high precision. Through extensive experiments on six datasets across five domains and nine LLMs, we demonstrate how DiCoRe consistently outperforms prior zero-shot, transfer-learning, and reasoning baselines, achieving 4-7% average F1 gains over the best baseline -- establishing DiCoRe as a strong zero-shot ED framework.

[210] Information Locality as an Inductive Bias for Neural Language Models

Taiga Someya,Anej Svete,Brian DuSell,Timothy J. O'Donnell,Mario Giulianelli,Ryan Cotterell

Main category: cs.CL

TL;DR: 论文提出了一种量化框架，通过$m$-局部熵衡量语言模型的归纳偏置，发现高局部熵语言对Transformer和LSTM模型更难学习，表明神经语言模型与人类类似，对语言的局部统计结构敏感。

Details

Motivation: 探讨神经语言模型的归纳偏置是否与人类处理语言的约束一致或分歧。 Method: 提出基于信息论的$m$-局部熵度量，通过实验在扰动自然语言语料库和概率有限状态自动机定义的语言中验证。 Result: 高$m$-局部熵的语言对Transformer和LSTM模型更难学习。 Conclusion: 神经语言模型与人类类似，对语言的局部统计结构高度敏感。 Abstract: Inductive biases are inherent in every machine learning system, shaping how models generalize from finite data. In the case of neural language models (LMs), debates persist as to whether these biases align with or diverge from human processing constraints. To address this issue, we propose a quantitative framework that allows for controlled investigations into the nature of these biases. Within our framework, we introduce $m$-local entropy$\unicode{x2013}$an information-theoretic measure derived from average lossy-context surprisal$\unicode{x2013}$that captures the local uncertainty of a language by quantifying how effectively the $m-1$ preceding symbols disambiguate the next symbol. In experiments on both perturbed natural language corpora and languages defined by probabilistic finite-state automata (PFSAs), we show that languages with higher $m$-local entropy are more difficult for Transformer and LSTM LMs to learn. These results suggest that neural LMs, much like humans, are highly sensitive to the local statistical structure of a language.

[211] AudioLens: A Closer Look at Auditory Attribute Perception of Large Audio-Language Models

Chih-Kai Yang,Neo Ho,Yi-Jyun Lee,Hung-yi Lee

Main category: cs.CL

TL;DR: 论文通过词汇投影分析大型音频-语言模型（LALMs）的内部机制，发现属性信息在失败识别时随层深减少，早期层解析属性与更高准确率相关，并提出改进方法。

Details

Motivation: 理解LALMs的内部机制对解释其行为和提升性能至关重要。 Method: 应用词汇投影技术分析三种先进LALMs中属性信息的演变，追踪层和位置的影响。 Result: 发现属性信息在失败识别时随层深减少，早期解析属性与高准确率相关，模型依赖查询而非隐藏状态聚合信息。 Conclusion: 研究为LALMs的改进提供了新思路，揭示了听觉属性处理的机制。 Abstract: Understanding the internal mechanisms of large audio-language models (LALMs) is crucial for interpreting their behavior and improving performance. This work presents the first in-depth analysis of how LALMs internally perceive and recognize auditory attributes. By applying vocabulary projection on three state-of-the-art LALMs, we track how attribute information evolves across layers and token positions. We find that attribute information generally decreases with layer depth when recognition fails, and that resolving attributes at earlier layers correlates with better accuracy. Moreover, LALMs heavily rely on querying auditory inputs for predicting attributes instead of aggregating necessary information in hidden states at attribute-mentioning positions. Based on our findings, we demonstrate a method to enhance LALMs. Our results offer insights into auditory attribute processing, paving the way for future improvements.

[212] Do Large Language Models Judge Error Severity Like Humans?

Diege Sun,Guanyi Chen,Fan Zhao,Xiaorong Cheng,Tingting He

Main category: cs.CL

TL;DR: 研究比较了人类与LLMs对图像描述中语义错误的严重性评估，发现LLMs在性别和颜色错误上的评分与人类不一致，仅Doubao和DeepSeek-V3表现较好。

Details

Motivation: 探讨LLMs是否能准确复制人类对错误严重性的判断。 Method: 扩展van Miltenburg等人的实验框架，评估四种错误类型（年龄、性别、服装类型、颜色）在单模态和多模态设置下的表现。 Result: 人类对不同错误类型的严重性评估有差异，视觉上下文显著影响颜色和类型错误的感知。LLMs在性别和颜色错误上的评分与人类不一致，仅Doubao和DeepSeek-V3表现较好。 Conclusion: LLMs可能内化了社会规范影响性别判断，但缺乏对颜色的感知基础，仅少数模型能接近人类评估水平。 Abstract: Large Language Models (LLMs) are increasingly used as automated evaluators in natural language generation, yet it remains unclear whether they can accurately replicate human judgments of error severity. In this study, we systematically compare human and LLM assessments of image descriptions containing controlled semantic errors. We extend the experimental framework of van Miltenburg et al. (2020) to both unimodal (text-only) and multimodal (text + image) settings, evaluating four error types: age, gender, clothing type, and clothing colour. Our findings reveal that humans assign varying levels of severity to different error types, with visual context significantly amplifying perceived severity for colour and type errors. Notably, most LLMs assign low scores to gender errors but disproportionately high scores to colour errors, unlike humans, who judge both as highly severe but for different reasons. This suggests that these models may have internalised social norms influencing gender judgments but lack the perceptual grounding to emulate human sensitivity to colour, which is shaped by distinct neural mechanisms. Only one of the evaluated LLMs, Doubao, replicates the human-like ranking of error severity, but it fails to distinguish between error types as clearly as humans. Surprisingly, DeepSeek-V3, a unimodal LLM, achieves the highest alignment with human judgments across both unimodal and multimodal conditions, outperforming even state-of-the-art multimodal models.

[213] Knowledgeable-r1: Policy Optimization for Knowledge Exploration in Retrieval-Augmented Generation

Chenyu Lin,Yilin Wen,Du Su,Fei Sun,Muhan Chen,Chenfu Bao,Zhonghou Lv

Main category: cs.CL

TL;DR: 论文提出Knowledgeable-r1方法，通过联合采样和多策略分布探索知识能力，平衡检索增强生成（RAG）中检索上下文与模型固有知识的利用，显著提升鲁棒性和推理准确性。

Details

Motivation: 当前RAG系统过度依赖检索上下文，可能忽视模型固有知识，尤其在处理误导或冗余信息时表现不佳。 Method: 提出Knowledgeable-r1，采用联合采样和多策略分布探索知识能力，促进模型对参数化知识和上下文知识的自我整合利用。 Result: 实验显示，Knowledgeable-r1在参数与上下文冲突任务及一般RAG任务中显著提升性能，尤其在反事实场景中优于基线17.07%。 Conclusion: Knowledgeable-r1有效平衡了检索与模型知识的利用，提升了RAG系统的鲁棒性和准确性。 Abstract: Retrieval-augmented generation (RAG) is a mainstream method for improving performance on knowledge-intensive tasks. However,current RAG systems often place too much emphasis on retrieved contexts. This can lead to reliance on inaccurate sources and overlook the model's inherent knowledge, especially when dealing with misleading or excessive information. To resolve this imbalance, we propose Knowledgeable-r1 that using joint sampling and define multi policy distributions in knowledge capability exploration to stimulate large language models'self-integrated utilization of parametric and contextual knowledge. Experiments show that Knowledgeable-r1 significantly enhances robustness and reasoning accuracy in both parameters and contextual conflict tasks and general RAG tasks, especially outperforming baselines by 17.07% in counterfactual scenarios and demonstrating consistent gains across RAG tasks. Our code are available at https://github.com/lcy80366872/ knowledgeable-r1.

[214] Dissecting Bias in LLMs: A Mechanistic Interpretability Perspective

Bhavik Chandna,Zubair Bashir,Procheta Sen

Main category: cs.CL

TL;DR: 论文采用机制可解释性方法分析LLMs中的社会、人口和性别偏见，发现偏见计算高度集中在少数层，且移除这些组件会影响其他NLP任务。

Details

Motivation: 大型语言模型（LLMs）存在社会、人口和性别偏见，研究旨在揭示这些偏见在模型中的结构表现及其影响。 Method: 通过机制可解释性方法，分析GPT-2和Llama2模型，识别偏见行为相关的内部边，并评估其稳定性、定位性和泛化性。 Result: 偏见计算集中在少数层，且在不同微调设置中变化；移除偏见组件会减少偏见输出，但影响其他NLP任务。 Conclusion: 偏见在模型中高度局部化，且与其他任务共享重要组件，移除偏见需权衡其对其他任务的影响。 Abstract: Large Language Models (LLMs) are known to exhibit social, demographic, and gender biases, often as a consequence of the data on which they are trained. In this work, we adopt a mechanistic interpretability approach to analyze how such biases are structurally represented within models such as GPT-2 and Llama2. Focusing on demographic and gender biases, we explore different metrics to identify the internal edges responsible for biased behavior. We then assess the stability, localization, and generalizability of these components across dataset and linguistic variations. Through systematic ablations, we demonstrate that bias-related computations are highly localized, often concentrated in a small subset of layers. Moreover, the identified components change across fine-tuning settings, including those unrelated to bias. Finally, we show that removing these components not only reduces biased outputs but also affects other NLP tasks, such as named entity recognition and linguistic acceptability judgment because of the sharing of important components with these tasks.

[215] ECoRAG: Evidentiality-guided Compression for Long Context RAG

Yeonseok Jeong,Jinsu Kim,Dohyeon Lee,Seung-won Hwang

Main category: cs.CL

TL;DR: ECoRAG框架通过基于证据性的检索文档压缩，提升LLM在开放域问答中的性能，同时减少延迟和令牌使用。

Details

Motivation: 现有检索增强生成（RAG）方法未过滤非证据性信息，限制了性能。 Method: 提出ECoRAG框架，基于证据性压缩文档，并在证据不足时继续检索。 Result: 实验表明ECoRAG在ODQA任务中优于现有压缩方法，且成本高效。 Conclusion: ECoRAG通过优化证据性信息压缩，显著提升了LLM的性能和效率。 Abstract: Large Language Models (LLMs) have shown remarkable performance in Open-Domain Question Answering (ODQA) by leveraging external documents through Retrieval-Augmented Generation (RAG). To reduce RAG overhead, from longer context, context compression is necessary. However, prior compression methods do not focus on filtering out non-evidential information, which limit the performance in LLM-based RAG. We thus propose Evidentiality-guided RAG, or \textbf{ECoRAG} framework. ECoRAG improves LLM performance by compressing retrieved documents based on evidentiality, ensuring whether answer generation is supported by the correct evidence. As an additional step, ECoRAG reflects whether the compressed content provides sufficient evidence, and if not, retrieves more until sufficient. Experiments show that ECoRAG improves LLM performance on ODQA tasks, outperforming existing compression methods. Furthermore, ECoRAG is highly cost-efficient, as it not only reduces latency but also minimizes token usage by retaining only the necessary information to generate the correct answer. Code is available at https://github.com/ldilab/ECoRAG.

[216] Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models

Yanzhao Zhang,Mingxin Li,Dingkun Long,Xin Zhang,Huan Lin,Baosong Yang,Pengjun Xie,An Yang,Dayiheng Liu,Junyang Lin,Fei Huang,Jingren Zhou

Main category: cs.CL

TL;DR: Qwen3 Embedding系列基于Qwen3基础模型，通过多阶段训练流程和模型合并策略，显著提升了文本嵌入和重排序能力，支持多种模型尺寸，并在多语言和检索任务中达到最先进水平。

Details

Motivation: 改进GTE-Qwen系列的文本嵌入和重排序能力，利用Qwen3 LLMs的多语言理解能力，满足不同部署场景的需求。 Method: 采用大规模无监督预训练和高质量数据集监督微调的多阶段训练流程，结合模型合并策略，利用Qwen3 LLMs生成多样化训练数据。 Result: Qwen3 Embedding系列在MTEB等多语言基准测试和检索任务中表现优异，支持0.6B、4B、8B等多种模型尺寸。 Conclusion: Qwen3 Embedding系列在多语言和检索任务中表现卓越，模型开源以促进社区研究和开发。 Abstract: In this work, we introduce the Qwen3 Embedding series, a significant advancement over its predecessor, the GTE-Qwen series, in text embedding and reranking capabilities, built upon the Qwen3 foundation models. Leveraging the Qwen3 LLMs' robust capabilities in multilingual text understanding and generation, our innovative multi-stage training pipeline combines large-scale unsupervised pre-training with supervised fine-tuning on high-quality datasets. Effective model merging strategies further ensure the robustness and adaptability of the Qwen3 Embedding series. During the training process, the Qwen3 LLMs serve not only as backbone models but also play a crucial role in synthesizing high-quality, rich, and diverse training data across multiple domains and languages, thus enhancing the training pipeline. The Qwen3 Embedding series offers a spectrum of model sizes (0.6B, 4B, 8B) for both embedding and reranking tasks, addressing diverse deployment scenarios where users can optimize for either efficiency or effectiveness. Empirical evaluations demonstrate that the Qwen3 Embedding series achieves state-of-the-art results across diverse benchmarks. Notably, it excels on the multilingual evaluation benchmark MTEB for text embedding, as well as in various retrieval tasks, including code retrieval, cross-lingual retrieval and multilingual retrieval. To facilitate reproducibility and promote community-driven research and development, the Qwen3 Embedding models are publicly available under the Apache 2.0 license.

[217] Counterfactual reasoning: an analysis of in-context emergence

Moritz Miller,Bernhard Schölkopf,Siyuan Guo

Main category: cs.CL

TL;DR: 论文研究了大规模神经语言模型（LMs）在上下文学习中的反事实推理能力，即在假设场景下预测变化后果的能力。通过线性回归任务的合成实验，发现模型能够完成噪声反演任务，并揭示了自注意力、模型深度和数据多样性对性能的影响。

Details

Motivation: 探索语言模型在反事实推理中的能力，特别是在假设场景下预测变化后果的潜力。 Method: 采用线性回归任务的合成实验，要求模型从上下文观察中推断并复制噪声，以完成反事实推理。 Result: 语言模型在控制实验中能够进行反事实推理，且自注意力、模型深度和数据多样性对性能有显著影响。研究还表明，这种能力可扩展到序列数据，如反事实故事生成。 Conclusion: 语言模型具备反事实推理的潜力，尤其是在噪声反演任务中，为更广泛的应用（如故事生成）提供了初步证据。 Abstract: Large-scale neural language models (LMs) exhibit remarkable performance in in-context learning: the ability to learn and reason the input context on the fly without parameter update. This work studies in-context counterfactual reasoning in language models, that is, to predict the consequences of changes under hypothetical scenarios. We focus on studying a well-defined synthetic setup: a linear regression task that requires noise abduction, where accurate prediction is based on inferring and copying the contextual noise from factual observations. We show that language models are capable of counterfactual reasoning in this controlled setup and provide insights that counterfactual reasoning for a broad class of functions can be reduced to a transformation on in-context observations; we find self-attention, model depth, and data diversity in pre-training drive performance in Transformers. More interestingly, our findings extend beyond regression tasks and show that Transformers can perform noise abduction on sequential data, providing preliminary evidence on the potential for counterfactual story generation. Our code is available under https://github.com/moXmiller/counterfactual-reasoning.git .

[218] RELIC: Evaluating Compositional Instruction Following via Language Recognition

Jackson Petty,Michael Y. Hu,Wentao Wang,Shauli Ravfogel,William Merrill,Tal Linzen

Main category: cs.CL

TL;DR: RELIC框架评估大语言模型（LLMs）的指令跟随能力，通过语言识别任务发现其性能随任务复杂度增加而下降，依赖浅层启发式而非复杂指令。

Details

Motivation: 评估LLMs仅基于上下文任务说明执行任务的能力（指令跟随），并设计可扩展的测试框架。 Method: 引入RELIC框架，利用形式语法生成的语言识别任务，自动生成测试实例以避免数据污染。 Result: LLMs在复杂语法和样本上表现接近随机，性能可由语法和样本复杂度预测。 Conclusion: 当前最先进的LLMs在复杂指令跟随任务中表现有限，倾向于使用浅层启发式方法。 Abstract: Large language models (LLMs) are increasingly expected to perform tasks based only on a specification of the task provided in context, without examples of inputs and outputs; this ability is referred to as instruction following. We introduce the Recognition of Languages In-Context (RELIC) framework to evaluate instruction following using language recognition: the task of determining if a string is generated by formal grammar. Unlike many standard evaluations of LLMs' ability to use their context, this task requires composing together a large number of instructions (grammar productions) retrieved from the context. Because the languages are synthetic, the task can be increased in complexity as LLMs' skills improve, and new instances can be automatically generated, mitigating data contamination. We evaluate state-of-the-art LLMs on RELIC and find that their accuracy can be reliably predicted from the complexity of the grammar and the individual example strings, and that even the most advanced LLMs currently available show near-chance performance on more complex grammars and samples, in line with theoretical expectations. We also use RELIC to diagnose how LLMs attempt to solve increasingly difficult reasoning tasks, finding that as the complexity of the language recognition task increases, models switch to relying on shallow heuristics instead of following complex instructions.

[219] The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text

Nikhil Kandpal,Brian Lester,Colin Raffel,Sebastian Majstorovic,Stella Biderman,Baber Abbasi,Luca Soldaini,Enrico Shippole,A. Feder Cooper,Aviya Skowron,John Kirchenbauer,Shayne Longpre,Lintang Sutawika,Alon Albalak,Zhenlin Xu,Guilherme Penedo,Loubna Ben Allal,Elie Bakouch,John David Pressman,Honglu Fan,Dashiell Stander,Guangyu Song,Aaron Gokaslan,Tom Goldstein,Brian R. Bartoldson,Bhavya Kailkhura,Tyler Murray

Main category: cs.CL

TL;DR: 论文提出Common Pile v0.1，一个8TB的开放许可文本数据集，用于训练大语言模型（LLM），解决了现有开放数据集规模小或质量低的问题。

Details

Motivation: 现有LLM训练常使用未经许可的文本，引发知识产权和伦理问题。开放许可文本数据集是解决方案，但现有数据集规模不足。 Method: 收集、整理并发布Common Pile v0.1，包含30个来源的多样化内容。训练了两个7B参数的LLM（Comma v0.1-1T和Comma v0.1-2T）验证其性能。 Result: 训练出的模型性能与基于未经许可文本训练的LLM（如Llama 1和2 7B）相当。 Conclusion: Common Pile v0.1为开放许可文本训练LLM提供了可行方案，并公开了数据集、代码和模型检查点。 Abstract: Large language models (LLMs) are typically trained on enormous quantities of unlicensed text, a practice that has led to scrutiny due to possible intellectual property infringement and ethical concerns. Training LLMs on openly licensed text presents a first step towards addressing these issues, but prior data collection efforts have yielded datasets too small or low-quality to produce performant LLMs. To address this gap, we collect, curate, and release the Common Pile v0.1, an eight terabyte collection of openly licensed text designed for LLM pretraining. The Common Pile comprises content from 30 sources that span diverse domains including research papers, code, books, encyclopedias, educational materials, audio transcripts, and more. Crucially, we validate our efforts by training two 7 billion parameter LLMs on text from the Common Pile: Comma v0.1-1T and Comma v0.1-2T, trained on 1 and 2 trillion tokens respectively. Both models attain competitive performance to LLMs trained on unlicensed text with similar computational budgets, such as Llama 1 and 2 7B. In addition to releasing the Common Pile v0.1 itself, we also release the code used in its creation as well as the training mixture and checkpoints for the Comma v0.1 models.

[220] Improving Low-Resource Morphological Inflection via Self-Supervised Objectives

Adam Wiemerslage,Katharina von der Wense

Main category: cs.CL

TL;DR: 研究了自监督辅助任务在极低资源环境下对形态屈折（字符级任务）的有效性，发现自动编码在数据极少时表现最佳，而字符掩码语言建模（CMLM）在数据增加时更有效。

Details

Motivation: 探索自监督目标在字符级任务（如形态屈折）中的潜力，尤其是在资源稀缺的语言中。 Method: 使用编码器-解码器变换器，在19种语言和13种辅助目标上进行实验，比较不同自监督任务的效果。 Result: 自动编码在数据极少时表现最佳，CMLM在数据增加时更有效；基于已知语素边界的掩码采样能持续提升性能。 Conclusion: 自监督辅助任务对低资源形态建模有潜力，尤其是结合语素边界信息的CMLM。 Abstract: Self-supervised objectives have driven major advances in NLP by leveraging large-scale unlabeled data, but such resources are scarce for many of the world's languages. Surprisingly, they have not been explored much for character-level tasks, where smaller amounts of data have the potential to be beneficial. We investigate the effectiveness of self-supervised auxiliary tasks for morphological inflection -- a character-level task highly relevant for language documentation -- in extremely low-resource settings, training encoder-decoder transformers for 19 languages and 13 auxiliary objectives. Autoencoding yields the best performance when unlabeled data is very limited, while character masked language modeling (CMLM) becomes more effective as data availability increases. Though objectives with stronger inductive biases influence model predictions intuitively, they rarely outperform standard CMLM. However, sampling masks based on known morpheme boundaries consistently improves performance, highlighting a promising direction for low-resource morphological modeling.

[221] Towards a Unified System of Representation for Continuity and Discontinuity in Natural Language

Ratna Kandala,Prakash Mondal

Main category: cs.CL

TL;DR: 提出了一种统一表示自然语言连续与不连续结构的系统，结合了短语结构语法、依存语法和范畴语法的特点。

Details

Motivation: 解决不同语法形式对不连续结构分析的非收敛性问题。 Method: 结合短语结构语法（PSG）的构成性、依存语法（DG）的头-依存关系及范畴语法（CG）的函子-论元关系，提出统一数学推导。 Result: 展示了不连续和连续结构可通过统一数学推导分析。 Conclusion: 证明了三种语法形式可统一用于分析语言结构，为不连续现象提供了新视角。 Abstract: Syntactic discontinuity is a grammatical phenomenon in which a constituent is split into more than one part because of the insertion of an element which is not part of the constituent. This is observed in many languages across the world such as Turkish, Russian, Japanese, Warlpiri, Navajo, Hopi, Dyirbal, Yidiny etc. Different formalisms/frameworks in current linguistic theory approach the problem of discontinuous structures in different ways. Each framework/formalism has widely been viewed as an independent and non-converging system of analysis. In this paper, we propose a unified system of representation for both continuity and discontinuity in structures of natural languages by taking into account three formalisms, in particular, Phrase Structure Grammar (PSG) for its widely used notion of constituency, Dependency Grammar (DG) for its head-dependent relations, and Categorial Grammar (CG) for its focus on functor-argument relations. We attempt to show that discontinuous expressions as well as continuous structures can be analysed through a unified mathematical derivation incorporating the representations of linguistic structure in these three grammar formalisms.

[222] CLATTER: Comprehensive Entailment Reasoning for Hallucination Detection

Ron Eliav,Arie Cattan,Eran Hirsch,Shahaf Bassan,Elias Stengel-Eskin,Mohit Bansal,Ido Dagan

Main category: cs.CL

TL;DR: 论文提出了一种通过系统化推理过程（分解文本、子声明归因与分类、聚合分类）来提升幻觉检测性能的方法，并验证了其有效性。

Details

Motivation: 现有方法将幻觉检测视为自然语言推理任务，但复杂推理任务需要更明确的推理过程，以提升准确性和细粒度决策。 Method: 定义了一个三步推理过程：声明分解、子声明归因与分类、聚合分类，并通过中间步骤的质量指标验证其效果。 Result: 系统化推理显著提升了幻觉检测的性能，并通过中间指标验证了推理质量的改进。 Conclusion: 通过引导模型进行系统化和全面的推理，可以有效提升幻觉检测的准确性和细粒度决策能力。 Abstract: A common approach to hallucination detection casts it as a natural language inference (NLI) task, often using LLMs to classify whether the generated text is entailed by corresponding reference texts. Since entailment classification is a complex reasoning task, one would expect that LLMs could benefit from generating an explicit reasoning process, as in CoT reasoning or the explicit ``thinking'' of recent reasoning models. In this work, we propose that guiding such models to perform a systematic and comprehensive reasoning process -- one that both decomposes the text into smaller facts and also finds evidence in the source for each fact -- allows models to execute much finer-grained and accurate entailment decisions, leading to increased performance. To that end, we define a 3-step reasoning process, consisting of (i) claim decomposition, (ii) sub-claim attribution and entailment classification, and (iii) aggregated classification, showing that such guided reasoning indeed yields improved hallucination detection. Following this reasoning framework, we introduce an analysis scheme, consisting of several metrics that measure the quality of the intermediate reasoning steps, which provided additional empirical evidence for the improved quality of our guided reasoning scheme.

[223] Micro-Act: Mitigate Knowledge Conflict in Question Answering via Actionable Self-Reasoning

Nan Huo,Jinyang Li,Bowen Qin,Ge Qu,Xiaolong Li,Xiaodong Li,Chenhao Ma,Reynold Cheng

Main category: cs.CL

TL;DR: 论文提出Micro-Act框架，通过分层动作空间解决RAG系统中的知识冲突问题，显著提升QA任务准确性。

Details

Motivation: RAG系统中外部检索知识与LLMs固有知识冲突影响下游任务性能，现有方法因上下文冗长而效果有限。 Method: 提出Micro-Act框架，自动感知上下文复杂度并分解知识源为细粒度比较，以动作步骤表示。 Result: 在5个基准数据集上，Micro-Act显著提升QA准确性，尤其在时间和语义冲突类型上表现突出。 Conclusion: Micro-Act不仅解决知识冲突，还在非冲突问题上表现稳健，具有实际应用价值。 Abstract: Retrieval-Augmented Generation (RAG) systems commonly suffer from Knowledge Conflicts, where retrieved external knowledge contradicts the inherent, parametric knowledge of large language models (LLMs). It adversely affects performance on downstream tasks such as question answering (QA). Existing approaches often attempt to mitigate conflicts by directly comparing two knowledge sources in a side-by-side manner, but this can overwhelm LLMs with extraneous or lengthy contexts, ultimately hindering their ability to identify and mitigate inconsistencies. To address this issue, we propose Micro-Act a framework with a hierarchical action space that automatically perceives context complexity and adaptively decomposes each knowledge source into a sequence of fine-grained comparisons. These comparisons are represented as actionable steps, enabling reasoning beyond the superficial context. Through extensive experiments on five benchmark datasets, Micro-Act consistently achieves significant increase in QA accuracy over state-of-the-art baselines across all 5 datasets and 3 conflict types, especially in temporal and semantic types where all baselines fail significantly. More importantly, Micro-Act exhibits robust performance on non-conflict questions simultaneously, highlighting its practical value in real-world RAG applications.

[224] ProRefine: Inference-time Prompt Refinement with Textual Feedback

Deepak Pandita,Tharindu Cyril Weerasooriya,Ankit Parag Shah,Christopher M. Homan,Wei Wei

Main category: cs.CL

TL;DR: ProRefine是一种创新的提示优化方法，通过动态优化多步推理任务的提示，显著提升AI代理协作的准确性和效率。

Details

Motivation: 多AI代理协作工作流中，提示设计不佳导致错误传播和性能下降，限制了系统的可靠性和扩展性。 Method: ProRefine利用大型语言模型的文本反馈，动态优化提示，无需额外训练或真实标签。 Result: 在五个数学推理基准数据集上，ProRefine比零样本思维链基线高出3到37个百分点。 Conclusion: ProRefine不仅提高了准确性，还使小模型能匹配大模型的性能，具有高效、可扩展和普及高性能AI的潜力。 Abstract: Agentic workflows, where multiple AI agents collaborate to accomplish complex tasks like reasoning or planning, are becoming increasingly prevalent. However, these workflows often suffer from error propagation and sub-optimal performance, largely due to poorly designed prompts that fail to effectively guide individual agents. This is a critical problem because it limits the reliability and scalability of these powerful systems. We introduce ProRefine, an innovative inference-time prompt optimization method that leverages textual feedback from large language models (LLMs) to address this challenge. ProRefine dynamically refines prompts for multi-step reasoning tasks without additional training or ground truth labels. Evaluated on five benchmark mathematical reasoning datasets, ProRefine significantly surpasses zero-shot Chain-of-Thought baselines by 3 to 37 percentage points. This approach not only boosts accuracy but also allows smaller models to match the performance of larger ones, highlighting its potential for efficient and scalable AI deployment, and democratizing access to high-performing AI.

[225] Constrained Entropic Unlearning: A Primal-Dual Framework for Large Language Models

Taha Entesari,Arman Hatami,Rinat Khaziev,Anil Ramakrishna,Mahyar Fazlyab

Main category: cs.CL

TL;DR: 本文提出了一种新的LLM遗忘方法，通过约束优化和logit-margin flattening损失，有效移除敏感信息并保留有用数据。

Details

Motivation: 现有遗忘方法通常通过正则化权衡遗忘和保留，导致优化不稳定和性能下降。 Method: 将LLM遗忘问题建模为约束优化，使用logit-margin flattening损失实现遗忘，并通过硬约束保留有用数据。 Result: 在TOFU和MUSE基准测试中，该方法优于现有基线，有效移除目标信息并保持下游性能。 Conclusion: 该方法为LLM遗忘提供了更高效、稳定的解决方案。 Abstract: Large Language Models (LLMs) deployed in real-world settings increasingly face the need to unlearn sensitive, outdated, or proprietary information. Existing unlearning methods typically formulate forgetting and retention as a regularized trade-off, combining both objectives into a single scalarized loss. This often leads to unstable optimization and degraded performance on retained data, especially under aggressive forgetting. We propose a new formulation of LLM unlearning as a constrained optimization problem: forgetting is enforced via a novel logit-margin flattening loss that explicitly drives the output distribution toward uniformity on a designated forget set, while retention is preserved through a hard constraint on a separate retain set. Compared to entropy-based objectives, our loss is softmax-free, numerically stable, and maintains non-vanishing gradients, enabling more efficient and robust optimization. We solve the constrained problem using a scalable primal-dual algorithm that exposes the trade-off between forgetting and retention through the dynamics of the dual variable. Evaluations on the TOFU and MUSE benchmarks across diverse LLM architectures demonstrate that our approach consistently matches or exceeds state-of-the-art baselines, effectively removing targeted information while preserving downstream utility.

[226] Search Arena: Analyzing Search-Augmented LLMs

Mihran Miroyan,Tsung-Han Wu,Logan King,Tianle Li,Jiayi Pan,Xinyan Hu,Wei-Lin Chiang,Anastasios N. Angelopoulos,Trevor Darrell,Narges Norouzi,Joseph E. Gonzalez

Main category: cs.CL

TL;DR: Search Arena是一个大规模、多轮对话的人机交互数据集，用于分析搜索增强语言模型的用户偏好，揭示了引用数量和来源对用户偏好的影响。

Details

Motivation: 现有数据集规模小、范围窄，难以全面分析搜索增强语言模型的性能，因此需要更大规模和多样化的数据集。 Method: 通过众包收集了24,000对多轮用户交互数据，包含12,000个人类偏好投票，并分析了引用数量和来源对偏好的影响。 Result: 用户偏好受引用数量影响，即使引用内容不直接支持论点；社区驱动平台更受青睐，静态百科来源可靠性存疑。搜索增强模型在非搜索环境中表现良好，但仅依赖参数知识时搜索环境表现显著下降。 Conclusion: Search Arena为未来研究提供了重要资源，揭示了用户偏好与引用行为的关系，并展示了搜索增强模型在不同环境中的表现差异。 Abstract: Search-augmented language models combine web search with Large Language Models (LLMs) to improve response groundedness and freshness. However, analyzing these systems remains challenging: existing datasets are limited in scale and narrow in scope, often constrained to static, single-turn, fact-checking questions. In this work, we introduce Search Arena, a crowd-sourced, large-scale, human-preference dataset of over 24,000 paired multi-turn user interactions with search-augmented LLMs. The dataset spans diverse intents and languages, and contains full system traces with around 12,000 human preference votes. Our analysis reveals that user preferences are influenced by the number of citations, even when the cited content does not directly support the attributed claims, uncovering a gap between perceived and actual credibility. Furthermore, user preferences vary across cited sources, revealing that community-driven platforms are generally preferred and static encyclopedic sources are not always appropriate and reliable. To assess performance across different settings, we conduct cross-arena analyses by testing search-augmented LLMs in a general-purpose chat environment and conventional LLMs in search-intensive settings. We find that web search does not degrade and may even improve performance in non-search settings; however, the quality in search settings is significantly affected if solely relying on the model's parametric knowledge. We open-sourced the dataset to support future research in this direction. Our dataset and code are available at: https://github.com/lmarena/search-arena.

[227] Flattery, Fluff, and Fog: Diagnosing and Mitigating Idiosyncratic Biases in Preference Models

Anirudh Bharadwaj,Chaitanya Malaviya,Nitish Joshi,Mark Yatskar

Main category: cs.CL

TL;DR: 语言模型在偏好评估中存在系统性偏差，过度依赖长度、结构等表面特征，导致奖励黑客和不可靠评估。研究发现训练数据中的偏差是主要原因，并提出一种基于反事实数据增强的后训练方法，有效减少偏差。

Details

Motivation: 语言模型在人类偏好评估中作为代理，但存在系统性偏差，影响评估可靠性。研究旨在探究训练数据偏差与偏好模型偏差之间的关系，并提出解决方案。 Method: 通过控制反事实对，量化偏好模型对偏差特征的依赖程度，并提出基于反事实数据增强的后训练方法（CDA）以减少偏差。 Result: 偏好模型在60%以上实例中偏向偏差特征，与人类偏好相比偏差约40%。CDA方法将平均偏差从39.4%降至32.5%，绝对偏差差异从20.5%降至10.0%。 Conclusion: 反事实数据增强方法能有效减少偏好模型的偏差，提升评估可靠性，同时保持整体性能。 Abstract: Language models serve as proxies for human preference judgements in alignment and evaluation, yet they exhibit systematic miscalibration, prioritizing superficial patterns over substantive qualities. This bias manifests as overreliance on features like length, structure, and style, leading to issues like reward hacking and unreliable evaluations. Evidence suggests these biases originate in artifacts in human training data. In this work, we systematically investigate the relationship between training data biases and preference model miscalibration across five idiosyncratic features of language model generations: length, structure, jargon, sycophancy and vagueness. Using controlled counterfactual pairs, we first quantify the extent to which preference models favor responses with magnified biases (skew), finding this preference occurs in >60% of instances, and model preferences show high miscalibration (~40%) compared to human preferences. Notably, bias features only show mild negative correlations to human preference labels (mean r_human = -0.12) but show moderately strong positive correlations with labels from a strong reward model (mean r_model = +0.36), suggesting that models may overrely on spurious cues. To mitigate these issues, we propose a simple post-training method based on counterfactual data augmentation (CDA) using synthesized contrastive examples. Finetuning models with CDA reduces average miscalibration from 39.4% to 32.5% and average absolute skew difference from 20.5% to 10.0%, while maintaining overall RewardBench performance, showing that targeted debiasing is effective for building reliable preference models.

Zhaolu Kang,Junhao Gong,Jiaxu Yan,Wanke Xia,Yian Wang,Ziwen Wang,Huaxuan Ding,Zhuo Cheng,Wenhao Cao,Zhiyuan Feng,Siqi He,Shannan Yan,Junzhe Chen,Xiaomin He,Chaoya Jiang,Wei Ye,Kaidong Yu,Xuelong Li

Main category: cs.CL

TL;DR: HSSBench是一个专为评估多模态大语言模型（MLLMs）在人文学科和社会科学（HSS）任务中的表现而设计的基准测试，填补了现有评测的不足。

Details

Motivation: 现有MLLM评测主要关注STEM领域，忽视了HSS所需的跨学科思维和知识整合能力。 Method: 提出HSSBench，包含13,000个精心设计的样本，覆盖六类任务，并通过专家与自动化代理协作生成数据。 Result: 对20多个主流MLLM的测试表明，HSSBench对现有模型仍具挑战性。 Conclusion: HSSBench有望推动MLLM在跨学科推理能力上的研究。 Abstract: Multimodal Large Language Models (MLLMs) have demonstrated significant potential to advance a broad range of domains. However, current benchmarks for evaluating MLLMs primarily emphasize general knowledge and vertical step-by-step reasoning typical of STEM disciplines, while overlooking the distinct needs and potential of the Humanities and Social Sciences (HSS). Tasks in the HSS domain require more horizontal, interdisciplinary thinking and a deep integration of knowledge across related fields, which presents unique challenges for MLLMs, particularly in linking abstract concepts with corresponding visual representations. Addressing this gap, we present HSSBench, a dedicated benchmark designed to assess the capabilities of MLLMs on HSS tasks in multiple languages, including the six official languages of the United Nations. We also introduce a novel data generation pipeline tailored for HSS scenarios, in which multiple domain experts and automated agents collaborate to generate and iteratively refine each sample. HSSBench contains over 13,000 meticulously designed samples, covering six key categories. We benchmark more than 20 mainstream MLLMs on HSSBench and demonstrate that it poses significant challenges even for state-of-the-art models. We hope that this benchmark will inspire further research into enhancing the cross-disciplinary reasoning abilities of MLLMs, especially their capacity to internalize and connect knowledge across fields.

cs.AI [Back]

[229] Contextual Integrity in LLMs via Reasoning and Reinforcement Learning

Guangchen Lan,Huseyin A. Inan,Sahar Abdelnabi,Janardhan Kulkarni,Lukas Wutschitz,Reza Shokri,Christopher G. Brinton,Robert Sim

Main category: cs.AI

TL;DR: 论文提出了一种通过强化学习框架提升大型语言模型在任务中保持上下文完整性（CI）的方法，显著减少了不适当的信息披露。

Details

Motivation: 随着自主代理为用户决策的普及，确保上下文完整性（CI）成为核心问题，即如何在特定任务中合理共享信息。 Method: 首先通过提示LLMs显式推理CI，再开发强化学习框架进一步训练模型以实现CI，使用约700个多样化的合成数据集进行验证。 Result: 方法显著减少了不适当的信息披露，同时保持任务性能，且改进可迁移至人类标注的CI基准测试（如PrivacyLens）。 Conclusion: 强化学习框架能有效提升模型在复杂任务中的上下文完整性推理能力，且具有泛化性。 Abstract: As the era of autonomous agents making decisions on behalf of users unfolds, ensuring contextual integrity (CI) -- what is the appropriate information to share while carrying out a certain task -- becomes a central question to the field. We posit that CI demands a form of reasoning where the agent needs to reason about the context in which it is operating. To test this, we first prompt LLMs to reason explicitly about CI when deciding what information to disclose. We then extend this approach by developing a reinforcement learning (RL) framework that further instills in models the reasoning necessary to achieve CI. Using a synthetic, automatically created, dataset of only $\sim700$ examples but with diverse contexts and information disclosure norms, we show that our method substantially reduces inappropriate information disclosure while maintaining task performance across multiple model sizes and families. Importantly, improvements transfer from this synthetic dataset to established CI benchmarks such as PrivacyLens that has human annotations and evaluates privacy leakage of AI assistants in actions and tool calls.

[230] A Graph-Retrieval-Augmented Generation Framework Enhances Decision-Making in the Circular Economy

Yang Zhao,Chengxiao Dai,Dusit Niyato,Chuan Fu Tan,Keyi Xiang,Yueyang Wang,Zhiquan Yeo,Daren Tan Zong Loong,Jonathan Low Zhaozhi,Eugene H. Z. HO

Main category: cs.AI

TL;DR: CircuGraphRAG 是一个基于知识图谱的检索增强生成框架，用于提升大语言模型在循环经济领域的准确性和可靠性。

Details

Motivation: 大语言模型在可持续制造中常产生错误的工业代码和排放因子，影响决策。 Method: CircuGraphRAG 结合领域知识图谱，通过多跳推理和 SPARQL 查询验证子图。 Result: 在问答任务中表现优异（ROUGE-L F1 达 1.0），效率提升（响应时间减半，token 使用减少 16%）。 Conclusion: CircuGraphRAG 为循环经济规划提供可靠支持，推动低碳资源决策。 Abstract: Large language models (LLMs) hold promise for sustainable manufacturing, but often hallucinate industrial codes and emission factors, undermining regulatory and investment decisions. We introduce CircuGraphRAG, a retrieval-augmented generation (RAG) framework that grounds LLMs outputs in a domain-specific knowledge graph for the circular economy. This graph connects 117,380 industrial and waste entities with classification codes and GWP100 emission data, enabling structured multi-hop reasoning. Natural language queries are translated into SPARQL and verified subgraphs are retrieved to ensure accuracy and traceability. Compared with Standalone LLMs and Naive RAG, CircuGraphRAG achieves superior performance in single-hop and multi-hop question answering, with ROUGE-L F1 scores up to 1.0, while baseline scores below 0.08. It also improves efficiency, halving the response time and reducing token usage by 16% in representative tasks. CircuGraphRAG provides fact-checked, regulatory-ready support for circular economy planning, advancing reliable, low-carbon resource decision making.

[231] Matter-of-Fact: A Benchmark for Verifying the Feasibility of Literature-Supported Claims in Materials Science

Peter Jansen,Samiah Hassan,Ruoyao Wang

Main category: cs.AI

TL;DR: 论文提出了一种名为Matter-of-Fact的数据集，用于评估假设（以声明形式呈现）的可行性，旨在优化科学发现系统中的假设筛选过程。

Details

Motivation: 自动化实验成本高昂，而假设生成相对廉价。通过筛选可行性高的假设，可以提升科学发现系统的效率和成功率。 Method: 构建了一个包含8.4k条科学声明的数据集，涵盖材料科学的四个领域，并测试了检索增强生成和代码生成等基线方法。 Result: 基线方法的性能不超过72%（随机概率为50%），但专家验证表明大多数问题是可解决的，显示了当前模型的局限性。 Conclusion: 该任务对现有模型具有挑战性，但通过改进可以加速科学发现进程。 Abstract: Contemporary approaches to assisted scientific discovery use language models to automatically generate large numbers of potential hypothesis to test, while also automatically generating code-based experiments to test those hypotheses. While hypotheses can be comparatively inexpensive to generate, automated experiments can be costly, particularly when run at scale (i.e. thousands of experiments). Developing the capacity to filter hypotheses based on their feasibility would allow discovery systems to run at scale, while increasing their likelihood of making significant discoveries. In this work we introduce Matter-of-Fact, a challenge dataset for determining the feasibility of hypotheses framed as claims. Matter-of-Fact includes 8.4k claims extracted from scientific articles spanning four high-impact contemporary materials science topics, including superconductors, semiconductors, batteries, and aerospace materials, while including qualitative and quantitative claims from theoretical, experimental, and code/simulation results. We show that strong baselines that include retrieval augmented generation over scientific literature and code generation fail to exceed 72% performance on this task (chance performance is 50%), while domain-expert verification suggests nearly all are solvable -- highlighting both the difficulty of this task for current models, and the potential to accelerate scientific discovery by making near-term progress.

[232] Evaluation is All You Need: Strategic Overclaiming of LLM Reasoning Capabilities Through Evaluation Design

Lin Sun,Weihong Lin,Jinzhu Wu,Yongfu Zhu,Xiaoqi Jian,Guangxiang Zhao,Change Jia,Linglin Zhang,Sai-er Hu,Yuhan Wu,Xiangzheng Zhang

Main category: cs.AI

TL;DR: 研究发现Deepseek-R1-Distill系列模型及其衍生模型在评估结果中存在显著波动，呼吁建立更严格的评估范式。

Details

Motivation: 揭示现有推理模型评估结果的不稳定性，推动更可靠的性能评估方法。 Method: 通过实证评估分析Deepseek-R1-Distill系列模型的性能波动。 Result: 评估结果受细微条件变化影响显著，性能改进难以复现。 Conclusion: 需建立更严格的模型评估范式以确保结果可靠性。 Abstract: Reasoning models represented by the Deepseek-R1-Distill series have been widely adopted by the open-source community due to their strong performance in mathematics, science, programming, and other domains. However, our study reveals that their benchmark evaluation results are subject to significant fluctuations caused by various factors. Subtle differences in evaluation conditions can lead to substantial variations in results. Similar phenomena are observed in other open-source inference models fine-tuned based on the Deepseek-R1-Distill series, as well as in the QwQ-32B model, making their claimed performance improvements difficult to reproduce reliably. Therefore, we advocate for the establishment of a more rigorous paradigm for model performance evaluation and present our empirical assessments of the Deepseek-R1-Distill series models.

[233] When Thinking LLMs Lie: Unveiling the Strategic Deception in Representations of Reasoning Models

Kai Wang,Yihao Zhang,Meng Sun

Main category: cs.AI

TL;DR: 论文研究了大型语言模型（LLMs）的战略性欺骗问题，通过表示工程方法诱导、检测和控制这种欺骗，并提出了工具以提升AI的可信度。

Details

Motivation: 随着具备链式推理（CoT）能力的高级LLMs可能战略性欺骗人类，研究其欺骗行为成为关键的对齐挑战。 Method: 使用表示工程和线性人工断层扫描（LAT）技术，系统性诱导、检测和控制欺骗行为，提取“欺骗向量”并实现89%的检测准确率。 Result: 通过激活引导，实现了40%的成功率在不明确提示下引发上下文相关的欺骗行为。 Conclusion: 揭示了推理模型在诚实性方面的特定问题，并提供了提升AI可信对齐的工具。 Abstract: The honesty of large language models (LLMs) is a critical alignment challenge, especially as advanced systems with chain-of-thought (CoT) reasoning may strategically deceive humans. Unlike traditional honesty issues on LLMs, which could be possibly explained as some kind of hallucination, those models' explicit thought paths enable us to study strategic deception--goal-driven, intentional misinformation where reasoning contradicts outputs. Using representation engineering, we systematically induce, detect, and control such deception in CoT-enabled LLMs, extracting "deception vectors" via Linear Artificial Tomography (LAT) for 89% detection accuracy. Through activation steering, we achieve a 40% success rate in eliciting context-appropriate deception without explicit prompts, unveiling the specific honesty-related issue of reasoning models and providing tools for trustworthy AI alignment.

[234] LLM-First Search: Self-Guided Exploration of the Solution Space

Nathan Herr,Tim Rocktäschel,Roberta Raileanu

Main category: cs.AI

TL;DR: 论文提出了一种名为LLM-First Search (LFS)的新方法，通过让大型语言模型(LLM)自主控制搜索过程，无需预定义搜索策略，从而实现了更灵活和上下文敏感的问题解决。

Details

Motivation: 现有方法如MCTS依赖固定的探索超参数，难以适应不同难度的任务，限制了其实际应用。LFS旨在通过LLM自主引导搜索，消除对外部启发式或硬编码策略的依赖。 Method: LFS是一种LLM自引导搜索方法，模型通过内部评分机制自主决定是否继续当前搜索路径或探索其他分支，无需手动调整或任务特定适配。 Result: 在Countdown和Sudoku任务中，LFS表现优于ToT-BFS、BestFS和MCTS，尤其在更具挑战性的任务中表现更好，计算效率更高，且随着模型和计算预算的增加扩展性更强。 Conclusion: LFS通过LLM自主控制搜索过程，提供了一种更灵活、高效且可扩展的问题解决方法，适用于多种推理任务。 Abstract: Large Language Models (LLMs) have demonstrated remarkable improvements in reasoning and planning through increased test-time compute, often by framing problem-solving as a search process. While methods like Monte Carlo Tree Search (MCTS) have proven effective in some domains, their reliance on fixed exploration hyperparameters limits their adaptability across tasks of varying difficulty, rendering them impractical or expensive in certain settings. In this paper, we propose \textbf{LLM-First Search (LFS)}, a novel \textit{LLM Self-Guided Search} method that removes the need for pre-defined search strategies by empowering the LLM to autonomously control the search process via self-guided exploration. Rather than relying on external heuristics or hardcoded policies, the LLM evaluates whether to pursue the current search path or explore alternative branches based on its internal scoring mechanisms. This enables more flexible and context-sensitive reasoning without requiring manual tuning or task-specific adaptation. We evaluate LFS on Countdown and Sudoku against three classic widely-used search algorithms, Tree-of-Thoughts' Breadth First Search (ToT-BFS), Best First Search (BestFS), and MCTS, each of which have been used to achieve SotA results on a range of challenging reasoning tasks. We found that LFS (1) performs better on more challenging tasks without additional tuning, (2) is more computationally efficient compared to the other methods, especially when powered by a stronger model, (3) scales better with stronger models, due to its LLM-First design, and (4) scales better with increased compute budget. Our code is publicly available at \href{https://github.com/NathanHerr/LLM-First-Search}{LLM-First-Search}.

[235] Ontology-based knowledge representation for bone disease diagnosis: a foundation for safe and sustainable medical artificial intelligence systems

Loan Dao,Ngoc Quoc Ly

Main category: cs.AI

TL;DR: 该研究提出了一种基于本体的骨疾病诊断框架，结合了层次化神经网络、视觉问答系统和多模态深度学习模型，旨在提升医学AI系统的诊断可靠性。

Details

Motivation: 医学AI系统常缺乏系统性的领域专业知识整合，可能影响诊断的可靠性。 Method: 开发了一个基于本体的框架，包括层次化神经网络、VQA系统和多模态深度学习模型，结合了视觉语言模型和临床数据。 Result: 框架展示了在骨疾病诊断中的潜力，并具备扩展到其他疾病的标准化结构和可重用组件。 Conclusion: 尽管理论基础已建立，仍需实验验证。未来工作将扩展临床数据集并进行系统验证。 Abstract: Medical artificial intelligence (AI) systems frequently lack systematic domain expertise integration, potentially compromising diagnostic reliability. This study presents an ontology-based framework for bone disease diagnosis, developed in collaboration with Ho Chi Minh City Hospital for Traumatology and Orthopedics. The framework introduces three theoretical contributions: (1) a hierarchical neural network architecture guided by bone disease ontology for segmentation-classification tasks, incorporating Visual Language Models (VLMs) through prompts, (2) an ontology-enhanced Visual Question Answering (VQA) system for clinical reasoning, and (3) a multimodal deep learning model that integrates imaging, clinical, and laboratory data through ontological relationships. The methodology maintains clinical interpretability through systematic knowledge digitization, standardized medical terminology mapping, and modular architecture design. The framework demonstrates potential for extension beyond bone diseases through its standardized structure and reusable components. While theoretical foundations are established, experimental validation remains pending due to current dataset and computational resource limitations. Future work will focus on expanding the clinical dataset and conducting comprehensive system validation.

eess.AS [Back]

[236] Can we reconstruct a dysarthric voice with the large speech model Parler TTS?

Ariadna Sanchez,Simon King

Main category: eess.AS

TL;DR: 论文研究了利用大型语音模型Parler TTS重建构音障碍患者的声音，以生成清晰且保留说话者身份的语音，但模型在控制清晰度和一致性方面存在挑战。

Details

Motivation: 构音障碍患者难以沟通，个性化文本转语音技术是一种潜在的辅助工具，旨在重建患者病前的声音。 Method: 使用Parler TTS模型，通过标注的数据集进行微调，生成接近患者病前声音的语音。 Result: 模型能够从挑战性数据中学习生成语音，但在控制清晰度和保持说话者身份一致性方面表现不佳。 Conclusion: 未来需改进此类模型的可控性，以更好地完成声音重建任务。 Abstract: Speech disorders can make communication hard or even impossible for those who develop them. Personalised Text-to-Speech is an attractive option as a communication aid. We attempt voice reconstruction using a large speech model, with which we generate an approximation of a dysarthric speaker's voice prior to the onset of their condition. In particular, we investigate whether a state-of-the-art large speech model, Parler TTS, can generate intelligible speech while maintaining speaker identity. We curate a dataset and annotate it with relevant speaker and intelligibility information, and use this to fine-tune the model. Our results show that the model can indeed learn to generate from the distribution of this challenging data, but struggles to control intelligibility and to maintain consistent speaker identity. We propose future directions to improve controllability of this class of model, for the voice reconstruction task.

[237] Towards Efficient Speech-Text Jointly Decoding within One Speech Language Model

Haibin Wu,Yuxuan Hu,Ruchao Fan,Xiaofei Wang,Kenichi Kumatani,Bo Ren,Jianwei Yu,Heng Lu,Lijuan Wang,Yao Qian,Jinyu Li

Main category: eess.AS

TL;DR: 本文比较了联合语音-文本解码策略，提出了一种新的早期停止交错（ESI）模式，显著加速解码并提升性能。

Details

Motivation: 研究联合语音-文本解码策略对性能、效率和对齐质量的影响，以优化语音语言模型在对话系统中的应用。 Method: 在相同基础模型、语音分词器和训练数据下，系统比较了交错和并行生成范式，并提出ESI模式。 Result: 交错方法对齐效果最佳但推理慢，ESI模式显著加速解码且性能略优。 Conclusion: ESI模式是高效且性能优越的联合解码策略，同时高质量QA数据集进一步提升了语音QA性能。 Abstract: Speech language models (Speech LMs) enable end-to-end speech-text modelling within a single model, offering a promising direction for spoken dialogue systems. The choice of speech-text jointly decoding paradigm plays a critical role in performance, efficiency, and alignment quality. In this work, we systematically compare representative joint speech-text decoding strategies-including the interleaved, and parallel generation paradigms-under a controlled experimental setup using the same base language model, speech tokenizer and training data. Our results show that the interleaved approach achieves the best alignment. However it suffers from slow inference due to long token sequence length. To address this, we propose a novel early-stop interleaved (ESI) pattern that not only significantly accelerates decoding but also yields slightly better performance. Additionally, we curate high-quality question answering (QA) datasets to further improve speech QA performance.

[238] EMO-Debias: Benchmarking Gender Debiasing Techniques in Multi-Label Speech Emotion Recognition

Yi-Cheng Lin,Huang-Cheng Chou,Yu-Hsuan Li Liang,Hung-yi Lee

Main category: eess.AS

TL;DR: 论文提出了EMO-Debias方法，比较了13种去偏方法在多标签语音情感识别中的效果，分析了公平性与准确性的权衡。

Details

Motivation: 语音情感识别系统存在性别偏见，现有去偏方法在多标签场景中的效果和鲁棒性尚未充分研究。 Method: 提出EMO-Debias，比较了13种去偏方法（包括预处理、正则化、对抗学习等），在性别不平衡条件下评估。 Result: 实验量化了公平性与准确性的权衡，识别出能减少性别性能差距且不影响整体性能的方法。 Conclusion: 研究为选择有效去偏策略提供了实用建议，并揭示了数据集分布的影响。 Abstract: Speech emotion recognition (SER) systems often exhibit gender bias. However, the effectiveness and robustness of existing debiasing methods in such multi-label scenarios remain underexplored. To address this gap, we present EMO-Debias, a large-scale comparison of 13 debiasing methods applied to multi-label SER. Our study encompasses techniques from pre-processing, regularization, adversarial learning, biased learners, and distributionally robust optimization. Experiments conducted on acted and naturalistic emotion datasets, using WavLM and XLSR representations, evaluate each method under conditions of gender imbalance. Our analysis quantifies the trade-offs between fairness and accuracy, identifying which approaches consistently reduce gender performance gaps without compromising overall model performance. The findings provide actionable insights for selecting effective debiasing strategies and highlight the impact of dataset distributions.

cs.CR [Back]

[239] Why LLM Safety Guardrails Collapse After Fine-tuning: A Similarity Analysis Between Alignment and Fine-tuning Datasets

Lei Hsiung,Tianyu Pang,Yung-Chen Tang,Linyue Song,Tsung-Yi Ho,Pin-Yu Chen,Yaoqing Yang

Main category: cs.CR

TL;DR: 研究发现，上游安全对齐数据集与下游微调任务的表示相似性对模型的安全性有显著影响，高相似性会削弱安全防护，而低相似性则能提升模型鲁棒性。

Details

Motivation: 探讨上游安全对齐数据在模型微调中对安全防护的影响，以弥补现有方法忽视上游因素的不足。 Method: 通过实验分析上游对齐数据集与下游微调任务的表示相似性对安全防护的影响。 Result: 高相似性削弱安全防护，低相似性使模型更鲁棒，危害评分降低达10.33%。 Conclusion: 上游数据集设计对构建持久安全防护至关重要，为微调服务提供商提供了实用建议。 Abstract: Recent advancements in large language models (LLMs) have underscored their vulnerability to safety alignment jailbreaks, particularly when subjected to downstream fine-tuning. However, existing mitigation strategies primarily focus on reactively addressing jailbreak incidents after safety guardrails have been compromised, removing harmful gradients during fine-tuning, or continuously reinforcing safety alignment throughout fine-tuning. As such, they tend to overlook a critical upstream factor: the role of the original safety-alignment data. This paper therefore investigates the degradation of safety guardrails through the lens of representation similarity between upstream alignment datasets and downstream fine-tuning tasks. Our experiments demonstrate that high similarity between these datasets significantly weakens safety guardrails, making models more susceptible to jailbreaks. Conversely, low similarity between these two types of datasets yields substantially more robust models and thus reduces harmfulness score by up to 10.33%. By highlighting the importance of upstream dataset design in the building of durable safety guardrails and reducing real-world vulnerability to jailbreak attacks, these findings offer actionable insights for fine-tuning service providers.

cs.MA [Back]

[240] Time to Talk: LLM Agents for Asynchronous Group Communication in Mafia Games

Niv Eckhaus,Uri Berger,Gabriel Stanovsky

Main category: cs.MA

TL;DR: 该论文开发了一种异步LLM代理，能够决定何时发言，而不仅仅是说什么。通过在线Mafia游戏数据集评估，代理表现与人类玩家相当，并释放数据以支持未来研究。

Details

Motivation: 现实世界中的许多场景（如群聊、团队会议或社交游戏）是异步的，缺乏明确的轮流发言机制，因此决定何时发言是参与者决策的关键部分。 Method: 开发了一种自适应异步LLM代理，能够决定何时发言。通过在线Mafia游戏数据集（包含人类玩家和代理）进行评估。 Result: 代理在游戏表现和融入人类玩家方面与人类相当，其发言时机行为与人类相似，但消息内容存在差异。 Conclusion: 该研究为LLM在现实人类群体环境中的集成铺平了道路，适用于团队讨论辅助以及需要处理复杂社交动态的教育和专业环境。 Abstract: LLMs are used predominantly in synchronous communication, where a human user and a model communicate in alternating turns. In contrast, many real-world settings are inherently asynchronous. For example, in group chats, online team meetings, or social games, there is no inherent notion of turns; therefore, the decision of when to speak forms a crucial part of the participant's decision making. In this work, we develop an adaptive asynchronous LLM-agent which, in addition to determining what to say, also decides when to say it. To evaluate our agent, we collect a unique dataset of online Mafia games, including both human participants, as well as our asynchronous agent. Overall, our agent performs on par with human players, both in game performance, as well as in its ability to blend in with the other human players. Our analysis shows that the agent's behavior in deciding when to speak closely mirrors human patterns, although differences emerge in message content. We release all our data and code to support and encourage further research for more realistic asynchronous communication between LLM agents. This work paves the way for integration of LLMs into realistic human group settings, from assistance in team discussions to educational and professional environments where complex social dynamics must be navigated.

cs.LG [Back]

[241] Behavioural vs. Representational Systematicity in End-to-End Models: An Opinionated Survey

Ivan Vegner,Sydelle de Souza,Valentin Forch,Martha Lewis,Leonidas A. A. Doumas

Main category: cs.LG

TL;DR: 论文讨论了系统性（systematicity）在机器学习模型中的重要性，区分了行为系统性和表征系统性，并分析了现有基准测试的局限性。

Details

Motivation: 系统性是组合性的核心特性，对模型的泛化能力至关重要。现有研究多关注行为系统性，而忽视了表征系统性，论文旨在强调这一区别。 Method: 基于Hadley（1994）的分类法，分析了语言和视觉领域的关键基准测试对行为系统性的评估程度，并探讨了表征系统性的评估方法。 Result: 现有基准测试主要测试行为系统性，而对表征系统性的评估不足。论文提出了从机制可解释性角度评估表征系统性的方法。 Conclusion: 论文呼吁更全面地评估系统性，尤其是表征系统性，以推动模型在复杂任务中的泛化能力。 Abstract: A core aspect of compositionality, systematicity is a desirable property in ML models as it enables strong generalization to novel contexts. This has led to numerous studies proposing benchmarks to assess systematic generalization, as well as models and training regimes designed to enhance it. Many of these efforts are framed as addressing the challenge posed by Fodor and Pylyshyn. However, while they argue for systematicity of representations, existing benchmarks and models primarily focus on the systematicity of behaviour. We emphasize the crucial nature of this distinction. Furthermore, building on Hadley's (1994) taxonomy of systematic generalization, we analyze the extent to which behavioural systematicity is tested by key benchmarks in the literature across language and vision. Finally, we highlight ways of assessing systematicity of representations in ML models as practiced in the field of mechanistic interpretability.

[242] Clustering and Median Aggregation Improve Differentially Private Inference

Kareem Amin,Salman Avestimehr,Sara Babakniya,Alex Bie,Weiwei Kong,Natalia Ponomareva,Umar Syed

Main category: cs.LG

TL;DR: 本文提出了一种改进的差分隐私语言模型推理方法，通过聚类输入数据并利用中位数聚合令牌预测，提高了生成文本的质量和隐私保护效果。

Details

Motivation: 现有方法通过均匀随机采样敏感输入生成合成文本，但在异构主题下效果不佳，需要改进以提升文本质量和隐私性。 Method: 首先对输入数据进行聚类，然后引入新算法，通过私有计算中位数（而非平均值）聚合令牌预测，利用中位数局部敏感性降低的特点。 Result: 实验表明，该方法在代表性指标（如MAUVE）和下游任务性能上均有提升，能以更低隐私成本生成高质量合成数据。 Conclusion: 通过聚类和中位数聚合，本文方法显著提升了差分隐私语言模型推理的效果，优于现有最优方法。 Abstract: Differentially private (DP) language model inference is an approach for generating private synthetic text. A sensitive input example is used to prompt an off-the-shelf large language model (LLM) to produce a similar example. Multiple examples can be aggregated together to formally satisfy the DP guarantee. Prior work creates inference batches by sampling sensitive inputs uniformly at random. We show that uniform sampling degrades the quality of privately generated text, especially when the sensitive examples concern heterogeneous topics. We remedy this problem by clustering the input data before selecting inference batches. Next, we observe that clustering also leads to more similar next-token predictions across inferences. We use this insight to introduce a new algorithm that aggregates next token statistics by privately computing medians instead of averages. This approach leverages the fact that the median has decreased local sensitivity when next token predictions are similar, allowing us to state a data-dependent and ex-post DP guarantee about the privacy properties of this algorithm. Finally, we demonstrate improvements in terms of representativeness metrics (e.g., MAUVE) as well as downstream task performance. We show that our method produces high-quality synthetic data at significantly lower privacy cost than a previous state-of-the-art method.

[243] Urania: Differentially Private Insights into AI Use

Daogao Liu,Edith Cohen,Badih Ghazi,Peter Kairouz,Pritish Kamath,Alexander Knop,Ravi Kumar,Pasin Manurangsi,Adam Sealfon,Da Yu,Chiyuan Zhang

Main category: cs.LG

TL;DR: Urania 是一个新颖的框架，用于在严格差分隐私（DP）保证下生成关于 LLM 聊天机器人交互的见解。

Details

Motivation: 研究旨在在保护用户隐私的同时，从 LLM 聊天机器人交互中提取有意义的见解。 Method: 采用私有聚类机制和创新的关键词提取方法（频率、TF-IDF 和 LLM 引导），结合 DP 工具（聚类、分区选择和直方图摘要）实现端到端隐私保护。 Result: 评估表明，框架在保留语义内容和相似性的同时，提供了严格的隐私保护，优于非私有基准。 Conclusion: Urania 成功平衡了数据实用性和隐私保护，为 LLM 交互分析提供了有效解决方案。 Abstract: We introduce $Urania$, a novel framework for generating insights about LLM chatbot interactions with rigorous differential privacy (DP) guarantees. The framework employs a private clustering mechanism and innovative keyword extraction methods, including frequency-based, TF-IDF-based, and LLM-guided approaches. By leveraging DP tools such as clustering, partition selection, and histogram-based summarization, $Urania$ provides end-to-end privacy protection. Our evaluation assesses lexical and semantic content preservation, pair similarity, and LLM-based metrics, benchmarking against a non-private Clio-inspired pipeline (Tamkin et al., 2024). Moreover, we develop a simple empirical privacy evaluation that demonstrates the enhanced robustness of our DP pipeline. The results show the framework's ability to extract meaningful conversational insights while maintaining stringent user privacy, effectively balancing data utility with privacy preservation.

[244] From EHRs to Patient Pathways: Scalable Modeling of Longitudinal Health Trajectories with LLMs

Chantal Pellegrini,Ege Özsoy,David Bani-Harouni,Matthias Keicher,Nassir Navab

Main category: cs.LG

TL;DR: 提出了一种名为EHR2Path的新方法，通过结构化电子健康记录数据并设计预测模型，优化未来健康轨迹预测，同时引入高效的摘要机制。

Details

Motivation: 现有方法在个性化医疗中难以处理复杂的患者数据交互，需要更全面的解决方案。 Method: 将多样化的电子健康记录数据转化为结构化表示，并设计EHR2Path模型，结合摘要机制嵌入长期时间上下文。 Result: EHR2Path在预测和模拟患者轨迹方面表现优异，优于基线模型。 Conclusion: EHR2Path为预测性和个性化医疗提供了新路径，支持多种评估任务。 Abstract: Healthcare systems face significant challenges in managing and interpreting vast, heterogeneous patient data for personalized care. Existing approaches often focus on narrow use cases with a limited feature space, overlooking the complex, longitudinal interactions needed for a holistic understanding of patient health. In this work, we propose a novel approach to patient pathway modeling by transforming diverse electronic health record (EHR) data into a structured representation and designing a holistic pathway prediction model, EHR2Path, optimized to predict future health trajectories. Further, we introduce a novel summary mechanism that embeds long-term temporal context into topic-specific summary tokens, improving performance over text-only models, while being much more token-efficient. EHR2Path demonstrates strong performance in both next time-step prediction and longitudinal simulation, outperforming competitive baselines. It enables detailed simulations of patient trajectories, inherently targeting diverse evaluation tasks, such as forecasting vital signs, lab test results, or length-of-stay, opening a path towards predictive and personalized healthcare.

[245] Dissecting Long Reasoning Models: An Empirical Study

Yongyu Mu,Jiali Zeng,Bei Li,Xinyan Guan,Fandong Meng,Jie Zhou,Tong Xiao,Jingbo Zhu

Main category: cs.LG

TL;DR: 论文研究了长上下文推理模型在强化学习中的三个关键问题：正负样本的作用、数据效率问题以及性能不稳定性。

Details

Motivation: 尽管强化学习在长上下文推理模型训练中取得进展，但仍存在未解决的问题和反直觉行为。本文旨在系统分析这些问题。 Method: 分析了正负样本在RL中的作用；提出改进数据效率的策略（如相对长度奖励和离线样本注入）；探讨性能不稳定性并提出多轮评估方法。 Result: 发现负样本显著提升泛化性；改进策略提高了数据利用效率；多轮评估缓解了性能不稳定问题。 Conclusion: 研究揭示了RL训练中的关键问题，并提出了实用解决方案，为未来研究提供了重要参考。 Abstract: Despite recent progress in training long-context reasoning models via reinforcement learning (RL), several open questions and counterintuitive behaviors remain. This work focuses on three key aspects: (1) We systematically analyze the roles of positive and negative samples in RL, revealing that positive samples mainly facilitate data fitting, whereas negative samples significantly enhance generalization and robustness. Interestingly, training solely on negative samples can rival standard RL training performance. (2) We identify substantial data inefficiency in group relative policy optimization, where over half of the samples yield zero advantage. To address this, we explore two straightforward strategies, including relative length rewards and offline sample injection, to better leverage these data and enhance reasoning efficiency and capability. (3) We investigate unstable performance across various reasoning models and benchmarks, attributing instability to uncertain problems with ambiguous outcomes, and demonstrate that multiple evaluation runs mitigate this issue.

[246] Mitigating Degree Bias Adaptively with Hard-to-Learn Nodes in Graph Contrastive Learning

Jingyu Hu,Hongbo Bo,Jun Hong,Xiaowei Liu,Weiru Liu

Main category: cs.LG

TL;DR: 本文提出了一种名为HAR的对比损失方法，通过增加正样本对并自适应加权正负样本对，以解决GNN中的度偏差问题，并开发了SHARP框架验证其有效性。

Details

Motivation: GNN在节点分类任务中存在度偏差问题，现有GCL方法因正样本对有限且权重分配不均，导致低度节点信息不足且噪声多。 Method: 提出HAR对比损失方法，利用节点标签增加正样本对，并根据学习难度自适应加权正负样本对；开发SHARP框架扩展HAR应用场景。 Result: 在四个数据集上的实验表明，SHARP在全局和度级别上均优于基线方法。 Conclusion: HAR和SHARP能有效缓解GNN中的度偏差问题，提升节点分类性能。 Abstract: Graph Neural Networks (GNNs) often suffer from degree bias in node classification tasks, where prediction performance varies across nodes with different degrees. Several approaches, which adopt Graph Contrastive Learning (GCL), have been proposed to mitigate this bias. However, the limited number of positive pairs and the equal weighting of all positives and negatives in GCL still lead to low-degree nodes acquiring insufficient and noisy information. This paper proposes the Hardness Adaptive Reweighted (HAR) contrastive loss to mitigate degree bias. It adds more positive pairs by leveraging node labels and adaptively weights positive and negative pairs based on their learning hardness. In addition, we develop an experimental framework named SHARP to extend HAR to a broader range of scenarios. Both our theoretical analysis and experiments validate the effectiveness of SHARP. The experimental results across four datasets show that SHARP achieves better performance against baselines at both global and degree levels.

[247] Diagonal Batching Unlocks Parallelism in Recurrent Memory Transformers for Long Contexts

Danil Sivtsov,Ivan Rodkin,Gleb Kuzmin,Yuri Kuratov,Ivan Oseledets

Main category: cs.LG

TL;DR: Diagonal Batching是一种调度方案，解决了RMTs的串行执行问题，显著提升了长上下文推理的效率。

Details

Motivation: Transformer模型在处理长上下文时存在时间和内存复杂度高的问题，RMTs虽然降低了复杂度，但其串行更新机制导致性能瓶颈。 Method: 提出Diagonal Batching调度方案，通过跨段并行化保留精确递归，无需重新训练现有RMT模型。 Result: 在LLaMA-1B ARMT模型上，Diagonal Batching实现了3.3倍的速度提升，并显著降低了推理成本和延迟。 Conclusion: Diagonal Batching通过消除串行瓶颈，使RMTs成为实际长上下文应用的可行解决方案。 Abstract: Transformer models struggle with long-context inference due to their quadratic time and linear memory complexity. Recurrent Memory Transformers (RMTs) offer a solution by reducing the asymptotic cost to linear time and constant memory usage. However, their memory update mechanism leads to sequential execution, causing a performance bottleneck. We introduce Diagonal Batching, a scheduling scheme that unlocks parallelism across segments in RMTs while preserving exact recurrence. This approach eliminates the sequential constraint, enabling efficient GPU inference even for single long-context inputs without complex batching and pipelining techniques. Because the technique is purely a run-time computation reordering, existing RMT models adopt it with no retraining. Applied to a LLaMA-1B ARMT model, Diagonal Batching yields a 3.3x speedup over standard full-attention LLaMA-1B and a 1.8x speedup over the sequential RMT implementation on 131,072-token sequences. By removing sequential bottleneck, Diagonal Batching reduces inference cost and latency, thereby strengthening RMTs as a practical solution for real-world, long-context applications.

[248] MesaNet: Sequence Modeling by Locally Optimal Test-Time Training

Johannes von Oswald,Nino Scherrer,Seijin Kobayashi,Luca Versari,Songlin Yang,Maximilian Schlegel,Kaitlin Maile,Yanick Schimpf,Oliver Sieberling,Alexander Meulemans,Rif A. Saurous,Guillaume Lajoie,Charlotte Frenkel,Razvan Pascanu,Blaise Agüera y Arcas,João Sacramento

Main category: cs.LG

TL;DR: 论文提出了一种基于在线学习规则的稳定、可并行化的Mesa层，通过优化上下文损失提升语言模型性能。

Details

Motivation: 解决传统Transformer在推理时内存和计算资源线性增长的问题，探索更高效的RNN模型。 Method: 引入数值稳定的Mesa层，使用共轭梯度求解器优化上下文损失，支持分块并行化。 Result: 在十亿参数规模的语言建模中，Mesa层在长上下文任务中表现优于现有RNN模型。 Conclusion: 通过增加推理时的计算开销，Mesa层显著提升了模型性能，与当前增加测试时计算以提升性能的趋势一致。 Abstract: Sequence modeling is currently dominated by causal transformer architectures that use softmax self-attention. Although widely adopted, transformers require scaling memory and compute linearly during inference. A recent stream of work linearized the softmax operation, resulting in powerful recurrent neural network (RNN) models with constant memory and compute costs such as DeltaNet, Mamba or xLSTM. These models can be unified by noting that their recurrent layer dynamics can all be derived from an in-context regression objective, approximately optimized through an online learning rule. Here, we join this line of work and introduce a numerically stable, chunkwise parallelizable version of the recently proposed Mesa layer (von Oswald et al., 2024), and study it in language modeling at the billion-parameter scale. This layer again stems from an in-context loss, but which is now minimized to optimality at every time point using a fast conjugate gradient solver. Through an extensive suite of experiments, we show that optimal test-time training enables reaching lower language modeling perplexity and higher downstream benchmark performance than previous RNNs, especially on tasks requiring long context understanding. This performance gain comes at the cost of additional flops spent during inference time. Our results are therefore intriguingly related to recent trends of increasing test-time compute to improve performance -- here by spending compute to solve sequential optimization problems within the neural network itself.

[249] Improving Data Efficiency for LLM Reinforcement Fine-tuning Through Difficulty-targeted Online Data Selection and Rollout Replay

Yifan Sun,Jingyan Shen,Yibin Wang,Tianyu Chen,Zhendong Wang,Mingyuan Zhou,Huan Zhang

Main category: cs.LG

TL;DR: 本文提出两种技术（难度导向的在线数据选择和回放机制）以提高LLM强化学习微调的数据效率，显著减少训练时间。

Details

Motivation: 现有RL微调方法资源消耗大且忽视数据效率问题，亟需改进。 Method: 1. 自适应难度导向的在线数据选择；2. 基于注意力框架的难度估计；3. 回放机制减少计算成本。 Result: 实验表明，方法在6种LLM-数据集组合上减少25%-65%的训练时间，性能与GRPO相当。 Conclusion: 提出的技术显著提升了RL微调的数据效率和计算效率。 Abstract: Reinforcement learning (RL) has become an effective approach for fine-tuning large language models (LLMs), particularly to enhance their reasoning capabilities. However, RL fine-tuning remains highly resource-intensive, and existing work has largely overlooked the problem of data efficiency. In this paper, we propose two techniques to improve data efficiency in LLM RL fine-tuning: difficulty-targeted online data selection and rollout replay. We introduce the notion of adaptive difficulty to guide online data selection, prioritizing questions of moderate difficulty that are more likely to yield informative learning signals. To estimate adaptive difficulty efficiently, we develop an attention-based framework that requires rollouts for only a small reference set of questions. The adaptive difficulty of the remaining questions is then estimated based on their similarity to this set. To further reduce rollout cost, we introduce a rollout replay mechanism that reuses recent rollouts, lowering per-step computation while maintaining stable updates. Extensive experiments across 6 LLM-dataset combinations show that our method reduces RL fine-tuning time by 25% to 65% to reach the same level of performance as the original GRPO algorithm.

[250] Kinetics: Rethinking Test-Time Scaling Laws

Ranajoy Sadhukhan,Zhuoming Chen,Haizhong Zheng,Yang Zhou,Emma Strubell,Beidi Chen

Main category: cs.LG

TL;DR: 研究发现，小模型在测试时的效率被高估，提出新的Kinetics Scaling Law，强调稀疏注意力的重要性。

Details

Motivation: 重新审视测试时扩展规律，揭示小模型效率被高估的问题，并解决计算与内存访问成本的平衡。 Method: 通过分析0.6B到32B参数模型，提出Kinetics Scaling Law，并验证稀疏注意力的优势。 Result: 稀疏注意力模型在低资源和高资源场景下均优于密集模型，AIME问题解决准确率提升显著。 Conclusion: 稀疏注意力是实现测试时扩展潜力的关键，代码已开源。 Abstract: We rethink test-time scaling laws from a practical efficiency perspective, revealing that the effectiveness of smaller models is significantly overestimated. Prior work, grounded in compute-optimality, overlooks critical memory access bottlenecks introduced by inference-time strategies (e.g., Best-of-$N$, long CoTs). Our holistic analysis, spanning models from 0.6B to 32B parameters, reveals a new Kinetics Scaling Law that better guides resource allocation by incorporating both computation and memory access costs. Kinetics Scaling Law suggests that test-time compute is more effective when used on models above a threshold than smaller ones. A key reason is that in TTS, attention, rather than parameter count, emerges as the dominant cost factor. Motivated by this, we propose a new scaling paradigm centered on sparse attention, which lowers per-token cost and enables longer generations and more parallel samples within the same resource budget. Empirically, we show that sparse attention models consistently outperform dense counterparts, achieving over 60 points gains in low-cost regimes and over 5 points gains in high-cost regimes for problem-solving accuracy on AIME, encompassing evaluations on state-of-the-art MoEs. These results suggest that sparse attention is essential for realizing the full potential of test-time scaling because, unlike training, where parameter scaling saturates, test-time accuracy continues to improve through increased generation. The code is available at https://github.com/Infini-AI-Lab/Kinetics.

[251] Inference-Time Hyper-Scaling with KV Cache Compression

Adrian Łańcucki,Konrad Staniszewski,Piotr Nawrot,Edoardo M. Ponti

Main category: cs.LG

TL;DR: 通过压缩KV缓存实现推理时超扩展，动态内存稀疏化（DMS）方法在保持高精度的同时显著提升推理效率。

Details

Motivation: Transformer LLMs的推理成本受限于KV缓存大小，而非生成token数量，因此探索通过压缩KV缓存提升推理效率。 Method: 提出动态内存稀疏化（DMS），延迟token淘汰并隐式合并表示，仅需1K训练步骤即可实现8倍压缩。 Result: 在多种LLM上验证，DMS在相同计算预算下显著提升推理精度，例如Qwen-R1 32B在多个基准测试中平均提升7.6-9.6分。 Conclusion: DMS为推理时超扩展提供实用解决方案，显著提升效率与精度。 Abstract: Inference-time scaling trades efficiency for increased reasoning accuracy by generating longer or more parallel sequences. However, in Transformer LLMs, generation cost is bottlenecked by the size of the key-value (KV) cache, rather than the number of generated tokens. Hence, we explore inference-time hyper-scaling: by compressing the KV cache, we can generate more tokens within the same compute budget and further improve the accuracy of scaled inference. The success of this approach, however, hinges on the ability of compression methods to preserve accuracy even at high compression ratios. To make hyper-scaling practical, we introduce Dynamic Memory Sparsification (DMS), a novel method for sparsifying KV caches that only requires 1K training steps to achieve 8$\times$ compression, while maintaining better accuracy than training-free sparse attention. Instead of prematurely discarding cached tokens, DMS delays token eviction, implicitly merging representations and preserving critical information. We demonstrate the effectiveness of inference-time hyper-scaling with DMS on multiple families of LLMs, showing that it boosts accuracy for comparable inference runtime and memory load. For instance, we enhance Qwen-R1 32B by an average of 9.1 points on AIME 24, 7.6 on GPQA, and 9.6 on LiveCodeBench across compute budgets.

[252] You Only Train Once

Christos Sakaridis

Main category: cs.LG

TL;DR: 论文提出了一种名为YOTO的方法，通过单次训练自动优化损失权重超参数，避免了传统网格搜索的繁琐过程。

Details

Motivation: 传统方法需要多次训练以优化损失权重，效率低下。YOTO旨在通过单次训练实现损失权重的自动优化。 Method: YOTO将损失权重视为网络参数，通过梯度优化学习这些权重，并引入正则化损失确保权重均匀性和有界性。 Result: 在3D估计和语义分割任务中，YOTO优于传统网格搜索方法，在测试数据上表现更优。 Conclusion: YOTO提供了一种高效的单次训练方法，显著提升了损失权重优化的效率和性能。 Abstract: The title of this paper is perhaps an overclaim. Of course, the process of creating and optimizing a learned model inevitably involves multiple training runs which potentially feature different architectural designs, input and output encodings, and losses. However, our method, You Only Train Once (YOTO), indeed contributes to limiting training to one shot for the latter aspect of losses selection and weighting. We achieve this by automatically optimizing loss weight hyperparameters of learned models in one shot via standard gradient-based optimization, treating these hyperparameters as regular parameters of the networks and learning them. To this end, we leverage the differentiability of the composite loss formulation which is widely used for optimizing multiple empirical losses simultaneously and model it as a novel layer which is parameterized with a softmax operation that satisfies the inherent positivity constraints on loss hyperparameters while avoiding degenerate empirical gradients. We complete our joint end-to-end optimization scheme by defining a novel regularization loss on the learned hyperparameters, which models a uniformity prior among the employed losses while ensuring boundedness of the identified optima. We evidence the efficacy of YOTO in jointly optimizing loss hyperparameters and regular model parameters in one shot by comparing it to the commonly used brute-force grid search across state-of-the-art networks solving two key problems in computer vision, i.e. 3D estimation and semantic segmentation, and showing that it consistently outperforms the best grid-search model on unseen test data. Code will be made publicly available.

[253] StatsMerging: Statistics-Guided Model Merging via Task-Specific Teacher Distillation

Ranjith Merugu,Bryan Bo Cao,Shubham Jain

Main category: cs.LG

TL;DR: StatsMerging是一种基于统计学习的轻量级模型合并方法，利用SVD奇异值捕获任务重要性，无需真实标签或测试样本，在多个任务中表现优异。

Details

Motivation: 解决内存受限下多大型模型合并的问题，避免依赖真实标签或测试样本。 Method: 利用SVD奇异值指导任务系数预测，引入轻量级学习器StatsMergeLearner和任务特定教师蒸馏。 Result: 在八个任务中表现优于现有技术，整体准确性、泛化能力和鲁棒性显著提升。 Conclusion: StatsMerging为模型合并提供了高效、通用的解决方案，适用于异构架构和多样化任务。 Abstract: Model merging has emerged as a promising solution to accommodate multiple large models within constrained memory budgets. We present StatsMerging, a novel lightweight learning-based model merging method guided by weight distribution statistics without requiring ground truth labels or test samples. StatsMerging offers three key advantages: (1) It uniquely leverages singular values from singular value decomposition (SVD) to capture task-specific weight distributions, serving as a proxy for task importance to guide task coefficient prediction; (2) It employs a lightweight learner StatsMergeLearner to model the weight distributions of task-specific pre-trained models, improving generalization and enhancing adaptation to unseen samples; (3) It introduces Task-Specific Teacher Distillation for merging vision models with heterogeneous architectures, a merging learning paradigm that avoids costly ground-truth labels by task-specific teacher distillation. Notably, we present two types of knowledge distillation, (a) distilling knowledge from task-specific models to StatsMergeLearner; and (b) distilling knowledge from models with heterogeneous architectures prior to merging. Extensive experiments across eight tasks demonstrate the effectiveness of StatsMerging. Our results show that StatsMerging outperforms state-of-the-art techniques in terms of overall accuracy, generalization to unseen tasks, and robustness to image quality variations.

[254] Scaling Laws for Robust Comparison of Open Foundation Language-Vision Models and Datasets

Marianna Nezhurina,Tomer Porian,Giovanni Pucceti,Tommie Kerssies,Romain Beaumont,Mehdi Cherti,Jenia Jitsev

Main category: cs.LG

TL;DR: 论文探讨了如何利用扩展定律（scaling laws）比较模型和数据集，以优化预训练过程。通过对比CLIP和MaMMUT两种语言-视觉学习方法的扩展定律，发现MaMMUT在规模扩展和样本效率上优于CLIP。

Details

Motivation: 研究动机在于通过扩展定律的系统比较，避免仅基于单一参考尺度的误导性结论，从而为开放基础模型和数据集的改进提供科学依据。 Method: 方法包括密集测量不同模型和样本规模的扩展定律，并比较CLIP和MaMMUT在分类、检索、分割等下游任务中的表现。同时验证了恒定学习率计划下的扩展定律可行性。 Result: 结果显示MaMMUT在规模扩展和样本效率上优于CLIP，且在不同数据集（DataComp、DFN、Re-LAION）和任务中表现一致。 Conclusion: 结论指出，准确的扩展定律推导为跨尺度的模型和数据集比较提供了有效工具，推动了开放基础模型和数据集的系统性改进。 Abstract: In studies of transferable learning, scaling laws are obtained for various important foundation models to predict their properties and performance at larger scales. We show here how scaling law derivation can also be used for model and dataset comparison, allowing to decide which procedure is to be preferred for pre-training. For the first time, full scaling laws based on dense measurements across a wide span of model and samples seen scales are derived for two important language-vision learning procedures, CLIP and MaMMUT, that use either contrastive only or contrastive and captioning text generative loss. Ensuring sufficient prediction accuracy for held out points, we use derived scaling laws to compare both models, obtaining evidence for MaMMUT's stronger improvement with scale and better sample efficiency than standard CLIP. To strengthen validity of the comparison, we show scaling laws for various downstream tasks, classification, retrieval, and segmentation, and for different open datasets, DataComp, DFN and Re-LAION, observing consistently the same trends. We show that comparison can also be performed when deriving scaling laws with a constant learning rate schedule, reducing compute cost. Accurate derivation of scaling laws provides thus means to perform model and dataset comparison across scale spans, avoiding misleading conclusions based on measurements from single reference scales only, paving the road for systematic comparison and improvement of open foundation models and datasets for their creation. We release all the pre-trained models with their intermediate checkpoints, including openMaMMUT-L/14, which achieves $80.3\%$ zero-shot ImageNet-1k accuracy, trained on 12.8B samples from DataComp-1.4B. Code for reproducing experiments in the paper and raw experiments data can be found at https://github.com/LAION-AI/scaling-laws-for-comparison.

[255] Exploring bidirectional bounds for minimax-training of Energy-based models

Cong Geng,Jia Wang,Li Chen,Zhiyong Gao,Jes Frellsen,Søren Hauberg

Main category: cs.LG

TL;DR: 论文提出了一种通过双向边界（同时最大化下界和最小化上界）训练能量模型（EBM）的方法，以解决传统训练中的不稳定性问题。

Details

Motivation: 能量模型（EBMs）虽然能够优雅地估计未归一化的密度，但训练过程通常困难且不稳定。 Method: 作者提出使用双向边界（最大化下界和最小化上界）训练EBM，并研究了四种不同的对数似然边界，包括基于生成器雅可比矩阵奇异值和互信息的下界，以及基于梯度惩罚和扩散过程的上界。 Result: 实验表明，双向边界方法稳定了EBM的训练，并实现了高质量的密度估计和样本生成。 Conclusion: 双向边界方法为EBM训练提供了一种稳定且高效的解决方案。 Abstract: Energy-based models (EBMs) estimate unnormalized densities in an elegant framework, but they are generally difficult to train. Recent work has linked EBMs to generative adversarial networks, by noting that they can be trained through a minimax game using a variational lower bound. To avoid the instabilities caused by minimizing a lower bound, we propose to instead work with bidirectional bounds, meaning that we maximize a lower bound and minimize an upper bound when training the EBM. We investigate four different bounds on the log-likelihood derived from different perspectives. We derive lower bounds based on the singular values of the generator Jacobian and on mutual information. To upper bound the negative log-likelihood, we consider a gradient penalty-like bound, as well as one based on diffusion processes. In all cases, we provide algorithms for evaluating the bounds. We compare the different bounds to investigate, the pros and cons of the different approaches. Finally, we demonstrate that the use of bidirectional bounds stabilizes EBM training and yields high-quality density estimation and sample generation.

[256] Identifying and Understanding Cross-Class Features in Adversarial Training

Zeming Wei,Yiwen Guo,Yisen Wang

Main category: cs.LG

TL;DR: 论文通过类间特征归因的视角研究对抗训练（AT），发现跨类特征对鲁棒分类的关键作用，并揭示了AT过程中模型从学习跨类特征到依赖类特定特征的动态变化。

Details

Motivation: 对抗训练（AT）是增强深度神经网络对抗攻击鲁棒性的有效方法，但其训练机制和动态仍待深入研究。本文旨在通过类间特征归因的视角揭示AT的机制。 Method: 提出通过类间特征归因研究AT，重点关注跨类特征的作用，并通过合成数据模型提供理论支持。 Result: 研究发现，AT初期模型倾向于学习更多跨类特征，达到最佳鲁棒性后，模型转而依赖类特定特征，导致鲁棒过拟合。 Conclusion: 研究为AT机制提供了新视角，统一解释了软标签训练的优势和鲁棒过拟合现象，深化了对AT的理解。 Abstract: Adversarial training (AT) has been considered one of the most effective methods for making deep neural networks robust against adversarial attacks, while the training mechanisms and dynamics of AT remain open research problems. In this paper, we present a novel perspective on studying AT through the lens of class-wise feature attribution. Specifically, we identify the impact of a key family of features on AT that are shared by multiple classes, which we call cross-class features. These features are typically useful for robust classification, which we offer theoretical evidence to illustrate through a synthetic data model. Through systematic studies across multiple model architectures and settings, we find that during the initial stage of AT, the model tends to learn more cross-class features until the best robustness checkpoint. As AT further squeezes the training robust loss and causes robust overfitting, the model tends to make decisions based on more class-specific features. Based on these discoveries, we further provide a unified view of two existing properties of AT, including the advantage of soft-label training and robust overfitting. Overall, these insights refine the current understanding of AT mechanisms and provide new perspectives on studying them. Our code is available at https://github.com/PKU-ML/Cross-Class-Features-AT.

[257] Aligning Latent Spaces with Flow Priors

Yizhuo Li,Yuying Ge,Yixiao Ge,Ying Shan,Ping Luo

Main category: cs.LG

TL;DR: 提出了一种利用基于流的生成模型作为先验，对齐可学习潜在空间与目标分布的新框架，避免了昂贵的似然评估和ODE求解。

Details

Motivation: 解决潜在空间与目标分布对齐的问题，同时减少计算成本。 Method: 预训练基于流的模型捕获目标分布，通过对齐损失正则化潜在空间，优化目标为变分下界。 Result: 实验证明对齐损失近似目标分布的负对数似然，并在ImageNet上验证了有效性。 Conclusion: 该框架为潜在空间对齐提供了新方法，兼具理论和实证支持。 Abstract: This paper presents a novel framework for aligning learnable latent spaces to arbitrary target distributions by leveraging flow-based generative models as priors. Our method first pretrains a flow model on the target features to capture the underlying distribution. This fixed flow model subsequently regularizes the latent space via an alignment loss, which reformulates the flow matching objective to treat the latents as optimization targets. We formally prove that minimizing this alignment loss establishes a computationally tractable surrogate objective for maximizing a variational lower bound on the log-likelihood of latents under the target distribution. Notably, the proposed method eliminates computationally expensive likelihood evaluations and avoids ODE solving during optimization. As a proof of concept, we demonstrate in a controlled setting that the alignment loss landscape closely approximates the negative log-likelihood of the target distribution. We further validate the effectiveness of our approach through large-scale image generation experiments on ImageNet with diverse target distributions, accompanied by detailed discussions and ablation studies. With both theoretical and empirical validation, our framework paves a new way for latent space alignment.

cs.IR [Back]

[258] Exp4Fuse: A Rank Fusion Framework for Enhanced Sparse Retrieval using Large Language Model-based Query Expansion

Lingyuan Liu,Mengxiang Zhang

Main category: cs.IR

TL;DR: Exp4Fuse是一种新型融合排序框架，通过零样本LLM查询扩展提升稀疏检索性能，实验证明其优于现有方法。

Details

Motivation: 解决LLM生成文档质量依赖复杂提示策略和高计算成本的问题，探索零样本LLM查询扩展以改进稀疏检索。 Method: 提出Exp4Fuse框架，结合原始查询和LLM增强查询的两条检索路径，通过改进的互逆排名融合方法生成最终排名。 Result: 在多个数据集上，Exp4Fuse超越现有LLM查询扩展方法，结合高级稀疏检索器时达到SOTA性能。 Conclusion: Exp4Fuse在提升稀疏检索性能方面表现出色，为查询扩展提供了高效解决方案。 Abstract: Large Language Models (LLMs) have shown potential in generating hypothetical documents for query expansion, thereby enhancing information retrieval performance. However, the efficacy of this method is highly dependent on the quality of the generated documents, which often requires complex prompt strategies and the integration of advanced dense retrieval techniques. This can be both costly and computationally intensive. To mitigate these limitations, we explore the use of zero-shot LLM-based query expansion to improve sparse retrieval, particularly for learned sparse retrievers. We introduce a novel fusion ranking framework, Exp4Fuse, which enhances the performance of sparse retrievers through an indirect application of zero-shot LLM-based query expansion. Exp4Fuse operates by simultaneously considering two retrieval routes-one based on the original query and the other on the LLM-augmented query. It then generates two ranked lists using a sparse retriever and fuses them using a modified reciprocal rank fusion method. We conduct extensive evaluations of Exp4Fuse against leading LLM-based query expansion methods and advanced retrieval techniques on three MS MARCO-related datasets and seven low-resource datasets. Experimental results reveal that Exp4Fuse not only surpasses existing LLM-based query expansion methods in enhancing sparse retrievers but also, when combined with advanced sparse retrievers, achieves SOTA results on several benchmarks. This highlights the superior performance and effectiveness of Exp4Fuse in improving query expansion for sparse retrieval.

[259] GOLFer: Smaller LM-Generated Documents Hallucination Filter & Combiner for Query Expansion in Information Retrieval

Lingyuan Liu,Mengxiang Zhang

Main category: cs.IR

TL;DR: GOLFer是一种利用小型开源语言模型进行查询扩展的新方法，通过过滤生成文档中的幻觉内容并组合有效信息，显著提升了性能。

Details

Motivation: 大型语言模型（LLMs）用于查询扩展成本高且计算密集，小型语言模型（LMs）生成的文档常包含非事实内容。GOLFer旨在解决这些问题。 Method: GOLFer包含两个模块：幻觉过滤器（检测并移除非事实句子）和文档组合器（通过权重向量平衡查询与生成内容）。 Result: 在多个数据集上的实验表明，GOLFer优于其他小型LM方法，并与大型LLM方法竞争。 Conclusion: GOLFer为低成本、高效的查询扩展提供了可行方案，尤其适用于资源有限场景。 Abstract: Large language models (LLMs)-based query expansion for information retrieval augments queries with generated hypothetical documents with LLMs. However, its performance relies heavily on the scale of the language models (LMs), necessitating larger, more advanced LLMs. This approach is costly, computationally intensive, and often has limited accessibility. To address these limitations, we introduce GOLFer - Smaller LMs-Generated Documents Hallucination Filter & Combiner - a novel method leveraging smaller open-source LMs for query expansion. GOLFer comprises two modules: a hallucination filter and a documents combiner. The former detects and removes non-factual and inconsistent sentences in generated documents, a common issue with smaller LMs, while the latter combines the filtered content with the query using a weight vector to balance their influence. We evaluate GOLFer alongside dominant LLM-based query expansion methods on three web search and ten low-resource datasets. Experimental results demonstrate that GOLFer consistently outperforms other methods using smaller LMs, and maintains competitive performance against methods using large-size LLMs, demonstrating its effectiveness.

[260] Towards Storage-Efficient Visual Document Retrieval: An Empirical Study on Reducing Patch-Level Embeddings

Yubo Ma,Jinsong Li,Yuhang Zang,Xiaobao Wu,Xiaoyi Dong,Pan Zhang,Yuhang Cao,Haodong Duan,Jiaqi Wang,Yixin Cao,Aixin Sun

Main category: cs.IR

TL;DR: 论文研究了减少视觉文档检索（VDR）中内存占用的方法，发现随机剪枝策略优于复杂方法，但合并策略更有效，最终开发了Light-ColPali/ColQwen2，显著降低内存占用且性能损失小。

Details

Motivation: ColPali/ColQwen2在VDR中性能强，但内存占用高，需研究如何减少嵌入数量以降低内存。 Method: 评估了两种策略：剪枝和合并。剪枝中随机策略表现最佳，但合并更适合VDR。通过优化合并策略开发了Light-ColPali/ColQwen2。 Result: Light-ColPali/ColQwen2在仅11.8%内存占用下保持98.2%性能，2.8%内存下保持94.6%性能。 Conclusion: 合并策略更适合VDR，Light-ColPali/ColQwen2为高效VDR研究提供了有价值的基线和见解。 Abstract: Despite the strong performance of ColPali/ColQwen2 in Visualized Document Retrieval (VDR), it encodes each page into multiple patch-level embeddings and leads to excessive memory usage. This empirical study investigates methods to reduce patch embeddings per page at minimum performance degradation. We evaluate two token-reduction strategies: token pruning and token merging. Regarding token pruning, we surprisingly observe that a simple random strategy outperforms other sophisticated pruning methods, though still far from satisfactory. Further analysis reveals that pruning is inherently unsuitable for VDR as it requires removing certain page embeddings without query-specific information. Turning to token merging (more suitable for VDR), we search for the optimal combinations of merging strategy across three dimensions and develop Light-ColPali/ColQwen2. It maintains 98.2% of retrieval performance with only 11.8% of original memory usage, and preserves 94.6% effectiveness at 2.8% memory footprint. We expect our empirical findings and resulting Light-ColPali/ColQwen2 offer valuable insights and establish a competitive baseline for future research towards efficient VDR.

astro-ph.SR [Back]

[261] Deep learning image burst stacking to reconstruct high-resolution ground-based solar observations

Christoph Schirninger,Robert Jarolim,Astrid M. Veronig,Christoph Kuckein

Main category: astro-ph.SR

TL;DR: 提出一种基于深度学习的实时图像重建方法，用于解决地面太阳望远镜观测中大气湍流导致的图像退化问题。

Details

Motivation: 地面太阳望远镜观测受大气湍流影响，现有重建方法在强湍流和高计算成本下表现不佳。 Method: 采用无配对图像到图像转换的深度学习模型，将100张短曝光图像实时重建为高质量图像。 Result: 该方法在感知质量上表现更优，尤其在参考图像存在伪影时更具鲁棒性。 Conclusion: 该方法能高效利用图像信息，在完整图像序列下实现最佳重建效果。 Abstract: Large aperture ground based solar telescopes allow the solar atmosphere to be resolved in unprecedented detail. However, observations are limited by Earths turbulent atmosphere, requiring post image corrections. Current reconstruction methods using short exposure bursts face challenges with strong turbulence and high computational costs. We introduce a deep learning approach that reconstructs 100 short exposure images into one high quality image in real time. Using unpaired image to image translation, our model is trained on degraded bursts with speckle reconstructions as references, improving robustness and generalization. Our method shows an improved robustness in terms of perceptual quality, especially when speckle reconstructions show artifacts. An evaluation with a varying number of images per burst demonstrates that our method makes efficient use of the combined image information and achieves the best reconstructions when provided with the full image burst.

cs.RO [Back]

[262] RoboRefer: Towards Spatial Referring with Reasoning in Vision-Language Models for Robotics

Enshen Zhou,Jingkun An,Cheng Chi,Yi Han,Shanyu Rong,Chi Zhang,Pengwei Wang,Zhongyuan Wang,Tiejun Huang,Lu Sheng,Shanghang Zhang

Main category: cs.RO

TL;DR: RoboRefer是一种3D感知的视觉语言模型，通过监督微调（SFT）和强化微调（RFT）提升空间理解和推理能力，实验表现优于现有方法。

Details

Motivation: 现有视觉语言模型在复杂3D场景中的空间理解和动态推理能力不足，RoboRefer旨在解决这一问题。 Method: 结合SFT和RFT训练，引入专用深度编码器和度量敏感奖励函数，并使用大规模数据集RefSpatial和基准RefSpatial-Bench。 Result: SFT训练的RoboRefer空间理解达到89.6%成功率，RFT训练后进一步超越基线，比Gemini-2.5-Pro高17.4%。 Conclusion: RoboRefer在空间指代任务中表现优异，可集成到多种机器人控制策略中，适用于复杂现实场景。 Abstract: Spatial referring is a fundamental capability of embodied robots to interact with the 3D physical world. However, even with the powerful pretrained vision language models (VLMs), recent approaches are still not qualified to accurately understand the complex 3D scenes and dynamically reason about the instruction-indicated locations for interaction. To this end, we propose RoboRefer, a 3D-aware VLM that can first achieve precise spatial understanding by integrating a disentangled but dedicated depth encoder via supervised fine-tuning (SFT). Moreover, RoboRefer advances generalized multi-step spatial reasoning via reinforcement fine-tuning (RFT), with metric-sensitive process reward functions tailored for spatial referring tasks. To support SFT and RFT training, we introduce RefSpatial, a large-scale dataset of 20M QA pairs (2x prior), covering 31 spatial relations (vs. 15 prior) and supporting complex reasoning processes (up to 5 steps). In addition, we introduce RefSpatial-Bench, a challenging benchmark filling the gap in evaluating spatial referring with multi-step reasoning. Experiments show that SFT-trained RoboRefer achieves state-of-the-art spatial understanding, with an average success rate of 89.6%. RFT-trained RoboRefer further outperforms all other baselines by a large margin, even surpassing Gemini-2.5-Pro by 17.4% in average accuracy on RefSpatial-Bench. Notably, RoboRefer can be integrated with various control policies to execute long-horizon, dynamic tasks across diverse robots (e,g., UR5, G1 humanoid) in cluttered real-world scenes.

[263] Learning Smooth State-Dependent Traversability from Dense Point Clouds

Zihao Dong,Alan Papalia,Leonard Jung,Alenna Spiro,Philip R. Osteen,Christa S. Robison,Michael Everett

Main category: cs.RO

TL;DR: SPARTA是一种通过点云估计基于接近角度的地形可通行性的方法，解决了传统方法需要大量数据和计算效率低的问题。

Details

Motivation: 越野自主性中，地形的可通行性常取决于车辆状态，传统方法需要大量训练数据和重复模型推断，效率低下。 Method: SPARTA通过输出一个平滑的解析函数（基于傅里叶基函数）来预测任何接近角度的风险分布，减少了计算开销。 Result: 在高保真模拟中，SPARTA在40米巨石场中的通过成功率为91%，优于基线的73%，并在硬件测试中展示了泛化能力。 Conclusion: SPARTA通过几何结构和傅里叶基函数，有效解决了越野自主性中的可通行性预测问题，具有高效和泛化能力。 Abstract: A key open challenge in off-road autonomy is that the traversability of terrain often depends on the vehicle's state. In particular, some obstacles are only traversable from some orientations. However, learning this interaction by encoding the angle of approach as a model input demands a large and diverse training dataset and is computationally inefficient during planning due to repeated model inference. To address these challenges, we present SPARTA, a method for estimating approach angle conditioned traversability from point clouds. Specifically, we impose geometric structure into our network by outputting a smooth analytical function over the 1-Sphere that predicts risk distribution for any angle of approach with minimal overhead and can be reused for subsequent queries. The function is composed of Fourier basis functions, which has important advantages for generalization due to their periodic nature and smoothness. We demonstrate SPARTA both in a high-fidelity simulation platform, where our model achieves a 91\% success rate crossing a 40m boulder field (compared to 73\% for the baseline), and on hardware, illustrating the generalization ability of the model to real-world settings.

[264] MineInsight: A Multi-sensor Dataset for Humanitarian Demining Robotics in Off-Road Environments

Mario Malizia,Charles Hamesse,Ken Hasselmann,Geert De Cubber,Nikolaos Tsiogkas,Eric Demeester,Rob Haelterman

Main category: cs.RO

TL;DR: 论文介绍了MineInsight数据集，用于提升地雷检测算法的验证能力，包含多传感器、多光谱数据，并公开可用。

Details

Motivation: 当前缺乏多样化和真实的地雷检测数据集，限制了算法的可靠验证。 Method: 提出MineInsight数据集，整合双视角传感器扫描（地面无人车和机械臂）、LiDAR和多光谱图像（RGB、VIS-SWIR、LWIR），并提供目标位置估计。 Result: 数据集包含35个目标、约38,000 RGB帧、53,000 VIS-SWIR帧和108,000 LWIR帧，支持昼夜条件下的测试。 Conclusion: MineInsight为地雷检测算法的开发和评估提供了基准，填补了现有数据集的空白。 Abstract: The use of robotics in humanitarian demining increasingly involves computer vision techniques to improve landmine detection capabilities. However, in the absence of diverse and realistic datasets, the reliable validation of algorithms remains a challenge for the research community. In this paper, we introduce MineInsight, a publicly available multi-sensor, multi-spectral dataset designed for off-road landmine detection. The dataset features 35 different targets (15 landmines and 20 commonly found objects) distributed along three distinct tracks, providing a diverse and realistic testing environment. MineInsight is, to the best of our knowledge, the first dataset to integrate dual-view sensor scans from both an Unmanned Ground Vehicle and its robotic arm, offering multiple viewpoints to mitigate occlusions and improve spatial awareness. It features two LiDARs, as well as images captured at diverse spectral ranges, including visible (RGB, monochrome), visible short-wave infrared (VIS-SWIR), and long-wave infrared (LWIR). Additionally, the dataset comes with an estimation of the location of the targets, offering a benchmark for evaluating detection algorithms. We recorded approximately one hour of data in both daylight and nighttime conditions, resulting in around 38,000 RGB frames, 53,000 VIS-SWIR frames, and 108,000 LWIR frames. MineInsight serves as a benchmark for developing and evaluating landmine detection algorithms. Our dataset is available at https://github.com/mariomlz99/MineInsight.

[265] Synthetic Dataset Generation for Autonomous Mobile Robots Using 3D Gaussian Splatting for Vision Training

Aneesh Deogan,Wout Beks,Peter Teurlings,Koen de Vos,Mark van den Brand,Rene van de Molengraft

Main category: cs.RO

TL;DR: 提出了一种基于Unreal Engine的自动生成合成数据方法，用于解决手动标注数据集的耗时、易错和多样性不足问题，尤其在机器人领域表现显著。

Details

Motivation: 手动创建标注数据集耗时且易错，尤其在机器人领域多样性需求高，传统方法难以满足。 Method: 利用Unreal Engine和3D高斯样条快速生成逼真合成数据，结合真实数据提升检测性能。 Result: 合成数据训练的目标检测器性能与真实数据相当，且结合两者可进一步提升性能。 Conclusion: 该方法为机器人领域提供了一种高效、可扩展的数据集生成方案，显著减少人工标注需求。 Abstract: Annotated datasets are critical for training neural networks for object detection, yet their manual creation is time- and labour-intensive, subjective to human error, and often limited in diversity. This challenge is particularly pronounced in the domain of robotics, where diverse and dynamic scenarios further complicate the creation of representative datasets. To address this, we propose a novel method for automatically generating annotated synthetic data in Unreal Engine. Our approach leverages photorealistic 3D Gaussian splats for rapid synthetic data generation. We demonstrate that synthetic datasets can achieve performance comparable to that of real-world datasets while significantly reducing the time required to generate and annotate data. Additionally, combining real-world and synthetic data significantly increases object detection performance by leveraging the quality of real-world images with the easier scalability of synthetic data. To our knowledge, this is the first application of synthetic data for training object detection algorithms in the highly dynamic and varied environment of robot soccer. Validation experiments reveal that a detector trained on synthetic images performs on par with one trained on manually annotated real-world images when tested on robot soccer match scenarios. Our method offers a scalable and comprehensive alternative to traditional dataset creation, eliminating the labour-intensive error-prone manual annotation process. By generating datasets in a simulator where all elements are intrinsically known, we ensure accurate annotations while significantly reducing manual effort, which makes it particularly valuable for robotics applications requiring diverse and scalable training data.

eess.IV [Back]

[266] Gradient Inversion Attacks on Parameter-Efficient Fine-Tuning

Hasin Us Sami,Swapneel Sen,Amit K. Roy-Chowdhury,Srikanth V. Krishnamurthy,Basak Guler

Main category: eess.IV

TL;DR: 论文探讨了联邦学习中参数高效微调（PEFT）的隐私风险，展示了攻击者如何通过恶意设计的预训练模型和适配器模块重构用户数据。

Details

Motivation: 研究PEFT在联邦学习中的隐私漏洞，揭示攻击者如何利用梯度反演攻击窃取用户数据。 Method: 通过设计恶意预训练模型和适配器模块，利用梯度反演攻击重构用户的微调数据。 Result: 实验证明，攻击者可以高保真地重构大批量微调图像数据。 Conclusion: 研究强调了PEFT需要隐私保护机制，并提出了未来研究方向。 Abstract: Federated learning (FL) allows multiple data-owners to collaboratively train machine learning models by exchanging local gradients, while keeping their private data on-device. To simultaneously enhance privacy and training efficiency, recently parameter-efficient fine-tuning (PEFT) of large-scale pretrained models has gained substantial attention in FL. While keeping a pretrained (backbone) model frozen, each user fine-tunes only a few lightweight modules to be used in conjunction, to fit specific downstream applications. Accordingly, only the gradients with respect to these lightweight modules are shared with the server. In this work, we investigate how the privacy of the fine-tuning data of the users can be compromised via a malicious design of the pretrained model and trainable adapter modules. We demonstrate gradient inversion attacks on a popular PEFT mechanism, the adapter, which allow an attacker to reconstruct local data samples of a target user, using only the accessible adapter gradients. Via extensive experiments, we demonstrate that a large batch of fine-tuning images can be retrieved with high fidelity. Our attack highlights the need for privacy-preserving mechanisms for PEFT, while opening up several future directions. Our code is available at https://github.com/info-ucr/PEFTLeak.

[267] A Poisson-Guided Decomposition Network for Extreme Low-Light Image Enhancement

Isha Rao,Sanjay Ghosh

Main category: eess.IV

TL;DR: 提出了一种轻量级深度学习方法，结合Retinex分解与泊松去噪，用于极端低光条件下的图像去噪和增强。

Details

Motivation: 解决低光条件下信号依赖性噪声（泊松噪声）的图像去噪和增强问题，传统高斯噪声假设不适用。 Method: 采用编码器-解码器网络，集成Retinex分解和泊松去噪损失，无需先验反射和光照信息。 Result: 显著提升低光条件下的可见性和亮度，同时保持图像结构和颜色一致性。 Conclusion: 该方法在极端低光条件下有效且实用，解决了信号依赖性噪声问题。 Abstract: Low-light image denoising and enhancement are challenging, especially when traditional noise assumptions, such as Gaussian noise, do not hold in majority. In many real-world scenarios, such as low-light imaging, noise is signal-dependent and is better represented as Poisson noise. In this work, we address the problem of denoising images degraded by Poisson noise under extreme low-light conditions. We introduce a light-weight deep learning-based method that integrates Retinex based decomposition with Poisson denoising into a unified encoder-decoder network. The model simultaneously enhances illumination and suppresses noise by incorporating a Poisson denoising loss to address signal-dependent noise. Without prior requirement for reflectance and illumination, the network learns an effective decomposition process while ensuring consistent reflectance and smooth illumination without causing any form of color distortion. The experimental results demonstrate the effectiveness and practicality of the proposed low-light illumination enhancement method. Our method significantly improves visibility and brightness in low-light conditions, while preserving image structure and color constancy under ambient illumination.

[268] DACN: Dual-Attention Convolutional Network for Hyperspectral Image Super-Resolution

Usman Muhammad,Jorma Laaksonen

Main category: eess.IV

TL;DR: DACN是一种双注意力卷积网络，用于解决高光谱图像超分辨率任务中局部依赖和全局上下文理解不足的问题。

Details

Motivation: 传统2D CNN在高光谱图像超分辨率任务中依赖局部邻域，缺乏全局上下文理解，且受限于波段相关性和数据稀缺性。 Method: DACN结合多头注意力和通道注意力，通过增强卷积捕捉局部和全局特征依赖，并优化损失函数以确保光谱保真度。 Result: 在两个高光谱数据集上的实验表明，多头注意力和通道注意力的结合优于单独使用任一机制。 Conclusion: DACN通过双注意力机制和优化损失函数，显著提升了高光谱图像超分辨率的性能。 Abstract: 2D convolutional neural networks (CNNs) have attracted significant attention for hyperspectral image super-resolution tasks. However, a key limitation is their reliance on local neighborhoods, which leads to a lack of global contextual understanding. Moreover, band correlation and data scarcity continue to limit their performance. To mitigate these issues, we introduce DACN, a dual-attention convolutional network for hyperspectral image super-resolution. Specifically, the model first employs augmented convolutions, integrating multi-head attention to effectively capture both local and global feature dependencies. Next, we infer separate attention maps for the channel and spatial dimensions to determine where to focus across different channels and spatial positions. Furthermore, a custom optimized loss function is proposed that combines L2 regularization with spatial-spectral gradient loss to ensure accurate spectral fidelity. Experimental results on two hyperspectral datasets demonstrate that the combination of multi-head attention and channel attention outperforms either attention mechanism used individually.

[269] PixCell: A generative foundation model for digital histopathology images

Srikar Yellapragada,Alexandros Graikos,Zilinghan Li,Kostas Triaridis,Varun Belagali,Saarthak Kapse,Tarak Nath Nandi,Ravi K Madduri,Prateek Prasanna,Tahsin Kurc,Rajarsi R. Gupta,Joel Saltz,Dimitris Samaras

Main category: eess.IV

TL;DR: PixCell是一种基于扩散模型的生成基础模型，用于组织病理学，能够生成高质量图像，解决数据稀缺和隐私问题，并支持数据增强和教育用途。

Details

Motivation: 解决病理学中合成图像的需求，如克服标注数据稀缺、隐私保护数据共享和虚拟染色等生成任务。 Method: 使用扩散模型PixCell，在PanCan-30M数据集上进行训练，采用渐进式训练策略和自监督条件化方法。 Result: PixCell生成多样且高质量的图像，可用于训练自监督判别模型，支持数据增强和分子标记研究推断。 Conclusion: PixCell为计算病理学提供了强大的生成工具，加速研究并解决实际应用中的问题。 Abstract: The digitization of histology slides has revolutionized pathology, providing massive datasets for cancer diagnosis and research. Contrastive self-supervised and vision-language models have been shown to effectively mine large pathology datasets to learn discriminative representations. On the other hand, generative models, capable of synthesizing realistic and diverse images, present a compelling solution to address unique problems in pathology that involve synthesizing images; overcoming annotated data scarcity, enabling privacy-preserving data sharing, and performing inherently generative tasks, such as virtual staining. We introduce PixCell, the first diffusion-based generative foundation model for histopathology. We train PixCell on PanCan-30M, a vast, diverse dataset derived from 69,184 H\&E-stained whole slide images covering various cancer types. We employ a progressive training strategy and a self-supervision-based conditioning that allows us to scale up training without any annotated data. PixCell generates diverse and high-quality images across multiple cancer types, which we find can be used in place of real data to train a self-supervised discriminative model. Synthetic images shared between institutions are subject to fewer regulatory barriers than would be the case with real clinical images. Furthermore, we showcase the ability to precisely control image generation using a small set of annotated images, which can be used for both data augmentation and educational purposes. Testing on a cell segmentation task, a mask-guided PixCell enables targeted data augmentation, improving downstream performance. Finally, we demonstrate PixCell's ability to use H\&E structural staining to infer results from molecular marker studies; we use this capability to infer IHC staining from H\&E images. Our trained models are publicly released to accelerate research in computational pathology.

[270] DM-SegNet: Dual-Mamba Architecture for 3D Medical Image Segmentation with Global Context Modeling

Hangyu Ji

Main category: eess.IV

TL;DR: DM-SegNet是一种双Mamba架构，通过方向性状态转换和解剖感知分层解码，解决了医学图像分割中全局上下文建模与空间拓扑保留的兼容性问题。

Details

Motivation: 现有医学状态空间模型（SSMs）存在编码器-解码器不兼容问题，1D序列扁平化破坏了空间结构，而传统解码器无法利用Mamba的状态传播。 Method: 提出DM-SegNet，包括四方向3D扫描的空间Mamba模块、门控空间卷积层和Mamba驱动的解码框架，实现跨尺度的双向状态同步。 Result: 在Synapse和BraTS2023数据集上分别达到85.44%和90.22%的DSC，表现最优。 Conclusion: DM-SegNet通过创新的架构设计，显著提升了3D医学图像分割的精度。 Abstract: Accurate 3D medical image segmentation demands architectures capable of reconciling global context modeling with spatial topology preservation. While State Space Models (SSMs) like Mamba show potential for sequence modeling, existing medical SSMs suffer from encoder-decoder incompatibility: the encoder's 1D sequence flattening compromises spatial structures, while conventional decoders fail to leverage Mamba's state propagation. We present DM-SegNet, a Dual-Mamba architecture integrating directional state transitions with anatomy-aware hierarchical decoding. The core innovations include a quadri-directional spatial Mamba module employing four-directional 3D scanning to maintain anatomical spatial coherence, a gated spatial convolution layer that enhances spatially sensitive feature representation prior to state modeling, and a Mamba-driven decoding framework enabling bidirectional state synchronization across scales. Extensive evaluation on two clinically significant benchmarks demonstrates the efficacy of DM-SegNet: achieving state-of-the-art Dice Similarity Coefficient (DSC) of 85.44% on the Synapse dataset for abdominal organ segmentation and 90.22% on the BraTS2023 dataset for brain tumor segmentation.

cs.CY [Back]

[271] Understanding and Meeting Practitioner Needs When Measuring Representational Harms Caused by LLM-Based Systems

Emma Harvey,Emily Sheng,Su Lin Blodgett,Alexandra Chouldechova,Jean Garcia-Gathright,Alexandra Olteanu,Hanna Wallach

Main category: cs.CY

TL;DR: 研究发现，现有公开工具难以满足实践者评估大型语言模型（LLM）表征性危害的需求，主要因工具不实用或存在使用障碍。

Details

Motivation: 探讨公开工具是否满足实践者评估LLM表征性危害的需求。 Method: 通过半结构化访谈12名实践者，分析工具的使用情况和挑战。 Result: 实践者常无法使用公开工具，原因包括工具不实用或存在使用障碍。 Conclusion: 建议基于测量理论和实用测量改进工具，以更好地满足实践者需求。 Abstract: The NLP research community has made publicly available numerous instruments for measuring representational harms caused by large language model (LLM)-based systems. These instruments have taken the form of datasets, metrics, tools, and more. In this paper, we examine the extent to which such instruments meet the needs of practitioners tasked with evaluating LLM-based systems. Via semi-structured interviews with 12 such practitioners, we find that practitioners are often unable to use publicly available instruments for measuring representational harms. We identify two types of challenges. In some cases, instruments are not useful because they do not meaningfully measure what practitioners seek to measure or are otherwise misaligned with practitioner needs. In other cases, instruments - even useful instruments - are not used by practitioners due to practical and institutional barriers impeding their uptake. Drawing on measurement theory and pragmatic measurement, we provide recommendations for addressing these challenges to better meet practitioner needs.

cs.HC [Back]

[272] Beyond the Desktop: XR-Driven Segmentation with Meta Quest 3 and MX Ink

Lisle Faray de Paiva,Gijs Luijten,Ana Sofia Ferreira Santos,Moon Kim,Behrus Puladi,Jens Kleesiek,Jan Egger

Main category: cs.HC

TL;DR: 该研究开发了一种基于扩展现实（XR）的医学影像分割工具，结合Meta Quest 3头显和Logitech MX Ink触控笔，旨在简化临床中的手动标注任务。

Details

Motivation: 医学影像分割在临床中至关重要，但手动标注费时费力。本研究旨在通过XR技术减轻工作流程碎片化和认知负担。 Method: 开发了一个沉浸式界面，支持实时交互2D和3D医学影像数据，结合触控笔标注和即时3D渲染。用户研究使用公开颅面CT数据集评估工具。 Result: 工具获得66分的系统可用性评分（SUS），用户反馈显示其控制直观（4.1/5分）且空间交互设计优秀，但需提升任务精度和错误管理。 Conclusion: XR-触控笔范式为沉浸式分割工具提供了可行基础，未来需优化触觉反馈和工作流个性化以推动临床应用。 Abstract: Medical imaging segmentation is essential in clinical settings for diagnosing diseases, planning surgeries, and other procedures. However, manual annotation is a cumbersome and effortful task. To mitigate these aspects, this study implements and evaluates the usability and clinical applicability of an extended reality (XR)-based segmentation tool for anatomical CT scans, using the Meta Quest 3 headset and Logitech MX Ink stylus. We develop an immersive interface enabling real-time interaction with 2D and 3D medical imaging data in a customizable workspace designed to mitigate workflow fragmentation and cognitive demands inherent to conventional manual segmentation tools. The platform combines stylus-driven annotation, mirroring traditional pen-on-paper workflows, with instant 3D volumetric rendering. A user study with a public craniofacial CT dataset demonstrated the tool's foundational viability, achieving a System Usability Scale (SUS) score of 66, within the expected range for medical applications. Participants highlighted the system's intuitive controls (scoring 4.1/5 for self-descriptiveness on ISONORM metrics) and spatial interaction design, with qualitative feedback highlighting strengths in hybrid 2D/3D navigation and realistic stylus ergonomics. While users identified opportunities to enhance task-specific precision and error management, the platform's core workflow enabled dynamic slice adjustment, reducing cognitive load compared to desktop tools. Results position the XR-stylus paradigm as a promising foundation for immersive segmentation tools, with iterative refinements targeting haptic feedback calibration and workflow personalization to advance adoption in preoperative planning.

[273] From Screen to Space: Evaluating Siemens' Cinematic Reality

Gijs Luijten,Lisle Faray de Paiva,Sebastian Krueger,Alexander Brost,Laura Mazilescu,Ana Sofia Ferreira Santos,Peter Hoyer,Jens Kleesiek,Sophia Marie-Therese Schmitz,Ulf Peter Neumann,Jan Egger

Main category: cs.HC

TL;DR: 评估Siemens的Cinematic Reality在Apple Vision Pro上的可用性和临床潜力，医学专家反馈显示其可行性和潜在临床价值。

Details

Motivation: 探索沉浸式电影渲染在医学影像中的实际应用潜力。 Method: 使用CHAOS和MRCP_DLRecon数据集，14位医学专家通过问卷调查评估可用性和临床整合潜力。 Result: 反馈确认了可行性，并指出了关键优势及需改进的功能。 Conclusion: 沉浸式电影渲染在医学影像中具有潜在应用价值，但需进一步优化以适应临床工作流程。 Abstract: As one of the first research teams with full access to Siemens' Cinematic Reality, we evaluate its usability and clinical potential for cinematic volume rendering on the Apple Vision Pro. We visualized venous-phase liver computed tomography and magnetic resonance cholangiopancreatography scans from the CHAOS and MRCP\_DLRecon datasets. Fourteen medical experts assessed usability and anticipated clinical integration potential using the System Usability Scale, ISONORM 9242-110-S questionnaire, and an open-ended survey. Their feedback identified feasibility, key usability strengths, and required features to catalyze the adaptation in real-world clinical workflows. The findings provide insights into the potential of immersive cinematic rendering in medical imaging.

cs.MM [Back]

[274] CMIE: Combining MLLM Insights with External Evidence for Explainable Out-of-Context Misinformation Detection

Fanxiao Li,Jiaying Wu,Canyuan He,Wei Zhou

Main category: cs.MM

TL;DR: 论文提出CMIE框架，通过生成共存关系和关联评分机制，提升多模态大语言模型在检测上下文无关虚假信息中的表现。

Details

Motivation: 现有MLLM在检测上下文无关虚假信息时，难以捕捉深层语义关联，且证据噪声影响准确性。 Method: 提出CMIE框架，包含共存关系生成策略（CRG）和关联评分机制（AS），识别图像与文本的深层关系并选择性利用证据。 Result: 实验表明CMIE优于现有方法。 Conclusion: CMIE通过增强语义关联和减少噪声影响，有效提升了虚假信息检测的准确性。 Abstract: Multimodal large language models (MLLMs) have demonstrated impressive capabilities in visual reasoning and text generation. While previous studies have explored the application of MLLM for detecting out-of-context (OOC) misinformation, our empirical analysis reveals two persisting challenges of this paradigm. Evaluating the representative GPT-4o model on direct reasoning and evidence augmented reasoning, results indicate that MLLM struggle to capture the deeper relationships-specifically, cases in which the image and text are not directly connected but are associated through underlying semantic links. Moreover, noise in the evidence further impairs detection accuracy. To address these challenges, we propose CMIE, a novel OOC misinformation detection framework that incorporates a Coexistence Relationship Generation (CRG) strategy and an Association Scoring (AS) mechanism. CMIE identifies the underlying coexistence relationships between images and text, and selectively utilizes relevant evidence to enhance misinformation detection. Experimental results demonstrate that our approach outperforms existing methods.

cs.SD [Back]

[275] Grapheme-Coherent Phonemic and Prosodic Annotation of Speech by Implicit and Explicit Grapheme Conditioning

Hien Ohnaka,Yuma Shirahata,Byeongseon Park,Ryuichi Yamamoto

Main category: cs.SD

TL;DR: 提出一种模型，通过隐式和显式方法将音素和韵律标签与字素对齐，显著提升一致性，并在口音估计任务中验证了有效性。

Details

Motivation: 现有方法仅通过微调预训练ASR模型生成标签，缺乏与字素的直接关联，限制了其在文本到语音等任务中的应用。 Method: 1) 使用预训练BERT特征的提示编码器实现隐式字素条件；2) 在推理阶段显式剪枝与字素不一致的标签假设。 Result: 显著提升了标签与字素的一致性，生成的三元平行数据有效提高了口音估计任务的准确性。 Conclusion: 所提方法通过结合隐式和显式条件，成功实现了标签与字素的对齐，为下游任务提供了高质量数据。 Abstract: We propose a model to obtain phonemic and prosodic labels of speech that are coherent with graphemes. Unlike previous methods that simply fine-tune a pre-trained ASR model with the labels, the proposed model conditions the label generation on corresponding graphemes by two methods: 1) Add implicit grapheme conditioning through prompt encoder using pre-trained BERT features. 2) Explicitly prune the label hypotheses inconsistent with the grapheme during inference. These methods enable obtaining parallel data of speech, the labels, and graphemes, which is applicable to various downstream tasks such as text-to-speech and accent estimation from text. Experiments showed that the proposed method significantly improved the consistency between graphemes and the predicted labels. Further, experiments on accent estimation task confirmed that the created parallel data by the proposed method effectively improve the estimation accuracy.

[276] LLM-based phoneme-to-grapheme for phoneme-based speech recognition

Te Ma,Min Bi,Saierdaer Yusuyin,Hao Huang,Zhijian Ou

Main category: cs.SD

TL;DR: 论文提出了一种基于大型语言模型（LLM）的音素到字形（LLM-P2G）解码方法，用于音素基础的自动语音识别（ASR），通过数据增强和随机化训练策略解决了信息丢失问题，并在波兰语和德语的跨语言ASR中表现优于传统WFST方法。

Details

Motivation: 音素基础的多语言预训练和跨语言微调在ASR中具有高效性和竞争力，但传统的WFST解码方法无法利用大型语言模型且流程复杂。 Method: 提出LLM-P2G解码方法，包括语音到音素（S2P）和音素到字形（P2G）两部分，并通过数据增强（DANP）和随机化训练（TKM）解决信息丢失问题。 Result: 实验表明，LLM-P2G在波兰语和德语的跨语言ASR中相对WER分别降低了3.6%和6.9%。 Conclusion: LLM-P2G方法在音素基础的ASR中表现优于传统WFST解码，展示了大型语言模型在跨语言任务中的潜力。 Abstract: In automatic speech recognition (ASR), phoneme-based multilingual pre-training and crosslingual fine-tuning is attractive for its high data efficiency and competitive results compared to subword-based models. However, Weighted Finite State Transducer (WFST) based decoding is limited by its complex pipeline and inability to leverage large language models (LLMs). Therefore, we propose LLM-based phoneme-to-grapheme (LLM-P2G) decoding for phoneme-based ASR, consisting of speech-to-phoneme (S2P) and phoneme-to-grapheme (P2G). A challenge is that there seems to have information loss in cascading S2P and P2G. To address this challenge, we propose two training strategies: data augmentation with noisy phonemes (DANP), and randomized top-$K$ marginalized (TKM) training and decoding. Our experimental results show that LLM-P2G outperforms WFST-based systems in crosslingual ASR for Polish and German, by relative WER reductions of 3.6% and 6.9% respectively.

Table of Contents

cs.CV [Back]

[1] Dynamic Epsilon Scheduling: A Multi-Factor Adaptive Perturbation Budget for Adversarial Training

[2] RSVP: Reasoning Segmentation via Visual Prompting and Multi-modal Chain-of-Thought

[3] Evaluating MLLMs with Multimodal Multi-image Reasoning Benchmark

[4] HuGeDiff: 3D Human Generation via Diffusion with Gaussian Splatting

[5] ReXVQA: A Large-scale Visual Question Answering Benchmark for Generalist Chest X-ray Understanding

[6] WorldPrediction: A Benchmark for High-level World Modeling and Long-horizon Procedural Planning

[7] Puck Localization Using Contextual Cues

[8] Fine-Tuning Video Transformers for Word-Level Bangla Sign Language: A Comparative Analysis for Classification Tasks

[9] Visualizing and Controlling Cortical Responses Using Voxel-Weighted Activation Maximization

[10] Is Perturbation-Based Image Protection Disruptive to Image Editing?

[11] Normalize Filters! Classical Wisdom for Deep Vision

[12] Photoreal Scene Reconstruction from an Egocentric Device

[13] HMAR: Efficient Hierarchical Masked Auto-Regressive Image Generation

[14] Towards Large-Scale Pose-Invariant Face Recognition Using Face Defrontalization

[15] FALO: Fast and Accurate LiDAR 3D Object Detection on Resource-Constrained Devices

[16] AuthGuard: Generalizable Deepfake Detection via Language Guidance

[17] Pruning Everything, Everywhere, All at Once

[18] EECD-Net: Energy-Efficient Crack Detection with Spiking Neural Networks and Gated Attention

[19] Enhancing Frequency for Single Image Super-Resolution with Learnable Separable Kernels

[20] Perceptual Decoupling for Scalable Multi-modal Reasoning via Reward-Optimized Captioning

[21] LGM-Pose: A Lightweight Global Modeling Network for Real-time Human Pose Estimation

[22] Follow-Your-Creation: Empowering 4D Creation through Video Inpainting

[23] Hierarchical-Task-Aware Multi-modal Mixture of Incremental LoRA Experts for Embodied Continual Learning

[24] SmartAvatar: Text- and Image-Guided Human Avatar Generation with VLM AI Agents

[25] Perfecting Depth: Uncertainty-Aware Enhancement of Metric Depth

[26] Deep Learning Reforms Image Matching: A Survey and Outlook

[27] Unfolding Spatial Cognition: Evaluating Multimodal Models on Visual Simulations

[28] Text-Aware Real-World Image Super-Resolution via Diffusion Model with Joint Segmentation Decoders

[29] FPSAttention: Training-Aware FP8 and Sparsity Co-Design for Fast Video Diffusion

[30] Feature-Based Lie Group Transformer for Real-World Applications

[31] Interpretable Few-Shot Image Classification via Prototypical Concept-Guided Mixture of LoRA Experts

[32] Gen-n-Val: Agentic Image Data Generation and Validation

[33] MARS: Radio Map Super-resolution and Reconstruction Method under Sparse Channel Measurements

[34] HoliSafe: Holistic Safety Benchmarking and Modeling with Safety Meta Token for Vision-Language Model

[35] Line of Sight: On Linear Representations in VLLMs

[36] Robust Few-Shot Vision-Language Model Adaptation

[37] Towards Holistic Visual Quality Assessment of AI-Generated Videos: A LLM-Based Multi-Dimensional Evaluation Model

[38] Learning dissection trajectories from expert surgical videos via imitation learning with equivariant diffusion

[39] Using In-Context Learning for Automatic Defect Labelling of Display Manufacturing Data

[40] Bridging Annotation Gaps: Transferring Labels to Align Object Detection Datasets

[41] SRD: Reinforcement-Learned Semantic Perturbation for Backdoor Defense in VLMs

[42] Physics Informed Capsule Enhanced Variational AutoEncoder for Underwater Image Enhancement

[43] Truth in the Few: High-Value Data Selection for Efficient Multi-Modal Reasoning

[44] Toward Better SSIM Loss for Unsupervised Monocular Depth Estimation

[45] HypeVPR: Exploring Hyperbolic Space for Perspective to Equirectangular Visual Place Recognition

[46] Object-X: Learning to Reconstruct Multi-Modal 3D Object Representations

[47] LotusFilter: Fast Diverse Nearest Neighbor Search via a Learned Cutoff Table

[48] SupeRANSAC: One RANSAC to Rule Them All

[49] MegaHan97K: A Large-Scale Dataset for Mega-Category Chinese Character Recognition with over 97K Categories

[50] Spike-TBR: a Noise Resilient Neuromorphic Event Representation

[51] Fool the Stoplight: Realistic Adversarial Patch Attacks on Traffic Light Detectors

[52] DualX-VSR: Dual Axial Spatial$\times$Temporal Transformer for Real-World Video Super-Resolution without Motion Compensation

[53] OpenMaskDINO3D : Reasoning 3D Segmentation via Large Language Model

[54] Geological Field Restoration through the Lens of Image Inpainting

[55] Invisible Backdoor Triggers in Image Editing Model via Deep Watermarking

[56] Learning to Plan via Supervised Contrastive Learning and Strategic Interpolation: A Chess Case Study

[57] From Objects to Anywhere: A Holistic Benchmark for Multi-level Visual Grounding in 3D Scenes

[58] Generating Synthetic Stereo Datasets using 3D Gaussian Splatting and Expert Knowledge Transfer

[59] Light and 3D: a methodological exploration of digitisation techniques adapted to a selection of objects from the Mus{é}e d'Arch{é}ologie Nationale

[60] CzechLynx: A Dataset for Individual Identification and Pose Estimation of the Eurasian Lynx

[61] Time-Lapse Video-Based Embryo Grading via Complementary Spatial-Temporal Pattern Mining

[62] Robustness as Architecture: Designing IQA Models to Withstand Adversarial Perturbations

[63] APVR: Hour-Level Long Video Understanding with Adaptive Pivot Visual Information Retrieval

[64] FEAT: Full-Dimensional Efficient Attention Transformer for Medical Video Generation

[65] Bringing SAM to new heights: Leveraging elevation data for tree crown segmentation from drone imagery

[66] TextVidBench: A Benchmark for Long Video Scene Text Understanding

[67] Multi-scale Image Super Resolution with a Single Auto-Regressive Model

[68] PATS: Proficiency-Aware Temporal Sampling for Multi-View Sports Skill Assessment

[69] Beyond Cropped Regions: New Benchmark and Corresponding Baseline for Chinese Scene Text Retrieval in Diverse Layouts

[70] Structure-Aware Radar-Camera Depth Estimation

[71] Point Cloud Segmentation of Agricultural Vehicles using 3D Gaussian Splatting

[72] UAV4D: Dynamic Neural Rendering of Human-Centric UAV Imagery using Gaussian Splatting

[73] Physical Annotation for Automated Optical Inspection: A Concept for In-Situ, Pointer-Based Trainingdata Generation

[74] FlowDirector: Training-Free Flow Steering for Precise Text-to-Video Editing

[75] A Survey on Vietnamese Document Analysis and Recognition: Challenges and Future Directions

[76] SeedEdit 3.0: Fast and High-Quality Generative Image Editing

[77] Interpretable Multimodal Framework for Human-Centered Street Assessment: Integrating Visual-Language Models for Perceptual Urban Diagnostics

[78] FG 2025 TrustFAA: the First Workshop on Towards Trustworthy Facial Affect Analysis: Advancing Insights of Fairness, Explainability, and Safety (TrustFAA)