cs.CV [Total: 189]
cs.GR [Total: 12]
cs.CL [Total: 133]
cs.MM [Total: 1]
stat.ML [Total: 1]
eess.SP [Total: 3]
cs.MA [Total: 2]
cs.CY [Total: 5]
cs.RO [Total: 10]
cs.AR [Total: 2]
eess.IV [Total: 9]
physics.ed-ph [Total: 1]
cs.IR [Total: 6]
eess.AS [Total: 2]
cs.AI [Total: 14]
cs.CR [Total: 4]
cs.LG [Total: 32]
q-fin.ST [Total: 1]

cs.CV [Back]

[1] Facial Foundational Model Advances Early Warning of Coronary Artery Disease from Live Videos with DigitalShadow

Juexiao Zhou,Zhongyi Han,Mankun Xin,Xingwei He,Guotao Wang,Jiaoyan Song,Gongning Luo,Wenjia He,Xintong Li,Yuetan Chu,Juanwen Chen,Bo Wang,Xia Wu,Wenwen Duan,Zhixia Guo,Liyan Bai,Yilin Pan,Xuefei Bi,Lu Liu,Long Feng,Xiaonan He,Xin Gao

Main category: cs.CV

TL;DR: DigitalShadow是一种基于面部基础模型的CAD早期预警系统，通过无接触方式从视频流中提取面部特征，生成个性化风险报告和健康建议。

Details

Motivation: 全球人口老龄化和CAD的高死亡率促使开发一种无接触、被动的早期检测系统，以改善预防性医疗。 Method: 系统使用预训练的2100万张面部图像模型，并微调为LiveCAD模型，基于7004张来自1751名受试者的面部图像进行CAD风险评估。 Result: DigitalShadow能够被动、无接触地生成CAD风险报告和个性化健康建议，同时支持本地部署以确保数据隐私。 Conclusion: DigitalShadow为CAD的早期检测和管理提供了一种创新、隐私安全的解决方案。 Abstract: Global population aging presents increasing challenges to healthcare systems, with coronary artery disease (CAD) responsible for approximately 17.8 million deaths annually, making it a leading cause of global mortality. As CAD is largely preventable, early detection and proactive management are essential. In this work, we introduce DigitalShadow, an advanced early warning system for CAD, powered by a fine-tuned facial foundation model. The system is pre-trained on 21 million facial images and subsequently fine-tuned into LiveCAD, a specialized CAD risk assessment model trained on 7,004 facial images from 1,751 subjects across four hospitals in China. DigitalShadow functions passively and contactlessly, extracting facial features from live video streams without requiring active user engagement. Integrated with a personalized database, it generates natural language risk reports and individualized health recommendations. With privacy as a core design principle, DigitalShadow supports local deployment to ensure secure handling of user data.

[2] Exploring Adversarial Watermarking in Transformer-Based Models: Transferability and Robustness Against Defense Mechanism for Medical Images

Rifat Sadik,Tanvir Rahman,Arpan Bhattacharjee,Bikash Chandra Halder,Ismail Hossain

Main category: cs.CV

TL;DR: 论文研究了基于Transformer的视觉模型（ViTs）在医学图像中对对抗性水印攻击的脆弱性，并通过对抗训练提高了防御能力。

Details

Motivation: 随着Transformer模型在计算机视觉任务中的成功应用，研究其在医学图像分析中的对抗性攻击脆弱性具有重要意义。 Method: 使用投影梯度下降（PGD）生成对抗性水印，测试ViTs的脆弱性，并评估对抗训练的效果。 Result: ViTs对对抗性攻击表现出显著脆弱性（准确率降至27.6%），但对抗训练能将其准确率提升至90.0%。 Conclusion: ViTs在医学图像中易受对抗性攻击，但对抗训练是一种有效的防御手段。 Abstract: Deep learning models have shown remarkable success in dermatological image analysis, offering potential for automated skin disease diagnosis. Previously, convolutional neural network(CNN) based architectures have achieved immense popularity and success in computer vision (CV) based task like skin image recognition, generation and video analysis. But with the emergence of transformer based models, CV tasks are now are nowadays carrying out using these models. Vision Transformers (ViTs) is such a transformer-based models that have shown success in computer vision. It uses self-attention mechanisms to achieve state-of-the-art performance across various tasks. However, their reliance on global attention mechanisms makes them susceptible to adversarial perturbations. This paper aims to investigate the susceptibility of ViTs for medical images to adversarial watermarking-a method that adds so-called imperceptible perturbations in order to fool models. By generating adversarial watermarks through Projected Gradient Descent (PGD), we examine the transferability of such attacks to CNNs and analyze the performance defense mechanism -- adversarial training. Results indicate that while performance is not compromised for clean images, ViTs certainly become much more vulnerable to adversarial attacks: an accuracy drop of as low as 27.6%. Nevertheless, adversarial training raises it up to 90.0%.

[3] (LiFT) Lightweight Fitness Transformer: A language-vision model for Remote Monitoring of Physical Training

A. Postlmayr,P. Cosman,S. Dey

Main category: cs.CV

TL;DR: 提出了一种基于RGB智能手机摄像头的远程健身追踪系统，具有隐私性、可扩展性和成本效益，支持数百种运动检测和计数。

Details

Motivation: 现有健身追踪模型要么运动种类有限，要么过于复杂难以部署，缺乏通用性。 Method: 开发了一个多任务运动分析模型，利用大规模数据集Olympia（包含1900多种运动），结合视觉-语言模型进行运动检测和计数。 Result: 模型在Olympia数据集上运动检测准确率为76.5%，计数准确率为85.3%。 Conclusion: 通过单一视觉-语言模型实现运动识别和计数，推动了AI健身追踪的普及。 Abstract: We introduce a fitness tracking system that enables remote monitoring for exercises using only a RGB smartphone camera, making fitness tracking more private, scalable, and cost effective. Although prior work explored automated exercise supervision, existing models are either too limited in exercise variety or too complex for real-world deployment. Prior approaches typically focus on a small set of exercises and fail to generalize across diverse movements. In contrast, we develop a robust, multitask motion analysis model capable of performing exercise detection and repetition counting across hundreds of exercises, a scale far beyond previous methods. We overcome previous data limitations by assembling a large-scale fitness dataset, Olympia covering more than 1,900 exercises. To our knowledge, our vision-language model is the first that can perform multiple tasks on skeletal fitness data. On Olympia, our model can detect exercises with 76.5% accuracy and count repetitions with 85.3% off-by-one accuracy, using only RGB video. By presenting a single vision-language transformer model for both exercise identification and rep counting, we take a significant step toward democratizing AI-powered fitness tracking.

[4] GS4: Generalizable Sparse Splatting Semantic SLAM

Mingqi Jiang,Chanho Kim,Chen Ziwen,Li Fuxin

Main category: cs.CV

TL;DR: 提出了一种基于高斯溅射（GS）的通用语义SLAM算法，通过学习网络增量构建3D场景表示，解决了传统GS方法依赖逐场景优化的问题。

Details

Motivation: 传统SLAM算法生成的地图分辨率低且不完整，而现有GS方法依赖逐场景优化，耗时长且泛化能力差。 Method: 使用RGB-D图像识别主干预测高斯参数，集成3D语义分割，并通过全局定位后仅优化1次GS来修正定位漂移。 Result: 在ScanNet上实现最先进的语义SLAM性能，高斯数量比其他GS方法少一个数量级，并在NYUv2和TUM RGB-D数据集上展示零样本泛化能力。 Conclusion: 该方法高效、通用，显著提升了语义SLAM的性能和泛化能力。 Abstract: Traditional SLAM algorithms are excellent at camera tracking but might generate lower resolution and incomplete 3D maps. Recently, Gaussian Splatting (GS) approaches have emerged as an option for SLAM with accurate, dense 3D map building. However, existing GS-based SLAM methods rely on per-scene optimization which is time-consuming and does not generalize to diverse scenes well. In this work, we introduce the first generalizable GS-based semantic SLAM algorithm that incrementally builds and updates a 3D scene representation from an RGB-D video stream using a learned generalizable network. Our approach starts from an RGB-D image recognition backbone to predict the Gaussian parameters from every downsampled and backprojected image location. Additionally, we seamlessly integrate 3D semantic segmentation into our GS framework, bridging 3D mapping and recognition through a shared backbone. To correct localization drifting and floaters, we propose to optimize the GS for only 1 iteration following global localization. We demonstrate state-of-the-art semantic SLAM performance on the real-world benchmark ScanNet with an order of magnitude fewer Gaussians compared to other recent GS-based methods, and showcase our model's generalization capability through zero-shot transfer to the NYUv2 and TUM RGB-D datasets.

[5] Bridging Audio and Vision: Zero-Shot Audiovisual Segmentation by Connecting Pretrained Models

Seung-jae Lee,Paul Hongsuck Seo

Main category: cs.CV

TL;DR: 提出了一种零样本视听分割框架，利用预训练模型实现无需任务特定训练的高效分割。

Details

Motivation: 传统方法依赖大规模像素级标注，成本高且耗时，因此需要一种无需特定标注的解决方案。 Method: 整合音频、视觉和文本表示，通过连接预训练模型消除模态差距，实现精确分割。 Result: 在多个数据集上达到零样本视听分割的最先进性能。 Conclusion: 多模态模型整合在细粒度视听分割中具有显著效果。 Abstract: Audiovisual segmentation (AVS) aims to identify visual regions corresponding to sound sources, playing a vital role in video understanding, surveillance, and human-computer interaction. Traditional AVS methods depend on large-scale pixel-level annotations, which are costly and time-consuming to obtain. To address this, we propose a novel zero-shot AVS framework that eliminates task-specific training by leveraging multiple pretrained models. Our approach integrates audio, vision, and text representations to bridge modality gaps, enabling precise sound source segmentation without AVS-specific annotations. We systematically explore different strategies for connecting pretrained models and evaluate their efficacy across multiple datasets. Experimental results demonstrate that our framework achieves state-of-the-art zero-shot AVS performance, highlighting the effectiveness of multimodal model integration for finegrained audiovisual segmentation.

[6] Securing Traffic Sign Recognition Systems in Autonomous Vehicles

Thushari Hapuarachchi,Long Dang,Kaiqi Xiong

Main category: cs.CV

TL;DR: 论文研究了交通标志识别中深度神经网络（DNNs）的鲁棒性，提出了一种基于数据增强的训练方法以抵御误差最小化攻击，并开发了检测模型识别被污染数据。

Details

Motivation: 由于DNNs在交通标志识别中的广泛应用，且训练数据来源未知，确保模型安全免受攻击或污染至关重要。 Method: 通过向训练数据添加微小扰动进行误差最小化攻击，提出基于非线性变换的数据增强训练方法以提高模型鲁棒性。 Result: 攻击使DNNs预测准确率从99.90%降至10.6%，但提出的方法成功恢复至96.05%，检测模型识别攻击成功率超99%。 Conclusion: 研究表明，需采用先进训练方法以抵御数据污染攻击，确保交通标志识别系统的安全性。 Abstract: Deep Neural Networks (DNNs) are widely used for traffic sign recognition because they can automatically extract high-level features from images. These DNNs are trained on large-scale datasets obtained from unknown sources. Therefore, it is important to ensure that the models remain secure and are not compromised or poisoned during training. In this paper, we investigate the robustness of DNNs trained for traffic sign recognition. First, we perform the error-minimizing attacks on DNNs used for traffic sign recognition by adding imperceptible perturbations on training data. Then, we propose a data augmentation-based training method to mitigate the error-minimizing attacks. The proposed training method utilizes nonlinear transformations to disrupt the perturbations and improve the model robustness. We experiment with two well-known traffic sign datasets to demonstrate the severity of the attack and the effectiveness of our mitigation scheme. The error-minimizing attacks reduce the prediction accuracy of the DNNs from 99.90% to 10.6%. However, our mitigation scheme successfully restores the prediction accuracy to 96.05%. Moreover, our approach outperforms adversarial training in mitigating the error-minimizing attacks. Furthermore, we propose a detection model capable of identifying poisoned data even when the perturbations are imperceptible to human inspection. Our detection model achieves a success rate of over 99% in identifying the attack. This research highlights the need to employ advanced training methods for DNNs in traffic sign recognition systems to mitigate the effects of data poisoning attacks.

[7] Textile Analysis for Recycling Automation using Transfer Learning and Zero-Shot Foundation Models

Yannis Spyridis,Vasileios Argyriou

Main category: cs.CV

TL;DR: 论文探讨了利用RGB图像和深度学习技术（如迁移学习和基础模型）实现纺织品自动分类和污染物分割的可行性，展示了高效且低成本的方法。

Details

Motivation: 纺织品回收中材料成分识别和污染物检测的自动化需求迫切，但现有传感器数据难以满足准确性和效率要求。 Method: 使用RGB图像，结合迁移学习（EfficientNetB0）进行分类，以及零样本方法（Grounding DINO + SAM）进行分割。 Result: 分类准确率达81.25%，分割任务mIoU为0.90，表现优异。 Conclusion: RGB图像结合现代深度学习技术可用于纺织品回收的关键预处理步骤，具有实际应用潜力。 Abstract: Automated sorting is crucial for improving the efficiency and scalability of textile recycling, but accurately identifying material composition and detecting contaminants from sensor data remains challenging. This paper investigates the use of standard RGB imagery, a cost-effective sensing modality, for key pre-processing tasks in an automated system. We present computer vision components designed for a conveyor belt setup to perform (a) classification of four common textile types and (b) segmentation of non-textile features such as buttons and zippers. For classification, several pre-trained architectures were evaluated using transfer learning and cross-validation, with EfficientNetB0 achieving the best performance on a held-out test set with 81.25\% accuracy. For feature segmentation, a zero-shot approach combining the Grounding DINO open-vocabulary detector with the Segment Anything Model (SAM) was employed, demonstrating excellent performance with a mIoU of 0.90 for the generated masks against ground truth. This study demonstrates the feasibility of using RGB images coupled with modern deep learning techniques, including transfer learning for classification and foundation models for zero-shot segmentation, to enable essential analysis steps for automated textile recycling pipelines.

[8] A Deep Learning Approach for Facial Attribute Manipulation and Reconstruction in Surveillance and Reconnaissance

Anees Nashath Shaik,Barbara Villarini,Vasileios Argyriou

Main category: cs.CV

TL;DR: 提出了一种数据驱动平台，通过生成合成训练数据来解决现有AI面部分析模型的偏见问题，提升监控系统的准确性和公平性。

Details

Motivation: 现有监控系统因低质量图像和视频导致面部识别准确性下降，且AI模型存在肤色和遮挡偏见，限制了其实际应用效果。 Method: 利用深度学习（自编码器和GAN）生成多样化的合成面部数据，并集成图像增强模块提升低分辨率或遮挡面部的清晰度。 Result: 在CelebA数据集上的评估表明，该平台显著提升了训练数据的多样性和模型的公平性。 Conclusion: 该研究减少了AI面部分析的偏见，提升了监控系统在复杂环境中的准确性和可靠性。 Abstract: Surveillance systems play a critical role in security and reconnaissance, but their performance is often compromised by low-quality images and videos, leading to reduced accuracy in face recognition. Additionally, existing AI-based facial analysis models suffer from biases related to skin tone variations and partially occluded faces, further limiting their effectiveness in diverse real-world scenarios. These challenges are the results of data limitations and imbalances, where available training datasets lack sufficient diversity, resulting in unfair and unreliable facial recognition performance. To address these issues, we propose a data-driven platform that enhances surveillance capabilities by generating synthetic training data tailored to compensate for dataset biases. Our approach leverages deep learning-based facial attribute manipulation and reconstruction using autoencoders and Generative Adversarial Networks (GANs) to create diverse and high-quality facial datasets. Additionally, our system integrates an image enhancement module, improving the clarity of low-resolution or occluded faces in surveillance footage. We evaluate our approach using the CelebA dataset, demonstrating that the proposed platform enhances both training data diversity and model fairness. This work contributes to reducing bias in AI-based facial analysis and improving surveillance accuracy in challenging environments, leading to fairer and more reliable security applications.

[9] EV-LayerSegNet: Self-supervised Motion Segmentation using Event Cameras

Youssef Farah,Federico Paredes-Vallés,Guido De Croon,Muhammad Ahmed Humais,Hussain Sajwani,Yahya Zweiri

Main category: cs.CV

TL;DR: EV-LayerSegNet是一种自监督CNN，用于事件相机的运动分割，通过分层场景动态表示学习光流和分割掩码，并以去模糊质量作为自监督损失。

Details

Motivation: 事件相机的高时间分辨率适合运动相关任务，但获取真实标签成本高且困难，因此需要自监督方法。 Method: 提出EV-LayerSegNet，通过分层动态表示分别学习仿射光流和分割掩码，并利用去模糊质量作为自监督损失。 Result: 在仅含仿射运动的模拟数据集上，IoU和检测率分别达到71%和87%。 Conclusion: EV-LayerSegNet在自监督条件下有效实现了事件相机的运动分割。 Abstract: Event cameras are novel bio-inspired sensors that capture motion dynamics with much higher temporal resolution than traditional cameras, since pixels react asynchronously to brightness changes. They are therefore better suited for tasks involving motion such as motion segmentation. However, training event-based networks still represents a difficult challenge, as obtaining ground truth is very expensive, error-prone and limited in frequency. In this article, we introduce EV-LayerSegNet, a self-supervised CNN for event-based motion segmentation. Inspired by a layered representation of the scene dynamics, we show that it is possible to learn affine optical flow and segmentation masks separately, and use them to deblur the input events. The deblurring quality is then measured and used as self-supervised learning loss. We train and test the network on a simulated dataset with only affine motion, achieving IoU and detection rate up to 71% and 87% respectively.

[10] RARL: Improving Medical VLM Reasoning and Generalization with Reinforcement Learning and LoRA under Data and Hardware Constraints

Tan-Hanh Pham,Chris Ngo

Main category: cs.CV

TL;DR: 论文提出了一种名为RARL的框架，通过强化学习提升医学视觉语言模型的推理能力，并在资源受限环境中保持高效。实验表明，RARL在医学图像分析和临床推理任务中表现优于传统方法。

Details

Motivation: 当前医学视觉语言模型在泛化性、透明性和计算效率方面存在局限，阻碍了其在真实资源受限环境中的部署。 Method: 采用轻量级基础模型Qwen2-VL-2B-Instruct，结合低秩适应和自定义奖励函数（考虑诊断准确性和推理质量），在单块NVIDIA A100-PCIE-40GB GPU上进行训练。 Result: RARL在推理任务中比监督微调性能提升约7.78%，在未见数据集上比传统强化学习微调性能提升约4%。 Conclusion: 推理引导学习和推理提示能显著提升医学视觉语言模型的透明性、准确性和资源效率。 Abstract: The growing integration of vision-language models (VLMs) in medical applications offers promising support for diagnostic reasoning. However, current medical VLMs often face limitations in generalization, transparency, and computational efficiency-barriers that hinder deployment in real-world, resource-constrained settings. To address these challenges, we propose a Reasoning-Aware Reinforcement Learning framework, \textbf{RARL}, that enhances the reasoning capabilities of medical VLMs while remaining efficient and adaptable to low-resource environments. Our approach fine-tunes a lightweight base model, Qwen2-VL-2B-Instruct, using Low-Rank Adaptation and custom reward functions that jointly consider diagnostic accuracy and reasoning quality. Training is performed on a single NVIDIA A100-PCIE-40GB GPU, demonstrating the feasibility of deploying such models in constrained environments. We evaluate the model using an LLM-as-judge framework that scores both correctness and explanation quality. Experimental results show that RARL significantly improves VLM performance in medical image analysis and clinical reasoning, outperforming supervised fine-tuning on reasoning-focused tasks by approximately 7.78%, while requiring fewer computational resources. Additionally, we demonstrate the generalization capabilities of our approach on unseen datasets, achieving around 27% improved performance compared to supervised fine-tuning and about 4% over traditional RL fine-tuning. Our experiments also illustrate that diversity prompting during training and reasoning prompting during inference are crucial for enhancing VLM performance. Our findings highlight the potential of reasoning-guided learning and reasoning prompting to steer medical VLMs toward more transparent, accurate, and resource-efficient clinical decision-making. Code and data are publicly available.

[11] Zero Shot Composed Image Retrieval

Santhosh Kakarla,Gautama Shastry Bulusu Venkata

Main category: cs.CV

TL;DR: 论文提出了一种改进的零样本组合图像检索方法，通过微调BLIP-2和轻量级Q-Former，显著提升了检索性能，同时分析了Retrieval-DPO方法的局限性。

Details

Motivation: 解决零样本组合图像检索（CIR）在FashionIQ基准上表现不佳的问题，探索更高效的检索方法。 Method: 1. 微调BLIP-2模型，结合轻量级Q-Former融合视觉和文本特征；2. 尝试Retrieval-DPO方法，优化CLIP文本编码器。 Result: BLIP-2方法显著提升Recall@10（最高50.4%）和Recall@50（平均67.6%）；Retrieval-DPO表现极差（仅0.02%）。 Conclusion: 有效的CIR需要多模态融合、排名感知目标和高质量负样本，Retrieval-DPO因设计缺陷未能成功。 Abstract: Composed image retrieval (CIR) allows a user to locate a target image by applying a fine-grained textual edit (e.g., ``turn the dress blue'' or ``remove stripes'') to a reference image. Zero-shot CIR, which embeds the image and the text with separate pretrained vision-language encoders, reaches only 20-25\% Recall@10 on the FashionIQ benchmark. We improve this by fine-tuning BLIP-2 with a lightweight Q-Former that fuses visual and textual features into a single embedding, raising Recall@10 to 45.6\% (shirt), 40.1\% (dress), and 50.4\% (top-tee) and increasing the average Recall@50 to 67.6\%. We also examine Retrieval-DPO, which fine-tunes CLIP's text encoder with a Direct Preference Optimization loss applied to FAISS-mined hard negatives. Despite extensive tuning of the scaling factor, index, and sampling strategy, Retrieval-DPO attains only 0.02\% Recall@10 -- far below zero-shot and prompt-tuned baselines -- because it (i) lacks joint image-text fusion, (ii) uses a margin objective misaligned with top-$K$ metrics, (iii) relies on low-quality negatives, and (iv) keeps the vision and Transformer layers frozen. Our results show that effective preference-based CIR requires genuine multimodal fusion, ranking-aware objectives, and carefully curated negatives.

[12] PhysLab: A Benchmark Dataset for Multi-Granularity Visual Parsing of Physics Experiments

Minghao Zou,Qingtian Zeng,Yongping Miao,Shangkun Liu,Zilong Wang,Hantao Liu,Wei Zhou

Main category: cs.CV

TL;DR: PhysLab是一个针对教育场景的视频数据集，专注于物理实验，提供多级注释以支持多种视觉任务。

Details

Motivation: 现有数据集在标注粒度、领域覆盖和程序指导方面存在不足，阻碍了细粒度场景理解和高级推理。 Method: 引入PhysLab数据集，包含620个长视频，涵盖四种代表性物理实验，提供多级注释。 Result: 建立了强基线并进行了广泛评估，突出了解析程序性教育视频的关键挑战。 Conclusion: PhysLab有望推动细粒度视觉解析、智能课堂系统以及计算机视觉与教育技术的更紧密集成。 Abstract: Visual parsing of images and videos is critical for a wide range of real-world applications. However, progress in this field is constrained by limitations of existing datasets: (1) insufficient annotation granularity, which impedes fine-grained scene understanding and high-level reasoning; (2) limited coverage of domains, particularly a lack of datasets tailored for educational scenarios; and (3) lack of explicit procedural guidance, with minimal logical rules and insufficient representation of structured task process. To address these gaps, we introduce PhysLab, the first video dataset that captures students conducting complex physics experiments. The dataset includes four representative experiments that feature diverse scientific instruments and rich human-object interaction (HOI) patterns. PhysLab comprises 620 long-form videos and provides multilevel annotations that support a variety of vision tasks, including action recognition, object detection, HOI analysis, etc. We establish strong baselines and perform extensive evaluations to highlight key challenges in the parsing of procedural educational videos. We expect PhysLab to serve as a valuable resource for advancing fine-grained visual parsing, facilitating intelligent classroom systems, and fostering closer integration between computer vision and educational technologies. The dataset and the evaluation toolkit are publicly available at https://github.com/ZMH-SDUST/PhysLab.

[13] Dark Channel-Assisted Depth-from-Defocus from a Single Image

Moushumi Medhi,Rajiv Ranjan Sahay

Main category: cs.CV

TL;DR: 本文提出了一种利用暗通道作为补充线索的方法，从单张空间变异性模糊图像中估计场景深度。

Details

Motivation: 现有的深度估计技术通常依赖多张不同光圈或对焦设置的图像，而从单张模糊图像中估计深度是一个欠约束问题。本文旨在解决这一问题。 Method: 通过利用局部模糊与对比度变化的关系作为深度线索，并结合暗通道先验，以端到端的方式训练模型。 Result: 在真实数据上的实验表明，该方法能够有效地从单张模糊图像中估计出有意义的深度信息。 Conclusion: 结合暗通道先验的单图像深度估计方法具有有效性，为解决单图像深度估计问题提供了新思路。 Abstract: In this paper, we utilize the dark channel as a complementary cue to estimate the depth of a scene from a single space-variant defocus blurred image due to its effectiveness in implicitly capturing the local statistics of blurred images and the scene structure. Existing depth-from-defocus (DFD) techniques typically rely on multiple images with varying apertures or focus settings to recover depth information. Very few attempts have focused on DFD from a single defocused image due to the underconstrained nature of the problem. Our method capitalizes on the relationship between local defocus blur and contrast variations as key depth cues to enhance the overall performance in estimating the scene's structure. The entire pipeline is trained adversarially in a fully end-to-end fashion. Experiments conducted on real data with realistic depth-induced defocus blur demonstrate that incorporating dark channel prior into single image DFD yields meaningful depth estimation results, validating the effectiveness of our approach.

[14] Parametric Gaussian Human Model: Generalizable Prior for Efficient and Realistic Human Avatar Modeling

Cheng Peng,Jingxiang Sun,Yushuo Chen,Zhaoqi Su,Zhuo Su,Yebin Liu

Main category: cs.CV

TL;DR: PGHM是一种基于3D高斯泼溅（3DGS）的通用高效框架，用于从单目视频快速重建高保真人类化身，解决了现有方法优化时间长和稀疏输入泛化能力差的问题。

Details

Motivation: 开发一种能够快速重建高质量人类化身的方法，以支持虚拟/增强现实、远程呈现和数字娱乐等应用。 Method: PGHM引入UV对齐的潜在身份映射和多头U-Net，通过分解静态、姿态依赖和视角依赖的组件来预测高斯属性。 Result: PGHM显著提高了效率，仅需约20分钟即可生成视觉质量相当的化身，且对挑战性姿态和视角具有鲁棒性。 Conclusion: PGHM为单目化身创建提供了一种实用且高效的解决方案，具有广泛的实际应用潜力。 Abstract: Photorealistic and animatable human avatars are a key enabler for virtual/augmented reality, telepresence, and digital entertainment. While recent advances in 3D Gaussian Splatting (3DGS) have greatly improved rendering quality and efficiency, existing methods still face fundamental challenges, including time-consuming per-subject optimization and poor generalization under sparse monocular inputs. In this work, we present the Parametric Gaussian Human Model (PGHM), a generalizable and efficient framework that integrates human priors into 3DGS for fast and high-fidelity avatar reconstruction from monocular videos. PGHM introduces two core components: (1) a UV-aligned latent identity map that compactly encodes subject-specific geometry and appearance into a learnable feature tensor; and (2) a disentangled Multi-Head U-Net that predicts Gaussian attributes by decomposing static, pose-dependent, and view-dependent components via conditioned decoders. This design enables robust rendering quality under challenging poses and viewpoints, while allowing efficient subject adaptation without requiring multi-view capture or long optimization time. Experiments show that PGHM is significantly more efficient than optimization-from-scratch methods, requiring only approximately 20 minutes per subject to produce avatars with comparable visual quality, thereby demonstrating its practical applicability for real-world monocular avatar creation.

[15] Flood-DamageSense: Multimodal Mamba with Multitask Learning for Building Flood Damage Assessment using SAR Remote Sensing Imagery

Yu-Hsuan Ho,Ali Mostafavi

Main category: cs.CV

TL;DR: Flood-DamageSense是一种专为洪水灾害后建筑物损坏评估设计的深度学习框架，通过融合多模态数据显著提升了分类性能。

Details

Motivation: 现有模型在洪水灾害后建筑物损坏分类中表现不佳，尤其是在损坏特征不明显时。 Method: 模型结合了SAR/InSAR数据、高分辨率光学地图和长期洪水风险层，采用多模态Mamba架构和半孪生编码器。 Result: 在Hurricane Harvey数据上，模型性能比现有方法提升了19个百分点，特别是在“轻微”和“中等”损坏类别中。 Conclusion: Flood-DamageSense通过风险感知建模和SAR全天候能力，为灾后决策提供了更快、更精细的损坏评估。 Abstract: Most post-disaster damage classifiers succeed only when destructive forces leave clear spectral or structural signatures -- conditions rarely present after inundation. Consequently, existing models perform poorly at identifying flood-related building damages. The model presented in this study, Flood-DamageSense, addresses this gap as the first deep-learning framework purpose-built for building-level flood-damage assessment. The architecture fuses pre- and post-event SAR/InSAR scenes with very-high-resolution optical basemaps and an inherent flood-risk layer that encodes long-term exposure probabilities, guiding the network toward plausibly affected structures even when compositional change is minimal. A multimodal Mamba backbone with a semi-Siamese encoder and task-specific decoders jointly predicts (1) graded building-damage states, (2) floodwater extent, and (3) building footprints. Training and evaluation on Hurricane Harvey (2017) imagery from Harris County, Texas -- supported by insurance-derived property-damage extents -- show a mean F1 improvement of up to 19 percentage points over state-of-the-art baselines, with the largest gains in the frequently misclassified "minor" and "moderate" damage categories. Ablation studies identify the inherent-risk feature as the single most significant contributor to this performance boost. An end-to-end post-processing pipeline converts pixel-level outputs to actionable, building-scale damage maps within minutes of image acquisition. By combining risk-aware modeling with SAR's all-weather capability, Flood-DamageSense delivers faster, finer-grained, and more reliable flood-damage intelligence to support post-disaster decision-making and resource allocation.

[16] Interpretation of Deep Learning Model in Embryo Selection for In Vitro Fertilization (IVF) Treatment

Radha Kodali,Venkata Rao Dhulipalla,Venkata Siva Kishor Tatavarty,Madhavi Nadakuditi,Bharadwaj Thiruveedhula,Suryanarayana Gunnam,Durga Prasad Bavirisetti

Main category: cs.CV

TL;DR: 论文提出了一种基于CNN-LSTM的可解释人工智能框架，用于高效分类胚胎，解决了传统胚胎评分方法耗时且效率低的问题。

Details

Motivation: 不孕症对生活质量影响显著，体外受精（IVF）是主要解决方案，但传统胚胎评分方法效率低下，需要改进。 Method: 采用CNN-LSTM融合架构，结合深度学习和可解释AI（XAI），对胚胎图像进行分类。 Result: 模型在胚胎分类中实现了高准确性，同时保持了可解释性。 Conclusion: 提出的CNN-LSTM框架为胚胎分类提供了高效且可解释的解决方案，有望提升IVF成功率。 Abstract: Infertility has a considerable impact on individuals' quality of life, affecting them socially and psychologically, with projections indicating a rise in the upcoming years. In vitro fertilization (IVF) emerges as one of the primary techniques within economically developed nations, employed to address the rising problem of low fertility. Expert embryologists conventionally grade embryos by reviewing blastocyst images to select the most optimal for transfer, yet this process is time-consuming and lacks efficiency. Blastocyst images provide a valuable resource for assessing embryo viability. In this study, we introduce an explainable artificial intelligence (XAI) framework for classifying embryos, employing a fusion of convolutional neural network (CNN) and long short-term memory (LSTM) architecture, referred to as CNN-LSTM. Utilizing deep learning, our model achieves high accuracy in embryo classification while maintaining interpretability through XAI.

[17] A Systematic Investigation on Deep Learning-Based Omnidirectional Image and Video Super-Resolution

Qianqian Zhao,Chunle Guo,Tianyi Zhang,Junpei Zhang,Peiyang Jia,Tan Su,Wenjie Jiang,Chongyi Li

Main category: cs.CV

TL;DR: 本文系统综述了基于深度学习的全向图像和视频超分辨率方法，并提出了新数据集360Insta以弥补现有数据集的不足。

Details

Motivation: 全向图像和视频超分辨率在虚拟现实和增强现实中至关重要，但现有数据集多为合成退化，无法反映真实场景的复杂性。 Method: 通过系统综述现有深度学习方法，并引入真实退化数据集360Insta，进行定性和定量评估。 Result: 360Insta填补了现有数据集的空白，为方法泛化能力提供了更鲁棒的评估。 Conclusion: 本文为全向超分辨率研究提供了全面总结，并指出了未来研究方向。 Abstract: Omnidirectional image and video super-resolution is a crucial research topic in low-level vision, playing an essential role in virtual reality and augmented reality applications. Its goal is to reconstruct high-resolution images or video frames from low-resolution inputs, thereby enhancing detail preservation and enabling more accurate scene analysis and interpretation. In recent years, numerous innovative and effective approaches have been proposed, predominantly based on deep learning techniques, involving diverse network architectures, loss functions, projection strategies, and training datasets. This paper presents a systematic review of recent progress in omnidirectional image and video super-resolution, focusing on deep learning-based methods. Given that existing datasets predominantly rely on synthetic degradation and fall short in capturing real-world distortions, we introduce a new dataset, 360Insta, that comprises authentically degraded omnidirectional images and videos collected under diverse conditions, including varying lighting, motion, and exposure settings. This dataset addresses a critical gap in current omnidirectional benchmarks and enables more robust evaluation of the generalization capabilities of omnidirectional super-resolution methods. We conduct comprehensive qualitative and quantitative evaluations of existing methods on both public datasets and our proposed dataset. Furthermore, we provide a systematic overview of the current status of research and discuss promising directions for future exploration. All datasets, methods, and evaluation metrics introduced in this work are publicly available and will be regularly updated. Project page: https://github.com/nqian1/Survey-on-ODISR-and-ODVSR.

[18] Active Contour Models Driven by Hyperbolic Mean Curvature Flow for Image Segmentation

Saiyu Hu,Chunlei He,Jianfeng Zhang,Dexing Kong,Shoujun Huang

Main category: cs.CV

TL;DR: 本文提出了基于双曲平均曲率流的主动轮廓模型（HMCF-ACMs）和双模式正则化流驱动的主动轮廓模型（HDRF-ACMs），通过可调初始速度场和边缘感知力调制，提升了图像分割的精度和噪声鲁棒性。

Details

Motivation: 传统抛物线平均曲率流驱动的主动轮廓模型（PMCF-ACMs）对初始曲线配置依赖性强，限制了其适应性。本文旨在通过引入双曲平均曲率流和正则化技术，解决这一问题。 Method: 提出HMCF-ACMs和HDRF-ACMs，利用可调初始速度场和光滑Heaviside函数进行边缘感知力调制。通过水平集方法和符号距离函数建立数值等价性，并优化了加权四阶Runge-Kutta算法。 Result: 实验表明，HMCF-ACMs和HDRF-ACMs在噪声抑制和数值稳定性方面表现优异，能够实现更精确的分割。 Conclusion: HMCF-ACMs和HDRF-ACMs通过自适应初始配置和正则化技术，显著提升了图像分割的性能，适用于多样化场景。 Abstract: Parabolic mean curvature flow-driven active contour models (PMCF-ACMs) are widely used in image segmentation, which however depend heavily on the selection of initial curve configurations. In this paper, we firstly propose several hyperbolic mean curvature flow-driven ACMs (HMCF-ACMs), which introduce tunable initial velocity fields, enabling adaptive optimization for diverse segmentation scenarios. We shall prove that HMCF-ACMs are indeed normal flows and establish the numerical equivalence between dissipative HMCF formulations and certain wave equations using the level set method with signed distance function. Building on this framework, we furthermore develop hyperbolic dual-mode regularized flow-driven ACMs (HDRF-ACMs), which utilize smooth Heaviside functions for edge-aware force modulation to suppress over-diffusion near weak boundaries. Then, we optimize a weighted fourth-order Runge-Kutta algorithm with nine-point stencil spatial discretization when solving the above-mentioned wave equations. Experiments show that both HMCF-ACMs and HDRF-ACMs could achieve more precise segmentations with superior noise resistance and numerical stability due to task-adaptive configurations of initial velocities and initial contours.

[19] Improving Wildlife Out-of-Distribution Detection: Africas Big Five

Mufhumudzi Muthivhi,Jiahao Huo,Fredrik Gustafsson,Terence L. van Zyl

Main category: cs.CV

TL;DR: 论文研究了野生动物（特别是非洲五大动物）的分布外检测问题，通过对比参数化和非参数化方法，发现基于特征的方法在泛化能力上表现更优。

Details

Motivation: 解决人类与野生动物冲突的计算机视觉方案需要准确识别可能引发冲突的个体，但现有模型在未知类别上表现过自信，因此研究分布外检测方法。 Method: 采用参数化的最近类均值（NCM）和非参数化的对比学习方法作为基线，利用预训练和投影特征，并与文献中常见的分布外检测方法进行比较。 Result: 基于特征的方法在泛化能力上表现更强，NCM方法在多个指标上优于其他分布外检测方法，最高提升22%。 Conclusion: 特征方法在野生动物分布外检测中具有潜力，NCM结合预训练特征表现最佳。 Abstract: Mitigating human-wildlife conflict seeks to resolve unwanted encounters between these parties. Computer Vision provides a solution to identifying individuals that might escalate into conflict, such as members of the Big Five African animals. However, environments often contain several varied species. The current state-of-the-art animal classification models are trained under a closed-world assumption. They almost always remain overconfident in their predictions even when presented with unknown classes. This study investigates out-of-distribution (OOD) detection of wildlife, specifically the Big Five. To this end, we select a parametric Nearest Class Mean (NCM) and a non-parametric contrastive learning approach as baselines to take advantage of pretrained and projected features from popular classification encoders. Moreover, we compare our baselines to various common OOD methods in the literature. The results show feature-based methods reflect stronger generalisation capability across varying classification thresholds. Specifically, NCM with ImageNet pre-trained features achieves a 2%, 4% and 22% improvement on AUPR-IN, AUPR-OUT and AUTC over the best OOD methods, respectively. The code can be found here https://github.com/pxpana/BIG5OOD

[20] Mitigating Object Hallucination via Robust Local Perception Search

Zixian Gao,Chao Yang,Zhanhui Zhou,Xing Xu,Chaochao Lu

Main category: cs.CV

TL;DR: 论文提出了一种名为LPS的解码方法，用于减少多模态大语言模型中的幻觉现象，尤其在噪声环境下效果显著。

Details

Motivation: 尽管多模态大语言模型在视觉与语言结合方面取得了成功，但其输出仍存在与图像内容不符的幻觉现象。 Method: 引入了一种简单且无需训练的推理解码方法LPS，利用局部视觉先验信息修正解码过程。 Result: 实验表明，LPS在幻觉基准测试和噪声数据中显著减少了幻觉现象，尤其在噪声环境下表现优异。 Conclusion: LPS是一种即插即用的方法，适用于多种模型，能有效抑制幻觉现象。 Abstract: Recent advancements in Multimodal Large Language Models (MLLMs) have enabled them to effectively integrate vision and language, addressing a variety of downstream tasks. However, despite their significant success, these models still exhibit hallucination phenomena, where the outputs appear plausible but do not align with the content of the images. To mitigate this issue, we introduce Local Perception Search (LPS), a decoding method during inference that is both simple and training-free, yet effectively suppresses hallucinations. This method leverages local visual prior information as a value function to correct the decoding process. Additionally, we observe that the impact of the local visual prior on model performance is more pronounced in scenarios with high levels of image noise. Notably, LPS is a plug-and-play approach that is compatible with various models. Extensive experiments on widely used hallucination benchmarks and noisy data demonstrate that LPS significantly reduces the incidence of hallucinations compared to the baseline, showing exceptional performance, particularly in noisy settings.

[21] RecipeGen: A Step-Aligned Multimodal Benchmark for Real-World Recipe Generation

Ruoxuan Zhang,Jidong Gao,Bin Wen,Hongxia Xie,Chenming Zhang,Honghan-shuai,Wen-Huang Cheng

Main category: cs.CV

TL;DR: RecipeGen是一个大规模、真实世界的基准数据集，用于基于食谱的文本到图像（T2I）、图像到视频（I2V）和文本到视频（T2V）生成，填补了现有数据集中食谱目标、步骤说明和视觉内容之间细粒度对齐的空白。

Details

Motivation: 现有数据集缺乏食谱目标、步骤说明和视觉内容之间的细粒度对齐，限制了食品计算在烹饪教育和多模态食谱助手等领域的应用。 Method: 提出了RecipeGen数据集，包含26,453个食谱、196,724张图像和4,491个视频，覆盖了多样化的食材、烹饪程序、风格和菜品类型。同时提出了领域特定的评估指标，用于评估食材保真度和交互建模。 Result: RecipeGen为T2I、I2V和T2V模型提供了基准，并为未来的食谱生成模型提供了见解。 Conclusion: RecipeGen填补了现有数据集的空白，为食品计算领域的进一步发展提供了重要资源。 Abstract: Creating recipe images is a key challenge in food computing, with applications in culinary education and multimodal recipe assistants. However, existing datasets lack fine-grained alignment between recipe goals, step-wise instructions, and visual content. We present RecipeGen, the first large-scale, real-world benchmark for recipe-based Text-to-Image (T2I), Image-to-Video (I2V), and Text-to-Video (T2V) generation. RecipeGen contains 26,453 recipes, 196,724 images, and 4,491 videos, covering diverse ingredients, cooking procedures, styles, and dish types. We further propose domain-specific evaluation metrics to assess ingredient fidelity and interaction modeling, benchmark representative T2I, I2V, and T2V models, and provide insights for future recipe generation models. Project page is available now.

[22] THU-Warwick Submission for EPIC-KITCHEN Challenge 2025: Semi-Supervised Video Object Segmentation

Mingqi Gao,Haoran Duan,Tianlu Zhang,Jungong Han

Main category: cs.CV

TL;DR: 提出了一种结合视觉预训练和深度几何线索的自中心视频对象分割方法，在VISOR测试集上达到90.1%的J&F分数。

Details

Motivation: 处理复杂场景和长期跟踪的自中心视频对象分割问题。 Method: 结合SAM2的大规模视觉预训练和深度几何线索，构建统一框架。 Result: 在VISOR测试集上取得90.1%的J&F分数。 Conclusion: 统一框架有效提升了分割性能。 Abstract: In this report, we describe our approach to egocentric video object segmentation. Our method combines large-scale visual pretraining from SAM2 with depth-based geometric cues to handle complex scenes and long-term tracking. By integrating these signals in a unified framework, we achieve strong segmentation performance. On the VISOR test set, our method reaches a J&F score of 90.1%.

[23] SAR2Struct: Extracting 3D Semantic Structural Representation of Aircraft Targets from Single-View SAR Image

Ziyu Yue,Ruixi You,Feng Xu

Main category: cs.CV

TL;DR: 本文提出了一种新任务：从单视角SAR图像中恢复目标结构，通过结构描述符的两步算法框架，首次实现了从SAR图像直接推断飞机目标的3D语义结构表示。

Details

Motivation: 现有方法主要关注3D表面重建或局部几何特征提取，忽视了结构建模在捕获语义信息中的作用。本文旨在通过结构恢复任务填补这一空白。 Method: 开发了一个基于结构描述符的两步算法框架：训练阶段从真实SAR图像检测2D关键点，并通过模拟数据学习这些关键点到3D层次结构的映射；测试阶段整合这两步从真实SAR图像推断3D结构。 Result: 实验验证了每一步的有效性，首次证明可以从单视角SAR图像直接推导飞机目标的3D语义结构表示。 Conclusion: 本文提出的方法为SAR图像的高级信息检索提供了一种新思路，展示了结构建模在语义信息提取中的潜力。 Abstract: To translate synthetic aperture radar (SAR) image into interpretable forms for human understanding is the ultimate goal of SAR advanced information retrieval. Existing methods mainly focus on 3D surface reconstruction or local geometric feature extraction of targets, neglecting the role of structural modeling in capturing semantic information. This paper proposes a novel task: SAR target structure recovery, which aims to infer the components of a target and the structural relationships between its components, specifically symmetry and adjacency, from a single-view SAR image. Through learning the structural consistency and geometric diversity across the same type of targets as observed in different SAR images, it aims to derive the semantic representation of target directly from its 2D SAR image. To solve this challenging task, a two-step algorithmic framework based on structural descriptors is developed. Specifically, in the training phase, it first detects 2D keypoints from real SAR images, and then learns the mapping from these keypoints to 3D hierarchical structures using simulated data. During the testing phase, these two steps are integrated to infer the 3D structure from real SAR images. Experimental results validated the effectiveness of each step and demonstrated, for the first time, that 3D semantic structural representation of aircraft targets can be directly derived from a single-view SAR image.

Nidheesh Gorthi,Kartik Thakral,Rishabh Ranjan,Richa Singh,Mayank Vatsa

Main category: cs.CV

TL;DR: LitMAS是一个轻量级、通用的多模态反欺骗框架，用于检测语音、人脸、虹膜和指纹生物识别系统中的欺骗攻击，性能优于现有方法。

Details

Motivation: 生物识别认证系统易受欺骗攻击，现有研究多为模态特定方法，缺乏统一的跨模态解决方案。 Method: 提出LitMAS框架，采用模态对齐集中损失（Modality-Aligned Concentration Loss）增强类间分离性和跨模态一致性。 Result: LitMAS仅需6M参数，在七个数据集上的平均EER比现有方法低1.36%，效率高且适合边缘部署。 Conclusion: LitMAS展示了高效性、强泛化能力和跨模态一致性，为多模态反欺骗提供了实用解决方案。 Abstract: Biometric authentication systems are increasingly being deployed in critical applications, but they remain susceptible to spoofing. Since most of the research efforts focus on modality-specific anti-spoofing techniques, building a unified, resource-efficient solution across multiple biometric modalities remains a challenge. To address this, we propose LitMAS, a $\textbf{Li}$gh$\textbf{t}$ weight and generalizable $\textbf{M}$ulti-modal $\textbf{A}$nti-$\textbf{S}$poofing framework designed to detect spoofing attacks in speech, face, iris, and fingerprint-based biometric systems. At the core of LitMAS is a Modality-Aligned Concentration Loss, which enhances inter-class separability while preserving cross-modal consistency and enabling robust spoof detection across diverse biometric traits. With just 6M parameters, LitMAS surpasses state-of-the-art methods by $1.36\%$ in average EER across seven datasets, demonstrating high efficiency, strong generalizability, and suitability for edge deployment. Code and trained models are available at https://github.com/IAB-IITJ/LitMAS.

[25] LoopDB: A Loop Closure Dataset for Large Scale Simultaneous Localization and Mapping

Mohammad-Maher Nakshbandi,Ziad Sharawy,Dorian Cojocaru,Sorin Grigorescu

Main category: cs.CV

TL;DR: LoopDB是一个包含1000多张多样化环境图像的闭环数据集，用于评估和训练SLAM中的闭环算法。

Details

Motivation: 提供高质量、多样化的闭环数据集，以支持SLAM中闭环算法的准确性和鲁棒性评估及深度学习方法的训练。 Method: 使用高分辨率相机采集图像，每场景包含五张连续图像，并提供旋转和平移作为真值。 Result: 数据集适用于闭环算法的基准测试和深度学习方法的训练与微调。 Conclusion: LoopDB是一个公开可用的闭环数据集，支持SLAM领域的算法开发和评估。 Abstract: In this study, we introduce LoopDB, which is a challenging loop closure dataset comprising over 1000 images captured across diverse environments, including parks, indoor scenes, parking spaces, as well as centered around individual objects. Each scene is represented by a sequence of five consecutive images. The dataset was collected using a high resolution camera, providing suitable imagery for benchmarking the accuracy of loop closure algorithms, typically used in simultaneous localization and mapping. As ground truth information, we provide computed rotations and translations between each consecutive images. Additional to its benchmarking goal, the dataset can be used to train and fine-tune loop closure methods based on deep neural networks. LoopDB is publicly available at https://github.com/RovisLab/LoopDB.

[26] Continuous-Time SO(3) Forecasting with Savitzky--Golay Neural Controlled Differential Equations

Lennart Bastian,Mohammad Rashed,Nassir Navab,Tolga Birdal

Main category: cs.CV

TL;DR: 提出了一种基于神经控制微分方程和Savitzky-Golay路径的连续时间旋转物体动力学建模方法，解决了SO(3)外推中的噪声、稀疏观测和复杂动态问题。

Details

Motivation: SO(3)外推在计算机视觉和机器人学中具有基础性意义，但面临观测噪声、稀疏性、复杂动态和长期预测需求等挑战。 Method: 使用神经控制微分方程和Savitzky-Golay路径建模连续时间旋转物体动力学，保留旋转的几何结构。 Result: 在真实数据上的实验表明，该方法在预测能力上优于现有方法。 Conclusion: 该方法通过学习潜在动态系统，实现了对复杂旋转轨迹的有效预测。 Abstract: Tracking and forecasting the rotation of objects is fundamental in computer vision and robotics, yet SO(3) extrapolation remains challenging as (1) sensor observations can be noisy and sparse, (2) motion patterns can be governed by complex dynamics, and (3) application settings can demand long-term forecasting. This work proposes modeling continuous-time rotational object dynamics on $SO(3)$ using Neural Controlled Differential Equations guided by Savitzky-Golay paths. Unlike existing methods that rely on simplified motion assumptions, our method learns a general latent dynamical system of the underlying object trajectory while respecting the geometric structure of rotations. Experimental results on real-world data demonstrate compelling forecasting capabilities compared to existing approaches.

[27] Training-Free Identity Preservation in Stylized Image Generation Using Diffusion Models

Mohammad Ali Rezaei,Helia Hajikazem,Saeed Khanehgir,Mahdi Javanmardi

Main category: cs.CV

TL;DR: 提出了一种基于扩散模型的无训练框架，用于身份保持的风格化图像合成，解决了小面部或远距离拍摄时身份保留不足的问题。

Details

Motivation: 现有风格迁移技术在身份保持上表现不佳，尤其在面部较小或相机距离较远时。 Method: 采用“马赛克恢复内容图像”技术和无训练内容一致性损失，增强身份保留和细节保护。 Result: 实验表明，该方法在保持高风格保真度和身份完整性上显著优于基线模型。 Conclusion: 无需重新训练或微调，即可在小面部或远距离条件下实现高质量风格化。 Abstract: While diffusion models have demonstrated remarkable generative capabilities, existing style transfer techniques often struggle to maintain identity while achieving high-quality stylization. This limitation is particularly acute for images where faces are small or exhibit significant camera-to-face distances, frequently leading to inadequate identity preservation. To address this, we introduce a novel, training-free framework for identity-preserved stylized image synthesis using diffusion models. Key contributions include: (1) the "Mosaic Restored Content Image" technique, significantly enhancing identity retention, especially in complex scenes; and (2) a training-free content consistency loss that enhances the preservation of fine-grained content details by directing more attention to the original image during stylization. Our experiments reveal that the proposed approach substantially surpasses the baseline model in concurrently maintaining high stylistic fidelity and robust identity integrity, particularly under conditions of small facial regions or significant camera-to-face distances, all without necessitating model retraining or fine-tuning.

[28] Stepwise Decomposition and Dual-stream Focus: A Novel Approach for Training-free Camouflaged Object Segmentation

Chao Yin,Hao Li,Kequan Yang,Jide Li,Pinpin Zhu,Xiaoqiang Li

Main category: cs.CV

TL;DR: 论文提出RDVP-MSD框架，通过多模态逐步分解链式思维（MSD-CoT）和区域约束双流视觉提示（RDVP），解决伪装物体分割中的语义模糊和空间分离问题，无需训练即可实现高效分割。

Details

Motivation: 当前任务通用提示分割方法在伪装物体分割（COS）中存在语义模糊和空间分离问题，导致分割效果不佳。 Method: 结合MSD-CoT逐步分解图像描述消除语义模糊，RDVP通过区域约束和双流视觉提示解决空间分离问题。 Result: 在多个COS基准测试中达到最优分割效果，且推理速度更快。 Conclusion: RDVP-MSD无需训练即可显著提升分割准确性和效率，为COS任务提供了高效解决方案。 Abstract: While promptable segmentation (\textit{e.g.}, SAM) has shown promise for various segmentation tasks, it still requires manual visual prompts for each object to be segmented. In contrast, task-generic promptable segmentation aims to reduce the need for such detailed prompts by employing only a task-generic prompt to guide segmentation across all test samples. However, when applied to Camouflaged Object Segmentation (COS), current methods still face two critical issues: 1) \textit{\textbf{semantic ambiguity in getting instance-specific text prompts}}, which arises from insufficient discriminative cues in holistic captions, leading to foreground-background confusion; 2) \textit{\textbf{semantic discrepancy combined with spatial separation in getting instance-specific visual prompts}}, which results from global background sampling far from object boundaries with low feature correlation, causing SAM to segment irrelevant regions. To address the issues above, we propose \textbf{RDVP-MSD}, a novel training-free test-time adaptation framework that synergizes \textbf{R}egion-constrained \textbf{D}ual-stream \textbf{V}isual \textbf{P}rompting (RDVP) via \textbf{M}ultimodal \textbf{S}tepwise \textbf{D}ecomposition Chain of Thought (MSD-CoT). MSD-CoT progressively disentangles image captions to eliminate semantic ambiguity, while RDVP injects spatial constraints into visual prompting and independently samples visual prompts for foreground and background points, effectively mitigating semantic discrepancy and spatial separation. Without requiring any training or supervision, RDVP-MSD achieves a state-of-the-art segmentation result on multiple COS benchmarks and delivers a faster inference speed than previous methods, demonstrating significantly improved accuracy and efficiency. The codes will be available at \href{https://github.com/ycyinchao/RDVP-MSD}{https://github.com/ycyinchao/RDVP-MSD}

[29] Hi-LSplat: Hierarchical 3D Language Gaussian Splatting

Chenlu Zhan,Yufei Zhang,Gaoang Wang,Hongwei Wang

Main category: cs.CV

TL;DR: Hi-LSplat提出了一种基于3D高斯泼溅的分层语言模型，解决了现有方法在视图一致性和开放词汇查询中的问题。

Details

Motivation: 现有3DGS模型依赖2D基础模型，导致视图不一致和开放词汇挑战，阻碍了分层语义理解。 Method: 通过构建3D分层语义树和引入实例及部件对比损失，提升3D特征的视图一致性和分层语义表示。 Result: 实验表明Hi-LSplat在3D开放词汇分割和定位中表现优越，并能捕捉复杂分层语义。 Conclusion: Hi-LSplat通过统一3D表示和分层语义建模，显著提升了3D语言查询的视图一致性和语义理解能力。 Abstract: Modeling 3D language fields with Gaussian Splatting for open-ended language queries has recently garnered increasing attention. However, recent 3DGS-based models leverage view-dependent 2D foundation models to refine 3D semantics but lack a unified 3D representation, leading to view inconsistencies. Additionally, inherent open-vocabulary challenges cause inconsistencies in object and relational descriptions, impeding hierarchical semantic understanding. In this paper, we propose Hi-LSplat, a view-consistent Hierarchical Language Gaussian Splatting work for 3D open-vocabulary querying. To achieve view-consistent 3D hierarchical semantics, we first lift 2D features to 3D features by constructing a 3D hierarchical semantic tree with layered instance clustering, which addresses the view inconsistency issue caused by 2D semantic features. Besides, we introduce instance-wise and part-wise contrastive losses to capture all-sided hierarchical semantic representations. Notably, we construct two hierarchical semantic datasets to better assess the model's ability to distinguish different semantic levels. Extensive experiments highlight our method's superiority in 3D open-vocabulary segmentation and localization. Its strong performance on hierarchical semantic datasets underscores its ability to capture complex hierarchical semantics within 3D scenes.

[30] Exploring Visual Prompting: Robustness Inheritance and Beyond

Qi Li,Liangzhi Li,Zhouqiang Jiang,Bowen Wang,Keke Tang

Main category: cs.CV

TL;DR: 本文探讨了视觉提示（VP）在鲁棒源模型下的表现，提出了Prompt Boundary Loosening（PBL）策略以缓解鲁棒性与泛化能力的权衡问题。

Details

Motivation: 研究VP在鲁棒源模型场景下的表现，包括能否继承鲁棒性、是否存在鲁棒性与泛化能力的权衡，并提出解决方案。 Method: 提出PBL策略，作为一种轻量级、即插即用的方法，与VP兼容，旨在继承鲁棒性并提升泛化能力。 Result: 实验表明PBL能有效继承鲁棒性并显著提升VP的泛化能力，结果具有普适性。 Conclusion: PBL成功解决了VP在鲁棒源模型下的权衡问题，为相关研究提供了新思路。 Abstract: Visual Prompting (VP), an efficient method for transfer learning, has shown its potential in vision tasks. However, previous works focus exclusively on VP from standard source models, it is still unknown how it performs under the scenario of a robust source model: Can the robustness of the source model be successfully inherited? Does VP also encounter the same trade-off between robustness and generalization ability as the source model during this process? If such a trade-off exists, is there a strategy specifically tailored to VP to mitigate this limitation? In this paper, we thoroughly explore these three questions for the first time and provide affirmative answers to them. To mitigate the trade-off faced by VP, we propose a strategy called Prompt Boundary Loosening (PBL). As a lightweight, plug-and-play strategy naturally compatible with VP, PBL effectively ensures the successful inheritance of robustness when the source model is a robust model, while significantly enhancing VP's generalization ability across various downstream datasets. Extensive experiments across various datasets show that our findings are universal and demonstrate the significant benefits of the proposed strategy.

[31] Controllable Coupled Image Generation via Diffusion Models

Chenfei Yuan,Nanshan Jia,Hangqi Li,Peter W. Glynn,Zeyu Zheng

Main category: cs.CV

TL;DR: 提出了一种注意力控制方法，用于生成背景相似但中心对象灵活的耦合图像。

Details

Motivation: 解决多图像生成中背景耦合但对象灵活的需求。 Method: 通过解耦背景和实体的注意力模块，并引入时间步相关的权重控制参数。 Result: 在背景耦合、文本对齐和视觉质量上优于现有方法。 Conclusion: 该方法有效实现了背景耦合与对象灵活性的平衡。 Abstract: We provide an attention-level control method for the task of coupled image generation, where "coupled" means that multiple simultaneously generated images are expected to have the same or very similar backgrounds. While backgrounds coupled, the centered objects in the generated images are still expected to enjoy the flexibility raised from different text prompts. The proposed method disentangles the background and entity components in the model's cross-attention modules, attached with a sequence of time-varying weight control parameters depending on the time step of sampling. We optimize this sequence of weight control parameters with a combined objective that assesses how coupled the backgrounds are as well as text-to-image alignment and overall visual quality. Empirical results demonstrate that our method outperforms existing approaches across these criteria.

[32] EndoARSS: Adapting Spatially-Aware Foundation Model for Efficient Activity Recognition and Semantic Segmentation in Endoscopic Surgery

Guankun Wang,Rui Tang,Mengya Xu,Long Bai,Huxin Gao,Hongliang Ren

Main category: cs.CV

TL;DR: EndoARSS是一个基于DINOv2的多任务学习框架，用于内窥镜手术活动识别和语义分割，通过低秩适应和空间感知多尺度注意力提升性能。

Details

Motivation: 内窥镜手术场景复杂，传统深度学习模型在跨活动干扰下表现不佳，需多任务学习提升性能。 Method: 结合低秩适应和任务共享低秩适配器，引入空间感知多尺度注意力，增强特征表示。 Result: 在多个基准测试中表现优异，显著提升准确性和鲁棒性。 Conclusion: EndoARSS有望推动AI驱动的内窥镜手术系统发展，提升手术安全性和效率。 Abstract: Endoscopic surgery is the gold standard for robotic-assisted minimally invasive surgery, offering significant advantages in early disease detection and precise interventions. However, the complexity of surgical scenes, characterized by high variability in different surgical activity scenarios and confused image features between targets and the background, presents challenges for surgical environment understanding. Traditional deep learning models often struggle with cross-activity interference, leading to suboptimal performance in each downstream task. To address this limitation, we explore multi-task learning, which utilizes the interrelated features between tasks to enhance overall task performance. In this paper, we propose EndoARSS, a novel multi-task learning framework specifically designed for endoscopy surgery activity recognition and semantic segmentation. Built upon the DINOv2 foundation model, our approach integrates Low-Rank Adaptation to facilitate efficient fine-tuning while incorporating Task Efficient Shared Low-Rank Adapters to mitigate gradient conflicts across diverse tasks. Additionally, we introduce the Spatially-Aware Multi-Scale Attention that enhances feature representation discrimination by enabling cross-spatial learning of global information. In order to evaluate the effectiveness of our framework, we present three novel datasets, MTLESD, MTLEndovis and MTLEndovis-Gen, tailored for endoscopic surgery scenarios with detailed annotations for both activity recognition and semantic segmentation tasks. Extensive experiments demonstrate that EndoARSS achieves remarkable performance across multiple benchmarks, significantly improving both accuracy and robustness in comparison to existing models. These results underscore the potential of EndoARSS to advance AI-driven endoscopic surgical systems, offering valuable insights for enhancing surgical safety and efficiency.

[33] Harnessing Vision-Language Models for Time Series Anomaly Detection

Zelin He,Sarah Alnegheimish,Matthew Reimherr

Main category: cs.CV

TL;DR: 论文提出了一种基于视觉语言模型（VLM）的两阶段时间序列异常检测方法，结合轻量级视觉编码器和VLM推理能力，显著提升了检测性能。

Details

Motivation: 现有时间序列异常检测方法缺乏视觉-时间推理能力，无法像人类专家那样识别上下文异常。 Method: 提出两阶段方法：ViT4TS（轻量级视觉编码器定位候选异常）和VLM4TS（VLM整合全局时间上下文优化检测）。 Result: VLM4TS在未进行时间序列训练的情况下，F1-max分数比最佳基线提升24.6%，且效率更高。 Conclusion: 该方法在性能和效率上均优于现有方法，展示了VLM在时间序列异常检测中的潜力。 Abstract: Time-series anomaly detection (TSAD) has played a vital role in a variety of fields, including healthcare, finance, and industrial monitoring. Prior methods, which mainly focus on training domain-specific models on numerical data, lack the visual-temporal reasoning capacity that human experts have to identify contextual anomalies. To fill this gap, we explore a solution based on vision language models (VLMs). Recent studies have shown the ability of VLMs for visual reasoning tasks, yet their direct application to time series has fallen short on both accuracy and efficiency. To harness the power of VLMs for TSAD, we propose a two-stage solution, with (1) ViT4TS, a vision-screening stage built on a relatively lightweight pretrained vision encoder, which leverages 2-D time-series representations to accurately localize candidate anomalies; (2) VLM4TS, a VLM-based stage that integrates global temporal context and VLM reasoning capacity to refine the detection upon the candidates provided by ViT4TS. We show that without any time-series training, VLM4TS outperforms time-series pretrained and from-scratch baselines in most cases, yielding a 24.6 percent improvement in F1-max score over the best baseline. Moreover, VLM4TS also consistently outperforms existing language-model-based TSAD methods and is on average 36 times more efficient in token usage.

[34] Multi-StyleGS: Stylizing Gaussian Splatting with Multiple Styles

Yangkai Lin,Jiabao Lei,Kui jia

Main category: cs.CV

TL;DR: 提出了一种名为Multi-StyleGS的新方法，用于在3D高斯泼溅（GS）场景中实现多风格匹配，通过自动局部风格转移或手动指定，同时保持内存高效的训练。

Details

Motivation: 近年来，3D场景风格化需求增长，但现有方法在适应多风格匹配和内存效率方面存在挑战。 Method: 采用二分匹配机制自动识别风格图像与渲染图像局部区域的对应关系，提出语义风格损失函数和局部-全局特征匹配以增强多视角一致性。 Result: 实验表明，该方法在风格化效果、内存效率和编辑灵活性上优于现有方法。 Conclusion: Multi-StyleGS为3D场景风格化提供了一种高效且灵活的解决方案。 Abstract: In recent years, there has been a growing demand to stylize a given 3D scene to align with the artistic style of reference images for creative purposes. While 3D Gaussian Splatting(GS) has emerged as a promising and efficient method for realistic 3D scene modeling, there remains a challenge in adapting it to stylize 3D GS to match with multiple styles through automatic local style transfer or manual designation, while maintaining memory efficiency for stylization training. In this paper, we introduce a novel 3D GS stylization solution termed Multi-StyleGS to tackle these challenges. In particular, we employ a bipartite matching mechanism to au tomatically identify correspondences between the style images and the local regions of the rendered images. To facilitate local style transfer, we introduce a novel semantic style loss function that employs a segmentation network to apply distinct styles to various objects of the scene and propose a local-global feature matching to enhance the multi-view consistency. Furthermore, this technique can achieve memory efficient training, more texture details and better color match. To better assign a robust semantic label to each Gaussian, we propose several techniques to regularize the segmentation network. As demonstrated by our comprehensive experiments, our approach outperforms existing ones in producing plausible stylization results and offering flexible editing.

[35] Deep Inertial Pose: A deep learning approach for human pose estimation

Sara M. Cerqueira,Manuel Palermo,Cristina P. Santos

Main category: cs.CV

TL;DR: 论文研究了基于神经网络的惯性运动捕捉系统，比较了不同架构和方法，发现混合LSTM-Madgwick方法效果最佳，误差为7.96。

Details

Motivation: 传统惯性运动捕捉系统依赖复杂模型和昂贵软件，本文旨在通过神经网络简化这一过程。 Method: 比较了不同神经网络架构和方法，使用低成本和高端的MARG传感器进行姿态估计。 Result: 混合LSTM-Madgwick方法表现最优，误差为7.96。 Conclusion: 神经网络可以用于姿态估计，效果与现有融合滤波器相当。 Abstract: Inertial-based Motion capture system has been attracting growing attention due to its wearability and unsconstrained use. However, accurate human joint estimation demands several complex and expertise demanding steps, which leads to expensive software such as the state-of-the-art MVN Awinda from Xsens Technologies. This work aims to study the use of Neural Networks to abstract the complex biomechanical models and analytical mathematics required for pose estimation. Thus, it presents a comparison of different Neural Network architectures and methodologies to understand how accurately these methods can estimate human pose, using both low cost(MPU9250) and high end (Mtw Awinda) Magnetic, Angular Rate, and Gravity (MARG) sensors. The most efficient method was the Hybrid LSTM-Madgwick detached, which achieved an Quaternion Angle distance error of 7.96, using Mtw Awinda data. Also, an ablation study was conducted to study the impact of data augmentation, output representation, window size, loss function and magnetometer data on the pose estimation error. This work indicates that Neural Networks can be trained to estimate human pose, with results comparable to the state-of-the-art fusion filters.

[36] Position Prediction Self-Supervised Learning for Multimodal Satellite Imagery Semantic Segmentation

John Waithaka,Moise Busogi

Main category: cs.CV

TL;DR: 本文提出了一种基于位置预测的自监督学习方法（LOCA），用于多模态卫星图像的语义分割，显著优于现有的基于重建的方法。

Details

Motivation: 卫星图像的语义分割对地球观测至关重要，但受限于标记数据的不足。现有的自监督预训练方法（如MAE）侧重于重建而非定位，而定位是分割任务的核心。 Method: 通过扩展SatMAE的通道分组到多模态数据，并引入同组注意力掩码以促进跨模态交互，同时采用相对补丁位置预测任务以增强空间推理能力。 Result: 在Sen1Floods11洪水映射数据集上，该方法显著优于现有的基于重建的自监督学习方法。 Conclusion: 位置预测任务经过适当调整后，能比基于重建的方法更有效地学习卫星图像语义分割的表示。 Abstract: Semantic segmentation of satellite imagery is crucial for Earth observation applications, but remains constrained by limited labelled training data. While self-supervised pretraining methods like Masked Autoencoders (MAE) have shown promise, they focus on reconstruction rather than localisation-a fundamental aspect of segmentation tasks. We propose adapting LOCA (Location-aware), a position prediction self-supervised learning method, for multimodal satellite imagery semantic segmentation. Our approach addresses the unique challenges of satellite data by extending SatMAE's channel grouping from multispectral to multimodal data, enabling effective handling of multiple modalities, and introducing same-group attention masking to encourage cross-modal interaction during pretraining. The method uses relative patch position prediction, encouraging spatial reasoning for localisation rather than reconstruction. We evaluate our approach on the Sen1Floods11 flood mapping dataset, where it significantly outperforms existing reconstruction-based self-supervised learning methods for satellite imagery. Our results demonstrate that position prediction tasks, when properly adapted for multimodal satellite imagery, learn representations more effective for satellite image semantic segmentation than reconstruction-based approaches.

[37] DONUT: A Decoder-Only Model for Trajectory Prediction

Markus Knoche,Daan de Geus,Bastian Leibe

Main category: cs.CV

TL;DR: DONUT是一种仅解码器的轨迹预测模型，通过自回归方式预测未来轨迹，优于传统的编码器-解码器模型，并在Argoverse 2基准测试中取得最佳性能。

Details

Motivation: 自动驾驶需要预测其他代理的运动轨迹，现有编码器-解码器模型存在信息滞后问题，DONUT通过仅解码器设计解决这一问题。 Method: 使用自回归模型直接预测未来轨迹，引入‘过预测’策略以提升长期预测能力。 Result: 在Argoverse 2单代理运动预测基准测试中表现优于基线模型，达到最新技术水平。 Conclusion: 仅解码器模型在轨迹预测任务中具有优势，未来可进一步优化多代理场景。 Abstract: Predicting the motion of other agents in a scene is highly relevant for autonomous driving, as it allows a self-driving car to anticipate. Inspired by the success of decoder-only models for language modeling, we propose DONUT, a Decoder-Only Network for Unrolling Trajectories. Different from existing encoder-decoder forecasting models, we encode historical trajectories and predict future trajectories with a single autoregressive model. This allows the model to make iterative predictions in a consistent manner, and ensures that the model is always provided with up-to-date information, enhancing the performance. Furthermore, inspired by multi-token prediction for language modeling, we introduce an 'overprediction' strategy that gives the network the auxiliary task of predicting trajectories at longer temporal horizons. This allows the model to better anticipate the future, and further improves the performance. With experiments, we demonstrate that our decoder-only approach outperforms the encoder-decoder baseline, and achieves new state-of-the-art results on the Argoverse 2 single-agent motion forecasting benchmark.

[38] Vision-EKIPL: External Knowledge-Infused Policy Learning for Visual Reasoning

Chaoyang Wang,Zeyu Zhang,Haiyun Jiang

Main category: cs.CV

TL;DR: 论文提出了一种名为Vision-EKIPL的新型强化学习框架，通过引入外部辅助模型生成的高质量动作来优化策略模型，提升了多模态大语言模型的视觉推理能力。

Details

Motivation: 现有强化学习方法仅从策略模型本身采样动作组，限制了模型的推理能力上限并导致训练效率低下。 Method: 提出Vision-EKIPL框架，在强化学习训练过程中引入外部辅助模型生成的高质量动作，指导策略模型的优化。 Result: 在Reason-RFT-CoT Benchmark上，Vision-EKIPL比现有最佳方法性能提升5%。 Conclusion: Vision-EKIPL克服了传统强化学习方法的限制，显著提升了视觉推理性能，为该领域研究提供了新范式。 Abstract: Visual reasoning is crucial for understanding complex multimodal data and advancing Artificial General Intelligence. Existing methods enhance the reasoning capability of Multimodal Large Language Models (MLLMs) through Reinforcement Learning (RL) fine-tuning (e.g., GRPO). However, current RL approaches sample action groups solely from the policy model itself, which limits the upper boundary of the model's reasoning capability and leads to inefficient training. To address these limitations, this paper proposes a novel RL framework called \textbf{Vision-EKIPL}. The core of this framework lies in introducing high-quality actions generated by external auxiliary models during the RL training process to guide the optimization of the policy model. The policy learning with knowledge infusion from external models significantly expands the model's exploration space, effectively improves the reasoning boundary, and substantially accelerates training convergence speed and efficiency. Experimental results demonstrate that our proposed Vision-EKIPL achieved up to a 5\% performance improvement on the Reason-RFT-CoT Benchmark compared to the state-of-the-art (SOTA). It reveals that Vision-EKIPL can overcome the limitations of traditional RL methods, significantly enhance the visual reasoning performance of MLLMs, and provide a new effective paradigm for research in this field.

[39] Face recognition on point cloud with cgan-top for denoising

Junyu Liu,Jianfeng Ren,Sunhong Liang,Xudong Jiang

Main category: cs.CV

TL;DR: 提出了一种端到端的3D人脸识别方法，结合去噪和识别模块，显著提高了噪声点云下的识别精度。

Details

Motivation: 原始点云因传感器不完美常含噪声，影响识别效果。 Method: 设计了cGAN-TOP去噪模型和LDGCNN识别模型，协同工作。 Result: 在Bosphorus数据集上验证，最大识别精度提升14.81%。 Conclusion: 该方法有效解决了噪声点云下的3D人脸识别问题。 Abstract: Face recognition using 3D point clouds is gaining growing interest, while raw point clouds often contain a significant amount of noise due to imperfect sensors. In this paper, an end-to-end 3D face recognition on a noisy point cloud is proposed, which synergistically integrates the denoising and recognition modules. Specifically, a Conditional Generative Adversarial Network on Three Orthogonal Planes (cGAN-TOP) is designed to effectively remove the noise in the point cloud, and recover the underlying features for subsequent recognition. A Linked Dynamic Graph Convolutional Neural Network (LDGCNN) is then adapted to recognize faces from the processed point cloud, which hierarchically links both the local point features and neighboring features of multiple scales. The proposed method is validated on the Bosphorus dataset. It significantly improves the recognition accuracy under all noise settings, with a maximum gain of 14.81%.

[40] Hybrid Vision Transformer-Mamba Framework for Autism Diagnosis via Eye-Tracking Analysis

Wafaa Kasri,Yassine Himeur,Abigail Copiaco,Wathiq Mansoor,Ammar Albanna,Valsamma Eapen

Main category: cs.CV

TL;DR: 提出了一种结合Vision Transformers和Vision Mamba的混合深度学习框架，用于通过眼动数据检测自闭症谱系障碍（ASD），显著提升了诊断准确性和可解释性。

Details

Motivation: 早期诊断ASD对干预至关重要，但传统方法依赖人工特征且缺乏透明性，亟需更高效、可解释的解决方案。 Method: 采用Vision Transformers和Vision Mamba的混合框架，通过注意力机制融合视觉、语音和面部线索，捕捉时空动态。 Result: 在Saliency4ASD数据集上，模型表现优异，准确率达0.96，F1分数0.95，灵敏度0.97，特异性0.94。 Conclusion: 该模型为资源有限或远程临床环境提供了可扩展、可解释的ASD筛查工具。 Abstract: Accurate Autism Spectrum Disorder (ASD) diagnosis is vital for early intervention. This study presents a hybrid deep learning framework combining Vision Transformers (ViT) and Vision Mamba to detect ASD using eye-tracking data. The model uses attention-based fusion to integrate visual, speech, and facial cues, capturing both spatial and temporal dynamics. Unlike traditional handcrafted methods, it applies state-of-the-art deep learning and explainable AI techniques to enhance diagnostic accuracy and transparency. Tested on the Saliency4ASD dataset, the proposed ViT-Mamba model outperformed existing methods, achieving 0.96 accuracy, 0.95 F1-score, 0.97 sensitivity, and 0.94 specificity. These findings show the model's promise for scalable, interpretable ASD screening, especially in resource-constrained or remote clinical settings where access to expert diagnosis is limited.

[41] NSD-Imagery: A benchmark dataset for extending fMRI vision decoding methods to mental imagery

Reese Kneeland,Paul S. Scotti,Ghislain St-Yves,Jesse Breedlove,Kendrick Kay,Thomas Naselaris

Main category: cs.CV

TL;DR: NSD-Imagery是一个新发布的基准数据集，用于评估从fMRI数据重建心理图像的能力，补充了现有的NSD数据集。研究发现，现有模型在心理图像重建上的表现与视觉重建表现脱钩，且简单架构模型表现更好。

Details

Motivation: 现有模型仅在视觉图像重建上评估，而心理图像重建对医学和脑机接口应用至关重要。NSD-Imagery填补了这一空白。 Method: 使用NSD-Imagery评估了多种NSD训练的视觉解码模型（如MindEye1、Brain Diffuser等），比较它们在心理图像重建上的表现。 Result: 心理图像重建表现与视觉重建脱钩，简单线性解码架构和多模态特征解码模型表现更优。 Conclusion: 心理图像数据集对实际应用至关重要，NSD-Imagery为视觉解码方法的优化提供了重要资源。 Abstract: We release NSD-Imagery, a benchmark dataset of human fMRI activity paired with mental images, to complement the existing Natural Scenes Dataset (NSD), a large-scale dataset of fMRI activity paired with seen images that enabled unprecedented improvements in fMRI-to-image reconstruction efforts. Recent models trained on NSD have been evaluated only on seen image reconstruction. Using NSD-Imagery, it is possible to assess how well these models perform on mental image reconstruction. This is a challenging generalization requirement because mental images are encoded in human brain activity with relatively lower signal-to-noise and spatial resolution; however, generalization from seen to mental imagery is critical for real-world applications in medical domains and brain-computer interfaces, where the desired information is always internally generated. We provide benchmarks for a suite of recent NSD-trained open-source visual decoding models (MindEye1, MindEye2, Brain Diffuser, iCNN, Takagi et al.) on NSD-Imagery, and show that the performance of decoding methods on mental images is largely decoupled from performance on vision reconstruction. We further demonstrate that architectural choices significantly impact cross-decoding performance: models employing simple linear decoding architectures and multimodal feature decoding generalize better to mental imagery, while complex architectures tend to overfit visual training data. Our findings indicate that mental imagery datasets are critical for the development of practical applications, and establish NSD-Imagery as a useful resource for better aligning visual decoding methods with this goal.

[42] KNN-Defense: Defense against 3D Adversarial Point Clouds using Nearest-Neighbor Search

Nima Jamali,Matina Mahdizadeh Sani,Hanieh Naderi,Shohreh Kasaei

Main category: cs.CV

TL;DR: 论文提出了一种名为KNN-Defense的防御策略，通过利用训练集中邻近样本的语义相似性来恢复受扰动的3D点云数据，显著提升了对抗攻击下的鲁棒性。

Details

Motivation: 尽管深度神经网络在3D点云数据分析中表现出色，但其对对抗攻击（如点丢弃、移动和添加）的脆弱性威胁了3D视觉系统的可靠性。现有防御机制对此类攻击效果有限。 Method: KNN-Defense基于流形假设和特征空间中的最近邻搜索，通过恢复受扰动的输入而非重建表面几何或强制均匀点分布，实现轻量级且高效的计算。 Result: 在ModelNet40数据集上的实验表明，KNN-Defense显著提升了多种攻击类型下的鲁棒性，特别是在点丢弃攻击下，对PointNet、PointNet++、DGCNN和PCT的准确率分别提升了20.1%、3.6%、3.44%和7.74%。 Conclusion: KNN-Defense提供了一种可扩展且有效的解决方案，增强了3D点云分类器对抗攻击的鲁棒性，适用于实时和实际应用。 Abstract: Deep neural networks (DNNs) have demonstrated remarkable performance in analyzing 3D point cloud data. However, their vulnerability to adversarial attacks-such as point dropping, shifting, and adding-poses a critical challenge to the reliability of 3D vision systems. These attacks can compromise the semantic and structural integrity of point clouds, rendering many existing defense mechanisms ineffective. To address this issue, a defense strategy named KNN-Defense is proposed, grounded in the manifold assumption and nearest-neighbor search in feature space. Instead of reconstructing surface geometry or enforcing uniform point distributions, the method restores perturbed inputs by leveraging the semantic similarity of neighboring samples from the training set. KNN-Defense is lightweight and computationally efficient, enabling fast inference and making it suitable for real-time and practical applications. Empirical results on the ModelNet40 dataset demonstrated that KNN-Defense significantly improves robustness across various attack types. In particular, under point-dropping attacks-where many existing methods underperform due to the targeted removal of critical points-the proposed method achieves accuracy gains of 20.1%, 3.6%, 3.44%, and 7.74% on PointNet, PointNet++, DGCNN, and PCT, respectively. These findings suggest that KNN-Defense offers a scalable and effective solution for enhancing the adversarial resilience of 3D point cloud classifiers. (An open-source implementation of the method, including code and data, is available at https://github.com/nimajam41/3d-knn-defense).

[43] Gaussian Mapping for Evolving Scenes

Vladimir Yugay,Thies Kersten,Luca Carlone,Theo Gevers,Martin R. Oswald,Lukas Schmid

Main category: cs.CV

TL;DR: 论文提出了一种动态场景适应机制和关键帧管理方法，用于解决3D高斯泼溅技术在长期动态场景中的局限性，并在合成和真实数据集上验证了其优越性。

Details

Motivation: 当前3D高斯泼溅技术在静态场景中表现优异，但在长期动态场景（如场景在视野外变化）中表现不足。本文旨在解决这一问题。 Method: 引入动态场景适应机制，持续更新3D表示以反映最新变化；提出关键帧管理机制，剔除过时观测以保持几何和语义一致性。 Result: 在合成和真实数据集上，提出的GaME方法比现有技术更准确。 Conclusion: GaME方法有效解决了长期动态场景中的挑战，提升了3D高斯泼溅技术的实用性。 Abstract: Mapping systems with novel view synthesis (NVS) capabilities are widely used in computer vision, with augmented reality, robotics, and autonomous driving applications. Most notably, 3D Gaussian Splatting-based systems show high NVS performance; however, many current approaches are limited to static scenes. While recent works have started addressing short-term dynamics (motion within the view of the camera), long-term dynamics (the scene evolving through changes out of view) remain less explored. To overcome this limitation, we introduce a dynamic scene adaptation mechanism that continuously updates the 3D representation to reflect the latest changes. In addition, since maintaining geometric and semantic consistency remains challenging due to stale observations disrupting the reconstruction process, we propose a novel keyframe management mechanism that discards outdated observations while preserving as much information as possible. We evaluate Gaussian Mapping for Evolving Scenes (GaME) on both synthetic and real-world datasets and find it to be more accurate than the state of the art.

[44] Sleep Stage Classification using Multimodal Embedding Fusion from EOG and PSM

Olivier Papillon,Rafik Goubran,James Green,Julien Larivière-Chartier,Caitlin Higginson,Frank Knoefel,Rébecca Robillard

Main category: cs.CV

TL;DR: 该研究提出了一种基于ImageBind的多模态嵌入深度学习模型，结合EOG和PSM数据用于睡眠阶段分类，显著提高了准确性，优于现有单模态方法。

Details

Motivation: 传统PSG依赖EEG，设备复杂且不适合家庭监测，因此研究探索了EOG和PSM作为替代方案。 Method: 使用ImageBind模型整合双通道EOG信号和PSM数据，首次将这两种数据融合用于睡眠分类。 Result: 经过微调的ImageBind模型显著优于单通道EOG和PSM的现有方法，且在未微调时也表现良好。 Conclusion: 预训练的多模态嵌入模型可有效用于睡眠分类，接近依赖复杂EEG数据的系统精度。 Abstract: Accurate sleep stage classification is essential for diagnosing sleep disorders, particularly in aging populations. While traditional polysomnography (PSG) relies on electroencephalography (EEG) as the gold standard, its complexity and need for specialized equipment make home-based sleep monitoring challenging. To address this limitation, we investigate the use of electrooculography (EOG) and pressure-sensitive mats (PSM) as less obtrusive alternatives for five-stage sleep-wake classification. This study introduces a novel approach that leverages ImageBind, a multimodal embedding deep learning model, to integrate PSM data with dual-channel EOG signals for sleep stage classification. Our method is the first reported approach that fuses PSM and EOG data for sleep stage classification with ImageBind. Our results demonstrate that fine-tuning ImageBind significantly improves classification accuracy, outperforming existing models based on single-channel EOG (DeepSleepNet), exclusively PSM data (ViViT), and other multimodal deep learning approaches (MBT). Notably, the model also achieved strong performance without fine-tuning, highlighting its adaptability to specific tasks with limited labeled data, making it particularly advantageous for medical applications. We evaluated our method using 85 nights of patient recordings from a sleep clinic. Our findings suggest that pre-trained multimodal embedding models, even those originally developed for non-medical domains, can be effectively adapted for sleep staging, with accuracies approaching systems that require complex EEG data.

[45] Reading in the Dark with Foveated Event Vision

Carl Brander,Giovanni Cioffi,Nico Messikommer,Davide Scaramuzza

Main category: cs.CV

TL;DR: 提出了一种基于事件相机的智能眼镜OCR方法，通过用户眼动聚焦减少带宽，并在低光和高动态场景中优于传统RGB相机。

Details

Motivation: 解决RGB相机在低光和高动态场景中的性能不足及高带宽消耗问题。 Method: 利用用户眼动聚焦事件流，结合深度二元重建和多模态LLM进行OCR。 Result: 在低光环境下成功读取文本，带宽消耗比RGB相机低2400倍。 Conclusion: 事件相机结合眼动聚焦是一种高效、低功耗的智能眼镜OCR解决方案。 Abstract: Current smart glasses equipped with RGB cameras struggle to perceive the environment in low-light and high-speed motion scenarios due to motion blur and the limited dynamic range of frame cameras. Additionally, capturing dense images with a frame camera requires large bandwidth and power consumption, consequently draining the battery faster. These challenges are especially relevant for developing algorithms that can read text from images. In this work, we propose a novel event-based Optical Character Recognition (OCR) approach for smart glasses. By using the eye gaze of the user, we foveate the event stream to significantly reduce bandwidth by around 98% while exploiting the benefits of event cameras in high-dynamic and fast scenes. Our proposed method performs deep binary reconstruction trained on synthetic data and leverages multimodal LLMs for OCR, outperforming traditional OCR solutions. Our results demonstrate the ability to read text in low light environments where RGB cameras struggle while using up to 2400 times less bandwidth than a wearable RGB camera.

[46] How Important are Videos for Training Video LLMs?

George Lydakis,Alexander Hermans,Ali Athar,Daan de Geus,Bastian Leibe

Main category: cs.CV

TL;DR: 研究发现，仅通过图像训练的Video LLMs在时间推理能力上表现优于预期，而视频训练的改进效果较小。

Details

Motivation: 探讨Video LLMs在时间推理任务中的表现，尤其是图像训练与视频训练的效果差异。 Method: 使用LongVU算法训练LLMs，并通过TVBench评估时间推理能力；引入基于标注图像序列的微调方案。 Result: 图像训练的LLMs在TVBench上表现显著高于随机水平，且微调后的性能接近或超过视频训练的LLMs。 Conclusion: 当前视频训练方案未充分利用视频的丰富时间特征，需进一步研究图像训练LLMs的时间推理机制及视频训练的瓶颈。 Abstract: Research into Video Large Language Models (LLMs) has progressed rapidly, with numerous models and benchmarks emerging in just a few years. Typically, these models are initialized with a pretrained text-only LLM and finetuned on both image- and video-caption datasets. In this paper, we present findings indicating that Video LLMs are more capable of temporal reasoning after image-only training than one would assume, and that improvements from video-specific training are surprisingly small. Specifically, we show that image-trained versions of two LLMs trained with the recent LongVU algorithm perform significantly above chance level on TVBench, a temporal reasoning benchmark. Additionally, we introduce a simple finetuning scheme involving sequences of annotated images and questions targeting temporal capabilities. This baseline results in temporal reasoning performance close to, and occasionally higher than, what is achieved by video-trained LLMs. This suggests suboptimal utilization of rich temporal features found in real video by current models. Our analysis motivates further research into the mechanisms that allow image-trained LLMs to perform temporal reasoning, as well as into the bottlenecks that render current video training schemes inefficient.

[47] Polar Hierarchical Mamba: Towards Streaming LiDAR Object Detection with Point Clouds as Egocentric Sequences

Mellon M. Zhang,Glen Chou,Saibal Mukhopadhyay

Main category: cs.CV

TL;DR: PHiM是一种新型的SSM架构，专为极坐标流式LiDAR设计，通过局部双向Mamba块和全局前向Mamba取代传统卷积和位置编码，显著提升了流式检测性能。

Details

Motivation: 自动驾驶需要低延迟、高吞吐量的实时感知，传统方法处理全扫描LiDAR数据时延迟高，流式方法虽能缓解但性能下降或需复杂校正。 Method: PHiM采用局部双向Mamba块进行扇区内空间编码，全局前向Mamba进行扇区间时序建模，避免卷积和位置编码的几何失真问题。 Result: 在Waymo Open Dataset上，PHiM比之前最佳流式检测器性能提升10%，且吞吐量翻倍。 Conclusion: PHiM为极坐标流式LiDAR提供了一种高效、高性能的解决方案，显著优于现有方法。 Abstract: Accurate and efficient object detection is essential for autonomous vehicles, where real-time perception requires low latency and high throughput. LiDAR sensors provide robust depth information, but conventional methods process full 360{\deg} scans in a single pass, introducing significant delay. Streaming approaches address this by sequentially processing partial scans in the native polar coordinate system, yet they rely on translation-invariant convolutions that are misaligned with polar geometry -- resulting in degraded performance or requiring complex distortion mitigation. Recent Mamba-based state space models (SSMs) have shown promise for LiDAR perception, but only in the full-scan setting, relying on geometric serialization and positional embeddings that are memory-intensive and ill-suited to streaming. We propose Polar Hierarchical Mamba (PHiM), a novel SSM architecture designed for polar-coordinate streaming LiDAR. PHiM uses local bidirectional Mamba blocks for intra-sector spatial encoding and a global forward Mamba for inter-sector temporal modeling, replacing convolutions and positional encodings with distortion-aware, dimensionally-decomposed operations. PHiM sets a new state-of-the-art among streaming detectors on the Waymo Open Dataset, outperforming the previous best by 10\% and matching full-scan baselines at twice the throughput. Code will be available at https://github.com/meilongzhang/Polar-Hierarchical-Mamba .

[48] LaTtE-Flow: Layerwise Timestep-Expert Flow-based Transformer

Ying Shen,Zhiyang Xu,Jiuhai Chen,Shizhe Diao,Jiaxin Zhang,Yuguang Yao,Joy Rimchala,Ismini Lourentzou,Lifu Huang

Main category: cs.CV

TL;DR: LaTtE-Flow是一种新型高效架构，统一了图像理解和生成，通过分层时间步专家流和残差注意力机制，显著提升了生成速度和性能。

Details

Motivation: 现有统一模型需要大量预训练且性能不如专用模型，生成速度慢，限制了实际应用。 Method: 基于预训练视觉语言模型，引入分层时间步专家流架构和残差注意力机制，优化生成效率。 Result: 在理解任务中表现优异，生成质量与现有模型相当，推理速度快6倍。 Conclusion: LaTtE-Flow高效统一了图像理解与生成，具有实际部署潜力。 Abstract: Recent advances in multimodal foundation models unifying image understanding and generation have opened exciting avenues for tackling a wide range of vision-language tasks within a single framework. Despite progress, existing unified models typically require extensive pretraining and struggle to achieve the same level of performance compared to models dedicated to each task. Additionally, many of these models suffer from slow image generation speeds, limiting their practical deployment in real-time or resource-constrained settings. In this work, we propose Layerwise Timestep-Expert Flow-based Transformer (LaTtE-Flow), a novel and efficient architecture that unifies image understanding and generation within a single multimodal model. LaTtE-Flow builds upon powerful pretrained Vision-Language Models (VLMs) to inherit strong multimodal understanding capabilities, and extends them with a novel Layerwise Timestep Experts flow-based architecture for efficient image generation. LaTtE-Flow distributes the flow-matching process across specialized groups of Transformer layers, each responsible for a distinct subset of timesteps. This design significantly improves sampling efficiency by activating only a small subset of layers at each sampling timestep. To further enhance performance, we propose a Timestep-Conditioned Residual Attention mechanism for efficient information reuse across layers. Experiments demonstrate that LaTtE-Flow achieves strong performance on multimodal understanding tasks, while achieving competitive image generation quality with around 6x faster inference speed compared to recent unified multimodal models.

[49] Task-driven real-world super-resolution of document scans

Maciej Zyrek,Tomasz Tarasiewicz,Jakub Sadel,Aleksandra Krzywon,Michal Kawulok

Main category: cs.CV

TL;DR: 本文提出了一种针对光学字符识别任务优化的超分辨率网络，通过多任务学习框架结合多种辅助损失函数，提升了真实场景下的文本检测性能。

Details

Motivation: 现有深度学习方法在模拟数据集上表现良好，但在真实场景（如文档扫描）中因复杂退化和语义变化而表现不佳。 Method: 采用多任务学习框架，结合文本检测、识别、关键点定位和色调一致性等辅助损失函数，并使用动态权重平均机制平衡目标。 Result: 在模拟和真实文档数据集上，该方法提高了文本检测的IoU指标，同时保持了图像保真度。 Conclusion: 多目标优化有助于缩小模拟训练与实际部署之间的差距，提升超分辨率模型在真实场景中的实用性。 Abstract: Single-image super-resolution refers to the reconstruction of a high-resolution image from a single low-resolution observation. Although recent deep learning-based methods have demonstrated notable success on simulated datasets -- with low-resolution images obtained by degrading and downsampling high-resolution ones -- they frequently fail to generalize to real-world settings, such as document scans, which are affected by complex degradations and semantic variability. In this study, we introduce a task-driven, multi-task learning framework for training a super-resolution network specifically optimized for optical character recognition tasks. We propose to incorporate auxiliary loss functions derived from high-level vision tasks, including text detection using the connectionist text proposal network, text recognition via a convolutional recurrent neural network, keypoints localization using Key.Net, and hue consistency. To balance these diverse objectives, we employ dynamic weight averaging mechanism, which adaptively adjusts the relative importance of each loss term based on its convergence behavior. We validate our approach upon the SRResNet architecture, which is a well-established technique for single-image super-resolution. Experimental evaluations on both simulated and real-world scanned document datasets demonstrate that the proposed approach improves text detection, measured with intersection over union, while preserving overall image fidelity. These findings underscore the value of multi-objective optimization in super-resolution models for bridging the gap between simulated training regimes and practical deployment in real-world scenarios.

[50] AR-RAG: Autoregressive Retrieval Augmentation for Image Generation

Jingyuan Qi,Zhiyang Xu,Qifan Wang,Lifu Huang

Main category: cs.CV

TL;DR: AR-RAG是一种通过自回归方式在图像生成过程中动态检索并融合相关视觉参考的新方法，解决了现有方法中静态检索导致的过复制和风格偏差等问题。

Details

Motivation: 现有图像生成方法通常依赖静态检索，导致生成过程中无法动态适应需求，容易出现过复制和风格偏差。AR-RAG旨在通过动态检索提升生成质量。 Method: 提出两种并行框架：DAiD（无需训练的直接分布融合策略）和FAiD（参数高效的多尺度特征平滑方法）。 Result: 在多个基准测试（如Midjourney-30K、GenEval和DPG-Bench）上，AR-RAG显著优于现有图像生成模型。 Conclusion: AR-RAG通过动态检索和融合机制，显著提升了图像生成的质量和灵活性。 Abstract: We introduce Autoregressive Retrieval Augmentation (AR-RAG), a novel paradigm that enhances image generation by autoregressively incorporating knearest neighbor retrievals at the patch level. Unlike prior methods that perform a single, static retrieval before generation and condition the entire generation on fixed reference images, AR-RAG performs context-aware retrievals at each generation step, using prior-generated patches as queries to retrieve and incorporate the most relevant patch-level visual references, enabling the model to respond to evolving generation needs while avoiding limitations (e.g., over-copying, stylistic bias, etc.) prevalent in existing methods. To realize AR-RAG, we propose two parallel frameworks: (1) Distribution-Augmentation in Decoding (DAiD), a training-free plug-and-use decoding strategy that directly merges the distribution of model-predicted patches with the distribution of retrieved patches, and (2) Feature-Augmentation in Decoding (FAiD), a parameter-efficient fine-tuning method that progressively smooths the features of retrieved patches via multi-scale convolution operations and leverages them to augment the image generation process. We validate the effectiveness of AR-RAG on widely adopted benchmarks, including Midjourney-30K, GenEval and DPG-Bench, demonstrating significant performance gains over state-of-the-art image generation models.

[51] Dual-view Spatio-Temporal Feature Fusion with CNN-Transformer Hybrid Network for Chinese Isolated Sign Language Recognition

Siyuan Jing,Guangxue Wang,Haoyang Zhai,Qin Tao,Jun Yang,Bing Wang,Peng Jin

Main category: cs.CV

TL;DR: 本文提出了一个双视角中国手语数据集NationalCSL-DP，并设计了一个CNN-Transformer网络作为基线模型，通过实验验证了数据集和融合策略的有效性。

Details

Motivation: 现有手语数据集覆盖不全且多为单视角，难以处理手部遮挡问题，因此需要构建更全面的双视角数据集以提升孤立手语识别（ISLR）的实用性。 Method: 构建了覆盖中国国家手语词汇的双视角数据集NationalCSL-DP，并提出了一种CNN-Transformer网络及简单有效的融合策略。 Result: 实验表明，提出的融合策略显著提升了ISLR性能，但序列到序列模型难以从双视角视频中学习互补特征。 Conclusion: 双视角数据集和融合策略为ISLR提供了新思路，但如何更好地利用双视角信息仍需进一步研究。 Abstract: Due to the emergence of many sign language datasets, isolated sign language recognition (ISLR) has made significant progress in recent years. In addition, the development of various advanced deep neural networks is another reason for this breakthrough. However, challenges remain in applying the technique in the real world. First, existing sign language datasets do not cover the whole sign vocabulary. Second, most of the sign language datasets provide only single view RGB videos, which makes it difficult to handle hand occlusions when performing ISLR. To fill this gap, this paper presents a dual-view sign language dataset for ISLR named NationalCSL-DP, which fully covers the Chinese national sign language vocabulary. The dataset consists of 134140 sign videos recorded by ten signers with respect to two vertical views, namely, the front side and the left side. Furthermore, a CNN transformer network is also proposed as a strong baseline and an extremely simple but effective fusion strategy for prediction. Extensive experiments were conducted to prove the effectiveness of the datasets as well as the baseline. The results show that the proposed fusion strategy can significantly increase the performance of the ISLR, but it is not easy for the sequence-to-sequence model, regardless of whether the early-fusion or late-fusion strategy is applied, to learn the complementary features from the sign videos of two vertical views.

Pengfei Zhao,Rongbo Luan,Wei Zhang,Peng Wu,Sifeng He

Main category: cs.CV

TL;DR: 论文提出MAPLE框架，利用MLLM的细粒度对齐先验指导跨模态表示学习，通过强化学习和新损失函数显著提升细粒度跨模态检索性能。

Details

Motivation: 尽管CLIP在多模态检索中表现优异，但仍存在模态间隙问题。MLLM展现出强大的模态对齐能力，但现有方法依赖粗粒度对齐机制，限制了潜力。 Method: 提出MAPLE框架，结合MLLM自动构建偏好数据，并设计RPA损失函数，将DPO应用于嵌入学习。 Result: 实验表明，MAPLE在细粒度跨模态检索中取得显著提升。 Conclusion: MAPLE通过细粒度对齐机制有效处理语义细微差异，为跨模态表示学习提供了新思路。 Abstract: Despite Contrastive Language-Image Pretraining (CLIP)'s remarkable capability to retrieve content across modalities, a substantial modality gap persists in its feature space. Intriguingly, we discover that off-the-shelf MLLMs (Multimodal Large Language Models) demonstrate powerful inherent modality alignment properties. While recent MLLM-based retrievers with unified architectures partially mitigate this gap, their reliance on coarse modality alignment mechanisms fundamentally limits their potential. In this work, We introduce MAPLE (Modality-Aligned Preference Learning for Embeddings), a novel framework that leverages the fine grained alignment priors inherent in MLLM to guide cross modal representation learning. MAPLE formulates the learning process as reinforcement learning with two key components: (1) Automatic preference data construction using off-the-shelf MLLM, and (2) a new Relative Preference Alignment (RPA) loss, which adapts Direct Preference Optimization (DPO) to the embedding learning setting. Experimental results show that our preference-guided alignment achieves substantial gains in fine-grained cross-modal retrieval, underscoring its effectiveness in handling nuanced semantic distinctions.

[53] Hybrid Mesh-Gaussian Representation for Efficient Indoor Scene Reconstruction

Binxiao Huang,Zhihao Li,Shiyong Liu,Xiao Tang,Jiajun Tang,Jiaqi Lin,Yuxin Cheng,Zhenyu Chen,Xiaofei Wu,Ngai Wong

Main category: cs.CV

TL;DR: 论文提出了一种结合3D高斯溅射（3DGS）和纹理网格的混合表示方法，用于提升复杂纹理区域的渲染效率。

Details

Motivation: 复杂纹理区域需要大量高斯元素来准确捕捉颜色变化，导致渲染速度低下。 Method: 通过修剪和优化提取的网格，结合3DGS和纹理网格的联合优化，使用预热策略和透射感知监督来平衡两者。 Result: 实验表明，混合表示在保持渲染质量的同时，显著提升了帧率并减少了高斯元素数量。 Conclusion: 该方法有效解决了复杂纹理区域的渲染效率问题，为室内场景的实时渲染提供了高效解决方案。 Abstract: 3D Gaussian splatting (3DGS) has demonstrated exceptional performance in image-based 3D reconstruction and real-time rendering. However, regions with complex textures require numerous Gaussians to capture significant color variations accurately, leading to inefficiencies in rendering speed. To address this challenge, we introduce a hybrid representation for indoor scenes that combines 3DGS with textured meshes. Our approach uses textured meshes to handle texture-rich flat areas, while retaining Gaussians to model intricate geometries. The proposed method begins by pruning and refining the extracted mesh to eliminate geometrically complex regions. We then employ a joint optimization for 3DGS and mesh, incorporating a warm-up strategy and transmittance-aware supervision to balance their contributions seamlessly.Extensive experiments demonstrate that the hybrid representation maintains comparable rendering quality and achieves superior frames per second FPS with fewer Gaussian primitives.

[54] Boosting Adversarial Transferability via Commonality-Oriented Gradient Optimization

Yanting Gao,Yepeng Liu,Junming Liu,Qi Zhang,Hongyun Zhang,Duoqian Miao,Cairong Zhao

Main category: cs.CV

TL;DR: 论文提出了一种名为COGO的通用梯度优化策略，通过增强共享特征和抑制个体特征，显著提高了对抗样本在黑盒设置中的可迁移性。

Details

Motivation: 理解Vision Transformers（ViTs）的特性和机制需要有效的对抗样本，但现有方法因过拟合导致可迁移性较差。 Method: COGO包含两部分：共性增强（CE）扰动中低频区域，个性抑制（IS）通过自适应阈值评估梯度相关性并加权。 Result: 实验表明，COGO显著提高了对抗攻击的成功率，优于现有方法。 Conclusion: COGO通过优化共享和抑制个体特征，为对抗样本的可迁移性提供了有效解决方案。 Abstract: Exploring effective and transferable adversarial examples is vital for understanding the characteristics and mechanisms of Vision Transformers (ViTs). However, adversarial examples generated from surrogate models often exhibit weak transferability in black-box settings due to overfitting. Existing methods improve transferability by diversifying perturbation inputs or applying uniform gradient regularization within surrogate models, yet they have not fully leveraged the shared and unique features of surrogate models trained on the same task, leading to suboptimal transfer performance. Therefore, enhancing perturbations of common information shared by surrogate models and suppressing those tied to individual characteristics offers an effective way to improve transferability. Accordingly, we propose a commonality-oriented gradient optimization strategy (COGO) consisting of two components: Commonality Enhancement (CE) and Individuality Suppression (IS). CE perturbs the mid-to-low frequency regions, leveraging the fact that ViTs trained on the same dataset tend to rely more on mid-to-low frequency information for classification. IS employs adaptive thresholds to evaluate the correlation between backpropagated gradients and model individuality, assigning weights to gradients accordingly. Extensive experiments demonstrate that COGO significantly improves the transfer success rates of adversarial attacks, outperforming current state-of-the-art methods.

[55] DM$^3$Net: Dual-Camera Super-Resolution via Domain Modulation and Multi-scale Matching

Cong Guan,Jiacheng Ying,Yuya Ieiri,Osamu Yoshie

Main category: cs.CV

TL;DR: DM$^3$Net是一种基于域调制和多尺度匹配的双摄像头超分辨率网络，旨在通过参考图像提升广角图像的分辨率。

Details

Motivation: 智能手机摄影中，利用长焦图像作为参考提升广角图像的分辨率具有实际意义。 Method: 通过域调制学习两个压缩全局表示，设计多尺度匹配模块进行跨尺度特征匹配，并引入关键剪枝以减少内存和推理时间。 Result: 在三个真实数据集上，DM$^3$Net优于现有方法。 Conclusion: DM$^3$Net在性能和效率上均表现出色，为双摄像头超分辨率提供了有效解决方案。 Abstract: Dual-camera super-resolution is highly practical for smartphone photography that primarily super-resolve the wide-angle images using the telephoto image as a reference. In this paper, we propose DM$^3$Net, a novel dual-camera super-resolution network based on Domain Modulation and Multi-scale Matching. To bridge the domain gap between the high-resolution domain and the degraded domain, we learn two compressed global representations from image pairs corresponding to the two domains. To enable reliable transfer of high-frequency structural details from the reference image, we design a multi-scale matching module that conducts patch-level feature matching and retrieval across multiple receptive fields to improve matching accuracy and robustness. Moreover, we also introduce Key Pruning to achieve a significant reduction in memory usage and inference time with little model performance sacrificed. Experimental results on three real-world datasets demonstrate that our DM$^3$Net outperforms the state-of-the-art approaches.

[56] Technical Report for ICRA 2025 GOOSE 3D Semantic Segmentation Challenge: Adaptive Point Cloud Understanding for Heterogeneous Robotic Systems

Xiaoya Zhang

Main category: cs.CV

TL;DR: 本文介绍了ICRA 2025 GOOSE 3D语义分割挑战赛的获胜方案，通过结合Point Prompt Tuning和Point Transformer v3，实现了对异构LiDAR数据的自适应处理，性能提升显著。

Details

Motivation: 解决多平台采集的异构3D点云数据语义分割问题，提升模型在复杂环境中的适应性。 Method: 采用Point Prompt Tuning (PPT)与Point Transformer v3 (PTv3)结合，通过平台特定条件和跨数据集类别对齐策略处理数据。 Result: 在不使用额外数据的情况下，模型性能显著提升，mIoU最高增加22.59%。 Conclusion: 该方法展示了自适应点云理解在野外机器人应用中的有效性。 Abstract: This technical report presents the implementation details of the winning solution for the ICRA 2025 GOOSE 3D Semantic Segmentation Challenge. This challenge focuses on semantic segmentation of 3D point clouds from diverse unstructured outdoor environments collected from multiple robotic platforms. This problem was addressed by implementing Point Prompt Tuning (PPT) integrated with Point Transformer v3 (PTv3) backbone, enabling adaptive processing of heterogeneous LiDAR data through platform-specific conditioning and cross-dataset class alignment strategies. The model is trained without requiring additional external data. As a result, this approach achieved substantial performance improvements with mIoU increases of up to 22.59% on challenging platforms compared to the baseline PTv3 model, demonstrating the effectiveness of adaptive point cloud understanding for field robotics applications.

[57] BePo: Leveraging Birds Eye View and Sparse Points for Efficient and Accurate 3D Occupancy Prediction

Yunxiao Shi,Hong Cai,Jisoo Jeong,Yinhao Zhu,Shizhong Han,Amin Ansari,Fatih Porikli

Main category: cs.CV

TL;DR: 提出了一种结合BEV和稀疏点表示的新方法BePo，用于3D占用预测，解决了现有方法在小物体和平坦表面上的局限性，并在实验中表现出优越性。

Details

Motivation: 现有3D占用预测方法在计算成本或场景表示上存在不足，BEV对小物体信息损失严重，稀疏点对平坦表面效率低。 Method: 采用双分支设计，结合查询稀疏点分支和BEV分支，通过交叉注意力共享信息，最终融合输出预测3D占用。 Result: 在Occ3D-nuScenes和Occ3D-Waymo基准测试中表现优越，推理速度与最新高效方法相当。 Conclusion: BePo有效解决了BEV和稀疏点的局限性，提升了3D占用预测的性能和效率。 Abstract: 3D occupancy provides fine-grained 3D geometry and semantics for scene understanding which is critical for autonomous driving. Most existing methods, however, carry high compute costs, requiring dense 3D feature volume and cross-attention to effectively aggregate information. More recent works have adopted Bird's Eye View (BEV) or sparse points as scene representation with much reduced cost, but still suffer from their respective shortcomings. More concretely, BEV struggles with small objects that often experience significant information loss after being projected to the ground plane. On the other hand, points can flexibly model little objects in 3D, but is inefficient at capturing flat surfaces or large objects. To address these challenges, in this paper, we present a novel 3D occupancy prediction approach, BePo, which combines BEV and sparse points based representations. We propose a dual-branch design: a query-based sparse points branch and a BEV branch. The 3D information learned in the sparse points branch is shared with the BEV stream via cross-attention, which enriches the weakened signals of difficult objects on the BEV plane. The outputs of both branches are finally fused to generate predicted 3D occupancy. We conduct extensive experiments on the Occ3D-nuScenes and Occ3D-Waymo benchmarks that demonstrate the superiority of our proposed BePo. Moreover, BePo also delivers competitive inference speed when compared to the latest efficient approaches.

[58] UNO: Unified Self-Supervised Monocular Odometry for Platform-Agnostic Deployment

Wentao Zhao,Yihe Niu,Yanbo Wang,Tianchen Deng,Shenghai Yuan,Zhenli Wang,Rui Guo,Jingchuan Wang

Main category: cs.CV

TL;DR: UNO是一个统一的单目视觉里程计框架，能够在多样环境中实现鲁棒且自适应的位姿估计，无需依赖特定部署调整或预定义运动先验。

Details

Motivation: 传统方法需要针对特定部署进行调整或依赖预定义运动先验，限制了其通用性。UNO旨在解决这一问题，适用于多种场景和设备。 Method: 采用Mixture-of-Experts策略，通过多个专用解码器处理不同运动模式；引入Gumbel-Softmax模块构建帧间关联图并选择最优解码器；结合预训练深度先验和轻量级捆绑调整确保几何一致性。 Result: 在KITTI、EuRoC-MAV和TUM-RGBD三个基准数据集上表现出最先进的性能。 Conclusion: UNO框架在多样环境中实现了鲁棒且自适应的位姿估计，展示了其广泛适用性和优越性能。 Abstract: This work presents UNO, a unified monocular visual odometry framework that enables robust and adaptable pose estimation across diverse environments, platforms, and motion patterns. Unlike traditional methods that rely on deployment-specific tuning or predefined motion priors, our approach generalizes effectively across a wide range of real-world scenarios, including autonomous vehicles, aerial drones, mobile robots, and handheld devices. To this end, we introduce a Mixture-of-Experts strategy for local state estimation, with several specialized decoders that each handle a distinct class of ego-motion patterns. Moreover, we introduce a fully differentiable Gumbel-Softmax module that constructs a robust inter-frame correlation graph, selects the optimal expert decoder, and prunes erroneous estimates. These cues are then fed into a unified back-end that combines pre-trained, scale-independent depth priors with a lightweight bundling adjustment to enforce geometric consistency. We extensively evaluate our method on three major benchmark datasets: KITTI (outdoor/autonomous driving), EuRoC-MAV (indoor/aerial drones), and TUM-RGBD (indoor/handheld), demonstrating state-of-the-art performance.

[59] TABLET: Table Structure Recognition using Encoder-only Transformers

Qiyu Hou,Jun Wang

Main category: cs.CV

TL;DR: 提出了一种基于Split-Merge的表格结构识别方法，通过序列标注和Transformer编码器优化性能，适用于大规模密集表格。

Details

Motivation: 解决表格结构识别中的挑战，特别是针对大规模、密集表格的准确性和效率问题。 Method: 采用Split-Merge框架，将行列分割视为序列标注任务，使用双Transformer编码器；合并过程作为网格分类任务，使用额外Transformer编码器。 Result: 在FinTabNet和PubTabNet上表现优异，减少分辨率损失和计算复杂度，兼具高精度和快速处理。 Conclusion: 该方法为大规模表格识别提供了鲁棒、可扩展且高效的解决方案，适合工业部署。 Abstract: To address the challenges of table structure recognition, we propose a novel Split-Merge-based top-down model optimized for large, densely populated tables. Our approach formulates row and column splitting as sequence labeling tasks, utilizing dual Transformer encoders to capture feature interactions. The merging process is framed as a grid cell classification task, leveraging an additional Transformer encoder to ensure accurate and coherent merging. By eliminating unstable bounding box predictions, our method reduces resolution loss and computational complexity, achieving high accuracy while maintaining fast processing speed. Extensive experiments on FinTabNet and PubTabNet demonstrate the superiority of our model over existing approaches, particularly in real-world applications. Our method offers a robust, scalable, and efficient solution for large-scale table recognition, making it well-suited for industrial deployment.

[60] MAGNET: A Multi-agent Framework for Finding Audio-Visual Needles by Reasoning over Multi-Video Haystacks

Sanjoy Chowdhury,Mohamed Elmoghany,Yohan Abeysinghe,Junjie Fei,Sayan Nag,Salman Khan,Mohamed Elhoseiny,Dinesh Manocha

Main category: cs.CV

TL;DR: 论文提出了一种名为AV-HaystacksQA的新任务，旨在通过多视频检索和时间定位生成信息丰富的答案，并引入了AVHaystacks基准和MAGNET框架，显著提升了性能。

Details

Motivation: 现有视频问答基准局限于单视频查询，无法满足实际应用中大规模音频-视觉检索和推理的需求。 Method: 提出了AVHaystacks基准和MAGNET多智能体框架，用于多视频检索和时间定位任务。 Result: MAGNET在BLEU@4和GPT评估分数上分别提升了89%和65%。 Conclusion: AV-HaystacksQA任务和MAGNET框架为多视频检索和推理提供了有效解决方案，并引入了新的评估指标STEM和MTGS。 Abstract: Large multimodal models (LMMs) have shown remarkable progress in audio-visual understanding, yet they struggle with real-world scenarios that require complex reasoning across extensive video collections. Existing benchmarks for video question answering remain limited in scope, typically involving one clip per query, which falls short of representing the challenges of large-scale, audio-visual retrieval and reasoning encountered in practical applications. To bridge this gap, we introduce a novel task named AV-HaystacksQA, where the goal is to identify salient segments across different videos in response to a query and link them together to generate the most informative answer. To this end, we present AVHaystacks, an audio-visual benchmark comprising 3100 annotated QA pairs designed to assess the capabilities of LMMs in multi-video retrieval and temporal grounding task. Additionally, we propose a model-agnostic, multi-agent framework MAGNET to address this challenge, achieving up to 89% and 65% relative improvements over baseline methods on BLEU@4 and GPT evaluation scores in QA task on our proposed AVHaystacks. To enable robust evaluation of multi-video retrieval and temporal grounding for optimal response generation, we introduce two new metrics, STEM, which captures alignment errors between a ground truth and a predicted step sequence and MTGS, to facilitate balanced and interpretable evaluation of segment-level grounding performance. Project: https://schowdhury671.github.io/magnet_project/

[61] Interpretable and Reliable Detection of AI-Generated Images via Grounded Reasoning in MLLMs

Yikun Ji,Hong Yan,Jun Lan,Huijia Zhu,Weiqiang Wang,Qi Fan,Liqing Zhang,Jianfu Zhang

Main category: cs.CV

TL;DR: 本文提出了一种基于多模态大语言模型（MLLMs）的方法，用于检测AI生成图像并提供可解释的视觉和文本依据。通过构建标注数据集和多阶段优化策略，模型在检测和定位视觉缺陷方面表现优异。

Details

Motivation: 现有检测方法多为黑箱，缺乏可解释性；MLLMs虽具分析能力，但在视觉与文本对齐上存在缺陷。 Method: 构建标注数据集，采用多阶段优化策略微调MLLMs，平衡检测、定位和解释目标。 Result: 模型在检测AI生成图像和定位视觉缺陷上显著优于基线方法。 Conclusion: 该方法实现了高检测性能和可解释性，为AI生成图像的识别提供了新思路。 Abstract: The rapid advancement of image generation technologies intensifies the demand for interpretable and robust detection methods. Although existing approaches often attain high accuracy, they typically operate as black boxes without providing human-understandable justifications. Multi-modal Large Language Models (MLLMs), while not originally intended for forgery detection, exhibit strong analytical and reasoning capabilities. When properly fine-tuned, they can effectively identify AI-generated images and offer meaningful explanations. However, existing MLLMs still struggle with hallucination and often fail to align their visual interpretations with actual image content and human reasoning. To bridge this gap, we construct a dataset of AI-generated images annotated with bounding boxes and descriptive captions that highlight synthesis artifacts, establishing a foundation for human-aligned visual-textual grounded reasoning. We then finetune MLLMs through a multi-stage optimization strategy that progressively balances the objectives of accurate detection, visual localization, and coherent textual explanation. The resulting model achieves superior performance in both detecting AI-generated images and localizing visual flaws, significantly outperforming baseline methods.

[62] From Swath to Full-Disc: Advancing Precipitation Retrieval with Multimodal Knowledge Expansion

Zheng Wang,Kai Ying,Bin Xu,Chunjiao Wang,Cong Bai

Main category: cs.CV

TL;DR: PRE-Net通过多模态知识扩展技术提升红外降水反演精度，显著优于现有方法。

Details

Motivation: 红外降水反演精度低，而微波和雷达方法范围有限，需扩展红外反演能力。 Method: 两阶段流程：Swath-Distilling阶段通过CoMWE转移知识，Full-Disc Adaptation阶段通过Self-MaskTune优化预测。 Result: PRE-Net在PRE基准测试中表现优异，超越PERSIANN-CCS、PDIR和IMERG。 Conclusion: PRE-Net为红外降水反演提供了高效解决方案，代码将开源。 Abstract: Accurate near-real-time precipitation retrieval has been enhanced by satellite-based technologies. However, infrared-based algorithms have low accuracy due to weak relations with surface precipitation, whereas passive microwave and radar-based methods are more accurate but limited in range. This challenge motivates the Precipitation Retrieval Expansion (PRE) task, which aims to enable accurate, infrared-based full-disc precipitation retrievals beyond the scanning swath. We introduce Multimodal Knowledge Expansion, a two-stage pipeline with the proposed PRE-Net model. In the Swath-Distilling stage, PRE-Net transfers knowledge from a multimodal data integration model to an infrared-based model within the scanning swath via Coordinated Masking and Wavelet Enhancement (CoMWE). In the Full-Disc Adaptation stage, Self-MaskTune refines predictions across the full disc by balancing multimodal and full-disc infrared knowledge. Experiments on the introduced PRE benchmark demonstrate that PRE-Net significantly advanced precipitation retrieval performance, outperforming leading products like PERSIANN-CCS, PDIR, and IMERG. The code will be available at https://github.com/Zjut-MultimediaPlus/PRE-Net.

[63] A Layered Self-Supervised Knowledge Distillation Framework for Efficient Multimodal Learning on the Edge

Tarique Dahri,Zulfiqar Ali Memon,Zhenyu Yu,Mohd. Yamani Idna Idris,Sheheryar Khan,Sadiq Ahmad,Maged Shoman,Saddam Aziz,Rizwan Qureshi

Main category: cs.CV

TL;DR: LSSKD框架通过中间特征图的辅助分类器实现自监督知识蒸馏，无需依赖预训练教师网络，提升了模型性能并降低了计算成本。

Details

Motivation: 传统方法依赖预训练教师网络，计算成本高且不灵活。LSSKD旨在通过自监督知识蒸馏提升模型性能，同时减少对教师网络的依赖。 Method: 在中间特征图上添加辅助分类器，生成多样化的自监督知识，并实现不同网络阶段的一对一知识迁移。 Result: 在CIFAR-100上平均提升4.54%，ImageNet上提升0.32%，并在小样本学习场景下取得最优结果。 Conclusion: LSSKD在提升模型泛化能力和性能的同时，无需额外计算成本，适用于低计算设备部署和多模态感知环境。 Abstract: We introduce Layered Self-Supervised Knowledge Distillation (LSSKD) framework for training compact deep learning models. Unlike traditional methods that rely on pre-trained teacher networks, our approach appends auxiliary classifiers to intermediate feature maps, generating diverse self-supervised knowledge and enabling one-to-one transfer across different network stages. Our method achieves an average improvement of 4.54\% over the state-of-the-art PS-KD method and a 1.14% gain over SSKD on CIFAR-100, with a 0.32% improvement on ImageNet compared to HASSKD. Experiments on Tiny ImageNet and CIFAR-100 under few-shot learning scenarios also achieve state-of-the-art results. These findings demonstrate the effectiveness of our approach in enhancing model generalization and performance without the need for large over-parameterized teacher networks. Importantly, at the inference stage, all auxiliary classifiers can be removed, yielding no extra computational cost. This makes our model suitable for deploying small language models on affordable low-computing devices. Owing to its lightweight design and adaptability, our framework is particularly suitable for multimodal sensing and cyber-physical environments that require efficient and responsive inference. LSSKD facilitates the development of intelligent agents capable of learning from limited sensory data under weak supervision.

[64] D2R: dual regularization loss with collaborative adversarial generation for model robustness

Zhenyu Liu,Huizhi Liang,Rajiv Ranjan,Zhanxing Zhu,Vaclav Snasel,Varun Ojha

Main category: cs.CV

TL;DR: 提出了一种双正则化损失（D2R Loss）方法和协作对抗生成（CAG）策略，通过优化对抗分布和干净分布增强模型鲁棒性，实验验证了其有效性。

Details

Motivation: 现有防御方法在目标模型损失函数引导不足和对抗生成非协作性方面存在局限，需改进模型鲁棒性。 Method: 采用D2R Loss（双优化步骤）和CAG策略（梯度协作生成对抗样本），结合不同损失函数优化目标模型分布。 Result: 在CIFAR-10、CIFAR-100、Tiny ImageNet等数据集上验证，D2R Loss与CAG显著提升了模型鲁棒性。 Conclusion: D2R Loss和CAG策略有效解决了现有方法的不足，显著增强了对抗攻击下的模型鲁棒性。 Abstract: The robustness of Deep Neural Network models is crucial for defending models against adversarial attacks. Recent defense methods have employed collaborative learning frameworks to enhance model robustness. Two key limitations of existing methods are (i) insufficient guidance of the target model via loss functions and (ii) non-collaborative adversarial generation. We, therefore, propose a dual regularization loss (D2R Loss) method and a collaborative adversarial generation (CAG) strategy for adversarial training. D2R loss includes two optimization steps. The adversarial distribution and clean distribution optimizations enhance the target model's robustness by leveraging the strengths of different loss functions obtained via a suitable function space exploration to focus more precisely on the target model's distribution. CAG generates adversarial samples using a gradient-based collaboration between guidance and target models. We conducted extensive experiments on three benchmark databases, including CIFAR-10, CIFAR-100, Tiny ImageNet, and two popular target models, WideResNet34-10 and PreActResNet18. Our results show that D2R loss with CAG produces highly robust models.

[65] FLAIR-HUB: Large-scale Multimodal Dataset for Land Cover and Crop Mapping

Anatol Garioud,Sébastien Giordano,Nicolas David,Nicolas Gonthier

Main category: cs.CV

TL;DR: FLAIR-HUB 是法国 IGN 推出的多传感器土地覆盖数据集，结合六种模态数据，用于土地覆盖和作物分类，支持监督和多模态预训练。

Details

Motivation: 解决地球观测数据量大且异构带来的处理和标注挑战，推动全球土地覆盖和作物类型监测。 Method: 结合六种对齐模态数据（如航空影像、Sentinel-1/2时间序列等），通过多模态融合和深度学习模型（CNN、transformer）进行土地覆盖和作物分类。 Result: 最佳土地覆盖分类性能为78.2%准确率和65.8% mIoU，几乎使用了所有模态数据。 Conclusion: FLAIR-HUB 展示了多模态融合的复杂性，但为土地覆盖和作物监测提供了高质量数据集和工具。 Abstract: The growing availability of high-quality Earth Observation (EO) data enables accurate global land cover and crop type monitoring. However, the volume and heterogeneity of these datasets pose major processing and annotation challenges. To address this, the French National Institute of Geographical and Forest Information (IGN) is actively exploring innovative strategies to exploit diverse EO data, which require large annotated datasets. IGN introduces FLAIR-HUB, the largest multi-sensor land cover dataset with very-high-resolution (20 cm) annotations, covering 2528 km2 of France. It combines six aligned modalities: aerial imagery, Sentinel-1/2 time series, SPOT imagery, topographic data, and historical aerial images. Extensive benchmarks evaluate multimodal fusion and deep learning models (CNNs, transformers) for land cover or crop mapping and also explore multi-task learning. Results underscore the complexity of multimodal fusion and fine-grained classification, with best land cover performance (78.2% accuracy, 65.8% mIoU) achieved using nearly all modalities. FLAIR-HUB supports supervised and multimodal pretraining, with data and code available at https://ignf.github.io/FLAIR/flairhub.

[66] UCOD-DPL: Unsupervised Camouflaged Object Detection via Dynamic Pseudo-label Learning

Weiqi Yan,Lvhai Chen,Huaijia Kou,Shengchuan Zhang,Yan Zhang,Liujuan Cao

Main category: cs.CV

TL;DR: 提出了一种基于动态伪标签学习的无监督伪装目标检测方法（UCOD-DPL），通过自适应伪标签模块、双分支对抗解码器和二次细化机制，显著提升了性能。

Details

Motivation: 现有无监督伪装目标检测方法因固定伪标签策略和简单解码器导致性能较低，且易受噪声影响。 Method: 采用教师-学生框架，结合自适应伪标签模块（APM）、双分支对抗解码器（DBA）和二次细化机制（Look-Twice）。 Result: 实验表明，该方法性能优异，甚至超过部分全监督方法。 Conclusion: UCOD-DPL通过动态伪标签学习和多模块协作，有效解决了无监督伪装目标检测中的关键问题。 Abstract: Unsupervised Camoflaged Object Detection (UCOD) has gained attention since it doesn't need to rely on extensive pixel-level labels. Existing UCOD methods typically generate pseudo-labels using fixed strategies and train 1 x1 convolutional layers as a simple decoder, leading to low performance compared to fully-supervised methods. We emphasize two drawbacks in these approaches: 1). The model is prone to fitting incorrect knowledge due to the pseudo-label containing substantial noise. 2). The simple decoder fails to capture and learn the semantic features of camouflaged objects, especially for small-sized objects, due to the low-resolution pseudo-labels and severe confusion between foreground and background pixels. To this end, we propose a UCOD method with a teacher-student framework via Dynamic Pseudo-label Learning called UCOD-DPL, which contains an Adaptive Pseudo-label Module (APM), a Dual-Branch Adversarial (DBA) decoder, and a Look-Twice mechanism. The APM module adaptively combines pseudo-labels generated by fixed strategies and the teacher model to prevent the model from overfitting incorrect knowledge while preserving the ability for self-correction; the DBA decoder takes adversarial learning of different segmentation objectives, guides the model to overcome the foreground-background confusion of camouflaged objects, and the Look-Twice mechanism mimics the human tendency to zoom in on camouflaged objects and performs secondary refinement on small-sized objects. Extensive experiments show that our method demonstrates outstanding performance, even surpassing some existing fully supervised methods. The code is available now.

[67] SceneLCM: End-to-End Layout-Guided Interactive Indoor Scene Generation with Latent Consistency Model

Yangkai Lin,Jiabao Lei,Kui Jia

Main category: cs.CV

TL;DR: SceneLCM是一个端到端框架，结合LLM和LCM，通过四个模块化流程生成和优化复杂室内场景，解决了现有方法的局限性。

Details

Motivation: 现有室内场景生成方法存在编辑限制、物理不一致、人工干预多、单房间限制和材质质量差等问题，需要更高效的解决方案。 Method: SceneLCM通过四个流程实现：LLM引导的布局生成、基于LCM的家具生成、环境优化和物理编辑。 Result: 实验验证SceneLCM优于现有技术，具有广泛的应用潜力。 Conclusion: SceneLCM通过模块化设计和优化，显著提升了室内场景生成的效率和质量。 Abstract: Our project page: https://scutyklin.github.io/SceneLCM/. Automated generation of complex, interactive indoor scenes tailored to user prompt remains a formidable challenge. While existing methods achieve indoor scene synthesis, they struggle with rigid editing constraints, physical incoherence, excessive human effort, single-room limitations, and suboptimal material quality. To address these limitations, we propose SceneLCM, an end-to-end framework that synergizes Large Language Model (LLM) for layout design with Latent Consistency Model(LCM) for scene optimization. Our approach decomposes scene generation into four modular pipelines: (1) Layout Generation. We employ LLM-guided 3D spatial reasoning to convert textual descriptions into parametric blueprints(3D layout). And an iterative programmatic validation mechanism iteratively refines layout parameters through LLM-mediated dialogue loops; (2) Furniture Generation. SceneLCM employs Consistency Trajectory Sampling(CTS), a consistency distillation sampling loss guided by LCM, to form fast, semantically rich, and high-quality representations. We also offer two theoretical justification to demonstrate that our CTS loss is equivalent to consistency loss and its distillation error is bounded by the truncation error of the Euler solver; (3) Environment Optimization. We use a multiresolution texture field to encode the appearance of the scene, and optimize via CTS loss. To maintain cross-geometric texture coherence, we introduce a normal-aware cross-attention decoder to predict RGB by cross-attending to the anchors locations in geometrically heterogeneous instance. (4)Physically Editing. SceneLCM supports physically editing by integrating physical simulation, achieved persistent physical realism. Extensive experiments validate SceneLCM's superiority over state-of-the-art techniques, showing its wide-ranging potential for diverse applications.

[68] EdgeSpotter: Multi-Scale Dense Text Spotting for Industrial Panel Monitoring

Changhong Fu,Hua Lin,Haobo Zuo,Liangliang Yao,Liguo Zhang

Main category: cs.CV

TL;DR: 提出了一种名为EdgeSpotter的多尺度密集文本检测方法，用于解决工业面板中复杂文本检测的挑战，包括跨尺度定位和密集文本区域的模糊边界问题。

Details

Motivation: 工业面板中的文本检测对智能监控至关重要，但现有方法在复杂场景下表现不佳，尤其是跨尺度和密集文本区域的问题。 Method: 开发了一种新型Transformer结构，结合高效混合器学习多级特征的相互依赖关系，并设计了基于Catmull-Rom样条的特征采样方法，以编码文本的形状、位置和语义信息。 Result: 在新建的工业面板监控基准数据集（IPM）上进行了广泛评估，验证了方法的优越性能，并通过实际边缘AI视觉系统测试证明了其实用性。 Conclusion: EdgeSpotter在复杂工业面板文本检测任务中表现出色，具有较高的准确性和鲁棒性，适用于实际应用。 Abstract: Text spotting for industrial panels is a key task for intelligent monitoring. However, achieving efficient and accurate text spotting for complex industrial panels remains challenging due to issues such as cross-scale localization and ambiguous boundaries in dense text regions. Moreover, most existing methods primarily focus on representing a single text shape, neglecting a comprehensive exploration of multi-scale feature information across different texts. To address these issues, this work proposes a novel multi-scale dense text spotter for edge AI-based vision system (EdgeSpotter) to achieve accurate and robust industrial panel monitoring. Specifically, a novel Transformer with efficient mixer is developed to learn the interdependencies among multi-level features, integrating multi-layer spatial and semantic cues. In addition, a new feature sampling with catmull-rom splines is designed, which explicitly encodes the shape, position, and semantic information of text, thereby alleviating missed detections and reducing recognition errors caused by multi-scale or dense text regions. Furthermore, a new benchmark dataset for industrial panel monitoring (IPM) is constructed. Extensive qualitative and quantitative evaluations on this challenging benchmark dataset validate the superior performance of the proposed method in different challenging panel monitoring tasks. Finally, practical tests based on the self-designed edge AI-based vision system demonstrate the practicality of the method. The code and demo will be available at https://github.com/vision4robotics/EdgeSpotter.

[69] Image segmentation and classification of E-waste for waste segregation

Prakriti Tripathi,Theertha Biju,Maniram Thota,Rakesh Lingam

Main category: cs.CV

TL;DR: 利用YOLOv11和Mask-RCNN模型对电子废物进行分类，分别达到70和41 mAP，未来将集成到分拣机器人中。

Details

Motivation: 解决电子废物分类问题，为分拣机器人提供自动化支持。 Method: 创建自定义数据集，训练YOLOv11和Mask-RCNN模型。 Result: YOLOv11达到70 mAP，Mask-RCNN达到41 mAP。 Conclusion: 模型将集成到分拣机器人中，实现电子废物自动化分类。 Abstract: Industry partners provided a problem statement that involves classifying electronic waste using machine learning models that will be used by pick-and-place robots for waste segregation. We started by taking common electronic waste items, such as a mouse and charger, unsoldering them, and taking pictures to create a custom dataset. Then state-of-the art YOLOv11 model was trained and run to achieve 70 mAP in real-time. Mask-RCNN model was also trained and achieved 41 mAP. The model will be further integrated with pick-and-place robots to perform segregation of e-waste.

[70] Hi-VAE: Efficient Video Autoencoding with Global and Detailed Motion

Huaize Liu,Wenzhang Sun,Qiyuan Zhang,Donglin Di,Biao Gong,Hao Li,Chen Wei,Changqing Zou

Main category: cs.CV

TL;DR: Hi-VAE提出了一种高效的视频自编码框架，通过分层编码视频动态的粗到细运动表示，显著提高了压缩效率。

Details

Motivation: 现有视频自编码方法未能高效建模时空冗余，导致压缩率不足和训练成本过高。 Method: Hi-VAE将视频动态分解为全局运动和细节运动两个潜在空间，并使用条件扩散解码器进行重建。 Result: 实验表明Hi-VAE的压缩率高达1428倍，远超基线方法，同时保持高质量重建和下游任务性能。 Conclusion: Hi-VAE在高效压缩和高质量重建方面表现出色，为视频潜在表示和生成提供了新视角。 Abstract: Recent breakthroughs in video autoencoders (Video AEs) have advanced video generation, but existing methods fail to efficiently model spatio-temporal redundancies in dynamics, resulting in suboptimal compression factors. This shortfall leads to excessive training costs for downstream tasks. To address this, we introduce Hi-VAE, an efficient video autoencoding framework that hierarchically encode coarse-to-fine motion representations of video dynamics and formulate the decoding process as a conditional generation task. Specifically, Hi-VAE decomposes video dynamics into two latent spaces: Global Motion, capturing overarching motion patterns, and Detailed Motion, encoding high-frequency spatial details. Using separate self-supervised motion encoders, we compress video latents into compact motion representations to reduce redundancy significantly. A conditional diffusion decoder then reconstructs videos by combining hierarchical global and detailed motions, enabling high-fidelity video reconstructions. Extensive experiments demonstrate that Hi-VAE achieves a high compression factor of 1428$\times$, almost 30$\times$ higher than baseline methods (e.g., Cosmos-VAE at 48$\times$), validating the efficiency of our approach. Meanwhile, Hi-VAE maintains high reconstruction quality at such high compression rates and performs effectively in downstream generative tasks. Moreover, Hi-VAE exhibits interpretability and scalability, providing new perspectives for future exploration in video latent representation and generation.

[71] Learning Compact Vision Tokens for Efficient Large Multimodal Models

Hao Tang,Chengchao Shen

Main category: cs.CV

TL;DR: 论文提出了一种空间令牌融合（STF）方法和多块令牌融合（MBTF）模块，以减少视觉令牌序列的长度并提高推理效率，同时保持多模态推理能力。

Details

Motivation: 大型多模态模型（LMMs）因处理长视觉令牌序列的高计算成本而面临挑战，需要一种方法来减少令牌数量而不损失性能。 Method: 通过STF融合空间相邻令牌以减少序列长度，并通过MBTF补充多粒度特征，平衡令牌减少和信息保留。 Result: 在8个流行的视觉语言基准测试中，仅使用基线25%的视觉令牌，性能与基线相当或更优。 Conclusion: STF和MBTF的结合有效提高了推理效率，同时保持了多模态推理能力。 Abstract: Large multimodal models (LMMs) suffer significant computational challenges due to the high cost of Large Language Models (LLMs) and the quadratic complexity of processing long vision token sequences. In this paper, we explore the spatial redundancy among vision tokens and shorten the length of vision token sequences for inference acceleration. Specifically, we propose a Spatial Token Fusion (STF) method to learn compact vision tokens for short vision token sequence, where spatial-adjacent tokens are fused into one. Meanwhile, weight-frozen vision encoder can not well adapt to the demand of extensive downstream vision-language tasks. To this end, we further introduce a Multi-Block Token Fusion (MBTF) module to supplement multi-granularity features for the reduced token sequence. Overall, we combine STF and MBTF module to balance token reduction and information preservation, thereby improving inference efficiency without sacrificing multimodal reasoning capabilities. Experimental results demonstrate that our method based on LLaVA-1.5 achieves comparable or even superior performance to the baseline on 8 popular vision-language benchmarks with only $25\%$ vision tokens of baseline. The source code and trained weights are available at https://github.com/visresearch/LLaVA-STF.

Van Nguyen Nguyen,Christian Forster,Sindi Shkodrani,Vincent Lepetit,Bugra Tekin,Cem Keskin,Tomas Hodan

Main category: cs.CV

TL;DR: GoTrack是一种基于CAD的高效6DoF物体姿态优化与跟踪方法，无需特定物体训练，结合模型到帧和帧到帧注册，使用光流估计实现。

Details

Motivation: 现有跟踪方法仅依赖分析-合成方法进行模型到帧注册，GoTrack通过结合帧到帧注册节省计算并稳定跟踪。 Method: GoTrack使用标准神经网络块（基于DINOv2的Transformer）简化模型到帧注册，并采用轻量级现成光流模型处理帧到帧注册。 Result: GoTrack与现有粗姿态估计方法结合，在标准6DoF物体姿态估计和跟踪基准测试中达到RGB-only的先进水平。 Conclusion: GoTrack提供了一种高效、准确的6DoF物体姿态优化与跟踪解决方案，代码和模型已开源。 Abstract: We introduce GoTrack, an efficient and accurate CAD-based method for 6DoF object pose refinement and tracking, which can handle diverse objects without any object-specific training. Unlike existing tracking methods that rely solely on an analysis-by-synthesis approach for model-to-frame registration, GoTrack additionally integrates frame-to-frame registration, which saves compute and stabilizes tracking. Both types of registration are realized by optical flow estimation. The model-to-frame registration is noticeably simpler than in existing methods, relying only on standard neural network blocks (a transformer is trained on top of DINOv2) and producing reliable pose confidence scores without a scoring network. For the frame-to-frame registration, which is an easier problem as consecutive video frames are typically nearly identical, we employ a light off-the-shelf optical flow model. We demonstrate that GoTrack can be seamlessly combined with existing coarse pose estimation methods to create a minimal pipeline that reaches state-of-the-art RGB-only results on standard benchmarks for 6DoF object pose estimation and tracking. Our source code and trained models are publicly available at https://github.com/facebookresearch/gotrack

[73] Faster than Fast: Accelerating Oriented FAST Feature Detection on Low-end Embedded GPUs

Qiong Chang,Xinyuan Chen,Xiang Li,Weimin Wang,Jun Miyazaki

Main category: cs.CV

TL;DR: 论文提出了两种方法，用于在低端嵌入式GPU上加速Oriented FAST特征检测，显著提升了SLAM系统的实时处理能力。

Details

Motivation: 当前基于ORB的SLAM系统在移动平台上无法满足实时处理需求，主要原因是Oriented FAST计算耗时。 Method: 通过二进制编码策略快速确定候选点，以及利用可分离的Harris检测策略和高效的低级GPU硬件指令优化FAST特征点检测和Harris角点检测。 Result: 在Jetson TX2嵌入式GPU上实验表明，相比广泛使用的OpenCV GPU支持，平均加速超过7.3倍。 Conclusion: 该方法显著提升了SLAM系统的实时处理能力，适用于移动和资源受限环境。 Abstract: The visual-based SLAM (Simultaneous Localization and Mapping) is a technology widely used in applications such as robotic navigation and virtual reality, which primarily focuses on detecting feature points from visual images to construct an unknown environmental map and simultaneously determines its own location. It usually imposes stringent requirements on hardware power consumption, processing speed and accuracy. Currently, the ORB (Oriented FAST and Rotated BRIEF)-based SLAM systems have exhibited superior performance in terms of processing speed and robustness. However, they still fall short of meeting the demands for real-time processing on mobile platforms. This limitation is primarily due to the time-consuming Oriented FAST calculations accounting for approximately half of the entire SLAM system. This paper presents two methods to accelerate the Oriented FAST feature detection on low-end embedded GPUs. These methods optimize the most time-consuming steps in Oriented FAST feature detection: FAST feature point detection and Harris corner detection, which is achieved by implementing a binary-level encoding strategy to determine candidate points quickly and a separable Harris detection strategy with efficient low-level GPU hardware-specific instructions. Extensive experiments on a Jetson TX2 embedded GPU demonstrate an average speedup of over 7.3 times compared to widely used OpenCV with GPU support. This significant improvement highlights its effectiveness and potential for real-time applications in mobile and resource-constrained environments.

[74] Frame Guidance: Training-Free Guidance for Frame-Level Control in Video Diffusion Models

Sangwon Jang,Taekyung Ki,Jaehyeong Jo,Jaehong Yoon,Soo Ye Kim,Zhe Lin,Sung Ju Hwang

Main category: cs.CV

TL;DR: 提出了一种无需训练的帧级引导方法（Frame Guidance），用于可控视频生成，支持多种任务如关键帧引导、风格化和循环生成。

Details

Motivation: 现有方法依赖微调大规模视频模型，但随着模型规模增长，这种方法变得不切实际。 Method: 提出了一种简单的潜在处理方法以减少内存使用，并设计了新的潜在优化策略以实现全局一致视频生成。 Result: 实验表明，Frame Guidance能高效生成高质量可控视频，适用于多种任务和输入信号。 Conclusion: Frame Guidance是一种无需训练、兼容性强的方法，显著提升了视频生成的可控性。 Abstract: Advancements in diffusion models have significantly improved video quality, directing attention to fine-grained controllability. However, many existing methods depend on fine-tuning large-scale video models for specific tasks, which becomes increasingly impractical as model sizes continue to grow. In this work, we present Frame Guidance, a training-free guidance for controllable video generation based on frame-level signals, such as keyframes, style reference images, sketches, or depth maps. For practical training-free guidance, we propose a simple latent processing method that dramatically reduces memory usage, and apply a novel latent optimization strategy designed for globally coherent video generation. Frame Guidance enables effective control across diverse tasks, including keyframe guidance, stylization, and looping, without any training, compatible with any video models. Experimental results show that Frame Guidance can produce high-quality controlled videos for a wide range of tasks and input signals.

[75] Hierarchical Feature-level Reverse Propagation for Post-Training Neural Networks

Ni Ding,Lei He,Shengbo Eben Li,Keqiang Li

Main category: cs.CV

TL;DR: 本文提出了一种分层解耦的后训练框架，通过重构中间特征图引入代理监督信号，提升模型透明度和训练灵活性。

Details

Motivation: 端到端自动驾驶的黑盒模型存在可解释性和安全性问题，需要改进模型透明度和训练灵活性。 Method: 利用真实标签重构中间特征图，引入代理监督信号，将特征级反向计算形式化为优化问题。 Result: 在多个标准图像分类基准上表现出优越的泛化性能和计算效率。 Conclusion: 该方法为神经网络提供了一种新颖且高效的训练范式，验证了其有效性和潜力。 Abstract: End-to-end autonomous driving has emerged as a dominant paradigm, yet its highly entangled black-box models pose significant challenges in terms of interpretability and safety assurance. To improve model transparency and training flexibility, this paper proposes a hierarchical and decoupled post-training framework tailored for pretrained neural networks. By reconstructing intermediate feature maps from ground-truth labels, surrogate supervisory signals are introduced at transitional layers to enable independent training of specific components, thereby avoiding the complexity and coupling of conventional end-to-end backpropagation and providing interpretable insights into networks' internal mechanisms. To the best of our knowledge, this is the first method to formalize feature-level reverse computation as well-posed optimization problems, which we rigorously reformulate as systems of linear equations or least squares problems. This establishes a novel and efficient training paradigm that extends gradient backpropagation to feature backpropagation. Extensive experiments on multiple standard image classification benchmarks demonstrate that the proposed method achieves superior generalization performance and computational efficiency compared to traditional training approaches, validating its effectiveness and potential.

[76] SAP-Bench: Benchmarking Multimodal Large Language Models in Surgical Action Planning

Mengya Xu,Zhongzhen Huang,Dillan Imans,Yiru Ye,Xiaofan Zhang,Qi Dou

Main category: cs.CV

TL;DR: 论文介绍了SAP-Bench数据集和MLLM-SAP框架，用于评估多模态大语言模型在手术动作规划任务中的表现，揭示了当前模型的性能差距。

Details

Motivation: 现有基准无法充分评估手术决策任务所需的复杂能力，尤其是在生命关键领域。 Method: 提出SAP-Bench数据集，包含临床验证的手术动作标注，并开发MLLM-SAP框架，结合领域知识生成动作建议。 Result: 评估了七种先进MLLM模型，发现其在预测下一动作任务中存在显著性能差距。 Conclusion: SAP-Bench为手术动作规划任务提供了高质量评估工具，并揭示了当前模型的局限性。 Abstract: Effective evaluation is critical for driving advancements in MLLM research. The surgical action planning (SAP) task, which aims to generate future action sequences from visual inputs, demands precise and sophisticated analytical capabilities. Unlike mathematical reasoning, surgical decision-making operates in life-critical domains and requires meticulous, verifiable processes to ensure reliability and patient safety. This task demands the ability to distinguish between atomic visual actions and coordinate complex, long-horizon procedures, capabilities that are inadequately evaluated by current benchmarks. To address this gap, we introduce SAP-Bench, a large-scale, high-quality dataset designed to enable multimodal large language models (MLLMs) to perform interpretable surgical action planning. Our SAP-Bench benchmark, derived from the cholecystectomy procedures context with the mean duration of 1137.5s, and introduces temporally-grounded surgical action annotations, comprising the 1,226 clinically validated action clips (mean duration: 68.7s) capturing five fundamental surgical actions across 74 procedures. The dataset provides 1,152 strategically sampled current frames, each paired with the corresponding next action as multimodal analysis anchors. We propose the MLLM-SAP framework that leverages MLLMs to generate next action recommendations from the current surgical scene and natural language instructions, enhanced with injected surgical domain knowledge. To assess our dataset's effectiveness and the broader capabilities of current models, we evaluate seven state-of-the-art MLLMs (e.g., OpenAI-o1, GPT-4o, QwenVL2.5-72B, Claude-3.5-Sonnet, GeminiPro2.5, Step-1o, and GLM-4v) and reveal critical gaps in next action prediction performance.

[77] TV-LiVE: Training-Free, Text-Guided Video Editing via Layer Informed Vitality Exploitation

Min-Jung Kim,Dongjin Kim,Seokju Yun,Jaegul Choo

Main category: cs.CV

TL;DR: TV-LiVE是一种无需训练、基于文本引导的视频编辑框架，通过利用关键层的活力信息实现对象添加和非刚性编辑。

Details

Motivation: 当前视频编辑方法主要关注风格转换等简单任务，复杂任务如对象添加和非刚性编辑研究较少。 Method: 识别视频生成模型中的关键层（与RoPE相关），选择性注入特征以实现编辑，并通过显著层提取掩码区域。 Result: TV-LiVE在对象添加和非刚性编辑任务上优于现有方法。 Conclusion: TV-LiVE为复杂视频编辑任务提供了高效解决方案。 Abstract: Video editing has garnered increasing attention alongside the rapid progress of diffusion-based video generation models. As part of these advancements, there is a growing demand for more accessible and controllable forms of video editing, such as prompt-based editing. Previous studies have primarily focused on tasks such as style transfer, background replacement, object substitution, and attribute modification, while maintaining the content structure of the source video. However, more complex tasks, including the addition of novel objects and nonrigid transformations, remain relatively unexplored. In this paper, we present TV-LiVE, a Training-free and text-guided Video editing framework via Layerinformed Vitality Exploitation. We empirically identify vital layers within the video generation model that significantly influence the quality of generated outputs. Notably, these layers are closely associated with Rotary Position Embeddings (RoPE). Based on this observation, our method enables both object addition and non-rigid video editing by selectively injecting key and value features from the source model into the corresponding layers of the target model guided by the layer vitality. For object addition, we further identify prominent layers to extract the mask regions corresponding to the newly added target prompt. We found that the extracted masks from the prominent layers faithfully indicate the region to be edited. Experimental results demonstrate that TV-LiVE outperforms existing approaches for both object addition and non-rigid video editing. Project Page: https://emjay73.github.io/TV_LiVE/

[78] Backdoor Attack on Vision Language Models with Stealthy Semantic Manipulation

Zhiyuan Zhong,Zhen Sun,Yepang Liu,Xinlei He,Guanhong Tao

Main category: cs.CV

TL;DR: 论文提出了一种针对视觉语言模型（VLMs）的新型后门攻击方法BadSem，利用跨模态语义不匹配作为隐式触发器，攻击成功率高且难以防御。

Details

Motivation: 现有后门攻击主要依赖单模态触发器，未能充分探索VLMs的跨模态融合特性，因此研究跨模态语义不匹配的潜在攻击面具有重要意义。 Method: 通过数据投毒攻击，故意在训练时错配图像-文本对（如颜色和物体属性），构建SIMBad数据集，并设计BadSem攻击方法。 Result: 在四种广泛使用的VLMs上，BadSem平均攻击成功率（ASR）超过98%，且能泛化到分布外数据集和跨模态传输。防御策略（系统提示和监督微调）均未能有效缓解攻击。 Conclusion: 研究揭示了VLMs在语义层面的安全漏洞，亟需采取措施以提升其部署安全性。 Abstract: Vision Language Models (VLMs) have shown remarkable performance, but are also vulnerable to backdoor attacks whereby the adversary can manipulate the model's outputs through hidden triggers. Prior attacks primarily rely on single-modality triggers, leaving the crucial cross-modal fusion nature of VLMs largely unexplored. Unlike prior work, we identify a novel attack surface that leverages cross-modal semantic mismatches as implicit triggers. Based on this insight, we propose BadSem (Backdoor Attack with Semantic Manipulation), a data poisoning attack that injects stealthy backdoors by deliberately misaligning image-text pairs during training. To perform the attack, we construct SIMBad, a dataset tailored for semantic manipulation involving color and object attributes. Extensive experiments across four widely used VLMs show that BadSem achieves over 98% average ASR, generalizes well to out-of-distribution datasets, and can transfer across poisoning modalities. Our detailed analysis using attention visualization shows that backdoored models focus on semantically sensitive regions under mismatched conditions while maintaining normal behavior on clean inputs. To mitigate the attack, we try two defense strategies based on system prompt and supervised fine-tuning but find that both of them fail to mitigate the semantic backdoor. Our findings highlight the urgent need to address semantic vulnerabilities in VLMs for their safer deployment.

[79] AugmentGest: Can Random Data Cropping Augmentation Boost Gesture Recognition Performance?

Nada Aboudeshish,Dmitry Ignatov,Radu Timofte

Main category: cs.CV

TL;DR: 本文提出了一种综合数据增强框架，通过几何变换、随机裁剪等方法提升骨架数据集的多样性，显著提高了手势识别和动作识别的性能。

Details

Motivation: 针对骨架数据集中数据多样性有限的问题，提出一种数据增强方法以模拟真实世界的变化，提升模型泛化能力。 Method: 集成几何变换、随机裁剪、旋转、缩放和强度变换等方法，生成每个样本的三个增强版本，从而扩大数据集规模。 Result: 在多个模型（e2eET、FPPR-PCD、DD-Net）和数据集（DHG14/28、SHREC'17、JHMDB）上验证了框架的有效性，显著提升了性能。 Conclusion: 该框架不仅实现了最先进的性能，还为实际应用中的手势识别和动作识别提供了可扩展的解决方案。 Abstract: Data augmentation is a crucial technique in deep learning, particularly for tasks with limited dataset diversity, such as skeleton-based datasets. This paper proposes a comprehensive data augmentation framework that integrates geometric transformations, random cropping, rotation, zooming and intensity-based transformations, brightness and contrast adjustments to simulate real-world variations. Random cropping ensures the preservation of spatio-temporal integrity while addressing challenges such as viewpoint bias and occlusions. The augmentation pipeline generates three augmented versions for each sample in addition to the data set sample, thus quadrupling the data set size and enriching the diversity of gesture representations. The proposed augmentation strategy is evaluated on three models: multi-stream e2eET, FPPR point cloud-based hand gesture recognition (HGR), and DD-Network. Experiments are conducted on benchmark datasets including DHG14/28, SHREC'17, and JHMDB. The e2eET model, recognized as the state-of-the-art for hand gesture recognition on DHG14/28 and SHREC'17. The FPPR-PCD model, the second-best performing model on SHREC'17, excels in point cloud-based gesture recognition. DD-Net, a lightweight and efficient architecture for skeleton-based action recognition, is evaluated on SHREC'17 and the Human Motion Data Base (JHMDB). The results underline the effectiveness and versatility of the proposed augmentation strategy, significantly improving model generalization and robustness across diverse datasets and architectures. This framework not only establishes state-of-the-art results on all three evaluated models but also offers a scalable solution to advance HGR and action recognition applications in real-world scenarios. The framework is available at https://github.com/NadaAbodeshish/Random-Cropping-augmentation-HGR

[80] Hallucination at a Glance: Controlled Visual Edits and Fine-Grained Multimodal Learning

Tianyi Bai,Yuxuan Fan,Jiantao Qiu,Fupeng Sun,Jiayi Song,Junlin Han,Zichen Liu,Conghui He,Wentao Zhang,Binhang Yuan

Main category: cs.CV

TL;DR: 论文提出了一种针对多模态大语言模型（MLLMs）在细粒度视觉差异任务中表现不佳的解决方案，通过生成微编辑数据集（MED）和引入特征级一致性损失的监督微调框架，显著提升了模型性能。

Details

Motivation: MLLMs在视觉语言任务中表现优异，但在细粒度视觉差异任务中容易产生幻觉或遗漏语义变化，这归因于训练数据和学习目标的局限性。 Method: 提出了一种可控数据生成流程，构建了包含50K图像-文本对的MED数据集，并设计了带有特征级一致性损失的监督微调框架。 Result: 在微编辑检测基准测试中，该方法显著提升了差异检测准确性并减少了幻觉现象，同时在标准视觉语言任务中也表现优异。 Conclusion: 结合针对性数据和一致性目标，可以有效增强MLLMs在细粒度视觉推理任务中的表现。 Abstract: Multimodal large language models (MLLMs) have achieved strong performance on vision-language tasks but still struggle with fine-grained visual differences, leading to hallucinations or missed semantic shifts. We attribute this to limitations in both training data and learning objectives. To address these issues, we propose a controlled data generation pipeline that produces minimally edited image pairs with semantically aligned captions. Using this pipeline, we construct the Micro Edit Dataset (MED), containing over 50K image-text pairs spanning 11 fine-grained edit categories, including attribute, count, position, and object presence changes. Building on MED, we introduce a supervised fine-tuning (SFT) framework with a feature-level consistency loss that promotes stable visual embeddings under small edits. We evaluate our approach on the Micro Edit Detection benchmark, which includes carefully balanced evaluation pairs designed to test sensitivity to subtle visual variations across the same edit categories. Our method improves difference detection accuracy and reduces hallucinations compared to strong baselines, including GPT-4o. Moreover, it yields consistent gains on standard vision-language tasks such as image captioning and visual question answering. These results demonstrate the effectiveness of combining targeted data and alignment objectives for enhancing fine-grained visual reasoning in MLLMs.

[81] Multi-Step Visual Reasoning with Visual Tokens Scaling and Verification

Tianyi Bai,Zengjie Hu,Fupeng Sun,Jiantao Qiu,Yizhen Jiang,Guangxin He,Bohan Zeng,Conghui He,Binhang Yuan,Wentao Zhang

Main category: cs.CV

TL;DR: 论文提出了一种动态推理框架，通过迭代和验证器引导的视觉标记缩放，改进了多模态大语言模型（MLLMs）的视觉推理能力。

Details

Motivation: 现有MLLMs采用静态推理范式，无法动态调整视觉理解，限制了其适应性和迭代优化能力。 Method: 将问题建模为马尔可夫决策过程，结合推理器和验证器（通过多步直接偏好优化训练），实现动态视觉标记缩放。 Result: 方法在多个视觉推理基准测试中显著优于现有方法，提高了准确性和可解释性。 Conclusion: 动态推理机制为下一代MLLMs提供了细粒度、上下文感知的视觉推理能力。 Abstract: Multi-modal large language models (MLLMs) have achieved remarkable capabilities by integrating visual perception with language understanding, enabling applications such as image-grounded dialogue, visual question answering, and scientific analysis. However, most MLLMs adopt a static inference paradigm, encoding the entire image into fixed visual tokens upfront, which limits their ability to iteratively refine understanding or adapt to context during inference. This contrasts sharply with human perception, which is dynamic, selective, and feedback-driven. In this work, we introduce a novel framework for inference-time visual token scaling that enables MLLMs to perform iterative, verifier-guided reasoning over visual content. We formulate the problem as a Markov Decision Process, involving a reasoner that proposes visual actions and a verifier, which is trained via multi-step Direct Preference Optimization (DPO), that evaluates these actions and determines when reasoning should terminate. To support this, we present a new dataset, VTS, comprising supervised reasoning trajectories (VTS-SFT) and preference-labeled reasoning comparisons (VTS-DPO). Our method significantly outperforms existing approaches across diverse visual reasoning benchmarks, offering not only improved accuracy but also more interpretable and grounded reasoning processes. These results demonstrate the promise of dynamic inference mechanisms for enabling fine-grained, context-aware visual reasoning in next-generation MLLMs.

[82] From Generation to Generalization: Emergent Few-Shot Learning in Video Diffusion Models

Pablo Acuaviva,Aram Davtyan,Mariam Hassan,Sebastian Stapf,Ahmad Rahimi,Alexandre Alahi,Paolo Favaro

Main category: cs.CV

TL;DR: 视频扩散模型（VDMs）不仅是强大的视频生成工具，还能通过训练动态学习结构化表示和视觉世界的隐含理解。通过少量样本微调框架，VDMs可适应多种任务，展现广泛泛化能力。

Details

Motivation: 探索VDMs在视频生成之外的潜力，验证其是否能够通过训练动态学习到结构化表示和视觉世界的隐含知识。 Method: 提出一个少量样本微调框架，将任务转化为视觉过渡，通过训练LoRA权重在不改变VDMs生成接口的情况下适应新任务。 Result: 模型在低层视觉（如分割、姿态估计）到高层推理（如ARC-AGI）任务中表现出强大的泛化能力。 Conclusion: VDMs不仅是生成引擎，还是适应性强的视觉学习器，有望成为未来视觉基础模型的核心。 Abstract: Video Diffusion Models (VDMs) have emerged as powerful generative tools, capable of synthesizing high-quality spatiotemporal content. Yet, their potential goes far beyond mere video generation. We argue that the training dynamics of VDMs, driven by the need to model coherent sequences, naturally pushes them to internalize structured representations and an implicit understanding of the visual world. To probe the extent of this internal knowledge, we introduce a few-shot fine-tuning framework that repurposes VDMs for new tasks using only a handful of examples. Our method transforms each task into a visual transition, enabling the training of LoRA weights on short input-output sequences without altering the generative interface of a frozen VDM. Despite minimal supervision, the model exhibits strong generalization across diverse tasks, from low-level vision (for example, segmentation and pose estimation) to high-level reasoning (for example, on ARC-AGI). These results reframe VDMs as more than generative engines. They are adaptable visual learners with the potential to serve as the backbone for future foundation models in vision.

[83] Multi-Step Guided Diffusion for Image Restoration on Edge Devices: Toward Lightweight Perception in Embodied AI

Aditya Chakravarty

Main category: cs.CV

TL;DR: 论文提出了一种在去噪步骤中引入多步优化的策略，显著提升了图像质量和泛化能力。

Details

Motivation: 现有方法（如MPGD）在每次去噪步骤中仅应用单次梯度更新，限制了恢复效果和鲁棒性。 Method: 在每次去噪时间步中引入多步优化策略。 Result: 实验表明，增加梯度更新次数可提升LPIPS和PSNR，且在嵌入式设备（如Jetson Orin Nano）上验证了其有效性。 Conclusion: MPGD具有作为轻量级、即插即用恢复模块的潜力，适用于无人机和移动机器人等实时视觉感知任务。 Abstract: Diffusion models have shown remarkable flexibility for solving inverse problems without task-specific retraining. However, existing approaches such as Manifold Preserving Guided Diffusion (MPGD) apply only a single gradient update per denoising step, limiting restoration fidelity and robustness, especially in embedded or out-of-distribution settings. In this work, we introduce a multistep optimization strategy within each denoising timestep, significantly enhancing image quality, perceptual accuracy, and generalization. Our experiments on super-resolution and Gaussian deblurring demonstrate that increasing the number of gradient updates per step improves LPIPS and PSNR with minimal latency overhead. Notably, we validate this approach on a Jetson Orin Nano using degraded ImageNet and a UAV dataset, showing that MPGD, originally trained on face datasets, generalizes effectively to natural and aerial scenes. Our findings highlight MPGD's potential as a lightweight, plug-and-play restoration module for real-time visual perception in embodied AI agents such as drones and mobile robots.

[84] FANVID: A Benchmark for Face and License Plate Recognition in Low-Resolution Videos

Kavitha Viswanathan,Vrinda Goel,Shlesh Gholap,Devayan Ghosh,Madhav Gupta,Dhruvi Ganatra,Sanket Potdar,Amit Sethi

Main category: cs.CV

TL;DR: FANVID是一个新的视频基准数据集，包含低分辨率（LR）视频片段，用于推动时间识别模型的发展，支持人脸匹配和车牌识别任务。

Details

Motivation: 解决现实监控中低分辨率视频下人脸和车牌难以识别的问题，推动时间建模技术的发展。 Method: 构建FANVID数据集，包含1,463个LR视频片段，定义人脸匹配和车牌识别任务，并引入评估指标。 Result: 基线方法在两项任务中分别取得0.58和0.42的分数，显示任务的可行性和挑战性。 Conclusion: FANVID旨在促进低分辨率识别的时间建模创新，适用于监控、法医和自动驾驶等领域。 Abstract: Real-world surveillance often renders faces and license plates unrecognizable in individual low-resolution (LR) frames, hindering reliable identification. To advance temporal recognition models, we present FANVID, a novel video-based benchmark comprising nearly 1,463 LR clips (180 x 320, 20--60 FPS) featuring 63 identities and 49 license plates from three English-speaking countries. Each video includes distractor faces and plates, increasing task difficulty and realism. The dataset contains 31,096 manually verified bounding boxes and labels. FANVID defines two tasks: (1) face matching -- detecting LR faces and matching them to high-resolution mugshots, and (2) license plate recognition -- extracting text from LR plates without a predefined database. Videos are downsampled from high-resolution sources to ensure that faces and text are indecipherable in single frames, requiring models to exploit temporal information. We introduce evaluation metrics adapted from mean Average Precision at IoU > 0.5, prioritizing identity correctness for faces and character-level accuracy for text. A baseline method with pre-trained video super-resolution, detection, and recognition achieved performance scores of 0.58 (face matching) and 0.42 (plate recognition), highlighting both the feasibility and challenge of the tasks. FANVID's selection of faces and plates balances diversity with recognition challenge. We release the software for data access, evaluation, baseline, and annotation to support reproducibility and extension. FANVID aims to catalyze innovation in temporal modeling for LR recognition, with applications in surveillance, forensics, and autonomous vehicles.

[85] AllTracker: Efficient Dense Point Tracking at High Resolution

Adam W. Harley,Yang You,Xinglong Sun,Yang Zheng,Nikhil Raghuraman,Yunqi Gu,Sheldon Liang,Wen-Hsuan Chu,Achal Dave,Pavel Tokmakov,Suya You,Rares Ambrus,Katerina Fragkiadaki,Leonidas J. Guibas

Main category: cs.CV

TL;DR: AllTracker是一种通过估计查询帧与视频中其他帧之间的流场来估计长距离点轨迹的模型，提供高分辨率和密集（全像素）的对应关系。

Details

Motivation: 现有方法在点跟踪和光流估计中存在局限性，无法同时实现高分辨率、密集对应和长距离跟踪。 Method: 结合光流和点跟踪技术，设计新架构，通过低分辨率网格迭代推断，利用2D卷积层和像素对齐注意力层传递信息。 Result: 模型参数高效（1600万参数），在768x1024分辨率下实现最先进的点跟踪精度。 Conclusion: AllTracker在长距离点跟踪任务中表现出色，训练数据集的多样性对性能至关重要。 Abstract: We introduce AllTracker: a model that estimates long-range point tracks by way of estimating the flow field between a query frame and every other frame of a video. Unlike existing point tracking methods, our approach delivers high-resolution and dense (all-pixel) correspondence fields, which can be visualized as flow maps. Unlike existing optical flow methods, our approach corresponds one frame to hundreds of subsequent frames, rather than just the next frame. We develop a new architecture for this task, blending techniques from existing work in optical flow and point tracking: the model performs iterative inference on low-resolution grids of correspondence estimates, propagating information spatially via 2D convolution layers, and propagating information temporally via pixel-aligned attention layers. The model is fast and parameter-efficient (16 million parameters), and delivers state-of-the-art point tracking accuracy at high resolution (i.e., tracking 768x1024 pixels, on a 40G GPU). A benefit of our design is that we can train on a wider set of datasets, and we find that doing so is crucial for top performance. We provide an extensive ablation study on our architecture details and training recipe, making it clear which details matter most. Our code and model weights are available at https://alltracker.github.io .

[86] "CASE: Contrastive Activation for Saliency Estimation

Dane Williamson,Yangfeng Ji,Matthew Dwyer

Main category: cs.CV

TL;DR: 本文提出了一种诊断测试（class sensitivity）来评估显著性方法的可靠性，发现许多方法对类别标签不敏感，并提出了新的对比性解释方法CASE。

Details

Motivation: 显著性方法在可视化模型预测时被广泛使用，但其视觉合理性可能掩盖了关键局限性。本文旨在评估这些方法是否能区分同一输入的不同类别标签。 Method: 提出了一种诊断测试（class sensitivity），并通过实验验证了许多显著性方法对类别标签不敏感。随后，提出了CASE方法，通过对比性解释来突出预测类别的独特特征。 Result: 实验表明，许多显著性方法对类别标签不敏感，而CASE方法在诊断测试和基于扰动的保真度测试中表现更好。 Conclusion: 显著性方法的可靠性存在问题，而CASE方法提供了更忠实且类别特定的解释。 Abstract: Saliency methods are widely used to visualize which input features are deemed relevant to a model's prediction. However, their visual plausibility can obscure critical limitations. In this work, we propose a diagnostic test for class sensitivity: a method's ability to distinguish between competing class labels on the same input. Through extensive experiments, we show that many widely used saliency methods produce nearly identical explanations regardless of the class label, calling into question their reliability. We find that class-insensitive behavior persists across architectures and datasets, suggesting the failure mode is structural rather than model-specific. Motivated by these findings, we introduce CASE, a contrastive explanation method that isolates features uniquely discriminative for the predicted class. We evaluate CASE using the proposed diagnostic and a perturbation-based fidelity test, and show that it produces faithful and more class-specific explanations than existing methods.

Yijie Deng,Shuaihang Yuan,Geeta Chandra Raju Bethala,Anthony Tzes,Yu-Shen Liu,Yi Fang

Main category: cs.CV

TL;DR: 提出了一种基于分层评分范式的新型IIN框架，通过优化视角选择减少冗余，提升目标匹配效率。

Details

Motivation: 现有方法依赖随机采样视角或轨迹，导致冗余和计算开销大，缺乏优化的视角选择。 Method: 结合跨层级语义评分（CLIP相关场）和局部几何评分，估计最优视角。 Result: 在模拟IIN基准测试中达到最优性能，并具有实际应用价值。 Conclusion: 分层评分范式显著提升了IIN任务的效率和准确性。 Abstract: Instance Image-Goal Navigation (IIN) requires autonomous agents to identify and navigate to a target object or location depicted in a reference image captured from any viewpoint. While recent methods leverage powerful novel view synthesis (NVS) techniques, such as three-dimensional Gaussian splatting (3DGS), they typically rely on randomly sampling multiple viewpoints or trajectories to ensure comprehensive coverage of discriminative visual cues. This approach, however, creates significant redundancy through overlapping image samples and lacks principled view selection, substantially increasing both rendering and comparison overhead. In this paper, we introduce a novel IIN framework with a hierarchical scoring paradigm that estimates optimal viewpoints for target matching. Our approach integrates cross-level semantic scoring, utilizing CLIP-derived relevancy fields to identify regions with high semantic similarity to the target object class, with fine-grained local geometric scoring that performs precise pose estimation within promising regions. Extensive evaluations demonstrate that our method achieves state-of-the-art performance on simulated IIN benchmarks and real-world applicability.

[88] CBAM-STN-TPS-YOLO: Enhancing Agricultural Object Detection through Spatially Adaptive Attention Mechanisms

Satvik Praveen,Yoonsung Jung

Main category: cs.CV

TL;DR: CBAM-STN-TPS-YOLO结合了TPS和STN，通过CBAM增强特征对齐和注意力机制，提高了植物检测的精度和效率。

Details

Motivation: 传统YOLO模型在植物检测中因遮挡、不规则结构和背景噪声导致精度下降，STN的仿射变换无法处理非刚性变形。 Method: 提出CBAM-STN-TPS-YOLO，集成TPS实现非刚性空间变换，并通过CBAM抑制噪声、增强特征。 Result: 在PGP数据集上，模型在精度、召回率和mAP上优于STN-YOLO，假阳性减少12%。 Conclusion: 该模型轻量且适合实时边缘部署，为智能农业提供了高效准确的监测方案。 Abstract: Object detection is vital in precision agriculture for plant monitoring, disease detection, and yield estimation. However, models like YOLO struggle with occlusions, irregular structures, and background noise, reducing detection accuracy. While Spatial Transformer Networks (STNs) improve spatial invariance through learned transformations, affine mappings are insufficient for non-rigid deformations such as bent leaves and overlaps. We propose CBAM-STN-TPS-YOLO, a model integrating Thin-Plate Splines (TPS) into STNs for flexible, non-rigid spatial transformations that better align features. Performance is further enhanced by the Convolutional Block Attention Module (CBAM), which suppresses background noise and emphasizes relevant spatial and channel-wise features. On the occlusion-heavy Plant Growth and Phenotyping (PGP) dataset, our model outperforms STN-YOLO in precision, recall, and mAP. It achieves a 12% reduction in false positives, highlighting the benefits of improved spatial flexibility and attention-guided refinement. We also examine the impact of the TPS regularization parameter in balancing transformation smoothness and detection performance. This lightweight model improves spatial awareness and supports real-time edge deployment, making it ideal for smart farming applications requiring accurate and efficient monitoring.

[89] Multiple Object Stitching for Unsupervised Representation Learning

Chengchao Shen,Dawei Liu,Jianxin Wang

Main category: cs.CV

TL;DR: 提出了一种名为MOS的方法，通过拼接单目标图像生成多目标图像，提升无监督表示性能。

Details

Motivation: 现有对比学习方法在多目标图像上表现不佳，需要改进。 Method: 通过拼接单目标图像生成多目标图像，利用预定义的对象对应关系优化表示。 Result: 在ImageNet、CIFAR和COCO数据集上表现领先。 Conclusion: MOS方法显著提升了多目标图像的无监督表示性能。 Abstract: Contrastive learning for single object centric images has achieved remarkable progress on unsupervised representation, but suffering inferior performance on the widespread images with multiple objects. In this paper, we propose a simple but effective method, Multiple Object Stitching (MOS), to refine the unsupervised representation for multi-object images. Specifically, we construct the multi-object images by stitching the single object centric ones, where the objects in the synthesized multi-object images are predetermined. Hence, compared to the existing contrastive methods, our method provides additional object correspondences between multi-object images without human annotations. In this manner, our method pays more attention to the representations of each object in multi-object image, thus providing more detailed representations for complicated downstream tasks, such as object detection and semantic segmentation. Experimental results on ImageNet, CIFAR and COCO datasets demonstrate that our proposed method achieves the leading unsupervised representation performance on both single object centric images and multi-object ones. The source code is available at https://github.com/visresearch/MultipleObjectStitching.

[90] C3S3: Complementary Competition and Contrastive Selection for Semi-Supervised Medical Image Segmentation

Jiaying He,Yitong Lin,Jiahe Chen,Honghui Xu,Jianwei Zheng

Main category: cs.CV

TL;DR: 提出C3S3模型，通过互补竞争和对比选择提升半监督医学图像分割的边界细节捕捉能力，显著提高分割精度。

Details

Motivation: 医学领域标注样本不足，现有方法难以精确捕捉边界细节，导致诊断误差。 Method: 结合结果驱动的对比学习模块和动态互补竞争模块，优化边界定位和伪标签生成。 Result: 在两个公开数据集上表现优于现有方法，95HD和ASD指标提升至少6%。 Conclusion: C3S3模型显著提升了医学图像分割的边界精度，具有实际应用价值。 Abstract: For the immanent challenge of insufficiently annotated samples in the medical field, semi-supervised medical image segmentation (SSMIS) offers a promising solution. Despite achieving impressive results in delineating primary target areas, most current methodologies struggle to precisely capture the subtle details of boundaries. This deficiency often leads to significant diagnostic inaccuracies. To tackle this issue, we introduce C3S3, a novel semi-supervised segmentation model that synergistically integrates complementary competition and contrastive selection. This design significantly sharpens boundary delineation and enhances overall precision. Specifically, we develop an $\textit{Outcome-Driven Contrastive Learning}$ module dedicated to refining boundary localization. Additionally, we incorporate a $\textit{Dynamic Complementary Competition}$ module that leverages two high-performing sub-networks to generate pseudo-labels, thereby further improving segmentation quality. The proposed C3S3 undergoes rigorous validation on two publicly accessible datasets, encompassing the practices of both MRI and CT scans. The results demonstrate that our method achieves superior performance compared to previous cutting-edge competitors. Especially, on the 95HD and ASD metrics, our approach achieves a notable improvement of at least $6\%$, highlighting the significant advancements. The code is available at https://github.com/Y-TARL/C3S3.

[91] Generative Models at the Frontier of Compression: A Survey on Generative Face Video Coding

Bolin Chen,Shanzhi Yin,Goluck Konuko,Giuseppe Valenzise,Zihan Zhang,Shiqi Wang,Yan Ye

Main category: cs.CV

TL;DR: 本文综述了生成式人脸视频编码（GFVC）技术，分析了其在高保真低比特率视频通信中的优势，并提出了标准化和实际应用的未来方向。

Details

Motivation: 探索GFVC技术在视频压缩中的潜力，填补理论与工业标准化之间的空白。 Method: 综述现有GFVC方法，构建大规模数据库并进行主观质量评估，提出标准化语法和低复杂度系统。 Result: GFVC在超低比特率下表现优于VVC标准，并提出了未来应用和标准化的方向。 Conclusion: GFVC具有巨大潜力，但仍需解决挑战以推动实际应用和标准化。 Abstract: The rise of deep generative models has greatly advanced video compression, reshaping the paradigm of face video coding through their powerful capability for semantic-aware representation and lifelike synthesis. Generative Face Video Coding (GFVC) stands at the forefront of this revolution, which could characterize complex facial dynamics into compact latent codes for bitstream compactness at the encoder side and leverages powerful deep generative models to reconstruct high-fidelity face signal from the compressed latent codes at the decoder side. As such, this well-designed GFVC paradigm could enable high-fidelity face video communication at ultra-low bitrate ranges, far surpassing the capabilities of the latest Versatile Video Coding (VVC) standard. To pioneer foundational research and accelerate the evolution of GFVC, this paper presents the first comprehensive survey of GFVC technologies, systematically bridging critical gaps between theoretical innovation and industrial standardization. In particular, we first review a broad range of existing GFVC methods with different feature representations and optimization strategies, and conduct a thorough benchmarking analysis. In addition, we construct a large-scale GFVC-compressed face video database with subjective Mean Opinion Scores (MOSs) based on human perception, aiming to identify the most appropriate quality metrics tailored to GFVC. Moreover, we summarize the GFVC standardization potentials with a unified high-level syntax and develop a low-complexity GFVC system which are both expected to push forward future practical deployments and applications. Finally, we envision the potential of GFVC in industrial applications and deliberate on the current challenges and future opportunities.

[92] ARGUS: Hallucination and Omission Evaluation in Video-LLMs

Ruchit Rawal,Reza Shirkavand,Heng Huang,Gowthami Somepalli,Tom Goldstein

Main category: cs.CV

TL;DR: ARGUS是一个新的VideoLLM基准测试，专注于自由形式视频字幕任务，通过量化幻觉率和遗漏细节率来评估性能。

Details

Motivation: 现有VideoLLM在自由形式文本生成任务（如视频字幕）中表现出严重的幻觉问题，而传统多选测试无法充分衡量这一问题。 Method: 提出ARGUS基准，通过比较VideoLLM输出与人工标注的真实字幕，量化幻觉（错误内容）和遗漏（重要细节缺失）两项指标。 Result: ARGUS提供了对视频字幕性能的全面评估，揭示了VideoLLM在自由形式任务中的主要缺陷。 Conclusion: ARGUS为VideoLLM的自由形式文本生成能力提供了更准确的评估工具，有助于改进模型性能。 Abstract: Video large language models have not yet been widely deployed, largely due to their tendency to hallucinate. Typical benchmarks for Video-LLMs rely simply on multiple-choice questions. Unfortunately, VideoLLMs hallucinate far more aggressively on freeform text generation tasks like video captioning than they do on multiple choice verification tasks. To address this weakness, we propose ARGUS, a VideoLLM benchmark that measures freeform video captioning performance. By comparing VideoLLM outputs to human ground truth captions, ARGUS quantifies dual metrics. First, we measure the rate of hallucinations in the form of incorrect statements about video content or temporal relationships. Second, we measure the rate at which the model omits important descriptive details. Together, these dual metrics form a comprehensive view of video captioning performance.

[93] DINO-CoDT: Multi-class Collaborative Detection and Tracking with Vision Foundation Models

Xunjie He,Christina Dao Wen Lee,Meiling Wang,Chengran Yuan,Zefan Huang,Yufeng Yue,Marcelo H. Ang Jr

Main category: cs.CV

TL;DR: 提出了一种多类别协作检测与跟踪框架，通过全局空间注意力融合模块和视觉语义重识别模块，显著提升了检测和跟踪精度。

Details

Motivation: 现有协作感知研究主要关注车辆类别，缺乏对多类别对象的有效解决方案，限制了实际应用。 Method: 设计了全局空间注意力融合模块（GSAF）增强多尺度特征学习，引入视觉语义重识别模块（REID）减少ID切换错误，并开发速度自适应的轨迹管理模块（VATM）。 Result: 在V2X-Real和OPV2V数据集上，方法显著优于现有最优方法。 Conclusion: 提出的框架有效解决了多类别协作检测与跟踪问题，提升了实际场景的适用性。 Abstract: Collaborative perception plays a crucial role in enhancing environmental understanding by expanding the perceptual range and improving robustness against sensor failures, which primarily involves collaborative 3D detection and tracking tasks. The former focuses on object recognition in individual frames, while the latter captures continuous instance tracklets over time. However, existing works in both areas predominantly focus on the vehicle superclass, lacking effective solutions for both multi-class collaborative detection and tracking. This limitation hinders their applicability in real-world scenarios, which involve diverse object classes with varying appearances and motion patterns. To overcome these limitations, we propose a multi-class collaborative detection and tracking framework tailored for diverse road users. We first present a detector with a global spatial attention fusion (GSAF) module, enhancing multi-scale feature learning for objects of varying sizes. Next, we introduce a tracklet RE-IDentification (REID) module that leverages visual semantics with a vision foundation model to effectively reduce ID SWitch (IDSW) errors, in cases of erroneous mismatches involving small objects like pedestrians. We further design a velocity-based adaptive tracklet management (VATM) module that adjusts the tracking interval dynamically based on object motion. Extensive experiments on the V2X-Real and OPV2V datasets show that our approach significantly outperforms existing state-of-the-art methods in both detection and tracking accuracy.

[94] Adapter Naturally Serves as Decoupler for Cross-Domain Few-Shot Semantic Segmentation

Jintao Tong,Ran Ma,Yixiong Zou,Guangyao Chen,Yuhua Li,Ruixuan Li

Main category: cs.CV

TL;DR: 论文提出了一种跨域少样本分割方法（CD-FSS），通过预训练和微调解决域差距和数据稀缺问题，并设计了域特征导航器（DFN）和SAM-SVN方法提升性能。

Details

Motivation: 解决跨域少样本分割中的域差距和数据稀缺问题。 Method: 提出域特征导航器（DFN）作为结构解耦器，并结合SAM-SVN方法防止过拟合。 Result: 在1-shot和5-shot场景下分别超越现有方法2.69%和4.68% MIoU。 Conclusion: DFN和SAM-SVN方法有效提升了跨域少样本分割的性能。 Abstract: Cross-domain few-shot segmentation (CD-FSS) is proposed to pre-train the model on a source-domain dataset with sufficient samples, and then transfer the model to target-domain datasets where only a few samples are available for efficient fine-tuning. There are majorly two challenges in this task: (1) the domain gap and (2) fine-tuning with scarce data. To solve these challenges, we revisit the adapter-based methods, and discover an intriguing insight not explored in previous works: the adapter not only helps the fine-tuning of downstream tasks but also naturally serves as a domain information decoupler. Then, we delve into this finding for an interpretation, and find the model's inherent structure could lead to a natural decoupling of domain information. Building upon this insight, we propose the Domain Feature Navigator (DFN), which is a structure-based decoupler instead of loss-based ones like current works, to capture domain-specific information, thereby directing the model's attention towards domain-agnostic knowledge. Moreover, to prevent the potential excessive overfitting of DFN during the source-domain training, we further design the SAM-SVN method to constrain DFN from learning sample-specific knowledge. On target domains, we freeze the model and fine-tune the DFN to learn target-specific knowledge specific. Extensive experiments demonstrate that our method surpasses the state-of-the-art method in CD-FSS significantly by 2.69% and 4.68% MIoU in 1-shot and 5-shot scenarios, respectively.

[95] MrM: Black-Box Membership Inference Attacks against Multimodal RAG Systems

Peiru Yang,Jinhua Yin,Haoran Zheng,Xueying Bai,Huili Wang,Yufei Sun,Xintian Li,Shangguang Wang,Yongfeng Huang,Tao Qi

Main category: cs.CV

TL;DR: 论文提出了一种针对多模态RAG系统的黑盒成员推理攻击框架MrM，通过多目标数据扰动和反事实攻击，有效泄露敏感信息。

Details

Motivation: 多模态RAG系统在增强视觉语言模型的同时，可能泄露敏感信息，现有攻击方法主要关注文本模态，视觉模态研究不足。 Method: 提出MrM框架，结合对象感知数据扰动和反事实掩码选择策略，通过统计推理提取成员信息。 Result: 在两个视觉数据集和八个主流模型上验证，MrM在样本级和集合级评估中表现优异，且对自适应防御鲁棒。 Conclusion: MrM填补了视觉模态成员推理攻击的空白，为多模态RAG系统的隐私保护提供了新视角。 Abstract: Multimodal retrieval-augmented generation (RAG) systems enhance large vision-language models by integrating cross-modal knowledge, enabling their increasing adoption across real-world multimodal tasks. These knowledge databases may contain sensitive information that requires privacy protection. However, multimodal RAG systems inherently grant external users indirect access to such data, making them potentially vulnerable to privacy attacks, particularly membership inference attacks (MIAs). % Existing MIA methods targeting RAG systems predominantly focus on the textual modality, while the visual modality remains relatively underexplored. To bridge this gap, we propose MrM, the first black-box MIA framework targeted at multimodal RAG systems. It utilizes a multi-object data perturbation framework constrained by counterfactual attacks, which can concurrently induce the RAG systems to retrieve the target data and generate information that leaks the membership information. Our method first employs an object-aware data perturbation method to constrain the perturbation to key semantics and ensure successful retrieval. Building on this, we design a counterfact-informed mask selection strategy to prioritize the most informative masked regions, aiming to eliminate the interference of model self-knowledge and amplify attack efficacy. Finally, we perform statistical membership inference by modeling query trials to extract features that reflect the reconstruction of masked semantics from response patterns. Experiments on two visual datasets and eight mainstream commercial visual-language models (e.g., GPT-4o, Gemini-2) demonstrate that MrM achieves consistently strong performance across both sample-level and set-level evaluations, and remains robust under adaptive defenses.

[96] Compressed Feature Quality Assessment: Dataset and Baselines

Changsheng Gao,Wei Zhou,Guosheng Lin,Weisi Lin

Main category: cs.CV

TL;DR: 论文提出了压缩特征质量评估（CFQA）问题，并创建了一个包含300个原始特征和12000个压缩特征的基准数据集，评估了三种常用指标的性能，强调了需要更精细的指标来量化语义失真。

Details

Motivation: 在资源受限环境中，特征编码的语义退化难以量化，因此需要研究压缩特征的质量评估方法。 Method: 提出了CFQA问题，创建了基准数据集，评估了MSE、余弦相似度和中心核对齐三种指标的性能。 Result: 数据集具有代表性，但现有指标无法完全捕捉语义失真，需要更精细的指标。 Conclusion: 论文为CFQA研究提供了基础资源，推动了该领域的发展。 Abstract: The widespread deployment of large models in resource-constrained environments has underscored the need for efficient transmission of intermediate feature representations. In this context, feature coding, which compresses features into compact bitstreams, becomes a critical component for scenarios involving feature transmission, storage, and reuse. However, this compression process introduces inherent semantic degradation that is notoriously difficult to quantify with traditional metrics. To address this, this paper introduces the research problem of Compressed Feature Quality Assessment (CFQA), which seeks to evaluate the semantic fidelity of compressed features. To advance CFQA research, we propose the first benchmark dataset, comprising 300 original features and 12000 compressed features derived from three vision tasks and four feature codecs. Task-specific performance drops are provided as true semantic distortion for the evaluation of CFQA metrics. We assess the performance of three widely used metrics (MSE, cosine similarity, and Centered Kernel Alignment) in capturing semantic degradation. The results underscore the representativeness of the dataset and highlight the need for more refined metrics capable of addressing the nuances of semantic distortion in compressed features. To facilitate the ongoing development of CFQA research, we release the dataset and all accompanying source code at \href{https://github.com/chansongoal/Compressed-Feature-Quality-Assessment}{https://github.com/chansongoal/Compressed-Feature-Quality-Assessment}. This contribution aims to advance the field and provide a foundational resource for the community to explore CFQA.

[97] DPFormer: Dynamic Prompt Transformer for Continual Learning

Sheng-Kai Huang,Jiun-Feng Chang,Chun-Rong Huang

Main category: cs.CV

TL;DR: 提出动态提示变换器（DPFormer）解决持续学习中的灾难性遗忘和任务间混淆问题，通过提示方案和统一分类模块实现高性能。

Details

Motivation: 解决持续学习中灾难性遗忘和任务间混淆问题。 Method: 提出动态提示变换器（DPFormer）和提示方案，结合统一分类模块（二元交叉熵损失、知识蒸馏损失和辅助损失）进行端到端训练。 Result: 在CIFAR-100、ImageNet100和ImageNet1K数据集上表现优于现有方法。 Conclusion: DPFormer通过提示方案和统一分类模块有效解决了持续学习中的关键问题，性能优越。 Abstract: In continual learning, solving the catastrophic forgetting problem may make the models fall into the stability-plasticity dilemma. Moreover, inter-task confusion will also occur due to the lack of knowledge exchanges between different tasks. In order to solve the aforementioned problems, we propose a novel dynamic prompt transformer (DPFormer) with prompt schemes. The prompt schemes help the DPFormer memorize learned knowledge of previous classes and tasks, and keep on learning new knowledge from new classes and tasks under a single network structure with a nearly fixed number of model parameters. Moreover, they also provide discrepant information to represent different tasks to solve the inter-task confusion problem. Based on prompt schemes, a unified classification module with the binary cross entropy loss, the knowledge distillation loss and the auxiliary loss is proposed to train the whole model in an end-to-end trainable manner. Compared with state-of-the-art methods, our method achieves the best performance in the CIFAR-100, ImageNet100 and ImageNet1K datasets under different class-incremental settings in continual learning. The source code will be available at our GitHub after acceptance.

[98] FAMSeg: Fetal Femur and Cranial Ultrasound Segmentation Using Feature-Aware Attention and Mamba Enhancement

Jie He,Minglang Chen,Minying Lu,Bocheng Liang,Junming Wei,Guiyan Peng,Jiaxi Chen,Ying Tan

Main category: cs.CV

TL;DR: 提出了一种基于特征感知和Mamba增强的胎儿股骨和颅骨超声图像分割模型，解决了高噪声和高相似性对象的挑战。

Details

Motivation: 超声图像分割对精确生物测量和评估至关重要，但现有模型难以适应高噪声和高相似性对象，尤其是小对象分割时锯齿效应明显。 Method: 设计了纵向和横向独立视角扫描卷积块、特征感知模块和Mamba优化的残差结构，结合不同优化器训练。 Result: FAMSeg网络在实验中实现了最快的损失下降和最佳分割性能。 Conclusion: 该模型有效抑制噪声干扰，增强局部多维扫描，提升了分割精度。 Abstract: Accurate ultrasound image segmentation is a prerequisite for precise biometrics and accurate assessment. Relying on manual delineation introduces significant errors and is time-consuming. However, existing segmentation models are designed based on objects in natural scenes, making them difficult to adapt to ultrasound objects with high noise and high similarity. This is particularly evident in small object segmentation, where a pronounced jagged effect occurs. Therefore, this paper proposes a fetal femur and cranial ultrasound image segmentation model based on feature perception and Mamba enhancement to address these challenges. Specifically, a longitudinal and transverse independent viewpoint scanning convolution block and a feature perception module were designed to enhance the ability to capture local detail information and improve the fusion of contextual information. Combined with the Mamba-optimized residual structure, this design suppresses the interference of raw noise and enhances local multi-dimensional scanning. The system builds global information and local feature dependencies, and is trained with a combination of different optimizers to achieve the optimal solution. After extensive experimental validation, the FAMSeg network achieved the fastest loss reduction and the best segmentation performance across images of varying sizes and orientations.

[99] Prompt to Protection: A Comparative Study of Multimodal LLMs in Construction Hazard Recognition

Nishi Chaudhary,S M Jamil Uddin,Sathvik Sharath Chandra,Anto Ovid,Alex Albert

Main category: cs.CV

TL;DR: 研究比较了五种多模态大语言模型（LLMs）在建筑工地视觉危险识别中的表现，发现提示策略（尤其是链式思考）显著影响性能，GPT-4.5和GPT-o3表现最佳。

Details

Motivation: 探索多模态LLMs在建筑安全关键视觉任务中的表现，填补现有研究空白。 Method: 对五种LLMs（Claude-3 Opus、GPT-4.5、GPT-4o、GPT-o3、Gemini 2.0 Pro）进行对比评估，采用零样本、少样本和链式思考三种提示策略，使用精确率、召回率和F1分数量化分析。 Result: 链式思考提示策略效果最佳，GPT-4.5和GPT-o3在多数情况下表现最优。 Conclusion: 提示设计对多模态LLMs在建筑安全应用中的准确性和一致性至关重要，为开发更可靠的AI辅助安全系统提供了实用见解。 Abstract: The recent emergence of multimodal large language models (LLMs) has introduced new opportunities for improving visual hazard recognition on construction sites. Unlike traditional computer vision models that rely on domain-specific training and extensive datasets, modern LLMs can interpret and describe complex visual scenes using simple natural language prompts. However, despite growing interest in their applications, there has been limited investigation into how different LLMs perform in safety-critical visual tasks within the construction domain. To address this gap, this study conducts a comparative evaluation of five state-of-the-art LLMs: Claude-3 Opus, GPT-4.5, GPT-4o, GPT-o3, and Gemini 2.0 Pro, to assess their ability to identify potential hazards from real-world construction images. Each model was tested under three prompting strategies: zero-shot, few-shot, and chain-of-thought (CoT). Zero-shot prompting involved minimal instruction, few-shot incorporated basic safety context and a hazard source mnemonic, and CoT provided step-by-step reasoning examples to scaffold model thinking. Quantitative analysis was performed using precision, recall, and F1-score metrics across all conditions. Results reveal that prompting strategy significantly influenced performance, with CoT prompting consistently producing higher accuracy across models. Additionally, LLM performance varied under different conditions, with GPT-4.5 and GPT-o3 outperforming others in most settings. The findings also demonstrate the critical role of prompt design in enhancing the accuracy and consistency of multimodal LLMs for construction safety applications. This study offers actionable insights into the integration of prompt engineering and LLMs for practical hazard recognition, contributing to the development of more reliable AI-assisted safety systems.

[100] PhysiInter: Integrating Physical Mapping for High-Fidelity Human Interaction Generation

Wei Yao,Yunlian Sun,Chang Liu,Hongwen Zhang,Jinhui Tang

Main category: cs.CV

TL;DR: 论文提出了一种物理映射方法，用于提升多人运动生成中的物理真实性和运动质量。

Details

Motivation: 现有运动捕捉技术和生成模型常忽略物理约束，导致运动中的穿模、滑动和漂浮等问题，尤其是在多人交互场景中更为严重。 Method: 通过物理仿真环境中的运动模仿，将目标运动映射到物理有效空间，并引入运动一致性（MC）和基于标记的交互（MI）损失函数。 Result: 实验表明，该方法在生成运动质量上有显著提升，物理保真度提高了3%-89%。 Conclusion: 物理映射方法有效解决了运动生成中的物理约束问题，提升了多人交互场景的运动真实性和质量。 Abstract: Driven by advancements in motion capture and generative artificial intelligence, leveraging large-scale MoCap datasets to train generative models for synthesizing diverse, realistic human motions has become a promising research direction. However, existing motion-capture techniques and generative models often neglect physical constraints, leading to artifacts such as interpenetration, sliding, and floating. These issues are exacerbated in multi-person motion generation, where complex interactions are involved. To address these limitations, we introduce physical mapping, integrated throughout the human interaction generation pipeline. Specifically, motion imitation within a physics-based simulation environment is used to project target motions into a physically valid space. The resulting motions are adjusted to adhere to real-world physics constraints while retaining their original semantic meaning. This mapping not only improves MoCap data quality but also directly informs post-processing of generated motions. Given the unique interactivity of multi-person scenarios, we propose a tailored motion representation framework. Motion Consistency (MC) and Marker-based Interaction (MI) loss functions are introduced to improve model performance. Experiments show our method achieves impressive results in generated human motion quality, with a 3%-89% improvement in physical fidelity. Project page http://yw0208.github.io/physiinter

[101] GLOS: Sign Language Generation with Temporally Aligned Gloss-Level Conditioning

Taeryung Lee,Hyeongjin Nam,Gyeongsik Moon,Kyoung Mu Lee

Main category: cs.CV

TL;DR: 论文提出了一种基于时间对齐的GLOS框架，通过词级条件（gloss-level conditions）和时间对齐条件模块（TAC）改进手语生成（SLG），解决了现有方法中词汇顺序错误和语义准确性低的问题。

Details

Motivation: 现有手语生成方法因依赖句子级条件而无法捕捉手语的时间结构和词级语义，导致词汇顺序混乱和动作模糊。 Method: 提出GLOS框架，包括词级条件和TAC模块，通过时间对齐的词级语义和结构信息生成手语。 Result: 在CSL-Daily和Phoenix-2014T数据集上表现优于现有方法，生成的手语词汇顺序正确且语义准确性高。 Conclusion: GLOS框架通过时间对齐的词级条件显著提升了手语生成的质量，解决了现有方法的局限性。 Abstract: Sign language generation (SLG), or text-to-sign generation, bridges the gap between signers and non-signers. Despite recent progress in SLG, existing methods still often suffer from incorrect lexical ordering and low semantic accuracy. This is primarily due to sentence-level condition, which encodes the entire sentence of the input text into a single feature vector as a condition for SLG. This approach fails to capture the temporal structure of sign language and lacks the granularity of word-level semantics, often leading to disordered sign sequences and ambiguous motions. To overcome these limitations, we propose GLOS, a sign language generation framework with temporally aligned gloss-level conditioning. First, we employ gloss-level conditions, which we define as sequences of gloss embeddings temporally aligned with the motion sequence. This enables the model to access both the temporal structure of sign language and word-level semantics at each timestep. As a result, this allows for fine-grained control of signs and better preservation of lexical order. Second, we introduce a condition fusion module, temporal alignment conditioning (TAC), to efficiently deliver the word-level semantic and temporal structure provided by the gloss-level condition to the corresponding motion timesteps. Our method, which is composed of gloss-level conditions and TAC, generates signs with correct lexical order and high semantic accuracy, outperforming prior methods on CSL-Daily and Phoenix-2014T.

[102] DeepVideo-R1: Video Reinforcement Fine-Tuning via Difficulty-aware Regressive GRPO

Jinyoung Park,Jeehye Na,Jinyoung Kim,Hyunwoo J. Kim

Main category: cs.CV

TL;DR: 论文探讨了GRPO在视频LLMs中的应用问题，并提出Reg-GRPO和数据增强策略DeepVideo-R1，显著提升了视频推理性能。

Details

Motivation: 研究GRPO在视频LLMs中的应用，解决其依赖安全措施和优势消失问题。 Method: 提出Reg-GRPO将目标重构为回归任务，并设计难度感知数据增强策略。 Result: DeepVideo-R1在多个视频推理基准测试中表现显著提升。 Conclusion: Reg-GRPO和数据增强策略有效解决了GRPO在视频LLMs中的学习问题。 Abstract: Recent works have demonstrated the effectiveness of reinforcement learning (RL)-based post-training in enhancing the reasoning capabilities of large language models (LLMs). In particular, Group Relative Policy Optimization (GRPO) has shown impressive success by employing a PPO-style reinforcement algorithm with group-based normalized rewards. However, the application of GRPO to Video Large Language Models (Video LLMs) has been less studied. In this paper, we explore GRPO for video LLMs and identify two primary issues that impede its effective learning: (1) reliance on safeguards, and (2) the vanishing advantage problem. To mitigate these challenges, we propose DeepVideo-R1, a video large language model trained with our proposed Reg-GRPO (Regressive GRPO) and difficulty-aware data augmentation strategy. Reg-GRPO reformulates the GRPO objective as a regression task, directly predicting the advantage in GRPO. This design eliminates the need for safeguards like clipping and min functions, thereby facilitating more direct policy guidance by aligning the model with the advantage values. We also design the difficulty-aware data augmentation strategy that dynamically augments training samples at solvable difficulty levels, fostering diverse and informative reward signals. Our comprehensive experiments show that DeepVideo-R1 significantly improves video reasoning performance across multiple video reasoning benchmarks.

[103] Ambiguity-Restrained Text-Video Representation Learning for Partially Relevant Video Retrieval

CH Cho,WJ Moon,W Jun,MS Jung,JP Heo

Main category: cs.CV

TL;DR: 论文提出了一种解决部分相关视频检索（PRVR）中文本-视频对模糊性的框架ARL，通过多正对比学习和双重三元组边际损失提升模型性能。

Details

Motivation: 传统PRVR训练假设文本查询与视频一一对应，忽略了文本与视频内容的模糊性。本文旨在将这种模糊性融入模型学习过程。 Method: 提出ARL框架，通过不确定性和相似性检测模糊对，采用多正对比学习和双重三元组边际损失进行分层学习，并探索视频实例内的细粒度关系。 Result: ARL框架在PRVR任务中表现出色，有效解决了文本-视频对的模糊性问题。 Conclusion: ARL通过检测模糊对和分层学习，显著提升了PRVR的性能，同时减少了错误传播。 Abstract: Partially Relevant Video Retrieval~(PRVR) aims to retrieve a video where a specific segment is relevant to a given text query. Typical training processes of PRVR assume a one-to-one relationship where each text query is relevant to only one video. However, we point out the inherent ambiguity between text and video content based on their conceptual scope and propose a framework that incorporates this ambiguity into the model learning process. Specifically, we propose Ambiguity-Restrained representation Learning~(ARL) to address ambiguous text-video pairs. Initially, ARL detects ambiguous pairs based on two criteria: uncertainty and similarity. Uncertainty represents whether instances include commonly shared context across the dataset, while similarity indicates pair-wise semantic overlap. Then, with the detected ambiguous pairs, our ARL hierarchically learns the semantic relationship via multi-positive contrastive learning and dual triplet margin loss. Additionally, we delve into fine-grained relationships within the video instances. Unlike typical training at the text-video level, where pairwise information is provided, we address the inherent ambiguity within frames of the same untrimmed video, which often contains multiple contexts. This allows us to further enhance learning at the text-frame level. Lastly, we propose cross-model ambiguity detection to mitigate the error propagation that occurs when a single model is employed to detect ambiguous pairs for its training. With all components combined, our proposed method demonstrates its effectiveness in PRVR.

[104] CoCoA-Mix: Confusion-and-Confidence-Aware Mixture Model for Context Optimization

Dasol Hong,Wooju Lee,Hyun Myung

Main category: cs.CV

TL;DR: 论文提出了一种混淆感知损失（CoA-loss）和置信感知权重（CoA-weights）的方法，通过优化决策边界和混合模型，提升视觉语言模型在特定任务上的专业化和泛化能力。

Details

Motivation: 冻结编码器常导致特征不对齐，引发类别混淆，限制了模型的专业化能力。 Method: 提出CoA-loss优化决策边界，使用CoA-weights调整混合模型中的预测权重。 Result: CoCoA-Mix模型在专业化和泛化能力上优于现有方法。 Conclusion: CoCoA-Mix通过混淆感知和置信感知机制，有效解决了专业化和泛化之间的平衡问题。 Abstract: Prompt tuning, which adapts vision-language models by freezing model parameters and optimizing only the prompt, has proven effective for task-specific adaptations. The core challenge in prompt tuning is improving specialization for a specific task and generalization for unseen domains. However, frozen encoders often produce misaligned features, leading to confusion between classes and limiting specialization. To overcome this issue, we propose a confusion-aware loss (CoA-loss) that improves specialization by refining the decision boundaries between confusing classes. Additionally, we mathematically demonstrate that a mixture model can enhance generalization without compromising specialization. This is achieved using confidence-aware weights (CoA-weights), which adjust the weights of each prediction in the mixture model based on its confidence within the class domains. Extensive experiments show that CoCoA-Mix, a mixture model with CoA-loss and CoA-weights, outperforms state-of-the-art methods by enhancing specialization and generalization. Our code is publicly available at https://github.com/url-kaist/CoCoA-Mix.

[105] Drive Any Mesh: 4D Latent Diffusion for Mesh Deformation from Video

Yahao Shi,Yang Liu,Yanmin Wu,Xing Liu,Chen Zhao,Jie Luo,Bin Zhou

Main category: cs.CV

TL;DR: DriveAnyMesh是一种通过单目视频驱动网格的方法，解决了当前4D生成技术在渲染引擎中的效率与兼容性问题。

Details

Motivation: 现有4D生成技术存在渲染效率低、骨骼方法需要大量手动工作且缺乏跨类别泛化能力的问题。 Method: 采用4D扩散模型对潜在序列去噪，通过基于变压器的变分自编码器捕获3D形状和运动信息，生成网格动画。 Result: 实验表明，DriveAnyMesh能快速生成高质量动画，兼容现代渲染引擎。 Conclusion: 该方法在游戏和电影行业具有应用潜力。 Abstract: We propose DriveAnyMesh, a method for driving mesh guided by monocular video. Current 4D generation techniques encounter challenges with modern rendering engines. Implicit methods have low rendering efficiency and are unfriendly to rasterization-based engines, while skeletal methods demand significant manual effort and lack cross-category generalization. Animating existing 3D assets, instead of creating 4D assets from scratch, demands a deep understanding of the input's 3D structure. To tackle these challenges, we present a 4D diffusion model that denoises sequences of latent sets, which are then decoded to produce mesh animations from point cloud trajectory sequences. These latent sets leverage a transformer-based variational autoencoder, simultaneously capturing 3D shape and motion information. By employing a spatiotemporal, transformer-based diffusion model, information is exchanged across multiple latent frames, enhancing the efficiency and generalization of the generated results. Our experimental results demonstrate that DriveAnyMesh can rapidly produce high-quality animations for complex motions and is compatible with modern rendering engines. This method holds potential for applications in both the gaming and filming industries.

[106] SpatialLM: Training Large Language Models for Structured Indoor Modeling

Yongsen Mao,Junhao Zhong,Chuan Fang,Jia Zheng,Rui Tang,Hao Zhu,Ping Tan,Zihan Zhou

Main category: cs.CV

TL;DR: SpatialLM是一个大型语言模型，用于处理3D点云数据并生成结构化3D场景理解输出，如墙壁、门窗等建筑元素。它基于标准多模态LLM架构，并在公开基准测试中表现优异。

Details

Motivation: 提升现代LLM的空间理解能力，以支持增强现实、机器人等应用。 Method: 基于开源LLM微调，使用大规模合成数据集（12,328个室内场景）进行训练。 Result: 在布局估计任务中达到最先进水平，3D物体检测结果具有竞争力。 Conclusion: SpatialLM展示了增强LLM空间理解能力的可行路径。 Abstract: SpatialLM is a large language model designed to process 3D point cloud data and generate structured 3D scene understanding outputs. These outputs include architectural elements like walls, doors, windows, and oriented object boxes with their semantic categories. Unlike previous methods which exploit task-specific network designs, our model adheres to the standard multimodal LLM architecture and is fine-tuned directly from open-source LLMs. To train SpatialLM, we collect a large-scale, high-quality synthetic dataset consisting of the point clouds of 12,328 indoor scenes (54,778 rooms) with ground-truth 3D annotations, and conduct a careful study on various modeling and training decisions. On public benchmarks, our model gives state-of-the-art performance in layout estimation and competitive results in 3D object detection. With that, we show a feasible path for enhancing the spatial understanding capabilities of modern LLMs for applications in augmented reality, embodied robotics, and more.

Xiangyu Guo,Zhanqian Wu,Kaixin Xiong,Ziyang Xu,Lijun Zhou,Gangwei Xu,Shaoqing Xu,Haiyang Sun,Bing Wang,Guang Chen,Hangjun Ye,Wenyu Liu,Xinggang Wang

Main category: cs.CV

TL;DR: Genesis是一个统一框架，用于联合生成多视角驾驶视频和LiDAR序列，具有时空和跨模态一致性。

Details

Motivation: 解决多模态数据生成中的一致性问题，并提升生成数据的语义保真度和实用性。 Method: 采用两阶段架构，结合DiT视频扩散模型与3D-VAE编码，以及BEV感知的LiDAR生成器与NeRF渲染和自适应采样。通过共享潜在空间实现视觉和几何域的一致性。 Result: 在nuScenes基准测试中表现优异（FVD 16.95，FID 4.24，Chamfer 0.611），并提升了下游任务（如分割和3D检测）的性能。 Conclusion: Genesis在多模态数据生成中实现了先进性能，验证了其语义保真度和实际应用价值。 Abstract: We present Genesis, a unified framework for joint generation of multi-view driving videos and LiDAR sequences with spatio-temporal and cross-modal consistency. Genesis employs a two-stage architecture that integrates a DiT-based video diffusion model with 3D-VAE encoding, and a BEV-aware LiDAR generator with NeRF-based rendering and adaptive sampling. Both modalities are directly coupled through a shared latent space, enabling coherent evolution across visual and geometric domains. To guide the generation with structured semantics, we introduce DataCrafter, a captioning module built on vision-language models that provides scene-level and instance-level supervision. Extensive experiments on the nuScenes benchmark demonstrate that Genesis achieves state-of-the-art performance across video and LiDAR metrics (FVD 16.95, FID 4.24, Chamfer 0.611), and benefits downstream tasks including segmentation and 3D detection, validating the semantic fidelity and practical utility of the generated data.

[108] MoQAE: Mixed-Precision Quantization for Long-Context LLM Inference via Mixture of Quantization-Aware Experts

Wei Tao,Haocheng Lu,Xiaoyang Qu,Bin Zhang,Kai Lu,Jiguang Wan,Jianzong Wang

Main category: cs.CV

TL;DR: 提出了一种名为MoQAE的混合精度量化方法，通过量化感知专家的混合来优化大语言模型的长上下文推理中的KV缓存内存消耗。

Details

Motivation: 优化大语言模型的长上下文推理时，KV缓存的高内存消耗是主要挑战，现有量化方法无法兼顾效率与效果。 Method: 1. 将不同量化位宽配置视为专家，采用混合专家（MoE）方法选择最优配置；2. 分块输入令牌以减少传统MoE的低效问题；3. 设计轻量级路由器微调过程，平衡模型精度与内存使用；4. 引入路由冻结和共享机制降低推理开销。 Result: 在多个基准数据集上的实验表明，MoQAE在效率和效果上均优于现有KV缓存量化方法。 Conclusion: MoQAE通过混合精度量化和优化路由机制，显著提升了KV缓存的内存效率与模型性能。 Abstract: One of the primary challenges in optimizing large language models (LLMs) for long-context inference lies in the high memory consumption of the Key-Value (KV) cache. Existing approaches, such as quantization, have demonstrated promising results in reducing memory usage. However, current quantization methods cannot take both effectiveness and efficiency into account. In this paper, we propose MoQAE, a novel mixed-precision quantization method via mixture of quantization-aware experts. First, we view different quantization bit-width configurations as experts and use the traditional mixture of experts (MoE) method to select the optimal configuration. To avoid the inefficiency caused by inputting tokens one by one into the router in the traditional MoE method, we input the tokens into the router chunk by chunk. Second, we design a lightweight router-only fine-tuning process to train MoQAE with a comprehensive loss to learn the trade-off between model accuracy and memory usage. Finally, we introduce a routing freezing (RF) and a routing sharing (RS) mechanism to further reduce the inference overhead. Extensive experiments on multiple benchmark datasets demonstrate that our method outperforms state-of-the-art KV cache quantization approaches in both efficiency and effectiveness.

[109] Domain Randomization for Object Detection in Manufacturing Applications using Synthetic Data: A Comprehensive Study

Xiaomeng Zhu,Jacob Henningsson,Duruo Li,Pär Mårtensson,Lars Hanson,Mårten Björkman,Atsuto Maki

Main category: cs.CV

TL;DR: 论文提出了一种用于制造业目标检测的合成数据生成方法，通过领域随机化技术生成多样化的数据，并在公开数据集上验证了其有效性。

Details

Motivation: 解决制造业目标检测中真实数据获取困难的问题，通过合成数据模拟真实场景。 Method: 构建了一个综合数据生成管道，考虑对象特征、背景、光照等因素，并引入SIP15-OD数据集进行验证。 Result: 在Yolov8模型上，合成数据训练的模型在公开数据集上表现优异，mAP@50得分高达96.4%。 Conclusion: 领域随机化技术能有效生成接近真实数据的分布，为制造业目标检测提供了可行的解决方案。 Abstract: This paper addresses key aspects of domain randomization in generating synthetic data for manufacturing object detection applications. To this end, we present a comprehensive data generation pipeline that reflects different factors: object characteristics, background, illumination, camera settings, and post-processing. We also introduce the Synthetic Industrial Parts Object Detection dataset (SIP15-OD) consisting of 15 objects from three industrial use cases under varying environments as a test bed for the study, while also employing an industrial dataset publicly available for robotic applications. In our experiments, we present more abundant results and insights into the feasibility as well as challenges of sim-to-real object detection. In particular, we identified material properties, rendering methods, post-processing, and distractors as important factors. Our method, leveraging these, achieves top performance on the public dataset with Yolov8 models trained exclusively on synthetic data; mAP@50 scores of 96.4% for the robotics dataset, and 94.1%, 99.5%, and 95.3% across three of the SIP15-OD use cases, respectively. The results showcase the effectiveness of the proposed domain randomization, potentially covering the distribution close to real data for the applications.

[110] APTOS-2024 challenge report: Generation of synthetic 3D OCT images from fundus photographs

Bowen Liu,Weiyi Zhang,Peranut Chotcomwongse,Xiaolan Chen,Ruoyu Chen,Pawin Pakaymaskul,Niracha Arjkongharn,Nattaporn Vongsa,Xuelian Cheng,Zongyuan Ge,Kun Huang,Xiaohui Li,Yiru Duan,Zhenbang Wang,BaoYe Xie,Qiang Chen,Huazhu Fu,Michael A. Mahr,Jiaqi Qu,Wangyiyang Chen,Shiye Wang,Yubo Tan,Yongjie Li,Mingguang He,Danli Shi,Paisan Ruamviboonsuk

Main category: cs.CV

TL;DR: APTOS-2024挑战赛探索了从2D眼底图像生成3D OCT图像的可行性，吸引了342支团队参与，展示了生成模型在提升眼科医疗可及性方面的潜力。

Details

Motivation: OCT设备成本高且依赖专业操作人员，限制了其广泛应用；而2D眼底摄影更易获取。研究旨在通过生成模型将2D眼底图像转化为3D OCT图像，以提升医疗可及性。 Method: 挑战赛提供了基准数据集和评估方法（基于图像和视频的保真度指标），参与者采用混合数据预处理、预训练、视觉基础模型集成和架构改进等方法。 Result: 42支团队提交初步方案，9支进入决赛，展示了生成3D OCT图像的可行性。 Conclusion: APTOS-2024挑战赛首次验证了从眼底图像生成OCT的潜力，为资源匮乏地区的眼科医疗提供了新思路。 Abstract: Optical Coherence Tomography (OCT) provides high-resolution, 3D, and non-invasive visualization of retinal layers in vivo, serving as a critical tool for lesion localization and disease diagnosis. However, its widespread adoption is limited by equipment costs and the need for specialized operators. In comparison, 2D color fundus photography offers faster acquisition and greater accessibility with less dependence on expensive devices. Although generative artificial intelligence has demonstrated promising results in medical image synthesis, translating 2D fundus images into 3D OCT images presents unique challenges due to inherent differences in data dimensionality and biological information between modalities. To advance generative models in the fundus-to-3D-OCT setting, the Asia Pacific Tele-Ophthalmology Society (APTOS-2024) organized a challenge titled Artificial Intelligence-based OCT Generation from Fundus Images. This paper details the challenge framework (referred to as APTOS-2024 Challenge), including: the benchmark dataset, evaluation methodology featuring two fidelity metrics-image-based distance (pixel-level OCT B-scan similarity) and video-based distance (semantic-level volumetric consistency), and analysis of top-performing solutions. The challenge attracted 342 participating teams, with 42 preliminary submissions and 9 finalists. Leading methodologies incorporated innovations in hybrid data preprocessing or augmentation (cross-modality collaborative paradigms), pre-training on external ophthalmic imaging datasets, integration of vision foundation models, and model architecture improvement. The APTOS-2024 Challenge is the first benchmark demonstrating the feasibility of fundus-to-3D-OCT synthesis as a potential solution for improving ophthalmic care accessibility in under-resourced healthcare settings, while helping to expedite medical research and clinical applications.

[111] Synthesize Privacy-Preserving High-Resolution Images via Private Textual Intermediaries

Haoxiang Wang,Zinan Lin,Da Yu,Huishuai Zhang

Main category: cs.CV

TL;DR: SPTI通过将DP图像合成任务转移到文本域，利用现成的文本生成和图像重建模型，无需训练即可生成高质量DP合成图像，显著优于现有方法。

Details

Motivation: 现有DP图像合成方法难以生成高分辨率且忠实于原始数据的图像，SPTI旨在通过文本中介解决这一问题。 Method: SPTI将图像转换为文本描述，使用改进的Private Evolution算法生成DP文本，再通过文本到图像模型重建图像。 Result: 在LSUN Bedroom和MM CelebA HQ数据集上，SPTI的FID显著优于现有方法（如26.71 vs 40.36）。 Conclusion: SPTI提供了一种资源高效且兼容现有模型的高分辨率DP图像生成框架，扩展了私有视觉数据的访问。 Abstract: Generating high fidelity, differentially private (DP) synthetic images offers a promising route to share and analyze sensitive visual data without compromising individual privacy. However, existing DP image synthesis methods struggle to produce high resolution outputs that faithfully capture the structure of the original data. In this paper, we introduce a novel method, referred to as Synthesis via Private Textual Intermediaries (SPTI), that can generate high resolution DP images with easy adoption. The key idea is to shift the challenge of DP image synthesis from the image domain to the text domain by leveraging state of the art DP text generation methods. SPTI first summarizes each private image into a concise textual description using image to text models, then applies a modified Private Evolution algorithm to generate DP text, and finally reconstructs images using text to image models. Notably, SPTI requires no model training, only inference with off the shelf models. Given a private dataset, SPTI produces synthetic images of substantially higher quality than prior DP approaches. On the LSUN Bedroom dataset, SPTI attains an FID less than or equal to 26.71 under epsilon equal to 1.0, improving over Private Evolution FID of 40.36. Similarly, on MM CelebA HQ, SPTI achieves an FID less than or equal to 33.27 at epsilon equal to 1.0, compared to 57.01 from DP fine tuning baselines. Overall, our results demonstrate that Synthesis via Private Textual Intermediaries provides a resource efficient and proprietary model compatible framework for generating high resolution DP synthetic images, greatly expanding access to private visual datasets.

[112] Cross-channel Perception Learning for H&E-to-IHC Virtual Staining

Hao Yang,JianYu Wu,Run Fang,Xuelian Zhao,Yuan Ji,Zhiyu Chen,Guibin He,Junceng Guo,Yang Liu,Xinhua Zeng

Main category: cs.CV

TL;DR: 提出了一种新的跨通道感知学习策略（CCPL），用于解决H&E-to-IHC研究中忽略的细胞核与细胞膜跨通道相关性问题，通过双通道特征提取和统计分析方法，生成高质量的虚拟染色图像。

Details

Motivation: 现有H&E-to-IHC研究常忽略细胞核与细胞膜的跨通道相关性，限制了病理图像分析与诊断的准确性。 Method: CCPL将HER2免疫组化染色分解为Hematoxylin和DAB染色通道，利用Gigapath的Tile Encoder提取双通道特征，计算跨通道相关性和特征蒸馏损失，并进行统计分析以确保染色分布和强度的一致性。 Result: 实验结果表明，CCPL在PSNR、SSIM、PCC和FID等定量指标上表现优异，且病理学家评价证实其能有效保留病理特征并生成高质量虚拟染色图像。 Conclusion: CCPL为多媒体医学数据支持的自动化病理诊断提供了强有力的技术支持。 Abstract: With the rapid development of digital pathology, virtual staining has become a key technology in multimedia medical information systems, offering new possibilities for the analysis and diagnosis of pathological images. However, existing H&E-to-IHC studies often overlook the cross-channel correlations between cell nuclei and cell membranes. To address this issue, we propose a novel Cross-Channel Perception Learning (CCPL) strategy. Specifically, CCPL first decomposes HER2 immunohistochemical staining into Hematoxylin and DAB staining channels, corresponding to cell nuclei and cell membranes, respectively. Using the pathology foundation model Gigapath's Tile Encoder, CCPL extracts dual-channel features from both the generated and real images and measures cross-channel correlations between nuclei and membranes. The features of the generated and real stained images, obtained through the Tile Encoder, are also used to calculate feature distillation loss, enhancing the model's feature extraction capabilities without increasing the inference burden. Additionally, CCPL performs statistical analysis on the focal optical density maps of both single channels to ensure consistency in staining distribution and intensity. Experimental results, based on quantitative metrics such as PSNR, SSIM, PCC, and FID, along with professional evaluations from pathologists, demonstrate that CCPL effectively preserves pathological features, generates high-quality virtual stained images, and provides robust support for automated pathological diagnosis using multimedia medical data.

[113] OpenDance: Multimodal Controllable 3D Dance Generation Using Large-scale Internet Data

Jinlu Zhang,Zixi Kang,Yizhou Wang

Main category: cs.CV

TL;DR: 论文提出了OpenDance5D数据集和OpenDanceNet框架，用于解决音乐驱动舞蹈生成的多样性和可控性问题。

Details

Motivation: 现有方法因缺乏细粒度多模态数据和灵活的多条件生成能力，限制了舞蹈生成的多样性和可控性。 Method: 构建OpenDance5D数据集（包含14种舞蹈风格的101小时多模态数据），并提出OpenDanceNet框架，基于掩码建模实现多条件舞蹈生成。 Result: 实验表明OpenDanceNet能实现高保真和灵活可控的舞蹈生成。 Conclusion: OpenDance5D和OpenDanceNet为音乐驱动舞蹈生成提供了更丰富的多模态数据和更灵活的控制方法。 Abstract: Music-driven dance generation offers significant creative potential yet faces considerable challenges. The absence of fine-grained multimodal data and the difficulty of flexible multi-conditional generation limit previous works on generation controllability and diversity in practice. In this paper, we build OpenDance5D, an extensive human dance dataset comprising over 101 hours across 14 distinct genres. Each sample has five modalities to facilitate robust cross-modal learning: RGB video, audio, 2D keypoints, 3D motion, and fine-grained textual descriptions from human arts. Furthermore, we propose OpenDanceNet, a unified masked modeling framework for controllable dance generation conditioned on music and arbitrary combinations of text prompts, keypoints, or character positioning. Comprehensive experiments demonstrate that OpenDanceNet achieves high-fidelity and flexible controllability.

[114] Towards the Influence of Text Quantity on Writer Retrieval

Marco Peer,Robert Sablatnig,Florian Kleber

Main category: cs.CV

TL;DR: 本文研究了基于手写相似性的作者检索任务，探讨了文本量对检索性能的影响，发现少量文本（如四行）仍能保持较高准确率，且深度学习方法优于传统方法。

Details

Motivation: 现有研究主要关注页面级检索，本文旨在探索文本量（行级和词级）对作者检索性能的影响。 Method: 评估了三种最先进的作者检索系统（包括手工特征和深度学习方法），并在CVL和IAM数据集上测试了不同文本量的性能。 Result: 实验表明，仅用一行文本时性能下降20-30%，但四行文本仍能达到全页性能的90%以上；深度学习方法在低文本场景中表现更优。 Conclusion: 文本依赖性检索在低文本场景中仍有效，深度学习方法（如NetVLAD）优于传统方法。 Abstract: This paper investigates the task of writer retrieval, which identifies documents authored by the same individual within a dataset based on handwriting similarities. While existing datasets and methodologies primarily focus on page level retrieval, we explore the impact of text quantity on writer retrieval performance by evaluating line- and word level retrieval. We examine three state-of-the-art writer retrieval systems, including both handcrafted and deep learning-based approaches, and analyze their performance using varying amounts of text. Our experiments on the CVL and IAM dataset demonstrate that while performance decreases by 20-30% when only one line of text is used as query and gallery, retrieval accuracy remains above 90% of full-page performance when at least four lines are included. We further show that text-dependent retrieval can maintain strong performance in low-text scenarios. Our findings also highlight the limitations of handcrafted features in low-text scenarios, with deep learning-based methods like NetVLAD outperforming traditional VLAD encoding.

[115] LLM-driven Indoor Scene Layout Generation via Scaled Human-aligned Data Synthesis and Multi-Stage Preference Optimization

Yixuan Yang,Zhen Luo,Tongsheng Ding,Junru Lu,Mingqi Gao,Jinyu Yang,Victor Sanchez,Feng Zheng

Main category: cs.CV

TL;DR: 论文提出了一种基于LLM的室内布局生成方法，结合了合成数据集和优化模型，显著提升了布局质量和成功率。

Details

Motivation: 现有方法存在空间不一致性、高计算成本或泛化能力不足的问题，需要一种更高效且通用的解决方案。 Method: 提出了3D-SynthPlace数据集和OptiScene模型，采用两阶段训练（SFT和DPO）优化布局生成。 Result: OptiScene在布局质量和生成成功率上优于传统方法，并在交互任务中表现出潜力。 Conclusion: 3D-SynthPlace和OptiScene为室内布局生成提供了高效且通用的解决方案。 Abstract: Automatic indoor layout generation has attracted increasing attention due to its potential in interior design, virtual environment construction, and embodied AI. Existing methods fall into two categories: prompt-driven approaches that leverage proprietary LLM services (e.g., GPT APIs) and learning-based methods trained on layout data upon diffusion-based models. Prompt-driven methods often suffer from spatial inconsistency and high computational costs, while learning-based methods are typically constrained by coarse relational graphs and limited datasets, restricting their generalization to diverse room categories. In this paper, we revisit LLM-based indoor layout generation and present 3D-SynthPlace, a large-scale dataset that combines synthetic layouts generated via a 'GPT synthesize, Human inspect' pipeline, upgraded from the 3D-Front dataset. 3D-SynthPlace contains nearly 17,000 scenes, covering four common room types -- bedroom, living room, kitchen, and bathroom -- enriched with diverse objects and high-level spatial annotations. We further introduce OptiScene, a strong open-source LLM optimized for indoor layout generation, fine-tuned based on our 3D-SynthPlace dataset through our two-stage training. For the warum-up stage I, we adopt supervised fine-tuning (SFT), which is taught to first generate high-level spatial descriptions then conditionally predict concrete object placements. For the reinforcing stage II, to better align the generated layouts with human design preferences, we apply multi-turn direct preference optimization (DPO), which significantly improving layout quality and generation success rates. Extensive experiments demonstrate that OptiScene outperforms traditional prompt-driven and learning-based baselines. Moreover, OptiScene shows promising potential in interactive tasks such as scene editing and robot navigation.

[116] Learning Speaker-Invariant Visual Features for Lipreading

Yu Li,Feng Xue,Shujie Li,Jinrui Zhang,Shuang Yang,Dan Guo,Richang Hong

Main category: cs.CV

TL;DR: SIFLip框架通过隐式和显式解耦模块分离说话者特定特征，提升唇读模型的泛化能力。

Details

Motivation: 现有唇读方法提取的视觉特征包含说话者特定属性（如唇形、颜色、纹理），导致虚假相关性，影响准确性和泛化能力。 Method: SIFLip采用隐式解耦模块（利用稳定文本嵌入作为监督信号）和显式解耦模块（通过说话者识别子任务和梯度反转），分离说话者特定特征。 Result: 实验表明，SIFLip在多个公开数据集上显著提升泛化性能，优于现有方法。 Conclusion: SIFLip通过解耦说话者特定特征，有效提升了唇读模型的泛化能力和准确性。 Abstract: Lipreading is a challenging cross-modal task that aims to convert visual lip movements into spoken text. Existing lipreading methods often extract visual features that include speaker-specific lip attributes (e.g., shape, color, texture), which introduce spurious correlations between vision and text. These correlations lead to suboptimal lipreading accuracy and restrict model generalization. To address this challenge, we introduce SIFLip, a speaker-invariant visual feature learning framework that disentangles speaker-specific attributes using two complementary disentanglement modules (Implicit Disentanglement and Explicit Disentanglement) to improve generalization. Specifically, since different speakers exhibit semantic consistency between lip movements and phonetic text when pronouncing the same words, our implicit disentanglement module leverages stable text embeddings as supervisory signals to learn common visual representations across speakers, implicitly decoupling speaker-specific features. Additionally, we design a speaker recognition sub-task within the main lipreading pipeline to filter speaker-specific features, then further explicitly disentangle these personalized visual features from the backbone network via gradient reversal. Experimental results demonstrate that SIFLip significantly enhances generalization performance across multiple public datasets. Experimental results demonstrate that SIFLip significantly improves generalization performance across multiple public datasets, outperforming state-of-the-art methods.

[117] Uncertainty-o: One Model-agnostic Framework for Unveiling Uncertainty in Large Multimodal Models

Ruiyang Zhang,Hu Zhang,Hao Fei,Zhedong Zheng

Main category: cs.CV

TL;DR: 论文提出Uncertainty-o框架，用于评估和量化大型多模态模型（LMMs）的不确定性，并验证其在多种任务中的有效性。

Details

Motivation: 尽管LMMs被认为比纯语言模型更鲁棒，但其不确定性评估仍存在三个关键问题：统一评估方法、如何提示模型显示不确定性及如何量化不确定性。 Method: 提出Uncertainty-o框架，包括模型无关的评估方法、多模态提示扰动实验及多模态语义不确定性量化公式。 Result: 在18个基准测试和10种LMMs上验证了Uncertainty-o的有效性，提升了幻觉检测、缓解及不确定性感知推理等下游任务。 Conclusion: Uncertainty-o为LMMs的不确定性评估提供了统一且有效的解决方案，推动了多模态模型的实际应用。 Abstract: Large Multimodal Models (LMMs), harnessing the complementarity among diverse modalities, are often considered more robust than pure Language Large Models (LLMs); yet do LMMs know what they do not know? There are three key open questions remaining: (1) how to evaluate the uncertainty of diverse LMMs in a unified manner, (2) how to prompt LMMs to show its uncertainty, and (3) how to quantify uncertainty for downstream tasks. In an attempt to address these challenges, we introduce Uncertainty-o: (1) a model-agnostic framework designed to reveal uncertainty in LMMs regardless of their modalities, architectures, or capabilities, (2) an empirical exploration of multimodal prompt perturbations to uncover LMM uncertainty, offering insights and findings, and (3) derive the formulation of multimodal semantic uncertainty, which enables quantifying uncertainty from multimodal responses. Experiments across 18 benchmarks spanning various modalities and 10 LMMs (both open- and closed-source) demonstrate the effectiveness of Uncertainty-o in reliably estimating LMM uncertainty, thereby enhancing downstream tasks such as hallucination detection, hallucination mitigation, and uncertainty-aware Chain-of-Thought reasoning.

Boyu Chen,Siran Chen,Kunchang Li,Qinglin Xu,Yu Qiao,Yali Wang

Main category: cs.CV

TL;DR: 提出了一种统一的超级编码网络（SEN），通过递归关联多模态编码器，提升视频理解任务的表现。

Details

Motivation: 现有多模态基础模型仅通过对比学习对齐不同模态编码器，缺乏深层多模态交互，难以理解复杂视频场景中的目标运动。 Method: 将预训练编码器视为“超级神经元”，设计递归关联（RA）块，逐步融合多模态信息，实现知识整合、分发和提示。 Result: 在跟踪、识别、聊天和编辑等任务中表现显著提升，如跟踪任务Jaccard指数提高2.7%，编辑任务文本对齐提升6.4%。 Conclusion: SEN通过深层多模态交互有效提升视频理解任务的性能，为多模态基础模型提供了新思路。 Abstract: Video understanding has been considered as one critical step towards world modeling, which is an important long-term problem in AI research. Recently, multi-modal foundation models have shown such potential via large-scale pretraining. However, these models simply align encoders of different modalities via contrastive learning, while lacking deeper multi-modal interactions, which is critical for understanding complex target movements with diversified video scenes. To fill this gap, we propose a unified Super Encoding Network (SEN) for video understanding, which builds up such distinct interactions through recursive association of multi-modal encoders in the foundation models. Specifically, we creatively treat those well-trained encoders as "super neurons" in our SEN. Via designing a Recursive Association (RA) block, we progressively fuse multi-modalities with the input video, based on knowledge integrating, distributing, and prompting of super neurons in a recursive manner. In this way, our SEN can effectively encode deeper multi-modal interactions, for prompting various video understanding tasks in downstream. Extensive experiments show that, our SEN can remarkably boost the four most representative video tasks, including tracking, recognition, chatting, and editing, e.g., for pixel-level tracking, the average jaccard index improves 2.7%, temporal coherence(TC) drops 8.8% compared to the popular CaDeX++ approach. For one-shot video editing, textual alignment improves 6.4%, and frame consistency increases 4.1% compared to the popular TuneA-Video approach.

[119] Explore the vulnerability of black-box models via diffusion models

Jiacheng Shi,Yanfu Zhang,Huajie Shao,Ashley Gao

Main category: cs.CV

TL;DR: 扩散模型API被用于生成合成图像以训练替代模型，实现高效模型提取和对抗攻击，性能显著优于现有方法。

Details

Motivation: 扩散模型的高保真图像生成能力可能被恶意利用，导致安全和隐私风险，如版权侵犯和敏感信息泄露。 Method: 利用扩散模型API生成合成图像，训练替代模型，以最小查询量实现模型提取和对抗攻击。 Result: 在多个基准测试中，方法性能提升27.37%，仅需0.01倍查询预算，对抗攻击成功率高达98.68%。 Conclusion: 该方法揭示了扩散模型的安全威胁，为防御类似攻击提供了研究基础。 Abstract: Recent advancements in diffusion models have enabled high-fidelity and photorealistic image generation across diverse applications. However, these models also present security and privacy risks, including copyright violations, sensitive information leakage, and the creation of harmful or offensive content that could be exploited maliciously. In this study, we uncover a novel security threat where an attacker leverages diffusion model APIs to generate synthetic images, which are then used to train a high-performing substitute model. This enables the attacker to execute model extraction and transfer-based adversarial attacks on black-box classification models with minimal queries, without needing access to the original training data. The generated images are sufficiently high-resolution and diverse to train a substitute model whose outputs closely match those of the target model. Across the seven benchmarks, including CIFAR and ImageNet subsets, our method shows an average improvement of 27.37% over state-of-the-art methods while using just 0.01 times of the query budget, achieving a 98.68% success rate in adversarial attacks on the target model.

[120] SceneRAG: Scene-level Retrieval-Augmented Generation for Video Understanding

Nianbo Zeng,Haowen Hou,Fei Richard Yu,Si Shi,Ying Tiffany He

Main category: cs.CV

TL;DR: SceneRAG是一个基于大语言模型的视频理解框架，通过将视频分割为叙事一致的场景，并结合视觉和文本信息构建知识图谱，显著提升了长视频内容的理解能力。

Details

Motivation: 当前基于固定长度分块的RAG方法破坏了视频的上下文连续性，无法捕捉真实场景边界。受人类自然组织连续经验为连贯场景的启发，研究提出了SceneRAG。 Method: SceneRAG利用ASR转录和时间元数据分割视频为叙事一致的场景，并通过轻量级启发式和迭代校正优化边界。结合视觉和文本信息构建动态知识图谱，支持多跳检索和生成。 Result: 在LongerVideos基准测试中，SceneRAG显著优于现有方法，生成任务的胜率高达72.5%。 Conclusion: SceneRAG通过场景分割和多模态信息融合，有效解决了长视频内容理解的挑战，为未来研究提供了新方向。 Abstract: Despite recent advances in retrieval-augmented generation (RAG) for video understanding, effectively understanding long-form video content remains underexplored due to the vast scale and high complexity of video data. Current RAG approaches typically segment videos into fixed-length chunks, which often disrupts the continuity of contextual information and fails to capture authentic scene boundaries. Inspired by the human ability to naturally organize continuous experiences into coherent scenes, we present SceneRAG, a unified framework that leverages large language models to segment videos into narrative-consistent scenes by processing ASR transcripts alongside temporal metadata. SceneRAG further sharpens these initial boundaries through lightweight heuristics and iterative correction. For each scene, the framework fuses information from both visual and textual modalities to extract entity relations and dynamically builds a knowledge graph, enabling robust multi-hop retrieval and generation that account for long-range dependencies. Experiments on the LongerVideos benchmark, featuring over 134 hours of diverse content, confirm that SceneRAG substantially outperforms prior baselines, achieving a win rate of up to 72.5 percent on generation tasks.

[121] SurgBench: A Unified Large-Scale Benchmark for Surgical Video Analysis

Jianhui Wei,Zikai Xiao,Danyu Sun,Luqi Gong,Zongxin Yang,Zuozhu Liu,Jian Wu

Main category: cs.CV

TL;DR: SurgBench是一个统一的手术视频基准框架，包含预训练数据集SurgBench-P和评估基准SurgBench-E，旨在解决手术视频基础模型开发中数据稀缺的问题。

Details

Motivation: 手术视频理解对自动化术中决策、技能评估和术后质量改进至关重要，但缺乏大规模多样化数据集阻碍了进展。 Method: 提出SurgBench框架，包括预训练数据集SurgBench-P（5300万帧，22种手术）和评估基准SurgBench-E（72项任务）。 Result: 现有视频基础模型在多样化任务上泛化能力不足，而基于SurgBench-P的预训练显著提升性能并增强跨领域泛化能力。 Conclusion: SurgBench为手术视频分析提供了标准化工具，推动了基础模型的发展。 Abstract: Surgical video understanding is pivotal for enabling automated intraoperative decision-making, skill assessment, and postoperative quality improvement. However, progress in developing surgical video foundation models (FMs) remains hindered by the scarcity of large-scale, diverse datasets for pretraining and systematic evaluation. In this paper, we introduce \textbf{SurgBench}, a unified surgical video benchmarking framework comprising a pretraining dataset, \textbf{SurgBench-P}, and an evaluation benchmark, \textbf{SurgBench-E}. SurgBench offers extensive coverage of diverse surgical scenarios, with SurgBench-P encompassing 53 million frames across 22 surgical procedures and 11 specialties, and SurgBench-E providing robust evaluation across six categories (phase classification, camera motion, tool recognition, disease diagnosis, action classification, and organ detection) spanning 72 fine-grained tasks. Extensive experiments reveal that existing video FMs struggle to generalize across varied surgical video analysis tasks, whereas pretraining on SurgBench-P yields substantial performance improvements and superior cross-domain generalization to unseen procedures and modalities. Our dataset and code are available upon request.

[122] DragNeXt: Rethinking Drag-Based Image Editing

Yuan Zhou,Junbao Zhou,Qingshan Xu,Kesen Zhao,Yuxuan Wang,Hao Fei,Richang Hong,Hanwang Zhang

Main category: cs.CV

TL;DR: 论文提出了一种新的基于拖拽的图像编辑方法DragNeXt，通过明确指定拖拽区域和类型解决模糊性问题，并简化编辑流程。

Details

Motivation: 当前基于拖拽的图像编辑方法存在模糊性和繁琐性问题，无法高质量完成编辑任务。 Method: 将拖拽编辑重新定义为用户指定区域的变形、旋转和平移，提出Latent Region Optimization（LRO）框架，并通过Progressive Backward Self-Intervention（PBSI）优化。 Result: DragNeXt在NextBench上表现优异，显著优于现有方法。 Conclusion: DragNeXt通过明确区域指定和优化框架，有效解决了拖拽编辑的模糊性和质量问题。 Abstract: Drag-Based Image Editing (DBIE), which allows users to manipulate images by directly dragging objects within them, has recently attracted much attention from the community. However, it faces two key challenges: (\emph{\textcolor{magenta}{i}}) point-based drag is often highly ambiguous and difficult to align with users' intentions; (\emph{\textcolor{magenta}{ii}}) current DBIE methods primarily rely on alternating between motion supervision and point tracking, which is not only cumbersome but also fails to produce high-quality results. These limitations motivate us to explore DBIE from a new perspective -- redefining it as deformation, rotation, and translation of user-specified handle regions. Thereby, by requiring users to explicitly specify both drag areas and types, we can effectively address the ambiguity issue. Furthermore, we propose a simple-yet-effective editing framework, dubbed \textcolor{SkyBlue}{\textbf{DragNeXt}}. It unifies DBIE as a Latent Region Optimization (LRO) problem and solves it through Progressive Backward Self-Intervention (PBSI), simplifying the overall procedure of DBIE while further enhancing quality by fully leveraging region-level structure information and progressive guidance from intermediate drag states. We validate \textcolor{SkyBlue}{\textbf{DragNeXt}} on our NextBench, and extensive experiments demonstrate that our proposed method can significantly outperform existing approaches. Code will be released on github.

[123] Scaling Human Activity Recognition: A Comparative Evaluation of Synthetic Data Generation and Augmentation Techniques

Zikang Leng,Archith Iyer,Thomas Plötz

Main category: cs.CV

TL;DR: 论文比较了两种虚拟IMU数据生成方法（视频和语言）与传统数据增强技术，发现虚拟IMU数据在有限数据条件下显著提升性能。

Details

Motivation: 解决人类活动识别（HAR）中标记数据稀缺的问题，探索跨模态生成虚拟IMU数据的有效性。 Method: 构建大规模虚拟IMU数据集，对比视频、语言生成方法与传统数据增强技术，并在基准数据集上评估。 Result: 虚拟IMU数据显著优于单独使用真实或增强数据，尤其在数据有限时表现更佳。 Conclusion: 提供选择数据生成策略的实用建议，并总结了每种方法的优缺点。 Abstract: Human activity recognition (HAR) is often limited by the scarcity of labeled datasets due to the high cost and complexity of real-world data collection. To mitigate this, recent work has explored generating virtual inertial measurement unit (IMU) data via cross-modality transfer. While video-based and language-based pipelines have each shown promise, they differ in assumptions and computational cost. Moreover, their effectiveness relative to traditional sensor-level data augmentation remains unclear. In this paper, we present a direct comparison between these two virtual IMU generation approaches against classical data augmentation techniques. We construct a large-scale virtual IMU dataset spanning 100 diverse activities from Kinetics-400 and simulate sensor signals at 22 body locations. The three data generation strategies are evaluated on benchmark HAR datasets (UTD-MHAD, PAMAP2, HAD-AW) using four popular models. Results show that virtual IMU data significantly improves performance over real or augmented data alone, particularly under limited-data conditions. We offer practical guidance on choosing data generation strategies and highlight the distinct advantages and disadvantages of each approach.

[124] Event-Priori-Based Vision-Language Model for Efficient Visual Understanding

Haotong Qin,Cheng Hu,Michele Magno

Main category: cs.CV

TL;DR: EP-VLM利用动态事件视觉的运动先验，通过事件数据引导RGB视觉输入的稀疏化，显著提升视觉语言模型的效率，同时保持高精度。

Details

Motivation: 现有视觉语言模型（VLM）计算需求高，难以部署在资源受限的边缘设备上，且视觉输入中存在大量冗余信息。 Method: 提出EP-VLM，利用事件数据引导视觉输入的稀疏化，并设计位置保留的标记化策略处理稀疏输入。 Result: 实验表明，EP-VLM在Qwen2-VL-2B上节省50%计算量，同时保持98%的原始精度。 Conclusion: 事件视觉先验可显著提升VLM效率，为边缘设备上的可持续视觉理解提供新思路。 Abstract: Large Language Model (LLM)-based Vision-Language Models (VLMs) have substantially extended the boundaries of visual understanding capabilities. However, their high computational demands hinder deployment on resource-constrained edge devices. A key source of inefficiency stems from the VLM's need to process dense and redundant visual information. Visual inputs contain significant regions irrelevant to text semantics, rendering the associated computations ineffective for inference. This paper introduces a novel Event-Priori-Based Vision-Language Model, termed EP-VLM. Its core contribution is a novel mechanism leveraging motion priors derived from dynamic event vision to enhance VLM efficiency. Inspired by human visual cognition, EP-VLM first employs event data to guide the patch-wise sparsification of RGB visual inputs, progressively concentrating VLM computation on salient regions of the visual input. Subsequently, we construct a position-preserving tokenization strategy for the visual encoder within the VLM architecture. This strategy processes the event-guided, unstructured, sparse visual input while accurately preserving positional understanding within the visual input. Experimental results demonstrate that EP-VLM achieves significant efficiency improvements while maintaining nearly lossless accuracy compared to baseline models from the Qwen2-VL series. For instance, against the original Qwen2-VL-2B, EP-VLM achieves 50% FLOPs savings while retaining 98% of the original accuracy on the RealWorldQA dataset. This work demonstrates the potential of event-based vision priors for improving VLM inference efficiency, paving the way for creating more efficient and deployable VLMs for sustainable visual understanding at the edge.

[125] HuSc3D: Human Sculpture dataset for 3D object reconstruction

Weronika Smolak-Dyżewska,Dawid Malarz,Grzegorz Wilczyński,Rafał Tobiasz,Joanna Waczyńska,Piotr Borycki,Przemysław Spurek

Main category: cs.CV

TL;DR: HuSc3D是一个专为3D重建模型在真实采集挑战下进行严格基准测试而设计的新数据集，填补了现有数据集在动态背景和颜色差异等现实问题上的不足。

Details

Motivation: 现有3D重建数据集多基于理想化合成或精心捕捉的真实数据，无法反映现实场景中的复杂性问题，如动态背景和颜色差异。 Method: 提出HuSc3D数据集，包含六个高度详细的全白雕塑，具有复杂穿孔和最小纹理颜色变化，且每场景图像数量差异显著。 Result: 评估流行3D重建方法后，HuSc3D能有效区分模型性能，揭示方法对几何细节、颜色模糊和数据变化的敏感性。 Conclusion: HuSc3D填补了现有数据集的不足，为3D重建模型在现实挑战下的评估提供了独特工具。 Abstract: 3D scene reconstruction from 2D images is one of the most important tasks in computer graphics. Unfortunately, existing datasets and benchmarks concentrate on idealized synthetic or meticulously captured realistic data. Such benchmarks fail to convey the inherent complexities encountered in newly acquired real-world scenes. In such scenes especially those acquired outside, the background is often dynamic, and by popular usage of cell phone cameras, there might be discrepancies in, e.g., white balance. To address this gap, we present HuSc3D, a novel dataset specifically designed for rigorous benchmarking of 3D reconstruction models under realistic acquisition challenges. Our dataset uniquely features six highly detailed, fully white sculptures characterized by intricate perforations and minimal textural and color variation. Furthermore, the number of images per scene varies significantly, introducing the additional challenge of limited training data for some instances alongside scenes with a standard number of views. By evaluating popular 3D reconstruction methods on this diverse dataset, we demonstrate the distinctiveness of HuSc3D in effectively differentiating model performance, particularly highlighting the sensitivity of methods to fine geometric details, color ambiguity, and varying data availability--limitations often masked by more conventional datasets.

[126] HieraEdgeNet: A Multi-Scale Edge-Enhanced Framework for Automated Pollen Recognition

Yuchong Long,Wen Sun,Ningxiao Sun,Wenxiao Wang,Chao Li,Shan Yin

Main category: cs.CV

TL;DR: HieraEdgeNet是一种多尺度边缘增强框架，显著提升了显微镜下微小目标（如花粉）的识别精度，优于现有基线模型。

Details

Motivation: 传统花粉识别方法效率低且主观性强，现有深度学习模型对微小目标的定位精度不足。 Method: 提出HieraEdgeNet框架，包含三个模块：HEM（多尺度边缘特征提取）、SEF（边缘与语义信息融合）、CSPOKM（细节优化）。 Result: 在120类花粉数据集上，mAP@.5达到0.9501，优于YOLOv12n和RT-DETR。 Conclusion: HieraEdgeNet通过系统整合边缘信息，为高精度、高效率的微小目标检测提供了强大解决方案。 Abstract: Automated pollen recognition is vital to paleoclimatology, biodiversity monitoring, and public health, yet conventional methods are hampered by inefficiency and subjectivity. Existing deep learning models often struggle to achieve the requisite localization accuracy for microscopic targets like pollen, which are characterized by their minute size, indistinct edges, and complex backgrounds. To overcome this limitation, we introduce HieraEdgeNet, a multi-scale edge-enhancement framework. The framework's core innovation is the introduction of three synergistic modules: the Hierarchical Edge Module (HEM), which explicitly extracts a multi-scale pyramid of edge features that corresponds to the semantic hierarchy at early network stages; the Synergistic Edge Fusion (SEF) module, for deeply fusing these edge priors with semantic information at each respective scale; and the Cross Stage Partial Omni-Kernel Module (CSPOKM), which maximally refines the most detail-rich feature layers using an Omni-Kernel operator - comprising anisotropic large-kernel convolutions and mixed-domain attention - all within a computationally efficient Cross-Stage Partial (CSP) framework. On a large-scale dataset comprising 120 pollen classes, HieraEdgeNet achieves a mean Average Precision (mAP@.5) of 0.9501, significantly outperforming state-of-the-art baseline models such as YOLOv12n and RT-DETR. Furthermore, qualitative analysis confirms that our approach generates feature representations that are more precisely focused on object boundaries. By systematically integrating edge information, HieraEdgeNet provides a robust and powerful solution for high-precision, high-efficiency automated detection of microscopic objects.

[127] Synthetic Visual Genome

Jae Sung Park,Zixian Ma,Linjie Li,Chenhao Zheng,Cheng-Yu Hsieh,Ximing Lu,Khyathi Chandu,Quan Kong,Norimasa Kobori,Ali Farhadi,Yejin Choi,Ranjay Krishna

Main category: cs.CV

TL;DR: 论文介绍了ROBIN，一个通过密集标注关系进行指令调优的多模态语言模型（MLM），用于构建高质量密集场景图。通过合成数据集SVG和自蒸馏框架SG-EDIT，ROBIN在关系理解任务中表现出色。

Details

Motivation: 尽管多模态语言模型在视觉理解方面取得进展，但对关系及其生成的精确推理仍具挑战性。 Method: 使用合成数据集SVG和自蒸馏框架SG-EDIT训练ROBIN，生成高质量场景图。 Result: ROBIN-3B模型在关系理解基准测试中优于同类模型，并在指代表达理解任务中达到88.9分，刷新记录。 Conclusion: 研究表明，基于精炼场景图数据的训练对多样化视觉推理任务的高性能至关重要。 Abstract: Reasoning over visual relationships-spatial, functional, interactional, social, etc.-is considered to be a fundamental component of human cognition. Yet, despite the major advances in visual comprehension in multimodal language models (MLMs), precise reasoning over relationships and their generations remains a challenge. We introduce ROBIN: an MLM instruction-tuned with densely annotated relationships capable of constructing high-quality dense scene graphs at scale. To train ROBIN, we curate SVG, a synthetic scene graph dataset by completing the missing relations of selected objects in existing scene graphs using a teacher MLM and a carefully designed filtering process to ensure high-quality. To generate more accurate and rich scene graphs at scale for any image, we introduce SG-EDIT: a self-distillation framework where GPT-4o further refines ROBIN's predicted scene graphs by removing unlikely relations and/or suggesting relevant ones. In total, our dataset contains 146K images and 5.6M relationships for 2.6M objects. Results show that our ROBIN-3B model, despite being trained on less than 3 million instances, outperforms similar-size models trained on over 300 million instances on relationship understanding benchmarks, and even surpasses larger models up to 13B parameters. Notably, it achieves state-of-the-art performance in referring expression comprehension with a score of 88.9, surpassing the previous best of 87.4. Our results suggest that training on the refined scene graph data is crucial to maintaining high performance across diverse visual reasoning task.

[128] FMaMIL: Frequency-Driven Mamba Multi-Instance Learning for Weakly Supervised Lesion Segmentation in Medical Images

Hangbei Cheng,Xiaorong Dong,Xueyu Liu,Jianan Zhang,Xuetao Ma,Mingqiang Wei,Liansheng Wang,Junxin Chen,Yongfei Wu

Main category: cs.CV

TL;DR: FMaMIL是一种基于图像级标签的弱监督病变分割框架，通过两阶段方法实现高效分割，无需像素级标注。

Details

Motivation: 解决组织病理学图像中病变分割的挑战，尤其是像素级标注成本高的问题。 Method: 两阶段框架：1）Mamba编码器捕获长程依赖，结合频域编码模块；2）通过伪标签优化和自校正机制改进分割。 Result: 在公开和私有数据集上优于现有弱监督方法，验证了其有效性。 Conclusion: FMaMIL为数字病理学提供了一种高效且低成本的解决方案。 Abstract: Accurate lesion segmentation in histopathology images is essential for diagnostic interpretation and quantitative analysis, yet it remains challenging due to the limited availability of costly pixel-level annotations. To address this, we propose FMaMIL, a novel two-stage framework for weakly supervised lesion segmentation based solely on image-level labels. In the first stage, a lightweight Mamba-based encoder is introduced to capture long-range dependencies across image patches under the MIL paradigm. To enhance spatial sensitivity and structural awareness, we design a learnable frequency-domain encoding module that supplements spatial-domain features with spectrum-based information. CAMs generated in this stage are used to guide segmentation training. In the second stage, we refine the initial pseudo labels via a CAM-guided soft-label supervision and a self-correction mechanism, enabling robust training even under label noise. Extensive experiments on both public and private histopathology datasets demonstrate that FMaMIL outperforms state-of-the-art weakly supervised methods without relying on pixel-level annotations, validating its effectiveness and potential for digital pathology applications.

[129] ProSplat: Improved Feed-Forward 3D Gaussian Splatting for Wide-Baseline Sparse Views

Xiaohan Lu,Jiaye Fu,Jiaqi Zhang,Zetian Song,Chuanmin Jia,Siwei Ma

Main category: cs.CV

TL;DR: ProSplat是一种两阶段前馈框架，用于在宽基线条件下实现高保真渲染，通过3D高斯生成器和改进模型提升性能。

Details

Motivation: 解决3D高斯溅射在宽基线场景下因纹理细节不足和几何不一致导致的性能下降问题。 Method: 两阶段框架：首先生成3D高斯基元，其次通过改进模型（基于扩散模型）增强渲染视图，结合MORI和DWEA技术优化。 Result: 在RealEstate10K和DL3DV-10K数据集上，PSNR平均提升1 dB。 Conclusion: ProSplat在宽基线条件下显著提升了渲染质量，优于现有方法。 Abstract: Feed-forward 3D Gaussian Splatting (3DGS) has recently demonstrated promising results for novel view synthesis (NVS) from sparse input views, particularly under narrow-baseline conditions. However, its performance significantly degrades in wide-baseline scenarios due to limited texture details and geometric inconsistencies across views. To address these challenges, in this paper, we propose ProSplat, a two-stage feed-forward framework designed for high-fidelity rendering under wide-baseline conditions. The first stage involves generating 3D Gaussian primitives via a 3DGS generator. In the second stage, rendered views from these primitives are enhanced through an improvement model. Specifically, this improvement model is based on a one-step diffusion model, further optimized by our proposed Maximum Overlap Reference view Injection (MORI) and Distance-Weighted Epipolar Attention (DWEA). MORI supplements missing texture and color by strategically selecting a reference view with maximum viewpoint overlap, while DWEA enforces geometric consistency using epipolar constraints. Additionally, we introduce a divide-and-conquer training strategy that aligns data distributions between the two stages through joint optimization. We evaluate ProSplat on the RealEstate10K and DL3DV-10K datasets under wide-baseline settings. Experimental results demonstrate that ProSplat achieves an average improvement of 1 dB in PSNR compared to recent SOTA methods.

[130] OpenSplat3D: Open-Vocabulary 3D Instance Segmentation using Gaussian Splatting

Jens Piekenbrinck,Christian Schmidt,Alexander Hermans,Narunas Vaskevicius,Timm Linder,Bastian Leibe

Main category: cs.CV

TL;DR: OpenSplat3D扩展了3D高斯泼溅（3DGS）的能力，实现了无需手动标注的开放词汇3D实例分割。

Details

Motivation: 将3DGS从纯场景表示扩展到支持开放词汇的实例分割，提升场景理解的细粒度。 Method: 结合特征泼溅技术、Segment Anything Model实例掩码和对比损失，利用视觉语言模型的语言嵌入实现文本驱动的实例识别。 Result: 在LERF-mask、LERF-OVS和ScanNet++验证集上展示了方法的有效性。 Conclusion: OpenSplat3D能够基于自然语言描述灵活识别和分割3D场景中的任意对象。 Abstract: 3D Gaussian Splatting (3DGS) has emerged as a powerful representation for neural scene reconstruction, offering high-quality novel view synthesis while maintaining computational efficiency. In this paper, we extend the capabilities of 3DGS beyond pure scene representation by introducing an approach for open-vocabulary 3D instance segmentation without requiring manual labeling, termed OpenSplat3D. Our method leverages feature-splatting techniques to associate semantic information with individual Gaussians, enabling fine-grained scene understanding. We incorporate Segment Anything Model instance masks with a contrastive loss formulation as guidance for the instance features to achieve accurate instance-level segmentation. Furthermore, we utilize language embeddings of a vision-language model, allowing for flexible, text-driven instance identification. This combination enables our system to identify and segment arbitrary objects in 3D scenes based on natural language descriptions. We show results on LERF-mask and LERF-OVS as well as the full ScanNet++ validation set, demonstrating the effectiveness of our approach.

[131] NOVA3D: Normal Aligned Video Diffusion Model for Single Image to 3D Generation

Yuxiao Yang,Peihao Li,Yuhong Zhang,Junzhe Lu,Xianglong He,Minghan Qin,Weitao Wang,Haoqian Wang

Main category: cs.CV

TL;DR: NOVA3D是一种创新的单图像到3D生成框架，利用预训练视频扩散模型的3D先验和几何信息，提出GTA注意力机制和去冲突几何融合算法，显著提升了多视图一致性和纹理保真度。

Details

Motivation: 解决现有方法因3D先验不足导致的多视图一致性问题。 Method: 利用预训练视频扩散模型的3D先验，引入GTA注意力机制和去冲突几何融合算法。 Result: 实验验证NOVA3D优于现有基线方法。 Conclusion: NOVA3D通过改进3D先验和几何信息整合，显著提升了单图像到3D生成的质量。 Abstract: 3D AI-generated content (AIGC) has made it increasingly accessible for anyone to become a 3D content creator. While recent methods leverage Score Distillation Sampling to distill 3D objects from pretrained image diffusion models, they often suffer from inadequate 3D priors, leading to insufficient multi-view consistency. In this work, we introduce NOVA3D, an innovative single-image-to-3D generation framework. Our key insight lies in leveraging strong 3D priors from a pretrained video diffusion model and integrating geometric information during multi-view video fine-tuning. To facilitate information exchange between color and geometric domains, we propose the Geometry-Temporal Alignment (GTA) attention mechanism, thereby improving generalization and multi-view consistency. Moreover, we introduce the de-conflict geometry fusion algorithm, which improves texture fidelity by addressing multi-view inaccuracies and resolving discrepancies in pose alignment. Extensive experiments validate the superiority of NOVA3D over existing baselines.

Weilei Wen,Chunle Guo,Wenqi Ren,Hongpeng Wang,Xiuli Shao

Main category: cs.CV

TL;DR: 论文提出了一种动态滤波网络，通过全局和局部分支分别处理空间无关和空间相关的图像退化问题，显著提升了图像重建性能。

Details

Motivation: 现有方法忽视了不同退化类型的多样性，采用单一网络模型处理多种退化，导致效果不佳。 Method: 引入动态滤波网络，包含全局动态滤波层（处理空间无关退化）和局部动态滤波层（处理空间相关退化）。 Result: 在合成和真实图像数据集上，该方法优于现有盲超分辨率算法。 Conclusion: 动态滤波网络能有效区分和处理不同类型的图像退化，提升重建效果。 Abstract: Prior methodologies have disregarded the diversities among distinct degradation types during image reconstruction, employing a uniform network model to handle multiple deteriorations. Nevertheless, we discover that prevalent degradation modalities, including sampling, blurring, and noise, can be roughly categorized into two classes. We classify the first class as spatial-agnostic dominant degradations, less affected by regional changes in image space, such as downsampling and noise degradation. The second class degradation type is intimately associated with the spatial position of the image, such as blurring, and we identify them as spatial-specific dominant degradations. We introduce a dynamic filter network integrating global and local branches to address these two degradation types. This network can greatly alleviate the practical degradation problem. Specifically, the global dynamic filtering layer can perceive the spatial-agnostic dominant degradation in different images by applying weights generated by the attention mechanism to multiple parallel standard convolution kernels, enhancing the network's representation ability. Meanwhile, the local dynamic filtering layer converts feature maps of the image into a spatially specific dynamic filtering operator, which performs spatially specific convolution operations on the image features to handle spatial-specific dominant degradations. By effectively integrating both global and local dynamic filtering operators, our proposed method outperforms state-of-the-art blind super-resolution algorithms in both synthetic and real image datasets.

[133] Consistent Video Editing as Flow-Driven Image-to-Video Generation

Ge Wang,Songlin Fan,Hangxu Liu,Quanjian Song,Hewei Wang,Jinfeng Xu

Main category: cs.CV

TL;DR: FlowV2V提出了一种基于光流的视频编辑方法，通过分解任务为第一帧编辑和条件I2V生成，解决了复杂运动建模的挑战。

Details

Motivation: 现有方法难以处理复杂运动模式，如多对象和肖像编辑，而光流为复杂运动建模提供了新思路。 Method: FlowV2V将任务分解为第一帧编辑和条件I2V生成，并模拟伪光流序列以保持编辑一致性。 Result: 在DAVIS-EDIT数据集上，FlowV2V在DOVER和变形误差上分别提升了13.67%和50.66%。 Conclusion: FlowV2V在时间一致性和样本质量上优于现有方法，并通过消融研究验证了其内部功能的有效性。 Abstract: With the prosper of video diffusion models, down-stream applications like video editing have been significantly promoted without consuming much computational cost. One particular challenge in this task lies at the motion transfer process from the source video to the edited one, where it requires the consideration of the shape deformation in between, meanwhile maintaining the temporal consistency in the generated video sequence. However, existing methods fail to model complicated motion patterns for video editing, and are fundamentally limited to object replacement, where tasks with non-rigid object motions like multi-object and portrait editing are largely neglected. In this paper, we observe that optical flows offer a promising alternative in complex motion modeling, and present FlowV2V to re-investigate video editing as a task of flow-driven Image-to-Video (I2V) generation. Specifically, FlowV2V decomposes the entire pipeline into first-frame editing and conditional I2V generation, and simulates pseudo flow sequence that aligns with the deformed shape, thus ensuring the consistency during editing. Experimental results on DAVIS-EDIT with improvements of 13.67% and 50.66% on DOVER and warping error illustrate the superior temporal consistency and sample quality of FlowV2V compared to existing state-of-the-art ones. Furthermore, we conduct comprehensive ablation studies to analyze the internal functionalities of the first-frame paradigm and flow alignment in the proposed method.

[134] ReverB-SNN: Reversing Bit of the Weight and Activation for Spiking Neural Networks

Yufei Guo,Yuhan Zhang,Zhou Jie,Xiaode Liu,Xin Tong,Yuanpei Chen,Weihang Peng,Zhe Ma

Main category: cs.CV

TL;DR: ReverB-SNN通过反转权重和激活的比特位，结合实值激活和二进制权重，提升了SNN的信息容量和准确性，同时保持了事件驱动和无乘法的优势。

Details

Motivation: 解决SNN中二进制激活映射信息不足导致的准确性下降问题。 Method: 采用实值激活和二进制权重，引入可训练因子自适应学习权重幅度，推理时通过重参数化恢复标准形式。 Result: 在多种网络架构和数据集上表现优于现有方法。 Conclusion: ReverB-SNN在保持高效的同时显著提升了SNN的准确性。 Abstract: The Spiking Neural Network (SNN), a biologically inspired neural network infrastructure, has garnered significant attention recently. SNNs utilize binary spike activations for efficient information transmission, replacing multiplications with additions, thereby enhancing energy efficiency. However, binary spike activation maps often fail to capture sufficient data information, resulting in reduced accuracy. To address this challenge, we advocate reversing the bit of the weight and activation for SNNs, called \textbf{ReverB-SNN}, inspired by recent findings that highlight greater accuracy degradation from quantizing activations compared to weights. Specifically, our method employs real-valued spike activations alongside binary weights in SNNs. This preserves the event-driven and multiplication-free advantages of standard SNNs while enhancing the information capacity of activations. Additionally, we introduce a trainable factor within binary weights to adaptively learn suitable weight amplitudes during training, thereby increasing network capacity. To maintain efficiency akin to vanilla \textbf{ReverB-SNN}, our trainable binary weight SNNs are converted back to standard form using a re-parameterization technique during inference. Extensive experiments across various network architectures and datasets, both static and dynamic, demonstrate that our approach consistently outperforms state-of-the-art methods.

[135] ETA: Efficiency through Thinking Ahead, A Dual Approach to Self-Driving with Large Models

Shadi Hamdan,Chonghao Sima,Zetong Yang,Hongyang Li,Fatma Güney

Main category: cs.CV

TL;DR: 论文提出了一种异步系统ETA，通过将当前帧的密集计算转移到先前时间步并进行批量推理，使大型模型能够快速响应每个时间步，同时保持实时推理速度。

Details

Motivation: 解决自动驾驶系统中如何在保持推理速度的同时利用大型模型的优势，现有并行架构无法实现大型模型对每一帧的及时响应。 Method: 引入ETA系统，通过从大型模型的未来预测中传递信息特征、使用小型模型提取当前帧特征，并通过动作掩码机制整合双特征。 Result: 在Bench2Drive CARLA Leaderboard-v2基准测试中，ETA将驾驶分数提升至69.53，性能提升8%，同时保持50ms的近实时推理速度。 Conclusion: ETA通过异步设计和特征整合，成功实现了大型模型在自动驾驶系统中的高效应用，同时保持了实时性能。 Abstract: How can we benefit from large models without sacrificing inference speed, a common dilemma in self-driving systems? A prevalent solution is a dual-system architecture, employing a small model for rapid, reactive decisions and a larger model for slower but more informative analyses. Existing dual-system designs often implement parallel architectures where inference is either directly conducted using the large model at each current frame or retrieved from previously stored inference results. However, these works still struggle to enable large models for a timely response to every online frame. Our key insight is to shift intensive computations of the current frame to previous time steps and perform a batch inference of multiple time steps to make large models respond promptly to each time step. To achieve the shifting, we introduce Efficiency through Thinking Ahead (ETA), an asynchronous system designed to: (1) propagate informative features from the past to the current frame using future predictions from the large model, (2) extract current frame features using a small model for real-time responsiveness, and (3) integrate these dual features via an action mask mechanism that emphasizes action-critical image regions. Evaluated on the Bench2Drive CARLA Leaderboard-v2 benchmark, ETA advances state-of-the-art performance by 8% with a driving score of 69.53 while maintaining a near-real-time inference speed at 50 ms.

[136] SpikeSMOKE: Spiking Neural Networks for Monocular 3D Object Detection with Cross-Scale Gated Coding

Xuemei Chen,Huamin Wang,Hangchi Shen,Shukai Duan,Shiping Wen,Tingwen Huang

Main category: cs.CV

TL;DR: 论文提出了一种基于脉冲神经网络（SNN）的低功耗单目3D物体检测方法SpikeSMOKE，通过跨尺度门控编码机制（CSGC）增强特征表示，并设计了轻量级残差块以减少计算量。

Details

Motivation: 随着3D物体检测在自动驾驶等领域的广泛应用，其高能耗问题日益突出。SNN因其低功耗特性成为潜在解决方案，但其离散信号会导致信息丢失和特征表达能力受限。 Method: 提出SpikeSMOKE架构，引入跨尺度门控编码机制（CSGC）结合注意力方法和门控过滤机制以增强特征表示，并设计轻量级残差块以降低计算量。 Result: 在KITTI数据集上，SpikeSMOKE的检测性能显著提升（AP|R11指标），同时能耗降低72.2%，性能仅下降4%。轻量版SpikeSMOKE-L参数和计算量分别减少3倍和10倍。 Conclusion: SpikeSMOKE为低功耗3D物体检测提供了有效解决方案，通过CSGC和轻量化设计实现了性能与能耗的平衡。 Abstract: Low energy consumption for 3D object detection is an important research area because of the increasing energy consumption with their wide application in fields such as autonomous driving. The spiking neural networks (SNNs) with low-power consumption characteristics can provide a novel solution for this research. Therefore, we apply SNNs to monocular 3D object detection and propose the SpikeSMOKE architecture in this paper, which is a new attempt for low-power monocular 3D object detection. As we all know, discrete signals of SNNs will generate information loss and limit their feature expression ability compared with the artificial neural networks (ANNs).In order to address this issue, inspired by the filtering mechanism of biological neuronal synapses, we propose a cross-scale gated coding mechanism(CSGC), which can enhance feature representation by combining cross-scale fusion of attentional methods and gated filtering mechanisms.In addition, to reduce the computation and increase the speed of training, we present a novel light-weight residual block that can maintain spiking computing paradigm and the highest possible detection performance. Compared to the baseline SpikeSMOKE under the 3D Object Detection, the proposed SpikeSMOKE with CSGC can achieve 11.78 (+2.82, Easy), 10.69 (+3.2, Moderate), and 10.48 (+3.17, Hard) on the KITTI autonomous driving dataset by AP|R11 at 0.7 IoU threshold, respectively. It is important to note that the results of SpikeSMOKE can significantly reduce energy consumption compared to the results on SMOKE. For example,the energy consumption can be reduced by 72.2% on the hard category, while the detection performance is reduced by only 4%. SpikeSMOKE-L (lightweight) can further reduce the amount of parameters by 3 times and computation by 10 times compared to SMOKE.

[137] AssetDropper: Asset Extraction via Diffusion Models with Reward-Driven Optimization

Lanjiong Li,Guanhua Zhao,Lingting Zhu,Zeyu Cai,Lequan Yu,Jian Zhang,Zeyu Wang

Main category: cs.CV

TL;DR: AssetDropper是一个框架，用于从参考图像中提取标准化资产，解决了设计师在开放世界场景中高效提取高质量资产的挑战。

Details

Motivation: 设计师需要标准化资产库，但现有生成模型未充分支持这一需求。开放世界场景提供了丰富素材，但提取高质量资产仍困难。 Method: AssetDropper通过提取选定主题的前视图，处理透视变形和遮挡问题。使用合成数据集和真实世界基准评估，并通过预训练的奖励模型实现闭环反馈。 Result: AssetDropper在资产提取任务中取得了最先进的成果，得益于奖励驱动的优化。 Conclusion: AssetDropper为设计师提供了高效的开放世界资产提取工具，推动了相关下游任务的研究。 Abstract: Recent research on generative models has primarily focused on creating product-ready visual outputs; however, designers often favor access to standardized asset libraries, a domain that has yet to be significantly enhanced by generative capabilities. Although open-world scenes provide ample raw materials for designers, efficiently extracting high-quality, standardized assets remains a challenge. To address this, we introduce AssetDropper, the first framework designed to extract assets from reference images, providing artists with an open-world asset palette. Our model adeptly extracts a front view of selected subjects from input images, effectively handling complex scenarios such as perspective distortion and subject occlusion. We establish a synthetic dataset of more than 200,000 image-subject pairs and a real-world benchmark with thousands more for evaluation, facilitating the exploration of future research in downstream tasks. Furthermore, to ensure precise asset extraction that aligns well with the image prompts, we employ a pre-trained reward model to fulfill a closed-loop with feedback. We design the reward model to perform an inverse task that pastes the extracted assets back into the reference sources, which assists training with additional consistency and mitigates hallucination. Extensive experiments show that, with the aid of reward-driven optimization, AssetDropper achieves the state-of-the-art results in asset extraction. Project page: AssetDropper.github.io.

[138] ArchiLense: A Framework for Quantitative Analysis of Architectural Styles Based on Vision Large Language Models

Jing Zhong,Jun Yin,Peilin Li,Pengyu Zeng,Miao Zhang,Shuai Lu,Ran Luo

Main category: cs.CV

TL;DR: 该研究构建了一个建筑风格数据集ArchDiffBench，并提出基于视觉语言模型的框架ArchiLense，用于自动识别和分类建筑图像，解决了传统方法的主观性和区域偏见问题。

Details

Motivation: 传统建筑文化研究依赖主观专家解读和历史文献，存在区域偏见和解释范围有限的问题。 Method: 构建ArchDiffBench数据集（1,765张高质量建筑图像及风格标注），开发ArchiLense框架（结合计算机视觉和深度学习技术），实现自动识别与分类。 Result: ArchiLense在风格识别上表现优异，与专家标注一致性达92.4%，分类准确率为84.5%。 Conclusion: 该方法超越了传统分析的主观性，为建筑文化比较研究提供了更客观、准确的视角。 Abstract: Architectural cultures across regions are characterized by stylistic diversity, shaped by historical, social, and technological contexts in addition to geograph-ical conditions. Understanding architectural styles requires the ability to describe and analyze the stylistic features of different architects from various regions through visual observations of architectural imagery. However, traditional studies of architectural culture have largely relied on subjective expert interpretations and historical literature reviews, often suffering from regional biases and limited ex-planatory scope. To address these challenges, this study proposes three core contributions: (1) We construct a professional architectural style dataset named ArchDiffBench, which comprises 1,765 high-quality architectural images and their corresponding style annotations, collected from different regions and historical periods. (2) We propose ArchiLense, an analytical framework grounded in Vision-Language Models and constructed using the ArchDiffBench dataset. By integrating ad-vanced computer vision techniques, deep learning, and machine learning algo-rithms, ArchiLense enables automatic recognition, comparison, and precise classi-fication of architectural imagery, producing descriptive language outputs that ar-ticulate stylistic differences. (3) Extensive evaluations show that ArchiLense achieves strong performance in architectural style recognition, with a 92.4% con-sistency rate with expert annotations and 84.5% classification accuracy, effec-tively capturing stylistic distinctions across images. The proposed approach transcends the subjectivity inherent in traditional analyses and offers a more objective and accurate perspective for comparative studies of architectural culture.

[139] Flow-Anything: Learning Real-World Optical Flow Estimation from Large-Scale Single-view Images

Yingping Liang,Ying Fu,Yutao Hu,Wenqi Shao,Jiaming Liu,Debing Zhang

Main category: cs.CV

TL;DR: Flow-Anything框架通过单视角图像生成大规模真实光流数据，解决了合成数据集的领域差距问题，并在性能上超越现有方法。

Details

Motivation: 解决光流估计中因合成数据集训练导致的领域差距问题，提升真实世界应用的鲁棒性。 Method: 1. 使用单目深度估计网络将单视角图像转换为3D表示；2. 开发对象无关的体积渲染和深度感知修复模块，生成动态对象的光流数据。 Result: 生成的FA-Flow数据集在性能上超越现有无监督和监督方法，并提升下游视频任务表现。 Conclusion: Flow-Anything为光流估计提供了可扩展的真实数据生成方法，具有广泛的应用潜力。 Abstract: Optical flow estimation is a crucial subfield of computer vision, serving as a foundation for video tasks. However, the real-world robustness is limited by animated synthetic datasets for training. This introduces domain gaps when applied to real-world applications and limits the benefits of scaling up datasets. To address these challenges, we propose \textbf{Flow-Anything}, a large-scale data generation framework designed to learn optical flow estimation from any single-view images in the real world. We employ two effective steps to make data scaling-up promising. First, we convert a single-view image into a 3D representation using advanced monocular depth estimation networks. This allows us to render optical flow and novel view images under a virtual camera. Second, we develop an Object-Independent Volume Rendering module and a Depth-Aware Inpainting module to model the dynamic objects in the 3D representation. These two steps allow us to generate realistic datasets for training from large-scale single-view images, namely \textbf{FA-Flow Dataset}. For the first time, we demonstrate the benefits of generating optical flow training data from large-scale real-world images, outperforming the most advanced unsupervised methods and supervised methods on synthetic datasets. Moreover, our models serve as a foundation model and enhance the performance of various downstream video tasks.

[140] Difference Inversion: Interpolate and Isolate the Difference with Token Consistency for Image Analogy Generation

Hyunsoo Kim,Donghyun Kim,Suhyun Kim

Main category: cs.CV

TL;DR: 提出了一种名为Difference Inversion的方法，通过提取A和A'之间的差异并应用于B，生成B'，解决了现有方法对特定模型的依赖问题。

Details

Motivation: 现有方法通常局限于特定模型（如InstructPix2Pix），可能导致偏见或编辑能力受限，因此需要一种更通用的方法。 Method: 通过Delta Interpolation提取差异，结合Token Consistency Loss和Zero Initialization of Token Embeddings，生成适用于稳定扩散模型的Full Prompt。 Result: 实验表明，Difference Inversion在定量和定性上均优于现有基线，能够以模型无关的方式生成更可行的B'。 Conclusion: Difference Inversion是一种有效的模型无关方法，能够准确提取和应用图像差异，提升生成能力。 Abstract: How can we generate an image B' that satisfies A:A'::B:B', given the input images A,A' and B? Recent works have tackled this challenge through approaches like visual in-context learning or visual instruction. However, these methods are typically limited to specific models (e.g. InstructPix2Pix. Inpainting models) rather than general diffusion models (e.g. Stable Diffusion, SDXL). This dependency may lead to inherited biases or lower editing capabilities. In this paper, we propose Difference Inversion, a method that isolates only the difference from A and A' and applies it to B to generate a plausible B'. To address model dependency, it is crucial to structure prompts in the form of a "Full Prompt" suitable for input to stable diffusion models, rather than using an "Instruction Prompt". To this end, we accurately extract the Difference between A and A' and combine it with the prompt of B, enabling a plug-and-play application of the difference. To extract a precise difference, we first identify it through 1) Delta Interpolation. Additionally, to ensure accurate training, we propose the 2) Token Consistency Loss and 3) Zero Initialization of Token Embeddings. Our extensive experiments demonstrate that Difference Inversion outperforms existing baselines both quantitatively and qualitatively, indicating its ability to generate more feasible B' in a model-agnostic manner.

[141] Trend-Aware Fashion Recommendation with Visual Segmentation and Semantic Similarity

Mohamed Djilani,Nassim Ali Ousalah,Nidhal Eddine Chenni

Main category: cs.CV

TL;DR: 提出了一种结合视觉特征、语义分割和用户行为模拟的时尚推荐系统，通过加权评分函数生成推荐，实验表明其性能优于基线方法。

Details

Motivation: 解决时尚推荐中视觉特征与用户行为模拟的融合问题，平衡个性化风格与流行趋势。 Method: 使用语义分割提取服装区域特征，结合预训练CNN提取视觉嵌入，通过用户行为模拟生成推荐评分。 Result: 在DeepFashion数据集上，ResNet-50达到64.95%的类别相似度和最低流行度MAE。 Conclusion: 该方法为个性化时尚推荐提供了可扩展框架，平衡了个人风格与流行趋势。 Abstract: We introduce a trend-aware and visually-grounded fashion recommendation system that integrates deep visual representations, garment-aware segmentation, semantic category similarity and user behavior simulation. Our pipeline extracts focused visual embeddings by masking non-garment regions via semantic segmentation followed by feature extraction using pretrained CNN backbones (ResNet-50, DenseNet-121, VGG16). To simulate realistic shopping behavior, we generate synthetic purchase histories influenced by user-specific trendiness and item popularity. Recommendations are computed using a weighted scoring function that fuses visual similarity, semantic coherence and popularity alignment. Experiments on the DeepFashion dataset demonstrate consistent gender alignment and improved category relevance, with ResNet-50 achieving 64.95% category similarity and lowest popularity MAE. An ablation study confirms the complementary roles of visual and popularity cues. Our method provides a scalable framework for personalized fashion recommendations that balances individual style with emerging trends. Our implementation is available at https://github.com/meddjilani/FashionRecommender

[142] Language-Vision Planner and Executor for Text-to-Visual Reasoning

Yichang Xu,Gaowen Liu,Ramana Rao Kompella,Sihao Hu,Tiansheng Huang,Fatih Ilhan,Selim Furkan Tekin,Zachary Yahn,Ling Liu

Main category: cs.CV

TL;DR: VLAgent是一个多模态视觉-文本推理系统，通过上下文学习优化LLM生成逐步推理计划，并结合神经符号模块执行验证，显著提升了推理性能。

Details

Motivation: 现有视觉语言模型（VLMs）在泛化性能上表现不佳，VLAgent旨在通过结合规划与执行验证来解决这一问题。 Method: VLAgent分两阶段：任务规划阶段通过上下文学习微调LLM生成逐步计划；执行阶段通过神经符号模块优化推理结果。其独特设计包括上下文学习优化、语法-语义解析器和集成方法。 Result: 在四个视觉推理基准测试（GQA、MME、NLVR2、VQAv2）中，VLAgent表现优于现有VLMs和LLM视觉组合方法（如ViperGPT、VisProg）。 Conclusion: VLAgent通过优化模块（SS-Parser、Plan Repairer、Output Verifiers）显著提升了多模态推理性能，代码和数据将在论文接受后公开。 Abstract: The advancement in large language models (LLMs) and large vision models has fueled the rapid progress in multi-modal visual-text reasoning capabilities. However, existing vision-language models (VLMs) to date suffer from generalization performance. Inspired by recent development in LLMs for visual reasoning, this paper presents VLAgent, an AI system that can create a step-by-step visual reasoning plan with an easy-to-understand script and execute each step of the plan in real time by integrating planning script with execution verifications via an automated process supported by VLAgent. In the task planning phase, VLAgent fine-tunes an LLM through in-context learning to generate a step-by-step planner for each user-submitted text-visual reasoning task. During the plan execution phase, VLAgent progressively refines the composition of neuro-symbolic executable modules to generate high-confidence reasoning results. VLAgent has three unique design characteristics: First, we improve the quality of plan generation through in-context learning, improving logic reasoning by reducing erroneous logic steps, incorrect programs, and LLM hallucinations. Second, we design a syntax-semantics parser to identify and correct additional logic errors of the LLM-generated planning script prior to launching the plan executor. Finally, we employ the ensemble method to improve the generalization performance of our step-executor. Extensive experiments with four visual reasoning benchmarks (GQA, MME, NLVR2, VQAv2) show that VLAgent achieves significant performance enhancement for multimodal text-visual reasoning applications, compared to the exiting representative VLMs and LLM based visual composition approaches like ViperGPT and VisProg, thanks to the novel optimization modules of VLAgent back-engine (SS-Parser, Plan Repairer, Output Verifiers). Code and data will be made available upon paper acceptance.

[143] Design and Evaluation of Deep Learning-Based Dual-Spectrum Image Fusion Methods

Beining Xu,Junxian Li

Main category: cs.CV

TL;DR: 本文构建了一个高质量的双光谱数据集，并提出了一种全面的评估框架，用于融合可见光和红外图像。实验表明，针对下游任务优化的融合模型在目标检测中表现更优。

Details

Motivation: 当前融合方法缺乏标准化评估和高质量数据集，阻碍了进展。 Method: 构建包含1,369对可见光-红外图像的数据集，并提出综合评估框架，结合融合速度、通用指标和目标检测性能。 Result: 实验显示，针对下游任务优化的融合模型在低光和遮挡场景中表现更优，而通用指标表现好的算法未必适用于下游任务。 Conclusion: 本文的数据集和评估框架为未来研究提供了重要参考，并揭示了当前评估方法的局限性。 Abstract: Visible images offer rich texture details, while infrared images emphasize salient targets. Fusing these complementary modalities enhances scene understanding, particularly for advanced vision tasks under challenging conditions. Recently, deep learning-based fusion methods have gained attention, but current evaluations primarily rely on general-purpose metrics without standardized benchmarks or downstream task performance. Additionally, the lack of well-developed dual-spectrum datasets and fair algorithm comparisons hinders progress. To address these gaps, we construct a high-quality dual-spectrum dataset captured in campus environments, comprising 1,369 well-aligned visible-infrared image pairs across four representative scenarios: daytime, nighttime, smoke occlusion, and underpasses. We also propose a comprehensive and fair evaluation framework that integrates fusion speed, general metrics, and object detection performance using the lang-segment-anything model to ensure fairness in downstream evaluation. Extensive experiments benchmark several state-of-the-art fusion algorithms under this framework. Results demonstrate that fusion models optimized for downstream tasks achieve superior performance in target detection, especially in low-light and occluded scenes. Notably, some algorithms that perform well on general metrics do not translate to strong downstream performance, highlighting limitations of current evaluation practices and validating the necessity of our proposed framework. The main contributions of this work are: (1)a campus-oriented dual-spectrum dataset with diverse and challenging scenes; (2) a task-aware, comprehensive evaluation framework; and (3) thorough comparative analysis of leading fusion methods across multiple datasets, offering insights for future development.

[144] Re-ranking Reasoning Context with Tree Search Makes Large Vision-Language Models Stronger

Qi Yang,Chenghao Zhang,Lubin Fan,Kun Ding,Jieping Ye,Shiming Xiang

Main category: cs.CV

TL;DR: 本文提出了一种多模态RAG框架RCTS，通过构建推理上下文丰富的知识库和树搜索重排序方法，解决了现有LVLMs在VQA任务中知识稀缺和响应不稳定的问题。

Details

Motivation: 现有方法在视觉问答任务中面临知识库中推理示例稀缺和检索知识响应不稳定的挑战。 Method: 提出RCTS框架，包括构建推理上下文丰富的知识库和基于蒙特卡洛树搜索与启发式奖励的重排序方法（MCTS-HR）。 Result: 在多个VQA数据集上实现了最先进的性能，显著优于上下文学习和传统RAG方法。 Conclusion: RCTS框架通过高质量推理上下文和重排序方法，显著提升了LVLMs的性能和一致性。 Abstract: Recent advancements in Large Vision Language Models (LVLMs) have significantly improved performance in Visual Question Answering (VQA) tasks through multimodal Retrieval-Augmented Generation (RAG). However, existing methods still face challenges, such as the scarcity of knowledge with reasoning examples and erratic responses from retrieved knowledge. To address these issues, in this study, we propose a multimodal RAG framework, termed RCTS, which enhances LVLMs by constructing a Reasoning Context-enriched knowledge base and a Tree Search re-ranking method. Specifically, we introduce a self-consistent evaluation mechanism to enrich the knowledge base with intrinsic reasoning patterns. We further propose a Monte Carlo Tree Search with Heuristic Rewards (MCTS-HR) to prioritize the most relevant examples. This ensures that LVLMs can leverage high-quality contextual reasoning for better and more consistent responses. Extensive experiments demonstrate that our framework achieves state-of-the-art performance on multiple VQA datasets, significantly outperforming In-Context Learning (ICL) and Vanilla-RAG methods. It highlights the effectiveness of our knowledge base and re-ranking method in improving LVLMs. Our code is available at https://github.com/yannqi/RCTS-RAG.

[145] Image Reconstruction as a Tool for Feature Analysis

Eduard Allakhverdov,Dmitrii Tarasov,Elizaveta Goncharova,Andrey Kuznetsov

Main category: cs.CV

TL;DR: 本文提出了一种通过图像重建解释视觉特征的新方法，比较了SigLIP和SigLIP2模型，发现基于图像任务预训练的编码器保留更多图像信息。方法适用于任何视觉编码器，揭示了特征空间的内部结构。

Details

Motivation: 尽管视觉编码器在应用中表现优异，但其内部特征表示方式仍不明确，需要一种方法来解释这些特征。 Method: 通过图像重建比较不同训练目标的视觉编码器（如SigLIP和SigLIP2），并分析其特征空间的信息量和结构。 Result: 基于图像任务预训练的编码器保留更多图像信息；特征空间的旋转操作影响颜色编码。 Conclusion: 该方法为理解视觉编码器的特征表示提供了新视角，代码和模型已开源。 Abstract: Vision encoders are increasingly used in modern applications, from vision-only models to multimodal systems such as vision-language models. Despite their remarkable success, it remains unclear how these architectures represent features internally. Here, we propose a novel approach for interpreting vision features via image reconstruction. We compare two related model families, SigLIP and SigLIP2, which differ only in their training objective, and show that encoders pre-trained on image-based tasks retain significantly more image information than those trained on non-image tasks such as contrastive learning. We further apply our method to a range of vision encoders, ranking them by the informativeness of their feature representations. Finally, we demonstrate that manipulating the feature space yields predictable changes in reconstructed images, revealing that orthogonal rotations (rather than spatial transformations) control color encoding. Our approach can be applied to any vision encoder, shedding light on the inner structure of its feature space. The code and model weights to reproduce the experiments are available in GitHub.

Weilei Wen,Tianyi Zhang,Qianqian Zhao,Zhaohui Zheng,Chunle Guo,Xiuli Shao,Chongyi Li

Main category: cs.CV

TL;DR: 提出了一种基于不确定性引导和Top-k码本匹配的超分辨率框架（UGTSR），解决了现有方法中特征匹配不准确和纹理细节重建差的问题。

Details

Motivation: 现有基于码本的超分辨率方法在特征匹配和纹理细节重建方面存在不足，影响了图像质量。 Method: UGTSR框架包含三个关键组件：不确定性学习机制、Top-k特征匹配策略和Align-Attention模块。 Result: 实验结果表明，UGTSR在纹理真实性和重建保真度上显著优于现有方法。 Conclusion: UGTSR通过改进特征匹配和纹理重建，提升了超分辨率图像的质量。 Abstract: Recent advancements in codebook-based real image super-resolution (SR) have shown promising results in real-world applications. The core idea involves matching high-quality image features from a codebook based on low-resolution (LR) image features. However, existing methods face two major challenges: inaccurate feature matching with the codebook and poor texture detail reconstruction. To address these issues, we propose a novel Uncertainty-Guided and Top-k Codebook Matching SR (UGTSR) framework, which incorporates three key components: (1) an uncertainty learning mechanism that guides the model to focus on texture-rich regions, (2) a Top-k feature matching strategy that enhances feature matching accuracy by fusing multiple candidate features, and (3) an Align-Attention module that enhances the alignment of information between LR and HR features. Experimental results demonstrate significant improvements in texture realism and reconstruction fidelity compared to existing methods. We will release the code upon formal publication.

[147] Looking Beyond Visible Cues: Implicit Video Question Answering via Dual-Clue Reasoning

Tieyuan Chen,Huabin Liu,Yi Wang,Chaofan Gan,Mingxi Lyu,Gui Zou,Weiyao Lin

Main category: cs.CV

TL;DR: 该论文提出了一个新的任务和数据集I-VQA，专注于回答无法直接获取显式视觉证据的问题，并提出了一个双流推理框架IRM，显著优于现有方法。

Details

Motivation: 现有VideoQA方法依赖显式视觉证据，但在涉及符号意义或深层意图的问题上表现不佳，因此需要解决隐性视觉证据的问题。 Method: 提出了IRM框架，包含动作-意图模块（AIM）和视觉增强模块（VEM），通过双流建模上下文动作和意图线索进行推理。 Result: IRM在I-VQA任务中表现优异，分别超过GPT-4o、OpenAI-o3和VideoChat2 0.76%、1.37%和4.87%。 Conclusion: IRM为解决隐性视觉证据问题提供了有效方案，并在相关任务中达到SOTA性能。 Abstract: Video Question Answering (VideoQA) aims to answer natural language questions based on the given video, with prior work primarily focusing on identifying the duration of relevant segments, referred to as explicit visual evidence. However, explicit visual evidence is not always directly available, particularly when questions target symbolic meanings or deeper intentions, leading to significant performance degradation. To fill this gap, we introduce a novel task and dataset, $\textbf{I}$mplicit $\textbf{V}$ideo $\textbf{Q}$uestion $\textbf{A}$nswering (I-VQA), which focuses on answering questions in scenarios where explicit visual evidence is inaccessible. Given an implicit question and its corresponding video, I-VQA requires answering based on the contextual visual cues present within the video. To tackle I-VQA, we propose a novel reasoning framework, IRM (Implicit Reasoning Model), incorporating dual-stream modeling of contextual actions and intent clues as implicit reasoning chains. IRM comprises the Action-Intent Module (AIM) and the Visual Enhancement Module (VEM). AIM deduces and preserves question-related dual clues by generating clue candidates and performing relation deduction. VEM enhances contextual visual representation by leveraging key contextual clues. Extensive experiments validate the effectiveness of our IRM in I-VQA tasks, outperforming GPT-4o, OpenAI-o3, and fine-tuned VideoChat2 by $0.76\%$, $1.37\%$, and $4.87\%$, respectively. Additionally, IRM performs SOTA on similar implicit advertisement understanding and future prediction in traffic-VQA. Datasets and codes are available for double-blind review in anonymous repo: https://github.com/tychen-SJTU/Implicit-VideoQA.

[148] Self-Cascaded Diffusion Models for Arbitrary-Scale Image Super-Resolution

Junseo Bang,Joonhee Lee,Kyeonghyun Lee,Haechang Lee,Dong Un Kang,Se Young Chun

Main category: cs.CV

TL;DR: CasArbi是一种新型的自级联扩散框架，用于任意尺度图像超分辨率，通过分阶段逐步提升分辨率，结合坐标引导的残差扩散模型，在多个基准测试中表现优异。

Details

Motivation: 传统固定尺度超分辨率缺乏灵活性，而现有单阶段上采样方法难以适应广泛的连续尺度分布。 Method: CasArbi采用自级联扩散框架，将任意尺度需求分解为更小的顺序因子，结合坐标引导的残差扩散模型逐步提升分辨率。 Result: CasArbi在多个任意尺度超分辨率基准测试中，在感知和失真性能指标上均优于现有方法。 Conclusion: CasArbi通过分阶段扩散和坐标引导的残差学习，实现了高效的任意尺度超分辨率，为未来研究提供了新思路。 Abstract: Arbitrary-scale image super-resolution aims to upsample images to any desired resolution, offering greater flexibility than traditional fixed-scale super-resolution. Recent approaches in this domain utilize regression-based or generative models, but many of them are a single-stage upsampling process, which may be challenging to learn across a wide, continuous distribution of scaling factors. Progressive upsampling strategies have shown promise in mitigating this issue, yet their integration with diffusion models for flexible upscaling remains underexplored. Here, we present CasArbi, a novel self-cascaded diffusion framework for arbitrary-scale image super-resolution. CasArbi meets the varying scaling demands by breaking them down into smaller sequential factors and progressively enhancing the image resolution at each step with seamless transitions for arbitrary scales. Our novel coordinate-guided residual diffusion model allows for the learning of continuous image representations while enabling efficient diffusion sampling. Extensive experiments demonstrate that our CasArbi outperforms prior arts in both perceptual and distortion performance metrics across diverse arbitrary-scale super-resolution benchmarks.

[149] M2Restore: Mixture-of-Experts-based Mamba-CNN Fusion Framework for All-in-One Image Restoration

Yongzhen Wang,Yongjun Li,Zhuoran Zheng,Xiao-Ping Zhang,Mingqiang Wei

Main category: cs.CV

TL;DR: M2Restore是一种基于Mixture-of-Experts的Mamba-CNN融合框架，用于高效、鲁棒的全能图像恢复，解决了现有方法在动态退化场景中泛化能力不足和局部细节与全局依赖平衡不佳的问题。

Details

Motivation: 自然图像常受复合退化（如雨、雪、雾）影响，现有方法在动态退化场景中泛化能力有限，且难以平衡局部细节与全局依赖。 Method: 提出M2Restore框架，包括CLIP引导的MoE门控机制、双流架构（CNN与Mamba融合）和边缘感知动态门控机制。 Result: 在多个图像恢复基准测试中，M2Restore在视觉质量和定量性能上均表现优越。 Conclusion: M2Restore通过创新的架构设计，显著提升了图像恢复的泛化能力和细节保留效果。 Abstract: Natural images are often degraded by complex, composite degradations such as rain, snow, and haze, which adversely impact downstream vision applications. While existing image restoration efforts have achieved notable success, they are still hindered by two critical challenges: limited generalization across dynamically varying degradation scenarios and a suboptimal balance between preserving local details and modeling global dependencies. To overcome these challenges, we propose M2Restore, a novel Mixture-of-Experts (MoE)-based Mamba-CNN fusion framework for efficient and robust all-in-one image restoration. M2Restore introduces three key contributions: First, to boost the model's generalization across diverse degradation conditions, we exploit a CLIP-guided MoE gating mechanism that fuses task-conditioned prompts with CLIP-derived semantic priors. This mechanism is further refined via cross-modal feature calibration, which enables precise expert selection for various degradation types. Second, to jointly capture global contextual dependencies and fine-grained local details, we design a dual-stream architecture that integrates the localized representational strength of CNNs with the long-range modeling efficiency of Mamba. This integration enables collaborative optimization of global semantic relationships and local structural fidelity, preserving global coherence while enhancing detail restoration. Third, we introduce an edge-aware dynamic gating mechanism that adaptively balances global modeling and local enhancement by reallocating computational attention to degradation-sensitive regions. This targeted focus leads to more efficient and precise restoration. Extensive experiments across multiple image restoration benchmarks validate the superiority of M2Restore in both visual quality and quantitative performance.

[150] R3D2: Realistic 3D Asset Insertion via Diffusion for Autonomous Driving Simulation

William Ljungbergh,Bernardo Taveira,Wenzhao Zheng,Adam Tonderski,Chensheng Peng,Fredrik Kahl,Christoffer Petersson,Michael Felsberg,Kurt Keutzer,Masayoshi Tomizuka,Wei Zhan

Main category: cs.CV

TL;DR: R3D2是一种轻量级扩散模型，用于在自动驾驶验证中实现真实3D资产插入，解决了传统神经重建方法的动态对象操作和可重用性问题。

Details

Motivation: 自动驾驶系统验证需要多样化和安全关键的测试，传统仿真平台资源密集且存在领域差距，而神经重建方法在动态对象操作和可重用性上表现不佳。 Method: R3D2通过训练于新型数据集（基于3D高斯散射生成的3D资产），实现实时生成逼真渲染效果（如阴影和一致光照），支持文本到3D资产插入和跨场景对象转移。 Result: 定量和定性评估表明，R3D2显著提升了插入资产的逼真度，支持自动驾驶验证的可扩展性。 Conclusion: R3D2为自动驾驶验证提供了可扩展且逼真的仿真解决方案，并公开数据集和代码以促进进一步研究。 Abstract: Validating autonomous driving (AD) systems requires diverse and safety-critical testing, making photorealistic virtual environments essential. Traditional simulation platforms, while controllable, are resource-intensive to scale and often suffer from a domain gap with real-world data. In contrast, neural reconstruction methods like 3D Gaussian Splatting (3DGS) offer a scalable solution for creating photorealistic digital twins of real-world driving scenes. However, they struggle with dynamic object manipulation and reusability as their per-scene optimization-based methodology tends to result in incomplete object models with integrated illumination effects. This paper introduces R3D2, a lightweight, one-step diffusion model designed to overcome these limitations and enable realistic insertion of complete 3D assets into existing scenes by generating plausible rendering effects-such as shadows and consistent lighting-in real time. This is achieved by training R3D2 on a novel dataset: 3DGS object assets are generated from in-the-wild AD data using an image-conditioned 3D generative model, and then synthetically placed into neural rendering-based virtual environments, allowing R3D2 to learn realistic integration. Quantitative and qualitative evaluations demonstrate that R3D2 significantly enhances the realism of inserted assets, enabling use-cases like text-to-3D asset insertion and cross-scene/dataset object transfer, allowing for true scalability in AD validation. To promote further research in scalable and realistic AD simulation, we will release our dataset and code, see https://research.zenseact.com/publications/R3D2/.

[151] Diffusion models under low-noise regime

Elizabeth Pavlova,Xue-Xin Wei

Main category: cs.CV

TL;DR: 扩散模型在低噪声条件下的行为研究，揭示了训练数据规模、几何结构和模型目标对去噪轨迹的影响。

Details

Motivation: 探索扩散模型在小噪声条件下的行为，填补实际应用中模型可靠性的理解空白。 Method: 使用CelebA子集和高斯混合基准，系统研究低噪声扩散动态下的模型行为。 Result: 模型在数据流形附近表现分歧，训练集规模和数据几何影响去噪轨迹和评分准确性。 Conclusion: 研究为扩散模型在实际应用中的可靠性和学习机制提供了新见解。 Abstract: Recent work on diffusion models proposed that they operate in two regimes: memorization, in which models reproduce their training data, and generalization, in which they generate novel samples. While this has been tested in high-noise settings, the behavior of diffusion models as effective denoisers when the corruption level is small remains unclear. To address this gap, we systematically investigated the behavior of diffusion models under low-noise diffusion dynamics, with implications for model robustness and interpretability. Using (i) CelebA subsets of varying sample sizes and (ii) analytic Gaussian mixture benchmarks, we reveal that models trained on disjoint data diverge near the data manifold even when their high-noise outputs converge. We quantify how training set size, data geometry, and model objective choice shape denoising trajectories and affect score accuracy, providing insights into how these models actually learn representations of data distributions. This work starts to address gaps in our understanding of generative model reliability in practical applications where small perturbations are common.

[152] F2Net: A Frequency-Fused Network for Ultra-High Resolution Remote Sensing Segmentation

Hengzhi Chen,Liqian Feng,Wenhua Wu,Xiaogang Zhu,Shawn Leo,Kun Hu

Main category: cs.CV

TL;DR: F2Net提出了一种频率感知框架，通过分解超高清遥感图像为高频和低频分量进行专门处理，解决了传统方法在细节丢失和全局上下文碎片化之间的权衡问题。

Details

Motivation: 超高清遥感图像的语义分割在环境监测和城市规划中至关重要，但传统方法存在细节丢失或全局上下文碎片化的问题，多分支网络则面临计算效率低和梯度冲突的挑战。 Method: F2Net将图像分解为高频和低频分量，高频分支保留全分辨率结构细节，低频分支通过双子分支捕获短程和长程依赖关系，并通过混合频率融合模块整合结果。 Result: 在DeepGlobe和Inria Aerial基准测试中，F2Net分别达到80.22和83.39的mIoU，表现最优。 Conclusion: F2Net通过频率分解和融合模块，有效解决了超高清图像分割的挑战，实现了高性能和稳定的训练。 Abstract: Semantic segmentation of ultra-high-resolution (UHR) remote sensing imagery is critical for applications like environmental monitoring and urban planning but faces computational and optimization challenges. Conventional methods either lose fine details through downsampling or fragment global context via patch processing. While multi-branch networks address this trade-off, they suffer from computational inefficiency and conflicting gradient dynamics during training. We propose F2Net, a frequency-aware framework that decomposes UHR images into high- and low-frequency components for specialized processing. The high-frequency branch preserves full-resolution structural details, while the low-frequency branch processes downsampled inputs through dual sub-branches capturing short- and long-range dependencies. A Hybrid-Frequency Fusion module integrates these observations, guided by two novel objectives: Cross-Frequency Alignment Loss ensures semantic consistency between frequency components, and Cross-Frequency Balance Loss regulates gradient magnitudes across branches to stabilize training. Evaluated on DeepGlobe and Inria Aerial benchmarks, F2Net achieves state-of-the-art performance with mIoU of 80.22 and 83.39, respectively. Our code will be publicly available.

Teng Hu,Zhentao Yu,Zhengguang Zhou,Jiangning Zhang,Yuan Zhou,Qinglin Lu,Ran Yi

Main category: cs.CV

TL;DR: PolyVivid是一个多主体视频定制框架，通过文本-图像融合和3D-RoPE增强模块实现身份一致性和交互性，优于现有方法。

Details

Motivation: 现有视频生成模型在多主体定制中缺乏细粒度控制和身份一致性。 Method: 设计了VLLM文本-图像融合模块、3D-RoPE增强模块和注意力继承身份注入模块，并构建了MLLM数据管道。 Result: 实验表明PolyVivid在身份保真度、视频真实性和主体对齐方面表现优异。 Conclusion: PolyVivid通过多模块协同实现了高质量的多主体视频生成。 Abstract: Despite recent advances in video generation, existing models still lack fine-grained controllability, especially for multi-subject customization with consistent identity and interaction. In this paper, we propose PolyVivid, a multi-subject video customization framework that enables flexible and identity-consistent generation. To establish accurate correspondences between subject images and textual entities, we design a VLLM-based text-image fusion module that embeds visual identities into the textual space for precise grounding. To further enhance identity preservation and subject interaction, we propose a 3D-RoPE-based enhancement module that enables structured bidirectional fusion between text and image embeddings. Moreover, we develop an attention-inherited identity injection module to effectively inject fused identity features into the video generation process, mitigating identity drift. Finally, we construct an MLLM-based data pipeline that combines MLLM-based grounding, segmentation, and a clique-based subject consolidation strategy to produce high-quality multi-subject data, effectively enhancing subject distinction and reducing ambiguity in downstream video generation. Extensive experiments demonstrate that PolyVivid achieves superior performance in identity fidelity, video realism, and subject alignment, outperforming existing open-source and commercial baselines.

[154] SAM2Auto: Auto Annotation Using FLASH

Arash Rocky,Q. M. Jonathan Wu

Main category: cs.CV

TL;DR: SAM2Auto是一个全自动视频数据集标注系统，无需人工干预或数据集特定训练，通过结合对象检测和视频实例分割技术，显著减少标注时间和成本。

Details

Motivation: 解决视觉语言模型（VLMs）因标注数据集稀缺而发展滞后的问题，传统标注方法耗时且昂贵。 Method: 采用SMART-OD（结合自动掩码生成和开放世界对象检测）和FLASH（多对象实时视频实例分割）技术，确保跨帧一致的对象识别。 Result: 实验表明，SAM2Auto在准确性上与人工标注相当，同时大幅减少标注时间和成本，且无需重新训练或参数调整。 Conclusion: SAM2Auto为自动视频标注设定了新基准，解决了视觉语言理解中数据集瓶颈问题，加速了VLM的发展。 Abstract: Vision-Language Models (VLMs) lag behind Large Language Models due to the scarcity of annotated datasets, as creating paired visual-textual annotations is labor-intensive and expensive. To address this bottleneck, we introduce SAM2Auto, the first fully automated annotation pipeline for video datasets requiring no human intervention or dataset-specific training. Our approach consists of two key components: SMART-OD, a robust object detection system that combines automatic mask generation with open-world object detection capabilities, and FLASH (Frame-Level Annotation and Segmentation Handler), a multi-object real-time video instance segmentation (VIS) that maintains consistent object identification across video frames even with intermittent detection gaps. Unlike existing open-world detection methods that require frame-specific hyperparameter tuning and suffer from numerous false positives, our system employs statistical approaches to minimize detection errors while ensuring consistent object tracking throughout entire video sequences. Extensive experimental validation demonstrates that SAM2Auto achieves comparable accuracy to manual annotation while dramatically reducing annotation time and eliminating labor costs. The system successfully handles diverse datasets without requiring retraining or extensive parameter adjustments, making it a practical solution for large-scale dataset creation. Our work establishes a new baseline for automated video annotation and provides a pathway for accelerating VLM development by addressing the fundamental dataset bottleneck that has constrained progress in vision-language understanding.

[155] LogoSP: Local-global Grouping of Superpoints for Unsupervised Semantic Segmentation of 3D Point Clouds

Zihui Zhang,Weisheng Dai,Hongtao Wen,Bo Yang

Main category: cs.CV

TL;DR: LogoSP是一种无监督3D语义分割方法，通过结合局部和全局点特征学习语义信息，利用频域中的全局模式生成高精度伪标签，显著优于现有方法。

Details

Motivation: 现有无监督方法仅依赖局部特征，缺乏对更丰富语义先验的探索。 Method: 提出LogoSP，通过频域中的全局模式分组超点，生成语义伪标签用于训练分割网络。 Result: 在两个室内和一个室外数据集上，LogoSP大幅超越现有方法，达到最优性能。 Conclusion: LogoSP在无监督学习中成功捕捉有意义的3D语义，验证了全局模式的有效性。 Abstract: We study the problem of unsupervised 3D semantic segmentation on raw point clouds without needing human labels in training. Existing methods usually formulate this problem into learning per-point local features followed by a simple grouping strategy, lacking the ability to discover additional and possibly richer semantic priors beyond local features. In this paper, we introduce LogoSP to learn 3D semantics from both local and global point features. The key to our approach is to discover 3D semantic information by grouping superpoints according to their global patterns in the frequency domain, thus generating highly accurate semantic pseudo-labels for training a segmentation network. Extensive experiments on two indoor and an outdoor datasets show that our LogoSP surpasses all existing unsupervised methods by large margins, achieving the state-of-the-art performance for unsupervised 3D semantic segmentation. Notably, our investigation into the learned global patterns reveals that they truly represent meaningful 3D semantics in the absence of human labels during training.

[156] Egocentric Event-Based Vision for Ping Pong Ball Trajectory Prediction

Ivan Alberico,Marco Cannici,Giovanni Cioffi,Davide Scaramuzza

Main category: cs.CV

TL;DR: 提出了一种基于事件相机的实时乒乓球轨迹预测系统，利用高时间分辨率和注视数据优化检测与预测性能。

Details

Motivation: 传统相机在高速度乒乓球运动中存在延迟和运动模糊问题，事件相机提供更高时间分辨率，适合实时轨迹预测。 Method: 结合事件相机和Meta Project Aria眼镜的注视数据，采用注视视觉技术优化资源分配，减少计算延迟。 Result: 系统总延迟仅为4.5毫秒，比传统30 FPS系统快14倍以上，轨迹预测准确且高效。 Conclusion: 首次实现基于事件相机的第一视角乒乓球轨迹预测，为实时运动分析提供了新方法。 Abstract: In this paper, we present a real-time egocentric trajectory prediction system for table tennis using event cameras. Unlike standard cameras, which suffer from high latency and motion blur at fast ball speeds, event cameras provide higher temporal resolution, allowing more frequent state updates, greater robustness to outliers, and accurate trajectory predictions using just a short time window after the opponent's impact. We collect a dataset of ping-pong game sequences, including 3D ground-truth trajectories of the ball, synchronized with sensor data from the Meta Project Aria glasses and event streams. Our system leverages foveated vision, using eye-gaze data from the glasses to process only events in the viewer's fovea. This biologically inspired approach improves ball detection performance and significantly reduces computational latency, as it efficiently allocates resources to the most perceptually relevant regions, achieving a reduction factor of 10.81 on the collected trajectories. Our detection pipeline has a worst-case total latency of 4.5 ms, including computation and perception - significantly lower than a frame-based 30 FPS system, which, in the worst case, takes 66 ms solely for perception. Finally, we fit a trajectory prediction model to the estimated states of the ball, enabling 3D trajectory forecasting in the future. To the best of our knowledge, this is the first approach to predict table tennis trajectories from an egocentric perspective using event cameras.

[157] VIVAT: Virtuous Improving VAE Training through Artifact Mitigation

Lev Novitskiy,Viacheslav Vasilev,Maria Kovaleva,Vladimir Arkhipkin,Denis Dimitrov

Main category: cs.CV

TL;DR: VIVAT通过简单修改（如损失权重调整、填充策略和空间条件归一化）显著改善了KL-VAE训练中的常见伪影问题，提升了重建和生成质量。

Details

Motivation: KL-VAE训练中常见的伪影问题（如颜色偏移、网格模式等）影响了重建和生成质量，需要一种不依赖复杂架构修改的解决方案。 Method: 提出VIVAT方法，通过调整损失权重、优化填充策略和引入空间条件归一化，系统性地解决五种常见伪影问题。 Result: 在多个基准测试中，VIVAT在图像重建指标（PSNR和SSIM）上达到最优，同时提升了文本到图像生成的CLIP分数。 Conclusion: VIVAT在保持KL-VAE框架简单性的同时，有效解决了其实际挑战，为优化VAE训练提供了实用方案。 Abstract: Variational Autoencoders (VAEs) remain a cornerstone of generative computer vision, yet their training is often plagued by artifacts that degrade reconstruction and generation quality. This paper introduces VIVAT, a systematic approach to mitigating common artifacts in KL-VAE training without requiring radical architectural changes. We present a detailed taxonomy of five prevalent artifacts - color shift, grid patterns, blur, corner and droplet artifacts - and analyze their root causes. Through straightforward modifications, including adjustments to loss weights, padding strategies, and the integration of Spatially Conditional Normalization, we demonstrate significant improvements in VAE performance. Our method achieves state-of-the-art results in image reconstruction metrics (PSNR and SSIM) across multiple benchmarks and enhances text-to-image generation quality, as evidenced by superior CLIP scores. By preserving the simplicity of the KL-VAE framework while addressing its practical challenges, VIVAT offers actionable insights for researchers and practitioners aiming to optimize VAE training.

[158] FreeGave: 3D Physics Learning from Dynamic Videos by Gaussian Velocity

Jinxi Li,Ziyang Song,Siyuan Zhou,Bo Yang

Main category: cs.CV

TL;DR: FreeGave方法通过引入物理代码和无散度模块，无需物体先验即可学习复杂动态3D场景的物理特性，优于现有方法。

Details

Motivation: 现有方法在建模3D场景的物理特性时，常因依赖物体先验或低效的PINN损失而难以处理复杂边界运动。 Method: 提出FreeGave，结合物理代码和无散度模块，估计每个高斯速度场，避免使用PINN损失。 Result: 在多个数据集上验证了方法的优越性，尤其在未来帧外推和运动分割任务中表现突出。 Conclusion: FreeGave无需人工标注即可学习有意义的3D物理运动模式，具有广泛应用潜力。 Abstract: In this paper, we aim to model 3D scene geometry, appearance, and the underlying physics purely from multi-view videos. By applying various governing PDEs as PINN losses or incorporating physics simulation into neural networks, existing works often fail to learn complex physical motions at boundaries or require object priors such as masks or types. In this paper, we propose FreeGave to learn the physics of complex dynamic 3D scenes without needing any object priors. The key to our approach is to introduce a physics code followed by a carefully designed divergence-free module for estimating a per-Gaussian velocity field, without relying on the inefficient PINN losses. Extensive experiments on three public datasets and a newly collected challenging real-world dataset demonstrate the superior performance of our method for future frame extrapolation and motion segmentation. Most notably, our investigation into the learned physics codes reveals that they truly learn meaningful 3D physical motion patterns in the absence of any human labels in training.

[159] Spatio-Temporal State Space Model For Efficient Event-Based Optical Flow

Muhammad Ahmed Humais,Xiaoqian Huang,Hussain Sajwani,Sajid Javed,Yahya Zweiri

Main category: cs.CV

TL;DR: 提出了一种基于时空状态空间模型（STSSM）的高效事件相机光流估计方法，显著提升了计算效率和性能。

Details

Motivation: 事件相机在低延迟运动估计（光流）中具有潜力，但现有深度学习方法计算效率不足，而异步事件方法又缺乏足够的时空信息捕捉能力。 Method: 引入STSSM模块和新网络架构，利用状态空间模型高效捕捉事件数据的时空相关性。 Result: 模型在DSEC基准测试中表现优异，推理速度提升4.5倍，计算量降低8倍（相比TMA）或2倍（相比EV-FlowNet）。 Conclusion: STSSM方法在计算效率和性能上均优于现有方法，为事件相机光流估计提供了高效解决方案。 Abstract: Event cameras unlock new frontiers that were previously unthinkable with standard frame-based cameras. One notable example is low-latency motion estimation (optical flow), which is critical for many real-time applications. In such applications, the computational efficiency of algorithms is paramount. Although recent deep learning paradigms such as CNN, RNN, or ViT have shown remarkable performance, they often lack the desired computational efficiency. Conversely, asynchronous event-based methods including SNNs and GNNs are computationally efficient; however, these approaches fail to capture sufficient spatio-temporal information, a powerful feature required to achieve better performance for optical flow estimation. In this work, we introduce Spatio-Temporal State Space Model (STSSM) module along with a novel network architecture to develop an extremely efficient solution with competitive performance. Our STSSM module leverages state-space models to effectively capture spatio-temporal correlations in event data, offering higher performance with lower complexity compared to ViT, CNN-based architectures in similar settings. Our model achieves 4.5x faster inference and 8x lower computations compared to TMA and 2x lower computations compared to EV-FlowNet with competitive performance on the DSEC benchmark. Our code will be available at https://github.com/AhmedHumais/E-STMFlow

[160] CrosswalkNet: An Optimized Deep Learning Framework for Pedestrian Crosswalk Detection in Aerial Images with High-Performance Computing

Zubin Bhuyan,Yuanchang Xie,AngkeaReach Rith,Xintong Yan,Nasko Apostolov,Jimi Oke,Chengbo Ai

Main category: cs.CV

TL;DR: CrosswalkNet是一种高效的深度学习框架，用于从高分辨率航拍图像中检测行人横道，采用定向边界框（OBB）提升检测精度，并在多州数据集上表现出色。

Details

Motivation: 随着航拍和卫星图像的普及，深度学习在交通资产管理、安全分析和城市规划中具有巨大潜力。 Method: 提出CrosswalkNet框架，结合OBB、注意力机制和优化技术，使用23,000多个标注样本训练。 Result: 在麻省数据集上达到96.5%的精确率和93.3%的召回率，并在其他州无需微调即表现优异。 Conclusion: CrosswalkNet为提升行人安全和城市交通提供了高效工具。 Abstract: With the increasing availability of aerial and satellite imagery, deep learning presents significant potential for transportation asset management, safety analysis, and urban planning. This study introduces CrosswalkNet, a robust and efficient deep learning framework designed to detect various types of pedestrian crosswalks from 15-cm resolution aerial images. CrosswalkNet incorporates a novel detection approach that improves upon traditional object detection strategies by utilizing oriented bounding boxes (OBB), enhancing detection precision by accurately capturing crosswalks regardless of their orientation. Several optimization techniques, including Convolutional Block Attention, a dual-branch Spatial Pyramid Pooling-Fast module, and cosine annealing, are implemented to maximize performance and efficiency. A comprehensive dataset comprising over 23,000 annotated crosswalk instances is utilized to train and validate the proposed framework. The best-performing model achieves an impressive precision of 96.5% and a recall of 93.3% on aerial imagery from Massachusetts, demonstrating its accuracy and effectiveness. CrosswalkNet has also been successfully applied to datasets from New Hampshire, Virginia, and Maine without transfer learning or fine-tuning, showcasing its robustness and strong generalization capability. Additionally, the crosswalk detection results, processed using High-Performance Computing (HPC) platforms and provided in polygon shapefile format, have been shown to accelerate data processing and detection, supporting real-time analysis for safety and mobility applications. This integration offers policymakers, transportation engineers, and urban planners an effective instrument to enhance pedestrian safety and improve urban mobility.

[161] EgoM2P: Egocentric Multimodal Multitask Pretraining

Gen Li,Yutong Chen,Yiqian Wu,Kaifeng Zhao,Marc Pollefeys,Siyu Tang

Main category: cs.CV

TL;DR: 论文提出EgoM2P框架，通过高效的时间标记器和掩码建模解决多模态自我中心视觉任务中的挑战，支持多任务处理并优于专业模型。

Details

Motivation: 自我中心视觉的多模态信号理解对增强现实、机器人等领域至关重要，但数据异构性和缺失模态的伪标签生成困难限制了现有方法的扩展。 Method: 引入高效时间标记器和EgoM2P框架，通过掩码建模学习多模态时间标记，支持多任务处理。 Result: EgoM2P在注视预测、相机跟踪等任务中表现优于专业模型，且速度更快。 Conclusion: EgoM2P为自我中心视觉研究提供了高效、通用的解决方案，并将开源以推动社区发展。 Abstract: Understanding multimodal signals in egocentric vision, such as RGB video, depth, camera poses, and gaze, is essential for applications in augmented reality, robotics, and human-computer interaction. These capabilities enable systems to better interpret the camera wearer's actions, intentions, and surrounding environment. However, building large-scale egocentric multimodal and multitask models presents unique challenges. Egocentric data are inherently heterogeneous, with large variations in modality coverage across devices and settings. Generating pseudo-labels for missing modalities, such as gaze or head-mounted camera trajectories, is often infeasible, making standard supervised learning approaches difficult to scale. Furthermore, dynamic camera motion and the complex temporal and spatial structure of first-person video pose additional challenges for the direct application of existing multimodal foundation models. To address these challenges, we introduce a set of efficient temporal tokenizers and propose EgoM2P, a masked modeling framework that learns from temporally aware multimodal tokens to train a large, general-purpose model for egocentric 4D understanding. This unified design supports multitasking across diverse egocentric perception and synthesis tasks, including gaze prediction, egocentric camera tracking, and monocular depth estimation from egocentric video. EgoM2P also serves as a generative model for conditional egocentric video synthesis. Across these tasks, EgoM2P matches or outperforms specialist models while being an order of magnitude faster. We will fully open-source EgoM2P to support the community and advance egocentric vision research. Project page: https://egom2p.github.io/

[162] Video Unlearning via Low-Rank Refusal Vector

Simone Facchiano,Stefano Saravalle,Matteo Migliarini,Edoardo De Matteis,Alessio Sampieri,Andrea Pilzer,Emanuele Rodolà,Indro Spinelli,Luca Franco,Fabio Galasso

Main category: cs.CV

TL;DR: 提出了一种针对视频扩散模型的去学习技术，仅需5对多模态提示对即可生成拒绝向量，用于消除模型中的有害概念。

Details

Motivation: 视频生成模型可能继承训练数据中的偏见和有害内容，导致用户生成不良或非法内容，亟需解决方案。 Method: 通过多模态提示对生成拒绝向量，并采用低秩分解方法优化拒绝向量的鲁棒性，直接嵌入模型权重中。 Result: 方法能有效消除多种有害内容（如裸露、暴力、版权等），同时保持生成视频的视觉质量。 Conclusion: 该技术无需重新训练或原始数据，直接嵌入拒绝方向，提高了对抗绕过尝试的鲁棒性。 Abstract: Video generative models democratize the creation of visual content through intuitive instruction following, but they also inherit the biases and harmful concepts embedded within their web-scale training data. This inheritance creates a significant risk, as users can readily generate undesirable and even illegal content. This work introduces the first unlearning technique tailored explicitly for video diffusion models to address this critical issue. Our method requires 5 multi-modal prompt pairs only. Each pair contains a "safe" and an "unsafe" example that differ only by the target concept. Averaging their per-layer latent differences produces a "refusal vector", which, once subtracted from the model parameters, neutralizes the unsafe concept. We introduce a novel low-rank factorization approach on the covariance difference of embeddings that yields robust refusal vectors. This isolates the target concept while minimizing collateral unlearning of other semantics, thus preserving the visual quality of the generated video. Our method preserves the model's generation quality while operating without retraining or access to the original training data. By embedding the refusal direction directly into the model's weights, the suppression mechanism becomes inherently more robust against adversarial bypass attempts compared to surface-level input-output filters. In a thorough qualitative and quantitative evaluation, we show that we can neutralize a variety of harmful contents, including explicit nudity, graphic violence, copyrights, and trademarks. Project page: https://www.pinlab.org/video-unlearning.

[163] WeThink: Toward General-purpose Vision-Language Reasoning via Reinforcement Learning

Jie Yang,Feipeng Ma,Zitian Wang,Dacheng Yin,Kang Rong,Fengyun Rao,Ruimao Zhang

Main category: cs.CV

TL;DR: 论文提出了一种通过强化学习实现通用视觉-语言推理的方法，包括创新的QA合成管道、开源数据集WeThink，以及混合奖励机制，显著提升了多模态大语言模型的性能。

Details

Motivation: 扩展文本推理模型到多模态领域，解决通用视觉-语言推理的挑战。 Method: 1. 开发可扩展的多模态QA合成管道；2. 构建包含120K QA对的WeThink数据集；3. 探索混合奖励机制的强化学习。 Result: 在14个MLLM基准测试中表现显著提升，涵盖数学推理和通用多模态任务。 Conclusion: WeThink数据集和自动化数据管道能持续提升模型性能，为通用多模态推理提供了有效解决方案。 Abstract: Building on the success of text-based reasoning models like DeepSeek-R1, extending these capabilities to multimodal reasoning holds great promise. While recent works have attempted to adapt DeepSeek-R1-style reinforcement learning (RL) training paradigms to multimodal large language models (MLLM), focusing on domain-specific tasks like math and visual perception, a critical question remains: How can we achieve the general-purpose visual-language reasoning through RL? To address this challenge, we make three key efforts: (1) A novel Scalable Multimodal QA Synthesis pipeline that autonomously generates context-aware, reasoning-centric question-answer (QA) pairs directly from the given images. (2) The open-source WeThink dataset containing over 120K multimodal QA pairs with annotated reasoning paths, curated from 18 diverse dataset sources and covering various question domains. (3) A comprehensive exploration of RL on our dataset, incorporating a hybrid reward mechanism that combines rule-based verification with model-based assessment to optimize RL training efficiency across various task domains. Across 14 diverse MLLM benchmarks, we demonstrate that our WeThink dataset significantly enhances performance, from mathematical reasoning to diverse general multimodal tasks. Moreover, we show that our automated data pipeline can continuously increase data diversity to further improve model performance.

[164] A Comparative Study of U-Net Architectures for Change Detection in Satellite Images

Yaxita Amin,Naimisha S Trivedi,Rashmi Bhattad

Main category: cs.CV

TL;DR: 本文通过分析34篇论文，比较了18种U-Net变体在遥感变化检测中的应用，强调了Siamese Swin-U-Net等专为变化检测设计的架构，并探讨了其优缺点。

Details

Motivation: 填补U-Net在遥感变化检测领域的研究空白，评估不同U-Net变体的适用性。 Method: 对34篇论文进行综合分析，比较18种U-Net变体的性能，特别关注专为变化检测设计的架构。 Result: 研究发现，处理多时相数据和长距离关系对提升变化检测精度至关重要。 Conclusion: 本研究为选择U-Net变体进行遥感变化检测提供了有价值的参考。 Abstract: Remote sensing change detection is essential for monitoring the everchanging landscapes of the Earth. The U-Net architecture has gained popularity for its capability to capture spatial information and perform pixel-wise classification. However, their application in the Remote sensing field remains largely unexplored. Therefore, this paper fill the gap by conducting a comprehensive analysis of 34 papers. This study conducts a comparison and analysis of 18 different U-Net variations, assessing their potential for detecting changes in remote sensing. We evaluate both benefits along with drawbacks of each variation within the framework of this particular application. We emphasize variations that are explicitly built for change detection, such as Siamese Swin-U-Net, which utilizes a Siamese architecture. The analysis highlights the significance of aspects such as managing data from different time periods and collecting relationships over a long distance to enhance the precision of change detection. This study provides valuable insights for researchers and practitioners that choose U-Net versions for remote sensing change detection tasks.

Chengyue Huang,Yuchen Zhu,Sichen Zhu,Jingyun Xiao,Moises Andrade,Shivang Chopra,Zsolt Kira

Main category: cs.CV

TL;DR: 论文重新评估了视觉语言模型（VLMs）的多模态上下文学习（MM-ICL）能力，发现模型依赖浅层启发式方法而非真实任务理解，并提出了一种新的推理增强方法。

Details

Motivation: 研究VLMs是否真正具备多模态上下文学习能力，而非依赖浅层启发式方法。 Method: 提出MM-ICL with Reasoning管道，为每个示例生成答案和推理依据，并在不同数据集和模型上进行实验。 Result: 实验表明，当前VLMs未能有效利用示例信息，性能对演示数量、检索方法等因素不敏感。 Conclusion: 当前VLMs在多模态上下文学习中表现有限，需进一步改进以提升任务理解能力。 Abstract: Vision-language models (VLMs) are widely assumed to exhibit in-context learning (ICL), a property similar to that of their language-only counterparts. While recent work suggests VLMs can perform multimodal ICL (MM-ICL), studies show they often rely on shallow heuristics -- such as copying or majority voting -- rather than true task understanding. We revisit this assumption by evaluating VLMs under distribution shifts, where support examples come from a dataset different from the query. Surprisingly, performance often degrades with more demonstrations, and models tend to copy answers rather than learn from them. To investigate further, we propose a new MM-ICL with Reasoning pipeline that augments each demonstration with a generated rationale alongside the answer. We conduct extensive and comprehensive experiments on both perception- and reasoning-required datasets with open-source VLMs ranging from 3B to 72B and proprietary models such as Gemini 2.0. We conduct controlled studies varying shot count, retrieval method, rationale quality, and distribution. Our results show limited performance sensitivity across these factors, suggesting that current VLMs do not effectively utilize demonstration-level information as intended in MM-ICL.

[166] Decoupling the Image Perception and Multimodal Reasoning for Reasoning Segmentation with Digital Twin Representations

Yizhen Li,Dell Zhang,Xuelong Li,Yiqing Shen

Main category: cs.CV

TL;DR: DTwinSeger提出了一种新的Reasoning Segmentation方法，通过Digital Twin表示将感知与推理解耦，利用LLM进行显式推理，实现了多模态任务的高效处理。

Details

Motivation: 当前基于视觉语言模型的方法在图像分割中破坏了对象的空间连续性，需要一种能够保留空间关系并支持复杂推理的新方法。 Method: DTwinSeger将任务分为两阶段：1）将图像转换为结构化DT表示；2）使用LLM在DT表示上进行显式推理。 Result: 在多个基准测试中达到最优性能，证明了DT表示作为视觉与文本桥梁的有效性。 Conclusion: DTwinSeger通过解耦感知与推理，展示了LLM在多模态任务中的潜力，为复杂推理任务提供了新思路。 Abstract: Reasoning Segmentation (RS) is a multimodal vision-text task that requires segmenting objects based on implicit text queries, demanding both precise visual perception and vision-text reasoning capabilities. Current RS approaches rely on fine-tuning vision-language models (VLMs) for both perception and reasoning, but their tokenization of images fundamentally disrupts continuous spatial relationships between objects. We introduce DTwinSeger, a novel RS approach that leverages Digital Twin (DT) representation as an intermediate layer to decouple perception from reasoning. Innovatively, DTwinSeger reformulates RS as a two-stage process, where the first transforms the image into a structured DT representation that preserves spatial relationships and semantic properties and then employs a Large Language Model (LLM) to perform explicit reasoning over this representation to identify target objects. We propose a supervised fine-tuning method specifically for LLM with DT representation, together with a corresponding fine-tuning dataset Seg-DT, to enhance the LLM's reasoning capabilities with DT representations. Experiments show that our method can achieve state-of-the-art performance on two image RS benchmarks and three image referring segmentation benchmarks. It yields that DT representation functions as an effective bridge between vision and text, enabling complex multimodal reasoning tasks to be accomplished solely with an LLM.

[167] Creating a Historical Migration Dataset from Finnish Church Records, 1800-1920

Ari Vesalainen,Jenna Kanerva,Aida Nitsch,Kiia Korsu,Ilari Larkiola,Laura Ruotsalainen,Filip Ginter

Main category: cs.CV

TL;DR: 本文介绍了利用深度学习技术从芬兰1800-1920年的教会迁移记录中提取结构化数据的大规模研究，数据集包含600多万条记录，可用于历史人口学研究。

Details

Motivation: 研究旨在通过数字化教会迁移记录，为历史人口学提供结构化数据，以分析内部迁移、城市化等问题。 Method: 采用深度学习流水线自动化提取数据，包括布局分析、表格检测、单元格分类和手写识别。 Result: 成功构建了包含600多万条记录的结构化数据集，并通过案例研究展示了其应用价值。 Conclusion: 该研究展示了如何将大量手写档案转化为结构化数据，支持历史和人口学研究。 Abstract: This article presents a large-scale effort to create a structured dataset of internal migration in Finland between 1800 and 1920 using digitized church moving records. These records, maintained by Evangelical-Lutheran parishes, document the migration of individuals and families and offer a valuable source for studying historical demographic patterns. The dataset includes over six million entries extracted from approximately 200,000 images of handwritten migration records. The data extraction process was automated using a deep learning pipeline that included layout analysis, table detection, cell classification, and handwriting recognition. The complete pipeline was applied to all images, resulting in a structured dataset suitable for research. The dataset can be used to study internal migration, urbanization, and family migration, and the spread of disease in preindustrial Finland. A case study from the Elim\"aki parish shows how local migration histories can be reconstructed. The work demonstrates how large volumes of handwritten archival material can be transformed into structured data to support historical and demographic research.

[168] SlideCoder: Layout-aware RAG-enhanced Hierarchical Slide Generation from Design

Wenxin Tang,Jingyu Xiao,Wenxuan Jiang,Xi Xiao,Yuhang Wang,Xuxin Tang,Qing Li,Yuehe Ma,Junliang Liu,Shisong Tang,Michael R. Lyu

Main category: cs.CV

TL;DR: 论文提出了Slide2Code基准和SlideCoder框架，用于从参考图像生成可编辑幻灯片，解决了现有方法在视觉和结构设计上的不足。

Details

Motivation: 手动制作幻灯片耗时且需要专业知识，现有基于自然语言的LLM生成方法难以捕捉幻灯片设计的视觉和结构细节。 Method: 提出了SlideCoder框架，结合颜色梯度分割算法和分层检索增强生成方法，并发布了SlideMaster开源模型。 Result: SlideCoder在布局保真度、执行准确性和视觉一致性上优于现有方法，最高提升40.5分。 Conclusion: SlideCoder在幻灯片生成任务中表现出色，为自动化幻灯片设计提供了有效解决方案。 Abstract: Manual slide creation is labor-intensive and requires expert prior knowledge. Existing natural language-based LLM generation methods struggle to capture the visual and structural nuances of slide designs. To address this, we formalize the Reference Image to Slide Generation task and propose Slide2Code, the first benchmark with difficulty-tiered samples based on a novel Slide Complexity Metric. We introduce SlideCoder, a layout-aware, retrieval-augmented framework for generating editable slides from reference images. SlideCoder integrates a Color Gradient-based Segmentation algorithm and a Hierarchical Retrieval-Augmented Generation method to decompose complex tasks and enhance code generation. We also release SlideMaster, a 7B open-source model fine-tuned with improved reverse-engineered data. Experiments show that SlideCoder outperforms state-of-the-art baselines by up to 40.5 points, demonstrating strong performance across layout fidelity, execution accuracy, and visual consistency. Our code is available at https://github.com/vinsontang1/SlideCoder.

[169] SpaCE-10: A Comprehensive Benchmark for Multimodal Large Language Models in Compositional Spatial Intelligence

Ziyang Gong,Wenhao Li,Oliver Ma,Songyuan Li,Jiayi Ji,Xue Yang,Gen Luo,Junchi Yan,Rongrong Ji

Main category: cs.CV

TL;DR: SpaCE-10是一个用于评估多模态大语言模型（MLLMs）空间智能的综合性基准，包含10种原子空间能力和8种组合能力，通过高质量QA对进行测试。

Details

Motivation: 现有基准难以全面评估MLLMs的空间智能，尤其是在原子和组合能力层面。 Method: 定义了10种原子空间能力和8种组合能力，采用分层标注流程生成5k+ QA对，覆盖多种评估设置。 Result: 最先进的MLLMs在SpaCE-10上仍显著落后于人类表现，计数能力不足是主要限制因素。 Conclusion: SpaCE-10填补了MLLMs空间智能评估的空白，揭示了现有模型的不足，为社区提供了重要参考。 Abstract: Multimodal Large Language Models (MLLMs) have achieved remarkable progress in various multimodal tasks. To pursue higher intelligence in space, MLLMs require integrating multiple atomic spatial capabilities to handle complex and dynamic tasks. However, existing benchmarks struggle to comprehensively evaluate the spatial intelligence of common MLLMs from the atomic level to the compositional level. To fill this gap, we present SpaCE-10, a comprehensive benchmark for compositional spatial evaluations. In SpaCE-10, we define 10 atomic spatial capabilities, which are combined to form 8 compositional capabilities. Based on these definitions, we propose a novel hierarchical annotation pipeline to generate high-quality and diverse question-answer (QA) pairs. With over 150+ hours of human expert effort, we obtain over 5k QA pairs for 811 real indoor scenes in SpaCE-10, which covers various evaluation settings like point cloud input and multi-choice QA. We conduct an extensive evaluation of common MLLMs on SpaCE-10 and find that even the most advanced MLLM still lags behind humans by large margins. Through our careful study, we also draw several significant findings that benefit the MLLM community. For example, we reveal that the shortcoming of counting capability greatly limits the compositional spatial capabilities of existing MLLMs. The evaluation code and benchmark datasets are available at https://github.com/Cuzyoung/SpaCE-10.

[170] CyberV: Cybernetics for Test-time Scaling in Video Understanding

Jiahao Meng,Shuyang Sun,Yue Tan,Lu Qi,Yunhai Tong,Xiangtai Li,Longyin Wen

Main category: cs.CV

TL;DR: CyberV是一个基于控制论原理的框架，通过自监控、自校正和动态资源分配提升多模态大语言模型（MLLMs）在视频理解中的表现，显著提高了模型的准确性和鲁棒性。

Details

Motivation: 当前MLLMs在处理长或复杂视频时存在计算需求高、鲁棒性不足和准确性有限的问题，尤其是参数较少的模型。 Method: 提出CyberV框架，包含MLLM推理系统、传感器和控制器，通过实时监控和反馈实现自适应调整。 Result: 实验显示，CyberV显著提升了多个模型的性能，如Qwen2.5-VL-7B提升8.3%，InternVL3-8B提升5.5%，甚至接近人类专家水平。 Conclusion: CyberV通过动态调整机制有效提升了MLLMs的视频理解能力，具有广泛适用性和推广潜力。 Abstract: Current Multimodal Large Language Models (MLLMs) may struggle with understanding long or complex videos due to computational demands at test time, lack of robustness, and limited accuracy, primarily stemming from their feed-forward processing nature. These limitations could be more severe for models with fewer parameters. To address these limitations, we propose a novel framework inspired by cybernetic principles, redesigning video MLLMs as adaptive systems capable of self-monitoring, self-correction, and dynamic resource allocation during inference. Our approach, CyberV, introduces a cybernetic loop consisting of an MLLM Inference System, a Sensor, and a Controller. Specifically, the sensor monitors forward processes of the MLLM and collects intermediate interpretations, such as attention drift, then the controller determines when and how to trigger self-correction and generate feedback to guide the next round. This test-time adaptive scaling framework enhances frozen MLLMs without requiring retraining or additional components. Experiments demonstrate significant improvements: CyberV boosts Qwen2.5-VL-7B by 8.3% and InternVL3-8B by 5.5% on VideoMMMU, surpassing the competitive proprietary model GPT-4o. When applied to Qwen2.5-VL-72B, it yields a 10.0% improvement, achieving performance even comparable to human experts. Furthermore, our method demonstrates consistent gains on general-purpose benchmarks, such as VideoMME and WorldSense, highlighting its effectiveness and generalization capabilities in making MLLMs more robust and accurate for dynamic video understanding. The code is released at https://github.com/marinero4972/CyberV.

[171] OneIG-Bench: Omni-dimensional Nuanced Evaluation for Image Generation

Jingjing Chang,Yixiao Fang,Peng Xing,Shuhan Wu,Wei Cheng,Rui Wang,Xianfang Zeng,Gang Yu,Hai-Bao Chen

Main category: cs.CV

TL;DR: OneIG-Bench是一个全面的文本到图像（T2I）模型评估框架，专注于多维度细粒度评估，包括推理、文本渲染和风格化等。

Details

Motivation: 现有T2I模型评估标准不全面，无法充分评估推理能力等前沿问题。 Method: 设计了OneIG-Bench框架，支持多维度（如提示-图像对齐、文本渲染、推理内容生成等）灵活评估。 Result: OneIG-Bench提供了公开的代码和数据集，支持可重复的评估研究和跨模型比较。 Conclusion: OneIG-Bench填补了T2I模型评估的空白，帮助研究者全面分析模型性能。 Abstract: Text-to-image (T2I) models have garnered significant attention for generating high-quality images aligned with text prompts. However, rapid T2I model advancements reveal limitations in early benchmarks, lacking comprehensive evaluations, for example, the evaluation on reasoning, text rendering and style. Notably, recent state-of-the-art models, with their rich knowledge modeling capabilities, show promising results on the image generation problems requiring strong reasoning ability, yet existing evaluation systems have not adequately addressed this frontier. To systematically address these gaps, we introduce OneIG-Bench, a meticulously designed comprehensive benchmark framework for fine-grained evaluation of T2I models across multiple dimensions, including prompt-image alignment, text rendering precision, reasoning-generated content, stylization, and diversity. By structuring the evaluation, this benchmark enables in-depth analysis of model performance, helping researchers and practitioners pinpoint strengths and bottlenecks in the full pipeline of image generation. Specifically, OneIG-Bench enables flexible evaluation by allowing users to focus on a particular evaluation subset. Instead of generating images for the entire set of prompts, users can generate images only for the prompts associated with the selected dimension and complete the corresponding evaluation accordingly. Our codebase and dataset are now publicly available to facilitate reproducible evaluation studies and cross-model comparisons within the T2I research community.

[172] Real-time Localization of a Soccer Ball from a Single Camera

Dmitrii Vorobev,Artem Prosvetov,Karim Elhadji Daou

Main category: cs.CV

TL;DR: 提出了一种高效的单摄像头实时三维足球轨迹重建方法，通过多模态状态模型加速优化，保持厘米级精度。

Details

Motivation: 解决现有方法在遮挡、运动模糊和复杂背景下的性能问题，同时降低对多摄像头和昂贵设备的需求。 Method: 采用多模态状态模型（$W$离散模态）优化算法，适用于标准CPU，实现低延迟。 Result: 在6K分辨率俄罗斯超级联赛数据集上验证，性能媲美多摄像头系统。 Conclusion: 为专业足球环境提供了一种实用、低成本的高精度三维球体跟踪方法。 Abstract: We propose a computationally efficient method for real-time three-dimensional football trajectory reconstruction from a single broadcast camera. In contrast to previous work, our approach introduces a multi-mode state model with $W$ discrete modes to significantly accelerate optimization while preserving centimeter-level accuracy -- even in cases of severe occlusion, motion blur, and complex backgrounds. The system operates on standard CPUs and achieves low latency suitable for live broadcast settings. Extensive evaluation on a proprietary dataset of 6K-resolution Russian Premier League matches demonstrates performance comparable to multi-camera systems, without the need for specialized or costly infrastructure. This work provides a practical method for accessible and accurate 3D ball tracking in professional football environments.

[173] CXR-LT 2024: A MICCAI challenge on long-tailed, multi-label, and zero-shot disease classification from chest X-ray

Mingquan Lin,Gregory Holste,Song Wang,Yiliang Zhou,Yishu Wei,Imon Banerjee,Pengyi Chen,Tianjie Dai,Yuexi Du,Nicha C. Dvornek,Yuyan Ge,Zuowei Guo,Shouhei Hanaoka,Dongkyun Kim,Pablo Messina,Yang Lu,Denis Parra,Donghyun Son,Álvaro Soto,Aisha Urooj,René Vidal,Yosuke Yamagishi,Zefan Yang,Ruichi Zhang,Yang Zhou,Leo Anthony Celi,Ronald M. Summers,Zhiyong Lu,Hao Chen,Adam Flanders,George Shih,Zhangyang Wang,Yifan Peng

Main category: cs.CV

TL;DR: CXR-LT 2024是一个社区驱动的项目，旨在通过扩展数据集和改进技术（如零样本学习）来提升胸部X光片（CXR）的肺部疾病分类性能。

Details

Motivation: 解决开放长尾肺部疾病分类中的挑战，并提升现有技术的可测量性。 Method: 通过扩展数据集至377,110张CXR和45种疾病标签，引入零样本学习，并设计三项任务（长尾分类、黄金标准子集分类、零样本泛化）。 Result: 提供了高质量基准数据，整合了多模态模型、生成方法和零样本学习策略，提升了疾病覆盖范围。 Conclusion: CXR-LT 2024为未来研究提供了宝贵资源，推动了临床现实和泛化诊断模型的发展。 Abstract: The CXR-LT series is a community-driven initiative designed to enhance lung disease classification using chest X-rays (CXR). It tackles challenges in open long-tailed lung disease classification and enhances the measurability of state-of-the-art techniques. The first event, CXR-LT 2023, aimed to achieve these goals by providing high-quality benchmark CXR data for model development and conducting comprehensive evaluations to identify ongoing issues impacting lung disease classification performance. Building on the success of CXR-LT 2023, the CXR-LT 2024 expands the dataset to 377,110 chest X-rays (CXRs) and 45 disease labels, including 19 new rare disease findings. It also introduces a new focus on zero-shot learning to address limitations identified in the previous event. Specifically, CXR-LT 2024 features three tasks: (i) long-tailed classification on a large, noisy test set, (ii) long-tailed classification on a manually annotated "gold standard" subset, and (iii) zero-shot generalization to five previously unseen disease findings. This paper provides an overview of CXR-LT 2024, detailing the data curation process and consolidating state-of-the-art solutions, including the use of multimodal models for rare disease detection, advanced generative approaches to handle noisy labels, and zero-shot learning strategies for unseen diseases. Additionally, the expanded dataset enhances disease coverage to better represent real-world clinical settings, offering a valuable resource for future research. By synthesizing the insights and innovations of participating teams, we aim to advance the development of clinically realistic and generalizable diagnostic models for chest radiography.

[174] Rethinking Crowd-Sourced Evaluation of Neuron Explanations

Tuomas Oikarinen,Ge Yan,Akshay Kulkarni,Tsui-Wei Weng

Main category: cs.CV

TL;DR: 论文提出了一种高效且准确的众包评估策略，用于评估神经元解释的可靠性，通过重要性采样和贝叶斯方法显著降低了成本。

Details

Motivation: 现有神经元解释方法的可靠性评估依赖众包，但成本高且结果不稳定，需要更高效准确的评估策略。 Method: 引入重要性采样选择最有价值的输入样本，并提出贝叶斯方法聚合多评分，显著减少所需评分数量。 Result: 实现了约30倍的成本降低和5倍的评分数量减少，同时保持高准确性。 Conclusion: 提出的方法为大规模比较神经元解释质量提供了高效工具，验证了其成本效益和准确性。 Abstract: Interpreting individual neurons or directions in activations space is an important component of mechanistic interpretability. As such, many algorithms have been proposed to automatically produce neuron explanations, but it is often not clear how reliable these explanations are, or which methods produce the best explanations. This can be measured via crowd-sourced evaluations, but they can often be noisy and expensive, leading to unreliable results. In this paper, we carefully analyze the evaluation pipeline and develop a cost-effective and highly accurate crowdsourced evaluation strategy. In contrast to previous human studies that only rate whether the explanation matches the most highly activating inputs, we estimate whether the explanation describes neuron activations across all inputs. To estimate this effectively, we introduce a novel application of importance sampling to determine which inputs are the most valuable to show to raters, leading to around 30x cost reduction compared to uniform sampling. We also analyze the label noise present in crowd-sourced evaluations and propose a Bayesian method to aggregate multiple ratings leading to a further ~5x reduction in number of ratings required for the same accuracy. Finally, we use these methods to conduct a large-scale study comparing the quality of neuron explanations produced by the most popular methods for two different vision models.

Zhengyao Lv,Tianlin Pan,Chenyang Si,Zhaoxi Chen,Wangmeng Zuo,Ziwei Liu,Kwan-Yee K. Wong

Main category: cs.CV

TL;DR: 论文提出了一种名为TACA的方法，通过动态调整跨模态注意力和时间步感知权重，显著提升了文本到图像生成的对齐效果。

Details

Motivation: 现有MM-DiT模型在文本驱动视觉生成中存在跨模态注意力不平衡和时间步感知不足的问题，导致文本与生成内容对齐不精确。 Method: 提出TACA方法，结合温度缩放和时间步依赖调整，动态平衡多模态交互，并结合LoRA微调。 Result: 在T2I-CompBench基准测试中，TACA显著提升了文本-图像对齐效果，包括对象外观、属性绑定和空间关系。 Conclusion: TACA通过优化跨模态注意力平衡，提升了文本到图像扩散模型的语义保真度。 Abstract: Multimodal Diffusion Transformers (MM-DiTs) have achieved remarkable progress in text-driven visual generation. However, even state-of-the-art MM-DiT models like FLUX struggle with achieving precise alignment between text prompts and generated content. We identify two key issues in the attention mechanism of MM-DiT, namely 1) the suppression of cross-modal attention due to token imbalance between visual and textual modalities and 2) the lack of timestep-aware attention weighting, which hinder the alignment. To address these issues, we propose \textbf{Temperature-Adjusted Cross-modal Attention (TACA)}, a parameter-efficient method that dynamically rebalances multimodal interactions through temperature scaling and timestep-dependent adjustment. When combined with LoRA fine-tuning, TACA significantly enhances text-image alignment on the T2I-CompBench benchmark with minimal computational overhead. We tested TACA on state-of-the-art models like FLUX and SD3.5, demonstrating its ability to improve image-text alignment in terms of object appearance, attribute binding, and spatial relationships. Our findings highlight the importance of balancing cross-modal attention in improving semantic fidelity in text-to-image diffusion models. Our codes are publicly available at \href{https://github.com/Vchitect/TACA}

[176] PairEdit: Learning Semantic Variations for Exemplar-based Image Editing

Haoguang Lu,Jiacheng Chen,Zhenguo Yang,Aurele Tohokantche Gnanha,Fu Lee Wang,Li Qing,Xudong Mao

Main category: cs.CV

TL;DR: PairEdit是一种无需文本指导的视觉编辑方法，通过少量图像对学习复杂编辑语义，显著提升内容一致性。

Details

Motivation: 现有基于示例的编辑方法依赖文本描述或隐式文本指令，难以精确指定某些编辑语义。 Method: 提出目标噪声预测和内容保持噪声调度，优化LoRAs以分离语义变化与内容学习。 Result: PairEdit成功学习复杂语义，内容一致性显著优于基线方法。 Conclusion: PairEdit为无需文本的视觉编辑提供了有效解决方案。 Abstract: Recent advancements in text-guided image editing have achieved notable success by leveraging natural language prompts for fine-grained semantic control. However, certain editing semantics are challenging to specify precisely using textual descriptions alone. A practical alternative involves learning editing semantics from paired source-target examples. Existing exemplar-based editing methods still rely on text prompts describing the change within paired examples or learning implicit text-based editing instructions. In this paper, we introduce PairEdit, a novel visual editing method designed to effectively learn complex editing semantics from a limited number of image pairs or even a single image pair, without using any textual guidance. We propose a target noise prediction that explicitly models semantic variations within paired images through a guidance direction term. Moreover, we introduce a content-preserving noise schedule to facilitate more effective semantic learning. We also propose optimizing distinct LoRAs to disentangle the learning of semantic variations from content. Extensive qualitative and quantitative evaluations demonstrate that PairEdit successfully learns intricate semantics while significantly improving content consistency compared to baseline methods. Code will be available at https://github.com/xudonmao/PairEdit.

[177] UA-Pose: Uncertainty-Aware 6D Object Pose Estimation and Online Object Completion with Partial References

Ming-Feng Li,Xin Yang,Fu-En Wang,Hritam Basak,Yuyin Sun,Shreekant Gayaka,Min Sun,Cheng-Hao Kuo

Main category: cs.CV

TL;DR: UA-Pose提出了一种基于不确定性感知的6D物体姿态估计方法，适用于部分参考数据，显著提升了在不完整观测下的性能。

Details

Motivation: 现有方法通常需要完整的3D模型或大量参考图像，而UA-Pose旨在解决仅基于部分参考（如少量RGBD图像或单张2D图像）的6D姿态估计问题。 Method: 方法通过初始化部分3D模型（基于RGBD图像或2D图像生成），并引入不确定性区分可见与不可见区域，结合不确定性感知采样策略进行在线物体补全。 Result: 在YCB-Video、YCBInEOAT和HO3D数据集上的实验表明，UA-Pose在不完整观测下显著优于现有方法。 Conclusion: UA-Pose通过不确定性感知和在线补全，有效提升了部分参考数据下的6D姿态估计性能。 Abstract: 6D object pose estimation has shown strong generalizability to novel objects. However, existing methods often require either a complete, well-reconstructed 3D model or numerous reference images that fully cover the object. Estimating 6D poses from partial references, which capture only fragments of an object's appearance and geometry, remains challenging. To address this, we propose UA-Pose, an uncertainty-aware approach for 6D object pose estimation and online object completion specifically designed for partial references. We assume access to either (1) a limited set of RGBD images with known poses or (2) a single 2D image. For the first case, we initialize a partial object 3D model based on the provided images and poses, while for the second, we use image-to-3D techniques to generate an initial object 3D model. Our method integrates uncertainty into the incomplete 3D model, distinguishing between seen and unseen regions. This uncertainty enables confidence assessment in pose estimation and guides an uncertainty-aware sampling strategy for online object completion, enhancing robustness in pose estimation accuracy and improving object completeness. We evaluate our method on the YCB-Video, YCBInEOAT, and HO3D datasets, including RGBD sequences of YCB objects manipulated by robots and human hands. Experimental results demonstrate significant performance improvements over existing methods, particularly when object observations are incomplete or partially captured. Project page: https://minfenli.github.io/UA-Pose/

[178] MADFormer: Mixed Autoregressive and Diffusion Transformers for Continuous Image Generation

Junhao Chen,Yulia Tsvetkov,Xiaochuang Han

Main category: cs.CV

TL;DR: 论文提出MADFormer，一种结合自回归（AR）和扩散模型的混合Transformer，用于分析AR与扩散模型的权衡。通过分区生成图像，AR层用于全局条件，扩散层用于局部细化，实验表明该方法在高分辨率图像上表现优异。

Details

Motivation: 现有混合模型缺乏系统指导如何分配AR与扩散模型的能力，因此需要一种方法来优化两者的结合。 Method: MADFormer将图像生成分为空间块，AR层用于全局条件，扩散层用于局部细化。 Result: 实验显示，分区生成显著提升高分辨率图像性能，混合AR与扩散层在质量和效率上取得更好平衡，FID提升高达75%。 Conclusion: 研究为未来混合生成模型提供了实用的设计原则。 Abstract: Recent progress in multimodal generation has increasingly combined autoregressive (AR) and diffusion-based approaches, leveraging their complementary strengths: AR models capture long-range dependencies and produce fluent, context-aware outputs, while diffusion models operate in continuous latent spaces to refine high-fidelity visual details. However, existing hybrids often lack systematic guidance on how and why to allocate model capacity between these paradigms. In this work, we introduce MADFormer, a Mixed Autoregressive and Diffusion Transformer that serves as a testbed for analyzing AR-diffusion trade-offs. MADFormer partitions image generation into spatial blocks, using AR layers for one-pass global conditioning across blocks and diffusion layers for iterative local refinement within each block. Through controlled experiments on FFHQ-1024 and ImageNet, we identify two key insights: (1) block-wise partitioning significantly improves performance on high-resolution images, and (2) vertically mixing AR and diffusion layers yields better quality-efficiency balances--improving FID by up to 75% under constrained inference compute. Our findings offer practical design principles for future hybrid generative models.

[179] Aligning Text, Images, and 3D Structure Token-by-Token

Aadarsh Sahoo,Vansh Tibrewal,Georgia Gkioxari

Main category: cs.CV

TL;DR: 论文提出了一种统一的LLM框架，用于对齐语言、图像和3D场景，并提供了关键设计选择的详细指南。

Details

Motivation: 创建能够理解3D世界的机器，以辅助设计师构建和编辑3D环境，以及帮助机器人在三维空间中导航和交互。 Method: 采用自回归模型，研究其在结构化3D场景中的潜力，并提出统一的LLM框架。 Result: 在四个核心3D任务（渲染、识别、指令跟随和问答）和四个3D数据集上评估性能，并在真实世界3D物体识别任务中表现有效。 Conclusion: 通过量化形状编码丰富了3D模态，展示了模型在复杂3D物体形状重建中的有效性。 Abstract: Creating machines capable of understanding the world in 3D is essential in assisting designers that build and edit 3D environments and robots navigating and interacting within a three-dimensional space. Inspired by advances in language and image modeling, we investigate the potential of autoregressive models for a new modality: structured 3D scenes. To this end, we propose a unified LLM framework that aligns language, images, and 3D scenes and provide a detailed ''cookbook'' outlining critical design choices for achieving optimal training and performance addressing key questions related to data representation, modality-specific objectives, and more. We evaluate performance across four core 3D tasks -- rendering, recognition, instruction-following, and question-answering -- and four 3D datasets, synthetic and real-world. We extend our approach to reconstruct complex 3D object shapes by enriching our 3D modality with quantized shape encodings, and show our model's effectiveness on real-world 3D object recognition tasks. Project webpage: https://glab-caltech.github.io/kyvo/

[180] Audio-Sync Video Generation with Multi-Stream Temporal Control

Shuchen Weng,Haojie Zheng,Zheng Chang,Si Li,Boxin Shi,Xinlong Wang

Main category: cs.CV

TL;DR: MTV是一个用于音频同步视频生成的框架，通过分离音频为语音、音效和音乐轨道，实现对唇部动作、事件时间和视觉情绪的精细控制。

Details

Motivation: 音频与视觉世界紧密同步，是视频生成的自然控制信号，但现有方法在高质量视频生成和精确音视频同步方面表现不足。 Method: MTV框架将音频分离为语音、音效和音乐轨道，分别控制唇部动作、事件时间和视觉情绪，并引入DEMIX数据集支持多阶段训练。 Result: 实验表明，MTV在视频质量、文本-视频一致性和音视频对齐等六个标准指标上达到最先进水平。 Conclusion: MTV通过音频分离和多阶段训练实现了高质量、语义对齐的视频生成，为音视频同步提供了新解决方案。 Abstract: Audio is inherently temporal and closely synchronized with the visual world, making it a naturally aligned and expressive control signal for controllable video generation (e.g., movies). Beyond control, directly translating audio into video is essential for understanding and visualizing rich audio narratives (e.g., Podcasts or historical recordings). However, existing approaches fall short in generating high-quality videos with precise audio-visual synchronization, especially across diverse and complex audio types. In this work, we introduce MTV, a versatile framework for audio-sync video generation. MTV explicitly separates audios into speech, effects, and music tracks, enabling disentangled control over lip motion, event timing, and visual mood, respectively -- resulting in fine-grained and semantically aligned video generation. To support the framework, we additionally present DEMIX, a dataset comprising high-quality cinematic videos and demixed audio tracks. DEMIX is structured into five overlapped subsets, enabling scalable multi-stage training for diverse generation scenarios. Extensive experiments demonstrate that MTV achieves state-of-the-art performance across six standard metrics spanning video quality, text-video consistency, and audio-video alignment. Project page: https://hjzheng.net/projects/MTV/.

[181] Dynamic View Synthesis as an Inverse Problem

Hidir Yesiltepe,Pinar Yanardag

Main category: cs.CV

TL;DR: 提出了一种无需训练的动态视图合成方法，通过改进预训练视频扩散模型的噪声初始化阶段，实现高质量合成。

Details

Motivation: 解决单目视频动态视图合成的逆问题，避免权重更新或辅助模块的需求。 Method: 引入K阶递归噪声表示解决零终端信噪比问题，并采用随机潜在调制完成遮挡区域合成。 Result: 实验表明，通过噪声初始化阶段的潜在操作可有效实现动态视图合成。 Conclusion: 该方法在无需训练的情况下实现了高保真动态视图合成。 Abstract: In this work, we address dynamic view synthesis from monocular videos as an inverse problem in a training-free setting. By redesigning the noise initialization phase of a pre-trained video diffusion model, we enable high-fidelity dynamic view synthesis without any weight updates or auxiliary modules. We begin by identifying a fundamental obstacle to deterministic inversion arising from zero-terminal signal-to-noise ratio (SNR) schedules and resolve it by introducing a novel noise representation, termed K-order Recursive Noise Representation. We derive a closed form expression for this representation, enabling precise and efficient alignment between the VAE-encoded and the DDIM inverted latents. To synthesize newly visible regions resulting from camera motion, we introduce Stochastic Latent Modulation, which performs visibility aware sampling over the latent space to complete occluded regions. Comprehensive experiments demonstrate that dynamic view synthesis can be effectively performed through structured latent manipulation in the noise initialization phase.

[182] ZeroVO: Visual Odometry with Minimal Assumptions

Lei Lai,Zekai Yin,Eshed Ohn-Bar

Main category: cs.CV

TL;DR: ZeroVO是一种新型视觉里程计算法，实现跨相机和环境的零样本泛化，无需预定义或静态相机标定。

Details

Motivation: 解决现有方法依赖预定义或静态相机标定的局限性，提升视觉里程计的泛化能力。 Method: 1. 设计无标定、几何感知的网络结构；2. 引入语言先验增强特征提取；3. 开发半监督训练范式。 Result: 在KITTI、nuScenes和Argoverse 2等基准上性能提升30%，并在GTA合成数据集上验证。 Conclusion: ZeroVO无需微调或相机标定，为实际应用提供了通用解决方案。 Abstract: We introduce ZeroVO, a novel visual odometry (VO) algorithm that achieves zero-shot generalization across diverse cameras and environments, overcoming limitations in existing methods that depend on predefined or static camera calibration setups. Our approach incorporates three main innovations. First, we design a calibration-free, geometry-aware network structure capable of handling noise in estimated depth and camera parameters. Second, we introduce a language-based prior that infuses semantic information to enhance robust feature extraction and generalization to previously unseen domains. Third, we develop a flexible, semi-supervised training paradigm that iteratively adapts to new scenes using unlabeled data, further boosting the models' ability to generalize across diverse real-world scenarios. We analyze complex autonomous driving contexts, demonstrating over 30% improvement against prior methods on three standard benchmarks, KITTI, nuScenes, and Argoverse 2, as well as a newly introduced, high-fidelity synthetic dataset derived from Grand Theft Auto (GTA). By not requiring fine-tuning or camera calibration, our work broadens the applicability of VO, providing a versatile solution for real-world deployment at scale.

[183] Dreamland: Controllable World Creation with Simulator and Generative Models

Sicheng Mo,Ziyang Leng,Leon Liu,Weizhen Wang,Honglin He,Bolei Zhou

Main category: cs.CV

TL;DR: Dreamland是一个结合物理模拟器和生成模型的混合世界生成框架，通过分层抽象增强可控性，提升图像质量和可控性。

Details

Motivation: 现有大规模视频生成模型缺乏元素级可控性，限制了其在场景编辑和AI代理训练中的应用。 Method: 设计分层世界抽象，结合物理模拟器和生成模型，使用中间表示编码像素和对象级语义与几何。 Result: 实验显示Dreamland图像质量提升50.8%，可控性增强17.9%，适用于AI代理训练。 Conclusion: Dreamland通过混合框架显著提升生成模型的实用性和可控性，具有广泛应用潜力。 Abstract: Large-scale video generative models can synthesize diverse and realistic visual content for dynamic world creation, but they often lack element-wise controllability, hindering their use in editing scenes and training embodied AI agents. We propose Dreamland, a hybrid world generation framework combining the granular control of a physics-based simulator and the photorealistic content output of large-scale pretrained generative models. In particular, we design a layered world abstraction that encodes both pixel-level and object-level semantics and geometry as an intermediate representation to bridge the simulator and the generative model. This approach enhances controllability, minimizes adaptation cost through early alignment with real-world distributions, and supports off-the-shelf use of existing and future pretrained generative models. We further construct a D3Sim dataset to facilitate the training and evaluation of hybrid generation pipelines. Experiments demonstrate that Dreamland outperforms existing baselines with 50.8% improved image quality, 17.9% stronger controllability, and has great potential to enhance embodied agent training. Code and data will be made available.

[184] Hidden in plain sight: VLMs overlook their visual representations

Stephanie Fu,Tyler Bonnen,Devin Guillory,Trevor Darrell

Main category: cs.CV

TL;DR: 论文比较了视觉语言模型（VLMs）与其视觉编码器的性能，发现VLMs在视觉任务中表现显著较差，接近随机水平。研究分析了视觉表征退化、任务提示的脆弱性以及语言模型在任务中的作用，发现瓶颈在于VLMs未能有效利用视觉信息。

Details

Motivation: 探索视觉语言模型（VLMs）在整合视觉和语言信息方面的能力，以理解其在视觉任务中的表现。 Method: 通过一系列视觉中心基准测试（如深度估计、对应关系）比较VLMs与其视觉编码器的性能，并分析视觉表征退化、任务提示的脆弱性及语言模型的作用。 Result: VLMs在视觉任务中表现显著较差，接近随机水平，主要原因是未能有效利用视觉信息且继承了语言模型的先验。 Conclusion: 研究揭示了开源VLMs的失败模式，并提出了一系列评估方法，有助于未来对VLMs视觉理解能力的进一步研究。 Abstract: Language provides a natural interface to specify and evaluate performance on visual tasks. To realize this possibility, vision language models (VLMs) must successfully integrate visual and linguistic information. Our work compares VLMs to a direct readout of their visual encoders to understand their ability to integrate across these modalities. Across a series of vision-centric benchmarks (e.g., depth estimation, correspondence), we find that VLMs perform substantially worse than their visual encoders, dropping to near-chance performance. We investigate these results through a series of analyses across the entire VLM: namely 1) the degradation of vision representations, 2) brittleness to task prompt, and 3) the language model's role in solving the task. We find that the bottleneck in performing these vision-centric tasks lies in this third category; VLMs are not effectively using visual information easily accessible throughout the entire model, and they inherit the language priors present in the LLM. Our work helps diagnose the failure modes of open-source VLMs, and presents a series of evaluations useful for future investigations into visual understanding within VLMs.

[185] Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion

Xun Huang,Zhengqi Li,Guande He,Mingyuan Zhou,Eli Shechtman

Main category: cs.CV

TL;DR: Self Forcing是一种新的自回归视频扩散模型训练方法，通过自生成输出解决曝光偏差问题，实现了高效且高质量的视频生成。

Details

Motivation: 解决自回归视频扩散模型中曝光偏差的问题，即在推理时模型需要基于自身不完美的输出来生成序列。 Method: 采用自回归展开和KV缓存策略，在训练时基于自生成输出生成帧，并通过视频级损失进行监督。结合高效扩散模型和梯度截断策略，优化计算成本与性能。 Result: 实现了单GPU上的实时视频生成，延迟低于一秒，生成质量优于或匹配更慢的非因果扩散模型。 Conclusion: Self Forcing提供了一种高效且高质量的视频生成方法，解决了曝光偏差问题，适用于实时应用。 Abstract: We introduce Self Forcing, a novel training paradigm for autoregressive video diffusion models. It addresses the longstanding issue of exposure bias, where models trained on ground-truth context must generate sequences conditioned on their own imperfect outputs during inference. Unlike prior methods that denoise future frames based on ground-truth context frames, Self Forcing conditions each frame's generation on previously self-generated outputs by performing autoregressive rollout with key-value (KV) caching during training. This strategy enables supervision through a holistic loss at the video level that directly evaluates the quality of the entire generated sequence, rather than relying solely on traditional frame-wise objectives. To ensure training efficiency, we employ a few-step diffusion model along with a stochastic gradient truncation strategy, effectively balancing computational cost and performance. We further introduce a rolling KV cache mechanism that enables efficient autoregressive video extrapolation. Extensive experiments demonstrate that our approach achieves real-time streaming video generation with sub-second latency on a single GPU, while matching or even surpassing the generation quality of significantly slower and non-causal diffusion models. Project website: http://self-forcing.github.io/

[186] Vision Transformers Don't Need Trained Registers

Nick Jiang,Amil Dravid,Alexei Efros,Yossi Gandelsman

Main category: cs.CV

TL;DR: 研究发现Vision Transformers中存在高范数token导致注意力图噪声的问题，提出了一种无需重新训练的方法，通过转移高范数激活到额外token中，改善了注意力图和特征图，提升了模型性能。

Details

Motivation: 探索Vision Transformers中高范数token导致注意力噪声的机制，并提出无需重新训练的解决方案。 Method: 通过将高范数激活从特定神经元转移到额外未训练的token中，模拟注册token的效果。 Result: 方法生成了更清晰的注意力和特征图，提升了多个下游视觉任务的性能，效果接近显式训练注册token的模型。 Conclusion: 测试时注册方法为未预训练注册token的模型提供了无需训练的解决方案，提升了模型的可解释性。 Abstract: We investigate the mechanism underlying a previously identified phenomenon in Vision Transformers -- the emergence of high-norm tokens that lead to noisy attention maps. We observe that in multiple models (e.g., CLIP, DINOv2), a sparse set of neurons is responsible for concentrating high-norm activations on outlier tokens, leading to irregular attention patterns and degrading downstream visual processing. While the existing solution for removing these outliers involves retraining models from scratch with additional learned register tokens, we use our findings to create a training-free approach to mitigate these artifacts. By shifting the high-norm activations from our discovered register neurons into an additional untrained token, we can mimic the effect of register tokens on a model already trained without registers. We demonstrate that our method produces cleaner attention and feature maps, enhances performance over base models across multiple downstream visual tasks, and achieves results comparable to models explicitly trained with register tokens. We then extend test-time registers to off-the-shelf vision-language models to improve their interpretability. Our results suggest that test-time registers effectively take on the role of register tokens at test-time, offering a training-free solution for any pre-trained model released without them.

[187] Play to Generalize: Learning to Reason Through Game Play

Yunfei Xie,Yinsong Ma,Shiyi Lan,Alan Yuille,Junfei Xiao,Chen Wei

Main category: cs.CV

TL;DR: 论文提出了一种名为ViGaL的后训练范式，通过让多模态大语言模型（MLLMs）玩街机游戏来提升其跨领域多模态推理能力。实验表明，该方法显著提升了模型在数学和多学科问题上的表现，且优于专门针对多模态推理训练的模型。

Details

Motivation: 受认知科学启发，游戏玩法能促进可迁移的认知技能，因此探索通过游戏提升MLLMs的通用推理能力。 Method: 提出ViGaL后训练范式，使用强化学习（RL）在简单街机游戏（如Snake）上训练7B参数的MLLM，无需接触具体解题过程。 Result: 模型在MathVista和MMMU等基准测试中表现优异，优于专门模型，同时保持基础模型在通用视觉任务上的性能。 Conclusion: 规则化游戏可作为可控且可扩展的预训练任务，解锁MLLMs的通用多模态推理能力。 Abstract: Developing generalizable reasoning capabilities in multimodal large language models (MLLMs) remains challenging. Motivated by cognitive science literature suggesting that gameplay promotes transferable cognitive skills, we propose a novel post-training paradigm, Visual Game Learning, or ViGaL, where MLLMs develop out-of-domain generalization of multimodal reasoning through playing arcade-like games. Specifically, we show that post-training a 7B-parameter MLLM via reinforcement learning (RL) on simple arcade-like games, e.g. Snake, significantly enhances its downstream performance on multimodal math benchmarks like MathVista, and on multi-discipline questions like MMMU, without seeing any worked solutions, equations, or diagrams during RL, suggesting the capture of transferable reasoning skills. Remarkably, our model outperforms specialist models tuned on multimodal reasoning data in multimodal reasoning benchmarks, while preserving the base model's performance on general visual benchmarks, a challenge where specialist models often fall short. Our findings suggest a new post-training paradigm: synthetic, rule-based games can serve as controllable and scalable pre-text tasks that unlock generalizable multimodal reasoning abilities in MLLMs.

[188] StableMTL: Repurposing Latent Diffusion Models for Multi-Task Learning from Partially Annotated Synthetic Datasets

Anh-Quan Cao,Ivan Lopes,Raoul de Charette

Main category: cs.CV

TL;DR: StableMTL利用扩散模型在零样本设置下进行多任务学习，通过潜在回归和任务编码实现任务间的协同，无需平衡任务损失。

Details

Motivation: 多任务密集预测需要大量标注，而部分任务标注的方法限制了性能。本文旨在通过扩散模型的泛化能力，实现零样本多任务学习。 Method: 采用潜在回归和任务编码的降噪框架，引入任务注意力机制和多流模型，统一潜在损失以简化训练。 Result: 在8个基准测试的7个任务上优于基线方法。 Conclusion: StableMTL通过任务协同和简化训练，实现了高效的多任务学习。 Abstract: Multi-task learning for dense prediction is limited by the need for extensive annotation for every task, though recent works have explored training with partial task labels. Leveraging the generalization power of diffusion models, we extend the partial learning setup to a zero-shot setting, training a multi-task model on multiple synthetic datasets, each labeled for only a subset of tasks. Our method, StableMTL, repurposes image generators for latent regression. Adapting a denoising framework with task encoding, per-task conditioning and a tailored training scheme. Instead of per-task losses requiring careful balancing, a unified latent loss is adopted, enabling seamless scaling to more tasks. To encourage inter-task synergy, we introduce a multi-stream model with a task-attention mechanism that converts N-to-N task interactions into efficient 1-to-N attention, promoting effective cross-task sharing. StableMTL outperforms baselines on 7 tasks across 8 benchmarks.

[189] 4DGT: Learning a 4D Gaussian Transformer Using Real-World Monocular Videos

Zhen Xu,Zhengqin Li,Zhao Dong,Xiaowei Zhou,Richard Newcombe,Zhaoyang Lv

Main category: cs.CV

TL;DR: 4DGT是一种基于4D高斯的Transformer模型，用于动态场景重建，完全在真实世界单目视频上训练，通过统一静态和动态组件，高效建模复杂时变环境。

Details

Motivation: 动态场景重建需要处理复杂时变环境和不同对象生命周期，现有方法效率低且难以扩展。 Method: 提出4D高斯作为归纳偏置，结合密度控制策略，以滚动窗口方式处理64帧连续视频，实现纯前馈推理。 Result: 4DGT在真实世界视频中显著优于其他高斯网络，在跨域视频上与优化方法精度相当，重建时间从小时级降至秒级。 Conclusion: 4DGT通过高效建模和快速推理，为动态场景重建提供了可扩展的解决方案。 Abstract: We propose 4DGT, a 4D Gaussian-based Transformer model for dynamic scene reconstruction, trained entirely on real-world monocular posed videos. Using 4D Gaussian as an inductive bias, 4DGT unifies static and dynamic components, enabling the modeling of complex, time-varying environments with varying object lifespans. We proposed a novel density control strategy in training, which enables our 4DGT to handle longer space-time input and remain efficient rendering at runtime. Our model processes 64 consecutive posed frames in a rolling-window fashion, predicting consistent 4D Gaussians in the scene. Unlike optimization-based methods, 4DGT performs purely feed-forward inference, reducing reconstruction time from hours to seconds and scaling effectively to long video sequences. Trained only on large-scale monocular posed video datasets, 4DGT can outperform prior Gaussian-based networks significantly in real-world videos and achieve on-par accuracy with optimization-based methods on cross-domain videos. Project page: https://4dgt.github.io

cs.GR [Back]

[190] Vid2Sim: Generalizable, Video-based Reconstruction of Appearance, Geometry and Physics for Mesh-free Simulation

Chuhao Chen,Zhiyang Dou,Chen Wang,Yiming Huang,Anjun Chen,Qiao Feng,Jiatao Gu,Lingjie Liu

Main category: cs.GR

TL;DR: Vid2Sim是一种基于视频的通用框架，通过无网格简化模拟高效恢复几何和物理属性，避免了传统方法的复杂优化过程。

Details

Motivation: 从视频中忠实重建纹理形状和物理属性是一个具有挑战性的问题，传统方法依赖复杂的优化流程，效率低且泛化性差。 Method: Vid2Sim结合前馈神经网络和轻量级优化流程，基于线性混合蒙皮（LBS）实现高效无网格模拟。 Result: 实验表明，Vid2Sim在几何和物理属性重建上具有更高的准确性和效率。 Conclusion: Vid2Sim提供了一种高效、通用的解决方案，显著提升了从视频中重建物理系统的能力。 Abstract: Faithfully reconstructing textured shapes and physical properties from videos presents an intriguing yet challenging problem. Significant efforts have been dedicated to advancing such a system identification problem in this area. Previous methods often rely on heavy optimization pipelines with a differentiable simulator and renderer to estimate physical parameters. However, these approaches frequently necessitate extensive hyperparameter tuning for each scene and involve a costly optimization process, which limits both their practicality and generalizability. In this work, we propose a novel framework, Vid2Sim, a generalizable video-based approach for recovering geometry and physical properties through a mesh-free reduced simulation based on Linear Blend Skinning (LBS), offering high computational efficiency and versatile representation capability. Specifically, Vid2Sim first reconstructs the observed configuration of the physical system from video using a feed-forward neural network trained to capture physical world knowledge. A lightweight optimization pipeline then refines the estimated appearance, geometry, and physical properties to closely align with video observations within just a few minutes. Additionally, after the reconstruction, Vid2Sim enables high-quality, mesh-free simulation with high efficiency. Extensive experiments demonstrate that our method achieves superior accuracy and efficiency in reconstructing geometry and physical properties from video data.

[191] Splat and Replace: 3D Reconstruction with Repetitive Elements

Nicolás Violante,Andreas Meuleman,Alban Gauthier,Frédo Durand,Thibault Groueix,George Drettakis

Main category: cs.GR

TL;DR: 利用3D场景中的重复元素提升新视角合成质量。

Details

Motivation: NeRF和3DGS在新视角合成中表现优异，但对未见过或遮挡部分的渲染质量仍受限于训练视角的覆盖不足。环境中普遍存在的重复元素为解决这一问题提供了机会。 Method: 提出一种方法，对3DGS重建中的重复实例进行分割、配准，并实现实例间的信息共享，同时考虑外观变化。 Result: 在合成和真实场景中验证了方法的有效性，显著提升了新视角合成的质量。 Conclusion: 通过利用重复元素，能够有效改善因覆盖不足或遮挡导致的低质量部分，为新视角合成提供更优解决方案。 Abstract: We leverage repetitive elements in 3D scenes to improve novel view synthesis. Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS) have greatly improved novel view synthesis but renderings of unseen and occluded parts remain low-quality if the training views are not exhaustive enough. Our key observation is that our environment is often full of repetitive elements. We propose to leverage those repetitions to improve the reconstruction of low-quality parts of the scene due to poor coverage and occlusions. We propose a method that segments each repeated instance in a 3DGS reconstruction, registers them together, and allows information to be shared among instances. Our method improves the geometry while also accounting for appearance variations across instances. We demonstrate our method on a variety of synthetic and real scenes with typical repetitive elements, leading to a substantial improvement in the quality of novel view synthesis.

[192] Noise Consistency Regularization for Improved Subject-Driven Image Synthesis

Yao Ni,Song Wen,Piotr Koniusz,Anoop Cherian

Main category: cs.GR

TL;DR: 论文提出两种辅助一致性损失函数，用于改进Stable Diffusion的微调，解决欠拟合和过拟合问题，提升生成图像的多样性和主体一致性。

Details

Motivation: 现有微调方法在生成特定主体图像时存在欠拟合（无法可靠捕捉主体身份）和过拟合（记忆主体图像并减少背景多样性）的问题。 Method: 提出两种一致性损失函数：先验一致性正则化损失（保持非主体图像的扩散噪声预测与预训练模型一致）和主体一致性正则化损失（增强模型对噪声调制潜码的鲁棒性）。 Result: 实验表明，加入这些损失函数后，模型在CLIP分数、背景多样性和视觉质量上优于DreamBooth，同时保持了主体身份。 Conclusion: 提出的方法有效解决了微调中的欠拟合和过拟合问题，提升了生成图像的多样性和主体一致性。 Abstract: Fine-tuning Stable Diffusion enables subject-driven image synthesis by adapting the model to generate images containing specific subjects. However, existing fine-tuning methods suffer from two key issues: underfitting, where the model fails to reliably capture subject identity, and overfitting, where it memorizes the subject image and reduces background diversity. To address these challenges, we propose two auxiliary consistency losses for diffusion fine-tuning. First, a prior consistency regularization loss ensures that the predicted diffusion noise for prior (non-subject) images remains consistent with that of the pretrained model, improving fidelity. Second, a subject consistency regularization loss enhances the fine-tuned model's robustness to multiplicative noise modulated latent code, helping to preserve subject identity while improving diversity. Our experimental results demonstrate that incorporating these losses into fine-tuning not only preserves subject identity but also enhances image diversity, outperforming DreamBooth in terms of CLIP scores, background variation, and overall visual quality.

[193] JGS2: Near Second-order Converging Jacobi/Gauss-Seidel for GPU Elastodynamics

Lei Lan,Zixuan Lu,Chun Yuan,Weiwei Xu,Hao Su,Huamin Wang,Chenfanfu Jiang,Yin Yang

Main category: cs.GR

TL;DR: 提出了一种新型GPU算法，在保持良好并行性的同时，实现了与全空间牛顿法相当的收敛速度。通过解决超调现象，算法在运行时成本仅略高于雅可比法，但收敛速度接近牛顿法。

Details

Motivation: 并行模拟中，收敛性和并行性常被视为冲突目标。传统方法在提高并行性时会牺牲收敛速度，本文旨在解决这一问题。 Method: 基于对超调现象的分析，提出了一种理论上二阶最优的解决方案，并将其转化为可预计算的形式。利用Cubature采样，结合全坐标公式，实现了高效的预计算。 Result: 实验结果表明，该方法在刚性和软材料模拟中均实现了二阶收敛，性能优于现有GPU方法50至100倍。 Conclusion: 该算法在保持高并行性的同时，显著提升了收敛速度，为并行模拟提供了一种高效解决方案。 Abstract: In parallel simulation, convergence and parallelism are often seen as inherently conflicting objectives. Improved parallelism typically entails lighter local computation and weaker coupling, which unavoidably slow the global convergence. This paper presents a novel GPU algorithm that achieves convergence rates comparable to fullspace Newton's method while maintaining good parallelizability just like the Jacobi method. Our approach is built on a key insight into the phenomenon of overshoot. Overshoot occurs when a local solver aggressively minimizes its local energy without accounting for the global context, resulting in a local update that undermines global convergence. To address this, we derive a theoretically second-order optimal solution to mitigate overshoot. Furthermore, we adapt this solution into a pre-computable form. Leveraging Cubature sampling, our runtime cost is only marginally higher than the Jacobi method, yet our algorithm converges nearly quadratically as Newton's method. We also introduce a novel full-coordinate formulation for more efficient pre-computation. Our method integrates seamlessly with the incremental potential contact method and achieves second-order convergence for both stiff and soft materials. Experimental results demonstrate that our approach delivers high-quality simulations and outperforms state-of-the-art GPU methods with 50 to 100 times better convergence.

[194] CrossGen: Learning and Generating Cross Fields for Quad Meshing

Qiujie Dong,Jiepeng Wang,Rui Xu,Cheng Lin,Yuan Liu,Shiqing Xin,Zichun Zhong,Xin Li,Changhe Tu,Taku Komura,Leif Kobbelt,Scott Schaefer,Wenping Wang

Main category: cs.GR

TL;DR: CrossGen是一个新颖的框架，通过联合几何和交叉场表示在潜在空间中，实现了快速生成高质量交叉场，适用于四边形网格生成。

Details

Motivation: 现有交叉场生成方法在计算效率和生成质量之间难以平衡，通常需要缓慢的逐形状优化。 Method: 使用自编码器网络架构，将点云表面编码为稀疏体素网格，解码为基于SDF的几何和交叉场，并结合扩散模型生成新形状的交叉场。 Result: 能够在不到一秒的时间内生成高质量交叉场，具有高几何保真度、噪声鲁棒性和快速推理能力。 Conclusion: CrossGen在四边形网格生成任务中表现出色，适用于多种表面形状。 Abstract: Cross fields play a critical role in various geometry processing tasks, especially for quad mesh generation. Existing methods for cross field generation often struggle to balance computational efficiency with generation quality, using slow per-shape optimization. We introduce CrossGen, a novel framework that supports both feed-forward prediction and latent generative modeling of cross fields for quad meshing by unifying geometry and cross field representations within a joint latent space. Our method enables extremely fast computation of high-quality cross fields of general input shapes, typically within one second without per-shape optimization. Our method assumes a point-sampled surface, or called a point-cloud surface, as input, so we can accommodate various different surface representations by a straightforward point sampling process. Using an auto-encoder network architecture, we encode input point-cloud surfaces into a sparse voxel grid with fine-grained latent spaces, which are decoded into both SDF-based surface geometry and cross fields. We also contribute a dataset of models with both high-quality signed distance fields (SDFs) representations and their corresponding cross fields, and use it to train our network. Once trained, the network is capable of computing a cross field of an input surface in a feed-forward manner, ensuring high geometric fidelity, noise resilience, and rapid inference. Furthermore, leveraging the same unified latent representation, we incorporate a diffusion model for computing cross fields of new shapes generated from partial input, such as sketches. To demonstrate its practical applications, we validate CrossGen on the quad mesh generation task for a large variety of surface shapes. Experimental results...

[195] Accelerating 3D Gaussian Splatting with Neural Sorting and Axis-Oriented Rasterization

Zhican Wang,Guanghui He,Dantong Liu,Lingjun Gao,Shell Xu Hu,Chen Zhang,Zhuoran Song,Nicholas Lane,Wayne Luk,Hongxiang Fan

Main category: cs.GR

TL;DR: 本文提出了一种架构-算法协同设计方法，通过轴定向光栅化和神经排序技术，显著提升了3D高斯泼溅（3DGS）在资源受限设备上的实时渲染性能。

Details

Motivation: 尽管3DGS在高质量和高效视图合成方面表现出色，但在资源受限设备上的实时渲染仍面临挑战，主要由于功耗和面积限制。 Method: 1. 提出轴定向光栅化，预计算并重用共享项以减少计算冗余；2. 引入神经排序技术，预测顺序无关的混合权重；3. 设计可重构处理阵列和π轨迹瓦片调度优化硬件利用。 Result: 实验表明，设计在保持渲染质量的同时，速度提升23.4~27.8倍，能耗降低28.8~51.4倍。 Conclusion: 该协同设计显著提升了3DGS在资源受限设备上的性能，计划开源以推动领域发展。 Abstract: 3D Gaussian Splatting (3DGS) has recently gained significant attention for high-quality and efficient view synthesis, making it widely adopted in fields such as AR/VR, robotics, and autonomous driving. Despite its impressive algorithmic performance, real-time rendering on resource-constrained devices remains a major challenge due to tight power and area budgets. This paper presents an architecture-algorithm co-design to address these inefficiencies. First, we reveal substantial redundancy caused by repeated computation of common terms/expressions during the conventional rasterization. To resolve this, we propose axis-oriented rasterization, which pre-computes and reuses shared terms along both the X and Y axes through a dedicated hardware design, effectively reducing multiply-and-add (MAC) operations by up to 63%. Second, by identifying the resource and performance inefficiency of the sorting process, we introduce a novel neural sorting approach that predicts order-independent blending weights using an efficient neural network, eliminating the need for costly hardware sorters. A dedicated training framework is also proposed to improve its algorithmic stability. Third, to uniformly support rasterization and neural network inference, we design an efficient reconfigurable processing array that maximizes hardware utilization and throughput. Furthermore, we introduce a $\pi$-trajectory tile schedule, inspired by Morton encoding and Hilbert curve, to optimize Gaussian reuse and reduce memory access overhead. Comprehensive experiments demonstrate that the proposed design preserves rendering quality while achieving a speedup of $23.4\sim27.8\times$ and energy savings of $28.8\sim51.4\times$ compared to edge GPUs for real-world scenes. We plan to open-source our design to foster further development in this field.

[196] HOI-PAGE: Zero-Shot Human-Object Interaction Generation with Part Affordance Guidance

Lei Li,Angela Dai

Main category: cs.GR

TL;DR: HOI-PAGE是一种从文本提示中零样本合成4D人-物交互（HOI）的新方法，通过部分级功能推理实现。

Details

Motivation: 现有方法主要关注全局的全身-物体运动，而生成真实多样的HOI需要更细粒度的理解，即人体部分如何与物体部分交互。 Method: 引入部分功能图（PAGs），从大语言模型中提取结构化HOI表示，指导三阶段合成：分解3D物体、生成参考视频并提取运动约束、优化4D HOI序列。 Result: 实验表明，该方法能灵活生成复杂多物体或多人的交互序列，显著提升了零样本4D HOI生成的逼真度和文本对齐性。 Conclusion: HOI-PAGE通过部分级功能推理，实现了更真实和多样化的4D HOI合成。 Abstract: We present HOI-PAGE, a new approach to synthesizing 4D human-object interactions (HOIs) from text prompts in a zero-shot fashion, driven by part-level affordance reasoning. In contrast to prior works that focus on global, whole body-object motion for 4D HOI synthesis, we observe that generating realistic and diverse HOIs requires a finer-grained understanding -- at the level of how human body parts engage with object parts. We thus introduce Part Affordance Graphs (PAGs), a structured HOI representation distilled from large language models (LLMs) that encodes fine-grained part information along with contact relations. We then use these PAGs to guide a three-stage synthesis: first, decomposing input 3D objects into geometric parts; then, generating reference HOI videos from text prompts, from which we extract part-based motion constraints; finally, optimizing for 4D HOI motion sequences that not only mimic the reference dynamics but also satisfy part-level contact constraints. Extensive experiments show that our approach is flexible and capable of generating complex multi-object or multi-person interaction sequences, with significantly improved realism and text alignment for zero-shot 4D HOI generation.

[197] Immersive Visualization of Flat Surfaces Using Ray Marching

Fabian Lander,Diaaeldin Taha

Main category: cs.GR

TL;DR: 提出了一种基于光线步进的高效平面可视化方法，适用于探索平移曲面、镜面房间、展开多面体和平移棱柱。

Details

Motivation: 提供一种直观且计算高效的方法来可视化复杂平面结构。 Method: 使用光线步进技术，结合具体示例展示方法实用性，并提供实现细节。 Result: 成功实现了多种复杂结构的可视化，并公开了模拟和代码。 Conclusion: 该方法不仅高效实用，还适用于科普推广。 Abstract: We present an effective method for visualizing flat surfaces using ray marching. Our approach provides an intuitive way to explore translation surfaces, mirror rooms, unfolded polyhedra, and translation prisms while maintaining computational efficiency. We demonstrate the utility of the method through various examples and provide implementation insights for programmers. Finally, we discuss the use of our visualizations in outreach. We make our simulations and code available online.

[198] PIG: Physically-based Multi-Material Interaction with 3D Gaussians

Zeyu Xiao,Zhenyi Wu,Mingyang Sun,Qipeng Yan,Yufan Guo,Zhuoer Liang,Lihua Zhang

Main category: cs.GR

TL;DR: PIG方法通过结合3D高斯分割与多材料物理交互，解决了传统3D高斯场景中对象交互的精度和渲染问题。

Details

Motivation: 传统3D高斯场景中对象交互存在分割不准确、变形不精确和渲染伪影问题，需改进。 Method: 1. 快速准确地将2D像素映射到3D高斯；2. 为分割对象分配物理属性；3. 嵌入约束尺度到变形梯度中。 Result: 实验显示PIG在视觉质量上超越SOTA，并推动物理真实场景生成的新方向。 Conclusion: PIG方法显著提升了3D高斯场景的交互精度和渲染质量，为领域开辟了新途径。 Abstract: 3D Gaussian Splatting has achieved remarkable success in reconstructing both static and dynamic 3D scenes. However, in a scene represented by 3D Gaussian primitives, interactions between objects suffer from inaccurate 3D segmentation, imprecise deformation among different materials, and severe rendering artifacts. To address these challenges, we introduce PIG: Physically-Based Multi-Material Interaction with 3D Gaussians, a novel approach that combines 3D object segmentation with the simulation of interacting objects in high precision. Firstly, our method facilitates fast and accurate mapping from 2D pixels to 3D Gaussians, enabling precise 3D object-level segmentation. Secondly, we assign unique physical properties to correspondingly segmented objects within the scene for multi-material coupled interactions. Finally, we have successfully embedded constraint scales into deformation gradients, specifically clamping the scaling and rotation properties of the Gaussian primitives to eliminate artifacts and achieve geometric fidelity and visual consistency. Experimental results demonstrate that our method not only outperforms the state-of-the-art (SOTA) in terms of visual quality, but also opens up new directions and pipelines for the field of physically realistic scene generation.

[199] GaussianVAE: Adaptive Learning Dynamics of 3D Gaussians for High-Fidelity Super-Resolution

Shuja Khalid,Mohamed Ibrahim,Yang Liu

Main category: cs.GR

TL;DR: 提出了一种轻量级生成模型，通过Hessian辅助采样策略提升3D高斯泼溅的分辨率和几何保真度，突破输入分辨率的限制，实现实时高效增强。

Details

Motivation: 现有3DGS方法受限于输入分辨率，无法生成比训练视图更精细的细节，需要一种高效且轻量的解决方案。 Method: 采用轻量级生成模型预测和细化额外的3D高斯分布，结合Hessian辅助采样策略智能识别需要密集化的区域。 Result: 在单消费级GPU上实现实时推理（0.015秒/次），几何精度和渲染质量显著优于现有方法。 Conclusion: 该方法为分辨率无关的3D场景增强提供了新范式，适用于交互式应用。 Abstract: We present a novel approach for enhancing the resolution and geometric fidelity of 3D Gaussian Splatting (3DGS) beyond native training resolution. Current 3DGS methods are fundamentally limited by their input resolution, producing reconstructions that cannot extrapolate finer details than are present in the training views. Our work breaks this limitation through a lightweight generative model that predicts and refines additional 3D Gaussians where needed most. The key innovation is our Hessian-assisted sampling strategy, which intelligently identifies regions that are likely to benefit from densification, ensuring computational efficiency. Unlike computationally intensive GANs or diffusion approaches, our method operates in real-time (0.015s per inference on a single consumer-grade GPU), making it practical for interactive applications. Comprehensive experiments demonstrate significant improvements in both geometric accuracy and rendering quality compared to state-of-the-art methods, establishing a new paradigm for resolution-free 3D scene enhancement.

[200] Speedy Deformable 3D Gaussian Splatting: Fast Rendering and Compression of Dynamic Scenes

Allen Tu,Haiyang Ying,Alex Hanson,Yonghan Lee,Tom Goldstein,Matthias Zwicker

Main category: cs.GR

TL;DR: SpeeDe3DGS通过时间敏感剪枝和GroupFlow技术，显著提升动态3DGS和4DGS的渲染速度，减少模型大小和训练时间。

Details

Motivation: 动态3DGS中每帧对每个高斯进行神经推断导致渲染速度慢、内存和计算需求高，需优化。 Method: 1. 时间敏感剪枝去除低贡献高斯；2. GroupFlow通过轨迹聚类预测组级刚性变换。 Result: 在NeRF-DS数据集上渲染速度提升10.37倍，模型大小减少7.71倍，训练时间缩短2.71倍。 Conclusion: SpeeDe3DGS模块化设计可集成到任何动态3DGS/4DGS框架，显著提升效率。 Abstract: Recent extensions of 3D Gaussian Splatting (3DGS) to dynamic scenes achieve high-quality novel view synthesis by using neural networks to predict the time-varying deformation of each Gaussian. However, performing per-Gaussian neural inference at every frame poses a significant bottleneck, limiting rendering speed and increasing memory and compute requirements. In this paper, we present Speedy Deformable 3D Gaussian Splatting (SpeeDe3DGS), a general pipeline for accelerating the rendering speed of dynamic 3DGS and 4DGS representations by reducing neural inference through two complementary techniques. First, we propose a temporal sensitivity pruning score that identifies and removes Gaussians with low contribution to the dynamic scene reconstruction. We also introduce an annealing smooth pruning mechanism that improves pruning robustness in real-world scenes with imprecise camera poses. Second, we propose GroupFlow, a motion analysis technique that clusters Gaussians by trajectory similarity and predicts a single rigid transformation per group instead of separate deformations for each Gaussian. Together, our techniques accelerate rendering by $10.37\times$, reduce model size by $7.71\times$, and shorten training time by $2.71\times$ on the NeRF-DS dataset. SpeeDe3DGS also improves rendering speed by $4.20\times$ and $58.23\times$ on the D-NeRF and HyperNeRF vrig datasets. Our methods are modular and can be integrated into any deformable 3DGS or 4DGS framework.

[201] Squeeze3D: Your 3D Generation Model is Secretly an Extreme Neural Compressor

Rishit Dagli,Yushi Guan,Sankeerth Durvasula,Mohammadreza Mofayezi,Nandita Vijaykumar

Main category: cs.GR

TL;DR: Squeeze3D提出了一种利用预训练3D生成模型隐式先验知识的高压缩比3D数据压缩框架，支持多种3D数据格式，无需真实数据集训练。

Details

Motivation: 现有3D数据压缩方法通常需要大量真实数据集训练，且压缩比有限。Squeeze3D旨在通过利用预训练模型的隐式知识，实现更高压缩比和灵活性。 Method: 通过可训练的映射网络连接预训练编码器和生成模型的潜在空间，将3D数据压缩为紧凑潜在代码，再通过生成模型解压缩。 Result: 实验显示，Squeeze3D在纹理网格、点云和辐射场上的压缩比分别达到2187x、55x和619x，且视觉质量与现有方法相当。 Conclusion: Squeeze3D提供了一种高效、灵活的3D数据压缩方案，无需真实数据集训练，支持多种格式，压缩和解压延迟低。 Abstract: We propose Squeeze3D, a novel framework that leverages implicit prior knowledge learnt by existing pre-trained 3D generative models to compress 3D data at extremely high compression ratios. Our approach bridges the latent spaces between a pre-trained encoder and a pre-trained generation model through trainable mapping networks. Any 3D model represented as a mesh, point cloud, or a radiance field is first encoded by the pre-trained encoder and then transformed (i.e. compressed) into a highly compact latent code. This latent code can effectively be used as an extremely compressed representation of the mesh or point cloud. A mapping network transforms the compressed latent code into the latent space of a powerful generative model, which is then conditioned to recreate the original 3D model (i.e. decompression). Squeeze3D is trained entirely on generated synthetic data and does not require any 3D datasets. The Squeeze3D architecture can be flexibly used with existing pre-trained 3D encoders and existing generative models. It can flexibly support different formats, including meshes, point clouds, and radiance fields. Our experiments demonstrate that Squeeze3D achieves compression ratios of up to 2187x for textured meshes, 55x for point clouds, and 619x for radiance fields while maintaining visual quality comparable to many existing methods. Squeeze3D only incurs a small compression and decompression latency since it does not involve training object-specific networks to compress an object.

cs.CL [Back]

[202] How Significant Are the Real Performance Gains? An Unbiased Evaluation Framework for GraphRAG

Qiming Zeng,Xiao Yan,Hao Luo,Yuhao Lin,Yuxiang Wang,Fangcheng Fu,Bo Du,Quanqing Xu,Jiawei Jiang

Main category: cs.CL

TL;DR: 论文提出了一种无偏评估框架，用于解决GraphRAG方法中现有评估框架的两个关键缺陷（无关问题和评估偏见），并通过实验发现现有方法的性能提升比之前报道的更有限。

Details

Motivation: 当前GraphRAG方法的评估框架存在无关问题和评估偏见，可能导致对性能的偏颇甚至错误结论。 Method: 提出基于图-文本的问题生成方法以产生更相关的问题，并采用无偏评估流程消除基于LLM的答案评估中的偏见。 Result: 应用该框架评估3种代表性GraphRAG方法，发现其性能提升比之前报道的更有限。 Conclusion: 尽管该评估框架可能仍有缺陷，但它呼吁科学评估为GraphRAG研究奠定坚实基础。 Abstract: By retrieving contexts from knowledge graphs, graph-based retrieval-augmented generation (GraphRAG) enhances large language models (LLMs) to generate quality answers for user questions. Many GraphRAG methods have been proposed and reported inspiring performance in answer quality. However, we observe that the current answer evaluation framework for GraphRAG has two critical flaws, i.e., unrelated questions and evaluation biases, which may lead to biased or even wrong conclusions on performance. To tackle the two flaws, we propose an unbiased evaluation framework that uses graph-text-grounded question generation to produce questions that are more related to the underlying dataset and an unbiased evaluation procedure to eliminate the biases in LLM-based answer assessment. We apply our unbiased framework to evaluate 3 representative GraphRAG methods and find that their performance gains are much more moderate than reported previously. Although our evaluation framework may still have flaws, it calls for scientific evaluations to lay solid foundations for GraphRAG research.

[203] TESU-LLM: Training Speech-LLMs Without Speech via Unified Encoder Alignment

Taesoo Kim,Jong Hwan Ko

Main category: cs.CL

TL;DR: TESU-LLM是一种仅使用文本数据训练语音能力语言模型的新框架，通过统一编码器和轻量级投影网络实现从文本监督到语音推理的泛化。

Details

Motivation: 现有语音语言模型依赖大规模配对语音-文本数据和计算资源，限制了可扩展性和可访问性。 Method: 利用统一编码器将语义等效的文本和语音输入映射到共享潜在空间，并通过轻量级投影网络与LLM的嵌入空间对齐。 Result: 仅用文本训练的TESU-LLM在多个语音相关基准上表现优异，媲美基于多模态数据训练的基线方法。 Conclusion: TESU-LLM提供了一种无需语音数据的高效、可扩展方法，为构建语音LLM开辟了新途径。 Abstract: Recent advances in speech-enabled language models have shown promising results in building intelligent voice assistants. However, most existing approaches rely on large-scale paired speech-text data and extensive computational resources, which pose challenges in terms of scalability and accessibility. In this paper, we present \textbf{TESU-LLM}, a novel framework that enables training speech-capable language models using only text data. Our key insight is to leverage a unified encoder that maps semantically equivalent text and speech inputs to a shared latent space. By aligning the encoder output with the embedding space of a LLM via a lightweight projection network, we enable the model to generalize from text-only supervision to speech-based inference. Despite being trained exclusively on text, TESU-LLM achieves strong performance on various speech-related benchmarks, comparable to baseline methods trained with large-scale multimodal datasets and substantial computational resources. These results highlight the effectiveness and efficiency of our approach, offering a scalable path toward building speech LLMs without speech data.

[204] Unified Game Moderation: Soft-Prompting and LLM-Assisted Label Transfer for Resource-Efficient Toxicity Detection

Zachary Yang,Domenico Tullo,Reihaneh Rabbany

Main category: cs.CL

TL;DR: 论文提出了一种软提示方法和LLM辅助标签转移框架，以解决多游戏和多语言环境中的毒性检测扩展问题，显著降低了计算资源和维护成本。

Details

Motivation: 游戏社区中的毒性检测在多游戏和多语言环境中面临扩展和实时性的挑战，需要高效且可扩展的解决方案。 Method: 1. 引入软提示方法，使单一模型能处理多游戏；2. 开发基于GPT-4o-mini的LLM辅助标签转移框架，扩展支持七种语言。 Result: 在法语、德语、葡萄牙语和俄语上的宏F1分数为32.96%至58.88%，德语表现优于英语基准（45.39%）。生产环境中显著减少资源消耗。 Conclusion: 该方法在多游戏和多语言环境中实现了高效的毒性检测，并在实际应用中成功识别违规行为。 Abstract: Toxicity detection in gaming communities faces significant scaling challenges when expanding across multiple games and languages, particularly in real-time environments where computational efficiency is crucial. We present two key findings to address these challenges while building upon our previous work on ToxBuster, a BERT-based real-time toxicity detection system. First, we introduce a soft-prompting approach that enables a single model to effectively handle multiple games by incorporating game-context tokens, matching the performance of more complex methods like curriculum learning while offering superior scalability. Second, we develop an LLM-assisted label transfer framework using GPT-4o-mini to extend support to seven additional languages. Evaluations on real game chat data across French, German, Portuguese, and Russian achieve macro F1-scores ranging from 32.96% to 58.88%, with particularly strong performance in German, surpassing the English benchmark of 45.39%. In production, this unified approach significantly reduces computational resources and maintenance overhead compared to maintaining separate models for each game and language combination. At Ubisoft, this model successfully identifies an average of 50 players, per game, per day engaging in sanctionable behavior.

[205] Relationship Detection on Tabular Data Using Statistical Analysis and Large Language Models

Panagiotis Koletsis,Christos Panagiotopoulos,Georgios Th. Papadopoulos,Vasilis Efthymiou

Main category: cs.CL

TL;DR: 论文提出了一种混合方法，结合知识图谱和统计分析，用于检测未标记表格数据中列间关系（CPA任务），并在基准数据集上验证了其有效性。

Details

Motivation: 表格解释任务的重要性日益凸显，但现有方法在处理未标记数据时存在挑战，因此需要结合知识图谱和统计分析的混合方法。 Method: 使用知识图谱作为参考，结合大语言模型和统计分析方法（如域和范围约束检测、关系共现分析）来缩小潜在关系搜索空间。 Result: 在SemTab挑战提供的基准数据集上，该方法表现出色，与现有最优方法竞争。 Conclusion: 提出的混合方法在CPA任务中有效，且代码已开源，可供进一步研究和应用。 Abstract: Over the past few years, table interpretation tasks have made significant progress due to their importance and the introduction of new technologies and benchmarks in the field. This work experiments with a hybrid approach for detecting relationships among columns of unlabeled tabular data, using a Knowledge Graph (KG) as a reference point, a task known as CPA. This approach leverages large language models (LLMs) while employing statistical analysis to reduce the search space of potential KG relations. The main modules of this approach for reducing the search space are domain and range constraints detection, as well as relation co-appearance analysis. The experimental evaluation on two benchmark datasets provided by the SemTab challenge assesses the influence of each module and the effectiveness of different state-of-the-art LLMs at various levels of quantization. The experiments were performed, as well as at different prompting techniques. The proposed methodology, which is publicly available on github, proved to be competitive with state-of-the-art approaches on these datasets.

[206] Enhancing Decision-Making of Large Language Models via Actor-Critic

Heng Dong,Kefei Duan,Chongjie Zhang

Main category: cs.CL

TL;DR: 论文提出了一种基于LLM的Actor-Critic框架LAC，通过长期动作评估改进LLM策略，解决了复杂决策中的长期推理问题，并在多个环境中表现优异。

Details

Motivation: 现有方法在复杂决策场景中存在长期推理不足和结果评估不准确的问题，导致决策效果不佳。 Method: 提出LAC框架，通过计算与正负结果相关的Q值提取动作评估，并结合未来轨迹推演和推理，实现无梯度的策略改进。 Result: 在ALFWorld、BabyAI-Text和WebShop等环境中表现优于现有方法，甚至在使用较小参数LLM时超越基于GPT-4的基线方法。 Conclusion: LAC框架展示了将结构化策略优化与LLM内在知识结合的潜力，可提升多步环境中的决策能力。 Abstract: Large Language Models (LLMs) have achieved remarkable advancements in natural language processing tasks, yet they encounter challenges in complex decision-making scenarios that require long-term reasoning and alignment with high-level objectives. Existing methods either rely on short-term auto-regressive action generation or face limitations in accurately simulating rollouts and assessing outcomes, leading to sub-optimal decisions. This paper introduces a novel LLM-based Actor-Critic framework, termed LAC, that effectively improves LLM policies with long-term action evaluations in a principled and scalable way. Our approach addresses two key challenges: (1) extracting robust action evaluations by computing Q-values via token logits associated with positive/negative outcomes, enhanced by future trajectory rollouts and reasoning; and (2) enabling efficient policy improvement through a gradient-free mechanism. Experiments across diverse environments -- including high-level decision-making (ALFWorld), low-level action spaces (BabyAI-Text), and large action spaces (WebShop) -- demonstrate the framework's generality and superiority over state-of-the-art methods. Notably, our approach achieves competitive performance using 7B/8B parameter LLMs, even outperforming baseline methods employing GPT-4 in complex tasks. These results underscore the potential of integrating structured policy optimization with LLMs' intrinsic knowledge to advance decision-making capabilities in multi-step environments.

[207] Detection Method for Prompt Injection by Integrating Pre-trained Model and Heuristic Feature Engineering

Yi Ji,Runzhi Li,Baolei Mao

Main category: cs.CL

TL;DR: 论文提出了一种双通道特征融合检测框架DMPI-PMHFE，用于检测大语言模型中的提示注入攻击，结合预训练模型和启发式特征工程，显著提升了检测效果。

Details

Motivation: 随着大语言模型的广泛应用，提示注入攻击成为重大安全威胁，现有防御机制在效果和泛化性之间存在权衡，亟需高效且通用的检测方法。 Method: 提出DMPI-PMHFE框架，结合DeBERTa-v3-base提取语义向量和启发式规则提取结构特征，通过全连接神经网络融合特征进行预测。 Result: 实验表明，DMPI-PMHFE在准确性、召回率和F1分数上优于现有方法，并在实际部署中显著降低了主流大语言模型的攻击成功率。 Conclusion: DMPI-PMHFE通过双通道特征融合有效解决了提示注入检测的挑战，为实际应用提供了高效解决方案。 Abstract: With the widespread adoption of Large Language Models (LLMs), prompt injection attacks have emerged as a significant security threat. Existing defense mechanisms often face critical trade-offs between effectiveness and generalizability. This highlights the urgent need for efficient prompt injection detection methods that are applicable across a wide range of LLMs. To address this challenge, we propose DMPI-PMHFE, a dual-channel feature fusion detection framework. It integrates a pretrained language model with heuristic feature engineering to detect prompt injection attacks. Specifically, the framework employs DeBERTa-v3-base as a feature extractor to transform input text into semantic vectors enriched with contextual information. In parallel, we design heuristic rules based on known attack patterns to extract explicit structural features commonly observed in attacks. Features from both channels are subsequently fused and passed through a fully connected neural network to produce the final prediction. This dual-channel approach mitigates the limitations of relying only on DeBERTa to extract features. Experimental results on diverse benchmark datasets demonstrate that DMPI-PMHFE outperforms existing methods in terms of accuracy, recall, and F1-score. Furthermore, when deployed actually, it significantly reduces attack success rates across mainstream LLMs, including GLM-4, LLaMA 3, Qwen 2.5, and GPT-4o.

[208] Confidence Is All You Need: Few-Shot RL Fine-Tuning of Language Models

Pengyi Li,Matvey Skripkin,Alexander Zubrey,Andrey Kuznetsov,Ivan Oseledets

Main category: cs.CL

TL;DR: RLSC是一种利用模型自身置信度作为奖励信号的无监督强化学习方法，显著提升了数学推理任务的性能。

Details

Motivation: 现有强化学习方法依赖昂贵的人工标注或外部奖励模型，RLSC旨在通过模型自身置信度简化这一过程。 Method: 提出RLSC方法，利用模型的置信度作为奖励信号，无需人工标注或外部奖励模型。 Result: 在Qwen2.5-Math-7B上，仅用8个样本和4个训练周期，RLSC在AIME2024、MATH500和AMC23上的准确率分别提升了20.10%、49.40%和52.50%。 Conclusion: RLSC提供了一种简单、可扩展且无需大量监督的推理模型后训练方法。 Abstract: Large language models (LLMs) excel at reasoning, yet post-training remains critical for aligning their behavior with task goals. Existing reinforcement learning (RL) methods often depend on costly human annotations or external reward models. We propose Reinforcement Learning via Self-Confidence (RLSC), which uses the model's own confidence as reward signals-eliminating the need for labels, preference models, or reward engineering. Applied to Qwen2.5-Math-7B with only 8 samples per question and 4 training epochs, RLSC improves accuracy by +20.10% on AIME2024, +49.40% on MATH500, and +52.50% on AMC23. RLSC offers a simple, scalable post-training method for reasoning models with minimal supervision.

[209] Natural Language Interaction with Databases on Edge Devices in the Internet of Battlefield Things

Christopher D. Molek,Roberto Fronteddu,K. Brent Venable,Niranjan Suri

Main category: cs.CL

TL;DR: 论文提出了一种利用自然语言处理（NLP）和大型语言模型（LLMs）的工作流，用于战场物联网（IoBT）中增强情境感知，并通过自然语言查询和总结数据库信息。

Details

Motivation: 提升战场物联网（IoBT）在关键决策中的情境感知能力，通过将设备数据转化为用户友好的信息对象。 Method: 采用LLMs和图形数据库，实现自然语言到Cypher查询的映射及数据库输出的自然语言总结。 Result: Llama 3.1（80亿参数）表现最佳，两步方法提高了19.4%的准确性。 Conclusion: 该工作流为在边缘设备上部署LLMs以实现自然语言交互奠定了基础。 Abstract: The expansion of the Internet of Things (IoT) in the battlefield, Internet of Battlefield Things (IoBT), gives rise to new opportunities for enhancing situational awareness. To increase the potential of IoBT for situational awareness in critical decision making, the data from these devices must be processed into consumer-ready information objects, and made available to consumers on demand. To address this challenge we propose a workflow that makes use of natural language processing (NLP) to query a database technology and return a response in natural language. Our solution utilizes Large Language Models (LLMs) that are sized for edge devices to perform NLP as well as graphical databases which are well suited for dynamic connected networks which are pervasive in the IoBT. Our architecture employs LLMs for both mapping questions in natural language to Cypher database queries as well as to summarize the database output back to the user in natural language. We evaluate several medium sized LLMs for both of these tasks on a database representing publicly available data from the US Army's Multipurpose Sensing Area (MSA) at the Jornada Range in Las Cruces, NM. We observe that Llama 3.1 (8 billion parameters) outperforms the other models across all the considered metrics. Most importantly, we note that, unlike current methods, our two step approach allows the relaxation of the Exact Match (EM) requirement of the produced Cypher queries with ground truth code and, in this way, it achieves a 19.4% increase in accuracy. Our workflow lays the ground work for deploying LLMs on edge devices to enable natural language interactions with databases containing information objects for critical decision making.

[210] Direct Behavior Optimization: Unlocking the Potential of Lightweight LLMs

Hongming Yang,Shi Lin,Jun Shao,Changting Lin,Donghai Zhu,Meng Han,Qinglei Kong

Main category: cs.CL

TL;DR: DeBoP是一种直接行为优化范式，通过梯度自由的蒙特卡洛树搜索优化轻量级大语言模型（LwLLMs）的行为，显著提升其性能，并在多项任务中超越GPT-3.5。

Details

Motivation: 轻量级大语言模型（LwLLMs）在资源效率和隐私方面有优势，但推理能力有限，现有提示优化方法对其效果不佳。 Method: DeBoP将复杂提示优化转化为离散、可量化的执行序列优化，采用梯度自由的蒙特卡洛树搜索方法。 Result: DeBoP在多项任务中显著优于现有方法，优化后的LwLLMs性能超越GPT-3.5，计算时间减少约60%。 Conclusion: DeBoP为LwLLMs提供了一种高效的自动优化方法，显著提升了其实际应用能力。 Abstract: Lightweight Large Language Models (LwLLMs) are reduced-parameter, optimized models designed to run efficiently on consumer-grade hardware, offering significant advantages in resource efficiency, cost-effectiveness, and data privacy. However, these models often struggle with limited inference and reasoning capabilities, which restrict their performance on complex tasks and limit their practical applicability. Moreover, existing prompt optimization methods typically rely on extensive manual effort or the meta-cognitive abilities of state-of-the-art LLMs, making them less effective for LwLLMs. To address these challenges, we introduce DeBoP, a new Direct Behavior Optimization Paradigm, original from the Chain-of-Thought (CoT) prompting technique. Unlike CoT Prompting, DeBoP is an automatic optimization method, which focuses on the optimization directly on the behavior of LwLLMs. In particular, DeBoP transforms the optimization of complex prompts into the optimization of discrete, quantifiable execution sequences using a gradient-free Monte Carlo Tree Search. We evaluate DeBoP on seven challenging tasks where state-of-the-art LLMs excel but LwLLMs generally underperform. Experimental results demonstrate that DeBoP significantly outperforms recent prompt optimization methods on most tasks. In particular, DeBoP-optimized LwLLMs surpass GPT-3.5 on most tasks while reducing computational time by approximately 60% compared to other automatic prompt optimization methods.

[211] Unintended Harms of Value-Aligned LLMs: Psychological and Empirical Insights

Sooyung Choi,Jaehyeok Lee,Xiaoyuan Yi,Jing Yao,Xing Xie,JinYeong Bak

Main category: cs.CL

TL;DR: 研究发现，与人类价值观对齐的大型语言模型（LLMs）更容易产生有害行为，且安全风险更高，原因在于其生成内容会放大有害结果。

Details

Motivation: 随着LLMs应用范围扩大，个性化对齐人类价值观的需求增加，但同时也引发了安全风险，需研究其背后的心理机制。 Method: 通过数据集分析，结合心理学假设，探究价值观对齐与安全风险的相关性。 Result: 价值观对齐的LLMs比未微调模型更易产生有害行为，且风险略高于其他微调模型。 Conclusion: 研究揭示了价值观对齐的“黑箱”问题，并提出上下文对齐方法以提升安全性。 Abstract: The application scope of Large Language Models (LLMs) continues to expand, leading to increasing interest in personalized LLMs that align with human values. However, aligning these models with individual values raises significant safety concerns, as certain values may correlate with harmful information. In this paper, we identify specific safety risks associated with value-aligned LLMs and investigate the psychological principles behind these challenges. Our findings reveal two key insights. (1) Value-aligned LLMs are more prone to harmful behavior compared to non-fine-tuned models and exhibit slightly higher risks in traditional safety evaluations than other fine-tuned models. (2) These safety issues arise because value-aligned LLMs genuinely generate text according to the aligned values, which can amplify harmful outcomes. Using a dataset with detailed safety categories, we find significant correlations between value alignment and safety risks, supported by psychological hypotheses. This study offers insights into the "black box" of value alignment and proposes in-context alignment methods to enhance the safety of value-aligned LLMs.

[212] SMAR: Soft Modality-Aware Routing Strategy for MoE-based Multimodal Large Language Models Preserving Language Capabilities

Guoyang Xia,Yifeng Ding,Fengfa Li,Lei Ren,Chen Wei,Fangxiang Feng,Xiaojie Wang

Main category: cs.CL

TL;DR: 提出了一种名为SMAR的正则化技术，通过KL散度控制多模态路由概率分布，平衡专家专业化和语言能力，实验表现优于基线。

Details

Motivation: 解决现有多模态MoE模型训练成本高或语言能力下降的问题。 Method: 使用KL散度正则化技术SMAR，控制路由概率分布，无需修改模型架构或依赖大量文本数据。 Result: 在视觉指令调整实验中，SMAR保留了86.6%的语言能力，仅需2.5%纯文本，优于基线且保持多模态性能。 Conclusion: SMAR为多模态MoE模型提供了一种高效平衡模态差异和语言能力的解决方案。 Abstract: Mixture of Experts (MoE) architectures have become a key approach for scaling large language models, with growing interest in extending them to multimodal tasks. Existing methods to build multimodal MoE models either incur high training costs or suffer from degraded language capabilities when adapting pretrained models. To address this, we propose Soft ModalityAware Routing (SMAR), a novel regularization technique that uses Kullback Leibler divergence to control routing probability distributions across modalities, encouraging expert specialization without modifying model architecture or heavily relying on textual data. Experiments on visual instruction tuning show that SMAR preserves language ability at 86.6% retention with only 2.5% pure text, outperforming baselines while maintaining strong multimodal performance. Our approach offers a practical and efficient solution to balance modality differentiation and language capabilities in multimodal MoE models.

[213] Canonical Autoregressive Generation

Ivi Chatzi,Nina Corvelo Benz,Stratis Tsirtsis,Manuel Gomez-Rodriguez

Main category: cs.CL

TL;DR: 论文提出了一种称为“规范采样”的方法，旨在解决大语言模型生成非规范标记序列的问题，并证明其生成的序列更接近训练数据的真实分布。

Details

Motivation: 现有大语言模型在推理时可能生成非规范的标记序列，这带来了一些负面影响。论文旨在解决这一问题。 Method: 通过理论分析，提出了一种名为“规范采样”的简单高效采样方法，确保模型生成规范的标记序列。 Result: 规范采样生成的标记序列分布比标准采样更接近训练数据的真实分布。 Conclusion: 规范采样是一种有效的解决方案，能够提升模型生成结果的规范性和准确性。 Abstract: State of the art large language models are trained using large amounts of tokens derived from raw text using what is called a tokenizer. Crucially, the tokenizer determines the (token) vocabulary a model will use during inference as well as, in principle, the (token) language. This is because, while the token vocabulary may allow for different tokenizations of a string, the tokenizer always maps the string to only one of these tokenizations--the canonical tokenization. However, multiple lines of empirical evidence suggest that large language models do not always generate canonical token sequences, and this comes with several negative consequences. In this work, we first show that, to generate a canonical token sequence, a model needs to generate (partial) canonical token sequences at each step of the autoregressive generation process underpinning its functioning. Building upon this theoretical result, we introduce canonical sampling, a simple and efficient sampling method that precludes a given model from generating non-canonical token sequences. Further, we also show that, in comparison with standard sampling, the distribution of token sequences generated using canonical sampling is provably closer to the true distribution of token sequences used during training.

[214] What Is Seen Cannot Be Unseen: The Disruptive Effect of Knowledge Conflict on Large Language Models

Kaiser Sun,Fan Bai,Mark Dredze

Main category: cs.CL

TL;DR: 论文提出一个诊断框架，评估大语言模型在上下文与参数知识冲突时的表现，发现冲突对任务影响有限，模型难以完全抑制内部知识，提供解释可增加对上下文的依赖。

Details

Motivation: 研究大语言模型在上下文信息与参数知识冲突时的行为，以评估其表现和潜在问题。 Method: 构建诊断数据以引发冲突，分析模型在不同任务类型中的表现。 Result: 冲突对无需知识的任务影响小；上下文与知识一致时表现更好；模型难以完全抑制内部知识；提供解释增加对上下文的依赖。 Conclusion: 知识冲突对模型评估有效性提出挑战，需在部署中考虑冲突问题。 Abstract: Large language models frequently rely on both contextual input and parametric knowledge to perform tasks. However, these sources can come into conflict, especially when retrieved documents contradict the model's parametric knowledge. We propose a diagnostic framework to systematically evaluate LLM behavior under context-memory conflict, where the contextual information diverges from their parametric beliefs. We construct diagnostic data that elicit these conflicts and analyze model performance across multiple task types. Our findings reveal that (1) knowledge conflict has minimal impact on tasks that do not require knowledge utilization, (2) model performance is consistently higher when contextual and parametric knowledge are aligned, (3) models are unable to fully suppress their internal knowledge even when instructed, and (4) providing rationales that explain the conflict increases reliance on contexts. These insights raise concerns about the validity of model-based evaluation and underscore the need to account for knowledge conflict in the deployment of LLMs.

[215] Improving LLM-Powered EDA Assistants with RAFT

Luyao Shi,Michael Kazda,Charles Schmitter,Hemlata Gupta

Main category: cs.CL

TL;DR: 论文提出了一种利用合成Q/A数据集增强LLMs的方法（RAFT），以解决EDA领域中缺乏标注数据的问题，显著提升了RAG任务的性能，并探讨了数据安全和泄漏风险。

Details

Motivation: 电子设计工程师在EDA任务中难以高效获取相关信息，而现有的开源LLMs缺乏领域知识，且RAG方法可能产生不准确回答。 Method: 采用RAFT方法，结合合成Q/A数据集增强LLMs性能，并研究使用真实用户问题作为RAFS示例的影响。同时实施安全访问控制。 Result: RAFT结合合成数据显著提升了LLMs在EDA任务中的性能，并提供了关于数据泄漏和记忆风险的实践见解。 Conclusion: 合成数据结合RAFT是提升LLMs在EDA领域性能的有效方法，同时需关注数据安全和泄漏风险。 Abstract: Electronic design engineers often struggle to efficiently access relevant information for tasks like design verification and technology development. While large language models (LLMs) can enhance productivity as conversational agents, pre-trained open-source LLMs lack domain-specific knowledge for Electronic Design Automation (EDA). In a Retrieval-Augmented Generation (RAG) context, LLMs rely on external context but may still produce inaccurate responses. Retrieval-Augmented Fine-Tuning (RAFT) improves LLM performance, but acquiring labeled question/answer (Q/A) data in EDA is difficult. To address this, we propose using synthetic Q/A datasets to enhance LLMs with RAFT. Our results show that RAFT with synthetic data significantly boosts LLM performance for RAG-based EDA tasks. We also investigate the impact of using real user questions as Retrieval-Augmented Few-Shot (RAFS) examples for synthetic data generation. Additionally, we implement secure access control to ensure sensitive information is only accessible to authorized personnel. Finally, we assess the risk of data leakage and unintended memorization during fine-tuning with synthetic data, providing practical insights.

[216] Biases Propagate in Encoder-based Vision-Language Models: A Systematic Analysis From Intrinsic Measures to Zero-shot Retrieval Outcomes

Kshitish Ghate,Tessa Charlesworth,Mona Diab,Aylin Caliskan

Main category: cs.CL

TL;DR: 研究发现，视觉语言模型（VLM）中的社会群体偏见会系统性传播到零样本检索任务中，且性能更强的模型偏见传播更严重。

Details

Motivation: 理解VLM中固有的社会群体偏见如何影响下游任务，以构建更公平的AI系统。 Method: 提出一个框架，通过关联VLM表示空间的内在偏见与零样本检索任务的外在偏见来测量偏见传播。 Result: 结果显示内在与外在偏见高度相关（平均ρ=0.83±0.10），且性能更强的模型偏见传播更严重。 Conclusion: 研究揭示了VLM偏见的系统性传播，呼吁对复杂AI模型的偏见进行更严格的评估。 Abstract: To build fair AI systems we need to understand how social-group biases intrinsic to foundational encoder-based vision-language models (VLMs) manifest in biases in downstream tasks. In this study, we demonstrate that intrinsic biases in VLM representations systematically ``carry over'' or propagate into zero-shot retrieval tasks, revealing how deeply rooted biases shape a model's outputs. We introduce a controlled framework to measure this propagation by correlating (a) intrinsic measures of bias in the representational space with (b) extrinsic measures of bias in zero-shot text-to-image (TTI) and image-to-text (ITT) retrieval. Results show substantial correlations between intrinsic and extrinsic bias, with an average $\rho$ = 0.83 $\pm$ 0.10. This pattern is consistent across 114 analyses, both retrieval directions, six social groups, and three distinct VLMs. Notably, we find that larger/better-performing models exhibit greater bias propagation, a finding that raises concerns given the trend towards increasingly complex AI models. Our framework introduces baseline evaluation tasks to measure the propagation of group and valence signals. Investigations reveal that underrepresented groups experience less robust propagation, further skewing their model-related outcomes.

[217] Fixing It in Post: A Comparative Study of LLM Post-Training Data Quality and Model Performance

Aladin Djuhera,Swanand Ravindra Kadhe,Syed Zawad,Farhan Ahmed,Heiko Ludwig,Holger Boche

Main category: cs.CL

TL;DR: 本文对两个开源后训练数据集Tulu-3-SFT-Mix和SmolTalk进行了首次全面对比分析，提出了新的数据混合方法TuluTalk，性能优于源数据集。

Details

Motivation: 当前主流大语言模型的后训练数据集缺乏透明度，开源替代品虽性能接近，但缺乏系统比较。 Method: 使用Magpie框架对数据集样本进行质量标注，分析结构差异，设计新的数据混合方法。 Result: TuluTalk样本减少14%，性能优于或匹配源数据集。 Conclusion: 研究为后训练数据集构建提供了实用建议，并公开了标注数据和TuluTalk。 Abstract: Recent work on large language models (LLMs) has increasingly focused on post-training and alignment with datasets curated to enhance instruction following, world knowledge, and specialized skills. However, most post-training datasets used in leading open- and closed-source LLMs remain inaccessible to the public, with limited information about their construction process. This lack of transparency has motivated the recent development of open-source post-training corpora. While training on these open alternatives can yield performance comparable to that of leading models, systematic comparisons remain challenging due to the significant computational cost of conducting them rigorously at scale, and are therefore largely absent. As a result, it remains unclear how specific samples, task types, or curation strategies influence downstream performance when assessing data quality. In this work, we conduct the first comprehensive side-by-side analysis of two prominent open post-training datasets: Tulu-3-SFT-Mix and SmolTalk. Using the Magpie framework, we annotate each sample with detailed quality metrics, including turn structure (single-turn vs. multi-turn), task category, input quality, and response quality, and we derive statistics that reveal structural and qualitative similarities and differences between the two datasets. Based on these insights, we design a principled curation recipe that produces a new data mixture, TuluTalk, which contains 14% fewer samples than either source dataset while matching or exceeding their performance on key benchmarks. Our findings offer actionable insights for constructing more effective post-training datasets that improve model performance within practical resource limits. To support future research, we publicly release both the annotated source datasets and our curated TuluTalk mixture.

[218] Beyond Facts: Evaluating Intent Hallucination in Large Language Models

Yijie Hao,Haofei Yu,Jiaxuan You

Main category: cs.CL

TL;DR: 论文提出了意图幻觉的概念，指大型语言模型（LLM）在处理复杂查询时部分忽略或误解条件，并引入FAITHQA基准和CONSTRAINT SCORE评估指标。

Details

Motivation: 当前LLM在复杂查询中常忽略或误解部分条件，导致意图幻觉，需系统性评估。 Method: 提出FAITHQA基准（20,068个问题）和CONSTRAINT SCORE评估指标，涵盖查询和检索增强生成（RAG）场景。 Result: 发现意图幻觉普遍存在于先进LLM中，源于条件忽略或误解；CONSTRAINT SCORE接近人类评估效果。 Conclusion: FAITHQA和CONSTRAINT SCORE为意图幻觉研究提供了工具，揭示了LLM的局限性。 Abstract: When exposed to complex queries containing multiple conditions, today's large language models (LLMs) tend to produce responses that only partially satisfy the query while neglecting certain conditions. We therefore introduce the concept of Intent Hallucination. In this phenomenon, LLMs either omit (neglecting to address certain parts) or misinterpret (responding to invented query parts) elements of the given query, leading to intent hallucinated generation. To systematically evaluate intent hallucination, we introduce FAITHQA, a novel benchmark for intent hallucination that contains 20,068 problems, covering both query-only and retrieval-augmented generation (RAG) setups with varying topics and difficulty. FAITHQA is the first hallucination benchmark that goes beyond factual verification, tailored to identify the fundamental cause of intent hallucination. By evaluating various LLMs on FAITHQA, we find that (1) intent hallucination is a common issue even for state-of-the-art models, and (2) the phenomenon stems from omission or misinterpretation of LLMs. To facilitate future research, we introduce an automatic LLM generation evaluation metric, CONSTRAINT SCORE, for detecting intent hallucination. Human evaluation results demonstrate that CONSTRAINT SCORE is closer to human performance for intent hallucination compared to baselines.

[219] LaMP-Cap: Personalized Figure Caption Generation With Multimodal Figure Profiles

Ho Yin 'Sam' Ng,Ting-Yao Hsu,Aashish Anantha Ramakrishnan,Branislav Kveton,Nedim Lipka,Franck Dernoncourt,Dongwon Lee,Tong Yu,Sungchul Kim,Ryan A. Rossi,Ting-Hao 'Kenneth' Huang

Main category: cs.CL

TL;DR: LaMP-Cap是一个用于个性化图注生成的多模态数据集，通过结合图像和文本信息，帮助生成更接近作者风格的图注。

Details

Motivation: 现有AI生成的图注通常需要作者修改以匹配其风格和领域需求，凸显了个性化的重要性。然而，现有技术多关注纯文本场景，忽略了多模态输入和配置的需求。 Method: LaMP-Cap数据集为每个目标图提供多模态配置（如图像、图注和相关段落），并利用四种大型语言模型进行实验。 Result: 实验表明，使用配置信息能显著提升图注生成质量，其中图像比文本段落更有帮助。 Conclusion: 多模态配置优于纯文本配置，为个性化图注生成提供了新方向。 Abstract: Figure captions are crucial for helping readers understand and remember a figure's key message. Many models have been developed to generate these captions, helping authors compose better quality captions more easily. Yet, authors almost always need to revise generic AI-generated captions to match their writing style and the domain's style, highlighting the need for personalization. Despite language models' personalization (LaMP) advances, these technologies often focus on text-only settings and rarely address scenarios where both inputs and profiles are multimodal. This paper introduces LaMP-Cap, a dataset for personalized figure caption generation with multimodal figure profiles. For each target figure, LaMP-Cap provides not only the needed inputs, such as figure images, but also up to three other figures from the same document--each with its image, caption, and figure-mentioning paragraphs--as a profile to characterize the context. Experiments with four LLMs show that using profile information consistently helps generate captions closer to the original author-written ones. Ablation studies reveal that images in the profile are more helpful than figure-mentioning paragraphs, highlighting the advantage of using multimodal profiles over text-only ones.

[220] Precise Information Control in Long-Form Text Generation

Jacqueline He,Howard Yen,Margaret Li,Shuyue Stella Li,Zhiyuan Zeng,Weijia Shi,Yulia Tsvetkov,Danqi Chen,Pang Wei Koh,Luke Zettlemoyer

Main category: cs.CL

TL;DR: 论文提出Precise Information Control (PIC)任务，研究语言模型的内在幻觉问题，并通过PIC-Bench评估模型表现，提出一种后训练框架PIC-LM显著提升生成准确性。

Details

Motivation: 现代语言模型存在内在幻觉问题，即生成看似合理但未经输入内容证实的信息。研究旨在通过PIC任务和PIC-Bench评估模型表现，并提出解决方案。 Method: 提出PIC任务，要求模型基于可验证的短句生成长文本；构建PIC-Bench基准测试；开发后训练框架PIC-LM，通过弱监督偏好数据提升模型准确性。 Result: 现有模型在PIC-Bench上70%输出存在幻觉；PIC-LM在完整PIC设置中F1从69.1%提升至91.0%，并在实际任务中显著提升准确性。 Conclusion: PIC-LM能有效减少语言模型的幻觉问题，提升生成文本的准确性，为基于事实的生成任务提供了潜力。 Abstract: A central challenge in modern language models (LMs) is intrinsic hallucination: the generation of information that is plausible but unsubstantiated relative to input context. To study this problem, we propose Precise Information Control (PIC), a new task formulation that requires models to generate long-form outputs grounded in a provided set of short self-contained statements, known as verifiable claims, without adding any unsupported ones. For comprehensiveness, PIC includes a full setting that tests a model's ability to include exactly all input claims, and a partial setting that requires the model to selectively incorporate only relevant claims. We present PIC-Bench, a benchmark of eight long-form generation tasks (e.g., summarization, biography generation) adapted to the PIC setting, where LMs are supplied with well-formed, verifiable input claims. Our evaluation of a range of open and proprietary LMs on PIC-Bench reveals that, surprisingly, state-of-the-art LMs still intrinsically hallucinate in over 70% of outputs. To alleviate this lack of faithfulness, we introduce a post-training framework, using a weakly supervised preference data construction method, to train an 8B PIC-LM with stronger PIC ability--improving from 69.1% to 91.0% F1 in the full PIC setting. When integrated into end-to-end factual generation pipelines, PIC-LM improves exact match recall by 17.1% on ambiguous QA with retrieval, and factual precision by 30.5% on a birthplace verification task, underscoring the potential of precisely grounded generation.

[221] MedCite: Can Language Models Generate Verifiable Text for Medicine?

Xiao Wang,Mengjue Tan,Qiao Jin,Guangzhi Xiong,Yu Hu,Aidong Zhang,Zhiyong Lu,Minjia Zhang

Main category: cs.CL

TL;DR: 论文提出首个端到端框架，用于设计和评估基于LLM的医学任务引用生成，并引入多轮检索-引用方法，显著提升引用质量。

Details

Motivation: 现有基于LLM的医学问答系统缺乏引用生成和评估能力，限制了其实际应用。 Method: 提出多轮检索-引用方法，设计端到端框架支持引用生成与评估。 Result: 方法在引用精确率和召回率上优于基线，评估结果与专家标注高度相关。 Conclusion: 研究揭示了医学任务引用生成的挑战与机遇，设计选择对最终质量影响显著。 Abstract: Existing LLM-based medical question-answering systems lack citation generation and evaluation capabilities, raising concerns about their adoption in practice. In this work, we introduce \name, the first end-to-end framework that facilitates the design and evaluation of citation generation with LLMs for medical tasks. Meanwhile, we introduce a novel multi-pass retrieval-citation method that generates high-quality citations. Our evaluation highlights the challenges and opportunities of citation generation for medical tasks, while identifying important design choices that have a significant impact on the final citation quality. Our proposed method achieves superior citation precision and recall improvements compared to strong baseline methods, and we show that evaluation results correlate well with annotation results from professional experts.

[222] Training-Free Tokenizer Transplantation via Orthogonal Matching Pursuit

Charles Goddard,Fernando Fernandes Neto

Main category: cs.CL

TL;DR: 提出一种无需训练的移植方法，通过OMP重建未见过的词嵌入，实现预训练大语言模型（LLM）中分词器的移植。

Details

Motivation: 解决不同分词器之间的差异问题，避免梯度更新，直接复用预训练模型权重。 Method: 使用正交匹配追踪（OMP）将新词表示为共享词的稀疏线性组合，分两阶段实现词嵌入的移植。 Result: 在跨分词器任务中，OMP表现最佳，显著优于其他零样本方法，且无需梯度更新。 Conclusion: OMP方法有效解决了分词器差异问题，支持跨分词器知识蒸馏等应用。 Abstract: We present a training-free method to transplant tokenizers in pretrained large language models (LLMs) by reconstructing unseen token embeddings via Orthogonal Matching Pursuit (OMP). Specifically, we approximate each out-of-vocabulary token as a sparse linear combination of shared tokens, in two phases: first, compute each new token's representation in the donor embedding space with a small dictionary of shared anchor tokens, then transfer these same sparse coefficients back into the base model's embedding space. On two challenging cross-tokenizer tasks--Llama$\to$Mistral NeMo (12B) and Qwen$\to$Llama (1B)--we show that OMP achieves best zero-shot preservation of the base model's performance across multiple benchmarks, while other zero-shot approaches degrade significantly. Compared to baselines (zero-init, mean-init, and existing approaches like WECHSEL, FOCUS, ZETT), OMP consistently achieves the best overall performance, effectively bridging large tokenizer discrepancies without gradient updates. Our analysis further identifies mismatched numerical tokenization schemes as a critical challenge for preserving mathematical reasoning capabilities. This technique enables direct reuse of pretrained model weights with new tokenizers, facilitating cross-tokenizer knowledge distillation, speculative decoding, ensembling, merging, and domain-specific vocabulary adaptations. We integrate our method into the open-source mergekit-tokensurgeon tool for post hoc vocabulary realignment.

[223] Transferring Features Across Language Models With Model Stitching

Alan Chen,Jack Merullo,Alessandro Stolfo,Ellie Pavlick

Main category: cs.CL

TL;DR: 论文提出了一种通过仿射映射在语言模型残差流之间高效传输特征的方法，并应用于稀疏自编码器（SAE）权重的跨模型传输，发现小模型和大模型的表示空间高度相似，从而节省计算资源。

Details

Motivation: 研究小模型和大模型表示空间的相似性，以探索如何通过传输特征节省训练稀疏自编码器（SAE）的计算成本。 Method: 使用仿射映射技术传输SAE权重，比较不同大小模型的表示空间，并分析特征级别的可传输性。 Result: 小模型和大模型的表示空间高度相似，通过传输SAE权重可节省50%的训练成本；语义和结构特征的传输效果不同，但功能特征的角色能忠实映射。 Conclusion: 研究揭示了小模型和大模型线性表示空间的异同，并提出了一种提高SAE训练效率的方法。 Abstract: In this work, we demonstrate that affine mappings between residual streams of language models is a cheap way to effectively transfer represented features between models. We apply this technique to transfer the weights of Sparse Autoencoders (SAEs) between models of different sizes to compare their representations. We find that small and large models learn highly similar representation spaces, which motivates training expensive components like SAEs on a smaller model and transferring to a larger model at a FLOPs savings. For example, using a small-to-large transferred SAE as initialization can lead to 50% cheaper training runs when training SAEs on larger models. Next, we show that transferred probes and steering vectors can effectively recover ground truth performance. Finally, we dive deeper into feature-level transferability, finding that semantic and structural features transfer noticeably differently while specific classes of functional features have their roles faithfully mapped. Overall, our findings illustrate similarities and differences in the linear representation spaces of small and large models and demonstrate a method for improving the training efficiency of SAEs.

Samuel Kim,Oghenemaro Imieye,Yunting Yin

Main category: cs.CL

TL;DR: 论文研究了大型语言模型（LLMs）和传统机器学习分类器在社交媒体数据中的抑郁语言检测性能，发现零样本LLMs在二元分类中表现优异，但在细粒度分类中表现不佳，而基于LLM生成摘要嵌入的分类器表现更优。

Details

Motivation: 准确且可解释的抑郁语言检测对心理健康早期干预具有重要意义，可为临床实践和公共卫生提供支持。 Method: 比较了零样本LLMs和监督分类器在三种分类任务中的表现，包括二元抑郁分类、抑郁严重程度分类和抑郁、PTSD及焦虑的鉴别诊断。 Result: 零样本LLMs在二元分类中泛化能力强，但在细粒度分类中表现较差；基于LLM摘要嵌入的分类器表现更优。 Conclusion: LLMs在心理健康预测中具有潜力，未来可通过优化零样本能力和上下文感知摘要技术进一步提升性能。 Abstract: Accurate and interpretable detection of depressive language in social media is useful for early interventions of mental health conditions, and has important implications for both clinical practice and broader public health efforts. In this paper, we investigate the performance of large language models (LLMs) and traditional machine learning classifiers across three classification tasks involving social media data: binary depression classification, depression severity classification, and differential diagnosis classification among depression, PTSD, and anxiety. Our study compares zero-shot LLMs with supervised classifiers trained on both conventional text embeddings and LLM-generated summary embeddings. Our experiments reveal that while zero-shot LLMs demonstrate strong generalization capabilities in binary classification, they struggle with fine-grained ordinal classifications. In contrast, classifiers trained on summary embeddings generated by LLMs demonstrate competitive, and in some cases superior, performance on the classification tasks, particularly when compared to models using traditional text embeddings. Our findings demonstrate the strengths of LLMs in mental health prediction, and suggest promising directions for better utilization of their zero-shot capabilities and context-aware summarization techniques.

[225] BriefMe: A Legal NLP Benchmark for Assisting with Legal Briefs

Jesse Woo,Fateme Hashemi Chaleshtori,Ana Marasović,Kenneth Marino

Main category: cs.CL

TL;DR: 论文介绍了BRIEFME数据集，用于评估语言模型在法律简报写作中的表现，包括摘要、补全和案例检索任务。当前模型在部分任务上表现良好，但在其他任务上仍有不足。

Details

Motivation: 探索法律简报写作与编辑这一未被充分研究的领域，评估语言模型在法律工作中的实际应用能力。 Method: 构建BRIEFME数据集，包含三个任务：论点摘要、论点补全和案例检索，并对当前语言模型的表现进行分析。 Result: 当前大型语言模型在摘要和引导补全任务上表现优异，但在实际论点补全和案例检索任务上表现较差。 Conclusion: BRIEFME数据集有望推动法律NLP的发展，帮助法律专业人士更高效地完成工作。 Abstract: A core part of legal work that has been under-explored in Legal NLP is the writing and editing of legal briefs. This requires not only a thorough understanding of the law of a jurisdiction, from judgments to statutes, but also the ability to make new arguments to try to expand the law in a new direction and make novel and creative arguments that are persuasive to judges. To capture and evaluate these legal skills in language models, we introduce BRIEFME, a new dataset focused on legal briefs. It contains three tasks for language models to assist legal professionals in writing briefs: argument summarization, argument completion, and case retrieval. In this work, we describe the creation of these tasks, analyze them, and show how current models perform. We see that today's large language models (LLMs) are already quite good at the summarization and guided completion tasks, even beating human-generated headings. Yet, they perform poorly on other tasks in our benchmark: realistic argument completion and retrieving relevant legal cases. We hope this dataset encourages more development in Legal NLP in ways that will specifically aid people in performing legal work.

[226] Psychological Counseling Cannot Be Achieved Overnight: Automated Psychological Counseling Through Multi-Session Conversations

Junzhe Wang,Bichen Wang,Xing Fu,Yixin Sun,Yanyan Zhao,Bing Qin

Main category: cs.CL

TL;DR: 论文介绍了多会话心理咨询数据集MusPsy-Dataset和模型MusPsy-Model，解决了当前LLM研究中单会话心理咨询的局限性，实验表明模型在多会话场景中表现优于基线。

Details

Motivation: 当前大型语言模型（LLM）在心理咨询中仅关注单次会话，而实际心理咨询是多会话过程，需要持续跟踪和适应客户进展。 Method: 构建了基于真实案例的多会话心理咨询数据集MusPsy-Dataset，并开发了能够跟踪客户进展的MusPsy-Model。 Result: 实验表明，MusPsy-Model在多会话心理咨询中表现优于基线模型。 Conclusion: 多会话心理咨询数据集和模型为LLM在心理咨询中的实际应用提供了更现实的解决方案。 Abstract: In recent years, Large Language Models (LLMs) have made significant progress in automated psychological counseling. However, current research focuses on single-session counseling, which doesn't represent real-world scenarios. In practice, psychological counseling is a process, not a one-time event, requiring sustained, multi-session engagement to progressively address clients' issues. To overcome this limitation, we introduce a dataset for Multi-Session Psychological Counseling Conversation Dataset (MusPsy-Dataset). Our MusPsy-Dataset is constructed using real client profiles from publicly available psychological case reports. It captures the dynamic arc of counseling, encompassing multiple progressive counseling conversations from the same client across different sessions. Leveraging our dataset, we also developed our MusPsy-Model, which aims to track client progress and adapt its counseling direction over time. Experiments show that our model performs better than baseline models across multiple sessions.

[227] SafeLawBench: Towards Safe Alignment of Large Language Models

Chuxue Cao,Han Zhu,Jiaming Ji,Qichao Sun,Zhenghao Zhu,Yinyu Wu,Juntao Dai,Yaodong Yang,Sirui Han,Yike Guo

Main category: cs.CL

TL;DR: 论文提出了基于法律视角的SafeLawBench基准，用于系统评估大语言模型的安全性，发现当前模型在安全性任务上表现不佳。

Details

Motivation: 由于现有安全性评估标准主观性强，缺乏明确标准，论文从法律角度填补了这一空白。 Method: 提出SafeLawBench基准，包含多选和开放问答任务，评估了20个LLM的安全性和推理稳定性。 Result: 领先模型如Claude-3.5-Sonnet和GPT-4o在多选任务中准确率未超过80.5%，20个模型平均准确率为68.8%。 Conclusion: 呼吁社区重视LLM安全性研究，并提出多数投票机制可提升模型表现。 Abstract: With the growing prevalence of large language models (LLMs), the safety of LLMs has raised significant concerns. However, there is still a lack of definitive standards for evaluating their safety due to the subjective nature of current safety benchmarks. To address this gap, we conducted the first exploration of LLMs' safety evaluation from a legal perspective by proposing the SafeLawBench benchmark. SafeLawBench categorizes safety risks into three levels based on legal standards, providing a systematic and comprehensive framework for evaluation. It comprises 24,860 multi-choice questions and 1,106 open-domain question-answering (QA) tasks. Our evaluation included 2 closed-source LLMs and 18 open-source LLMs using zero-shot and few-shot prompting, highlighting the safety features of each model. We also evaluated the LLMs' safety-related reasoning stability and refusal behavior. Additionally, we found that a majority voting mechanism can enhance model performance. Notably, even leading SOTA models like Claude-3.5-Sonnet and GPT-4o have not exceeded 80.5% accuracy in multi-choice tasks on SafeLawBench, while the average accuracy of 20 LLMs remains at 68.8\%. We urge the community to prioritize research on the safety of LLMs.

[228] Quantile Regression with Large Language Models for Price Prediction

Nikhita Vedula,Dushyanta Dhyani,Laleh Jalali,Boris Oreshkin,Mohsen Bayati,Shervin Malmasi

Main category: cs.CL

TL;DR: 本文研究了利用大语言模型（LLMs）进行概率回归的方法，提出了一种新颖的分位数回归方法，显著提升了预测准确性和分布校准效果。

Details

Motivation: 现有方法主要关注点估计，缺乏对不同方法的系统比较，而文本到分布预测任务（如价格估计）需要细粒度文本理解和不确定性量化。 Method: 提出了一种分位数回归方法，使LLMs能够生成完整的预测分布，并在三个价格预测数据集上进行了实验。 Result: 实验表明，采用分位数头的Mistral-7B模型在预测准确性和分布校准方面均优于传统方法。 Conclusion: LLMs在概率回归任务中表现出色，分位数回归方法显著提升了性能，同时公开了数据集以支持未来研究。 Abstract: Large Language Models (LLMs) have shown promise in structured prediction tasks, including regression, but existing approaches primarily focus on point estimates and lack systematic comparison across different methods. We investigate probabilistic regression using LLMs for unstructured inputs, addressing challenging text-to-distribution prediction tasks such as price estimation where both nuanced text understanding and uncertainty quantification are critical. We propose a novel quantile regression approach that enables LLMs to produce full predictive distributions, improving upon traditional point estimates. Through extensive experiments across three diverse price prediction datasets, we demonstrate that a Mistral-7B model fine-tuned with quantile heads significantly outperforms traditional approaches for both point and distributional estimations, as measured by three established metrics each for prediction accuracy and distributional calibration. Our systematic comparison of LLM approaches, model architectures, training approaches, and data scaling reveals that Mistral-7B consistently outperforms encoder architectures, embedding-based methods, and few-shot learning methods. Our experiments also reveal the effectiveness of LLM-assisted label correction in achieving human-level accuracy without systematic bias. Our curated datasets are made available at https://github.com/vnik18/llm-price-quantile-reg/ to support future research.

[229] Learning Distribution-Wise Control in Representation Space for Language Models

Chunyuan Deng,Ruidi Chang,Hanjie Chen

Main category: cs.CL

TL;DR: 论文提出了一种分布级干预方法，扩展了点级干预，在语言模型的行为控制中表现更优。

Details

Motivation: 点级干预（表示微调）在控制语言模型行为时有效，但缺乏对概念子空间周围区域的考虑。 Method: 通过分布级干预，模型不仅能学习点级变换，还能学习概念子空间的周围区域。 Result: 在八个常识推理和七个算术推理基准测试中，分布级干预在可控性和鲁棒性上优于点级干预。 Conclusion: 分布级干预为语言模型行为控制提供了更全面的方法，实现了更精细的控制。 Abstract: Interventions in language models (LMs) are applied strategically to steer model behavior during the forward pass. Learnable interventions, also known as representation fine-tuning, aim to apply pointwise control within the concept subspace and have proven effective in altering high-level behaviors. In this work, we extend this approach to the distribution level, enabling the model to learn not only pointwise transformations but also the surrounding regions of the concept subspace. We demonstrate that these methods perform effectively in early layers, with larger standard deviations correlating strongly with improved performance. Across eight commonsense reasoning and seven arithmetic reasoning benchmarks, our distribution-wise interventions consistently outperform pointwise interventions in controllability and robustness. These results illustrate that distribution-wise interventions provide a more comprehensive method for steering model behavior and enabling finer-grained control over language models. The code is at: \href{https://github.com/chili-lab/D-Intervention}{https://github.com/chili-lab/D-Intervention}.

[230] Dynamic and Parametric Retrieval-Augmented Generation

Weihang Su,Qingyao Ai,Jingtao Zhan,Qian Dong,Yiqun Liu

Main category: cs.CL

TL;DR: 本文探讨了动态RAG和参数化RAG两种新兴研究方向，旨在解决传统RAG在复杂任务中的局限性。

Details

Motivation: 传统RAG系统采用静态检索和上下文知识注入，难以满足多跳推理和自适应信息访问的需求。 Method: 动态RAG实时调整检索时机和内容，参数化RAG从输入级转向参数级知识注入。 Result: 动态RAG和参数化RAG在效率和效果上均有提升。 Conclusion: 本文为RAG的未来研究提供了理论基础和实践指导。 Abstract: Retrieval-Augmented Generation (RAG) has become a foundational paradigm for equipping large language models (LLMs) with external knowledge, playing a critical role in information retrieval and knowledge-intensive applications. However, conventional RAG systems typically adopt a static retrieve-then-generate pipeline and rely on in-context knowledge injection, which can be suboptimal for complex tasks that require multihop reasoning, adaptive information access, and deeper integration of external knowledge. Motivated by these limitations, the research community has moved beyond static retrieval and in-context knowledge injection. Among the emerging directions, this tutorial delves into two rapidly growing and complementary research areas on RAG: Dynamic RAG and Parametric RAG. Dynamic RAG adaptively determines when and what to retrieve during the LLM's generation process, enabling real-time adaptation to the LLM's evolving information needs. Parametric RAG rethinks how retrieved knowledge should be injected into LLMs, transitioning from input-level to parameter-level knowledge injection for enhanced efficiency and effectiveness. This tutorial offers a comprehensive overview of recent advances in these emerging research areas. It also shares theoretical foundations and practical insights to support and inspire further research in RAG.

[231] DivScore: Zero-Shot Detection of LLM-Generated Text in Specialized Domains

Zhihui Chen,Kai He,Yucheng Huang,Yunxiao Zhu,Mengling Feng

Main category: cs.CL

TL;DR: 论文提出DivScore，一种零样本检测框架，通过熵评分和领域知识蒸馏，有效识别专业领域（如医学和法律）中LLM生成的文本，显著优于现有方法。

Details

Motivation: 专业领域（如医学和法律）中LLM生成文本的检测对防止错误信息和确保真实性至关重要，但现有零样本检测器因领域偏移而表现不佳。 Method: 提出DivScore框架，结合归一化熵评分和领域知识蒸馏，并发布医学和法律领域的基准数据集。 Result: DivScore在AUROC和召回率上分别提升14.4%和64.0%，在对抗性场景中表现更稳健。 Conclusion: DivScore在专业领域中显著优于现有检测器，为LLM生成文本的检测提供了高效解决方案。 Abstract: Detecting LLM-generated text in specialized and high-stakes domains like medicine and law is crucial for combating misinformation and ensuring authenticity. However, current zero-shot detectors, while effective on general text, often fail when applied to specialized content due to domain shift. We provide a theoretical analysis showing this failure is fundamentally linked to the KL divergence between human, detector, and source text distributions. To address this, we propose DivScore, a zero-shot detection framework using normalized entropy-based scoring and domain knowledge distillation to robustly identify LLM-generated text in specialized domains. We also release a domain-specific benchmark for LLM-generated text detection in the medical and legal domains. Experiments on our benchmark show that DivScore consistently outperforms state-of-the-art detectors, with 14.4% higher AUROC and 64.0% higher recall (0.1% false positive rate threshold). In adversarial settings, DivScore demonstrates superior robustness than other baselines, achieving on average 22.8% advantage in AUROC and 29.5% in recall. Code and data are publicly available.

[232] A Survey of Retentive Network

Haiqi Yang,Zhiyuan Li,Yi Chang,Yuan Wu

Main category: cs.CL

TL;DR: RetNet是一种新型神经网络架构，通过引入保留机制解决了Transformer的高内存和扩展性问题，实现了线性时间推理和高效上下文建模。

Details

Motivation: 解决Transformer在处理长序列时的高内存成本和扩展性限制，同时保持全局依赖建模能力。 Method: 提出保留机制，结合递归的归纳偏置和注意力的全局依赖建模，支持并行化训练。 Result: RetNet在自然语言处理、语音识别和时间序列分析等领域表现优异。 Conclusion: 本文首次全面综述了RetNet架构及其应用，并探讨了未来研究方向。 Abstract: Retentive Network (RetNet) represents a significant advancement in neural network architecture, offering an efficient alternative to the Transformer. While Transformers rely on self-attention to model dependencies, they suffer from high memory costs and limited scalability when handling long sequences due to their quadratic complexity. To mitigate these limitations, RetNet introduces a retention mechanism that unifies the inductive bias of recurrence with the global dependency modeling of attention. This mechanism enables linear-time inference, facilitates efficient modeling of extended contexts, and remains compatible with fully parallelizable training pipelines. RetNet has garnered significant research interest due to its consistently demonstrated cross-domain effectiveness, achieving robust performance across machine learning paradigms including natural language processing, speech recognition, and time-series analysis. However, a comprehensive review of RetNet is still missing from the current literature. This paper aims to fill that gap by offering the first detailed survey of the RetNet architecture, its key innovations, and its diverse applications. We also explore the main challenges associated with RetNet and propose future research directions to support its continued advancement in both academic research and practical deployment.

[233] C-PATH: Conversational Patient Assistance and Triage in Healthcare System

Qi Shi,Qiwei Han,Cláudia Soares

Main category: cs.CL

TL;DR: C-PATH是一种基于LLM的对话AI系统，用于帮助患者识别症状并推荐合适的医疗科室，通过多轮对话实现。

Details

Motivation: 医疗系统复杂，患者难以及时获得适当医疗帮助，C-PATH旨在解决这一问题。 Method: 基于LLaMA3架构，结合医学知识、对话数据和临床摘要，采用GPT数据增强框架和对话历史管理策略。 Result: 在清晰度、信息量和推荐准确性方面表现优异，显著优于领域基线。 Conclusion: C-PATH是数字健康辅助和分诊领域的重要进展。 Abstract: Navigating healthcare systems can be complex and overwhelming, creating barriers for patients seeking timely and appropriate medical attention. In this paper, we introduce C-PATH (Conversational Patient Assistance and Triage in Healthcare), a novel conversational AI system powered by large language models (LLMs) designed to assist patients in recognizing symptoms and recommending appropriate medical departments through natural, multi-turn dialogues. C-PATH is fine-tuned on medical knowledge, dialogue data, and clinical summaries using a multi-stage pipeline built on the LLaMA3 architecture. A core contribution of this work is a GPT-based data augmentation framework that transforms structured clinical knowledge from DDXPlus into lay-person-friendly conversations, allowing alignment with patient communication norms. We also implement a scalable conversation history management strategy to ensure long-range coherence. Evaluation with GPTScore demonstrates strong performance across dimensions such as clarity, informativeness, and recommendation accuracy. Quantitative benchmarks show that C-PATH achieves superior performance in GPT-rewritten conversational datasets, significantly outperforming domain-specific baselines. C-PATH represents a step forward in the development of user-centric, accessible, and accurate AI tools for digital health assistance and triage.

[234] Geopolitical biases in LLMs: what are the "good" and the "bad" countries according to contemporary language models

Mikhail Salnikov,Dmitrii Korzh,Ivan Lazichny,Elvir Karimov,Artyom Iudin,Ivan Oseledets,Oleg Y. Rogov,Alexander Panchenko,Natalia Loukachevitch,Elena Tutubalina

Main category: cs.CL

TL;DR: 论文评估了LLMs在地缘政治偏见方面的表现，分析了其对历史事件的不同国家视角（美、英、苏、中）的解读，发现模型存在显著偏见，且简单去偏见方法效果有限。

Details

Motivation: 研究动机是揭示LLMs在处理具有冲突国家视角的历史事件时是否存在地缘政治偏见，并探讨去偏见方法的有效性。 Method: 方法包括构建包含中立事件描述和不同国家观点的数据集，并通过实验测试模型对不同国家标签的敏感性。 Result: 结果显示LLMs存在显著的地缘政治偏见，且简单去偏见提示效果有限，模型对标签操纵敏感。 Conclusion: 结论指出LLMs存在国家叙事偏见，简单去偏见方法效果不足，为未来研究提供了框架和数据集。 Abstract: This paper evaluates geopolitical biases in LLMs with respect to various countries though an analysis of their interpretation of historical events with conflicting national perspectives (USA, UK, USSR, and China). We introduce a novel dataset with neutral event descriptions and contrasting viewpoints from different countries. Our findings show significant geopolitical biases, with models favoring specific national narratives. Additionally, simple debiasing prompts had a limited effect in reducing these biases. Experiments with manipulated participant labels reveal models' sensitivity to attribution, sometimes amplifying biases or recognizing inconsistencies, especially with swapped labels. This work highlights national narrative biases in LLMs, challenges the effectiveness of simple debiasing methods, and offers a framework and dataset for future geopolitical bias research.

[235] They want to pretend not to understand: The Limits of Current LLMs in Interpreting Implicit Content of Political Discourse

Walter Paci,Alessandro Panunzi,Sandro Pezzelle

Main category: cs.CL

TL;DR: 论文探讨了大型语言模型（LLMs）在政治话语中检测和解释隐含内容的能力，发现当前模型在理解预设和隐含意义方面表现不佳。

Details

Motivation: 政治话语中隐含内容的重要性及其对受众的影响，以及LLMs在此领域的潜力尚未充分探索。 Method: 利用IMPAQTS语料库，通过选择题和开放式生成任务测试LLMs的表现。 Result: 所有测试模型在解释预设和隐含意义时表现不佳，表明当前LLMs缺乏关键语用能力。 Conclusion: 当前LLMs在政治话语中解释高度隐含语言的能力有限，但未来有改进方向。 Abstract: Implicit content plays a crucial role in political discourse, where speakers systematically employ pragmatic strategies such as implicatures and presuppositions to influence their audiences. Large Language Models (LLMs) have demonstrated strong performance in tasks requiring complex semantic and pragmatic understanding, highlighting their potential for detecting and explaining the meaning of implicit content. However, their ability to do this within political discourse remains largely underexplored. Leveraging, for the first time, the large IMPAQTS corpus, which comprises Italian political speeches with the annotation of manipulative implicit content, we propose methods to test the effectiveness of LLMs in this challenging problem. Through a multiple-choice task and an open-ended generation task, we demonstrate that all tested models struggle to interpret presuppositions and implicatures. We conclude that current LLMs lack the key pragmatic capabilities necessary for accurately interpreting highly implicit language, such as that found in political discourse. At the same time, we highlight promising trends and future directions for enhancing model performance. We release our data and code at https://github.com/WalterPaci/IMPAQTS-PID

[236] Extending dependencies to the taggedPBC: Word order in transitive clauses

Hiram Ring

Main category: cs.CL

TL;DR: 本文介绍了taggedPBC数据集的依赖标注版本，展示了其在跨语言研究中的实用性，尽管存在标注质量问题。

Details

Motivation: taggedPBC数据集虽大但未标注依赖关系，本文旨在填补这一空白并验证其语言学价值。 Method: 将依赖信息与POS标签一起转换到taggedPBC的所有语言中，生成CoNLLU格式数据集。 Result: 数据集中的词序信息与专家确定的词序相关，表明其在语料库类型学研究中的潜力。 Conclusion: 即使数据标注存在噪声，依赖标注的语料库仍能为跨语言研究提供重要见解，数据已开源供研究使用。 Abstract: The taggedPBC (Ring 2025a) contains more than 1,800 sentences of pos-tagged parallel text data from over 1,500 languages, representing 133 language families and 111 isolates. While this dwarfs previously available resources, and the POS tags achieve decent accuracy, allowing for predictive crosslinguistic insights (Ring 2025b), the dataset was not initially annotated for dependencies. This paper reports on a CoNLLU-formatted version of the dataset which transfers dependency information along with POS tags to all languages in the taggedPBC. Although there are various concerns regarding the quality of the tags and the dependencies, word order information derived from this dataset regarding the position of arguments and predicates in transitive clauses correlates with expert determinations of word order in three typological databases (WALS, Grambank, Autotyp). This highlights the usefulness of corpus-based typological approaches (as per Baylor et al. 2023; Bjerva 2024) for extending comparisons of discrete linguistic categories, and suggests that important insights can be gained even from noisy data, given sufficient annotation. The dependency-annotated corpora are also made available for research and collaboration via GitHub.

[237] On the Adaptive Psychological Persuasion of Large Language Models

Tianjie Ju,Yujia Chen,Hao Fei,Mong-Li Lee,Wynne Hsu,Pengzhou Cheng,Zongru Wu,Zhuosheng Zhang,Gongshen Liu

Main category: cs.CL

TL;DR: 本文研究了大型语言模型（LLMs）在自主说服和抵抗说服中的双重能力，提出了一种基于心理策略的自适应框架，显著提高了说服成功率。

Details

Motivation: 探索LLMs在心理修辞背景下的说服和抵抗能力，填补现有研究的空白。 Method: 评估四种LLMs在对抗性对话中的表现，引入11种心理说服策略，并提出自适应框架优化策略选择。 Result: 实验证明，特定策略（如流畅性效应和重复效应）显著提高说服成功率，但效果依赖上下文。自适应框架进一步提升了成功率。 Conclusion: 自适应心理说服方法有效优化了LLMs的策略选择，显著提升说服能力，同时保持通用性能。 Abstract: Previous work has showcased the intriguing capabilities of Large Language Models (LLMs) in instruction-following and rhetorical fluency. However, systematic exploration of their dual capabilities to autonomously persuade and resist persuasion, particularly in contexts involving psychological rhetoric, remains unexplored. In this paper, we first evaluate four commonly adopted LLMs by tasking them to alternately act as persuaders and listeners in adversarial dialogues. Empirical results show that persuader LLMs predominantly employ repetitive strategies, leading to low success rates. Then we introduce eleven comprehensive psychological persuasion strategies, finding that explicitly instructing LLMs to adopt specific strategies such as Fluency Effect and Repetition Effect significantly improves persuasion success rates. However, no ``one-size-fits-all'' strategy proves universally effective, with performance heavily dependent on contextual counterfactuals. Motivated by these observations, we propose an adaptive framework based on direct preference optimization that trains LLMs to autonomously select optimal strategies by leveraging persuasion results from strategy-specific responses as preference pairs. Experiments on three open-source LLMs confirm that the proposed adaptive psychological persuasion method effectively enables persuader LLMs to select optimal strategies, significantly enhancing their success rates while maintaining general capabilities. Our code is available at https://github.com/KalinaEine/PsychologicalPersuasion.

[238] Label-semantics Aware Generative Approach for Domain-Agnostic Multilabel Classification

Subhendu Khatuya,Shashwat Naidu,Saptarshi Ghosh,Pawan Goyal,Niloy Ganguly

Main category: cs.CL

TL;DR: 提出了一种基于生成模型的多标签文本分类框架LAGAMC，通过生成标签描述并匹配预定义标签，结合双重目标损失函数，实现了高效且通用的分类性能。

Details

Motivation: 文本数据爆炸导致手动分类困难，需要一种高效且通用的自动分类方法。 Method: 利用预定义标签描述生成模型，结合双重目标损失函数（交叉熵损失和余弦相似度），通过微调的句子转换器匹配生成描述与预定义标签。 Result: 在所有评估数据集上达到新的最优性能，Micro-F1提升13.94%，Macro-F1提升24.85%。 Conclusion: LAGAMC模型在参数效率和通用性上表现优异，适用于实际应用。 Abstract: The explosion of textual data has made manual document classification increasingly challenging. To address this, we introduce a robust, efficient domain-agnostic generative model framework for multi-label text classification. Instead of treating labels as mere atomic symbols, our approach utilizes predefined label descriptions and is trained to generate these descriptions based on the input text. During inference, the generated descriptions are matched to the pre-defined labels using a finetuned sentence transformer. We integrate this with a dual-objective loss function, combining cross-entropy loss and cosine similarity of the generated sentences with the predefined target descriptions, ensuring both semantic alignment and accuracy. Our proposed model LAGAMC stands out for its parameter efficiency and versatility across diverse datasets, making it well-suited for practical applications. We demonstrate the effectiveness of our proposed model by achieving new state-of-the-art performances across all evaluated datasets, surpassing several strong baselines. We achieve improvements of 13.94% in Micro-F1 and 24.85% in Macro-F1 compared to the closest baseline across all datasets.

[239] Not quite Sherlock Holmes: Language model predictions do not reliably differentiate impossible from improbable events

James A. Michaelov,Reeka Estacio,Zhien Zhang,Benjamin K. Bergen

Main category: cs.CL

TL;DR: 语言模型在区分可能事件与不可能事件的能力上表现不稳定，某些情况下甚至不如随机猜测。

Details

Motivation: 探讨语言模型是否能可靠区分可能事件与不可能事件，并分析其表现不稳定的原因。 Method: 通过分离可能性、典型性和上下文相关性，测试多种语言模型（如Llama 3、Gemma 2、Mistral NeMo）的表现。 Result: 所有测试模型在某些条件下表现低于随机水平，倾向于为不可能句子分配更高概率。 Conclusion: 语言模型在此任务上的能力尚不稳健，需进一步改进。 Abstract: Can language models reliably predict that possible events are more likely than merely improbable ones? By teasing apart possibility, typicality, and contextual relatedness, we show that despite the results of previous work, language models' ability to do this is far from robust. In fact, under certain conditions, all models tested - including Llama 3, Gemma 2, and Mistral NeMo - perform at worse-than-chance level, assigning higher probabilities to impossible sentences such as 'the car was given a parking ticket by the brake' than to merely unlikely sentences such as 'the car was given a parking ticket by the explorer'.

[240] Advancing Question Generation with Joint Narrative and Difficulty Control

Bernardo Leite,Henrique Lopes Cardoso

Main category: cs.CL

TL;DR: 本文提出了一种联合控制叙事和难度的方法，用于生成阅读理解问题，填补了现有研究的空白。

Details

Motivation: 现有研究缺乏同时控制问题难度和叙事的方法，而这对教育用途的问题生成至关重要。 Method: 提出了一种联合控制叙事和难度的策略，用于生成阅读理解问题。 Result: 初步评估表明该方法可行，但并非在所有情况下都有效，并明确了其适用条件和权衡。 Conclusion: 该方法为教育用途的问题生成提供了新思路，但需进一步优化以提升效果。 Abstract: Question Generation (QG), the task of automatically generating questions from a source input, has seen significant progress in recent years. Difficulty-controllable QG (DCQG) enables control over the difficulty level of generated questions while considering the learner's ability. Additionally, narrative-controllable QG (NCQG) allows control over the narrative aspects embedded in the questions. However, research in QG lacks a focus on combining these two types of control, which is important for generating questions tailored to educational purposes. To address this gap, we propose a strategy for Joint Narrative and Difficulty Control, enabling simultaneous control over these two attributes in the generation of reading comprehension questions. Our evaluation provides preliminary evidence that this approach is feasible, though it is not effective across all instances. Our findings highlight the conditions under which the strategy performs well and discuss the trade-offs associated with its application.

[241] BTPD: A Multilingual Hand-curated Dataset of Bengali Transnational Political Discourse Across Online Communities

Dipto Das,Syed Ishtiaque Ahmed,Shion Guha

Main category: cs.CL

TL;DR: 论文提出了一个多语言的孟加拉跨国政治话语数据集（BTPD），填补了资源不足语言的研究空白，并描述了其收集方法和内容概述。

Details

Motivation: 研究在线政治话语对分析公众意见和意识形态极化至关重要，但资源不足的语言（如孟加拉语）相关研究受限。 Method: 通过社区知情的关键词检索手工收集数据集，涵盖三个在线平台的不同社区结构和互动动态。 Result: 提供了BTPD数据集，包括其多语言内容和主题概述。 Conclusion: 该数据集为研究孟加拉跨国政治话语提供了宝贵资源，填补了现有研究的不足。 Abstract: Understanding political discourse in online spaces is crucial for analyzing public opinion and ideological polarization. While social computing and computational linguistics have explored such discussions in English, such research efforts are significantly limited in major yet under-resourced languages like Bengali due to the unavailability of datasets. In this paper, we present a multilingual dataset of Bengali transnational political discourse (BTPD) collected from three online platforms, each representing distinct community structures and interaction dynamics. Besides describing how we hand-curated the dataset through community-informed keyword-based retrieval, this paper also provides a general overview of its topics and multilingual content.

[242] How do datasets, developers, and models affect biases in a low-resourced language?

Dipto Das,Shion Guha,Bryan Semaan

Main category: cs.CL

TL;DR: 论文研究了孟加拉语情感分析模型中基于性别、宗教和国籍的身份偏见，发现尽管语义内容和结构相似，模型仍存在偏见，并探讨了预训练模型与数据集结合的不一致性和不确定性。

Details

Motivation: 研究动机是解决低资源语言（如孟加拉语）中身份偏见的不足，测试多语言或特定语言模型和数据集在缓解偏见方面的有效性。 Method: 方法包括对基于mBERT和BanglaBERT的情感分析模型进行算法审计，使用Google Dataset Search中所有孟加拉语情感分析数据集进行微调。 Result: 结果显示，尽管语义内容和结构相似，模型在不同身份类别中仍表现出偏见，并揭示了预训练模型与数据集结合时的不一致性和不确定性。 Conclusion: 结论强调了身份偏见在低资源语言中的普遍性，并呼吁关注认知不公、AI对齐和算法审计中的方法论决策。 Abstract: Sociotechnical systems, such as language technologies, frequently exhibit identity-based biases. These biases exacerbate the experiences of historically marginalized communities and remain understudied in low-resource contexts. While models and datasets specific to a language or with multilingual support are commonly recommended to address these biases, this paper empirically tests the effectiveness of such approaches in the context of gender, religion, and nationality-based identities in Bengali, a widely spoken but low-resourced language. We conducted an algorithmic audit of sentiment analysis models built on mBERT and BanglaBERT, which were fine-tuned using all Bengali sentiment analysis (BSA) datasets from Google Dataset Search. Our analyses showed that BSA models exhibit biases across different identity categories despite having similar semantic content and structure. We also examined the inconsistencies and uncertainties arising from combining pre-trained models and datasets created by individuals from diverse demographic backgrounds. We connected these findings to the broader discussions on epistemic injustice, AI alignment, and methodological decisions in algorithmic audits.

[243] Beyond Classification: Towards Speech Emotion Reasoning with Multitask AudioLLMs

Wenyu Zhang,Yingxu He,Geyu Lin,Zhuohan Liu,Shuo Sun,Bin Wang,Xunlong Zou,Jeremy H. M. Wong,Qiongqiong Wang,Hardik B. Sailor,Nancy F. Chen,Ai Ti Aw

Main category: cs.CL

TL;DR: 论文提出了一种通过情感推理增强AudioLLMs的方法，结合多任务框架提升情感预测准确性和解释的连贯性。

Details

Motivation: 现有AudioLLMs在情感理解上表现有限，通常仅作为分类问题处理，缺乏对预测背后逻辑的解释。 Method: 提出统一框架，包括推理增强数据监督、双编码器架构和任务交替训练，以支持情感推理。 Result: 在IEMOCAP和MELD数据集上，方法提高了情感预测准确性，并增强了生成响应的连贯性和证据基础。 Conclusion: 通过情感推理和多任务框架，AudioLLMs在情感理解任务中表现更优，同时提供更合理的解释。 Abstract: Audio Large Language Models (AudioLLMs) have achieved strong results in semantic tasks like speech recognition and translation, but remain limited in modeling paralinguistic cues such as emotion. Existing approaches often treat emotion understanding as a classification problem, offering little insight into the underlying rationale behind predictions. In this work, we explore emotion reasoning, a strategy that leverages the generative capabilities of AudioLLMs to enhance emotion recognition by producing semantically aligned, evidence-grounded explanations. To support this in multitask AudioLLMs, we introduce a unified framework combining reasoning-augmented data supervision, dual-encoder architecture, and task-alternating training. This approach enables AudioLLMs to effectively learn different tasks while incorporating emotional reasoning. Experiments on IEMOCAP and MELD show that our approach not only improves emotion prediction accuracy but also enhances the coherence and evidential grounding of the generated responses.

[244] Can LLMs Generate Reliable Test Case Generators? A Study on Competition-Level Programming Problems

Yuhan Cao,Zian Chen,Kun Quan,Ziliang Zhang,Yu Wang,Xiaoning Dong,Yeqi Feng,Guanzhong He,Jingcheng Huang,Jianhao Li,Yixuan Tan,Jiafu Tang,Yilin Tang,Junlei Wu,Qianyu Xiao,Can Zheng,Shouchen Zhou,Yuxiang Zhu,Yiming Huang,Tian Xie,Tianxing He

Main category: cs.CL

TL;DR: 研究探讨了大型语言模型（LLMs）在生成测试用例以检查或调试代码方面的能力，提出了TCGBench基准，并发现LLMs在生成针对性测试用例方面表现不佳，但通过高质量数据集可以提升性能。

Details

Motivation: 探索LLMs在代码检查和调试中的潜力，特别是在竞争级编程（CP）中生成测试用例的能力。 Method: 提出TCGBench基准，包含两项任务：生成有效的测试用例生成器及针对性测试用例生成器，并通过实验评估LLMs的表现。 Result: LLMs能生成有效测试用例生成器，但在针对性测试用例生成上表现不佳，高级推理模型也远不及人类水平。使用高质量数据集可提升性能。 Conclusion: LLMs在生成针对性测试用例方面仍有局限，但通过数据驱动方法（如提示和微调）可以显著改进。 Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities in code generation, capable of tackling complex tasks during inference. However, the extent to which LLMs can be utilized for code checking or debugging through test case generation remains largely unexplored. We investigate this problem from the perspective of competition-level programming (CP) programs and propose TCGBench, a Benchmark for (LLM generation of) Test Case Generators. This benchmark comprises two tasks, aimed at studying the capabilities of LLMs in (1) generating valid test case generators for a given CP problem, and further (2) generating targeted test case generators that expose bugs in human-written code. Experimental results indicate that while state-of-the-art LLMs can generate valid test case generators in most cases, most LLMs struggle to generate targeted test cases that reveal flaws in human code effectively. Especially, even advanced reasoning models (e.g., o3-mini) fall significantly short of human performance in the task of generating targeted generators. Furthermore, we construct a high-quality, manually curated dataset of instructions for generating targeted generators. Analysis demonstrates that the performance of LLMs can be enhanced with the aid of this dataset, by both prompting and fine-tuning.

Arkadiusz Modzelewski,Witold Sosnowski,Tiziano Labruna,Adam Wierzbicki,Giovanni Da San Martino

Main category: cs.CL

TL;DR: 论文提出了一种基于说服知识的零样本虚假信息检测方法PCoT，并在多个数据集上验证其性能优于现有方法15%。

Details

Motivation: 心理学研究表明，了解说服谬误有助于识别虚假信息，因此作者尝试将说服知识融入大语言模型以提升检测效果。 Method: 提出Persuasion-Augmented Chain of Thought (PCoT)，利用说服知识增强零样本分类能力，并在新闻和社交媒体数据上测试。 Result: PCoT在五个大语言模型和五个数据集上平均性能优于竞争方法15%，并发布了两个新数据集EUDisinfo和MultiDis。 Conclusion: 说服知识能显著提升零样本虚假信息检测效果，PCoT方法具有实际应用价值。 Abstract: Disinformation detection is a key aspect of media literacy. Psychological studies have shown that knowledge of persuasive fallacies helps individuals detect disinformation. Inspired by these findings, we experimented with large language models (LLMs) to test whether infusing persuasion knowledge enhances disinformation detection. As a result, we introduce the Persuasion-Augmented Chain of Thought (PCoT), a novel approach that leverages persuasion to improve disinformation detection in zero-shot classification. We extensively evaluate PCoT on online news and social media posts. Moreover, we publish two novel, up-to-date disinformation datasets: EUDisinfo and MultiDis. These datasets enable the evaluation of PCoT on content entirely unseen by the LLMs used in our experiments, as the content was published after the models' knowledge cutoffs. We show that, on average, PCoT outperforms competitive methods by 15% across five LLMs and five datasets. These findings highlight the value of persuasion in strengthening zero-shot disinformation detection.

[246] Adapt Once, Thrive with Updates: Transferable Parameter-Efficient Fine-Tuning on Evolving Base Models

Naibin Gu,Peng Fu,Xiyu Liu,Ke Ma,Zheng Lin,Weiping Wang

Main category: cs.CL

TL;DR: Trans-PEFT是一种新方法，通过专注于任务特定模式并减少对基础模型中某些知识的依赖，解决了PEFT模块在基础模型更新后性能下降的问题。

Details

Motivation: 基础模型更新后，PEFT模块性能显著下降，重新调谐这些模块计算成本高昂。 Method: 分析基础模型更新变化，发现持续训练主要影响FFN中的任务特定知识，而对注意力机制中的任务特定模式影响较小。基于此，提出Trans-PEFT方法。 Result: 在7个基础模型和12个数据集上的实验表明，Trans-PEFT模块无需重新调谐即可在更新后的基础模型上保持性能。 Conclusion: Trans-PEFT显著降低了实际应用中的维护开销。 Abstract: Parameter-efficient fine-tuning (PEFT) has become a common method for fine-tuning large language models, where a base model can serve multiple users through PEFT module switching. To enhance user experience, base models require periodic updates. However, once updated, PEFT modules fine-tuned on previous versions often suffer substantial performance degradation on newer versions. Re-tuning these numerous modules to restore performance would incur significant computational costs. Through a comprehensive analysis of the changes that occur during base model updates, we uncover an interesting phenomenon: continual training primarily affects task-specific knowledge stored in Feed-Forward Networks (FFN), while having less impact on the task-specific pattern in the Attention mechanism. Based on these findings, we introduce Trans-PEFT, a novel approach that enhances the PEFT module by focusing on the task-specific pattern while reducing its dependence on certain knowledge in the base model. Further theoretical analysis supports our approach. Extensive experiments across 7 base models and 12 datasets demonstrate that Trans-PEFT trained modules can maintain performance on updated base models without re-tuning, significantly reducing maintenance overhead in real-world applications.

[247] Right Is Not Enough: The Pitfalls of Outcome Supervision in Training LLMs for Math Reasoning

Jiaxing Guo,Wenjie Yang,Shengzhong Zhang,Tongshan Xu,Lun Du,Da Zheng,Zengfeng Huang

Main category: cs.CL

TL;DR: 论文提出MathOlympiadEval数据集和ParaStepVerifier方法，揭示LLMs在数学问题中答案正确但推理过程错误的问题，并通过新方法显著提升检测准确性。

Details

Motivation: LLMs在数学问题中常通过不合理的推理过程得出正确答案（奖励黑客现象），现有方法难以可靠检测这些缺陷。 Method: 提出ParaStepVerifier方法，通过细粒度步骤验证数学解决方案，识别错误推理步骤。 Result: ParaStepVerifier显著提高了对复杂多步问题中错误解决方案的检测准确性。 Conclusion: 该方法为评估和训练具备真正数学推理能力的LLMs提供了更可靠的路径。 Abstract: Outcome-rewarded Large Language Models (LLMs) have demonstrated remarkable success in mathematical problem-solving. However, this success often masks a critical issue: models frequently achieve correct answers through fundamentally unsound reasoning processes, a phenomenon indicative of reward hacking. We introduce MathOlympiadEval, a new dataset with fine-grained annotations, which reveals a significant gap between LLMs' answer correctness and their low process correctness. Existing automated methods like LLM-as-a-judge struggle to reliably detect these reasoning flaws. To address this, we propose ParaStepVerifier, a novel methodology for meticulous, step-by-step verification of mathematical solutions. ParaStepVerifier identifies incorrect reasoning steps. Empirical results demonstrate that ParaStepVerifier substantially improves the accuracy of identifying flawed solutions compared to baselines, especially for complex, multi-step problems. This offers a more robust path towards evaluating and training LLMs with genuine mathematical reasoning.

[248] Mixture of Small and Large Models for Chinese Spelling Check

Ziheng Qiao,Houquan Zhou,Zhenghua Li

Main category: cs.CL

TL;DR: 本文提出了一种动态混合方法，结合小模型和LLM的概率分布，提升中文拼写检查任务的性能，无需微调LLM，节省资源。

Details

Motivation: 现有LLM方法在中文拼写检查任务中表现不佳，而BERT类模型虽表现优秀但存在编辑模式过拟合问题，需结合两者优势。 Method: 提出动态混合方法，在beam search解码阶段结合小模型和LLM的概率分布，平衡精确性和流畅性。 Result: 实验表明该方法显著提升纠错能力，在多个数据集上达到最优效果。 Conclusion: 动态混合方法有效结合小模型和LLM的优势，提升性能且节省资源，具有广泛适应性。 Abstract: In the era of large language models (LLMs), the Chinese Spelling Check (CSC) task has seen various LLM methods developed, yet their performance remains unsatisfactory. In contrast, fine-tuned BERT-based models, relying on high-quality in-domain data, show excellent performance but suffer from edit pattern overfitting. This paper proposes a novel dynamic mixture approach that effectively combines the probability distributions of small models and LLMs during the beam search decoding phase, achieving a balanced enhancement of precise corrections from small models and the fluency of LLMs. This approach also eliminates the need for fine-tuning LLMs, saving significant time and resources, and facilitating domain adaptation. Comprehensive experiments demonstrate that our mixture approach significantly boosts error correction capabilities, achieving state-of-the-art results across multiple datasets. Our code is available at https://github.com/zhqiao-nlp/MSLLM.

[249] Automatic Speech Recognition of African American English: Lexical and Contextual Effects

Hamid Mojarad,Kevin Tang

Main category: cs.CL

TL;DR: 研究探讨了自动语音识别（ASR）模型在非洲裔美国英语（AAE）中的表现，发现辅音簇缩减（CCR）和ING缩减会增加误识别率，且无外部语言模型（LM）的端到端ASR系统更易受词汇邻域效应影响。

Details

Motivation: ASR模型在处理AAE的语音和语法特征时表现不佳，尤其是CCR和ING缩减现象，因此需要研究这些变量对ASR性能的影响。 Method: 使用CORAAL语料库，通过wav2vec 2.0（带和不带LM）进行转录，并利用Montreal Forced Aligner检测CCR和ING缩减。 Result: CCR和ING缩减对词错误率（WER）有显著但较小的影响；无LM的ASR系统更易受词汇邻域效应影响。 Conclusion: AAE的语音特征对ASR性能有影响，无LM的系统在词汇邻域效应上表现更敏感。 Abstract: Automatic Speech Recognition (ASR) models often struggle with the phonetic, phonological, and morphosyntactic features found in African American English (AAE). This study focuses on two key AAE variables: Consonant Cluster Reduction (CCR) and ING-reduction. It examines whether the presence of CCR and ING-reduction increases ASR misrecognition. Subsequently, it investigates whether end-to-end ASR systems without an external Language Model (LM) are more influenced by lexical neighborhood effect and less by contextual predictability compared to systems with an LM. The Corpus of Regional African American Language (CORAAL) was transcribed using wav2vec 2.0 with and without an LM. CCR and ING-reduction were detected using the Montreal Forced Aligner (MFA) with pronunciation expansion. The analysis reveals a small but significant effect of CCR and ING on Word Error Rate (WER) and indicates a stronger presence of lexical neighborhood effect in ASR systems without LMs.

[250] Hybrid Extractive Abstractive Summarization for Multilingual Sentiment Analysis

Mikhail Krasitskii,Grigori Sidorov,Olga Kolesnikova,Liliana Chanona Hernandez,Alexander Gelbukh

Main category: cs.CL

TL;DR: 提出了一种结合提取式和生成式摘要的多语言情感分析方法，显著提升了准确性和计算效率。

Details

Motivation: 解决单一方法在多语言情感分析中的局限性。 Method: 结合TF-IDF提取和XLM-R生成模块，采用动态阈值和文化适应技术。 Result: 在10种语言中表现优于基线，英语准确率0.90，低资源语言0.84，计算效率提升22%。 Conclusion: 方法适用于实时品牌监控和跨文化分析，未来将优化低资源语言支持。 Abstract: We propose a hybrid approach for multilingual sentiment analysis that combines extractive and abstractive summarization to address the limitations of standalone methods. The model integrates TF-IDF-based extraction with a fine-tuned XLM-R abstractive module, enhanced by dynamic thresholding and cultural adaptation. Experiments across 10 languages show significant improvements over baselines, achieving 0.90 accuracy for English and 0.84 for low-resource languages. The approach also demonstrates 22% greater computational efficiency than traditional methods. Practical applications include real-time brand monitoring and cross-cultural discourse analysis. Future work will focus on optimization for low-resource languages via 8-bit quantization.

[251] DiscoSum: Discourse-aware News Summarization

Alexander Spangher,Tenghao Huang,Jialiang Gu,Jiatong Shi,Muhao Chen

Main category: cs.CL

TL;DR: 论文提出了一种结合新闻语篇结构的文本摘要方法，通过DiscoSum算法实现结构感知的摘要生成，并在多平台数据集上验证了其有效性。

Details

Motivation: 现有的大型语言模型在文本摘要中难以保持长期语篇结构，尤其是新闻文章的组织结构对读者参与度有重要影响。 Method: 提出了一种新闻语篇模式，并开发了DiscoSum算法，利用束搜索技术实现结构感知的摘要生成。 Result: 通过人工和自动评估验证了该方法在保持叙事忠实度和满足结构需求方面的有效性。 Conclusion: 该方法成功地将语篇结构整合到摘要过程中，为新闻摘要提供了新的解决方案。 Abstract: Recent advances in text summarization have predominantly leveraged large language models to generate concise summaries. However, language models often do not maintain long-term discourse structure, especially in news articles, where organizational flow significantly influences reader engagement. We introduce a novel approach to integrating discourse structure into summarization processes, focusing specifically on news articles across various media. We present a novel summarization dataset where news articles are summarized multiple times in different ways across different social media platforms (e.g. LinkedIn, Facebook, etc.). We develop a novel news discourse schema to describe summarization structures and a novel algorithm, DiscoSum, which employs beam search technique for structure-aware summarization, enabling the transformation of news stories to meet different stylistic and structural demands. Both human and automatic evaluation results demonstrate the efficacy of our approach in maintaining narrative fidelity and meeting structural requirements.

[252] What Makes a Good Natural Language Prompt?

Do Xuan Long,Duy Dinh,Ngoc-Hai Nguyen,Kenji Kawaguchi,Nancy F. Chen,Shafiq Joty,Min-Yen Kan

Main category: cs.CL

TL;DR: 本文通过分析150多篇关于提示的论文和博客，提出了一个基于属性和人类中心的提示质量评估框架，并揭示了现有研究的不足。研究发现单属性增强对推理任务影响最大，且基于属性增强提示的指令调优能提升模型推理能力。

Details

Motivation: 随着大语言模型的发展，提示成为人机交互的关键，但目前缺乏对自然语言提示的量化共识。 Method: 通过元分析150多篇论文和博客，提出包含21个属性的六维框架，分析其对LLMs的影响，并探索多属性提示增强。 Result: 单属性增强对推理任务影响最大，基于属性增强提示的指令调优能提升模型推理能力。 Conclusion: 研究为提示评估和优化奠定了基础，为人机交互和提示研究开辟了新方向。 Abstract: As large language models (LLMs) have progressed towards more human-like and human--AI communications have become prevalent, prompting has emerged as a decisive component. However, there is limited conceptual consensus on what exactly quantifies natural language prompts. We attempt to address this question by conducting a meta-analysis surveying more than 150 prompting-related papers from leading NLP and AI conferences from 2022 to 2025 and blogs. We propose a property- and human-centric framework for evaluating prompt quality, encompassing 21 properties categorized into six dimensions. We then examine how existing studies assess their impact on LLMs, revealing their imbalanced support across models and tasks, and substantial research gaps. Further, we analyze correlations among properties in high-quality natural language prompts, deriving prompting recommendations. We then empirically explore multi-property prompt enhancements in reasoning tasks, observing that single-property enhancements often have the greatest impact. Finally, we discover that instruction-tuning on property-enhanced prompts can result in better reasoning models. Our findings establish a foundation for property-centric prompt evaluation and optimization, bridging the gaps between human--AI communication and opening new prompting research directions.

[253] BIS Reasoning 1.0: The First Large-Scale Japanese Benchmark for Belief-Inconsistent Syllogistic Reasoning

Ha-Thanh Nguyen,Chaoran Liu,Hirokazu Kiyomaru,Koichi Takeda,Yusuke Miyao,Maki Matsuda,Yusuke Oda,Pontus Stenetorp,Qianying Liu,Su Myat Noe,Hideyuki Tachibana,Kouta Nakayama,Sadao Kurohashi

Main category: cs.CL

TL;DR: BIS Reasoning 1.0是首个针对日语三段论推理的大规模数据集，旨在评估大型语言模型（LLMs）在信念不一致推理中的表现。

Details

Motivation: 现有数据集（如NeuBAROCO和JFLD）主要关注通用或信念一致的推理，而BIS Reasoning 1.0通过引入逻辑有效但信念不一致的三段论，揭示LLMs在人类对齐语料训练中的推理偏差。 Method: 通过设计逻辑有效但信念冲突的三段论问题，对包括GPT、Claude和领先日语LLMs在内的模型进行基准测试。 Result: GPT-4o表现最佳，准确率为79.54%，但当前LLMs在处理逻辑有效但信念冲突的输入时存在显著弱点。 Conclusion: 这些发现对在法律、医疗和科学文献等高风险领域部署LLMs具有重要意义，需确保逻辑真理优先于直觉信念以保证完整性和安全性。 Abstract: We present BIS Reasoning 1.0, the first large-scale Japanese dataset of syllogistic reasoning problems explicitly designed to evaluate belief-inconsistent reasoning in large language models (LLMs). Unlike prior datasets such as NeuBAROCO and JFLD, which focus on general or belief-aligned reasoning, BIS Reasoning 1.0 introduces logically valid yet belief-inconsistent syllogisms to uncover reasoning biases in LLMs trained on human-aligned corpora. We benchmark state-of-the-art models - including GPT models, Claude models, and leading Japanese LLMs - revealing significant variance in performance, with GPT-4o achieving 79.54% accuracy. Our analysis identifies critical weaknesses in current LLMs when handling logically valid but belief-conflicting inputs. These findings have important implications for deploying LLMs in high-stakes domains such as law, healthcare, and scientific literature, where truth must override intuitive belief to ensure integrity and safety.

[254] Learning to Clarify by Reinforcement Learning Through Reward-Weighted Fine-Tuning

Subhojyoti Mukherjee,Viet Dac Lai,Raghavendra Addanki,Ryan Rossi,Seunghyun Yoon,Trung Bui,Anup Rao,Jayakumar Subramanian,Branislav Kveton

Main category: cs.CL

TL;DR: 该论文提出了一种通过强化学习（RL）训练问答（QA）代理学习提出澄清问题的方法，并设计了离线RL目标，优化了奖励加权监督微调（SFT）。

Details

Motivation: 现有基于监督微调和直接偏好优化的方法存在额外超参数且未直接优化奖励，因此需要一种更高效的方法。 Method: 通过模拟包含澄清问题的对话，使用强化学习训练QA代理，并提出离线RL目标（奖励加权SFT）。 Result: 实验表明，该方法在优化奖励和语言质量上优于现有方法。 Conclusion: 离线RL目标是一种高效且直接优化奖励的方法，优于现有技术。 Abstract: Question answering (QA) agents automatically answer questions posed in natural language. In this work, we learn to ask clarifying questions in QA agents. The key idea in our method is to simulate conversations that contain clarifying questions and learn from them using reinforcement learning (RL). To make RL practical, we propose and analyze offline RL objectives that can be viewed as reward-weighted supervised fine-tuning (SFT) and easily optimized in large language models. Our work stands in a stark contrast to recently proposed methods, based on SFT and direct preference optimization, which have additional hyper-parameters and do not directly optimize rewards. We compare to these methods empirically and report gains in both optimized rewards and language quality.

[255] A dependently-typed calculus of event telicity and culminativity

Pavel Kovalev,Carlo Angiuli

Main category: cs.CL

TL;DR: 提出一个依赖类型的跨语言框架，用于分析事件的完成性和终止性，并通过英语句子示例展示其应用。

Details

Motivation: 研究事件完成性和终止性的跨语言分析，填补现有理论在依赖类型和跨语言建模方面的空白。 Method: 框架分为名词域和动词域两部分：名词域建模名词短语的有界性及其与子类型、限定数量和形容词修饰的关系；动词域定义依赖事件演算，将完成性事件建模为受事有界的事件，终止性事件为完成性事件中达到内在终点的事件，并考虑副词修饰。 Result: 框架基于Martin-Löf依赖类型理论的扩展，规则和示例已在Agda证明助手中形式化。 Conclusion: 该框架为事件完成性和终止性的跨语言分析提供了新的理论工具，具有形式化和可扩展性。 Abstract: We present a dependently-typed cross-linguistic framework for analyzing the telicity and culminativity of events, accompanied by examples of using our framework to model English sentences. Our framework consists of two parts. In the nominal domain, we model the boundedness of noun phrases and its relationship to subtyping, delimited quantities, and adjectival modification. In the verbal domain we define a dependent event calculus, modeling telic events as those whose undergoer is bounded, culminating events as telic events that achieve their inherent endpoint, and consider adverbial modification. In both domains we pay particular attention to associated entailments. Our framework is defined as an extension of intensional Martin-L\"of dependent type theory, and the rules and examples in this paper have been formalized in the Agda proof assistant.

[256] Break-The-Chain: Reasoning Failures in LLMs via Adversarial Prompting in Code Generation

Jaechul Roh,Varun Gandhi,Shivani Anilkumar,Arin Garg

Main category: cs.CL

TL;DR: 论文研究了大型语言模型（LLMs）在复杂推理任务中的鲁棒性，通过引入语义忠实但对抗性结构的提示扰动，发现模型性能对提示的表面动态敏感，既有性能下降也有提升。

Details

Motivation: 探讨LLMs是否真正具备推理能力，还是仅依赖浅层统计模式。 Method: 引入一系列语义忠实但对抗性结构的提示扰动，评估700个LeetCode风格问题的代码生成。 Result: 某些扰动导致性能显著下降（最高-42.1%），而其他扰动则意外提升性能（最高35.3%）。 Conclusion: 当前推理系统存在脆弱性和不可预测性，需更原则性的方法来提升推理对齐和提示鲁棒性。 Abstract: Large Language Models (LLMs) have achieved remarkable success in tasks requiring complex reasoning, such as code generation, mathematical problem solving, and algorithmic synthesis -- especially when aided by reasoning tokens and Chain-of-Thought prompting. Yet, a core question remains: do these models truly reason, or do they merely exploit shallow statistical patterns? In this paper, we systematically investigate the robustness of reasoning LLMs by introducing a suite of semantically faithful yet adversarially structured prompt perturbations. Our evaluation -- spanning 700 perturbed code generations derived from LeetCode-style problems -- applies transformations such as storytelling reframing, irrelevant constraint injection, example reordering, and numeric perturbation. We observe that while certain modifications severely degrade performance (with accuracy drops up to -42.1%), others surprisingly improve model accuracy by up to 35.3%, suggesting sensitivity not only to semantics but also to surface-level prompt dynamics. These findings expose the fragility and unpredictability of current reasoning systems, underscoring the need for more principles approaches to reasoning alignments and prompting robustness. We release our perturbation datasets and evaluation framework to promote further research in trustworthy and resilient LLM reasoning.

[257] Atomic Reasoning for Scientific Table Claim Verification

Yuji Zhang,Qingyun Wang,Cheng Qian,Jiateng Liu,Chenkai Sun,Denghui Zhang,Tarek Abdelzaher,Chengxiang Zhai,Preslav Nakov,Heng Ji

Main category: cs.CL

TL;DR: 论文提出了一种基于认知负荷理论的模块化推理方法，通过原子技能链提升科学表格声明的验证准确性，优于现有大型语言模型。

Details

Motivation: 科学文本的复杂性和高信息密度易导致非专家误解，现有模型在细粒度推理上表现不足。 Method: 引入原子技能链模式，动态组合模块化推理组件，减少认知负荷。 Result: 仅用350个微调样本，模型表现优于GPT-4o的思维链方法，达到最优效果。 Conclusion: 模块化推理方法显著提升了科学声明验证的准确性和泛化能力。 Abstract: Scientific texts often convey authority due to their technical language and complex data. However, this complexity can sometimes lead to the spread of misinformation. Non-experts are particularly susceptible to misleading claims based on scientific tables due to their high information density and perceived credibility. Existing table claim verification models, including state-of-the-art large language models (LLMs), often struggle with precise fine-grained reasoning, resulting in errors and a lack of precision in verifying scientific claims. Inspired by Cognitive Load Theory, we propose that enhancing a model's ability to interpret table-based claims involves reducing cognitive load by developing modular, reusable reasoning components (i.e., atomic skills). We introduce a skill-chaining schema that dynamically composes these skills to facilitate more accurate and generalizable reasoning with a reduced cognitive load. To evaluate this, we create SciAtomicBench, a cross-domain benchmark with fine-grained reasoning annotations. With only 350 fine-tuning examples, our model trained by atomic reasoning outperforms GPT-4o's chain-of-thought method, achieving state-of-the-art results with far less training data.

[258] Chain of Methodologies: Scaling Test Time Computation without Training

Cong Liu,Jie Wu,Weigang Wu,Xu Chen,Liang Lin,Wei-Shi Zheng

Main category: cs.CL

TL;DR: 论文提出Chain of Methodologies (CoM)框架，通过整合人类方法论增强LLMs的复杂推理能力，无需微调即可实现系统性思考。

Details

Motivation: LLMs在复杂推理任务中表现不佳，主要因其训练数据缺乏深度洞察。 Method: CoM框架利用用户定义的方法论激活LLMs的系统性推理能力。 Result: 实验表明CoM优于基线方法，展示了无训练提示方法的潜力。 Conclusion: CoM为复杂推理任务提供了高效解决方案，并缩小了与人类推理水平的差距。 Abstract: Large Language Models (LLMs) often struggle with complex reasoning tasks due to insufficient in-depth insights in their training data, which are typically absent in publicly available documents. This paper introduces the Chain of Methodologies (CoM), an innovative and intuitive prompting framework that enhances structured thinking by integrating human methodological insights, enabling LLMs to tackle complex tasks with extended reasoning. CoM leverages the metacognitive abilities of advanced LLMs, activating systematic reasoning throught user-defined methodologies without explicit fine-tuning. Experiments show that CoM surpasses competitive baselines, demonstrating the potential of training-free prompting methods as robust solutions for complex reasoning tasks and bridging the gap toward human-level reasoning through human-like methodological insights.

[259] Cultural Bias Matters: A Cross-Cultural Benchmark Dataset and Sentiment-Enriched Model for Understanding Multimodal Metaphors

Senqi Yang,Dongyu Zhang,Jing Ren,Ziqi Xu,Xiuzhen Zhang,Yiliao Song,Hongfei Lin,Feng Xia

Main category: cs.CL

TL;DR: 论文提出了MultiMM数据集和SEMD模型，用于研究跨文化多模态隐喻，旨在解决NLP中的文化偏见问题。

Details

Motivation: 现有隐喻处理研究多基于英语数据，存在西方文化偏见，影响模型性能评估和NLP发展。 Method: 构建MultiMM数据集（8,461对中英文广告文本-图像）并开发SEMD模型，结合情感嵌入提升跨文化隐喻理解。 Result: 实验证明SEMD在隐喻检测和情感分析任务中有效。 Conclusion: 该研究提高了对NLP文化偏见的认识，推动了更公平、包容的语言模型发展。 Abstract: Metaphors are pervasive in communication, making them crucial for natural language processing (NLP). Previous research on automatic metaphor processing predominantly relies on training data consisting of English samples, which often reflect Western European or North American biases. This cultural skew can lead to an overestimation of model performance and contributions to NLP progress. However, the impact of cultural bias on metaphor processing, particularly in multimodal contexts, remains largely unexplored. To address this gap, we introduce MultiMM, a Multicultural Multimodal Metaphor dataset designed for cross-cultural studies of metaphor in Chinese and English. MultiMM consists of 8,461 text-image advertisement pairs, each accompanied by fine-grained annotations, providing a deeper understanding of multimodal metaphors beyond a single cultural domain. Additionally, we propose Sentiment-Enriched Metaphor Detection (SEMD), a baseline model that integrates sentiment embeddings to enhance metaphor comprehension across cultural backgrounds. Experimental results validate the effectiveness of SEMD on metaphor detection and sentiment analysis tasks. We hope this work increases awareness of cultural bias in NLP research and contributes to the development of fairer and more inclusive language models. Our dataset and code are available at https://github.com/DUTIR-YSQ/MultiMM.

[260] What makes Reasoning Models Different? Follow the Reasoning Leader for Efficient Decoding

Ming Li,Zhengyuan Yang,Xiyao Wang,Dianqi Li,Kevin Lin,Tianyi Zhou,Lijuan Wang

Main category: cs.CL

TL;DR: 论文提出FoReaL-Decoding方法，通过快速-慢速协作解码优化大型推理模型的效率，减少计算成本同时保持性能。

Details

Motivation: 大型推理模型（LRMs）推理过程冗长且易陷入过度思考，导致效率低下。研究发现其与非推理模型在token级别存在两种关键现象：全局错位反弹和局部错位减弱。 Method: 提出FoReaL-Decoding方法，由主导模型生成句子开头部分，草稿模型完成剩余部分，并通过随机门平滑切换模型。 Result: 在四个数学推理基准测试中，FoReaL-Decoding减少30-50%计算量，缩短推理链40%，性能保留86-100%。 Conclusion: FoReaL-Decoding是一种简单、即插即用的方法，可在推理任务中实现成本与性能的平衡。 Abstract: Large reasoning models (LRMs) achieve strong reasoning performance by emitting long chains of thought. Yet, these verbose traces slow down inference and often drift into unnecessary detail, known as the overthinking phenomenon. To better understand LRMs' behavior, we systematically analyze the token-level misalignment between reasoning and non-reasoning models. While it is expected that their primary difference lies in the stylistic "thinking cues", LRMs uniquely exhibit two pivotal, previously under-explored phenomena: a Global Misalignment Rebound, where their divergence from non-reasoning models persists or even grows as response length increases, and more critically, a Local Misalignment Diminish, where the misalignment concentrates at the "thinking cues" each sentence starts with but rapidly declines in the remaining of the sentence. Motivated by the Local Misalignment Diminish, we propose FoReaL-Decoding, a collaborative fast-slow thinking decoding method for cost-quality trade-off. In FoReaL-Decoding, a Leading model leads the first few tokens for each sentence, and then a weaker draft model completes the following tokens to the end of each sentence. FoReaL-Decoding adopts a stochastic gate to smoothly interpolate between the small and the large model. On four popular math-reasoning benchmarks (AIME24, GPQA-Diamond, MATH500, AMC23), FoReaL-Decoding reduces theoretical FLOPs by 30 to 50% and trims CoT length by up to 40%, while preserving 86 to 100% of model performance. These results establish FoReaL-Decoding as a simple, plug-and-play route to controllable cost-quality trade-offs in reasoning-centric tasks.

[261] Adversarial Paraphrasing: A Universal Attack for Humanizing AI-Generated Text

Yize Cheng,Vinu Sankar Sadasivan,Mehrdad Saberi,Shoumik Saha,Soheil Feizi

Main category: cs.CL

TL;DR: 本文提出了一种名为Adversarial Paraphrasing的训练免费攻击框架，通过利用现成的指令跟随LLM，在AI文本检测器的指导下改写AI生成的内容，以更有效地绕过检测。实验表明，该方法在多种检测系统中均表现出高效性和可迁移性。

Details

Motivation: 随着大型语言模型（LLMs）能力的提升，其被滥用于AI生成抄袭和社会工程的风险增加。尽管已有多种AI生成文本检测器，但许多仍容易被简单的改写技术规避。本文旨在提出一种更有效的攻击方法，以揭示现有检测器的脆弱性。 Method: 提出Adversarial Paraphrasing框架，利用现成的指令跟随LLM在AI文本检测器的指导下改写AI生成内容，生成对抗性样本以绕过检测。 Result: 实验显示，该方法在多种检测系统中显著降低检测率（如RADAR和Fast-DetectGPT的T@1%F分别减少64.49%和98.96%），平均降低87.88%。文本质量仅轻微下降。 Conclusion: Adversarial Paraphrasing的成功凸显了现有检测策略在面对日益复杂的规避技术时的脆弱性，呼吁开发更鲁棒和弹性的检测方法。 Abstract: The increasing capabilities of Large Language Models (LLMs) have raised concerns about their misuse in AI-generated plagiarism and social engineering. While various AI-generated text detectors have been proposed to mitigate these risks, many remain vulnerable to simple evasion techniques such as paraphrasing. However, recent detectors have shown greater robustness against such basic attacks. In this work, we introduce Adversarial Paraphrasing, a training-free attack framework that universally humanizes any AI-generated text to evade detection more effectively. Our approach leverages an off-the-shelf instruction-following LLM to paraphrase AI-generated content under the guidance of an AI text detector, producing adversarial examples that are specifically optimized to bypass detection. Extensive experiments show that our attack is both broadly effective and highly transferable across several detection systems. For instance, compared to simple paraphrasing attack--which, ironically, increases the true positive at 1% false positive (T@1%F) by 8.57% on RADAR and 15.03% on Fast-DetectGPT--adversarial paraphrasing, guided by OpenAI-RoBERTa-Large, reduces T@1%F by 64.49% on RADAR and a striking 98.96% on Fast-DetectGPT. Across a diverse set of detectors--including neural network-based, watermark-based, and zero-shot approaches--our attack achieves an average T@1%F reduction of 87.88% under the guidance of OpenAI-RoBERTa-Large. We also analyze the tradeoff between text quality and attack success to find that our method can significantly reduce detection rates, with mostly a slight degradation in text quality. Our adversarial setup highlights the need for more robust and resilient detection strategies in the light of increasingly sophisticated evasion techniques.

[262] A Culturally-diverse Multilingual Multimodal Video Benchmark & Model

Bhuiyan Sanjid Shafique,Ashmal Vayani,Muhammad Maaz,Hanoona Abdul Rasheed,Dinura Dissanayake,Mohammed Irfan Kurpath,Yahya Hmaiti,Go Inoue,Jean Lahoud,Md. Safirur Rashid,Shadid Intisar Quasem,Maheen Fatima,Franco Vidal,Mykola Maslych,Ketan Pravin More,Sanoojan Baliah,Hasindri Watawana,Yuhao Li,Fabian Farestam,Leon Schaller,Roman Tymtsiv,Simon Weber,Hisham Cholakkal,Ivan Laptev,Shin'ichi Satoh,Michael Felsberg,Mubarak Shah,Salman Khan,Fahad Shahbaz Khan

Main category: cs.CL

TL;DR: 该论文提出了一种多语言视频LMM基准ViMUL-Bench，用于评估14种语言的视频理解能力，并开发了一个简单的多语言视频LMM模型ViMUL，以促进文化和语言包容性研究。

Details

Motivation: 现有的大型多模态模型（LMMs）主要集中在英语，缺乏对多语言和文化多样性的视频理解研究，因此需要开发更具包容性的视频LMMs。 Method: 通过构建ViMUL-Bench基准（包含8k手动验证样本）和1.2百万机器翻译的多语言视频训练集，开发了ViMUL模型，以平衡高低资源语言的视频理解能力。 Result: ViMUL模型在高资源和低资源语言之间取得了更好的平衡，ViMUL-Bench为未来研究提供了多语言视频LMM的评估标准。 Conclusion: ViMUL-Bench和ViMUL模型为开发文化和语言包容性的多语言视频LMM提供了重要工具，相关数据和模型将公开以促进研究。 Abstract: Large multimodal models (LMMs) have recently gained attention due to their effectiveness to understand and generate descriptions of visual content. Most existing LMMs are in English language. While few recent works explore multilingual image LMMs, to the best of our knowledge, moving beyond the English language for cultural and linguistic inclusivity is yet to be investigated in the context of video LMMs. In pursuit of more inclusive video LMMs, we introduce a multilingual Video LMM benchmark, named ViMUL-Bench, to evaluate Video LMMs across 14 languages, including both low- and high-resource languages: English, Chinese, Spanish, French, German, Hindi, Arabic, Russian, Bengali, Urdu, Sinhala, Tamil, Swedish, and Japanese. Our ViMUL-Bench is designed to rigorously test video LMMs across 15 categories including eight culturally diverse categories, ranging from lifestyles and festivals to foods and rituals and from local landmarks to prominent cultural personalities. ViMUL-Bench comprises both open-ended (short and long-form) and multiple-choice questions spanning various video durations (short, medium, and long) with 8k samples that are manually verified by native language speakers. In addition, we also introduce a machine translated multilingual video training set comprising 1.2 million samples and develop a simple multilingual video LMM, named ViMUL, that is shown to provide a better tradeoff between high-and low-resource languages for video understanding. We hope our ViMUL-Bench and multilingual video LMM along with a large-scale multilingual video training set will help ease future research in developing cultural and linguistic inclusive multilingual video LMMs. Our proposed benchmark, video LMM and training data will be publicly released at https://mbzuai-oryx.github.io/ViMUL/.

[263] KG2QA: Knowledge Graph-enhanced Retrieval-Augmented Generation for Communication Standards Question Answering

Zhongze Luo,Weixuan Wan,Qizhi Zheng,Yanhong Bai,Jingyun Sun,Jian Wang,Dan Wang

Main category: cs.CL

TL;DR: 本文结合大语言模型微调与知识图谱构建，实现了通信标准领域的智能咨询与问答系统，显著提升了问答效果。

Details

Motivation: 传统咨询模型周期长且依赖专家经验，难以满足快速发展的技术需求。 Method: 采用LoRA微调Qwen2.5-7B-Instruct模型，并构建包含13,906实体和13,524关系的知识图谱，结合RAG框架进行图形检索。 Result: BLEU-4从18.8564提升至66.8993，ROUGE等指标显著提高，问答效果优于对比模型Llama-3-8B-Instruct。 Conclusion: 系统在交互体验和后端接入方面表现优异，具有较高的实际应用价值。 Abstract: There are many types of standards in the field of communication. The traditional consulting model has a long cycle and relies on the knowledge and experience of experts, making it difficult to meet the rapidly developing technological demands. This paper combines the fine-tuning of large language models with the construction of knowledge graphs to implement an intelligent consultation and question-answering system for communication standards. The experimental results show that after LoRA tuning on the constructed dataset of 6,587 questions and answers in the field of communication standards, Qwen2.5-7B-Instruct demonstrates outstanding professional capabilities in the field of communication standards on the test set. BLEU-4 rose from 18.8564 to 66.8993, and evaluation indicators such as ROUGE also increased significantly, outperforming the fine-tuning effect of the comparison model Llama-3-8B-Instruct. Based on the ontology framework containing 6 entity attributes and 10 relation attributes, a knowledge graph of the communication standard domain containing 13,906 entities and 13,524 relations was constructed, showing a relatively good query accuracy rate. The intelligent consultation and question-answering system enables the fine-tuned model on the server side to access the locally constructed knowledge graph and conduct graphical retrieval of key information first, which is conducive to improving the question-answering effect. The evaluation using DeepSeek as the Judge on the test set shows that our RAG framework enables the fine-tuned model to improve the scores at all five angles, with an average score increase of 2.26%. And combined with web services and API interfaces, it has achieved very good results in terms of interaction experience and back-end access, and has very good practical application value.

[264] Reasoning with RAGged events: RAG-Enhanced Event Knowledge Base Construction and reasoning with proof-assistants

Stergios Chatzikyriakidis

Main category: cs.CL

TL;DR: 论文提出了一种自动提取历史事件的方法，结合多种LLM和增强策略，并通过Coq验证其语义结构的有效性。

Details

Motivation: 手动构建历史事件的计算表示成本高，且现有RDF/OWL推理器无法支持更深层次的时空和语义分析。 Method: 使用GPT-4、Claude和Llama 3.2三种LLM，结合基础生成、知识图增强和RAG三种策略，自动提取历史事件。 Result: 增强策略在不同性能维度上表现各异：基础生成在覆盖范围上最优，RAG在精度上更优。模型架构影响增强效果。 Conclusion: 通过Coq验证，RAG提取的事件类型具有领域特定的语义结构，支持高阶推理。 Abstract: Extracting structured computational representations of historical events from narrative text remains computationally expensive when constructed manually. While RDF/OWL reasoners enable graph-based reasoning, they are limited to fragments of first-order logic, preventing deeper temporal and semantic analysis. This paper addresses both challenges by developing automatic historical event extraction models using multiple LLMs (GPT-4, Claude, Llama 3.2) with three enhancement strategies: pure base generation, knowledge graph enhancement, and Retrieval-Augmented Generation (RAG). We conducted comprehensive evaluations using historical texts from Thucydides. Our findings reveal that enhancement strategies optimize different performance dimensions rather than providing universal improvements. For coverage and historical breadth, base generation achieves optimal performance with Claude and GPT-4 extracting comprehensive events. However, for precision, RAG enhancement improves coordinate accuracy and metadata completeness. Model architecture fundamentally determines enhancement sensitivity: larger models demonstrate robust baseline performance with incremental RAG improvements, while Llama 3.2 shows extreme variance from competitive performance to complete failure. We then developed an automated translation pipeline converting extracted RDF representations into Coq proof assistant specifications, enabling higher-order reasoning beyond RDF capabilities including multi-step causal verification, temporal arithmetic with BC dates, and formal proofs about historical causation. The Coq formalization validates that RAG-discovered event types represent legitimate domain-specific semantic structures rather than ontological violations.

[265] Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning

LASA Team,Weiwen Xu,Hou Pong Chan,Long Li,Mahani Aljunied,Ruifeng Yuan,Jianyu Wang,Chenghao Xiao,Guizhen Chen,Chaoqun Liu,Zhaodonghui Li,Yu Sun,Junao Shen,Chaojun Wang,Jie Tan,Deli Zhao,Tingyang Xu,Hao Zhang,Yu Rong

Main category: cs.CL

TL;DR: 论文提出了一种名为Lingshu的医学专用多模态大语言模型，通过改进数据整理和训练策略，解决了现有医学MLLMs在知识覆盖、幻觉问题和推理能力上的局限性。

Details

Motivation: 现有医学MLLMs在通用领域表现优异，但在医学应用中因数据与任务差异而受限，需解决知识覆盖不足、幻觉问题及推理能力不足的挑战。 Method: 提出综合数据整理方法，整合医学影像、文本和通用数据，构建多模态数据集；开发Lingshu模型，采用多阶段训练和强化学习增强推理能力；并设计MedEvalKit评估框架。 Result: Lingshu在多项医学任务（如多模态QA、文本QA和报告生成）中表现优于现有开源多模态模型。 Conclusion: Lingshu通过数据优化和训练策略改进，显著提升了医学MLLMs的性能，为医学AI应用提供了新方向。 Abstract: Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities in understanding common visual elements, largely due to their large-scale datasets and advanced training strategies. However, their effectiveness in medical applications remains limited due to the inherent discrepancies between data and tasks in medical scenarios and those in the general domain. Concretely, existing medical MLLMs face the following critical limitations: (1) limited coverage of medical knowledge beyond imaging, (2) heightened susceptibility to hallucinations due to suboptimal data curation processes, (3) lack of reasoning capabilities tailored for complex medical scenarios. To address these challenges, we first propose a comprehensive data curation procedure that (1) efficiently acquires rich medical knowledge data not only from medical imaging but also from extensive medical texts and general-domain data; and (2) synthesizes accurate medical captions, visual question answering (VQA), and reasoning samples. As a result, we build a multimodal dataset enriched with extensive medical knowledge. Building on the curated data, we introduce our medical-specialized MLLM: Lingshu. Lingshu undergoes multi-stage training to embed medical expertise and enhance its task-solving capabilities progressively. Besides, we preliminarily explore the potential of applying reinforcement learning with verifiable rewards paradigm to enhance Lingshu's medical reasoning ability. Additionally, we develop MedEvalKit, a unified evaluation framework that consolidates leading multimodal and textual medical benchmarks for standardized, fair, and efficient model assessment. We evaluate the performance of Lingshu on three fundamental medical tasks, multimodal QA, text-based QA, and medical report generation. The results show that Lingshu consistently outperforms the existing open-source multimodal models on most tasks ...

[266] Com$^2$: A Causal-Guided Benchmark for Exploring Complex Commonsense Reasoning in Large Language Models

Kai Xiong,Xiao Ding,Yixin Cao,Yuxiong Yan,Li Du,Yufei Zhang,Jinglong Gao,Jiaqian Liu,Bing Qin,Ting Liu

Main category: cs.CL

TL;DR: 论文提出了一个名为Com$^2$的基准测试，用于评估大型语言模型（LLMs）在复杂常识推理中的表现，填补了现有研究的空白。

Details

Motivation: LLMs在简单常识推理中表现优异，但在复杂和隐性的常识推理（如事件长期影响）中表现不佳，而人类更关注后者。现有研究多集中于数学和代码等复杂任务，复杂常识推理因不确定性和缺乏结构而未被充分探索。 Method: 结合因果事件图作为结构化复杂常识，采用因果理论（如干预）修改因果图以生成符合人类关注的场景，并利用LLM通过慢思考合成示例。此外，还构建了更具挑战性的侦探故事子集。 Result: 实验表明，LLMs在推理深度和广度上存在困难，但后训练和慢思考可以缓解这一问题。 Conclusion: Com$^2$基准测试为复杂常识推理提供了新的研究方向，并展示了LLMs在此领域的局限性及改进潜力。 Abstract: Large language models (LLMs) have mastered abundant simple and explicit commonsense knowledge through pre-training, enabling them to achieve human-like performance in simple commonsense reasoning. Nevertheless, LLMs struggle to reason with complex and implicit commonsense knowledge that is derived from simple ones (such as understanding the long-term effects of certain events), an aspect humans tend to focus on more. Existing works focus on complex tasks like math and code, while complex commonsense reasoning remains underexplored due to its uncertainty and lack of structure. To fill this gap and align with real-world concerns, we propose a benchmark Com$^2$ focusing on complex commonsense reasoning. We first incorporate causal event graphs to serve as structured complex commonsense. Then we adopt causal theory~(e.g., intervention) to modify the causal event graphs and obtain different scenarios that meet human concerns. Finally, an LLM is employed to synthesize examples with slow thinking, which is guided by the logical relationships in the modified causal graphs. Furthermore, we use detective stories to construct a more challenging subset. Experiments show that LLMs struggle in reasoning depth and breadth, while post-training and slow thinking can alleviate this. The code and data are available at https://github.com/Waste-Wood/Com2.

[267] Representation Decomposition for Learning Similarity and Contrastness Across Modalities for Affective Computing

Yuanhe Tian,Pengsen Cheng,Guoqing Jin,Lei Zhang,Yan Song

Main category: cs.CL

TL;DR: 提出了一种基于LLM的多模态情感计算方法，通过分解共享和模态特定组件，显著提升了情感分析的性能。

Details

Motivation: 现有方法未能有效处理多模态数据中的复杂和冲突信息，限制了情感计算的准确性。 Method: 使用预训练多模态编码器对齐输入，分解共享和模态特定信息，并通过注意力机制整合为动态软提示输入LLM。 Result: 在三个代表性任务中表现优于基线模型和现有最优模型。 Conclusion: 该方法有效解决了多模态情感计算中的信息融合问题，具有广泛应用潜力。 Abstract: Multi-modal affective computing aims to automatically recognize and interpret human attitudes from diverse data sources such as images and text, thereby enhancing human-computer interaction and emotion understanding. Existing approaches typically rely on unimodal analysis or straightforward fusion of cross-modal information that fail to capture complex and conflicting evidence presented across different modalities. In this paper, we propose a novel LLM-based approach for affective computing that explicitly deconstructs visual and textual representations into shared (modality-invariant) and modality-specific components. Specifically, our approach firstly encodes and aligns input modalities using pre-trained multi-modal encoders, then employs a representation decomposition framework to separate common emotional content from unique cues, and finally integrates these decomposed signals via an attention mechanism to form a dynamic soft prompt for a multi-modal LLM. Extensive experiments on three representative tasks for affective computing, namely, multi-modal aspect-based sentiment analysis, multi-modal emotion analysis, and hateful meme detection, demonstrate the effectiveness of our approach, which consistently outperforms strong baselines and state-of-the-art models.

[268] How Far Are We from Optimal Reasoning Efficiency?

Jiaxuan Gao,Shu Yan,Qixin Tan,Lu Yang,Shusheng Xu,Wei Fu,Zhiyu Mei,Kaifeng Lyu,Yi Wu

Main category: cs.CL

TL;DR: 论文提出了一种衡量大型推理模型（LRMs）效率的新指标REG，并通过REO-RL算法显著提升了推理效率。

Details

Motivation: 现有大型推理模型在推理过程中产生冗余信息，导致高推理成本，且缺乏统一的效率评估标准。 Method: 引入推理效率前沿（efficiency frontiers）和REG指标，并提出REO-RL算法，通过强化学习优化推理效率。 Result: REO-RL在保持准确性的同时显著减少推理长度，REG降低50%以上，并在16K token预算下接近Qwen3-4B/8B的效率前沿。 Conclusion: 论文表明，通过REG和REO-RL可以有效提升LRMs的推理效率，但完全对齐效率前沿仍具挑战性。 Abstract: Large Reasoning Models (LRMs) demonstrate remarkable problem-solving capabilities through extended Chain-of-Thought (CoT) reasoning but often produce excessively verbose and redundant reasoning traces. This inefficiency incurs high inference costs and limits practical deployment. While existing fine-tuning methods aim to improve reasoning efficiency, assessing their efficiency gains remains challenging due to inconsistent evaluations. In this work, we introduce the reasoning efficiency frontiers, empirical upper bounds derived from fine-tuning base LRMs across diverse approaches and training configurations. Based on these frontiers, we propose the Reasoning Efficiency Gap (REG), a unified metric quantifying deviations of any fine-tuned LRMs from these frontiers. Systematic evaluation on challenging mathematical benchmarks reveals significant gaps in current methods: they either sacrifice accuracy for short length or still remain inefficient under tight token budgets. To reduce the efficiency gap, we propose REO-RL, a class of Reinforcement Learning algorithms that minimizes REG by targeting a sparse set of token budgets. Leveraging numerical integration over strategically selected budgets, REO-RL approximates the full efficiency objective with low error using a small set of token budgets. Through systematic benchmarking, we demonstrate that our efficiency metric, REG, effectively captures the accuracy-length trade-off, with low-REG methods reducing length while maintaining accuracy. Our approach, REO-RL, consistently reduces REG by >=50 across all evaluated LRMs and matching Qwen3-4B/8B efficiency frontiers under a 16K token budget with minimal accuracy loss. Ablation studies confirm the effectiveness of our exponential token budget strategy. Finally, our findings highlight that fine-tuning LRMs to perfectly align with the efficiency frontiers remains an open challenge.

[269] Theorem-of-Thought: A Multi-Agent Framework for Abductive, Deductive, and Inductive Reasoning in Language Models

Samir Abdaljalil,Hasan Kurban,Khalid Qaraqe,Erchin Serpedin

Main category: cs.CL

TL;DR: 论文提出了一种名为Theorem-of-Thought (ToTh)的新框架，通过模拟三种推理模式（溯因、演绎和归纳）的并行代理，生成结构化推理图，并利用贝叶斯信念传播评估一致性，显著提升了大型语言模型（LLMs）的推理性能和可解释性。

Details

Motivation: 尽管大型语言模型在自然语言推理任务中表现优异，但其推理过程脆弱且难以解释。现有的提示技术（如Chain-of-Thought）缺乏对逻辑结构的强制约束和内部一致性的评估。 Method: ToTh框架通过三个并行代理（分别模拟溯因、演绎和归纳推理）生成推理轨迹，并将其结构化形成正式推理图。利用贝叶斯信念传播和自然语言推理（NLI）评估一致性，选择最一致的推理图生成最终答案。 Result: 在符号推理（WebOfLies）和数值推理（MultiArith）基准测试中，ToTh在多个LLMs上均优于Chain-of-Thought、Self-Consistency和CoT-Decoding，同时生成可解释且逻辑基础扎实的推理链。 Conclusion: ToTh为构建更稳健且受认知启发的LLM推理提供了一条有前景的路径，其实现代码已开源。 Abstract: Large language models (LLMs) have shown strong performance across natural language reasoning tasks, yet their reasoning processes remain brittle and difficult to interpret. Prompting techniques like Chain-of-Thought (CoT) enhance reliability by eliciting intermediate reasoning steps or aggregating multiple outputs. However, they lack mechanisms for enforcing logical structure and assessing internal coherence. We introduce Theorem-of-Thought (ToTh), a novel framework that models reasoning as collaboration among three parallel agents, each simulating a distinct mode of inference: abductive, deductive, and inductive. Each agent produces a reasoning trace, which is structured into a formal reasoning graph. To evaluate consistency, we apply Bayesian belief propagation guided by natural language inference (NLI), assigning confidence scores to each step. The most coherent graph is selected to derive the final answer. Experiments on symbolic (WebOfLies) and numerical (MultiArith) reasoning benchmarks show that ToTh consistently outperforms CoT, Self-Consistency, and CoT-Decoding across multiple LLMs, while producing interpretable and logically grounded reasoning chains. Our findings suggest a promising direction for building more robust and cognitively inspired LLM reasoning. The implementation is available at https://github.com/KurbanIntelligenceLab/theorem-of-thought.

[270] Prompting Science Report 2: The Decreasing Value of Chain of Thought in Prompting

Lennart Meincke,Ethan Mollick,Lilach Mollick,Dan Shapiro

Main category: cs.CL

TL;DR: 本文探讨了Chain-of-Thought（CoT）提示的有效性，发现其效果因任务和模型类型而异，且可能增加时间和成本。

Details

Motivation: 帮助商业、教育和政策领导者通过严格测试理解AI的技术细节。 Method: 研究CoT提示在不同任务和模型中的表现，分析其对性能、时间和成本的影响。 Result: CoT对非推理模型有小幅提升但可能增加错误；对推理模型增益有限且显著增加时间和成本。 Conclusion: CoT提示的效果因模型和任务而异，需权衡其潜在收益与额外成本。 Abstract: This is the second in a series of short reports that seek to help business, education, and policy leaders understand the technical details of working with AI through rigorous testing. In this report, we investigate Chain-of-Thought (CoT) prompting, a technique that encourages a large language model (LLM) to "think step by step" (Wei et al., 2022). CoT is a widely adopted method for improving reasoning tasks, however, our findings reveal a more nuanced picture of its effectiveness. We demonstrate two things: - The effectiveness of Chain-of-Thought prompting can vary greatly depending on the type of task and model. For non-reasoning models, CoT generally improves average performance by a small amount, particularly if the model does not inherently engage in step-by-step processing by default. However, CoT can introduce more variability in answers, sometimes triggering occasional errors in questions the model would otherwise get right. We also found that many recent models perform some form of CoT reasoning even if not asked; for these models, a request to perform CoT had little impact. Performing CoT generally requires far more tokens (increasing cost and time) than direct answers. - For models designed with explicit reasoning capabilities, CoT prompting often results in only marginal, if any, gains in answer accuracy. However, it significantly increases the time and tokens needed to generate a response.

[271] Semantic-preserved Augmentation with Confidence-weighted Fine-tuning for Aspect Category Sentiment Analysis

Yaping Chai,Haoran Xie,Joe S. Qin

Main category: cs.CL

TL;DR: 提出了一种基于大语言模型（LLM）的数据增强策略，用于解决低资源场景下的数据稀缺问题，通过结构化提示模板和语义一致性后处理技术提升模型性能。

Details

Motivation: 解决低资源场景中数据稀缺问题，提升模型在方面类别情感分析（ACSA）任务中的表现。 Method: 设计结构化提示模板引导LLM生成数据，结合后处理技术确保语义一致性，并采用置信度加权微调策略优化预测。 Result: 在四个基准数据集上均优于现有方法，性能显著提升。 Conclusion: 该方法有效增强了数据语义覆盖和模型推理能力，为低资源场景提供了实用解决方案。 Abstract: Large language model (LLM) is an effective approach to addressing data scarcity in low-resource scenarios. Recent existing research designs hand-crafted prompts to guide LLM for data augmentation. We introduce a data augmentation strategy for the aspect category sentiment analysis (ACSA) task that preserves the original sentence semantics and has linguistic diversity, specifically by providing a structured prompt template for an LLM to generate predefined content. In addition, we employ a post-processing technique to further ensure semantic consistency between the generated sentence and the original sentence. The augmented data increases the semantic coverage of the training distribution, enabling the model better to understand the relationship between aspect categories and sentiment polarities, enhancing its inference capabilities. Furthermore, we propose a confidence-weighted fine-tuning strategy to encourage the model to generate more confident and accurate sentiment polarity predictions. Compared with powerful and recent works, our method consistently achieves the best performance on four benchmark datasets over all baselines.

[272] Syntactic Control of Language Models by Posterior Inference

Vicky Xefteri,Tim Vieira,Ryan Cotterell,Afra Amini

Main category: cs.CL

TL;DR: 通过后验推断的采样算法，结合蒙特卡洛和句法标注器，有效控制生成文本的句法结构，显著提升句法准确性。

Details

Motivation: 在需要清晰、风格一致或可解释性的应用中，控制语言模型生成文本的句法结构具有重要价值，但现有方法仍具挑战性。 Method: 结合顺序蒙特卡洛（估计后验分布）和句法标注器，确保生成标记与目标句法结构对齐。 Result: 实验表明，该方法显著提升GPT2和Llama3-8B的句法准确性（F1分数从12.31和35.33提升至约93），同时保持语言流畅性。 Conclusion: 该方法为需要精确句法控制的应用提供了有效解决方案，突显了采样算法的潜力。 Abstract: Controlling the syntactic structure of text generated by language models is valuable for applications requiring clarity, stylistic consistency, or interpretability, yet it remains a challenging task. In this paper, we argue that sampling algorithms based on the posterior inference can effectively enforce a target constituency structure during generation. Our approach combines sequential Monte Carlo, which estimates the posterior distribution by sampling from a proposal distribution, with a syntactic tagger that ensures that each generated token aligns with the desired syntactic structure. Our experiments with GPT2 and Llama3-8B models show that with an appropriate proposal distribution, we can improve syntactic accuracy, increasing the F1 score from $12.31$ (GPT2-large) and $35.33$ (Llama3-8B) to about $93$ in both cases without compromising the language model's fluency. These results underscore both the complexity of syntactic control and the effectiveness of sampling algorithms, offering a promising approach for applications where precise control over syntax is essential.

[273] GeometryZero: Improving Geometry Solving for LLM with Group Contrastive Policy Optimization

Yikun Wang,Yibin Wang,Dianyi Wang,Zimian Peng,Qipeng Guo,Dacheng Tao,Jiaqi Wang

Main category: cs.CL

TL;DR: 论文提出了一种新的强化学习框架GCPO，用于训练小型几何推理模型GeometryZero，通过自适应奖励信号和长度奖励，显著提升了性能。

Details

Motivation: 现有方法在几何问题解决中表现不佳或计算成本过高，需要一种更高效的方法结合辅助构造与几何推理。 Method: 提出GCPO框架，包括Group Contrastive Masking和长度奖励，用于优化辅助构造决策。 Result: GeometryZero模型在多个基准测试中平均提升4.29%，优于基线方法。 Conclusion: GCPO框架为几何推理提供了一种高效且经济的解决方案，显著提升了性能。 Abstract: Recent advances in large language models (LLMs) have demonstrated remarkable capabilities across diverse domains, particularly in mathematical reasoning, amid which geometry problem solving remains a challenging area where auxiliary construction plays a enssential role. Existing approaches either achieve suboptimal performance or rely on massive LLMs (e.g., GPT-4o), incurring massive computational costs. We posit that reinforcement learning with verifiable reward (e.g., GRPO) offers a promising direction for training smaller models that effectively combine auxiliary construction with robust geometric reasoning. However, directly applying GRPO to geometric reasoning presents fundamental limitations due to its dependence on unconditional rewards, which leads to indiscriminate and counterproductive auxiliary constructions. To address these challenges, we propose Group Contrastive Policy Optimization (GCPO), a novel reinforcement learning framework featuring two key innovations: (1) Group Contrastive Masking, which adaptively provides positive or negative reward signals for auxiliary construction based on contextual utility, and a (2) length reward that promotes longer reasoning chains. Building on GCPO, we develop GeometryZero, a family of affordable-size geometric reasoning models that judiciously determine when to employ auxiliary construction. Our extensive empirical evaluation across popular geometric benchmarks (Geometry3K, MathVista) demonstrates that GeometryZero models consistently outperform baselines (e.g. GRPO), achieving an average improvement of 4.29% across all benchmarks.

[274] CTDGSI: A comprehensive exploitation of instance selection methods for automatic text classification. VII Concurso de Teses, Dissertações e Trabalhos de Graduação em SI -- XXI Simpósio Brasileiro de Sistemas de Informação

Washington Cunha,Leonardo Rocha,Marcos André Gonçalves

Main category: cs.CL

TL;DR: 该博士论文研究了自然语言处理中的实例选择（IS）技术，旨在通过减少训练集中的噪声和冗余实例来降低训练成本，同时保持模型效果。

Details

Motivation: 当前NLP领域依赖大量数据和计算资源，训练大型密集模型成本高昂，而实例选择技术潜力巨大但研究不足。 Method: 论文全面比较了IS方法在自动文本分类（ATC）任务中的应用，并提出了两种针对大型数据集和Transformer架构的新型IS解决方案。 Result: 最终方案平均减少41%的训练集规模，同时保持模型效果，训练速度提升1.67倍（最高2.46倍）。 Conclusion: 实例选择技术在NLP中具有显著潜力，能够显著降低训练成本并提升效率。 Abstract: Progress in Natural Language Processing (NLP) has been dictated by the rule of more: more data, more computing power and more complexity, best exemplified by the Large Language Models. However, training (or fine-tuning) large dense models for specific applications usually requires significant amounts of computing resources. This \textbf{Ph.D. dissertation} focuses on an under-investi\-gated NLP data engineering technique, whose potential is enormous in the current scenario known as Instance Selection (IS). The IS goal is to reduce the training set size by removing noisy or redundant instances while maintaining the effectiveness of the trained models and reducing the training process cost. We provide a comprehensive and scientifically sound comparison of IS methods applied to an essential NLP task -- Automatic Text Classification (ATC), considering several classification solutions and many datasets. Our findings reveal a significant untapped potential for IS solutions. We also propose two novel IS solutions that are noise-oriented and redundancy-aware, specifically designed for large datasets and transformer architectures. Our final solution achieved an average reduction of 41\% in training sets, while maintaining the same levels of effectiveness in all datasets. Importantly, our solutions demonstrated speedup improvements of 1.67x (up to 2.46x), making them scalable for datasets with hundreds of thousands of documents.

[275] RULE: Reinforcement UnLEarning Achieves Forget-Retain Pareto Optimality

Chenlong Zhang,Zhuoran Jin,Hongbang Yuan,Jiaheng Wei,Tong Zhou,Kang Liu,Jun Zhao,Yubo Chen

Main category: cs.CL

TL;DR: 论文提出了一种名为RULE的高效框架，通过强化学习实现LLM的定向遗忘，仅需少量遗忘数据和合成边界查询即可显著提升遗忘质量和响应自然度，同时保持模型实用性。

Details

Motivation: 大规模语言模型（LLMs）可能包含敏感、受版权保护或非法内容，现有遗忘方法依赖大量数据集且效果不佳，亟需高效解决方案。 Method: 提出Reinforcement UnLearning (RULE)框架，将遗忘问题建模为拒绝边界优化任务，使用少量遗忘数据和合成边界查询进行训练，并通过可验证的奖励函数实现安全拒绝。 Result: 实验表明，RULE仅需12%遗忘数据和8%合成数据，遗忘质量和响应自然度分别提升17.5%和16.3%，同时保持模型实用性。 Conclusion: RULE是一种高效且通用的LLM定向遗忘方法，显著优于现有基线，并展现出良好的泛化能力和训练效率。 Abstract: The widespread deployment of Large Language Models (LLMs) trained on massive, uncurated corpora has raised growing concerns about the inclusion of sensitive, copyrighted, or illegal content. This has led to increasing interest in LLM unlearning: the task of selectively removing specific information from a model without retraining from scratch or degrading overall utility. However, existing methods often rely on large-scale forget and retain datasets, and suffer from unnatural responses, poor generalization, or catastrophic utility loss. In this work, we propose Reinforcement UnLearning (RULE), an efficient framework that formulates unlearning as a refusal boundary optimization problem. RULE is trained with a small portion of the forget set and synthesized boundary queries, using a verifiable reward function that encourages safe refusal on forget--related queries while preserving helpful responses on permissible inputs. We provide both theoretical and empirical evidence demonstrating the effectiveness of RULE in achieving targeted unlearning without compromising model utility. Experimental results show that, with only $12%$ forget set and $8%$ synthesized boundary data, RULE outperforms existing baselines by up to $17.5%$ forget quality and $16.3%$ naturalness response while maintaining general utility, achieving forget--retain Pareto optimality. Remarkably, we further observe that RULE improves the naturalness of model outputs, enhances training efficiency, and exhibits strong generalization ability, generalizing refusal behavior to semantically related but unseen queries.

[276] Flattery in Motion: Benchmarking and Analyzing Sycophancy in Video-LLMs

Wenrui Zhou,Shu Yang,Qingsong Yang,Zikun Guo,Lijie Hu,Di Wang

Main category: cs.CL

TL;DR: 论文提出VISE基准，用于评估视频大语言模型（Video-LLMs）在误导性用户输入下的迎合行为，填补了该领域系统性评测的空白。

Details

Motivation: 视频大语言模型在现实应用中需确保事实一致性和可靠性，但其迎合用户输入的行为（即使与视觉证据矛盾）影响了可信度。目前研究缺乏针对视频语言领域的系统性评测。 Method: 提出VISE基准，结合多种问题格式、提示偏见和视觉推理任务，评估Video-LLMs的迎合行为，并探索关键帧选择作为缓解策略。 Result: VISE首次将语言领域的迎合行为分析引入视觉领域，支持多类型迎合行为的细粒度分析。关键帧选择显示通过增强视觉基础可减少迎合偏见。 Conclusion: VISE填补了视频语言领域迎合行为评测的空白，为提升模型可靠性提供了新方向。 Abstract: As video large language models (Video-LLMs) become increasingly integrated into real-world applications that demand grounded multimodal reasoning, ensuring their factual consistency and reliability is of critical importance. However, sycophancy, the tendency of these models to align with user input even when it contradicts the visual evidence, undermines their trustworthiness in such contexts. Current sycophancy research has largely overlooked its specific manifestations in the video-language domain, resulting in a notable absence of systematic benchmarks and targeted evaluations to understand how Video-LLMs respond under misleading user input. To fill this gap, we propose VISE (Video-LLM Sycophancy Benchmarking and Evaluation), the first dedicated benchmark designed to evaluate sycophantic behavior in state-of-the-art Video-LLMs across diverse question formats, prompt biases, and visual reasoning tasks. Specifically, VISE pioneeringly brings linguistic perspectives on sycophancy into the visual domain, enabling fine-grained analysis across multiple sycophancy types and interaction patterns. In addition, we explore key-frame selection as an interpretable, training-free mitigation strategy, which reveals potential paths for reducing sycophantic bias by strengthening visual grounding.

[277] SDE-SQL: Enhancing Text-to-SQL Generation in Large Language Models via Self-Driven Exploration with SQL Probes

Wenxuan Xie,Yaxun Dai,Wenhao Jiang

Main category: cs.CL

TL;DR: SDE-SQL框架通过动态SQL探针让大语言模型在推理时自主探索数据库，显著提升Text-to-SQL任务性能。

Details

Motivation: 现有方法依赖静态预处理数据库信息，限制了模型对数据的理解能力。 Method: 提出SDE-SQL框架，生成并执行SQL探针，动态获取数据库信息。 Result: 在BIRD基准测试中，SDE-SQL比基线模型提升8.02%执行准确率。 Conclusion: SDE-SQL为零样本方法，无需监督微调即可实现最佳性能。 Abstract: Recent advancements in large language models (LLMs) have significantly improved performance on the Text-to-SQL task. However, prior approaches typically rely on static, pre-processed database information provided at inference time, which limits the model's ability to fully understand the database contents. Without dynamic interaction, LLMs are constrained to fixed, human-provided context and cannot autonomously explore the underlying data. To address this limitation, we propose SDE-SQL, a framework that enables large language models to perform self-driven exploration of databases during inference. This is accomplished by generating and executing SQL probes, which allow the model to actively retrieve information from the database and iteratively update its understanding of the data. Unlike prior methods, SDE-SQL operates in a zero-shot setting, without relying on any question-SQL pairs as in-context demonstrations. When evaluated on the BIRD benchmark with Qwen2.5-72B-Instruct, SDE-SQL achieves an 8.02% relative improvement in execution accuracy over the vanilla Qwen2.5-72B-Instruct baseline, establishing a new state-of-the-art among methods based on open-source models without supervised fine-tuning (SFT) or model ensembling. Moreover, with SFT, the performance of SDE-SQL can be further enhanced, yielding an additional 0.52% improvement.

[278] Improving the Efficiency of Long Document Classification using Sentence Ranking Approach

Prathamesh Kokate,Mitali Sarnaik,Manavi Khopade,Raviraj Joshi

Main category: cs.CL

TL;DR: 提出了一种基于TF-IDF的句子排序方法，用于长文档分类，显著减少输入大小和推理延迟，同时保持分类准确性。

Details

Motivation: 解决基于Transformer的模型（如BERT）在长文档分类中的计算限制和冗余问题。 Method: 使用TF-IDF对句子进行排序，结合固定数量或百分比选择句子，并采用归一化TF-IDF分数和句子长度的增强评分策略。 Result: 在MahaNews LDC数据集上，该方法优于基线方法，输入大小减少50%以上，推理延迟降低43%，分类准确性仅下降0.33%。 Conclusion: 该方法证明了在不牺牲性能的情况下显著减少上下文是可行的，适用于实际长文档分类任务。 Abstract: Long document classification poses challenges due to the computational limitations of transformer-based models, particularly BERT, which are constrained by fixed input lengths and quadratic attention complexity. Moreover, using the full document for classification is often redundant, as only a subset of sentences typically carries the necessary information. To address this, we propose a TF-IDF-based sentence ranking method that improves efficiency by selecting the most informative content. Our approach explores fixed-count and percentage-based sentence selection, along with an enhanced scoring strategy combining normalized TF-IDF scores and sentence length. Evaluated on the MahaNews LDC dataset of long Marathi news articles, the method consistently outperforms baselines such as first, last, and random sentence selection. With MahaBERT-v2, we achieve near-identical classification accuracy with just a 0.33 percent drop compared to the full-context baseline, while reducing input size by over 50 percent and inference latency by 43 percent. This demonstrates that significant context reduction is possible without sacrificing performance, making the method practical for real-world long document classification tasks.

[279] Bias Attribution in Filipino Language Models: Extending a Bias Interpretability Metric for Application on Agglutinative Languages

Lance Calvin Lim Gamboa,Yue Feng,Mark Lee

Main category: cs.CL

TL;DR: 本文研究了语言模型在处理菲律宾语时的偏见来源，发现与英语模型不同，菲律宾语模型的偏见更多与人、物体和关系相关。

Details

Motivation: 探索非英语语言模型（尤其是菲律宾语）中偏见的来源，并与英语模型的偏见模式进行对比。 Method: 采用信息论的偏见归因评分方法，并将其适配到菲律宾语模型及多语言模型上。 Result: 菲律宾语模型的偏见主题集中在人、物体和关系上，而英语模型则更多与行为相关（如犯罪、性行为等）。 Conclusion: 英语与非英语语言模型在处理与社会人口群体相关的输入时存在显著差异。 Abstract: Emerging research on bias attribution and interpretability have revealed how tokens contribute to biased behavior in language models processing English texts. We build on this line of inquiry by adapting the information-theoretic bias attribution score metric for implementation on models handling agglutinative languages, particularly Filipino. We then demonstrate the effectiveness of our adapted method by using it on a purely Filipino model and on three multilingual models: one trained on languages worldwide and two on Southeast Asian data. Our results show that Filipino models are driven towards bias by words pertaining to people, objects, and relationships, entity-based themes that stand in contrast to the action-heavy nature of bias-contributing themes in English (i.e., criminal, sexual, and prosocial behaviors). These findings point to differences in how English and non-English models process inputs linked to sociodemographic groups and bias.

[280] Question Answering under Temporal Conflict: Evaluating and Organizing Evolving Knowledge with LLMs

Atahan Özer,Çağatay Yıldız

Main category: cs.CL

TL;DR: 论文探讨了大型语言模型（LLMs）在处理随时间演变的文本数据时的局限性，并提出了一种轻量级框架以改善其性能。

Details

Motivation: LLMs的知识受限于预训练数据，而现实世界信息不断变化，传统更新方法（如重新训练或上下文学习）成本高且不实用。 Method: 引入两个新基准（Temporal Wiki和Unified Clark），并提出一种轻量级框架，通过外部结构化记忆增量存储知识。 Result: 该方法在基准测试中优于上下文学习和RAG基线，尤其在处理复杂推理或冲突事实时表现更佳。 Conclusion: 提出的框架有效解决了LLMs在处理动态知识时的局限性，无需重新训练即可提升性能。 Abstract: Large language models (LLMs) exhibit remarkable capabilities in question answering and reasoning thanks to their extensive parametric memory. However, their knowledge is inherently limited by the scope of their pre-training data, while real-world information evolves continuously. Updating this knowledge typically requires costly and brittle re-training, or in-context learning (ICL), which becomes impractical at scale given the volume and volatility of modern information. Motivated by these limitations, we investigate how LLMs perform when exposed to temporal text corpora, or documents that reflect evolving knowledge over time, such as sports biographies where facts like a player's "current team" change year by year. To this end, we introduce two new benchmarks: Temporal Wiki, which captures factual drift across historical Wikipedia snapshots, and Unified Clark, which aggregates timestamped news articles to simulate real-world information accumulation. Our analysis reveals that LLMs often struggle to reconcile conflicting or outdated facts and can be misled when multiple versions of a fact appear in context. To address these issues, we propose a lightweight, agentic framework that incrementally builds a structured, external memory from source documents without requiring re-training. This knowledge organization strategy enables models to retrieve and reason over temporally filtered, relevant information at inference time. Empirically, our method outperforms ICL and RAG baselines across both benchmarks, especially on questions requiring more complex reasoning or integration of conflicting facts.

[281] Parsing the Switch: LLM-Based UD Annotation for Complex Code-Switched and Low-Resource Languages

Olga Kellert,Nemika Tyagi,Muhammad Imran,Nelvin Licona-Guevara,Carlos Gómez-Rodríguez

Main category: cs.CL

TL;DR: 论文提出BiLingua Parser，一种基于LLM的标注工具，用于生成代码混合文本的通用依存标注，显著优于现有方法。

Details

Motivation: 解决代码混合文本在低资源语言环境下句法分析的挑战，尤其是现有方法在多语言和混合语言输入上的不足。 Method: 开发基于提示的框架，结合少量样本LLM提示和专家评审，并发布两个标注数据集。 Result: BiLingua Parser在专家修订后达到95.29% LAS，显著优于基线方法。 Conclusion: LLM在精心指导下可作为低资源代码混合环境下句法资源构建的实用工具。 Abstract: Code-switching presents a complex challenge for syntactic analysis, especially in low-resource language settings where annotated data is scarce. While recent work has explored the use of large language models (LLMs) for sequence-level tagging, few approaches systematically investigate how well these models capture syntactic structure in code-switched contexts. Moreover, existing parsers trained on monolingual treebanks often fail to generalize to multilingual and mixed-language input. To address this gap, we introduce the BiLingua Parser, an LLM-based annotation pipeline designed to produce Universal Dependencies (UD) annotations for code-switched text. First, we develop a prompt-based framework for Spanish-English and Spanish-Guaran\'i data, combining few-shot LLM prompting with expert review. Second, we release two annotated datasets, including the first Spanish-Guaran\'i UD-parsed corpus. Third, we conduct a detailed syntactic analysis of switch points across language pairs and communicative contexts. Experimental results show that BiLingua Parser achieves up to 95.29% LAS after expert revision, significantly outperforming prior baselines and multilingual parsers. These results show that LLMs, when carefully guided, can serve as practical tools for bootstrapping syntactic resources in under-resourced, code-switched environments. Data and source code are available at https://github.com/N3mika/ParsingProject

[282] Exploring the Impact of Temperature on Large Language Models:Hot or Cold?

Lujun Li,Lama Sleem,Niccolo' Gentile,Geoffrey Nichil,Radu State

Main category: cs.CL

TL;DR: 本文研究了采样温度对大型语言模型性能的影响，揭示了温度对不同能力的特异性作用，并提出了一种基于BERT的温度选择器以优化性能。

Details

Motivation: 挑战了‘随机鹦鹉’类比，证明LLMs能理解语义而非仅记忆数据，且温度调制的随机性在推理中起关键作用。 Method: 系统评估了温度（0-2）对六种能力的影响，分析了三种规模的开源模型，并提出了BERT温度选择器。 Result: 温度对模型性能有技能特异性影响，BERT选择器显著提升中小模型在SuperGLUE的表现，温度效应在FP16和4位量化模型中一致。 Conclusion: 温度选择复杂但关键，BERT选择器为实用优化提供方案，温度效应随模型规模变化。 Abstract: The sampling temperature, a critical hyperparameter in large language models (LLMs), modifies the logits before the softmax layer, thereby reshaping the distribution of output tokens. Recent studies have challenged the Stochastic Parrots analogy by demonstrating that LLMs are capable of understanding semantics rather than merely memorizing data and that randomness, modulated by sampling temperature, plays a crucial role in model inference. In this study, we systematically evaluated the impact of temperature in the range of 0 to 2 on data sets designed to assess six different capabilities, conducting statistical analyses on open source models of three different sizes: small (1B--4B), medium (6B--13B), and large (40B--80B). Our findings reveal distinct skill-specific effects of temperature on model performance, highlighting the complexity of optimal temperature selection in practical applications. To address this challenge, we propose a BERT-based temperature selector that takes advantage of these observed effects to identify the optimal temperature for a given prompt. We demonstrate that this approach can significantly improve the performance of small and medium models in the SuperGLUE datasets. Furthermore, our study extends to FP16 precision inference, revealing that temperature effects are consistent with those observed in 4-bit quantized models. By evaluating temperature effects up to 4.0 in three quantized models, we find that the Mutation Temperature -- the point at which significant performance changes occur -- increases with model size.

[283] Subjectivity in the Annotation of Bridging Anaphora

Lauren Levine,Amir Zeldes

Main category: cs.CL

TL;DR: 论文探讨了桥接标注中的主观性问题，提出新的分类系统，并发现现有资源可能标注不足。

Details

Motivation: 桥接标注的主观性导致一致性难以达成，需探索其在不同层次的表现。 Method: 在GUM语料库测试集上进行标注实验，提出新的桥接子类型分类系统。 Result: 发现现有资源标注不足，桥接子类型标注一致性中等，但实例识别一致性低。 Conclusion: 桥接标注的主观性显著，需进一步研究以提高标注一致性。 Abstract: Bridging refers to the associative relationship between inferable entities in a discourse and the antecedents which allow us to understand them, such as understanding what "the door" means with respect to an aforementioned "house". As identifying associative relations between entities is an inherently subjective task, it is difficult to achieve consistent agreement in the annotation of bridging anaphora and their antecedents. In this paper, we explore the subjectivity involved in the annotation of bridging instances at three levels: anaphor recognition, antecedent resolution, and bridging subtype selection. To do this, we conduct an annotation pilot on the test set of the existing GUM corpus, and propose a newly developed classification system for bridging subtypes, which we compare to previously proposed schemes. Our results suggest that some previous resources are likely to be severely under-annotated. We also find that while agreement on the bridging subtype category was moderate, annotator overlap for exhaustively identifying instances of bridging is low, and that many disagreements resulted from subjective understanding of the entities involved.

[284] ConfQA: Answer Only If You Are Confident

Yin Huang,Yifan Ethan Xu,Kai Sun,Vera Yan,Alicia Sun,Haidar Khan,Jimmy Nguyen,Mohammad Kachuee,Zhaojiang Lin,Yue Liu,Aaron Colak,Anuj Kumar,Wen-tau Yih,Xin Luna Dong

Main category: cs.CL

TL;DR: ConfQA是一种微调策略，通过训练LLM在不确定时承认“我不确定”，并结合“仅在自信时回答”的提示和知识图谱校准信心，将幻觉率从20-40%降至5%以下。

Details

Motivation: 解决LLM在生成事实性陈述时的幻觉问题，提高回答的准确性和可靠性。 Method: 引入ConfQA策略，结合“仅在自信时回答”的提示和知识图谱校准信心，提出Dual Neural Knowledge框架，动态选择内部参数化知识和外部符号知识。 Result: 幻觉率降至5%以下，潜在准确率超过95%，外部检索减少30%以上。 Conclusion: ConfQA和Dual Neural Knowledge框架有效减少幻觉，提高LLM的准确性和效率。 Abstract: Can we teach Large Language Models (LLMs) to refrain from hallucinating factual statements? In this paper we present a fine-tuning strategy that we call ConfQA, which can reduce hallucination rate from 20-40% to under 5% across multiple factuality benchmarks. The core idea is simple: when the LLM answers a question correctly, it is trained to continue with the answer; otherwise, it is trained to admit "I am unsure". But there are two key factors that make the training highly effective. First, we introduce a dampening prompt "answer only if you are confident" to explicitly guide the behavior, without which hallucination remains high as 15%-25%. Second, we leverage simple factual statements, specifically attribute values from knowledge graphs, to help LLMs calibrate the confidence, resulting in robust generalization across domains and question types. Building on this insight, we propose the Dual Neural Knowledge framework, which seamlessly select between internally parameterized neural knowledge and externally recorded symbolic knowledge based on ConfQA's confidence. The framework enables potential accuracy gains to beyond 95%, while reducing unnecessary external retrievals by over 30%.

[285] Reward Model Interpretability via Optimal and Pessimal Tokens

Brian Christian,Hannah Rose Kirk,Jessica A. F. Thompson,Christopher Summerfield,Tsvetomira Dumbalska

Main category: cs.CL

TL;DR: 该论文研究了奖励模型的可解释性，发现不同模型之间存在显著异质性，对高频词过度估值，并可能编码有害偏见。

Details

Motivation: 奖励模型在将大型语言模型与人类价值观对齐中起关键作用，但其本身的可解释性和潜在偏见尚未充分研究。 Method: 通过分析奖励模型在整个词汇空间中对单令牌响应的评分，揭示其行为和潜在问题。 Result: 发现模型间异质性、对高频词的偏好、对提示框架的敏感性以及潜在的偏见编码。 Conclusion: 奖励模型作为人类价值观代理的适用性受到挑战，需警惕其在下游模型中的潜在负面影响。 Abstract: Reward modeling has emerged as a crucial component in aligning large language models with human values. Significant attention has focused on using reward models as a means for fine-tuning generative models. However, the reward models themselves -- which directly encode human value judgments by turning prompt-response pairs into scalar rewards -- remain relatively understudied. We present a novel approach to reward model interpretability through exhaustive analysis of their responses across their entire vocabulary space. By examining how different reward models score every possible single-token response to value-laden prompts, we uncover several striking findings: (i) substantial heterogeneity between models trained on similar objectives, (ii) systematic asymmetries in how models encode high- vs low-scoring tokens, (iii) significant sensitivity to prompt framing that mirrors human cognitive biases, and (iv) overvaluation of more frequent tokens. We demonstrate these effects across ten recent open-source reward models of varying parameter counts and architectures. Our results challenge assumptions about the interchangeability of reward models, as well as their suitability as proxies of complex and context-dependent human values. We find that these models can encode concerning biases toward certain identity groups, which may emerge as unintended consequences of harmlessness training -- distortions that risk propagating through the downstream large language models now deployed to millions.

[286] Improving LLM Reasoning through Interpretable Role-Playing Steering

Anyi Wang,Dong Shu,Yifan Wang,Yunpu Ma,Mengnan Du

Main category: cs.CL

TL;DR: SRPS是一种新框架，通过识别和操纵与角色扮演行为相关的内部模型特征，提升大语言模型的推理能力。

Details

Motivation: 现有方法主要依赖提示工程，缺乏稳定性和可解释性。 Method: 提取角色扮演提示的潜在表示，基于激活模式选择最相关特征，构建可控制强度的转向向量。 Result: 在多个推理基准测试中表现一致提升，例如Llama3.1-8B在CSQA上的准确率从31.86%提升至39.80%。 Conclusion: SRPS在提升推理能力的同时，提供了比传统提示方法更好的可解释性和稳定性。 Abstract: Role-playing has emerged as an effective technique for enhancing the reasoning capabilities of large language models (LLMs). However, existing methods primarily rely on prompt engineering, which often lacks stability and interpretability. In this paper, we introduce Sparse Autoencoder Role-Playing Steering (SRPS), a novel framework that identifies and manipulates internal model features associated with role-playing behavior. Our approach extracts latent representations from role-play prompts, selects the most relevant features based on activation patterns, and constructs a steering vector that can be injected into the model's residual stream with controllable intensity. Our method enables fine-grained control over role-specific behavior and offers insights into how role information influences internal model activations. Extensive experiments across various reasoning benchmarks and model sizes demonstrate consistent performance gains. Notably, in the zero-shot chain-of-thought (CoT) setting, the accuracy of Llama3.1-8B on CSQA improves from 31.86% to 39.80%, while Gemma2-9B on SVAMP increases from 37.50% to 45.10%. These results highlight the potential of SRPS to enhance reasoning ability in LLMs, providing better interpretability and stability compared to traditional prompt-based role-playing.

[287] Refusal-Feature-guided Teacher for Safe Finetuning via Data Filtering and Alignment Distillation

Seokil Ham,Yubin Choi,Seungju Cho,Yujin Yang,Younghun Kim,Changick Kim

Main category: cs.CL

TL;DR: 论文提出了一种基于拒绝特征的教师模型（ReFT），用于在Finetuning-as-a-Service中过滤有害提示，以保护LLM的安全对齐性。

Details

Motivation: 当前Finetuning-as-a-Service中，用户数据可能包含有害提示，导致LLM安全对齐性下降，而现有方法未从根本上解决这一问题。 Method: 利用安全对齐LLM中的拒绝特征，训练ReFT模型识别有害提示，并在微调过程中过滤有害数据并传递对齐知识。 Result: 实验表明，ReFT策略能有效减少有害输出并提升微调准确性。 Conclusion: ReFT为Finetuning-as-a-Service提供了一种安全可靠的解决方案。 Abstract: Recently, major AI service providers such as Google and OpenAI have introduced Finetuning-as-a-Service, which enables users to customize Large Language Models (LLMs) for specific downstream tasks using their own data. However, this service is vulnerable to degradation of LLM safety-alignment when user data contains harmful prompts. While some prior works address this issue, fundamentally filtering harmful data from user data remains unexplored. Motivated by our observation that a directional representation reflecting refusal behavior (called the refusal feature) obtained from safety-aligned LLMs can inherently distinguish between harmful and harmless prompts, we propose the Refusal-Feature-guided Teacher (ReFT). Our ReFT model is trained to identify harmful prompts based on the similarity between input prompt features and its refusal feature. During finetuning, the ReFT model serves as a teacher that filters harmful prompts from user data and distills alignment knowledge into the base model. Extensive experiments demonstrate that our ReFT-based finetuning strategy effectively minimizes harmful outputs and enhances finetuning accuracy for user-specific tasks, offering a practical solution for secure and reliable deployment of LLMs in Finetuning-as-a-Service.

[288] SEED: Enhancing Text-to-SQL Performance and Practical Usability Through Automatic Evidence Generation

Janghyeon Yun,Sang-goo Lee

Main category: cs.CL

TL;DR: SEED是一种自动生成证据的方法，用于改进文本到SQL转换的性能和实用性，解决了BIRD数据集中人工生成证据的缺陷。

Details

Motivation: 现有文本到SQL研究依赖BIRD数据集，但其假设用户具备专业知识和领域知识，且人工生成证据存在缺陷，影响了模型性能。 Method: 提出SEED方法，通过分析数据库模式、描述文件和值来自动生成证据。 Result: 在BIRD和Spider数据集上评估，SEED显著提高了无证据场景下的SQL生成准确性，甚至优于提供BIRD证据的情况。 Conclusion: SEED生成的证据不仅填补了研究与实际部署之间的差距，还提升了文本到SQL模型的适应性和鲁棒性。 Abstract: Text-to-SQL enables non-experts to retrieve data from databases by converting natural language queries into SQL. However, state-of-the-art text-to-SQL studies rely on the BIRD dataset, which assumes that evidence is provided along with questions. Although BIRD facilitates research advancements, it assumes that users have expertise and domain knowledge, contradicting the fundamental goal of text-to-SQL. In addition, human-generated evidence in BIRD contains defects, including missing or erroneous evidence, which affects model performance. To address this issue, we propose SEED (System for Evidence Extraction and Domain knowledge generation), an approach that automatically generates evidence to improve performance and practical usability in real-world scenarios. SEED systematically analyzes database schema, description files, and values to extract relevant information. We evaluated SEED on BIRD and Spider, demonstrating that it significantly improves SQL generation accuracy in the no-evidence scenario, and in some cases, even outperforms the setting where BIRD evidence is provided. Our results highlight that SEED-generated evidence not only bridges the gap between research and real-world deployment but also improves the adaptability and robustness of text-to-SQL models. Our code is available at https://github.com/felix01189/SEED

[289] Plug-in and Fine-tuning: Bridging the Gap between Small Language Models and Large Language Models

Kyeonghyun Kim,Jinhee Jang,Juhwan Choi,Yoonji Lee,Kyohoon Jin,YoungBin Kim

Main category: cs.CL

TL;DR: PiFi框架通过将大语言模型（LLM）的冻结层集成到小语言模型（SLM）中，结合两者的优势，既保持了高效性又提升了性能。

Details

Motivation: 解决大语言模型（LLM）计算资源需求高和小语言模型（SLM）泛化能力不足的问题。 Method: 提出PiFi框架，将LLM的冻结层集成到SLM中，并进行任务微调。 Result: PiFi在多种自然语言处理任务中表现优异，提升了泛化能力和语言能力迁移。 Conclusion: PiFi成功结合了LLM和SLM的优势，为资源受限环境提供了高效解决方案。 Abstract: Large language models (LLMs) are renowned for their extensive linguistic knowledge and strong generalization capabilities, but their high computational demands make them unsuitable for resource-constrained environments. In contrast, small language models (SLMs) are computationally efficient but often lack the broad generalization capacity of LLMs. To bridge this gap, we propose PiFi, a novel framework that combines the strengths of both LLMs and SLMs to achieve high performance while maintaining efficiency. PiFi integrates a single frozen layer from an LLM into a SLM and fine-tunes the combined model for specific tasks, boosting performance without a significant increase in computational cost. We show that PiFi delivers consistent performance improvements across a range of natural language processing tasks, including both natural language understanding and generation. Moreover, our findings demonstrate PiFi's ability to effectively leverage LLM knowledge, enhancing generalization to unseen domains and facilitating the transfer of linguistic abilities.

[290] Conjoined Predication and Scalar Implicature

Ratna Kandala

Main category: cs.CL

TL;DR: 本文分析了Magri（2016）提出的第一个未解决的谜题，探讨了量化、集体/并发解释和上下文更新之间的隐藏互动，指出某些句子在连词时的不自然源于间接的上下文矛盾。

Details

Motivation: Magri（2016）提出了两个关于连词的谜题，第二个已解决，但第一个仍未解。本文旨在通过理论框架分析第一个谜题，揭示其背后的机制。 Method: 通过概念分析，将Magri的第一个谜题置于其原始理论框架中，探讨连词句子的不自然性及其成因。 Result: 研究发现，连词句子的不自然性源于集体或并发解释导致的间接上下文矛盾，并指出标量含义的语用机制超出了基于穷尽化的语法解释。 Conclusion: 本文扩展了对标量含义生成机制的理解，揭示了连词句子中隐藏的语用矛盾。 Abstract: Magri (2016) investigates two puzzles arising from conjunction. Although Magri has proposed a solution to the second puzzle, the first remains unresolved. This first puzzle reveals a hidden interaction among quantification, collective/concurrent interpretation, and contextual updating dimensions that have yet to be explored. In essence, the problem is that certain forms of sentences like "Some Italians come from a warm country," when conjoined as in "(Only) Some Italians come from a warm country and are blond," sound infelicitous, even though no obvious alternative triggers a conflicting scalar implicature. In this paper, we offer a conceptual analysis of Magri's first puzzle by situating it within its original theoretical framework. We argue that the oddness arises from the collective or concurrent reading of the conjunctive predicate: in examples such as "(Only) Some Italians come from a warm country and are blond," this interpretation generates an indirect contextual contradiction. Moreover, we suggest that the pragmatic mechanisms governing scalar implicature generation extend beyond what is captured by exhaustification-based grammatical licensing accounts.

[291] Well Begun is Half Done: Low-resource Preference Alignment by Weak-to-Strong Decoding

Feifan Song,Shaohang Wei,Wen Luo,Yuxuan Fan,Tianyu Liu,Guoyin Wang,Houfeng Wang

Main category: cs.CL

TL;DR: 论文提出了一种名为弱到强解码（WSD）的新框架，通过小型对齐模型指导增强基础模型的对齐能力，避免了生成低质量内容的问题。

Details

Motivation: 大型语言模型（LLMs）需要与人类偏好对齐以避免生成冒犯性、虚假或无意义的内容，但现有低资源对齐方法仍面临高质量与对齐内容难以兼顾的挑战。 Method: 提出WSD框架，利用小型对齐模型生成对齐的开头，再由大型基础模型继续生成后续内容，并通过自动切换机制控制。 Result: 实验表明，WSD框架显著提升了基础模型的对齐能力，优于所有基线方法，且未对下游任务产生负面影响。 Conclusion: WSD框架通过小型模型的引导有效解决了LLM对齐问题，同时避免了传统方法中的对齐税问题。 Abstract: Large Language Models (LLMs) require alignment with human preferences to avoid generating offensive, false, or meaningless content. Recently, low-resource methods for LLM alignment have been popular, while still facing challenges in obtaining both high-quality and aligned content. Motivated by the observation that the difficulty of generating aligned responses is concentrated at the beginning of decoding, we propose a novel framework, Weak-to-Strong Decoding (WSD), to enhance the alignment ability of base models by the guidance of a small aligned model. The small model first drafts well-aligned beginnings, followed by the large base model to continue the rest, controlled by a well-designed auto-switch mechanism. We also collect a new dataset, GenerAlign, to fine-tune a small-sized Pilot-3B as the draft model, which effectively enhances different base models under the WSD framework to outperform all baseline methods, while avoiding degradation on downstream tasks, termed as the alignment tax. Extensive experiments are further conducted to examine the impact of different settings and time efficiency, as well as analyses on the intrinsic mechanisms of WSD in depth.

[292] LG-ANNA-Embedding technical report

Jooyoung Choi,Hyun Kim,Hansol Jang,Changwook Jun,Kyunghoon Bae,Hyewon Choi,Stanley Jungkyu Choi,Honglak Lee,Chulmin Yun

Main category: cs.CL

TL;DR: 该论文提出了一种基于指令的统一框架，用于学习适用于信息检索（IR）和非IR任务的通用文本嵌入，结合上下文学习、软监督和自适应硬负样本挖掘，无需任务特定微调。

Details

Motivation: 旨在开发一种通用文本嵌入方法，适用于多种任务，同时避免任务特定的微调，以提高模型的泛化能力和效率。 Method: 基于Mistral-7B模型，结合上下文学习、软监督（连续相关性分数）和自适应硬负样本挖掘，生成上下文感知嵌入。 Result: 在MTEB（英语，v2）基准测试的41项任务中表现优异，泛化能力强，优于多个更大或完全微调的基线模型。 Conclusion: 结合上下文提示、软监督和自适应采样，能够高效生成高质量的嵌入，具有广泛的应用潜力。 Abstract: This report presents a unified instruction-based framework for learning generalized text embeddings optimized for both information retrieval (IR) and non-IR tasks. Built upon a decoder-only large language model (Mistral-7B), our approach combines in-context learning, soft supervision, and adaptive hard-negative mining to generate context-aware embeddings without task-specific fine-tuning. Structured instructions and few-shot examples are used to guide the model across diverse tasks, enabling strong performance on classification, semantic similarity, clustering, and reranking benchmarks. To improve semantic discrimination, we employ a soft labeling framework where continuous relevance scores, distilled from a high-performance dense retriever and reranker, serve as fine-grained supervision signals. In addition, we introduce adaptive margin-based hard-negative mining, which filters out semantically ambiguous negatives based on their similarity to positive examples, thereby enhancing training stability and retrieval robustness. Our model is evaluated on the newly introduced MTEB (English, v2) benchmark, covering 41 tasks across seven categories. Results show that our method achieves strong generalization and ranks among the top-performing models by Borda score, outperforming several larger or fully fine-tuned baselines. These findings highlight the effectiveness of combining in-context prompting, soft supervision, and adaptive sampling for scalable, high-quality embedding generation.

[293] Understanding Cross-Domain Adaptation in Low-Resource Topic Modeling

Pritom Saha Akash,Kevin Chen-Chuan Chang

Main category: cs.CL

TL;DR: DALTA框架通过域对齐和对抗训练，在低资源主题建模中显著提升了主题一致性和稳定性。

Details

Motivation: 现有主题模型在低资源环境下表现不佳，缺乏稳定性和一致性，需要一种能够有效利用高资源领域知识的方法。 Method: 提出DALTA框架，结合共享编码器、专用解码器和对抗对齐，实现跨领域知识选择性迁移。 Result: 在多个低资源数据集上，DALTA在主题一致性、稳定性和可迁移性上优于现有方法。 Conclusion: DALTA为低资源主题建模提供了一种有效的域适应解决方案。 Abstract: Topic modeling plays a vital role in uncovering hidden semantic structures within text corpora, but existing models struggle in low-resource settings where limited target-domain data leads to unstable and incoherent topic inference. We address this challenge by formally introducing domain adaptation for low-resource topic modeling, where a high-resource source domain informs a low-resource target domain without overwhelming it with irrelevant content. We establish a finite-sample generalization bound showing that effective knowledge transfer depends on robust performance in both domains, minimizing latent-space discrepancy, and preventing overfitting to the data. Guided by these insights, we propose DALTA (Domain-Aligned Latent Topic Adaptation), a new framework that employs a shared encoder for domain-invariant features, specialized decoders for domain-specific nuances, and adversarial alignment to selectively transfer relevant information. Experiments on diverse low-resource datasets demonstrate that DALTA consistently outperforms state-of-the-art methods in terms of topic coherence, stability, and transferability.

[294] KScope: A Framework for Characterizing the Knowledge Status of Language Models

Yuxin Xiao,Shan Chen,Jack Gallifant,Danielle Bitterman,Thomas Hartvigsen,Marzyeh Ghassemi

Main category: cs.CL

TL;DR: 论文提出了KScope框架，通过统计测试分层分析LLM的知识状态，并将其分为五种类型。研究发现上下文支持能缩小知识差距，且特定上下文特征能有效更新知识。

Details

Motivation: 现有研究主要关注LLM在知识冲突下的行为，但未能全面评估模型对问题的知识掌握程度。 Method: 提出KScope框架，通过分层统计测试将LLM知识分为五种状态，并在九个LLM和四个数据集上验证。 Result: 发现上下文支持缩小知识差距，特定特征（难度、相关性、熟悉度）驱动知识更新，LLM在不同知识状态下行为差异显著。 Conclusion: KScope框架能有效分析LLM知识状态，上下文特征分析和增强可信度可进一步提升知识更新效果。 Abstract: Characterizing a large language model's (LLM's) knowledge of a given question is challenging. As a result, prior work has primarily examined LLM behavior under knowledge conflicts, where the model's internal parametric memory contradicts information in the external context. However, this does not fully reflect how well the model knows the answer to the question. In this paper, we first introduce a taxonomy of five knowledge statuses based on the consistency and correctness of LLM knowledge modes. We then propose KScope, a hierarchical framework of statistical tests that progressively refines hypotheses about knowledge modes and characterizes LLM knowledge into one of these five statuses. We apply KScope to nine LLMs across four datasets and systematically establish: (1) Supporting context narrows knowledge gaps across models. (2) Context features related to difficulty, relevance, and familiarity drive successful knowledge updates. (3) LLMs exhibit similar feature preferences when partially correct or conflicted, but diverge sharply when consistently wrong. (4) Context summarization constrained by our feature analysis, together with enhanced credibility, further improves update effectiveness and generalizes across LLMs.

[295] From Calibration to Collaboration: LLM Uncertainty Quantification Should Be More Human-Centered

Siddartha Devic,Tejas Srinivasan,Jesse Thomason,Willie Neiswanger,Vatsal Sharan

Main category: cs.CL

TL;DR: 论文指出当前LLM不确定性量化方法存在生态效度低、仅考虑认知不确定性及优化指标不反映下游效用的问题，并提出以用户为中心的改进方向。

Details

Motivation: 提升LLM在现实任务中的可靠性，促进人机协作。 Method: 分析40种LLM不确定性量化方法，识别三大问题并提出改进建议。 Result: 发现当前方法在生态效度、不确定性类型和指标优化上存在不足。 Conclusion: 呼吁采用更以用户为中心的不确定性量化方法。 Abstract: Large Language Models (LLMs) are increasingly assisting users in the real world, yet their reliability remains a concern. Uncertainty quantification (UQ) has been heralded as a tool to enhance human-LLM collaboration by enabling users to know when to trust LLM predictions. We argue that current practices for uncertainty quantification in LLMs are not optimal for developing useful UQ for human users making decisions in real-world tasks. Through an analysis of 40 LLM UQ methods, we identify three prevalent practices hindering the community's progress toward its goal of benefiting downstream users: 1) evaluating on benchmarks with low ecological validity; 2) considering only epistemic uncertainty; and 3) optimizing metrics that are not necessarily indicative of downstream utility. For each issue, we propose concrete user-centric practices and research directions that LLM UQ researchers should consider. Instead of hill-climbing on unrepresentative tasks using imperfect metrics, we argue that the community should adopt a more human-centered approach to LLM uncertainty quantification.

[296] CCI4.0: A Bilingual Pretraining Dataset for Enhancing Reasoning in Large Language Models

Guang Liu,Liangdong Wang,Jijie Li,Yang Yu,Yao Xu,Jiabei Chen,Yu Bai,Feng Liao,Yonghua Lin

Main category: cs.CL

TL;DR: CCI4.0是一个大规模双语预训练数据集，包含高质量数据和多样化推理轨迹，通过严格的数据处理流程提升模型性能。

Details

Motivation: 现有预训练数据质量参差不齐，需专家经验处理，且缺乏多样化推理轨迹，影响模型性能。 Method: 提出两阶段去重、多分类器质量评分和领域感知流畅性过滤的数据处理流程，并提取45亿条CoT模板。 Result: 实验表明，基于CCI4.0预训练的模型在数学和代码任务中表现更优。 Conclusion: 严格数据筛选和多样化推理模板对提升LLM性能至关重要，为自动处理预训练数据提供新思路。 Abstract: We introduce CCI4.0, a large-scale bilingual pre-training dataset engineered for superior data quality and diverse human-like reasoning trajectory. CCI4.0 occupies roughly $35$ TB of disk space and comprises two sub-datasets: CCI4.0-M2-Base and CCI4.0-M2-CoT. CCI4.0-M2-Base combines a $5.2$ TB carefully curated Chinese web corpus, a $22.5$ TB English subset from Nemotron-CC, and diverse sources from math, wiki, arxiv, and code. Although these data are mostly sourced from well-processed datasets, the quality standards of various domains are dynamic and require extensive expert experience and labor to process. So, we propose a novel pipeline justifying data quality mainly based on models through two-stage deduplication, multiclassifier quality scoring, and domain-aware fluency filtering. We extract $4.5$ billion pieces of CoT(Chain-of-Thought) templates, named CCI4.0-M2-CoT. Differing from the distillation of CoT from larger models, our proposed staged CoT extraction exemplifies diverse reasoning patterns and significantly decreases the possibility of hallucination. Empirical evaluations demonstrate that LLMs pre-trained in CCI4.0 benefit from cleaner, more reliable training signals, yielding consistent improvements in downstream tasks, especially in math and code reflection tasks. Our results underscore the critical role of rigorous data curation and human thinking templates in advancing LLM performance, shedding some light on automatically processing pretraining corpora.

[297] Improving Fairness of Large Language Models in Multi-document Summarization

Haoyuan Li Yusen Zhang,Snigdha Chaturvedi

Main category: cs.CL

TL;DR: FairPO是一种偏好调整方法，旨在同时提升多文档摘要（MDS）中的摘要级和语料库级公平性。

Details

Motivation: 多文档摘要中的公平性对决策至关重要，但现有方法主要关注摘要级公平性，忽略了语料库级公平性。 Method: 通过扰动文档集生成偏好对以提升摘要级公平性，并通过动态调整偏好对权重实现语料库级公平性。 Result: 实验表明，FairPO在保持摘要质量的同时优于基线方法。 Conclusion: FairPO有效平衡了摘要级和语料库级公平性，为多文档摘要提供了更全面的公平性解决方案。 Abstract: Fairness in multi-document summarization (MDS) is crucial for providing comprehensive views across documents with diverse social attribute values, which can significantly impact decision-making. For example, a summarization system that tends to overrepresent negative reviews of products can mislead customers into disregarding good products. Previous works measure fairness in MDS at two levels: summary-level and corpus-level. While summary-level fairness focuses on individual summaries, corpus-level fairness focuses on a corpus of summaries. Recent methods primarily focus on summary-level fairness. We propose FairPO, a preference tuning method that focuses on both summary-level and corpus-level fairness in MDS. To improve summary-level fairness, we propose to generate preference pairs by perturbing document sets. To improve corpus-level fairness, we propose fairness-aware preference tuning by dynamically adjusting the weights of preference pairs. Our experiments show that FairPO outperforms strong baselines while maintaining the critical qualities of summaries. The code is available at https://github.com/leehaoyuan/coverage_fairnes.

[298] A Hybrid GA LLM Framework for Structured Task Optimization

Berry Feng,Jonas Lin,Patrick Lau

Main category: cs.CL

TL;DR: GA LLM结合遗传算法与大型语言模型，通过迭代优化生成结构化任务结果。

Details

Motivation: 解决单一语言模型在严格约束下生成任务中的不足，结合遗传算法的全局优化能力。 Method: 将输出视为基因，利用语言模型指导选择、交叉和变异操作，迭代改进解决方案。 Result: 在行程规划、学术大纲和商业报告等任务中表现优异，满足约束且质量高。 Conclusion: GA LLM通过结合两种方法的优势，显著提升了约束满足和生成质量。 Abstract: GA LLM is a hybrid framework that combines Genetic Algorithms with Large Language Models to handle structured generation tasks under strict constraints. Each output, such as a plan or report, is treated as a gene, and evolutionary operations like selection, crossover, and mutation are guided by the language model to iteratively improve solutions. The language model provides domain knowledge and creative variation, while the genetic algorithm ensures structural integrity and global optimization. GA LLM has proven effective in tasks such as itinerary planning, academic outlining, and business reporting, consistently producing well structured and requirement satisfying results. Its modular design also makes it easy to adapt to new tasks. Compared to using a language model alone, GA LLM achieves better constraint satisfaction and higher quality solutions by combining the strengths of both components.

[299] DEBATE: A Dataset for Disentangling Textual Ambiguity in Mandarin Through Speech

Haotian Guo,Jing Han,Yongfeng Tu,Shihao Gao,Shengfan Shen,Wulong Xiang,Weihao Gan,Zixing Zhang

Main category: cs.CL

TL;DR: 论文介绍了DEBATE数据集，旨在研究语音线索如何帮助解决文本歧义，填补了语音消歧研究的空白。

Details

Motivation: 语音消歧（DTS）研究不足，缺乏高质量数据集，DEBATE填补了这一空白。 Method: 构建DEBATE数据集，包含1,001条歧义语句，每条由10位母语者录制，分析语音线索（如发音、停顿、重音和语调）。 Result: 数据集质量高，实验显示机器与人类在语音意图理解上存在显著差距。 Conclusion: DEBATE为语音消歧研究提供了基础，支持跨语言和文化的类似数据集构建。 Abstract: Despite extensive research on textual and visual disambiguation, disambiguation through speech (DTS) remains underexplored. This is largely due to the lack of high-quality datasets that pair spoken sentences with richly ambiguous text. To address this gap, we present DEBATE, a unique public Chinese speech-text dataset designed to study how speech cues and patterns-pronunciation, pause, stress and intonation-can help resolve textual ambiguity and reveal a speaker's true intent. DEBATE contains 1,001 carefully selected ambiguous utterances, each recorded by 10 native speakers, capturing diverse linguistic ambiguities and their disambiguation through speech. We detail the data collection pipeline and provide rigorous quality analysis. Additionally, we benchmark three state-of-the-art large speech and audio-language models, illustrating clear and huge performance gaps between machine and human understanding of spoken intent. DEBATE represents the first effort of its kind and offers a foundation for building similar DTS datasets across languages and cultures. The dataset and associated code are available at: https://github.com/SmileHnu/DEBATE.

[300] What Do Indonesians Really Need from Language Technology? A Nationwide Survey

Muhammad Dehan Al Kautsar,Lucky Susanto,Derry Wijaya,Fajri Koto

Main category: cs.CL

TL;DR: 研究调查印尼700多种本地语言社区对语言技术的实际需求，发现机器翻译和信息检索是首要需求，同时需解决隐私和偏见问题。

Details

Motivation: 了解印尼本地语言社区对语言技术的真实需求，以指导资源分配和技术开发。 Method: 通过全国性调查评估印尼本地语言社区的需求。 Result: 机器翻译和信息检索是首要需求，但需解决隐私、偏见和透明度问题。 Conclusion: 需透明沟通和明确政策以支持语言技术的广泛应用。 Abstract: There is an emerging effort to develop NLP for Indonesias 700+ local languages, but progress remains costly due to the need for direct engagement with native speakers. However, it is unclear what these language communities truly need from language technology. To address this, we conduct a nationwide survey to assess the actual needs of native speakers in Indonesia. Our findings indicate that addressing language barriers, particularly through machine translation and information retrieval, is the most critical priority. Although there is strong enthusiasm for advancements in language technology, concerns around privacy, bias, and the use of public data for AI training highlight the need for greater transparency and clear communication to support broader AI adoption.

[301] DeRAGEC: Denoising Named Entity Candidates with Synthetic Rationale for ASR Error Correction

Solee Im,Wonjun Lee,Jinmyeong An,Yunsu Kim,Jungseul Ok,Gary Geunbae Lee

Main category: cs.CL

TL;DR: DeRAGEC是一种改进自动语音识别（ASR）系统中命名实体（NE）校正的方法，通过合成去噪原理过滤噪声NE候选，利用语音相似性和增强定义优化校正，无需额外训练。

Details

Motivation: 改进ASR系统中的命名实体校正，减少噪声干扰，提高准确性。 Method: 扩展RAGEC框架，使用合成去噪原理过滤噪声NE候选，结合语音相似性和增强定义进行校正。 Result: 在CommonVoice和STOP数据集上显著降低词错误率（WER）并提高NE命中率，比无后处理的ASR相对减少28%的WER。 Conclusion: DeRAGEC在NE校正中表现优异，无需额外训练即可显著提升ASR系统性能。 Abstract: We present DeRAGEC, a method for improving Named Entity (NE) correction in Automatic Speech Recognition (ASR) systems. By extending the Retrieval-Augmented Generative Error Correction (RAGEC) framework, DeRAGEC employs synthetic denoising rationales to filter out noisy NE candidates before correction. By leveraging phonetic similarity and augmented definitions, it refines noisy retrieved NEs using in-context learning, requiring no additional training. Experimental results on CommonVoice and STOP datasets show significant improvements in Word Error Rate (WER) and NE hit ratio, outperforming baseline ASR and RAGEC methods. Specifically, we achieved a 28% relative reduction in WER compared to ASR without postprocessing. Our source code is publicly available at: https://github.com/solee0022/deragec

[302] Towards Large Language Models with Self-Consistent Natural Language Explanations

Sahar Admoni,Ofra Amir,Assaf Hallak,Yftah Ziser

Main category: cs.CL

TL;DR: 论文提出了一种新方法（PSCB）来评估和改进大语言模型（LLM）生成解释的自洽性，并通过新指标和优化方法提升了解释与决策特征的一致性。

Details

Motivation: 现有LLM生成的事后解释常与真实决策过程不一致，但缺乏系统性解决方案。 Method: 引入Post-hoc Self-Consistency Bank（PSCB）作为大规模基准，结合新指标和Direct Preference Optimization（DPO）优化LLM。 Result: 分析显示自洽性评分在正确与错误预测间差异微小，新指标更有效区分解释质量，优化后LLM的解释与决策特征更一致。 Conclusion: 研究为提升LLM解释的可信度和自洽性提供了可扩展的路径。 Abstract: Large language models (LLMs) seem to offer an easy path to interpretability: just ask them to explain their decisions. Yet, studies show that these post-hoc explanations often misrepresent the true decision process, as revealed by mismatches in feature importance. Despite growing evidence of this inconsistency, no systematic solutions have emerged, partly due to the high cost of estimating feature importance, which limits evaluations to small datasets. To address this, we introduce the Post-hoc Self-Consistency Bank (PSCB) - a large-scale benchmark of decisions spanning diverse tasks and models, each paired with LLM-generated explanations and corresponding feature importance scores. Analysis of PSCB reveals that self-consistency scores barely differ between correct and incorrect predictions. We also show that the standard metric fails to meaningfully distinguish between explanations. To overcome this limitation, we propose an alternative metric that more effectively captures variation in explanation quality. We use it to fine-tune LLMs via Direct Preference Optimization (DPO), leading to significantly better alignment between explanations and decision-relevant features, even under domain shift. Our findings point to a scalable path toward more trustworthy, self-consistent LLMs.

[303] Bit-level BPE: Below the byte boundary

Sangwhan Moon,Tatsuya Hiraoka,Naoaki Okazaki

Main category: cs.CL

TL;DR: 提出了一种无损压缩技术，用于减少子词分词中字节级回退导致的序列长度增加问题。

Details

Motivation: 字节级回退在防止OOV方面有效，但在处理CJK语言和表情符号时会显著增加序列长度，导致计算效率下降。 Method: 提出了一种简单的无损压缩技术。 Result: 该方法能够减少序列长度，同时保持信息完整性。 Conclusion: 该技术为处理字符多样性语言提供了一种高效解决方案。 Abstract: Byte-level fallbacks for subword tokenization have become a common practice in large language models. In particular, it has been demonstrated to be incredibly effective as a pragmatic solution for preventing OOV, especially in the context of larger models. However, breaking a character down to individual bytes significantly increases the sequence length for long-tail tokens in languages such as Chinese, Japanese, and Korean (CJK) and other character-diverse contexts such as emoji. The increased sequence length results in longer computation during both training and inference. In this work, we propose a simple compression technique that reduces the sequence length losslessly.

[304] SELT: Self-Evaluation Tree Search for LLMs with Task Decomposition

Mengsong Wu,Di Zhang,Yuqiang Li,Dongzhan Zhou,Wenliang Chen

Main category: cs.CL

TL;DR: SELT框架通过改进的蒙特卡洛树搜索（MCTS）增强LLM的推理能力，无需依赖外部奖励模型，显著提升复杂推理任务的性能。

Details

Motivation: 大型语言模型（LLMs）在复杂推理任务中表现不佳，需要一种无需外部奖励模型的方法来提升其推理能力。 Method: SELT通过重新定义上置信界评分（UCB）以匹配LLM的自我评估能力，并将推理过程分解为原子子任务，结合语义聚类，平衡探索与利用。 Result: 在MMLU和Seal-Tools等基准测试中，SELT显著提高了答案准确性和推理鲁棒性，且无需任务特定微调。 Conclusion: SELT是一种通用性强、无需外部奖励模型的LLM推理增强框架，适用于多样化推理任务。 Abstract: While Large Language Models (LLMs) have achieved remarkable success in a wide range of applications, their performance often degrades in complex reasoning tasks. In this work, we introduce SELT (Self-Evaluation LLM Tree Search), a novel framework that leverages a modified Monte Carlo Tree Search (MCTS) to enhance LLM reasoning without relying on external reward models. By redefining the Upper Confidence Bound scoring to align with intrinsic self-evaluation capabilities of LLMs and decomposing the inference process into atomic subtasks augmented with semantic clustering at each node, SELT effectively balances exploration and exploitation, reduces redundant reasoning paths, and mitigates hallucination. We validate our approach on challenging benchmarks, including the knowledge-based MMLU and the Tool Learning dataset Seal-Tools, where SELT achieves significant improvements in answer accuracy and reasoning robustness compared to baseline methods. Notably, our framework operates without task-specific fine-tuning, demonstrating strong generalizability across diverse reasoning tasks. Relevant results and code are available at https://github.com/fairyshine/SELT .

[305] Beyond the Sentence: A Survey on Context-Aware Machine Translation with Large Language Models

Ramakrishna Appicharla,Baban Gain,Santanu Pal,Asif Ekbal

Main category: cs.CL

TL;DR: 本文综述了大语言模型（LLMs）在上下文感知机器翻译中的应用，发现商业LLMs（如ChatGPT）表现优于开源LLMs（如Llama），并提出了未来研究方向。

Details

Motivation: 尽管大语言模型（LLMs）广受欢迎，但其在上下文感知机器翻译中的应用尚未充分探索。 Method: 通过文献综述，分析了提示和微调方法，并探讨了自动后编辑和翻译代理的应用。 Result: 商业LLMs表现优于开源LLMs，提示方法可作为翻译质量评估的基线。 Conclusion: 未来可进一步探索上下文感知机器翻译的潜力。 Abstract: Despite the popularity of the large language models (LLMs), their application to machine translation is relatively underexplored, especially in context-aware settings. This work presents a literature review of context-aware translation with LLMs. The existing works utilise prompting and fine-tuning approaches, with few focusing on automatic post-editing and creating translation agents for context-aware machine translation. We observed that the commercial LLMs (such as ChatGPT and Tower LLM) achieved better results than the open-source LLMs (such as Llama and Bloom LLMs), and prompt-based approaches serve as good baselines to assess the quality of translations. Finally, we present some interesting future directions to explore.

[306] Instructing Large Language Models for Low-Resource Languages: A Systematic Study for Basque

Oscar Sainz,Naiara Perez,Julen Etxaniz,Joseba Fernandez de Landa,Itziar Aldabe,Iker García-Ferrero,Aimar Zabala,Ekhi Azurmendi,German Rigau,Eneko Agirre,Mikel Artetxe,Aitor Soroa

Main category: cs.CL

TL;DR: 研究探讨了在低资源语言场景下，如何利用目标语言语料库、多语言基础模型和合成指令来替代传统指令适应流程，并通过实验证明指令调优模型优于非指令模型。

Details

Motivation: 解决低资源语言因缺乏大规模指令数据集而难以训练语言模型的问题。 Method: 利用目标语言语料库、多语言基础模型和合成指令，系统实验不同组合，评估基准和人类偏好。 Result: 目标语言语料库是关键，合成指令能生成稳健模型，指令调优模型表现优于非指令模型，且规模扩大效果更佳。 Conclusion: 在低资源语言中，结合目标语料库和指令调优模型可显著提升性能，接近更大规模前沿模型。 Abstract: Instructing language models with user intent requires large instruction datasets, which are only available for a limited set of languages. In this paper, we explore alternatives to conventional instruction adaptation pipelines in low-resource scenarios. We assume a realistic scenario for low-resource languages, where only the following are available: corpora in the target language, existing open-weight multilingual base and instructed backbone LLMs, and synthetically generated instructions sampled from the instructed backbone. We present a comprehensive set of experiments for Basque that systematically study different combinations of these components evaluated on benchmarks and human preferences from 1,680 participants. Our conclusions show that target language corpora are essential, with synthetic instructions yielding robust models, and, most importantly, that using as backbone an instruction-tuned model outperforms using a base non-instructed model, and improved results when scaling up. Using Llama 3.1 instruct 70B as backbone our model comes near frontier models of much larger sizes for Basque, without using any Basque data apart from the 1.2B word corpora. We release code, models, instruction datasets, and human preferences to support full reproducibility in future research on low-resource language adaptation.

[307] PolitiSky24: U.S. Political Bluesky Dataset with User Stance Labels

Peyman Rostami,Vahid Rahimzadeh,Ali Adibi,Azadeh Shakery

Main category: cs.CL

TL;DR: 论文提出了首个针对2024年美国总统选举的立场检测数据集PolitiSky24，收集自Bluesky平台，聚焦于Kamala Harris和Donald Trump，包含16,044对用户-目标立场数据。

Details

Motivation: 现有立场检测数据集多关注推文级别，而用户级别的数据稀缺，尤其是在新兴平台如Bluesky上。用户级别立场检测能更全面地反映用户观点。 Method: 采用结合高级信息检索和大语言模型的标注流程，生成立场标签及支持理由和文本片段，标注准确率达81%。 Result: 数据集包含用户发布历史、互动图和参与元数据，填补了政治立场分析在时效性、开放性和用户视角上的空白。 Conclusion: PolitiSky24为政治立场分析提供了及时、开放且用户级别的资源，数据集已公开。 Abstract: Stance detection identifies the viewpoint expressed in text toward a specific target, such as a political figure. While previous datasets have focused primarily on tweet-level stances from established platforms, user-level stance resources, especially on emerging platforms like Bluesky remain scarce. User-level stance detection provides a more holistic view by considering a user's complete posting history rather than isolated posts. We present the first stance detection dataset for the 2024 U.S. presidential election, collected from Bluesky and centered on Kamala Harris and Donald Trump. The dataset comprises 16,044 user-target stance pairs enriched with engagement metadata, interaction graphs, and user posting histories. PolitiSky24 was created using a carefully evaluated pipeline combining advanced information retrieval and large language models, which generates stance labels with supporting rationales and text spans for transparency. The labeling approach achieves 81\% accuracy with scalable LLMs. This resource addresses gaps in political stance analysis through its timeliness, open-data nature, and user-level perspective. The dataset is available at https://doi.org/10.5281/zenodo.15616911

[308] Vuyko Mistral: Adapting LLMs for Low-Resource Dialectal Translation

Roman Kyslyi,Yuliia Maksymiuk,Ihor Pysmennyi

Main category: cs.CL

TL;DR: 本文首次尝试将大语言模型（LLMs）适配到乌克兰方言（Hutsul），通过构建平行语料库和字典，并利用RAG生成合成数据，最终微调模型在翻译任务中表现优于GPT-4o。

Details

Motivation: Hutsul方言资源匮乏且形态复杂，缺乏适配的LLMs，因此需要构建数据并优化模型以支持方言翻译。 Method: 创建平行语料库和字典，提出RAG管道生成合成数据，使用LoRA微调LLMs，并采用多指标评估策略。 Result: 微调的小型模型（7B）在自动和LLM评估指标上均优于GPT-4o的零样本基线。 Conclusion: 通过数据增强和模型微调，成功实现了对低资源方言的高效适配，为类似任务提供了可行方案。 Abstract: In this paper we introduce the first effort to adapt large language models (LLMs) to the Ukrainian dialect (in our case Hutsul), a low-resource and morphologically complex dialect spoken in the Carpathian Highlands. We created a parallel corpus of 9852 dialect-to-standard Ukrainian sentence pairs and a dictionary of 7320 dialectal word mappings. We also addressed data shortage by proposing an advanced Retrieval-Augmented Generation (RAG) pipeline to generate synthetic parallel translation pairs, expanding the corpus with 52142 examples. We have fine-tuned multiple open-source LLMs using LoRA and evaluated them on a standard-to-dialect translation task, also comparing with few-shot GPT-4o translation. In the absence of human annotators, we adopt a multi-metric evaluation strategy combining BLEU, chrF++, TER, and LLM-based judgment (GPT-4o). The results show that even small(7B) finetuned models outperform zero-shot baselines such as GPT-4o across both automatic and LLM-evaluated metrics. All data, models, and code are publicly released at: https://github.com/woters/vuyko-hutsul

[309] LoRMA: Low-Rank Multiplicative Adaptation for LLMs

Harsh Bihany,Shubham Patel,Ashutosh Modi

Main category: cs.CL

TL;DR: 论文提出了一种名为LoRMA的新方法，通过矩阵乘法变换替代传统的加法更新，解决了计算复杂性和秩瓶颈问题。

Details

Motivation: 尽管LoRA等低秩适应方法在效率上有所改进，但其加法更新的局限性促使研究者探索更丰富的变换空间。 Method: 提出LoRMA方法，通过矩阵乘法变换和操作重排序及秩膨胀策略，优化计算效率和性能。 Result: 实验证明LoRMA在多种评估指标上表现优异。 Conclusion: LoRMA为低秩适应提供了一种更高效的替代方案，具有广泛的应用潜力。 Abstract: Large Language Models have shown remarkable capabilities in the NLP domain. Their effectiveness can mainly be attributed to their ability to adapt to an array of downstream tasks. However, generally, full fine-tuning is a computationally expensive job. To mitigate this, many techniques have been developed that prime efficiency, a prominent one being Low-Rank Adaptation (LoRA). However, LoRA and its variants employ re-parametrized additive updates. In this paper, we propose Low-Rank Multiplicative Adaptation (LoRMA), which shifts the paradigm of additive updates to a richer space of matrix multiplicative transformations. We tackle challenges such as computational complexity and rank bottleneck of matrix multiplication by effectively re-ordering operations and introducing rank inflation strategies. We conduct extensive experiments to demonstrate the effectiveness of our approach in terms of various evaluation metrics.

[310] Intent Matters: Enhancing AI Tutoring with Fine-Grained Pedagogical Intent Annotation

Kseniia Petukhova,Ekaterina Kochmar

Main category: cs.CL

TL;DR: 通过细粒度标注教师意图，改进LLM在教育场景中的生成响应质量。

Details

Motivation: 当前LLM在教育应用中缺乏与教学策略的对齐，需要任务特定适配以提高效果。 Method: 使用包含11种教学意图的细粒度分类法重新标注MathDial数据集，并微调LLM。 Result: 细粒度模型生成的响应在教学对齐和效果上优于原始四分类模型。 Conclusion: 细粒度意图标注对教育场景中的可控文本生成具有重要价值。 Abstract: Large language models (LLMs) hold great promise for educational applications, particularly in intelligent tutoring systems. However, effective tutoring requires alignment with pedagogical strategies - something current LLMs lack without task-specific adaptation. In this work, we explore whether fine-grained annotation of teacher intents can improve the quality of LLM-generated tutoring responses. We focus on MathDial, a dialog dataset for math instruction, and apply an automated annotation framework to re-annotate a portion of the dataset using a detailed taxonomy of eleven pedagogical intents. We then fine-tune an LLM using these new annotations and compare its performance to models trained on the original four-category taxonomy. Both automatic and qualitative evaluations show that the fine-grained model produces more pedagogically aligned and effective responses. Our findings highlight the value of intent specificity for controlled text generation in educational settings, and we release our annotated data and code to facilitate further research.

[311] Unblocking Fine-Grained Evaluation of Detailed Captions: An Explaining AutoRater and Critic-and-Revise Pipeline

Brian Gordon,Yonatan Bitton,Andreea Marzoca,Yasumasa Onoe,Xiao Wang,Daniel Cohen-Or,Idan Szpektor

Main category: cs.CL

TL;DR: 论文提出了DOCCI-Critique基准和VNLI-Critique模型，用于评估和改进视觉语言模型（VLM）生成段落的准确性，并展示了其在多个应用中的有效性。

Details

Motivation: 当前评估视觉语言模型生成段落的事实准确性方法存在不足，缺乏细粒度错误检测和已验证的数据集。 Method: 引入DOCCI-Critique基准（1,400条标注数据）和VNLI-Critique模型，用于自动化句子级事实分类和错误分析。 Result: VNLI-Critique在多个基准测试中表现优异，AutoRater与人类判断高度一致，Critic-and-Revise流程显著提升事实准确性。 Conclusion: 该研究为细粒度评估提供了关键工具，有助于提升VLM的图像理解能力。 Abstract: Large Vision-Language Models (VLMs) now generate highly detailed, paragraphlength image captions, yet evaluating their factual accuracy remains challenging. Current methods often miss fine-grained errors, being designed for shorter texts or lacking datasets with verified inaccuracies. We introduce DOCCI-Critique, a benchmark with 1,400 VLM-generated paragraph captions (100 images, 14 VLMs) featuring over 10,216 sentence-level human annotations of factual correctness and explanatory rationales for errors, all within paragraph context. Building on this, we develop VNLI-Critique, a model for automated sentence-level factuality classification and critique generation. We highlight three key applications: (1) VNLI-Critique demonstrates robust generalization, validated by state-of-the-art performance on the M-HalDetect benchmark and strong results in CHOCOLATE claim verification. (2) The VNLI-Critique driven AutoRater for DOCCI-Critique provides reliable VLM rankings, showing excellent alignment with human factuality judgments (e.g., 0.98 Spearman). (3) An innovative Critic-and-Revise pipeline, where critiques from VNLI-Critique guide LLM-based corrections, achieves substantial improvements in caption factuality (e.g., a 46% gain on DetailCaps-4870). Our work offers a crucial benchmark alongside practical tools, designed to significantly elevate the standards for fine-grained evaluation and foster the improvement of VLM image understanding. Project page: https://google.github.io/unblocking-detail-caption

[312] TreeReview: A Dynamic Tree of Questions Framework for Deep and Efficient LLM-based Scientific Peer Review

Yuan Chang,Ziyue Li,Hengyuan Zhang,Yuanbo Kong,Yanru Wu,Zhijiang Guo,Ngai Wong

Main category: cs.CL

TL;DR: TreeReview是一种分层双向问答框架，用于生成全面且深入的论文评审，同时显著减少计算资源消耗。

Details

Motivation: 当前大型语言模型在辅助同行评审时难以兼顾全面性和效率，因此需要一种更高效且深入的评审方法。 Method: 通过递归分解高层次问题为细粒度子问题，构建问题树，并通过动态问题扩展机制生成后续问题，最终从叶到根聚合答案生成评审。 Result: 实验表明，TreeReview在生成全面、深入且专家一致的评审反馈上优于基线方法，同时减少80%的LLM令牌使用。 Conclusion: TreeReview提供了一种高效且深入的论文评审解决方案，适用于实际应用。 Abstract: While Large Language Models (LLMs) have shown significant potential in assisting peer review, current methods often struggle to generate thorough and insightful reviews while maintaining efficiency. In this paper, we propose TreeReview, a novel framework that models paper review as a hierarchical and bidirectional question-answering process. TreeReview first constructs a tree of review questions by recursively decomposing high-level questions into fine-grained sub-questions and then resolves the question tree by iteratively aggregating answers from leaf to root to get the final review. Crucially, we incorporate a dynamic question expansion mechanism to enable deeper probing by generating follow-up questions when needed. We construct a benchmark derived from ICLR and NeurIPS venues to evaluate our method on full review generation and actionable feedback comments generation tasks. Experimental results of both LLM-based and human evaluation show that TreeReview outperforms strong baselines in providing comprehensive, in-depth, and expert-aligned review feedback, while reducing LLM token usage by up to 80% compared to computationally intensive approaches. Our code and benchmark dataset are available at https://github.com/YuanChang98/tree-review.

[313] Evaluating LLMs Robustness in Less Resourced Languages with Proxy Models

Maciej Chrabąszcz,Katarzyna Lorenc,Karolina Seweryn

Main category: cs.CL

TL;DR: 研究发现，大型语言模型（LLMs）在低资源语言（如波兰语）中容易受到字符和词级攻击，导致预测结果大幅改变，可能绕过其内部安全机制。

Details

Motivation: LLMs在多语言任务中表现优异，但安全训练数据主要集中于高资源语言（如英语），导致低资源语言存在潜在漏洞。 Method: 通过少量字符修改和小型代理模型计算词重要性，构建低成本攻击方法，并在波兰语中验证其有效性。 Result: 攻击显著改变了LLMs的预测结果，揭示了其在低资源语言中的潜在漏洞。 Conclusion: 研究提出了LLMs在低资源语言中的安全问题，并提供了数据集和代码供进一步研究。 Abstract: Large language models (LLMs) have demonstrated impressive capabilities across various natural language processing (NLP) tasks in recent years. However, their susceptibility to jailbreaks and perturbations necessitates additional evaluations. Many LLMs are multilingual, but safety-related training data contains mainly high-resource languages like English. This can leave them vulnerable to perturbations in low-resource languages such as Polish. We show how surprisingly strong attacks can be cheaply created by altering just a few characters and using a small proxy model for word importance calculation. We find that these character and word-level attacks drastically alter the predictions of different LLMs, suggesting a potential vulnerability that can be used to circumvent their internal safety mechanisms. We validate our attack construction methodology on Polish, a low-resource language, and find potential vulnerabilities of LLMs in this language. Additionally, we show how it can be extended to other languages. We release the created datasets and code for further research.

[314] Transcript-Prompted Whisper with Dictionary-Enhanced Decoding for Japanese Speech Annotation

Rui Hu,Xiaolong Lin,Jiawang Liu,Shixi Huang,Zhenpeng Zhan

Main category: cs.CL

TL;DR: 提出了一种基于预训练ASR模型的日语TTS数据集标注方法，结合字典先验知识优化音素标注，效果优于纯文本或音频方法。

Details

Motivation: 构建高质量的日语TTS数据集需要准确的音素和韵律标注，传统方法依赖纯文本或音频，效果有限。 Method: 通过微调预训练ASR模型，结合字典先验知识解码策略，同时输出短语级字素和标注标签。 Result: 客观评估显示优于传统方法；主观评估表明标注质量接近人工标注。 Conclusion: 该方法能高效生成高质量标注，适用于TTS数据集构建。 Abstract: In this paper, we propose a method for annotating phonemic and prosodic labels on a given audio-transcript pair, aimed at constructing Japanese text-to-speech (TTS) datasets. Our approach involves fine-tuning a large-scale pre-trained automatic speech recognition (ASR) model, conditioned on ground truth transcripts, to simultaneously output phrase-level graphemes and annotation labels. To further correct errors in phonemic labeling, we employ a decoding strategy that utilizes dictionary prior knowledge. The objective evaluation results demonstrate that our proposed method outperforms previous approaches relying solely on text or audio. The subjective evaluation results indicate that the naturalness of speech synthesized by the TTS model, trained with labels annotated using our method, is comparable to that of a model trained with manual annotations.

[315] Beyond Benchmarks: A Novel Framework for Domain-Specific LLM Evaluation and Knowledge Mapping

Nitin Sharma,Thomas Wolfers,Çağatay Yıldız

Main category: cs.CL

TL;DR: 论文提出了一种确定性流程，将原始领域语料转化为完成型评测基准，解决了领域特定评测的可靠性和知识表示问题，并通过实验验证了其有效性。

Details

Motivation: 解决语言模型评测中的两个关键挑战：创建可靠的领域特定评测基准和理解领域适应过程中的知识表示。 Method: 引入确定性流程，利用TF和Term TF-IDF方法生成领域关键词和提示-目标对，构建评测基准，评估模型完成能力。 Result: 实验表明，该评测基准与专家生成基准强相关，且比传统困惑度指标更准确；揭示了小模型快速适应和知识表示的分层特性。 Conclusion: 提供了一种实用的领域特定语言模型评测方法，并为领域适应中的知识表示和高效微调策略提供了新见解。 Abstract: The paper addresses two critical challenges in language model (LM) evaluation: creating reliable domain-specific benchmarks and understanding knowledge representation during domain adaptation. We introduce a deterministic pipeline that converts raw domain corpora into completion-type benchmarks without relying on LMs or human curation, eliminating benchmark contamination issues while enabling evaluation on the latest domain data. Our approach generates domain-specific keywords and related word lists using TF and Term TF-IDF methods and constructs prompt-target pairs. We evaluate models by measuring their ability to complete these prompts with the correct domain-specific targets, providing a direct assessment of domain knowledge with low computational cost. Through comprehensive experiments across multiple models (GPT-2 medium/XL, Llama-2/3.1, OLMo-2, Qwen-2, Mistral) and domains, we demonstrate that our benchmark strongly correlates with expert-generated benchmarks while providing a more accurate measure of domain knowledge than traditional perplexity metrics. We reveal that domain adaptation happens rapidly in smaller models (within 500 steps) and illustrate a new approach to domain knowledge evaluation in base models during training for early stopping. By extending mechanistic analysis to domain adaptation, we discover that initial-to-mid layers are primarily responsible for attribute extraction, while later layers focus on next token prediction. Furthermore, we show that during adaptation, forgetting begins in the middle layers, where attribute extraction happens and is amplified in later layers. Our work provides both a practical evaluation methodology for domain-specific LMs and novel insights into knowledge representation during adaptation, with implications for more efficient fine-tuning strategies and targeted approaches to mitigate catastrophic forgetting.

[316] Synthesis by Design: Controlled Data Generation via Structural Guidance

Lei Xu,Sirui Chen,Yuxuan Huang,Chaochao Lu

Main category: cs.CL

TL;DR: 论文提出了一种通过生成问题解决代码提取结构化信息的方法，用于增强LLM的数学推理能力，并生成了高质量的数据集和基准测试。

Details

Motivation: 现有方法在生成质量和问题复杂度上存在问题，限制了LLM的数学推理能力提升。 Method: 通过生成问题解决代码提取结构化信息，并基于此生成带标注中间步骤的数据集。 Result: 生成了39K问题和6.1K高难度基准测试，验证了数据集的效性。 Conclusion: 该方法为提升LLM推理能力提供了有效工具，有望推动未来研究。 Abstract: Mathematical reasoning remains challenging for LLMs due to complex logic and the need for precise computation. Existing methods enhance LLM reasoning by synthesizing datasets through problem rephrasing, but face issues with generation quality and problem complexity. To address this, we propose to extract structural information with generated problem-solving code from mathematical reasoning and guide data generation with structured solutions. Applied to MATH and GSM8K, our approach produces 39K problems with labeled intermediate steps and a 6.1K-problem benchmark of higher difficulty. Results on our benchmark show that model performance declines as reasoning length increases. Additionally, we conducted fine-tuning experiments using the proposed training data on a range of LLMs, and the results validate the effectiveness of our dataset. We hope the proposed method and dataset will contribute to future research in enhancing LLM reasoning capabilities.

[317] Silencing Empowerment, Allowing Bigotry: Auditing the Moderation of Hate Speech on Twitch

Prarabdh Shukla,Wei Yin Chong,Yash Patel,Brennan Schaffner,Danish Pruthi,Arjun Bhagoji

Main category: cs.CL

TL;DR: 该论文对Twitch的自动审核工具AutoMod进行了审计，发现其在识别仇恨内容时存在显著漏洞，同时误判了大量良性内容。

Details

Motivation: 随着实时互动平台（如Twitch）的兴起，自动审核系统的有效性成为关键问题，但目前对其性能的了解有限。 Method: 通过创建测试账户并使用Twitch API发送超过107,000条评论，评估AutoMod对仇恨内容的识别能力。 Result: AutoMod漏检了高达94%的仇恨内容，但对包含敏感词的良性内容误判率高达89.5%。 Conclusion: 研究表明AutoMod存在重大缺陷，强调了上下文理解对自动审核系统的重要性。 Abstract: To meet the demands of content moderation, online platforms have resorted to automated systems. Newer forms of real-time engagement($\textit{e.g.}$, users commenting on live streams) on platforms like Twitch exert additional pressures on the latency expected of such moderation systems. Despite their prevalence, relatively little is known about the effectiveness of these systems. In this paper, we conduct an audit of Twitch's automated moderation tool ($\texttt{AutoMod}$) to investigate its effectiveness in flagging hateful content. For our audit, we create streaming accounts to act as siloed test beds, and interface with the live chat using Twitch's APIs to send over $107,000$ comments collated from $4$ datasets. We measure $\texttt{AutoMod}$'s accuracy in flagging blatantly hateful content containing misogyny, racism, ableism and homophobia. Our experiments reveal that a large fraction of hateful messages, up to $94\%$ on some datasets, $\textit{bypass moderation}$. Contextual addition of slurs to these messages results in $100\%$ removal, revealing $\texttt{AutoMod}$'s reliance on slurs as a moderation signal. We also find that contrary to Twitch's community guidelines, $\texttt{AutoMod}$ blocks up to $89.5\%$ of benign examples that use sensitive words in pedagogical or empowering contexts. Overall, our audit points to large gaps in $\texttt{AutoMod}$'s capabilities and underscores the importance for such systems to understand context effectively.

[318] GaRAGe: A Benchmark with Grounding Annotations for RAG Evaluation

Ionut-Teodor Sorodoc,Leonardo F. R. Ribeiro,Rexhina Blloshmi,Christopher Davis,Adrià de Gispert

Main category: cs.CL

TL;DR: GaRAGe是一个大型RAG基准测试，包含人工标注的长篇答案和基础段落，用于评估LLM在生成RAG答案时是否能识别相关基础信息。

Details

Motivation: 当前LLM在RAG任务中倾向于过度总结而非严格基于相关段落生成答案，或在无相关信息时回避回答。需要更细粒度的评估工具。 Method: 构建包含2366个多样化问题的基准测试，标注超过35K段落，涵盖私有文档和网页数据，模拟真实RAG场景。 Result: 评估显示，LLM在严格基于相关段落生成答案（最高60%事实性评分）或回避回答（最高31%正确率）方面表现不佳，F1评分最高58.9%。 Conclusion: GaRAGe为评估LLM在RAG任务中的能力提供了理想测试平台，揭示了模型在时间敏感问题和稀疏数据源上的表现不足。 Abstract: We present GaRAGe, a large RAG benchmark with human-curated long-form answers and annotations of each grounding passage, allowing a fine-grained evaluation of whether LLMs can identify relevant grounding when generating RAG answers. Our benchmark contains 2366 questions of diverse complexity, dynamism, and topics, and includes over 35K annotated passages retrieved from both private document sets and the Web, to reflect real-world RAG use cases. This makes it an ideal test bed to evaluate an LLM's ability to identify only the relevant information necessary to compose a response, or provide a deflective response when there is insufficient information. Evaluations of multiple state-of-the-art LLMs on GaRAGe show that the models tend to over-summarise rather than (a) ground their answers strictly on the annotated relevant passages (reaching at most a Relevance-Aware Factuality Score of 60%), or (b) deflect when no relevant grounding is available (reaching at most 31% true positive rate in deflections). The F1 in attribution to relevant sources is at most 58.9%, and we show that performance is particularly reduced when answering time-sensitive questions and when having to draw knowledge from sparser private grounding sources.

[319] Training Superior Sparse Autoencoders for Instruct Models

Jiaming Li,Haoran Ye,Yukun Chen,Xinyue Li,Lei Zhang,Hamid Alinejad-Rokny,Jimmy Chih-Hsien Peng,Min Yang

Main category: cs.CL

TL;DR: 论文提出了一种名为FAST的新训练方法，专门针对指令模型优化稀疏自编码器（SAE），显著提升了重构质量和特征可解释性。

Details

Motivation: 现有SAE训练方法主要针对基础模型，在指令模型上表现不佳，需要一种专门优化的方法。 Method: 提出FAST方法，通过调整训练过程以匹配指令模型的数据分布和激活模式。 Result: 在Qwen2.5-7B-Instruct上，FAST的重构误差显著低于基线方法；在Llama3.2-3B-Instruct上，高质量特征比例更高。 Conclusion: FAST不仅提升了性能，还揭示了通过干预特殊令牌激活改进输出质量的新机会。 Abstract: As large language models (LLMs) grow in scale and capability, understanding their internal mechanisms becomes increasingly critical. Sparse autoencoders (SAEs) have emerged as a key tool in mechanistic interpretability, enabling the extraction of human-interpretable features from LLMs. However, existing SAE training methods are primarily designed for base models, resulting in reduced reconstruction quality and interpretability when applied to instruct models. To bridge this gap, we propose $\underline{\textbf{F}}$inetuning-$\underline{\textbf{a}}$ligned $\underline{\textbf{S}}$equential $\underline{\textbf{T}}$raining ($\textit{FAST}$), a novel training method specifically tailored for instruct models. $\textit{FAST}$ aligns the training process with the data distribution and activation patterns characteristic of instruct models, resulting in substantial improvements in both reconstruction and feature interpretability. On Qwen2.5-7B-Instruct, $\textit{FAST}$ achieves a mean squared error of 0.6468 in token reconstruction, significantly outperforming baseline methods with errors of 5.1985 and 1.5096. In feature interpretability, $\textit{FAST}$ yields a higher proportion of high-quality features, for Llama3.2-3B-Instruct, $21.1\%$ scored in the top range, compared to $7.0\%$ and $10.2\%$ for $\textit{BT(P)}$ and $\textit{BT(F)}$. Surprisingly, we discover that intervening on the activations of special tokens via the SAEs leads to improvements in output quality, suggesting new opportunities for fine-grained control of model behavior. Code, data, and 240 trained SAEs are available at https://github.com/Geaming2002/FAST.

[320] Through the Valley: Path to Effective Long CoT Training for Small Language Models

Renjie Luo,Jiaxi Li,Chen Huang,Wei Lu

Main category: cs.CL

TL;DR: 研究发现，小语言模型（SLMs）在长链思维（CoT）监督训练中会出现性能显著下降的现象，称为长CoT退化。

Details

Motivation: 探讨小语言模型在长CoT训练中的性能退化现象及其原因。 Method: 通过Qwen2.5、LLaMA3和Gemma3系列模型的实验，分析长CoT训练对SLMs的影响。 Result: 实验显示，SLMs在长CoT训练中性能下降高达75%，且部分模型无法恢复原始性能。 Conclusion: 长CoT训练对小模型效果有限，需谨慎使用，并提出改进建议。 Abstract: Long chain-of-thought (CoT) supervision has become a common strategy to enhance reasoning in language models. While effective for large models, we identify a phenomenon we call Long CoT Degradation, in which small language models (SLMs; <=3B parameters) trained on limited long CoT data experience significant performance deterioration. Through extensive experiments on the Qwen2.5, LLaMA3 and Gemma3 families, we demonstrate that this degradation is widespread across SLMs. In some settings, models trained on only 8k long CoT examples lose up to 75% of their original performance before fine-tuning. Strikingly, we further observe that for some particularly small models, even training on 220k long CoT examples fails to recover or surpass their original performance prior to fine-tuning. Our analysis attributes this effect to error accumulation: while longer responses increase the capacity for multi-step reasoning, they also amplify the risk of compounding mistakes. Furthermore, we find that Long CoT Degradation may negatively impacts downstream reinforcement learning (RL), although this can be alleviated by sufficiently scaled supervised fine-tuning (SFT). Our findings challenge common assumptions about the benefits of long CoT training for SLMs and offer practical guidance for building more effective small-scale reasoning models.

[321] Multilingual Grammatical Error Annotation: Combining Language-Agnostic Framework with Language-Specific Flexibility

Mengyang Qiu,Tran Minh Nguyen,Zihao Huang,Zelong Li,Yang Gu,Qingyu Gao,Siliang Liu,Jungyeul Park

Main category: cs.CL

TL;DR: 提出了一种标准化的模块化框架，用于多语言语法错误标注，支持多种语言的一致性标注和灵活扩展。

Details

Motivation: 现有语法错误标注框架（如errant）在扩展到类型多样的语言时存在局限性，需要一种更灵活且一致的多语言解决方案。 Method: 结合语言无关的基础框架和结构化语言特定扩展，重新实现errant以支持更广泛的多语言覆盖。 Result: 展示了框架在英语、德语、捷克语、韩语和中文中的适应性，支持从通用标注到定制化语言细化。 Conclusion: 该框架支持跨语言的可扩展和可解释的语法错误标注，促进多语言环境下更一致的评估。 Abstract: Grammatical Error Correction (GEC) relies on accurate error annotation and evaluation, yet existing frameworks, such as $\texttt{errant}$, face limitations when extended to typologically diverse languages. In this paper, we introduce a standardized, modular framework for multilingual grammatical error annotation. Our approach combines a language-agnostic foundation with structured language-specific extensions, enabling both consistency and flexibility across languages. We reimplement $\texttt{errant}$ using $\texttt{stanza}$ to support broader multilingual coverage, and demonstrate the framework's adaptability through applications to English, German, Czech, Korean, and Chinese, ranging from general-purpose annotation to more customized linguistic refinements. This work supports scalable and interpretable GEC annotation across languages and promotes more consistent evaluation in multilingual settings. The complete codebase and annotation tools can be accessed at https://github.com/open-writing-evaluation/jp_errant_bea.

[322] Swiss Parliaments Corpus Re-Imagined (SPC_R): Enhanced Transcription with RAG-based Correction and Predicted BLEU

Vincenzo Timmel,Manfred Vogel,Daniel Perruchoud,Reza Kakooee

Main category: cs.CL

TL;DR: 论文提出了一种将瑞士议会辩论音频转换为高质量语音-文本对的长格式语料库构建方法，结合Whisper Large-v3和GPT-4o校正，显著提升了数据质量。

Details

Motivation: 解决低资源、领域特定语音语料库的构建问题，尤其是瑞士德语辩论音频的转录与对齐。 Method: 使用Whisper Large-v3进行初步转录，再通过两阶段GPT-4o校正（命名实体修正和语义完整性评估），并结合数据驱动过滤。 Result: 最终语料库包含801小时音频，其中751小时通过质量控制，相比原始版本BLEU分数提高了6分。 Conclusion: 结合ASR、LLM校正和数据过滤的方法，显著提升了长格式语音语料库的质量。 Abstract: This paper presents a new long-form release of the Swiss Parliaments Corpus, converting entire multi-hour Swiss German debate sessions (each aligned with the official session protocols) into high-quality speech-text pairs. Our pipeline starts by transcribing all session audio into Standard German using Whisper Large-v3 under high-compute settings. We then apply a two-step GPT-4o correction process: first, GPT-4o ingests the raw Whisper output alongside the official protocols to refine misrecognitions, mainly named entities. Second, a separate GPT-4o pass evaluates each refined segment for semantic completeness. We filter out any segments whose Predicted BLEU score (derived from Whisper's average token log-probability) and GPT-4o evaluation score fall below a certain threshold. The final corpus contains 801 hours of audio, of which 751 hours pass our quality control. Compared to the original sentence-level SPC release, our long-form dataset achieves a 6-point BLEU improvement, demonstrating the power of combining robust ASR, LLM-based correction, and data-driven filtering for low-resource, domain-specific speech corpora.

[323] Augmenting LLMs' Reasoning by Reinforcing Abstract Thinking

Silin Gao,Antoine Bosselut,Samy Bengio,Emmanuel Abbe

Main category: cs.CL

TL;DR: 论文提出了一种通过强化学习促进抽象推理的方法（AbstraL），以应对大语言模型在分布变化时的性能下降问题。

Details

Motivation: 研究发现小规模语言模型在面对分布变化时推理能力不足，现有方法通过生成合成数据应对，而本文提出抽象化推理问题的新思路。 Method: 采用强化学习（RL）训练模型进行抽象推理，而非传统的监督微调，生成粒度化的抽象数据（AbstraL）。 Result: AbstraL方法显著减少了在GSM扰动基准测试中的性能下降。 Conclusion: 抽象化推理结合强化学习能有效提升语言模型的鲁棒性，尤其在分布变化场景下。 Abstract: Recent studies have shown that large language models (LLMs), especially smaller ones, often lack robustness in their reasoning. I.e., they tend to experience performance drops when faced with distribution shifts, such as changes to numerical or nominal variables, or insertions of distracting clauses. A possible strategy to address this involves generating synthetic data to further "instantiate" reasoning problems on potential variations. In contrast, our approach focuses on "abstracting" reasoning problems. This not only helps counteract distribution shifts but also facilitates the connection to symbolic tools for deriving solutions. We find that this abstraction process is better acquired through reinforcement learning (RL) than just supervised fine-tuning, which often fails to produce faithful abstractions. Our method, AbstraL -- which promotes abstract reasoning in LLMs using RL on granular abstraction data -- significantly mitigates performance degradation on recent GSM perturbation benchmarks.

[324] LLM Unlearning Should Be Form-Independent

Xiaotian Ye,Mengqi Zhang,Shu Wu

Main category: cs.CL

TL;DR: 论文研究了LLM（大语言模型）遗忘技术中的形式依赖偏差问题，提出了新基准ORT和新方法ROCR以提升遗忘效果。

Details

Motivation: 现有遗忘方法在实际场景中效果有限，主要问题在于其效果依赖于训练样本的形式，难以泛化到不同表达形式的知识。 Method: 提出ORT基准评估遗忘方法的鲁棒性，并引入ROCR方法，通过重定向激活的危险概念实现形式无关的遗忘。 Result: 实验表明，ROCR显著优于传统方法，能快速修改模型参数并生成自然输出。 Conclusion: LLM遗忘应实现形式无关性，ROCR为解决这一问题提供了有效路径。 Abstract: Large Language Model (LLM) unlearning aims to erase or suppress undesirable knowledge within the model, offering promise for controlling harmful or private information to prevent misuse. However, recent studies highlight its limited efficacy in real-world scenarios, hindering practical adoption. In this study, we identify a pervasive issue underlying many downstream failures: the effectiveness of existing unlearning methods heavily depends on the form of training samples and frequently fails to generalize to alternate expressions of the same knowledge. We formally characterize this problem as Form-Dependent Bias and systematically investigate its specific manifestation patterns across various downstream tasks. To quantify its prevalence and support future research, we introduce ORT, a novel benchmark designed to evaluate the robustness of unlearning methods against variations in knowledge expression. Results reveal that Form-Dependent Bias is both widespread and severe among current techniques. We argue that LLM unlearning should be form-independent to address the endless forms of downstream tasks encountered in real-world security-critical scenarios. Towards this goal, we introduce Rank-one Concept Redirection (ROCR), a novel training-free method, as a promising solution path. ROCR performs unlearning by targeting the invariants in downstream tasks, specifically the activated dangerous concepts. It is capable of modifying model parameters within seconds to redirect the model's perception of a specific unlearning target concept to another harmless concept. Extensive experiments demonstrate that ROCR significantly improves unlearning effectiveness compared to traditional methods while generating highly natural outputs.

[325] MultiMatch: Multihead Consistency Regularization Matching for Semi-Supervised Text Classification

Iustin Sirbu,Robert-Adrian Popovici,Cornelia Caragea,Stefan Trausan-Matu,Traian Rebedea

Main category: cs.CL

TL;DR: MultiMatch是一种结合协同训练和一致性正则化的半监督学习算法，通过三重重伪标签加权模块提升性能。

Details

Motivation: 解决半监督学习中伪标签选择和加权的挑战，提升模型在数据不平衡场景下的鲁棒性。 Method: 结合协同训练和一致性正则化，设计三重重伪标签加权模块，整合多种现有技术。 Result: 在5个NLP数据集的10个设置中，9个达到SOTA，并在不平衡数据中表现优异。 Conclusion: MultiMatch在性能和鲁棒性上显著优于现有方法，尤其适用于数据不平衡任务。 Abstract: We introduce MultiMatch, a novel semi-supervised learning (SSL) algorithm combining the paradigms of co-training and consistency regularization with pseudo-labeling. At its core, MultiMatch features a three-fold pseudo-label weighting module designed for three key purposes: selecting and filtering pseudo-labels based on head agreement and model confidence, and weighting them according to the perceived classification difficulty. This novel module enhances and unifies three existing techniques -- heads agreement from Multihead Co-training, self-adaptive thresholds from FreeMatch, and Average Pseudo-Margins from MarginMatch -- resulting in a holistic approach that improves robustness and performance in SSL settings. Experimental results on benchmark datasets highlight the superior performance of MultiMatch, achieving state-of-the-art results on 9 out of 10 setups from 5 natural language processing datasets and ranking first according to the Friedman test among 19 methods. Furthermore, MultiMatch demonstrates exceptional robustness in highly imbalanced settings, outperforming the second-best approach by 3.26% -- and data imbalance is a key factor for many text classification tasks.

[326] WebUIBench: A Comprehensive Benchmark for Evaluating Multimodal Large Language Models in WebUI-to-Code

Zhiyu Lin,Zhengda Zhou,Zhiyuan Zhao,Tianrui Wan,Yilun Ma,Junyu Gao,Xuelong Li

Main category: cs.CL

TL;DR: 论文提出WebUIBench，一个系统性评估多模态大语言模型（MLLMs）在网页开发中四个关键领域的基准，填补了现有评估仅关注网页生成结果的不足。

Details

Motivation: 随着生成式AI技术的发展，MLLMs有望成为AI软件工程师，但现有基准缺乏对多维子能力的评估，无法全面指导开发效率提升。 Method: 基于软件工程原则，设计了WebUIBench，包含21K高质量问答对，覆盖WebUI感知、HTML编程、WebUI-HTML理解和WebUI-to-Code四个领域。 Result: 对29个主流MLLMs的评估揭示了模型在开发过程中的技能特点和多种弱点。 Conclusion: WebUIBench为MLLMs在网页开发中的能力评估提供了系统性工具，有助于指导模型改进。 Abstract: With the rapid advancement of Generative AI technology, Multimodal Large Language Models(MLLMs) have the potential to act as AI software engineers capable of executing complex web application development. Considering that the model requires a confluence of multidimensional sub-capabilities to address the challenges of various development phases, constructing a multi-view evaluation framework is crucial for accurately guiding the enhancement of development efficiency. However, existing benchmarks usually fail to provide an assessment of sub-capabilities and focus solely on webpage generation outcomes. In this work, we draw inspiration from the principles of software engineering and further propose WebUIBench, a benchmark systematically designed to evaluate MLLMs in four key areas: WebUI Perception, HTML Programming,WebUI-HTML Understanding, and WebUI-to-Code. WebUIBench comprises 21K high-quality question-answer pairs derived from over 0.7K real-world websites. The extensive evaluation of 29 mainstream MLLMs uncovers the skill characteristics and various weakness that models encountered during the development process.

[327] Learning to Focus: Causal Attention Distillation via Gradient-Guided Token Pruning

Yiju Guo,Wenkai Yang,Zexu Sun,Ning Ding,Zhiyuan Liu,Yankai Lin

Main category: cs.CL

TL;DR: 论文提出了一种名为LeaF的两阶段框架，通过干预式推理解决大语言模型在长上下文推理中注意力分散的问题，显著提升了推理准确性和生成质量。

Details

Motivation: 大语言模型在长上下文推理中容易受到干扰模式的影响，导致注意力分散和推理准确性下降。作者希望通过消除这些干扰模式，提升模型的推理能力。 Method: LeaF框架分为两阶段：1）通过梯度比较和高级教师模型识别干扰标记；2）在蒸馏过程中修剪这些标记，使学生的注意力分布与教师一致。 Result: 实验表明，LeaF在数学推理和代码生成任务中显著提升性能，并有效抑制了对干扰标记的注意力。 Conclusion: LeaF通过干预式推理解决了注意力分散问题，提高了模型的可靠性和可解释性。 Abstract: Large language models (LLMs) have demonstrated significant improvements in contextual understanding. However, their ability to attend to truly critical information during long-context reasoning and generation still falls behind the pace. Specifically, our preliminary experiments reveal that certain distracting patterns can misdirect the model's attention during inference, and removing these patterns substantially improves reasoning accuracy and generation quality. We attribute this phenomenon to spurious correlations in the training data, which obstruct the model's capacity to infer authentic causal instruction-response relationships. This phenomenon may induce redundant reasoning processes, potentially resulting in significant inference overhead and, more critically, the generation of erroneous or suboptimal responses. To mitigate this, we introduce a two-stage framework called Learning to Focus (LeaF) leveraging intervention-based inference to disentangle confounding factors. In the first stage, LeaF employs gradient-based comparisons with an advanced teacher to automatically identify confounding tokens based on causal relationships in the training corpus. Then, in the second stage, it prunes these tokens during distillation to enact intervention, aligning the student's attention with the teacher's focus distribution on truly critical context tokens. Experimental results demonstrate that LeaF not only achieves an absolute improvement in various mathematical reasoning and code generation benchmarks but also effectively suppresses attention to confounding tokens during inference, yielding a more interpretable and reliable reasoning model.

[328] MEMOIR: Lifelong Model Editing with Minimal Overwrite and Informed Retention for LLMs

Ke Wang,Yiming Qin,Nikolaos Dimitriadis,Alessandro Favero,Pascal Frossard

Main category: cs.CL

TL;DR: MEMOIR是一种新型可扩展框架，通过残差记忆模块注入知识，避免干扰并保持预训练模型的核心能力。

Details

Motivation: 解决语言模型在部署后高效、可靠地更新知识而不影响原有能力的挑战。 Method: 使用残差记忆模块和稀疏激活掩码，将每个编辑限制在记忆参数的不同子集，减少干扰。推理时通过稀疏激活模式匹配相关编辑。 Result: 在问答、幻觉纠正和分布外泛化任务中表现优异，支持数千次顺序编辑且遗忘最少。 Conclusion: MEMOIR在可靠性、泛化性和局部性方面达到最先进水平，适用于大规模知识更新。 Abstract: Language models deployed in real-world systems often require post-hoc updates to incorporate new or corrected knowledge. However, editing such models efficiently and reliably - without retraining or forgetting previous information - remains a major challenge. Existing methods for lifelong model editing either compromise generalization, interfere with past edits, or fail to scale to long editing sequences. We propose MEMOIR, a novel scalable framework that injects knowledge through a residual memory, i.e., a dedicated parameter module, while preserving the core capabilities of the pre-trained model. By sparsifying input activations through sample-dependent masks, MEMOIR confines each edit to a distinct subset of the memory parameters, minimizing interference among edits. At inference, it identifies relevant edits by comparing the sparse activation patterns of new queries to those stored during editing. This enables generalization to rephrased queries by activating only the relevant knowledge while suppressing unnecessary memory activation for unrelated prompts. Experiments on question answering, hallucination correction, and out-of-distribution generalization benchmarks across LLaMA-3 and Mistral demonstrate that MEMOIR achieves state-of-the-art performance across reliability, generalization, and locality metrics, scaling to thousands of sequential edits with minimal forgetting.

[329] MiniCPM4: Ultra-Efficient LLMs on End Devices

MiniCPM Team,Chaojun Xiao,Yuxuan Li,Xu Han,Yuzhuo Bai,Jie Cai,Haotian Chen,Wentong Chen,Xin Cong,Ganqu Cui,Ning Ding,Shengdan Fan,Yewei Fang,Zixuan Fu,Wenyu Guan,Yitong Guan,Junshao Guo,Yufeng Han,Bingxiang He,Yuxiang Huang,Cunliang Kong,Qiuzuo Li,Siyuan Li,Wenhao Li,Yanghao Li,Yishan Li,Zhen Li,Dan Liu,Biyuan Lin,Yankai Lin,Xiang Long,Quanyu Lu,Yaxi Lu,Peiyan Luo,Hongya Lyu,Litu Ou,Yinxu Pan,Zekai Qu,Qundong Shi,Zijun Song,Jiayuan Su,Zhou Su,Ao Sun,Xianghui Sun,Peijun Tang,Fangzheng Wang,Feng Wang,Shuo Wang,Yudong Wang,Yesai Wu,Zhenyu Xiao,Jie Xie,Zihao Xie,Yukun Yan,Jiarui Yuan,Kaihuo Zhang,Lei Zhang,Linyue Zhang,Xueren Zhang,Yudi Zhang,Hengyu Zhao,Weilin Zhao,Weilun Zhao,Yuanqian Zhao,Zhi Zheng,Ge Zhou,Jie Zhou,Wei Zhou,Zihan Zhou,Zixuan Zhou,Zhiyuan Liu,Guoyang Zeng,Chao Jia,Dahai Li,Maosong Sun

Main category: cs.CL

TL;DR: MiniCPM4是一种高效的端侧设备大语言模型，通过模型架构、训练数据、训练算法和推理系统的创新实现高效性，并在多个基准测试中优于同类开源模型。

Details

Motivation: 为端侧设备设计高效的大语言模型，满足多样化需求。 Method: 提出InfLLM v2稀疏注意力机制、UltraClean和UltraChat v2数据集、ModelTunnel v2训练算法、CPM.cu推理系统。 Result: MiniCPM4在多个基准测试中表现优于同类模型，处理长序列时速度显著提升。 Conclusion: MiniCPM4通过创新设计实现了高效性和广泛适用性，成功应用于多种场景。 Abstract: This paper introduces MiniCPM4, a highly efficient large language model (LLM) designed explicitly for end-side devices. We achieve this efficiency through systematic innovation in four key dimensions: model architecture, training data, training algorithms, and inference systems. Specifically, in terms of model architecture, we propose InfLLM v2, a trainable sparse attention mechanism that accelerates both prefilling and decoding phases for long-context processing. Regarding training data, we propose UltraClean, an efficient and accurate pre-training data filtering and generation strategy, and UltraChat v2, a comprehensive supervised fine-tuning dataset. These datasets enable satisfactory model performance to be achieved using just 8 trillion training tokens. Regarding training algorithms, we propose ModelTunnel v2 for efficient pre-training strategy search, and improve existing post-training methods by introducing chunk-wise rollout for load-balanced reinforcement learning and data-efficient tenary LLM, BitCPM. Regarding inference systems, we propose CPM.cu that integrates sparse attention, model quantization, and speculative sampling to achieve efficient prefilling and decoding. To meet diverse on-device requirements, MiniCPM4 is available in two versions, with 0.5B and 8B parameters, respectively. Sufficient evaluation results show that MiniCPM4 outperforms open-source models of similar size across multiple benchmarks, highlighting both its efficiency and effectiveness. Notably, MiniCPM4-8B demonstrates significant speed improvements over Qwen3-8B when processing long sequences. Through further adaptation, MiniCPM4 successfully powers diverse applications, including trustworthy survey generation and tool use with model context protocol, clearly showcasing its broad usability.

[330] Quantum Graph Transformer for NLP Sentiment Classification

Shamminuj Aktar,Andreas Bärtschi,Abdel-Hameed A. Badawy,Stephan Eidenbenz

Main category: cs.CL

TL;DR: 论文提出了一种量子图变换器（QGT），结合量子自注意力机制和消息传递框架，用于结构化语言建模，显著减少了可训练参数数量，并在情感分类任务中表现出色。

Details

Motivation: 量子机器学习在复杂结构化数据建模中具有潜力，但现有方法在参数效率和性能上仍有改进空间。 Method: QGT通过参数化量子电路（PQCs）实现量子自注意力机制，集成到图消息传递框架中。 Result: QGT在五个情感分类基准测试中表现优于或与现有QNLP模型相当，相比经典图变换器平均准确率提升5.42%（真实数据）和4.76%（合成数据），且样本效率提高50%。 Conclusion: QGT展示了图基QNLP技术在高效和可扩展语言理解中的潜力。 Abstract: Quantum machine learning is a promising direction for building more efficient and expressive models, particularly in domains where understanding complex, structured data is critical. We present the Quantum Graph Transformer (QGT), a hybrid graph-based architecture that integrates a quantum self-attention mechanism into the message-passing framework for structured language modeling. The attention mechanism is implemented using parameterized quantum circuits (PQCs), which enable the model to capture rich contextual relationships while significantly reducing the number of trainable parameters compared to classical attention mechanisms. We evaluate QGT on five sentiment classification benchmarks. Experimental results show that QGT consistently achieves higher or comparable accuracy than existing quantum natural language processing (QNLP) models, including both attention-based and non-attention-based approaches. When compared with an equivalent classical graph transformer, QGT yields an average accuracy improvement of 5.42% on real-world datasets and 4.76% on synthetic datasets. Additionally, QGT demonstrates improved sample efficiency, requiring nearly 50% fewer labeled samples to reach comparable performance on the Yelp dataset. These results highlight the potential of graph-based QNLP techniques for advancing efficient and scalable language understanding.

[331] Statistical Hypothesis Testing for Auditing Robustness in Language Models

Paulius Rauba,Qiyao Wei,Mihaela van der Schaar

Main category: cs.CL

TL;DR: 论文提出了一种基于分布的扰动分析框架，用于测试大型语言模型（LLM）输出在干预下的变化，解决了传统方法无法处理的问题。

Details

Motivation: 现有方法无法有效比较LLM输出的变化，尤其是在随机性和计算复杂性存在的情况下。 Method: 通过蒙特卡洛采样在低维语义相似空间中构建经验分布，将扰动分析转化为频率假设检验问题。 Result: 框架具有模型无关性、支持任意输入扰动、提供可解释的p值、控制误差率并提供标量效应大小。 Conclusion: 该框架为LLM审计提供了一种可靠的频率假设检验方法，适用于多种实际场景。 Abstract: Consider the problem of testing whether the outputs of a large language model (LLM) system change under an arbitrary intervention, such as an input perturbation or changing the model variant. We cannot simply compare two LLM outputs since they might differ due to the stochastic nature of the system, nor can we compare the entire output distribution due to computational intractability. While existing methods for analyzing text-based outputs exist, they focus on fundamentally different problems, such as measuring bias or fairness. To this end, we introduce distribution-based perturbation analysis, a framework that reformulates LLM perturbation analysis as a frequentist hypothesis testing problem. We construct empirical null and alternative output distributions within a low-dimensional semantic similarity space via Monte Carlo sampling, enabling tractable inference without restrictive distributional assumptions. The framework is (i) model-agnostic, (ii) supports the evaluation of arbitrary input perturbations on any black-box LLM, (iii) yields interpretable p-values; (iv) supports multiple perturbations via controlled error rates; and (v) provides scalar effect sizes. We demonstrate the usefulness of the framework across multiple case studies, showing how we can quantify response changes, measure true/false positive rates, and evaluate alignment with reference models. Above all, we see this as a reliable frequentist hypothesis testing framework for LLM auditing.

[332] Language Models over Canonical Byte-Pair Encodings

Tim Vieira,Tianyu Liu,Clemente Pasti,Yahya Emara,Brian DuSell,Benjamin LeBrun,Mario Giulianelli,Juan Luis Gastaldi,Timothy J. O'Donnell,Ryan Cotterell

Main category: cs.CL

TL;DR: 论文提出解决语言模型中非规范标记编码问题的方法，确保模型仅分配概率给规范标记字符串。

Details

Motivation: 当前语言模型对字符字符串的概率分布存在非规范标记编码问题，导致概率分配错误且浪费资源。 Method: 提出两种方法：1) 通过条件推断在测试时确保规范标记；2) 通过模型参数化在训练时保证规范输出。 Result: 实验表明，修正规范性问题提高了多个模型和语料库的似然性能。 Conclusion: 确保标记编码的规范性可以提升语言模型的效率和准确性。 Abstract: Modern language models represent probability distributions over character strings as distributions over (shorter) token strings derived via a deterministic tokenizer, such as byte-pair encoding. While this approach is highly effective at scaling up language models to large corpora, its current incarnations have a concerning property: the model assigns nonzero probability mass to an exponential number of $\it{noncanonical}$ token encodings of each character string -- these are token strings that decode to valid character strings but are impossible under the deterministic tokenizer (i.e., they will never be seen in any training corpus, no matter how large). This misallocation is both erroneous, as noncanonical strings never appear in training data, and wasteful, diverting probability mass away from plausible outputs. These are avoidable mistakes! In this work, we propose methods to enforce canonicality in token-level language models, ensuring that only canonical token strings are assigned positive probability. We present two approaches: (1) canonicality by conditioning, leveraging test-time inference strategies without additional training, and (2) canonicality by construction, a model parameterization that guarantees canonical outputs but requires training. We demonstrate that fixing canonicality mistakes improves the likelihood of held-out data for several models and corpora.

[333] Correlated Errors in Large Language Models

Elliot Kim,Avi Garg,Kenny Peng,Nikhil Garg

Main category: cs.CL

TL;DR: 研究发现，尽管LLMs在数据、架构和提供商方面存在多样性，但它们的错误高度相关，尤其是在更大、更准确的模型中。这种相关性在下游任务中产生了显著影响。

Details

Motivation: 探讨不同LLMs是否因多样性而表现出显著差异，填补缺乏实证研究的空白。 Method: 通过对350多个LLMs进行大规模评估，使用两个流行排行榜和一个简历筛选任务，分析模型错误的相关性及其驱动因素。 Result: 发现模型错误高度相关（60%一致），尤其是共享架构和提供商的模型；更大、更准确的模型即使架构和提供商不同，错误也高度相关。 Conclusion: LLMs的错误相关性显著，且在下游任务中产生实际影响，支持算法单一化的理论预测。 Abstract: Diversity in training data, architecture, and providers is assumed to mitigate homogeneity in LLMs. However, we lack empirical evidence on whether different LLMs differ meaningfully. We conduct a large-scale empirical evaluation on over 350 LLMs overall, using two popular leaderboards and a resume-screening task. We find substantial correlation in model errors -- on one leaderboard dataset, models agree 60% of the time when both models err. We identify factors driving model correlation, including shared architectures and providers. Crucially, however, larger and more accurate models have highly correlated errors, even with distinct architectures and providers. Finally, we show the effects of correlation in two downstream tasks: LLM-as-judge evaluation and hiring -- the latter reflecting theoretical predictions regarding algorithmic monoculture.

[334] Reinforcement Pre-Training

Qingxiu Dong,Li Dong,Yao Tang,Tianzhu Ye,Yutao Sun,Zhifang Sui,Furu Wei

Main category: cs.CL

TL;DR: 本文提出了一种新的扩展范式——强化预训练（RPT），通过将下一词预测任务转化为基于强化学习的推理任务，显著提升了语言模型的准确性。

Details

Motivation: 传统的大语言模型和强化学习方法依赖于领域特定的标注数据，而RPT旨在利用大量文本数据进行通用强化学习，提升模型的推理能力。 Method: RPT将下一词预测任务重新定义为基于强化学习的推理任务，通过可验证的奖励机制训练模型正确预测下一词。 Result: 实验表明，RPT显著提升了下一词预测的准确性，并为后续强化微调提供了良好的预训练基础。计算资源的增加持续提升了预测精度。 Conclusion: RPT是一种有效且有前景的语言模型预训练扩展范式。 Abstract: In this work, we introduce Reinforcement Pre-Training (RPT) as a new scaling paradigm for large language models and reinforcement learning (RL). Specifically, we reframe next-token prediction as a reasoning task trained using RL, where it receives verifiable rewards for correctly predicting the next token for a given context. RPT offers a scalable method to leverage vast amounts of text data for general-purpose RL, rather than relying on domain-specific annotated answers. By incentivizing the capability of next-token reasoning, RPT significantly improves the language modeling accuracy of predicting the next tokens. Moreover, RPT provides a strong pre-trained foundation for further reinforcement fine-tuning. The scaling curves show that increased training compute consistently improves the next-token prediction accuracy. The results position RPT as an effective and promising scaling paradigm to advance language model pre-training.

cs.MM [Back]

[335] Experimental Evaluation of Static Image Sub-Region-Based Search Models Using CLIP

Bastian Jäckl,Vojtěch Kloda,Daniel A. Keim,Jakub Lokoč

Main category: cs.MM

TL;DR: 研究探讨了在高度同质化的专业领域中，通过添加基于位置的提示来增强模糊文本查询的检索性能。

Details

Motivation: 在专业领域中，用户通常只能提供模糊的文本描述，缺乏区分同质实体的专业知识，导致检索效果不佳。 Method: 收集了741个人工标注的数据集，包含短文本和长文本描述以及边界框标注，评估了CLIP模型在不同静态子区域上的检索性能。 Result: 简单的3x3分区和5网格重叠显著提高了检索效果，并对标注框的扰动保持鲁棒性。 Conclusion: 基于位置的提示可以有效增强模糊文本查询的检索性能。 Abstract: Advances in multimodal text-image models have enabled effective text-based querying in extensive image collections. While these models show convincing performance for everyday life scenes, querying in highly homogeneous, specialized domains remains challenging. The primary problem is that users can often provide only vague textual descriptions as they lack expert knowledge to discriminate between homogenous entities. This work investigates whether adding location-based prompts to complement these vague text queries can enhance retrieval performance. Specifically, we collected a dataset of 741 human annotations, each containing short and long textual descriptions and bounding boxes indicating regions of interest in challenging underwater scenes. Using these annotations, we evaluate the performance of CLIP when queried on various static sub-regions of images compared to the full image. Our results show that both a simple 3-by-3 partitioning and a 5-grid overlap significantly improve retrieval effectiveness and remain robust to perturbations of the annotation box.

stat.ML [Back]

[336] On the Fundamental Impossibility of Hallucination Control in Large Language Models

Michał P. Karpowicz

Main category: stat.ML

TL;DR: 论文证明了无法创建不产生幻觉的大语言模型，并分析了需要权衡的四个基本属性。

Details

Motivation: 探讨为什么大语言模型无法避免幻觉，并明确其根本限制。 Method: 通过将LLM推理建模为“想法拍卖”，利用Green-Laffont定理证明不可能性。 Result: 证明了没有推理机制能同时满足四个基本属性。 Conclusion: 为模型架构、训练目标和评估方法提供了理论基础。 Abstract: This paper explains \textbf{why it is impossible to create large language models that do not hallucinate and what are the trade-offs we should be looking for}. It presents a formal \textbf{impossibility theorem} demonstrating that no inference mechanism can simultaneously satisfy four fundamental properties: \textbf{truthful (non-hallucinatory) generation, semantic information conservation, relevant knowledge revelation, and knowledge-constrained optimality}. By modeling LLM inference as an \textbf{auction of ideas} where neural components compete to contribute to responses, we prove the impossibility using the Green-Laffont theorem. That mathematical framework provides a rigorous foundation for understanding the nature of inference process, with implications for model architecture, training objectives, and evaluation methods.

eess.SP [Back]

[337] Benchmarking Early Agitation Prediction in Community-Dwelling People with Dementia Using Multimodal Sensors and Machine Learning

Ali Abedi,Charlene H. Chu,Shehroz S. Khan

Main category: eess.SP

TL;DR: 该研究开发了基于多模态传感器数据的机器学习方法，用于早期预测社区居住的痴呆症患者的激越行为，并通过引入新的上下文特征提升了预测性能。

Details

Motivation: 激越行为是痴呆症患者常见的反应行为，早期预测可减轻护理负担并提升生活质量。 Method: 研究使用了多种机器学习和深度学习模型，结合TIHM数据集中的活动、生理和睡眠数据，评估了不同问题表述下的预测性能。 Result: 最佳模型（轻梯度提升机）在二元分类任务中取得了AUC-ROC 0.9720和AUC-PR 0.4320的高性能。 Conclusion: 该研究为基于隐私保护传感器数据的激越行为预测提供了首个全面基准，支持主动护理和居家养老。 Abstract: Agitation is one of the most common responsive behaviors in people living with dementia, particularly among those residing in community settings without continuous clinical supervision. Timely prediction of agitation can enable early intervention, reduce caregiver burden, and improve the quality of life for both patients and caregivers. This study aimed to develop and benchmark machine learning approaches for the early prediction of agitation in community-dwelling older adults with dementia using multimodal sensor data. A new set of agitation-related contextual features derived from activity data was introduced and employed for agitation prediction. A wide range of machine learning and deep learning models was evaluated across multiple problem formulations, including binary classification for single-timestamp tabular sensor data and multi-timestamp sequential sensor data, as well as anomaly detection for single-timestamp tabular sensor data. The study utilized the Technology Integrated Health Management (TIHM) dataset, the largest publicly available dataset for remote monitoring of people living with dementia, comprising 2,803 days of in-home activity, physiology, and sleep data. The most effective setting involved binary classification of sensor data using the current 6-hour timestamp to predict agitation at the subsequent timestamp. Incorporating additional information, such as time of day and agitation history, further improved model performance, with the highest AUC-ROC of 0.9720 and AUC-PR of 0.4320 achieved by the light gradient boosting machine. This work presents the first comprehensive benchmarking of state-of-the-art techniques for agitation prediction in community-based dementia care using privacy-preserving sensor data. The approach enables accurate, explainable, and efficient agitation prediction, supporting proactive dementia care and aging in place.

[338] An Open-Source Python Framework and Synthetic ECG Image Datasets for Digitization, Lead and Lead Name Detection, and Overlapping Signal Segmentation

Masoud Rahimi,Reza Karbasi,Abdol-Hossein Vahabie

Main category: eess.SP

TL;DR: 介绍了一个开源Python框架，用于生成合成ECG图像数据集，支持深度学习任务如ECG数字化、导联区域和名称检测以及波形分割。

Details

Motivation: 为推进ECG分析中的深度学习任务，如数字化、导联检测和波形分割，提供高质量合成数据集。 Method: 利用PTB-XL信号数据集生成四种开放数据集，包括配对的ECG图像与时间序列信号、YOLO格式标注的导联区域和名称检测数据、以及适用于U-Net模型的单导联分割掩码。 Result: 生成了四种公开可用的数据集，支持多种ECG分析任务，并提供了开源框架和数据集的访问链接。 Conclusion: 该框架和数据集为ECG分析领域的深度学习研究提供了重要资源，促进了相关技术的发展。 Abstract: We introduce an open-source Python framework for generating synthetic ECG image datasets to advance critical deep learning-based tasks in ECG analysis, including ECG digitization, lead region and lead name detection, and pixel-level waveform segmentation. Using the PTB-XL signal dataset, our proposed framework produces four open-access datasets: (1) ECG images in various lead configurations paired with time-series signals for ECG digitization, (2) ECG images annotated with YOLO-format bounding boxes for detection of lead region and lead name, (3)-(4) cropped single-lead images with segmentation masks compatible with U-Net-based models in normal and overlapping versions. In the overlapping case, waveforms from neighboring leads are superimposed onto the target lead image, while the segmentation masks remain clean. The open-source Python framework and datasets are publicly available at https://github.com/rezakarbasi/ecg-image-and-signal-dataset and https://doi.org/10.5281/zenodo.15484519, respectively.

[339] Heart Rate Classification in ECG Signals Using Machine Learning and Deep Learning

Thien Nhan Vo,Thanh Xuan Truong

Main category: eess.SP

TL;DR: 研究通过传统机器学习和深度学习方法对ECG信号的心跳进行分类，发现基于手工特征的LightGBM模型表现最佳。

Details

Motivation: 解决ECG信号中心跳分类问题，比较传统特征提取与深度学习图像转换方法的性能差异。 Method: 1. 传统机器学习：提取HRV、均值、方差等特征，训练SVM、随机森林等模型。2. 深度学习：将ECG信号转换为图像（GAF、MTF、RP），用CNN分类。 Result: LightGBM准确率99%，F1分数0.94，优于CNN（F1分数0.85）。手工特征在捕捉ECG信号变化上更优。 Conclusion: 手工特征在ECG分类中表现更佳，未来可结合多导联信号和时序依赖提升性能。 Abstract: This study addresses the classification of heartbeats from ECG signals through two distinct approaches: traditional machine learning utilizing hand-crafted features and deep learning via transformed images of ECG beats. The dataset underwent preprocessing steps, including downsampling, filtering, and normalization, to ensure consistency and relevance for subsequent analysis. In the first approach, features such as heart rate variability (HRV), mean, variance, and RR intervals were extracted to train various classifiers, including SVM, Random Forest, AdaBoost, LSTM, Bi-directional LSTM, and LightGBM. The second approach involved transforming ECG signals into images using Gramian Angular Field (GAF), Markov Transition Field (MTF), and Recurrence Plots (RP), with these images subsequently classified using CNN architectures like VGG and Inception. Experimental results demonstrate that the LightGBM model achieved the highest performance, with an accuracy of 99% and an F1 score of 0.94, outperforming the image-based CNN approach (F1 score of 0.85). Models such as SVM and AdaBoost yielded significantly lower scores, indicating limited suitability for this task. The findings underscore the superior ability of hand-crafted features to capture temporal and morphological variations in ECG signals compared to image-based representations of individual beats. Future investigations may benefit from incorporating multi-lead ECG signals and temporal dependencies across successive beats to enhance classification accuracy further.

cs.MA [Back]

[340] G-Memory: Tracing Hierarchical Memory for Multi-Agent Systems

Guibin Zhang,Muxin Fu,Guancheng Wan,Miao Yu,Kun Wang,Shuicheng Yan

Main category: cs.MA

TL;DR: 论文提出G-Memory，一种分层、代理化的记忆系统，用于解决多代理系统（MAS）中记忆机制过于简单的问题，显著提升了任务执行成功率。

Details

Motivation: 现有MAS的记忆机制过于简单，忽视了代理间协作的复杂性，且缺乏跨任务和代理定制化能力，限制了系统的自我进化能力。 Method: 引入G-Memory，基于组织记忆理论的三层图层次结构（洞察图、查询图和交互图），通过双向记忆遍历检索高层洞察和细粒度交互轨迹。 Result: 在五个基准测试中，G-Memory显著提升了任务成功率（最高20.89%）和知识问答准确性（最高10.12%）。 Conclusion: G-Memory通过改进记忆架构，显著提升了MAS的性能和进化能力，且无需修改原有框架。 Abstract: Large language model (LLM)-powered multi-agent systems (MAS) have demonstrated cognitive and execution capabilities that far exceed those of single LLM agents, yet their capacity for self-evolution remains hampered by underdeveloped memory architectures. Upon close inspection, we are alarmed to discover that prevailing MAS memory mechanisms (1) are overly simplistic, completely disregarding the nuanced inter-agent collaboration trajectories, and (2) lack cross-trial and agent-specific customization, in stark contrast to the expressive memory developed for single agents. To bridge this gap, we introduce G-Memory, a hierarchical, agentic memory system for MAS inspired by organizational memory theory, which manages the lengthy MAS interaction via a three-tier graph hierarchy: insight, query, and interaction graphs. Upon receiving a new user query, G-Memory performs bi-directional memory traversal to retrieve both $\textit{high-level, generalizable insights}$ that enable the system to leverage cross-trial knowledge, and $\textit{fine-grained, condensed interaction trajectories}$ that compactly encode prior collaboration experiences. Upon task execution, the entire hierarchy evolves by assimilating new collaborative trajectories, nurturing the progressive evolution of agent teams. Extensive experiments across five benchmarks, three LLM backbones, and three popular MAS frameworks demonstrate that G-Memory improves success rates in embodied action and accuracy in knowledge QA by up to $20.89\%$ and $10.12\%$, respectively, without any modifications to the original frameworks. Our codes are available at https://github.com/bingreeky/GMemory.

[341] MedChat: A Multi-Agent Framework for Multimodal Diagnosis with Large Language Models

Philip Liu,Sparsh Bansal,Jimmy Dinh,Aditya Pawar,Ramani Satishkumar,Shail Desai,Neeraj Gupta,Xin Wang,Shu Hu

Main category: cs.MA

TL;DR: MedChat是一个多代理诊断框架，结合专用视觉模型和角色特定的LLM代理，以提升青光眼检测的可靠性和临床报告效率。

Details

Motivation: 解决通用LLM在医学影像应用中的幻觉、解释性不足和领域知识缺乏问题，同时模拟多学科医疗团队的复杂推理。 Method: 提出MedChat框架，通过协调多个角色特定的LLM代理和专用视觉模型，由导演代理统一管理。 Result: 提高了可靠性，减少了幻觉风险，并支持交互式诊断报告，适用于临床和教育用途。 Conclusion: MedChat通过多代理设计有效解决了通用LLM在医学领域的局限性，为自动化诊断提供了更可靠的解决方案。 Abstract: The integration of deep learning-based glaucoma detection with large language models (LLMs) presents an automated strategy to mitigate ophthalmologist shortages and improve clinical reporting efficiency. However, applying general LLMs to medical imaging remains challenging due to hallucinations, limited interpretability, and insufficient domain-specific medical knowledge, which can potentially reduce clinical accuracy. Although recent approaches combining imaging models with LLM reasoning have improved reporting, they typically rely on a single generalist agent, restricting their capacity to emulate the diverse and complex reasoning found in multidisciplinary medical teams. To address these limitations, we propose MedChat, a multi-agent diagnostic framework and platform that combines specialized vision models with multiple role-specific LLM agents, all coordinated by a director agent. This design enhances reliability, reduces hallucination risk, and enables interactive diagnostic reporting through an interface tailored for clinical review and educational use. Code available at https://github.com/Purdue-M2/MedChat.

cs.CY [Back]

[342] How Malicious AI Swarms Can Threaten Democracy

Daniel Thilo Schroeder,Meeyoung Cha,Andrea Baronchelli,Nick Bostrom,Nicholas A. Christakis,David Garcia,Amit Goldenberg,Yara Kyrychenko,Kevin Leyton-Brown,Nina Lutz,Gary Marcus,Filippo Menczer,Gordon Pennycook,David G. Rand,Frank Schweitzer,Christopher Summerfield,Audrey Tang,Jay Van Bavel,Sander van der Linden,Dawn Song,Jonas R. Kunst

Main category: cs.CY

TL;DR: AI恶意群体可能引发复杂的虚假信息操作，威胁民主进程，需采取平台、模型和系统层面的防御措施。

Details

Motivation: AI技术的进步可能导致恶意AI群体的出现，这些群体能够隐蔽协调、渗透社区并破坏社会共识，威胁民主制度。 Method: 提出三方面应对策略：平台防御（如群体检测仪表盘、透明度审计）、模型保护（如风险测试、水印技术）和系统监管（如联合国支持的AI影响观察站）。 Result: 恶意AI群体可能导致虚假共识、现实分裂、选民操纵等问题，需多层面防御。 Conclusion: 呼吁全球采取紧急措施，从平台、模型和系统层面应对AI恶意群体带来的威胁。 Abstract: Advances in AI portend a new era of sophisticated disinformation operations. While individual AI systems already create convincing -- and at times misleading -- information, an imminent development is the emergence of malicious AI swarms. These systems can coordinate covertly, infiltrate communities, evade traditional detectors, and run continuous A/B tests, with round-the-clock persistence. The result can include fabricated grassroots consensus, fragmented shared reality, mass harassment, voter micro-suppression or mobilization, contamination of AI training data, and erosion of institutional trust. With democratic processes worldwide increasingly vulnerable, we urge a three-pronged response: (1) platform-side defenses -- always-on swarm-detection dashboards, pre-election high-fidelity swarm-simulation stress-tests, transparency audits, and optional client-side "AI shields" for users; (2) model-side safeguards -- standardized persuasion-risk tests, provenance-authenticating passkeys, and watermarking; and (3) system-level oversight -- a UN-backed AI Influence Observatory.

[343] LLMs as World Models: Data-Driven and Human-Centered Pre-Event Simulation for Disaster Impact Assessment

Lingyao Li,Dawei Li,Zhenhui Ou,Xiaoran Xu,Jingxiao Liu,Zihui Ma,Runlong Yu,Min Deng

Main category: cs.CY

TL;DR: 本研究探讨了利用大型语言模型（LLMs）模拟地震影响的潜力，通过多模态数据集生成MMI预测，并在实际地震数据中验证了高相关性（0.88）和低误差（0.77）。

Details

Motivation: 提升对突发性灾害（如地震）的主动准备能力，需要高效的模拟方法。 Method: 结合多模态数据集（地理空间、社会经济、建筑和街景图像数据），利用LLMs生成MMI预测，并通过RAG和ICL技术优化性能。 Result: 在2014年Napa和2019年Ridgecrest地震中，预测与实际报告高度一致（相关性0.88，RMSE 0.77）。 Conclusion: LLMs在模拟灾害影响方面具有潜力，有助于加强灾前规划。 Abstract: Efficient simulation is essential for enhancing proactive preparedness for sudden-onset disasters such as earthquakes. Recent advancements in large language models (LLMs) as world models show promise in simulating complex scenarios. This study examines multiple LLMs to proactively estimate perceived earthquake impacts. Leveraging multimodal datasets including geospatial, socioeconomic, building, and street-level imagery data, our framework generates Modified Mercalli Intensity (MMI) predictions at zip code and county scales. Evaluations on the 2014 Napa and 2019 Ridgecrest earthquakes using USGS ''Did You Feel It? (DYFI)'' reports demonstrate significant alignment, as evidenced by a high correlation of 0.88 and a low RMSE of 0.77 as compared to real reports at the zip code level. Techniques such as RAG and ICL can improve simulation performance, while visual inputs notably enhance accuracy compared to structured numerical data alone. These findings show the promise of LLMs in simulating disaster impacts that can help strengthen pre-event planning.

[344] From Rogue to Safe AI: The Role of Explicit Refusals in Aligning LLMs with International Humanitarian Law

John Mavi,Diana Teodora Găitan,Sergio Coronado

Main category: cs.CY

TL;DR: 该研究评估了八种主流大语言模型（LLM）在国际人道法（IHL）合规性上的表现，发现虽然多数模型能拒绝违法请求，但回应的清晰度和一致性存在差异。标准化安全提示显著提升了拒绝解释的质量，但复杂请求仍暴露漏洞。

Details

Motivation: 研究动机是评估LLM在IHL框架下的合规性，以及如何通过改进拒绝提示的清晰度和解释性来减少滥用风险。 Method: 研究通过测试八种LLM对违法请求的拒绝能力，并分析其回应的清晰度和解释性。同时，引入标准化安全提示以评估其效果。 Result: 多数模型能拒绝违法请求，但回应质量不一。标准化提示显著提升了拒绝解释的质量，但对复杂请求仍存在漏洞。 Conclusion: 研究为开发更安全、透明的AI系统提供了基准，并展示了轻量级干预的有效性，但需进一步解决复杂请求的漏洞。 Abstract: Large Language Models (LLMs) are widely used across sectors, yet their alignment with International Humanitarian Law (IHL) is not well understood. This study evaluates eight leading LLMs on their ability to refuse prompts that explicitly violate these legal frameworks, focusing also on helpfulness - how clearly and constructively refusals are communicated. While most models rejected unlawful requests, the clarity and consistency of their responses varied. By revealing the model's rationale and referencing relevant legal or safety principles, explanatory refusals clarify the system's boundaries, reduce ambiguity, and help prevent misuse. A standardised system-level safety prompt significantly improved the quality of the explanations expressed within refusals in most models, highlighting the effectiveness of lightweight interventions. However, more complex prompts involving technical language or requests for code revealed ongoing vulnerabilities. These findings contribute to the development of safer, more transparent AI systems and propose a benchmark to evaluate the compliance of LLM with IHL.

[345] Large Language Models Can Be a Viable Substitute for Expert Political Surveys When a Shock Disrupts Traditional Measurement Approaches

Patrick Y. Wu

Main category: cs.CY

TL;DR: 论文提出用大型语言模型（LLMs）替代专家政治调查，以研究突发事件后的关联因素，并以2025年DOGE联邦裁员为例验证其可行性。

Details

Motivation: 突发事件后，专家判断易受结果影响，难以重建事件前的认知。传统测量方法失效时，需要替代方案。 Method: 使用LLMs进行成对比较提示，生成联邦机构的意识形态分数，并与裁员目标关联分析。 Result: LLMs生成的分数能复现裁员前的专家测量，并预测裁员目标，同时揭示知识机构认知的影响。 Conclusion: LLMs可作为专家调查的替代方案，论文提出使用LLMs的两部分标准。 Abstract: After a disruptive event or shock, such as the Department of Government Efficiency (DOGE) federal layoffs of 2025, expert judgments are colored by knowledge of the outcome. This can make it difficult or impossible to reconstruct the pre-event perceptions needed to study the factors associated with the event. This position paper argues that large language models (LLMs), trained on vast amounts of digital media data, can be a viable substitute for expert political surveys when a shock disrupts traditional measurement. We analyze the DOGE layoffs as a specific case study for this position. We use pairwise comparison prompts with LLMs and derive ideology scores for federal executive agencies. These scores replicate pre-layoff expert measures and predict which agencies were targeted by DOGE. We also use this same approach and find that the perceptions of certain federal agencies as knowledge institutions predict which agencies were targeted by DOGE, even when controlling for ideology. This case study demonstrates that using LLMs allows us to rapidly and easily test the associated factors hypothesized behind the shock. More broadly, our case study of this recent event exemplifies how LLMs offer insights into the correlational factors of the shock when traditional measurement techniques fail. We conclude by proposing a two-part criterion for when researchers can turn to LLMs as a substitute for expert political surveys.

[346] Future of Work with AI Agents: Auditing Automation and Augmentation Potential across the U.S. Workforce

Yijia Shao,Humishka Zope,Yucheng Jiang,Jiaxin Pei,David Nguyen,Erik Brynjolfsson,Diyi Yang

Main category: cs.CY

TL;DR: 论文提出了一种新的审计框架，用于评估工人希望AI代理自动化或增强哪些职业任务，以及这些愿望与当前技术能力的匹配情况。

Details

Motivation: 复合AI系统的快速发展正在重塑劳动力市场，引发了关于工作替代、人类能动性减弱和对自动化过度依赖的担忧，但目前缺乏对这一演变景观的系统性理解。 Method: 研究引入了一种音频增强的小型访谈框架，捕捉工人的细微需求，并提出了人类能动性量表（HAS）作为量化人类参与偏好的共同语言。基于此框架，构建了WORKBank数据库，结合工人偏好和AI专家评估，将任务分为四个区域。 Result: 研究发现不同职业的人类能动性需求各异，揭示了AI代理开发中的关键不匹配和机会。此外，AI代理的整合可能会从信息技能转向人际技能，重塑核心人类能力。 Conclusion: 研究强调了将AI代理开发与人类需求对齐的重要性，并为工人适应不断变化的工作环境提供了早期信号。 Abstract: The rapid rise of compound AI systems (a.k.a., AI agents) is reshaping the labor market, raising concerns about job displacement, diminished human agency, and overreliance on automation. Yet, we lack a systematic understanding of the evolving landscape. In this paper, we address this gap by introducing a novel auditing framework to assess which occupational tasks workers want AI agents to automate or augment, and how those desires align with the current technological capabilities. Our framework features an audio-enhanced mini-interview to capture nuanced worker desires and introduces the Human Agency Scale (HAS) as a shared language to quantify the preferred level of human involvement. Using this framework, we construct the WORKBank database, building on the U.S. Department of Labor's O*NET database, to capture preferences from 1,500 domain workers and capability assessments from AI experts across over 844 tasks spanning 104 occupations. Jointly considering the desire and technological capability divides tasks in WORKBank into four zones: Automation "Green Light" Zone, Automation "Red Light" Zone, R&D Opportunity Zone, Low Priority Zone. This highlights critical mismatches and opportunities for AI agent development. Moving beyond a simple automate-or-not dichotomy, our results reveal diverse HAS profiles across occupations, reflecting heterogeneous expectations for human involvement. Moreover, our study offers early signals of how AI agent integration may reshape the core human competencies, shifting from information-focused skills to interpersonal ones. These findings underscore the importance of aligning AI agent development with human desires and preparing workers for evolving workplace dynamics.

cs.RO [Back]

[347] SMaRCSim: Maritime Robotics Simulation Modules

Mart Kartašev,David Dörner,Özer Özkahraman,Petter Ögren,Ivan Stenius,John Folkesson

Main category: cs.RO

TL;DR: SMaRCSim是一套模拟工具包，旨在解决水下机器人开发中的快速测试和团队协作问题。

Details

Motivation: 现有模拟工具无法满足学习型水下机器人开发、多类型自主载具团队协作及与任务规划集成的需求。 Method: 开发了SMaRCSim模拟工具包，支持水下、水面和空中载具的团队协作及任务规划集成。 Result: SMaRCSim为水下机器人功能开发提供了高效测试和团队协作的解决方案。 Conclusion: SMaRCSim填补了现有工具的不足，为水下机器人领域的新功能开发提供了潜力。 Abstract: Developing new functionality for underwater robots and testing them in the real world is time-consuming and resource-intensive. Simulation environments allow for rapid testing before field deployment. However, existing tools lack certain functionality for use cases in our project: i) developing learning-based methods for underwater vehicles; ii) creating teams of autonomous underwater, surface, and aerial vehicles; iii) integrating the simulation with mission planning for field experiments. A holistic solution to these problems presents great potential for bringing novel functionality into the underwater domain. In this paper we present SMaRCSim, a set of simulation packages that we have developed to help us address these issues.

[348] Active Illumination Control in Low-Light Environments using NightHawk

Yash Turkar,Youngjin Kim,Karthik Dantu

Main category: cs.RO

TL;DR: NightHawk框架通过结合主动照明和曝光控制，优化地下环境中的图像质量，提升特征检测和匹配性能。

Details

Motivation: 地下环境（如涵洞）光线昏暗且缺乏特征，现有照明方法存在反光、过曝和功耗问题。 Method: 提出在线贝叶斯优化方法，动态调整光照强度和曝光时间，并使用基于特征检测的指标作为优化目标。 Result: 实验表明，特征检测和匹配性能提升47-197%，视觉估计更可靠。 Conclusion: NightHawk有效解决了地下环境中的机器人视觉挑战。 Abstract: Subterranean environments such as culverts present significant challenges to robot vision due to dim lighting and lack of distinctive features. Although onboard illumination can help, it introduces issues such as specular reflections, overexposure, and increased power consumption. We propose NightHawk, a framework that combines active illumination with exposure control to optimize image quality in these settings. NightHawk formulates an online Bayesian optimization problem to determine the best light intensity and exposure-time for a given scene. We propose a novel feature detector-based metric to quantify image utility and use it as the cost function for the optimizer. We built NightHawk as an event-triggered recursive optimization pipeline and deployed it on a legged robot navigating a culvert beneath the Erie Canal. Results from field experiments demonstrate improvements in feature detection and matching by 47-197% enabling more reliable visual estimation in challenging lighting conditions.

[349] Edge-Enabled Collaborative Object Detection for Real-Time Multi-Vehicle Perception

Everett Richards,Bipul Thapa,Lena Mashayekhy

Main category: cs.RO

TL;DR: 论文提出了一种基于边缘计算和多车协作的实时物体检测框架ECOD，通过PACE和VOTE算法提升CAV的感知能力，实验显示其分类准确率比传统方法高75%。

Details

Motivation: 传统车载感知系统因遮挡和盲区精度有限，云端解决方案延迟高，无法满足实时需求。 Method: ECOD框架结合PACE（多车数据聚合）和VOTE（共识投票机制）算法，利用边缘计算实现低延迟处理。 Result: 实验表明ECOD在物体分类准确率上比传统方法提升75%，且满足实时性需求。 Conclusion: 边缘计算可显著提升协作感知能力，适用于延迟敏感的自动驾驶系统。 Abstract: Accurate and reliable object detection is critical for ensuring the safety and efficiency of Connected Autonomous Vehicles (CAVs). Traditional on-board perception systems have limited accuracy due to occlusions and blind spots, while cloud-based solutions introduce significant latency, making them unsuitable for real-time processing demands required for autonomous driving in dynamic environments. To address these challenges, we introduce an innovative framework, Edge-Enabled Collaborative Object Detection (ECOD) for CAVs, that leverages edge computing and multi-CAV collaboration for real-time, multi-perspective object detection. Our ECOD framework integrates two key algorithms: Perceptive Aggregation and Collaborative Estimation (PACE) and Variable Object Tally and Evaluation (VOTE). PACE aggregates detection data from multiple CAVs on an edge server to enhance perception in scenarios where individual CAVs have limited visibility. VOTE utilizes a consensus-based voting mechanism to improve the accuracy of object classification by integrating data from multiple CAVs. Both algorithms are designed at the edge to operate in real-time, ensuring low-latency and reliable decision-making for CAVs. We develop a hardware-based controlled testbed consisting of camera-equipped robotic CAVs and an edge server to evaluate the efficacy of our framework. Our experimental results demonstrate the significant benefits of ECOD in terms of improved object classification accuracy, outperforming traditional single-perspective onboard approaches by up to 75%, while ensuring low-latency, edge-driven real-time processing. This research highlights the potential of edge computing to enhance collaborative perception for latency-sensitive autonomous systems.

[350] DriveSuprim: Towards Precise Trajectory Selection for End-to-End Planning

Wenhao Yao,Zhenxin Li,Shiyi Lan,Zi Wang,Xinglong Sun,Jose M. Alvarez,Zuxuan Wu

Main category: cs.RO

TL;DR: DriveSuprim通过渐进式候选过滤、旋转增强和自蒸馏框架提升自动驾驶轨迹选择的安全性和性能。

Details

Motivation: 解决基于选择的轨迹预测方法在优化和区分安全关键差异上的挑战。 Method: 采用粗到细的渐进候选过滤、旋转增强和自蒸馏框架。 Result: 在NAVSIM v1和v2上分别达到93.5% PDMS和87.1% EPDMS，表现优异。 Conclusion: DriveSuprim显著提升了自动驾驶的安全性和轨迹质量。 Abstract: In complex driving environments, autonomous vehicles must navigate safely. Relying on a single predicted path, as in regression-based approaches, usually does not explicitly assess the safety of the predicted trajectory. Selection-based methods address this by generating and scoring multiple trajectory candidates and predicting the safety score for each, but face optimization challenges in precisely selecting the best option from thousands of possibilities and distinguishing subtle but safety-critical differences, especially in rare or underrepresented scenarios. We propose DriveSuprim to overcome these challenges and advance the selection-based paradigm through a coarse-to-fine paradigm for progressive candidate filtering, a rotation-based augmentation method to improve robustness in out-of-distribution scenarios, and a self-distillation framework to stabilize training. DriveSuprim achieves state-of-the-art performance, reaching 93.5% PDMS in NAVSIM v1 and 87.1% EPDMS in NAVSIM v2 without extra data, demonstrating superior safetycritical capabilities, including collision avoidance and compliance with rules, while maintaining high trajectory quality in various driving scenarios.

[351] Generalized Trajectory Scoring for End-to-end Multimodal Planning

Zhenxin Li,Wenhao Yao,Zi Wang,Xinglong Sun,Joshua Chen,Nadine Chang,Maying Shen,Zuxuan Wu,Shiyi Lan,Jose M. Alvarez

Main category: cs.RO

TL;DR: GTRS是一种端到端多模态规划框架，结合粗粒度和细粒度轨迹评估，解决了静态和动态轨迹评分方法的局限性。

Details

Motivation: 现有轨迹评分方法在泛化性上存在不足，静态方法难以适应细粒度变化，动态方法无法覆盖广泛轨迹分布。 Method: GTRS通过扩散模型生成多样化轨迹、词汇泛化技术和传感器增强策略，实现统一评分框架。 Result: GTRS在Navsim v2挑战赛中表现优异，接近依赖真实感知的优越方法。 Conclusion: GTRS为多模态规划提供了高效且泛化性强的解决方案。 Abstract: End-to-end multi-modal planning is a promising paradigm in autonomous driving, enabling decision-making with diverse trajectory candidates. A key component is a robust trajectory scorer capable of selecting the optimal trajectory from these candidates. While recent trajectory scorers focus on scoring either large sets of static trajectories or small sets of dynamically generated ones, both approaches face significant limitations in generalization. Static vocabularies provide effective coarse discretization but struggle to make fine-grained adaptation, while dynamic proposals offer detailed precision but fail to capture broader trajectory distributions. To overcome these challenges, we propose GTRS (Generalized Trajectory Scoring), a unified framework for end-to-end multi-modal planning that combines coarse and fine-grained trajectory evaluation. GTRS consists of three complementary innovations: (1) a diffusion-based trajectory generator that produces diverse fine-grained proposals; (2) a vocabulary generalization technique that trains a scorer on super-dense trajectory sets with dropout regularization, enabling its robust inference on smaller subsets; and (3) a sensor augmentation strategy that enhances out-of-domain generalization while incorporating refinement training for critical trajectory discrimination. As the winning solution of the Navsim v2 Challenge, GTRS demonstrates superior performance even with sub-optimal sensor inputs, approaching privileged methods that rely on ground-truth perception. Code will be available at https://github.com/NVlabs/GTRS.

[352] RoboCerebra: A Large-scale Benchmark for Long-horizon Robotic Manipulation Evaluation

Songhao Han,Boxiang Qiu,Yue Liao,Siyuan Huang,Chen Gao,Shuicheng Yan,Si Liu

Main category: cs.RO

TL;DR: RoboCerebra是一个用于评估机器人长期操作中高层推理能力的基准，结合了大规模模拟数据集、分层框架和结构化评估协议。

Details

Motivation: 现有研究多关注反应式策略，未能充分利用视觉语言模型（VLMs）的语义推理和长期规划能力。 Method: 提出RoboCerebra基准，包括大规模模拟数据集、分层框架（高层VLM规划器与低层VLA控制器结合）和结构化评估协议。 Result: 数据集包含更长的动作序列和更密集的标注，并评估了先进VLMs作为System 2模块的性能。 Conclusion: RoboCerebra推动了更具能力和泛化性的机器人规划器的发展。 Abstract: Recent advances in vision-language models (VLMs) have enabled instruction-conditioned robotic systems with improved generalization. However, most existing work focuses on reactive System 1 policies, underutilizing VLMs' strengths in semantic reasoning and long-horizon planning. These System 2 capabilities-characterized by deliberative, goal-directed thinking-remain under explored due to the limited temporal scale and structural complexity of current benchmarks. To address this gap, we introduce RoboCerebra, a benchmark for evaluating high-level reasoning in long-horizon robotic manipulation. RoboCerebra includes: (1) a large-scale simulation dataset with extended task horizons and diverse subtask sequences in household environments; (2) a hierarchical framework combining a high-level VLM planner with a low-level vision-language-action (VLA) controller; and (3) an evaluation protocol targeting planning, reflection, and memory through structured System 1-System 2 interaction. The dataset is constructed via a top-down pipeline, where GPT generates task instructions and decomposes them into subtask sequences. Human operators execute the subtasks in simulation, yielding high-quality trajectories with dynamic object variations. Compared to prior benchmarks, RoboCerebra features significantly longer action sequences and denser annotations. We further benchmark state-of-the-art VLMs as System 2 modules and analyze their performance across key cognitive dimensions, advancing the development of more capable and generalizable robotic planners.

[353] SpikePingpong: High-Frequency Spike Vision-based Robot Learning for Precise Striking in Table Tennis Game

Hao Wang,Chengkai Hou,Xianglong Li,Yankai Fu,Chenxuan Li,Ning Chen,Gaole Dai,Jiaming Liu,Tiejun Huang,Shanghang Zhang

Main category: cs.RO

TL;DR: SpikePingpong系统结合尖峰视觉和模仿学习，通过SONIC和IMPACT模块解决乒乓球机器人高精度控制问题，实现91%和71%的成功率。

Details

Motivation: 高速物体控制在机器人领域具有挑战性，乒乓球作为测试平台，需要快速拦截和精确调整轨迹。 Method: 使用20 kHz尖峰相机进行高分辨率球跟踪，结合SONIC模块补偿现实不确定性，IMPACT模块实现精确落点规划。 Result: 在30 cm和20 cm精度任务中，成功率分别达到91%和71%，超越现有方法38%和37%。 Conclusion: SpikePingpong为高速动态任务中的机器人控制提供了新视角。 Abstract: Learning to control high-speed objects in the real world remains a challenging frontier in robotics. Table tennis serves as an ideal testbed for this problem, demanding both rapid interception of fast-moving balls and precise adjustment of their trajectories. This task presents two fundamental challenges: it requires a high-precision vision system capable of accurately predicting ball trajectories, and it necessitates intelligent strategic planning to ensure precise ball placement to target regions. The dynamic nature of table tennis, coupled with its real-time response requirements, makes it particularly well-suited for advancing robotic control capabilities in fast-paced, precision-critical domains. In this paper, we present SpikePingpong, a novel system that integrates spike-based vision with imitation learning for high-precision robotic table tennis. Our approach introduces two key attempts that directly address the aforementioned challenges: SONIC, a spike camera-based module that achieves millimeter-level precision in ball-racket contact prediction by compensating for real-world uncertainties such as air resistance and friction; and IMPACT, a strategic planning module that enables accurate ball placement to targeted table regions. The system harnesses a 20 kHz spike camera for high-temporal resolution ball tracking, combined with efficient neural network models for real-time trajectory correction and stroke planning. Experimental results demonstrate that SpikePingpong achieves a remarkable 91% success rate for 30 cm accuracy target area and 71% in the more challenging 20 cm accuracy task, surpassing previous state-of-the-art approaches by 38% and 37% respectively. These significant performance improvements enable the robust implementation of sophisticated tactical gameplay strategies, providing a new research perspective for robotic control in high-speed dynamic tasks.

Chenguang Huang,Oier Mees,Andy Zeng,Wolfram Burgard

Main category: cs.RO

TL;DR: 提出多模态空间语言地图（VLMaps和AVLMaps），融合预训练多模态特征与3D环境重建，支持自然语言命令翻译和跨机器人共享，提升目标导航和消歧能力。

Details

Motivation: 解决现有方法在环境映射、空间精度和多模态信息利用上的不足。 Method: 通过标准探索自主构建地图，结合预训练多模态特征与3D重建，扩展视觉-语言地图（VLMaps）至音频-视觉-语言地图（AVLMaps）。 Result: 实验证明，该方法支持零样本空间和多模态目标导航，在模糊场景中召回率提升50%。 Conclusion: 多模态空间语言地图为机器人导航和交互提供了视觉、音频和空间线索的统一支持。 Abstract: Grounding language to a navigating agent's observations can leverage pretrained multimodal foundation models to match perceptions to object or event descriptions. However, previous approaches remain disconnected from environment mapping, lack the spatial precision of geometric maps, or neglect additional modality information beyond vision. To address this, we propose multimodal spatial language maps as a spatial map representation that fuses pretrained multimodal features with a 3D reconstruction of the environment. We build these maps autonomously using standard exploration. We present two instances of our maps, which are visual-language maps (VLMaps) and their extension to audio-visual-language maps (AVLMaps) obtained by adding audio information. When combined with large language models (LLMs), VLMaps can (i) translate natural language commands into open-vocabulary spatial goals (e.g., "in between the sofa and TV") directly localized in the map, and (ii) be shared across different robot embodiments to generate tailored obstacle maps on demand. Building upon the capabilities above, AVLMaps extend VLMaps by introducing a unified 3D spatial representation integrating audio, visual, and language cues through the fusion of features from pretrained multimodal foundation models. This enables robots to ground multimodal goal queries (e.g., text, images, or audio snippets) to spatial locations for navigation. Additionally, the incorporation of diverse sensory inputs significantly enhances goal disambiguation in ambiguous environments. Experiments in simulation and real-world settings demonstrate that our multimodal spatial language maps enable zero-shot spatial and multimodal goal navigation and improve recall by 50% in ambiguous scenarios. These capabilities extend to mobile robots and tabletop manipulators, supporting navigation and interaction guided by visual, audio, and spatial cues.

[355] MapBERT: Bitwise Masked Modeling for Real-Time Semantic Mapping Generation

Yijie Deng,Shuaihang Yuan,Congcong Wen,Hao Huang,Anthony Tzes,Geeta Chandra Raju Bethala,Yi Fang

Main category: cs.RO

TL;DR: MapBERT是一个新颖的框架，利用BitVAE和掩码变换器生成未观察区域的语义地图，通过对象感知掩码策略提升推理能力，在Gibson基准测试中表现优异。

Details

Motivation: 现有方法难以实时生成未观察区域且泛化能力差，因此需要一种能有效建模未观察空间分布的新方法。 Method: 提出MapBERT，结合BitVAE编码语义地图为紧凑比特令牌，使用掩码变换器推断缺失区域，并引入对象感知掩码策略增强推理。 Result: 在Gibson基准测试中，MapBERT实现了最先进的语义地图生成，平衡了计算效率和未观察区域的准确重建。 Conclusion: MapBERT通过创新的比特编码和对象感知掩码策略，显著提升了语义地图生成的性能和实用性。 Abstract: Spatial awareness is a critical capability for embodied agents, as it enables them to anticipate and reason about unobserved regions. The primary challenge arises from learning the distribution of indoor semantics, complicated by sparse, imbalanced object categories and diverse spatial scales. Existing methods struggle to robustly generate unobserved areas in real time and do not generalize well to new environments. To this end, we propose \textbf{MapBERT}, a novel framework designed to effectively model the distribution of unseen spaces. Motivated by the observation that the one-hot encoding of semantic maps aligns naturally with the binary structure of bit encoding, we, for the first time, leverage a lookup-free BitVAE to encode semantic maps into compact bitwise tokens. Building on this, a masked transformer is employed to infer missing regions and generate complete semantic maps from limited observations. To enhance object-centric reasoning, we propose an object-aware masking strategy that masks entire object categories concurrently and pairs them with learnable embeddings, capturing implicit relationships between object embeddings and spatial tokens. By learning these relationships, the model more effectively captures indoor semantic distributions crucial for practical robotic tasks. Experiments on Gibson benchmarks show that MapBERT achieves state-of-the-art semantic map generation, balancing computational efficiency with accurate reconstruction of unobserved regions.

[356] BitVLA: 1-bit Vision-Language-Action Models for Robotics Manipulation

Hongyu Wang,Chuyan Xiong,Ruiping Wang,Xilin Chen

Main category: cs.RO

TL;DR: BitVLA是首个1位三元参数（-1,0,1）的视觉-语言-动作（VLA）模型，通过蒸馏感知训练策略压缩视觉编码器至1.58位，在内存受限设备上表现优异。

Details

Motivation: 解决VLA模型因规模增大在资源受限机器人系统上部署的挑战，探索1位预训练在VLA模型中的应用。 Method: 提出BitVLA模型，采用三元参数和蒸馏感知训练策略压缩视觉编码器，利用全精度编码器作为教师模型对齐潜在表示。 Result: 在LIBERO基准测试中，BitVLA性能接近4位后训练量化的OpenVLA-OFT，内存占用仅为29.8%。 Conclusion: BitVLA展示了在内存受限边缘设备上部署的潜力，代码和模型权重已开源。 Abstract: Vision-Language-Action (VLA) models have shown impressive capabilities across a wide range of robotics manipulation tasks. However, their growing model size poses significant challenges for deployment on resource-constrained robotic systems. While 1-bit pretraining has proven effective for enhancing the inference efficiency of large language models with minimal performance loss, its application to VLA models remains underexplored. In this work, we present BitVLA, the first 1-bit VLA model for robotics manipulation, in which every parameter is ternary, i.e., {-1, 0, 1}. To further reduce the memory footprint of the vision encoder, we propose the distillation-aware training strategy that compresses the full-precision encoder to 1.58-bit weights. During this process, a full-precision encoder serves as a teacher model to better align latent representations. Despite the lack of large-scale robotics pretraining, BitVLA achieves performance comparable to the state-of-the-art model OpenVLA-OFT with 4-bit post-training quantization on the LIBERO benchmark, while consuming only 29.8% of the memory. These results highlight BitVLA's promise for deployment on memory-constrained edge devices. We release the code and model weights in https://github.com/ustcwhy/BitVLA.

cs.AR [Back]

[357] ProtocolLLM: RTL Benchmark for SystemVerilog Generation of Communication Protocols

Arnav Sheth,Ivaxi Sheth,Mario Fritz

Main category: cs.AR

TL;DR: 论文探讨了大型语言模型（LLMs）在生成硬件描述语言（HDL）代码方面的能力，特别是针对SystemVerilog的标准通信协议实现，并提出了首个基准测试套件。

Details

Motivation: 尽管LLMs在通用编程语言代码生成方面表现出色，但其在HDL（如SystemVerilog）中的应用，尤其是在生成可综合且功能正确的设计方面，仍未被充分探索。 Method: 论文提出了针对四种常用协议（SPI、I2C、UART、AXI）的基准测试套件，定义了不同设计抽象层次和提示具体性的代码生成任务。 Result: 生成的代码通过语法正确性、可综合性和功能保真度（波形仿真和测试台）进行评估。 Conclusion: 论文为LLMs在HDL领域的应用提供了初步分析，并展示了其在生成标准通信协议实现方面的潜力。 Abstract: Recent advances in Large Language Models (LLMs) have shown promising capabilities in generating code for general-purpose programming languages. In contrast, their applicability for hardware description languages, particularly for generating synthesizable and functionally correct designs, remains significantly underexplored. HDLs such as SystemVerilog are logic-oriented and demand strict adherence to timing semantics, concurrency, and synthesizability constraints. Moreover, HDL-based design flows encompass a broad set of tasks beyond structural code generation, including testbench development, assertion-based verification, timing closure, and protocol-level integration for on-chip communication. The objective of our paper is to analyze the capabilities of state-of-the-art LLMs in generating SystemVerilog implementations of standard communication protocols, a core component of embedded and System-on-Chip (SoC) architectures. This paper introduces the first benchmark suite targeting four widely used protocols: SPI, I2C, UART, and AXI. We define code generation tasks that capture varying levels of design abstraction and prompt specificity. The generated designs are assessed for syntactic correctness, synthesizability, and functional fidelity via waveform simulation and test benches.

[358] QForce-RL: Quantized FPGA-Optimized Reinforcement Learning Compute Engine

Anushka Jha,Tanushree Dewangan,Mukul Lokhande,Santosh Kumar Vishvakarma

Main category: cs.AR

TL;DR: QForce-RL利用量化和轻量级架构提升FPGA上的RL性能，减少资源消耗和能耗，同时保持性能。

Details

Motivation: FPGA部署RL资源消耗大，QForce-RL旨在通过量化优化性能和能效。 Method: 结合E2HRL减少动作空间和QuaRL量化加速，支持硬件加速和灵活部署。 Result: 性能提升2.3倍，FPS提升2.6倍，适用于资源受限设备。 Conclusion: QForce-RL为高效RL部署提供了可扩展且灵活的解决方案。 Abstract: Reinforcement Learning (RL) has outperformed other counterparts in sequential decision-making and dynamic environment control. However, FPGA deployment is significantly resource-expensive, as associated with large number of computations in training agents with high-quality images and possess new challenges. In this work, we propose QForce-RL takes benefits of quantization to enhance throughput and reduce energy footprint with light-weight RL architecture, without significant performance degradation. QForce-RL takes advantages from E2HRL to reduce overall RL actions to learn desired policy and QuaRL for quantization based SIMD for hardware acceleration. We have also provided detailed analysis for different RL environments, with emphasis on model size, parameters, and accelerated compute ops. The architecture is scalable for resource-constrained devices and provide parametrized efficient deployment with flexibility in latency, throughput, power, and energy efficiency. The proposed QForce-RL provides performance enhancement up to 2.3x and better FPS - 2.6x compared to SoTA works.

eess.IV [Back]

[359] ResPF: Residual Poisson Flow for Efficient and Physically Consistent Sparse-View CT Reconstruction

Changsheng Fang,Yongtong Liu,Bahareh Morovati,Shuo Han,Yu Shi,Li Zhou,Shuyi Fan,Hengyong Yu

Main category: eess.IV

TL;DR: 提出了一种基于Poisson Flow Generative Models（PFGM）的Residual Poisson Flow（ResPF）方法，用于高效准确的稀疏视图CT重建，通过条件引导和残差融合模块提升重建质量和计算效率。

Details

Motivation: 稀疏视图CT虽能减少辐射剂量，但重建问题复杂，现有深度学习和扩散模型缺乏物理可解释性或计算成本高。 Method: 基于PFGM++，ResPF引入条件引导和残差融合模块，跳过冗余初始步骤并嵌入数据一致性，提升重建稳定性和质量。 Result: 在合成和临床数据集上，ResPF在重建质量、推理速度和鲁棒性上优于现有方法。 Conclusion: ResPF是首个将Poisson flow模型应用于稀疏视图CT的方法，显著提升了重建效率和准确性。 Abstract: Sparse-view computed tomography (CT) is a practical solution to reduce radiation dose, but the resulting ill-posed inverse problem poses significant challenges for accurate image reconstruction. Although deep learning and diffusion-based methods have shown promising results, they often lack physical interpretability or suffer from high computational costs due to iterative sampling starting from random noise. Recent advances in generative modeling, particularly Poisson Flow Generative Models (PFGM), enable high-fidelity image synthesis by modeling the full data distribution. In this work, we propose Residual Poisson Flow (ResPF) Generative Models for efficient and accurate sparse-view CT reconstruction. Based on PFGM++, ResPF integrates conditional guidance from sparse measurements and employs a hijacking strategy to significantly reduce sampling cost by skipping redundant initial steps. However, skipping early stages can degrade reconstruction quality and introduce unrealistic structures. To address this, we embed a data-consistency into each iteration, ensuring fidelity to sparse-view measurements. Yet, PFGM sampling relies on a fixed ordinary differential equation (ODE) trajectory induced by electrostatic fields, which can be disrupted by step-wise data consistency, resulting in unstable or degraded reconstructions. Inspired by ResNet, we introduce a residual fusion module to linearly combine generative outputs with data-consistent reconstructions, effectively preserving trajectory continuity. To the best of our knowledge, this is the first application of Poisson flow models to sparse-view CT. Extensive experiments on synthetic and clinical datasets demonstrate that ResPF achieves superior reconstruction quality, faster inference, and stronger robustness compared to state-of-the-art iterative, learning-based, and diffusion models.

[360] SPC to 3D: Novel View Synthesis from Binary SPC via I2I translation

Sumit Sharma,Gopi Raju Matta,Kaushik Mitra

Main category: eess.IV

TL;DR: 提出了一种两阶段框架，将二值SPC图像转换为高质量彩色新视图，解决了传统3D合成技术因信息丢失而无效的问题。

Details

Motivation: SPC图像的二值特性导致纹理和颜色信息严重丢失，传统3D合成技术无法有效处理。 Method: 采用模块化两阶段框架：1) 使用生成模型（如Pix2PixHD）进行图像到图像转换；2) 使用3D重建技术（如NeRF或3DGS）生成新视图。 Result: 通过实验验证，该框架在感知质量和几何一致性上显著优于基线方法。 Conclusion: 提出的两阶段框架有效解决了SPC图像信息丢失问题，为3D重建和辐射场恢复提供了新思路。 Abstract: Single Photon Avalanche Diodes (SPADs) represent a cutting-edge imaging technology, capable of detecting individual photons with remarkable timing precision. Building on this sensitivity, Single Photon Cameras (SPCs) enable image capture at exceptionally high speeds under both low and high illumination. Enabling 3D reconstruction and radiance field recovery from such SPC data holds significant promise. However, the binary nature of SPC images leads to severe information loss, particularly in texture and color, making traditional 3D synthesis techniques ineffective. To address this challenge, we propose a modular two-stage framework that converts binary SPC images into high-quality colorized novel views. The first stage performs image-to-image (I2I) translation using generative models such as Pix2PixHD, converting binary SPC inputs into plausible RGB representations. The second stage employs 3D scene reconstruction techniques like Neural Radiance Fields (NeRF) or Gaussian Splatting (3DGS) to generate novel views. We validate our two-stage pipeline (Pix2PixHD + Nerf/3DGS) through extensive qualitative and quantitative experiments, demonstrating significant improvements in perceptual quality and geometric consistency over the alternative baseline.

[361] Optimal Transport Driven Asymmetric Image-to-Image Translation for Nuclei Segmentation of Histological Images

Suman Mahapatra,Pradipta Maji

Main category: eess.IV

TL;DR: 该论文提出了一种新的深度生成模型，用于从组织学图像中分割细胞核结构，解决了信息不对称问题，并通过可逆生成器和空间约束优化框架提高了性能。

Details

Motivation: 组织学图像中细胞核区域的分割有助于疾病的检测和诊断，但现有图像到图像转换模型在信息不对称时表现不佳。 Method: 提出了一种深度生成模型，结合最优传输和测度理论，设计了可逆生成器和空间约束挤压操作，优化了网络复杂性和性能。 Result: 模型在公开数据集上表现优于现有方法，实现了网络复杂性和性能的更好平衡。 Conclusion: 该模型为组织学图像中的细胞核分割提供了一种高效且性能优越的解决方案。 Abstract: Segmentation of nuclei regions from histological images enables morphometric analysis of nuclei structures, which in turn helps in the detection and diagnosis of diseases under consideration. To develop a nuclei segmentation algorithm, applicable to different types of target domain representations, image-to-image translation networks can be considered as they are invariant to target domain image representations. One of the important issues with image-to-image translation models is that they fail miserably when the information content between two image domains are asymmetric in nature. In this regard, the paper introduces a new deep generative model for segmenting nuclei structures from histological images. The proposed model considers an embedding space for handling information-disparity between information-rich histological image space and information-poor segmentation map domain. Integrating judiciously the concepts of optimal transport and measure theory, the model develops an invertible generator, which provides an efficient optimization framework with lower network complexity. The concept of invertible generator automatically eliminates the need of any explicit cycle-consistency loss. The proposed model also introduces a spatially-constrained squeeze operation within the framework of invertible generator to maintain spatial continuity within the image patches. The model provides a better trade-off between network complexity and model performance compared to other existing models having complex network architectures. The performance of the proposed deep generative model, along with a comparison with state-of-the-art nuclei segmentation methods, is demonstrated on publicly available histological image data sets.

[362] SiliCoN: Simultaneous Nuclei Segmentation and Color Normalization of Histological Images

Suman Mahapatra,Pradipta Maji

Main category: eess.IV

TL;DR: 论文提出了一种新的深度生成模型，用于同时分割细胞核结构和标准化染色组织图像的颜色外观，结合截断正态分布和空间注意力的优势。

Details

Motivation: 解决染色组织图像中颜色变异对细胞核分割的影响，同时提高分割和颜色标准化的准确性。 Method: 使用深度生成模型，假设颜色外观信息与细胞核分割图独立，采用截断正态分布混合作为先验，并引入空间注意力机制。 Result: 在公开标准数据集上验证了模型性能，并与现有算法进行了比较分析。 Conclusion: 该模型具有通用性和适应性，颜色信息的修改或丢失不会影响细胞核分割结果。 Abstract: Segmentation of nuclei regions from histological images is an important task for automated computer-aided analysis of histological images, particularly in the presence of impermissible color variation in the color appearance of stained tissue images. While color normalization enables better nuclei segmentation, accurate segmentation of nuclei structures makes color normalization rather trivial. In this respect, the paper proposes a novel deep generative model for simultaneously segmenting nuclei structures and normalizing color appearance of stained histological images.This model judiciously integrates the merits of truncated normal distribution and spatial attention. The model assumes that the latent color appearance information, corresponding to a particular histological image, is independent of respective nuclei segmentation map as well as embedding map information. The disentangled representation makes the model generalizable and adaptable as the modification or loss in color appearance information cannot be able to affect the nuclei segmentation map as well as embedding information. Also, for dealing with the stain overlap of associated histochemical reagents, the prior for latent color appearance code is assumed to be a mixture of truncated normal distributions. The proposed model incorporates the concept of spatial attention for segmentation of nuclei regions from histological images. The performance of the proposed approach, along with a comparative analysis with related state-of-the-art algorithms, has been demonstrated on publicly available standard histological image data sets.

[363] Transfer Learning and Explainable AI for Brain Tumor Classification: A Study Using MRI Data from Bangladesh

Shuvashis Sarker

Main category: eess.IV

TL;DR: 研究开发了一种基于深度学习和可解释AI的自动脑肿瘤分类系统，用于MRI数据分析，在资源有限的地区（如孟加拉国）提高诊断效率。

Details

Motivation: 脑肿瘤（尤其是恶性肿瘤）的早期诊断对患者预后至关重要，但手动MRI分析效率低且易出错，尤其在医疗资源有限的地区。 Method: 使用VGG16、VGG19和ResNet50等深度学习模型分类脑肿瘤，并结合Grad-CAM和Grad-CAM++等XAI方法提升模型可解释性。 Result: VGG16模型表现最佳，准确率达99.17%，XAI增强了系统的透明度和稳定性。 Conclusion: 深度学习与XAI结合可有效提升脑肿瘤检测，适用于医疗资源有限的地区。 Abstract: Brain tumors, regardless of being benign or malignant, pose considerable health risks, with malignant tumors being more perilous due to their swift and uncontrolled proliferation, resulting in malignancy. Timely identification is crucial for enhancing patient outcomes, particularly in nations such as Bangladesh, where healthcare infrastructure is constrained. Manual MRI analysis is arduous and susceptible to inaccuracies, rendering it inefficient for prompt diagnosis. This research sought to tackle these problems by creating an automated brain tumor classification system utilizing MRI data obtained from many hospitals in Bangladesh. Advanced deep learning models, including VGG16, VGG19, and ResNet50, were utilized to classify glioma, meningioma, and various brain cancers. Explainable AI (XAI) methodologies, such as Grad-CAM and Grad-CAM++, were employed to improve model interpretability by emphasizing the critical areas in MRI scans that influenced the categorization. VGG16 achieved the most accuracy, attaining 99.17%. The integration of XAI enhanced the system's transparency and stability, rendering it more appropriate for clinical application in resource-limited environments such as Bangladesh. This study highlights the capability of deep learning models, in conjunction with explainable artificial intelligence (XAI), to enhance brain tumor detection and identification in areas with restricted access to advanced medical technologies.

[364] A Comprehensive Analysis of COVID-19 Detection Using Bangladeshi Data and Explainable AI

Shuvashis Sarker

Main category: eess.IV

TL;DR: 论文提出了一种基于VGG19模型的COVID-19检测方法，通过深度学习在CXR图像中实现高精度分类，并利用XAI技术增强模型透明度和可靠性。

Details

Motivation: COVID-19全球大流行对公共卫生和经济造成巨大影响，亟需高效检测方法。研究旨在通过改进CXR图像的分类准确性，提升疫情应对能力。 Method: 使用包含4,350张CXR图像的数据集，分为四类。采用ML、DL和TL模型，其中VGG19模型表现最佳。结合LIME解释模型预测，并应用SMOTE解决类别不平衡问题。 Result: VGG19模型达到98%的准确率，LIME揭示了影响分类的关键区域，SMOTE有效改善了类别不平衡。 Conclusion: 研究强调了XAI在提升模型透明度和可靠性中的重要性，为COVID-19检测提供了高效且可解释的解决方案。 Abstract: COVID-19 is a rapidly spreading and highly infectious virus which has triggered a global pandemic, profoundly affecting millions across the world. The pandemic has introduced unprecedented challenges in public health, economic stability, and societal structures, necessitating the implementation of extensive and multifaceted health interventions globally. It had a tremendous impact on Bangladesh by April 2024, with around 29,495 fatalities and more than 2 million confirmed cases. This study focuses on improving COVID-19 detection in CXR images by utilizing a dataset of 4,350 images from Bangladesh categorized into four classes: Normal, Lung-Opacity, COVID-19 and Viral-Pneumonia. ML, DL and TL models are employed with the VGG19 model achieving an impressive 98% accuracy. LIME is used to explain model predictions, highlighting the regions and features influencing classification decisions. SMOTE is applied to address class imbalances. By providing insight into both correct and incorrect classifications, the study emphasizes the importance of XAI in enhancing the transparency and reliability of models, ultimately improving the effectiveness of detection from CXR images.

[365] A Narrative Review on Large AI Models in Lung Cancer Screening, Diagnosis, and Treatment Planning

Jiachen Zhong,Yiting Wang,Di Zhu,Ziwei Wang

Main category: eess.IV

TL;DR: 本文综述了大型AI模型在肺癌筛查、诊断、预后和治疗中的应用，总结了现有模型的分类、性能及临床潜力，并讨论了当前局限性和未来发展方向。

Details

Motivation: 肺癌是全球高发且致命的疾病，亟需精准及时的诊断和治疗。大型AI模型的进步为医学图像理解和临床决策提供了新工具。 Method: 系统综述了现有大型AI模型，将其分为模态特定编码器、编码器-解码器框架和联合编码架构，并评估了它们在多模态学习任务中的表现。 Result: 模型在肺结节检测、基因突变预测、多组学整合和个性化治疗规划中表现优异，部分已进入临床验证阶段。 Conclusion: 大型AI模型在肺癌诊疗中具有变革潜力，但需解决泛化性、可解释性和合规性等挑战，以实现临床整合。 Abstract: Lung cancer remains one of the most prevalent and fatal diseases worldwide, demanding accurate and timely diagnosis and treatment. Recent advancements in large AI models have significantly enhanced medical image understanding and clinical decision-making. This review systematically surveys the state-of-the-art in applying large AI models to lung cancer screening, diagnosis, prognosis, and treatment. We categorize existing models into modality-specific encoders, encoder-decoder frameworks, and joint encoder architectures, highlighting key examples such as CLIP, BLIP, Flamingo, BioViL-T, and GLoRIA. We further examine their performance in multimodal learning tasks using benchmark datasets like LIDC-IDRI, NLST, and MIMIC-CXR. Applications span pulmonary nodule detection, gene mutation prediction, multi-omics integration, and personalized treatment planning, with emerging evidence of clinical deployment and validation. Finally, we discuss current limitations in generalizability, interpretability, and regulatory compliance, proposing future directions for building scalable, explainable, and clinically integrated AI systems. Our review underscores the transformative potential of large AI models to personalize and optimize lung cancer care.

[366] Text-guided multi-stage cross-perception network for medical image segmentation

Gaoyu Chen

Main category: eess.IV

TL;DR: 提出了一种基于文本引导的多阶段交叉感知网络（TMC），通过多阶段交叉注意力模块和多阶段对齐损失，提升了医学图像分割的性能。

Details

Motivation: 现有医学图像分割方法因目标与非目标区域对比度低导致语义表达弱，且现有文本引导方法存在跨模态交互不足和特征表达不充分的问题。 Method: 引入多阶段交叉注意力模块增强语义细节理解，采用多阶段对齐损失提升跨模态语义一致性。 Result: 在三个公开数据集（QaTa-COV19、MosMedData和Breast）上Dice分数分别为84.77%、78.50%和88.73%，优于UNet和现有文本引导方法。 Conclusion: TMC通过改进跨模态交互和特征表达，显著提升了医学图像分割的性能。 Abstract: Medical image segmentation plays a crucial role in clinical medicine, serving as a tool for auxiliary diagnosis, treatment planning, and disease monitoring, thus facilitating physicians in the study and treatment of diseases. However, existing medical image segmentation methods are limited by the weak semantic expression of the target segmentation regions, which is caused by the low contrast between the target and non-target segmentation regions. To address this limitation, text prompt information has greast potential to capture the lesion location. However, existing text-guided methods suffer from insufficient cross-modal interaction and inadequate cross-modal feature expression. To resolve these issues, we propose the Text-guided Multi-stage Cross-perception network (TMC). In TMC, we introduce a multistage cross-attention module to enhance the model's understanding of semantic details and a multi-stage alignment loss to improve the consistency of cross-modal semantics. The results of the experiments demonstrate that our TMC achieves a superior performance with Dice of 84.77%, 78.50%, 88.73% in three public datasets (QaTa-COV19, MosMedData and Breast), outperforming UNet based networks and text-guided methods.

[367] Fine-Grained Motion Compression and Selective Temporal Fusion for Neural B-Frame Video Coding

Xihua Sheng,Peilin Chen,Meng Wang,Li Zhang,Shiqi Wang,Dapeng Oliver Wu

Main category: eess.IV

TL;DR: 论文提出了一种针对神经B帧编码的改进方法，包括精细运动压缩和选择性时间融合，显著提升了压缩性能。

Details

Motivation: 现有神经B帧编码器直接采用P帧工具，未能解决B帧压缩的独特挑战，导致性能不佳。 Method: 设计了精细运动压缩方法（交互式双分支运动自编码器）和选择性时间融合方法（预测双向融合权重）。 Result: 实验表明，该方法优于现有神经B帧编码器，性能接近H.266/VVC参考软件。 Conclusion: 提出的方法有效解决了B帧压缩的挑战，具有显著的性能提升。 Abstract: With the remarkable progress in neural P-frame video coding, neural B-frame coding has recently emerged as a critical research direction. However, most existing neural B-frame codecs directly adopt P-frame coding tools without adequately addressing the unique challenges of B-frame compression, leading to suboptimal performance. To bridge this gap, we propose novel enhancements for motion compression and temporal fusion for neural B-frame coding. First, we design a fine-grained motion compression method. This method incorporates an interactive dual-branch motion auto-encoder with per-branch adaptive quantization steps, which enables fine-grained compression of bi-directional motion vectors while accommodating their asymmetric bitrate allocation and reconstruction quality requirements. Furthermore, this method involves an interactive motion entropy model that exploits correlations between bi-directional motion latent representations by interactively leveraging partitioned latent segments as directional priors. Second, we propose a selective temporal fusion method that predicts bi-directional fusion weights to achieve discriminative utilization of bi-directional multi-scale temporal contexts with varying qualities. Additionally, this method introduces a hyperprior-based implicit alignment mechanism for contextual entropy modeling. By treating the hyperprior as a surrogate for the contextual latent representation, this mechanism implicitly mitigates the misalignment in the fused bi-directional temporal priors. Extensive experiments demonstrate that our proposed codec outperforms state-of-the-art neural B-frame codecs and achieves comparable or even superior compression performance to the H.266/VVC reference software under random-access configurations.

physics.ed-ph [Back]

[368] Pendulum Tracker -- SimuFísica: A Web-based Tool for Real-time Measurement of Oscillatory Motion

Marco P. M. de Souza,Juciane G. Maia,Lilian N. de Andrade

Main category: physics.ed-ph

TL;DR: Pendulum Tracker是一个基于计算机视觉的实时测量物理摆运动的工具，集成在教育平台SimuFísica中，支持多设备使用。

Details

Motivation: 为教育实验物理提供一种实时、准确且易用的摆运动测量工具。 Method: 利用OpenCV.js库和浏览器摄像头实时检测摆的位置，生成角度-时间图并估算周期。 Result: 实验结果显示与理论预测高度一致，适用于测量周期、重力加速度和分析阻尼振荡。 Conclusion: Pendulum Tracker界面友好且支持数据导出，是实验物理教学的实用工具。 Abstract: We present Pendulum Tracker, a computer vision-based application that enables real-time measurement of the oscillatory motion of a physical pendulum. Integrated into the educational platform SimuF\'isica, the system uses the OpenCV.js library and runs directly in the browser, working on computers, tablets, and smartphones. The application automatically detects the pendulum's position via the device's camera, displaying in real time the angle-versus-time graph and estimates of the oscillation period. Experimental case studies demonstrate its effectiveness in measuring the period, determining gravitational acceleration, and analyzing damped oscillations. The results show excellent agreement with theoretical predictions, confirming the system's accuracy and its applicability in educational contexts. The accessible interface and the ability to export raw data make Pendulum Tracker a versatile tool for experimental physics teaching.

cs.IR [Back]

[369] DISRetrieval: Harnessing Discourse Structure for Long Document Retrieval

Huiyao Chen,Yi Yang,Yinghui Li,Meishan Zhang,Min Zhang

Main category: cs.IR

TL;DR: DISRetrieval是一个基于语言学话语结构的层次化检索框架，通过结合修辞结构理论和自适应摘要，显著提升了长文档理解的性能。

Details

Motivation: 现有方法未能捕捉文档的固有话语结构，限制了长文档理解的效果。 Method: 提出三个创新：话语感知的文档组织框架、LLM增强的节点表示技术和层次化证据检索机制。 Result: 在QASPER和QuALITY数据集上表现优于现有方法，验证了话语结构的重要性。 Conclusion: 语言学启发的文档表示对长文本理解至关重要，代码和数据集已开源。 Abstract: Long document understanding has become increasingly crucial in natural language processing, with retrieval-based methods emerging as a promising solution to address the context length limitations of large language models (LLMs). However, existing approaches either treat documents as flat sequences or employ arbitrary chunking strategies, failing to capture the inherent discourse structure that guides human comprehension. We present DISRetrieval, a novel hierarchical retrieval framework that leverages linguistic discourse structure to enhance long document understanding. Our approach introduces three key innovations: (1) a discourse-aware document organization framework that utilizes rhetorical structure theory (RST) to create sentence-level hierarchical representations, preserving both semantic relationships and natural document flow; (2) an LLM-enhanced node representation technique that combines discourse structure with adaptive summarization to enrich tree nodes with contextual information; and (3) a hierarchical evidence retrieval mechanism that effectively selects relevant content while maintaining discourse coherence. Through comprehensive experiments on QASPER and QuALITY datasets, DISRetrieval demonstrates substantial improvements over existing methods in both token-level retrieval metrics and downstream question answering tasks. Our ablation studies confirm that incorporating discourse structure significantly enhances retrieval effectiveness across different document lengths and query types, validating the importance of linguistically-informed document representation in long-text understanding. Our code and datasets are publicly available at github/DreamH1gh/DISRetrieval to facilitate future research.

[370] Is BERTopic Better than PLSA for Extracting Key Topics in Aviation Safety Reports?

Aziida Nanyonga,Joiner Keith,Turhan Ugur,Wild Graham

Main category: cs.IR

TL;DR: 比较BERTopic和PLSA在航空安全报告主题提取中的效果，BERTopic表现更优。

Details

Motivation: 提升对航空事故数据模式的理解，以支持更明智的安全决策。 Method: 使用36,000+份NTSB报告（2000-2020），BERTopic基于Transformer嵌入和层次聚类，PLSA基于EM算法。 Result: BERTopic在主题连贯性（Cv得分0.41 vs. 0.37）和可解释性上优于PLSA。 Conclusion: 现代Transformer方法在复杂航空数据分析中具有优势，未来将探索混合模型和多语言数据集。 Abstract: This study compares the effectiveness of BERTopic and Probabilistic Latent Semantic Analysis (PLSA) in extracting meaningful topics from aviation safety reports aiming to enhance the understanding of patterns in aviation incident data. Using a dataset of over 36,000 National Transportation Safety Board (NTSB) reports from 2000 to 2020, BERTopic employed transformer based embeddings and hierarchical clustering, while PLSA utilized probabilistic modelling through the Expectation-Maximization (EM) algorithm. Results showed that BERTopic outperformed PLSA in topic coherence, achieving a Cv score of 0.41 compared to PLSA 0.37, while also demonstrating superior interpretability as validated by aviation safety experts. These findings underscore the advantages of modern transformer based approaches in analyzing complex aviation datasets, paving the way for enhanced insights and informed decision-making in aviation safety. Future work will explore hybrid models, multilingual datasets, and advanced clustering techniques to further improve topic modelling in this domain.

[371] FinBERT2: A Specialized Bidirectional Encoder for Bridging the Gap in Finance-Specific Deployment of Large Language Models

Xuan Xu,Fufang Wen,Beilin Chu,Zhibing Fu,Qinhong Lin,Jiaqi Liu,Binjie Fei,Zhongliang Yang,Linna Zhou,Yu Li

Main category: cs.IR

TL;DR: FinBERT2是一种专门针对金融领域的双向编码器，解决了LLMs在金融应用中的局限性，包括判别任务性能不足、生成任务依赖RAG以及主题建模等问题。

Details

Motivation: LLMs在金融领域的实际应用中存在性能不足和资源消耗高的问题，特别是在判别任务、生成任务和主题建模方面。 Method: 提出了FinBERT2，一个基于32b金融语料预训练的双向编码器，并开发了Fin-Labelers、Fin-Retrievers和Fin-TopicModel等变体。 Result: FinBERT2在金融分类任务中优于其他BERT变体和LLMs，在检索任务中超越开源和专有嵌入模型，主题建模表现优越。 Conclusion: FinBERT2为金融领域提供了高效的解决方案，填补了LLMs在特定领域部署的空白。 Abstract: In natural language processing (NLP), the focus has shifted from encoder-only tiny language models like BERT to decoder-only large language models(LLMs) such as GPT-3. However, LLMs' practical application in the financial sector has revealed three limitations: (1) LLMs often perform worse than fine-tuned BERT on discriminative tasks despite costing much higher computational resources, such as market sentiment analysis in financial reports; (2) Application on generative tasks heavily relies on retrieval augmented generation (RAG) methods to provide current and specialized information, with general retrievers showing suboptimal performance on domain-specific retrieval tasks; (3) There are additional inadequacies in other feature-based scenarios, such as topic modeling. We introduce FinBERT2, a specialized bidirectional encoder pretrained on a high-quality, financial-specific corpus of 32b tokens. This represents the largest known Chinese financial pretraining corpus for models of this parameter size. As a better backbone, FinBERT2 can bridge the gap in the financial-specific deployment of LLMs through the following achievements: (1) Discriminative fine-tuned models (Fin-Labelers) outperform other (Fin)BERT variants by 0.4%-3.3% and leading LLMs by 9.7%-12.3% on average across five financial classification tasks. (2) Contrastive fine-tuned models (Fin-Retrievers) outperform both open-source (e.g., +6.8\% avg improvement over BGE-base-zh) and proprietary (e.g., +4.2\% avg improvement over OpenAI's text-embedding-3-large) embedders across five financial retrieval tasks; (3) Building on FinBERT2 variants, we construct the Fin-TopicModel, which enables superior clustering and topic representation for financial titles. Our work revisits financial BERT models through comparative analysis with contemporary LLMs and offers practical insights for effectively utilizing FinBERT in the LLMs era.

[372] Optimizing RAG Pipelines for Arabic: A Systematic Analysis of Core Components

Jumana Alsubhi,Mohammad D. Alahmadi,Ahmed Alhusayni,Ibrahim Aldailami,Israa Hamdine,Ahmad Shabana,Yazeed Iskandar,Suhayb Khayyat

Main category: cs.IR

TL;DR: 本文研究了检索增强生成（RAG）在阿拉伯语中的优化，评估了多种组件（如分块策略、嵌入模型、重排序器和语言模型）的性能，发现句子感知分块和特定嵌入模型表现最佳。

Details

Motivation: 尽管RAG在高资源语言中已有研究，但其在阿拉伯语中的优化仍未被充分探索。 Method: 使用RAGAS框架，系统评估了不同RAG组件在阿拉伯语数据集上的性能，包括分块策略、嵌入模型、重排序器和语言模型。 Result: 句子感知分块效果最佳，BGE-M3和Multilingual-E5-large是最佳嵌入模型，重排序器显著提升复杂数据集中的忠实度，Aya-8B在生成质量上优于StableLM。 Conclusion: 研究为构建高质量的阿拉伯语RAG管道提供了关键见解，并为不同文档类型选择最优组件提供了实用指南。 Abstract: Retrieval-Augmented Generation (RAG) has emerged as a powerful architecture for combining the precision of retrieval systems with the fluency of large language models. While several studies have investigated RAG pipelines for high-resource languages, the optimization of RAG components for Arabic remains underexplored. This study presents a comprehensive empirical evaluation of state-of-the-art RAG components-including chunking strategies, embedding models, rerankers, and language models-across a diverse set of Arabic datasets. Using the RAGAS framework, we systematically compare performance across four core metrics: context precision, context recall, answer faithfulness, and answer relevancy. Our experiments demonstrate that sentence-aware chunking outperforms all other segmentation methods, while BGE-M3 and Multilingual-E5-large emerge as the most effective embedding models. The inclusion of a reranker (bge-reranker-v2-m3) significantly boosts faithfulness in complex datasets, and Aya-8B surpasses StableLM in generation quality. These findings provide critical insights for building high-quality Arabic RAG pipelines and offer practical guidelines for selecting optimal components across different document types.

[373] LlamaRec-LKG-RAG: A Single-Pass, Learnable Knowledge Graph-RAG Framework for LLM-Based Ranking

Vahid Azizi,Fatemeh Koochaki

Main category: cs.IR

TL;DR: LlamaRec-LKG-RAG是一种新型端到端可训练框架，通过将个性化知识图谱上下文整合到基于LLM的推荐排序中，显著提升了推荐效果。

Details

Motivation: 现有RAG方法主要依赖基于相似性的检索，未能充分利用用户-物品交互中的丰富关系结构。 Method: 扩展LlamaRec架构，引入轻量级用户偏好模块，动态识别异构知识图谱中的关键关系路径，并将其整合到Llama-2模型的提示中。 Result: 在ML-100K和Amazon Beauty数据集上，LlamaRec-LKG-RAG在MRR、NDCG和Recall等关键指标上显著优于LlamaRec。 Conclusion: LlamaRec-LKG-RAG证明了结构化推理在基于LLM的推荐中的重要性，为下一代知识感知的个性化推荐系统奠定了基础。 Abstract: Recent advances in Large Language Models (LLMs) have driven their adoption in recommender systems through Retrieval-Augmented Generation (RAG) frameworks. However, existing RAG approaches predominantly rely on flat, similarity-based retrieval that fails to leverage the rich relational structure inherent in user-item interactions. We introduce LlamaRec-LKG-RAG, a novel single-pass, end-to-end trainable framework that integrates personalized knowledge graph context into LLM-based recommendation ranking. Our approach extends the LlamaRec architecture by incorporating a lightweight user preference module that dynamically identifies salient relation paths within a heterogeneous knowledge graph constructed from user behavior and item metadata. These personalized subgraphs are seamlessly integrated into prompts for a fine-tuned Llama-2 model, enabling efficient and interpretable recommendations through a unified inference step. Comprehensive experiments on ML-100K and Amazon Beauty datasets demonstrate consistent and significant improvements over LlamaRec across key ranking metrics (MRR, NDCG, Recall). LlamaRec-LKG-RAG demonstrates the critical value of structured reasoning in LLM-based recommendations and establishes a foundation for scalable, knowledge-aware personalization in next-generation recommender systems. Code is available at~\href{https://github.com/VahidAz/LlamaRec-LKG-RAG}{repository}.

[374] HotelMatch-LLM: Joint Multi-Task Training of Small and Large Language Models for Efficient Multimodal Hotel Retrieval

Arian Askari,Emmanouil Stergiadis,Ilya Gusev,Moran Beladev

Main category: cs.IR

TL;DR: HotelMatch-LLM是一种多模态密集检索模型，用于旅游领域的自然语言属性搜索，解决了传统搜索引擎需要用户先输入目的地和调整搜索参数的局限性。

Details

Motivation: 传统旅游搜索引擎的局限性促使开发一种更灵活、高效的搜索模型，支持自然语言查询和多模态数据处理。 Method: 模型结合了领域特定的多任务优化、非对称密集检索架构和广泛的图像处理技术，包括小型语言模型（SLM）和大型语言模型（LLM）的协同工作。 Result: 在四个测试集上，HotelMatch-LLM显著优于现有最先进模型（如VISTA和MARVEL），在主查询类型测试集上达到0.681的分数（MARVEL为0.603）。 Conclusion: HotelMatch-LLM在多任务优化、跨LLM架构的通用性以及处理大规模图像库的可扩展性方面表现出色。 Abstract: We present HotelMatch-LLM, a multimodal dense retrieval model for the travel domain that enables natural language property search, addressing the limitations of traditional travel search engines which require users to start with a destination and editing search parameters. HotelMatch-LLM features three key innovations: (1) Domain-specific multi-task optimization with three novel retrieval, visual, and language modeling objectives; (2) Asymmetrical dense retrieval architecture combining a small language model (SLM) for efficient online query processing and a large language model (LLM) for embedding hotel data; and (3) Extensive image processing to handle all property image galleries. Experiments on four diverse test sets show HotelMatch-LLM significantly outperforms state-of-the-art models, including VISTA and MARVEL. Specifically, on the test set -- main query type -- we achieve 0.681 for HotelMatch-LLM compared to 0.603 for the most effective baseline, MARVEL. Our analysis highlights the impact of our multi-task optimization, the generalizability of HotelMatch-LLM across LLM architectures, and its scalability for processing large image galleries.

eess.AS [Back]

[375] Reducing Object Hallucination in Large Audio-Language Models via Audio-Aware Decoding

Tzu-wen Hsu,Ke-Han Lu,Cheng-Han Chiang,Hung-yi Lee

Main category: eess.AS

TL;DR: 论文提出了一种名为音频感知解码（AAD）的方法，通过对比解码减少大型音频语言模型（LALMs）的幻觉问题，并在实验中显著提升了性能。

Details

Motivation: 现有LALMs在标准测试中表现良好，但存在严重的幻觉问题，即模型会生成与音频内容不符的回答。 Method: AAD是一种轻量级推理策略，通过对比音频存在与否的token预测概率，选择概率增加的token以减少幻觉。 Result: 在对象幻觉数据集上，AAD将F1分数提升了0.046至0.428；在通用音频QA数据集上，准确率提升了5.4%至10.3%。 Conclusion: AAD有效减少了LALMs的幻觉问题，并在多个数据集上验证了其性能提升。 Abstract: Large Audio-Language Models (LALMs) can take audio and text as the inputs and answer questions about the audio. While prior LALMs have shown strong performance on standard benchmarks, there has been alarming evidence that LALMs can hallucinate what is presented in the audio. To mitigate the hallucination of LALMs, we introduce Audio-Aware Decoding (AAD), a lightweight inference-time strategy that uses contrastive decoding to compare the token prediction logits with and without the audio context. By contrastive decoding, AAD promotes the tokens whose probability increases when the audio is present. We conduct our experiment on object hallucination datasets with three LALMs and show that AAD improves the F1 score by 0.046 to 0.428. We also show that AAD can improve the accuracy on general audio QA datasets like Clotho-AQA by 5.4% to 10.3%. We conduct thorough ablation studies to understand the effectiveness of each component in AAD.

[376] Speaker-Distinguishable CTC: Learning Speaker Distinction Using CTC for Multi-Talker Speech Recognition

Asahi Sakuma,Hiroaki Sato,Ryuga Sugano,Tadashi Kumano,Yoshihiko Kawai,Tetsuji Ogawa

Main category: eess.AS

TL;DR: 提出了一种无需辅助信息的多说话人语音识别框架SD-CTC，结合SOT框架显著降低错误率。

Details

Motivation: 解决SOT方法因说话人分配错误导致的识别问题，避免依赖难以提取的辅助信息。 Method: 扩展CTC为SD-CTC，联合分配token和说话人标签，并与SOT框架结合进行多任务学习。 Result: 实验表明，SD-CTC与SOT结合将错误率降低26%，性能接近依赖辅助信息的先进方法。 Conclusion: SD-CTC有效提升多说话人语音识别性能，无需依赖辅助信息。 Abstract: This paper presents a novel framework for multi-talker automatic speech recognition without the need for auxiliary information. Serialized Output Training (SOT), a widely used approach, suffers from recognition errors due to speaker assignment failures. Although incorporating auxiliary information, such as token-level timestamps, can improve recognition accuracy, extracting such information from natural conversational speech remains challenging. To address this limitation, we propose Speaker-Distinguishable CTC (SD-CTC), an extension of CTC that jointly assigns a token and its corresponding speaker label to each frame. We further integrate SD-CTC into the SOT framework, enabling the SOT model to learn speaker distinction using only overlapping speech and transcriptions. Experimental comparisons show that multi-task learning with SD-CTC and SOT reduces the error rate of the SOT model by 26% and achieves performance comparable to state-of-the-art methods relying on auxiliary information.

cs.AI [Back]

[377] Contextual Experience Replay for Self-Improvement of Language Agents

Yitao Liu,Chenglei Si,Karthik Narasimhan,Shunyu Yao

Main category: cs.AI

TL;DR: 论文提出了一种名为CER的无训练框架，通过动态记忆缓冲区积累和合成过去经验，帮助LLM代理在复杂任务中自我改进。

Details

Motivation: 当前LLM代理在复杂任务中缺乏环境特定经验，且无法在推理时持续学习，限制了其适应性。 Method: CER通过动态记忆缓冲区积累环境动态和决策模式，代理可检索相关知识以增强新任务表现。 Result: 在WebArena和VisualWebArena基准测试中，CER分别取得36.7%和31.9%的成功率，相对GPT-4o基线提升51.0%。 Conclusion: CER有效提升了LLM代理在复杂环境中的适应性和性能，证明了其高效性和实用性。 Abstract: Large language model (LLM) agents have been applied to sequential decision-making tasks such as web navigation, but without any environment-specific experiences, they often fail in these complex tasks. Moreover, current LLM agents are not designed to continually learn from past experiences during inference time, which could be crucial for them to gain these environment-specific experiences. To address this, we propose Contextual Experience Replay (CER), a training-free framework to enable efficient self-improvement for language agents in their context window. Specifically, CER accumulates and synthesizes past experiences into a dynamic memory buffer. These experiences encompass environment dynamics and common decision-making patterns, allowing the agents to retrieve and augment themselves with relevant knowledge in new tasks, enhancing their adaptability in complex environments. We evaluate CER on the challenging WebArena and VisualWebArena benchmarks. On VisualWebArena, CER achieves a competitive performance of 31.9%. On WebArena, CER also gets a competitive average success rate of 36.7%, relatively improving the success rate of the GPT-4o agent baseline by 51.0%. We also conduct a comprehensive analysis on it to prove its efficiency, validity and understand it better.

[378] Cross-Entropy Games for Language Models: From Implicit Knowledge to General Capability Measures

Clément Hongler,Andrew Emil

Main category: cs.AI

TL;DR: 论文提出了一种基于大语言模型（LLM）概率测量的任务框架，称为交叉熵（Xent）游戏，用于评估LLM的能力。

Details

Motivation: 探讨LLM如何理解其概率测量，并扩展其任务范围，超越生成性采样。 Method: 通过交叉熵分数和约束构建Xent游戏，涵盖单人和多人任务，形成计算图和程序。 Result: Xent游戏空间丰富且可构造，可用于构建能力基准（Xent Game measures）。 Conclusion: 提出通过进化动力学思想探索Xent游戏空间，解决通用能力测量的无界范围问题。 Abstract: Large Language Models (LLMs) define probability measures on text. By considering the implicit knowledge question of what it means for an LLM to know such a measure and what it entails algorithmically, we are naturally led to formulate a series of tasks that go beyond generative sampling, involving forms of summarization, counterfactual thinking, anomaly detection, originality search, reverse prompting, debating, creative solving, etc. These tasks can be formulated as games based on LLM measures, which we call Cross-Entropy (Xent) Games. Xent Games can be single-player or multi-player. They involve cross-entropy scores and cross-entropy constraints, and can be expressed as simple computational graphs and programs. We show the Xent Game space is large enough to contain a wealth of interesting examples, while being constructible from basic game-theoretic consistency axioms. We then discuss how the Xent Game space can be used to measure the abilities of LLMs. This leads to the construction of Xent Game measures: finite families of Xent Games that can be used as capability benchmarks, built from a given scope, by extracting a covering measure. To address the unbounded scope problem associated with the challenge of measuring general abilities, we propose to explore the space of Xent Games in a coherent fashion, using ideas inspired by evolutionary dynamics.

[379] Meta-Adaptive Prompt Distillation for Few-Shot Visual Question Answering

Akash Gupta,Amos Storkey,Mirella Lapata

Main category: cs.AI

TL;DR: 论文提出了一种元学习方法，通过蒸馏任务相关图像特征生成软提示，解决了大型多模态模型（LMMs）在上下文学习中性能不一致的问题。

Details

Motivation: LMMs在上下文学习中表现不稳定，尤其是小型模型，作者认为这是由于图像嵌入中的冗余信息干扰了任务性能。 Method: 提出了一种元学习方法，通过注意力映射模块和软提示蒸馏任务相关特征，支持低数据条件下的任务适应。 Result: 在VL-ICL Bench上的评估表明，该方法优于上下文学习和相关提示调优方法，即使在图像扰动下也能提升任务推理能力。 Conclusion: 该方法为LMMs在低数据条件下的任务适应提供了有效解决方案，显著提升了性能。 Abstract: Large Multimodal Models (LMMs) often rely on in-context learning (ICL) to perform new tasks with minimal supervision. However, ICL performance, especially in smaller LMMs, is inconsistent and does not always improve monotonically with increasing examples. We hypothesize that this occurs due to the LMM being overwhelmed by additional information present in the image embeddings, which is not required for the downstream task. To address this, we propose a meta-learning approach that provides an alternative for inducing few-shot capabilities in LMMs, using a fixed set of soft prompts that are distilled from task-relevant image features and can be adapted at test time using a few examples. To facilitate this distillation, we introduce an attention-mapper module that can be easily integrated with the popular LLaVA v1.5 architecture and is jointly learned with soft prompts, enabling task adaptation in LMMs under low-data regimes with just a few gradient steps. Evaluation on the VL-ICL Bench shows that our method consistently outperforms ICL and related prompt-tuning approaches, even under image perturbations, improving task induction and reasoning across visual question answering tasks.

[380] The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity

Parshin Shojaee,Iman Mirzadeh,Keivan Alizadeh,Maxwell Horton,Samy Bengio,Mehrdad Farajtabar

Main category: cs.AI

TL;DR: 论文研究了大型推理模型（LRMs）在复杂任务中的表现，发现其推理能力存在局限性，尤其是在高复杂度任务中表现崩溃，且推理努力与问题复杂度呈非线性关系。

Details

Motivation: 当前对LRMs的研究主要关注数学和编程基准测试，缺乏对其推理过程和能力的深入理解，尤其是在不同复杂度任务中的表现。 Method: 通过可控的拼图环境系统研究LRMs，分析其推理痕迹和最终答案，并与标准LLMs在相同计算资源下进行比较。 Result: LRMs在低复杂度任务中表现不如标准模型，中复杂度任务中表现优越，但在高复杂度任务中完全崩溃；其推理努力随复杂度增加至某一点后下降。 Conclusion: LRMs在精确计算和一致性推理方面存在局限性，其推理能力仍需进一步研究。 Abstract: Recent generations of language models have introduced Large Reasoning Models (LRMs) that generate detailed thinking processes before providing answers. While these models demonstrate improved performance on reasoning benchmarks, their fundamental capabilities, scaling properties, and limitations remain insufficiently understood. Current evaluations primarily focus on established math and coding benchmarks, emphasizing final answer accuracy. However, this evaluation paradigm often suffers from contamination and does not provide insights into the reasoning traces. In this work, we systematically investigate these gaps with the help of controllable puzzle environments that allow precise manipulation of complexity while maintaining consistent logical structures. This setup enables the analysis of not only final answers but also the internal reasoning traces, offering insights into how LRMs think. Through extensive experiments, we show that LRMs face a complete accuracy collapse beyond certain complexities. Moreover, they exhibit a counterintuitive scaling limit: their reasoning effort increases with problem complexity up to a point, then declines despite having remaining token budget. By comparing LRMs with their standard LLM counterparts under same inference compute, we identify three performance regimes: (1) low-complexity tasks where standard models outperform LRMs, (2) medium-complexity tasks where LRMs demonstrates advantage, and (3) high-complexity tasks where both models face complete collapse. We found that LRMs have limitations in exact computation: they fail to use explicit algorithms and reason inconsistently across scales. We also investigate the reasoning traces in more depth, studying the patterns of explored solutions and analyzing the models' computational behavior, shedding light on their strengths, limitations, and raising questions about their reasoning capabilities.

[381] Mitigating Behavioral Hallucination in Multimodal Large Language Models for Sequential Images

Liangliang You,Junchi Yao,Shu Yang,Guimin Hu,Lijie Hu,Di Wang

Main category: cs.AI

TL;DR: 本文提出SHE框架，通过两阶段方法检测和减少行为幻觉，并提出新指标BEACH量化其严重性。

Details

Motivation: 多模态大语言模型存在行为幻觉问题，影响其可靠性和扩展性，但相关研究较少。 Method: 提出SHE框架，包括基于自适应时间窗口的视觉-文本对齐检测和联合嵌入空间的正交投影缓解。 Result: 在标准基准测试中，SHE将行为幻觉减少10%以上，同时保持描述准确性。 Conclusion: SHE有效填补了行为幻觉研究的空白，并提供了实用的解决方案。 Abstract: While multimodal large language models excel at various tasks, they still suffer from hallucinations, which limit their reliability and scalability for broader domain applications. To address this issue, recent research mainly focuses on objective hallucination. However, for sequential images, besides objective hallucination, there is also behavioral hallucination, which is less studied. This work aims to fill in the gap. We first reveal that behavioral hallucinations mainly arise from two key factors: prior-driven bias and the snowball effect. Based on these observations, we introduce SHE (Sequence Hallucination Eradication), a lightweight, two-stage framework that (1) detects hallucinations via visual-textual alignment check using our proposed adaptive temporal window and (2) mitigates them via orthogonal projection onto the joint embedding space. We also propose a new metric (BEACH) to quantify behavioral hallucination severity. Empirical results on standard benchmarks demonstrate that SHE reduces behavioral hallucination by over 10% on BEACH while maintaining descriptive accuracy.

[382] SAFEFLOW: A Principled Protocol for Trustworthy and Transactional Autonomous Agent Systems

Peiran Li,Xinkai Zou,Zhuohang Wu,Ruifeng Li,Shuo Xing,Hanwen Zheng,Zhikai Hu,Yuping Wang,Haoxi Li,Qin Yuan,Yingmo Zhang,Zhengzhong Tu

Main category: cs.AI

TL;DR: SAFEFLOW是一个新的协议级框架，旨在构建可信赖的基于LLM/VLM的智能代理，通过细粒度信息流控制和多代理协调机制提升安全性和可靠性。

Details

Motivation: 当前基于LLM/VLM的代理框架缺乏安全信息流、可靠性和多代理协调的机制，导致其脆弱性。 Method: SAFEFLOW引入细粒度信息流控制（IFC）、事务执行、冲突解决和安全调度，并采用写前日志、回滚和安全缓存等机制增强鲁棒性。 Result: 实验表明，SAFEFLOW在对抗性、噪声和并发条件下显著优于现有技术，同时保持任务性能和安全性。 Conclusion: SAFEFLOW为可靠、安全和鲁棒的智能代理生态系统奠定了基础。 Abstract: Recent advances in large language models (LLMs) and vision-language models (VLMs) have enabled powerful autonomous agents capable of complex reasoning and multi-modal tool use. Despite their growing capabilities, today's agent frameworks remain fragile, lacking principled mechanisms for secure information flow, reliability, and multi-agent coordination. In this work, we introduce SAFEFLOW, a new protocol-level framework for building trustworthy LLM/VLM-based agents. SAFEFLOW enforces fine-grained information flow control (IFC), precisely tracking provenance, integrity, and confidentiality of all the data exchanged between agents, tools, users, and environments. By constraining LLM reasoning to respect these security labels, SAFEFLOW prevents untrusted or adversarial inputs from contaminating high-integrity decisions. To ensure robustness in concurrent multi-agent settings, SAFEFLOW introduces transactional execution, conflict resolution, and secure scheduling over shared state, preserving global consistency across agents. We further introduce mechanisms, including write-ahead logging, rollback, and secure caches, that further enhance resilience against runtime errors and policy violations. To validate the performances, we built SAFEFLOWBENCH, a comprehensive benchmark suite designed to evaluate agent reliability under adversarial, noisy, and concurrent operational conditions. Extensive experiments demonstrate that agents built with SAFEFLOW maintain impressive task performance and security guarantees even in hostile environments, substantially outperforming state-of-the-art. Together, SAFEFLOW and SAFEFLOWBENCH lay the groundwork for principled, robust, and secure agent ecosystems, advancing the frontier of reliable autonomy.

[383] Evaluating Large Language Models on the Frame and Symbol Grounding Problems: A Zero-shot Benchmark

Shoko Oka

Main category: cs.AI

TL;DR: 现代大型语言模型（LLMs）是否具备解决传统AI中的框架问题和符号接地问题的能力？研究通过设计基准任务测试13种LLMs，发现部分封闭模型表现优异。

Details

Motivation: 探讨现代LLMs是否能解决传统符号AI中认为无法解决的框架问题和符号接地问题。 Method: 设计两个基准任务，零样本条件下测试13种LLMs（开源和封闭模型），评估其输出质量。 Result: 开源模型表现因规模、量化和指令调优差异而波动，部分封闭模型表现稳定且高分。 Conclusion: 部分现代LLMs可能具备解决长期理论挑战的能力。 Abstract: Recent advancements in large language models (LLMs) have revitalized philosophical debates surrounding artificial intelligence. Two of the most fundamental challenges - namely, the Frame Problem and the Symbol Grounding Problem - have historically been viewed as unsolvable within traditional symbolic AI systems. This study investigates whether modern LLMs possess the cognitive capacities required to address these problems. To do so, I designed two benchmark tasks reflecting the philosophical core of each problem, administered them under zero-shot conditions to 13 prominent LLMs (both closed and open-source), and assessed the quality of the models' outputs across five trials each. Responses were scored along multiple criteria, including contextual reasoning, semantic coherence, and information filtering. The results demonstrate that while open-source models showed variability in performance due to differences in model size, quantization, and instruction tuning, several closed models consistently achieved high scores. These findings suggest that select modern LLMs may be acquiring capacities sufficient to produce meaningful and stable responses to these long-standing theoretical challenges.

[384] LUCIFER: Language Understanding and Context-Infused Framework for Exploration and Behavior Refinement

Dimitris Panagopoulos,Adolfo Perrusquia,Weisi Guo

Main category: cs.AI

TL;DR: LUCIFER框架通过结合分层决策架构、强化学习和大型语言模型，利用人类上下文知识提升自主决策效率和质量。

Details

Motivation: 动态环境中，现有知识快速过时，导致自主决策效果受限，需整合人类实时观察的上下文知识。 Method: 提出LUCIFER框架，结合分层决策、强化学习和LLMs，LLMs用于上下文提取和零样本探索引导。 Result: LUCIFER在探索效率和决策质量上优于传统方法，验证了上下文驱动决策的潜力。 Conclusion: LUCIFER展示了整合人类上下文知识对自主系统成功的重要性。 Abstract: In dynamic environments, the rapid obsolescence of pre-existing environmental knowledge creates a gap between an agent's internal model and the evolving reality of its operational context. This disparity between prior and updated environmental valuations fundamentally limits the effectiveness of autonomous decision-making. To bridge this gap, the contextual bias of human domain stakeholders, who naturally accumulate insights through direct, real-time observation, becomes indispensable. However, translating their nuanced, and context-rich input into actionable intelligence for autonomous systems remains an open challenge. To address this, we propose LUCIFER (Language Understanding and Context-Infused Framework for Exploration and Behavior Refinement), a domain-agnostic framework that integrates a hierarchical decision-making architecture with reinforcement learning (RL) and large language models (LLMs) into a unified system. This architecture mirrors how humans decompose complex tasks, enabling a high-level planner to coordinate specialised sub-agents, each focused on distinct objectives and temporally interdependent actions. Unlike traditional applications where LLMs are limited to single role, LUCIFER integrates them in two synergistic roles: as context extractors, structuring verbal stakeholder input into domain-aware representations that influence decision-making through an attention space mechanism aligning LLM-derived insights with the agent's learning process, and as zero-shot exploration facilitators guiding the agent's action selection process during exploration. We benchmark various LLMs in both roles and demonstrate that LUCIFER improves exploration efficiency and decision quality, outperforming flat, goal-conditioned policies. Our findings show the potential of context-driven decision-making, where autonomous systems leverage human contextual knowledge for operational success.

[385] Solving Inequality Proofs with Large Language Models

Jiayi Sheng,Luna Lyu,Jikai Jin,Tony Xia,Alex Gu,James Zou,Pan Lu

Main category: cs.AI

TL;DR: 论文提出了一种新的不等式证明任务形式，并发布了IneqMath数据集，揭示了当前大型语言模型在严格证明中的局限性。

Details

Motivation: 现有数据集稀缺、合成或过于形式化，限制了不等式证明领域的研究进展。 Method: 将不等式证明分为两个可自动检查的子任务：边界估计和关系预测，并开发了LLM-as-judge评估框架。 Result: 29个领先的LLM在IneqMath上表现不佳，整体准确率低于10%，揭示了模型在严格证明中的脆弱性。 Conclusion: 研究指出了定理引导推理和自我优化等有前景的方向，并提供了公开的数据和代码。 Abstract: Inequality proving, crucial across diverse scientific and mathematical fields, tests advanced reasoning skills such as discovering tight bounds and strategic theorem application. This makes it a distinct, demanding frontier for large language models (LLMs), offering insights beyond general mathematical problem-solving. Progress in this area is hampered by existing datasets that are often scarce, synthetic, or rigidly formal. We address this by proposing an informal yet verifiable task formulation, recasting inequality proving into two automatically checkable subtasks: bound estimation and relation prediction. Building on this, we release IneqMath, an expert-curated dataset of Olympiad-level inequalities, including a test set and training corpus enriched with step-wise solutions and theorem annotations. We also develop a novel LLM-as-judge evaluation framework, combining a final-answer judge with four step-wise judges designed to detect common reasoning flaws. A systematic evaluation of 29 leading LLMs on IneqMath reveals a surprising reality: even top models like o1 achieve less than 10% overall accuracy under step-wise scrutiny; this is a drop of up to 65.5% from their accuracy considering only final answer equivalence. This discrepancy exposes fragile deductive chains and a critical gap for current LLMs between merely finding an answer and constructing a rigorous proof. Scaling model size and increasing test-time computation yield limited gains in overall proof correctness. Instead, our findings highlight promising research directions such as theorem-guided reasoning and self-refinement. Code and data are available at https://ineqmath.github.io/.

[386] Reinforcing Multimodal Understanding and Generation with Dual Self-rewards

Jixiang Hong,Yiran Zhang,Guanzhong Wang,Yi Liu,Ji-Rong Wen,Rui Yan

Main category: cs.AI

TL;DR: 论文提出了一种自监督的双奖励机制，通过理解与生成的逆对偶任务提升多模态模型的性能，无需外部监督。

Details

Motivation: 现有大型多模态模型（LMMs）在图像-文本对齐上表现不佳，且依赖外部监督。理解与生成是逆对偶任务，为自监督优化提供了可能。 Method: 提出双奖励机制，通过采样多输出并反转输入-输出对，计算对偶似然作为自奖励进行优化。 Result: 实验表明，该方法显著提升了模型在视觉理解和生成任务上的性能，尤其在文本到图像任务中表现突出。 Conclusion: 自监督的双奖励机制有效提升了多模态模型的性能，无需外部监督，为未来研究提供了新思路。 Abstract: Building upon large language models (LLMs), recent large multimodal models (LMMs) unify cross-model understanding and generation into a single framework. However, LMMs still struggle to achieve accurate image-text alignment, prone to generating text responses contradicting the visual input or failing to follow the text-to-image prompts. Current solutions require external supervision (e.g., human feedback or reward models) and only address unidirectional tasks-either understanding or generation. In this work, based on the observation that understanding and generation are inverse dual tasks, we introduce a self-supervised dual reward mechanism to reinforce the understanding and generation capabilities of LMMs. Specifically, we sample multiple outputs for a given input in one task domain, then reverse the input-output pairs to compute the dual likelihood of the model as self-rewards for optimization. Extensive experimental results on visual understanding and generation benchmarks demonstrate that our method can effectively enhance the performance of the model without any external supervision, especially achieving remarkable improvements in text-to-image tasks.

[387] $τ^2$-Bench: Evaluating Conversational Agents in a Dual-Control Environment

Victor Barres,Honghua Dong,Soham Ray,Xujie Si,Karthik Narasimhan

Main category: cs.AI

TL;DR: 论文提出了$ au^2$-bench，一个支持双控制环境的对话AI基准测试，模拟真实场景中用户和AI共同使用工具的动态环境。

Details

Motivation: 现有对话AI基准测试仅模拟单控制环境，忽略了用户主动参与的真实场景（如技术支持），因此需要更贴近实际的测试环境。 Method: 1) 提出Telecom双控制域，建模为Dec-POMDP；2) 设计任务生成器，从原子组件生成多样化任务；3) 开发可靠用户模拟器；4) 通过多维度分析评估性能。 Result: 实验表明，从单控制转向双控制时，AI性能显著下降，突显了引导用户的挑战。 Conclusion: $ au^2$-bench为需要同时推理和引导用户的AI提供了可控测试平台。 Abstract: Existing benchmarks for conversational AI agents simulate single-control environments, where only the AI agent can use tools to interact with the world, while the user remains a passive information provider. This differs from real-world scenarios like technical support, where users need to actively participate in modifying the state of the (shared) world. In order to address this gap, we introduce $\tau^2$-bench, with four key contributions: 1) A novel Telecom dual-control domain modeled as a Dec-POMDP, where both agent and user make use of tools to act in a shared, dynamic environment that tests both agent coordination and communication, 2) A compositional task generator that programmatically creates diverse, verifiable tasks from atomic components, ensuring domain coverage and controlled complexity, 3) A reliable user simulator tightly coupled with the environment, whose behavior is constrained by tools and observable states, improving simulation fidelity, 4) Fine-grained analysis of agent performance through multiple ablations including separating errors arising from reasoning vs communication/coordination. In particular, our experiments show significant performance drops when agents shift from no-user to dual-control, highlighting the challenges of guiding users. Overall, $\tau^2$-bench provides a controlled testbed for agents that must both reason effectively and guide user actions.

[388] VisioMath: Benchmarking Figure-based Mathematical Reasoning in LMMs

Can Li,Ting Zhang,Mei Wang,Hua Huang

Main category: cs.AI

TL;DR: VisioMath是一个评估多模态数学推理能力的基准数据集，专注于图像选项的数学问题，现有大型多模态模型（LMMs）在此任务上表现不佳。

Details

Motivation: 现有大型多模态模型在图像选项的数学推理能力上未得到充分研究，VisioMath填补了这一空白。 Method: VisioMath包含8,070张图像和1,800道多选题，每个选项均为图像，用于评估LMMs的数学推理能力。 Result: 现有最先进的LMMs（如GPT-4o）在VisioMath上仅达到45.9%的准确率，表现不佳。 Conclusion: VisioMath为多模态数学推理研究提供了重要基准，揭示了当前模型的局限性，并推动未来研究。 Abstract: Large Multimodal Models (LMMs) have demonstrated remarkable problem-solving capabilities across various domains. However, their ability to perform mathematical reasoning when answer options are represented as images--an essential aspect of multi-image comprehension--remains underexplored. To bridge this gap, we introduce VisioMath, a benchmark designed to evaluate mathematical reasoning in multimodal contexts involving image-based answer choices. VisioMath comprises 8,070 images and 1,800 multiple-choice questions, where each answer option is an image, presenting unique challenges to existing LMMs. To the best of our knowledge, VisioMath is the first dataset specifically tailored for mathematical reasoning in image-based-option scenarios, where fine-grained distinctions between answer choices are critical for accurate problem-solving. We systematically evaluate state-of-the-art LMMs on VisioMath and find that even the most advanced models struggle with this task. Notably, GPT-4o achieves only 45.9% accuracy, underscoring the limitations of current models in reasoning over visually similar answer choices. By addressing a crucial gap in existing benchmarks, VisioMath establishes a rigorous testbed for future research, driving advancements in multimodal reasoning.

[389] Long-Tailed Learning for Generalized Category Discovery

Cuong Manh Hoang

Main category: cs.AI

TL;DR: 提出了一种在长尾分布中进行广义类别发现的新框架，通过自引导标记和表示平衡技术提升性能。

Details

Motivation: 现实世界数据集通常是不平衡的，现有方法在平衡数据集上表现良好，但在不平衡数据上效果不佳。 Method: 使用自引导标记技术生成伪标签以减少偏差，并通过表示平衡过程挖掘样本邻域以关注尾部类别。 Result: 在公开数据集上的实验表明，该框架优于现有最优方法。 Conclusion: 提出的框架能有效解决长尾分布中的广义类别发现问题。 Abstract: Generalized Category Discovery (GCD) utilizes labeled samples of known classes to discover novel classes in unlabeled samples. Existing methods show effective performance on artificial datasets with balanced distributions. However, real-world datasets are always imbalanced, significantly affecting the effectiveness of these methods. To solve this problem, we propose a novel framework that performs generalized category discovery in long-tailed distributions. We first present a self-guided labeling technique that uses a learnable distribution to generate pseudo-labels, resulting in less biased classifiers. We then introduce a representation balancing process to derive discriminative representations. By mining sample neighborhoods, this process encourages the model to focus more on tail classes. We conduct experiments on public datasets to demonstrate the effectiveness of the proposed framework. The results show that our model exceeds previous state-of-the-art methods.

[390] GUI-Reflection: Empowering Multimodal GUI Models with Self-Reflection Behavior

Penghao Wu,Shengnan Ma,Bo Wang,Jiaheng Yu,Lewei Lu,Ziwei Liu

Main category: cs.AI

TL;DR: GUI-Reflection框架通过自反思和错误纠正能力增强多模态GUI模型，包括预训练、离线微调和在线反思调优三个阶段，实现完全自动化的数据生成和学习。

Details

Motivation: 现有GUI模型依赖无错误的离线轨迹，缺乏反思和错误恢复能力，限制了其鲁棒性和适应性。 Method: 提出GUI-Reflection框架，包括自动数据生成、GUI-Reflection任务套件、在线训练环境和迭代反思调优算法。 Result: 框架赋予GUI代理自反思和纠正能力，提升了GUI自动化的鲁棒性和智能性。 Conclusion: GUI-Reflection为更强大、适应性更强的GUI自动化奠定了基础，所有资源将公开。 Abstract: Multimodal Large Language Models (MLLMs) have shown great potential in revolutionizing Graphical User Interface (GUI) automation. However, existing GUI models mostly rely on learning from nearly error-free offline trajectories, thus lacking reflection and error recovery capabilities. To bridge this gap, we propose GUI-Reflection, a novel framework that explicitly integrates self-reflection and error correction capabilities into end-to-end multimodal GUI models throughout dedicated training stages: GUI-specific pre-training, offline supervised fine-tuning (SFT), and online reflection tuning. GUI-reflection enables self-reflection behavior emergence with fully automated data generation and learning processes without requiring any human annotation. Specifically, 1) we first propose scalable data pipelines to automatically construct reflection and error correction data from existing successful trajectories. While existing GUI models mainly focus on grounding and UI understanding ability, we propose the GUI-Reflection Task Suite to learn and evaluate reflection-oriented abilities explicitly. 2) Furthermore, we built a diverse and efficient environment for online training and data collection of GUI models on mobile devices. 3) We also present an iterative online reflection tuning algorithm leveraging the proposed environment, enabling the model to continuously enhance its reflection and error correction abilities. Our framework equips GUI agents with self-reflection and correction capabilities, paving the way for more robust, adaptable, and intelligent GUI automation, with all data, models, environments, and tools to be released publicly.

cs.CR [Back]

[391] HeavyWater and SimplexWater: Watermarking Low-Entropy Text Distributions

Dor Tsur,Carol Xuan Long,Claudio Mayrink Verdun,Hsiang Hsu,Chen-Fu Chen,Haim Permuter,Sajani Vithana,Flavio P. Calmon

Main category: cs.CR

TL;DR: 论文提出了一种优化框架，设计两种新型水印（HeavyWater和SimplexWater），在低熵任务中实现高检测精度且最小化文本失真。

Details

Motivation: 解决LLM水印在低熵生成任务（如编码）中的挑战，优化随机侧信息的使用以提高检测概率并减少文本失真。 Method: 提出优化框架，设计两种可调水印（HeavyWater和SimplexWater），适用于任何LLM且不依赖侧信息生成方式。 Result: 实验表明，两种水印在低熵任务中实现高检测精度且对文本质量影响极小，并揭示了水印与编码理论的新联系。 Conclusion: HeavyWater和SimplexWater为LLM水印设计提供了有效解决方案，尤其在低熵任务中表现优异。 Abstract: Large language model (LLM) watermarks enable authentication of text provenance, curb misuse of machine-generated text, and promote trust in AI systems. Current watermarks operate by changing the next-token predictions output by an LLM. The updated (i.e., watermarked) predictions depend on random side information produced, for example, by hashing previously generated tokens. LLM watermarking is particularly challenging in low-entropy generation tasks - such as coding - where next-token predictions are near-deterministic. In this paper, we propose an optimization framework for watermark design. Our goal is to understand how to most effectively use random side information in order to maximize the likelihood of watermark detection and minimize the distortion of generated text. Our analysis informs the design of two new watermarks: HeavyWater and SimplexWater. Both watermarks are tunable, gracefully trading-off between detection accuracy and text distortion. They can also be applied to any LLM and are agnostic to side information generation. We examine the performance of HeavyWater and SimplexWater through several benchmarks, demonstrating that they can achieve high watermark detection accuracy with minimal compromise of text generation quality, particularly in the low-entropy regime. Our theoretical analysis also reveals surprising new connections between LLM watermarking and coding theory. The code implementation can be found in https://github.com/DorTsur/HeavyWater_SimplexWater

[392] Auditing Black-Box LLM APIs with a Rank-Based Uniformity Test

Xiaoyuan Zhu,Yaowen Ye,Tianyi Qiu,Hanlin Zhu,Sijun Tan,Ajraf Mannan,Jonathan Michala,Raluca Ada Popa,Willie Neiswanger

Main category: cs.CR

TL;DR: 提出了一种基于排序的统一性测试方法，用于验证黑盒LLM与本地真实模型的行为一致性，解决了API提供商可能偷偷替换模型的问题。

Details

Motivation: API提供商可能为了降低成本或恶意修改模型行为，偷偷提供量化或微调的变体，影响性能和安全性，而用户难以检测。 Method: 采用基于排序的统一性测试方法，高效且避免可检测的查询模式，对抗性提供商难以察觉。 Result: 在多种威胁场景下（如量化、有害微调、越狱提示和完整模型替换），该方法在有限查询预算下表现出优越的统计功效。 Conclusion: 该方法能有效验证黑盒LLM的行为一致性，优于现有方法，适用于对抗性环境。 Abstract: As API access becomes a primary interface to large language models (LLMs), users often interact with black-box systems that offer little transparency into the deployed model. To reduce costs or maliciously alter model behaviors, API providers may discreetly serve quantized or fine-tuned variants, which can degrade performance and compromise safety. Detecting such substitutions is difficult, as users lack access to model weights and, in most cases, even output logits. To tackle this problem, we propose a rank-based uniformity test that can verify the behavioral equality of a black-box LLM to a locally deployed authentic model. Our method is accurate, query-efficient, and avoids detectable query patterns, making it robust to adversarial providers that reroute or mix responses upon the detection of testing attempts. We evaluate the approach across diverse threat scenarios, including quantization, harmful fine-tuning, jailbreak prompts, and full model substitution, showing that it consistently achieves superior statistical power over prior methods under constrained query budgets.

[393] HauntAttack: When Attack Follows Reasoning as a Shadow

Jingyuan Ma,Rui Li,Zheng Li,Junfeng Liu,Lei Sha,Zhifang Sui

Main category: cs.CR

TL;DR: 论文提出HauntAttack框架，通过将有害指令嵌入推理问题中，揭示大型推理模型的安全漏洞。

Details

Motivation: 研究大型推理模型（LRMs）在推理能力增强和内部推理过程暴露时可能引发的安全风险，特别是推理与有害性交织时的安全-推理权衡问题。 Method: 引入HauntAttack，一种通用的黑盒攻击框架，通过替换推理问题中的条件为有害指令，引导模型逐步生成不安全输出。 Result: 实验表明，即使最先进的LRMs也存在显著安全漏洞，并对不同模型、有害指令类型及输出模式进行了详细分析。 Conclusion: HauntAttack揭示了LRMs的安全脆弱性，为未来模型安全设计提供了重要参考。 Abstract: Emerging Large Reasoning Models (LRMs) consistently excel in mathematical and reasoning tasks, showcasing exceptional capabilities. However, the enhancement of reasoning abilities and the exposure of their internal reasoning processes introduce new safety vulnerabilities. One intriguing concern is: when reasoning is strongly entangled with harmfulness, what safety-reasoning trade-off do LRMs exhibit? To address this issue, we introduce HauntAttack, a novel and general-purpose black-box attack framework that systematically embeds harmful instructions into reasoning questions. Specifically, we treat reasoning questions as carriers and substitute one of their original conditions with a harmful instruction. This process creates a reasoning pathway in which the model is guided step by step toward generating unsafe outputs. Based on HauntAttack, we conduct comprehensive experiments on multiple LRMs. Our results reveal that even the most advanced LRMs exhibit significant safety vulnerabilities. Additionally, we perform a detailed analysis of different models, various types of harmful instructions, and model output patterns, providing valuable insights into the security of LRMs.

[394] Beyond Jailbreaks: Revealing Stealthier and Broader LLM Security Risks Stemming from Alignment Failures

Yukai Zhou,Sibei Yang,Wenjie Wang

Main category: cs.CR

TL;DR: 论文通过结构化象限视角重新定义LLM风险，提出JailFlipBench基准以评估隐性危害，并开发攻击方法，揭示其现实风险。

Details

Motivation: LLM在现实应用中的安全性问题日益突出，尤其是隐性危害（Implicit Harm）未被充分关注。 Method: 提出JailFlipBench基准，涵盖单模态、多模态和事实扩展场景，并开发JailFlip攻击方法。 Result: 评估显示隐性危害对开源和黑盒LLM均构成现实风险。 Conclusion: 呼吁扩展LLM安全评估范围，超越传统越狱范式。 Abstract: Large language models (LLMs) are increasingly deployed in real-world applications, raising concerns about their security. While jailbreak attacks highlight failures under overtly harmful queries, they overlook a critical risk: incorrectly answering harmless-looking inputs can be dangerous and cause real-world harm (Implicit Harm). We systematically reformulate the LLM risk landscape through a structured quadrant perspective based on output factuality and input harmlessness, uncovering an overlooked high-risk region. To investigate this gap, we propose JailFlipBench, a benchmark aims to capture implicit harm, spanning single-modal, multimodal, and factual extension scenarios with diverse evaluation metrics. We further develop initial JailFlip attack methodologies and conduct comprehensive evaluations across multiple open-source and black-box LLMs, show that implicit harm present immediate and urgent real-world risks, calling for broader LLM safety assessments and alignment beyond conventional jailbreak paradigms.

cs.LG [Back]

[395] GLProtein: Global-and-Local Structure Aware Protein Representation Learning

Yunqing Liu,Wenqi Fan,Xiaoyong Wei,Qing Li

Main category: cs.LG

TL;DR: GLProtein是一个创新的蛋白质预训练框架，整合了全局结构相似性和局部氨基酸细节，显著提升了预测准确性。

Details

Motivation: 尽管蛋白质序列分析已取得进展，但整合蛋白质结构信息仍有潜力。结构信息不仅限于3D信息，还包括从氨基酸分子到蛋白质结构相似性的多层次信息。 Method: GLProtein结合了蛋白质掩码建模、三重结构相似性评分、蛋白质3D距离编码和基于子结构的氨基酸分子编码。 Result: 实验表明，GLProtein在蛋白质-蛋白质相互作用预测、接触预测等任务中优于现有方法。 Conclusion: GLProtein通过整合多层次结构信息，为蛋白质功能预测提供了更全面的框架。 Abstract: Proteins are central to biological systems, participating as building blocks across all forms of life. Despite advancements in understanding protein functions through protein sequence analysis, there remains potential for further exploration in integrating protein structural information. We argue that the structural information of proteins is not only limited to their 3D information but also encompasses information from amino acid molecules (local information) to protein-protein structure similarity (global information). To address this, we propose \textbf{GLProtein}, the first framework in protein pre-training that incorporates both global structural similarity and local amino acid details to enhance prediction accuracy and functional insights. GLProtein innovatively combines protein-masked modelling with triplet structure similarity scoring, protein 3D distance encoding and substructure-based amino acid molecule encoding. Experimental results demonstrate that GLProtein outperforms previous methods in several bioinformatics tasks, including predicting protein-protein interaction, contact prediction, and so on.

[396] dLLM-Cache: Accelerating Diffusion Large Language Models with Adaptive Caching

Zhiyuan Liu,Yicun Yang,Yaojie Zhang,Junjie Chen,Chang Zou,Qingyuan Wei,Shaobo Wang,Linfeng Zhang

Main category: cs.LG

TL;DR: 论文提出了一种名为dLLM-Cache的训练免费自适应缓存框架，用于解决基于扩散的大型语言模型（dLLMs）的高推理延迟问题，显著提升了推理速度。

Details

Motivation: 传统的自回归模型（ARMs）加速技术（如键值缓存）不适用于dLLMs，因为其双向注意力机制导致高延迟。论文旨在解决这一问题。 Method: 基于dLLM推理中静态提示和部分动态响应的特点，提出dLLM-Cache框架，结合长间隔提示缓存和基于特征相似性的部分响应更新。 Result: 在LLaDA 8B和Dream 7B等dLLMs上实验表明，dLLM-Cache实现了最高9.1倍的推理速度提升，且不影响输出质量。 Conclusion: dLLM-Cache显著降低了dLLMs的推理延迟，使其接近ARMs的性能，为dLLMs的实际应用提供了高效解决方案。 Abstract: Autoregressive Models (ARMs) have long dominated the landscape of Large Language Models. Recently, a new paradigm has emerged in the form of diffusion-based Large Language Models (dLLMs), which generate text by iteratively denoising masked segments. This approach has shown significant advantages and potential. However, dLLMs suffer from high inference latency. Traditional ARM acceleration techniques, such as Key-Value caching, are incompatible with dLLMs due to their bidirectional attention mechanism. To address this specific challenge, our work begins with a key observation that dLLM inference involves a static prompt and a partially dynamic response, where most tokens remain stable across adjacent denoising steps. Based on this, we propose dLLM-Cache, a training-free adaptive caching framework that combines long-interval prompt caching with partial response updates guided by feature similarity. This design enables efficient reuse of intermediate computations without compromising model performance. Extensive experiments on representative dLLMs, including LLaDA 8B and Dream 7B, show that dLLM-Cache achieves up to 9.1 x speedup over standard inference without compromising output quality. Notably, our method brings dLLM inference latency close to that of ARMs under many settings. Codes are provided in the supplementary material and will be released publicly on GitHub.

[397] Reward Is Enough: LLMs Are In-Context Reinforcement Learners

Kefan Song,Amir Moeini,Peng Wang,Lei Gong,Rohan Chandra,Yanjun Qi,Shangtong Zhang

Main category: cs.LG

TL;DR: 论文提出了一种名为ICRL prompting的多轮提示框架，发现LLM在推理时能通过上下文反馈优化响应，表现出类似强化学习的行为。

Details

Motivation: 探索LLM在推理时是否能通过反馈信号（奖励）优化任务表现，类似于强化学习。 Method: 提出ICRL prompting框架，通过多轮提示和反馈（奖励）逐步优化LLM的响应。 Result: 在三个基准测试中，ICRL prompting显著优于基线方法，甚至LLM自生成的奖励信号也能带来性能提升。 Conclusion: ICRL prompting为LLM在推理时优化任务表现提供了新范式，展示了类似强化学习的潜力。 Abstract: Reinforcement learning (RL) is a human-designed framework for solving sequential decision making problems. In this work, we demonstrate that, surprisingly, RL emerges in LLM's (Large Language Model) inference time -- a phenomenon known as in-context RL (ICRL). Specifically, we propose a novel multi-round prompting framework called ICRL prompting. The goal is to prompt the LLM to complete a task. After the LLM generates a response at the current round, we give numerical scalar feedbacks for the response, called the rewards. At the next round, we prompt the LLM again with the same task and a context consisting of all previous responses and rewards. We observe that the quality of the LLM's response increases as the context grows. In other words, the LLM is able to maximize the scalar reward signal in the inference time, just like an RL algorithm. We evaluate ICRL prompting in three benchmarks (Game of 24, creative writing, and ScienceWorld) and demonstrate significant performance improvements over baseline methods such as Self-Refine and Reflexion. Surprisingly, in some experiments the reward signals are generated by the LLM itself, yet performance improvements are still observed from ICRL prompting, offering a promising paradigm for scaling test-time compute.

[398] Towards Efficient Multi-LLM Inference: Characterization and Analysis of LLM Routing and Hierarchical Techniques

Adarsh Prasad Behera,Jaya Prakash Champati,Roberto Morabito,Sasu Tarkoma,James Gross

Main category: cs.LG

TL;DR: 该论文探讨了如何通过动态模型选择和层级推理策略优化语言模型的推理效率，以降低计算成本和能耗。

Details

Motivation: 当前语言模型在计算和能耗方面的高成本限制了其在移动、边缘或成本敏感环境中的部署。 Method: 提出了两种策略：(i) 路由选择，根据查询选择最合适的模型；(ii) 层级推理，通过一系列模型逐步处理查询。 Result: 这两种方法能显著减少计算资源的使用，同时保持性能。 Conclusion: 未来研究方向包括更快响应时间、自适应模型选择和异构环境部署，以提高语言模型的效率和可访问性。 Abstract: Recent progress in Language Models (LMs) has dramatically advanced the field of natural language processing (NLP), excelling at tasks like text generation, summarization, and question answering. However, their inference remains computationally expensive and energy intensive, especially in settings with limited hardware, power, or bandwidth. This makes it difficult to deploy LMs in mobile, edge, or cost sensitive environments. To address these challenges, recent approaches have introduced multi LLM intelligent model selection strategies that dynamically allocate computational resources based on query complexity -- using lightweight models for simpler queries and escalating to larger models only when necessary. This survey explores two complementary strategies for efficient LLM inference: (i) routing, which selects the most suitable model based on the query, and (ii) cascading or hierarchical inference (HI), which escalates queries through a sequence of models until a confident response is found. Both approaches aim to reduce computation by using lightweight models for simpler tasks while offloading only when needed. We provide a comparative analysis of these techniques across key performance metrics, discuss benchmarking efforts, and outline open challenges. Finally, we outline future research directions to enable faster response times, adaptive model selection based on task complexity, and scalable deployment across heterogeneous environments, making LLM based systems more efficient and accessible for real world applications.

[399] Curriculum Reinforcement Learning from Easy to Hard Tasks Improves LLM Reasoning

Shubham Parashar,Shurui Gui,Xiner Li,Hongyi Ling,Sushil Vemuri,Blake Olson,Eric Li,Yu Zhang,James Caverlee,Dileep Kalathil,Shuiwang Ji

Main category: cs.LG

TL;DR: 通过从易到难的任务调度（E2H Reasoner），结合强化学习提升语言模型的推理能力，显著改善了小型LLM的表现。

Details

Motivation: 现有研究表明，仅用强化学习提升语言模型在复杂任务上的推理能力效果有限，因此提出从易到难的任务调度方法。 Method: 提出E2H Reasoner方法，通过从易到难的任务调度逐步培养模型的推理能力，并理论证明了其收敛性。 Result: 实验表明，E2H Reasoner显著提升了小型LLM（1.5B至3B）的推理能力，且样本效率更高。 Conclusion: 从易到难的任务调度是提升语言模型推理能力的有效方法，尤其适用于小型模型。 Abstract: We aim to improve the reasoning capabilities of language models via reinforcement learning (RL). Recent RL post-trained models like DeepSeek-R1 have demonstrated reasoning abilities on mathematical and coding tasks. However, prior studies suggest that using RL alone to improve reasoning on inherently difficult tasks is less effective. Here, we draw inspiration from curriculum learning and propose to schedule tasks from easy to hard (E2H), allowing LLMs to build reasoning skills gradually. Our method is termed E2H Reasoner. Empirically, we observe that, although easy tasks are important initially, fading them out through appropriate scheduling is essential in preventing overfitting. Theoretically, we establish convergence guarantees for E2H Reasoner within an approximate policy iteration framework. We derive finite-sample complexity bounds and show that when tasks are appropriately decomposed and conditioned, learning through curriculum stages requires fewer total samples than direct learning. Experiments across multiple domains show that E2H Reasoner significantly improves the reasoning ability of small LLMs (1.5B to 3B), which otherwise struggle when trained with vanilla RL alone, highlighting the effectiveness of our method.

[400] MarginSel : Max-Margin Demonstration Selection for LLMs

Rajeev Bhatt Ambati,James Lester,Shashank Srivastava,Snigdha Chaturvedi

Main category: cs.LG

TL;DR: MarginSel方法通过选择困难示例优化LLM的上下文学习，提升分类任务性能。

Details

Motivation: 解决上下文学习中示例选择和排序对性能敏感的问题。 Method: 提出MarginSel，两步法选择困难示例，适应每个测试实例。 Result: 在分类任务中F1分数绝对提升2-7%。 Conclusion: MarginSel通过增加困难示例的边距，优化LLM决策边界。 Abstract: Large Language Models (LLMs) excel at few-shot learning via in-context learning (ICL). However, the effectiveness of ICL is often sensitive to the selection and ordering of demonstration examples. To address this, we present MarginSel: Max-Margin Demonstration Selection for LLMs, a two-step method that selects hard demonstration examples for the ICL prompt, adapting to each test instance. Our approach achieves 2-7% absolute improvement in F1-score across classification tasks, compared to a random selection of examples. We also provide theoretical insights and empirical evidence showing that MarginSel induces max-margin behavior in LLMs by effectively increasing the margin for hard examples, analogous to support vectors, thereby shifting the decision boundary in a beneficial direction.

[401] Efficient Text-Attributed Graph Learning through Selective Annotation and Graph Alignment

Huanyi Xie,Lijie Hu,Lu Yu,Tianhao Huang,Longfei Li,Meng Li,Jun Zhou,Huan Wang,Di Wang

Main category: cs.LG

TL;DR: GAGA是一个高效的TAG表示学习框架，通过仅标注代表性节点和边来减少时间和成本，并通过两级对齐模块整合标注图和TAG，实验表明其性能优越且高效。

Details

Motivation: 传统GNN在TAG中因复杂文本信息表现不佳，现有方法需大量标注或微调，成本高。 Method: GAGA通过标注代表性节点和边构建标注图，并采用两级对齐模块整合标注图和TAG。 Result: GAGA在仅需1%标注数据的情况下，分类准确率与或优于现有方法。 Conclusion: GAGA是一种高效且性能优越的TAG表示学习框架。 Abstract: In the realm of Text-attributed Graphs (TAGs), traditional graph neural networks (GNNs) often fall short due to the complex textual information associated with each node. Recent methods have improved node representations by leveraging large language models (LLMs) to enhance node text features, but these approaches typically require extensive annotations or fine-tuning across all nodes, which is both time-consuming and costly. To overcome these challenges, we introduce GAGA, an efficient framework for TAG representation learning. GAGA reduces annotation time and cost by focusing on annotating only representative nodes and edges. It constructs an annotation graph that captures the topological relationships among these annotations. Furthermore, GAGA employs a two-level alignment module to effectively integrate the annotation graph with the TAG, aligning their underlying structures. Experiments show that GAGA achieves classification accuracies on par with or surpassing state-of-the-art methods while requiring only 1% of the data to be annotated, demonstrating its high efficiency.

[402] When Style Breaks Safety: Defending Language Models Against Superficial Style Alignment

Yuxin Xiao,Sana Tonekaboni,Walter Gerych,Vinith Suriyakumar,Marzyeh Ghassemi

Main category: cs.LG

TL;DR: 研究发现，特定风格提示（如列表格式）会提高大型语言模型（LLM）对越狱攻击的成功率，且攻击成功率与风格长度和模型对其的关注度相关。作者提出SafeStyle防御策略，能有效降低风险。

Details

Motivation: 探讨风格提示是否影响LLM的安全性，以及如何通过对齐缓解这些风险。 Method: 评估32个LLM在7个越狱基准上的表现，分析风格对齐对模型脆弱性的影响，并提出SafeStyle防御策略。 Result: 风格提示显著提高越狱攻击成功率，且与风格长度和模型关注度相关；SafeStyle在多种风格设置下优于基线。 Conclusion: 风格提示会威胁LLM安全性，SafeStyle是一种有效的防御方法。 Abstract: Large language models (LLMs) can be prompted with specific styles (e.g., formatting responses as lists), including in jailbreak queries. Although these style patterns are semantically unrelated to the malicious intents behind jailbreak queries, their safety impact remains unclear. In this work, we seek to understand whether style patterns compromise LLM safety, how superficial style alignment increases model vulnerability, and how best to mitigate these risks during alignment. We evaluate 32 LLMs across seven jailbreak benchmarks, and find that malicious queries with style patterns inflate the attack success rate (ASR) for nearly all models. Notably, ASR inflation correlates with both the length of style patterns and the relative attention an LLM exhibits on them. We then investigate superficial style alignment, and find that fine-tuning with specific styles makes LLMs more vulnerable to jailbreaks of those same styles. Finally, we propose SafeStyle, a defense strategy that incorporates a small amount of safety training data augmented to match the distribution of style patterns in the fine-tuning data. Across three LLMs and five fine-tuning style settings, SafeStyle consistently outperforms baselines in maintaining LLM safety.

[403] Chasing Moving Targets with Online Self-Play Reinforcement Learning for Safer Language Models

Mickel Liu,Liwei Jiang,Yancheng Liang,Simon Shaolei Du,Yejin Choi,Tim Althoff,Natasha Jaques

Main category: cs.LG

TL;DR: 论文提出Self-RedTeam，一种在线自博弈强化学习算法，通过攻击者和防御者角色的动态交互提升语言模型的安全性。

Details

Motivation: 传统语言模型安全对齐方法存在滞后性，攻击者可以针对静态防御模型进行过拟合，而防御者则难以应对新威胁。 Method: 采用两玩家零和博弈框架，模型在攻击者和防御者角色间切换，通过自博弈实现动态协同进化。 Result: 实验表明，Self-RedTeam能发现更多样化的攻击（+21.8% SBERT），并在安全基准测试中表现更优（如WildJailBreak上+65.5%）。 Conclusion: 研究提倡从被动修补转向主动协同进化，通过多智能体强化学习实现语言模型的自主、鲁棒性提升。 Abstract: Conventional language model (LM) safety alignment relies on a reactive, disjoint procedure: attackers exploit a static model, followed by defensive fine-tuning to patch exposed vulnerabilities. This sequential approach creates a mismatch -- attackers overfit to obsolete defenses, while defenders perpetually lag behind emerging threats. To address this, we propose Self-RedTeam, an online self-play reinforcement learning algorithm where an attacker and defender agent co-evolve through continuous interaction. We cast safety alignment as a two-player zero-sum game, where a single model alternates between attacker and defender roles -- generating adversarial prompts and safeguarding against them -- while a reward LM adjudicates outcomes. This enables dynamic co-adaptation. Grounded in the game-theoretic framework of zero-sum games, we establish a theoretical safety guarantee which motivates the design of our method: if self-play converges to a Nash Equilibrium, the defender will reliably produce safe responses to any adversarial input. Empirically, Self-RedTeam uncovers more diverse attacks (+21.8% SBERT) compared to attackers trained against static defenders and achieves higher robustness on safety benchmarks (e.g., +65.5% on WildJailBreak) than defenders trained against static attackers. We further propose hidden Chain-of-Thought, allowing agents to plan privately, which boosts adversarial diversity and reduces over-refusals. Our results motivate a shift from reactive patching to proactive co-evolution in LM safety training, enabling scalable, autonomous, and robust self-improvement of LMs via multi-agent reinforcement learning (MARL).

[404] Graph-of-Causal Evolution: Challenging Chain-of-Model for Reasoning

Libo Wang

Main category: cs.LG

TL;DR: 论文提出了一种图因果演化（GoCE）方法，解决了链式模型（CoM）中子链仅依赖前序信息而丢失长程依赖的问题。通过将隐式令牌表示映射为可微分稀疏因果邻接矩阵，结合因果掩码注意力和因果-MoE，实现了因果结构学习与Transformer架构自适应更新的动态平衡。实验证明GoCE在捕获长程因果依赖方面优于基线LLMs。

Details

Motivation: 解决链式模型中因因果掩码导致全局上下文信息丢失的问题，提升模型对长程因果依赖的捕获能力。 Method: 提出GoCE方法，将隐式令牌表示映射为可微分稀疏因果邻接矩阵，结合因果掩码注意力和因果-MoE，并通过干预一致性损失测试和自我演化门实现动态平衡。 Result: 实验表明GoCE在多个公开数据集上优于基线LLMs，显著提升了长程因果依赖的捕获能力和自我演化能力。 Conclusion: GoCE不仅超越了CoM的设计原则，还为未来因果学习和持续自适应改进提供了经验。 Abstract: In view of the problem that each subchain in the chain-of-model (CoM) relies only on the information of the previous subchain and may lose long-range dependencies due to the causal mask blocking the global context flow between multi-level subchains, this work proposes a graph of causal evolution (GoCE). Its core principle is to map the implicit token representation into a differentiable and sparse causal adjacency matrix, then permeate causal constraints through each layer of calculation using causal-masked attention and causal-MoE. By combining intervention consistency loss test and self-evolution gate, the dynamic balance between causal structure learning and adaptive updating of transformer architecture is realized. The researcher built experimental environments in sandboxes built with Claude Sonnet 4, o4-mini-high, and DeepSeek R1 respectively with the transformer variant architecture introduced in GoCE. It is evaluated on publicly available datasets including CLUTRR, CLADDER, EX-FEVER, and CausalQA and compared with the baseline LLMs. The finding proves that GoCE strengthens the transformer's ability to capture long-range causal dependencies, while the ability to self-evolve is improved. It not only surpasses the design of CoM in terms of design principles, but also provides experience for future research on causal learning and continuous adaptive improvement.

[405] ChemAgent: Enhancing LLMs for Chemistry and Materials Science through Tree-Search Based Tool Learning

Mengsong Wu,YaFei Wang,Yidong Ming,Yuqi An,Yuwei Wan,Wenliang Chen,Binbin Lin,Yuqiang Li,Tong Xie,Dongzhan Zhou

Main category: cs.LG

TL;DR: 提出了一种基于LLM的化学代理，整合外部化学工具和数据集ChemToolBench，通过HE-MCTS框架优化工具规划与执行，显著提升化学QA和发现任务的性能。

Details

Motivation: 解决LLMs在化学任务中因预训练知识过时和难以整合专业化学知识而面临的挑战。 Method: 提出LLM代理，整合137种外部化学工具，创建数据集ChemToolBench，并采用HE-MCTS框架优化工具规划与执行。 Result: 实验表明，该方法显著提升了化学QA和发现任务的性能，超越GPT-4o。 Conclusion: 该方法为LLMs整合专业化学工具提供了有效解决方案，支持高级化学应用。 Abstract: Large language models (LLMs) have recently demonstrated promising capabilities in chemistry tasks while still facing challenges due to outdated pretraining knowledge and the difficulty of incorporating specialized chemical expertise. To address these issues, we propose an LLM-based agent that synergistically integrates 137 external chemical tools created ranging from basic information retrieval to complex reaction predictions, and a dataset curation pipeline to generate the dataset ChemToolBench that facilitates both effective tool selection and precise parameter filling during fine-tuning and evaluation. We introduce a Hierarchical Evolutionary Monte Carlo Tree Search (HE-MCTS) framework, enabling independent optimization of tool planning and execution. By leveraging self-generated data, our approach supports step-level fine-tuning (FT) of the policy model and training task-adaptive PRM and ORM that surpass GPT-4o. Experimental evaluations demonstrate that our approach significantly improves performance in Chemistry QA and discovery tasks, offering a robust solution to integrate specialized tools with LLMs for advanced chemical applications. All datasets and code are available at https://github.com/AI4Chem/ChemistryAgent .

[406] E-LDA: Toward Interpretable LDA Topic Models with Strong Guarantees in Logarithmic Parallel Time

Adam Breuer

Main category: cs.LG

TL;DR: 本文提出了一种新颖的非梯度组合方法，用于推断LDA主题模型中每个文档的主题分配，具有对数并行计算时间的高效性和可解释性保证。

Details

Motivation: 解决LDA主题模型在社会科学、数据探索和因果推断等应用中的主要推断问题，提供高效且可解释的算法。 Method: 采用非梯度组合方法估计主题模型，实现对数并行计算时间的高效收敛。 Result: 算法在语义质量上优于现有LDA、神经主题模型和基于LLM的主题模型，且能保持因果推断所需的独立性假设。 Conclusion: 该方法为LDA主题模型提供了高效、可解释且适用于下游因果推断的解决方案。 Abstract: In this paper, we provide the first practical algorithms with provable guarantees for the problem of inferring the topics assigned to each document in an LDA topic model. This is the primary inference problem for many applications of topic models in social science, data exploration, and causal inference settings. We obtain this result by showing a novel non-gradient-based, combinatorial approach to estimating topic models. This yields algorithms that converge to near-optimal posterior probability in logarithmic parallel computation time (adaptivity) -- exponentially faster than any known LDA algorithm. We also show that our approach can provide interpretability guarantees such that each learned topic is formally associated with a known keyword. Finally, we show that unlike alternatives, our approach can maintain the independence assumptions necessary to use the learned topic model for downstream causal inference methods that allow researchers to study topics as treatments. In terms of practical performance, our approach consistently returns solutions of higher semantic quality than solutions from state-of-the-art LDA algorithms, neural topic models, and LLM-based topic models across a diverse range of text datasets and evaluation parameters.

[407] Improving large language models with concept-aware fine-tuning

Michael K. Chen,Xikun Zhang,Jiaxing Huang,Dacheng Tao

Main category: cs.LG

TL;DR: 论文提出了一种名为概念感知微调（CAFT）的新方法，通过多令牌训练改进大型语言模型（LLMs）的概念理解能力，显著优于传统的单令牌预测方法。

Details

Motivation: 现有LLMs的逐令牌预测范式限制了其对高层次概念的连贯理解，阻碍了真正智能系统的发展。 Method: 引入概念感知微调（CAFT），一种多令牌训练方法，使模型能够学习跨多个令牌的序列。 Result: 实验表明，CAFT在文本摘要和蛋白质设计等任务中表现优于传统方法，且首次将多令牌预测引入训练后阶段。 Conclusion: CAFT不仅提升了模型性能，还为机器学习研究社区提供了更广泛的应用潜力。 Abstract: Large language models (LLMs) have become the cornerstone of modern AI. However, the existing paradigm of next-token prediction fundamentally limits their ability to form coherent, high-level concepts, making it a critical barrier to human-like understanding and reasoning. Take the phrase "ribonucleic acid" as an example: an LLM will first decompose it into tokens, i.e., artificial text fragments ("rib", "on", ...), then learn each token sequentially, rather than grasping the phrase as a unified, coherent semantic entity. This fragmented representation hinders deeper conceptual understanding and, ultimately, the development of truly intelligent systems. In response, we introduce Concept-Aware Fine-Tuning (CAFT), a novel multi-token training method that redefines how LLMs are fine-tuned. By enabling the learning of sequences that span multiple tokens, this method fosters stronger concept-aware learning. Our experiments demonstrate significant improvements compared to conventional next-token finetuning methods across diverse tasks, including traditional applications like text summarization and domain-specific ones like de novo protein design. Multi-token prediction was previously only possible in the prohibitively expensive pretraining phase; CAFT, to our knowledge, is the first to bring the multi-token setting to the post-training phase, thus effectively democratizing its benefits for the broader community of practitioners and researchers. Finally, the unexpected effectiveness of our proposed method suggests wider implications for the machine learning research community. All code and data are available at https://github.com/michaelchen-lab/caft-llm

[408] Uncovering the Functional Roles of Nonlinearity in Memory

Manuel Brenner,Georgia Koppe

Main category: cs.LG

TL;DR: 研究发现，在序列建模任务中，最小非线性往往足够且最优，简化模型并提高鲁棒性和可解释性。

Details

Motivation: 探讨非线性在循环网络中的功能作用，明确其计算必要性和机制。 Method: 使用Almost Linear Recurrent Neural Networks (AL-RNNs)作为建模工具，精细控制非线性。 Result: 在多种任务中，最小非线性不仅足够，且通常最优，模型更简单、鲁棒和可解释。 Conclusion: 为选择性引入非线性提供了理论框架，连接动力学系统理论与循环网络的功能需求。 Abstract: Memory and long-range temporal processing are core requirements for sequence modeling tasks across natural language processing, time-series forecasting, speech recognition, and control. While nonlinear recurrence has long been viewed as essential for enabling such mechanisms, recent work suggests that linear dynamics may often suffice. In this study, we go beyond performance comparisons to systematically dissect the functional role of nonlinearity in recurrent networks--identifying both when it is computationally necessary, and what mechanisms it enables. We use Almost Linear Recurrent Neural Networks (AL-RNNs), which allow fine-grained control over nonlinearity, as both a flexible modeling tool and a probe into the internal mechanisms of memory. Across a range of classic sequence modeling tasks and a real-world stimulus selection task, we find that minimal nonlinearity is not only sufficient but often optimal, yielding models that are simpler, more robust, and more interpretable than their fully nonlinear or linear counterparts. Our results provide a principled framework for selectively introducing nonlinearity, bridging dynamical systems theory with the functional demands of long-range memory and structured computation in recurrent neural networks, with implications for both artificial and biological neural systems.

[409] HeuriGym: An Agentic Benchmark for LLM-Crafted Heuristics in Combinatorial Optimization

Hongzheng Chen,Yingheng Wang,Yaohui Cai,Hins Hu,Jiajie Li,Shirley Huang,Chenhui Deng,Rongjian Liang,Shufeng Kong,Haoxing Ren,Samitha Samaranayake,Carla P. Gomes,Zhiru Zhang

Main category: cs.LG

TL;DR: HeuriGym是一个评估LLMs在组合优化问题中生成启发式算法能力的框架，通过代码执行反馈迭代优化，提出QYI指标量化性能，发现当前模型表现仍远低于专家水平。

Details

Motivation: 现有评估方法无法充分衡量LLMs在推理和问题解决中的能力，需更严谨的评估框架。 Method: 引入HeuriGym框架，让LLMs提出启发式算法并通过代码执行反馈迭代优化，使用QYI指标量化性能。 Result: 测试9个先进模型在9个问题上的表现，发现其在工具使用、规划和自适应推理方面存在局限，QYI得分最高仅0.6。 Conclusion: HeuriGym为LLMs在科学和工程领域的实际问题解决能力提供了更有效的评估基准。 Abstract: While Large Language Models (LLMs) have demonstrated significant advancements in reasoning and agent-based problem-solving, current evaluation methodologies fail to adequately assess their capabilities: existing benchmarks either rely on closed-ended questions prone to saturation and memorization, or subjective comparisons that lack consistency and rigor. In this work, we introduce HeuriGym, an agentic framework designed for evaluating heuristic algorithms generated by LLMs for combinatorial optimization problems, characterized by clearly defined objectives and expansive solution spaces. HeuriGym empowers LLMs to propose heuristics, receive evaluative feedback via code execution, and iteratively refine their solutions. We evaluate nine state-of-the-art models on nine problems across domains such as computer systems, logistics, and biology, exposing persistent limitations in tool use, planning, and adaptive reasoning. To quantify performance, we propose the Quality-Yield Index (QYI), a metric that captures both solution pass rate and quality. Even top models like GPT-o4-mini-high and Gemini-2.5-Pro attain QYI scores of only 0.6, well below the expert baseline of 1. Our open-source benchmark aims to guide the development of LLMs toward more effective and realistic problem-solving in scientific and engineering domains.

[410] Reparameterized LLM Training via Orthogonal Equivalence Transformation

Zeju Qiu,Simon Buchholz,Tim Z. Xiao,Maximilian Dax,Bernhard Schölkopf,Weiyang Liu

Main category: cs.LG

TL;DR: POET是一种新的重参数化训练算法，通过正交等价变换优化神经元，提升大语言模型的训练效果和泛化能力。

Details

Motivation: 大语言模型（LLMs）训练困难，需要更有效和稳定的方法。 Method: POET通过两个可学习的正交矩阵和一个固定随机权重矩阵重参数化神经元，保持权重矩阵的谱特性。 Result: 实验验证POET在训练LLMs时具有高效性和可扩展性。 Conclusion: POET为训练大语言模型提供了一种稳定且高效的解决方案。 Abstract: While large language models (LLMs) are driving the rapid advancement of artificial intelligence, effectively and reliably training these large models remains one of the field's most significant challenges. To address this challenge, we propose POET, a novel reParameterized training algorithm that uses Orthogonal Equivalence Transformation to optimize neurons. Specifically, POET reparameterizes each neuron with two learnable orthogonal matrices and a fixed random weight matrix. Because of its provable preservation of spectral properties of weight matrices, POET can stably optimize the objective function with improved generalization. We further develop efficient approximations that make POET flexible and scalable for training large-scale neural networks. Extensive experiments validate the effectiveness and scalability of POET in training LLMs.

[411] CellCLIP -- Learning Perturbation Effects in Cell Painting via Text-Guided Contrastive Learning

Mingyu Lu,Ethan Weinberger,Chanwoo Kim,Su-In Lee

Main category: cs.LG

TL;DR: CellCLIP是一种用于高内涵筛选数据的跨模态对比学习框架，通过预训练图像编码器和新型通道编码方案，显著提升了跨模态检索和生物相关任务的性能。

Details

Motivation: 高内涵筛选技术（如Cell Painting）能够大规模研究细胞形态对扰动的响应，但现有方法难以处理Cell Painting图像与自然图像的语义差异及扰动类型的多样性。 Method: CellCLIP结合预训练图像编码器、新型通道编码方案和自然语言编码器，构建统一的潜在空间，对齐扰动与其形态效应。 Result: CellCLIP在跨模态检索和下游生物任务中表现最佳，同时显著减少计算时间。 Conclusion: CellCLIP为高内涵筛选数据提供了一种高效的跨模态学习方法，解决了现有挑战。 Abstract: High-content screening (HCS) assays based on high-throughput microscopy techniques such as Cell Painting have enabled the interrogation of cells' morphological responses to perturbations at an unprecedented scale. The collection of such data promises to facilitate a better understanding of the relationships between different perturbations and their effects on cellular state. Towards achieving this goal, recent advances in cross-modal contrastive learning could, in theory, be leveraged to learn a unified latent space that aligns perturbations with their corresponding morphological effects. However, the application of such methods to HCS data is not straightforward due to substantial differences in the semantics of Cell Painting images compared to natural images, and the difficulty of representing different classes of perturbations (e.g., small molecule vs CRISPR gene knockout) in a single latent space. In response to these challenges, here we introduce CellCLIP, a cross-modal contrastive learning framework for HCS data. CellCLIP leverages pre-trained image encoders coupled with a novel channel encoding scheme to better capture relationships between different microscopy channels in image embeddings, along with natural language encoders for representing perturbations. Our framework outperforms current open-source models, demonstrating the best performance in both cross-modal retrieval and biologically meaningful downstream tasks while also achieving significant reductions in computation time.

[412] NeurNCD: Novel Class Discovery via Implicit Neural Representation

Junming Wang,Yi Shi

Main category: cs.LG

TL;DR: NeurNCD是一种新型框架，通过结合Embedding-NeRF模型和KL散度，解决了开放世界中新类别发现的挑战，无需密集标注数据即可实现优越的分割性能。

Details

Motivation: 传统显式表示（如3D分割图）存在离散、易受噪声干扰的问题，限制了新类别发现的准确性。 Method: 采用Embedding-NeRF模型和KL散度替代传统3D分割图，结合特征查询、调制和聚类等技术，实现高效特征增强和信息交换。 Result: 在NYUv2和Replica数据集上显著优于现有方法，适用于开放和封闭世界场景。 Conclusion: NeurNCD为开放世界中的新类别发现提供了一种高效、数据友好的解决方案。 Abstract: Discovering novel classes in open-world settings is crucial for real-world applications. Traditional explicit representations, such as object descriptors or 3D segmentation maps, are constrained by their discrete, hole-prone, and noisy nature, which hinders accurate novel class discovery. To address these challenges, we introduce NeurNCD, the first versatile and data-efficient framework for novel class discovery that employs the meticulously designed Embedding-NeRF model combined with KL divergence as a substitute for traditional explicit 3D segmentation maps to aggregate semantic embedding and entropy in visual embedding space. NeurNCD also integrates several key components, including feature query, feature modulation and clustering, facilitating efficient feature augmentation and information exchange between the pre-trained semantic segmentation network and implicit neural representations. As a result, our framework achieves superior segmentation performance in both open and closed-world settings without relying on densely labelled datasets for supervised training or human interaction to generate sparse label supervision. Extensive experiments demonstrate that our method significantly outperforms state-of-the-art approaches on the NYUv2 and Replica datasets.

[413] Vision-QRWKV: Exploring Quantum-Enhanced RWKV Models for Image Classification

Chi-Sheng Chen

Main category: cs.LG

TL;DR: 论文提出了一种混合量子-经典架构Vision-QRWKV，用于图像分类任务，通过集成变分量子电路（VQC）提升非线性特征转换能力。实验表明，该量子增强模型在多数数据集上优于经典版本。

Details

Motivation: 量子机器学习在复杂高维数据领域展现潜力，本文旨在探索量子增强的RWKV架构在视觉任务中的应用。 Method: 在RWKV的通道混合组件中集成VQC，构建Vision-QRWKV模型，并在14个图像分类基准数据集上进行评估。 Result: 量子增强模型在多数数据集（尤其是噪声或类别区分细微的数据集）上表现优于经典模型。 Conclusion: 本研究首次系统地将量子增强RWKV应用于视觉领域，为轻量高效视觉任务的量子模型提供了未来发展方向。 Abstract: Recent advancements in quantum machine learning have shown promise in enhancing classical neural network architectures, particularly in domains involving complex, high-dimensional data. Building upon prior work in temporal sequence modeling, this paper introduces Vision-QRWKV, a hybrid quantum-classical extension of the Receptance Weighted Key Value (RWKV) architecture, applied for the first time to image classification tasks. By integrating a variational quantum circuit (VQC) into the channel mixing component of RWKV, our model aims to improve nonlinear feature transformation and enhance the expressive capacity of visual representations. We evaluate both classical and quantum RWKV models on a diverse collection of 14 medical and standard image classification benchmarks, including MedMNIST datasets, MNIST, and FashionMNIST. Our results demonstrate that the quantum-enhanced model outperforms its classical counterpart on a majority of datasets, particularly those with subtle or noisy class distinctions (e.g., ChestMNIST, RetinaMNIST, BloodMNIST). This study represents the first systematic application of quantum-enhanced RWKV in the visual domain, offering insights into the architectural trade-offs and future potential of quantum models for lightweight and efficient vision tasks.

[414] Non-Intrusive Load Monitoring Based on Image Load Signatures and Continual Learning

Olimjon Toirov,Wei Yu

Main category: cs.LG

TL;DR: 提出了一种结合图像负载特征和持续学习的非侵入式负载监测方法，显著提高了识别精度。

Details

Motivation: 传统NILM方法在复杂负载组合和应用环境下特征鲁棒性和模型泛化能力不足。 Method: 将多维电力信号转换为图像负载特征，结合深度卷积神经网络进行设备识别，并引入自监督预训练和持续在线学习策略。 Result: 在高采样率负载数据集上的实验表明，该方法在识别精度上有显著提升。 Conclusion: 该方法通过图像特征和持续学习有效提升了NILM的性能，适应新负载的出现。 Abstract: Non-Intrusive Load Monitoring (NILM) identifies the operating status and energy consumption of each electrical device in the circuit by analyzing the electrical signals at the bus, which is of great significance for smart power management. However, the complex and changeable load combinations and application environments lead to the challenges of poor feature robustness and insufficient model generalization of traditional NILM methods. To this end, this paper proposes a new non-intrusive load monitoring method that integrates "image load signature" and continual learning. This method converts multi-dimensional power signals such as current, voltage, and power factor into visual image load feature signatures, and combines deep convolutional neural networks to realize the identification and classification of multiple devices; at the same time, self-supervised pre-training is introduced to improve feature generalization, and continual online learning strategies are used to overcome model forgetting to adapt to the emergence of new loads. This paper conducts a large number of experiments on high-sampling rate load datasets, and compares a variety of existing methods and model variants. The results show that the proposed method has achieved significant improvements in recognition accuracy.

[415] The OCR Quest for Generalization: Learning to recognize low-resource alphabets with model editing

Adrià Molina Rodríguez,Oriol Ramos Terrades,Josep Lladós

Main category: cs.LG

TL;DR: 该论文提出了一种通过模型编辑技术增强低资源语言识别能力的方法，显著提升了模型在新字母表和跨域评估中的表现。

Details

Motivation: 解决低资源语言（如古代手稿和非西方语言）在识别系统中代表性不足的问题，提升模型的泛化能力。 Method: 利用模型编辑技术，结合领域合并策略，优化低资源学习，无需依赖大规模预训练或原型设计。 Result: 实验表明，该方法在迁移学习和跨域评估中表现优异，尤其在历史加密文本和非拉丁字母任务中。 Conclusion: 该研究为构建能够快速适应低资源字母表的模型提供了新思路，扩展了文档识别的应用范围。 Abstract: Achieving robustness in recognition systems across diverse domains is crucial for their practical utility. While ample data availability is usually assumed, low-resource languages, such as ancient manuscripts and non-western languages, tend to be kept out of the equations of massive pretraining and foundational techniques due to an under representation. In this work, we aim for building models which can generalize to new distributions of data, such as alphabets, faster than centralized fine-tune strategies. For doing so, we take advantage of the recent advancements in model editing to enhance the incorporation of unseen scripts (low-resource learning). In contrast to state-of-the-art meta-learning, we showcase the effectiveness of domain merging in sparse distributions of data, with agnosticity of its relation to the overall distribution or any other prototyping necessity. Even when using the same exact training data, our experiments showcase significant performance boosts in \textbf{transfer learning} to new alphabets and \textbf{out-of-domain evaluation} in challenging domain shifts, including historical ciphered texts and non-Latin scripts. This research contributes a novel approach into building models that can easily adopt under-represented alphabets and, therefore, enable document recognition to a wider set of contexts and cultures.

[416] Feature-Based Instance Neighbor Discovery: Advanced Stable Test-Time Adaptation in Dynamic World

Qinting Jiang,Chuyang Ye,Dongyan Wei,Bingli Wang,Yuan Xue,Jingyan Jiang,Zhi Wang

Main category: cs.LG

TL;DR: FIND方法通过特征解耦和自适应归一化，显著提升了动态分布变化下的模型性能。

Details

Motivation: 深度神经网络在训练与测试分布变化时性能下降，现有TTA方法难以应对动态多分布场景。 Method: 提出FIND方法，包括LFD、FABN和S-FABN，分别实现特征解耦、自适应归一化和选择性优化。 Result: FIND在动态场景中准确率提升30%，同时保持计算效率。 Conclusion: FIND通过特征聚类和自适应归一化，有效解决了动态分布变化问题。 Abstract: Despite progress, deep neural networks still suffer performance declines under distribution shifts between training and test domains, leading to a substantial decrease in Quality of Experience (QoE) for applications. Existing test-time adaptation (TTA) methods are challenged by dynamic, multiple test distributions within batches. We observe that feature distributions across different domains inherently cluster into distinct groups with varying means and variances. This divergence reveals a critical limitation of previous global normalization strategies in TTA, which inevitably distort the original data characteristics. Based on this insight, we propose Feature-based Instance Neighbor Discovery (FIND), which comprises three key components: Layer-wise Feature Disentanglement (LFD), Feature Aware Batch Normalization (FABN) and Selective FABN (S-FABN). LFD stably captures features with similar distributions at each layer by constructing graph structures. While FABN optimally combines source statistics with test-time distribution specific statistics for robust feature representation. Finally, S-FABN determines which layers require feature partitioning and which can remain unified, thereby enhancing inference efficiency. Extensive experiments demonstrate that FIND significantly outperforms existing methods, achieving a 30\% accuracy improvement in dynamic scenarios while maintaining computational efficiency.

[417] FREE: Fast and Robust Vision Language Models with Early Exits

Divya Jyoti Bajpai,Manjesh Kumar Hanawal

Main category: cs.LG

TL;DR: 论文提出了一种名为FREE的对抗训练方法，用于在视觉语言模型（VLMs）中实现早期退出（EE）策略，以提高推理速度并减少性能损失。

Details

Motivation: 尽管VLMs在视觉语言任务中表现优异，但其大尺寸导致推理延迟问题，限制了实际应用。 Method: 采用GAN框架的对抗训练方法，每个退出点包含一个Transformer层和一个分类器，通过特征表示相似性训练和输入自适应推理优化速度与性能。 Result: 实验表明，该方法在保持性能的同时将推理速度提升1.51倍以上，并缓解了过度思考和中间危机现象。 Conclusion: FREE方法有效提升了VLMs的推理效率和鲁棒性，适用于实际应用场景。 Abstract: In recent years, Vision-Language Models (VLMs) have shown remarkable performance improvements in Vision-Language tasks. However, their large size poses challenges for real-world applications where inference latency is a concern. To tackle this issue, we propose employing Early Exit (EE) strategies in VLMs. However, training exit classifiers in VLMs is challenging, particularly with limited labeled training data. To address this, we introduce FREE, an adversarial training approach within a GAN-based framework. Here, each exit consists of a transformer layer and a classifier. The transformer layer is adversarially trained to produce feature representations similar to the final layer, while a feature classifier serves as the discriminator. Our method focuses on performing input-adaptive inference that increases inference speed with minimal drop in performance. Experimental results demonstrate the effectiveness of our approach in enhancing accuracy and model robustness by mitigating overthinking and the phenomenon of mid-crisis that we highlight. We experimentally validate that our method speeds up the inference process by more than 1.51x while retaining comparable performance. The source code is available at https://github.com/Div290/FREE.

[418] Rewriting the Budget: A General Framework for Black-Box Attacks Under Cost Asymmetry

Mahdi Salmani,Alireza Abdollahpoorrostam,Seyed-Mohsen Moosavi-Dezfooli

Main category: cs.LG

TL;DR: 本文提出了一种针对非对称查询成本的决策型黑盒对抗攻击框架，通过改进搜索策略和梯度估计过程，显著降低了总查询成本和扰动大小。

Details

Motivation: 现有方法假设所有查询成本相同，但实际中某些查询可能触发额外审查或惩罚，导致成本不对称。目前缺乏针对此场景的有效算法。 Method: 提出了非对称搜索（AS）和非对称梯度估计（AGREST），通过减少高成本查询的依赖和调整采样分布来优化攻击效率。 Result: 在多种成本设置下，新方法的总查询成本和扰动大小均优于现有方法，某些情况下提升达40%。 Conclusion: 该框架可轻松集成到现有黑盒攻击中，为非对称成本场景提供了高效解决方案。 Abstract: Traditional decision-based black-box adversarial attacks on image classifiers aim to generate adversarial examples by slightly modifying input images while keeping the number of queries low, where each query involves sending an input to the model and observing its output. Most existing methods assume that all queries have equal cost. However, in practice, queries may incur asymmetric costs; for example, in content moderation systems, certain output classes may trigger additional review, enforcement, or penalties, making them more costly than others. While prior work has considered such asymmetric cost settings, effective algorithms for this scenario remain underdeveloped. In this paper, we propose a general framework for decision-based attacks under asymmetric query costs, which we refer to as asymmetric black-box attacks. We modify two core components of existing attacks: the search strategy and the gradient estimation process. Specifically, we propose Asymmetric Search (AS), a more conservative variant of binary search that reduces reliance on high-cost queries, and Asymmetric Gradient Estimation (AGREST), which shifts the sampling distribution to favor low-cost queries. We design efficient algorithms that minimize total attack cost by balancing different query types, in contrast to earlier methods such as stealthy attacks that focus only on limiting expensive (high-cost) queries. Our method can be integrated into a range of existing black-box attacks with minimal changes. We perform both theoretical analysis and empirical evaluation on standard image classification benchmarks. Across various cost regimes, our method consistently achieves lower total query cost and smaller perturbations than existing approaches, with improvements of up to 40% in some settings.

[419] Towards Physics-informed Diffusion for Anomaly Detection in Trajectories

Arun Sharma,Mingzhou Yang,Majid Farhadloo,Subhankar Ghosh,Bharat Jayaprakash,Shashi Shekhar

Main category: cs.LG

TL;DR: 提出了一种基于物理约束的扩散模型，用于检测异常轨迹，以应对GPS欺骗问题，实验表明其在高精度和低误差率方面优于现有方法。

Details

Motivation: 解决国际水域非法活动（如非法捕鱼和石油走私）中的GPS欺骗问题，同时应对AI生成的虚假轨迹和标记数据不足的挑战。 Method: 提出了一种物理信息扩散模型，结合运动学约束，识别不符合物理规律的轨迹。 Result: 在真实数据集（海事和城市领域）上验证，显示出更高的异常检测精度和更低的轨迹生成误差率。 Conclusion: 该方法有效解决了现有方法在精细时空依赖性和物理知识方面的不足，具有实际应用潜力。 Abstract: Given trajectory data, a domain-specific study area, and a user-defined threshold, we aim to find anomalous trajectories indicative of possible GPS spoofing (e.g., fake trajectory). The problem is societally important to curb illegal activities in international waters, such as unauthorized fishing and illicit oil transfers. The problem is challenging due to advances in AI generated in deep fakes generation (e.g., additive noise, fake trajectories) and lack of adequate amount of labeled samples for ground-truth verification. Recent literature shows promising results for anomalous trajectory detection using generative models despite data sparsity. However, they do not consider fine-scale spatiotemporal dependencies and prior physical knowledge, resulting in higher false-positive rates. To address these limitations, we propose a physics-informed diffusion model that integrates kinematic constraints to identify trajectories that do not adhere to physical laws. Experimental results on real-world datasets in the maritime and urban domains show that the proposed framework results in higher prediction accuracy and lower estimation error rate for anomaly detection and trajectory generation methods, respectively. Our implementation is available at https://github.com/arunshar/Physics-Informed-Diffusion-Probabilistic-Model.

[420] Advancing Multimodal Reasoning Capabilities of Multimodal Large Language Models via Visual Perception Reward

Tong Xiao,Xin Xu,Zhenya Huang,Hongyu Gao,Quan Liu,Qi Liu,Enhong Chen

Main category: cs.LG

TL;DR: 论文提出Perception-R1方法，通过引入视觉感知奖励增强多模态大语言模型（MLLMs）的感知与推理能力，解决了现有RLVR方法在提升多模态感知能力上的不足。

Details

Motivation: 现有RLVR方法在多模态领域应用时忽视了多模态感知能力的提升，而这是复杂多模态推理的核心前提。 Method: 提出Perception-R1，通过收集视觉注释作为奖励参考，利用评判LLM评估MLLM响应与注释的一致性，分配视觉感知奖励。 Result: 在多个多模态推理基准测试中，Perception-R1仅用1,442训练数据即达到最优性能。 Conclusion: Perception-R1有效提升了MLLMs的多模态感知与推理能力，填补了现有方法的不足。 Abstract: Enhancing the multimodal reasoning capabilities of Multimodal Large Language Models (MLLMs) is a challenging task that has attracted increasing attention in the community. Recently, several studies have applied Reinforcement Learning with Verifiable Rewards (RLVR) to the multimodal domain in order to enhance the reasoning abilities of MLLMs. However, these works largely overlook the enhancement of multimodal perception capabilities in MLLMs, which serve as a core prerequisite and foundational component of complex multimodal reasoning. Through McNemar's test, we find that existing RLVR method fails to effectively enhance the multimodal perception capabilities of MLLMs, thereby limiting their further improvement in multimodal reasoning. To address this limitation, we propose Perception-R1, which introduces a novel visual perception reward that explicitly encourages MLLMs to perceive the visual content accurately, thereby can effectively incentivizing both their multimodal perception and reasoning capabilities. Specifically, we first collect textual visual annotations from the CoT trajectories of multimodal problems, which will serve as visual references for reward assignment. During RLVR training, we employ a judging LLM to assess the consistency between the visual annotations and the responses generated by MLLM, and assign the visual perception reward based on these consistency judgments. Extensive experiments on several multimodal reasoning benchmarks demonstrate the effectiveness of our Perception-R1, which achieves state-of-the-art performance on most benchmarks using only 1,442 training data.

[421] Variational Supervised Contrastive Learning

Ziwen Wang,Jiajun Fan,Thao Nguyen,Heng Ji,Ge Liu

Main category: cs.LG

TL;DR: VarCon通过变分推断改进对比学习，解决了嵌入分布无显式调控和依赖大规模负样本的问题，在多个数据集上实现SOTA性能。

Details

Motivation: 解决对比学习中嵌入分布无显式调控和依赖大规模负样本的问题。 Method: 提出VarCon，将监督对比学习重新表述为对潜在类变量的变分推断，最大化后验加权的ELBO。 Result: 在CIFAR-10、CIFAR-100、ImageNet等数据集上达到SOTA性能，Top-1准确率79.36%（ImageNet-1K），78.29%（CIFAR-100）。 Conclusion: VarCon在性能、嵌入空间语义组织和少样本学习方面表现优越，且对增强策略鲁棒。 Abstract: Contrastive learning has proven to be highly efficient and adaptable in shaping representation spaces across diverse modalities by pulling similar samples together and pushing dissimilar ones apart. However, two key limitations persist: (1) Without explicit regulation of the embedding distribution, semantically related instances can inadvertently be pushed apart unless complementary signals guide pair selection, and (2) excessive reliance on large in-batch negatives and tailored augmentations hinders generalization. To address these limitations, we propose Variational Supervised Contrastive Learning (VarCon), which reformulates supervised contrastive learning as variational inference over latent class variables and maximizes a posterior-weighted evidence lower bound (ELBO) that replaces exhaustive pair-wise comparisons for efficient class-aware matching and grants fine-grained control over intra-class dispersion in the embedding space. Trained exclusively on image data, our experiments on CIFAR-10, CIFAR-100, ImageNet-100, and ImageNet-1K show that VarCon (1) achieves state-of-the-art performance for contrastive learning frameworks, reaching 79.36% Top-1 accuracy on ImageNet-1K and 78.29% on CIFAR-100 with a ResNet-50 encoder while converging in just 200 epochs; (2) yields substantially clearer decision boundaries and semantic organization in the embedding space, as evidenced by KNN classification, hierarchical clustering results, and transfer-learning assessments; and (3) demonstrates superior performance in few-shot learning than supervised baseline and superior robustness across various augmentation strategies.

[422] Language Embedding Meets Dynamic Graph: A New Exploration for Neural Architecture Representation Learning

Haizhao Jing,Haokui Zhang,Zhenhao Shang,Rong Xiao,Peng Wang,Yanning Zhang

Main category: cs.LG

TL;DR: LeDG-Former是一个结合语言嵌入和动态图表示学习的框架，解决了现有方法忽略硬件属性和静态拓扑结构的局限性，实现了跨硬件平台的零样本预测和更优的神经网络建模性能。

Details

Motivation: 当前方法忽视硬件属性信息且依赖静态邻接矩阵，限制了模型的实用性和编码效果。 Method: 提出LeDG-Former框架，结合语言嵌入（通过LLM处理）和动态图表示学习，实现硬件和架构的统一语义表示。 Result: 在NNLQP基准测试中超越现有方法，首次实现跨硬件延迟预测，并在NAS-Bench数据集上表现优异。 Conclusion: LeDG-Former通过语言和动态图的结合，显著提升了神经网络架构表示学习的性能和应用范围。 Abstract: Neural Architecture Representation Learning aims to transform network models into feature representations for predicting network attributes, playing a crucial role in deploying and designing networks for real-world applications. Recently, inspired by the success of transformers, transformer-based models integrated with Graph Neural Networks (GNNs) have achieved significant progress in representation learning. However, current methods still have some limitations. First, existing methods overlook hardware attribute information, which conflicts with the current trend of diversified deep learning hardware and limits the practical applicability of models. Second, current encoding approaches rely on static adjacency matrices to represent topological structures, failing to capture the structural differences between computational nodes, which ultimately compromises encoding effectiveness. In this paper, we introduce LeDG-Former, an innovative framework that addresses these limitations through the synergistic integration of language-based semantic embedding and dynamic graph representation learning. Specifically, inspired by large language models (LLMs), we propose a language embedding framework where both neural architectures and hardware platform specifications are projected into a unified semantic space through tokenization and LLM processing, enabling zero-shot prediction across different hardware platforms for the first time. Then, we propose a dynamic graph-based transformer for modeling neural architectures, resulting in improved neural architecture modeling performance. On the NNLQP benchmark, LeDG-Former surpasses previous methods, establishing a new SOTA while demonstrating the first successful cross-hardware latency prediction capability. Furthermore, our framework achieves superior performance on the cell-structured NAS-Bench-101 and NAS-Bench-201 datasets.

[423] Identifiable Object Representations under Spatial Ambiguities

Avinash Kori,Francesca Toni,Ben Glocker

Main category: cs.LG

TL;DR: 提出了一种多视角概率方法，通过聚合视角特定槽来捕捉不变内容信息，同时学习解耦的全局视角信息，解决了空间模糊性问题。

Details

Motivation: 模块化的对象中心表示对人类类似推理至关重要，但在空间模糊性（如遮挡和视角模糊）下难以实现。 Method: 引入多视角概率方法，聚合视角特定槽以捕捉不变内容信息，并学习解耦的全局视角信息，无需视角标注。 Result: 在标准基准和新颖复杂数据集上的实验验证了方法的鲁棒性和可扩展性。 Conclusion: 该方法解决了空间模糊性问题，提供了可识别性的理论保证，且无需视角标注。 Abstract: Modular object-centric representations are essential for *human-like reasoning* but are challenging to obtain under spatial ambiguities, *e.g. due to occlusions and view ambiguities*. However, addressing challenges presents both theoretical and practical difficulties. We introduce a novel multi-view probabilistic approach that aggregates view-specific slots to capture *invariant content* information while simultaneously learning disentangled global *viewpoint-level* information. Unlike prior single-view methods, our approach resolves spatial ambiguities, provides theoretical guarantees for identifiability, and requires *no viewpoint annotations*. Extensive experiments on standard benchmarks and novel complex datasets validate our method's robustness and scalability.

[424] Diffusion Counterfactual Generation with Semantic Abduction

Rajat Rasal,Avinash Kori,Fabio De Sousa Ribeiro,Tian Xia,Ben Glocker

Main category: cs.LG

TL;DR: 论文提出了一种基于扩散模型的因果图像编辑框架，通过空间、语义和动态推理实现高质量的反事实图像生成，解决了现有方法在可扩展性和保真度上的不足。

Details

Motivation: 现有自动编码框架在反事实图像生成中存在可扩展性和保真度问题，扩散模型因其卓越的视觉质量和感知能力提供了改进机会。 Method: 提出了一套基于扩散模型的因果机制，结合Pearl因果理论，通过空间、语义和动态推理实现图像编辑。 Result: 首次实现了扩散模型中高级语义身份保留的反事实图像生成，展示了语义控制在因果控制和身份保留之间的权衡。 Conclusion: 该框架为反事实图像编辑提供了新的解决方案，展示了扩散模型在因果推理中的潜力。 Abstract: Counterfactual image generation presents significant challenges, including preserving identity, maintaining perceptual quality, and ensuring faithfulness to an underlying causal model. While existing auto-encoding frameworks admit semantic latent spaces which can be manipulated for causal control, they struggle with scalability and fidelity. Advancements in diffusion models present opportunities for improving counterfactual image editing, having demonstrated state-of-the-art visual quality, human-aligned perception and representation learning capabilities. Here, we present a suite of diffusion-based causal mechanisms, introducing the notions of spatial, semantic and dynamic abduction. We propose a general framework that integrates semantic representations into diffusion models through the lens of Pearlian causality to edit images via a counterfactual reasoning process. To our knowledge, this is the first work to consider high-level semantic identity preservation for diffusion counterfactuals and to demonstrate how semantic control enables principled trade-offs between faithful causal control and identity preservation.

[425] Diffuse Everything: Multimodal Diffusion Models on Arbitrary State Spaces

Kevin Rojas,Yuchen Zhu,Sichen Zhu,Felix X. -F. Ye,Molei Tao

Main category: cs.LG

TL;DR: 提出了一种新的多模态扩散模型框架，支持原生生成跨模态耦合数据，无需依赖外部预处理协议。

Details

Motivation: 现有方法依赖外部预处理协议（如分词器和变分自编码器）来统一多模态数据表示，但这对数据有限的应用存在问题。 Method: 通过为每种模态引入解耦的噪声调度，实现无条件生成和模态条件生成。 Result: 在文本-图像生成和混合类型表格数据合成任务中表现优异。 Conclusion: 该框架为多模态数据生成提供了灵活且高效的解决方案。 Abstract: Diffusion models have demonstrated remarkable performance in generating unimodal data across various tasks, including image, video, and text generation. On the contrary, the joint generation of multimodal data through diffusion models is still in the early stages of exploration. Existing approaches heavily rely on external preprocessing protocols, such as tokenizers and variational autoencoders, to harmonize varied data representations into a unified, unimodal format. This process heavily demands the high accuracy of encoders and decoders, which can be problematic for applications with limited data. To lift this restriction, we propose a novel framework for building multimodal diffusion models on arbitrary state spaces, enabling native generation of coupled data across different modalities. By introducing an innovative decoupled noise schedule for each modality, we enable both unconditional and modality-conditioned generation within a single model simultaneously. We empirically validate our approach for text-image generation and mixed-type tabular data synthesis, demonstrating that it achieves competitive performance.

[426] Generative Modeling of Weights: Generalization or Memorization?

Boya Zeng,Yida Yin,Zhiqiu Xu,Zhuang Liu

Main category: cs.LG

TL;DR: 生成模型在图像和视频生成中表现优异，但用于生成神经网络权重时，现有方法主要通过记忆训练检查点，无法生成新颖且高性能的权重。

Details

Motivation: 探索生成模型在合成神经网络权重方面的潜力，并评估其是否能生成新颖且高性能的权重。 Method: 研究了四种代表性方法，分析其生成权重的能力，并与简单基线（如添加噪声或权重集成）进行比较。 Result: 现有方法主要通过记忆训练检查点生成权重，无法超越简单基线，且通过调整建模因素或数据增强无法有效缓解记忆问题。 Conclusion: 研究揭示了生成模型在新领域的局限性，强调了对生成模型评估的重要性。 Abstract: Generative models, with their success in image and video generation, have recently been explored for synthesizing effective neural network weights. These approaches take trained neural network checkpoints as training data, and aim to generate high-performing neural network weights during inference. In this work, we examine four representative methods on their ability to generate novel model weights, i.e., weights that are different from the checkpoints seen during training. Surprisingly, we find that these methods synthesize weights largely by memorization: they produce either replicas, or at best simple interpolations, of the training checkpoints. Current methods fail to outperform simple baselines, such as adding noise to the weights or taking a simple weight ensemble, in obtaining different and simultaneously high-performing models. We further show that this memorization cannot be effectively mitigated by modifying modeling factors commonly associated with memorization in image diffusion models, or applying data augmentations. Our findings provide a realistic assessment of what types of data current generative models can model, and highlight the need for more careful evaluation of generative models in new domains. Our code is available at https://github.com/boyazeng/weight_memorization.

q-fin.ST [Back]

[427] The Hype Index: an NLP-driven Measure of Market News Attention

Zheng Cao,Wanchaloem Wunkaew,Helyette Geman

Main category: q-fin.ST

TL;DR: 本文提出了一种名为Hype Index的新指标，用于量化媒体对大盘股的关注度，结合NLP技术从金融新闻中提取预测信号。通过S&P 100数据，构建了两种Hype Index版本，并验证其在市场分析中的实用性。

Details

Motivation: 量化媒体对股票的关注度，为金融分析和市场预测提供新工具。 Method: 构建News Count-Based Hype Index和Capitalization Adjusted Hype Index，并通过分类、回报关联、波动性分析等多维度验证。 Result: Hype Index家族为股票波动性分析、市场信号和NLP在金融中的应用提供了有价值工具。 Conclusion: Hype Index是一种有效的量化媒体关注度的方法，具有广泛的应用潜力。 Abstract: This paper introduces the Hype Index as a novel metric to quantify media attention toward large-cap equities, leveraging advances in Natural Language Processing (NLP) for extracting predictive signals from financial news. Using the S&P 100 as the focus universe, we first construct a News Count-Based Hype Index, which measures relative media exposure by computing the share of news articles referencing each stock or sector. We then extend it to the Capitalization Adjusted Hype Index, adjusts for economic size by taking the ratio of a stock's or sector's media weight to its market capitalization weight within its industry or sector. We compute both versions of the Hype Index at the stock and sector levels, and evaluate them through multiple lenses: (1) their classification into different hype groups, (2) their associations with returns, volatility, and VIX index at various lags, (3) their signaling power for short-term market movements, and (4) their empirical properties including correlations, samplings, and trends. Our findings suggest that the Hype Index family provides a valuable set of tools for stock volatility analysis, market signaling, and NLP extensions in Finance.

Table of Contents

cs.CV [Back]

[1] Facial Foundational Model Advances Early Warning of Coronary Artery Disease from Live Videos with DigitalShadow

[2] Exploring Adversarial Watermarking in Transformer-Based Models: Transferability and Robustness Against Defense Mechanism for Medical Images

[3] (LiFT) Lightweight Fitness Transformer: A language-vision model for Remote Monitoring of Physical Training

[4] GS4: Generalizable Sparse Splatting Semantic SLAM

[5] Bridging Audio and Vision: Zero-Shot Audiovisual Segmentation by Connecting Pretrained Models

[6] Securing Traffic Sign Recognition Systems in Autonomous Vehicles

[7] Textile Analysis for Recycling Automation using Transfer Learning and Zero-Shot Foundation Models

[8] A Deep Learning Approach for Facial Attribute Manipulation and Reconstruction in Surveillance and Reconnaissance

[9] EV-LayerSegNet: Self-supervised Motion Segmentation using Event Cameras

[10] RARL: Improving Medical VLM Reasoning and Generalization with Reinforcement Learning and LoRA under Data and Hardware Constraints

[11] Zero Shot Composed Image Retrieval

[12] PhysLab: A Benchmark Dataset for Multi-Granularity Visual Parsing of Physics Experiments

[13] Dark Channel-Assisted Depth-from-Defocus from a Single Image

[14] Parametric Gaussian Human Model: Generalizable Prior for Efficient and Realistic Human Avatar Modeling

[15] Flood-DamageSense: Multimodal Mamba with Multitask Learning for Building Flood Damage Assessment using SAR Remote Sensing Imagery

[16] Interpretation of Deep Learning Model in Embryo Selection for In Vitro Fertilization (IVF) Treatment

[17] A Systematic Investigation on Deep Learning-Based Omnidirectional Image and Video Super-Resolution

[18] Active Contour Models Driven by Hyperbolic Mean Curvature Flow for Image Segmentation

[19] Improving Wildlife Out-of-Distribution Detection: Africas Big Five

[20] Mitigating Object Hallucination via Robust Local Perception Search

[21] RecipeGen: A Step-Aligned Multimodal Benchmark for Real-World Recipe Generation

[22] THU-Warwick Submission for EPIC-KITCHEN Challenge 2025: Semi-Supervised Video Object Segmentation

[23] SAR2Struct: Extracting 3D Semantic Structural Representation of Aircraft Targets from Single-View SAR Image

[24] LitMAS: A Lightweight and Generalized Multi-Modal Anti-Spoofing Framework for Biometric Security

[25] LoopDB: A Loop Closure Dataset for Large Scale Simultaneous Localization and Mapping

[26] Continuous-Time SO(3) Forecasting with Savitzky--Golay Neural Controlled Differential Equations

[27] Training-Free Identity Preservation in Stylized Image Generation Using Diffusion Models

[28] Stepwise Decomposition and Dual-stream Focus: A Novel Approach for Training-free Camouflaged Object Segmentation

[29] Hi-LSplat: Hierarchical 3D Language Gaussian Splatting

[30] Exploring Visual Prompting: Robustness Inheritance and Beyond

[31] Controllable Coupled Image Generation via Diffusion Models

[32] EndoARSS: Adapting Spatially-Aware Foundation Model for Efficient Activity Recognition and Semantic Segmentation in Endoscopic Surgery

[33] Harnessing Vision-Language Models for Time Series Anomaly Detection

[34] Multi-StyleGS: Stylizing Gaussian Splatting with Multiple Styles

[35] Deep Inertial Pose: A deep learning approach for human pose estimation

[36] Position Prediction Self-Supervised Learning for Multimodal Satellite Imagery Semantic Segmentation

[37] DONUT: A Decoder-Only Model for Trajectory Prediction

[38] Vision-EKIPL: External Knowledge-Infused Policy Learning for Visual Reasoning

[39] Face recognition on point cloud with cgan-top for denoising

[40] Hybrid Vision Transformer-Mamba Framework for Autism Diagnosis via Eye-Tracking Analysis

[41] NSD-Imagery: A benchmark dataset for extending fMRI vision decoding methods to mental imagery

[42] KNN-Defense: Defense against 3D Adversarial Point Clouds using Nearest-Neighbor Search

[43] Gaussian Mapping for Evolving Scenes

[44] Sleep Stage Classification using Multimodal Embedding Fusion from EOG and PSM

[45] Reading in the Dark with Foveated Event Vision

[46] How Important are Videos for Training Video LLMs?

[47] Polar Hierarchical Mamba: Towards Streaming LiDAR Object Detection with Point Clouds as Egocentric Sequences

[48] LaTtE-Flow: Layerwise Timestep-Expert Flow-based Transformer

[49] Task-driven real-world super-resolution of document scans

[50] AR-RAG: Autoregressive Retrieval Augmentation for Image Generation

[51] Dual-view Spatio-Temporal Feature Fusion with CNN-Transformer Hybrid Network for Chinese Isolated Sign Language Recognition

[52] Guiding Cross-Modal Representations with MLLM Priors via Preference Alignment

[53] Hybrid Mesh-Gaussian Representation for Efficient Indoor Scene Reconstruction

[54] Boosting Adversarial Transferability via Commonality-Oriented Gradient Optimization

[55] DM$^3$Net: Dual-Camera Super-Resolution via Domain Modulation and Multi-scale Matching

[56] Technical Report for ICRA 2025 GOOSE 3D Semantic Segmentation Challenge: Adaptive Point Cloud Understanding for Heterogeneous Robotic Systems

[57] BePo: Leveraging Birds Eye View and Sparse Points for Efficient and Accurate 3D Occupancy Prediction

[58] UNO: Unified Self-Supervised Monocular Odometry for Platform-Agnostic Deployment

[59] TABLET: Table Structure Recognition using Encoder-only Transformers

[60] MAGNET: A Multi-agent Framework for Finding Audio-Visual Needles by Reasoning over Multi-Video Haystacks

[61] Interpretable and Reliable Detection of AI-Generated Images via Grounded Reasoning in MLLMs

[62] From Swath to Full-Disc: Advancing Precipitation Retrieval with Multimodal Knowledge Expansion

[63] A Layered Self-Supervised Knowledge Distillation Framework for Efficient Multimodal Learning on the Edge

[64] D2R: dual regularization loss with collaborative adversarial generation for model robustness

[65] FLAIR-HUB: Large-scale Multimodal Dataset for Land Cover and Crop Mapping

[66] UCOD-DPL: Unsupervised Camouflaged Object Detection via Dynamic Pseudo-label Learning

[67] SceneLCM: End-to-End Layout-Guided Interactive Indoor Scene Generation with Latent Consistency Model

[68] EdgeSpotter: Multi-Scale Dense Text Spotting for Industrial Panel Monitoring

[69] Image segmentation and classification of E-waste for waste segregation

[70] Hi-VAE: Efficient Video Autoencoding with Global and Detailed Motion

[71] Learning Compact Vision Tokens for Efficient Large Multimodal Models

[72] GoTrack: Generic 6DoF Object Pose Refinement and Tracking

[73] Faster than Fast: Accelerating Oriented FAST Feature Detection on Low-end Embedded GPUs

[74] Frame Guidance: Training-Free Guidance for Frame-Level Control in Video Diffusion Models

[75] Hierarchical Feature-level Reverse Propagation for Post-Training Neural Networks

[76] SAP-Bench: Benchmarking Multimodal Large Language Models in Surgical Action Planning

[77] TV-LiVE: Training-Free, Text-Guided Video Editing via Layer Informed Vitality Exploitation

[78] Backdoor Attack on Vision Language Models with Stealthy Semantic Manipulation