cs.CV [Total: 199]
cs.GR [Total: 8]
cs.CL [Total: 193]
math.OC [Total: 1]
econ.GN [Total: 1]
cs.MA [Total: 1]
cs.HC [Total: 1]
cs.RO [Total: 7]
physics.med-ph [Total: 1]
cs.SD [Total: 5]
cs.LG [Total: 41]
quant-ph [Total: 1]
q-bio.NC [Total: 1]
cs.MM [Total: 1]
cs.IR [Total: 3]
eess.AS [Total: 1]
cs.AI [Total: 25]
cs.CR [Total: 2]
eess.IV [Total: 19]
cs.SE [Total: 3]

cs.CV [Back]

[1] Improving Open-Set Semantic Segmentation in 3D Point Clouds by Conditional Channel Capacity Maximization: Preliminary Results

Wang Fang,Shirin Rahimi,Olivia Bennett,Sophie Carter,Mitra Hassani,Xu Lan,Omid Javadi,Lucas Mitchell

Main category: cs.CV

TL;DR: 提出了一种用于开放集语义分割（O3S）的即插即用框架，通过条件马尔可夫链建模分割流程，并引入条件通道容量最大化（3CM）正则化项，以增强模型对未知类别的识别能力。

Details

Motivation: 当前点云语义分割模型在封闭集上表现优异，但在开放集场景中难以识别或分割未知类别，因此需要开发能够同时处理已知和未知类别的模型。 Method: 提出了一种基于条件马尔可夫链的框架，并设计了3CM正则化项，通过最大化特征与预测之间的条件互信息来保留更丰富的标签相关特征。 Result: 实验结果表明，该方法在检测未知对象方面表现有效。 Conclusion: 该方法为开放集语义分割提供了新思路，并展望了动态开放世界适应和信息论高效估计的未来方向。 Abstract: Point-cloud semantic segmentation underpins a wide range of critical applications. Although recent deep architectures and large-scale datasets have driven impressive closed-set performance, these models struggle to recognize or properly segment objects outside their training classes. This gap has sparked interest in Open-Set Semantic Segmentation (O3S), where models must both correctly label known categories and detect novel, unseen classes. In this paper, we propose a plug and play framework for O3S. By modeling the segmentation pipeline as a conditional Markov chain, we derive a novel regularizer term dubbed Conditional Channel Capacity Maximization (3CM), that maximizes the mutual information between features and predictions conditioned on each class. When incorporated into standard loss functions, 3CM encourages the encoder to retain richer, label-dependent features, thereby enhancing the network's ability to distinguish and segment previously unseen categories. Experimental results demonstrate effectiveness of proposed method on detecting unseen objects. We further outline future directions for dynamic open-world adaptation and efficient information-theoretic estimation.

[2] Questioning Representational Optimism in Deep Learning: The Fractured Entangled Representation Hypothesis

Akarsh Kumar,Jeff Clune,Joel Lehman,Kenneth O. Stanley

Main category: cs.CV

TL;DR: 论文探讨了AI性能提升是否必然伴随内部表征的改进，通过比较进化网络和SGD训练网络在生成单张图像任务中的表现，发现两者输出相同但内部表征差异显著。

Details

Motivation: 挑战“性能提升必然伴随表征改进”的观点，探究不同训练方法对网络内部表征的影响。 Method: 比较进化网络和SGD训练网络在生成单张图像任务中的表现，通过可视化神经元行为分析内部表征。 Result: SGD训练网络表现出分裂纠缠表征（FER），而进化网络接近统一分解表征（UFR），表明性能相同但表征差异显著。 Conclusion: FER可能损害模型的核心能力，未来表征学习需理解和缓解FER。 Abstract: Much of the excitement in modern AI is driven by the observation that scaling up existing systems leads to better performance. But does better performance necessarily imply better internal representations? While the representational optimist assumes it must, this position paper challenges that view. We compare neural networks evolved through an open-ended search process to networks trained via conventional stochastic gradient descent (SGD) on the simple task of generating a single image. This minimal setup offers a unique advantage: each hidden neuron's full functional behavior can be easily visualized as an image, thus revealing how the network's output behavior is internally constructed neuron by neuron. The result is striking: while both networks produce the same output behavior, their internal representations differ dramatically. The SGD-trained networks exhibit a form of disorganization that we term fractured entangled representation (FER). Interestingly, the evolved networks largely lack FER, even approaching a unified factored representation (UFR). In large models, FER may be degrading core model capacities like generalization, creativity, and (continual) learning. Therefore, understanding and mitigating FER could be critical to the future of representation learning.

[3] Improved Bag-of-Words Image Retrieval with Geometric Constraints for Ground Texture Localization

Aaron Wilhelm,Nils Napp

Main category: cs.CV

TL;DR: 提出了一种改进的BoW图像检索系统，用于地面纹理定位，显著提高了全局定位和SLAM闭环检测的精度。

Details

Motivation: 地面纹理定位是一种低成本、高精度的定位方案，但现有BoW系统在全局定位和闭环检测中的表现有待提升。 Method: 采用近似k均值（AKM）词汇表和软分配，利用地面纹理定位的固定方向和尺度约束，设计了高精度和高速版本算法。 Result: 通过消融实验验证了改进效果，显著提升了全局定位和闭环检测的精度与召回率。 Conclusion: 该方法可直接替代现有BoW系统，无需修改环境即可显著提升定位效果。 Abstract: Ground texture localization using a downward-facing camera offers a low-cost, high-precision localization solution that is robust to dynamic environments and requires no environmental modification. We present a significantly improved bag-of-words (BoW) image retrieval system for ground texture localization, achieving substantially higher accuracy for global localization and higher precision and recall for loop closure detection in SLAM. Our approach leverages an approximate $k$-means (AKM) vocabulary with soft assignment, and exploits the consistent orientation and constant scale constraints inherent to ground texture localization. Identifying the different needs of global localization vs. loop closure detection for SLAM, we present both high-accuracy and high-speed versions of our algorithm. We test the effect of each of our proposed improvements through an ablation study and demonstrate our method's effectiveness for both global localization and loop closure detection. With numerous ground texture localization systems already using BoW, our method can readily replace other generic BoW systems in their pipeline and immediately improve their results.

[4] BandRC: Band Shifted Raised Cosine Activated Implicit Neural Representations

Pandula Thennakoon,Avishka Ranasinghe,Mario De Silva,Buwaneka Epakanda,Roshan Godaliyadda,Parakrama Ekanayake,Vijitha Herath

Main category: cs.CV

TL;DR: 论文提出了一种新的激活函数BandRC，用于增强隐式神经表示（INRs）的信号表示能力，解决了现有激活函数在频谱偏差、噪声鲁棒性和局部/全局特征捕获方面的挑战。

Details

Motivation: 现有INRs激活函数存在频谱偏差、噪声鲁棒性差、难以同时捕获局部和全局特征等问题，且需要手动调参。 Method: 提出BandRC激活函数，并结合信号提取的深度先验知识，通过任务特定模型调整激活函数。 Result: 在图像重建、去噪、超分辨率、修复和3D形状重建等任务中，BandRC显著优于现有SOTA方法，PSNR提升最高达8.93 dB。 Conclusion: BandRC是一种高效的激活函数，能显著提升INRs在多种计算机视觉任务中的性能。 Abstract: In recent years, implicit neural representations(INRs) have gained popularity in the computer vision community. This is mainly due to the strong performance of INRs in many computer vision tasks. These networks can extract a continuous signal representation given a discrete signal representation. In previous studies, it has been repeatedly shown that INR performance has a strong correlation with the activation functions used in its multilayer perceptrons. Although numerous activation functions have been proposed that are competitive with one another, they share some common set of challenges such as spectral bias(Lack of sensitivity to high-frequency content in signals), limited robustness to signal noise and difficulties in simultaneous capturing both local and global features. and furthermore, the requirement for manual parameter tuning. To address these issues, we introduce a novel activation function, Band Shifted Raised Cosine Activated Implicit Neural Networks \textbf{(BandRC)} tailored to enhance signal representation capacity further. We also incorporate deep prior knowledge extracted from the signal to adjust the activation functions through a task-specific model. Through a mathematical analysis and a series of experiments which include image reconstruction (with a +8.93 dB PSNR improvement over the nearest counterpart), denoising (with a +0.46 dB increase in PSNR), super-resolution (with a +1.03 dB improvement over the nearest State-Of-The-Art (SOTA) method for 6X super-resolution), inpainting, and 3D shape reconstruction we demonstrate the dominance of BandRC over existing state of the art activation functions.

[5] DPSeg: Dual-Prompt Cost Volume Learning for Open-Vocabulary Semantic Segmentation

Ziyu Zhao,Xiaoguang Li,Linjia Shi,Nasrin Imanpour,Song Wang

Main category: cs.CV

TL;DR: 论文提出了一种双提示框架DPSeg，用于开放词汇语义分割，通过结合视觉和文本嵌入减少领域差距，并利用多级特征提升分割精度。

Details

Motivation: 现有方法依赖预训练的视觉语言模型（如CLIP）的文本嵌入，但存在图像与文本嵌入的领域差距问题，且缺乏浅层特征指导，影响小物体和细节的分割精度。 Method: 提出DPSeg框架，包括双提示成本体积生成、成本体积引导的解码器和语义引导的提示细化策略，结合视觉嵌入减少领域差距并提供多级特征指导。 Result: 在多个公共数据集上的实验表明，该方法显著优于现有最先进方法。 Conclusion: DPSeg通过双提示框架有效解决了开放词汇语义分割中的领域差距和特征指导问题，提升了分割性能。 Abstract: Open-vocabulary semantic segmentation aims to segment images into distinct semantic regions for both seen and unseen categories at the pixel level. Current methods utilize text embeddings from pre-trained vision-language models like CLIP but struggle with the inherent domain gap between image and text embeddings, even after extensive alignment during training. Additionally, relying solely on deep text-aligned features limits shallow-level feature guidance, which is crucial for detecting small objects and fine details, ultimately reducing segmentation accuracy. To address these limitations, we propose a dual prompting framework, DPSeg, for this task. Our approach combines dual-prompt cost volume generation, a cost volume-guided decoder, and a semantic-guided prompt refinement strategy that leverages our dual prompting scheme to mitigate alignment issues in visual prompt generation. By incorporating visual embeddings from a visual prompt encoder, our approach reduces the domain gap between text and image embeddings while providing multi-level guidance through shallow features. Extensive experiments demonstrate that our method significantly outperforms existing state-of-the-art approaches on multiple public datasets.

[6] LoFT: LoRA-fused Training Dataset Generation with Few-shot Guidance

Jae Myung Kim,Stephan Alaniz,Cordelia Schmid,Zeynep Akata

Main category: cs.CV

TL;DR: LoFT是一种新的数据集生成框架，通过微调LoRA权重并结合少量真实图像，生成具有高保真度和多样性的合成数据，显著提升监督学习性能。

Details

Motivation: 现有合成数据方法无法准确重现真实数据的分布，缺乏多样性和保真度，导致性能提升有限。 Method: LoFT通过微调LoRA权重并融合少量真实图像特征，生成合成数据，提高多样性和保真度。 Result: 实验表明，LoFT生成的合成数据在10个数据集上表现优于其他方法，显著提升准确性。 Conclusion: LoFT能够生成高保真度和多样性的合成数据，有效提升下游模型性能。 Abstract: Despite recent advances in text-to-image generation, using synthetically generated data seldom brings a significant boost in performance for supervised learning. Oftentimes, synthetic datasets do not faithfully recreate the data distribution of real data, i.e., they lack the fidelity or diversity needed for effective downstream model training. While previous work has employed few-shot guidance to address this issue, existing methods still fail to capture and generate features unique to specific real images. In this paper, we introduce a novel dataset generation framework named LoFT, LoRA-Fused Training-data Generation with Few-shot Guidance. Our method fine-tunes LoRA weights on individual real images and fuses them at inference time, producing synthetic images that combine the features of real images for improved diversity and fidelity of generated data. We evaluate the synthetic data produced by LoFT on 10 datasets, using 8 to 64 real images per class as guidance and scaling up to 1000 images per class. Our experiments show that training on LoFT-generated data consistently outperforms other synthetic dataset methods, significantly increasing accuracy as the dataset size increases. Additionally, our analysis demonstrates that LoFT generates datasets with high fidelity and sufficient diversity, which contribute to the performance improvement. The code is available at https://github.com/ExplainableML/LoFT.

[7] Attend to Not Attended: Structure-then-Detail Token Merging for Post-training DiT Acceleration

Haipeng Fang,Sheng Tang,Juan Cao,Enshuo Zhang,Fan Tang,Tong-Yee Lee

Main category: cs.CV

TL;DR: 本文提出了一种名为SDTM的新方法，通过动态压缩特征冗余来加速扩散变换器的视觉生成过程，同时保持图像质量。

Details

Motivation: 扩散变换器在视觉生成中表现优异，但计算成本高。现有方法忽略了扩散模型的去噪先验，导致加速效果不佳和图像质量下降。 Method: 基于结构-细节去噪先验分析特征冗余，提出SDTM方法，包括动态视觉令牌合并、压缩比调整和提示重加权。 Result: 实验表明，SDTM在多种架构、调度器和数据集上表现优异，例如实现1.55倍加速且对图像质量影响可忽略。 Conclusion: SDTM是一种高效的后训练方法，可无缝集成到任何扩散变换器架构中，显著提升计算效率。 Abstract: Diffusion transformers have shown exceptional performance in visual generation but incur high computational costs. Token reduction techniques that compress models by sharing the denoising process among similar tokens have been introduced. However, existing approaches neglect the denoising priors of the diffusion models, leading to suboptimal acceleration and diminished image quality. This study proposes a novel concept: attend to prune feature redundancies in areas not attended by the diffusion process. We analyze the location and degree of feature redundancies based on the structure-then-detail denoising priors. Subsequently, we introduce SDTM, a structure-then-detail token merging approach that dynamically compresses feature redundancies. Specifically, we design dynamic visual token merging, compression ratio adjusting, and prompt reweighting for different stages. Served in a post-training way, the proposed method can be integrated seamlessly into any DiT architecture. Extensive experiments across various backbones, schedulers, and datasets showcase the superiority of our method, for example, it achieves 1.55 times acceleration with negligible impact on image quality. Project page: https://github.com/ICTMCG/SDTM.

[8] EgoDex: Learning Dexterous Manipulation from Large-Scale Egocentric Video

Ryan Hoque,Peide Huang,David J. Yoon,Mouli Sivapurapu,Jian Zhang

Main category: cs.CV

TL;DR: 论文介绍了EgoDex数据集，解决了模仿学习中数据稀缺的问题，提供了829小时的3D手部追踪视频，用于机器人模仿学习。

Details

Motivation: 模仿学习在操作任务中存在数据稀缺问题，现有数据集缺乏手部姿态标注和操作任务焦点。 Method: 使用Apple Vision Pro收集EgoDex数据集，包含829小时的3D手部追踪视频，覆盖194种日常操作任务。 Result: 数据集支持手部轨迹预测的模仿学习策略，并提供了评估指标和基准。 Conclusion: 发布EgoDex数据集旨在推动机器人、计算机视觉和基础模型的发展。 Abstract: Imitation learning for manipulation has a well-known data scarcity problem. Unlike natural language and 2D computer vision, there is no Internet-scale corpus of data for dexterous manipulation. One appealing option is egocentric human video, a passively scalable data source. However, existing large-scale datasets such as Ego4D do not have native hand pose annotations and do not focus on object manipulation. To this end, we use Apple Vision Pro to collect EgoDex: the largest and most diverse dataset of dexterous human manipulation to date. EgoDex has 829 hours of egocentric video with paired 3D hand and finger tracking data collected at the time of recording, where multiple calibrated cameras and on-device SLAM can be used to precisely track the pose of every joint of each hand. The dataset covers a wide range of diverse manipulation behaviors with everyday household objects in 194 different tabletop tasks ranging from tying shoelaces to folding laundry. Furthermore, we train and systematically evaluate imitation learning policies for hand trajectory prediction on the dataset, introducing metrics and benchmarks for measuring progress in this increasingly important area. By releasing this large-scale dataset, we hope to push the frontier of robotics, computer vision, and foundation models.

[9] UGoDIT: Unsupervised Group Deep Image Prior Via Transferable Weights

Shijun Liang,Ismail R. Alkhouri,Siddhant Gautam,Qing Qu,Saiprasad Ravishankar

Main category: cs.CV

TL;DR: UGoDIT是一种无监督的Group DIP方法，适用于低数据场景，通过可转移权重实现高效图像重建。

Details

Motivation: 解决数据稀缺场景下（如动态成像）传统方法（如DMs）需要大量干净数据，而DIP方法存在噪声过拟合和计算成本高的问题。 Method: 提出UGoDIT，通过共享编码器和M个解耦解码器学习可转移权重，测试时固定部分参数并优化其余参数以保持测量一致性。 Result: 在医学和自然图像恢复任务中，UGoDIT显著加速收敛并提升重建质量，性能接近SOTA方法。 Conclusion: UGoDIT在低数据场景下表现优异，无需大量干净数据即可实现高效重建。 Abstract: Recent advances in data-centric deep generative models have led to significant progress in solving inverse imaging problems. However, these models (e.g., diffusion models (DMs)) typically require large amounts of fully sampled (clean) training data, which is often impractical in medical and scientific settings such as dynamic imaging. On the other hand, training-data-free approaches like the Deep Image Prior (DIP) do not require clean ground-truth images but suffer from noise overfitting and can be computationally expensive as the network parameters need to be optimized for each measurement set independently. Moreover, DIP-based methods often overlook the potential of learning a prior using a small number of sub-sampled measurements (or degraded images) available during training. In this paper, we propose UGoDIT, an Unsupervised Group DIP via Transferable weights, designed for the low-data regime where only a very small number, M, of sub-sampled measurement vectors are available during training. Our method learns a set of transferable weights by optimizing a shared encoder and M disentangled decoders. At test time, we reconstruct the unseen degraded image using a DIP network, where part of the parameters are fixed to the learned weights, while the remaining are optimized to enforce measurement consistency. We evaluate UGoDIT on both medical (multi-coil MRI) and natural (super resolution and non-linear deblurring) image recovery tasks under various settings. Compared to recent standalone DIP methods, UGoDIT provides accelerated convergence and notable improvement in reconstruction quality. Furthermore, our method achieves performance competitive with SOTA DM-based and supervised approaches, despite not requiring large amounts of clean training data.

[10] Semantically-Aware Game Image Quality Assessment

Kai Zhu,Vignesh Edithal,Le Zhang,Ilia Blank,Imran Junejo

Main category: cs.CV

TL;DR: 该论文提出了一种针对游戏图形的无参考图像质量评估（NR-IQA）模型，通过知识蒸馏和语义门控技术，解决了游戏特有失真的检测与量化问题。

Details

Motivation: 游戏图形的视觉质量评估因缺乏参考图像和独特的失真类型（如锯齿、纹理模糊等）而具有挑战性，现有NR-IQA/VQA方法无法直接适用。 Method: 模型采用知识蒸馏的Game distortion feature extractor（GDFE）提取游戏特有失真特征，并结合CLIP嵌入的语义门控动态加权特征重要性。 Result: 模型在训练数据外的中间失真级别上表现良好，语义门控提高了上下文相关性并降低了预测方差，优于跨领域方法。 Conclusion: 该研究为游戏图形质量自动评估奠定了基础，推动了NR-IQA方法在游戏领域的应用。 Abstract: Assessing the visual quality of video game graphics presents unique challenges due to the absence of reference images and the distinct types of distortions, such as aliasing, texture blur, and geometry level of detail (LOD) issues, which differ from those in natural images or user-generated content. Existing no-reference image and video quality assessment (NR-IQA/VQA) methods fail to generalize to gaming environments as they are primarily designed for distortions like compression artifacts. This study introduces a semantically-aware NR-IQA model tailored to gaming. The model employs a knowledge-distilled Game distortion feature extractor (GDFE) to detect and quantify game-specific distortions, while integrating semantic gating via CLIP embeddings to dynamically weight feature importance based on scene content. Training on gameplay data recorded across graphical quality presets enables the model to produce quality scores that align with human perception. Our results demonstrate that the GDFE, trained through knowledge distillation from binary classifiers, generalizes effectively to intermediate distortion levels unseen during training. Semantic gating further improves contextual relevance and reduces prediction variance. In the absence of in-domain NR-IQA baselines, our model outperforms out-of-domain methods and exhibits robust, monotonic quality trends across unseen games in the same genre. This work establishes a foundation for automated graphical quality assessment in gaming, advancing NR-IQA methods in this domain.

[11] X-Edit: Detecting and Localizing Edits in Images Altered by Text-Guided Diffusion Models

Valentina Bazyleva,Nicolo Bonettini,Gaurav Bharaj

Main category: cs.CV

TL;DR: X-Edit是一种新方法，用于定位基于扩散模型的图像编辑区域，通过反转图像特征并结合分割网络和注意力机制，显著提升了编辑区域的检测准确性。

Details

Motivation: 随着文本引导扩散模型在图像编辑中的广泛应用，其潜在的恶意使用对检测技术提出了挑战，需要一种能够精确定位编辑区域的方法。 Method: X-Edit通过预训练扩散模型反转图像特征，输入到结合通道和空间注意力的分割网络，并使用分割损失和相关性损失进行微调，以定位编辑区域。 Result: X-Edit在PSNR和SSIM指标上优于基线方法，能够准确检测扩散模型引入的编辑区域。 Conclusion: X-Edit作为一种强大的取证工具，为检测高级图像编辑技术引入的篡改提供了有效解决方案。 Abstract: Text-guided diffusion models have significantly advanced image editing, enabling highly realistic and local modifications based on textual prompts. While these developments expand creative possibilities, their malicious use poses substantial challenges for detection of such subtle deepfake edits. To this end, we introduce Explain Edit (X-Edit), a novel method for localizing diffusion-based edits in images. To localize the edits for an image, we invert the image using a pretrained diffusion model, then use these inverted features as input to a segmentation network that explicitly predicts the edited masked regions via channel and spatial attention. Further, we finetune the model using a combined segmentation and relevance loss. The segmentation loss ensures accurate mask prediction by balancing pixel-wise errors and perceptual similarity, while the relevance loss guides the model to focus on low-frequency regions and mitigate high-frequency artifacts, enhancing the localization of subtle edits. To the best of our knowledge, we are the first to address and model the problem of localizing diffusion-based modified regions in images. We additionally contribute a new dataset of paired original and edited images addressing the current lack of resources for this task. Experimental results demonstrate that X-Edit accurately localizes edits in images altered by text-guided diffusion models, outperforming baselines in PSNR and SSIM metrics. This highlights X-Edit's potential as a robust forensic tool for detecting and pinpointing manipulations introduced by advanced image editing techniques.

[12] Generalizable Vision-Language Few-Shot Adaptation with Predictive Prompts and Negative Learning

Sriram Mandalika

Main category: cs.CV

TL;DR: PromptFuseNL是一个结合预测提示调优和双分支正负学习的统一框架，通过任务条件残差和多阶段跨模态协调增强少样本泛化能力，并在15个基准测试中表现优异。

Details

Motivation: 解决视觉语言模型在少样本适应中的核心挑战，尤其是在有限监督和噪声支持样本下的性能问题。 Method: 结合预测提示调优和双分支正负学习，通过任务条件残差、多阶段跨模态协调和语义硬负样本挖掘优化类原型，并采用无监督实例重加权策略处理标签噪声。 Result: 在15个基准测试中表现优于现有方法，训练速度提升300倍，FLOPs降低1000倍，达到新的SOTA。 Conclusion: PromptFuseNL为少样本视觉语言适应提供了一种高效、鲁棒且可扩展的解决方案。 Abstract: Few-shot adaptation remains a core challenge for vision-language models (VLMs), especially under limited supervision and noisy support samples. We propose PromptFuseNL, a unified framework that enhances few-shot generalization by combining predictive prompt tuning with dual-branch positive and negative learning. The method refines class prototypes through task-conditioned residuals, multi-stage cross-modal coordination, and semantic hard negative mining. To address label noise, we introduce an unsupervised instance reweighting strategy that downweights unreliable support examples without requiring additional labels or structural changes. PromptFuseNL fuses visual and textual cues through lightweight modules for efficient and discriminative prediction. Evaluated across 15 benchmarks, it consistently surpasses existing prompt- and adapter-based methods in all shot settings while remaining highly efficient, achieving up to 300x faster training and 1000x lower FLOPs compared to full prompt tuning, achieving a new state-of-the-art for robust and scalable few-shot vision-language adaptation.

[13] Technical Report for ICRA 2025 GOOSE 2D Semantic Segmentation Challenge: Boosting Off-Road Segmentation via Photometric Distortion and Exponential Moving Average

Wonjune Kim,Lae-kyoung Lee,Su-Yong An

Main category: cs.CV

TL;DR: 论文提出了一种基于FlashInternImage-B主干和UPerNet解码器的高容量语义分割方法，用于非结构化越野场景的语义分割挑战。通过结合强光增强和EMA权重优化，取得了88.8%的mIoU。

Details

Motivation: 解决非结构化越野环境中的语义分割问题，适应其独特的场景条件。 Method: 使用FlashInternImage-B主干和UPerNet解码器，结合强光增强和EMA权重优化。 Result: 在验证集上达到88.8%的mIoU。 Conclusion: 通过优化现有技术而非设计新方法，成功应用于越野场景的语义分割。 Abstract: We report on the application of a high-capacity semantic segmentation pipeline to the GOOSE 2D Semantic Segmentation Challenge for unstructured off-road environments. Using a FlashInternImage-B backbone together with a UPerNet decoder, we adapt established techniques, rather than designing new ones, to the distinctive conditions of off-road scenes. Our training recipe couples strong photometric distortion augmentation (to emulate the wide lighting variations of outdoor terrain) with an Exponential Moving Average (EMA) of weights for better generalization. Using only the GOOSE training dataset, we achieve 88.8\% mIoU on the validation set.

[14] Self-NPO: Negative Preference Optimization of Diffusion Models by Simply Learning from Itself without Explicit Preference Annotations

Fu-Yun Wang,Keqiang Sun,Yao Teng,Xihui Liu,Jiaming Song,Hongsheng Li

Main category: cs.CV

TL;DR: Self-NPO是一种无需人工标注或奖励模型训练的负偏好优化方法，通过从模型自身学习，提升生成质量和人类偏好对齐。

Details

Motivation: 现有负偏好优化方法依赖显式偏好标注，成本高且不实用，尤其在数据稀缺领域。 Method: 提出Self-NPO，仅从模型自身学习，避免人工标注或奖励模型训练，高效且无需大量数据采样。 Result: Self-NPO无缝集成多种扩散模型，显著提升生成质量和人类偏好对齐。 Conclusion: Self-NPO为负偏好优化提供了一种高效、实用的解决方案。 Abstract: Diffusion models have demonstrated remarkable success in various visual generation tasks, including image, video, and 3D content generation. Preference optimization (PO) is a prominent and growing area of research that aims to align these models with human preferences. While existing PO methods primarily concentrate on producing favorable outputs, they often overlook the significance of classifier-free guidance (CFG) in mitigating undesirable results. Diffusion-NPO addresses this gap by introducing negative preference optimization (NPO), training models to generate outputs opposite to human preferences and thereby steering them away from unfavorable outcomes. However, prior NPO approaches, including Diffusion-NPO, rely on costly and fragile procedures for obtaining explicit preference annotations (e.g., manual pairwise labeling or reward model training), limiting their practicality in domains where such data are scarce or difficult to acquire. In this work, we introduce Self-NPO, a Negative Preference Optimization approach that learns exclusively from the model itself, thereby eliminating the need for manual data labeling or reward model training. Moreover, our method is highly efficient and does not require exhaustive data sampling. We demonstrate that Self-NPO integrates seamlessly into widely used diffusion models, including SD1.5, SDXL, and CogVideoX, as well as models already optimized for human preferences, consistently enhancing both their generation quality and alignment with human preferences.

[15] CL-CaGAN: Capsule differential adversarial continuous learning for cross-domain hyperspectral anomaly detection

Jianing Wang,Siying Guo,Zheng Hua,Runhu Huang,Jinyu Hu,Maoguo Gong

Main category: cs.CV

TL;DR: 提出了一种基于持续学习的胶囊差分生成对抗网络（CL-CaGAN），用于提升高光谱图像（HSI）异常检测的跨场景学习能力。

Details

Motivation: 现有深度学习（DL）方法在开放场景的跨域检测中面临先验信息不足和灾难性遗忘问题，限制了其实际应用。 Method: 结合改进的胶囊结构和对抗学习网络估计背景分布，并采用聚类样本回放策略和自蒸馏正则化来缓解灾难性遗忘。 Result: 实验表明，CL-CaGAN在跨域场景中表现出更高的检测性能和持续学习能力。 Conclusion: CL-CaGAN有效解决了先验信息不足和灾难性遗忘问题，提升了高光谱异常检测的实用性。 Abstract: Anomaly detection (AD) has attracted remarkable attention in hyperspectral image (HSI) processing fields, and most existing deep learning (DL)-based algorithms indicate dramatic potential for detecting anomaly samples through specific training process under current scenario. However, the limited prior information and the catastrophic forgetting problem indicate crucial challenges for existing DL structure in open scenarios cross-domain detection. In order to improve the detection performance, a novel continual learning-based capsule differential generative adversarial network (CL-CaGAN) is proposed to elevate the cross-scenario learning performance for facilitating the real application of DL-based structure in hyperspectral AD (HAD) task. First, a modified capsule structure with adversarial learning network is constructed to estimate the background distribution for surmounting the deficiency of prior information. To mitigate the catastrophic forgetting phenomenon, clustering-based sample replay strategy and a designed extra self-distillation regularization are integrated for merging the history and future knowledge in continual AD task, while the discriminative learning ability from previous detection scenario to current scenario is retained by the elaborately designed structure with continual learning (CL) strategy. In addition, the differentiable enhancement is enforced to augment the generation performance of the training data. This further stabilizes the training process with better convergence and efficiently consolidates the reconstruction ability of background samples. To verify the effectiveness of our proposed CL-CaGAN, we conduct experiments on several real HSIs, and the results indicate that the proposed CL-CaGAN demonstrates higher detection performance and continuous learning capacity for mitigating the catastrophic forgetting under cross-domain scenarios.

[16] CL-BioGAN: Biologically-Inspired Cross-Domain Continual Learning for Hyperspectral Anomaly Detection

Jianing Wang,Zheng Hua,Wan Zhang,Shengjia Hao,Yuqiong Yao,Maoguo Gong

Main category: cs.CV

TL;DR: 论文提出了一种受生物启发的持续学习生成对抗网络（CL-BioGAN），用于跨场景高光谱异常检测（HAD）任务，通过主动遗忘历史知识和引入重放策略，提升模型的稳定性和灵活性。

Details

Motivation: 生物神经网络能够通过调节突触扩展和收敛来主动遗忘与学习新经验冲突的历史知识，受此启发，研究旨在解决持续学习中记忆稳定性和学习灵活性的核心挑战。 Method: 提出CL-BioGAN，结合持续学习生物启发损失（CL-Bio Loss）和自注意力生成对抗网络（BioGAN），设计了一种包含主动遗忘损失（AF Loss）和CL损失的生物启发损失，从贝叶斯角度实现参数释放和增强。 Result: 实验表明，CL-BioGAN在跨域HAD任务中能以更少的参数和计算成本实现更鲁棒和满意的检测精度。 Conclusion: CL-BioGAN不仅提升了持续学习性能，还为开放场景HAD任务中的神经适应机制提供了新见解。 Abstract: Memory stability and learning flexibility in continual learning (CL) is a core challenge for cross-scene Hyperspectral Anomaly Detection (HAD) task. Biological neural networks can actively forget history knowledge that conflicts with the learning of new experiences by regulating learning-triggered synaptic expansion and synaptic convergence. Inspired by this phenomenon, we propose a novel Biologically-Inspired Continual Learning Generative Adversarial Network (CL-BioGAN) for augmenting continuous distribution fitting ability for cross-domain HAD task, where Continual Learning Bio-inspired Loss (CL-Bio Loss) and self-attention Generative Adversarial Network (BioGAN) are incorporated to realize forgetting history knowledge as well as involving replay strategy in the proposed BioGAN. Specifically, a novel Bio-Inspired Loss composed with an Active Forgetting Loss (AF Loss) and a CL loss is designed to realize parameters releasing and enhancing between new task and history tasks from a Bayesian perspective. Meanwhile, BioGAN loss with L2-Norm enhances self-attention (SA) to further balance the stability and flexibility for better fitting background distribution for open scenario HAD (OHAD) tasks. Experiment results underscore that the proposed CL-BioGAN can achieve more robust and satisfying accuracy for cross-domain HAD with fewer parameters and computation cost. This dual contribution not only elevates CL performance but also offers new insights into neural adaptation mechanisms in OHAD task.

[17] Self-Learning Hyperspectral and Multispectral Image Fusion via Adaptive Residual Guided Subspace Diffusion Model

Jian Zhu,He Wang,Yang Xu,Zebin Wu,Zhihui Wei

Main category: cs.CV

TL;DR: 提出了一种自学习的自适应残差引导子空间扩散模型（ARGS-Diff），仅利用观测图像无需额外训练数据，通过轻量级光谱和空间扩散模型分别学习分布，结合自适应残差模块优化重建高分辨率高光谱图像。

Details

Motivation: 现有深度学习方法依赖大量高光谱数据训练，实际应用中数据稀缺，需一种无需额外数据的解决方案。 Method: 设计光谱和空间扩散模型分别学习LR-HSI和HR-MSI的分布，通过逆向扩散过程重建HR-HSI，并引入自适应残差模块优化采样。 Result: ARGS-Diff在性能和计算效率上优于现有HSI-MSI融合方法。 Conclusion: ARGS-Diff为数据稀缺场景提供高效解决方案，显著提升融合效果。 Abstract: Hyperspectral and multispectral image (HSI-MSI) fusion involves combining a low-resolution hyperspectral image (LR-HSI) with a high-resolution multispectral image (HR-MSI) to generate a high-resolution hyperspectral image (HR-HSI). Most deep learning-based methods for HSI-MSI fusion rely on large amounts of hyperspectral data for supervised training, which is often scarce in practical applications. In this paper, we propose a self-learning Adaptive Residual Guided Subspace Diffusion Model (ARGS-Diff), which only utilizes the observed images without any extra training data. Specifically, as the LR-HSI contains spectral information and the HR-MSI contains spatial information, we design two lightweight spectral and spatial diffusion models to separately learn the spectral and spatial distributions from them. Then, we use these two models to reconstruct HR-HSI from two low-dimensional components, i.e, the spectral basis and the reduced coefficient, during the reverse diffusion process. Furthermore, we introduce an Adaptive Residual Guided Module (ARGM), which refines the two components through a residual guided function at each sampling step, thereby stabilizing the sampling process. Extensive experimental results demonstrate that ARGS-Diff outperforms existing state-of-the-art methods in terms of both performance and computational efficiency in the field of HSI-MSI fusion. Code is available at https://github.com/Zhu1116/ARGS-Diff.

[18] Are vision language models robust to uncertain inputs?

Xi Wang,Eric Nalisnick

Main category: cs.CV

TL;DR: 大型视觉语言模型（如GPT4o）在不确定和模糊输入下的鲁棒性有所提升，但仍存在过度自信的问题。通过提示模型避免不确定预测，可显著提高可靠性，但在特定领域任务中仍存在局限性。

Details

Motivation: 研究深度视觉语言模型在面对不确定和模糊输入时的鲁棒性，探索其局限性及改进方法。 Method: 通过异常检测和模糊分类任务评估模型性能，提出基于标题多样性的新机制揭示模型内部不确定性。 Result: 大型模型在自然图像任务中表现良好，但在特定领域任务中可靠性不足；新机制能有效预测模型的不确定性。 Conclusion: 模型需进一步优化以应对不确定输入，新机制为无标签数据下的不确定性预测提供了可行方案。 Abstract: Robustness against uncertain and ambiguous inputs is a critical challenge for deep learning models. While recent advancements in large scale vision language models (VLMs, e.g. GPT4o) might suggest that increasing model and training dataset size would mitigate this issue, our empirical evaluation shows a more complicated picture. Testing models using two classic uncertainty quantification tasks, anomaly detection and classification under inherently ambiguous conditions, we find that newer and larger VLMs indeed exhibit improved robustness compared to earlier models, but still suffer from a tendency to strictly follow instructions, often causing them to hallucinate confident responses even when faced with unclear or anomalous inputs. Remarkably, for natural images such as ImageNet, this limitation can be overcome without pipeline modifications: simply prompting models to abstain from uncertain predictions enables significant reliability gains, achieving near-perfect robustness in several settings. However, for domain-specific tasks such as galaxy morphology classification, a lack of specialized knowledge prevents reliable uncertainty estimation. Finally, we propose a novel mechanism based on caption diversity to reveal a model's internal uncertainty, enabling practitioners to predict when models will successfully abstain without relying on labeled data.

[19] Image-based Visibility Analysis Replacing Line-of-Sight Simulation: An Urban Landmark Perspective

Zicheng Fan,Kunihiko Fujiwara,Pengyuan Liu,Fan Zhang,Filip Biljecki

Main category: cs.CV

TL;DR: 论文提出了一种基于图像的新方法，利用视觉语言模型（VLM）分析城市标志物的可见性，补充了传统的视线（LoS）分析方法。

Details

Motivation: 传统基于几何的可见性分析无法捕捉真实世界中的上下文和感知维度，因此需要一种更全面的方法。 Method: 应用VLM在街景图像中检测目标对象，并构建异质可见性图以分析观察者与目标对象的复杂互动。 Result: 方法在检测全球城市六个地标时准确率达87%，并揭示了伦敦泰晤士河沿岸地标的连接形式和强度。 Conclusion: 该方法增强了传统LoS分析，为城市规划、遗产保护等提供了新视角。 Abstract: Visibility analysis is one of the fundamental analytics methods in urban planning and landscape research, traditionally conducted through computational simulations based on the Line-of-Sight (LoS) principle. However, when assessing the visibility of named urban objects such as landmarks, geometric intersection alone fails to capture the contextual and perceptual dimensions of visibility as experienced in the real world. The study challenges the traditional LoS-based approaches by introducing a new, image-based visibility analysis method. Specifically, a Vision Language Model (VLM) is applied to detect the target object within a direction-zoomed Street View Image (SVI). Successful detection represents the object's visibility at the corresponding SVI location. Further, a heterogeneous visibility graph is constructed to address the complex interaction between observers and target objects. In the first case study, the method proves its reliability in detecting the visibility of six tall landmark constructions in global cities, with an overall accuracy of 87%. Furthermore, it reveals broader contextual differences when the landmarks are perceived and experienced. In the second case, the proposed visibility graph uncovers the form and strength of connections for multiple landmarks along the River Thames in London, as well as the places where these connections occur. Notably, bridges on the River Thames account for approximately 30% of total connections. Our method complements and enhances traditional LoS-based visibility analysis, and showcases the possibility of revealing the prevalent connection of any visual objects in the urban environment. It opens up new research perspectives for urban planning, heritage conservation, and computational social science.

[20] SGD-Mix: Enhancing Domain-Specific Image Classification with Label-Preserving Data Augmentation

Yixuan Dong,Fang-Yi Su,Jung-Hsien Chiang

Main category: cs.CV

TL;DR: 提出了一种新框架，通过显式整合多样性、忠实性和标签清晰度来改进领域特定图像分类任务的数据增强方法。

Details

Motivation: 现有基于生成扩散模型的方法未能同时解决多样性、忠实性和标签清晰度的问题，且忽视了扩散模型的固有挑战。 Method: 采用显著性引导的混合和微调扩散模型，以保留前景语义、丰富背景多样性并确保标签一致性。 Result: 在细粒度、长尾、少样本和背景鲁棒性任务中表现优于现有方法。 Conclusion: 新框架有效解决了数据增强中的关键问题，提升了性能。 Abstract: Data augmentation for domain-specific image classification tasks often struggles to simultaneously address diversity, faithfulness, and label clarity of generated data, leading to suboptimal performance in downstream tasks. While existing generative diffusion model-based methods aim to enhance augmentation, they fail to cohesively tackle these three critical aspects and often overlook intrinsic challenges of diffusion models, such as sensitivity to model characteristics and stochasticity under strong transformations. In this paper, we propose a novel framework that explicitly integrates diversity, faithfulness, and label clarity into the augmentation process. Our approach employs saliency-guided mixing and a fine-tuned diffusion model to preserve foreground semantics, enrich background diversity, and ensure label consistency, while mitigating diffusion model limitations. Extensive experiments across fine-grained, long-tail, few-shot, and background robustness tasks demonstrate our method's superior performance over state-of-the-art approaches.

Jiajun Qin,Yuan Pu,Zhuolun He,Seunggeun Kim,David Z. Pan,Bei Yu

Main category: cs.CV

TL;DR: UniMoCo是一种新型视觉语言模型，通过模态补全模块和专门训练策略，解决了多模态嵌入任务中模态组合多样性的挑战，表现优于现有方法。

Details

Motivation: 现有模型难以在统一嵌入空间中处理多样化的模态组合（如文本到图像、文本到文本和图像等），导致性能下降。 Method: 提出UniMoCo架构，引入模态补全模块生成视觉特征，并设计训练策略对齐原始和补全输入的嵌入。 Result: 实验表明UniMoCo优于现有方法，并缓解了训练数据中模态组合不平衡导致的偏差。 Conclusion: UniMoCo通过模态补全和嵌入对齐，显著提升了多模态嵌入任务的鲁棒性和性能。 Abstract: Current research has explored vision-language models for multi-modal embedding tasks, such as information retrieval, visual grounding, and classification. However, real-world scenarios often involve diverse modality combinations between queries and targets, such as text and image to text, text and image to text and image, and text to text and image. These diverse combinations pose significant challenges for existing models, as they struggle to align all modality combinations within a unified embedding space during training, which degrades performance at inference. To address this limitation, we propose UniMoCo, a novel vision-language model architecture designed for multi-modal embedding tasks. UniMoCo introduces a modality-completion module that generates visual features from textual inputs, ensuring modality completeness for both queries and targets. Additionally, we develop a specialized training strategy to align embeddings from both original and modality-completed inputs, ensuring consistency within the embedding space. This enables the model to robustly handle a wide range of modality combinations across embedding tasks. Experiments show that UniMoCo outperforms previous methods while demonstrating consistent robustness across diverse settings. More importantly, we identify and quantify the inherent bias in conventional approaches caused by imbalance of modality combinations in training data, which can be mitigated through our modality-completion paradigm. The code is available at https://github.com/HobbitQia/UniMoCo.

[22] Continuous Subspace Optimization for Continual Learning

Quan Cheng,Yuanyu Wan,Lingyu Wu,Chenping Hou,Lijun Zhang

Main category: cs.CV

TL;DR: 论文提出了一种名为CoSO的连续子空间优化方法，用于解决持续学习中的灾难性遗忘问题，通过动态调整优化子空间来提升模型性能。

Details

Motivation: 持续学习在顺序学习多个任务时容易遗忘先验知识，现有方法通过低秩适应调整预训练模型，但限制了学习能力。 Method: CoSO通过梯度奇异值分解动态确定优化子空间，并在任务学习时保持任务特定组件以捕获关键更新方向。 Result: 在多个数据集上的实验表明，CoSO显著优于现有方法，尤其在长任务序列的挑战性场景中。 Conclusion: CoSO通过动态子空间优化有效缓解了灾难性遗忘问题，提升了持续学习的性能。 Abstract: Continual learning aims to learn multiple tasks sequentially while preserving prior knowledge, but faces the challenge of catastrophic forgetting when acquiring new knowledge. Recently, approaches leveraging pre-trained models have gained increasing popularity to mitigate this issue, due to the strong generalization ability of foundation models. To adjust pre-trained models for new tasks, existing methods usually employ low-rank adaptation, which restricts parameter updates to a fixed low-rank subspace. However, constraining the optimization space inherently compromises the model's learning capacity, resulting in inferior performance. To address the limitation, we propose Continuous Subspace Optimization for Continual Learning (CoSO) to fine-tune the model in a series of subspaces rather than a single one. These sequential subspaces are dynamically determined through the singular value decomposition of gradients. CoSO updates the model by projecting gradients into these subspaces, ensuring memory-efficient optimization. To mitigate forgetting, the optimization subspaces of each task are set to be orthogonal to the historical task subspace. During task learning, CoSO maintains a task-specific component that captures the critical update directions associated with the current task. Upon completing a task, this component is used to update the historical task subspace, laying the groundwork for subsequent learning. Extensive experiments on multiple datasets demonstrate that CoSO significantly outperforms state-of-the-art methods, especially in challenging scenarios with long task sequences.

[23] Robust Cross-View Geo-Localization via Content-Viewpoint Disentanglement

Ke Li,Di Wang,Xiaowei Wang,Zhihong Wu,Yiming Zhang,Yifeng Wang,Quan Wang

Main category: cs.CV

TL;DR: 论文提出了一种新的跨视角地理定位框架CVD，通过显式解耦内容和视角因素，解决了现有方法因视角差异导致的特征不一致问题。

Details

Motivation: 跨视角地理定位因视角变化导致的外观和空间扭曲问题，现有方法假设特征可直接对齐，但忽略了视角差异的固有冲突。 Method: 提出CVD框架，将特征空间建模为内容和视角共同主导的复合流形，并引入两种约束：视图内独立性约束和视图间重建约束。 Result: 在四个基准测试（University-1652、SUES-200、CVUSA和CVACT）上，CVD显著提高了定位精度和泛化能力。 Conclusion: CVD通过解耦内容和视角因素，有效提升了跨视角地理定位的性能，并可无缝集成到现有流程中。 Abstract: Cross-view geo-localization (CVGL) aims to match images of the same geographic location captured from different perspectives, such as drones and satellites. Despite recent advances, CVGL remains highly challenging due to significant appearance changes and spatial distortions caused by viewpoint variations. Existing methods typically assume that cross-view images can be directly aligned within a shared feature space by maximizing feature similarity through contrastive learning. Nonetheless, this assumption overlooks the inherent conflicts induced by viewpoint discrepancies, resulting in extracted features containing inconsistent information that hinders precise localization. In this study, we take a manifold learning perspective and model the feature space of cross-view images as a composite manifold jointly governed by content and viewpoint information. Building upon this insight, we propose $\textbf{CVD}$, a new CVGL framework that explicitly disentangles $\textit{content}$ and $\textit{viewpoint}$ factors. To promote effective disentanglement, we introduce two constraints: $\textit{(i)}$ An intra-view independence constraint, which encourages statistical independence between the two factors by minimizing their mutual information. $\textit{(ii)}$ An inter-view reconstruction constraint that reconstructs each view by cross-combining $\textit{content}$ and $\textit{viewpoint}$ from paired images, ensuring factor-specific semantics are preserved. As a plug-and-play module, CVD can be seamlessly integrated into existing geo-localization pipelines. Extensive experiments on four benchmarks, i.e., University-1652, SUES-200, CVUSA, and CVACT, demonstrate that CVD consistently improves both localization accuracy and generalization across multiple baselines.

[24] Bootstrapping Diffusion: Diffusion Model Training Leveraging Partial and Corrupted Data

Xudong Ma

Main category: cs.CV

TL;DR: 论文提出了一种利用部分数据（如低分辨率图像、短视频等）训练扩散模型的方法，通过分视图训练和残差评分函数预测，证明了其数据效率接近一阶最优。

Details

Motivation: 获取大规模高质量数据（如高分辨率图像、长视频）困难，而部分数据（如低分辨率图像、短视频）通常被视为损坏或不完整。研究探讨了是否可以利用这些部分数据训练扩散模型。 Method: 提出分视图训练扩散模型的方法：1）为每个部分数据视图单独训练扩散模型；2）训练残差评分函数预测模型。理论分析表明，残差评分函数的训练难度与部分数据视图未捕获的信号相关性成正比。 Result: 证明了所提方法在适当正则化下可实现更低的泛化误差，且数据效率接近一阶最优。 Conclusion: 利用部分数据训练扩散模型是可行的，且通过分视图和残差评分函数的方法能有效提升数据效率。 Abstract: Training diffusion models requires large datasets. However, acquiring large volumes of high-quality data can be challenging, for example, collecting large numbers of high-resolution images and long videos. On the other hand, there are many complementary data that are usually considered corrupted or partial, such as low-resolution images and short videos. Other examples of corrupted data include videos that contain subtitles, watermarks, and logos. In this study, we investigate the theoretical problem of whether the above partial data can be utilized to train conventional diffusion models. Motivated by our theoretical analysis in this study, we propose a straightforward approach of training diffusion models utilizing partial data views, where we consider each form of complementary data as a view of conventional data. Our proposed approach first trains one separate diffusion model for each individual view, and then trains a model for predicting the residual score function. We prove generalization error bounds, which show that the proposed diffusion model training approach can achieve lower generalization errors if proper regularizations are adopted in the residual score function training. In particular, we prove that the difficulty in training the residual score function scales proportionally with the signal correlations not captured by partial data views. Consequently, the proposed approach achieves near first-order optimal data efficiency.

[25] CoT-Vid: Dynamic Chain-of-Thought Routing with Self Verification for Training-Free Video Reasoning

Hongbo Jin,Ruyang Liu,Wenhao Zhang,Guibo Luo,Ge Li

Main category: cs.CV

TL;DR: CoT-Vid是一种无需训练的视频推理新范式，通过多阶段复杂推理设计显著提升性能，优于现有视频LLM，并在多个基准测试中表现优异。

Details

Motivation: 当前AI领域对复杂视频推理的研究存在空白，现有视频LLM过于依赖感知能力，缺乏显式推理机制。 Method: 提出CoT-Vid范式，包含动态推理路径路由、问题解耦策略和视频自一致性验证三大组件，并设计了新的视频问题分类标准。 Result: 在Egochema和VideoEspresso等基准测试中，CoT-Vid分别提升9.3%和5.6%，性能媲美或超越GPT-4V等大型专有模型。 Conclusion: CoT-Vid为视频推理领域提供了高效且无需训练的新方法，具有广泛的应用潜力。 Abstract: System2 reasoning is developing rapidly these days with the emergence of Deep- Thinking Models and chain-of-thought technology, which has become a centralized discussion point in the AI community. However, there is a relative gap in the research on complex video reasoning at present. In this work, we propose CoT-Vid, a novel training-free paradigm for the video domain with a multistage complex reasoning design. Distinguishing from existing video LLMs, which rely heavily on perceptual abilities, it achieved surprising performance gain with explicit reasoning mechanism. The paradigm consists of three main components: dynamic inference path routing, problem decoupling strategy, and video self-consistency verification. In addition, we propose a new standard for categorization of video questions. CoT- Vid showed outstanding results on a wide range of benchmarks, and outperforms its base model by 9.3% on Egochema and 5.6% on VideoEspresso, rivalling or even surpassing larger and proprietary models, such as GPT-4V, GPT-4o and Gemini-1.5-flash. Our codebase will be publicly available soon.

[26] RVTBench: A Benchmark for Visual Reasoning Tasks

Yiqing Shen,Chenjia Li,Chenxiao Fan,Mathias Unberath

Main category: cs.CV

TL;DR: 本文提出了推理视觉任务（RVTs）的统一框架，扩展了传统视频推理分割，支持多种输出格式，并构建了RVTBench基准测试，同时提出了无需任务特定微调的RVTagent框架。

Details

Motivation: 当前视觉推理领域缺乏相关基准测试，且现有方法依赖大型语言模型（LLMs），无法充分捕捉复杂的时空关系和多步推理链。 Method: 提出基于数字孪生（DT）表示的自动化RVT基准构建流程，构建了RVTBench基准测试，并设计了RVTagent框架以实现零样本泛化。 Result: 构建了包含3,896个查询、覆盖四种RVT类型和三种推理类别的RVTBench，并验证了RVTagent的零样本泛化能力。 Conclusion: RVT框架和RVTBench为视觉推理提供了更全面的基准，RVTagent展示了跨任务零样本泛化的潜力。 Abstract: Visual reasoning, the capability to interpret visual input in response to implicit text query through multi-step reasoning, remains a challenge for deep learning models due to the lack of relevant benchmarks. Previous work in visual reasoning has primarily focused on reasoning segmentation, where models aim to segment objects based on implicit text queries. This paper introduces reasoning visual tasks (RVTs), a unified formulation that extends beyond traditional video reasoning segmentation to a diverse family of visual language reasoning problems, which can therefore accommodate multiple output formats including bounding boxes, natural language descriptions, and question-answer pairs. Correspondingly, we identify the limitations in current benchmark construction methods that rely solely on large language models (LLMs), which inadequately capture complex spatial-temporal relationships and multi-step reasoning chains in video due to their reliance on token representation, resulting in benchmarks with artificially limited reasoning complexity. To address this limitation, we propose a novel automated RVT benchmark construction pipeline that leverages digital twin (DT) representations as structured intermediaries between perception and the generation of implicit text queries. Based on this method, we construct RVTBench, a RVT benchmark containing 3,896 queries of over 1.2 million tokens across four types of RVT (segmentation, grounding, VQA and summary), three reasoning categories (semantic, spatial, and temporal), and four increasing difficulty levels, derived from 200 video sequences. Finally, we propose RVTagent, an agent framework for RVT that allows for zero-shot generalization across various types of RVT without task-specific fine-tuning.

[27] Video-SafetyBench: A Benchmark for Safety Evaluation of Video LVLMs

Xuannan Liu,Zekun Li,Zheqi He,Peipei Li,Shuhan Xia,Xing Cui,Huaibo Huang,Xi Yang,Ran He

Main category: cs.CV

TL;DR: Video-SafetyBench是首个评估大型视觉语言模型（LVLMs）在视频文本攻击下安全性的综合基准，包含2,264个视频文本对，覆盖48个细粒度不安全类别。

Details

Motivation: 现有安全评估主要关注静态图像输入，忽略了视频的动态特性可能引发的独特安全风险。 Method: 设计了可控的视频生成流程，将视频语义分解为主题图像和运动文本，并提出RJScore评估不确定或边界有害输出。 Result: 实验显示，良性查询视频组合的平均攻击成功率为67.2%，揭示了视频诱导攻击的持续漏洞。 Conclusion: Video-SafetyBench将推动未来基于视频的安全评估和防御策略研究。 Abstract: The increasing deployment of Large Vision-Language Models (LVLMs) raises safety concerns under potential malicious inputs. However, existing multimodal safety evaluations primarily focus on model vulnerabilities exposed by static image inputs, ignoring the temporal dynamics of video that may induce distinct safety risks. To bridge this gap, we introduce Video-SafetyBench, the first comprehensive benchmark designed to evaluate the safety of LVLMs under video-text attacks. It comprises 2,264 video-text pairs spanning 48 fine-grained unsafe categories, each pairing a synthesized video with either a harmful query, which contains explicit malice, or a benign query, which appears harmless but triggers harmful behavior when interpreted alongside the video. To generate semantically accurate videos for safety evaluation, we design a controllable pipeline that decomposes video semantics into subject images (what is shown) and motion text (how it moves), which jointly guide the synthesis of query-relevant videos. To effectively evaluate uncertain or borderline harmful outputs, we propose RJScore, a novel LLM-based metric that incorporates the confidence of judge models and human-aligned decision threshold calibration. Extensive experiments show that benign-query video composition achieves average attack success rates of 67.2%, revealing consistent vulnerabilities to video-induced attacks. We believe Video-SafetyBench will catalyze future research into video-based safety evaluation and defense strategies.

[28] ElderFallGuard: Real-Time IoT and Computer Vision-Based Fall Detection System for Elderly Safety

Tasrifur Riahi,Md. Azizul Hakim Bappy,Md. Mehedi Islam

Main category: cs.CV

TL;DR: ElderFallGuard是一种基于计算机视觉的物联网系统，用于实时检测老年人跌倒并通过Telegram通知护理人员，取得了100%的准确率。

Details

Motivation: 老年人跌倒可能导致严重伤害和独立性丧失，亟需一种非侵入式的实时检测方案。 Method: 利用MediaPipe进行人体姿态估计，通过自定义数据集训练随机森林分类器，结合特定逻辑（如特定姿势持续时间和运动下降）检测跌倒。 Result: 系统在测试中表现优异，准确率、精确率、召回率和F1-score均达到100%。 Conclusion: ElderFallGuard为老年人安全提供了一种高效、可靠的解决方案，并通过智能警报减轻护理人员负担。 Abstract: For the elderly population, falls pose a serious and increasing risk of serious injury and loss of independence. In order to overcome this difficulty, we present ElderFallGuard: A Computer Vision Based IoT Solution for Elderly Fall Detection and Notification, a cutting-edge, non-invasive system intended for quick caregiver alerts and real-time fall detection. Our approach leverages the power of computer vision, utilizing MediaPipe for accurate human pose estimation from standard video streams. We developed a custom dataset comprising 7200 samples across 12 distinct human poses to train and evaluate various machine learning classifiers, with Random Forest ultimately selected for its superior performance. ElderFallGuard employs a specific detection logic, identifying a fall when a designated prone pose ("Pose6") is held for over 3 seconds coupled with a significant drop in motion detected for more than 2 seconds. Upon confirmation, the system instantly dispatches an alert, including a snapshot of the event, to a designated Telegram group via a custom bot, incorporating cooldown logic to prevent notification overload. Rigorous testing on our dataset demonstrated exceptional results, achieving 100% accuracy, precision, recall, and F1-score. ElderFallGuard offers a promising, vision-based IoT solution to enhance elderly safety and provide peace of mind for caregivers through intelligent, timely alerts.

[29] MedSG-Bench: A Benchmark for Medical Image Sequences Grounding

Jingkun Yue,Siqi Zhang,Zinan Jia,Huihuan Xu,Zongbo Han,Xiaohong Liu,Guangyu Wang

Main category: cs.CV

TL;DR: 提出了首个针对医学图像序列的视觉定位基准MedSG-Bench，包含两种任务范式，并构建了大规模指令调优数据集MedSG-188K和模型MedSeq-Grounder。

Details

Motivation: 现有医学视觉定位基准多关注单图像场景，而临床实际需要跨模态和时间序列的精细语义对齐和上下文推理。 Method: 设计了包含8个VQA任务的MedSG-Bench，涵盖76个公共数据集和10种医学成像模态，并开发了MedSeq-Grounder模型。 Result: 现有MLLMs在医学序列定位任务中表现有限，MedSeq-Grounder为未来研究提供了支持。 Conclusion: MedSG-Bench和MedSeq-Grounder填补了医学序列视觉定位的空白，推动了该领域的发展。 Abstract: Visual grounding is essential for precise perception and reasoning in multimodal large language models (MLLMs), especially in medical imaging domains. While existing medical visual grounding benchmarks primarily focus on single-image scenarios, real-world clinical applications often involve sequential images, where accurate lesion localization across different modalities and temporal tracking of disease progression (e.g., pre- vs. post-treatment comparison) require fine-grained cross-image semantic alignment and context-aware reasoning. To remedy the underrepresentation of image sequences in existing medical visual grounding benchmarks, we propose MedSG-Bench, the first benchmark tailored for Medical Image Sequences Grounding. It comprises eight VQA-style tasks, formulated into two paradigms of the grounding tasks, including 1) Image Difference Grounding, which focuses on detecting change regions across images, and 2) Image Consistency Grounding, which emphasizes detection of consistent or shared semantics across sequential images. MedSG-Bench covers 76 public datasets, 10 medical imaging modalities, and a wide spectrum of anatomical structures and diseases, totaling 9,630 question-answer pairs. We benchmark both general-purpose MLLMs (e.g., Qwen2.5-VL) and medical-domain specialized MLLMs (e.g., HuatuoGPT-vision), observing that even the advanced models exhibit substantial limitations in medical sequential grounding tasks. To advance this field, we construct MedSG-188K, a large-scale instruction-tuning dataset tailored for sequential visual grounding, and further develop MedSeq-Grounder, an MLLM designed to facilitate future research on fine-grained understanding across medical sequential images. The benchmark, dataset, and model are available at https://huggingface.co/MedSG-Bench

[30] MonoMobility: Zero-Shot 3D Mobility Analysis from Monocular Videos

Hongyi Zhou,Xiaogang Wang,Yulan Guo,Kai Xu

Main category: cs.CV

TL;DR: 提出了一种从单目视频中零样本分析3D运动的新框架，无需标注数据，能精确解析运动部件及其属性。

Details

Motivation: 现有方法依赖密集多视角图像或详细部件级标注，限制了动态环境分析的灵活性和实用性。 Method: 结合深度估计、光流分析和点云配准，构建场景几何并初步分析运动部件及属性；采用2D高斯泼溅进行场景表示；提出端到端动态场景优化算法，细化结果。 Result: 实验验证了框架在无标注情况下有效分析铰接物体运动的能力，展示了在具身智能中的潜力。 Conclusion: 该框架灵活且多功能，适用于复杂运动分析，为未来具身智能应用提供了重要工具。 Abstract: Accurately analyzing the motion parts and their motion attributes in dynamic environments is crucial for advancing key areas such as embodied intelligence. Addressing the limitations of existing methods that rely on dense multi-view images or detailed part-level annotations, we propose an innovative framework that can analyze 3D mobility from monocular videos in a zero-shot manner. This framework can precisely parse motion parts and motion attributes only using a monocular video, completely eliminating the need for annotated training data. Specifically, our method first constructs the scene geometry and roughly analyzes the motion parts and their initial motion attributes combining depth estimation, optical flow analysis and point cloud registration method, then employs 2D Gaussian splatting for scene representation. Building on this, we introduce an end-to-end dynamic scene optimization algorithm specifically designed for articulated objects, refining the initial analysis results to ensure the system can handle 'rotation', 'translation', and even complex movements ('rotation+translation'), demonstrating high flexibility and versatility. To validate the robustness and wide applicability of our method, we created a comprehensive dataset comprising both simulated and real-world scenarios. Experimental results show that our framework can effectively analyze articulated object motions in an annotation-free manner, showcasing its significant potential in future embodied intelligence applications.

[31] PRS-Med: Position Reasoning Segmentation with Vision-Language Model in Medical Imaging

Quoc-Huy Trinh,Minh-Van Nguyen,Jung Peng,Ulas Bagci,Debesh Jha

Main category: cs.CV

TL;DR: PRS-Med是一个结合视觉语言模型和分割能力的框架，用于生成精确的分割掩码和空间推理输出，并在多模态医学图像中显著优于现有方法。

Details

Motivation: 现有方法在医生需要通过自然语言交互或需要位置推理时面临挑战，缺乏位置推理数据。 Method: PRS-Med整合视觉语言模型与分割能力，并引入MMRS数据集提供空间推理数据。 Result: PRS-Med在六种成像模态中表现优异，显著优于现有方法的分割准确性和位置推理能力。 Conclusion: PRS-Med通过自然语言交互提升诊断效率，其数据集和模型将开源以推动医学应用中的空间感知多模态推理研究。 Abstract: Recent advancements in prompt-based medical image segmentation have enabled clinicians to identify tumors using simple input like bounding boxes or text prompts. However, existing methods face challenges when doctors need to interact through natural language or when position reasoning is required - understanding spatial relationships between anatomical structures and pathologies. We present PRS-Med, a framework that integrates vision-language models with segmentation capabilities to generate both accurate segmentation masks and corresponding spatial reasoning outputs. Additionally, we introduce the MMRS dataset (Multimodal Medical in Positional Reasoning Segmentation), which provides diverse, spatially-grounded question-answer pairs to address the lack of position reasoning data in medical imaging. PRS-Med demonstrates superior performance across six imaging modalities (CT, MRI, X-ray, ultrasound, endoscopy, RGB), significantly outperforming state-of-the-art methods in both segmentation accuracy and position reasoning. Our approach enables intuitive doctor-system interaction through natural language, facilitating more efficient diagnoses. Our dataset pipeline, model, and codebase will be released to foster further research in spatially-aware multimodal reasoning for medical applications.

[32] Revisiting Residual Connections: Orthogonal Updates for Stable and Efficient Deep Networks

Giyeong Oh,Woohyun Cho,Siyeol Kim,Suhwan Choi,Younjae Yu

Main category: cs.CV

TL;DR: 提出了一种正交残差更新方法，通过分解模块输出并仅添加与输入流正交的分量，以增强特征学习和训练效率。

Details

Motivation: 标准残差更新可能仅强化或调制现有流方向，未能充分利用模块学习新特征的能力。 Method: 引入正交残差更新，分解模块输出并仅添加正交分量。 Result: 在多种架构和数据集上提升泛化准确性和训练稳定性，例如ViT-B在ImageNet-1k上提升4.3% top-1准确率。 Conclusion: 正交残差更新策略有效促进了新特征学习，提升了模型性能。 Abstract: Residual connections are pivotal for deep neural networks, enabling greater depth by mitigating vanishing gradients. However, in standard residual updates, the module's output is directly added to the input stream. This can lead to updates that predominantly reinforce or modulate the existing stream direction, potentially underutilizing the module's capacity for learning entirely novel features. In this work, we introduce Orthogonal Residual Update: we decompose the module's output relative to the input stream and add only the component orthogonal to this stream. This design aims to guide modules to contribute primarily new representational directions, fostering richer feature learning while promoting more efficient training. We demonstrate that our orthogonal update strategy improves generalization accuracy and training stability across diverse architectures (ResNetV2, Vision Transformers) and datasets (CIFARs, TinyImageNet, ImageNet-1k), achieving, for instance, a +4.3\%p top-1 accuracy gain for ViT-B on ImageNet-1k.

[33] GenZSL: Generative Zero-Shot Learning Via Inductive Variational Autoencoder

Shiming Chen,Dingjie Fu,Salman Khan,Fahad Shahbaz Khan

Main category: cs.CV

TL;DR: GenZSL是一种基于变分自编码器的生成式零样本学习方法，通过从类似已见类中归纳新类样本，并使用弱类别语义向量提升生成效果。

Details

Motivation: 现有生成式ZSL方法依赖专家标注的强类别语义向量，生成效果和场景泛化能力有限。 Method: GenZSL采用类多样性促进和目标类引导信息增强策略，优化生成样本的多样性和信息量。 Result: 在三个基准数据集上，GenZSL性能显著优于f-VAEGAN，如AWA2上性能提升24.7%，训练速度提升60倍以上。 Conclusion: GenZSL通过弱语义向量和优化策略，显著提升了生成式ZSL的效果和效率。 Abstract: Remarkable progress in zero-shot learning (ZSL) has been achieved using generative models. However, existing generative ZSL methods merely generate (imagine) the visual features from scratch guided by the strong class semantic vectors annotated by experts, resulting in suboptimal generative performance and limited scene generalization. To address these and advance ZSL, we propose an inductive variational autoencoder for generative zero-shot learning, dubbed GenZSL. Mimicking human-level concept learning, GenZSL operates by inducting new class samples from similar seen classes using weak class semantic vectors derived from target class names (i.e., CLIP text embedding). To ensure the generation of informative samples for training an effective ZSL classifier, our GenZSL incorporates two key strategies. Firstly, it employs class diversity promotion to enhance the diversity of class semantic vectors. Secondly, it utilizes target class-guided information boosting criteria to optimize the model. Extensive experiments conducted on three popular benchmark datasets showcase the superiority and potential of our GenZSL with significant efficacy and efficiency over f-VAEGAN, e.g., 24.7% performance gains and more than $60\times$ faster training speed on AWA2. Codes are available at https://github.com/shiming-chen/GenZSL.

[34] Facial Recognition Leveraging Generative Adversarial Networks

Zhongwen Li,Zongwei Li,Xiaoqi Li

Main category: cs.CV

TL;DR: 提出了一种基于GAN的数据增强方法，通过改进生成器和判别器设计，显著提升了人脸识别性能。

Details

Motivation: 解决深度学习人脸识别中大规模训练数据难以获取的问题。 Method: 采用残差嵌入生成器、Inception ResNet-V1判别器，并设计端到端框架联合优化数据生成与识别性能。 Result: 在LFW基准测试中，识别准确率提升12.7%，且在小样本下具有良好的泛化能力。 Conclusion: 该方法有效解决了数据不足问题，显著提升了人脸识别性能。 Abstract: Face recognition performance based on deep learning heavily relies on large-scale training data, which is often difficult to acquire in practical applications. To address this challenge, this paper proposes a GAN-based data augmentation method with three key contributions: (1) a residual-embedded generator to alleviate gradient vanishing/exploding problems, (2) an Inception ResNet-V1 based FaceNet discriminator for improved adversarial training, and (3) an end-to-end framework that jointly optimizes data generation and recognition performance. Experimental results demonstrate that our approach achieves stable training dynamics and significantly improves face recognition accuracy by 12.7% on the LFW benchmark compared to baseline methods, while maintaining good generalization capability with limited training samples.

Chih-Ting Liao,Bin Ren,Guofeng Mei,Xu Zheng

Main category: cs.CV

TL;DR: 本文首次全面研究了统一多模态编码器在对抗性扰动下的脆弱性，并提出了一种高效的对抗性校准框架，显著提升了模型的鲁棒性。

Details

Motivation: 尽管统一多模态编码器在多模态任务中表现出色，但其在对抗性扰动下的鲁棒性尚未得到充分研究，这对安全敏感应用至关重要。 Method: 提出了一种高效的对抗性校准框架，通过模态特定的投影头（仅用对抗样本训练）提升鲁棒性，同时保持预训练编码器和语义中心不变。 Result: 实验表明，该方法在六种模态和三种Bind风格模型上，对抗性鲁棒性提升了47.3%，且不影响干净数据的性能。 Conclusion: 该方法在不修改预训练模型的情况下显著提升了多模态编码器的对抗性鲁棒性，适用于现有基础模型。 Abstract: Recent unified multi-modal encoders align a wide range of modalities into a shared representation space, enabling diverse cross-modal tasks. Despite their impressive capabilities, the robustness of these models under adversarial perturbations remains underexplored, which is a critical concern for safety-sensitive applications. In this work, we present the first comprehensive study of adversarial vulnerability in unified multi-modal encoders. We find that even mild adversarial perturbations lead to substantial performance drops across all modalities. Non-visual inputs, such as audio and point clouds, are especially fragile, while visual inputs like images and videos also degrade significantly. To address this, we propose an efficient adversarial calibration framework that improves robustness across modalities without modifying pretrained encoders or semantic centers, ensuring compatibility with existing foundation models. Our method introduces modality-specific projection heads trained solely on adversarial examples, while keeping the backbone and embeddings frozen. We explore three training objectives: fixed-center cross-entropy, clean-to-adversarial L2 alignment, and clean-adversarial InfoNCE, and we introduce a regularization strategy to ensure modality-consistent alignment under attack. Experiments on six modalities and three Bind-style models show that our method improves adversarial robustness by up to 47.3 percent at epsilon = 4/255, while preserving or even improving clean zero-shot and retrieval performance with less than 1 percent trainable parameters.

[36] FiGKD: Fine-Grained Knowledge Distillation via High-Frequency Detail Transfer

Seonghak Kim

Main category: cs.CV

TL;DR: 论文提出了一种名为FiGKD的频率感知知识蒸馏方法，通过离散小波变换分解教师模型的输出，选择性传递高频细节信息，提升细粒度视觉任务的性能。

Details

Motivation: 传统知识蒸馏方法在细粒度视觉任务中表现不佳，因为它们将教师模型的输出视为单一信号，忽略了细节信息的差异。 Method: 使用离散小波变换（DWT）将教师模型的输出分解为低频（内容）和高频（细节）成分，仅选择性传递高频成分。 Result: 在CIFAR-100、TinyImageNet等多个细粒度基准测试中，FiGKD优于现有的基于输出和特征的知识蒸馏方法。 Conclusion: 频率感知的分解方法能更高效地传递知识，尤其在资源受限的场景中表现突出。 Abstract: Knowledge distillation (KD) is a widely adopted technique for transferring knowledge from a high-capacity teacher model to a smaller student model by aligning their output distributions. However, existing methods often underperform in fine-grained visual recognition tasks, where distinguishing subtle differences between visually similar classes is essential. This performance gap stems from the fact that conventional approaches treat the teacher's output logits as a single, undifferentiated signal-assuming all contained information is equally beneficial to the student. Consequently, student models may become overloaded with redundant signals and fail to capture the teacher's nuanced decision boundaries. To address this issue, we propose Fine-Grained Knowledge Distillation (FiGKD), a novel frequency-aware framework that decomposes a model's logits into low-frequency (content) and high-frequency (detail) components using the discrete wavelet transform (DWT). FiGKD selectively transfers only the high-frequency components, which encode the teacher's semantic decision patterns, while discarding redundant low-frequency content already conveyed through ground-truth supervision. Our approach is simple, architecture-agnostic, and requires no access to intermediate feature maps. Extensive experiments on CIFAR-100, TinyImageNet, and multiple fine-grained recognition benchmarks show that FiGKD consistently outperforms state-of-the-art logit-based and feature-based distillation methods across a variety of teacher-student configurations. These findings confirm that frequency-aware logit decomposition enables more efficient and effective knowledge transfer, particularly in resource-constrained settings.

[37] GTR: Gaussian Splatting Tracking and Reconstruction of Unknown Objects Based on Appearance and Geometric Complexity

Takuya Ikeda,Sergey Zakharov,Muhammad Zubair Irshad,Istvan Balazs Opra,Shun Iwase,Dian Chen,Mark Tjersland,Robert Lee,Alexandre Dilly,Rares Ambrus,Koichi Nishiwaki

Main category: cs.CV

TL;DR: 提出了一种基于单目RGBD视频的6自由度物体跟踪和高保真3D重建新方法，解决了复杂物体（如对称、复杂几何或外观）的挑战。

Details

Motivation: 现有方法在处理复杂物体时表现不佳，尤其是对称、复杂几何或外观的物体。 Method: 结合3D高斯泼溅、混合几何/外观跟踪和关键帧选择的自适应方法。 Result: 实现了鲁棒的跟踪和精确的重建，并在开放环境中为单传感器3D重建设定了新标准。 Conclusion: 该方法在复杂物体上表现出色，为相关领域提供了高质量的基准和重建结果。 Abstract: We present a novel method for 6-DoF object tracking and high-quality 3D reconstruction from monocular RGBD video. Existing methods, while achieving impressive results, often struggle with complex objects, particularly those exhibiting symmetry, intricate geometry or complex appearance. To bridge these gaps, we introduce an adaptive method that combines 3D Gaussian Splatting, hybrid geometry/appearance tracking, and key frame selection to achieve robust tracking and accurate reconstructions across a diverse range of objects. Additionally, we present a benchmark covering these challenging object classes, providing high-quality annotations for evaluating both tracking and reconstruction performance. Our approach demonstrates strong capabilities in recovering high-fidelity object meshes, setting a new standard for single-sensor 3D reconstruction in open-world environments.

[38] Are Multimodal Large Language Models Ready for Omnidirectional Spatial Reasoning?

Zihao Dongfang,Xu Zheng,Ziqiao Weng,Yuanhuiyi Lyu,Danda Pani Paudel,Luc Van Gool,Kailun Yang,Xuming Hu

Main category: cs.CV

TL;DR: 论文介绍了OSR-Bench，首个专为全景空间推理设计的基准测试，评估了多模态大语言模型在全景图像中的表现，发现当前模型在此任务上表现不佳。

Details

Motivation: 360度相机广泛应用于AI和虚拟现实，但多模态大语言模型在全景感知方面的研究较少，作者希望探索其是否具备全景空间推理能力。 Method: 提出OSR-Bench基准，包含15.3万个多样化问答对，覆盖对象计数、相对距离和方向等推理类型，并设计两阶段评估框架。 Result: 评估了8种先进模型（如GPT-4o、Gemini 1.5 Pro），发现它们在全景空间推理中表现不佳。 Conclusion: 当前模型在全景空间推理上存在不足，需进一步改进。OSR-Bench将为未来研究提供支持。 Abstract: The 180x360 omnidirectional field of view captured by 360-degree cameras enables their use in a wide range of applications such as embodied AI and virtual reality. Although recent advances in multimodal large language models (MLLMs) have shown promise in visual-spatial reasoning, most studies focus on standard pinhole-view images, leaving omnidirectional perception largely unexplored. In this paper, we ask: Are MLLMs ready for omnidirectional spatial reasoning? To investigate this, we introduce OSR-Bench, the first benchmark specifically designed for this setting. OSR-Bench includes over 153,000 diverse question-answer pairs grounded in high-fidelity panoramic indoor scene maps. It covers key reasoning types including object counting, relative distance, and direction. We also propose a negative sampling strategy that inserts non-existent objects into prompts to evaluate hallucination and grounding robustness. For fine-grained analysis, we design a two-stage evaluation framework assessing both cognitive map generation and QA accuracy using rotation-invariant matching and a combination of rule-based and LLM-based metrics. We evaluate eight state-of-the-art MLLMs, including GPT-4o, Gemini 1.5 Pro, and leading open-source models under zero-shot settings. Results show that current models struggle with spatial reasoning in panoramic contexts, highlighting the need for more perceptually grounded MLLMs. OSR-Bench and code will be released at: https://huggingface.co/datasets/UUUserna/OSR-Bench

[39] DC-Seg: Disentangled Contrastive Learning for Brain Tumor Segmentation with Missing Modalities

Haitao Li,Ziyu Li,Yiheng Mao,Zhengyao Ding,Zhengxing Huang

Main category: cs.CV

TL;DR: DC-Seg提出了一种新的方法，通过解耦模态不变和模态特定的表示，改进多模态脑图像分割的鲁棒性。

Details

Motivation: 临床数据中多模态图像可能缺失，现有方法未能充分利用各模态的独特信息。 Method: 使用解剖对比学习和模态对比学习，解耦图像为模态不变和模态特定表示，并引入分割正则化器。 Result: 在BraTS 2020和WMH数据集上表现优于现有方法，尤其在处理缺失模态时。 Conclusion: DC-Seg在多模态脑图像分割任务中具有鲁棒性和泛化能力。 Abstract: Accurate segmentation of brain images typically requires the integration of complementary information from multiple image modalities. However, clinical data for all modalities may not be available for every patient, creating a significant challenge. To address this, previous studies encode multiple modalities into a shared latent space. While somewhat effective, it remains suboptimal, as each modality contains distinct and valuable information. In this study, we propose DC-Seg (Disentangled Contrastive Learning for Segmentation), a new method that explicitly disentangles images into modality-invariant anatomical representation and modality-specific representation, by using anatomical contrastive learning and modality contrastive learning respectively. This solution improves the separation of anatomical and modality-specific features by considering the modality gaps, leading to more robust representations. Furthermore, we introduce a segmentation-based regularizer that enhances the model's robustness to missing modalities. Extensive experiments on the BraTS 2020 and a private white matter hyperintensity(WMH) segmentation dataset demonstrate that DC-Seg outperforms state-of-the-art methods in handling incomplete multimodal brain tumor segmentation tasks with varying missing modalities, while also demonstrate strong generalizability in WMH segmentation. The code is available at https://github.com/CuCl-2/DC-Seg.

[40] SafeVid: Toward Safety Aligned Video Large Multimodal Models

Yixu Wang,Jiaxin Song,Yifeng Gao,Xin Wang,Yang Yao,Yan Teng,Xingjun Ma,Yingchun Wang,Yu-Gang Jiang

Main category: cs.CV

TL;DR: SafeVid框架通过文本视频描述作为桥梁，将文本安全对齐能力迁移到视频领域，显著提升视频大型多模态模型（VLMMs）的安全性。

Details

Motivation: 视频大型多模态模型（VLMMs）的复杂性导致静态安全对齐在动态视频场景中失效，亟需针对视频的安全对齐方法。 Method: SafeVid通过生成350,000对视频安全偏好数据集（SafeVid-350K），使用直接偏好优化（DPO）对齐VLMMs，并通过SafeVidBench进行全面评估。 Result: SafeVid显著提升VLMMs的安全性，例如LLaVA-NeXT-Video在SafeVidBench上性能提升高达42.39%。 Conclusion: SafeVid通过文本描述作为安全推理桥梁，为VLMMs提供了有效的安全对齐方法和资源，其数据集已公开。 Abstract: As Video Large Multimodal Models (VLMMs) rapidly advance, their inherent complexity introduces significant safety challenges, particularly the issue of mismatched generalization where static safety alignments fail to transfer to dynamic video contexts. We introduce SafeVid, a framework designed to instill video-specific safety principles in VLMMs. SafeVid uniquely transfers robust textual safety alignment capabilities to the video domain by employing detailed textual video descriptions as an interpretive bridge, facilitating LLM-based rule-driven safety reasoning. This is achieved through a closed-loop system comprising: 1) generation of SafeVid-350K, a novel 350,000-pair video-specific safety preference dataset; 2) targeted alignment of VLMMs using Direct Preference Optimization (DPO); and 3) comprehensive evaluation via our new SafeVidBench benchmark. Alignment with SafeVid-350K significantly enhances VLMM safety, with models like LLaVA-NeXT-Video demonstrating substantial improvements (e.g., up to 42.39%) on SafeVidBench. SafeVid provides critical resources and a structured approach, demonstrating that leveraging textual descriptions as a conduit for safety reasoning markedly improves the safety alignment of VLMMs. We have made SafeVid-350K dataset (https://huggingface.co/datasets/yxwang/SafeVid-350K) publicly available.

[41] iSegMan: Interactive Segment-and-Manipulate 3D Gaussians

Yian Zhao,Wanshi Xu,Ruochong Zheng,Pengchong Qiao,Chang Liu,Jie Chen

Main category: cs.CV

TL;DR: iSegMan提出了一种交互式分割和操作框架，通过2D用户交互实现高效的3D场景操作，无需场景特定训练。

Details

Motivation: 现有3D场景操作方法在控制操作区域和提供交互反馈方面存在不足，且现有分割框架需要场景特定训练，限制了效率和灵活性。 Method: 提出Epipolar-guided Interaction Propagation (EIP)和Visibility-based Gaussian Voting (VGV)，结合2D用户交互和3D高斯模型，实现高效区域控制。 Result: iSegMan在3D场景操作和分割任务中表现出显著优势，提升了可控性、灵活性和实用性。 Conclusion: iSegMan通过交互式分割和高效区域控制，为3D场景操作提供了更灵活、实用的解决方案。 Abstract: The efficient rendering and explicit nature of 3DGS promote the advancement of 3D scene manipulation. However, existing methods typically encounter challenges in controlling the manipulation region and are unable to furnish the user with interactive feedback, which inevitably leads to unexpected results. Intuitively, incorporating interactive 3D segmentation tools can compensate for this deficiency. Nevertheless, existing segmentation frameworks impose a pre-processing step of scene-specific parameter training, which limits the efficiency and flexibility of scene manipulation. To deliver a 3D region control module that is well-suited for scene manipulation with reliable efficiency, we propose interactive Segment-and-Manipulate 3D Gaussians (iSegMan), an interactive segmentation and manipulation framework that only requires simple 2D user interactions in any view. To propagate user interactions to other views, we propose Epipolar-guided Interaction Propagation (EIP), which innovatively exploits epipolar constraint for efficient and robust interaction matching. To avoid scene-specific training to maintain efficiency, we further propose the novel Visibility-based Gaussian Voting (VGV), which obtains 2D segmentations from SAM and models the region extraction as a voting game between 2D Pixels and 3D Gaussians based on Gaussian visibility. Taking advantage of the efficient and precise region control of EIP and VGV, we put forth a Manipulation Toolbox to implement various functions on selected regions, enhancing the controllability, flexibility and practicality of scene manipulation. Extensive results on 3D scene manipulation and segmentation tasks fully demonstrate the significant advantages of iSegMan. Project page is available at https://zhao-yian.github.io/iSegMan.

[42] Top-Down Compression: Revisit Efficient Vision Token Projection for Visual Instruction Tuning

Bonan li,Zicheng Zhang,Songhua Liu,Weihao Yu,Xinchao Wang

Main category: cs.CV

TL;DR: LLaVA-Meteor通过Top-Down Compression和Flash Global Fusion模块，高效压缩视觉标记并保持性能，在12个基准测试中表现优异。

Details

Motivation: 解决视觉指令调优中精度与效率的权衡问题。 Method: 采用Top-Down Compression范式、Flash Global Fusion模块和Visual-Native Selection机制，高效压缩视觉标记并捕捉局部依赖。 Result: 视觉标记减少75-95%，在12个基准测试中性能相当或更优。 Conclusion: LLaVA-Meteor在视觉指令调优中实现了高效与高性能的平衡。 Abstract: Visual instruction tuning aims to enable large language models to comprehend the visual world, with a pivotal challenge lying in establishing an effective vision-to-language projection. However, existing methods often grapple with the intractable trade-off between accuracy and efficiency. In this paper, we present LLaVA-Meteor, a novel approach designed to break this deadlock, equipped with a novel Top-Down Compression paradigm that strategically compresses visual tokens without compromising core information. Specifically, we construct a trainable Flash Global Fusion module based on efficient selective state space operators, which aligns the feature space while enabling each token to perceive holistic visual context and instruction preference at low cost. Furthermore, a local-to-single scanning manner is employed to effectively capture local dependencies, thereby enhancing the model's capability in vision modeling. To alleviate computational overhead, we explore a Visual-Native Selection mechanism that independently assesses token significance by both the visual and native experts, followed by aggregation to retain the most critical subset. Extensive experiments show that our approach reduces visual tokens by 75--95% while achieving comparable or superior performance across 12 benchmarks, significantly improving efficiency.

[43] Advanced Integration of Discrete Line Segments in Digitized P&ID for Continuous Instrument Connectivity

Soumya Swarup Prusty,Astha Agarwal,Srinivasan Iyenger

Main category: cs.CV

TL;DR: 论文提出了一种自动化方法，通过计算机视觉模型检测和合并P&ID中的线段，以数字化P&ID信息，减少人工错误和时间消耗。

Details

Motivation: 手动从P&ID图纸中提取信息耗时且易错，依赖专家经验，亟需自动化解决方案。 Method: 使用计算机视觉模型检测线段并合并，建立设备与线段间的连接，生成数字化P&ID。 Result: 实现了P&ID的数字化，信息可存储于知识图谱，支持优化路径、检测系统循环等任务。 Conclusion: 该方法显著提升了P&ID信息提取的效率和准确性，为后续高级分析奠定了基础。 Abstract: Piping and Instrumentation Diagrams (P&IDs) constitute the foundational blueprint of a plant, depicting the interconnections among process equipment, instrumentation for process control, and the flow of fluids and control signals. In their existing setup, the manual mapping of information from P&ID sheets holds a significant challenge. This is a time-consuming process, taking around 3-6 months, and is susceptible to errors. It also depends on the expertise of the domain experts and often requires multiple rounds of review. The digitization of P&IDs entails merging detected line segments, which is essential for linking various detected instruments, thereby creating a comprehensive digitized P&ID. This paper focuses on explaining how line segments which are detected using a computer vision model are merged and eventually building the connection between equipment and merged lines. Hence presenting a digitized form of information stating the interconnection between process equipment, instrumentation, flow of fluids and control signals. Eventually, which can be stored in a knowledge graph and that information along with the help of advanced algorithms can be leveraged for tasks like finding optimal routes, detecting system cycles, computing transitive closures, and more.

[44] AoP-SAM: Automation of Prompts for Efficient Segmentation

Yi Chen,Mu-Young Son,Chuanbo Hua,Joo-Young Kim

Main category: cs.CV

TL;DR: AoP-SAM是一种自动生成提示的方法，提升了SAM模型的效率和实用性，无需手动输入，适用于实际任务。

Details

Motivation: 手动提示不适用于实际应用，尤其是在需要快速提示和资源效率的场景中。 Method: AoP-SAM使用轻量级Prompt Predictor模型自动检测关键实体并生成提示，结合自适应采样和过滤机制优化提示和掩模生成。 Result: 在三个数据集上的评估显示，AoP-SAM显著提高了提示生成效率和掩模生成准确性。 Conclusion: AoP-SAM使SAM更适合自动化分割任务，提升了其在实际应用中的效果。 Abstract: The Segment Anything Model (SAM) is a powerful foundation model for image segmentation, showing robust zero-shot generalization through prompt engineering. However, relying on manual prompts is impractical for real-world applications, particularly in scenarios where rapid prompt provision and resource efficiency are crucial. In this paper, we propose the Automation of Prompts for SAM (AoP-SAM), a novel approach that learns to generate essential prompts in optimal locations automatically. AoP-SAM enhances SAM's efficiency and usability by eliminating manual input, making it better suited for real-world tasks. Our approach employs a lightweight yet efficient Prompt Predictor model that detects key entities across images and identifies the optimal regions for placing prompt candidates. This method leverages SAM's image embeddings, preserving its zero-shot generalization capabilities without requiring fine-tuning. Additionally, we introduce a test-time instance-level Adaptive Sampling and Filtering mechanism that generates prompts in a coarse-to-fine manner. This notably enhances both prompt and mask generation efficiency by reducing computational overhead and minimizing redundant mask refinements. Evaluations of three datasets demonstrate that AoP-SAM substantially improves both prompt generation efficiency and mask generation accuracy, making SAM more effective for automated segmentation tasks.

[45] Online Iterative Self-Alignment for Radiology Report Generation

Ting Xiao,Lei Shi,Yang Zhang,HaoFeng Yang,Zhe Wang,Chenjia Bai

Main category: cs.CV

TL;DR: 论文提出了一种名为OISA的新方法，通过自生成数据、自评估、自对齐和自迭代四阶段，提升放射学报告生成模型的性能。

Details

Motivation: 现有放射学报告生成模型依赖有限的高质量标注数据，容易过拟合且泛化能力不足。 Method: 采用在线迭代自对齐（OISA）方法，包括数据自生成、多目标偏好自评估、多目标优化自对齐和进一步自迭代。 Result: 实验表明，该方法在多项评估指标上优于现有方法，达到最优性能。 Conclusion: OISA方法通过迭代多目标优化，显著提升了数据质量和模型性能。 Abstract: Radiology Report Generation (RRG) is an important research topic for relieving radiologist' heavy workload. Existing RRG models mainly rely on supervised fine-tuning (SFT) based on different model architectures using data pairs of radiological images and corresponding radiologist-annotated reports. Recent research has shifted focus to post-training improvements, aligning RRG model outputs with human preferences using reinforcement learning (RL). However, the limited data coverage of high-quality annotated data poses risks of overfitting and generalization. This paper proposes a novel Online Iterative Self-Alignment (OISA) method for RRG that consists of four stages: self-generation of diverse data, self-evaluation for multi-objective preference data,self-alignment for multi-objective optimization and self-iteration for further improvement. Our approach allows for generating varied reports tailored to specific clinical objectives, enhancing the overall performance of the RRG model iteratively. Unlike existing methods, our frame-work significantly increases data quality and optimizes performance through iterative multi-objective optimization. Experimental results demonstrate that our method surpasses previous approaches, achieving state-of-the-art performance across multiple evaluation metrics.

[46] SpatialCrafter: Unleashing the Imagination of Video Diffusion Models for Scene Reconstruction from Limited Observations

Songchun Zhang,Huiyao Xu,Sitong Guo,Zhongwei Xie,Pengwei Liu,Hujun Bao,Weiwei Xu,Changqing Zou

Main category: cs.CV

TL;DR: 提出SpatialCrafter框架，利用视频扩散模型生成多视角观测，解决稀疏或单视图输入的3D场景重建问题。

Details

Motivation: 现有方法依赖密集多视角观测，限制了应用范围，需解决稀疏或单视图输入的挑战。 Method: 结合可训练相机编码器和极线注意力机制，引入统一尺度估计策略，整合单目深度先验与语义特征，直接回归3D高斯基元。 Result: 实验表明方法提升了稀疏视图重建质量，恢复了3D场景的真实外观。 Conclusion: SpatialCrafter通过几何约束和高效特征处理，实现了稀疏输入的逼真3D重建。 Abstract: Novel view synthesis (NVS) boosts immersive experiences in computer vision and graphics. Existing techniques, though progressed, rely on dense multi-view observations, restricting their application. This work takes on the challenge of reconstructing photorealistic 3D scenes from sparse or single-view inputs. We introduce SpatialCrafter, a framework that leverages the rich knowledge in video diffusion models to generate plausible additional observations, thereby alleviating reconstruction ambiguity. Through a trainable camera encoder and an epipolar attention mechanism for explicit geometric constraints, we achieve precise camera control and 3D consistency, further reinforced by a unified scale estimation strategy to handle scale discrepancies across datasets. Furthermore, by integrating monocular depth priors with semantic features in the video latent space, our framework directly regresses 3D Gaussian primitives and efficiently processes long-sequence features using a hybrid network structure. Extensive experiments show our method enhances sparse view reconstruction and restores the realistic appearance of 3D scenes.

[47] Multimodal Cancer Survival Analysis via Hypergraph Learning with Cross-Modality Rebalance

Mingcheng Qu,Guang Yang,Donglin,Tonghua Su,Yue Gao,Yang Song,Lei Fan

Main category: cs.CV

TL;DR: 提出了一种结合超图学习和模态再平衡机制的多模态生存预测框架，显著提升了病理-基因组数据的整合效果。

Details

Motivation: 现有研究多采用多实例学习整合病理图像特征，但忽略了上下文和层次细节的信息丢失，且病理与基因组数据在粒度和维度上的差异导致模态不平衡。 Method: 提出超图学习捕获病理图像的上下文和层次细节，并采用模态再平衡机制和交互对齐融合策略动态调整两模态的贡献。 Result: 在五个TCGA数据集上的实验表明，模型在C-Index性能上优于先进方法3.4%以上。 Conclusion: 该框架有效解决了病理-基因组模态不平衡问题，提升了生存预测性能。 Abstract: Multimodal pathology-genomic analysis has become increasingly prominent in cancer survival prediction. However, existing studies mainly utilize multi-instance learning to aggregate patch-level features, neglecting the information loss of contextual and hierarchical details within pathology images. Furthermore, the disparity in data granularity and dimensionality between pathology and genomics leads to a significant modality imbalance. The high spatial resolution inherent in pathology data renders it a dominant role while overshadowing genomics in multimodal integration. In this paper, we propose a multimodal survival prediction framework that incorporates hypergraph learning to effectively capture both contextual and hierarchical details from pathology images. Moreover, it employs a modality rebalance mechanism and an interactive alignment fusion strategy to dynamically reweight the contributions of the two modalities, thereby mitigating the pathology-genomics imbalance. Quantitative and qualitative experiments are conducted on five TCGA datasets, demonstrating that our model outperforms advanced methods by over 3.4\% in C-Index performance.

[48] IQBench: How "Smart'' Are Vision-Language Models? A Study with Human IQ Tests

Tan-Hanh Pham,Phu-Vinh Nguyen,Dang The Hung,Bui Trong Duong,Vu Nguyen Thanh,Chris Ngo,Tri Quang Truong,Truong-Son Hy

Main category: cs.CV

TL;DR: 论文提出了IQBench，一个评估视觉语言模型（VLMs）在标准化视觉智商测试中推理能力的新基准，强调推理过程而非最终答案的准确性。

Details

Motivation: 探索VLMs在人类智商测试中的真实推理能力，填补现有研究的空白。 Method: 通过手动收集和标注500个视觉IQ问题，构建IQBench基准，评估模型的解释、解决模式和最终预测准确性。 Result: 模型在3D空间和字谜推理任务中表现不佳，推理过程与最终答案存在不一致性。 Conclusion: 强调评估推理准确性的重要性，揭示了当前VLMs在通用推理能力上的局限性。 Abstract: Although large Vision-Language Models (VLMs) have demonstrated remarkable performance in a wide range of multimodal tasks, their true reasoning capabilities on human IQ tests remain underexplored. To advance research on the fluid intelligence of VLMs, we introduce **IQBench**, a new benchmark designed to evaluate VLMs on standardized visual IQ tests. We focus on evaluating the reasoning capabilities of VLMs, which we argue are more important than the accuracy of the final prediction. **Our benchmark is visually centric, minimizing the dependence on unnecessary textual content**, thus encouraging models to derive answers primarily from image-based information rather than learned textual knowledge. To this end, we manually collected and annotated 500 visual IQ questions to **prevent unintentional data leakage during training**. Unlike prior work that focuses primarily on the accuracy of the final answer, we evaluate the reasoning ability of the models by assessing their explanations and the patterns used to solve each problem, along with the accuracy of the final prediction and human evaluation. Our experiments show that there are substantial performance disparities between tasks, with models such as `o4-mini`, `gemini-2.5-flash`, and `claude-3.7-sonnet` achieving the highest average accuracies of 0.615, 0.578, and 0.548, respectively. However, all models struggle with 3D spatial and anagram reasoning tasks, highlighting significant limitations in current VLMs' general reasoning abilities. In terms of reasoning scores, `o4-mini`, `gemini-2.5-flash`, and `claude-3.7-sonnet` achieved top averages of 0.696, 0.586, and 0.516, respectively. These results highlight inconsistencies between the reasoning processes of the models and their final answers, emphasizing the importance of evaluating the accuracy of the reasoning in addition to the final predictions.

[49] CHRIS: Clothed Human Reconstruction with Side View Consistency

Dong Liu,Yifan Yang,Zixiong Huang,Yuxin Gao,Mingkui Tan

Main category: cs.CV

TL;DR: CHRIS方法通过侧视图一致性提升单视角RGB图像重建的服装人体模型质量，解决了全局拓扑和局部表面不一致问题。

Details

Motivation: 单视角图像仅包含前视图信息，导致侧视图的全局拓扑和局部表面不一致，限制了重建效果。 Method: 提出CHRIS方法，包括侧视图法线判别器和多对一梯度计算（M2O），分别增强全局合理性和局部一致性。 Result: 实验表明CHRIS在公开基准上达到最优性能，优于现有方法。 Conclusion: CHRIS通过侧视图一致性和局部平滑操作，显著提升了服装人体重建的真实性。 Abstract: Creating a realistic clothed human from a single-view RGB image is crucial for applications like mixed reality and filmmaking. Despite some progress in recent years, mainstream methods often fail to fully utilize side-view information, as the input single-view image contains front-view information only. This leads to globally unrealistic topology and local surface inconsistency in side views. To address these, we introduce Clothed Human Reconstruction with Side View Consistency, namely CHRIS, which consists of 1) A Side-View Normal Discriminator that enhances global visual reasonability by distinguishing the generated side-view normals from the ground truth ones; 2) A Multi-to-One Gradient Computation (M2O) that ensures local surface consistency. M2O calculates the gradient of a sampling point by integrating the gradients of the nearby points, effectively acting as a smooth operation. Experimental results demonstrate that CHRIS achieves state-of-the-art performance on public benchmarks and outperforms the prior work.

Runduo Han,Xiuping Liu,Shangxuan Yi,Yi Zhang,Hongchen Tan

Main category: cs.CV

TL;DR: 提出了一种多模态协同优化与扩展网络（MCO-E Net），用于解决单眼表情识别任务中的低光、高曝光和高动态范围等挑战。

Details

Motivation: 单眼表情识别在低光、高曝光和高动态范围条件下表现不佳，需要多模态协同优化以提高性能。 Method: MCO-E Net包含两个创新设计：MCO-Mamba（基于Mamba的双模态协同优化）和HCE-MoE（异构协同与扩展的专家混合模型），分别用于模态语义融合和互补语义学习。 Result: 实验表明，该网络在单眼表情识别任务中表现优异，尤其在恶劣光照条件下。 Conclusion: MCO-E Net通过多模态协同和异构专家设计，有效提升了单眼表情识别的鲁棒性和性能。 Abstract: In this paper, we proposed a Multi-modal Collaborative Optimization and Expansion Network (MCO-E Net), to use event modalities to resist challenges such as low light, high exposure, and high dynamic range in single-eye expression recognition tasks. The MCO-E Net introduces two innovative designs: Multi-modal Collaborative Optimization Mamba (MCO-Mamba) and Heterogeneous Collaborative and Expansion Mixture-of-Experts (HCE-MoE). MCO-Mamba, building upon Mamba, leverages dual-modal information to jointly optimize the model, facilitating collaborative interaction and fusion of modal semantics. This approach encourages the model to balance the learning of both modalities and harness their respective strengths. HCE-MoE, on the other hand, employs a dynamic routing mechanism to distribute structurally varied experts (deep, attention, and focal), fostering collaborative learning of complementary semantics. This heterogeneous architecture systematically integrates diverse feature extraction paradigms to comprehensively capture expression semantics. Extensive experiments demonstrate that our proposed network achieves competitive performance in the task of single-eye expression recognition, especially under poor lighting conditions.

[51] Black-box Adversaries from Latent Space: Unnoticeable Attacks on Human Pose and Shape Estimation

Zhiying Li,Guanggang Geng,Yeying Jin,Zhizhi Guo,Bruce Gu,Jidong Huo,Zhaoxin Fan,Wenjun Wu

Main category: cs.CV

TL;DR: 论文提出了一种针对EHPS模型的新型不可察觉黑盒攻击（UBA），通过潜在空间表示生成对抗噪声，无需模型内部信息即可显著增加估计误差。

Details

Motivation: 现有EHPS模型多关注估计精度，忽视安全漏洞；现有攻击方法需白盒访问或生成明显扰动，实用性不足。 Method: 利用自然图像的潜在空间表示生成对抗噪声，通过迭代优化在数字空间提升攻击效果，仅依赖模型输出查询。 Result: UBA平均将EHPS模型的姿态估计误差提高17.27%-58.21%，揭示了严重安全漏洞。 Conclusion: 研究强调了数字人生成系统安全风险的紧迫性，需进一步缓解相关威胁。 Abstract: Expressive human pose and shape (EHPS) estimation is vital for digital human generation, particularly in live-streaming applications. However, most existing EHPS models focus primarily on minimizing estimation errors, with limited attention on potential security vulnerabilities. Current adversarial attacks on EHPS models often require white-box access (e.g., model details or gradients) or generate visually conspicuous perturbations, limiting their practicality and ability to expose real-world security threats. To address these limitations, we propose a novel Unnoticeable Black-Box Attack (UBA) against EHPS models. UBA leverages the latent-space representations of natural images to generate an optimal adversarial noise pattern and iteratively refine its attack potency along an optimized direction in digital space. Crucially, this process relies solely on querying the model's output, requiring no internal knowledge of the EHPS architecture, while guiding the noise optimization toward greater stealth and effectiveness. Extensive experiments and visual analyses demonstrate the superiority of UBA. Notably, UBA increases the pose estimation errors of EHPS models by 17.27%-58.21% on average, revealing critical vulnerabilities. These findings underscore the urgent need to address and mitigate security risks associated with digital human generation systems.

[52] Cross-Model Transfer of Task Vectors via Few-Shot Orthogonal Alignment

Kazuhiko Kawamoto,Atsuhiro Endo,Hiroshi Kera

Main category: cs.CV

TL;DR: 论文提出了一种基于少量样本正交对齐的方法，用于解决任务算术在跨模型迁移中的局限性，提升任务向量的可迁移性。

Details

Motivation: 任务算术通常假设源模型和目标模型从相同的预训练参数初始化，这限制了其在跨模型迁移中的适用性。 Method: 通过少量样本正交对齐，将任务向量对齐到不同预训练目标模型的参数空间，同时保留任务向量的关键属性。 Result: 实验表明，该方法在八个分类数据集上提升了迁移准确性，性能接近少量样本微调，同时保持了任务向量的模块化和可重用性。 Conclusion: 该方法有效解决了跨模型迁移中的任务向量对齐问题，为任务算术的广泛应用提供了新思路。 Abstract: Task arithmetic enables efficient model editing by representing task-specific changes as vectors in parameter space. Task arithmetic typically assumes that the source and target models are initialized from the same pre-trained parameters. This assumption limits its applicability in cross-model transfer settings, where models are independently pre-trained on different datasets. To address this challenge, we propose a method based on few-shot orthogonal alignment, which aligns task vectors to the parameter space of a differently pre-trained target model. These transformations preserve key properties of task vectors, such as norm and rank, and are learned using only a small number of labeled examples. We evaluate the method using two Vision Transformers pre-trained on YFCC100M and LAION400M, and test on eight classification datasets. Experimental results show that our method improves transfer accuracy over direct task vector application and achieves performance comparable to few-shot fine-tuning, while maintaining the modularity and reusability of task vectors. Our code is available at https://github.com/kawakera-lab/CrossModelTransfer.

[53] FIGhost: Fluorescent Ink-based Stealthy and Flexible Backdoor Attacks on Physical Traffic Sign Recognition

Shuai Yuan,Guowen Xu,Hongwei Li,Rui Zhang,Xinyuan Qian,Wenbo Jiang,Hangcheng Cao,Qingchuan Zhao

Main category: cs.CV

TL;DR: FIGhost是一种利用荧光墨水作为触发器的物理世界后门攻击方法，具有隐蔽性、灵活性和不可追踪性，能够有效对抗先进检测器和视觉-大语言模型。

Details

Motivation: 现有物理后门攻击在隐蔽性、攻击控制灵活性或对新兴视觉-大语言模型的适应性上存在不足。 Method: 通过荧光墨水触发器（在紫外光下激活）和基于插值的荧光模拟算法增强鲁棒性，并开发自动化后门样本生成方法支持三种攻击目标。 Result: 物理世界评估显示FIGhost对先进检测器和视觉-大语言模型有效，且能抵抗环境变化和现有防御。 Conclusion: FIGhost为物理世界后门攻击提供了隐蔽、灵活且鲁棒的解决方案。 Abstract: Traffic sign recognition (TSR) systems are crucial for autonomous driving but are vulnerable to backdoor attacks. Existing physical backdoor attacks either lack stealth, provide inflexible attack control, or ignore emerging Vision-Large-Language-Models (VLMs). In this paper, we introduce FIGhost, the first physical-world backdoor attack leveraging fluorescent ink as triggers. Fluorescent triggers are invisible under normal conditions and activated stealthily by ultraviolet light, providing superior stealthiness, flexibility, and untraceability. Inspired by real-world graffiti, we derive realistic trigger shapes and enhance their robustness via an interpolation-based fluorescence simulation algorithm. Furthermore, we develop an automated backdoor sample generation method to support three attack objectives. Extensive evaluations in the physical world demonstrate FIGhost's effectiveness against state-of-the-art detectors and VLMs, maintaining robustness under environmental variations and effectively evading existing defenses.

[54] Accelerating Diffusion-based Super-Resolution with Dynamic Time-Spatial Sampling

Rui Qin,Qijie Wang,Ming Sun,Haowei Zhu,Chao Zhou,Bin Wang

Main category: cs.CV

TL;DR: 提出了一种基于时间和空间感知的采样策略（TSS），用于加速扩散超分辨率（SR）模型，无需额外训练成本，显著减少迭代步骤并提升性能。

Details

Motivation: 现有扩散SR方法计算成本高，且通用加速技术未充分利用低层任务特性。 Method: 分析扩散SR的频率和空间特性，提出TSS策略，结合时间动态采样（TDS）和空间动态采样（SDS）。 Result: TSS在多个基准测试中表现优异，仅用一半迭代步骤即超越现有加速方法，MUSIQ分数提升0.2-3.0。 Conclusion: TSS通过时间和空间动态采样有效加速扩散SR，显著提升效率与性能。 Abstract: Diffusion models have gained attention for their success in modeling complex distributions, achieving impressive perceptual quality in SR tasks. However, existing diffusion-based SR methods often suffer from high computational costs, requiring numerous iterative steps for training and inference. Existing acceleration techniques, such as distillation and solver optimization, are generally task-agnostic and do not fully leverage the specific characteristics of low-level tasks like super-resolution (SR). In this study, we analyze the frequency- and spatial-domain properties of diffusion-based SR methods, revealing key insights into the temporal and spatial dependencies of high-frequency signal recovery. Specifically, high-frequency details benefit from concentrated optimization during early and late diffusion iterations, while spatially textured regions demand adaptive denoising strategies. Building on these observations, we propose the Time-Spatial-aware Sampling strategy (TSS) for the acceleration of Diffusion SR without any extra training cost. TSS combines Time Dynamic Sampling (TDS), which allocates more iterations to refining textures, and Spatial Dynamic Sampling (SDS), which dynamically adjusts strategies based on image content. Extensive evaluations across multiple benchmarks demonstrate that TSS achieves state-of-the-art (SOTA) performance with significantly fewer iterations, improving MUSIQ scores by 0.2 - 3.0 and outperforming the current acceleration methods with only half the number of steps.

[55] VFRTok: Variable Frame Rates Video Tokenizer with Duration-Proportional Information Assumption

Tianxiong Zhong,Xingye Tian,Boyuan Jiang,Xuebo Wang,Xin Tao,Pengfei Wan,Zhiwei Zhang

Main category: cs.CV

TL;DR: 论文提出了一种基于Transformer的视频分词器VFRTok，通过可变帧率编码和解码解决了现有视频生成框架中因固定时间压缩率导致的计算成本问题。

Details

Motivation: 现有视频生成框架因Frame-Proportional Information Assumption导致计算成本随帧率线性增长，效率低下。 Method: 提出Duration-Proportional Information Assumption，并设计VFRTok实现可变帧率编码和解码；引入Partial Rotary Position Embeddings（RoPE）解耦位置和内容建模。 Result: VFRTok在仅使用1/8令牌的情况下，实现了竞争性的重建质量和最先进的生成保真度。 Conclusion: VFRTok通过紧凑和连续的时空表示，显著提升了视频生成效率和质量。 Abstract: Modern video generation frameworks based on Latent Diffusion Models suffer from inefficiencies in tokenization due to the Frame-Proportional Information Assumption. Existing tokenizers provide fixed temporal compression rates, causing the computational cost of the diffusion model to scale linearly with the frame rate. The paper proposes the Duration-Proportional Information Assumption: the upper bound on the information capacity of a video is proportional to the duration rather than the number of frames. Based on this insight, the paper introduces VFRTok, a Transformer-based video tokenizer, that enables variable frame rate encoding and decoding through asymmetric frame rate training between the encoder and decoder. Furthermore, the paper proposes Partial Rotary Position Embeddings (RoPE) to decouple position and content modeling, which groups correlated patches into unified tokens. The Partial RoPE effectively improves content-awareness, enhancing the video generation capability. Benefiting from the compact and continuous spatio-temporal representation, VFRTok achieves competitive reconstruction quality and state-of-the-art generation fidelity while using only 1/8 tokens compared to existing tokenizers.

[56] Beluga Whale Detection from Satellite Imagery with Point Labels

Yijie Zheng,Jinxuan Yang,Yu Chen,Yaxuan Wang,Yihang Lu,Guoqing Li

Main category: cs.CV

TL;DR: 该论文提出了一种自动化流程，利用点标注和Segment Anything Model（SAM）生成精确的边界框标注，用于训练YOLOv8进行多类别检测（确定鲸鱼、不确定鲸鱼和竖琴海豹），显著提升了检测性能并减少了人工标注需求。

Details

Motivation: 现有基于深度学习的鲸鱼检测方法依赖人工标注的高质量边界框，耗时且常忽略不确定鲸鱼，限制了模型的实际应用。 Method: 结合点标注和SAM生成精确边界框，训练YOLOv8进行多类别检测。 Result: SAM生成的标注显著提升了检测性能，YOLOv8在鲸鱼和海豹检测中分别达到72.2%和70.3%的F1分数。 Conclusion: 该方法减少了人工标注负担，提升了不确定鲸鱼的检测能力，具有扩展到其他物种和平台的潜力，推动了生态监测与保护。 Abstract: Very high-resolution (VHR) satellite imagery has emerged as a powerful tool for monitoring marine animals on a large scale. However, existing deep learning-based whale detection methods usually require manually created, high-quality bounding box annotations, which are labor-intensive to produce. Moreover, existing studies often exclude ``uncertain whales'', individuals that have ambiguous appearances in satellite imagery, limiting the applicability of these models in real-world scenarios. To address these limitations, this study introduces an automated pipeline for detecting beluga whales and harp seals in VHR satellite imagery. The pipeline leverages point annotations and the Segment Anything Model (SAM) to generate precise bounding box annotations, which are used to train YOLOv8 for multiclass detection of certain whales, uncertain whales, and harp seals. Experimental results demonstrated that SAM-generated annotations significantly improved detection performance, achieving higher $\text{F}_\text{1}$-scores compared to traditional buffer-based annotations. YOLOv8 trained on SAM-labeled boxes achieved an overall $\text{F}_\text{1}$-score of 72.2% for whales overall and 70.3% for harp seals, with superior performance in dense scenes. The proposed approach not only reduces the manual effort required for annotation but also enhances the detection of uncertain whales, offering a more comprehensive solution for marine animal monitoring. This method holds great potential for extending to other species, habitats, and remote sensing platforms, as well as for estimating whale biometrics, thereby advancing ecological monitoring and conservation efforts. The codes for our label and detection pipeline are publicly available at http://github.com/voyagerxvoyagerx/beluga-seeker .

[57] MT-CYP-Net: Multi-Task Network for Pixel-Level Crop Yield Prediction Under Very Few Samples

Shenzhou Liu,Di Wang,Haonan Guo,Chengxi Han,Wenzhi Zeng

Main category: cs.CV

TL;DR: 提出了一种多任务作物产量预测网络（MT-CYP-Net），通过共享特征和融合信息，解决了卫星遥感数据中地面真实数据稀缺的问题，实现了精确的像素级产量预测。

Details

Motivation: 解决基于卫星遥感数据的像素级产量预测中地面真实数据稀缺的挑战。 Method: 采用多任务特征共享策略，共享主干网络提取的特征，同时用于产量预测和作物分类解码器，并融合信息。 Result: 在黑龙江八家农场的数据集上，MT-CYP-Net优于传统方法，展示了在有限标签数据下精确预测的潜力。 Conclusion: MT-CYP-Net在多作物类型上表现优越，为像素级产量预测提供了新思路。 Abstract: Accurate and fine-grained crop yield prediction plays a crucial role in advancing global agriculture. However, the accuracy of pixel-level yield estimation based on satellite remote sensing data has been constrained by the scarcity of ground truth data. To address this challenge, we propose a novel approach called the Multi-Task Crop Yield Prediction Network (MT-CYP-Net). This framework introduces an effective multi-task feature-sharing strategy, where features extracted from a shared backbone network are simultaneously utilized by both crop yield prediction decoders and crop classification decoders with the ability to fuse information between them. This design allows MT-CYP-Net to be trained with extremely sparse crop yield point labels and crop type labels, while still generating detailed pixel-level crop yield maps. Concretely, we collected 1,859 yield point labels along with corresponding crop type labels and satellite images from eight farms in Heilongjiang Province, China, in 2023, covering soybean, maize, and rice crops, and constructed a sparse crop yield label dataset. MT-CYP-Net is compared with three classical machine learning and deep learning benchmark methods in this dataset. Experimental results not only indicate the superiority of MT-CYP-Net compared to previous methods on multiple types of crops but also demonstrate the potential of deep networks on precise pixel-level crop yield prediction, especially with limited data labels.

[58] Denoising Mutual Knowledge Distillation in Bi-Directional Multiple Instance Learning

Chen Shu,Boyu Fu,Yiman Li,Ting Yin,Wenchuan Zhang,Jie Chen,Yuhao Yi,Hong Bu

Main category: cs.CV

TL;DR: 该论文提出了一种结合伪标签校正的方法，以提升多实例学习（MIL）在数字病理学中的性能，弥补了MIL与全监督学习之间的差距。

Details

Motivation: 尽管MIL避免了细粒度标注的需求，但其在袋级和实例级分类的准确性仍存疑。现有方法可能引入噪声标签，因此需要一种更稳健的方法。 Method: 通过结合伪标签校正技术，增强袋级和实例级学习过程，从弱到强泛化。 Result: 实验表明，该方法在公共病理数据集上提升了袋级和实例级预测的性能。 Conclusion: 提出的算法有效弥补了MIL与全监督学习之间的差距，提升了分类性能。 Abstract: Multiple Instance Learning is the predominant method for Whole Slide Image classification in digital pathology, enabling the use of slide-level labels to supervise model training. Although MIL eliminates the tedious fine-grained annotation process for supervised learning, whether it can learn accurate bag- and instance-level classifiers remains a question. To address the issue, instance-level classifiers and instance masks were incorporated to ground the prediction on supporting patches. These methods, while practically improving the performance of MIL methods, may potentially introduce noisy labels. We propose to bridge the gap between commonly used MIL and fully supervised learning by augmenting both the bag- and instance-level learning processes with pseudo-label correction capabilities elicited from weak to strong generalization techniques. The proposed algorithm improves the performance of dual-level MIL algorithms on both bag- and instance-level predictions. Experiments on public pathology datasets showcase the advantage of the proposed methods.

[59] VisionReasoner: Unified Visual Perception and Reasoning via Reinforcement Learning

Yuqi Liu,Tianyuan Qu,Zhisheng Zhong,Bohao Peng,Shu Liu,Bei Yu,Jiaya Jia

Main category: cs.CV

TL;DR: VisionReasoner是一个统一的视觉感知框架，通过多目标认知学习策略和任务重构，能够解决多种视觉任务，并在检测、分割和计数任务上表现优异。

Details

Motivation: 大型视觉语言模型在处理多样化视觉感知任务时具有潜力，但缺乏统一的框架来整合这些能力。 Method: 设计了多目标认知学习策略和系统性任务重构，生成结构化推理过程以解决多种任务。 Result: 在COCO（检测）、ReasonSeg（分割）和CountBench（计数）任务上，分别以29.1%、22.1%和15.3%的相对优势超越Qwen2.5VL。 Conclusion: VisionReasoner展示了作为统一模型在多样化视觉任务中的卓越性能，为视觉感知任务提供了高效解决方案。 Abstract: Large vision-language models exhibit inherent capabilities to handle diverse visual perception tasks. In this paper, we introduce VisionReasoner, a unified framework capable of reasoning and solving multiple visual perception tasks within a shared model. Specifically, by designing novel multi-object cognitive learning strategies and systematic task reformulation, VisionReasoner enhances its reasoning capabilities to analyze visual inputs, and addresses diverse perception tasks in a unified framework. The model generates a structured reasoning process before delivering the desired outputs responding to user queries. To rigorously assess unified visual perception capabilities, we evaluate VisionReasoner on ten diverse tasks spanning three critical domains: detection, segmentation, and counting. Experimental results show that VisionReasoner achieves superior performance as a unified model, outperforming Qwen2.5VL by relative margins of 29.1% on COCO (detection), 22.1% on ReasonSeg (segmentation), and 15.3% on CountBench (counting).

[60] LOVE: Benchmarking and Evaluating Text-to-Video Generation and Video-to-Text Interpretation

Jiarui Wang,Huiyu Duan,Ziheng Jia,Yu Zhao,Woo Yi Yang,Zicheng Zhang,Zijian Chen,Juntong Wang,Yuke Xing,Guangtao Zhai,Xiongkuo Min

Main category: cs.CV

TL;DR: 论文提出了AIGVE-60K数据集和LOVE评估指标，用于评估AI生成视频的质量和文本-视频对齐，并在多个维度上实现了最先进的性能。

Details

Motivation: 当前AI生成视频（AIGV）在感知质量和文本-视频对齐方面存在不足，需要可靠且可扩展的自动评估模型。 Method: 构建了AIGVE-60K数据集，包含大量人类标注（MOS和QA对），并提出了基于LMM的LOVE评估指标。 Result: LOVE在AIGVE-60K数据集上表现优异，并能泛化到其他AIGV评估基准。 Conclusion: AIGVE-60K数据集和LOVE指标为AIGV评估提供了重要工具，推动了该领域的发展。 Abstract: Recent advancements in large multimodal models (LMMs) have driven substantial progress in both text-to-video (T2V) generation and video-to-text (V2T) interpretation tasks. However, current AI-generated videos (AIGVs) still exhibit limitations in terms of perceptual quality and text-video alignment. Therefore, a reliable and scalable automatic model for AIGV evaluation is desirable, which heavily relies on the scale and quality of human annotations. To this end, we present AIGVE-60K, a comprehensive dataset and benchmark for AI-Generated Video Evaluation, which features (i) comprehensive tasks, encompassing 3,050 extensive prompts across 20 fine-grained task dimensions, (ii) the largest human annotations, including 120K mean-opinion scores (MOSs) and 60K question-answering (QA) pairs annotated on 58,500 videos generated from 30 T2V models, and (iii) bidirectional benchmarking and evaluating for both T2V generation and V2T interpretation capabilities. Based on AIGVE-60K, we propose LOVE, a LMM-based metric for AIGV Evaluation from multiple dimensions including perceptual preference, text-video correspondence, and task-specific accuracy in terms of both instance level and model level. Comprehensive experiments demonstrate that LOVE not only achieves state-of-the-art performance on the AIGVE-60K dataset, but also generalizes effectively to a wide range of other AIGV evaluation benchmarks. These findings highlight the significance of the AIGVE-60K dataset. Database and codes are anonymously available at https://github.com/IntMeGroup/LOVE.

[61] TinyRS-R1: Compact Multimodal Language Model for Remote Sensing

Aybora Koksal,A. Aydin Alatan

Main category: cs.CV

TL;DR: TinyRS和TinyRS-R1是专为遥感任务优化的2B参数多模态小语言模型，性能媲美7B参数模型，但内存和延迟更低。

Details

Motivation: 解决边缘硬件无法运行大型多模态语言模型的问题，为遥感任务提供高效解决方案。 Method: 通过四阶段训练流程：预训练、指令调优、CoT微调和GRPO对齐。 Result: TinyRS-R1在分类、VQA等任务中表现优于7B参数模型，内存和延迟减少三分之二。 Conclusion: TinyRS-R1是首个专为遥感任务设计的GRPO对齐CoT推理模型，适用于多种场景。 Abstract: Remote-sensing applications often run on edge hardware that cannot host today's 7B-parameter multimodal language models. This paper introduces TinyRS, the first 2B-parameter multimodal small language model (MSLM) optimized for remote sensing tasks, and TinyRS-R1, its reasoning-augmented variant. Built upon Qwen2-VL-2B, TinyRS is trained through a four-stage pipeline: pre-training on million satellite images, instruction tuning on visual instruction examples, fine-tuning with Chain-of-Thought (CoT) annotations from the proposed reasoning dataset, and alignment via Group Relative Policy Optimization (GRPO). TinyRS-R1 achieves or surpasses the performance of recent 7B-parameter remote sensing models across classification, VQA, visual grounding, and open-ended question answering-while requiring just one-third of the memory and latency. Our analysis shows that CoT reasoning substantially benefits spatial grounding and scene understanding, while the non-reasoning TinyRS excels in concise, latency-sensitive VQA tasks. TinyRS-R1 represents the first domain-specialized MSLM with GRPO-aligned CoT reasoning for general-purpose remote sensing.

[62] EarthSynth: Generating Informative Earth Observation with Diffusion Models

Jiancheng Pan,Shiye Lei,Yuqian Fu,Jiahao Li,Yanxing Liu,Yuze Sun,Xiao He,Long Peng,Xiaomeng Huang,Bo Zhao

Main category: cs.CV

TL;DR: EarthSynth是一种基于扩散的生成基础模型，用于合成多类别、跨卫星标记的地球观测数据，以解决遥感图像解释任务中标记数据稀缺的问题。

Details

Motivation: 遥感图像解释任务因标记数据稀缺而受限，需要一种方法生成多样化的标记数据以提升性能。 Method: EarthSynth采用Counterfactual Composition训练策略和R-Filter规则方法，生成并筛选高质量合成数据。 Result: EarthSynth在场景分类、目标检测和语义分割等任务中表现优异，为遥感图像解释提供了实用解决方案。 Conclusion: EarthSynth是首个探索多任务生成的遥感模型，通过合成数据有效提升了遥感图像解释任务的性能。 Abstract: Remote sensing image (RSI) interpretation typically faces challenges due to the scarcity of labeled data, which limits the performance of RSI interpretation tasks. To tackle this challenge, we propose EarthSynth, a diffusion-based generative foundation model that enables synthesizing multi-category, cross-satellite labeled Earth observation for downstream RSI interpretation tasks. To the best of our knowledge, EarthSynth is the first to explore multi-task generation for remote sensing. EarthSynth, trained on the EarthSynth-180K dataset, employs the Counterfactual Composition training strategy to improve training data diversity and enhance category control. Furthermore, a rule-based method of R-Filter is proposed to filter more informative synthetic data for downstream tasks. We evaluate our EarthSynth on scene classification, object detection, and semantic segmentation in open-world scenarios, offering a practical solution for advancing RSI interpretation.

[63] Keypoints as Dynamic Centroids for Unified Human Pose and Segmentation

Niaz Ahmad,Jawad Khan,Kang G. Shin,Youngmoon Lee,Guanghui Wang

Main category: cs.CV

TL;DR: 论文提出了一种名为KDC的新方法，通过动态质心表示统一解决人体姿态估计和实例级分割问题，特别适用于快速运动或关节重叠的场景。

Details

Motivation: 现有方法在关节重叠或快速运动时表现不佳，需要一种更鲁棒的解决方案。 Method: 采用自下而上的范式生成关键点热图，引入KeyCentroids和MaskCentroids动态质心表示，实现像素快速聚类。 Result: 在CrowdPose、OCHuman和COCO基准测试中表现出色，兼顾准确性和运行时性能。 Conclusion: KDC方法在复杂场景中具有高效性和通用性，为相关任务提供了新思路。 Abstract: The dynamic movement of the human body presents a fundamental challenge for human pose estimation and body segmentation. State-of-the-art approaches primarily rely on combining keypoint heatmaps with segmentation masks but often struggle in scenarios involving overlapping joints or rapidly changing poses during instance-level segmentation. To address these limitations, we propose Keypoints as Dynamic Centroid (KDC), a new centroid-based representation for unified human pose estimation and instance-level segmentation. KDC adopts a bottom-up paradigm to generate keypoint heatmaps for both easily distinguishable and complex keypoints and improves keypoint detection and confidence scores by introducing KeyCentroids using a keypoint disk. It leverages high-confidence keypoints as dynamic centroids in the embedding space to generate MaskCentroids, allowing for swift clustering of pixels to specific human instances during rapid body movements in live environments. Our experimental evaluations on the CrowdPose, OCHuman, and COCO benchmarks demonstrate KDC's effectiveness and generalizability in challenging scenarios in terms of both accuracy and runtime performance. The implementation is available at: https://sites.google.com/view/niazahmad/projects/kdc.

[64] Learning to Highlight Audio by Watching Movies

Chao Huang,Ruohan Gao,J. M. F. Tsang,Jan Kurcius,Cagdas Bilen,Chenliang Xu,Anurag Kumar,Sanjeel Parekh

Main category: cs.CV

TL;DR: 论文提出了一种视觉引导的音频高亮任务，通过多模态框架提升音视频体验，并引入新数据集和伪数据生成方法。

Details

Motivation: 视频内容创作中视觉与音频的协调不足，导致音视频体验不和谐，需填补这一技术空白。 Method: 采用基于Transformer的多模态框架，利用电影数据生成伪数据模拟真实场景。 Result: 方法在定量和主观评估中均优于基线，并研究了不同上下文指导和数据集难度的影响。 Conclusion: 视觉引导的音频高亮任务可行且有效，多模态框架和伪数据生成方法为音视频协调提供了新思路。 Abstract: Recent years have seen a significant increase in video content creation and consumption. Crafting engaging content requires the careful curation of both visual and audio elements. While visual cue curation, through techniques like optimal viewpoint selection or post-editing, has been central to media production, its natural counterpart, audio, has not undergone equivalent advancements. This often results in a disconnect between visual and acoustic saliency. To bridge this gap, we introduce a novel task: visually-guided acoustic highlighting, which aims to transform audio to deliver appropriate highlighting effects guided by the accompanying video, ultimately creating a more harmonious audio-visual experience. We propose a flexible, transformer-based multimodal framework to solve this task. To train our model, we also introduce a new dataset -- the muddy mix dataset, leveraging the meticulous audio and video crafting found in movies, which provides a form of free supervision. We develop a pseudo-data generation process to simulate poorly mixed audio, mimicking real-world scenarios through a three-step process -- separation, adjustment, and remixing. Our approach consistently outperforms several baselines in both quantitative and subjective evaluation. We also systematically study the impact of different types of contextual guidance and difficulty levels of the dataset. Our project page is here: https://wikichao.github.io/VisAH/.

[65] SoftPQ: Robust Instance Segmentation Evaluation via Soft Matching and Tunable Thresholds

Ranit Karmakar,Simon F. Nørrelykke

Main category: cs.CV

TL;DR: SoftPQ是一种灵活的实例分割评估指标，通过引入可调阈值和部分匹配区域，避免了传统二元分类的局限性，提供更平滑的评分和更有意义的反馈。

Details

Motivation: 传统分割评估指标（如IoU、PQ）依赖二元决策逻辑，忽略了错误之间的质性差异，且无法反映模型的渐进改进。 Method: 提出SoftPQ，通过可调上下IoU阈值定义部分匹配区域，并应用次线性惩罚函数处理模糊或碎片化预测。 Result: SoftPQ在实验中表现出更平滑的评分行为，对结构分割错误更具鲁棒性，并能捕捉现有指标忽略的质量差异。 Conclusion: SoftPQ是一种实用且原则性的替代方案，适用于基准测试和模型迭代优化。 Abstract: Segmentation evaluation metrics traditionally rely on binary decision logic: predictions are either correct or incorrect, based on rigid IoU thresholds. Detection--based metrics such as F1 and mAP determine correctness at the object level using fixed overlap cutoffs, while overlap--based metrics like Intersection over Union (IoU) and Dice operate at the pixel level, often overlooking instance--level structure. Panoptic Quality (PQ) attempts to unify detection and segmentation assessment, but it remains dependent on hard-threshold matching--treating predictions below the threshold as entirely incorrect. This binary framing obscures important distinctions between qualitatively different errors and fails to reward gradual model improvements. We propose SoftPQ, a flexible and interpretable instance segmentation metric that redefines evaluation as a graded continuum rather than a binary classification. SoftPQ introduces tunable upper and lower IoU thresholds to define a partial matching region and applies a sublinear penalty function to ambiguous or fragmented predictions. These extensions allow SoftPQ to exhibit smoother score behavior, greater robustness to structural segmentation errors, and more informative feedback for model development and evaluation. Through controlled perturbation experiments, we show that SoftPQ captures meaningful differences in segmentation quality that existing metrics overlook, making it a practical and principled alternative for both benchmarking and iterative model refinement.

[66] Ditch the Denoiser: Emergence of Noise Robustness in Self-Supervised Learning from Data Curriculum

Wenquan Lu,Jiaqi Zhang,Hugues Van Assel,Randall Balestriero

Main category: cs.CV

TL;DR: 提出了一种自监督学习框架，能够在无需去噪器或下游微调的情况下，从噪声数据中学习鲁棒表示。

Details

Motivation: 当前自监督学习研究主要针对高质量数据，而噪声数据的应用（如天体物理学、医学影像等）仍具挑战性。 Method: 通过训练自监督去噪器构建去噪-噪声数据课程，结合教师引导正则化，使模型内化噪声鲁棒性。 Result: 在极端高斯噪声下，方法比DINOv2提高了4.8%的线性探测准确率。 Conclusion: 该方法展示了通过噪声感知预训练可以实现无需去噪器的鲁棒性。 Abstract: Self-Supervised Learning (SSL) has become a powerful solution to extract rich representations from unlabeled data. Yet, SSL research is mostly focused on clean, curated and high-quality datasets. As a result, applying SSL on noisy data remains a challenge, despite being crucial to applications such as astrophysics, medical imaging, geophysics or finance. In this work, we present a fully self-supervised framework that enables noise-robust representation learning without requiring a denoiser at inference or downstream fine-tuning. Our method first trains an SSL denoiser on noisy data, then uses it to construct a denoised-to-noisy data curriculum (i.e., training first on denoised, then noisy samples) for pretraining a SSL backbone (e.g., DINOv2), combined with a teacher-guided regularization that anchors noisy embeddings to their denoised counterparts. This process encourages the model to internalize noise robustness. Notably, the denoiser can be discarded after pretraining, simplifying deployment. On ImageNet-1k with ViT-B under extreme Gaussian noise ($\sigma=255$, SNR = 0.72 dB), our method improves linear probing accuracy by 4.8% over DINOv2, demonstrating that denoiser-free robustness can emerge from noise-aware pretraining. The code is available at https://github.com/wenquanlu/noisy_dinov2.

[67] Always Clear Depth: Robust Monocular Depth Estimation under Adverse Weather

Kui Jiang,Jing Cao,Zhaocheng Yu,Junjun Jiang,Jingchun Zhou

Main category: cs.CV

TL;DR: ACDepth提出了一种鲁棒的单目深度估计方法，通过高质量训练数据生成和域适应技术，提升在恶劣天气条件下的性能。

Details

Motivation: 现有方法在恶劣天气下性能下降，ACDepth旨在解决这一问题。 Method: 结合扩散模型生成模拟恶劣天气的样本，使用LoRA适配器微调，并引入循环一致性损失和对抗训练。此外，采用多粒度知识蒸馏策略（MKD）和有序引导蒸馏机制（OGD）。 Result: 在nuScenes数据集上，ACDepth在夜间和雨天场景的absRel指标上分别优于md4all-DD 2.50%和2.61%。 Conclusion: ACDepth通过数据生成和知识蒸馏策略，显著提升了恶劣天气下的深度估计性能。 Abstract: Monocular depth estimation is critical for applications such as autonomous driving and scene reconstruction. While existing methods perform well under normal scenarios, their performance declines in adverse weather, due to challenging domain shifts and difficulties in extracting scene information. To address this issue, we present a robust monocular depth estimation method called \textbf{ACDepth} from the perspective of high-quality training data generation and domain adaptation. Specifically, we introduce a one-step diffusion model for generating samples that simulate adverse weather conditions, constructing a multi-tuple degradation dataset during training. To ensure the quality of the generated degradation samples, we employ LoRA adapters to fine-tune the generation weights of diffusion model. Additionally, we integrate circular consistency loss and adversarial training to guarantee the fidelity and naturalness of the scene contents. Furthermore, we elaborate on a multi-granularity knowledge distillation strategy (MKD) that encourages the student network to absorb knowledge from both the teacher model and pretrained Depth Anything V2. This strategy guides the student model in learning degradation-agnostic scene information from various degradation inputs. In particular, we introduce an ordinal guidance distillation mechanism (OGD) that encourages the network to focus on uncertain regions through differential ranking, leading to a more precise depth estimation. Experimental results demonstrate that our ACDepth surpasses md4all-DD by 2.50\% for night scene and 2.61\% for rainy scene on the nuScenes dataset in terms of the absRel metric.

[68] CompBench: Benchmarking Complex Instruction-guided Image Editing

Bohan Jia,Wenxuan Huang,Yuntian Tang,Junbo Qiao,Jincheng Liao,Shaosheng Cao,Fei Zhao,Zhaopeng Feng,Zhouhong Gu,Zhenfei Yin,Lei Bai,Wanli Ouyang,Lin Chen,Fei Zhao,Zihan Wang,Yuan Xie,Shaohui Lin

Main category: cs.CV

TL;DR: 提出CompBench，一个针对复杂指令引导图像编辑的大规模基准测试，填补现有基准测试在任务复杂性和细粒度指令上的不足。

Details

Motivation: 现实应用需要复杂场景编辑，但现有基准测试过于简化任务且缺乏细粒度指令。 Method: 采用MLLM-人类协作框架和任务管道，提出指令解耦策略，将编辑意图分为位置、外观、动态和对象四个维度。 Result: CompBench揭示了当前图像编辑模型的基本局限性。 Conclusion: CompBench为下一代指令引导图像编辑系统的开发提供了关键见解。 Abstract: While real-world applications increasingly demand intricate scene manipulation, existing instruction-guided image editing benchmarks often oversimplify task complexity and lack comprehensive, fine-grained instructions. To bridge this gap, we introduce, a large-scale benchmark specifically designed for complex instruction-guided image editing. CompBench features challenging editing scenarios that incorporate fine-grained instruction following, spatial and contextual reasoning, thereby enabling comprehensive evaluation of image editing models' precise manipulation capabilities. To construct CompBench, We propose an MLLM-human collaborative framework with tailored task pipelines. Furthermore, we propose an instruction decoupling strategy that disentangles editing intents into four key dimensions: location, appearance, dynamics, and objects, ensuring closer alignment between instructions and complex editing requirements. Extensive evaluations reveal that CompBench exposes fundamental limitations of current image editing models and provides critical insights for the development of next-generation instruction-guided image editing systems.

[69] Road Segmentation for ADAS/AD Applications

Mathanesh Vellingiri Ramasamy,Dimas Rizky Kurniasalim

Main category: cs.CV

TL;DR: 研究探讨了模型架构和数据集选择对道路分割的影响，通过在不同数据集上训练VGG-16和U-Net，发现VGG-16表现更优。

Details

Motivation: 精确的道路分割对自动驾驶和ADAS至关重要，需研究模型架构和数据集的影响。 Method: 在Comma10k数据集上训练修改的VGG-16，在KITTI Road数据集上训练修改的U-Net，并进行跨数据集测试。 Result: VGG-16表现优于U-Net，尽管U-Net训练周期更长。 Conclusion: 模型架构和数据集选择对道路分割性能有显著影响，VGG-16在跨数据集测试中表现更优。 Abstract: Accurate road segmentation is essential for autonomous driving and ADAS, enabling effective navigation in complex environments. This study examines how model architecture and dataset choice affect segmentation by training a modified VGG-16 on the Comma10k dataset and a modified U-Net on the KITTI Road dataset. Both models achieved high accuracy, with cross-dataset testing showing VGG-16 outperforming U-Net despite U-Net being trained for more epochs. We analyze model performance using metrics such as F1-score, mean intersection over union, and precision, discussing how architecture and dataset impact results.

[70] Can Large Multimodal Models Understand Agricultural Scenes? Benchmarking with AgroMind

Qingmei Li,Yang Zhang,Zurong Mai,Yuhang Chen,Shuohong Lou,Henglian Huang,Jiarui Zhang,Zhiwei Zhang,Yibin Wen,Weijia Li,Haohuan Fu,Jianxi Huang,Juepeng Zheng

Main category: cs.CV

TL;DR: AgroMind是一个全面的农业遥感基准测试，涵盖四个任务维度，评估了18个开源和3个闭源LMM模型，揭示了其在空间推理和细粒度识别上的性能差距。

Details

Motivation: 现有农业遥感基准测试在场景多样性和任务设计上存在不足，需要更全面的评估框架。 Method: 整合多源数据预处理，定义多样化农业相关问题，利用LMM进行推理和评估。 Result: 实验显示LMM在空间推理和细粒度识别上表现不佳，但部分模型优于人类表现。 Conclusion: AgroMind为农业遥感提供了标准化评估框架，揭示了LMM的领域知识局限性，为未来研究指明方向。 Abstract: Large Multimodal Models (LMMs) has demonstrated capabilities across various domains, but comprehensive benchmarks for agricultural remote sensing (RS) remain scarce. Existing benchmarks designed for agricultural RS scenarios exhibit notable limitations, primarily in terms of insufficient scene diversity in the dataset and oversimplified task design. To bridge this gap, we introduce AgroMind, a comprehensive agricultural remote sensing benchmark covering four task dimensions: spatial perception, object understanding, scene understanding, and scene reasoning, with a total of 13 task types, ranging from crop identification and health monitoring to environmental analysis. We curate a high-quality evaluation set by integrating eight public datasets and one private farmland plot dataset, containing 25,026 QA pairs and 15,556 images. The pipeline begins with multi-source data preprocessing, including collection, format standardization, and annotation refinement. We then generate a diverse set of agriculturally relevant questions through the systematic definition of tasks. Finally, we employ LMMs for inference, generating responses, and performing detailed examinations. We evaluated 18 open-source LMMs and 3 closed-source models on AgroMind. Experiments reveal significant performance gaps, particularly in spatial reasoning and fine-grained recognition, it is notable that human performance lags behind several leading LMMs. By establishing a standardized evaluation framework for agricultural RS, AgroMind reveals the limitations of LMMs in domain knowledge and highlights critical challenges for future work. Data and code can be accessed at https://rssysu.github.io/AgroMind/.

[71] Hyperspectral Image Land Cover Captioning Dataset for Vision Language Models

Aryan Das,Tanishq Rachamalla,Pravendra Singh,Koushik Biswas,Vinay Kumar Verma,Swalpa Kumar Roy

Main category: cs.CV

TL;DR: HyperCap是一个大规模高光谱字幕数据集，旨在提升遥感应用中模型的性能和效果。

Details

Motivation: 传统高光谱成像数据集仅关注分类任务，而HyperCap结合光谱数据与像素级文本标注，以实现对高光谱图像的更深层次语义理解。 Method: 数据集基于四个基准数据集构建，采用自动与手动结合的混合标注方法确保准确性。 Result: 使用先进编码器和多种融合技术的实证评估显示分类性能显著提升。 Conclusion: HyperCap展示了视觉-语言学习在高光谱成像中的潜力，并为未来研究提供了基础数据集。 Abstract: We introduce HyperCap, the first large-scale hyperspectral captioning dataset designed to enhance model performance and effectiveness in remote sensing applications. Unlike traditional hyperspectral imaging (HSI) datasets that focus solely on classification tasks, HyperCap integrates spectral data with pixel-wise textual annotations, enabling deeper semantic understanding of hyperspectral imagery. This dataset enhances model performance in tasks like classification and feature extraction, providing a valuable resource for advanced remote sensing applications. HyperCap is constructed from four benchmark datasets and annotated through a hybrid approach combining automated and manual methods to ensure accuracy and consistency. Empirical evaluations using state-of-the-art encoders and diverse fusion techniques demonstrate significant improvements in classification performance. These results underscore the potential of vision-language learning in HSI and position HyperCap as a foundational dataset for future research in the field.

[72] From Low Field to High Value: Robust Cortical Mapping from Low-Field MRI

Karthik Gopinath,Annabel Sorby-Adams,Jonathan W. Ramirez,Dina Zemlyanker,Jennifer Guo,David Hunt,Christine L. Mac Donald,C. Dirk Keene,Timothy Coalson,Matthew F. Glasser,David Van Essen,Matthew S. Rosen,Oula Puonti,W. Taylor Kimberly,Juan Eugenio Iglesias

Main category: cs.CV

TL;DR: 提出了一种基于机器学习的低场MRI（LF-MRI）三维重建方法，适用于便携式设备，无需重新训练即可使用，并在多种对比度和分辨率下验证了其效果。

Details

Motivation: 高场MRI（HF-MRI）虽然标准但普及受限，而低场MRI（LF-MRI）成本低且便携，但现有工具对其低信噪比和分辨率适应性差。 Method: 使用3D U-Net训练合成LF-MRI数据预测皮质表面距离函数，并通过几何处理确保拓扑准确性。 Result: 在3mm各向同性T2加权扫描下，与HF-MRI重建结果高度一致（表面面积r=0.96，皮质分区Dice=0.98，灰质体积r=0.93），但皮质厚度相关性较低（r=0.70）。 Conclusion: 该方法为便携式LF-MRI的皮质表面分析提供了可行方案，代码已开源。 Abstract: Three-dimensional reconstruction of cortical surfaces from MRI for morphometric analysis is fundamental for understanding brain structure. While high-field MRI (HF-MRI) is standard in research and clinical settings, its limited availability hinders widespread use. Low-field MRI (LF-MRI), particularly portable systems, offers a cost-effective and accessible alternative. However, existing cortical surface analysis tools are optimized for high-resolution HF-MRI and struggle with the lower signal-to-noise ratio and resolution of LF-MRI. In this work, we present a machine learning method for 3D reconstruction and analysis of portable LF-MRI across a range of contrasts and resolutions. Our method works "out of the box" without retraining. It uses a 3D U-Net trained on synthetic LF-MRI to predict signed distance functions of cortical surfaces, followed by geometric processing to ensure topological accuracy. We evaluate our method using paired HF/LF-MRI scans of the same subjects, showing that LF-MRI surface reconstruction accuracy depends on acquisition parameters, including contrast type (T1 vs T2), orientation (axial vs isotropic), and resolution. A 3mm isotropic T2-weighted scan acquired in under 4 minutes, yields strong agreement with HF-derived surfaces: surface area correlates at r=0.96, cortical parcellations reach Dice=0.98, and gray matter volume achieves r=0.93. Cortical thickness remains more challenging with correlations up to r=0.70, reflecting the difficulty of sub-mm precision with 3mm voxels. We further validate our method on challenging postmortem LF-MRI, demonstrating its robustness. Our method represents a step toward enabling cortical surface analysis on portable LF-MRI. Code is available at https://surfer.nmr.mgh.harvard.edu/fswiki/ReconAny

[73] NOFT: Test-Time Noise Finetune via Information Bottleneck for Highly Correlated Asset Creation

Jia Li,Nan Gao,Huaibo Huang,Ran He

Main category: cs.CV

TL;DR: 提出NOFT模块，通过优化噪声潜变量实现高相关性和多样性的图像生成，仅需少量参数和训练时间。

Details

Motivation: 现有扩散模型未充分利用噪声潜变量中的信息，NOFT旨在探索其潜在能力以实现高保真和可控的图像生成。 Method: 使用最优传输信息瓶颈（OT-IB）微调种子噪声或逆噪声，仅需14K可训练参数和10分钟训练。 Result: NOFT能高效生成高保真且多样化的图像，支持拓扑和纹理对齐。 Conclusion: NOFT是一种高效的通用方法，可用于微调2D/3D AIGC资产，适用于文本或图像引导生成。 Abstract: The diffusion model has provided a strong tool for implementing text-to-image (T2I) and image-to-image (I2I) generation. Recently, topology and texture control are popular explorations, e.g., ControlNet, IP-Adapter, Ctrl-X, and DSG. These methods explicitly consider high-fidelity controllable editing based on external signals or diffusion feature manipulations. As for diversity, they directly choose different noise latents. However, the diffused noise is capable of implicitly representing the topological and textural manifold of the corresponding image. Moreover, it's an effective workbench to conduct the trade-off between content preservation and controllable variations. Previous T2I and I2I diffusion works do not explore the information within the compressed contextual latent. In this paper, we first propose a plug-and-play noise finetune NOFT module employed by Stable Diffusion to generate highly correlated and diverse images. We fine-tune seed noise or inverse noise through an optimal-transported (OT) information bottleneck (IB) with around only 14K trainable parameters and 10 minutes of training. Our test-time NOFT is good at producing high-fidelity image variations considering topology and texture alignments. Comprehensive experiments demonstrate that NOFT is a powerful general reimagine approach to efficiently fine-tune the 2D/3D AIGC assets with text or image guidance.

[74] From Shots to Stories: LLM-Assisted Video Editing with Unified Language Representations

Yuzhi Li,Haojun Xu,Feng Tian

Main category: cs.CV

TL;DR: 该论文首次系统研究了LLMs在视频编辑中的应用，提出了L-Storyboard和StoryFlow方法，显著提升了任务准确性和逻辑一致性。

Details

Motivation: 尽管LLMs和VLMs在视频理解中表现出色，但其在视频编辑中的应用尚未充分探索。 Method: 引入L-Storyboard作为中间表示，将视频转换为结构化语言描述，并提出StoryFlow策略解决发散任务的不稳定性。 Result: 实验表明L-Storyboard增强了视觉信息与语言描述的映射，StoryFlow提升了任务逻辑一致性和输出稳定性。 Conclusion: LLMs在智能视频编辑中具有巨大潜力，L-Storyboard和StoryFlow为相关任务提供了有效解决方案。 Abstract: Large Language Models (LLMs) and Vision-Language Models (VLMs) have demonstrated remarkable reasoning and generalization capabilities in video understanding; however, their application in video editing remains largely underexplored. This paper presents the first systematic study of LLMs in the context of video editing. To bridge the gap between visual information and language-based reasoning, we introduce L-Storyboard, an intermediate representation that transforms discrete video shots into structured language descriptions suitable for LLM processing. We categorize video editing tasks into Convergent Tasks and Divergent Tasks, focusing on three core tasks: Shot Attributes Classification, Next Shot Selection, and Shot Sequence Ordering. To address the inherent instability of divergent task outputs, we propose the StoryFlow strategy, which converts the divergent multi-path reasoning process into a convergent selection mechanism, effectively enhancing task accuracy and logical coherence. Experimental results demonstrate that L-Storyboard facilitates a more robust mapping between visual information and language descriptions, significantly improving the interpretability and privacy protection of video editing tasks. Furthermore, StoryFlow enhances the logical consistency and output stability in Shot Sequence Ordering, underscoring the substantial potential of LLMs in intelligent video editing.

[75] SEPT: Standard-Definition Map Enhanced Scene Perception and Topology Reasoning for Autonomous Driving

Muleilan Pei,Jiayao Shan,Peiliang Li,Jieqi Shi,Jing Huo,Yang Gao,Shaojie Shen

Main category: cs.CV

TL;DR: 论文提出了一种名为SEPT的框架，通过结合标准清晰度（SD）地图作为先验知识，提升自动驾驶车辆在无地图驾驶系统中的场景感知和拓扑推理能力。

Details

Motivation: 现有在线场景理解方法在长距离或遮挡场景中存在局限性，主要受限于车载传感器的固有约束。 Method: 提出了一种混合特征融合策略，结合SD地图与鸟瞰图（BEV）特征，并设计了辅助的交叉感知关键点检测任务。 Result: 在OpenLane-V2数据集上的实验表明，SEPT框架显著提升了场景感知和拓扑推理性能，优于现有方法。 Conclusion: 通过有效整合SD地图先验知识，SEPT框架能够显著提升自动驾驶系统的场景理解能力。 Abstract: Online scene perception and topology reasoning are critical for autonomous vehicles to understand their driving environments, particularly for mapless driving systems that endeavor to reduce reliance on costly High-Definition (HD) maps. However, recent advances in online scene understanding still face limitations, especially in long-range or occluded scenarios, due to the inherent constraints of onboard sensors. To address this challenge, we propose a Standard-Definition (SD) Map Enhanced scene Perception and Topology reasoning (SEPT) framework, which explores how to effectively incorporate the SD map as prior knowledge into existing perception and reasoning pipelines. Specifically, we introduce a novel hybrid feature fusion strategy that combines SD maps with Bird's-Eye-View (BEV) features, considering both rasterized and vectorized representations, while mitigating potential misalignment between SD maps and BEV feature spaces. Additionally, we leverage the SD map characteristics to design an auxiliary intersection-aware keypoint detection task, which further enhances the overall scene understanding performance. Experimental results on the large-scale OpenLane-V2 dataset demonstrate that by effectively integrating SD map priors, our framework significantly improves both scene perception and topology reasoning, outperforming existing methods by a substantial margin.

[76] SMFusion: Semantic-Preserving Fusion of Multimodal Medical Images for Enhanced Clinical Diagnosis

Haozhe Xiang,Han Zhang,Yu Cheng,Xiongwen Quan,Wanwan Huang

Main category: cs.CV

TL;DR: 提出了一种新颖的语义引导多模态医学图像融合方法，首次将医学先验知识融入融合过程，通过语义对齐和文本注入模块提升融合效果，并生成诊断报告评估信息保留。

Details

Motivation: 现有医学图像融合方法多基于计算机视觉标准，忽略了医学图像的丰富语义信息，导致关键医学信息丢失。 Method: 构建多模态医学图像-文本数据集，利用BiomedGPT生成文本描述，通过语义交互对齐模块将文本与图像特征在高维空间对齐，并设计医学语义损失函数增强信息保留。 Result: 实验结果表明，该方法在定性和定量评估中均表现优异，能保留更多关键医学信息。 Conclusion: 提出的语义引导方法显著提升了医学图像融合的效果，为医学诊断提供了更可靠的辅助工具。 Abstract: Multimodal medical image fusion plays a crucial role in medical diagnosis by integrating complementary information from different modalities to enhance image readability and clinical applicability. However, existing methods mainly follow computer vision standards for feature extraction and fusion strategy formulation, overlooking the rich semantic information inherent in medical images. To address this limitation, we propose a novel semantic-guided medical image fusion approach that, for the first time, incorporates medical prior knowledge into the fusion process. Specifically, we construct a publicly available multimodal medical image-text dataset, upon which text descriptions generated by BiomedGPT are encoded and semantically aligned with image features in a high-dimensional space via a semantic interaction alignment module. During this process, a cross attention based linear transformation automatically maps the relationship between textual and visual features to facilitate comprehensive learning. The aligned features are then embedded into a text-injection module for further feature-level fusion. Unlike traditional methods, we further generate diagnostic reports from the fused images to assess the preservation of medical information. Additionally, we design a medical semantic loss function to enhance the retention of textual cues from the source images. Experimental results on test datasets demonstrate that the proposed method achieves superior performance in both qualitative and quantitative evaluations while preserving more critical medical information.

[77] LLaVA-4D: Embedding SpatioTemporal Prompt into LMMs for 4D Scene Understanding

Hanyu Zhou,Gim Hee Lee

Main category: cs.CV

TL;DR: LLaVA-4D提出了一种新的时空提示方法，通过将3D位置和1D时间编码为动态感知的4D坐标嵌入，增强了大型多模态模型对4D场景的理解能力。

Details

Motivation: 现有3D大型多模态模型仅能理解静态背景，无法捕捉动态物体的时间变化，限制了其在物理世界中的应用。 Method: 提出LLaVA-4D框架，生成时空提示并嵌入视觉特征中，同时构建了带时空坐标标注的4D视觉语言数据集进行微调。 Result: 实验证明该方法在4D场景理解任务中有效，能够区分背景与动态物体。 Conclusion: LLaVA-4D通过时空提示和特征对齐，显著提升了模型对动态场景的理解能力。 Abstract: Despite achieving significant progress in 2D image understanding, large multimodal models (LMMs) struggle in the physical world due to the lack of spatial representation. Typically, existing 3D LMMs mainly embed 3D positions as fixed spatial prompts within visual features to represent the scene. However, these methods are limited to understanding the static background and fail to capture temporally varying dynamic objects. In this paper, we propose LLaVA-4D, a general LMM framework with a novel spatiotemporal prompt for visual representation in 4D scene understanding. The spatiotemporal prompt is generated by encoding 3D position and 1D time into a dynamic-aware 4D coordinate embedding. Moreover, we demonstrate that spatial and temporal components disentangled from visual features are more effective in distinguishing the background from objects. This motivates embedding the 4D spatiotemporal prompt into these features to enhance the dynamic scene representation. By aligning visual spatiotemporal embeddings with language embeddings, LMMs gain the ability to understand both spatial and temporal characteristics of static background and dynamic objects in the physical world. Additionally, we construct a 4D vision-language dataset with spatiotemporal coordinate annotations for instruction fine-tuning LMMs. Extensive experiments have been conducted to demonstrate the effectiveness of our method across different tasks in 4D scene understanding.

[78] MMS-VPR: Multimodal Street-Level Visual Place Recognition Dataset and Benchmark

Yiwei Ou,Xiaobin Ren,Ronggui Sun,Guansong Gao,Ziyi Jiang,Kaiqi Zhao,Manfredo Manfredini

Main category: cs.CV

TL;DR: MMS-VPR是一个大规模多模态数据集，专注于复杂行人环境中的街景地点识别，填补了现有数据集的不足。

Details

Motivation: 现有VPR数据集主要依赖车载图像，缺乏多模态多样性，且未充分代表非西方城市环境中的密集混合用途街景。 Method: 通过系统化数据收集协议，在成都一个商业区采集了78,575张标注图像和2,512段视频，涵盖多种光照、视角和时间条件，并构建了空间图结构。 Result: 实验表明，利用多模态和结构线索的模型性能显著提升。 Conclusion: MMS-VPR为计算机视觉、地理空间理解和多模态推理的交叉研究提供了新资源。 Abstract: Existing visual place recognition (VPR) datasets predominantly rely on vehicle-mounted imagery, lack multimodal diversity and underrepresent dense, mixed-use street-level spaces, especially in non-Western urban contexts. To address these gaps, we introduce MMS-VPR, a large-scale multimodal dataset for street-level place recognition in complex, pedestrian-only environments. The dataset comprises 78,575 annotated images and 2,512 video clips captured across 207 locations in a ~70,800 $\mathrm{m}^2$ open-air commercial district in Chengdu, China. Each image is labeled with precise GPS coordinates, timestamp, and textual metadata, and covers varied lighting conditions, viewpoints, and timeframes. MMS-VPR follows a systematic and replicable data collection protocol with minimal device requirements, lowering the barrier for scalable dataset creation. Importantly, the dataset forms an inherent spatial graph with 125 edges, 81 nodes, and 1 subgraph, enabling structure-aware place recognition. We further define two application-specific subsets -- Dataset_Edges and Dataset_Points -- to support fine-grained and graph-based evaluation tasks. Extensive benchmarks using conventional VPR models, graph neural networks, and multimodal baselines show substantial improvements when leveraging multimodal and structural cues. MMS-VPR facilitates future research at the intersection of computer vision, geospatial understanding, and multimodal reasoning. The dataset is publicly available at https://huggingface.co/datasets/Yiwei-Ou/MMS-VPR.

[79] PMQ-VE: Progressive Multi-Frame Quantization for Video Enhancement

ZhanFeng Feng,Long Peng,Xin Di,Yong Guo,Wenbo Li,Yulun Zhang,Renjing Pei,Yang Wang,Yang Cao,Zheng-Jun Zha

Main category: cs.CV

TL;DR: 论文提出了一种新的视频增强量化方法PMQ-VE，通过两阶段渐进式策略解决现有量化方法在视频增强任务中的性能下降问题。

Details

Motivation: 现有Transformer-based视频增强方法计算和内存需求高，量化虽能提升效率，但直接应用会导致性能下降和细节丢失。 Method: 提出PMQ-VE框架，包括BMFQ（基于回溯的多帧量化）和PMTD（渐进多教师蒸馏）两阶段策略。 Result: 实验表明，PMQ-VE在多个任务和基准测试中优于现有方法，达到最优性能。 Conclusion: PMQ-VE为视频增强任务提供了一种高效的量化解决方案，代码将开源。 Abstract: Multi-frame video enhancement tasks aim to improve the spatial and temporal resolution and quality of video sequences by leveraging temporal information from multiple frames, which are widely used in streaming video processing, surveillance, and generation. Although numerous Transformer-based enhancement methods have achieved impressive performance, their computational and memory demands hinder deployment on edge devices. Quantization offers a practical solution by reducing the bit-width of weights and activations to improve efficiency. However, directly applying existing quantization methods to video enhancement tasks often leads to significant performance degradation and loss of fine details. This stems from two limitations: (a) inability to allocate varying representational capacity across frames, which results in suboptimal dynamic range adaptation; (b) over-reliance on full-precision teachers, which limits the learning of low-bit student models. To tackle these challenges, we propose a novel quantization method for video enhancement: Progressive Multi-Frame Quantization for Video Enhancement (PMQ-VE). This framework features a coarse-to-fine two-stage process: Backtracking-based Multi-Frame Quantization (BMFQ) and Progressive Multi-Teacher Distillation (PMTD). BMFQ utilizes a percentile-based initialization and iterative search with pruning and backtracking for robust clipping bounds. PMTD employs a progressive distillation strategy with both full-precision and multiple high-bit (INT) teachers to enhance low-bit models' capacity and quality. Extensive experiments demonstrate that our method outperforms existing approaches, achieving state-of-the-art performance across multiple tasks and benchmarks.The code will be made publicly available at: https://github.com/xiaoBIGfeng/PMQ-VE.

[80] Context-Aware Autoregressive Models for Multi-Conditional Image Generation

Yixiao Chen,Zhiyuan Ma,Guoli Jia,Che Jiang,Jianjun Li,Bowen Zhou

Main category: cs.CV

TL;DR: ContextAR是一个多条件图像生成框架，通过将不同条件嵌入到统一的token序列中，结合混合位置编码和条件感知注意力机制，实现了高效且灵活的多条件控制。

Details

Motivation: 解决多条件图像生成任务中如何统一处理不同模态条件的问题，同时保持空间对齐和条件区分。 Method: 提出ContextAR框架，嵌入多样条件到token序列，使用混合位置编码（Rotary和Learnable），并设计条件感知注意力机制以降低计算复杂度。 Result: 实验表明ContextAR在多条件控制任务中表现出强大的可控性和通用性，性能与现有自回归基线相当。 Conclusion: ContextAR为多条件图像生成提供了一种简洁高效的解决方案，支持推理时任意条件组合。 Abstract: Autoregressive transformers have recently shown impressive image generation quality and efficiency on par with state-of-the-art diffusion models. Unlike diffusion architectures, autoregressive models can naturally incorporate arbitrary modalities into a single, unified token sequence--offering a concise solution for multi-conditional image generation tasks. In this work, we propose $\textbf{ContextAR}$, a flexible and effective framework for multi-conditional image generation. ContextAR embeds diverse conditions (e.g., canny edges, depth maps, poses) directly into the token sequence, preserving modality-specific semantics. To maintain spatial alignment while enhancing discrimination among different condition types, we introduce hybrid positional encodings that fuse Rotary Position Embedding with Learnable Positional Embedding. We design Conditional Context-aware Attention to reduces computational complexity while preserving effective intra-condition perception. Without any fine-tuning, ContextAR supports arbitrary combinations of conditions during inference time. Experimental results demonstrate the powerful controllability and versatility of our approach, and show that the competitive perpormance than diffusion-based multi-conditional control approaches the existing autoregressive baseline across diverse multi-condition driven scenarios. Project page: $\href{https://context-ar.github.io/}{https://context-ar.github.io/.}$

[81] Temporal-Spectral-Spatial Unified Remote Sensing Dense Prediction

Sijie Zhao,Feng Liu,Xueliang Zhang,Hao Chen,Pengfeng Xiao,Lei Bai

Main category: cs.CV

TL;DR: 本文提出了一种名为TSSUN的新型网络架构，用于统一处理遥感数据在时间、光谱和空间（TSS）维度的异质性，并通过实验验证了其高效性和通用性。

Details

Motivation: 遥感数据在时间、光谱和空间维度上存在显著异质性，导致现有深度学习模型在处理不同输入输出配置时性能下降或需要大量调整。 Method: TSSUN采用TSS统一策略，利用元信息对输入输出进行标准化，并提出局部-全局窗口注意力机制以增强特征提取能力。 Result: 实验表明，TSSUN能有效适应异构输入并统一多种密集预测任务，性能达到或超越现有最优方法。 Conclusion: TSSUN展示了在复杂遥感应用中的鲁棒性和通用性，无需任务特定修改。 Abstract: The proliferation of diverse remote sensing data has spurred advancements in dense prediction tasks, yet significant challenges remain in handling data heterogeneity. Remote sensing imagery exhibits substantial variability across temporal, spectral, and spatial (TSS) dimensions, complicating unified data processing. Current deep learning models for dense prediction tasks, such as semantic segmentation and change detection, are typically tailored to specific input-output configurations. Consequently, variations in data dimensionality or task requirements often lead to significant performance degradation or model incompatibility, necessitating costly retraining or fine-tuning efforts for different application scenarios. This paper introduces the Temporal-Spectral-Spatial Unified Network (TSSUN), a novel architecture designed for unified representation and modeling of remote sensing data across diverse TSS characteristics and task types. TSSUN employs a Temporal-Spectral-Spatial Unified Strategy that leverages meta-information to decouple and standardize input representations from varied temporal, spectral, and spatial configurations, and similarly unifies output structures for different dense prediction tasks and class numbers. Furthermore, a Local-Global Window Attention mechanism is proposed to efficiently capture both local contextual details and global dependencies, enhancing the model's adaptability and feature extraction capabilities. Extensive experiments on multiple datasets demonstrate that a single TSSUN model effectively adapts to heterogeneous inputs and unifies various dense prediction tasks. The proposed approach consistently achieves or surpasses state-of-the-art performance, highlighting its robustness and generalizability for complex remote sensing applications without requiring task-specific modifications.

[82] LogicOCR: Do Your Large Multimodal Models Excel at Logical Reasoning on Text-Rich Images?

Maoyuan Ye,Jing Zhang,Juhua Liu,Bo Du,Dacheng Tao

Main category: cs.CV

TL;DR: LogicOCR是一个用于评估大型多模态模型（LMMs）在文本丰富图像上逻辑推理能力的基准，包含1,100个多选题，旨在减少对领域知识的依赖。

Details

Motivation: 尽管LMMs在推理和OCR方面有显著进步，但在文本丰富图像的复杂逻辑推理任务上表现仍未充分探索。 Method: 通过中国国家公务员考试的文本语料库构建LogicOCR，并开发自动化流程生成多模态样本，包括设计提示模板生成多样化图像，并手动验证质量。 Result: 评估显示LMMs在多模态推理上仍落后于纯文本输入，且对视觉-文本方向敏感。 Conclusion: LogicOCR为推进多模态推理研究提供了宝贵资源，数据集已开源。 Abstract: Recent advances in Large Multimodal Models (LMMs) have significantly improved their reasoning and Optical Character Recognition (OCR) capabilities. However, their performance on complex logical reasoning tasks involving text-rich images remains underexplored. To bridge this gap, we introduce LogicOCR, a benchmark comprising 1,100 multiple-choice questions designed to evaluate LMMs' logical reasoning abilities on text-rich images, while minimizing reliance on domain-specific knowledge (e.g., mathematics). We construct LogicOCR by curating a text corpus from the Chinese National Civil Servant Examination and develop a scalable, automated pipeline to convert it into multimodal samples. First, we design prompt templates to steer GPT-Image-1 to generate images with diverse backgrounds, interleaved text-illustration layouts, and varied fonts, ensuring contextual relevance and visual realism. Then, the generated images are manually verified, with low-quality examples discarded. We evaluate a range of representative open-source and proprietary LMMs under both Chain-of-Thought (CoT) and direct-answer settings. Our multi-dimensional analysis reveals key insights, such as the impact of test-time scaling, input modality differences, and sensitivity to visual-text orientation. Notably, LMMs still lag in multimodal reasoning compared to text-only inputs, indicating that they have not fully bridged visual reading with reasoning. We hope LogicOCR will serve as a valuable resource for advancing multimodal reasoning research. The dataset is available at https://github.com/MiliLab/LogicOCR.

[83] DNOI-4DRO: Deep 4D Radar Odometry with Differentiable Neural-Optimization Iterations

Shouyi Lu,Huanyu Zhou,Guirong Zhuo

Main category: cs.CV

TL;DR: 提出了一种结合学习与优化的4D雷达里程计模型DNOI-4DRO，通过神经网络与几何优化结合，显著提升了性能。

Details

Motivation: 传统方法在处理稀疏4D雷达点云时表现不足，需结合神经网络与优化方法以提高精度。 Method: 使用神经网络估计点运动流，构建基于点运动与位姿关系的成本函数，并通过高斯牛顿法优化雷达位姿；设计了双流4D雷达骨干网络。 Result: 在VoD和Snail-Radar数据集上表现优异，甚至接近基于LiDAR的A-LOAM方法。 Conclusion: DNOI-4DRO模型通过结合学习与优化，显著提升了4D雷达里程计的精度，代码将开源。 Abstract: A novel learning-optimization-combined 4D radar odometry model, named DNOI-4DRO, is proposed in this paper. The proposed model seamlessly integrates traditional geometric optimization with end-to-end neural network training, leveraging an innovative differentiable neural-optimization iteration operator. In this framework, point-wise motion flow is first estimated using a neural network, followed by the construction of a cost function based on the relationship between point motion and pose in 3D space. The radar pose is then refined using Gauss-Newton updates. Additionally, we design a dual-stream 4D radar backbone that integrates multi-scale geometric features and clustering-based class-aware features to enhance the representation of sparse 4D radar point clouds. Extensive experiments on the VoD and Snail-Radar datasets demonstrate the superior performance of our model, which outperforms recent classical and learning-based approaches. Notably, our method even achieves results comparable to A-LOAM with mapping optimization using LiDAR point clouds as input. Our models and code will be publicly released.

[84] Visuospatial Cognitive Assistant

Qi Feng,Hidetoshi Shimodaira

Main category: cs.CV

TL;DR: 论文提出了ViCA-322K数据集和ViCA-7B模型，用于提升视频空间认知能力，并在多个任务中取得最佳性能。

Details

Motivation: 视频空间认知对机器人和AI至关重要，但现有视觉语言模型（VLMs）面临挑战。 Method: 引入ViCA-322K数据集，开发ViCA-7B模型，并进一步优化为ViCA-7B-Thinking以提供显式推理链。 Result: ViCA-7B在八个VSI-Bench任务中表现最优，如绝对距离任务提升26.1分。 Conclusion: 研究强调了针对性数据的重要性，并提供了提升时空建模的路径，所有资源已开源。 Abstract: Video-based spatial cognition is vital for robotics and embodied AI but challenges current Vision-Language Models (VLMs). This paper makes two key contributions. First, we introduce ViCA (Visuospatial Cognitive Assistant)-322K, a diverse dataset of 322,003 QA pairs from real-world indoor videos (ARKitScenes, ScanNet, ScanNet++), offering supervision for 3D metadata-grounded queries and video-based complex reasoning. Second, we develop ViCA-7B, fine-tuned on ViCA-322K, which achieves new state-of-the-art on all eight VSI-Bench tasks, outperforming existing models, including larger ones (e.g., +26.1 on Absolute Distance). For interpretability, we present ViCA-Thinking-2.68K, a dataset with explicit reasoning chains, and fine-tune ViCA-7B to create ViCA-7B-Thinking, a model that articulates its spatial reasoning. Our work highlights the importance of targeted data and suggests paths for improved temporal-spatial modeling. We release all resources to foster research in robust visuospatial intelligence.

[85] Improving Out-of-Domain Robustness with Targeted Augmentation in Frequency and Pixel Spaces

Ruoqi Wang,Haitao Wang,Shaojie Guo,Qiong Luo

Main category: cs.CV

TL;DR: 论文提出了一种名为Frequency-Pixel Connect的领域适应框架，通过在频率空间和像素空间中引入目标增强，显著提升了模型在分布偏移下的泛化能力。

Details

Motivation: 现实应用中，领域适应场景下的分布偏移（OOD）是一个关键挑战。传统的数据增强方法效果有限，而针对特定数据集的增强方法需要专家知识。本文旨在提出一种无需数据集先验知识的通用增强方法。 Method: 通过混合源图像和目标图像的振幅谱和像素内容，生成增强样本，既引入领域多样性，又保留源图像的语义结构。 Result: 在四个不同领域的真实基准测试中，Frequency-Pixel Connect显著提升了跨领域连接性，并优于其他通用和特定数据集的增强方法。 Conclusion: Frequency-Pixel Connect是一种数据集无关的增强方法，具有广泛的适用性，显著提升了模型在分布偏移下的鲁棒性。 Abstract: Out-of-domain (OOD) robustness under domain adaptation settings, where labeled source data and unlabeled target data come from different distributions, is a key challenge in real-world applications. A common approach to improving OOD robustness is through data augmentations. However, in real-world scenarios, models trained with generic augmentations can only improve marginally when generalized under distribution shifts toward unlabeled target domains. While dataset-specific targeted augmentations can address this issue, they typically require expert knowledge and extensive prior data analysis to identify the nature of the datasets and domain shift. To address these challenges, we propose Frequency-Pixel Connect, a domain-adaptation framework that enhances OOD robustness by introducing a targeted augmentation in both the frequency space and pixel space. Specifically, we mix the amplitude spectrum and pixel content of a source image and a target image to generate augmented samples that introduce domain diversity while preserving the semantic structure of the source image. Unlike previous targeted augmentation methods that are both dataset-specific and limited to the pixel space, Frequency-Pixel Connect is dataset-agnostic, enabling broader and more flexible applicability beyond natural image datasets. We further analyze the effectiveness of Frequency-Pixel Connect by evaluating the performance of our method connecting same-class cross-domain samples while separating different-class examples. We demonstrate that Frequency-Pixel Connect significantly improves cross-domain connectivity and outperforms previous generic methods on four diverse real-world benchmarks across vision, medical, audio, and astronomical domains, and it also outperforms other dataset-specific targeted augmentation methods.

[86] Is Artificial Intelligence Generated Image Detection a Solved Problem?

Ziqiang Li,Jiazhen Yan,Ziwen He,Kai Zeng,Weiwei Jiang,Lizhi Xiong,Zhangjie Fu

Main category: cs.CV

TL;DR: AIGIBench是一个用于评估AIGI检测器鲁棒性和泛化能力的基准测试，揭示了现有检测器在真实场景中的性能下降问题。

Details

Motivation: 生成模型（如GANs和Diffusion模型）生成的合成图像高度逼真，引发了关于虚假信息、深度伪造和版权侵权的担忧。现有AIGI检测器在真实场景中的有效性存疑。 Method: AIGIBench通过四项核心任务（多源泛化、图像退化鲁棒性、数据增强敏感性和测试时预处理影响）评估11种先进检测器，涵盖23种多样化的虚假图像子集和真实样本。 Result: 实验表明，尽管检测器在受控环境中表现优异，但在真实数据上性能显著下降，且常见增强和预处理效果有限。 Conclusion: AIGIBench为未来研究提供了统一且现实的评估框架，推动开发更鲁棒和可泛化的AIGI检测策略。 Abstract: The rapid advancement of generative models, such as GANs and Diffusion models, has enabled the creation of highly realistic synthetic images, raising serious concerns about misinformation, deepfakes, and copyright infringement. Although numerous Artificial Intelligence Generated Image (AIGI) detectors have been proposed, often reporting high accuracy, their effectiveness in real-world scenarios remains questionable. To bridge this gap, we introduce AIGIBench, a comprehensive benchmark designed to rigorously evaluate the robustness and generalization capabilities of state-of-the-art AIGI detectors. AIGIBench simulates real-world challenges through four core tasks: multi-source generalization, robustness to image degradation, sensitivity to data augmentation, and impact of test-time pre-processing. It includes 23 diverse fake image subsets that span both advanced and widely adopted image generation techniques, along with real-world samples collected from social media and AI art platforms. Extensive experiments on 11 advanced detectors demonstrate that, despite their high reported accuracy in controlled settings, these detectors suffer significant performance drops on real-world data, limited benefits from common augmentations, and nuanced effects of pre-processing, highlighting the need for more robust detection strategies. By providing a unified and realistic evaluation framework, AIGIBench offers valuable insights to guide future research toward dependable and generalizable AIGI detection.

[87] Towards Open-world Generalized Deepfake Detection: General Feature Extraction via Unsupervised Domain Adaptation

Midou Guo,Qilin Yin,Wei Lu,Xiangyang Luo

Main category: cs.CV

TL;DR: 本文提出了一种新的开放世界深度伪造检测通用增强训练策略（OWG-DS），通过优化领域距离和类边界分离，提升模型在未标记数据上的泛化能力。

Details

Motivation: 随着生成式AI的发展，伪造方法快速涌现，社交平台上大量未标记的合成与真实数据混合，现有监督检测方法难以应对未知伪造方法的检测。 Method: 提出OWG-DS策略，包括领域距离优化（DDO）模块和相似性类边界分离（SCBS）模块，结合对抗训练学习域不变特征。 Result: 实验表明，该方法在跨方法和跨数据集场景中表现优异，显著提升了模型的泛化能力。 Conclusion: OWG-DS策略有效解决了开放世界场景下深度伪造检测的泛化问题，为未标记数据的检测提供了新思路。 Abstract: With the development of generative artificial intelligence, new forgery methods are rapidly emerging. Social platforms are flooded with vast amounts of unlabeled synthetic data and authentic data, making it increasingly challenging to distinguish real from fake. Due to the lack of labels, existing supervised detection methods struggle to effectively address the detection of unknown deepfake methods. Moreover, in open world scenarios, the amount of unlabeled data greatly exceeds that of labeled data. Therefore, we define a new deepfake detection generalization task which focuses on how to achieve efficient detection of large amounts of unlabeled data based on limited labeled data to simulate a open world scenario. To solve the above mentioned task, we propose a novel Open-World Deepfake Detection Generalization Enhancement Training Strategy (OWG-DS) to improve the generalization ability of existing methods. Our approach aims to transfer deepfake detection knowledge from a small amount of labeled source domain data to large-scale unlabeled target domain data. Specifically, we introduce the Domain Distance Optimization (DDO) module to align different domain features by optimizing both inter-domain and intra-domain distances. Additionally, the Similarity-based Class Boundary Separation (SCBS) module is used to enhance the aggregation of similar samples to ensure clearer class boundaries, while an adversarial training mechanism is adopted to learn the domain-invariant features. Extensive experiments show that the proposed deepfake detection generalization enhancement training strategy excels in cross-method and cross-dataset scenarios, improving the model's generalization.

[88] DIMM: Decoupled Multi-hierarchy Kalman Filter for 3D Object Tracking

Jirong Zha,Yuxuan Fan,Kai Li,Han Li,Chen Gao,Xinlei Chen,Yong Li

Main category: cs.CV

TL;DR: DIMM框架通过解耦多层级滤波器和自适应融合网络，显著提高了3D目标跟踪的精度。

Details

Motivation: 传统IMM方法在模型组合空间和权重计算上存在局限性，无法充分处理目标的多方向运动特性和测量不确定性。 Method: DIMM设计了解耦多层级滤波器扩展模型组合空间，并采用自适应融合网络优化权重分配。 Result: 实验表明，DIMM将现有状态估计方法的跟踪精度提高了31.61%~99.23%。 Conclusion: DIMM通过创新设计有效解决了传统IMM的不足，显著提升了3D目标跟踪性能。 Abstract: State estimation is challenging for 3D object tracking with high maneuverability, as the target's state transition function changes rapidly, irregularly, and is unknown to the estimator. Existing work based on interacting multiple model (IMM) achieves more accurate estimation than single-filter approaches through model combination, aligning appropriate models for different motion modes of the target object over time. However, two limitations of conventional IMM remain unsolved. First, the solution space of the model combination is constrained as the target's diverse kinematic properties in different directions are ignored. Second, the model combination weights calculated by the observation likelihood are not accurate enough due to the measurement uncertainty. In this paper, we propose a novel framework, DIMM, to effectively combine estimates from different motion models in each direction, thus increasing the 3D object tracking accuracy. First, DIMM extends the model combination solution space of conventional IMM from a hyperplane to a hypercube by designing a 3D-decoupled multi-hierarchy filter bank, which describes the target's motion with various-order linear models. Second, DIMM generates more reliable combination weight matrices through a differentiable adaptive fusion network for importance allocation rather than solely relying on the observation likelihood; it contains an attention-based twin delayed deep deterministic policy gradient (TD3) method with a hierarchical reward. Experiments demonstrate that DIMM significantly improves the tracking accuracy of existing state estimation methods by 31.61%~99.23%.

[89] Towards Visuospatial Cognition via Hierarchical Fusion of Visual Experts

Qi Feng,Hidetoshi Shimodaira

Main category: cs.CV

TL;DR: ViCA2是一种新型多模态大语言模型，专注于提升空间推理能力，通过双视觉编码器和专用数据集ViCA-322K，在VSI-Bench基准测试中表现优异。

Details

Motivation: 现有模型在空间认知任务中表现不足，缺乏细粒度空间理解的架构和训练数据。 Method: ViCA2采用双视觉编码器（SigLIP和Hiera）结合令牌比例控制机制，并使用ViCA-322K数据集进行指令微调。 Result: ViCA2-7B在VSI-Bench上以56.8的平均分超越其他开源和专有模型。 Conclusion: ViCA2证明了紧凑模型也能实现强大的空间智能，并开源了模型和数据集以促进研究。 Abstract: While Multimodal Large Language Models (MLLMs) excel at general vision-language tasks, visuospatial cognition - reasoning about spatial layouts, relations, and dynamics - remains a significant challenge. Existing models often lack the necessary architectural components and specialized training data for fine-grained spatial understanding. We introduce ViCA2 (Visuospatial Cognitive Assistant 2), a novel MLLM designed to enhance spatial reasoning. ViCA2 features a dual vision encoder architecture integrating SigLIP for semantics and Hiera for spatial structure, coupled with a token ratio control mechanism for efficiency. We also developed ViCA-322K, a new large-scale dataset with over 322,000 spatially grounded question-answer pairs for targeted instruction tuning. On the challenging VSI-Bench benchmark, our ViCA2-7B model achieves a state-of-the-art average score of 56.8, significantly surpassing larger open-source models (e.g., LLaVA-NeXT-Video-72B, 40.9) and leading proprietary models (Gemini-1.5 Pro, 45.4). This demonstrates the effectiveness of our approach in achieving strong visuospatial intelligence with a compact model. We release ViCA2, its codebase, and the ViCA-322K dataset to facilitate further research.

[90] CLIP-aware Domain-Adaptive Super-Resolution

Zhengyang Lu,Qian Xia,Weifan Wang,Feng Wang

Main category: cs.CV

TL;DR: CDASR利用CLIP的语义能力，通过域自适应和元学习策略，显著提升了单图像超分辨率在跨域和极端缩放因子下的性能。

Details

Motivation: 解决单图像超分辨率中的域泛化挑战，提升跨域和极端缩放因子下的性能。 Method: 结合CLIP引导的特征对齐机制和元学习启发的少样本适应策略，通过多阶段转换和自定义损失函数实现语义信息融合。 Result: 在Urban100数据集上，CDASR在×8缩放下PSNR提升0.15dB，×16缩放下提升0.30dB。 Conclusion: CDASR在跨域和极端缩放因子下表现出色，为超分辨率任务提供了高效解决方案。 Abstract: This work introduces CLIP-aware Domain-Adaptive Super-Resolution (CDASR), a novel framework that addresses the critical challenge of domain generalization in single image super-resolution. By leveraging the semantic capabilities of CLIP (Contrastive Language-Image Pre-training), CDASR achieves unprecedented performance across diverse domains and extreme scaling factors. The proposed method integrates CLIP-guided feature alignment mechanism with a meta-learning inspired few-shot adaptation strategy, enabling efficient knowledge transfer and rapid adaptation to target domains. A custom domain-adaptive module processes CLIP features alongside super-resolution features through a multi-stage transformation process, including CLIP feature processing, spatial feature generation, and feature fusion. This intricate process ensures effective incorporation of semantic information into the super-resolution pipeline. Additionally, CDASR employs a multi-component loss function that combines pixel-wise reconstruction, perceptual similarity, and semantic consistency. Extensive experiments on benchmark datasets demonstrate CDASR's superiority, particularly in challenging scenarios. On the Urban100 dataset at $\times$8 scaling, CDASR achieves a significant PSNR gain of 0.15dB over existing methods, with even larger improvements of up to 0.30dB observed at $\times$16 scaling.

Minxu Liu,Donghai Guan,Chuhang Zheng,Chunwei Tian,Jie Wen,Qi Zhu

Main category: cs.CV

TL;DR: ViEEG是一个受生物学启发的分层EEG解码框架，通过模拟视觉处理的层次结构，显著提升了EEG视觉解码的性能。

Details

Motivation: 现有EEG视觉解码方法依赖扁平神经表示，忽略了大脑的视觉层次结构，ViEEG旨在解决这一问题。 Method: ViEEG将视觉刺激分解为三个生物学对齐的组件，并通过三流EEG编码器和跨注意力路由模拟视觉皮层信息流，结合分层对比学习与CLIP嵌入对齐。 Result: 在THINGS-EEG数据集上，ViEEG在受试者依赖和跨受试者设置中分别达到40.9%和22.9%的Top-1准确率，超越现有方法45%。 Conclusion: ViEEG不仅提升了性能，还为基于生物学的脑解码设定了新范式。 Abstract: Understanding and decoding brain activity into visual representations is a fundamental challenge at the intersection of neuroscience and artificial intelligence. While EEG-based visual decoding has shown promise due to its non-invasive, low-cost nature and millisecond-level temporal resolution, existing methods are limited by their reliance on flat neural representations that overlook the brain's inherent visual hierarchy. In this paper, we introduce ViEEG, a biologically inspired hierarchical EEG decoding framework that aligns with the Hubel-Wiesel theory of visual processing. ViEEG decomposes each visual stimulus into three biologically aligned components-contour, foreground object, and contextual scene-serving as anchors for a three-stream EEG encoder. These EEG features are progressively integrated via cross-attention routing, simulating cortical information flow from V1 to IT to the association cortex. We further adopt hierarchical contrastive learning to align EEG representations with CLIP embeddings, enabling zero-shot object recognition. Extensive experiments on the THINGS-EEG dataset demonstrate that ViEEG achieves state-of-the-art performance, with 40.9% Top-1 accuracy in subject-dependent and 22.9% Top-1 accuracy in cross-subject settings, surpassing existing methods by over 45%. Our framework not only advances the performance frontier but also sets a new paradigm for biologically grounded brain decoding in AI.

[92] Kornia-rs: A Low-Level 3D Computer Vision Library In Rust

Edgar Riba,Jian Shi,Aditya Kumar,Andrew Shen,Gary Bradski

Main category: cs.CV

TL;DR: kornia-rs是一个高性能的3D计算机视觉库，完全用Rust编写，专注于安全关键和实时应用。它通过Rust的所有权模型和类型系统确保内存和线程安全，并提供高效的图像处理和3D操作。

Details

Motivation: 解决Rust生态系统中缺乏高性能3D计算机视觉库的问题，同时提供比现有方案更高的性能和安全性。 Method: 采用静态类型张量系统和模块化设计，支持跨平台兼容性并提供Python绑定。 Result: 在图像变换任务中比原生Rust方案快3~5倍，性能与C++封装库相当，填补了Rust生态中3D视觉的空白。 Conclusion: kornia-rs展示了在真实计算机视觉应用中的高效性和实用性，为Rust生态提供了重要工具。 Abstract: We present \textit{kornia-rs}, a high-performance 3D computer vision library written entirely in native Rust, designed for safety-critical and real-time applications. Unlike C++-based libraries like OpenCV or wrapper-based solutions like OpenCV-Rust, \textit{kornia-rs} is built from the ground up to leverage Rust's ownership model and type system for memory and thread safety. \textit{kornia-rs} adopts a statically-typed tensor system and a modular set of crates, providing efficient image I/O, image processing and 3D operations. To aid cross-platform compatibility, \textit{kornia-rs} offers Python bindings, enabling seamless and efficient integration with Rust code. Empirical results show that \textit{kornia-rs} achieves a 3~ 5 times speedup in image transformation tasks over native Rust alternatives, while offering comparable performance to C++ wrapper-based libraries. In addition to 2D vision capabilities, \textit{kornia-rs} addresses a significant gap in the Rust ecosystem by providing a set of 3D computer vision operators. This paper presents the architecture and performance characteristics of \textit{kornia-rs}, demonstrating its effectiveness in real-world computer vision applications.

[93] DragLoRA: Online Optimization of LoRA Adapters for Drag-based Image Editing in Diffusion Model

Siwei Xia,Li Sun,Tiantian Sun,Qingli Li

Main category: cs.CV

TL;DR: DragLoRA通过集成LoRA适配器和引入去噪分数蒸馏损失，提升了基于拖拽的图像编辑的精度和效率。

Details

Motivation: 传统方法在基于拖拽的编辑中因特征表示能力不足和搜索空间大而精度低且效率差。 Method: DragLoRA结合LoRA适配器，引入去噪分数蒸馏损失，并设计自适应优化方案。 Result: 实验表明DragLoRA显著提高了控制精度和计算效率。 Conclusion: DragLoRA为基于拖拽的图像编辑提供了更高效和精确的解决方案。 Abstract: Drag-based editing within pretrained diffusion model provides a precise and flexible way to manipulate foreground objects. Traditional methods optimize the input feature obtained from DDIM inversion directly, adjusting them iteratively to guide handle points towards target locations. However, these approaches often suffer from limited accuracy due to the low representation ability of the feature in motion supervision, as well as inefficiencies caused by the large search space required for point tracking. To address these limitations, we present DragLoRA, a novel framework that integrates LoRA (Low-Rank Adaptation) adapters into the drag-based editing pipeline. To enhance the training of LoRA adapters, we introduce an additional denoising score distillation loss which regularizes the online model by aligning its output with that of the original model. Additionally, we improve the consistency of motion supervision by adapting the input features using the updated LoRA, giving a more stable and accurate input feature for subsequent operations. Building on this, we design an adaptive optimization scheme that dynamically toggles between two modes, prioritizing efficiency without compromising precision. Extensive experiments demonstrate that DragLoRA significantly enhances the control precision and computational efficiency for drag-based image editing. The Codes of DragLoRA are available at: https://github.com/Sylvie-X/DragLoRA.

[94] DPCD: A Quality Assessment Database for Dynamic Point Clouds

Yating Liu,Yujie Zhang,Qi Yang,Yiling Xu,Zhu Li,Ye-Kui Wang

Main category: cs.CV

TL;DR: 论文提出一个大规模动态点云质量评估数据库DPCD，包含15个参考动态点云和525个失真动态点云，通过主观实验验证其异质性和可靠性，并评估了多种客观指标的性能。

Details

Motivation: 动态点云（DPC）能捕捉时间变化，但动态点云质量评估（DPCQA）研究较少，阻碍了质量导向应用的发展。 Method: 构建DPCD数据库，包含多种失真类型的动态点云，通过主观实验获取平均意见分数（MOS），并评估客观指标性能。 Result: 实验表明DPCQA比静态点云更具挑战性，DPCD数据库验证了其异质性和可靠性。 Conclusion: DPCD数据库为DPCQA研究提供了基础，并公开可用。 Abstract: Recently, the advancements in Virtual/Augmented Reality (VR/AR) have driven the demand for Dynamic Point Clouds (DPC). Unlike static point clouds, DPCs are capable of capturing temporal changes within objects or scenes, offering a more accurate simulation of the real world. While significant progress has been made in the quality assessment research of static point cloud, little study has been done on Dynamic Point Cloud Quality Assessment (DPCQA), which hinders the development of quality-oriented applications, such as interframe compression and transmission in practical scenarios. In this paper, we introduce a large-scale DPCQA database, named DPCD, which includes 15 reference DPCs and 525 distorted DPCs from seven types of lossy compression and noise distortion. By rendering these samples to Processed Video Sequences (PVS), a comprehensive subjective experiment is conducted to obtain Mean Opinion Scores (MOS) from 21 viewers for analysis. The characteristic of contents, impact of various distortions, and accuracy of MOSs are presented to validate the heterogeneity and reliability of the proposed database. Furthermore, we evaluate the performance of several objective metrics on DPCD. The experiment results show that DPCQA is more challenge than that of static point cloud. The DPCD, which serves as a catalyst for new research endeavors on DPCQA, is publicly available at https://huggingface.co/datasets/Olivialyt/DPCD.

[95] SRLoRA: Subspace Recomposition in Low-Rank Adaptation via Importance-Based Fusion and Reinitialization

Haodong Yang,Lei Wang,Md Zakir Hossain

Main category: cs.CV

TL;DR: SRLoRA通过动态重组低秩子空间提升LoRA的表达能力，不增加参数数量，在语言和视觉任务中表现优于标准LoRA。

Details

Motivation: LoRA的低秩更新限制了表示能力，影响下游任务性能，需要一种方法在不增加参数的情况下提升表达力。 Method: SRLoRA通过重要性评分动态融合和重新初始化LoRA对，释放容量并沿未使用的主方向重新初始化。 Result: 在GLUE基准和图像分类任务中，SRLoRA收敛更快且准确率更高。 Conclusion: SRLoRA是一种高效且通用的PEFT方法，具有广泛的应用潜力。 Abstract: Low-Rank Adaptation (LoRA) is a widely adopted parameter-efficient fine-tuning (PEFT) method that injects two trainable low-rank matrices (A and B) into frozen pretrained models. While efficient, LoRA constrains updates to a fixed low-rank subspace (Delta W = BA), which can limit representational capacity and hinder downstream performance. We introduce Subspace Recomposition in Low-Rank Adaptation (SRLoRA) via importance-based fusion and reinitialization, a novel approach that enhances LoRA's expressiveness without compromising its lightweight structure. SRLoRA assigns importance scores to each LoRA pair (a column of B and the corresponding row of A), and dynamically recomposes the subspace during training. Less important pairs are fused into the frozen backbone, freeing capacity to reinitialize new pairs along unused principal directions derived from the pretrained weight's singular value decomposition. This mechanism enables continual subspace refreshment and richer adaptation over time, without increasing the number of trainable parameters. We evaluate SRLoRA on both language and vision tasks, including the GLUE benchmark and various image classification datasets. SRLoRA consistently achieves faster convergence and improved accuracy over standard LoRA, demonstrating its generality, efficiency, and potential for broader PEFT applications.

[96] VideoRFT: Incentivizing Video Reasoning Capability in MLLMs via Reinforced Fine-Tuning

Qi Wang,Yanrui Yu,Ye Yuan,Rui Mao,Tianfei Zhou

Main category: cs.CV

TL;DR: VIDEORFT是一种扩展RFT范式的新方法，旨在提升多模态大语言模型（MLLMs）的视频推理能力，通过自动生成链式思维（CoT）数据集和强化学习优化，取得了六项视频推理基准的最优性能。

Details

Motivation: 视频推理是人工智能的重要挑战，现有方法难以处理视频数据的复杂逻辑、时间和因果关系。 Method: 采用两阶段方法：1）基于自动生成的CoT数据集进行监督微调（SFT）；2）通过强化学习（RL）和语义一致性奖励优化模型。 Result: 在六个视频推理基准上达到最优性能。 Conclusion: VIDEORFT通过创新的数据集生成和强化学习策略，显著提升了MLLMs的视频推理能力。 Abstract: Reinforcement fine-tuning (RFT) has shown great promise in achieving humanlevel reasoning capabilities of Large Language Models (LLMs), and has recently been extended to MLLMs. Nevertheless, reasoning about videos, which is a fundamental aspect of human intelligence, remains a persistent challenge due to the complex logic, temporal and causal structures inherent in video data. To fill this gap, we propose VIDEORFT, a novel approach that extends the RFT paradigm to cultivate human-like video reasoning capabilities in MLLMs. VIDEORFT follows the standard two-stage scheme in RFT: supervised fine-tuning (SFT) with chain-of-thought (CoT) annotations, followed by reinforcement learning (RL) to improve generalization. A central challenge to achieve this in the video domain lies in the scarcity of large-scale, high-quality video CoT datasets. We address this by building a fully automatic CoT curation pipeline. First, we devise a cognitioninspired prompting strategy to elicit a reasoning LLM to generate preliminary CoTs based solely on rich, structured, and literal representations of video content. Subsequently, these CoTs are revised by a visual-language model conditioned on the actual video, ensuring visual consistency and reducing visual hallucinations. This pipeline results in two new datasets - VideoRFT-CoT-102K for SFT and VideoRFT-RL-310K for RL. To further strength the RL phase, we introduce a novel semantic-consistency reward that explicitly promotes the alignment between textual reasoning with visual evidence. This reward encourages the model to produce coherent, context-aware reasoning outputs grounded in visual input. Extensive experiments show that VIDEORFT achieves state-of-the-art performance on six video reasoning benchmarks.

[97] SSR: Enhancing Depth Perception in Vision-Language Models via Rationale-Guided Spatial Reasoning

Yang Liu,Ming Ma,Xiaomin Yu,Pengxiang Ding,Han Zhao,Mingyang Sun,Siteng Huang,Donglin Wang

Main category: cs.CV

TL;DR: 论文提出了一种名为SSR的新方法，将深度数据转化为结构化文本理性，以增强视觉语言模型的空间推理能力，并通过知识蒸馏压缩为紧凑嵌入，无需重新训练即可集成到现有模型中。

Details

Motivation: 尽管视觉语言模型在多模态任务中取得了显著进展，但其对RGB输入的依赖限制了空间理解的精确性。现有方法要么需要专用传感器，要么未能有效利用深度信息进行高阶推理。 Method: 提出SSR框架，将原始深度数据转化为结构化文本理性，并通过知识蒸馏生成紧凑的潜在嵌入，实现高效集成。 Result: 实验表明，SSR显著提升了深度信息的利用和空间推理能力，推动了视觉语言模型向更接近人类的多模态理解发展。 Conclusion: SSR通过结构化文本理性和知识蒸馏，有效提升了视觉语言模型的空间推理能力，为多模态任务提供了新思路。 Abstract: Despite impressive advancements in Visual-Language Models (VLMs) for multi-modal tasks, their reliance on RGB inputs limits precise spatial understanding. Existing methods for integrating spatial cues, such as point clouds or depth, either require specialized sensors or fail to effectively exploit depth information for higher-order reasoning. To this end, we propose a novel Spatial Sense and Reasoning method, dubbed SSR, a novel framework that transforms raw depth data into structured, interpretable textual rationales. These textual rationales serve as meaningful intermediate representations to significantly enhance spatial reasoning capabilities. Additionally, we leverage knowledge distillation to compress the generated rationales into compact latent embeddings, which facilitate resource-efficient and plug-and-play integration into existing VLMs without retraining. To enable comprehensive evaluation, we introduce a new dataset named SSR-CoT, a million-scale visual-language reasoning dataset enriched with intermediate spatial reasoning annotations, and present SSRBench, a comprehensive multi-task benchmark. Extensive experiments on multiple benchmarks demonstrate SSR substantially improves depth utilization and enhances spatial reasoning, thereby advancing VLMs toward more human-like multi-modal understanding. Our project page is at https://yliu-cs.github.io/SSR.

[98] Spectral-Spatial Self-Supervised Learning for Few-Shot Hyperspectral Image Classification

Wenchen Chen,Yanmei Zhang,Zhongwei Xiao,Jianping Chu,Xingbo Wang

Main category: cs.CV

TL;DR: 提出了一种名为S4L-FSC的方法，结合自监督学习和少样本学习，通过空间和光谱特征的预训练，显著提升了高光谱图像的少样本分类性能。

Details

Motivation: 高光谱图像的少样本分类面临标记样本稀缺的问题，现有方法难以适应空间几何多样性和缺乏光谱先验知识。 Method: 使用旋转-镜像自监督学习（RM-SSL）预训练空间特征提取器，结合少样本学习（FSL）获取空间元知识；通过掩码重建自监督学习（MR-SSL）和FSL预训练光谱特征提取器，学习光谱依赖关系。 Result: 在四个高光谱数据集上的实验表明，S4L-FSC方法在少样本分类任务中表现出色。 Conclusion: S4L-FSC通过结合空间和光谱特征的预训练，有效解决了高光谱图像少样本分类的挑战。 Abstract: Few-shot classification of hyperspectral images (HSI) faces the challenge of scarce labeled samples. Self-Supervised learning (SSL) and Few-Shot Learning (FSL) offer promising avenues to address this issue. However, existing methods often struggle to adapt to the spatial geometric diversity of HSIs and lack sufficient spectral prior knowledge. To tackle these challenges, we propose a method, Spectral-Spatial Self-Supervised Learning for Few-Shot Hyperspectral Image Classification (S4L-FSC), aimed at improving the performance of few-shot HSI classification. Specifically, we first leverage heterogeneous datasets to pretrain a spatial feature extractor using a designed Rotation-Mirror Self-Supervised Learning (RM-SSL) method, combined with FSL. This approach enables the model to learn the spatial geometric diversity of HSIs using rotation and mirroring labels as supervisory signals, while acquiring transferable spatial meta-knowledge through few-shot learning. Subsequently, homogeneous datasets are utilized to pretrain a spectral feature extractor via a combination of FSL and Masked Reconstruction Self-Supervised Learning (MR-SSL). The model learns to reconstruct original spectral information from randomly masked spectral vectors, inferring spectral dependencies. In parallel, FSL guides the model to extract pixel-level discriminative features, thereby embedding rich spectral priors into the model. This spectral-spatial pretraining method, along with the integration of knowledge from heterogeneous and homogeneous sources, significantly enhances model performance. Extensive experiments on four HSI datasets demonstrate the effectiveness and superiority of the proposed S4L-FSC approach for few-shot HSI classification.

[99] Guiding Diffusion with Deep Geometric Moments: Balancing Fidelity and Variation

Sangmin Jung,Utkarsh Nath,Yezhou Yang,Giulia Pedrielli,Joydeep Biswas,Amy Zhang,Hassan Ghasemzadeh,Pavan Turaga

Main category: cs.CV

TL;DR: 本文提出了一种名为Deep Geometric Moments（DGM）的新方法，用于在文本到图像生成模型中提供细粒度控制，同时保持多样性。

Details

Motivation: 现有方法（如分割图和深度图）在控制输出时引入了空间刚性，限制了扩散模型的多样性。DGM通过几何先验学习，专注于主题的视觉特征和细节。 Method: DGM通过学习的几何先验捕捉主题的视觉特征，避免了现有方法（如DINO或CLIP）对全局特征或语义的过度依赖。 Result: 实验表明，DGM在扩散模型生成图像时有效平衡了控制与多样性，提供了灵活的生成控制机制。 Conclusion: DGM是一种新颖且有效的指导方法，能够在保持多样性的同时实现对生成图像的细粒度控制。 Abstract: Text-to-image generation models have achieved remarkable capabilities in synthesizing images, but often struggle to provide fine-grained control over the output. Existing guidance approaches, such as segmentation maps and depth maps, introduce spatial rigidity that restricts the inherent diversity of diffusion models. In this work, we introduce Deep Geometric Moments (DGM) as a novel form of guidance that encapsulates the subject's visual features and nuances through a learned geometric prior. DGMs focus specifically on the subject itself compared to DINO or CLIP features, which suffer from overemphasis on global image features or semantics. Unlike ResNets, which are sensitive to pixel-wise perturbations, DGMs rely on robust geometric moments. Our experiments demonstrate that DGM effectively balance control and diversity in diffusion-based image generation, allowing a flexible control mechanism for steering the diffusion process.

[100] Video-GPT via Next Clip Diffusion

Shaobin Zhuang,Zhipeng Huang,Ying Zhang,Fangyikang Wang,Canmiao Fu,Binxin Yang,Chong Sun,Chen Li,Yali Wang

Main category: cs.CV

TL;DR: Video-GPT将视频视为新的语言建模方式，通过新颖的“下一片段扩散”范式进行预训练，实现了短时生成与长时预测，并在视频预测任务中达到SOTA性能。

Details

Motivation: 语言序列无法充分描述视觉世界的时空细节，而视频序列能更好地捕捉这些细节，因此提出Video-GPT。 Method: 采用“下一片段扩散”范式，通过自回归去噪历史片段中的噪声片段进行预训练。 Result: 在视频预测任务中表现最佳（Physics-IQ Benchmark: Video-GPT 34.97 vs. Kling 23.64 vs. Wan 20.89），并在6个主流视频任务中展示出强大的泛化能力。 Conclusion: Video-GPT在视频建模中表现出色，具有广泛的下游任务适应性。 Abstract: GPT has shown its remarkable success in natural language processing. However, the language sequence is not sufficient to describe spatial-temporal details in the visual world. Alternatively, the video sequence is good at capturing such details. Motivated by this fact, we propose a concise Video-GPT in this paper by treating video as new language for visual world modeling. By analogy to next token prediction in GPT, we introduce a novel next clip diffusion paradigm for pretraining Video-GPT. Different from the previous works, this distinct paradigm allows Video-GPT to tackle both short-term generation and long-term prediction, by autoregressively denoising the noisy clip according to the clean clips in the history. Extensive experiments show our Video-GPT achieves the state-of-the-art performance on video prediction, which is the key factor towards world modeling (Physics-IQ Benchmark: Video-GPT 34.97 vs. Kling 23.64 vs. Wan 20.89). Moreover, it can be well adapted on 6 mainstream video tasks in both video generation and understanding, showing its great generalization capacity in downstream. The project page is at https://Video-GPT.github.io.

[101] Rebalancing Contrastive Alignment with Learnable Semantic Gaps in Text-Video Retrieval

Jian Xiao,Zijie Song,Jialong Hu,Hao Cheng,Zhenzhen Hu,Jia Li,Richang Hong

Main category: cs.CV

TL;DR: GARE提出了一种解决文本-视频检索中模态间隙和假阴性问题的框架，通过引入可学习的增量Delta_ij来优化梯度冲突，提升对齐稳定性。

Details

Motivation: 现有方法忽视了文本和视频分布间的模态间隙及批量采样中的假阴性问题，导致InfoNCE损失下的梯度冲突，影响对齐效果。 Method: GARE通过可学习的增量Delta_ij缓解梯度冲突，结合轻量级神经网络模块和正则化组件（信任区域约束、方向多样性、信息瓶颈）优化学习。 Result: 在四个检索基准测试中，GARE显著提升了对齐准确性和对噪声监督的鲁棒性。 Conclusion: GARE通过显式处理模态间隙和梯度冲突，有效提升了文本-视频检索的性能和稳定性。 Abstract: Recent advances in text-video retrieval have been largely driven by contrastive learning frameworks. However, existing methods overlook a key source of optimization tension: the separation between text and video distributions in the representation space (referred to as the modality gap), and the prevalence of false negatives in batch sampling. These factors lead to conflicting gradients under the InfoNCE loss, impeding stable alignment. To mitigate this, we propose GARE, a Gap-Aware Retrieval framework that introduces a learnable, pair-specific increment Delta_ij between text t_i and video v_j to offload the tension from the global anchor representation. We first derive the ideal form of Delta_ij via a coupled multivariate first-order Taylor approximation of the InfoNCE loss under a trust-region constraint, revealing it as a mechanism for resolving gradient conflicts by guiding updates along a locally optimal descent direction. Due to the high cost of directly computing Delta_ij, we introduce a lightweight neural module conditioned on the semantic gap between each video-text pair, enabling structure-aware correction guided by gradient supervision. To further stabilize learning and promote interpretability, we regularize Delta using three components: a trust-region constraint to prevent oscillation, a directional diversity term to promote semantic coverage, and an information bottleneck to limit redundancy. Experiments across four retrieval benchmarks show that GARE consistently improves alignment accuracy and robustness to noisy supervision, confirming the effectiveness of gap-aware tension mitigation.

[102] GlobalGeoTree: A Multi-Granular Vision-Language Dataset for Global Tree Species Classification

Yang Mu,Zhitong Xiong,Yi Wang,Muhammad Shahzad,Franz Essl,Mark van Kleunen,Xiao Xiang Zhu

Main category: cs.CV

TL;DR: GlobalGeoTree是一个全球树种类数据集，包含630万条地理标记数据，用于遥感分类。GeoTreeCLIP模型在该数据集上表现优异。

Details

Motivation: 解决全球树种类分类中大规模标记数据稀缺的问题。 Method: 构建GlobalGeoTree数据集，并开发GeoTreeCLIP模型，结合遥感数据和分类标签。 Result: GeoTreeCLIP在零样本和少样本分类任务中显著优于现有模型。 Conclusion: 数据集和模型的公开将推动树种类分类和生物多样性研究。 Abstract: Global tree species mapping using remote sensing data is vital for biodiversity monitoring, forest management, and ecological research. However, progress in this field has been constrained by the scarcity of large-scale, labeled datasets. To address this, we introduce GlobalGeoTree, a comprehensive global dataset for tree species classification. GlobalGeoTree comprises 6.3 million geolocated tree occurrences, spanning 275 families, 2,734 genera, and 21,001 species across the hierarchical taxonomic levels. Each sample is paired with Sentinel-2 image time series and 27 auxiliary environmental variables, encompassing bioclimatic, geographic, and soil data. The dataset is partitioned into GlobalGeoTree-6M for model pretraining and curated evaluation subsets, primarily GlobalGeoTree-10kEval for zero-shot and few-shot benchmarking. To demonstrate the utility of the dataset, we introduce a baseline model, GeoTreeCLIP, which leverages paired remote sensing data and taxonomic text labels within a vision-language framework pretrained on GlobalGeoTree-6M. Experimental results show that GeoTreeCLIP achieves substantial improvements in zero- and few-shot classification on GlobalGeoTree-10kEval over existing advanced models. By making the dataset, models, and code publicly available, we aim to establish a benchmark to advance tree species classification and foster innovation in biodiversity research and ecological applications.

[103] Exploring Sparsity for Parameter Efficient Fine Tuning Using Wavelets

Ahmet Bilican,M. Akın Yılmaz,A. Murat Tekalp,R. Gökberk Cinbiş

Main category: cs.CV

TL;DR: WaveFT是一种新颖的参数高效微调方法，通过在小波域学习稀疏更新，显著优于LoRA等方法，尤其在极低参数数量下表现优异。

Details

Motivation: 在计算和内存预算有限的情况下，高效适应大型基础模型至关重要。现有PEFT方法（如LoRA）在低参数数量下表现有限。 Method: WaveFT在小波域中学习残差矩阵的稀疏更新，提供精细的参数控制和调整能力。 Result: 在个性化文本到图像生成任务中，WaveFT在低参数数量下显著优于LoRA和其他PEFT方法，实现更高的主题保真度、提示对齐和图像多样性。 Conclusion: WaveFT为极端参数高效场景提供了一种有效的解决方案，展示了小波变换在稀疏更新中的优势。 Abstract: Efficiently adapting large foundation models is critical, especially with tight compute and memory budgets. Parameter-Efficient Fine-Tuning (PEFT) methods such as LoRA offer limited granularity and effectiveness in few-parameter regimes. We propose Wavelet Fine-Tuning (WaveFT), a novel PEFT method that learns highly sparse updates in the wavelet domain of residual matrices. WaveFT allows precise control of trainable parameters, offering fine-grained capacity adjustment and excelling with remarkably low parameter count, potentially far fewer than LoRA's minimum -- ideal for extreme parameter-efficient scenarios. In order to demonstrate the effect of the wavelet transform, we compare WaveFT with a special case, called SHiRA, that entails applying sparse updates directly in the weight domain. Evaluated on personalized text-to-image generation using Stable Diffusion XL as baseline, WaveFT significantly outperforms LoRA and other PEFT methods, especially at low parameter counts; achieving superior subject fidelity, prompt alignment, and image diversity.

[104] ProMi: An Efficient Prototype-Mixture Baseline for Few-Shot Segmentation with Bounding-Box Annotations

Florent Chiaroni,Ali Ayub,Ola Ahmad

Main category: cs.CV

TL;DR: 提出了一种基于边界框标注的少样本二值分割方法ProMi，无需训练，背景类被视为混合分布，显著优于现有方法。

Details

Motivation: 机器人应用中，像素级标注耗时且昂贵，而少样本分割能减少训练数据需求。 Method: 采用原型混合方法，利用边界框标注而非像素级标签，简单且无需训练。 Result: 在多个数据集上表现最佳，显著优于基线方法，适用于移动机器人任务。 Conclusion: ProMi方法高效且实用，适用于真实场景的机器人任务。 Abstract: In robotics applications, few-shot segmentation is crucial because it allows robots to perform complex tasks with minimal training data, facilitating their adaptation to diverse, real-world environments. However, pixel-level annotations of even small amount of images is highly time-consuming and costly. In this paper, we present a novel few-shot binary segmentation method based on bounding-box annotations instead of pixel-level labels. We introduce, ProMi, an efficient prototype-mixture-based method that treats the background class as a mixture of distributions. Our approach is simple, training-free, and effective, accommodating coarse annotations with ease. Compared to existing baselines, ProMi achieves the best results across different datasets with significant gains, demonstrating its effectiveness. Furthermore, we present qualitative experiments tailored to real-world mobile robot tasks, demonstrating the applicability of our approach in such scenarios. Our code: https://github.com/ThalesGroup/promi.

[105] VGGT-SLAM: Dense RGB SLAM Optimized on the SL(4) Manifold

Dominic Maggio,Hyungtae Lim,Luca Carlone

Main category: cs.CV

TL;DR: VGGT-SLAM是一种基于未校准单目相机的稠密RGB SLAM系统，通过优化SL(4)流形实现子地图的全局对齐，解决了传统相似变换方法的不足。

Details

Motivation: 未校准相机在子地图对齐中使用相似变换（平移、旋转和缩放）存在不足，需要解决重建模糊性问题。 Method: 利用SL(4)流形优化15自由度单应变换，结合环路闭合约束，实现子地图的全局对齐。 Result: 实验证明VGGT-SLAM在长视频序列中提升了地图质量，克服了VGGT的高GPU需求限制。 Conclusion: VGGT-SLAM通过全局对齐和优化方法，显著提升了未校准相机场景下的重建效果。 Abstract: We present VGGT-SLAM, a dense RGB SLAM system constructed by incrementally and globally aligning submaps created from the feed-forward scene reconstruction approach VGGT using only uncalibrated monocular cameras. While related works align submaps using similarity transforms (i.e., translation, rotation, and scale), we show that such approaches are inadequate in the case of uncalibrated cameras. In particular, we revisit the idea of reconstruction ambiguity, where given a set of uncalibrated cameras with no assumption on the camera motion or scene structure, the scene can only be reconstructed up to a 15-degrees-of-freedom projective transformation of the true geometry. This inspires us to recover a consistent scene reconstruction across submaps by optimizing over the SL(4) manifold, thus estimating 15-degrees-of-freedom homography transforms between sequential submaps while accounting for potential loop closure constraints. As verified by extensive experiments, we demonstrate that VGGT-SLAM achieves improved map quality using long video sequences that are infeasible for VGGT due to its high GPU requirements.

[106] Coarse Attribute Prediction with Task Agnostic Distillation for Real World Clothes Changing ReID

Priyank Pathak,Yogesh S Rawat

Main category: cs.CV

TL;DR: 论文提出了一种名为RLQ的新框架，用于提升衣物更换重识别（CC-ReID）在低质量图像上的性能，通过CAP和TAD方法分别优化外部属性和内部特征表示。

Details

Motivation: 现有方法在高质量图像上表现良好，但在低质量图像（如像素化、模糊等）上表现不佳，导致特征混淆和错误匹配。 Method: RLQ框架结合了CAP（粗粒度属性预测）和TAD（任务无关蒸馏），通过交替训练机制优化模型。CAP减少噪声输入的影响，TAD通过任务无关自监督和蒸馏缩小高低质量特征差距。 Result: RLQ在LaST、DeepChange等真实数据集上Top-1准确率提升1.6%-2.9%，在PRCC上提升5.3%-6%，在LTCC上表现也具竞争力。 Conclusion: RLQ有效解决了低质量图像对CC-ReID的干扰，显著提升了模型性能，代码将开源。 Abstract: This work focuses on Clothes Changing Re-IDentification (CC-ReID) for the real world. Existing works perform well with high-quality (HQ) images, but struggle with low-quality (LQ) where we can have artifacts like pixelation, out-of-focus blur, and motion blur. These artifacts introduce noise to not only external biometric attributes (e.g. pose, body shape, etc.) but also corrupt the model's internal feature representation. Models usually cluster LQ image features together, making it difficult to distinguish between them, leading to incorrect matches. We propose a novel framework Robustness against Low-Quality (RLQ) to improve CC-ReID model on real-world data. RLQ relies on Coarse Attributes Prediction (CAP) and Task Agnostic Distillation (TAD) operating in alternate steps in a novel training mechanism. CAP enriches the model with external fine-grained attributes via coarse predictions, thereby reducing the effect of noisy inputs. On the other hand, TAD enhances the model's internal feature representation by bridging the gap between HQ and LQ features, via an external dataset through task-agnostic self-supervision and distillation. RLQ outperforms the existing approaches by 1.6%-2.9% Top-1 on real-world datasets like LaST, and DeepChange, while showing consistent improvement of 5.3%-6% Top-1 on PRCC with competitive performance on LTCC. *The code will be made public soon.*

[107] Event-based Star Tracking under Spacecraft Jitter: the e-STURT Dataset

Samya Bagchi,Peter Anastasiou,Matthew Tetlow,Tat-Jun Chin,Yasir Latif

Main category: cs.CV

TL;DR: 论文介绍了首个基于事件相机的星体观测数据集e-STURT，用于研究航天器抖动问题，并提出了高频抖动估计算法。

Details

Motivation: 航天器抖动影响光学通信等任务的精确指向能力，需要高保真传感器数据开发补偿算法。 Method: 使用压电致动器模拟抖动，生成包含200个序列的事件相机数据集，并提出直接处理事件流的高频抖动估计算法。 Result: 生成了首个公开的抖动条件下星体观测数据集，并验证了算法的可行性。 Conclusion: e-STURT数据集将支持开发抖动感知算法，提升航天任务的事件相机应用。 Abstract: Jitter degrades a spacecraft's fine-pointing ability required for optical communication, earth observation, and space domain awareness. Development of jitter estimation and compensation algorithms requires high-fidelity sensor observations representative of on-board jitter. In this work, we present the Event-based Star Tracking Under Jitter (e-STURT) dataset -- the first event camera based dataset of star observations under controlled jitter conditions. Specialized hardware employed for the dataset emulates an event-camera undergoing on-board jitter. While the event camera provides asynchronous, high temporal resolution star observations, systematic and repeatable jitter is introduced using a micrometer accurate piezoelectric actuator. Various jitter sources are simulated using distinct frequency bands and utilizing both axes of motion. Ground-truth jitter is captured in hardware from the piezoelectric actuator. The resulting dataset consists of 200 sequences and is made publicly available. This work highlights the dataset generation process, technical challenges and the resulting limitations. To serve as a baseline, we propose a high-frequency jitter estimation algorithm that operates directly on the event stream. The e-STURT dataset will enable the development of jitter aware algorithms for mission critical event-based space sensing applications.

[108] SurveillanceVQA-589K: A Benchmark for Comprehensive Surveillance Video-Language Understanding with Large Models

Bo Liu,Pengfei Qiao,Minhan Ma,Xuange Zhang,Yinan Tang,Peng Xu,Kun Liu,Tongtong Yuan

Main category: cs.CV

TL;DR: 论文提出了SurveillanceVQA-589K，一个针对监控领域的大规模开放式视频问答基准数据集，包含589,380个QA对，覆盖12种认知多样的问题类型。通过混合标注流程和评估协议，揭示了当前模型在真实监控场景中的局限性。

Details

Motivation: 监控视频内容的理解在视觉语言研究中仍是一个关键但未充分探索的挑战，尤其是由于其现实复杂性、不规则事件动态和安全关键性。 Method: 设计了混合标注流程，结合时间对齐的人工标注和基于提示的大型视觉语言模型辅助QA生成，并提出了多维评估协议。 Result: 评估了八个大型视觉语言模型，发现其在因果和异常相关任务中存在显著性能差距。 Conclusion: 该基准为安全关键应用中的视频语言理解提供了实用且全面的资源。 Abstract: Understanding surveillance video content remains a critical yet underexplored challenge in vision-language research, particularly due to its real-world complexity, irregular event dynamics, and safety-critical implications. In this work, we introduce SurveillanceVQA-589K, the largest open-ended video question answering benchmark tailored to the surveillance domain. The dataset comprises 589,380 QA pairs spanning 12 cognitively diverse question types, including temporal reasoning, causal inference, spatial understanding, and anomaly interpretation, across both normal and abnormal video scenarios. To construct the benchmark at scale, we design a hybrid annotation pipeline that combines temporally aligned human-written captions with Large Vision-Language Model-assisted QA generation using prompt-based techniques. We also propose a multi-dimensional evaluation protocol to assess contextual, temporal, and causal comprehension. We evaluate eight LVLMs under this framework, revealing significant performance gaps, especially in causal and anomaly-related tasks, underscoring the limitations of current models in real-world surveillance contexts. Our benchmark provides a practical and comprehensive resource for advancing video-language understanding in safety-critical applications such as intelligent monitoring, incident analysis, and autonomous decision-making.

[109] Learning Cross-Spectral Point Features with Task-Oriented Training

Mia Thomas,Trevor Ablett,Jonathan Kelly

Main category: cs.CV

TL;DR: 论文提出了一种基于学习的跨光谱（热-可见光）点特征方法，用于将热成像集成到无人机导航系统中，通过匹配和配准任务训练特征网络，显著提高了低能见度条件下的性能。

Details

Motivation: 无人机在低能见度条件下依赖可见光相机的导航系统效果不佳，而热成像相机能在黑暗和烟雾中有效工作。因此，需要一种方法将热成像与现有导航系统结合。 Method: 提出了一种训练特征网络的方法，专注于匹配和配准任务，通过将网络响应输入可微分配准管道，并对其匹配和配准估计应用损失函数。 Result: 在MultiPoint数据集上，训练后的模型在超过75%的估计中实现了低于10像素的配准误差（角点误差）。 Conclusion: 该方法不仅适用于深度学习管道，还可与传统匹配和配准方法结合，为无人机在低能见度条件下的导航提供了有效解决方案。 Abstract: Unmanned aerial vehicles (UAVs) enable operations in remote and hazardous environments, yet the visible-spectrum, camera-based navigation systems often relied upon by UAVs struggle in low-visibility conditions. Thermal cameras, which capture long-wave infrared radiation, are able to function effectively in darkness and smoke, where visible-light cameras fail. This work explores learned cross-spectral (thermal-visible) point features as a means to integrate thermal imagery into established camera-based navigation systems. Existing methods typically train a feature network's detection and description outputs directly, which often focuses training on image regions where thermal and visible-spectrum images exhibit similar appearance. Aiming to more fully utilize the available data, we propose a method to train the feature network on the tasks of matching and registration. We run our feature network on thermal-visible image pairs, then feed the network response into a differentiable registration pipeline. Losses are applied to the matching and registration estimates of this pipeline. Our selected model, trained on the task of matching, achieves a registration error (corner error) below 10 pixels for more than 75% of estimates on the MultiPoint dataset. We further demonstrate that our model can also be used with a classical pipeline for matching and registration.

[110] Temporal-Oriented Recipe for Transferring Large Vision-Language Model to Video Understanding

Thong Nguyen,Zhiyuan Hu,Xu Lin,Cong-Duy Nguyen,See-Kiong Ng,Luu Anh Tuan

Main category: cs.CV

TL;DR: 本文通过实证研究揭示了影响大型视觉语言模型（LVLMs）时间理解能力的关键组件，并提出了一种时间导向的改进方案，显著提升了视频理解任务的性能。

Details

Motivation: 当前大型视觉语言模型在视频理解中依赖其隐含的时间理解能力，但未明确关键组件，限制了其潜力。 Method: 通过实证研究分析关键组件，并提出时间导向的训练方案和升级的接口。 Result: 最终模型在标准视频理解任务中显著优于之前的LVLMs。 Conclusion: 时间导向的改进方案能有效提升LVLMs的视频理解能力。 Abstract: Recent years have witnessed outstanding advances of large vision-language models (LVLMs). In order to tackle video understanding, most of them depend upon their implicit temporal understanding capacity. As such, they have not deciphered important components that contribute to temporal understanding ability, which might limit the potential of these LVLMs for video understanding. In this work, we conduct a thorough empirical study to demystify crucial components that influence the temporal understanding of LVLMs. Our empirical study reveals that significant impacts are centered around the intermediate interface between the visual encoder and the large language model. Building on these insights, we propose a temporal-oriented recipe that encompasses temporal-oriented training schemes and an upscaled interface. Our final model developed using our recipe significantly enhances previous LVLMs on standard video understanding tasks.

Shiyu Xuan,Zechao Li,Jinhui Tang

Main category: cs.CV

TL;DR: Diff-MM是一种多模态目标跟踪方法，利用预训练的文本到图像生成模型（如Stable Diffusion）提取特征，通过并行特征提取管道和多模态子模块调优方法，提升复杂场景下的跟踪性能。

Details

Motivation: 现有方法受限于多模态训练数据的不足，性能不佳。Diff-MM旨在利用预训练生成模型的多模态理解能力，解决这一问题。 Method: Diff-MM利用预训练的Stable Diffusion的UNet作为特征提取器，通过并行特征提取管道处理成对图像输入，并引入多模态子模块调优方法以学习模态间的互补信息。 Result: 实验表明，Diff-MM性能优于近期提出的跟踪器，如在TNL2K数据集上的AUC比OneTracker高8.3%。 Conclusion: Diff-MM通过利用生成模型的先验知识，实现了统一的RGB-N/D/T/E跟踪器，性能显著提升。 Abstract: Multi-modal object tracking integrates auxiliary modalities such as depth, thermal infrared, event flow, and language to provide additional information beyond RGB images, showing great potential in improving tracking stabilization in complex scenarios. Existing methods typically start from an RGB-based tracker and learn to understand auxiliary modalities only from training data. Constrained by the limited multi-modal training data, the performance of these methods is unsatisfactory. To alleviate this limitation, this work proposes a unified multi-modal tracker Diff-MM by exploiting the multi-modal understanding capability of the pre-trained text-to-image generation model. Diff-MM leverages the UNet of pre-trained Stable Diffusion as a tracking feature extractor through the proposed parallel feature extraction pipeline, which enables pairwise image inputs for object tracking. We further introduce a multi-modal sub-module tuning method that learns to gain complementary information between different modalities. By harnessing the extensive prior knowledge in the generation model, we achieve a unified tracker with uniform parameters for RGB-N/D/T/E tracking. Experimental results demonstrate the promising performance of our method compared with recently proposed trackers, e.g., its AUC outperforms OneTracker by 8.3% on TNL2K.

[112] BusterX: MLLM-Powered AI-Generated Video Forgery Detection and Explanation

Haiquan Wen,Yiwei He,Zhenglin Huang,Tianxiao Li,Zihan YU,Xingru Huang,Lu Qi,Baoyuan Wu,Xiangtai Li,Guangliang Cheng

Main category: cs.CV

TL;DR: 论文提出了首个大规模高质量AI生成视频数据集GenBuster-200K和首个结合MLLM与强化学习的可解释检测框架BusterX，以应对AI生成视频的虚假信息风险。

Details

Motivation: AI生成视频技术快速发展（如Sora和WanX），但缺乏大规模高质量数据集和可解释的检测方法，导致虚假信息风险加剧。 Method: 提出GenBuster-200K数据集（20万高清视频片段）和BusterX框架（结合MLLM与强化学习，实现检测与解释）。 Result: 实验验证BusterX在检测效果和泛化性上优于现有方法。 Conclusion: GenBuster-200K和BusterX填补了AI生成视频检测领域的空白，为虚假信息治理提供了新工具。 Abstract: Advances in AI generative models facilitate super-realistic video synthesis, amplifying misinformation risks via social media and eroding trust in digital content. Several research works have explored new deepfake detection methods on AI-generated images to alleviate these risks. However, with the fast development of video generation models, such as Sora and WanX, there is currently a lack of large-scale, high-quality AI-generated video datasets for forgery detection. In addition, existing detection approaches predominantly treat the task as binary classification, lacking explainability in model decision-making and failing to provide actionable insights or guidance for the public. To address these challenges, we propose \textbf{GenBuster-200K}, a large-scale AI-generated video dataset featuring 200K high-resolution video clips, diverse latest generative techniques, and real-world scenes. We further introduce \textbf{BusterX}, a novel AI-generated video detection and explanation framework leveraging multimodal large language model (MLLM) and reinforcement learning for authenticity determination and explainable rationale. To our knowledge, GenBuster-200K is the {\it \textbf{first}} large-scale, high-quality AI-generated video dataset that incorporates the latest generative techniques for real-world scenarios. BusterX is the {\it \textbf{first}} framework to integrate MLLM with reinforcement learning for explainable AI-generated video detection. Extensive comparisons with state-of-the-art methods and ablation studies validate the effectiveness and generalizability of BusterX. The code, models, and datasets will be released.

[113] Degradation-Aware Feature Perturbation for All-in-One Image Restoration

Xiangpeng Tian,Xiangyu Liao,Xiao Liu,Meng Li,Chao Ren

Main category: cs.CV

TL;DR: DFPIR是一种新型的全能图像修复方法，通过引入退化感知特征扰动（DFP）来调整特征空间，以解决多任务干扰问题，并在多个图像修复任务中取得领先性能。

Details

Motivation: 解决全能图像修复中因退化类型差异导致的梯度更新方向冲突和任务干扰问题。 Method: 提出DFPIR方法，通过通道扰动和注意力扰动调整特征空间，并设计Degradation-Guided Perturbation Block（DGPB）实现这些功能。 Result: 在去噪、去雾、去雨、运动去模糊和低光增强等任务中达到最先进性能。 Conclusion: DFPIR通过特征扰动有效解决了多任务干扰问题，为全能图像修复提供了高效解决方案。 Abstract: All-in-one image restoration aims to recover clear images from various degradation types and levels with a unified model. Nonetheless, the significant variations among degradation types present challenges for training a universal model, often resulting in task interference, where the gradient update directions of different tasks may diverge due to shared parameters. To address this issue, motivated by the routing strategy, we propose DFPIR, a novel all-in-one image restorer that introduces Degradation-aware Feature Perturbations(DFP) to adjust the feature space to align with the unified parameter space. In this paper, the feature perturbations primarily include channel-wise perturbations and attention-wise perturbations. Specifically, channel-wise perturbations are implemented by shuffling the channels in high-dimensional space guided by degradation types, while attention-wise perturbations are achieved through selective masking in the attention space. To achieve these goals, we propose a Degradation-Guided Perturbation Block (DGPB) to implement these two functions, positioned between the encoding and decoding stages of the encoder-decoder architecture. Extensive experimental results demonstrate that DFPIR achieves state-of-the-art performance on several all-in-one image restoration tasks including image denoising, image dehazing, image deraining, motion deblurring, and low-light image enhancement. Our codes are available at https://github.com/TxpHome/DFPIR.

[114] Multi-Resolution Haar Network: Enhancing human motion prediction via Haar transform

Li Lin

Main category: cs.CV

TL;DR: HaarMoDic网络利用2D Haar变换将关节投影到高分辨率坐标，同时捕捉时空信息，显著提升了3D人体姿态预测的精度。

Details

Motivation: 现有方法因忽略人体运动序列在时空轴上的任意性，难以处理复杂动作预测。 Method: 提出HaarMoDic网络，采用MR-Haar模块通过2D Haar变换将运动序列投影到混合高分辨率坐标，同时提取时空信息。 Result: 在Human3.6M数据集上，HaarMoDic在MPJPE指标上全面超越现有方法。 Conclusion: MR-Haar模块通过多分辨率混合坐标显著提升了预测性能，为复杂动作预测提供了新思路。 Abstract: The 3D human pose is vital for modern computer vision and computer graphics, and its prediction has drawn attention in recent years. 3D human pose prediction aims at forecasting a human's future motion from the previous sequence. Ignoring that the arbitrariness of human motion sequences has a firm origin in transition in both temporal and spatial axes limits the performance of state-of-the-art methods, leading them to struggle with making precise predictions on complex cases, e.g., arbitrarily posing or greeting. To alleviate this problem, a network called HaarMoDic is proposed in this paper, which utilizes the 2D Haar transform to project joints to higher resolution coordinates where the network can access spatial and temporal information simultaneously. An ablation study proves that the significant contributing module within the HaarModic Network is the Multi-Resolution Haar (MR-Haar) block. Instead of mining in one of two axes or extracting separately, the MR-Haar block projects whole motion sequences to a mixed-up coordinate in higher resolution with 2D Haar Transform, allowing the network to give scope to information from both axes in different resolutions. With the MR-Haar block, the HaarMoDic network can make predictions referring to a broader range of information. Experimental results demonstrate that HaarMoDic surpasses state-of-the-art methods in every testing interval on the Human3.6M dataset in the Mean Per Joint Position Error (MPJPE) metric.

[115] Scalable Video-to-Dataset Generation for Cross-Platform Mobile Agents

Yunseok Jang,Yeda Song,Sungryull Sohn,Lajanugen Logeswaran,Tiange Luo,Dong-Ki Kim,Kyunghoon Bae,Honglak Lee

Main category: cs.CV

TL;DR: MONDAY是一个大规模数据集，包含313K标注帧和20K教学视频，用于训练GUI视觉代理，显著提升跨平台泛化能力。

Details

Motivation: 开发GUI视觉代理的需求增加，但现有数据集局限于单一操作系统，缺乏多样性。 Method: 利用公开视频内容自动构建数据集，结合OCR、UI元素检测和多步动作识别技术。 Result: 使用MONDAY预训练的模型在未见过的移动OS平台上平均性能提升18.11%。 Conclusion: MONDAY数据集和自动化框架为移动OS导航研究提供了重要资源。 Abstract: Recent advancements in Large Language Models (LLMs) and Vision-Language Models (VLMs) have sparked significant interest in developing GUI visual agents. We introduce MONDAY (Mobile OS Navigation Task Dataset for Agents from YouTube), a large-scale dataset of 313K annotated frames from 20K instructional videos capturing diverse real-world mobile OS navigation across multiple platforms. Models that include MONDAY in their pre-training phases demonstrate robust cross-platform generalization capabilities, consistently outperforming models trained on existing single OS datasets while achieving an average performance gain of 18.11%p on an unseen mobile OS platform. To enable continuous dataset expansion as mobile platforms evolve, we present an automated framework that leverages publicly available video content to create comprehensive task datasets without manual annotation. Our framework comprises robust OCR-based scene detection (95.04% F1score), near-perfect UI element detection (99.87% hit ratio), and novel multi-step action identification to extract reliable action sequences across diverse interface configurations. We contribute both the MONDAY dataset and our automated collection framework to facilitate future research in mobile OS navigation.

[116] MVPainter: Accurate and Detailed 3D Texture Generation via Multi-View Diffusion with Geometric Control

Mingqi Shao,Feng Xiong,Zhaoxu Sun,Mu Xu

Main category: cs.CV

TL;DR: MVPainter提出了一种改进3D纹理生成的方法，通过数据过滤、增强和几何条件控制，提升了纹理质量和对齐性，并支持PBR渲染。

Details

Motivation: 当前3D纹理生成研究不足，尤其在纹理对齐、几何一致性和局部质量方面存在挑战。 Method: 采用数据过滤和增强策略，引入ControlNet几何条件控制，提取PBR属性生成适合渲染的网格。 Result: 在纹理对齐、几何一致性和局部质量上达到最优效果，并通过人类评估验证。 Conclusion: MVPainter为3D纹理生成提供了高效解决方案，并开源了完整流程以促进研究。 Abstract: Recently, significant advances have been made in 3D object generation. Building upon the generated geometry, current pipelines typically employ image diffusion models to generate multi-view RGB images, followed by UV texture reconstruction through texture baking. While 3D geometry generation has improved significantly, supported by multiple open-source frameworks, 3D texture generation remains underexplored. In this work, we systematically investigate 3D texture generation through the lens of three core dimensions: reference-texture alignment, geometry-texture consistency, and local texture quality. To tackle these issues, we propose MVPainter, which employs data filtering and augmentation strategies to enhance texture fidelity and detail, and introduces ControlNet-based geometric conditioning to improve texture-geometry alignment. Furthermore, we extract physically-based rendering (PBR) attributes from the generated views to produce PBR meshes suitable for real-world rendering applications. MVPainter achieves state-of-the-art results across all three dimensions, as demonstrated by human-aligned evaluations. To facilitate further research and reproducibility, we also release our full pipeline as an open-source system, including data construction, model architecture, and evaluation tools.

[117] Single Image Reflection Removal via inter-layer Complementarity

Yue Huang,Zi'ang Li,Tianle Hu,Jie Wen,Guanbin Li,Jinglin Zhang,Guoxu Zhou,Xiaozhao Fang

Main category: cs.CV

TL;DR: 论文提出了一种改进的双流架构，通过增强层间互补性模型和引入高效的注意力机制，显著提升了单图像反射去除的质量和效率。

Details

Motivation: 现有双流架构未能充分利用层间互补性，限制了图像分离的质量。 Method: 1. 提出新型层间互补性模型，通过低频和高频组件的交互增强互补性；2. 设计高效的层间互补注意力机制，通过通道级重组和注意力计算优化分离效果。 Result: 实验表明，该方法在多个公开数据集上达到最优分离质量，同时显著降低计算成本和模型复杂度。 Conclusion: 改进的双流架构通过层间互补性和注意力机制，显著提升了图像反射去除的性能和效率。 Abstract: Although dual-stream architectures have achieved remarkable success in single image reflection removal, they fail to fully exploit inter-layer complementarity in their physical modeling and network design, which limits the quality of image separation. To address this fundamental limitation, we propose two targeted improvements to enhance dual-stream architectures: First, we introduce a novel inter-layer complementarity model where low-frequency components extracted from the residual layer interact with the transmission layer through dual-stream architecture to enhance inter-layer complementarity. Meanwhile, high-frequency components from the residual layer provide inverse modulation to both streams, improving the detail quality of the transmission layer. Second, we propose an efficient inter-layer complementarity attention mechanism which first cross-reorganizes dual streams at the channel level to obtain reorganized streams with inter-layer complementary structures, then performs attention computation on the reorganized streams to achieve better inter-layer separation, and finally restores the original stream structure for output. Experimental results demonstrate that our method achieves state-of-the-art separation quality on multiple public datasets while significantly reducing both computational cost and model complexity.

[118] Use as Many Surrogates as You Want: Selective Ensemble Attack to Unleash Transferability without Sacrificing Resource Efficiency

Bo Yang,Hengwei Zhang,Jindong Wang,Yuchen Ren,Chenhao Lin,Chao Shen,Zhengyu Zhao

Main category: cs.CV

TL;DR: 论文提出了一种动态选择多样化模型的攻击方法（SEA），通过解耦迭代内和迭代间模型多样性，实现了在保持效率的同时提升攻击的可转移性。

Details

Motivation: 现有攻击方法在可转移性和效率之间存在权衡，限制了其实际应用。论文认为这种权衡源于不必要的假设（所有模型在迭代中必须相同），并提出了解决方案。 Method: 提出选择性集成攻击（SEA），动态选择多样化模型，固定迭代内模型数量以保持效率，增加迭代间多样性以提升可转移性。 Result: 在ImageNet上的实验显示，SEA在相同效率下可转移性比现有攻击高8.5%，且适用于商业API和大型视觉语言模型。 Conclusion: SEA为根据资源需求自适应平衡可转移性和效率提供了可能。 Abstract: In surrogate ensemble attacks, using more surrogate models yields higher transferability but lower resource efficiency. This practical trade-off between transferability and efficiency has largely limited existing attacks despite many pre-trained models are easily accessible online. In this paper, we argue that such a trade-off is caused by an unnecessary common assumption, i.e., all models should be identical across iterations. By lifting this assumption, we can use as many surrogates as we want to unleash transferability without sacrificing efficiency. Concretely, we propose Selective Ensemble Attack (SEA), which dynamically selects diverse models (from easily accessible pre-trained models) across iterations based on our new interpretation of decoupling within-iteration and cross-iteration model diversity.In this way, the number of within-iteration models is fixed for maintaining efficiency, while only cross-iteration model diversity is increased for higher transferability. Experiments on ImageNet demonstrate the superiority of SEA in various scenarios. For example, when dynamically selecting 4 from 20 accessible models, SEA yields 8.5% higher transferability than existing attacks under the same efficiency. The superiority of SEA also generalizes to real-world systems, such as commercial vision APIs and large vision-language models. Overall, SEA opens up the possibility of adaptively balancing transferability and efficiency according to specific resource requirements.

[119] AutoMat: Enabling Automated Crystal Structure Reconstruction from Microscopy via Agentic Tool Use

Yaotian Yang,Yiwen Tang,Yizhe Chen,Xiao Chen,Jiangjie Qiu,Hao Xiong,Haoyu Yin,Zhiyao Luo,Yifei Zhang,Sijia Tao,Wentao Li,Qinghua Zhang,Yuqiang Li,Wanli Ouyang,Bin Zhao,Xiaonan Wang,Fei Wei

Main category: cs.CV

TL;DR: AutoMat是一个端到端的自动化流程，将STEM图像转换为原子晶体结构并预测其物理性质，显著优于现有工具。

Details

Motivation: 实验解析的原子结构数据稀缺，而电子显微镜图像转换为模拟可用格式的过程繁琐且易出错，阻碍了机器学习模型的训练与验证。 Method: AutoMat结合模式自适应去噪、物理引导模板检索、对称性感知原子重建、快速松弛及MatterSim性质预测，并通过协调各阶段实现闭环推理。 Result: 在450个结构样本的大规模实验中，AutoMat在晶格RMSD、形成能MAE和结构匹配成功率上显著优于现有多模态大语言模型和工具。 Conclusion: AutoMat和STEM2Mat-Bench的提出，为连接显微镜技术与材料科学中的原子模拟迈出了关键一步。 Abstract: Machine learning-based interatomic potentials and force fields depend critically on accurate atomic structures, yet such data are scarce due to the limited availability of experimentally resolved crystals. Although atomic-resolution electron microscopy offers a potential source of structural data, converting these images into simulation-ready formats remains labor-intensive and error-prone, creating a bottleneck for model training and validation. We introduce AutoMat, an end-to-end, agent-assisted pipeline that automatically transforms scanning transmission electron microscopy (STEM) images into atomic crystal structures and predicts their physical properties. AutoMat combines pattern-adaptive denoising, physics-guided template retrieval, symmetry-aware atomic reconstruction, fast relaxation and property prediction via MatterSim, and coordinated orchestration across all stages. We propose the first dedicated STEM2Mat-Bench for this task and evaluate performance using lattice RMSD, formation energy MAE, and structure-matching success rate. By orchestrating external tool calls, AutoMat enables a text-only LLM to outperform vision-language models in this domain, achieving closed-loop reasoning throughout the pipeline. In large-scale experiments over 450 structure samples, AutoMat substantially outperforms existing multimodal large language models and tools. These results validate both AutoMat and STEM2Mat-Bench, marking a key step toward bridging microscopy and atomistic simulation in materials science.The code and dataset are publicly available at https://github.com/yyt-2378/AutoMat and https://huggingface.co/datasets/yaotianvector/STEM2Mat.

[120] SPKLIP: Aligning Spike Video Streams with Natural Language

Yongchang Gao,Meiling Jin,Zhaofei Yu,Tiejun Huang,Guozhang Chen

Main category: cs.CV

TL;DR: SPKLIP是一种专为Spike-VLA设计的架构，通过分层特征提取和对比学习实现高效的语义对齐，并在能效和性能上表现优异。

Details

Motivation: 解决Spike相机稀疏异步输出与语义理解之间的模态不匹配问题，提升Spike-VLA的性能。 Method: 采用分层Spike特征提取器建模多尺度时序动态，结合Spike-文本对比学习，并引入全Spiking视觉编码器提升能效。 Result: 在基准数据集上达到SOTA性能，并在新贡献的真实数据集上表现出强泛化能力。 Conclusion: SPKLIP在能效和性能上的优势为事件驱动的多模态研究提供了新方向。 Abstract: Spike cameras offer unique sensing capabilities but their sparse, asynchronous output challenges semantic understanding, especially for Spike Video-Language Alignment (Spike-VLA) where models like CLIP underperform due to modality mismatch. We introduce SPKLIP, the first architecture specifically for Spike-VLA. SPKLIP employs a hierarchical spike feature extractor that adaptively models multi-scale temporal dynamics in event streams, and uses spike-text contrastive learning to directly align spike video with language, enabling effective few-shot learning. A full-spiking visual encoder variant, integrating SNN components into our pipeline, demonstrates enhanced energy efficiency. Experiments show state-of-the-art performance on benchmark spike datasets and strong few-shot generalization on a newly contributed real-world dataset. SPKLIP's energy efficiency highlights its potential for neuromorphic deployment, advancing event-based multimodal research. The source code and dataset are available at [link removed for anonymity].

[121] Predicting Reaction Time to Comprehend Scenes with Foveated Scene Understanding Maps

Ziqi Wen,Jonathan Skaza,Shravan Murlidaran,William Y. Wang,Miguel P. Eckstein

Main category: cs.CV

TL;DR: 提出了一种基于视觉-语言模型和注视点视觉的新型图像可计算模型（F-SUM），用于预测人类场景理解时间，其表现优于传统图像指标。

Details

Motivation: 现有模型难以预测人类在场景理解任务中的响应时间，而视觉-语言模型的进步为建模提供了新机会。 Method: 结合注视点视觉和视觉-语言模型，生成空间分辨的场景理解图（F-SUM），并计算聚合分数。 Result: F-SUM分数与人类响应时间（r=0.47）、注视次数（r=0.51）和描述准确性（r=-0.56）显著相关，优于传统指标。 Conclusion: F-SUM是一种有效的图像可计算指标，强调了注视点视觉处理对场景理解难度的重要性。 Abstract: Although models exist that predict human response times (RTs) in tasks such as target search and visual discrimination, the development of image-computable predictors for scene understanding time remains an open challenge. Recent advances in vision-language models (VLMs), which can generate scene descriptions for arbitrary images, combined with the availability of quantitative metrics for comparing linguistic descriptions, offer a new opportunity to model human scene understanding. We hypothesize that the primary bottleneck in human scene understanding and the driving source of variability in response times across scenes is the interaction between the foveated nature of the human visual system and the spatial distribution of task-relevant visual information within an image. Based on this assumption, we propose a novel image-computable model that integrates foveated vision with VLMs to produce a spatially resolved map of scene understanding as a function of fixation location (Foveated Scene Understanding Map, or F-SUM), along with an aggregate F-SUM score. This metric correlates with average (N=17) human RTs (r=0.47) and number of saccades (r=0.51) required to comprehend a scene (across 277 scenes). The F-SUM score also correlates with average (N=16) human description accuracy (r=-0.56) in time-limited presentations. These correlations significantly exceed those of standard image-based metrics such as clutter, visual complexity, and scene ambiguity based on language entropy. Together, our work introduces a new image-computable metric for predicting human response times in scene understanding and demonstrates the importance of foveated visual processing in shaping comprehension difficulty.

[122] Safe-Sora: Safe Text-to-Video Generation via Graphical Watermarking

Zihan Su,Xuerui Qiu,Hongbin Xu,Tangyu Jiang,Junhao Zhuang,Chun Yuan,Ming Li,Shengfeng He,Fei Richard Yu

Main category: cs.CV

TL;DR: Safe-Sora 是首个在视频生成过程中嵌入图形水印的框架，通过分层自适应匹配机制和3D小波变换增强的Mamba架构，实现了高质量、高保真和鲁棒的水印保护。

Details

Motivation: 解决AI生成视频内容版权保护的未充分探索问题，尤其是隐形水印在视频生成中的应用。 Method: 采用分层粗到细的自适应匹配机制，将水印图像分块并分配到最相似的视频帧中，结合3D小波变换增强的Mamba架构实现时空融合。 Result: 实验表明，Safe-Sora 在视频质量、水印保真度和鲁棒性方面达到最先进水平。 Conclusion: Safe-Sora 为高效且鲁棒的视频水印保护开辟了新途径，并首次将状态空间模型应用于水印领域。 Abstract: The explosive growth of generative video models has amplified the demand for reliable copyright preservation of AI-generated content. Despite its popularity in image synthesis, invisible generative watermarking remains largely underexplored in video generation. To address this gap, we propose Safe-Sora, the first framework to embed graphical watermarks directly into the video generation process. Motivated by the observation that watermarking performance is closely tied to the visual similarity between the watermark and cover content, we introduce a hierarchical coarse-to-fine adaptive matching mechanism. Specifically, the watermark image is divided into patches, each assigned to the most visually similar video frame, and further localized to the optimal spatial region for seamless embedding. To enable spatiotemporal fusion of watermark patches across video frames, we develop a 3D wavelet transform-enhanced Mamba architecture with a novel spatiotemporal local scanning strategy, effectively modeling long-range dependencies during watermark embedding and retrieval. To the best of our knowledge, this is the first attempt to apply state space models to watermarking, opening new avenues for efficient and robust watermark protection. Extensive experiments demonstrate that Safe-Sora achieves state-of-the-art performance in terms of video quality, watermark fidelity, and robustness, which is largely attributed to our proposals. We will release our code upon publication.

[123] TS-VLM: Text-Guided SoftSort Pooling for Vision-Language Models in Multi-View Driving Reasoning

Lihong Chen,Hossein Hassani,Soodeh Nikan

Main category: cs.CV

TL;DR: TS-VLM是一种轻量级视觉语言模型，通过文本引导的软排序池化模块（TGSSP）动态融合多视角视觉特征，显著降低计算成本并提升多视角推理的准确性。

Details

Motivation: 现有视觉语言模型在自动驾驶中因计算开销大和多视角数据融合效率低，难以实时部署。 Method: 提出TS-VLM模型，引入TGSSP模块，基于输入查询语义动态排序和融合多视角视觉特征，避免使用高成本注意力机制。 Result: 在DriveLM基准测试中，TS-VLM性能优于现有模型（BLEU-4:56.82，METEOR:41.91，ROUGE-L:74.64，CIDEr:3.39），计算成本降低90%，最小版本仅20.1M参数。 Conclusion: TS-VLM通过高效的多视角融合和轻量化设计，为自动驾驶实时部署提供了可行解决方案。 Abstract: Vision-Language Models (VLMs) have shown remarkable potential in advancing autonomous driving by leveraging multi-modal fusion in order to enhance scene perception, reasoning, and decision-making. Despite their potential, existing models suffer from computational overhead and inefficient integration of multi-view sensor data that make them impractical for real-time deployment in safety-critical autonomous driving applications. To address these shortcomings, this paper is devoted to designing a lightweight VLM called TS-VLM, which incorporates a novel Text-Guided SoftSort Pooling (TGSSP) module. By resorting to semantics of the input queries, TGSSP ranks and fuses visual features from multiple views, enabling dynamic and query-aware multi-view aggregation without reliance on costly attention mechanisms. This design ensures the query-adaptive prioritization of semantically related views, which leads to improved contextual accuracy in multi-view reasoning for autonomous driving. Extensive evaluations on the DriveLM benchmark demonstrate that, on the one hand, TS-VLM outperforms state-of-the-art models with a BLEU-4 score of 56.82, METEOR of 41.91, ROUGE-L of 74.64, and CIDEr of 3.39. On the other hand, TS-VLM reduces computational cost by up to 90%, where the smallest version contains only 20.1 million parameters, making it more practical for real-time deployment in autonomous vehicles.

[124] Few-Step Diffusion via Score identity Distillation

Mingyuan Zhou,Yi Gu,Zhendong Wang

Main category: cs.CV

TL;DR: 本文提出了一种名为SiD的数据无关一步蒸馏框架，用于加速高分辨率文本到图像扩散模型，并通过理论分析和新的指导策略解决了文本-图像对齐与生成多样性之间的权衡问题。

Details

Motivation: 现有方法在蒸馏高分辨率T2I扩散模型时依赖真实或教师合成图像，且使用CFG导致文本-图像对齐与多样性之间的权衡。本文旨在解决这些问题。 Method: 提出Score identity Distillation (SiD)框架，结合理论分析优化少步生成，并引入Diffusion GAN对抗损失和两种新指导策略（Zero-CFG和Anti-CFG）。 Result: 在SD1.5和SDXL上实现了一流性能，支持少步生成，且在无真实图像时表现稳健。 Conclusion: SiD框架在加速T2I扩散模型的同时，有效解决了对齐与多样性的权衡问题，具有高效性和灵活性。 Abstract: Diffusion distillation has emerged as a promising strategy for accelerating text-to-image (T2I) diffusion models by distilling a pretrained score network into a one- or few-step generator. While existing methods have made notable progress, they often rely on real or teacher-synthesized images to perform well when distilling high-resolution T2I diffusion models such as Stable Diffusion XL (SDXL), and their use of classifier-free guidance (CFG) introduces a persistent trade-off between text-image alignment and generation diversity. We address these challenges by optimizing Score identity Distillation (SiD) -- a data-free, one-step distillation framework -- for few-step generation. Backed by theoretical analysis that justifies matching a uniform mixture of outputs from all generation steps to the data distribution, our few-step distillation algorithm avoids step-specific networks and integrates seamlessly into existing pipelines, achieving state-of-the-art performance on SDXL at 1024x1024 resolution. To mitigate the alignment-diversity trade-off when real text-image pairs are available, we introduce a Diffusion GAN-based adversarial loss applied to the uniform mixture and propose two new guidance strategies: Zero-CFG, which disables CFG in the teacher and removes text conditioning in the fake score network, and Anti-CFG, which applies negative CFG in the fake score network. This flexible setup improves diversity without sacrificing alignment. Comprehensive experiments on SD1.5 and SDXL demonstrate state-of-the-art performance in both one-step and few-step generation settings, along with robustness to the absence of real images. Our efficient PyTorch implementation, along with the resulting one- and few-step distilled generators, will be released publicly as a separate branch at https://github.com/mingyuanzhou/SiD-LSG.

[125] CURE: Concept Unlearning via Orthogonal Representation Editing in Diffusion Models

Shristi Das Biswas,Arani Roy,Kaushik Roy

Main category: cs.CV

TL;DR: CURE是一种无需训练的框架，通过权重空间直接操作预训练扩散模型，高效抑制不良概念。

Details

Motivation: 现有安全干预方法存在概念移除不彻底、易被绕过、计算效率低或损害无关能力的问题。 Method: 使用Spectral Eraser模块，通过SVD识别并隔离不良概念的独特特征，单步更新模型。 Result: CURE高效彻底移除目标概念（如艺术风格、对象、身份或不良内容），对原始生成能力损害小，且抗攻击性强。 Conclusion: CURE提供了一种快速、可解释且高效的概念遗忘方法，优于现有技术。 Abstract: As Text-to-Image models continue to evolve, so does the risk of generating unsafe, copyrighted, or privacy-violating content. Existing safety interventions - ranging from training data curation and model fine-tuning to inference-time filtering and guidance - often suffer from incomplete concept removal, susceptibility to jail-breaking, computational inefficiency, or collateral damage to unrelated capabilities. In this paper, we introduce CURE, a training-free concept unlearning framework that operates directly in the weight space of pre-trained diffusion models, enabling fast, interpretable, and highly specific suppression of undesired concepts. At the core of our method is the Spectral Eraser, a closed-form, orthogonal projection module that identifies discriminative subspaces using Singular Value Decomposition over token embeddings associated with the concepts to forget and retain. Intuitively, the Spectral Eraser identifies and isolates features unique to the undesired concept while preserving safe attributes. This operator is then applied in a single step update to yield an edited model in which the target concept is effectively unlearned - without retraining, supervision, or iterative optimization. To balance the trade-off between filtering toxicity and preserving unrelated concepts, we further introduce an Expansion Mechanism for spectral regularization which selectively modulates singular vectors based on their relative significance to control the strength of forgetting. All the processes above are in closed-form, guaranteeing extremely efficient erasure in only $2$ seconds. Benchmarking against prior approaches, CURE achieves a more efficient and thorough removal for targeted artistic styles, objects, identities, or explicit content, with minor damage to original generation ability and demonstrates enhanced robustness against red-teaming.

[126] Mamba-Adaptor: State Space Model Adaptor for Visual Recognition

Fei Xie,Jiahao Nie,Yujin Tang,Wenkang Zhang,Hongshen Zhao

Main category: cs.CV

TL;DR: Mamba-Adaptor通过两个模块（Adaptor-T和Adaptor-S）解决了Mamba模型在视觉任务中的性能问题，提升了全局上下文访问、长程记忆和空间建模能力，并在多个任务中验证了其有效性。

Details

Motivation: Mamba模型在视觉任务中表现不佳，主要受限于无法访问全局上下文、长程遗忘和空间建模能力弱。 Method: 提出Mamba-Adaptor，包含Adaptor-T（缓解长程遗忘）和Adaptor-S（增强空间建模），并探索了三种应用方式。 Result: 在ImageNet和COCO等基准测试中取得了最先进的性能。 Conclusion: Mamba-Adaptor有效解决了Mamba模型的局限性，显著提升了视觉任务的性能。 Abstract: Recent State Space Models (SSM), especially Mamba, have demonstrated impressive performance in visual modeling and possess superior model efficiency. However, the application of Mamba to visual tasks suffers inferior performance due to three main constraints existing in the sequential model: 1) Casual computing is incapable of accessing global context; 2) Long-range forgetting when computing the current hidden states; 3) Weak spatial structural modeling due to the transformed sequential input. To address these issues, we investigate a simple yet powerful vision task Adaptor for Mamba models, which consists of two functional modules: Adaptor-T and Adaptor-S. When solving the hidden states for SSM, we apply a lightweight prediction module Adaptor-T to select a set of learnable locations as memory augmentations to ease long-range forgetting issues. Moreover, we leverage Adapator-S, composed of multi-scale dilated convolutional kernels, to enhance the spatial modeling and introduce the image inductive bias into the feature output. Both modules can enlarge the context modeling in casual computing, as the output is enhanced by the inaccessible features. We explore three usages of Mamba-Adaptor: A general visual backbone for various vision tasks; A booster module to raise the performance of pretrained backbones; A highly efficient fine-tuning module that adapts the base model for transfer learning tasks. Extensive experiments verify the effectiveness of Mamba-Adaptor in three settings. Notably, our Mamba-Adaptor achieves state-of the-art performance on the ImageNet and COCO benchmarks.

Luyao Lei,Shuo Xu,Yifan Bai,Xing Wei

Main category: cs.CV

TL;DR: 提出了一种自适应多模态融合框架TACOcc，通过双向对称检索机制和体积渲染监督，解决了3D语义占用预测中的几何-语义不匹配和表面细节丢失问题。

Details

Motivation: 多模态3D占用预测性能受限，主要由于固定融合策略导致的几何-语义不匹配和稀疏标注引起的表面细节丢失。 Method: 提出目标尺度自适应的双向对称检索机制，优化跨模态特征对齐；引入基于3D高斯溅射的体积渲染管道，增强表面细节重建。 Result: 在nuScenes和SemanticKITTI基准测试中验证了有效性。 Conclusion: TACOcc框架通过自适应融合和体积渲染监督，显著提升了3D语义占用预测的性能。 Abstract: The performance of multi-modal 3D occupancy prediction is limited by ineffective fusion, mainly due to geometry-semantics mismatch from fixed fusion strategies and surface detail loss caused by sparse, noisy annotations. The mismatch stems from the heterogeneous scale and distribution of point cloud and image features, leading to biased matching under fixed neighborhood fusion. To address this, we propose a target-scale adaptive, bidirectional symmetric retrieval mechanism. It expands the neighborhood for large targets to enhance context awareness and shrinks it for small ones to improve efficiency and suppress noise, enabling accurate cross-modal feature alignment. This mechanism explicitly establishes spatial correspondences and improves fusion accuracy. For surface detail loss, sparse labels provide limited supervision, resulting in poor predictions for small objects. We introduce an improved volume rendering pipeline based on 3D Gaussian Splatting, which takes fused features as input to render images, applies photometric consistency supervision, and jointly optimizes 2D-3D consistency. This enhances surface detail reconstruction while suppressing noise propagation. In summary, we propose TACOcc, an adaptive multi-modal fusion framework for 3D semantic occupancy prediction, enhanced by volume rendering supervision. Experiments on the nuScenes and SemanticKITTI benchmarks validate its effectiveness.

[128] Long-RVOS: A Comprehensive Benchmark for Long-term Referring Video Object Segmentation

Tianming Liang,Haichao Jiang,Yuting Yang,Chaolei Tan,Shuai Li,Wei-Shi Zheng,Jian-Fang Hu

Main category: cs.CV

TL;DR: Long-RVOS是一个新的大规模基准数据集，用于长期参考视频对象分割（RVOS），包含2000+个平均时长超过60秒的视频，评估静态属性、运动模式及时空关系。现有方法在长视频中表现不佳，提出的ReferMo方法通过整合运动信息和局部到全局架构显著提升性能。

Details

Motivation: 现有RVOS数据集集中于短视频片段，缺乏对长视频中对象遮挡、消失重现等实际场景的支持，因此需要Long-RVOS推动研究向更实用的方向发展。 Method: 提出Long-RVOS数据集，包含长视频和多样化对象动态；提出ReferMo方法，整合运动信息扩展时间感受野，采用局部到全局架构捕捉短长期依赖。 Result: 现有方法在长视频中表现不佳，ReferMo在长期场景中显著优于现有方法。 Conclusion: Long-RVOS和ReferMo为未来RVOS研究提供了更现实的长视频基准和解决方案。 Abstract: Referring video object segmentation (RVOS) aims to identify, track and segment the objects in a video based on language descriptions, which has received great attention in recent years. However, existing datasets remain focus on short video clips within several seconds, with salient objects visible in most frames. To advance the task towards more practical scenarios, we introduce \textbf{Long-RVOS}, a large-scale benchmark for long-term referring video object segmentation. Long-RVOS contains 2,000+ videos of an average duration exceeding 60 seconds, covering a variety of objects that undergo occlusion, disappearance-reappearance and shot changing. The objects are manually annotated with three different types of descriptions to individually evaluate the understanding of static attributes, motion patterns and spatiotemporal relationships. Moreover, unlike previous benchmarks that rely solely on the per-frame spatial evaluation, we introduce two new metrics to assess the temporal and spatiotemporal consistency. We benchmark 6 state-of-the-art methods on Long-RVOS. The results show that current approaches struggle severely with the long-video challenges. To address this, we further propose ReferMo, a promising baseline method that integrates motion information to expand the temporal receptive field, and employs a local-to-global architecture to capture both short-term dynamics and long-term dependencies. Despite simplicity, ReferMo achieves significant improvements over current methods in long-term scenarios. We hope that Long-RVOS and our baseline can drive future RVOS research towards tackling more realistic and long-form videos.

[129] SpatialLLM: From Multi-modality Data to Urban Spatial Intelligence

Jiabin Chen,Haiping Wang,Jinpeng Li,Yuan Liu,Zhen Dong,Bisheng Yang

Main category: cs.CV

TL;DR: SpatialLLM是一种新型语言模型，无需训练或专家干预即可直接处理复杂城市场景中的空间智能任务。

Details

Motivation: 传统方法需要地理分析工具或领域专家，而SpatialLLM旨在通过预训练LLM实现零样本空间智能任务。 Method: 通过构建详细的结构化场景描述，直接利用预训练LLM进行场景分析。 Result: 实验表明，预训练LLM能准确感知空间分布信息，并零样本执行高级空间智能任务。 Conclusion: SpatialLLM为城市智能分析提供了新视角，多领域知识、上下文长度和推理能力是关键因素。 Abstract: We propose SpatialLLM, a novel approach advancing spatial intelligence tasks in complex urban scenes. Unlike previous methods requiring geographic analysis tools or domain expertise, SpatialLLM is a unified language model directly addressing various spatial intelligence tasks without any training, fine-tuning, or expert intervention. The core of SpatialLLM lies in constructing detailed and structured scene descriptions from raw spatial data to prompt pre-trained LLMs for scene-based analysis. Extensive experiments show that, with our designs, pretrained LLMs can accurately perceive spatial distribution information and enable zero-shot execution of advanced spatial intelligence tasks, including urban planning, ecological analysis, traffic management, etc. We argue that multi-field knowledge, context length, and reasoning ability are key factors influencing LLM performances in urban analysis. We hope that SpatialLLM will provide a novel viable perspective for urban intelligent analysis and management. The code and dataset are available at https://github.com/WHU-USI3DV/SpatialLLM.

[130] Any-to-Any Learning in Computational Pathology via Triplet Multimodal Pretraining

Qichen Sun,Zhengrui Guo,Rui Peng,Hao Chen,Jinzhuo Wang

Main category: cs.CV

TL;DR: ALTER是一个多模态预训练框架，整合了WSIs、基因组学和病理报告，解决了病理诊断中数据融合、模态缺失和任务多样性的挑战。

Details

Motivation: 解决病理诊断中多模态数据融合的高计算成本、模态缺失的灵活性需求以及下游任务的多样性问题。 Method: 提出ALTER框架，支持任意模态子集的预训练，学习跨模态表示。 Result: 在生存预测、癌症分型、基因突变预测和报告生成等任务中表现优异。 Conclusion: ALTER为病理诊断提供了一个灵活且强大的多模态解决方案。 Abstract: Recent advances in computational pathology and artificial intelligence have significantly enhanced the utilization of gigapixel whole-slide images and and additional modalities (e.g., genomics) for pathological diagnosis. Although deep learning has demonstrated strong potential in pathology, several key challenges persist: (1) fusing heterogeneous data types requires sophisticated strategies beyond simple concatenation due to high computational costs; (2) common scenarios of missing modalities necessitate flexible strategies that allow the model to learn robustly in the absence of certain modalities; (3) the downstream tasks in CPath are diverse, ranging from unimodal to multimodal, cnecessitating a unified model capable of handling all modalities. To address these challenges, we propose ALTER, an any-to-any tri-modal pretraining framework that integrates WSIs, genomics, and pathology reports. The term "any" emphasizes ALTER's modality-adaptive design, enabling flexible pretraining with any subset of modalities, and its capacity to learn robust, cross-modal representations beyond WSI-centric approaches. We evaluate ALTER across extensive clinical tasks including survival prediction, cancer subtyping, gene mutation prediction, and report generation, achieving superior or comparable performance to state-of-the-art baselines.

[131] IA-MVS: Instance-Focused Adaptive Depth Sampling for Multi-View Stereo

Yinzhe Wang,Yiwen Xiao,Hu Wang,Yiping Xu,Yan Tian

Main category: cs.CV

TL;DR: 本文提出了一种基于实例自适应的多视角立体视觉（IA-MVS）方法，通过缩小深度假设范围并在每个实例上进行细化，提高了深度估计的精度。

Details

Motivation: 现有方法未充分利用单个实例的深度覆盖范围小于整个场景的潜力，且初始阶段的偏差会随着过程推进累积，限制了深度估计精度的进一步提升。 Method: 提出IA-MVS方法，通过实例自适应缩小深度假设范围并进行细化，引入基于实例内深度连续性先验的过滤机制，并开发了基于条件概率的置信度估计数学模型。 Result: 在DTU基准测试中达到了最先进的性能。 Conclusion: IA-MVS方法在不增加额外训练负担的情况下，显著提升了深度估计的精度和鲁棒性。 Abstract: Multi-view stereo (MVS) models based on progressive depth hypothesis narrowing have made remarkable advancements. However, existing methods haven't fully utilized the potential that the depth coverage of individual instances is smaller than that of the entire scene, which restricts further improvements in depth estimation precision. Moreover, inevitable deviations in the initial stage accumulate as the process advances. In this paper, we propose Instance-Adaptive MVS (IA-MVS). It enhances the precision of depth estimation by narrowing the depth hypothesis range and conducting refinement on each instance. Additionally, a filtering mechanism based on intra-instance depth continuity priors is incorporated to boost robustness. Furthermore, recognizing that existing confidence estimation can degrade IA-MVS performance on point clouds. We have developed a detailed mathematical model for confidence estimation based on conditional probability. The proposed method can be widely applied in models based on MVSNet without imposing extra training burdens. Our method achieves state-of-the-art performance on the DTU benchmark. The source code is available at https://github.com/KevinWang73106/IA-MVS.

[132] VLC Fusion: Vision-Language Conditioned Sensor Fusion for Robust Object Detection

Aditya Taparia,Noel Ngu,Mario Leiva,Joshua Shay Kricheli,John Corcoran,Nathaniel D. Bastian,Gerardo Simari,Paulo Shakarian,Ransalu Senanayake

Main category: cs.CV

TL;DR: VLC Fusion是一种新颖的多模态融合框架，利用视觉语言模型动态调整传感器权重，提升目标检测性能。

Details

Motivation: 现有融合方法忽视环境条件和传感器输入的细微变化，难以自适应调整模态权重。 Method: 提出Vision-Language Conditioned Fusion（VLC Fusion），通过视觉语言模型捕捉高级环境上下文（如黑暗、雨天、模糊），动态调整模态权重。 Result: 在自动驾驶和军事目标检测数据集上，VLC Fusion优于传统融合方法，检测精度更高。 Conclusion: VLC Fusion通过动态调整模态权重，有效提升了多模态目标检测的适应性和准确性。 Abstract: Although fusing multiple sensor modalities can enhance object detection performance, existing fusion approaches often overlook subtle variations in environmental conditions and sensor inputs. As a result, they struggle to adaptively weight each modality under such variations. To address this challenge, we introduce Vision-Language Conditioned Fusion (VLC Fusion), a novel fusion framework that leverages a Vision-Language Model (VLM) to condition the fusion process on nuanced environmental cues. By capturing high-level environmental context such as as darkness, rain, and camera blurring, the VLM guides the model to dynamically adjust modality weights based on the current scene. We evaluate VLC Fusion on real-world autonomous driving and military target detection datasets that include image, LIDAR, and mid-wave infrared modalities. Our experiments show that VLC Fusion consistently outperforms conventional fusion baselines, achieving improved detection accuracy in both seen and unseen scenarios.

[133] FLASH: Latent-Aware Semi-Autoregressive Speculative Decoding for Multimodal Tasks

Zihua Wang,Ruibo Li,Haozhe Du,Joey Tianyi Zhou,Yu Zhang,Xu Yang

Main category: cs.CV

TL;DR: FLASH是一种专为多模态模型设计的推测解码框架，通过轻量级潜在感知令牌压缩和半自回归解码策略，显著提升解码速度。

Details

Motivation: 大型多模态模型（LMMs）的解码速度较慢，尤其是视觉输入令牌冗余且信息密度低，现有方法未能充分利用视觉输入特性。 Method: 提出FLASH框架，结合潜在感知令牌压缩和半自回归解码策略，设计轻量级草稿模型。 Result: 实验显示FLASH在视频字幕和视觉指令调优任务中分别实现2.68倍和2.55倍的加速。 Conclusion: FLASH通过优化视觉令牌处理和生成策略，显著提升多模态模型的解码效率。 Abstract: Large language and multimodal models (LLMs and LMMs) exhibit strong inference capabilities but are often limited by slow decoding speeds. This challenge is especially acute in LMMs, where visual inputs typically comprise more tokens with lower information density than text -- an issue exacerbated by recent trends toward finer-grained visual tokenizations to boost performance. Speculative decoding has been effective in accelerating LLM inference by using a smaller draft model to generate candidate tokens, which are then selectively verified by the target model, improving speed without sacrificing output quality. While this strategy has been extended to LMMs, existing methods largely overlook the unique properties of visual inputs and depend solely on text-based draft models. In this work, we propose \textbf{FLASH} (Fast Latent-Aware Semi-Autoregressive Heuristics), a speculative decoding framework designed specifically for LMMs, which leverages two key properties of multimodal data to design the draft model. First, to address redundancy in visual tokens, we propose a lightweight latent-aware token compression mechanism. Second, recognizing that visual objects often co-occur within a scene, we employ a semi-autoregressive decoding strategy to generate multiple tokens per forward pass. These innovations accelerate draft decoding while maintaining high acceptance rates, resulting in faster overall inference. Experiments show that FLASH significantly outperforms prior speculative decoding approaches in both unimodal and multimodal settings, achieving up to \textbf{2.68$\times$} speed-up on video captioning and \textbf{2.55$\times$} on visual instruction tuning tasks compared to the original LMM.

[134] MVAR: Visual Autoregressive Modeling with Scale and Spatial Markovian Conditioning

Jinhua Zhang,Wei Long,Minghao Han,Weiyi You,Shuhang Gu

Main category: cs.CV

TL;DR: MVAR通过引入尺度和空间马尔可夫假设，减少条件概率建模的复杂性，显著降低计算和内存消耗，同时保持或提升生成性能。

Details

Motivation: 传统方法在建模视觉数据先验时存在尺度和空间冗余，导致计算复杂度和内存消耗高。 Method: 提出MVAR框架，包括尺度马尔可夫轨迹和空间马尔可夫注意力机制，分别减少输入依赖和注意力范围。 Result: MVAR将注意力计算复杂度从O(N^2)降至O(Nk)，内存占用减少3.0倍，性能与现有方法相当或更优。 Conclusion: MVAR通过高效建模视觉先验，显著提升了生成效率，适用于资源受限的场景。 Abstract: Essential to visual generation is efficient modeling of visual data priors. Conventional next-token prediction methods define the process as learning the conditional probability distribution of successive tokens. Recently, next-scale prediction methods redefine the process to learn the distribution over multi-scale representations, significantly reducing generation latency. However, these methods condition each scale on all previous scales and require each token to consider all preceding tokens, exhibiting scale and spatial redundancy. To better model the distribution by mitigating redundancy, we propose Markovian Visual AutoRegressive modeling (MVAR), a novel autoregressive framework that introduces scale and spatial Markov assumptions to reduce the complexity of conditional probability modeling. Specifically, we introduce a scale-Markov trajectory that only takes as input the features of adjacent preceding scale for next-scale prediction, enabling the adoption of a parallel training strategy that significantly reduces GPU memory consumption. Furthermore, we propose spatial-Markov attention, which restricts the attention of each token to a localized neighborhood of size k at corresponding positions on adjacent scales, rather than attending to every token across these scales, for the pursuit of reduced modeling complexity. Building on these improvements, we reduce the computational complexity of attention calculation from O(N^2) to O(Nk), enabling training with just eight NVIDIA RTX 4090 GPUs and eliminating the need for KV cache during inference. Extensive experiments on ImageNet demonstrate that MVAR achieves comparable or superior performance with both small model trained from scratch and large fine-tuned models, while reducing the average GPU memory footprint by 3.0x.

[135] LiDAR MOT-DETR: A LiDAR-based Two-Stage Transformer for 3D Multiple Object Tracking

Martha Teiko Teye,Ori Maoz,Matthias Rottmann

Main category: cs.CV

TL;DR: 提出了一种基于LiDAR的两阶段DETR变换器模型，用于多目标跟踪，通过平滑和跟踪阶段提升性能，在nuScenes和KITTI数据集上表现优异。

Details

Motivation: LiDAR点云数据稀疏且不规则，传统跟踪系统依赖手工特征和运动模型，难以在拥挤或快速移动场景中保持目标一致性。 Method: 采用两阶段DETR变换器：平滑阶段优化检测结果，跟踪阶段通过注意力机制关联目标。 Result: 在线模式在nuScenes数据集上表现优于基线，aMOTA为0.722，aMOTP为0.475；离线模式进一步提升aMOTP 3个百分点。 Conclusion: 该方法在多目标跟踪中表现出色，尤其在LiDAR点云稀疏场景下，通过两阶段设计显著提升性能。 Abstract: Multi-object tracking from LiDAR point clouds presents unique challenges due to the sparse and irregular nature of the data, compounded by the need for temporal coherence across frames. Traditional tracking systems often rely on hand-crafted features and motion models, which can struggle to maintain consistent object identities in crowded or fast-moving scenes. We present a lidar-based two-staged DETR inspired transformer; a smoother and tracker. The smoother stage refines lidar object detections, from any off-the-shelf detector, across a moving temporal window. The tracker stage uses a DETR-based attention block to maintain tracks across time by associating tracked objects with the refined detections using the point cloud as context. The model is trained on the datasets nuScenes and KITTI in both online and offline (forward peeking) modes demonstrating strong performance across metrics such as ID-switch and multiple object tracking accuracy (MOTA). The numerical results indicate that the online mode outperforms the lidar-only baseline and SOTA models on the nuScenes dataset, with an aMOTA of 0.722 and an aMOTP of 0.475, while the offline mode provides an additional 3 pp aMOTP

[136] It's not you, it's me -- Global urban visual perception varies across demographics and personalities

Matias Quintana,Youlong Gu,Xiucheng Liang,Yujun Hou,Koichi Ito,Yihan Zhu,Mahmoud Abdelrahman,Filip Biljecki

Main category: cs.CV

TL;DR: 论文通过全球街景视觉感知调查，揭示了人口统计和人格特质对城市街景感知的影响，并提出了新的评估指标。

Details

Motivation: 当前城市规划方法常忽视人口统计差异，可能导致偏见。研究旨在填补这一空白，探索人口统计和人格特质对街景感知的影响。 Method: 进行了大规模全球街景视觉感知调查，收集了1000名来自5个国家、45种国籍的参与者数据，分析了六项传统指标和四项新指标。 Result: 发现人口统计和人格特质显著影响感知评分，且现有机器学习模型与人类感知存在偏差。 Conclusion: 研究呼吁在城市规划中更细致地考虑人口统计和人格特质，以更准确地反映本地居民的需求。 Abstract: Understanding people's preferences and needs is crucial for urban planning decisions, yet current approaches often combine them from multi-cultural and multi-city populations, obscuring important demographic differences and risking amplifying biases. We conducted a large-scale urban visual perception survey of streetscapes worldwide using street view imagery, examining how demographics -- including gender, age, income, education, race and ethnicity, and, for the first time, personality traits -- shape perceptions among 1,000 participants, with balanced demographics, from five countries and 45 nationalities. This dataset, introduced as Street Perception Evaluation Considering Socioeconomics (SPECS), exhibits statistically significant differences in perception scores in six traditionally used indicators (safe, lively, wealthy, beautiful, boring, and depressing) and four new ones we propose (live nearby, walk, cycle, green) among demographics and personalities. We revealed that location-based sentiments are carried over in people's preferences when comparing urban streetscapes with other cities. Further, we compared the perception scores based on where participants and streetscapes are from. We found that an off-the-shelf machine learning model trained on an existing global perception dataset tends to overestimate positive indicators and underestimate negative ones compared to human responses, suggesting that targeted intervention should consider locals' perception. Our study aspires to rectify the myopic treatment of street perception, which rarely considers demographics or personality traits.

[137] Reasoning-OCR: Can Large Multimodal Models Solve Complex Logical Reasoning Problems from OCR Cues?

Haibin He,Maoyuan Ye,Jing Zhang,Xiantao Cai,Juhua Liu,Bo Du,Dacheng Tao

Main category: cs.CV

TL;DR: 论文提出了一个名为Reasoning-OCR的基准测试，用于评估大型多模态模型（LMMs）在基于OCR线索的复杂逻辑推理问题上的能力。

Details

Motivation: 现有OCR相关基准测试主要关注简单任务，而LMMs在复杂逻辑推理问题上的能力尚未充分探索。 Method: 设计了包含六种视觉场景和150个问题的Reasoning-OCR基准，尽量减少领域专业知识的影响。 Result: 评估揭示了专有和开源LMMs在不同推理挑战中的表现，突显了提升推理能力的紧迫性。 Conclusion: Reasoning-OCR旨在启发和促进未来基于OCR线索的复杂推理能力研究。 Abstract: Large Multimodal Models (LMMs) have become increasingly versatile, accompanied by impressive Optical Character Recognition (OCR) related capabilities. Existing OCR-related benchmarks emphasize evaluating LMMs' abilities of relatively simple visual question answering, visual-text parsing, etc. However, the extent to which LMMs can deal with complex logical reasoning problems based on OCR cues is relatively unexplored. To this end, we introduce the Reasoning-OCR benchmark, which challenges LMMs to solve complex reasoning problems based on the cues that can be extracted from rich visual-text. Reasoning-OCR covers six visual scenarios and encompasses 150 meticulously designed questions categorized into six reasoning challenges. Additionally, Reasoning-OCR minimizes the impact of field-specialized knowledge. Our evaluation offers some insights for proprietary and open-source LMMs in different reasoning challenges, underscoring the urgent to improve the reasoning performance. We hope Reasoning-OCR can inspire and facilitate future research on enhancing complex reasoning ability based on OCR cues. Reasoning-OCR is publicly available at https://github.com/Hxyz-123/ReasoningOCR.

[138] Pyramid Sparse Transformer: Enhancing Multi-Scale Feature Fusion with Dynamic Token Selection

Junyi Hu,Tian Bai,Fengyi Wu,Zhengming Peng,Yi Zhang

Main category: cs.CV

TL;DR: 论文提出了一种轻量级的Pyramid Sparse Transformer（PST）模块，用于高效特征融合，显著提升了检测和分类任务的性能。

Details

Motivation: 现有基于注意力的特征融合方法计算复杂且实现困难，限制了在资源受限环境中的效率。 Method: PST通过粗到细的令牌选择和共享注意力参数来减少计算量，同时保留空间细节。 Result: 在YOLOv11和ResNet等模型上，PST显著提升了mAP和ImageNet准确率，且延迟影响小。 Conclusion: PST是一种简单、硬件友好的增强模块，适用于检测和分类任务。 Abstract: Feature fusion is critical for high-performance vision models but often incurs prohibitive complexity. However, prevailing attention-based fusion methods often involve significant computational complexity and implementation challenges, limiting their efficiency in resource-constrained environments. To address these issues, we introduce the Pyramid Sparse Transformer (PST), a lightweight, plug-and-play module that integrates coarse-to-fine token selection and shared attention parameters to reduce computation while preserving spatial detail. PST can be trained using only coarse attention and seamlessly activated at inference for further accuracy gains without retraining. When added to state-of-the-art real-time detection models, such as YOLOv11-N/S/M, PST yields mAP improvements of 0.9%, 0.5%, and 0.4% on MS COCO with minimal latency impact. Likewise, embedding PST into ResNet-18/50/101 as backbones, boosts ImageNet top-1 accuracy by 6.5%, 1.7%, and 1.0%, respectively. These results demonstrate PST's effectiveness as a simple, hardware-friendly enhancement for both detection and classification tasks.

[139] Enhancing Transformers Through Conditioned Embedded Tokens

Hemanth Saratchandran,Simon Lucey

Main category: cs.CV

TL;DR: 论文提出了一种改进Transformer中注意力机制条件化的方法，通过调整嵌入令牌来优化训练稳定性与效率。

Details

Motivation: Transformer的注意力机制存在固有的不良条件化问题，影响梯度优化和训练效率。 Method: 开发了一个理论框架，分析注意力机制条件化与嵌入令牌的关系，并提出条件化嵌入令牌方法。 Result: 该方法显著改善了条件化问题，提升了训练稳定性和效率，并在多种任务中验证了有效性。 Conclusion: 条件化嵌入令牌方法具有广泛适用性，能有效优化Transformer的训练性能。 Abstract: Transformers have transformed modern machine learning, driving breakthroughs in computer vision, natural language processing, and robotics. At the core of their success lies the attention mechanism, which enables the modeling of global dependencies among input tokens. However, we reveal that the attention block in transformers suffers from inherent ill-conditioning, which hampers gradient-based optimization and leads to inefficient training. To address this, we develop a theoretical framework that establishes a direct relationship between the conditioning of the attention block and that of the embedded tokenized data. Building on this insight, we introduce conditioned embedded tokens, a method that systematically modifies the embedded tokens to improve the conditioning of the attention mechanism. Our analysis demonstrates that this approach significantly mitigates ill-conditioning, leading to more stable and efficient training. We validate our methodology across various transformer architectures, achieving consistent improvements in image classification, object detection, instance segmentation, and natural language processing, highlighting its broad applicability and effectiveness.

[140] Informed Mixing -- Improving Open Set Recognition via Attribution-based Augmentation

Jiawen Xu,Odej Kao,Margret Keuper

Main category: cs.CV

TL;DR: GradMix是一种数据增强方法，通过梯度归因图动态屏蔽已学习概念，以促进模型学习更多样化的特征，提升开放集识别性能。

Details

Motivation: 开放集识别（OSR）需要检测模型推理中的新类别，但现有数据可能无法提供足够区分性特征。 Method: 提出GradMix，利用梯度归因图在训练中动态屏蔽已学习概念，鼓励模型学习更全面的代表性特征。 Result: 在开放集识别、闭集分类和分布外检测任务中表现优于现有方法，并提升模型鲁棒性和自监督学习性能。 Conclusion: GradMix能有效提升模型泛化能力，适用于多种任务。 Abstract: Open set recognition (OSR) is devised to address the problem of detecting novel classes during model inference. Even in recent vision models, this remains an open issue which is receiving increasing attention. Thereby, a crucial challenge is to learn features that are relevant for unseen categories from given data, for which these features might not be discriminative. To facilitate this process and "optimize to learn" more diverse features, we propose GradMix, a data augmentation method that dynamically leverages gradient-based attribution maps of the model during training to mask out already learned concepts. Thus GradMix encourages the model to learn a more complete set of representative features from the same data source. Extensive experiments on open set recognition, close set classification, and out-of-distribution detection reveal that our method can often outperform the state-of-the-art. GradMix can further increase model robustness to corruptions as well as downstream classification performance for self-supervised learning, indicating its benefit for model generalization.

[141] Rethinking Features-Fused-Pyramid-Neck for Object Detection

Hulin Li

Main category: cs.CV

TL;DR: 论文提出了一种独立层次金字塔（IHP）架构，解决了多尺度检测中特征金字塔层次间特征不对齐的问题，并引入了软最近邻插值（SNI）和自适应特征选择方法（ESD），最终实现了实时检测的先进性能。

Details

Motivation: 多尺度检测中，特征金字塔层次间的点对点强制融合会导致特征不对齐，影响检测效果。 Method: 设计了IHP架构，引入SNI和ESD方法，结合GSConvE技术，优化特征对齐和空间特征保留。 Result: 在Pascal VOC和MS COCO数据集上实现了最先进的检测性能。 Conclusion: 提出的方法有效解决了特征不对齐问题，提升了实时检测的精度和效率。 Abstract: Multi-head detectors typically employ a features-fused-pyramid-neck for multi-scale detection and are widely adopted in the industry. However, this approach faces feature misalignment when representations from different hierarchical levels of the feature pyramid are forcibly fused point-to-point. To address this issue, we designed an independent hierarchy pyramid (IHP) architecture to evaluate the effectiveness of the features-unfused-pyramid-neck for multi-head detectors. Subsequently, we introduced soft nearest neighbor interpolation (SNI) with a weight downscaling factor to mitigate the impact of feature fusion at different hierarchies while preserving key textures. Furthermore, we present a features adaptive selection method for down sampling in extended spatial windows (ESD) to retain spatial features and enhance lightweight convolutional techniques (GSConvE). These advancements culminate in our secondary features alignment solution (SA) for real-time detection, achieving state-of-the-art results on Pascal VOC and MS COCO. Code will be released at https://github.com/AlanLi1997/rethinking-fpn. This paper has been accepted by ECCV2024 and published on Springer Nature.

[142] Mitigating Hallucination in VideoLLMs via Temporal-Aware Activation Engineering

Jianfeng Cai,Wengang Zhou,Zongmeng Zhang,Jiale Hong,Nianji Zhan,Houqiang Li

Main category: cs.CV

TL;DR: 本文研究了激活工程在减少视频多模态大语言模型（VideoLLMs）幻觉问题中的有效性，提出了一个基于时间变化特征的框架，显著降低了幻觉现象。

Details

Motivation: 尽管多模态大语言模型在视频理解方面取得了显著进展，但幻觉问题（模型生成看似合理但错误的输出）仍未得到充分解决。激活工程在LLMs和ImageLLMs中已证明有效，但其在VideoLLMs中的应用尚未被探索。 Method: 通过分析激活工程的关键因素，发现模型对幻觉的敏感性与时间变化相关。基于此，提出了一个时间感知的激活工程框架，自适应地识别和操作幻觉敏感模块。 Result: 实验表明，该方法在多个模型和基准测试中显著减少了VideoLLMs的幻觉现象。 Conclusion: 时间感知的激活工程框架有效减少了VideoLLMs的幻觉，验证了其鲁棒性。 Abstract: Multimodal large language models (MLLMs) have achieved remarkable progress in video understanding.However, hallucination, where the model generates plausible yet incorrect outputs, persists as a significant and under-addressed challenge in the video domain. Among existing solutions, activation engineering has proven successful in mitigating hallucinations in LLMs and ImageLLMs, yet its applicability to VideoLLMs remains largely unexplored. In this work, we are the first to systematically investigate the effectiveness and underlying mechanisms of activation engineering for mitigating hallucinations in VideoLLMs. We initially conduct an investigation of the key factors affecting the performance of activation engineering and find that a model's sensitivity to hallucination depends on $\textbf{temporal variation}$ rather than task type. Moreover, selecting appropriate internal modules and dataset for activation engineering is critical for reducing hallucination. Guided by these findings, we propose a temporal-aware activation engineering framework for VideoLLMs, which adaptively identifies and manipulates hallucination-sensitive modules based on the temporal variation characteristic, substantially mitigating hallucinations without additional LLM fine-tuning. Experiments across multiple models and benchmarks demonstrate that our method markedly reduces hallucination in VideoLLMs, thereby validating the robustness of our findings.

[143] A Study on the Refining Handwritten Font by Mixing Font Styles

Avinash Kumar,Kyeolhee Kang,Ammar ul Hassan,Jaeyoung Choi

Main category: cs.CV

TL;DR: FontFusionGAN (FFGAN) 利用生成对抗网络（GAN）结合手写和印刷字体，生成既易读又美观的字体。

Details

Motivation: 手写字体具有独特的表达性，但通常因不清晰或不一致而难以阅读。FFGAN旨在提升手写字体的可读性，同时保留其美学特征。 Method: 通过在手写和印刷字体数据集上训练GAN，生成兼具两者优点的字体。 Result: 实验表明，FFGAN显著提升了手写字体的可读性，同时保持了其独特风格。 Conclusion: FFGAN不仅适用于复杂字符集的语言，还可用于其他文本图像任务，如字体属性控制和多语言字体风格迁移。 Abstract: Handwritten fonts have a distinct expressive character, but they are often difficult to read due to unclear or inconsistent handwriting. FontFusionGAN (FFGAN) is a novel method for improving handwritten fonts by combining them with printed fonts. Our method implements generative adversarial network (GAN) to generate font that mix the desirable features of handwritten and printed fonts. By training the GAN on a dataset of handwritten and printed fonts, it can generate legible and visually appealing font images. We apply our method to a dataset of handwritten fonts and demonstrate that it significantly enhances the readability of the original fonts while preserving their unique aesthetic. Our method has the potential to improve the readability of handwritten fonts, which would be helpful for a variety of applications including document creation, letter writing, and assisting individuals with reading and writing difficulties. In addition to addressing the difficulties of font creation for languages with complex character sets, our method is applicable to other text-image-related tasks, such as font attribute control and multilingual font style transfer.

[144] Accelerate TarFlow Sampling with GS-Jacobi Iteration

Ben Liu,Zhen Qin

Main category: cs.CV

TL;DR: 论文提出通过GS-Jacobi迭代方法优化TarFlow模型的采样效率，利用CRM和IGM指标区分块的重要性，显著加速采样过程且不影响生成质量。

Details

Motivation: TarFlow模型采样速度慢，因其因果注意力形式需顺序计算。本文旨在通过优化策略提升采样效率。 Method: 提出CRM和IGM指标，结合GS-Jacobi迭代方法，区分块的收敛性和初始值敏感性，优化采样过程。 Result: 在四种TarFlow模型上实验，采样速度提升2.51x至5.32x，且FID分数和生成质量未下降。 Conclusion: GS-Jacobi方法有效加速TarFlow采样，为高效图像生成提供新思路。 Abstract: Image generation models have achieved widespread applications. As an instance, the TarFlow model combines the transformer architecture with Normalizing Flow models, achieving state-of-the-art results on multiple benchmarks. However, due to the causal form of attention requiring sequential computation, TarFlow's sampling process is extremely slow. In this paper, we demonstrate that through a series of optimization strategies, TarFlow sampling can be greatly accelerated by using the Gauss-Seidel-Jacobi (abbreviated as GS-Jacobi) iteration method. Specifically, we find that blocks in the TarFlow model have varying importance: a small number of blocks play a major role in image generation tasks, while other blocks contribute relatively little; some blocks are sensitive to initial values and prone to numerical overflow, while others are relatively robust. Based on these two characteristics, we propose the Convergence Ranking Metric (CRM) and the Initial Guessing Metric (IGM): CRM is used to identify whether a TarFlow block is "simple" (converges in few iterations) or "tough" (requires more iterations); IGM is used to evaluate whether the initial value of the iteration is good. Experiments on four TarFlow models demonstrate that GS-Jacobi sampling can significantly enhance sampling efficiency while maintaining the quality of generated images (measured by FID), achieving speed-ups of 4.53x in Img128cond, 5.32x in AFHQ, 2.96x in Img64uncond, and 2.51x in Img64cond without degrading FID scores or sample quality. Code and checkpoints are accessible on https://github.com/encoreus/GS-Jacobi_for_TarFlow

[145] The Way Up: A Dataset for Hold Usage Detection in Sport Climbing

Anna Maschek,David C. Schedl

Main category: cs.CV

TL;DR: 论文提出首个标注了攀岩视频中抓握点位置、使用顺序和使用时间的数据集，并探讨了基于关键点的2D姿态估计模型在攀岩中的应用。

Details

Motivation: 目前缺乏详细标注攀岩抓握点使用情况的数据集，限制了相关应用的发展。 Method: 通过分析关节关键点与抓握点的重叠情况，使用2D姿态估计模型检测抓握点使用情况。 Result: 评估了多个先进模型在数据集上的准确性，并识别了攀岩特有的挑战。 Conclusion: 数据集和结果为攀岩姿态估计研究奠定了基础，支持未来AI辅助攀岩系统的发展。 Abstract: Detecting an athlete's position on a route and identifying hold usage are crucial in various climbing-related applications. However, no climbing dataset with detailed hold usage annotations exists to our knowledge. To address this issue, we introduce a dataset of 22 annotated climbing videos, providing ground-truth labels for hold locations, usage order, and time of use. Furthermore, we explore the application of keypoint-based 2D pose-estimation models for detecting hold usage in sport climbing. We determine usage by analyzing the key points of certain joints and the corresponding overlap with climbing holds. We evaluate multiple state-of-the-art models and analyze their accuracy on our dataset, identifying and highlighting climbing-specific challenges. Our dataset and results highlight key challenges in climbing-specific pose estimation and establish a foundation for future research toward AI-assisted systems for sports climbing.

[146] Towards a Universal Image Degradation Model via Content-Degradation Disentanglement

Wenbo Yang,Zhongling Wang,Zhou Wang

Main category: cs.CV

TL;DR: 提出首个通用退化模型，能合成广泛复杂且真实的退化，包含全局和空间变化成分，无需用户干预。

Details

Motivation: 现有模型只能生成特定或狭窄的退化类型，缺乏通用性和适应性。 Method: 通过解压缩方法分离退化信息，并设计新模块提取和结合空间变化退化成分。 Result: 在胶片颗粒模拟和盲图像恢复任务中验证了模型的准确性和适应性。 Conclusion: 该模型为图像退化合成提供了通用解决方案，具有广泛应用潜力。 Abstract: Image degradation synthesis is highly desirable in a wide variety of applications ranging from image restoration to simulating artistic effects. Existing models are designed to generate one specific or a narrow set of degradations, which often require user-provided degradation parameters. As a result, they lack the generalizability to synthesize degradations beyond their initial design or adapt to other applications. Here we propose the first universal degradation model that can synthesize a broad spectrum of complex and realistic degradations containing both homogeneous (global) and inhomogeneous (spatially varying) components. Our model automatically extracts and disentangles homogeneous and inhomogeneous degradation features, which are later used for degradation synthesis without user intervention. A disentangle-by-compression method is proposed to separate degradation information from images. Two novel modules for extracting and incorporating inhomogeneous degradations are created to model inhomogeneous components in complex degradations. We demonstrate the model's accuracy and adaptability in film-grain simulation and blind image restoration tasks. The demo video, code, and dataset of this project will be released upon publication at github.com/yangwenbo99/content-degradation-disentanglement.

[147] Robust Multimodal Segmentation with Representation Regularization and Hybrid Prototype Distillation

Jiaqi Tan,Xu Zheng,Yang Liu

Main category: cs.CV

TL;DR: 论文提出RobustSeg框架，通过HPDM和RRM模块提升多模态语义分割的鲁棒性，实验显示优于现有方法。

Details

Motivation: 多模态语义分割在动态环境、传感器故障和噪声干扰下性能下降，需提升实际场景中的鲁棒性。 Method: 采用两阶段框架：预训练多模态教师模型，再通过HPDM和RRM训练学生模型，支持随机模态丢失。 Result: 在三个公开基准测试中，性能分别提升2.76%、4.56%和0.98%。 Conclusion: RobustSeg有效提升多模态语义分割的鲁棒性，优于现有方法。 Abstract: Multi-modal semantic segmentation (MMSS) faces significant challenges in real-world scenarios due to dynamic environments, sensor failures, and noise interference, creating a gap between theoretical models and practical performance. To address this, we propose a two-stage framework called RobustSeg, which enhances multi-modal robustness through two key components: the Hybrid Prototype Distillation Module (HPDM) and the Representation Regularization Module (RRM). In the first stage, RobustSeg pre-trains a multi-modal teacher model using complete modalities. In the second stage, a student model is trained with random modality dropout while learning from the teacher via HPDM and RRM. HPDM transforms features into compact prototypes, enabling cross-modal hybrid knowledge distillation and mitigating bias from missing modalities. RRM reduces representation discrepancies between the teacher and student by optimizing functional entropy through the log-Sobolev inequality. Extensive experiments on three public benchmarks demonstrate that RobustSeg outperforms previous state-of-the-art methods, achieving improvements of +2.76%, +4.56%, and +0.98%, respectively. Code is available at: https://github.com/RobustSeg/RobustSeg.

[148] ORQA: A Benchmark and Foundation Model for Holistic Operating Room Modeling

Ege Özsoy,Chantal Pellegrini,David Bani-Harouni,Kun Yuan,Matthias Keicher,Nassir Navab

Main category: cs.CV

TL;DR: ORQA是一个新型手术室问答基准和多模态基础模型，旨在提升手术室智能，通过整合多种数据信号和提出渐进式知识蒸馏方法，实现高效、统一的手术建模。

Details

Motivation: 手术复杂性要求医生和计算系统具备全面理解能力，现有单任务方法缺乏通用性。 Method: 整合四个公共手术数据集为统一基准，提出多模态大语言模型融合视觉、听觉和结构化数据，并采用渐进式知识蒸馏优化模型。 Result: ORQA在基准测试中表现优异，具备零样本泛化能力，推动了多模态手术智能的发展。 Conclusion: ORQA为手术室智能提供了可扩展的统一建模方法，显著提升了多模态手术智能水平。 Abstract: The real-world complexity of surgeries necessitates surgeons to have deep and holistic comprehension to ensure precision, safety, and effective interventions. Computational systems are required to have a similar level of comprehension within the operating room. Prior works, limited to single-task efforts like phase recognition or scene graph generation, lack scope and generalizability. In this work, we introduce ORQA, a novel OR question answering benchmark and foundational multimodal model to advance OR intelligence. By unifying all four public OR datasets into a comprehensive benchmark, we enable our approach to concurrently address a diverse range of OR challenges. The proposed multimodal large language model fuses diverse OR signals such as visual, auditory, and structured data, for a holistic modeling of the OR. Finally, we propose a novel, progressive knowledge distillation paradigm, to generate a family of models optimized for different speed and memory requirements. We show the strong performance of ORQA on our proposed benchmark, and its zero-shot generalization, paving the way for scalable, unified OR modeling and significantly advancing multimodal surgical intelligence. We will release our code and data upon acceptance.

[149] EPIC: Explanation of Pretrained Image Classification Networks via Prototype

Piotr Borycki,Magdalena Trędowicz,Szymon Janusz,Jacek Tabor,Przemysław Spurek,Arkadiusz Lewicki,Łukasz Struski

Main category: cs.CV

TL;DR: EPIC是一种新型的XAI方法，结合了后处理和原型解释的优点，无需修改预训练模型即可提供直观的原型解释。

Details

Motivation: 现有XAI方法中，后处理解释粗糙，原型解释需要专用架构且适用性有限。EPIC旨在填补这一空白。 Method: EPIC基于预训练模型，无需架构修改，通过原型生成直观解释，适用于多种数据集。 Result: EPIC在CUB-200-2011、Stanford Cars和ImageNet等数据集上验证了其解释能力。 Conclusion: EPIC首次实现了后处理方法完全复制原型解释的核心能力，为XAI提供了灵活且易于理解的工具。 Abstract: Explainable AI (XAI) methods generally fall into two categories. Post-hoc approaches generate explanations for pre-trained models and are compatible with various neural network architectures. These methods often use feature importance visualizations, such as saliency maps, to indicate which input regions influenced the model's prediction. Unfortunately, they typically offer a coarse understanding of the model's decision-making process. In contrast, ante-hoc (inherently explainable) methods rely on specially designed model architectures trained from scratch. A notable subclass of these methods provides explanations through prototypes, representative patches extracted from the training data. However, prototype-based approaches have limitations: they require dedicated architectures, involve specialized training procedures, and perform well only on specific datasets. In this work, we propose EPIC (Explanation of Pretrained Image Classification), a novel approach that bridges the gap between these two paradigms. Like post-hoc methods, EPIC operates on pre-trained models without architectural modifications. Simultaneously, it delivers intuitive, prototype-based explanations inspired by ante-hoc techniques. To the best of our knowledge, EPIC is the first post-hoc method capable of fully replicating the core explanatory power of inherently interpretable models. We evaluate EPIC on benchmark datasets commonly used in prototype-based explanations, such as CUB-200-2011 and Stanford Cars, alongside large-scale datasets like ImageNet, typically employed by post-hoc methods. EPIC uses prototypes to explain model decisions, providing a flexible and easy-to-understand tool for creating clear, high-quality explanations.

[150] Towards Low-Latency Event Stream-based Visual Object Tracking: A Slow-Fast Approach

Shiao Wang,Xiao Wang,Liye Jin,Bo Jiang,Lin Zhu,Lan Chen,Yonghong Tian,Bin Luo

Main category: cs.CV

TL;DR: 论文提出了一种名为SFTrack的慢快跟踪范式，结合高精度慢速跟踪器和高效快速跟踪器，适用于不同资源环境，并通过实验验证其有效性。

Details

Motivation: 现有跟踪算法依赖低帧率RGB相机和计算密集型深度网络，难以实现低延迟且在资源受限环境中表现不佳。事件相机为低延迟应用提供了新方向。 Method: 提出慢快跟踪范式，包括高精度慢速跟踪器和高效快速跟踪器。基于事件流进行图表示学习，并集成到FlashAttention视觉骨干中。通过知识蒸馏和微调优化性能。 Result: 在FE240、COESOT和EventVOT等公开基准测试中验证了方法的有效性和效率。 Conclusion: SFTrack在不同实际场景中表现出色，为事件相机跟踪提供了灵活高效的解决方案。 Abstract: Existing tracking algorithms typically rely on low-frame-rate RGB cameras coupled with computationally intensive deep neural network architectures to achieve effective tracking. However, such frame-based methods inherently face challenges in achieving low-latency performance and often fail in resource-constrained environments. Visual object tracking using bio-inspired event cameras has emerged as a promising research direction in recent years, offering distinct advantages for low-latency applications. In this paper, we propose a novel Slow-Fast Tracking paradigm that flexibly adapts to different operational requirements, termed SFTrack. The proposed framework supports two complementary modes, i.e., a high-precision slow tracker for scenarios with sufficient computational resources, and an efficient fast tracker tailored for latency-aware, resource-constrained environments. Specifically, our framework first performs graph-based representation learning from high-temporal-resolution event streams, and then integrates the learned graph-structured information into two FlashAttention-based vision backbones, yielding the slow and fast trackers, respectively. The fast tracker achieves low latency through a lightweight network design and by producing multiple bounding box outputs in a single forward pass. Finally, we seamlessly combine both trackers via supervised fine-tuning and further enhance the fast tracker's performance through a knowledge distillation strategy. Extensive experiments on public benchmarks, including FE240, COESOT, and EventVOT, demonstrate the effectiveness and efficiency of our proposed method across different real-world scenarios. The source code has been released on https://github.com/Event-AHU/SlowFast_Event_Track.

[151] Dynamic Graph Induced Contour-aware Heat Conduction Network for Event-based Object Detection

Xiao Wang,Yu Jin,Lan Chen,Bo Jiang,Lin Zhu,Yonghong Tian,Jin Tang,Bin Luo

Main category: cs.CV

TL;DR: 提出了一种基于事件流的目标检测新方法CvHeat-DET，通过动态图建模轮廓信息并利用多尺度特征，解决了现有方法在轮廓建模和多尺度特征利用上的不足。

Details

Motivation: 事件视觉传感器（EVS）在低光、高速运动捕捉和低延迟方面具有优势，但现有目标检测算法在轮廓建模和多尺度特征利用上表现不佳。 Method: 提出动态图诱导的轮廓感知热传导网络（CvHeat-DET），利用事件流的清晰轮廓信息预测热传导模型的热扩散系数，并整合多尺度图特征。 Result: 在三个基准数据集上的实验验证了模型的有效性。 Conclusion: CvHeat-DET在事件流目标检测中表现出色，代码将开源。 Abstract: Event-based Vision Sensors (EVS) have demonstrated significant advantages over traditional RGB frame-based cameras in low-light conditions, high-speed motion capture, and low latency. Consequently, object detection based on EVS has attracted increasing attention from researchers. Current event stream object detection algorithms are typically built upon Convolutional Neural Networks (CNNs) or Transformers, which either capture limited local features using convolutional filters or incur high computational costs due to the utilization of self-attention. Recently proposed vision heat conduction backbone networks have shown a good balance between efficiency and accuracy; however, these models are not specifically designed for event stream data. They exhibit weak capability in modeling object contour information and fail to exploit the benefits of multi-scale features. To address these issues, this paper proposes a novel dynamic graph induced contour-aware heat conduction network for event stream based object detection, termed CvHeat-DET. The proposed model effectively leverages the clear contour information inherent in event streams to predict the thermal diffusivity coefficients within the heat conduction model, and integrates hierarchical structural graph features to enhance feature learning across multiple scales. Extensive experiments on three benchmark datasets for event stream-based object detection fully validated the effectiveness of the proposed model. The source code of this paper will be released on https://github.com/Event-AHU/OpenEvDET.

[152] HiERO: understanding the hierarchy of human behavior enhances reasoning on egocentric videos

Simone Alberto Peirone,Francesca Pistilli,Giuseppe Averta

Main category: cs.CV

TL;DR: HiERO是一种弱监督方法，通过层次化活动线程丰富视频片段特征，利用视频与描述的对齐实现上下文、语义和时间推理，在多个基准测试中表现优异。

Details

Motivation: 人类活动复杂多变，但其变异性具有层次化结构，可以从非脚本视频中自然涌现并用于推理。 Method: HiERO通过视频片段与叙述描述的对齐，利用层次化架构推断上下文、语义和时间关系。 Result: 在EgoMCQ、EgoNLQ等基准测试中表现优异，零样本任务中甚至超越全监督方法（EgoProceL F1提升12.5%）。 Conclusion: 层次化人类活动知识对第一人称视觉中的多推理任务具有重要价值。 Abstract: Human activities are particularly complex and variable, and this makes challenging for deep learning models to reason about them. However, we note that such variability does have an underlying structure, composed of a hierarchy of patterns of related actions. We argue that such structure can emerge naturally from unscripted videos of human activities, and can be leveraged to better reason about their content. We present HiERO, a weakly-supervised method to enrich video segments features with the corresponding hierarchical activity threads. By aligning video clips with their narrated descriptions, HiERO infers contextual, semantic and temporal reasoning with an hierarchical architecture. We prove the potential of our enriched features with multiple video-text alignment benchmarks (EgoMCQ, EgoNLQ) with minimal additional training, and in zero-shot for procedure learning tasks (EgoProceL and Ego4D Goal-Step). Notably, HiERO achieves state-of-the-art performance in all the benchmarks, and for procedure learning tasks it outperforms fully-supervised methods by a large margin (+12.5% F1 on EgoProceL) in zero shot. Our results prove the relevance of using knowledge of the hierarchy of human activities for multiple reasoning tasks in egocentric vision.

[153] Uniformity First: Uniformity-aware Test-time Adaptation of Vision-language Models against Image Corruption

Kazuki Adachi,Shin'ya Yamaguchi,Tomoki Hamagami

Main category: cs.CV

TL;DR: 论文提出了一种名为UnInfo的方法，通过均匀性感知信息平衡测试时适应（TTA）来解决CLIP模型在传感器退化条件下的分布偏移问题。

Details

Motivation: 预训练的视觉语言模型（如CLIP）在分布偏移（如传感器退化）下表现不佳，而收集新数据成本高昂。因此，研究如何通过TTA在无标签测试数据上实时适应测试分布。 Method: 提出UnInfo方法，包括均匀性感知置信最大化、信息感知损失平衡和EMA教师的知识蒸馏，以解决图像嵌入的均匀性损坏问题。 Result: 实验表明，UnInfo通过保持均匀性信息，显著提高了传感器退化条件下的分类准确率。 Conclusion: UnInfo方法有效解决了CLIP模型在传感器退化条件下的分布偏移问题，提升了模型的鲁棒性。 Abstract: Pre-trained vision-language models such as contrastive language-image pre-training (CLIP) have demonstrated a remarkable generalizability, which has enabled a wide range of applications represented by zero-shot classification. However, vision-language models still suffer when they face datasets with large gaps from training ones, i.e., distribution shifts. We found that CLIP is especially vulnerable to sensor degradation, a type of realistic distribution shift caused by sensor conditions such as weather, light, or noise. Collecting a new dataset from a test distribution for fine-tuning highly costs since sensor degradation occurs unexpectedly and has a range of variety. Thus, we investigate test-time adaptation (TTA) of zero-shot classification, which enables on-the-fly adaptation to the test distribution with unlabeled test data. Existing TTA methods for CLIP mainly focus on modifying image and text embeddings or predictions to address distribution shifts. Although these methods can adapt to domain shifts, such as fine-grained labels spaces or different renditions in input images, they fail to adapt to distribution shifts caused by sensor degradation. We found that this is because image embeddings are "corrupted" in terms of uniformity, a measure related to the amount of information. To make models robust to sensor degradation, we propose a novel method called uniformity-aware information-balanced TTA (UnInfo). To address the corruption of image embeddings, we introduce uniformity-aware confidence maximization, information-aware loss balancing, and knowledge distillation from the exponential moving average (EMA) teacher. Through experiments, we demonstrate that our UnInfo improves accuracy under sensor degradation by retaining information in terms of uniformity.

[154] LatentINDIGO: An INN-Guided Latent Diffusion Algorithm for Image Restoration

Di You,Daniel Siromani,Pier Luigi Dragotti

Main category: cs.CV

TL;DR: 论文提出了一种基于小波的可逆神经网络（INN）用于图像恢复，解决了现有方法在复杂退化、潜在空间稳定性和计算开销上的问题。

Details

Motivation: 现有潜在扩散模型（LDMs）在图像恢复任务中面临复杂退化处理能力不足、潜在空间引导不稳定以及计算开销大的挑战。 Method: 提出了一种小波启发的INN，通过前向变换模拟退化，逆变换恢复细节，并将其集成到潜在扩散框架中，设计了两种方法：LatentINDIGO-PixelINN和LatentINDIGO-LatentINN。 Result: 实验表明，该方法在合成和真实低质量图像上实现了最先进的性能，并能适应任意输出尺寸。 Conclusion: 该方法有效解决了现有LDMs在图像恢复中的局限性，为复杂退化场景提供了高效解决方案。 Abstract: There is a growing interest in the use of latent diffusion models (LDMs) for image restoration (IR) tasks due to their ability to model effectively the distribution of natural images. While significant progress has been made, there are still key challenges that need to be addressed. First, many approaches depend on a predefined degradation operator, making them ill-suited for complex or unknown degradations that deviate from standard analytical models. Second, many methods struggle to provide a stable guidance in the latent space and finally most methods convert latent representations back to the pixel domain for guidance at every sampling iteration, which significantly increases computational and memory overhead. To overcome these limitations, we introduce a wavelet-inspired invertible neural network (INN) that simulates degradations through a forward transform and reconstructs lost details via the inverse transform. We further integrate this design into a latent diffusion pipeline through two proposed approaches: LatentINDIGO-PixelINN, which operates in the pixel domain, and LatentINDIGO-LatentINN, which stays fully in the latent space to reduce complexity. Both approaches alternate between updating intermediate latent variables under the guidance of our INN and refining the INN forward model to handle unknown degradations. In addition, a regularization step preserves the proximity of latent variables to the natural image manifold. Experiments demonstrate that our algorithm achieves state-of-the-art performance on synthetic and real-world low-quality images, and can be readily adapted to arbitrary output sizes.

[155] Multiscale Adaptive Conflict-Balancing Model For Multimedia Deepfake Detection

Zihan Xiong,Xiaohua Wu,Lei Chen,Fangqi Lou

Main category: cs.CV

TL;DR: 提出了一种音频-视觉联合学习方法（MACB-DF），通过对比学习和多模态融合解决模态冲突和忽视问题，显著提升了深度伪造检测性能。

Details

Motivation: 深度伪造技术模糊了真实与伪造媒体的界限，现有检测方法因模态间学习不平衡而受限。 Method: 采用对比学习辅助多级和跨模态融合，设计正交化-多模态帕累托模块以解决梯度冲突。 Result: 在主流数据集上平均准确率达95.5%，跨数据集泛化能力显著提升。 Conclusion: MACB-DF方法有效平衡和利用了多模态信息，显著提升了检测性能。 Abstract: Advances in computer vision and deep learning have blurred the line between deepfakes and authentic media, undermining multimedia credibility through audio-visual forgery. Current multimodal detection methods remain limited by unbalanced learning between modalities. To tackle this issue, we propose an Audio-Visual Joint Learning Method (MACB-DF) to better mitigate modality conflicts and neglect by leveraging contrastive learning to assist in multi-level and cross-modal fusion, thereby fully balancing and exploiting information from each modality. Additionally, we designed an orthogonalization-multimodal pareto module that preserves unimodal information while addressing gradient conflicts in audio-video encoders caused by differing optimization targets of the loss functions. Extensive experiments and ablation studies conducted on mainstream deepfake datasets demonstrate consistent performance gains of our model across key evaluation metrics, achieving an average accuracy of 95.5% across multiple datasets. Notably, our method exhibits superior cross-dataset generalization capabilities, with absolute improvements of 8.0% and 7.7% in ACC scores over the previous best-performing approach when trained on DFDC and tested on DefakeAVMiT and FakeAVCeleb datasets.

[156] A Skull-Adaptive Framework for AI-Based 3D Transcranial Focused Ultrasound Simulation

Vinkle Srivastav,Juliette Puel,Jonathan Vappou,Elijah Van Houten,Paolo Cabras,Nicolas Padoy

Main category: cs.CV

TL;DR: TFUScapes是一个大规模、高分辨率的tFUS模拟数据集，结合DeepTFUS深度学习模型，用于估计超声压力场，解决了颅骨异质性带来的超声波前失真问题。

Details

Motivation: 解决人类颅骨异质性和各向异性导致的超声波前失真问题，减少患者特异性规划的时间成本。 Method: 使用k-Wave伪谱求解器生成模拟数据集，开发DeepTFUS模型（基于U-Net架构），结合傅里叶编码位置嵌入和MLP层，通过特征调制和动态卷积优化压力场估计。 Result: TFUScapes数据集公开，DeepTFUS模型能够高保真地估计压力场，为计算声学、神经技术和深度学习研究提供支持。 Conclusion: TFUScapes和DeepTFUS为tFUS研究提供了数据驱动的方法，有望加速相关领域的发展。 Abstract: Transcranial focused ultrasound (tFUS) is an emerging modality for non-invasive brain stimulation and therapeutic intervention, offering millimeter-scale spatial precision and the ability to target deep brain structures. However, the heterogeneous and anisotropic nature of the human skull introduces significant distortions to the propagating ultrasound wavefront, which require time-consuming patient-specific planning and corrections using numerical solvers for accurate targeting. To enable data-driven approaches in this domain, we introduce TFUScapes, the first large-scale, high-resolution dataset of tFUS simulations through anatomically realistic human skulls derived from T1-weighted MRI images. We have developed a scalable simulation engine pipeline using the k-Wave pseudo-spectral solver, where each simulation returns a steady-state pressure field generated by a focused ultrasound transducer placed at realistic scalp locations. In addition to the dataset, we present DeepTFUS, a deep learning model that estimates normalized pressure fields directly from input 3D CT volumes and transducer position. The model extends a U-Net backbone with transducer-aware conditioning, incorporating Fourier-encoded position embeddings and MLP layers to create global transducer embeddings. These embeddings are fused with U-Net encoder features via feature-wise modulation, dynamic convolutions, and cross-attention mechanisms. The model is trained using a combination of spatially weighted and gradient-sensitive loss functions, enabling it to approximate high-fidelity wavefields. The TFUScapes dataset is publicly released to accelerate research at the intersection of computational acoustics, neurotechnology, and deep learning. The project page is available at https://github.com/CAMMA-public/TFUScapes.

[157] Anti-Inpainting: A Proactive Defense against Malicious Diffusion-based Inpainters under Unknown Conditions

Yimao Guo,Zuomin Qu,Wei Lu,Xiangyang Luo

Main category: cs.CV

TL;DR: 论文提出了一种名为Anti-Inpainting的主动防御方法，通过三重机制保护图像免受未知条件下的恶意篡改。

Details

Motivation: 现有主动防御方法仅能在已知条件下保护图像，无法应对恶意用户设计的篡改条件，因此需要一种更通用的防御方法。 Method: 提出多级深度特征提取器、多尺度语义保留数据增强和基于选择的分布偏差优化策略，以增强对抗性扰动的保护效果。 Result: 实验表明，Anti-Inpainting在InpaintGuardBench和CelebA-HQ数据集上对未知条件下的篡改具有显著防御效果，且具有鲁棒性和跨模型迁移性。 Conclusion: Anti-Inpainting为扩散模型下的图像保护提供了一种有效的主动防御方案，适用于未知条件和多样化攻击。 Abstract: As diffusion-based malicious image manipulation becomes increasingly prevalent, multiple proactive defense methods are developed to safeguard images against unauthorized tampering. However, most proactive defense methods only can safeguard images against manipulation under known conditions, and fail to protect images from manipulations guided by tampering conditions crafted by malicious users. To tackle this issue, we propose Anti-Inpainting, a proactive defense method that achieves adequate protection under unknown conditions through a triple mechanism to address this challenge. Specifically, a multi-level deep feature extractor is presented to obtain intricate features during the diffusion denoising process to improve protective effectiveness. We design multi-scale semantic-preserving data augmentation to enhance the transferability of adversarial perturbations across unknown conditions by multi-scale transformations while preserving semantic integrity. In addition, we propose a selection-based distribution deviation optimization strategy to improve the protection of adversarial perturbation against manipulation under diverse random seeds. Extensive experiments indicate the proactive defensive performance of Anti-Inpainting against diffusion-based inpainters guided by unknown conditions in InpaintGuardBench and CelebA-HQ. At the same time, we also demonstrate the proposed approach's robustness under various image purification methods and its transferability across different versions of diffusion models.

[158] Expert-Like Reparameterization of Heterogeneous Pyramid Receptive Fields in Efficient CNNs for Fair Medical Image Classification

Xiao Wu,Xiaoqing Zhang,Zunjie Xiao,Lingxi Hu,Risa Higashita,Jiang Liu

Main category: cs.CV

TL;DR: 论文提出了一种名为ERoHPRF的新方法，通过异构金字塔感受野和多专家咨询模式，提升医学图像分类的性能和公平性。

Details

Motivation: 现有CNN架构在医学图像分类中存在两个主要问题：1) 难以高效捕捉多样病灶特征；2) 预测结果存在偏差。 Method: 提出ERoHPRF，结合异构金字塔感受野和专家级结构重参数化技术，通过多核卷积操作捕捉不同病灶特征。 Result: 实验表明，ERoHPRF在分类性能、公平性和计算开销上优于现有方法。 Conclusion: ERoHPRF为医学图像分类提供了一种高效且公平的解决方案，代码将开源。 Abstract: Efficient convolutional neural network (CNN) architecture designs have attracted growing research interests. However, they usually apply single receptive field (RF), small asymmetric RFs, or pyramid RFs to learn different feature representations, still encountering two significant challenges in medical image classification tasks: 1) They have limitations in capturing diverse lesion characteristics efficiently, e.g., tiny, coordination, small and salient, which have unique roles on results, especially imbalanced medical image classification. 2) The predictions generated by those CNNs are often unfair/biased, bringing a high risk by employing them to real-world medical diagnosis conditions. To tackle these issues, we develop a new concept, Expert-Like Reparameterization of Heterogeneous Pyramid Receptive Fields (ERoHPRF), to simultaneously boost medical image classification performance and fairness. This concept aims to mimic the multi-expert consultation mode by applying the well-designed heterogeneous pyramid RF bags to capture different lesion characteristics effectively via convolution operations with multiple heterogeneous kernel sizes. Additionally, ERoHPRF introduces an expert-like structural reparameterization technique to merge its parameters with the two-stage strategy, ensuring competitive computation cost and inference speed through comparisons to a single RF. To manifest the effectiveness and generalization ability of ERoHPRF, we incorporate it into mainstream efficient CNN architectures. The extensive experiments show that our method maintains a better trade-off than state-of-the-art methods in terms of medical image classification, fairness, and computation overhead. The codes of this paper will be released soon.

[159] A Generalized Label Shift Perspective for Cross-Domain Gaze Estimation

Hao-Ran Yang,Xiaohui Chen,Chuan-Xian Ren

Main category: cs.CV

TL;DR: 本文提出了一种基于广义标签偏移（GLS）理论的新视角来解决跨域注视估计（CDGE）问题，通过标签和条件偏移建模，并引入截断高斯分布的重加权策略，显著提升了模型的泛化能力。

Details

Motivation: 现有CDGE方法通常提取域不变特征以缓解特征空间中的域偏移，但GLS理论证明其不足。本文旨在通过GLS视角改进跨域注视估计。 Method: 提出GLS校正框架，采用基于截断高斯分布的重加权策略解决标签偏移问题，并推导概率感知的条件算子差异估计。 Result: 在标准CDGE任务上的实验表明，该方法在不同骨干模型中均表现出优异的跨域泛化能力和适用性。 Conclusion: 通过GLS视角和重加权策略，本文方法显著提升了跨域注视估计的性能，适用于多种模型。 Abstract: Aiming to generalize the well-trained gaze estimation model to new target domains, Cross-domain Gaze Estimation (CDGE) is developed for real-world application scenarios. Existing CDGE methods typically extract the domain-invariant features to mitigate domain shift in feature space, which is proved insufficient by Generalized Label Shift (GLS) theory. In this paper, we introduce a novel GLS perspective to CDGE and modelize the cross-domain problem by label and conditional shift problem. A GLS correction framework is presented and a feasible realization is proposed, in which a importance reweighting strategy based on truncated Gaussian distribution is introduced to overcome the continuity challenges in label shift correction. To embed the reweighted source distribution to conditional invariant learning, we further derive a probability-aware estimation of conditional operator discrepancy. Extensive experiments on standard CDGE tasks with different backbone models validate the superior generalization capability across domain and applicability on various models of proposed method.

[160] RGB-to-Polarization Estimation: A New Task and Benchmark Study

Beibei Lin,Zifeng Yuan,Tingting Chen

Main category: cs.CV

TL;DR: 该论文提出了从RGB图像估计偏振信息的新任务，并建立了首个全面基准，评估了多种深度学习模型，揭示了不同模型的优缺点，并提出了未来研究方向。

Details

Motivation: 偏振图像比标准RGB图像包含更多物理信息，但其获取成本高且复杂。为填补这一空白，研究旨在直接从RGB图像推断偏振信息。 Method: 利用现有偏振数据集，评估了多种深度学习模型（包括恢复导向和生成架构），通过定量和定性分析建立性能基准。 Result: 基准确定了RGB到偏振估计的当前性能上限，并系统揭示了不同模型家族的优缺点。 Conclusion: 该基准为未来偏振估计方法的设计和评估提供了基础资源，并指出了潜在研究方向。 Abstract: Polarization images provide rich physical information that is fundamentally absent from standard RGB images, benefiting a wide range of computer vision applications such as reflection separation and material classification. However, the acquisition of polarization images typically requires additional optical components, which increases both the cost and the complexity of the applications. To bridge this gap, we introduce a new task: RGB-to-polarization image estimation, which aims to infer polarization information directly from RGB images. In this work, we establish the first comprehensive benchmark for this task by leveraging existing polarization datasets and evaluating a diverse set of state-of-the-art deep learning models, including both restoration-oriented and generative architectures. Through extensive quantitative and qualitative analysis, our benchmark not only establishes the current performance ceiling of RGB-to-polarization estimation, but also systematically reveals the respective strengths and limitations of different model families -- such as direct reconstruction versus generative synthesis, and task-specific training versus large-scale pre-training. In addition, we provide some potential directions for future research on polarization estimation. This benchmark is intended to serve as a foundational resource to facilitate the design and evaluation of future methods for polarization estimation from standard RGB inputs.

[161] 3D Visual Illusion Depth Estimation

CHengtang Yao,Zhidan Liu,Jiaxi Zeng,Lidong Yu,Yuwei Wu,Yunde Jia

Main category: cs.CV

TL;DR: 论文研究了3D视觉错觉对机器视觉系统深度估计的影响，提出了一个鲁棒的深度估计框架，并在实验中验证了其优越性。

Details

Motivation: 探索3D视觉错觉如何影响机器视觉系统的深度估计，尤其是单目和双目深度估计方法。 Method: 收集大规模数据集（3k场景和200k图像），训练和评估SOTA深度估计方法，并提出一个结合视觉语言模型的鲁棒框架。 Result: 实验表明现有深度估计方法易受3D视觉错觉干扰，而提出的框架性能优于SOTA方法。 Conclusion: 3D视觉错觉对机器视觉系统有显著影响，提出的鲁棒框架能有效提升深度估计的可靠性。 Abstract: 3D visual illusion is a perceptual phenomenon where a two-dimensional plane is manipulated to simulate three-dimensional spatial relationships, making a flat artwork or object look three-dimensional in the human visual system. In this paper, we reveal that the machine visual system is also seriously fooled by 3D visual illusions, including monocular and binocular depth estimation. In order to explore and analyze the impact of 3D visual illusion on depth estimation, we collect a large dataset containing almost 3k scenes and 200k images to train and evaluate SOTA monocular and binocular depth estimation methods. We also propose a robust depth estimation framework that uses common sense from a vision-language model to adaptively select reliable depth from binocular disparity and monocular depth. Experiments show that SOTA monocular, binocular, and multi-view depth estimation approaches are all fooled by various 3D visual illusions, while our method achieves SOTA performance.

Zhaoyi Wang,Shengyu Huang,Jemil Avers Butt,Yuanzhou Cai,Matej Varga,Andreas Wieser

Main category: cs.CV

TL;DR: CoFF是一种新颖的跨模态特征融合方法，结合点云几何和RGB图像进行点云配准，显著提升了配准性能。

Details

Motivation: 现有方法常忽略RGB图像的辐射信息，导致在几何数据不足的区域配准效果不佳。CoFF旨在通过融合几何和图像特征解决这一问题。 Method: CoFF采用两阶段融合策略：1) 像素级图像特征与3D点云特征融合；2) 块级图像特征与超点特征融合，随后通过粗到细匹配模块建立对应关系。 Result: 在3DMatch、3DLoMatch等数据集上，CoFF实现了95.9%和81.6%的配准召回率，表现优于现有方法。 Conclusion: CoFF通过有效融合跨模态特征，显著提升了点云配准性能，尤其在几何模糊区域表现突出。 Abstract: Point cloud registration has seen significant advancements with the application of deep learning techniques. However, existing approaches often overlook the potential of integrating radiometric information from RGB images. This limitation reduces their effectiveness in aligning point clouds pairs, especially in regions where geometric data alone is insufficient. When used effectively, radiometric information can enhance the registration process by providing context that is missing from purely geometric data. In this paper, we propose CoFF, a novel Cross-modal Feature Fusion method that utilizes both point cloud geometry and RGB images for pairwise point cloud registration. Assuming that the co-registration between point clouds and RGB images is available, CoFF explicitly addresses the challenges where geometric information alone is unclear, such as in regions with symmetric similarity or planar structures, through a two-stage fusion of 3D point cloud features and 2D image features. It incorporates a cross-modal feature fusion module that assigns pixel-wise image features to 3D input point clouds to enhance learned 3D point features, and integrates patch-wise image features with superpoint features to improve the quality of coarse matching. This is followed by a coarse-to-fine matching module that accurately establishes correspondences using the fused features. We extensively evaluate CoFF on four common datasets: 3DMatch, 3DLoMatch, IndoorLRS, and the recently released ScanNet++ datasets. In addition, we assess CoFF on specific subset datasets containing geometrically ambiguous cases. Our experimental results demonstrate that CoFF achieves state-of-the-art registration performance across all benchmarks, including remarkable registration recalls of 95.9% and 81.6% on the widely-used 3DMatch and 3DLoMatch datasets, respectively...(Truncated to fit arXiv abstract length)

[163] Touch2Shape: Touch-Conditioned 3D Diffusion for Shape Exploration and Reconstruction

Yuanbo Wang,Zhaoxuan Zhang,Jiajin Qiu,Dilong Sun,Zhengyu Meng,Xiaopeng Wei,Xin Yang

Main category: cs.CV

TL;DR: 论文提出Touch2Shape模型，利用触觉图像和扩散模型解决3D生成任务中局部细节捕捉不足的问题，结合强化学习提升重建性能。

Details

Motivation: 当前3D扩散模型在全局上下文理解上表现优异，但在复杂形状的局部细节捕捉上存在局限，且受遮挡和光照条件限制。 Method: 提出Touch2Shape模型，包含触觉嵌入模块和触觉形状融合模块，结合扩散模型与强化学习训练探索策略。 Result: 实验验证了重建质量，触觉探索策略进一步提升了重建性能。 Conclusion: Touch2Shape模型通过触觉信息有效解决了3D生成中的局部细节问题，并展示了强化学习的辅助作用。 Abstract: Diffusion models have made breakthroughs in 3D generation tasks. Current 3D diffusion models focus on reconstructing target shape from images or a set of partial observations. While excelling in global context understanding, they struggle to capture the local details of complex shapes and limited to the occlusion and lighting conditions. To overcome these limitations, we utilize tactile images to capture the local 3D information and propose a Touch2Shape model, which leverages a touch-conditioned diffusion model to explore and reconstruct the target shape from touch. For shape reconstruction, we have developed a touch embedding module to condition the diffusion model in creating a compact representation and a touch shape fusion module to refine the reconstructed shape. For shape exploration, we combine the diffusion model with reinforcement learning to train a policy. This involves using the generated latent vector from the diffusion model to guide the touch exploration policy training through a novel reward design. Experiments validate the reconstruction quality thorough both qualitatively and quantitative analysis, and our touch exploration policy further boosts reconstruction performance.

[164] Industry-focused Synthetic Segmentation Pre-training

Shinichi Mae,Ryosuke Yamada,Hirokatsu Kataoka

Main category: cs.CV

TL;DR: 提出InsCore，一种基于合成数据的预训练数据集，用于工业实例分割，无需真实图像或人工标注，性能优于COCO、ImageNet-21k和微调SAM。

Details

Motivation: 解决工业应用中因法律限制和领域差距导致无法使用真实图像预训练的问题。 Method: 使用公式驱动监督学习（FDSL）生成合成数据集InsCore，模拟工业数据特征。 Result: 在五个工业数据集上，InsCore预训练模型性能平均提升6.2分，数据效率高（仅需10万张图像）。 Conclusion: InsCore是一种实用且无需许可的工业视觉基础模型。 Abstract: Pre-training on real-image datasets has been widely proven effective for improving instance segmentation. However, industrial applications face two key challenges: (1) legal and ethical restrictions, such as ImageNet's prohibition of commercial use, and (2) limited transferability due to the domain gap between web images and industrial imagery. Even recent vision foundation models, including the segment anything model (SAM), show notable performance degradation in industrial settings. These challenges raise critical questions: Can we build a vision foundation model for industrial applications without relying on real images or manual annotations? And can such models outperform even fine-tuned SAM on industrial datasets? To address these questions, we propose the Instance Core Segmentation Dataset (InsCore), a synthetic pre-training dataset based on formula-driven supervised learning (FDSL). InsCore generates fully annotated instance segmentation images that reflect key characteristics of industrial data, including complex occlusions, dense hierarchical masks, and diverse non-rigid shapes, distinct from typical web imagery. Unlike previous methods, InsCore requires neither real images nor human annotations. Experiments on five industrial datasets show that models pre-trained with InsCore outperform those trained on COCO and ImageNet-21k, as well as fine-tuned SAM, achieving an average improvement of 6.2 points in instance segmentation performance. This result is achieved using only 100k synthetic images, more than 100 times fewer than the 11 million images in SAM's SA-1B dataset, demonstrating the data efficiency of our approach. These findings position InsCore as a practical and license-free vision foundation model for industrial applications.

[165] ARIW-Framework: Adaptive Robust Iterative Watermarking Framework

Shaowu Wu,Liting Zeng,Wei Lu,Xiangyang Luo

Main category: cs.CV

TL;DR: 本文提出了一种自适应鲁棒迭代水印框架（ARIW-Framework），通过优化编码器和并行优化策略，显著提升了水印图像的视觉质量、鲁棒性和泛化能力。

Details

Motivation: 随着大模型的快速发展，生成图像内容的版权保护成为关键安全挑战。现有深度学习水印技术在视觉质量、鲁棒性和泛化性方面存在局限。 Method: 提出ARIW-Framework，采用迭代方法优化编码器生成鲁棒残差，结合噪声层和解码器计算鲁棒权重，并利用图像梯度确定嵌入强度。 Result: 实验表明，该方法在视觉质量、鲁棒性和泛化性方面表现优异，能有效抵抗多种噪声攻击。 Conclusion: ARIW-Framework为图像版权保护提供了一种高效解决方案，具有广泛的应用潜力。 Abstract: With the rapid rise of large models, copyright protection for generated image content has become a critical security challenge. Although deep learning watermarking techniques offer an effective solution for digital image copyright protection, they still face limitations in terms of visual quality, robustness and generalization. To address these issues, this paper proposes an adaptive robust iterative watermarking framework (ARIW-Framework) that achieves high-quality watermarked images while maintaining exceptional robustness and generalization performance. Specifically, we introduce an iterative approach to optimize the encoder for generating robust residuals. The encoder incorporates noise layers and a decoder to compute robustness weights for residuals under various noise attacks. By employing a parallel optimization strategy, the framework enhances robustness against multiple types of noise attacks. Furthermore, we leverage image gradients to determine the embedding strength at each pixel location, significantly improving the visual quality of the watermarked images. Extensive experiments demonstrate that the proposed method achieves superior visual quality while exhibiting remarkable robustness and generalization against noise attacks.

Snehashis Majhi,Giacomo D'Amicantonio,Antitza Dantcheva,Quan Kong,Lorenzo Garattoni,Gianpiero Francesca,Egor Bondarev,Francois Bremond

Main category: cs.CV

TL;DR: PI-VAD是一种基于多模态增强的弱监督视频异常检测方法，通过引入五种额外模态（如姿态、深度、全景掩码等）提升RGB特征的区分能力，并在训练时使用伪模态生成和跨模态诱导模块，实现了在三个数据集上的最优性能。

Details

Motivation: RGB特征在区分视觉相似事件（如商店盗窃）时表现不足，因此需要引入多模态信息以提高视频异常检测的鲁棒性。 Method: PI-VAD通过五种额外模态（姿态、深度、全景掩码、光流和语言提示）增强RGB特征，并设计了伪模态生成模块和跨模态诱导模块，在训练时生成模态特异性原型表示。 Result: PI-VAD在三个真实场景的VAD数据集上达到了最优性能，且推理时无需额外计算开销。 Conclusion: 多模态增强显著提升了视频异常检测的可靠性，PI-VAD为复杂场景下的VAD提供了高效解决方案。 Abstract: Weakly-supervised methods for video anomaly detection (VAD) are conventionally based merely on RGB spatio-temporal features, which continues to limit their reliability in real-world scenarios. This is due to the fact that RGB-features are not sufficiently distinctive in setting apart categories such as shoplifting from visually similar events. Therefore, towards robust complex real-world VAD, it is essential to augment RGB spatio-temporal features by additional modalities. Motivated by this, we introduce the Poly-modal Induced framework for VAD: "PI-VAD", a novel approach that augments RGB representations by five additional modalities. Specifically, the modalities include sensitivity to fine-grained motion (Pose), three dimensional scene and entity representation (Depth), surrounding objects (Panoptic masks), global motion (optical flow), as well as language cues (VLM). Each modality represents an axis of a polygon, streamlined to add salient cues to RGB. PI-VAD includes two plug-in modules, namely Pseudo-modality Generation module and Cross Modal Induction module, which generate modality-specific prototypical representation and, thereby, induce multi-modal information into RGB cues. These modules operate by performing anomaly-aware auxiliary tasks and necessitate five modality backbones -- only during training. Notably, PI-VAD achieves state-of-the-art accuracy on three prominent VAD datasets encompassing real-world scenarios, without requiring the computational overhead of five modality backbones at inference.

[167] Adaptive Image Restoration for Video Surveillance: A Real-Time Approach

Muhammad Awais Amin,Adama Ilboudo,Abdul Samad bin Shahid,Amjad Ali,Waqas Haider Khan Bangyal

Main category: cs.CV

TL;DR: 开发了一种基于ResNet_50的实时图像修复模型，用于视频监控中的多退化类型识别与修复。

Details

Motivation: 图像退化（如雨、雾、光照等）影响计算机视觉任务的性能，现有修复模型无法满足实时处理需求。 Method: 利用迁移学习结合ResNet_50，自动识别图像退化类型并选择相应修复方法。 Result: 模型具有灵活性和可扩展性，适用于实时视频监控。 Conclusion: 该研究为实时图像修复提供了一种高效解决方案。 Abstract: One of the major challenges in the field of computer vision especially for detection, segmentation, recognition, monitoring, and automated solutions, is the quality of images. Image degradation, often caused by factors such as rain, fog, lighting, etc., has a negative impact on automated decision-making.Furthermore, several image restoration solutions exist, including restoration models for single degradation and restoration models for multiple degradations. However, these solutions are not suitable for real-time processing. In this study, the aim was to develop a real-time image restoration solution for video surveillance. To achieve this, using transfer learning with ResNet_50, we developed a model for automatically identifying the types of degradation present in an image to reference the necessary treatment(s) for image restoration. Our solution has the advantage of being flexible and scalable.

[168] Learning to Adapt to Position Bias in Vision Transformer Classifiers

Robert-Jan Bruintjes,Jan van Gemert

Main category: cs.CV

TL;DR: 论文探讨了位置信息对图像分类的重要性，提出了一种衡量位置偏差的方法Position-SHAP，并开发了Auto-PE嵌入技术以优化分类性能。

Details

Motivation: 研究位置信息在图像分类中的作用，因为其重要性因数据集而异，既有平移不变性需求，也有利用位置偏差的需求。 Method: 提出Position-SHAP衡量位置偏差，并开发Auto-PE嵌入技术，通过调整嵌入范数来优化位置信息的使用。 Result: 不同数据集表现出不同程度的位置偏差，Auto-PE能结合现有嵌入技术提升分类准确率。 Conclusion: 位置偏差对视觉Transformer性能至关重要，Auto-PE能灵活适应不同数据集的需求。 Abstract: How discriminative position information is for image classification depends on the data. On the one hand, the camera position is arbitrary and objects can appear anywhere in the image, arguing for translation invariance. At the same time, position information is key for exploiting capture/center bias, and scene layout, e.g.: the sky is up. We show that position bias, the level to which a dataset is more easily solved when positional information on input features is used, plays a crucial role in the performance of Vision Transformers image classifiers. To investigate, we propose Position-SHAP, a direct measure of position bias by extending SHAP to work with position embeddings. We show various levels of position bias in different datasets, and find that the optimal choice of position embedding depends on the position bias apparent in the dataset. We therefore propose Auto-PE, a single-parameter position embedding extension, which allows the position embedding to modulate its norm, enabling the unlearning of position information. Auto-PE combines with existing PEs to match or improve accuracy on classification datasets.

[169] CacheFlow: Fast Human Motion Prediction by Cached Normalizing Flow

Takahiro Maeda,Jinkun Cao,Norimichi Ukita,Kris Kitani

Main category: cs.CV

TL;DR: CacheFlow是一种基于流的快速3D人体运动预测方法，通过预计算和缓存技术显著提高推理速度，同时保持预测精度。

Details

Motivation: 现有3D人体运动预测方法推理时间过长，无法满足实时需求，因此需要一种更高效的密度估计技术。 Method: 提出两阶段方法：1）使用无条件流模型预计算未来运动的密度；2）通过轻量级模型将历史轨迹映射到高斯混合样本。 Result: 推理时间缩短至约1毫秒，比VAE快4倍，比扩散方法快30倍，同时在Human3.6M和AMASS数据集上保持高精度。 Conclusion: CacheFlow在速度和精度上均优于现有方法，适用于实时3D人体运动预测。 Abstract: Many density estimation techniques for 3D human motion prediction require a significant amount of inference time, often exceeding the duration of the predicted time horizon. To address the need for faster density estimation for 3D human motion prediction, we introduce a novel flow-based method for human motion prediction called CacheFlow. Unlike previous conditional generative models that suffer from time efficiency, CacheFlow takes advantage of an unconditional flow-based generative model that transforms a Gaussian mixture into the density of future motions. The results of the computation of the flow-based generative model can be precomputed and cached. Then, for conditional prediction, we seek a mapping from historical trajectories to samples in the Gaussian mixture. This mapping can be done by a much more lightweight model, thus saving significant computation overhead compared to a typical conditional flow model. In such a two-stage fashion and by caching results from the slow flow model computation, we build our CacheFlow without loss of prediction accuracy and model expressiveness. This inference process is completed in approximately one millisecond, making it 4 times faster than previous VAE methods and 30 times faster than previous diffusion-based methods on standard benchmarks such as Human3.6M and AMASS datasets. Furthermore, our method demonstrates improved density estimation accuracy and comparable prediction accuracy to a SOTA method on Human3.6M. Our code and models will be publicly available.

[170] FlowCut: Unsupervised Video Instance Segmentation via Temporal Mask Matching

Alp Eren Sari,Paolo Favaro

Main category: cs.CV

TL;DR: FlowCut是一种用于无监督视频实例分割的三阶段框架，通过伪标签构建高质量视频数据集，并在多个基准测试中达到最先进性能。

Details

Motivation: 当前无监督视频实例分割领域缺乏高质量的伪标签数据集，FlowCut旨在填补这一空白。 Method: 三阶段框架：1）利用图像和光流特征生成伪实例掩码；2）构建包含高质量、一致伪实例掩码的短视频片段；3）使用YouTubeVIS-2021训练视频分割模型。 Result: 在YouTubeVIS-2019、YouTubeVIS-2021、DAVIS-2017等基准测试中达到最先进性能。 Conclusion: FlowCut通过伪标签构建高质量数据集，为无监督视频实例分割提供了有效解决方案。 Abstract: We propose FlowCut, a simple and capable method for unsupervised video instance segmentation consisting of a three-stage framework to construct a high-quality video dataset with pseudo labels. To our knowledge, our work is the first attempt to curate a video dataset with pseudo-labels for unsupervised video instance segmentation. In the first stage, we generate pseudo-instance masks by exploiting the affinities of features from both images and optical flows. In the second stage, we construct short video segments containing high-quality, consistent pseudo-instance masks by temporally matching them across the frames. In the third stage, we use the YouTubeVIS-2021 video dataset to extract our training instance segmentation set, and then train a video segmentation model. FlowCut achieves state-of-the-art performance on the YouTubeVIS-2019, YouTubeVIS-2021, DAVIS-2017, and DAVIS-2017 Motion benchmarks.

[171] Emergence of Fixational and Saccadic Movements in a Multi-Level Recurrent Attention Model for Vision

Pengcheng Pan,Yonekura Shogo,Yasuo Kuniyoshi

Main category: cs.CV

TL;DR: 提出了一种多级循环注意力模型（MRAM），通过模拟人类视觉系统的神经层次结构，解决了现有硬注意力模型在视觉探索动态上的不足，实现了更接近人类眼动的注意力行为，并在图像分类任务中表现优异。

Details

Motivation: 现有硬注意力模型（如RAM和DRAM）未能模拟人类视觉系统的层次结构，导致注意力行为过于固定或过度扫视，与人类眼动行为不符。 Method: 提出MRAM框架，通过将瞥视位置生成和任务执行功能解耦到两个循环层中，模拟人类视觉处理的神经层次结构。 Result: MRAM不仅实现了更接近人类眼动的注意力动态，还在标准图像分类基准上优于CNN、RAM和DRAM。 Conclusion: MRAM通过层次化建模，实现了更自然的注意力行为，并在性能上超越了现有模型。 Abstract: Inspired by foveal vision, hard attention models promise interpretability and parameter economy. However, existing models like the Recurrent Model of Visual Attention (RAM) and Deep Recurrent Attention Model (DRAM) failed to model the hierarchy of human vision system, that compromise on the visual exploration dynamics. As a result, they tend to produce attention that are either overly fixational or excessively saccadic, diverging from human eye movement behavior. In this paper, we propose a Multi-Level Recurrent Attention Model (MRAM), a novel hard attention framework that explicitly models the neural hierarchy of human visual processing. By decoupling the function of glimpse location generation and task execution in two recurrent layers, MRAM emergent a balanced behavior between fixation and saccadic movement. Our results show that MRAM not only achieves more human-like attention dynamics, but also consistently outperforms CNN, RAM and DRAM baselines on standard image classification benchmarks.

[172] MatPredict: a dataset and benchmark for learning material properties of diverse indoor objects

Yuzhen Chen,Hojun Son,Arpan Kusari

Main category: cs.CV

TL;DR: MatPredict数据集结合Replica和MatSynth的数据，生成具有多样化材质属性的3D对象，用于从视觉图像推断材质属性，并提供了基准测试和代码。

Details

Motivation: 通过从相机图像中确定材质属性，提升消费机器人对复杂物体的识别能力。 Method: 选择特定前景对象的3D网格，用不同材质属性渲染，生成18种常见对象和14种材质，并引入光照和相机位置的变异性。 Result: 提供了从视觉图像推断材质属性的基准测试，讨论了神经网络模型的性能。 Conclusion: 通过模拟光与材质的交互增强真实感，MatPredict有望革新消费机器人的感知能力。 Abstract: Determining material properties from camera images can expand the ability to identify complex objects in indoor environments, which is valuable for consumer robotics applications. To support this, we introduce MatPredict, a dataset that combines the high-quality synthetic objects from Replica dataset with MatSynth dataset's material properties classes - to create objects with diverse material properties. We select 3D meshes of specific foreground objects and render them with different material properties. In total, we generate \textbf{18} commonly occurring objects with \textbf{14} different materials. We showcase how we provide variability in terms of lighting and camera placement for these objects. Next, we provide a benchmark for inferring material properties from visual images using these perturbed models in the scene, discussing the specific neural network models involved and their performance based on different image comparison metrics. By accurately simulating light interactions with different materials, we can enhance realism, which is crucial for training models effectively through large-scale simulations. This research aims to revolutionize perception in consumer robotics. The dataset is provided \href{https://huggingface.co/datasets/UMTRI/MatPredict}{here} and the code is provided \href{https://github.com/arpan-kusari/MatPredict}{here}.

[173] MAGI-1: Autoregressive Video Generation at Scale

Sand. ai,Hansi Teng,Hongyu Jia,Lei Sun,Lingzhi Li,Maolin Li,Mingqiu Tang,Shuai Han,Tianning Zhang,W. Q. Zhang,Weifeng Luo,Xiaoyang Kang,Yuchen Sun,Yue Cao,Yunpeng Huang,Yutong Lin,Yuxin Fang,Zewei Tao,Zheng Zhang,Zhongshu Wang,Zixun Liu,Dai Shi,Guoli Su,Hanwen Sun,Hong Pan,Jie Wang,Jiexin Sheng,Min Cui,Min Hu,Ming Yan,Shucheng Yin,Siran Zhang,Tingting Liu,Xianping Yin,Xiaoyu Yang,Xin Song,Xuan Hu,Yankai Zhang,Yuqiao Li

Main category: cs.CV

TL;DR: MAGI-1是一个通过自回归预测视频块序列生成视频的世界模型，支持流式生成，并在图像到视频任务中表现优异。

Details

Motivation: 解决视频生成中的时间一致性和可扩展性问题，同时支持实时、高效的内存部署。 Method: 采用自回归预测视频块序列的方法，通过单调增加的噪声去噪训练，支持块级提示和恒定推理成本。 Result: MAGI-1在图像到视频任务中表现出高时间一致性和可扩展性，最大模型支持40亿参数和400万token上下文长度。 Conclusion: MAGI-1通过算法创新和专用基础设施，实现了可控生成和高效部署，展示了其可扩展性和鲁棒性。 Abstract: We present MAGI-1, a world model that generates videos by autoregressively predicting a sequence of video chunks, defined as fixed-length segments of consecutive frames. Trained to denoise per-chunk noise that increases monotonically over time, MAGI-1 enables causal temporal modeling and naturally supports streaming generation. It achieves strong performance on image-to-video (I2V) tasks conditioned on text instructions, providing high temporal consistency and scalability, which are made possible by several algorithmic innovations and a dedicated infrastructure stack. MAGI-1 facilitates controllable generation via chunk-wise prompting and supports real-time, memory-efficient deployment by maintaining constant peak inference cost, regardless of video length. The largest variant of MAGI-1 comprises 24 billion parameters and supports context lengths of up to 4 million tokens, demonstrating the scalability and robustness of our approach. The code and models are available at https://github.com/SandAI-org/MAGI-1 and https://github.com/SandAI-org/MagiAttention. The product can be accessed at https://sand.ai.

[174] RB-SCD: A New Benchmark for Semantic Change Detection of Roads and Bridges in Traffic Scenes

Qingling Shu,Sibao Chen,Zhihui You,Wei Lu,Jin Tang,Bin Luo

Main category: cs.CV

TL;DR: 论文提出了RB-SCD数据集和MFDCD框架，用于精细语义变化检测，显著提升了道路和桥梁变化的识别能力。

Details

Motivation: 现有方法因缺乏高质量标注数据集而难以提取细粒度语义变化信息，影响城市规划和交通管理。 Method: 引入RB-SCD数据集，并提出MFDCD框架，结合多模态频率域特征，包括动态频率耦合器和文本频率滤波器。 Result: 在RB-SCD及三个公开基准测试中验证了方法的有效性。 Conclusion: RB-SCD和MFDCD为道路桥梁语义变化检测提供了新工具，具有实际应用价值。 Abstract: Accurate detection of changes in roads and bridges, such as construction, renovation, and demolition, is essential for urban planning and traffic management. However, existing methods often struggle to extract fine-grained semantic change information due to the lack of high-quality annotated datasets in traffic scenarios. To address this, we introduce the Road and Bridge Semantic Change Detection (RB-SCD) dataset, a comprehensive benchmark comprising 260 pairs of high-resolution remote sensing images from diverse cities and countries. RB-SCD captures 11 types of semantic changes across varied road and bridge structures, enabling detailed structural and functional analysis. Building on this dataset, we propose a novel framework, Multimodal Frequency-Driven Change Detector (MFDCD), which integrates multimodal features in the frequency domain. MFDCD includes a Dynamic Frequency Coupler (DFC) that fuses hierarchical visual features with wavelet-based frequency components, and a Textual Frequency Filter (TFF) that transforms CLIP-derived textual features into the frequency domain and applies graph-based filtering. Experimental results on RB-SCD and three public benchmarks demonstrate the effectiveness of our approach.

[175] Hybrid 3D-4D Gaussian Splatting for Fast Dynamic Scene Representation

Seungjun Oh,Younggeun Lee,Hyejin Jeon,Eunbyung Park

Main category: cs.CV

TL;DR: 提出了一种混合3D-4D高斯泼溅方法（3D-4DGS），通过自适应地将静态区域表示为3D高斯，动态区域保留为4D高斯，显著减少计算和内存开销，同时保持或提升视觉质量。

Details

Motivation: 现有4D高斯泼溅方法在静态区域冗余分配4D高斯，导致计算和内存开销大，且可能降低图像质量。 Method: 从完全4D高斯表示开始，迭代地将时间不变的高斯转换为3D，动态高斯保留4D表示。 Result: 显著减少参数数量，提高计算效率，训练时间更快，同时保持或提升视觉质量。 Conclusion: 3D-4DGS是一种高效且高质量的动态3D场景重建方法。 Abstract: Recent advancements in dynamic 3D scene reconstruction have shown promising results, enabling high-fidelity 3D novel view synthesis with improved temporal consistency. Among these, 4D Gaussian Splatting (4DGS) has emerged as an appealing approach due to its ability to model high-fidelity spatial and temporal variations. However, existing methods suffer from substantial computational and memory overhead due to the redundant allocation of 4D Gaussians to static regions, which can also degrade image quality. In this work, we introduce hybrid 3D-4D Gaussian Splatting (3D-4DGS), a novel framework that adaptively represents static regions with 3D Gaussians while reserving 4D Gaussians for dynamic elements. Our method begins with a fully 4D Gaussian representation and iteratively converts temporally invariant Gaussians into 3D, significantly reducing the number of parameters and improving computational efficiency. Meanwhile, dynamic Gaussians retain their full 4D representation, capturing complex motions with high fidelity. Our approach achieves significantly faster training times compared to baseline 4D Gaussian Splatting methods while maintaining or improving the visual quality.

[176] Swin DiT: Diffusion Transformer using Pseudo Shifted Windows

Jiafu Wu,Yabiao Wang,Jian Li,Jinlong Peng,Yun Cao,Chengjie Wang,Jiangning Zhang

Main category: cs.CV

TL;DR: 论文提出了一种名为PSWA的注意力机制和PCCA策略，显著减少了DiTs中的全局计算冗余，并提升了图像生成性能。

Details

Motivation: 传统的DiTs在处理高分辨率图像时计算成本高，且全局信息依赖性被高估，导致冗余计算。 Method: 提出PSWA机制和PCCA策略，通过窗口注意力实现局部-全局信息交互，并补充高频信息。 Result: Swin-DiT-L在FID指标上比DiT-XL/2提升了54%，且计算成本更低。 Conclusion: PSWA和PCCA策略有效解决了DiTs的冗余问题，显著提升了性能。 Abstract: Diffusion Transformers (DiTs) achieve remarkable performance within the domain of image generation through the incorporation of the transformer architecture. Conventionally, DiTs are constructed by stacking serial isotropic global information modeling transformers, which face significant computational cost when processing high-resolution images. We empirically analyze that latent space image generation does not exhibit a strong dependence on global information as traditionally assumed. Most of the layers in the model demonstrate redundancy in global computation. In addition, conventional attention mechanisms exhibit low-frequency inertia issues. To address these issues, we propose \textbf{P}seudo \textbf{S}hifted \textbf{W}indow \textbf{A}ttention (PSWA), which fundamentally mitigates global model redundancy. PSWA achieves intermediate global-local information interaction through window attention, while employing a high-frequency bridging branch to simulate shifted window operations, supplementing appropriate global and high-frequency information. Furthermore, we propose the Progressive Coverage Channel Allocation(PCCA) strategy that captures high-order attention similarity without additional computational cost. Building upon all of them, we propose a series of Pseudo \textbf{S}hifted \textbf{Win}dow DiTs (\textbf{Swin DiT}), accompanied by extensive experiments demonstrating their superior performance. For example, our proposed Swin-DiT-L achieves a 54%$\uparrow$ FID improvement over DiT-XL/2 while requiring less computational. https://github.com/wujiafu007/Swin-DiT

[177] Automatic Complementary Separation Pruning Toward Lightweight CNNs

David Levin,Gonen Singer

Main category: cs.CV

TL;DR: ACSP是一种全自动的卷积神经网络剪枝方法，结合结构化剪枝和基于激活的剪枝，通过构建图空间和互补选择原则，高效移除冗余组件，同时保持网络性能。

Details

Motivation: 现有剪枝方法需要手动定义剪枝量，缺乏自动化且效率不高。ACSP旨在通过全自动化和互补选择原则，提升剪枝效率和实用性。 Method: ACSP构建图空间编码组件对类别的分离能力，利用聚类算法和互补选择原则，自动确定每层最优组件子集。 Result: 在VGG-16、ResNet-50等架构上，ACSP在CIFAR-10、ImageNet等数据集上实现了高精度，同时显著降低计算成本。 Conclusion: ACSP是一种高效、全自动的剪枝方法，适用于实际部署，无需手动定义剪枝量。 Abstract: In this paper, we present Automatic Complementary Separation Pruning (ACSP), a novel and fully automated pruning method for convolutional neural networks. ACSP integrates the strengths of both structured pruning and activation-based pruning, enabling the efficient removal of entire components such as neurons and channels while leveraging activations to identify and retain the most relevant components. Our approach is designed specifically for supervised learning tasks, where we construct a graph space that encodes the separation capabilities of each component with respect to all class pairs. By employing complementary selection principles and utilizing a clustering algorithm, ACSP ensures that the selected components maintain diverse and complementary separation capabilities, reducing redundancy and maintaining high network performance. The method automatically determines the optimal subset of components in each layer, utilizing a knee-finding algorithm to select the minimal subset that preserves performance without requiring user-defined pruning volumes. Extensive experiments on multiple architectures, including VGG-16, ResNet-50, and MobileNet-V2, across datasets like CIFAR-10, CIFAR-100, and ImageNet-1K, demonstrate that ACSP achieves competitive accuracy compared to other methods while significantly reducing computational costs. This fully automated approach not only enhances scalability but also makes ACSP especially practical for real-world deployment by eliminating the need for manually defining the pruning volume.

[178] From Local Details to Global Context: Advancing Vision-Language Models with Attention-Based Selection

Lincan Cai,Jingxuan Kang,Shuang Li,Wenxuan Ma,Binhui Xie,Zhida Qin,Jian Liang

Main category: cs.CV

TL;DR: 论文提出了一种基于注意力的选择方法（ABS），通过注意力引导的裁剪和特征选择，解决了视觉增强技术带来的背景干扰和局部细节过度关注问题，显著提升了零样本分类性能。

Details

Motivation: 预训练的视觉语言模型（如CLIP）在零样本任务中表现优异，但随机视觉增强技术可能引入背景干扰并导致模型过度关注局部细节，影响全局语义理解。 Method: 提出ABS方法，通过注意力引导的裁剪（图像和特征空间）和策略性特征选择补充全局语义信息，并引入软匹配技术优化LLM生成描述的匹配。 Result: ABS在分布外泛化和零样本分类任务中达到最先进性能，且无需训练，甚至可与少样本和测试时适应方法媲美。 Conclusion: ABS方法有效解决了视觉增强技术的局限性，显著提升了零样本任务的性能，具有广泛的应用潜力。 Abstract: Pretrained vision-language models (VLMs), e.g., CLIP, demonstrate impressive zero-shot capabilities on downstream tasks. Prior research highlights the crucial role of visual augmentation techniques, like random cropping, in alignment with fine-grained class descriptions generated by large language models (LLMs), significantly enhancing zero-shot performance by incorporating multi-view information. However, the inherent randomness of these augmentations can inevitably introduce background artifacts and cause models to overly focus on local details, compromising global semantic understanding. To address these issues, we propose an \textbf{A}ttention-\textbf{B}ased \textbf{S}election (\textbf{ABS}) method from local details to global context, which applies attention-guided cropping in both raw images and feature space, supplement global semantic information through strategic feature selection. Additionally, we introduce a soft matching technique to effectively filter LLM descriptions for better alignment. \textbf{ABS} achieves state-of-the-art performance on out-of-distribution generalization and zero-shot classification tasks. Notably, \textbf{ABS} is training-free and even rivals few-shot and test-time adaptation methods. Our code is available at \href{https://github.com/BIT-DA/ABS}{\textcolor{darkgreen}{https://github.com/BIT-DA/ABS}}.

[179] WriteViT: Handwritten Text Generation with Vision Transformer

Dang Hoai Nam,Huynh Tong Dang Khoa,Vo Nguyen Le Duy

Main category: cs.CV

TL;DR: WriteViT是一个基于Vision Transformers的单样本手写文本合成框架，旨在解决机器在低数据量下难以捕捉手写风格的问题，并在越南语和英语数据集上表现出色。

Details

Motivation: 人类能快速从单一示例中泛化手写风格，而机器在低数据量下难以捕捉细微的空间和风格线索。WriteViT旨在填补这一差距。 Method: WriteViT结合了ViT-based Writer Identifier提取风格嵌入、多尺度生成器（Transformer编码器-解码器块+条件位置编码）和轻量级ViT-based识别器。 Result: 在越南语和英语数据集上，WriteViT生成了高质量、风格一致的手写文本，并在低资源场景下保持了较强的识别性能。 Conclusion: 基于Transformer的设计在多语言手写生成和高效风格适应中展现出潜力。 Abstract: Humans can quickly generalize handwriting styles from a single example by intuitively separating content from style. Machines, however, struggle with this task, especially in low-data settings, often missing subtle spatial and stylistic cues. Motivated by this gap, we introduce WriteViT, a one-shot handwritten text synthesis framework that incorporates Vision Transformers (ViT), a family of models that have shown strong performance across various computer vision tasks. WriteViT integrates a ViT-based Writer Identifier for extracting style embeddings, a multi-scale generator built with Transformer encoder-decoder blocks enhanced by conditional positional encoding (CPE), and a lightweight ViT-based recognizer. While previous methods typically rely on CNNs or CRNNs, our design leverages transformers in key components to better capture both fine-grained stroke details and higher-level style information. Although handwritten text synthesis has been widely explored, its application to Vietnamese -- a language rich in diacritics and complex typography -- remains limited. Experiments on Vietnamese and English datasets demonstrate that WriteViT produces high-quality, style-consistent handwriting while maintaining strong recognition performance in low-resource scenarios. These results highlight the promise of transformer-based designs for multilingual handwriting generation and efficient style adaptation.

[180] Joint Depth and Reflectivity Estimation using Single-Photon LiDAR

Hashan K. Weerasooriya,Prateek Chennuri,Weijian Zhang,Istvan Gyongy,Stanley H. Chan

Main category: cs.CV

TL;DR: 本文提出了一种名为SPLiDER的新方法，用于在快速移动场景中同时恢复深度和反射率，优于现有方法。

Details

Motivation: 现有SP-LiDAR方法通常单独或顺序恢复深度和反射率，且在动态场景中效率不足。本文旨在解决这一问题。 Method: 通过理论分析深度与反射率的相互关联，提出联合估计方法SPLiDER，利用共享信息增强信号恢复。 Result: 在合成和真实SP-LiDAR数据上，SPLiDER表现优于现有方法，实现了更优的联合重建质量。 Conclusion: SPLiDER方法在动态场景中高效且有效，为SP-LiDAR技术提供了新的解决方案。 Abstract: Single-Photon Light Detection and Ranging (SP-LiDAR is emerging as a leading technology for long-range, high-precision 3D vision tasks. In SP-LiDAR, timestamps encode two complementary pieces of information: pulse travel time (depth) and the number of photons reflected by the object (reflectivity). Existing SP-LiDAR reconstruction methods typically recover depth and reflectivity separately or sequentially use one modality to estimate the other. Moreover, the conventional 3D histogram construction is effective mainly for slow-moving or stationary scenes. In dynamic scenes, however, it is more efficient and effective to directly process the timestamps. In this paper, we introduce an estimation method to simultaneously recover both depth and reflectivity in fast-moving scenes. We offer two contributions: (1) A theoretical analysis demonstrating the mutual correlation between depth and reflectivity and the conditions under which joint estimation becomes beneficial. (2) A novel reconstruction method, "SPLiDER", which exploits the shared information to enhance signal recovery. On both synthetic and real SP-LiDAR data, our method outperforms existing approaches, achieving superior joint reconstruction quality.

[181] Unlocking the Potential of Difficulty Prior in RL-based Multimodal Reasoning

Mingrui Chen,Haogeng Liu,Hao Liang,Huaibo Huang,Wentao Zhang,Ran He

Main category: cs.CV

TL;DR: 论文研究了如何通过显式建模问题难度信息提升多模态推理中基于强化学习的微调效果，提出了离线数据筛选、在线优势差异化和难度提示三部分方法，显著提升了性能。

Details

Motivation: 探索问题难度信息对多模态推理中强化学习微调效果的影响，旨在通过优化训练数据和方法提升模型性能。 Method: 1. 离线数据筛选：分析数据集的U型难度分布，过滤过于简单或困难的问题；2. 在线优势差异化：基于组间准确率动态调整优势估计；3. 难度提示：在第二阶段训练中为复杂样本提供显式提示。 Result: 在仅使用2K+0.6K两阶段训练数据的情况下，方法在多模态数学推理基准测试中表现显著提升。 Conclusion: 显式建模问题难度信息能有效提升多模态推理任务的性能，方法具有高效性和可扩展性。 Abstract: In this work, we investigate how explicitly modeling problem's difficulty prior information shapes the effectiveness of reinforcement learning based fine-tuning for multimodal reasoning. Our exploration mainly comprises of following three perspective: First, through offline data curation, we analyze the U-shaped difficulty distribution of two given datasets using the base model by multi-round sampling, and then filter out prompts that are either too simple or extremely difficult to provide meaningful gradients and perform subsequent two-stage training. Second, we implement an online advantage differentiation, computing group-wise empirical accuracy as a difficulty proxy to adaptively reweight advantages estimation, providing stronger learning signals for more challenging problems. Finally, we introduce difficulty hints as explicit prompts for more complex samples in the second training stage, encouraging the model to calibrate its reasoning depth and perform reflective validation checks. Our comprehensive approach demonstrates significant performances across various multi-modal mathematical reasoning benchmarks with only 2K+0.6K two-stage training data.

[182] DB3D-L: Depth-aware BEV Feature Transformation for Accurate 3D Lane Detection

Yehao Liu,Xiaosu Xu,Zijian Wang,Yiqing Yao

Main category: cs.CV

TL;DR: 提出了一种基于深度感知BEV特征转换的3D车道检测方法，通过深度网络获取深度信息，简化视角转换复杂度，并在合成和真实数据集上表现优异。

Details

Motivation: 现有方法在从前视图图像构建BEV特征时因缺乏深度信息而受限，且依赖平坦地面假设，深度估计与车道检测任务融合效果不佳。 Method: 设计了特征提取模块（集成深度网络）、特征降维模块和融合模块，以有效融合前视图特征和深度信息构建BEV特征。 Result: 在合成Apollo和真实OpenLane数据集上表现与最先进方法相当。 Conclusion: 该方法通过深度感知BEV特征转换，有效解决了3D车道检测中的深度信息缺失问题，提升了检测性能。 Abstract: 3D Lane detection plays an important role in autonomous driving. Recent advances primarily build Birds-Eye-View (BEV) feature from front-view (FV) images to perceive 3D information of Lane more effectively. However, constructing accurate BEV information from FV image is limited due to the lacking of depth information, causing previous works often rely heavily on the assumption of a flat ground plane. Leveraging monocular depth estimation to assist in constructing BEV features is less constrained, but existing methods struggle to effectively integrate the two tasks. To address the above issue, in this paper, an accurate 3D lane detection method based on depth-aware BEV feature transtormation is proposed. In detail, an effective feature extraction module is designed, in which a Depth Net is integrated to obtain the vital depth information for 3D perception, thereby simplifying the complexity of view transformation. Subquently a feature reduce module is proposed to reduce height dimension of FV features and depth features, thereby enables effective fusion of crucial FV features and depth features. Then a fusion module is designed to build BEV feature from prime FV feature and depth information. The proposed method performs comparably with state-of-the-art methods on both synthetic Apollo, realistic OpenLane datasets.

[183] Event-Driven Dynamic Scene Depth Completion

Zhiqiang Yan,Jianhao Jiao,Zhengxue Wang,Gim Hee Lee

Main category: cs.CV

TL;DR: EventDC是一个事件驱动的深度完成框架，通过事件调制对齐和局部深度过滤模块，利用事件相机的高时间分辨率提升动态场景中的深度完成质量。

Details

Motivation: 动态场景中，快速的自运动和物体运动会导致RGB图像和LiDAR测量质量下降，传统RGB-D传感器难以精确对齐和捕捉可靠深度。事件相机的高时间分辨率为此提供了补充线索。 Method: 提出EventDC框架，包含事件调制对齐（EMA）和局部深度过滤（LDF）模块，通过学习事件流驱动的卷积偏移和权重，优化RGB-D特征对齐和深度估计。 Result: 实验表明EventDC在动态场景中的深度完成效果显著优于传统方法，并建立了首个事件驱动的深度完成基准数据集。 Conclusion: EventDC通过事件相机的高时间分辨率解决了动态场景中的深度完成问题，为未来研究提供了新方向和基准。 Abstract: Depth completion in dynamic scenes poses significant challenges due to rapid ego-motion and object motion, which can severely degrade the quality of input modalities such as RGB images and LiDAR measurements. Conventional RGB-D sensors often struggle to align precisely and capture reliable depth under such conditions. In contrast, event cameras with their high temporal resolution and sensitivity to motion at the pixel level provide complementary cues that are %particularly beneficial in dynamic environments.To this end, we propose EventDC, the first event-driven depth completion framework. It consists of two key components: Event-Modulated Alignment (EMA) and Local Depth Filtering (LDF). Both modules adaptively learn the two fundamental components of convolution operations: offsets and weights conditioned on motion-sensitive event streams. In the encoder, EMA leverages events to modulate the sampling positions of RGB-D features to achieve pixel redistribution for improved alignment and fusion. In the decoder, LDF refines depth estimations around moving objects by learning motion-aware masks from events. Additionally, EventDC incorporates two loss terms to further benefit global alignment and enhance local depth recovery. Moreover, we establish the first benchmark for event-based depth completion comprising one real-world and two synthetic datasets to facilitate future research. Extensive experiments on this benchmark demonstrate the superiority of our EventDC.

[184] Computer Vision Models Show Human-Like Sensitivity to Geometric and Topological Concepts

Zekun Wang,Sashank Varma

Main category: cs.CV

TL;DR: 论文探讨了计算机视觉模型与人类对几何和拓扑（GT）概念敏感性的对齐问题，发现Transformer模型表现最佳且与儿童表现一致，而视觉语言模型表现较差。

Details

Motivation: 研究动机是探讨GT概念是否是通过日常环境互动学习的，而非天生的核心知识。 Method: 使用三类模型（CNN、Transformer、视觉语言模型）在43个GT概念的odd-one-out任务中测试性能和对齐性。 Result: Transformer模型表现最优，超越儿童水平，且与儿童表现高度一致；视觉语言模型表现较差且偏离人类表现。 Conclusion: 支持学习解释GT概念敏感性，但多模态整合可能损害几何敏感性。 Abstract: With the rapid improvement of machine learning (ML) models, cognitive scientists are increasingly asking about their alignment with how humans think. Here, we ask this question for computer vision models and human sensitivity to geometric and topological (GT) concepts. Under the core knowledge account, these concepts are innate and supported by dedicated neural circuitry. In this work, we investigate an alternative explanation, that GT concepts are learned ``for free'' through everyday interaction with the environment. We do so using computer visions models, which are trained on large image datasets. We build on prior studies to investigate the overall performance and human alignment of three classes of models -- convolutional neural networks (CNNs), transformer-based models, and vision-language models -- on an odd-one-out task testing 43 GT concepts spanning seven classes. Transformer-based models achieve the highest overall accuracy, surpassing that of young children. They also show strong alignment with children's performance, finding the same classes of concepts easy vs. difficult. By contrast, vision-language models underperform their vision-only counterparts and deviate further from human profiles, indicating that na\"ive multimodality might compromise abstract geometric sensitivity. These findings support the use of computer vision models to evaluate the sufficiency of the learning account for explaining human sensitivity to GT concepts, while also suggesting that integrating linguistic and visual representations might have unpredicted deleterious consequences.

[185] DD-Ranking: Rethinking the Evaluation of Dataset Distillation

Zekai Li,Xinhao Zhong,Samir Khaki,Zhiyuan Liang,Yuhao Zhou,Mingjia Shi,Ziqiao Wang,Xuanlei Zhao,Wangbo Zhao,Ziheng Qin,Mengxuan Wu,Pengfei Zhou,Haonan Wang,David Junhao Zhang,Jia-Wei Liu,Shaobo Wang,Dai Liu,Linfeng Zhang,Guang Li,Kun Wang,Zheng Zhu,Zhiheng Ma,Joey Tianyi Zhou,Jiancheng Lv,Yaochu Jin,Peihao Wang,Kaipeng Zhang,Lingjuan Lyu,Yiran Huang,Zeynep Akata,Zhiwei Deng,Xindi Wu,George Cazenavette,Yuzhang Shang,Justin Cui,Jindong Gu,Qian Zheng,Hao Ye,Shuo Wang,Xiaobo Wang,Yan Yan,Angela Yao,Mike Zheng Shou,Tianlong Chen,Hakan Bilen,Baharan Mirzasoleiman,Manolis Kellis,Konstantinos N. Plataniotis,Zhangyang Wang,Bo Zhao,Yang You,Kai Wang

Main category: cs.CV

TL;DR: 论文探讨了数据集蒸馏领域中的评估问题，提出了一种新的评估框架DD-Ranking，以更公平地衡量不同方法的性能提升。

Details

Motivation: 当前数据集蒸馏方法的性能提升可能源于额外技术而非图像本身质量，现有评估标准（如准确率）可能不公平。 Method: 提出DD-Ranking框架和新的评估指标，关注蒸馏数据集的实际信息增强。 Result: 实证发现现有方法的性能提升可能依赖额外技术，随机图像也能取得优异结果。 Conclusion: DD-Ranking为未来研究提供了更全面和公平的评估标准。 Abstract: In recent years, dataset distillation has provided a reliable solution for data compression, where models trained on the resulting smaller synthetic datasets achieve performance comparable to those trained on the original datasets. To further improve the performance of synthetic datasets, various training pipelines and optimization objectives have been proposed, greatly advancing the field of dataset distillation. Recent decoupled dataset distillation methods introduce soft labels and stronger data augmentation during the post-evaluation phase and scale dataset distillation up to larger datasets (e.g., ImageNet-1K). However, this raises a question: Is accuracy still a reliable metric to fairly evaluate dataset distillation methods? Our empirical findings suggest that the performance improvements of these methods often stem from additional techniques rather than the inherent quality of the images themselves, with even randomly sampled images achieving superior results. Such misaligned evaluation settings severely hinder the development of DD. Therefore, we propose DD-Ranking, a unified evaluation framework, along with new general evaluation metrics to uncover the true performance improvements achieved by different methods. By refocusing on the actual information enhancement of distilled datasets, DD-Ranking provides a more comprehensive and fair evaluation standard for future research advancements.

Chengsong Sun,Weiping Li,Xiang Li,Yuankun Liu,Lianlei Shan

Main category: cs.CV

TL;DR: 论文提出了一种名为GCRDP的新方法，用于解决少样本跨模态检索中的多峰分布问题，通过高斯混合模型和多正样本对比学习机制提升检索准确性。

Details

Motivation: 传统跨模态检索方法假设训练和测试数据共享相同的类别分布，而少样本检索涉及稀疏表示的数据，现有方法难以建模多峰分布，导致潜在语义空间中的偏差。 Method: 提出GCRDP方法，使用高斯混合模型捕捉数据多峰分布，结合多正样本对比学习机制，并引入跨模态语义对齐策略优化特征表示。 Result: 在四个基准数据集上的实验表明，GCRDP优于六种现有方法。 Conclusion: GCRDP通过建模多峰分布和优化语义对齐，显著提升了少样本跨模态检索的准确性。 Abstract: Few-shot cross-modal retrieval focuses on learning cross-modal representations with limited training samples, enabling the model to handle unseen classes during inference. Unlike traditional cross-modal retrieval tasks, which assume that both training and testing data share the same class distribution, few-shot retrieval involves data with sparse representations across modalities. Existing methods often fail to adequately model the multi-peak distribution of few-shot cross-modal data, resulting in two main biases in the latent semantic space: intra-modal bias, where sparse samples fail to capture intra-class diversity, and inter-modal bias, where misalignments between image and text distributions exacerbate the semantic gap. These biases hinder retrieval accuracy. To address these issues, we propose a novel method, GCRDP, for few-shot cross-modal retrieval. This approach effectively captures the complex multi-peak distribution of data using a Gaussian Mixture Model (GMM) and incorporates a multi-positive sample contrastive learning mechanism for comprehensive feature modeling. Additionally, we introduce a new strategy for cross-modal semantic alignment, which constrains the relative distances between image and text feature distributions, thereby improving the accuracy of cross-modal representations. We validate our approach through extensive experiments on four benchmark datasets, demonstrating superior performance over six state-of-the-art methods.

[187] eStonefish-scenes: A synthetically generated dataset for underwater event-based optical flow prediction tasks

Jad Mansour,Sebastian Realpe,Hayat Rajani,Michele Grimaldi,Rafael Garcia,Nuno Gracias

Main category: cs.CV

TL;DR: 论文提出了一个合成事件光流数据集eStonefish-scenes，基于Stonefish模拟器，填补了水下事件视觉数据集的空白，并提供了数据处理工具eWiz。

Details

Motivation: 现有事件视觉数据集缺乏多样性和可扩展性，尤其是水下应用领域缺乏标记数据集，阻碍了事件视觉与自主水下机器人的结合。 Method: 通过Stonefish模拟器生成合成数据集eStonefish-scenes，包括动态场景模拟和珊瑚礁地形生成，并开发了数据处理库eWiz。 Result: 成功创建了一个可定制的水下环境数据集和数据处理工具，支持动态场景模拟和珊瑚礁地形生成。 Conclusion: 合成数据集和工具填补了水下事件视觉数据的空白，为事件视觉与自主水下机器人的结合提供了支持。 Abstract: The combined use of event-based vision and Spiking Neural Networks (SNNs) is expected to significantly impact robotics, particularly in tasks like visual odometry and obstacle avoidance. While existing real-world event-based datasets for optical flow prediction, typically captured with Unmanned Aerial Vehicles (UAVs), offer valuable insights, they are limited in diversity, scalability, and are challenging to collect. Moreover, there is a notable lack of labelled datasets for underwater applications, which hinders the integration of event-based vision with Autonomous Underwater Vehicles (AUVs). To address this, synthetic datasets could provide a scalable solution while bridging the gap between simulation and reality. In this work, we introduce eStonefish-scenes, a synthetic event-based optical flow dataset based on the Stonefish simulator. Along with the dataset, we present a data generation pipeline that enables the creation of customizable underwater environments. This pipeline allows for simulating dynamic scenarios, such as biologically inspired schools of fish exhibiting realistic motion patterns, including obstacle avoidance and reactive navigation around corals. Additionally, we introduce a scene generator that can build realistic reef seabeds by randomly distributing coral across the terrain. To streamline data accessibility, we present eWiz, a comprehensive library designed for processing event-based data, offering tools for data loading, augmentation, visualization, encoding, and training data generation, along with loss functions and performance metrics.

[188] Denoising Diffusion Probabilistic Model for Point Cloud Compression at Low Bit-Rates

Gabriele Spadaro,Alberto Presta,Jhony H. Giraldo,Marco Grangetto,Wei Hu,Giuseppe Valenzise,Attilio Fiandrotti,Enzo Tartaglione

Main category: cs.CV

TL;DR: 提出了一种基于DDPM的低比特率点云压缩方法（DDPM-PCC），通过PointNet编码器和可学习向量量化器实现高效压缩，实验表明在低比特率下优于现有方法。

Details

Motivation: 现有技术主要关注高保真重建，需要较多比特进行压缩，而低比特率点云压缩在带宽受限应用中至关重要。 Method: 采用DDPM架构，结合PointNet编码器生成条件向量，并通过可学习向量量化器量化，实现低比特率压缩。 Result: 在ShapeNet和ModelNet40数据集上的实验显示，DDPM-PCC在低比特率下的率失真性能优于标准化和前沿方法。 Conclusion: DDPM-PCC是一种有效的低比特率点云压缩方法，代码已公开。 Abstract: Efficient compression of low-bit-rate point clouds is critical for bandwidth-constrained applications. However, existing techniques mainly focus on high-fidelity reconstruction, requiring many bits for compression. This paper proposes a "Denoising Diffusion Probabilistic Model" (DDPM) architecture for point cloud compression (DDPM-PCC) at low bit-rates. A PointNet encoder produces the condition vector for the generation, which is then quantized via a learnable vector quantizer. This configuration allows to achieve a low bitrates while preserving quality. Experiments on ShapeNet and ModelNet40 show improved rate-distortion at low rates compared to standardized and state-of-the-art approaches. We publicly released the code at https://github.com/EIDOSLAB/DDPM-PCC.

[189] VesselGPT: Autoregressive Modeling of Vascular Geometry

Paula Feldman,Martin Sinnona,Viviana Siless,Claudio Delrieux,Emmanuel Iarussi

Main category: cs.CV

TL;DR: 提出了一种基于自回归方法合成解剖树的技术，利用VQ-VAE和GPT-2模型，实现了高保真度的血管树重建。

Details

Motivation: 解剖树的复杂几何结构使其准确表示具有挑战性，而大语言模型的最新进展为此提供了新思路。 Method: 使用VQ-VAE将血管结构嵌入离散词汇表，再用GPT-2模型自回归生成血管树，并采用B样条表示血管横截面。 Result: 方法能够捕捉复杂几何和分支模式，实现高保真重建，且保留了传统方法忽略的形态细节。 Conclusion: 这是首个自回归生成血管树的工作，代码和数据将公开。 Abstract: Anatomical trees are critical for clinical diagnosis and treatment planning, yet their complex and diverse geometry make accurate representation a significant challenge. Motivated by the latest advances in large language models, we introduce an autoregressive method for synthesizing anatomical trees. Our approach first embeds vessel structures into a learned discrete vocabulary using a VQ-VAE architecture, then models their generation autoregressively with a GPT-2 model. This method effectively captures intricate geometries and branching patterns, enabling realistic vascular tree synthesis. Comprehensive qualitative and quantitative evaluations reveal that our technique achieves high-fidelity tree reconstruction with compact discrete representations. Moreover, our B-spline representation of vessel cross-sections preserves critical morphological details that are often overlooked in previous' methods parameterizations. To the best of our knowledge, this work is the first to generate blood vessels in an autoregressive manner. Code, data, and trained models will be made available.

[190] Benchmarking Unified Face Attack Detection via Hierarchical Prompt Tuning

Ajian Liu,Haocheng Yuan,Xiao Guo,Hui Ma,Wanyi Zhuang,Changtao Miao,Yan Hong,Chuanbiao Song,Jun Lan,Qi Chu,Tao Gong,Yanyan Liang,Weiqiang Wang,Jun Wan,Xiaoming Liu,Zhen Lei

Main category: cs.CV

TL;DR: 论文提出UniAttackData+数据集和HiPTune框架，解决现有统一人脸攻击检测模型在数据和方法上的不足。

Details

Motivation: 现有模型分别训练，难以应对未知攻击且部署负担重，缺乏统一处理物理和数字攻击的模型。 Method: 提出UniAttackData+数据集和HiPTune框架，通过视觉提示树和多语义空间分类标准提升检测能力。 Result: 实验在12个数据集上验证了方法的有效性，展示了在统一人脸攻击检测领域的潜力。 Conclusion: UniAttackData+和HiPTune为统一人脸攻击检测提供了新思路，推动了该领域的创新。 Abstract: Presentation Attack Detection and Face Forgery Detection are designed to protect face data from physical media-based Presentation Attacks and digital editing-based DeepFakes respectively. But separate training of these two models makes them vulnerable to unknown attacks and burdens deployment environments. The lack of a Unified Face Attack Detection model to handle both types of attacks is mainly due to two factors. First, there's a lack of adequate benchmarks for models to explore. Existing UAD datasets have limited attack types and samples, restricting the model's ability to address advanced threats. To address this, we propose UniAttackDataPlus (UniAttackData+), the most extensive and sophisticated collection of forgery techniques to date. It includes 2,875 identities and their 54 kinds of falsified samples, totaling 697,347 videos. Second, there's a lack of a reliable classification criterion. Current methods try to find an arbitrary criterion within the same semantic space, which fails when encountering diverse attacks. So, we present a novel Visual-Language Model-based Hierarchical Prompt Tuning Framework (HiPTune) that adaptively explores multiple classification criteria from different semantic spaces. We build a Visual Prompt Tree to explore various classification rules hierarchically. Then, by adaptively pruning the prompts, the model can select the most suitable prompts to guide the encoder to extract discriminative features at different levels in a coarse-to-fine way. Finally, to help the model understand the classification criteria in visual space, we propose a Dynamically Prompt Integration module to project the visual prompts to the text encoder for more accurate semantics. Experiments on 12 datasets have shown the potential to inspire further innovations in the UAD field.

[191] RoPECraft: Training-Free Motion Transfer with Trajectory-Guided RoPE Optimization on Diffusion Transformers

Ahmet Berke Gokmen,Yigit Ekin,Bahri Batuhan Bilecen,Aysegul Dundar

Main category: cs.CV

TL;DR: RoPECraft是一种无需训练的视频运动迁移方法，通过修改扩散变换器的旋转位置嵌入（RoPE）实现，利用光流编码运动并通过流匹配优化生成过程。

Details

Motivation: 旨在通过直接操作RoPE嵌入实现高效的运动迁移，避免传统训练方法的复杂性。 Method: 提取参考视频的光流，利用运动偏移扭曲RoPE的复数张量，并通过流匹配目标优化轨迹对齐。 Result: 在基准测试中，RoPECraft在质量和数量上均优于现有方法。 Conclusion: RoPECraft提供了一种高效且无需训练的运动迁移解决方案，性能优越。 Abstract: We propose RoPECraft, a training-free video motion transfer method for diffusion transformers that operates solely by modifying their rotary positional embeddings (RoPE). We first extract dense optical flow from a reference video, and utilize the resulting motion offsets to warp the complex-exponential tensors of RoPE, effectively encoding motion into the generation process. These embeddings are then further optimized during denoising time steps via trajectory alignment between the predicted and target velocities using a flow-matching objective. To keep the output faithful to the text prompt and prevent duplicate generations, we incorporate a regularization term based on the phase components of the reference video's Fourier transform, projecting the phase angles onto a smooth manifold to suppress high-frequency artifacts. Experiments on benchmarks reveal that RoPECraft outperforms all recently published methods, both qualitatively and quantitatively.

[192] Faster Video Diffusion with Trainable Sparse Attention

Peiyuan Zhang,Haofeng Huang,Yongqi Chen,Will Lin,Zhengzhong Liu,Ion Stoica,Eric P. Xing,Hao Zhang

Main category: cs.CV

TL;DR: VSA是一种可训练的稀疏注意力机制，通过分阶段处理显著减少计算量，同时保持模型性能，显著提升视频扩散模型的效率。

Details

Motivation: 视频扩散变换器（DiTs）的3D注意力机制计算复杂度高，限制了其扩展性。研究发现注意力主要集中在少数位置，因此提出VSA以优化计算效率。 Method: VSA分为粗粒度阶段（池化标记为瓦片并识别关键标记）和细粒度阶段（仅在关键瓦片内计算标记级注意力），形成单一可微分内核。 Result: VSA将训练FLOPS减少2.53倍，注意力时间加速6倍，生成时间从31秒降至18秒，且性能无损失。 Conclusion: VSA是可训练稀疏注意力的实用替代方案，为视频扩散模型的进一步扩展提供了关键支持。 Abstract: Scaling video diffusion transformers (DiTs) is limited by their quadratic 3D attention, even though most of the attention mass concentrates on a small subset of positions. We turn this observation into VSA, a trainable, hardware-efficient sparse attention that replaces full attention at \emph{both} training and inference. In VSA, a lightweight coarse stage pools tokens into tiles and identifies high-weight \emph{critical tokens}; a fine stage computes token-level attention only inside those tiles subjecting to block computing layout to ensure hard efficiency. This leads to a single differentiable kernel that trains end-to-end, requires no post-hoc profiling, and sustains 85\% of FlashAttention3 MFU. We perform a large sweep of ablation studies and scaling-law experiments by pretraining DiTs from 60M to 1.4B parameters. VSA reaches a Pareto point that cuts training FLOPS by 2.53$\times$ with no drop in diffusion loss. Retrofitting the open-source Wan-2.1 model speeds up attention time by 6$\times$ and lowers end-to-end generation time from 31s to 18s with comparable quality. These results establish trainable sparse attention as a practical alternative to full attention and a key enabler for further scaling of video diffusion models.

[193] FEALLM: Advancing Facial Emotion Analysis in Multimodal Large Language Models with Emotional Synergy and Reasoning

Zhuozhao Hu,Kaishen Yuan,Xin Liu,Zitong Yu,Yuan Zong,Jingang Shi,Huanjing Yue,Jingyu Yang

Main category: cs.CV

TL;DR: 论文提出了一种新的FEA指令数据集和FEALLM模型，解决了传统方法在面部情感分析中的局限性，并在多个数据集上展示了其优越性能。

Details

Motivation: 传统面部情感分析方法在可解释性、泛化和推理能力上存在不足，且多模态大语言模型（MLLMs）在FEA任务中因缺乏专业数据集和难以捕捉面部表情与动作单元（AUs）的复杂关系而表现不佳。 Method: 构建了FEA指令数据集和FEABench基准，并设计了FEALLM模型，以更详细地捕捉面部信息。 Result: FEALLM在FEABench上表现优异，并在RAF-DB、AffectNet、BP4D和DISFA等数据集上展示了零样本泛化能力。 Conclusion: FEALLM通过新数据集和架构设计，显著提升了面部情感分析的性能，具有广泛的应用潜力。 Abstract: Facial Emotion Analysis (FEA) plays a crucial role in visual affective computing, aiming to infer a person's emotional state based on facial data. Scientifically, facial expressions (FEs) result from the coordinated movement of facial muscles, which can be decomposed into specific action units (AUs) that provide detailed emotional insights. However, traditional methods often struggle with limited interpretability, constrained generalization and reasoning abilities. Recently, Multimodal Large Language Models (MLLMs) have shown exceptional performance in various visual tasks, while they still face significant challenges in FEA due to the lack of specialized datasets and their inability to capture the intricate relationships between FEs and AUs. To address these issues, we introduce a novel FEA Instruction Dataset that provides accurate and aligned FE and AU descriptions and establishes causal reasoning relationships between them, followed by constructing a new benchmark, FEABench. Moreover, we propose FEALLM, a novel MLLM architecture designed to capture more detailed facial information, enhancing its capability in FEA tasks. Our model demonstrates strong performance on FEABench and impressive generalization capability through zero-shot evaluation on various datasets, including RAF-DB, AffectNet, BP4D, and DISFA, showcasing its robustness and effectiveness in FEA tasks. The dataset and code will be available at https://github.com/953206211/FEALLM.

[194] G1: Bootstrapping Perception and Reasoning Abilities of Vision-Language Model via Reinforcement Learning

Liang Chen,Hongcheng Gao,Tianyu Liu,Zhiqi Huang,Flood Sung,Xinyu Zhou,Yuxin Wu,Baobao Chang

Main category: cs.CV

TL;DR: VLM-Gym是一个专为多游戏并行训练设计的强化学习环境，用于解决视觉语言模型（VLMs）在交互式环境中的决策能力不足问题。通过纯强化学习驱动的自我进化训练G0模型，并进一步开发感知增强的G1模型，显著提升了性能。

Details

Motivation: 视觉语言模型在直接多模态任务中表现优异，但在交互式视觉环境（如游戏）中的决策能力不足，限制了其作为自主代理的潜力。 Method: 引入VLM-Gym环境，通过强化学习训练G0模型，并开发感知增强的G1模型，结合冷启动和RL微调。 Result: G1模型在所有游戏中超越其教师模型，并优于领先的专有模型（如Claude-3.7-Sonnet-Thinking）。感知与推理能力在训练过程中相互促进。 Conclusion: VLM-Gym和G1模型为提升VLMs在交互式环境中的能力提供了有效方法，代码已开源以促进未来研究。 Abstract: Vision-Language Models (VLMs) excel in many direct multimodal tasks but struggle to translate this prowess into effective decision-making within interactive, visually rich environments like games. This ``knowing-doing'' gap significantly limits their potential as autonomous agents, as leading VLMs often performing badly in simple games. To address this, we introduce VLM-Gym, a curated reinforcement learning (RL) environment featuring diverse visual games with unified interfaces and adjustable, compositional difficulty, specifically designed for scalable multi-game parallel training. Leveraging VLM-Gym, we train G0 models using pure RL-driven self-evolution, which demonstrate emergent perception and reasoning patterns. To further mitigate challenges arising from game diversity, we develop G1 models. G1 incorporates a perception-enhanced cold start prior to RL fine-tuning. Our resulting G1 models consistently surpass their teacher across all games and outperform leading proprietary models like Claude-3.7-Sonnet-Thinking. Systematic analysis reveals an intriguing finding: perception and reasoning abilities mutually bootstrap each other throughout the RL training process. Source code including VLM-Gym and RL training are released at https://github.com/chenllliang/G1 to foster future research in advancing VLMs as capable interactive agents.

[195] Understanding Complexity in VideoQA via Visual Program Generation

Cristobal Eyzaguirre,Igor Vasiljevic,Achal Dave,Jiajun Wu,Rares Andrei Ambrus,Thomas Kollar,Juan Carlos Niebles,Pavel Tokmakov

Main category: cs.CV

TL;DR: 提出了一种基于数据驱动的方法来分析视频问答中的查询复杂度，通过代码生成技术自动评估问题难度，优于人工预测，并构建了一个更难的基准测试。

Details

Motivation: 现有基准测试依赖人工设计难题，但实验表明人类难以准确预测模型认为的难题。 Method: 利用代码生成技术，以生成代码的复杂度作为问题难度的代理指标，并提出算法从代码中估计问题复杂度。 Result: 该方法与模型性能的相关性显著优于人工估计，并能自动生成更复杂的问题。 Conclusion: 该方法可扩展性强，构建的新基准测试比NExT-QA难1.9倍，展示了其实际应用价值。 Abstract: We propose a data-driven approach to analyzing query complexity in Video Question Answering (VideoQA). Previous efforts in benchmark design have relied on human expertise to design challenging questions, yet we experimentally show that humans struggle to predict which questions are difficult for machine learning models. Our automatic approach leverages recent advances in code generation for visual question answering, using the complexity of generated code as a proxy for question difficulty. We demonstrate that this measure correlates significantly better with model performance than human estimates. To operationalize this insight, we propose an algorithm for estimating question complexity from code. It identifies fine-grained primitives that correlate with the hardest questions for any given set of models, making it easy to scale to new approaches in the future. Finally, to further illustrate the utility of our method, we extend it to automatically generate complex questions, constructing a new benchmark that is 1.9 times harder than the popular NExT-QA.

[196] KinTwin: Imitation Learning with Torque and Muscle Driven Biomechanical Models Enables Precise Replication of Able-Bodied and Impaired Movement from Markerless Motion Capture

R. James Cotton

Main category: cs.CV

TL;DR: 该论文探讨了通过模仿学习应用于生物力学模型，从大量健康与运动障碍个体的运动数据中学习计算逆动力学，以提升运动分析和康复的质量。

Details

Motivation: 提高运动分析和康复的质量，更详细地描述运动障碍和对干预的反应，甚至早期检测新的神经系统疾病或跌倒风险。 Method: 使用模仿学习应用于生物力学模型，测试包含运动障碍参与者的数据集，并报告与临床运动测量相关的详细跟踪指标。 Result: 模仿学习策略KinTwin能准确复制多种运动的运动学特征，并推断出临床上有意义的关节扭矩和肌肉激活差异。 Conclusion: 模仿学习在临床实践中实现高质量运动分析具有潜力。 Abstract: Broader access to high-quality movement analysis could greatly benefit movement science and rehabilitation, such as allowing more detailed characterization of movement impairments and responses to interventions, or even enabling early detection of new neurological conditions or fall risk. While emerging technologies are making it easier to capture kinematics with biomechanical models, or how joint angles change over time, inferring the underlying physics that give rise to these movements, including ground reaction forces, joint torques, or even muscle activations, is still challenging. Here we explore whether imitation learning applied to a biomechanical model from a large dataset of movements from able-bodied and impaired individuals can learn to compute these inverse dynamics. Although imitation learning in human pose estimation has seen great interest in recent years, our work differences in several ways: we focus on using an accurate biomechanical model instead of models adopted for computer vision, we test it on a dataset that contains participants with impaired movements, we reported detailed tracking metrics relevant for the clinical measurement of movement including joint angles and ground contact events, and finally we apply imitation learning to a muscle-driven neuromusculoskeletal model. We show that our imitation learning policy, KinTwin, can accurately replicate the kinematics of a wide range of movements, including those with assistive devices or therapist assistance, and that it can infer clinically meaningful differences in joint torques and muscle activations. Our work demonstrates the potential for using imitation learning to enable high-quality movement analysis in clinical practice.

[197] FinePhys: Fine-grained Human Action Generation by Explicitly Incorporating Physical Laws for Effective Skeletal Guidance

Dian Shao,Mingfei Shi,Shengda Xu,Haodong Chen,Yongle Huang,Binglu Wang

Main category: cs.CV

TL;DR: FinePhys是一个结合物理建模的细粒度人类动作生成框架，通过2D姿态估计和3D姿态提升，结合物理运动重估计，显著提升了动作生成的逼真度。

Details

Motivation: 当前视频生成方法在建模细粒度语义和复杂时间动态时表现不佳，尤其是生成体操等高难度动作时效果不理想。 Method: FinePhys首先在线估计2D姿态，通过上下文学习提升至3D，再引入基于物理的运动重估计模块，结合双向时间更新计算关节加速度，最后融合数据驱动和物理预测的3D姿态。 Result: 在FineGym的三个子集上，FinePhys显著优于基线方法，生成的细粒度动作更自然逼真。 Conclusion: FinePhys通过结合物理建模和数据驱动方法，有效解决了细粒度人类动作生成的挑战。 Abstract: Despite significant advances in video generation, synthesizing physically plausible human actions remains a persistent challenge, particularly in modeling fine-grained semantics and complex temporal dynamics. For instance, generating gymnastics routines such as "switch leap with 0.5 turn" poses substantial difficulties for current methods, often yielding unsatisfactory results. To bridge this gap, we propose FinePhys, a Fine-grained human action generation framework that incorporates Physics to obtain effective skeletal guidance. Specifically, FinePhys first estimates 2D poses in an online manner and then performs 2D-to-3D dimension lifting via in-context learning. To mitigate the instability and limited interpretability of purely data-driven 3D poses, we further introduce a physics-based motion re-estimation module governed by Euler-Lagrange equations, calculating joint accelerations via bidirectional temporal updating. The physically predicted 3D poses are then fused with data-driven ones, offering multi-scale 2D heatmap guidance for the diffusion process. Evaluated on three fine-grained action subsets from FineGym (FX-JUMP, FX-TURN, and FX-SALTO), FinePhys significantly outperforms competitive baselines. Comprehensive qualitative results further demonstrate FinePhys's ability to generate more natural and plausible fine-grained human actions.

[198] VTBench: Evaluating Visual Tokenizers for Autoregressive Image Generation

Huawei Lin,Tong Geng,Zhaozhuo Xu,Weijie Zhao

Main category: cs.CV

TL;DR: 论文提出了VTBench，一个评估视觉分词器（VT）性能的基准测试，发现连续变分自编码器（VAEs）优于离散VTs，并讨论了GPT-4o的潜在自回归特性。

Details

Motivation: 当前离散VTs在图像重建和细节保留上表现不佳，但缺乏针对VT性能的独立评估基准。 Method: 引入VTBench，系统评估VTs在图像重建、细节保留和文本保留三个核心任务上的表现，并使用多种指标。 Result: 连续VAEs在视觉表示上优于离散VTs，离散VTs在重建时易失真且细节丢失。 Conclusion: 呼吁开发更强的开源VTs，并公开了基准和代码以支持研究。 Abstract: Autoregressive (AR) models have recently shown strong performance in image generation, where a critical component is the visual tokenizer (VT) that maps continuous pixel inputs to discrete token sequences. The quality of the VT largely defines the upper bound of AR model performance. However, current discrete VTs fall significantly behind continuous variational autoencoders (VAEs), leading to degraded image reconstructions and poor preservation of details and text. Existing benchmarks focus on end-to-end generation quality, without isolating VT performance. To address this gap, we introduce VTBench, a comprehensive benchmark that systematically evaluates VTs across three core tasks: Image Reconstruction, Detail Preservation, and Text Preservation, and covers a diverse range of evaluation scenarios. We systematically assess state-of-the-art VTs using a set of metrics to evaluate the quality of reconstructed images. Our findings reveal that continuous VAEs produce superior visual representations compared to discrete VTs, particularly in retaining spatial structure and semantic detail. In contrast, the degraded representations produced by discrete VTs often lead to distorted reconstructions, loss of fine-grained textures, and failures in preserving text and object integrity. Furthermore, we conduct experiments on GPT-4o image generation and discuss its potential AR nature, offering new insights into the role of visual tokenization. We release our benchmark and codebase publicly to support further research and call on the community to develop strong, general-purpose open-source VTs.

[199] Recollection from Pensieve: Novel View Synthesis via Learning from Uncalibrated Videos

Ruoyu Wang,Yi Ma,Shenghua Gao

Main category: cs.CV

TL;DR: 提出一种两阶段策略，仅用原始视频帧或多视角图像训练视图合成模型，无需相机参数或其他先验知识。

Details

Motivation: 现有方法依赖相机标定或几何先验，限制了在大规模未标定数据上的应用。 Method: 第一阶段在隐式潜在空间中重建场景；第二阶段通过显式3D高斯基元预测减少与真实3D世界的差距。 Result: 实验表明，该方法在视图合成和相机姿态估计上优于依赖标定或深度信息的方法。 Conclusion: 两阶段策略互补，实现了高质量视图合成和准确的相机姿态估计。 Abstract: Currently almost all state-of-the-art novel view synthesis and reconstruction models rely on calibrated cameras or additional geometric priors for training. These prerequisites significantly limit their applicability to massive uncalibrated data. To alleviate this requirement and unlock the potential for self-supervised training on large-scale uncalibrated videos, we propose a novel two-stage strategy to train a view synthesis model from only raw video frames or multi-view images, without providing camera parameters or other priors. In the first stage, we learn to reconstruct the scene implicitly in a latent space without relying on any explicit 3D representation. Specifically, we predict per-frame latent camera and scene context features, and employ a view synthesis model as a proxy for explicit rendering. This pretraining stage substantially reduces the optimization complexity and encourages the network to learn the underlying 3D consistency in a self-supervised manner. The learned latent camera and implicit scene representation have a large gap compared with the real 3D world. To reduce this gap, we introduce the second stage training by explicitly predicting 3D Gaussian primitives. We additionally apply explicit Gaussian Splatting rendering loss and depth projection loss to align the learned latent representations with physically grounded 3D geometry. In this way, Stage 1 provides a strong initialization and Stage 2 enforces 3D consistency - the two stages are complementary and mutually beneficial. Extensive experiments demonstrate the effectiveness of our approach, achieving high-quality novel view synthesis and accurate camera pose estimation, compared to methods that employ supervision with calibration, pose, or depth information. The code is available at https://github.com/Dwawayu/Pensieve.

cs.GR [Back]

[200] Neural Importance Sampling of Many Lights

Pedro Figueiredo,Qihao He,Steve Bako,Nima Khademi Kalantari

Main category: cs.GR

TL;DR: 提出一种基于神经网络的动态光选择分布估计方法，用于改进蒙特卡洛渲染中的重要性采样，适用于复杂多光源场景。

Details

Motivation: 复杂场景中多光源的重要性采样效率低，传统方法难以高效处理。 Method: 使用神经网络预测每个着色点的光选择分布，结合光层次结构技术和残差学习策略。 Result: 在多样且具有挑战性的场景中表现优异。 Conclusion: 该方法显著提升了渲染效率和性能。 Abstract: We propose a neural approach for estimating spatially varying light selection distributions to improve importance sampling in Monte Carlo rendering, particularly for complex scenes with many light sources. Our method uses a neural network to predict the light selection distribution at each shading point based on local information, trained by minimizing the KL-divergence between the learned and target distributions in an online manner. To efficiently manage hundreds or thousands of lights, we integrate our neural approach with light hierarchy techniques, where the network predicts cluster-level distributions and existing methods sample lights within clusters. Additionally, we introduce a residual learning strategy that leverages initial distributions from existing techniques, accelerating convergence during training. Our method achieves superior performance across diverse and challenging scenes.

[201] Generating Digital Models Using Text-to-3D and Image-to-3D Prompts: Critical Case Study

Rushan Ziatdinov,Rifkat Nabiyev

Main category: cs.GR

TL;DR: 论文探讨了AI工具在自动化生成3D模型中的应用，分析了多种在线3D模型生成器，旨在通过不同提示获得更高质量的结果。

Details

Motivation: 随着AI工具的发展，自动化生成3D模型成为可能，可以节省设计师、艺术家和游戏开发者的时间。 Method: 论文回顾并批判性分析了多种在线3D模型生成器。 Result: 研究分析了不同提示下的生成结果，希望找到更高质量的生成方法。 Conclusion: 自动化3D模型生成工具具有潜力，但仍需改进以提升生成质量。 Abstract: In the world of technology and AI, digital models play an important role in our lives and are an essential part of the digital twins of real-world objects. They can be created by designers, artists, or game developers using spline curves and surfaces, meshes, and voxels, but making such models is too time-consuming. With the growth of AI tools, there is interest in the automated generation of 3D models, such as generative design approaches, which can save creators valuable time. This paper reviews several online 3D model generators and critically analyses the results, hoping to see higher-quality results from different prompts.

[202] Modeling Aesthetic Preferences in 3D Shapes: A Large-Scale Paired Comparison Study Across Object Categories

Kapil Dev

Main category: cs.GR

TL;DR: 该研究通过大规模人类偏好调查，结合非线性建模和跨类别分析，揭示了3D形状美学的几何驱动因素，为设计提供了实用指导。

Details

Motivation: 3D形状的美学偏好对工业设计等领域至关重要，但现有计算模型缺乏大规模人类判断的实证基础，限制了其实用性。 Method: 收集22,301对比较数据，应用Bradley-Terry模型和随机森林结合SHAP分析，识别关键几何特征（如对称性、曲率）。 Result: 揭示了美学偏好的普遍原则和领域特定趋势，提供了可解释的几何特征和公开数据集。 Conclusion: 通过人本数据驱动框架，推动了3D形状美学的理解，并为设计师提供了实用工具。 Abstract: Human aesthetic preferences for 3D shapes are central to industrial design, virtual reality, and consumer product development. However, most computational models of 3D aesthetics lack empirical grounding in large-scale human judgments, limiting their practical relevance. We present a large-scale study of human preferences. We collected 22,301 pairwise comparisons across five object categories (chairs, tables, mugs, lamps, and dining chairs) via Amazon Mechanical Turk. Building on a previously published dataset~\cite{dev2020learning}, we introduce new non-linear modeling and cross-category analysis to uncover the geometric drivers of aesthetic preference. We apply the Bradley-Terry model to infer latent aesthetic scores and use Random Forests with SHAP analysis to identify and interpret the most influential geometric features (e.g., symmetry, curvature, compactness). Our cross-category analysis reveals both universal principles and domain-specific trends in aesthetic preferences. We focus on human interpretable geometric features to ensure model transparency and actionable design insights, rather than relying on black-box deep learning approaches. Our findings bridge computational aesthetics and cognitive science, providing practical guidance for designers and a publicly available dataset to support reproducibility. This work advances the understanding of 3D shape aesthetics through a human-centric, data-driven framework.

[203] Penetration-free Solid-Fluid Interaction on Shells and Rods

Jinyuan Liu,Yuchen Sun,Yin Yang,Chenfanfu Jiang,Minchen Li,Bo Zhu

Main category: cs.GR

TL;DR: 提出了一种新颖的方法，用于模拟流体与薄弹性固体之间的无穿透交互，通过优化系统和屏障增强实现。

Details

Motivation: 解决现有方法在流体与固体界面速度一致性上的局限性，通过显式解决位置约束提升模拟效果。 Method: 采用优化系统增强屏障，结合显式固体位置和隐式流体水平集表示，调整水平集值以保持流体体积，并开发距离度量。 Result: 能够稳健模拟流体与低维物体（如壳和杆）的多种交互过程，包括拓扑变化、弹跳、飞溅等。 Conclusion: 该方法在模拟流体与薄弹性固体交互中表现出高效性和灵活性，适用于多种复杂场景。 Abstract: We introduce a novel approach to simulate the interaction between fluids and thin elastic solids without any penetration. Our approach is centered around an optimization system augmented with barriers, which aims to find a configuration that ensures the absence of penetration while enforcing incompressibility for the fluids and minimizing elastic potentials for the solids. Unlike previous methods that primarily focus on velocity coherence at the fluid-solid interfaces, we demonstrate the effectiveness and flexibility of explicitly resolving positional constraints, including both explicit representation of solid positions and the implicit representation of fluid level-set interface. To preserve the volume of the fluid, we propose a simple yet efficient approach that adjusts the associated level-set values. Additionally, we develop a distance metric capable of measuring the separation between an implicitly represented surface and a Lagrangian object of arbitrary codimension. By integrating the inertia, solid elastic potential, damping, barrier potential, and fluid incompressibility within a unified system, we are able to robustly simulate a wide range of processes involving fluid interactions with lower-dimensional objects such as shells and rods. These processes include topology changes, bouncing, splashing, sliding, rolling, floating, and more.

[204] HIL: Hybrid Imitation Learning of Diverse Parkour Skills from Videos

Jiashun Wang,Yifeng Jiang,Haotian Zhang,Chen Tessler,Davis Rempe,Jessica Hodgins,Xue Bin Peng

Main category: cs.GR

TL;DR: 提出了一种混合模仿学习（HIL）框架，结合运动跟踪和对抗模仿学习，以提升模拟角色的适应性和技能组合能力。

Details

Motivation: 现有数据驱动方法在适应新环境和组合多样化技能时表现不佳，需要一种更有效的解决方案。 Method: 采用并行多任务环境和统一观察空间，结合运动跟踪和对抗模仿学习。 Result: 在挑战性跑酷环境中，方法提升了运动质量、技能多样性，并具有竞争力的任务完成率。 Conclusion: HIL框架有效解决了数据驱动方法的局限性，为模拟角色行为提供了更优方案。 Abstract: Recent data-driven methods leveraging deep reinforcement learning have been an effective paradigm for developing controllers that enable physically simulated characters to produce natural human-like behaviors. However, these data-driven methods often struggle to adapt to novel environments and compose diverse skills coherently to perform more complex tasks. To address these challenges, we propose a hybrid imitation learning (HIL) framework that combines motion tracking, for precise skill replication, with adversarial imitation learning, to enhance adaptability and skill composition. This hybrid learning framework is implemented through parallel multi-task environments and a unified observation space, featuring an agent-centric scene representation to facilitate effective learning from the hybrid parallel environments. Our framework trains a unified controller on parkour data sourced from Internet videos, enabling a simulated character to traverse through new environments using diverse and life-like parkour skills. Evaluations across challenging parkour environments demonstrate that our method improves motion quality, increases skill diversity, and achieves competitive task completion compared to previous learning-based methods.

[205] UniHM: Universal Human Motion Generation with Object Interactions in Indoor Scenes

Zichen Geng,Zeeshan Hayder,Wei Liu,Ajmal Mian

Main category: cs.GR

TL;DR: UniHM是一个统一运动语言模型，通过扩散生成技术合成场景感知的人类运动，支持文本到运动和文本到人-物交互任务。

Details

Motivation: 现有语言条件运动模型在场景感知运动生成中存在局限性，如运动标记化导致信息丢失和无法捕捉3D人类运动的连续性和上下文依赖性。 Method: 提出UniHM框架，包括混合运动表示、LFQ-VAE和增强版Lingo数据集。 Result: 在OMOMO和HumanML3D基准测试中表现优异。 Conclusion: UniHM在场景感知运动生成中具有显著优势，为复杂3D场景中的运动合成提供了新方法。 Abstract: Human motion synthesis in complex scenes presents a fundamental challenge, extending beyond conventional Text-to-Motion tasks by requiring the integration of diverse modalities such as static environments, movable objects, natural language prompts, and spatial waypoints. Existing language-conditioned motion models often struggle with scene-aware motion generation due to limitations in motion tokenization, which leads to information loss and fails to capture the continuous, context-dependent nature of 3D human movement. To address these issues, we propose UniHM, a unified motion language model that leverages diffusion-based generation for synthesizing scene-aware human motion. UniHM is the first framework to support both Text-to-Motion and Text-to-Human-Object Interaction (HOI) in complex 3D scenes. Our approach introduces three key contributions: (1) a mixed-motion representation that fuses continuous 6DoF motion with discrete local motion tokens to improve motion realism; (2) a novel Look-Up-Free Quantization VAE (LFQ-VAE) that surpasses traditional VQ-VAEs in both reconstruction accuracy and generative performance; and (3) an enriched version of the Lingo dataset augmented with HumanML3D annotations, providing stronger supervision for scene-specific motion learning. Experimental results demonstrate that UniHM achieves comparative performance on the OMOMO benchmark for text-to-HOI synthesis and yields competitive results on HumanML3D for general text-conditioned motion generation.

[206] AdaToken-3D: Dynamic Spatial Gating for Efficient 3D Large Multimodal-Models Reasoning

Kai Zhang,Xingyu Chen,Xiaofeng Zhang

Main category: cs.GR

TL;DR: AdaToken-3D是一种自适应空间令牌优化框架，通过动态剪枝冗余令牌提升3D大型多模态模型的效率。

Details

Motivation: 当前3D多模态模型因空间令牌冗余导致计算效率低下，需优化令牌利用率。 Method: 提出AdaToken-3D框架，通过空间贡献分析和注意力模式挖掘动态剪枝冗余令牌。 Result: 在LLaVA-3D上实现21%推理速度提升和63%FLOPs减少，同时保持任务准确性。 Conclusion: 揭示了60%以上空间令牌贡献极小，为高效3D多模态学习奠定理论基础。 Abstract: Large Multimodal Models (LMMs) have become a pivotal research focus in deep learning, demonstrating remarkable capabilities in 3D scene understanding. However, current 3D LMMs employing thousands of spatial tokens for multimodal reasoning suffer from critical inefficiencies: excessive computational overhead and redundant information flows. Unlike 2D VLMs processing single images, 3D LMMs exhibit inherent architectural redundancy due to the heterogeneous mechanisms between spatial tokens and visual tokens. To address this challenge, we propose AdaToken-3D, an adaptive spatial token optimization framework that dynamically prunes redundant tokens through spatial contribution analysis. Our method automatically tailors pruning strategies to different 3D LMM architectures by quantifying token-level information flows via attention pattern mining. Extensive experiments on LLaVA-3D (a 7B parameter 3D-LMM) demonstrate that AdaToken-3D achieves 21\% faster inference speed and 63\% FLOPs reduction while maintaining original task accuracy. Beyond efficiency gains, this work systematically investigates redundancy patterns in multimodal spatial information flows through quantitative token interaction analysis. Our findings reveal that over 60\% of spatial tokens contribute minimally ($<$5\%) to the final predictions, establishing theoretical foundations for efficient 3D multimodal learning.

[207] MGPBD: A Multigrid Accelerated Global XPBD Solver

Chunlei Li,Peng Yu,Tiantian Liu,Siyuan Yu,Yuting Xiao,Shuai Li,Aimin Hao,Yang Gao,Qinping Zhao

Main category: cs.GR

TL;DR: 提出了一种结合非平滑聚合代数多重网格（UA-AMG）和预条件共轭梯度（PCG）的新方法，以解决高分辨率和高刚度模拟中XPBD的局限性。

Details

Motivation: XPBD在模拟可变形物体时速度快且简单，但其非线性高斯-赛德尔（GS）求解器在低频误差处理上表现不佳，导致高分辨率和刚度模拟中的不稳定和停滞问题。 Method: 采用AMG方法，提出惰性设置策略以减少计算开销，并通过迭代方法简化近核组件的构建。 Result: 实验结果表明，该方法显著提高了收敛速度和数值稳定性，实现了高效稳定的高分辨率模拟。 Conclusion: 该方法通过结合UA-AMG和PCG，有效解决了XPBD在高分辨率和刚度模拟中的问题，具有较低的运算成本。 Abstract: We introduce a novel Unsmoothed Aggregation (UA) Algebraic Multigrid (AMG) method combined with Preconditioned Conjugate Gradient (PCG) to overcome the limitations of Extended Position-Based Dynamics (XPBD) in high-resolution and high-stiffness simulations. While XPBD excels in simulating deformable objects due to its speed and simplicity, its nonlinear Gauss-Seidel (GS) solver often struggles with low-frequency errors, leading to instability and stalling issues, especially in high-resolution, high-stiffness simulations. Our multigrid approach addresses these issues efficiently by leveraging AMG. To reduce the computational overhead of traditional AMG, where prolongator construction can consume up to two-thirds of the runtime, we propose a lazy setup strategy that reuses prolongators across iterations based on matrix structure and physical significance. Furthermore, we introduce a simplified method for constructing near-kernel components by applying a few sweeps of iterative methods to the homogeneous equation, achieving convergence rates comparable to adaptive smoothed aggregation (adaptive-SA) at a lower computational cost. Experimental results demonstrate that our method significantly improves convergence rates and numerical stability, enabling efficient and stable high-resolution simulations of deformable objects.

cs.CL [Back]

[208] A Data Synthesis Method Driven by Large Language Models for Proactive Mining of Implicit User Intentions in Tourism

Jinqiang Wang,Huansheng Ning,Tao Zhu,Jianguo Ding

Main category: cs.CL

TL;DR: 论文提出SynPT方法，利用LLM驱动的用户和助手代理模拟对话，生成高质量数据集SynPT-Dialog，以主动挖掘旅游领域中的隐式用户意图。

Details

Motivation: 现有方法在旅游领域中难以挖掘用户隐式意图，且缺乏高质量训练数据，存在领域适应性差、数据分布不均等问题。 Method: 构建LLM驱动的用户和助手代理，基于种子数据模拟对话，生成包含显式推理的数据集SynPT-Dialog，并用于微调通用LLM。 Result: 实验表明SynPT优于现有方法，并分析了关键超参数和实际应用案例。 Conclusion: SynPT有效解决了旅游领域隐式意图挖掘的问题，并展示了跨语言适应性。 Abstract: In the tourism domain, Large Language Models (LLMs) often struggle to mine implicit user intentions from tourists' ambiguous inquiries and lack the capacity to proactively guide users toward clarifying their needs. A critical bottleneck is the scarcity of high-quality training datasets that facilitate proactive questioning and implicit intention mining. While recent advances leverage LLM-driven data synthesis to generate such datasets and transfer specialized knowledge to downstream models, existing approaches suffer from several shortcomings: (1) lack of adaptation to the tourism domain, (2) skewed distributions of detail levels in initial inquiries, (3) contextual redundancy in the implicit intention mining module, and (4) lack of explicit thinking about tourists' emotions and intention values. Therefore, we propose SynPT (A Data Synthesis Method Driven by LLMs for Proactive Mining of Implicit User Intentions in the Tourism), which constructs an LLM-driven user agent and assistant agent to simulate dialogues based on seed data collected from Chinese tourism websites. This approach addresses the aforementioned limitations and generates SynPT-Dialog, a training dataset containing explicit reasoning. The dataset is utilized to fine-tune a general LLM, enabling it to proactively mine implicit user intentions. Experimental evaluations, conducted from both human and LLM perspectives, demonstrate the superiority of SynPT compared to existing methods. Furthermore, we analyze key hyperparameters and present case studies to illustrate the practical applicability of our method, including discussions on its adaptability to English-language scenarios. All code and data are publicly available.

[209] AI-generated Text Detection: A Multifaceted Approach to Binary and Multiclass Classification

Harika Abburi,Sanmitra Bhattacharya,Edward Bowen,Nirmala Pudota

Main category: cs.CL

TL;DR: 论文研究了如何准确检测AI生成的文本及其来源模型，提出了两种神经网络架构，分别在区分人类与AI生成文本（任务A）和识别生成模型（任务B）中取得优异表现。

Details

Motivation: 大型语言模型（LLMs）的文本生成能力可能被滥用，如生成假新闻或垃圾邮件，因此需要准确检测AI生成文本及其来源模型以确保负责任的使用。 Method: 针对任务A（区分人类与AI生成文本）和任务B（识别生成模型），分别提出了优化模型和简化变体的神经网络架构。 Result: 任务A的优化模型F1得分为0.994，排名第五；任务B的简化模型F1得分为0.627，同样排名第五。 Conclusion: 提出的方法在检测AI生成文本及其来源模型方面表现优异，为LLMs的负责任使用提供了技术支持。 Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities in generating text that closely resembles human writing across a wide range of styles and genres. However, such capabilities are prone to potential misuse, such as fake news generation, spam email creation, and misuse in academic assignments. As a result, accurate detection of AI-generated text and identification of the model that generated it are crucial for maintaining the responsible use of LLMs. In this work, we addressed two sub-tasks put forward by the Defactify workshop under AI-Generated Text Detection shared task at the Association for the Advancement of Artificial Intelligence (AAAI 2025): Task A involved distinguishing between human-authored or AI-generated text, while Task B focused on attributing text to its originating language model. For each task, we proposed two neural architectures: an optimized model and a simpler variant. For Task A, the optimized neural architecture achieved fifth place with $F1$ score of 0.994, and for Task B, the simpler neural architecture also ranked fifth place with $F1$ score of 0.627.

[210] Assessing Collective Reasoning in Multi-Agent LLMs via Hidden Profile Tasks

Yuxuan Li,Aoi Naito,Hirokazu Shirado

Main category: cs.CL

TL;DR: 论文引入社会心理学中的Hidden Profile范式，用于评估多智能体LLM系统的集体推理失败问题，发现多智能体系统在所有模型中均无法达到完整信息下单个智能体的准确性。

Details

Motivation: 当前缺乏理论基础的基准来系统评估多智能体LLM系统的集体推理失败问题，因此需要一种诊断工具。 Method: 采用Hidden Profile范式，通过不对称分布关键信息，构建包含九项任务的基准，并在GPT-4.1等六种LLM上进行实验。 Result: 多智能体系统在所有模型中表现不及完整信息下的单个智能体，且与人类群体表现类似，但存在行为差异（如对社会期望的敏感性）。 Conclusion: 研究提供了一个可复现的评估框架，揭示了多智能体LLM系统中的合作-矛盾权衡，为未来人工集体智能和人机交互研究提供了方向。 Abstract: Multi-agent systems built on large language models (LLMs) promise enhanced problem-solving through distributed information integration, but also risk replicating collective reasoning failures observed in human groups. Yet, no theory-grounded benchmark exists to systematically evaluate such failures. In this paper, we introduce the Hidden Profile paradigm from social psychology as a diagnostic testbed for multi-agent LLM systems. By distributing critical information asymmetrically across agents, the paradigm reveals how inter-agent dynamics support or hinder collective reasoning. We first formalize the paradigm for multi-agent decision-making under distributed knowledge and instantiate it as a benchmark with nine tasks spanning diverse scenarios, including adaptations from prior human studies. We then conduct experiments with GPT-4.1 and five other leading LLMs, including reasoning-enhanced variants, showing that multi-agent systems across all models fail to match the accuracy of single agents given complete information. While agents' collective performance is broadly comparable to that of human groups, nuanced behavioral differences emerge, such as increased sensitivity to social desirability. Finally, we demonstrate the paradigm's diagnostic utility by exploring a cooperation-contradiction trade-off in multi-agent LLM systems. We find that while cooperative agents are prone to over-coordination in collective settings, increased contradiction impairs group convergence. This work contributes a reproducible framework for evaluating multi-agent LLM systems and motivates future research on artificial collective intelligence and human-AI interaction.

[211] Talk to Your Slides: Efficient Slide Editing Agent with Large Language Models

Kyudan Jung,Hojun Cho,Jooyeol Yun,Jaehyeok Jang,Jagul Choo

Main category: cs.CL

TL;DR: 论文介绍了Talk-to-Your-Slides，一种基于LLM的代理，通过COM通信直接在PowerPoint中编辑幻灯片，解决了现有研究忽视的编辑任务。

Details

Motivation: 现有研究主要关注幻灯片的生成，而忽略了编辑现有幻灯片的繁琐任务。 Method: 采用两级方法：高层处理由LLM代理解析指令并制定编辑计划，低层执行通过Python脚本直接操作PowerPoint对象。 Result: 实验结果表明，Talk-to-Your-Slides在执行成功率、指令忠实度和编辑效率上显著优于基线方法。 Conclusion: 该方法提供了更灵活和上下文感知的幻灯片编辑，并通过TSBench数据集和开源代码支持进一步研究。 Abstract: Existing research on large language models (LLMs) for PowerPoint predominantly focuses on slide generation, overlooking the common yet tedious task of editing existing slides. We introduce Talk-to-Your-Slides, an LLM-powered agent that directly edits slides within active PowerPoint sessions through COM communication. Our system employs a two-level approach: (1) high-level processing where an LLM agent interprets instructions and formulates editing plans, and (2) low-level execution where Python scripts directly manipulate PowerPoint objects. Unlike previous methods relying on predefined operations, our approach enables more flexible and contextually-aware editing. To facilitate evaluation, we present TSBench, a human-annotated dataset of 379 diverse editing instructions with corresponding slide variations. Experimental results demonstrate that Talk-to-Your-Slides significantly outperforms baseline methods in execution success rate, instruction fidelity, and editing efficiency. Our code and benchmark are available at https://anonymous.4open.science/r/talk-to-your-slides/

[212] MedGUIDE: Benchmarking Clinical Decision-Making in Large Language Models

Xiaomin Li,Mingye Gao,Yuexing Hao,Taoran Li,Guangya Wan,Zihan Wang,Yijun Wang

Main category: cs.CL

TL;DR: MedGUIDE是一个新基准，用于评估大型语言模型（LLMs）在遵循临床指南决策中的表现，发现即使专业模型也常表现不佳。

Details

Motivation: 评估LLMs是否能可靠遵循结构化临床指南，确保其在医疗决策中的安全性和准确性。 Method: 基于55个NCCN决策树构建MedGUIDE，通过两阶段质量筛选生成7,747个高质量样本，评估25种LLMs。 Result: 发现即使专业LLMs在结构化指南任务中表现不佳，尝试通过上下文指南或继续预训练改进效果有限。 Conclusion: MedGUIDE对评估LLMs在真实临床环境中的适用性至关重要，凸显其仍需改进。 Abstract: Clinical guidelines, typically structured as decision trees, are central to evidence-based medical practice and critical for ensuring safe and accurate diagnostic decision-making. However, it remains unclear whether Large Language Models (LLMs) can reliably follow such structured protocols. In this work, we introduce MedGUIDE, a new benchmark for evaluating LLMs on their ability to make guideline-consistent clinical decisions. MedGUIDE is constructed from 55 curated NCCN decision trees across 17 cancer types and uses clinical scenarios generated by LLMs to create a large pool of multiple-choice diagnostic questions. We apply a two-stage quality selection process, combining expert-labeled reward models and LLM-as-a-judge ensembles across ten clinical and linguistic criteria, to select 7,747 high-quality samples. We evaluate 25 LLMs spanning general-purpose, open-source, and medically specialized models, and find that even domain-specific LLMs often underperform on tasks requiring structured guideline adherence. We also test whether performance can be improved via in-context guideline inclusion or continued pretraining. Our findings underscore the importance of MedGUIDE in assessing whether LLMs can operate safely within the procedural frameworks expected in real-world clinical settings.

[213] Steering Risk Preferences in Large Language Models by Aligning Behavioral and Neural Representations

Jian-Qiao Zhu,Haijiang Yan,Thomas L. Griffiths

Main category: cs.CL

TL;DR: 通过编辑Transformer的残差流，使用“转向向量”可以高效地改变大语言模型的行为，无需重新训练或微调。

Details

Motivation: 研究如何系统性地识别转向向量，以更精准地调控模型行为。 Method: 提出一种原则性方法，通过行为方法（如马尔可夫链蒙特卡洛）与神经表征对齐，提取转向向量。 Result: 实验证明，提取的转向向量能可靠地调控模型的风险偏好输出。 Conclusion: 转向向量是一种有效且无需重新训练的方法，可用于定向调控模型行为。 Abstract: Changing the behavior of large language models (LLMs) can be as straightforward as editing the Transformer's residual streams using appropriately constructed "steering vectors." These modifications to internal neural activations, a form of representation engineering, offer an effective and targeted means of influencing model behavior without retraining or fine-tuning the model. But how can such steering vectors be systematically identified? We propose a principled approach for uncovering steering vectors by aligning latent representations elicited through behavioral methods (specifically, Markov chain Monte Carlo with LLMs) with their neural counterparts. To evaluate this approach, we focus on extracting latent risk preferences from LLMs and steering their risk-related outputs using the aligned representations as steering vectors. We show that the resulting steering vectors successfully and reliably modulate LLM outputs in line with the targeted behavior.

[214] THELMA: Task Based Holistic Evaluation of Large Language Model Applications-RAG Question Answering

Udita Patel,Rutu Mulkar,Jay Roberts,Cibi Chakravarthy Senthilkumar,Sujay Gandhi,Xiaofei Zheng,Naumaan Nayyar,Rafael Castrillo

Main category: cs.CL

TL;DR: THELMA是一个无需参考的无框架，用于评估RAG QA应用，包含六个相互关联的指标，支持端到端评估和改进。

Details

Motivation: 为RAG QA应用提供无需标注数据或参考响应的全面、细粒度评估方法。 Method: 设计六个相互依赖的指标，用于评估RAG QA应用的各个环节。 Result: THELMA框架能帮助开发者识别需要改进的RAG组件。 Conclusion: THELMA为RAG QA应用提供了一种有效的评估和改进工具。 Abstract: We propose THELMA (Task Based Holistic Evaluation of Large Language Model Applications), a reference free framework for RAG (Retrieval Augmented generation) based question answering (QA) applications. THELMA consist of six interdependent metrics specifically designed for holistic, fine grained evaluation of RAG QA applications. THELMA framework helps developers and application owners evaluate, monitor and improve end to end RAG QA pipelines without requiring labelled sources or reference responses.We also present our findings on the interplay of the proposed THELMA metrics, which can be interpreted to identify the specific RAG component needing improvement in QA applications.

[215] Critique-Guided Distillation: Improving Supervised Fine-tuning via Better Distillation

Berkcan Kapusuzoglu,Supriyo Chakraborty,Chia-Hsuan Lee,Sambit Sahu

Main category: cs.CL

TL;DR: 论文提出了一种名为CGD的多阶段框架，通过整合教师模型的解释性批评和精炼响应，解决了监督微调中的模仿问题，显著提升了数学和语言理解任务的性能。

Details

Motivation: 监督微调（SFT）常因模仿问题导致模型仅复制响应而缺乏理解。CGD旨在通过教师模型的批评和精炼响应，提升模型的理解能力。 Method: CGD框架将提示、教师批评和学生初始响应映射到精炼教师响应，通过熵分析减少不确定性，并视为贝叶斯后验更新。 Result: 在数学（AMC23 +17.5%）和语言理解（MMLU-Pro +6.3%）任务上表现显著提升，同时缓解了格式漂移问题。 Conclusion: CGD有效解决了模仿问题，提升了模型的理解能力，并在多个任务上验证了其优越性。 Abstract: Supervised fine-tuning (SFT) using expert demonstrations often suffer from the imitation problem, where the model learns to reproduce the correct responses without \emph{understanding} the underlying rationale. To address this limitation, we propose \textsc{Critique-Guided Distillation (CGD)}, a novel multi-stage framework that integrates teacher model generated \emph{explanatory critiques} and \emph{refined responses} into the SFT process. A student model is then trained to map the triplet of prompt, teacher critique, and its own initial response to the corresponding refined teacher response, thereby learning both \emph{what} to imitate and \emph{why}. Using entropy-based analysis, we show that \textsc{CGD} reduces refinement uncertainty and can be interpreted as a Bayesian posterior update. We perform extensive empirical evaluation of \textsc{CGD}, on variety of benchmark tasks, and demonstrate significant gains on both math (AMC23 +17.5%) and language understanding tasks (MMLU-Pro +6.3%), while successfully mitigating the format drift issues observed in previous critique fine-tuning (CFT) techniques.

[216] Can an Easy-to-Hard Curriculum Make Reasoning Emerge in Small Language Models? Evidence from a Four-Stage Curriculum on GPT-2

Xiang Fu

Main category: cs.CL

TL;DR: 通过分阶段课程训练小语言模型（SLMs）显著提高了推理透明度和样本效率，但最终答案成功率仍落后30%。

Details

Motivation: 探索分阶段课程训练是否能提升小语言模型的推理能力和效率。 Method: 训练124M参数的GPT-2模型（Cognivolve），采用四阶段课程（从词汇匹配到多步符号推理），并评估其表现。 Result: Cognivolve在优化步骤减半的情况下达到目标精度，激活更多推理头，注意力分布更均衡。但无序课程或优化器重置无法复现效果。 Conclusion: 分阶段课程训练显著提升模型性能，但仍需解决最终答案成功率低和探测不足的问题。 Abstract: We demonstrate that a developmentally ordered curriculum markedly improves reasoning transparency and sample-efficiency in small language models (SLMs). Concretely, we train Cognivolve, a 124 M-parameter GPT-2 model, on a four-stage syllabus that ascends from lexical matching to multi-step symbolic inference and then evaluate it without any task-specific fine-tuning. Cognivolve reaches target accuracy in half the optimization steps of a single-phase baseline, activates an order-of-magnitude more gradient-salient reasoning heads, and shifts those heads toward deeper layers, yielding higher-entropy attention that balances local and long-range context. The same curriculum applied out of order or with optimizer resets fails to reproduce these gains, confirming that progression--not extra compute--drives the effect. We also identify open challenges: final-answer success still lags a conventional run by about 30%, and our saliency probe under-detects verbal-knowledge heads in the hardest stage, suggesting directions for mixed-stage fine-tuning and probe expansion.

[217] Multilingual Prompt Engineering in Large Language Models: A Survey Across NLP Tasks

Shubham Vatsal,Harsh Dubey,Aditi Singh

Main category: cs.CL

TL;DR: 本文综述了多语言提示工程技术，分析了其在提升大型语言模型（LLMs）多语言性能中的应用，并提出了分类和潜在的最先进方法。

Details

Motivation: 尽管LLMs在多种NLP任务中表现出色，但其在多语言环境中的有效性仍面临挑战，多语言提示工程成为无需大量参数调整的解决方案。 Method: 通过结构化自然语言提示，提取LLMs在不同语言中的知识，并分类分析了36篇研究论文中的39种提示技术。 Result: 研究覆盖了约250种语言的30个多语言NLP任务，分析了语言家族和资源水平的分布情况。 Conclusion: 多语言提示工程为广泛用户提供了利用LLMs的途径，并展示了在不同语言和任务中的潜力。 Abstract: Large language models (LLMs) have demonstrated impressive performance across a wide range of Natural Language Processing (NLP) tasks. However, ensuring their effectiveness across multiple languages presents unique challenges. Multilingual prompt engineering has emerged as a key approach to enhance LLMs' capabilities in diverse linguistic settings without requiring extensive parameter re-training or fine-tuning. With growing interest in multilingual prompt engineering over the past two to three years, researchers have explored various strategies to improve LLMs' performance across languages and NLP tasks. By crafting structured natural language prompts, researchers have successfully extracted knowledge from LLMs across different languages, making these techniques an accessible pathway for a broader audience, including those without deep expertise in machine learning, to harness the capabilities of LLMs. In this paper, we survey and categorize different multilingual prompting techniques based on the NLP tasks they address across a diverse set of datasets that collectively span around 250 languages. We further highlight the LLMs employed, present a taxonomy of approaches and discuss potential state-of-the-art (SoTA) methods for specific multilingual datasets. Additionally, we derive a range of insights across language families and resource levels (high-resource vs. low-resource), including analyses such as the distribution of NLP tasks by language resource type and the frequency of prompting methods across different language families. Our survey reviews 36 research papers covering 39 prompting techniques applied to 30 multilingual NLP tasks, with the majority of these studies published in the last two years.

[218] Ambiguity Resolution in Text-to-Structured Data Mapping

Zhibo Hu,Chen Wang,Yanfeng Shu,Hye-Young Paik,Liming Zhu

Main category: cs.CL

TL;DR: 论文提出了一种新方法，通过分析潜在空间中模糊文本的表示差异来识别模糊性，并利用稀疏自编码器（SAE）的梯度路径核设计新的距离度量，以提高LLM在模糊任务中的性能。

Details

Motivation: 自然语言中的模糊性是LLM在文本到结构化数据映射中的主要障碍，现有方法（如ReACT框架或监督微调）未能有效解决这一问题。 Method: 通过分析模糊问题与其解释之间的关系，设计了一种基于稀疏自编码器（SAE）梯度路径核的新距离度量，用于检测模糊性。 Result: 提出了一种新框架，通过预测缺失概念来改进LLM在模糊任务（如工具调用）中的性能。 Conclusion: 该方法通过潜在空间分析有效识别模糊性，为LLM在模糊任务中的性能提升提供了新思路。 Abstract: Ambiguity in natural language is a significant obstacle for achieving accurate text to structured data mapping through large language models (LLMs), which affects the performance of tasks such as mapping text to agentic tool calling and text-to-SQL queries. Existing methods of ambiguity handling either exploit ReACT framework to produce the correct mapping through trial and error, or supervised fine tuning to guide models to produce a biased mapping to improve certain tasks. In this paper, we adopt a different approach that characterizes the representation difference of ambiguous text in the latent space and leverage the difference to identify ambiguity before mapping them to structured data. To detect ambiguity of a sentence, we focused on the relationship between ambiguous questions and their interpretations and what cause the LLM ignore multiple interpretations. Different to the distance calculated by dense embedding vectors, we utilize the observation that ambiguity is caused by concept missing in latent space of LLM to design a new distance measurement, computed through the path kernel by the integral of gradient values for each concepts from sparse-autoencoder (SAE) under each state. We identify patterns to distinguish ambiguous questions with this measurement. Based on our observation, We propose a new framework to improve the performance of LLMs on ambiguous agentic tool calling through missing concepts prediction.

[219] Evaluating Design Decisions for Dual Encoder-based Entity Disambiguation

Susanna Rücker,Alan Akbik

Main category: cs.CL

TL;DR: 论文提出VerbalizED模型，通过双编码器设计优化实体消歧任务，重点评估损失函数、相似性度量等关键决策，并在AIDA-Yago数据集上验证其有效性。

Details

Motivation: 研究双编码器在实体消歧任务中的关键设计决策，如损失函数、相似性度量等，以提升性能。 Method: 提出VerbalizED模型，结合上下文标签描述和高效负采样策略，并探索迭代预测变体。 Result: 在AIDA-Yago数据集上验证模型有效性，并在ZELDA基准测试中达到新SOTA。 Conclusion: VerbalizED模型通过优化设计决策，显著提升了实体消歧任务的性能。 Abstract: Entity disambiguation (ED) is the task of linking mentions in text to corresponding entries in a knowledge base. Dual Encoders address this by embedding mentions and label candidates in a shared embedding space and applying a similarity metric to predict the correct label. In this work, we focus on evaluating key design decisions for Dual Encoder-based ED, such as its loss function, similarity metric, label verbalization format, and negative sampling strategy. We present the resulting model VerbalizED, a document-level Dual Encoder model that includes contextual label verbalizations and efficient hard negative sampling. Additionally, we explore an iterative prediction variant that aims to improve the disambiguation of challenging data points. Comprehensive experiments on AIDA-Yago validate the effectiveness of our approach, offering insights into impactful design choices that result in a new State-of-the-Art system on the ZELDA benchmark.

[220] Automatic Speech Recognition for African Low-Resource Languages: Challenges and Future Directions

Sukairaj Hafiz Imam,Babangida Sani,Dawit Ketema Gete,Bedru Yimam Ahamed,Ibrahim Said Ahmad,Idris Abdulmumin,Seid Muhie Yimam,Muhammad Yahuza Bello,Shamsuddeen Hassan Muhammad

Main category: cs.CL

TL;DR: 本文探讨了非洲低资源语言在自动语音识别（ASR）技术发展中的主要挑战，并提出解决策略，强调社区驱动、多语言学习和隐私保护等方法。

Details

Motivation: 非洲低资源语言在ASR研究中代表性不足，面临数据稀缺、语言复杂性等问题，亟需解决方案以促进技术包容性。 Method: 通过分析挑战（如数据稀缺、计算资源有限）并结合案例研究，提出社区驱动数据收集、轻量级模型等策略。 Result: 试点项目证明定制化解决方案（如基于语素的建模）可行，能提升ASR在非洲语言中的应用效果。 Conclusion: 跨学科合作和持续投资是关键，需推动伦理、高效的ASR系统以保护语言多样性并促进社会经济发展。 Abstract: Automatic Speech Recognition (ASR) technologies have transformed human-computer interaction; however, low-resource languages in Africa remain significantly underrepresented in both research and practical applications. This study investigates the major challenges hindering the development of ASR systems for these languages, which include data scarcity, linguistic complexity, limited computational resources, acoustic variability, and ethical concerns surrounding bias and privacy. The primary goal is to critically analyze these barriers and identify practical, inclusive strategies to advance ASR technologies within the African context. Recent advances and case studies emphasize promising strategies such as community-driven data collection, self-supervised and multilingual learning, lightweight model architectures, and techniques that prioritize privacy. Evidence from pilot projects involving various African languages showcases the feasibility and impact of customized solutions, which encompass morpheme-based modeling and domain-specific ASR applications in sectors like healthcare and education. The findings highlight the importance of interdisciplinary collaboration and sustained investment to tackle the distinct linguistic and infrastructural challenges faced by the continent. This study offers a progressive roadmap for creating ethical, efficient, and inclusive ASR systems that not only safeguard linguistic diversity but also improve digital accessibility and promote socioeconomic participation for speakers of African languages.

[221] Hierarchical Bracketing Encodings for Dependency Parsing as Tagging

Ana Ezquerro,David Vilares,Anssi Yli-Jyrä,Carlos Gómez-Rodríguez

Main category: cs.CL

TL;DR: 本文提出了一种基于分层括号概念的序列标注依赖解析编码家族，证明了现有的4位投影编码属于该家族但标签数量非最优，并推导出最优分层括号编码，将投影树的标签数量从16减少到12。此外，还扩展了该编码以更紧凑地支持非投影性。新编码在多样树库上表现优异。

Details

Motivation: 现有的4位投影编码在标签数量上非最优，且对非投影性的支持不够紧凑。 Method: 提出基于分层括号的编码家族，推导最优分层括号编码，并扩展以支持非投影性。 Result: 最优编码将投影树的标签数量从16减少到12，并在非投影性上表现更紧凑。 Conclusion: 新编码在多样树库上实现了竞争性准确率，优于现有方法。 Abstract: We present a family of encodings for sequence labeling dependency parsing, based on the concept of hierarchical bracketing. We prove that the existing 4-bit projective encoding belongs to this family, but it is suboptimal in the number of labels used to encode a tree. We derive an optimal hierarchical bracketing, which minimizes the number of symbols used and encodes projective trees using only 12 distinct labels (vs. 16 for the 4-bit encoding). We also extend optimal hierarchical bracketing to support arbitrary non-projectivity in a more compact way than previous encodings. Our new encodings yield competitive accuracy on a diverse set of treebanks.

[222] Disambiguating Reference in Visually Grounded Dialogues through Joint Modeling of Textual and Multimodal Semantic Structures

Shun Inadumi,Nobuhiro Ueda,Koichiro Yoshino

Main category: cs.CL

TL;DR: 该论文提出了一种统一文本和多模态参考解析的框架，通过将提及嵌入映射到对象嵌入，并基于相似性选择提及或对象。实验表明，学习文本参考解析（如共指解析）对多模态参考解析有积极影响，尤其在代词短语定位任务中表现优于MDETR和GLIP。

Details

Motivation: 解决视觉对话中的歧义问题，尤其是由代词和省略引起的歧义，需要整合文本和多模态参考解析。 Method: 提出一个框架，将提及嵌入映射到对象嵌入，并通过相似性选择提及或对象。结合共指解析等文本参考解析方法。 Result: 模型在代词短语定位任务中表现优于MDETR和GLIP，且结合文本参考关系增强了提及与对象之间的置信度。 Conclusion: 整合文本参考解析有助于减少视觉对话中的歧义，提升多模态参考解析的性能。 Abstract: Multimodal reference resolution, including phrase grounding, aims to understand the semantic relations between mentions and real-world objects. Phrase grounding between images and their captions is a well-established task. In contrast, for real-world applications, it is essential to integrate textual and multimodal reference resolution to unravel the reference relations within dialogue, especially in handling ambiguities caused by pronouns and ellipses. This paper presents a framework that unifies textual and multimodal reference resolution by mapping mention embeddings to object embeddings and selecting mentions or objects based on their similarity. Our experiments show that learning textual reference resolution, such as coreference resolution and predicate-argument structure analysis, positively affects performance in multimodal reference resolution. In particular, our model with coreference resolution performs better in pronoun phrase grounding than representative models for this task, MDETR and GLIP. Our qualitative analysis demonstrates that incorporating textual reference relations strengthens the confidence scores between mentions, including pronouns and predicates, and objects, which can reduce the ambiguities that arise in visually grounded dialogues.

[223] MedCaseReasoning: Evaluating and learning diagnostic reasoning from clinical case reports

Kevin Wu,Eric Wu,Rahul Thapa,Kevin Wei,Angela Zhang,Arvind Suresh,Jacqueline J. Tao,Min Woo Sun,Alejandro Lozano,James Zou

Main category: cs.CL

TL;DR: MedCaseReasoning数据集首次评估LLMs在临床诊断推理中的表现，发现现有模型在诊断准确性和推理忠实性上存在显著不足，但通过微调可显著提升性能。

Details

Motivation: 现有医学基准（如MedQA和MMLU）仅评估最终答案准确性，忽视了临床推理过程的质量和忠实性，因此需要新的评估方法。 Method: 引入MedCaseReasoning数据集，包含14,489个诊断问答案例，每个案例配有详细的临床推理陈述，并评估了当前领先的LLMs。 Result: 现有模型表现不佳（如DeepSeek-R1诊断准确率48%，推理陈述召回率64%），但微调后诊断准确性和推理召回率分别提升29%和41%。 Conclusion: MedCaseReasoning填补了医学LLM评估的空白，微调可显著改善模型性能，推动临床诊断LLMs的发展。 Abstract: Doctors and patients alike increasingly use Large Language Models (LLMs) to diagnose clinical cases. However, unlike domains such as math or coding, where correctness can be objectively defined by the final answer, medical diagnosis requires both the outcome and the reasoning process to be accurate. Currently, widely used medical benchmarks like MedQA and MMLU assess only accuracy in the final answer, overlooking the quality and faithfulness of the clinical reasoning process. To address this limitation, we introduce MedCaseReasoning, the first open-access dataset for evaluating LLMs on their ability to align with clinician-authored diagnostic reasoning. The dataset includes 14,489 diagnostic question-and-answer cases, each paired with detailed reasoning statements derived from open-access medical case reports. We evaluate state-of-the-art reasoning LLMs on MedCaseReasoning and find significant shortcomings in their diagnoses and reasoning: for instance, the top-performing open-source model, DeepSeek-R1, achieves only 48% 10-shot diagnostic accuracy and mentions only 64% of the clinician reasoning statements (recall). However, we demonstrate that fine-tuning LLMs on the reasoning traces derived from MedCaseReasoning significantly improves diagnostic accuracy and clinical reasoning recall by an average relative gain of 29% and 41%, respectively. The open-source dataset, code, and models are available at https://github.com/kevinwu23/Stanford-MedCaseReasoning.

[224] ZeroTuning: Unlocking the Initial Token's Power to Enhance Large Language Models Without Training

Feijiang Han,Xiaodong Yu,Jianheng Tang,Lyle Ungar

Main category: cs.CL

TL;DR: 论文提出了一种无需训练的方法ZeroTuning，通过调整初始空语义token的注意力分布来优化大语言模型（LLM）的性能。

Details

Motivation: 现有方法依赖辅助机制识别任务相关token，可能引入偏差且适用性有限。本文发现初始空语义token是一个未被充分探索的控制点。 Method: 通过理论分析发现初始token的注意力调整能影响后续token的注意力分布。提出ZeroTuning方法，针对该token进行头部特异性注意力调整。 Result: ZeroTuning在文本分类、多选和多轮对话任务中显著提升模型性能（如Llama-3.1-8B分类任务提升11.71%），且具有鲁棒性。 Conclusion: 初始空语义token是LLM中一个被忽视的控制点，ZeroTuning为推理时优化和模型可解释性提供了新思路。 Abstract: Recently, training-free methods for improving large language models (LLMs) have attracted growing interest, with token-level attention tuning emerging as a promising and interpretable direction. However, existing methods typically rely on auxiliary mechanisms to identify important or irrelevant task-specific tokens, introducing potential bias and limiting applicability. In this paper, we uncover a surprising and elegant alternative: the semantically empty initial token is a powerful and underexplored control point for optimizing model behavior. Through theoretical analysis, we show that tuning the initial token's attention sharpens or flattens the attention distribution over subsequent tokens, and its role as an attention sink amplifies this effect. Empirically, we find that: (1) tuning its attention improves LLM performance more effectively than tuning other task-specific tokens; (2) the effect follows a consistent trend across layers, with earlier layers having greater impact, but varies across attention heads, with different heads showing distinct preferences in how they attend to this token. Based on these findings, we propose ZeroTuning, a training-free approach that improves LLM performance by applying head-specific attention adjustments to this special token. Despite tuning only one token, ZeroTuning achieves higher performance on text classification, multiple-choice, and multi-turn conversation tasks across models such as Llama, Qwen, and DeepSeek. For example, ZeroTuning improves Llama-3.1-8B by 11.71% on classification, 2.64% on QA tasks, and raises its multi-turn score from 7.804 to 7.966. The method is also robust to limited resources, few-shot settings, long contexts, quantization, decoding strategies, and prompt variations. Our work sheds light on a previously overlooked control point in LLMs, offering new insights into both inference-time tuning and model interpretability.

[225] Token Masking Improves Transformer-Based Text Classification

Xianglong Xu,John Bowen,Rojin Taheri

Main category: cs.CL

TL;DR: 提出了一种基于随机掩码的输入正则化方法，通过随机替换输入token为[MASK]来增强模型性能，实验证明其在语言识别和情感分析任务中优于传统正则化方法。

Details

Motivation: 探索掩码输入token是否能进一步提升基于Transformer的文本分类模型的性能。 Method: 提出token masking regularization方法，随机以概率p将输入token替换为[MASK]，引入训练中的随机扰动。 Result: 在多种模型（mBERT、Qwen2.5-0.5B、TinyLlama-1.1B）上，语言识别和情感分析任务中表现优于传统正则化方法，p=0.1为通用最优值。 Conclusion: 掩码正则化通过减少过拟合和隐式梯度平滑提升模型性能，适用于多种任务。 Abstract: While transformer-based models achieve strong performance on text classification, we explore whether masking input tokens can further enhance their effectiveness. We propose token masking regularization, a simple yet theoretically motivated method that randomly replaces input tokens with a special [MASK] token at probability p. This introduces stochastic perturbations during training, leading to implicit gradient averaging that encourages the model to capture deeper inter-token dependencies. Experiments on language identification and sentiment analysis -- across diverse models (mBERT, Qwen2.5-0.5B, TinyLlama-1.1B) -- show consistent improvements over standard regularization techniques. We identify task-specific optimal masking rates, with p = 0.1 as a strong general default. We attribute the gains to two key effects: (1) input perturbation reduces overfitting, and (2) gradient-level smoothing acts as implicit ensembling.

[226] Masking in Multi-hop QA: An Analysis of How Language Models Perform with Context Permutation

Wenyu Huang,Pavlos Vougiouklis,Mirella Lapata,Jeff Z. Pan

Main category: cs.CL

TL;DR: 论文研究了语言模型在多跳问答任务中的表现，发现编码器-解码器模型优于因果解码器模型，文档顺序影响性能，修改因果掩码可提升效果。

Details

Motivation: 探索语言模型在多跳问答任务中的表现，分析其推理能力受限的原因，并提出改进方法。 Method: 通过排列搜索结果的顺序，比较不同语言模型的性能，并修改因果掩码以增强双向注意力。 Result: 编码器-解码器模型表现更优，文档顺序与推理链一致时性能最佳，修改掩码可提升因果解码器模型效果。 Conclusion: 多跳问答任务中，模型结构和文档顺序是关键因素，双向注意力能显著提升性能。 Abstract: Multi-hop Question Answering (MHQA) adds layers of complexity to question answering, making it more challenging. When Language Models (LMs) are prompted with multiple search results, they are tasked not only with retrieving relevant information but also employing multi-hop reasoning across the information sources. Although LMs perform well on traditional question-answering tasks, the causal mask can hinder their capacity to reason across complex contexts. In this paper, we explore how LMs respond to multi-hop questions by permuting search results (retrieved documents) under various configurations. Our study reveals interesting findings as follows: 1) Encoder-decoder models, such as the ones in the Flan-T5 family, generally outperform causal decoder-only LMs in MHQA tasks, despite being significantly smaller in size; 2) altering the order of gold documents reveals distinct trends in both Flan T5 models and fine-tuned decoder-only models, with optimal performance observed when the document order aligns with the reasoning chain order; 3) enhancing causal decoder-only models with bi-directional attention by modifying the causal mask can effectively boost their end performance. In addition to the above, we conduct a thorough investigation of the distribution of LM attention weights in the context of MHQA. Our experiments reveal that attention weights tend to peak at higher values when the resulting answer is correct. We leverage this finding to heuristically improve LMs' performance on this task. Our code is publicly available at https://github.com/hwy9855/MultiHopQA-Reasoning.

[227] Towards Universal Semantics With Large Language Models

Raymond Baartmans,Matthew Raffel,Rahul Vikram,Aiden Deringer,Lizhong Chen

Main category: cs.CL

TL;DR: 利用大型语言模型（LLMs）生成自然语义元语言（NSM）的释义，提出自动评估方法和专用数据集，1B和8B模型表现优于GPT-4o。

Details

Motivation: 传统NSM释义生成过程缓慢且手动，LLMs的应用可提升效率并拓展语义分析的潜力。 Method: 引入自动评估方法、专用数据集，并微调LLMs生成NSM释义。 Result: 1B和8B模型在生成准确、可跨语言翻译的释义上优于GPT-4o。 Conclusion: LLMs在NSM中的应用为通用语义表示和语义分析、翻译等任务提供了新可能。 Abstract: The Natural Semantic Metalanguage (NSM) is a linguistic theory based on a universal set of semantic primes: simple, primitive word-meanings that have been shown to exist in most, if not all, languages of the world. According to this framework, any word, regardless of complexity, can be paraphrased using these primes, revealing a clear and universally translatable meaning. These paraphrases, known as explications, can offer valuable applications for many natural language processing (NLP) tasks, but producing them has traditionally been a slow, manual process. In this work, we present the first study of using large language models (LLMs) to generate NSM explications. We introduce automatic evaluation methods, a tailored dataset for training and evaluation, and fine-tuned models for this task. Our 1B and 8B models outperform GPT-4o in producing accurate, cross-translatable explications, marking a significant step toward universal semantic representation with LLMs and opening up new possibilities for applications in semantic analysis, translation, and beyond.

[228] Retrospex: Language Agent Meets Offline Reinforcement Learning Critic

Yufei Xiang,Yiqun Shen,Yeqin Zhang,Cam-Tu Nguyen

Main category: cs.CL

TL;DR: Retrospex是一个基于LLM的智能体框架，通过深度分析过去经验并结合RL Critic的评估，提升了任务表现。

Details

Motivation: 现有LLM智能体框架未能充分利用过去经验进行改进，Retrospex旨在解决这一问题。 Method: 结合LLM的动作概率与RL Critic基于离线回顾过程估计的动作价值，并采用动态动作重评分机制。 Result: 在ScienceWorld、ALFWorld和Webshop环境中优于现有基线方法。 Conclusion: Retrospex通过经验分析和动态调整，显著提升了LLM智能体的性能。 Abstract: Large Language Models (LLMs) possess extensive knowledge and commonsense reasoning capabilities, making them valuable for creating powerful agents. However, existing LLM agent frameworks have not fully utilized past experiences for improvement. This work introduces a new LLM-based agent framework called Retrospex, which addresses this challenge by analyzing past experiences in depth. Unlike previous approaches, Retrospex does not directly integrate experiences into the LLM's context. Instead, it combines the LLM's action likelihood with action values estimated by a Reinforcement Learning (RL) Critic, which is trained on past experiences through an offline ''retrospection'' process. Additionally, Retrospex employs a dynamic action rescoring mechanism that increases the importance of experience-based values for tasks that require more interaction with the environment. We evaluate Retrospex in ScienceWorld, ALFWorld and Webshop environments, demonstrating its advantages over strong, contemporary baselines.

[229] Efficiently Building a Domain-Specific Large Language Model from Scratch: A Case Study of a Classical Chinese Large Language Model

Shen Li,Renfen Hu,Lijun Wang

Main category: cs.CL

TL;DR: AI Taiyan是一个专为古典中文设计的语言模型，通过合理设计和训练，在1.8亿参数下实现了优于通用模型和传统领域模型的效果。

Details

Motivation: 通用大语言模型在古典中文领域表现不佳，需要专门设计以融入领域知识。 Method: 开发了AI Taiyan模型，结合数据处理、基础训练和微调。 Result: 在标点、典故识别、词义解释和古今翻译等任务中表现优于其他模型，接近或超越人类基线。 Conclusion: 为高效构建领域专用大语言模型提供了参考，并展示了在古籍整理、词典编辑等领域的应用潜力。 Abstract: General-purpose large language models demonstrate notable capabilities in language comprehension and generation, achieving results that are comparable to, or even surpass, human performance in many language information processing tasks. Nevertheless, when general models are applied to some specific domains, e.g., Classical Chinese texts, their effectiveness is often unsatisfactory, and fine-tuning open-source foundational models similarly struggles to adequately incorporate domain-specific knowledge. To address this challenge, this study developed a large language model, AI Taiyan, specifically designed for understanding and generating Classical Chinese. Experiments show that with a reasonable model design, data processing, foundational training, and fine-tuning, satisfactory results can be achieved with only 1.8 billion parameters. In key tasks related to Classical Chinese information processing such as punctuation, identification of allusions, explanation of word meanings, and translation between ancient and modern Chinese, this model exhibits a clear advantage over both general-purpose large models and domain-specific traditional models, achieving levels close to or surpassing human baselines. This research provides a reference for the efficient construction of specialized domain-specific large language models. Furthermore, the paper discusses the application of this model in fields such as the collation of ancient texts, dictionary editing, and language research, combined with case studies.

[230] BELLE: A Bi-Level Multi-Agent Reasoning Framework for Multi-Hop Question Answering

Taolin Zhang,Dongyang Li,Qizhou Chen,Chengyu Wang,Xiaofeng He

Main category: cs.CL

TL;DR: 论文提出了一种名为BELLE的双层次多智能体推理框架，通过分析多跳问题的类型与方法的对应关系，显著提升了多跳问答任务的性能。

Details

Motivation: 多跳问答任务需要结合多种方法，但现有方法未充分考虑问题类型与方法的匹配性。因此，作者提出BELLE框架，通过多智能体辩论动态选择最优方法组合。 Method: BELLE框架分为两个层次：第一层次通过多智能体辩论生成执行计划，第二层次引入快慢辩论者监控观点变化的合理性。 Result: 实验表明，BELLE在多个数据集上显著优于基线方法，且在复杂场景中具有更高的成本效益。 Conclusion: BELLE通过动态匹配问题类型与方法，为多跳问答任务提供了一种高效且灵活的解决方案。 Abstract: Multi-hop question answering (QA) involves finding multiple relevant passages and performing step-by-step reasoning to answer complex questions. Previous works on multi-hop QA employ specific methods from different modeling perspectives based on large language models (LLMs), regardless of the question types. In this paper, we first conduct an in-depth analysis of public multi-hop QA benchmarks, dividing the questions into four types and evaluating five types of cutting-edge methods for multi-hop QA: Chain-of-Thought (CoT), Single-step, Iterative-step, Sub-step, and Adaptive-step. We find that different types of multi-hop questions have varying degrees of sensitivity to different types of methods. Thus, we propose a Bi-levEL muLti-agEnt reasoning (BELLE) framework to address multi-hop QA by specifically focusing on the correspondence between question types and methods, where each type of method is regarded as an ''operator'' by prompting LLMs differently. The first level of BELLE includes multiple agents that debate to obtain an executive plan of combined ''operators'' to address the multi-hop QA task comprehensively. During the debate, in addition to the basic roles of affirmative debater, negative debater, and judge, at the second level, we further leverage fast and slow debaters to monitor whether changes in viewpoints are reasonable. Extensive experiments demonstrate that BELLE significantly outperforms strong baselines in various datasets. Additionally, the model consumption of BELLE is higher cost-effectiveness than that of single models in more complex multi-hop QA scenarios.

[231] Chain-of-Model Learning for Language Model

Kaitao Song,Xiaohua Wang,Xu Tan,Huiqiang Jiang,Chengruidong Zhang,Yongliang Shen,Cen LU,Zihao Li,Zifan Song,Caihua Shan,Yansen Wang,Kan Ren,Xiaoqing Zheng,Tao Qin,Yuqing Yang,Dongsheng Li,Lili Qiu

Main category: cs.CL

TL;DR: 提出了一种名为Chain-of-Model（CoM）的新学习范式，通过将因果关系引入隐藏状态链式结构，提升了模型训练的扩展性和推理的灵活性。

Details

Motivation: 为了解决传统模型在扩展性和灵活性上的限制，提出了一种基于链式隐藏状态的框架。 Method: 引入了Chain-of-Representation（CoR）概念，将隐藏状态分解为多个子表示链，并设计了Chain-of-Language-Model（CoLM）及其变体CoLM-Air。 Result: 实验表明，CoLM家族在性能上与标准Transformer相当，同时提供了更高的灵活性和扩展性。 Conclusion: CoM框架为语言模型的构建提供了新的方向，具有显著的训练效率和推理灵活性优势。 Abstract: In this paper, we propose a novel learning paradigm, termed Chain-of-Model (CoM), which incorporates the causal relationship into the hidden states of each layer as a chain style, thereby introducing great scaling efficiency in model training and inference flexibility in deployment. We introduce the concept of Chain-of-Representation (CoR), which formulates the hidden states at each layer as a combination of multiple sub-representations (i.e., chains) at the hidden dimension level. In each layer, each chain from the output representations can only view all of its preceding chains in the input representations. Consequently, the model built upon CoM framework can progressively scale up the model size by increasing the chains based on the previous models (i.e., chains), and offer multiple sub-models at varying sizes for elastic inference by using different chain numbers. Based on this principle, we devise Chain-of-Language-Model (CoLM), which incorporates the idea of CoM into each layer of Transformer architecture. Based on CoLM, we further introduce CoLM-Air by introducing a KV sharing mechanism, that computes all keys and values within the first chain and then shares across all chains. This design demonstrates additional extensibility, such as enabling seamless LM switching, prefilling acceleration and so on. Experimental results demonstrate our CoLM family can achieve comparable performance to the standard Transformer, while simultaneously enabling greater flexiblity, such as progressive scaling to improve training efficiency and offer multiple varying model sizes for elastic inference, paving a a new way toward building language models. Our code will be released in the future at: https://github.com/microsoft/CoLM.

[232] Not All Thoughts are Generated Equal: Efficient LLM Reasoning via Multi-Turn Reinforcement Learning

Yansong Ning,Wei Li,Jun Fang,Naiqiang Tan,Hao Liu

Main category: cs.CL

TL;DR: 论文提出了一种名为Long⊗Short的高效推理框架，通过区分长思考和短思考的LLM协作，显著减少了推理过程中的token长度，同时保持了性能。

Details

Motivation: 现有方法对所有长链式思考进行均等压缩，限制了推理的简洁性和有效性。因此，研究不同思考的重要性，并提出更高效的压缩策略。 Method: 通过自动长链式思考分块和蒙特卡洛模拟评估思考的重要性，提出理论边界度量，并设计Long⊗Short框架，结合长思考和短思考LLM协作。 Result: 实验表明，该方法在多个基准测试中减少了80%以上的token长度，同时性能与现有方法相当。 Conclusion: Long⊗Short框架通过协作优化，显著提升了推理效率，同时保持了性能。 Abstract: Compressing long chain-of-thought (CoT) from large language models (LLMs) is an emerging strategy to improve the reasoning efficiency of LLMs. Despite its promising benefits, existing studies equally compress all thoughts within a long CoT, hindering more concise and effective reasoning. To this end, we first investigate the importance of different thoughts by examining their effectiveness and efficiency in contributing to reasoning through automatic long CoT chunking and Monte Carlo rollouts. Building upon the insights, we propose a theoretically bounded metric to jointly measure the effectiveness and efficiency of different thoughts. We then propose Long$\otimes$Short, an efficient reasoning framework that enables two LLMs to collaboratively solve the problem: a long-thought LLM for more effectively generating important thoughts, while a short-thought LLM for efficiently generating remaining thoughts. Specifically, we begin by synthesizing a small amount of cold-start data to fine-tune LLMs for long-thought and short-thought reasoning styles, respectively. Furthermore, we propose a synergizing-oriented multi-turn reinforcement learning, focusing on the model self-evolution and collaboration between long-thought and short-thought LLMs. Experimental results show that our method enables Qwen2.5-7B and Llama3.1-8B to achieve comparable performance compared to DeepSeek-R1-Distill-Qwen-7B and DeepSeek-R1-Distill-Llama-8B, while reducing token length by over 80% across the MATH500, AIME24/25, AMC23, and GPQA Diamond benchmarks. Our data and code are available at https://github.com/yasNing/Long-otimes-Short/.

[233] Class Distillation with Mahalanobis Contrast: An Efficient Training Paradigm for Pragmatic Language Understanding Tasks

Chenlu Wang,Weimin Lyu,Ritwik Banerjee

Main category: cs.CL

TL;DR: 提出了一种名为ClaD的新训练范式，用于从多样化的背景中提取小目标类，通过结构感知的损失函数和可解释的决策算法，在性别歧视、隐喻和讽刺检测任务中表现优异。

Details

Motivation: 现有分类器在检测异常或微妙语言时计算成本高且数据需求大，需要一种更高效的方法。 Method: ClaD结合了基于马氏距离的结构感知损失函数和优化的可解释决策算法。 Result: 在性别歧视、隐喻和讽刺检测任务中，ClaD优于基线模型，且使用更小的语言模型和参数即可媲美大型语言模型。 Conclusion: ClaD是一种高效的工具，适用于从复杂背景中提取小目标类的语言理解任务。 Abstract: Detecting deviant language such as sexism, or nuanced language such as metaphors or sarcasm, is crucial for enhancing the safety, clarity, and interpretation of online social discourse. While existing classifiers deliver strong results on these tasks, they often come with significant computational cost and high data demands. In this work, we propose \textbf{Cla}ss \textbf{D}istillation (ClaD), a novel training paradigm that targets the core challenge: distilling a small, well-defined target class from a highly diverse and heterogeneous background. ClaD integrates two key innovations: (i) a loss function informed by the structural properties of class distributions, based on Mahalanobis distance, and (ii) an interpretable decision algorithm optimized for class separation. Across three benchmark detection tasks -- sexism, metaphor, and sarcasm -- ClaD outperforms competitive baselines, and even with smaller language models and orders of magnitude fewer parameters, achieves performance comparable to several large language models (LLMs). These results demonstrate ClaD as an efficient tool for pragmatic language understanding tasks that require gleaning a small target class from a larger heterogeneous background.

[234] Multilingual Collaborative Defense for Large Language Models

Hongliang Li,Jinan Xu,Gengping Cui,Changhao Guan,Fengran Mo,Kaiyu Huang

Main category: cs.CL

TL;DR: 论文提出了一种多语言协作防御（MCD）方法，通过优化连续软安全提示来增强大型语言模型（LLMs）在多语言场景下的安全性。

Details

Motivation: LLMs在多语言环境中的安全性研究不足，尤其是通过翻译绕过防护的漏洞，亟需解决方案。 Method: 提出MCD方法，自动优化安全提示，提升多语言防护性能，同时保持泛化能力和降低误拒率。 Result: MCD在多语言防护和语言迁移能力上优于现有方法，实验验证了其有效性。 Conclusion: MCD为LLMs的多语言安全性提供了有效解决方案，具有实际应用潜力。 Abstract: The robustness and security of large language models (LLMs) has become a prominent research area. One notable vulnerability is the ability to bypass LLM safeguards by translating harmful queries into rare or underrepresented languages, a simple yet effective method of "jailbreaking" these models. Despite the growing concern, there has been limited research addressing the safeguarding of LLMs in multilingual scenarios, highlighting an urgent need to enhance multilingual safety. In this work, we investigate the correlation between various attack features across different languages and propose Multilingual Collaborative Defense (MCD), a novel learning method that optimizes a continuous, soft safety prompt automatically to facilitate multilingual safeguarding of LLMs. The MCD approach offers three advantages: First, it effectively improves safeguarding performance across multiple languages. Second, MCD maintains strong generalization capabilities while minimizing false refusal rates. Third, MCD mitigates the language safety misalignment caused by imbalances in LLM training corpora. To evaluate the effectiveness of MCD, we manually construct multilingual versions of commonly used jailbreak benchmarks, such as MaliciousInstruct and AdvBench, to assess various safeguarding methods. Additionally, we introduce these datasets in underrepresented (zero-shot) languages to verify the language transferability of MCD. The results demonstrate that MCD outperforms existing approaches in safeguarding against multilingual jailbreak attempts while also exhibiting strong language transfer capabilities. Our code is available at https://github.com/HLiang-Lee/MCD.

[235] When AI Co-Scientists Fail: SPOT-a Benchmark for Automated Verification of Scientific Research

Guijin Son,Jiwoo Hong,Honglu Fan,Heejeong Nam,Hyunwoo Ko,Seungwon Lim,Jinyeop Song,Jinha Choi,Gonçalo Paulo,Youngjae Yu,Stella Biderman

Main category: cs.CL

TL;DR: 论文探讨了利用大型语言模型（LLMs）作为验证工具来自动化科学手稿的学术验证，但发现当前模型的性能远未达到可靠水平。

Details

Motivation: 探索LLMs在学术验证中的互补应用，弥补现有生成性任务的不足。 Method: 引入SPOT数据集，包含83篇论文及其91个错误，评估LLMs在错误检测中的表现。 Result: 最佳模型的召回率和精确率分别不超过21.1%和6.1%，且模型一致性低，错误类型类似学生误解。 Conclusion: 当前LLMs在学术验证中的能力与可靠需求存在显著差距。 Abstract: Recent advances in large language models (LLMs) have fueled the vision of automated scientific discovery, often called AI Co-Scientists. To date, prior work casts these systems as generative co-authors responsible for crafting hypotheses, synthesizing code, or drafting manuscripts. In this work, we explore a complementary application: using LLMs as verifiers to automate the \textbf{academic verification of scientific manuscripts}. To that end, we introduce SPOT, a dataset of 83 published papers paired with 91 errors significant enough to prompt errata or retraction, cross-validated with actual authors and human annotators. Evaluating state-of-the-art LLMs on SPOT, we find that none surpasses 21.1\% recall or 6.1\% precision (o3 achieves the best scores, with all others near zero). Furthermore, confidence estimates are uniformly low, and across eight independent runs, models rarely rediscover the same errors, undermining their reliability. Finally, qualitative analysis with domain experts reveals that even the strongest models make mistakes resembling student-level misconceptions derived from misunderstandings. These findings highlight the substantial gap between current LLM capabilities and the requirements for dependable AI-assisted academic verification.

[236] NAMET: Robust Massive Model Editing via Noise-Aware Memory Optimization

Yanbo Dai,Zhenlan Ji,Zongjie Li,Shuai Wang

Main category: cs.CL

TL;DR: NAMET是一种通过引入噪声改进大规模知识编辑效果的方法，优于现有方法。

Details

Motivation: 现有模型编辑方法在大规模编辑场景下效果下降，主要由于知识项之间的嵌入冲突。 Method: 提出NAMET，通过在记忆提取阶段引入噪声（对MEMIT的一行修改）。 Result: 在六个LLM和三个数据集上的实验表明，NAMET在编辑数千个事实时表现优于现有方法。 Conclusion: NAMET通过噪声感知机制有效解决了大规模知识编辑中的嵌入冲突问题。 Abstract: Model editing techniques are essential for efficiently updating knowledge in large language models (LLMs). However, the effectiveness of existing approaches degrades in massive editing scenarios, particularly when evaluated with practical metrics or in context-rich settings. We attribute these failures to embedding collisions among knowledge items, which undermine editing reliability at scale. To address this, we propose NAMET (Noise-aware Model Editing in Transformers), a simple yet effective method that introduces noise during memory extraction via a one-line modification to MEMIT. Extensive experiments across six LLMs and three datasets demonstrate that NAMET consistently outperforms existing methods when editing thousands of facts.

[237] AutoMedEval: Harnessing Language Models for Automatic Medical Capability Evaluation

Xiechi Zhang,Zetian Ouyang,Linlin Wang,Gerard de Melo,Zhu Cao,Xiaoling Wang,Ya Zhang,Yanfeng Wang,Liang He

Main category: cs.CL

TL;DR: AutoMedEval是一个开源的自动评估模型，专为评估医学大语言模型（LLMs）的问答能力而设计，旨在减少对人类评估的依赖。

Details

Motivation: 传统评估指标（如F1和ROUGE）忽视医学术语的重要性，而人类评估成本高且可能不准确。现有基于LLM的评估方法在医学领域适用性有限。 Method: 提出AutoMedEval，采用分层训练方法，包括课程指令调优和迭代知识内省机制，以有限指令数据获得专业医学评估能力。 Result: AutoMedEval在人类评估中表现优于其他基线模型，与人类判断相关性更高。 Conclusion: AutoMedEval为医学LLMs提供了一种高效、可靠的自动评估解决方案，显著减少了对人类评估的依赖。 Abstract: With the proliferation of large language models (LLMs) in the medical domain, there is increasing demand for improved evaluation techniques to assess their capabilities. However, traditional metrics like F1 and ROUGE, which rely on token overlaps to measure quality, significantly overlook the importance of medical terminology. While human evaluation tends to be more reliable, it can be very costly and may as well suffer from inaccuracies due to limits in human expertise and motivation. Although there are some evaluation methods based on LLMs, their usability in the medical field is limited due to their proprietary nature or lack of expertise. To tackle these challenges, we present AutoMedEval, an open-sourced automatic evaluation model with 13B parameters specifically engineered to measure the question-answering proficiency of medical LLMs. The overarching objective of AutoMedEval is to assess the quality of responses produced by diverse models, aspiring to significantly reduce the dependence on human evaluation. Specifically, we propose a hierarchical training method involving curriculum instruction tuning and an iterative knowledge introspection mechanism, enabling AutoMedEval to acquire professional medical assessment capabilities with limited instructional data. Human evaluations indicate that AutoMedEval surpasses other baselines in terms of correlation with human judgments.

[238] Mobile-Bench-v2: A More Realistic and Comprehensive Benchmark for VLM-based Mobile Agents

Weikai Xu,Zhizheng Jiang,Yuxuan Liu,Wei Liu,Jian Luan,Yuanchun Li,Yunxin Liu,Bin Wang,Bo An

Main category: cs.CL

TL;DR: 论文提出了Mobile-Bench-v2，一个更全面和真实的基准测试，用于评估基于VLM的移动代理在多路径任务、噪声环境和主动交互中的表现。

Details

Motivation: 现有基准测试在动态环境变化下难以提供稳定的奖励信号，且无法评估代理在噪声环境或主动交互中的表现。 Method: 采用基于插槽的指令生成方法，构建包含多路径评估、噪声环境和模糊指令交互的Mobile-Bench-v2。 Result: 评估了AppAgent-v1、Mobile-Agent-v2等代理在多路径任务、噪声环境和主动交互中的表现。 Conclusion: Mobile-Bench-v2为移动代理的评估提供了更全面的基准，解决了现有方法的局限性。 Abstract: VLM-based mobile agents are increasingly popular due to their capabilities to interact with smartphone GUIs and XML-structured texts and to complete daily tasks. However, existing online benchmarks struggle with obtaining stable reward signals due to dynamic environmental changes. Offline benchmarks evaluate the agents through single-path trajectories, which stands in contrast to the inherently multi-solution characteristics of GUI tasks. Additionally, both types of benchmarks fail to assess whether mobile agents can handle noise or engage in proactive interactions due to a lack of noisy apps or overly full instructions during the evaluation process. To address these limitations, we use a slot-based instruction generation method to construct a more realistic and comprehensive benchmark named Mobile-Bench-v2. Mobile-Bench-v2 includes a common task split, with offline multi-path evaluation to assess the agent's ability to obtain step rewards during task execution. It contains a noisy split based on pop-ups and ads apps, and a contaminated split named AITZ-Noise to formulate a real noisy environment. Furthermore, an ambiguous instruction split with preset Q\&A interactions is released to evaluate the agent's proactive interaction capabilities. We conduct evaluations on these splits using the single-agent framework AppAgent-v1, the multi-agent framework Mobile-Agent-v2, as well as other mobile agents such as UI-Tars and OS-Atlas. Code and data are available at https://huggingface.co/datasets/xwk123/MobileBench-v2.

[239] RLAP: A Reinforcement Learning Enhanced Adaptive Planning Framework for Multi-step NLP Task Solving

Zepeng Ding,Dixuan Wang,Ziqin Luo,Guochao Jiang,Deqing Yang,Jiaqing Liang

Main category: cs.CL

TL;DR: 论文提出了一种基于强化学习的自适应规划框架（RLAP），用于改进大语言模型在多步自然语言处理任务中的表现。

Details

Motivation: 现有方法在多步规划中忽视了实例的语言特征，仅依赖大语言模型的内在规划能力，导致结果不理想。 Method: 将NLP任务建模为马尔可夫决策过程（MDP），通过强化学习训练轻量级Actor模型估计语言序列的Q值，动态确定子任务顺序。 Result: 在多种NLP任务和数据集上的实验验证了RLAP的有效性和鲁棒性。 Conclusion: RLAP通过结合语言特征和强化学习，显著提升了大语言模型在多步任务中的表现。 Abstract: Multi-step planning has been widely employed to enhance the performance of large language models (LLMs) on downstream natural language processing (NLP) tasks, which decomposes the original task into multiple subtasks and guide LLMs to solve them sequentially without additional training. When addressing task instances, existing methods either preset the order of steps or attempt multiple paths at each step. However, these methods overlook instances' linguistic features and rely on the intrinsic planning capabilities of LLMs to evaluate intermediate feedback and then select subtasks, resulting in suboptimal outcomes. To better solve multi-step NLP tasks with LLMs, in this paper we propose a Reinforcement Learning enhanced Adaptive Planning framework (RLAP). In our framework, we model an NLP task as a Markov decision process (MDP) and employ an LLM directly into the environment. In particular, a lightweight Actor model is trained to estimate Q-values for natural language sequences consisting of states and actions through reinforcement learning. Therefore, during sequential planning, the linguistic features of each sequence in the MDP can be taken into account, and the Actor model interacts with the LLM to determine the optimal order of subtasks for each task instance. We apply RLAP on three different types of NLP tasks and conduct extensive experiments on multiple datasets to verify RLAP's effectiveness and robustness.

[240] Recursive Question Understanding for Complex Question Answering over Heterogeneous Personal Data

Philipp Christmann,Gerhard Weikum

Main category: cs.CL

TL;DR: ReQAP是一种通过递归分解生成可执行操作树的方法，用于混合源（如文本和表格）的问答，支持个人数据的本地处理。

Details

Motivation: 个人设备每天生成大量异构数据（如日历、购物记录等），需要一种轻量级方法支持用户便捷查询，同时确保数据隐私。 Method: 通过递归分解问题生成操作树，操作符设计支持结构化与非结构化数据的无缝集成，执行操作树生成可追踪答案。 Result: 提出了ReQAP方法，并发布了PerQA基准测试，涵盖多样化的用户需求。 Conclusion: ReQAP为混合源问答提供了一种高效、隐私友好的解决方案，PerQA基准测试进一步验证了其实用性。 Abstract: Question answering over mixed sources, like text and tables, has been advanced by verbalizing all contents and encoding it with a language model. A prominent case of such heterogeneous data is personal information: user devices log vast amounts of data every day, such as calendar entries, workout statistics, shopping records, streaming history, and more. Information needs range from simple look-ups to queries of analytical nature. The challenge is to provide humans with convenient access with small footprint, so that all personal data stays on the user devices. We present ReQAP, a novel method that creates an executable operator tree for a given question, via recursive decomposition. Operators are designed to enable seamless integration of structured and unstructured sources, and the execution of the operator tree yields a traceable answer. We further release the PerQA benchmark, with persona-based data and questions, covering a diverse spectrum of realistic user needs.

[241] ELITE: Embedding-Less retrieval with Iterative Text Exploration

Zhangyu Wang,Siyuan Gao,Rong Zhou,Hao Wang,Li Ning

Main category: cs.CL

TL;DR: 提出了一种无需嵌入的检索框架，利用LLMs的逻辑推理能力改进检索，减少存储和计算开销。

Details

Motivation: 现有RAG系统依赖嵌入检索，可能导致语义相似但意图不符的内容被检索，且图或层次结构方法带来高开销。 Method: 通过迭代搜索空间细化和重要性度量，利用LLMs逻辑推理能力进行检索，避免显式图构建。 Result: 在长上下文QA基准测试中表现优于基线，存储和运行时减少一个数量级。 Conclusion: 提出的框架高效且性能优越，解决了现有RAG系统的局限性。 Abstract: Large Language Models (LLMs) have achieved impressive progress in natural language processing, but their limited ability to retain long-term context constrains performance on document-level or multi-turn tasks. Retrieval-Augmented Generation (RAG) mitigates this by retrieving relevant information from an external corpus. However, existing RAG systems often rely on embedding-based retrieval trained on corpus-level semantic similarity, which can lead to retrieving content that is semantically similar in form but misaligned with the question's true intent. Furthermore, recent RAG variants construct graph- or hierarchy-based structures to improve retrieval accuracy, resulting in significant computation and storage overhead. In this paper, we propose an embedding-free retrieval framework. Our method leverages the logical inferencing ability of LLMs in retrieval using iterative search space refinement guided by our novel importance measure and extend our retrieval results with logically related information without explicit graph construction. Experiments on long-context QA benchmarks, including NovelQA and Marathon, show that our approach outperforms strong baselines while reducing storage and runtime by over an order of magnitude.

[242] Enhancing Complex Instruction Following for Large Language Models with Mixture-of-Contexts Fine-tuning

Yuheng Lu,ZiMeng Bai,Caixia Yuan,Huixing Jiang,Xiaojie Wang

Main category: cs.CL

TL;DR: 论文提出MISO方法，通过将顺序指令转换为并行子指令，提升大语言模型在复杂指令跟随任务中的表现。

Details

Motivation: 大语言模型在复杂指令跟随任务中表现不稳定，现有方法忽视关键子上下文，影响监督微调效果。 Method: 提出MISO框架，将顺序指令分解为并行子指令，并采用混合上下文范式优化监督微调。 Result: 实验证明MISO在复杂指令跟随任务中效果显著，且训练效率更高。 Conclusion: MISO是一种有效的监督微调方法，能显著提升大语言模型处理复杂指令的能力。 Abstract: Large language models (LLMs) exhibit remarkable capabilities in handling natural language tasks; however, they may struggle to consistently follow complex instructions including those involve multiple constraints. Post-training LLMs using supervised fine-tuning (SFT) is a standard approach to improve their ability to follow instructions. In addressing complex instruction following, existing efforts primarily focus on data-driven methods that synthesize complex instruction-output pairs for SFT. However, insufficient attention allocated to crucial sub-contexts may reduce the effectiveness of SFT. In this work, we propose transforming sequentially structured input instruction into multiple parallel instructions containing subcontexts. To support processing this multi-input, we propose MISO (Multi-Input Single-Output), an extension to currently dominant decoder-only transformer-based LLMs. MISO introduces a mixture-of-contexts paradigm that jointly considers the overall instruction-output alignment and the influence of individual sub-contexts to enhance SFT effectiveness. We apply MISO fine-tuning to complex instructionfollowing datasets and evaluate it with standard LLM inference. Empirical results demonstrate the superiority of MISO as a fine-tuning method for LLMs, both in terms of effectiveness in complex instruction-following scenarios and its potential for training efficiency.

[243] An Explanation of Intrinsic Self-Correction via Linear Representations and Latent Concepts

Yu-Ting Lee,Hui-Ying Shih,Fu-Chieh Chang,Pei-Yuan Wu

Main category: cs.CL

TL;DR: 论文研究了语言模型通过自我修正提升性能的机制，提出提示诱导的隐藏状态变化影响输出分布，并通过数学建模和实验验证了自我修正的有效性。

Details

Motivation: 探索语言模型自我修正的性能提升原因，揭示提示如何通过隐藏状态变化影响输出分布。 Method: 提出线性表示向量的假设，建立自我修正的数学模型，并通过文本去毒实验验证。 Result: 实验显示提示诱导的变化显著区分了有毒和无毒词汇，表明自我修正增强了潜在概念识别能力。 Conclusion: 研究揭示了自我修正的机制，为提示如何可解释地工作提供了新见解。 Abstract: We provide an explanation for the performance gains of intrinsic self-correction, a process where a language model iteratively refines its outputs without external feedback. More precisely, we investigate how prompting induces interpretable changes in hidden states and thus affects the output distributions. We hypothesize that each prompt-induced shift lies in a linear span of some linear representation vectors, naturally separating tokens based on individual concept alignment. Building around this idea, we give a mathematical formulation of self-correction and derive a concentration result for output tokens based on alignment magnitudes. Our experiments on text detoxification with zephyr-7b-sft reveal a substantial gap in the inner products of the prompt-induced shifts and the unembeddings of the top-100 most toxic tokens vs. those of the unembeddings of the bottom-100 least toxic tokens, under toxic instructions. This suggests that self-correction prompts enhance a language model's capability of latent concept recognition. Our analysis offers insights into the underlying mechanism of self-correction by characterizing how prompting works explainably. For reproducibility, our code is available.

[244] Neuro-Symbolic Query Compiler

Yuyao Zhang,Zhicheng Dou,Xiaoxi Li,Jiajie Jin,Yongkang Wu,Zhonghua Li,Qi Ye,Ji-Rong Wen

Main category: cs.CL

TL;DR: QCompiler是一个神经符号框架，通过设计BNF语法和编译流程，提升RAG系统对复杂查询的意图识别能力。

Details

Motivation: 解决RAG系统在资源受限和复杂查询下意图识别不精确的问题。 Method: 设计最小冗余的BNF语法，结合查询翻译器、词法语法解析器和递归下降处理器，将查询编译为AST。 Result: 通过子查询原子化提升文档检索和响应生成的精确性，显著改善复杂查询处理能力。 Conclusion: QCompiler为RAG系统提供了一种高效处理复杂查询的方法。 Abstract: Precise recognition of search intent in Retrieval-Augmented Generation (RAG) systems remains a challenging goal, especially under resource constraints and for complex queries with nested structures and dependencies. This paper presents QCompiler, a neuro-symbolic framework inspired by linguistic grammar rules and compiler design, to bridge this gap. It theoretically designs a minimal yet sufficient Backus-Naur Form (BNF) grammar $G[q]$ to formalize complex queries. Unlike previous methods, this grammar maintains completeness while minimizing redundancy. Based on this, QCompiler includes a Query Expression Translator, a Lexical Syntax Parser, and a Recursive Descent Processor to compile queries into Abstract Syntax Trees (ASTs) for execution. The atomicity of the sub-queries in the leaf nodes ensures more precise document retrieval and response generation, significantly improving the RAG system's ability to address complex queries.

[245] ChartEdit: How Far Are MLLMs From Automating Chart Analysis? Evaluating MLLMs' Capability via Chart Editing

Xuanle Zhao,Xuexin Liu,Haoyue Yang,Xianzhen Luo,Fanhu Zeng,Jianling Li,Qi Shi,Chi Chen

Main category: cs.CL

TL;DR: 论文提出了ChartEdit，一个用于评估多模态大语言模型（MLLMs）在图表编辑任务中表现的新基准，发现当前模型在精确修改方面仍存在挑战。

Details

Motivation: 图表编辑任务对人类来说是劳动密集型任务，对MLLMs要求高，但现有评估方法不足，亟需一个全面的评估框架。 Method: 提出ChartEdit基准，包含1,405条多样化编辑指令和233张真实图表，通过手动标注和验证，评估10种主流MLLMs在代码和图表层面的表现。 Result: 大规模模型能生成部分匹配参考图像的代码，但精确编辑能力有限（SOTA模型得分仅59.96）；小规模模型表现更差。 Conclusion: 图表编辑任务仍具挑战性，需进一步研究改进模型性能。 Abstract: Although multimodal large language models (MLLMs) show promise in generating chart rendering code, chart editing presents a greater challenge. This difficulty stems from its nature as a labor-intensive task for humans that also demands MLLMs to integrate chart understanding, complex reasoning, and precise intent interpretation. While many MLLMs claim such editing capabilities, current assessments typically rely on limited case studies rather than robust evaluation methodologies, highlighting the urgent need for a comprehensive evaluation framework. In this work, we propose ChartEdit, a new high-quality benchmark designed for chart editing tasks. This benchmark comprises $1,405$ diverse editing instructions applied to $233$ real-world charts, with each instruction-chart instance having been manually annotated and validated for accuracy. Utilizing ChartEdit, we evaluate the performance of 10 mainstream MLLMs across two types of experiments, assessing them at both the code and chart levels. The results suggest that large-scale models can generate code to produce images that partially match the reference images. However, their ability to generate accurate edits according to the instructions remains limited. The state-of-the-art (SOTA) model achieves a score of only $59.96$, highlighting significant challenges in precise modification. In contrast, small-scale models, including chart-domain models, struggle both with following editing instructions and generating overall chart images, underscoring the need for further development in this area. Code is available at https://github.com/xxlllz/ChartEdit.

[246] Counterspeech the ultimate shield! Multi-Conditioned Counterspeech Generation through Attributed Prefix Learning

Aswini Kumar Padhi,Anil Bandhakavi,Tanmoy Chakraborty

Main category: cs.CL

TL;DR: HiPPrO提出了一种新的两阶段框架，通过分层前缀学习和偏好优化，结合多属性条件生成更有效的反仇恨言论。

Details

Motivation: 现有研究仅关注单一属性的反仇恨言论生成，而多属性结合能产生更细致和有效的回应。 Method: HiPPrO采用分层前缀嵌入空间优化和偏好优化两阶段方法，扩展了IntentCONANv2数据集并标注情感标签。 Result: HiPPrO在意图一致性上提升约38%，Rouge分数也有小幅提升，人类评估显示其生成的反仇恨言论更相关和合适。 Conclusion: 多属性条件能显著提升反仇恨言论生成系统的效果。 Abstract: Counterspeech has proven to be a powerful tool to combat hate speech online. Previous studies have focused on generating counterspeech conditioned only on specific intents (single attributed). However, a holistic approach considering multiple attributes simultaneously can yield more nuanced and effective responses. Here, we introduce HiPPrO, Hierarchical Prefix learning with Preference Optimization, a novel two-stage framework that utilizes the effectiveness of attribute-specific prefix embedding spaces hierarchically optimized during the counterspeech generation process in the first phase. Thereafter, we incorporate both reference and reward-free preference optimization to generate more constructive counterspeech. Furthermore, we extend IntentCONANv2 by annotating all 13,973 counterspeech instances with emotion labels by five annotators. HiPPrO leverages hierarchical prefix optimization to integrate these dual attributes effectively. An extensive evaluation demonstrates that HiPPrO achieves a ~38 % improvement in intent conformity and a ~3 %, ~2 %, ~3 % improvement in Rouge-1, Rouge-2, and Rouge-L, respectively, compared to several baseline models. Human evaluations further substantiate the superiority of our approach, highlighting the enhanced relevance and appropriateness of the generated counterspeech. This work underscores the potential of multi-attribute conditioning in advancing the efficacy of counterspeech generation systems.

[247] EmoHopeSpeech: An Annotated Dataset of Emotions and Hope Speech in English

Md. Rafiul Biswas,Wajdi Zaghouani

Main category: cs.CL

TL;DR: 该研究提供了一个双语（阿拉伯语和英语）情感与希望语音数据集，填补了多情感数据集的空白，并通过基线模型验证了数据的有效性。

Details

Motivation: 解决多情感（情感与希望）数据集的稀缺问题，促进自然语言处理在低资源语言中的发展。 Method: 构建双语数据集，标注情感强度、复杂性及原因，并使用Fleiss' Kappa评估标注一致性。 Result: 标注一致性高（0.75-0.85），基线模型微F1分数为0.67，验证了数据的价值。 Conclusion: 该数据集为自然语言处理提供了宝贵资源，支持跨语言情感与希望语音分析。 Abstract: This research introduces a bilingual dataset comprising 23,456 entries for Arabic and 10,036 entries for English, annotated for emotions and hope speech, addressing the scarcity of multi-emotion (Emotion and hope) datasets. The dataset provides comprehensive annotations capturing emotion intensity, complexity, and causes, alongside detailed classifications and subcategories for hope speech. To ensure annotation reliability, Fleiss' Kappa was employed, revealing 0.75-0.85 agreement among annotators both for Arabic and English language. The evaluation metrics (micro-F1-Score=0.67) obtained from the baseline model (i.e., using a machine learning model) validate that the data annotations are worthy. This dataset offers a valuable resource for advancing natural language processing in underrepresented languages, fostering better cross-linguistic analysis of emotions and hope speech.

[248] CCNU at SemEval-2025 Task 3: Leveraging Internal and External Knowledge of Large Language Models for Multilingual Hallucination Annotation

Xu Liu,Guanyi Chen

Main category: cs.CL

TL;DR: CCNU团队开发的多语言问答系统幻觉检测方法，利用多LLM并行标注，集成内外知识，在14种语言中表现优异，尤其在印地语中排名第一。

Details

Motivation: 解决多语言问答系统中的幻觉问题，提升检测准确性和效率。 Method: 并行使用多个LLM模拟众包标注，结合内外知识进行标注。 Result: 在印地语中排名第一，其他七种语言进入前五。 Conclusion: 多LLM并行标注方法有效，同时分享了失败经验和关键见解。 Abstract: We present the system developed by the Central China Normal University (CCNU) team for the Mu-SHROOM shared task, which focuses on identifying hallucinations in question-answering systems across 14 different languages. Our approach leverages multiple Large Language Models (LLMs) with distinct areas of expertise, employing them in parallel to annotate hallucinations, effectively simulating a crowdsourcing annotation process. Furthermore, each LLM-based annotator integrates both internal and external knowledge related to the input during the annotation process. Using the open-source LLM DeepSeek-V3, our system achieves the top ranking (\#1) for Hindi data and secures a Top-5 position in seven other languages. In this paper, we also discuss unsuccessful approaches explored during our development process and share key insights gained from participating in this shared task.

[249] An Annotated Corpus of Arabic Tweets for Hate Speech Analysis

Md. Rafiul Biswas,Wajdi Zaghouani

Main category: cs.CL

TL;DR: 该研究构建了一个阿拉伯语多标签仇恨言论数据集，包含10000条标注的推文，并评估了不同Transformer模型的性能，其中AraBERTv2表现最佳。

Details

Motivation: 阿拉伯语方言多样性导致仇恨言论识别困难，因此需要构建高质量数据集以支持相关研究。 Method: 收集并标注10000条阿拉伯语推文，计算标注者间一致性，并使用Transformer模型评估数据集。 Result: 标注者间一致性为0.86（攻击性内容）和0.71（仇恨目标）。AraBERTv2模型的micro-F1为0.7865，准确率为0.786。 Conclusion: 研究成功构建了阿拉伯语仇恨言论数据集，并验证了AraBERTv2模型的有效性。 Abstract: Identifying hate speech content in the Arabic language is challenging due to the rich quality of dialectal variations. This study introduces a multilabel hate speech dataset in the Arabic language. We have collected 10000 Arabic tweets and annotated each tweet, whether it contains offensive content or not. If a text contains offensive content, we further classify it into different hate speech targets such as religion, gender, politics, ethnicity, origin, and others. A text can contain either single or multiple targets. Multiple annotators are involved in the data annotation task. We calculated the inter-annotator agreement, which was reported to be 0.86 for offensive content and 0.71 for multiple hate speech targets. Finally, we evaluated the data annotation task by employing a different transformers-based model in which AraBERTv2 outperformed with a micro-F1 score of 0.7865 and an accuracy of 0.786.

[250] Unveiling Knowledge Utilization Mechanisms in LLM-based Retrieval-Augmented Generation

Yuhao Wang,Ruiyang Ren,Yucheng Wang,Wayne Xin Zhao,Jing Liu,Hua Wu,Haifeng Wang

Main category: cs.CL

TL;DR: 本文系统研究了LLMs在RAG中如何整合内部和外部知识，提出了知识流分析框架和KAPE方法，揭示了知识利用的四个阶段及模块功能。

Details

Motivation: 探索LLMs在RAG中知识利用的内在机制，以提升其可解释性和可靠性。 Method: 宏观层面采用知识流分析，微观层面提出KAPE方法识别神经元，并通过选择性失活实验验证。 Result: 揭示了知识利用的四个阶段，发现模块间的互补作用，实现了对知识源依赖的有针对性调控。 Conclusion: 研究为提升RAG中LLMs的透明性和鲁棒性提供了理论基础，推动了知识密集型领域的发展。 Abstract: Considering the inherent limitations of parametric knowledge in large language models (LLMs), retrieval-augmented generation (RAG) is widely employed to expand their knowledge scope. Since RAG has shown promise in knowledge-intensive tasks like open-domain question answering, its broader application to complex tasks and intelligent assistants has further advanced its utility. Despite this progress, the underlying knowledge utilization mechanisms of LLM-based RAG remain underexplored. In this paper, we present a systematic investigation of the intrinsic mechanisms by which LLMs integrate internal (parametric) and external (retrieved) knowledge in RAG scenarios. Specially, we employ knowledge stream analysis at the macroscopic level, and investigate the function of individual modules at the microscopic level. Drawing on knowledge streaming analyses, we decompose the knowledge utilization process into four distinct stages within LLM layers: knowledge refinement, knowledge elicitation, knowledge expression, and knowledge contestation. We further demonstrate that the relevance of passages guides the streaming of knowledge through these stages. At the module level, we introduce a new method, knowledge activation probability entropy (KAPE) for neuron identification associated with either internal or external knowledge. By selectively deactivating these neurons, we achieve targeted shifts in the LLM's reliance on one knowledge source over the other. Moreover, we discern complementary roles for multi-head attention and multi-layer perceptron layers during knowledge formation. These insights offer a foundation for improving interpretability and reliability in retrieval-augmented LLMs, paving the way for more robust and transparent generative solutions in knowledge-intensive domains.

[251] Towards Comprehensive Argument Analysis in Education: Dataset, Tasks, and Method

Yupei Ren,Xinyi Zhou,Ning Zhang,Shangqing Zhao,Man Lan,Xiaopeng Bai

Main category: cs.CL

TL;DR: 论文提出14种细粒度论元关系类型，以解决现有论元关系过于简单的问题，并在三个任务上进行了实验。

Details

Motivation: 当前论元关系过于基础，难以捕捉现实场景中的复杂论元结构，因此需要更细粒度的关系类型。 Method: 提出14种细粒度论元关系类型，从垂直和水平维度捕捉论元组件的复杂交互，并在三个任务（论元组件检测、关系预测、自动作文评分）上实验。 Result: 实验结果表明细粒度论元标注对论元写作质量评估很重要，并鼓励多维论元分析。 Conclusion: 细粒度论元关系能更好地理解论元结构，对写作质量评估和多维分析有重要意义。 Abstract: Argument mining has garnered increasing attention over the years, with the recent advancement of Large Language Models (LLMs) further propelling this trend. However, current argument relations remain relatively simplistic and foundational, struggling to capture the full scope of argument information, particularly when it comes to representing complex argument structures in real-world scenarios. To address this limitation, we propose 14 fine-grained relation types from both vertical and horizontal dimensions, thereby capturing the intricate interplay between argument components for a thorough understanding of argument structure. On this basis, we conducted extensive experiments on three tasks: argument component detection, relation prediction, and automated essay grading. Additionally, we explored the impact of writing quality on argument component detection and relation prediction, as well as the connections between discourse relations and argumentative features. The findings highlight the importance of fine-grained argumentative annotations for argumentative writing quality assessment and encourage multi-dimensional argument analysis.

[252] MoL for LLMs: Dual-Loss Optimization to Enhance Domain Expertise While Preserving General Capabilities

Jingxue Chen,Qingkun Tang,Qianchun Lu,Siyuan Fang

Main category: cs.CL

TL;DR: 论文提出了一种名为Mixture of Losses (MoL)的新框架，通过解耦领域特定和通用语料的优化目标，解决了领域偏置数据和语料混合比例不当的问题，显著提升了模型性能。

Details

Motivation: 尽管大语言模型在通用任务中表现良好，但在领域特定应用中存在幻觉和准确性限制。传统CPT方法面临领域偏置数据损害通用语言能力以及语料混合比例不当的问题。 Method: 提出MoL框架，使用交叉熵损失（CE）处理领域数据以确保知识获取，同时使用KL散度对齐通用语料训练与基础模型的核心能力。 Result: 实验表明，1:1的领域与通用语料比例能最优平衡训练与过拟合。模型在Math-500基准上非思考推理模式下准确率提高27.9%，在AIME25子集的思考模式下提升83.3%。 Conclusion: MoL框架有效保留了通用能力并增强领域专长，避免了灾难性遗忘，显著优于传统CPT方法。 Abstract: Although LLMs perform well in general tasks, domain-specific applications suffer from hallucinations and accuracy limitations. CPT approaches encounter two key issues: (1) domain-biased data degrades general language skills, and (2) improper corpus-mixture ratios limit effective adaptation. To address these, we propose a novel framework, Mixture of Losses (MoL), which decouples optimization objectives for domain-specific and general corpora. Specifically, cross-entropy (CE) loss is applied to domain data to ensure knowledge acquisition, while Kullback-Leibler (KL) divergence aligns general-corpus training with the base model's foundational capabilities. This dual-loss architecture preserves universal skills while enhancing domain expertise, avoiding catastrophic forgetting. Empirically, we validate that a 1:1 domain-to-general corpus ratio optimally balances training and overfitting without the need for extensive tuning or resource-intensive experiments. Furthermore, our experiments demonstrate significant performance gains compared to traditional CPT approaches, which often suffer from degradation in general language capabilities; our model achieves 27.9% higher accuracy on the Math-500 benchmark in the non-think reasoning mode, and an impressive 83.3% improvement on the challenging AIME25 subset in the think mode, underscoring the effectiveness of our approach.

[253] ABoN: Adaptive Best-of-N Alignment

Vinod Raman,Hilal Asi,Satyen Kale

Main category: cs.CL

TL;DR: 提出了一种基于提示自适应的Best-of-N对齐策略，通过两阶段算法更高效地分配推理计算资源。

Details

Motivation: 解决现有测试时对齐方法（如Best-of-N采样）计算成本高且未考虑提示对齐难度差异的问题。 Method: 开发两阶段算法：初始探索阶段估计每个提示的奖励分布，第二阶段自适应分配剩余计算资源。 Result: 在AlpacaEval数据集上，12种LM/RM组合和50批提示的实验表明，自适应策略优于均匀分配，且在大批量时性能提升。 Conclusion: 自适应策略高效且兼容性强，能显著提升对齐效果并减少计算成本。 Abstract: Recent advances in test-time alignment methods, such as Best-of-N sampling, offer a simple and effective way to steer language models (LMs) toward preferred behaviors using reward models (RM). However, these approaches can be computationally expensive, especially when applied uniformly across prompts without accounting for differences in alignment difficulty. In this work, we propose a prompt-adaptive strategy for Best-of-N alignment that allocates inference-time compute more efficiently. Motivated by latency concerns, we develop a two-stage algorithm: an initial exploratory phase estimates the reward distribution for each prompt using a small exploration budget, and a second stage adaptively allocates the remaining budget using these estimates. Our method is simple, practical, and compatible with any LM/RM combination. Empirical results on the AlpacaEval dataset for 12 LM/RM pairs and 50 different batches of prompts show that our adaptive strategy consistently outperforms the uniform allocation with the same inference budget. Moreover, our experiments show that our adaptive strategy remains competitive against uniform allocations with 20% larger inference budgets and even improves in performance as the batch size grows.

[254] GenderBench: Evaluation Suite for Gender Biases in LLMs

Matúš Pikuliak

Main category: cs.CL

TL;DR: GenderBench是一个用于评估LLMs中性别偏见的综合测试套件，包含14个探针，量化了19种性别相关有害行为。它开源且可扩展，评估了12个LLMs，揭示了其在刻板印象推理、性别平等表现等方面的不足。

Details

Motivation: 为了解决LLMs中存在的性别偏见问题，提供一个标准化、可复现的评估工具，促进领域内的公平性和鲁棒性。 Method: 开发了GenderBench测试套件，包含14个探针，量化19种性别相关有害行为，并对12个LLMs进行评估。 Result: 评估发现LLMs在刻板印象推理、性别平等表现和高风险场景（如招聘）中存在一致性问题。 Conclusion: GenderBench为LLMs的性别偏见评估提供了有效工具，揭示了现有模型的不足，呼吁进一步改进模型公平性。 Abstract: We present GenderBench -- a comprehensive evaluation suite designed to measure gender biases in LLMs. GenderBench includes 14 probes that quantify 19 gender-related harmful behaviors exhibited by LLMs. We release GenderBench as an open-source and extensible library to improve the reproducibility and robustness of benchmarking across the field. We also publish our evaluation of 12 LLMs. Our measurements reveal consistent patterns in their behavior. We show that LLMs struggle with stereotypical reasoning, equitable gender representation in generated texts, and occasionally also with discriminatory behavior in high-stakes scenarios, such as hiring.

[255] Why Not Act on What You Know? Unleashing Safety Potential of LLMs via Self-Aware Guard Enhancement

Peng Ding,Jun Kuang,Zongyu Wang,Xuezhi Cao,Xunliang Cai,Jiajun Chen,Shujian Huang

Main category: cs.CL

TL;DR: 论文提出了一种名为SAGE的无训练防御策略，通过增强LLMs的安全判别能力来抵御复杂的越狱攻击，实验证明其有效性和鲁棒性。

Details

Motivation: 尽管LLMs能够检测越狱提示，但在直接处理这些输入时仍可能产生不安全响应，这揭示了安全判别与生成能力之间的差距。 Method: SAGE包含两个核心模块：判别分析模块和判别响应模块，通过灵活的安全判别指令提升防御能力。 Result: 实验表明，SAGE在多种开源和闭源LLMs上平均防御成功率达99%，同时保持对通用基准的实用性。 Conclusion: SAGE为未来LLMs的安全意识和生成行为的一致性提供了重要贡献，代码和数据集已公开。 Abstract: Large Language Models (LLMs) have shown impressive capabilities across various tasks but remain vulnerable to meticulously crafted jailbreak attacks. In this paper, we identify a critical safety gap: while LLMs are adept at detecting jailbreak prompts, they often produce unsafe responses when directly processing these inputs. Inspired by this insight, we propose SAGE (Self-Aware Guard Enhancement), a training-free defense strategy designed to align LLMs' strong safety discrimination performance with their relatively weaker safety generation ability. SAGE consists of two core components: a Discriminative Analysis Module and a Discriminative Response Module, enhancing resilience against sophisticated jailbreak attempts through flexible safety discrimination instructions. Extensive experiments demonstrate SAGE's effectiveness and robustness across various open-source and closed-source LLMs of different sizes and architectures, achieving an average 99% defense success rate against numerous complex and covert jailbreak methods while maintaining helpfulness on general benchmarks. We further conduct mechanistic interpretability analysis through hidden states and attention distributions, revealing the underlying mechanisms of this detection-generation discrepancy. Our work thus contributes to developing future LLMs with coherent safety awareness and generation behavior. Our code and datasets are publicly available at https://github.com/NJUNLP/SAGE.

[256] Historical and psycholinguistic perspectives on morphological productivity: A sketch of an integrative approach

Harald Baayen,Kristian Berg,Maziyah Mohamed

Main category: cs.CL

TL;DR: 本文从认知计算和历时角度研究形态生产力，使用判别词典模型分析芬兰语、马来语和英语的形态模式，并通过托马斯·曼的写作分析其词汇创新。

Details

Motivation: 探讨形态生产力的认知计算基础及个体语言使用的历时变化，揭示语言模式的可推广性和个体创新性。 Method: 使用判别词典模型（DLM）分析芬兰语、马来语和英语的形态模式；通过托马斯·曼的阅读与写作数据研究其词汇创新。 Result: DLM显示词缀与语义中心关联；托马斯·曼的词汇创新率极低，输入中的新词远多于输出。 Conclusion: 形态生产力依赖于形式与语义的系统性关联；个体语言创新受输入与语义距离影响。 Abstract: In this study, we approach morphological productivity from two perspectives: a cognitive-computational perspective, and a diachronic perspective zooming in on an actual speaker, Thomas Mann. For developing the first perspective, we make use of a cognitive computational model of the mental lexicon, the discriminative lexicon model. For computational mappings between form and meaning to be productive, in the sense that novel, previously unencountered words, can be understood and produced, there must be systematicities between the form space and the semantic space. If the relation between form and meaning would be truly arbitrary, a model could memorize form and meaning pairings, but there is no way in which the model would be able to generalize to novel test data. For Finnish nominal inflection, Malay derivation, and English compounding, we explore, using the Discriminative Lexicon Model as a computational tool, to trace differences in the degree to which inflectional and word formation patterns are productive. We show that the DLM tends to associate affix-like sublexical units with the centroids of the embeddings of the words with a given affix. For developing the second perspective, we study how the intake and output of one prolific writer, Thomas Mann, changes over time. We show by means of an examination of what Thomas Mann is likely to have read, and what he wrote, that the rate at which Mann produces novel derived words is extremely low. There are far more novel words in his input than in his output. We show that Thomas Mann is less likely to produce a novel derived word with a given suffix the greater the average distance is of the embeddings of all derived words to the corresponding centroid, and discuss the challenges of using speaker-specific embeddings for low-frequency and novel words.

[257] Do different prompting methods yield a common task representation in language models?

Guy Davidson,Todd M. Gureckis,Brenden M. Lake,Adina Williams

Main category: cs.CL

TL;DR: 论文研究了语言模型在不同任务提示方式（演示与指令）下的任务表示机制，发现两者利用不同的模型组件，支持结合使用演示和指令的实践。

Details

Motivation: 探究不同任务提示方式是否导致相似的任务表示，以提升模型的可解释性和控制能力。 Method: 通过函数向量提取任务表示，并将其推广到短文本指令提示，分析演示和指令函数向量的差异。 Result: 发现演示和指令函数向量利用不同的模型组件，任务表示不完全重叠，支持结合两者的实践。 Conclusion: 不同任务提示方式引发不同的任务表示机制，需进一步研究LLM的任务推理机制。 Abstract: Demonstrations and instructions are two primary approaches for prompting language models to perform in-context learning (ICL) tasks. Do identical tasks elicited in different ways result in similar representations of the task? An improved understanding of task representation mechanisms would offer interpretability insights and may aid in steering models. We study this through function vectors, recently proposed as a mechanism to extract few-shot ICL task representations. We generalize function vectors to alternative task presentations, focusing on short textual instruction prompts, and successfully extract instruction function vectors that promote zero-shot task accuracy. We find evidence that demonstration- and instruction-based function vectors leverage different model components, and offer several controls to dissociate their contributions to task performance. Our results suggest that different task presentations do not induce a common task representation but elicit different, partly overlapping mechanisms. Our findings offer principled support to the practice of combining textual instructions and task demonstrations, imply challenges in universally monitoring task inference across presentation forms, and encourage further examinations of LLM task inference mechanisms.

[258] Model Merging in Pre-training of Large Language Models

Yunshui Li,Yiyuan Ma,Shen Yan,Chaoyi Zhang,Jing Liu,Jianqiao Lu,Ziwen Xu,Mengzhao Chen,Minrui Wang,Shiyi Zhan,Jin Ma,Xunhao Lai,Yao Luo,Xingyan Bin,Hongbin Ren,Mingji Han,Wenhao Hao,Bairen Yi,LingJun Liu,Bole Ma,Xiaoying Jia,Zhou Xun,Liang Xiang,Yonghui Wu

Main category: cs.CL

TL;DR: 本文研究了模型合并技术在预训练中的应用，展示了其在性能提升和成本降低方面的潜力，并提供了开源社区的实用指南。

Details

Motivation: 探索模型合并技术在大规模预训练中的应用，以提升模型性能并降低训练成本。 Method: 通过密集和MoE架构的实验，研究合并检查点对性能的影响，并分析合并策略和超参数。 Result: 合并检查点显著提升性能，并能预测退火行为，同时降低训练成本。 Conclusion: 模型合并技术在预训练中具有实际应用价值，为开源社区提供了实用指南。 Abstract: Model merging has emerged as a promising technique for enhancing large language models, though its application in large-scale pre-training remains relatively unexplored. In this paper, we present a comprehensive investigation of model merging techniques during the pre-training process. Through extensive experiments with both dense and Mixture-of-Experts (MoE) architectures ranging from millions to over 100 billion parameters, we demonstrate that merging checkpoints trained with constant learning rates not only achieves significant performance improvements but also enables accurate prediction of annealing behavior. These improvements lead to both more efficient model development and significantly lower training costs. Our detailed ablation studies on merging strategies and hyperparameters provide new insights into the underlying mechanisms while uncovering novel applications. Through comprehensive experimental analysis, we offer the open-source community practical pre-training guidelines for effective model merging.

[259] Personalized Author Obfuscation with Large Language Models

Mohammad Shokri,Sarah Ita Levitan,Rivka Levitan

Main category: cs.CL

TL;DR: 研究大型语言模型（LLMs）通过改写和改变写作风格来模糊作者身份的效果，发现其效果因作者而异，并提出个性化提示方法以改善效果。

Details

Motivation: 探讨LLMs在模糊作者身份方面的有效性，尤其是针对不同作者的个性化表现。 Method: 通过用户级别的性能分析，提出个性化提示方法以优化效果。 Result: LLMs整体有效，但效果呈双峰分布；个性化提示方法优于标准方法。 Conclusion: 个性化提示方法能部分解决效果双峰分布问题，提升模糊作者身份的效果。 Abstract: In this paper, we investigate the efficacy of large language models (LLMs) in obfuscating authorship by paraphrasing and altering writing styles. Rather than adopting a holistic approach that evaluates performance across the entire dataset, we focus on user-wise performance to analyze how obfuscation effectiveness varies across individual authors. While LLMs are generally effective, we observe a bimodal distribution of efficacy, with performance varying significantly across users. To address this, we propose a personalized prompting method that outperforms standard prompting techniques and partially mitigates the bimodality issue.

[260] Improving Fairness in LLMs Through Testing-Time Adversaries

Isabela Pereira Gregio,Ian Pons,Anna Helena Reali Costa,Artur Jordão

Main category: cs.CL

TL;DR: 提出一种简单、用户友好且实用的方法，通过修改句子属性生成多个变体并评估预测行为，以减轻LLM中的偏见，无需训练或微调。

Details

Motivation: LLM中的偏见问题阻碍了其在伦理敏感任务中的应用，需一种无需训练数据或调参的解决方案。 Method: 通过生成句子变体并比较预测行为，检测和减少偏见，仅需前向传递。 Result: 在Llama3上实验表明，公平性指标提升高达27个百分点。 Conclusion: 该方法显著提升了LLM的公平性和可靠性，适用于伦理敏感任务。 Abstract: Large Language Models (LLMs) push the bound-aries in natural language processing and generative AI, driving progress across various aspects of modern society. Unfortunately, the pervasive issue of bias in LLMs responses (i.e., predictions) poses a significant and open challenge, hindering their application in tasks involving ethical sensitivity and responsible decision-making. In this work, we propose a straightforward, user-friendly and practical method to mitigate such biases, enhancing the reliability and trustworthiness of LLMs. Our method creates multiple variations of a given sentence by modifying specific attributes and evaluates the corresponding prediction behavior compared to the original, unaltered, prediction/sentence. The idea behind this process is that critical ethical predictions often exhibit notable inconsistencies, indicating the presence of bias. Unlike previous approaches, our method relies solely on forward passes (i.e., testing-time adversaries), eliminating the need for training, fine-tuning, or prior knowledge of the training data distribution. Through extensive experiments on the popular Llama family, we demonstrate the effectiveness of our method in improving various fairness metrics, focusing on the reduction of disparities in how the model treats individuals from different racial groups. Specifically, using standard metrics, we improve the fairness in Llama3 in up to 27 percentage points. Overall, our approach significantly enhances fairness, equity, and reliability in LLM-generated results without parameter tuning or training data modifications, confirming its effectiveness in practical scenarios. We believe our work establishes an important step toward enabling the use of LLMs in tasks that require ethical considerations and responsible decision-making.

[261] A Multi-Task Benchmark for Abusive Language Detection in Low-Resource Settings

Fitsum Gaim,Hoyun Song,Huije Lee,Changgeon Ko,Eui Jun Hwang,Jong C. Park

Main category: cs.CL

TL;DR: 该研究提出了一个针对提格里尼亚语社交媒体的大规模多任务基准数据集，用于检测辱骂性语言，并包含情感和主题分类任务。数据集包含13,717条YouTube评论，由九位母语者标注。实验表明，小型多任务模型在低资源环境下表现优于前沿模型。

Details

Motivation: 当前内容审核研究因资源不足未能覆盖多数语言，导致数百万用户面临网络敌意。 Method: 采用迭代术语聚类方法选择数据，数据集涵盖两种书写系统（罗马化转写和原生Ge'ez文字），并建立多任务基准。 Result: 小型多任务模型在低资源环境下表现优异，辱骂性语言检测准确率达86%（提升7个百分点）。 Conclusion: 该数据集和模型为提格里尼亚语等低资源语言的在线安全研究提供了重要资源。 Abstract: Content moderation research has recently made significant advances, but still fails to serve the majority of the world's languages due to the lack of resources, leaving millions of vulnerable users to online hostility. This work presents a large-scale human-annotated multi-task benchmark dataset for abusive language detection in Tigrinya social media with joint annotations for three tasks: abusiveness, sentiment, and topic classification. The dataset comprises 13,717 YouTube comments annotated by nine native speakers, collected from 7,373 videos with a total of over 1.2 billion views across 51 channels. We developed an iterative term clustering approach for effective data selection. Recognizing that around 64% of Tigrinya social media content uses Romanized transliterations rather than native Ge'ez script, our dataset accommodates both writing systems to reflect actual language use. We establish strong baselines across the tasks in the benchmark, while leaving significant challenges for future contributions. Our experiments reveal that small, specialized multi-task models outperform the current frontier models in the low-resource setting, achieving up to 86% accuracy (+7 points) in abusiveness detection. We make the resources publicly available to promote research on online safety.

[262] The AI Gap: How Socioeconomic Status Affects Language Technology Interactions

Elisa Bassignana,Amanda Cercas Curry,Dirk Hovy

Main category: cs.CL

TL;DR: 研究发现社会经济地位（SES）显著影响人们与语言技术的互动方式，高SES群体更倾向于抽象表达和简洁请求，而低SES群体更倾向于拟人化和具体语言。

Details

Motivation: 探讨SES如何影响人们与语言技术的互动，以弥补以往研究依赖代理指标和合成数据的不足。 Method: 调查了1,000名不同SES背景的个体，收集了6,482条与LLM互动的提示。 Result: 高SES群体使用更抽象的语言和简洁请求，关注“包容性”和“旅行”等话题；低SES群体更拟人化LLM并使用更具体的语言。 Conclusion: 语言技术的普及仍需考虑SES差异，以减少数字鸿沟并满足不同群体的语言需求。 Abstract: Socioeconomic status (SES) fundamentally influences how people interact with each other and more recently, with digital technologies like Large Language Models (LLMs). While previous research has highlighted the interaction between SES and language technology, it was limited by reliance on proxy metrics and synthetic data. We survey 1,000 individuals from diverse socioeconomic backgrounds about their use of language technologies and generative AI, and collect 6,482 prompts from their previous interactions with LLMs. We find systematic differences across SES groups in language technology usage (i.e., frequency, performed tasks), interaction styles, and topics. Higher SES entails a higher level of abstraction, convey requests more concisely, and topics like 'inclusivity' and 'travel'. Lower SES correlates with higher anthropomorphization of LLMs (using ''hello'' and ''thank you'') and more concrete language. Our findings suggest that while generative language technologies are becoming more accessible to everyone, socioeconomic linguistic differences still stratify their use to exacerbate the digital divide. These differences underscore the importance of considering SES in developing language technologies to accommodate varying linguistic needs rooted in socioeconomic factors and limit the AI Gap across SES groups.

[263] Emotion Recognition for Low-Resource Turkish: Fine-Tuning BERTurk on TREMO and Testing on Xenophobic Political Discourse

Darmawan Wicaksono,Hasri Akbar Awal Rozaq,Nevfel Boz

Main category: cs.CL

TL;DR: 该研究开发了一个针对土耳其语的BERTurk情感识别模型（ERM），准确率达92.62%，用于分析土耳其社交媒体中关于叙利亚难民的‘Sessiz Istila’（无声入侵）情绪。

Details

Motivation: 研究旨在揭示土耳其社交媒体中反难民情绪的情感动态，并推动计算社会科学中非主流语言的情感分析发展。 Method: 使用BERTurk和TREMO数据集构建ERM模型，分析大规模X（原Twitter）数据中的情感分类。 Result: 模型在土耳其语情感分类中达到92.62%的准确率，揭示了土耳其语独特的语言挑战和情感表达。 Conclusion: 本地化的NLP工具（如ERM）在实时情感分析中具有实际应用价值，尤其在营销、公关和危机管理等领域，强调了考虑区域和语言细微差别的重要性。 Abstract: Social media platforms like X (formerly Twitter) play a crucial role in shaping public discourse and societal norms. This study examines the term Sessiz Istila (Silent Invasion) on Turkish social media, highlighting the rise of anti-refugee sentiment amidst the Syrian refugee influx. Using BERTurk and the TREMO dataset, we developed an advanced Emotion Recognition Model (ERM) tailored for Turkish, achieving 92.62% accuracy in categorizing emotions such as happiness, fear, anger, sadness, disgust, and surprise. By applying this model to large-scale X data, the study uncovers emotional nuances in Turkish discourse, contributing to computational social science by advancing sentiment analysis in underrepresented languages and enhancing our understanding of global digital discourse and the unique linguistic challenges of Turkish. The findings underscore the transformative potential of localized NLP tools, with our ERM model offering practical applications for real-time sentiment analysis in Turkish-language contexts. By addressing critical areas, including marketing, public relations, and crisis management, these models facilitate improved decision-making through timely and accurate sentiment tracking. This highlights the significance of advancing research that accounts for regional and linguistic nuances.

[264] Truth Neurons

Haohang Li,Yupeng Cao,Yangyang Yu,Jordan W. Suchow,Zining Zhu

Main category: cs.CL

TL;DR: 论文提出了一种在神经元层面识别语言模型中真实性表征的方法，发现存在与主题无关的“真实性神经元”，并通过实验验证了其普遍性。

Details

Motivation: 语言模型有时会生成不真实的回答，但对其真实性机制的理解有限，影响了其可靠性和安全性。 Method: 提出了一种识别神经元层面真实性表征的方法，通过实验验证了“真实性神经元”的存在及其分布模式。 Result: 实验证实了“真实性神经元”的普遍存在，其分布模式与真实性几何结构一致，抑制这些神经元会降低模型性能。 Conclusion: 研究揭示了语言模型中真实性的机制，为提高其可信度和可靠性提供了新方向。 Abstract: Despite their remarkable success and deployment across diverse workflows, language models sometimes produce untruthful responses. Our limited understanding of how truthfulness is mechanistically encoded within these models jeopardizes their reliability and safety. In this paper, we propose a method for identifying representations of truthfulness at the neuron level. We show that language models contain truth neurons, which encode truthfulness in a subject-agnostic manner. Experiments conducted across models of varying scales validate the existence of truth neurons, confirming that the encoding of truthfulness at the neuron level is a property shared by many language models. The distribution patterns of truth neurons over layers align with prior findings on the geometry of truthfulness. Selectively suppressing the activations of truth neurons found through the TruthfulQA dataset degrades performance both on TruthfulQA and on other benchmarks, showing that the truthfulness mechanisms are not tied to a specific dataset. Our results offer novel insights into the mechanisms underlying truthfulness in language models and highlight potential directions toward improving their trustworthiness and reliability.

[265] Decoding the Mind of Large Language Models: A Quantitative Evaluation of Ideology and Biases

Manari Hirose,Masato Uchida

Main category: cs.CL

TL;DR: 研究提出了一种评估大型语言模型（LLMs）意识形态偏见的框架，通过定量分析436个二元选择题，发现不同模型和语言间存在差异，并揭示了潜在的伦理问题。

Details

Motivation: 随着LLMs在各领域的广泛应用，理解其偏见和社会影响成为确保伦理和有效使用的关键。 Method: 提出了一种定量分析框架，通过二元选择题评估ChatGPT和Gemini的意识形态偏见。 Result: 发现LLMs在不同模型和语言间存在意识形态差异，ChatGPT倾向于迎合提问者观点，且存在伦理问题。 Conclusion: 研究强调了评估LLMs时需考虑意识形态和伦理问题，框架为开发更社会对齐的AI提供了方法。 Abstract: The widespread integration of Large Language Models (LLMs) across various sectors has highlighted the need for empirical research to understand their biases, thought patterns, and societal implications to ensure ethical and effective use. In this study, we propose a novel framework for evaluating LLMs, focusing on uncovering their ideological biases through a quantitative analysis of 436 binary-choice questions, many of which have no definitive answer. By applying our framework to ChatGPT and Gemini, findings revealed that while LLMs generally maintain consistent opinions on many topics, their ideologies differ across models and languages. Notably, ChatGPT exhibits a tendency to change their opinion to match the questioner's opinion. Both models also exhibited problematic biases, unethical or unfair claims, which might have negative societal impacts. These results underscore the importance of addressing both ideological and ethical considerations when evaluating LLMs. The proposed framework offers a flexible, quantitative method for assessing LLM behavior, providing valuable insights for the development of more socially aligned AI systems.

[266] Vectors from Larger Language Models Predict Human Reading Time and fMRI Data More Poorly when Dimensionality Expansion is Controlled

Yi-Chien Lin,Hongao Zhu,William Schuler

Main category: cs.CL

TL;DR: 研究发现大型语言模型（LLM）在预测人类阅读时间和脑成像数据时存在逆缩放效应，表明LLM与人类句子处理机制存在显著不匹配。

Details

Motivation: 探讨LLM作为人类句子处理模型的适用性，尤其是其规模扩大时对心理测量数据的拟合能力变化。 Method: 使用完整的LLM向量进行评估，同时控制较大模型中预测变量的数量。 Result: 结果显示逆缩放效应，表明LLM规模增大时与人类句子处理的不匹配加剧。 Conclusion: LLM的不足可能源于其与人类句子处理机制的实质性不匹配，而非模型复杂性不足。 Abstract: The impressive linguistic abilities of large language models (LLMs) have recommended them as models of human sentence processing, with some conjecturing a positive 'quality-power' relationship (Wilcox et al., 2023), in which language models' (LMs') fit to psychometric data continues to improve as their ability to predict words in context increases. This is important because it suggests that elements of LLM architecture, such as veridical attention to context and a unique objective of predicting upcoming words, reflect the architecture of the human sentence processing faculty, and that any inadequacies in predicting human reading time and brain imaging data may be attributed to insufficient model complexity, which recedes as larger models become available. Recent studies (Oh and Schuler, 2023) have shown this scaling inverts after a point, as LMs become excessively large and accurate, when word prediction probability (as information-theoretic surprisal) is used as a predictor. Other studies propose the use of entire vectors from differently sized LLMs, still showing positive scaling (Schrimpf et al., 2021), casting doubt on the value of surprisal as a predictor, but do not control for the larger number of predictors in vectors from larger LMs. This study evaluates LLM scaling using entire LLM vectors, while controlling for the larger number of predictors in vectors from larger LLMs. Results show that inverse scaling obtains, suggesting that inadequacies in predicting human reading time and brain imaging data may be due to substantial misalignment between LLMs and human sentence processing, which worsens as larger models are used.

[267] How Reliable is Multilingual LLM-as-a-Judge?

Xiyan Fu,Wei Liu

Main category: cs.CL

TL;DR: LLM-as-a-Judge在跨语言评估中的可靠性不足，一致性较差，尤其在低资源语言中表现更差。

Details

Motivation: 探讨大型语言模型在多语言评估中的可靠性，填补现有研究空白。 Method: 分析五个不同家族的模型在25种语言的五项任务中的表现，评估其一致性。 Result: 模型在多语言评估中一致性低（平均Fleiss' Kappa约0.3），训练数据多语言性或模型规模未直接改善一致性。 Conclusion: LLM-as-a-Judge在多语言评估中尚不可靠，提出集成策略以提升实际应用中的一致性。 Abstract: LLM-as-a-Judge has emerged as a popular evaluation strategy, where advanced large language models assess generation results in alignment with human instructions. While these models serve as a promising alternative to human annotators, their reliability in multilingual evaluation remains uncertain. To bridge this gap, we conduct a comprehensive analysis of multilingual LLM-as-a-Judge. Specifically, we evaluate five models from different model families across five diverse tasks involving 25 languages. Our findings reveal that LLMs struggle to achieve consistent judgment results across languages, with an average Fleiss' Kappa of approximately 0.3, and some models performing even worse. To investigate the cause of inconsistency, we analyze various influencing factors. We observe that consistency varies significantly across languages, with particularly poor performance in low-resource languages. Additionally, we find that neither training on multilingual data nor increasing model scale directly improves judgment consistency. These findings suggest that LLMs are not yet reliable for evaluating multilingual predictions. We finally propose an ensemble strategy which improves the consistency of the multilingual judge in real-world applications.

[268] Data Whisperer: Efficient Data Selection for Task-Specific LLM Fine-Tuning via Few-Shot In-Context Learning

Shaobo Wang,Ziming Wang,Xiangqi Jin,Jize Wang,Jiajun Zhang,Kaixin Li,Zichen Wen,Zhong Li,Conghui He,Xuming Hu,Linfeng Zhang

Main category: cs.CL

TL;DR: Data Whisperer是一种无需训练的、基于注意力的数据选择方法，通过少样本上下文学习优化LLM微调的数据子集选择，显著提升性能并降低计算成本。

Details

Motivation: 随着数据集规模增大，传统数据选择方法效率低且资源消耗大，亟需一种高效、无需训练的方法来优化数据子集选择。 Method: 提出Data Whisperer，利用少样本上下文学习和注意力机制，直接从目标模型中提取信息，无需额外训练。 Result: 在Llama-3-8B-Instruct模型上，仅用10%数据即超越完整GSM8K数据集性能，性能提升3.1分，速度提升7.4倍。 Conclusion: Data Whisperer为LLM微调提供了一种高效、低成本的数据选择解决方案，显著优于现有方法。 Abstract: Fine-tuning large language models (LLMs) on task-specific data is essential for their effective deployment. As dataset sizes grow, efficiently selecting optimal subsets for training becomes crucial to balancing performance and computational costs. Traditional data selection methods often require fine-tuning a scoring model on the target dataset, which is time-consuming and resource-intensive, or rely on heuristics that fail to fully leverage the model's predictive capabilities. To address these challenges, we propose Data Whisperer, an efficient, training-free, attention-based method that leverages few-shot in-context learning with the model to be fine-tuned. Comprehensive evaluations were conducted on both raw and synthetic datasets across diverse tasks and models. Notably, Data Whisperer achieves superior performance compared to the full GSM8K dataset on the Llama-3-8B-Instruct model, using just 10% of the data, and outperforms existing methods with a 3.1-point improvement and a 7.4$\times$ speedup.

[269] GMSA: Enhancing Context Compression via Group Merging and Layer Semantic Alignment

Jiwei Tang,Zhicheng Zhang,Shunlong Wu,Jingheng Ye,Lichen Bai,Zitai Wang,Tingwei Lu,Jiaqi Chen,Lin Hai,Hai-Tao Zheng,Hong-Gee Kim

Main category: cs.CL

TL;DR: GMSA是一种基于编码器-解码器架构的上下文压缩框架，通过减少输入序列长度和冗余信息，解决了大语言模型在长上下文场景中的计算效率低和信息冗余问题。

Details

Motivation: 大语言模型在长上下文场景中面临计算效率低和信息冗余的挑战，需要一种高效的上下文压缩方法。 Method: GMSA包含两个关键组件：组合并和层语义对齐（LSA），通过自动编码器训练学习软标记，并采用知识提取微调（KEFT）适应下游任务。 Result: GMSA在上下文恢复和下游问答任务中显著优于传统压缩方法和现有SOTA方法，并实现2倍的端到端推理加速。 Conclusion: GMSA是一种高效的上下文压缩框架，显著提升了长上下文场景中的性能和计算效率。 Abstract: Large language models (LLMs) have achieved impressive performance in a variety of natural language processing (NLP) tasks. However, when applied to long-context scenarios, they face two challenges, i.e., low computational efficiency and much redundant information. This paper introduces GMSA, a context compression framework based on the encoder-decoder architecture, which addresses these challenges by reducing input sequence length and redundant information. Structurally, GMSA has two key components: Group Merging and Layer Semantic Alignment (LSA). Group merging is used to effectively and efficiently extract summary vectors from the original context. Layer semantic alignment, on the other hand, aligns the high-level summary vectors with the low-level primary input semantics, thus bridging the semantic gap between different layers. In the training process, GMSA first learns soft tokens that contain complete semantics through autoencoder training. To furtherly adapt GMSA to downstream tasks, we propose Knowledge Extraction Fine-tuning (KEFT) to extract knowledge from the soft tokens for downstream tasks. We train GMSA by randomly sampling the compression rate for each sample in the dataset. Under this condition, GMSA not only significantly outperforms the traditional compression paradigm in context restoration but also achieves stable and significantly faster convergence with only a few encoder layers. In downstream question-answering (QA) tasks, GMSA can achieve approximately a 2x speedup in end-to-end inference while outperforming both the original input prompts and various state-of-the-art (SOTA) methods by a large margin.

[270] One-for-All Pruning: A Universal Model for Customized Compression of Large Language Models

Rongguang Ye,Ming Tang

Main category: cs.CL

TL;DR: 论文提出了一种名为UniCuCo的通用模型，用于高效处理大规模语言模型（LLMs）的多请求剪枝问题，通过引入StratNet学习最优剪枝策略，解决了现有方法处理时间随请求数线性增长的问题。

Details

Motivation: 现有剪枝方法在处理多用户请求时效率低下，处理时间随请求数线性增长，无法满足实际需求。 Method: 提出UniCuCo模型，利用StratNet学习请求到最优剪枝策略的映射，并通过高斯过程近似非可微剪枝过程的梯度，实现StratNet的更新。 Result: 实验表明，UniCuCo在处理64个请求时比基线方法快28倍，同时保持与基线相当的准确性。 Conclusion: UniCuCo显著提升了多请求剪枝的效率，为实际应用提供了可行的解决方案。 Abstract: Existing pruning methods for large language models (LLMs) focus on achieving high compression rates while maintaining model performance. Although these methods have demonstrated satisfactory performance in handling a single user's compression request, their processing time increases linearly with the number of requests, making them inefficient for real-world scenarios with multiple simultaneous requests. To address this limitation, we propose a Univeral Model for Customized Compression (UniCuCo) for LLMs, which introduces a StratNet that learns to map arbitrary requests to their optimal pruning strategy. The challenge in training StratNet lies in the high computational cost of evaluating pruning strategies and the non-differentiable nature of the pruning process, which hinders gradient backpropagation for StratNet updates. To overcome these challenges, we leverage a Gaussian process to approximate the evaluation process. Since the gradient of the Gaussian process is computable, we can use it to approximate the gradient of the non-differentiable pruning process, thereby enabling StratNet updates. Experimental results show that UniCuCo is 28 times faster than baselines in processing 64 requests, while maintaining comparable accuracy to baselines.

[271] Examining Linguistic Shifts in Academic Writing Before and After the Launch of ChatGPT: A Study on Preprint Papers

Tong Bao,Yi Zhao,Jin Mao,Chengzhi Zhang

Main category: cs.CL

TL;DR: 该研究通过分析arXiv数据集中的823,798篇摘要，发现大型语言模型（LLMs）对学术写作的广泛影响，包括词汇复杂性增加、句法复杂性降低、连贯性和可读性下降。

Details

Motivation: 现有研究多通过定量方法分析LLMs在学术写作中的使用，但缺乏对其语言特征潜在影响的系统性研究。 Method: 对arXiv数据集中的摘要进行语言学分析，包括LLM偏好词汇频率、词汇复杂性、句法复杂性、连贯性、可读性和情感分析。 Result: LLMs显著增加了摘要中偏好词汇的比例，提高了词汇复杂性和情感，但降低了句法复杂性、连贯性和可读性。 Conclusion: LLMs对学术写作风格产生了显著影响，尤其是在英语水平较低的研究者和计算机科学领域。 Abstract: Large Language Models (LLMs), such as ChatGPT, have prompted academic concerns about their impact on academic writing. Existing studies have primarily examined LLM usage in academic writing through quantitative approaches, such as word frequency statistics and probability-based analyses. However, few have systematically examined the potential impact of LLMs on the linguistic characteristics of academic writing. To address this gap, we conducted a large-scale analysis across 823,798 abstracts published in last decade from arXiv dataset. Through the linguistic analysis of features such as the frequency of LLM-preferred words, lexical complexity, syntactic complexity, cohesion, readability and sentiment, the results indicate a significant increase in the proportion of LLM-preferred words in abstracts, revealing the widespread influence of LLMs on academic writing. Additionally, we observed an increase in lexical complexity and sentiment in the abstracts, but a decrease in syntactic complexity, suggesting that LLMs introduce more new vocabulary and simplify sentence structure. However, the significant decrease in cohesion and readability indicates that abstracts have fewer connecting words and are becoming more difficult to read. Moreover, our analysis reveals that scholars with weaker English proficiency were more likely to use the LLMs for academic writing, and focused on improving the overall logic and fluency of the abstracts. Finally, at discipline level, we found that scholars in Computer Science showed more pronounced changes in writing style, while the changes in Mathematics were minimal.

[272] Bridging Generative and Discriminative Learning: Few-Shot Relation Extraction via Two-Stage Knowledge-Guided Pre-training

Quanjiang Guo,Jinchuan Zhang,Sijie Wang,Ling Tian,Zhao Kang,Bin Yan,Weidong Xiao

Main category: cs.CL

TL;DR: TKRE框架通过结合LLMs与传统关系抽取模型，提出两阶段预训练策略，有效解决少样本关系抽取的数据稀缺和泛化问题。

Details

Motivation: 少样本关系抽取（FSRE）因标注数据稀缺和现有模型泛化能力有限而具有挑战性。 Method: TKRE框架利用LLMs生成解释驱动的知识和模式约束的合成数据，并提出两阶段预训练策略（MSLM和SCL）。 Result: 在基准数据集上，TKRE实现了最先进的性能。 Conclusion: TKRE在低资源场景下具有广泛应用潜力。 Abstract: Few-Shot Relation Extraction (FSRE) remains a challenging task due to the scarcity of annotated data and the limited generalization capabilities of existing models. Although large language models (LLMs) have demonstrated potential in FSRE through in-context learning (ICL), their general-purpose training objectives often result in suboptimal performance for task-specific relation extraction. To overcome these challenges, we propose TKRE (Two-Stage Knowledge-Guided Pre-training for Relation Extraction), a novel framework that synergistically integrates LLMs with traditional relation extraction models, bridging generative and discriminative learning paradigms. TKRE introduces two key innovations: (1) leveraging LLMs to generate explanation-driven knowledge and schema-constrained synthetic data, addressing the issue of data scarcity; and (2) a two-stage pre-training strategy combining Masked Span Language Modeling (MSLM) and Span-Level Contrastive Learning (SCL) to enhance relational reasoning and generalization. Together, these components enable TKRE to effectively tackle FSRE tasks. Comprehensive experiments on benchmark datasets demonstrate the efficacy of TKRE, achieving new state-of-the-art performance in FSRE and underscoring its potential for broader application in low-resource scenarios. \footnote{The code and data are released on https://github.com/UESTC-GQJ/TKRE.

[273] PANORAMA: A synthetic PII-laced dataset for studying sensitive data memorization in LLMs

Sriram Selvam,Anneswa Ghosh

Main category: cs.CL

TL;DR: PANORAMA是一个大规模合成数据集，用于研究大型语言模型对敏感和个人身份信息（PII）的记忆问题，填补了现有数据集的空白，并揭示了PII记忆率随数据重复和内容类型的变化。

Details

Motivation: 随着大型语言模型的规模扩大和实际应用增加，其对敏感和PII数据的记忆问题带来了隐私风险，但缺乏全面、真实且符合伦理的数据集来研究这一问题。 Method: 通过构建内部一致的多属性合成人类档案，结合零样本提示和OpenAI o3-mini生成多样化的内容类型（如维基文章、社交媒体帖子等），嵌入真实的PII信息，形成PANORAMA数据集。 Result: 在Mistral-7B模型上的实验显示，PII记忆率随数据重复增加而变化，且不同内容类型的记忆风险存在差异。 Conclusion: PANORAMA为隐私风险评估、模型审计和隐私保护LLM的开发提供了重要资源。 Abstract: The memorization of sensitive and personally identifiable information (PII) by large language models (LLMs) poses growing privacy risks as models scale and are increasingly deployed in real-world applications. Existing efforts to study sensitive and PII data memorization and develop mitigation strategies are hampered by the absence of comprehensive, realistic, and ethically sourced datasets reflecting the diversity of sensitive information found on the web. We introduce PANORAMA - Profile-based Assemblage for Naturalistic Online Representation and Attribute Memorization Analysis, a large-scale synthetic corpus of 384,789 samples derived from 9,674 synthetic profiles designed to closely emulate the distribution, variety, and context of PII and sensitive data as it naturally occurs in online environments. Our data generation pipeline begins with the construction of internally consistent, multi-attribute human profiles using constrained selection to reflect real-world demographics such as education, health attributes, financial status, etc. Using a combination of zero-shot prompting and OpenAI o3-mini, we generate diverse content types - including wiki-style articles, social media posts, forum discussions, online reviews, comments, and marketplace listings - each embedding realistic, contextually appropriate PII and other sensitive information. We validate the utility of PANORAMA by fine-tuning the Mistral-7B model on 1x, 5x, 10x, and 25x data replication rates with a subset of data and measure PII memorization rates - revealing not only consistent increases with repetition but also variation across content types, highlighting PANORAMA's ability to model how memorization risks differ by context. Our dataset and code are publicly available, providing a much-needed resource for privacy risk assessment, model auditing, and the development of privacy-preserving LLMs.

[274] Distribution Prompting: Understanding the Expressivity of Language Models Through the Next-Token Distributions They Can Produce

Haojin Wang,Zining Zhu,Freda Shi

Main category: cs.CL

TL;DR: 论文研究了自回归语言模型生成的概率分布，发现某些分布比其他分布更难诱导，并分析了影响分布难易度的因素。

Details

Motivation: 系统地理解语言模型能生成的概率分布，探究哪些分布更难或更容易通过提示诱导。 Method: 使用基于梯度的提示调整方法，尝试找到能诱导模型输出接近目标分布的提示。 Result: 发现低或高熵分布比中等熵分布更容易近似；含异常标记的分布更易近似；模型生成的分布比随机目标更易近似。 Conclusion: 研究揭示了语言模型的表达能力及其作为概率分布生成器的挑战。 Abstract: Autoregressive neural language models (LMs) generate a probability distribution over tokens at each time step given a prompt. In this work, we attempt to systematically understand the probability distributions that LMs can produce, showing that some distributions are significantly harder to elicit than others. Specifically, for any target next-token distribution over the vocabulary, we attempt to find a prompt that induces the LM to output a distribution as close as possible to the target, using either soft or hard gradient-based prompt tuning. We find that (1) in general, distributions with very low or very high entropy are easier to approximate than those with moderate entropy; (2) among distributions with the same entropy, those containing ''outlier tokens'' are easier to approximate; (3) target distributions generated by LMs -- even LMs with different tokenizers -- are easier to approximate than randomly chosen targets. These results offer insights into the expressiveness of LMs and the challenges of using them as probability distribution proposers.

[275] Not All Documents Are What You Need for Extracting Instruction Tuning Data

Chi Zhang,Huaping Zhong,Hongtao Li,Chengliang Chai,Jiawei Hong,Yuhao Deng,Jiacheng Wang,Tian Tan,Yizhou Yan,Jiantao Qiu,Ye Yuan,Guoren Wang,Conghui He,Lei Cao

Main category: cs.CL

TL;DR: EQUAL框架通过迭代选择文档和提取高质量QA对，显著降低了计算成本并提升了模型性能。

Details

Motivation: 现有方法合成的指令数据缺乏多样性且与种子数据相似，限制了实际应用。 Method: 提出EQUAL框架，结合文档聚类和多臂老虎机策略，迭代提取高质量QA对。 Result: 在多个下游任务中，EQUAL降低计算成本5-10倍，准确率提升2.5%。 Conclusion: EQUAL是一种高效且可扩展的数据提取框架，适用于提升指令调优性能。 Abstract: Instruction tuning improves the performance of large language models (LLMs), but it heavily relies on high-quality training data. Recently, LLMs have been used to synthesize instruction data using seed question-answer (QA) pairs. However, these synthesized instructions often lack diversity and tend to be similar to the input seeds, limiting their applicability in real-world scenarios. To address this, we propose extracting instruction tuning data from web corpora that contain rich and diverse knowledge. A naive solution is to retrieve domain-specific documents and extract all QA pairs from them, but this faces two key challenges: (1) extracting all QA pairs using LLMs is prohibitively expensive, and (2) many extracted QA pairs may be irrelevant to the downstream tasks, potentially degrading model performance. To tackle these issues, we introduce EQUAL, an effective and scalable data extraction framework that iteratively alternates between document selection and high-quality QA pair extraction to enhance instruction tuning. EQUAL first clusters the document corpus based on embeddings derived from contrastive learning, then uses a multi-armed bandit strategy to efficiently identify clusters that are likely to contain valuable QA pairs. This iterative approach significantly reduces computational cost while boosting model performance. Experiments on AutoMathText and StackOverflow across four downstream tasks show that EQUAL reduces computational costs by 5-10x and improves accuracy by 2.5 percent on LLaMA-3.1-8B and Mistral-7B

[276] Teach2Eval: An Indirect Evaluation Method for LLM by Judging How It Teaches

Yuhang Zhou,Xutian Chen,Yixin Cao,Yuchen Ni,Yu He,Siyu Tian,Xiang Liu,Jian Zhang,Chuanjun Ji,Guangnan Ye,Xipeng Qiu

Main category: cs.CL

TL;DR: Teach2Eval是一种间接评估框架，通过评估大语言模型（LLMs）教授学生模型的能力，实现多维度、可扩展的自动化评估。

Details

Motivation: 传统评估方法存在公平性、可扩展性和数据污染问题，需要一种更有效的评估框架。 Method: Teach2Eval将开放式任务转化为标准化多选题（MCQs），通过教师模型生成的反馈间接评估LLMs的多项能力。 Result: 在26个领先LLMs上的实验表明，Teach2Eval与现有动态排名一致，并提供额外的可解释性。 Conclusion: Teach2Eval提供了一种避免数据泄漏和记忆化的多维度评估方法，适用于LLMs的广泛认知能力评估。 Abstract: Recent progress in large language models (LLMs) has outpaced the development of effective evaluation methods. Traditional benchmarks rely on task-specific metrics and static datasets, which often suffer from fairness issues, limited scalability, and contamination risks. In this paper, we introduce Teach2Eval, an indirect evaluation framework inspired by the Feynman Technique. Instead of directly testing LLMs on predefined tasks, our method evaluates a model's multiple abilities to teach weaker student models to perform tasks effectively. By converting open-ended tasks into standardized multiple-choice questions (MCQs) through teacher-generated feedback, Teach2Eval enables scalable, automated, and multi-dimensional assessment. Our approach not only avoids data leakage and memorization but also captures a broad range of cognitive abilities that are orthogonal to current benchmarks. Experimental results across 26 leading LLMs show strong alignment with existing human and model-based dynamic rankings, while offering additional interpretability for training guidance.

[277] Learning Auxiliary Tasks Improves Reference-Free Hallucination Detection in Open-Domain Long-Form Generation

Chengwei Qin,Wenxuan Zhou,Karthik Abinav Sankararaman,Nanshu Wang,Tengyu Xu,Alexander Radovic,Eryk Helenowski,Arya Talebzadeh,Aditya Tayade,Sinong Wang,Shafiq Joty,Han Fang,Hao Ma

Main category: cs.CL

TL;DR: 论文提出了一种名为RATE-FT的新方法，用于在开放域长文本生成任务中检测幻觉（事实错误），通过结合微调和辅助任务，显著提升了检测效果。

Details

Motivation: 现有方法在开放域长文本任务中检测幻觉时，要么局限于特定领域，要么依赖外部事实核查工具，但这些工具可能不可用。因此，需要一种无需参考的可靠检测方法。 Method: 研究了基于内部状态（如输出概率和熵）的检测方法，发现其效果有限。随后探索了提示、探测和微调等方法，最终提出RATE-FT，结合微调和辅助任务进行联合学习。 Result: 实验表明，RATE-FT在多种模型和数据集上表现优异，例如在LongFact上比普通微调方法提升了3%。 Conclusion: RATE-FT是一种有效的幻觉检测方法，具有通用性和高准确性，适用于开放域长文本生成任务。 Abstract: Hallucination, the generation of factually incorrect information, remains a significant challenge for large language models (LLMs), especially in open-domain long-form generation. Existing approaches for detecting hallucination in long-form tasks either focus on limited domains or rely heavily on external fact-checking tools, which may not always be available. In this work, we systematically investigate reference-free hallucination detection in open-domain long-form responses. Our findings reveal that internal states (e.g., model's output probability and entropy) alone are insufficient for reliably (i.e., better than random guessing) distinguishing between factual and hallucinated content. To enhance detection, we explore various existing approaches, including prompting-based methods, probing, and fine-tuning, with fine-tuning proving the most effective. To further improve the accuracy, we introduce a new paradigm, named RATE-FT, that augments fine-tuning with an auxiliary task for the model to jointly learn with the main task of hallucination detection. With extensive experiments and analysis using a variety of model families & datasets, we demonstrate the effectiveness and generalizability of our method, e.g., +3% over general fine-tuning methods on LongFact.

[278] $K$-MSHC: Unmasking Minimally Sufficient Head Circuits in Large Language Models with Experiments on Syntactic Classification Tasks

Pratim Chowdhary

Main category: cs.CL

TL;DR: 论文提出了一种名为$K$-MSHC的方法，用于识别中型语言模型中关键注意力头的最小集合，并通过Search-K-MSHC算法在Gemma-9B上验证了其有效性。研究发现不同任务依赖不同的注意力头电路，且存在部分重叠。

Details

Motivation: 理解中型语言模型（≤10B参数）中哪些神经组件驱动特定能力是一个关键挑战。 Method: 引入$K$-MSHC方法和Search-K-MSHC算法，用于识别分类任务中关键的注意力头最小集合。 Result: 在Gemma-9B上分析三种句法任务家族，发现任务特定的注意力头电路，并观察到非线性电路重叠模式。 Conclusion: 句法和数值能力源于专门化但部分可重用的注意力头电路。 Abstract: Understanding which neural components drive specific capabilities in mid-sized language models ($\leq$10B parameters) remains a key challenge. We introduce the $(\bm{K}, \epsilon)$-Minimum Sufficient Head Circuit ($K$-MSHC), a methodology to identify minimal sets of attention heads crucial for classification tasks as well as Search-K-MSHC, an efficient algorithm for discovering these circuits. Applying our Search-K-MSHC algorithm to Gemma-9B, we analyze three syntactic task families: grammar acceptability, arithmetic verification, and arithmetic word problems. Our findings reveal distinct task-specific head circuits, with grammar tasks predominantly utilizing early layers, word problems showing pronounced activity in both shallow and deep regions, and arithmetic verification demonstrating a more distributed pattern across the network. We discover non-linear circuit overlap patterns, where different task pairs share computational components at varying levels of importance. While grammar and arithmetic share many "weak" heads, arithmetic and word problems share more consistently critical "strong" heads. Importantly, we find that each task maintains dedicated "super-heads" with minimal cross-task overlap, suggesting that syntactic and numerical competencies emerge from specialized yet partially reusable head circuits.

[279] LLM-Based Evaluation of Low-Resource Machine Translation: A Reference-less Dialect Guided Approach with a Refined Sylheti-English Benchmark

Md. Atiqur Rahman,Sabrina Islam,Mushfiqul Haque Omi

Main category: cs.CL

TL;DR: 本文提出了一种基于方言指导的框架，用于提升低资源语言机器翻译（MT）的评估效果，通过增强LLM的评估能力并引入方言特定内容。

Details

Motivation: 低资源语言机器翻译评估面临参考翻译稀缺和方言多样性问题，LLM在缺乏方言上下文时效果不佳。 Method: 扩展ONUBAD数据集，增加方言特定词汇，引入回归头和方言指导提示策略。 Result: 在多个LLM上表现优于现有方法，Spearman相关性最高提升+0.1083。 Conclusion: 方言指导的框架显著提升了低资源语言MT评估效果，数据集和代码已开源。 Abstract: Evaluating machine translation (MT) for low-resource languages poses a persistent challenge, primarily due to the limited availability of high quality reference translations. This issue is further exacerbated in languages with multiple dialects, where linguistic diversity and data scarcity hinder robust evaluation. Large Language Models (LLMs) present a promising solution through reference-free evaluation techniques; however, their effectiveness diminishes in the absence of dialect-specific context and tailored guidance. In this work, we propose a comprehensive framework that enhances LLM-based MT evaluation using a dialect guided approach. We extend the ONUBAD dataset by incorporating Sylheti-English sentence pairs, corresponding machine translations, and Direct Assessment (DA) scores annotated by native speakers. To address the vocabulary gap, we augment the tokenizer vocabulary with dialect-specific terms. We further introduce a regression head to enable scalar score prediction and design a dialect-guided (DG) prompting strategy. Our evaluation across multiple LLMs shows that the proposed pipeline consistently outperforms existing methods, achieving the highest gain of +0.1083 in Spearman correlation, along with improvements across other evaluation settings. The dataset and the code are available at https://github.com/180041123-Atiq/MTEonLowResourceLanguage.

[280] The Tower of Babel Revisited: Multilingual Jailbreak Prompts on Closed-Source Large Language Models

Linghan Huang,Haolin Jin,Zhaoge Bi,Pengyue Yang,Peizhou Zhao,Taozhao Chen,Xiongfei Wu,Lei Ma,Huaming Chen

Main category: cs.CL

TL;DR: 该论文研究了闭源大语言模型在多语言攻击场景下的脆弱性，提出了一个综合对抗框架，评估了包括GPT-4o在内的多个模型的安全性能。研究发现中文提示的攻击成功率更高，并呼吁加强语言感知对齐和跨语言防御。

Details

Motivation: 现有研究主要集中在开源模型上，而闭源大语言模型在多语言攻击场景下的安全性尚未充分探索。 Method: 提出了一个综合对抗框架，结合多种攻击技术，系统评估了多个闭源模型在六类安全内容上的表现，生成了38,400个响应。 Result: Qwen-Max最易受攻击，GPT-4o防御最强；中文提示的攻击成功率更高，新型Two-Sides攻击技术最有效。 Conclusion: 强调需要加强语言感知对齐和跨语言防御，以构建更健壮和包容的AI系统。 Abstract: Large language models (LLMs) have seen widespread applications across various domains, yet remain vulnerable to adversarial prompt injections. While most existing research on jailbreak attacks and hallucination phenomena has focused primarily on open-source models, we investigate the frontier of closed-source LLMs under multilingual attack scenarios. We present a first-of-its-kind integrated adversarial framework that leverages diverse attack techniques to systematically evaluate frontier proprietary solutions, including GPT-4o, DeepSeek-R1, Gemini-1.5-Pro, and Qwen-Max. Our evaluation spans six categories of security contents in both English and Chinese, generating 38,400 responses across 32 types of jailbreak attacks. Attack success rate (ASR) is utilized as the quantitative metric to assess performance from three dimensions: prompt design, model architecture, and language environment. Our findings suggest that Qwen-Max is the most vulnerable, while GPT-4o shows the strongest defense. Notably, prompts in Chinese consistently yield higher ASRs than their English counterparts, and our novel Two-Sides attack technique proves to be the most effective across all models. This work highlights a dire need for language-aware alignment and robust cross-lingual defenses in LLMs, and we hope it will inspire researchers, developers, and policymakers toward more robust and inclusive AI systems.

[281] Enhance Mobile Agents Thinking Process Via Iterative Preference Learning

Kun Huang,Weikai Xu,Yuxuan Liu,Quandong Wang,Pengzhi Gao,Wei Liu,Jian Luan,Bin Wang,Bo An

Main category: cs.CL

TL;DR: 论文提出了一种迭代偏好学习（IPL）方法，通过构建CoaT树和改进的奖励模型，提升了VLM移动代理在GUI任务中的推理性能。

Details

Motivation: 现有CoaT轨迹数据稀缺，限制了代理的表达和泛化能力，且现有自训练方法忽视中间推理步骤的正确性或依赖昂贵的过程级标注。 Method: 提出IPL方法，通过迭代采样构建CoaT树，使用基于规则的奖励评分，并通过T-DPO对优化推理步骤。引入三阶段指令演化，利用GPT-4o生成多样化Q&A对。 Result: 在三个标准Mobile GUI代理基准测试中，MobileIPL优于基线模型，达到SOTA性能，并展现出强大的跨域泛化能力。 Conclusion: IPL方法有效解决了数据稀缺和推理步骤优化问题，显著提升了移动代理的性能和泛化能力。 Abstract: The Chain of Action-Planning Thoughts (CoaT) paradigm has been shown to improve the reasoning performance of VLM-based mobile agents in GUI tasks. However, the scarcity of diverse CoaT trajectories limits the expressiveness and generalization ability of such agents. While self-training is commonly employed to address data scarcity, existing approaches either overlook the correctness of intermediate reasoning steps or depend on expensive process-level annotations to construct process reward models (PRM). To address the above problems, we propose an Iterative Preference Learning (IPL) that constructs a CoaT-tree through interative sampling, scores leaf nodes using rule-based reward, and backpropagates feedback to derive Thinking-level Direct Preference Optimization (T-DPO) pairs. To prevent overfitting during warm-up supervised fine-tuning, we further introduce a three-stage instruction evolution, which leverages GPT-4o to generate diverse Q\&A pairs based on real mobile UI screenshots, enhancing both generality and layout understanding. Experiments on three standard Mobile GUI-agent benchmarks demonstrate that our agent MobileIPL outperforms strong baselines, including continual pretraining models such as OS-ATLAS and UI-TARS. It achieves state-of-the-art performance across three standard Mobile GUI-Agents benchmarks and shows strong generalization to out-of-domain scenarios.

[282] HBO: Hierarchical Balancing Optimization for Fine-Tuning Large Language Models

Weixuan Wang,Minghao Wu,Barry Haddow,Alexandra Birch

Main category: cs.CL

TL;DR: 论文提出了一种名为HBO的新方法，通过双层优化策略解决LLM微调中的数据不平衡和异质性问题，显著提升了模型性能。

Details

Motivation: 现有方法通常仅全局处理数据不平衡和异质性，忽略了局部问题，限制了其效果。 Method: HBO采用双层优化策略，包含全局和局部两种角色，通过奖励函数动态调整数据分配。 Result: 在多种任务和语言环境下，HBO显著优于基线方法，提升了模型准确性。 Conclusion: HBO为LLM微调中的数据不平衡和异质性提供了全面解决方案，提升了训练效果。 Abstract: Fine-tuning large language models (LLMs) on a mixture of diverse datasets poses challenges due to data imbalance and heterogeneity. Existing methods often address these issues across datasets (globally) but overlook the imbalance and heterogeneity within individual datasets (locally), which limits their effectiveness. We introduce Hierarchical Balancing Optimization (HBO), a novel method that enables LLMs to autonomously adjust data allocation during fine-tuning both across datasets (globally) and within each individual dataset (locally). HBO employs a bilevel optimization strategy with two types of actors: a Global Actor, which balances data sampling across different subsets of the training mixture, and several Local Actors, which optimizes data usage within each subset based on difficulty levels. These actors are guided by reward functions derived from the LLM's training state, which measure learning progress and relative performance improvement. We evaluate HBO on three LLM backbones across nine diverse tasks in multilingual and multitask setups. Results show that HBO consistently outperforms existing baselines, achieving significant accuracy gains. Our in-depth analysis further demonstrates that both the global actor and local actors of HBO effectively adjust data usage during fine-tuning. HBO provides a comprehensive solution to the challenges of data imbalance and heterogeneity in LLM fine-tuning, enabling more effective training across diverse datasets.

[283] Bidirectional LMs are Better Knowledge Memorizers? A Benchmark for Real-world Knowledge Injection

Yuwei Zhang,Wenhao Yu,Shangbin Feng,Yifan Zhu,Letian Peng,Jayanth Srinivasa,Gaowen Liu,Jingbo Shang

Main category: cs.CL

TL;DR: 论文提出了WikiDYK，一个基于Wikipedia动态更新的知识注入基准，用于测试大语言模型的知识记忆能力。实验发现因果语言模型（CLMs）在知识记忆上表现较差，而双向语言模型（BiLMs）表现更优。作者还提出了一个模块化协作框架，进一步提升准确性。

Details

Motivation: 当前大语言模型的知识记忆能力缺乏标准化和高品质的测试基准，因此需要一种动态更新的评估方法。 Method: 利用Wikipedia的“Did You Know...”条目构建WikiDYK基准，生成多样化的问题-答案对，并通过实验对比CLMs和BiLMs的表现。提出模块化协作框架，结合BiLMs作为外部知识库。 Result: CLMs在知识记忆上的可靠性准确率比BiLMs低23%。协作框架将可靠性准确率提升了29.1%。 Conclusion: WikiDYK是一个有效的动态知识测试基准，BiLMs在知识记忆上优于CLMs，模块化协作框架能显著提升模型性能。 Abstract: Despite significant advances in large language models (LLMs), their knowledge memorization capabilities remain underexplored, due to the lack of standardized and high-quality test ground. In this paper, we introduce a novel, real-world and large-scale knowledge injection benchmark that evolves continuously over time without requiring human intervention. Specifically, we propose WikiDYK, which leverages recently-added and human-written facts from Wikipedia's "Did You Know..." entries. These entries are carefully selected by expert Wikipedia editors based on criteria such as verifiability and clarity. Each entry is converted into multiple question-answer pairs spanning diverse task formats from easy cloze prompts to complex multi-hop questions. WikiDYK contains 12,290 facts and 77,180 questions, which is also seamlessly extensible with future updates from Wikipedia editors. Extensive experiments using continued pre-training reveal a surprising insight: despite their prevalence in modern LLMs, Causal Language Models (CLMs) demonstrate significantly weaker knowledge memorization capabilities compared to Bidirectional Language Models (BiLMs), exhibiting a 23% lower accuracy in terms of reliability. To compensate for the smaller scales of current BiLMs, we introduce a modular collaborative framework utilizing ensembles of BiLMs as external knowledge repositories to integrate with LLMs. Experiment shows that our framework further improves the reliability accuracy by up to 29.1%.

[284] ExpertSteer: Intervening in LLMs through Expert Knowledge

Weixuan Wang,Minghao Wu,Barry Haddow,Alexandra Birch

Main category: cs.CL

TL;DR: ExpertSteer提出了一种新方法，利用外部专家模型生成引导向量，以控制大型语言模型（LLM）的行为，显著优于现有方法。

Details

Motivation: 现有方法依赖模型自身生成引导向量，限制了效果和灵活性，无法利用外部专家模型的知识。 Method: 通过四步流程：维度对齐、干预层配对、生成引导向量和应用引导向量，实现跨模型知识转移。 Result: 在15个基准测试中，ExpertSteer显著优于基线方法，且成本极低。 Conclusion: ExpertSteer为LLM的行为控制提供了一种高效且灵活的新方法。 Abstract: Large Language Models (LLMs) exhibit remarkable capabilities across various tasks, yet guiding them to follow desired behaviours during inference remains a significant challenge. Activation steering offers a promising method to control the generation process of LLMs by modifying their internal activations. However, existing methods commonly intervene in the model's behaviour using steering vectors generated by the model itself, which constrains their effectiveness to that specific model and excludes the possibility of leveraging powerful external expert models for steering. To address these limitations, we propose ExpertSteer, a novel approach that leverages arbitrary specialized expert models to generate steering vectors, enabling intervention in any LLMs. ExpertSteer transfers the knowledge from an expert model to a target LLM through a cohesive four-step process: first aligning representation dimensions with auto-encoders to enable cross-model transfer, then identifying intervention layer pairs based on mutual information analysis, next generating steering vectors from the expert model using Recursive Feature Machines, and finally applying these vectors on the identified layers during inference to selectively guide the target LLM without updating model parameters. We conduct comprehensive experiments using three LLMs on 15 popular benchmarks across four distinct domains. Experiments demonstrate that ExpertSteer significantly outperforms established baselines across diverse tasks at minimal cost.

[285] LLMSR@XLLM25: An Empirical Study of LLM for Structural Reasoning

Xinye Li,Mingqi Wan,Dianbo Sui

Main category: cs.CL

TL;DR: Team asdfo123的LLMSR@XLLM25任务提交，使用Meta-Llama-3-8B-Instruct模型，通过少量示例和多轮提示实现结构化推理，无需微调或外部资源，排名第5。

Details

Motivation: 评估大语言模型在生成细粒度、可控且可解释的推理过程上的能力。 Method: 使用少量示例和多轮提示，引导模型提取条件、分解推理链为声明-证据对，并验证逻辑有效性，后处理正则表达式规范化输出。 Result: 未微调或使用外部资源，排名第5，F1分数与更复杂方法相当。 Conclusion: 分析了方法的优缺点，并展望了LLM结构化推理的未来研究方向。 Abstract: We present Team asdfo123's submission to the LLMSR@XLLM25 shared task, which evaluates large language models on producing fine-grained, controllable, and interpretable reasoning processes. Systems must extract all problem conditions, decompose a chain of thought into statement-evidence pairs, and verify the logical validity of each pair. Leveraging only the off-the-shelf Meta-Llama-3-8B-Instruct, we craft a concise few-shot, multi-turn prompt that first enumerates all conditions and then guides the model to label, cite, and adjudicate every reasoning step. A lightweight post-processor based on regular expressions normalises spans and enforces the official JSON schema. Without fine-tuning, external retrieval, or ensembling, our method ranks 5th overall, achieving macro F1 scores on par with substantially more complex and resource-consuming pipelines. We conclude by analysing the strengths and limitations of our approach and outlining directions for future research in structural reasoning with LLMs. Our code is available at https://github.com/asdfo123/LLMSR-asdfo123.

[286] UniEdit: A Unified Knowledge Editing Benchmark for Large Language Models

Qizhou Chen,Dakan Wang,Taolin Zhang,Zaoming Yan,Chengsong You,Chengyu Wang,Xiaofeng He

Main category: cs.CL

TL;DR: UniEdit是一个基于开放领域知识的统一基准，用于评估大型语言模型（LLM）编辑的全面性和多样性。

Details

Motivation: 现有LLM编辑数据集局限于狭窄的知识领域，缺乏对编辑需求广泛性和编辑后连锁效应的全面评估。 Method: 通过从25个常见领域中选择实体构建编辑样本，设计NMCS算法采样子图以评估连锁效应，并利用专有LLM将知识子图转换为自然语言文本。 Result: UniEdit基准在规模、全面性和多样性上表现优异，实验揭示了不同LLM和编辑器在开放知识领域中的编辑性能。 Conclusion: UniEdit为未来研究提供了有价值的基准和见解，强调了开放知识领域编辑的挑战与机遇。 Abstract: Model editing aims to enhance the accuracy and reliability of large language models (LLMs) by efficiently adjusting their internal parameters. Currently, most LLM editing datasets are confined to narrow knowledge domains and cover a limited range of editing evaluation. They often overlook the broad scope of editing demands and the diversity of ripple effects resulting from edits. In this context, we introduce UniEdit, a unified benchmark for LLM editing grounded in open-domain knowledge. First, we construct editing samples by selecting entities from 25 common domains across five major categories, utilizing the extensive triple knowledge available in open-domain knowledge graphs to ensure comprehensive coverage of the knowledge domains. To address the issues of generality and locality in editing, we design an Neighborhood Multi-hop Chain Sampling (NMCS) algorithm to sample subgraphs based on a given knowledge piece to entail comprehensive ripple effects to evaluate. Finally, we employ proprietary LLMs to convert the sampled knowledge subgraphs into natural language text, guaranteeing grammatical accuracy and syntactical diversity. Extensive statistical analysis confirms the scale, comprehensiveness, and diversity of our UniEdit benchmark. We conduct comprehensive experiments across multiple LLMs and editors, analyzing their performance to highlight strengths and weaknesses in editing across open knowledge domains and various evaluation criteria, thereby offering valuable insights for future research endeavors.

[287] Wisdom from Diversity: Bias Mitigation Through Hybrid Human-LLM Crowds

Axel Abels,Tom Lenaerts

Main category: cs.CL

TL;DR: 研究发现大型语言模型（LLMs）会反映数据中的偏见，通过众包策略（如加权聚合）和混合人群（LLMs与人类）可有效减少偏见并提高准确性。

Details

Motivation: LLMs虽性能强大，但会无意中延续训练数据中的偏见，需探索减少偏见的方法。 Method: 分析LLMs对偏见性标题的响应，比较简单平均与加权聚合的效果，并引入混合人群（LLMs与人类）策略。 Result: 加权聚合比简单平均更能减少偏见，混合人群策略进一步提升了性能并降低了偏见。 Conclusion: 结合LLMs的准确性和人类的多样性，混合人群策略是减少偏见并提高性能的有效方法。 Abstract: Despite their performance, large language models (LLMs) can inadvertently perpetuate biases found in the data they are trained on. By analyzing LLM responses to bias-eliciting headlines, we find that these models often mirror human biases. To address this, we explore crowd-based strategies for mitigating bias through response aggregation. We first demonstrate that simply averaging responses from multiple LLMs, intended to leverage the "wisdom of the crowd", can exacerbate existing biases due to the limited diversity within LLM crowds. In contrast, we show that locally weighted aggregation methods more effectively leverage the wisdom of the LLM crowd, achieving both bias mitigation and improved accuracy. Finally, recognizing the complementary strengths of LLMs (accuracy) and humans (diversity), we demonstrate that hybrid crowds containing both significantly enhance performance and further reduce biases across ethnic and gender-related contexts.

[288] CAPTURE: Context-Aware Prompt Injection Testing and Robustness Enhancement

Gauri Kholkar,Ratinder Ahuja

Main category: cs.CL

TL;DR: CAPTURE是一个新的上下文感知基准，用于评估提示注入防护模型的攻击检测和过度防御倾向，发现现有模型在高对抗性和良性场景中存在显著缺陷。

Details

Motivation: 提示注入是大型语言模型的主要安全风险，现有防护模型在上下文感知环境中的效果未被充分研究，且存在过度防御问题。 Method: 引入CAPTURE基准，通过少量域内示例评估攻击检测和过度防御倾向。 Result: 实验显示当前防护模型在对抗性案例中漏检率高，在良性场景中误报率高。 Conclusion: 现有提示注入防护模型存在严重局限性，需改进以应对实际应用中的挑战。 Abstract: Prompt injection remains a major security risk for large language models. However, the efficacy of existing guardrail models in context-aware settings remains underexplored, as they often rely on static attack benchmarks. Additionally, they have over-defense tendencies. We introduce CAPTURE, a novel context-aware benchmark assessing both attack detection and over-defense tendencies with minimal in-domain examples. Our experiments reveal that current prompt injection guardrail models suffer from high false negatives in adversarial cases and excessive false positives in benign scenarios, highlighting critical limitations.

[289] From n-gram to Attention: How Model Architectures Learn and Propagate Bias in Language Modeling

Mohsinul Kabir,Tasfia Tahsin,Sophia Ananiadou

Main category: cs.CL

TL;DR: 论文提出了一种基于比较行为理论的方法，研究语言模型中数据与模型架构对偏见传播的交互作用，发现数据时间性、模型设计及架构对偏见传播有显著影响。

Details

Motivation: 当前研究多关注数据质量，而忽视了模型架构和数据时间性对偏见的影响，缺乏对偏见起源的系统性研究。 Method: 基于比较行为理论，结合n-gram模型与Transformer的对比，分析数据、模型设计和时间动态对偏见传播的影响。 Result: 发现n-gram模型对上下文窗口大小敏感，Transformer架构更稳健；数据时间性显著影响偏见；不同架构对偏见注入反应不同。 Conclusion: 需从数据和模型维度全面追溯偏见起源，而非仅关注表象，以减少语言模型的潜在危害。 Abstract: Current research on bias in language models (LMs) predominantly focuses on data quality, with significantly less attention paid to model architecture and temporal influences of data. Even more critically, few studies systematically investigate the origins of bias. We propose a methodology grounded in comparative behavioral theory to interpret the complex interaction between training data and model architecture in bias propagation during language modeling. Building on recent work that relates transformers to n-gram LMs, we evaluate how data, model design choices, and temporal dynamics affect bias propagation. Our findings reveal that: (1) n-gram LMs are highly sensitive to context window size in bias propagation, while transformers demonstrate architectural robustness; (2) the temporal provenance of training data significantly affects bias; and (3) different model architectures respond differentially to controlled bias injection, with certain biases (e.g. sexual orientation) being disproportionately amplified. As language models become ubiquitous, our findings highlight the need for a holistic approach -- tracing bias to its origins across both data and model dimensions, not just symptoms, to mitigate harm.

[290] SLOT: Sample-specific Language Model Optimization at Test-time

Yang Hu,Xingyu Zhang,Xueji Fang,Zhiyang Chen,Xiao Wang,Huatian Zhang,Guojun Qi

Main category: cs.CL

TL;DR: SLOT是一种参数高效的测试时推理方法，通过优化样本特定参数向量提升语言模型对复杂指令的响应能力。

Details

Motivation: 现有大型语言模型在处理复杂指令时表现不佳，SLOT旨在通过测试时优化提升模型对单个提示的适应能力。 Method: SLOT在测试时进行少量优化步骤，更新轻量级样本特定参数向量，并将其添加到最终隐藏层前，通过缓存最后一层特征实现高效适配。 Result: 实验表明，SLOT在多个基准测试和模型上表现优异，如Qwen2.5-7B在GSM8K上准确率提升8.6%，DeepSeek-R1-Distill-Llama-70B在GPQA上达到SOTA。 Conclusion: SLOT通过测试时优化显著提升了语言模型对复杂指令的响应能力，具有高效性和广泛适用性。 Abstract: We propose SLOT (Sample-specific Language Model Optimization at Test-time), a novel and parameter-efficient test-time inference approach that enhances a language model's ability to more accurately respond to individual prompts. Existing Large Language Models (LLMs) often struggle with complex instructions, leading to poor performances on those not well represented among general samples. To address this, SLOT conducts few optimization steps at test-time to update a light-weight sample-specific parameter vector. It is added to the final hidden layer before the output head, and enables efficient adaptation by caching the last layer features during per-sample optimization. By minimizing the cross-entropy loss on the input prompt only, SLOT helps the model better aligned with and follow each given instruction. In experiments, we demonstrate that our method outperforms the compared models across multiple benchmarks and LLMs. For example, Qwen2.5-7B with SLOT achieves an accuracy gain of 8.6% on GSM8K from 57.54% to 66.19%, while DeepSeek-R1-Distill-Llama-70B with SLOT achieves a SOTA accuracy of 68.69% on GPQA among 70B-level models. Our code is available at https://github.com/maple-research-lab/SLOT.

[291] Traversal Verification for Speculative Tree Decoding

Yepeng Weng,Qiao Hu,Xujie Chen,Li Liu,Dianwen Mei,Huishi Qiu,Jiang Tian,Zhongchao Shi

Main category: cs.CL

TL;DR: 提出了一种新的推测解码算法Traversal Verification，通过从叶到根的遍历方式优化验证过程，显著提升接受率和吞吐量。

Details

Motivation: 现有推测解码方法依赖逐层验证，导致接受长度不理想且候选序列利用率低。 Method: 采用叶到根遍历方式验证整个令牌序列，保留潜在有效子序列。 Result: 实验证明，该方法在多种大语言模型和任务中均提升了接受长度和吞吐量。 Conclusion: Traversal Verification在保证无损推理的同时，实现了显著的加速效果。 Abstract: Speculative decoding is a promising approach for accelerating large language models. The primary idea is to use a lightweight draft model to speculate the output of the target model for multiple subsequent timesteps, and then verify them in parallel to determine whether the drafted tokens should be accepted or rejected. To enhance acceptance rates, existing frameworks typically construct token trees containing multiple candidates in each timestep. However, their reliance on token-level verification mechanisms introduces two critical limitations: First, the probability distribution of a sequence differs from that of individual tokens, leading to suboptimal acceptance length. Second, current verification schemes begin from the root node and proceed layer by layer in a top-down manner. Once a parent node is rejected, all its child nodes should be discarded, resulting in inefficient utilization of speculative candidates. This paper introduces Traversal Verification, a novel speculative decoding algorithm that fundamentally rethinks the verification paradigm through leaf-to-root traversal. Our approach considers the acceptance of the entire token sequence from the current node to the root, and preserves potentially valid subsequences that would be prematurely discarded by existing methods. We theoretically prove that the probability distribution obtained through Traversal Verification is identical to that of the target model, guaranteeing lossless inference while achieving substantial acceleration gains. Experimental results across different large language models and multiple tasks show that our method consistently improves acceptance length and throughput over existing methods

[292] The power of text similarity in identifying AI-LLM paraphrased documents: The case of BBC news articles and ChatGPT

Konstantinos Xylogiannopoulos,Petros Xanthopoulos,Panagiotis Karampelas,Georgios Bakamitsos

Main category: cs.CL

TL;DR: 论文提出了一种基于模式相似性的方法，用于检测ChatGPT生成的改写新闻，准确率高达96.23%。

Details

Motivation: 生成式AI改写的文本可能侵犯版权并损害原创内容创作者的收入，但目前相关研究较少。 Method: 提出一种不依赖深度学习的算法方案，通过模式相似性检测ChatGPT改写的新闻。 Result: 在包含2,224篇真实新闻和改写新闻的基准数据集上，方法在准确率、精确率、灵敏度、特异性和F1分数上均超过96%。 Conclusion: 该方法能高效识别ChatGPT生成的改写内容，为版权保护提供了实用工具。 Abstract: Generative AI paraphrased text can be used for copyright infringement and the AI paraphrased content can deprive substantial revenue from original content creators. Despite this recent surge of malicious use of generative AI, there are few academic publications that research this threat. In this article, we demonstrate the ability of pattern-based similarity detection for AI paraphrased news recognition. We propose an algorithmic scheme, which is not limited to detect whether an article is an AI paraphrase, but, more importantly, to identify that the source of infringement is the ChatGPT. The proposed method is tested with a benchmark dataset specifically created for this task that incorporates real articles from BBC, incorporating a total of 2,224 articles across five different news categories, as well as 2,224 paraphrased articles created with ChatGPT. Results show that our pattern similarity-based method, that makes no use of deep learning, can detect ChatGPT assisted paraphrased articles at percentages 96.23% for accuracy, 96.25% for precision, 96.21% for sensitivity, 96.25% for specificity and 96.23% for F1 score.

[293] Table-R1: Region-based Reinforcement Learning for Table Understanding

Zhenhe Wu,Jian Yang,Jiaheng Liu,Xianjie Wu,Changzai Pan,Jie Zhang,Yu Zhao,Shuangyong Song,Yongxiang Li,Zhoujun Li

Main category: cs.CL

TL;DR: Table-R1通过强化学习方法提升语言模型对表格的理解能力，结合区域证据和混合奖励系统，显著提高了表格问答的性能和效率。

Details

Motivation: 表格因其结构化特性对语言模型提出独特挑战，现有方法在表格问答上的性能优化不足。 Method: 提出区域增强监督微调（RE-SFT）和表格感知组相对策略优化（TARPO），结合文本、符号和程序推理。 Result: Table-R1在多个基准数据集上平均提升14.36分，TARPO减少67.5%的响应令牌消耗。 Conclusion: Table-R1和TARPO显著提升了语言模型在表格推理中的性能和效率。 Abstract: Tables present unique challenges for language models due to their structured row-column interactions, necessitating specialized approaches for effective comprehension. While large language models (LLMs) have demonstrated potential in table reasoning through prompting and techniques like chain-of-thought (CoT) and program-of-thought (PoT), optimizing their performance for table question answering remains underexplored. In this paper, we introduce region-based Table-R1, a novel reinforcement learning approach that enhances LLM table understanding by integrating region evidence into reasoning steps. Our method employs Region-Enhanced Supervised Fine-Tuning (RE-SFT) to guide models in identifying relevant table regions before generating answers, incorporating textual, symbolic, and program-based reasoning. Additionally, Table-Aware Group Relative Policy Optimization (TARPO) introduces a mixed reward system to dynamically balance region accuracy and answer correctness, with decaying region rewards and consistency penalties to align reasoning steps. Experiments show that Table-R1 achieves an average performance improvement of 14.36 points across multiple base models on three benchmark datasets, even outperforming baseline models with ten times the parameters, while TARPO reduces response token consumption by 67.5% compared to GRPO, significantly advancing LLM capabilities in efficient tabular reasoning.

[294] PSC: Extending Context Window of Large Language Models via Phase Shift Calibration

Wenqiao Zhu,Chao Xu,Lulu Wang,Jun Wu

Main category: cs.CL

TL;DR: PSC（Phase Shift Calibration）是一种用于校准RoPE预定义频率的小模块，能显著提升现有方法（如PI、YaRN、LongRoPE）的性能，尤其在扩展上下文窗口时表现优异。

Details

Motivation: 现有方法难以预定义RoPE的最佳频率缩放因子，因其搜索空间呈指数级增长，亟需一种校准机制来优化性能。 Method: 提出PSC模块，通过校准RoPE的预定义频率，增强现有方法的扩展能力。实验覆盖多种模型和任务。 Result: 实验表明，启用PSC后，困惑度随上下文窗口（16k至64k）的扩大而显著降低，且方法具有广泛适用性和鲁棒性。 Conclusion: PSC是一种高效且通用的校准工具，能显著提升RoPE在扩展上下文窗口中的表现。 Abstract: Rotary Position Embedding (RoPE) is an efficient position encoding approach and is widely utilized in numerous large language models (LLMs). Recently, a lot of methods have been put forward to further expand the context window based on RoPE. The core concept of those methods is to predefine or search for a set of factors to rescale the base frequencies of RoPE. Nevertheless, it is quite a challenge for existing methods to predefine an optimal factor due to the exponential search space. In view of this, we introduce PSC (Phase Shift Calibration), a small module for calibrating the frequencies predefined by existing methods. With the employment of PSC, we demonstrate that many existing methods can be further enhanced, like PI, YaRN, and LongRoPE. We conducted extensive experiments across multiple models and tasks. The results demonstrate that (1) when PSC is enabled, the comparative reductions in perplexity increase as the context window size is varied from 16k, to 32k, and up to 64k. (2) Our approach is broadly applicable and exhibits robustness across a variety of models and tasks. The code can be found at https://github.com/WNQzhu/PSC.

[295] Learning to Play Like Humans: A Framework for LLM Adaptation in Interactive Fiction Games

Jinming Zhang,Yunfei Long

Main category: cs.CL

TL;DR: 提出了一种名为LPLH的认知启发框架，指导大型语言模型（LLM）系统性地学习和玩交互式小说游戏（IF游戏），以模拟人类玩家的叙事理解和决策行为。

Details

Motivation: 现有方法过于关注任务性能指标，而忽视了人类对叙事背景和游戏逻辑的理解。LPLH旨在通过认知科学原理，使LLM代理的行为更接近人类玩家。 Method: LPLH框架包含三个关键组件：结构化地图构建、动作学习和反馈驱动的经验分析，以优化决策过程。 Result: LPLH能够生成更具解释性和人类化的游戏表现，超越了纯探索策略。 Conclusion: LPLH将IF游戏挑战重新定义为LLM代理的学习问题，为复杂文本环境中的上下文感知游戏提供了新路径。 Abstract: Interactive Fiction games (IF games) are where players interact through natural language commands. While recent advances in Artificial Intelligence agents have reignited interest in IF games as a domain for studying decision-making, existing approaches prioritize task-specific performance metrics over human-like comprehension of narrative context and gameplay logic. This work presents a cognitively inspired framework that guides Large Language Models (LLMs) to learn and play IF games systematically. Our proposed **L**earning to **P**lay **L**ike **H**umans (LPLH) framework integrates three key components: (1) structured map building to capture spatial and narrative relationships, (2) action learning to identify context-appropriate commands, and (3) feedback-driven experience analysis to refine decision-making over time. By aligning LLMs-based agents' behavior with narrative intent and commonsense constraints, LPLH moves beyond purely exploratory strategies to deliver more interpretable, human-like performance. Crucially, this approach draws on cognitive science principles to more closely simulate how human players read, interpret, and respond within narrative worlds. As a result, LPLH reframes the IF games challenge as a learning problem for LLMs-based agents, offering a new path toward robust, context-aware gameplay in complex text-based environments.

[296] Introspective Growth: Automatically Advancing LLM Expertise in Technology Judgment

Siyang Wu,Honglin Bao,Nadav Kunievsky,James A. Evans

Main category: cs.CL

TL;DR: 论文提出了一种通过自我提问提升大语言模型（LLM）理解能力的方法，并在计算机科学专利数据集上验证了其有效性。

Details

Motivation: 尽管LLM表现出概念理解能力，但其内部知识难以访问和评估，尤其在需要细粒度语义区分的领域。 Method: 采用自我提问策略，让LLM生成并回答与任务相关的问题，同时结合外部科学文本检索。 Result: 自我提问显著提升了模型性能，尤其在技术概念理解上优于链式思维提示，并揭示了跨模型协作的新策略。 Conclusion: 自我提问是一种实用机制，可自动提升LLM理解能力，并作为诊断工具揭示内部与外部知识的组织方式。 Abstract: Large language models (LLMs) increasingly demonstrate signs of conceptual understanding, yet much of their internal knowledge remains latent, loosely structured, and difficult to access or evaluate. We propose self-questioning as a lightweight and scalable strategy to improve LLMs' understanding, particularly in domains where success depends on fine-grained semantic distinctions. To evaluate this approach, we introduce a challenging new benchmark of 1.3 million post-2015 computer science patent pairs, characterized by dense technical jargon and strategically complex writing. The benchmark centers on a pairwise differentiation task: can a model distinguish between closely related but substantively different inventions? We show that prompting LLMs to generate and answer their own questions - targeting the background knowledge required for the task - significantly improves performance. These self-generated questions and answers activate otherwise underutilized internal knowledge. Allowing LLMs to retrieve answers from external scientific texts further enhances performance, suggesting that model knowledge is compressed and lacks the full richness of the training data. We also find that chain-of-thought prompting and self-questioning converge, though self-questioning remains more effective for improving understanding of technical concepts. Notably, we uncover an asymmetry in prompting: smaller models often generate more fundamental, more open-ended, better-aligned questions for mid-sized models than large models with better understanding do, revealing a new strategy for cross-model collaboration. Altogether, our findings establish self-questioning as both a practical mechanism for automatically improving LLM comprehension, especially in domains with sparse and underrepresented knowledge, and a diagnostic probe of how internal and external knowledge are organized.

[297] Towards DS-NER: Unveiling and Addressing Latent Noise in Distant Annotations

Yuyang Ding,Dan Qiao,Juntao Li,Jiajie Xu,Pingfu Chao,Xiaofang Zhou,Min Zhang

Main category: cs.CL

TL;DR: 本文探讨了远程监督命名实体识别（DS-NER）的有效性和鲁棒性，通过分析不同远程标注方法的潜在噪声分布，并提出了一种新的噪声评估框架。

Details

Motivation: 远程监督NER是一种低成本且便捷的替代人工标注的方法，但不同标注方法之间的噪声分布尚未得到充分研究。 Method: 研究从两方面入手：(1) 比较传统规则方法和基于大语言模型的监督方法；(2) 提出新框架，将噪声问题分为未标注实体问题（UEP）和噪声实体问题（NEP），并分别提供解决方案。 Result: 在八个真实数据集上验证，显著优于现有方法。 Conclusion: 提出的方法在多种数据源和标注技术下表现优越，为DS-NER的噪声问题提供了有效解决方案。 Abstract: Distantly supervised named entity recognition (DS-NER) has emerged as a cheap and convenient alternative to traditional human annotation methods, enabling the automatic generation of training data by aligning text with external resources. Despite the many efforts in noise measurement methods, few works focus on the latent noise distribution between different distant annotation methods. In this work, we explore the effectiveness and robustness of DS-NER by two aspects: (1) distant annotation techniques, which encompasses both traditional rule-based methods and the innovative large language model supervision approach, and (2) noise assessment, for which we introduce a novel framework. This framework addresses the challenges by distinctly categorizing them into the unlabeled-entity problem (UEP) and the noisy-entity problem (NEP), subsequently providing specialized solutions for each. Our proposed method achieves significant improvements on eight real-world distant supervision datasets originating from three different data sources and involving four distinct annotation techniques, confirming its superiority over current state-of-the-art methods.

[298] What are they talking about? Benchmarking Large Language Models for Knowledge-Grounded Discussion Summarization

Weixiao Zhou,Junnan Zhu,Gengyao Li,Xianfu Cheng,Xinnian Liang,Feifei Zhai,Zhoujun Li

Main category: cs.CL

TL;DR: 论文研究了LLMs在新任务中的表现，该任务需要结合讨论与背景知识进行摘要生成，以解决现有对话摘要系统仅依赖讨论信息的局限性。

Details

Motivation: 现有对话摘要系统因仅依赖讨论信息，导致外部观察者易混淆，因此需要结合背景知识改进摘要生成。 Method: 将任务输出建模为背景和观点摘要，定义两种标准化摘要模式，并引入首个高质量人工标注基准及分层评估框架。 Result: 评估12种LLMs发现：（1）LLMs在背景摘要检索、生成及观点摘要整合上表现不佳；（2）即使顶级LLMs平均表现低于69%；（3）当前LLMs缺乏足够的自我评估和修正能力。 Conclusion: 当前LLMs在结合背景知识的摘要任务中表现有限，需进一步改进自我评估和修正能力。 Abstract: In this work, we investigate the performance of LLMs on a new task that requires combining discussion with background knowledge for summarization. This aims to address the limitation of outside observer confusion in existing dialogue summarization systems due to their reliance solely on discussion information. To achieve this, we model the task output as background and opinion summaries and define two standardized summarization patterns. To support assessment, we introduce the first benchmark comprising high-quality samples consistently annotated by human experts and propose a novel hierarchical evaluation framework with fine-grained, interpretable metrics. We evaluate 12 LLMs under structured-prompt and self-reflection paradigms. Our findings reveal: (1) LLMs struggle with background summary retrieval, generation, and opinion summary integration. (2) Even top LLMs achieve less than 69% average performance across both patterns. (3) Current LLMs lack adequate self-evaluation and self-correction capabilities for this task.

[299] Enhancing Large Language Models with Reward-guided Tree Search for Knowledge Graph Question and Answering

Xiao Long,Liansheng Zhuang,Chen Shen,Shaotian Yan,Yifei Li,Shafei Wang

Main category: cs.CL

TL;DR: 论文提出了一种名为RTSoG的无训练框架，通过分解复杂问题为子问题并结合奖励引导的树搜索，优化了知识图谱问答任务中的推理路径选择。

Details

Motivation: 现有基于LLM的KGQA方法忽视了历史推理路径的利用，且复杂语义可能导致推理路径不准确。 Method: RTSoG将问题分解为子问题，使用SC-MCTS检索加权推理路径，并堆叠生成最终答案。 Result: 在四个数据集上验证了RTSoG的有效性，性能提升显著（GrailQA提升8.7%，WebQSP提升7.0%）。 Conclusion: RTSoG通过结合子问题分解和奖励引导的树搜索，显著提升了KGQA任务的性能。 Abstract: Recently, large language models (LLMs) have demonstrated impressive performance in Knowledge Graph Question Answering (KGQA) tasks, which aim to find answers based on knowledge graphs (KGs) for natural language questions. Existing LLMs-based KGQA methods typically follow the Graph Retrieval-Augmented Generation (GraphRAG) paradigm, which first retrieves reasoning paths from the large KGs, and then generates the answers based on them. However, these methods emphasize the exploration of new optimal reasoning paths in KGs while ignoring the exploitation of historical reasoning paths, which may lead to sub-optimal reasoning paths. Additionally, the complex semantics contained in questions may lead to the retrieval of inaccurate reasoning paths. To address these issues, this paper proposes a novel and training-free framework for KGQA tasks called Reward-guided Tree Search on Graph (RTSoG). RTSoG decomposes an original question into a series of simpler and well-defined sub-questions to handle the complex semantics. Then, a Self-Critic Monte Carlo Tree Search (SC-MCTS) guided by a reward model is introduced to iteratively retrieve weighted reasoning paths as contextual knowledge. Finally, it stacks the weighted reasoning paths according to their weights to generate the final answers. Extensive experiments on four datasets demonstrate the effectiveness of RTSoG. Notably, it achieves 8.7\% and 7.0\% performance improvement over the state-of-the-art method on the GrailQA and the WebQSP respectively.

[300] KG-QAGen: A Knowledge-Graph-Based Framework for Systematic Question Generation and Long-Context LLM Evaluation

Nikita Tatarinov,Vidhyakshaya Kannan,Haricharana Srinivasa,Arnav Raj,Harpreet Singh Anand,Varun Singh,Aditya Luthra,Ravij Lade,Agam Shah,Sudheer Chava

Main category: cs.CL

TL;DR: KG-QAGen框架通过知识图谱生成多复杂度QA对，评估语言模型在长文本中的信息检索与处理能力，发现现有模型在多跳推理和集合操作上表现不佳。

Details

Motivation: 现代语言模型的上下文长度增加，需评估其在长文档中的信息处理能力，现有基准缺乏系统化问题复杂度变化。 Method: 利用金融协议的结构化表示，生成多复杂度QA对，涵盖多跳检索、集合操作和答案多样性三个维度。 Result: 构建了20,139对QA数据集，评估13个模型，发现其在集合比较和多跳逻辑推理上表现较差。 Conclusion: 模型在语义误解和隐式关系处理上存在系统性失败模式，需进一步改进。 Abstract: The increasing context length of modern language models has created a need for evaluating their ability to retrieve and process information across extensive documents. While existing benchmarks test long-context capabilities, they often lack a structured way to systematically vary question complexity. We introduce KG-QAGen (Knowledge-Graph-based Question-Answer Generation), a framework that (1) extracts QA pairs at multiple complexity levels (2) by leveraging structured representations of financial agreements (3) along three key dimensions -- multi-hop retrieval, set operations, and answer plurality -- enabling fine-grained assessment of model performance across controlled difficulty levels. Using this framework, we construct a dataset of 20,139 QA pairs (the largest number among the long-context benchmarks) and open-source a part of it. We evaluate 13 proprietary and open-source LLMs and observe that even the best-performing models are struggling with set-based comparisons and multi-hop logical inference. Our analysis reveals systematic failure modes tied to semantic misinterpretation and inability to handle implicit relations.

[301] LM$^2$otifs : An Explainable Framework for Machine-Generated Texts Detection

Xu Zheng,Zhuomin Chen,Esteban Schafir,Sipeng Chen,Hojat Allah Salehi,Haifeng Chen,Farhad Shirani,Wei Cheng,Dongsheng Luo

Main category: cs.CL

TL;DR: LM$^2$otifs是一种新型可解释框架，用于检测机器生成文本（MGT），通过图神经网络和可解释性技术实现高准确性和多级解释。

Details

Motivation: 解决现有机器生成文本检测方法在解释性上的不足，特别是传统方法难以捕捉复杂词汇关系的问题。 Method: 采用eXplainable Graph Neural Networks，分三阶段：文本转图表示词汇依赖、图神经网络预测、后处理提取可解释模式。 Result: 在多个基准数据集上表现优异，提取的模式有效区分人机生成文本，并揭示MGT的独特语言特征。 Conclusion: LM$^2$otifs不仅性能优越，还提供了从词汇到句法的多级解释，为MGT检测提供了新思路。 Abstract: The impressive ability of large language models to generate natural text across various tasks has led to critical challenges in authorship authentication. Although numerous detection methods have been developed to differentiate between machine-generated texts (MGT) and human-generated texts (HGT), the explainability of these methods remains a significant gap. Traditional explainability techniques often fall short in capturing the complex word relationships that distinguish HGT from MGT. To address this limitation, we present LM$^2$otifs, a novel explainable framework for MGT detection. Inspired by probabilistic graphical models, we provide a theoretical rationale for the effectiveness. LM$^2$otifs utilizes eXplainable Graph Neural Networks to achieve both accurate detection and interpretability. The LM$^2$otifs pipeline operates in three key stages: first, it transforms text into graphs based on word co-occurrence to represent lexical dependencies; second, graph neural networks are used for prediction; and third, a post-hoc explainability method extracts interpretable motifs, offering multi-level explanations from individual words to sentence structures. Extensive experiments on multiple benchmark datasets demonstrate the comparable performance of LM$^2$otifs. The empirical evaluation of the extracted explainable motifs confirms their effectiveness in differentiating HGT and MGT. Furthermore, qualitative analysis reveals distinct and visible linguistic fingerprints characteristic of MGT.

[302] DS-ProGen: A Dual-Structure Deep Language Model for Functional Protein Design

Yanting Li,Jiyue Jiang,Zikang Wang,Ziqian Lin,Dongchen He,Yuheng Shan,Yanruisheng Shao,Jiayi Li,Xiangyu Shi,Jiuming Wang,Yanyu Chen,Yimin Fan,Han Li,Yu Li

Main category: cs.CL

TL;DR: DS-ProGen是一种双结构深度语言模型，用于蛋白质设计，结合了骨架几何和表面特征，显著提升了序列预测的准确性和功能性。

Details

Motivation: 现有方法仅依赖骨架坐标或表面特征，无法全面捕捉蛋白质设计的复杂化学和几何约束。 Method: DS-ProGen整合骨架几何和表面化学及几何描述符，采用下一个氨基酸预测范式。 Result: 在PRIDE数据集上达到61.47%的恢复率，并在与生物伙伴的相互作用预测中表现优异。 Conclusion: 多模态结构编码在蛋白质设计中具有协同优势，DS-ProGen展示了强大的功能保留能力。 Abstract: Inverse Protein Folding (IPF) is a critical subtask in the field of protein design, aiming to engineer amino acid sequences capable of folding correctly into a specified three-dimensional (3D) conformation. Although substantial progress has been achieved in recent years, existing methods generally rely on either backbone coordinates or molecular surface features alone, which restricts their ability to fully capture the complex chemical and geometric constraints necessary for precise sequence prediction. To address this limitation, we present DS-ProGen, a dual-structure deep language model for functional protein design, which integrates both backbone geometry and surface-level representations. By incorporating backbone coordinates as well as surface chemical and geometric descriptors into a next-amino-acid prediction paradigm, DS-ProGen is able to generate functionally relevant and structurally stable sequences while satisfying both global and local conformational constraints. On the PRIDE dataset, DS-ProGen attains the current state-of-the-art recovery rate of 61.47%, demonstrating the synergistic advantage of multi-modal structural encoding in protein design. Furthermore, DS-ProGen excels in predicting interactions with a variety of biological partners, including ligands, ions, and RNA, confirming its robust functional retention capabilities.

[303] ESC-Judge: A Framework for Comparing Emotional Support Conversational Agents

Navid Madani,Rohini Srihari

Main category: cs.CL

TL;DR: ESC-Judge是一个端到端的评估框架，用于比较情感支持型大语言模型（LLMs）的效果，基于Clara Hill的咨询模型，并实现自动化评估。

Details

Motivation: 当前缺乏可扩展且理论支持的方法来评估情感支持型LLMs的效果，ESC-Judge旨在填补这一空白。 Method: ESC-Judge分三个阶段：合成求助者角色、隔离模型策略、使用专用LLM进行成对偏好评估。 Result: ESC-Judge在Exploration、Insight和Action决策上与PhD级标注者的一致性分别达到85%、83%和86%。 Conclusion: ESC-Judge以低成本实现人类水平的可靠性，为情感支持型AI的透明发展提供了工具。 Abstract: Large language models (LLMs) increasingly power mental-health chatbots, yet the field still lacks a scalable, theory-grounded way to decide which model is most effective to deploy. We present ESC-Judge, the first end-to-end evaluation framework that (i) grounds head-to-head comparisons of emotional-support LLMs in Clara Hill's established Exploration-Insight-Action counseling model, providing a structured and interpretable view of performance, and (ii) fully automates the evaluation pipeline at scale. ESC-Judge operates in three stages: first, it synthesizes realistic help-seeker roles by sampling empirically salient attributes such as stressors, personality, and life history; second, it has two candidate support agents conduct separate sessions with the same role, isolating model-specific strategies; and third, it asks a specialized judge LLM to express pairwise preferences across rubric-anchored skills that span the Exploration, Insight, and Action spectrum. In our study, ESC-Judge matched PhD-level annotators on 85 percent of Exploration, 83 percent of Insight, and 86 percent of Action decisions, demonstrating human-level reliability at a fraction of the cost. All code, prompts, synthetic roles, transcripts, and judgment scripts are released to promote transparent progress in emotionally supportive AI.

[304] Relation Extraction or Pattern Matching? Unravelling the Generalisation Limits of Language Models for Biographical RE

Varvara Arzt,Allan Hanbury,Michael Wiegand,Gábor Recski,Terra Blevins

Main category: cs.CL

TL;DR: 关系抽取（RE）模型的泛化能力分析显示，模型在未见数据上表现不佳，高数据集内性能不代表可迁移性，数据质量比词汇相似性更重要。

Details

Motivation: 评估关系抽取模型是否学习到稳健的关系模式，还是依赖于虚假相关性。 Method: 通过跨数据集实验，比较不同数据质量和适应策略（如微调和少样本上下文学习）的效果。 Result: 模型在跨数据集迁移中表现不佳，数据质量是关键因素；微调适用于高质量数据，少样本学习适用于噪声数据。 Conclusion: 关系抽取基准的结构性问题（如单关系样本和非标准化负类定义）限制了模型的可迁移性。 Abstract: Analysing the generalisation capabilities of relation extraction (RE) models is crucial for assessing whether they learn robust relational patterns or rely on spurious correlations. Our cross-dataset experiments find that RE models struggle with unseen data, even within similar domains. Notably, higher intra-dataset performance does not indicate better transferability, instead often signaling overfitting to dataset-specific artefacts. Our results also show that data quality, rather than lexical similarity, is key to robust transfer, and the choice of optimal adaptation strategy depends on the quality of data available: while fine-tuning yields the best cross-dataset performance with high-quality data, few-shot in-context learning (ICL) is more effective with noisier data. However, even in these cases, zero-shot baselines occasionally outperform all cross-dataset results. Structural issues in RE benchmarks, such as single-relation per sample constraints and non-standardised negative class definitions, further hinder model transferability.

[305] Disambiguation in Conversational Question Answering in the Era of LLM: A Survey

Md Mehrab Tanjim,Yeonjun In,Xiang Chen,Victor S. Bursztyn,Ryan A. Rossi,Sungchul Kim,Guang-Jie Ren,Vaishnavi Muppala,Shun Jiang,Yongsung Kim,Chanyoung Park

Main category: cs.CL

TL;DR: 本文探讨了自然语言处理（NLP）中的歧义问题，特别是在大型语言模型（LLMs）和对话问答（CQA）背景下的定义、形式及影响，并提出了消歧方法的分类与比较分析。

Details

Motivation: 解决NLP中的歧义问题对提升语言系统的鲁棒性和可靠性至关重要，尤其是在LLMs广泛应用的情况下。 Method: 定义了关键术语和概念，分类了LLMs支持的消歧方法，并进行了比较分析。 Result: 提供了公开数据集的综述，用于评估歧义检测与消解技术，并指出了未来研究方向。 Conclusion: 通过全面回顾LLMs中的歧义与消歧研究，本文为开发更强大的语言系统提供了参考。 Abstract: Ambiguity remains a fundamental challenge in Natural Language Processing (NLP) due to the inherent complexity and flexibility of human language. With the advent of Large Language Models (LLMs), addressing ambiguity has become even more critical due to their expanded capabilities and applications. In the context of Conversational Question Answering (CQA), this paper explores the definition, forms, and implications of ambiguity for language driven systems, particularly in the context of LLMs. We define key terms and concepts, categorize various disambiguation approaches enabled by LLMs, and provide a comparative analysis of their advantages and disadvantages. We also explore publicly available datasets for benchmarking ambiguity detection and resolution techniques and highlight their relevance for ongoing research. Finally, we identify open problems and future research directions, proposing areas for further investigation. By offering a comprehensive review of current research on ambiguities and disambiguation with LLMs, we aim to contribute to the development of more robust and reliable language systems.

[306] Towards Reliable and Interpretable Traffic Crash Pattern Prediction and Safety Interventions Using Customized Large Language Models

Yang Zhao,Pu Wang,Yibo Zhao,Hongru Du,Hao,Yang

Main category: cs.CL

TL;DR: TrafficSafe框架利用LLMs改进交通事故预测，通过多模态数据整合和文本化处理，显著提升预测性能，并揭示关键风险因素。

Details

Motivation: 现有方法难以解析复杂的交通事故数据交互，无法充分捕捉多源数据的语义信息和关联，限制了关键风险因素的识别能力。 Method: 提出TrafficSafe框架，将多模态数据（如文本报告、图像、环境数据等）文本化，并定制和微调LLMs进行预测和特征归因。 Result: TrafficSafe LLM的F1分数比基线平均提高42%，发现酒驾是严重事故的主要因素，其贡献是其他驾驶行为的两倍。 Conclusion: TrafficSafe为交通安全研究提供了突破性进展，展示了如何将先进AI技术转化为可操作的生命安全解决方案。 Abstract: Predicting crash events is crucial for understanding crash distributions and their contributing factors, thereby enabling the design of proactive traffic safety policy interventions. However, existing methods struggle to interpret the complex interplay among various sources of traffic crash data, including numeric characteristics, textual reports, crash imagery, environmental conditions, and driver behavior records. As a result, they often fail to capture the rich semantic information and intricate interrelationships embedded in these diverse data sources, limiting their ability to identify critical crash risk factors. In this research, we propose TrafficSafe, a framework that adapts LLMs to reframe crash prediction and feature attribution as text-based reasoning. A multi-modal crash dataset including 58,903 real-world reports together with belonged infrastructure, environmental, driver, and vehicle information is collected and textualized into TrafficSafe Event Dataset. By customizing and fine-tuning LLMs on this dataset, the TrafficSafe LLM achieves a 42% average improvement in F1-score over baselines. To interpret these predictions and uncover contributing factors, we introduce TrafficSafe Attribution, a sentence-level feature attribution framework enabling conditional risk analysis. Findings show that alcohol-impaired driving is the leading factor in severe crashes, with aggressive and impairment-related behaviors having nearly twice the contribution for severe crashes compared to other driver behaviors. Furthermore, TrafficSafe Attribution highlights pivotal features during model training, guiding strategic crash data collection for iterative performance improvements. The proposed TrafficSafe offers a transformative leap in traffic safety research, providing a blueprint for translating advanced AI technologies into responsible, actionable, and life-saving outcomes.

[307] Extracting memorized pieces of (copyrighted) books from open-weight language models

A. Feder Cooper,Aaron Gokaslan,Amy B. Cyphert,Christopher De Sa,Mark A. Lemley,Daniel E. Ho,Percy Liang

Main category: cs.CL

TL;DR: 论文探讨了生成式AI版权诉讼中关于LLMs是否记忆受保护内容的争议，通过实验证明记忆化程度因模型和书籍而异，结果对版权案件有复杂影响。

Details

Motivation: 解决版权诉讼中关于LLMs是否记忆受保护内容的极端对立观点，揭示记忆化与版权关系的复杂性。 Method: 利用概率提取技术从13个开源LLMs中提取Books3数据集内容，通过实验验证记忆化程度。 Result: 实验表明某些LLMs（如Llama 3.1 70B）几乎完全记忆部分书籍（如《哈利波特》），但大多数书籍未被广泛记忆。 Conclusion: 记忆化程度因模型和书籍而异，结果对版权案件有重要但不明确的启示，未明确支持任何一方。 Abstract: Plaintiffs and defendants in copyright lawsuits over generative AI often make sweeping, opposing claims about the extent to which large language models (LLMs) have memorized plaintiffs' protected expression. Drawing on adversarial ML and copyright law, we show that these polarized positions dramatically oversimplify the relationship between memorization and copyright. To do so, we leverage a recent probabilistic extraction technique to extract pieces of the Books3 dataset from 13 open-weight LLMs. Through numerous experiments, we show that it's possible to extract substantial parts of at least some books from different LLMs. This is evidence that the LLMs have memorized the extracted text; this memorized content is copied inside the model parameters. But the results are complicated: the extent of memorization varies both by model and by book. With our specific experiments, we find that the largest LLMs don't memorize most books -- either in whole or in part. However, we also find that Llama 3.1 70B memorizes some books, like Harry Potter and 1984, almost entirely. We discuss why our results have significant implications for copyright cases, though not ones that unambiguously favor either side.

[308] The taggedPBC: Annotating a massive parallel corpus for crosslinguistic investigations

Hiram Ring

Main category: cs.CL

TL;DR: 论文介绍了一个名为taggedPBC的大型自动标记平行数据集，旨在解决现有跨语言研究数据集的局限性。

Details

Motivation: 现有数据集要么覆盖少量语言的大量数据，要么覆盖大量语言的少量数据，限制了揭示人类语言普遍特性的能力。 Method: 开发了一个包含1,800多句POS标记平行文本的数据集，覆盖1,500多种语言，并验证了其标记准确性。 Result: 数据集标记准确性与现有SOTA标记器和手工标记语料库一致，并提出了一个与专家确定的语序相关的新指标N1 ratio。 Conclusion: taggedPBC是推动基于语料库的跨语言研究的重要一步，数据集已开源供研究合作。 Abstract: Existing datasets available for crosslinguistic investigations have tended to focus on large amounts of data for a small group of languages or a small amount of data for a large number of languages. This means that claims based on these datasets are limited in what they reveal about universal properties of the human language faculty. While this has begun to change through the efforts of projects seeking to develop tagged corpora for a large number of languages, such efforts are still constrained by limits on resources. The current paper reports on a large automatically tagged parallel dataset which has been developed to partially address this issue. The taggedPBC contains more than 1,800 sentences of pos-tagged parallel text data from over 1,500 languages, representing 133 language families and 111 isolates, dwarfing previously available resources. The accuracy of tags in this dataset is shown to correlate well with both existing SOTA taggers for high-resource languages (SpaCy, Trankit) as well as hand-tagged corpora (Universal Dependencies Treebanks). Additionally, a novel measure derived from this dataset, the N1 ratio, correlates with expert determinations of word order in three typological databases (WALS, Grambank, Autotyp) such that a Gaussian Naive Bayes classifier trained on this feature can accurately identify basic word order for languages not in those databases. While much work is still needed to expand and develop this dataset, the taggedPBC is an important step to enable corpus-based crosslinguistic investigations, and is made available for research and collaboration via GitHub.

[309] Enriching Patent Claim Generation with European Patent Dataset

Lekang Jiang,Chengzu Li,Stephan Goetz

Main category: cs.CL

TL;DR: 论文介绍了EPD，一个欧洲专利数据集，旨在扩大专利研究的范围，支持多任务如权利要求生成，并显著提升模型性能。

Details

Motivation: 现有研究主要依赖美国专利数据，缺乏对其他司法管辖区和法律标准的覆盖，EPD填补了这一空白。 Method: 引入EPD数据集，包含高质量欧洲专利文本和结构化元数据，用于微调大语言模型（LLMs）。 Result: 实验表明，基于EPD微调的LLMs在权利要求质量和跨领域泛化能力上显著优于现有数据集和GPT-4o。 Conclusion: EPD为专利研究提供了更全面的基准，并揭示了未来研究需解决的真实世界挑战。 Abstract: Drafting patent claims is time-intensive, costly, and requires professional skill. Therefore, researchers have investigated large language models (LLMs) to assist inventors in writing claims. However, existing work has largely relied on datasets from the United States Patent and Trademark Office (USPTO). To enlarge research scope regarding various jurisdictions, drafting conventions, and legal standards, we introduce EPD, a European patent dataset. EPD presents rich textual data and structured metadata to support multiple patent-related tasks, including claim generation. This dataset enriches the field in three critical aspects: (1) Jurisdictional diversity: Patents from different offices vary in legal and drafting conventions. EPD fills a critical gap by providing a benchmark for European patents to enable more comprehensive evaluation. (2) Quality improvement: EPD offers high-quality granted patents with finalized and legally approved texts, whereas others consist of patent applications that are unexamined or provisional. Experiments show that LLMs fine-tuned on EPD significantly outperform those trained on previous datasets and even GPT-4o in claim quality and cross-domain generalization. (3) Real-world simulation: We propose a difficult subset of EPD to better reflect real-world challenges of claim generation. Results reveal that all tested LLMs perform substantially worse on these challenging samples, which highlights the need for future research.

[310] Measuring Information Distortion in Hierarchical Ultra long Novel Generation:The Optimal Expansion Ratio

Hanwen Shen,Ting Ying

Main category: cs.CL

TL;DR: 研究探讨了在生成百万字长篇小说时，人类提供的大纲对质量的影响，并提出了一种两阶段分层大纲方法以减少语义失真。

Details

Motivation: 现有框架主要针对较短小说，而超长小说的生成研究较少，需要量化LLMs在压缩和重建过程中的失真情况。 Method: 采用信息论分析，提出两阶段分层生成流程（大纲→详细大纲→手稿），并通过实验确定最优大纲长度。 Result: 两阶段方法显著减少了语义失真，为作者和研究者提供了实用的指导。 Conclusion: 两阶段分层大纲方法在生成超长小说时更有效，平衡了信息保存与人力投入。 Abstract: Writing novels with Large Language Models (LLMs) raises a critical question: how much human-authored outline is necessary to generate high-quality million-word novels? While frameworks such as DOME, Plan&Write, and Long Writer have improved stylistic coherence and logical consistency, they primarily target shorter novels (10k--100k words), leaving ultra-long generation largely unexplored. Drawing on insights from recent text compression methods like LLMZip and LLM2Vec, we conduct an information-theoretic analysis that quantifies distortion occurring when LLMs compress and reconstruct ultra-long novels under varying compression-expansion ratios. We introduce a hierarchical two-stage generation pipeline (outline -> detailed outline -> manuscript) and find an optimal outline length that balances information preservation with human effort. Through extensive experimentation with Chinese novels, we establish that a two-stage hierarchical outline approach significantly reduces semantic distortion compared to single-stage methods. Our findings provide empirically-grounded guidance for authors and researchers collaborating with LLMs to create million-word novels.

[311] Improving Multilingual Language Models by Aligning Representations through Steering

Omar Mahmoud,Buddhika Laknath Semage,Thommen George Karimpanal,Santu Rana

Main category: cs.CL

TL;DR: 通过表示引导技术，单层模型激活的调整显著提升了非英语标记的处理性能，效果媲美翻译基线并优于提示优化方法。

Details

Motivation: 研究大型语言模型（LLMs）如何处理非英语标记，尽管该领域已有显著进展，但仍是一个开放性问题。 Method: 采用表示引导技术，通过学习向量调整单层模型的激活，并结合监督微调（SFT）和人类反馈强化学习（RLHF）提升多语言能力。 Result: 该方法性能与翻译基线相当，并优于最先进的提示优化方法。 Conclusion: 通过调整LLMs的层表示，结合SFT和RLHF，可显著提升多语言处理能力。 Abstract: In this paper, we investigate how large language models (LLMS) process non-English tokens within their layer representations, an open question despite significant advancements in the field. Using representation steering, specifically by adding a learned vector to a single model layer's activations, we demonstrate that steering a single model layer can notably enhance performance. Our analysis shows that this approach achieves results comparable to translation baselines and surpasses state of the art prompt optimization methods. Additionally, we highlight how advanced techniques like supervised fine tuning (\textsc{sft}) and reinforcement learning from human feedback (\textsc{rlhf}) improve multilingual capabilities by altering representation spaces. We further illustrate how these methods align with our approach to reshaping LLMS layer representations.

[312] CMLFormer: A Dual Decoder Transformer with Switching Point Learning for Code-Mixed Language Modeling

Aditeya Baral,Allen George Ajith,Roshan Nayak,Mrityunjay Abhijeet Bhanja

Main category: cs.CL

TL;DR: CMLFormer是一种改进的双解码器Transformer模型，用于处理代码混合语言，通过多任务预训练策略提升性能。

Details

Motivation: 标准语言模型难以处理代码混合语言的结构挑战，因此需要专门设计的模型。 Method: 提出CMLFormer，采用共享编码器和同步解码器交叉注意力机制，预训练时加入切换点和翻译标注的新目标。 Result: 在HASOC-2021基准测试中，CMLFormer在F1分数、精确度和准确率上优于其他方法，并能识别切换点。 Conclusion: CMLFormer的架构和多任务预训练策略有效提升了代码混合语言的建模能力。 Abstract: Code-mixed languages, characterized by frequent within-sentence language transitions, present structural challenges that standard language models fail to address. In this work, we propose CMLFormer, an enhanced multi-layer dual-decoder Transformer with a shared encoder and synchronized decoder cross-attention, designed to model the linguistic and semantic dynamics of code-mixed text. CMLFormer is pre-trained on an augmented Hinglish corpus with switching point and translation annotations with multiple new objectives specifically aimed at capturing switching behavior, cross-lingual structure, and code-mixing complexity. Our experiments show that CMLFormer improves F1 score, precision, and accuracy over other approaches on the HASOC-2021 benchmark under select pre-training setups. Attention analyses further show that it can identify and attend to switching points, validating its sensitivity to code-mixed structure. These results demonstrate the effectiveness of CMLFormer's architecture and multi-task pre-training strategy for modeling code-mixed languages.

[313] PromptPrism: A Linguistically-Inspired Taxonomy for Prompts

Sullam Jeoung,Yueyan Chen,Yi Zhang,Shuai Wang,Haibo Ding,Lin Lee Cheong

Main category: cs.CL

TL;DR: 论文提出了PromptPrism，一个语言学启发的分类法，用于系统分析提示的结构和组件，并通过三个应用验证其有效性。

Details

Motivation: 当前缺乏系统分析提示的框架，理解提示的结构和组件对优化大型语言模型（LLM）性能至关重要。 Method: 引入PromptPrism分类法，从功能结构、语义组件和句法模式三个层次分析提示，并应用于提示优化、数据集分析和敏感性实验。 Result: 实验证明PromptPrism在提示优化、数据集分析和敏感性分析中有效，为提示的改进和分析提供了基础。 Conclusion: PromptPrism为系统分析和优化提示提供了实用框架，有助于提升LLM性能。 Abstract: Prompts are the interface for eliciting the capabilities of large language models (LLMs). Understanding their structure and components is critical for analyzing LLM behavior and optimizing performance. However, the field lacks a comprehensive framework for systematic prompt analysis and understanding. We introduce PromptPrism, a linguistically-inspired taxonomy that enables prompt analysis across three hierarchical levels: functional structure, semantic component, and syntactic pattern. We show the practical utility of PromptPrism by applying it to three applications: (1) a taxonomy-guided prompt refinement approach that automatically improves prompt quality and enhances model performance across a range of tasks; (2) a multi-dimensional dataset profiling method that extracts and aggregates structural, semantic, and syntactic characteristics from prompt datasets, enabling comprehensive analysis of prompt distributions and patterns; (3) a controlled experimental framework for prompt sensitivity analysis by quantifying the impact of semantic reordering and delimiter modifications on LLM performance. Our experimental results validate the effectiveness of our taxonomy across these applications, demonstrating that PromptPrism provides a foundation for refining, profiling, and analyzing prompts.

[314] AD-AGENT: A Multi-agent Framework for End-to-end Anomaly Detection

Tiankai Yang,Junjun Liu,Wingchun Siu,Jiahang Wang,Zhuangzhuang Qian,Chanjuan Song,Cheng Cheng,Xiyang Hu,Yue Zhao

Main category: cs.CL

TL;DR: AD-AGENT是一个基于LLM的多智能体框架，通过自然语言指令生成可执行的异常检测流程，整合了多个流行库，为非专家用户提供便捷。

Details

Motivation: 异常检测在多个领域至关重要，但数据多样性和专业库的复杂性为非专家用户带来挑战。 Method: AD-AGENT通过协调多个智能体（意图解析、数据准备、库选择等）和共享工作空间，整合PyOD等流行库。 Result: 实验表明AD-AGENT能生成可靠脚本并推荐高性能模型。 Conclusion: AD-AGENT开源支持进一步研究和实际应用。 Abstract: Anomaly detection (AD) is essential in areas such as fraud detection, network monitoring, and scientific research. However, the diversity of data modalities and the increasing number of specialized AD libraries pose challenges for non-expert users who lack in-depth library-specific knowledge and advanced programming skills. To tackle this, we present AD-AGENT, an LLM-driven multi-agent framework that turns natural-language instructions into fully executable AD pipelines. AD-AGENT coordinates specialized agents for intent parsing, data preparation, library and model selection, documentation mining, and iterative code generation and debugging. Using a shared short-term workspace and a long-term cache, the agents integrate popular AD libraries like PyOD, PyGOD, and TSLib into a unified workflow. Experiments demonstrate that AD-AGENT produces reliable scripts and recommends competitive models across libraries. The system is open-sourced to support further research and practical applications in AD.

[315] Duluth at SemEval-2025 Task 7: TF-IDF with Optimized Vector Dimensions for Multilingual Fact-Checked Claim Retrieval

Shujauddin Syed,Ted Pedersen

Main category: cs.CL

TL;DR: 本文介绍了Duluth方法在SemEval-2025任务7中的多语言和跨语言事实核查声明检索中的应用，基于TF-IDF的检索系统在词级分词和15,000特征词汇量下表现最佳，但落后于顶级系统。

Details

Motivation: 探索传统TF-IDF方法在多语言检索任务中的竞争力，尤其是在计算资源有限的情况下。 Method: 采用TF-IDF检索系统，实验不同向量维度和分词策略，最佳配置为词级分词和15,000特征词汇量。 Result: 开发集平均success@10得分为0.78，测试集为0.69，表现优于高资源语言但仍显著落后于顶级系统（0.96）。 Conclusion: 尽管神经网络架构在多语言检索中占主导，但优化后的传统方法如TF-IDF仍具竞争力，特别是在资源有限的情况下。 Abstract: This paper presents the Duluth approach to the SemEval-2025 Task 7 on Multilingual and Crosslingual Fact-Checked Claim Retrieval. We implemented a TF-IDF-based retrieval system with experimentation on vector dimensions and tokenization strategies. Our best-performing configuration used word-level tokenization with a vocabulary size of 15,000 features, achieving an average success@10 score of 0.78 on the development set and 0.69 on the test set across ten languages. Our system showed stronger performance on higher-resource languages but still lagged significantly behind the top-ranked system, which achieved 0.96 average success@10. Our findings suggest that though advanced neural architectures are increasingly dominant in multilingual retrieval tasks, properly optimized traditional methods like TF-IDF remain competitive baselines, especially in limited compute resource scenarios.

[316] Think Before You Attribute: Improving the Performance of LLMs Attribution Systems

João Eduardo Batista,Emil Vatai,Mohamed Wahib

Main category: cs.CL

TL;DR: 论文提出了一种句子级预归因方法，用于提升大型语言模型（LLMs）在科学领域的可信度和可验证性。

Details

Motivation: 当前LLMs在生成答案时缺乏可靠的来源归因，甚至可能提供错误的归因，这在科学和高风险场景中是不可接受的。 Method: 通过句子级预归因步骤，将句子分为不可归因、可归因于单一引用和可归因于多个引用三类，从而选择合适的归因方法或跳过归因。 Result: 结果表明分类器适合此任务，并提供了HAGRID数据集的清洁版本和即用型端到端归因系统。 Conclusion: 预归因步骤降低了归因的计算复杂度，提升了LLMs输出的可信度和实用性。 Abstract: Large Language Models (LLMs) are increasingly applied in various science domains, yet their broader adoption remains constrained by a critical challenge: the lack of trustworthy, verifiable outputs. Current LLMs often generate answers without reliable source attribution, or worse, with incorrect attributions, posing a barrier to their use in scientific and high-stakes settings, where traceability and accountability are non-negotiable. To be reliable, attribution systems need high accuracy and retrieve data with short lengths, i.e., attribute to a sentence within a document rather than a whole document. We propose a sentence-level pre-attribution step for Retrieve-Augmented Generation (RAG) systems that classify sentences into three categories: not attributable, attributable to a single quote, and attributable to multiple quotes. By separating sentences before attribution, a proper attribution method can be selected for the type of sentence, or the attribution can be skipped altogether. Our results indicate that classifiers are well-suited for this task. In this work, we propose a pre-attribution step to reduce the computational complexity of attribution, provide a clean version of the HAGRID dataset, and provide an end-to-end attribution system that works out of the box.

[317] R1dacted: Investigating Local Censorship in DeepSeek's R1 Language Model

Ali Naseh,Harsh Chaudhari,Jaechul Roh,Mingshi Wu,Alina Oprea,Amir Houmansadr

Main category: cs.CL

TL;DR: DeepSeek的R1模型在推理任务上表现优异，但在处理中国政治敏感话题时表现出审查行为。本文通过分析其审查模式、触发条件及跨语言表现，揭示了可能的训练或对齐设计问题。

Details

Motivation: 研究R1模型在政治敏感话题上的审查行为，探讨其背后的设计选择及对透明度和偏见的潜在影响。 Method: 构建大规模审查提示集，分析R1的审查模式、触发条件、跨语言表现，并研究其蒸馏模型的审查可转移性。 Result: R1的审查行为具有一致性，可能源于训练或对齐设计，引发对透明度和治理的担忧。 Conclusion: R1的审查行为揭示了模型设计中的潜在问题，呼吁加强透明度和治理。 Abstract: DeepSeek recently released R1, a high-performing large language model (LLM) optimized for reasoning tasks. Despite its efficient training pipeline, R1 achieves competitive performance, even surpassing leading reasoning models like OpenAI's o1 on several benchmarks. However, emerging reports suggest that R1 refuses to answer certain prompts related to politically sensitive topics in China. While existing LLMs often implement safeguards to avoid generating harmful or offensive outputs, R1 represents a notable shift - exhibiting censorship-like behavior on politically charged queries. In this paper, we investigate this phenomenon by first introducing a large-scale set of heavily curated prompts that get censored by R1, covering a range of politically sensitive topics, but are not censored by other models. We then conduct a comprehensive analysis of R1's censorship patterns, examining their consistency, triggers, and variations across topics, prompt phrasing, and context. Beyond English-language queries, we explore censorship behavior in other languages. We also investigate the transferability of censorship to models distilled from the R1 language model. Finally, we propose techniques for bypassing or removing this censorship. Our findings reveal possible additional censorship integration likely shaped by design choices during training or alignment, raising concerns about transparency, bias, and governance in language model deployment.

[318] Revealing the Deceptiveness of Knowledge Editing: A Mechanistic Analysis of Superficial Editing

Jiakuan Xie,Pengfei Cao,Yubo Chen,Kang Liu,Jun Zhao

Main category: cs.CL

TL;DR: 论文提出“表面编辑”概念，指语言模型知识编辑后仍易生成原始知识的现象，并揭示其成因与特定注意力模块及残差流有关。

Details

Motivation: 现有知识编辑算法在传统指标上表现优异，但模型仍易生成原始知识，需深入探究其成因。 Method: 通过系统评估，分析残差流和特定注意力模块对表面编辑的影响，并验证其因果关系。 Result: 发现特定注意力头及其左奇异向量与表面编辑相关，且在表面遗忘任务中表现一致。 Conclusion: 研究揭示了知识编辑中的表面编辑问题，为算法改进提供了方向，方法具有广泛适用性。 Abstract: Knowledge editing, which aims to update the knowledge encoded in language models, can be deceptive. Despite the fact that many existing knowledge editing algorithms achieve near-perfect performance on conventional metrics, the models edited by them are still prone to generating original knowledge. This paper introduces the concept of "superficial editing" to describe this phenomenon. Our comprehensive evaluation reveals that this issue presents a significant challenge to existing algorithms. Through systematic investigation, we identify and validate two key factors contributing to this issue: (1) the residual stream at the last subject position in earlier layers and (2) specific attention modules in later layers. Notably, certain attention heads in later layers, along with specific left singular vectors in their output matrices, encapsulate the original knowledge and exhibit a causal relationship with superficial editing. Furthermore, we extend our analysis to the task of superficial unlearning, where we observe consistent patterns in the behavior of specific attention heads and their corresponding left singular vectors, thereby demonstrating the robustness and broader applicability of our methodology and conclusions. Our code is available here.

[319] Predicting Turn-Taking and Backchannel in Human-Machine Conversations Using Linguistic, Acoustic, and Visual Signals

Yuxin Lin,Yinglin Zheng,Ming Zeng,Wangzheng Shi

Main category: cs.CL

TL;DR: 本文提出了一种多模态信号（语言、声学、视觉）预测人机对话中轮转和反馈动作的方法，并构建了一个大规模数据集（MM-F2F）和端到端框架，显著提升了预测性能。

Details

Motivation: 解决现有数据集的局限性，填补人机对话中轮转和反馈动作预测的空白。 Method: 通过自动数据收集管道构建MM-F2F数据集，并提出一个端到端框架，利用多模态信号预测轮转和反馈动作。 Result: 在轮转和反馈预测任务中，F1分数分别提高了10%和33%，达到当前最佳性能。 Conclusion: 提出的数据集和框架为后续研究提供了便利，并展示了多模态信号在对话预测中的潜力。 Abstract: This paper addresses the gap in predicting turn-taking and backchannel actions in human-machine conversations using multi-modal signals (linguistic, acoustic, and visual). To overcome the limitation of existing datasets, we propose an automatic data collection pipeline that allows us to collect and annotate over 210 hours of human conversation videos. From this, we construct a Multi-Modal Face-to-Face (MM-F2F) human conversation dataset, including over 1.5M words and corresponding turn-taking and backchannel annotations from approximately 20M frames. Additionally, we present an end-to-end framework that predicts the probability of turn-taking and backchannel actions from multi-modal signals. The proposed model emphasizes the interrelation between modalities and supports any combination of text, audio, and video inputs, making it adaptable to a variety of realistic scenarios. Our experiments show that our approach achieves state-of-the-art performance on turn-taking and backchannel prediction tasks, achieving a 10\% increase in F1-score on turn-taking and a 33\% increase on backchannel prediction. Our dataset and code are publicly available online to ease of subsequent research.

[320] Know3-RAG: A Knowledge-aware RAG Framework with Adaptive Retrieval, Generation, and Filtering

Xukai Liu,Ye Liu,Shiwen Wu,Yanghai Zhang,Yihao Yuan,Kai Zhang,Qi Liu

Main category: cs.CL

TL;DR: Know3-RAG框架通过结合知识图谱的结构化知识，改进了检索增强生成（RAG）的三个核心阶段，显著减少了幻觉内容并提升了答案可靠性。

Details

Motivation: 现有RAG系统存在自适应控制不可靠和幻觉问题，需要更有效的方法提升事实可靠性。 Method: 提出Know3-RAG框架，利用知识图谱嵌入指导检索、生成和过滤阶段，包括知识感知的自适应检索、知识增强的参考生成和知识驱动的参考过滤。 Result: 在多个开放域QA基准测试中，Know3-RAG表现优于基线方法，显著减少幻觉并提高答案可靠性。 Conclusion: Know3-RAG通过结构化知识的整合，有效解决了RAG系统的局限性，提升了生成内容的准确性。 Abstract: Recent advances in large language models (LLMs) have led to impressive progress in natural language generation, yet their tendency to produce hallucinated or unsubstantiated content remains a critical concern. To improve factual reliability, Retrieval-Augmented Generation (RAG) integrates external knowledge during inference. However, existing RAG systems face two major limitations: (1) unreliable adaptive control due to limited external knowledge supervision, and (2) hallucinations caused by inaccurate or irrelevant references. To address these issues, we propose Know3-RAG, a knowledge-aware RAG framework that leverages structured knowledge from knowledge graphs (KGs) to guide three core stages of the RAG process, including retrieval, generation, and filtering. Specifically, we introduce a knowledge-aware adaptive retrieval module that employs KG embedding to assess the confidence of the generated answer and determine retrieval necessity, a knowledge-enhanced reference generation strategy that enriches queries with KG-derived entities to improve generated reference relevance, and a knowledge-driven reference filtering mechanism that ensures semantic alignment and factual accuracy of references. Experiments on multiple open-domain QA benchmarks demonstrate that Know3-RAG consistently outperforms strong baselines, significantly reducing hallucinations and enhancing answer reliability.

[321] Shadow-FT: Tuning Instruct via Base

Taiqiang Wu,Runming Yang,Jiayi Li,Pengfei Hu,Ngai Wong,Yujiu Yang

Main category: cs.CL

TL;DR: 论文提出Shadow-FT框架，通过利用BASE模型微调INSTRUCT模型，显著提升性能，无需额外参数。

Details

Motivation: 直接微调INSTRUCT模型效果有限甚至退化，而BASE模型权重相似，故提出利用BASE模型微调INSTRUCT模型的方法。 Method: 通过微调BASE模型并将学习到的权重更新直接移植到INSTRUCT模型，实现Shadow-FT框架。 Result: 在19个基准测试中，Shadow-FT优于传统全参数和参数高效微调方法，并可扩展至MLLMs和结合DPO。 Conclusion: Shadow-FT是一种高效、易实现的微调方法，显著提升INSTRUCT模型性能，具有广泛适用性。 Abstract: Large language models (LLMs) consistently benefit from further fine-tuning on various tasks. However, we observe that directly tuning the INSTRUCT (i.e., instruction tuned) models often leads to marginal improvements and even performance degeneration. Notably, paired BASE models, the foundation for these INSTRUCT variants, contain highly similar weight values (i.e., less than 2% on average for Llama 3.1 8B). Therefore, we propose a novel Shadow-FT framework to tune the INSTRUCT models by leveraging the corresponding BASE models. The key insight is to fine-tune the BASE model, and then directly graft the learned weight updates to the INSTRUCT model. Our proposed Shadow-FT introduces no additional parameters, is easy to implement, and significantly improves performance. We conduct extensive experiments on tuning mainstream LLMs, such as Qwen 3 and Llama 3 series, and evaluate them across 19 benchmarks covering coding, reasoning, and mathematical tasks. Experimental results demonstrate that Shadow-FT consistently outperforms conventional full-parameter and parameter-efficient tuning approaches. Further analyses indicate that Shadow-FT can be applied to multimodal large language models (MLLMs) and combined with direct preference optimization (DPO). Codes and weights are available at \href{https://github.com/wutaiqiang/Shadow-FT}{Github}.

[322] ToTRL: Unlock LLM Tree-of-Thoughts Reasoning Potential through Puzzles Solving

Haoyuan Wu,Xueyi Chen,Rui Ming,Jilong Gao,Shoubo Hu,Zhuolun He,Bei Yu

Main category: cs.CL

TL;DR: 论文提出了一种基于树状思维（ToT）的强化学习框架ToTRL，用于改进大语言模型（LLM）的推理能力，减少冗余输出并提升效率。

Details

Motivation: 长链思维（CoT）推理存在冗长和缺乏系统性的问题，而树状思维（ToT）能并行评估多个推理路径，提高效率。 Method: 提出ToTRL框架，结合规则奖励和LLM的CoT能力，通过解谜游戏训练模型构建和探索思维树。 Result: ToTQwen3-8B模型在复杂推理任务中表现出显著的性能提升和效率改进。 Conclusion: ToTRL框架有效提升了LLM的推理能力，为复杂任务提供了更高效的解决方案。 Abstract: Large language models (LLMs) demonstrate significant reasoning capabilities, particularly through long chain-of-thought (CoT) processes, which can be elicited by reinforcement learning (RL). However, prolonged CoT reasoning presents limitations, primarily verbose outputs due to excessive introspection. The reasoning process in these LLMs often appears to follow a trial-and-error methodology rather than a systematic, logical deduction. In contrast, tree-of-thoughts (ToT) offers a conceptually more advanced approach by modeling reasoning as an exploration within a tree structure. This reasoning structure facilitates the parallel generation and evaluation of multiple reasoning branches, allowing for the active identification, assessment, and pruning of unproductive paths. This process can potentially lead to improved performance and reduced token costs. Building upon the long CoT capability of LLMs, we introduce tree-of-thoughts RL (ToTRL), a novel on-policy RL framework with a rule-based reward. ToTRL is designed to guide LLMs in developing the parallel ToT strategy based on the sequential CoT strategy. Furthermore, we employ LLMs as players in a puzzle game during the ToTRL training process. Solving puzzle games inherently necessitates exploring interdependent choices and managing multiple constraints, which requires the construction and exploration of a thought tree, providing challenging tasks for cultivating the ToT reasoning capability. Our empirical evaluations demonstrate that our ToTQwen3-8B model, trained with our ToTRL, achieves significant improvement in performance and reasoning efficiency on complex reasoning tasks.

[323] Automated Bias Assessment in AI-Generated Educational Content Using CEAT Framework

Jingyang Peng,Wenyuan Shen,Jiarui Rao,Jionghao Lin

Main category: cs.CL

TL;DR: 本文提出了一种自动化评估生成式人工智能（GenAI）在教育内容中偏见的方法，结合上下文嵌入关联测试和提示工程词提取技术，验证了其可靠性和一致性。

Details

Motivation: 生成式人工智能在教育内容创作中的应用日益广泛，但其中潜在的性别、种族或国家偏见引发了伦理和教育问题，目前缺乏系统性的偏见检测方法。 Method: 研究提出了一种自动化偏见评估方法，整合了上下文嵌入关联测试和提示工程词提取技术，并在检索增强生成框架中应用。 Result: 实验结果显示，自动化方法与人工标注的词集高度一致（Pearson相关系数r=0.993），证明了方法的可靠性和一致性。 Conclusion: 该方法减少了人为主观性，提高了公平性、可扩展性和可重复性，适用于审计GenAI生成的教育内容。 Abstract: Recent advances in Generative Artificial Intelligence (GenAI) have transformed educational content creation, particularly in developing tutor training materials. However, biases embedded in AI-generated content--such as gender, racial, or national stereotypes--raise significant ethical and educational concerns. Despite the growing use of GenAI, systematic methods for detecting and evaluating such biases in educational materials remain limited. This study proposes an automated bias assessment approach that integrates the Contextualized Embedding Association Test with a prompt-engineered word extraction method within a Retrieval-Augmented Generation framework. We applied this method to AI-generated texts used in tutor training lessons. Results show a high alignment between the automated and manually curated word sets, with a Pearson correlation coefficient of r = 0.993, indicating reliable and consistent bias assessment. Our method reduces human subjectivity and enhances fairness, scalability, and reproducibility in auditing GenAI-produced educational content.

[324] On-Policy Optimization with Group Equivalent Preference for Multi-Programming Language Understanding

Haoyuan Wu,Rui Ming,Jilong Gao,Hangyu Zhao,Xueyi Chen,Yikai Yang,Haisheng Zheng,Zhuolun He,Bei Yu

Main category: cs.CL

TL;DR: 通过代码翻译任务和新型强化学习框架OORL，提升大语言模型在多编程语言间的代码生成能力。

Details

Motivation: 解决大语言模型在不同编程语言间性能差异的问题。 Method: 结合代码翻译任务和OORL框架（整合策略内与策略外强化学习），并引入GEPO方法优化模型偏好。 Result: 在多编程语言的代码基准测试中取得显著性能提升。 Conclusion: OORL框架和GEPO方法有效提升模型对代码功能的理解和跨语言能力。 Abstract: Large language models (LLMs) achieve remarkable performance in code generation tasks. However, a significant performance disparity persists between popular programming languages (e.g., Python, C++) and others. To address this capability gap, we leverage the code translation task to train LLMs, thereby facilitating the transfer of coding proficiency across diverse programming languages. Moreover, we introduce OORL for training, a novel reinforcement learning (RL) framework that integrates on-policy and off-policy strategies. Within OORL, on-policy RL is applied during code translation, guided by a rule-based reward signal derived from unit tests. Complementing this coarse-grained rule-based reward, we propose Group Equivalent Preference Optimization (GEPO), a novel preference optimization method. Specifically, GEPO trains the LLM using intermediate representations (IRs) groups. LLMs can be guided to discern IRs equivalent to the source code from inequivalent ones, while also utilizing signals about the mutual equivalence between IRs within the group. This process allows LLMs to capture nuanced aspects of code functionality. By employing OORL for training with code translation tasks, LLMs improve their recognition of code functionality and their understanding of the relationships between code implemented in different languages. Extensive experiments demonstrate that our OORL for LLMs training with code translation tasks achieves significant performance improvements on code benchmarks across multiple programming languages.

[325] What is Stigma Attributed to? A Theory-Grounded, Expert-Annotated Interview Corpus for Demystifying Mental-Health Stigma

Han Meng,Yancan Chen,Yunan Li,Yitian Yang,Jungup Lee,Renwen Zhang,Yi-Chieh Lee

Main category: cs.CL

TL;DR: 论文提出了一种基于专家标注的理论指导数据集，用于训练神经模型精细分类心理健康污名，并评估了现有模型的性能。

Details

Motivation: 心理健康污名是一个普遍的社会问题，但现有训练神经模型的资源有限，且缺乏理论基础。 Method: 构建了一个包含4,141个片段、684名参与者的专家标注数据集，并评估了现有神经模型的性能。 Result: 实验揭示了污名检测的挑战，并展示了数据集的潜力。 Conclusion: 该数据集可促进计算检测、中和及对抗心理健康污名的研究。 Abstract: Mental-health stigma remains a pervasive social problem that hampers treatment-seeking and recovery. Existing resources for training neural models to finely classify such stigma are limited, relying primarily on social-media or synthetic data without theoretical underpinnings. To remedy this gap, we present an expert-annotated, theory-informed corpus of human-chatbot interviews, comprising 4,141 snippets from 684 participants with documented socio-cultural backgrounds. Our experiments benchmark state-of-the-art neural models and empirically unpack the challenges of stigma detection. This dataset can facilitate research on computationally detecting, neutralizing, and counteracting mental-health stigma.

[326] ReEx-SQL: Reasoning with Execution-Aware Reinforcement Learning for Text-to-SQL

Yaxun Dai,Wenxuan Xie,Xialie Zhuang,Tianyu Yang,Yiying Yang,Haiqin Yang,Yuhang Zhao,Pingfu Chao,Wenhao Jiang

Main category: cs.CL

TL;DR: ReEx-SQL提出了一种基于执行感知强化学习的Text-to-SQL框架，通过动态整合执行反馈提升SQL生成准确性和鲁棒性。

Details

Motivation: 现有方法仅将执行反馈作为事后修正信号，未能将其融入生成过程，限制了模型处理推理错误的能力。 Method: ReEx-SQL引入执行感知推理范式，通过结构化提示和逐步展开策略，将执行反馈整合到生成过程中，并使用复合奖励函数监督学习。 Result: 在Spider和BIRD数据集上分别达到88.8%和64.9%的准确率，推理时间减少51.9%。 Conclusion: ReEx-SQL通过动态整合执行反馈和树状解码策略，显著提升了Text-to-SQL的性能和效率。 Abstract: In Text-to-SQL, execution feedback is essential for guiding large language models (LLMs) to reason accurately and generate reliable SQL queries. However, existing methods treat execution feedback solely as a post-hoc signal for correction or selection, failing to integrate it into the generation process. This limitation hinders their ability to address reasoning errors as they occur, ultimately reducing query accuracy and robustness. To address this issue, we propose ReEx-SQL (Reasoning with Execution-Aware Reinforcement Learning), a framework for Text-to-SQL that enables models to interact with the database during decoding and dynamically adjust their reasoning based on execution feedback. ReEx-SQL introduces an execution-aware reasoning paradigm that interleaves intermediate SQL execution into reasoning paths, facilitating context-sensitive revisions. It achieves this through structured prompts with markup tags and a stepwise rollout strategy that integrates execution feedback into each stage of generation. To supervise policy learning, we develop a composite reward function that includes an exploration reward, explicitly encouraging effective database interaction. Additionally, ReEx-SQL adopts a tree-based decoding strategy to support exploratory reasoning, enabling dynamic expansion of alternative reasoning paths. Notably, ReEx-SQL achieves 88.8% on Spider and 64.9% on BIRD at the 7B scale, surpassing the standard reasoning baseline by 2.7% and 2.6%, respectively. It also shows robustness, achieving 85.2% on Spider-Realistic with leading performance. In addition, its tree-structured decoding improves efficiency and performance over linear decoding, reducing inference time by 51.9% on the BIRD development set.

[327] A Token is Worth over 1,000 Tokens: Efficient Knowledge Distillation through Low-Rank Clone

Jitai Hao,Qiang Huang,Hao Liu,Xinyan Xiao,Zhaochun Ren,Jun Yu

Main category: cs.CL

TL;DR: LRC是一种高效预训练方法，通过低秩投影矩阵实现软剪枝和激活对齐，显著提升小语言模型的训练效率。

Details

Motivation: 解决现有小语言模型训练中的信息丢失、表示对齐低效和FFN激活利用不足问题。 Method: 使用低秩投影矩阵进行软剪枝和激活对齐，无需显式对齐模块。 Result: LRC在仅使用20B tokens的情况下，性能匹配或超越现有模型，训练效率提升1000倍。 Conclusion: LRC为高效训练高性能小语言模型提供了新方法，具有显著优势。 Abstract: Training high-performing Small Language Models (SLMs) remains costly, even with knowledge distillation and pruning from larger teacher models. Existing work often faces three key challenges: (1) information loss from hard pruning, (2) inefficient alignment of representations, and (3) underutilization of informative activations, particularly from Feed-Forward Networks (FFNs). To address these challenges, we introduce Low-Rank Clone (LRC), an efficient pre-training method that constructs SLMs aspiring to behavioral equivalence with strong teacher models. LRC trains a set of low-rank projection matrices that jointly enable soft pruning by compressing teacher weights, and activation clone by aligning student activations, including FFN signals, with those of the teacher. This unified design maximizes knowledge transfer while removing the need for explicit alignment modules. Extensive experiments with open-source teachers (e.g., Llama-3.2-3B-Instruct, Qwen2.5-3B/7B-Instruct) show that LRC matches or surpasses state-of-the-art models trained on trillions of tokens--while using only 20B tokens, achieving over 1,000x training efficiency. Our codes and model checkpoints are available at https://github.com/CURRENTF/LowRankClone and https://huggingface.co/collections/JitaiHao/low-rank-clone-lrc-6828389e96a93f1d4219dfaf.

[328] EAVIT: Efficient and Accurate Human Value Identification from Text data via LLMs

Wenhao Zhu,Yuhang Xie,Guojie Song,Xin Zhang

Main category: cs.CL

TL;DR: 论文提出EAVIT框架，结合本地可调优和在线黑盒LLMs的优势，高效准确地识别人文价值。通过本地小模型生成初步估计，再构造简洁输入提示在线LLMs，显著减少输入令牌数并提升性能。

Details

Motivation: 传统NLP模型（如BERT）在人文价值识别任务中表现不如新兴LLMs（如GPTs），但在线LLMs处理长上下文时性能下降且计算成本高。 Method: 提出EAVIT框架，结合本地小模型和在线LLMs，采用解释性训练和数据生成技术优化输入提示的简洁性。 Result: EAVIT将输入令牌数减少至1/6，性能优于传统NLP方法和其他基于LLMs的策略。 Conclusion: EAVIT框架在减少计算成本的同时，显著提升了人文价值识别的准确性和效率。 Abstract: The rapid evolution of large language models (LLMs) has revolutionized various fields, including the identification and discovery of human values within text data. While traditional NLP models, such as BERT, have been employed for this task, their ability to represent textual data is significantly outperformed by emerging LLMs like GPTs. However, the performance of online LLMs often degrades when handling long contexts required for value identification, which also incurs substantial computational costs. To address these challenges, we propose EAVIT, an efficient and accurate framework for human value identification that combines the strengths of both locally fine-tunable and online black-box LLMs. Our framework employs a value detector - a small, local language model - to generate initial value estimations. These estimations are then used to construct concise input prompts for online LLMs, enabling accurate final value identification. To train the value detector, we introduce explanation-based training and data generation techniques specifically tailored for value identification, alongside sampling strategies to optimize the brevity of LLM input prompts. Our approach effectively reduces the number of input tokens by up to 1/6 compared to directly querying online LLMs, while consistently outperforming traditional NLP methods and other LLM-based strategies.

[329] Decentralized Arena: Towards Democratic and Scalable Automatic Evaluation of Language Models

Yanbin Yin,Kun Zhou,Zhen Wang,Xiangdong Zhang,Yifei Shao,Shibo Hao,Yi Gu,Jieyuan Liu,Somanshu Singla,Tianyang Liu,Eric P. Xing,Zhengzhong Liu,Haojian Jin,Zhiting Hu

Main category: cs.CL

TL;DR: 论文提出了一种名为Decentralized Arena（dearena）的自动化框架，利用所有LLM的集体智慧相互评估，解决了现有基准测试的局限性。

Details

Motivation: 当前大型语言模型（LLM）的快速发展使得可靠且可扩展的基准测试变得迫切。现有方法存在封闭式问题饱和、人工评估成本高或依赖单一模型导致偏见的问题。 Method: dearena通过民主化的成对评估减少单一模型偏见，并采用粗到细的排名算法和自动问题选择策略，实现高效扩展。 Result: 在66个LLM上的实验表明，dearena与人类判断的相关性高达97%，同时显著降低成本。 Conclusion: dearena为LLM评估提供了一种高效、低成本且减少偏见的新方法。 Abstract: The recent explosion of large language models (LLMs), each with its own general or specialized strengths, makes scalable, reliable benchmarking more urgent than ever. Standard practices nowadays face fundamental trade-offs: closed-ended question-based benchmarks (eg MMLU) struggle with saturation as newer models emerge, while crowd-sourced leaderboards (eg Chatbot Arena) rely on costly and slow human judges. Recently, automated methods (eg LLM-as-a-judge) shed light on the scalability, but risk bias by relying on one or a few "authority" models. To tackle these issues, we propose Decentralized Arena (dearena), a fully automated framework leveraging collective intelligence from all LLMs to evaluate each other. It mitigates single-model judge bias by democratic, pairwise evaluation, and remains efficient at scale through two key components: (1) a coarse-to-fine ranking algorithm for fast incremental insertion of new models with sub-quadratic complexity, and (2) an automatic question selection strategy for the construction of new evaluation dimensions. Across extensive experiments across 66 LLMs, dearena attains up to 97% correlation with human judgements, while significantly reducing the cost. Our code and data will be publicly released on https://github.com/maitrix-org/de-arena.

[330] PsyMem: Fine-grained psychological alignment and Explicit Memory Control for Advanced Role-Playing LLMs

Xilong Cheng,Yunxiao Qin,Yuting Tan,Zhengnan Li,Ye Wang,Hongjiang Xiao,Yuan Zhang

Main category: cs.CL

TL;DR: PsyMem提出了一种结合细粒度心理属性和显式记忆控制的角色扮演框架，解决了现有方法在角色建模和记忆一致性上的不足。

Details

Motivation: 现有基于LLM的角色扮演方法依赖浅层文本描述或简单指标，无法充分建模角色的内外维度，且记忆模拟缺乏一致性，影响可靠性。 Method: PsyMem通过26个心理指标补充角色描述，并采用显式记忆对齐训练，动态控制角色响应。基于Qwen2.5-7B-Instruct模型和定制数据集训练。 Result: PsyMem-Qwen在角色扮演中表现优于基线模型，在拟人化和角色忠实度上达到最佳性能。 Conclusion: PsyMem通过心理属性和显式记忆控制提升了角色扮演的可靠性和一致性，适用于社交模拟等应用。 Abstract: Existing LLM-based role-playing methods often rely on superficial textual descriptions or simplistic metrics, inadequately modeling both intrinsic and extrinsic character dimensions. Additionally, they typically simulate character memory with implicit model knowledge or basic retrieval augment generation without explicit memory alignment, compromising memory consistency. The two issues weaken reliability of role-playing LLMs in several applications, such as trustworthy social simulation. To address these limitations, we propose PsyMem, a novel framework integrating fine-grained psychological attributes and explicit memory control for role-playing. PsyMem supplements textual descriptions with 26 psychological indicators to detailed model character. Additionally, PsyMem implements memory alignment training, explicitly trains the model to align character's response with memory, thereby enabling dynamic memory-controlled responding during inference. By training Qwen2.5-7B-Instruct on our specially designed dataset (including 5,414 characters and 38,962 dialogues extracted from novels), the resulting model, termed as PsyMem-Qwen, outperforms baseline models in role-playing, achieving the best performance in human-likeness and character fidelity.

[331] SynDec: A Synthesize-then-Decode Approach for Arbitrary Textual Style Transfer via Large Language Models

Han Sun,Zhen Sun,Zongmin Zhang,Linzhao Jia,Wei Shao,Min Zhang

Main category: cs.CL

TL;DR: SynDec方法通过自动合成高质量提示并在解码阶段放大其作用，解决了LLMs在任意风格转换中的依赖手动提示和固有风格偏见问题。

Details

Motivation: LLMs在文本风格转换中面临依赖手动提示和固有风格偏见的挑战，需要更高效的解决方案。 Method: 提出SynDec方法，通过选择代表性样本、四维风格分析和候选重排合成提示，并在解码阶段通过对比输出概率放大效果。 Result: 在六个基准测试中的五个上优于现有方法，例如现代英语到伊丽莎白英语转换的准确率提升9%。 Conclusion: SynDec通过自动合成和放大提示，显著提升了LLMs在风格转换中的表现。 Abstract: Large Language Models (LLMs) are emerging as dominant forces for textual style transfer. However, for arbitrary style transfer, LLMs face two key challenges: (1) considerable reliance on manually-constructed prompts and (2) rigid stylistic biases inherent in LLMs. In this paper, we propose a novel Synthesize-then-Decode (SynDec) approach, which automatically synthesizes high-quality prompts and amplifies their roles during decoding process. Specifically, our approach synthesizes prompts by selecting representative few-shot samples, conducting a four-dimensional style analysis, and reranking the candidates. At LLM decoding stage, the TST effect is amplified by maximizing the contrast in output probabilities between scenarios with and without the synthesized prompt, as well as between prompts and negative samples. We conduct extensive experiments and the results show that SynDec outperforms existing state-of-the-art LLM-based methods on five out of six benchmarks (e.g., achieving up to a 9\% increase in accuracy for modern-to-Elizabethan English transfer). Detailed ablation studies further validate the effectiveness of SynDec.

[332] Contrastive Prompting Enhances Sentence Embeddings in LLMs through Inference-Time Steering

Zifeng Cheng,Zhonghui Wang,Yuchen Fu,Zhiwei Jiang,Yafeng Yin,Cong Wang,Qing Gu

Main category: cs.CL

TL;DR: 提出了一种对比提示（CP）方法，通过引入辅助提示优化句子嵌入，提升语义编码能力。

Details

Motivation: 现有方法通过提示工程引导LLMs编码句子核心语义，但最后一个令牌仍包含过多非必要信息（如停用词），限制了编码能力。 Method: 提出对比提示（CP）方法，通过辅助提示对比，引导现有提示编码句子核心语义而非非必要信息，是一种即插即用的推理时干预方法。 Result: 在语义文本相似性（STS）任务和下游分类任务中，CP显著提升了现有提示方法的性能。 Conclusion: CP是一种有效且通用的方法，可提升不同LLMs中提示方法的性能。 Abstract: Extracting sentence embeddings from large language models (LLMs) is a practical direction, as it requires neither additional data nor fine-tuning. Previous studies usually focus on prompt engineering to guide LLMs to encode the core semantic information of the sentence into the embedding of the last token. However, the last token in these methods still encodes an excess of non-essential information, such as stop words, limiting its encoding capacity. To this end, we propose a Contrastive Prompting (CP) method that introduces an extra auxiliary prompt to elicit better sentence embedding. By contrasting with the auxiliary prompt, CP can steer existing prompts to encode the core semantics of the sentence, rather than non-essential information. CP is a plug-and-play inference-time intervention method that can be combined with various prompt-based methods. Extensive experiments on Semantic Textual Similarity (STS) tasks and downstream classification tasks demonstrate that our method can improve the performance of existing prompt-based methods across different LLMs. Our code will be released at https://github.com/zifengcheng/CP.

Hengxing Cai,Jinhan Dong,Jingjun Tan,Jingcheng Deng,Sihang Li,Zhifeng Gao,Haidong Wang,Zicheng Su,Agachai Sumalee,Renxin Zhong

Main category: cs.CL

TL;DR: FlightGPT是一种基于视觉语言模型的新型无人机视觉与语言导航框架，通过两阶段训练和链式思维推理机制，解决了现有方法的多模态融合不足、泛化能力弱和可解释性差的问题。

Details

Motivation: 无人机视觉与语言导航在灾害响应、物流配送和城市检查等应用中至关重要，但现有方法在多模态融合、泛化能力和可解释性方面表现不佳。 Method: 采用两阶段训练流程：首先通过监督微调（SFT）进行高质量演示的初始化，然后使用组相对策略优化（GRPO）算法增强泛化能力。同时引入链式思维（CoT）推理机制提高决策可解释性。 Result: 在城市规模数据集CityNav上的实验表明，FlightGPT在所有场景中均达到最先进性能，在未见环境中成功率比最强基线高出9.22%。 Conclusion: FlightGPT通过创新的训练方法和推理机制，显著提升了无人机视觉与语言导航的性能和可解释性，具有广泛的应用潜力。 Abstract: Unmanned Aerial Vehicle (UAV) Vision-and-Language Navigation (VLN) is vital for applications such as disaster response, logistics delivery, and urban inspection. However, existing methods often struggle with insufficient multimodal fusion, weak generalization, and poor interpretability. To address these challenges, we propose FlightGPT, a novel UAV VLN framework built upon Vision-Language Models (VLMs) with powerful multimodal perception capabilities. We design a two-stage training pipeline: first, Supervised Fine-Tuning (SFT) using high-quality demonstrations to improve initialization and structured reasoning; then, Group Relative Policy Optimization (GRPO) algorithm, guided by a composite reward that considers goal accuracy, reasoning quality, and format compliance, to enhance generalization and adaptability. Furthermore, FlightGPT introduces a Chain-of-Thought (CoT)-based reasoning mechanism to improve decision interpretability. Extensive experiments on the city-scale dataset CityNav demonstrate that FlightGPT achieves state-of-the-art performance across all scenarios, with a 9.22\% higher success rate than the strongest baseline in unseen environments. Our implementation is publicly available.

[334] The Hidden Structure -- Improving Legal Document Understanding Through Explicit Text Formatting

Christian Braun,Alexander Lilienbeck,Daniel Mentjukov

Main category: cs.CL

TL;DR: 论文研究了输入文本结构和提示工程对GPT-4o和GPT-4.1在法律问答任务中表现的影响，发现结构优化和提示设计对性能提升至关重要。

Details

Motivation: 探讨法律合同的结构对LLM处理的影响，尤其是在高风险的法学应用中。 Method: 比较不同输入格式（如结构化文本、OCR提取文本、Markdown等）和提示工程对模型性能的影响。 Result: GPT-4o对输入结构变化更具鲁棒性但整体表现较差，而GPT-4.1对结构敏感，优化结构和提示可显著提升其准确率。 Conclusion: 输入结构优化和提示设计对LLM性能至关重要，尤其是在法学等高要求领域。 Abstract: Legal contracts possess an inherent, semantically vital structure (e.g., sections, clauses) that is crucial for human comprehension but whose impact on LLM processing remains under-explored. This paper investigates the effects of explicit input text structure and prompt engineering on the performance of GPT-4o and GPT-4.1 on a legal question-answering task using an excerpt of the CUAD. We compare model exact-match accuracy across various input formats: well-structured plain-text (human-generated from CUAD), plain-text cleaned of line breaks, extracted plain-text from Azure OCR, plain-text extracted by GPT-4o Vision, and extracted (and interpreted) Markdown (MD) from GPT-4o Vision. To give an indication of the impact of possible prompt engineering, we assess the impact of shifting task instructions to the system prompt and explicitly informing the model about the structured nature of the input. Our findings reveal that GPT-4o demonstrates considerable robustness to variations in input structure, but lacks in overall performance. Conversely, GPT-4.1's performance is markedly sensitive; poorly structured inputs yield suboptimal results (but identical with GPT-4o), while well-structured formats (original CUAD text, GPT-4o Vision text and GPT-4o MD) improve exact-match accuracy by ~20 percentage points. Optimizing the system prompt to include task details and an advisory about structured input further elevates GPT-4.1's accuracy by an additional ~10-13 percentage points, with Markdown ultimately achieving the highest performance under these conditions (79 percentage points overall exact-match accuracy). This research empirically demonstrates that while newer models exhibit greater resilience, careful input structuring and strategic prompt design remain critical for optimizing the performance of LLMs, and can significantly affect outcomes in high-stakes legal applications.

[335] Re-identification of De-identified Documents with Autoregressive Infilling

Lucas Georges Gabriel Charpentier,Pierre Lison

Main category: cs.CL

TL;DR: 本文提出了一种基于RAG的反向去标识化方法，通过背景知识库重新识别被掩码的个人信息，实验表明80%的掩码信息可被成功恢复。

Details

Motivation: 研究去标识化方法的鲁棒性，验证背景知识对重新识别掩码信息的影响。 Method: 采用两步法：检索器从背景知识库中选择相关段落，填充模型推断掩码内容，迭代替换所有掩码。 Result: 在三个数据集上，80%的掩码信息可被恢复，且重新识别准确率随背景知识增加而提高。 Conclusion: 去标识化方法可能不够鲁棒，背景知识对重新识别掩码信息至关重要。 Abstract: Documents revealing sensitive information about individuals must typically be de-identified. This de-identification is often done by masking all mentions of personally identifiable information (PII), thereby making it more difficult to uncover the identity of the person(s) in question. To investigate the robustness of de-identification methods, we present a novel, RAG-inspired approach that attempts the reverse process of re-identification based on a database of documents representing background knowledge. Given a text in which personal identifiers have been masked, the re-identification proceeds in two steps. A retriever first selects from the background knowledge passages deemed relevant for the re-identification. Those passages are then provided to an infilling model which seeks to infer the original content of each text span. This process is repeated until all masked spans are replaced. We evaluate the re-identification on three datasets (Wikipedia biographies, court rulings and clinical notes). Results show that (1) as many as 80% of de-identified text spans can be successfully recovered and (2) the re-identification accuracy increases along with the level of background knowledge.

[336] LEXam: Benchmarking Legal Reasoning on 340 Law Exams

Yu Fan,Jingwei Ni,Jakob Merane,Etienne Salimbeni,Yang Tian,Yoan Hermstrüwer,Yinya Huang,Mubashara Akhtar,Florian Geering,Oliver Dreyer,Daniel Brunner,Markus Leippold,Mrinmaya Sachan,Alexander Stremitzer,Christoph Engel,Elliott Ash,Joel Niklaus

Main category: cs.CL

TL;DR: LEXam是一个新的法律推理基准，包含340门法律课程的4,886个考试问题，评估显示当前大型语言模型在结构化、多步法律推理任务上表现不佳。

Details

Motivation: 解决大型语言模型在长形式法律推理任务中的挑战，提供更全面的评估方法。 Method: 从340门法律课程中提取4,886个问题（包括开放性和选择题），并采用LLM-as-a-Judge范式进行评估。 Result: 模型在需要多步推理的开放性问题中表现较差，但数据集能有效区分不同能力的模型。 Conclusion: LEXam为法律推理提供了可扩展的评估方法，揭示了当前模型的局限性。 Abstract: Long-form legal reasoning remains a key challenge for large language models (LLMs) in spite of recent advances in test-time scaling. We introduce LEXam, a novel benchmark derived from 340 law exams spanning 116 law school courses across a range of subjects and degree levels. The dataset comprises 4,886 law exam questions in English and German, including 2,841 long-form, open-ended questions and 2,045 multiple-choice questions. Besides reference answers, the open questions are also accompanied by explicit guidance outlining the expected legal reasoning approach such as issue spotting, rule recall, or rule application. Our evaluation on both open-ended and multiple-choice questions present significant challenges for current LLMs; in particular, they notably struggle with open questions that require structured, multi-step legal reasoning. Moreover, our results underscore the effectiveness of the dataset in differentiating between models with varying capabilities. Adopting an LLM-as-a-Judge paradigm with rigorous human expert validation, we demonstrate how model-generated reasoning steps can be evaluated consistently and accurately. Our evaluation setup provides a scalable method to assess legal reasoning quality beyond simple accuracy metrics. Project page: https://lexam-benchmark.github.io/

[337] GAP: Graph-Assisted Prompts for Dialogue-based Medication Recommendation

Jialun Zhong,Yanzeng Li,Sen Hu,Yang Zhang,Teng Xu,Lei Zou

Main category: cs.CL

TL;DR: 论文提出了一种基于图辅助提示（GAP）的框架，用于对话式药物推荐，以解决大型语言模型在医疗对话中忽略细节和生成非事实性回应的问题。

Details

Motivation: 医疗对话系统中的药物推荐需要关注患者与医生交互的细节，而现有基于电子健康记录的方法无法满足这一需求。大型语言模型虽能提供建议，但可能忽略关键信息或生成不准确的回应。 Method: GAP框架通过提取对话中的医疗概念和状态构建患者中心图，结合外部医疗知识图谱生成查询和提示，以减少非事实性回应。 Result: 实验表明，GAP在对话式药物推荐数据集上表现优异，并在动态诊断访谈中展现出潜力。 Conclusion: GAP框架通过显式建模患者信息和结合外部知识，显著提升了药物推荐的准确性和安全性。 Abstract: Medication recommendations have become an important task in the healthcare domain, especially in measuring the accuracy and safety of medical dialogue systems (MDS). Different from the recommendation task based on electronic health records (EHRs), dialogue-based medication recommendations require research on the interaction details between patients and doctors, which is crucial but may not exist in EHRs. Recent advancements in large language models (LLM) have extended the medical dialogue domain. These LLMs can interpret patients' intent and provide medical suggestions including medication recommendations, but some challenges are still worth attention. During a multi-turn dialogue, LLMs may ignore the fine-grained medical information or connections across the dialogue turns, which is vital for providing accurate suggestions. Besides, LLMs may generate non-factual responses when there is a lack of domain-specific knowledge, which is more risky in the medical domain. To address these challenges, we propose a \textbf{G}raph-\textbf{A}ssisted \textbf{P}rompts (\textbf{GAP}) framework for dialogue-based medication recommendation. It extracts medical concepts and corresponding states from dialogue to construct an explicitly patient-centric graph, which can describe the neglected but important information. Further, combined with external medical knowledge graphs, GAP can generate abundant queries and prompts, thus retrieving information from multiple sources to reduce the non-factual responses. We evaluate GAP on a dialogue-based medication recommendation dataset and further explore its potential in a more difficult scenario, dynamically diagnostic interviewing. Extensive experiments demonstrate its competitive performance when compared with strong baselines.

[338] On the Thinking-Language Modeling Gap in Large Language Models

Chenxi Liu,Yongqiang Chen,Tongliang Liu,James Cheng,Bo Han,Kun Zhang

Main category: cs.CL

TL;DR: 论文指出语言模型（LLMs）在模拟人类语言时存在偏见，导致推理偏离思维链。作者提出一种新的提示技术LoT，通过调整信息表达顺序和用词来减少偏见，提升推理性能。

Details

Motivation: 人类通过思维语言进行系统2推理，但LLMs在模拟语言时容易吸收语言偏见，导致推理偏离真实思维链。 Method: 提出Language-of-Thoughts（LoT）技术，调整信息表达顺序和用词，以减少语言建模偏见。 Result: LoT显著减少了LLMs的语言建模偏见，并在多种推理任务中提升了性能。 Conclusion: LoT技术有效缓解了语言与思维之间的差距，为LLMs的推理能力提供了改进方向。 Abstract: System 2 reasoning is one of the defining characteristics of intelligence, which requires slow and logical thinking. Human conducts System 2 reasoning via the language of thoughts that organizes the reasoning process as a causal sequence of mental language, or thoughts. Recently, it has been observed that System 2 reasoning can be elicited from Large Language Models (LLMs) pre-trained on large-scale natural languages. However, in this work, we show that there is a significant gap between the modeling of languages and thoughts. As language is primarily a tool for humans to share knowledge and thinking, modeling human language can easily absorb language biases into LLMs deviated from the chain of thoughts in minds. Furthermore, we show that the biases will mislead the eliciting of "thoughts" in LLMs to focus only on a biased part of the premise. To this end, we propose a new prompt technique termed Language-of-Thoughts (LoT) to demonstrate and alleviate this gap. Instead of directly eliciting the chain of thoughts from partial information, LoT instructs LLMs to adjust the order and token used for the expressions of all the relevant information. We show that the simple strategy significantly reduces the language modeling biases in LLMs and improves the performance of LLMs across a variety of reasoning tasks.

[339] PyFCG: Fluid Construction Grammar in Python

Paul Van Eecke,Katrien Beuls

Main category: cs.CL

TL;DR: PyFCG是一个将流体构式语法（FCG）移植到Python的开源库，支持与Python生态无缝集成，并提供三个典型用例教程。

Details

Motivation: 将FCG功能引入Python生态，便于用户结合其他库使用FCG。 Method: 开发PyFCG库，并通过三个教程展示其应用：构式语法分析、基于语料库的学习和涌现通信实验。 Result: PyFCG成功实现FCG功能，并展示了其在多种场景下的实用性。 Conclusion: PyFCG为FCG研究提供了灵活的工具，扩展了其在Python生态中的应用潜力。 Abstract: We present PyFCG, an open source software library that ports Fluid Construction Grammar (FCG) to the Python programming language. PyFCG enables its users to seamlessly integrate FCG functionality into Python programs, and to use FCG in combination with other libraries within Python's rich ecosystem. Apart from a general description of the library, this paper provides three walkthrough tutorials that demonstrate example usage of PyFCG in typical use cases of FCG: (i) formalising and testing construction grammar analyses, (ii) learning usage-based construction grammars from corpora, and (iii) implementing agent-based experiments on emergent communication.

[340] Do Not Let Low-Probability Tokens Over-Dominate in RL for LLMs

Zhihe Yang,Xufang Luo,Zilong Wang,Dongqi Han,Zhiyuan He,Dongsheng Li,Yunjian Xu

Main category: cs.CL

TL;DR: 论文提出两种新方法（Advantage Reweighting和Lopti）解决RL训练中低概率token梯度干扰问题，显著提升LLM性能。

Details

Motivation: 低概率token的梯度干扰高概率token的学习，影响LLM性能。 Method: 提出Advantage Reweighting和Lopti方法，平衡不同概率token的梯度更新。 Result: 实验显示性能提升46.2%（K&K逻辑谜题任务）。 Conclusion: 新方法有效提升RL训练效率，显著改善LLM推理能力。 Abstract: Reinforcement learning (RL) has become a cornerstone for enhancing the reasoning capabilities of large language models (LLMs), with recent innovations such as Group Relative Policy Optimization (GRPO) demonstrating exceptional effectiveness. In this study, we identify a critical yet underexplored issue in RL training: low-probability tokens disproportionately influence model updates due to their large gradient magnitudes. This dominance hinders the effective learning of high-probability tokens, whose gradients are essential for LLMs' performance but are substantially suppressed. To mitigate this interference, we propose two novel methods: Advantage Reweighting and Low-Probability Token Isolation (Lopti), both of which effectively attenuate gradients from low-probability tokens while emphasizing parameter updates driven by high-probability tokens. Our approaches promote balanced updates across tokens with varying probabilities, thereby enhancing the efficiency of RL training. Experimental results demonstrate that they substantially improve the performance of GRPO-trained LLMs, achieving up to a 46.2% improvement in K&K Logic Puzzle reasoning tasks. Our implementation is available at https://github.com/zhyang2226/AR-Lopti.

[341] A3 : an Analytical Low-Rank Approximation Framework for Attention

Jeffrey T. H. Wong,Cheng Zhang,Xinye Cao,Pedro Gimenes,George A. Constantinides,Wayne Luk,Yiren Zhao

Main category: cs.CL

TL;DR: 论文提出了一种名为A³的后训练低秩近似框架，通过将Transformer层分为三个功能组件并优化隐藏维度大小，以减少模型大小和计算开销，同时保持性能。

Details

Motivation: 现有低秩近似方法未考虑Transformer架构特性，且分解后的小矩阵引入运行时开销，性能不如剪枝和量化。 Method: A³将Transformer层分为QK、OV和MLP三个组件，为每个组件提供解析解以减少隐藏维度大小，最小化功能损失。 Result: A³在相同计算和内存预算下，性能优于现有方法，如LLaMA 3.1-70B在WikiText-2上的困惑度为4.69，优于之前的7.87。 Conclusion: A³是一种高效的低秩近似方法，可直接减少模型大小和计算开销，且无运行时开销，适用于多种压缩场景。 Abstract: Large language models have demonstrated remarkable performance; however, their massive parameter counts make deployment highly expensive. Low-rank approximation offers a promising compression solution, yet existing approaches have two main limitations: (1) They focus on minimizing the output error of individual linear layers, without considering the architectural characteristics of Transformers, and (2) they decompose a large weight matrix into two small low-rank matrices. Consequently, these methods often fall short compared to other compression techniques like pruning and quantization, and introduce runtime overhead such as the extra GEMM kernel launches for decomposed small matrices. To address these limitations, we propose $\tt A^\tt 3$, a post-training low-rank approximation framework. $\tt A^\tt 3$ splits a Transformer layer into three functional components, namely $\tt QK$, $\tt OV$, and $\tt MLP$. For each component, $\tt A^\tt 3$ provides an analytical solution that reduces the hidden dimension size inside each component while minimizing the component's functional loss ($\it i.e.$, error in attention scores, attention outputs, and MLP outputs). This approach directly reduces model sizes, KV cache sizes, and FLOPs without introducing any runtime overheads. In addition, it provides a new narrative in advancing the optimization problem from singular linear layer loss optimization toward improved end-to-end performance. Through extensive experiments, we show that $\tt A^\tt 3$ maintains superior performance compared to SoTAs. For example, under the same reduction budget in computation and memory, our low-rank approximated LLaMA 3.1-70B achieves a perplexity of 4.69 on WikiText-2, outperforming the previous SoTA's 7.87 by 3.18. We also demonstrate the versatility of $\tt A^\tt 3$, including KV cache compression, quantization, and mixed-rank assignments for enhanced performance.

[342] Neural Morphological Tagging for Nguni Languages

Cael Marquard,Simbarashe Mawere,Francois Meyer

Main category: cs.CL

TL;DR: 本文研究了使用神经方法为南非的Nguni语言构建形态标记器，比较了从头训练的神经序列标记器和微调预训练语言模型的性能，发现神经标记器显著优于传统基于规则的解析器。

Details

Motivation: 形态解析对于黏着语（如Nguni语言）尤其具有挑战性，本文旨在探索神经方法在此任务中的有效性。 Method: 采用两种神经方法：从头训练的序列标记器（LSTM和神经CRF）和微调预训练语言模型，并与基于规则的解析器进行对比。 Result: 神经标记器明显优于基于规则的基线，且从头训练的模型通常优于预训练模型。 Conclusion: 神经标记器结合现有形态分割器对Nguni语言是可行的。 Abstract: Morphological parsing is the task of decomposing words into morphemes, the smallest units of meaning in a language, and labelling their grammatical roles. It is a particularly challenging task for agglutinative languages, such as the Nguni languages of South Africa, which construct words by concatenating multiple morphemes. A morphological parsing system can be framed as a pipeline with two separate components, a segmenter followed by a tagger. This paper investigates the use of neural methods to build morphological taggers for the four Nguni languages. We compare two classes of approaches: training neural sequence labellers (LSTMs and neural CRFs) from scratch and finetuning pretrained language models. We compare performance across these two categories, as well as to a traditional rule-based morphological parser. Neural taggers comfortably outperform the rule-based baseline and models trained from scratch tend to outperform pretrained models. We also compare parsing results across different upstream segmenters and with varying linguistic input features. Our findings confirm the viability of employing neural taggers based on pre-existing morphological segmenters for the Nguni languages.

[343] GuRE:Generative Query REwriter for Legal Passage Retrieval

Daehee Kim,Deokhyung Kang,Jonghwi Kim,Sangwon Ryu,Gary Geunbae Lee

Main category: cs.CL

TL;DR: GuRE利用LLMs生成改写查询，解决法律文本检索中的词汇不匹配问题，显著提升检索性能。

Details

Motivation: 法律文本检索中查询与目标段落词汇不匹配问题严重，现有方法未充分探索。 Method: 提出GuRE方法，利用LLMs训练生成改写查询，以缓解词汇不匹配。 Result: 实验表明GuRE显著提升检索性能，优于基线方法，且适用于不同检索器。 Conclusion: GuRE比直接微调检索器更适合实际应用，不同训练目标导致不同检索行为。 Abstract: Legal Passage Retrieval (LPR) systems are crucial as they help practitioners save time when drafting legal arguments. However, it remains an underexplored avenue. One primary reason is the significant vocabulary mismatch between the query and the target passage. To address this, we propose a simple yet effective method, the Generative query REwriter (GuRE). We leverage the generative capabilities of Large Language Models (LLMs) by training the LLM for query rewriting. "Rewritten queries" help retrievers to retrieve target passages by mitigating vocabulary mismatch. Experimental results show that GuRE significantly improves performance in a retriever-agnostic manner, outperforming all baseline methods. Further analysis reveals that different training objectives lead to distinct retrieval behaviors, making GuRE more suitable than direct retriever fine-tuning for real-world applications. Codes are avaiable at github.com/daehuikim/GuRE.

[344] MA-COIR: Leveraging Semantic Search Index and Generative Models for Ontology-Driven Biomedical Concept Recognition

Shanshan Liu,Noriki Nishida,Rumana Ferdous Munne,Narumi Tokunaga,Yuki Yamagata,Kouji Kozaki,Yuji Matsumoto

Main category: cs.CL

TL;DR: MA-COIR框架通过语义搜索索引（ssIDs）改进生物医学概念识别，解决了传统方法无法捕捉隐含概念的问题，并在低资源环境下表现优异。

Details

Motivation: 传统概念识别方法依赖显式提及，难以识别复杂或隐含概念，限制了生物医学领域的知识图谱构建和本体细化。 Method: MA-COIR将概念识别重新定义为索引-识别任务，利用预训练的BART模型和小数据集微调，结合LLM生成的查询和合成数据。 Result: 在CDR、HPO和HOIP三个场景中，MA-COIR有效识别显式和隐式概念，无需推理时的提及级标注。 Conclusion: MA-COIR为生物医学领域的本体驱动概念识别提供了高效、低资源需求的解决方案。 Abstract: Recognizing biomedical concepts in the text is vital for ontology refinement, knowledge graph construction, and concept relationship discovery. However, traditional concept recognition methods, relying on explicit mention identification, often fail to capture complex concepts not explicitly stated in the text. To overcome this limitation, we introduce MA-COIR, a framework that reformulates concept recognition as an indexing-recognition task. By assigning semantic search indexes (ssIDs) to concepts, MA-COIR resolves ambiguities in ontology entries and enhances recognition efficiency. Using a pretrained BART-based model fine-tuned on small datasets, our approach reduces computational requirements to facilitate adoption by domain experts. Furthermore, we incorporate large language models (LLMs)-generated queries and synthetic data to improve recognition in low-resource settings. Experimental results on three scenarios (CDR, HPO, and HOIP) highlight the effectiveness of MA-COIR in recognizing both explicit and implicit concepts without the need for mention-level annotations during inference, advancing ontology-driven concept recognition in biomedical domain applications. Our code and constructed data are available at https://github.com/sl-633/macoir-master.

[345] Calm-Whisper: Reduce Whisper Hallucination On Non-Speech By Calming Crazy Heads Down

Yingzhi Wang,Anas Alhmoud,Saad Alsahly,Muhammad Alqurishi,Mirco Ravanelli

Main category: cs.CL

TL;DR: 论文提出了一种减少Whisper在非语音段幻觉问题的新方法，通过分析并微调特定自注意力头，显著降低了幻觉现象。

Details

Motivation: Whisper在非语音段存在幻觉问题，限制了其在复杂工业场景中的应用。 Method: 通过头掩码分析Whisper-large-v3解码器中各自注意力头对幻觉的贡献，并微调关键头。 Result: 微调后的Calm-Whisper模型在非语音段幻觉减少80%以上，词错误率仅增加不到0.1%。 Conclusion: 该方法有效解决了Whisper的幻觉问题，且对语音识别性能影响极小。 Abstract: OpenAI's Whisper has achieved significant success in Automatic Speech Recognition. However, it has consistently been found to exhibit hallucination issues, particularly in non-speech segments, which limits its broader application in complex industrial settings. In this paper, we introduce a novel method to reduce Whisper's hallucination on non-speech segments without using any pre- or post-possessing techniques. Specifically, we benchmark the contribution of each self-attentional head in the Whisper-large-v3 decoder to the hallucination problem by performing a head-wise mask. Our findings reveal that only 3 of the 20 heads account for over 75% of the hallucinations on the UrbanSound dataset. We then fine-tune these three crazy heads using a collection of non-speech data. The results show that our best fine-tuned model, namely Calm-Whisper, achieves over 80% reduction in non-speech hallucination with only less than 0.1% WER degradation on LibriSpeech test-clean and test-other.

[346] A Structured Literature Review on Traditional Approaches in Current Natural Language Processing

Robin Jegan,Andreas Henrich

Main category: cs.CL

TL;DR: 该论文调查了五种自然语言处理应用场景中传统技术与现代大型语言模型的结合使用情况，发现传统方法仍以不同形式存在。

Details

Motivation: 尽管大型语言模型取得了巨大成功，但仍存在许多不足，论文旨在评估传统技术在五种应用场景中的现状及其未来前景。 Method: 通过定义传统技术的特征，并调查五种场景（分类、信息与关系提取、文本简化、文本摘要）中传统方法的使用情况。 Result: 研究发现，所有五种场景中传统方法仍以不同形式存在，如作为处理流程的一部分、基线模型或核心模型。 Conclusion: 传统方法在现代自然语言处理中仍具有价值，未来应合理结合传统与现代技术。 Abstract: The continued rise of neural networks and large language models in the more recent past has altered the natural language processing landscape, enabling new approaches towards typical language tasks and achieving mainstream success. Despite the huge success of large language models, many disadvantages still remain and through this work we assess the state of the art in five application scenarios with a particular focus on the future perspectives and sensible application scenarios of traditional and older approaches and techniques. In this paper we survey recent publications in the application scenarios classification, information and relation extraction, text simplification as well as text summarization. After defining our terminology, i.e., which features are characteristic for traditional techniques in our interpretation for the five scenarios, we survey if such traditional approaches are still being used, and if so, in what way they are used. It turns out that all five application scenarios still exhibit traditional models in one way or another, as part of a processing pipeline, as a comparison/baseline to the core model of the respective paper, or as the main model(s) of the paper. For the complete statistics, see https://zenodo.org/records/13683801

[347] Fast, Not Fancy: Rethinking G2P with Rich Data and Rule-Based Models

Mahta Fetrat Qharabagh,Zahra Dehghanian,Hamid R. Rabiee

Main category: cs.CL

TL;DR: 论文提出了一种半自动化的流程构建同形异义词数据集HomoRich，并改进了深度学习G2P系统和规则基础的eSpeak系统，显著提升了同形异义词消歧准确率。

Details

Motivation: 解决低资源语言中同形异义词消歧的两大挑战：数据集构建的高成本和实时应用中的延迟问题。 Method: 1. 提出半自动化流程构建HomoRich数据集；2. 改进深度学习G2P系统和规则基础的eSpeak系统（HomoFast eSpeak）。 Result: 同形异义词消歧准确率提升约30%。 Conclusion: 通过离线数据集支持快速规则方法，为实时应用提供了高效解决方案。 Abstract: Homograph disambiguation remains a significant challenge in grapheme-to-phoneme (G2P) conversion, especially for low-resource languages. This challenge is twofold: (1) creating balanced and comprehensive homograph datasets is labor-intensive and costly, and (2) specific disambiguation strategies introduce additional latency, making them unsuitable for real-time applications such as screen readers and other accessibility tools. In this paper, we address both issues. First, we propose a semi-automated pipeline for constructing homograph-focused datasets, introduce the HomoRich dataset generated through this pipeline, and demonstrate its effectiveness by applying it to enhance a state-of-the-art deep learning-based G2P system for Persian. Second, we advocate for a paradigm shift - utilizing rich offline datasets to inform the development of fast, rule-based methods suitable for latency-sensitive accessibility applications like screen readers. To this end, we improve one of the most well-known rule-based G2P systems, eSpeak, into a fast homograph-aware version, HomoFast eSpeak. Our results show an approximate 30% improvement in homograph disambiguation accuracy for the deep learning-based and eSpeak systems.

[348] An Empirical Study of Many-to-Many Summarization with Large Language Models

Jiaan Wang,Fandong Meng,Zengkui Sun,Yunlong Liang,Yuxuan Cao,Jiarong Xu,Haoxiang Shi,Jie Zhou

Main category: cs.CL

TL;DR: 该论文研究了大型语言模型（LLMs）在多对多摘要任务（M2MS）中的表现，通过重组数据集和实验发现，零样本LLMs与传统微调模型表现相当，而指令调优后的开源LLMs表现更优，但仍存在事实性问题。

Details

Motivation: 探索LLMs在多语言多领域摘要任务中的潜力，并评估其实际应用能力。 Method: 重组了47.8K个样本的数据集，涵盖5个领域和6种语言，对18个LLMs进行零样本和指令调优测试，并与传统微调模型对比。 Result: 零样本LLMs与传统微调模型表现相当；指令调优后的开源LLMs表现优于零样本LLMs（包括GPT-4），但事实性问题加剧。 Conclusion: LLMs在多对多摘要任务中表现出潜力，但需解决事实性问题，未来研究应关注如何控制错误。 Abstract: Many-to-many summarization (M2MS) aims to process documents in any language and generate the corresponding summaries also in any language. Recently, large language models (LLMs) have shown strong multi-lingual abilities, giving them the potential to perform M2MS in real applications. This work presents a systematic empirical study on LLMs' M2MS ability. Specifically, we first reorganize M2MS data based on eight previous domain-specific datasets. The reorganized data contains 47.8K samples spanning five domains and six languages, which could be used to train and evaluate LLMs. Then, we benchmark 18 LLMs in a zero-shot manner and an instruction-tuning manner. Fine-tuned traditional models (e.g., mBART) are also conducted for comparisons. Our experiments reveal that, zero-shot LLMs achieve competitive results with fine-tuned traditional models. After instruct-tuning, open-source LLMs can significantly improve their M2MS ability, and outperform zero-shot LLMs (including GPT-4) in terms of automatic evaluations. In addition, we demonstrate that this task-specific improvement does not sacrifice the LLMs' general task-solving abilities. However, as revealed by our human evaluation, LLMs still face the factuality issue, and the instruction tuning might intensify the issue. Thus, how to control factual errors becomes the key when building LLM summarizers in real applications, and is worth noting in future research.

[349] ExTrans: Multilingual Deep Reasoning Translation via Exemplar-Enhanced Reinforcement Learning

Jiaan Wang,Fandong Meng,Jie Zhou

Main category: cs.CL

TL;DR: 论文提出了一种新的奖励建模方法，通过比较策略MT模型与强大LRM（如DeepSeek-R1-671B）的翻译结果，量化比较以提供奖励。实验表明该方法优越性，并在多语言设置中扩展，取得了显著的多语言MT性能。

Details

Motivation: 现有研究主要关注高资源语言（如英语和中文），且奖励建模方法未充分发挥强化学习在MT中的潜力。 Method: 设计新的奖励建模方法，量化比较策略MT模型与强LRM的翻译结果；在多语言设置中扩展，采用轻量级奖励建模。 Result: 使用Qwen2.5-7B-Instruct作为主干，模型在文学翻译中达到新SOTA，并在11种语言的90个翻译方向上表现出色。 Conclusion: 新奖励建模方法有效提升了MT性能，且在多语言环境下具有扩展潜力。 Abstract: In recent years, the emergence of large reasoning models (LRMs), such as OpenAI-o1 and DeepSeek-R1, has shown impressive capabilities in complex problems, e.g., mathematics and coding. Some pioneering studies attempt to bring the success of LRMs in neural machine translation (MT). They try to build LRMs with deep reasoning MT ability via reinforcement learning (RL). Despite some progress that has been made, these attempts generally focus on several high-resource languages, e.g., English and Chinese, leaving the performance on other languages unclear. Besides, the reward modeling methods in previous work do not fully unleash the potential of reinforcement learning in MT. In this work, we first design a new reward modeling method that compares the translation results of the policy MT model with a strong LRM (i.e., DeepSeek-R1-671B), and quantifies the comparisons to provide rewards. Experimental results demonstrate the superiority of the reward modeling method. Using Qwen2.5-7B-Instruct as the backbone, the trained model achieves the new state-of-the-art performance in literary translation, and outperforms strong LRMs including OpenAI-o1 and DeepSeeK-R1. Furthermore, we extend our method to the multilingual settings with 11 languages. With a carefully designed lightweight reward modeling in RL, we can simply transfer the strong MT ability from a single direction into multiple (i.e., 90) translation directions and achieve impressive multilingual MT performance.

[350] EffiBench-X: A Multi-Language Benchmark for Measuring Efficiency of LLM-Generated Code

Yuhao Qing,Boyu Zhu,Mingzhe Du,Zhijiang Guo,Terry Yue Zhuo,Qianru Zhang,Jie M. Zhang,Heming Cui,Siu-Ming Yiu,Dong Huang,See-Kiong Ng,Luu Anh Tuan

Main category: cs.CL

TL;DR: EffiBench-X是首个多语言基准测试，用于评估LLM生成代码的效率，覆盖Python、C++、Java等语言。结果显示，LLM生成的代码效率普遍低于人类专家，平均仅为62%，且在不同语言间差异显著。

Details

Motivation: 现有代码生成基准主要关注功能正确性，而忽略代码效率，且多限于单一语言（如Python）。EffiBench-X旨在填补这一空白。 Method: EffiBench-X包含多语言（Python、C++等）的竞争编程任务，以人类专家解决方案为效率基准，评估LLM生成代码的效率。 Result: LLM生成的代码功能正确，但效率平均仅为人类专家的62%，且在Python、Ruby等语言中表现优于Java、C++等。 Conclusion: 研究强调了优化LLM代码效率的重要性，尤其是在多语言场景下。EffiBench-X的数据集和评估工具已开源。 Abstract: Existing code generation benchmarks primarily evaluate functional correctness, with limited focus on code efficiency and often restricted to a single language like Python. To address this gap, we introduce EffiBench-X, the first multi-language benchmark designed to measure the efficiency of LLM-generated code. EffiBench-X supports Python, C++, Java, JavaScript, Ruby, and Golang. It comprises competitive programming tasks with human-expert solutions as efficiency baselines. Evaluating state-of-the-art LLMs on EffiBench-X reveals that while models generate functionally correct code, they consistently underperform human experts in efficiency. Even the most efficient LLM-generated solutions (Qwen3-32B) achieve only around \textbf{62\%} of human efficiency on average, with significant language-specific variations. LLMs show better efficiency in Python, Ruby, and JavaScript than in Java, C++, and Golang. For instance, DeepSeek-R1's Python code is significantly more efficient than its Java code. These results highlight the critical need for research into LLM optimization techniques to improve code efficiency across diverse languages. The dataset and evaluation infrastructure are submitted and available at https://github.com/EffiBench/EffiBench-X.git and https://huggingface.co/datasets/EffiBench/effibench-x.

[351] Evaluating the Performance of RAG Methods for Conversational AI in the Airport Domain

Yuyang Li,Philip J. M. Kerbusch,Raimon H. R. Pruim,Tobias Käfer

Main category: cs.CL

TL;DR: 论文研究了三种RAG方法在机场环境中的应用，发现Graph RAG表现最佳，准确率高且幻觉少。

Details

Motivation: 提升机场自动化水平，解决机场术语和动态问题。 Method: 实现三种RAG方法（传统RAG、SQL RAG、Graph RAG）并进行实验比较。 Result: Graph RAG准确率最高（91.49%），幻觉最少，适合处理推理问题。 Conclusion: 推荐SQL RAG和Graph RAG用于机场环境，因其幻觉少且能处理动态问题。 Abstract: Airports from the top 20 in terms of annual passengers are highly dynamic environments with thousands of flights daily, and they aim to increase the degree of automation. To contribute to this, we implemented a Conversational AI system that enables staff in an airport to communicate with flight information systems. This system not only answers standard airport queries but also resolves airport terminology, jargon, abbreviations, and dynamic questions involving reasoning. In this paper, we built three different Retrieval-Augmented Generation (RAG) methods, including traditional RAG, SQL RAG, and Knowledge Graph-based RAG (Graph RAG). Experiments showed that traditional RAG achieved 84.84% accuracy using BM25 + GPT-4 but occasionally produced hallucinations, which is risky to airport safety. In contrast, SQL RAG and Graph RAG achieved 80.85% and 91.49% accuracy respectively, with significantly fewer hallucinations. Moreover, Graph RAG was especially effective for questions that involved reasoning. Based on our observations, we thus recommend SQL RAG and Graph RAG are better for airport environments, due to fewer hallucinations and the ability to handle dynamic questions.

[352] To Bias or Not to Bias: Detecting bias in News with bias-detector

Himel Ghosh,Ahmed Mosharafa,Georg Groh

Main category: cs.CL

TL;DR: 本文提出了一种基于RoBERTa的句子级媒体偏见检测方法，通过微调专家标注的BABE数据集，显著提升了性能，并展示了模型的解释性和泛化能力。

Details

Motivation: 媒体偏见检测对信息公平传播至关重要，但因偏见的主观性和高质量标注数据的稀缺而具有挑战性。 Method: 使用RoBERTa模型在BABE数据集上进行微调，并通过McNemar检验和5x2交叉验证t检验验证性能提升。 Result: 模型在性能上显著优于基线方法，且能更关注上下文相关标记，避免对政治敏感词的过度敏感。 Conclusion: 该方法为媒体偏见检测提供了更鲁棒、可解释的解决方案，并指出了未来研究方向，如上下文感知建模和偏见类型分类。 Abstract: Media bias detection is a critical task in ensuring fair and balanced information dissemination, yet it remains challenging due to the subjectivity of bias and the scarcity of high-quality annotated data. In this work, we perform sentence-level bias classification by fine-tuning a RoBERTa-based model on the expert-annotated BABE dataset. Using McNemar's test and the 5x2 cross-validation paired t-test, we show statistically significant improvements in performance when comparing our model to a domain-adaptively pre-trained DA-RoBERTa baseline. Furthermore, attention-based analysis shows that our model avoids common pitfalls like oversensitivity to politically charged terms and instead attends more meaningfully to contextually relevant tokens. For a comprehensive examination of media bias, we present a pipeline that combines our model with an already-existing bias-type classifier. Our method exhibits good generalization and interpretability, despite being constrained by sentence-level analysis and dataset size because of a lack of larger and more advanced bias corpora. We talk about context-aware modeling, bias neutralization, and advanced bias type classification as potential future directions. Our findings contribute to building more robust, explainable, and socially responsible NLP systems for media bias detection.

[353] topicwizard -- a Modern, Model-agnostic Framework for Topic Model Visualization and Interpretation

Márton Kardos,Kenneth C. Enevoldsen,Kristoffer Laigaard Nielbo

Main category: cs.CL

TL;DR: 论文介绍了一种名为topicwizard的框架，用于模型无关的主题模型解释，提供直观交互工具帮助用户理解主题模型中的语义关系。

Details

Motivation: 主题模型的参数解释对用户具有挑战性，传统的基于排名前10词汇的方法存在局限性和偏差，需要更全面的可视化工具。 Method: 提出topicwizard框架，提供模型无关的交互式工具，用于分析文档、词汇和主题之间的复杂语义关系。 Result: topicwizard帮助用户更全面、准确地理解主题模型的输出。 Conclusion: topicwizard为模型无关的主题模型解释提供了有效的可视化解决方案。 Abstract: Topic models are statistical tools that allow their users to gain qualitative and quantitative insights into the contents of textual corpora without the need for close reading. They can be applied in a wide range of settings from discourse analysis, through pretraining data curation, to text filtering. Topic models are typically parameter-rich, complex models, and interpreting these parameters can be challenging for their users. It is typical practice for users to interpret topics based on the top 10 highest ranking terms on a given topic. This list-of-words approach, however, gives users a limited and biased picture of the content of topics. Thoughtful user interface design and visualizations can help users gain a more complete and accurate understanding of topic models' output. While some visualization utilities do exist for topic models, these are typically limited to a certain type of topic model. We introduce topicwizard, a framework for model-agnostic topic model interpretation, that provides intuitive and interactive tools that help users examine the complex semantic relations between documents, words and topics learned by topic models.

[354] KIT's Offline Speech Translation and Instruction Following Submission for IWSLT 2025

Sai Koneru,Maike Züfle,Thai-Binh Nguyen,Seymanur Akti,Jan Niehues,Alexander Waibel

Main category: cs.CL

TL;DR: 本文介绍了卡尔斯鲁厄理工学院在离线语音翻译和指令跟随任务中的方法，利用大语言模型提升性能。

Details

Motivation: 现代系统能力的提升，尤其是大语言模型的成功，推动了语音翻译任务的扩展。 Method: 离线语音翻译采用多语音识别系统融合和大语言模型的两步翻译；指令跟随任务采用端到端模型结合大语言模型。 Result: 通过文档级上下文优化，提升了翻译和指令跟随的输出质量。 Conclusion: 大语言模型在语音翻译和指令跟随任务中表现出色，未来可进一步优化。 Abstract: The scope of the International Workshop on Spoken Language Translation (IWSLT) has recently broadened beyond traditional Speech Translation (ST) to encompass a wider array of tasks, including Speech Question Answering and Summarization. This shift is partly driven by the growing capabilities of modern systems, particularly with the success of Large Language Models (LLMs). In this paper, we present the Karlsruhe Institute of Technology's submissions for the Offline ST and Instruction Following (IF) tracks, where we leverage LLMs to enhance performance across all tasks. For the Offline ST track, we propose a pipeline that employs multiple automatic speech recognition systems, whose outputs are fused using an LLM with document-level context. This is followed by a two-step translation process, incorporating additional refinement step to improve translation quality. For the IF track, we develop an end-to-end model that integrates a speech encoder with an LLM to perform a wide range of instruction-following tasks. We complement it with a final document-level refinement stage to further enhance output quality by using contextual information.

[355] SNAPE-PM: Building and Utilizing Dynamic Partner Models for Adaptive Explanation Generation

Amelie S. Robrecht,Christoph R. Kowalski,Stefan Kopp

Main category: cs.CL

TL;DR: 论文提出了一种基于贝叶斯推理和非平稳马尔可夫决策过程的框架，用于动态调整对话系统中的解释策略，以适应不同用户。

Details

Motivation: 适应对话对象是成功解释的关键，但对对话系统提出了挑战。本文旨在解决如何跟踪交互上下文和用户特征，并动态调整解释策略的问题。 Method: 采用贝叶斯推理更新用户模型，结合非平稳马尔可夫决策过程动态调整解释策略。 Result: 实验表明，该框架能有效适应不同用户，甚至能应对用户反馈行为的变化，生成针对性的解释策略。 Conclusion: 该方法为可解释AI和对话系统提供了潜在改进方向，展示了高适应性和灵活性。 Abstract: Adapting to the addressee is crucial for successful explanations, yet poses significant challenges for dialogsystems. We adopt the approach of treating explanation generation as a non-stationary decision process, where the optimal strategy varies according to changing beliefs about the explainee and the interaction context. In this paper we address the questions of (1) how to track the interaction context and the relevant listener features in a formally defined computational partner model, and (2) how to utilize this model in the dynamically adjusted, rational decision process that determines the currently best explanation strategy. We propose a Bayesian inference-based approach to continuously update the partner model based on user feedback, and a non-stationary Markov Decision Process to adjust decision-making based on the partner model values. We evaluate an implementation of this framework with five simulated interlocutors, demonstrating its effectiveness in adapting to different partners with constant and even changing feedback behavior. The results show high adaptivity with distinct explanation strategies emerging for different partners, highlighting the potential of our approach to improve explainable AI systems and dialogsystems in general.

[356] Suicide Risk Assessment Using Multimodal Speech Features: A Study on the SW1 Challenge Dataset

Ambre Marie,Ilias Maoudj,Guillaume Dardenne,Gwenolé Quellec

Main category: cs.CL

TL;DR: 研究探讨了多模态方法在青少年自杀风险评估中的应用，结合语音转录、语言和音频嵌入以及手工声学特征，发现加权注意力融合策略效果最佳，但泛化能力仍需提升。

Details

Motivation: 青少年自杀风险评估的需求推动了多模态方法的研究，旨在通过语音和语言特征提高评估准确性。 Method: 整合WhisperX自动转录、中文RoBERTa语言嵌入、WavLM音频嵌入及手工声学特征，探索了三种融合策略。 Result: 加权注意力融合策略在开发集上达到69%准确率，但测试集表现显示泛化能力不足。 Conclusion: 需优化嵌入表示和融合机制以提高分类可靠性，研究结果基于MINI-KID框架。 Abstract: The 1st SpeechWellness Challenge conveys the need for speech-based suicide risk assessment in adolescents. This study investigates a multimodal approach for this challenge, integrating automatic transcription with WhisperX, linguistic embeddings from Chinese RoBERTa, and audio embeddings from WavLM. Additionally, handcrafted acoustic features -- including MFCCs, spectral contrast, and pitch-related statistics -- were incorporated. We explored three fusion strategies: early concatenation, modality-specific processing, and weighted attention with mixup regularization. Results show that weighted attention provided the best generalization, achieving 69% accuracy on the development set, though a performance gap between development and test sets highlights generalization challenges. Our findings, strictly tied to the MINI-KID framework, emphasize the importance of refining embedding representations and fusion mechanisms to enhance classification reliability.

[357] Advancing Sequential Numerical Prediction in Autoregressive Models

Xiang Fei,Jinghui Lu,Qi Sun,Hao Feng,Yanjie Wang,Wei Shi,An-Lan Wang,Jingqun Tang,Can Huang

Main category: cs.CL

TL;DR: 论文提出了一种名为NTIL的损失函数，用于改进自回归模型在数字序列生成中的性能，通过结合词级和序列级的损失设计显著提升了预测效果。

Details

Motivation: 标准自回归模型在处理数字序列时忽略了数字间的连贯结构，导致预测效果不佳。 Method: 提出NTIL损失函数，结合词级的Earth Mover's Distance（EMD）和序列级的整体差异惩罚。 Result: 实验表明NTIL显著提升了数字序列生成的性能。 Conclusion: NTIL是一种有效的损失函数，能够改进自回归模型在数字序列生成任务中的表现。 Abstract: Autoregressive models have become the de facto choice for sequence generation tasks, but standard approaches treat digits as independent tokens and apply cross-entropy loss, overlooking the coherent structure of numerical sequences. This paper introduces Numerical Token Integrity Loss (NTIL) to address this gap. NTIL operates at two levels: (1) token-level, where it extends the Earth Mover's Distance (EMD) to preserve ordinal relationships between numerical values, and (2) sequence-level, where it penalizes the overall discrepancy between the predicted and actual sequences. This dual approach improves numerical prediction and integrates effectively with LLMs/MLLMs. Extensive experiments show significant performance improvements with NTIL.

[358] Systematic Generalization in Language Models Scales with Information Entropy

Sondre Wold,Lucas Georges Gabriel Charpentier,Étienne Simon

Main category: cs.CL

TL;DR: 论文探讨了语言模型在系统性泛化中的挑战，提出通过训练数据中组件部分的熵来衡量问题难度，并发现模型性能与熵相关。

Details

Motivation: 当前语言模型在系统性泛化中存在困难，尤其是在新语境中表现不佳，缺乏衡量问题难度的标准。 Method: 通过形式化序列到序列任务中熵的测量框架，分析不同模型架构的性能与熵的关系。 Result: 模型性能随熵值变化，高熵下无需内置先验即可成功，低熵下可作为系统性泛化鲁棒性的评估目标。 Conclusion: 系统性泛化与信息效率相关，高熵表现和低熵目标为未来研究提供了方向。 Abstract: Systematic generalization remains challenging for current language models, which are known to be both sensitive to semantically similar permutations of the input and to struggle with known concepts presented in novel contexts. Although benchmarks exist for assessing compositional behavior, it is unclear how to measure the difficulty of a systematic generalization problem. In this work, we show how one aspect of systematic generalization can be described by the entropy of the distribution of component parts in the training data. We formalize a framework for measuring entropy in a sequence-to-sequence task and find that the performance of popular model architectures scales with the entropy. Our work connects systematic generalization to information efficiency, and our results indicate that success at high entropy can be achieved even without built-in priors, and that success at low entropy can serve as a target for assessing progress towards robust systematic generalization.

[359] The Effect of Language Diversity When Fine-Tuning Large Language Models for Translation

David Stap,Christof Monz

Main category: cs.CL

TL;DR: 研究发现，增加语言多样性在LLM微调中能提升翻译质量，但超过一定阈值后效果会减弱。

Details

Motivation: 解决先前研究关于语言多样性在LLM微调中效果不一致的问题。 Method: 通过132个翻译方向的受控微调实验，系统分析语言多样性的影响。 Result: 语言多样性提升翻译质量，但效果在超过阈值后减弱；多样性促进语言无关表征的形成。 Conclusion: 语言多样性在微调中具有积极作用，但需注意阈值的控制。 Abstract: Prior research diverges on language diversity in LLM fine-tuning: Some studies report benefits while others find no advantages. Through controlled fine-tuning experiments across 132 translation directions, we systematically resolve these disparities. We find that expanding language diversity during fine-tuning improves translation quality for both unsupervised and -- surprisingly -- supervised pairs, despite less diverse models being fine-tuned exclusively on these supervised pairs. However, benefits plateau or decrease beyond a certain diversity threshold. We show that increased language diversity creates more language-agnostic representations. These representational adaptations help explain the improved performance in models fine-tuned with greater diversity.

[360] Benchmarking and Confidence Evaluation of LALMs For Temporal Reasoning

Debarpan Bhattacharya,Apoorva Kulkarni,Sriram Ganapathy

Main category: cs.CL

TL;DR: 论文提出了一种名为TREA的新型数据集，用于评估大型音频语言模型（LALMs）在时序推理任务中的表现，并发现现有模型在任务中表现不及人类。同时，论文提出了一种不确定性度量方法，指出需要更全面的评估方法。

Details

Motivation: 研究动机在于结合多模态（如音频与文本）以实现类似大型语言模型的能力，并评估LALMs在推理任务中的表现。 Method: 提出了TREA数据集，用于评估LALMs在时序推理任务中的表现，并引入了一种不确定性度量方法。 Result: 实验表明，现有LALMs在TREA任务中表现不及人类，且准确性与不确定性度量不相关。 Conclusion: 结论指出需要更全面的评估方法，以确保LALMs在高风险应用中的可靠性。 Abstract: The popular success of text-based large language models (LLM) has streamlined the attention of the multimodal community to combine other modalities like vision and audio along with text to achieve similar multimodal capabilities. In this quest, large audio language models (LALMs) have to be evaluated on reasoning related tasks which are different from traditional classification or generation tasks. Towards this goal, we propose a novel dataset called temporal reasoning evaluation of audio (TREA). We benchmark open-source LALMs and observe that they are consistently behind human capabilities on the tasks in the TREA dataset. While evaluating LALMs, we also propose an uncertainty metric, which computes the invariance of the model to semantically identical perturbations of the input. Our analysis shows that the accuracy and uncertainty metrics are not necessarily correlated and thus, points to a need for wholesome evaluation of LALMs for high-stakes applications.

[361] ModernGBERT: German-only 1B Encoder Model Trained from Scratch

Anton Ehrmanntraut,Julia Wunderle,Jan Pfister,Fotis Jannidis,Andreas Hotho

Main category: cs.CL

TL;DR: 论文介绍了ModernGBERT和LL"aMmlein2Vec两种德语编码器模型，比较了从头训练和从解码器转换的性能，ModernGBERT 1B在性能和参数效率上表现最佳。

Details

Motivation: 研究资源受限应用中编码器的重要性，并提供透明、高性能的德语编码器模型。 Method: 提出ModernGBERT（从头训练）和LL"aMmlein2Vec（从解码器转换），并在自然语言理解、文本嵌入和长上下文推理任务上评估。 Result: ModernGBERT 1B在性能和参数效率上优于现有德语编码器和转换模型。 Conclusion: 公开所有模型和资源，推动了德语NLP生态系统的发展。 Abstract: Despite the prominence of decoder-only language models, encoders remain crucial for resource-constrained applications. We introduce ModernGBERT (134M, 1B), a fully transparent family of German encoder models trained from scratch, incorporating architectural innovations from ModernBERT. To evaluate the practical trade-offs of training encoders from scratch, we also present LL\"aMmlein2Vec (120M, 1B, 7B), a family of encoders derived from German decoder-only models via LLM2Vec. We benchmark all models on natural language understanding, text embedding, and long-context reasoning tasks, enabling a controlled comparison between dedicated encoders and converted decoders. Our results show that ModernGBERT 1B outperforms prior state-of-the-art German encoders as well as encoders adapted via LLM2Vec, with regard to performance and parameter-efficiency. All models, training data, checkpoints and code are publicly available, advancing the German NLP ecosystem with transparent, high-performance encoder models.

[362] Understanding Cross-Lingual Inconsistency in Large Language Models

Zheng Wei Lim,Alham Fikri Aji,Trevor Cohn

Main category: cs.CL

TL;DR: 研究发现大型语言模型（LLMs）在多语言推理中存在输出不一致问题，原因是依赖语言特定子空间而非共享语义空间。通过引导模型向共享空间处理，可以提高多语言推理性能。

Details

Motivation: 理解LLMs如何在不同语言间泛化知识，并解决多语言推理中的输出不一致问题。 Method: 应用logit lens分析LLMs解决多语言多选推理问题的隐式步骤，并通过引导模型向共享语义空间处理来调节知识共享。 Result: LLMs依赖语言特定子空间导致预测不一致和准确性下降；大型模型更可能脱离共享表示，但能更好地检索跨语言知识；引导共享空间处理可提升多语言推理性能。 Conclusion: 通过强化共享语义空间的利用，可以改善LLMs的多语言推理能力和输出一致性。 Abstract: Large language models (LLMs) are demonstrably capable of cross-lingual transfer, but can produce inconsistent output when prompted with the same queries written in different languages. To understand how language models are able to generalize knowledge from one language to the others, we apply the logit lens to interpret the implicit steps taken by LLMs to solve multilingual multi-choice reasoning questions. We find LLMs predict inconsistently and are less accurate because they rely on subspaces of individual languages, rather than working in a shared semantic space. While larger models are more multilingual, we show their hidden states are more likely to dissociate from the shared representation compared to smaller models, but are nevertheless more capable of retrieving knowledge embedded across different languages. Finally, we demonstrate that knowledge sharing can be modulated by steering the models' latent processing towards the shared semantic space. We find reinforcing utilization of the shared space improves the models' multilingual reasoning performance, as a result of more knowledge transfer from, and better output consistency with English.

[363] What if Deception Cannot be Detected? A Cross-Linguistic Study on the Limits of Deception Detection from Text

Aswathy Velutharambath,Roman Klinger,Kai Sassenberg

Main category: cs.CL

TL;DR: 论文探讨了仅从文本中检测欺骗的可行性，质疑先前研究的可靠性，并提出基于信念的欺骗框架。通过构建DeFaBel语料库，发现传统欺骗线索与欺骗标签相关性极低，且模型在不同数据集上表现不一致，挑战了NLP中欺骗研究的现有方法。

Details

Motivation: 先前研究声称能从文本中自动检测欺骗，但作者认为这些结果可能由数据收集中的伪影驱动，缺乏泛化性。因此，提出基于信念的欺骗框架，重新评估欺骗线索的可靠性。 Method: 提出信念欺骗框架，构建DeFaBel语料库（包括德语和多语言版本），评估传统欺骗线索的可靠性，并测试多种模型（特征模型、预训练语言模型等）在不同数据集上的表现。 Result: 传统欺骗线索与欺骗标签相关性极低且统计不显著；模型在DeFaBel上表现接近随机，而在其他数据集上效果有限且不一致。 Conclusion: 研究质疑仅从文本中可靠检测欺骗的可行性，呼吁重新思考NLP中欺骗研究的方法和数据收集方式。 Abstract: Can deception be detected solely from written text? Cues of deceptive communication are inherently subtle, even more so in text-only communication. Yet, prior studies have reported considerable success in automatic deception detection. We hypothesize that such findings are largely driven by artifacts introduced during data collection and do not generalize beyond specific datasets. We revisit this assumption by introducing a belief-based deception framework, which defines deception as a misalignment between an author's claims and true beliefs, irrespective of factual accuracy, allowing deception cues to be studied in isolation. Based on this framework, we construct three corpora, collectively referred to as DeFaBel, including a German-language corpus of deceptive and non-deceptive arguments and a multilingual version in German and English, each collected under varying conditions to account for belief change and enable cross-linguistic analysis. Using these corpora, we evaluate commonly reported linguistic cues of deception. Across all three DeFaBel variants, these cues show negligible, statistically insignificant correlations with deception labels, contrary to prior work that treats such cues as reliable indicators. We further benchmark against other English deception datasets following similar data collection protocols. While some show statistically significant correlations, effect sizes remain low and, critically, the set of predictive cues is inconsistent across datasets. We also evaluate deception detection using feature-based models, pretrained language models, and instruction-tuned large language models. While some models perform well on established deception datasets, they consistently perform near chance on DeFaBel. Our findings challenge the assumption that deception can be reliably inferred from linguistic cues and call for rethinking how deception is studied and modeled in NLP.

[364] Tianyi: A Traditional Chinese Medicine all-rounder language model and its Real-World Clinical Practice

Zhi Liu,Tao Yang,Jing Wang,Yexin Chen,Zhan Gao,Jiaxi Yang,Kui Chen,Bingji Lu,Xiaochen Li,Changyong Luo,Yan Li,Xiaohong Gu,Peng Cao

Main category: cs.CL

TL;DR: 论文介绍了Tianyi，一个专为中医设计的7.6亿参数大语言模型，解决了现有模型在中医领域的不足，并通过TCMEval评估其潜力。

Details

Motivation: 中医的全球认可度提升，但现有AI模型在数据、目标单一性和专业性上存在局限，无法满足中医实践需求。 Method: 开发了Tianyi模型，通过预训练和微调中医多样化语料库，采用渐进式学习方法整合中医知识，并建立TCMEval评估基准。 Result: Tianyi在中医考试、临床任务和问答中表现优异，展示了其在中医实践和研究中的潜力。 Conclusion: Tianyi填补了中医知识与实际应用间的鸿沟，为中医AI助手的发展提供了重要支持。 Abstract: Natural medicines, particularly Traditional Chinese Medicine (TCM), are gaining global recognition for their therapeutic potential in addressing human symptoms and diseases. TCM, with its systematic theories and extensive practical experience, provides abundant resources for healthcare. However, the effective application of TCM requires precise syndrome diagnosis, determination of treatment principles, and prescription formulation, which demand decades of clinical expertise. Despite advancements in TCM-based decision systems, machine learning, and deep learning research, limitations in data and single-objective constraints hinder their practical application. In recent years, large language models (LLMs) have demonstrated potential in complex tasks, but lack specialization in TCM and face significant challenges, such as too big model scale to deploy and issues with hallucination. To address these challenges, we introduce Tianyi with 7.6-billion-parameter LLM, a model scale proper and specifically designed for TCM, pre-trained and fine-tuned on diverse TCM corpora, including classical texts, expert treatises, clinical records, and knowledge graphs. Tianyi is designed to assimilate interconnected and systematic TCM knowledge through a progressive learning manner. Additionally, we establish TCMEval, a comprehensive evaluation benchmark, to assess LLMs in TCM examinations, clinical tasks, domain-specific question-answering, and real-world trials. The extensive evaluations demonstrate the significant potential of Tianyi as an AI assistant in TCM clinical practice and research, bridging the gap between TCM knowledge and practical application.

[365] Role-Playing Evaluation for Large Language Models

Yassine El Boudouri,Walter Nuninger,Julian Alvarez,Yvan Peter

Main category: cs.CL

TL;DR: RPEval是一个新的基准测试，用于评估大型语言模型在角色扮演中的能力，涵盖情感理解、决策、道德对齐和角色一致性四个维度。

Details

Motivation: 现有评估方法资源密集或存在偏见，需要一种更有效的评估工具。 Method: 构建RPEval基准测试，并进行基线评估。 Result: 提出了RPEval，并提供了代码和数据集。 Conclusion: RPEval为评估LLM角色扮演能力提供了标准化工具。 Abstract: Large Language Models (LLMs) demonstrate a notable capacity for adopting personas and engaging in role-playing. However, evaluating this ability presents significant challenges, as human assessments are resource-intensive and automated evaluations can be biased. To address this, we introduce Role-Playing Eval (RPEval), a novel benchmark designed to assess LLM role-playing capabilities across four key dimensions: emotional understanding, decision-making, moral alignment, and in-character consistency. This article details the construction of RPEval and presents baseline evaluations. Our code and dataset are available at https://github.com/yelboudouri/RPEval

[366] Positional Fragility in LLMs: How Offset Effects Reshape Our Understanding of Memorization Risks

Yixuan Xu,Antoine Bosselut,Imanol Schlag

Main category: cs.CL

TL;DR: 研究发现大语言模型存在记忆训练数据的风险，提出位置偏移效应可抑制记忆和退化。

Details

Motivation: 探讨大语言模型记忆训练数据带来的版权风险。 Method: 预训练不同规模的模型，模拟受控频率的版权内容，分析记忆现象。 Result: 发现位置偏移效应，短前缀触发记忆，偏移后记忆下降。 Conclusion: 位置偏移是评估记忆风险的关键因素，可抑制记忆和退化。 Abstract: Large language models are known to memorize parts of their training data, posing risk of copyright violations. To systematically examine this risk, we pretrain language models (1B/3B/8B) from scratch on 83B tokens, mixing web-scale data with public domain books used to simulate copyrighted content at controlled frequencies at lengths at least ten times longer than prior work. We thereby identified the offset effect, a phenomenon characterized by two key findings: (1) verbatim memorization is most strongly triggered by short prefixes drawn from the beginning of the context window, with memorization decreasing counterintuitively as prefix length increases; and (2) a sharp decline in verbatim recall when prefix begins offset from the initial tokens of the context window. We attribute this to positional fragility: models rely disproportionately on the earliest tokens in their context window as retrieval anchors, making them sensitive to even slight shifts. We further observe that when the model fails to retrieve memorized content, it often produces degenerated text. Leveraging these findings, we show that shifting sensitive data deeper into the context window suppresses both extractable memorization and degeneration. Our results suggest that positional offset is a critical and previously overlooked axis for evaluating memorization risks, since prior work implicitly assumed uniformity by probing only from the beginning of training sequences.

[367] A Case Study of Cross-Lingual Zero-Shot Generalization for Classical Languages in LLMs

V. S. D. S. Mahesh Akavarapu,Hrishikesh Terdalkar,Pramit Bhattacharyya,Shubhangi Agarwal,Vishakha Deulgaonkar,Pralay Manna,Chaitali Dangarikar,Arnab Bhattacharya

Main category: cs.CL

TL;DR: 研究探讨了大语言模型（LLMs）在古典语言（梵语、古希腊语、拉丁语）中的跨语言零样本泛化能力，发现模型规模是关键因素。

Details

Motivation: 研究旨在了解影响LLMs在古典语言中跨语言零样本泛化的因素，以评估其在古典学研究中的实用性。 Method: 通过命名实体识别、机器翻译和梵语事实问答任务，比较不同规模LLMs的表现，并采用检索增强生成方法提升性能。 Result: LLMs在域外数据上表现优于或等于微调基线，但小模型在抽象实体类型和问答任务中表现较差；检索增强生成显著提升了梵语问答性能。 Conclusion: 模型规模是影响跨语言泛化的关键因素，LLMs在古典语言研究中具有潜在实用价值。 Abstract: Large Language Models (LLMs) have demonstrated remarkable generalization capabilities across diverse tasks and languages. In this study, we focus on natural language understanding in three classical languages -- Sanskrit, Ancient Greek and Latin -- to investigate the factors affecting cross-lingual zero-shot generalization. First, we explore named entity recognition and machine translation into English. While LLMs perform equal to or better than fine-tuned baselines on out-of-domain data, smaller models often struggle, especially with niche or abstract entity types. In addition, we concentrate on Sanskrit by presenting a factoid question-answering (QA) dataset and show that incorporating context via retrieval-augmented generation approach significantly boosts performance. In contrast, we observe pronounced performance drops for smaller LLMs across these QA tasks. These results suggest model scale as an important factor influencing cross-lingual generalization. Assuming that models used such as GPT-4o and Llama-3.1 are not instruction fine-tuned on classical languages, our findings provide insights into how LLMs may generalize on these languages and their consequent utility in classical studies.

[368] ToolSpectrum : Towards Personalized Tool Utilization for Large Language Models

Zihao Cheng,Hongru Wang,Zeming Liu,Yuhang Guo,Yuanfang Guo,Yunhong Wang,Haifeng Wang

Main category: cs.CL

TL;DR: ToolSpectrum是一个用于评估大语言模型在个性化工具使用能力的基准，强调用户画像和环境因素对工具选择的影响。

Details

Motivation: 现有方法在工具选择中忽视了上下文感知的个性化，导致用户满意度低和工具使用效率低下。 Method: 引入ToolSpectrum基准，形式化用户画像和环境因素两个关键维度，分析其对工具使用的单独和协同影响。 Result: 实验表明个性化工具使用显著提升用户体验，但当前模型在联合推理用户画像和环境因素时能力有限。 Conclusion: 上下文感知个性化对工具增强的大语言模型至关重要，当前模型存在明显局限性。 Abstract: While integrating external tools into large language models (LLMs) enhances their ability to access real-time information and domain-specific services, existing approaches focus narrowly on functional tool selection following user instructions, overlooking the context-aware personalization in tool selection. This oversight leads to suboptimal user satisfaction and inefficient tool utilization, particularly when overlapping toolsets require nuanced selection based on contextual factors. To bridge this gap, we introduce ToolSpectrum, a benchmark designed to evaluate LLMs' capabilities in personalized tool utilization. Specifically, we formalize two key dimensions of personalization, user profile and environmental factors, and analyze their individual and synergistic impacts on tool utilization. Through extensive experiments on ToolSpectrum, we demonstrate that personalized tool utilization significantly improves user experience across diverse scenarios. However, even state-of-the-art LLMs exhibit the limited ability to reason jointly about user profiles and environmental factors, often prioritizing one dimension at the expense of the other. Our findings underscore the necessity of context-aware personalization in tool-augmented LLMs and reveal critical limitations for current models. Our data and code are available at https://github.com/Chengziha0/ToolSpectrum.

[369] Efficient Speech Language Modeling via Energy Distance in Continuous Latent Space

Zhengrui Ma,Yang Feng,Chenze Shao,Fandong Meng,Jie Zhou,Min Zhang

Main category: cs.CL

TL;DR: SLED是一种新的语音语言建模方法，通过连续潜在表示和能量距离目标实现高效训练，避免了离散化误差和复杂架构。

Details

Motivation: 传统语音语言模型依赖残差向量量化和复杂分层架构，SLED旨在简化流程并保留语音信息的丰富性。 Method: SLED将语音波形编码为连续潜在表示，并使用能量距离目标进行自回归建模。 Result: 实验表明，SLED在零样本和流式语音合成中表现优异。 Conclusion: SLED展示了在通用语音语言模型中广泛应用的潜力。 Abstract: We introduce SLED, an alternative approach to speech language modeling by encoding speech waveforms into sequences of continuous latent representations and modeling them autoregressively using an energy distance objective. The energy distance offers an analytical measure of the distributional gap by contrasting simulated and target samples, enabling efficient training to capture the underlying continuous autoregressive distribution. By bypassing reliance on residual vector quantization, SLED avoids discretization errors and eliminates the need for the complicated hierarchical architectures common in existing speech language models. It simplifies the overall modeling pipeline while preserving the richness of speech information and maintaining inference efficiency. Empirical results demonstrate that SLED achieves strong performance in both zero-shot and streaming speech synthesis, showing its potential for broader applications in general-purpose speech language models.

[370] Alignment-Augmented Speculative Decoding with Alignment Sampling and Conditional Verification

Jikai Wang,Zhenxu Tian,Juntao Li,Qingrong Xia,Xinyu Duan,Zhefeng Wang,Baoxing Huai,Min Zhang

Main category: cs.CL

TL;DR: 提出了一种无需训练的增强对齐推测解码算法，通过对齐采样和灵活验证策略，显著提升生成效率和质量。

Details

Motivation: 现有推测解码方法依赖训练实现对齐，成本高；本文旨在提出一种无需训练的高效对齐方法。 Method: 采用对齐采样利用预填充阶段的输出分布生成对齐候选；引入灵活验证策略，通过自适应概率阈值提升生成质量。 Result: 在8个数据集上实验显示，生成分数平均提升3.3分，平均接受长度达2.39，生成速度提升2.23倍。 Conclusion: 无需训练的方法在生成效率和质量上均优于现有训练方法，具有显著优势。 Abstract: Recent works have revealed the great potential of speculative decoding in accelerating the autoregressive generation process of large language models. The success of these methods relies on the alignment between draft candidates and the sampled outputs of the target model. Existing methods mainly achieve draft-target alignment with training-based methods, e.g., EAGLE, Medusa, involving considerable training costs. In this paper, we present a training-free alignment-augmented speculative decoding algorithm. We propose alignment sampling, which leverages output distribution obtained in the prefilling phase to provide more aligned draft candidates. To further benefit from high-quality but non-aligned draft candidates, we also introduce a simple yet effective flexible verification strategy. Through an adaptive probability threshold, our approach can improve generation accuracy while further improving inference efficiency. Experiments on 8 datasets (including question answering, summarization and code completion tasks) show that our approach increases the average generation score by 3.3 points for the LLaMA3 model. Our method achieves a mean acceptance length up to 2.39 and speed up generation by 2.23.

[371] Picturized and Recited with Dialects: A Multimodal Chinese Representation Framework for Sentiment Analysis of Classical Chinese Poetry

Xiaocong Du,Haoyu Pei,Haipeng Zhang

Main category: cs.CL

TL;DR: 提出了一种方言增强的多模态框架，用于古典汉语诗歌情感分析，结合音频、视觉和文本特征，性能优于现有方法。

Details

Motivation: 现有研究仅基于文本分析情感，忽略了诗歌独特的韵律和视觉特征，尤其是其朗诵和绘画背景。 Method: 提取句子级音频特征（含多方言），生成视觉特征，并通过多模态对比表示学习与LLM增强的文本特征融合。 Result: 在两个公开数据集上表现优于现有方法，准确率提升至少2.51%，宏F1提升1.63%。 Conclusion: 框架有效提升了古典诗歌情感分析性能，开源代码以促进多模态中文表示研究。 Abstract: Classical Chinese poetry is a vital and enduring part of Chinese literature, conveying profound emotional resonance. Existing studies analyze sentiment based on textual meanings, overlooking the unique rhythmic and visual features inherent in poetry,especially since it is often recited and accompanied by Chinese paintings. In this work, we propose a dialect-enhanced multimodal framework for classical Chinese poetry sentiment analysis. We extract sentence-level audio features from the poetry and incorporate audio from multiple dialects,which may retain regional ancient Chinese phonetic features, enriching the phonetic representation. Additionally, we generate sentence-level visual features, and the multimodal features are fused with textual features enhanced by LLM translation through multimodal contrastive representation learning. Our framework outperforms state-of-the-art methods on two public datasets, achieving at least 2.51% improvement in accuracy and 1.63% in macro F1. We open-source the code to facilitate research in this area and provide insights for general multimodal Chinese representation.

[372] SeedBench: A Multi-task Benchmark for Evaluating Large Language Models in Seed Science

Jie Ying,Zihong Chen,Zhefan Wang,Wanli Jiang,Chenyang Wang,Zhonghang Yuan,Haoyang Su,Huanjun Kong,Fan Yang,Nanqing Dong

Main category: cs.CL

TL;DR: SeedBench是首个针对种子科学的多任务基准，旨在评估大型语言模型（LLMs）在种子育种中的应用，揭示了LLMs与实际需求之间的差距。

Details

Motivation: 种子科学对现代农业至关重要，但面临跨学科复杂性、高成本和资源不足等挑战，LLMs的应用因缺乏数字资源和标准化基准而受限。 Method: 开发SeedBench基准，模拟现代育种过程，评估26种领先LLMs（包括专有、开源和领域微调模型）。 Result: 评估结果显示LLMs在解决种子科学问题方面存在显著差距。 Conclusion: SeedBench为LLMs在种子设计中的研究奠定了基础，推动了该领域的发展。 Abstract: Seed science is essential for modern agriculture, directly influencing crop yields and global food security. However, challenges such as interdisciplinary complexity and high costs with limited returns hinder progress, leading to a shortage of experts and insufficient technological support. While large language models (LLMs) have shown promise across various fields, their application in seed science remains limited due to the scarcity of digital resources, complex gene-trait relationships, and the lack of standardized benchmarks. To address this gap, we introduce SeedBench -- the first multi-task benchmark specifically designed for seed science. Developed in collaboration with domain experts, SeedBench focuses on seed breeding and simulates key aspects of modern breeding processes. We conduct a comprehensive evaluation of 26 leading LLMs, encompassing proprietary, open-source, and domain-specific fine-tuned models. Our findings not only highlight the substantial gaps between the power of LLMs and the real-world seed science problems, but also make a foundational step for research on LLMs for seed design.

[373] JNLP at SemEval-2025 Task 11: Cross-Lingual Multi-Label Emotion Detection Using Generative Models

Jieying Xue,Phuong Minh Nguyen,Minh Le Nguyen,Xin Liu

Main category: cs.CL

TL;DR: 论文研究了多语言多标签情感检测问题，提出了基于预训练模型的方法，并在SemEval-2025任务中取得了优异表现。

Details

Motivation: 随着全球数字化的发展，多语言情感检测成为重要研究方向，旨在解决跨语言情感识别的挑战。 Method: 使用预训练多语言模型，包括微调的BERT分类模型和指令调优的生成式LLM，并提出两种多标签分类方法。 Result: 在10种语言中排名前4，其中印地语排名第1；在7种语言的情感强度任务中排名前5。 Conclusion: 该方法在多语言情感检测中表现出强大的泛化能力和有效性。 Abstract: With the rapid advancement of global digitalization, users from different countries increasingly rely on social media for information exchange. In this context, multilingual multi-label emotion detection has emerged as a critical research area. This study addresses SemEval-2025 Task 11: Bridging the Gap in Text-Based Emotion Detection. Our paper focuses on two sub-tracks of this task: (1) Track A: Multi-label emotion detection, and (2) Track B: Emotion intensity. To tackle multilingual challenges, we leverage pre-trained multilingual models and focus on two architectures: (1) a fine-tuned BERT-based classification model and (2) an instruction-tuned generative LLM. Additionally, we propose two methods for handling multi-label classification: the base method, which maps an input directly to all its corresponding emotion labels, and the pairwise method, which models the relationship between the input text and each emotion category individually. Experimental results demonstrate the strong generalization ability of our approach in multilingual emotion recognition. In Track A, our method achieved Top 4 performance across 10 languages, ranking 1st in Hindi. In Track B, our approach also secured Top 5 performance in 7 languages, highlighting its simplicity and effectiveness\footnote{Our code is available at https://github.com/yingjie7/mlingual_multilabel_emo_detection.

Sidney Wong

Main category: cs.CL

TL;DR: 互联网对边缘化社区既是机遇也是挑战，需从社会角度而非纯技术方案解决网络仇恨言论问题。

Details

Motivation: 互联网既能连接社区，也可能传播仇恨与错误信息，需社会方法解决这一社会问题。 Method: 借鉴语言学研究的经验，将语言与社会知识应用于数字空间的反社会行为风险缓解。 Result: 语言学家与NLP研究者可通过与社区、政策制定者合作，推动数字包容与缩小数字鸿沟。 Conclusion: 语言学与社会方法在解决网络仇恨言论中具有关键作用，需多方合作以实现社会影响。 Abstract: The advent of the internet has been both a blessing and a curse for once marginalised communities. When used well, the internet can be used to connect and establish communities crossing different intersections; however, it can also be used as a tool to alienate people and communities as well as perpetuate hate, misinformation, and disinformation especially on social media platforms. We propose steering hate speech research and researchers away from pre-existing computational solutions and consider social methods to inform social solutions to address this social problem. In a similar way linguistics research can inform language planning policy, linguists should apply what we know about language and society to mitigate some of the emergent risks and dangers of anti-social behaviour in digital spaces. We argue linguists and NLP researchers can play a principle role in unleashing the social impact potential of linguistics research working alongside communities, advocates, activists, and policymakers to enable equitable digital inclusion and to close the digital divide.

[375] Natural Language Planning via Coding and Inference Scaling

Rikhil Amonkar,Ronan Le Bras,Li Zhang

Main category: cs.CL

TL;DR: 论文研究了LLMs在复杂文本规划任务（如会议安排）中的表现，比较了闭源和开源模型生成程序的能力，发现编程通常优于规划，但生成的代码存在鲁棒性和效率问题。

Details

Motivation: 现实中的文本规划任务（如会议安排）对LLMs提出了挑战，尤其是在高复杂度情况下。此前研究主要关注闭源模型的自动回归生成，而本文系统评估了闭源和开源模型，包括那些在推理时能根据复杂度扩展输出长度的模型。 Method: 通过生成程序（包括标准Python代码和约束满足问题求解器代码）来输出计划，并比较编程与规划的表现。 Result: 研究发现，编程通常优于规划，但生成的代码在鲁棒性和效率方面存在问题，影响了泛化能力。 Conclusion: 尽管编程在某些情况下表现更好，但生成的代码仍需改进鲁棒性和效率，以提升泛化能力。 Abstract: Real-life textual planning tasks such as meeting scheduling have posed much challenge to LLMs especially when the complexity is high. While previous work primarily studied auto-regressive generation of plans with closed-source models, we systematically evaluate both closed- and open-source models, including those that scales output length with complexity during inference, in generating programs, which are executed to output the plan. We consider not only standard Python code, but also the code to a constraint satisfaction problem solver. Despite the algorithmic nature of the task, we show that programming often but not always outperforms planning. Our detailed error analysis also indicates a lack of robustness and efficiency in the generated code that hinders generalization.

[376] HeteroSpec: Leveraging Contextual Heterogeneity for Efficient Speculative Decoding

Siran Liu,Yang Ye,Qianchao Zhu,Zheng Cao,Yongchao He

Main category: cs.CL

TL;DR: HeteroSpec是一种基于语言复杂性的异构自适应推测解码框架，通过动态优化计算资源分配，显著提升大型语言模型推理速度。

Details

Motivation: 自回归解码是大型语言模型推理的主要瓶颈，现有推测解码算法未能充分利用语言复杂性异质性，导致资源分配不优。 Method: 提出HeteroSpec框架，包括基于累积元路径Top-K熵的预测上下文识别机制和基于数据驱动熵分区的动态资源分配策略。 Result: 在五个公共基准和四个模型上，HeteroSpec平均加速4.26倍，优于现有EAGLE-3方法，且无需重新训练草稿模型。 Conclusion: HeteroSpec为上下文感知的LLM推理加速提供了新范式，兼容其他加速技术，并在更强草稿模型下表现更优。 Abstract: Autoregressive decoding, the standard approach for Large Language Model (LLM) inference, remains a significant bottleneck due to its sequential nature. While speculative decoding algorithms mitigate this inefficiency through parallel verification, they fail to exploit the inherent heterogeneity in linguistic complexity, a key factor leading to suboptimal resource allocation. We address this by proposing HeteroSpec, a heterogeneity-adaptive speculative decoding framework that dynamically optimizes computational resource allocation based on linguistic context complexity. HeteroSpec introduces two key mechanisms: (1) A novel cumulative meta-path Top-$K$ entropy metric for efficiently identifying predictable contexts. (2) A dynamic resource allocation strategy based on data-driven entropy partitioning, enabling adaptive speculative expansion and pruning tailored to local context difficulty. Evaluated on five public benchmarks and four models, HeteroSpec achieves an average speedup of 4.26$\times$. It consistently outperforms state-of-the-art EAGLE-3 across speedup rates, average acceptance length, and verification cost. Notably, HeteroSpec requires no draft model retraining, incurs minimal overhead, and is orthogonal to other acceleration techniques. It demonstrates enhanced acceleration with stronger draft models, establishing a new paradigm for context-aware LLM inference acceleration.

[377] WikiPersonas: What Can We Learn From Personalized Alignment to Famous People?

Zilu Tang,Afra Feyza Akyürek,Ekin Akyürek,Derry Wijaya

Main category: cs.CL

TL;DR: 论文提出了WikiPersona数据集，用于解决个性化偏好对齐问题，并通过推断个人偏好前缀实现高效个性化。

Details

Motivation: 现有研究多关注通用人类偏好，忽略了个体间多样且矛盾的偏好，缺乏针对个人层面细粒度偏好的数据集。 Method: 引入WikiPersona数据集，通过生成可验证的文本描述和偏好对齐，评估不同个性化方法。 Result: 发现推断个人偏好前缀的方法在偏好冲突时更有效，且能更公平地泛化到未见过的个体。 Conclusion: WikiPersona为个性化偏好对齐提供了新方向，推断偏好前缀是一种高效方法。 Abstract: Preference alignment has become a standard pipeline in finetuning models to follow \emph{generic} human preferences. Majority of work seeks to optimize model to produce responses that would be preferable \emph{on average}, simplifying the diverse and often \emph{contradicting} space of human preferences. While research has increasingly focused on personalized alignment: adapting models to individual user preferences, there is a lack of personalized preference dataset which focus on nuanced individual-level preferences. To address this, we introduce WikiPersona: the first fine-grained personalization using well-documented, famous individuals. Our dataset challenges models to align with these personas through an interpretable process: generating verifiable textual descriptions of a persona's background and preferences in addition to alignment. We systematically evaluate different personalization approaches and find that as few-shot prompting with preferences and fine-tuning fail to simultaneously ensure effectiveness and efficiency, using \textit{inferred personal preferences} as prefixes enables effective personalization, especially in topics where preferences clash while leading to more equitable generalization across unseen personas.

[378] Effective and Transparent RAG: Adaptive-Reward Reinforcement Learning for Decision Traceability

Jingyi Ren,Yekun Xu,Xiaolong Wang,Weitao Li,Weizhi Ma,Yang Liu

Main category: cs.CL

TL;DR: ARENA框架通过强化学习提升RAG生成器的透明性和有效性，在多跳QA任务上表现优异。

Details

Motivation: 解决RAG中生成器对检索信息利用不足及缺乏透明性的问题。 Method: 提出基于强化学习的ARENA框架，通过自适应奖励计算和结构化生成提升性能。 Result: 在多跳QA数据集上实现10-30%的性能提升，与商业LLMs相当。 Conclusion: ARENA具有强适应性和透明性，代码和模型已开源。 Abstract: Retrieval-Augmented Generation (RAG) has significantly improved the performance of large language models (LLMs) on knowledge-intensive domains. However, although RAG achieved successes across distinct domains, there are still some unsolved challenges: 1) Effectiveness. Existing research mainly focuses on developing more powerful RAG retrievers, but how to enhance the generator's (LLM's) ability to utilize the retrieved information for reasoning and generation? 2) Transparency. Most RAG methods ignore which retrieved content actually contributes to the reasoning process, resulting in a lack of interpretability and visibility. To address this, we propose ARENA (Adaptive-Rewarded Evidence Navigation Agent), a transparent RAG generator framework trained via reinforcement learning (RL) with our proposed rewards. Based on the structured generation and adaptive reward calculation, our RL-based training enables the model to identify key evidence, perform structured reasoning, and generate answers with interpretable decision traces. Applied to Qwen2.5-7B-Instruct and Llama3.1-8B-Instruct, abundant experiments with various RAG baselines demonstrate that our model achieves 10-30% improvements on all multi-hop QA datasets, which is comparable with the SOTA Commercially-developed LLMs (e.g., OpenAI-o1, DeepSeek-R1). Further analyses show that ARENA has strong flexibility to be adopted on new datasets without extra training. Our models and codes are publicly released.

[379] From Automation to Autonomy: A Survey on Large Language Models in Scientific Discovery

Tianshi Zheng,Zheye Deng,Hong Ting Tsang,Weiqi Wang,Jiaxin Bai,Zihao Wang,Yangqiu Song

Main category: cs.CL

TL;DR: 该论文探讨了大型语言模型（LLMs）在科学发现中的角色转变，从工具到自主代理，并提出了三级分类法（工具、分析师、科学家）以描述其自主性和责任。同时指出了未来挑战和研究方向。

Details

Motivation: 研究LLMs如何从任务自动化工具发展为自主代理，重新定义科研流程和人机协作，推动科学发现的范式转变。 Method: 通过科学方法的视角，提出三级分类法（工具、分析师、科学家），并系统梳理LLMs在科研生命周期中的角色和能力变化。 Result: 明确了LLMs在科学发现中的潜力与挑战，如机器人自动化、自我改进和伦理治理，并提供了未来研究方向。 Conclusion: 论文为AI驱动的科学发现提供了概念框架和战略展望，旨在促进快速创新和负责任的发展。 Abstract: Large Language Models (LLMs) are catalyzing a paradigm shift in scientific discovery, evolving from task-specific automation tools into increasingly autonomous agents and fundamentally redefining research processes and human-AI collaboration. This survey systematically charts this burgeoning field, placing a central focus on the changing roles and escalating capabilities of LLMs in science. Through the lens of the scientific method, we introduce a foundational three-level taxonomy-Tool, Analyst, and Scientist-to delineate their escalating autonomy and evolving responsibilities within the research lifecycle. We further identify pivotal challenges and future research trajectories such as robotic automation, self-improvement, and ethical governance. Overall, this survey provides a conceptual architecture and strategic foresight to navigate and shape the future of AI-driven scientific discovery, fostering both rapid innovation and responsible advancement. Github Repository: https://github.com/HKUST-KnowComp/Awesome-LLM-Scientific-Discovery.

[380] Representation of perceived prosodic similarity of conversational feedback

Livia Qian,Carol Figueroa,Gabriel Skantze

Main category: cs.CL

TL;DR: 研究探讨了语音反馈的韵律相似性及其在现有语音表征中的体现，发现频谱和自监督表征优于基频特征，且可通过对比学习进一步优化。

Details

Motivation: 语音反馈在对话中至关重要，但其韵律相似性及现有表征的反映程度尚不明确。 Method: 采用三元比较任务，测量参与者对来自不同数据集的语音反馈的感知相似性。 Result: 频谱和自监督表征在编码韵律方面优于基频特征，尤其是同说话者反馈；对比学习可进一步优化表征。 Conclusion: 语音反馈的韵律相似性可通过特定表征有效捕捉，对比学习能提升其与人类感知的对齐。 Abstract: Vocal feedback (e.g., `mhm', `yeah', `okay') is an important component of spoken dialogue and is crucial to ensuring common ground in conversational systems. The exact meaning of such feedback is conveyed through both lexical and prosodic form. In this work, we investigate the perceived prosodic similarity of vocal feedback with the same lexical form, and to what extent existing speech representations reflect such similarities. A triadic comparison task with recruited participants is used to measure perceived similarity of feedback responses taken from two different datasets. We find that spectral and self-supervised speech representations encode prosody better than extracted pitch features, especially in the case of feedback from the same speaker. We also find that it is possible to further condense and align the representations to human perception through contrastive learning.

[381] CSC-SQL: Corrective Self-Consistency in Text-to-SQL via Reinforcement Learning

Lei Sheng,Shuai-Shuai Xu

Main category: cs.CL

TL;DR: CSC-SQL结合自一致性和自校正方法，通过并行采样和合并修订模型提升SQL生成质量，并使用GRPO算法优化模型。实验显示其在BIRD数据集上表现优异。

Details

Motivation: 现有方法如自一致性和自校正各有局限，前者可能选择次优输出，后者仅解决语法错误。CSC-SQL旨在结合两者优势。 Method: 提出CSC-SQL方法，选择并行采样中最频繁的两个输出，通过合并修订模型校正，并用GRPO算法优化模型。 Result: 在BIRD开发集上，3B模型执行准确率达65.28%，7B模型达69.19%。 Conclusion: CSC-SQL有效且通用，代码将开源。 Abstract: Large language models (LLMs) have demonstrated strong capabilities in translating natural language questions about relational databases into SQL queries. In particular, test-time scaling techniques such as Self-Consistency and Self-Correction can enhance SQL generation accuracy by increasing computational effort during inference. However, these methods have notable limitations: Self-Consistency may select suboptimal outputs despite majority votes, while Self-Correction typically addresses only syntactic errors. To leverage the strengths of both approaches, we propose CSC-SQL, a novel method that integrates Self-Consistency and Self-Correction. CSC-SQL selects the two most frequently occurring outputs from parallel sampling and feeds them into a merge revision model for correction. Additionally, we employ the Group Relative Policy Optimization (GRPO) algorithm to fine-tune both the SQL generation and revision models via reinforcement learning, significantly enhancing output quality. Experimental results confirm the effectiveness and generalizability of CSC-SQL. On the BIRD development set, our 3B model achieves 65.28% execution accuracy, while the 7B model achieves 69.19%. The code will be open sourced at https://github.com/CycloneBoy/csc_sql.

[382] $\textit{Rank, Chunk and Expand}$: Lineage-Oriented Reasoning for Taxonomy Expansion

Sahil Mishra,Kumar Arjun,Tanmoy Chakraborty

Main category: cs.CL

TL;DR: LORex是一个结合判别式排序和生成式推理的框架，用于高效扩展分类法，显著提升准确性和相似性。

Details

Motivation: 现有方法在分类法扩展中面临表示限制、噪声问题和上下文效率不足的挑战。 Method: LORex通过排序和分块候选词，过滤噪声并迭代推理候选词的层次结构。 Result: 在四个基准测试中，LORex的准确性提高了12%，Wu & Palmer相似性提高了5%。 Conclusion: LORex为分类法扩展提供了一种高效且准确的解决方案。 Abstract: Taxonomies are hierarchical knowledge graphs crucial for recommendation systems, and web applications. As data grows, expanding taxonomies is essential, but existing methods face key challenges: (1) discriminative models struggle with representation limits and generalization, while (2) generative methods either process all candidates at once, introducing noise and exceeding context limits, or discard relevant entities by selecting noisy candidates. We propose LORex ($\textbf{L}$ineage-$\textbf{O}$riented $\textbf{Re}$asoning for Taxonomy E$\textbf{x}$pansion), a plug-and-play framework that combines discriminative ranking and generative reasoning for efficient taxonomy expansion. Unlike prior methods, LORex ranks and chunks candidate terms into batches, filtering noise and iteratively refining selections by reasoning candidates' hierarchy to ensure contextual efficiency. Extensive experiments across four benchmarks and twelve baselines show that LORex improves accuracy by 12% and Wu & Palmer similarity by 5% over state-of-the-art methods.

Alice Plebe,Timothy Douglas,Diana Riazi,R. Maria del Rio-Chanona

Main category: cs.CL

TL;DR: 研究发现，图像会显著增加视觉语言模型（VLMs）对新闻内容的转发倾向，尤其是虚假新闻。模型家族、用户画像和内容属性均会影响这一行为。

Details

Motivation: 探讨图像如何影响VLMs转发新闻的倾向，以及模型家族、用户画像和内容属性如何调节这一行为。 Method: 采用一种类似越狱的提示策略，模拟具有反社会特性和政治倾向的用户，并使用多模态数据集（含政治新闻及其图像和真实性标签）进行分析。 Result: 图像使真实新闻转发率增加4.8%，虚假新闻增加15%。用户画像（如黑暗三联征或共和党倾向）会进一步调节效果。 Conclusion: 研究揭示了多模态模型行为中的潜在风险，呼吁开发针对性评估框架和缓解策略。 Abstract: Large language models are increasingly integrated into news recommendation systems, raising concerns about their role in spreading misinformation. In humans, visual content is known to boost credibility and shareability of information, yet its effect on vision-language models (VLMs) remains unclear. We present the first study examining how images influence VLMs' propensity to reshare news content, whether this effect varies across model families, and how persona conditioning and content attributes modulate this behavior. To support this analysis, we introduce two methodological contributions: a jailbreaking-inspired prompting strategy that elicits resharing decisions from VLMs while simulating users with antisocial traits and political alignments; and a multimodal dataset of fact-checked political news from PolitiFact, paired with corresponding images and ground-truth veracity labels. Experiments across model families reveal that image presence increases resharing rates by 4.8% for true news and 15.0% for false news. Persona conditioning further modulates this effect: Dark Triad traits amplify resharing of false news, whereas Republican-aligned profiles exhibit reduced veracity sensitivity. Of all the tested models, only Claude-3-Haiku demonstrates robustness to visual misinformation. These findings highlight emerging risks in multimodal model behavior and motivate the development of tailored evaluation frameworks and mitigation strategies for personalized AI systems. Code and dataset are available at: https://github.com/3lis/misinfo_vlm

[384] RBF++: Quantifying and Optimizing Reasoning Boundaries across Measurable and Unmeasurable Capabilities for Chain-of-Thought Reasoning

Qiguang Chen,Libo Qin,Jinhao Liu,Yue Liao,Jiaqi Wang,Jingxuan Zhou,Wanxiang Che

Main category: cs.CL

TL;DR: 论文提出Reasoning Boundary Framework++（RBF++），用于解决Chain-of-Thought（CoT）推理在现实应用中的两大挑战：定量评估与优化可测量边界，以及处理不可测量边界（如多模态感知）。

Details

Motivation: 当前CoT推理缺乏定量指标和优化方法，且无法评估不可测量能力（如多模态感知），限制了其实际应用。 Method: 提出RBF++框架，定义推理边界（RB）作为性能上限，并提出组合定律和恒定假设，分别处理可测量和不可测量边界。此外，引入推理边界划分机制。 Result: 在13个任务和38个模型上的实验验证了框架的可行性，并评估了10种CoT策略，扩展了LLM推理的评估基准。 Conclusion: RBF++为理解和优化LLM中的推理边界提供了新方法，推动了CoT推理的实际应用。 Abstract: Chain-of-Thought (CoT) reasoning has proven effective in enhancing large language models (LLMs) on complex tasks, spurring research into its underlying mechanisms. However, two primary challenges remain for real-world applications: (1) the lack of quantitative metrics and actionable guidelines for evaluating and optimizing measurable boundaries of CoT capability, and (2) the absence of methods to assess boundaries of unmeasurable CoT capability, such as multimodal perception. To address these gaps, we introduce the Reasoning Boundary Framework++ (RBF++). To tackle the first challenge, we define the reasoning boundary (RB) as the maximum limit of CoT performance. We also propose a combination law for RBs, enabling quantitative analysis and offering actionable guidance across various CoT tasks. For the second challenge, particularly in multimodal scenarios, we introduce a constant assumption, which replaces unmeasurable RBs with scenario-specific constants. Additionally, we propose the reasoning boundary division mechanism, which divides unmeasurable RBs into two sub-boundaries, facilitating the quantification and optimization of both unmeasurable domain knowledge and multimodal perception capabilities. Extensive experiments involving 38 models across 13 tasks validate the feasibility of our framework in cross-modal settings. Additionally, we evaluate 10 CoT strategies, offer insights into optimization and decay from two complementary perspectives, and expand evaluation benchmarks for measuring RBs in LLM reasoning. We hope this work advances the understanding of RBs and optimization strategies in LLMs. Code and data are available at https://github.com/LightChen233/reasoning-boundary.

[385] GUARD: Generation-time LLM Unlearning via Adaptive Restriction and Detection

Zhijie Deng,Chris Yuhao Liu,Zirui Pang,Xinlei He,Lei Feng,Qi Xuan,Zhaowei Zhu,Jiaheng Wei

Main category: cs.CL

TL;DR: GUARD框架通过动态调整生成过程实现选择性遗忘，避免微调带来的性能损失。

Details

Motivation: 大型语言模型需要选择性遗忘特定知识以确保安全和合规，但现有方法通常通过微调实现，可能影响整体性能。 Method: 提出GUARD框架，结合提示分类器和动态惩罚机制，在生成时检测并过滤遗忘目标。 Result: 在多个数据集上验证，GUARD在保持模型通用能力的同时实现了高质量的遗忘。 Conclusion: GUARD在遗忘与模型性能之间取得了良好平衡，为动态遗忘提供了有效解决方案。 Abstract: Large Language Models (LLMs) have demonstrated strong capabilities in memorizing vast amounts of knowledge across diverse domains. However, the ability to selectively forget specific knowledge is critical for ensuring the safety and compliance of deployed models. Existing unlearning efforts typically fine-tune the model with resources such as forget data, retain data, and a calibration model. These additional gradient steps blur the decision boundary between forget and retain knowledge, making unlearning often at the expense of overall performance. To avoid the negative impact of fine-tuning, it would be better to unlearn solely at inference time by safely guarding the model against generating responses related to the forget target, without destroying the fluency of text generation. In this work, we propose Generation-time Unlearning via Adaptive Restriction and Detection (GUARD), a framework that enables dynamic unlearning during LLM generation. Specifically, we first employ a prompt classifier to detect unlearning targets and extract the corresponding forbidden token. We then dynamically penalize and filter candidate tokens during generation using a combination of token matching and semantic matching, effectively preventing the model from leaking the forgotten content. Experimental results on copyright content unlearning tasks over the Harry Potter dataset and the MUSE benchmark, as well as entity unlearning tasks on the TOFU dataset, demonstrate that GUARD achieves strong forget quality across various tasks while causing almost no degradation to the LLM's general capabilities, striking an excellent trade-off between forgetting and utility.

[386] Rethinking Stateful Tool Use in Multi-Turn Dialogues: Benchmarks and Challenges

Hongru Wang,Wenyu Huang,Yufei Wang,Yuanhao Xi,Jianqiao Lu,Huan Zhang,Nan Hu,Zeming Liu,Jeff Z. Pan,Kam-Fai Wong

Main category: cs.CL

TL;DR: 论文提出了一个多轮对话数据集DialogTool和一个虚拟移动评估环境VirtualMobile，用于评估语言模型在多轮工具交互中的表现，发现现有模型在长期工具使用上仍有不足。

Details

Motivation: 现有评估语言模型作为语言代理的基准主要关注单轮或无状态交互，忽略了多轮应用中的状态性交互需求。 Method: 提出DialogTool数据集和VirtualMobile评估环境，涵盖工具创建、工具使用和角色一致响应三个阶段，评估13种语言模型。 Result: 现有最先进的语言模型在多轮工具交互中表现不佳。 Conclusion: 需要进一步改进语言模型在多轮状态性工具交互中的能力。 Abstract: Existing benchmarks that assess Language Models (LMs) as Language Agents (LAs) for tool use primarily focus on stateless, single-turn interactions or partial evaluations, such as tool selection in a single turn, overlooking the inherent stateful nature of interactions in multi-turn applications. To fulfill this gap, we propose \texttt{DialogTool}, a multi-turn dialogue dataset with stateful tool interactions considering the whole life cycle of tool use, across six key tasks in three stages: 1) \textit{tool creation}; 2) \textit{tool utilization}: tool awareness, tool selection, tool execution; and 3) \textit{role-consistent response}: response generation and role play. Furthermore, we build \texttt{VirtualMobile} -- an embodied virtual mobile evaluation environment to simulate API calls and assess the robustness of the created APIs\footnote{We will use tools and APIs alternatively, there are no significant differences between them in this paper.}. Taking advantage of these artifacts, we conduct comprehensive evaluation on 13 distinct open- and closed-source LLMs and provide detailed analysis at each stage, revealing that the existing state-of-the-art LLMs still cannot perform well to use tools over long horizons.

Qiongqiong Wang,Hardik B. Sailor,Tianchi Liu,Ai Ti Aw

Main category: cs.CL

TL;DR: 论文提出了一种新的数据集生成框架，结合上下文推理和副语言信息，用于增强语音-LLM的副语言理解能力。

Details

Motivation: 当前语音-LLM在上下文推理和副语言理解方面能力有限，缺乏涵盖这两方面的QA数据集。 Method: 提出了一种从野外语音数据生成数据集的框架，包括基于伪副语言标签的数据浓缩和基于LLM的上下文副语言QA生成。 Result: 在Qwen2-Audio-7B-Instruct模型上的评估显示，生成的数据集与人工生成的CPQA数据集有强相关性，但也揭示了模型在同理心推理任务上的局限性。 Conclusion: 该框架是首创，有望用于训练更具鲁棒性的语音-LLM，提升其副语言推理能力。 Abstract: Current speech-LLMs exhibit limited capability in contextual reasoning alongside paralinguistic understanding, primarily due to the lack of Question-Answer (QA) datasets that cover both aspects. We propose a novel framework for dataset generation from in-the-wild speech data, that integrates contextual reasoning with paralinguistic information. It consists of a pseudo paralinguistic label-based data condensation of in-the-wild speech and LLM-based Contextual Paralinguistic QA (CPQA) generation. The effectiveness is validated by a strong correlation in evaluations of the Qwen2-Audio-7B-Instruct model on a dataset created by our framework and human-generated CPQA dataset. The results also reveal the speech-LLM's limitations in handling empathetic reasoning tasks, highlighting the need for such datasets and more robust models. The proposed framework is first of its kind and has potential in training more robust speech-LLMs with paralinguistic reasoning capabilities.

[388] J4R: Learning to Judge with Equivalent Initial State Group Relative Preference Optimization

Austin Xu,Yilun Zhou,Xuan-Phi Nguyen,Caiming Xiong,Shafiq Joty

Main category: cs.CL

TL;DR: 论文提出了一种基于强化学习的LLM评估方法，通过EIS-GRPO算法解决位置偏差问题，并引入新基准ReasoningJudgeBench。训练的J4R模型在推理任务中表现优于GPT-4o和其他小型模型。

Details

Motivation: 随着大语言模型（LLM）的发展，自动评估需求增加，但现有评估方法在复杂推理任务中表现不佳。 Method: 提出EIS-GRPO算法训练评估模型，并开发ReasoningJudgeBench基准测试。 Result: 训练的J4R模型在推理任务中表现优于GPT-4o和其他小型模型，提升6.7%和9%。 Conclusion: 强化学习训练的评估模型在复杂推理任务中表现优异，解决了现有方法的局限性。 Abstract: To keep pace with the increasing pace of large language models (LLM) development, model output evaluation has transitioned away from time-consuming human evaluation to automatic evaluation, where LLMs themselves are tasked with assessing and critiquing other model outputs. LLM-as-judge models are a class of generative evaluators that excel in evaluating relatively simple domains, like chat quality, but struggle in reasoning intensive domains where model responses contain more substantive and challenging content. To remedy existing judge shortcomings, we explore training judges with reinforcement learning (RL). We make three key contributions: (1) We propose the Equivalent Initial State Group Relative Policy Optimization (EIS-GRPO) algorithm, which allows us to train our judge to be robust to positional biases that arise in more complex evaluation settings. (2) We introduce ReasoningJudgeBench, a benchmark that evaluates judges in diverse reasoning settings not covered by prior work. (3) We train Judge for Reasoning (J4R), a 7B judge trained with EIS-GRPO that outperforms GPT-4o and the next best small judge by 6.7% and 9%, matching or exceeding the performance of larger GRPO-trained judges on both JudgeBench and ReasoningJudgeBench.

[389] Investigating the Vulnerability of LLM-as-a-Judge Architectures to Prompt-Injection Attacks

Narek Maloyan,Bislan Ashinov,Dmitry Namiot

Main category: cs.CL

TL;DR: 论文探讨了LLM作为评估者（LLM-as-a-Judge）在对抗性攻击下的脆弱性，提出了两种攻击策略并验证了其有效性。

Details

Motivation: 随着LLM被广泛用于评估机器生成文本，其可靠性和安全性成为关键问题，尤其是对抗性攻击的威胁。 Method: 研究使用Greedy Coordinate Gradient（GCG）方法生成对抗性后缀，测试了两种攻击策略（CUA和JMA）在开源LLM上的效果。 Result: 实验显示，CUA攻击成功率达30%以上，JMA也显著有效，表明现有LLM-as-a-Judge系统存在严重漏洞。 Conclusion: 研究强调了当前LLM评估系统的脆弱性，呼吁加强防御机制并进一步研究对抗性评估和可信度问题。 Abstract: Large Language Models (LLMs) are increasingly employed as evaluators (LLM-as-a-Judge) for assessing the quality of machine-generated text. This paradigm offers scalability and cost-effectiveness compared to human annotation. However, the reliability and security of such systems, particularly their robustness against adversarial manipulations, remain critical concerns. This paper investigates the vulnerability of LLM-as-a-Judge architectures to prompt-injection attacks, where malicious inputs are designed to compromise the judge's decision-making process. We formalize two primary attack strategies: Comparative Undermining Attack (CUA), which directly targets the final decision output, and Justification Manipulation Attack (JMA), which aims to alter the model's generated reasoning. Using the Greedy Coordinate Gradient (GCG) optimization method, we craft adversarial suffixes appended to one of the responses being compared. Experiments conducted on the MT-Bench Human Judgments dataset with open-source instruction-tuned LLMs (Qwen2.5-3B-Instruct and Falcon3-3B-Instruct) demonstrate significant susceptibility. The CUA achieves an Attack Success Rate (ASR) exceeding 30\%, while JMA also shows notable effectiveness. These findings highlight substantial vulnerabilities in current LLM-as-a-Judge systems, underscoring the need for robust defense mechanisms and further research into adversarial evaluation and trustworthiness in LLM-based assessment frameworks.

[390] Sense and Sensitivity: Examining the Influence of Semantic Recall on Long Context Code Reasoning

Adam Štorek,Mukur Gupta,Samira Hajizadeh,Prashast Srivastava,Suman Jana

Main category: cs.CL

TL;DR: 论文研究了大型语言模型（LLMs）在长代码上下文中的推理能力，区分了词汇代码回忆和语义代码回忆，并提出SemTrace技术衡量语义回忆。研究发现，随着代码片段接近输入上下文中部，推理准确性显著下降，且词汇和语义回忆机制不同。

Details

Motivation: 探究LLMs在长代码上下文中的推理能力及其与回忆能力的关系，尤其是词汇回忆和语义回忆的差异。 Method: 提出SemTrace技术衡量语义回忆，并量化现有基准中的语义回忆敏感性。评估了先进LLMs在代码推理中的表现。 Result: 代码推理准确性在代码片段接近上下文中部时显著下降，词汇回忆和语义回忆机制不同，现有基准可能低估了LLMs的挑战。 Conclusion: LLMs在长代码上下文中的语义回忆能力有限，现有基准需改进以更准确评估其能力。 Abstract: Although modern Large Language Models (LLMs) support extremely large contexts, their effectiveness in utilizing long context for code reasoning remains unclear. This paper investigates LLM reasoning ability over code snippets within large repositories and how it relates to their recall ability. Specifically, we differentiate between lexical code recall (verbatim retrieval) and semantic code recall (remembering what the code does). To measure semantic recall, we propose SemTrace, a code reasoning technique where the impact of specific statements on output is attributable and unpredictable. We also present a method to quantify semantic recall sensitivity in existing benchmarks. Our evaluation of state-of-the-art LLMs reveals a significant drop in code reasoning accuracy as a code snippet approaches the middle of the input context, particularly with techniques requiring high semantic recall like SemTrace. Moreover, we find that lexical recall varies by granularity, with models excelling at function retrieval but struggling with line-by-line recall. Notably, a disconnect exists between lexical and semantic recall, suggesting different underlying mechanisms. Finally, our findings indicate that current code reasoning benchmarks may exhibit low semantic recall sensitivity, potentially underestimating LLM challenges in leveraging in-context information.

[391] What Prompts Don't Say: Understanding and Managing Underspecification in LLM Prompts

Chenyang Yang,Yike Shi,Qianou Ma,Michael Xieyang Liu,Christian Kästner,Tongshuang Wu

Main category: cs.CL

TL;DR: 论文分析了LLM提示中需求不明确的问题，提出了一种新的需求感知提示优化方法，性能提升4.8%，并建议更广泛的需求管理流程。

Details

Motivation: 开发者通过自然语言与LLM沟通时，提示常因需求不明确导致性能不稳定，现有方法无法有效解决。 Method: 分析了提示不明确的影响，提出需求感知的提示优化机制，并测试其效果。 Result: LLM默认能猜中41.1%未明确需求，但不稳定；新方法平均提升性能4.8%。 Conclusion: 需结合需求发现、评估和监控，系统化管理提示不明确问题。 Abstract: Building LLM-powered software requires developers to communicate their requirements through natural language, but developer prompts are frequently underspecified, failing to fully capture many user-important requirements. In this paper, we present an in-depth analysis of prompt underspecification, showing that while LLMs can often (41.1%) guess unspecified requirements by default, such behavior is less robust: Underspecified prompts are 2x more likely to regress over model or prompt changes, sometimes with accuracy drops by more than 20%. We then demonstrate that simply adding more requirements to a prompt does not reliably improve performance, due to LLMs' limited instruction-following capabilities and competing constraints, and standard prompt optimizers do not offer much help. To address this, we introduce novel requirements-aware prompt optimization mechanisms that can improve performance by 4.8% on average over baselines that naively specify everything in the prompt. Beyond prompt optimization, we envision that effectively managing prompt underspecification requires a broader process, including proactive requirements discovery, evaluation, and monitoring.

[392] Thinkless: LLM Learns When to Think

Gongfan Fang,Xinyin Ma,Xinchao Wang

Main category: cs.CL

TL;DR: Thinkless框架通过自适应选择简短或详细推理，提升语言模型效率，减少50%-90%的长链推理使用。

Details

Motivation: 解决语言模型对所有查询均采用复杂推理导致的低效问题，探索模型能否学会何时思考。 Method: 提出Thinkless框架，基于强化学习训练模型选择推理模式，采用DeGRPO算法分解学习目标。 Result: 在多个基准测试中显著减少长链推理使用，提升效率。 Conclusion: Thinkless有效平衡推理效率与准确性，为语言模型优化提供新思路。 Abstract: Reasoning Language Models, capable of extended chain-of-thought reasoning, have demonstrated remarkable performance on tasks requiring complex logical inference. However, applying elaborate reasoning for all queries often results in substantial computational inefficiencies, particularly when many problems admit straightforward solutions. This motivates an open question: Can LLMs learn when to think? To answer this, we propose Thinkless, a learnable framework that empowers an LLM to adaptively select between short-form and long-form reasoning, based on both task complexity and the model's ability. Thinkless is trained under a reinforcement learning paradigm and employs two control tokens, for concise responses and for detailed reasoning. At the core of our method is a Decoupled Group Relative Policy Optimization (DeGRPO) algorithm, which decomposes the learning objective of hybrid reasoning into two components: (1) a control token loss that governs the selection of the reasoning mode, and (2) a response loss that improves the accuracy of the generated answers. This decoupled formulation enables fine-grained control over the contributions of each objective, stabilizing training and effectively preventing collapse observed in vanilla GRPO. Empirically, on several benchmarks such as Minerva Algebra, MATH-500, and GSM8K, Thinkless is able to reduce the usage of long-chain thinking by 50% - 90%, significantly improving the efficiency of Reasoning Language Models. The code is available at https://github.com/VainF/Thinkless

[393] R3: Robust Rubric-Agnostic Reward Models

David Anugraha,Zilu Tang,Lester James V. Miranda,Hanyang Zhao,Mohammad Rifqi Farhansyah,Garry Kuwanto,Derry Wijaya,Genta Indra Winata

Main category: cs.CL

TL;DR: R3是一个新颖的奖励建模框架，解决了现有方法在可控性和可解释性上的不足，支持跨维度评估和透明评分。

Details

Motivation: 现有奖励模型缺乏可控性和可解释性，且泛化能力有限，难以适应多样化任务。 Method: 提出R3框架，采用与评分标准无关的设计，支持多维度评估和可解释的评分分配。 Result: R3实现了更透明和灵活的语言模型评估，能够与多样化的人类价值观和用例对齐。 Conclusion: R3为奖励建模提供了更通用和可解释的解决方案，相关资源已开源。 Abstract: Reward models are essential for aligning language model outputs with human preferences, yet existing approaches often lack both controllability and interpretability. These models are typically optimized for narrow objectives, limiting their generalizability to broader downstream tasks. Moreover, their scalar outputs are difficult to interpret without contextual reasoning. To address these limitations, we introduce R3, a novel reward modeling framework that is rubric-agnostic, generalizable across evaluation dimensions, and provides interpretable, reasoned score assignments. R3 enables more transparent and flexible evaluation of language models, supporting robust alignment with diverse human values and use cases. Our models, data, and code are available as open source at https://github.com/rubricreward/r3

[394] MR. Judge: Multimodal Reasoner as a Judge

Renjie Pi,Felix Bai,Qibin Chen,Simon Wang,Jiulong Shan,Kieran Liu,Meng Cao

Main category: cs.CL

TL;DR: 提出了一种名为MR. Judge的新范式，通过多模态推理增强MLLM的评价能力，显著提升了性能和可解释性。

Details

Motivation: 利用MLLM作为评价工具在RLHF和推理扩展中表现出潜力，但缺乏推理能力和标注数据。 Method: 将评价过程转化为多选推理问题，并通过反向候选生成和文本推理提取自动标注数据。 Result: MR. Judge-7B在VL-RewardBench上超越GPT-4o 9.9%，在MM-Vet上提升7.7%。 Conclusion: MR. Judge通过推理增强显著提升了MLLM的评价能力，适用于多种任务。 Abstract: The paradigm of using Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) as evaluative judges has emerged as an effective approach in RLHF and inference-time scaling. In this work, we propose Multimodal Reasoner as a Judge (MR. Judge), a paradigm for empowering general-purpose MLLMs judges with strong reasoning capabilities. Instead of directly assigning scores for each response, we formulate the judgement process as a reasoning-inspired multiple-choice problem. Specifically, the judge model first conducts deliberate reasoning covering different aspects of the responses and eventually selects the best response from them. This reasoning process not only improves the interpretibility of the judgement, but also greatly enhances the performance of MLLM judges. To cope with the lack of questions with scored responses, we propose the following strategy to achieve automatic annotation: 1) Reverse Response Candidates Synthesis: starting from a supervised fine-tuning (SFT) dataset, we treat the original response as the best candidate and prompt the MLLM to generate plausible but flawed negative candidates. 2) Text-based reasoning extraction: we carefully design a data synthesis pipeline for distilling the reasoning capability from a text-based reasoning model, which is adopted to enable the MLLM judges to regain complex reasoning ability via warm up supervised fine-tuning. Experiments demonstrate that our MR. Judge is effective across a wide range of tasks. Specifically, our MR. Judge-7B surpasses GPT-4o by 9.9% on VL-RewardBench, and improves performance on MM-Vet during inference-time scaling by up to 7.7%.

[395] Granary: Speech Recognition and Translation Dataset in 25 European Languages

Nithin Rao Koluguri,Monica Sekoyan,George Zelenfroynd,Sasha Meister,Shuoyang Ding,Sofia Kostandian,He Huang,Nikolay Karpov,Jagadeesh Balam,Vitaly Lavrukhin,Yifan Peng,Sara Papi,Marco Gaido,Alessio Brutti,Boris Ginsburg

Main category: cs.CL

TL;DR: Granary是一个针对25种欧洲语言的大规模语音数据集，支持语音识别和翻译，通过伪标签和数据过滤提升质量，模型训练效率高且性能接近传统方法。

Details

Motivation: 解决低资源语言语音处理数据稀缺的问题。 Method: 使用伪标签管道（包括分段、两遍推理、幻觉过滤和标点恢复）和数据过滤生成高质量数据集。 Result: 模型在减少约50%数据量的情况下仍能达到类似性能。 Conclusion: Granary为低资源语言语音处理提供了高效且高质量的解决方案。 Abstract: Multi-task and multilingual approaches benefit large models, yet speech processing for low-resource languages remains underexplored due to data scarcity. To address this, we present Granary, a large-scale collection of speech datasets for recognition and translation across 25 European languages. This is the first open-source effort at this scale for both transcription and translation. We enhance data quality using a pseudo-labeling pipeline with segmentation, two-pass inference, hallucination filtering, and punctuation restoration. We further generate translation pairs from pseudo-labeled transcriptions using EuroLLM, followed by a data filtration pipeline. Designed for efficiency, our pipeline processes vast amount of data within hours. We assess models trained on processed data by comparing their performance on previously curated datasets for both high- and low-resource languages. Our findings show that these models achieve similar performance using approx. 50% less data. Dataset will be made available at https://hf.co/datasets/nvidia/Granary

[396] AdaptThink: Reasoning Models Can Learn When to Think

Jiajie Zhang,Nianyi Lin,Lei Hou,Ling Feng,Juanzi Li

Main category: cs.CL

TL;DR: AdaptThink是一种新型RL算法，通过自适应选择思考模式（Thinking/NoThinking）来优化推理模型的效率和性能。

Details

Motivation: 现有大型推理模型因冗长的思考过程导致推理开销大，效率成为瓶颈。研究发现，对于简单任务，直接生成解决方案（NoThinking）在性能和效率上更优。 Method: 提出AdaptThink算法，包括约束优化目标（鼓励NoThinking同时保持性能）和重要性采样策略（平衡两种思考模式的样本）。 Result: 实验显示，AdaptThink显著降低推理成本并提升性能，如将响应长度减少53%，准确率提高2.4%。 Conclusion: 自适应思考模式选择能有效平衡推理质量和效率，具有广泛应用潜力。 Abstract: Recently, large reasoning models have achieved impressive performance on various tasks by employing human-like deep thinking. However, the lengthy thinking process substantially increases inference overhead, making efficiency a critical bottleneck. In this work, we first demonstrate that NoThinking, which prompts the reasoning model to skip thinking and directly generate the final solution, is a better choice for relatively simple tasks in terms of both performance and efficiency. Motivated by this, we propose AdaptThink, a novel RL algorithm to teach reasoning models to choose the optimal thinking mode adaptively based on problem difficulty. Specifically, AdaptThink features two core components: (1) a constrained optimization objective that encourages the model to choose NoThinking while maintaining the overall performance; (2) an importance sampling strategy that balances Thinking and NoThinking samples during on-policy training, thereby enabling cold start and allowing the model to explore and exploit both thinking modes throughout the training process. Our experiments indicate that AdaptThink significantly reduces the inference costs while further enhancing performance. Notably, on three math datasets, AdaptThink reduces the average response length of DeepSeek-R1-Distill-Qwen-1.5B by 53% and improves its accuracy by 2.4%, highlighting the promise of adaptive thinking-mode selection for optimizing the balance between reasoning quality and efficiency. Our codes and models are available at https://github.com/THU-KEG/AdaptThink.

[397] Dementia Through Different Eyes: Explainable Modeling of Human and LLM Perceptions for Early Awareness

Lotem Peled-Cohen,Maya Zadok,Nitay Calderon,Hila Gonen,Roi Reichart

Main category: cs.CL

TL;DR: 研究探讨非专家和LLMs如何通过语言感知痴呆症，发现人类依赖有限且可能误导的线索，而LLMs使用更丰富、更符合临床的特征。

Details

Motivation: 痴呆症的早期语言变化常被非专家察觉，但判断依据可能不准确，需研究如何改进。 Method: 通过非专家和LLMs对图片描述文本进行直觉判断，使用可解释方法提取特征并建模。 Result: 人类判断不一致且依赖有限线索，LLMs更接近临床模式，但两者均易漏诊。 Conclusion: 可解释框架有助于非专家更准确地识别痴呆症的语言特征。 Abstract: Cognitive decline often surfaces in language years before diagnosis. It is frequently non-experts, such as those closest to the patient, who first sense a change and raise concern. As LLMs become integrated into daily communication and used over prolonged periods, it may even be an LLM that notices something is off. But what exactly do they notice--and should be noticing--when making that judgment? This paper investigates how dementia is perceived through language by non-experts. We presented transcribed picture descriptions to non-expert humans and LLMs, asking them to intuitively judge whether each text was produced by someone healthy or with dementia. We introduce an explainable method that uses LLMs to extract high-level, expert-guided features representing these picture descriptions, and use logistic regression to model human and LLM perceptions and compare with clinical diagnoses. Our analysis reveals that human perception of dementia is inconsistent and relies on a narrow, and sometimes misleading, set of cues. LLMs, by contrast, draw on a richer, more nuanced feature set that aligns more closely with clinical patterns. Still, both groups show a tendency toward false negatives, frequently overlooking dementia cases. Through our interpretable framework and the insights it provides, we hope to help non-experts better recognize the linguistic signs that matter.

[398] SMOTExT: SMOTE meets Large Language Models

Mateusz Bystroński,Mikołaj Hołysz,Grzegorz Piotrowski,Nitesh V. Chawla,Tomasz Kajdanowicz

Main category: cs.CL

TL;DR: SMOTExT是一种新颖的文本数据增强方法，通过BERT嵌入插值和xRAG架构生成合成文本，解决了数据稀缺和类别不平衡问题，并在隐私保护机器学习中显示出潜力。

Details

Motivation: 解决NLP领域中数据稀缺和类别不平衡问题，尤其是在专业领域或低资源环境下。 Method: 结合SMOTE思想，通过BERT嵌入插值生成新的合成文本，并使用xRAG架构将潜在向量解码为连贯文本。 Result: 初步定性结果显示，该方法在少样本设置中具有知识蒸馏和数据增强的潜力，且在隐私保护机器学习中表现良好。 Conclusion: SMOTExT为数据稀缺和隐私保护问题提供了可行的解决方案，未来有望进一步验证其效果。 Abstract: Data scarcity and class imbalance are persistent challenges in training robust NLP models, especially in specialized domains or low-resource settings. We propose a novel technique, SMOTExT, that adapts the idea of Synthetic Minority Over-sampling (SMOTE) to textual data. Our method generates new synthetic examples by interpolating between BERT-based embeddings of two existing examples and then decoding the resulting latent point into text with xRAG architecture. By leveraging xRAG's cross-modal retrieval-generation framework, we can effectively turn interpolated vectors into coherent text. While this is preliminary work supported by qualitative outputs only, the method shows strong potential for knowledge distillation and data augmentation in few-shot settings. Notably, our approach also shows promise for privacy-preserving machine learning: in early experiments, training models solely on generated data achieved comparable performance to models trained on the original dataset. This suggests a viable path toward safe and effective learning under data protection constraints.

[399] ChartMuseum: Testing Visual Reasoning Capabilities of Large Vision-Language Models

Liyan Tang,Grace Kim,Xinyu Zhao,Thom Lake,Wenxuan Ding,Fangcong Yin,Prasann Singhal,Manya Wadhwa,Zeyu Leo Liu,Zayne Sprague,Ramya Namuduri,Bodun Hu,Juan Diego Rodriguez,Puyuan Peng,Greg Durrett

Main category: cs.CL

TL;DR: 当前大型视觉语言模型（LVLMs）在图表理解任务中表现不佳，尤其是在视觉推理方面。研究者通过合成数据集和新的基准测试ChartMuseum揭示了模型与人类表现的显著差距。

Details

Motivation: 图表理解需要结合复杂的文本和视觉推理能力，但现有LVLMs在这两方面表现不平衡，尤其在视觉推理上存在明显短板。 Method: 使用合成数据集评估模型表现，并构建ChartMuseum基准测试，包含1,162个专家标注的问题，涵盖多种推理类型。 Result: 人类准确率达93%，而最佳模型Gemini-2.5-Pro仅63.0%，开源模型Qwen2.5-VL-72B-Instruct仅38.5%。视觉推理问题中，模型性能下降35%-55%。 Conclusion: 当前LVLMs在视觉推理方面存在显著不足，ChartMuseum基准测试为未来研究提供了重要工具。 Abstract: Chart understanding presents a unique challenge for large vision-language models (LVLMs), as it requires the integration of sophisticated textual and visual reasoning capabilities. However, current LVLMs exhibit a notable imbalance between these skills, falling short on visual reasoning that is difficult to perform in text. We conduct a case study using a synthetic dataset solvable only through visual reasoning and show that model performance degrades significantly with increasing visual complexity, while human performance remains robust. We then introduce ChartMuseum, a new Chart Question Answering (QA) benchmark containing 1,162 expert-annotated questions spanning multiple reasoning types, curated from real-world charts across 184 sources, specifically built to evaluate complex visual and textual reasoning. Unlike prior chart understanding benchmarks -- where frontier models perform similarly and near saturation -- our benchmark exposes a substantial gap between model and human performance, while effectively differentiating model capabilities: although humans achieve 93% accuracy, the best-performing model Gemini-2.5-Pro attains only 63.0%, and the leading open-source LVLM Qwen2.5-VL-72B-Instruct achieves only 38.5%. Moreover, on questions requiring primarily visual reasoning, all models experience a 35%-55% performance drop from text-reasoning-heavy question performance. Lastly, our qualitative error analysis reveals specific categories of visual reasoning that are challenging for current LVLMs.

[400] CIE: Controlling Language Model Text Generations Using Continuous Signals

Vinay Samuel,Harshita Diddee,Yiming Zhang,Daphne Ippolito

Main category: cs.CL

TL;DR: 论文提出了一种通过连续控制信号（如向量插值）来精确控制语言模型生成内容的方法，特别是在响应长度方面，优于现有的离散信号或上下文学习方法。

Details

Motivation: 现有方法通过自然语言提示或离散信号控制语言模型生成内容，但这些方法不够灵活且难以扩展。因此，需要一种连续控制信号的方法来更可靠地调节生成内容的属性（如长度、复杂度、情感等）。 Method: 通过微调语言模型，利用连续控制信号（如向量插值）来调节生成行为，特别是在响应长度方面。 Result: 实验表明，该方法在控制响应长度方面比上下文学习或离散信号方法更可靠。 Conclusion: 连续控制信号为语言模型生成内容的精确控制提供了更灵活和可扩展的解决方案。 Abstract: Aligning language models with user intent is becoming increasingly relevant to enhance user experience. This calls for designing methods that can allow users to control the properties of the language that LMs generate. For example, controlling the length of the generation, the complexity of the language that gets chosen, the sentiment, tone, etc. Most existing work attempts to integrate users' control by conditioning LM generations on natural language prompts or discrete control signals, which are often brittle and hard to scale. In this work, we are interested in \textit{continuous} control signals, ones that exist along a spectrum that can't easily be captured in a natural language prompt or via existing techniques in conditional generation. Through a case study in controlling the precise response-length of generations produced by LMs, we demonstrate how after fine-tuning, behaviors of language models can be controlled via continuous signals -- as vectors that are interpolated between a "low" and a "high" token embedding. Our method more reliably exerts response-length control than in-context learning methods or fine-tuning methods that represent the control signal as a discrete signal. Our full open-sourced code and datasets are available at https://github.com/vsamuel2003/CIE.

math.OC [Back]

[401] Deep Unrolled Meta-Learning for Multi-Coil and Multi-Modality MRI with Adaptive Optimization

Merham Fouladvand,Peuroly Batra

Main category: math.OC

TL;DR: 提出了一种统一的深度元学习框架，用于加速MRI成像，同时解决多线圈重建和跨模态合成问题。

Details

Motivation: 传统方法在处理欠采样数据和缺失模态时存在局限性，因此需要一种更高效的方法。 Method: 将可证明收敛的优化算法展开为结构化神经网络架构，结合元学习以快速适应不同的采样模式和模态组合。 Result: 在开源数据集上评估，PSNR和SSIM显著优于传统监督学习，尤其在激进欠采样和领域偏移下表现突出。 Conclusion: 该方法结合了展开优化、任务感知元学习和模态融合，为临床MRI重建提供了可扩展且通用的解决方案。 Abstract: We propose a unified deep meta-learning framework for accelerated magnetic resonance imaging (MRI) that jointly addresses multi-coil reconstruction and cross-modality synthesis. Motivated by the limitations of conventional methods in handling undersampled data and missing modalities, our approach unrolls a provably convergent optimization algorithm into a structured neural network architecture. Each phase of the network mimics a step of an adaptive forward-backward scheme with extrapolation, enabling the model to incorporate both data fidelity and nonconvex regularization in a principled manner. To enhance generalization across different acquisition settings, we integrate meta-learning, which enables the model to rapidly adapt to unseen sampling patterns and modality combinations using task-specific meta-knowledge. The proposed method is evaluated on the open source datasets, showing significant improvements in PSNR and SSIM over conventional supervised learning, especially under aggressive undersampling and domain shifts. Our results demonstrate the synergy of unrolled optimization, task-aware meta-learning, and modality fusion, offering a scalable and generalizable solution for real-world clinical MRI reconstruction.

econ.GN [Back]

[402] Vague Knowledge: Evidence from Analyst Reports

Kerry Xiao,Amy Zang

Main category: econ.GN

TL;DR: 论文探讨了语言在传达模糊信息中的作用，发现分析师报告中的文本语气对预测误差和后续数值预测修订具有预测能力。

Details

Motivation: 现实世界中，人们对未来收益的模糊知识难以量化，而语言在传达这种模糊信息中扮演重要角色。 Method: 通过分析分析师报告中的文本语气与数值预测的关系，研究语言在预测中的作用。 Result: 文本语气对预测误差和修订有预测能力，且在语言更模糊、不确定性更高或分析师更忙时关系更强。 Conclusion: 部分有用信息仅通过语言传达，语言在模糊知识传递中具有独特价值。 Abstract: People in the real world often possess vague knowledge of future payoffs, for which quantification is not feasible or desirable. We argue that language, with differing ability to convey vague information, plays an important but less known-role in subjective expectations. Empirically, we find that in their reports, analysts include useful information in linguistic expressions but not numerical forecasts. Specifically, the textual tone of analyst reports has predictive power for forecast errors and subsequent revisions in numerical forecasts, and this relation becomes stronger when analyst's language is vaguer, when uncertainty is higher, and when analysts are busier. Overall, our theory and evidence suggest that some useful information is vaguely known and only communicated through language.

cs.MA [Back]

[403] IG Parser: A Software Package for the Encoding of Institutional Statements using the Institutional Grammar

Christopher K. Frantz

Main category: cs.MA

TL;DR: IG Parser是一款支持定性内容分析的软件，用于解析正式或非正式的规则和策略，并将其转化为多种格式以支持后续分析。

Details

Motivation: 为社会科学研究提供一种工具，能够系统化地分析和编码治理社会系统的制度。 Method: 采用独特的语法（IG Script）和自动化转换技术，基于Institutional Grammar 2.0的理论框架。 Result: IG Parser能够高效地编码自然语言并支持多种分析技术。 Conclusion: IG Parser为制度分析提供了强大的工具，其语法和架构设计使其在社会科学研究中具有广泛的应用潜力。 Abstract: This article provides an overview of IG Parser, a software that facilitates qualitative content analysis of formal (e.g., legal) rules or informal (e.g., socio-normative) norms, and strategies (such as conventions) -- referred to as \emph{institutions} -- that govern social systems and operate configurally to describe \emph{institutional systems}. To this end, the IG Parser employs a distinctive syntax that ensures rigorous encoding of natural language, while automating the transformation into various formats that support the downstream analysis using diverse analytical techniques. The conceptual core of the IG Parser is an associated syntax, IG Script, that operationalizes the conceptual foundations of the Institutional Grammar, and more specifically Institutional Grammar 2.0, an analytical paradigm for institutional analysis. This article presents the IG Parser, including its conceptual foundations, syntactic specification of IG Script, alongside architectural principles. This introduction is augmented with selective illustrative examples that highlight the use and benefit associated with the tool.

cs.HC [Back]

[404] Behind the Screens: Uncovering Bias in AI-Driven Video Interview Assessments Using Counterfactuals

Dena F. Mujtaba,Nihar R. Mahapatra

Main category: cs.HC

TL;DR: 论文提出了一种基于反事实的框架，用于评估和量化AI驱动的人格评估中的偏见，利用GAN生成反事实表示，支持多模态公平性分析。

Details

Motivation: AI在人格评估中的应用可能放大训练数据中的偏见，导致基于性别、种族和年龄的歧视性结果，因此需要一种方法来系统评估和解决这些偏见。 Method: 采用生成对抗网络（GANs）生成反事实的求职者表示，通过改变受保护属性进行公平性分析，支持多模态（视觉、音频、文本）评估。 Result: 应用于先进的人格预测模型时，该方法揭示了不同人口群体间的显著差异，并通过受保护属性分类器验证了反事实生成的有效性。 Conclusion: 该框架为商业AI招聘平台的公平性审计提供了可扩展工具，尤其在黑盒设置中，强调了反事实方法在提升情感计算伦理透明度中的重要性。 Abstract: AI-enhanced personality assessments are increasingly shaping hiring decisions, using affective computing to predict traits from the Big Five (OCEAN) model. However, integrating AI into these assessments raises ethical concerns, especially around bias amplification rooted in training data. These biases can lead to discriminatory outcomes based on protected attributes like gender, ethnicity, and age. To address this, we introduce a counterfactual-based framework to systematically evaluate and quantify bias in AI-driven personality assessments. Our approach employs generative adversarial networks (GANs) to generate counterfactual representations of job applicants by altering protected attributes, enabling fairness analysis without access to the underlying model. Unlike traditional bias assessments that focus on unimodal or static data, our method supports multimodal evaluation-spanning visual, audio, and textual features. This comprehensive approach is particularly important in high-stakes applications like hiring, where third-party vendors often provide AI systems as black boxes. Applied to a state-of-the-art personality prediction model, our method reveals significant disparities across demographic groups. We also validate our framework using a protected attribute classifier to confirm the effectiveness of our counterfactual generation. This work provides a scalable tool for fairness auditing of commercial AI hiring platforms, especially in black-box settings where training data and model internals are inaccessible. Our results highlight the importance of counterfactual approaches in improving ethical transparency in affective computing.

cs.RO [Back]

Pengdi Huang,Mingyang Wang,Huan Tian,Minglun Gong,Hao Zhang,Hui Huang

Main category: cs.RO

TL;DR: 提出了一种基于LiDAR的实时空间推理框架，用于动态环境中的表面重建与导航，模拟生物定位系统。

Details

Motivation: 研究如何让移动机器人在动态环境中实现类似生物的空间感知与导航能力。 Method: 结合实时单帧网格重建与机器人导航辅助，利用LiDAR的视线向量进行表面法线估计和自由空间更新。 Result: 在合成和真实场景中验证了方法的速度与质量优势，并展示了实时3D重建和自主导航应用。 Conclusion: 该框架在动态环境中表现出色，为机器人导航和场景重建提供了高效解决方案。 Abstract: Our brain has an inner global positioning system which enables us to sense and navigate 3D spaces in real time. Can mobile robots replicate such a biological feat in a dynamic environment? We introduce the first spatial reasoning framework for real-time surface reconstruction and navigation that is designed for outdoor LiDAR scanning data captured by ground mobile robots and capable of handling moving objects such as pedestrians. Our reconstruction-based approach is well aligned with the critical cellular functions performed by the border vector cells (BVCs) over all layers of the medial entorhinal cortex (MEC) for surface sensing and tracking. To address the challenges arising from blurred boundaries resulting from sparse single-frame LiDAR points and outdated data due to object movements, we integrate real-time single-frame mesh reconstruction, via visibility reasoning, with robot navigation assistance through on-the-fly 3D free space determination. This enables continuous and incremental updates of the scene and free space across multiple frames. Key to our method is the utilization of line-of-sight (LoS) vectors from LiDAR, which enable real-time surface normal estimation, as well as robust and instantaneous per-voxel free space updates. We showcase two practical applications: real-time 3D scene reconstruction and autonomous outdoor robot navigation in real-world conditions. Comprehensive experiments on both synthetic and real scenes highlight our method's superiority in speed and quality over existing real-time LiDAR processing approaches.

[406] Bridging Human Oversight and Black-box Driver Assistance: Vision-Language Models for Predictive Alerting in Lane Keeping Assist Systems

Yuhang Wang,Hao Zhou

Main category: cs.RO

TL;DR: LKAlert是一种基于VLM的新型监督预警系统，通过预测LKA风险并提供自然语言解释，增强驾驶员对LKA系统的信任和情境意识。

Details

Motivation: 解决LKA系统因黑盒特性导致的不可预测故障，提升驾驶员对自动化辅助的信任和有效监督。 Method: 结合VLM和可解释模型，处理多模态数据（视频和CAN数据），生成预测警报和解释性文本。 Result: 预测LKA故障的准确率为69.8%，F1分数为58.6%，解释文本质量高（71.7 ROUGE-L），实时性良好（2 Hz）。 Conclusion: LKAlert提升了ADAS的安全性和可用性，为黑盒自动化的人机监督提供了可扩展的范式。 Abstract: Lane Keeping Assist systems, while increasingly prevalent, often suffer from unpredictable real-world failures, largely due to their opaque, black-box nature, which limits driver anticipation and trust. To bridge the gap between automated assistance and effective human oversight, we present LKAlert, a novel supervisory alert system that leverages VLM to forecast potential LKA risk 1-3 seconds in advance. LKAlert processes dash-cam video and CAN data, integrating surrogate lane segmentation features from a parallel interpretable model as automated guiding attention. Unlike traditional binary classifiers, LKAlert issues both predictive alert and concise natural language explanation, enhancing driver situational awareness and trust. To support the development and evaluation of such systems, we introduce OpenLKA-Alert, the first benchmark dataset designed for predictive and explainable LKA failure warnings. It contains synchronized multimodal inputs and human-authored justifications across annotated temporal windows. We further contribute a generalizable methodological framework for VLM-based black-box behavior prediction, combining surrogate feature guidance with LoRA. This framework enables VLM to reason over structured visual context without altering its vision backbone, making it broadly applicable to other complex, opaque systems requiring interpretable oversight. Empirical results correctly predicts upcoming LKA failures with 69.8% accuracy and a 58.6\% F1-score. The system also generates high-quality textual explanations for drivers (71.7 ROUGE-L) and operates efficiently at approximately 2 Hz, confirming its suitability for real-time, in-vehicle use. Our findings establish LKAlert as a practical solution for enhancing the safety and usability of current ADAS and offer a scalable paradigm for applying VLMs to human-centered supervision of black-box automation.

[407] GLOVER++: Unleashing the Potential of Affordance Learning from Human Behaviors for Robotic Manipulation

Teli Ma,Jia Zheng,Zifan Wang,Ziyao Gao,Jiaming Zhou,Junwei Liang

Main category: cs.RO

TL;DR: 论文提出HOVA-500K数据集和GLOVER++框架，通过人类演示视频学习机器人操作技能，解决了数据不足和多样性问题，并在开放词汇推理任务中表现优异。

Details

Motivation: 从人类演示视频中学习机器人操作技能具有潜力，但缺乏大规模精确标注的数据集和多样化操作场景的探索。 Method: 提出HOVA-500K数据集（50万张图像，1726类对象，675种动作）和GLOVER++框架（全局到局部可操作属性训练）。 Result: GLOVER++在HOVA-500K基准测试中达到最优，并在多样化机器人操作任务中表现出强泛化能力。 Conclusion: HOVA-500K和GLOVER++为连接人类演示与机器人操作能力提供了宝贵资源。 Abstract: Learning manipulation skills from human demonstration videos offers a promising path toward generalizable and interpretable robotic intelligence-particularly through the lens of actionable affordances. However, transferring such knowledge remains challenging due to: 1) a lack of large-scale datasets with precise affordance annotations, and 2) insufficient exploration of affordances in diverse manipulation contexts. To address these gaps, we introduce HOVA-500K, a large-scale, affordance-annotated dataset comprising 500,000 images across 1,726 object categories and 675 actions. We also release a standardized benchmarking suite for multi-modal affordance reasoning. Built upon HOVA-500K, we present GLOVER++, a global-to-local affordance training framework that effectively transfers actionable affordance knowledge from human demonstrations to downstream open-vocabulary reasoning tasks. GLOVER++ achieves state-of-the-art results on the HOVA-500K benchmark and demonstrates strong generalization across diverse downstream robotic manipulation tasks. By explicitly modeling actionable affordances, GLOVER++ facilitates robust transfer across scenes, modalities, and tasks. We hope that HOVA-500K and the GLOVER++ framework will serve as valuable resources for bridging the gap between human demonstrations and robotic manipulation capabilities.

[408] Experimental Study on Automatically Assembling Custom Catering Packages With a 3-DOF Delta Robot Using Deep Learning Methods

Reihaneh Yourdkhani,Arash Tavoosian,Navid Asadi Khomami,Mehdi Tale Masouleh

Main category: cs.RO

TL;DR: 本文介绍了一种基于深度学习的自动化餐饮包装方法，使用Delta并联机器人和两指夹爪，结合YOLOV5和FastSAM模型实现物体检测与分割，并通过几何方法计算抓取点，实验成功率超过80%。

Details

Motivation: 解决餐饮包装自动化中的物体检测与抓取问题，提升机器人系统的实用性和效率。 Method: 使用YOLOV5进行物体检测，FastSAM进行分割，通过几何方法计算抓取点，并将信息传递给Delta并联机器人完成包装。 Result: 实验验证了算法的有效性，实现了实时检测和全自动包装，抓取成功率超过80%。 Conclusion: 该研究为包装自动化中的机器人应用提供了重要进展，展示了深度学习与机器人技术的结合潜力。 Abstract: This paper introduces a pioneering experimental study on the automated packing of a catering package using a two-fingered gripper affixed to a 3-degree-of-freedom Delta parallel robot. A distinctive contribution lies in the application of a deep learning approach to tackle this challenge. A custom dataset, comprising 1,500 images, is meticulously curated for this endeavor, representing a noteworthy initiative as the first dataset focusing on Persian-manufactured products. The study employs the YOLOV5 model for object detection, followed by segmentation using the FastSAM model. Subsequently, rotation angle calculation is facilitated with segmentation masks, and a rotated rectangle encapsulating the object is generated. This rectangle forms the basis for calculating two grasp points using a novel geometrical approach involving eigenvectors. An extensive experimental study validates the proposed model, where all pertinent information is seamlessly transmitted to the 3-DOF Delta parallel robot. The proposed algorithm ensures real-time detection, calibration, and the fully autonomous packing process of a catering package, boasting an impressive over 80\% success rate in automatic grasping. This study marks a significant stride in advancing the capabilities of robotic systems for practical applications in packaging automation.

[409] Emergent Active Perception and Dexterity of Simulated Humanoids from Visual Reinforcement Learning

Zhengyi Luo,Chen Tessler,Toru Lin,Ye Yuan,Tairan He,Wenli Xiao,Yunrong Guo,Gal Chechik,Kris Kitani,Linxi Fan,Yuke Zhu

Main category: cs.RO

TL;DR: PDC框架利用视觉驱动的灵巧全身控制，通过强化学习训练单一策略完成多项家庭任务，无需特权状态信息。

Details

Motivation: 人类行为受视觉感知驱动，启发研究者开发基于视觉的灵巧控制系统，模拟人类行为。 Method: 提出Perceptive Dexterous Control (PDC)框架，仅依赖自我中心视觉进行任务指定，通过强化学习训练策略。 Result: PDC能完成搜索、抓取、放置等多项任务，并涌现出主动搜索等行为。 Conclusion: 视觉驱动控制是实现感知-动作闭环的关键，适用于动画、机器人及具身AI。 Abstract: Human behavior is fundamentally shaped by visual perception -- our ability to interact with the world depends on actively gathering relevant information and adapting our movements accordingly. Behaviors like searching for objects, reaching, and hand-eye coordination naturally emerge from the structure of our sensory system. Inspired by these principles, we introduce Perceptive Dexterous Control (PDC), a framework for vision-driven dexterous whole-body control with simulated humanoids. PDC operates solely on egocentric vision for task specification, enabling object search, target placement, and skill selection through visual cues, without relying on privileged state information (e.g., 3D object positions and geometries). This perception-as-interface paradigm enables learning a single policy to perform multiple household tasks, including reaching, grasping, placing, and articulated object manipulation. We also show that training from scratch with reinforcement learning can produce emergent behaviors such as active search. These results demonstrate how vision-driven control and complex tasks induce human-like behaviors and can serve as the key ingredients in closing the perception-action loop for animation, robotics, and embodied AI.

[410] Structureless VIO

Junlin Song,Miguel Olivares-Mendez

Main category: cs.RO

TL;DR: 提出了一种无结构的视觉惯性里程计（VIO），移除了传统VIO中的视觉地图模块，显著提高了计算效率和精度。

Details

Motivation: 传统VIO中定位与建图紧密耦合，导致依赖高精度地图和定位信息，而无需地图的高效定位方案尚未充分研究。 Method: 提出了一种无结构的VIO框架，移除了视觉地图模块，简化了系统设计。 Result: 实验表明，无结构VIO在计算效率和精度上均优于传统结构化的VIO基线。 Conclusion: 无结构VIO为解决传统VIO的耦合问题提供了高效且精确的替代方案。 Abstract: Visual odometry (VO) is typically considered as a chicken-and-egg problem, as the localization and mapping modules are tightly-coupled. The estimation of visual map relies on accurate localization information. Meanwhile, localization requires precise map points to provide motion constraints. This classical design principle is naturally inherited by visual-inertial odometry (VIO). Efficient localization solution that does not require a map has not been fully investigated. To this end, we propose a novel structureless VIO, where the visual map is removed from the odometry framework. Experimental results demonstrated that, compared to the structure-based VIO baseline, our structureless VIO not only substantially improves computational efficiency but also has advantages in accuracy.

[411] TeleOpBench: A Simulator-Centric Benchmark for Dual-Arm Dexterous Teleoperation

Hangyu Li,Qin Zhao,Haoran Xu,Xinyu Jiang,Qingwei Ben,Feiyu Jia,Haoyu Zhao,Liang Xu,Jia Zeng,Hanqing Wang,Bo Dai,Junting Dong,Jiangmiao Pang

Main category: cs.RO

TL;DR: TeleOpBench是一个专注于双手灵巧远程操作的模拟器基准测试，包含30个高保真任务环境，覆盖多种操作难度，并验证了模拟器性能与实际硬件表现的相关性。

Details

Motivation: 远程操作是机器人学习的重要方式，但目前缺乏统一的基准测试来公平比较不同硬件系统。 Method: 提出TeleOpBench基准测试，包含30个任务环境和四种远程操作模态，并通过模拟与硬件实验验证其有效性。 Result: 模拟器性能与实际硬件表现高度相关，验证了TeleOpBench的外部有效性。 Conclusion: TeleOpBench为远程操作研究提供了统一标准，并为未来算法和硬件创新提供了可扩展平台。 Abstract: Teleoperation is a cornerstone of embodied-robot learning, and bimanual dexterous teleoperation in particular provides rich demonstrations that are difficult to obtain with fully autonomous systems. While recent studies have proposed diverse hardware pipelines-ranging from inertial motion-capture gloves to exoskeletons and vision-based interfaces-there is still no unified benchmark that enables fair, reproducible comparison of these systems. In this paper, we introduce TeleOpBench, a simulator-centric benchmark tailored to bimanual dexterous teleoperation. TeleOpBench contains 30 high-fidelity task environments that span pick-and-place, tool use, and collaborative manipulation, covering a broad spectrum of kinematic and force-interaction difficulty. Within this benchmark we implement four representative teleoperation modalities-(i) MoCap, (ii) VR device, (iii) arm-hand exoskeletons, and (iv) monocular vision tracking-and evaluate them with a common protocol and metric suite. To validate that performance in simulation is predictive of real-world behavior, we conduct mirrored experiments on a physical dual-arm platform equipped with two 6-DoF dexterous hands. Across 10 held-out tasks we observe a strong correlation between simulator and hardware performance, confirming the external validity of TeleOpBench. TeleOpBench establishes a common yardstick for teleoperation research and provides an extensible platform for future algorithmic and hardware innovation.

physics.med-ph [Back]

[412] OpenPros: A Large-Scale Dataset for Limited View Prostate Ultrasound Computed Tomography

Hanchen Wang,Yixuan Wu,Yinan Feng,Peng Jin,Shihang Feng,Yiming Mao,James Wiskin,Baris Turkbey,Peter A. Pinto,Bradford J. Wood,Songting Luo,Yinpeng Chen,Emad Boctor,Youzuo Lin

Main category: physics.med-ph

TL;DR: OpenPros是一个针对有限视角前列腺超声计算机断层扫描（USCT）的大规模基准数据集，旨在提升前列腺癌早期检测的准确性。

Details

Motivation: 前列腺癌早期检测需求迫切，传统超声方法灵敏度不足，USCT临床实现面临挑战。 Method: 开发OpenPros数据集，包含28万对模拟2D声速幻影和超声全波形数据，基于真实MRI/CT扫描和离体超声测量生成。 Result: 深度学习方法在推理效率和重建精度上优于传统物理方法，但仍未达到临床高分辨率要求。 Conclusion: 公开OpenPros数据集以推动机器学习算法发展，实现临床可用的高分辨率前列腺超声图像。 Abstract: Prostate cancer is one of the most common and lethal cancers among men, making its early detection critically important. Although ultrasound imaging offers greater accessibility and cost-effectiveness compared to MRI, traditional transrectal ultrasound methods suffer from low sensitivity, especially in detecting anteriorly located tumors. Ultrasound computed tomography provides quantitative tissue characterization, but its clinical implementation faces significant challenges, particularly under anatomically constrained limited-angle acquisition conditions specific to prostate imaging. To address these unmet needs, we introduce OpenPros, the first large-scale benchmark dataset explicitly developed for limited-view prostate USCT. Our dataset includes over 280,000 paired samples of realistic 2D speed-of-sound (SOS) phantoms and corresponding ultrasound full-waveform data, generated from anatomically accurate 3D digital prostate models derived from real clinical MRI/CT scans and ex vivo ultrasound measurements, annotated by medical experts. Simulations are conducted under clinically realistic configurations using advanced finite-difference time-domain and Runge-Kutta acoustic wave solvers, both provided as open-source components. Through comprehensive baseline experiments, we demonstrate that state-of-the-art deep learning methods surpass traditional physics-based approaches in both inference efficiency and reconstruction accuracy. Nevertheless, current deep learning models still fall short of delivering clinically acceptable high-resolution images with sufficient accuracy. By publicly releasing OpenPros, we aim to encourage the development of advanced machine learning algorithms capable of bridging this performance gap and producing clinically usable, high-resolution, and highly accurate prostate ultrasound images. The dataset is publicly accessible at https://open-pros.github.io/.

cs.SD [Back]

[413] SounDiT: Geo-Contextual Soundscape-to-Landscape Generation

Junbo Wang,Haofeng Tan,Bowen Liao,Albert Jiang,Teng Fei,Qixing Huang,Zhengzhong Tu,Shan Ye,Yuhao Kang

Main category: cs.SD

TL;DR: 论文提出了一种地理上下文的声音景观到景观图像生成方法（GeoS2L），通过结合地理知识生成更真实的景观图像，并提出了新的数据集和评估框架。

Details

Motivation: 现有音频到图像生成方法忽略了地理和环境上下文，导致生成的图像与现实环境不符。 Method: 提出了一种地理上下文计算框架，构建了两个大规模数据集（SoundingSVI和SonicUrban），并开发了基于扩散变换器（DiT）的模型SounDiT。 Result: SounDiT在视觉保真度和地理一致性上优于现有基线方法。 Conclusion: 该研究为GeoS2L生成奠定了基础，并强调了地理知识在生成模型中的重要性。 Abstract: We present a novel and practically significant problem-Geo-Contextual Soundscape-to-Landscape (GeoS2L) generation-which aims to synthesize geographically realistic landscape images from environmental soundscapes. Prior audio-to-image generation methods typically rely on general-purpose datasets and overlook geographic and environmental contexts, resulting in unrealistic images that are misaligned with real-world environmental settings. To address this limitation, we introduce a novel geo-contextual computational framework that explicitly integrates geographic knowledge into multimodal generative modeling. We construct two large-scale geo-contextual multimodal datasets, SoundingSVI and SonicUrban, pairing diverse soundscapes with real-world landscape images. We propose SounDiT, a novel Diffusion Transformer (DiT)-based model that incorporates geo-contextual scene conditioning to synthesize geographically coherent landscape images. Furthermore, we propose a practically-informed geo-contextual evaluation framework, the Place Similarity Score (PSS), across element-, scene-, and human perception-levels to measure consistency between input soundscapes and generated landscape images. Extensive experiments demonstrate that SounDiT outperforms existing baselines in both visual fidelity and geographic settings. Our work not only establishes foundational benchmarks for GeoS2L generation but also highlights the importance of incorporating geographic domain knowledge in advancing multimodal generative models, opening new directions at the intersection of generative AI, geography, urban planning, and environmental sciences.

[414] ASR-FAIRBENCH: Measuring and Benchmarking Equity Across Speech Recognition Systems

Anand Rai,Satyam Rahangdale,Utkarsh Anand,Animesh Mukherjee

Main category: cs.SD

TL;DR: ASR-FAIRBENCH是一个评估ASR模型准确性和公平性的排行榜，结合了公平分数与传统指标（如WER），揭示了不同人口群体间的性能差异。

Details

Motivation: 解决ASR系统在不同人口群体中性能差异显著的问题，推动更包容的技术发展。 Method: 利用Meta的Fair-Speech数据集，采用混合效应泊松回归模型计算公平分数，并与WER结合生成FAAS评分。 Result: 发现SOTA ASR模型在不同人口群体中存在显著性能差异。 Conclusion: ASR-FAIRBENCH为开发更公平的ASR技术提供了评估基准。 Abstract: Automatic Speech Recognition (ASR) systems have become ubiquitous in everyday applications, yet significant disparities in performance across diverse demographic groups persist. In this work, we introduce the ASR-FAIRBENCH leaderboard which is designed to assess both the accuracy and equity of ASR models in real-time. Leveraging the Meta's Fair-Speech dataset, which captures diverse demographic characteristics, we employ a mixed-effects Poisson regression model to derive an overall fairness score. This score is integrated with traditional metrics like Word Error Rate (WER) to compute the Fairness Adjusted ASR Score (FAAS), providing a comprehensive evaluation framework. Our approach reveals significant performance disparities in SOTA ASR models across demographic groups and offers a benchmark to drive the development of more inclusive ASR technologies.

[415] VoiceCloak: A Multi-Dimensional Defense Framework against Unauthorized Diffusion-based Voice Cloning

Qianyue Hu,Junyan Wu,Wei Lu,Xiangyang Luo

Main category: cs.SD

TL;DR: VoiceCloak是一种针对扩散模型（DMs）的多维主动防御框架，旨在通过对抗性扰动破坏未经授权的语音克隆（VC）过程，混淆说话者身份并降低感知质量。

Details

Motivation: 扩散模型在语音克隆中表现出色，但也增加了恶意滥用的风险。现有防御方法不适用于扩散模型，因此需要一种新的防御框架。 Method: VoiceCloak通过分析扩散模型的漏洞，引入对抗性扰动，扭曲说话者身份表示并破坏条件引导过程，同时通过噪声引导的语义破坏降低输出质量。 Result: 实验表明，VoiceCloak在防御未经授权的扩散基语音克隆方面表现出色。 Conclusion: VoiceCloak为扩散模型提供了一种有效的主动防御解决方案，保护语音数据免受恶意克隆。 Abstract: Diffusion Models (DMs) have achieved remarkable success in realistic voice cloning (VC), while they also increase the risk of malicious misuse. Existing proactive defenses designed for traditional VC models aim to disrupt the forgery process, but they have been proven incompatible with DMs due to the intricate generative mechanisms of diffusion. To bridge this gap, we introduce VoiceCloak, a multi-dimensional proactive defense framework with the goal of obfuscating speaker identity and degrading perceptual quality in potential unauthorized VC. To achieve these goals, we conduct a focused analysis to identify specific vulnerabilities within DMs, allowing VoiceCloak to disrupt the cloning process by introducing adversarial perturbations into the reference audio. Specifically, to obfuscate speaker identity, VoiceCloak first targets speaker identity by distorting representation learning embeddings to maximize identity variation, which is guided by auditory perception principles. Additionally, VoiceCloak disrupts crucial conditional guidance processes, particularly attention context, thereby preventing the alignment of vocal characteristics that are essential for achieving convincing cloning. Then, to address the second objective, VoiceCloak introduces score magnitude amplification to actively steer the reverse trajectory away from the generation of high-quality speech. Noise-guided semantic corruption is further employed to disrupt structural speech semantics captured by DMs, degrading output quality. Extensive experiments highlight VoiceCloak's outstanding defense success rate against unauthorized diffusion-based voice cloning. Audio samples of VoiceCloak are available at https://voice-cloak.github.io/VoiceCloak/.

[416] MMAR: A Challenging Benchmark for Deep Reasoning in Speech, Audio, Music, and Their Mix

Ziyang Ma,Yinghao Ma,Yanqiao Zhu,Chen Yang,Yi-Wen Chao,Ruiyang Xu,Wenxi Chen,Yuanzhe Chen,Zhuo Chen,Jian Cong,Kai Li,Keliang Li,Siyou Li,Xinfeng Li,Xiquan Li,Zheng Lian,Yuzhe Liang,Minghao Liu,Zhikang Niu,Tianrui Wang,Yuping Wang,Yuxuan Wang,Yihao Wu,Guanrou Yang,Jianwei Yu,Ruibin Yuan,Zhisheng Zheng,Ziya Zhou,Haina Zhu,Wei Xue,Emmanouil Benetos,Kai Yu,Eng-Siong Chng,Xie Chen

Main category: cs.SD

TL;DR: MMAR是一个新的音频-语言模型基准测试，涵盖多学科任务，包含1000个高质量音频-问题-答案三元组，旨在评估深度推理能力。

Details

Motivation: 现有基准测试局限于特定音频领域，缺乏对多模态和深度推理的评估，MMAR填补了这一空白。 Method: MMAR通过分层分类（信号、感知、语义、文化）和链式思维标注，构建复杂任务集，并测试多种模型。 Result: 当前模型在MMAR上表现不佳，揭示了理解和推理能力的局限性。 Conclusion: MMAR为音频推理领域的未来研究提供了挑战和方向。 Abstract: We introduce MMAR, a new benchmark designed to evaluate the deep reasoning capabilities of Audio-Language Models (ALMs) across massive multi-disciplinary tasks. MMAR comprises 1,000 meticulously curated audio-question-answer triplets, collected from real-world internet videos and refined through iterative error corrections and quality checks to ensure high quality. Unlike existing benchmarks that are limited to specific domains of sound, music, or speech, MMAR extends them to a broad spectrum of real-world audio scenarios, including mixed-modality combinations of sound, music, and speech. Each question in MMAR is hierarchically categorized across four reasoning layers: Signal, Perception, Semantic, and Cultural, with additional sub-categories within each layer to reflect task diversity and complexity. To further foster research in this area, we annotate every question with a Chain-of-Thought (CoT) rationale to promote future advancements in audio reasoning. Each item in the benchmark demands multi-step deep reasoning beyond surface-level understanding. Moreover, a part of the questions requires graduate-level perceptual and domain-specific knowledge, elevating the benchmark's difficulty and depth. We evaluate MMAR using a broad set of models, including Large Audio-Language Models (LALMs), Large Audio Reasoning Models (LARMs), Omni Language Models (OLMs), Large Language Models (LLMs), and Large Reasoning Models (LRMs), with audio caption inputs. The performance of these models on MMAR highlights the benchmark's challenging nature, and our analysis further reveals critical limitations of understanding and reasoning capabilities among current models. We hope MMAR will serve as a catalyst for future advances in this important but little-explored area.

Jongmin Jung,Dongmin Kim,Sihun Lee,Seola Cho,Hyungjoon Soh,Irmak Bukey,Chris Donahue,Dasaem Jeong

Main category: cs.SD

TL;DR: 提出了一种统一的多模态音乐翻译方法，通过大规模数据集和统一的分词框架，显著提升了跨模态翻译任务的性能。

Details

Motivation: 过去的多模态音乐翻译任务通常针对单一任务训练专用模型，缺乏统一性和效率。 Method: 使用包含1300小时配对音频-乐谱图像数据的新数据集，并通过统一的分词框架将不同模态离散化为令牌序列，利用单一编码器-解码器Transformer处理多任务。 Result: 统一模型在多个任务上优于单任务基线，如光学音乐识别的符号错误率从24.58%降至13.67%，并首次实现了乐谱图像条件音频生成。 Conclusion: 统一方法在多模态音乐翻译中表现出色，为跨模态音乐生成开辟了新方向。 Abstract: Music exists in various modalities, such as score images, symbolic scores, MIDI, and audio. Translations between each modality are established as core tasks of music information retrieval, such as automatic music transcription (audio-to-MIDI) and optical music recognition (score image to symbolic score). However, most past work on multimodal translation trains specialized models on individual translation tasks. In this paper, we propose a unified approach, where we train a general-purpose model on many translation tasks simultaneously. Two key factors make this unified approach viable: a new large-scale dataset and the tokenization of each modality. Firstly, we propose a new dataset that consists of more than 1,300 hours of paired audio-score image data collected from YouTube videos, which is an order of magnitude larger than any existing music modal translation datasets. Secondly, our unified tokenization framework discretizes score images, audio, MIDI, and MusicXML into a sequence of tokens, enabling a single encoder-decoder Transformer to tackle multiple cross-modal translation as one coherent sequence-to-sequence task. Experimental results confirm that our unified multitask model improves upon single-task baselines in several key areas, notably reducing the symbol error rate for optical music recognition from 24.58% to a state-of-the-art 13.67%, while similarly substantial improvements are observed across the other translation tasks. Notably, our approach achieves the first successful score-image-conditioned audio generation, marking a significant breakthrough in cross-modal music generation.

cs.LG [Back]

[418] Spectral Policy Optimization: Coloring your Incorrect Reasoning in GRPO

Peter Chen,Xiaopeng Li,Ziniu Li,Xi Chen,Tianyi Lin

Main category: cs.LG

TL;DR: 本文提出了一种通过引入响应多样性来改进GRPO在‘全负样本组’中学习停滞问题的方法，并通过理论和实验验证了其有效性。

Details

Motivation: GRPO在‘全负样本组’中无法更新策略，导致学习停滞，本文旨在解决这一问题。 Method: 提出了一种框架，利用AI反馈在全负样本组中引入响应多样性，并通过理论模型分析其学习动态。 Result: 实验验证了该方法在不同模型规模（7B、14B、32B）和多种学习设置下的性能提升。 Conclusion: 学习全负样本组不仅可行且有益，为相关研究提供了新见解。 Abstract: Reinforcement learning (RL) has demonstrated significant success in enhancing reasoning capabilities in large language models (LLMs). One of the most widely used RL methods is Group Relative Policy Optimization (GRPO)~\cite{Shao-2024-Deepseekmath}, known for its memory efficiency and success in training DeepSeek-R1~\cite{Guo-2025-Deepseek}. However, GRPO stalls when all sampled responses in a group are incorrect -- referred to as an \emph{all-negative-sample} group -- as it fails to update the policy, hindering learning progress. The contributions of this paper are two-fold. First, we propose a simple yet effective framework that introduces response diversity within all-negative-sample groups in GRPO using AI feedback. We also provide a theoretical analysis, via a stylized model, showing how this diversification improves learning dynamics. Second, we empirically validate our approach, showing the improved performance across various model sizes (7B, 14B, 32B) in both offline and online learning settings with 10 benchmarks, including base and distilled variants. Our findings highlight that learning from all-negative-sample groups is not only feasible but beneficial, advancing recent insights from \citet{Xiong-2025-Minimalist}.

Xilong Wang,John Bloch,Zedian Shao,Yuepeng Hu,Shuyan Zhou,Neil Zhenqiang Gong

Main category: cs.LG

TL;DR: EnvInjection是一种针对多模态大语言模型（MLLM）网络代理的攻击方法，通过修改网页像素值诱导代理执行目标动作，解决了现有攻击方法在有效性和隐蔽性上的不足。

Details

Motivation: 现有攻击方法在有效性和隐蔽性上存在不足，且在实际场景中难以实施，因此需要一种更高效、隐蔽且实用的攻击手段。 Method: 通过优化问题设计扰动，训练神经网络近似像素值与截图间的映射，并使用投影梯度下降求解优化问题。 Result: 在多个网页数据集上的评估显示，EnvInjection攻击效果显著优于现有基线方法。 Conclusion: EnvInjection提供了一种高效且隐蔽的攻击方式，解决了现有方法的局限性。 Abstract: Multi-modal large language model (MLLM)-based web agents interact with webpage environments by generating actions based on screenshots of the webpages. Environmental prompt injection attacks manipulate the environment to induce the web agent to perform a specific, attacker-chosen action--referred to as the target action. However, existing attacks suffer from limited effectiveness or stealthiness, or are impractical in real-world settings. In this work, we propose EnvInjection, a new attack that addresses these limitations. Our attack adds a perturbation to the raw pixel values of the rendered webpage, which can be implemented by modifying the webpage's source code. After these perturbed pixels are mapped into a screenshot, the perturbation induces the web agent to perform the target action. We formulate the task of finding the perturbation as an optimization problem. A key challenge in solving this problem is that the mapping between raw pixel values and screenshot is non-differentiable, making it difficult to backpropagate gradients to the perturbation. To overcome this, we train a neural network to approximate the mapping and apply projected gradient descent to solve the reformulated optimization problem. Extensive evaluation on multiple webpage datasets shows that EnvInjection is highly effective and significantly outperforms existing baselines.

[420] Concept-Guided Interpretability via Neural Chunking

Shuchen Wu,Stephan Alaniz,Shyamgopal Karthik,Peter Dayan,Eric Schulz,Zeynep Akata

Main category: cs.LG

TL;DR: 论文提出“反射假说”，认为神经网络的活动模式反映了训练数据的规律，并基于认知启发的分块方法提取可解释单元。

Details

Motivation: 解决神经网络作为黑盒难以理解的问题，提出其活动模式可能反映数据规律。 Method: 提出三种方法（DSC、PA、UCD）从高维神经活动中提取可解释单元。 Result: 方法在不同规模模型中有效提取实体，并与具体或抽象概念对应。 Conclusion: 结合认知原理和自然数据结构的可解释性方法为理解复杂学习系统提供了新方向。 Abstract: Neural networks are often black boxes, reflecting the significant challenge of understanding their internal workings. We propose a different perspective that challenges the prevailing view: rather than being inscrutable, neural networks exhibit patterns in their raw population activity that mirror regularities in the training data. We refer to this as the Reflection Hypothesis and provide evidence for this phenomenon in both simple recurrent neural networks (RNNs) and complex large language models (LLMs). Building on this insight, we propose to leverage cognitively-inspired methods of chunking to segment high-dimensional neural population dynamics into interpretable units that reflect underlying concepts. We propose three methods to extract these emerging entities, complementing each other based on label availability and dimensionality. Discrete sequence chunking (DSC) creates a dictionary of entities; population averaging (PA) extracts recurring entities that correspond to known labels; and unsupervised chunk discovery (UCD) can be used when labels are absent. We demonstrate the effectiveness of these methods in extracting entities across varying model sizes, ranging from inducing compositionality in RNNs to uncovering recurring neural population states in large models with diverse architectures, and illustrate their advantage over other methods. Throughout, we observe a robust correspondence between the extracted entities and concrete or abstract concepts. Artificially inducing the extracted entities in neural populations effectively alters the network's generation of associated concepts. Our work points to a new direction for interpretability, one that harnesses both cognitive principles and the structure of naturalistic data to reveal the hidden computations of complex learning systems, gradually transforming them from black boxes into systems we can begin to understand.

[421] Efficient Uncertainty Estimation via Distillation of Bayesian Large Language Models

Harshil Vejendla,Haizhou Shi,Yibin Wang,Tunyu Zhang,Huan Zhang,Hao Wang

Main category: cs.LG

TL;DR: 提出了一种无需测试时采样的LLM不确定性估计方法，通过将贝叶斯LLM的置信度蒸馏到非贝叶斯学生LLM中，显著提升了效率。

Details

Motivation: 现有贝叶斯方法在推理时需要多次采样，效率低下，限制了实际部署。 Method: 通过最小化预测分布差异，将贝叶斯LLM的置信度蒸馏到非贝叶斯学生LLM中，仅需训练数据集。 Result: 测试时不确定性估计效率提升N倍（N为传统贝叶斯LLM所需采样次数），性能与或优于现有贝叶斯LLM。 Conclusion: 该方法高效且有效，无需额外验证数据集，实现了不确定性估计能力的泛化。 Abstract: Recent advances in uncertainty estimation for Large Language Models (LLMs) during downstream adaptation have addressed key challenges of reliability and simplicity. However, existing Bayesian methods typically require multiple sampling iterations during inference, creating significant efficiency issues that limit practical deployment. In this paper, we investigate the possibility of eliminating the need for test-time sampling for LLM uncertainty estimation. Specifically, when given an off-the-shelf Bayesian LLM, we distill its aligned confidence into a non-Bayesian student LLM by minimizing the divergence between their predictive distributions. Unlike typical calibration methods, our distillation is carried out solely on the training dataset without the need of an additional validation dataset. This simple yet effective approach achieves N-times more efficient uncertainty estimation during testing, where N is the number of samples traditionally required by Bayesian LLMs. Our extensive experiments demonstrate that uncertainty estimation capabilities on training data can successfully generalize to unseen test data through our distillation technique, consistently producing results comparable to (or even better than) state-of-the-art Bayesian LLMs.

[422] SageAttention3: Microscaling FP4 Attention for Inference and An Exploration of 8-Bit Training

Jintao Zhang,Jia Wei,Pengle Zhang,Xiaoming Xu,Haofeng Huang,Haoxu Wang,Kai Jiang,Jun Zhu,Jianfei Chen

Main category: cs.LG

TL;DR: 论文通过利用Blackwell GPU的FP4 Tensor Cores提升注意力效率，实现5倍加速，并首次将低比特注意力应用于训练任务。

Details

Motivation: 注意力机制的二次时间复杂度影响效率，需优化计算。 Method: 1. 利用FP4 Tensor Cores加速注意力计算；2. 设计8比特注意力用于训练任务。 Result: FP4注意力实现5倍加速；8比特注意力在微调任务中无损性能，但预训练收敛较慢。 Conclusion: 低比特注意力在推理和训练中均有效，但预训练需进一步优化。 Abstract: The efficiency of attention is important due to its quadratic time complexity. We enhance the efficiency of attention through two key contributions: First, we leverage the new FP4 Tensor Cores in Blackwell GPUs to accelerate attention computation. Our implementation achieves 1038 TOPS on RTX5090, which is a 5x speedup over the fastest FlashAttention on RTX5090. Experiments show that our FP4 attention can accelerate inference of various models in a plug-and-play way. Second, we pioneer low-bit attention to training tasks. Existing low-bit attention works like FlashAttention3 and SageAttention focus only on inference. However, the efficiency of training large models is also important. To explore whether low-bit attention can be effectively applied to training tasks, we design an accurate and efficient 8-bit attention for both forward and backward propagation. Experiments indicate that 8-bit attention achieves lossless performance in fine-tuning tasks but exhibits slower convergence in pretraining tasks. The code will be available at https://github.com/thu-ml/SageAttention.

[423] Token-Level Uncertainty Estimation for Large Language Model Reasoning

Tunyu Zhang,Haizhou Shi,Yibin Wang,Hengyi Wang,Xiaoxiao He,Zhuowei Li,Haoxian Chen,Ligong Han,Kai Xu,Huan Zhang,Dimitris Metaxas,Hao Wang

Main category: cs.LG

TL;DR: 提出了一种基于令牌级不确定性估计的框架，帮助大语言模型（LLMs）在数学推理任务中自我评估和改进生成质量。

Details

Motivation: LLMs的输出质量在不同应用场景中不一致，尤其是在需要多步推理的复杂任务中，难以确定可信赖的响应。 Method: 通过低秩随机权重扰动生成预测分布，估计令牌级不确定性，并聚合这些不确定性以反映生成序列的语义不确定性。 Result: 实验表明，令牌级不确定性指标与答案正确性和模型鲁棒性高度相关，且通过粒子滤波算法能直接提升模型推理性能。 Conclusion: 该框架显著优于现有不确定性估计方法，为LLMs的推理生成提供了有效的评估和改进工具。 Abstract: While Large Language Models (LLMs) have demonstrated impressive capabilities, their output quality remains inconsistent across various application scenarios, making it difficult to identify trustworthy responses, especially in complex tasks requiring multi-step reasoning. In this paper, we propose a token-level uncertainty estimation framework to enable LLMs to self-assess and self-improve their generation quality in mathematical reasoning. Specifically, we introduce low-rank random weight perturbation to LLM decoding, generating predictive distributions that we use to estimate token-level uncertainties. We then aggregate these uncertainties to reflect semantic uncertainty of the generated sequences. Experiments on mathematical reasoning datasets of varying difficulty demonstrate that our token-level uncertainty metrics strongly correlate with answer correctness and model robustness. Additionally, we explore using uncertainty to directly enhance the model's reasoning performance through multiple generations and the particle filtering algorithm. Our approach consistently outperforms existing uncertainty estimation methods, establishing effective uncertainty estimation as a valuable tool for both evaluating and improving reasoning generation in LLMs.

[424] Urban Representation Learning for Fine-grained Economic Mapping: A Semi-supervised Graph-based Approach

Jinzhou Cao,Xiangxu Wang,Jiashi Chen,Wei Tu,Zhenhui Li,Xindong Yang,Tianhong Zhao,Qingquan Li

Main category: cs.LG

TL;DR: SemiGTX是一种可解释的半监督图学习框架，用于部门经济映射，通过多任务学习和融合地理空间数据模态，显著提升了经济预测的准确性和可解释性。

Details

Motivation: 现有方法在数据稀缺场景中忽略了半监督学习，且缺乏统一的多任务框架进行全面的部门经济分析。 Method: 提出SemiGTX框架，结合空间自监督和局部掩码监督回归，通过多任务学习统一映射GDP的三大部门。 Result: 在珠三角地区实验中，R2得分分别为0.93、0.96和0.94；跨区域实验验证了其泛化能力。 Conclusion: SemiGTX通过多样化城市数据整合，为精确经济预测提供了坚实基础，推动了区域经济监测的发展。 Abstract: Fine-grained economic mapping through urban representation learning has emerged as a crucial tool for evidence-based economic decisions. While existing methods primarily rely on supervised or unsupervised approaches, they often overlook semi-supervised learning in data-scarce scenarios and lack unified multi-task frameworks for comprehensive sectoral economic analysis. To address these gaps, we propose SemiGTX, an explainable semi-supervised graph learning framework for sectoral economic mapping. The framework is designed with dedicated fusion encoding modules for various geospatial data modalities, seamlessly integrating them into a cohesive graph structure. It introduces a semi-information loss function that combines spatial self-supervision with locally masked supervised regression, enabling more informative and effective region representations. Through multi-task learning, SemiGTX concurrently maps GDP across primary, secondary, and tertiary sectors within a unified model. Extensive experiments conducted in the Pearl River Delta region of China demonstrate the model's superior performance compared to existing methods, achieving R2 scores of 0.93, 0.96, and 0.94 for the primary, secondary and tertiary sectors, respectively. Cross-regional experiments in Beijing and Chengdu further illustrate its generality. Systematic analysis reveals how different data modalities influence model predictions, enhancing explainability while providing valuable insights for regional development planning. This representation learning framework advances regional economic monitoring through diverse urban data integration, providing a robust foundation for precise economic forecasting.

[425] Feature Hedging: Correlated Features Break Narrow Sparse Autoencoders

David Chanin,Tomáš Dulka,Adrià Garriga-Alonso

Main category: cs.LG

TL;DR: 稀疏自编码器（SAE）在特征相关且数量超过SAE宽度时，会合并相关特征，破坏单义性，称为特征对冲。本文通过理论和实验研究此现象，并提出改进方法。

Details

Motivation: 研究稀疏自编码器（SAE）在特征相关且数量超过其宽度时表现不佳的原因，即特征对冲现象。 Method: 通过理论分析和实验验证，研究特征对冲现象，并提出改进的matryoshka SAE变体。 Result: 发现特征对冲是SAE表现不佳的核心原因之一，并提出改进方法。 Conclusion: 特征对冲是SAE的根本问题之一，但通过研究有望推动SAE在大规模解释LLM中的应用。 Abstract: It is assumed that sparse autoencoders (SAEs) decompose polysemantic activations into interpretable linear directions, as long as the activations are composed of sparse linear combinations of underlying features. However, we find that if an SAE is more narrow than the number of underlying "true features" on which it is trained, and there is correlation between features, the SAE will merge components of correlated features together, thus destroying monosemanticity. In LLM SAEs, these two conditions are almost certainly true. This phenomenon, which we call feature hedging, is caused by SAE reconstruction loss, and is more severe the narrower the SAE. In this work, we introduce the problem of feature hedging and study it both theoretically in toy models and empirically in SAEs trained on LLMs. We suspect that feature hedging may be one of the core reasons that SAEs consistently underperform supervised baselines. Finally, we use our understanding of feature hedging to propose an improved variant of matryoshka SAEs. Our work shows there remain fundamental issues with SAEs, but we are hopeful that that highlighting feature hedging will catalyze future advances that allow SAEs to achieve their full potential of interpreting LLMs at scale.

[426] Internal Causal Mechanisms Robustly Predict Language Model Out-of-Distribution Behaviors

Jing Huang,Junyi Tao,Thomas Icard,Diyi Yang,Christopher Potts

Main category: cs.LG

TL;DR: 研究发现，通过因果机制分析可以预测神经网络在分布外数据上的行为，提出两种方法（反事实模拟和值探测），效果优于非因果方法。

Details

Motivation: 探讨如何利用内部因果机制预测模型在分布外数据上的行为，以增强模型的可解释性和可靠性。 Method: 提出两种方法：反事实模拟（检查关键因果变量是否实现）和值探测（利用变量值预测输出）。 Result: 两种方法在分布内和分布外数据上均表现优异，AUC-ROC较高，优于非因果方法。 Conclusion: 内部因果分析是预测语言模型行为的新应用方向，具有重要价值。 Abstract: Interpretability research now offers a variety of techniques for identifying abstract internal mechanisms in neural networks. Can such techniques be used to predict how models will behave on out-of-distribution examples? In this work, we provide a positive answer to this question. Through a diverse set of language modeling tasks--including symbol manipulation, knowledge retrieval, and instruction following--we show that the most robust features for correctness prediction are those that play a distinctive causal role in the model's behavior. Specifically, we propose two methods that leverage causal mechanisms to predict the correctness of model outputs: counterfactual simulation (checking whether key causal variables are realized) and value probing (using the values of those variables to make predictions). Both achieve high AUC-ROC in distribution and outperform methods that rely on causal-agnostic features in out-of-distribution settings, where predicting model behaviors is more crucial. Our work thus highlights a novel and significant application for internal causal analysis of language models.

[427] VenusX: Unlocking Fine-Grained Functional Understanding of Proteins

Yang Tan,Wenrui Gou,Bozitao Zhong,Liang Hong,Huiqun Yu,Bingxin Zhou

Main category: cs.LG

TL;DR: VenusX是一个用于蛋白质细粒度功能注释和功能配对的大规模基准测试，涵盖残基、片段和域级别，提供多样化的任务和评估场景。

Details

Motivation: 尽管深度学习在蛋白质功能预测方面取得了进展，但需要更细致的视角来理解功能机制和评估模型捕获的生物学知识。 Method: VenusX包含三类任务（残基级分类、片段级分类和功能相似性评分），基于878,000多个样本，支持跨家族和混合家族评估。 Result: 评估了多种模型（如预训练蛋白质语言模型和结构混合方法），提供了全面的性能比较。 Conclusion: VenusX为未来研究提供了全面的基准和公开的数据与代码。 Abstract: Deep learning models have driven significant progress in predicting protein function and interactions at the protein level. While these advancements have been invaluable for many biological applications such as enzyme engineering and function annotation, a more detailed perspective is essential for understanding protein functional mechanisms and evaluating the biological knowledge captured by models. To address this demand, we introduce VenusX, the first large-scale benchmark for fine-grained functional annotation and function-based protein pairing at the residue, fragment, and domain levels. VenusX comprises three major task categories across six types of annotations, including residue-level binary classification, fragment-level multi-class classification, and pairwise functional similarity scoring for identifying critical active sites, binding sites, conserved sites, motifs, domains, and epitopes. The benchmark features over 878,000 samples curated from major open-source databases such as InterPro, BioLiP, and SAbDab. By providing mixed-family and cross-family splits at three sequence identity thresholds, our benchmark enables a comprehensive assessment of model performance on both in-distribution and out-of-distribution scenarios. For baseline evaluation, we assess a diverse set of popular and open-source models, including pre-trained protein language models, sequence-structure hybrids, structure-based methods, and alignment-based techniques. Their performance is reported across all benchmark datasets and evaluation settings using multiple metrics, offering a thorough comparison and a strong foundation for future research. Code and data are publicly available at https://github.com/ai4protein/VenusX.

[428] J1: Exploring Simple Test-Time Scaling for LLM-as-a-Judge

Chi-Min Chan,Chunpu Xu,Jiaming Ji,Zhen Ye,Pengcheng Wen,Chunyang Jiang,Yaodong Yang,Wei Xue,Sirui Han,Yike Guo

Main category: cs.LG

TL;DR: 论文提出J1-7B模型，通过反射增强数据集和监督学习结合强化学习（RL）训练，采用简单测试时扩展（STTS）策略提升性能，超越现有LLM-as-a-Judge方法4.8%，并展示更强的扩展趋势。

Details

Motivation: 传统评估方法缺乏可解释性，LLM-as-a-Judge提供更可扩展和可解释的监督方式，结合大推理模型进一步优化性能和可解释性。 Method: J1-7B通过反射增强数据集监督微调，随后用可验证奖励的RL训练，推理时采用STTS策略。 Result: J1-7B性能提升4.8%，STTS下扩展趋势增强5.1%，并发现RL训练是STTS能力的关键。 Conclusion: RL训练是实现有效测试时扩展的核心，J1-7B为AI评估提供了更优的可解释性和性能。 Abstract: The current focus of AI research is shifting from emphasizing model training towards enhancing evaluation quality, a transition that is crucial for driving further advancements in AI systems. Traditional evaluation methods typically rely on reward models assigning scalar preference scores to outputs. Although effective, such approaches lack interpretability, leaving users often uncertain about why a reward model rates a particular response as high or low. The advent of LLM-as-a-Judge provides a more scalable and interpretable method of supervision, offering insights into the decision-making process. Moreover, with the emergence of large reasoning models, which consume more tokens for deeper thinking and answer refinement, scaling test-time computation in the LLM-as-a-Judge paradigm presents an avenue for further boosting performance and providing more interpretability through reasoning traces. In this paper, we introduce $\textbf{J1-7B}$, which is first supervised fine-tuned on reflection-enhanced datasets collected via rejection-sampling and subsequently trained using Reinforcement Learning (RL) with verifiable rewards. At inference time, we apply Simple Test-Time Scaling (STTS) strategies for additional performance improvement. Experimental results demonstrate that $\textbf{J1-7B}$ surpasses the previous state-of-the-art LLM-as-a-Judge by $ \textbf{4.8}$\% and exhibits a $ \textbf{5.1}$\% stronger scaling trend under STTS. Additionally, we present three key findings: (1) Existing LLM-as-a-Judge does not inherently exhibit such scaling trend. (2) Model simply fine-tuned on reflection-enhanced datasets continues to demonstrate similarly weak scaling behavior. (3) Significant scaling trend emerges primarily during the RL phase, suggesting that effective STTS capability is acquired predominantly through RL training.

[429] MINGLE: Mixtures of Null-Space Gated Low-Rank Experts for Test-Time Continual Model Merging

Zihuan Qiu,Yi Xu,Chiyuan He,Fanman Meng,Linfeng Xu,Qingbo Wu,Hongliang Li

Main category: cs.LG

TL;DR: MINGLE是一种新型的测试时间持续模型合并框架，通过动态调整合并过程解决参数干扰和分布适应问题。

Details

Motivation: 当前持续模型合并方法存在参数干扰和测试分布适应性不足的问题，导致任务遗忘和新任务适应困难。 Method: MINGLE采用混合专家架构和低秩专家，结合零空间约束门控和自适应松弛策略，动态调整合并过程。 Result: 实验表明，MINGLE显著减少遗忘，平均性能提升7-9%，优于现有方法。 Conclusion: MINGLE通过动态适应和约束策略，有效解决了持续模型合并中的关键挑战。 Abstract: Continual model merging integrates independently fine-tuned models sequentially without access to original training data, providing a scalable and efficient solution to continual learning. However, current methods still face critical challenges, notably parameter interference among tasks and limited adaptability to evolving test distributions. The former causes catastrophic forgetting of integrated tasks, while the latter hinders effective adaptation to new tasks. To address these, we propose MINGLE, a novel framework for test-time continual model merging, which leverages test-time adaptation using a small set of unlabeled test samples from the current task to dynamically guide the merging process. MINGLE employs a mixture-of-experts architecture composed of parameter-efficient, low-rank experts, enabling efficient adaptation and improving robustness to distribution shifts. To mitigate catastrophic forgetting, we propose Null-Space Constrained Gating, which restricts gating updates to subspaces orthogonal to prior task representations. This suppresses activations on old task inputs and preserves model behavior on past tasks. To further balance stability and adaptability, we design an Adaptive Relaxation Strategy, which dynamically adjusts the constraint strength based on interference signals captured during test-time adaptation. Extensive experiments on standard continual merging benchmarks demonstrate that MINGLE achieves robust generalization, reduces forgetting significantly, and consistently surpasses previous state-of-the-art methods by 7-9\% on average across diverse task orders.

[430] Parameter Efficient Continual Learning with Dynamic Low-Rank Adaptation

Prashant Shivaram Bhat,Shakib Yazdani,Elahe Arani,Bahram Zonooz

Main category: cs.LG

TL;DR: PEARL是一种无需复现的动态低秩适配器（LoRA）框架，通过自适应分配LoRA组件的秩来解决持续学习中的灾难性遗忘问题。

Details

Motivation: 灾难性遗忘是持续学习中的关键挑战，现有低秩适配器方法对秩选择敏感，导致资源分配和性能不佳。 Method: PEARL利用参考任务权重，根据当前任务与参考任务在参数空间中的接近程度，动态分配LoRA组件的秩。 Result: 在多种视觉架构和持续学习场景中，PEARL显著优于基线方法。 Conclusion: PEARL通过动态秩分配有效解决了灾难性遗忘问题，提升了持续学习的性能。 Abstract: Catastrophic forgetting has remained a critical challenge for deep neural networks in Continual Learning (CL) as it undermines consolidated knowledge when learning new tasks. Parameter efficient fine tuning CL techniques are gaining traction for their effectiveness in addressing catastrophic forgetting with a lightweight training schedule while avoiding degradation of consolidated knowledge in pre-trained models. However, low rank adapters (LoRA) in these approaches are highly sensitive to rank selection which can lead to sub-optimal resource allocation and performance. To this end, we introduce PEARL, a rehearsal-free CL framework that entails dynamic rank allocation for LoRA components during CL training. Specifically, PEARL leverages reference task weights and adaptively determines the rank of task-specific LoRA components based on the current tasks' proximity to reference task weights in parameter space. To demonstrate the versatility of PEARL, we evaluate it across three vision architectures (ResNet, Separable Convolutional Network and Vision Transformer) and a multitude of CL scenarios, and show that PEARL outperforms all considered baselines by a large margin.

[431] Reward Inside the Model: A Lightweight Hidden-State Reward Model for LLM's Best-of-N sampling

Jizhou Guo,Zhaomin Wu,Philip S. Yu

Main category: cs.LG

TL;DR: ELHSR是一种高效、参数少的奖励模型，利用LLM隐藏状态信息，显著优于基线模型，且计算成本低。

Details

Motivation: 当前奖励模型计算成本高、参数多，限制了实际应用，需要更高效的解决方案。 Method: 提出ELHSR模型，利用LLM隐藏状态信息，参数极少，训练样本需求少。 Result: ELHSR性能显著优于基线模型，计算效率大幅提升，且适用于闭源LLM。 Conclusion: ELHSR是一种高效、通用的奖励模型，可单独使用或与传统模型结合提升性能。 Abstract: High-quality reward models are crucial for unlocking the reasoning potential of large language models (LLMs), with best-of-N voting demonstrating significant performance gains. However, current reward models, which typically operate on the textual output of LLMs, are computationally expensive and parameter-heavy, limiting their real-world applications. We introduce the Efficient Linear Hidden State Reward (ELHSR) model - a novel, highly parameter-efficient approach that leverages the rich information embedded in LLM hidden states to address these issues. ELHSR systematically outperform baselines with less than 0.005% of the parameters of baselines, requiring only a few samples for training. ELHSR also achieves orders-of-magnitude efficiency improvement with significantly less time and fewer FLOPs per sample than baseline reward models. Moreover, ELHSR exhibits robust performance even when trained only on logits, extending its applicability to some closed-source LLMs. In addition, ELHSR can also be combined with traditional reward models to achieve additional performance gains.

Ali Gholamzadeh,Noor Sajid

Main category: cs.LG

TL;DR: 提出了一种半监督方法，通过条件流匹配实现跨模态模型对齐，数据效率高且需要最少监督。

Details

Motivation: 跨模态模型复用因内部表示对齐困难而受限，现有方法需大量配对数据或局限于特定领域。 Method: 采用条件流匹配方法，学习不同模态潜在空间之间的条件流，包括解决最优传输问题和使用标记示例进行对齐。 Result: 在MNIST、ImageNet等数据集上，下游任务性能与端到端训练模型相当，尤其在标记数据稀缺时表现优异。 Conclusion: 该方法为跨模态模型对齐提供了数据高效的解决方案，适用于监督数据有限的情况。 Abstract: Foundation models have demonstrated remarkable performance across modalities such as language and vision. However, model reuse across distinct modalities (e.g., text and vision) remains limited due to the difficulty of aligning internal representations. Existing methods require extensive paired training data or are constrained to specific domains. We introduce a semi-supervised approach for model alignment via conditional flow matching. The conditional flow between latent spaces of different modalities (e.g., text-to-image or biological-to-artificial neuronal activity) can be learned in two settings: ($1$) solving a (balanced or unbalanced) optimal transport problem with an inter-space bridge cost, and ($2$) performing memory-efficient alignment using labelled exemplars. Despite being constrained by the original models' capacity, our method--under both settings--matches downstream task performance of end-to-end trained models on object recognition and image generation tasks across MNIST, ImageNet, and \cite{majaj2015simple} datasets, particularly when labelled training data is scarce ($<20\%$). Our method provides a data-efficient solution for inter-modal model alignment with minimal supervision.

[433] UFO-RL: Uncertainty-Focused Optimization for Efficient Reinforcement Learning Data Selection

Yang Zhao,Kai Xiong,Xiao Ding,Li Du,YangouOuyang,Zhouhao Sun,Jiannan Guan,Wenbin Zhang,Bin Liu,Dong Hu,Bing Qin,Ting Liu

Main category: cs.LG

TL;DR: UFO-RL是一种高效数据选择框架，通过单次不确定性估计快速识别有价值数据，显著提升LLM强化学习的效率。

Details

Motivation: 传统多采样方法计算成本高，LLM在潜在理解区域（ZPD）内学习效果最佳。 Method: 引入UFO-RL框架，利用单次不确定性估计高效选择ZPD内数据。 Result: 仅使用10%数据即可达到或超越全数据训练效果，训练时间减少16倍。 Conclusion: UFO-RL为LLM强化学习提供了一种高效且实用的数据选择策略。 Abstract: Scaling RL for LLMs is computationally expensive, largely due to multi-sampling for policy optimization and evaluation, making efficient data selection crucial. Inspired by the Zone of Proximal Development (ZPD) theory, we hypothesize LLMs learn best from data within their potential comprehension zone. Addressing the limitation of conventional, computationally intensive multi-sampling methods for data assessment, we introduce UFO-RL. This novel framework uses a computationally efficient single-pass uncertainty estimation to identify informative data instances, achieving up to 185x faster data evaluation. UFO-RL leverages this metric to select data within the estimated ZPD for training. Experiments show that training with just 10% of data selected by UFO-RL yields performance comparable to or surpassing full-data training, reducing overall training time by up to 16x while enhancing stability and generalization. UFO-RL offers a practical and highly efficient strategy for scaling RL fine-tuning of LLMs by focusing learning on valuable data.

[434] Mitigating Hallucinations via Inter-Layer Consistency Aggregation in Large Vision-Language Models

Kai Tang,Jinhao You,Xiuqi Ge,Hanze Li,Yichen Guo,Xiande Huang

Main category: cs.LG

TL;DR: 提出了一种无需重新训练的解码机制DCLA，通过层聚合增强层间一致性，有效减少大型视觉语言模型的幻觉问题。

Details

Motivation: 尽管大型视觉语言模型（LVLMs）能力强大，但仍容易产生与输入图像不一致的幻觉内容，现有方法性能不稳定且对超参数敏感。 Method: 提出DCLA解码机制，通过聚合前层表示构建动态语义参考，并校正语义偏离层以增强层间一致性。 Result: 在MME和POPE等基准测试中，DCLA有效减少了幻觉并提升了模型的可靠性和性能。 Conclusion: DCLA是一种无需额外训练或外部知识的实用方法，能稳定减少LVLMs的幻觉问题。 Abstract: Despite the impressive capabilities of Large Vision-Language Models (LVLMs), they remain susceptible to hallucinations-generating content that is inconsistent with the input image. Existing training-free hallucination mitigation methods often suffer from unstable performance and high sensitivity to hyperparameter settings, limiting their practicality and broader adoption. In this paper, we propose a novel decoding mechanism, Decoding with Inter-layer Consistency via Layer Aggregation (DCLA), which requires no retraining, fine-tuning, or access to external knowledge bases. Specifically, our approach constructs a dynamic semantic reference by aggregating representations from previous layers, and corrects semantically deviated layers to enforce inter-layer consistency. The method allows DCLA to robustly mitigate hallucinations across multiple LVLMs. Experiments on hallucination benchmarks such as MME and POPE demonstrate that DCLA effectively reduces hallucinations while enhancing the reliability and performance of LVLMs.

[435] STAR: Stage-Wise Attention-Guided Token Reduction for Efficient Large Vision-Language Models Inference

Yichen Guo,Hanze Li,Zonghao Zhang,Jinhao You,Kai Tang,Xiande Huang

Main category: cs.LG

TL;DR: STAR是一种无需训练、即插即用的两阶段视觉令牌剪枝框架，通过全局视角显著降低计算成本，同时保持任务性能。

Details

Motivation: 现有单阶段令牌剪枝方法忽视全局信息流，导致高剪枝率下性能显著下降。 Method: STAR采用两阶段剪枝：早期基于视觉自注意力剪枝冗余低层特征，后期基于跨模态注意力剪枝任务无关令牌。 Result: 在多种LVLM架构和基准测试中，STAR显著加速且性能接近甚至优于原模型。 Conclusion: STAR通过全局视角的两阶段剪枝，有效平衡计算效率与任务性能。 Abstract: Although large vision-language models (LVLMs) leverage rich visual token representations to achieve strong performance on multimodal tasks, these tokens also introduce significant computational overhead during inference. Existing training-free token pruning methods typically adopt a single-stage strategy, focusing either on visual self-attention or visual-textual cross-attention. However, such localized perspectives often overlook the broader information flow across the model, leading to substantial performance degradation, especially under high pruning ratios. In this work, we propose STAR (Stage-wise Attention-guided token Reduction), a training-free, plug-and-play framework that approaches token pruning from a global perspective. Instead of pruning at a single point, STAR performs attention-guided reduction in two complementary stages: an early-stage pruning based on visual self-attention to remove redundant low-level features, and a later-stage pruning guided by cross-modal attention to discard task-irrelevant tokens. This holistic approach allows STAR to significantly reduce computational cost while better preserving task-critical information. Extensive experiments across multiple LVLM architectures and benchmarks show that STAR achieves strong acceleration while maintaining comparable, and in some cases even improved performance.

[436] Enhancing Latent Computation in Transformers with Latent Tokens

Yuchang Sun,Yanxi Chen,Yaliang Li,Bolin Ding

Main category: cs.LG

TL;DR: 提出一种轻量级方法“潜在标记”，通过注意力机制增强大型语言模型的性能，验证其在分布外泛化中的有效性。

Details

Motivation: 增强大型语言模型（LLMs）的性能，尤其是通过辅助标记改善其适应性和泛化能力。 Method: 引入非自然语言可解释的潜在标记，通过注意力机制引导自回归解码过程，参数高效训练，并灵活应用于推理阶段。 Result: 潜在标记显著优于基线方法，尤其在分布外泛化场景中表现突出。 Conclusion: 潜在标记是一种有效且轻量的方法，可提升LLMs的适应性和性能。 Abstract: Augmenting large language models (LLMs) with auxiliary tokens has emerged as a promising strategy for enhancing model performance. In this work, we introduce a lightweight method termed latent tokens; these are dummy tokens that may be non-interpretable in natural language but steer the autoregressive decoding process of a Transformer-based LLM via the attention mechanism. The proposed latent tokens can be seamlessly integrated with a pre-trained Transformer, trained in a parameter-efficient manner, and applied flexibly at inference time, while adding minimal complexity overhead to the existing infrastructure of standard Transformers. We propose several hypotheses about the underlying mechanisms of latent tokens and design synthetic tasks accordingly to verify them. Numerical results confirm that the proposed method noticeably outperforms the baselines, particularly in the out-of-distribution generalization scenarios, highlighting its potential in improving the adaptability of LLMs.

[437] Observe-R1: Unlocking Reasoning Abilities of MLLMs with Dynamic Progressive Reinforcement Learning

Zirun Guo,Minjie Hong,Tao Jin

Main category: cs.LG

TL;DR: Observe-R1是一个新颖的框架，旨在通过强化学习提升多模态大语言模型的推理能力，采用渐进式学习范式，并通过数据集和奖励系统优化训练效果。

Details

Motivation: 尽管强化学习在提升语言模型推理能力方面有潜力，但如何适应多模态数据和格式仍未被充分探索。 Method: 提出Observe-R1框架，构建NeuraLadder数据集，采用渐进式学习范式，引入多模态格式约束和奖励系统。 Result: 在Qwen2.5-VL模型上实验表明，Observe-R1在推理和通用基准测试中表现优于其他模型，且推理链更清晰简洁。 Conclusion: Observe-R1通过渐进式学习和优化策略，显著提升了多模态模型的推理能力，具有鲁棒性和泛化性。 Abstract: Reinforcement Learning (RL) has shown promise in improving the reasoning abilities of Large Language Models (LLMs). However, the specific challenges of adapting RL to multimodal data and formats remain relatively unexplored. In this work, we present Observe-R1, a novel framework aimed at enhancing the reasoning capabilities of multimodal large language models (MLLMs). We draw inspirations from human learning progression--from simple to complex and easy to difficult, and propose a gradual learning paradigm for MLLMs. To this end, we construct the NeuraLadder dataset, which is organized and sampled according to the difficulty and complexity of data samples for RL training. To tackle multimodal tasks, we introduce a multimodal format constraint that encourages careful observation of images, resulting in enhanced visual abilities and clearer and more structured responses. Additionally, we implement a bonus reward system that favors concise, correct answers within a length constraint, alongside a dynamic weighting mechanism that prioritizes uncertain and medium-difficulty problems, ensuring that more informative samples have a greater impact on training. Our experiments with the Qwen2.5-VL-3B and Qwen2.5-VL-7B models on 20k samples from the NeuraLadder dataset show that Observe-R1 outperforms a series of larger reasoning models on both reasoning and general benchmarks, achieving superior clarity and conciseness in reasoning chains. Ablation studies validate the effectiveness of our strategies, highlighting the robustness and generalization of our approach. The dataset and code will be released at https://github.com/zrguo/Observe-R1.

[438] Joint Embedding vs Reconstruction: Provable Benefits of Latent Space Prediction for Self Supervised Learning

Hugues Van Assel,Mark Ibrahim,Tommaso Biancalani,Aviv Regev,Randall Balestriero

Main category: cs.LG

TL;DR: 论文比较了自监督学习中重建和联合嵌入两种范式，揭示了它们的核心机制和适用场景，指出联合嵌入方法在特定情况下更优。

Details

Motivation: 为自监督学习中的重建和联合嵌入范式提供明确的选择指南，揭示其核心区别。 Method: 通过闭式解分析两种范式，研究数据增强对表示学习的影响。 Result: 发现两种范式均需增强与无关特征的最小对齐，联合嵌入方法在无关特征较大时更优。 Conclusion: 联合嵌入方法在现实挑战性数据集上表现更优，为范式选择提供了理论支持。 Abstract: Reconstruction and joint embedding have emerged as two leading paradigms in Self Supervised Learning (SSL). Reconstruction methods focus on recovering the original sample from a different view in input space. On the other hand, joint embedding methods align the representations of different views in latent space. Both approaches offer compelling advantages, yet practitioners lack clear guidelines for choosing between them. In this work, we unveil the core mechanisms that distinguish each paradigm. By leveraging closed form solutions for both approaches, we precisely characterize how the view generation process, e.g. data augmentation, impacts the learned representations. We then demonstrate that, unlike supervised learning, both SSL paradigms require a minimal alignment between augmentations and irrelevant features to achieve asymptotic optimality with increasing sample size. Our findings indicate that in scenarios where these irrelevant features have a large magnitude, joint embedding methods are preferable because they impose a strictly weaker alignment condition compared to reconstruction based methods. These results not only clarify the trade offs between the two paradigms but also substantiate the empirical success of joint embedding approaches on real world challenging datasets.

[439] Rethinking Reward Model Evaluation Through the Lens of Reward Overoptimization

Sunghwan Kim,Dongjin Kang,Taeyoon Kwon,Hyungjoo Chae,Dongha Lee,Jinyoung Yeo

Main category: cs.LG

TL;DR: 本文探讨了如何构建可靠的奖励模型（RM）基准测试，指出现有基准与优化策略性能相关性弱，并提出通过奖励过优化现象改进评估设计。

Details

Motivation: 现有奖励模型基准与优化策略性能相关性弱，无法准确评估RM的真实能力，需改进评估方法。 Method: 通过奖励过优化现象设计评估方法，分析如何构建可靠基准，包括减少选择与拒绝响应的差异、多范围比较及多样化模型响应来源。 Result: 发现构建可靠基准的三个关键点，同时指出奖励过优化程度过高可能降低与下游性能的相关性。 Conclusion: 奖励过优化程度可作为设计基准的有用工具，但不应作为最终目标。 Abstract: Reward models (RMs) play a crucial role in reinforcement learning from human feedback (RLHF), aligning model behavior with human preferences. However, existing benchmarks for reward models show a weak correlation with the performance of optimized policies, suggesting that they fail to accurately assess the true capabilities of RMs. To bridge this gap, we explore several evaluation designs through the lens of reward overoptimization\textemdash a phenomenon that captures both how well the reward model aligns with human preferences and the dynamics of the learning signal it provides to the policy. The results highlight three key findings on how to construct a reliable benchmark: (i) it is important to minimize differences between chosen and rejected responses beyond correctness, (ii) evaluating reward models requires multiple comparisons across a wide range of chosen and rejected responses, and (iii) given that reward models encounter responses with diverse representations, responses should be sourced from a variety of models. However, we also observe that a extremely high correlation with degree of overoptimization leads to comparatively lower correlation with certain downstream performance. Thus, when designing a benchmark, it is desirable to use the degree of overoptimization as a useful tool, rather than the end goal.

[440] Scalable Strategies for Continual Learning with Replay

Truman Hickok

Main category: cs.LG

TL;DR: 论文探讨了持续学习中的关键挑战，提出了结合低秩适应、阶段性回放和顺序合并的方法，显著提升了可扩展性和性能。

Details

Motivation: 未来深度学习模型需要持续学习，但现有方法如回放技术成本高且未与多任务微调技术充分结合。 Method: 应用低秩适应于持续学习，提出阶段性回放（减少55%样本需求）和顺序合并技术。 Result: 开发的方法协同作用，形成高效工具集，性能优于独立方法。 Conclusion: 结合低秩适应、回放和合并技术，为持续学习提供了可扩展的解决方案。 Abstract: Future deep learning models will be distinguished by systems that perpetually learn through interaction, imagination, and cooperation, blurring the line between training and inference. This makes continual learning a critical challenge, as methods that efficiently maximize bidirectional transfer across learning trajectories will be essential. Replay is on track to play a foundational role in continual learning, allowing models to directly reconcile new information with past knowledge. In practice, however, replay is quite unscalable, doubling the cost of continual learning when applied naively. Moreover, the continual learning literature has not fully synchronized with the multi-task fine-tuning literature, having not fully integrated highly scalable techniques like model merging and low rank adaptation into a replay-enabled toolset that can produce a unified model in the face of many sequential tasks. In this paper, we begin by applying and analyzing low rank adaptation in a continual learning setting. Next, we introduce consolidation, a phasic approach to replay which leads to up to 55\% less replay samples being needed for a given performance target. Then, we propose sequential merging, an offshoot of task arithmetic which is tailored to the continual learning setting and is shown to work well in combination with replay. Finally, we demonstrate that the developed strategies can operate synergistically, resulting in a highly scalable toolset that outperforms standalone variants.

[441] GEM: Gaussian Embedding Modeling for Out-of-Distribution Detection in GUI Agents

Zheng Wu,Pengzhou Cheng,Zongru Wu,Lingzhong Dong,Zhuosheng Zhang

Main category: cs.LG

TL;DR: 论文提出了一种名为GEM的新方法，用于检测GUI代理中的分布外（OOD）指令，通过高斯混合模型拟合输入嵌入距离，显著提升了检测准确率。

Details

Motivation: GUI代理在执行超出其能力范围或违反环境约束的指令时可能导致任务失败或安全威胁，因此需要有效的OOD检测方法。 Method: 基于观察到GUI代理的输入语义空间呈现聚类模式，提出GEM方法，利用高斯混合模型拟合输入嵌入距离以界定代理能力边界。 Result: 在八个数据集上的实验表明，GEM方法平均准确率比最佳基线提升了23.70%，并在九种不同骨干网络上验证了其泛化能力。 Conclusion: GEM方法在GUI代理的OOD检测中表现出色，具有广泛的应用潜力。 Abstract: Graphical user interface (GUI) agents have recently emerged as an intriguing paradigm for human-computer interaction, capable of automatically executing user instructions to operate intelligent terminal devices. However, when encountering out-of-distribution (OOD) instructions that violate environmental constraints or exceed the current capabilities of agents, GUI agents may suffer task breakdowns or even pose security threats. Therefore, effective OOD detection for GUI agents is essential. Traditional OOD detection methods perform suboptimally in this domain due to the complex embedding space and evolving GUI environments. In this work, we observe that the in-distribution input semantic space of GUI agents exhibits a clustering pattern with respect to the distance from the centroid. Based on the finding, we propose GEM, a novel method based on fitting a Gaussian mixture model over input embedding distances extracted from the GUI Agent that reflect its capability boundary. Evaluated on eight datasets spanning smartphones, computers, and web browsers, our method achieves an average accuracy improvement of 23.70\% over the best-performing baseline. Analysis verifies the generalization ability of our method through experiments on nine different backbones. The codes are available at https://github.com/Wuzheng02/GEM-OODforGUIagents.

[442] Does Low Rank Adaptation Lead to Lower Robustness against Training-Time Attacks?

Zi Liang,Haibo Hu,Qingqing Ye,Yaxin Xiao,Ronghua Li

Main category: cs.LG

TL;DR: LoRA技术在微调大语言模型时效率高，但其在训练时攻击下的安全性尚未充分研究。本文通过理论分析发现，LoRA对后门攻击更鲁棒，但对无目标数据投毒更脆弱。

Details

Motivation: 研究LoRA在微调过程中的安全性，特别是其对数据投毒和后门攻击的鲁棒性。 Method: 提出一个分析框架，结合神经正切核和信息理论，研究LoRA的低秩结构与其安全性的关系。 Result: LoRA对后门攻击更鲁棒，但对无目标数据投毒更脆弱。 Conclusion: LoRA的低秩结构在安全性上表现出双重性，需在设计时权衡效率与安全性。 Abstract: Low rank adaptation (LoRA) has emerged as a prominent technique for fine-tuning large language models (LLMs) thanks to its superb efficiency gains over previous methods. While extensive studies have examined the performance and structural properties of LoRA, its behavior upon training-time attacks remain underexplored, posing significant security risks. In this paper, we theoretically investigate the security implications of LoRA's low-rank structure during fine-tuning, in the context of its robustness against data poisoning and backdoor attacks. We propose an analytical framework that models LoRA's training dynamics, employs the neural tangent kernel to simplify the analysis of the training process, and applies information theory to establish connections between LoRA's low rank structure and its vulnerability against training-time attacks. Our analysis indicates that LoRA exhibits better robustness to backdoor attacks than full fine-tuning, while becomes more vulnerable to untargeted data poisoning due to its over-simplified information geometry. Extensive experimental evaluations have corroborated our theoretical findings.

[443] An approach based on class activation maps for investigating the effects of data augmentation on neural networks for image classification

Lucas M. Dorneles,Luan Fonseca Garcia,Joel Luís Carbonera

Main category: cs.LG

TL;DR: 该论文提出了一种分析数据增强对卷积神经网络在图像分类任务中学习模式影响的方法，通过类激活图和提取的指标量化不同数据增强策略的效果。

Details

Motivation: 尽管数据增强被广泛采用，但缺乏对其如何影响神经网络学习模式的研究。本文旨在填补这一空白。 Method: 利用类激活图识别和测量模型对图像像素的重要性，并通过提取的指标分析不同数据增强策略的相似性和差异性。 Result: 实验表明，该方法能有效分析数据增强的影响，并识别其对训练模型的不同影响模式。 Conclusion: 提出的方法为量化数据增强对神经网络学习模式的影响提供了新工具，有助于优化数据增强策略。 Abstract: Neural networks have become increasingly popular in the last few years as an effective tool for the task of image classification due to the impressive performance they have achieved on this task. In image classification tasks, it is common to use data augmentation strategies to increase the robustness of trained networks to changes in the input images and to avoid overfitting. Although data augmentation is a widely adopted technique, the literature lacks a body of research analyzing the effects data augmentation methods have on the patterns learned by neural network models working on complex datasets. The primary objective of this work is to propose a methodology and set of metrics that may allow a quantitative approach to analyzing the effects of data augmentation in convolutional networks applied to image classification. An important tool used in the proposed approach lies in the concept of class activation maps for said models, which allow us to identify and measure the importance these models assign to each individual pixel in an image when executing the classification task. From these maps, we may then extract metrics over the similarities and differences between maps generated by these models trained on a given dataset with different data augmentation strategies. Experiments made using this methodology suggest that the effects of these data augmentation techniques not only can be analyzed in this way but also allow us to identify different impact profiles over the trained models.

[444] Two out of Three (ToT): using self-consistency to make robust predictions

Jung Hoon Lee,Sujith Vijayan

Main category: cs.LG

TL;DR: 论文提出了一种名为'Two out of Three (ToT)'的算法，旨在通过让深度学习模型在不确定时选择不回答，以提高其决策的稳健性。

Details

Motivation: 深度学习的决策原理尚不明确，可能导致高风险领域的严重后果，因此需要提高模型的稳健性。 Method: ToT算法通过生成两个额外的预测，结合原始预测来判断是否提供答案。 Result: 该方法能够帮助模型在不确定时选择不回答，从而减少错误决策的风险。 Conclusion: ToT算法为深度学习模型在高风险领域的应用提供了一种潜在的解决方案。 Abstract: Deep learning (DL) can automatically construct intelligent agents, deep neural networks (alternatively, DL models), that can outperform humans in certain tasks. However, the operating principles of DL remain poorly understood, making its decisions incomprehensible. As a result, it poses a great risk to deploy DL in high-stakes domains in which mistakes or errors may lead to critical consequences. Here, we aim to develop an algorithm that can help DL models make more robust decisions by allowing them to abstain from answering when they are uncertain. Our algorithm, named `Two out of Three (ToT)', is inspired by the sensitivity of the human brain to conflicting information. ToT creates two alternative predictions in addition to the original model prediction and uses the alternative predictions to decide whether it should provide an answer or not.

[445] On the Mechanisms of Adversarial Data Augmentation for Robust and Adaptive Transfer Learning

Hana Satou,Alan Mitkiy

Main category: cs.LG

TL;DR: 本文研究了对抗性数据增强（ADA）在提升迁移学习中的鲁棒性和适应性中的作用，提出了一种结合ADA、一致性正则化和域不变表示学习的统一框架，实验证明其能显著提升目标域性能。

Details

Motivation: 解决领域分布偏移下的迁移学习挑战，探索对抗性扰动作为数据增强工具的潜力。 Method: 提出统一框架，整合对抗性数据增强、一致性正则化和域不变表示学习。 Result: 在多个基准数据集上（如VisDA、DomainNet、Office-Home）显著提升了目标域性能。 Conclusion: 对抗性学习可以转化为跨领域可迁移性的正则化工具，而非破坏性攻击。 Abstract: Transfer learning across domains with distribution shift remains a fundamental challenge in building robust and adaptable machine learning systems. While adversarial perturbations are traditionally viewed as threats that expose model vulnerabilities, recent studies suggest that they can also serve as constructive tools for data augmentation. In this work, we systematically investigate the role of adversarial data augmentation (ADA) in enhancing both robustness and adaptivity in transfer learning settings. We analyze how adversarial examples, when used strategically during training, improve domain generalization by enriching decision boundaries and reducing overfitting to source-domain-specific features. We further propose a unified framework that integrates ADA with consistency regularization and domain-invariant representation learning. Extensive experiments across multiple benchmark datasets -- including VisDA, DomainNet, and Office-Home -- demonstrate that our method consistently improves target-domain performance under both unsupervised and few-shot domain adaptation settings. Our results highlight a constructive perspective of adversarial learning, transforming perturbation from a destructive attack into a regularizing force for cross-domain transferability.

[446] Leveraging LLM Inconsistency to Boost Pass@k Performance

Uri Dalal,Meirav Segal,Zvika Ben-Haim,Dan Lahav,Omer Nevo

Main category: cs.LG

TL;DR: 论文提出了一种利用大语言模型（LLM）的不一致性来提升Pass@k性能的新方法，通过“Variator”代理生成任务变体并提交候选解决方案。

Details

Motivation: 尽管LLM在多个领域表现出色，但对微小输入变化的响应不一致性被视为潜在优势而非缺陷。 Method: 引入“Variator”代理，生成任务变体并为每个变体提交一个候选解决方案，方法适用于广泛领域且与自由形式输入兼容。 Result: 理论分析和实验表明，该方法在APPS数据集上优于基线，且前沿推理模型在编码和网络安全领域仍存在不一致性。 Conclusion: 利用模型不一致性的方法具有广泛适用性，且对未来模型仍具相关性。 Abstract: Large language models (LLMs) achieve impressive abilities in numerous domains, but exhibit inconsistent performance in response to minor input changes. Rather than view this as a drawback, in this paper we introduce a novel method for leveraging models' inconsistency to boost Pass@k performance. Specifically, we present a "Variator" agent that generates k variants of a given task and submits one candidate solution for each one. Our variant generation approach is applicable to a wide range of domains as it is task agnostic and compatible with free-form inputs. We demonstrate the efficacy of our agent theoretically using a probabilistic model of the inconsistency effect, and show empirically that it outperforms the baseline on the APPS dataset. Furthermore, we establish that inconsistency persists even in frontier reasoning models across coding and cybersecurity domains, suggesting our method is likely to remain relevant for future model generations.

[447] Structure-based Anomaly Detection and Clustering

Filippo Leveni

Main category: cs.LG

TL;DR: 该论文提出了几种无监督异常检测方法，包括基于结构的PIF和MultiLink，以及适用于流数据的Online-iForest，并在多种场景中表现出优越性能。

Details

Motivation: 异常检测在医疗、制造和网络安全等领域至关重要，但现有方法在结构和流数据场景中表现不足。 Method: 提出Preference Isolation Forest (PIF)及其变体Voronoi-iForest和RuzHash-iForest，以及Sliding-PIF用于流数据；MultiLink用于结构聚类；Online-iForest用于流数据异常检测；改进Gradient Boosting用于网络安全。 Result: PIF和MultiLink在合成和真实数据集中优于现有方法；Online-iForest在流数据中表现高效；改进的Gradient Boosting成功应用于生产系统。 Conclusion: 论文提出的方法在异常检测的多个领域表现出高效性和鲁棒性，具有实际应用价值。 Abstract: Anomaly detection is a fundamental problem in domains such as healthcare, manufacturing, and cybersecurity. This thesis proposes new unsupervised methods for anomaly detection in both structured and streaming data settings. In the first part, we focus on structure-based anomaly detection, where normal data follows low-dimensional manifolds while anomalies deviate from them. We introduce Preference Isolation Forest (PIF), which embeds data into a high-dimensional preference space via manifold fitting, and isolates outliers using two variants: Voronoi-iForest, based on geometric distances, and RuzHash-iForest, leveraging Locality Sensitive Hashing for scalability. We also propose Sliding-PIF, which captures local manifold information for streaming scenarios. Our methods outperform existing techniques on synthetic and real datasets. We extend this to structure-based clustering with MultiLink, a novel method for recovering multiple geometric model families in noisy data. MultiLink merges clusters via a model-aware linkage strategy, enabling robust multi-class structure recovery. It offers key advantages over existing approaches, such as speed, reduced sensitivity to thresholds, and improved robustness to poor initial sampling. The second part of the thesis addresses online anomaly detection in evolving data streams. We propose Online Isolation Forest (Online-iForest), which uses adaptive, multi-resolution histograms and dynamically updates tree structures to track changes over time. It avoids retraining while achieving accuracy comparable to offline models, with superior efficiency for real-time applications. Finally, we tackle anomaly detection in cybersecurity via open-set recognition for malware classification. We enhance a Gradient Boosting classifier with MaxLogit to detect unseen malware families, a method now integrated into Cleafy's production system.

[448] Fractured Chain-of-Thought Reasoning

Baohao Liao,Hanze Dong,Yuhui Xu,Doyen Sahoo,Christof Monz,Junnan Li,Caiming Xiong

Main category: cs.LG

TL;DR: Fractured Sampling是一种推理时策略，通过截断推理轨迹和优化计算分配，显著减少token成本，同时保持高准确性。

Details

Motivation: 解决Chain-of-Thought（CoT）提示方法在延迟敏感场景中因高token成本而难以部署的问题。 Method: 提出Fractured Sampling，通过三个正交轴（推理轨迹数量、每个轨迹的最终解数量、截断深度）在完整CoT和仅解采样之间插值。 Result: 在五个推理基准测试中，Fractured Sampling实现了更优的准确性-成本权衡，显著提升了Pass@k与token预算的log-linear比例增益。 Conclusion: Fractured Sampling为LLM推理提供了更高效和可扩展的解决方案，优化了计算分配以最大化性能。 Abstract: Inference-time scaling techniques have significantly bolstered the reasoning capabilities of large language models (LLMs) by harnessing additional computational effort at inference without retraining. Similarly, Chain-of-Thought (CoT) prompting and its extension, Long CoT, improve accuracy by generating rich intermediate reasoning trajectories, but these approaches incur substantial token costs that impede their deployment in latency-sensitive settings. In this work, we first show that truncated CoT, which stops reasoning before completion and directly generates the final answer, often matches full CoT sampling while using dramatically fewer tokens. Building on this insight, we introduce Fractured Sampling, a unified inference-time strategy that interpolates between full CoT and solution-only sampling along three orthogonal axes: (1) the number of reasoning trajectories, (2) the number of final solutions per trajectory, and (3) the depth at which reasoning traces are truncated. Through extensive experiments on five diverse reasoning benchmarks and several model scales, we demonstrate that Fractured Sampling consistently achieves superior accuracy-cost trade-offs, yielding steep log-linear scaling gains in Pass@k versus token budget. Our analysis reveals how to allocate computation across these dimensions to maximize performance, paving the way for more efficient and scalable LLM reasoning.

[449] FreeKV: Boosting KV Cache Retrieval for Efficient LLM Inference

Guangda Liu,Chengwei Li,Zhenyu Ning,Jing Lin,Yiwu Yao,Danning Ke,Minyi Guo,Jieru Zhao

Main category: cs.LG

TL;DR: FreeKV是一种算法-系统协同优化框架，旨在提高KV检索效率并保持准确性，通过推测性检索和细粒度校正实现，实验显示其性能优于现有方法。

Details

Motivation: 长上下文窗口部署LLMs时，KV缓存的大小随上下文长度增长，现有压缩方法存在精度损失或效率瓶颈，需一种高效且准确的解决方案。 Method: FreeKV结合推测性检索和细粒度校正优化算法，同时在系统中采用混合KV布局和双缓冲流式检索提升效率。 Result: FreeKV在多种场景和模型中实现接近无损的精度，速度比现有方法快13倍。 Conclusion: FreeKV通过算法与系统协同优化，有效解决了KV检索的效率与精度问题，具有广泛应用潜力。 Abstract: Large language models (LLMs) have been widely deployed with rapidly expanding context windows to support increasingly demanding applications. However, long contexts pose significant deployment challenges, primarily due to the KV cache whose size grows proportionally with context length. While KV cache compression methods are proposed to address this issue, KV dropping methods incur considerable accuracy loss, and KV retrieval methods suffer from significant efficiency bottlenecks. We propose FreeKV, an algorithm-system co-optimization framework to enhance KV retrieval efficiency while preserving accuracy. On the algorithm side, FreeKV introduces speculative retrieval to shift the KV selection and recall processes out of the critical path, combined with fine-grained correction to ensure accuracy. On the system side, FreeKV employs hybrid KV layouts across CPU and GPU memory to eliminate fragmented data transfers, and leverages double-buffered streamed recall to further improve efficiency. Experiments demonstrate that FreeKV achieves near-lossless accuracy across various scenarios and models, delivering up to 13$\times$ speedup compared to SOTA KV retrieval methods.

Yuanze Hu,Zhaoxin Fan,Xinyu Wang,Gen Li,Ye Qiu,Zhichao Yang,Wenjun Wu,Kejian Wu,Yifan Sun,Xiaotie Deng,Jin Dong

Main category: cs.LG

TL;DR: TinyAlign框架通过检索增强生成技术提升轻量级视觉语言模型的性能，显著减少训练损失并提高数据效率。

Details

Motivation: 现有方法依赖冻结视觉和语言模型，但轻量级模型的能力有限，导致对齐质量不佳。 Method: 提出TinyAlign框架，利用检索增强生成技术从记忆库中检索相关上下文以增强多模态输入的对齐。 Result: 实验表明，TinyAlign显著降低训练损失、加速收敛，并仅需40%微调数据即可达到基准性能。 Conclusion: TinyAlign为轻量级视觉语言模型提供了一种高效的对齐方法，并提出了新的理论视角。 Abstract: Lightweight Vision-Language Models (VLMs) are indispensable for resource-constrained applications. The prevailing approach to aligning vision and language models involves freezing both the vision encoder and the language model while training small connector modules. However, this strategy heavily depends on the intrinsic capabilities of the language model, which can be suboptimal for lightweight models with limited representational capacity. In this work, we investigate this alignment bottleneck through the lens of mutual information, demonstrating that the constrained capacity of the language model inherently limits the Effective Mutual Information (EMI) between multimodal inputs and outputs, thereby compromising alignment quality. To address this challenge, we propose TinyAlign, a novel framework inspired by Retrieval-Augmented Generation, which strategically retrieves relevant context from a memory bank to enrich multimodal inputs and enhance their alignment. Extensive empirical evaluations reveal that TinyAlign significantly reduces training loss, accelerates convergence, and enhances task performance. Remarkably, it allows models to achieve baseline-level performance with only 40\% of the fine-tuning data, highlighting exceptional data efficiency. Our work thus offers a practical pathway for developing more capable lightweight VLMs while introducing a fresh theoretical lens to better understand and address alignment bottlenecks in constrained multimodal systems.

[451] CALM-PDE: Continuous and Adaptive Convolutions for Latent Space Modeling of Time-dependent PDEs

Jan Hagnberger,Daniel Musekamp,Mathias Niepert

Main category: cs.LG

TL;DR: CALM-PDE提出了一种新型连续卷积编码器-解码器架构，用于高效解决任意离散化的PDE问题，显著提升了内存和推理时间效率。

Details

Motivation: 传统方法在物理空间直接计算PDE计算成本高，现有神经代理模型虽降低复杂度但内存消耗大。CALM-PDE旨在解决这一问题。 Method: 采用连续卷积编码器-解码器架构，使用ε邻域约束核，并自适应优化查询点。 Result: CALM-PDE在多种PDE问题上表现优异，内存和推理效率显著优于基于Transformer的方法。 Conclusion: CALM-PDE为高效解决任意离散化PDE提供了新思路，兼具性能与效率优势。 Abstract: Solving time-dependent Partial Differential Equations (PDEs) using a densely discretized spatial domain is a fundamental problem in various scientific and engineering disciplines, including modeling climate phenomena and fluid dynamics. However, performing these computations directly in the physical space often incurs significant computational costs. To address this issue, several neural surrogate models have been developed that operate in a compressed latent space to solve the PDE. While these approaches reduce computational complexity, they often use Transformer-based attention mechanisms to handle irregularly sampled domains, resulting in increased memory consumption. In contrast, convolutional neural networks allow memory-efficient encoding and decoding but are limited to regular discretizations. Motivated by these considerations, we propose CALM-PDE, a model class that efficiently solves arbitrarily discretized PDEs in a compressed latent space. We introduce a novel continuous convolution-based encoder-decoder architecture that uses an epsilon-neighborhood-constrained kernel and learns to apply the convolution operator to adaptive and optimized query points. We demonstrate the effectiveness of CALM-PDE on a diverse set of PDEs with both regularly and irregularly sampled spatial domains. CALM-PDE is competitive with or outperforms existing baseline methods while offering significant improvements in memory and inference time efficiency compared to Transformer-based methods.

[452] Seek in the Dark: Reasoning via Test-Time Instance-Level Policy Gradient in Latent Space

Hengli Li,Chenxi Li,Tong Wu,Xuekai Zhu,Yuxuan Wang,Zhaoxin Yu,Eric Hanchen Jiang,Song-Chun Zhu,Zixia Jia,Ying Nian Wu,Zilong Zheng

Main category: cs.LG

TL;DR: LatentSeek通过潜在空间的测试时实例级适应（TTIA）提升LLM的推理能力，优于传统方法。

Details

Motivation: 尽管LLM在训练规模法则下表现提升，但在推理能力和训练算法（如灾难性遗忘）方面仍面临挑战。 Method: 提出LatentSeek框架，利用策略梯度在潜在空间中迭代更新表示，通过自生成奖励信号指导。 Result: 在GSM8K、MATH-500等基准测试中，LatentSeek优于CoT提示和微调方法，且高效收敛。 Conclusion: LatentSeek是一种轻量、可扩展且有效的解决方案，展示了潜在空间测试时扩展的潜力。 Abstract: Reasoning ability, a core component of human intelligence, continues to pose a significant challenge for Large Language Models (LLMs) in the pursuit of AGI. Although model performance has improved under the training scaling law, significant challenges remain, particularly with respect to training algorithms, such as catastrophic forgetting, and the limited availability of novel training data. As an alternative, test-time scaling enhances reasoning performance by increasing test-time computation without parameter updating. Unlike prior methods in this paradigm focused on token space, we propose leveraging latent space for more effective reasoning and better adherence to the test-time scaling law. We introduce LatentSeek, a novel framework that enhances LLM reasoning through Test-Time Instance-level Adaptation (TTIA) within the model's latent space. Specifically, LatentSeek leverages policy gradient to iteratively update latent representations, guided by self-generated reward signals. LatentSeek is evaluated on a range of reasoning benchmarks, including GSM8K, MATH-500, and AIME2024, across multiple LLM architectures. Results show that LatentSeek consistently outperforms strong baselines, such as Chain-of-Thought prompting and fine-tuning-based methods. Furthermore, our analysis demonstrates that LatentSeek is highly efficient, typically converging within a few iterations for problems of average complexity, while also benefiting from additional iterations, thereby highlighting the potential of test-time scaling in the latent space. These findings position LatentSeek as a lightweight, scalable, and effective solution for enhancing the reasoning capabilities of LLMs.

[453] Walking the Tightrope: Disentangling Beneficial and Detrimental Drifts in Non-Stationary Custom-Tuning

Xiaoyu Yang,Jie Lu,En Yu

Main category: cs.LG

TL;DR: 论文揭示了多模态大语言模型（MLLMs）中链式思维（CoT）推理在非平稳强化微调（RFT）过程中的有害概念漂移现象，并提出了一种新颖的解决方案Counterfactual Preference Optimization（CPO）。

Details

Motivation: 解决CoT推理中因非平稳RFT导致的概念漂移问题，避免其对最终预测的显著偏差影响。 Method: 通过将CoT的自回归令牌流形式化为非平稳分布，提出基于概念图的LLM专家生成反事实推理轨迹的counterfact-aware RFT方法。 Result: 实验表明CPO在非平稳环境中具有优越的鲁棒性、泛化能力和协调性，并贡献了大规模数据集CXR-CounterFact（CCF）。 Conclusion: CPO通过反事实感知偏好对齐实现了稳定的RFT，尤其在医疗领域表现突出，同时公开了代码和数据。 Abstract: This paper uncovers a critical yet overlooked phenomenon in multi-modal large language models (MLLMs): detrimental concept drift within chain-of-thought (CoT) reasoning during non-stationary reinforcement fine-tuning (RFT), where reasoning token distributions evolve unpredictably, thereby introducing significant biases in final predictions. To address this, we are pioneers in establishing the theoretical bridge between concept drift theory and RFT processes by formalizing CoT's autoregressive token streams as non-stationary distributions undergoing arbitrary temporal shifts. Leveraging this framework, we propose a novel counterfact-aware RFT that systematically decouples beneficial distribution adaptation from harmful concept drift through concept graph-empowered LLM experts generating counterfactual reasoning trajectories. Our solution, Counterfactual Preference Optimization (CPO), enables stable RFT in non-stationary environments, particularly within the medical domain, through custom-tuning of counterfactual-aware preference alignment. Extensive experiments demonstrate our superior performance of robustness, generalization and coordination within RFT. Besides, we also contributed a large-scale dataset CXR-CounterFact (CCF), comprising 320,416 meticulously curated counterfactual reasoning trajectories derived from MIMIC-CXR. Our code and data are public.

[454] A Minimum Description Length Approach to Regularization in Neural Networks

Matan Abudy,Orr Well,Emmanuel Chemla,Roni Katzir,Nur Lan

Main category: cs.LG

TL;DR: 研究发现，标准正则化方法（如L1、L2或无正则化）在训练神经网络时会阻碍模型收敛到完美解，而基于最小描述长度（MDL）的正则化能有效选择完美解并提升泛化能力。

Details

Motivation: 探讨正则化方法对神经网络收敛到完美解的影响，并寻找更有效的正则化策略。 Method: 比较标准正则化（L1、L2或无正则化）与基于MDL的正则化在训练神经网络时的表现，尤其是在形式语言任务中。 Result: MDL正则化能选择完美解，而标准正则化方法会阻碍收敛甚至偏离完美初始化。 Conclusion: MDL提供了理论支持的正则化方法，能有效对抗过拟合并促进泛化，优于现有技术。 Abstract: State-of-the-art neural networks can be trained to become remarkable solutions to many problems. But while these architectures can express symbolic, perfect solutions, trained models often arrive at approximations instead. We show that the choice of regularization method plays a crucial role: when trained on formal languages with standard regularization ($L_1$, $L_2$, or none), expressive architectures not only fail to converge to correct solutions but are actively pushed away from perfect initializations. In contrast, applying the Minimum Description Length (MDL) principle to balance model complexity with data fit provides a theoretically grounded regularization method. Using MDL, perfect solutions are selected over approximations, independently of the optimization algorithm. We propose that unlike existing regularization techniques, MDL introduces the appropriate inductive bias to effectively counteract overfitting and promote generalization.

[455] Fine-tuning Quantized Neural Networks with Zeroth-order Optimization

Sifeng Shang,Jiayi Zhou,Chenyu Lin,Minxian Li,Kaiyang Zhou

Main category: cs.LG

TL;DR: 论文提出了一种名为QZO的新型方法，通过零阶优化和模型量化显著减少GPU内存使用，适用于大语言模型和扩散模型。

Details

Motivation: 随着大语言模型规模指数级增长，GPU内存成为下游任务适配的瓶颈，需高效减少内存占用。 Method: 结合零阶优化（消除梯度和优化器状态）和模型量化（如bfloat16转int4），提出QZO方法，通过扰动连续量化尺度估计梯度并稳定训练。 Result: QZO可将4位LLM的总内存成本降低18倍以上，支持在24GB GPU上微调Llama-2-13B和Stable Diffusion 3.5 Large。 Conclusion: QZO为内存高效训练提供了可行方案，适用于资源受限环境。 Abstract: As the size of large language models grows exponentially, GPU memory has become a bottleneck for adapting these models to downstream tasks. In this paper, we aim to push the limits of memory-efficient training by minimizing memory usage on model weights, gradients, and optimizer states, within a unified framework. Our idea is to eliminate both gradients and optimizer states using zeroth-order optimization, which approximates gradients by perturbing weights during forward passes to identify gradient directions. To minimize memory usage on weights, we employ model quantization, e.g., converting from bfloat16 to int4. However, directly applying zeroth-order optimization to quantized weights is infeasible due to the precision gap between discrete weights and continuous gradients, which would otherwise require de-quantization and re-quantization. To overcome this challenge, we propose Quantized Zeroth-order Optimization (QZO), a novel approach that perturbs the continuous quantization scale for gradient estimation and uses a directional derivative clipping method to stabilize training. QZO is orthogonal to both scalar-based and codebook-based post-training quantization methods. Compared to full-parameter fine-tuning in bfloat16, QZO can reduce the total memory cost by more than 18$\times$ for 4-bit LLMs, and enables fine-tuning Llama-2-13B and Stable Diffusion 3.5 Large within a single 24GB GPU.

[456] Optimizing Anytime Reasoning via Budget Relative Policy Optimization

Penghui Qi,Zichen Liu,Tianyu Pang,Chao Du,Wee Sun Lee,Min Lin

Main category: cs.LG

TL;DR: 论文提出AnytimeReasoner框架，优化LLM的实时推理性能，通过截断思维过程并在采样预算下生成验证奖励，提升训练和推理效率。

Details

Motivation: 现有方法通常通过强化学习最大化最终奖励，但忽略了效率和灵活性。AnytimeReasoner旨在优化实时推理性能，适应不同预算约束。 Method: 截断完整思维过程以适配采样预算，生成验证奖励；解耦优化思维和总结策略；引入BRPO技术减少方差。 Result: 在数学推理任务中，AnytimeReasoner在所有预算下均优于GRPO，提升了训练和token效率。 Conclusion: AnytimeReasoner通过验证奖励和BRPO技术，显著提升了LLM的推理效率和灵活性。 Abstract: Scaling test-time compute is crucial for enhancing the reasoning capabilities of large language models (LLMs). Existing approaches typically employ reinforcement learning (RL) to maximize a verifiable reward obtained at the end of reasoning traces. However, such methods optimize only the final performance under a large and fixed token budget, which hinders efficiency in both training and deployment. In this work, we present a novel framework, AnytimeReasoner, to optimize anytime reasoning performance, which aims to improve token efficiency and the flexibility of reasoning under varying token budget constraints. To achieve this, we truncate the complete thinking process to fit within sampled token budgets from a prior distribution, compelling the model to summarize the optimal answer for each truncated thinking for verification. This introduces verifiable dense rewards into the reasoning process, facilitating more effective credit assignment in RL optimization. We then optimize the thinking and summary policies in a decoupled manner to maximize the cumulative reward. Additionally, we introduce a novel variance reduction technique, Budget Relative Policy Optimization (BRPO), to enhance the robustness and efficiency of the learning process when reinforcing the thinking policy. Empirical results in mathematical reasoning tasks demonstrate that our method consistently outperforms GRPO across all thinking budgets under various prior distributions, enhancing both training and token efficiency.

[457] RECON: Robust symmetry discovery via Explicit Canonical Orientation Normalization

Alonso Urbano,David W. Romero,Max Zimmer,Sebastian Pokutta

Main category: cs.LG

TL;DR: RECON框架通过数据驱动的方法发现输入数据的固有对称性分布，无需预先固定变换群，提升了对称性建模的灵活性。

Details

Motivation: 现实数据常具有未知或近似对称性，但现有等变网络需预先固定变换群（如SO(2)旋转），导致性能下降。 Method: RECON利用类-姿态分解和数据驱动归一化，将任意参考帧对齐到共同自然姿态，生成可比较的对称性描述符。 Result: 在2D图像基准测试中有效发现对称性，并首次扩展到3D变换群。 Conclusion: RECON为更灵活的等变建模提供了新途径。 Abstract: Real-world data often exhibits unknown or approximate symmetries, yet existing equivariant networks must commit to a fixed transformation group prior to training, e.g., continuous $SO(2)$ rotations. This mismatch degrades performance when the actual data symmetries differ from those in the transformation group. We introduce RECON, a framework to discover each input's intrinsic symmetry distribution from unlabeled data. RECON leverages class-pose decompositions and applies a data-driven normalization to align arbitrary reference frames into a common natural pose, yielding directly comparable and interpretable symmetry descriptors. We demonstrate effective symmetry discovery on 2D image benchmarks and -- for the first time -- extend it to 3D transformation groups, paving the way towards more flexible equivariant modeling.

[458] Mean Flows for One-step Generative Modeling

Zhengyang Geng,Mingyang Deng,Xingjian Bai,J. Zico Kolter,Kaiming He

Main category: cs.LG

TL;DR: 提出了一种名为MeanFlow的单步生成建模框架，通过定义平均速度与瞬时速度的关系，显著提升了性能。

Details

Motivation: 旨在缩小单步扩散/流模型与多步模型之间的性能差距，并探索生成模型的基础理论。 Method: 引入平均速度概念，推导其与瞬时速度的恒等式，用于指导神经网络训练，无需预训练或课程学习。 Result: 在ImageNet 256x256上，单次评估（1-NFE）FID达到3.43，优于现有单步模型。 Conclusion: MeanFlow显著缩小了单步与多步模型的差距，为未来研究提供了新方向。 Abstract: We propose a principled and effective framework for one-step generative modeling. We introduce the notion of average velocity to characterize flow fields, in contrast to instantaneous velocity modeled by Flow Matching methods. A well-defined identity between average and instantaneous velocities is derived and used to guide neural network training. Our method, termed the MeanFlow model, is self-contained and requires no pre-training, distillation, or curriculum learning. MeanFlow demonstrates strong empirical performance: it achieves an FID of 3.43 with a single function evaluation (1-NFE) on ImageNet 256x256 trained from scratch, significantly outperforming previous state-of-the-art one-step diffusion/flow models. Our study substantially narrows the gap between one-step diffusion/flow models and their multi-step predecessors, and we hope it will motivate future research to revisit the foundations of these powerful models.

quant-ph [Back]

[459] Efficient Generation of Parameterised Quantum Circuits from Large Texts

Colin Krawchuk,Nikhil Khatri,Neil John Ortega,Dimitri Kartsaklis

Main category: quant-ph

TL;DR: 论文提出了一种将大规模文本转换为量子电路的高效方法，利用树状表示和对称幺半范畴，实现了长文本的语法和语篇关系编码。

Details

Motivation: 传统量子-经典混合模型依赖经典神经网络，而新方法DisCoCirc能直接编码整个文档为参数化量子电路，具有更好的可解释性和组合性。 Method: 使用树状表示的前群图将文本转换为量子电路，利用语言与量子力学的组合相似性。 Result: 实验证明该方法能高效编码长达6410词的复杂文本。 Conclusion: 该方法被集成到开源量子NLP工具包lambeq Gen II中，为社区提供支持。 Abstract: Quantum approaches to natural language processing (NLP) are redefining how linguistic information is represented and processed. While traditional hybrid quantum-classical models rely heavily on classical neural networks, recent advancements propose a novel framework, DisCoCirc, capable of directly encoding entire documents as parameterised quantum circuits (PQCs), besides enjoying some additional interpretability and compositionality benefits. Following these ideas, this paper introduces an efficient methodology for converting large-scale texts into quantum circuits using tree-like representations of pregroup diagrams. Exploiting the compositional parallels between language and quantum mechanics, grounded in symmetric monoidal categories, our approach enables faithful and efficient encoding of syntactic and discourse relationships in long and complex texts (up to 6410 words in our experiments) to quantum circuits. The developed system is provided to the community as part of the augmented open-source quantum NLP package lambeq Gen II.

q-bio.NC [Back]

[460] BrainNetMLP: An Efficient and Effective Baseline for Functional Brain Network Classification

Jiacheng Hou,Zhenjie Song,Ercan Engin Kuruoglu

Main category: q-bio.NC

TL;DR: 论文提出了一种基于纯MLP的方法BrainNetMLP，用于功能性脑网络分类，展示了简单模型也能达到最先进性能。

Details

Motivation: 尽管深度学习模型复杂度增加，但性能提升不明显，因此重新审视最简单的MLP架构，探索其潜力。 Method: 提出BrainNetMLP，采用双分支结构捕捉空间连接和频谱信息，实现精确的时空特征融合。 Result: 在HCP和ABIDE数据集上，BrainNetMLP达到了最先进的分类性能。 Conclusion: MLP模型可以作为功能性脑网络分类中高效且有效的替代方案。 Abstract: Recent studies have made great progress in functional brain network classification by modeling the brain as a network of Regions of Interest (ROIs) and leveraging their connections to understand brain functionality and diagnose mental disorders. Various deep learning architectures, including Convolutional Neural Networks, Graph Neural Networks, and the recent Transformer, have been developed. However, despite the increasing complexity of these models, the performance gain has not been as salient. This raises a question: Does increasing model complexity necessarily lead to higher classification accuracy? In this paper, we revisit the simplest deep learning architecture, the Multi-Layer Perceptron (MLP), and propose a pure MLP-based method, named BrainNetMLP, for functional brain network classification, which capitalizes on the advantages of MLP, including efficient computation and fewer parameters. Moreover, BrainNetMLP incorporates a dual-branch structure to jointly capture both spatial connectivity and spectral information, enabling precise spatiotemporal feature fusion. We evaluate our proposed BrainNetMLP on two public and popular brain network classification datasets, the Human Connectome Project (HCP) and the Autism Brain Imaging Data Exchange (ABIDE). Experimental results demonstrate pure MLP-based methods can achieve state-of-the-art performance, revealing the potential of MLP-based models as more efficient yet effective alternatives in functional brain network classification. The code will be available at https://github.com/JayceonHo/BrainNetMLP.

cs.MM [Back]

[461] Enhanced Multimodal Hate Video Detection via Channel-wise and Modality-wise Fusion

Yinghui Zhang,Tailin Chen,Yuchen Zhang,Zeyu Fu

Main category: cs.MM

TL;DR: CMFusion是一种新型多模态仇恨视频检测模型，通过通道和模态融合机制显著提升检测性能。

Details

Motivation: 视频平台上仇恨内容的传播日益严重，但现有单模态检测方法效果有限，多模态方法未能有效整合时序动态和模态交互。 Method: CMFusion提取文本、音频和视频特征，使用时序交叉注意力机制捕捉模态依赖关系，并通过通道和模态融合模块生成视频表征。 Result: 实验表明，CMFusion在准确率、精确率、召回率和F1分数上显著优于五种基线方法。 Conclusion: CMFusion通过有效整合多模态特征和时序动态，显著提升了仇恨视频检测性能。 Abstract: The rapid rise of video content on platforms such as TikTok and YouTube has transformed information dissemination, but it has also facilitated the spread of harmful content, particularly hate videos. Despite significant efforts to combat hate speech, detecting these videos remains challenging due to their often implicit nature. Current detection methods primarily rely on unimodal approaches, which inadequately capture the complementary features across different modalities. While multimodal techniques offer a broader perspective, many fail to effectively integrate temporal dynamics and modality-wise interactions essential for identifying nuanced hate content. In this paper, we present CMFusion, an enhanced multimodal hate video detection model utilizing a novel Channel-wise and Modality-wise Fusion Mechanism. CMFusion first extracts features from text, audio, and video modalities using pre-trained models and then incorporates a temporal cross-attention mechanism to capture dependencies between video and audio streams. The learned features are then processed by channel-wise and modality-wise fusion modules to obtain informative representations of videos. Our extensive experiments on a real-world dataset demonstrate that CMFusion significantly outperforms five widely used baselines in terms of accuracy, precision, recall, and F1 score. Comprehensive ablation studies and parameter analyses further validate our design choices, highlighting the model's effectiveness in detecting hate videos. The source codes will be made publicly available at https://github.com/EvelynZ10/cmfusion.

cs.IR [Back]

[462] TARGET: Benchmarking Table Retrieval for Generative Tasks

Xingyu Ji,Parker Glenn,Aditya G. Parameswaran,Madelon Hulsebos

Main category: cs.IR

TL;DR: TARGET是一个评估表格检索性能的基准，用于生成任务，发现基于密集嵌入的检索方法优于BM25基线，并展示了检索性能对元数据的敏感性。

Details

Motivation: 结构化数据在数据分析和机器学习中具有重要价值，但如何检索正确的表格以支持分析查询或任务是一个关键问题。 Method: 引入TARGET基准，评估不同检索方法的性能及其对下游任务的影响，包括密集嵌入和BM25基线。 Result: 密集嵌入检索方法显著优于BM25，且检索性能对元数据（如表标题缺失）敏感，不同数据集和任务间性能差异显著。 Conclusion: TARGET为表格检索提供了评估工具，揭示了检索方法的性能差异和元数据的重要性。 Abstract: The data landscape is rich with structured data, often of high value to organizations, driving important applications in data analysis and machine learning. Recent progress in representation learning and generative models for such data has led to the development of natural language interfaces to structured data, including those leveraging text-to-SQL. Contextualizing interactions, either through conversational interfaces or agentic components, in structured data through retrieval-augmented generation can provide substantial benefits in the form of freshness, accuracy, and comprehensiveness of answers. The key question is: how do we retrieve the right table(s) for the analytical query or task at hand? To this end, we introduce TARGET: a benchmark for evaluating TAble Retrieval for GEnerative Tasks. With TARGET we analyze the retrieval performance of different retrievers in isolation, as well as their impact on downstream tasks. We find that dense embedding-based retrievers far outperform a BM25 baseline which is less effective than it is for retrieval over unstructured text. We also surface the sensitivity of retrievers across various metadata (e.g., missing table titles), and demonstrate a stark variation of retrieval performance across datasets and tasks. TARGET is available at https://target-benchmark.github.io.

[463] MIRACL-VISION: A Large, multilingual, visual document retrieval benchmark

Radek Osmulsk,Gabriel de Souza P. Moreira,Ronay Ak,Mengyao Xu,Benedikt Schifferer,Even Oldridge

Main category: cs.IR

TL;DR: MIRACL-VISION是一个多语言视觉文档检索评估基准，扩展了MIRACL数据集，覆盖18种语言，旨在解决现有基准的局限性。

Details

Motivation: 现有视觉文档检索评估基准主要局限于英语、依赖合成问题且语料库规模小，无法满足多语言和真实场景需求。 Method: 通过消除语料库中的“简单”负样本，设计了一个更高效且具有挑战性的评估方法，并进行了广泛的实验比较。 Result: 实验显示，基于视觉的检索模型在多语言能力上存在显著差距，准确率比文本模型低59.7%，英语场景下低12.1%。 Conclusion: MIRACL-VISION是一个具有挑战性和代表性的多语言视觉检索评估基准，有助于推动文档检索模型的鲁棒性发展。 Abstract: Document retrieval is an important task for search and Retrieval-Augmented Generation (RAG) applications. Large Language Models (LLMs) have contributed to improving the accuracy of text-based document retrieval. However, documents with complex layout and visual elements like tables, charts and infographics are not perfectly represented in textual format. Recently, image-based document retrieval pipelines have become popular, which use visual large language models (VLMs) to retrieve relevant page images given a query. Current evaluation benchmarks on visual document retrieval are limited, as they primarily focus only English language, rely on synthetically generated questions and offer a small corpus size. Therefore, we introduce MIRACL-VISION, a multilingual visual document retrieval evaluation benchmark. MIRACL-VISION covers 18 languages, and is an extension of the MIRACL dataset, a popular benchmark to evaluate text-based multilingual retrieval pipelines. MIRACL was built using a human-intensive annotation process to generate high-quality questions. In order to reduce MIRACL-VISION corpus size to make evaluation more compute friendly while keeping the datasets challenging, we have designed a method for eliminating the "easy" negatives from the corpus. We conducted extensive experiments comparing MIRACL-VISION with other benchmarks, using popular public text and image models. We observe a gap in state-of-the-art VLM-based embedding models on multilingual capabilities, with up to 59.7% lower retrieval accuracy than a text-based retrieval models. Even for the English language, the visual models retrieval accuracy is 12.1% lower compared to text-based models. MIRACL-VISION is a challenging, representative, multilingual evaluation benchmark for visual retrieval pipelines and will help the community build robust models for document retrieval.

[464] LightRetriever: A LLM-based Hybrid Retrieval Architecture with 1000x Faster Query Inference

Guangyuan Ma,Yongliang Ma,Xuanrui Gou,Zhenpeng Su,Ming Zhou,Songlin Hu

Main category: cs.IR

TL;DR: LightRetriever是一种基于大型语言模型（LLM）的混合检索方法，通过极轻量级的查询编码器实现高效检索，显著提升查询推理速度，同时保持高性能。

Details

Motivation: 尽管LLM显著提升了检索能力，但其深度参数化导致查询推理速度慢且在线部署资源需求高。 Method: 保留完整大小的LLM用于文档编码，但将查询编码的工作量减少到不超过嵌入查找的水平。 Result: 在H800 GPU上，查询推理速度提升超过1000倍；无GPU时提升20倍。在大规模检索基准测试中，平均保留95%的性能。 Conclusion: LightRetriever在高效性和性能之间取得了平衡，适用于多样化的检索任务。 Abstract: Large Language Models (LLMs)-based hybrid retrieval uses LLMs to encode queries and documents into low-dimensional dense or high-dimensional sparse vectors. It retrieves documents relevant to search queries based on vector similarities. Documents are pre-encoded offline, while queries arrive in real-time, necessitating an efficient online query encoder. Although LLMs significantly enhance retrieval capabilities, serving deeply parameterized LLMs slows down query inference throughput and increases demands for online deployment resources. In this paper, we propose LightRetriever, a novel LLM-based hybrid retriever with extremely lightweight query encoders. Our method retains a full-sized LLM for document encoding, but reduces the workload of query encoding to no more than an embedding lookup. Compared to serving a full-sized LLM on an H800 GPU, our approach achieves over a 1000x speedup for query inference with GPU acceleration, and even a 20x speedup without GPU. Experiments on large-scale retrieval benchmarks demonstrate that our method generalizes well across diverse retrieval tasks, retaining an average of 95% full-sized performance.

eess.AS [Back]

[465] SAKURA: On the Multi-hop Reasoning of Large Audio-Language Models Based on Speech and Audio Information

Chih-Kai Yang,Neo Ho,Yen-Ting Piao,Hung-yi Lee

Main category: eess.AS

TL;DR: 论文介绍了SAKURA基准，用于评估大型音频-语言模型（LALMs）的多跳推理能力，发现其在整合语音/音频信息方面存在困难。

Details

Motivation: 现有研究对LALMs在语音和音频处理任务中的表现进行了广泛研究，但其多跳推理能力（即整合多个事实的能力）尚未得到系统评估。 Method: 作者提出了SAKURA基准，专门用于评估LALMs基于语音和音频信息的多跳推理能力。 Result: 实验结果表明，LALMs在整合语音/音频信息进行多跳推理时表现不佳，即使能正确提取相关信息。 Conclusion: 研究揭示了LALMs在多模态推理中的关键局限性，为未来研究提供了方向。 Abstract: Large audio-language models (LALMs) extend the large language models with multimodal understanding in speech, audio, etc. While their performances on speech and audio-processing tasks are extensively studied, their reasoning abilities remain underexplored. Particularly, their multi-hop reasoning, the ability to recall and integrate multiple facts, lacks systematic evaluation. Existing benchmarks focus on general speech and audio-processing tasks, conversational abilities, and fairness but overlook this aspect. To bridge this gap, we introduce SAKURA, a benchmark assessing LALMs' multi-hop reasoning based on speech and audio information. Results show that LALMs struggle to integrate speech/audio representations for multi-hop reasoning, even when they extract the relevant information correctly, highlighting a fundamental challenge in multimodal reasoning. Our findings expose a critical limitation in LALMs, offering insights and resources for future research.

cs.AI [Back]

[466] Probing the Vulnerability of Large Language Models to Polysemantic Interventions

Bofan Gong,Shiyang Lai,Dawn Song

Main category: cs.AI

TL;DR: 论文研究了神经网络中的多义性现象，并通过稀疏自编码器分析了两款小型模型的多义结构，发现其可被用于对大型黑盒模型进行有效干预。

Details

Motivation: 多义性是大型神经网络的一个显著特征，但其对模型安全性的影响尚不明确。研究旨在探索多义性结构及其潜在的安全风险。 Method: 利用稀疏自编码器分析Pythia-70M和GPT-2-Small的多义结构，并评估其在提示、特征、标记和神经元层面的干预效果。 Result: 发现两款小型模型共享一致的多义拓扑结构，并成功将此结构应用于对LLaMA3.1-8B-Instruct和Gemma-2-9B-Instruct的干预。 Conclusion: 多义性结构具有稳定性和可迁移性，可能在不同架构和训练方式中普遍存在，对模型安全性提出新挑战。 Abstract: Polysemanticity -- where individual neurons encode multiple unrelated features -- is a well-known characteristic of large neural networks and remains a central challenge in the interpretability of language models. At the same time, its implications for model safety are also poorly understood. Leveraging recent advances in sparse autoencoders, we investigate the polysemantic structure of two small models (Pythia-70M and GPT-2-Small) and evaluate their vulnerability to targeted, covert interventions at the prompt, feature, token, and neuron levels. Our analysis reveals a consistent polysemantic topology shared across both models. Strikingly, we demonstrate that this structure can be exploited to mount effective interventions on two larger, black-box instruction-tuned models (LLaMA3.1-8B-Instruct and Gemma-2-9B-Instruct). These findings suggest not only the generalizability of the interventions but also point to a stable and transferable polysemantic structure that could potentially persist across architectures and training regimes.

[467] Using Reinforcement Learning to Train Large Language Models to Explain Human Decisions

Jian-Qiao Zhu,Hanbo Xie,Dilip Arumugam,Robert C. Wilson,Thomas L. Griffiths

Main category: cs.AI

TL;DR: 论文探讨了如何利用预训练大型语言模型（LLMs）作为双重认知模型，既能准确预测人类行为，又能提供可解释的自然语言解释。

Details

Motivation: 现有神经网络模型在预测人类行为方面表现优异，但缺乏对认知过程的可解释性。 Method: 采用基于结果的强化学习，引导LLMs生成显式推理轨迹来解释人类风险选择。 Result: 该方法既能生成高质量的解释，又能强有力地定量预测人类决策。 Conclusion: 预训练LLMs在认知建模中具有双重潜力，既能预测行为又能提供解释。 Abstract: A central goal of cognitive modeling is to develop models that not only predict human behavior but also provide insight into the underlying cognitive mechanisms. While neural network models trained on large-scale behavioral data often achieve strong predictive performance, they typically fall short in offering interpretable explanations of the cognitive processes they capture. In this work, we explore the potential of pretrained large language models (LLMs) to serve as dual-purpose cognitive models--capable of both accurate prediction and interpretable explanation in natural language. Specifically, we employ reinforcement learning with outcome-based rewards to guide LLMs toward generating explicit reasoning traces for explaining human risky choices. Our findings demonstrate that this approach produces high-quality explanations alongside strong quantitative predictions of human decisions.

Qi Zhou,Jie Zhang,Dongxia Wang,Qiang Liu,Tianlin Li,Jin Song Dong,Wenhai Wang,Qing Guo

Main category: cs.AI

TL;DR: Fair-PP是一个针对社会公平的个性化偏好合成数据集，通过角色扮演生成偏好记录，并提供了自动化框架、主流LLM分析和偏好对齐方法。

Details

Motivation: 解决收集人类偏好反馈成本高且现有数据集忽视个性化与偏好相关性的问题。 Method: 利用GPT-4o-mini进行角色扮演，生成个性化偏好数据，并开发自动化框架和样本重加权方法。 Result: 生成了238,623条偏好记录，方法在实验中优于基线。 Conclusion: Fair-PP为个性化偏好研究提供了新工具，并通过实验验证了其有效性。 Abstract: Human preference plays a crucial role in the refinement of large language models (LLMs). However, collecting human preference feedback is costly and most existing datasets neglect the correlation between personalization and preferences. To address this issue, we introduce Fair-PP, a synthetic dataset of personalized preferences targeting social equity, derived from real-world social survey data, which includes 28 social groups, 98 equity topics, and 5 personal preference dimensions. Leveraging GPT-4o-mini, we engage in role-playing based on seven representative persona portrayals guided by existing social survey data, yielding a total of 238,623 preference records. Through Fair-PP, we also contribute (i) An automated framework for generating preference data, along with a more fine-grained dataset of personalized preferences; (ii) analysis of the positioning of the existing mainstream LLMs across five major global regions within the personalized preference space; and (iii) a sample reweighting method for personalized preference alignment, enabling alignment with a target persona while maximizing the divergence from other personas. Empirical experiments show our method outperforms the baselines.

[469] AI-Driven Automation Can Become the Foundation of Next-Era Science of Science Research

Renqi Chen,Haoyang Su,Shixiang Tang,Zhenfei Yin,Qi Wu,Hui Li,Ye Sun,Nanqing Dong,Wanli Ouyang,Philip Torr

Main category: cs.AI

TL;DR: 本文探讨了人工智能（AI）如何革新科学学（SoS），通过自动化大规模模式发现提升研究效率，并提出了未来挑战与解决方案。

Details

Motivation: 传统科学学方法依赖简单假设和统计工具，难以应对现代研究生态系统的复杂性。AI为科学学提供了新的可能性。 Method: 提出将AI与科学学结合，开发多智能体系统模拟研究社会，展示AI在模式发现中的潜力。 Result: AI能够自动化发现研究模式，超越传统方法，加速科学学研究的进展。 Conclusion: AI与科学学的结合具有巨大潜力，但仍需解决开放挑战以实现其全部价值。 Abstract: The Science of Science (SoS) explores the mechanisms underlying scientific discovery, and offers valuable insights for enhancing scientific efficiency and fostering innovation. Traditional approaches often rely on simplistic assumptions and basic statistical tools, such as linear regression and rule-based simulations, which struggle to capture the complexity and scale of modern research ecosystems. The advent of artificial intelligence (AI) presents a transformative opportunity for the next generation of SoS, enabling the automation of large-scale pattern discovery and uncovering insights previously unattainable. This paper offers a forward-looking perspective on the integration of Science of Science with AI for automated research pattern discovery and highlights key open challenges that could greatly benefit from AI. We outline the advantages of AI over traditional methods, discuss potential limitations, and propose pathways to overcome them. Additionally, we present a preliminary multi-agent system as an illustrative example to simulate research societies, showcasing AI's ability to replicate real-world research patterns and accelerate progress in Science of Science research.

[470] Tiny QA Benchmark++: Ultra-Lightweight, Synthetic Multilingual Dataset Generation & Smoke-Tests for Continuous LLM Evaluation

Vincent Koc

Main category: cs.AI

TL;DR: TQB++ 是一个超轻量级的多语言 QA 测试套件，用于快速检测 LLM 管道的错误，成本低且运行速度快。

Details

Motivation: 为满足开发者在构建 Comet Opik 提示优化 SDK 时的快速反馈需求，避免等待重量级测试。 Method: 结合 52 项英文黄金数据集和基于 LiteLLM 的合成数据生成器，支持多语言和自定义测试包。 Result: TQB++ 能在几秒内完成测试，可靠地检测提示模板错误、分词器漂移等问题。 Conclusion: TQB++ 为生成式 AI 生态系统提供了高效、持续的质量保障工具。 Abstract: Tiny QA Benchmark++ (TQB++) presents an ultra-lightweight, multilingual smoke-test suite designed to give large-language-model (LLM) pipelines a unit-test style safety net dataset that runs in seconds with minimal cost. Born out of the tight feedback-loop demands building the Comet Opik prompt-optimization SDK, where waiting on heavyweight benchmarks breaks developer flow. TQB++ couples a 52-item English gold set (less than 20 kB) with a tiny synthetic-data generator pypi package built on provider-agnostic LiteLLM. The generator lets practitioners mint their own tiny packs in any language, domain, or difficulty, while ten ready-made packs already cover Arabic, Chinese, French, German, Japanese, Korean, Portuguese, Russian, Spanish, and Turkish. Every dataset ships with Croissant metadata and plug-and-play files for OpenAI-Evals, LangChain, and standard CI tools, so teams can drop deterministic micro-benchmarks directly into pull-request gates, prompt-engineering loops, and production dashboards without touching GPU budgets. A complete TQB++ run adds only a few seconds to pipeline latency yet reliably flags prompt-template errors, tokenizer drift, and fine-tuning side-effects long before full-scale suites like MMLU or BIG-Bench would finish configuring. The entire framework is released to accelerate continuous, resource-efficient quality assurance across the generative-AI ecosystem.

[471] Demystifying and Enhancing the Efficiency of Large Language Model Based Search Agents

Tiannuo Yang,Zebin Yao,Bowen Jin,Lixiao Cui,Yusen Li,Gang Wang,Xiaoguang Liu

Main category: cs.AI

TL;DR: SearchAgent-X 是一个高效推理框架，解决了基于 LLM 的搜索代理在检索和推理中的效率瓶颈问题。

Details

Motivation: 现有 LLM 搜索代理在动态分解问题和交替推理检索时存在效率瓶颈，包括检索开销大和系统设计低效。 Method: 提出 SearchAgent-X，采用高召回近似检索，结合优先级调度和非阻塞检索技术。 Result: 实验表明，SearchAgent-X 在吞吐量和延迟上显著优于现有系统（如 vLLM 和 HNSW），且不影响生成质量。 Conclusion: SearchAgent-X 通过优化检索和调度，显著提升了 LLM 搜索代理的效率。 Abstract: Large Language Model (LLM)-based search agents have shown remarkable capabilities in solving complex tasks by dynamically decomposing problems and addressing them through interleaved reasoning and retrieval. However, this interleaved paradigm introduces substantial efficiency bottlenecks. First, we observe that both highly accurate and overly approximate retrieval methods degrade system efficiency: exact search incurs significant retrieval overhead, while coarse retrieval requires additional reasoning steps during generation. Second, we identify inefficiencies in system design, including improper scheduling and frequent retrieval stalls, which lead to cascading latency -- where even minor delays in retrieval amplify end-to-end inference time. To address these challenges, we introduce SearchAgent-X, a high-efficiency inference framework for LLM-based search agents. SearchAgent-X leverages high-recall approximate retrieval and incorporates two key techniques: priority-aware scheduling and non-stall retrieval. Extensive experiments demonstrate that SearchAgent-X consistently outperforms state-of-the-art systems such as vLLM and HNSW-based retrieval across diverse tasks, achieving up to 3.4$\times$ higher throughput and 5$\times$ lower latency, without compromising generation quality. SearchAgent-X is available at https://github.com/tiannuo-yang/SearchAgent-X.

[472] LLM-BABYBENCH: Understanding and Evaluating Grounded Planning and Reasoning in LLMs

Omar Choukrani,Idriss Malek,Daniil Orel,Zhuohan Xie,Zangir Iklassov,Martin Takáč,Salem Lahlou

Main category: cs.AI

TL;DR: 论文介绍了LLM-BabyBench，一个用于评估大型语言模型（LLMs）在交互环境中规划和推理能力的基准测试套件，包含预测、规划和分解三个任务。

Details

Motivation: 评估LLMs在交互环境中的规划和推理能力对开发智能AI代理至关重要。 Method: 基于BabyAI网格世界的文本改编，构建了三个数据集（Predict、Plan、Decompose），并通过专家代理生成结构化数据。提供了标准化评估工具和指标。 Result: 初步基线结果表明这些任务具有挑战性。 Conclusion: LLM-BabyBench及其相关资源已公开，旨在促进LLMs在基础推理任务上的可重复评估。 Abstract: Assessing the capacity of Large Language Models (LLMs) to plan and reason within the constraints of interactive environments is crucial for developing capable AI agents. We introduce $\textbf{LLM-BabyBench}$, a new benchmark suite designed specifically for this purpose. Built upon a textual adaptation of the procedurally generated BabyAI grid world, this suite evaluates LLMs on three fundamental aspects of grounded intelligence: (1) predicting the consequences of actions on the environment state ($\textbf{Predict}$ task), (2) generating sequences of low-level actions to achieve specified objectives ($\textbf{Plan}$ task), and (3) decomposing high-level instructions into coherent subgoal sequences ($\textbf{Decompose}$ task). We detail the methodology for generating the three corresponding datasets ($\texttt{LLM-BabyBench-Predict}$, $\texttt{-Plan}$, $\texttt{-Decompose}$) by extracting structured information from an expert agent operating within the text-based environment. Furthermore, we provide a standardized evaluation harness and metrics, including environment interaction for validating generated plans, to facilitate reproducible assessment of diverse LLMs. Initial baseline results highlight the challenges posed by these grounded reasoning tasks. The benchmark suite, datasets, data generation code, and evaluation code are made publicly available ($\href{https://github.com/choukrani/llm-babybench}{\text{GitHub}}$, $\href{https://huggingface.co/datasets/salem-mbzuai/LLM-BabyBench}{\text{HuggingFace}}$).

[473] Mitigating Content Effects on Reasoning in Language Models through Fine-Grained Activation Steering

Marco Valentino,Geonhee Kim,Dhairya Dalal,Zhixue Zhao,André Freitas

Main category: cs.AI

TL;DR: 论文通过激活导向方法减少大语言模型在形式推理中的内容偏见，提出动态条件导向方法（K-CAST）显著提升推理准确性。

Details

Motivation: 大语言模型常混淆内容合理性与逻辑有效性，导致偏见推理，影响其可信度和泛化能力。 Method: 构建控制性三段论数据集，定位形式与内容推理层，研究对比激活导向方法，并引入动态条件导向（K-CAST）。 Result: 对比导向能线性控制内容偏见，静态方法效果有限；K-CAST在非响应模型上提升15%推理准确性，且对提示变化鲁棒。 Conclusion: 激活导向方法可增强大语言模型的鲁棒性，为系统性、无偏见的形式推理提供可扩展策略。 Abstract: Large language models (LLMs) frequently demonstrate reasoning limitations, often conflating content plausibility (i.e., material inference) with logical validity (i.e., formal inference). This can result in biased inferences, where plausible arguments are incorrectly deemed logically valid or vice versa. Mitigating this limitation is critical, as it undermines the trustworthiness and generalizability of LLMs in applications that demand rigorous logical consistency. This paper investigates the problem of mitigating content biases on formal reasoning through activation steering. Specifically, we curate a controlled syllogistic reasoning dataset to disentangle formal validity from content plausibility. After localising the layers responsible for formal and material inference, we investigate contrastive activation steering methods for test-time interventions. An extensive empirical analysis on different LLMs reveals that contrastive steering consistently supports linear control over content biases. However, we observe that a static approach is insufficient for improving all the tested models. We then leverage the possibility to control content effects by dynamically determining the value of the steering parameters via fine-grained conditional methods. We found that conditional steering is effective on unresponsive models, achieving up to 15% absolute improvement in formal reasoning accuracy with a newly introduced kNN-based method (K-CAST). Finally, additional experiments reveal that steering for content effects is robust to prompt variations, incurs minimal side effects on language modeling capabilities, and can partially generalize to out-of-distribution reasoning tasks. Practically, this paper demonstrates that activation-level interventions can offer a scalable strategy for enhancing the robustness of LLMs, contributing towards more systematic and unbiased formal reasoning.

[474] Efficient RL Training for Reasoning Models via Length-Aware Optimization

Danlong Yuan,Tian Xie,Shaohan Huang,Zhuocheng Gong,Huishuai Zhang,Chong Luo,Furu Wei,Dongyan Zhao

Main category: cs.AI

TL;DR: 提出了一种通过强化学习直接优化推理路径长度的方法，无需额外训练阶段，显著减少了响应长度并保持或提升性能。

Details

Motivation: 大型推理模型在推理任务中表现出色，但推理路径长且资源消耗大，现有方法需额外训练数据或阶段。 Method: 在强化学习过程中集成三种关键奖励设计，直接优化推理路径长度。 Result: 在逻辑推理和数学问题中，响应长度分别减少40%和33%，性能保持或提升14%。 Conclusion: 该方法有效减少了推理模型的响应长度，同时保持或提升性能，无需额外训练阶段。 Abstract: Large reasoning models, such as OpenAI o1 or DeepSeek R1, have demonstrated remarkable performance on reasoning tasks but often incur a long reasoning path with significant memory and time costs. Existing methods primarily aim to shorten reasoning paths by introducing additional training data and stages. In this paper, we propose three critical reward designs integrated directly into the reinforcement learning process of large reasoning models, which reduce the response length without extra training stages. Experiments on four settings show that our method significantly decreases response length while maintaining or even improving performance. Specifically, in a logic reasoning setting, we achieve a 40% reduction in response length averaged by steps alongside a 14% gain in performance. For math problems, we reduce response length averaged by steps by 33% while preserving performance.

[475] Beyond Single-Point Judgment: Distribution Alignment for LLM-as-a-Judge

Luyu Chen,Zeyu Zhang,Haoran Tan,Quanyu Dai,Hao Yang,Zhenhua Dong,Xu Chen

Main category: cs.AI

TL;DR: 提出了一种新的训练框架，通过分布对齐提升LLM评估的多样性和可靠性，优于现有方法。

Details

Motivation: 现有LLM评估方法依赖单点评估，忽略了人类评估的多样性和不确定性，导致信息丢失和可靠性下降。 Method: 提出基于KL散度的分布对齐目标，结合交叉熵正则化稳定训练，并引入对抗训练增强模型鲁棒性。 Result: 在多种LLM和任务中，框架显著优于闭源LLM和传统单点对齐方法，提升了对齐质量、评估准确性和鲁棒性。 Conclusion: 新框架有效解决了单点评估的局限性，提升了LLM评估的多样性和可靠性。 Abstract: LLMs have emerged as powerful evaluators in the LLM-as-a-Judge paradigm, offering significant efficiency and flexibility compared to human judgments. However, previous methods primarily rely on single-point evaluations, overlooking the inherent diversity and uncertainty in human evaluations. This approach leads to information loss and decreases the reliability of evaluations. To address this limitation, we propose a novel training framework that explicitly aligns the LLM-generated judgment distribution with empirical human distributions. Specifically, we propose a distributional alignment objective based on KL divergence, combined with an auxiliary cross-entropy regularization to stabilize the training process. Furthermore, considering that empirical distributions may derive from limited human annotations, we incorporate adversarial training to enhance model robustness against distribution perturbations. Extensive experiments across various LLM backbones and evaluation tasks demonstrate that our framework significantly outperforms existing closed-source LLMs and conventional single-point alignment methods, with improved alignment quality, evaluation accuracy, and robustness.

[476] MedAgentBoard: Benchmarking Multi-Agent Collaboration with Conventional Methods for Diverse Medical Tasks

Yinghao Zhu,Ziyi He,Haoran Hu,Xiaochen Zheng,Xichen Zhang,Zixiang Wang,Junyi Gao,Liantao Ma,Lequan Yu

Main category: cs.AI

TL;DR: MedAgentBoard是一个用于评估多智能体协作、单LLM和传统方法在医学任务中表现的基准测试，揭示了多智能体协作的局限性。

Details

Motivation: 现有评估缺乏普适性和严谨性，未能全面覆盖真实临床任务，且未与传统方法充分比较。 Method: 引入MedAgentBoard基准，涵盖四类医学任务，并进行多智能体协作、单LLM和传统方法的对比实验。 Result: 多智能体协作在特定场景（如临床工作流自动化）中表现优异，但在其他任务中不如单LLM或传统方法。 Conclusion: 需根据任务特性选择AI解决方案，多智能体协作的复杂性和性能增益需权衡。 Abstract: The rapid advancement of Large Language Models (LLMs) has stimulated interest in multi-agent collaboration for addressing complex medical tasks. However, the practical advantages of multi-agent collaboration approaches remain insufficiently understood. Existing evaluations often lack generalizability, failing to cover diverse tasks reflective of real-world clinical practice, and frequently omit rigorous comparisons against both single-LLM-based and established conventional methods. To address this critical gap, we introduce MedAgentBoard, a comprehensive benchmark for the systematic evaluation of multi-agent collaboration, single-LLM, and conventional approaches. MedAgentBoard encompasses four diverse medical task categories: (1) medical (visual) question answering, (2) lay summary generation, (3) structured Electronic Health Record (EHR) predictive modeling, and (4) clinical workflow automation, across text, medical images, and structured EHR data. Our extensive experiments reveal a nuanced landscape: while multi-agent collaboration demonstrates benefits in specific scenarios, such as enhancing task completeness in clinical workflow automation, it does not consistently outperform advanced single LLMs (e.g., in textual medical QA) or, critically, specialized conventional methods that generally maintain better performance in tasks like medical VQA and EHR-based prediction. MedAgentBoard offers a vital resource and actionable insights, emphasizing the necessity of a task-specific, evidence-based approach to selecting and developing AI solutions in medicine. It underscores that the inherent complexity and overhead of multi-agent collaboration must be carefully weighed against tangible performance gains. All code, datasets, detailed prompts, and experimental results are open-sourced at https://medagentboard.netlify.app/.

[477] mCLM: A Function-Infused and Synthesis-Friendly Modular Chemical Language Model

Carl Edwards,Chi Han,Gawon Lee,Thao Nguyen,Bowen Jin,Chetan Kumar Prasad,Sara Szymkuć,Bartosz A. Grzybowski,Ying Diao,Jiawei Han,Ge Liu,Hao Peng,Martin D. Burke,Heng Ji

Main category: cs.AI

TL;DR: 论文提出了一种模块化化学语言模型mCLM，通过将分子分解为功能构建块来生成可合成且功能优化的分子。

Details

Motivation: 现有大型语言模型（LLM）在生成具有药物特性的新分子方面能力有限，且生成的分子难以在实验室合成。因此，需要一种方法将分子分解为功能构建块，以提升生成分子的可合成性和功能性。 Method: 提出mCLM模型，将分子分解为功能构建块，并学习自然语言描述与分子构建块的双语语言模型。 Result: 在430种FDA批准药物的实验中，mCLM显著改善了5种关键化学功能，并能通过多次迭代优化FDA拒绝的药物。 Conclusion: mCLM通过功能构建块的方法，显著提升了分子的可合成性和功能性，为药物发现提供了新思路。 Abstract: Despite their ability to understand chemical knowledge and accurately generate sequential representations, large language models (LLMs) remain limited in their capacity to propose novel molecules with drug-like properties. In addition, the molecules that LLMs propose can often be challenging to make in the lab. To more effectively enable the discovery of functional small molecules, LLMs need to learn a molecular language. However, LLMs are currently limited by encoding molecules from atoms. In this paper, we argue that just like tokenizing texts into (sub-)word tokens instead of characters, molecules should be decomposed and reassembled at the level of functional building blocks, i.e., parts of molecules that bring unique functions and serve as effective building blocks for real-world automated laboratory synthesis. This motivates us to propose mCLM, a modular Chemical-Language Model tokenizing molecules into building blocks and learning a bilingual language model of both natural language descriptions of functions and molecule building blocks. By reasoning on such functional building blocks, mCLM guarantees to generate efficiently synthesizable molecules thanks to recent progress in block-based chemistry, while also improving the functions of molecules in a principled manner. In experiments on 430 FDA-approved drugs, we find mCLM capable of significantly improving 5 out of 6 chemical functions critical to determining drug potentials. More importantly, mCLM can reason on multiple functions and improve the FDA-rejected drugs (``fallen angels'') over multiple iterations to greatly improve their shortcomings.

[478] Ineq-Comp: Benchmarking Human-Intuitive Compositional Reasoning in Automated Theorem Proving on Inequalities

Haoyu Zhao,Yihan Geng,Shange Tang,Yong Lin,Bohan Lyu,Hongzhou Lin,Chi Jin,Sanjeev Arora

Main category: cs.AI

TL;DR: 论文研究了基于LLM的形式化证明助手（如Lean）是否真正理解数学结构，通过数学不等式测试其组合推理能力。

Details

Motivation: 探究AI证明助手是否具备类似人类的数学直觉和组合推理能力。 Method: 引入Ineq-Comp基准，通过系统变换（如变量复制、代数重写、多步组合）生成问题，测试多个证明系统。 Result: 大多数证明系统表现不佳，DeepSeek-Prover-V2-7B相对稳健但仍下降20%，组合推理能力普遍薄弱。 Conclusion: 当前AI证明系统在组合推理上与人类数学直觉存在显著差距。 Abstract: LLM-based formal proof assistants (e.g., in Lean) hold great promise for automating mathematical discovery. But beyond syntactic correctness, do these systems truly understand mathematical structure as humans do? We investigate this question through the lens of mathematical inequalities -- a fundamental tool across many domains. While modern provers can solve basic inequalities, we probe their ability to handle human-intuitive compositionality. We introduce Ineq-Comp, a benchmark built from elementary inequalities through systematic transformations, including variable duplication, algebraic rewriting, and multi-step composition. Although these problems remain easy for humans, we find that most provers -- including Goedel, STP, and Kimina-7B -- struggle significantly. DeepSeek-Prover-V2-7B shows relative robustness -- possibly because it is trained to decompose the problems into sub-problems -- but still suffers a 20\% performance drop (pass@32). Strikingly, performance remains poor for all models even when formal proofs of the constituent parts are provided in context, revealing that the source of weakness is indeed in compositional reasoning. Our results expose a persisting gap between the generalization behavior of current AI provers and human mathematical intuition.

[479] Bullying the Machine: How Personas Increase LLM Vulnerability

Ziwei Xu,Udit Sanghi,Mohan Kankanhalli

Main category: cs.AI

TL;DR: 研究发现，大型语言模型（LLMs）在采用特定人格时，其安全性会受到欺凌行为的影响，某些人格配置会增加模型的不安全输出风险。

Details

Motivation: 探讨人格条件化是否会影响LLMs在欺凌行为下的安全性，揭示人格驱动交互带来的新型安全风险。 Method: 引入模拟框架，攻击者LLM使用心理欺凌策略与受害者LLM互动，受害者采用基于大五人格特质的人格配置。 Result: 某些人格配置（如低宜人性或责任心）显著增加受害者的不安全输出风险，情感或讽刺性欺凌策略尤为有效。 Conclusion: 人格驱动交互为LLMs安全风险提供了新途径，需开发人格感知的安全评估和对齐策略。 Abstract: Large Language Models (LLMs) are increasingly deployed in interactions where they are prompted to adopt personas. This paper investigates whether such persona conditioning affects model safety under bullying, an adversarial manipulation that applies psychological pressures in order to force the victim to comply to the attacker. We introduce a simulation framework in which an attacker LLM engages a victim LLM using psychologically grounded bullying tactics, while the victim adopts personas aligned with the Big Five personality traits. Experiments using multiple open-source LLMs and a wide range of adversarial goals reveal that certain persona configurations -- such as weakened agreeableness or conscientiousness -- significantly increase victim's susceptibility to unsafe outputs. Bullying tactics involving emotional or sarcastic manipulation, such as gaslighting and ridicule, are particularly effective. These findings suggest that persona-driven interaction introduces a novel vector for safety risks in LLMs and highlight the need for persona-aware safety evaluation and alignment strategies.

[480] Detection and Mitigation of Hallucination in Large Reasoning Models: A Mechanistic Perspective

Zhongxiang Sun,Qipeng Wang,Haoyu Wang,Xiao Zhang,Jun Xu

Main category: cs.AI

TL;DR: 论文研究了大型推理模型（LRMs）中的推理幻觉问题，提出了一种量化推理深度的评分方法，并开发了检测框架和强化学习算法以减少幻觉。

Details

Motivation: 大型推理模型在多步推理任务中表现出色，但伴随而来的是推理幻觉问题，即逻辑连贯但事实错误的推理导致错误结论。这种错误更难检测且危害更大。 Method: 提出了推理评分（Reasoning Score）来量化推理深度，开发了推理幻觉检测框架（RHD），并引入强化学习算法GRPO-R以减少幻觉。 Result: 在ReTruthQA数据集上识别了两种关键推理幻觉模式，RHD框架在多个领域达到最优性能，GRPO-R算法提高了推理质量并降低了幻觉率。 Conclusion: 通过量化推理深度和引入强化学习算法，论文有效检测和减少了推理幻觉，提升了大型推理模型的可靠性。 Abstract: Large Reasoning Models (LRMs) have shown impressive capabilities in multi-step reasoning tasks. However, alongside these successes, a more deceptive form of model error has emerged--Reasoning Hallucination--where logically coherent but factually incorrect reasoning traces lead to persuasive yet faulty conclusions. Unlike traditional hallucinations, these errors are embedded within structured reasoning, making them more difficult to detect and potentially more harmful. In this work, we investigate reasoning hallucinations from a mechanistic perspective. We propose the Reasoning Score, which quantifies the depth of reasoning by measuring the divergence between logits obtained from projecting late layers of LRMs to the vocabulary space, effectively distinguishing shallow pattern-matching from genuine deep reasoning. Using this score, we conduct an in-depth analysis on the ReTruthQA dataset and identify two key reasoning hallucination patterns: early-stage fluctuation in reasoning depth and incorrect backtracking to flawed prior steps. These insights motivate our Reasoning Hallucination Detection (RHD) framework, which achieves state-of-the-art performance across multiple domains. To mitigate reasoning hallucinations, we further introduce GRPO-R, an enhanced reinforcement learning algorithm that incorporates step-level deep reasoning rewards via potential-based shaping. Our theoretical analysis establishes stronger generalization guarantees, and experiments demonstrate improved reasoning quality and reduced hallucination rates.

[481] TIME: A Multi-level Benchmark for Temporal Reasoning of LLMs in Real-World Scenarios

Shaohang Wei,Wei Li,Feifan Song,Wen Luo,Tianyi Zhuang,Haochen Tan,Zhijiang Guo,Houfeng Wang

Main category: cs.AI

TL;DR: 论文提出了一个多层次的基准TIME，用于解决现实世界中的时间推理挑战，包含38,522个QA对，涵盖3个层次和11个子任务。

Details

Motivation: 现有研究忽视了时间推理在现实世界中的挑战，如密集的时间信息、快速变化的事件动态和复杂的时间依赖关系。 Method: 设计了TIME基准，包含三个子数据集（TIME-Wiki、TIME-News、TIME-Dial），并进行了广泛的实验和分析。 Result: 实验分析了时间推理在不同现实场景和任务中的表现，并总结了测试时间扩展对时间推理能力的影响。 Conclusion: TIME基准填补了时间推理研究的空白，并发布了TIME-Lite子集以促进未来研究。 Abstract: Temporal reasoning is pivotal for Large Language Models (LLMs) to comprehend the real world. However, existing works neglect the real-world challenges for temporal reasoning: (1) intensive temporal information, (2) fast-changing event dynamics, and (3) complex temporal dependencies in social interactions. To bridge this gap, we propose a multi-level benchmark TIME, designed for temporal reasoning in real-world scenarios. TIME consists of 38,522 QA pairs, covering 3 levels with 11 fine-grained sub-tasks. This benchmark encompasses 3 sub-datasets reflecting different real-world challenges: TIME-Wiki, TIME-News, and TIME-Dial. We conduct extensive experiments on reasoning models and non-reasoning models. And we conducted an in-depth analysis of temporal reasoning performance across diverse real-world scenarios and tasks, and summarized the impact of test-time scaling on temporal reasoning capabilities. Additionally, we release TIME-Lite, a human-annotated subset to foster future research and standardized evaluation in temporal reasoning. The code is available at https://github.com/sylvain-wei/TIME , and the dataset is available at https://huggingface.co/datasets/SylvainWei/TIME .

[482] LLM-KG-Bench 3.0: A Compass for SemanticTechnology Capabilities in the Ocean of LLMs

Lars-Peter Meyer,Johannes Frey,Desiree Heim,Felix Brei,Claus Stadler,Kurt Junghanns,Michael Martin

Main category: cs.AI

TL;DR: LLM-KG-Bench框架3.0版用于评估大型语言模型（LLM）在知识图谱（KG）和语义Web领域的表现，支持自动化任务评估，并提供了数据集和模型卡示例。

Details

Motivation: 探讨LLM是否支持知识图谱工程（KGE）及语义Web任务，并确定最佳模型，避免手动检查。 Method: LLM-KG-Bench框架3.0版包含可扩展的任务集，用于自动化评估LLM答案，支持多种语义技术任务。 Result: 生成包含30多种当代LLM的数据集，展示模型在RDF、SPARQL等任务中的能力，并比较性能。 Conclusion: LLM-KG-Bench 3.0为评估LLM在KGE领域的表现提供了高效工具，支持多种任务和模型。 Abstract: Current Large Language Models (LLMs) can assist developing program code beside many other things, but can they support working with Knowledge Graphs (KGs) as well? Which LLM is offering the best capabilities in the field of Semantic Web and Knowledge Graph Engineering (KGE)? Is this possible to determine without checking many answers manually? The LLM-KG-Bench framework in Version 3.0 is designed to answer these questions. It consists of an extensible set of tasks for automated evaluation of LLM answers and covers different aspects of working with semantic technologies. In this paper the LLM-KG-Bench framework is presented in Version 3 along with a dataset of prompts, answers and evaluations generated with it and several state-of-the-art LLMs. Significant enhancements have been made to the framework since its initial release, including an updated task API that offers greater flexibility in handling evaluation tasks, revised tasks, and extended support for various open models through the vllm library, among other improvements. A comprehensive dataset has been generated using more than 30 contemporary open and proprietary LLMs, enabling the creation of exemplary model cards that demonstrate the models' capabilities in working with RDF and SPARQL, as well as comparing their performance on Turtle and JSON-LD RDF serialization tasks.

[483] Zero-Shot Iterative Formalization and Planning in Partially Observable Environments

Liancheng Gong,Wang Zhu,Jesse Thomason,Li Zhang

Main category: cs.AI

TL;DR: 论文提出PDDLego+框架，用于在部分可观测环境中零样本迭代形式化PDDL表示，提升规划性能。

Details

Motivation: 解决现有方法在部分可观测环境中因信息不全而失效的问题。 Method: 提出PDDLego+框架，迭代形式化、规划、扩展和优化PDDL表示，无需现有轨迹。 Result: 在两个文本模拟环境中表现优异，对问题复杂度具有鲁棒性，且捕获的领域知识可解释并有益于未来任务。 Conclusion: PDDLego+在部分可观测环境中有效，且具有可扩展性和可解释性。 Abstract: In planning, using LLMs not to predict plans but to formalize an environment into the Planning Domain Definition Language (PDDL) has been shown to greatly improve performance and control. While most work focused on fully observable environments, we tackle the more realistic and challenging partially observable environments where existing methods are incapacitated by the lack of complete information. We propose PDDLego+, a framework to iteratively formalize, plan, grow, and refine PDDL representations in a zero-shot manner, without needing access to any existing trajectories. On two textual simulated environments, we show that PDDLego+ not only achieves superior performance, but also shows robustness against problem complexity. We also show that the domain knowledge captured after a successful trial is interpretable and benefits future tasks.

[484] Scaling Computer-Use Grounding via User Interface Decomposition and Synthesis

Tianbao Xie,Jiaqi Deng,Xiaochuan Li,Junlin Yang,Haoyuan Wu,Jixuan Chen,Wenjing Hu,Xinyuan Wang,Yuhui Xu,Zekun Wang,Yiheng Xu,Junli Wang,Doyen Sahoo,Tao Yu,Caiming Xiong

Main category: cs.AI

TL;DR: 论文提出了OSWorld-G基准和Jedi数据集，用于改进GUI自然语言指令的映射能力，并通过实验验证其有效性。

Details

Motivation: 当前GUI基准任务过于简化，无法捕捉真实交互的复杂性，如软件常识、布局理解和精细操作。 Method: 引入OSWorld-G基准和Jedi数据集，通过多尺度模型训练验证性能。 Result: 在多个基准测试中表现优于现有方法，并将基础模型的代理能力从5%提升至27%。 Conclusion: Jedi数据集和OSWorld-G基准显著提升了GUI任务的性能，并支持对新界面的组合泛化。 Abstract: Graphical user interface (GUI) grounding, the ability to map natural language instructions to specific actions on graphical user interfaces, remains a critical bottleneck in computer use agent development. Current benchmarks oversimplify grounding tasks as short referring expressions, failing to capture the complexity of real-world interactions that require software commonsense, layout understanding, and fine-grained manipulation capabilities. To address these limitations, we introduce OSWorld-G, a comprehensive benchmark comprising 564 finely annotated samples across diverse task types including text matching, element recognition, layout understanding, and precise manipulation. Additionally, we synthesize and release the largest computer use grounding dataset Jedi, which contains 4 million examples through multi-perspective decoupling of tasks. Our multi-scale models trained on Jedi demonstrate its effectiveness by outperforming existing approaches on ScreenSpot-v2, ScreenSpot-Pro, and our OSWorld-G. Furthermore, we demonstrate that improved grounding with Jedi directly enhances agentic capabilities of general foundation models on complex computer tasks, improving from 5% to 27% on OSWorld. Through detailed ablation studies, we identify key factors contributing to grounding performance and verify that combining specialized data for different interface elements enables compositional generalization to novel interfaces. All benchmark, data, checkpoints, and code are open-sourced and available at https://osworld-grounding.github.io.

[485] CompeteSMoE -- Statistically Guaranteed Mixture of Experts Training via Competition

Nam V. Nguyen,Huy Nguyen,Quang Pham,Van Nguyen,Savitha Ramasamy,Nhat Ho

Main category: cs.AI

TL;DR: 论文提出了一种名为CompeteSMoE的新机制，通过竞争路由策略改进稀疏混合专家（SMoE）的训练效率，并在实验中验证了其性能优于现有方法。

Details

Motivation: 传统SMoE训练中的路由过程效率低下，专家计算与路由过程分离，导致性能不佳。 Method: 提出竞争机制，将令牌路由至神经响应最高的专家，并开发了CompeteSMoE算法，通过路由器学习竞争策略。 Result: 实验表明，CompeteSMoE在视觉指令调整和语言预训练任务中表现出高效性、鲁棒性和可扩展性。 Conclusion: 竞争机制显著提升了SMoE的训练效率和性能，为大规模语言模型提供了更优解决方案。 Abstract: Sparse mixture of experts (SMoE) offers an appealing solution to scale up the model complexity beyond the mean of increasing the network's depth or width. However, we argue that effective SMoE training remains challenging because of the suboptimal routing process where experts that perform computation do not directly contribute to the routing process. In this work, we propose competition, a novel mechanism to route tokens to experts with the highest neural response. Theoretically, we show that the competition mechanism enjoys a better sample efficiency than the traditional softmax routing. Furthermore, we develop CompeteSMoE, a simple yet effective algorithm to train large language models by deploying a router to learn the competition policy, thus enjoying strong performances at a low training overhead. Our extensive empirical evaluations on both the visual instruction tuning and language pre-training tasks demonstrate the efficacy, robustness, and scalability of CompeteSMoE compared to state-of-the-art SMoE strategies. We have made the implementation available at: https://github.com/Fsoft-AIC/CompeteSMoE. This work is an improved version of the previous study at arXiv:2402.02526

[486] CoT-Kinetics: A Theoretical Modeling Assessing LRM Reasoning Process

Jinhe Bi,Danqi Yan,Yifan Wang,Wenke Huang,Haokun Chen,Guancheng Wan,Mang Ye,Xun Xiao,Hinrich Schuetze,Volker Tresp,Yunpu Ma

Main category: cs.AI

TL;DR: 论文提出了一种基于经典力学的CoT-Kinetics能量方程，用于评估大型推理模型（LRM）生成推理轨迹的合理性，从而更准确地衡量输出答案的质量。

Details

Motivation: 现有方法在评估LRM输出答案时，未能充分反映推理部分与答案之间的因果关系，导致对答案质量的判断不够准确。 Method: 受经典力学启发，提出CoT-Kinetics能量方程，将LRM内部变换层调控的令牌状态转换过程类比为力学场中的粒子动力学，从而为推理阶段分配标量评分。 Result: CoT-Kinetics能量方程能够专门评估推理阶段的合理性，提供对答案置信度的精确衡量，超越了传统的二元判断（正确或错误）。 Conclusion: 该方法显著提升了LRM输出质量的评估能力，为复杂推理任务的解决提供了更可靠的依据。 Abstract: Recent Large Reasoning Models significantly improve the reasoning ability of Large Language Models by learning to reason, exhibiting the promising performance in solving complex tasks. LRMs solve tasks that require complex reasoning by explicitly generating reasoning trajectories together with answers. Nevertheless, judging the quality of such an output answer is not easy because only considering the correctness of the answer is not enough and the soundness of the reasoning trajectory part matters as well. Logically, if the soundness of the reasoning part is poor, even if the answer is correct, the confidence of the derived answer should be low. Existing methods did consider jointly assessing the overall output answer by taking into account the reasoning part, however, their capability is still not satisfactory as the causal relationship of the reasoning to the concluded answer cannot properly reflected. In this paper, inspired by classical mechanics, we present a novel approach towards establishing a CoT-Kinetics energy equation. Specifically, our CoT-Kinetics energy equation formulates the token state transformation process, which is regulated by LRM internal transformer layers, as like a particle kinetics dynamics governed in a mechanical field. Our CoT-Kinetics energy assigns a scalar score to evaluate specifically the soundness of the reasoning phase, telling how confident the derived answer could be given the evaluated reasoning. As such, the LRM's overall output quality can be accurately measured, rather than a coarse judgment (e.g., correct or incorrect) anymore.

[487] StarFT: Robust Fine-tuning of Zero-shot Models via Spuriosity Alignment

Younghyun Kim,Jongheon Jeong,Sangkyung Kwak,Kyungmin Lee,Juho Lee,Jinwoo Shin

Main category: cs.AI

TL;DR: 论文提出StarFT框架，通过正则化防止零样本模型在微调时学习虚假特征，提升鲁棒性。

Details

Motivation: 零样本模型（如CLIP）在微调下游任务时容易学习虚假特征（如背景或纹理），导致鲁棒性下降。 Method: 提出StarFT框架，引入正则化方法，通过语言模型生成虚假标签，确保模型不学习无关特征。 Result: 在Waterbirds场景中，StarFT将最差组和平均准确率分别提升14.30%和3.02%。 Conclusion: StarFT能有效防止模型学习虚假特征，提升零样本模型的鲁棒性和分类性能。 Abstract: Learning robust representations from data often requires scale, which has led to the success of recent zero-shot models such as CLIP. However, the obtained robustness can easily be deteriorated when these models are fine-tuned on other downstream tasks (e.g., of smaller scales). Previous works often interpret this phenomenon in the context of domain shift, developing fine-tuning methods that aim to preserve the original domain as much as possible. However, in a different context, fine-tuned models with limited data are also prone to learning features that are spurious to humans, such as background or texture. In this paper, we propose StarFT (Spurious Textual Alignment Regularization), a novel framework for fine-tuning zero-shot models to enhance robustness by preventing them from learning spuriosity. We introduce a regularization that aligns the output distribution for spuriosity-injected labels with the original zero-shot model, ensuring that the model is not induced to extract irrelevant features further from these descriptions.We leverage recent language models to get such spuriosity-injected labels by generating alternative textual descriptions that highlight potentially confounding features.Extensive experiments validate the robust generalization of StarFT and its emerging properties: zero-shot group robustness and improved zero-shot classification. Notably, StarFT boosts both worst-group and average accuracy by 14.30% and 3.02%, respectively, in the Waterbirds group shift scenario, where other robust fine-tuning baselines show even degraded performance.

[488] Trust, But Verify: A Self-Verification Approach to Reinforcement Learning with Verifiable Rewards

Xiaoyuan Liu,Tian Liang,Zhiwei He,Jiahao Xu,Wenxuan Wang,Pinjia He,Zhaopeng Tu,Haitao Mi,Dong Yu

Main category: cs.AI

TL;DR: RISE是一种新型在线强化学习框架，通过同时训练LLM的问题解决和自我验证能力，显著提升了模型的推理准确性和自我验证能力。

Details

Motivation: 解决LLM在复杂推理中存在的‘表面自我反思’问题，即模型无法有效验证自身输出。 Method: RISE框架通过在线强化学习，利用可验证奖励同时优化问题解决和自我验证任务，模型生成解决方案并自我批评，两者共同更新策略。 Result: 在多样化数学推理基准测试中，RISE显著提高了问题解决准确性和自我验证能力，同时表现出更频繁和准确的自我验证行为。 Conclusion: RISE为开发更健壮和自知的推理模型提供了灵活有效的路径。 Abstract: Large Language Models (LLMs) show great promise in complex reasoning, with Reinforcement Learning with Verifiable Rewards (RLVR) being a key enhancement strategy. However, a prevalent issue is ``superficial self-reflection'', where models fail to robustly verify their own outputs. We introduce RISE (Reinforcing Reasoning with Self-Verification), a novel online RL framework designed to tackle this. RISE explicitly and simultaneously trains an LLM to improve both its problem-solving and self-verification abilities within a single, integrated RL process. The core mechanism involves leveraging verifiable rewards from an outcome verifier to provide on-the-fly feedback for both solution generation and self-verification tasks. In each iteration, the model generates solutions, then critiques its own on-policy generated solutions, with both trajectories contributing to the policy update. Extensive experiments on diverse mathematical reasoning benchmarks show that RISE consistently improves model's problem-solving accuracy while concurrently fostering strong self-verification skills. Our analyses highlight the advantages of online verification and the benefits of increased verification compute. Additionally, RISE models exhibit more frequent and accurate self-verification behaviors during reasoning. These advantages reinforce RISE as a flexible and effective path towards developing more robust and self-aware reasoners.

[489] Advancing Generalization Across a Variety of Abstract Visual Reasoning Tasks

Mikołaj Małkiński,Jacek Mańdziuk

Main category: cs.AI

TL;DR: 论文提出了一种名为PoNG的新型神经网络架构，用于提升抽象视觉推理任务中的模型泛化能力，并在多个基准测试中表现优异。

Details

Motivation: 当前在抽象视觉推理（AVR）领域，模型在独立同分布（i.i.d.）场景下表现良好，但在分布外（o.o.d.）场景下的泛化能力仍不足。 Method: 提出PoNG模型，结合了群卷积、归一化和并行设计。 Result: 实验表明，PoNG在多个AVR基准测试中表现出强大的泛化能力，部分场景下优于现有方法。 Conclusion: PoNG模型在提升AVR任务泛化能力方面具有潜力，为未来研究提供了新方向。 Abstract: The abstract visual reasoning (AVR) domain presents a diverse suite of analogy-based tasks devoted to studying model generalization. Recent years have brought dynamic progress in the field, particularly in i.i.d. scenarios, in which models are trained and evaluated on the same data distributions. Nevertheless, o.o.d. setups that assess model generalization to new test distributions remain challenging even for the most recent models. To advance generalization in AVR tasks, we present the Pathways of Normalized Group Convolution model (PoNG), a novel neural architecture that features group convolution, normalization, and a parallel design. We consider a wide set of AVR benchmarks, including Raven's Progressive Matrices and visual analogy problems with both synthetic and real-world images. The experiments demonstrate strong generalization capabilities of the proposed model, which in several settings outperforms the existing literature methods.

[490] MM-PRM: Enhancing Multimodal Mathematical Reasoning with Scalable Step-Level Supervision

Lingxiao Du,Fanqing Meng,Zongkai Liu,Zhixiang Zhou,Ping Luo,Qiaosheng Zhang,Wenqi Shao

Main category: cs.AI

TL;DR: MM-PRM通过自动化的过程奖励模型（PRM）提升多模态大语言模型（MLLMs）在复杂多步推理中的逻辑一致性，显著提高了性能。

Details

Motivation: 现有MLLMs在多步推理中表现不佳，缺乏对中间推理步骤的细粒度监督。 Method: 提出MM-PRM，基于MM-Policy和MM-K12数据集，利用MCTS生成大量无标注的步骤级注释，训练PRM用于推理路径评分。 Result: 在多个基准测试（MM-K12、OlympiadBench、MathVista等）中取得显著改进。 Conclusion: 过程监督是提升多模态推理系统逻辑鲁棒性的有效工具。 Abstract: While Multimodal Large Language Models (MLLMs) have achieved impressive progress in vision-language understanding, they still struggle with complex multi-step reasoning, often producing logically inconsistent or partially correct solutions. A key limitation lies in the lack of fine-grained supervision over intermediate reasoning steps. To address this, we propose MM-PRM, a process reward model trained within a fully automated, scalable framework. We first build MM-Policy, a strong multimodal model trained on diverse mathematical reasoning data. Then, we construct MM-K12, a curated dataset of 10,000 multimodal math problems with verifiable answers, which serves as seed data. Leveraging a Monte Carlo Tree Search (MCTS)-based pipeline, we generate over 700k step-level annotations without human labeling. The resulting PRM is used to score candidate reasoning paths in the Best-of-N inference setup and achieves significant improvements across both in-domain (MM-K12 test set) and out-of-domain (OlympiadBench, MathVista, etc.) benchmarks. Further analysis confirms the effectiveness of soft labels, smaller learning rates, and path diversity in optimizing PRM performance. MM-PRM demonstrates that process supervision is a powerful tool for enhancing the logical robustness of multimodal reasoning systems. We release all our codes and data at https://github.com/ModalMinds/MM-PRM.

cs.CR [Back]

[491] IP Leakage Attacks Targeting LLM-Based Multi-Agent Systems

Liwen Wang,Wenxuan Wang,Shuai Wang,Zongjie Li,Zhenlan Ji,Zongyi Lyu,Daoyuan Wu,Shing-Chi Cheung

Main category: cs.CR

TL;DR: MASLEAK是一种针对多智能体系统（MAS）的黑盒攻击框架，能够通过公共API提取敏感信息，如系统拓扑、任务指令等，攻击成功率高达87%-92%。

Details

Motivation: 随着大型语言模型（LLM）和多智能体系统（MAS）的快速发展，保护知识产权（IP）成为重要问题。MASLEAK旨在揭示MAS的潜在安全风险。 Method: MASLEAK通过精心设计的攻击查询，利用公共API与MAS交互，逐步提取系统信息，无需了解内部架构。 Result: 在合成数据集和真实MAS应用中，MASLEAK成功提取系统提示、任务指令和架构信息，成功率分别为87%和92%。 Conclusion: 研究揭示了MAS的IP泄露风险，并呼吁开发防御措施以应对此类攻击。 Abstract: The rapid advancement of Large Language Models (LLMs) has led to the emergence of Multi-Agent Systems (MAS) to perform complex tasks through collaboration. However, the intricate nature of MAS, including their architecture and agent interactions, raises significant concerns regarding intellectual property (IP) protection. In this paper, we introduce MASLEAK, a novel attack framework designed to extract sensitive information from MAS applications. MASLEAK targets a practical, black-box setting, where the adversary has no prior knowledge of the MAS architecture or agent configurations. The adversary can only interact with the MAS through its public API, submitting attack query $q$ and observing outputs from the final agent. Inspired by how computer worms propagate and infect vulnerable network hosts, MASLEAK carefully crafts adversarial query $q$ to elicit, propagate, and retain responses from each MAS agent that reveal a full set of proprietary components, including the number of agents, system topology, system prompts, task instructions, and tool usages. We construct the first synthetic dataset of MAS applications with 810 applications and also evaluate MASLEAK against real-world MAS applications, including Coze and CrewAI. MASLEAK achieves high accuracy in extracting MAS IP, with an average attack success rate of 87% for system prompts and task instructions, and 92% for system architecture in most cases. We conclude by discussing the implications of our findings and the potential defenses.

[492] Evaluatiing the efficacy of LLM Safety Solutions : The Palit Benchmark Dataset

Sayon Palit,Daniel Woods

Main category: cs.CR

TL;DR: 该研究对大型语言模型（LLM）安全工具进行了比较分析，发现基线模型（ChatGPT-3.5-Turbo）误报率高，而Lakera Guard和ProtectAI LLM Guard在性能和可用性上表现最佳。

Details

Motivation: LLM在关键行业（如医疗和金融）中的应用增加，但恶意查询可能导致数据泄露或法律责任，现有安全工具缺乏正式评估。 Method: 研究构建了恶意提示的基准数据集，评估了7种安全工具（13种中）的性能，并与基线模型对比。 Result: 基线模型误报率高，Lakera Guard和ProtectAI LLM Guard表现最佳。 Conclusion: 建议提高闭源工具的透明度、改进上下文感知检测、增强开源参与、提升用户意识，并采用更具代表性的性能指标。 Abstract: Large Language Models (LLMs) are increasingly integrated into critical systems in industries like healthcare and finance. Users can often submit queries to LLM-enabled chatbots, some of which can enrich responses with information retrieved from internal databases storing sensitive data. This gives rise to a range of attacks in which a user submits a malicious query and the LLM-system outputs a response that creates harm to the owner, such as leaking internal data or creating legal liability by harming a third-party. While security tools are being developed to counter these threats, there is little formal evaluation of their effectiveness and usability. This study addresses this gap by conducting a thorough comparative analysis of LLM security tools. We identified 13 solutions (9 closed-source, 4 open-source), but only 7 were evaluated due to a lack of participation by proprietary model owners.To evaluate, we built a benchmark dataset of malicious prompts, and evaluate these tools performance against a baseline LLM model (ChatGPT-3.5-Turbo). Our results show that the baseline model has too many false positives to be used for this task. Lakera Guard and ProtectAI LLM Guard emerged as the best overall tools showcasing the tradeoff between usability and performance. The study concluded with recommendations for greater transparency among closed source providers, improved context-aware detections, enhanced open-source engagement, increased user awareness, and the adoption of more representative performance metrics.

eess.IV [Back]

[493] MedVKAN: Efficient Feature Extraction with Mamba and KAN for Medical Image Segmentation

Hancan Zhu,Jinhao Chen,Guanghua He

Main category: eess.IV

TL;DR: MedVKAN结合Mamba和KAN提出了一种高效的医学图像分割模型，解决了CNN和Transformer的局限性，并在多个数据集上取得领先性能。

Details

Motivation: 解决CNN感受野有限和Transformer计算复杂度高的问题，探索Mamba和KAN在医学图像分割中的潜力。 Method: 提出MedVKAN模型，整合Mamba和KAN，设计EFC-KAN模块增强局部像素交互，并用VKAN模块替代Transformer。 Result: 在五个公开医学图像分割数据集上，MedVKAN在四个数据集上达到最优性能，另一个排名第二。 Conclusion: MedVKAN展示了Mamba和KAN在医学图像分割中的潜力，提供了一种创新且高效的特征提取框架。 Abstract: Medical image segmentation relies heavily on convolutional neural networks (CNNs) and Transformer-based models. However, CNNs are constrained by limited receptive fields, while Transformers suffer from scalability challenges due to their quadratic computational complexity. To address these limitations, recent advances have explored alternative architectures. The state-space model Mamba offers near-linear complexity while capturing long-range dependencies, and the Kolmogorov-Arnold Network (KAN) enhances nonlinear expressiveness by replacing fixed activation functions with learnable ones. Building on these strengths, we propose MedVKAN, an efficient feature extraction model integrating Mamba and KAN. Specifically, we introduce the EFC-KAN module, which enhances KAN with convolutional operations to improve local pixel interaction. We further design the VKAN module, integrating Mamba with EFC-KAN as a replacement for Transformer modules, significantly improving feature extraction. Extensive experiments on five public medical image segmentation datasets show that MedVKAN achieves state-of-the-art performance on four datasets and ranks second on the remaining one. These results validate the potential of Mamba and KAN for medical image segmentation while introducing an innovative and computationally efficient feature extraction framework. The code is available at: https://github.com/beginner-cjh/MedVKAN.

[494] Patient-Specific Autoregressive Models for Organ Motion Prediction in Radiotherapy

Yuxiang Lai,Jike Zhong,Vanessa Su,Xiaofeng Yang

Main category: eess.IV

TL;DR: 论文提出了一种基于自回归模型的器官运动预测方法，用于放疗前精确预测器官运动，优于现有方法。

Details

Motivation: 放疗期间器官运动影响治疗精度，现有方法依赖主成分分析（PCA），难以捕捉周期性动态。 Method: 将器官运动预测建模为自回归过程，利用4D CT扫描数据预测未来运动相位。 Result: 在50名患者和公开数据集上测试，预测肺和心脏运动表现优于现有基准。 Conclusion: 该方法有望提升放疗前规划精度，实现更精准的放射治疗。 Abstract: Radiotherapy often involves a prolonged treatment period. During this time, patients may experience organ motion due to breathing and other physiological factors. Predicting and modeling this motion before treatment is crucial for ensuring precise radiation delivery. However, existing pre-treatment organ motion prediction methods primarily rely on deformation analysis using principal component analysis (PCA), which is highly dependent on registration quality and struggles to capture periodic temporal dynamics for motion modeling.In this paper, we observe that organ motion prediction closely resembles an autoregressive process, a technique widely used in natural language processing (NLP). Autoregressive models predict the next token based on previous inputs, naturally aligning with our objective of predicting future organ motion phases. Building on this insight, we reformulate organ motion prediction as an autoregressive process to better capture patient-specific motion patterns. Specifically, we acquire 4D CT scans for each patient before treatment, with each sequence comprising multiple 3D CT phases. These phases are fed into the autoregressive model to predict future phases based on prior phase motion patterns. We evaluate our method on a real-world test set of 4D CT scans from 50 patients who underwent radiotherapy at our institution and a public dataset containing 4D CT scans from 20 patients (some with multiple scans), totaling over 1,300 3D CT phases. The performance in predicting the motion of the lung and heart surpasses existing benchmarks, demonstrating its effectiveness in capturing motion dynamics from CT images. These results highlight the potential of our method to improve pre-treatment planning in radiotherapy, enabling more precise and adaptive radiation delivery.

Pengfei Lyu,Pak-Hei Yeung,Xiaosheng Yu,Jing Xia,Jianning Chi,Chengdong Wu,Jagath C. Rajapakse

Main category: eess.IV

TL;DR: 论文提出了一种模型无关的无监督域适应框架LowBridge，通过利用跨模态图像的低级特征（如边缘）进行医学图像分割，实验表明其性能优于现有方法。

Details

Motivation: 解决跨模态医学图像分割中的域适应问题，利用低级特征的相似性提升分割效果。 Method: 通过生成模型从边缘特征恢复源图像，再训练分割模型；测试时利用目标图像的边缘特征生成源风格图像进行分割。 Result: 在多个数据集上表现优于11种现有方法，且框架对生成和分割模型类型无关。 Conclusion: LowBridge简单高效，具有与先进模型结合的潜力，未来可进一步提升性能。 Abstract: This paper addresses the task of cross-modal medical image segmentation by exploring unsupervised domain adaptation (UDA) approaches. We propose a model-agnostic UDA framework, LowBridge, which builds on a simple observation that cross-modal images share some similar low-level features (e.g., edges) as they are depicting the same structures. Specifically, we first train a generative model to recover the source images from their edge features, followed by training a segmentation model on the generated source images, separately. At test time, edge features from the target images are input to the pretrained generative model to generate source-style target domain images, which are then segmented using the pretrained segmentation network. Despite its simplicity, extensive experiments on various publicly available datasets demonstrate that \proposed achieves state-of-the-art performance, outperforming eleven existing UDA approaches under different settings. Notably, further ablation studies show that \proposed is agnostic to different types of generative and segmentation models, suggesting its potential to be seamlessly plugged with the most advanced models to achieve even more outstanding results in the future. The code is available at https://github.com/JoshuaLPF/LowBridge.

[496] Joint Manifold Learning and Optimal Transport for Dynamic Imaging

Sven Dummer,Puru Vaish,Christoph Brune

Main category: eess.IV

TL;DR: 论文提出了一种结合低维图像流形假设和最优传输（OT）正则化的方法，用于动态生物成像，以解决时间序列数据和时间点不足的问题。

Details

Motivation: 动态生物成像中时间序列数据和时间点有限，难以学习有意义的模式。现有方法要么忽略时间先验，要么忽略其他时间序列的信息。 Method: 提出了一种潜在模型表示图像流形，并确保其与时间序列数据和OT先验的一致性。 Result: 结合低维流形假设和OT正则化，能够更有效地提取动态生物成像中的相关信息。 Conclusion: 该方法通过整合两种正则化策略，提升了动态成像中信息提取的效率和准确性。 Abstract: Dynamic imaging is critical for understanding and visualizing dynamic biological processes in medicine and cell biology. These applications often encounter the challenge of a limited amount of time series data and time points, which hinders learning meaningful patterns. Regularization methods provide valuable prior knowledge to address this challenge, enabling the extraction of relevant information despite the scarcity of time-series data and time points. In particular, low-dimensionality assumptions on the image manifold address sample scarcity, while time progression models, such as optimal transport (OT), provide priors on image development to mitigate the lack of time points. Existing approaches using low-dimensionality assumptions disregard a temporal prior but leverage information from multiple time series. OT-prior methods, however, incorporate the temporal prior but regularize only individual time series, ignoring information from other time series of the same image modality. In this work, we investigate the effect of integrating a low-dimensionality assumption of the underlying image manifold with an OT regularizer for time-evolving images. In particular, we propose a latent model representation of the underlying image manifold and promote consistency between this representation, the time series data, and the OT prior on the time-evolving images. We discuss the advantages of enriching OT interpolations with latent models and integrating OT priors into latent models.

[497] Bayesian Deep Learning Approaches for Uncertainty-Aware Retinal OCT Image Segmentation for Multiple Sclerosis

Samuel T. M. Ball

Main category: eess.IV

TL;DR: 该研究使用贝叶斯卷积神经网络（BCNNs）对OCT图像进行视网膜层分割，提供不确定性估计，提高了分割性能和临床适用性。

Details

Motivation: 传统OCT视网膜层分割耗时且易受人为偏差影响，现有深度学习方法缺乏不确定性估计，导致模型可能产生错误但自信的预测。 Method: 应用贝叶斯卷积神经网络（BCNNs）对公开的35例人类视网膜OCT数据集进行分割，生成不确定性地图。 Result: 模型在分割任务中Dice得分达95.65%，并能识别高不确定性样本（如噪声或校准问题），同时支持层厚度等医学相关测量的不确定性估计。 Conclusion: 贝叶斯方法提升了OCT分割的临床适用性、统计鲁棒性和性能。 Abstract: Optical Coherence Tomography (OCT) provides valuable insights in ophthalmology, cardiology, and neurology due to high-resolution, cross-sectional images of the retina. One critical task for ophthalmologists using OCT is delineation of retinal layers within scans. This process is time-consuming and prone to human bias, affecting the accuracy and reliability of diagnoses. Previous efforts to automate delineation using deep learning face challenges in uptake from clinicians and statisticians due to the absence of uncertainty estimation, leading to "confidently wrong" models via hallucinations. In this study, we address these challenges by applying Bayesian convolutional neural networks (BCNNs) to segment an openly available OCT imaging dataset containing 35 human retina OCTs split between healthy controls and patients with multiple sclerosis. Our findings demonstrate that Bayesian models can be used to provide uncertainty maps of the segmentation, which can further be used to identify highly uncertain samples that exhibit recording artefacts such as noise or miscalibration at inference time. Our method also allows for uncertainty-estimation for important secondary measurements such as layer thicknesses, that are medically relevant for patients. We show that these features come in addition to greater performance compared to similar work over all delineations; with an overall Dice score of 95.65%. Our work brings greater clinical applicability, statistical robustness, and performance to retinal OCT segmentation.

[498] NTIRE 2025 Challenge on Efficient Burst HDR and Restoration: Datasets, Methods, and Results

Sangmin Lee,Eunpil Park,Angel Canelo,Hyunhee Park,Youngjo Kim,Hyung-Ju Chun,Xin Jin,Chongyi Li,Chun-Le Guo,Radu Timofte,Qi Wu,Tianheng Qiu,Yuchun Dong,Shenglin Ding,Guanghua Pan,Weiyu Zhou,Tao Hu,Yixu Feng,Duwei Dai,Yu Cao,Peng Wu,Wei Dong,Yanning Zhang,Qingsen Yan,Simon J. Larsen,Ruixuan Jiang,Senyan Xu,Xingbo Wang,Xin Lu,Marcos V. Conde,Javier Abad-Hernandez,Alvaro Garcıa-Lara,Daniel Feijoo,Alvaro Garcıa,Zeyu Xiao,Zhuoyuan Li

Main category: eess.IV

TL;DR: NTIRE 2025挑战赛聚焦高效多帧HDR与修复技术，基于新RAW数据集，参赛者需在严格效率限制下融合多帧，最终六支团队提交方案，最佳PSNR达43.22 dB。

Details

Motivation: 推动高效多帧HDR与修复技术的发展，解决噪声和未对齐RAW帧的融合问题。 Method: 使用包含九帧不同曝光RAW数据的新数据集，参赛者在参数和计算限制下开发融合方案。 Result: 六支团队提交有效方案，最佳PSNR为43.22 dB。 Conclusion: 挑战赛展示了高效HDR与修复技术的潜力，为研究者提供了重要参考。 Abstract: This paper reviews the NTIRE 2025 Efficient Burst HDR and Restoration Challenge, which aims to advance efficient multi-frame high dynamic range (HDR) and restoration techniques. The challenge is based on a novel RAW multi-frame fusion dataset, comprising nine noisy and misaligned RAW frames with various exposure levels per scene. Participants were tasked with developing solutions capable of effectively fusing these frames while adhering to strict efficiency constraints: fewer than 30 million model parameters and a computational budget under 4.0 trillion FLOPs. A total of 217 participants registered, with six teams finally submitting valid solutions. The top-performing approach achieved a PSNR of 43.22 dB, showcasing the potential of novel methods in this domain. This paper provides a comprehensive overview of the challenge, compares the proposed solutions, and serves as a valuable reference for researchers and practitioners in efficient burst HDR and restoration.

[499] HISTAI: An Open-Source, Large-Scale Whole Slide Image Dataset for Computational Pathology

Dmitry Nechaev,Alexey Pchelnikov,Ekaterina Ivanova

Main category: eess.IV

TL;DR: HISTAI数据集是一个大规模、多模态、开放访问的WSI集合，旨在解决现有数据集的不足，推动AI在数字病理学中的应用。

Details

Motivation: 现有公开WSI数据集规模小、多样性不足且缺乏临床元数据，限制了AI模型的鲁棒性和泛化能力。 Method: 收集超过60,000张来自不同组织类型的WSI，并附有丰富的临床元数据和病理注释。 Result: HISTAI数据集填补了现有资源的空白，支持创新和可重复性研究。 Conclusion: HISTAI数据集为开发临床相关的计算病理学解决方案提供了重要资源。 Abstract: Recent advancements in Digital Pathology (DP), particularly through artificial intelligence and Foundation Models, have underscored the importance of large-scale, diverse, and richly annotated datasets. Despite their critical role, publicly available Whole Slide Image (WSI) datasets often lack sufficient scale, tissue diversity, and comprehensive clinical metadata, limiting the robustness and generalizability of AI models. In response, we introduce the HISTAI dataset, a large, multimodal, open-access WSI collection comprising over 60,000 slides from various tissue types. Each case in the HISTAI dataset is accompanied by extensive clinical metadata, including diagnosis, demographic information, detailed pathological annotations, and standardized diagnostic coding. The dataset aims to fill gaps identified in existing resources, promoting innovation, reproducibility, and the development of clinically relevant computational pathology solutions. The dataset can be accessed at https://github.com/HistAI/HISTAI.

[500] CTLformer: A Hybrid Denoising Model Combining Convolutional Layers and Self-Attention for Enhanced CT Image Reconstruction

Zhiting Zheng,Shuqi Wu,Wen Ding

Main category: eess.IV

TL;DR: 本文提出了一种名为CTLformer的创新模型，结合卷积结构和Transformer架构，用于低剂量CT（LDCT）图像去噪。通过多尺度注意力机制和动态注意力控制机制，模型显著提升了去噪性能。

Details

Motivation: LDCT图像常伴随显著噪声，影响图像质量和诊断准确性。现有方法在多尺度特征融合和噪声分布多样性方面存在挑战。 Method: CTLformer结合卷积结构和Transformer，提出多尺度注意力机制和动态注意力控制机制，并采用重叠推理减少边界伪影。 Result: 在2016年NIH AAPM Mayo Clinic LDCT数据集上，CTLformer在去噪性能和模型效率上显著优于现有方法。 Conclusion: CTLformer为LDCT去噪提供了高效解决方案，并在处理复杂噪声模式的医学图像分析中展现出广泛潜力。 Abstract: Low-dose CT (LDCT) images are often accompanied by significant noise, which negatively impacts image quality and subsequent diagnostic accuracy. To address the challenges of multi-scale feature fusion and diverse noise distribution patterns in LDCT denoising, this paper introduces an innovative model, CTLformer, which combines convolutional structures with transformer architecture. Two key innovations are proposed: a multi-scale attention mechanism and a dynamic attention control mechanism. The multi-scale attention mechanism, implemented through the Token2Token mechanism and self-attention interaction modules, effectively captures both fine details and global structures at different scales, enhancing relevant features and suppressing noise. The dynamic attention control mechanism adapts the attention distribution based on the noise characteristics of the input image, focusing on high-noise regions while preserving details in low-noise areas, thereby enhancing robustness and improving denoising performance. Furthermore, CTLformer integrates convolutional layers for efficient feature extraction and uses overlapping inference to mitigate boundary artifacts, further strengthening its denoising capability. Experimental results on the 2016 National Institutes of Health AAPM Mayo Clinic LDCT Challenge dataset demonstrate that CTLformer significantly outperforms existing methods in both denoising performance and model efficiency, greatly improving the quality of LDCT images. The proposed CTLformer not only provides an efficient solution for LDCT denoising but also shows broad potential in medical image analysis, especially for clinical applications dealing with complex noise patterns.

[501] PRETI: Patient-Aware Retinal Foundation Model via Metadata-Guided Representation Learning

Yeonkyung Lee,Woojung Han,Youngjun Jun,Hyeonmin Kim,Jungkyung Cho,Seong Jae Hwang

Main category: eess.IV

TL;DR: PRETI是一种视网膜基础模型，结合元数据感知学习和自监督学习，通过动态元数据嵌入和患者级数据对提升性能，采用视网膜感知自适应掩码策略优化表示，在多种疾病和生物标志物预测中达到SOTA。

Details

Motivation: 减少对标注数据的依赖，利用广泛可用的元数据（如年龄、性别）分析疾病进展，提升视网膜图像理解。 Method: 提出Learnable Metadata Embedding（LME）动态优化元数据表示；构建患者级数据对增强鲁棒性；提出Retina-Aware Adaptive Masking（RAAM）选择性掩码视网膜区域。 Result: 在内部和公共数据集上，PRETI在多种疾病和生物标志物预测中取得最优结果。 Conclusion: 元数据引导的基础模型在视网膜疾病分析中具有重要性，PRETI展示了其优越性能。 Abstract: Retinal foundation models have significantly advanced retinal image analysis by leveraging self-supervised learning to reduce dependence on labeled data while achieving strong generalization. Many recent approaches enhance retinal image understanding using report supervision, but obtaining clinical reports is often costly and challenging. In contrast, metadata (e.g., age, gender) is widely available and serves as a valuable resource for analyzing disease progression. To effectively incorporate patient-specific information, we propose PRETI, a retinal foundation model that integrates metadata-aware learning with robust self-supervised representation learning. We introduce Learnable Metadata Embedding (LME), which dynamically refines metadata representations. Additionally, we construct patient-level data pairs, associating images from the same individual to improve robustness against non-clinical variations. To further optimize retinal image representation, we propose Retina-Aware Adaptive Masking (RAAM), a strategy that selectively applies masking within the retinal region and dynamically adjusts the masking ratio during training. PRETI captures both global structures and fine-grained pathological details, resulting in superior diagnostic performance. Extensive experiments demonstrate that PRETI achieves state-of-the-art results across diverse diseases and biomarker predictions using in-house and public data, indicating the importance of metadata-guided foundation models in retinal disease analysis. Our code and pretrained model are available at https://github.com/MICV-yonsei/PRETI

[502] Attention-Enhanced U-Net for Accurate Segmentation of COVID-19 Infected Lung Regions in CT Scans

Amal Lahchim,Lazar Davic

Main category: eess.IV

TL;DR: 提出了一种基于改进U-Net架构的自动分割方法，用于COVID-19 CT扫描中的感染区域分割，性能优于其他方法。

Details

Motivation: 解决COVID-19 CT扫描中感染区域的自动分割问题，提高分割精度。 Method: 采用改进的U-Net架构，结合注意力机制、数据增强和后处理技术。 Result: Dice系数0.8658，平均IoU 0.8316，性能优于其他方法。 Conclusion: 方法表现优异，未来将扩展数据集、探索3D分割并准备临床部署。 Abstract: In this study, we propose a robust methodology for automatic segmentation of infected lung regions in COVID-19 CT scans using convolutional neural networks. The approach is based on a modified U-Net architecture enhanced with attention mechanisms, data augmentation, and postprocessing techniques. It achieved a Dice coefficient of 0.8658 and mean IoU of 0.8316, outperforming other methods. The dataset was sourced from public repositories and augmented for diversity. Results demonstrate superior segmentation performance. Future work includes expanding the dataset, exploring 3D segmentation, and preparing the model for clinical deployment.

[503] Mutual Evidential Deep Learning for Medical Image Segmentation

Yuanpeng He,Yali Bi,Lijian Li,Chi-Man Pun,Wenpin Jiao,Zhi Jin

Main category: eess.IV

TL;DR: 提出了一种基于互证深度学习的框架（MEDL），通过改进伪标签生成和渐进学习策略，提升半监督医学分割性能。

Details

Motivation: 现有半监督医学分割框架因伪标签质量低导致模型性能下降，且未充分探索不同来源伪标签的可靠性。 Method: 引入多架构网络生成互补证据，采用改进的类感知证据融合和渐进Fisher信息学习策略。 Result: 在五个主流数据集上验证，MEDL达到最先进性能。 Conclusion: MEDL通过优化伪标签生成和渐进学习，显著提升了半监督医学分割的准确性和鲁棒性。 Abstract: Existing semi-supervised medical segmentation co-learning frameworks have realized that model performance can be diminished by the biases in model recognition caused by low-quality pseudo-labels. Due to the averaging nature of their pseudo-label integration strategy, they fail to explore the reliability of pseudo-labels from different sources. In this paper, we propose a mutual evidential deep learning (MEDL) framework that offers a potentially viable solution for pseudo-label generation in semi-supervised learning from two perspectives. First, we introduce networks with different architectures to generate complementary evidence for unlabeled samples and adopt an improved class-aware evidential fusion to guide the confident synthesis of evidential predictions sourced from diverse architectural networks. Second, utilizing the uncertainty in the fused evidence, we design an asymptotic Fisher information-based evidential learning strategy. This strategy enables the model to initially focus on unlabeled samples with more reliable pseudo-labels, gradually shifting attention to samples with lower-quality pseudo-labels while avoiding over-penalization of mislabeled classes in high data uncertainty samples. Additionally, for labeled data, we continue to adopt an uncertainty-driven asymptotic learning strategy, gradually guiding the model to focus on challenging voxels. Extensive experiments on five mainstream datasets have demonstrated that MEDL achieves state-of-the-art performance.

[504] FreqSelect: Frequency-Aware fMRI-to-Image Reconstruction

Junliang Ye,Lei Wang,Md Zakir Hossain

Main category: eess.IV

TL;DR: FreqSelect是一种轻量级自适应模块，通过选择性过滤空间频率带提升fMRI图像重建质量。

Details

Motivation: 解决现有方法对所有空间频率成分平等处理的问题，提升重建效果。 Method: 结合FreqSelect模块的动态频率选择功能，优化VAE-扩散模型。 Result: 在Natural Scenes数据集上显著提升重建质量，并提供频率选择的神经科学解释。 Conclusion: FreqSelect不仅提升解码准确性，还为神经科学研究提供新视角。 Abstract: Reconstructing natural images from functional magnetic resonance imaging (fMRI) data remains a core challenge in natural decoding due to the mismatch between the richness of visual stimuli and the noisy, low resolution nature of fMRI signals. While recent two-stage models, combining deep variational autoencoders (VAEs) with diffusion models, have advanced this task, they treat all spatial-frequency components of the input equally. This uniform treatment forces the model to extract meaning features and suppress irrelevant noise simultaneously, limiting its effectiveness. We introduce FreqSelect, a lightweight, adaptive module that selectively filters spatial-frequency bands before encoding. By dynamically emphasizing frequencies that are most predictive of brain activity and suppressing those that are uninformative, FreqSelect acts as a content-aware gate between image features and natural data. It integrates seamlessly into standard very deep VAE-diffusion pipelines and requires no additional supervision. Evaluated on the Natural Scenes dataset, FreqSelect consistently improves reconstruction quality across both low- and high-level metrics. Beyond performance gains, the learned frequency-selection patterns offer interpretable insights into how different visual frequencies are represented in the brain. Our method generalizes across subjects and scenes, and holds promise for extension to other neuroimaging modalities, offering a principled approach to enhancing both decoding accuracy and neuroscientific interpretability.

[505] The Gaussian Latent Machine: Efficient Prior and Posterior Sampling for Inverse Problems

Muhamed Kuric,Martin Zach,Andreas Habring,Michael Unser,Thomas Pock

Main category: eess.IV

TL;DR: 提出了一种基于高斯潜在变量的通用采样方法，统一并推广了现有采样算法，特别适用于贝叶斯成像中的先验和后验分布。

Details

Motivation: 解决贝叶斯成像中常见先验和后验分布的采样问题，提升采样效率。 Method: 通过将模型提升为高斯潜在机器（Gaussian latent machine），提出了一种两阶段Gibbs采样方法。 Result: 实验表明该方法在多种贝叶斯成像问题中高效且有效。 Conclusion: 高斯潜在机器提供了一种通用且高效的采样框架，适用于广泛的贝叶斯成像问题。 Abstract: We consider the problem of sampling from a product-of-experts-type model that encompasses many standard prior and posterior distributions commonly found in Bayesian imaging. We show that this model can be easily lifted into a novel latent variable model, which we refer to as a Gaussian latent machine. This leads to a general sampling approach that unifies and generalizes many existing sampling algorithms in the literature. Most notably, it yields a highly efficient and effective two-block Gibbs sampling approach in the general case, while also specializing to direct sampling algorithms in particular cases. Finally, we present detailed numerical experiments that demonstrate the efficiency and effectiveness of our proposed sampling approach across a wide range of prior and posterior sampling problems from Bayesian imaging.

[506] RetinaLogos: Fine-Grained Synthesis of High-Resolution Retinal Images Through Captions

Junzhi Ning,Cheng Tang,Kaijin Zhou,Diping Song,Lihao Liu,Ming Hu,Wei Li,Yanzhou Su,Tianbing Li,Jiyao Liu,Yejin,Sheng Zhang,Yuanfeng Ji,Junjun He

Main category: eess.IV

TL;DR: 论文提出了一种创新方法RetinaLogos-1400k，通过合成大规模标注的视网膜图像数据集，解决了高质量标注数据稀缺的问题，并提升了机器学习模型在眼科领域的性能。

Details

Motivation: 高质量标注的视网膜影像数据稀缺，阻碍了机器学习模型在眼科领域的发展，现有方法无法生成多样化和细粒度的图像。 Method: 引入RetinaLogos-1400k数据集，利用大语言模型生成描述，并结合三步训练框架实现细粒度语义控制。 Result: 合成图像62.07%难以与真实图像区分，且在糖尿病视网膜病变分级和青光眼检测中准确率提升10%-25%。 Conclusion: 该方法为眼科数据集的扩展提供了可扩展的解决方案，显著提升了模型性能。 Abstract: The scarcity of high-quality, labelled retinal imaging data, which presents a significant challenge in the development of machine learning models for ophthalmology, hinders progress in the field. To synthesise Colour Fundus Photographs (CFPs), existing methods primarily relying on predefined disease labels face significant limitations. However, current methods remain limited, thus failing to generate images for broader categories with diverse and fine-grained anatomical structures. To overcome these challenges, we first introduce an innovative pipeline that creates a large-scale, synthetic Caption-CFP dataset comprising 1.4 million entries, called RetinaLogos-1400k. Specifically, RetinaLogos-1400k uses large language models (LLMs) to describe retinal conditions and key structures, such as optic disc configuration, vascular distribution, nerve fibre layers, and pathological features. Furthermore, based on this dataset, we employ a novel three-step training framework, called RetinaLogos, which enables fine-grained semantic control over retinal images and accurately captures different stages of disease progression, subtle anatomical variations, and specific lesion types. Extensive experiments demonstrate state-of-the-art performance across multiple datasets, with 62.07% of text-driven synthetic images indistinguishable from real ones by ophthalmologists. Moreover, the synthetic data improves accuracy by 10%-25% in diabetic retinopathy grading and glaucoma detection, thereby providing a scalable solution to augment ophthalmic datasets.

[507] Segmentation of temporomandibular joint structures on mri images using neural networks for diagnosis of pathologies

Maksim I. Ivanov,Olga E. Mendybaeva,Yuri E. Karyakin,Igor N. Glukhikh,Aleksey V. Lebedev

Main category: eess.IV

TL;DR: 本文探讨了利用人工智能技术对颞下颌关节（TMJ）病理进行诊断，特别是通过MRI图像分割关节盘。研究发现Roboflow模型在分割任务中表现最佳。

Details

Motivation: 研究动机源于TMJ病理的高发性及医疗诊断中对准确性和速度的需求。现有解决方案（如Diagnocat、MandSeg）因专注于骨结构而无法满足关节盘研究需求。 Method: 研究收集了94张包含“颞下颌关节”和“颌骨”类别的图像数据集，并采用数据增强方法扩展数据量。随后训练并比较了U-Net、YOLOv8n、YOLOv11n和Roboflow神经网络模型，评估指标包括Dice Score、Precision、Sensitivity、Specificity和Mean Average Precision。 Result: 实验结果表明，Roboflow模型在颞下颌关节分割任务中具有潜力。 Conclusion: 未来计划开发测量颌间距离和确定关节盘位置的算法，以进一步提升TMJ病理诊断能力。 Abstract: This article explores the use of artificial intelligence for the diagnosis of pathologies of the temporomandibular joint (TMJ), in particular, for the segmentation of the articular disc on MRI images. The relevance of the work is due to the high prevalence of TMJ pathologies, as well as the need to improve the accuracy and speed of diagnosis in medical institutions. During the study, the existing solutions (Diagnocat, MandSeg) were analyzed, which, as a result, are not suitable for studying the articular disc due to the orientation towards bone structures. To solve the problem, an original dataset was collected from 94 images with the classes "temporomandibular joint" and "jaw". To increase the amount of data, augmentation methods were used. After that, the models of U-Net, YOLOv8n, YOLOv11n and Roboflow neural networks were trained and compared. The evaluation was carried out according to the Dice Score, Precision, Sensitivity, Specificity, and Mean Average Precision metrics. The results confirm the potential of using the Roboflow model for segmentation of the temporomandibular joint. In the future, it is planned to develop an algorithm for measuring the distance between the jaws and determining the position of the articular disc, which will improve the diagnosis of TMJ pathologies.

[508] Enhancing Diffusion-Weighted Images (DWI) for Diffusion MRI: Is it Enough without Non-Diffusion-Weighted B=0 Reference?

Yinzhe Wu,Jiahao Huang,Fanwen Wang,Mengze Gao,Congyu Liao,Guang Yang,Kawin Setsompop

Main category: eess.IV

TL;DR: 论文提出了一种新的比率损失函数，用于改善扩散MRI（dMRI）超分辨率成像中DWI与b=0图像比率的准确性，从而提升扩散度量的计算质量。

Details

Motivation: 传统方法仅优化扩散加权图像（DWI），而忽略了其与非扩散加权（b=0）参考图像的关系，导致计算扩散度量（如ADC、FA和MD）时比率误差较大。 Method: 提出了一种比率损失函数，定义为预测与真实DWI/b=0比率的对数之间的均方误差（MSE）损失。 Result: 实验表明，比率损失显著降低了比率误差，并略微提高了生成DWI的PSNR，从而改善了dMRI超分辨率和扩散度量的准确性。 Conclusion: 比率损失函数有效提升了dMRI超分辨率成像中比率特征的保留，为临床诊断提供了更可靠的扩散度量。 Abstract: Diffusion MRI (dMRI) is essential for studying brain microstructure, but high-resolution imaging remains challenging due to the inherent trade-offs between acquisition time and signal-to-noise ratio (SNR). Conventional methods often optimize only the diffusion-weighted images (DWIs) without considering their relationship with the non-diffusion-weighted (b=0) reference images. However, calculating diffusion metrics, such as the apparent diffusion coefficient (ADC) and diffusion tensor with its derived metrics like fractional anisotropy (FA) and mean diffusivity (MD), relies on the ratio between each DWI and the b=0 image, which is crucial for clinical observation and diagnostics. In this study, we demonstrate that solely enhancing DWIs using a conventional pixel-wise mean squared error (MSE) loss is insufficient, as the error in ratio between generated DWIs and b=0 diverges. We propose a novel ratio loss, defined as the MSE loss between the predicted and ground-truth log of DWI/b=0 ratios. Our results show that incorporating the ratio loss significantly improves the convergence of this ratio error, achieving lower ratio MSE and slightly enhancing the peak signal-to-noise ratio (PSNR) of generated DWIs. This leads to improved dMRI super-resolution and better preservation of b=0 ratio-based features for the derivation of diffusion metrics.

[509] A generalisable head MRI defacing pipeline: Evaluation on 2,566 meningioma scans

Lorena Garcia-Foncillas Macias,Aaron Kujawa,Aya Elshalakany,Jonathan Shapey,Tom Vercauteren

Main category: eess.IV

TL;DR: 提出了一种可靠的MRI去标识化方法，结合基于图谱的配准和脑部掩膜技术，成功率高且解剖结构保存良好。

Details

Motivation: 保护患者隐私同时保留脑部解剖结构是研究合作的关键，现有方法常存在去标识不完全或脑组织区域退化的问题。 Method: 集成基于图谱的配准与脑部掩膜技术，构建了一个通用的高分辨率MRI去标识化流程。 Result: 在2,566例临床扫描中，视觉检查成功率为99.92%，解剖结构保存良好（Dice相似系数0.9975±0.0023）。 Conclusion: 该方法高效可靠，适用于临床研究，源代码已公开。 Abstract: Reliable MRI defacing techniques to safeguard patient privacy while preserving brain anatomy are critical for research collaboration. Existing methods often struggle with incomplete defacing or degradation of brain tissue regions. We present a robust, generalisable defacing pipeline for high-resolution MRI that integrates atlas-based registration with brain masking. Our method was evaluated on 2,566 heterogeneous clinical scans for meningioma and achieved a 99.92 per cent success rate (2,564/2,566) upon visual inspection. Excellent anatomical preservation is demonstrated with a Dice similarity coefficient of 0.9975 plus or minus 0.0023 between brain masks automatically extracted from the original and defaced volumes. Source code is available at https://github.com/cai4cai/defacing_pipeline.

[510] Higher fidelity perceptual image and video compression with a latent conditioned residual denoising diffusion model

Jonas Brenig,Radu Timofte

Main category: eess.IV

TL;DR: 提出了一种混合压缩方案，结合解码器网络和扩散模型，在保持感知质量的同时提高保真度。

Details

Motivation: 扩散模型在图像生成任务中表现出色，但在压缩任务中保真度较低，需要改进。 Method: 使用解码器网络生成初始图像，再通过扩散模型预测残差以优化感知质量。 Result: 在标准基准测试中，PSNR提高了2dB，同时保持感知评分。 Conclusion: 该方法有效平衡了保真度和感知质量，并可扩展至视频压缩。 Abstract: Denoising diffusion models achieved impressive results on several image generation tasks often outperforming GAN based models. Recently, the generative capabilities of diffusion models have been employed for perceptual image compression, such as in CDC. A major drawback of these diffusion-based methods is that, while producing impressive perceptual quality images they are dropping in fidelity/increasing the distortion to the original uncompressed images when compared with other traditional or learned image compression schemes aiming for fidelity. In this paper, we propose a hybrid compression scheme optimized for perceptual quality, extending the approach of the CDC model with a decoder network in order to reduce the impact on distortion metrics such as PSNR. After using the decoder network to generate an initial image, optimized for distortion, the latent conditioned diffusion model refines the reconstruction for perceptual quality by predicting the residual. On standard benchmarks, we achieve up to +2dB PSNR fidelity improvements while maintaining comparable LPIPS and FID perceptual scores when compared with CDC. Additionally, the approach is easily extensible to video compression, where we achieve similar results.

[511] GuidedMorph: Two-Stage Deformable Registration for Breast MRI

Yaqian Chen,Hanxue Gu,Haoyu Dong,Qihang Li,Yuwen Chen,Nicholas Konz,Lin Li,Maciej A. Mazurowski

Main category: eess.IV

TL;DR: 提出了一种名为GuidedMorph的两阶段配准框架，用于更准确地配准乳腺MR图像中的密集组织，显著提升了配准精度。

Details

Motivation: 乳腺MR图像的多时间点配准对乳腺癌检测和治疗规划至关重要，但传统方法难以处理密集组织的复杂性和非刚性变形。 Method: 采用两阶段框架，结合单尺度网络和密集组织信息跟踪，引入Dual Spatial Transformer Network (DSTN)融合变换场，并提出基于欧几里得距离变换(EDT)的变形方法。 Result: 在ISPY2和内部数据集上验证，密集组织Dice提升13.01%，乳腺Dice提升3.13%，乳腺SSIM提升1.21%。 Conclusion: GuidedMorph框架在乳腺图像配准中表现出色，支持多种范式，具有广泛适用性。 Abstract: Accurately registering breast MR images from different time points enables the alignment of anatomical structures and tracking of tumor progression, supporting more effective breast cancer detection, diagnosis, and treatment planning. However, the complexity of dense tissue and its highly non-rigid nature pose challenges for conventional registration methods, which primarily focus on aligning general structures while overlooking intricate internal details. To address this, we propose \textbf{GuidedMorph}, a novel two-stage registration framework designed to better align dense tissue. In addition to a single-scale network for global structure alignment, we introduce a framework that utilizes dense tissue information to track breast movement. The learned transformation fields are fused by introducing the Dual Spatial Transformer Network (DSTN), improving overall alignment accuracy. A novel warping method based on the Euclidean distance transform (EDT) is also proposed to accurately warp the registered dense tissue and breast masks, preserving fine structural details during deformation. The framework supports paradigms that require external segmentation models and with image data only. It also operates effectively with the VoxelMorph and TransMorph backbones, offering a versatile solution for breast registration. We validate our method on ISPY2 and internal dataset, demonstrating superior performance in dense tissue, overall breast alignment, and breast structural similarity index measure (SSIM), with notable improvements by over 13.01% in dense tissue Dice, 3.13% in breast Dice, and 1.21% in breast SSIM compared to the best learning-based baseline.

cs.SE [Back]

[512] Introduction to Analytical Software Engineering Design Paradigm

Tarik Houichime,Younes El Amrani

Main category: cs.SE

TL;DR: 论文提出了一种新的设计范式Analytical Software Engineering (ASE)，通过BSS和ODR框架解决复杂软件工程问题，平衡抽象、工具可访问性、兼容性和可扩展性。

Details

Motivation: 传统方法在应对现代软件系统的规模和复杂性时表现不足，尤其是在设计模式检测和代码重构方面，因此需要一种新的范式来解决这些挑战。 Method: 提出了ASE范式，并通过BSS（语言无关的代码表示）和ODR（基于启发式算法的代码重构优化）两个框架进行验证。 Result: ASE能够有效建模和解决复杂软件工程问题，BSS支持精确设计模式检测，ODR优化代码重构并减少计算开销。 Conclusion: ASE为未来复杂软件度量的编码和分析研究奠定了基础。 Abstract: As modern software systems expand in scale and complexity, the challenges associated with their modeling and formulation grow increasingly intricate. Traditional approaches often fall short in effectively addressing these complexities, particularly in tasks such as design pattern detection for maintenance and assessment, as well as code refactoring for optimization and long-term sustainability. This growing inadequacy underscores the need for a paradigm shift in how such challenges are approached and resolved. This paper presents Analytical Software Engineering (ASE), a novel design paradigm aimed at balancing abstraction, tool accessibility, compatibility, and scalability. ASE enables effective modeling and resolution of complex software engineering problems. The paradigm is evaluated through two frameworks Behavioral-Structural Sequences (BSS) and Optimized Design Refactoring (ODR), both developed in accordance with ASE principles. BSS offers a compact, language-agnostic representation of codebases to facilitate precise design pattern detection. ODR unifies artifact and solution representations to optimize code refactoring via heuristic algorithms while eliminating iterative computational overhead. By providing a structured approach to software design challenges, ASE lays the groundwork for future research in encoding and analyzing complex software metrics.

[513] EVALOOP: Assessing LLM Robustness in Programming from a Self-consistency Perspective

Sen Fang,Weiyuan Ding,Bowen Xu

Main category: cs.SE

TL;DR: EVALOOP是一个新的评估框架，通过自一致性循环评估大型语言模型（LLMs）在编程任务中的鲁棒性，无需依赖外部攻击设置。

Details

Motivation: 当前评估方法主要关注静态基准测试的代码生成准确性，忽视了模型在编程任务中的鲁棒性。现有的对抗攻击方法效果有限且结果不一致。 Method: EVALOOP通过自反馈循环（如代码生成与代码摘要的循环）评估模型的鲁棒性，提供统一度量标准。 Result: 在16个主流LLMs上测试，EVALOOP在10次循环中导致pass@1性能下降5.01%-19.31%。初始性能与鲁棒性不一定相关。 Conclusion: EVALOOP为LLMs的鲁棒性评估提供了有效且统一的框架，揭示了模型在多次循环中的稳定性差异。 Abstract: Assessing the programming capabilities of Large Language Models (LLMs) is crucial for their effective use in software engineering. Current evaluations, however, predominantly measure the accuracy of generated code on static benchmarks, neglecting the critical aspect of model robustness during programming tasks. While adversarial attacks offer insights on model robustness, their effectiveness is limited and evaluation could be constrained. Current adversarial attack methods for robustness evaluation yield inconsistent results, struggling to provide a unified evaluation across different LLMs. We introduce EVALOOP, a novel assessment framework that evaluate the robustness from a self-consistency perspective, i.e., leveraging the natural duality inherent in popular software engineering tasks, e.g., code generation and code summarization. EVALOOP initiates a self-contained feedback loop: an LLM generates output (e.g., code) from an input (e.g., natural language specification), and then use the generated output as the input to produce a new output (e.g., summarizes that code into a new specification). EVALOOP repeats the process to assess the effectiveness of EVALOOP in each loop. This cyclical strategy intrinsically evaluates robustness without rely on any external attack setups, providing a unified metric to evaluate LLMs' robustness in programming. We evaluate 16 prominent LLMs (e.g., GPT-4.1, O4-mini) on EVALOOP and found that EVALOOP typically induces a 5.01%-19.31% absolute drop in pass@1 performance within ten loops. Intriguingly, robustness does not always align with initial performance (i.e., one-time query); for instance, GPT-3.5-Turbo, despite superior initial code generation compared to DeepSeek-V2, demonstrated lower robustness over repeated evaluation loop.

[514] AutoGEEval: A Multimodal and Automated Framework for Geospatial Code Generation on GEE with Large Language Models

Shuyang Hou,Zhangxiao Shen,Huayi Wu,Jianyuan Liang,Haoyue Jiao,Yaxian Qing,Xiaopu Zhang,Xu Li,Zhipeng Gui,Xuefeng Guan,Longgang Xiang

Main category: cs.SE

TL;DR: AutoGEEval是首个基于Google Earth Engine（GEE）的多模态、单元级自动化评估框架，用于地理空间代码生成任务，支持多维度的定量分析。

Details

Motivation: 当前地理空间代码生成领域缺乏标准化自动评估工具，AutoGEEval旨在填补这一空白。 Method: 基于GEE Python API，构建包含1325个测试用例的基准套件（AutoGEEval-Bench），集成问题生成和答案验证组件，实现端到端评估。 Result: 评估了18种先进的大语言模型（LLM），揭示了它们在GEE代码生成中的性能特点和优化路径。 Conclusion: AutoGEEval为地理空间代码生成模型的开发和评估提供了统一协议和基础资源，推动了自然语言到领域特定代码的自动化翻译前沿。 Abstract: Geospatial code generation is emerging as a key direction in the integration of artificial intelligence and geoscientific analysis. However, there remains a lack of standardized tools for automatic evaluation in this domain. To address this gap, we propose AutoGEEval, the first multimodal, unit-level automated evaluation framework for geospatial code generation tasks on the Google Earth Engine (GEE) platform powered by large language models (LLMs). Built upon the GEE Python API, AutoGEEval establishes a benchmark suite (AutoGEEval-Bench) comprising 1325 test cases that span 26 GEE data types. The framework integrates both question generation and answer verification components to enable an end-to-end automated evaluation pipeline-from function invocation to execution validation. AutoGEEval supports multidimensional quantitative analysis of model outputs in terms of accuracy, resource consumption, execution efficiency, and error types. We evaluate 18 state-of-the-art LLMs-including general-purpose, reasoning-augmented, code-centric, and geoscience-specialized models-revealing their performance characteristics and potential optimization pathways in GEE code generation. This work provides a unified protocol and foundational resource for the development and assessment of geospatial code generation models, advancing the frontier of automated natural language to domain-specific code translation.

Table of Contents

cs.CV [Back]

[1] Improving Open-Set Semantic Segmentation in 3D Point Clouds by Conditional Channel Capacity Maximization: Preliminary Results

[2] Questioning Representational Optimism in Deep Learning: The Fractured Entangled Representation Hypothesis

[3] Improved Bag-of-Words Image Retrieval with Geometric Constraints for Ground Texture Localization

[4] BandRC: Band Shifted Raised Cosine Activated Implicit Neural Representations

[5] DPSeg: Dual-Prompt Cost Volume Learning for Open-Vocabulary Semantic Segmentation

[6] LoFT: LoRA-fused Training Dataset Generation with Few-shot Guidance

[7] Attend to Not Attended: Structure-then-Detail Token Merging for Post-training DiT Acceleration

[8] EgoDex: Learning Dexterous Manipulation from Large-Scale Egocentric Video

[9] UGoDIT: Unsupervised Group Deep Image Prior Via Transferable Weights

[10] Semantically-Aware Game Image Quality Assessment

[11] X-Edit: Detecting and Localizing Edits in Images Altered by Text-Guided Diffusion Models

[12] Generalizable Vision-Language Few-Shot Adaptation with Predictive Prompts and Negative Learning

[13] Technical Report for ICRA 2025 GOOSE 2D Semantic Segmentation Challenge: Boosting Off-Road Segmentation via Photometric Distortion and Exponential Moving Average

[14] Self-NPO: Negative Preference Optimization of Diffusion Models by Simply Learning from Itself without Explicit Preference Annotations

[15] CL-CaGAN: Capsule differential adversarial continuous learning for cross-domain hyperspectral anomaly detection

[16] CL-BioGAN: Biologically-Inspired Cross-Domain Continual Learning for Hyperspectral Anomaly Detection

[17] Self-Learning Hyperspectral and Multispectral Image Fusion via Adaptive Residual Guided Subspace Diffusion Model

[18] Are vision language models robust to uncertain inputs?

[19] Image-based Visibility Analysis Replacing Line-of-Sight Simulation: An Urban Landmark Perspective

[20] SGD-Mix: Enhancing Domain-Specific Image Classification with Label-Preserving Data Augmentation

[21] UniMoCo: Unified Modality Completion for Robust Multi-Modal Embeddings

[22] Continuous Subspace Optimization for Continual Learning

[23] Robust Cross-View Geo-Localization via Content-Viewpoint Disentanglement

[24] Bootstrapping Diffusion: Diffusion Model Training Leveraging Partial and Corrupted Data

[25] CoT-Vid: Dynamic Chain-of-Thought Routing with Self Verification for Training-Free Video Reasoning

[26] RVTBench: A Benchmark for Visual Reasoning Tasks

[27] Video-SafetyBench: A Benchmark for Safety Evaluation of Video LVLMs

[28] ElderFallGuard: Real-Time IoT and Computer Vision-Based Fall Detection System for Elderly Safety

[29] MedSG-Bench: A Benchmark for Medical Image Sequences Grounding

[30] MonoMobility: Zero-Shot 3D Mobility Analysis from Monocular Videos

[31] PRS-Med: Position Reasoning Segmentation with Vision-Language Model in Medical Imaging

[32] Revisiting Residual Connections: Orthogonal Updates for Stable and Efficient Deep Networks

[33] GenZSL: Generative Zero-Shot Learning Via Inductive Variational Autoencoder

[34] Facial Recognition Leveraging Generative Adversarial Networks

[35] Adversarial Robustness for Unified Multi-Modal Encoders via Efficient Calibration

[36] FiGKD: Fine-Grained Knowledge Distillation via High-Frequency Detail Transfer

[37] GTR: Gaussian Splatting Tracking and Reconstruction of Unknown Objects Based on Appearance and Geometric Complexity

[38] Are Multimodal Large Language Models Ready for Omnidirectional Spatial Reasoning?

[39] DC-Seg: Disentangled Contrastive Learning for Brain Tumor Segmentation with Missing Modalities

[40] SafeVid: Toward Safety Aligned Video Large Multimodal Models

[41] iSegMan: Interactive Segment-and-Manipulate 3D Gaussians

[42] Top-Down Compression: Revisit Efficient Vision Token Projection for Visual Instruction Tuning

[43] Advanced Integration of Discrete Line Segments in Digitized P&ID for Continuous Instrument Connectivity

[44] AoP-SAM: Automation of Prompts for Efficient Segmentation

[45] Online Iterative Self-Alignment for Radiology Report Generation

[46] SpatialCrafter: Unleashing the Imagination of Video Diffusion Models for Scene Reconstruction from Limited Observations

[47] Multimodal Cancer Survival Analysis via Hypergraph Learning with Cross-Modality Rebalance

[48] IQBench: How "Smart'' Are Vision-Language Models? A Study with Human IQ Tests

[49] CHRIS: Clothed Human Reconstruction with Side View Consistency

[50] Multi-modal Collaborative Optimization and Expansion Network for Event-assisted Single-eye Expression Recognition

[51] Black-box Adversaries from Latent Space: Unnoticeable Attacks on Human Pose and Shape Estimation

[52] Cross-Model Transfer of Task Vectors via Few-Shot Orthogonal Alignment

[53] FIGhost: Fluorescent Ink-based Stealthy and Flexible Backdoor Attacks on Physical Traffic Sign Recognition

[54] Accelerating Diffusion-based Super-Resolution with Dynamic Time-Spatial Sampling

[55] VFRTok: Variable Frame Rates Video Tokenizer with Duration-Proportional Information Assumption

[56] Beluga Whale Detection from Satellite Imagery with Point Labels

[57] MT-CYP-Net: Multi-Task Network for Pixel-Level Crop Yield Prediction Under Very Few Samples

[58] Denoising Mutual Knowledge Distillation in Bi-Directional Multiple Instance Learning

[59] VisionReasoner: Unified Visual Perception and Reasoning via Reinforcement Learning

[60] LOVE: Benchmarking and Evaluating Text-to-Video Generation and Video-to-Text Interpretation

[61] TinyRS-R1: Compact Multimodal Language Model for Remote Sensing

[62] EarthSynth: Generating Informative Earth Observation with Diffusion Models

[63] Keypoints as Dynamic Centroids for Unified Human Pose and Segmentation

[64] Learning to Highlight Audio by Watching Movies

[65] SoftPQ: Robust Instance Segmentation Evaluation via Soft Matching and Tunable Thresholds

[66] Ditch the Denoiser: Emergence of Noise Robustness in Self-Supervised Learning from Data Curriculum

[67] Always Clear Depth: Robust Monocular Depth Estimation under Adverse Weather

[68] CompBench: Benchmarking Complex Instruction-guided Image Editing

[69] Road Segmentation for ADAS/AD Applications

[70] Can Large Multimodal Models Understand Agricultural Scenes? Benchmarking with AgroMind

[71] Hyperspectral Image Land Cover Captioning Dataset for Vision Language Models

[72] From Low Field to High Value: Robust Cortical Mapping from Low-Field MRI

[73] NOFT: Test-Time Noise Finetune via Information Bottleneck for Highly Correlated Asset Creation

[74] From Shots to Stories: LLM-Assisted Video Editing with Unified Language Representations

[75] SEPT: Standard-Definition Map Enhanced Scene Perception and Topology Reasoning for Autonomous Driving

[76] SMFusion: Semantic-Preserving Fusion of Multimodal Medical Images for Enhanced Clinical Diagnosis

[77] LLaVA-4D: Embedding SpatioTemporal Prompt into LMMs for 4D Scene Understanding

[78] MMS-VPR: Multimodal Street-Level Visual Place Recognition Dataset and Benchmark