cs.CV [Total: 362]
cs.CL [Total: 194]
cs.CR [Total: 6]
cs.AI [Total: 32]
cs.GR [Total: 8]
cs.HC [Total: 6]
eess.AS [Total: 4]
cs.CY [Total: 10]
cs.SD [Total: 6]
eess.SY [Total: 2]
cs.RO [Total: 6]
cond-mat.mtrl-sci [Total: 2]
physics.flu-dyn [Total: 2]
cs.LG [Total: 30]
cs.IR [Total: 18]
q-bio.NC [Total: 2]
eess.IV [Total: 12]

cs.CV [Back]

[1] HAL-NeRF: High Accuracy Localization Leveraging Neural Radiance Fields

Asterios Reppas,Grigorios-Aris Cheimariotis,Panos K. Papadopoulos,Panagiotis Frasiolas,Dimitrios Zarpalas

Main category: cs.CV

TLDR: HAL-NeRF结合CNN姿态回归器和基于蒙特卡洛粒子滤波的细化模块，显著提升了相机定位精度，在7-Scenes和Cambridge Landmarks数据集上达到0.025m和0.59度的误差。

Details

Motivation: 在XR和机器人应用中，仅依赖相机输入的定位方法成本低但精度不足，需提升单目相机定位的准确性。 Method: 结合CNN姿态回归器和基于NeRF的蒙特卡洛粒子滤波细化模块，利用Nerfacto模型增强训练数据和测量光度损失。 Result: 在7-Scenes和Cambridge Landmarks数据集上分别达到0.025m/0.59度和0.04m/0.58度的误差，计算时间增加。 Conclusion: 结合APR与NeRF细化技术可显著提升单目相机定位精度，展示了其潜力。 Abstract: Precise camera localization is a critical task in XR applications and robotics. Using only the camera captures as input to a system is an inexpensive option that enables localization in large indoor and outdoor environments, but it presents challenges in achieving high accuracy. Specifically, camera relocalization methods, such as Absolute Pose Regression (APR), can localize cameras with a median translation error of more than $0.5m$ in outdoor scenes. This paper presents HAL-NeRF, a high-accuracy localization method that combines a CNN pose regressor with a refinement module based on a Monte Carlo particle filter. The Nerfacto model, an implementation of Neural Radiance Fields (NeRFs), is used to augment the data for training the pose regressor and to measure photometric loss in the particle filter refinement module. HAL-NeRF leverages Nerfacto's ability to synthesize high-quality novel views, significantly improving the performance of the localization pipeline. HAL-NeRF achieves state-of-the-art results that are conventionally measured as the average of the median per scene errors. The translation error was $0.025m$ and the rotation error was $0.59$ degrees and 0.04m and 0.58 degrees on the 7-Scenes dataset and Cambridge Landmarks datasets respectively, with the trade-off of increased computational time. This work highlights the potential of combining APR with NeRF-based refinement techniques to advance monocular camera relocalization accuracy.

[2] LookingGlass: Generative Anamorphoses via Laplacian Pyramid Warping

Pascal Chang,Sergio Sancho,Jingwei Tang,Markus Gross,Vinicius C. Azevedo

Main category: cs.CV

TLDR: 本文提出了一种生成变形图像的方法，利用潜在整流流模型和拉普拉斯金字塔变形技术，使图像在直接观看时仍保持可识别性。

Details

Motivation: 变形图像通常只能在特定视角下识别，限制了其应用。本文旨在通过生成方法扩展变形图像的可用性，使其在直接观看时仍有意义。 Method: 采用潜在整流流模型和拉普拉斯金字塔变形技术，生成高质量变形图像，并扩展了视觉字谜（Visual Anagrams）的方法。 Result: 成功生成了在直接观看时仍可识别的变形图像，扩展了生成感知幻觉的可能性。 Conclusion: 本文方法为变形图像的生成提供了新思路，扩展了其应用范围，并为生成感知幻觉提供了更多可能性。 Abstract: Anamorphosis refers to a category of images that are intentionally distorted, making them unrecognizable when viewed directly. Their true form only reveals itself when seen from a specific viewpoint, which can be through some catadioptric device like a mirror or a lens. While the construction of these mathematical devices can be traced back to as early as the 17th century, they are only interpretable when viewed from a specific vantage point and tend to lose meaning when seen normally. In this paper, we revisit these famous optical illusions with a generative twist. With the help of latent rectified flow models, we propose a method to create anamorphic images that still retain a valid interpretation when viewed directly. To this end, we introduce Laplacian Pyramid Warping, a frequency-aware image warping technique key to generating high-quality visuals. Our work extends Visual Anagrams (arXiv:2311.17919) to latent space models and to a wider range of spatial transforms, enabling the creation of novel generative perceptual illusions.

[3] Robust SAM: On the Adversarial Robustness of Vision Foundation Models

Jiahuan Long,Zhengqin Xu,Tingsong Jiang,Wen Yao,Shuai Jia,Chao Ma,Xiaoqian Chen

Main category: cs.CV

TLDR: 本文提出了一种对抗性鲁棒性框架，用于评估和增强SAM模型的鲁棒性，包括跨提示攻击方法和少参数适应防御策略。

Details

Motivation: SAM模型广泛应用，但其对抗攻击鲁棒性研究不足，现有攻击和防御方法存在缺陷。 Method: 提出跨提示攻击方法增强攻击可转移性，使用SVD约束参数空间进行防御。 Result: 跨提示攻击方法优于现有方法，防御策略仅需512参数即可提升15%的mIoU。 Conclusion: 该框架显著提升SAM的鲁棒性，同时保持其原始性能。 Abstract: The Segment Anything Model (SAM) is a widely used vision foundation model with diverse applications, including image segmentation, detection, and tracking. Given SAM's wide applications, understanding its robustness against adversarial attacks is crucial for real-world deployment. However, research on SAM's robustness is still in its early stages. Existing attacks often overlook the role of prompts in evaluating SAM's robustness, and there has been insufficient exploration of defense methods to balance the robustness and accuracy. To address these gaps, this paper proposes an adversarial robustness framework designed to evaluate and enhance the robustness of SAM. Specifically, we introduce a cross-prompt attack method to enhance the attack transferability across different prompt types. Besides attacking, we propose a few-parameter adaptation strategy to defend SAM against various adversarial attacks. To balance robustness and accuracy, we use the singular value decomposition (SVD) to constrain the space of trainable parameters, where only singular values are adaptable. Experiments demonstrate that our cross-prompt attack method outperforms previous approaches in terms of attack success rate on both SAM and SAM 2. By adapting only 512 parameters, we achieve at least a 15\% improvement in mean intersection over union (mIoU) against various adversarial attacks. Compared to previous defense methods, our approach enhances the robustness of SAM while maximally maintaining its original performance.

[4] Parameter-Free Fine-tuning via Redundancy Elimination for Vision Foundation Models

Jiahuan Long,Tingsong Jiang,Wen Yao,Yizhe Xiong,Zhengqin Xu,Shuai Jia,Chao Ma

Main category: cs.CV

TLDR: 本文提出了一种参数无关的微调方法，通过选择和重用预训练特征来解决视觉基础模型（VFMs）中的特征冗余问题，提升下游任务性能。

Details

Motivation: 视觉基础模型（VFMs）存在显著的特征冗余，限制了其对新任务的适应性，因此需要一种高效的微调方法。 Method: 提出基于模型输出差异的通道选择算法，识别冗余和有效通道，选择性替换冗余通道以增强任务特定特征表示。 Result: 实验表明该方法在域内外数据集上均高效有效，并能与现有微调策略（如LoRA、Adapter）无缝结合，进一步优化性能。 Conclusion: 该方法显著降低了计算和GPU内存开销，为模型微调提供了新视角。 Abstract: Vision foundation models (VFMs) are large pre-trained models that form the backbone of various vision tasks. Fine-tuning VFMs can further unlock their potential for downstream tasks or scenarios. However, VFMs often contain significant feature redundancy, which may limit their adaptability to new tasks. In this paper, we investigate the redundancies in the segment anything model (SAM) and then propose a parameter-free fine-tuning method to address this issue. Unlike traditional fine-tuning methods that adjust parameters, our method emphasizes selecting, reusing, and enhancing pre-trained features, offering a new perspective on model fine-tuning. Specifically, we introduce a channel selection algorithm based on the model's output difference to identify redundant and effective channels. By selectively replacing the redundant channels with more effective ones, we filter out less useful features and reuse the more relevant features to downstream tasks, thereby enhancing the task-specific feature representation. Experiments on both out-of-domain and in-domain datasets demonstrate the efficiency and effectiveness of our method. Notably, our approach can seamlessly integrate with existing fine-tuning strategies (e.g., LoRA, Adapter), further boosting the performance of already fine-tuned models. Moreover, since our channel selection involves only model inference, our method significantly reduces computational and GPU memory overhead.

[5] MotionDreamer: One-to-Many Motion Synthesis with Localized Generative Masked Transformer

Yilin Wang,Chuan Guo,Yuxuan Mu,Muhammad Gohar Javed,Xinxin Zuo,Juwei Lu,Hai Jiang,Li Cheng

Main category: cs.CV

TLDR: MotionDreamer提出了一种局部掩码建模方法，用于从单一MoCap参考生成多样化的动画，解决了现有方法在动画领域可能导致的过拟合问题。

Details

Motivation: 在动画领域，大规模数据集通常不可用，而现有生成掩码建模方法在单一参考下容易过拟合。 Method: 通过分布正则化方法将运动嵌入量化标记，构建局部运动模式的代码库，并引入滑动窗口局部注意力机制。 Result: MotionDreamer在忠实性和多样性上优于基于GAN或Diffusion的现有方法，并能有效执行下游任务。 Conclusion: MotionDreamer通过量化方法实现了高质量动画生成，适用于多种下游任务。 Abstract: Generative masked transformers have demonstrated remarkable success across various content generation tasks, primarily due to their ability to effectively model large-scale dataset distributions with high consistency. However, in the animation domain, large datasets are not always available. Applying generative masked modeling to generate diverse instances from a single MoCap reference may lead to overfitting, a challenge that remains unexplored. In this work, we present MotionDreamer, a localized masked modeling paradigm designed to learn internal motion patterns from a given motion with arbitrary topology and duration. By embedding the given motion into quantized tokens with a novel distribution regularization method, MotionDreamer constructs a robust and informative codebook for local motion patterns. Moreover, a sliding window local attention is introduced in our masked transformer, enabling the generation of natural yet diverse animations that closely resemble the reference motion patterns. As demonstrated through comprehensive experiments, MotionDreamer outperforms the state-of-the-art methods that are typically GAN or Diffusion-based in both faithfulness and diversity. Thanks to the consistency and robustness of the quantization-based approach, MotionDreamer can also effectively perform downstream tasks such as temporal motion editing, \textcolor{update}{crowd animation}, and beat-aligned dance generation, all using a single reference motion. Visit our project page: https://motiondreamer.github.io/

[6] PACT: Pruning and Clustering-Based Token Reduction for Faster Visual Language Models

Mohamed Dhouib,Davide Buscaldi,Sonia Vanier,Aymen Shabou

Main category: cs.CV

TLDR: PACT是一种通过剪枝无关令牌和合并冗余视觉令牌来减少视觉语言模型推理时间和内存使用的方法。

Details

Motivation: 视觉令牌通常包含冗余和不重要信息，导致不必要的计算资源浪费。 Method: 提出PACT方法，使用重要性度量剪枝无关令牌，并采用Distance Bounded Density Peak Clustering算法合并冗余令牌。 Result: 实验证明PACT能有效减少推理时间和内存使用。 Conclusion: PACT为视觉语言模型的高效推理提供了可行解决方案。 Abstract: Visual Language Models require substantial computational resources for inference due to the additional input tokens needed to represent visual information. However, these visual tokens often contain redundant and unimportant information, resulting in an unnecessarily high number of tokens. To address this, we introduce PACT, a method that reduces inference time and memory usage by pruning irrelevant tokens and merging visually redundant ones at an early layer of the language model. Our approach uses a novel importance metric to identify unimportant tokens without relying on attention scores, making it compatible with FlashAttention. We also propose a novel clustering algorithm, called Distance Bounded Density Peak Clustering, which efficiently clusters visual tokens while constraining the distances between elements within a cluster by a predefined threshold. We demonstrate the effectiveness of PACT through extensive experiments.

[7] Adaptive Additive Parameter Updates of Vision Transformers for Few-Shot Continual Learning

Kyle Stein,Andrew Arash Mahyari,Guillermo Francia III,Eman El-Sheikh

Main category: cs.CV

TLDR: 提出了一种基于冻结Vision Transformer（ViT）的小样本增量学习框架，通过参数高效的加法更新机制，有效减少过拟合和灾难性遗忘。

Details

Motivation: 解决小样本增量学习（FSCIL）中因数据量少导致的过拟合和灾难性遗忘问题。 Method: 冻结预训练ViT参数，通过加法更新机制在自注意力模块中注入可训练权重，仅更新少量参数以适应新类别。 Result: 在基准数据集上实现了优于基线FSCIL方法的性能。 Conclusion: 该方法通过冻结大部分参数和选择性更新，平衡了新类别学习和旧知识保留。 Abstract: Integrating new class information without losing previously acquired knowledge remains a central challenge in artificial intelligence, often referred to as catastrophic forgetting. Few-shot class incremental learning (FSCIL) addresses this by first training a model on a robust dataset of base classes and then incrementally adapting it in successive sessions using only a few labeled examples per novel class. However, this approach is prone to overfitting on the limited new data, which can compromise overall performance and exacerbate forgetting. In this work, we propose a simple yet effective novel FSCIL framework that leverages a frozen Vision Transformer (ViT) backbone augmented with parameter-efficient additive updates. Our approach freezes the pre-trained ViT parameters and selectively injects trainable weights into the self-attention modules via an additive update mechanism. This design updates only a small subset of parameters to accommodate new classes without sacrificing the representations learned during the base session. By fine-tuning a limited number of parameters, our method preserves the generalizable features in the frozen ViT while reducing the risk of overfitting. Furthermore, as most parameters remain fixed, the model avoids overwriting previously learned knowledge when small novel data batches are introduced. Extensive experiments on benchmark datasets demonstrate that our approach yields state-of-the-art performance compared to baseline FSCIL methods.

[8] Chest X-ray Classification using Deep Convolution Models on Low-resolution images with Uncertain Labels

Snigdha Agarwal,Neelam Sinha

Main category: cs.CV

TLDR: 论文研究了在低分辨率胸部X光图像上使用深度卷积神经网络（CNN）进行分类的可行性，并提出了一种随机翻转标签技术处理噪声标签。通过实验，模型在部分病理分类上比高分辨率图像的原始结果提高了3%的准确率。

Details

Motivation: 远程医疗中低分辨率图像是更经济的解决方案，但医学诊断中关键细节可能难以识别，因此需要研究低分辨率图像的分类可行性。 Method: 使用不同尺寸的胸部X光图像训练深度CNN模型，提出随机翻转标签技术处理噪声标签，并采用数据增强、正则化和类别激活图等技术优化模型。 Result: 在CheXpert数据集的5种病理分类中，模型对Cardiomegaly、Consolidation和Edema的准确率比高分辨率图像的原始结果提高了3%。 Conclusion: 低分辨率图像在特定病理分类中具有可行性，且通过技术优化可以提升模型性能。 Abstract: Deep Convolutional Neural Networks have consistently proven to achieve state-of-the-art results on a lot of imaging tasks over the past years' majority of which comprise of high-quality data. However, it is important to work on low-resolution images since it could be a cheaper alternative for remote healthcare access where the primary need of automated pathology identification models occurs. Medical diagnosis using low-resolution images is challenging since critical details may not be easily identifiable. In this paper, we report classification results by experimenting on different input image sizes of Chest X-rays to deep CNN models and discuss the feasibility of classification on varying image sizes. We also leverage the noisy labels in the dataset by proposing a Randomized Flipping of labels techniques. We use an ensemble of multi-label classification models on frontal and lateral studies. Our models are trained on 5 out of the 14 chest pathologies of the publicly available CheXpert dataset. We incorporate techniques such as augmentation, regularization for model improvement and use class activation maps to visualize the neural network's decision making. Comparison with classification results on data from 200 subjects, obtained on the corresponding high-resolution images, reported in the original CheXpert paper, has been presented. For pathologies Cardiomegaly, Consolidation and Edema, we obtain 3% higher accuracy with our model architecture.

[9] Sculpting Memory: Multi-Concept Forgetting in Diffusion Models via Dynamic Mask and Concept-Aware Optimization

Gen Li,Yang Xiao,Jie Ji,Kaiyuan Deng,Bo Hui,Linke Guo,Xiaolong Ma

Main category: cs.CV

TLDR: 本文提出了一种名为“动态掩码与概念感知损失”的新框架，用于解决扩散模型中多概念遗忘的问题，显著提升了遗忘效果、输出质量和语义一致性。

Details

Motivation: 现有的遗忘方法在多概念遗忘时存在不稳定性、残留知识和生成质量下降的问题，因此需要一种更有效的解决方案。 Method: 结合动态掩码机制和概念感知损失，动态更新梯度掩码并通过语义一致性指导遗忘过程。 Result: 实验表明，该方法在多概念遗忘场景中优于现有技术，具有更高的遗忘效果和生成质量。 Conclusion: 该框架为生成模型提供了一种稳定且高质量的遗忘方法，代码将公开。 Abstract: Text-to-image (T2I) diffusion models have achieved remarkable success in generating high-quality images from textual prompts. However, their ability to store vast amounts of knowledge raises concerns in scenarios where selective forgetting is necessary, such as removing copyrighted content, reducing biases, or eliminating harmful concepts. While existing unlearning methods can remove certain concepts, they struggle with multi-concept forgetting due to instability, residual knowledge persistence, and generation quality degradation. To address these challenges, we propose \textbf{Dynamic Mask coupled with Concept-Aware Loss}, a novel unlearning framework designed for multi-concept forgetting in diffusion models. Our \textbf{Dynamic Mask} mechanism adaptively updates gradient masks based on current optimization states, allowing selective weight modifications that prevent interference with unrelated knowledge. Additionally, our \textbf{Concept-Aware Loss} explicitly guides the unlearning process by enforcing semantic consistency through superclass alignment, while a regularization loss based on knowledge distillation ensures that previously unlearned concepts remain forgotten during sequential unlearning. We conduct extensive experiments to evaluate our approach. Results demonstrate that our method outperforms existing unlearning techniques in forgetting effectiveness, output fidelity, and semantic coherence, particularly in multi-concept scenarios. Our work provides a principled and flexible framework for stable and high-fidelity unlearning in generative models. The code will be released publicly.

[10] BlockGaussian: Efficient Large-Scale Scene NovelView Synthesis via Adaptive Block-Based Gaussian Splatting

Yongchang Wu,Zipeng Qi,Zhenwei Shi,Zhengxia Zou

Main category: cs.CV

TLDR: BlockGaussian提出了一种基于内容感知的场景分区策略和可见性感知的块优化方法，实现了高效且高质量的大规模场景重建。

Details

Motivation: 3D高斯泼溅（3DGS）在新视角合成任务中表现出巨大潜力，但大规模场景重建仍面临分区、优化和合并的挑战。 Method: 采用内容感知分区策略平衡计算负载，引入辅助点解决独立块优化的监督不匹配问题，并提出伪视图几何约束减少合并时的渲染退化。 Result: 实验表明，BlockGaussian在优化速度上提升5倍，PSNR平均提高1.21 dB，并显著降低计算需求。 Conclusion: BlockGaussian在大规模场景重建中实现了高效和高渲染质量，适用于单24GB VRAM设备。 Abstract: The recent advancements in 3D Gaussian Splatting (3DGS) have demonstrated remarkable potential in novel view synthesis tasks. The divide-and-conquer paradigm has enabled large-scale scene reconstruction, but significant challenges remain in scene partitioning, optimization, and merging processes. This paper introduces BlockGaussian, a novel framework incorporating a content-aware scene partition strategy and visibility-aware block optimization to achieve efficient and high-quality large-scale scene reconstruction. Specifically, our approach considers the content-complexity variation across different regions and balances computational load during scene partitioning, enabling efficient scene reconstruction. To tackle the supervision mismatch issue during independent block optimization, we introduce auxiliary points during individual block optimization to align the ground-truth supervision, which enhances the reconstruction quality. Furthermore, we propose a pseudo-view geometry constraint that effectively mitigates rendering degradation caused by airspace floaters during block merging. Extensive experiments on large-scale scenes demonstrate that our approach achieves state-of-the-art performance in both reconstruction efficiency and rendering quality, with a 5x speedup in optimization and an average PSNR improvement of 1.21 dB on multiple benchmarks. Notably, BlockGaussian significantly reduces computational requirements, enabling large-scale scene reconstruction on a single 24GB VRAM device. The project page is available at https://github.com/SunshineWYC/BlockGaussian

[11] You Need a Transition Plane: Bridging Continuous Panoramic 3D Reconstruction with Perspective Gaussian Splatting

Zhijie Shen,Chunyu Lin,Shujuan Huang,Lang Nie,Kang Liao,Yao Zhao

Main category: cs.CV

TLDR: 提出了一种名为TPGS的新框架，用于从全景图像重建3D场景，解决了直接渲染3D高斯到2D等距柱状空间时的失真问题。

Details

Motivation: 全景图像具有360×180的视野，但直接渲染3D高斯到等距柱状空间会引入严重失真，而转换为立方体贴图投影又会带来新的挑战。 Method: 引入过渡平面和优化策略，先在立方体面内优化3D高斯，再在全景空间中微调，并采用球形采样技术消除接缝。 Result: 在室内外、第一人称和漫游基准数据集上的实验表明，TPGS优于现有方法。 Conclusion: TPGS通过过渡平面和优化策略，有效解决了全景图像重建中的失真和边界问题。 Abstract: Recently, reconstructing scenes from a single panoramic image using advanced 3D Gaussian Splatting (3DGS) techniques has attracted growing interest. Panoramic images offer a 360$\times$ 180 field of view (FoV), capturing the entire scene in a single shot. However, panoramic images introduce severe distortion, making it challenging to render 3D Gaussians into 2D distorted equirectangular space directly. Converting equirectangular images to cubemap projections partially alleviates this problem but introduces new challenges, such as projection distortion and discontinuities across cube-face boundaries. To address these limitations, we present a novel framework, named TPGS, to bridge continuous panoramic 3D scene reconstruction with perspective Gaussian splatting. Firstly, we introduce a Transition Plane between adjacent cube faces to enable smoother transitions in splatting directions and mitigate optimization ambiguity in the boundary region. Moreover, an intra-to-inter face optimization strategy is proposed to enhance local details and restore visual consistency across cube-face boundaries. Specifically, we optimize 3D Gaussians within individual cube faces and then fine-tune them in the stitched panoramic space. Additionally, we introduce a spherical sampling technique to eliminate visible stitching seams. Extensive experiments on indoor and outdoor, egocentric, and roaming benchmark datasets demonstrate that our approach outperforms existing state-of-the-art methods. Code and models will be available at https://github.com/zhijieshen-bjtu/TPGS.

[12] Hyperlocal disaster damage assessment using bi-temporal street-view imagery and pre-trained vision models

Yifan Yang,Lei Zou,Bing Zhou,Daoyang Li,Binbin Lin,Joynal Abedin,Mingzheng Yang

Main category: cs.CV

TLDR: 该研究利用双时态街景图像和预训练视觉模型，通过结合灾前图像和双通道算法，显著提高了灾害损害评估的准确性。

Details

Motivation: 现有研究主要关注灾后图像，而时间序列街景图像的潜力尚未充分挖掘。灾前图像可为建筑和街道级别的损害评估提供基准，提高标注数据的可靠性和模型性能。 Method: 收集2024年飓风Milton前后的街景图像，通过微调预训练模型（如Swin Transformer和ConvNeXt）和设计双通道算法进行损害评估。 Result: 结合灾前图像和双通道处理框架后，损害评估准确率从66.14%（Swin Transformer基线）提升至77.11%（双通道Feature-Fusion ConvNeXt模型）。 Conclusion: 该方法能实现超局部空间分辨率的快速损害评估，为灾害管理和韧性规划提供有力支持。 Abstract: Street-view images offer unique advantages for disaster damage estimation as they capture impacts from a visual perspective and provide detailed, on-the-ground insights. Despite several investigations attempting to analyze street-view images for damage estimation, they mainly focus on post-disaster images. The potential of time-series street-view images remains underexplored. Pre-disaster images provide valuable benchmarks for accurate damage estimations at building and street levels. These images could aid annotators in objectively labeling post-disaster impacts, improving the reliability of labeled data sets for model training, and potentially enhancing the model performance in damage evaluation. The goal of this study is to estimate hyperlocal, on-the-ground disaster damages using bi-temporal street-view images and advanced pre-trained vision models. Street-view images before and after 2024 Hurricane Milton in Horseshoe Beach, Florida, were collected for experiments. The objectives are: (1) to assess the performance gains of incorporating pre-disaster street-view images as a no-damage category in fine-tuning pre-trained models, including Swin Transformer and ConvNeXt, for damage level classification; (2) to design and evaluate a dual-channel algorithm that reads pair-wise pre- and post-disaster street-view images for hyperlocal damage assessment. The results indicate that incorporating pre-disaster street-view images and employing a dual-channel processing framework can significantly enhance damage assessment accuracy. The accuracy improves from 66.14% with the Swin Transformer baseline to 77.11% with the dual-channel Feature-Fusion ConvNeXt model. This research enables rapid, operational damage assessments at hyperlocal spatial resolutions, providing valuable insights to support effective decision-making in disaster management and resilience planning.

[13] UniFlowRestore: A General Video Restoration Framework via Flow Matching and Prompt Guidance

Shuning Sun,Yu Zhang,Chen Wu,Dianjie Lu,Dianjie Lu,Guijuan Zhan,Yang Weng,Zhuoran Zheng

Main category: cs.CV

TLDR: UniFlowRestore提出了一种通用的视频修复框架，通过物理感知的向量场统一处理多种退化问题，实现了高效且泛化性强的修复效果。

Details

Motivation: 传统视频修复方法采用“单任务单模型”范式，泛化性差且计算成本高，难以应对现实场景中的多样化退化问题。 Method: UniFlowRestore将修复建模为时间连续的演化过程，结合物理感知的向量场和任务相关提示，通过哈密顿系统优化实现统一修复。 Result: 实验表明，UniFlowRestore在视频去噪任务中达到最高PSNR（33.89 dB）和SSIM（0.97），并在所有评估任务中表现优异。 Conclusion: UniFlowRestore通过物理感知和提示引导的框架，实现了高效、泛化性强的视频修复，为复杂退化问题提供了统一解决方案。 Abstract: Video imaging is often affected by complex degradations such as blur, noise, and compression artifacts. Traditional restoration methods follow a "single-task single-model" paradigm, resulting in poor generalization and high computational cost, limiting their applicability in real-world scenarios with diverse degradation types. We propose UniFlowRestore, a general video restoration framework that models restoration as a time-continuous evolution under a prompt-guided and physics-informed vector field. A physics-aware backbone PhysicsUNet encodes degradation priors as potential energy, while PromptGenerator produces task-relevant prompts as momentum. These components define a Hamiltonian system whose vector field integrates inertial dynamics, decaying physical gradients, and prompt-based guidance. The system is optimized via a fixed-step ODE solver to achieve efficient and unified restoration across tasks. Experiments show that UniFlowRestore delivers stateof-the-art performance with strong generalization and efficiency. Quantitative results demonstrate that UniFlowRestore achieves state-of-the-art performance, attaining the highest PSNR (33.89 dB) and SSIM (0.97) on the video denoising task, while maintaining top or second-best scores across all evaluated tasks.

[14] Exploring Synergistic Ensemble Learning: Uniting CNNs, MLP-Mixers, and Vision Transformers to Enhance Image Classification

Mk Bashar,Ocean Monjur,Samia Islam,Mohammad Galib Shams,Niamul Quader

Main category: cs.CV

TLDR: 论文提出了一种通过集成技术结合不同神经网络架构的方法，以探索其互补性，并展示了基础集成方法的有效性，提升了图像分类性能。

Details

Motivation: 研究旨在更系统地探索不同神经网络架构（如CNN、MLP-mixer和Vision Transformer）的互补性，避免启发式合并模块的局限性。 Method: 通过保持各架构的完整性，使用集成技术结合不同架构的模型，而非启发式合并模块。 Result: 基础集成方法优于相似架构的集成，创建的分类网络在ImageNet上超越了之前的单网络最佳性能，且延迟更低。 Conclusion: 该方法为探索不同架构的互补性提供了系统性框架，并展示了集成技术在提升性能方面的潜力。 Abstract: In recent years, Convolutional Neural Networks (CNNs), MLP-mixers, and Vision Transformers have risen to prominence as leading neural architectures in image classification. Prior research has underscored the distinct advantages of each architecture, and there is growing evidence that combining modules from different architectures can boost performance. In this study, we build upon and improve previous work exploring the complementarity between different architectures. Instead of heuristically merging modules from various architectures through trial and error, we preserve the integrity of each architecture and combine them using ensemble techniques. By maintaining the distinctiveness of each architecture, we aim to explore their inherent complementarity more deeply and with implicit isolation. This approach provides a more systematic understanding of their individual strengths. In addition to uncovering insights into architectural complementarity, we showcase the effectiveness of even basic ensemble methods that combine models from diverse architectures. These methods outperform ensembles comprised of similar architectures. Our straightforward ensemble framework serves as a foundational strategy for blending complementary architectures, offering a solid starting point for further investigations into the unique strengths and synergies among different architectures and their ensembles in image classification. A direct outcome of this work is the creation of an ensemble of classification networks that surpasses the accuracy of the previous state-of-the-art single classification network on ImageNet, setting a new benchmark, all while requiring less overall latency.

[15] A Visual Self-attention Mechanism Facial Expression Recognition Network beyond Convnext

Bingyu Nan,Feng Liu,Xuezhong Qian,Wei Song

Main category: cs.CV

TLDR: 提出了一种基于截断ConvNeXt的视觉面部表情信号特征处理网络（Conv-cut），用于提升在挑战性条件下的面部表情识别（FER）准确性。

Details

Motivation: 面部表情识别（FER）在人工智能领域具有重要意义，但数据分布不均、不同类别表情相似以及同一类别内不同个体间的差异仍是挑战。 Method: 使用截断的ConvNeXt-Base作为特征提取器，设计了细节提取块（Detail Extraction Block）提取细节特征，并引入自注意力机制（Self-Attention）以更有效地学习特征。 Result: 在RAF-DB和FERPlus数据集上的实验表明，该模型达到了最先进的性能。 Conclusion: Conv-cut方法在FER任务中表现出色，代码已开源。 Abstract: Facial expression recognition is an important research direction in the field of artificial intelligence. Although new breakthroughs have been made in recent years, the uneven distribution of datasets and the similarity between different categories of facial expressions, as well as the differences within the same category among different subjects, remain challenges. This paper proposes a visual facial expression signal feature processing network based on truncated ConvNeXt approach(Conv-cut), to improve the accuracy of FER under challenging conditions. The network uses a truncated ConvNeXt-Base as the feature extractor, and then we designed a Detail Extraction Block to extract detailed features, and introduced a Self-Attention mechanism to enable the network to learn the extracted features more effectively. To evaluate the proposed Conv-cut approach, we conducted experiments on the RAF-DB and FERPlus datasets, and the results show that our model has achieved state-of-the-art performance. Our code could be accessed at Github.

[16] Using Vision Language Models for Safety Hazard Identification in Construction

Muhammad Adil,Gaang Lee,Vicente A. Gonzalez,Qipei Mei

Main category: cs.CV

TLDR: 论文提出了一种基于视觉语言模型（VLM）的框架，用于识别建筑工地安全隐患，解决了现有方法在上下文特定风险识别和适应性上的不足。

Details

Motivation: 现有计算机视觉方法难以识别上下文特定的安全隐患，且适应性有限，导致安全漏洞。 Method: 提出并实验验证了一种VLM框架，结合提示工程模块，将安全指南转化为上下文查询，利用VLM处理视觉信息并生成符合规范的评估。 Result: 实验表明，GPT-4o和Gemini 1.5 Pro表现最佳，BERTScore分别为0.906和0.888，但处理时间仍是挑战。 Conclusion: VLM框架为建筑工地安全隐患检测提供了实用方案，有助于提升主动安全管理。 Abstract: Safety hazard identification and prevention are the key elements of proactive safety management. Previous research has extensively explored the applications of computer vision to automatically identify hazards from image clips collected from construction sites. However, these methods struggle to identify context-specific hazards, as they focus on detecting predefined individual entities without understanding their spatial relationships and interactions. Furthermore, their limited adaptability to varying construction site guidelines and conditions hinders their generalization across different projects. These limitations reduce their ability to assess hazards in complex construction environments and adaptability to unseen risks, leading to potential safety gaps. To address these challenges, we proposed and experimentally validated a Vision Language Model (VLM)-based framework for the identification of construction hazards. The framework incorporates a prompt engineering module that structures safety guidelines into contextual queries, allowing VLM to process visual information and generate hazard assessments aligned with the regulation guide. Within this framework, we evaluated state-of-the-art VLMs, including GPT-4o, Gemini, Llama 3.2, and InternVL2, using a custom dataset of 1100 construction site images. Experimental results show that GPT-4o and Gemini 1.5 Pro outperformed alternatives and displayed promising BERTScore of 0.906 and 0.888 respectively, highlighting their ability to identify both general and context-specific hazards. However, processing times remain a significant challenge, impacting real-time feasibility. These findings offer insights into the practical deployment of VLMs for construction site hazard detection, thereby contributing to the enhancement of proactive safety management.

[17] RICCARDO: Radar Hit Prediction and Convolution for Camera-Radar 3D Object Detection

Yunfei Long,Abhinav Kumar,Xiaoming Liu,Daniel Morris

Main category: cs.CV

TLDR: 论文提出了一种利用雷达命中分布模型辅助雷达-相机融合的方法，通过预测雷达命中分布并结合实际雷达点匹配，提升了检测性能。

Details

Motivation: 当前雷达-相机融合方法通过黑盒神经网络隐式处理雷达命中分布，缺乏显式建模。本文旨在显式利用雷达命中分布模型优化融合效果。 Method: 1. 构建雷达命中分布模型，基于单目检测器获取的目标属性预测分布；2. 使用预测分布作为核函数匹配实际雷达点，生成匹配分数；3. 结合上下文信息优化匹配分数。 Result: 在nuScenes数据集上实现了最先进的雷达-相机检测性能。 Conclusion: 显式建模雷达命中分布显著提升了融合效果，代码已开源。 Abstract: Radar hits reflect from points on both the boundary and internal to object outlines. This results in a complex distribution of radar hits that depends on factors including object category, size, and orientation. Current radar-camera fusion methods implicitly account for this with a black-box neural network. In this paper, we explicitly utilize a radar hit distribution model to assist fusion. First, we build a model to predict radar hit distributions conditioned on object properties obtained from a monocular detector. Second, we use the predicted distribution as a kernel to match actual measured radar points in the neighborhood of the monocular detections, generating matching scores at nearby positions. Finally, a fusion stage combines context with the kernel detector to refine the matching scores. Our method achieves the state-of-the-art radar-camera detection performance on nuScenes. Our source code is available at https://github.com/longyunf/riccardo.

[18] BIGS: Bimanual Category-agnostic Interaction Reconstruction from Monocular Videos via 3D Gaussian Splatting

Jeongwan On,Kyeonghwan Gwak,Gunyoung Kang,Junuk Cha,Soohyun Hwang,Hyein Hwang,Seungryul Baek

Main category: cs.CV

TLDR: 本文提出了一种名为BIGS的方法，用于从单目RGB视频中重建双手与未知物体的3D高斯分布，解决了复杂交互中的遮挡问题，并在多个指标上达到最优性能。

Details

Motivation: 当前缺乏从单目视频中重建双手与未知物体交互的完整方法，且复杂交互导致严重遮挡。本文旨在填补这一空白。 Method: 结合预训练扩散模型（SDS损失）重建物体部分，利用MANO手部模型先验共享高斯分布，并通过交互主体优化步骤对齐手与物体。 Result: 在两个数据集上，3D手部姿态估计、物体重建和渲染质量均达到最优。 Conclusion: BIGS方法有效解决了复杂交互中的遮挡问题，实现了高精度的3D重建。 Abstract: Reconstructing 3Ds of hand-object interaction (HOI) is a fundamental problem that can find numerous applications. Despite recent advances, there is no comprehensive pipeline yet for bimanual class-agnostic interaction reconstruction from a monocular RGB video, where two hands and an unknown object are interacting with each other. Previous works tackled the limited hand-object interaction case, where object templates are pre-known or only one hand is involved in the interaction. The bimanual interaction reconstruction exhibits severe occlusions introduced by complex interactions between two hands and an object. To solve this, we first introduce BIGS (Bimanual Interaction 3D Gaussian Splatting), a method that reconstructs 3D Gaussians of hands and an unknown object from a monocular video. To robustly obtain object Gaussians avoiding severe occlusions, we leverage prior knowledge of pre-trained diffusion model with score distillation sampling (SDS) loss, to reconstruct unseen object parts. For hand Gaussians, we exploit the 3D priors of hand model (i.e., MANO) and share a single Gaussian for two hands to effectively accumulate hand 3D information, given limited views. To further consider the 3D alignment between hands and objects, we include the interacting-subjects optimization step during Gaussian optimization. Our method achieves the state-of-the-art accuracy on two challenging datasets, in terms of 3D hand pose estimation (MPJPE), 3D object reconstruction (CDh, CDo, F10), and rendering quality (PSNR, SSIM, LPIPS), respectively.

Yonghao Huang,Leiting Chen,Chuan Zhou

Main category: cs.CV

TLDR: 提出了一种基于多尺度交叉注意力和移位窗口自注意力的多模态多视角眼底图像融合方法，用于提升视网膜病变诊断的准确性和效率。

Details

Motivation: 多模态和多视角眼底图像的联合解读对预防视网膜病变至关重要，但现有方法在长程依赖性和计算复杂度上存在局限。 Method: 设计了基于多尺度交叉注意力的多模态融合方法和基于移位窗口自注意力的多视角融合方法，并结合为多任务诊断框架。 Result: 实验结果显示分类准确率为82.53%，报告生成BLEU-1得分为0.543。 Conclusion: 该方法能有效提升临床诊断效率和可靠性。 Abstract: The joint interpretation of multi-modal and multi-view fundus images is critical for retinopathy prevention, as different views can show the complete 3D eyeball field and different modalities can provide complementary lesion areas. Compared with single images, the sequence relationships in multi-modal and multi-view fundus images contain long-range dependencies in lesion features. By modeling the long-range dependencies in these sequences, lesion areas can be more comprehensively mined, and modality-specific lesions can be detected. To learn the long-range dependency relationship and fuse complementary multi-scale lesion features between different fundus modalities, we design a multi-modal fundus image fusion method based on multi-scale cross-attention, which solves the static receptive field problem in previous multi-modal medical fusion methods based on attention. To capture multi-view relative positional relationships between different views and fuse comprehensive lesion features between different views, we design a multi-view fundus image fusion method based on shifted window self-attention, which also solves the computational complexity of the multi-view fundus fusion method based on self-attention is quadratic to the size and number of multi-view fundus images. Finally, we design a multi-task retinopathy diagnosis framework to help ophthalmologists reduce workload and improve diagnostic accuracy by combining the proposed two fusion methods. The experimental results of retinopathy classification and report generation tasks indicate our method's potential to improve the efficiency and reliability of retinopathy diagnosis in clinical practice, achieving a classification accuracy of 82.53\% and a report generation BlEU-1 of 0.543.

[20] Probability Distribution Alignment and Low-Rank Weight Decomposition for Source-Free Domain Adaptive Brain Decoding

Ganxi Xu,Jinyi Long,Hanrui Wu,Jia Zhang

Main category: cs.CV

TLDR: 提出了一种基于无源域适应的大脑解码框架，解决个体差异、模态对齐和高维嵌入问题。

Details

Motivation: 当前大脑解码面临个体差异、模态对齐和高维嵌入的挑战，现有方法存在隐私泄漏、数据存储负担重、模态未完全对齐及计算成本高等问题。 Method: 采用无源域适应框架，避免使用源主体数据，同时优化模态对齐和高维嵌入问题。 Result: 解决了隐私和数据存储问题，改进了模态对齐效果，并降低了计算成本。 Conclusion: 该框架为大脑解码提供了一种高效且隐私安全的解决方案。 Abstract: Brain decoding currently faces significant challenges in individual differences, modality alignment, and high-dimensional embeddings. To address individual differences, researchers often use source subject data, which leads to issues such as privacy leakage and heavy data storage burdens. In modality alignment, current works focus on aligning the softmax probability distribution but neglect the alignment of marginal probability distributions, resulting in modality misalignment. Additionally, images and text are aligned separately with fMRI without considering the complex interplay between images and text, leading to poor image reconstruction. Finally, the enormous dimensionality of CLIP embeddings causes significant computational costs. Although the dimensionality of CLIP embeddings can be reduced by ignoring the number of patches obtained from images and the number of tokens acquired from text, this comes at the cost of a significant drop in model performance, creating a dilemma. To overcome these limitations, we propose a source-free domain adaptation-based brain decoding framework

[21] A Constrained Optimization Approach for Gaussian Splatting from Coarsely-posed Images and Noisy Lidar Point Clouds

Jizong Peng,Tze Ho Elden Tse,Kai Xu,Wenchao Gao,Angela Yao

Main category: cs.CV

TLDR: 提出了一种无需SfM支持的相机姿态估计与3D重建联合优化方法，显著提升了3DGS的性能。

Details

Motivation: 3DGS需要准确的相机姿态和高保真点云初始化，但SfM耗时且限制应用范围。 Method: 通过分解相机姿态为相机到设备中心和设备中心到世界的优化序列，并引入参数敏感性和搜索空间约束。 Result: 在自收集数据集和公开基准测试中，性能显著优于现有3DGS基线和COLMAP补充方法。 Conclusion: 该方法为3DGS在现实场景和大规模重建中的应用提供了更高效的解决方案。 Abstract: 3D Gaussian Splatting (3DGS) is a powerful reconstruction technique, but it needs to be initialized from accurate camera poses and high-fidelity point clouds. Typically, the initialization is taken from Structure-from-Motion (SfM) algorithms; however, SfM is time-consuming and restricts the application of 3DGS in real-world scenarios and large-scale scene reconstruction. We introduce a constrained optimization method for simultaneous camera pose estimation and 3D reconstruction that does not require SfM support. Core to our approach is decomposing a camera pose into a sequence of camera-to-(device-)center and (device-)center-to-world optimizations. To facilitate, we propose two optimization constraints conditioned to the sensitivity of each parameter group and restricts each parameter's search space. In addition, as we learn the scene geometry directly from the noisy point clouds, we propose geometric constraints to improve the reconstruction quality. Experiments demonstrate that the proposed method significantly outperforms the existing (multi-modal) 3DGS baseline and methods supplemented by COLMAP on both our collected dataset and two public benchmarks.

[22] MASH: Masked Anchored SpHerical Distances for 3D Shape Representation and Generation

Changhao Li,Yu Xin,Xiaowei Zhou,Ariel Shamir,Hao Zhang,Ligang Liu,Ruizhen Hu

Main category: cs.CV

TLDR: MASH是一种新颖的多视角参数化3D形状表示方法，通过局部表面块和球面距离函数捕捉形状特征，结合球谐函数和视图锥实现高效编码，适用于多种3D任务。

Details

Motivation: 受多视角几何启发，MASH旨在通过感知形状理解提升3D形状学习效果，捕捉局部表面特征。 Method: MASH将3D形状表示为局部表面块的集合，每个块由球面距离函数定义，利用球谐函数编码，并通过参数化视图锥实现局部性。 Result: 实验表明，MASH在表面重建、形状生成、补全和混合等任务中表现优异，结合了隐式和显式特征。 Conclusion: MASH通过独特的表示方法在多任务中展现出优越性能，为3D形状处理提供了新思路。 Abstract: We introduce Masked Anchored SpHerical Distances (MASH), a novel multi-view and parametrized representation of 3D shapes. Inspired by multi-view geometry and motivated by the importance of perceptual shape understanding for learning 3D shapes, MASH represents a 3D shape as a collection of observable local surface patches, each defined by a spherical distance function emanating from an anchor point. We further leverage the compactness of spherical harmonics to encode the MASH functions, combined with a generalized view cone with a parameterized base that masks the spatial extent of the spherical function to attain locality. We develop a differentiable optimization algorithm capable of converting any point cloud into a MASH representation accurately approximating ground-truth surfaces with arbitrary geometry and topology. Extensive experiments demonstrate that MASH is versatile for multiple applications including surface reconstruction, shape generation, completion, and blending, achieving superior performance thanks to its unique representation encompassing both implicit and explicit features.

[23] Evolved Hierarchical Masking for Self-Supervised Learning

Zhanzhou Feng,Shiliang Zhang

Main category: cs.CV

TLDR: 提出了一种层次化掩码方法，通过动态调整掩码模式提升自监督学习中的视觉线索建模能力。

Details

Motivation: 现有固定掩码模式限制了视觉线索建模能力，需动态适应不同训练阶段的需求。 Method: 利用训练中的视觉模型解析输入线索为层次结构，动态生成掩码，从低到高层次逐步演化。 Result: 在七项下游任务中表现优异，如图像分类和语义分割，超越MAE方法1.1%和1.4%。 Conclusion: 该方法无需额外预训练模型或标注，动态调整训练难度，显著提升任务性能。 Abstract: Existing Masked Image Modeling methods apply fixed mask patterns to guide the self-supervised training. As those mask patterns resort to different criteria to depict image contents, sticking to a fixed pattern leads to a limited vision cues modeling capability.This paper introduces an evolved hierarchical masking method to pursue general visual cues modeling in self-supervised learning. The proposed method leverages the vision model being trained to parse the input visual cues into a hierarchy structure, which is hence adopted to generate masks accordingly. The accuracy of hierarchy is on par with the capability of the model being trained, leading to evolved mask patterns at different training stages. Initially, generated masks focus on low-level visual cues to grasp basic textures, then gradually evolve to depict higher-level cues to reinforce the learning of more complicated object semantics and contexts. Our method does not require extra pre-trained models or annotations and ensures training efficiency by evolving the training difficulty. We conduct extensive experiments on seven downstream tasks including partial-duplicate image retrieval relying on low-level details, as well as image classification and semantic segmentation that require semantic parsing capability. Experimental results demonstrate that it substantially boosts performance across these tasks. For instance, it surpasses the recent MAE by 1.1\% in imageNet-1K classification and 1.4\% in ADE20K segmentation with the same training epochs. We also align the proposed method with the current research focus on LLMs. The proposed approach bridges the gap with large-scale pre-training on semantic demanding tasks and enhances intricate detail perception in tasks requiring low-level feature recognition.

[24] LEREL: Lipschitz Continuity-Constrained Emotion Recognition Ensemble Learning For Electroencephalography

Shengyu Gong,Yueyang Li,Zijian Kang,Weiming Zeng,Hongjie Yan,Wai Ting Siok,Nizhuan Wang

Main category: cs.CV

TLDR: LEREL框架通过Lipschitz连续性约束和集成学习提升EEG情感识别的准确性和鲁棒性。

Details

Motivation: 情感障碍与严重的心理社会障碍相关，现有EEG情感识别方法存在模型稳定性不足、高维非线性信号处理精度有限等问题。 Method: 提出LEREL框架，利用Lipschitz连续性约束增强模型稳定性和泛化能力，结合集成学习减少单模型偏差。 Result: 在三个公开数据集（EAV、FACED、SEED）上平均识别准确率分别为76.43%、83.00%和89.22%。 Conclusion: LEREL显著提升了EEG情感识别的性能和鲁棒性，适用于小样本数据集。 Abstract: Accurate and efficient perception of emotional states in oneself and others is crucial, as emotion-related disorders are associated with severe psychosocial impairments. While electroencephalography (EEG) offers a powerful tool for emotion detection, current EEG-based emotion recognition (EER) methods face key limitations: insufficient model stability, limited accuracy in processing high-dimensional nonlinear EEG signals, and poor robustness against intra-subject variability and signal noise. To address these challenges, we propose LEREL (Lipschitz continuity-constrained Emotion Recognition Ensemble Learning), a novel framework that significantly enhances both the accuracy and robustness of emotion recognition performance. The LEREL framework employs Lipschitz continuity constraints to enhance model stability and generalization in EEG emotion recognition, reducing signal variability and noise susceptibility while maintaining strong performance on small-sample datasets. The ensemble learning strategy reduces single-model bias and variance through multi-classifier decision fusion, further optimizing overall performance. Experimental results on three public benchmark datasets (EAV, FACED and SEED) demonstrate LEREL's effectiveness, achieving average recognition accuracies of 76.43%, 83.00% and 89.22%, respectively.

[25] SCFlow2: Plug-and-Play Object Pose Refiner with Shape-Constraint Scene Flow

Qingyuan Wang,Rui Song,Jiaojiao Li,Kerui Cheng,David Ferstl,Yinlin Hu

Main category: cs.CV

TLDR: SCFlow2是一个即插即用的6D物体姿态估计细化框架，通过引入几何约束和3D场景流，显著提升现有方法的精度，无需重新训练。

Details

Motivation: 现有6D姿态估计方法在细化时存在对应噪声或需重新训练的问题，SCFlow2旨在解决这些限制。 Method: 基于SCFlow模型，通过3D场景流引入几何约束，结合刚性运动嵌入和3D形状先验训练循环匹配网络。 Result: 在BOP数据集上评估，显著提升现有方法的精度，无需重新训练或微调。 Conclusion: SCFlow2是一种高效且通用的细化框架，适用于多种6D姿态估计方法。 Abstract: We introduce SCFlow2, a plug-and-play refinement framework for 6D object pose estimation. Most recent 6D object pose methods rely on refinement to get accurate results. However, most existing refinement methods either suffer from noises in establishing correspondences, or rely on retraining for novel objects. SCFlow2 is based on the SCFlow model designed for refinement with shape constraint, but formulates the additional depth as a regularization in the iteration via 3D scene flow for RGBD frames. The key design of SCFlow2 is an introduction of geometry constraints into the training of recurrent matching network, by combining the rigid-motion embeddings in 3D scene flow and 3D shape prior of the target. We train SCFlow2 on a combination of dataset Objaverse, GSO and ShapeNet, and evaluate on BOP datasets with novel objects. After using our method as a post-processing, most state-of-the-art methods produce significantly better results, without any retraining or fine-tuning. The source code is available at https://scflow2.github.io.

[26] ReferGPT: Towards Zero-Shot Referring Multi-Object Tracking

Tzoulio Chamiti,Leandro Di Bella,Adrian Munteanu,Nikos Deligiannis

Main category: cs.CV

TLDR: ReferGPT是一种零样本的多目标跟踪框架，利用多模态大语言模型（MLLM）生成3D感知的文本描述，并通过CLIP语义编码实现灵活的查询匹配。

Details

Motivation: 解决现有方法在开放集查询中泛化能力不足的问题，避免监督训练的需求。 Method: 结合MLLM生成3D感知描述，使用CLIP语义编码和模糊匹配策略关联用户查询与生成描述。 Result: 在Refer-KITTI等数据集上表现优异，展示了零样本能力。 Conclusion: ReferGPT在自动驾驶场景中具有鲁棒性和零样本优势，代码已开源。 Abstract: Tracking multiple objects based on textual queries is a challenging task that requires linking language understanding with object association across frames. Previous works typically train the whole process end-to-end or integrate an additional referring text module into a multi-object tracker, but they both require supervised training and potentially struggle with generalization to open-set queries. In this work, we introduce ReferGPT, a novel zero-shot referring multi-object tracking framework. We provide a multi-modal large language model (MLLM) with spatial knowledge enabling it to generate 3D-aware captions. This enhances its descriptive capabilities and supports a more flexible referring vocabulary without training. We also propose a robust query-matching strategy, leveraging CLIP-based semantic encoding and fuzzy matching to associate MLLM generated captions with user queries. Extensive experiments on Refer-KITTI, Refer-KITTIv2 and Refer-KITTI+ demonstrate that ReferGPT achieves competitive performance against trained methods, showcasing its robustness and zero-shot capabilities in autonomous driving. The codes are available on https://github.com/Tzoulio/ReferGPT

[27] RT-DATR:Real-time Unsupervised Domain Adaptive Detection Transformer with Adversarial Feature Learning

Feng Lv,Chunlong Xia,Shuo Wang,Huo Cao

Main category: cs.CV

TLDR: 论文提出了一种实时域自适应检测变换器RT-DATR，通过局部对象级特征对齐和场景语义特征对齐模块提升跨域检测性能。

Details

Motivation: 尽管基于CNN和变换器的域自适应目标检测器在跨域检测任务中取得了进展，但实时变换器检测器的域自适应尚未被探索，现有方法直接应用效果不佳。 Method: 基于RT-DETR，引入局部对象级特征对齐模块和场景语义特征对齐模块，并设计域查询以进一步对齐实例特征分布。 Result: 实验表明，该方法在多个基准测试中优于当前最先进方法。 Conclusion: RT-DATR是一种简单高效的实时域自适应检测变换器，显著提升了跨域检测性能。 Abstract: Despite domain-adaptive object detectors based on CNN and transformers have made significant progress in cross-domain detection tasks, it is regrettable that domain adaptation for real-time transformer-based detectors has not yet been explored. Directly applying existing domain adaptation algorithms has proven to be suboptimal. In this paper, we propose RT-DATR, a simple and efficient real-time domain adaptive detection transformer. Building on RT-DETR as our base detector, we first introduce a local object-level feature alignment module to significantly enhance the feature representation of domain invariance during object transfer. Additionally, we introduce a scene semantic feature alignment module designed to boost cross-domain detection performance by aligning scene semantic features. Finally, we introduced a domain query and decoupled it from the object query to further align the instance feature distribution within the decoder layer, reduce the domain gap, and maintain discriminative ability. Experimental results on various benchmarks demonstrate that our method outperforms current state-of-the-art approaches. Our code will be released soon.

[28] From Visual Explanations to Counterfactual Explanations with Latent Diffusion

Tung Luu,Nam Le,Duc Le,Bac Le

Main category: cs.CV

TLDR: 提出了一种新方法，通过视觉解释算法确定关键区域，并结合对抗攻击和潜在扩散模型生成逼真的反事实解释，解决了现有方法的两大挑战。

Details

Motivation: 解决现有方法中难以确定区分目标类与原始类的关键特征，以及为非鲁棒分类器提供有价值解释的问题。 Method: 通过视觉解释算法识别关键修改区域，结合对抗攻击和潜在扩散模型生成逼真反事实解释。 Result: 在ImageNet和CelebA-HQ数据集上优于现有方法。 Conclusion: 该方法适用于任意分类器，强调了视觉与反事实解释的强关联，并能生成语义有意义的变化。 Abstract: Visual counterfactual explanations are ideal hypothetical images that change the decision-making of the classifier with high confidence toward the desired class while remaining visually plausible and close to the initial image. In this paper, we propose a new approach to tackle two key challenges in recent prominent works: i) determining which specific counterfactual features are crucial for distinguishing the "concept" of the target class from the original class, and ii) supplying valuable explanations for the non-robust classifier without relying on the support of an adversarially robust model. Our method identifies the essential region for modification through algorithms that provide visual explanations, and then our framework generates realistic counterfactual explanations by combining adversarial attacks based on pruning the adversarial gradient of the target classifier and the latent diffusion model. The proposed method outperforms previous state-of-the-art results on various evaluation criteria on ImageNet and CelebA-HQ datasets. In general, our method can be applied to arbitrary classifiers, highlight the strong association between visual and counterfactual explanations, make semantically meaningful changes from the target classifier, and provide observers with subtle counterfactual images.

[29] AerOSeg: Harnessing SAM for Open-Vocabulary Segmentation in Remote Sensing Images

Saikat Dutta,Akhil Vasim,Siddhant Gole,Hamid Rezatofighi,Biplab Banerjee

Main category: cs.CV

TLDR: AerOSeg是一种针对遥感数据的开放词汇分割方法，通过多尺度旋转图像和领域特定提示生成特征，结合SAM模型优化分割效果，显著优于现有方法。

Details

Motivation: 解决遥感图像分割中未见类别的泛化问题，减少对昂贵像素级标注的依赖。 Method: 利用多旋转图像和领域提示生成特征，结合SAM模型进行空间和类别优化，引入语义反向传播模块和多尺度解码器。 Result: 在三个遥感数据集上平均提升2.54 h-mIoU，优于现有方法。 Conclusion: AerOSeg为遥感开放词汇分割提供了高效解决方案，显著提升了性能。 Abstract: Image segmentation beyond predefined categories is a key challenge in remote sensing, where novel and unseen classes often emerge during inference. Open-vocabulary image Segmentation addresses these generalization issues in traditional supervised segmentation models while reducing reliance on extensive per-pixel annotations, which are both expensive and labor-intensive to obtain. Most Open-Vocabulary Segmentation (OVS) methods are designed for natural images but struggle with remote sensing data due to scale variations, orientation changes, and complex scene compositions. This necessitates the development of OVS approaches specifically tailored for remote sensing. In this context, we propose AerOSeg, a novel OVS approach for remote sensing data. First, we compute robust image-text correlation features using multiple rotated versions of the input image and domain-specific prompts. These features are then refined through spatial and class refinement blocks. Inspired by the success of the Segment Anything Model (SAM) in diverse domains, we leverage SAM features to guide the spatial refinement of correlation features. Additionally, we introduce a semantic back-projection module and loss to ensure the seamless propagation of SAM's semantic information throughout the segmentation pipeline. Finally, we enhance the refined correlation features using a multi-scale attention-aware decoder to produce the final segmentation map. We validate our SAM-guided Open-Vocabulary Remote Sensing Segmentation model on three benchmark remote sensing datasets: iSAID, DLRSD, and OpenEarthMap. Our model outperforms state-of-the-art open-vocabulary segmentation methods, achieving an average improvement of 2.54 h-mIoU.

Zhicheng Zhang,Hao Tang,Jinhui Tang

Main category: cs.CV

TLDR: 论文提出了一种名为MDCM的多尺度多样化线索建模框架，用于细粒度鸟类识别（FGBR），通过增强ViT模型的多尺度能力，显著提升了识别性能。

Details

Motivation: 现有ViT模型在FGBR中因有限的感受野和尺度变化敏感性而表现受限，需增强其多尺度能力以提升识别效果。 Method: 提出MDCM框架，包含多尺度线索激活模块、多尺度令牌选择机制和多尺度动态聚合机制，分阶段提取并融合多样化线索。 Result: MDCM在多个FGBR基准测试中优于CNN和ViT模型，验证了其有效性。 Conclusion: MDCM通过多尺度线索建模显著提升了FGBR性能，为ViT模型在多尺度任务中的应用提供了新思路。 Abstract: Given the critical role of birds in ecosystems, Fine-Grained Bird Recognition (FGBR) has gained increasing attention, particularly in distinguishing birds within similar subcategories. Although Vision Transformer (ViT)-based methods often outperform Convolutional Neural Network (CNN)-based methods in FGBR, recent studies reveal that the limited receptive field of plain ViT model hinders representational richness and makes them vulnerable to scale variance. Thus, enhancing the multi-scale capabilities of existing ViT-based models to overcome this bottleneck in FGBR is a worthwhile pursuit. In this paper, we propose a novel framework for FGBR, namely Multi-scale Diverse Cues Modeling (MDCM), which explores diverse cues at different scales across various stages of a multi-scale Vision Transformer (MS-ViT) in an "Activation-Selection-Aggregation" paradigm. Specifically, we first propose a multi-scale cue activation module to ensure the discriminative cues learned at different stage are mutually different. Subsequently, a multi-scale token selection mechanism is proposed to remove redundant noise and highlight discriminative, scale-specific cues at each stage. Finally, the selected tokens from each stage are independently utilized for bird recognition, and the recognition results from multiple stages are adaptively fused through a multi-scale dynamic aggregation mechanism for final model decisions. Both qualitative and quantitative results demonstrate the effectiveness of our proposed MDCM, which outperforms CNN- and ViT-based models on several widely-used FGBR benchmarks.

[31] DL-QAT: Weight-Decomposed Low-Rank Quantization-Aware Training for Large Language Models

Wenjin Ke,Zhe Li,Dong Li,Lu Tian,Emad Barsoum

Main category: cs.CV

TLDR: 提出了一种名为DL-QAT的方法，结合了量化感知训练（QAT）的优势，同时仅训练不到1%的参数，显著提升了低比特量化下的性能。

Details

Motivation: 解决后训练量化（PTQ）在低比特量化时下游任务表现不佳的问题，同时避免QAT的高计算资源需求。 Method: 引入组特定量化幅度调整每组量化范围，并在每组内使用LoRA矩阵更新量化空间中的权重大小和方向。 Result: 在LLaMA和LLaMA2模型上验证，3位LLaMA-7B模型在MMLU任务上比现有方法提升4.2%。 Conclusion: DL-QAT在性能和效率上均优于现有方法，适用于预训练模型的量化。 Abstract: Improving the efficiency of inference in Large Language Models (LLMs) is a critical area of research. Post-training Quantization (PTQ) is a popular technique, but it often faces challenges at low-bit levels, particularly in downstream tasks. Quantization-aware Training (QAT) can alleviate this problem, but it requires significantly more computational resources. To tackle this, we introduced Weight-Decomposed Low-Rank Quantization-Aware Training (DL-QAT), which merges the advantages of QAT while training only less than 1% of the total parameters. Specifically, we introduce a group-specific quantization magnitude to adjust the overall scale of each quantization group. Within each quantization group, we use LoRA matrices to update the weight size and direction in the quantization space. We validated the effectiveness of our method on the LLaMA and LLaMA2 model families. The results show significant improvements over our baseline method across different quantization granularities. For instance, for LLaMA-7B, our approach outperforms the previous state-of-the-art method by 4.2% in MMLU on 3-bit LLaMA-7B model. Additionally, our quantization results on pre-trained models also surpass previous QAT methods, demonstrating the superior performance and efficiency of our approach.

[32] Learning Occlusion-Robust Vision Transformers for Real-Time UAV Tracking

You Wu,Xucheng Wang,Xiangyang Yang,Mengyuan Liu,Dan Zeng,Hengzhou Ye,Shuiwang Li

Main category: cs.CV

TLDR: 提出了一种基于ViT的单流架构ORR方法（ORTrack），通过随机掩码模拟遮挡，增强无人机跟踪的遮挡鲁棒性，并设计了AFKD方法（ORTrack-D）提升实时性。

Details

Motivation: 单流ViT架构在无人机跟踪中潜力大，但缺乏有效处理遮挡的策略，需增强遮挡鲁棒性。 Method: 提出ORR方法，通过空间Cox过程随机掩码模拟遮挡，学习遮挡鲁棒特征；设计AFKD方法，自适应知识蒸馏生成高效学生模型ORTrack-D。 Result: 在多个基准测试中验证了方法的有效性，性能达到SOTA。 Conclusion: ORTrack和ORTrack-D在遮挡鲁棒性和实时性上表现优异，代码已开源。 Abstract: Single-stream architectures using Vision Transformer (ViT) backbones show great potential for real-time UAV tracking recently. However, frequent occlusions from obstacles like buildings and trees expose a major drawback: these models often lack strategies to handle occlusions effectively. New methods are needed to enhance the occlusion resilience of single-stream ViT models in aerial tracking. In this work, we propose to learn Occlusion-Robust Representations (ORR) based on ViTs for UAV tracking by enforcing an invariance of the feature representation of a target with respect to random masking operations modeled by a spatial Cox process. Hopefully, this random masking approximately simulates target occlusions, thereby enabling us to learn ViTs that are robust to target occlusion for UAV tracking. This framework is termed ORTrack. Additionally, to facilitate real-time applications, we propose an Adaptive Feature-Based Knowledge Distillation (AFKD) method to create a more compact tracker, which adaptively mimics the behavior of the teacher model ORTrack according to the task's difficulty. This student model, dubbed ORTrack-D, retains much of ORTrack's performance while offering higher efficiency. Extensive experiments on multiple benchmarks validate the effectiveness of our method, demonstrating its state-of-the-art performance. Codes is available at https://github.com/wuyou3474/ORTrack.

[33] NoTeS-Bank: Benchmarking Neural Transcription and Search for Scientific Notes Understanding

Aniket Pal,Sanket Biswas,Alloy Das,Ayush Lodh,Priyanka Banerjee,Soumitri Chattopadhyay,Dimosthenis Karatzas,Josep Llados,C. V. Jawahar

Main category: cs.CV

TLDR: NoTeS-Bank是一个用于评估手写笔记问答的基准，包含多领域复杂笔记，支持基于证据和开放领域的视觉问答任务，挑战现有模型的视觉-语言融合与推理能力。

Details

Motivation: 解决现有视觉问答基准在真实手写笔记（如数学公式、图表）上的局限性，推动文档AI的发展。 Method: 引入NoTeS-Bank基准，定义两项任务：基于证据的VQA和开放领域VQA，评估模型在非结构化、多模态内容上的表现。 Result: 通过NDCG@5、MRR等指标，揭示了现有视觉-语言模型在转录和推理上的不足。 Conclusion: NoTeS-Bank为视觉文档理解与推理设立了新标准，推动了多模态模型的发展。 Abstract: Understanding and reasoning over academic handwritten notes remains a challenge in document AI, particularly for mathematical equations, diagrams, and scientific notations. Existing visual question answering (VQA) benchmarks focus on printed or structured handwritten text, limiting generalization to real-world note-taking. To address this, we introduce NoTeS-Bank, an evaluation benchmark for Neural Transcription and Search in note-based question answering. NoTeS-Bank comprises complex notes across multiple domains, requiring models to process unstructured and multimodal content. The benchmark defines two tasks: (1) Evidence-Based VQA, where models retrieve localized answers with bounding-box evidence, and (2) Open-Domain VQA, where models classify the domain before retrieving relevant documents and answers. Unlike classical Document VQA datasets relying on optical character recognition (OCR) and structured data, NoTeS-BANK demands vision-language fusion, retrieval, and multimodal reasoning. We benchmark state-of-the-art Vision-Language Models (VLMs) and retrieval frameworks, exposing structured transcription and reasoning limitations. NoTeS-Bank provides a rigorous evaluation with NDCG@5, MRR, Recall@K, IoU, and ANLS, establishing a new standard for visual document understanding and reasoning.

[34] FVQ: A Large-Scale Dataset and A LMM-based Method for Face Video Quality Assessment

Sijing Wu,Yunhao Li,Ziwen Xu,Yixuan Gao,Huiyu Duan,Wei Sun,Guangtao Zhai

Main category: cs.CV

TLDR: 本文提出了首个大规模野外人脸视频质量评估数据集FVQ-20K，并开发了FVQ-Rater方法，利用多模态特征和LoRA技术实现高质量评分。

Details

Motivation: 人脸视频在社交媒体中占主导地位，且人类视觉系统对其敏感，但缺乏大规模数据集，因此需要研究FVQA。 Method: 提取空间、时间及人脸特定特征，结合LoRA指令调优技术，开发FVQ-Rater方法。 Result: FVQ-Rater在FVQ-20K和CFVQA数据集上表现优异。 Conclusion: FVQ-20K数据集和FVQ-Rater方法对推动FVQA发展具有重要潜力。 Abstract: Face video quality assessment (FVQA) deserves to be explored in addition to general video quality assessment (VQA), as face videos are the primary content on social media platforms and human visual system (HVS) is particularly sensitive to human faces. However, FVQA is rarely explored due to the lack of large-scale FVQA datasets. To fill this gap, we present the first large-scale in-the-wild FVQA dataset, FVQ-20K, which contains 20,000 in-the-wild face videos together with corresponding mean opinion score (MOS) annotations. Along with the FVQ-20K dataset, we further propose a specialized FVQA method named FVQ-Rater to achieve human-like rating and scoring for face video, which is the first attempt to explore the potential of large multimodal models (LMMs) for the FVQA task. Concretely, we elaborately extract multi-dimensional features including spatial features, temporal features, and face-specific features (i.e., portrait features and face embeddings) to provide comprehensive visual information, and take advantage of the LoRA-based instruction tuning technique to achieve quality-specific fine-tuning, which shows superior performance on both FVQ-20K and CFVQA datasets. Extensive experiments and comprehensive analysis demonstrate the significant potential of the FVQ-20K dataset and FVQ-Rater method in promoting the development of FVQA.

[35] PathVLM-R1: A Reinforcement Learning-Driven Reasoning Model for Pathology Visual-Language Tasks

Jianyu Wu,Hao Yang,Xinhua Zeng,Guibing He,Zhiyu Chen,Zihui Li,Xiaochuan Zhang,Yangyang Ma,Run Fang,Yang Liu

Main category: cs.CV

TLDR: PathVLM-R1是一种针对病理图像的视觉语言模型，通过监督微调和强化学习优化，显著提升了诊断准确性和推理能力。

Details

Motivation: 解决病理图像诊断中专家资源不足和传统多模态模型推理能力弱的问题。 Method: 基于Qwen2.5-VL-7B-Instruct，通过监督微调和GRPO强化学习优化模型。 Result: 在病理图像问答任务中准确率提升14%，跨模态数据迁移性能平均提升17.3%。 Conclusion: PathVLM-R1在准确性和扩展性上表现优异，具有广泛的应用潜力。 Abstract: The diagnosis of pathological images is often limited by expert availability and regional disparities, highlighting the importance of automated diagnosis using Vision-Language Models (VLMs). Traditional multimodal models typically emphasize outcomes over the reasoning process, compromising the reliability of clinical decisions. To address the weak reasoning abilities and lack of supervised processes in pathological VLMs, we have innovatively proposed PathVLM-R1, a visual language model designed specifically for pathological images. We have based our model on Qwen2.5-VL-7B-Instruct and enhanced its performance for pathological tasks through meticulously designed post-training strategies. Firstly, we conduct supervised fine-tuning guided by pathological data to imbue the model with foundational pathological knowledge, forming a new pathological base model. Subsequently, we introduce Group Relative Policy Optimization (GRPO) and propose a dual reward-driven reinforcement learning optimization, ensuring strict constraint on logical supervision of the reasoning process and accuracy of results via cross-modal process reward and outcome accuracy reward. In the pathological image question-answering tasks, the testing results of PathVLM-R1 demonstrate a 14% improvement in accuracy compared to baseline methods, and it demonstrated superior performance compared to the Qwen2.5-VL-32B version despite having a significantly smaller parameter size. Furthermore, in out-domain data evaluation involving four medical imaging modalities: Computed Tomography (CT), dermoscopy, fundus photography, and Optical Coherence Tomography (OCT) images: PathVLM-R1's transfer performance improved by an average of 17.3% compared to traditional SFT methods. These results clearly indicate that PathVLM-R1 not only enhances accuracy but also possesses broad applicability and expansion potential.

[36] Head-Aware KV Cache Compression for Efficient Visual Autoregressive Modeling

Ziran Qin,Youru Lv,Mingbao Lin,Zeren Zhang,Danping Zou,Weiyao Lin

Main category: cs.CV

TLDR: HACK提出了一种针对VAR模型的KV缓存压缩方法，通过区分结构性头和上下文头，采用不对称缓存预算和特定模式压缩策略，显著减少内存使用。

Details

Motivation: VAR模型在推理过程中因KV缓存积累导致内存瓶颈，现有压缩技术因未区分两种注意力头类型而效果不佳。 Method: 提出HACK方法，为结构性头和上下文头分配不对称缓存预算，并采用模式特定的压缩策略。 Result: 在VAR-d30和Infinity-8B上，HACK分别减少50%和70%的缓存，性能损失极小；即使压缩70%和90%，仍保持高质量生成。 Conclusion: HACK通过针对性压缩策略，有效解决了VAR模型的内存瓶颈问题，同时保持了生成质量。 Abstract: Visual Autoregressive (VAR) models have emerged as a powerful approach for multi-modal content creation, offering high efficiency and quality across diverse multimedia applications. However, they face significant memory bottlenecks due to extensive KV cache accumulation during inference. Existing KV cache compression techniques for large language models are suboptimal for VAR models due to, as we identify in this paper, two distinct categories of attention heads in VAR models: Structural Heads, which preserve spatial coherence through diagonal attention patterns, and Contextual Heads, which maintain semantic consistency through vertical attention patterns. These differences render single-strategy KV compression techniques ineffective for VAR models. To address this, we propose HACK, a training-free Head-Aware Compression method for KV cache. HACK allocates asymmetric cache budgets and employs pattern-specific compression strategies tailored to the essential characteristics of each head category. Experiments on Infinity-2B, Infinity-8B, and VAR-d30 demonstrate its effectiveness in text-to-image and class-conditional generation tasks. HACK can hack down up to 50\% and 70\% of cache with minimal performance degradation for VAR-d30 and Infinity-8B, respectively. Even with 70\% and 90\% KV cache compression in VAR-d30 and Infinity-8B, HACK still maintains high-quality generation while reducing memory usage by 44.2\% and 58.9\%, respectively.

[37] VideoAds for Fast-Paced Video Understanding: Where Opensource Foundation Models Beat GPT-4o & Gemini-1.5 Pro

Zheyuan Zhang,Monica Dou,Linkai Peng,Hongyi Pan,Ulas Bagci,Boqing Gong

Main category: cs.CV

TLDR: 论文介绍了VideoAds数据集，用于评估多模态大语言模型（MLLMs）在广告视频上的表现，发现开源模型Qwen2.5-VL-72B优于GPT-4o和Gemini-1.5 Pro，但人类专家表现更优。

Details

Motivation: 广告视频因其复杂的叙事结构和快速场景切换，对MLLMs提出了挑战，需要专门的数据集来评估其性能。 Method: 提出VideoAds数据集，包含手动标注的多样化问题，涵盖视觉查找、视频摘要和视觉推理任务，并设计定量指标衡量视频复杂度。 Result: Qwen2.5-VL-72B在VideoAds上准确率为73.35%，优于GPT-4o（66.82%）和Gemini-1.5 Pro（69.66%），但人类专家达到94.27%。 Conclusion: VideoAds可作为未来视频理解研究的关键基准，需提升MLLMs的时序建模能力。 Abstract: Advertisement videos serve as a rich and valuable source of purpose-driven information, encompassing high-quality visual, textual, and contextual cues designed to engage viewers. They are often more complex than general videos of similar duration due to their structured narratives and rapid scene transitions, posing significant challenges to multi-modal large language models (MLLMs). In this work, we introduce VideoAds, the first dataset tailored for benchmarking the performance of MLLMs on advertisement videos. VideoAds comprises well-curated advertisement videos with complex temporal structures, accompanied by \textbf{manually} annotated diverse questions across three core tasks: visual finding, video summary, and visual reasoning. We propose a quantitative measure to compare VideoAds against existing benchmarks in terms of video complexity. Through extensive experiments, we find that Qwen2.5-VL-72B, an opensource MLLM, achieves 73.35\% accuracy on VideoAds, outperforming GPT-4o (66.82\%) and Gemini-1.5 Pro (69.66\%); the two proprietary models especially fall behind the opensource model in video summarization and reasoning, but perform the best in visual finding. Notably, human experts easily achieve a remarkable accuracy of 94.27\%. These results underscore the necessity of advancing MLLMs' temporal modeling capabilities and highlight VideoAds as a potentially pivotal benchmark for future research in understanding video that requires high FPS sampling. The dataset and evaluation code will be publicly available at https://videoadsbenchmark.netlify.app.

[38] Towards Explainable Partial-AIGC Image Quality Assessment

Jiaying Qian,Ziheng Jia,Zicheng Zhang,Zeyu Zhang,Guangtao Zhai,Xiongkuo Min

Main category: cs.CV

TLDR: 该论文提出了一种针对局部AI生成内容（PAI）的图像质量评估方法，构建了首个大规模数据集EPAIQA-15K，并开发了具有解释性反馈能力的EPAIQA系列模型。

Details

Motivation: 现有研究主要关注完全由AI生成的图像，而局部AI编辑图像的质量评估几乎未被探索。 Method: 构建EPAIQA-15K数据集，利用大型多模态模型（LMM）分三阶段训练：编辑区域定位、定量质量评分和质量解释。 Result: 开发了具有解释性反馈能力的EPAIQA系列模型，填补了局部AIGC图像质量评估的空白。 Conclusion: 该研究为局部AI编辑图像的质量评估提供了开创性解决方案。 Abstract: The rapid advancement of AI-driven visual generation technologies has catalyzed significant breakthroughs in image manipulation, particularly in achieving photorealistic localized editing effects on natural scene images (NSIs). Despite extensive research on image quality assessment (IQA) for AI-generated images (AGIs), most studies focus on fully AI-generated outputs (e.g., text-to-image generation), leaving the quality assessment of partial-AIGC images (PAIs)-images with localized AI-driven edits an almost unprecedented field. Motivated by this gap, we construct the first large-scale PAI dataset towards explainable partial-AIGC image quality assessment (EPAIQA), the EPAIQA-15K, which includes 15K images with localized AI manipulation in different regions and over 300K multi-dimensional human ratings. Based on this, we leverage large multi-modal models (LMMs) and propose a three-stage model training paradigm. This paradigm progressively trains the LMM for editing region grounding, quantitative quality scoring, and quality explanation. Finally, we develop the EPAIQA series models, which possess explainable quality feedback capabilities. Our work represents a pioneering effort in the perceptual IQA field for comprehensive PAI quality assessment.

[39] Cycle Training with Semi-Supervised Domain Adaptation: Bridging Accuracy and Efficiency for Real-Time Mobile Scene Detection

Huu-Phong Phan-Nguyen,Anh Dao,Tien-Huy Nguyen,Tuan Quang,Huu-Loc Tran,Tinh-Anh Nguyen-Nhu,Huy-Thach Pham,Quan Nguyen,Hoang M. Le,Quang-Vinh Dinh

Main category: cs.CV

TLDR: 提出了一种名为Cycle Training的新型训练框架，结合三阶段训练和半监督域适应（SSDA），在移动设备上实现高效准确的图像分类。

Details

Motivation: 智能手机普及但资源有限，如何在移动设备上平衡深度学习模型的准确性和计算效率是一个重要挑战。 Method: 采用三阶段的Cycle Training框架，交替探索和稳定阶段，并结合SSDA利用大模型和未标记数据扩展训练集。 Result: 在CamSSD数据集上，Top-1准确率达94.00%，Top-3达99.17%，CPU推理时间仅1.61ms。 Conclusion: 该方法在移动设备上实现了高效准确的图像分类，适合实际部署。 Abstract: Nowadays, smartphones are ubiquitous, and almost everyone owns one. At the same time, the rapid development of AI has spurred extensive research on applying deep learning techniques to image classification. However, due to the limited resources available on mobile devices, significant challenges remain in balancing accuracy with computational efficiency. In this paper, we propose a novel training framework called Cycle Training, which adopts a three-stage training process that alternates between exploration and stabilization phases to optimize model performance. Additionally, we incorporate Semi-Supervised Domain Adaptation (SSDA) to leverage the power of large models and unlabeled data, thereby effectively expanding the training dataset. Comprehensive experiments on the CamSSD dataset for mobile scene detection demonstrate that our framework not only significantly improves classification accuracy but also ensures real-time inference efficiency. Specifically, our method achieves a 94.00% in Top-1 accuracy and a 99.17% in Top-3 accuracy and runs inference in just 1.61ms using CPU, demonstrating its suitability for real-world mobile deployment.

[40] A Lightweight Moment Retrieval System with Global Re-Ranking and Robust Adaptive Bidirectional Temporal Search

Tinh-Anh Nguyen-Nhu,Huu-Loc Tran,Nguyen-Khang Le,Minh-Nhat Nguyen,Tien-Huy Nguyen,Hoang-Long Nguyen-Huu,Huu-Phong Phan-Nguyen,Huy-Thach Pham,Quan Nguyen,Hoang M. Le,Quang-Vinh Dinh

Main category: cs.CV

TLDR: 提出了一种新的交互式视频语料库时刻检索框架，结合SuperGlobal重排序和自适应双向时间搜索，优化查询相似性、时间稳定性和计算资源。

Details

Motivation: 解决现有视频检索方法在计算效率、时间上下文限制和视频内容复杂性方面的不足。 Method: 采用关键帧提取模型和图像哈希去重技术预处理视频语料库，结合SuperGlobal重排序和自适应双向时间搜索。 Result: 显著减少存储需求，同时在多样化视频库中保持高定位精度。 Conclusion: 该框架为大规模视频语料库的高效时刻检索提供了可扩展的解决方案。 Abstract: The exponential growth of digital video content has posed critical challenges in moment-level video retrieval, where existing methodologies struggle to efficiently localize specific segments within an expansive video corpus. Current retrieval systems are constrained by computational inefficiencies, temporal context limitations, and the intrinsic complexity of navigating video content. In this paper, we address these limitations through a novel Interactive Video Corpus Moment Retrieval framework that integrates a SuperGlobal Reranking mechanism and Adaptive Bidirectional Temporal Search (ABTS), strategically optimizing query similarity, temporal stability, and computational resources. By preprocessing a large corpus of videos using a keyframe extraction model and deduplication technique through image hashing, our approach provides a scalable solution that significantly reduces storage requirements while maintaining high localization precision across diverse video repositories.

[41] MedIL: Implicit Latent Spaces for Generating Heterogeneous Medical Images at Arbitrary Resolutions

Tyler Spears,Shen Zhu,Yinzhu Jin,Aman Shrivastava,P. Thomas Fletcher

Main category: cs.CV

TLDR: MedIL是一种新型自编码器，用于处理不同尺寸和分辨率的医学图像生成，解决了传统方法因固定尺寸和重采样丢失细节的问题。

Details

Motivation: 医学图像尺寸和分辨率差异大，且细节对临床至关重要，现有方法（如LDMs）因固定尺寸和重采样丢失细节，无法满足需求。 Method: MedIL利用隐式神经表示将图像视为连续信号，支持任意分辨率的编码和解码，无需预先重采样。 Result: 实验证明MedIL在多站点、多分辨率数据集（T1w脑MRI和肺部CT）上能压缩并保留临床相关特征，并提升扩散模型生成图像的质量。 Conclusion: MedIL能增强生成模型，使其更接近原始临床采集图像，为医学图像生成提供了新思路。 Abstract: In this work, we introduce MedIL, a first-of-its-kind autoencoder built for encoding medical images with heterogeneous sizes and resolutions for image generation. Medical images are often large and heterogeneous, where fine details are of vital clinical importance. Image properties change drastically when considering acquisition equipment, patient demographics, and pathology, making realistic medical image generation challenging. Recent work in latent diffusion models (LDMs) has shown success in generating images resampled to a fixed-size. However, this is a narrow subset of the resolutions native to image acquisition, and resampling discards fine anatomical details. MedIL utilizes implicit neural representations to treat images as continuous signals, where encoding and decoding can be performed at arbitrary resolutions without prior resampling. We quantitatively and qualitatively show how MedIL compresses and preserves clinically-relevant features over large multi-site, multi-resolution datasets of both T1w brain MRIs and lung CTs. We further demonstrate how MedIL can influence the quality of images generated with a diffusion model, and discuss how MedIL can enhance generative models to resemble raw clinical acquisitions.

[42] Infused Suppression Of Magnification Artefacts For Micro-AU Detection

Huai-Qian Khor,Yante Li,Xingxun Jiang,Guoying Zhao

Main category: cs.CV

TLDR: 论文提出InfuseNet框架，通过层间特征融合和运动上下文约束，解决微表情分析中运动放大带来的伪影问题，提升AU检测性能。

Details

Motivation: 微表情分析中，运动放大技术虽能增强运动幅度，但会引入光照变化和投影误差导致的伪影，影响模型学习真实运动特征。 Method: 提出InfuseNet框架，利用运动上下文约束AU学习区域，并采用放大潜在特征而非重建样本以减少伪影。 Result: InfuseNet在CD6ME协议中超越现有最佳结果，定量研究验证了伪影缓解的有效性。 Conclusion: InfuseNet通过减少运动放大伪影，显著提升了微表情中AU检测的准确性。 Abstract: Facial micro-expressions are spontaneous, brief and subtle facial motions that unveil the underlying, suppressed emotions. Detecting Action Units (AUs) in micro-expressions is crucial because it yields a finer representation of facial motions than categorical emotions, effectively resolving the ambiguity among different expressions. One of the difficulties in micro-expression analysis is that facial motions are subtle and brief, thereby increasing the difficulty in correlating facial motion features to AU occurrence. To bridge the subtlety issue, flow-related features and motion magnification are a few common approaches as they can yield descriptive motion changes and increased motion amplitude respectively. While motion magnification can amplify the motion changes, it also accounts for illumination changes and projection errors during the amplification process, thereby creating motion artefacts that confuse the model to learn inauthentic magnified motion features. The problem is further aggravated in the context of a more complicated task where more AU classes are analyzed in cross-database settings. To address this issue, we propose InfuseNet, a layer-wise unitary feature infusion framework that leverages motion context to constrain the Action Unit (AU) learning within an informative facial movement region, thereby alleviating the influence of magnification artefacts. On top of that, we propose leveraging magnified latent features instead of reconstructing magnified samples to limit the distortion and artefacts caused by the projection inaccuracy in the motion reconstruction process. Via alleviating the magnification artefacts, InfuseNet has surpassed the state-of-the-art results in the CD6ME protocol. Further quantitative studies have also demonstrated the efficacy of motion artefacts alleviation.

[43] Text To 3D Object Generation For Scalable Room Assembly

Sonia Laguna,Alberto Garcia-Garcia,Marie-Julie Rakotosaona,Stylianos Moschoglou,Leonhard Helminger,Sergio Orts-Escolano

Main category: cs.CV

TLDR: 提出了一种端到端的合成数据生成系统，用于解决场景理解任务中的数据稀缺问题，通过文本生成高保真3D对象并集成到预定义场景中。

Details

Motivation: 现代场景理解模型依赖高质量数据集，但现实中数据稀缺且人工制作成本高，需要一种可扩展的合成数据生成方法。 Method: 结合文本到图像和多视角扩散模型与NeRF网格化，通过新损失函数和训练策略生成可定制3D场景。 Result: 系统能够按需生成高质量3D场景，缓解数据稀缺问题。 Conclusion: 该系统提升了合成数据在机器学习训练中的作用，支持更鲁棒和通用的模型开发。 Abstract: Modern machine learning models for scene understanding, such as depth estimation and object tracking, rely on large, high-quality datasets that mimic real-world deployment scenarios. To address data scarcity, we propose an end-to-end system for synthetic data generation for scalable, high-quality, and customizable 3D indoor scenes. By integrating and adapting text-to-image and multi-view diffusion models with Neural Radiance Field-based meshing, this system generates highfidelity 3D object assets from text prompts and incorporates them into pre-defined floor plans using a rendering tool. By introducing novel loss functions and training strategies into existing methods, the system supports on-demand scene generation, aiming to alleviate the scarcity of current available data, generally manually crafted by artists. This system advances the role of synthetic data in addressing machine learning training limitations, enabling more robust and generalizable models for real-world applications.

[44] REMEMBER: Retrieval-based Explainable Multimodal Evidence-guided Modeling for Brain Evaluation and Reasoning in Zero- and Few-shot Neurodegenerative Diagnosis

Duy-Cat Can,Quang-Huy Tang,Huong Ha,Binh T. Nguyen,Oliver Y. Chén

Main category: cs.CV

TLDR: REMEMBER是一种基于检索的多模态机器学习框架，用于阿尔茨海默病的零样本和少样本诊断，通过参考数据和上下文推理提高可解释性。

Details

Motivation: 现有的深度学习模型需要大规模标注数据且缺乏可解释性，而临床数据通常有限或无标注，限制了深度学习的应用。 Method: REMEMBER通过对比对齐的视觉-文本模型训练，结合伪文本模态和检索机制，模仿临床决策过程。 Result: 实验表明，REMEMBER在零样本和少样本任务中表现稳健，并提供可解释的诊断报告。 Conclusion: REMEMBER为神经影像诊断提供了一种高效且可解释的解决方案，尤其适用于数据有限的情况。 Abstract: Timely and accurate diagnosis of neurodegenerative disorders, such as Alzheimer's disease, is central to disease management. Existing deep learning models require large-scale annotated datasets and often function as "black boxes". Additionally, datasets in clinical practice are frequently small or unlabeled, restricting the full potential of deep learning methods. Here, we introduce REMEMBER -- Retrieval-based Explainable Multimodal Evidence-guided Modeling for Brain Evaluation and Reasoning -- a new machine learning framework that facilitates zero- and few-shot Alzheimer's diagnosis using brain MRI scans through a reference-based reasoning process. Specifically, REMEMBER first trains a contrastively aligned vision-text model using expert-annotated reference data and extends pseudo-text modalities that encode abnormality types, diagnosis labels, and composite clinical descriptions. Then, at inference time, REMEMBER retrieves similar, human-validated cases from a curated dataset and integrates their contextual information through a dedicated evidence encoding module and attention-based inference head. Such an evidence-guided design enables REMEMBER to imitate real-world clinical decision-making process by grounding predictions in retrieved imaging and textual context. Specifically, REMEMBER outputs diagnostic predictions alongside an interpretable report, including reference images and explanations aligned with clinical workflows. Experimental results demonstrate that REMEMBER achieves robust zero- and few-shot performance and offers a powerful and explainable framework to neuroimaging-based diagnosis in the real world, especially under limited data.

[45] PapMOT: Exploring Adversarial Patch Attack against Multiple Object Tracking

Jiahuan Long,Tingsong Jiang,Wen Yao,Shuai Jia,Weijia Zhang,Weien Zhou,Chao Ma,Xiaoqian Chen

Main category: cs.CV

TLDR: PapMOT是一种针对多目标跟踪（MOT）的物理对抗攻击方法，通过生成可打印的对抗补丁，攻击检测和身份关联过程，并在真实场景中验证其有效性。

Details

Motivation: 现有MOT方法对数字攻击（如像素级噪声注入）存在漏洞，但缺乏针对物理场景的攻击方法。PapMOT旨在填补这一空白。 Method: PapMOT生成可打印的对抗补丁，攻击检测机制并误导身份关联，同时引入补丁增强策略以降低跟踪结果的时序一致性。 Result: PapMOT在多个数据集上成功攻击了多种MOT跟踪器，并在真实场景中验证了其物理攻击的有效性。 Conclusion: PapMOT为MOT系统在数字和物理场景中的对抗攻击提供了新的研究方向和评估标准。 Abstract: Tracking multiple objects in a continuous video stream is crucial for many computer vision tasks. It involves detecting and associating objects with their respective identities across successive frames. Despite significant progress made in multiple object tracking (MOT), recent studies have revealed the vulnerability of existing MOT methods to adversarial attacks. Nevertheless, all of these attacks belong to digital attacks that inject pixel-level noise into input images, and are therefore ineffective in physical scenarios. To fill this gap, we propose PapMOT, which can generate physical adversarial patches against MOT for both digital and physical scenarios. Besides attacking the detection mechanism, PapMOT also optimizes a printable patch that can be detected as new targets to mislead the identity association process. Moreover, we introduce a patch enhancement strategy to further degrade the temporal consistency of tracking results across video frames, resulting in more aggressive attacks. We further develop new evaluation metrics to assess the robustness of MOT against such attacks. Extensive evaluations on multiple datasets demonstrate that our PapMOT can successfully attack various architectures of MOT trackers in digital scenarios. We also validate the effectiveness of PapMOT for physical attacks by deploying printed adversarial patches in the real world.

[46] Beyond Degradation Conditions: All-in-One Image Restoration via HOG Transformers

Jiawei Wu,Zhifei Yang,Zhe Wang,Zhi Jin

Main category: cs.CV

TLDR: HOGformer是一种基于HOG描述符的全能图像修复框架，通过动态自注意力机制和局部动态范围卷积模块，显著提升了复杂场景下的修复性能。

Details

Motivation: 现有方法依赖预测和整合退化条件，容易在复杂场景中误激活退化特定特征，限制了修复性能。 Method: 提出HOGformer框架，利用HOG描述符的退化判别能力，结合动态自注意力机制和HOG引导的局部动态范围卷积模块，增强退化敏感性。 Result: 在多种基准测试中，HOGformer实现了最先进的性能，并能有效泛化到复杂的真实世界退化场景。 Conclusion: HOGformer通过HOG引导的动态机制，显著提升了全能图像修复的性能和泛化能力。 Abstract: All-in-one image restoration, which aims to address diverse degradations within a unified framework, is critical for practical applications. However, existing methods rely on predicting and integrating degradation conditions, which can misactivate degradation-specific features in complex scenarios, limiting their restoration performance. To address this issue, we propose a novel all-in-one image restoration framework guided by Histograms of Oriented Gradients (HOG), named HOGformer. By leveraging the degradation-discriminative capability of HOG descriptors, HOGformer employs a dynamic self-attention mechanism that adaptively attends to long-range spatial dependencies based on degradation-aware HOG cues. To enhance the degradation sensitivity of attention inputs, we design a HOG-guided local dynamic-range convolution module that captures long-range degradation similarities while maintaining awareness of global structural information. Furthermore, we propose a dynamic interaction feed-forward module, efficiently increasing the model capacity to adapt to different degradations through channel-spatial interactions. Extensive experiments across diverse benchmarks, including adverse weather and natural degradations, demonstrate that HOGformer achieves state-of-the-art performance and generalizes effectively to complex real-world degradations. Code is available at https://github.com/Fire-friend/HOGformer.

[47] Low-Light Image Enhancement using Event-Based Illumination Estimation

Lei Sun,Yuhan Bao,Jiajun Zhai,Jingyun Liang,Yulun Zhang,Kaiwei Wang,Danda Pani Paudel,Luc Van Gool

Main category: cs.CV

TLDR: 论文提出了一种基于事件相机的低光图像增强方法，通过利用时间映射事件估计光照，显著提升了图像质量。

Details

Motivation: 现有方法主要依赖运动事件增强边缘纹理，而忽略了事件相机的高动态范围和低光响应能力。本文探索了时间映射事件在光照估计中的应用。 Method: 提出了一种光照辅助的反射增强模块，利用时间映射事件生成精细的光照线索，并研究了低光条件下时间映射事件的退化模型以合成训练数据。 Result: 在5个合成数据集和真实数据集EvLowLight上，RetinEV方法在图像质量和动态范围上优于现有方法，最高提升6.62 dB，推理速度为35.6 FPS。 Conclusion: RetinEV通过时间映射事件有效提升了低光图像增强的性能，为事件相机在低光环境中的应用提供了新思路。 Abstract: Low-light image enhancement (LLIE) aims to improve the visibility of images captured in poorly lit environments. Prevalent event-based solutions primarily utilize events triggered by motion, i.e., ''motion events'' to strengthen only the edge texture, while leaving the high dynamic range and excellent low-light responsiveness of event cameras largely unexplored. This paper instead opens a new avenue from the perspective of estimating the illumination using ''temporal-mapping'' events, i.e., by converting the timestamps of events triggered by a transmittance modulation into brightness values. The resulting fine-grained illumination cues facilitate a more effective decomposition and enhancement of the reflectance component in low-light images through the proposed Illumination-aided Reflectance Enhancement module. Furthermore, the degradation model of temporal-mapping events under low-light condition is investigated for realistic training data synthesizing. To address the lack of datasets under this regime, we construct a beam-splitter setup and collect EvLowLight dataset that includes images, temporal-mapping events, and motion events. Extensive experiments across 5 synthetic datasets and our real-world EvLowLight dataset substantiate that the devised pipeline, dubbed RetinEV, excels in producing well-illuminated, high dynamic range images, outperforming previous state-of-the-art event-based methods by up to 6.62 dB, while maintaining an efficient inference speed of 35.6 frame-per-second on a 640X480 image.

[48] Contour Flow Constraint: Preserving Global Shape Similarity for Deep Learning based Image Segmentation

Shengzhe Chen,Zhaoxuan Dong,Jun Liu

Main category: cs.CV

TLDR: 论文提出了一种基于轮廓流（CF）的全局形状相似性约束，并将其融入深度学习框架，显著提升了图像分割的准确性和形状相似性。

Details

Motivation: 现有方法主要关注特定属性或形状的先验知识，缺乏从轮廓流角度考虑全局形状相似性，且未探索如何将其自然融入深度卷积网络的激活函数中。 Method: 提出基于轮廓流的全局形状相似性概念，并数学推导出约束条件；通过形状损失和变分分割模型两种方式将其融入深度学习框架。 Result: 实验表明，提出的形状损失显著提升了分割准确性和形状相似性，CFSSnet在噪声图像分割中表现鲁棒且能保持全局形状相似性。 Conclusion: 提出的轮廓流约束和CFSSnet为图像分割提供了新的有效工具，具有广泛的适应性和鲁棒性。 Abstract: For effective image segmentation, it is crucial to employ constraints informed by prior knowledge about the characteristics of the areas to be segmented to yield favorable segmentation outcomes. However, the existing methods have primarily focused on priors of specific properties or shapes, lacking consideration of the general global shape similarity from a Contour Flow (CF) perspective. Furthermore, naturally integrating this contour flow prior image segmentation model into the activation functions of deep convolutional networks through mathematical methods is currently unexplored. In this paper, we establish a concept of global shape similarity based on the premise that two shapes exhibit comparable contours. Furthermore, we mathematically derive a contour flow constraint that ensures the preservation of global shape similarity. We propose two implementations to integrate the constraint with deep neural networks. Firstly, the constraint is converted to a shape loss, which can be seamlessly incorporated into the training phase for any learning-based segmentation framework. Secondly, we add the constraint into a variational segmentation model and derive its iterative schemes for solution. The scheme is then unrolled to get the architecture of the proposed CFSSnet. Validation experiments on diverse datasets are conducted on classic benchmark deep network segmentation models. The results indicate a great improvement in segmentation accuracy and shape similarity for the proposed shape loss, showcasing the general adaptability of the proposed loss term regardless of specific network architectures. CFSSnet shows robustness in segmenting noise-contaminated images, and inherent capability to preserve global shape similarity.

[49] Vision Transformers Exhibit Human-Like Biases: Evidence of Orientation and Color Selectivity, Categorical Perception, and Phase Transitions

Nooshin Bahador

Main category: cs.CV

TLDR: ViTs表现出与人类大脑相似的取向和颜色偏差，通过合成数据集分析发现其行为与人类感知类别一致，并观察到相变现象和注意力头的专业化能力。

Details

Motivation: 探索ViTs是否表现出与人类大脑相似的取向和颜色偏差，并分析其行为背后的机制。 Method: 使用合成数据集控制噪声、角度、长度、宽度和颜色变化，通过LoRA微调ViTs并分析其行为。 Result: ViTs表现出斜向效应、颜色相关的角度预测误差、与人类感知一致的颜色聚类、相变现象和注意力头的专业化能力。 Conclusion: ViTs的偏差和特性主要源于预训练数据集和架构约束，而非下游任务的数据统计。 Abstract: This study explored whether Vision Transformers (ViTs) developed orientation and color biases similar to those observed in the human brain. Using synthetic datasets with controlled variations in noise levels, angles, lengths, widths, and colors, we analyzed the behavior of ViTs fine-tuned with LoRA. Our findings revealed four key insights: First, ViTs exhibited an oblique effect showing the lowest angle prediction errors at 180 deg (horizontal) across all conditions. Second, angle prediction errors varied by color. Errors were highest for bluish hues and lowest for yellowish ones. Additionally, clustering analysis of angle prediction errors showed that ViTs grouped colors in a way that aligned with human perceptual categories. In addition to orientation and color biases, we observed phase transition phenomena. While two phase transitions occurred consistently across all conditions, the training loss curves exhibited delayed transitions when color was incorporated as an additional data attribute. Finally, we observed that attention heads in certain layers inherently develop specialized capabilities, functioning as task-agnostic feature extractors regardless of the downstream task. These observations suggest that biases and properties arise primarily from pre-training on the original dataset which shapes the model's foundational representations and the inherent architectural constraints of the vision transformer, rather than being solely determined by downstream data statistics.

[50] Comparing Performance of Preprocessing Techniques for Traffic Sign Recognition Using a HOG-SVM

Luis Vieira

Main category: cs.CV

TLDR: 比较了多种预处理技术对交通标志识别的性能影响，发现YUV显著提升了HOG-SVM分类器的准确率。

Details

Motivation: 研究不同预处理技术对交通标志识别性能的影响，以优化预处理流程。 Method: 使用HOG和SVM在GTSRB数据集上评估CLAHE、HUE和YUV等预处理技术。 Result: YUV预处理将分类准确率从89.65%提升至91.25%。 Conclusion: YUV预处理对提升交通标志识别性能具有显著效果，为预处理流程优化提供了参考。 Abstract: This study compares the performance of various preprocessing techniques for Traffic Sign Recognition (TSR) using Histogram of Oriented Gradients (HOG) and Support Vector Machine (SVM) on the German Traffic Sign Recognition Benchmark (GTSRB) dataset. Techniques such as CLAHE, HUE, and YUV were evaluated for their impact on classification accuracy. Results indicate that YUV in particular significantly enhance the performance of the HOG-SVM classifier (improving accuracy from 89.65% to 91.25%), providing insights into improvements for preprocessing pipeline of TSR applications.

[51] BabyVLM: Data-Efficient Pretraining of VLMs Inspired by Infant Learning

Shengao Wang,Arjun Chandra,Aoming Liu,Venkatesh Saligrama,Boqing Gong

Main category: cs.CV

TLDR: BabyVLM是一个结合婴儿发展启发的视觉语言模型框架，通过合成数据集和全面评估基准，提升模型性能和数据效率。

Details

Motivation: 现有评估基准与婴儿学习方式不匹配，且婴儿数据训练忽略了多样性输入，需改进。 Method: 提出BabyVLM框架，包括合成训练数据集和全面评估基准，结合婴儿学习特点。 Result: BabyVLM在任务上表现优于仅用SAYCam或通用数据训练的模型。 Conclusion: BabyVLM为数据高效的视觉语言学习提供了新方向，展示了小模型在精选数据上的潜力。 Abstract: Human infants rapidly develop visual reasoning skills from minimal input, suggesting that developmentally inspired pretraining could significantly enhance the efficiency of vision-language models (VLMs). Although recent efforts have leveraged infant-inspired datasets like SAYCam, existing evaluation benchmarks remain misaligned--they are either too simplistic, narrowly scoped, or tailored for large-scale pretrained models. Additionally, training exclusively on infant data overlooks the broader, diverse input from which infants naturally learn. To address these limitations, we propose BabyVLM, a novel framework comprising comprehensive in-domain evaluation benchmarks and a synthetic training dataset created via child-directed transformations of existing datasets. We demonstrate that VLMs trained with our synthetic dataset achieve superior performance on BabyVLM tasks compared to models trained solely on SAYCam or general-purpose data of the SAYCam size. BabyVLM thus provides a robust, developmentally aligned evaluation tool and illustrates how compact models trained on carefully curated data can generalize effectively, opening pathways toward data-efficient vision-language learning paradigms.

[52] Structure-Accurate Medical Image Translation based on Dynamic Frequency Balance and Knowledge Guidance

Jiahua Xu,Dawei Zhou,Lei Hu,Zaiyi Liu,Nannan Wang,Xinbo Gao

Main category: cs.CV

TLDR: 提出了一种基于动态频率平衡和知识引导的新方法，用于解决多模态医学图像合成中的解剖结构失真问题。

Details

Motivation: 多模态医学图像在临床诊断中至关重要，但现有方法因高频信息过拟合和低频信息弱化导致解剖结构失真。 Method: 通过小波变换分解关键特征，设计动态频率平衡模块自适应调整频率，并结合视觉语言模型的先验知识引导机制。 Result: 在多数据集上的实验表明，该方法在定性和定量评估上均有显著提升。 Conclusion: 该方法有效解决了医学图像合成中的解剖结构失真问题，具有优越性。 Abstract: Multimodal medical images play a crucial role in the precise and comprehensive clinical diagnosis. Diffusion model is a powerful strategy to synthesize the required medical images. However, existing approaches still suffer from the problem of anatomical structure distortion due to the overfitting of high-frequency information and the weakening of low-frequency information. Thus, we propose a novel method based on dynamic frequency balance and knowledge guidance. Specifically, we first extract the low-frequency and high-frequency components by decomposing the critical features of the model using wavelet transform. Then, a dynamic frequency balance module is designed to adaptively adjust frequency for enhancing global low-frequency features and effective high-frequency details as well as suppressing high-frequency noise. To further overcome the challenges posed by the large differences between different medical modalities, we construct a knowledge-guided mechanism that fuses the prior clinical knowledge from a visual language model with visual features, to facilitate the generation of accurate anatomical structures. Experimental evaluations on multiple datasets show the proposed method achieves significant improvements in qualitative and quantitative assessments, verifying its effectiveness and superiority.

[53] Sparse Deformable Mamba for Hyperspectral Image Classification

Lincoln Linlin Xu,Yimin Zhu,Zack Dewis,Zhengsen Xu,Motasem Alkayid,Mabel Heffring,Saeid Taleghanidoozdoozan

Main category: cs.CV

TLDR: 本文提出了一种稀疏可变形Mamba（SDMamba）方法，用于提升高光谱图像分类性能，通过自适应学习最优序列和设计专门的空间与光谱模块，实现了更高的精度和效率。

Details

Motivation: 现有Mamba模型在高光谱图像分类中构建序列的效率与效果不足，需要改进。 Method: 设计了稀疏可变形序列（SDS）方法，并开发了空间模块（SDSpaM）和光谱模块（SDSpeM），最后通过注意力机制融合特征。 Result: 在多个基准数据集上测试，SDMamba表现出更高的分类精度、更快的速度和更好的细节保留能力。 Conclusion: SDMamba通过优化序列构建和特征融合，显著提升了高光谱图像分类的性能。 Abstract: Although the recent Mamba models significantly improve hyperspectral image (HSI) classification, one critical challenge is caused by the difficulty to build the Mamba sequence efficiently and effectively. This paper presents a Sparse Deformable Mamba (SDMamba) approach for enhanced HSI classification, with the following contributions. First, to enhance Mamba sequence, an efficient Sparse Deformable Sequencing (SDS) approach is designed to adaptively learn the "optimal" sequence, leading to sparse and deformable Mamba sequence with increased detail preservation and decreased computations. Second, to boost spatial-spectral feature learning, based on SDS, a Sparse Deformable Spatial Mamba Module (SDSpaM) and a Sparse Deformable Spectral Mamba Module (SDSpeM) are designed for tailored modeling of the spatial information spectral information. Last, to improve the fusion of SDSpaM and SDSpeM, an attention based feature fusion approach is designed to integrate the outputs of the SDSpaM and SDSpeM. The proposed method is tested on several benchmark datasets with many state-of-the-art approaches, demonstrating that the proposed approach can achieve higher accuracy, faster speed, and better detail small-class preservation capability.

[54] InfoBound: A Provable Information-Bounds Inspired Framework for Both OoD Generalization and OoD Detection

Lin Zhu,Yifeng Yang,Zichao Nie,Yuan Gao,Jiarui Li,Qinying Gu,Xinbing Wang,Chenghu Zhou,Nanyang Ye

Main category: cs.CV

TLDR: 论文提出了一种基于信息论的统一方法（MI-Min和CE-Max），同时提升OoD检测和OoD泛化能力，解决了现有方法在两者之间权衡的问题。

Details

Motivation: 现实场景中，分布偏移导致OoD泛化和OoD检测的重要性，但现有方法往往只能解决其中一个问题，且牺牲另一个。 Method: 基于互信息和条件熵的理论边界，提出MI-Min和CE-Max的统一方法。 Result: 在多标签图像分类和目标检测任务中，该方法优于基线，成功缓解了两者之间的权衡。 Conclusion: 信息论视角为同时提升OoD检测和泛化提供了有效途径，且易于应用于现有模型和任务。 Abstract: In real-world scenarios, distribution shifts give rise to the importance of two problems: out-of-distribution (OoD) generalization, which focuses on models' generalization ability against covariate shifts (i.e., the changes of environments), and OoD detection, which aims to be aware of semantic shifts (i.e., test-time unseen classes). Real-world testing environments often involve a combination of both covariate and semantic shifts. While numerous methods have been proposed to address these critical issues, only a few works tackled them simultaneously. Moreover, prior works often improve one problem but sacrifice the other. To overcome these limitations, we delve into boosting OoD detection and OoD generalization from the perspective of information theory, which can be easily applied to existing models and different tasks. Building upon the theoretical bounds for mutual information and conditional entropy, we provide a unified approach, composed of Mutual Information Minimization (MI-Min) and Conditional Entropy Maximizing (CE-Max). Extensive experiments and comprehensive evaluations on multi-label image classification and object detection have demonstrated the superiority of our method. It successfully mitigates trade-offs between the two challenges compared to competitive baselines.

[55] FractalForensics: Proactive Deepfake Detection and Localization via Fractal Watermarks

Tianyi Wang,Harry Cheng,Ming-Hui Liu,Mohan Kankanhalli

Main category: cs.CV

TLDR: 提出了一种名为FractalForensics的新型分形水印方法，用于主动Deepfake检测和定位，解决了现有水印方法缺乏定位功能和检测结果可解释性的问题。

Details

Motivation: 被动Deepfake检测器难以识别高质量合成图像，而现有主动水印方法缺乏定位功能和检测结果的可解释性，且水印的鲁棒性不稳定。 Method: 设计了一种参数驱动的水印生成流程，生成基于分形的水印并进行单向加密；提出了一种半脆弱水印框架，嵌入和恢复水印；采用entry-to-patch策略实现Deepfake操作的定位。 Result: 实验表明，该方法对常见图像处理和Deepfake操作具有满意的鲁棒性和脆弱性，优于现有半脆弱水印算法和被动检测器。 Conclusion: FractalForensics不仅提高了主动Deepfake检测的性能，还通过突出显示被篡改区域提供了检测结果的可解释性。 Abstract: Proactive Deepfake detection via robust watermarks has been raised ever since passive Deepfake detectors encountered challenges in identifying high-quality synthetic images. However, while demonstrating reasonable detection performance, they lack localization functionality and explainability in detection results. Additionally, the unstable robustness of watermarks can significantly affect the detection performance accordingly. In this study, we propose novel fractal watermarks for proactive Deepfake detection and localization, namely FractalForensics. Benefiting from the characteristics of fractals, we devise a parameter-driven watermark generation pipeline that derives fractal-based watermarks and conducts one-way encryption regarding the parameters selected. Subsequently, we propose a semi-fragile watermarking framework for watermark embedding and recovery, trained to be robust against benign image processing operations and fragile when facing Deepfake manipulations in a black-box setting. Meanwhile, we introduce an entry-to-patch strategy that implicitly embeds the watermark matrix entries into image patches at corresponding positions, achieving localization of Deepfake manipulations. Extensive experiments demonstrate satisfactory robustness and fragility of our approach against common image processing operations and Deepfake manipulations, outperforming state-of-the-art semi-fragile watermarking algorithms and passive detectors for Deepfake detection. Furthermore, by highlighting the areas manipulated, our method provides explainability for the proactive Deepfake detection results.

[56] D$^2$iT: Dynamic Diffusion Transformer for Accurate Image Generation

Weinan Jia,Mengqi Huang,Nan Chen,Lei Zhang,Zhendong Mao

Main category: cs.CV

TLDR: 论文提出了一种动态压缩不同图像区域的两阶段框架（DVAE和D$^2$iT），以解决固定压缩导致的局部真实性和全局一致性问题。

Details

Motivation: Diffusion Transformer（DiT）在扩散过程中对不同图像区域采用固定压缩，忽略了区域信息密度的差异，影响了生成图像的质量。 Method: 1. 第一阶段使用动态VAE（DVAE）分层编码不同区域；2. 第二阶段通过动态扩散Transformer（D$^2$iT）预测多粒度噪声。 Result: 实验验证了该方法在全局一致性和局部真实性上的有效性。 Conclusion: 动态压缩策略显著提升了图像生成的质量和效率。 Abstract: Diffusion models are widely recognized for their ability to generate high-fidelity images. Despite the excellent performance and scalability of the Diffusion Transformer (DiT) architecture, it applies fixed compression across different image regions during the diffusion process, disregarding the naturally varying information densities present in these regions. However, large compression leads to limited local realism, while small compression increases computational complexity and compromises global consistency, ultimately impacting the quality of generated images. To address these limitations, we propose dynamically compressing different image regions by recognizing the importance of different regions, and introduce a novel two-stage framework designed to enhance the effectiveness and efficiency of image generation: (1) Dynamic VAE (DVAE) at first stage employs a hierarchical encoder to encode different image regions at different downsampling rates, tailored to their specific information densities, thereby providing more accurate and natural latent codes for the diffusion process. (2) Dynamic Diffusion Transformer (D$^2$iT) at second stage generates images by predicting multi-grained noise, consisting of coarse-grained (less latent code in smooth regions) and fine-grained (more latent codes in detailed regions), through an novel combination of the Dynamic Grain Transformer and the Dynamic Content Transformer. The strategy of combining rough prediction of noise with detailed regions correction achieves a unification of global consistency and local realism. Comprehensive experiments on various generation tasks validate the effectiveness of our approach. Code will be released at https://github.com/jiawn-creator/Dynamic-DiT.

[57] Enhancing Wide-Angle Image Using Narrow-Angle View of the Same Scene

Hussain Md. Safwan,Mahbub Islam Mahim,Fawwaz Mohammed Amin

Main category: cs.CV

TLDR: 提出一种基于GAN的方法，通过结合窄角和广角镜头拍摄的图像，将高质量细节从窄角图像转移到广角图像中。

Details

Motivation: 解决拍摄场景时广角镜头覆盖范围大但细节不足，而窄角镜头细节丰富但覆盖范围有限的问题。 Method: 训练GAN模型，从窄角图像中提取视觉质量参数，并将其转移到对应的广角图像中。 Result: 在多个基准数据集上进行了评估，并与当前领域的最新进展进行了比较。 Conclusion: 该方法成功实现了将窄角图像的高质量细节融入广角图像，提升了广角图像的视觉质量。 Abstract: A common dilemma while photographing a scene is whether to capture it in wider angle, allowing more of the scene to be covered but in lesser details or to click in narrow angle that captures better details but leaves out portions of the scene. We propose a novel method in this paper that infuses wider shots with finer quality details that is usually associated with an image captured by the primary lens by capturing the same scene using both narrow and wide field of view (FoV) lenses. We do so by training a GAN-based model to learn to extract the visual quality parameters from a narrow angle shot and to transfer these to the corresponding wide-angle image of the scene. We have mentioned in details the proposed technique to isolate the visual essence of an image and to transfer it into another image. We have also elaborately discussed our implementation details and have presented the results of evaluation over several benchmark datasets and comparisons with contemporary advancements in the field.

[58] CamMimic: Zero-Shot Image To Camera Motion Personalized Video Generation Using Diffusion Models

Pooja Guhan,Divya Kothandaraman,Tsung-Wei Huang,Guan-Ming Su,Dinesh Manocha

Main category: cs.CV

TLDR: CamMimic是一种创新的算法，用于动态视频编辑，能够零样本地将参考视频中的相机运动无缝转移到用户选择的场景中。

Details

Motivation: 解决动态视频编辑中相机运动转移的挑战，无需额外数据。 Method: 采用两阶段策略：1）多概念学习方法结合LoRA层和正交损失；2）基于单应性的细化策略。 Result: 实验证明方法有效，用户研究中70.31%的参与者偏好场景保留，90.45%偏好运动转移。 Conclusion: 为跨场景相机运动转移的未来研究奠定了基础。 Abstract: We introduce CamMimic, an innovative algorithm tailored for dynamic video editing needs. It is designed to seamlessly transfer the camera motion observed in a given reference video onto any scene of the user's choice in a zero-shot manner without requiring any additional data. Our algorithm achieves this using a two-phase strategy by leveraging a text-to-video diffusion model. In the first phase, we develop a multi-concept learning method using a combination of LoRA layers and an orthogonality loss to capture and understand the underlying spatial-temporal characteristics of the reference video as well as the spatial features of the user's desired scene. The second phase proposes a unique homography-based refinement strategy to enhance the temporal and spatial alignment of the generated video. We demonstrate the efficacy of our method through experiments conducted on a dataset containing combinations of diverse scenes and reference videos containing a variety of camera motions. In the absence of an established metric for assessing camera motion transfer between unrelated scenes, we propose CameraScore, a novel metric that utilizes homography representations to measure camera motion similarity between the reference and generated videos. Extensive quantitative and qualitative evaluations demonstrate that our approach generates high-quality, motion-enhanced videos. Additionally, a user study reveals that 70.31% of participants preferred our method for scene preservation, while 90.45% favored it for motion transfer. We hope this work lays the foundation for future advancements in camera motion transfer across different scenes.

[59] Vision-Language Model for Object Detection and Segmentation: A Review and Evaluation

Yongchao Feng,Yajie Liu,Shuai Yang,Wenrui Cai,Jinqing Zhang,Qiqi Zhan,Ziyue Huang,Hongxi Yan,Qiao Wan,Chenguang Liu,Junzhe Wang,Jiahui Lv,Ziqi Liu,Tengyuan Shi,Qingjie Liu,Yunhong Wang

Main category: cs.CV

TLDR: 本文系统评估了视觉语言模型（VLM）在传统视觉任务中的表现，包括检测和分割任务，并分析了不同微调策略对性能的影响，为未来VLM设计提供了见解。

Details

Motivation: 尽管VLM在开放词汇任务中表现良好，但其在传统视觉任务中的有效性尚未评估。本文旨在填补这一空白。 Method: 通过系统评估VLM在多种检测和分割任务中的表现，包括不同微调策略（零预测、视觉微调、文本提示）的影响。 Result: 揭示了不同VLM架构在不同任务中的优势和局限性，并分析了任务特性、模型架构和训练方法之间的相关性。 Conclusion: 本文为计算机视觉和多模态学习领域的研究者提供了VLM在传统任务中的全面评估，并指出了未来研究方向。 Abstract: Vision-Language Model (VLM) have gained widespread adoption in Open-Vocabulary (OV) object detection and segmentation tasks. Despite they have shown promise on OV-related tasks, their effectiveness in conventional vision tasks has thus far been unevaluated. In this work, we present the systematic review of VLM-based detection and segmentation, view VLM as the foundational model and conduct comprehensive evaluations across multiple downstream tasks for the first time: 1) The evaluation spans eight detection scenarios (closed-set detection, domain adaptation, crowded objects, etc.) and eight segmentation scenarios (few-shot, open-world, small object, etc.), revealing distinct performance advantages and limitations of various VLM architectures across tasks. 2) As for detection tasks, we evaluate VLMs under three finetuning granularities: \textit{zero prediction}, \textit{visual fine-tuning}, and \textit{text prompt}, and further analyze how different finetuning strategies impact performance under varied task. 3) Based on empirical findings, we provide in-depth analysis of the correlations between task characteristics, model architectures, and training methodologies, offering insights for future VLM design. 4) We believe that this work shall be valuable to the pattern recognition experts working in the fields of computer vision, multimodal learning, and vision foundation models by introducing them to the problem, and familiarizing them with the current status of the progress while providing promising directions for future research. A project associated with this review and evaluation has been created at https://github.com/better-chao/perceptual_abilities_evaluation.

[60] DropoutGS: Dropping Out Gaussians for Better Sparse-view Rendering

Yexing Xu,Longguang Wang,Minglin Chen,Sheng Ao,Li Li,Yulan Guo

Main category: cs.CV

TLDR: 3D高斯泼溅（3DGS）在稀疏输入下性能下降且产生伪影，本文提出随机丢弃正则化（RDR）和边缘引导分裂策略（ESS）以提升泛化性能。

Details

Motivation: 稀疏输入下3DGS性能显著下降且易过拟合，低复杂度模型表现更优，启发提出新方法。 Method: 提出RDR利用低复杂度模型缓解过拟合，ESS补充高频细节，形成DropoutGS方法。 Result: 在Blender、LLFF和DTU基准数据集上取得最优性能。 Conclusion: DropoutGS简单有效，显著提升稀疏视图下3DGS的泛化能力。 Abstract: Although 3D Gaussian Splatting (3DGS) has demonstrated promising results in novel view synthesis, its performance degrades dramatically with sparse inputs and generates undesirable artifacts. As the number of training views decreases, the novel view synthesis task degrades to a highly under-determined problem such that existing methods suffer from the notorious overfitting issue. Interestingly, we observe that models with fewer Gaussian primitives exhibit less overfitting under sparse inputs. Inspired by this observation, we propose a Random Dropout Regularization (RDR) to exploit the advantages of low-complexity models to alleviate overfitting. In addition, to remedy the lack of high-frequency details for these models, an Edge-guided Splitting Strategy (ESS) is developed. With these two techniques, our method (termed DropoutGS) provides a simple yet effective plug-in approach to improve the generalization performance of existing 3DGS methods. Extensive experiments show that our DropoutGS produces state-of-the-art performance under sparse views on benchmark datasets including Blender, LLFF, and DTU. The project page is at: https://xuyx55.github.io/DropoutGS/.

[61] EasyREG: Easy Depth-Based Markerless Registration and Tracking using Augmented Reality Device for Surgical Guidance

Yue Yang,Christoph Leuze,Brian Hargreaves,Bruce Daniel,Fred Baik

Main category: cs.CV

TLDR: 论文提出了一种基于AR设备的无标记手术导航框架，包含高精度配准模块和实时跟踪模块，解决了传统标记方法的繁琐问题，并在性能上优于工业解决方案。

Details

Motivation: 传统手术导航依赖外部标记，操作繁琐且难以在临床中部署；现有无标记方案因遮挡和异常值问题导致精度不足。 Method: 采用深度传感器，配准模块结合误差校正、人机交互区域过滤和全局对齐，跟踪模块基于配准结果实时估计目标位姿。 Result: 系统在配准性能上优于工业方案，跟踪性能相当，适用于目标解剖结构动态或静态的手术场景。 Conclusion: 提出的两模块设计为手术导航提供了一站式解决方案，具有高精度和实时性。 Abstract: The use of Augmented Reality (AR) devices for surgical guidance has gained increasing traction in the medical field. Traditional registration methods often rely on external fiducial markers to achieve high accuracy and real-time performance. However, these markers introduce cumbersome calibration procedures and can be challenging to deploy in clinical settings. While commercial solutions have attempted real-time markerless tracking using the native RGB cameras of AR devices, their accuracy remains questionable for medical guidance, primarily due to occlusions and significant outliers between the live sensor data and the preoperative target anatomy point cloud derived from MRI or CT scans. In this work, we present a markerless framework that relies only on the depth sensor of AR devices and consists of two modules: a registration module for high-precision, outlier-robust target anatomy localization, and a tracking module for real-time pose estimation. The registration module integrates depth sensor error correction, a human-in-the-loop region filtering technique, and a robust global alignment with curvature-aware feature sampling, followed by local ICP refinement, for markerless alignment of preoperative models with patient anatomy. The tracking module employs a fast and robust registration algorithm that uses the initial pose from the registration module to estimate the target pose in real-time. We comprehensively evaluated the performance of both modules through simulation and real-world measurements. The results indicate that our markerless system achieves superior performance for registration and comparable performance for tracking to industrial solutions. The two-module design makes our system a one-stop solution for surgical procedures where the target anatomy moves or stays static during surgery.

[62] PCM-SAR: Physics-Driven Contrastive Mutual Learning for SAR Classification

Pengfei Wang,Hao Zheng,Zhigang Hu,Aikun Xu,Meiguang Zheng,Liu Yang

Main category: cs.CV

TLDR: 提出了一种基于物理驱动的对比互学习框架（PCM-SAR），用于SAR图像分类，通过结合领域特定的物理特性改进样本生成和特征提取。

Details

Motivation: 现有基于对比学习的SAR图像分类方法通常依赖为光学图像设计的样本生成策略，未能捕捉SAR数据的独特语义和物理特性。 Method: PCM-SAR利用灰度共生矩阵（GLCM）模拟真实噪声模式，并通过语义检测进行无监督局部采样，同时采用多级特征融合机制进行特征表示优化。 Result: 实验结果表明，PCM-SAR在多种数据集和SAR分类任务中均优于现有方法。 Conclusion: PCM-SAR通过改进SAR特征表示，显著提升了小模型的性能，为SAR图像分类提供了更有效的解决方案。 Abstract: Existing SAR image classification methods based on Contrastive Learning often rely on sample generation strategies designed for optical images, failing to capture the distinct semantic and physical characteristics of SAR data. To address this, we propose Physics-Driven Contrastive Mutual Learning for SAR Classification (PCM-SAR), which incorporates domain-specific physical insights to improve sample generation and feature extraction. PCM-SAR utilizes the gray-level co-occurrence matrix (GLCM) to simulate realistic noise patterns and applies semantic detection for unsupervised local sampling, ensuring generated samples accurately reflect SAR imaging properties. Additionally, a multi-level feature fusion mechanism based on mutual learning enables collaborative refinement of feature representations. Notably, PCM-SAR significantly enhances smaller models by refining SAR feature representations, compensating for their limited capacity. Experimental results show that PCM-SAR consistently outperforms SOTA methods across diverse datasets and SAR classification tasks.

[63] Pillar-Voxel Fusion Network for 3D Object Detection in Airborne Hyperspectral Point Clouds

Yanze Jiang,Yanfeng Gu,Xian Li

Main category: cs.CV

TLDR: 论文提出PiV-AHPC，一种针对机载高光谱点云（HPCs）的3D目标检测网络，通过双分支编码器和多级特征融合机制解决几何-光谱失真问题。

Details

Motivation: 现有HPCs生成方法因融合误差和遮挡导致几何-光谱失真，限制了其在精细任务中的性能。 Method: 采用支柱-体素双分支编码器，分别提取光谱-垂直结构特征和3D空间特征，并通过多级特征融合机制实现异构特征整合。 Result: 在两种机载HPCs数据集上的实验表明，PiV-AHPC具有先进的检测性能和高泛化能力。 Conclusion: PiV-AHPC是首个针对HPCs任务的尝试，有效解决了几何-光谱失真问题，提升了检测性能。 Abstract: Hyperspectral point clouds (HPCs) can simultaneously characterize 3D spatial and spectral information of ground objects, offering excellent 3D perception and target recognition capabilities. Current approaches for generating HPCs often involve fusion techniques with hyperspectral images and LiDAR point clouds, which inevitably lead to geometric-spectral distortions due to fusion errors and obstacle occlusions. These adverse effects limit their performance in downstream fine-grained tasks across multiple scenarios, particularly in airborne applications. To address these issues, we propose PiV-AHPC, a 3D object detection network for airborne HPCs. To the best of our knowledge, this is the first attempt at this HPCs task. Specifically, we first develop a pillar-voxel dual-branch encoder, where the former captures spectral and vertical structural features from HPCs to overcome spectral distortion, while the latter emphasizes extracting accurate 3D spatial features from point clouds. A multi-level feature fusion mechanism is devised to enhance information interaction between the two branches, achieving neighborhood feature alignment and channel-adaptive selection, thereby organically integrating heterogeneous features and mitigating geometric distortion. Extensive experiments on two airborne HPCs datasets demonstrate that PiV-AHPC possesses state-of-the-art detection performance and high generalization capability.

[64] FVOS for MOSE Track of 4th PVUW Challenge: 3rd Place Solution

Mengjiao Wang,Junpei Zhang,Xu Liu,Yuting Yang,Mengru Ma

Main category: cs.CV

TLDR: 本文提出了一种改进的视频对象分割方法（FVOS），通过微调现有模型并引入形态学后处理和投票融合策略，在复杂场景中取得了显著效果。

Details

Motivation: 现有视频对象分割方法在复杂场景中表现不佳，本文旨在解决这一问题。 Method: 提出FVOS方法，包括对现有模型的微调、形态学后处理以及多尺度分割结果的投票融合。 Result: 验证和测试阶段的J&F分数分别为76.81%和83.92%，在PVUW挑战赛中排名第三。 Conclusion: FVOS方法在复杂场景中表现优异，为视频对象分割提供了有效解决方案。 Abstract: Video Object Segmentation (VOS) is one of the most fundamental and challenging tasks in computer vision and has a wide range of applications. Most existing methods rely on spatiotemporal memory networks to extract frame-level features and have achieved promising results on commonly used datasets. However, these methods often struggle in more complex real-world scenarios. This paper addresses this issue, aiming to achieve accurate segmentation of video objects in challenging scenes. We propose fine-tuning VOS (FVOS), optimizing existing methods for specific datasets through tailored training. Additionally, we introduce a morphological post-processing strategy to address the issue of excessively large gaps between adjacent objects in single-model predictions. Finally, we apply a voting-based fusion method on multi-scale segmentation results to generate the final output. Our approach achieves J&F scores of 76.81% and 83.92% during the validation and testing stages, respectively, securing third place overall in the MOSE Track of the 4th PVUW challenge 2025.

[65] DiffuMural: Restoring Dunhuang Murals with Multi-scale Diffusion

Puyu Han,Jiaju Kang,Yuhang Pan,Erting Pan,Zeyu Zhang,Qunchao Jin,Juntao Jiang,Zhichen Liu,Luqi Gong

Main category: cs.CV

TLDR: DiffuMural是一种结合多尺度收敛和协作扩散机制的模型，用于优化古代壁画修复任务，解决了大缺陷区域和训练样本稀缺的问题。

Details

Motivation: 古代壁画修复作为条件图像生成的重要下游任务，面临大缺陷区域和训练样本稀缺的挑战，现有方法难以满足美学标准。 Method: 提出DiffuMural，结合ControlNet和循环一致性损失，优化生成图像与条件控制的匹配。 Result: 在23幅敦煌壁画数据上表现优异，修复细节和整体风格一致，定量评估优于现有方法。 Conclusion: DiffuMural在壁画修复中表现出色，结合人文价值评估，保留文化意义。 Abstract: Large-scale pre-trained diffusion models have produced excellent results in the field of conditional image generation. However, restoration of ancient murals, as an important downstream task in this field, poses significant challenges to diffusion model-based restoration methods due to its large defective area and scarce training samples. Conditional restoration tasks are more concerned with whether the restored part meets the aesthetic standards of mural restoration in terms of overall style and seam detail, and such metrics for evaluating heuristic image complements are lacking in current research. We therefore propose DiffuMural, a combined Multi-scale convergence and Collaborative Diffusion mechanism with ControlNet and cyclic consistency loss to optimise the matching between the generated images and the conditional control. DiffuMural demonstrates outstanding capabilities in mural restoration, leveraging training data from 23 large-scale Dunhuang murals that exhibit consistent visual aesthetics. The model excels in restoring intricate details, achieving a coherent overall appearance, and addressing the unique challenges posed by incomplete murals lacking factual grounding. Our evaluation framework incorporates four key metrics to quantitatively assess incomplete murals: factual accuracy, textural detail, contextual semantics, and holistic visual coherence. Furthermore, we integrate humanistic value assessments to ensure the restored murals retain their cultural and artistic significance. Extensive experiments validate that our method outperforms state-of-the-art (SOTA) approaches in both qualitative and quantitative metrics.

[66] Capturing Longitudinal Changes in Brain Morphology Using Temporally Parameterized Neural Displacement Fields

Aisha L. Shuaibu,Kieran A. Gibb,Peter A. Wijeratne,Ivor J. A. Simpson

Main category: cs.CV

TLDR: 提出了一种基于神经位移场的纵向图像配准方法，用于研究脑形态的时间变化，解决了噪声和小解剖变化量化的问题。

Details

Motivation: 纵向图像配准可用于监测脑结构的生长或萎缩，但数据噪声和小解剖变化量化是挑战。 Method: 使用时间参数化的神经位移场建模结构变化，采用多层感知机实现隐式神经表示（INR），并利用其解析导数设计新的正则化函数。 Result: 方法在4D脑MR配准中表现出色，生成更符合生物学规律的形态变化模式。 Conclusion: 提出的方法有效解决了纵向图像配准的挑战，为脑形态研究提供了新工具。 Abstract: Longitudinal image registration enables studying temporal changes in brain morphology which is useful in applications where monitoring the growth or atrophy of specific structures is important. However this task is challenging due to; noise/artifacts in the data and quantifying small anatomical changes between sequential scans. We propose a novel longitudinal registration method that models structural changes using temporally parameterized neural displacement fields. Specifically, we implement an implicit neural representation (INR) using a multi-layer perceptron that serves as a continuous coordinate-based approximation of the deformation field at any time point. In effect, for any N scans of a particular subject, our model takes as input a 3D spatial coordinate location x, y, z and a corresponding temporal representation t and learns to describe the continuous morphology of structures for both observed and unobserved points in time. Furthermore, we leverage the analytic derivatives of the INR to derive a new regularization function that enforces monotonic rate of change in the trajectory of the voxels, which is shown to provide more biologically plausible patterns. We demonstrate the effectiveness of our method on 4D brain MR registration.

[67] 3D CoCa: Contrastive Learners are 3D Captioners

Ting Huang,Zeyu Zhang,Yemin Wang,Hao Tang

Main category: cs.CV

TLDR: 3D CoCa是一个结合对比学习和3D描述生成的新框架，显著提升了3D场景描述的性能。

Details

Motivation: 解决现有方法中点云稀疏性和跨模态对齐弱的问题。 Method: 结合冻结的CLIP骨干网络、空间感知的3D编码器和多模态解码器，联合优化对比与描述目标。 Result: 在ScanRefer和Nr3D基准上，CIDEr分数分别提升10.2%和5.76%。 Conclusion: 3D CoCa通过联合训练实现了更强的空间推理和语义对齐，性能显著优于现有方法。 Abstract: 3D captioning, which aims to describe the content of 3D scenes in natural language, remains highly challenging due to the inherent sparsity of point clouds and weak cross-modal alignment in existing methods. To address these challenges, we propose 3D CoCa, a novel unified framework that seamlessly combines contrastive vision-language learning with 3D caption generation in a single architecture. Our approach leverages a frozen CLIP vision-language backbone to provide rich semantic priors, a spatially-aware 3D scene encoder to capture geometric context, and a multi-modal decoder to generate descriptive captions. Unlike prior two-stage methods that rely on explicit object proposals, 3D CoCa jointly optimizes contrastive and captioning objectives in a shared feature space, eliminating the need for external detectors or handcrafted proposals. This joint training paradigm yields stronger spatial reasoning and richer semantic grounding by aligning 3D and textual representations. Extensive experiments on the ScanRefer and Nr3D benchmarks demonstrate that 3D CoCa significantly outperforms current state-of-the-arts by 10.2% and 5.76% in CIDEr at 0.5IoU, respectively. Code will be available at https://github.com/AIGeeksGroup/3DCoCa.

[68] AeroLite: Tag-Guided Lightweight Generation of Aerial Image Captions

Xing Zi,Tengjun Ni,Xianjing Fan,Xian Tao,Jun Li,Ali Braytee,Mukesh Prasad

Main category: cs.CV

TLDR: AeroLite是一个轻量级的、基于标签引导的框架，用于自动生成遥感图像的描述，通过结合语义标签和多层感知机架构，显著提升了小型语言模型在遥感图像描述任务中的性能。

Details

Motivation: 遥感图像的自动描述在环境监测、城市规划等领域至关重要，但由于复杂的空间语义和领域变化，这一任务具有挑战性。 Method: AeroLite利用GPT-4o生成伪描述数据集，结合多标签CLIP编码器提取语义标签，并通过多层感知机架构融合视觉和语义信息，采用两阶段LoRA训练方法。 Result: 实验表明，AeroLite在BLEU和METEOR等指标上优于参数更大的模型，同时计算成本更低。 Conclusion: AeroLite为遥感图像描述提供了一种高效、轻量级的解决方案，具有广泛的应用潜力。 Abstract: Accurate and automated captioning of aerial imagery is crucial for applications like environmental monitoring, urban planning, and disaster management. However, this task remains challenging due to complex spatial semantics and domain variability. To address these issues, we introduce \textbf{AeroLite}, a lightweight, tag-guided captioning framework designed to equip small-scale language models (1--3B parameters) with robust and interpretable captioning capabilities specifically for remote sensing images. \textbf{AeroLite} leverages GPT-4o to generate a large-scale, semantically rich pseudo-caption dataset by integrating multiple remote sensing benchmarks, including DLRSD, iSAID, LoveDA, WHU, and RSSCN7. To explicitly capture key semantic elements such as orientation and land-use types, AeroLite employs natural language processing techniques to extract relevant semantic tags. These tags are then learned by a dedicated multi-label CLIP encoder, ensuring precise semantic predictions. To effectively fuse visual and semantic information, we propose a novel bridging multilayer perceptron (MLP) architecture, aligning semantic tags with visual embeddings while maintaining minimal computational overhead. AeroLite's flexible design also enables seamless integration with various pretrained large language models. We adopt a two-stage LoRA-based training approach: the initial stage leverages our pseudo-caption dataset to capture broad remote sensing semantics, followed by fine-tuning on smaller, curated datasets like UCM and Sydney Captions to refine domain-specific alignment. Experimental evaluations demonstrate that AeroLite surpasses significantly larger models (e.g., 13B parameters) in standard captioning metrics, including BLEU and METEOR, while maintaining substantially lower computational costs.

[69] Trajectory-guided Motion Perception for Facial Expression Quality Assessment in Neurological Disorders

Shuchao Duan,Amirhossein Dadashzadeh,Alan Whone,Majid Mirmehdi

Main category: cs.CV

TLDR: 提出了一种基于面部标志点轨迹的自动化面部表情质量评估（FEQA）框架TraMP-Former，结合视觉语义线索，显著提升了在神经疾病数据集上的性能。

Details

Motivation: 神经疾病中面部表情的细微运动难以捕捉，影响诊断准确性，需要一种高效的方法来评估表情质量。 Method: 利用面部标志点轨迹作为紧凑且信息丰富的表示，结合RGB帧的视觉语义线索，通过TraMP-Former框架回归出质量评分。 Result: 在PFED5和Toronto NeuroFace数据集上分别提升了6.51%和7.62%，达到最新技术水平。 Conclusion: TraMP-Former通过融合轨迹特征和视觉语义，显著提升了FEQA性能，验证了标志点轨迹的有效性。 Abstract: Automated facial expression quality assessment (FEQA) in neurological disorders is critical for enhancing diagnostic accuracy and improving patient care, yet effectively capturing the subtle motions and nuances of facial muscle movements remains a challenge. We propose to analyse facial landmark trajectories, a compact yet informative representation, that encodes these subtle motions from a high-level structural perspective. Hence, we introduce Trajectory-guided Motion Perception Transformer (TraMP-Former), a novel FEQA framework that fuses landmark trajectory features for fine-grained motion capture with visual semantic cues from RGB frames, ultimately regressing the combined features into a quality score. Extensive experiments demonstrate that TraMP-Former achieves new state-of-the-art performance on benchmark datasets with neurological disorders, including PFED5 (up by 6.51%) and an augmented Toronto NeuroFace (up by 7.62%). Our ablation studies further validate the efficiency and effectiveness of landmark trajectories in FEQA. Our code is available at https://github.com/shuchaoduan/TraMP-Former.

[70] FastRSR: Efficient and Accurate Road Surface Reconstruction from Bird's Eye View

Yuting Zhao,Yuheng Ji,Xiaoshuai Hao,Shuxiao Li

Main category: cs.CV

TLDR: 论文提出两种高效准确的BEV-based RSR模型（FastRSR-mono和FastRSR-stereo），通过Depth-Aware Projection和Spatial Attention Enhancement等模块解决信息丢失和稀疏性问题，在性能和速度上均优于现有方法。

Details

Motivation: 现有方法在将视角视图转换为BEV时存在信息丢失和表示稀疏的问题，且立体匹配在BEV中难以平衡精度与速度。 Method: 提出Depth-Aware Projection（DAP）策略和Spatial Attention Enhancement（SAE）、Confidence Attention Generation（CAG）模块，优化BEV数据聚合和立体匹配。 Result: FastRSR在RSRD数据集上表现优异，单目方法提升6.0%的绝对高度误差，立体方法速度提升至少3.0倍。 Conclusion: FastRSR通过创新模块解决了BEV-based RSR的关键问题，实现了高效准确的性能。 Abstract: Road Surface Reconstruction (RSR) is crucial for autonomous driving, enabling the understanding of road surface conditions. Recently, RSR from the Bird's Eye View (BEV) has gained attention for its potential to enhance performance. However, existing methods for transforming perspective views to BEV face challenges such as information loss and representation sparsity. Moreover, stereo matching in BEV is limited by the need to balance accuracy with inference speed. To address these challenges, we propose two efficient and accurate BEV-based RSR models: FastRSR-mono and FastRSR-stereo. Specifically, we first introduce Depth-Aware Projection (DAP), an efficient view transformation strategy designed to mitigate information loss and sparsity by querying depth and image features to aggregate BEV data within specific road surface regions using a pre-computed look-up table. To optimize accuracy and speed in stereo matching, we design the Spatial Attention Enhancement (SAE) and Confidence Attention Generation (CAG) modules. SAE adaptively highlights important regions, while CAG focuses on high-confidence predictions and filters out irrelevant information. FastRSR achieves state-of-the-art performance, exceeding monocular competitors by over 6.0% in elevation absolute error and providing at least a 3.0x speedup by stereo methods on the RSRD dataset. The source code will be released.

[71] EmbodiedOcc++: Boosting Embodied 3D Occupancy Prediction with Plane Regularization and Uncertainty Sampler

Hao Wang,Xiaobao Wei,Xiaoan Zhang,Jianing Li,Chengyu Bai,Ying Li,Ming Lu,Wenzhao Zheng,Shanghang Zhang

Main category: cs.CV

TLDR: EmbodiedOcc++通过几何引导的细化模块和语义感知的不确定性采样器，改进了3D占用预测的几何一致性和细节保留。

Details

Motivation: 现有EmbodiedOcc框架忽略了室内环境的几何特性（主要是平面结构），影响了预测的准确性。 Method: 引入GRM模块通过平面正则化约束高斯更新，以及SUS模块优化重叠区域的更新策略。 Result: 在EmbodiedOcc-ScanNet基准测试中达到最优性能，提升了边缘精度和几何细节保留。 Conclusion: EmbodiedOcc++在保持计算效率的同时，显著提升了3D占用预测的几何一致性和细节表现。 Abstract: Online 3D occupancy prediction provides a comprehensive spatial understanding of embodied environments. While the innovative EmbodiedOcc framework utilizes 3D semantic Gaussians for progressive indoor occupancy prediction, it overlooks the geometric characteristics of indoor environments, which are primarily characterized by planar structures. This paper introduces EmbodiedOcc++, enhancing the original framework with two key innovations: a Geometry-guided Refinement Module (GRM) that constrains Gaussian updates through plane regularization, along with a Semantic-aware Uncertainty Sampler (SUS) that enables more effective updates in overlapping regions between consecutive frames. GRM regularizes the position update to align with surface normals. It determines the adaptive regularization weight using curvature-based and depth-based constraints, allowing semantic Gaussians to align accurately with planar surfaces while adapting in complex regions. To effectively improve geometric consistency from different views, SUS adaptively selects proper Gaussians to update. Comprehensive experiments on the EmbodiedOcc-ScanNet benchmark demonstrate that EmbodiedOcc++ achieves state-of-the-art performance across different settings. Our method demonstrates improved edge accuracy and retains more geometric details while ensuring computational efficiency, which is essential for online embodied perception. The code will be released at: https://github.com/PKUHaoWang/EmbodiedOcc2.

[72] SD-ReID: View-aware Stable Diffusion for Aerial-Ground Person Re-Identification

Xiang Hu,Pingping Zhang,Yuhao Wang,Bin Yan,Huchuan Lu

Main category: cs.CV

TLDR: 论文提出了一种名为SD-ReID的两阶段特征学习框架，利用生成模型（如Stable Diffusion）生成跨视角特征，以解决空中-地面行人重识别问题。

Details

Motivation: 现有方法主要关注设计鲁棒的ReID模型以保持身份一致性，但忽略了视角特定特征的贡献。 Method: 两阶段框架：第一阶段训练ViT模型提取粗粒度表示和可控条件；第二阶段微调SD模型学习互补表示，并提出View-Refine Decoder生成缺失的跨视角特征。 Result: 在AG-ReID基准测试中验证了SD-ReID的有效性。 Conclusion: SD-ReID通过生成视角特定特征显著提升了跨视角行人重识别的性能。 Abstract: Aerial-Ground Person Re-IDentification (AG-ReID) aims to retrieve specific persons across cameras with different viewpoints. Previous works focus on designing discriminative ReID models to maintain identity consistency despite drastic changes in camera viewpoints. The core idea behind these methods is quite natural, but designing a view-robust network is a very challenging task. Moreover, they overlook the contribution of view-specific features in enhancing the model's capability to represent persons. To address these issues, we propose a novel two-stage feature learning framework named SD-ReID for AG-ReID, which takes advantage of the powerful understanding capacity of generative models, e.g., Stable Diffusion (SD), to generate view-specific features between different viewpoints. In the first stage, we train a simple ViT-based model to extract coarse-grained representations and controllable conditions. Then, in the second stage, we fine-tune the SD model to learn complementary representations guided by the controllable conditions. Furthermore, we propose the View-Refine Decoder (VRD) to obtain additional controllable conditions to generate missing cross-view features. Finally, we use the coarse-grained representations and all-view features generated by SD to retrieve target persons. Extensive experiments on the AG-ReID benchmarks demonstrate the effectiveness of our proposed SD-ReID. The source code will be available upon acceptance.

[73] Mitigating Long-tail Distribution in Oracle Bone Inscriptions: Dataset, Model, and Benchmark

Jinhao Li,Zijian Chen,Runze Dong,Tingzhu Chen,Changbo Wang,Guangtao Zhai

Main category: cs.CV

TLDR: 论文提出了Oracle-P15K数据集和OBIDiff模型，用于解决甲骨文识别中的长尾分布问题，通过生成式数据增强提升少数类别的样本量。

Details

Motivation: 现有甲骨文数据集存在长尾分布问题，导致识别模型在多数和少数类别上表现不均。生成式数据增强为解决这一问题提供了可能，但缺乏结构对齐的大规模图像对用于训练生成模型。 Method: 1. 构建Oracle-P15K数据集，包含14,542张结构对齐的甲骨文图像；2. 提出基于扩散模型的OBIDiff生成器，实现真实可控的甲骨文生成。 Result: 实验证明Oracle-P15K数据集的有效性，OBIDiff能准确保留字形结构并有效转换拓片风格。 Conclusion: Oracle-P15K和OBIDiff为甲骨文识别提供了高质量的数据增强工具，解决了长尾分布问题。 Abstract: The oracle bone inscription (OBI) recognition plays a significant role in understanding the history and culture of ancient China. However, the existing OBI datasets suffer from a long-tail distribution problem, leading to biased performance of OBI recognition models across majority and minority classes. With recent advancements in generative models, OBI synthesis-based data augmentation has become a promising avenue to expand the sample size of minority classes. Unfortunately, current OBI datasets lack large-scale structure-aligned image pairs for generative model training. To address these problems, we first present the Oracle-P15K, a structure-aligned OBI dataset for OBI generation and denoising, consisting of 14,542 images infused with domain knowledge from OBI experts. Second, we propose a diffusion model-based pseudo OBI generator, called OBIDiff, to achieve realistic and controllable OBI generation. Given a clean glyph image and a target rubbing-style image, it can effectively transfer the noise style of the original rubbing to the glyph image. Extensive experiments on OBI downstream tasks and user preference studies show the effectiveness of the proposed Oracle-P15K dataset and demonstrate that OBIDiff can accurately preserve inherent glyph structures while transferring authentic rubbing styles effectively.

[74] TextSplat: Text-Guided Semantic Fusion for Generalizable Gaussian Splatting

Zhicong Wu,Hongbin Xu,Gang Xu,Ping Nie,Zhixin Yan,Jinkai Zheng,Liangqiong Qu,Ming Li,Liqiang Nie

Main category: cs.CV

TLDR: 论文提出TextSplat，首个文本驱动的通用高斯泼溅框架，通过融合多模态语义信息提升3D重建的几何与语义一致性。

Details

Motivation: 现有方法多关注几何一致性，忽略了文本驱动对语义理解的潜力，导致复杂场景中细节重建不足。 Method: 采用三个并行模块（深度估计、语义分割、多视角交互）提取互补特征，通过文本引导的注意力机制融合。 Result: 实验表明，该方法在多个基准数据集上优于现有方法，验证了其有效性。 Conclusion: TextSplat通过文本引导的语义融合，显著提升了3D重建的细节和语义准确性。 Abstract: Recent advancements in Generalizable Gaussian Splatting have enabled robust 3D reconstruction from sparse input views by utilizing feed-forward Gaussian Splatting models, achieving superior cross-scene generalization. However, while many methods focus on geometric consistency, they often neglect the potential of text-driven guidance to enhance semantic understanding, which is crucial for accurately reconstructing fine-grained details in complex scenes. To address this limitation, we propose TextSplat--the first text-driven Generalizable Gaussian Splatting framework. By employing a text-guided fusion of diverse semantic cues, our framework learns robust cross-modal feature representations that improve the alignment of geometric and semantic information, producing high-fidelity 3D reconstructions. Specifically, our framework employs three parallel modules to obtain complementary representations: the Diffusion Prior Depth Estimator for accurate depth information, the Semantic Aware Segmentation Network for detailed semantic information, and the Multi-View Interaction Network for refined cross-view features. Then, in the Text-Guided Semantic Fusion Module, these representations are integrated via the text-guided and attention-based feature aggregation mechanism, resulting in enhanced 3D Gaussian parameters enriched with detailed semantic cues. Experimental results on various benchmark datasets demonstrate improved performance compared to existing methods across multiple evaluation metrics, validating the effectiveness of our framework. The code will be publicly available.

[75] DualPrompt-MedCap: A Dual-Prompt Enhanced Approach for Medical Image Captioning

Yining Zhao,Ali Braytee,Mukesh Prasad

Main category: cs.CV

TLDR: DualPrompt-MedCap通过双提示增强框架提升医学图像描述生成能力，在模态识别和描述质量上优于基线模型。

Details

Motivation: 医学图像描述生成在临床诊断辅助中潜力巨大，但生成上下文相关且模态识别准确的描述仍具挑战性。 Method: 提出DualPrompt-MedCap框架，包含模态感知提示和问题引导提示，并通过半监督分类模型和生物医学语言模型嵌入增强。 Result: 在多个医学数据集上，DualPrompt-MedCap在模态识别准确率上提升22%，生成更全面且问题对齐的描述。 Conclusion: 该方法能生成临床准确的报告，可作为医学专家先验知识及下游视觉语言任务的自动标注。 Abstract: Medical image captioning via vision-language models has shown promising potential for clinical diagnosis assistance. However, generating contextually relevant descriptions with accurate modality recognition remains challenging. We present DualPrompt-MedCap, a novel dual-prompt enhancement framework that augments Large Vision-Language Models (LVLMs) through two specialized components: (1) a modality-aware prompt derived from a semi-supervised classification model pretrained on medical question-answer pairs, and (2) a question-guided prompt leveraging biomedical language model embeddings. To address the lack of captioning ground truth, we also propose an evaluation framework that jointly considers spatial-semantic relevance and medical narrative quality. Experiments on multiple medical datasets demonstrate that DualPrompt-MedCap outperforms the baseline BLIP-3 by achieving a 22% improvement in modality recognition accuracy while generating more comprehensive and question-aligned descriptions. Our method enables the generation of clinically accurate reports that can serve as medical experts' prior knowledge and automatic annotations for downstream vision-language tasks.

[76] Mixture-of-Shape-Experts (MoSE): End-to-End Shape Dictionary Framework to Prompt SAM for Generalizable Medical Segmentation

Jia Wei,Xiaoqi Zhao,Jonghye Woo,Jinsong Ouyang,Georges El Fakhri,Qingyu Chen,Xiaofeng Liu

Main category: cs.CV

TLDR: 提出了一种新的Mixture-of-Shape-Experts (MoSE)框架，通过结合字典学习和混合专家训练，动态融合形状先验，并利用SAM的强大泛化能力。

Details

Motivation: 解决现有字典学习方法在医学图像分割中形状先验表示能力有限或过拟合的问题，同时兼容SAM等大型基础模型。 Method: 将字典原子视为形状专家，通过门控网络动态融合，利用SAM编码稀疏激活防止过拟合，并双向集成SAM的泛化能力。 Result: 在多个公共数据集上的实验验证了其有效性。 Conclusion: MoSE框架成功整合了形状先验和SAM，提升了医学图像分割的泛化性能。 Abstract: Single domain generalization (SDG) has recently attracted growing attention in medical image segmentation. One promising strategy for SDG is to leverage consistent semantic shape priors across different imaging protocols, scanner vendors, and clinical sites. However, existing dictionary learning methods that encode shape priors often suffer from limited representational power with a small set of offline computed shape elements, or overfitting when the dictionary size grows. Moreover, they are not readily compatible with large foundation models such as the Segment Anything Model (SAM). In this paper, we propose a novel Mixture-of-Shape-Experts (MoSE) framework that seamlessly integrates the idea of mixture-of-experts (MoE) training into dictionary learning to efficiently capture diverse and robust shape priors. Our method conceptualizes each dictionary atom as a shape expert, which specializes in encoding distinct semantic shape information. A gating network dynamically fuses these shape experts into a robust shape map, with sparse activation guided by SAM encoding to prevent overfitting. We further provide this shape map as a prompt to SAM, utilizing the powerful generalization capability of SAM through bidirectional integration. All modules, including the shape dictionary, are trained in an end-to-end manner. Extensive experiments on multiple public datasets demonstrate its effectiveness.

[77] Early-Bird Diffusion: Investigating and Leveraging Timestep-Aware Early-Bird Tickets in Diffusion Models for Efficient Training

Lexington Whalen,Zhenbang Du,Haoran You,Chaojian Li,Sixu Li,Yingyan,Lin

Main category: cs.CV

TLDR: EB-Diff-Train是一种高效训练扩散模型的方法，利用早期稀疏子网络（EB tickets）减少计算资源需求，同时保持生成质量。

Details

Motivation: 扩散模型训练需要大量计算资源，研究旨在通过EB tickets减少训练时间和资源消耗。 Method: 提出传统和基于时间步的EB tickets，动态调整稀疏度，并行训练并在推理时结合。 Result: 实验显示EB-Diff-Train显著减少训练时间（2.9×至5.8×加速），且生成质量不下降。 Conclusion: EB-Diff-Train是一种高效且有效的扩散模型训练方法，适用于资源受限场景。 Abstract: Training diffusion models (DMs) requires substantial computational resources due to multiple forward and backward passes across numerous timesteps, motivating research into efficient training techniques. In this paper, we propose EB-Diff-Train, a new efficient DM training approach that is orthogonal to other methods of accelerating DM training, by investigating and leveraging Early-Bird (EB) tickets -- sparse subnetworks that manifest early in the training process and maintain high generation quality. We first investigate the existence of traditional EB tickets in DMs, enabling competitive generation quality without fully training a dense model. Then, we delve into the concept of diffusion-dedicated EB tickets, drawing on insights from varying importance of different timestep regions. These tickets adapt their sparsity levels according to the importance of corresponding timestep regions, allowing for aggressive sparsity during non-critical regions while conserving computational resources for crucial timestep regions. Building on this, we develop an efficient DM training technique that derives timestep-aware EB tickets, trains them in parallel, and combines them during inference for image generation. Extensive experiments validate the existence of both traditional and timestep-aware EB tickets, as well as the effectiveness of our proposed EB-Diff-Train method. This approach can significantly reduce training time both spatially and temporally -- achieving 2.9$\times$ to 5.8$\times$ speedups over training unpruned dense models, and up to 10.3$\times$ faster training compared to standard train-prune-finetune pipelines -- without compromising generative quality. Our code is available at https://github.com/GATECH-EIC/Early-Bird-Diffusion.

[78] ERL-MPP: Evolutionary Reinforcement Learning with Multi-head Puzzle Perception for Solving Large-scale Jigsaw Puzzles of Eroded Gaps

Xingke Song,Xiaoying Yang,Chenglin Yao,Jianfeng Ren,Ruibin Bai,Xin Chen,Xudong Jiang

Main category: cs.CV

TLDR: 论文提出了一种结合进化强化学习和多头拼图感知的框架（ERL-MPP），用于解决大规模带间隙拼图的挑战，显著优于现有方法。

Details

Motivation: 现有模型主要解决小规模或无间隙拼图，而大规模带间隙拼图在图像理解和组合优化上提出了新挑战。 Method: 设计了多头拼图感知网络（MPPN）和进化强化学习代理（EvoRL），前者通过共享编码器和多个头感知拼图状态，后者通过演员-评论家框架和进化策略优化动作。 Result: 在JPLEG-5和MIT数据集上，ERL-MPP显著优于所有现有模型。 Conclusion: ERL-MPP为大规模带间隙拼图提供了一种高效解决方案，展示了其在图像理解和组合优化上的优势。 Abstract: Solving jigsaw puzzles has been extensively studied. While most existing models focus on solving either small-scale puzzles or puzzles with no gap between fragments, solving large-scale puzzles with gaps presents distinctive challenges in both image understanding and combinatorial optimization. To tackle these challenges, we propose a framework of Evolutionary Reinforcement Learning with Multi-head Puzzle Perception (ERL-MPP) to derive a better set of swapping actions for solving the puzzles. Specifically, to tackle the challenges of perceiving the puzzle with gaps, a Multi-head Puzzle Perception Network (MPPN) with a shared encoder is designed, where multiple puzzlet heads comprehensively perceive the local assembly status, and a discriminator head provides a global assessment of the puzzle. To explore the large swapping action space efficiently, an Evolutionary Reinforcement Learning (EvoRL) agent is designed, where an actor recommends a set of suitable swapping actions from a large action space based on the perceived puzzle status, a critic updates the actor using the estimated rewards and the puzzle status, and an evaluator coupled with evolutionary strategies evolves the actions aligning with the historical assembly experience. The proposed ERL-MPP is comprehensively evaluated on the JPLEG-5 dataset with large gaps and the MIT dataset with large-scale puzzles. It significantly outperforms all state-of-the-art models on both datasets.

[79] Tokenize Image Patches: Global Context Fusion for Effective Haze Removal in Large Images

Jiuchen Chen,Xinyu Yan,Qizhi Xu,Kaiqi Li

Main category: cs.CV

TLDR: DehazeXL提出了一种平衡全局上下文和局部特征的去雾方法，支持大图像端到端建模，并设计了视觉归因方法和8KDehaze数据集。

Details

Motivation: 现有深度学习模型在大分辨率图像去雾任务中因GPU内存限制而表现不佳，常需切片或降采样，导致全局信息或高频细节丢失。 Method: DehazeXL通过平衡全局上下文和局部特征提取，实现大图像端到端建模，并设计了针对去雾任务的视觉归因方法。 Result: DehazeXL在10240×10240像素图像上仅需21GB内存，达到最优性能。 Conclusion: DehazeXL解决了大分辨率图像去雾的挑战，提供了高效的方法和数据集支持。 Abstract: Global contextual information and local detail features are essential for haze removal tasks. Deep learning models perform well on small, low-resolution images, but they encounter difficulties with large, high-resolution ones due to GPU memory limitations. As a compromise, they often resort to image slicing or downsampling. The former diminishes global information, while the latter discards high-frequency details. To address these challenges, we propose DehazeXL, a haze removal method that effectively balances global context and local feature extraction, enabling end-to-end modeling of large images on mainstream GPU hardware. Additionally, to evaluate the efficiency of global context utilization in haze removal performance, we design a visual attribution method tailored to the characteristics of haze removal tasks. Finally, recognizing the lack of benchmark datasets for haze removal in large images, we have developed an ultra-high-resolution haze removal dataset (8KDehaze) to support model training and testing. It includes 10000 pairs of clear and hazy remote sensing images, each sized at 8192 $\times$ 8192 pixels. Extensive experiments demonstrate that DehazeXL can infer images up to 10240 $\times$ 10240 pixels with only 21 GB of memory, achieving state-of-the-art results among all evaluated methods. The source code and experimental dataset are available at https://github.com/CastleChen339/DehazeXL.

[80] Ges3ViG: Incorporating Pointing Gestures into Language-Based 3D Visual Grounding for Embodied Reference Understanding

Atharv Mahesh Mane,Dulanga Weerakoon,Vigneshwaran Subbaraju,Sougata Sen,Sanjay E. Sarma,Archan Misra

Main category: cs.CV

TLDR: 3D-ERU结合语言描述和指向手势识别3D场景中的目标物体，填补了现有研究的空白。通过数据增强框架Imputer和新的数据集ImputeRefer，以及提出的Ges3ViG模型，实现了显著的性能提升。

Details

Motivation: 现有研究主要集中在纯语言基础的3D定位，而结合指向手势的3D-ERU研究较少，本文旨在填补这一空白。 Method: 提出数据增强框架Imputer，构建新数据集ImputeRefer，并开发Ges3ViG模型。 Result: Ges3ViG模型在3D-ERU任务中准确率提升约30%，相比纯语言模型提升约9%。 Conclusion: 3D-ERU结合指向手势显著提升了3D场景中的目标识别能力，Ges3ViG模型和ImputeRefer数据集为未来研究提供了重要资源。 Abstract: 3-Dimensional Embodied Reference Understanding (3D-ERU) combines a language description and an accompanying pointing gesture to identify the most relevant target object in a 3D scene. Although prior work has explored pure language-based 3D grounding, there has been limited exploration of 3D-ERU, which also incorporates human pointing gestures. To address this gap, we introduce a data augmentation framework-Imputer, and use it to curate a new benchmark dataset-ImputeRefer for 3D-ERU, by incorporating human pointing gestures into existing 3D scene datasets that only contain language instructions. We also propose Ges3ViG, a novel model for 3D-ERU that achieves ~30% improvement in accuracy as compared to other 3D-ERU models and ~9% compared to other purely language-based 3D grounding models. Our code and dataset are available at https://github.com/AtharvMane/Ges3ViG.

[81] TinyLLaVA-Video-R1: Towards Smaller LMMs for Video Reasoning

Xingjian Zhang,Siwei Wen,Wenjun Wu,Lei Huang

Main category: cs.CV

TLDR: 论文提出小规模视频推理模型TinyLLaVA-Video-R1，探索有限计算资源下小模型的推理能力，并在通用视频问答数据集上展示其改进的推理和思考能力。

Details

Motivation: 现有研究多基于大规模模型和高推理强度数据集，而小规模模型的推理能力探索对资源有限的研究者仍有价值，且模型解释推理过程的能力同样重要。 Method: 基于TinyLLaVA-Video（参数不超过4B），通过强化学习在通用视频问答数据集上提升推理能力。 Result: 模型在推理和思考能力上显著提升，并展现出“顿悟时刻”的涌现特性。 Conclusion: 研究为小规模模型的视频推理能力探索提供了实用见解，模型已开源。 Abstract: Recently, improving the reasoning ability of large multimodal models (LMMs) through reinforcement learning has made great progress. However, most existing works are based on highly reasoning-intensive datasets such as mathematics and code, and researchers generally choose large-scale models as the foundation. We argue that exploring small-scale models' reasoning capabilities remains valuable for researchers with limited computational resources. Moreover, enabling models to explain their reasoning processes on general question-answering datasets is equally meaningful. Therefore, we present the small-scale video reasoning model TinyLLaVA-Video-R1. Based on TinyLLaVA-Video, a traceably trained video understanding model with no more than 4B parameters, it not only demonstrates significantly improved reasoning and thinking capabilities after using reinforcement learning on general Video-QA datasets, but also exhibits the emergent characteristic of "aha moments". Furthermore, we share a series of experimental findings, aiming to provide practical insights for future exploration of video reasoning (thinking) abilities in small-scale models. It is available at https://github.com/ZhangXJ199/TinyLLaVA-Video-R1.

[82] SegEarth-R1: Geospatial Pixel Reasoning via Large Language Model

Kaiyu Li,Zepeng Xin,Li Pang,Chao Pang,Yupeng Deng,Jing Yao,Guisong Xia,Deyu Meng,Zhi Wang,Xiangyong Cao

Main category: cs.CV

TLDR: 论文提出了一种新的地理空间像素推理任务，并构建了首个大规模基准数据集EarthReason，同时提出了SegEarth-R1模型，该模型在推理和参考分割任务中表现优异。

Details

Motivation: 传统遥感方法难以处理复杂的隐式查询，需要结合空间上下文和领域知识进行推理，因此提出了地理空间像素推理任务。 Method: 提出了SegEarth-R1模型，结合了分层视觉编码器、大型语言模型（LLM）和定制的掩码生成器，并针对遥感图像进行了领域特定优化。 Result: SegEarth-R1在推理和参考分割任务中表现优异，显著优于传统和基于LLM的分割方法。 Conclusion: 地理空间像素推理任务和SegEarth-R1模型为遥感领域提供了新的解决方案，数据集和代码将开源。 Abstract: Remote sensing has become critical for understanding environmental dynamics, urban planning, and disaster management. However, traditional remote sensing workflows often rely on explicit segmentation or detection methods, which struggle to handle complex, implicit queries that require reasoning over spatial context, domain knowledge, and implicit user intent. Motivated by this, we introduce a new task, \ie, geospatial pixel reasoning, which allows implicit querying and reasoning and generates the mask of the target region. To advance this task, we construct and release the first large-scale benchmark dataset called EarthReason, which comprises 5,434 manually annotated image masks with over 30,000 implicit question-answer pairs. Moreover, we propose SegEarth-R1, a simple yet effective language-guided segmentation baseline that integrates a hierarchical visual encoder, a large language model (LLM) for instruction parsing, and a tailored mask generator for spatial correlation. The design of SegEarth-R1 incorporates domain-specific adaptations, including aggressive visual token compression to handle ultra-high-resolution remote sensing images, a description projection module to fuse language and multi-scale features, and a streamlined mask prediction pipeline that directly queries description embeddings. Extensive experiments demonstrate that SegEarth-R1 achieves state-of-the-art performance on both reasoning and referring segmentation tasks, significantly outperforming traditional and LLM-based segmentation methods. Our data and code will be released at https://github.com/earth-insights/SegEarth-R1.

[83] KeyVID: Keyframe-Aware Video Diffusion for Audio-Synchronized Visual Animation

Xingrui Wang,Jiang Liu,Ze Wang,Xiaodong Yu,Jialian Wu,Ximeng Sun,Yusheng Su,Alan Yuille,Zicheng Liu,Emad Barsoum

Main category: cs.CV

TLDR: KeyVID提出了一种基于关键帧的音频到视觉动画框架，显著提高了动态运动中的生成质量，同时保持计算效率。

Details

Motivation: 当前音频到视觉动画模型使用均匀采样的帧，无法捕捉低帧率下的关键动态时刻，且增加帧数会占用大量内存。 Method: 通过音频定位关键帧时间点，生成视觉关键帧，再通过运动插值生成中间帧。 Result: 实验表明，KeyVID显著提高了音频-视频同步性和视频质量，尤其在高度动态的运动中。 Conclusion: KeyVID在动态运动中表现优异，代码已开源。 Abstract: Generating video from various conditions, such as text, image, and audio, enables both spatial and temporal control, leading to high-quality generation results. Videos with dramatic motions often require a higher frame rate to ensure smooth motion. Currently, most audio-to-visual animation models use uniformly sampled frames from video clips. However, these uniformly sampled frames fail to capture significant key moments in dramatic motions at low frame rates and require significantly more memory when increasing the number of frames directly. In this paper, we propose KeyVID, a keyframe-aware audio-to-visual animation framework that significantly improves the generation quality for key moments in audio signals while maintaining computation efficiency. Given an image and an audio input, we first localize keyframe time steps from the audio. Then, we use a keyframe generator to generate the corresponding visual keyframes. Finally, we generate all intermediate frames using the motion interpolator. Through extensive experiments, we demonstrate that KeyVID significantly improves audio-video synchronization and video quality across multiple datasets, particularly for highly dynamic motions. The code is released in https://github.com/XingruiWang/KeyVID.

Yao Yuan,Pan Gao,Qun Dai,Jie Qin,Wei Xiang

Main category: cs.CV

TLDR: 论文提出了一种基于不确定性引导学习的显著性目标检测方法（UGRAN），通过多级交互注意力（MIA）、尺度空间一致性注意力（SSCA）和不确定性细化注意力（URA）模块，提升模型对不确定区域的感知能力，生成高饱和度的细粒度显著性预测图。

Details

Motivation: 现有显著性目标检测方法预测的显著区域通常包含不饱和区域和阴影，限制了模型的细粒度预测可靠性。 Method: 设计了UGRAN网络，包含MIA、SSCA和URA模块，并通过自适应动态分区（ADP）机制优化计算开销。 Result: 在七个基准数据集上的实验表明，UGRAN优于现有方法。 Conclusion: UGRAN通过不确定性引导学习，显著提升了显著性目标检测的细粒度预测能力。 Abstract: Recently, salient object detection (SOD) methods have achieved impressive performance. However, salient regions predicted by existing methods usually contain unsaturated regions and shadows, which limits the model for reliable fine-grained predictions. To address this, we introduce the uncertainty guidance learning approach to SOD, intended to enhance the model's perception of uncertain regions. Specifically, we design a novel Uncertainty Guided Refinement Attention Network (UGRAN), which incorporates three important components, i.e., the Multilevel Interaction Attention (MIA) module, the Scale Spatial-Consistent Attention (SSCA) module, and the Uncertainty Refinement Attention (URA) module. Unlike conventional methods dedicated to enhancing features, the proposed MIA facilitates the interaction and perception of multilevel features, leveraging the complementary characteristics among multilevel features. Then, through the proposed SSCA, the salient information across diverse scales within the aggregated features can be integrated more comprehensively and integrally. In the subsequent steps, we utilize the uncertainty map generated from the saliency prediction map to enhance the model's perception capability of uncertain regions, generating a highly-saturated fine-grained saliency prediction map. Additionally, we devise an adaptive dynamic partition (ADP) mechanism to minimize the computational overhead of the URA module and improve the utilization of uncertainty guidance. Experiments on seven benchmark datasets demonstrate the superiority of the proposed UGRAN over the state-of-the-art methodologies. Codes will be released at https://github.com/I2-Multimedia-Lab/UGRAN.

[85] LightHeadEd: Relightable & Editable Head Avatars from a Smartphone

Pranav Manu,Astitva Srivastava,Amit Raj,Varun Jampani,Avinash Sharma,P. J. Narayanan

Main category: cs.CV

TLDR: 提出了一种低成本方法，仅需配备偏振滤光片的智能手机，即可创建高质量、可动画、可重光照的3D头部化身。

Details

Motivation: 传统方法依赖昂贵的Lightstage设备，限制了广泛应用。 Method: 通过同时捕获交叉偏振和平行偏振视频流，分离皮肤的漫反射和镜面反射成分，并引入混合表示方法。 Result: 实现了高效实时渲染，同时保留高保真几何细节。 Conclusion: 该方法为创建高质量3D头部化身提供了经济高效的解决方案。 Abstract: Creating photorealistic, animatable, and relightable 3D head avatars traditionally requires expensive Lightstage with multiple calibrated cameras, making it inaccessible for widespread adoption. To bridge this gap, we present a novel, cost-effective approach for creating high-quality relightable head avatars using only a smartphone equipped with polaroid filters. Our approach involves simultaneously capturing cross-polarized and parallel-polarized video streams in a dark room with a single point-light source, separating the skin's diffuse and specular components during dynamic facial performances. We introduce a hybrid representation that embeds 2D Gaussians in the UV space of a parametric head model, facilitating efficient real-time rendering while preserving high-fidelity geometric details. Our learning-based neural analysis-by-synthesis pipeline decouples pose and expression-dependent geometrical offsets from appearance, decomposing the surface into albedo, normal, and specular UV texture maps, along with the environment maps. We collect a unique dataset of various subjects performing diverse facial expressions and head movements.

[86] Computer-Aided Layout Generation for Building Design: A Review

Jiachen Liu,Yuan Xue,Haomiao Ni,Rui Yu,Zihan Zhou,Sharon X. Huang

Main category: cs.CV

TLDR: 综述了建筑布局设计的三大研究方向：平面布局生成、场景布局合成及其他建筑布局格式生成，总结了现有方法的优缺点，并提出了未来研究方向。

Details

Motivation: 传统建筑布局设计方法成本高、耗时长，深度学习生成模型提高了效率和多样性，但仍需系统梳理和未来展望。 Method: 分类综述了三大主题，包括研究方法、数据集和评估指标，并分析了现有方法的局限。 Result: 总结了现有方法的成熟问题和不足，提出了未来研究的潜在方向。 Conclusion: 深度学习在建筑布局生成中表现优异，但仍需进一步研究以解决现有挑战。 Abstract: Generating realistic building layouts for automatic building design has been studied in both the computer vision and architecture domains. Traditional approaches from the architecture domain, which are based on optimization techniques or heuristic design guidelines, can synthesize desirable layouts, but usually require post-processing and involve human interaction in the design pipeline, making them costly and timeconsuming. The advent of deep generative models has significantly improved the fidelity and diversity of the generated architecture layouts, reducing the workload by designers and making the process much more efficient. In this paper, we conduct a comprehensive review of three major research topics of architecture layout design and generation: floorplan layout generation, scene layout synthesis, and generation of some other formats of building layouts. For each topic, we present an overview of the leading paradigms, categorized either by research domains (architecture or machine learning) or by user input conditions or constraints. We then introduce the commonly-adopted benchmark datasets that are used to verify the effectiveness of the methods, as well as the corresponding evaluation metrics. Finally, we identify the well-solved problems and limitations of existing approaches, then propose new perspectives as promising directions for future research in this important research area. A project associated with this survey to maintain the resources is available at awesome-building-layout-generation.

[87] ToolTipNet: A Segmentation-Driven Deep Learning Baseline for Surgical Instrument Tip Detection

Zijian Wu,Shuojue Yang,Yueming Jin,Septimiu E Salcudean

Main category: cs.CV

TLDR: 论文提出了一种基于深度学习的手术器械尖端检测方法，利用分割基础模型（如Segment Anything）生成的器械部分分割掩码作为输入，解决了传统方法中器械尖端定位不准确的问题。

Details

Motivation: 在机器人辅助腹腔镜前列腺切除术（RALP）中，器械尖端位置的准确定位对超声帧与腹腔镜相机帧的配准至关重要。传统方法通过da Vinci API获取的器械尖端位置不准确且需要手眼校准，因此基于视觉的方法成为更优解决方案。 Method: 提出了一种基于深度学习的手术器械尖端检测方法，利用分割基础模型生成的器械部分分割掩码作为输入，避免了传统图像处理方法的局限性。 Result: 在模拟和真实数据集上的对比实验表明，所提方法优于传统手工图像处理方法。 Conclusion: 基于深度学习的手术器械尖端检测方法在准确性和鲁棒性上优于传统方法，为手术技能评估和自动化手术等任务提供了可靠支持。 Abstract: In robot-assisted laparoscopic radical prostatectomy (RALP), the location of the instrument tip is important to register the ultrasound frame with the laparoscopic camera frame. A long-standing limitation is that the instrument tip position obtained from the da Vinci API is inaccurate and requires hand-eye calibration. Thus, directly computing the position of the tool tip in the camera frame using the vision-based method becomes an attractive solution. Besides, surgical instrument tip detection is the key component of other tasks, like surgical skill assessment and surgery automation. However, this task is challenging due to the small size of the tool tip and the articulation of the surgical instrument. Surgical instrument segmentation becomes relatively easy due to the emergence of the Segmentation Foundation Model, i.e., Segment Anything. Based on this advancement, we explore the deep learning-based surgical instrument tip detection approach that takes the part-level instrument segmentation mask as input. Comparison experiments with a hand-crafted image-processing approach demonstrate the superiority of the proposed method on simulated and real datasets.

[88] A Survey on Efficient Vision-Language Models

Gaurav Shinde,Anuradha Ravi,Emon Dey,Shadman Sakib,Milind Rampure,Nirmalya Roy

Main category: cs.CV

TLDR: 综述了高效视觉语言模型（VLM）在边缘设备上的优化技术，包括紧凑架构和性能-内存权衡。

Details

Motivation: 视觉语言模型的高计算需求限制了实时应用，因此需要开发高效模型。 Method: 回顾了优化VLM的关键技术，包括紧凑架构和框架，并分析了性能与内存的权衡。 Result: 建立了GitHub仓库，汇总了所有调查的论文，并计划持续更新。 Conclusion: 旨在推动高效视觉语言模型的深入研究。 Abstract: Vision-language models (VLMs) integrate visual and textual information, enabling a wide range of applications such as image captioning and visual question answering, making them crucial for modern AI systems. However, their high computational demands pose challenges for real-time applications. This has led to a growing focus on developing efficient vision language models. In this survey, we review key techniques for optimizing VLMs on edge and resource-constrained devices. We also explore compact VLM architectures, frameworks and provide detailed insights into the performance-memory trade-offs of efficient VLMs. Furthermore, we establish a GitHub repository at https://github.com/MPSCUMBC/Efficient-Vision-Language-Models-A-Survey to compile all surveyed papers, which we will actively update. Our objective is to foster deeper research in this area.

[89] Automatic Detection of Intro and Credits in Video using CLIP and Multihead Attention

Vasilii Korolkov,Andrey Yanchenko

Main category: cs.CV

TLDR: 论文提出了一种基于深度学习的视频内容分割方法，用于检测视频中片头/片尾与正片的过渡，通过序列分类任务实现高精度和实时性能。

Details

Motivation: 手动标注视频过渡耗时且易错，而启发式方法难以适应多样化的视频风格，因此需要一种更高效且通用的解决方案。 Method: 采用序列到序列分类任务，以1 FPS提取帧，使用CLIP编码，并通过多头注意力模型处理特征，结合学习的位置编码。 Result: 在测试集上达到F1分数91.0%，精确率89.0%，召回率97.0%，实时性能为CPU 11.5 FPS，GPU 107 FPS。 Conclusion: 该方法在自动化内容索引、高光检测和视频摘要中有实际应用，未来将探索多模态学习以进一步提升精度。 Abstract: Detecting transitions between intro/credits and main content in videos is a crucial task for content segmentation, indexing, and recommendation systems. Manual annotation of such transitions is labor-intensive and error-prone, while heuristic-based methods often fail to generalize across diverse video styles. In this work, we introduce a deep learning-based approach that formulates the problem as a sequence-to-sequence classification task, where each second of a video is labeled as either "intro" or "film." Our method extracts frames at a fixed rate of 1 FPS, encodes them using CLIP (Contrastive Language-Image Pretraining), and processes the resulting feature representations with a multihead attention model incorporating learned positional encoding. The system achieves an F1-score of 91.0%, Precision of 89.0%, and Recall of 97.0% on the test set, and is optimized for real-time inference, achieving 11.5 FPS on CPU and 107 FPS on high-end GPUs. This approach has practical applications in automated content indexing, highlight detection, and video summarization. Future work will explore multimodal learning, incorporating audio features and subtitles to further enhance detection accuracy.

[90] Socratic Chart: Cooperating Multiple Agents for Robust SVG Chart Understanding

Yuyang Ji,Haohan Wang

Main category: cs.CV

TLDR: 论文提出了一种新框架Socratic Chart，用于提升多模态大语言模型（MLLMs）在图表推理任务中的视觉理解能力，通过去除文本标签和引入图表扰动来挑战现有模型，并展示了其优越性能。

Details

Motivation: 现有MLLMs在图表推理任务中依赖文本捷径而非真正的视觉理解，ChartQA等基准测试揭示了这一局限性。 Method: 提出Socratic Chart框架，将图表图像转换为SVG表示，通过多代理流程（代理生成器和代理批评家）提取和验证图表属性。 Result: 在去除文本标签和引入扰动的条件下，GPT-4o和Gemini-2.0 Pro性能下降30%，而Socratic Chart在图表原语提取和推理性能上超越现有模型。 Conclusion: Socratic Chart为提升MLLMs的视觉理解能力提供了有效途径，尤其在图表推理任务中表现卓越。 Abstract: Multimodal Large Language Models (MLLMs) have shown remarkable versatility but face challenges in demonstrating true visual understanding, particularly in chart reasoning tasks. Existing benchmarks like ChartQA reveal significant reliance on text-based shortcuts and probabilistic pattern-matching rather than genuine visual reasoning. To rigorously evaluate visual reasoning, we introduce a more challenging test scenario by removing textual labels and introducing chart perturbations in the ChartQA dataset. Under these conditions, models like GPT-4o and Gemini-2.0 Pro experience up to a 30% performance drop, underscoring their limitations. To address these challenges, we propose Socratic Chart, a new framework that transforms chart images into Scalable Vector Graphics (SVG) representations, enabling MLLMs to integrate textual and visual modalities for enhanced chart understanding. Socratic Chart employs a multi-agent pipeline with specialized agent-generators to extract primitive chart attributes (e.g., bar heights, line coordinates) and an agent-critic to validate results, ensuring high-fidelity symbolic representations. Our framework surpasses state-of-the-art models in accurately capturing chart primitives and improving reasoning performance, establishing a robust pathway for advancing MLLM visual understanding.

[91] On the representation of stack operators by mathematical morphology

Diego Marcondes

Main category: cs.CV

TLDR: 本文介绍了灰度图像堆栈算子的类别，这些算子将二值图像映射为二值图像，并在平均意义上与截面操作交换。堆栈算子是集合算子的1-Lipschitz扩展，可通过将特征集合算子应用于图像的截面并求和来表示。它们是堆栈滤波器的推广，其中特征集合算子是递增的。主要结果表明堆栈算子继承了特征集合算子的格性质。

Details

Motivation: 研究灰度图像处理中堆栈算子的性质及其与二值图像算子的关系，为图像处理问题的设计提供理论支持。 Method: 通过推导堆栈算子的特征函数、核和基表示，研究其性质，并证明其继承了特征集合算子的格性质。 Result: 堆栈算子是集合算子的1-Lipschitz扩展，可通过特征集合算子表示，并继承了其格性质。 Conclusion: 堆栈算子为灰度图像处理问题提供了一种设计方法，即通过设计二值图像算子并扩展为堆栈算子。未来可研究其机器学习应用及适用问题的特征化。 Abstract: This paper introduces the class of grey-scale image stack operators as those that (a) map binary-images into binary-images and (b) commute in average with cross-sectioning. We show that stack operators are 1-Lipchitz extensions of set operators which can be represented by applying a characteristic set operator to the cross-sections of the image and summing. In particular, they are a generalisation of stack filters, for which the characteristic set operators are increasing. Our main result is that stack operators inherit lattice properties of the characteristic set operators. We focus on the case of translation-invariant and locally defined stack operators and show the main result by deducing the characteristic function, kernel, and basis representation of stack operators. The results of this paper have implications on the design of image operators, since imply that to solve some grey-scale image processing problems it is enough to design an operator for performing the desired transformation on binary images, and then considering its extension given by a stack operator. We leave many topics for future research regarding the machine learning of stack operators and the characterisation of the image processing problems that can be solved by them.

[92] EquiVDM: Equivariant Video Diffusion Models with Temporally Consistent Noise

Chao Liu,Arash Vahdat

Main category: cs.CV

TLDR: 提出了一种利用时间一致性噪声的视频扩散框架，无需额外模块或约束即可生成连贯视频帧。

Details

Motivation: 解决视频扩散模型在时间一致性方面的挑战，适用于sim-to-real、风格迁移等应用。 Method: 通过时间一致性噪声训练扩散模型，使其对输入视频和噪声的空间变换具有等变性，并扩展到3D一致性视频生成。 Result: 在运动对齐、3D一致性和视频质量上优于现有方法，且采样步骤少。 Conclusion: 该方法在视频生成任务中表现出色，具有高效性和广泛适用性。 Abstract: Temporally consistent video-to-video generation is essential for applications of video diffusion models in areas such as sim-to-real, style-transfer, video upsampling, etc. In this paper, we propose a video diffusion framework that leverages temporally consistent noise to generate coherent video frames without specialized modules or additional constraints. We show that the standard training objective of diffusion models, when applied with temporally consistent noise, encourages the model to be equivariant to spatial transformations in input video and noise. This enables our model to better follow motion patterns from the input video, producing aligned motion and high-fidelity frames. Furthermore, we extend our approach to 3D-consistent video generation by attaching noise as textures on 3D meshes, ensuring 3D consistency in sim-to-real applications. Experimental results demonstrate that our method surpasses state-of-the-art baselines in motion alignment, 3D consistency, and video quality while requiring only a few sampling steps in practice.

[93] IGL-DT: Iterative Global-Local Feature Learning with Dual-Teacher Semantic Segmentation Framework under Limited Annotation Scheme

Dinh Dai Quan Tran,Hoang-Thien Nguyen. Thanh-Huy Nguyen,Gia-Van To,Tien-Huy Nguyen,Quan Nguyen

Main category: cs.CV

TLDR: 提出了一种名为IGL-DT的三分支半监督语义分割框架，结合双教师策略，平衡全局语义与局部特征提取，显著提升分割性能。

Details

Motivation: 现有半监督语义分割方法难以平衡全局语义与局部特征提取，导致性能受限。 Method: 采用SwinUnet进行全局上下文学习，ResUnet进行局部区域学习，并引入差异学习机制以减少对单一教师的依赖。 Result: 在多个基准数据集上，IGL-DT优于现有方法，分割性能显著提升。 Conclusion: IGL-DT通过双教师策略和差异学习，有效解决了全局与局部特征平衡问题，为半监督语义分割提供了新思路。 Abstract: Semi-Supervised Semantic Segmentation (SSSS) aims to improve segmentation accuracy by leveraging a small set of labeled images alongside a larger pool of unlabeled data. Recent advances primarily focus on pseudo-labeling, consistency regularization, and co-training strategies. However, existing methods struggle to balance global semantic representation with fine-grained local feature extraction. To address this challenge, we propose a novel tri-branch semi-supervised segmentation framework incorporating a dual-teacher strategy, named IGL-DT. Our approach employs SwinUnet for high-level semantic guidance through Global Context Learning and ResUnet for detailed feature refinement via Local Regional Learning. Additionally, a Discrepancy Learning mechanism mitigates over-reliance on a single teacher, promoting adaptive feature learning. Extensive experiments on benchmark datasets demonstrate that our method outperforms state-of-the-art approaches, achieving superior segmentation performance across various data regimes.

[94] DUDA: Distilled Unsupervised Domain Adaptation for Lightweight Semantic Segmentation

Beomseok Kang,Niluthpol Chowdhury Mithun,Abhinav Rajvanshi,Han-Pang Chiu,Supun Samarasekera

Main category: cs.CV

TLDR: DUDA提出了一种结合EMA自训练和知识蒸馏的新框架，解决了轻量模型在UDA中性能下降的问题。

Details

Motivation: 现有UDA方法在轻量模型上表现不佳，主要因架构不灵活导致伪标签质量低。 Method: DUDA结合EMA自训练与知识蒸馏，引入辅助学生网络、渐进蒸馏、不一致性损失和多教师学习。 Result: 在四个UDA基准测试中，DUDA表现优于现有方法，轻量模型性能甚至超过其他方法的重量模型。 Conclusion: DUDA通过创新设计显著提升了轻量模型在UDA任务中的性能。 Abstract: Unsupervised Domain Adaptation (UDA) is essential for enabling semantic segmentation in new domains without requiring costly pixel-wise annotations. State-of-the-art (SOTA) UDA methods primarily use self-training with architecturally identical teacher and student networks, relying on Exponential Moving Average (EMA) updates. However, these approaches face substantial performance degradation with lightweight models due to inherent architectural inflexibility leading to low-quality pseudo-labels. To address this, we propose Distilled Unsupervised Domain Adaptation (DUDA), a novel framework that combines EMA-based self-training with knowledge distillation (KD). Our method employs an auxiliary student network to bridge the architectural gap between heavyweight and lightweight models for EMA-based updates, resulting in improved pseudo-label quality. DUDA employs a strategic fusion of UDA and KD, incorporating innovative elements such as gradual distillation from large to small networks, inconsistency loss prioritizing poorly adapted classes, and learning with multiple teachers. Extensive experiments across four UDA benchmarks demonstrate DUDA's superiority in achieving SOTA performance with lightweight models, often surpassing the performance of heavyweight models from other approaches.

[95] Density-based Object Detection in Crowded Scenes

Chenyang Zhao,Jia Wan,Antoni B. Chan

Main category: cs.CV

TLDR: 论文提出两种新策略（DGA和DG-NMS），利用目标密度图优化锚点分配和后处理，解决拥挤场景中目标检测的模糊锚点和误抑制问题。

Details

Motivation: 拥挤场景中目标高度重叠导致训练时锚点模糊和推理时误抑制增多，需针对性解决方案。 Method: 提出密度引导锚点（DGA）和密度引导NMS（DG-NMS），基于不平衡最优传输（UOT）计算锚点分配与重加权，并利用密度图自适应调整NMS阈值。 Result: 在CrowdHuman和Citypersons数据集上验证了方法的有效性和鲁棒性。 Conclusion: 密度引导检测器能有效应对拥挤场景，代码和模型将公开。 Abstract: Compared with the generic scenes, crowded scenes contain highly-overlapped instances, which result in: 1) more ambiguous anchors during training of object detectors, and 2) more predictions are likely to be mistakenly suppressed in post-processing during inference. To address these problems, we propose two new strategies, density-guided anchors (DGA) and density-guided NMS (DG-NMS), which uses object density maps to jointly compute optimal anchor assignments and reweighing, as well as an adaptive NMS. Concretely, based on an unbalanced optimal transport (UOT) problem, the density owned by each ground-truth object is transported to each anchor position at a minimal transport cost. And density on anchors comprises an instance-specific density distribution, from which DGA decodes the optimal anchor assignment and re-weighting strategy. Meanwhile, DG-NMS utilizes the predicted density map to adaptively adjust the NMS threshold to reduce mistaken suppressions. In the UOT, a novel overlap-aware transport cost is specifically designed for ambiguous anchors caused by overlapped neighboring objects. Extensive experiments on the challenging CrowdHuman dataset with Citypersons dataset demonstrate that our proposed density-guided detector is effective and robust to crowdedness. The code and pre-trained models will be made available later.

[96] FATE: A Prompt-Tuning-Based Semi-Supervised Learning Framework for Extremely Limited Labeled Data

Hezhao Liu,Yang Lu,Mengke Li,Yiqun Zhang,Shreyank N Gowda,Chen Gong,Hanzi Wang

Main category: cs.CV

TLDR: FATE是一个针对标记数据极少的半监督学习框架，通过两阶段提示调优范式，先利用无标记数据适应预训练模型，再完成分类任务。

Details

Motivation: 现有半监督学习方法在标记数据极少（如仅一个样本）时表现不佳，FATE旨在解决这一问题。 Method: FATE采用两阶段方法：1）无监督方式利用无标记数据适应预训练模型；2）设计专门方法完成分类任务。 Result: 在七个基准测试中，FATE平均性能提升33.74%，优于现有方法。 Conclusion: FATE有效解决了标记数据稀缺问题，适用于视觉和视觉语言预训练模型。 Abstract: Semi-supervised learning (SSL) has achieved significant progress by leveraging both labeled data and unlabeled data. Existing SSL methods overlook a common real-world scenario when labeled data is extremely scarce, potentially as limited as a single labeled sample in the dataset. General SSL approaches struggle to train effectively from scratch under such constraints, while methods utilizing pre-trained models often fail to find an optimal balance between leveraging limited labeled data and abundant unlabeled data. To address this challenge, we propose Firstly Adapt, Then catEgorize (FATE), a novel SSL framework tailored for scenarios with extremely limited labeled data. At its core, the two-stage prompt tuning paradigm FATE exploits unlabeled data to compensate for scarce supervision signals, then transfers to downstream tasks. Concretely, FATE first adapts a pre-trained model to the feature distribution of downstream data using volumes of unlabeled samples in an unsupervised manner. It then applies an SSL method specifically designed for pre-trained models to complete the final classification task. FATE is designed to be compatible with both vision and vision-language pre-trained models. Extensive experiments demonstrate that FATE effectively mitigates challenges arising from the scarcity of labeled samples in SSL, achieving an average performance improvement of 33.74% across seven benchmarks compared to state-of-the-art SSL methods. Code is available at https://anonymous.4open.science/r/Semi-supervised-learning-BA72.

Lu Yue,Dongliang Zhou,Liang Xie,Erwei Yin,Feitian Zhang

Main category: cs.CV

TLDR: ST-Booster通过多粒度感知和指令感知推理提升连续环境中的视觉与语言导航性能。

Details

Motivation: 解决连续环境中导航的视觉记忆异质性和局部特征感知受损问题。 Method: 提出ST-Booster，包含HSTE、MGAF和VGWG模块，通过多粒度对齐和迭代优化提升导航能力。 Result: ST-Booster在复杂环境中表现优于现有方法。 Conclusion: ST-Booster有效解决了连续环境导航中的核心挑战，提升了性能。 Abstract: Vision-and-Language Navigation in Continuous Environments (VLN-CE) requires agents to navigate unknown, continuous spaces based on natural language instructions. Compared to discrete settings, VLN-CE poses two core perception challenges. First, the absence of predefined observation points leads to heterogeneous visual memories and weakened global spatial correlations. Second, cumulative reconstruction errors in three-dimensional scenes introduce structural noise, impairing local feature perception. To address these challenges, this paper proposes ST-Booster, an iterative spatiotemporal booster that enhances navigation performance through multi-granularity perception and instruction-aware reasoning. ST-Booster consists of three key modules -- Hierarchical SpatioTemporal Encoding (HSTE), Multi-Granularity Aligned Fusion (MGAF), and ValueGuided Waypoint Generation (VGWG). HSTE encodes long-term global memory using topological graphs and captures shortterm local details via grid maps. MGAF aligns these dualmap representations with instructions through geometry-aware knowledge fusion. The resulting representations are iteratively refined through pretraining tasks. During reasoning, VGWG generates Guided Attention Heatmaps (GAHs) to explicitly model environment-instruction relevance and optimize waypoint selection. Extensive comparative experiments and performance analyses are conducted, demonstrating that ST-Booster outperforms existing state-of-the-art methods, particularly in complex, disturbance-prone environments.

[98] GFT: Gradient Focal Transformer

Boris Kriuk,Simranjit Kaur Gill,Shoaib Aslam,Amir Fakhrutdinov

Main category: cs.CV

TLDR: GFT（Gradient Focal Transformer）是一种新型ViT衍生框架，通过GALA机制和PPS策略，动态聚焦区分性特征，提升细粒度图像分类性能。

Details

Motivation: 现有CNN和ViT模型在细粒度图像分类中难以兼顾全局上下文和局部细节，且计算效率低。 Method: GFT结合GALA机制分析注意力梯度流，动态选择区分性特征，并通过PPS策略逐步过滤非关键区域。 Result: 在FGVC Aircraft、Food-101和COCO数据集上达到SOTA性能，参数仅93M，效率优于其他ViT模型。 Conclusion: GFT通过平衡全局与局部特征提取，为细粒度分类设定了新基准，并提供可解释的解决方案。 Abstract: Fine-Grained Image Classification (FGIC) remains a complex task in computer vision, as it requires models to distinguish between categories with subtle localized visual differences. Well-studied CNN-based models, while strong in local feature extraction, often fail to capture the global context required for fine-grained recognition, while more recent ViT-backboned models address FGIC with attention-driven mechanisms but lack the ability to adaptively focus on truly discriminative regions. TransFG and other ViT-based extensions introduced part-aware token selection to enhance attention localization, yet they still struggle with computational efficiency, attention region selection flexibility, and detail-focus narrative in complex environments. This paper introduces GFT (Gradient Focal Transformer), a new ViT-derived framework created for FGIC tasks. GFT integrates the Gradient Attention Learning Alignment (GALA) mechanism to dynamically prioritize class-discriminative features by analyzing attention gradient flow. Coupled with a Progressive Patch Selection (PPS) strategy, the model progressively filters out less informative regions, reducing computational overhead while enhancing sensitivity to fine details. GFT achieves SOTA accuracy on FGVC Aircraft, Food-101, and COCO datasets with 93M parameters, outperforming ViT-based advanced FGIC models in efficiency. By bridging global context and localized detail extraction, GFT sets a new benchmark in fine-grained recognition, offering interpretable solutions for real-world deployment scenarios.

[99] HDC: Hierarchical Distillation for Multi-level Noisy Consistency in Semi-Supervised Fetal Ultrasound Segmentation

Tran Quoc Khanh Le,Nguyen Lan Vi Vu,Ha-Hieu Pham,Xuan-Loc Huynh,Tien-Huy Nguyen,Minh Huu Nhat Le,Quan Nguyen,Hien D. Nguyen

Main category: cs.CV

TLDR: 提出了一种名为HDC的半监督分割框架，通过层次蒸馏和一致性学习解决宫颈超声图像分割中的低对比度和模糊边界问题，显著降低了计算开销。

Details

Motivation: 宫颈超声图像分割因低对比度、阴影伪影和模糊边界而具有挑战性，且大规模标注数据难以获取。半监督学习虽能利用未标注数据，但现有方法存在确认偏差和高计算成本问题。 Method: HDC框架结合层次蒸馏和一致性学习，引入相关性指导损失和互信息损失，优化特征表示，降低模型复杂度。 Result: 在两个胎儿超声数据集（FUGC和PSFH）上，HDC表现优异，计算开销显著低于现有多教师模型。 Conclusion: HDC框架在宫颈超声图像分割中表现出高效性和泛化能力，为半监督学习提供了新思路。 Abstract: Transvaginal ultrasound is a critical imaging modality for evaluating cervical anatomy and detecting physiological changes. However, accurate segmentation of cervical structures remains challenging due to low contrast, shadow artifacts, and fuzzy boundaries. While convolutional neural networks (CNNs) have shown promising results in medical image segmentation, their performance is often limited by the need for large-scale annotated datasets - an impractical requirement in clinical ultrasound imaging. Semi-supervised learning (SSL) offers a compelling solution by leveraging unlabeled data, but existing teacher-student frameworks often suffer from confirmation bias and high computational costs. We propose HDC, a novel semi-supervised segmentation framework that integrates Hierarchical Distillation and Consistency learning within a multi-level noise mean-teacher framework. Unlike conventional approaches that rely solely on pseudo-labeling, we introduce a hierarchical distillation mechanism that guides feature-level learning via two novel objectives: (1) Correlation Guidance Loss to align feature representations between the teacher and main student branch, and (2) Mutual Information Loss to stabilize representations between the main and noisy student branches. Our framework reduces model complexity while improving generalization. Extensive experiments on two fetal ultrasound datasets, FUGC and PSFH, demonstrate that our method achieves competitive performance with significantly lower computational overhead than existing multi-teacher models.

[100] MCBlock: Boosting Neural Radiance Field Training Speed by MCTS-based Dynamic-Resolution Ray Sampling

Yunpeng Tan,Junlin Hao,Jiangkai Wu,Liming Liu,Qingyang Li,Xinggong Zhang

Main category: cs.CV

TLDR: MCBlock是一种基于蒙特卡洛树搜索的动态分辨率光线采样算法，显著提升了NeRF模型的训练速度。

Details

Motivation: 现有NeRF模型（如Gaussian Splatting）训练速度慢，无法满足实时需求，主要原因是采样效率低下。 Method: 提出MCBlock算法，通过蒙特卡洛树搜索动态划分图像块，优化采样粒度。 Result: 在Nerfstudio中实现，训练速度提升2.33倍，优于其他光线采样算法。 Conclusion: MCBlock适用于所有锥形追踪NeRF模型，对多媒体领域有重要贡献。 Abstract: Neural Radiance Field (NeRF) is widely known for high-fidelity novel view synthesis. However, even the state-of-the-art NeRF model, Gaussian Splatting, requires minutes for training, far from the real-time performance required by multimedia scenarios like telemedicine. One of the obstacles is its inefficient sampling, which is only partially addressed by existing works. Existing point-sampling algorithms uniformly sample simple-texture regions (easy to fit) and complex-texture regions (hard to fit), while existing ray-sampling algorithms sample these regions all in the finest granularity (i.e. the pixel level), both wasting GPU training resources. Actually, regions with different texture intensities require different sampling granularities. To this end, we propose a novel dynamic-resolution ray-sampling algorithm, MCBlock, which employs Monte Carlo Tree Search (MCTS) to partition each training image into pixel blocks with different sizes for active block-wise training. Specifically, the trees are initialized according to the texture of training images to boost the initialization speed, and an expansion/pruning module dynamically optimizes the block partition. MCBlock is implemented in Nerfstudio, an open-source toolset, and achieves a training acceleration of up to 2.33x, surpassing other ray-sampling algorithms. We believe MCBlock can apply to any cone-tracing NeRF model and contribute to the multimedia community.

[101] Focus on Local: Finding Reliable Discriminative Regions for Visual Place Recognition

Changwei Wang,Shunpeng Chen,Yukun Song,Rongtao Xu,Zherui Zhang,Jiguang Zhang,Haoran Yang,Yu Zhang,Kexue Fu,Shide Du,Zhiwei Xu,Longxiang Gao,Li Guo,Shibiao Xu

Main category: cs.CV

TLDR: 提出Focus on Local (FoL)方法，通过挖掘和利用图像中的局部判别区域，提升视觉地点识别（VPR）任务中的图像检索和重排序性能。

Details

Motivation: 现有方法未能精确建模和充分利用图像中的局部判别区域，而这些区域对VPR任务至关重要。 Method: 设计两种损失函数（SAL和CEL）建模局部判别区域，提出弱监督局部特征训练策略，并引入高效重排序流程。 Result: 在多个VPR基准测试中达到最优性能，显著优于现有两阶段方法。 Conclusion: FoL方法通过局部判别区域的建模和利用，显著提升了VPR任务的性能和效率。 Abstract: Visual Place Recognition (VPR) is aimed at predicting the location of a query image by referencing a database of geotagged images. For VPR task, often fewer discriminative local regions in an image produce important effects while mundane background regions do not contribute or even cause perceptual aliasing because of easy overlap. However, existing methods lack precisely modeling and full exploitation of these discriminative regions. In this paper, we propose the Focus on Local (FoL) approach to stimulate the performance of image retrieval and re-ranking in VPR simultaneously by mining and exploiting reliable discriminative local regions in images and introducing pseudo-correlation supervision. First, we design two losses, Extraction-Aggregation Spatial Alignment Loss (SAL) and Foreground-Background Contrast Enhancement Loss (CEL), to explicitly model reliable discriminative local regions and use them to guide the generation of global representations and efficient re-ranking. Second, we introduce a weakly-supervised local feature training strategy based on pseudo-correspondences obtained from aggregating global features to alleviate the lack of local correspondences ground truth for the VPR task. Third, we suggest an efficient re-ranking pipeline that is efficiently and precisely based on discriminative region guidance. Finally, experimental results show that our FoL achieves the state-of-the-art on multiple VPR benchmarks in both image retrieval and re-ranking stages and also significantly outperforms existing two-stage VPR methods in terms of computational efficiency. Code and models are available at https://github.com/chenshunpeng/FoL

[102] Enhanced Semantic Extraction and Guidance for UGC Image Super Resolution

Yiwen Wang,Ying Liang,Yuxuan Zhang,Xinning Chai,Zhengxue Cheng,Yingsheng Qin,Yucai Yang,Rong Xie,Li Song

Main category: cs.CV

TLDR: 提出了一种结合语义引导的扩散框架方法，用于用户生成内容（UGC）图像的超分辨率，解决了真实世界与合成退化之间的不一致性问题。

Details

Motivation: 传统超分辨率方法难以泛化到真实世界的退化，需要更鲁棒的方法。 Method: 通过模拟LSDIR数据集的退化过程并结合官方配对训练集，同时引入预训练语义提取模型（SAM2）和优化超参数。 Result: 实验表明方法优于现有技术，并在CVPR NTIRE 2025挑战赛中获第二名。 Conclusion: 该方法有效提升了UGC图像的超分辨率性能，代码已开源。 Abstract: Due to the disparity between real-world degradations in user-generated content(UGC) images and synthetic degradations, traditional super-resolution methods struggle to generalize effectively, necessitating a more robust approach to model real-world distortions. In this paper, we propose a novel approach to UGC image super-resolution by integrating semantic guidance into a diffusion framework. Our method addresses the inconsistency between degradations in wild and synthetic datasets by separately simulating the degradation processes on the LSDIR dataset and combining them with the official paired training set. Furthermore, we enhance degradation removal and detail generation by incorporating a pretrained semantic extraction model (SAM2) and fine-tuning key hyperparameters for improved perceptual fidelity. Extensive experiments demonstrate the superiority of our approach against state-of-the-art methods. Additionally, the proposed model won second place in the CVPR NTIRE 2025 Short-form UGC Image Super-Resolution Challenge, further validating its effectiveness. The code is available at https://github.c10pom/Moonsofang/NTIRE-2025-SRlab.

[103] TAMP: Token-Adaptive Layerwise Pruning in Multimodal Large Language Models

Jaewoo Lee,Keyang Xuan,Chanakya Ekbote,Sandeep Polisetty,Yi R.,Fung,Paul Pu Liang

Main category: cs.CV

TLDR: 论文提出了一种名为TAMP的剪枝框架，专门针对多模态大语言模型（MLLMs），通过多样性感知稀疏性和自适应多模态输入激活，显著提升了剪枝效果。

Details

Motivation: 传统的剪枝方法在多模态大语言模型（MLLMs）中效果有限，因为它们未能考虑到MLLMs中跨层和多模态的独特标记属性。 Method: TAMP框架包含两个关键组件：1）多样性感知稀疏性，根据多模态输出标记的多样性调整每层的稀疏率；2）自适应多模态输入激活，利用注意力分数识别代表性多模态输入标记以指导非结构化权重剪枝。 Result: 在LLaVA-NeXT和VideoLLaMA2两种MLLMs上的实验表明，TAMP的每个组件均显著优于现有剪枝技术。 Conclusion: TAMP是一种简单而有效的剪枝框架，能够显著提升多模态大语言模型的剪枝效果。 Abstract: Multimodal Large Language Models (MLLMs) have shown remarkable versatility in understanding diverse multimodal data and tasks. However, these capabilities come with an increased model scale. While post-training pruning reduces model size in unimodal models, its application to MLLMs often yields limited success. Our analysis discovers that conventional methods fail to account for the unique token attributes across layers and modalities inherent to MLLMs. Inspired by this observation, we propose TAMP, a simple yet effective pruning framework tailored for MLLMs, featuring two key components: (1) Diversity-Aware Sparsity, which adjusts sparsity ratio per layer based on diversities among multimodal output tokens, preserving more parameters in high-diversity layers; and (2) Adaptive Multimodal Input Activation, which identifies representative multimodal input tokens using attention scores to guide unstructured weight pruning. We validate our method on two state-of-the-art MLLMs: LLaVA-NeXT, designed for vision-language tasks, and VideoLLaMA2, capable of processing audio, visual, and language modalities. Empirical experiments across various multimodal evaluation benchmarks demonstrate that each component of our approach substantially outperforms existing pruning techniques.

[104] Digital Staining with Knowledge Distillation: A Unified Framework for Unpaired and Paired-But-Misaligned Data

Ziwang Xu,Lanqing Guo,Satoshi Tsutsui,Shuyan Zhang,Alex C. Kot,Bihan Wen

Main category: cs.CV

TLDR: 提出了一种无监督深度学习框架，通过知识蒸馏减少对配对数据的需求，实现数字细胞染色。

Details

Motivation: 传统染色方法成本高、耗时长且不可逆，而现有监督学习方法需要大量对齐的配对数据，难以获取。 Method: 采用两阶段流程（光增强和着色）作为教师模型，通过知识蒸馏训练学生染色生成器，并扩展到配对但未对齐的情况，加入对齐模块。 Result: 在两种设置下均能生成更准确细胞目标位置和形状的染色图像，定量和定性结果优于竞争方法。 Conclusion: 该方法在医学应用（如白细胞数据集）中展现出潜力，为数字染色提供了高效解决方案。 Abstract: Staining is essential in cell imaging and medical diagnostics but poses significant challenges, including high cost, time consumption, labor intensity, and irreversible tissue alterations. Recent advances in deep learning have enabled digital staining through supervised model training. However, collecting large-scale, perfectly aligned pairs of stained and unstained images remains difficult. In this work, we propose a novel unsupervised deep learning framework for digital cell staining that reduces the need for extensive paired data using knowledge distillation. We explore two training schemes: (1) unpaired and (2) paired-but-misaligned settings. For the unpaired case, we introduce a two-stage pipeline, comprising light enhancement followed by colorization, as a teacher model. Subsequently, we obtain a student staining generator through knowledge distillation with hybrid non-reference losses. To leverage the pixel-wise information between adjacent sections, we further extend to the paired-but-misaligned setting, adding the Learning to Align module to utilize pixel-level information. Experiment results on our dataset demonstrate that our proposed unsupervised deep staining method can generate stained images with more accurate positions and shapes of the cell targets in both settings. Compared with competing methods, our method achieves improved results both qualitatively and quantitatively (e.g., NIQE and PSNR).We applied our digital staining method to the White Blood Cell (WBC) dataset, investigating its potential for medical applications.

[105] Small Object Detection with YOLO: A Performance Analysis Across Model Versions and Hardware

Muhammad Fasih Tariq,Muhammad Azeem Javed

Main category: cs.CV

TLDR: 本文对YOLO目标检测模型（v5至v11）在不同硬件平台和优化库上的性能进行了全面评估，分析了推理速度、检测精度及对小目标的适应性，为实际应用中的模型选择提供了指导。

Details

Motivation: 研究不同YOLO模型在多种硬件和优化库上的性能差异，帮助开发者根据实际需求选择最优模型。 Method: 在Intel和AMD CPU上使用ONNX和OpenVINO，GPU上使用TensorRT等框架，评估模型推理速度和检测精度，并分析模型对小目标的敏感性。 Result: 揭示了不同YOLO模型在效率、精度和小目标适应性上的权衡，为硬件约束和检测需求提供了选择依据。 Conclusion: 研究为实际应用中YOLO模型的部署提供了实用建议，帮助开发者根据具体需求优化模型选择。 Abstract: This paper provides an extensive evaluation of YOLO object detection models (v5, v8, v9, v10, v11) by com- paring their performance across various hardware platforms and optimization libraries. Our study investigates inference speed and detection accuracy on Intel and AMD CPUs using popular libraries such as ONNX and OpenVINO, as well as on GPUs through TensorRT and other GPU-optimized frameworks. Furthermore, we analyze the sensitivity of these YOLO models to object size within the image, examining performance when detecting objects that occupy 1%, 2.5%, and 5% of the total area of the image. By identifying the trade-offs in efficiency, accuracy, and object size adaptability, this paper offers insights for optimal model selection based on specific hardware constraints and detection requirements, aiding practitioners in deploying YOLO models effectively for real-world applications.

[106] LiteTracker: Leveraging Temporal Causality for Accurate Low-latency Tissue Tracking

Mert Asim Karaoglu,Wenbo Ji,Ahmed Abbas,Nassir Navab,Benjamin Busam,Alexander Ladikos

Main category: cs.CV

TLDR: LiteTracker是一种低延迟的组织追踪方法，通过训练无关的运行时优化，显著提升了追踪速度，同时保持了高精度。

Details

Motivation: 当前的组织追踪方法虽然精度高，但无法满足实时手术应用的低延迟需求。 Method: 基于先进的长时点追踪方法，引入训练无关的运行时优化，利用时间内存缓冲和先验运动信息。 Result: LiteTracker比其前身快7倍，比现有技术快2倍，同时在STIR和SuPer数据集上表现优异。 Conclusion: LiteTracker是实现实时手术应用中低延迟组织追踪的重要进展。 Abstract: Tissue tracking plays a critical role in various surgical navigation and extended reality (XR) applications. While current methods trained on large synthetic datasets achieve high tracking accuracy and generalize well to endoscopic scenes, their runtime performances fail to meet the low-latency requirements necessary for real-time surgical applications. To address this limitation, we propose LiteTracker, a low-latency method for tissue tracking in endoscopic video streams. LiteTracker builds on a state-of-the-art long-term point tracking method, and introduces a set of training-free runtime optimizations. These optimizations enable online, frame-by-frame tracking by leveraging a temporal memory buffer for efficient feature reuse and utilizing prior motion for accurate track initialization. LiteTracker demonstrates significant runtime improvements being around 7x faster than its predecessor and 2x than the state-of-the-art. Beyond its primary focus on efficiency, LiteTracker delivers high-accuracy tracking and occlusion prediction, performing competitively on both the STIR and SuPer datasets. We believe LiteTracker is an important step toward low-latency tissue tracking for real-time surgical applications in the operating room.

[107] Improving Multimodal Hateful Meme Detection Exploiting LMM-Generated Knowledge

Maria Tzelepi,Vasileios Mezaris

Main category: cs.CV

TLDR: 提出了一种基于大型多模态模型（LMM）的双重方法，用于检测有害的仇恨表情包，通过提取语义和情感信息，并结合硬挖掘技术，实现了最先进的性能。

Details

Motivation: 社交媒体中表情包传播广泛，但部分包含仇恨言论，危害特定群体。检测仇恨表情包成为重要任务，但由于多模态复杂性，具有挑战性。 Method: 利用LMM提取表情包的语义描述和情感信息，构建强表征；开发硬挖掘技术，将LMM编码知识直接引入训练过程。 Result: 在两个数据集上验证了方法的有效性，达到了最先进的性能。 Conclusion: 提出的双重方法显著提升了仇恨表情包检测的准确性，代码和模型已开源。 Abstract: Memes have become a dominant form of communication in social media in recent years. Memes are typically humorous and harmless, however there are also memes that promote hate speech, being in this way harmful to individuals and groups based on their identity. Therefore, detecting hateful content in memes has emerged as a task of critical importance. The need for understanding the complex interactions of images and their embedded text renders the hateful meme detection a challenging multimodal task. In this paper we propose to address the aforementioned task leveraging knowledge encoded in powerful Large Multimodal Models (LMM). Specifically, we propose to exploit LMMs in a two-fold manner. First, by extracting knowledge oriented to the hateful meme detection task in order to build strong meme representations. Specifically, generic semantic descriptions and emotions that the images along with their embedded texts elicit are extracted, which are then used to train a simple classification head for hateful meme detection. Second, by developing a novel hard mining approach introducing directly LMM-encoded knowledge to the training process, providing further improvements. We perform extensive experiments on two datasets that validate the effectiveness of the proposed method, achieving state-of-the-art performance. Our code and trained models are publicly available at: https://github.com/IDT-ITI/LMM-CLIP-meme.

Zheng Liu,Mengjie Liu,Jingzhou Chen,Jingwei Xu,Bin Cui,Conghui He,Wentao Zhang

Main category: cs.CV

TLDR: FUSION是一种多模态大语言模型（MLLM），通过全视觉-语言对齐和整合范式，实现了深度动态整合。其核心包括文本引导的统一视觉编码、上下文感知递归对齐解码和双监督语义映射损失。FUSION在多个基准测试中显著优于现有方法。

Details

Motivation: 现有方法主要在LLM解码阶段进行模态交互，缺乏深度整合。FUSION旨在实现全流程的动态整合，提升多模态任务的性能。 Method: 提出文本引导的统一视觉编码、上下文感知递归对齐解码和双监督语义映射损失，并构建高质量QA数据集。 Result: FUSION 3B和8B在多个基准测试中表现优异，甚至超越更大规模的模型（如Cambrian-1 8B）。 Conclusion: FUSION通过全模态整合方法显著提升了性能，证明了其有效性。 Abstract: We introduce FUSION, a family of multimodal large language models (MLLMs) with a fully vision-language alignment and integration paradigm. Unlike existing methods that primarily rely on late-stage modality interaction during LLM decoding, our approach achieves deep, dynamic integration throughout the entire processing pipeline. To this end, we propose Text-Guided Unified Vision Encoding, incorporating textual information in vision encoding to achieve pixel-level integration. We further design Context-Aware Recursive Alignment Decoding that recursively aggregates visual features conditioned on textual context during decoding, enabling fine-grained, question-level semantic integration. To guide feature mapping and mitigate modality discrepancies, we develop Dual-Supervised Semantic Mapping Loss. Additionally, we construct a Synthesized Language-Driven Question-Answer (QA) dataset through a new data synthesis method, prioritizing high-quality QA pairs to optimize text-guided feature integration. Building on these foundations, we train FUSION at two scales-3B, 8B-and demonstrate that our full-modality integration approach significantly outperforms existing methods with only 630 vision tokens. Notably, FUSION 3B surpasses Cambrian-1 8B and Florence-VL 8B on most benchmarks. FUSION 3B continues to outperform Cambrian-1 8B even when limited to 300 vision tokens. Our ablation studies show that FUSION outperforms LLaVA-NeXT on over half of the benchmarks under same configuration without dynamic resolution, highlighting the effectiveness of our approach. We release our code, model weights, and dataset. https://github.com/starriver030515/FUSION

[109] Omni-Dish: Photorealistic and Faithful Image Generation and Editing for Arbitrary Chinese Dishes

Huijie Liu,Bingcan Wang,Jie Hu,Xiaoming Wei,Guoliang Kang

Main category: cs.CV

TLDR: Omni-Dish是首个专为中国菜品设计的文本到图像生成模型，通过数据增强和精细训练策略，显著提升了生成图像的细节和多样性。

Details

Motivation: 现有文本到图像生成模型在捕捉特定领域（如中国菜品）的多样性和细节方面表现不佳，亟需针对性解决方案。 Method: 构建大规模菜品数据集，采用重标注策略和粗到细训练方案，结合高质量标注库和大语言模型优化输入。 Result: 实验证明Omni-Dish在生成逼真且细节丰富的中国菜品图像方面表现优越。 Conclusion: Omni-Dish及其扩展的编辑功能为特定领域的图像生成和编辑提供了有效解决方案。 Abstract: Dish images play a crucial role in the digital era, with the demand for culturally distinctive dish images continuously increasing due to the digitization of the food industry and e-commerce. In general cases, existing text-to-image generation models excel in producing high-quality images; however, they struggle to capture diverse characteristics and faithful details of specific domains, particularly Chinese dishes. To address this limitation, we propose Omni-Dish, the first text-to-image generation model specifically tailored for Chinese dishes. We develop a comprehensive dish curation pipeline, building the largest dish dataset to date. Additionally, we introduce a recaption strategy and employ a coarse-to-fine training scheme to help the model better learn fine-grained culinary nuances. During inference, we enhance the user's textual input using a pre-constructed high-quality caption library and a large language model, enabling more photorealistic and faithful image generation. Furthermore, to extend our model's capability for dish editing tasks, we propose Concept-Enhanced P2P. Based on this approach, we build a dish editing dataset and train a specialized editing model. Extensive experiments demonstrate the superiority of our methods.

[110] Efficient 2D to Full 3D Human Pose Uplifting including Joint Rotations

Katja Ludwig,Yuliia Oksymets,Robin Schön,Daniel Kienzle,Rainer Lienhart

Main category: cs.CV

TLDR: 提出了一种新型2D到3D提升模型，直接估计3D人体姿态和关节旋转，速度快且精度高。

Details

Motivation: 现有方法结合3D HPE模型和IK估计关节位置和旋转，但IK计算成本高。 Method: 研究多种旋转表示、损失函数和训练策略，直接估计3D姿态和旋转。 Result: 模型在旋转估计上达到最优精度，速度快150倍，关节定位精度超过HMR模型。 Conclusion: 新方法高效且准确，优于现有技术。 Abstract: In sports analytics, accurately capturing both the 3D locations and rotations of body joints is essential for understanding an athlete's biomechanics. While Human Mesh Recovery (HMR) models can estimate joint rotations, they often exhibit lower accuracy in joint localization compared to 3D Human Pose Estimation (HPE) models. Recent work addressed this limitation by combining a 3D HPE model with inverse kinematics (IK) to estimate both joint locations and rotations. However, IK is computationally expensive. To overcome this, we propose a novel 2D-to-3D uplifting model that directly estimates 3D human poses, including joint rotations, in a single forward pass. We investigate multiple rotation representations, loss functions, and training strategies - both with and without access to ground truth rotations. Our models achieve state-of-the-art accuracy in rotation estimation, are 150 times faster than the IK-based approach, and surpass HMR models in joint localization precision.

[111] Semantic Depth Matters: Explaining Errors of Deep Vision Networks through Perceived Class Similarities

Katarzyna Filus,Michał Romaszewski,Mateusz Żarski

Main category: cs.CV

TLDR: 论文提出了一种新框架，通过相似性深度（SD）指标和基于图的可视化方法，分析深度神经网络（DNN）的语义层次深度与误分类模式的关系。

Details

Motivation: 当前评估方法缺乏透明度，难以解释网络误分类的根本原因。 Method: 引入SD指标量化网络感知的语义层次深度，并提出基于类模板的方法，无需额外数据即可分析已训练网络。 Result: 发现深度视觉网络编码了特定语义层次，高语义深度能提升感知类别相似性与实际错误的一致性。 Conclusion: 该框架为理解DNN行为提供了新视角，揭示了语义深度与错误模式的关系。 Abstract: Understanding deep neural network (DNN) behavior requires more than evaluating classification accuracy alone; analyzing errors and their predictability is equally crucial. Current evaluation methodologies lack transparency, particularly in explaining the underlying causes of network misclassifications. To address this, we introduce a novel framework that investigates the relationship between the semantic hierarchy depth perceived by a network and its real-data misclassification patterns. Central to our framework is the Similarity Depth (SD) metric, which quantifies the semantic hierarchy depth perceived by a network along with a method of evaluation of how closely the network's errors align with its internally perceived similarity structure. We also propose a graph-based visualization of model semantic relationships and misperceptions. A key advantage of our approach is that leveraging class templates -- representations derived from classifier layer weights -- is applicable to already trained networks without requiring additional data or experiments. Our approach reveals that deep vision networks encode specific semantic hierarchies and that high semantic depth improves the compliance between perceived class similarities and actual errors.

[112] Dual-Path Enhancements in Event-Based Eye Tracking: Augmented Robustness and Adaptive Temporal Modeling

Hoang M. Truong,Vinh-Thuan Ly,Huy G. Tran,Thuan-Phat Nguyen,Tram T. Doan

Main category: cs.CV

TLDR: 论文提出了一种基于事件的眼动追踪技术，通过轻量级时空网络和两种关键改进（数据增强和混合架构KnightPupil），在CVPR 2025挑战赛上表现优异。

Details

Motivation: 现有方法在真实场景中（如快速眼动和环境噪声）表现不佳，需要更高效的解决方案。 Method: 1. 数据增强管道（时间位移、空间翻转和事件删除）提升模型鲁棒性；2. 混合架构KnightPupil（EfficientNet-B3、双向GRU和线性时变状态空间模块）动态适应稀疏输入和噪声。 Result: 在3ET+基准测试中，欧几里得距离误差降低12%（1.61 vs 1.70基线），在CVPR 2025挑战赛私有测试集上表现优异。 Conclusion: 该框架为AR/VR系统的实际部署提供了有效解决方案，并为神经形态视觉的未来创新奠定了基础。 Abstract: Event-based eye tracking has become a pivotal technology for augmented reality and human-computer interaction. Yet, existing methods struggle with real-world challenges such as abrupt eye movements and environmental noise. Building on the efficiency of the Lightweight Spatiotemporal Network-a causal architecture optimized for edge devices-we introduce two key advancements. First, a robust data augmentation pipeline incorporating temporal shift, spatial flip, and event deletion improves model resilience, reducing Euclidean distance error by 12% (1.61 vs. 1.70 baseline) on challenging samples. Second, we propose KnightPupil, a hybrid architecture combining an EfficientNet-B3 backbone for spatial feature extraction, a bidirectional GRU for contextual temporal modeling, and a Linear Time-Varying State-Space Module to adapt to sparse inputs and noise dynamically. Evaluated on the 3ET+ benchmark, our framework achieved 1.61 Euclidean distance on the private test set of the Event-based Eye Tracking Challenge at CVPR 2025, demonstrating its effectiveness for practical deployment in AR/VR systems while providing a foundation for future innovations in neuromorphic vision.

[113] SemiETS: Integrating Spatial and Content Consistencies for Semi-Supervised End-to-end Text Spotting

Dongliang Luo,Hanshen Zhu,Ziyang Zhang,Dingkang Liang,Xudong Xie,Yuliang Liu,Xiang Bai

Main category: cs.CV

TLDR: 提出了一种半监督端到端文本检测框架SemiETS，通过生成层次化伪标签和双向流信息提升性能，显著优于现有方法。

Details

Motivation: 减少高质量手动标注的高成本，利用未标注图像中的信息进行半监督文本检测。 Method: 提出SemiETS框架，生成可靠层次化伪标签，利用双向流信息提升检测与识别任务的一致性。 Result: 在多个数据集上表现优异，优于现有SSL方法，并在少量标注数据下超越全监督方法。 Conclusion: SemiETS在减少标注成本的同时提升了性能，具有实际应用潜力。 Abstract: Most previous scene text spotting methods rely on high-quality manual annotations to achieve promising performance. To reduce their expensive costs, we study semi-supervised text spotting (SSTS) to exploit useful information from unlabeled images. However, directly applying existing semi-supervised methods of general scenes to SSTS will face new challenges: 1) inconsistent pseudo labels between detection and recognition tasks, and 2) sub-optimal supervisions caused by inconsistency between teacher/student. Thus, we propose a new Semi-supervised framework for End-to-end Text Spotting, namely SemiETS that leverages the complementarity of text detection and recognition. Specifically, it gradually generates reliable hierarchical pseudo labels for each task, thereby reducing noisy labels. Meanwhile, it extracts important information in locations and transcriptions from bidirectional flows to improve consistency. Extensive experiments on three datasets under various settings demonstrate the effectiveness of SemiETS on arbitrary-shaped text. For example, it outperforms previous state-of-the-art SSL methods by a large margin on end-to-end spotting (+8.7%, +5.6%, and +2.6% H-mean under 0.5%, 1%, and 2% labeled data settings on Total-Text, respectively). More importantly, it still improves upon a strongly supervised text spotter trained with plenty of labeled data by 2.0%. Compelling domain adaptation ability shows practical potential. Moreover, our method demonstrates consistent improvement on different text spotters.

[114] Enhancing Multi-task Learning Capability of Medical Generalist Foundation Model via Image-centric Multi-annotation Data

Xun Zhu,Fanbin Mo,Zheng Zhang,Jiaxi Wang,Yiming Shi,Ming Wu,Chuang Zhang,Miao Li,Ji Wu

Main category: cs.CV

TLDR: 论文提出了一种以数据为中心的多任务学习方法，通过构建IMAX数据集（图像中心的多标注X射线数据集），显著提升了医学多模态大语言模型的多任务性能。

Details

Motivation: 现有医学通用基础模型多关注数据规模或架构改进，而忽视了从数据角度重新审视多任务学习，导致图像-任务对齐分散，无法满足临床需求。 Method: 构建IMAX数据集，包含354K条高质量标注数据，每张X射线图像平均关联4.10个任务和7.46条训练条目，确保多任务表示的丰富性。 Result: IMAX在七种开源医学MLLMs中，多任务平均性能提升3.20%至21.05%，并分析了其与DMAX在统计模式和优化动态上的差异。 Conclusion: IMAX数据构建方法有效提升了多任务学习能力，并提出了一种基于DMAX的优化训练策略，以解决实际场景中高质量数据获取的难题。 Abstract: The emergence of medical generalist foundation models has revolutionized conventional task-specific model development paradigms, aiming to better handle multiple tasks through joint training on large-scale medical datasets. However, recent advances prioritize simple data scaling or architectural component enhancement, while neglecting to re-examine multi-task learning from a data-centric perspective. Critically, simply aggregating existing data resources leads to decentralized image-task alignment, which fails to cultivate comprehensive image understanding or align with clinical needs for multi-dimensional image interpretation. In this paper, we introduce the image-centric multi-annotation X-ray dataset (IMAX), the first attempt to enhance the multi-task learning capabilities of medical multi-modal large language models (MLLMs) from the data construction level. To be specific, IMAX is featured from the following attributes: 1) High-quality data curation. A comprehensive collection of more than 354K entries applicable to seven different medical tasks. 2) Image-centric dense annotation. Each X-ray image is associated with an average of 4.10 tasks and 7.46 training entries, ensuring multi-task representation richness per image. Compared to the general decentralized multi-annotation X-ray dataset (DMAX), IMAX consistently demonstrates significant multi-task average performance gains ranging from 3.20% to 21.05% across seven open-source state-of-the-art medical MLLMs. Moreover, we investigate differences in statistical patterns exhibited by IMAX and DMAX training processes, exploring potential correlations between optimization dynamics and multi-task performance. Finally, leveraging the core concept of IMAX data construction, we propose an optimized DMAX-based training strategy to alleviate the dilemma of obtaining high-quality IMAX data in practical scenarios.

[115] Beyond Degradation Redundancy: Contrastive Prompt Learning for All-in-One Image Restoration

Gang Wu,Junjun Jiang,Kui Jiang,Xianming Liu,Liqiang Nie

Main category: cs.CV

TLDR: 论文提出了一种名为对比提示学习（CPL）的新框架，通过稀疏提示模块（SPM）和对比提示正则化（CPR）改进多任务图像恢复模型的提示-任务对齐。

Details

Motivation: 现有方法在统一模型中设计任务特定提示时，存在任务表示重叠或冗余的问题，同时显式提示可能丢失关键视觉信息。 Method: CPL结合了SPM（高效捕获退化特征并减少冗余）和CPR（通过负提示样本强化任务边界）。 Result: 在五个基准测试中，CPL显著提升了多任务和复合退化场景下的性能，并保持了参数效率。 Conclusion: CPL为统一图像恢复提供了一种高效且性能优越的解决方案。 Abstract: All-in-one image restoration, addressing diverse degradation types with a unified model, presents significant challenges in designing task-specific prompts that effectively guide restoration across multiple degradation scenarios. While adaptive prompt learning enables end-to-end optimization, it often yields overlapping or redundant task representations. Conversely, explicit prompts derived from pretrained classifiers enhance discriminability but may discard critical visual information for reconstruction. To address these limitations, we introduce Contrastive Prompt Learning (CPL), a novel framework that fundamentally enhances prompt-task alignment through two complementary innovations: a \emph{Sparse Prompt Module (SPM)} that efficiently captures degradation-specific features while minimizing redundancy, and a \emph{Contrastive Prompt Regularization (CPR)} that explicitly strengthens task boundaries by incorporating negative prompt samples across different degradation types. Unlike previous approaches that focus primarily on degradation classification, CPL optimizes the critical interaction between prompts and the restoration model itself. Extensive experiments across five comprehensive benchmarks demonstrate that CPL consistently enhances state-of-the-art all-in-one restoration models, achieving significant improvements in both standard multi-task scenarios and challenging composite degradation settings. Our framework establishes new state-of-the-art performance while maintaining parameter efficiency, offering a principled solution for unified image restoration.

[116] Resampling Benchmark for Efficient Comprehensive Evaluation of Large Vision-Language Models

Teppei Suzuki,Keisuke Ozawa

Main category: cs.CV

TLDR: 提出了一种基于最远点采样（FPS）的高效评估协议，用于大规模视觉语言模型（VLMs），仅需1%数据即可保持与完整评估的高相关性（>0.96）。

Details

Motivation: 由于VLMs的广泛知识和推理能力，全面评估需要多个基准测试，计算成本高。 Method: 采用最远点采样（FPS）构建评估子集，实验表明其与完整评估结果高度相关。 Result: FPS基准测试仅使用1%数据，与完整评估相关性>0.96，并能减少数据集偏差。 Conclusion: FPS方法高效且可靠，可用于优化VLMs的评估流程。 Abstract: We propose an efficient evaluation protocol for large vision-language models (VLMs). Given their broad knowledge and reasoning capabilities, multiple benchmarks are needed for comprehensive assessment, making evaluation computationally expensive. To improve efficiency, we construct a subset that yields results comparable to full benchmark evaluations. Our benchmark classification experiments reveal that no single benchmark fully covers all challenges. We then introduce a subset construction method using farthest point sampling (FPS). Our experiments show that FPS-based benchmarks maintain a strong correlation (> 0.96) with full evaluations while using only ~1\% of the data. Additionally, applying FPS to an existing benchmark improves correlation with overall evaluation results, suggesting its potential to reduce unintended dataset biases.

[117] Correlative and Discriminative Label Grouping for Multi-Label Visual Prompt Tuning

LeiLei Ma,Shuo Xu,MingKun Xie,Lei Wang,Dengdi Sun,Haifeng Zhao

Main category: cs.CV

TLDR: 提出了一种多标签视觉提示调优框架，通过平衡标签的相关性和区分性关系，避免过拟合，提升模型性能。

Details

Motivation: 现有研究过度强调标签共现关系，导致过拟合风险，需平衡相关性和区分性关系以优化模型。 Method: 将类别分组，利用多提示令牌和混合专家模型分别建模相关性和区分性关系。 Result: 在多个基准数据集上表现优异，优于现有最优方法。 Conclusion: 该方法有效平衡标签关系，提升分类性能。 Abstract: Modeling label correlations has always played a pivotal role in multi-label image classification (MLC), attracting significant attention from researchers. However, recent studies have overemphasized co-occurrence relationships among labels, which can lead to overfitting risk on this overemphasis, resulting in suboptimal models. To tackle this problem, we advocate for balancing correlative and discriminative relationships among labels to mitigate the risk of overfitting and enhance model performance. To this end, we propose the Multi-Label Visual Prompt Tuning framework, a novel and parameter-efficient method that groups classes into multiple class subsets according to label co-occurrence and mutual exclusivity relationships, and then models them respectively to balance the two relationships. In this work, since each group contains multiple classes, multiple prompt tokens are adopted within Vision Transformer (ViT) to capture the correlation or discriminative label relationship within each group, and effectively learn correlation or discriminative representations for class subsets. On the other hand, each group contains multiple group-aware visual representations that may correspond to multiple classes, and the mixture of experts (MoE) model can cleverly assign them from the group-aware to the label-aware, adaptively obtaining label-aware representation, which is more conducive to classification. Experiments on multiple benchmark datasets show that our proposed approach achieves competitive results and outperforms SOTA methods on multiple pre-trained models.

[118] Metric-Guided Synthesis of Class Activation Mapping

Alejandro Luque-Cerpa,Elizabeth Polgreen,Ajitha Rajan,Hazem Torfah

Main category: cs.CV

TLDR: SyCAM是一种基于度量的方法，用于合成CAM表达式，优化用户定义的评估指标，生成目标热图。

Details

Motivation: 现有CAM方法无法根据用户意图或领域知识灵活生成热图，限制了其适用性。 Method: 提出SyCAM，通过语法引导的合成方法，基于预定义的评估指标和语法约束生成CAM表达式。 Result: 在ResNet50、VGG16和VGG19模型上验证了SyCAM的有效性和灵活性，优于其他CAM方法。 Conclusion: SyCAM能够根据用户需求生成定制化的热图，解决了现有CAM方法的局限性。 Abstract: Class activation mapping (CAM) is a widely adopted class of saliency methods used to explain the behavior of convolutional neural networks (CNNs). These methods generate heatmaps that highlight the parts of the input most relevant to the CNN output. Various CAM methods have been proposed, each distinguished by the expressions used to derive heatmaps. In general, users look for heatmaps with specific properties that reflect different aspects of CNN functionality. These may include similarity to ground truth, robustness, equivariance, and more. Although existing CAM methods implicitly encode some of these properties in their expressions, they do not allow for variability in heatmap generation following the user's intent or domain knowledge. In this paper, we address this limitation by introducing SyCAM, a metric-based approach for synthesizing CAM expressions. Given a predefined evaluation metric for saliency maps, SyCAM automatically generates CAM expressions optimized for that metric. We specifically explore a syntax-guided synthesis instantiation of SyCAM, where CAM expressions are derived based on predefined syntactic constraints and the given metric. Using several established evaluation metrics, we demonstrate the efficacy and flexibility of our approach in generating targeted heatmaps. We compare SyCAM with other well-known CAM methods on three prominent models: ResNet50, VGG16, and VGG19.

[119] GaussVideoDreamer: 3D Scene Generation with Video Diffusion and Inconsistency-Aware Gaussian Splatting

Junlin Hao,Peiheng Wang,Haoyang Wang,Xinggong Zhang,Zongming Guo

Main category: cs.CV

TLDR: GaussVideoDreamer提出了一种结合图像、视频和3D生成的方法，通过渐进式视频修复和3D高斯溅射一致性掩码，显著提升了多视图一致性和计算效率。

Details

Motivation: 单图像3D场景重建因其病态性和输入限制面临挑战，现有方法在多视图一致性或泛化能力上存在不足。 Method: 结合几何感知初始化、不一致感知高斯溅射和渐进式视频修复策略。 Result: 实验显示，该方法在LLaVA-IQA评分上提升32%，速度至少提高2倍。 Conclusion: GaussVideoDreamer在性能和效率上优于现有方法，适用于多样化场景。 Abstract: Single-image 3D scene reconstruction presents significant challenges due to its inherently ill-posed nature and limited input constraints. Recent advances have explored two promising directions: multiview generative models that train on 3D consistent datasets but struggle with out-of-distribution generalization, and 3D scene inpainting and completion frameworks that suffer from cross-view inconsistency and suboptimal error handling, as they depend exclusively on depth data or 3D smoothness, which ultimately degrades output quality and computational performance. Building upon these approaches, we present GaussVideoDreamer, which advances generative multimedia approaches by bridging the gap between image, video, and 3D generation, integrating their strengths through two key innovations: (1) A progressive video inpainting strategy that harnesses temporal coherence for improved multiview consistency and faster convergence. (2) A 3D Gaussian Splatting consistency mask to guide the video diffusion with 3D consistent multiview evidence. Our pipeline combines three core components: a geometry-aware initialization protocol, Inconsistency-Aware Gaussian Splatting, and a progressive video inpainting strategy. Experimental results demonstrate that our approach achieves 32% higher LLaVA-IQA scores and at least 2x speedup compared to existing methods while maintaining robust performance across diverse scenes.

[120] An Image is Worth $K$ Topics: A Visual Structural Topic Model with Pretrained Image Embeddings

Matías Piqueras,Alexandra Segerberg,Matteo Magnani,Måns Magnusson,Nataša Sladoje

Main category: cs.CV

TLDR: 提出了一种结合预训练图像嵌入和结构化主题模型的视觉结构化主题模型（vSTM），用于分析政治视觉内容。

Details

Motivation: 现有计算工具在分析政治和社会视觉内容时缺乏针对性方法，需要更适应政治研究需求的模型。 Method: 结合预训练图像嵌入和结构化主题模型，捕捉图像的语义复杂性，并分析主题与协变量关系。 Result: vSTM能够识别可解释、连贯且与在线政治传播研究相关的主题。 Conclusion: vSTM为政治视觉内容分析提供了有效工具，具有语义复杂性和主题分析优势。 Abstract: Political scientists are increasingly interested in analyzing visual content at scale. However, the existing computational toolbox is still in need of methods and models attuned to the specific challenges and goals of social and political inquiry. In this article, we introduce a visual Structural Topic Model (vSTM) that combines pretrained image embeddings with a structural topic model. This has important advantages compared to existing approaches. First, pretrained embeddings allow the model to capture the semantic complexity of images relevant to political contexts. Second, the structural topic model provides the ability to analyze how topics and covariates are related, while maintaining a nuanced representation of images as a mixture of multiple topics. In our empirical application, we show that the vSTM is able to identify topics that are interpretable, coherent, and substantively relevant to the study of online political communication.

[121] EBAD-Gaussian: Event-driven Bundle Adjusted Deblur Gaussian Splatting

Yufei Deng,Yuanjian Wang,Rong Xiao,Chenwei Tang,Jizhe Zhou,Jiahao Fan,Deng Xiong,Jiancheng Lv,Huajin Tang

Main category: cs.CV

TLDR: EBAD-Gaussian利用事件相机和模糊图像重建清晰的3D高斯模型，通过联合学习高斯参数和相机运动轨迹，显著提升了运动模糊场景下的重建质量。

Details

Motivation: 在快速运动或低光条件下，RGB图像的去模糊方法难以准确建模相机位姿和辐射变化，导致重建精度下降。事件相机能捕捉亮度连续变化，为建模运动模糊提供了新思路。 Method: 提出EBAD-Gaussian方法，通过模糊损失函数和事件流监督，联合优化高斯参数和相机轨迹，并基于事件双积分先验增强中间图像的细节。 Result: 在合成和真实数据集上，EBAD-Gaussian在模糊图像和事件流输入下实现了高质量的3D场景重建。 Conclusion: EBAD-Gaussian有效解决了运动模糊问题，为事件相机与3D高斯重建的结合提供了新方向。 Abstract: While 3D Gaussian Splatting (3D-GS) achieves photorealistic novel view synthesis, its performance degrades with motion blur. In scenarios with rapid motion or low-light conditions, existing RGB-based deblurring methods struggle to model camera pose and radiance changes during exposure, reducing reconstruction accuracy. Event cameras, capturing continuous brightness changes during exposure, can effectively assist in modeling motion blur and improving reconstruction quality. Therefore, we propose Event-driven Bundle Adjusted Deblur Gaussian Splatting (EBAD-Gaussian), which reconstructs sharp 3D Gaussians from event streams and severely blurred images. This method jointly learns the parameters of these Gaussians while recovering camera motion trajectories during exposure time. Specifically, we first construct a blur loss function by synthesizing multiple latent sharp images during the exposure time, minimizing the difference between real and synthesized blurred images. Then we use event stream to supervise the light intensity changes between latent sharp images at any time within the exposure period, supplementing the light intensity dynamic changes lost in RGB images. Furthermore, we optimize the latent sharp images at intermediate exposure times based on the event-based double integral (EDI) prior, applying consistency constraints to enhance the details and texture information of the reconstructed images. Extensive experiments on synthetic and real-world datasets show that EBAD-Gaussian can achieve high-quality 3D scene reconstruction under the condition of blurred images and event stream inputs.

[122] RGB-Event based Pedestrian Attribute Recognition: A Benchmark Dataset and An Asymmetric RWKV Fusion Framework

Xiao Wang,Haiyang Wang,Shiao Wang,Qiang Chen,Jiandong Jin,Haoyu Song,Bo Jiang,Chenglong Li

Main category: cs.CV

TLDR: 提出了一种基于RGB-Event多模态的行人属性识别方法，解决了RGB相机在光照和运动模糊上的限制，并探索了情感维度。

Details

Motivation: 现有方法依赖RGB相机，受限于光照和运动模糊，且缺乏对情感维度的分析。 Method: 提出了首个大规模多模态数据集EventPAR，包含100K样本，覆盖外观和情感属性；设计了基于RWKV的多模态框架。 Result: 在EventPAR及模拟数据集上取得最优性能。 Conclusion: 为未来研究提供了数据和算法基准，代码和数据集将开源。 Abstract: Existing pedestrian attribute recognition methods are generally developed based on RGB frame cameras. However, these approaches are constrained by the limitations of RGB cameras, such as sensitivity to lighting conditions and motion blur, which hinder their performance. Furthermore, current attribute recognition primarily focuses on analyzing pedestrians' external appearance and clothing, lacking an exploration of emotional dimensions. In this paper, we revisit these issues and propose a novel multi-modal RGB-Event attribute recognition task by drawing inspiration from the advantages of event cameras in low-light, high-speed, and low-power consumption. Specifically, we introduce the first large-scale multi-modal pedestrian attribute recognition dataset, termed EventPAR, comprising 100K paired RGB-Event samples that cover 50 attributes related to both appearance and six human emotions, diverse scenes, and various seasons. By retraining and evaluating mainstream PAR models on this dataset, we establish a comprehensive benchmark and provide a solid foundation for future research in terms of data and algorithmic baselines. In addition, we propose a novel RWKV-based multi-modal pedestrian attribute recognition framework, featuring an RWKV visual encoder and an asymmetric RWKV fusion module. Extensive experiments are conducted on our proposed dataset as well as two simulated datasets (MARS-Attribute and DukeMTMC-VID-Attribute), achieving state-of-the-art results. The source code and dataset will be released on https://github.com/Event-AHU/OpenPAR

[123] Masked Autoencoder Self Pre-Training for Defect Detection in Microelectronics

Nikolai Röhrich,Alwin Hoffmann,Richard Nordsieck,Emilio Zarbali,Alireza Javanmardi

Main category: cs.CV

TLDR: 论文提出了一种基于掩码自编码器（MAE）的视觉变换器（ViT）预训练框架，用于微电子缺陷检测，解决了数据稀疏和领域差异问题。

Details

Motivation: 微电子缺陷检测仍依赖卷积神经网络（CNN），而变换器（Transformer）因数据需求高和领域数据稀缺难以应用。预训练自然图像数据集因领域差异效果不佳，因此提出自预训练方法。 Method: 采用掩码自编码器（MAE）对ViT进行自预训练，预训练目标为重建掩码图像块。实验使用少于10,000张扫描声学显微镜（SAM）图像，标注基于瞬态热分析（TTA）。 Result: 自预训练ViT性能显著优于监督ViT、自然图像预训练ViT和文献中CNN模型。可解释性分析显示模型能正确聚焦缺陷相关特征（如焊料裂纹）。 Conclusion: 自预训练方法生成缺陷特异性特征表示，适用于实际微电子缺陷检测。 Abstract: Whereas in general computer vision, transformer-based architectures have quickly become the gold standard, microelectronics defect detection still heavily relies on convolutional neural networks (CNNs). We hypothesize that this is due to the fact that a) transformers have an increased need for data and b) labelled image generation procedures for microelectronics are costly, and labelled data is therefore sparse. Whereas in other domains, pre-training on large natural image datasets can mitigate this problem, in microelectronics transfer learning is hindered due to the dissimilarity of domain data and natural images. Therefore, we evaluate self pre-training, where models are pre-trained on the target dataset, rather than another dataset. We propose a vision transformer (ViT) pre-training framework for defect detection in microelectronics based on masked autoencoders (MAE). In MAE, a large share of image patches is masked and reconstructed by the model during pre-training. We perform pre-training and defect detection using a dataset of less than 10.000 scanning acoustic microscopy (SAM) images labelled using transient thermal analysis (TTA). Our experimental results show that our approach leads to substantial performance gains compared to a) supervised ViT, b) ViT pre-trained on natural image datasets, and c) state-of-the-art CNN-based defect detection models used in the literature. Additionally, interpretability analysis reveals that our self pre-trained models, in comparison to ViT baselines, correctly focus on defect-relevant features such as cracks in the solder material. This demonstrates that our approach yields fault-specific feature representations, making our self pre-trained models viable for real-world defect detection in microelectronics.

[124] Relative Illumination Fields: Learning Medium and Light Independent Underwater Scenes

Mengkun She,Felix Seegräber,David Nakath,Patricia Schöntag,Kevin Köser

Main category: cs.CV

TLDR: 提出一种在非均匀光照和散射环境中构建一致且逼真的神经辐射场的方法，解决了动态光源与静态散射介质交互的挑战。

Details

Motivation: 现有水下场景表示方法多针对静态均匀光照，而忽略了如机器人探索深水区时光照不足的情况。 Method: 提出一种与相机局部关联的照明场，结合体积介质表示，处理动态光照与静态散射介质的交互。 Result: 评估结果表明该方法有效且灵活。 Conclusion: 该方法为动态光照与散射介质交互的场景提供了一种有效的解决方案。 Abstract: We address the challenge of constructing a consistent and photorealistic Neural Radiance Field in inhomogeneously illuminated, scattering environments with unknown, co-moving light sources. While most existing works on underwater scene representation focus on a static homogeneous illumination, limited attention has been paid to scenarios such as when a robot explores water deeper than a few tens of meters, where sunlight becomes insufficient. To address this, we propose a novel illumination field locally attached to the camera, enabling the capture of uneven lighting effects within the viewing frustum. We combine this with a volumetric medium representation to an overall method that effectively handles interaction between dynamic illumination field and static scattering medium. Evaluation results demonstrate the effectiveness and flexibility of our approach.

[125] TT3D: Table Tennis 3D Reconstruction

Thomas Gossard,Andreas Ziegler,Andreas Zell

Main category: cs.CV

TLDR: 提出了一种从乒乓球比赛录像中重建精确3D球轨迹的新方法，结合物理运动模型和自动相机校准，无需依赖人体姿态或球拍跟踪。

Details

Motivation: 传统2D球追踪依赖摄像机视角，无法支持全面的比赛分析，需要更精确的3D重建方法。 Method: 利用球的物理运动模型最小化重投影误差，自动校准相机并跟踪球员动作，实现3D轨迹重建。 Result: 方法能够准确重建3D球轨迹，并推断球的旋转，无需依赖不可靠的人体姿态或球拍跟踪。 Conclusion: 提出的方法实现了乒乓球比赛的完整3D重建，为体育分析提供了更全面的数据支持。 Abstract: Sports analysis requires processing large amounts of data, which is time-consuming and costly. Advancements in neural networks have significantly alleviated this burden, enabling highly accurate ball tracking in sports broadcasts. However, relying solely on 2D ball tracking is limiting, as it depends on the camera's viewpoint and falls short of supporting comprehensive game analysis. To address this limitation, we propose a novel approach for reconstructing precise 3D ball trajectories from online table tennis match recordings. Our method leverages the underlying physics of the ball's motion to identify the bounce state that minimizes the reprojection error of the ball's flying trajectory, hence ensuring an accurate and reliable 3D reconstruction. A key advantage of our approach is its ability to infer ball spin without relying on human pose estimation or racket tracking, which are often unreliable or unavailable in broadcast footage. We developed an automated camera calibration method capable of reliably tracking camera movements. Additionally, we adapted an existing 3D pose estimation model, which lacks depth motion capture, to accurately track player movements. Together, these contributions enable the full 3D reconstruction of a table tennis rally.

[126] Investigating the Role of Bilateral Symmetry for Inpainting Brain MRI

Sergey Kuznetsov,Sanduni Pinnawala,Peter A. Wijeratne,Ivor J. A. Simpson

Main category: cs.CV

TLDR: 论文研究了MRI图像修复中大脑结构的统计关系，重点关注对称性对修复过程的影响。

Details

Motivation: 探索MRI图像修复中大脑结构修复与条件信息（如对称性）的统计关系，以理解模型的信息来源。 Method: 通过分析扩散修复模型，研究不同区域掩蔽对修复结果的影响，特别是对侧结构的对称性作用。 Result: 实验表明某些大脑结构在修复过程中强烈依赖于对称性条件。 Conclusion: 对称性在MRI图像修复中具有重要作用，尤其是在特定大脑结构的修复中。 Abstract: Inpainting has recently emerged as a valuable and interesting technology to employ in the analysis of medical imaging data, in particular brain MRI. A wide variety of methodologies for inpainting MRI have been proposed and demonstrated on tasks including anomaly detection. In this work we investigate the statistical relationship between inpainted brain structures and the amount of subject-specific conditioning information, i.e. the other areas of the image that are masked. In particular, we analyse the distribution of inpainting results when masking additional regions of the image, specifically the contra-lateral structure. This allows us to elucidate where in the brain the model is drawing information from, and in particular, what is the importance of hemispherical symmetry? Our experiments interrogate a diffusion inpainting model through analysing the inpainting of subcortical brain structures based on intensity and estimated area change. We demonstrate that some structures show a strong influence of symmetry in the conditioning of the inpainting process.

[127] Aligning Anime Video Generation with Human Feedback

Bingwen Zhu,Yudong Jiang,Baohan Xu,Siqian Yang,Mingyu Yin,Yidi Wu,Huyang Sun,Zuxuan Wu

Main category: cs.CV

TLDR: 论文提出了一种基于人类反馈的动漫视频生成优化方法，通过构建首个多维度奖励数据集和开发AnimeReward模型，结合GAPO训练方法，显著提升了动漫视频的质量和一致性。

Details

Motivation: 动漫视频生成因数据稀缺和独特运动模式导致质量不佳，现有奖励模型无法满足动漫的特殊需求，需通过人类反馈优化对齐。 Method: 构建包含30k人类标注样本的多维度奖励数据集，开发AnimeReward模型，并引入GAPO训练方法以优化偏好对齐。 Result: AnimeReward优于现有奖励模型，GAPO进一步提升了定量指标和人类评估中的对齐效果。 Conclusion: 提出的方法有效提升了动漫视频生成质量，数据集和代码将公开。 Abstract: Anime video generation faces significant challenges due to the scarcity of anime data and unusual motion patterns, leading to issues such as motion distortion and flickering artifacts, which result in misalignment with human preferences. Existing reward models, designed primarily for real-world videos, fail to capture the unique appearance and consistency requirements of anime. In this work, we propose a pipeline to enhance anime video generation by leveraging human feedback for better alignment. Specifically, we construct the first multi-dimensional reward dataset for anime videos, comprising 30k human-annotated samples that incorporating human preferences for both visual appearance and visual consistency. Based on this, we develop AnimeReward, a powerful reward model that employs specialized vision-language models for different evaluation dimensions to guide preference alignment. Furthermore, we introduce Gap-Aware Preference Optimization (GAPO), a novel training method that explicitly incorporates preference gaps into the optimization process, enhancing alignment performance and efficiency. Extensive experiment results show that AnimeReward outperforms existing reward models, and the inclusion of GAPO leads to superior alignment in both quantitative benchmarks and human evaluations, demonstrating the effectiveness of our pipeline in enhancing anime video quality. Our dataset and code will be publicly available.

[128] Multi-Object Grounding via Hierarchical Contrastive Siamese Transformers

Chengyi Du,Keyan Jin

Main category: cs.CV

TLDR: 提出了一种名为H-COST的方法，通过分层对比Siamese Transformer框架解决3D场景中多目标定位问题，性能提升9.5%。

Details

Motivation: 现实场景中需要定位多个对象，而现有方法主要针对单目标定位，无法满足需求。 Method: 采用分层处理策略逐步优化目标定位，并引入对比Siamese Transformer框架，利用辅助网络增强参考网络的语义理解。 Result: 在复杂多目标定位基准测试中，性能优于现有方法9.5%。 Conclusion: H-COST通过分层和对比机制显著提升了多目标定位的能力，适用于复杂语言指令和点云数据处理。 Abstract: Multi-object grounding in 3D scenes involves localizing multiple objects based on natural language input. While previous work has primarily focused on single-object grounding, real-world scenarios often demand the localization of several objects. To tackle this challenge, we propose Hierarchical Contrastive Siamese Transformers (H-COST), which employs a Hierarchical Processing strategy to progressively refine object localization, enhancing the understanding of complex language instructions. Additionally, we introduce a Contrastive Siamese Transformer framework, where two networks with the identical structure are used: one auxiliary network processes robust object relations from ground-truth labels to guide and enhance the second network, the reference network, which operates on segmented point-cloud data. This contrastive mechanism strengthens the model' s semantic understanding and significantly enhances its ability to process complex point-cloud data. Our approach outperforms previous state-of-the-art methods by 9.5% on challenging multi-object grounding benchmarks.

[129] Summarization of Multimodal Presentations with Vision-Language Models: Study of the Effect of Modalities and Structure

Théo Gigant,Camille Guinaudeau,Frédéric Dufaux

Main category: cs.CV

TLDR: 本文研究了如何利用视觉语言模型（VLMs）对多模态演示进行自动摘要，提出了在不同输入长度预算下的成本效益策略，并分析了跨模态交互的特性。

Details

Motivation: 多模态演示（如视频、幻灯片和文本）的自动摘要需求日益增长，但现有方法在处理多模态输入时效率不高。本文旨在探索VLMs在多模态摘要中的表现，并提出优化策略。 Method: 通过定量和定性分析，比较了不同输入表示（如原始视频、幻灯片、文本及其组合）对摘要生成的影响，并提出了结构化表示方法。 Result: 实验表明，使用幻灯片作为输入比原始视频更有效，而幻灯片与文本的混合结构化表示性能最佳。 Conclusion: 本文为多模态摘要提供了实用策略，并指出了VLMs在跨模态理解方面的改进方向。 Abstract: Vision-Language Models (VLMs) can process visual and textual information in multiple formats: texts, images, interleaved texts and images, or even hour-long videos. In this work, we conduct fine-grained quantitative and qualitative analyses of automatic summarization of multimodal presentations using VLMs with various representations as input. From these experiments, we suggest cost-effective strategies for generating summaries from text-heavy multimodal documents under different input-length budgets using VLMs. We show that slides extracted from the video stream can be beneficially used as input against the raw video, and that a structured representation from interleaved slides and transcript provides the best performance. Finally, we reflect and comment on the nature of cross-modal interactions in multimodal presentations and share suggestions to improve the capabilities of VLMs to understand documents of this nature.

[130] Mavors: Multi-granularity Video Representation for Multimodal Large Language Model

Yang Shi,Jiaheng Liu,Yushuo Guan,Zhenhua Wu,Yuanxing Zhang,Zihao Wang,Weihong Lin,Jingyun Hua,Zekun Wang,Xinlong Chen,Bohan Zeng,Wentao Zhang,Fuzheng Zhang,Wenjing Yang,Di Zhang

Main category: cs.CV

TLDR: Mavors是一个多粒度视频表示框架，用于解决长视频理解中的计算效率与细粒度时空模式保留问题。

Details

Motivation: 现有方法在长视频理解中存在信息丢失问题，尤其是在复杂运动或分辨率变化的视频中。 Method: Mavors通过Intra-chunk Vision Encoder（IVE）和Inter-chunk Feature Aggregator（IFA）直接编码原始视频内容，保留高分辨率空间特征并建立时间连贯性。 Result: 实验表明，Mavors在保持空间保真度和时间连续性方面优于现有方法。 Conclusion: Mavors为长视频理解提供了一种高效且保留细节的解决方案。 Abstract: Long-context video understanding in multimodal large language models (MLLMs) faces a critical challenge: balancing computational efficiency with the retention of fine-grained spatio-temporal patterns. Existing approaches (e.g., sparse sampling, dense sampling with low resolution, and token compression) suffer from significant information loss in temporal dynamics, spatial details, or subtle interactions, particularly in videos with complex motion or varying resolutions. To address this, we propose $\mathbf{Mavors}$, a novel framework that introduces $\mathbf{M}$ulti-gr$\mathbf{a}$nularity $\mathbf{v}$ide$\mathbf{o}$ $\mathbf{r}$epre$\mathbf{s}$entation for holistic long-video modeling. Specifically, Mavors directly encodes raw video content into latent representations through two core components: 1) an Intra-chunk Vision Encoder (IVE) that preserves high-resolution spatial features via 3D convolutions and Vision Transformers, and 2) an Inter-chunk Feature Aggregator (IFA) that establishes temporal coherence across chunks using transformer-based dependency modeling with chunk-level rotary position encodings. Moreover, the framework unifies image and video understanding by treating images as single-frame videos via sub-image decomposition. Experiments across diverse benchmarks demonstrate Mavors' superiority in maintaining both spatial fidelity and temporal continuity, significantly outperforming existing methods in tasks requiring fine-grained spatio-temporal reasoning.

[131] DTFSal: Audio-Visual Dynamic Token Fusion for Video Saliency Prediction

Kiana Hoshanfar,Alireza Hosseini,Ahmad Kalhor,Babak Nadjar Araabi

Main category: cs.CV

TLDR: DFTSal是一种新颖的音频-视觉显著性预测框架，通过动态令牌融合和自适应多模态融合，平衡了准确性和计算效率。

Details

Motivation: 现有视觉显著性方法难以有效整合听觉信息，且计算复杂度高。 Method: 提出DFTSal框架，包含多尺度视觉编码器（LTEB和DLTFB模块）、音频分支和自适应多模态融合块（AMFB）。 Result: 在六个基准测试中达到SOTA性能，同时保持计算效率。 Conclusion: DFTSal通过动态令牌融合和多模态融合，显著提升了音频-视觉显著性预测的性能。 Abstract: Audio-visual saliency prediction aims to mimic human visual attention by identifying salient regions in videos through the integration of both visual and auditory information. Although visual-only approaches have significantly advanced, effectively incorporating auditory cues remains challenging due to complex spatio-temporal interactions and high computational demands. To address these challenges, we propose Dynamic Token Fusion Saliency (DFTSal), a novel audio-visual saliency prediction framework designed to balance accuracy with computational efficiency. Our approach features a multi-scale visual encoder equipped with two novel modules: the Learnable Token Enhancement Block (LTEB), which adaptively weights tokens to emphasize crucial saliency cues, and the Dynamic Learnable Token Fusion Block (DLTFB), which employs a shifting operation to reorganize and merge features, effectively capturing long-range dependencies and detailed spatial information. In parallel, an audio branch processes raw audio signals to extract meaningful auditory features. Both visual and audio features are integrated using our Adaptive Multimodal Fusion Block (AMFB), which employs local, global, and adaptive fusion streams for precise cross-modal fusion. The resulting fused features are processed by a hierarchical multi-decoder structure, producing accurate saliency maps. Extensive evaluations on six audio-visual benchmarks demonstrate that DFTSal achieves SOTA performance while maintaining computational efficiency.

[132] Hierarchical Relation-augmented Representation Generalization for Few-shot Action Recognition

Hongyu Qu,Ling Xing,Rui Yan,Yazhou Yao,Guo-Sen Xie,Xiangbo Shu

Main category: cs.CV

TLDR: HR2G-shot是一个用于少样本动作识别（FSAR）的层次化关系增强表示泛化框架，通过统一三种关系建模（帧间、视频间和任务间）来学习任务特定的时间模式。

Details

Motivation: 现有方法通常独立学习视频的帧级表示，忽略了视频和任务之间的显式关系建模，无法捕捉跨视频的共享时间模式或重用历史任务的时间知识。 Method: HR2G-shot包含两个组件：1）视频间语义相关性（ISC）进行细粒度的跨视频帧级交互；2）任务间知识转移（IKT）从历史任务中检索和聚合相关时间知识。 Result: 在五个基准测试中，HR2G-shot优于当前领先的FSAR方法。 Conclusion: HR2G-shot通过层次化关系建模，显著提升了少样本动作识别的性能。 Abstract: Few-shot action recognition (FSAR) aims to recognize novel action categories with few exemplars. Existing methods typically learn frame-level representations independently for each video by designing various inter-frame temporal modeling strategies. However, they neglect explicit relation modeling between videos and tasks, thus failing to capture shared temporal patterns across videos and reuse temporal knowledge from historical tasks. In light of this, we propose HR2G-shot, a Hierarchical Relation-augmented Representation Generalization framework for FSAR, which unifies three types of relation modeling (inter-frame, inter-video, and inter-task) to learn task-specific temporal patterns from a holistic view. In addition to conducting inter-frame temporal interactions, we further devise two components to respectively explore inter-video and inter-task relationships: i) Inter-video Semantic Correlation (ISC) performs cross-video frame-level interactions in a fine-grained manner, thereby capturing task-specific query features and learning intra- and inter-class temporal correlations among support features; ii) Inter-task Knowledge Transfer (IKT) retrieves and aggregates relevant temporal knowledge from the bank, which stores diverse temporal patterns from historical tasks. Extensive experiments on five benchmarks show that HR2G-shot outperforms current top-leading FSAR methods.

[133] Learning to Harmonize Cross-vendor X-ray Images by Non-linear Image Dynamics Correction

Yucheng Lu,Shunxin Wang,Dovile Juodelyte,Veronika Cheplygina

Main category: cs.CV

TLDR: 论文探讨了传统图像增强如何提升医学图像分析的模型鲁棒性，提出了一种称为GDCE的方法来解决领域特定曝光不匹配问题。

Details

Motivation: 研究动机是解决医学图像中领域特定的非线性动态特性无法通过简单线性变换处理的问题。 Method: 方法包括将图像协调任务重新定义为曝光校正问题，并提出GDCE方法，通过预定义多项式函数和领域判别器进行训练。 Result: 结果显示GDCE能有效减少领域特定的曝光不匹配，提升模型在下游任务中的透明度。 Conclusion: 结论是GDCE方法优于现有的黑盒方法，能更好地处理医学图像中的领域差异。 Abstract: In this paper, we explore how conventional image enhancement can improve model robustness in medical image analysis. By applying commonly used normalization methods to images from various vendors and studying their influence on model generalization in transfer learning, we show that the nonlinear characteristics of domain-specific image dynamics cannot be addressed by simple linear transforms. To tackle this issue, we reformulate the image harmonization task as an exposure correction problem and propose a method termed Global Deep Curve Estimation (GDCE) to reduce domain-specific exposure mismatch. GDCE performs enhancement via a pre-defined polynomial function and is trained with the help of a ``domain discriminator'', aiming to improve model transparency in downstream tasks compared to existing black-box methods.

[134] UP-Person: Unified Parameter-Efficient Transfer Learning for Text-based Person Retrieval

Yating Liu,Yaowei Li,Xiangyuan Lan,Wenming Yang,Zimo Liu,Qingmin Liao

Main category: cs.CV

TLDR: UP-Person提出了一种高效的参数迁移学习方法，结合Prefix、LoRA和Adapter三种轻量级组件，用于文本行人检索任务，显著提升了性能且仅需微调少量参数。

Details

Motivation: 现有方法通过完全微调CLIP模型容易过拟合且泛化能力不足，因此需要一种更高效的迁移学习方法。 Method: UP-Person整合了Prefix、LoRA和Adapter三种组件，并优化了S-Prefix和L-Adapter两个子模块，以提升局部和全局特征表示。 Result: 在多个数据集（如CUHK-PEDES、ICFG-PEDES和RSTPReid）上取得了最先进的性能，仅需微调4.7%的参数。 Conclusion: UP-Person通过高效的参数迁移学习方法，显著提升了文本行人检索任务的性能，同时避免了过拟合问题。 Abstract: Text-based Person Retrieval (TPR) as a multi-modal task, which aims to retrieve the target person from a pool of candidate images given a text description, has recently garnered considerable attention due to the progress of contrastive visual-language pre-trained model. Prior works leverage pre-trained CLIP to extract person visual and textual features and fully fine-tune the entire network, which have shown notable performance improvements compared to uni-modal pre-training models. However, full-tuning a large model is prone to overfitting and hinders the generalization ability. In this paper, we propose a novel Unified Parameter-Efficient Transfer Learning (PETL) method for Text-based Person Retrieval (UP-Person) to thoroughly transfer the multi-modal knowledge from CLIP. Specifically, UP-Person simultaneously integrates three lightweight PETL components including Prefix, LoRA and Adapter, where Prefix and LoRA are devised together to mine local information with task-specific information prompts, and Adapter is designed to adjust global feature representations. Additionally, two vanilla submodules are optimized to adapt to the unified architecture of TPR. For one thing, S-Prefix is proposed to boost attention of prefix and enhance the gradient propagation of prefix tokens, which improves the flexibility and performance of the vanilla prefix. For another thing, L-Adapter is designed in parallel with layer normalization to adjust the overall distribution, which can resolve conflicts caused by overlap and interaction among multiple submodules. Extensive experimental results demonstrate that our UP-Person achieves state-of-the-art results across various person retrieval datasets, including CUHK-PEDES, ICFG-PEDES and RSTPReid while merely fine-tuning 4.7\% parameters. Code is available at https://github.com/Liu-Yating/UP-Person.

[135] CameraBench: Benchmarking Visual Reasoning in MLLMs via Photography

I-Sheng Fang,Jun-Cheng Chen

Main category: cs.CV

TLDR: 该论文探讨了多模态大语言模型（MLLMs）在摄影相关任务中的视觉推理能力，特别是从照片中识别相机参数的能力。

Details

Motivation: 视觉推理（结合视觉和文本输入）在人工智能中尚未充分探索，而摄影任务因其涉及物理参数（如光照、模糊程度等）与相机设置的关联，成为验证MLLMs视觉理解能力的理想场景。 Method: 扩展了先前针对视觉语言模型（VLMs）的方法，评估MLLMs在区分与相机设置相关的视觉差异上的表现。 Result: 初步结果显示视觉推理在摄影任务中的重要性，且没有单一MLLM在所有评估任务中表现一致最优。 Conclusion: 研究表明，开发具有更强视觉推理能力的MLLMs仍面临挑战和机遇。 Abstract: Large language models (LLMs) and multimodal large language models (MLLMs) have significantly advanced artificial intelligence. However, visual reasoning, reasoning involving both visual and textual inputs, remains underexplored. Recent advancements, including the reasoning models like OpenAI o1 and Gemini 2.0 Flash Thinking, which incorporate image inputs, have opened this capability. In this ongoing work, we focus specifically on photography-related tasks because a photo is a visual snapshot of the physical world where the underlying physics (i.e., illumination, blur extent, etc.) interplay with the camera parameters. Successfully reasoning from the visual information of a photo to identify these numerical camera settings requires the MLLMs to have a deeper understanding of the underlying physics for precise visual comprehension, representing a challenging and intelligent capability essential for practical applications like photography assistant agents. We aim to evaluate MLLMs on their ability to distinguish visual differences related to numerical camera settings, extending a methodology previously proposed for vision-language models (VLMs). Our preliminary results demonstrate the importance of visual reasoning in photography-related tasks. Moreover, these results show that no single MLLM consistently dominates across all evaluation tasks, demonstrating ongoing challenges and opportunities in developing MLLMs with better visual reasoning.

[136] Global and Local Mamba Network for Multi-Modality Medical Image Super-Resolution

Zexin Ji,Beiji Zou,Xiaoyan Kui,Sebastien Thureau,Su Ruan

Main category: cs.CV

TLDR: 论文提出了一种基于Mamba的全局与局部网络（GLMamba），用于多模态医学图像超分辨率，通过结合全局和局部信息提升性能。

Details

Motivation: 现有方法（如CNN和Transformer）在医学图像超分辨率中要么固定感受野，要么计算负担大，限制了性能提升。 Method: 采用两分支网络（全局和局部Mamba分支），结合可变形块和调制器增强特征表示，并设计多模态特征融合块和对比边缘损失。 Result: GLMamba能高效建模长程依赖并提取局部细节，提升超分辨率性能。 Conclusion: GLMamba通过全局与局部信息的结合，显著提升了多模态医学图像的超分辨率效果。 Abstract: Convolutional neural networks and Transformer have made significant progresses in multi-modality medical image super-resolution. However, these methods either have a fixed receptive field for local learning or significant computational burdens for global learning, limiting the super-resolution performance. To solve this problem, State Space Models, notably Mamba, is introduced to efficiently model long-range dependencies in images with linear computational complexity. Relying on the Mamba and the fact that low-resolution images rely on global information to compensate for missing details, while high-resolution reference images need to provide more local details for accurate super-resolution, we propose a global and local Mamba network (GLMamba) for multi-modality medical image super-resolution. To be specific, our GLMamba is a two-branch network equipped with a global Mamba branch and a local Mamba branch. The global Mamba branch captures long-range relationships in low-resolution inputs, and the local Mamba branch focuses more on short-range details in high-resolution reference images. We also use the deform block to adaptively extract features of both branches to enhance the representation ability. A modulator is designed to further enhance deformable features in both global and local Mamba blocks. To fully integrate the reference image for low-resolution image super-resolution, we further develop a multi-modality feature fusion block to adaptively fuse features by considering similarities, differences, and complementary aspects between modalities. In addition, a contrastive edge loss (CELoss) is developed for sufficient enhancement of edge textures and contrast in medical images.

[137] SoccerNet-v3D: Leveraging Sports Broadcast Replays for 3D Scene Understanding

Marc Gutiérrez-Pérez,Antonio Agudo

Main category: cs.CV

TLDR: 论文介绍了SoccerNet-v3D和ISSIA-3D两个数据集，用于足球转播的3D场景理解，提出了单目3D球定位任务和优化方法，并提供了数据集生成代码。

Details

Motivation: 提升足球转播分析的3D场景理解能力，通过多视角同步和相机标定实现3D物体定位。 Method: 引入基于三角测量的单目3D球定位任务，提出相机标定和重投影指标，优化2D标注框以对齐3D场景。 Result: 建立了3D足球场景理解的新基准，提升了时空分析能力。 Conclusion: 提出的数据集和方法为足球分析提供了新工具，推动了3D场景理解的发展。 Abstract: Sports video analysis is a key domain in computer vision, enabling detailed spatial understanding through multi-view correspondences. In this work, we introduce SoccerNet-v3D and ISSIA-3D, two enhanced and scalable datasets designed for 3D scene understanding in soccer broadcast analysis. These datasets extend SoccerNet-v3 and ISSIA by incorporating field-line-based camera calibration and multi-view synchronization, enabling 3D object localization through triangulation. We propose a monocular 3D ball localization task built upon the triangulation of ground-truth 2D ball annotations, along with several calibration and reprojection metrics to assess annotation quality on demand. Additionally, we present a single-image 3D ball localization method as a baseline, leveraging camera calibration and ball size priors to estimate the ball's position from a monocular viewpoint. To further refine 2D annotations, we introduce a bounding box optimization technique that ensures alignment with the 3D scene representation. Our proposed datasets establish new benchmarks for 3D soccer scene understanding, enhancing both spatial and temporal analysis in sports analytics. Finally, we provide code to facilitate access to our annotations and the generation pipelines for the datasets.

[138] AGO: Adaptive Grounding for Open World 3D Occupancy Prediction

Peizheng Li,Shuxiao Ding,You Zhou,Qingwen Zhang,Onat Inak,Larissa Triess,Niklas Hanselmann,Marius Cordts,Andreas Zell

Main category: cs.CV

TLDR: AGO框架通过自适应接地技术改进开放世界3D语义占用预测，结合视觉语言模型知识，提升未知物体预测能力。

Details

Motivation: 传统方法受限于预定义标签空间和图像-文本表示不一致，无法有效处理开放世界场景。 Method: AGO框架通过3D和文本嵌入的相似性训练，结合模态适配器减少模态差异。 Result: 在Occ3D-nuScenes数据集上，AGO在零样本和少样本迁移中表现优异，mIoU提升4.09。 Conclusion: AGO为开放世界3D语义占用预测提供了高效解决方案，显著提升性能。 Abstract: Open-world 3D semantic occupancy prediction aims to generate a voxelized 3D representation from sensor inputs while recognizing both known and unknown objects. Transferring open-vocabulary knowledge from vision-language models (VLMs) offers a promising direction but remains challenging. However, methods based on VLM-derived 2D pseudo-labels with traditional supervision are limited by a predefined label space and lack general prediction capabilities. Direct alignment with pretrained image embeddings, on the other hand, fails to achieve reliable performance due to often inconsistent image and text representations in VLMs. To address these challenges, we propose AGO, a novel 3D occupancy prediction framework with adaptive grounding to handle diverse open-world scenarios. AGO first encodes surrounding images and class prompts into 3D and text embeddings, respectively, leveraging similarity-based grounding training with 3D pseudo-labels. Additionally, a modality adapter maps 3D embeddings into a space aligned with VLM-derived image embeddings, reducing modality gaps. Experiments on Occ3D-nuScenes show that AGO improves unknown object prediction in zero-shot and few-shot transfer while achieving state-of-the-art closed-world self-supervised performance, surpassing prior methods by 4.09 mIoU.

Tzu-Yun Tseng,Hongyu Lyu,Josephine Li,Julie Stephany Berrio,Mao Shan,Stewart Worrall

Main category: cs.CV

TLDR: 论文介绍了M2S-RoAD数据集，用于农村道路损坏的语义分割，以提升自动驾驶和驾驶辅助系统的安全性。

Details

Motivation: 农村道路损坏检测研究较少，现有研究多集中于城市环境，而农村道路因维护不足更需关注。 Method: 收集并标注了澳大利亚新南威尔士州多个城镇的道路损坏数据，形成M2S-RoAD数据集，支持九种损坏类型的语义分割。 Result: M2S-RoAD数据集将公开发布，填补农村道路损坏检测数据空白。 Conclusion: M2S-RoAD数据集有助于推动农村道路损坏检测研究，提升自动驾驶安全性。 Abstract: Road damage can create safety and comfort challenges for both human drivers and autonomous vehicles (AVs). This damage is particularly prevalent in rural areas due to less frequent surveying and maintenance of roads. Automated detection of pavement deterioration can be used as an input to AVs and driver assistance systems to improve road safety. Current research in this field has predominantly focused on urban environments driven largely by public datasets, while rural areas have received significantly less attention. This paper introduces M2S-RoAD, a dataset for the semantic segmentation of different classes of road damage. M2S-RoAD was collected in various towns across New South Wales, Australia, and labelled for semantic segmentation to identify nine distinct types of road damage. This dataset will be released upon the acceptance of the paper.

[140] Hierarchical and Step-Layer-Wise Tuning of Attention Specialty for Multi-Instance Synthesis in Diffusion Transformers

Chunyang Zhang,Zhenhong Sun,Zhicheng Zhang,Junyan Wang,Yu Zhang,Dong Gong,Huadong Mo,Daoyi Dong

Main category: cs.CV

TLDR: 论文提出了一种无需训练的方法（AST），通过分层和逐层注意力调整，提升基于DiT的文本到图像生成模型在多实例合成（MIS）中的表现。

Details

Motivation: 传统的MIS控制方法不适用于基于DiT的模型（如FLUX和SD v3.5），因为它们依赖图像和文本令牌的集成注意力而非文本-图像交叉注意力。 Method: 通过分析DiT中的混合注意力机制，提出分层和逐层注意力调整（AST），优化多模态交互。 Result: 实验表明，AST方法在复杂布局生成中显著提升了实例定位和属性表示的准确性。 Conclusion: AST是一种有效的训练免费方法，能够显著提升DiT模型在MIS任务中的表现。 Abstract: Text-to-image (T2I) generation models often struggle with multi-instance synthesis (MIS), where they must accurately depict multiple distinct instances in a single image based on complex prompts detailing individual features. Traditional MIS control methods for UNet architectures like SD v1.5/SDXL fail to adapt to DiT-based models like FLUX and SD v3.5, which rely on integrated attention between image and text tokens rather than text-image cross-attention. To enhance MIS in DiT, we first analyze the mixed attention mechanism in DiT. Our token-wise and layer-wise analysis of attention maps reveals a hierarchical response structure: instance tokens dominate early layers, background tokens in middle layers, and attribute tokens in later layers. Building on this observation, we propose a training-free approach for enhancing MIS in DiT-based models with hierarchical and step-layer-wise attention specialty tuning (AST). AST amplifies key regions while suppressing irrelevant areas in distinct attention maps across layers and steps, guided by the hierarchical structure. This optimizes multimodal interactions by hierarchically decoupling the complex prompts with instance-based sketches. We evaluate our approach using upgraded sketch-based layouts for the T2I-CompBench and customized complex scenes. Both quantitative and qualitative results confirm our method enhances complex layout generation, ensuring precise instance placement and attribute representation in MIS.

[141] COUNTS: Benchmarking Object Detectors and Multimodal Large Language Models under Distribution Shifts

Jiansheng Li,Xingxuan Zhang,Hao Zou,Yige Guo,Renzhe Xu,Yilong Liu,Chuzhao Zhu,Yue He,Peng Cui

Main category: cs.CV

TLDR: 论文提出了COUNTS数据集，用于评估目标检测器和多模态大语言模型在分布偏移下的泛化能力，并设计了O(OD)2和OODG两个基准测试。

Details

Motivation: 当前目标检测器在分布偏移下性能下降显著，缺乏大规模、细粒度标注的数据集来评估其OOD泛化能力。 Method: 引入COUNTS数据集，包含14种自然分布偏移、222K样本和1,196K标注框，并设计O(OD)2和OODG两个基准测试。 Result: 实验显示，大模型和预训练数据在IID场景下表现优异，但在OOD场景中仍有显著不足，如GPT-4o和Gemini-1.5在视觉定位任务中仅达到56.7%和28.0%的准确率。 Conclusion: COUNTS数据集有望推动开发更鲁棒的目标检测器和MLLM，以应对分布偏移。 Abstract: Current object detectors often suffer significant perfor-mance degradation in real-world applications when encountering distributional shifts. Consequently, the out-of-distribution (OOD) generalization capability of object detectors has garnered increasing attention from researchers. Despite this growing interest, there remains a lack of a large-scale, comprehensive dataset and evaluation benchmark with fine-grained annotations tailored to assess the OOD generalization on more intricate tasks like object detection and grounding. To address this gap, we introduce COUNTS, a large-scale OOD dataset with object-level annotations. COUNTS encompasses 14 natural distributional shifts, over 222K samples, and more than 1,196K labeled bounding boxes. Leveraging COUNTS, we introduce two novel benchmarks: O(OD)2 and OODG. O(OD)2 is designed to comprehensively evaluate the OOD generalization capabilities of object detectors by utilizing controlled distribution shifts between training and testing data. OODG, on the other hand, aims to assess the OOD generalization of grounding abilities in multimodal large language models (MLLMs). Our findings reveal that, while large models and extensive pre-training data substantially en hance performance in in-distribution (IID) scenarios, significant limitations and opportunities for improvement persist in OOD contexts for both object detectors and MLLMs. In visual grounding tasks, even the advanced GPT-4o and Gemini-1.5 only achieve 56.7% and 28.0% accuracy, respectively. We hope COUNTS facilitates advancements in the development and assessment of robust object detectors and MLLMs capable of maintaining high performance under distributional shifts.

[142] WildLive: Near Real-time Visual Wildlife Tracking onboard UAVs

Nguyen Ngoc Dat,Tom Richardson,Matthew Watson,Kilian Meier,Jenna Kline,Sid Reid,Guy Maalouf,Duncan Hine,Majid Mirmehdi,Tilo Burghardt

Main category: cs.CV

TLDR: WildLive是一个无人机上实时动物检测与跟踪框架，支持高分辨率视频处理，优化了计算资源分配，实现了高精度和高速度的跟踪。

Details

Motivation: 现有解决方案依赖地面站视频流，无法满足自主飞行和特定任务需求，因此开发了WildLive。 Method: 结合稀疏光流跟踪和优化的YOLO目标检测与分割技术，专注于高不确定性时空区域。 Result: 系统在HD和4K视频流上分别达到17fps和7fps，并保持高精度。 Conclusion: WildLive证明了无人机上实时高分辨率野生动物跟踪的可行性，为未来自主导航和任务操作奠定了基础。 Abstract: Live tracking of wildlife via high-resolution video processing directly onboard drones is widely unexplored and most existing solutions rely on streaming video to ground stations to support navigation. Yet, both autonomous animal-reactive flight control beyond visual line of sight and/or mission-specific individual and behaviour recognition tasks rely to some degree on this capability. In response, we introduce WildLive -- a near real-time animal detection and tracking framework for high-resolution imagery running directly onboard uncrewed aerial vehicles (UAVs). The system performs multi-animal detection and tracking at 17fps+ for HD and 7fps+ on 4K video streams suitable for operation during higher altitude flights to minimise animal disturbance. Our system is optimised for Jetson Orin AGX onboard hardware. It integrates the efficiency of sparse optical flow tracking and mission-specific sampling with device-optimised and proven YOLO-driven object detection and segmentation techniques. Essentially, computational resource is focused onto spatio-temporal regions of high uncertainty to significantly improve UAV processing speeds without domain-specific loss of accuracy. Alongside, we introduce our WildLive dataset, which comprises 200k+ annotated animal instances across 19k+ frames from 4K UAV videos collected at the Ol Pejeta Conservancy in Kenya. All frames contain ground truth bounding boxes, segmentation masks, as well as individual tracklets and tracking point trajectories. We compare our system against current object tracking approaches including OC-SORT, ByteTrack, and SORT. Our multi-animal tracking experiments with onboard hardware confirm that near real-time high-resolution wildlife tracking is possible on UAVs whilst maintaining high accuracy levels as needed for future navigational and mission-specific animal-centric operational autonomy.

[143] LLaVA-ReID: Selective Multi-image Questioner for Interactive Person Re-Identification

Yiding Lu,Mouxing Yang,Dezhong Peng,Peng Hu,Yijie Lin,Xi Peng

Main category: cs.CV

TLDR: 论文提出了一种新的交互式行人重识别任务（Inter-ReID），通过对话逐步完善初始描述，并构建了一个对话数据集。提出的LLaVA-ReID模型能生成针对性问题，显著优于基线方法。

Details

Motivation: 传统基于文本的行人重识别假设描述是完整且一次性提供的，而现实中描述往往是部分或模糊的。Inter-ReID通过交互式对话解决这一问题。 Method: 构建对话数据集，分解细粒度属性；提出LLaVA-ReID模型，结合视觉和文本上下文生成问题，采用前瞻策略优先选择信息量大的问题。 Result: LLaVA-ReID在Inter-ReID和文本ReID基准测试中显著优于基线方法。 Conclusion: Inter-ReID和LLaVA-ReID为行人重识别提供了更实用的解决方案，尤其在描述不完整时表现优异。 Abstract: Traditional text-based person ReID assumes that person descriptions from witnesses are complete and provided at once. However, in real-world scenarios, such descriptions are often partial or vague. To address this limitation, we introduce a new task called interactive person re-identification (Inter-ReID). Inter-ReID is a dialogue-based retrieval task that iteratively refines initial descriptions through ongoing interactions with the witnesses. To facilitate the study of this new task, we construct a dialogue dataset that incorporates multiple types of questions by decomposing fine-grained attributes of individuals. We further propose LLaVA-ReID, a question model that generates targeted questions based on visual and textual contexts to elicit additional details about the target person. Leveraging a looking-forward strategy, we prioritize the most informative questions as supervision during training. Experimental results on both Inter-ReID and text-based ReID benchmarks demonstrate that LLaVA-ReID significantly outperforms baselines.

[144] Differentially Private 2D Human Pose Estimation

Kaushik Bhargav Sivangi,Idris Zakariyya,Paul Henderson,Fani Deligianni

Main category: cs.CV

TLDR: 本文提出了一种基于差分隐私的2D人体姿态估计方法，通过改进的DP-SGD（PDP-SGD）和轻量级视觉Transformer TinyViT，在保护隐私的同时提升了性能。

Details

Motivation: 人体姿态估计在医疗、活动识别等领域有广泛应用，但传统隐私保护方法效果有限且可能损害数据效用，差分隐私虽提供保障但会降低模型性能。 Method: 采用PDP-SGD将噪声梯度投影到低维子空间，并结合TinyViT作为轻量级骨干网络。 Result: 在MPII数据集上，PDP-SGD在严格隐私预算（ε=0.2）下PCKh@0.5达到78.48%，优于标准DP-SGD的63.85%。 Conclusion: 该方法为敏感场景下的隐私保护人体姿态估计奠定了基础。 Abstract: Human pose estimation (HPE) has become essential in numerous applications including healthcare, activity recognition, and human-computer interaction. However, the privacy implications of processing sensitive visual data present significant deployment barriers in critical domains. While traditional anonymization techniques offer limited protection and often compromise data utility for broader motion analysis, Differential Privacy (DP) provides formal privacy guarantees but typically degrades model performance when applied naively. In this work, we present the first differentially private 2D human pose estimation (2D-HPE) by applying Differentially Private Stochastic Gradient Descent (DP-SGD) to this task. To effectively balance privacy with performance, we adopt Projected DP-SGD (PDP-SGD), which projects the noisy gradients to a low-dimensional subspace. Additionally, we adapt TinyViT, a compact and efficient vision transformer for coordinate classification in HPE, providing a lightweight yet powerful backbone that enhances privacy-preserving deployment feasibility on resource-limited devices. Our approach is particularly valuable for multimedia interpretation tasks, enabling privacy-safe analysis and understanding of human motion across diverse visual media while preserving the semantic meaning required for downstream applications. Comprehensive experiments on the MPII Human Pose Dataset demonstrate significant performance enhancement with PDP-SGD achieving 78.48% PCKh@0.5 at a strict privacy budget ($\epsilon=0.2$), compared to 63.85% for standard DP-SGD. This work lays foundation for privacy-preserving human pose estimation in real-world, sensitive applications.

[145] VibrantLeaves: A principled parametric image generator for training deep restoration models

Raphael Achddou,Yann Gousseau,Saïd Ladjal,Sabine Süsstrunk

Main category: cs.CV

TLDR: 论文提出了一种基于简单原则的合成图像生成器，用于改进深度神经网络在图像恢复任务中的训练效果，性能接近自然图像数据集，并增强了模型的鲁棒性。

Details

Motivation: 深度神经网络在图像恢复任务中表现强大，但存在训练集偏差和可解释性差的问题。通过合成数据集可以更好地控制训练过程。 Method: 提出一种合成图像生成器，结合几何建模、纹理和图像采集的简单模型，基于经典的Dead Leaves模型生成高效训练集。 Result: 在图像去噪和超分辨率任务中，使用合成数据集训练的模型性能接近自然图像数据集，且对几何和辐射扰动更具鲁棒性。 Conclusion: 合成数据集是改进深度神经网络训练的有效方法，同时为模型可解释性提供了初步分析。 Abstract: Even though Deep Neural Networks are extremely powerful for image restoration tasks, they have several limitations. They are poorly understood and suffer from strong biases inherited from the training sets. One way to address these shortcomings is to have a better control over the training sets, in particular by using synthetic sets. In this paper, we propose a synthetic image generator relying on a few simple principles. In particular, we focus on geometric modeling, textures, and a simple modeling of image acquisition. These properties, integrated in a classical Dead Leaves model, enable the creation of efficient training sets. Standard image denoising and super-resolution networks can be trained on such datasets, reaching performance almost on par with training on natural image datasets. As a first step towards explainability, we provide a careful analysis of the considered principles, identifying which image properties are necessary to obtain good performances. Besides, such training also yields better robustness to various geometric and radiometric perturbations of the test sets.

[146] Balancing Stability and Plasticity in Pretrained Detector: A Dual-Path Framework for Incremental Object Detection

Songze Li,Qixing Xu,Tonghua Su,Xu-Yao Zhang,Zhongjie Wang

Main category: cs.CV

TLDR: 论文提出了一种双路径框架，用于在预训练模型增量目标检测中平衡稳定性和可塑性，通过解耦定位和分类路径实现跨域适应。

Details

Motivation: 现有方法在跨域场景中的可塑性不足，定位模块具有稳定性但分类模块需要增强可塑性。 Method: 基于预训练的DETR检测器，提出双路径框架：定位路径保持稳定性，分类路径通过参数高效微调和伪特征重放增强可塑性。 Result: 在多个基准测试（MS COCO、PASCAL VOC、TT100K）上表现优异，实现了跨域适应和抗遗忘能力的平衡。 Conclusion: 该方法有效解决了预训练模型增量目标检测中稳定性和可塑性的平衡问题，具有鲁棒的跨域适应能力。 Abstract: The balance between stability and plasticity remains a fundamental challenge in pretrained model-based incremental object detection (PTMIOD). While existing PTMIOD methods demonstrate strong performance on in-domain tasks aligned with pretraining data, their plasticity to cross-domain scenarios remains underexplored. Through systematic component-wise analysis of pretrained detectors, we reveal a fundamental discrepancy: the localization modules demonstrate inherent cross-domain stability-preserving precise bounding box estimation across distribution shifts-while the classification components require enhanced plasticity to mitigate discriminability degradation in cross-domain scenarios. Motivated by these findings, we propose a dual-path framework built upon pretrained DETR-based detectors which decouples localization stability and classification plasticity: the localization path maintains stability to preserve pretrained localization knowledge, while the classification path facilitates plasticity via parameter-efficient fine-tuning and resists forgetting with pseudo-feature replay. Extensive evaluations on both in-domain (MS COCO and PASCAL VOC) and cross-domain (TT100K) benchmarks show state-of-the-art performance, demonstrating our method's ability to effectively balance stability and plasticity in PTMIOD, achieving robust cross-domain adaptation and strong retention of anti-forgetting capabilities.

[147] CAT: A Conditional Adaptation Tailor for Efficient and Effective Instance-Specific Pansharpening on Real-World Data

Tianyu Xin,Jin-Liang Xiao,Zeyu Xia,Shan Yin,Liang-Jian Deng

Main category: cs.CV

TLDR: 提出了一种高效的全色锐化框架，通过实例自适应训练和快速推理，解决了跨传感器泛化能力差和计算开销大的问题。

Details

Motivation: 现有深度学习方法在全色锐化中存在跨传感器泛化能力差和计算效率低的问题，限制了实时应用。 Method: 将输入图像分块，选择子集进行无监督CAT训练，通过预训练网络的特征提取和通道变换阶段集成CAT模块，实现高效推理。 Result: 在WorldView-3和WorldView-2数据集上实现了最先进的性能，512×512图像训练和推理仅需0.4秒，4000×4000图像仅需3秒。 Conclusion: 该方法显著提升了跨传感器泛化能力和计算效率，适用于实时全色锐化应用。 Abstract: Pansharpening is a crucial remote sensing technique that fuses low-resolution multispectral (LRMS) images with high-resolution panchromatic (PAN) images to generate high-resolution multispectral (HRMS) imagery. Although deep learning techniques have significantly advanced pansharpening, many existing methods suffer from limited cross-sensor generalization and high computational overhead, restricting their real-time applications. To address these challenges, we propose an efficient framework that quickly adapts to a specific input instance, completing both training and inference in a short time. Our framework splits the input image into multiple patches, selects a subset for unsupervised CAT training, and then performs inference on all patches, stitching them into the final output. The CAT module, integrated between the feature extraction and channel transformation stages of a pre-trained network, tailors the fused features and fixes the parameters for efficient inference, generating improved results. Our approach offers two key advantages: (1) $\textit{Improved Generalization Ability}$: by mitigating cross-sensor degradation, our model--although pre-trained on a specific dataset--achieves superior performance on datasets captured by other sensors; (2) $\textit{Enhanced Computational Efficiency}$: the CAT-enhanced network can swiftly adapt to the test sample using the single LRMS-PAN pair input, without requiring extensive large-scale data retraining. Experiments on the real-world data from WorldView-3 and WorldView-2 datasets demonstrate that our method achieves state-of-the-art performance on cross-sensor real-world data, while achieving both training and inference of $512\times512$ image within $\textit{0.4 seconds}$ and $4000\times4000$ image within $\textit{3 seconds}$ at the fastest setting on a commonly used RTX 3090 GPU.

[148] MASSeg : 2nd Technical Report for 4th PVUW MOSE Track

Xuqiang Cao,Linnan Zhao,Jiaxuan Zhao,Fang Liu,Puhua Chen,Wenping Ma

Main category: cs.CV

TLDR: 该论文提出了一种改进的视频对象分割模型MASSeg，并在MOSE+数据集上验证了其性能，取得了高分。

Details

Motivation: 解决复杂视频对象分割中的小物体识别、遮挡处理和动态场景建模问题。 Method: 基于现有框架改进模型MASSeg，结合帧间一致和不一致的数据增强策略，并设计掩码输出缩放策略。 Result: 在MOSE测试集上，J得分为0.8250，F得分为0.9007，J&F得分为0.8628。 Conclusion: MASSeg在复杂视频对象分割任务中表现出色，尤其在处理小物体和遮挡时效果显著。 Abstract: Complex video object segmentation continues to face significant challenges in small object recognition, occlusion handling, and dynamic scene modeling. This report presents our solution, which ranked second in the MOSE track of CVPR 2025 PVUW Challenge. Based on an existing segmentation framework, we propose an improved model named MASSeg for complex video object segmentation, and construct an enhanced dataset, MOSE+, which includes typical scenarios with occlusions, cluttered backgrounds, and small target instances. During training, we incorporate a combination of inter-frame consistent and inconsistent data augmentation strategies to improve robustness and generalization. During inference, we design a mask output scaling strategy to better adapt to varying object sizes and occlusion levels. As a result, MASSeg achieves a J score of 0.8250, F score of 0.9007, and a J&F score of 0.8628 on the MOSE test set.

[149] XY-Cut++: Advanced Layout Ordering via Hierarchical Mask Mechanism on a Novel Benchmark

Shuai Liu,Youmeng Li,Jizeng Wei

Main category: cs.CV

TLDR: XY-Cut++是一种先进的文档阅读顺序恢复方法，通过预掩码处理、多粒度分割和跨模态匹配显著提升了布局排序的准确性。

Details

Motivation: 现有方法在处理复杂布局（如多栏报纸）、跨模态元素交互和高开销时表现不佳，且缺乏稳健的评估基准。 Method: XY-Cut++结合预掩码处理、多粒度分割和跨模态匹配技术。 Result: 在DocBench-100数据集上，XY-Cut++达到98.8 BLEU的先进性能，比现有基线提升24%。 Conclusion: XY-Cut++为文档结构恢复提供了可靠基础，为布局排序任务设定了新标准，并促进了更高效的RAG和LLM预处理。 Abstract: Document Reading Order Recovery is a fundamental task in document image understanding, playing a pivotal role in enhancing Retrieval-Augmented Generation (RAG) and serving as a critical preprocessing step for large language models (LLMs). Existing methods often struggle with complex layouts(e.g., multi-column newspapers), high-overhead interactions between cross-modal elements (visual regions and textual semantics), and a lack of robust evaluation benchmarks. We introduce XY-Cut++, an advanced layout ordering method that integrates pre-mask processing, multi-granularity segmentation, and cross-modal matching to address these challenges. Our method significantly enhances layout ordering accuracy compared to traditional XY-Cut techniques. Specifically, XY-Cut++ achieves state-of-the-art performance (98.8 BLEU overall) while maintaining simplicity and efficiency. It outperforms existing baselines by up to 24\% and demonstrates consistent accuracy across simple and complex layouts on the newly introduced DocBench-100 dataset. This advancement establishes a reliable foundation for document structure recovery, setting a new standard for layout ordering tasks and facilitating more effective RAG and LLM preprocessing.

[150] Trade-offs in Privacy-Preserving Eye Tracking through Iris Obfuscation: A Benchmarking Study

Mengdi Wang,Efe Bozkir,Enkelejda Kasneci

Main category: cs.CV

TLDR: 论文评估了五种方法（模糊、噪声、降采样、橡胶片模型和虹膜风格迁移）在保护用户隐私的同时保持眼动追踪功能的效果，发现虹膜风格迁移效果最佳，但需权衡计算成本。

Details

Motivation: AR/VR头戴设备中的眼动追踪可能泄露用户虹膜纹理等隐私信息，需在保护隐私的同时保持功能准确性。 Method: 通过模糊、噪声、降采样、橡胶片模型和虹膜风格迁移五种方法处理虹膜纹理，评估其对图像质量、隐私保护、功能任务（眼部分割和注视估计）及攻击风险的影响。 Result: 虹膜风格迁移在功能任务和抗攻击性上表现最佳，但计算成本高；其他方法效果有限。 Conclusion: 无普适最优方法，需根据需求权衡隐私、功能和计算成本，建议结合多种方法以达到最佳平衡。 Abstract: Recent developments in hardware, computer graphics, and AI may soon enable AR/VR head-mounted displays (HMDs) to become everyday devices like smartphones and tablets. Eye trackers within HMDs provide a special opportunity for such setups as it is possible to facilitate gaze-based research and interaction. However, estimating users' gaze information often requires raw eye images and videos that contain iris textures, which are considered a gold standard biometric for user authentication, and this raises privacy concerns. Previous research in the eye-tracking community focused on obfuscating iris textures while keeping utility tasks such as gaze estimation accurate. Despite these attempts, there is no comprehensive benchmark that evaluates state-of-the-art approaches. Considering all, in this paper, we benchmark blurring, noising, downsampling, rubber sheet model, and iris style transfer to obfuscate user identity, and compare their impact on image quality, privacy, utility, and risk of imposter attack on two datasets. We use eye segmentation and gaze estimation as utility tasks, and reduction in iris recognition accuracy as a measure of privacy protection, and false acceptance rate to estimate risk of attack. Our experiments show that canonical image processing methods like blurring and noising cause a marginal impact on deep learning-based tasks. While downsampling, rubber sheet model, and iris style transfer are effective in hiding user identifiers, iris style transfer, with higher computation cost, outperforms others in both utility tasks, and is more resilient against spoof attacks. Our analyses indicate that there is no universal optimal approach to balance privacy, utility, and computation burden. Therefore, we recommend practitioners consider the strengths and weaknesses of each approach, and possible combinations of those to reach an optimal privacy-utility trade-off.

[151] Multimodal Long Video Modeling Based on Temporal Dynamic Context

Haoran Hao,Jiaming Han,Yiyuan Zhang,Xiangyu Yue

Main category: cs.CV

TLDR: 提出了一种动态长视频编码方法（TDC），利用帧间时间关系解决LLMs处理长视频时的上下文长度限制和信息丢失问题，并通过训练无关的链式思维策略处理极长视频。

Details

Motivation: 现有LLMs在长视频处理中因上下文长度限制和信息压缩丢失关键信息，且难以处理音频等多模态数据。 Method: 1. 基于帧间相似性分割视频为语义一致场景；2. 使用视觉-音频编码器编码帧为令牌；3. 提出时间上下文压缩器减少令牌数量；4. 结合静态帧令牌和时间上下文令牌输入LLM。 Result: 在通用视频理解和音视频理解基准测试中表现优异。 Conclusion: TDC方法有效解决了长视频处理中的信息丢失和多模态融合问题，性能显著提升。 Abstract: Recent advances in Large Language Models (LLMs) have led to significant breakthroughs in video understanding. However, existing models still struggle with long video processing due to the context length constraint of LLMs and the vast amount of information within the video. Although some recent methods are designed for long video understanding, they often lose crucial information during token compression and struggle with additional modality like audio. In this work, we propose a dynamic long video encoding method utilizing the temporal relationship between frames, named Temporal Dynamic Context (TDC). Firstly, we segment the video into semantically consistent scenes based on inter-frame similarities, then encode each frame into tokens using visual-audio encoders. Secondly, we propose a novel temporal context compressor to reduce the number of tokens within each segment. Specifically, we employ a query-based Transformer to aggregate video, audio, and instruction text tokens into a limited set of temporal context tokens. Finally, we feed the static frame tokens and the temporal context tokens into the LLM for video understanding. Furthermore, to handle extremely long videos, we propose a training-free chain-of-thought strategy that progressively extracts answers from multiple video segments. These intermediate answers serve as part of the reasoning process and contribute to the final answer. We conduct extensive experiments on general video understanding and audio-video understanding benchmarks, where our method demonstrates strong performance. The code and models are available at https://github.com/Hoar012/TDC-Video.

[152] LMFormer: Lane based Motion Prediction Transformer

Harsh Yadav,Maximilian Schaefer,Kun Zhao,Tobias Meisen

Main category: cs.CV

TLDR: LMFormer是一种基于Transformer的车道感知网络，用于轨迹预测任务，通过动态车道优先级机制和车道连接信息提升性能，并在nuScenes和Deep Scenario数据集上取得SOTA结果。

Details

Motivation: 解决自动驾驶中轨迹预测的挑战，特别是动态车道优先级和长距离车道依赖问题。 Method: 提出LMFormer，结合动态车道优先级机制和车道连接信息，并通过堆叠Transformer层实现迭代优化。 Result: 在nuScenes数据集上表现最优，同时在Deep Scenario数据集上展示了跨数据集训练的统一能力。 Conclusion: LMFormer通过动态车道优先级和长距离依赖学习，显著提升了轨迹预测性能，并具备跨数据集训练的潜力。 Abstract: Motion prediction plays an important role in autonomous driving. This study presents LMFormer, a lane-aware transformer network for trajectory prediction tasks. In contrast to previous studies, our work provides a simple mechanism to dynamically prioritize the lanes and shows that such a mechanism introduces explainability into the learning behavior of the network. Additionally, LMFormer uses the lane connection information at intersections, lane merges, and lane splits, in order to learn long-range dependency in lane structure. Moreover, we also address the issue of refining the predicted trajectories and propose an efficient method for iterative refinement through stacked transformer layers. For benchmarking, we evaluate LMFormer on the nuScenes dataset and demonstrate that it achieves SOTA performance across multiple metrics. Furthermore, the Deep Scenario dataset is used to not only illustrate cross-dataset network performance but also the unification capabilities of LMFormer to train on multiple datasets and achieve better performance.

[153] DiffMOD: Progressive Diffusion Point Denoising for Moving Object Detection in Remote Sensing

Jinyue Zhang,Xiangrong Zhang,Zhongjian Huang,Tianyang Zhang,Yifei Jiang,Licheng Jiao

Main category: cs.CV

TLDR: 提出了一种基于点云的遥感移动目标检测方法，通过扩散模型和渐进去噪过程提升检测能力和时间一致性。

Details

Motivation: 遥感中移动目标检测面临低分辨率、目标极小和复杂噪声干扰的挑战，现有方法缺乏灵活的信息交互。 Method: 采用点云表示，通过扩散模型逐步去噪恢复目标中心，设计空间关系聚合注意力和时间传播模块增强特征交互。 Result: 在RsData数据集上验证了方法的有效性，提升了稀疏移动目标间关系的挖掘能力和检测性能。 Conclusion: 基于点云去噪的方法能更有效地探索稀疏移动目标间关系，提升检测能力和时间一致性。 Abstract: Moving object detection (MOD) in remote sensing is significantly challenged by low resolution, extremely small object sizes, and complex noise interference. Current deep learning-based MOD methods rely on probability density estimation, which restricts flexible information interaction between objects and across temporal frames. To flexibly capture high-order inter-object and temporal relationships, we propose a point-based MOD in remote sensing. Inspired by diffusion models, the network optimization is formulated as a progressive denoising process that iteratively recovers moving object centers from sparse noisy points. Specifically, we sample scattered features from the backbone outputs as atomic units for subsequent processing, while global feature embeddings are aggregated to compensate for the limited coverage of sparse point features. By modeling spatial relative positions and semantic affinities, Spatial Relation Aggregation Attention is designed to enable high-order interactions among point-level features for enhanced object representation. To enhance temporal consistency, the Temporal Propagation and Global Fusion module is designed, which leverages an implicit memory reasoning mechanism for robust cross-frame feature integration. To align with the progressive denoising process, we propose a progressive MinK optimal transport assignment strategy that establishes specialized learning objectives at each denoising level. Additionally, we introduce a missing loss function to counteract the clustering tendency of denoised points around salient objects. Experiments on the RsData remote sensing MOD dataset show that our MOD method based on scattered point denoising can more effectively explore potential relationships between sparse moving objects and improve the detection capability and temporal consistency.

[154] GUI-R1 : A Generalist R1-Style Vision-Language Action Model For GUI Agents

Xiaobo Xia,Run Luo

Main category: cs.CV

TLDR: 提出了一种基于强化学习的框架（\name），通过统一动作空间规则建模，显著提升了大型视觉语言模型（LVLM）在GUI任务中的性能，仅需少量高质量数据。

Details

Motivation: 现有GUI代理方法依赖大量训练数据且泛化能力不足，限制了实际应用。受强化微调（RFT）启发，提出新框架以解决这一问题。 Method: 采用统一动作空间规则建模，结合多平台高质量数据，使用GRPO等策略优化算法更新模型。 Result: 仅用0.02%数据（3K vs. 13M），在8个跨平台基准测试中超越现有最佳方法（如OS-Atlas）。 Conclusion: 强化学习结合统一动作空间规则建模在提升LVLM执行GUI任务能力方面具有巨大潜力。 Abstract: Existing efforts in building Graphical User Interface (GUI) agents largely rely on the training paradigm of supervised fine-tuning on Large Vision-Language Models (LVLMs). However, this approach not only demands extensive amounts of training data but also struggles to effectively understand GUI screenshots and generalize to unseen interfaces. The issue significantly limits its application in real-world scenarios, especially for high-level tasks. Inspired by Reinforcement Fine-Tuning (RFT) in large reasoning models (e.g., DeepSeek-R1), which efficiently enhances the problem-solving capabilities of large language models in real-world settings, we propose \name, the first reinforcement learning framework designed to enhance the GUI capabilities of LVLMs in high-level real-world task scenarios, through unified action space rule modeling. By leveraging a small amount of carefully curated high-quality data across multiple platforms (including Windows, Linux, MacOS, Android, and Web) and employing policy optimization algorithms such as Group Relative Policy Optimization (GRPO) to update the model, \name achieves superior performance using only 0.02\% of the data (3K vs. 13M) compared to previous state-of-the-art methods like OS-Atlas across eight benchmarks spanning three different platforms (mobile, desktop, and web). These results demonstrate the immense potential of reinforcement learning based on unified action space rule modeling in improving the execution capabilities of LVLMs for real-world GUI agent tasks.

[155] Noise2Ghost: Self-supervised deep convolutional reconstruction for ghost imaging

Mathieu Manni,Dmitry Karpov,K. Joost Batenburg,Sharon Shwartz,Nicola Viganò

Main category: cs.CV

TLDR: 提出了一种基于自监督深度学习的鬼成像重建方法，在无监督方法中表现出卓越的噪声抑制性能。

Details

Motivation: 解决低光鬼成像场景中信号噪声比问题，无需干净参考数据。 Method: 自监督深度学习框架，结合数学理论和实际数据验证。 Result: 在理论和实际数据中均表现出优异的噪声抑制和重建性能。 Conclusion: 该方法为低光鬼成像提供了有效工具，适用于生物样本和电池等应用场景。 Abstract: We present a new self-supervised deep-learning-based Ghost Imaging (GI) reconstruction method, which provides unparalleled reconstruction performance for noisy acquisitions among unsupervised methods. We present the supporting mathematical framework and results from theoretical and real data use cases. Self-supervision removes the need for clean reference data while offering strong noise reduction. This provides the necessary tools for addressing signal-to-noise ratio concerns for GI acquisitions in emerging and cutting-edge low-light GI scenarios. Notable examples include micro- and nano-scale x-ray emission imaging, e.g., x-ray fluorescence imaging of dose-sensitive samples. Their applications include in-vivo and in-operando case studies for biological samples and batteries.

[156] MIEB: Massive Image Embedding Benchmark

Chenghao Xiao,Isaac Chung,Imene Kerboua,Jamie Stirling,Xin Zhang,Márton Kardos,Roman Solomatin,Noura Al Moubayed,Kenneth Enevoldsen,Niklas Muennighoff

Main category: cs.CV

TLDR: MIEB是一个大规模图像嵌入基准测试，用于评估图像和图像-文本嵌入模型在广泛任务中的表现，发现没有单一方法在所有任务中占优。

Details

Motivation: 现有图像表示评估方法分散且任务特定，缺乏对模型能力的全面理解。 Method: 引入MIEB，涵盖38种语言的130个任务，分为8大类，并对50个模型进行基准测试。 Result: 发现高级视觉模型在文本视觉表示上表现优异，但在交错编码和存在干扰项时的图像-文本匹配能力有限。 Conclusion: MIEB揭示了模型能力的多样性，并显示视觉编码器在MIEB上的表现与多模态大语言模型中的表现高度相关。 Abstract: Image representations are often evaluated through disjointed, task-specific protocols, leading to a fragmented understanding of model capabilities. For instance, it is unclear whether an image embedding model adept at clustering images is equally good at retrieving relevant images given a piece of text. We introduce the Massive Image Embedding Benchmark (MIEB) to evaluate the performance of image and image-text embedding models across the broadest spectrum to date. MIEB spans 38 languages across 130 individual tasks, which we group into 8 high-level categories. We benchmark 50 models across our benchmark, finding that no single method dominates across all task categories. We reveal hidden capabilities in advanced vision models such as their accurate visual representation of texts, and their yet limited capabilities in interleaved encodings and matching images and texts in the presence of confounders. We also show that the performance of vision encoders on MIEB correlates highly with their performance when used in multimodal large language models. Our code, dataset, and leaderboard are publicly available at https://github.com/embeddings-benchmark/mteb.

[157] ESCT3D: Efficient and Selectively Controllable Text-Driven 3D Content Generation with Gaussian Splatting

Huiqi Wu,Jianbo Mei,Yingjie Huang,Yining Xu,Jingjiao You,Yilong Liu,Li Yao

Main category: cs.CV

TLDR: 论文提出了一种基于GPT-4V的自优化方法，用于提升从简单文本输入生成高质量3D内容的效率和可控性。

Details

Motivation: 当前文本驱动的3D内容生成方法对输入提示质量依赖性强，且生成过程不可控，导致效率低下。 Method: 利用GPT-4V进行自优化，支持多条件输入（如风格、边缘、姿势等），并整合多视图信息以解决Janus问题。 Result: 实验表明，该方法能高效生成高质量3D内容，并具有更强的可控性和泛化能力。 Conclusion: 该方法显著提升了3D内容生成的效率和可控性，为实际应用提供了可行方案。 Abstract: In recent years, significant advancements have been made in text-driven 3D content generation. However, several challenges remain. In practical applications, users often provide extremely simple text inputs while expecting high-quality 3D content. Generating optimal results from such minimal text is a difficult task due to the strong dependency of text-to-3D models on the quality of input prompts. Moreover, the generation process exhibits high variability, making it difficult to control. Consequently, multiple iterations are typically required to produce content that meets user expectations, reducing generation efficiency. To address this issue, we propose GPT-4V for self-optimization, which significantly enhances the efficiency of generating satisfactory content in a single attempt. Furthermore, the controllability of text-to-3D generation methods has not been fully explored. Our approach enables users to not only provide textual descriptions but also specify additional conditions, such as style, edges, scribbles, poses, or combinations of multiple conditions, allowing for more precise control over the generated 3D content. Additionally, during training, we effectively integrate multi-view information, including multi-view depth, masks, features, and images, to address the common Janus problem in 3D content generation. Extensive experiments demonstrate that our method achieves robust generalization, facilitating the efficient and controllable generation of high-quality 3D content.

[158] Analysis of Attention in Video Diffusion Transformers

Yuxin Wen,Jim Wu,Ajay Jain,Tom Goldstein,Ashwinee Panda

Main category: cs.CV

TLDR: 分析了视频扩散变换器（VDiTs）中的注意力机制，发现了三个关键特性：结构、稀疏性和注意力汇点。

Details

Motivation: 研究VDiTs中注意力机制的特性，以提升视频编辑和模型效率。 Method: 通过分析不同VDiTs的注意力模式，研究其结构相似性、稀疏性及注意力汇点现象。 Result: 发现注意力模式具有跨提示的结构相似性，稀疏性方法并非适用于所有VDiTs，并首次研究了VDiTs中的注意力汇点。 Conclusion: 研究结果为优化VDiTs的效率-质量权衡提供了新方向。 Abstract: We conduct an in-depth analysis of attention in video diffusion transformers (VDiTs) and report a number of novel findings. We identify three key properties of attention in VDiTs: Structure, Sparsity, and Sinks. Structure: We observe that attention patterns across different VDiTs exhibit similar structure across different prompts, and that we can make use of the similarity of attention patterns to unlock video editing via self-attention map transfer. Sparse: We study attention sparsity in VDiTs, finding that proposed sparsity methods do not work for all VDiTs, because some layers that are seemingly sparse cannot be sparsified. Sinks: We make the first study of attention sinks in VDiTs, comparing and contrasting them to attention sinks in language models. We propose a number of future directions that can make use of our insights to improve the efficiency-quality Pareto frontier for VDiTs.

[159] SlowFastVAD: Video Anomaly Detection via Integrating Simple Detector and RAG-Enhanced Vision-Language Model

Zongcan Ding,Haodong Zhang,Peng Wu,Guansong Pang,Zhiwei Yang,Peng Wang,Yanning Zhang

Main category: cs.CV

TLDR: SlowFastVAD结合快速和慢速异常检测器，利用视觉语言模型（VLM）提升视频异常检测的准确性和可解释性，同时降低计算成本。

Details

Motivation: 半监督视频异常检测方法存在高误报率和低可解释性问题，而视觉语言模型（VLM）虽能提供多模态推理能力，但计算成本高且缺乏领域适应性。 Method: 提出SlowFastVAD框架，通过快速检测器初步筛选异常片段，慢速检测器（RAG增强的VLM）进一步分析模糊片段，并结合知识库优化领域适应性。 Result: 在四个基准测试中，SlowFastVAD显著提升了检测准确性和可解释性，同时大幅降低了计算开销。 Conclusion: SlowFastVAD结合快速与慢速检测器的优势，适用于高可靠性要求的实际视频异常检测场景。 Abstract: Video anomaly detection (VAD) aims to identify unexpected events in videos and has wide applications in safety-critical domains. While semi-supervised methods trained on only normal samples have gained traction, they often suffer from high false alarm rates and poor interpretability. Recently, vision-language models (VLMs) have demonstrated strong multimodal reasoning capabilities, offering new opportunities for explainable anomaly detection. However, their high computational cost and lack of domain adaptation hinder real-time deployment and reliability. Inspired by dual complementary pathways in human visual perception, we propose SlowFastVAD, a hybrid framework that integrates a fast anomaly detector with a slow anomaly detector (namely a retrieval augmented generation (RAG) enhanced VLM), to address these limitations. Specifically, the fast detector first provides coarse anomaly confidence scores, and only a small subset of ambiguous segments, rather than the entire video, is further analyzed by the slower yet more interpretable VLM for elaborate detection and reasoning. Furthermore, to adapt VLMs to domain-specific VAD scenarios, we construct a knowledge base including normal patterns based on few normal samples and abnormal patterns inferred by VLMs. During inference, relevant patterns are retrieved and used to augment prompts for anomaly reasoning. Finally, we smoothly fuse the anomaly confidence of fast and slow detectors to enhance robustness of anomaly detection. Extensive experiments on four benchmarks demonstrate that SlowFastVAD effectively combines the strengths of both fast and slow detectors, and achieves remarkable detection accuracy and interpretability with significantly reduced computational overhead, making it well-suited for real-world VAD applications with high reliability requirements.

[160] InstructEngine: Instruction-driven Text-to-Image Alignment

Xingyu Lu,Yuhang Hu,YiFan Zhang,Kaiyu Jiang,Changyi Liu,Tianke Zhang,Jinpeng Wang,Bin Wen,Chun Yuan,Fan Yang,Tingting Gao,Di Zhang

Main category: cs.CV

TLDR: InstructEngine框架通过自动化数据构建和跨验证对齐方法，解决了RLHF/RLAIF在文本到图像模型中的数据和算法限制，显著提升了性能。

Details

Motivation: 现有RLHF/RLAIF方法依赖高成本人工标注数据且算法效率低，限制了文本到图像模型的偏好对齐效果。 Method: 提出InstructEngine框架，包括自动化数据构建（基于分类法和多模态模型生成25K偏好对）和跨验证对齐方法。 Result: 在DrawBench上，InstructEngine将SD v1.5和SDXL性能分别提升10.53%和5.30%，人类评测胜率超50%。 Conclusion: InstructEngine有效解决了现有方法的局限性，显著提升了模型与人类偏好的对齐效果。 Abstract: Reinforcement Learning from Human/AI Feedback (RLHF/RLAIF) has been extensively utilized for preference alignment of text-to-image models. Existing methods face certain limitations in terms of both data and algorithm. For training data, most approaches rely on manual annotated preference data, either by directly fine-tuning the generators or by training reward models to provide training signals. However, the high annotation cost makes them difficult to scale up, the reward model consumes extra computation and cannot guarantee accuracy. From an algorithmic perspective, most methods neglect the value of text and only take the image feedback as a comparative signal, which is inefficient and sparse. To alleviate these drawbacks, we propose the InstructEngine framework. Regarding annotation cost, we first construct a taxonomy for text-to-image generation, then develop an automated data construction pipeline based on it. Leveraging advanced large multimodal models and human-defined rules, we generate 25K text-image preference pairs. Finally, we introduce cross-validation alignment method, which refines data efficiency by organizing semantically analogous samples into mutually comparable pairs. Evaluations on DrawBench demonstrate that InstructEngine improves SD v1.5 and SDXL's performance by 10.53% and 5.30%, outperforming state-of-the-art baselines, with ablation study confirming the benefits of InstructEngine's all components. A win rate of over 50% in human reviews also proves that InstructEngine better aligns with human preferences.

[161] LL-Gaussian: Low-Light Scene Reconstruction and Enhancement via Gaussian Splatting for Novel View Synthesis

Hao Sun,Fenggen Yu,Huiyao Xu,Tao Zhang,Changqing Zou

Main category: cs.CV

TLDR: LL-Gaussian是一种针对低光sRGB图像的新型3D重建和增强框架，通过创新的初始化模块、双分支分解模型和无监督优化策略，实现了快速且高质量的伪正常光新视角合成。

Details

Motivation: 解决低光场景下新视角合成的挑战，如噪声、低动态范围和初始化不可靠，同时克服现有方法的高计算成本和数据依赖限制。 Method: 1) 低光高斯初始化模块（LLGIM）；2) 双分支高斯分解模型；3) 基于物理约束和扩散先验的无监督优化策略。 Result: 相比现有NeRF方法，推理速度提升2000倍，训练时间减少至2%，且重建和渲染质量更优。 Conclusion: LL-Gaussian在低光环境下表现出色，为实时高质量新视角合成提供了高效解决方案。 Abstract: Novel view synthesis (NVS) in low-light scenes remains a significant challenge due to degraded inputs characterized by severe noise, low dynamic range (LDR) and unreliable initialization. While recent NeRF-based approaches have shown promising results, most suffer from high computational costs, and some rely on carefully captured or pre-processed data--such as RAW sensor inputs or multi-exposure sequences--which severely limits their practicality. In contrast, 3D Gaussian Splatting (3DGS) enables real-time rendering with competitive visual fidelity; however, existing 3DGS-based methods struggle with low-light sRGB inputs, resulting in unstable Gaussian initialization and ineffective noise suppression. To address these challenges, we propose LL-Gaussian, a novel framework for 3D reconstruction and enhancement from low-light sRGB images, enabling pseudo normal-light novel view synthesis. Our method introduces three key innovations: 1) an end-to-end Low-Light Gaussian Initialization Module (LLGIM) that leverages dense priors from learning-based MVS approach to generate high-quality initial point clouds; 2) a dual-branch Gaussian decomposition model that disentangles intrinsic scene properties (reflectance and illumination) from transient interference, enabling stable and interpretable optimization; 3) an unsupervised optimization strategy guided by both physical constrains and diffusion prior to jointly steer decomposition and enhancement. Additionally, we contribute a challenging dataset collected in extreme low-light environments and demonstrate the effectiveness of LL-Gaussian. Compared to state-of-the-art NeRF-based methods, LL-Gaussian achieves up to 2,000 times faster inference and reduces training time to just 2%, while delivering superior reconstruction and rendering quality.

[162] Benchmarking 3D Human Pose Estimation Models Under Occlusions

Filipa Lino,Carlos Santiago,Manuel Marques

Main category: cs.CV

TLDR: 论文通过分析现有3D人体姿态估计模型对遮挡、相机位置和动作变化的鲁棒性和敏感性，提出了一种新的合成数据集BlendMimic3D，并测试了多个先进模型。研究发现模型对遮挡和相机设置高度敏感，需改进以适应真实场景。

Details

Motivation: 解决3D人体姿态估计模型在遮挡、相机位置和动作变化等复杂环境中的鲁棒性问题。 Method: 使用新合成的数据集BlendMimic3D，测试多个先进模型，分析其对遮挡和相机设置的敏感性。 Result: 模型对遮挡和相机设置表现出显著敏感性，需改进以适应真实场景的多样性。 Conclusion: 研究强调了改进3D人体姿态估计模型以适应复杂环境和遮挡场景的必要性。 Abstract: This paper addresses critical challenges in 3D Human Pose Estimation (HPE) by analyzing the robustness and sensitivity of existing models to occlusions, camera position, and action variability. Using a novel synthetic dataset, BlendMimic3D, which includes diverse scenarios with multi-camera setups and several occlusion types, we conduct specific tests on several state-of-the-art models. Our study focuses on the discrepancy in keypoint formats between common datasets such as Human3.6M, and 2D datasets such as COCO, commonly used for 2D detection models and frequently input of 3D HPE models. Our work explores the impact of occlusions on model performance and the generality of models trained exclusively under standard conditions. The findings suggest significant sensitivity to occlusions and camera settings, revealing a need for models that better adapt to real-world variability and occlusion scenarios. This research contributed to ongoing efforts to improve the fidelity and applicability of 3D HPE systems in complex environments.

[163] Multimodal Representation Learning Techniques for Comprehensive Facial State Analysis

Kaiwen Zheng,Xuri Ge,Junchen Fu,Jun Peng,Joemon M. Jose

Main category: cs.CV

TLDR: 提出了一种多模态面部状态分析的综合框架，包括新数据集MFA、多级多模态基础模型MF^2和解耦微调网络DFN，显著提升了AU和情感识别性能。

Details

Motivation: 多模态基础模型在面部状态（如AU和情感）理解中的应用有限，需一个综合框架来整合视觉和语言模态。 Method: 1. 构建MFA数据集，利用GPT-4o生成多层次语言描述；2. 设计MF^2模型，结合局部和全局视觉特征；3. 开发DFN网络，高效适配不同任务。 Result: 实验表明，该方法在AU和情感检测任务中表现优异。 Conclusion: 提出的框架为多模态面部状态分析提供了高效解决方案，具有广泛适用性。 Abstract: Multimodal foundation models have significantly improved feature representation by integrating information from multiple modalities, making them highly suitable for a broader set of applications. However, the exploration of multimodal facial representation for understanding perception has been limited. Understanding and analyzing facial states, such as Action Units (AUs) and emotions, require a comprehensive and robust framework that bridges visual and linguistic modalities. In this paper, we present a comprehensive pipeline for multimodal facial state analysis. First, we compile a new Multimodal Face Dataset (MFA) by generating detailed multilevel language descriptions of face, incorporating Action Unit (AU) and emotion descriptions, by leveraging GPT-4o. Second, we introduce a novel Multilevel Multimodal Face Foundation model (MF^2) tailored for Action Unit (AU) and emotion recognition. Our model incorporates comprehensive visual feature modeling at both local and global levels of face image, enhancing its ability to represent detailed facial appearances. This design aligns visual representations with structured AU and emotion descriptions, ensuring effective cross-modal integration. Third, we develop a Decoupled Fine-Tuning Network (DFN) that efficiently adapts MF^2 across various tasks and datasets. This approach not only reduces computational overhead but also broadens the applicability of the foundation model to diverse scenarios. Experimentation show superior performance for AU and emotion detection tasks.

[164] Patch and Shuffle: A Preprocessing Technique for Texture Classification in Autonomous Cementitious Fabrication

Jeremiah Giordani

Main category: cs.CV

TLDR: 论文提出了一种名为“patch and shuffle”的预处理技术，通过分割、打乱和重组图像来增强纹理分类性能，显著提升了准确率。

Details

Motivation: 传统纹理分类方法依赖全局图像特征，容易偏向语义内容而非低层次纹理。为了解决这一问题，作者提出了一种新方法。 Method: 采用“patch and shuffle”技术，将输入图像分割为小块、打乱后重组，迫使分类器依赖局部纹理特征。使用ResNet-18架构进行实验验证。 Result: 实验结果显示，新方法在测试集上的准确率达到90.64%，显著高于基线模型的72.46%。 Conclusion: 该方法通过破坏全局结构提升了纹理分类性能，对依赖低层次特征的任务（如制造监控和医学影像）具有广泛意义。 Abstract: Autonomous fabrication systems are transforming construction and manufacturing, yet they remain vulnerable to print errors. Texture classification is a key component of computer vision systems that enable real-time monitoring and adjustment during cementitious fabrication. Traditional classification methods often rely on global image features, which can bias the model toward semantic content rather than low-level textures. In this paper, we introduce a novel preprocessing technique called "patch and shuffle," which segments input images into smaller patches, shuffles them, and reconstructs a jumbled image before classification. This transformation removes semantic context, forcing the classifier to rely on local texture features. We evaluate this approach on a dataset of extruded cement images, using a ResNet-18-based architecture. Our experiments compare the patch and shuffle method to a standard pipeline, holding all other factors constant. Results show a significant improvement in accuracy: the patch and shuffle model achieved 90.64% test accuracy versus 72.46% for the baseline. These findings suggest that disrupting global structure enhances performance in texture-based classification tasks. This method has implications for broader vision tasks where low-level features matter more than high-level semantics. The technique may improve classification in applications ranging from fabrication monitoring to medical imaging.

[165] FingER: Content Aware Fine-grained Evaluation with Reasoning for AI-Generated Videos

Rui Chen,Lei Sun,Jing Tang,Geng Li,Xiangxiang Chu

Main category: cs.CV

TLDR: 论文提出FingER框架，通过细粒度实体级问题生成和推理模型评分，改进AI生成视频的评估方法。

Details

Motivation: 现有视频生成技术的进步使得AI生成内容的评估变得复杂，传统评分方法难以应对不一致性和缺陷。 Method: 利用LLMs生成实体级问题，构建FingER数据集（3.3k视频和60k QA标注），并通过GRPO训练推理模型。 Result: 模型在GenAI-Bench和MonetBench上分别以11.8%和5.5%的相对优势超越现有方法，仅需3.3k训练样本。 Conclusion: FingER框架显著提升了视频评估的细粒度和可解释性，为AI生成内容评估提供了新思路。 Abstract: Recent advances in video generation have posed great challenges in the assessment of AI-generated content, particularly with the emergence of increasingly sophisticated models. The various inconsistencies and defects observed in such videos are inherently complex, making overall scoring notoriously difficult. In this paper, we emphasize the critical importance of integrating fine-grained reasoning into video evaluation, and we propose $\textbf{F}$ing$\textbf{ER}$, a novel entity-level reasoning evaluation framework that first automatically generates $\textbf{F}$ine-grained $\textbf{E}$ntity-level questions, and then answers those questions by a $\textbf{R}$easoning model with scores, which can be subsequently weighted summed to an overall score for different applications. Specifically, we leverage LLMs to derive entity-level questions across five distinct perspectives, which (i) often focus on some specific entities of the content, thereby making answering or scoring much easier by MLLMs, and (ii) are more interpretable. Then we construct a FingER dataset, consisting of approximately 3.3k videos and corresponding 60k fine-grained QA annotations, each with detailed reasons. Based on that, we further investigate various training protocols to best incentivize the reasoning capability of MLLMs for correct answer prediction. Extensive experiments demonstrate that a reasoning model trained using Group Relative Policy Optimization (GRPO) with a cold-start strategy achieves the best performance. Notably, our model surpasses existing methods by a relative margin of $11.8\%$ on GenAI-Bench and $5.5\%$ on MonetBench with only 3.3k training videos, which is at most one-tenth of the training samples utilized by other methods. Our code and dataset will be released soon.

[166] PG-DPIR: An efficient plug-and-play method for high-count Poisson-Gaussian inverse problems

Maud Biquard,Marie Chabert,Florence Genin,Christophe Latry,Thomas Oberlin

Main category: cs.CV

TLDR: PG-DPIR是一种高效的PnP方法，针对高计数泊松-高斯噪声问题，改进了DPIR算法，显著提升了收敛速度。

Details

Motivation: 针对泊松-高斯噪声的图像恢复问题，现有深度学习方法需要传感器特定训练，而PnP方法更具通用性。 Method: 改进DPIR算法，提出高效的梯度下降初始化策略，加速泊松-高斯噪声下的近端算子计算。 Result: 在卫星图像恢复和超分辨率任务中，PG-DPIR实现了最先进的性能，收敛速度显著提升。 Conclusion: PG-DPIR在效率和性能上表现优异，适用于卫星图像处理链。 Abstract: Poisson-Gaussian noise describes the noise of various imaging systems thus the need of efficient algorithms for Poisson-Gaussian image restoration. Deep learning methods offer state-of-the-art performance but often require sensor-specific training when used in a supervised setting. A promising alternative is given by plug-and-play (PnP) methods, which consist in learning only a regularization through a denoiser, allowing to restore images from several sources with the same network. This paper introduces PG-DPIR, an efficient PnP method for high-count Poisson-Gaussian inverse problems, adapted from DPIR. While DPIR is designed for white Gaussian noise, a naive adaptation to Poisson-Gaussian noise leads to prohibitively slow algorithms due to the absence of a closed-form proximal operator. To address this, we adapt DPIR for the specificities of Poisson-Gaussian noise and propose in particular an efficient initialization of the gradient descent required for the proximal step that accelerates convergence by several orders of magnitude. Experiments are conducted on satellite image restoration and super-resolution problems. High-resolution realistic Pleiades images are simulated for the experiments, which demonstrate that PG-DPIR achieves state-of-the-art performance with improved efficiency, which seems promising for on-ground satellite processing chains.

[167] Better Coherence, Better Height: Fusing Physical Models and Deep Learning for Forest Height Estimation from Interferometric SAR Data

Ragini Bal Mahesh,Ronny Hänsch

Main category: cs.CV

TLDR: CoHNet结合深度学习和物理约束，提升SAR图像森林高度估计的准确性和可靠性。

Details

Motivation: 传统物理模型泛化能力有限，深度学习缺乏物理可解释性，需结合两者优势。 Method: 提出CoHNet框架，利用预训练神经代理模型通过物理约束损失优化深度学习。 Result: 实验表明，该方法提高了森林高度估计准确性，并生成增强预测可靠性的特征。 Conclusion: CoHNet成功结合物理与深度学习，为SAR图像森林高度估计提供了更优解决方案。 Abstract: Estimating forest height from Synthetic Aperture Radar (SAR) images often relies on traditional physical models, which, while interpretable and data-efficient, can struggle with generalization. In contrast, Deep Learning (DL) approaches lack physical insight. To address this, we propose CoHNet - an end-to-end framework that combines the best of both worlds: DL optimized with physics-informed constraints. We leverage a pre-trained neural surrogate model to enforce physical plausibility through a unique training loss. Our experiments show that this approach not only improves forest height estimation accuracy but also produces meaningful features that enhance the reliability of predictions.

[168] Towards Low-Latency Event-based Obstacle Avoidance on a FPGA-Drone

Pietro Bonazzi,Christian Vogt,Michael Jost,Lyes Khacef,Federico Paredes-Vallés,Michele Magno

Main category: cs.CV

TLDR: 事件视觉系统（EVS）在FPGA加速器上比传统RGB模型在碰撞避免动作预测中表现更优，具有更高帧率、更低误差和更强鲁棒性。

Details

Motivation: 评估EVS与RGB模型在实时碰撞避免中的性能差异，探索EVS在资源受限环境中的潜力。 Method: 在FPGA加速器上对比EVS和RGB模型的帧率、时空预测误差及鲁棒性，测试不同数据集和状态分类。 Result: EVS帧率达1 kHz，时空误差更低（-20 ms/-20 mm），运动状态分类精度高59%，F1分数显著提升（0.73 vs. 0.06），端到端延迟2.14 ms。 Conclusion: EVS在实时碰撞避免中优于RGB模型，适合资源受限环境部署。 Abstract: This work quantitatively evaluates the performance of event-based vision systems (EVS) against conventional RGB-based models for action prediction in collision avoidance on an FPGA accelerator. Our experiments demonstrate that the EVS model achieves a significantly higher effective frame rate (1 kHz) and lower temporal (-20 ms) and spatial prediction errors (-20 mm) compared to the RGB-based model, particularly when tested on out-of-distribution data. The EVS model also exhibits superior robustness in selecting optimal evasion maneuvers. In particular, in distinguishing between movement and stationary states, it achieves a 59 percentage point advantage in precision (78% vs. 19%) and a substantially higher F1 score (0.73 vs. 0.06), highlighting the susceptibility of the RGB model to overfitting. Further analysis in different combinations of spatial classes confirms the consistent performance of the EVS model in both test data sets. Finally, we evaluated the system end-to-end and achieved a latency of approximately 2.14 ms, with event aggregation (1 ms) and inference on the processing unit (0.94 ms) accounting for the largest components. These results underscore the advantages of event-based vision for real-time collision avoidance and demonstrate its potential for deployment in resource-constrained environments.

[169] GPS: Distilling Compact Memories via Grid-based Patch Sampling for Efficient Online Class-Incremental Learning

Mingchuan Ma,Yuhao Zhou,Jindi Lv,Yuxin Tian,Dan Si,Shujian Li,Qing Ye,Jiancheng Lv

Main category: cs.CV

TLDR: GPS是一种轻量级策略，通过采样像素生成低分辨率表示，提升在线增量学习中的记忆样本信息量，无需可训练模型，显著提升性能。

Details

Motivation: 现有方法在受限存储下通过蒸馏数据增强记忆信息量，但计算开销大，GPS旨在解决这一问题。 Method: GPS从原始图像中采样像素子集，生成紧凑的低分辨率表示，保留语义和结构信息。 Result: GPS在内存受限设置下平均最终准确率提升3%-4%，计算开销低。 Conclusion: GPS是一种高效且轻量的记忆样本蒸馏策略，适用于在线增量学习。 Abstract: Online class-incremental learning aims to enable models to continuously adapt to new classes with limited access to past data, while mitigating catastrophic forgetting. Replay-based methods address this by maintaining a small memory buffer of previous samples, achieving competitive performance. For effective replay under constrained storage, recent approaches leverage distilled data to enhance the informativeness of memory. However, such approaches often involve significant computational overhead due to the use of bi-level optimization. Motivated by these limitations, we introduce Grid-based Patch Sampling (GPS), a lightweight and effective strategy for distilling informative memory samples without relying on a trainable model. GPS generates informative samples by sampling a subset of pixels from the original image, yielding compact low-resolution representations that preserve both semantic content and structural information. During replay, these representations are reassembled to support training and evaluation. Experiments on extensive benchmarks demonstrate that GRS can be seamlessly integrated into existing replay frameworks, leading to 3%-4% improvements in average end accuracy under memory-constrained settings, with limited computational overhead.

[170] HUMOTO: A 4D Dataset of Mocap Human Object Interactions

Jiaxin Lu,Chun-Hao Paul Huang,Uttaran Bhattacharya,Qixing Huang,Yi Zhou

Main category: cs.CV

TLDR: HUMOTO是一个高质量的人类与物体交互数据集，用于运动生成、计算机视觉和机器人应用，包含736个序列和63个精确建模的物体。

Details

Motivation: 解决现有数据集中缺乏高质量、多样化的人类与物体交互数据的问题，以支持动画、机器人和AI系统的研究与应用。 Method: 采用场景驱动的LLM脚本管道生成完整任务，结合动作捕捉和相机记录处理遮挡，并由专业艺术家清理和验证数据。 Result: HUMOTO提供了高保真度的全身运动和同时多物体交互数据，解决了数据捕捉的关键挑战。 Conclusion: HUMOTO为研究领域提供了高质量的人类与物体交互数据，具有广泛的实际应用潜力。 Abstract: We present Human Motions with Objects (HUMOTO), a high-fidelity dataset of human-object interactions for motion generation, computer vision, and robotics applications. Featuring 736 sequences (7,875 seconds at 30 fps), HUMOTO captures interactions with 63 precisely modeled objects and 72 articulated parts. Our innovations include a scene-driven LLM scripting pipeline creating complete, purposeful tasks with natural progression, and a mocap-and-camera recording setup to effectively handle occlusions. Spanning diverse activities from cooking to outdoor picnics, HUMOTO preserves both physical accuracy and logical task flow. Professional artists rigorously clean and verify each sequence, minimizing foot sliding and object penetrations. We also provide benchmarks compared to other datasets. HUMOTO's comprehensive full-body motion and simultaneous multi-object interactions address key data-capturing challenges and provide opportunities to advance realistic human-object interaction modeling across research domains with practical applications in animation, robotics, and embodied AI systems. Project: https://jiaxin-lu.github.io/humoto/ .

[171] MonoDiff9D: Monocular Category-Level 9D Object Pose Estimation via Diffusion Model

Jian Liu,Wei Sun,Hui Yang,Jin Zheng,Zichen Geng,Hossein Rahmani,Ajmal Mian

Main category: cs.CV

TLDR: MonoDiff9D是一种基于扩散模型的单目类别级9D物体姿态估计方法，无需形状先验或CAD模型。

Details

Motivation: 利用扩散模型的概率特性，避免对形状先验、CAD模型或深度传感器的依赖。 Method: 通过DINOv2估计粗深度并转换为点云，融合全局特征与输入图像，使用基于变压器的去噪器恢复物体姿态。 Result: 在两个基准数据集上实现了最先进的单目类别级9D物体姿态估计精度。 Conclusion: MonoDiff9D无需形状先验或CAD模型即可实现高精度姿态估计。 Abstract: Object pose estimation is a core means for robots to understand and interact with their environment. For this task, monocular category-level methods are attractive as they require only a single RGB camera. However, current methods rely on shape priors or CAD models of the intra-class known objects. We propose a diffusion-based monocular category-level 9D object pose generation method, MonoDiff9D. Our motivation is to leverage the probabilistic nature of diffusion models to alleviate the need for shape priors, CAD models, or depth sensors for intra-class unknown object pose estimation. We first estimate coarse depth via DINOv2 from the monocular image in a zero-shot manner and convert it into a point cloud. We then fuse the global features of the point cloud with the input image and use the fused features along with the encoded time step to condition MonoDiff9D. Finally, we design a transformer-based denoiser to recover the object pose from Gaussian noise. Extensive experiments on two popular benchmark datasets show that MonoDiff9D achieves state-of-the-art monocular category-level 9D object pose estimation accuracy without the need for shape priors or CAD models at any stage. Our code will be made public at https://github.com/CNJianLiu/MonoDiff9D.

[172] Anchor Token Matching: Implicit Structure Locking for Training-free AR Image Editing

Taihang Hu,Linxuan Li,Kai Wang,Yaxing Wang,Jian Yang,Ming-Ming Cheng

Main category: cs.CV

TLDR: ISLock是一种无需训练的自回归模型编辑策略，通过隐式结构锁定和锚令牌匹配协议，解决了自回归模型在图像编辑中的结构一致性问题。

Details

Motivation: 自回归模型在图像编辑中因注意力图的空间稀疏性和结构误差的累积而表现不佳，现有扩散模型编辑技术无法直接应用于自回归模型。 Method: 提出ISLock方法，通过动态对齐自注意力模式与参考图像，隐式保持结构一致性，无需显式注意力操作或微调。 Result: 实验表明，ISLock实现了高质量、结构一致的编辑效果，优于或与传统编辑技术相当。 Conclusion: ISLock为自回归模型的图像编辑提供了高效灵活的解决方案，缩小了与扩散模型的性能差距。 Abstract: Text-to-image generation has seen groundbreaking advancements with diffusion models, enabling high-fidelity synthesis and precise image editing through cross-attention manipulation. Recently, autoregressive (AR) models have re-emerged as powerful alternatives, leveraging next-token generation to match diffusion models. However, existing editing techniques designed for diffusion models fail to translate directly to AR models due to fundamental differences in structural control. Specifically, AR models suffer from spatial poverty of attention maps and sequential accumulation of structural errors during image editing, which disrupt object layouts and global consistency. In this work, we introduce Implicit Structure Locking (ISLock), the first training-free editing strategy for AR visual models. Rather than relying on explicit attention manipulation or fine-tuning, ISLock preserves structural blueprints by dynamically aligning self-attention patterns with reference images through the Anchor Token Matching (ATM) protocol. By implicitly enforcing structural consistency in latent space, our method ISLock enables structure-aware editing while maintaining generative autonomy. Extensive experiments demonstrate that ISLock achieves high-quality, structure-consistent edits without additional training and is superior or comparable to conventional editing techniques. Our findings pioneer the way for efficient and flexible AR-based image editing, further bridging the performance gap between diffusion and autoregressive generative models. The code will be publicly available at https://github.com/hutaiHang/ATM

[173] Integrating Vision and Location with Transformers: A Multimodal Deep Learning Framework for Medical Wound Analysis

Ramin Mousa,Hadis Taherinia,Khabiba Abdiyeva,Amir Ali Bengari,Mohammadmahdi Vahediahmar

Main category: cs.CV

TLDR: 该论文提出了一种基于深度学习的多模态分类器，结合伤口图像和位置数据，用于高效分类难愈性伤口类型，并通过优化算法提升模型准确性。

Details

Motivation: 传统机器学习模型在伤口诊断中存在特征选择和模型复杂性问题，深度学习虽表现出潜力，但效率和准确性仍有提升空间。 Method: 使用Vision Transformer提取图像特征，结合DWT层和Transformer提取空间特征，并采用三种群优化算法优化权重向量。 Result: 模型在原始身体地图上的分类准确率达到0.8123（仅图像数据）和0.8007（图像+位置数据），优化后准确率提升至0.8342。 Conclusion: 结合优化算法的多模态深度学习模型显著提高了伤口分类的准确性和效率，为伤口诊断提供了有效工具。 Abstract: Effective recognition of acute and difficult-to-heal wounds is a necessary step in wound diagnosis. An efficient classification model can help wound specialists classify wound types with less financial and time costs and also help in deciding on the optimal treatment method. Traditional machine learning models suffer from feature selection and are usually cumbersome models for accurate recognition. Recently, deep learning (DL) has emerged as a powerful tool in wound diagnosis. Although DL seems promising for wound type recognition, there is still a large scope for improving the efficiency and accuracy of the model. In this study, a DL-based multimodal classifier was developed using wound images and their corresponding locations to classify them into multiple classes, including diabetic, pressure, surgical, and venous ulcers. A body map was also created to provide location data, which can help wound specialists label wound locations more effectively. The model uses a Vision Transformer to extract hierarchical features from input images, a Discrete Wavelet Transform (DWT) layer to capture low and high frequency components, and a Transformer to extract spatial features. The number of neurons and weight vector optimization were performed using three swarm-based optimization techniques (Monster Gorilla Toner (MGTO), Improved Gray Wolf Optimization (IGWO), and Fox Optimization Algorithm). The evaluation results show that weight vector optimization using optimization algorithms can increase diagnostic accuracy and make it a very effective approach for wound detection. In the classification using the original body map, the proposed model was able to achieve an accuracy of 0.8123 using image data and an accuracy of 0.8007 using a combination of image data and wound location. Also, the accuracy of the model in combination with the optimization models varied from 0.7801 to 0.8342.

[174] The Scalability of Simplicity: Empirical Analysis of Vision-Language Learning with a Single Transformer

Weixian Lei,Jiacong Wang,Haochen Wang,Xiangtai Li,Jun Hao Liew,Jiashi Feng,Zilong Huang

Main category: cs.CV

TLDR: SAIL是一种单变压器统一多模态大语言模型，整合了原始像素编码和语言解码，无需预训练视觉编码器，采用混合注意机制和多模态位置编码。

Details

Motivation: 解决现有模块化多模态大语言模型依赖预训练视觉编码器的问题，提出更简洁的架构设计。 Method: 通过混合注意机制和多模态位置编码，统一处理视觉和文本模态，无需独立视觉编码器。 Result: SAIL在性能和视觉表示能力上与模块化模型相当，且具有更好的可扩展性和不同的跨模态信息流模式。 Conclusion: SAIL展示了统一架构在多模态任务中的潜力，性能与模块化模型相当，同时简化了设计。 Abstract: This paper introduces SAIL, a single transformer unified multimodal large language model (MLLM) that integrates raw pixel encoding and language decoding within a singular architecture. Unlike existing modular MLLMs, which rely on a pre-trained vision transformer (ViT), SAIL eliminates the need for a separate vision encoder, presenting a more minimalist architecture design. Instead of introducing novel architectural components, SAIL adapts mix-attention mechanisms and multimodal positional encodings to better align with the distinct characteristics of visual and textual modalities. We systematically compare SAIL's properties-including scalability, cross-modal information flow patterns, and visual representation capabilities-with those of modular MLLMs. By scaling both training data and model size, SAIL achieves performance comparable to modular MLLMs. Notably, the removal of pretrained ViT components enhances SAIL's scalability and results in significantly different cross-modal information flow patterns. Moreover, SAIL demonstrates strong visual representation capabilities, achieving results on par with ViT-22B in vision tasks such as semantic segmentation. Code and models are available at https://github.com/bytedance/SAIL.

[175] Pixel-SAIL: Single Transformer For Pixel-Grounded Understanding

Tao Zhang,Xiangtai Li,Zilong Huang,Yanwei Li,Weixian Lei,Xueqing Deng,Shihao Chen,Shunping Ji,Jiashi Feng

Main category: cs.CV

TLDR: Pixel-SAIL提出了一种简化的多模态大语言模型（MLLM），通过单一Transformer实现像素级任务，避免了额外组件的依赖，并在多个基准测试中表现优异。

Details

Motivation: 现有MLLM依赖额外组件（如视觉编码器、分割专家），导致系统复杂且难以扩展。本研究旨在探索一种无需额外组件的简化MLLM。 Method: 1. 设计可学习的上采样模块优化视觉令牌特征；2. 提出视觉提示注入策略；3. 引入视觉专家蒸馏策略。 Result: 在多个基准测试中，Pixel-SAIL表现优于或接近现有方法，且系统更简单。 Conclusion: Pixel-SAIL证明了单一Transformer可实现高效像素级任务，为MLLM简化提供了新思路。 Abstract: Multimodal Large Language Models (MLLMs) achieve remarkable performance for fine-grained pixel-level understanding tasks. However, all the works rely heavily on extra components, such as vision encoder (CLIP), segmentation experts, leading to high system complexity and limiting model scaling. In this work, our goal is to explore a highly simplified MLLM without introducing extra components. Our work is motivated by the recent works on Single trAnsformer as a unified vIsion-Language Model (SAIL) design, where these works jointly learn vision tokens and text tokens in transformers. We present Pixel-SAIL, a single transformer for pixel-wise MLLM tasks. In particular, we present three technical improvements on the plain baseline. First, we design a learnable upsampling module to refine visual token features. Secondly, we propose a novel visual prompt injection strategy to enable the single transformer to understand visual prompt inputs and benefit from the early fusion of visual prompt embeddings and vision tokens. Thirdly, we introduce a vision expert distillation strategy to efficiently enhance the single transformer's fine-grained feature extraction capability. In addition, we have collected a comprehensive pixel understanding benchmark (PerBench), using a manual check. It includes three tasks: detailed object description, visual prompt-based question answering, and visual-text referring segmentation. Extensive experiments on four referring segmentation benchmarks, one visual prompt benchmark, and our PerBench show that our Pixel-SAIL achieves comparable or even better results with a much simpler pipeline. Code and model will be released at https://github.com/magic-research/Sa2VA.

[176] Art3D: Training-Free 3D Generation from Flat-Colored Illustration

Xiaoyan Cong,Jiayi Shen,Zekun Li,Rao Fu,Tao Lu,Srinath Sridhar

Main category: cs.CV

TLDR: Art3D是一种无需训练的方法，能将平面彩色2D设计提升为3D，利用预训练的2D图像生成模型和基于VLM的真实性评估，增强参考图像的三维错觉。

Details

Motivation: 现有的大规模预训练图像到3D生成模型在处理平面彩色图像（如手绘图）时表现不佳，缺乏3D错觉，而这类输入在艺术内容创作中非常常见。 Method: Art3D通过结合预训练的2D图像生成模型的结构和语义特征，以及基于VLM的真实性评估，无需额外训练即可实现2D到3D的提升。 Result: 实验结果表明，Art3D在平面彩色图像上表现出色，具有强大的泛化能力和实际应用潜力。 Conclusion: Art3D简化了从2D生成3D的过程，适用于多种绘画风格，并公开了源代码和数据集。 Abstract: Large-scale pre-trained image-to-3D generative models have exhibited remarkable capabilities in diverse shape generations. However, most of them struggle to synthesize plausible 3D assets when the reference image is flat-colored like hand drawings due to the lack of 3D illusion, which are often the most user-friendly input modalities in art content creation. To this end, we propose Art3D, a training-free method that can lift flat-colored 2D designs into 3D. By leveraging structural and semantic features with pre- trained 2D image generation models and a VLM-based realism evaluation, Art3D successfully enhances the three-dimensional illusion in reference images, thus simplifying the process of generating 3D from 2D, and proves adaptable to a wide range of painting styles. To benchmark the generalization performance of existing image-to-3D models on flat-colored images without 3D feeling, we collect a new dataset, Flat-2D, with over 100 samples. Experimental results demonstrate the performance and robustness of Art3D, exhibiting superior generalizable capacity and promising practical applicability. Our source code and dataset will be publicly available on our project page: https://joy-jy11.github.io/ .

[177] InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Jinguo Zhu,Weiyun Wang,Zhe Chen,Zhaoyang Liu,Shenglong Ye,Lixin Gu,Yuchen Duan,Hao Tian,Weijie Su,Jie Shao,Zhangwei Gao,Erfei Cui,Yue Cao,Yangzhou Liu,Weiye Xu,Hao Li,Jiahao Wang,Han Lv,Dengnian Chen,Songze Li,Yinan He,Tan Jiang,Jiapeng Luo,Yi Wang,Conghui He,Botian Shi,Xingcheng Zhang,Wenqi Shao,Junjun He,Yingtong Xiong,Wenwen Qu,Peng Sun,Penglong Jiao,Lijun Wu,Kaipeng Zhang,Huipeng Deng,Jiaye Ge,Kai Chen,Limin Wang,Min Dou,Lewei Lu,Xizhou Zhu,Tong Lu,Dahua Lin,Yu Qiao,Jifeng Dai,Wenhai Wang

Main category: cs.CV

TLDR: InternVL3是一种原生多模态预训练模型，通过统一训练范式解决传统MLLM的复杂性和对齐问题，支持扩展多模态上下文，并在多模态任务中表现优异。

Details

Motivation: 传统方法通过将纯文本LLM适配为支持视觉输入的MLLM，存在复杂性和对齐挑战，InternVL3旨在通过统一训练范式解决这些问题。 Method: InternVL3采用原生多模态预训练，结合可变视觉位置编码（V2PE）、监督微调（SFT）、混合偏好优化（MPO）等先进技术。 Result: InternVL3在MMMU基准测试中得分72.2，创开源MLLM新纪录，性能与ChatGPT-4o等专有模型竞争，同时保持纯语言能力。 Conclusion: InternVL3通过统一训练和先进技术在多模态任务中表现卓越，未来将公开训练数据和模型权重以推动研究。 Abstract: We introduce InternVL3, a significant advancement in the InternVL series featuring a native multimodal pre-training paradigm. Rather than adapting a text-only large language model (LLM) into a multimodal large language model (MLLM) that supports visual inputs, InternVL3 jointly acquires multimodal and linguistic capabilities from both diverse multimodal data and pure-text corpora during a single pre-training stage. This unified training paradigm effectively addresses the complexities and alignment challenges commonly encountered in conventional post-hoc training pipelines for MLLMs. To further improve performance and scalability, InternVL3 incorporates variable visual position encoding (V2PE) to support extended multimodal contexts, employs advanced post-training techniques such as supervised fine-tuning (SFT) and mixed preference optimization (MPO), and adopts test-time scaling strategies alongside an optimized training infrastructure. Extensive empirical evaluations demonstrate that InternVL3 delivers superior performance across a wide range of multi-modal tasks. In particular, InternVL3-78B achieves a score of 72.2 on the MMMU benchmark, setting a new state-of-the-art among open-source MLLMs. Its capabilities remain highly competitive with leading proprietary models, including ChatGPT-4o, Claude 3.5 Sonnet, and Gemini 2.5 Pro, while also maintaining strong pure-language proficiency. In pursuit of open-science principles, we will publicly release both the training data and model weights to foster further research and development in next-generation MLLMs.

[178] REPA-E: Unlocking VAE for End-to-End Tuning with Latent Diffusion Transformers

Xingjian Leng,Jaskirat Singh,Yunzhong Hou,Zhenchang Xing,Saining Xie,Liang Zheng

Main category: cs.CV

TLDR: 论文提出了一种名为REPA-E的训练方法，通过表示对齐损失（REPA）实现了VAE和扩散模型的端到端训练，显著提升了性能。

Details

Motivation: 探讨是否可以通过端到端训练联合优化VAE和扩散模型，以提升生成模型的性能。 Method: 使用REPA损失替代传统的扩散损失，实现VAE和扩散模型的联合训练。 Result: REPA-E方法将训练速度提升了17倍和45倍，并在ImageNet 256x256上实现了FID 1.26和1.83的新SOTA性能。 Conclusion: REPA-E方法不仅加速了训练，还改善了VAE的潜在空间结构，为生成模型提供了新的端到端训练方案。 Abstract: In this paper we tackle a fundamental question: "Can we train latent diffusion models together with the variational auto-encoder (VAE) tokenizer in an end-to-end manner?" Traditional deep-learning wisdom dictates that end-to-end training is often preferable when possible. However, for latent diffusion transformers, it is observed that end-to-end training both VAE and diffusion-model using standard diffusion-loss is ineffective, even causing a degradation in final performance. We show that while diffusion loss is ineffective, end-to-end training can be unlocked through the representation-alignment (REPA) loss -- allowing both VAE and diffusion model to be jointly tuned during the training process. Despite its simplicity, the proposed training recipe (REPA-E) shows remarkable performance; speeding up diffusion model training by over 17x and 45x over REPA and vanilla training recipes, respectively. Interestingly, we observe that end-to-end tuning with REPA-E also improves the VAE itself; leading to improved latent space structure and downstream generation performance. In terms of final performance, our approach sets a new state-of-the-art; achieving FID of 1.26 and 1.83 with and without classifier-free guidance on ImageNet 256 x 256. Code is available at https://end2end-diffusion.github.io.

[179] Decoupled Diffusion Sparks Adaptive Scene Generation

Yunsong Zhou,Naisheng Ye,William Ljungbergh,Tianyu Li,Jiazhi Yang,Zetong Yang,Hongzi Zhu,Christoffer Petersson,Hongyang Li

Main category: cs.CV

TLDR: Nexus提出了一种解耦的场景生成框架，通过独立噪声状态的细粒度令牌模拟常规和挑战性场景，提升了反应性和目标导向性。

Details

Motivation: 降低自动驾驶数据收集成本，解决现有方法在反应性和目标状态指导上的不足。 Method: 采用解耦的生成框架，结合部分噪声掩码训练策略和噪声感知调度，生成复杂场景。 Result: 生成真实感提升，位移误差减少40%，闭环规划性能提升20%。 Conclusion: Nexus在提升场景生成质量和安全性方面表现优异，适用于数据增强和安全关键场景生成。 Abstract: Controllable scene generation could reduce the cost of diverse data collection substantially for autonomous driving. Prior works formulate the traffic layout generation as predictive progress, either by denoising entire sequences at once or by iteratively predicting the next frame. However, full sequence denoising hinders online reaction, while the latter's short-sighted next-frame prediction lacks precise goal-state guidance. Further, the learned model struggles to generate complex or challenging scenarios due to a large number of safe and ordinal driving behaviors from open datasets. To overcome these, we introduce Nexus, a decoupled scene generation framework that improves reactivity and goal conditioning by simulating both ordinal and challenging scenarios from fine-grained tokens with independent noise states. At the core of the decoupled pipeline is the integration of a partial noise-masking training strategy and a noise-aware schedule that ensures timely environmental updates throughout the denoising process. To complement challenging scenario generation, we collect a dataset consisting of complex corner cases. It covers 540 hours of simulated data, including high-risk interactions such as cut-in, sudden braking, and collision. Nexus achieves superior generation realism while preserving reactivity and goal orientation, with a 40% reduction in displacement error. We further demonstrate that Nexus improves closed-loop planning by 20% through data augmentation and showcase its capability in safety-critical data generation.

[180] DNF-Avatar: Distilling Neural Fields for Real-time Animatable Avatar Relighting

Zeren Jiang,Shaofei Wang,Siyu Tang

Main category: cs.CV

TLDR: 论文提出了一种从单目视频创建可重光照和可动画化的人体化身的方法，通过将隐式神经场的知识蒸馏到显式2D高斯泼溅表示中，显著提高了渲染速度。

Details

Motivation: 现有方法因依赖蒙特卡洛光线追踪导致渲染速度慢，限制了实际应用。 Method: 使用隐式神经场（教师模型）到显式2D高斯泼溅（学生模型）的知识蒸馏，结合分裂和近似PBR外观和部分环境光遮蔽探针。 Result: 学生模型在推理时速度提升370倍（67 FPS），且重光照效果与教师模型相当或更好。 Conclusion: 该方法实现了高质量实时重光照，适用于虚拟现实、体育和游戏等领域。 Abstract: Creating relightable and animatable human avatars from monocular videos is a rising research topic with a range of applications, e.g. virtual reality, sports, and video games. Previous works utilize neural fields together with physically based rendering (PBR), to estimate geometry and disentangle appearance properties of human avatars. However, one drawback of these methods is the slow rendering speed due to the expensive Monte Carlo ray tracing. To tackle this problem, we proposed to distill the knowledge from implicit neural fields (teacher) to explicit 2D Gaussian splatting (student) representation to take advantage of the fast rasterization property of Gaussian splatting. To avoid ray-tracing, we employ the split-sum approximation for PBR appearance. We also propose novel part-wise ambient occlusion probes for shadow computation. Shadow prediction is achieved by querying these probes only once per pixel, which paves the way for real-time relighting of avatars. These techniques combined give high-quality relighting results with realistic shadow effects. Our experiments demonstrate that the proposed student model achieves comparable or even better relighting results with our teacher model while being 370 times faster at inference time, achieving a 67 FPS rendering speed.

[181] FLOSS: Free Lunch in Open-vocabulary Semantic Segmentation

Yasser Benigmim,Mohammad Fahes,Tuan-Hung Vu,Andrei Bursuc,Raoul de Charette

Main category: cs.CV

TLDR: 论文提出FLOSS方法，通过选择单模板分类器（class-experts）并融合其预测，无需训练即可提升开放词汇语义分割（OVSS）性能。

Details

Motivation: 挑战现有OVSS模型依赖多模板平均文本嵌入的现状，探索单模板分类器的潜力。 Method: 利用未标注图像和预测熵选择单模板分类器（class-experts），并通过新融合方法生成更准确的OVSS预测。 Result: FLOSS在多个OVSS基准测试中显著提升现有方法性能，且专家模板可跨数据集泛化。 Conclusion: FLOSS是一种无需标签和训练的即插即用方法，为OVSS提供系统性改进。 Abstract: Recent Open-Vocabulary Semantic Segmentation (OVSS) models extend the CLIP model to segmentation while maintaining the use of multiple templates (e.g., a photo of , a sketch of a , etc.) for constructing class-wise averaged text embeddings, acting as a classifier. In this paper, we challenge this status quo and investigate the impact of templates for OVSS. Empirically, we observe that for each class, there exist single-template classifiers significantly outperforming the conventional averaged classifier. We refer to them as class-experts. Given access to unlabeled images and without any training involved, we estimate these experts by leveraging the class-wise prediction entropy of single-template classifiers, selecting as class-wise experts those which yield the lowest entropy. All experts, each specializing in a specific class, collaborate in a newly proposed fusion method to generate more accurate OVSS predictions. Our plug-and-play method, coined FLOSS, is orthogonal and complementary to existing OVSS methods, offering a ''free lunch'' to systematically improve OVSS without labels and additional training. Extensive experiments demonstrate that FLOSS consistently boosts state-of-the-art methods on various OVSS benchmarks. Moreover, the selected expert templates can generalize well from one dataset to others sharing the same semantic categories, yet exhibiting distribution shifts. Additionally, we obtain satisfactory improvements under a low-data regime, where only a few unlabeled images are available. Our code is available at https://github.com/yasserben/FLOSS .

[182] HAL-NeRF: High Accuracy Localization Leveraging Neural Radiance Fields

Asterios Reppas,Grigorios-Aris Cheimariotis,Panos K. Papadopoulos,Panagiotis Frasiolas,Dimitrios Zarpalas

Main category: cs.CV

TLDR: HAL-NeRF结合CNN姿态回归器和基于蒙特卡洛粒子滤波的细化模块，显著提升了相机定位精度。

Details

Motivation: 在XR和机器人应用中，仅使用相机输入实现高精度定位具有挑战性，现有方法如APR在室外场景中误差较大。 Method: 结合CNN姿态回归器和NeRF（Nerfacto模型）的细化模块，通过数据增强和光度损失测量提升定位精度。 Result: 在7-Scenes和Cambridge Landmarks数据集上，平移误差分别为0.025m和0.04m，旋转误差为0.59度和0.58度。 Conclusion: HAL-NeRF展示了结合APR与NeRF细化技术的潜力，显著提升了单目相机重定位精度，但计算时间增加。 Abstract: Precise camera localization is a critical task in XR applications and robotics. Using only the camera captures as input to a system is an inexpensive option that enables localization in large indoor and outdoor environments, but it presents challenges in achieving high accuracy. Specifically, camera relocalization methods, such as Absolute Pose Regression (APR), can localize cameras with a median translation error of more than $0.5m$ in outdoor scenes. This paper presents HAL-NeRF, a high-accuracy localization method that combines a CNN pose regressor with a refinement module based on a Monte Carlo particle filter. The Nerfacto model, an implementation of Neural Radiance Fields (NeRFs), is used to augment the data for training the pose regressor and to measure photometric loss in the particle filter refinement module. HAL-NeRF leverages Nerfacto's ability to synthesize high-quality novel views, significantly improving the performance of the localization pipeline. HAL-NeRF achieves state-of-the-art results that are conventionally measured as the average of the median per scene errors. The translation error was $0.025m$ and the rotation error was $0.59$ degrees and 0.04m and 0.58 degrees on the 7-Scenes dataset and Cambridge Landmarks datasets respectively, with the trade-off of increased computational time. This work highlights the potential of combining APR with NeRF-based refinement techniques to advance monocular camera relocalization accuracy.

[183] LookingGlass: Generative Anamorphoses via Laplacian Pyramid Warping

Pascal Chang,Sergio Sancho,Jingwei Tang,Markus Gross,Vinicius C. Azevedo

Main category: cs.CV

TLDR: 本文提出了一种生成变形图像的方法，利用潜在校正流模型和拉普拉斯金字塔扭曲技术，使图像在直接观看时仍能保持有效解释。

Details

Motivation: 传统变形图像只能在特定视角下识别，限制了其应用。本文旨在生成既能保持变形效果又能直接观看的图像。 Method: 结合潜在校正流模型和拉普拉斯金字塔扭曲技术，生成高质量的变形图像。 Result: 成功生成了新型生成感知错觉图像，扩展了视觉变形技术的应用范围。 Conclusion: 本文方法为变形图像的生成提供了新思路，扩展了其在生成模型中的应用。 Abstract: Anamorphosis refers to a category of images that are intentionally distorted, making them unrecognizable when viewed directly. Their true form only reveals itself when seen from a specific viewpoint, which can be through some catadioptric device like a mirror or a lens. While the construction of these mathematical devices can be traced back to as early as the 17th century, they are only interpretable when viewed from a specific vantage point and tend to lose meaning when seen normally. In this paper, we revisit these famous optical illusions with a generative twist. With the help of latent rectified flow models, we propose a method to create anamorphic images that still retain a valid interpretation when viewed directly. To this end, we introduce Laplacian Pyramid Warping, a frequency-aware image warping technique key to generating high-quality visuals. Our work extends Visual Anagrams (arXiv:2311.17919) to latent space models and to a wider range of spatial transforms, enabling the creation of novel generative perceptual illusions.

[184] Robust SAM: On the Adversarial Robustness of Vision Foundation Models

Jiahuan Long,Zhengqin Xu,Tingsong Jiang,Wen Yao,Shuai Jia,Chao Ma,Xiaoqian Chen

Main category: cs.CV

TLDR: 本文提出了一种对抗性鲁棒性框架，用于评估和增强SAM模型的鲁棒性，包括跨提示攻击方法和少参数适应防御策略。

Details

Motivation: SAM模型广泛应用，但其对抗攻击鲁棒性研究不足，现有攻击和防御方法存在缺陷。 Method: 提出跨提示攻击方法增强攻击可转移性，采用SVD约束参数空间的少参数适应策略进行防御。 Result: 跨提示攻击方法在攻击成功率上优于现有方法，防御策略仅需512参数即可提升15%的mIoU。 Conclusion: 该框架显著提升SAM的鲁棒性，同时保持其原始性能。 Abstract: The Segment Anything Model (SAM) is a widely used vision foundation model with diverse applications, including image segmentation, detection, and tracking. Given SAM's wide applications, understanding its robustness against adversarial attacks is crucial for real-world deployment. However, research on SAM's robustness is still in its early stages. Existing attacks often overlook the role of prompts in evaluating SAM's robustness, and there has been insufficient exploration of defense methods to balance the robustness and accuracy. To address these gaps, this paper proposes an adversarial robustness framework designed to evaluate and enhance the robustness of SAM. Specifically, we introduce a cross-prompt attack method to enhance the attack transferability across different prompt types. Besides attacking, we propose a few-parameter adaptation strategy to defend SAM against various adversarial attacks. To balance robustness and accuracy, we use the singular value decomposition (SVD) to constrain the space of trainable parameters, where only singular values are adaptable. Experiments demonstrate that our cross-prompt attack method outperforms previous approaches in terms of attack success rate on both SAM and SAM 2. By adapting only 512 parameters, we achieve at least a 15\% improvement in mean intersection over union (mIoU) against various adversarial attacks. Compared to previous defense methods, our approach enhances the robustness of SAM while maximally maintaining its original performance.

[185] Parameter-Free Fine-tuning via Redundancy Elimination for Vision Foundation Models

Jiahuan Long,Tingsong Jiang,Wen Yao,Yizhe Xiong,Zhengqin Xu,Shuai Jia,Chao Ma

Main category: cs.CV

TLDR: 本文提出了一种参数无关的微调方法，通过选择和增强预训练特征来减少视觉基础模型（VFMs）中的特征冗余，提升下游任务性能。

Details

Motivation: 视觉基础模型（VFMs）存在特征冗余问题，限制了其在新任务中的适应性，因此需要一种高效的微调方法。 Method: 提出基于模型输出差异的通道选择算法，识别冗余和有效通道，选择性替换冗余通道以增强任务相关特征。 Result: 实验表明该方法在域内外数据集上均高效有效，并能与现有微调策略无缝结合，显著降低计算和内存开销。 Conclusion: 该方法为模型微调提供了新视角，通过特征选择和重用提升了任务特定性能，同时减少了资源消耗。 Abstract: Vision foundation models (VFMs) are large pre-trained models that form the backbone of various vision tasks. Fine-tuning VFMs can further unlock their potential for downstream tasks or scenarios. However, VFMs often contain significant feature redundancy, which may limit their adaptability to new tasks. In this paper, we investigate the redundancies in the segment anything model (SAM) and then propose a parameter-free fine-tuning method to address this issue. Unlike traditional fine-tuning methods that adjust parameters, our method emphasizes selecting, reusing, and enhancing pre-trained features, offering a new perspective on model fine-tuning. Specifically, we introduce a channel selection algorithm based on the model's output difference to identify redundant and effective channels. By selectively replacing the redundant channels with more effective ones, we filter out less useful features and reuse the more relevant features to downstream tasks, thereby enhancing the task-specific feature representation. Experiments on both out-of-domain and in-domain datasets demonstrate the efficiency and effectiveness of our method. Notably, our approach can seamlessly integrate with existing fine-tuning strategies (e.g., LoRA, Adapter), further boosting the performance of already fine-tuned models. Moreover, since our channel selection involves only model inference, our method significantly reduces computational and GPU memory overhead.

[186] MotionDreamer: One-to-Many Motion Synthesis with Localized Generative Masked Transformer

Yilin Wang,Chuan Guo,Yuxuan Mu,Muhammad Gohar Javed,Xinxin Zuo,Juwei Lu,Hai Jiang,Li Cheng

Main category: cs.CV

TLDR: MotionDreamer提出了一种局部掩码建模范式，用于从单一MoCap参考生成多样化的动画，解决了传统方法在数据稀缺时容易过拟合的问题。

Details

Motivation: 在动画领域，大型数据集稀缺，传统生成掩码建模方法在单一参考下容易过拟合，缺乏多样性。 Method: 通过分布正则化方法将运动嵌入量化标记，构建局部运动模式的鲁棒代码本，并结合滑动窗口局部注意力机制生成多样化动画。 Result: MotionDreamer在忠实度和多样性上优于基于GAN或Diffusion的现有方法，并能高效完成下游任务。 Conclusion: MotionDreamer通过量化方法和局部注意力机制，实现了从单一参考生成多样化动画，并展示了广泛的应用潜力。 Abstract: Generative masked transformers have demonstrated remarkable success across various content generation tasks, primarily due to their ability to effectively model large-scale dataset distributions with high consistency. However, in the animation domain, large datasets are not always available. Applying generative masked modeling to generate diverse instances from a single MoCap reference may lead to overfitting, a challenge that remains unexplored. In this work, we present MotionDreamer, a localized masked modeling paradigm designed to learn internal motion patterns from a given motion with arbitrary topology and duration. By embedding the given motion into quantized tokens with a novel distribution regularization method, MotionDreamer constructs a robust and informative codebook for local motion patterns. Moreover, a sliding window local attention is introduced in our masked transformer, enabling the generation of natural yet diverse animations that closely resemble the reference motion patterns. As demonstrated through comprehensive experiments, MotionDreamer outperforms the state-of-the-art methods that are typically GAN or Diffusion-based in both faithfulness and diversity. Thanks to the consistency and robustness of the quantization-based approach, MotionDreamer can also effectively perform downstream tasks such as temporal motion editing, \textcolor{update}{crowd animation}, and beat-aligned dance generation, all using a single reference motion. Visit our project page: https://motiondreamer.github.io/

[187] PACT: Pruning and Clustering-Based Token Reduction for Faster Visual Language Models

Mohamed Dhouib,Davide Buscaldi,Sonia Vanier,Aymen Shabou

Main category: cs.CV

TLDR: PACT是一种通过剪枝无关令牌和合并冗余视觉令牌来减少视觉语言模型推理时间和内存使用的方法。

Details

Motivation: 视觉令牌通常包含冗余和不重要信息，导致不必要的计算资源浪费。 Method: 提出PACT方法，使用重要性度量剪枝无关令牌，并提出新的聚类算法Distance Bounded Density Peak Clustering合并冗余令牌。 Result: 实验证明PACT能有效减少推理时间和内存使用。 Conclusion: PACT通过优化视觉令牌处理，显著提升了视觉语言模型的效率。 Abstract: Visual Language Models require substantial computational resources for inference due to the additional input tokens needed to represent visual information. However, these visual tokens often contain redundant and unimportant information, resulting in an unnecessarily high number of tokens. To address this, we introduce PACT, a method that reduces inference time and memory usage by pruning irrelevant tokens and merging visually redundant ones at an early layer of the language model. Our approach uses a novel importance metric to identify unimportant tokens without relying on attention scores, making it compatible with FlashAttention. We also propose a novel clustering algorithm, called Distance Bounded Density Peak Clustering, which efficiently clusters visual tokens while constraining the distances between elements within a cluster by a predefined threshold. We demonstrate the effectiveness of PACT through extensive experiments.

[188] Adaptive Additive Parameter Updates of Vision Transformers for Few-Shot Continual Learning

Kyle Stein,Andrew Arash Mahyari,Guillermo Francia III,Eman El-Sheikh

Main category: cs.CV

TLDR: 提出了一种基于冻结Vision Transformer（ViT）的少样本类增量学习（FSCIL）框架，通过参数高效的加法更新机制，减少过拟合并避免灾难性遗忘。

Details

Motivation: 解决少样本类增量学习中因数据有限导致的过拟合和灾难性遗忘问题。 Method: 冻结预训练ViT参数，通过加法更新机制在自注意力模块中注入可训练权重，仅更新少量参数以适应新类。 Result: 在基准数据集上实现了优于基线FSCIL方法的性能。 Conclusion: 该方法通过冻结大部分参数和选择性更新，有效平衡了新类适应和旧类知识保留。 Abstract: Integrating new class information without losing previously acquired knowledge remains a central challenge in artificial intelligence, often referred to as catastrophic forgetting. Few-shot class incremental learning (FSCIL) addresses this by first training a model on a robust dataset of base classes and then incrementally adapting it in successive sessions using only a few labeled examples per novel class. However, this approach is prone to overfitting on the limited new data, which can compromise overall performance and exacerbate forgetting. In this work, we propose a simple yet effective novel FSCIL framework that leverages a frozen Vision Transformer (ViT) backbone augmented with parameter-efficient additive updates. Our approach freezes the pre-trained ViT parameters and selectively injects trainable weights into the self-attention modules via an additive update mechanism. This design updates only a small subset of parameters to accommodate new classes without sacrificing the representations learned during the base session. By fine-tuning a limited number of parameters, our method preserves the generalizable features in the frozen ViT while reducing the risk of overfitting. Furthermore, as most parameters remain fixed, the model avoids overwriting previously learned knowledge when small novel data batches are introduced. Extensive experiments on benchmark datasets demonstrate that our approach yields state-of-the-art performance compared to baseline FSCIL methods.

[189] Chest X-ray Classification using Deep Convolution Models on Low-resolution images with Uncertain Labels

Snigdha Agarwal,Neelam Sinha

Main category: cs.CV

TLDR: 本文研究了低分辨率胸部X光图像的分类问题，通过改进CNN模型和标签处理技术，提高了对特定病理的识别准确率。

Details

Motivation: 低分辨率图像在远程医疗中具有成本优势，但识别难度大，因此需要开发适用于此类图像的自动化诊断模型。 Method: 使用不同尺寸的输入图像训练CNN模型，提出随机翻转标签技术，并采用多标签分类模型集成。结合数据增强和正则化技术，利用类激活图可视化决策过程。 Result: 在CheXpert数据集上，对特定病理（如心脏肥大、实变和水肿）的分类准确率比原论文高3%。 Conclusion: 研究表明，改进的CNN模型可以有效处理低分辨率图像，为远程医疗提供可行解决方案。 Abstract: Deep Convolutional Neural Networks have consistently proven to achieve state-of-the-art results on a lot of imaging tasks over the past years' majority of which comprise of high-quality data. However, it is important to work on low-resolution images since it could be a cheaper alternative for remote healthcare access where the primary need of automated pathology identification models occurs. Medical diagnosis using low-resolution images is challenging since critical details may not be easily identifiable. In this paper, we report classification results by experimenting on different input image sizes of Chest X-rays to deep CNN models and discuss the feasibility of classification on varying image sizes. We also leverage the noisy labels in the dataset by proposing a Randomized Flipping of labels techniques. We use an ensemble of multi-label classification models on frontal and lateral studies. Our models are trained on 5 out of the 14 chest pathologies of the publicly available CheXpert dataset. We incorporate techniques such as augmentation, regularization for model improvement and use class activation maps to visualize the neural network's decision making. Comparison with classification results on data from 200 subjects, obtained on the corresponding high-resolution images, reported in the original CheXpert paper, has been presented. For pathologies Cardiomegaly, Consolidation and Edema, we obtain 3% higher accuracy with our model architecture.

[190] Sculpting Memory: Multi-Concept Forgetting in Diffusion Models via Dynamic Mask and Concept-Aware Optimization

Gen Li,Yang Xiao,Jie Ji,Kaiyuan Deng,Bo Hui,Linke Guo,Xiaolong Ma

Main category: cs.CV

TLDR: 提出了一种动态掩码与概念感知损失相结合的新框架，用于扩散模型中的多概念遗忘，解决了现有方法的不稳定性、残留知识和生成质量下降问题。

Details

Motivation: 解决扩散模型中多概念遗忘的挑战，如去除版权内容、减少偏见或消除有害概念。 Method: 动态掩码机制根据优化状态自适应更新梯度掩码，概念感知损失通过超类对齐和知识蒸馏正则化指导遗忘过程。 Result: 实验表明，该方法在多概念遗忘效果、输出保真度和语义一致性上优于现有技术。 Conclusion: 提供了一个稳定且高保真的生成模型遗忘框架，代码将公开。 Abstract: Text-to-image (T2I) diffusion models have achieved remarkable success in generating high-quality images from textual prompts. However, their ability to store vast amounts of knowledge raises concerns in scenarios where selective forgetting is necessary, such as removing copyrighted content, reducing biases, or eliminating harmful concepts. While existing unlearning methods can remove certain concepts, they struggle with multi-concept forgetting due to instability, residual knowledge persistence, and generation quality degradation. To address these challenges, we propose \textbf{Dynamic Mask coupled with Concept-Aware Loss}, a novel unlearning framework designed for multi-concept forgetting in diffusion models. Our \textbf{Dynamic Mask} mechanism adaptively updates gradient masks based on current optimization states, allowing selective weight modifications that prevent interference with unrelated knowledge. Additionally, our \textbf{Concept-Aware Loss} explicitly guides the unlearning process by enforcing semantic consistency through superclass alignment, while a regularization loss based on knowledge distillation ensures that previously unlearned concepts remain forgotten during sequential unlearning. We conduct extensive experiments to evaluate our approach. Results demonstrate that our method outperforms existing unlearning techniques in forgetting effectiveness, output fidelity, and semantic coherence, particularly in multi-concept scenarios. Our work provides a principled and flexible framework for stable and high-fidelity unlearning in generative models. The code will be released publicly.

[191] BlockGaussian: Efficient Large-Scale Scene NovelView Synthesis via Adaptive Block-Based Gaussian Splatting

Yongchang Wu,Zipeng Qi,Zhenwei Shi,Zhengxia Zou

Main category: cs.CV

TLDR: BlockGaussian提出了一种基于内容感知的场景分区和可见性感知块优化的新框架，用于高效、高质量的大规模场景重建。

Details

Motivation: 3DGS在大规模场景重建中存在场景分区、优化和合并的挑战，需要更高效的解决方案。 Method: 采用内容感知分区策略和可见性感知块优化，引入辅助点和对齐监督，以及伪视图几何约束。 Result: 在多个基准测试中，优化速度提升5倍，PSNR平均提高1.21 dB，且可在24GB VRAM设备上运行。 Conclusion: BlockGaussian在重建效率和渲染质量上达到最先进水平，显著降低了计算需求。 Abstract: The recent advancements in 3D Gaussian Splatting (3DGS) have demonstrated remarkable potential in novel view synthesis tasks. The divide-and-conquer paradigm has enabled large-scale scene reconstruction, but significant challenges remain in scene partitioning, optimization, and merging processes. This paper introduces BlockGaussian, a novel framework incorporating a content-aware scene partition strategy and visibility-aware block optimization to achieve efficient and high-quality large-scale scene reconstruction. Specifically, our approach considers the content-complexity variation across different regions and balances computational load during scene partitioning, enabling efficient scene reconstruction. To tackle the supervision mismatch issue during independent block optimization, we introduce auxiliary points during individual block optimization to align the ground-truth supervision, which enhances the reconstruction quality. Furthermore, we propose a pseudo-view geometry constraint that effectively mitigates rendering degradation caused by airspace floaters during block merging. Extensive experiments on large-scale scenes demonstrate that our approach achieves state-of-the-art performance in both reconstruction efficiency and rendering quality, with a 5x speedup in optimization and an average PSNR improvement of 1.21 dB on multiple benchmarks. Notably, BlockGaussian significantly reduces computational requirements, enabling large-scale scene reconstruction on a single 24GB VRAM device. The project page is available at https://github.com/SunshineWYC/BlockGaussian

[192] You Need a Transition Plane: Bridging Continuous Panoramic 3D Reconstruction with Perspective Gaussian Splatting

Zhijie Shen,Chunyu Lin,Shujuan Huang,Lang Nie,Kang Liao,Yao Zhao

Main category: cs.CV

TLDR: 提出了一种名为TPGS的新框架，用于通过3D高斯溅射技术从单张全景图像重建3D场景，解决了全景图像直接渲染时的失真问题。

Details

Motivation: 全景图像具有360×180的视野，但直接渲染3D高斯到2D等距柱状空间会引入严重失真，现有方法（如立方体贴图）也存在投影失真和边界不连续问题。 Method: TPGS框架引入过渡平面以平滑溅射方向，提出面内到面间优化策略增强局部细节和视觉一致性，并采用球形采样消除接缝。 Result: 在室内外、第一人称和漫游基准数据集上的实验表明，TPGS优于现有方法。 Conclusion: TPGS有效解决了全景3D场景重建中的失真和边界问题，代码和模型将开源。 Abstract: Recently, reconstructing scenes from a single panoramic image using advanced 3D Gaussian Splatting (3DGS) techniques has attracted growing interest. Panoramic images offer a 360$\times$ 180 field of view (FoV), capturing the entire scene in a single shot. However, panoramic images introduce severe distortion, making it challenging to render 3D Gaussians into 2D distorted equirectangular space directly. Converting equirectangular images to cubemap projections partially alleviates this problem but introduces new challenges, such as projection distortion and discontinuities across cube-face boundaries. To address these limitations, we present a novel framework, named TPGS, to bridge continuous panoramic 3D scene reconstruction with perspective Gaussian splatting. Firstly, we introduce a Transition Plane between adjacent cube faces to enable smoother transitions in splatting directions and mitigate optimization ambiguity in the boundary region. Moreover, an intra-to-inter face optimization strategy is proposed to enhance local details and restore visual consistency across cube-face boundaries. Specifically, we optimize 3D Gaussians within individual cube faces and then fine-tune them in the stitched panoramic space. Additionally, we introduce a spherical sampling technique to eliminate visible stitching seams. Extensive experiments on indoor and outdoor, egocentric, and roaming benchmark datasets demonstrate that our approach outperforms existing state-of-the-art methods. Code and models will be available at https://github.com/zhijieshen-bjtu/TPGS.

[193] Hyperlocal disaster damage assessment using bi-temporal street-view imagery and pre-trained vision models

Yifan Yang,Lei Zou,Bing Zhou,Daoyang Li,Binbin Lin,Joynal Abedin,Mingzheng Yang

Main category: cs.CV

TLDR: 研究利用双时态街景图像和预训练视觉模型，通过结合灾前图像和双通道算法，显著提升了灾害损害评估的准确性。

Details

Motivation: 现有研究主要关注灾后街景图像，而忽略了灾前图像的潜力。灾前图像可作为基准，提高损害评估的可靠性和模型性能。 Method: 收集2024年飓风前后的街景图像，使用Swin Transformer和ConvNeXt等预训练模型进行微调，并设计双通道算法进行损害评估。 Result: 结合灾前图像和双通道框架后，损害评估准确率从66.14%提升至77.11%。 Conclusion: 该方法能快速实现高精度损害评估，为灾害管理和韧性规划提供支持。 Abstract: Street-view images offer unique advantages for disaster damage estimation as they capture impacts from a visual perspective and provide detailed, on-the-ground insights. Despite several investigations attempting to analyze street-view images for damage estimation, they mainly focus on post-disaster images. The potential of time-series street-view images remains underexplored. Pre-disaster images provide valuable benchmarks for accurate damage estimations at building and street levels. These images could aid annotators in objectively labeling post-disaster impacts, improving the reliability of labeled data sets for model training, and potentially enhancing the model performance in damage evaluation. The goal of this study is to estimate hyperlocal, on-the-ground disaster damages using bi-temporal street-view images and advanced pre-trained vision models. Street-view images before and after 2024 Hurricane Milton in Horseshoe Beach, Florida, were collected for experiments. The objectives are: (1) to assess the performance gains of incorporating pre-disaster street-view images as a no-damage category in fine-tuning pre-trained models, including Swin Transformer and ConvNeXt, for damage level classification; (2) to design and evaluate a dual-channel algorithm that reads pair-wise pre- and post-disaster street-view images for hyperlocal damage assessment. The results indicate that incorporating pre-disaster street-view images and employing a dual-channel processing framework can significantly enhance damage assessment accuracy. The accuracy improves from 66.14% with the Swin Transformer baseline to 77.11% with the dual-channel Feature-Fusion ConvNeXt model. This research enables rapid, operational damage assessments at hyperlocal spatial resolutions, providing valuable insights to support effective decision-making in disaster management and resilience planning.

[194] UniFlowRestore: A General Video Restoration Framework via Flow Matching and Prompt Guidance

Shuning Sun,Yu Zhang,Chen Wu,Dianjie Lu,Dianjie Lu,Guijuan Zhan,Yang Weng,Zhuoran Zheng

Main category: cs.CV

TLDR: UniFlowRestore提出了一种通用的视频修复框架，通过物理感知的向量场和提示引导实现多任务统一修复，性能优于传统方法。

Details

Motivation: 传统视频修复方法泛化性差且计算成本高，无法应对现实场景中的多样化退化问题。 Method: 将修复建模为时间连续的演化过程，结合物理感知的向量场和提示生成器，构建哈密顿系统并通过ODE求解器优化。 Result: 在视频去噪任务中取得PSNR 33.89 dB和SSIM 0.97的优异性能，并在所有评估任务中表现最佳或次佳。 Conclusion: UniFlowRestore在泛化性和效率上均优于传统方法，为视频修复提供了统一的解决方案。 Abstract: Video imaging is often affected by complex degradations such as blur, noise, and compression artifacts. Traditional restoration methods follow a "single-task single-model" paradigm, resulting in poor generalization and high computational cost, limiting their applicability in real-world scenarios with diverse degradation types. We propose UniFlowRestore, a general video restoration framework that models restoration as a time-continuous evolution under a prompt-guided and physics-informed vector field. A physics-aware backbone PhysicsUNet encodes degradation priors as potential energy, while PromptGenerator produces task-relevant prompts as momentum. These components define a Hamiltonian system whose vector field integrates inertial dynamics, decaying physical gradients, and prompt-based guidance. The system is optimized via a fixed-step ODE solver to achieve efficient and unified restoration across tasks. Experiments show that UniFlowRestore delivers stateof-the-art performance with strong generalization and efficiency. Quantitative results demonstrate that UniFlowRestore achieves state-of-the-art performance, attaining the highest PSNR (33.89 dB) and SSIM (0.97) on the video denoising task, while maintaining top or second-best scores across all evaluated tasks.

[195] Exploring Synergistic Ensemble Learning: Uniting CNNs, MLP-Mixers, and Vision Transformers to Enhance Image Classification

Mk Bashar,Ocean Monjur,Samia Islam,Mohammad Galib Shams,Niamul Quader

Main category: cs.CV

TLDR: 该论文提出了一种通过集成技术结合不同神经网络架构（如CNN、MLP-mixer和Vision Transformer）的方法，以提升图像分类性能，并展示了其优于单一架构或同类架构集成的效果。

Details

Motivation: 研究不同神经网络架构的互补性，并通过系统化的集成方法探索其优势，而非简单合并模块。 Method: 采用集成技术结合不同架构（CNN、MLP-mixer、Vision Transformer），保持各架构的完整性，而非启发式合并模块。 Result: 提出的集成方法在ImageNet上超越了现有最佳单一分类网络的准确率，同时降低了延迟。 Conclusion: 通过集成不同架构，可以更系统地理解其互补性，并为未来研究提供了基础框架。 Abstract: In recent years, Convolutional Neural Networks (CNNs), MLP-mixers, and Vision Transformers have risen to prominence as leading neural architectures in image classification. Prior research has underscored the distinct advantages of each architecture, and there is growing evidence that combining modules from different architectures can boost performance. In this study, we build upon and improve previous work exploring the complementarity between different architectures. Instead of heuristically merging modules from various architectures through trial and error, we preserve the integrity of each architecture and combine them using ensemble techniques. By maintaining the distinctiveness of each architecture, we aim to explore their inherent complementarity more deeply and with implicit isolation. This approach provides a more systematic understanding of their individual strengths. In addition to uncovering insights into architectural complementarity, we showcase the effectiveness of even basic ensemble methods that combine models from diverse architectures. These methods outperform ensembles comprised of similar architectures. Our straightforward ensemble framework serves as a foundational strategy for blending complementary architectures, offering a solid starting point for further investigations into the unique strengths and synergies among different architectures and their ensembles in image classification. A direct outcome of this work is the creation of an ensemble of classification networks that surpasses the accuracy of the previous state-of-the-art single classification network on ImageNet, setting a new benchmark, all while requiring less overall latency.

[196] A Visual Self-attention Mechanism Facial Expression Recognition Network beyond Convnext

Bingyu Nan,Feng Liu,Xuezhong Qian,Wei Song

Main category: cs.CV

TLDR: 本文提出了一种基于截断ConvNeXt的视觉面部表情信号特征处理网络（Conv-cut），用于提升在挑战性条件下的面部表情识别（FER）准确性。

Details

Motivation: 面部表情识别是人工智能领域的重要研究方向，但数据分布不均、不同类别表情相似以及同一类别内不同受试者的差异仍是挑战。 Method: 网络采用截断的ConvNeXt-Base作为特征提取器，设计了细节提取块（Detail Extraction Block）提取细节特征，并引入自注意力机制（Self-Attention）以更有效地学习特征。 Result: 在RAF-DB和FERPlus数据集上的实验表明，该模型达到了最先进的性能。 Conclusion: Conv-cut方法在面部表情识别任务中表现出色，代码已开源。 Abstract: Facial expression recognition is an important research direction in the field of artificial intelligence. Although new breakthroughs have been made in recent years, the uneven distribution of datasets and the similarity between different categories of facial expressions, as well as the differences within the same category among different subjects, remain challenges. This paper proposes a visual facial expression signal feature processing network based on truncated ConvNeXt approach(Conv-cut), to improve the accuracy of FER under challenging conditions. The network uses a truncated ConvNeXt-Base as the feature extractor, and then we designed a Detail Extraction Block to extract detailed features, and introduced a Self-Attention mechanism to enable the network to learn the extracted features more effectively. To evaluate the proposed Conv-cut approach, we conducted experiments on the RAF-DB and FERPlus datasets, and the results show that our model has achieved state-of-the-art performance. Our code could be accessed at Github.

[197] Using Vision Language Models for Safety Hazard Identification in Construction

Muhammad Adil,Gaang Lee,Vicente A. Gonzalez,Qipei Mei

Main category: cs.CV

TLDR: 论文提出了一种基于视觉语言模型（VLM）的框架，用于建筑工地安全隐患识别，解决了现有方法在上下文特定风险识别和适应性上的不足。

Details

Motivation: 现有计算机视觉方法难以识别上下文特定风险且适应性有限，导致安全隐患评估能力不足。 Method: 提出并验证了一种VLM框架，结合提示工程模块将安全指南转化为上下文查询，评估了多种VLM模型。 Result: GPT-4o和Gemini 1.5 Pro表现最佳，BERTScore分别达0.906和0.888，但处理时间仍是挑战。 Conclusion: VLM框架为建筑工地安全隐患检测提供了实用见解，有助于提升主动安全管理。 Abstract: Safety hazard identification and prevention are the key elements of proactive safety management. Previous research has extensively explored the applications of computer vision to automatically identify hazards from image clips collected from construction sites. However, these methods struggle to identify context-specific hazards, as they focus on detecting predefined individual entities without understanding their spatial relationships and interactions. Furthermore, their limited adaptability to varying construction site guidelines and conditions hinders their generalization across different projects. These limitations reduce their ability to assess hazards in complex construction environments and adaptability to unseen risks, leading to potential safety gaps. To address these challenges, we proposed and experimentally validated a Vision Language Model (VLM)-based framework for the identification of construction hazards. The framework incorporates a prompt engineering module that structures safety guidelines into contextual queries, allowing VLM to process visual information and generate hazard assessments aligned with the regulation guide. Within this framework, we evaluated state-of-the-art VLMs, including GPT-4o, Gemini, Llama 3.2, and InternVL2, using a custom dataset of 1100 construction site images. Experimental results show that GPT-4o and Gemini 1.5 Pro outperformed alternatives and displayed promising BERTScore of 0.906 and 0.888 respectively, highlighting their ability to identify both general and context-specific hazards. However, processing times remain a significant challenge, impacting real-time feasibility. These findings offer insights into the practical deployment of VLMs for construction site hazard detection, thereby contributing to the enhancement of proactive safety management.

[198] RICCARDO: Radar Hit Prediction and Convolution for Camera-Radar 3D Object Detection

Yunfei Long,Abhinav Kumar,Xiaoming Liu,Daniel Morris

Main category: cs.CV

TLDR: 论文提出了一种显式利用雷达命中分布模型的方法，通过预测雷达命中分布并结合实际测量点，改进了雷达-相机融合检测性能。

Details

Motivation: 当前雷达-相机融合方法通过黑盒神经网络隐式处理雷达命中分布，缺乏显式建模。本文旨在通过显式建模雷达命中分布来改进融合效果。 Method: 1. 构建模型预测雷达命中分布；2. 使用预测分布作为核匹配实际雷达点；3. 结合上下文信息优化匹配分数。 Result: 在nuScenes数据集上实现了最先进的雷达-相机检测性能。 Conclusion: 显式建模雷达命中分布能有效提升雷达-相机融合检测性能。 Abstract: Radar hits reflect from points on both the boundary and internal to object outlines. This results in a complex distribution of radar hits that depends on factors including object category, size, and orientation. Current radar-camera fusion methods implicitly account for this with a black-box neural network. In this paper, we explicitly utilize a radar hit distribution model to assist fusion. First, we build a model to predict radar hit distributions conditioned on object properties obtained from a monocular detector. Second, we use the predicted distribution as a kernel to match actual measured radar points in the neighborhood of the monocular detections, generating matching scores at nearby positions. Finally, a fusion stage combines context with the kernel detector to refine the matching scores. Our method achieves the state-of-the-art radar-camera detection performance on nuScenes. Our source code is available at https://github.com/longyunf/riccardo.

[199] BIGS: Bimanual Category-agnostic Interaction Reconstruction from Monocular Videos via 3D Gaussian Splatting

Jeongwan On,Kyeonghwan Gwak,Gunyoung Kang,Junuk Cha,Soohyun Hwang,Hyein Hwang,Seungryul Baek

Main category: cs.CV

TLDR: 提出BIGS方法，从单目RGB视频重建双手与未知物体的3D高斯模型，利用扩散模型和手部先验知识解决遮挡问题，在多个指标上达到最优。

Details

Motivation: 现有方法无法全面处理双手与未知物体的交互重建问题，尤其是复杂遮挡情况。 Method: 结合预训练扩散模型（SDS损失）和手部3D先验（MANO模型），通过高斯优化和交互对齐步骤重建3D高斯模型。 Result: 在两个数据集上，3D手部姿态估计、物体重建和渲染质量均达到最优。 Conclusion: BIGS方法有效解决了双手与未知物体交互的3D重建问题，具有广泛应用潜力。 Abstract: Reconstructing 3Ds of hand-object interaction (HOI) is a fundamental problem that can find numerous applications. Despite recent advances, there is no comprehensive pipeline yet for bimanual class-agnostic interaction reconstruction from a monocular RGB video, where two hands and an unknown object are interacting with each other. Previous works tackled the limited hand-object interaction case, where object templates are pre-known or only one hand is involved in the interaction. The bimanual interaction reconstruction exhibits severe occlusions introduced by complex interactions between two hands and an object. To solve this, we first introduce BIGS (Bimanual Interaction 3D Gaussian Splatting), a method that reconstructs 3D Gaussians of hands and an unknown object from a monocular video. To robustly obtain object Gaussians avoiding severe occlusions, we leverage prior knowledge of pre-trained diffusion model with score distillation sampling (SDS) loss, to reconstruct unseen object parts. For hand Gaussians, we exploit the 3D priors of hand model (i.e., MANO) and share a single Gaussian for two hands to effectively accumulate hand 3D information, given limited views. To further consider the 3D alignment between hands and objects, we include the interacting-subjects optimization step during Gaussian optimization. Our method achieves the state-of-the-art accuracy on two challenging datasets, in terms of 3D hand pose estimation (MPJPE), 3D object reconstruction (CDh, CDo, F10), and rendering quality (PSNR, SSIM, LPIPS), respectively.

Yonghao Huang,Leiting Chen,Chuan Zhou

Main category: cs.CV

TLDR: 提出了一种基于多尺度交叉注意力和移位窗口自注意力的多模态多视角眼底图像融合方法，用于全面挖掘病变特征并提高视网膜病变诊断的效率和准确性。

Details

Motivation: 多模态和多视角眼底图像的联合解读对预防视网膜病变至关重要，但现有方法存在静态感受野问题和计算复杂度高的限制。 Method: 设计了基于多尺度交叉注意力的多模态融合方法和基于移位窗口自注意力的多视角融合方法，并结合为多任务诊断框架。 Result: 在视网膜病变分类和报告生成任务中，分类准确率达到82.53%，报告生成的BLEU-1得分为0.543。 Conclusion: 该方法能有效提升临床实践中视网膜病变诊断的效率和可靠性。 Abstract: The joint interpretation of multi-modal and multi-view fundus images is critical for retinopathy prevention, as different views can show the complete 3D eyeball field and different modalities can provide complementary lesion areas. Compared with single images, the sequence relationships in multi-modal and multi-view fundus images contain long-range dependencies in lesion features. By modeling the long-range dependencies in these sequences, lesion areas can be more comprehensively mined, and modality-specific lesions can be detected. To learn the long-range dependency relationship and fuse complementary multi-scale lesion features between different fundus modalities, we design a multi-modal fundus image fusion method based on multi-scale cross-attention, which solves the static receptive field problem in previous multi-modal medical fusion methods based on attention. To capture multi-view relative positional relationships between different views and fuse comprehensive lesion features between different views, we design a multi-view fundus image fusion method based on shifted window self-attention, which also solves the computational complexity of the multi-view fundus fusion method based on self-attention is quadratic to the size and number of multi-view fundus images. Finally, we design a multi-task retinopathy diagnosis framework to help ophthalmologists reduce workload and improve diagnostic accuracy by combining the proposed two fusion methods. The experimental results of retinopathy classification and report generation tasks indicate our method's potential to improve the efficiency and reliability of retinopathy diagnosis in clinical practice, achieving a classification accuracy of 82.53\% and a report generation BlEU-1 of 0.543.

[201] Probability Distribution Alignment and Low-Rank Weight Decomposition for Source-Free Domain Adaptive Brain Decoding

Ganxi Xu,Jinyi Long,Hanrui Wu,Jia Zhang

Main category: cs.CV

TLDR: 提出了一种基于无源域适应的大脑解码框架，解决个体差异、模态对齐和高维嵌入问题。

Details

Motivation: 当前大脑解码面临个体差异、模态对齐和高维嵌入的挑战，现有方法存在隐私泄露、数据存储负担重、模态对齐不充分以及计算成本高的问题。 Method: 采用无源域适应框架，避免使用源主体数据，同时优化模态对齐和高维嵌入。 Result: 框架解决了隐私和数据存储问题，改进了模态对齐，并降低了计算成本。 Conclusion: 该框架为大脑解码提供了一种高效且隐私保护的解决方案。 Abstract: Brain decoding currently faces significant challenges in individual differences, modality alignment, and high-dimensional embeddings. To address individual differences, researchers often use source subject data, which leads to issues such as privacy leakage and heavy data storage burdens. In modality alignment, current works focus on aligning the softmax probability distribution but neglect the alignment of marginal probability distributions, resulting in modality misalignment. Additionally, images and text are aligned separately with fMRI without considering the complex interplay between images and text, leading to poor image reconstruction. Finally, the enormous dimensionality of CLIP embeddings causes significant computational costs. Although the dimensionality of CLIP embeddings can be reduced by ignoring the number of patches obtained from images and the number of tokens acquired from text, this comes at the cost of a significant drop in model performance, creating a dilemma. To overcome these limitations, we propose a source-free domain adaptation-based brain decoding framework

[202] A Constrained Optimization Approach for Gaussian Splatting from Coarsely-posed Images and Noisy Lidar Point Clouds

Jizong Peng,Tze Ho Elden Tse,Kai Xu,Wenchao Gao,Angela Yao

Main category: cs.CV

TLDR: 提出了一种无需SfM支持的3D高斯泼溅（3DGS）初始化方法，通过约束优化同时估计相机姿态和3D重建。

Details

Motivation: 传统3DGS依赖SfM算法初始化，耗时且限制应用范围，需改进以支持实际场景和大规模重建。 Method: 分解相机姿态为两步优化，引入参数敏感性和搜索空间约束，并利用几何约束提升重建质量。 Result: 实验显示，该方法显著优于现有3DGS基线及COLMAP辅助方法。 Conclusion: 提出的方法高效且适用于实际场景，为3DGS提供了更灵活的初始化方案。 Abstract: 3D Gaussian Splatting (3DGS) is a powerful reconstruction technique, but it needs to be initialized from accurate camera poses and high-fidelity point clouds. Typically, the initialization is taken from Structure-from-Motion (SfM) algorithms; however, SfM is time-consuming and restricts the application of 3DGS in real-world scenarios and large-scale scene reconstruction. We introduce a constrained optimization method for simultaneous camera pose estimation and 3D reconstruction that does not require SfM support. Core to our approach is decomposing a camera pose into a sequence of camera-to-(device-)center and (device-)center-to-world optimizations. To facilitate, we propose two optimization constraints conditioned to the sensitivity of each parameter group and restricts each parameter's search space. In addition, as we learn the scene geometry directly from the noisy point clouds, we propose geometric constraints to improve the reconstruction quality. Experiments demonstrate that the proposed method significantly outperforms the existing (multi-modal) 3DGS baseline and methods supplemented by COLMAP on both our collected dataset and two public benchmarks.

[203] MASH: Masked Anchored SpHerical Distances for 3D Shape Representation and Generation

Changhao Li,Yu Xin,Xiaowei Zhou,Ariel Shamir,Hao Zhang,Ligang Liu,Ruizhen Hu

Main category: cs.CV

TLDR: MASH是一种新颖的多视图参数化3D形状表示方法，通过局部表面块和球面距离函数实现高效形状理解与重建。

Details

Motivation: 受多视图几何启发，MASH旨在通过感知形状理解提升3D形状学习的效果。 Method: MASH将3D形状表示为局部表面块的集合，每个块由球面距离函数定义，并利用球谐函数编码和参数化视图锥实现局部性。 Result: 实验表明，MASH在表面重建、形状生成、补全和混合等任务中表现优异。 Conclusion: MASH结合隐式和显式特征，为3D形状处理提供了多功能且高效的表示方法。 Abstract: We introduce Masked Anchored SpHerical Distances (MASH), a novel multi-view and parametrized representation of 3D shapes. Inspired by multi-view geometry and motivated by the importance of perceptual shape understanding for learning 3D shapes, MASH represents a 3D shape as a collection of observable local surface patches, each defined by a spherical distance function emanating from an anchor point. We further leverage the compactness of spherical harmonics to encode the MASH functions, combined with a generalized view cone with a parameterized base that masks the spatial extent of the spherical function to attain locality. We develop a differentiable optimization algorithm capable of converting any point cloud into a MASH representation accurately approximating ground-truth surfaces with arbitrary geometry and topology. Extensive experiments demonstrate that MASH is versatile for multiple applications including surface reconstruction, shape generation, completion, and blending, achieving superior performance thanks to its unique representation encompassing both implicit and explicit features.

[204] Evolved Hierarchical Masking for Self-Supervised Learning

Zhanzhou Feng,Shiliang Zhang

Main category: cs.CV

TLDR: 提出了一种基于层次化掩码的自监督学习方法，通过动态调整掩码模式提升视觉线索建模能力。

Details

Motivation: 现有掩码图像建模方法使用固定掩码模式，限制了视觉线索建模能力，因此需要一种动态调整掩码的方法。 Method: 利用训练中的视觉模型解析输入视觉线索为层次结构，动态生成掩码，从低层纹理逐步过渡到高层语义。 Result: 在七项下游任务中表现优异，如图像分类（ImageNet-1K提升1.1%）和语义分割（ADE20K提升1.4%）。 Conclusion: 该方法无需额外预训练模型或标注，动态调整训练难度，显著提升任务性能，填补了语义任务与低层特征任务之间的差距。 Abstract: Existing Masked Image Modeling methods apply fixed mask patterns to guide the self-supervised training. As those mask patterns resort to different criteria to depict image contents, sticking to a fixed pattern leads to a limited vision cues modeling capability.This paper introduces an evolved hierarchical masking method to pursue general visual cues modeling in self-supervised learning. The proposed method leverages the vision model being trained to parse the input visual cues into a hierarchy structure, which is hence adopted to generate masks accordingly. The accuracy of hierarchy is on par with the capability of the model being trained, leading to evolved mask patterns at different training stages. Initially, generated masks focus on low-level visual cues to grasp basic textures, then gradually evolve to depict higher-level cues to reinforce the learning of more complicated object semantics and contexts. Our method does not require extra pre-trained models or annotations and ensures training efficiency by evolving the training difficulty. We conduct extensive experiments on seven downstream tasks including partial-duplicate image retrieval relying on low-level details, as well as image classification and semantic segmentation that require semantic parsing capability. Experimental results demonstrate that it substantially boosts performance across these tasks. For instance, it surpasses the recent MAE by 1.1\% in imageNet-1K classification and 1.4\% in ADE20K segmentation with the same training epochs. We also align the proposed method with the current research focus on LLMs. The proposed approach bridges the gap with large-scale pre-training on semantic demanding tasks and enhances intricate detail perception in tasks requiring low-level feature recognition.

[205] LEREL: Lipschitz Continuity-Constrained Emotion Recognition Ensemble Learning For Electroencephalography

Shengyu Gong,Yueyang Li,Zijian Kang,Weiming Zeng,Hongjie Yan,Wai Ting Siok,Nizhuan Wang

Main category: cs.CV

TLDR: 论文提出了一种名为LEREL的新框架，通过Lipschitz连续性约束和集成学习策略，显著提升了基于EEG的情绪识别的准确性和鲁棒性。

Details

Motivation: 当前基于EEG的情绪识别方法存在模型稳定性不足、处理高维非线性信号精度有限以及对噪声和个体差异鲁棒性差的问题。 Method: LEREL框架结合Lipschitz连续性约束增强模型稳定性和泛化能力，并通过集成学习策略减少单模型偏差和方差。 Result: 在三个公开数据集（EAV、FACED和SEED）上的实验表明，LEREL的平均识别准确率分别为76.43%、83.00%和89.22%。 Conclusion: LEREL有效解决了现有方法的局限性，显著提升了情绪识别的性能。 Abstract: Accurate and efficient perception of emotional states in oneself and others is crucial, as emotion-related disorders are associated with severe psychosocial impairments. While electroencephalography (EEG) offers a powerful tool for emotion detection, current EEG-based emotion recognition (EER) methods face key limitations: insufficient model stability, limited accuracy in processing high-dimensional nonlinear EEG signals, and poor robustness against intra-subject variability and signal noise. To address these challenges, we propose LEREL (Lipschitz continuity-constrained Emotion Recognition Ensemble Learning), a novel framework that significantly enhances both the accuracy and robustness of emotion recognition performance. The LEREL framework employs Lipschitz continuity constraints to enhance model stability and generalization in EEG emotion recognition, reducing signal variability and noise susceptibility while maintaining strong performance on small-sample datasets. The ensemble learning strategy reduces single-model bias and variance through multi-classifier decision fusion, further optimizing overall performance. Experimental results on three public benchmark datasets (EAV, FACED and SEED) demonstrate LEREL's effectiveness, achieving average recognition accuracies of 76.43%, 83.00% and 89.22%, respectively.

[206] SCFlow2: Plug-and-Play Object Pose Refiner with Shape-Constraint Scene Flow

Qingyuan Wang,Rui Song,Jiaojiao Li,Kerui Cheng,David Ferstl,Yinlin Hu

Main category: cs.CV

TLDR: SCFlow2是一个即插即用的6D物体姿态估计细化框架，通过引入几何约束和3D场景流，显著提升现有方法的精度，无需重新训练。

Details

Motivation: 现有6D物体姿态细化方法存在对应噪声问题或需重新训练新物体，SCFlow2旨在解决这些问题。 Method: 基于SCFlow模型，通过3D场景流将深度作为迭代正则化，结合刚性运动嵌入和3D形状先验训练循环匹配网络。 Result: 在BOP数据集上评估，SCFlow2显著提升了现有方法的精度，无需重新训练。 Conclusion: SCFlow2是一种高效且通用的细化框架，适用于多种6D姿态估计方法。 Abstract: We introduce SCFlow2, a plug-and-play refinement framework for 6D object pose estimation. Most recent 6D object pose methods rely on refinement to get accurate results. However, most existing refinement methods either suffer from noises in establishing correspondences, or rely on retraining for novel objects. SCFlow2 is based on the SCFlow model designed for refinement with shape constraint, but formulates the additional depth as a regularization in the iteration via 3D scene flow for RGBD frames. The key design of SCFlow2 is an introduction of geometry constraints into the training of recurrent matching network, by combining the rigid-motion embeddings in 3D scene flow and 3D shape prior of the target. We train SCFlow2 on a combination of dataset Objaverse, GSO and ShapeNet, and evaluate on BOP datasets with novel objects. After using our method as a post-processing, most state-of-the-art methods produce significantly better results, without any retraining or fine-tuning. The source code is available at https://scflow2.github.io.

[207] ReferGPT: Towards Zero-Shot Referring Multi-Object Tracking

Tzoulio Chamiti,Leandro Di Bella,Adrian Munteanu,Nikos Deligiannis

Main category: cs.CV

TLDR: ReferGPT是一种零样本的多目标跟踪框架，利用多模态大语言模型（MLLM）生成3D感知的文本描述，无需训练即可支持灵活的查询匹配。

Details

Motivation: 现有方法需要监督训练且难以泛化到开放集查询，因此提出一种无需训练的零样本框架。 Method: 结合MLLM生成3D感知文本描述，并采用基于CLIP的语义编码和模糊匹配策略进行查询匹配。 Result: 在多个数据集上表现优异，与有监督方法竞争，展示了其鲁棒性和零样本能力。 Conclusion: ReferGPT为多目标跟踪提供了一种无需训练且灵活的解决方案，适用于自动驾驶场景。 Abstract: Tracking multiple objects based on textual queries is a challenging task that requires linking language understanding with object association across frames. Previous works typically train the whole process end-to-end or integrate an additional referring text module into a multi-object tracker, but they both require supervised training and potentially struggle with generalization to open-set queries. In this work, we introduce ReferGPT, a novel zero-shot referring multi-object tracking framework. We provide a multi-modal large language model (MLLM) with spatial knowledge enabling it to generate 3D-aware captions. This enhances its descriptive capabilities and supports a more flexible referring vocabulary without training. We also propose a robust query-matching strategy, leveraging CLIP-based semantic encoding and fuzzy matching to associate MLLM generated captions with user queries. Extensive experiments on Refer-KITTI, Refer-KITTIv2 and Refer-KITTI+ demonstrate that ReferGPT achieves competitive performance against trained methods, showcasing its robustness and zero-shot capabilities in autonomous driving. The codes are available on https://github.com/Tzoulio/ReferGPT

[208] RT-DATR:Real-time Unsupervised Domain Adaptive Detection Transformer with Adversarial Feature Learning

Feng Lv,Chunlong Xia,Shuo Wang,Huo Cao

Main category: cs.CV

TLDR: 本文提出了RT-DATR，一种基于实时检测变换器（RT-DETR）的简单高效域自适应检测方法，通过局部对象级特征对齐和场景语义特征对齐模块提升跨域检测性能。

Details

Motivation: 现有基于CNN和变换器的域自适应目标检测器在跨域检测任务中取得进展，但实时变换器检测器的域适应尚未被探索，直接应用现有算法效果不佳。 Method: 基于RT-DETR，引入局部对象级特征对齐模块和场景语义特征对齐模块，并设计域查询以解耦对象查询，进一步对齐实例特征分布。 Result: 在多个基准测试中，RT-DATR优于当前最先进方法。 Conclusion: RT-DATR通过特征对齐和解耦设计，显著提升了实时变换器检测器的跨域检测性能。 Abstract: Despite domain-adaptive object detectors based on CNN and transformers have made significant progress in cross-domain detection tasks, it is regrettable that domain adaptation for real-time transformer-based detectors has not yet been explored. Directly applying existing domain adaptation algorithms has proven to be suboptimal. In this paper, we propose RT-DATR, a simple and efficient real-time domain adaptive detection transformer. Building on RT-DETR as our base detector, we first introduce a local object-level feature alignment module to significantly enhance the feature representation of domain invariance during object transfer. Additionally, we introduce a scene semantic feature alignment module designed to boost cross-domain detection performance by aligning scene semantic features. Finally, we introduced a domain query and decoupled it from the object query to further align the instance feature distribution within the decoder layer, reduce the domain gap, and maintain discriminative ability. Experimental results on various benchmarks demonstrate that our method outperforms current state-of-the-art approaches. Our code will be released soon.

[209] From Visual Explanations to Counterfactual Explanations with Latent Diffusion

Tung Luu,Nam Le,Duc Le,Bac Le

Main category: cs.CV

TLDR: 本文提出了一种新方法，用于生成视觉反事实解释，解决现有方法中的两个关键挑战：确定区分目标类与原类的关键特征，以及为非鲁棒分类器提供有价值的解释。

Details

Motivation: 现有方法在生成视觉反事实解释时，难以确定关键特征且依赖鲁棒模型。本文旨在解决这些问题，提供更有效的解释方法。 Method: 通过视觉解释算法识别关键修改区域，结合基于剪枝的对抗攻击和潜在扩散模型生成逼真的反事实解释。 Result: 在ImageNet和CelebA-HQ数据集上，该方法在多项评估标准上优于现有最优方法。 Conclusion: 该方法适用于任意分类器，展示了视觉与反事实解释的强关联，并生成语义有意义且细微的反事实图像。 Abstract: Visual counterfactual explanations are ideal hypothetical images that change the decision-making of the classifier with high confidence toward the desired class while remaining visually plausible and close to the initial image. In this paper, we propose a new approach to tackle two key challenges in recent prominent works: i) determining which specific counterfactual features are crucial for distinguishing the "concept" of the target class from the original class, and ii) supplying valuable explanations for the non-robust classifier without relying on the support of an adversarially robust model. Our method identifies the essential region for modification through algorithms that provide visual explanations, and then our framework generates realistic counterfactual explanations by combining adversarial attacks based on pruning the adversarial gradient of the target classifier and the latent diffusion model. The proposed method outperforms previous state-of-the-art results on various evaluation criteria on ImageNet and CelebA-HQ datasets. In general, our method can be applied to arbitrary classifiers, highlight the strong association between visual and counterfactual explanations, make semantically meaningful changes from the target classifier, and provide observers with subtle counterfactual images.

[210] AerOSeg: Harnessing SAM for Open-Vocabulary Segmentation in Remote Sensing Images

Saikat Dutta,Akhil Vasim,Siddhant Gole,Hamid Rezatofighi,Biplab Banerjee

Main category: cs.CV

TLDR: AerOSeg是一种针对遥感数据的开放词汇分割方法，通过多尺度旋转图像和领域特定提示提取特征，结合SAM模型优化空间和类别细化，显著提升了分割性能。

Details

Motivation: 遥感图像中常出现未见类别，传统分割模型泛化能力不足且依赖昂贵标注，需开发专门针对遥感的开放词汇分割方法。 Method: 提出AerOSeg，利用旋转图像和领域提示提取特征，结合SAM模型进行空间和语义优化，并通过多尺度解码器生成分割图。 Result: 在三个遥感数据集上超越现有方法，平均提升2.54 h-mIoU。 Conclusion: AerOSeg为遥感开放词汇分割提供了高效解决方案，显著提升性能。 Abstract: Image segmentation beyond predefined categories is a key challenge in remote sensing, where novel and unseen classes often emerge during inference. Open-vocabulary image Segmentation addresses these generalization issues in traditional supervised segmentation models while reducing reliance on extensive per-pixel annotations, which are both expensive and labor-intensive to obtain. Most Open-Vocabulary Segmentation (OVS) methods are designed for natural images but struggle with remote sensing data due to scale variations, orientation changes, and complex scene compositions. This necessitates the development of OVS approaches specifically tailored for remote sensing. In this context, we propose AerOSeg, a novel OVS approach for remote sensing data. First, we compute robust image-text correlation features using multiple rotated versions of the input image and domain-specific prompts. These features are then refined through spatial and class refinement blocks. Inspired by the success of the Segment Anything Model (SAM) in diverse domains, we leverage SAM features to guide the spatial refinement of correlation features. Additionally, we introduce a semantic back-projection module and loss to ensure the seamless propagation of SAM's semantic information throughout the segmentation pipeline. Finally, we enhance the refined correlation features using a multi-scale attention-aware decoder to produce the final segmentation map. We validate our SAM-guided Open-Vocabulary Remote Sensing Segmentation model on three benchmark remote sensing datasets: iSAID, DLRSD, and OpenEarthMap. Our model outperforms state-of-the-art open-vocabulary segmentation methods, achieving an average improvement of 2.54 h-mIoU.

Zhicheng Zhang,Hao Tang,Jinhui Tang

Main category: cs.CV

TLDR: 提出了一种名为MDCM的多尺度视觉Transformer框架，用于细粒度鸟类识别，通过多尺度线索激活、选择和聚合机制提升性能。

Details

Motivation: 现有ViT模型在细粒度鸟类识别中因感受野有限而表现受限，需增强多尺度能力。 Method: 提出MDCM框架，包含多尺度线索激活模块、令牌选择机制和动态聚合机制。 Result: MDCM在多个FGBR基准测试中优于CNN和ViT模型。 Conclusion: MDCM通过多尺度建模有效提升了细粒度鸟类识别的性能。 Abstract: Given the critical role of birds in ecosystems, Fine-Grained Bird Recognition (FGBR) has gained increasing attention, particularly in distinguishing birds within similar subcategories. Although Vision Transformer (ViT)-based methods often outperform Convolutional Neural Network (CNN)-based methods in FGBR, recent studies reveal that the limited receptive field of plain ViT model hinders representational richness and makes them vulnerable to scale variance. Thus, enhancing the multi-scale capabilities of existing ViT-based models to overcome this bottleneck in FGBR is a worthwhile pursuit. In this paper, we propose a novel framework for FGBR, namely Multi-scale Diverse Cues Modeling (MDCM), which explores diverse cues at different scales across various stages of a multi-scale Vision Transformer (MS-ViT) in an "Activation-Selection-Aggregation" paradigm. Specifically, we first propose a multi-scale cue activation module to ensure the discriminative cues learned at different stage are mutually different. Subsequently, a multi-scale token selection mechanism is proposed to remove redundant noise and highlight discriminative, scale-specific cues at each stage. Finally, the selected tokens from each stage are independently utilized for bird recognition, and the recognition results from multiple stages are adaptively fused through a multi-scale dynamic aggregation mechanism for final model decisions. Both qualitative and quantitative results demonstrate the effectiveness of our proposed MDCM, which outperforms CNN- and ViT-based models on several widely-used FGBR benchmarks.

[212] DL-QAT: Weight-Decomposed Low-Rank Quantization-Aware Training for Large Language Models

Wenjin Ke,Zhe Li,Dong Li,Lu Tian,Emad Barsoum

Main category: cs.CV

TLDR: 论文提出了一种名为DL-QAT的方法，结合了QAT的优势，仅训练不到1%的参数，显著提升了低比特量化下LLM的推理效率。

Details

Motivation: 解决Post-training Quantization在低比特量化时下游任务表现不佳的问题，同时避免Quantization-aware Training的高计算资源需求。 Method: 引入Weight-Decomposed Low-Rank Quantization-Aware Training (DL-QAT)，通过分组量化幅度调整和LoRA矩阵更新权重。 Result: 在LLaMA和LLaMA2模型上验证，3-bit量化下性能提升4.2%，且优于现有QAT方法。 Conclusion: DL-QAT在性能和效率上均优于现有方法，为低比特量化提供了有效解决方案。 Abstract: Improving the efficiency of inference in Large Language Models (LLMs) is a critical area of research. Post-training Quantization (PTQ) is a popular technique, but it often faces challenges at low-bit levels, particularly in downstream tasks. Quantization-aware Training (QAT) can alleviate this problem, but it requires significantly more computational resources. To tackle this, we introduced Weight-Decomposed Low-Rank Quantization-Aware Training (DL-QAT), which merges the advantages of QAT while training only less than 1% of the total parameters. Specifically, we introduce a group-specific quantization magnitude to adjust the overall scale of each quantization group. Within each quantization group, we use LoRA matrices to update the weight size and direction in the quantization space. We validated the effectiveness of our method on the LLaMA and LLaMA2 model families. The results show significant improvements over our baseline method across different quantization granularities. For instance, for LLaMA-7B, our approach outperforms the previous state-of-the-art method by 4.2% in MMLU on 3-bit LLaMA-7B model. Additionally, our quantization results on pre-trained models also surpass previous QAT methods, demonstrating the superior performance and efficiency of our approach.

[213] Learning Occlusion-Robust Vision Transformers for Real-Time UAV Tracking

You Wu,Xucheng Wang,Xiangyang Yang,Mengyuan Liu,Dan Zeng,Hengzhou Ye,Shuiwang Li

Main category: cs.CV

TLDR: 论文提出了一种基于ViT的单流架构ORTrack，通过随机掩码模拟遮挡学习鲁棒特征表示，并引入AFKD方法实现高效实时跟踪。

Details

Motivation: 现有单流ViT模型在无人机跟踪中缺乏有效处理遮挡的策略，需提升其遮挡鲁棒性。 Method: 提出ORTrack框架，通过空间Cox过程随机掩码学习遮挡鲁棒特征；采用AFKD方法蒸馏出高效学生模型ORTrack-D。 Result: 在多个基准测试中验证了方法的有效性，性能达到SOTA。 Conclusion: ORTrack和ORTrack-D在遮挡鲁棒性和实时性上表现优异，适用于无人机跟踪。 Abstract: Single-stream architectures using Vision Transformer (ViT) backbones show great potential for real-time UAV tracking recently. However, frequent occlusions from obstacles like buildings and trees expose a major drawback: these models often lack strategies to handle occlusions effectively. New methods are needed to enhance the occlusion resilience of single-stream ViT models in aerial tracking. In this work, we propose to learn Occlusion-Robust Representations (ORR) based on ViTs for UAV tracking by enforcing an invariance of the feature representation of a target with respect to random masking operations modeled by a spatial Cox process. Hopefully, this random masking approximately simulates target occlusions, thereby enabling us to learn ViTs that are robust to target occlusion for UAV tracking. This framework is termed ORTrack. Additionally, to facilitate real-time applications, we propose an Adaptive Feature-Based Knowledge Distillation (AFKD) method to create a more compact tracker, which adaptively mimics the behavior of the teacher model ORTrack according to the task's difficulty. This student model, dubbed ORTrack-D, retains much of ORTrack's performance while offering higher efficiency. Extensive experiments on multiple benchmarks validate the effectiveness of our method, demonstrating its state-of-the-art performance. Codes is available at https://github.com/wuyou3474/ORTrack.

[214] NoTeS-Bank: Benchmarking Neural Transcription and Search for Scientific Notes Understanding

Aniket Pal,Sanket Biswas,Alloy Das,Ayush Lodh,Priyanka Banerjee,Soumitri Chattopadhyay,Dimosthenis Karatzas,Josep Llados,C. V. Jawahar

Main category: cs.CV

TLDR: NoTeS-Bank是一个用于评估手写笔记问答的基准测试，专注于处理非结构化和多模态内容，提出了两种任务并测试了现有模型的性能。

Details

Motivation: 解决现有视觉问答基准在真实手写笔记（如数学公式、图表等）上的局限性，推动文档AI的发展。 Method: 引入NoTeS-Bank基准，包含跨领域复杂笔记，定义两种任务：基于证据的VQA和开放域VQA，测试模型的多模态推理能力。 Result: 通过NDCG@5、MRR等指标评估现有模型，揭示了其在结构化转录和推理上的不足。 Conclusion: NoTeS-Bank为视觉文档理解和推理设立了新标准，推动了多模态处理技术的发展。 Abstract: Understanding and reasoning over academic handwritten notes remains a challenge in document AI, particularly for mathematical equations, diagrams, and scientific notations. Existing visual question answering (VQA) benchmarks focus on printed or structured handwritten text, limiting generalization to real-world note-taking. To address this, we introduce NoTeS-Bank, an evaluation benchmark for Neural Transcription and Search in note-based question answering. NoTeS-Bank comprises complex notes across multiple domains, requiring models to process unstructured and multimodal content. The benchmark defines two tasks: (1) Evidence-Based VQA, where models retrieve localized answers with bounding-box evidence, and (2) Open-Domain VQA, where models classify the domain before retrieving relevant documents and answers. Unlike classical Document VQA datasets relying on optical character recognition (OCR) and structured data, NoTeS-BANK demands vision-language fusion, retrieval, and multimodal reasoning. We benchmark state-of-the-art Vision-Language Models (VLMs) and retrieval frameworks, exposing structured transcription and reasoning limitations. NoTeS-Bank provides a rigorous evaluation with NDCG@5, MRR, Recall@K, IoU, and ANLS, establishing a new standard for visual document understanding and reasoning.

[215] FVQ: A Large-Scale Dataset and A LMM-based Method for Face Video Quality Assessment

Sijing Wu,Yunhao Li,Ziwen Xu,Yixuan Gao,Huiyu Duan,Wei Sun,Guangtao Zhai

Main category: cs.CV

TLDR: 该论文提出了首个大规模野外人脸视频质量评估数据集FVQ-20K，并开发了FVQ-Rater方法，利用多模态特征和LoRA技术实现高质量评分。

Details

Motivation: 人脸视频在社交媒体中占主导地位，且人类视觉系统对人脸敏感，但缺乏大规模FVQA数据集，阻碍了相关研究。 Method: 提出FVQ-20K数据集，包含20,000个野外人脸视频及MOS标注；开发FVQ-Rater方法，提取空间、时序和面部特征，结合LoRA指令调优技术。 Result: FVQ-Rater在FVQ-20K和CFVQA数据集上表现优异，验证了数据集和方法的潜力。 Conclusion: FVQ-20K和FVQ-Rater为FVQA研究提供了重要基础，推动了该领域的发展。 Abstract: Face video quality assessment (FVQA) deserves to be explored in addition to general video quality assessment (VQA), as face videos are the primary content on social media platforms and human visual system (HVS) is particularly sensitive to human faces. However, FVQA is rarely explored due to the lack of large-scale FVQA datasets. To fill this gap, we present the first large-scale in-the-wild FVQA dataset, FVQ-20K, which contains 20,000 in-the-wild face videos together with corresponding mean opinion score (MOS) annotations. Along with the FVQ-20K dataset, we further propose a specialized FVQA method named FVQ-Rater to achieve human-like rating and scoring for face video, which is the first attempt to explore the potential of large multimodal models (LMMs) for the FVQA task. Concretely, we elaborately extract multi-dimensional features including spatial features, temporal features, and face-specific features (i.e., portrait features and face embeddings) to provide comprehensive visual information, and take advantage of the LoRA-based instruction tuning technique to achieve quality-specific fine-tuning, which shows superior performance on both FVQ-20K and CFVQA datasets. Extensive experiments and comprehensive analysis demonstrate the significant potential of the FVQ-20K dataset and FVQ-Rater method in promoting the development of FVQA.

[216] PathVLM-R1: A Reinforcement Learning-Driven Reasoning Model for Pathology Visual-Language Tasks

Jianyu Wu,Hao Yang,Xinhua Zeng,Guibing He,Zhiyu Chen,Zihui Li,Xiaochuan Zhang,Yangyang Ma,Run Fang,Yang Liu

Main category: cs.CV

TLDR: PathVLM-R1是一种针对病理图像的视觉语言模型，通过监督微调和强化学习优化，显著提升了诊断准确性和推理能力。

Details

Motivation: 病理图像诊断受限于专家资源和地区差异，现有模型缺乏推理能力和监督过程，影响临床决策可靠性。 Method: 基于Qwen2.5-VL-7B-Instruct，通过监督微调（SFT）和双奖励驱动的强化学习（GRPO）优化模型。 Result: 在病理图像问答任务中，准确率提升14%，并在跨模态数据中表现优于传统方法。 Conclusion: PathVLM-R1不仅提高了准确性，还具有广泛的适用性和扩展潜力。 Abstract: The diagnosis of pathological images is often limited by expert availability and regional disparities, highlighting the importance of automated diagnosis using Vision-Language Models (VLMs). Traditional multimodal models typically emphasize outcomes over the reasoning process, compromising the reliability of clinical decisions. To address the weak reasoning abilities and lack of supervised processes in pathological VLMs, we have innovatively proposed PathVLM-R1, a visual language model designed specifically for pathological images. We have based our model on Qwen2.5-VL-7B-Instruct and enhanced its performance for pathological tasks through meticulously designed post-training strategies. Firstly, we conduct supervised fine-tuning guided by pathological data to imbue the model with foundational pathological knowledge, forming a new pathological base model. Subsequently, we introduce Group Relative Policy Optimization (GRPO) and propose a dual reward-driven reinforcement learning optimization, ensuring strict constraint on logical supervision of the reasoning process and accuracy of results via cross-modal process reward and outcome accuracy reward. In the pathological image question-answering tasks, the testing results of PathVLM-R1 demonstrate a 14% improvement in accuracy compared to baseline methods, and it demonstrated superior performance compared to the Qwen2.5-VL-32B version despite having a significantly smaller parameter size. Furthermore, in out-domain data evaluation involving four medical imaging modalities: Computed Tomography (CT), dermoscopy, fundus photography, and Optical Coherence Tomography (OCT) images: PathVLM-R1's transfer performance improved by an average of 17.3% compared to traditional SFT methods. These results clearly indicate that PathVLM-R1 not only enhances accuracy but also possesses broad applicability and expansion potential.

[217] Head-Aware KV Cache Compression for Efficient Visual Autoregressive Modeling

Ziran Qin,Youru Lv,Mingbao Lin,Zeren Zhang,Danping Zou,Weiyao Lin

Main category: cs.CV

TLDR: HACK是一种针对VAR模型KV缓存压缩的无训练方法，通过区分结构头和上下文头，采用不对称缓存预算和特定模式压缩策略，显著减少内存使用。

Details

Motivation: VAR模型在推理过程中因KV缓存积累导致内存瓶颈，现有压缩技术因未区分不同类型的注意力头而效果不佳。 Method: 提出HACK方法，根据结构头和上下文头的特性分配不对称缓存预算，并采用模式特定的压缩策略。 Result: 在VAR-d30和Infinity-8B上，HACK分别减少50%和70%的缓存，内存使用降低44.2%和58.9%，同时保持高质量生成。 Conclusion: HACK通过针对性压缩策略有效解决了VAR模型的KV缓存问题，显著提升了效率。 Abstract: Visual Autoregressive (VAR) models have emerged as a powerful approach for multi-modal content creation, offering high efficiency and quality across diverse multimedia applications. However, they face significant memory bottlenecks due to extensive KV cache accumulation during inference. Existing KV cache compression techniques for large language models are suboptimal for VAR models due to, as we identify in this paper, two distinct categories of attention heads in VAR models: Structural Heads, which preserve spatial coherence through diagonal attention patterns, and Contextual Heads, which maintain semantic consistency through vertical attention patterns. These differences render single-strategy KV compression techniques ineffective for VAR models. To address this, we propose HACK, a training-free Head-Aware Compression method for KV cache. HACK allocates asymmetric cache budgets and employs pattern-specific compression strategies tailored to the essential characteristics of each head category. Experiments on Infinity-2B, Infinity-8B, and VAR-d30 demonstrate its effectiveness in text-to-image and class-conditional generation tasks. HACK can hack down up to 50\% and 70\% of cache with minimal performance degradation for VAR-d30 and Infinity-8B, respectively. Even with 70\% and 90\% KV cache compression in VAR-d30 and Infinity-8B, HACK still maintains high-quality generation while reducing memory usage by 44.2\% and 58.9\%, respectively.

[218] VideoAds for Fast-Paced Video Understanding: Where Opensource Foundation Models Beat GPT-4o & Gemini-1.5 Pro

Zheyuan Zhang,Monica Dou,Linkai Peng,Hongyi Pan,Ulas Bagci,Boqing Gong

Main category: cs.CV

TLDR: 论文介绍了首个针对广告视频的多模态大语言模型（MLLM）评测数据集VideoAds，包含复杂时间结构的视频和手动标注的问题。实验显示开源模型Qwen2.5-VL-72B优于GPT-4o和Gemini-1.5 Pro，但人类专家表现更优。

Details

Motivation: 广告视频因其复杂的时间结构和多模态信息，对MLLMs提出了挑战，需要专门的评测数据集。 Method: 构建VideoAds数据集，包含复杂广告视频和手动标注的视觉查找、视频摘要和视觉推理问题，并提出视频复杂度量化指标。 Result: Qwen2.5-VL-72B在VideoAds上准确率为73.35%，优于GPT-4o（66.82%）和Gemini-1.5 Pro（69.66%），但人类专家达到94.27%。 Conclusion: VideoAds是未来视频理解研究的重要基准，需提升MLLMs的时间建模能力。 Abstract: Advertisement videos serve as a rich and valuable source of purpose-driven information, encompassing high-quality visual, textual, and contextual cues designed to engage viewers. They are often more complex than general videos of similar duration due to their structured narratives and rapid scene transitions, posing significant challenges to multi-modal large language models (MLLMs). In this work, we introduce VideoAds, the first dataset tailored for benchmarking the performance of MLLMs on advertisement videos. VideoAds comprises well-curated advertisement videos with complex temporal structures, accompanied by \textbf{manually} annotated diverse questions across three core tasks: visual finding, video summary, and visual reasoning. We propose a quantitative measure to compare VideoAds against existing benchmarks in terms of video complexity. Through extensive experiments, we find that Qwen2.5-VL-72B, an opensource MLLM, achieves 73.35\% accuracy on VideoAds, outperforming GPT-4o (66.82\%) and Gemini-1.5 Pro (69.66\%); the two proprietary models especially fall behind the opensource model in video summarization and reasoning, but perform the best in visual finding. Notably, human experts easily achieve a remarkable accuracy of 94.27\%. These results underscore the necessity of advancing MLLMs' temporal modeling capabilities and highlight VideoAds as a potentially pivotal benchmark for future research in understanding video that requires high FPS sampling. The dataset and evaluation code will be publicly available at https://videoadsbenchmark.netlify.app.

[219] Towards Explainable Partial-AIGC Image Quality Assessment

Jiaying Qian,Ziheng Jia,Zicheng Zhang,Zeyu Zhang,Guangtao Zhai,Xiongkuo Min

Main category: cs.CV

TLDR: 该论文提出了首个大规模部分AI生成内容（PAI）数据集EPAIQA-15K，用于可解释的质量评估，并开发了基于多模态模型的三阶段训练方法。

Details

Motivation: 现有图像质量评估（IQA）研究主要关注完全AI生成的图像，而忽略了局部AI编辑图像（PAI）的质量评估。 Method: 构建EPAIQA-15K数据集，利用多模态模型（LMM）进行三阶段训练（区域定位、质量评分、质量解释），开发EPAIQA系列模型。 Result: 成功开发了具备可解释质量反馈能力的EPAIQA模型，填补了PAI质量评估领域的空白。 Conclusion: 该研究是PAI感知质量评估领域的开创性工作，为局部AI编辑图像的质量评估提供了新方法。 Abstract: The rapid advancement of AI-driven visual generation technologies has catalyzed significant breakthroughs in image manipulation, particularly in achieving photorealistic localized editing effects on natural scene images (NSIs). Despite extensive research on image quality assessment (IQA) for AI-generated images (AGIs), most studies focus on fully AI-generated outputs (e.g., text-to-image generation), leaving the quality assessment of partial-AIGC images (PAIs)-images with localized AI-driven edits an almost unprecedented field. Motivated by this gap, we construct the first large-scale PAI dataset towards explainable partial-AIGC image quality assessment (EPAIQA), the EPAIQA-15K, which includes 15K images with localized AI manipulation in different regions and over 300K multi-dimensional human ratings. Based on this, we leverage large multi-modal models (LMMs) and propose a three-stage model training paradigm. This paradigm progressively trains the LMM for editing region grounding, quantitative quality scoring, and quality explanation. Finally, we develop the EPAIQA series models, which possess explainable quality feedback capabilities. Our work represents a pioneering effort in the perceptual IQA field for comprehensive PAI quality assessment.

[220] Cycle Training with Semi-Supervised Domain Adaptation: Bridging Accuracy and Efficiency for Real-Time Mobile Scene Detection

Huu-Phong Phan-Nguyen,Anh Dao,Tien-Huy Nguyen,Tuan Quang,Huu-Loc Tran,Tinh-Anh Nguyen-Nhu,Huy-Thach Pham,Quan Nguyen,Hoang M. Le,Quang-Vinh Dinh

Main category: cs.CV

TLDR: 提出了一种名为Cycle Training的新型训练框架，通过三阶段训练和半监督域适应技术，在移动设备上实现了高精度和高效能的图像分类。

Details

Motivation: 智能手机普及但资源有限，如何在移动设备上平衡深度学习模型的准确性和计算效率是一个重要挑战。 Method: 采用三阶段训练框架Cycle Training，结合探索和稳定阶段优化模型，并利用半监督域适应（SSDA）扩展训练数据。 Result: 在CamSSD数据集上，Top-1准确率达到94.00%，Top-3准确率达到99.17%，CPU推理时间仅1.61ms。 Conclusion: Cycle Training框架适合实际移动部署，显著提升了分类精度和实时推理效率。 Abstract: Nowadays, smartphones are ubiquitous, and almost everyone owns one. At the same time, the rapid development of AI has spurred extensive research on applying deep learning techniques to image classification. However, due to the limited resources available on mobile devices, significant challenges remain in balancing accuracy with computational efficiency. In this paper, we propose a novel training framework called Cycle Training, which adopts a three-stage training process that alternates between exploration and stabilization phases to optimize model performance. Additionally, we incorporate Semi-Supervised Domain Adaptation (SSDA) to leverage the power of large models and unlabeled data, thereby effectively expanding the training dataset. Comprehensive experiments on the CamSSD dataset for mobile scene detection demonstrate that our framework not only significantly improves classification accuracy but also ensures real-time inference efficiency. Specifically, our method achieves a 94.00% in Top-1 accuracy and a 99.17% in Top-3 accuracy and runs inference in just 1.61ms using CPU, demonstrating its suitability for real-world mobile deployment.

[221] A Lightweight Moment Retrieval System with Global Re-Ranking and Robust Adaptive Bidirectional Temporal Search

Tinh-Anh Nguyen-Nhu,Huu-Loc Tran,Nguyen-Khang Le,Minh-Nhat Nguyen,Tien-Huy Nguyen,Hoang-Long Nguyen-Huu,Huu-Phong Phan-Nguyen,Huy-Thach Pham,Quan Nguyen,Hoang M. Le,Quang-Vinh Dinh

Main category: cs.CV

TLDR: 提出了一种结合SuperGlobal Reranking和Adaptive Bidirectional Temporal Search的新框架，用于高效视频片段检索。

Details

Motivation: 数字视频内容爆炸式增长，现有方法在视频片段检索中存在效率低、上下文限制和计算复杂等问题。 Method: 采用关键帧提取和图像哈希去重预处理视频，结合SuperGlobal Reranking和ABTS优化查询相似性和计算资源。 Result: 显著降低存储需求，同时保持高精度的视频片段定位。 Conclusion: 新框架为大规模视频库提供了高效、可扩展的检索解决方案。 Abstract: The exponential growth of digital video content has posed critical challenges in moment-level video retrieval, where existing methodologies struggle to efficiently localize specific segments within an expansive video corpus. Current retrieval systems are constrained by computational inefficiencies, temporal context limitations, and the intrinsic complexity of navigating video content. In this paper, we address these limitations through a novel Interactive Video Corpus Moment Retrieval framework that integrates a SuperGlobal Reranking mechanism and Adaptive Bidirectional Temporal Search (ABTS), strategically optimizing query similarity, temporal stability, and computational resources. By preprocessing a large corpus of videos using a keyframe extraction model and deduplication technique through image hashing, our approach provides a scalable solution that significantly reduces storage requirements while maintaining high localization precision across diverse video repositories.

[222] MedIL: Implicit Latent Spaces for Generating Heterogeneous Medical Images at Arbitrary Resolutions

Tyler Spears,Shen Zhu,Yinzhu Jin,Aman Shrivastava,P. Thomas Fletcher

Main category: cs.CV

TLDR: MedIL是一种新型自编码器，用于处理不同尺寸和分辨率的医学图像生成，解决了现有方法因固定尺寸和重采样丢失细节的问题。

Details

Motivation: 医学图像尺寸和分辨率差异大，且细节对临床至关重要，现有方法因固定尺寸和重采样无法保留这些细节。 Method: MedIL利用隐式神经表示将图像视为连续信号，支持任意分辨率编码和解码，无需预先重采样。 Result: MedIL在多站点、多分辨率数据集上定量和定性展示了其压缩和保留临床相关特征的能力，并提升了扩散模型生成图像的质量。 Conclusion: MedIL能增强生成模型，使其更接近原始临床采集图像，为医学图像生成提供了新方向。 Abstract: In this work, we introduce MedIL, a first-of-its-kind autoencoder built for encoding medical images with heterogeneous sizes and resolutions for image generation. Medical images are often large and heterogeneous, where fine details are of vital clinical importance. Image properties change drastically when considering acquisition equipment, patient demographics, and pathology, making realistic medical image generation challenging. Recent work in latent diffusion models (LDMs) has shown success in generating images resampled to a fixed-size. However, this is a narrow subset of the resolutions native to image acquisition, and resampling discards fine anatomical details. MedIL utilizes implicit neural representations to treat images as continuous signals, where encoding and decoding can be performed at arbitrary resolutions without prior resampling. We quantitatively and qualitatively show how MedIL compresses and preserves clinically-relevant features over large multi-site, multi-resolution datasets of both T1w brain MRIs and lung CTs. We further demonstrate how MedIL can influence the quality of images generated with a diffusion model, and discuss how MedIL can enhance generative models to resemble raw clinical acquisitions.

[223] Infused Suppression Of Magnification Artefacts For Micro-AU Detection

Huai-Qian Khor,Yante Li,Xingxun Jiang,Guoying Zhao

Main category: cs.CV

TLDR: InfuseNet通过层间单元特征融合框架，利用运动上下文约束AU学习，减少运动放大伪影的影响，在CD6ME协议中达到最先进效果。

Details

Motivation: 微表情分析中，面部动作细微且短暂，运动放大虽能增强动作幅度，但会引入伪影，影响模型学习。 Method: 提出InfuseNet框架，利用运动上下文约束AU学习区域，并使用放大潜在特征而非重构样本以减少伪影。 Result: 在CD6ME协议中超越现有最佳结果，定量研究验证了伪影缓解的有效性。 Conclusion: InfuseNet通过减少运动放大伪影，显著提升了微表情中AU检测的准确性。 Abstract: Facial micro-expressions are spontaneous, brief and subtle facial motions that unveil the underlying, suppressed emotions. Detecting Action Units (AUs) in micro-expressions is crucial because it yields a finer representation of facial motions than categorical emotions, effectively resolving the ambiguity among different expressions. One of the difficulties in micro-expression analysis is that facial motions are subtle and brief, thereby increasing the difficulty in correlating facial motion features to AU occurrence. To bridge the subtlety issue, flow-related features and motion magnification are a few common approaches as they can yield descriptive motion changes and increased motion amplitude respectively. While motion magnification can amplify the motion changes, it also accounts for illumination changes and projection errors during the amplification process, thereby creating motion artefacts that confuse the model to learn inauthentic magnified motion features. The problem is further aggravated in the context of a more complicated task where more AU classes are analyzed in cross-database settings. To address this issue, we propose InfuseNet, a layer-wise unitary feature infusion framework that leverages motion context to constrain the Action Unit (AU) learning within an informative facial movement region, thereby alleviating the influence of magnification artefacts. On top of that, we propose leveraging magnified latent features instead of reconstructing magnified samples to limit the distortion and artefacts caused by the projection inaccuracy in the motion reconstruction process. Via alleviating the magnification artefacts, InfuseNet has surpassed the state-of-the-art results in the CD6ME protocol. Further quantitative studies have also demonstrated the efficacy of motion artefacts alleviation.

[224] Text To 3D Object Generation For Scalable Room Assembly

Sonia Laguna,Alberto Garcia-Garcia,Marie-Julie Rakotosaona,Stylianos Moschoglou,Leonhard Helminger,Sergio Orts-Escolano

Main category: cs.CV

TLDR: 提出了一种端到端的合成数据生成系统，用于解决场景理解任务中的数据稀缺问题。

Details

Motivation: 现代机器学习模型依赖高质量数据集，但现实场景数据稀缺且人工制作成本高。 Method: 结合文本到图像和多视角扩散模型，利用NeRF网格化技术生成高质量3D对象，并通过渲染工具整合到预定义平面图中。 Result: 系统支持按需生成场景，提升了合成数据的质量和可扩展性。 Conclusion: 该系统通过合成数据缓解了机器学习训练中的数据限制，增强了模型的鲁棒性和泛化能力。 Abstract: Modern machine learning models for scene understanding, such as depth estimation and object tracking, rely on large, high-quality datasets that mimic real-world deployment scenarios. To address data scarcity, we propose an end-to-end system for synthetic data generation for scalable, high-quality, and customizable 3D indoor scenes. By integrating and adapting text-to-image and multi-view diffusion models with Neural Radiance Field-based meshing, this system generates highfidelity 3D object assets from text prompts and incorporates them into pre-defined floor plans using a rendering tool. By introducing novel loss functions and training strategies into existing methods, the system supports on-demand scene generation, aiming to alleviate the scarcity of current available data, generally manually crafted by artists. This system advances the role of synthetic data in addressing machine learning training limitations, enabling more robust and generalizable models for real-world applications.

[225] REMEMBER: Retrieval-based Explainable Multimodal Evidence-guided Modeling for Brain Evaluation and Reasoning in Zero- and Few-shot Neurodegenerative Diagnosis

Duy-Cat Can,Quang-Huy Tang,Huong Ha,Binh T. Nguyen,Oliver Y. Chén

Main category: cs.CV

TLDR: REMEMBER是一种基于检索的多模态机器学习框架，用于零样本和少样本阿尔茨海默病诊断，通过参考数据和上下文推理提供可解释的结果。

Details

Motivation: 现有深度学习模型依赖大规模标注数据且缺乏可解释性，临床数据通常小规模或无标注，限制了深度学习的应用。 Method: REMEMBER通过对比对齐的视觉-文本模型训练，结合伪文本模态和检索机制，模仿临床决策过程。 Result: 实验显示REMEMBER在零样本和少样本场景下表现稳健，提供可解释的诊断报告。 Conclusion: REMEMBER为神经影像诊断提供了一种高效且可解释的解决方案，尤其适用于数据有限的情况。 Abstract: Timely and accurate diagnosis of neurodegenerative disorders, such as Alzheimer's disease, is central to disease management. Existing deep learning models require large-scale annotated datasets and often function as "black boxes". Additionally, datasets in clinical practice are frequently small or unlabeled, restricting the full potential of deep learning methods. Here, we introduce REMEMBER -- Retrieval-based Explainable Multimodal Evidence-guided Modeling for Brain Evaluation and Reasoning -- a new machine learning framework that facilitates zero- and few-shot Alzheimer's diagnosis using brain MRI scans through a reference-based reasoning process. Specifically, REMEMBER first trains a contrastively aligned vision-text model using expert-annotated reference data and extends pseudo-text modalities that encode abnormality types, diagnosis labels, and composite clinical descriptions. Then, at inference time, REMEMBER retrieves similar, human-validated cases from a curated dataset and integrates their contextual information through a dedicated evidence encoding module and attention-based inference head. Such an evidence-guided design enables REMEMBER to imitate real-world clinical decision-making process by grounding predictions in retrieved imaging and textual context. Specifically, REMEMBER outputs diagnostic predictions alongside an interpretable report, including reference images and explanations aligned with clinical workflows. Experimental results demonstrate that REMEMBER achieves robust zero- and few-shot performance and offers a powerful and explainable framework to neuroimaging-based diagnosis in the real world, especially under limited data.

[226] PapMOT: Exploring Adversarial Patch Attack against Multiple Object Tracking

Jiahuan Long,Tingsong Jiang,Wen Yao,Shuai Jia,Weijia Zhang,Weien Zhou,Chao Ma,Xiaoqian Chen

Main category: cs.CV

TLDR: PapMOT是一种针对多目标跟踪（MOT）的物理对抗攻击方法，可生成可打印的对抗补丁，用于数字和物理场景的攻击。

Details

Motivation: 现有MOT方法对对抗攻击的脆弱性仅限于数字攻击，无法在物理场景中有效应用，因此需要一种能在物理场景中攻击MOT的方法。 Method: PapMOT通过生成可打印的对抗补丁，不仅攻击检测机制，还误导身份关联过程，并引入补丁增强策略以降低跟踪结果的时序一致性。 Result: PapMOT成功攻击了多种MOT跟踪器架构，并在真实世界中验证了其物理攻击的有效性。 Conclusion: PapMOT填补了MOT对抗攻击在物理场景中的空白，并提出了新的评估指标以衡量MOT的鲁棒性。 Abstract: Tracking multiple objects in a continuous video stream is crucial for many computer vision tasks. It involves detecting and associating objects with their respective identities across successive frames. Despite significant progress made in multiple object tracking (MOT), recent studies have revealed the vulnerability of existing MOT methods to adversarial attacks. Nevertheless, all of these attacks belong to digital attacks that inject pixel-level noise into input images, and are therefore ineffective in physical scenarios. To fill this gap, we propose PapMOT, which can generate physical adversarial patches against MOT for both digital and physical scenarios. Besides attacking the detection mechanism, PapMOT also optimizes a printable patch that can be detected as new targets to mislead the identity association process. Moreover, we introduce a patch enhancement strategy to further degrade the temporal consistency of tracking results across video frames, resulting in more aggressive attacks. We further develop new evaluation metrics to assess the robustness of MOT against such attacks. Extensive evaluations on multiple datasets demonstrate that our PapMOT can successfully attack various architectures of MOT trackers in digital scenarios. We also validate the effectiveness of PapMOT for physical attacks by deploying printed adversarial patches in the real world.

[227] Beyond Degradation Conditions: All-in-One Image Restoration via HOG Transformers

Jiawei Wu,Zhifei Yang,Zhe Wang,Zhi Jin

Main category: cs.CV

TLDR: HOGformer提出了一种基于HOG描述符的全能图像修复框架，通过动态自注意力机制和局部动态范围卷积模块，显著提升了复杂场景下的修复性能。

Details

Motivation: 现有方法依赖预测和整合退化条件，容易在复杂场景中误激活退化特定特征，限制了修复性能。 Method: 利用HOG描述符的退化判别能力，设计动态自注意力机制和HOG引导的局部动态范围卷积模块，增强退化敏感性。 Result: 在多种基准测试中，HOGformer实现了最先进的性能，并能有效泛化到复杂的真实世界退化场景。 Conclusion: HOGformer通过HOG引导的动态机制，显著提升了全能图像修复的性能和泛化能力。 Abstract: All-in-one image restoration, which aims to address diverse degradations within a unified framework, is critical for practical applications. However, existing methods rely on predicting and integrating degradation conditions, which can misactivate degradation-specific features in complex scenarios, limiting their restoration performance. To address this issue, we propose a novel all-in-one image restoration framework guided by Histograms of Oriented Gradients (HOG), named HOGformer. By leveraging the degradation-discriminative capability of HOG descriptors, HOGformer employs a dynamic self-attention mechanism that adaptively attends to long-range spatial dependencies based on degradation-aware HOG cues. To enhance the degradation sensitivity of attention inputs, we design a HOG-guided local dynamic-range convolution module that captures long-range degradation similarities while maintaining awareness of global structural information. Furthermore, we propose a dynamic interaction feed-forward module, efficiently increasing the model capacity to adapt to different degradations through channel-spatial interactions. Extensive experiments across diverse benchmarks, including adverse weather and natural degradations, demonstrate that HOGformer achieves state-of-the-art performance and generalizes effectively to complex real-world degradations. Code is available at https://github.com/Fire-friend/HOGformer.

[228] Low-Light Image Enhancement using Event-Based Illumination Estimation

Lei Sun,Yuhan Bao,Jiajun Zhai,Jingyun Liang,Yulun Zhang,Kaiwei Wang,Danda Pani Paudel,Luc Van Gool

Main category: cs.CV

TLDR: 论文提出了一种基于事件相机的低光图像增强方法RetinEV，利用时间映射事件估计光照，显著提升了图像质量和动态范围。

Details

Motivation: 现有方法主要依赖运动事件增强边缘纹理，忽略了事件相机在高动态范围和低光响应方面的潜力。 Method: 通过将时间映射事件的时间戳转换为亮度值，提出光照辅助反射增强模块，并研究了低光条件下时间映射事件的退化模型。 Result: RetinEV在5个合成数据集和真实数据集EvLowLight上表现优异，PSNR提升达6.62 dB，推理速度为35.6 FPS。 Conclusion: RetinEV为低光图像增强提供了新思路，显著优于现有方法，同时保持了高效性。 Abstract: Low-light image enhancement (LLIE) aims to improve the visibility of images captured in poorly lit environments. Prevalent event-based solutions primarily utilize events triggered by motion, i.e., ''motion events'' to strengthen only the edge texture, while leaving the high dynamic range and excellent low-light responsiveness of event cameras largely unexplored. This paper instead opens a new avenue from the perspective of estimating the illumination using ''temporal-mapping'' events, i.e., by converting the timestamps of events triggered by a transmittance modulation into brightness values. The resulting fine-grained illumination cues facilitate a more effective decomposition and enhancement of the reflectance component in low-light images through the proposed Illumination-aided Reflectance Enhancement module. Furthermore, the degradation model of temporal-mapping events under low-light condition is investigated for realistic training data synthesizing. To address the lack of datasets under this regime, we construct a beam-splitter setup and collect EvLowLight dataset that includes images, temporal-mapping events, and motion events. Extensive experiments across 5 synthetic datasets and our real-world EvLowLight dataset substantiate that the devised pipeline, dubbed RetinEV, excels in producing well-illuminated, high dynamic range images, outperforming previous state-of-the-art event-based methods by up to 6.62 dB, while maintaining an efficient inference speed of 35.6 frame-per-second on a 640X480 image.

[229] Contour Flow Constraint: Preserving Global Shape Similarity for Deep Learning based Image Segmentation

Shengzhe Chen,Zhaoxuan Dong,Jun Liu

Main category: cs.CV

TLDR: 论文提出了一种基于轮廓流（CF）的全局形状相似性约束，并将其融入深度学习分割框架，显著提升了分割精度和形状相似性。

Details

Motivation: 现有方法主要关注特定属性或形状的先验，缺乏从轮廓流角度考虑全局形状相似性，且未探索如何将其自然融入深度卷积网络的激活函数。 Method: 提出基于轮廓流的全局形状相似性概念，并数学推导出约束条件；通过形状损失和变分分割模型两种方式实现约束与深度网络的结合。 Result: 实验表明，提出的形状损失显著提升了分割精度和形状相似性，CFSSnet在噪声图像分割中表现出鲁棒性。 Conclusion: 提出的方法具有通用性，适用于不同网络架构，并能有效保持全局形状相似性。 Abstract: For effective image segmentation, it is crucial to employ constraints informed by prior knowledge about the characteristics of the areas to be segmented to yield favorable segmentation outcomes. However, the existing methods have primarily focused on priors of specific properties or shapes, lacking consideration of the general global shape similarity from a Contour Flow (CF) perspective. Furthermore, naturally integrating this contour flow prior image segmentation model into the activation functions of deep convolutional networks through mathematical methods is currently unexplored. In this paper, we establish a concept of global shape similarity based on the premise that two shapes exhibit comparable contours. Furthermore, we mathematically derive a contour flow constraint that ensures the preservation of global shape similarity. We propose two implementations to integrate the constraint with deep neural networks. Firstly, the constraint is converted to a shape loss, which can be seamlessly incorporated into the training phase for any learning-based segmentation framework. Secondly, we add the constraint into a variational segmentation model and derive its iterative schemes for solution. The scheme is then unrolled to get the architecture of the proposed CFSSnet. Validation experiments on diverse datasets are conducted on classic benchmark deep network segmentation models. The results indicate a great improvement in segmentation accuracy and shape similarity for the proposed shape loss, showcasing the general adaptability of the proposed loss term regardless of specific network architectures. CFSSnet shows robustness in segmenting noise-contaminated images, and inherent capability to preserve global shape similarity.

[230] Vision Transformers Exhibit Human-Like Biases: Evidence of Orientation and Color Selectivity, Categorical Perception, and Phase Transitions

Nooshin Bahador

Main category: cs.CV

TLDR: ViTs表现出与人类大脑相似的方位和颜色偏差，包括斜向效应、颜色相关误差、人类感知类别对齐的聚类分析，以及注意头的任务无关特征提取能力。

Details

Motivation: 探索ViTs是否表现出与人类大脑相似的方位和颜色偏差，以理解其行为是否受预训练数据和架构约束影响。 Method: 使用合成数据集控制噪声、角度、长度、宽度和颜色，分析LoRA微调ViTs的行为。 Result: ViTs表现出斜向效应、颜色相关误差、人类感知类别对齐的聚类分析，以及注意头的任务无关特征提取能力。 Conclusion: ViTs的偏差和特性主要源于预训练数据和架构约束，而非下游数据统计。 Abstract: This study explored whether Vision Transformers (ViTs) developed orientation and color biases similar to those observed in the human brain. Using synthetic datasets with controlled variations in noise levels, angles, lengths, widths, and colors, we analyzed the behavior of ViTs fine-tuned with LoRA. Our findings revealed four key insights: First, ViTs exhibited an oblique effect showing the lowest angle prediction errors at 180 deg (horizontal) across all conditions. Second, angle prediction errors varied by color. Errors were highest for bluish hues and lowest for yellowish ones. Additionally, clustering analysis of angle prediction errors showed that ViTs grouped colors in a way that aligned with human perceptual categories. In addition to orientation and color biases, we observed phase transition phenomena. While two phase transitions occurred consistently across all conditions, the training loss curves exhibited delayed transitions when color was incorporated as an additional data attribute. Finally, we observed that attention heads in certain layers inherently develop specialized capabilities, functioning as task-agnostic feature extractors regardless of the downstream task. These observations suggest that biases and properties arise primarily from pre-training on the original dataset which shapes the model's foundational representations and the inherent architectural constraints of the vision transformer, rather than being solely determined by downstream data statistics.

[231] Comparing Performance of Preprocessing Techniques for Traffic Sign Recognition Using a HOG-SVM

Luis Vieira

Main category: cs.CV

TLDR: 比较了多种预处理技术（CLAHE、HUE、YUV）在HOG-SVM分类器上的性能，发现YUV显著提升了分类准确率。

Details

Motivation: 研究不同预处理技术对交通标志识别性能的影响，以优化预处理流程。 Method: 使用HOG和SVM在GTSRB数据集上评估CLAHE、HUE和YUV的效果。 Result: YUV预处理将准确率从89.65%提升至91.25%。 Conclusion: YUV预处理技术显著提升了HOG-SVM分类器的性能，为TSR应用提供了改进方向。 Abstract: This study compares the performance of various preprocessing techniques for Traffic Sign Recognition (TSR) using Histogram of Oriented Gradients (HOG) and Support Vector Machine (SVM) on the German Traffic Sign Recognition Benchmark (GTSRB) dataset. Techniques such as CLAHE, HUE, and YUV were evaluated for their impact on classification accuracy. Results indicate that YUV in particular significantly enhance the performance of the HOG-SVM classifier (improving accuracy from 89.65% to 91.25%), providing insights into improvements for preprocessing pipeline of TSR applications.

[232] BabyVLM: Data-Efficient Pretraining of VLMs Inspired by Infant Learning

Shengao Wang,Arjun Chandra,Aoming Liu,Venkatesh Saligrama,Boqing Gong

Main category: cs.CV

TLDR: BabyVLM提出了一种新的框架，结合了婴儿启发的合成数据集和评估基准，显著提升了视觉语言模型的效率和性能。

Details

Motivation: 现有评估基准与婴儿学习方式不匹配，且婴儿数据训练忽略了多样性输入。 Method: 提出BabyVLM框架，包括合成训练数据集和评估基准。 Result: BabyVLM训练的模型在任务上表现优于仅使用SAYCam或通用数据的模型。 Conclusion: BabyVLM为数据高效的视觉语言学习提供了新方向。 Abstract: Human infants rapidly develop visual reasoning skills from minimal input, suggesting that developmentally inspired pretraining could significantly enhance the efficiency of vision-language models (VLMs). Although recent efforts have leveraged infant-inspired datasets like SAYCam, existing evaluation benchmarks remain misaligned--they are either too simplistic, narrowly scoped, or tailored for large-scale pretrained models. Additionally, training exclusively on infant data overlooks the broader, diverse input from which infants naturally learn. To address these limitations, we propose BabyVLM, a novel framework comprising comprehensive in-domain evaluation benchmarks and a synthetic training dataset created via child-directed transformations of existing datasets. We demonstrate that VLMs trained with our synthetic dataset achieve superior performance on BabyVLM tasks compared to models trained solely on SAYCam or general-purpose data of the SAYCam size. BabyVLM thus provides a robust, developmentally aligned evaluation tool and illustrates how compact models trained on carefully curated data can generalize effectively, opening pathways toward data-efficient vision-language learning paradigms.

[233] Structure-Accurate Medical Image Translation based on Dynamic Frequency Balance and Knowledge Guidance

Jiahua Xu,Dawei Zhou,Lei Hu,Zaiyi Liu,Nannan Wang,Xinbo Gao

Main category: cs.CV

TLDR: 提出了一种基于动态频率平衡和知识引导的新方法，用于解决多模态医学图像合成中的解剖结构失真问题。

Details

Motivation: 多模态医学图像在临床诊断中至关重要，但现有方法因高频信息过拟合和低频信息减弱导致解剖结构失真。 Method: 通过小波变换分解模型关键特征，设计动态频率平衡模块自适应调整频率，并结合知识引导机制融合临床先验知识。 Result: 在多数据集上的实验表明，该方法在定性和定量评估中均显著优于现有方法。 Conclusion: 该方法有效解决了医学图像合成中的解剖结构失真问题，具有显著的优势。 Abstract: Multimodal medical images play a crucial role in the precise and comprehensive clinical diagnosis. Diffusion model is a powerful strategy to synthesize the required medical images. However, existing approaches still suffer from the problem of anatomical structure distortion due to the overfitting of high-frequency information and the weakening of low-frequency information. Thus, we propose a novel method based on dynamic frequency balance and knowledge guidance. Specifically, we first extract the low-frequency and high-frequency components by decomposing the critical features of the model using wavelet transform. Then, a dynamic frequency balance module is designed to adaptively adjust frequency for enhancing global low-frequency features and effective high-frequency details as well as suppressing high-frequency noise. To further overcome the challenges posed by the large differences between different medical modalities, we construct a knowledge-guided mechanism that fuses the prior clinical knowledge from a visual language model with visual features, to facilitate the generation of accurate anatomical structures. Experimental evaluations on multiple datasets show the proposed method achieves significant improvements in qualitative and quantitative assessments, verifying its effectiveness and superiority.

[234] Sparse Deformable Mamba for Hyperspectral Image Classification

Lincoln Linlin Xu,Yimin Zhu,Zack Dewis,Zhengsen Xu,Motasem Alkayid,Mabel Heffring,Saeid Taleghanidoozdoozan

Main category: cs.CV

TLDR: 提出了一种稀疏可变形Mamba（SDMamba）方法，通过自适应学习优化序列，提升高光谱图像分类的效率和效果。

Details

Motivation: 解决Mamba模型在高光谱图像分类中序列构建效率低和效果差的问题。 Method: 设计了稀疏可变形序列（SDS）方法优化序列，并开发了空间和光谱Mamba模块（SDSpaM和SDSpeM），结合注意力特征融合提升分类性能。 Result: 在多个基准数据集上验证，SDMamba方法在准确性、速度和细节保留方面优于现有方法。 Conclusion: SDMamba方法显著提升了高光谱图像分类的性能，具有高效和细节保留的优势。 Abstract: Although the recent Mamba models significantly improve hyperspectral image (HSI) classification, one critical challenge is caused by the difficulty to build the Mamba sequence efficiently and effectively. This paper presents a Sparse Deformable Mamba (SDMamba) approach for enhanced HSI classification, with the following contributions. First, to enhance Mamba sequence, an efficient Sparse Deformable Sequencing (SDS) approach is designed to adaptively learn the "optimal" sequence, leading to sparse and deformable Mamba sequence with increased detail preservation and decreased computations. Second, to boost spatial-spectral feature learning, based on SDS, a Sparse Deformable Spatial Mamba Module (SDSpaM) and a Sparse Deformable Spectral Mamba Module (SDSpeM) are designed for tailored modeling of the spatial information spectral information. Last, to improve the fusion of SDSpaM and SDSpeM, an attention based feature fusion approach is designed to integrate the outputs of the SDSpaM and SDSpeM. The proposed method is tested on several benchmark datasets with many state-of-the-art approaches, demonstrating that the proposed approach can achieve higher accuracy, faster speed, and better detail small-class preservation capability.

[235] InfoBound: A Provable Information-Bounds Inspired Framework for Both OoD Generalization and OoD Detection

Lin Zhu,Yifeng Yang,Zichao Nie,Yuan Gao,Jiarui Li,Qinying Gu,Xinbing Wang,Chenghu Zhou,Nanyang Ye

Main category: cs.CV

TLDR: 本文提出了一种基于信息论的统一方法，同时提升分布外（OoD）检测和泛化能力，解决了现有方法在两者之间权衡的问题。

Details

Motivation: 现实测试环境中常同时存在协变量偏移和语义偏移，但现有方法往往只能单独解决其中一个问题，甚至牺牲另一个。 Method: 通过信息论中的互信息最小化（MI-Min）和条件熵最大化（CE-Max）构建统一方法。 Result: 在多标签图像分类和目标检测任务中，该方法显著优于基线，成功缓解了两者之间的权衡。 Conclusion: 该方法为同时提升OoD检测和泛化能力提供了有效解决方案，且易于应用于现有模型和任务。 Abstract: In real-world scenarios, distribution shifts give rise to the importance of two problems: out-of-distribution (OoD) generalization, which focuses on models' generalization ability against covariate shifts (i.e., the changes of environments), and OoD detection, which aims to be aware of semantic shifts (i.e., test-time unseen classes). Real-world testing environments often involve a combination of both covariate and semantic shifts. While numerous methods have been proposed to address these critical issues, only a few works tackled them simultaneously. Moreover, prior works often improve one problem but sacrifice the other. To overcome these limitations, we delve into boosting OoD detection and OoD generalization from the perspective of information theory, which can be easily applied to existing models and different tasks. Building upon the theoretical bounds for mutual information and conditional entropy, we provide a unified approach, composed of Mutual Information Minimization (MI-Min) and Conditional Entropy Maximizing (CE-Max). Extensive experiments and comprehensive evaluations on multi-label image classification and object detection have demonstrated the superiority of our method. It successfully mitigates trade-offs between the two challenges compared to competitive baselines.

[236] FractalForensics: Proactive Deepfake Detection and Localization via Fractal Watermarks

Tianyi Wang,Harry Cheng,Ming-Hui Liu,Mohan Kankanhalli

Main category: cs.CV

TLDR: 提出了一种基于分形水印的主动Deepfake检测与定位方法FractalForensics，解决了现有水印方法缺乏定位功能和检测结果可解释性的问题。

Details

Motivation: 被动Deepfake检测器难以识别高质量合成图像，现有主动检测水印方法缺乏定位功能和检测结果可解释性，且水印鲁棒性不稳定。 Method: 设计参数驱动的分形水印生成流程，提出半脆弱水印框架，采用入口到补丁策略实现水印嵌入与恢复，并定位Deepfake操作。 Result: 实验表明该方法对常见图像处理和Deepfake操作具有满意的鲁棒性和脆弱性，优于现有半脆弱水印算法和被动检测器。 Conclusion: FractalForensics不仅提高了主动Deepfake检测性能，还通过突出操作区域提供了检测结果的可解释性。 Abstract: Proactive Deepfake detection via robust watermarks has been raised ever since passive Deepfake detectors encountered challenges in identifying high-quality synthetic images. However, while demonstrating reasonable detection performance, they lack localization functionality and explainability in detection results. Additionally, the unstable robustness of watermarks can significantly affect the detection performance accordingly. In this study, we propose novel fractal watermarks for proactive Deepfake detection and localization, namely FractalForensics. Benefiting from the characteristics of fractals, we devise a parameter-driven watermark generation pipeline that derives fractal-based watermarks and conducts one-way encryption regarding the parameters selected. Subsequently, we propose a semi-fragile watermarking framework for watermark embedding and recovery, trained to be robust against benign image processing operations and fragile when facing Deepfake manipulations in a black-box setting. Meanwhile, we introduce an entry-to-patch strategy that implicitly embeds the watermark matrix entries into image patches at corresponding positions, achieving localization of Deepfake manipulations. Extensive experiments demonstrate satisfactory robustness and fragility of our approach against common image processing operations and Deepfake manipulations, outperforming state-of-the-art semi-fragile watermarking algorithms and passive detectors for Deepfake detection. Furthermore, by highlighting the areas manipulated, our method provides explainability for the proactive Deepfake detection results.

[237] D$^2$iT: Dynamic Diffusion Transformer for Accurate Image Generation

Weinan Jia,Mengqi Huang,Nan Chen,Lei Zhang,Zhendong Mao

Main category: cs.CV

TLDR: 论文提出了一种动态压缩不同图像区域的两阶段框架（DVAE和D²iT），以提升扩散模型生成图像的质量和效率。

Details

Motivation: 现有Diffusion Transformer（DiT）在扩散过程中对不同区域采用固定压缩，忽略了信息密度的差异，导致局部真实感不足或计算复杂度高。 Method: 1. 第一阶段使用动态VAE（DVAE）分层编码不同区域，适应其信息密度；2. 第二阶段通过动态扩散Transformer（D²iT）预测多粒度噪声，结合粗粒度和细粒度噪声生成图像。 Result: 实验验证了该方法在多种生成任务中的有效性，实现了全局一致性与局部真实感的统一。 Conclusion: 动态压缩策略显著提升了图像生成质量，代码将开源。 Abstract: Diffusion models are widely recognized for their ability to generate high-fidelity images. Despite the excellent performance and scalability of the Diffusion Transformer (DiT) architecture, it applies fixed compression across different image regions during the diffusion process, disregarding the naturally varying information densities present in these regions. However, large compression leads to limited local realism, while small compression increases computational complexity and compromises global consistency, ultimately impacting the quality of generated images. To address these limitations, we propose dynamically compressing different image regions by recognizing the importance of different regions, and introduce a novel two-stage framework designed to enhance the effectiveness and efficiency of image generation: (1) Dynamic VAE (DVAE) at first stage employs a hierarchical encoder to encode different image regions at different downsampling rates, tailored to their specific information densities, thereby providing more accurate and natural latent codes for the diffusion process. (2) Dynamic Diffusion Transformer (D$^2$iT) at second stage generates images by predicting multi-grained noise, consisting of coarse-grained (less latent code in smooth regions) and fine-grained (more latent codes in detailed regions), through an novel combination of the Dynamic Grain Transformer and the Dynamic Content Transformer. The strategy of combining rough prediction of noise with detailed regions correction achieves a unification of global consistency and local realism. Comprehensive experiments on various generation tasks validate the effectiveness of our approach. Code will be released at https://github.com/jiawn-creator/Dynamic-DiT.

[238] Enhancing Wide-Angle Image Using Narrow-Angle View of the Same Scene

Hussain Md. Safwan,Mahbub Islam Mahim,Fawwaz Mohammed Amin

Main category: cs.CV

TLDR: 提出了一种通过GAN模型将窄角镜头的细节质量转移到广角镜头图像中的新方法。

Details

Motivation: 解决拍摄时广角镜头覆盖范围广但细节不足，窄角镜头细节丰富但覆盖范围有限的问题。 Method: 使用GAN模型从窄角镜头图像中提取视觉质量参数，并将其转移到对应的广角图像中。 Result: 在多个基准数据集上进行了评估，并与当前领域的最新进展进行了比较。 Conclusion: 该方法成功地将窄角镜头的细节质量注入广角图像，提升了广角图像的视觉质量。 Abstract: A common dilemma while photographing a scene is whether to capture it in wider angle, allowing more of the scene to be covered but in lesser details or to click in narrow angle that captures better details but leaves out portions of the scene. We propose a novel method in this paper that infuses wider shots with finer quality details that is usually associated with an image captured by the primary lens by capturing the same scene using both narrow and wide field of view (FoV) lenses. We do so by training a GAN-based model to learn to extract the visual quality parameters from a narrow angle shot and to transfer these to the corresponding wide-angle image of the scene. We have mentioned in details the proposed technique to isolate the visual essence of an image and to transfer it into another image. We have also elaborately discussed our implementation details and have presented the results of evaluation over several benchmark datasets and comparisons with contemporary advancements in the field.

[239] CamMimic: Zero-Shot Image To Camera Motion Personalized Video Generation Using Diffusion Models

Pooja Guhan,Divya Kothandaraman,Tsung-Wei Huang,Guan-Ming Su,Dinesh Manocha

Main category: cs.CV

TLDR: CamMimic是一种创新的动态视频编辑算法，能够零样本地将参考视频的相机运动无缝转移到用户选择的场景中。

Details

Motivation: 解决动态视频编辑中相机运动转移的需求，无需额外数据。 Method: 采用两阶段策略：1) 多概念学习方法结合LoRA层和正交损失；2) 基于单应性的细化策略。 Result: 实验表明，该方法生成高质量视频，用户研究中70.31%的参与者偏好其场景保留能力，90.45%偏好其运动转移能力。 Conclusion: CamMimic为跨场景相机运动转移的未来研究奠定了基础。 Abstract: We introduce CamMimic, an innovative algorithm tailored for dynamic video editing needs. It is designed to seamlessly transfer the camera motion observed in a given reference video onto any scene of the user's choice in a zero-shot manner without requiring any additional data. Our algorithm achieves this using a two-phase strategy by leveraging a text-to-video diffusion model. In the first phase, we develop a multi-concept learning method using a combination of LoRA layers and an orthogonality loss to capture and understand the underlying spatial-temporal characteristics of the reference video as well as the spatial features of the user's desired scene. The second phase proposes a unique homography-based refinement strategy to enhance the temporal and spatial alignment of the generated video. We demonstrate the efficacy of our method through experiments conducted on a dataset containing combinations of diverse scenes and reference videos containing a variety of camera motions. In the absence of an established metric for assessing camera motion transfer between unrelated scenes, we propose CameraScore, a novel metric that utilizes homography representations to measure camera motion similarity between the reference and generated videos. Extensive quantitative and qualitative evaluations demonstrate that our approach generates high-quality, motion-enhanced videos. Additionally, a user study reveals that 70.31% of participants preferred our method for scene preservation, while 90.45% favored it for motion transfer. We hope this work lays the foundation for future advancements in camera motion transfer across different scenes.

[240] Vision-Language Model for Object Detection and Segmentation: A Review and Evaluation

Yongchao Feng,Yajie Liu,Shuai Yang,Wenrui Cai,Jinqing Zhang,Qiqi Zhan,Ziyue Huang,Hongxi Yan,Qiao Wan,Chenguang Liu,Junzhe Wang,Jiahui Lv,Ziqi Liu,Tengyuan Shi,Qingjie Liu,Yunhong Wang

Main category: cs.CV

TLDR: 本文系统评估了视觉语言模型（VLM）在传统视觉任务中的表现，涵盖检测和分割的多种场景，并分析了不同微调策略的影响。

Details

Motivation: 尽管VLM在开放词汇任务中表现优异，但其在传统视觉任务中的有效性尚未被评估，本文旨在填补这一空白。 Method: 通过八种检测和八种分割场景的全面评估，分析了不同VLM架构和微调策略（零预测、视觉微调、文本提示）的性能。 Result: 揭示了VLM在不同任务中的优势和局限性，并分析了任务特性、模型架构和训练方法之间的相关性。 Conclusion: 本文为未来VLM设计提供了见解，并有望推动计算机视觉和多模态学习领域的研究。 Abstract: Vision-Language Model (VLM) have gained widespread adoption in Open-Vocabulary (OV) object detection and segmentation tasks. Despite they have shown promise on OV-related tasks, their effectiveness in conventional vision tasks has thus far been unevaluated. In this work, we present the systematic review of VLM-based detection and segmentation, view VLM as the foundational model and conduct comprehensive evaluations across multiple downstream tasks for the first time: 1) The evaluation spans eight detection scenarios (closed-set detection, domain adaptation, crowded objects, etc.) and eight segmentation scenarios (few-shot, open-world, small object, etc.), revealing distinct performance advantages and limitations of various VLM architectures across tasks. 2) As for detection tasks, we evaluate VLMs under three finetuning granularities: \textit{zero prediction}, \textit{visual fine-tuning}, and \textit{text prompt}, and further analyze how different finetuning strategies impact performance under varied task. 3) Based on empirical findings, we provide in-depth analysis of the correlations between task characteristics, model architectures, and training methodologies, offering insights for future VLM design. 4) We believe that this work shall be valuable to the pattern recognition experts working in the fields of computer vision, multimodal learning, and vision foundation models by introducing them to the problem, and familiarizing them with the current status of the progress while providing promising directions for future research. A project associated with this review and evaluation has been created at https://github.com/better-chao/perceptual_abilities_evaluation.

[241] DropoutGS: Dropping Out Gaussians for Better Sparse-view Rendering

Yexing Xu,Longguang Wang,Minglin Chen,Sheng Ao,Li Li,Yulan Guo

Main category: cs.CV

TLDR: 3D高斯泼溅（3DGS）在稀疏输入下性能下降且产生伪影，本文提出随机丢弃正则化（RDR）和边缘引导分割策略（ESS）以提升泛化性能。

Details

Motivation: 稀疏输入下3DGS性能下降且过拟合严重，低复杂度模型表现更优，启发提出新方法。 Method: 提出RDR缓解过拟合，ESS补充高频细节，结合为DropoutGS。 Result: 在Blender、LLFF和DTU数据集上取得SOTA性能。 Conclusion: DropoutGS简单有效，显著提升稀疏视图下的3DGS性能。 Abstract: Although 3D Gaussian Splatting (3DGS) has demonstrated promising results in novel view synthesis, its performance degrades dramatically with sparse inputs and generates undesirable artifacts. As the number of training views decreases, the novel view synthesis task degrades to a highly under-determined problem such that existing methods suffer from the notorious overfitting issue. Interestingly, we observe that models with fewer Gaussian primitives exhibit less overfitting under sparse inputs. Inspired by this observation, we propose a Random Dropout Regularization (RDR) to exploit the advantages of low-complexity models to alleviate overfitting. In addition, to remedy the lack of high-frequency details for these models, an Edge-guided Splitting Strategy (ESS) is developed. With these two techniques, our method (termed DropoutGS) provides a simple yet effective plug-in approach to improve the generalization performance of existing 3DGS methods. Extensive experiments show that our DropoutGS produces state-of-the-art performance under sparse views on benchmark datasets including Blender, LLFF, and DTU. The project page is at: https://xuyx55.github.io/DropoutGS/.

[242] EasyREG: Easy Depth-Based Markerless Registration and Tracking using Augmented Reality Device for Surgical Guidance

Yue Yang,Christoph Leuze,Brian Hargreaves,Bruce Daniel,Fred Baik

Main category: cs.CV

TLDR: 提出了一种基于AR设备的无标记手术导航框架，包含高精度配准模块和实时跟踪模块，性能优于传统方法。

Details

Motivation: 传统手术导航依赖外部标记物，操作繁琐且临床部署困难；现有无标记方案精度不足。 Method: 配准模块结合深度传感器误差校正、人机交互区域过滤和全局对齐，跟踪模块基于配准结果实时估计目标位姿。 Result: 系统在配准性能上优于工业方案，跟踪性能相当，适用于动态或静态手术场景。 Conclusion: 该框架为手术导航提供了一种高效、无标记的解决方案。 Abstract: The use of Augmented Reality (AR) devices for surgical guidance has gained increasing traction in the medical field. Traditional registration methods often rely on external fiducial markers to achieve high accuracy and real-time performance. However, these markers introduce cumbersome calibration procedures and can be challenging to deploy in clinical settings. While commercial solutions have attempted real-time markerless tracking using the native RGB cameras of AR devices, their accuracy remains questionable for medical guidance, primarily due to occlusions and significant outliers between the live sensor data and the preoperative target anatomy point cloud derived from MRI or CT scans. In this work, we present a markerless framework that relies only on the depth sensor of AR devices and consists of two modules: a registration module for high-precision, outlier-robust target anatomy localization, and a tracking module for real-time pose estimation. The registration module integrates depth sensor error correction, a human-in-the-loop region filtering technique, and a robust global alignment with curvature-aware feature sampling, followed by local ICP refinement, for markerless alignment of preoperative models with patient anatomy. The tracking module employs a fast and robust registration algorithm that uses the initial pose from the registration module to estimate the target pose in real-time. We comprehensively evaluated the performance of both modules through simulation and real-world measurements. The results indicate that our markerless system achieves superior performance for registration and comparable performance for tracking to industrial solutions. The two-module design makes our system a one-stop solution for surgical procedures where the target anatomy moves or stays static during surgery.

[243] PCM-SAR: Physics-Driven Contrastive Mutual Learning for SAR Classification

Pengfei Wang,Hao Zheng,Zhigang Hu,Aikun Xu,Meiguang Zheng,Liu Yang

Main category: cs.CV

TLDR: 提出了一种基于物理驱动的对比互学习SAR分类方法（PCM-SAR），通过结合SAR数据的物理特性改进样本生成和特征提取，显著提升了分类性能。

Details

Motivation: 现有基于对比学习的SAR图像分类方法通常依赖为光学图像设计的样本生成策略，未能捕捉SAR数据的独特语义和物理特性。 Method: PCM-SAR利用灰度共生矩阵（GLCM）模拟真实噪声模式，并通过语义检测进行无监督局部采样，同时采用多级特征融合机制进行特征表示优化。 Result: 实验表明，PCM-SAR在多种数据集和SAR分类任务中均优于现有最优方法。 Conclusion: PCM-SAR通过结合物理驱动和互学习机制，显著提升了SAR图像分类的性能，尤其对小模型效果显著。 Abstract: Existing SAR image classification methods based on Contrastive Learning often rely on sample generation strategies designed for optical images, failing to capture the distinct semantic and physical characteristics of SAR data. To address this, we propose Physics-Driven Contrastive Mutual Learning for SAR Classification (PCM-SAR), which incorporates domain-specific physical insights to improve sample generation and feature extraction. PCM-SAR utilizes the gray-level co-occurrence matrix (GLCM) to simulate realistic noise patterns and applies semantic detection for unsupervised local sampling, ensuring generated samples accurately reflect SAR imaging properties. Additionally, a multi-level feature fusion mechanism based on mutual learning enables collaborative refinement of feature representations. Notably, PCM-SAR significantly enhances smaller models by refining SAR feature representations, compensating for their limited capacity. Experimental results show that PCM-SAR consistently outperforms SOTA methods across diverse datasets and SAR classification tasks.

[244] Pillar-Voxel Fusion Network for 3D Object Detection in Airborne Hyperspectral Point Clouds

Yanze Jiang,Yanfeng Gu,Xian Li

Main category: cs.CV

TLDR: PiV-AHPC是一种针对机载高光谱点云（HPCs）的3D目标检测网络，通过双分支编码器和多级特征融合机制解决几何-光谱失真问题，表现出卓越的检测性能和泛化能力。

Details

Motivation: 现有HPCs生成方法因融合误差和遮挡导致几何-光谱失真，限制了其在精细任务中的表现，特别是在机载应用中。 Method: 提出PiV-AHPC网络，包含柱-体素双分支编码器，分别提取光谱和空间特征，并通过多级特征融合机制整合异构特征。 Result: 在两个机载HPCs数据集上的实验表明，PiV-AHPC具有最先进的检测性能和泛化能力。 Conclusion: PiV-AHPC首次解决了HPCs任务中的几何-光谱失真问题，为机载应用提供了高效解决方案。 Abstract: Hyperspectral point clouds (HPCs) can simultaneously characterize 3D spatial and spectral information of ground objects, offering excellent 3D perception and target recognition capabilities. Current approaches for generating HPCs often involve fusion techniques with hyperspectral images and LiDAR point clouds, which inevitably lead to geometric-spectral distortions due to fusion errors and obstacle occlusions. These adverse effects limit their performance in downstream fine-grained tasks across multiple scenarios, particularly in airborne applications. To address these issues, we propose PiV-AHPC, a 3D object detection network for airborne HPCs. To the best of our knowledge, this is the first attempt at this HPCs task. Specifically, we first develop a pillar-voxel dual-branch encoder, where the former captures spectral and vertical structural features from HPCs to overcome spectral distortion, while the latter emphasizes extracting accurate 3D spatial features from point clouds. A multi-level feature fusion mechanism is devised to enhance information interaction between the two branches, achieving neighborhood feature alignment and channel-adaptive selection, thereby organically integrating heterogeneous features and mitigating geometric distortion. Extensive experiments on two airborne HPCs datasets demonstrate that PiV-AHPC possesses state-of-the-art detection performance and high generalization capability.

[245] FVOS for MOSE Track of 4th PVUW Challenge: 3rd Place Solution

Mengjiao Wang,Junpei Zhang,Xu Liu,Yuting Yang,Mengru Ma

Main category: cs.CV

TLDR: 该论文提出了一种针对视频对象分割（VOS）的优化方法，通过微调现有模型和引入形态学后处理策略，提升了在复杂场景中的分割准确性。

Details

Motivation: 现有方法在复杂现实场景中表现不佳，论文旨在解决这一问题，提升视频对象分割的准确性。 Method: 提出FVOS方法，包括对现有模型的微调、形态学后处理策略以及多尺度分割结果的投票融合。 Result: 在验证和测试阶段分别达到76.81%和83.92%的J&F分数，在2025年PVUW挑战赛的MOSE赛道中排名第三。 Conclusion: 论文提出的方法显著提升了复杂场景下的视频对象分割性能，验证了其有效性。 Abstract: Video Object Segmentation (VOS) is one of the most fundamental and challenging tasks in computer vision and has a wide range of applications. Most existing methods rely on spatiotemporal memory networks to extract frame-level features and have achieved promising results on commonly used datasets. However, these methods often struggle in more complex real-world scenarios. This paper addresses this issue, aiming to achieve accurate segmentation of video objects in challenging scenes. We propose fine-tuning VOS (FVOS), optimizing existing methods for specific datasets through tailored training. Additionally, we introduce a morphological post-processing strategy to address the issue of excessively large gaps between adjacent objects in single-model predictions. Finally, we apply a voting-based fusion method on multi-scale segmentation results to generate the final output. Our approach achieves J&F scores of 76.81% and 83.92% during the validation and testing stages, respectively, securing third place overall in the MOSE Track of the 4th PVUW challenge 2025.

[246] DiffuMural: Restoring Dunhuang Murals with Multi-scale Diffusion

Puyu Han,Jiaju Kang,Yuhang Pan,Erting Pan,Zeyu Zhang,Qunchao Jin,Juntao Jiang,Zhichen Liu,Luqi Gong

Main category: cs.CV

TLDR: DiffuMural是一种结合多尺度收敛和协作扩散机制的模型，用于优化古壁画修复任务，解决了大缺损区域和训练样本稀缺的问题，并在定量和定性评估中优于现有方法。

Details

Motivation: 古壁画修复作为条件图像生成的重要下游任务，面临大缺损区域和训练样本稀缺的挑战，且缺乏评估修复效果的启发式指标。 Method: 提出DiffuMural模型，结合多尺度收敛和协作扩散机制，使用ControlNet和循环一致性损失优化生成图像与条件控制的匹配。 Result: 在23幅敦煌壁画数据上训练，模型在细节修复、整体一致性和文化价值保留方面表现优异，定量评估框架验证其优于现有方法。 Conclusion: DiffuMural在古壁画修复中表现出色，解决了现有挑战，并通过人文价值评估确保修复结果的文化艺术意义。 Abstract: Large-scale pre-trained diffusion models have produced excellent results in the field of conditional image generation. However, restoration of ancient murals, as an important downstream task in this field, poses significant challenges to diffusion model-based restoration methods due to its large defective area and scarce training samples. Conditional restoration tasks are more concerned with whether the restored part meets the aesthetic standards of mural restoration in terms of overall style and seam detail, and such metrics for evaluating heuristic image complements are lacking in current research. We therefore propose DiffuMural, a combined Multi-scale convergence and Collaborative Diffusion mechanism with ControlNet and cyclic consistency loss to optimise the matching between the generated images and the conditional control. DiffuMural demonstrates outstanding capabilities in mural restoration, leveraging training data from 23 large-scale Dunhuang murals that exhibit consistent visual aesthetics. The model excels in restoring intricate details, achieving a coherent overall appearance, and addressing the unique challenges posed by incomplete murals lacking factual grounding. Our evaluation framework incorporates four key metrics to quantitatively assess incomplete murals: factual accuracy, textural detail, contextual semantics, and holistic visual coherence. Furthermore, we integrate humanistic value assessments to ensure the restored murals retain their cultural and artistic significance. Extensive experiments validate that our method outperforms state-of-the-art (SOTA) approaches in both qualitative and quantitative metrics.

[247] Capturing Longitudinal Changes in Brain Morphology Using Temporally Parameterized Neural Displacement Fields

Aisha L. Shuaibu,Kieran A. Gibb,Peter A. Wijeratne,Ivor J. A. Simpson

Main category: cs.CV

TLDR: 提出了一种基于神经位移场的纵向图像配准方法，用于连续建模脑部形态变化。

Details

Motivation: 研究脑部形态的时序变化对监测生长或萎缩很重要，但现有方法受噪声和小解剖变化的限制。 Method: 使用多层感知机实现隐式神经表示（INR），建模连续变形场，并通过导数正则化确保生物合理性。 Result: 在4D脑部MR配准中验证了方法的有效性。 Conclusion: 该方法能更准确地建模脑部形态的连续变化。 Abstract: Longitudinal image registration enables studying temporal changes in brain morphology which is useful in applications where monitoring the growth or atrophy of specific structures is important. However this task is challenging due to; noise/artifacts in the data and quantifying small anatomical changes between sequential scans. We propose a novel longitudinal registration method that models structural changes using temporally parameterized neural displacement fields. Specifically, we implement an implicit neural representation (INR) using a multi-layer perceptron that serves as a continuous coordinate-based approximation of the deformation field at any time point. In effect, for any N scans of a particular subject, our model takes as input a 3D spatial coordinate location x, y, z and a corresponding temporal representation t and learns to describe the continuous morphology of structures for both observed and unobserved points in time. Furthermore, we leverage the analytic derivatives of the INR to derive a new regularization function that enforces monotonic rate of change in the trajectory of the voxels, which is shown to provide more biologically plausible patterns. We demonstrate the effectiveness of our method on 4D brain MR registration.

[248] 3D CoCa: Contrastive Learners are 3D Captioners

Ting Huang,Zeyu Zhang,Yemin Wang,Hao Tang

Main category: cs.CV

TLDR: 3D CoCa是一个结合对比学习与3D场景描述的框架，显著提升了3D场景描述的性能。

Details

Motivation: 解决3D场景描述中点云稀疏性和跨模态对齐弱的问题。 Method: 结合CLIP视觉语言模型、空间感知3D编码器和多模态解码器，联合优化对比与描述目标。 Result: 在ScanRefer和Nr3D基准上，CIDEr分数分别提升10.2%和5.76%。 Conclusion: 3D CoCa通过联合训练实现了更强的空间推理和语义对齐，性能显著优于现有方法。 Abstract: 3D captioning, which aims to describe the content of 3D scenes in natural language, remains highly challenging due to the inherent sparsity of point clouds and weak cross-modal alignment in existing methods. To address these challenges, we propose 3D CoCa, a novel unified framework that seamlessly combines contrastive vision-language learning with 3D caption generation in a single architecture. Our approach leverages a frozen CLIP vision-language backbone to provide rich semantic priors, a spatially-aware 3D scene encoder to capture geometric context, and a multi-modal decoder to generate descriptive captions. Unlike prior two-stage methods that rely on explicit object proposals, 3D CoCa jointly optimizes contrastive and captioning objectives in a shared feature space, eliminating the need for external detectors or handcrafted proposals. This joint training paradigm yields stronger spatial reasoning and richer semantic grounding by aligning 3D and textual representations. Extensive experiments on the ScanRefer and Nr3D benchmarks demonstrate that 3D CoCa significantly outperforms current state-of-the-arts by 10.2% and 5.76% in CIDEr at 0.5IoU, respectively. Code will be available at https://github.com/AIGeeksGroup/3DCoCa.

[249] AeroLite: Tag-Guided Lightweight Generation of Aerial Image Captions

Xing Zi,Tengjun Ni,Xianjing Fan,Xian Tao,Jun Li,Ali Braytee,Mukesh Prasad

Main category: cs.CV

TLDR: AeroLite是一个轻量级的标签引导框架，用于为小规模语言模型提供遥感图像的自动标题生成能力，通过结合语义标签和视觉嵌入，显著提升了性能并降低了计算成本。

Details

Motivation: 遥感图像的自动标题生成在环境监测、城市规划等领域至关重要，但由于复杂的空间语义和领域变异性，这一任务仍具挑战性。 Method: AeroLite利用GPT-4o生成伪标题数据集，结合多标签CLIP编码器提取语义标签，并通过多层感知机融合视觉和语义信息，采用两阶段LoRA训练方法。 Result: AeroLite在BLEU和METEOR等指标上优于更大的模型（如13B参数），同时计算成本显著降低。 Conclusion: AeroLite为小规模语言模型提供了一种高效、可解释的遥感图像标题生成解决方案。 Abstract: Accurate and automated captioning of aerial imagery is crucial for applications like environmental monitoring, urban planning, and disaster management. However, this task remains challenging due to complex spatial semantics and domain variability. To address these issues, we introduce \textbf{AeroLite}, a lightweight, tag-guided captioning framework designed to equip small-scale language models (1--3B parameters) with robust and interpretable captioning capabilities specifically for remote sensing images. \textbf{AeroLite} leverages GPT-4o to generate a large-scale, semantically rich pseudo-caption dataset by integrating multiple remote sensing benchmarks, including DLRSD, iSAID, LoveDA, WHU, and RSSCN7. To explicitly capture key semantic elements such as orientation and land-use types, AeroLite employs natural language processing techniques to extract relevant semantic tags. These tags are then learned by a dedicated multi-label CLIP encoder, ensuring precise semantic predictions. To effectively fuse visual and semantic information, we propose a novel bridging multilayer perceptron (MLP) architecture, aligning semantic tags with visual embeddings while maintaining minimal computational overhead. AeroLite's flexible design also enables seamless integration with various pretrained large language models. We adopt a two-stage LoRA-based training approach: the initial stage leverages our pseudo-caption dataset to capture broad remote sensing semantics, followed by fine-tuning on smaller, curated datasets like UCM and Sydney Captions to refine domain-specific alignment. Experimental evaluations demonstrate that AeroLite surpasses significantly larger models (e.g., 13B parameters) in standard captioning metrics, including BLEU and METEOR, while maintaining substantially lower computational costs.

[250] Trajectory-guided Motion Perception for Facial Expression Quality Assessment in Neurological Disorders

Shuchao Duan,Amirhossein Dadashzadeh,Alan Whone,Majid Mirmehdi

Main category: cs.CV

TLDR: 提出TraMP-Former框架，结合面部标志点轨迹和RGB帧语义信息，用于神经疾病中的面部表情质量评估，性能显著提升。

Details

Motivation: 神经疾病中面部表情的细微变化对诊断至关重要，但现有方法难以捕捉这些细微动作。 Method: 利用面部标志点轨迹特征与RGB帧语义信息融合，通过TraMP-Former框架回归为质量评分。 Result: 在PFED5和Toronto NeuroFace数据集上分别提升6.51%和7.62%，达到新SOTA。 Conclusion: TraMP-Former通过轨迹特征和视觉语义融合，显著提升了面部表情质量评估的性能。 Abstract: Automated facial expression quality assessment (FEQA) in neurological disorders is critical for enhancing diagnostic accuracy and improving patient care, yet effectively capturing the subtle motions and nuances of facial muscle movements remains a challenge. We propose to analyse facial landmark trajectories, a compact yet informative representation, that encodes these subtle motions from a high-level structural perspective. Hence, we introduce Trajectory-guided Motion Perception Transformer (TraMP-Former), a novel FEQA framework that fuses landmark trajectory features for fine-grained motion capture with visual semantic cues from RGB frames, ultimately regressing the combined features into a quality score. Extensive experiments demonstrate that TraMP-Former achieves new state-of-the-art performance on benchmark datasets with neurological disorders, including PFED5 (up by 6.51%) and an augmented Toronto NeuroFace (up by 7.62%). Our ablation studies further validate the efficiency and effectiveness of landmark trajectories in FEQA. Our code is available at https://github.com/shuchaoduan/TraMP-Former.

[251] FastRSR: Efficient and Accurate Road Surface Reconstruction from Bird's Eye View

Yuting Zhao,Yuheng Ji,Xiaoshuai Hao,Shuxiao Li

Main category: cs.CV

TLDR: 论文提出了两种高效的BEV（鸟瞰图）道路表面重建模型FastRSR-mono和FastRSR-stereo，通过Depth-Aware Projection（DAP）减少信息丢失和稀疏性，并通过SAE和CAG模块优化立体匹配的精度和速度。

Details

Motivation: 现有方法在将视角视图转换为BEV时存在信息丢失和表示稀疏的问题，且立体匹配在精度和推理速度之间难以平衡。 Method: 提出DAP策略以减少信息丢失和稀疏性，并设计SAE和CAG模块优化立体匹配的精度和速度。 Result: FastRSR在RSRD数据集上表现优异，单目方法提升6.0%的绝对高程误差，立体方法速度提升至少3.0倍。 Conclusion: FastRSR模型在道路表面重建中实现了高效和准确，为自动驾驶提供了重要支持。 Abstract: Road Surface Reconstruction (RSR) is crucial for autonomous driving, enabling the understanding of road surface conditions. Recently, RSR from the Bird's Eye View (BEV) has gained attention for its potential to enhance performance. However, existing methods for transforming perspective views to BEV face challenges such as information loss and representation sparsity. Moreover, stereo matching in BEV is limited by the need to balance accuracy with inference speed. To address these challenges, we propose two efficient and accurate BEV-based RSR models: FastRSR-mono and FastRSR-stereo. Specifically, we first introduce Depth-Aware Projection (DAP), an efficient view transformation strategy designed to mitigate information loss and sparsity by querying depth and image features to aggregate BEV data within specific road surface regions using a pre-computed look-up table. To optimize accuracy and speed in stereo matching, we design the Spatial Attention Enhancement (SAE) and Confidence Attention Generation (CAG) modules. SAE adaptively highlights important regions, while CAG focuses on high-confidence predictions and filters out irrelevant information. FastRSR achieves state-of-the-art performance, exceeding monocular competitors by over 6.0% in elevation absolute error and providing at least a 3.0x speedup by stereo methods on the RSRD dataset. The source code will be released.

[252] EmbodiedOcc++: Boosting Embodied 3D Occupancy Prediction with Plane Regularization and Uncertainty Sampler

Hao Wang,Xiaobao Wei,Xiaoan Zhang,Jianing Li,Chengyu Bai,Ying Li,Ming Lu,Wenzhao Zheng,Shanghang Zhang

Main category: cs.CV

TLDR: EmbodiedOcc++改进原框架，通过几何引导细化模块和语义感知不确定性采样器，提升3D占用预测的几何一致性和准确性。

Details

Motivation: 原框架EmbodiedOcc忽略了室内环境的几何特征，尤其是平面结构，影响了预测的准确性。 Method: 引入GRM模块通过平面正则化约束高斯更新，SUS模块优化重叠区域的高斯更新。 Result: 在EmbodiedOcc-ScanNet基准测试中达到最优性能，边缘精度和几何细节保留显著提升。 Conclusion: EmbodiedOcc++在计算效率和几何一致性上表现优异，适用于在线感知任务。 Abstract: Online 3D occupancy prediction provides a comprehensive spatial understanding of embodied environments. While the innovative EmbodiedOcc framework utilizes 3D semantic Gaussians for progressive indoor occupancy prediction, it overlooks the geometric characteristics of indoor environments, which are primarily characterized by planar structures. This paper introduces EmbodiedOcc++, enhancing the original framework with two key innovations: a Geometry-guided Refinement Module (GRM) that constrains Gaussian updates through plane regularization, along with a Semantic-aware Uncertainty Sampler (SUS) that enables more effective updates in overlapping regions between consecutive frames. GRM regularizes the position update to align with surface normals. It determines the adaptive regularization weight using curvature-based and depth-based constraints, allowing semantic Gaussians to align accurately with planar surfaces while adapting in complex regions. To effectively improve geometric consistency from different views, SUS adaptively selects proper Gaussians to update. Comprehensive experiments on the EmbodiedOcc-ScanNet benchmark demonstrate that EmbodiedOcc++ achieves state-of-the-art performance across different settings. Our method demonstrates improved edge accuracy and retains more geometric details while ensuring computational efficiency, which is essential for online embodied perception. The code will be released at: https://github.com/PKUHaoWang/EmbodiedOcc2.

[253] SD-ReID: View-aware Stable Diffusion for Aerial-Ground Person Re-Identification

Xiang Hu,Pingping Zhang,Yuhao Wang,Bin Yan,Huchuan Lu

Main category: cs.CV

TLDR: 论文提出了一种名为SD-ReID的两阶段特征学习框架，利用生成模型（如Stable Diffusion）生成视角特定特征，以解决空中-地面行人重识别（AG-ReID）中视角变化带来的挑战。

Details

Motivation: 现有方法忽视了视角特定特征对行人表征能力的提升，且设计视角鲁棒网络具有挑战性。 Method: SD-ReID分为两阶段：1）训练ViT模型提取粗粒度表示和可控条件；2）微调SD模型学习互补表示，并引入View-Refine Decoder生成跨视角特征。 Result: 在AG-ReID基准测试中验证了SD-ReID的有效性。 Conclusion: SD-ReID通过生成视角特定特征和跨视角特征，显著提升了行人重识别的性能。 Abstract: Aerial-Ground Person Re-IDentification (AG-ReID) aims to retrieve specific persons across cameras with different viewpoints. Previous works focus on designing discriminative ReID models to maintain identity consistency despite drastic changes in camera viewpoints. The core idea behind these methods is quite natural, but designing a view-robust network is a very challenging task. Moreover, they overlook the contribution of view-specific features in enhancing the model's capability to represent persons. To address these issues, we propose a novel two-stage feature learning framework named SD-ReID for AG-ReID, which takes advantage of the powerful understanding capacity of generative models, e.g., Stable Diffusion (SD), to generate view-specific features between different viewpoints. In the first stage, we train a simple ViT-based model to extract coarse-grained representations and controllable conditions. Then, in the second stage, we fine-tune the SD model to learn complementary representations guided by the controllable conditions. Furthermore, we propose the View-Refine Decoder (VRD) to obtain additional controllable conditions to generate missing cross-view features. Finally, we use the coarse-grained representations and all-view features generated by SD to retrieve target persons. Extensive experiments on the AG-ReID benchmarks demonstrate the effectiveness of our proposed SD-ReID. The source code will be available upon acceptance.

[254] Mitigating Long-tail Distribution in Oracle Bone Inscriptions: Dataset, Model, and Benchmark

Jinhao Li,Zijian Chen,Runze Dong,Tingzhu Chen,Changbo Wang,Guangtao Zhai

Main category: cs.CV

TLDR: 论文提出了一种结构对齐的甲骨文数据集Oracle-P15K和基于扩散模型的伪甲骨文生成器OBIDiff，以解决甲骨文识别中的长尾分布问题。

Details

Motivation: 甲骨文识别对理解中国古代历史和文化至关重要，但现有数据集存在长尾分布问题，导致模型在多数类和少数类上的性能偏差。 Method: 1. 构建了包含14,542张图像的结构对齐甲骨文数据集Oracle-P15K；2. 提出了基于扩散模型的伪甲骨文生成器OBIDiff，用于生成真实且可控的甲骨文图像。 Result: 实验证明Oracle-P15K数据集的有效性，OBIDiff能准确保留字形结构并有效转换真实的拓片风格。 Conclusion: Oracle-P15K和OBIDiff为解决甲骨文识别中的长尾分布问题提供了有效工具。 Abstract: The oracle bone inscription (OBI) recognition plays a significant role in understanding the history and culture of ancient China. However, the existing OBI datasets suffer from a long-tail distribution problem, leading to biased performance of OBI recognition models across majority and minority classes. With recent advancements in generative models, OBI synthesis-based data augmentation has become a promising avenue to expand the sample size of minority classes. Unfortunately, current OBI datasets lack large-scale structure-aligned image pairs for generative model training. To address these problems, we first present the Oracle-P15K, a structure-aligned OBI dataset for OBI generation and denoising, consisting of 14,542 images infused with domain knowledge from OBI experts. Second, we propose a diffusion model-based pseudo OBI generator, called OBIDiff, to achieve realistic and controllable OBI generation. Given a clean glyph image and a target rubbing-style image, it can effectively transfer the noise style of the original rubbing to the glyph image. Extensive experiments on OBI downstream tasks and user preference studies show the effectiveness of the proposed Oracle-P15K dataset and demonstrate that OBIDiff can accurately preserve inherent glyph structures while transferring authentic rubbing styles effectively.

[255] TextSplat: Text-Guided Semantic Fusion for Generalizable Gaussian Splatting

Zhicong Wu,Hongbin Xu,Gang Xu,Ping Nie,Zhixin Yan,Jinkai Zheng,Liangqiong Qu,Ming Li,Liqiang Nie

Main category: cs.CV

TLDR: TextSplat是一种基于文本驱动的通用高斯泼溅框架，通过融合多模态语义线索提升3D重建的几何和语义一致性。

Details

Motivation: 现有方法多关注几何一致性，忽略了文本驱动指导对语义理解的潜力，导致复杂场景中细节重建不准确。 Method: 框架包含三个并行模块（深度估计、语义分割、多视图交互）和一个文本引导的语义融合模块，通过注意力机制整合特征。 Result: 实验表明，TextSplat在多个基准数据集上优于现有方法，验证了其有效性。 Conclusion: TextSplat通过文本驱动和多模态特征融合，显著提升了3D重建的语义和几何质量。 Abstract: Recent advancements in Generalizable Gaussian Splatting have enabled robust 3D reconstruction from sparse input views by utilizing feed-forward Gaussian Splatting models, achieving superior cross-scene generalization. However, while many methods focus on geometric consistency, they often neglect the potential of text-driven guidance to enhance semantic understanding, which is crucial for accurately reconstructing fine-grained details in complex scenes. To address this limitation, we propose TextSplat--the first text-driven Generalizable Gaussian Splatting framework. By employing a text-guided fusion of diverse semantic cues, our framework learns robust cross-modal feature representations that improve the alignment of geometric and semantic information, producing high-fidelity 3D reconstructions. Specifically, our framework employs three parallel modules to obtain complementary representations: the Diffusion Prior Depth Estimator for accurate depth information, the Semantic Aware Segmentation Network for detailed semantic information, and the Multi-View Interaction Network for refined cross-view features. Then, in the Text-Guided Semantic Fusion Module, these representations are integrated via the text-guided and attention-based feature aggregation mechanism, resulting in enhanced 3D Gaussian parameters enriched with detailed semantic cues. Experimental results on various benchmark datasets demonstrate improved performance compared to existing methods across multiple evaluation metrics, validating the effectiveness of our framework. The code will be publicly available.

[256] DualPrompt-MedCap: A Dual-Prompt Enhanced Approach for Medical Image Captioning

Yining Zhao,Ali Braytee,Mukesh Prasad

Main category: cs.CV

TLDR: DualPrompt-MedCap通过双提示增强框架提升医学图像描述生成能力，显著提高模态识别准确性和描述质量。

Details

Motivation: 医学图像描述生成在临床诊断辅助中潜力巨大，但生成上下文相关且模态识别准确的描述仍具挑战性。 Method: 提出DualPrompt-MedCap框架，包含模态感知提示和问题引导提示，结合半监督分类模型和生物医学语言模型嵌入。 Result: 在多医学数据集上，DualPrompt-MedCap比基线BLIP-3模态识别准确率提升22%，生成更全面且问题对齐的描述。 Conclusion: 该方法能生成临床准确的报告，可作为医学专家先验知识和下游视觉语言任务的自动标注。 Abstract: Medical image captioning via vision-language models has shown promising potential for clinical diagnosis assistance. However, generating contextually relevant descriptions with accurate modality recognition remains challenging. We present DualPrompt-MedCap, a novel dual-prompt enhancement framework that augments Large Vision-Language Models (LVLMs) through two specialized components: (1) a modality-aware prompt derived from a semi-supervised classification model pretrained on medical question-answer pairs, and (2) a question-guided prompt leveraging biomedical language model embeddings. To address the lack of captioning ground truth, we also propose an evaluation framework that jointly considers spatial-semantic relevance and medical narrative quality. Experiments on multiple medical datasets demonstrate that DualPrompt-MedCap outperforms the baseline BLIP-3 by achieving a 22% improvement in modality recognition accuracy while generating more comprehensive and question-aligned descriptions. Our method enables the generation of clinically accurate reports that can serve as medical experts' prior knowledge and automatic annotations for downstream vision-language tasks.

[257] Mixture-of-Shape-Experts (MoSE): End-to-End Shape Dictionary Framework to Prompt SAM for Generalizable Medical Segmentation

Jia Wei,Xiaoqi Zhao,Jonghye Woo,Jinsong Ouyang,Georges El Fakhri,Qingyu Chen,Xiaofeng Liu

Main category: cs.CV

TLDR: 提出了一种名为MoSE的新框架，通过混合专家（MoE）训练与字典学习结合，高效捕获多样且鲁棒的形状先验，并利用SAM的泛化能力。

Details

Motivation: 解决现有字典学习方法在医学图像分割中因形状元素有限或过拟合而表现不佳的问题，同时兼容SAM等大型基础模型。 Method: 将字典原子视为形状专家，通过门控网络动态融合这些专家生成鲁棒形状图，并利用SAM编码引导稀疏激活以防止过拟合。 Result: 在多个公开数据集上的实验证明了该方法的有效性。 Conclusion: MoSE框架成功整合了形状先验与SAM的泛化能力，为医学图像分割提供了高效且鲁棒的解决方案。 Abstract: Single domain generalization (SDG) has recently attracted growing attention in medical image segmentation. One promising strategy for SDG is to leverage consistent semantic shape priors across different imaging protocols, scanner vendors, and clinical sites. However, existing dictionary learning methods that encode shape priors often suffer from limited representational power with a small set of offline computed shape elements, or overfitting when the dictionary size grows. Moreover, they are not readily compatible with large foundation models such as the Segment Anything Model (SAM). In this paper, we propose a novel Mixture-of-Shape-Experts (MoSE) framework that seamlessly integrates the idea of mixture-of-experts (MoE) training into dictionary learning to efficiently capture diverse and robust shape priors. Our method conceptualizes each dictionary atom as a shape expert, which specializes in encoding distinct semantic shape information. A gating network dynamically fuses these shape experts into a robust shape map, with sparse activation guided by SAM encoding to prevent overfitting. We further provide this shape map as a prompt to SAM, utilizing the powerful generalization capability of SAM through bidirectional integration. All modules, including the shape dictionary, are trained in an end-to-end manner. Extensive experiments on multiple public datasets demonstrate its effectiveness.

[258] Early-Bird Diffusion: Investigating and Leveraging Timestep-Aware Early-Bird Tickets in Diffusion Models for Efficient Training

Lexington Whalen,Zhenbang Du,Haoran You,Chaojian Li,Sixu Li,Yingyan,Lin

Main category: cs.CV

TLDR: EB-Diff-Train是一种高效的扩散模型训练方法，利用早期稀疏子网络（EB tickets）减少计算资源消耗，同时保持生成质量。

Details

Motivation: 扩散模型训练需要大量计算资源，因此研究高效训练技术至关重要。 Method: 研究传统和扩散专用的EB tickets，根据时间步区域重要性调整稀疏度，并行训练并组合使用。 Result: 实验证明EB-Diff-Train显著减少训练时间（2.9×至5.8×加速），且不损失生成质量。 Conclusion: EB-Diff-Train是一种高效且有效的扩散模型训练方法，适用于资源受限的场景。 Abstract: Training diffusion models (DMs) requires substantial computational resources due to multiple forward and backward passes across numerous timesteps, motivating research into efficient training techniques. In this paper, we propose EB-Diff-Train, a new efficient DM training approach that is orthogonal to other methods of accelerating DM training, by investigating and leveraging Early-Bird (EB) tickets -- sparse subnetworks that manifest early in the training process and maintain high generation quality. We first investigate the existence of traditional EB tickets in DMs, enabling competitive generation quality without fully training a dense model. Then, we delve into the concept of diffusion-dedicated EB tickets, drawing on insights from varying importance of different timestep regions. These tickets adapt their sparsity levels according to the importance of corresponding timestep regions, allowing for aggressive sparsity during non-critical regions while conserving computational resources for crucial timestep regions. Building on this, we develop an efficient DM training technique that derives timestep-aware EB tickets, trains them in parallel, and combines them during inference for image generation. Extensive experiments validate the existence of both traditional and timestep-aware EB tickets, as well as the effectiveness of our proposed EB-Diff-Train method. This approach can significantly reduce training time both spatially and temporally -- achieving 2.9$\times$ to 5.8$\times$ speedups over training unpruned dense models, and up to 10.3$\times$ faster training compared to standard train-prune-finetune pipelines -- without compromising generative quality. Our code is available at https://github.com/GATECH-EIC/Early-Bird-Diffusion.

[259] ERL-MPP: Evolutionary Reinforcement Learning with Multi-head Puzzle Perception for Solving Large-scale Jigsaw Puzzles of Eroded Gaps

Xingke Song,Xiaoying Yang,Chenglin Yao,Jianfeng Ren,Ruibin Bai,Xin Chen,Xudong Jiang

Main category: cs.CV

TLDR: 论文提出了一种结合进化强化学习和多头拼图感知的框架（ERL-MPP），用于解决大规模带间隙拼图的挑战，显著优于现有方法。

Details

Motivation: 现有模型主要解决小规模或无间隙拼图，而大规模带间隙拼图在图像理解和组合优化方面存在独特挑战。 Method: 设计了多头拼图感知网络（MPPN）和进化强化学习（EvoRL）代理，分别用于感知拼图状态和高效探索动作空间。 Result: 在JPLEG-5和MIT数据集上显著优于所有现有模型。 Conclusion: ERL-MPP框架有效解决了大规模带间隙拼图的挑战，具有优越性能。 Abstract: Solving jigsaw puzzles has been extensively studied. While most existing models focus on solving either small-scale puzzles or puzzles with no gap between fragments, solving large-scale puzzles with gaps presents distinctive challenges in both image understanding and combinatorial optimization. To tackle these challenges, we propose a framework of Evolutionary Reinforcement Learning with Multi-head Puzzle Perception (ERL-MPP) to derive a better set of swapping actions for solving the puzzles. Specifically, to tackle the challenges of perceiving the puzzle with gaps, a Multi-head Puzzle Perception Network (MPPN) with a shared encoder is designed, where multiple puzzlet heads comprehensively perceive the local assembly status, and a discriminator head provides a global assessment of the puzzle. To explore the large swapping action space efficiently, an Evolutionary Reinforcement Learning (EvoRL) agent is designed, where an actor recommends a set of suitable swapping actions from a large action space based on the perceived puzzle status, a critic updates the actor using the estimated rewards and the puzzle status, and an evaluator coupled with evolutionary strategies evolves the actions aligning with the historical assembly experience. The proposed ERL-MPP is comprehensively evaluated on the JPLEG-5 dataset with large gaps and the MIT dataset with large-scale puzzles. It significantly outperforms all state-of-the-art models on both datasets.

[260] Tokenize Image Patches: Global Context Fusion for Effective Haze Removal in Large Images

Jiuchen Chen,Xinyu Yan,Qizhi Xu,Kaiqi Li

Main category: cs.CV

TLDR: DehazeXL提出了一种新的去雾方法，有效平衡全局上下文和局部特征提取，支持大图像端到端建模，并创建了超高分辨率去雾数据集8KDehaze。

Details

Motivation: 现有深度学习模型在处理大尺寸高分辨率图像时因GPU内存限制表现不佳，常需切片或降采样，导致全局信息或高频细节丢失。 Method: 提出DehazeXL方法，结合全局上下文和局部特征提取，设计视觉归因方法评估去雾性能，并构建8KDehaze数据集。 Result: DehazeXL在21GB内存下可处理10240×10240像素图像，性能优于其他方法。 Conclusion: DehazeXL解决了大图像去雾问题，提供了高效且性能优越的解决方案，并开源代码和数据集。 Abstract: Global contextual information and local detail features are essential for haze removal tasks. Deep learning models perform well on small, low-resolution images, but they encounter difficulties with large, high-resolution ones due to GPU memory limitations. As a compromise, they often resort to image slicing or downsampling. The former diminishes global information, while the latter discards high-frequency details. To address these challenges, we propose DehazeXL, a haze removal method that effectively balances global context and local feature extraction, enabling end-to-end modeling of large images on mainstream GPU hardware. Additionally, to evaluate the efficiency of global context utilization in haze removal performance, we design a visual attribution method tailored to the characteristics of haze removal tasks. Finally, recognizing the lack of benchmark datasets for haze removal in large images, we have developed an ultra-high-resolution haze removal dataset (8KDehaze) to support model training and testing. It includes 10000 pairs of clear and hazy remote sensing images, each sized at 8192 $\times$ 8192 pixels. Extensive experiments demonstrate that DehazeXL can infer images up to 10240 $\times$ 10240 pixels with only 21 GB of memory, achieving state-of-the-art results among all evaluated methods. The source code and experimental dataset are available at https://github.com/CastleChen339/DehazeXL.

[261] Ges3ViG: Incorporating Pointing Gestures into Language-Based 3D Visual Grounding for Embodied Reference Understanding

Atharv Mahesh Mane,Dulanga Weerakoon,Vigneshwaran Subbaraju,Sougata Sen,Sanjay E. Sarma,Archan Misra

Main category: cs.CV

TLDR: 3D-ERU结合语言描述和指向手势识别3D场景中的目标物体，填补了现有研究的空白。提出了数据增强框架Imputer和新数据集ImputeRefer，以及模型Ges3ViG，性能显著提升。

Details

Motivation: 现有研究主要关注纯语言3D定位，但结合指向手势的3D-ERU研究较少。 Method: 引入数据增强框架Imputer，创建新数据集ImputeRefer，并提出模型Ges3ViG。 Result: Ges3ViG在3D-ERU任务中准确率提升约30%，在纯语言3D定位任务中提升约9%。 Conclusion: 3D-ERU结合指向手势显著提升性能，新数据集和模型为未来研究提供了基础。 Abstract: 3-Dimensional Embodied Reference Understanding (3D-ERU) combines a language description and an accompanying pointing gesture to identify the most relevant target object in a 3D scene. Although prior work has explored pure language-based 3D grounding, there has been limited exploration of 3D-ERU, which also incorporates human pointing gestures. To address this gap, we introduce a data augmentation framework-Imputer, and use it to curate a new benchmark dataset-ImputeRefer for 3D-ERU, by incorporating human pointing gestures into existing 3D scene datasets that only contain language instructions. We also propose Ges3ViG, a novel model for 3D-ERU that achieves ~30% improvement in accuracy as compared to other 3D-ERU models and ~9% compared to other purely language-based 3D grounding models. Our code and dataset are available at https://github.com/AtharvMane/Ges3ViG.

[262] TinyLLaVA-Video-R1: Towards Smaller LMMs for Video Reasoning

Xingjian Zhang,Siwei Wen,Wenjun Wu,Lei Huang

Main category: cs.CV

TLDR: 论文提出了一种小型视频推理模型TinyLLaVA-Video-R1，通过强化学习提升其在通用视频问答数据集上的推理能力，并展示了‘顿悟时刻’的涌现特性。

Details

Motivation: 探索小型模型的推理能力对计算资源有限的研究者具有价值，同时让模型在通用问答数据集上解释其推理过程也具有重要意义。 Method: 基于TinyLLaVA-Video（参数不超过4B的视频理解模型），通过强化学习提升其推理能力。 Result: 模型在通用视频问答数据集上表现出显著提升的推理和思考能力，并展现出‘顿悟时刻’特性。 Conclusion: 研究为未来探索小型模型的视频推理能力提供了实用见解，模型已开源。 Abstract: Recently, improving the reasoning ability of large multimodal models (LMMs) through reinforcement learning has made great progress. However, most existing works are based on highly reasoning-intensive datasets such as mathematics and code, and researchers generally choose large-scale models as the foundation. We argue that exploring small-scale models' reasoning capabilities remains valuable for researchers with limited computational resources. Moreover, enabling models to explain their reasoning processes on general question-answering datasets is equally meaningful. Therefore, we present the small-scale video reasoning model TinyLLaVA-Video-R1. Based on TinyLLaVA-Video, a traceably trained video understanding model with no more than 4B parameters, it not only demonstrates significantly improved reasoning and thinking capabilities after using reinforcement learning on general Video-QA datasets, but also exhibits the emergent characteristic of "aha moments". Furthermore, we share a series of experimental findings, aiming to provide practical insights for future exploration of video reasoning (thinking) abilities in small-scale models. It is available at https://github.com/ZhangXJ199/TinyLLaVA-Video-R1.

[263] SegEarth-R1: Geospatial Pixel Reasoning via Large Language Model

Kaiyu Li,Zepeng Xin,Li Pang,Chao Pang,Yupeng Deng,Jing Yao,Guisong Xia,Deyu Meng,Zhi Wang,Xiangyong Cao

Main category: cs.CV

TLDR: 论文提出了一种新任务——地理空间像素推理，并构建了首个大规模基准数据集EarthReason，同时提出了SegEarth-R1模型，显著优于传统和基于LLM的分割方法。

Details

Motivation: 传统遥感工作流难以处理复杂的隐式查询，需要结合空间上下文、领域知识和用户意图进行推理。 Method: 提出SegEarth-R1模型，整合了分层视觉编码器、大型语言模型（LLM）和定制化的掩码生成器，并针对遥感图像进行了领域特定优化。 Result: SegEarth-R1在推理和参考分割任务上达到最先进性能，显著优于传统方法。 Conclusion: 论文通过新任务和模型推动了遥感图像处理的发展，并开源了数据和代码。 Abstract: Remote sensing has become critical for understanding environmental dynamics, urban planning, and disaster management. However, traditional remote sensing workflows often rely on explicit segmentation or detection methods, which struggle to handle complex, implicit queries that require reasoning over spatial context, domain knowledge, and implicit user intent. Motivated by this, we introduce a new task, \ie, geospatial pixel reasoning, which allows implicit querying and reasoning and generates the mask of the target region. To advance this task, we construct and release the first large-scale benchmark dataset called EarthReason, which comprises 5,434 manually annotated image masks with over 30,000 implicit question-answer pairs. Moreover, we propose SegEarth-R1, a simple yet effective language-guided segmentation baseline that integrates a hierarchical visual encoder, a large language model (LLM) for instruction parsing, and a tailored mask generator for spatial correlation. The design of SegEarth-R1 incorporates domain-specific adaptations, including aggressive visual token compression to handle ultra-high-resolution remote sensing images, a description projection module to fuse language and multi-scale features, and a streamlined mask prediction pipeline that directly queries description embeddings. Extensive experiments demonstrate that SegEarth-R1 achieves state-of-the-art performance on both reasoning and referring segmentation tasks, significantly outperforming traditional and LLM-based segmentation methods. Our data and code will be released at https://github.com/earth-insights/SegEarth-R1.

[264] KeyVID: Keyframe-Aware Video Diffusion for Audio-Synchronized Visual Animation

Xingrui Wang,Jiang Liu,Ze Wang,Xiaodong Yu,Jialian Wu,Ximeng Sun,Yusheng Su,Alan Yuille,Zicheng Liu,Emad Barsoum

Main category: cs.CV

TLDR: KeyVID是一个关键帧感知的音频到视觉动画框架，通过定位音频中的关键帧时间步长并生成对应的视觉关键帧，显著提高了动态运动中的生成质量。

Details

Motivation: 现有音频到视觉动画模型使用均匀采样的帧，无法在低帧率下捕捉关键动态时刻，且直接增加帧数会占用大量内存。 Method: 首先从音频中定位关键帧时间步长，生成视觉关键帧，然后通过运动插值生成中间帧。 Result: KeyVID显著提高了音频-视频同步性和视频质量，特别是在高动态运动中。 Conclusion: KeyVID在保持计算效率的同时，显著提升了关键动态时刻的生成质量。 Abstract: Generating video from various conditions, such as text, image, and audio, enables both spatial and temporal control, leading to high-quality generation results. Videos with dramatic motions often require a higher frame rate to ensure smooth motion. Currently, most audio-to-visual animation models use uniformly sampled frames from video clips. However, these uniformly sampled frames fail to capture significant key moments in dramatic motions at low frame rates and require significantly more memory when increasing the number of frames directly. In this paper, we propose KeyVID, a keyframe-aware audio-to-visual animation framework that significantly improves the generation quality for key moments in audio signals while maintaining computation efficiency. Given an image and an audio input, we first localize keyframe time steps from the audio. Then, we use a keyframe generator to generate the corresponding visual keyframes. Finally, we generate all intermediate frames using the motion interpolator. Through extensive experiments, we demonstrate that KeyVID significantly improves audio-video synchronization and video quality across multiple datasets, particularly for highly dynamic motions. The code is released in https://github.com/XingruiWang/KeyVID.

Yao Yuan,Pan Gao,Qun Dai,Jie Qin,Wei Xiang

Main category: cs.CV

TLDR: 提出了一种基于不确定性引导的显著目标检测方法UGRAN，通过多模块协作提升对不确定区域的感知能力，生成高饱和度的细粒度显著预测图。

Details

Motivation: 现有显著目标检测方法预测的显著区域常包含不饱和区域和阴影，限制了模型的细粒度预测能力。 Method: 设计了UGRAN网络，包含MIA、SSCA和URA三个模块，分别用于多级特征交互、多尺度信息整合和不确定性引导优化。 Result: 在七个基准数据集上验证了UGRAN的优越性，优于现有方法。 Conclusion: UGRAN通过不确定性引导和多模块协作，显著提升了显著目标检测的细粒度预测能力。 Abstract: Recently, salient object detection (SOD) methods have achieved impressive performance. However, salient regions predicted by existing methods usually contain unsaturated regions and shadows, which limits the model for reliable fine-grained predictions. To address this, we introduce the uncertainty guidance learning approach to SOD, intended to enhance the model's perception of uncertain regions. Specifically, we design a novel Uncertainty Guided Refinement Attention Network (UGRAN), which incorporates three important components, i.e., the Multilevel Interaction Attention (MIA) module, the Scale Spatial-Consistent Attention (SSCA) module, and the Uncertainty Refinement Attention (URA) module. Unlike conventional methods dedicated to enhancing features, the proposed MIA facilitates the interaction and perception of multilevel features, leveraging the complementary characteristics among multilevel features. Then, through the proposed SSCA, the salient information across diverse scales within the aggregated features can be integrated more comprehensively and integrally. In the subsequent steps, we utilize the uncertainty map generated from the saliency prediction map to enhance the model's perception capability of uncertain regions, generating a highly-saturated fine-grained saliency prediction map. Additionally, we devise an adaptive dynamic partition (ADP) mechanism to minimize the computational overhead of the URA module and improve the utilization of uncertainty guidance. Experiments on seven benchmark datasets demonstrate the superiority of the proposed UGRAN over the state-of-the-art methodologies. Codes will be released at https://github.com/I2-Multimedia-Lab/UGRAN.

[266] LightHeadEd: Relightable & Editable Head Avatars from a Smartphone

Pranav Manu,Astitva Srivastava,Amit Raj,Varun Jampani,Avinash Sharma,P. J. Narayanan

Main category: cs.CV

TLDR: 提出了一种低成本方法，利用智能手机和偏振滤镜创建高质量、可动画化、可重光照的3D头部头像。

Details

Motivation: 传统方法依赖昂贵的Lightstage设备，限制了广泛应用。本研究旨在提供一种更经济、易用的替代方案。 Method: 通过同时捕捉交叉偏振和平行偏振视频流，分离皮肤的漫反射和镜面反射成分，并结合混合表示和神经分析-合成流程。 Result: 实现了高质量、实时渲染的3D头部头像，保留了高保真几何细节。 Conclusion: 该方法为创建逼真3D头像提供了一种低成本、高效的解决方案。 Abstract: Creating photorealistic, animatable, and relightable 3D head avatars traditionally requires expensive Lightstage with multiple calibrated cameras, making it inaccessible for widespread adoption. To bridge this gap, we present a novel, cost-effective approach for creating high-quality relightable head avatars using only a smartphone equipped with polaroid filters. Our approach involves simultaneously capturing cross-polarized and parallel-polarized video streams in a dark room with a single point-light source, separating the skin's diffuse and specular components during dynamic facial performances. We introduce a hybrid representation that embeds 2D Gaussians in the UV space of a parametric head model, facilitating efficient real-time rendering while preserving high-fidelity geometric details. Our learning-based neural analysis-by-synthesis pipeline decouples pose and expression-dependent geometrical offsets from appearance, decomposing the surface into albedo, normal, and specular UV texture maps, along with the environment maps. We collect a unique dataset of various subjects performing diverse facial expressions and head movements.

[267] Computer-Aided Layout Generation for Building Design: A Review

Jiachen Liu,Yuan Xue,Haomiao Ni,Rui Yu,Zihan Zhou,Sharon X. Huang

Main category: cs.CV

TLDR: 本文综述了建筑布局设计与生成的三大研究方向，总结了传统方法与深度生成模型的优缺点，并提出了未来研究方向。

Details

Motivation: 传统建筑布局设计方法成本高且耗时，深度生成模型提高了效率和多样性，本文旨在全面回顾该领域的研究进展。 Method: 对三大主题（平面布局生成、场景布局合成、其他建筑布局生成）进行综述，分类研究范式、数据集和评估指标。 Result: 总结了现有方法的优缺点，提出了未来研究方向。 Conclusion: 深度生成模型显著提升了建筑布局设计的效率，但仍需进一步研究以解决现有局限性。 Abstract: Generating realistic building layouts for automatic building design has been studied in both the computer vision and architecture domains. Traditional approaches from the architecture domain, which are based on optimization techniques or heuristic design guidelines, can synthesize desirable layouts, but usually require post-processing and involve human interaction in the design pipeline, making them costly and timeconsuming. The advent of deep generative models has significantly improved the fidelity and diversity of the generated architecture layouts, reducing the workload by designers and making the process much more efficient. In this paper, we conduct a comprehensive review of three major research topics of architecture layout design and generation: floorplan layout generation, scene layout synthesis, and generation of some other formats of building layouts. For each topic, we present an overview of the leading paradigms, categorized either by research domains (architecture or machine learning) or by user input conditions or constraints. We then introduce the commonly-adopted benchmark datasets that are used to verify the effectiveness of the methods, as well as the corresponding evaluation metrics. Finally, we identify the well-solved problems and limitations of existing approaches, then propose new perspectives as promising directions for future research in this important research area. A project associated with this survey to maintain the resources is available at awesome-building-layout-generation.

[268] ToolTipNet: A Segmentation-Driven Deep Learning Baseline for Surgical Instrument Tip Detection

Zijian Wu,Shuojue Yang,Yueming Jin,Septimiu E Salcudean

Main category: cs.CV

TLDR: 论文提出了一种基于深度学习的机器人辅助腹腔镜前列腺切除术（RALP）中手术器械尖端检测方法，利用分割基础模型（Segment Anything）提供的部分级分割掩码作为输入，优于传统手工图像处理方法。

Details

Motivation: 在RALP中，器械尖端位置的准确定位对超声与腹腔镜相机帧的配准至关重要，但现有方法（如da Vinci API）存在不准确问题，需手眼校准。此外，器械尖端检测对手术技能评估和自动化等任务也很重要。 Method: 基于分割基础模型（Segment Anything）提供的部分级器械分割掩码，提出深度学习方法来检测器械尖端。 Result: 在模拟和真实数据集上的对比实验表明，该方法优于传统手工图像处理方法。 Conclusion: 深度学习结合分割基础模型能有效提升手术器械尖端检测的准确性，为手术自动化等任务提供支持。 Abstract: In robot-assisted laparoscopic radical prostatectomy (RALP), the location of the instrument tip is important to register the ultrasound frame with the laparoscopic camera frame. A long-standing limitation is that the instrument tip position obtained from the da Vinci API is inaccurate and requires hand-eye calibration. Thus, directly computing the position of the tool tip in the camera frame using the vision-based method becomes an attractive solution. Besides, surgical instrument tip detection is the key component of other tasks, like surgical skill assessment and surgery automation. However, this task is challenging due to the small size of the tool tip and the articulation of the surgical instrument. Surgical instrument segmentation becomes relatively easy due to the emergence of the Segmentation Foundation Model, i.e., Segment Anything. Based on this advancement, we explore the deep learning-based surgical instrument tip detection approach that takes the part-level instrument segmentation mask as input. Comparison experiments with a hand-crafted image-processing approach demonstrate the superiority of the proposed method on simulated and real datasets.

[269] A Survey on Efficient Vision-Language Models

Gaurav Shinde,Anuradha Ravi,Emon Dey,Shadman Sakib,Milind Rampure,Nirmalya Roy

Main category: cs.CV

TLDR: 综述探讨了高效视觉语言模型（VLMs）的关键技术，重点优化边缘和资源受限设备上的性能与内存权衡。

Details

Motivation: 视觉语言模型的高计算需求限制了实时应用，促使研究高效VLMs。 Method: 回顾了紧凑VLM架构和优化技术，并建立了GitHub仓库整合相关论文。 Result: 提供了性能与内存权衡的详细分析，促进高效VLM研究。 Conclusion: 旨在推动高效视觉语言模型的深入研究。 Abstract: Vision-language models (VLMs) integrate visual and textual information, enabling a wide range of applications such as image captioning and visual question answering, making them crucial for modern AI systems. However, their high computational demands pose challenges for real-time applications. This has led to a growing focus on developing efficient vision language models. In this survey, we review key techniques for optimizing VLMs on edge and resource-constrained devices. We also explore compact VLM architectures, frameworks and provide detailed insights into the performance-memory trade-offs of efficient VLMs. Furthermore, we establish a GitHub repository at https://github.com/MPSCUMBC/Efficient-Vision-Language-Models-A-Survey to compile all surveyed papers, which we will actively update. Our objective is to foster deeper research in this area.

[270] Automatic Detection of Intro and Credits in Video using CLIP and Multihead Attention

Vasilii Korolkov,Andrey Yanchenko

Main category: cs.CV

TLDR: 提出了一种基于深度学习的视频内容分割方法，用于区分片头/片尾与正片内容，通过序列分类任务实现高精度检测。

Details

Motivation: 手动标注视频内容过渡耗时且易错，传统启发式方法泛化能力差，需自动化解决方案。 Method: 以1 FPS提取帧，使用CLIP编码，结合多头注意力模型和位置编码进行序列分类。 Result: 测试集F1分数91.0%，精确率89.0%，召回率97.0%，实时推理性能达11.5 FPS（CPU）和107 FPS（GPU）。 Conclusion: 该方法适用于内容索引、高光检测和视频摘要，未来将探索多模态学习以提升精度。 Abstract: Detecting transitions between intro/credits and main content in videos is a crucial task for content segmentation, indexing, and recommendation systems. Manual annotation of such transitions is labor-intensive and error-prone, while heuristic-based methods often fail to generalize across diverse video styles. In this work, we introduce a deep learning-based approach that formulates the problem as a sequence-to-sequence classification task, where each second of a video is labeled as either "intro" or "film." Our method extracts frames at a fixed rate of 1 FPS, encodes them using CLIP (Contrastive Language-Image Pretraining), and processes the resulting feature representations with a multihead attention model incorporating learned positional encoding. The system achieves an F1-score of 91.0%, Precision of 89.0%, and Recall of 97.0% on the test set, and is optimized for real-time inference, achieving 11.5 FPS on CPU and 107 FPS on high-end GPUs. This approach has practical applications in automated content indexing, highlight detection, and video summarization. Future work will explore multimodal learning, incorporating audio features and subtitles to further enhance detection accuracy.

[271] Socratic Chart: Cooperating Multiple Agents for Robust SVG Chart Understanding

Yuyang Ji,Haohan Wang

Main category: cs.CV

TLDR: 论文提出了一种名为Socratic Chart的新框架，通过将图表转换为SVG表示，提升多模态大语言模型（MLLMs）的视觉推理能力，解决了现有模型在图表理解中的局限性。

Details

Motivation: 现有MLLMs在图表推理任务中依赖文本捷径而非真正的视觉理解，导致性能受限。 Method: 提出Socratic Chart框架，将图表转换为SVG表示，并采用多智能体流程提取图表属性并验证结果。 Result: 在去除文本标签和引入图表扰动的测试中，Socratic Chart显著提升了模型性能，超越了现有最先进模型。 Conclusion: Socratic Chart为提升MLLMs的视觉理解能力提供了有效途径。 Abstract: Multimodal Large Language Models (MLLMs) have shown remarkable versatility but face challenges in demonstrating true visual understanding, particularly in chart reasoning tasks. Existing benchmarks like ChartQA reveal significant reliance on text-based shortcuts and probabilistic pattern-matching rather than genuine visual reasoning. To rigorously evaluate visual reasoning, we introduce a more challenging test scenario by removing textual labels and introducing chart perturbations in the ChartQA dataset. Under these conditions, models like GPT-4o and Gemini-2.0 Pro experience up to a 30% performance drop, underscoring their limitations. To address these challenges, we propose Socratic Chart, a new framework that transforms chart images into Scalable Vector Graphics (SVG) representations, enabling MLLMs to integrate textual and visual modalities for enhanced chart understanding. Socratic Chart employs a multi-agent pipeline with specialized agent-generators to extract primitive chart attributes (e.g., bar heights, line coordinates) and an agent-critic to validate results, ensuring high-fidelity symbolic representations. Our framework surpasses state-of-the-art models in accurately capturing chart primitives and improving reasoning performance, establishing a robust pathway for advancing MLLM visual understanding.

[272] On the representation of stack operators by mathematical morphology

Diego Marcondes

Main category: cs.CV

TLDR: 本文介绍了灰度图像堆栈算子类，其将二值图像映射为二值图像，并与截面平均交换。堆栈算子是集合算子的1-Lipchitz扩展，可通过将特征集合算子应用于图像截面并求和来表示。它们是堆栈滤波器的推广，其中特征集合算子是递增的。主要结果是堆栈算子继承了特征集合算子的格性质。

Details

Motivation: 研究灰度图像堆栈算子，推广堆栈滤波器，并探索其在图像处理中的应用。 Method: 通过推导堆栈算子的特征函数、核和基表示，研究平移不变和局部定义的堆栈算子。 Result: 堆栈算子继承了特征集合算子的格性质，为灰度图像处理问题提供了一种设计思路。 Conclusion: 通过设计二值图像算子并扩展为堆栈算子，可解决某些灰度图像处理问题，未来可研究其机器学习和应用范围。 Abstract: This paper introduces the class of grey-scale image stack operators as those that (a) map binary-images into binary-images and (b) commute in average with cross-sectioning. We show that stack operators are 1-Lipchitz extensions of set operators which can be represented by applying a characteristic set operator to the cross-sections of the image and summing. In particular, they are a generalisation of stack filters, for which the characteristic set operators are increasing. Our main result is that stack operators inherit lattice properties of the characteristic set operators. We focus on the case of translation-invariant and locally defined stack operators and show the main result by deducing the characteristic function, kernel, and basis representation of stack operators. The results of this paper have implications on the design of image operators, since imply that to solve some grey-scale image processing problems it is enough to design an operator for performing the desired transformation on binary images, and then considering its extension given by a stack operator. We leave many topics for future research regarding the machine learning of stack operators and the characterisation of the image processing problems that can be solved by them.

[273] EquiVDM: Equivariant Video Diffusion Models with Temporally Consistent Noise

Chao Liu,Arash Vahdat

Main category: cs.CV

TLDR: 提出了一种利用时间一致性噪声的视频扩散框架，无需额外模块即可生成连贯视频帧，并在3D一致性视频生成中表现优异。

Details

Motivation: 解决视频扩散模型中时间一致性问题，应用于sim-to-real、风格迁移、视频上采样等领域。 Method: 利用时间一致性噪声训练扩散模型，使其对输入视频和噪声的空间变换具有等变性，并扩展至3D一致性生成。 Result: 在运动对齐、3D一致性和视频质量上优于现有方法，且采样步骤少。 Conclusion: 该方法高效且性能优越，适用于多种视频生成任务。 Abstract: Temporally consistent video-to-video generation is essential for applications of video diffusion models in areas such as sim-to-real, style-transfer, video upsampling, etc. In this paper, we propose a video diffusion framework that leverages temporally consistent noise to generate coherent video frames without specialized modules or additional constraints. We show that the standard training objective of diffusion models, when applied with temporally consistent noise, encourages the model to be equivariant to spatial transformations in input video and noise. This enables our model to better follow motion patterns from the input video, producing aligned motion and high-fidelity frames. Furthermore, we extend our approach to 3D-consistent video generation by attaching noise as textures on 3D meshes, ensuring 3D consistency in sim-to-real applications. Experimental results demonstrate that our method surpasses state-of-the-art baselines in motion alignment, 3D consistency, and video quality while requiring only a few sampling steps in practice.

[274] IGL-DT: Iterative Global-Local Feature Learning with Dual-Teacher Semantic Segmentation Framework under Limited Annotation Scheme

Dinh Dai Quan Tran,Hoang-Thien Nguyen. Thanh-Huy Nguyen,Gia-Van To,Tien-Huy Nguyen,Quan Nguyen

Main category: cs.CV

TLDR: 提出了一种名为IGL-DT的三分支半监督语义分割框架，结合双教师策略，平衡全局语义与局部特征提取。

Details

Motivation: 现有方法难以平衡全局语义表示与细粒度局部特征提取。 Method: 采用SwinUnet进行全局上下文学习，ResUnet进行局部区域学习，并通过差异学习机制减少对单一教师的依赖。 Result: 在多个基准数据集上表现优于现有方法，分割性能显著提升。 Conclusion: IGL-DT框架有效解决了半监督语义分割中的全局与局部特征平衡问题。 Abstract: Semi-Supervised Semantic Segmentation (SSSS) aims to improve segmentation accuracy by leveraging a small set of labeled images alongside a larger pool of unlabeled data. Recent advances primarily focus on pseudo-labeling, consistency regularization, and co-training strategies. However, existing methods struggle to balance global semantic representation with fine-grained local feature extraction. To address this challenge, we propose a novel tri-branch semi-supervised segmentation framework incorporating a dual-teacher strategy, named IGL-DT. Our approach employs SwinUnet for high-level semantic guidance through Global Context Learning and ResUnet for detailed feature refinement via Local Regional Learning. Additionally, a Discrepancy Learning mechanism mitigates over-reliance on a single teacher, promoting adaptive feature learning. Extensive experiments on benchmark datasets demonstrate that our method outperforms state-of-the-art approaches, achieving superior segmentation performance across various data regimes.

[275] DUDA: Distilled Unsupervised Domain Adaptation for Lightweight Semantic Segmentation

Beomseok Kang,Niluthpol Chowdhury Mithun,Abhinav Rajvanshi,Han-Pang Chiu,Supun Samarasekera

Main category: cs.CV

TLDR: 论文提出了一种结合EMA自训练和知识蒸馏的新框架DUDA，用于解决轻量级模型在无监督域适应中的性能下降问题。

Details

Motivation: 现有UDA方法在轻量级模型上表现不佳，主要由于架构不灵活导致伪标签质量低。 Method: DUDA通过辅助学生网络弥补架构差距，结合EMA自训练和知识蒸馏，采用渐进蒸馏、不一致损失和多教师学习。 Result: 在四个UDA基准测试中，DUDA表现优异，轻量级模型性能常超越其他方法的重量级模型。 Conclusion: DUDA框架显著提升了轻量级模型在无监督域适应中的性能，为实际应用提供了高效解决方案。 Abstract: Unsupervised Domain Adaptation (UDA) is essential for enabling semantic segmentation in new domains without requiring costly pixel-wise annotations. State-of-the-art (SOTA) UDA methods primarily use self-training with architecturally identical teacher and student networks, relying on Exponential Moving Average (EMA) updates. However, these approaches face substantial performance degradation with lightweight models due to inherent architectural inflexibility leading to low-quality pseudo-labels. To address this, we propose Distilled Unsupervised Domain Adaptation (DUDA), a novel framework that combines EMA-based self-training with knowledge distillation (KD). Our method employs an auxiliary student network to bridge the architectural gap between heavyweight and lightweight models for EMA-based updates, resulting in improved pseudo-label quality. DUDA employs a strategic fusion of UDA and KD, incorporating innovative elements such as gradual distillation from large to small networks, inconsistency loss prioritizing poorly adapted classes, and learning with multiple teachers. Extensive experiments across four UDA benchmarks demonstrate DUDA's superiority in achieving SOTA performance with lightweight models, often surpassing the performance of heavyweight models from other approaches.

[276] Density-based Object Detection in Crowded Scenes

Chenyang Zhao,Jia Wan,Antoni B. Chan

Main category: cs.CV

TLDR: 论文提出两种新策略（DGA和DG-NMS），利用目标密度图优化锚点分配和NMS阈值，解决拥挤场景中目标检测的模糊锚点和误抑制问题。

Details

Motivation: 拥挤场景中目标高度重叠导致训练时锚点模糊和推理时误抑制增多，影响检测效果。 Method: 提出密度引导锚点（DGA）和密度引导NMS（DG-NMS），基于不平衡最优传输（UOT）计算最优锚点分配和自适应NMS阈值。 Result: 在CrowdHuman和Citypersons数据集上验证了方法的有效性和鲁棒性。 Conclusion: 密度引导检测器能有效应对拥挤场景中的目标检测问题。 Abstract: Compared with the generic scenes, crowded scenes contain highly-overlapped instances, which result in: 1) more ambiguous anchors during training of object detectors, and 2) more predictions are likely to be mistakenly suppressed in post-processing during inference. To address these problems, we propose two new strategies, density-guided anchors (DGA) and density-guided NMS (DG-NMS), which uses object density maps to jointly compute optimal anchor assignments and reweighing, as well as an adaptive NMS. Concretely, based on an unbalanced optimal transport (UOT) problem, the density owned by each ground-truth object is transported to each anchor position at a minimal transport cost. And density on anchors comprises an instance-specific density distribution, from which DGA decodes the optimal anchor assignment and re-weighting strategy. Meanwhile, DG-NMS utilizes the predicted density map to adaptively adjust the NMS threshold to reduce mistaken suppressions. In the UOT, a novel overlap-aware transport cost is specifically designed for ambiguous anchors caused by overlapped neighboring objects. Extensive experiments on the challenging CrowdHuman dataset with Citypersons dataset demonstrate that our proposed density-guided detector is effective and robust to crowdedness. The code and pre-trained models will be made available later.

[277] FATE: A Prompt-Tuning-Based Semi-Supervised Learning Framework for Extremely Limited Labeled Data

Hezhao Liu,Yang Lu,Mengke Li,Yiqun Zhang,Shreyank N Gowda,Chen Gong,Hanzi Wang

Main category: cs.CV

TLDR: FATE是一种新颖的半监督学习框架，针对标签数据极少的情况，通过两阶段提示调优范式，利用未标记数据补充监督信号，显著提升性能。

Details

Motivation: 现有半监督学习方法在标签数据极少（如仅一个样本）时表现不佳，且预训练模型难以平衡有限标签数据与大量未标记数据。 Method: FATE采用两阶段方法：首先无监督地利用未标记数据适应预训练模型，再使用专为预训练模型设计的半监督学习方法完成分类任务。 Result: 在七个基准测试中，FATE平均性能提升33.74%，优于现有半监督学习方法。 Conclusion: FATE有效解决了标签数据稀缺问题，适用于视觉和视觉语言预训练模型，性能显著提升。 Abstract: Semi-supervised learning (SSL) has achieved significant progress by leveraging both labeled data and unlabeled data. Existing SSL methods overlook a common real-world scenario when labeled data is extremely scarce, potentially as limited as a single labeled sample in the dataset. General SSL approaches struggle to train effectively from scratch under such constraints, while methods utilizing pre-trained models often fail to find an optimal balance between leveraging limited labeled data and abundant unlabeled data. To address this challenge, we propose Firstly Adapt, Then catEgorize (FATE), a novel SSL framework tailored for scenarios with extremely limited labeled data. At its core, the two-stage prompt tuning paradigm FATE exploits unlabeled data to compensate for scarce supervision signals, then transfers to downstream tasks. Concretely, FATE first adapts a pre-trained model to the feature distribution of downstream data using volumes of unlabeled samples in an unsupervised manner. It then applies an SSL method specifically designed for pre-trained models to complete the final classification task. FATE is designed to be compatible with both vision and vision-language pre-trained models. Extensive experiments demonstrate that FATE effectively mitigates challenges arising from the scarcity of labeled samples in SSL, achieving an average performance improvement of 33.74% across seven benchmarks compared to state-of-the-art SSL methods. Code is available at https://anonymous.4open.science/r/Semi-supervised-learning-BA72.

Lu Yue,Dongliang Zhou,Liang Xie,Erwei Yin,Feitian Zhang

Main category: cs.CV

TLDR: ST-Booster通过多粒度感知和指令感知推理提升连续环境中的视觉与语言导航性能。

Details

Motivation: 解决连续环境中视觉记忆异构和局部特征感知受损的问题。 Method: 提出ST-Booster，包含HSTE、MGAF和VGWG模块，通过双图表示和迭代优化提升导航能力。 Result: 在复杂干扰环境中表现优于现有方法。 Conclusion: ST-Booster有效解决了连续环境中的导航挑战。 Abstract: Vision-and-Language Navigation in Continuous Environments (VLN-CE) requires agents to navigate unknown, continuous spaces based on natural language instructions. Compared to discrete settings, VLN-CE poses two core perception challenges. First, the absence of predefined observation points leads to heterogeneous visual memories and weakened global spatial correlations. Second, cumulative reconstruction errors in three-dimensional scenes introduce structural noise, impairing local feature perception. To address these challenges, this paper proposes ST-Booster, an iterative spatiotemporal booster that enhances navigation performance through multi-granularity perception and instruction-aware reasoning. ST-Booster consists of three key modules -- Hierarchical SpatioTemporal Encoding (HSTE), Multi-Granularity Aligned Fusion (MGAF), and ValueGuided Waypoint Generation (VGWG). HSTE encodes long-term global memory using topological graphs and captures shortterm local details via grid maps. MGAF aligns these dualmap representations with instructions through geometry-aware knowledge fusion. The resulting representations are iteratively refined through pretraining tasks. During reasoning, VGWG generates Guided Attention Heatmaps (GAHs) to explicitly model environment-instruction relevance and optimize waypoint selection. Extensive comparative experiments and performance analyses are conducted, demonstrating that ST-Booster outperforms existing state-of-the-art methods, particularly in complex, disturbance-prone environments.

[279] GFT: Gradient Focal Transformer

Boris Kriuk,Simranjit Kaur Gill,Shoaib Aslam,Amir Fakhrutdinov

Main category: cs.CV

TLDR: GFT（Gradient Focal Transformer）是一种新型ViT衍生框架，通过GALA机制和PPS策略，动态优化注意力区域，提升细粒度图像分类性能。

Details

Motivation: 现有ViT模型在细粒度图像分类中无法自适应聚焦关键区域，且计算效率低。 Method: 结合GALA机制分析注意力梯度流，动态选择判别性特征；采用PPS策略逐步过滤非信息区域。 Result: 在FGVC Aircraft、Food-101和COCO数据集上达到SOTA精度，参数仅93M。 Conclusion: GFT通过全局与局部特征的结合，为细粒度分类设定了新基准，并提供可解释的解决方案。 Abstract: Fine-Grained Image Classification (FGIC) remains a complex task in computer vision, as it requires models to distinguish between categories with subtle localized visual differences. Well-studied CNN-based models, while strong in local feature extraction, often fail to capture the global context required for fine-grained recognition, while more recent ViT-backboned models address FGIC with attention-driven mechanisms but lack the ability to adaptively focus on truly discriminative regions. TransFG and other ViT-based extensions introduced part-aware token selection to enhance attention localization, yet they still struggle with computational efficiency, attention region selection flexibility, and detail-focus narrative in complex environments. This paper introduces GFT (Gradient Focal Transformer), a new ViT-derived framework created for FGIC tasks. GFT integrates the Gradient Attention Learning Alignment (GALA) mechanism to dynamically prioritize class-discriminative features by analyzing attention gradient flow. Coupled with a Progressive Patch Selection (PPS) strategy, the model progressively filters out less informative regions, reducing computational overhead while enhancing sensitivity to fine details. GFT achieves SOTA accuracy on FGVC Aircraft, Food-101, and COCO datasets with 93M parameters, outperforming ViT-based advanced FGIC models in efficiency. By bridging global context and localized detail extraction, GFT sets a new benchmark in fine-grained recognition, offering interpretable solutions for real-world deployment scenarios.

[280] HDC: Hierarchical Distillation for Multi-level Noisy Consistency in Semi-Supervised Fetal Ultrasound Segmentation

Tran Quoc Khanh Le,Nguyen Lan Vi Vu,Ha-Hieu Pham,Xuan-Loc Huynh,Tien-Huy Nguyen,Minh Huu Nhat Le,Quan Nguyen,Hien D. Nguyen

Main category: cs.CV

TLDR: 提出了一种名为HDC的半监督分割框架，结合分层蒸馏和一致性学习，解决了宫颈超声图像分割中的低对比度和模糊边界问题，同时降低了计算成本。

Details

Motivation: 宫颈超声图像分割因低对比度、阴影伪影和模糊边界而具有挑战性，且现有方法需要大量标注数据或存在确认偏差和高计算成本。 Method: HDC框架通过分层蒸馏机制和一致性学习，引入相关指导损失和互信息损失，优化特征表示，减少模型复杂度。 Result: 在两个胎儿超声数据集（FUGC和PSFH）上，HDC表现出色，计算开销显著低于现有多教师模型。 Conclusion: HDC是一种高效且泛化能力强的半监督分割方法，适用于宫颈超声图像分析。 Abstract: Transvaginal ultrasound is a critical imaging modality for evaluating cervical anatomy and detecting physiological changes. However, accurate segmentation of cervical structures remains challenging due to low contrast, shadow artifacts, and fuzzy boundaries. While convolutional neural networks (CNNs) have shown promising results in medical image segmentation, their performance is often limited by the need for large-scale annotated datasets - an impractical requirement in clinical ultrasound imaging. Semi-supervised learning (SSL) offers a compelling solution by leveraging unlabeled data, but existing teacher-student frameworks often suffer from confirmation bias and high computational costs. We propose HDC, a novel semi-supervised segmentation framework that integrates Hierarchical Distillation and Consistency learning within a multi-level noise mean-teacher framework. Unlike conventional approaches that rely solely on pseudo-labeling, we introduce a hierarchical distillation mechanism that guides feature-level learning via two novel objectives: (1) Correlation Guidance Loss to align feature representations between the teacher and main student branch, and (2) Mutual Information Loss to stabilize representations between the main and noisy student branches. Our framework reduces model complexity while improving generalization. Extensive experiments on two fetal ultrasound datasets, FUGC and PSFH, demonstrate that our method achieves competitive performance with significantly lower computational overhead than existing multi-teacher models.

[281] MCBlock: Boosting Neural Radiance Field Training Speed by MCTS-based Dynamic-Resolution Ray Sampling

Yunpeng Tan,Junlin Hao,Jiangkai Wu,Liming Liu,Qingyang Li,Xinggong Zhang

Main category: cs.CV

TLDR: MCBlock是一种动态分辨率光线采样算法，通过蒙特卡洛树搜索（MCTS）优化NeRF训练中的采样效率，提升训练速度2.33倍。

Details

Motivation: 当前NeRF模型（如Gaussian Splatting）训练时间长，难以满足实时需求，现有采样方法效率低下。 Method: 提出MCBlock算法，利用MCTS动态分区训练图像为不同大小的像素块，并通过扩展/剪枝模块优化分区。 Result: 在Nerfstudio中实现，训练速度提升2.33倍，优于其他光线采样算法。 Conclusion: MCBlock适用于任何锥形追踪NeRF模型，对多媒体应用有重要贡献。 Abstract: Neural Radiance Field (NeRF) is widely known for high-fidelity novel view synthesis. However, even the state-of-the-art NeRF model, Gaussian Splatting, requires minutes for training, far from the real-time performance required by multimedia scenarios like telemedicine. One of the obstacles is its inefficient sampling, which is only partially addressed by existing works. Existing point-sampling algorithms uniformly sample simple-texture regions (easy to fit) and complex-texture regions (hard to fit), while existing ray-sampling algorithms sample these regions all in the finest granularity (i.e. the pixel level), both wasting GPU training resources. Actually, regions with different texture intensities require different sampling granularities. To this end, we propose a novel dynamic-resolution ray-sampling algorithm, MCBlock, which employs Monte Carlo Tree Search (MCTS) to partition each training image into pixel blocks with different sizes for active block-wise training. Specifically, the trees are initialized according to the texture of training images to boost the initialization speed, and an expansion/pruning module dynamically optimizes the block partition. MCBlock is implemented in Nerfstudio, an open-source toolset, and achieves a training acceleration of up to 2.33x, surpassing other ray-sampling algorithms. We believe MCBlock can apply to any cone-tracing NeRF model and contribute to the multimedia community.

[282] Focus on Local: Finding Reliable Discriminative Regions for Visual Place Recognition

Changwei Wang,Shunpeng Chen,Yukun Song,Rongtao Xu,Zherui Zhang,Jiguang Zhang,Haoran Yang,Yu Zhang,Kexue Fu,Shide Du,Zhiwei Xu,Longxiang Gao,Li Guo,Shibiao Xu

Main category: cs.CV

TLDR: 论文提出Focus on Local (FoL)方法，通过挖掘和利用图像中的局部判别区域，结合伪相关监督，提升视觉地点识别（VPR）任务中的图像检索和重排序性能。

Details

Motivation: 现有方法未能精确建模和充分利用图像中的局部判别区域，导致性能受限。 Method: 设计两种损失函数（SAL和CEL）建模局部判别区域，提出弱监督局部特征训练策略，并引入高效重排序流程。 Result: 在多个VPR基准测试中达到最优性能，显著优于现有两阶段方法。 Conclusion: FoL方法通过局部判别区域的建模和利用，显著提升了VPR任务的性能与效率。 Abstract: Visual Place Recognition (VPR) is aimed at predicting the location of a query image by referencing a database of geotagged images. For VPR task, often fewer discriminative local regions in an image produce important effects while mundane background regions do not contribute or even cause perceptual aliasing because of easy overlap. However, existing methods lack precisely modeling and full exploitation of these discriminative regions. In this paper, we propose the Focus on Local (FoL) approach to stimulate the performance of image retrieval and re-ranking in VPR simultaneously by mining and exploiting reliable discriminative local regions in images and introducing pseudo-correlation supervision. First, we design two losses, Extraction-Aggregation Spatial Alignment Loss (SAL) and Foreground-Background Contrast Enhancement Loss (CEL), to explicitly model reliable discriminative local regions and use them to guide the generation of global representations and efficient re-ranking. Second, we introduce a weakly-supervised local feature training strategy based on pseudo-correspondences obtained from aggregating global features to alleviate the lack of local correspondences ground truth for the VPR task. Third, we suggest an efficient re-ranking pipeline that is efficiently and precisely based on discriminative region guidance. Finally, experimental results show that our FoL achieves the state-of-the-art on multiple VPR benchmarks in both image retrieval and re-ranking stages and also significantly outperforms existing two-stage VPR methods in terms of computational efficiency. Code and models are available at https://github.com/chenshunpeng/FoL

[283] Enhanced Semantic Extraction and Guidance for UGC Image Super Resolution

Yiwen Wang,Ying Liang,Yuxuan Zhang,Xinning Chai,Zhengxue Cheng,Yingsheng Qin,Yucai Yang,Rong Xie,Li Song

Main category: cs.CV

TLDR: 提出了一种结合语义引导的扩散框架方法，用于提升用户生成内容（UGC）图像的超分辨率，解决了传统方法在真实世界退化与合成退化之间的泛化问题。

Details

Motivation: 传统超分辨率方法在真实世界退化与合成退化之间存在泛化差距，无法有效处理UGC图像的退化问题。 Method: 通过将语义引导集成到扩散框架中，模拟真实世界退化过程，并结合预训练的语义提取模型（SAM2）和微调超参数，提升退化去除和细节生成能力。 Result: 实验表明该方法优于现有技术，并在CVPR NTIRE 2025挑战赛中获第二名。 Conclusion: 该方法有效解决了UGC图像超分辨率中的退化问题，具有较高的感知保真度。 Abstract: Due to the disparity between real-world degradations in user-generated content(UGC) images and synthetic degradations, traditional super-resolution methods struggle to generalize effectively, necessitating a more robust approach to model real-world distortions. In this paper, we propose a novel approach to UGC image super-resolution by integrating semantic guidance into a diffusion framework. Our method addresses the inconsistency between degradations in wild and synthetic datasets by separately simulating the degradation processes on the LSDIR dataset and combining them with the official paired training set. Furthermore, we enhance degradation removal and detail generation by incorporating a pretrained semantic extraction model (SAM2) and fine-tuning key hyperparameters for improved perceptual fidelity. Extensive experiments demonstrate the superiority of our approach against state-of-the-art methods. Additionally, the proposed model won second place in the CVPR NTIRE 2025 Short-form UGC Image Super-Resolution Challenge, further validating its effectiveness. The code is available at https://github.c10pom/Moonsofang/NTIRE-2025-SRlab.

[284] TAMP: Token-Adaptive Layerwise Pruning in Multimodal Large Language Models

Jaewoo Lee,Keyang Xuan,Chanakya Ekbote,Sandeep Polisetty,Yi R.,Fung,Paul Pu Liang

Main category: cs.CV

TLDR: 论文提出了一种针对多模态大语言模型（MLLMs）的剪枝框架TAMP，通过多样性感知稀疏性和自适应多模态输入激活，显著提升了剪枝效果。

Details

Motivation: 现有剪枝方法在多模态大语言模型中效果有限，未能考虑跨层和多模态的独特令牌属性。 Method: TAMP框架包含两个关键组件：多样性感知稀疏性（根据输出令牌多样性调整层稀疏度）和自适应多模态输入激活（利用注意力分数识别代表性输入令牌）。 Result: 在LLaVA-NeXT和VideoLLaMA2上的实验表明，TAMP显著优于现有剪枝技术。 Conclusion: TAMP为多模态大语言模型提供了一种简单有效的剪枝方法，解决了传统方法的局限性。 Abstract: Multimodal Large Language Models (MLLMs) have shown remarkable versatility in understanding diverse multimodal data and tasks. However, these capabilities come with an increased model scale. While post-training pruning reduces model size in unimodal models, its application to MLLMs often yields limited success. Our analysis discovers that conventional methods fail to account for the unique token attributes across layers and modalities inherent to MLLMs. Inspired by this observation, we propose TAMP, a simple yet effective pruning framework tailored for MLLMs, featuring two key components: (1) Diversity-Aware Sparsity, which adjusts sparsity ratio per layer based on diversities among multimodal output tokens, preserving more parameters in high-diversity layers; and (2) Adaptive Multimodal Input Activation, which identifies representative multimodal input tokens using attention scores to guide unstructured weight pruning. We validate our method on two state-of-the-art MLLMs: LLaVA-NeXT, designed for vision-language tasks, and VideoLLaMA2, capable of processing audio, visual, and language modalities. Empirical experiments across various multimodal evaluation benchmarks demonstrate that each component of our approach substantially outperforms existing pruning techniques.

[285] Digital Staining with Knowledge Distillation: A Unified Framework for Unpaired and Paired-But-Misaligned Data

Ziwang Xu,Lanqing Guo,Satoshi Tsutsui,Shuyan Zhang,Alex C. Kot,Bihan Wen

Main category: cs.CV

TLDR: 提出了一种无监督深度学习框架，通过知识蒸馏减少对配对数据的需求，实现数字细胞染色，并在两种训练方案中验证了其有效性。

Details

Motivation: 传统染色方法成本高、耗时长且不可逆，而现有深度学习方法需要大量对齐的配对数据，难以获取。 Method: 采用两阶段管道（光增强和着色）作为教师模型，通过知识蒸馏训练学生染色生成器，并引入学习对齐模块利用像素级信息。 Result: 在两种设置下生成更准确的细胞目标位置和形状的染色图像，定量和定性评估优于竞争方法。 Conclusion: 该方法在无监督条件下有效，适用于医学应用，如白细胞数据集。 Abstract: Staining is essential in cell imaging and medical diagnostics but poses significant challenges, including high cost, time consumption, labor intensity, and irreversible tissue alterations. Recent advances in deep learning have enabled digital staining through supervised model training. However, collecting large-scale, perfectly aligned pairs of stained and unstained images remains difficult. In this work, we propose a novel unsupervised deep learning framework for digital cell staining that reduces the need for extensive paired data using knowledge distillation. We explore two training schemes: (1) unpaired and (2) paired-but-misaligned settings. For the unpaired case, we introduce a two-stage pipeline, comprising light enhancement followed by colorization, as a teacher model. Subsequently, we obtain a student staining generator through knowledge distillation with hybrid non-reference losses. To leverage the pixel-wise information between adjacent sections, we further extend to the paired-but-misaligned setting, adding the Learning to Align module to utilize pixel-level information. Experiment results on our dataset demonstrate that our proposed unsupervised deep staining method can generate stained images with more accurate positions and shapes of the cell targets in both settings. Compared with competing methods, our method achieves improved results both qualitatively and quantitatively (e.g., NIQE and PSNR).We applied our digital staining method to the White Blood Cell (WBC) dataset, investigating its potential for medical applications.

[286] Small Object Detection with YOLO: A Performance Analysis Across Model Versions and Hardware

Muhammad Fasih Tariq,Muhammad Azeem Javed

Main category: cs.CV

TLDR: 本文对不同版本的YOLO模型（v5至v11）在多种硬件平台和优化库上的性能进行了全面评估，分析了其在CPU和GPU上的推理速度和检测精度，并研究了模型对不同大小物体的检测敏感性。

Details

Motivation: 研究旨在帮助从业者根据硬件限制和检测需求选择最优的YOLO模型，以实现高效部署。 Method: 通过比较YOLO模型在Intel和AMD CPU（使用ONNX和OpenVINO）以及GPU（使用TensorRT等框架）上的性能，并分析模型对不同大小物体（1%、2.5%、5%图像面积）的检测能力。 Result: 研究揭示了YOLO模型在效率、精度和物体大小适应性之间的权衡，为模型选择提供了依据。 Conclusion: 本文为实际应用中YOLO模型的部署提供了实用指导，帮助用户根据具体需求选择最佳模型。 Abstract: This paper provides an extensive evaluation of YOLO object detection models (v5, v8, v9, v10, v11) by com- paring their performance across various hardware platforms and optimization libraries. Our study investigates inference speed and detection accuracy on Intel and AMD CPUs using popular libraries such as ONNX and OpenVINO, as well as on GPUs through TensorRT and other GPU-optimized frameworks. Furthermore, we analyze the sensitivity of these YOLO models to object size within the image, examining performance when detecting objects that occupy 1%, 2.5%, and 5% of the total area of the image. By identifying the trade-offs in efficiency, accuracy, and object size adaptability, this paper offers insights for optimal model selection based on specific hardware constraints and detection requirements, aiding practitioners in deploying YOLO models effectively for real-world applications.

[287] LiteTracker: Leveraging Temporal Causality for Accurate Low-latency Tissue Tracking

Mert Asim Karaoglu,Wenbo Ji,Ahmed Abbas,Nassir Navab,Benjamin Busam,Alexander Ladikos

Main category: cs.CV

TLDR: LiteTracker是一种低延迟组织跟踪方法，通过训练优化和特征重用，显著提升运行速度，适用于实时手术应用。

Details

Motivation: 现有方法在运行时性能上无法满足实时手术的低延迟需求，因此需要一种更高效的跟踪方法。 Method: 基于先进的长时点跟踪方法，引入无需训练的运行时优化，利用时间内存缓冲区和先验运动进行高效特征重用和跟踪初始化。 Result: LiteTracker比前代方法快7倍，比现有最优方法快2倍，同时在STIR和SuPer数据集上表现出高精度跟踪和遮挡预测能力。 Conclusion: LiteTracker是实现实时手术应用中低延迟组织跟踪的重要进展。 Abstract: Tissue tracking plays a critical role in various surgical navigation and extended reality (XR) applications. While current methods trained on large synthetic datasets achieve high tracking accuracy and generalize well to endoscopic scenes, their runtime performances fail to meet the low-latency requirements necessary for real-time surgical applications. To address this limitation, we propose LiteTracker, a low-latency method for tissue tracking in endoscopic video streams. LiteTracker builds on a state-of-the-art long-term point tracking method, and introduces a set of training-free runtime optimizations. These optimizations enable online, frame-by-frame tracking by leveraging a temporal memory buffer for efficient feature reuse and utilizing prior motion for accurate track initialization. LiteTracker demonstrates significant runtime improvements being around 7x faster than its predecessor and 2x than the state-of-the-art. Beyond its primary focus on efficiency, LiteTracker delivers high-accuracy tracking and occlusion prediction, performing competitively on both the STIR and SuPer datasets. We believe LiteTracker is an important step toward low-latency tissue tracking for real-time surgical applications in the operating room.

[288] Improving Multimodal Hateful Meme Detection Exploiting LMM-Generated Knowledge

Maria Tzelepi,Vasileios Mezaris

Main category: cs.CV

TLDR: 该论文提出了一种利用大型多模态模型（LMM）检测仇恨表情包的方法，通过提取任务相关知识并开发硬挖掘方法，实现了最先进的性能。

Details

Motivation: 表情包已成为社交媒体中主要的传播形式，但部分包含仇恨言论，对特定群体造成伤害，因此检测仇恨内容至关重要。 Method: 利用LMM提取表情包的语义描述和情感信息，构建强表征，并开发硬挖掘方法直接引入LMM知识以优化训练。 Result: 在两个数据集上的实验验证了方法的有效性，达到了最先进的性能。 Conclusion: 提出的方法通过LMM知识显著提升了仇恨表情包检测的性能，代码和模型已公开。 Abstract: Memes have become a dominant form of communication in social media in recent years. Memes are typically humorous and harmless, however there are also memes that promote hate speech, being in this way harmful to individuals and groups based on their identity. Therefore, detecting hateful content in memes has emerged as a task of critical importance. The need for understanding the complex interactions of images and their embedded text renders the hateful meme detection a challenging multimodal task. In this paper we propose to address the aforementioned task leveraging knowledge encoded in powerful Large Multimodal Models (LMM). Specifically, we propose to exploit LMMs in a two-fold manner. First, by extracting knowledge oriented to the hateful meme detection task in order to build strong meme representations. Specifically, generic semantic descriptions and emotions that the images along with their embedded texts elicit are extracted, which are then used to train a simple classification head for hateful meme detection. Second, by developing a novel hard mining approach introducing directly LMM-encoded knowledge to the training process, providing further improvements. We perform extensive experiments on two datasets that validate the effectiveness of the proposed method, achieving state-of-the-art performance. Our code and trained models are publicly available at: https://github.com/IDT-ITI/LMM-CLIP-meme.

Zheng Liu,Mengjie Liu,Jingzhou Chen,Jingwei Xu,Bin Cui,Conghui He,Wentao Zhang

Main category: cs.CV

TLDR: FUSION是一种多模态大语言模型（MLLM），通过全视觉语言对齐和整合范式，实现了深度动态整合。其方法包括文本引导的统一视觉编码、上下文感知递归对齐解码和双监督语义映射损失，显著优于现有方法。

Details

Motivation: 现有方法主要依赖解码阶段的后期模态交互，而FUSION旨在实现整个处理流程的深度动态整合，以提升多模态任务的性能。 Method: 提出文本引导的统一视觉编码、上下文感知递归对齐解码和双监督语义映射损失，并构建合成语言驱动的QA数据集。 Result: FUSION在3B和8B规模上显著优于现有方法，3B模型在大多数基准测试中超越8B模型，甚至在视觉标记受限时仍表现优异。 Conclusion: FUSION通过全模态整合方法在多模态任务中表现出色，验证了其方法的有效性，并开源了代码、模型权重和数据集。 Abstract: We introduce FUSION, a family of multimodal large language models (MLLMs) with a fully vision-language alignment and integration paradigm. Unlike existing methods that primarily rely on late-stage modality interaction during LLM decoding, our approach achieves deep, dynamic integration throughout the entire processing pipeline. To this end, we propose Text-Guided Unified Vision Encoding, incorporating textual information in vision encoding to achieve pixel-level integration. We further design Context-Aware Recursive Alignment Decoding that recursively aggregates visual features conditioned on textual context during decoding, enabling fine-grained, question-level semantic integration. To guide feature mapping and mitigate modality discrepancies, we develop Dual-Supervised Semantic Mapping Loss. Additionally, we construct a Synthesized Language-Driven Question-Answer (QA) dataset through a new data synthesis method, prioritizing high-quality QA pairs to optimize text-guided feature integration. Building on these foundations, we train FUSION at two scales-3B, 8B-and demonstrate that our full-modality integration approach significantly outperforms existing methods with only 630 vision tokens. Notably, FUSION 3B surpasses Cambrian-1 8B and Florence-VL 8B on most benchmarks. FUSION 3B continues to outperform Cambrian-1 8B even when limited to 300 vision tokens. Our ablation studies show that FUSION outperforms LLaVA-NeXT on over half of the benchmarks under same configuration without dynamic resolution, highlighting the effectiveness of our approach. We release our code, model weights, and dataset. https://github.com/starriver030515/FUSION

[290] Omni-Dish: Photorealistic and Faithful Image Generation and Editing for Arbitrary Chinese Dishes

Huijie Liu,Bingcan Wang,Jie Hu,Xiaoming Wei,Guoliang Kang

Main category: cs.CV

TLDR: Omni-Dish是一个专为中国菜肴设计的文本到图像生成模型，通过数据收集、重新标注和分阶段训练，提升了生成图像的真实性和细节表现。

Details

Motivation: 现有文本到图像生成模型在特定领域（如中国菜肴）中难以捕捉多样性和细节，因此需要专门优化的模型。 Method: 开发了全面的数据收集流程，采用重新标注策略和分阶段训练方案，并利用高质量标注库和大语言模型优化输入。 Result: 实验表明，Omni-Dish在生成中国菜肴图像方面表现优异，细节和真实性显著提升。 Conclusion: Omni-Dish为特定领域的图像生成提供了有效解决方案，并通过扩展支持编辑任务，展示了其多功能性。 Abstract: Dish images play a crucial role in the digital era, with the demand for culturally distinctive dish images continuously increasing due to the digitization of the food industry and e-commerce. In general cases, existing text-to-image generation models excel in producing high-quality images; however, they struggle to capture diverse characteristics and faithful details of specific domains, particularly Chinese dishes. To address this limitation, we propose Omni-Dish, the first text-to-image generation model specifically tailored for Chinese dishes. We develop a comprehensive dish curation pipeline, building the largest dish dataset to date. Additionally, we introduce a recaption strategy and employ a coarse-to-fine training scheme to help the model better learn fine-grained culinary nuances. During inference, we enhance the user's textual input using a pre-constructed high-quality caption library and a large language model, enabling more photorealistic and faithful image generation. Furthermore, to extend our model's capability for dish editing tasks, we propose Concept-Enhanced P2P. Based on this approach, we build a dish editing dataset and train a specialized editing model. Extensive experiments demonstrate the superiority of our methods.

[291] Efficient 2D to Full 3D Human Pose Uplifting including Joint Rotations

Katja Ludwig,Yuliia Oksymets,Robin Schön,Daniel Kienzle,Rainer Lienhart

Main category: cs.CV

TLDR: 提出了一种新型2D到3D提升模型，直接通过单次前向传递估计3D人体姿态和关节旋转，显著提高了计算效率和精度。

Details

Motivation: 现有HMR模型在关节定位精度上不如3D HPE模型，而结合IK的方法计算成本高，需改进。 Method: 研究了多种旋转表示、损失函数和训练策略，提出直接估计3D姿态和旋转的模型。 Result: 模型在旋转估计上达到最优精度，计算速度比IK方法快150倍，关节定位精度超过HMR模型。 Conclusion: 新模型在效率和精度上均优于现有方法，适用于运动生物力学分析。 Abstract: In sports analytics, accurately capturing both the 3D locations and rotations of body joints is essential for understanding an athlete's biomechanics. While Human Mesh Recovery (HMR) models can estimate joint rotations, they often exhibit lower accuracy in joint localization compared to 3D Human Pose Estimation (HPE) models. Recent work addressed this limitation by combining a 3D HPE model with inverse kinematics (IK) to estimate both joint locations and rotations. However, IK is computationally expensive. To overcome this, we propose a novel 2D-to-3D uplifting model that directly estimates 3D human poses, including joint rotations, in a single forward pass. We investigate multiple rotation representations, loss functions, and training strategies - both with and without access to ground truth rotations. Our models achieve state-of-the-art accuracy in rotation estimation, are 150 times faster than the IK-based approach, and surpass HMR models in joint localization precision.

[292] Semantic Depth Matters: Explaining Errors of Deep Vision Networks through Perceived Class Similarities

Katarzyna Filus,Michał Romaszewski,Mateusz Żarski

Main category: cs.CV

TLDR: 论文提出了一种新框架，通过相似性深度（SD）指标和基于图的可视化方法，分析深度神经网络（DNN）的语义层次深度与错误分类模式的关系。

Details

Motivation: 当前评估方法缺乏透明度，难以解释网络误分类的根本原因，因此需要更深入的分析工具。 Method: 引入SD指标量化网络的语义层次深度，并提出基于图的可视化方法，利用分类器层权重的类模板分析已训练网络。 Result: 研究发现深度视觉网络编码了特定的语义层次，高语义深度能提高感知类别相似性与实际错误之间的一致性。 Conclusion: 该框架为理解DNN行为提供了新视角，无需额外数据即可分析网络误分类的语义原因。 Abstract: Understanding deep neural network (DNN) behavior requires more than evaluating classification accuracy alone; analyzing errors and their predictability is equally crucial. Current evaluation methodologies lack transparency, particularly in explaining the underlying causes of network misclassifications. To address this, we introduce a novel framework that investigates the relationship between the semantic hierarchy depth perceived by a network and its real-data misclassification patterns. Central to our framework is the Similarity Depth (SD) metric, which quantifies the semantic hierarchy depth perceived by a network along with a method of evaluation of how closely the network's errors align with its internally perceived similarity structure. We also propose a graph-based visualization of model semantic relationships and misperceptions. A key advantage of our approach is that leveraging class templates -- representations derived from classifier layer weights -- is applicable to already trained networks without requiring additional data or experiments. Our approach reveals that deep vision networks encode specific semantic hierarchies and that high semantic depth improves the compliance between perceived class similarities and actual errors.

[293] Dual-Path Enhancements in Event-Based Eye Tracking: Augmented Robustness and Adaptive Temporal Modeling

Hoang M. Truong,Vinh-Thuan Ly,Huy G. Tran,Thuan-Phat Nguyen,Tram T. Doan

Main category: cs.CV

TLDR: 论文提出了一种基于事件的眼动追踪技术，通过轻量级时空网络和两种关键改进（数据增强和混合架构KnightPupil），在CVPR 2025挑战赛上取得了1.61欧氏距离误差的优异表现。

Details

Motivation: 现有方法在真实场景中（如快速眼动和环境噪声）表现不佳，需要更高效的解决方案。 Method: 1. 引入数据增强管道（时间偏移、空间翻转和事件删除）；2. 提出混合架构KnightPupil（结合EfficientNet-B3、双向GRU和线性时变状态空间模块）。 Result: 在3ET+基准测试中，欧氏距离误差降低12%（1.61 vs. 1.70），在CVPR 2025挑战赛私有测试集上表现优异。 Conclusion: 该框架为AR/VR系统的实际部署提供了有效解决方案，并为神经形态视觉的未来创新奠定了基础。 Abstract: Event-based eye tracking has become a pivotal technology for augmented reality and human-computer interaction. Yet, existing methods struggle with real-world challenges such as abrupt eye movements and environmental noise. Building on the efficiency of the Lightweight Spatiotemporal Network-a causal architecture optimized for edge devices-we introduce two key advancements. First, a robust data augmentation pipeline incorporating temporal shift, spatial flip, and event deletion improves model resilience, reducing Euclidean distance error by 12% (1.61 vs. 1.70 baseline) on challenging samples. Second, we propose KnightPupil, a hybrid architecture combining an EfficientNet-B3 backbone for spatial feature extraction, a bidirectional GRU for contextual temporal modeling, and a Linear Time-Varying State-Space Module to adapt to sparse inputs and noise dynamically. Evaluated on the 3ET+ benchmark, our framework achieved 1.61 Euclidean distance on the private test set of the Event-based Eye Tracking Challenge at CVPR 2025, demonstrating its effectiveness for practical deployment in AR/VR systems while providing a foundation for future innovations in neuromorphic vision.

[294] SemiETS: Integrating Spatial and Content Consistencies for Semi-Supervised End-to-end Text Spotting

Dongliang Luo,Hanshen Zhu,Ziyang Zhang,Dingkang Liang,Xudong Xie,Yuliang Liu,Xiang Bai

Main category: cs.CV

TLDR: 提出了一种半监督端到端文本检测框架SemiETS，通过生成可靠的分层伪标签和利用双向流信息，显著提升了性能。

Details

Motivation: 减少高质量手动标注的高成本，研究半监督文本检测以利用未标注图像中的信息。 Method: 提出SemiETS框架，生成分层伪标签并利用双向流信息提高一致性。 Result: 在多个数据集上表现优异，优于现有半监督方法，甚至超过强监督方法。 Conclusion: SemiETS在减少标注成本的同时提升了性能，展示了实际应用潜力。 Abstract: Most previous scene text spotting methods rely on high-quality manual annotations to achieve promising performance. To reduce their expensive costs, we study semi-supervised text spotting (SSTS) to exploit useful information from unlabeled images. However, directly applying existing semi-supervised methods of general scenes to SSTS will face new challenges: 1) inconsistent pseudo labels between detection and recognition tasks, and 2) sub-optimal supervisions caused by inconsistency between teacher/student. Thus, we propose a new Semi-supervised framework for End-to-end Text Spotting, namely SemiETS that leverages the complementarity of text detection and recognition. Specifically, it gradually generates reliable hierarchical pseudo labels for each task, thereby reducing noisy labels. Meanwhile, it extracts important information in locations and transcriptions from bidirectional flows to improve consistency. Extensive experiments on three datasets under various settings demonstrate the effectiveness of SemiETS on arbitrary-shaped text. For example, it outperforms previous state-of-the-art SSL methods by a large margin on end-to-end spotting (+8.7%, +5.6%, and +2.6% H-mean under 0.5%, 1%, and 2% labeled data settings on Total-Text, respectively). More importantly, it still improves upon a strongly supervised text spotter trained with plenty of labeled data by 2.0%. Compelling domain adaptation ability shows practical potential. Moreover, our method demonstrates consistent improvement on different text spotters.

[295] Enhancing Multi-task Learning Capability of Medical Generalist Foundation Model via Image-centric Multi-annotation Data

Xun Zhu,Fanbin Mo,Zheng Zhang,Jiaxi Wang,Yiming Shi,Ming Wu,Chuang Zhang,Miao Li,Ji Wu

Main category: cs.CV

TLDR: 论文提出了一种图像中心的多注释X射线数据集（IMAX），旨在从数据构建层面提升医学多模态大语言模型（MLLMs）的多任务学习能力，显著优于传统分散数据集（DMAX）。

Details

Motivation: 现有医学通用基础模型多关注数据规模或架构改进，而忽视了从数据中心视角重新审视多任务学习，导致图像与任务对齐分散，无法满足临床多维图像解读需求。 Method: 构建IMAX数据集，包含高质量、图像中心的密集注释，每张X射线图像平均关联4.10个任务和7.46个训练条目，并与DMAX进行性能对比。 Result: IMAX在七种开源医学MLLMs中平均性能提升3.20%至21.05%，并分析了其与DMAX在统计模式和优化动态上的差异。 Conclusion: IMAX显著提升多任务性能，并提出基于DMAX的优化训练策略，以缓解高质量IMAX数据获取的实践难题。 Abstract: The emergence of medical generalist foundation models has revolutionized conventional task-specific model development paradigms, aiming to better handle multiple tasks through joint training on large-scale medical datasets. However, recent advances prioritize simple data scaling or architectural component enhancement, while neglecting to re-examine multi-task learning from a data-centric perspective. Critically, simply aggregating existing data resources leads to decentralized image-task alignment, which fails to cultivate comprehensive image understanding or align with clinical needs for multi-dimensional image interpretation. In this paper, we introduce the image-centric multi-annotation X-ray dataset (IMAX), the first attempt to enhance the multi-task learning capabilities of medical multi-modal large language models (MLLMs) from the data construction level. To be specific, IMAX is featured from the following attributes: 1) High-quality data curation. A comprehensive collection of more than 354K entries applicable to seven different medical tasks. 2) Image-centric dense annotation. Each X-ray image is associated with an average of 4.10 tasks and 7.46 training entries, ensuring multi-task representation richness per image. Compared to the general decentralized multi-annotation X-ray dataset (DMAX), IMAX consistently demonstrates significant multi-task average performance gains ranging from 3.20% to 21.05% across seven open-source state-of-the-art medical MLLMs. Moreover, we investigate differences in statistical patterns exhibited by IMAX and DMAX training processes, exploring potential correlations between optimization dynamics and multi-task performance. Finally, leveraging the core concept of IMAX data construction, we propose an optimized DMAX-based training strategy to alleviate the dilemma of obtaining high-quality IMAX data in practical scenarios.

[296] Beyond Degradation Redundancy: Contrastive Prompt Learning for All-in-One Image Restoration

Gang Wu,Junjun Jiang,Kui Jiang,Xianming Liu,Liqiang Nie

Main category: cs.CV

TLDR: 论文提出了一种名为对比提示学习（CPL）的新框架，通过稀疏提示模块（SPM）和对比提示正则化（CPR）改进多任务图像恢复中的提示-任务对齐。

Details

Motivation: 现有方法在自适应提示学习中存在任务表示重叠或冗余的问题，而显式提示可能丢失重建所需的关键视觉信息。 Method: CPL结合SPM（高效捕捉退化特征）和CPR（通过负样本增强任务边界），优化提示与恢复模型的交互。 Result: 在五个基准测试中，CPL显著提升了多任务和复合退化场景下的性能，同时保持参数高效。 Conclusion: CPL为统一图像恢复提供了新的最优解决方案。 Abstract: All-in-one image restoration, addressing diverse degradation types with a unified model, presents significant challenges in designing task-specific prompts that effectively guide restoration across multiple degradation scenarios. While adaptive prompt learning enables end-to-end optimization, it often yields overlapping or redundant task representations. Conversely, explicit prompts derived from pretrained classifiers enhance discriminability but may discard critical visual information for reconstruction. To address these limitations, we introduce Contrastive Prompt Learning (CPL), a novel framework that fundamentally enhances prompt-task alignment through two complementary innovations: a \emph{Sparse Prompt Module (SPM)} that efficiently captures degradation-specific features while minimizing redundancy, and a \emph{Contrastive Prompt Regularization (CPR)} that explicitly strengthens task boundaries by incorporating negative prompt samples across different degradation types. Unlike previous approaches that focus primarily on degradation classification, CPL optimizes the critical interaction between prompts and the restoration model itself. Extensive experiments across five comprehensive benchmarks demonstrate that CPL consistently enhances state-of-the-art all-in-one restoration models, achieving significant improvements in both standard multi-task scenarios and challenging composite degradation settings. Our framework establishes new state-of-the-art performance while maintaining parameter efficiency, offering a principled solution for unified image restoration.

[297] Resampling Benchmark for Efficient Comprehensive Evaluation of Large Vision-Language Models

Teppei Suzuki,Keisuke Ozawa

Main category: cs.CV

TLDR: 提出了一种高效评估大规模视觉语言模型（VLM）的协议，通过构建子集减少计算成本，同时保持评估结果的准确性。

Details

Motivation: 由于VLM的广泛知识和推理能力，需要多个基准进行全面评估，导致计算成本高昂。 Method: 使用最远点采样（FPS）构建子集，实验表明FPS基准与完整评估结果高度相关（>0.96），且仅需约1%的数据。 Result: FPS方法不仅提高了评估效率，还能减少数据集偏差。 Conclusion: FPS是一种高效且可靠的评估方法，适用于大规模VLM的全面评估。 Abstract: We propose an efficient evaluation protocol for large vision-language models (VLMs). Given their broad knowledge and reasoning capabilities, multiple benchmarks are needed for comprehensive assessment, making evaluation computationally expensive. To improve efficiency, we construct a subset that yields results comparable to full benchmark evaluations. Our benchmark classification experiments reveal that no single benchmark fully covers all challenges. We then introduce a subset construction method using farthest point sampling (FPS). Our experiments show that FPS-based benchmarks maintain a strong correlation (> 0.96) with full evaluations while using only ~1\% of the data. Additionally, applying FPS to an existing benchmark improves correlation with overall evaluation results, suggesting its potential to reduce unintended dataset biases.

[298] Correlative and Discriminative Label Grouping for Multi-Label Visual Prompt Tuning

LeiLei Ma,Shuo Xu,MingKun Xie,Lei Wang,Dengdi Sun,Haifeng Zhao

Main category: cs.CV

TLDR: 论文提出了一种多标签视觉提示调优框架，通过平衡标签的相关性和区分性关系，避免过拟合，提升模型性能。

Details

Motivation: 当前多标签图像分类研究过于强调标签的共现关系，可能导致过拟合，因此需要平衡相关性和区分性关系。 Method: 提出Multi-Label Visual Prompt Tuning框架，将类别分组并分别建模，利用Vision Transformer和混合专家模型学习标签关系。 Result: 在多个基准数据集上表现优异，优于现有最佳方法。 Conclusion: 该方法有效平衡了标签关系，提升了分类性能，具有参数高效性。 Abstract: Modeling label correlations has always played a pivotal role in multi-label image classification (MLC), attracting significant attention from researchers. However, recent studies have overemphasized co-occurrence relationships among labels, which can lead to overfitting risk on this overemphasis, resulting in suboptimal models. To tackle this problem, we advocate for balancing correlative and discriminative relationships among labels to mitigate the risk of overfitting and enhance model performance. To this end, we propose the Multi-Label Visual Prompt Tuning framework, a novel and parameter-efficient method that groups classes into multiple class subsets according to label co-occurrence and mutual exclusivity relationships, and then models them respectively to balance the two relationships. In this work, since each group contains multiple classes, multiple prompt tokens are adopted within Vision Transformer (ViT) to capture the correlation or discriminative label relationship within each group, and effectively learn correlation or discriminative representations for class subsets. On the other hand, each group contains multiple group-aware visual representations that may correspond to multiple classes, and the mixture of experts (MoE) model can cleverly assign them from the group-aware to the label-aware, adaptively obtaining label-aware representation, which is more conducive to classification. Experiments on multiple benchmark datasets show that our proposed approach achieves competitive results and outperforms SOTA methods on multiple pre-trained models.

[299] Metric-Guided Synthesis of Class Activation Mapping

Alejandro Luque-Cerpa,Elizabeth Polgreen,Ajitha Rajan,Hazem Torfah

Main category: cs.CV

TLDR: SyCAM是一种基于度量的方法，用于合成CAM表达式，通过预定义的评估指标自动生成优化的热图，解决了现有CAM方法无法根据用户意图或领域知识调整的局限性。

Details

Motivation: 现有CAM方法生成的热图虽然隐含了一些特性（如与真实数据的相似性、鲁棒性等），但无法根据用户需求或领域知识灵活调整。 Method: SyCAM通过预定义的评估指标和语法约束自动生成优化的CAM表达式，特别探索了语法引导的合成方法。 Result: 实验表明，SyCAM在生成目标热图方面具有高效性和灵活性，并在ResNet50、VGG16和VGG19模型上与其他CAM方法进行了对比。 Conclusion: SyCAM提供了一种灵活且高效的方式，能够根据用户需求生成优化的CAM热图，弥补了现有方法的不足。 Abstract: Class activation mapping (CAM) is a widely adopted class of saliency methods used to explain the behavior of convolutional neural networks (CNNs). These methods generate heatmaps that highlight the parts of the input most relevant to the CNN output. Various CAM methods have been proposed, each distinguished by the expressions used to derive heatmaps. In general, users look for heatmaps with specific properties that reflect different aspects of CNN functionality. These may include similarity to ground truth, robustness, equivariance, and more. Although existing CAM methods implicitly encode some of these properties in their expressions, they do not allow for variability in heatmap generation following the user's intent or domain knowledge. In this paper, we address this limitation by introducing SyCAM, a metric-based approach for synthesizing CAM expressions. Given a predefined evaluation metric for saliency maps, SyCAM automatically generates CAM expressions optimized for that metric. We specifically explore a syntax-guided synthesis instantiation of SyCAM, where CAM expressions are derived based on predefined syntactic constraints and the given metric. Using several established evaluation metrics, we demonstrate the efficacy and flexibility of our approach in generating targeted heatmaps. We compare SyCAM with other well-known CAM methods on three prominent models: ResNet50, VGG16, and VGG19.

[300] GaussVideoDreamer: 3D Scene Generation with Video Diffusion and Inconsistency-Aware Gaussian Splatting

Junlin Hao,Peiheng Wang,Haoyang Wang,Xinggong Zhang,Zongming Guo

Main category: cs.CV

TLDR: GaussVideoDreamer通过结合图像、视频和3D生成的优势，提出了一种新的单图像3D场景重建方法，显著提高了质量和速度。

Details

Motivation: 解决单图像3D场景重建的固有病态性和输入约束问题，同时克服现有方法在泛化性和一致性上的不足。 Method: 结合几何感知初始化协议、不一致感知高斯泼溅和渐进式视频修复策略，利用时间一致性和3D证据引导生成。 Result: 实验显示，该方法在LLaVA-IQA评分上提高了32%，速度至少提升2倍，且在多场景中表现稳健。 Conclusion: GaussVideoDreamer通过创新策略有效提升了3D场景重建的质量和效率，为未来研究提供了新方向。 Abstract: Single-image 3D scene reconstruction presents significant challenges due to its inherently ill-posed nature and limited input constraints. Recent advances have explored two promising directions: multiview generative models that train on 3D consistent datasets but struggle with out-of-distribution generalization, and 3D scene inpainting and completion frameworks that suffer from cross-view inconsistency and suboptimal error handling, as they depend exclusively on depth data or 3D smoothness, which ultimately degrades output quality and computational performance. Building upon these approaches, we present GaussVideoDreamer, which advances generative multimedia approaches by bridging the gap between image, video, and 3D generation, integrating their strengths through two key innovations: (1) A progressive video inpainting strategy that harnesses temporal coherence for improved multiview consistency and faster convergence. (2) A 3D Gaussian Splatting consistency mask to guide the video diffusion with 3D consistent multiview evidence. Our pipeline combines three core components: a geometry-aware initialization protocol, Inconsistency-Aware Gaussian Splatting, and a progressive video inpainting strategy. Experimental results demonstrate that our approach achieves 32% higher LLaVA-IQA scores and at least 2x speedup compared to existing methods while maintaining robust performance across diverse scenes.

[301] An Image is Worth $K$ Topics: A Visual Structural Topic Model with Pretrained Image Embeddings

Matías Piqueras,Alexandra Segerberg,Matteo Magnani,Måns Magnusson,Nataša Sladoje

Main category: cs.CV

TLDR: 本文介绍了一种结合预训练图像嵌入和结构化主题模型的视觉结构化主题模型（vSTM），用于分析政治科学中的视觉内容。

Details

Motivation: 政治科学领域需要更适合社会和政治研究目标的视觉内容分析方法。 Method: 结合预训练图像嵌入和结构化主题模型，捕捉图像语义复杂性并分析主题与协变量关系。 Result: vSTM能够识别可解释、连贯且与在线政治通信研究相关的主题。 Conclusion: vSTM为政治科学中的视觉内容分析提供了有效的工具。 Abstract: Political scientists are increasingly interested in analyzing visual content at scale. However, the existing computational toolbox is still in need of methods and models attuned to the specific challenges and goals of social and political inquiry. In this article, we introduce a visual Structural Topic Model (vSTM) that combines pretrained image embeddings with a structural topic model. This has important advantages compared to existing approaches. First, pretrained embeddings allow the model to capture the semantic complexity of images relevant to political contexts. Second, the structural topic model provides the ability to analyze how topics and covariates are related, while maintaining a nuanced representation of images as a mixture of multiple topics. In our empirical application, we show that the vSTM is able to identify topics that are interpretable, coherent, and substantively relevant to the study of online political communication.

[302] EBAD-Gaussian: Event-driven Bundle Adjusted Deblur Gaussian Splatting

Yufei Deng,Yuanjian Wang,Rong Xiao,Chenwei Tang,Jizhe Zhou,Jiahao Fan,Deng Xiong,Jiancheng Lv,Huajin Tang

Main category: cs.CV

TLDR: EBAD-Gaussian利用事件相机数据改进3D高斯泼溅技术，通过联合学习高斯参数和相机运动轨迹，从模糊图像和事件流中重建清晰3D场景。

Details

Motivation: 现有RGB去模糊方法在快速运动或低光条件下难以建模相机位姿和辐射变化，导致重建精度下降。事件相机能捕捉曝光期间的亮度变化，有助于建模运动模糊。 Method: 提出EBAD-Gaussian，通过合成潜在清晰图像构建模糊损失函数，利用事件流监督光强变化，并基于事件双积分先验优化中间曝光时间的图像。 Result: 在合成和真实数据集上，EBAD-Gaussian能从模糊图像和事件流输入中实现高质量3D场景重建。 Conclusion: EBAD-Gaussian有效解决了运动模糊问题，提升了3D高斯泼溅技术在复杂场景下的重建质量。 Abstract: While 3D Gaussian Splatting (3D-GS) achieves photorealistic novel view synthesis, its performance degrades with motion blur. In scenarios with rapid motion or low-light conditions, existing RGB-based deblurring methods struggle to model camera pose and radiance changes during exposure, reducing reconstruction accuracy. Event cameras, capturing continuous brightness changes during exposure, can effectively assist in modeling motion blur and improving reconstruction quality. Therefore, we propose Event-driven Bundle Adjusted Deblur Gaussian Splatting (EBAD-Gaussian), which reconstructs sharp 3D Gaussians from event streams and severely blurred images. This method jointly learns the parameters of these Gaussians while recovering camera motion trajectories during exposure time. Specifically, we first construct a blur loss function by synthesizing multiple latent sharp images during the exposure time, minimizing the difference between real and synthesized blurred images. Then we use event stream to supervise the light intensity changes between latent sharp images at any time within the exposure period, supplementing the light intensity dynamic changes lost in RGB images. Furthermore, we optimize the latent sharp images at intermediate exposure times based on the event-based double integral (EDI) prior, applying consistency constraints to enhance the details and texture information of the reconstructed images. Extensive experiments on synthetic and real-world datasets show that EBAD-Gaussian can achieve high-quality 3D scene reconstruction under the condition of blurred images and event stream inputs.

[303] RGB-Event based Pedestrian Attribute Recognition: A Benchmark Dataset and An Asymmetric RWKV Fusion Framework

Xiao Wang,Haiyang Wang,Shiao Wang,Qiang Chen,Jiandong Jin,Haoyu Song,Bo Jiang,Chenglong Li

Main category: cs.CV

TLDR: 论文提出了一种基于RGB-Event多模态的行人属性识别任务，并构建了首个大规模数据集EventPAR，同时提出了一种基于RWKV的多模态框架，取得了先进成果。

Details

Motivation: 现有RGB相机方法受限于光照和运动模糊，且缺乏对情感维度的探索，因此提出多模态方法以解决这些问题。 Method: 引入EventPAR数据集（100K样本，50属性），提出RWKV视觉编码器和非对称RWKV融合模块的多模态框架。 Result: 在EventPAR及两个模拟数据集上取得最先进结果。 Conclusion: 多模态方法为行人属性识别提供了新方向，数据集和框架为未来研究奠定基础。 Abstract: Existing pedestrian attribute recognition methods are generally developed based on RGB frame cameras. However, these approaches are constrained by the limitations of RGB cameras, such as sensitivity to lighting conditions and motion blur, which hinder their performance. Furthermore, current attribute recognition primarily focuses on analyzing pedestrians' external appearance and clothing, lacking an exploration of emotional dimensions. In this paper, we revisit these issues and propose a novel multi-modal RGB-Event attribute recognition task by drawing inspiration from the advantages of event cameras in low-light, high-speed, and low-power consumption. Specifically, we introduce the first large-scale multi-modal pedestrian attribute recognition dataset, termed EventPAR, comprising 100K paired RGB-Event samples that cover 50 attributes related to both appearance and six human emotions, diverse scenes, and various seasons. By retraining and evaluating mainstream PAR models on this dataset, we establish a comprehensive benchmark and provide a solid foundation for future research in terms of data and algorithmic baselines. In addition, we propose a novel RWKV-based multi-modal pedestrian attribute recognition framework, featuring an RWKV visual encoder and an asymmetric RWKV fusion module. Extensive experiments are conducted on our proposed dataset as well as two simulated datasets (MARS-Attribute and DukeMTMC-VID-Attribute), achieving state-of-the-art results. The source code and dataset will be released on https://github.com/Event-AHU/OpenPAR

[304] Masked Autoencoder Self Pre-Training for Defect Detection in Microelectronics

Nikolai Röhrich,Alwin Hoffmann,Richard Nordsieck,Emilio Zarbali,Alireza Javanmardi

Main category: cs.CV

TLDR: 论文提出了一种基于掩码自编码器（MAE）的视觉变换器（ViT）预训练框架，用于微电子缺陷检测，解决了数据稀疏和领域差异问题，性能优于现有方法。

Details

Motivation: 微电子缺陷检测仍依赖CNN，而Transformer因数据需求高和标注成本高未被广泛应用。领域数据与自然图像差异大，迁移学习受限。 Method: 采用MAE预训练ViT，通过掩码和重建图像块进行自监督学习，使用少于10,000张SAM图像进行预训练和缺陷检测。 Result: 自预训练ViT性能显著优于监督ViT、自然图像预训练ViT和CNN模型，且能更关注缺陷相关特征（如焊料裂纹）。 Conclusion: 自预训练ViT能生成缺陷特异性特征表示，适用于实际微电子缺陷检测。 Abstract: Whereas in general computer vision, transformer-based architectures have quickly become the gold standard, microelectronics defect detection still heavily relies on convolutional neural networks (CNNs). We hypothesize that this is due to the fact that a) transformers have an increased need for data and b) labelled image generation procedures for microelectronics are costly, and labelled data is therefore sparse. Whereas in other domains, pre-training on large natural image datasets can mitigate this problem, in microelectronics transfer learning is hindered due to the dissimilarity of domain data and natural images. Therefore, we evaluate self pre-training, where models are pre-trained on the target dataset, rather than another dataset. We propose a vision transformer (ViT) pre-training framework for defect detection in microelectronics based on masked autoencoders (MAE). In MAE, a large share of image patches is masked and reconstructed by the model during pre-training. We perform pre-training and defect detection using a dataset of less than 10.000 scanning acoustic microscopy (SAM) images labelled using transient thermal analysis (TTA). Our experimental results show that our approach leads to substantial performance gains compared to a) supervised ViT, b) ViT pre-trained on natural image datasets, and c) state-of-the-art CNN-based defect detection models used in the literature. Additionally, interpretability analysis reveals that our self pre-trained models, in comparison to ViT baselines, correctly focus on defect-relevant features such as cracks in the solder material. This demonstrates that our approach yields fault-specific feature representations, making our self pre-trained models viable for real-world defect detection in microelectronics.

[305] Relative Illumination Fields: Learning Medium and Light Independent Underwater Scenes

Mengkun She,Felix Seegräber,David Nakath,Patricia Schöntag,Kevin Köser

Main category: cs.CV

TLDR: 提出一种在非均匀光照和散射环境中构建一致且逼真的神经辐射场的方法，适用于未知共移动光源场景。

Details

Motivation: 现有水下场景表示方法多针对静态均匀光照，而忽略了如机器人探索深水区时阳光不足的情况。 Method: 提出一种局部附着于相机的光照场，结合体积介质表示，处理动态光照与静态散射介质的交互。 Result: 评估结果表明该方法有效且灵活。 Conclusion: 该方法为非均匀光照和散射环境中的场景表示提供了有效解决方案。 Abstract: We address the challenge of constructing a consistent and photorealistic Neural Radiance Field in inhomogeneously illuminated, scattering environments with unknown, co-moving light sources. While most existing works on underwater scene representation focus on a static homogeneous illumination, limited attention has been paid to scenarios such as when a robot explores water deeper than a few tens of meters, where sunlight becomes insufficient. To address this, we propose a novel illumination field locally attached to the camera, enabling the capture of uneven lighting effects within the viewing frustum. We combine this with a volumetric medium representation to an overall method that effectively handles interaction between dynamic illumination field and static scattering medium. Evaluation results demonstrate the effectiveness and flexibility of our approach.

[306] TT3D: Table Tennis 3D Reconstruction

Thomas Gossard,Andreas Ziegler,Andreas Zell

Main category: cs.CV

TLDR: 提出了一种从乒乓球比赛录像中重建精确3D球轨迹的新方法，结合物理运动模型和自动相机校准，无需依赖人体姿态估计或球拍跟踪。

Details

Motivation: 传统2D球追踪依赖摄像机视角，无法支持全面比赛分析，需解决3D重建问题。 Method: 利用球的物理运动模型最小化重投影误差，自动校准相机，并改进3D姿态估计模型追踪球员动作。 Result: 实现了可靠的3D球轨迹重建，并能推断球的旋转，无需依赖不可靠的球拍或人体姿态数据。 Conclusion: 该方法为乒乓球比赛提供了全面的3D重建能力，解决了传统2D追踪的局限性。 Abstract: Sports analysis requires processing large amounts of data, which is time-consuming and costly. Advancements in neural networks have significantly alleviated this burden, enabling highly accurate ball tracking in sports broadcasts. However, relying solely on 2D ball tracking is limiting, as it depends on the camera's viewpoint and falls short of supporting comprehensive game analysis. To address this limitation, we propose a novel approach for reconstructing precise 3D ball trajectories from online table tennis match recordings. Our method leverages the underlying physics of the ball's motion to identify the bounce state that minimizes the reprojection error of the ball's flying trajectory, hence ensuring an accurate and reliable 3D reconstruction. A key advantage of our approach is its ability to infer ball spin without relying on human pose estimation or racket tracking, which are often unreliable or unavailable in broadcast footage. We developed an automated camera calibration method capable of reliably tracking camera movements. Additionally, we adapted an existing 3D pose estimation model, which lacks depth motion capture, to accurately track player movements. Together, these contributions enable the full 3D reconstruction of a table tennis rally.

[307] Investigating the Role of Bilateral Symmetry for Inpainting Brain MRI

Sergey Kuznetsov,Sanduni Pinnawala,Peter A. Wijeratne,Ivor J. A. Simpson

Main category: cs.CV

TLDR: 该论文研究了医学影像修复（inpainting）技术中，大脑MRI修复结果与受试者特定条件信息（如被掩蔽区域）之间的统计关系，重点关注半球对称性的影响。

Details

Motivation: 探索大脑MRI修复过程中，模型从哪些区域获取信息，以及半球对称性在修复中的重要性。 Method: 通过分析扩散修复模型，研究子皮层结构的修复结果，基于强度和估计面积变化。 Result: 实验表明，某些结构的修复过程受对称性条件强烈影响。 Conclusion: 半球对称性在大脑MRI修复中具有重要作用，某些结构的修复结果显著依赖于对称性信息。 Abstract: Inpainting has recently emerged as a valuable and interesting technology to employ in the analysis of medical imaging data, in particular brain MRI. A wide variety of methodologies for inpainting MRI have been proposed and demonstrated on tasks including anomaly detection. In this work we investigate the statistical relationship between inpainted brain structures and the amount of subject-specific conditioning information, i.e. the other areas of the image that are masked. In particular, we analyse the distribution of inpainting results when masking additional regions of the image, specifically the contra-lateral structure. This allows us to elucidate where in the brain the model is drawing information from, and in particular, what is the importance of hemispherical symmetry? Our experiments interrogate a diffusion inpainting model through analysing the inpainting of subcortical brain structures based on intensity and estimated area change. We demonstrate that some structures show a strong influence of symmetry in the conditioning of the inpainting process.

[308] Aligning Anime Video Generation with Human Feedback

Bingwen Zhu,Yudong Jiang,Baohan Xu,Siqian Yang,Mingyu Yin,Yidi Wu,Huyang Sun,Zuxuan Wu

Main category: cs.CV

TLDR: 提出了一种基于人类反馈的动漫视频生成优化流程，包括构建首个多维奖励数据集AnimeReward和引入GAPO训练方法，显著提升了生成质量。

Details

Motivation: 现有奖励模型针对真实世界视频设计，无法满足动漫视频独特的外观和一致性需求，导致生成质量不佳。 Method: 构建30k人类标注的多维奖励数据集，开发AnimeReward模型，并提出GAPO训练方法。 Result: AnimeReward优于现有模型，GAPO进一步提升了对齐性能，实验验证了流程的有效性。 Conclusion: 提出的流程显著提升了动漫视频生成质量，数据集和代码将公开。 Abstract: Anime video generation faces significant challenges due to the scarcity of anime data and unusual motion patterns, leading to issues such as motion distortion and flickering artifacts, which result in misalignment with human preferences. Existing reward models, designed primarily for real-world videos, fail to capture the unique appearance and consistency requirements of anime. In this work, we propose a pipeline to enhance anime video generation by leveraging human feedback for better alignment. Specifically, we construct the first multi-dimensional reward dataset for anime videos, comprising 30k human-annotated samples that incorporating human preferences for both visual appearance and visual consistency. Based on this, we develop AnimeReward, a powerful reward model that employs specialized vision-language models for different evaluation dimensions to guide preference alignment. Furthermore, we introduce Gap-Aware Preference Optimization (GAPO), a novel training method that explicitly incorporates preference gaps into the optimization process, enhancing alignment performance and efficiency. Extensive experiment results show that AnimeReward outperforms existing reward models, and the inclusion of GAPO leads to superior alignment in both quantitative benchmarks and human evaluations, demonstrating the effectiveness of our pipeline in enhancing anime video quality. Our dataset and code will be publicly available.

[309] Multi-Object Grounding via Hierarchical Contrastive Siamese Transformers

Chengyi Du,Keyan Jin

Main category: cs.CV

TLDR: 论文提出了一种名为H-COST的方法，用于解决3D场景中多目标定位问题，通过分层对比孪生变换器提升复杂语言指令的理解能力，性能优于现有方法9.5%。

Details

Motivation: 现实场景中常需定位多个对象，而现有研究多集中于单目标定位，因此需要一种能处理多目标定位的方法。 Method: 采用分层处理策略和对比孪生变换器框架，通过辅助网络增强参考网络的语义理解能力。 Result: 在复杂多目标定位基准测试中，性能提升9.5%。 Conclusion: H-COST方法在多目标定位任务中表现出色，为复杂场景下的对象定位提供了有效解决方案。 Abstract: Multi-object grounding in 3D scenes involves localizing multiple objects based on natural language input. While previous work has primarily focused on single-object grounding, real-world scenarios often demand the localization of several objects. To tackle this challenge, we propose Hierarchical Contrastive Siamese Transformers (H-COST), which employs a Hierarchical Processing strategy to progressively refine object localization, enhancing the understanding of complex language instructions. Additionally, we introduce a Contrastive Siamese Transformer framework, where two networks with the identical structure are used: one auxiliary network processes robust object relations from ground-truth labels to guide and enhance the second network, the reference network, which operates on segmented point-cloud data. This contrastive mechanism strengthens the model' s semantic understanding and significantly enhances its ability to process complex point-cloud data. Our approach outperforms previous state-of-the-art methods by 9.5% on challenging multi-object grounding benchmarks.

[310] Summarization of Multimodal Presentations with Vision-Language Models: Study of the Effect of Modalities and Structure

Théo Gigant,Camille Guinaudeau,Frédéric Dufaux

Main category: cs.CV

TLDR: 本文分析了使用视觉语言模型（VLMs）对多模态演示进行自动摘要的效果，提出了在不同输入长度预算下的高效策略，并探讨了跨模态交互的性质。

Details

Motivation: 研究VLMs在多模态演示自动摘要中的表现，探索不同输入形式（如幻灯片、视频、文本等）对摘要质量的影响。 Method: 通过细粒度的定量和定性分析，比较不同输入形式（如原始视频、幻灯片、幻灯片与文本交替结构）对摘要生成的影响。 Result: 实验表明，使用幻灯片作为输入优于原始视频，而幻灯片与文本交替的结构表现最佳。 Conclusion: 研究为多模态文档摘要提供了高效策略，并提出了改进VLMs跨模态理解能力的建议。 Abstract: Vision-Language Models (VLMs) can process visual and textual information in multiple formats: texts, images, interleaved texts and images, or even hour-long videos. In this work, we conduct fine-grained quantitative and qualitative analyses of automatic summarization of multimodal presentations using VLMs with various representations as input. From these experiments, we suggest cost-effective strategies for generating summaries from text-heavy multimodal documents under different input-length budgets using VLMs. We show that slides extracted from the video stream can be beneficially used as input against the raw video, and that a structured representation from interleaved slides and transcript provides the best performance. Finally, we reflect and comment on the nature of cross-modal interactions in multimodal presentations and share suggestions to improve the capabilities of VLMs to understand documents of this nature.

[311] Mavors: Multi-granularity Video Representation for Multimodal Large Language Model

Yang Shi,Jiaheng Liu,Yushuo Guan,Zhenhua Wu,Yuanxing Zhang,Zihao Wang,Weihong Lin,Jingyun Hua,Zekun Wang,Xinlong Chen,Bohan Zeng,Wentao Zhang,Fuzheng Zhang,Wenjing Yang,Di Zhang

Main category: cs.CV

TLDR: Mavors 是一个新型多粒度视频表示框架，用于解决长视频理解中的计算效率与细粒度时空模式保留问题。

Details

Motivation: 现有方法在复杂运动或多分辨率视频中会丢失时空动态或细节信息，需要一种更高效的解决方案。 Method: Mavors 通过 Intra-chunk Vision Encoder 保留高分辨率空间特征，并通过 Inter-chunk Feature Aggregator 建立时间连贯性。 Result: 实验表明，Mavors 在时空推理任务中显著优于现有方法。 Conclusion: Mavors 提供了一种高效且保留细节的长视频理解框架。 Abstract: Long-context video understanding in multimodal large language models (MLLMs) faces a critical challenge: balancing computational efficiency with the retention of fine-grained spatio-temporal patterns. Existing approaches (e.g., sparse sampling, dense sampling with low resolution, and token compression) suffer from significant information loss in temporal dynamics, spatial details, or subtle interactions, particularly in videos with complex motion or varying resolutions. To address this, we propose $\mathbf{Mavors}$, a novel framework that introduces $\mathbf{M}$ulti-gr$\mathbf{a}$nularity $\mathbf{v}$ide$\mathbf{o}$ $\mathbf{r}$epre$\mathbf{s}$entation for holistic long-video modeling. Specifically, Mavors directly encodes raw video content into latent representations through two core components: 1) an Intra-chunk Vision Encoder (IVE) that preserves high-resolution spatial features via 3D convolutions and Vision Transformers, and 2) an Inter-chunk Feature Aggregator (IFA) that establishes temporal coherence across chunks using transformer-based dependency modeling with chunk-level rotary position encodings. Moreover, the framework unifies image and video understanding by treating images as single-frame videos via sub-image decomposition. Experiments across diverse benchmarks demonstrate Mavors' superiority in maintaining both spatial fidelity and temporal continuity, significantly outperforming existing methods in tasks requiring fine-grained spatio-temporal reasoning.

[312] DTFSal: Audio-Visual Dynamic Token Fusion for Video Saliency Prediction

Kiana Hoshanfar,Alireza Hosseini,Ahmad Kalhor,Babak Nadjar Araabi

Main category: cs.CV

TLDR: DFTSal是一种新颖的音频-视觉显著性预测框架，通过动态令牌融合和自适应多模态融合，平衡了准确性和计算效率。

Details

Motivation: 尽管视觉显著性预测已有显著进展，但有效整合听觉信息仍具挑战性，主要由于复杂的时空交互和高计算需求。 Method: DFTSal采用多尺度视觉编码器，包含LTEB和DLTFB模块，以及音频分支和AMFB多模态融合模块，最终通过多解码器生成显著性图。 Result: 在六个音频-视觉基准测试中，DFTSal实现了SOTA性能，同时保持计算效率。 Conclusion: DFTSal通过创新的动态令牌融合和多模态融合，显著提升了音频-视觉显著性预测的性能和效率。 Abstract: Audio-visual saliency prediction aims to mimic human visual attention by identifying salient regions in videos through the integration of both visual and auditory information. Although visual-only approaches have significantly advanced, effectively incorporating auditory cues remains challenging due to complex spatio-temporal interactions and high computational demands. To address these challenges, we propose Dynamic Token Fusion Saliency (DFTSal), a novel audio-visual saliency prediction framework designed to balance accuracy with computational efficiency. Our approach features a multi-scale visual encoder equipped with two novel modules: the Learnable Token Enhancement Block (LTEB), which adaptively weights tokens to emphasize crucial saliency cues, and the Dynamic Learnable Token Fusion Block (DLTFB), which employs a shifting operation to reorganize and merge features, effectively capturing long-range dependencies and detailed spatial information. In parallel, an audio branch processes raw audio signals to extract meaningful auditory features. Both visual and audio features are integrated using our Adaptive Multimodal Fusion Block (AMFB), which employs local, global, and adaptive fusion streams for precise cross-modal fusion. The resulting fused features are processed by a hierarchical multi-decoder structure, producing accurate saliency maps. Extensive evaluations on six audio-visual benchmarks demonstrate that DFTSal achieves SOTA performance while maintaining computational efficiency.

[313] Hierarchical Relation-augmented Representation Generalization for Few-shot Action Recognition

Hongyu Qu,Ling Xing,Rui Yan,Yazhou Yao,Guo-Sen Xie,Xiangbo Shu

Main category: cs.CV

TLDR: HR2G-shot是一个用于少样本动作识别（FSAR）的框架，通过统一三种关系建模（帧间、视频间和任务间）来学习任务特定的时间模式。

Details

Motivation: 现有方法忽略了视频和任务之间的显式关系建模，无法捕捉共享的时间模式或重用历史任务中的时间知识。 Method: HR2G-shot设计了两个组件：视频间语义关联（ISC）和任务间知识转移（IKT），分别探索视频间和任务间的关系。 Result: 在五个基准测试中，HR2G-shot优于当前领先的FSAR方法。 Conclusion: HR2G-shot通过多层次关系建模，显著提升了少样本动作识别的性能。 Abstract: Few-shot action recognition (FSAR) aims to recognize novel action categories with few exemplars. Existing methods typically learn frame-level representations independently for each video by designing various inter-frame temporal modeling strategies. However, they neglect explicit relation modeling between videos and tasks, thus failing to capture shared temporal patterns across videos and reuse temporal knowledge from historical tasks. In light of this, we propose HR2G-shot, a Hierarchical Relation-augmented Representation Generalization framework for FSAR, which unifies three types of relation modeling (inter-frame, inter-video, and inter-task) to learn task-specific temporal patterns from a holistic view. In addition to conducting inter-frame temporal interactions, we further devise two components to respectively explore inter-video and inter-task relationships: i) Inter-video Semantic Correlation (ISC) performs cross-video frame-level interactions in a fine-grained manner, thereby capturing task-specific query features and learning intra- and inter-class temporal correlations among support features; ii) Inter-task Knowledge Transfer (IKT) retrieves and aggregates relevant temporal knowledge from the bank, which stores diverse temporal patterns from historical tasks. Extensive experiments on five benchmarks show that HR2G-shot outperforms current top-leading FSAR methods.

[314] Learning to Harmonize Cross-vendor X-ray Images by Non-linear Image Dynamics Correction

Yucheng Lu,Shunxin Wang,Dovile Juodelyte,Veronika Cheplygina

Main category: cs.CV

TLDR: 论文探讨了传统图像增强如何提升医学图像分析的模型鲁棒性，提出了一种称为GDCE的方法来解决领域特定曝光不匹配问题。

Details

Motivation: 研究动机在于解决医学图像中因不同厂商设备导致的领域特定动态特性问题，传统线性变换无法有效处理。 Method: 方法是通过将图像协调任务重新定义为曝光校正问题，提出GDCE方法，利用预定义多项式函数和领域判别器进行训练。 Result: 结果表明，GDCE能有效减少领域特定曝光不匹配，提升模型在下游任务中的透明性。 Conclusion: 结论是GDCE优于现有黑盒方法，为医学图像分析提供了更透明的解决方案。 Abstract: In this paper, we explore how conventional image enhancement can improve model robustness in medical image analysis. By applying commonly used normalization methods to images from various vendors and studying their influence on model generalization in transfer learning, we show that the nonlinear characteristics of domain-specific image dynamics cannot be addressed by simple linear transforms. To tackle this issue, we reformulate the image harmonization task as an exposure correction problem and propose a method termed Global Deep Curve Estimation (GDCE) to reduce domain-specific exposure mismatch. GDCE performs enhancement via a pre-defined polynomial function and is trained with the help of a ``domain discriminator'', aiming to improve model transparency in downstream tasks compared to existing black-box methods.

[315] UP-Person: Unified Parameter-Efficient Transfer Learning for Text-based Person Retrieval

Yating Liu,Yaowei Li,Xiangyuan Lan,Wenming Yang,Zimo Liu,Qingmin Liao

Main category: cs.CV

TLDR: 提出了一种名为UP-Person的参数高效迁移学习方法，用于文本驱动的行人检索任务，通过结合Prefix、LoRA和Adapter三种轻量级组件，显著提升了性能。

Details

Motivation: 现有方法通常完全微调预训练模型，容易过拟合且泛化能力不足。UP-Person旨在通过参数高效迁移学习解决这一问题。 Method: UP-Person整合了Prefix、LoRA和Adapter三种组件，并优化了S-Prefix和L-Adapter两个子模块，以提升局部信息挖掘和全局特征调整能力。 Result: 在多个数据集（如CUHK-PEDES、ICFG-PEDES和RSTPReid）上取得了最先进的结果，仅微调了4.7%的参数。 Conclusion: UP-Person通过高效的参数迁移学习方法，显著提升了文本驱动的行人检索任务的性能，同时避免了过拟合问题。 Abstract: Text-based Person Retrieval (TPR) as a multi-modal task, which aims to retrieve the target person from a pool of candidate images given a text description, has recently garnered considerable attention due to the progress of contrastive visual-language pre-trained model. Prior works leverage pre-trained CLIP to extract person visual and textual features and fully fine-tune the entire network, which have shown notable performance improvements compared to uni-modal pre-training models. However, full-tuning a large model is prone to overfitting and hinders the generalization ability. In this paper, we propose a novel Unified Parameter-Efficient Transfer Learning (PETL) method for Text-based Person Retrieval (UP-Person) to thoroughly transfer the multi-modal knowledge from CLIP. Specifically, UP-Person simultaneously integrates three lightweight PETL components including Prefix, LoRA and Adapter, where Prefix and LoRA are devised together to mine local information with task-specific information prompts, and Adapter is designed to adjust global feature representations. Additionally, two vanilla submodules are optimized to adapt to the unified architecture of TPR. For one thing, S-Prefix is proposed to boost attention of prefix and enhance the gradient propagation of prefix tokens, which improves the flexibility and performance of the vanilla prefix. For another thing, L-Adapter is designed in parallel with layer normalization to adjust the overall distribution, which can resolve conflicts caused by overlap and interaction among multiple submodules. Extensive experimental results demonstrate that our UP-Person achieves state-of-the-art results across various person retrieval datasets, including CUHK-PEDES, ICFG-PEDES and RSTPReid while merely fine-tuning 4.7\% parameters. Code is available at https://github.com/Liu-Yating/UP-Person.

[316] CameraBench: Benchmarking Visual Reasoning in MLLMs via Photography

I-Sheng Fang,Jun-Cheng Chen

Main category: cs.CV

TLDR: 论文探讨了多模态大语言模型（MLLMs）在摄影相关任务中的视觉推理能力，尤其是从照片中推断相机参数的能力，并展示了初步结果。

Details

Motivation: 视觉推理（结合视觉和文本输入）在人工智能中尚未充分探索，摄影任务因其物理特性（如光照、模糊程度等）为MLLMs提供了挑战性的测试场景。 Method: 扩展了先前用于视觉语言模型（VLMs）的方法，评估MLLMs在区分与相机设置相关的视觉差异上的能力。 Result: 初步结果表明视觉推理在摄影任务中的重要性，且没有单一MLLM在所有评估任务中表现一致最优。 Conclusion: 开发具有更好视觉推理能力的MLLMs仍面临挑战和机遇。 Abstract: Large language models (LLMs) and multimodal large language models (MLLMs) have significantly advanced artificial intelligence. However, visual reasoning, reasoning involving both visual and textual inputs, remains underexplored. Recent advancements, including the reasoning models like OpenAI o1 and Gemini 2.0 Flash Thinking, which incorporate image inputs, have opened this capability. In this ongoing work, we focus specifically on photography-related tasks because a photo is a visual snapshot of the physical world where the underlying physics (i.e., illumination, blur extent, etc.) interplay with the camera parameters. Successfully reasoning from the visual information of a photo to identify these numerical camera settings requires the MLLMs to have a deeper understanding of the underlying physics for precise visual comprehension, representing a challenging and intelligent capability essential for practical applications like photography assistant agents. We aim to evaluate MLLMs on their ability to distinguish visual differences related to numerical camera settings, extending a methodology previously proposed for vision-language models (VLMs). Our preliminary results demonstrate the importance of visual reasoning in photography-related tasks. Moreover, these results show that no single MLLM consistently dominates across all evaluation tasks, demonstrating ongoing challenges and opportunities in developing MLLMs with better visual reasoning.

[317] Global and Local Mamba Network for Multi-Modality Medical Image Super-Resolution

Zexin Ji,Beiji Zou,Xiaoyan Kui,Sebastien Thureau,Su Ruan

Main category: cs.CV

TLDR: 论文提出了一种基于Mamba的全局与局部网络（GLMamba），用于多模态医学图像超分辨率，通过全局和局部分支分别处理长程和短程依赖，并结合变形块和多模态特征融合提升性能。

Details

Motivation: 现有方法（如CNN和Transformer）在医学图像超分辨率中要么固定感受野，要么计算负担大，限制了性能提升。因此，引入Mamba模型以高效建模长程依赖。 Method: 提出GLMamba网络，包含全局和局部Mamba分支，分别处理低分辨率图像的全局信息和高分辨率参考图像的局部细节，结合变形块、调制器和多模态特征融合块。 Result: 通过对比实验，GLMamba在医学图像超分辨率任务中表现出色，尤其在边缘纹理和对比度增强方面。 Conclusion: GLMamba通过全局与局部Mamba分支的有效结合，显著提升了多模态医学图像超分辨率的性能。 Abstract: Convolutional neural networks and Transformer have made significant progresses in multi-modality medical image super-resolution. However, these methods either have a fixed receptive field for local learning or significant computational burdens for global learning, limiting the super-resolution performance. To solve this problem, State Space Models, notably Mamba, is introduced to efficiently model long-range dependencies in images with linear computational complexity. Relying on the Mamba and the fact that low-resolution images rely on global information to compensate for missing details, while high-resolution reference images need to provide more local details for accurate super-resolution, we propose a global and local Mamba network (GLMamba) for multi-modality medical image super-resolution. To be specific, our GLMamba is a two-branch network equipped with a global Mamba branch and a local Mamba branch. The global Mamba branch captures long-range relationships in low-resolution inputs, and the local Mamba branch focuses more on short-range details in high-resolution reference images. We also use the deform block to adaptively extract features of both branches to enhance the representation ability. A modulator is designed to further enhance deformable features in both global and local Mamba blocks. To fully integrate the reference image for low-resolution image super-resolution, we further develop a multi-modality feature fusion block to adaptively fuse features by considering similarities, differences, and complementary aspects between modalities. In addition, a contrastive edge loss (CELoss) is developed for sufficient enhancement of edge textures and contrast in medical images.

[318] SoccerNet-v3D: Leveraging Sports Broadcast Replays for 3D Scene Understanding

Marc Gutiérrez-Pérez,Antonio Agudo

Main category: cs.CV

TLDR: 论文介绍了SoccerNet-v3D和ISSIA-3D两个改进的数据集，用于足球比赛3D场景理解，并提出了一种基于单目视觉的3D球定位方法。

Details

Motivation: 提升足球比赛视频分析的3D场景理解能力，通过多视角同步和场线校准实现更精确的3D对象定位。 Method: 提出基于三角测量的3D球定位任务，利用2D标注和相机校准；引入单目3D球定位基线方法；优化2D标注的边界框对齐。 Result: 新数据集为3D足球场景理解设定了新基准，提升了时空分析能力。 Conclusion: 论文提出的数据集和方法为体育视频分析提供了更强大的工具，并开源了代码以促进研究。 Abstract: Sports video analysis is a key domain in computer vision, enabling detailed spatial understanding through multi-view correspondences. In this work, we introduce SoccerNet-v3D and ISSIA-3D, two enhanced and scalable datasets designed for 3D scene understanding in soccer broadcast analysis. These datasets extend SoccerNet-v3 and ISSIA by incorporating field-line-based camera calibration and multi-view synchronization, enabling 3D object localization through triangulation. We propose a monocular 3D ball localization task built upon the triangulation of ground-truth 2D ball annotations, along with several calibration and reprojection metrics to assess annotation quality on demand. Additionally, we present a single-image 3D ball localization method as a baseline, leveraging camera calibration and ball size priors to estimate the ball's position from a monocular viewpoint. To further refine 2D annotations, we introduce a bounding box optimization technique that ensures alignment with the 3D scene representation. Our proposed datasets establish new benchmarks for 3D soccer scene understanding, enhancing both spatial and temporal analysis in sports analytics. Finally, we provide code to facilitate access to our annotations and the generation pipelines for the datasets.

[319] AGO: Adaptive Grounding for Open World 3D Occupancy Prediction

Peizheng Li,Shuxiao Ding,You Zhou,Qingwen Zhang,Onat Inak,Larissa Triess,Niklas Hanselmann,Marius Cordts,Andreas Zell

Main category: cs.CV

TLDR: AGO是一个新颖的3D语义占用预测框架，通过自适应接地处理开放世界场景，结合视觉语言模型和3D伪标签，提升未知物体预测能力。

Details

Motivation: 传统方法基于预定义标签空间或直接对齐图像嵌入，无法有效处理开放世界中的未知物体和模态差异。 Method: AGO通过编码图像和类提示为3D和文本嵌入，利用相似性接地训练和模态适配器减少模态差异。 Result: 在Occ3D-nuScenes数据集上，AGO在零样本和少样本转移中提升了未知物体预测，封闭世界自监督性能达到SOTA，领先4.09 mIoU。 Conclusion: AGO通过自适应接地和模态适配器，显著提升了开放世界3D语义占用预测的性能和泛化能力。 Abstract: Open-world 3D semantic occupancy prediction aims to generate a voxelized 3D representation from sensor inputs while recognizing both known and unknown objects. Transferring open-vocabulary knowledge from vision-language models (VLMs) offers a promising direction but remains challenging. However, methods based on VLM-derived 2D pseudo-labels with traditional supervision are limited by a predefined label space and lack general prediction capabilities. Direct alignment with pretrained image embeddings, on the other hand, fails to achieve reliable performance due to often inconsistent image and text representations in VLMs. To address these challenges, we propose AGO, a novel 3D occupancy prediction framework with adaptive grounding to handle diverse open-world scenarios. AGO first encodes surrounding images and class prompts into 3D and text embeddings, respectively, leveraging similarity-based grounding training with 3D pseudo-labels. Additionally, a modality adapter maps 3D embeddings into a space aligned with VLM-derived image embeddings, reducing modality gaps. Experiments on Occ3D-nuScenes show that AGO improves unknown object prediction in zero-shot and few-shot transfer while achieving state-of-the-art closed-world self-supervised performance, surpassing prior methods by 4.09 mIoU.

Tzu-Yun Tseng,Hongyu Lyu,Josephine Li,Julie Stephany Berrio,Mao Shan,Stewart Worrall

Main category: cs.CV

TLDR: 论文提出了M2S-RoAD数据集，用于农村道路损坏的语义分割，以提升自动驾驶和驾驶辅助系统的安全性。

Details

Motivation: 农村道路损坏检测研究较少，现有研究多集中于城市环境。 Method: 收集并标注了澳大利亚新南威尔士州多个城镇的道路损坏数据，包含九种损坏类型。 Result: 构建了M2S-RoAD数据集，支持语义分割任务。 Conclusion: 该数据集将有助于农村道路损坏的自动化检测，提升道路安全。 Abstract: Road damage can create safety and comfort challenges for both human drivers and autonomous vehicles (AVs). This damage is particularly prevalent in rural areas due to less frequent surveying and maintenance of roads. Automated detection of pavement deterioration can be used as an input to AVs and driver assistance systems to improve road safety. Current research in this field has predominantly focused on urban environments driven largely by public datasets, while rural areas have received significantly less attention. This paper introduces M2S-RoAD, a dataset for the semantic segmentation of different classes of road damage. M2S-RoAD was collected in various towns across New South Wales, Australia, and labelled for semantic segmentation to identify nine distinct types of road damage. This dataset will be released upon the acceptance of the paper.

[321] Hierarchical and Step-Layer-Wise Tuning of Attention Specialty for Multi-Instance Synthesis in Diffusion Transformers

Chunyang Zhang,Zhenhong Sun,Zhicheng Zhang,Junyan Wang,Yu Zhang,Dong Gong,Huadong Mo,Daoyi Dong

Main category: cs.CV

TLDR: 论文提出了一种无需训练的层次化注意力调整方法（AST），用于提升基于DiT的文本到图像生成模型在多实例合成（MIS）任务中的表现。

Details

Motivation: 现有的多实例合成控制方法无法适配基于DiT的模型（如FLUX和SD v3.5），这些模型依赖图像与文本令牌的集成注意力机制而非传统的文本-图像交叉注意力。 Method: 通过分析DiT中的混合注意力机制，发现注意力图的层次化响应结构，并据此提出AST方法，在不同层次和步骤中调整注意力以优化多模态交互。 Result: 实验表明，AST方法在复杂布局生成中显著提升了实例位置和属性表示的准确性。 Conclusion: AST是一种有效的训练无关方法，能够显著提升DiT模型在多实例合成任务中的性能。 Abstract: Text-to-image (T2I) generation models often struggle with multi-instance synthesis (MIS), where they must accurately depict multiple distinct instances in a single image based on complex prompts detailing individual features. Traditional MIS control methods for UNet architectures like SD v1.5/SDXL fail to adapt to DiT-based models like FLUX and SD v3.5, which rely on integrated attention between image and text tokens rather than text-image cross-attention. To enhance MIS in DiT, we first analyze the mixed attention mechanism in DiT. Our token-wise and layer-wise analysis of attention maps reveals a hierarchical response structure: instance tokens dominate early layers, background tokens in middle layers, and attribute tokens in later layers. Building on this observation, we propose a training-free approach for enhancing MIS in DiT-based models with hierarchical and step-layer-wise attention specialty tuning (AST). AST amplifies key regions while suppressing irrelevant areas in distinct attention maps across layers and steps, guided by the hierarchical structure. This optimizes multimodal interactions by hierarchically decoupling the complex prompts with instance-based sketches. We evaluate our approach using upgraded sketch-based layouts for the T2I-CompBench and customized complex scenes. Both quantitative and qualitative results confirm our method enhances complex layout generation, ensuring precise instance placement and attribute representation in MIS.

[322] COUNTS: Benchmarking Object Detectors and Multimodal Large Language Models under Distribution Shifts

Jiansheng Li,Xingxuan Zhang,Hao Zou,Yige Guo,Renzhe Xu,Yilong Liu,Chuzhao Zhu,Yue He,Peng Cui

Main category: cs.CV

TLDR: 论文提出了COUNTS数据集，用于评估目标检测器和多模态大语言模型在分布偏移下的泛化能力，并设计了两个新基准O(OD)2和OODG。

Details

Motivation: 当前目标检测器在分布偏移下性能下降显著，缺乏大规模、细粒度标注的数据集来评估其OOD泛化能力。 Method: 引入COUNTS数据集，包含14种自然分布偏移、222K样本和1,196K标注框，并设计O(OD)2和OODG基准。 Result: 大型模型和预训练数据在IID场景表现优异，但在OOD场景仍有显著局限；GPT-4o和Gemini-1.5在视觉定位任务中准确率仅为56.7%和28.0%。 Conclusion: COUNTS数据集有望推动开发更鲁棒的目标检测器和MLLMs，以应对分布偏移。 Abstract: Current object detectors often suffer significant perfor-mance degradation in real-world applications when encountering distributional shifts. Consequently, the out-of-distribution (OOD) generalization capability of object detectors has garnered increasing attention from researchers. Despite this growing interest, there remains a lack of a large-scale, comprehensive dataset and evaluation benchmark with fine-grained annotations tailored to assess the OOD generalization on more intricate tasks like object detection and grounding. To address this gap, we introduce COUNTS, a large-scale OOD dataset with object-level annotations. COUNTS encompasses 14 natural distributional shifts, over 222K samples, and more than 1,196K labeled bounding boxes. Leveraging COUNTS, we introduce two novel benchmarks: O(OD)2 and OODG. O(OD)2 is designed to comprehensively evaluate the OOD generalization capabilities of object detectors by utilizing controlled distribution shifts between training and testing data. OODG, on the other hand, aims to assess the OOD generalization of grounding abilities in multimodal large language models (MLLMs). Our findings reveal that, while large models and extensive pre-training data substantially en hance performance in in-distribution (IID) scenarios, significant limitations and opportunities for improvement persist in OOD contexts for both object detectors and MLLMs. In visual grounding tasks, even the advanced GPT-4o and Gemini-1.5 only achieve 56.7% and 28.0% accuracy, respectively. We hope COUNTS facilitates advancements in the development and assessment of robust object detectors and MLLMs capable of maintaining high performance under distributional shifts.

[323] WildLive: Near Real-time Visual Wildlife Tracking onboard UAVs

Nguyen Ngoc Dat,Tom Richardson,Matthew Watson,Kilian Meier,Jenna Kline,Sid Reid,Guy Maalouf,Duncan Hine,Majid Mirmehdi,Tilo Burghardt

Main category: cs.CV

TLDR: WildLive是一个在无人机上实时运行的动物检测与跟踪框架，支持高分辨率视频处理，优化了计算资源分配，显著提高了处理速度。

Details

Motivation: 现有解决方案依赖地面站视频流，无法满足自主飞行和任务特定识别需求。 Method: 结合稀疏光流跟踪和优化的YOLO目标检测与分割技术，专注于高不确定性时空区域。 Result: 系统在HD和4K视频流上分别达到17fps和7fps，保持高精度。 Conclusion: WildLive证明了无人机上实时高分辨率野生动物跟踪的可行性，为未来自主导航和任务操作奠定了基础。 Abstract: Live tracking of wildlife via high-resolution video processing directly onboard drones is widely unexplored and most existing solutions rely on streaming video to ground stations to support navigation. Yet, both autonomous animal-reactive flight control beyond visual line of sight and/or mission-specific individual and behaviour recognition tasks rely to some degree on this capability. In response, we introduce WildLive -- a near real-time animal detection and tracking framework for high-resolution imagery running directly onboard uncrewed aerial vehicles (UAVs). The system performs multi-animal detection and tracking at 17fps+ for HD and 7fps+ on 4K video streams suitable for operation during higher altitude flights to minimise animal disturbance. Our system is optimised for Jetson Orin AGX onboard hardware. It integrates the efficiency of sparse optical flow tracking and mission-specific sampling with device-optimised and proven YOLO-driven object detection and segmentation techniques. Essentially, computational resource is focused onto spatio-temporal regions of high uncertainty to significantly improve UAV processing speeds without domain-specific loss of accuracy. Alongside, we introduce our WildLive dataset, which comprises 200k+ annotated animal instances across 19k+ frames from 4K UAV videos collected at the Ol Pejeta Conservancy in Kenya. All frames contain ground truth bounding boxes, segmentation masks, as well as individual tracklets and tracking point trajectories. We compare our system against current object tracking approaches including OC-SORT, ByteTrack, and SORT. Our multi-animal tracking experiments with onboard hardware confirm that near real-time high-resolution wildlife tracking is possible on UAVs whilst maintaining high accuracy levels as needed for future navigational and mission-specific animal-centric operational autonomy.

[324] LLaVA-ReID: Selective Multi-image Questioner for Interactive Person Re-Identification

Yiding Lu,Mouxing Yang,Dezhong Peng,Peng Hu,Yijie Lin,Xi Peng

Main category: cs.CV

TLDR: 论文提出了一种交互式行人重识别任务（Inter-ReID），通过对话逐步细化初始描述，并构建了一个包含多类型问题的数据集。提出的LLaVA-ReID模型通过视觉和文本上下文生成针对性问题，显著优于基线方法。

Details

Motivation: 传统基于文本的行人重识别假设描述是完整且一次性提供的，而现实中描述往往是部分或模糊的。为了解决这一问题，提出了交互式行人重识别任务。 Method: 构建了一个对话数据集，分解细粒度属性生成多类型问题；提出了LLaVA-ReID模型，利用视觉和文本上下文生成问题，并通过前瞻策略优先选择信息量最大的问题作为训练监督。 Result: LLaVA-ReID在Inter-ReID和文本ReID基准测试中显著优于基线方法。 Conclusion: 交互式行人重识别任务和LLaVA-ReID模型有效解决了传统方法的局限性，提升了行人重识别的性能。 Abstract: Traditional text-based person ReID assumes that person descriptions from witnesses are complete and provided at once. However, in real-world scenarios, such descriptions are often partial or vague. To address this limitation, we introduce a new task called interactive person re-identification (Inter-ReID). Inter-ReID is a dialogue-based retrieval task that iteratively refines initial descriptions through ongoing interactions with the witnesses. To facilitate the study of this new task, we construct a dialogue dataset that incorporates multiple types of questions by decomposing fine-grained attributes of individuals. We further propose LLaVA-ReID, a question model that generates targeted questions based on visual and textual contexts to elicit additional details about the target person. Leveraging a looking-forward strategy, we prioritize the most informative questions as supervision during training. Experimental results on both Inter-ReID and text-based ReID benchmarks demonstrate that LLaVA-ReID significantly outperforms baselines.

[325] Differentially Private 2D Human Pose Estimation

Kaushik Bhargav Sivangi,Idris Zakariyya,Paul Henderson,Fani Deligianni

Main category: cs.CV

TLDR: 本文提出了一种基于差分隐私的2D人体姿态估计方法，通过改进的DP-SGD和轻量级视觉Transformer，在保护隐私的同时提升了性能。

Details

Motivation: 人体姿态估计在多个领域有广泛应用，但传统隐私保护方法效果有限且可能损害数据效用，差分隐私虽提供保障但会降低模型性能。 Method: 采用改进的差分隐私随机梯度下降（PDP-SGD）和轻量级视觉Transformer（TinyViT）进行坐标分类。 Result: 在MPII数据集上，PDP-SGD在严格隐私预算（ε=0.2）下达到78.48% PCKh@0.5，优于标准DP-SGD的63.85%。 Conclusion: 该方法为敏感场景下的隐私保护人体姿态估计提供了可行方案。 Abstract: Human pose estimation (HPE) has become essential in numerous applications including healthcare, activity recognition, and human-computer interaction. However, the privacy implications of processing sensitive visual data present significant deployment barriers in critical domains. While traditional anonymization techniques offer limited protection and often compromise data utility for broader motion analysis, Differential Privacy (DP) provides formal privacy guarantees but typically degrades model performance when applied naively. In this work, we present the first differentially private 2D human pose estimation (2D-HPE) by applying Differentially Private Stochastic Gradient Descent (DP-SGD) to this task. To effectively balance privacy with performance, we adopt Projected DP-SGD (PDP-SGD), which projects the noisy gradients to a low-dimensional subspace. Additionally, we adapt TinyViT, a compact and efficient vision transformer for coordinate classification in HPE, providing a lightweight yet powerful backbone that enhances privacy-preserving deployment feasibility on resource-limited devices. Our approach is particularly valuable for multimedia interpretation tasks, enabling privacy-safe analysis and understanding of human motion across diverse visual media while preserving the semantic meaning required for downstream applications. Comprehensive experiments on the MPII Human Pose Dataset demonstrate significant performance enhancement with PDP-SGD achieving 78.48% PCKh@0.5 at a strict privacy budget ($\epsilon=0.2$), compared to 63.85% for standard DP-SGD. This work lays foundation for privacy-preserving human pose estimation in real-world, sensitive applications.

[326] VibrantLeaves: A principled parametric image generator for training deep restoration models

Raphael Achddou,Yann Gousseau,Saïd Ladjal,Sabine Süsstrunk

Main category: cs.CV

TLDR: 提出了一种基于几何建模和纹理的合成图像生成器，用于改进深度神经网络在图像恢复任务中的性能。

Details

Motivation: 深度神经网络在图像恢复任务中存在训练集偏差和难以解释的问题，合成数据集可以提供更好的控制。 Method: 结合几何建模、纹理和图像采集的简单模型，基于Dead Leaves模型生成合成训练集。 Result: 在去噪和超分辨率任务中，合成数据集训练的网络性能接近自然图像数据集，且对测试集的几何和辐射扰动更具鲁棒性。 Conclusion: 合成数据集是解决深度神经网络局限性的有效方法，同时为可解释性提供了初步分析。 Abstract: Even though Deep Neural Networks are extremely powerful for image restoration tasks, they have several limitations. They are poorly understood and suffer from strong biases inherited from the training sets. One way to address these shortcomings is to have a better control over the training sets, in particular by using synthetic sets. In this paper, we propose a synthetic image generator relying on a few simple principles. In particular, we focus on geometric modeling, textures, and a simple modeling of image acquisition. These properties, integrated in a classical Dead Leaves model, enable the creation of efficient training sets. Standard image denoising and super-resolution networks can be trained on such datasets, reaching performance almost on par with training on natural image datasets. As a first step towards explainability, we provide a careful analysis of the considered principles, identifying which image properties are necessary to obtain good performances. Besides, such training also yields better robustness to various geometric and radiometric perturbations of the test sets.

[327] Balancing Stability and Plasticity in Pretrained Detector: A Dual-Path Framework for Incremental Object Detection

Songze Li,Qixing Xu,Tonghua Su,Xu-Yao Zhang,Zhongjie Wang

Main category: cs.CV

TLDR: 该论文提出了一种双路径框架，用于在预训练模型增量目标检测（PTMIOD）中平衡稳定性和可塑性，通过解耦定位和分类模块，实现了跨域场景下的高性能。

Details

Motivation: 现有PTMIOD方法在跨域场景中的可塑性不足，论文通过分析预训练检测器的组件，发现定位模块具有跨域稳定性，而分类模块需要增强可塑性。 Method: 提出基于预训练DETR检测器的双路径框架，定位路径保持稳定性，分类路径通过参数高效微调和伪特征回放增强可塑性。 Result: 在MS COCO、PASCAL VOC和TT100K等基准测试中表现优异，实现了跨域适应和抗遗忘能力的平衡。 Conclusion: 该方法有效解决了PTMIOD中稳定性和可塑性的平衡问题，为跨域场景提供了鲁棒的解决方案。 Abstract: The balance between stability and plasticity remains a fundamental challenge in pretrained model-based incremental object detection (PTMIOD). While existing PTMIOD methods demonstrate strong performance on in-domain tasks aligned with pretraining data, their plasticity to cross-domain scenarios remains underexplored. Through systematic component-wise analysis of pretrained detectors, we reveal a fundamental discrepancy: the localization modules demonstrate inherent cross-domain stability-preserving precise bounding box estimation across distribution shifts-while the classification components require enhanced plasticity to mitigate discriminability degradation in cross-domain scenarios. Motivated by these findings, we propose a dual-path framework built upon pretrained DETR-based detectors which decouples localization stability and classification plasticity: the localization path maintains stability to preserve pretrained localization knowledge, while the classification path facilitates plasticity via parameter-efficient fine-tuning and resists forgetting with pseudo-feature replay. Extensive evaluations on both in-domain (MS COCO and PASCAL VOC) and cross-domain (TT100K) benchmarks show state-of-the-art performance, demonstrating our method's ability to effectively balance stability and plasticity in PTMIOD, achieving robust cross-domain adaptation and strong retention of anti-forgetting capabilities.

[328] CAT: A Conditional Adaptation Tailor for Efficient and Effective Instance-Specific Pansharpening on Real-World Data

Tianyu Xin,Jin-Liang Xiao,Zeyu Xia,Shan Yin,Liang-Jian Deng

Main category: cs.CV

TLDR: 提出了一种高效的图像融合框架，通过快速适应输入实例和CAT模块，显著提升了跨传感器泛化能力和计算效率。

Details

Motivation: 解决现有深度学习方法在跨传感器泛化和计算效率上的不足，限制实时应用的问题。 Method: 将输入图像分块，选择子集进行无监督CAT训练，通过预训练网络的特征提取和通道转换阶段生成融合特征。 Result: 在WorldView-3和WorldView-2数据集上实现最佳性能，512×512图像训练和推理仅需0.4秒，4000×4000图像仅需3秒。 Conclusion: 该方法在跨传感器泛化和计算效率上均优于现有方法，适用于实时应用。 Abstract: Pansharpening is a crucial remote sensing technique that fuses low-resolution multispectral (LRMS) images with high-resolution panchromatic (PAN) images to generate high-resolution multispectral (HRMS) imagery. Although deep learning techniques have significantly advanced pansharpening, many existing methods suffer from limited cross-sensor generalization and high computational overhead, restricting their real-time applications. To address these challenges, we propose an efficient framework that quickly adapts to a specific input instance, completing both training and inference in a short time. Our framework splits the input image into multiple patches, selects a subset for unsupervised CAT training, and then performs inference on all patches, stitching them into the final output. The CAT module, integrated between the feature extraction and channel transformation stages of a pre-trained network, tailors the fused features and fixes the parameters for efficient inference, generating improved results. Our approach offers two key advantages: (1) $\textit{Improved Generalization Ability}$: by mitigating cross-sensor degradation, our model--although pre-trained on a specific dataset--achieves superior performance on datasets captured by other sensors; (2) $\textit{Enhanced Computational Efficiency}$: the CAT-enhanced network can swiftly adapt to the test sample using the single LRMS-PAN pair input, without requiring extensive large-scale data retraining. Experiments on the real-world data from WorldView-3 and WorldView-2 datasets demonstrate that our method achieves state-of-the-art performance on cross-sensor real-world data, while achieving both training and inference of $512\times512$ image within $\textit{0.4 seconds}$ and $4000\times4000$ image within $\textit{3 seconds}$ at the fastest setting on a commonly used RTX 3090 GPU.

[329] MASSeg : 2nd Technical Report for 4th PVUW MOSE Track

Xuqiang Cao,Linnan Zhao,Jiaxuan Zhao,Fang Liu,Puhua Chen,Wenping Ma

Main category: cs.CV

TLDR: MASSeg模型在CVPR 2025 PVUW挑战赛MOSE赛道中排名第二，通过改进现有分割框架并构建增强数据集MOSE+，解决了复杂视频对象分割中的小对象识别、遮挡处理和动态场景建模问题。

Details

Motivation: 解决复杂视频对象分割中的小对象识别、遮挡处理和动态场景建模问题。 Method: 提出改进模型MASSeg，构建增强数据集MOSE+，结合帧间一致和不一致数据增强策略，设计掩码输出缩放策略。 Result: 在MOSE测试集上取得J分数0.8250、F分数0.9007和J&F分数0.8628。 Conclusion: MASSeg通过改进模型和数据集，显著提升了复杂视频对象分割的性能。 Abstract: Complex video object segmentation continues to face significant challenges in small object recognition, occlusion handling, and dynamic scene modeling. This report presents our solution, which ranked second in the MOSE track of CVPR 2025 PVUW Challenge. Based on an existing segmentation framework, we propose an improved model named MASSeg for complex video object segmentation, and construct an enhanced dataset, MOSE+, which includes typical scenarios with occlusions, cluttered backgrounds, and small target instances. During training, we incorporate a combination of inter-frame consistent and inconsistent data augmentation strategies to improve robustness and generalization. During inference, we design a mask output scaling strategy to better adapt to varying object sizes and occlusion levels. As a result, MASSeg achieves a J score of 0.8250, F score of 0.9007, and a J&F score of 0.8628 on the MOSE test set.

[330] XY-Cut++: Advanced Layout Ordering via Hierarchical Mask Mechanism on a Novel Benchmark

Shuai Liu,Youmeng Li,Jizeng Wei

Main category: cs.CV

TLDR: XY-Cut++是一种先进的布局排序方法，通过预掩码处理、多粒度分割和跨模态匹配，显著提升了文档阅读顺序恢复的准确性。

Details

Motivation: 解决现有方法在复杂布局（如多栏报纸）、跨模态元素交互和高开销问题上的不足，缺乏鲁棒的评估基准。 Method: 集成预掩码处理、多粒度分割和跨模态匹配的XY-Cut++方法。 Result: 在DocBench-100数据集上达到98.8 BLEU，比现有基线提升24%，在简单和复杂布局中均表现一致。 Conclusion: XY-Cut++为文档结构恢复提供了可靠基础，为布局排序任务设定了新标准，提升了RAG和LLM预处理效果。 Abstract: Document Reading Order Recovery is a fundamental task in document image understanding, playing a pivotal role in enhancing Retrieval-Augmented Generation (RAG) and serving as a critical preprocessing step for large language models (LLMs). Existing methods often struggle with complex layouts(e.g., multi-column newspapers), high-overhead interactions between cross-modal elements (visual regions and textual semantics), and a lack of robust evaluation benchmarks. We introduce XY-Cut++, an advanced layout ordering method that integrates pre-mask processing, multi-granularity segmentation, and cross-modal matching to address these challenges. Our method significantly enhances layout ordering accuracy compared to traditional XY-Cut techniques. Specifically, XY-Cut++ achieves state-of-the-art performance (98.8 BLEU overall) while maintaining simplicity and efficiency. It outperforms existing baselines by up to 24\% and demonstrates consistent accuracy across simple and complex layouts on the newly introduced DocBench-100 dataset. This advancement establishes a reliable foundation for document structure recovery, setting a new standard for layout ordering tasks and facilitating more effective RAG and LLM preprocessing.

[331] Trade-offs in Privacy-Preserving Eye Tracking through Iris Obfuscation: A Benchmarking Study

Mengdi Wang,Efe Bozkir,Enkelejda Kasneci

Main category: cs.CV

TLDR: 论文评估了五种方法（模糊化、噪声化、降采样、橡皮筋模型和虹膜风格迁移）在保护用户隐私的同时保持眼动追踪任务性能的效果，发现虹膜风格迁移表现最佳，但需权衡计算成本。

Details

Motivation: AR/VR头戴设备可能成为日常设备，但眼动追踪中的虹膜纹理数据引发隐私问题，需评估现有隐私保护方法的有效性。 Method: 通过模糊化、噪声化、降采样、橡皮筋模型和虹膜风格迁移五种方法处理虹膜数据，比较其对图像质量、隐私保护、任务性能和攻击风险的影响。 Result: 虹膜风格迁移在隐私保护和任务性能上表现最佳，但计算成本高；其他方法各有优劣，无普适最优解。 Conclusion: 建议根据具体需求选择或组合不同方法，以平衡隐私、性能和计算成本。 Abstract: Recent developments in hardware, computer graphics, and AI may soon enable AR/VR head-mounted displays (HMDs) to become everyday devices like smartphones and tablets. Eye trackers within HMDs provide a special opportunity for such setups as it is possible to facilitate gaze-based research and interaction. However, estimating users' gaze information often requires raw eye images and videos that contain iris textures, which are considered a gold standard biometric for user authentication, and this raises privacy concerns. Previous research in the eye-tracking community focused on obfuscating iris textures while keeping utility tasks such as gaze estimation accurate. Despite these attempts, there is no comprehensive benchmark that evaluates state-of-the-art approaches. Considering all, in this paper, we benchmark blurring, noising, downsampling, rubber sheet model, and iris style transfer to obfuscate user identity, and compare their impact on image quality, privacy, utility, and risk of imposter attack on two datasets. We use eye segmentation and gaze estimation as utility tasks, and reduction in iris recognition accuracy as a measure of privacy protection, and false acceptance rate to estimate risk of attack. Our experiments show that canonical image processing methods like blurring and noising cause a marginal impact on deep learning-based tasks. While downsampling, rubber sheet model, and iris style transfer are effective in hiding user identifiers, iris style transfer, with higher computation cost, outperforms others in both utility tasks, and is more resilient against spoof attacks. Our analyses indicate that there is no universal optimal approach to balance privacy, utility, and computation burden. Therefore, we recommend practitioners consider the strengths and weaknesses of each approach, and possible combinations of those to reach an optimal privacy-utility trade-off.

[332] Multimodal Long Video Modeling Based on Temporal Dynamic Context

Haoran Hao,Jiaming Han,Yiyuan Zhang,Xiangyu Yue

Main category: cs.CV

TLDR: 提出了一种动态长视频编码方法（TDC），通过时间动态上下文处理长视频，结合视觉-音频编码器和查询式Transformer，显著提升了视频理解性能。

Details

Motivation: 现有LLMs在长视频处理中因上下文长度限制和信息量大而表现不佳，且现有方法在令牌压缩和多模态（如音频）处理上存在缺陷。 Method: 1. 基于帧间相似性分割视频为语义一致场景；2. 使用视觉-音频编码器编码帧为令牌；3. 提出时间上下文压缩器减少令牌数量；4. 结合静态帧令牌和时间上下文令牌输入LLM。 Result: 在通用视频理解和音视频理解基准测试中表现优异。 Conclusion: TDC方法有效解决了长视频处理中的信息丢失和多模态融合问题，代码和模型已开源。 Abstract: Recent advances in Large Language Models (LLMs) have led to significant breakthroughs in video understanding. However, existing models still struggle with long video processing due to the context length constraint of LLMs and the vast amount of information within the video. Although some recent methods are designed for long video understanding, they often lose crucial information during token compression and struggle with additional modality like audio. In this work, we propose a dynamic long video encoding method utilizing the temporal relationship between frames, named Temporal Dynamic Context (TDC). Firstly, we segment the video into semantically consistent scenes based on inter-frame similarities, then encode each frame into tokens using visual-audio encoders. Secondly, we propose a novel temporal context compressor to reduce the number of tokens within each segment. Specifically, we employ a query-based Transformer to aggregate video, audio, and instruction text tokens into a limited set of temporal context tokens. Finally, we feed the static frame tokens and the temporal context tokens into the LLM for video understanding. Furthermore, to handle extremely long videos, we propose a training-free chain-of-thought strategy that progressively extracts answers from multiple video segments. These intermediate answers serve as part of the reasoning process and contribute to the final answer. We conduct extensive experiments on general video understanding and audio-video understanding benchmarks, where our method demonstrates strong performance. The code and models are available at https://github.com/Hoar012/TDC-Video.

[333] LMFormer: Lane based Motion Prediction Transformer

Harsh Yadav,Maximilian Schaefer,Kun Zhao,Tobias Meisen

Main category: cs.CV

TLDR: LMFormer是一种基于车道感知的Transformer网络，用于轨迹预测任务，通过动态优先级机制和车道连接信息提升性能，并在多个数据集上实现SOTA表现。

Details

Motivation: 解决自动驾驶中运动预测的关键问题，尤其是车道动态优先级和长距离依赖性的学习，同时引入可解释性。 Method: 提出LMFormer，利用车道连接信息和动态优先级机制，通过堆叠Transformer层进行迭代优化。 Result: 在nuScenes和Deep Scenario数据集上实现SOTA性能，并展示跨数据集训练的统一能力。 Conclusion: LMFormer通过车道感知和动态优化机制，显著提升了轨迹预测的性能和可解释性。 Abstract: Motion prediction plays an important role in autonomous driving. This study presents LMFormer, a lane-aware transformer network for trajectory prediction tasks. In contrast to previous studies, our work provides a simple mechanism to dynamically prioritize the lanes and shows that such a mechanism introduces explainability into the learning behavior of the network. Additionally, LMFormer uses the lane connection information at intersections, lane merges, and lane splits, in order to learn long-range dependency in lane structure. Moreover, we also address the issue of refining the predicted trajectories and propose an efficient method for iterative refinement through stacked transformer layers. For benchmarking, we evaluate LMFormer on the nuScenes dataset and demonstrate that it achieves SOTA performance across multiple metrics. Furthermore, the Deep Scenario dataset is used to not only illustrate cross-dataset network performance but also the unification capabilities of LMFormer to train on multiple datasets and achieve better performance.

[334] DiffMOD: Progressive Diffusion Point Denoising for Moving Object Detection in Remote Sensing

Jinyue Zhang,Xiangrong Zhang,Zhongjian Huang,Tianyang Zhang,Yifei Jiang,Licheng Jiao

Main category: cs.CV

TLDR: 提出了一种基于点云的遥感移动目标检测方法，通过扩散模型和渐进去噪过程优化网络，结合空间关系聚合注意力与时序传播模块，显著提升了检测能力和时序一致性。

Details

Motivation: 遥感中的移动目标检测面临低分辨率、目标极小和复杂噪声干扰的挑战，现有方法依赖概率密度估计，限制了对象间和时序帧间的灵活信息交互。 Method: 采用点云表示，通过扩散模型进行渐进去噪；设计空间关系聚合注意力模块和时序传播与全局融合模块，优化点级特征交互与时序一致性；提出渐进MinK最优传输分配策略和缺失损失函数。 Result: 在RsData数据集上验证了方法的有效性，能够更有效地挖掘稀疏移动目标间的潜在关系，提升检测能力和时序一致性。 Conclusion: 基于点云的去噪方法为遥感移动目标检测提供了新思路，显著提升了性能。 Abstract: Moving object detection (MOD) in remote sensing is significantly challenged by low resolution, extremely small object sizes, and complex noise interference. Current deep learning-based MOD methods rely on probability density estimation, which restricts flexible information interaction between objects and across temporal frames. To flexibly capture high-order inter-object and temporal relationships, we propose a point-based MOD in remote sensing. Inspired by diffusion models, the network optimization is formulated as a progressive denoising process that iteratively recovers moving object centers from sparse noisy points. Specifically, we sample scattered features from the backbone outputs as atomic units for subsequent processing, while global feature embeddings are aggregated to compensate for the limited coverage of sparse point features. By modeling spatial relative positions and semantic affinities, Spatial Relation Aggregation Attention is designed to enable high-order interactions among point-level features for enhanced object representation. To enhance temporal consistency, the Temporal Propagation and Global Fusion module is designed, which leverages an implicit memory reasoning mechanism for robust cross-frame feature integration. To align with the progressive denoising process, we propose a progressive MinK optimal transport assignment strategy that establishes specialized learning objectives at each denoising level. Additionally, we introduce a missing loss function to counteract the clustering tendency of denoised points around salient objects. Experiments on the RsData remote sensing MOD dataset show that our MOD method based on scattered point denoising can more effectively explore potential relationships between sparse moving objects and improve the detection capability and temporal consistency.

[335] GUI-R1 : A Generalist R1-Style Vision-Language Action Model For GUI Agents

Xiaobo Xia,Run Luo

Main category: cs.CV

TLDR: 提出了一种基于强化学习的框架（\name），通过统一动作空间规则建模，显著提升了大型视觉语言模型（LVLMs）在GUI任务中的性能，仅需少量高质量数据。

Details

Motivation: 现有GUI代理方法依赖大量训练数据且泛化能力不足，限制了其在真实场景中的应用。 Method: 采用强化微调（RFT）和策略优化算法（如GRPO），利用少量跨平台高质量数据进行模型更新。 Result: 在八个基准测试中，仅用0.02%的数据（3K vs. 13M）即超越现有最佳方法（如OS-Atlas）。 Conclusion: 强化学习结合统一动作空间规则建模，显著提升了LVLMs在真实GUI任务中的执行能力。 Abstract: Existing efforts in building Graphical User Interface (GUI) agents largely rely on the training paradigm of supervised fine-tuning on Large Vision-Language Models (LVLMs). However, this approach not only demands extensive amounts of training data but also struggles to effectively understand GUI screenshots and generalize to unseen interfaces. The issue significantly limits its application in real-world scenarios, especially for high-level tasks. Inspired by Reinforcement Fine-Tuning (RFT) in large reasoning models (e.g., DeepSeek-R1), which efficiently enhances the problem-solving capabilities of large language models in real-world settings, we propose \name, the first reinforcement learning framework designed to enhance the GUI capabilities of LVLMs in high-level real-world task scenarios, through unified action space rule modeling. By leveraging a small amount of carefully curated high-quality data across multiple platforms (including Windows, Linux, MacOS, Android, and Web) and employing policy optimization algorithms such as Group Relative Policy Optimization (GRPO) to update the model, \name achieves superior performance using only 0.02\% of the data (3K vs. 13M) compared to previous state-of-the-art methods like OS-Atlas across eight benchmarks spanning three different platforms (mobile, desktop, and web). These results demonstrate the immense potential of reinforcement learning based on unified action space rule modeling in improving the execution capabilities of LVLMs for real-world GUI agent tasks.

[336] Noise2Ghost: Self-supervised deep convolutional reconstruction for ghost imaging

Mathieu Manni,Dmitry Karpov,K. Joost Batenburg,Sharon Shwartz,Nicola Viganò

Main category: cs.CV

TLDR: 提出了一种基于自监督深度学习的鬼成像重建方法，无需干净参考数据即可实现强噪声抑制。

Details

Motivation: 解决低光鬼成像场景中的信噪比问题，适用于微纳尺度X射线成像等新兴领域。 Method: 采用自监督深度学习框架，结合数学理论支持，适用于理论和实际数据。 Result: 在无监督方法中表现出卓越的重建性能，尤其在噪声数据中。 Conclusion: 为低光鬼成像场景提供了有效的工具，适用于生物样本和电池等应用。 Abstract: We present a new self-supervised deep-learning-based Ghost Imaging (GI) reconstruction method, which provides unparalleled reconstruction performance for noisy acquisitions among unsupervised methods. We present the supporting mathematical framework and results from theoretical and real data use cases. Self-supervision removes the need for clean reference data while offering strong noise reduction. This provides the necessary tools for addressing signal-to-noise ratio concerns for GI acquisitions in emerging and cutting-edge low-light GI scenarios. Notable examples include micro- and nano-scale x-ray emission imaging, e.g., x-ray fluorescence imaging of dose-sensitive samples. Their applications include in-vivo and in-operando case studies for biological samples and batteries.

[337] MIEB: Massive Image Embedding Benchmark

Chenghao Xiao,Isaac Chung,Imene Kerboua,Jamie Stirling,Xin Zhang,Márton Kardos,Roman Solomatin,Noura Al Moubayed,Kenneth Enevoldsen,Niklas Muennighoff

Main category: cs.CV

TLDR: MIEB是一个大规模图像嵌入基准测试，用于评估图像和图像-文本嵌入模型的性能，涵盖38种语言和130个任务，发现没有单一方法在所有任务中表现最佳。

Details

Motivation: 现有图像表示评估方法分散且任务特定，无法全面理解模型能力，因此需要统一的评估框架。 Method: 引入MIEB基准测试，涵盖8大类130个任务，评估50个模型的性能。 Result: 发现高级视觉模型在文本视觉表示方面表现优异，但在复杂场景下图像-文本匹配能力有限，且性能与多模态大语言模型表现高度相关。 Conclusion: MIEB为图像嵌入模型提供了全面的评估工具，揭示了模型的潜在能力与局限性，并公开了代码、数据集和排行榜。 Abstract: Image representations are often evaluated through disjointed, task-specific protocols, leading to a fragmented understanding of model capabilities. For instance, it is unclear whether an image embedding model adept at clustering images is equally good at retrieving relevant images given a piece of text. We introduce the Massive Image Embedding Benchmark (MIEB) to evaluate the performance of image and image-text embedding models across the broadest spectrum to date. MIEB spans 38 languages across 130 individual tasks, which we group into 8 high-level categories. We benchmark 50 models across our benchmark, finding that no single method dominates across all task categories. We reveal hidden capabilities in advanced vision models such as their accurate visual representation of texts, and their yet limited capabilities in interleaved encodings and matching images and texts in the presence of confounders. We also show that the performance of vision encoders on MIEB correlates highly with their performance when used in multimodal large language models. Our code, dataset, and leaderboard are publicly available at https://github.com/embeddings-benchmark/mteb.

[338] ESCT3D: Efficient and Selectively Controllable Text-Driven 3D Content Generation with Gaussian Splatting

Huiqi Wu,Jianbo Mei,Yingjie Huang,Yining Xu,Jingjiao You,Yilong Liu,Li Yao

Main category: cs.CV

TLDR: 论文提出了一种基于GPT-4V的自优化方法，用于提升从简单文本生成高质量3D内容的效率和可控性。

Details

Motivation: 当前文本驱动的3D内容生成依赖高质量输入提示且生成过程不可控，导致效率低下。 Method: 使用GPT-4V进行自优化，支持多条件输入（如风格、边缘、姿势等），并整合多视图信息解决Janus问题。 Result: 实验表明，该方法能高效生成高质量且可控的3D内容。 Conclusion: 该方法显著提升了3D内容生成的效率和可控性。 Abstract: In recent years, significant advancements have been made in text-driven 3D content generation. However, several challenges remain. In practical applications, users often provide extremely simple text inputs while expecting high-quality 3D content. Generating optimal results from such minimal text is a difficult task due to the strong dependency of text-to-3D models on the quality of input prompts. Moreover, the generation process exhibits high variability, making it difficult to control. Consequently, multiple iterations are typically required to produce content that meets user expectations, reducing generation efficiency. To address this issue, we propose GPT-4V for self-optimization, which significantly enhances the efficiency of generating satisfactory content in a single attempt. Furthermore, the controllability of text-to-3D generation methods has not been fully explored. Our approach enables users to not only provide textual descriptions but also specify additional conditions, such as style, edges, scribbles, poses, or combinations of multiple conditions, allowing for more precise control over the generated 3D content. Additionally, during training, we effectively integrate multi-view information, including multi-view depth, masks, features, and images, to address the common Janus problem in 3D content generation. Extensive experiments demonstrate that our method achieves robust generalization, facilitating the efficient and controllable generation of high-quality 3D content.

[339] Analysis of Attention in Video Diffusion Transformers

Yuxin Wen,Jim Wu,Ajay Jain,Tom Goldstein,Ashwinee Panda

Main category: cs.CV

TLDR: 分析了视频扩散变换器（VDiTs）中的注意力机制，发现了结构、稀疏性和注意力汇三个关键特性，并提出了未来研究方向。

Details

Motivation: 研究VDiTs中注意力机制的特性，以提升其效率与质量的平衡。 Method: 通过分析不同VDiTs的注意力模式，研究其结构、稀疏性和注意力汇现象。 Result: 发现注意力模式具有跨提示的结构相似性，稀疏性方法不适用于所有VDiTs，并首次研究了VDiTs中的注意力汇现象。 Conclusion: 研究结果为优化VDiTs的效率与质量提供了新方向。 Abstract: We conduct an in-depth analysis of attention in video diffusion transformers (VDiTs) and report a number of novel findings. We identify three key properties of attention in VDiTs: Structure, Sparsity, and Sinks. Structure: We observe that attention patterns across different VDiTs exhibit similar structure across different prompts, and that we can make use of the similarity of attention patterns to unlock video editing via self-attention map transfer. Sparse: We study attention sparsity in VDiTs, finding that proposed sparsity methods do not work for all VDiTs, because some layers that are seemingly sparse cannot be sparsified. Sinks: We make the first study of attention sinks in VDiTs, comparing and contrasting them to attention sinks in language models. We propose a number of future directions that can make use of our insights to improve the efficiency-quality Pareto frontier for VDiTs.

[340] SlowFastVAD: Video Anomaly Detection via Integrating Simple Detector and RAG-Enhanced Vision-Language Model

Zongcan Ding,Haodong Zhang,Peng Wu,Guansong Pang,Zhiwei Yang,Peng Wang,Yanning Zhang

Main category: cs.CV

TLDR: SlowFastVAD结合快速异常检测器和慢速异常检测器（增强的VLM），通过双路径方法提升视频异常检测的准确性和可解释性，同时降低计算开销。

Details

Motivation: 半监督视频异常检测方法存在高误报率和可解释性差的问题，而现有视觉语言模型（VLM）计算成本高且缺乏领域适应性。 Method: 提出SlowFastVAD框架，快速检测器提供粗粒度异常分数，慢速检测器（RAG增强的VLM）仅分析模糊片段，并构建知识库以增强领域适应性。 Result: 在四个基准测试中，SlowFastVAD显著提升了检测准确性和可解释性，同时大幅降低计算开销。 Conclusion: SlowFastVAD结合快速和慢速检测器的优势，适用于高可靠性要求的实际视频异常检测应用。 Abstract: Video anomaly detection (VAD) aims to identify unexpected events in videos and has wide applications in safety-critical domains. While semi-supervised methods trained on only normal samples have gained traction, they often suffer from high false alarm rates and poor interpretability. Recently, vision-language models (VLMs) have demonstrated strong multimodal reasoning capabilities, offering new opportunities for explainable anomaly detection. However, their high computational cost and lack of domain adaptation hinder real-time deployment and reliability. Inspired by dual complementary pathways in human visual perception, we propose SlowFastVAD, a hybrid framework that integrates a fast anomaly detector with a slow anomaly detector (namely a retrieval augmented generation (RAG) enhanced VLM), to address these limitations. Specifically, the fast detector first provides coarse anomaly confidence scores, and only a small subset of ambiguous segments, rather than the entire video, is further analyzed by the slower yet more interpretable VLM for elaborate detection and reasoning. Furthermore, to adapt VLMs to domain-specific VAD scenarios, we construct a knowledge base including normal patterns based on few normal samples and abnormal patterns inferred by VLMs. During inference, relevant patterns are retrieved and used to augment prompts for anomaly reasoning. Finally, we smoothly fuse the anomaly confidence of fast and slow detectors to enhance robustness of anomaly detection. Extensive experiments on four benchmarks demonstrate that SlowFastVAD effectively combines the strengths of both fast and slow detectors, and achieves remarkable detection accuracy and interpretability with significantly reduced computational overhead, making it well-suited for real-world VAD applications with high reliability requirements.

[341] InstructEngine: Instruction-driven Text-to-Image Alignment

Xingyu Lu,Yuhang Hu,YiFan Zhang,Kaiyu Jiang,Changyi Liu,Tianke Zhang,Jinpeng Wang,Bin Wen,Chun Yuan,Fan Yang,Tingting Gao,Di Zhang

Main category: cs.CV

TLDR: 论文提出了InstructEngine框架，通过自动化数据构建和跨验证对齐方法，解决了文本到图像模型偏好对齐中的数据与算法限制，显著提升了模型性能。

Details

Motivation: 现有方法依赖高成本人工标注数据且算法效率低，无法规模化。 Method: 构建文本到图像生成的分类法，开发自动化数据管道生成25K偏好对，引入跨验证对齐方法。 Result: 在DrawBench上，SD v1.5和SDXL性能分别提升10.53%和5.30%，人类评审胜率超50%。 Conclusion: InstructEngine有效解决了数据与算法问题，显著提升了模型与人类偏好的对齐效果。 Abstract: Reinforcement Learning from Human/AI Feedback (RLHF/RLAIF) has been extensively utilized for preference alignment of text-to-image models. Existing methods face certain limitations in terms of both data and algorithm. For training data, most approaches rely on manual annotated preference data, either by directly fine-tuning the generators or by training reward models to provide training signals. However, the high annotation cost makes them difficult to scale up, the reward model consumes extra computation and cannot guarantee accuracy. From an algorithmic perspective, most methods neglect the value of text and only take the image feedback as a comparative signal, which is inefficient and sparse. To alleviate these drawbacks, we propose the InstructEngine framework. Regarding annotation cost, we first construct a taxonomy for text-to-image generation, then develop an automated data construction pipeline based on it. Leveraging advanced large multimodal models and human-defined rules, we generate 25K text-image preference pairs. Finally, we introduce cross-validation alignment method, which refines data efficiency by organizing semantically analogous samples into mutually comparable pairs. Evaluations on DrawBench demonstrate that InstructEngine improves SD v1.5 and SDXL's performance by 10.53% and 5.30%, outperforming state-of-the-art baselines, with ablation study confirming the benefits of InstructEngine's all components. A win rate of over 50% in human reviews also proves that InstructEngine better aligns with human preferences.

[342] LL-Gaussian: Low-Light Scene Reconstruction and Enhancement via Gaussian Splatting for Novel View Synthesis

Hao Sun,Fenggen Yu,Huiyao Xu,Tao Zhang,Changqing Zou

Main category: cs.CV

TLDR: LL-Gaussian提出了一种基于3D高斯散射的低光场景新视角合成方法，解决了现有方法在低光sRGB输入下的不稳定性和噪声问题，实现了快速且高质量的渲染。

Details

Motivation: 低光场景下的新视角合成面临输入质量差、噪声严重和动态范围低的问题，现有方法计算成本高或依赖特殊数据，限制了实用性。 Method: LL-Gaussian通过低光高斯初始化模块、双分支高斯分解模型和无监督优化策略，实现了稳定的点云初始化和噪声抑制。 Result: 相比现有方法，LL-Gaussian推理速度快2000倍，训练时间减少至2%，且重建和渲染质量更高。 Conclusion: LL-Gaussian为低光场景的3D重建和增强提供了一种高效且高质量的解决方案。 Abstract: Novel view synthesis (NVS) in low-light scenes remains a significant challenge due to degraded inputs characterized by severe noise, low dynamic range (LDR) and unreliable initialization. While recent NeRF-based approaches have shown promising results, most suffer from high computational costs, and some rely on carefully captured or pre-processed data--such as RAW sensor inputs or multi-exposure sequences--which severely limits their practicality. In contrast, 3D Gaussian Splatting (3DGS) enables real-time rendering with competitive visual fidelity; however, existing 3DGS-based methods struggle with low-light sRGB inputs, resulting in unstable Gaussian initialization and ineffective noise suppression. To address these challenges, we propose LL-Gaussian, a novel framework for 3D reconstruction and enhancement from low-light sRGB images, enabling pseudo normal-light novel view synthesis. Our method introduces three key innovations: 1) an end-to-end Low-Light Gaussian Initialization Module (LLGIM) that leverages dense priors from learning-based MVS approach to generate high-quality initial point clouds; 2) a dual-branch Gaussian decomposition model that disentangles intrinsic scene properties (reflectance and illumination) from transient interference, enabling stable and interpretable optimization; 3) an unsupervised optimization strategy guided by both physical constrains and diffusion prior to jointly steer decomposition and enhancement. Additionally, we contribute a challenging dataset collected in extreme low-light environments and demonstrate the effectiveness of LL-Gaussian. Compared to state-of-the-art NeRF-based methods, LL-Gaussian achieves up to 2,000 times faster inference and reduces training time to just 2%, while delivering superior reconstruction and rendering quality.

[343] Benchmarking 3D Human Pose Estimation Models Under Occlusions

Filipa Lino,Carlos Santiago,Manuel Marques

Main category: cs.CV

TLDR: 论文分析了3D人体姿态估计模型对遮挡、相机位置和动作变化的鲁棒性和敏感性，使用新合成数据集BlendMimic3D测试现有模型，揭示了模型在真实场景中的局限性。

Details

Motivation: 解决3D人体姿态估计模型在遮挡、相机位置和动作变化等复杂环境中的鲁棒性问题。 Method: 使用BlendMimic3D合成数据集，测试多种现有模型，分析其对遮挡和相机设置的敏感性。 Result: 模型对遮挡和相机设置表现敏感，标准条件下训练的模型在真实场景中泛化能力不足。 Conclusion: 需开发更具适应性的模型以应对真实世界中的复杂环境和遮挡情况。 Abstract: This paper addresses critical challenges in 3D Human Pose Estimation (HPE) by analyzing the robustness and sensitivity of existing models to occlusions, camera position, and action variability. Using a novel synthetic dataset, BlendMimic3D, which includes diverse scenarios with multi-camera setups and several occlusion types, we conduct specific tests on several state-of-the-art models. Our study focuses on the discrepancy in keypoint formats between common datasets such as Human3.6M, and 2D datasets such as COCO, commonly used for 2D detection models and frequently input of 3D HPE models. Our work explores the impact of occlusions on model performance and the generality of models trained exclusively under standard conditions. The findings suggest significant sensitivity to occlusions and camera settings, revealing a need for models that better adapt to real-world variability and occlusion scenarios. This research contributed to ongoing efforts to improve the fidelity and applicability of 3D HPE systems in complex environments.

[344] Multimodal Representation Learning Techniques for Comprehensive Facial State Analysis

Kaiwen Zheng,Xuri Ge,Junchen Fu,Jun Peng,Joemon M. Jose

Main category: cs.CV

TLDR: 提出了一种多模态面部状态分析框架，包括新数据集MFA、多级多模态基础模型MF^2和解耦微调网络DFN，显著提升了AU和情感识别性能。

Details

Motivation: 多模态基础模型在面部状态（如AU和情感）理解中的应用有限，需一个综合框架来整合视觉和语言模态。 Method: 1. 构建MFA数据集，利用GPT-4o生成多层次语言描述；2. 设计MF^2模型，结合局部和全局视觉特征；3. 开发DFN网络，高效微调以适应不同任务。 Result: 实验表明，该方法在AU和情感检测任务中表现优异。 Conclusion: 提出的框架为多模态面部状态分析提供了高效解决方案，扩展了基础模型的应用范围。 Abstract: Multimodal foundation models have significantly improved feature representation by integrating information from multiple modalities, making them highly suitable for a broader set of applications. However, the exploration of multimodal facial representation for understanding perception has been limited. Understanding and analyzing facial states, such as Action Units (AUs) and emotions, require a comprehensive and robust framework that bridges visual and linguistic modalities. In this paper, we present a comprehensive pipeline for multimodal facial state analysis. First, we compile a new Multimodal Face Dataset (MFA) by generating detailed multilevel language descriptions of face, incorporating Action Unit (AU) and emotion descriptions, by leveraging GPT-4o. Second, we introduce a novel Multilevel Multimodal Face Foundation model (MF^2) tailored for Action Unit (AU) and emotion recognition. Our model incorporates comprehensive visual feature modeling at both local and global levels of face image, enhancing its ability to represent detailed facial appearances. This design aligns visual representations with structured AU and emotion descriptions, ensuring effective cross-modal integration. Third, we develop a Decoupled Fine-Tuning Network (DFN) that efficiently adapts MF^2 across various tasks and datasets. This approach not only reduces computational overhead but also broadens the applicability of the foundation model to diverse scenarios. Experimentation show superior performance for AU and emotion detection tasks.

[345] Patch and Shuffle: A Preprocessing Technique for Texture Classification in Autonomous Cementitious Fabrication

Jeremiah Giordani

Main category: cs.CV

TLDR: 提出了一种名为“patch and shuffle”的预处理技术，通过分割和打乱图像块来消除语义信息，提升纹理分类的准确性。

Details

Motivation: 传统纹理分类方法依赖全局图像特征，容易偏向语义内容而非低层次纹理。 Method: 将输入图像分割为小块并打乱，重建为混乱图像后再分类，迫使分类器依赖局部纹理特征。 Result: 在水泥挤出图像数据集上，使用ResNet-18架构，新方法测试准确率达90.64%，显著优于基线的72.46%。 Conclusion: 破坏全局结构可提升纹理分类性能，该方法适用于需要低层次特征的视觉任务。 Abstract: Autonomous fabrication systems are transforming construction and manufacturing, yet they remain vulnerable to print errors. Texture classification is a key component of computer vision systems that enable real-time monitoring and adjustment during cementitious fabrication. Traditional classification methods often rely on global image features, which can bias the model toward semantic content rather than low-level textures. In this paper, we introduce a novel preprocessing technique called "patch and shuffle," which segments input images into smaller patches, shuffles them, and reconstructs a jumbled image before classification. This transformation removes semantic context, forcing the classifier to rely on local texture features. We evaluate this approach on a dataset of extruded cement images, using a ResNet-18-based architecture. Our experiments compare the patch and shuffle method to a standard pipeline, holding all other factors constant. Results show a significant improvement in accuracy: the patch and shuffle model achieved 90.64% test accuracy versus 72.46% for the baseline. These findings suggest that disrupting global structure enhances performance in texture-based classification tasks. This method has implications for broader vision tasks where low-level features matter more than high-level semantics. The technique may improve classification in applications ranging from fabrication monitoring to medical imaging.

[346] FingER: Content Aware Fine-grained Evaluation with Reasoning for AI-Generated Videos

Rui Chen,Lei Sun,Jing Tang,Geng Li,Xiangxiang Chu

Main category: cs.CV

TLDR: 论文提出了一种名为FingER的细粒度实体级推理评估框架，用于评估AI生成的视频内容，通过自动生成问题并由推理模型评分，显著提升了评估性能。

Details

Motivation: 随着视频生成技术的进步，AI生成内容的评估变得复杂且困难，需要更细粒度的评估方法。 Method: 利用LLMs生成实体级问题，构建FingER数据集，并通过GRPO训练推理模型。 Result: 模型在GenAI-Bench和MonetBench上分别相对提升11.8%和5.5%，仅需3.3k训练视频。 Conclusion: FingER框架通过细粒度推理显著提升了视频内容评估的准确性和可解释性。 Abstract: Recent advances in video generation have posed great challenges in the assessment of AI-generated content, particularly with the emergence of increasingly sophisticated models. The various inconsistencies and defects observed in such videos are inherently complex, making overall scoring notoriously difficult. In this paper, we emphasize the critical importance of integrating fine-grained reasoning into video evaluation, and we propose $\textbf{F}$ing$\textbf{ER}$, a novel entity-level reasoning evaluation framework that first automatically generates $\textbf{F}$ine-grained $\textbf{E}$ntity-level questions, and then answers those questions by a $\textbf{R}$easoning model with scores, which can be subsequently weighted summed to an overall score for different applications. Specifically, we leverage LLMs to derive entity-level questions across five distinct perspectives, which (i) often focus on some specific entities of the content, thereby making answering or scoring much easier by MLLMs, and (ii) are more interpretable. Then we construct a FingER dataset, consisting of approximately 3.3k videos and corresponding 60k fine-grained QA annotations, each with detailed reasons. Based on that, we further investigate various training protocols to best incentivize the reasoning capability of MLLMs for correct answer prediction. Extensive experiments demonstrate that a reasoning model trained using Group Relative Policy Optimization (GRPO) with a cold-start strategy achieves the best performance. Notably, our model surpasses existing methods by a relative margin of $11.8\%$ on GenAI-Bench and $5.5\%$ on MonetBench with only 3.3k training videos, which is at most one-tenth of the training samples utilized by other methods. Our code and dataset will be released soon.

[347] PG-DPIR: An efficient plug-and-play method for high-count Poisson-Gaussian inverse problems

Maud Biquard,Marie Chabert,Florence Genin,Christophe Latry,Thomas Oberlin

Main category: cs.CV

TLDR: PG-DPIR是一种高效的PnP方法，用于解决高计数泊松-高斯噪声的图像恢复问题，通过改进DPIR算法，显著提升了收敛速度。

Details

Motivation: 现有深度学习方法在泊松-高斯噪声图像恢复中需要传感器特定训练，而PnP方法通过仅学习去噪正则化，可适用于多源图像恢复。 Method: 改进DPIR算法，针对泊松-高斯噪声特性设计高效梯度下降初始化，加速收敛。 Result: 实验表明PG-DPIR在卫星图像恢复和超分辨率任务中达到最优性能，且效率显著提升。 Conclusion: PG-DPIR为地面卫星图像处理链提供了一种高效且性能优越的解决方案。 Abstract: Poisson-Gaussian noise describes the noise of various imaging systems thus the need of efficient algorithms for Poisson-Gaussian image restoration. Deep learning methods offer state-of-the-art performance but often require sensor-specific training when used in a supervised setting. A promising alternative is given by plug-and-play (PnP) methods, which consist in learning only a regularization through a denoiser, allowing to restore images from several sources with the same network. This paper introduces PG-DPIR, an efficient PnP method for high-count Poisson-Gaussian inverse problems, adapted from DPIR. While DPIR is designed for white Gaussian noise, a naive adaptation to Poisson-Gaussian noise leads to prohibitively slow algorithms due to the absence of a closed-form proximal operator. To address this, we adapt DPIR for the specificities of Poisson-Gaussian noise and propose in particular an efficient initialization of the gradient descent required for the proximal step that accelerates convergence by several orders of magnitude. Experiments are conducted on satellite image restoration and super-resolution problems. High-resolution realistic Pleiades images are simulated for the experiments, which demonstrate that PG-DPIR achieves state-of-the-art performance with improved efficiency, which seems promising for on-ground satellite processing chains.

[348] Better Coherence, Better Height: Fusing Physical Models and Deep Learning for Forest Height Estimation from Interferometric SAR Data

Ragini Bal Mahesh,Ronny Hänsch

Main category: cs.CV

TLDR: CoHNet结合深度学习和物理约束，提升SAR图像森林高度估计的准确性和可靠性。

Details

Motivation: 传统物理模型泛化能力有限，而深度学习方法缺乏物理可解释性，需结合两者优势。 Method: 提出CoHNet框架，利用预训练神经代理模型和物理约束损失函数。 Result: 实验表明，该方法提高了森林高度估计的准确性，并生成更具可靠性的特征。 Conclusion: CoHNet成功结合了深度学习和物理模型，为SAR图像分析提供了更优解决方案。 Abstract: Estimating forest height from Synthetic Aperture Radar (SAR) images often relies on traditional physical models, which, while interpretable and data-efficient, can struggle with generalization. In contrast, Deep Learning (DL) approaches lack physical insight. To address this, we propose CoHNet - an end-to-end framework that combines the best of both worlds: DL optimized with physics-informed constraints. We leverage a pre-trained neural surrogate model to enforce physical plausibility through a unique training loss. Our experiments show that this approach not only improves forest height estimation accuracy but also produces meaningful features that enhance the reliability of predictions.

[349] Towards Low-Latency Event-based Obstacle Avoidance on a FPGA-Drone

Pietro Bonazzi,Christian Vogt,Michael Jost,Lyes Khacef,Federico Paredes-Vallés,Michele Magno

Main category: cs.CV

TLDR: 事件视觉系统（EVS）在FPGA加速器上表现优于传统RGB模型，具有更高帧率、更低预测误差和更强鲁棒性。

Details

Motivation: 评估事件视觉系统（EVS）与RGB模型在实时碰撞避免中的性能差异。 Method: 在FPGA加速器上对比EVS和RGB模型的动作预测性能，测试包括帧率、预测误差和鲁棒性。 Result: EVS帧率达1 kHz，时空预测误差更低，鲁棒性更强，F1分数显著优于RGB模型。 Conclusion: EVS在实时碰撞避免中表现优异，适合资源受限环境部署。 Abstract: This work quantitatively evaluates the performance of event-based vision systems (EVS) against conventional RGB-based models for action prediction in collision avoidance on an FPGA accelerator. Our experiments demonstrate that the EVS model achieves a significantly higher effective frame rate (1 kHz) and lower temporal (-20 ms) and spatial prediction errors (-20 mm) compared to the RGB-based model, particularly when tested on out-of-distribution data. The EVS model also exhibits superior robustness in selecting optimal evasion maneuvers. In particular, in distinguishing between movement and stationary states, it achieves a 59 percentage point advantage in precision (78% vs. 19%) and a substantially higher F1 score (0.73 vs. 0.06), highlighting the susceptibility of the RGB model to overfitting. Further analysis in different combinations of spatial classes confirms the consistent performance of the EVS model in both test data sets. Finally, we evaluated the system end-to-end and achieved a latency of approximately 2.14 ms, with event aggregation (1 ms) and inference on the processing unit (0.94 ms) accounting for the largest components. These results underscore the advantages of event-based vision for real-time collision avoidance and demonstrate its potential for deployment in resource-constrained environments.

[350] GPS: Distilling Compact Memories via Grid-based Patch Sampling for Efficient Online Class-Incremental Learning

Mingchuan Ma,Yuhao Zhou,Jindi Lv,Yuxin Tian,Dan Si,Shujian Li,Qing Ye,Jiancheng Lv

Main category: cs.CV

TLDR: 论文提出了一种轻量级的Grid-based Patch Sampling（GPS）方法，用于在在线类增量学习中高效生成信息丰富的记忆样本，无需依赖可训练模型。

Details

Motivation: 现有基于蒸馏数据的方法在受限存储下生成信息丰富的记忆样本时，计算开销较大，因此需要一种更轻量级的解决方案。 Method: GPS通过从原始图像中采样像素子集生成紧凑的低分辨率表示，保留语义内容和结构信息，并在重放时重新组装这些表示。 Result: 实验表明，GPS在内存受限设置下可无缝集成到现有重放框架中，平均最终准确率提升3%-4%，计算开销有限。 Conclusion: GPS是一种高效且轻量级的方法，适用于在线类增量学习中的记忆样本生成。 Abstract: Online class-incremental learning aims to enable models to continuously adapt to new classes with limited access to past data, while mitigating catastrophic forgetting. Replay-based methods address this by maintaining a small memory buffer of previous samples, achieving competitive performance. For effective replay under constrained storage, recent approaches leverage distilled data to enhance the informativeness of memory. However, such approaches often involve significant computational overhead due to the use of bi-level optimization. Motivated by these limitations, we introduce Grid-based Patch Sampling (GPS), a lightweight and effective strategy for distilling informative memory samples without relying on a trainable model. GPS generates informative samples by sampling a subset of pixels from the original image, yielding compact low-resolution representations that preserve both semantic content and structural information. During replay, these representations are reassembled to support training and evaluation. Experiments on extensive benchmarks demonstrate that GRS can be seamlessly integrated into existing replay frameworks, leading to 3%-4% improvements in average end accuracy under memory-constrained settings, with limited computational overhead.

[351] HUMOTO: A 4D Dataset of Mocap Human Object Interactions

Jiaxin Lu,Chun-Hao Paul Huang,Uttaran Bhattacharya,Qixing Huang,Yi Zhou

Main category: cs.CV

TLDR: HUMOTO是一个高保真的人类-物体交互数据集，用于运动生成、计算机视觉和机器人应用，包含736个序列，覆盖63个精确建模的物体和72个铰接部件。

Details

Motivation: 解决现有数据集中缺乏全面、逻辑性强的任务流程和物理准确性的人类-物体交互数据问题。 Method: 采用场景驱动的LLM脚本管道生成完整任务，结合动作捕捉和相机记录处理遮挡，并由专业艺术家清理和验证数据。 Result: HUMOTO提供了多样化的活动数据，物理准确且逻辑性强，并提供了与其他数据集的基准比较。 Conclusion: HUMOTO填补了人类-物体交互数据的空白，为动画、机器人和具身AI系统等领域的应用提供了重要资源。 Abstract: We present Human Motions with Objects (HUMOTO), a high-fidelity dataset of human-object interactions for motion generation, computer vision, and robotics applications. Featuring 736 sequences (7,875 seconds at 30 fps), HUMOTO captures interactions with 63 precisely modeled objects and 72 articulated parts. Our innovations include a scene-driven LLM scripting pipeline creating complete, purposeful tasks with natural progression, and a mocap-and-camera recording setup to effectively handle occlusions. Spanning diverse activities from cooking to outdoor picnics, HUMOTO preserves both physical accuracy and logical task flow. Professional artists rigorously clean and verify each sequence, minimizing foot sliding and object penetrations. We also provide benchmarks compared to other datasets. HUMOTO's comprehensive full-body motion and simultaneous multi-object interactions address key data-capturing challenges and provide opportunities to advance realistic human-object interaction modeling across research domains with practical applications in animation, robotics, and embodied AI systems. Project: https://jiaxin-lu.github.io/humoto/ .

[352] MonoDiff9D: Monocular Category-Level 9D Object Pose Estimation via Diffusion Model

Jian Liu,Wei Sun,Hui Yang,Jin Zheng,Zichen Geng,Hossein Rahmani,Ajmal Mian

Main category: cs.CV

TLDR: MonoDiff9D是一种基于扩散模型的单目类别级9D物体位姿估计方法，无需形状先验或CAD模型。

Details

Motivation: 利用扩散模型的概率特性，减少对形状先验、CAD模型或深度传感器的依赖。 Method: 通过DINOv2从单目图像估计粗略深度并转换为点云，融合点云全局特征与输入图像，使用基于变压器的去噪器恢复物体位姿。 Result: 在两个基准数据集上实现了最先进的单目类别级9D物体位姿估计精度。 Conclusion: MonoDiff9D无需形状先验或CAD模型，展示了扩散模型在物体位姿估计中的潜力。 Abstract: Object pose estimation is a core means for robots to understand and interact with their environment. For this task, monocular category-level methods are attractive as they require only a single RGB camera. However, current methods rely on shape priors or CAD models of the intra-class known objects. We propose a diffusion-based monocular category-level 9D object pose generation method, MonoDiff9D. Our motivation is to leverage the probabilistic nature of diffusion models to alleviate the need for shape priors, CAD models, or depth sensors for intra-class unknown object pose estimation. We first estimate coarse depth via DINOv2 from the monocular image in a zero-shot manner and convert it into a point cloud. We then fuse the global features of the point cloud with the input image and use the fused features along with the encoded time step to condition MonoDiff9D. Finally, we design a transformer-based denoiser to recover the object pose from Gaussian noise. Extensive experiments on two popular benchmark datasets show that MonoDiff9D achieves state-of-the-art monocular category-level 9D object pose estimation accuracy without the need for shape priors or CAD models at any stage. Our code will be made public at https://github.com/CNJianLiu/MonoDiff9D.

[353] Anchor Token Matching: Implicit Structure Locking for Training-free AR Image Editing

Taihang Hu,Linxuan Li,Kai Wang,Yaxing Wang,Jian Yang,Ming-Ming Cheng

Main category: cs.CV

TLDR: 论文提出了一种名为ISLock的训练免费编辑策略，用于自回归视觉模型，通过动态对齐自注意力模式实现结构一致性编辑。

Details

Motivation: 现有编辑技术无法直接用于自回归模型，因其存在注意力图空间贫乏和结构误差累积问题，导致对象布局和全局一致性破坏。 Method: ISLock通过Anchor Token Matching协议动态对齐自注意力模式，隐式保持结构蓝图，无需显式注意力操作或微调。 Result: 实验表明ISLock能实现高质量、结构一致的编辑，无需额外训练，性能优于或媲美传统技术。 Conclusion: ISLock为高效灵活的自回归图像编辑开辟了新途径，缩小了扩散模型与自回归模型的性能差距。 Abstract: Text-to-image generation has seen groundbreaking advancements with diffusion models, enabling high-fidelity synthesis and precise image editing through cross-attention manipulation. Recently, autoregressive (AR) models have re-emerged as powerful alternatives, leveraging next-token generation to match diffusion models. However, existing editing techniques designed for diffusion models fail to translate directly to AR models due to fundamental differences in structural control. Specifically, AR models suffer from spatial poverty of attention maps and sequential accumulation of structural errors during image editing, which disrupt object layouts and global consistency. In this work, we introduce Implicit Structure Locking (ISLock), the first training-free editing strategy for AR visual models. Rather than relying on explicit attention manipulation or fine-tuning, ISLock preserves structural blueprints by dynamically aligning self-attention patterns with reference images through the Anchor Token Matching (ATM) protocol. By implicitly enforcing structural consistency in latent space, our method ISLock enables structure-aware editing while maintaining generative autonomy. Extensive experiments demonstrate that ISLock achieves high-quality, structure-consistent edits without additional training and is superior or comparable to conventional editing techniques. Our findings pioneer the way for efficient and flexible AR-based image editing, further bridging the performance gap between diffusion and autoregressive generative models. The code will be publicly available at https://github.com/hutaiHang/ATM

[354] Integrating Vision and Location with Transformers: A Multimodal Deep Learning Framework for Medical Wound Analysis

Ramin Mousa,Hadis Taherinia,Khabiba Abdiyeva,Amir Ali Bengari,Mohammadmahdi Vahediahmar

Main category: cs.CV

TLDR: 该研究提出了一种基于深度学习的多模态分类器，结合伤口图像和位置数据，通过优化算法提高伤口分类的准确性和效率。

Details

Motivation: 传统机器学习模型在伤口分类中存在特征选择和模型复杂性问题，深度学习虽具潜力但仍有改进空间。 Method: 使用Vision Transformer提取图像特征，DWT层捕获频率成分，Transformer提取空间特征，并结合三种群优化算法优化模型。 Result: 模型在原始身体地图上的分类准确率为0.8123（仅图像）和0.8007（图像+位置），结合优化算法后准确率提升至0.7801-0.8342。 Conclusion: 优化算法显著提高了伤口分类的准确性，为伤口诊断提供了一种高效方法。 Abstract: Effective recognition of acute and difficult-to-heal wounds is a necessary step in wound diagnosis. An efficient classification model can help wound specialists classify wound types with less financial and time costs and also help in deciding on the optimal treatment method. Traditional machine learning models suffer from feature selection and are usually cumbersome models for accurate recognition. Recently, deep learning (DL) has emerged as a powerful tool in wound diagnosis. Although DL seems promising for wound type recognition, there is still a large scope for improving the efficiency and accuracy of the model. In this study, a DL-based multimodal classifier was developed using wound images and their corresponding locations to classify them into multiple classes, including diabetic, pressure, surgical, and venous ulcers. A body map was also created to provide location data, which can help wound specialists label wound locations more effectively. The model uses a Vision Transformer to extract hierarchical features from input images, a Discrete Wavelet Transform (DWT) layer to capture low and high frequency components, and a Transformer to extract spatial features. The number of neurons and weight vector optimization were performed using three swarm-based optimization techniques (Monster Gorilla Toner (MGTO), Improved Gray Wolf Optimization (IGWO), and Fox Optimization Algorithm). The evaluation results show that weight vector optimization using optimization algorithms can increase diagnostic accuracy and make it a very effective approach for wound detection. In the classification using the original body map, the proposed model was able to achieve an accuracy of 0.8123 using image data and an accuracy of 0.8007 using a combination of image data and wound location. Also, the accuracy of the model in combination with the optimization models varied from 0.7801 to 0.8342.

[355] The Scalability of Simplicity: Empirical Analysis of Vision-Language Learning with a Single Transformer

Weixian Lei,Jiacong Wang,Haochen Wang,Xiangtai Li,Jun Hao Liew,Jiashi Feng,Zilong Huang

Main category: cs.CV

TLDR: SAIL是一种单变压器统一多模态大语言模型（MLLM），通过整合原始像素编码和语言解码于单一架构，简化了现有模块化MLLM的设计。

Details

Motivation: 现有模块化MLLM依赖预训练视觉变换器（ViT），SAIL旨在消除对独立视觉编码器的需求，提供更简洁的架构设计。 Method: SAIL采用混合注意力机制和多模态位置编码，适应视觉和文本模态的差异，无需引入新组件。 Result: SAIL在扩展训练数据和模型规模后，性能与模块化MLLM相当，且视觉表示能力接近ViT-22B。 Conclusion: SAIL展示了统一架构的潜力，简化设计的同时保持了高性能，为多模态模型提供了新思路。 Abstract: This paper introduces SAIL, a single transformer unified multimodal large language model (MLLM) that integrates raw pixel encoding and language decoding within a singular architecture. Unlike existing modular MLLMs, which rely on a pre-trained vision transformer (ViT), SAIL eliminates the need for a separate vision encoder, presenting a more minimalist architecture design. Instead of introducing novel architectural components, SAIL adapts mix-attention mechanisms and multimodal positional encodings to better align with the distinct characteristics of visual and textual modalities. We systematically compare SAIL's properties-including scalability, cross-modal information flow patterns, and visual representation capabilities-with those of modular MLLMs. By scaling both training data and model size, SAIL achieves performance comparable to modular MLLMs. Notably, the removal of pretrained ViT components enhances SAIL's scalability and results in significantly different cross-modal information flow patterns. Moreover, SAIL demonstrates strong visual representation capabilities, achieving results on par with ViT-22B in vision tasks such as semantic segmentation. Code and models are available at https://github.com/bytedance/SAIL.

[356] Pixel-SAIL: Single Transformer For Pixel-Grounded Understanding

Tao Zhang,Xiangtai Li,Zilong Huang,Yanwei Li,Weixian Lei,Xueqing Deng,Shihao Chen,Shunping Ji,Jiashi Feng

Main category: cs.CV

TLDR: Pixel-SAIL提出了一种简化的多模态大语言模型（MLLM），通过单一Transformer实现像素级任务，避免了额外组件依赖，并在多个基准测试中表现优异。

Details

Motivation: 现有MLLM依赖额外组件（如CLIP、分割专家），导致系统复杂且难以扩展，目标是探索一种无需额外组件的简化MLLM。 Method: 设计了可学习的上采样模块、视觉提示注入策略和视觉专家蒸馏策略，提升单一Transformer的性能。 Result: 在多个基准测试中，Pixel-SAIL表现与现有方法相当或更优，且流程更简化。 Conclusion: Pixel-SAIL证明了单一Transformer可实现高效像素级任务，为MLLM简化提供了新思路。 Abstract: Multimodal Large Language Models (MLLMs) achieve remarkable performance for fine-grained pixel-level understanding tasks. However, all the works rely heavily on extra components, such as vision encoder (CLIP), segmentation experts, leading to high system complexity and limiting model scaling. In this work, our goal is to explore a highly simplified MLLM without introducing extra components. Our work is motivated by the recent works on Single trAnsformer as a unified vIsion-Language Model (SAIL) design, where these works jointly learn vision tokens and text tokens in transformers. We present Pixel-SAIL, a single transformer for pixel-wise MLLM tasks. In particular, we present three technical improvements on the plain baseline. First, we design a learnable upsampling module to refine visual token features. Secondly, we propose a novel visual prompt injection strategy to enable the single transformer to understand visual prompt inputs and benefit from the early fusion of visual prompt embeddings and vision tokens. Thirdly, we introduce a vision expert distillation strategy to efficiently enhance the single transformer's fine-grained feature extraction capability. In addition, we have collected a comprehensive pixel understanding benchmark (PerBench), using a manual check. It includes three tasks: detailed object description, visual prompt-based question answering, and visual-text referring segmentation. Extensive experiments on four referring segmentation benchmarks, one visual prompt benchmark, and our PerBench show that our Pixel-SAIL achieves comparable or even better results with a much simpler pipeline. Code and model will be released at https://github.com/magic-research/Sa2VA.

[357] Art3D: Training-Free 3D Generation from Flat-Colored Illustration

Xiaoyan Cong,Jiayi Shen,Zekun Li,Rao Fu,Tao Lu,Srinath Sridhar

Main category: cs.CV

TLDR: Art3D是一种无需训练的方法，可将平面彩色2D设计提升为3D，利用预训练的2D图像生成模型和基于VLM的真实性评估，增强参考图像的三维感。

Details

Motivation: 现有的大规模预训练图像到3D生成模型在处理平面彩色图像（如手绘图）时表现不佳，缺乏三维感，而这类输入在艺术内容创作中最为用户友好。 Method: Art3D结合预训练的2D图像生成模型和基于VLM的真实性评估，无需额外训练即可提升平面彩色图像的三维感。 Result: 实验结果表明，Art3D在性能和鲁棒性上表现优异，具有广泛的泛化能力和实际应用前景。 Conclusion: Art3D简化了从2D生成3D的过程，适用于多种绘画风格，并公开了源代码和数据集。 Abstract: Large-scale pre-trained image-to-3D generative models have exhibited remarkable capabilities in diverse shape generations. However, most of them struggle to synthesize plausible 3D assets when the reference image is flat-colored like hand drawings due to the lack of 3D illusion, which are often the most user-friendly input modalities in art content creation. To this end, we propose Art3D, a training-free method that can lift flat-colored 2D designs into 3D. By leveraging structural and semantic features with pre- trained 2D image generation models and a VLM-based realism evaluation, Art3D successfully enhances the three-dimensional illusion in reference images, thus simplifying the process of generating 3D from 2D, and proves adaptable to a wide range of painting styles. To benchmark the generalization performance of existing image-to-3D models on flat-colored images without 3D feeling, we collect a new dataset, Flat-2D, with over 100 samples. Experimental results demonstrate the performance and robustness of Art3D, exhibiting superior generalizable capacity and promising practical applicability. Our source code and dataset will be publicly available on our project page: https://joy-jy11.github.io/ .

[358] InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Jinguo Zhu,Weiyun Wang,Zhe Chen,Zhaoyang Liu,Shenglong Ye,Lixin Gu,Yuchen Duan,Hao Tian,Weijie Su,Jie Shao,Zhangwei Gao,Erfei Cui,Yue Cao,Yangzhou Liu,Weiye Xu,Hao Li,Jiahao Wang,Han Lv,Dengnian Chen,Songze Li,Yinan He,Tan Jiang,Jiapeng Luo,Yi Wang,Conghui He,Botian Shi,Xingcheng Zhang,Wenqi Shao,Junjun He,Yingtong Xiong,Wenwen Qu,Peng Sun,Penglong Jiao,Lijun Wu,Kaipeng Zhang,Huipeng Deng,Jiaye Ge,Kai Chen,Limin Wang,Min Dou,Lewei Lu,Xizhou Zhu,Tong Lu,Dahua Lin,Yu Qiao,Jifeng Dai,Wenhai Wang

Main category: cs.CV

TLDR: InternVL3是一种新型多模态预训练模型，通过统一训练范式解决传统MLLM的复杂性和对齐问题，支持扩展多模态上下文，并在多项任务中表现优异。

Details

Motivation: 传统方法通过调整纯文本LLM为支持视觉输入的MLLM，存在复杂性和对齐挑战。InternVL3旨在通过统一训练范式解决这些问题。 Method: 采用原生多模态预训练范式，结合可变视觉位置编码（V2PE）、监督微调（SFT）和混合偏好优化（MPO）等技术。 Result: InternVL3-78B在MMMU基准测试中得分72.2，创开源MLLM新纪录，性能与ChatGPT-4o等专有模型竞争。 Conclusion: InternVL3在多模态任务中表现卓越，未来将公开训练数据和模型权重以推动研究。 Abstract: We introduce InternVL3, a significant advancement in the InternVL series featuring a native multimodal pre-training paradigm. Rather than adapting a text-only large language model (LLM) into a multimodal large language model (MLLM) that supports visual inputs, InternVL3 jointly acquires multimodal and linguistic capabilities from both diverse multimodal data and pure-text corpora during a single pre-training stage. This unified training paradigm effectively addresses the complexities and alignment challenges commonly encountered in conventional post-hoc training pipelines for MLLMs. To further improve performance and scalability, InternVL3 incorporates variable visual position encoding (V2PE) to support extended multimodal contexts, employs advanced post-training techniques such as supervised fine-tuning (SFT) and mixed preference optimization (MPO), and adopts test-time scaling strategies alongside an optimized training infrastructure. Extensive empirical evaluations demonstrate that InternVL3 delivers superior performance across a wide range of multi-modal tasks. In particular, InternVL3-78B achieves a score of 72.2 on the MMMU benchmark, setting a new state-of-the-art among open-source MLLMs. Its capabilities remain highly competitive with leading proprietary models, including ChatGPT-4o, Claude 3.5 Sonnet, and Gemini 2.5 Pro, while also maintaining strong pure-language proficiency. In pursuit of open-science principles, we will publicly release both the training data and model weights to foster further research and development in next-generation MLLMs.

[359] REPA-E: Unlocking VAE for End-to-End Tuning with Latent Diffusion Transformers

Xingjian Leng,Jaskirat Singh,Yunzhong Hou,Zhenchang Xing,Saining Xie,Liang Zheng

Main category: cs.CV

TLDR: 论文探讨了如何通过表示对齐损失（REPA）实现变分自编码器（VAE）和扩散模型的端到端训练，显著提升了训练速度和生成性能。

Details

Motivation: 传统方法中，端到端训练VAE和扩散模型效果不佳，甚至导致性能下降。本文旨在解决这一问题。 Method: 提出REPA损失，实现VAE和扩散模型的端到端联合训练（REPA-E方法）。 Result: 训练速度提升17倍（相比REPA）和45倍（相比传统方法），生成性能显著提升（FID达1.26）。 Conclusion: REPA-E方法不仅高效，还改善了VAE的潜在空间结构，为端到端训练提供了新思路。 Abstract: In this paper we tackle a fundamental question: "Can we train latent diffusion models together with the variational auto-encoder (VAE) tokenizer in an end-to-end manner?" Traditional deep-learning wisdom dictates that end-to-end training is often preferable when possible. However, for latent diffusion transformers, it is observed that end-to-end training both VAE and diffusion-model using standard diffusion-loss is ineffective, even causing a degradation in final performance. We show that while diffusion loss is ineffective, end-to-end training can be unlocked through the representation-alignment (REPA) loss -- allowing both VAE and diffusion model to be jointly tuned during the training process. Despite its simplicity, the proposed training recipe (REPA-E) shows remarkable performance; speeding up diffusion model training by over 17x and 45x over REPA and vanilla training recipes, respectively. Interestingly, we observe that end-to-end tuning with REPA-E also improves the VAE itself; leading to improved latent space structure and downstream generation performance. In terms of final performance, our approach sets a new state-of-the-art; achieving FID of 1.26 and 1.83 with and without classifier-free guidance on ImageNet 256 x 256. Code is available at https://end2end-diffusion.github.io.

[360] Decoupled Diffusion Sparks Adaptive Scene Generation

Yunsong Zhou,Naisheng Ye,William Ljungbergh,Tianyu Li,Jiazhi Yang,Zetong Yang,Hongzi Zhu,Christoffer Petersson,Hongyang Li

Main category: cs.CV

TLDR: Nexus提出了一种解耦的场景生成框架，通过独立噪声状态和细粒度令牌模拟常规和挑战性场景，提升了反应性和目标导向性。

Details

Motivation: 现有方法在交通布局生成中要么缺乏在线反应能力，要么缺乏精确的目标状态指导，且难以生成复杂或挑战性场景。 Method: 采用解耦的生成框架，结合部分噪声掩码训练策略和噪声感知调度，同时收集复杂角例数据集。 Result: Nexus在生成真实感、反应性和目标导向性上表现优越，位移误差减少40%，闭环规划提升20%。 Conclusion: Nexus在安全关键数据生成和数据增强方面具有显著优势。 Abstract: Controllable scene generation could reduce the cost of diverse data collection substantially for autonomous driving. Prior works formulate the traffic layout generation as predictive progress, either by denoising entire sequences at once or by iteratively predicting the next frame. However, full sequence denoising hinders online reaction, while the latter's short-sighted next-frame prediction lacks precise goal-state guidance. Further, the learned model struggles to generate complex or challenging scenarios due to a large number of safe and ordinal driving behaviors from open datasets. To overcome these, we introduce Nexus, a decoupled scene generation framework that improves reactivity and goal conditioning by simulating both ordinal and challenging scenarios from fine-grained tokens with independent noise states. At the core of the decoupled pipeline is the integration of a partial noise-masking training strategy and a noise-aware schedule that ensures timely environmental updates throughout the denoising process. To complement challenging scenario generation, we collect a dataset consisting of complex corner cases. It covers 540 hours of simulated data, including high-risk interactions such as cut-in, sudden braking, and collision. Nexus achieves superior generation realism while preserving reactivity and goal orientation, with a 40% reduction in displacement error. We further demonstrate that Nexus improves closed-loop planning by 20% through data augmentation and showcase its capability in safety-critical data generation.

[361] DNF-Avatar: Distilling Neural Fields for Real-time Animatable Avatar Relighting

Zeren Jiang,Shaofei Wang,Siyu Tang

Main category: cs.CV

TLDR: 论文提出了一种从单目视频创建可重光照和可动画化的人体化身的快速方法，通过将隐式神经场（教师）的知识蒸馏到显式2D高斯泼溅（学生）表示中，结合PBR外观的分裂和近似和部分环境光遮蔽探针，实现了高质量的重光照效果和实时渲染速度。

Details

Motivation: 解决现有方法因蒙特卡洛光线追踪导致的渲染速度慢的问题，提升人体化身的实时渲染性能。 Method: 1. 使用隐式神经场（教师）蒸馏到显式2D高斯泼溅（学生）表示；2. 采用分裂和近似处理PBR外观；3. 提出部分环境光遮蔽探针用于阴影计算。 Result: 学生模型在推理时比教师模型快370倍，达到67 FPS的渲染速度，且重光照效果相当或更好。 Conclusion: 该方法实现了高质量的重光照效果和实时渲染速度，适用于虚拟现实、体育和视频游戏等应用。 Abstract: Creating relightable and animatable human avatars from monocular videos is a rising research topic with a range of applications, e.g. virtual reality, sports, and video games. Previous works utilize neural fields together with physically based rendering (PBR), to estimate geometry and disentangle appearance properties of human avatars. However, one drawback of these methods is the slow rendering speed due to the expensive Monte Carlo ray tracing. To tackle this problem, we proposed to distill the knowledge from implicit neural fields (teacher) to explicit 2D Gaussian splatting (student) representation to take advantage of the fast rasterization property of Gaussian splatting. To avoid ray-tracing, we employ the split-sum approximation for PBR appearance. We also propose novel part-wise ambient occlusion probes for shadow computation. Shadow prediction is achieved by querying these probes only once per pixel, which paves the way for real-time relighting of avatars. These techniques combined give high-quality relighting results with realistic shadow effects. Our experiments demonstrate that the proposed student model achieves comparable or even better relighting results with our teacher model while being 370 times faster at inference time, achieving a 67 FPS rendering speed.

[362] FLOSS: Free Lunch in Open-vocabulary Semantic Segmentation

Yasser Benigmim,Mohammad Fahes,Tuan-Hung Vu,Andrei Bursuc,Raoul de Charette

Main category: cs.CV

TLDR: 论文提出FLOSS方法，通过选择单模板分类器（class-experts）并融合其预测，无需训练即可提升开放词汇语义分割（OVSS）性能。

Details

Motivation: 挑战现有OVSS模型中多模板平均文本嵌入的常规做法，探索单模板分类器的潜力。 Method: 利用未标注图像和单模板分类器的预测熵选择class-experts，并通过新融合方法生成更准确的OVSS预测。 Result: FLOSS在多个OVSS基准测试中显著提升现有方法性能，且所选模板可泛化至不同分布的数据集。 Conclusion: FLOSS是一种无需标签和训练的即插即用方法，为OVSS提供了系统性改进方案。 Abstract: Recent Open-Vocabulary Semantic Segmentation (OVSS) models extend the CLIP model to segmentation while maintaining the use of multiple templates (e.g., a photo of , a sketch of a , etc.) for constructing class-wise averaged text embeddings, acting as a classifier. In this paper, we challenge this status quo and investigate the impact of templates for OVSS. Empirically, we observe that for each class, there exist single-template classifiers significantly outperforming the conventional averaged classifier. We refer to them as class-experts. Given access to unlabeled images and without any training involved, we estimate these experts by leveraging the class-wise prediction entropy of single-template classifiers, selecting as class-wise experts those which yield the lowest entropy. All experts, each specializing in a specific class, collaborate in a newly proposed fusion method to generate more accurate OVSS predictions. Our plug-and-play method, coined FLOSS, is orthogonal and complementary to existing OVSS methods, offering a ''free lunch'' to systematically improve OVSS without labels and additional training. Extensive experiments demonstrate that FLOSS consistently boosts state-of-the-art methods on various OVSS benchmarks. Moreover, the selected expert templates can generalize well from one dataset to others sharing the same semantic categories, yet exhibiting distribution shifts. Additionally, we obtain satisfactory improvements under a low-data regime, where only a few unlabeled images are available. Our code is available at https://github.com/yasserben/FLOSS .

cs.CL [Back]

[363] Layers at Similar Depths Generate Similar Activations Across LLM Architectures

Christopher Wolfram,Aaron Schein

Main category: cs.CL

TLDR: 研究发现，独立训练的LLMs的潜在空间在层间关系上存在变化，但不同模型的对应层之间共享相似的近邻关系。

Details

Motivation: 探究独立训练的大型语言模型（LLMs）潜在空间之间的关系，以理解其内部结构的共性和差异。 Method: 分析了24个开源LLMs不同层的激活向量，研究了其近邻关系在层内和模型间的表现。 Result: 1) 同一模型内不同层的近邻关系存在变化；2) 不同模型的对应层之间共享近邻关系。 Conclusion: LLMs在层间生成激活几何结构的进展，但这一进展在不同模型间是共享的，仅因架构不同而有所调整。 Abstract: How do the latent spaces used by independently-trained LLMs relate to one another? We study the nearest neighbor relationships induced by activations at different layers of 24 open-weight LLMs, and find that they 1) tend to vary from layer to layer within a model, and 2) are approximately shared between corresponding layers of different models. Claim 2 shows that these nearest neighbor relationships are not arbitrary, as they are shared across models, but Claim 1 shows that they are not "obvious" either, as there is no single set of nearest neighbor relationships that is universally shared. Together, these suggest that LLMs generate a progression of activation geometries from layer to layer, but that this entire progression is largely shared between models, stretched and squeezed to fit into different architectures.

[364] SemCAFE: When Named Entities make the Difference Assessing Web Source Reliability through Entity-level Analytics

Gautam Kishore Shahi,Oshani Seneviratne,Marc Spaniol

Main category: cs.CL

TLDR: 论文提出SemCAFE系统，通过结合实体关联性检测新闻可靠性，优于现有方法12%。

Details

Motivation: 随着数字媒体的普及，不可靠新闻内容难以区分，影响公众意见和政治议程。 Method: 使用自然语言处理技术和YAGO知识库进行语义分析，生成新闻文章的语义指纹。 Result: 在46,020篇可靠和3,407篇不可靠新闻中，SemCAFE表现优于现有方法12%。 Conclusion: SemCAFE能有效区分新闻可靠性，为数字媒体可信度提供解决方案。 Abstract: With the shift from traditional to digital media, the online landscape now hosts not only reliable news articles but also a significant amount of unreliable content. Digital media has faster reachability by significantly influencing public opinion and advancing political agendas. While newspaper readers may be familiar with their preferred outlets political leanings or credibility, determining unreliable news articles is much more challenging. The credibility of many online sources is often opaque, with AI generated content being easily disseminated at minimal cost. Unreliable news articles, particularly those that followed the Russian invasion of Ukraine in 2022, closely mimic the topics and writing styles of credible sources, making them difficult to distinguish. To address this, we introduce SemCAFE, a system designed to detect news reliability by incorporating entity relatedness into its assessment. SemCAFE employs standard Natural Language Processing techniques, such as boilerplate removal and tokenization, alongside entity level semantic analysis using the YAGO knowledge base. By creating a semantic fingerprint for each news article, SemCAFE could assess the credibility of 46,020 reliable and 3,407 unreliable articles on the 2022 Russian invasion of Ukraine. Our approach improved the macro F1 score by 12% over state of the art methods. The sample data and code are available on GitHub

[365] From Tokens to Lattices: Emergent Lattice Structures in Language Models

Bo Xiong,Steffen Staab

Main category: cs.CL

TLDR: 该论文探讨了预训练掩码语言模型（MLMs）如何通过隐式学习形式背景（formal context）来构建概念格（concept lattice），并提出了一种不依赖人工定义概念的框架。

Details

Motivation: 研究MLMs如何从预训练中自然形成概念格结构，以及其背后的归纳偏置。 Method: 基于形式概念分析（FCA）框架，提出了一种从预训练MLMs构建概念格的方法，并验证其有效性。 Result: 实验结果表明，MLMs能够隐式学习形式背景，并成功构建概念格，揭示了其潜在的归纳偏置。 Conclusion: MLMs通过预训练隐式学习概念结构，为概念发现提供了新视角，扩展了人类定义的概念范围。 Abstract: Pretrained masked language models (MLMs) have demonstrated an impressive capability to comprehend and encode conceptual knowledge, revealing a lattice structure among concepts. This raises a critical question: how does this conceptualization emerge from MLM pretraining? In this paper, we explore this problem from the perspective of Formal Concept Analysis (FCA), a mathematical framework that derives concept lattices from the observations of object-attribute relationships. We show that the MLM's objective implicitly learns a \emph{formal context} that describes objects, attributes, and their dependencies, which enables the reconstruction of a concept lattice through FCA. We propose a novel framework for concept lattice construction from pretrained MLMs and investigate the origin of the inductive biases of MLMs in lattice structure learning. Our framework differs from previous work because it does not rely on human-defined concepts and allows for discovering "latent" concepts that extend beyond human definitions. We create three datasets for evaluation, and the empirical results verify our hypothesis.

[366] Can AI Master Construction Management (CM)? Benchmarking State-of-the-Art Large Language Models on CM Certification Exams

Ruoxin Xiong,Yanyu Wang,Suat Gunhan,Yimin Zhu,Charles Berryman

Main category: cs.CL

TLDR: 该研究引入CMExamSet数据集评估LLMs在建筑管理（CM）中的表现，发现GPT-4o和Claude 3.7在单步任务中表现优异，但在多步任务和图表题中表现较差，需进一步优化。

Details

Motivation: 建筑管理项目日益复杂，需专业工具提升效率，但LLMs在CM领域的应用尚未充分探索。 Method: 使用CMExamSet数据集（689道真实选择题）进行零样本评估，分析准确性、主题领域、推理复杂性和问题格式。 Result: GPT-4o和Claude 3.7平均准确率超70%，单步任务表现更优（85.7%和86.7%），图表题表现差（约40%）。 Conclusion: LLMs在CM中有潜力，但需领域优化和人工监督。 Abstract: The growing complexity of construction management (CM) projects, coupled with challenges such as strict regulatory requirements and labor shortages, requires specialized analytical tools that streamline project workflow and enhance performance. Although large language models (LLMs) have demonstrated exceptional performance in general reasoning tasks, their effectiveness in tackling CM-specific challenges, such as precise quantitative analysis and regulatory interpretation, remains inadequately explored. To bridge this gap, this study introduces CMExamSet, a comprehensive benchmarking dataset comprising 689 authentic multiple-choice questions sourced from four nationally accredited CM certification exams. Our zero-shot evaluation assesses overall accuracy, subject areas (e.g., construction safety), reasoning complexity (single-step and multi-step), and question formats (text-only, figure-referenced, and table-referenced). The results indicate that GPT-4o and Claude 3.7 surpass typical human pass thresholds (70%), with average accuracies of 82% and 83%, respectively. Additionally, both models performed better on single-step tasks, with accuracies of 85.7% (GPT-4o) and 86.7% (Claude 3.7). Multi-step tasks were more challenging, reducing performance to 76.5% and 77.6%, respectively. Furthermore, both LLMs show significant limitations on figure-referenced questions, with accuracies dropping to approximately 40%. Our error pattern analysis further reveals that conceptual misunderstandings are the most common (44.4% and 47.9%), underscoring the need for enhanced domain-specific reasoning models. These findings underscore the potential of LLMs as valuable supplementary analytical tools in CM, while highlighting the need for domain-specific refinements and sustained human oversight in complex decision making.

[367] Efficient Evaluation of Large Language Models via Collaborative Filtering

Xu-Xiang Zhong,Chao Yi,Han-Jia Ye

Main category: cs.CL

TLDR: 提出了一种基于推荐系统思想的两阶段方法，用于高效评估大语言模型在少量测试实例上的性能。

Details

Motivation: 评估大语言模型成本高，需要减少测试实例数量并快速预测性能。 Method: 将模型视为用户，测试实例视为物品，分两阶段：1）选择能区分模型性能的实例；2）预测未选实例上的性能。 Result: 实验表明，该方法能准确预测模型性能并显著降低计算开销。 Conclusion: 该方法为高效评估大语言模型提供了一种可行方案。 Abstract: With the development of Large Language Models (LLMs), numerous benchmarks have been proposed to measure and compare the capabilities of different LLMs. However, evaluating LLMs is costly due to the large number of test instances and their slow inference speed. In this paper, we aim to explore how to efficiently estimate a model's real performance on a given benchmark based on its evaluation results on a small number of instances sampled from the benchmark. Inspired by Collaborative Filtering (CF) in Recommendation Systems (RS), we treat LLMs as users and test instances as items and propose a two-stage method. In the first stage, we treat instance selection as recommending products to users to choose instances that can easily distinguish model performance. In the second stage, we see performance prediction as rating prediction problem in RS to predict the target LLM's behavior on unselected instances. Experiments on multiple LLMs and datasets imply that our method can accurately estimate the target model's performance while largely reducing its inference overhead.

[368] Enhancing NER Performance in Low-Resource Pakistani Languages using Cross-Lingual Data Augmentation

Toqeer Ehsan,Thamar Solorio

Main category: cs.CL

TLDR: 论文提出了一种数据增强技术，用于提升低资源巴基斯坦语言的命名实体识别（NER）性能，并通过微调多语言掩码大语言模型（LLMs）在Shahmukhi和Pashto上取得了显著改进。

Details

Motivation: 高资源语言的NER任务已有显著进展，但低资源语言因缺乏标注数据和预训练语言模型（PLMs）中的有限表示而研究不足。 Method: 采用数据增强技术生成文化合理的句子，并微调多语言掩码LLMs，同时探索生成式LLMs在NER和数据增强中的少样本学习能力。 Result: 在Shahmukhi和Pashto两种语言上，NER性能显著提升。 Conclusion: 该方法为低资源语言的NER任务提供了有效解决方案，并展示了生成式LLMs的潜力。 Abstract: Named Entity Recognition (NER), a fundamental task in Natural Language Processing (NLP), has shown significant advancements for high-resource languages. However, due to a lack of annotated datasets and limited representation in Pre-trained Language Models (PLMs), it remains understudied and challenging for low-resource languages. To address these challenges, we propose a data augmentation technique that generates culturally plausible sentences and experiments on four low-resource Pakistani languages; Urdu, Shahmukhi, Sindhi, and Pashto. By fine-tuning multilingual masked Large Language Models (LLMs), our approach demonstrates significant improvements in NER performance for Shahmukhi and Pashto. We further explore the capability of generative LLMs for NER and data augmentation using few-shot learning.

[369] Exploring Gradient-Guided Masked Language Model to Detect Textual Adversarial Attacks

Xiaomei Zhang,Zhaoxi Zhang,Yanjun Zhang,Xufei Zheng,Leo Yu Zhang,Shengshan Hu,Shirui Pan

Main category: cs.CL

TLDR: 论文提出了一种基于掩码语言模型的文本对抗攻击检测方法（MLMD），并通过梯度引导优化（GradMLMD）减少计算开销。

Details

Motivation: 文本对抗样本威胁自然语言处理系统的可靠性，而预训练掩码语言模型能近似正常文本的流形，因此探索其用于检测对抗攻击。 Method: MLMD利用掩码和去掩码操作区分正常与对抗文本的流形变化；GradMLMD通过梯度信息跳过非关键词以减少计算开销。 Result: MLMD检测性能优异但计算开销大；GradMLMD在不影响性能的情况下显著降低资源消耗。 Conclusion: 梯度引导的掩码语言模型检测方法（GradMLMD）高效且有效，适用于文本对抗攻击检测。 Abstract: Textual adversarial examples pose serious threats to the reliability of natural language processing systems. Recent studies suggest that adversarial examples tend to deviate from the underlying manifold of normal texts, whereas pre-trained masked language models can approximate the manifold of normal data. These findings inspire the exploration of masked language models for detecting textual adversarial attacks. We first introduce Masked Language Model-based Detection (MLMD), leveraging the mask and unmask operations of the masked language modeling (MLM) objective to induce the difference in manifold changes between normal and adversarial texts. Although MLMD achieves competitive detection performance, its exhaustive one-by-one masking strategy introduces significant computational overhead. Our posterior analysis reveals that a significant number of non-keywords in the input are not important for detection but consume resources. Building on this, we introduce Gradient-guided MLMD (GradMLMD), which leverages gradient information to identify and skip non-keywords during detection, significantly reducing resource consumption without compromising detection performance.

[370] Exploring the Effectiveness and Interpretability of Texts in LLM-based Time Series Models

Zhengke Sun,Hangwei Qian,Ivor Tsang

Main category: cs.CL

TLDR: 研究探讨了文本信息在时间序列预测中的实际效果和可解释性，发现文本与时间序列的对齐问题，且文本信息并未显著提升预测性能。

Details

Motivation: 探究文本信息是否真正有助于提升时间序列预测的性能和可解释性。 Method: 通过实验分析文本提示和文本原型的有效性，并提出新的度量指标SMI。 Result: 文本信息与时间序列的对齐性不足，且对预测性能提升有限；可视化分析显示文本表示缺乏可解释性。 Conclusion: 当前时间序列LLMs中文本的对齐性和可解释性存在问题，需进一步关注。 Abstract: Large Language Models (LLMs) have been applied to time series forecasting tasks, leveraging pre-trained language models as the backbone and incorporating textual data to purportedly enhance the comprehensive capabilities of LLMs for time series. However, are these texts really helpful for interpretation? This study seeks to investigate the actual efficacy and interpretability of such textual incorporations. Through a series of empirical experiments on textual prompts and textual prototypes, our findings reveal that the misalignment between two modalities exists, and the textual information does not significantly improve time series forecasting performance in many cases. Furthermore, visualization analysis indicates that the textual representations learned by existing frameworks lack sufficient interpretability when applied to time series data. We further propose a novel metric named Semantic Matching Index (SMI) to better evaluate the matching degree between time series and texts during our post hoc interpretability investigation. Our analysis reveals the misalignment and limited interpretability of texts in current time-series LLMs, and we hope this study can raise awareness of the interpretability of texts for time series. The code is available at https://github.com/zachysun/TS-Lang-Exp.

[371] CAReDiO: Cultural Alignment of LLM via Representativeness and Distinctiveness Guided Data Optimization

Jing Yao,Xiaoyuan Yi,Jindong Wang,Zhicheng Dou,Xing Xie

Main category: cs.CL

TLDR: 论文提出CAReDiO框架，通过优化文化数据的代表性和独特性，利用LLMs自动生成文化对话数据，显著提升文化对齐效果。

Details

Motivation: 随着LLMs在人类生活中的深入应用，文化对齐对提升用户体验和减少文化冲突至关重要。现有方法依赖大量文化特定语料库，但存在代表性和独特性不足的问题。 Method: 提出CAReDiO框架，利用LLMs自动生成文化对话数据，并通过优化代表性和独特性来构建高效数据集。 Result: 实验表明，CAReDiO生成的数据更有效，仅需100个训练样本即可实现文化对齐，提升性能和效率。 Conclusion: CAReDiO框架为文化对齐提供了一种高效且数据量需求低的新方法。 Abstract: As Large Language Models (LLMs) more deeply integrate into human life across various regions, aligning them with pluralistic cultures is crucial for improving user experience and mitigating cultural conflicts. Existing approaches develop culturally aligned LLMs primarily through fine-tuning with massive carefully curated culture-specific corpora. Nevertheless, inspired by culture theories, we identify two key challenges faced by these datasets: (1) Representativeness: These corpora fail to fully capture the target culture's core characteristics with redundancy, causing computation waste; (2) Distinctiveness: They struggle to distinguish the unique nuances of a given culture from shared patterns across other relevant ones, hindering precise cultural modeling. To handle these challenges, we introduce CAReDiO, a novel cultural data construction framework. Specifically, CAReDiO utilizes powerful LLMs to automatically generate cultural conversation data, where both the queries and responses are further optimized by maximizing representativeness and distinctiveness. Using CAReDiO, we construct a small yet effective dataset, covering five cultures, and compare it with several recent cultural corpora. Extensive experiments demonstrate that our method generates more effective data and enables cultural alignment with as few as 100 training samples, enhancing both performance and efficiency.

[372] SD$^2$: Self-Distilled Sparse Drafters

Mike Lasby,Nish Sinnadurai,Valavan Manohararajah,Sean Lie,Vithursan Thangarasa

Main category: cs.CL

TLDR: SD$^2$是一种通过自数据蒸馏和细粒度权重稀疏性提升LLM推理效率的新方法，显著提高草稿模型接受率并减少计算量。

Details

Motivation: 减少大型语言模型（LLM）的推理延迟，同时保持与目标模型的对齐。 Method: 利用自数据蒸馏和细粒度权重稀疏性训练高效的草稿模型。 Result: 在Llama-3.1-70B目标模型上，SD$^2$比层剪枝草稿模型平均接受长度提高1.59倍，计算量减少43.87%。 Conclusion: 稀疏感知微调和压缩策略可显著提升LLM推理效率，同时保持模型对齐。 Abstract: Speculative decoding is a powerful technique for reducing the latency of Large Language Models (LLMs), offering a fault-tolerant framework that enables the use of highly compressed draft models. In this work, we introduce Self-Distilled Sparse Drafters (SD$^2$), a novel methodology that leverages self-data distillation and fine-grained weight sparsity to produce highly efficient and well-aligned draft models. SD$^2$ systematically enhances draft token acceptance rates while significantly reducing Multiply-Accumulate operations (MACs), even in the Universal Assisted Generation (UAG) setting, where draft and target models originate from different model families. On a Llama-3.1-70B target model, SD$^2$ provides a $\times$1.59 higher Mean Accepted Length (MAL) compared to layer-pruned draft models and reduces MACs by over 43.87% with a 8.36% reduction in MAL compared to a dense draft models. Our results highlight the potential of sparsity-aware fine-tuning and compression strategies to improve LLM inference efficiency while maintaining alignment with target models.

[373] Forecasting Communication Derailments Through Conversation Generation

Yunfan Zhang,Kathleen McKeown,Smaranda Muresan

Main category: cs.CL

TLDR: 提出了一种基于LLM的方法，通过采样多个未来对话轨迹预测沟通脱轨，结合社会语言学属性，显著提升了预测准确性。

Details

Motivation: 预测沟通脱轨在内容审核、冲突解决等场景中有实际应用价值，但现有语言模型难以准确预测未来脱轨。 Method: 使用微调后的LLM采样多个未来对话轨迹，并基于共识预测结果；同时利用社会语言学属性指导生成。 Result: 在英语沟通脱轨预测基准上超越现有方法，消融实验显示显著准确率提升。 Conclusion: 通过多轨迹采样和社会语言学属性结合，有效提升了沟通脱轨预测的准确性。 Abstract: Forecasting communication derailment can be useful in real-world settings such as online content moderation, conflict resolution, and business negotiations. However, despite language models' success at identifying offensive speech present in conversations, they struggle to forecast future communication derailments. In contrast to prior work that predicts conversation outcomes solely based on the past conversation history, our approach samples multiple future conversation trajectories conditioned on existing conversation history using a fine-tuned LLM. It predicts the communication outcome based on the consensus of these trajectories. We also experimented with leveraging socio-linguistic attributes, which reflect turn-level conversation dynamics, as guidance when generating future conversations. Our method of future conversation trajectories surpasses state-of-the-art results on English communication derailment prediction benchmarks and demonstrates significant accuracy gains in ablation studies.

[374] Generating Planning Feedback for Open-Ended Programming Exercises with LLMs

Mehmet Arif Demirtaş,Claire Zheng,Max Fowler,Kathryn Cunningham

Main category: cs.CL

TLDR: 论文提出了一种利用大型语言模型（LLM）检测学生编程计划的方法，以提供反馈，即使代码存在语法错误。GPT-4o及其小型变体（GPT-4o-mini）在检测计划方面表现出色，小型模型经过微调后性能接近GPT-4o。

Details

Motivation: 当前自动评分系统仅基于最终代码的正确性提供反馈，无法评估学生的编程计划过程。LLM可以填补这一空白，帮助学生在代码实现前获得规划反馈。 Method: 通过LLM（如GPT-4o和GPT-4o-mini）检测学生代码中的高层次目标和模式（编程计划），并与传统代码分析方法对比。 Result: GPT-4o和GPT-4o-mini在检测编程计划方面表现优异，小型模型经过微调后性能接近GPT-4o，适合实时评分。 Conclusion: LLM可用于提供编程计划反馈，小型模型在成本效益和性能上具有潜力，并可推广至数学和物理等其他领域。 Abstract: To complete an open-ended programming exercise, students need to both plan a high-level solution and implement it using the appropriate syntax. However, these problems are often autograded on the correctness of the final submission through test cases, and students cannot get feedback on their planning process. Large language models (LLM) may be able to generate this feedback by detecting the overall code structure even for submissions with syntax errors. To this end, we propose an approach that detects which high-level goals and patterns (i.e. programming plans) exist in a student program with LLMs. We show that both the full GPT-4o model and a small variant (GPT-4o-mini) can detect these plans with remarkable accuracy, outperforming baselines inspired by conventional approaches to code analysis. We further show that the smaller, cost-effective variant (GPT-4o-mini) achieves results on par with state-of-the-art (GPT-4o) after fine-tuning, creating promising implications for smaller models for real-time grading. These smaller models can be incorporated into autograders for open-ended code-writing exercises to provide feedback for students' implicit planning skills, even when their program is syntactically incorrect. Furthermore, LLMs may be useful in providing feedback for problems in other domains where students start with a set of high-level solution steps and iteratively compute the output, such as math and physics problems.

[375] A Fully Automated Pipeline for Conversational Discourse Annotation: Tree Scheme Generation and Labeling with Large Language Models

Kseniia Petukhova,Ekaterina Kochmar

Main category: cs.CL

TLDR: 利用大型语言模型（LLMs）自动构建对话标注方案并完成标注，显著减少时间且性能优于人工设计。

Details

Motivation: 手动设计树状标注方案耗时且需专业知识，LLMs的自动化潜力可解决这一问题。 Method: 提出全自动流程，结合频率引导的决策树和先进LLM进行标注。 Result: 在SF和SWBD-DAMSL分类上，性能优于人工设计方案，甚至接近或超越人工标注。 Conclusion: 自动化标注方案高效且性能优越，释放代码和标注结果以推动未来研究。 Abstract: Recent advances in Large Language Models (LLMs) have shown promise in automating discourse annotation for conversations. While manually designing tree annotation schemes significantly improves annotation quality for humans and models, their creation remains time-consuming and requires expert knowledge. We propose a fully automated pipeline that uses LLMs to construct such schemes and perform annotation. We evaluate our approach on speech functions (SFs) and the Switchboard-DAMSL (SWBD-DAMSL) taxonomies. Our experiments compare various design choices, and we show that frequency-guided decision trees, paired with an advanced LLM for annotation, can outperform previously manually designed trees and even match or surpass human annotators while significantly reducing the time required for annotation. We release all code and resultant schemes and annotations to facilitate future research on discourse annotation.

[376] From Punchlines to Predictions: A Metric to Assess LLM Performance in Identifying Humor in Stand-Up Comedy

Adrianna Romanowski,Pedro H. V. Valois,Kazuhiro Fukui

Main category: cs.CL

TLDR: 研究评估了大型语言模型（LLMs）在识别单口喜剧幽默台词方面的能力，提出了一种新的幽默检测指标，结果显示模型表现优于人类，但仍有提升空间。

Details

Motivation: 随着LLMs的普及，幽默与AI的结合变得重要，研究旨在提升AI对幽默的理解能力，以改善人机交互的自然性。 Method: 使用单口喜剧剧本作为数据集，提出了一种包含模糊字符串匹配、句子嵌入和子空间相似性的模块化幽默检测指标，评估模型表现。 Result: 领先模型（如ChatGPT、Claude和DeepSeek）在幽默检测中最高得分51%，优于人类的41%，但显示出幽默主观性的复杂性。 Conclusion: 研究表明LLMs在幽默检测上有潜力，但需进一步优化以应对幽默的主观性和复杂性。 Abstract: Comedy serves as a profound reflection of the times we live in and is a staple element of human interactions. In light of the widespread adoption of Large Language Models (LLMs), the intersection of humor and AI has become no laughing matter. Advancements in the naturalness of human-computer interaction correlates with improvements in AI systems' abilities to understand humor. In this study, we assess the ability of models in accurately identifying humorous quotes from a stand-up comedy transcript. Stand-up comedy's unique comedic narratives make it an ideal dataset to improve the overall naturalness of comedic understanding. We propose a novel humor detection metric designed to evaluate LLMs amongst various prompts on their capability to extract humorous punchlines. The metric has a modular structure that offers three different scoring methods - fuzzy string matching, sentence embedding, and subspace similarity - to provide an overarching assessment of a model's performance. The model's results are compared against those of human evaluators on the same task. Our metric reveals that regardless of prompt engineering, leading models, ChatGPT, Claude, and DeepSeek, achieve scores of at most 51% in humor detection. Notably, this performance surpasses that of humans who achieve a score of 41%. The analysis of human evaluators and LLMs reveals variability in agreement, highlighting the subjectivity inherent in humor and the complexities involved in extracting humorous quotes from live performance transcripts. Code available at https://github.com/swaggirl9000/humor.

[377] Exploration of Plan-Guided Summarization for Narrative Texts: the Case of Small Language Models

Matt Grenander,Siddharth Varia,Paula Czarnowska,Yogarshi Vyas,Kishaloy Halder,Bonan Min

Main category: cs.CL

TLDR: 研究探讨了计划引导的摘要方法在小型语言模型（SLM）中减少幻觉的效果，发现其对长文档叙事任务无显著改进。

Details

Motivation: 叙事文本的长度和复杂性使其难以忠实摘要，计划引导方法是否有效尚不明确。 Method: 分析了现有针对细粒度细节的计划引导方法，并提出了更高层次的叙事计划方案。 Result: 计划引导方法在摘要质量或忠实度上未显著优于无计划基线，且计划本身同样可能包含幻觉。 Conclusion: 计划引导方法在长复杂叙事文本中效果有限，需谨慎使用。 Abstract: Plan-guided summarization attempts to reduce hallucinations in small language models (SLMs) by grounding generated summaries to the source text, typically by targeting fine-grained details such as dates or named entities. In this work, we investigate whether plan-based approaches in SLMs improve summarization in long document, narrative tasks. Narrative texts' length and complexity often mean they are difficult to summarize faithfully. We analyze existing plan-guided solutions targeting fine-grained details, and also propose our own higher-level, narrative-based plan formulation. Our results show that neither approach significantly improves on a baseline without planning in either summary quality or faithfulness. Human evaluation reveals that while plan-guided approaches are often well grounded to their plan, plans are equally likely to contain hallucinations compared to summaries. As a result, the plan-guided summaries are just as unfaithful as those from models without planning. Our work serves as a cautionary tale to plan-guided approaches to summarization, especially for long, complex domains such as narrative texts.

[378] A Multi-view Discourse Framework for Integrating Semantic and Syntactic Features in Dialog Agents

Akanksha Mehndiratta,Krishna Asawa

Main category: cs.CL

TLDR: 本文提出了一种基于多视角典型相关分析（MCCA）和典型相关分析（CCA）的检索式对话系统框架，通过整合语义和句法特征提升多轮对话的响应选择效果。

Details

Motivation: 现有方法常忽略对话中话语间的交互或均等处理所有话语，缺乏对上下文关系的深入理解。 Method: 采用MCCA编码话语和响应的上下文、位置和句法特征，再通过CCA学习话语间的关系。 Result: 在Ubuntu对话语料库上的实验表明，该模型在自动评估指标上有显著提升。 Conclusion: 该框架通过整合语义和句法特征，有效提升了多轮对话的响应选择能力。 Abstract: Multiturn dialogue models aim to generate human-like responses by leveraging conversational context, consisting of utterances from previous exchanges. Existing methods often neglect the interactions between these utterances or treat all of them as equally significant. This paper introduces a discourse-aware framework for response selection in retrieval-based dialogue systems. The proposed model first encodes each utterance and response with contextual, positional, and syntactic features using Multi-view Canonical Correlation Analysis (MCCA). It then learns discourse tokens that capture relationships between an utterance and its surrounding turns in a shared subspace via Canonical Correlation Analysis (CCA). This two-step approach effectively integrates semantic and syntactic features to build discourse-level understanding. Experiments on the Ubuntu Dialogue Corpus demonstrate that our model achieves significant improvements in automatic evaluation metrics, highlighting its effectiveness in response selection.

[379] Enhancing Dialogue Systems with Discourse-Level Understanding Using Deep Canonical Correlation Analysis

Akanksha Mehndiratta,Krishna Asawa

Main category: cs.CL

TLDR: 提出了一种基于深度典型相关分析（DCCA）的新框架，用于提升对话系统中长期对话历史的捕捉与利用能力。

Details

Motivation: 现有模型在捕捉和利用长期对话历史方面存在局限性，需要更上下文感知的系统。 Method: 通过DCCA学习话语标记，捕捉话语与其上下文的关系，从而理解长期依赖。 Result: 在Ubuntu对话语料库上的实验显示，响应选择性能显著提升，自动评估指标得分提高。 Conclusion: DCCA能有效过滤无关上下文，保留关键话语信息，提升对话系统的准确性。 Abstract: The evolution of conversational agents has been driven by the need for more contextually aware systems that can effectively manage dialogue over extended interactions. To address the limitations of existing models in capturing and utilizing long-term conversational history, we propose a novel framework that integrates Deep Canonical Correlation Analysis (DCCA) for discourse-level understanding. This framework learns discourse tokens to capture relationships between utterances and their surrounding context, enabling a better understanding of long-term dependencies. Experiments on the Ubuntu Dialogue Corpus demonstrate significant enhancement in response selection, based on the improved automatic evaluation metric scores. The results highlight the potential of DCCA in improving dialogue systems by allowing them to filter out irrelevant context and retain critical discourse information for more accurate response retrieval.

[380] Optimizing FDTD Solvers for Electromagnetics: A Compiler-Guided Approach with High-Level Tensor Abstractions

Yifei He,Måns I. Andersson,Stefano Markidis

Main category: cs.CL

TLDR: 本文提出了一种基于MLIR/LLVM基础设施的FDTD仿真领域专用编译器，解决了传统FDTD实现中代码可移植性和性能瓶颈的问题，实现了跨硬件平台的高效优化。

Details

Motivation: 传统FDTD方法依赖于手工编写的平台特定代码，导致开发成本高且性能受限，难以适应现代硬件架构。 Method: 采用MLIR/LLVM基础设施构建领域专用编译器，将三维FDTD核实现为3D张量操作，并自动应用高级优化（如循环分块、融合和向量化）。 Result: 在Intel、AMD和ARM平台上评估，相比基线Python实现（使用NumPy），实现了高达10倍的加速。 Conclusion: 该方法显著提升了FDTD仿真的性能和可移植性，为复杂电磁场模拟提供了高效解决方案。 Abstract: The Finite Difference Time Domain (FDTD) method is a widely used numerical technique for solving Maxwell's equations, particularly in computational electromagnetics and photonics. It enables accurate modeling of wave propagation in complex media and structures but comes with significant computational challenges. Traditional FDTD implementations rely on handwritten, platform-specific code that optimizes certain kernels while underperforming in others. The lack of portability increases development overhead and creates performance bottlenecks, limiting scalability across modern hardware architectures. To address these challenges, we introduce an end-to-end domain-specific compiler based on the MLIR/LLVM infrastructure for FDTD simulations. Our approach generates efficient and portable code optimized for diverse hardware platforms.We implement the three-dimensional FDTD kernel as operations on a 3D tensor abstraction with explicit computational semantics. High-level optimizations such as loop tiling, fusion, and vectorization are automatically applied by the compiler. We evaluate our customized code generation pipeline on Intel, AMD, and ARM platforms, achieving up to $10\times$ speedup over baseline Python implementation using NumPy.

[381] VisuoThink: Empowering LVLM Reasoning with Multimodal Tree Search

Yikun Wang,Siyin Wang,Qinyuan Cheng,Zhaoye Fei,Liang Ding,Qipeng Guo,Dacheng Tao,Xipeng Qiu

Main category: cs.CL

TLDR: VisuoThink是一个结合视觉与语言的新型框架，通过渐进式视觉-文本推理和前瞻树搜索提升复杂推理任务的表现。

Details

Motivation: 现有大型视觉语言模型在复杂推理任务中表现不佳，缺乏人类视觉-语言交互的细致过程。 Method: 提出VisuoThink框架，整合视觉空间与语言领域，支持多模态慢思考，并引入前瞻树搜索进行测试时扩展。 Result: 实验表明VisuoThink在几何和空间推理任务中表现优异，无需微调即可实现最先进性能。 Conclusion: VisuoThink通过视觉-语言交互的慢思考机制显著提升了复杂推理能力。 Abstract: Recent advancements in Large Vision-Language Models have showcased remarkable capabilities. However, they often falter when confronted with complex reasoning tasks that humans typically address through visual aids and deliberate, step-by-step thinking. While existing methods have explored text-based slow thinking or rudimentary visual assistance, they fall short of capturing the intricate, interleaved nature of human visual-verbal reasoning processes. To overcome these limitations and inspired by the mechanisms of slow thinking in human cognition, we introduce VisuoThink, a novel framework that seamlessly integrates visuospatial and linguistic domains. VisuoThink facilitates multimodal slow thinking by enabling progressive visual-textual reasoning and incorporates test-time scaling through look-ahead tree search. Extensive experiments demonstrate that VisuoThink significantly enhances reasoning capabilities via inference-time scaling, even without fine-tuning, achieving state-of-the-art performance in tasks involving geometry and spatial reasoning.

[382] Efficient and Asymptotically Unbiased Constrained Decoding for Large Language Models

Haotian Ye,Himanshu Jain,Chong You,Ananda Theertha Suresh,Haowei Lin,James Zou,Felix Yu

Main category: cs.CL

TLDR: 论文提出了一种名为DISC的新算法，结合GPU并行前缀验证（PPV），解决了现有约束解码方法的效率低下和偏差问题，实验证明其优于现有方法。

Details

Motivation: 在大型语言模型的实际应用中，输出常需符合特定约束（如预定义集合或安全标准），现有约束解码方法效率低且引入偏差。 Method: 提出动态重要性采样（DISC）和GPU并行前缀验证（PPV），确保理论上的渐进无偏性并提升效率。 Result: 实验表明DISC在效率和输出质量上均优于现有方法。 Conclusion: DISC方法在需要严格约束的应用中具有潜力，能显著提升生成质量。 Abstract: In real-world applications of large language models, outputs are often required to be confined: selecting items from predefined product or document sets, generating phrases that comply with safety standards, or conforming to specialized formatting styles. To control the generation, constrained decoding has been widely adopted. However, existing prefix-tree-based constrained decoding is inefficient under GPU-based model inference paradigms, and it introduces unintended biases into the output distribution. This paper introduces Dynamic Importance Sampling for Constrained Decoding (DISC) with GPU-based Parallel Prefix-Verification (PPV), a novel algorithm that leverages dynamic importance sampling to achieve theoretically guaranteed asymptotic unbiasedness and overcomes the inefficiency of prefix-tree. Extensive experiments demonstrate the superiority of our method over existing methods in both efficiency and output quality. These results highlight the potential of our methods to improve constrained generation in applications where adherence to specific constraints is essential.

[383] Can postgraduate translation students identify machine-generated text?

Michael Farrell

Main category: cs.CL

TLDR: 研究发现，经过简短培训的翻译研究生仍难以区分人类写作与AI生成文本，仅两人表现突出。

Details

Motivation: 探讨语言学训练者能否辨别AI生成文本与人类写作，以应对生成式AI在多语言内容创作中的广泛应用。 Method: 23名研究生接受培训后，对意大利散文片段进行评分，判断是否为AI生成（ChatGPT-4o）。 Result: 学生平均表现不佳，仅两人准确率高；低爆发性和自相矛盾更常见于AI文本。 Conclusion: 需改进培训方法，并研究AI文本是否需要进一步编辑以更自然。 Abstract: Given the growing use of generative artificial intelligence as a tool for creating multilingual content and bypassing both machine and traditional translation methods, this study explores the ability of linguistically trained individuals to discern machine-generated output from human-written text (HT). After brief training sessions on the textual anomalies typically found in synthetic text (ST), twenty-three postgraduate translation students analysed excerpts of Italian prose and assigned likelihood scores to indicate whether they believed they were human-written or AI-generated (ChatGPT-4o). The results show that, on average, the students struggled to distinguish between HT and ST, with only two participants achieving notable accuracy. Closer analysis revealed that the students often identified the same textual anomalies in both HT and ST, although features such as low burstiness and self-contradiction were more frequently associated with ST. These findings suggest the need for improvements in the preparatory training. Moreover, the study raises questions about the necessity of editing synthetic text to make it sound more human-like and recommends further research to determine whether AI-generated text is already sufficiently natural-sounding not to require further refinement.

[384] Langformers: Unified NLP Pipelines for Language Models

Rabindra Lamsal,Maria Rodriguez Read,Shanika Karunasekera

Main category: cs.CL

TLDR: Langformers是一个开源的Python库，旨在通过统一的工厂接口简化NLP流程，支持多种任务和平台。

Details

Motivation: Transformer语言模型在NLP中表现优异，但使用复杂，需要处理多个框架和重复代码，阻碍了非程序员和初学者的使用。 Method: Langformers提供任务特定的工厂接口，抽象训练、推理和部署的复杂性，内置内存和流式处理，设计轻量模块化。 Result: Langformers整合了多种NLP任务（如对话AI、文本分类等），支持Hugging Face和Ollama等平台。 Conclusion: Langformers通过简化NLP流程，提升了易用性和效率，适合开发者和初学者。 Abstract: Transformer-based language models have revolutionized the field of natural language processing (NLP). However, using these models often involves navigating multiple frameworks and tools, as well as writing repetitive boilerplate code. This complexity can discourage non-programmers and beginners, and even slow down prototyping for experienced developers. To address these challenges, we introduce Langformers, an open-source Python library designed to streamline NLP pipelines through a unified, factory-based interface for large language model (LLM) and masked language model (MLM) tasks. Langformers integrates conversational AI, MLM pretraining, text classification, sentence embedding/reranking, data labelling, semantic search, and knowledge distillation into a cohesive API, supporting popular platforms such as Hugging Face and Ollama. Key innovations include: (1) task-specific factories that abstract training, inference, and deployment complexities; (2) built-in memory and streaming for conversational agents; and (3) lightweight, modular design that prioritizes ease of use. Documentation: https://langformers.com

[385] Parameterized Synthetic Text Generation with SimpleStories

Lennart Finke,Thomas Dooms,Mat Allen,Juan Diego Rodriguez,Noa Nabeshima,Dan Braun

Main category: cs.CL

TLDR: SimpleStories是一个包含200万英语和日语故事的大规模合成数据集，通过多层级抽象特征控制故事特性，确保语法和语义多样性。

Details

Motivation: 解决TinyStories数据集的局限性，证明在合成文本生成中可同时实现简单性和多样性。 Method: 采用多层级抽象特征的提示参数化方法，系统控制故事特性。 Result: 生成了语法和语义多样的大规模简单语言故事数据集。 Conclusion: SimpleStories展示了在合成文本生成中同时实现简单性和多样性的可行性。 Abstract: We present SimpleStories, a large synthetic story dataset in simple language, consisting of 2 million stories each in English and Japanese. Our method employs parametrization of prompts with features at multiple levels of abstraction, allowing for systematic control over story characteristics to ensure broad syntactic and semantic diversity. Building on and addressing limitations in the TinyStories dataset, our approach demonstrates that simplicity and variety can be achieved simultaneously in synthetic text generation at scale.

[386] Feature-Aware Malicious Output Detection and Mitigation

Weilong Dong,Peiguang Li,Yu Tian,Xinyi Zeng,Fengdi Li,Sirui Wang

Main category: cs.CL

TLDR: 论文提出了一种特征感知的有害响应拒绝方法（FMM），通过检测模型特征空间中的恶意特征并自适应调整拒绝机制，有效防御LLMs的越狱攻击。

Details

Motivation: 大型语言模型（LLMs）虽经强化学习微调，但缺乏识别恶意内容的能力，导致防御越狱攻击的能力有限。 Method: FMM通过在解码阶段检测潜在恶意特征，并在检测到有毒特征时重新生成当前令牌，同时通过激活修补在后续令牌生成中引入拒绝向量。 Result: 实验证明该方法在多种语言模型和攻击技术下均有效，且不影响模型的正常生成能力。 Conclusion: FMM是一种高效且不影响模型性能的有害响应拒绝方法。 Abstract: The rapid advancement of large language models (LLMs) has brought significant benefits to various domains while introducing substantial risks. Despite being fine-tuned through reinforcement learning, LLMs lack the capability to discern malicious content, limiting their defense against jailbreak. To address these safety concerns, we propose a feature-aware method for harmful response rejection (FMM), which detects the presence of malicious features within the model's feature space and adaptively adjusts the model's rejection mechanism. By employing a simple discriminator, we detect potential malicious traits during the decoding phase. Upon detecting features indicative of toxic tokens, FMM regenerates the current token. By employing activation patching, an additional rejection vector is incorporated during the subsequent token generation, steering the model towards a refusal response. Experimental results demonstrate the effectiveness of our approach across multiple language models and diverse attack techniques, while crucially maintaining the models' standard generation capabilities.

[387] Enhancing Contrastive Demonstration Selection with Semantic Diversity for Robust In-Context Machine Translation

Owen Patterson,Chee Ng

Main category: cs.CL

TLDR: DiverseConE是一种新的演示选择方法，通过结合多样性和对比选择，提升上下文学习中的机器翻译性能。

Details

Motivation: 现有方法忽略了演示样本的多样性，影响了上下文学习的性能。 Method: 提出DiverseConE，在对比选择基础上加入基于嵌入空间差异的多样性增强步骤。 Result: 在多个语言对和评测指标上，DiverseConE优于基线方法。 Conclusion: 多样性对提升翻译质量至关重要，DiverseConE验证了其有效性。 Abstract: In-Context Learning (ICL) empowers large language models to perform tasks by conditioning on a few input-output examples. However, the performance of ICL is highly sensitive to the selection of these demonstrations. While existing methods focus on similarity or contrastive selection, they often overlook the importance of diversity among the chosen examples. In this paper, we propose DiverseConE (Diversity-Enhanced Contrastive Example Selection), a novel approach for demonstration selection in in-context learning for machine translation. Our method builds upon contrastive selection by incorporating a diversity enhancement step based on embedding space dissimilarity. We conduct extensive experiments on the Llama2-7b model across four language pairs (English-Chinese, Chinese-English, Russian-German, German-Russian) in 1-shot and 3-shot settings, using COMET20 and COMET22 for evaluation. Our results demonstrate that DiverseConE consistently outperforms strong baseline methods, including random selection, BM25, TopK, and a state-of-the-art contrastive selection method. Further analysis, including diversity metrics and human evaluation, validates the effectiveness of our approach and highlights the benefits of considering demonstration diversity for improved translation quality.

[388] Improving the Accuracy and Efficiency of Legal Document Tagging with Large Language Models and Instruction Prompts

Emily Johnson,Xavier Holt,Noah Wilson

Main category: cs.CL

TLDR: 论文提出Legal-LLM，利用大型语言模型（LLM）的指令微调能力，将多标签分类任务重构为结构化生成问题，显著提升了法律文档分类的性能。

Details

Motivation: 法律多标签分类面临法律语言复杂、标签依赖性强和标签不平衡等挑战，需要更高效的解决方案。 Method: 通过指令微调LLM，将其应用于法律文档的多标签分类任务，直接生成相关法律类别。 Result: 在POSTURE50K和EURLEX57K数据集上，Legal-LLM在micro-F1和macro-F1得分上优于传统方法和基于Transformer的基线模型。 Conclusion: Legal-LLM有效解决了标签不平衡问题，生成的法律标签准确且相关，验证了其优越性。 Abstract: Legal multi-label classification is a critical task for organizing and accessing the vast amount of legal documentation. Despite its importance, it faces challenges such as the complexity of legal language, intricate label dependencies, and significant label imbalance. In this paper, we propose Legal-LLM, a novel approach that leverages the instruction-following capabilities of Large Language Models (LLMs) through fine-tuning. We reframe the multi-label classification task as a structured generation problem, instructing the LLM to directly output the relevant legal categories for a given document. We evaluate our method on two benchmark datasets, POSTURE50K and EURLEX57K, using micro-F1 and macro-F1 scores. Our experimental results demonstrate that Legal-LLM outperforms a range of strong baseline models, including traditional methods and other Transformer-based approaches. Furthermore, ablation studies and human evaluations validate the effectiveness of our approach, particularly in handling label imbalance and generating relevant and accurate legal labels.

[389] QUDsim: Quantifying Discourse Similarities in LLM-Generated Text

Ramya Namuduri,Yating Wu,Anshun Asher Zheng,Manya Wadhwa,Greg Durrett,Junyi Jessy Li

Main category: cs.CL

TLDR: 论文提出了一种基于问题讨论（QUD）的相似性度量方法QUDsim，用于量化文本间的结构相似性，发现大语言模型（LLMs）在生成文本时存在结构重复性。

Details

Motivation: 大语言模型在生成文本时表现出重复性和结构单一性，现有相似性度量方法无法捕捉结构相似性，因此需要新的度量方法。 Method: 基于QUD理论和问题语义学，提出QUDsim度量方法，用于检测文本间的结构相似性。 Result: LLMs生成的文本在结构上重复性高，且与人类作者的结构类型存在差异。 Conclusion: QUDsim能有效量化文本结构相似性，揭示LLMs的结构重复性问题。 Abstract: As large language models become increasingly capable at various writing tasks, their weakness at generating unique and creative content becomes a major liability. Although LLMs have the ability to generate text covering diverse topics, there is an overall sense of repetitiveness across texts that we aim to formalize and quantify via a similarity metric. The familiarity between documents arises from the persistence of underlying discourse structures. However, existing similarity metrics dependent on lexical overlap and syntactic patterns largely capture $\textit{content}$ overlap, thus making them unsuitable for detecting $\textit{structural}$ similarities. We introduce an abstraction based on linguistic theories in Questions Under Discussion (QUD) and question semantics to help quantify differences in discourse progression. We then use this framework to build $\textbf{QUDsim}$, a similarity metric that can detect discursive parallels between documents. Using QUDsim, we find that LLMs often reuse discourse structures (more so than humans) across samples, even when content differs. Furthermore, LLMs are not only repetitive and structurally uniform, but are also divergent from human authors in the types of structures they use.

[390] Can you map it to English? The Role of Cross-Lingual Alignment in Multilingual Performance of LLMs

Kartik Ravisankar,Hyojung Han,Marine Carpuat

Main category: cs.CL

TLDR: 研究探讨了多语言大模型（LLMs）中跨语言表示对齐与任务性能的关系，发现语言级别的对齐与任务准确性高度相关，但样本级别的对齐无法区分预测正确与否。

Details

Motivation: 尽管LLMs在英语文本上预训练，却展现出多语言能力，但其跨语言泛化机制尚不明确。本文旨在探究跨语言表示对齐如何影响LLMs在自然语言理解任务和翻译任务中的表现。 Method: 引入跨语言对齐指标（如DALI），在语言和样本级别量化表示对齐，并在三个自然语言理解任务（Belebele、XStoryCloze、XCOPA）和机器翻译任务上进行实验。 Result: 跨语言对齐指标在语言级别与任务准确性高度相关，但样本级别的对齐无法区分预测正确与否，表明对齐是成功必要条件但非充分条件。 Conclusion: 跨语言对齐是LLMs在多语言任务中表现的关键因素，但仅靠对齐不足以保证任务成功，需进一步研究其他机制。 Abstract: Large language models (LLMs) pre-trained predominantly on English text exhibit surprising multilingual capabilities, yet the mechanisms driving cross-lingual generalization remain poorly understood. This work investigates how the alignment of representations for text written in different languages correlates with LLM performance on natural language understanding tasks and translation tasks, both at the language and the instance level. For this purpose, we introduce cross-lingual alignment metrics such as the Discriminative Alignment Index (DALI) to quantify the alignment at an instance level for discriminative tasks. Through experiments on three natural language understanding tasks (Belebele, XStoryCloze, XCOPA), and machine translation, we find that while cross-lingual alignment metrics strongly correlate with task accuracy at the language level, the sample-level alignment often fails to distinguish correct from incorrect predictions, exposing alignment as a necessary but insufficient condition for success.

[391] On Language Models' Sensitivity to Suspicious Coincidences

Sriram Padmanabhan,Kanishka Misra,Kyle Mahowald,Eunsol Choi

Main category: cs.CL

TLDR: 论文研究了语言模型（LMs）是否表现出人类对‘可疑巧合’的敏感性，发现零样本行为中不明显，但在提供假设空间后，LMs开始表现出类似行为。

Details

Motivation: 人类在归纳推理中对数据采样方式敏感，倾向于更具体的假设。论文旨在分析LMs是否表现出类似行为。 Method: 通过‘数字游戏’和‘城市扩展’两个实验，测试LMs在零样本和提示下的行为。 Result: 零样本行为中未发现明显‘可疑巧合’效应，但在提供假设空间后，LMs表现出类似人类的行为。 Conclusion: LMs的归纳推理行为可通过显式访问假设空间增强。 Abstract: Humans are sensitive to suspicious coincidences when generalizing inductively over data, as they make assumptions as to how the data was sampled. This results in smaller, more specific hypotheses being favored over more general ones. For instance, when provided the set {Austin, Dallas, Houston}, one is more likely to think that this is sampled from "Texas Cities" over "US Cities" even though both are compatible. Suspicious coincidence is strongly connected to pragmatic reasoning, and can serve as a testbed to analyze systems on their sensitivity towards the communicative goals of the task (i.e., figuring out the true category underlying the data). In this paper, we analyze whether suspicious coincidence effects are reflected in language models' (LMs) behavior. We do so in the context of two domains: 1) the number game, where humans made judgments of whether a number (e.g., 4) fits a list of given numbers (e.g., 16, 32, 2); and 2) by extending the number game setup to prominent cities. For both domains, the data is compatible with multiple hypotheses and we study which hypothesis is most consistent with the models' behavior. On analyzing five models, we do not find strong evidence for suspicious coincidences in LMs' zero-shot behavior. However, when provided access to the hypotheses space via chain-of-thought or explicit prompting, LMs start to show an effect resembling suspicious coincidences, sometimes even showing effects consistent with humans. Our study suggests that inductive reasoning behavior in LMs can be enhanced with explicit access to the hypothesis landscape.

[392] Beyond Memorization: Mapping the Originality-Quality Frontier of Language Models

Vishakh Padmakumar,Chen Yueh-Han,Jane Pan,Valerie Chen,He He

Main category: cs.CL

TLDR: 论文提出了一种新的LLM生成文本新颖性度量方法，平衡原创性和质量，并通过实验发现LLM生成文本的新颖性低于人类文本。

Details

Motivation: 评估LLM生成新颖输出的能力，现有方法仅关注原创性或质量，但两者需平衡。 Method: 提出新颖性度量方法（原创性和质量的调和平均数），并在三种创意任务上评估OLMo和Pythia模型。 Result: LLM生成文本的新颖性低于人类文本；推理时方法可提升新颖性，但会牺牲质量。 Conclusion: 提升模型规模或后训练是更有效的新颖性改进方法。 Abstract: As large language models (LLMs) are increasingly used for ideation and scientific discovery, it is important to evaluate their ability to generate novel output. Prior work evaluates novelty as the originality with respect to training data, but original outputs can be low quality. In contrast, non-expert judges may favor high-quality but memorized outputs, limiting the reliability of human preference as a metric. We propose a new novelty metric for LLM generations that balances originality and quality -- the harmonic mean of the fraction of \ngrams unseen during training and a task-specific quality score. We evaluate the novelty of generations from two families of open-data models (OLMo and Pythia) on three creative tasks: story completion, poetry writing, and creative tool use. We find that LLM generated text is less novel than human written text. To elicit more novel outputs, we experiment with various inference-time methods, which reveals a trade-off between originality and quality. While these methods can boost novelty, they do so by increasing originality at the expense of quality. In contrast, increasing model size or applying post-training reliably shifts the Pareto frontier, highlighting that starting with a stronger base model is a more effective way to improve novelty.

[393] Evaluation Under Imperfect Benchmarks and Ratings: A Case Study in Text Simplification

Joseph Liu,Yoonsoo Nam,Xinyue Cui,Swabha Swayamdipta

Main category: cs.CL

TLDR: 论文提出了一种新的文本简化评估方法，通过合成数据和LLM评委解决现有评估中的不一致性问题。

Details

Motivation: 现有文本简化评估存在数据质量低和人工评分不一致的问题，导致评估不可靠。 Method: 引入SynthSimpliEval合成基准，使用LLM评委自动评分，并改进现有可学习指标。 Result: 合成数据提高了评分一致性，LLM评委评分与预期趋势一致，改进的指标表现更好。 Conclusion: 高质量合成数据和LLM评委评分可提升文本简化评估的可靠性。 Abstract: Despite the successes of language models, their evaluation remains a daunting challenge for new and existing tasks. We consider the task of text simplification, commonly used to improve information accessibility, where evaluation faces two major challenges. First, the data in existing benchmarks might not reflect the capabilities of current language models on the task, often containing disfluent, incoherent, or simplistic examples. Second, existing human ratings associated with the benchmarks often contain a high degree of disagreement, resulting in inconsistent ratings; nevertheless, existing metrics still have to show higher correlations with these imperfect ratings. As a result, evaluation for the task is not reliable and does not reflect expected trends (e.g., more powerful models being assigned higher scores). We address these challenges for the task of text simplification through three contributions. First, we introduce SynthSimpliEval, a synthetic benchmark for text simplification featuring simplified sentences generated by models of varying sizes. Through a pilot study, we show that human ratings on our benchmark exhibit high inter-annotator agreement and reflect the expected trend: larger models produce higher-quality simplifications. Second, we show that auto-evaluation with a panel of LLM judges (LLMs-as-a-jury) often suffices to obtain consistent ratings for the evaluation of text simplification. Third, we demonstrate that existing learnable metrics for text simplification benefit from training on our LLMs-as-a-jury-rated synthetic data, closing the gap with pure LLMs-as-a-jury for evaluation. Overall, through our case study on text simplification, we show that a reliable evaluation requires higher quality test data, which could be obtained through synthetic data and LLMs-as-a-jury ratings.

[394] Composable NLP Workflows for BERT-based Ranking and QA System

Gaurav Kumar,Murali Mohana Krishna Dandu

Main category: cs.CL

TLDR: 本文介绍了一个基于Forte工具包的端到端排序和问答系统，利用BERT和RoBERTa等先进模型，在MS-MARCO和Covid-19数据集上评估性能，并与基准结果对比。

Details

Motivation: 解决多任务NLP模型中跨任务交互和文本粒度处理的繁琐问题。 Method: 使用Forte工具包构建可组合的NLP流水线，集成BERT和RoBERTa等深度学习模型。 Result: 在MS-MARCO和Covid-19数据集上评估，性能指标（如BLUE、MRR、F1）与基准结果对比。 Conclusion: 模块化流水线和低延迟重排序器便于构建复杂NLP应用。 Abstract: There has been a lot of progress towards building NLP models that scale to multiple tasks. However, real-world systems contain multiple components and it is tedious to handle cross-task interaction with varying levels of text granularity. In this work, we built an end-to-end Ranking and Question-Answering (QA) system using Forte, a toolkit that makes composable NLP pipelines. We utilized state-of-the-art deep learning models such as BERT, RoBERTa in our pipeline, evaluated the performance on MS-MARCO and Covid-19 datasets using metrics such as BLUE, MRR, F1 and compared the results of ranking and QA systems with their corresponding benchmark results. The modular nature of our pipeline and low latency of reranker makes it easy to build complex NLP applications easily.

[395] Question Tokens Deserve More Attention: Enhancing Large Language Models without Training through Step-by-Step Reading and Question Attention Recalibration

Feijiang Han,Licheng Guo,Hengtao Cui,Zhiyuan Lyu

Main category: cs.CL

TLDR: 论文研究了大型语言模型（LLM）在复杂问题理解中的局限性，提出了三种改进方法：重复问题标记、调整注意力机制和动态注意力重新校准。这些方法显著提升了模型性能。

Details

Motivation: 当前LLM在处理复杂问题和长距离依赖时表现不佳，需要改进其问题理解和推理能力。 Method: 提出了基于提示的策略（SSR、SSR+、SSR++）和训练无关的动态注意力重新校准机制。 Result: SSR++在多个基准测试中达到最优性能（如GSM8K 96.66%），动态注意力重新校准使LLaMA 3.1-8B在AQuA上提升5.17%。 Conclusion: 结构化提示设计和注意力优化是提升LLM理解能力的有效工具，适用于多种NLP任务。 Abstract: Large Language Models (LLMs) often struggle with tasks that require a deep understanding of complex questions, especially when faced with long-range dependencies or multi-step reasoning. This work investigates the limitations of current LLMs in question comprehension and identifies three insights: (1) repeating question tokens improves comprehension by increasing attention to question regions, (2) increased backward dependencies negatively affect performance due to unidirectional attentional constraints, and (3) recalibrating attentional mechanisms to prioritize question-relevant regions improves performance. Based on these findings, we first propose a family of prompt-based strategies - Step-by-Step Reading (SSR), SSR+, and SSR++ - that guide LLMs to incrementally process question tokens and align their reasoning with the input structure. These methods significantly improve performance, with SSR++ achieving state-of-the-art results on several benchmarks: 96.66% on GSM8K, 94.61% on ASDiv, and 76.28% on AQuA. Second, we introduce a training-free attention recalibration mechanism that dynamically adjusts attention distributions during inference to emphasize question-relevant regions. This method improves the accuracy of LLaMA 3.1-8B on AQuA by 5.17% without changing model parameters or input prompts. Taken together, our results highlight the importance of structured prompt design and attention optimization in improving LLM comprehension, providing lightweight yet effective tools for improving performance in various NLP tasks.

[396] UXAgent: A System for Simulating Usability Testing of Web Design with LLM Agents

Yuxuan Lu,Bingsheng Yao,Hansu Gu,Jing Huang,Jessie Wang,Yang Li,Jiri Gesi,Qi He,Toby Jia-Jun Li,Dakuo Wang

Main category: cs.CL

TLDR: 论文提出了一种基于LLM Agent的系统UXAgent，用于在真实用户研究前评估和迭代可用性测试设计。

Details

Motivation: 解决如何评估和迭代可用性测试研究设计本身的问题，利用LLM Agent技术提升用户体验研究的效率。 Method: 设计UXAgent系统，包含Persona Generator、LLM Agent和Universal Browser Connector模块，模拟用户测试网站，并提供分析工具。 Result: 通过启发式评估，UX研究人员认可系统创新性，但也对LLM Agent在UX研究中的未来应用表示担忧。 Conclusion: UXAgent为可用性测试设计提供了新工具，但LLM Agent在UX研究中的广泛应用仍需进一步探讨。 Abstract: Usability testing is a fundamental research method that user experience (UX) researchers use to evaluate and iterate a web design, but\textbf{ how to evaluate and iterate the usability testing study design } itself? Recent advances in Large Language Model-simulated Agent (\textbf{LLM Agent}) research inspired us to design \textbf{UXAgent} to support UX researchers in evaluating and reiterating their usability testing study design before they conduct the real human-subject study. Our system features a Persona Generator module, an LLM Agent module, and a Universal Browser Connector module to automatically generate thousands of simulated users to interactively test the target website. The system also provides an Agent Interview Interface and a Video Replay Interface so that the UX researchers can easily review and analyze the generated qualitative and quantitative log data. Through a heuristic evaluation, five UX researcher participants praised the innovation of our system but also expressed concerns about the future of LLM Agent usage in UX studies.

[397] SaRO: Enhancing LLM Safety through Reasoning-based Alignment

Yutao Mou,Yuxiao Luo,Shikun Zhang,Wei Ye

Main category: cs.CL

TLDR: 论文提出SaRO框架，通过两阶段优化解决LLMs安全对齐的欠泛化和过对齐问题。

Details

Motivation: 现有安全对齐技术存在欠泛化和过对齐问题，需更深入的语义理解。 Method: SaRO框架包括推理式预热（RW）和安全导向推理优化（SRPO）两阶段。 Result: 实验证明SaRO优于传统对齐方法。 Conclusion: SaRO通过语义推理优化有效提升LLMs的安全对齐能力。 Abstract: Current safety alignment techniques for large language models (LLMs) face two key challenges: (1) under-generalization, which leaves models vulnerable to novel jailbreak attacks, and (2) over-alignment, which leads to the excessive refusal of benign instructions. Our preliminary investigation reveals semantic overlap between jailbreak/harmful queries and normal prompts in embedding space, suggesting that more effective safety alignment requires a deeper semantic understanding. This motivates us to incorporate safety-policy-driven reasoning into the alignment process. To this end, we propose the Safety-oriented Reasoning Optimization Framework (SaRO), which consists of two stages: (1) Reasoning-style Warmup (RW) that enables LLMs to internalize long-chain reasoning through supervised fine-tuning, and (2) Safety-oriented Reasoning Process Optimization (SRPO) that promotes safety reflection via direct preference optimization (DPO). Extensive experiments demonstrate the superiority of SaRO over traditional alignment methods.

[398] ClinicalGPT-R1: Pushing reasoning capability of generalist disease diagnosis with large language model

Wuyang Lan,Wenzheng Wang,Changwei Ji,Guoxing Yang,Yongbo Zhang,Xiaohong Liu,Song Wu,Guangyu Wang

Main category: cs.CL

TLDR: ClinicalGPT-R1是一个专为疾病诊断设计的增强推理大型语言模型，在中文诊断任务中优于GPT-4o，英文任务中与GPT-4表现相当。

Details

Motivation: 探索大型语言模型在临床诊断中的应用潜力，填补该领域的研究空白。 Method: 基于20,000份真实临床记录训练，采用多样化训练策略增强推理能力，并使用MedBench-Hard数据集进行性能评估。 Result: 在中文诊断任务中优于GPT-4o，英文任务中与GPT-4表现相当。 Conclusion: ClinicalGPT-R1在疾病诊断任务中表现出色，验证了其优越性能。 Abstract: Recent advances in reasoning with large language models (LLMs)has shown remarkable reasoning capabilities in domains such as mathematics and coding, yet their application to clinical diagnosis remains underexplored. Here, we introduce ClinicalGPT-R1, a reasoning enhanced generalist large language model for disease diagnosis. Trained on a dataset of 20,000 real-world clinical records, ClinicalGPT-R1 leverages diverse training strategies to enhance diagnostic reasoning. To benchmark performance, we curated MedBench-Hard, a challenging dataset spanning seven major medical specialties and representative diseases. Experimental results demonstrate that ClinicalGPT-R1 outperforms GPT-4o in Chinese diagnostic tasks and achieves comparable performance to GPT-4 in English settings. This comparative study effectively validates the superior performance of ClinicalGPT-R1 in disease diagnosis tasks. Resources are available at https://github.com/medfound/medfound.

[399] HalluShift: Measuring Distribution Shifts towards Hallucination Detection in LLMs

Sharanya Dasgupta,Sujoy Nath,Arkaprabha Basu,Pourya Shamsolmoali,Swagatam Das

Main category: cs.CL

TLDR: 论文提出HalluShift方法，分析LLM生成内容中的内部状态分布变化，以解决幻觉问题，并在多个基准数据集上表现优异。

Details

Motivation: LLM在生成内容时容易出现幻觉（生成错误信息但保持结构连贯），作者认为这与LLM内部动态有关，类似人类认知中的不确定性。 Method: 提出HalluShift方法，通过分析LLM生成内容中的内部状态空间和词元概率分布变化来研究幻觉现象。 Result: HalluShift在多个基准数据集上表现优于现有基线方法。 Conclusion: HalluShift能有效分析LLM幻觉现象，为理解LLM内部动态提供了新视角。 Abstract: Large Language Models (LLMs) have recently garnered widespread attention due to their adeptness at generating innovative responses to the given prompts across a multitude of domains. However, LLMs often suffer from the inherent limitation of hallucinations and generate incorrect information while maintaining well-structured and coherent responses. In this work, we hypothesize that hallucinations stem from the internal dynamics of LLMs. Our observations indicate that, during passage generation, LLMs tend to deviate from factual accuracy in subtle parts of responses, eventually shifting toward misinformation. This phenomenon bears a resemblance to human cognition, where individuals may hallucinate while maintaining logical coherence, embedding uncertainty within minor segments of their speech. To investigate this further, we introduce an innovative approach, HalluShift, designed to analyze the distribution shifts in the internal state space and token probabilities of the LLM-generated responses. Our method attains superior performance compared to existing baselines across various benchmark datasets. Our codebase is available at https://github.com/sharanya-dasgupta001/hallushift.

[400] Kongzi: A Historical Large Language Model with Fact Enhancement

Jiashu Yang,Ningning Wang,Yian Zhao,Chaoran Feng,Junjia Du,Hao Pang,Zhirui Fang,Xuxin Cheng

Main category: cs.CL

TLDR: Kongzi是一种专为历史分析设计的大型语言模型，通过整合高质量历史数据和事实强化学习策略，显著提升了事实准确性和推理深度。

Details

Motivation: 当前大型语言模型在复杂推理任务中存在事实不准确的问题，尤其在历史研究中需要跨时间关联和模糊信息处理，限制了其潜力。 Method: 整合高质量历史数据，采用新颖的事实强化学习策略。 Result: 在历史问答和叙事生成任务中，Kongzi在事实准确性和推理深度上优于现有模型。 Conclusion: Kongzi为专业领域开发准确可靠的大型语言模型设定了新标准。 Abstract: The capabilities of the latest large language models (LLMs) have been extended from pure natural language understanding to complex reasoning tasks. However, current reasoning models often exhibit factual inaccuracies in longer reasoning chains, which poses challenges for historical reasoning and limits the potential of LLMs in complex, knowledge-intensive tasks. Historical studies require not only the accurate presentation of factual information but also the ability to establish cross-temporal correlations and derive coherent conclusions from fragmentary and often ambiguous sources. To address these challenges, we propose Kongzi, a large language model specifically designed for historical analysis. Through the integration of curated, high-quality historical data and a novel fact-reinforcement learning strategy, Kongzi demonstrates strong factual alignment and sophisticated reasoning depth. Extensive experiments on tasks such as historical question answering and narrative generation demonstrate that Kongzi outperforms existing models in both factual accuracy and reasoning depth. By effectively addressing the unique challenges inherent in historical texts, Kongzi sets a new standard for the development of accurate and reliable LLMs in professional domains.

[401] MADLLM: Multivariate Anomaly Detection via Pre-trained LLMs

Wei Tao,Xiaoyang Qu,Kai Lu,Jiguang Wan,Guokuan Li,Jianzong Wang

Main category: cs.CL

TLDR: MADLLM是一种基于预训练大语言模型（LLMs）的多变量异常检测方法，通过三重编码技术将多变量时间序列（MTS）与LLMs的文本模态对齐，显著提升了检测性能。

Details

Motivation: 现有方法将MTS数据简单转换为单变量时间序列，导致问题频出。本文旨在解决MTS模态与LLMs文本模态不匹配的问题。 Method: 提出三重编码技术，结合传统补丁嵌入与两种新嵌入方法（Skip Embedding和Feature Embedding），以对齐模态并提升模型性能。 Result: 实验表明，MADLLM在多个公共异常检测数据集上优于现有最优方法。 Conclusion: MADLLM通过创新的模态对齐技术，显著提升了LLMs在多变量异常检测任务中的表现。 Abstract: When applying pre-trained large language models (LLMs) to address anomaly detection tasks, the multivariate time series (MTS) modality of anomaly detection does not align with the text modality of LLMs. Existing methods simply transform the MTS data into multiple univariate time series sequences, which can cause many problems. This paper introduces MADLLM, a novel multivariate anomaly detection method via pre-trained LLMs. We design a new triple encoding technique to align the MTS modality with the text modality of LLMs. Specifically, this technique integrates the traditional patch embedding method with two novel embedding approaches: Skip Embedding, which alters the order of patch processing in traditional methods to help LLMs retain knowledge of previous features, and Feature Embedding, which leverages contrastive learning to allow the model to better understand the correlations between different features. Experimental results demonstrate that our method outperforms state-of-the-art methods in various public anomaly detection datasets.

[402] How new data permeates LLM knowledge and how to dilute it

Chen Sun,Renat Aksitov,Andrey Zhmoginov,Nolan Andrew Miller,Max Vladymyrov,Ulrich Rueckert,Been Kim,Mark Sandler

Main category: cs.CL

TLDR: 论文研究了大型语言模型（LLM）学习新信息时的“启动”效应，即新知识可能导致模型在不相关情境中错误应用该知识。作者通过“Outlandish”数据集系统研究此现象，并提出两种新方法来减少不良启动效应。

Details

Motivation: 理解LLM学习新信息时如何影响现有知识，以及如何减少由此产生的错误应用（如幻觉）。 Method: 使用“Outlandish”数据集（1320个文本样本）研究启动效应，并提出两种技术：1）“垫脚石”文本增强策略；2）“ignore-k”更新修剪方法。 Result: 启动效应可通过学习前关键词的token概率预测，且在不同模型架构、规模和训练阶段均稳健。提出的方法将不良启动效应减少50-95%。 Conclusion: 研究揭示了LLM学习机制，并提供了改进知识插入特异性的实用工具。 Abstract: Large language models learn and continually learn through the accumulation of gradient-based updates, but how individual pieces of new information affect existing knowledge, leading to both beneficial generalization and problematic hallucination, remains poorly understood. We demonstrate that when learning new information, LLMs exhibit a "priming" effect: learning a new fact can cause the model to inappropriately apply that knowledge in unrelated contexts. To systematically study this phenomenon, we introduce "Outlandish," a carefully curated dataset of 1320 diverse text samples designed to probe how new knowledge permeates through an LLM's existing knowledge base. Using this dataset, we show that the degree of priming after learning new information can be predicted by measuring the token probability of key words before learning. This relationship holds robustly across different model architectures (PALM-2, Gemma, Llama), sizes, and training stages. Finally, we develop two novel techniques to modulate how new knowledge affects existing model behavior: (1) a ``stepping-stone'' text augmentation strategy and (2) an ``ignore-k'' update pruning method. These approaches reduce undesirable priming effects by 50-95\% while preserving the model's ability to learn new information. Our findings provide both empirical insights into how LLMs learn and practical tools for improving the specificity of knowledge insertion in language models. Further materials: https://sunchipsster1.github.io/projects/outlandish/

[403] Syzygy of Thoughts: Improving LLM CoT with the Minimal Free Resolution

Chenghao Li,Chaoning Zhang,Yi Lu,Jiaquan Zhang,Qigan Sun,Xudong Wang,Jiwei Wei,Guoqing Wang,Yang Yang,Heng Tao Shen

Main category: cs.CL

TLDR: Syzygy of Thoughts (SoT) 是一种扩展 Chain-of-Thought (CoT) 的新框架，通过引入辅助推理路径和代数几何中的 Minimal Free Resolution (MFR) 方法，提升复杂任务的推理能力。

Details

Motivation: 复杂任务中单一推理链的局限性促使研究者提出 SoT，以捕捉更深层次的逻辑依赖关系。 Method: SoT 基于 MFR 方法，将问题分解为逻辑完整的子问题，引入模块、Betti 数等概念，优化推理过程。 Result: 在 GSM8K 和 MATH 等数据集上，SoT 的推理准确率与主流 CoT 相当或更高，同时提升了推理时间的可扩展性。 Conclusion: SoT 通过结构化分解和代数约束，实现了透明推理和高性能，为复杂问题提供了新的解决方案。 Abstract: Chain-of-Thought (CoT) prompting enhances the reasoning of large language models (LLMs) by decomposing problems into sequential steps, mimicking human logic and reducing errors. However, complex tasks with vast solution spaces and vague constraints often exceed the capacity of a single reasoning chain. Inspired by Minimal Free Resolution (MFR) in commutative algebra and algebraic geometry, we propose Syzygy of Thoughts (SoT)-a novel framework that extends CoT by introducing auxiliary, interrelated reasoning paths. SoT captures deeper logical dependencies, enabling more robust and structured problem-solving. MFR decomposes a module into a sequence of free modules with minimal rank, providing a structured analytical approach to complex systems. This method introduces the concepts of "Module", "Betti numbers","Freeness", "Mapping", "Exactness" and "Minimality", enabling the systematic decomposition of the original complex problem into logically complete minimal subproblems while preserving key problem features and reducing reasoning length. We tested SoT across diverse datasets (e.g., GSM8K, MATH) and models (e.g., GPT-4o-mini, Qwen2.5), achieving inference accuracy that matches or surpasses mainstream CoTs standards. Additionally, by aligning the sampling process with algebraic constraints, our approach enhances the scalability of inference time in LLMs, ensuring both transparent reasoning and high performance. Our code will be publicly available at https://github.com/dlMARiA/Syzygy-of-thoughts.

[404] LLMs Can Achieve High-quality Simultaneous Machine Translation as Efficiently as Offline

Biao Fu,Minpeng Liao,Kai Fan,Chengxi Li,Liang Zhang,Yidong Chen,Xiaodong Shi

Main category: cs.CL

TLDR: 论文提出了一种新范式，通过构建监督微调数据和新的训练推理策略，使大语言模型（LLMs）在流式机器翻译（SiMT）中高效且高质量地工作。

Details

Motivation: 在流式场景中，解码器仅LLMs的自回归特性限制了其效率和性能，需要一种方法使其在SiMT中表现与离线翻译相当。 Method: 通过重新排列源和目标标记为交错序列，并引入特殊标记以适应不同延迟需求，使LLMs能自适应学习读写操作。 Result: 实验表明，即使有限监督微调数据，该方法在多个SiMT基准测试中达到最优性能，并保留离线翻译能力。 Conclusion: 该方法不仅适用于句子级SiMT，还能泛化到文档级SiMT，无需额外微调。 Abstract: When the complete source sentence is provided, Large Language Models (LLMs) perform excellently in offline machine translation even with a simple prompt "Translate the following sentence from [src lang] into [tgt lang]:". However, in many real scenarios, the source tokens arrive in a streaming manner and simultaneous machine translation (SiMT) is required, then the efficiency and performance of decoder-only LLMs are significantly limited by their auto-regressive nature. To enable LLMs to achieve high-quality SiMT as efficiently as offline translation, we propose a novel paradigm that includes constructing supervised fine-tuning (SFT) data for SiMT, along with new training and inference strategies. To replicate the token input/output stream in SiMT, the source and target tokens are rearranged into an interleaved sequence, separated by special tokens according to varying latency requirements. This enables powerful LLMs to learn read and write operations adaptively, based on varying latency prompts, while still maintaining efficient auto-regressive decoding. Experimental results show that, even with limited SFT data, our approach achieves state-of-the-art performance across various SiMT benchmarks, and preserves the original abilities of offline translation. Moreover, our approach generalizes well to document-level SiMT setting without requiring specific fine-tuning, even beyond the offline translation model.

[405] Short-Path Prompting in LLMs: Analyzing Reasoning Instability and Solutions for Robust Performance

Zuoli Tang,Junjie Ou,Kaiqin Hu,Chunwei Wu,Zhaoxin Huan,Chilin Fu,Xiaolu Zhang,Jun Zhou,Chenliang Li

Main category: cs.CL

TLDR: 论文探讨了在用户提供简短提示时，大语言模型（LLMs）的推理性能变化，发现短路径提示会显著降低推理能力，并提出了两种解决方法。

Details

Motivation: 研究用户偏好简短提示与链式思维（CoT）推理之间的冲突，及其对LLMs推理性能的影响。 Method: 提出两种方法：指令引导方法和微调方法，以解决短路径提示导致的推理能力下降问题。 Result: 实验表明，两种方法均能有效提升推理准确性，平衡指令遵循与推理性能。 Conclusion: 短路径提示会削弱LLMs的推理能力，但通过指令引导或微调可以有效缓解这一问题。 Abstract: Recent years have witnessed significant progress in large language models' (LLMs) reasoning, which is largely due to the chain-of-thought (CoT) approaches, allowing models to generate intermediate reasoning steps before reaching the final answer. Building on these advances, state-of-the-art LLMs are instruction-tuned to provide long and detailed CoT pathways when responding to reasoning-related questions. However, human beings are naturally cognitive misers and will prompt language models to give rather short responses, thus raising a significant conflict with CoT reasoning. In this paper, we delve into how LLMs' reasoning performance changes when users provide short-path prompts. The results and analysis reveal that language models can reason effectively and robustly without explicit CoT prompts, while under short-path prompting, LLMs' reasoning ability drops significantly and becomes unstable, even on grade-school problems. To address this issue, we propose two approaches: an instruction-guided approach and a fine-tuning approach, both designed to effectively manage the conflict. Experimental results show that both methods achieve high accuracy, providing insights into the trade-off between instruction adherence and reasoning accuracy in current models.

[406] Metropolis-Hastings Captioning Game: Knowledge Fusion of Vision Language Models via Decentralized Bayesian Inference

Yuta Matsui,Ryosuke Yamaki,Ryo Ueda,Seitaro Shinagawa,Tadahiro Taniguchi

Main category: cs.CL

TLDR: MHCG是一种通过语言游戏实现多视觉语言模型知识融合的方法，避免了传统方法的推理成本和架构限制。

Details

Motivation: 解决现有多模型融合方法的高推理成本和架构限制问题。 Method: 通过类似语言游戏的分散贝叶斯推理，实现两个VLM代理之间的知识融合。 Result: 实验表明MHCG在无参考评估指标上表现一致提升，并促进了类别级词汇共享。 Conclusion: MHCG有效融合多模型知识，提升性能并促进词汇共享。 Abstract: We propose the Metropolis-Hastings Captioning Game (MHCG), a method to fuse knowledge of multiple vision-language models (VLMs) by learning from each other. Although existing methods that combine multiple models suffer from inference costs and architectural constraints, MHCG avoids these problems by performing decentralized Bayesian inference through a process resembling a language game. The knowledge fusion process establishes communication between two VLM agents alternately captioning images and learning from each other. We conduct two image-captioning experiments with two VLMs, each pre-trained on a different dataset. The first experiment demonstrates that MHCG achieves consistent improvement in reference-free evaluation metrics. The second experiment investigates how MHCG contributes to sharing VLMs' category-level vocabulary by observing the occurrence of the vocabulary in the generated captions.

[407] Leveraging Reasoning Model Answers to Enhance Non-Reasoning Model Capability

Haotian Wang,Han Zhao,Shuaiting Chen,Xiaoyu Tian,Sitong Zhao,Yunjie Ji,Yiping Peng,Xiangang Li

Main category: cs.CL

TLDR: 利用推理密集型模型的高质量输出来提升非推理模型的性能，通过监督微调实验证明其有效性。

Details

Motivation: 探索如何通过推理模型的输出来提升非推理模型的性能，以降低计算需求。 Method: 使用监督微调（SFT）方法，利用推理模型的答案训练非推理模型。 Result: 在多个基准测试中表现出一致的性能提升。 Conclusion: 该方法为直接提升模型问答能力提供了潜在的有效途径。 Abstract: Recent advancements in large language models (LLMs), such as DeepSeek-R1 and OpenAI-o1, have demonstrated the significant effectiveness of test-time scaling, achieving substantial performance gains across various benchmarks. These advanced models utilize deliberate "thinking" steps to systematically enhance answer quality. In this paper, we propose leveraging these high-quality outputs generated by reasoning-intensive models to improve less computationally demanding, non-reasoning models. We explore and compare methodologies for utilizing the answers produced by reasoning models to train and improve non-reasoning models. Through straightforward Supervised Fine-Tuning (SFT) experiments on established benchmarks, we demonstrate consistent improvements across various benchmarks, underscoring the potential of this approach for advancing the ability of models to answer questions directly.

[408] Iterative Self-Training for Code Generation via Reinforced Re-Ranking

Nikita Sorokin,Ivan Sedykh,Valentin Malykh

Main category: cs.CL

TLDR: 提出了一种基于PPO的迭代自训练方法，用于训练重排序模型，以提高代码生成质量。

Details

Motivation: 当前基于解码器的模型在生成代码时输出高度随机，即使小错误也可能破坏整个解决方案。通过重排序模型选择最佳样本可显著提升质量。 Method: 结合代码生成模型与重排序模型，采用PPO进行迭代自训练，优化重排序模型而非生成模型，并通过重新评估输出和纳入高评分负例提升性能。 Result: 在MultiPL-E数据集上，13.4B参数模型在代码生成质量上优于33B模型，速度快三倍，性能接近GPT-4并在某些语言中超越。 Conclusion: 迭代自训练方法有效提升了代码生成质量，重排序模型在优化过程中发挥了关键作用。 Abstract: Generating high-quality code that solves complex programming tasks is challenging, especially with current decoder-based models that produce highly stochastic outputs. In code generation, even minor errors can easily break the entire solution. Leveraging multiple sampled solutions can significantly improve the overall output quality. One effective way to enhance code generation is by pairing a code generation model with a reranker model, which selects the best solution from the generated samples. We propose a novel iterative self-training approach for self-training reranker models using Proximal Policy Optimization (PPO), aimed at improving both reranking accuracy and the overall code generation process. Unlike traditional PPO approaches, where the focus is on optimizing a generative model with a reward model, our approach emphasizes the development of a robust reward/reranking model. This model improves the quality of generated code through reranking and addresses problems and errors that the reward model might overlook during PPO alignment with the reranker. Our method iteratively refines the training dataset by re-evaluating outputs, identifying high-scoring negative examples, and incorporating them into the training loop, that boosting model performance. Our evaluation on the MultiPL-E dataset demonstrates that our 13.4B parameter model outperforms a 33B model in code generation quality while being three times faster. Moreover, it achieves performance comparable to GPT-4 and surpasses it in one programming language.

[409] Myanmar XNLI: Building a Dataset and Exploring Low-resource Approaches to Natural Language Inference with Myanmar

Aung Kyaw Htet,Mark Dras

Main category: cs.CL

TLDR: 论文扩展了XNLI任务至缅甸语，构建了myXNLI数据集，评估了多语言模型性能，并探索了数据增强方法。

Details

Motivation: 解决低资源语言（如缅甸语）在NLP任务中的挑战，扩展XNLI任务以提升模型跨语言能力。 Method: 通过社区众包和专家验证构建myXNLI数据集，评估多语言模型，并测试数据增强方法。 Result: 数据增强方法使缅甸语模型准确率提升2%，同时对其他语言也有提升。 Conclusion: myXNLI数据集为低资源语言研究提供了新资源，数据增强方法具有普适性。 Abstract: Despite dramatic recent progress in NLP, it is still a major challenge to apply Large Language Models (LLM) to low-resource languages. This is made visible in benchmarks such as Cross-Lingual Natural Language Inference (XNLI), a key task that demonstrates cross-lingual capabilities of NLP systems across a set of 15 languages. In this paper, we extend the XNLI task for one additional low-resource language, Myanmar, as a proxy challenge for broader low-resource languages, and make three core contributions. First, we build a dataset called Myanmar XNLI (myXNLI) using community crowd-sourced methods, as an extension to the existing XNLI corpus. This involves a two-stage process of community-based construction followed by expert verification; through an analysis, we demonstrate and quantify the value of the expert verification stage in the context of community-based construction for low-resource languages. We make the myXNLI dataset available to the community for future research. Second, we carry out evaluations of recent multilingual language models on the myXNLI benchmark, as well as explore data-augmentation methods to improve model performance. Our data-augmentation methods improve model accuracy by up to 2 percentage points for Myanmar, while uplifting other languages at the same time. Third, we investigate how well these data-augmentation methods generalise to other low-resource languages in the XNLI dataset.

[410] CLEAR-KGQA: Clarification-Enhanced Ambiguity Resolution for Knowledge Graph Question Answering

Liqiang Wen,Guanming Xiong,Tong Mo,Bing Li,Weiping Li,Wen Zhao

Main category: cs.CL

TLDR: 该研究提出了一种动态处理知识图谱问答中实体和意图模糊性的框架，通过交互式澄清和贝叶斯推理机制提升性能。

Details

Motivation: 现有知识图谱问答系统假设用户查询无歧义，但实际应用中模糊性普遍存在，需解决实体和意图模糊性问题。 Method: 采用贝叶斯推理机制量化查询模糊性，结合多轮对话框架指导LLMs请求用户澄清，并开发双代理交互框架优化逻辑形式。 Result: 在WebQSP和CWQ数据集上显著提升性能，并贡献了一个基于交互历史的消歧查询数据集。 Conclusion: 该框架有效解决了知识图谱问答中的模糊性问题，为未来研究提供了新方向。 Abstract: This study addresses the challenge of ambiguity in knowledge graph question answering (KGQA). While recent KGQA systems have made significant progress, particularly with the integration of large language models (LLMs), they typically assume user queries are unambiguous, which is an assumption that rarely holds in real-world applications. To address these limitations, we propose a novel framework that dynamically handles both entity ambiguity (e.g., distinguishing between entities with similar names) and intent ambiguity (e.g., clarifying different interpretations of user queries) through interactive clarification. Our approach employs a Bayesian inference mechanism to quantify query ambiguity and guide LLMs in determining when and how to request clarification from users within a multi-turn dialogue framework. We further develop a two-agent interaction framework where an LLM-based user simulator enables iterative refinement of logical forms through simulated user feedback. Experimental results on the WebQSP and CWQ dataset demonstrate that our method significantly improves performance by effectively resolving semantic ambiguities. Additionally, we contribute a refined dataset of disambiguated queries, derived from interaction histories, to facilitate future research in this direction.

[411] Domain-Adaptive Continued Pre-Training of Small Language Models

Salman Faroz

Main category: cs.CL

TLDR: 通过增量预训练小型语言模型（125M参数），在有限计算资源下实现教育领域的高效适应，性能显著提升（如MMLU +8.1%）。

Details

Motivation: 探索在计算资源有限的情况下，通过增量预训练小型语言模型实现领域适应的可行性。 Method: 采用125M参数模型，分阶段增量训练（400M和1B tokens），结合数据预处理和内存优化配置。 Result: 在知识密集型任务（MMLU +8.1%）和上下文理解（HellaSwag +7.6%）上表现显著提升，但存在领域专业化权衡。 Conclusion: 通过优化预处理和训练方法，小型语言模型在有限资源下也能实现显著改进，为领域适应提供新途径。 Abstract: Continued pre-training of small language models offers a promising path for domain adaptation with limited computational resources. I've investigated this approach within educational domains, evaluating it as a resource-efficient alternative to training models from scratch. Using a 125M parameter model, I demonstrate significant performance improvements through incremental training on 400 million tokens, followed by further training to reach 1 billion tokens. My approach includes comprehensive data preprocessing, memory-optimized training configurations, and benchmark-based evaluation. Results show notable gains in knowledge-intensive tasks (MMLU +8.1%) and contextual understanding (HellaSwag +7.6%), while revealing educational domain specialization trade-offs. I analyze token efficiency, catastrophic forgetting mitigation strategies, and scaling patterns. My findings suggest that thoughtful preprocessing and training methodologies enable meaningful improvements in language model capabilities even with constrained computational resources, opening pathways for domain-specific adaptation of smaller language models.

[412] GRPO-LEAD: A Difficulty-Aware Reinforcement Learning Approach for Concise Mathematical Reasoning in Language Models

Jixiao Zhang,Chunsheng Zuo

Main category: cs.CL

TLDR: GRPO-LEAD通过引入长度依赖奖励、错误答案惩罚和难度感知优势重加权策略，显著提升了数学推理任务的性能。

Details

Motivation: 当前GRPO实现存在奖励稀疏性、简洁性激励不足和复杂推理任务关注不够等问题。 Method: GRPO-LEAD引入了长度依赖奖励、错误答案惩罚和难度感知优势重加权策略，并研究了模型规模和SFT策略的影响。 Result: GRPO-LEAD显著缓解了现有问题，语言模型在数学任务中表现更简洁、准确和鲁棒。 Conclusion: GRPO-LEAD通过系统性改进，为数学推理任务提供了更优的解决方案。 Abstract: Recent advances in R1-like reasoning models leveraging Group Relative Policy Optimization (GRPO) have significantly improved the performance of language models on mathematical reasoning tasks. However, current GRPO implementations encounter critical challenges, including reward sparsity due to binary accuracy metrics, limited incentives for conciseness, and insufficient focus on complex reasoning tasks. To address these issues, we propose GRPO-LEAD, a suite of novel enhancements tailored for mathematical reasoning. Specifically, GRPO-LEAD introduces (1) a length-dependent accuracy reward to encourage concise and precise solutions, (2) an explicit penalty mechanism for incorrect answers to sharpen decision boundaries, and (3) a difficulty-aware advantage reweighting strategy that amplifies learning signals for challenging problems. Furthermore, we systematically examine the impact of model scale and supervised fine-tuning (SFT) strategies, demonstrating that larger-scale base models and carefully curated datasets significantly enhance reinforcement learning effectiveness. Extensive empirical evaluations and ablation studies confirm that GRPO-LEAD substantially mitigates previous shortcomings, resulting in language models that produce more concise, accurate, and robust reasoning across diverse mathematical tasks.

[413] Evaluating the Quality of Benchmark Datasets for Low-Resource Languages: A Case Study on Turkish

Ayşe Aysu Cengiz,Ahmet Kaan Sever,Elif Ecem Ümütlü,Naime Şeyma Erdem,Burak Aytan,Büşra Tufan,Abdullah Topraksoy,Esra Darıcı,Cagri Toraman

Main category: cs.CL

TLDR: 研究发现70%的土耳其基准数据集未达到质量标准，LLM评估者表现不如人类，需加强低资源语言数据集的质量控制。

Details

Motivation: 解决翻译或改编数据集在语言和文化适用性上的挑战，为土耳其语提供更可靠的基准。 Method: 评估17个土耳其基准数据集，采用六项标准，结合人类和LLM评估者进行详细分析。 Result: 70%数据集未达标，LLM在文化常识和文本理解上表现较弱，GPT-4o和Llama3.3-70B各有优势。 Conclusion: 需更严格的质量控制以提升低资源语言数据集的质量。 Abstract: The reliance on translated or adapted datasets from English or multilingual resources introduces challenges regarding linguistic and cultural suitability. This study addresses the need for robust and culturally appropriate benchmarks by evaluating the quality of 17 commonly used Turkish benchmark datasets. Using a comprehensive framework that assesses six criteria, both human and LLM-judge annotators provide detailed evaluations to identify dataset strengths and shortcomings. Our results reveal that 70% of the benchmark datasets fail to meet our heuristic quality standards. The correctness of the usage of technical terms is the strongest criterion, but 85% of the criteria are not satisfied in the examined datasets. Although LLM judges demonstrate potential, they are less effective than human annotators, particularly in understanding cultural common sense knowledge and interpreting fluent, unambiguous text. GPT-4o has stronger labeling capabilities for grammatical and technical tasks, while Llama3.3-70B excels at correctness and cultural knowledge evaluation. Our findings emphasize the urgent need for more rigorous quality control in creating and adapting datasets for low-resource languages.

[414] Improving Multilingual Capabilities with Cultural and Local Knowledge in Large Language Models While Enhancing Native Performance

Ram Mohan Rao Kadiyala,Siddartha Pullakhandam,Siddhant Gupta,Drishti Sharma,Jebish Purbey,Kanwal Mehreen,Muhammad Arham,Hamza Farooq

Main category: cs.CL

TLDR: 论文介绍了双语LLM Mantra-14B，通过指令微调在印地语和英语上表现优于更大模型，且无需资源密集型技术。

Details

Motivation: 解决LLM在低资源语言（如印地语）上的不足，提升多语言性能。 Method: 使用485K双语指令数据集微调多种LLM，优化训练数据比例。 Result: Mantra-14B在基准测试中平均提升3%，优于更大模型。 Conclusion: 通过文化本地化数据微调可显著提升多语言性能，且开源资源促进低资源语言研究。 Abstract: Large Language Models (LLMs) have shown remarkable capabilities, but their development has primarily focused on English and other high-resource languages, leaving many languages underserved. We present our latest Hindi-English bi-lingual LLM \textbf{Mantra-14B} with ~3\% average improvement in benchmark scores over both languages, outperforming models twice its size. Using a curated dataset composed of English and Hindi instruction data of 485K samples, we instruction tuned models such as Qwen-2.5-14B-Instruct and Phi-4 to improve performance over both English and Hindi. Our experiments encompassing seven different LLMs of varying parameter sizes and over 140 training attempts with varying English-Hindi training data ratios demonstrated that it is possible to significantly improve multilingual performance without compromising native performance. Further, our approach avoids resource-intensive techniques like vocabulary expansion or architectural modifications, thus keeping the model size small. Our results indicate that modest fine-tuning with culturally and locally informed data can bridge performance gaps without incurring significant computational overhead. We release our training code, datasets, and models under mit and apache licenses to aid further research towards under-represented and low-resource languages.

[415] Executable Functional Abstractions: Inferring Generative Programs for Advanced Math Problems

Zaid Khan,Elias Stengel-Eskin,Archiki Prasad,Jaemin Cho,Mohit Bansal

Main category: cs.CL

TLDR: 论文提出了一种自动生成高级数学问题可执行功能抽象（EFA）的方法EFAGen，通过程序合成任务和LLM生成候选EFA，并利用单元测试验证其有效性。

Details

Motivation: 现有EFA研究局限于小学数学问题，高级数学问题的EFA生成依赖人工工程，因此需要自动化方法。 Method: 将EFA生成任务形式化为程序合成任务，开发EFAGen，利用LLM基于种子问题及其逐步解生成候选EFA，并通过单元测试验证其有效性。 Result: EFAGen生成的EFA忠实于种子问题，能产生可学习的问题变体，并适用于多种竞赛级数学问题。 Conclusion: EFAGen能有效生成高级数学问题的EFA，并可用于问题难度调整和数据生成等下游任务。 Abstract: Scientists often infer abstract procedures from specific instances of problems and use the abstractions to generate new, related instances. For example, programs encoding the formal rules and properties of a system have been useful in fields ranging from RL (procedural environments) to physics (simulation engines). These programs can be seen as functions which execute to different outputs based on their parameterizations (e.g., gridworld configuration or initial physical conditions). We introduce the term EFA (Executable Functional Abstraction) to denote such programs for math problems. EFA-like constructs have been shown to be useful for math reasoning as problem generators for stress-testing models. However, prior work has been limited to abstractions for grade-school math (whose simple rules are easy to encode in programs), while generating EFAs for advanced math has thus far required human engineering. We explore the automatic construction of EFAs for advanced math problems. We operationalize the task of automatically constructing EFAs as a program synthesis task, and develop EFAGen, which conditions an LLM on a seed math problem and its step-by-step solution to generate candidate EFA programs that are faithful to the generalized problem and solution class underlying the seed problem. Furthermore, we formalize properties any valid EFA must possess in terms of executable unit tests, and show how the tests can be used as verifiable rewards to train LLMs to become better writers of EFAs. We demonstrate that EFAs constructed by EFAGen behave rationally by remaining faithful to seed problems, produce learnable problem variations, and that EFAGen can infer EFAs across multiple diverse sources of competition-level math problems. Finally, we show downstream uses of model-written EFAs e.g. finding problem variations that are harder or easier for a learner to solve, as well as data generation.

[416] Reasoning Court: Combining Reasoning, Action, and Judgment for Multi-Hop Reasoning

Jingtian Wu,Claire Cardie

Main category: cs.CL

TLDR: 论文提出了一种名为Reasoning Court（RC）的新框架，通过引入专门的LLM法官来验证中间推理步骤，解决了现有方法在多跳任务中的推理错误问题。

Details

Motivation: 大型语言模型（LLMs）在多跳任务中仍存在幻觉和推理错误，现有方法如ReAct缺乏对中间推理步骤的内部验证。 Method: RC扩展了ReAct等迭代推理-检索方法，通过独立的LLM法官评估候选答案及其推理，选择最优答案或合成新答案。 Result: 在HotpotQA、MuSiQue和FEVER等基准测试中，RC表现优于现有方法，无需任务特定微调。 Conclusion: RC通过引入法官机制显著提升了多跳任务的推理准确性和逻辑一致性。 Abstract: While large language models (LLMs) have demonstrated strong capabilities in tasks like question answering and fact verification, they continue to suffer from hallucinations and reasoning errors, especially in multi-hop tasks that require integration of multiple information sources. Current methods address these issues through retrieval-based techniques (grounding reasoning in external evidence), reasoning-based approaches (enhancing coherence via improved prompting), or hybrid strategies combining both elements. One prominent hybrid method, ReAct, has outperformed purely retrieval-based or reasoning-based approaches; however, it lacks internal verification of intermediate reasoning steps, allowing potential errors to propagate through complex reasoning tasks. In this paper, we introduce Reasoning Court (RC), a novel framework that extends iterative reasoning-and-retrieval methods, such as ReAct, with a dedicated LLM judge. Unlike ReAct, RC employs this judge to independently evaluate multiple candidate answers and their associated reasoning generated by separate LLM agents. The judge is asked to select the answer that it considers the most factually grounded and logically coherent based on the presented reasoning and evidence, or synthesizes a new answer using available evidence and its pre-trained knowledge if all candidates are inadequate, flawed, or invalid. Evaluations on multi-hop benchmarks (HotpotQA, MuSiQue) and fact-verification (FEVER) demonstrate that RC consistently outperforms state-of-the-art few-shot prompting methods without task-specific fine-tuning.

[417] VDocRAG: Retrieval-Augmented Generation over Visually-Rich Documents

Ryota Tanaka,Taichi Iki,Taku Hasegawa,Kyosuke Nishida,Kuniko Saito,Jun Suzuki

Main category: cs.CL

TLDR: VDocRAG是一种新的检索增强生成框架，能够直接处理多模态视觉丰富文档，避免传统文本解析导致的信息丢失，并通过自监督预训练任务提升性能。

Details

Motivation: 开发一种能够处理混合模态和多样格式文档的RAG框架，解决传统文本解析方法导致的信息缺失问题。 Method: 提出VDocRAG框架，将文档统一转换为图像格式，并设计自监督预训练任务，压缩视觉信息为密集标记表示，同时与文本内容对齐。 Result: VDocRAG在OpenDocVQA数据集上显著优于传统基于文本的RAG，并展现出强大的泛化能力。 Conclusion: VDocRAG展示了处理真实世界文档的有效RAG范式的潜力。 Abstract: We aim to develop a retrieval-augmented generation (RAG) framework that answers questions over a corpus of visually-rich documents presented in mixed modalities (e.g., charts, tables) and diverse formats (e.g., PDF, PPTX). In this paper, we introduce a new RAG framework, VDocRAG, which can directly understand varied documents and modalities in a unified image format to prevent missing information that occurs by parsing documents to obtain text. To improve the performance, we propose novel self-supervised pre-training tasks that adapt large vision-language models for retrieval by compressing visual information into dense token representations while aligning them with textual content in documents. Furthermore, we introduce OpenDocVQA, the first unified collection of open-domain document visual question answering datasets, encompassing diverse document types and formats. OpenDocVQA provides a comprehensive resource for training and evaluating retrieval and question answering models on visually-rich documents in an open-domain setting. Experiments show that VDocRAG substantially outperforms conventional text-based RAG and has strong generalization capability, highlighting the potential of an effective RAG paradigm for real-world documents.

[418] Training Small Reasoning LLMs with Cognitive Preference Alignment

Wenrui Cai,Chengyu Wang,Junbing Yan,Jun Huang,Xiangzhong Fang

Main category: cs.CL

TLDR: 论文提出了一种名为CRV的新框架，用于训练参数较少但推理能力强的语言模型，并通过CogPO算法优化小模型的推理能力。

Details

Motivation: 大型语言模型（LLM）的推理能力虽强，但资源需求高，因此需要探索如何用更少参数训练高效的推理模型。 Method: 提出CRV框架，包含多个LLM代理，分别负责批判、重新思考和验证链式思维（CoT），并引入CogPO算法优化小模型的推理能力。 Result: 在多个推理基准测试中，CRV和CogPO显著优于其他训练方法。 Conclusion: CRV框架和CogPO算法为训练高效的小型推理模型提供了有效解决方案。 Abstract: The reasoning capabilities of large language models (LLMs), such as OpenAI's o1 and DeepSeek-R1, have seen substantial advancements through deep thinking. However, these enhancements come with significant resource demands, underscoring the need to explore strategies to train effective reasoning LLMs with far fewer parameters. A critical challenge is that smaller models have different capacities and cognitive trajectories than their larger counterparts. Hence, direct distillation of chain-of-thought (CoT) results from large LLMs to smaller ones can be sometimes ineffective and requires a huge amount of annotated data. In this paper, we introduce a novel framework called Critique-Rethink-Verify (CRV), designed for training smaller yet powerful reasoning LLMs. Our CRV framework consists of multiple LLM agents, each specializing in unique abilities: (i) critiquing the CoTs according to the cognitive capabilities of smaller models, (ii) rethinking and refining these CoTs based on the critiques, and (iii) verifying the correctness of the refined results. We further propose the cognitive preference optimization (CogPO) algorithm to enhance the reasoning abilities of smaller models by aligning thoughts of these models with their cognitive capacities. Comprehensive evaluations on challenging reasoning benchmarks demonstrate the efficacy of CRV and CogPO, which outperforms other training methods by a large margin.

[419] Transferable text data distillation by trajectory matching

Rong Yao,Hailin Hu,Yifei Fu,Hanting Chen,Wenyi Fang,Fanyi Du,Kai Han,Yunhe Wang

Main category: cs.CL

TLDR: 提出了一种基于轨迹匹配和最近邻ID学习伪提示数据的方法，用于文本生成任务的数据蒸馏，优于现有数据选择方法LESS，并展示了跨架构的迁移能力。

Details

Motivation: 随着大型语言模型规模的增加，训练成本也随之上升，亟需减少训练数据量。数据蒸馏方法能合成少量数据样本，达到全数据集的训练效果，但文本数据的离散性阻碍了其在NLP中的应用。 Method: 通过学习伪提示数据，基于轨迹匹配和最近邻ID实现跨架构迁移，并在蒸馏过程中引入正则化损失以提高数据鲁棒性。 Result: 在ARC-Easy和MMLU指令调优数据集上的评估表明，该方法优于当前最优的数据选择方法LESS，并展示了在LLM结构（如OPT到Llama）上的良好迁移性。 Conclusion: 这是首个适用于文本生成任务（如指令调优）的数据蒸馏工作，为减少LLM训练数据量提供了有效解决方案。 Abstract: In the realm of large language model (LLM), as the size of large models increases, it also brings higher training costs. There is a urgent need to minimize the data size in LLM training. Compared with data selection method, the data distillation method aims to synthesize a small number of data samples to achieve the training effect of the full data set and has better flexibility. Despite its successes in computer vision, the discreteness of text data has hitherto stymied its exploration in natural language processing (NLP). In this work, we proposed a method that involves learning pseudo prompt data based on trajectory matching and finding its nearest neighbor ID to achieve cross-architecture transfer. During the distillation process, we introduce a regularization loss to improve the robustness of our distilled data. To our best knowledge, this is the first data distillation work suitable for text generation tasks such as instruction tuning. Evaluations on two benchmarks, including ARC-Easy and MMLU instruction tuning datasets, established the superiority of our distillation approach over the SOTA data selection method LESS. Furthermore, our method demonstrates a good transferability over LLM structures (i.e., OPT to Llama).

[420] Abacus-SQL: A Text-to-SQL System Empowering Cross-Domain and Open-Domain Database Retrieval

Keyan Xu,Dingzirui Wang,Xuanliang Zhang,Qingfu Zhu,Wanxiang Che

Main category: cs.CL

TLDR: Abacus-SQL通过数据库检索技术和数据增强方法解决了现有文本到SQL系统在开放域数据库检索和跨域迁移能力上的不足，并采用Pre-SQL和Self-debug方法提升SQL查询准确性。

Details

Motivation: 现有文本到SQL系统在开放域数据库检索和跨域迁移能力上存在不足，用户需手动筛选数据库且系统适应性差。 Method: Abacus-SQL结合数据库检索技术、数据增强方法、Pre-SQL和Self-debug技术。 Result: 实验表明Abacus-SQL在多轮文本到SQL任务中表现优异。 Conclusion: Abacus-SQL有效解决了现有系统的局限性，提升了SQL查询的准确性和适应性。 Abstract: The existing text-to-SQL systems have made significant progress in SQL query generation, but they still face numerous challenges. Existing systems often lack retrieval capabilities for open-domain databases, requiring users to manually filter relevant databases. Additionally, their cross-domain transferability is limited, making it challenging to accommodate diverse query requirements. To address these issues, we propose Abacus-SQL. Abacus-SQL utilizes database retrieval technology to accurately locate the required databases in an open-domain database environment. It also enhances the system cross-domain transfer ability through data augmentation methods. Moreover, Abacus-SQL employs Pre-SQL and Self-debug methods, thereby enhancing the accuracy of SQL queries. Experimental results demonstrate that Abacus-SQL performs excellently in multi-turn text-to-SQL tasks, effectively validating the approach's effectiveness. Abacus-SQL is publicly accessible at https://huozi.8wss.com/abacus-sql/.

[421] PASS-FC: Progressive and Adaptive Search Scheme for Fact Checking of Comprehensive Claims

Ziyu Zhuang

Main category: cs.CL

TLDR: PASS-FC是一个通过增强声明、自适应问题生成和迭代验证来改进自动事实核查的新框架，在多个数据集上表现优异。

Details

Motivation: 解决自动事实核查在处理复杂现实声明时的挑战。 Method: 采用声明增强、自适应问题生成、迭代验证、高级搜索技术和反思机制。 Result: 在六个数据集上表现优于基线模型，尤其在通用知识、科学、现实世界和多语言任务中。 Conclusion: PASS-FC显著提高了事实核查的准确性和跨领域适应性，代码和结果将开源以促进研究。 Abstract: Automated fact-checking faces challenges in handling complex real-world claims. We present PASS-FC, a novel framework that addresses these issues through claim augmentation, adaptive question generation, and iterative verification. PASS-FC enhances atomic claims with temporal and entity context, employs advanced search techniques, and utilizes a reflection mechanism. We evaluate PASS-FC on six diverse datasets, demonstrating superior performance across general knowledge, scientific, real-world, and multilingual fact-checking tasks. Our framework often surpasses stronger baseline models. Hyperparameter analysis reveals optimal settings for evidence quantity and reflection label triggers, while ablation studies highlight the importance of claim augmentation and language-specific adaptations. PASS-FC's performance underscores its effectiveness in improving fact-checking accuracy and adaptability across various domains. We will open-source our code and experimental results to facilitate further research in this area.

[422] Investigating Syntactic Biases in Multilingual Transformers with RC Attachment Ambiguities in Italian and English

Michael Kamerath,Aniello De Santo

Main category: cs.CL

TLDR: 研究探讨单语和多语LLMs在意大利语和英语中处理关系从句歧义时是否表现出类似人类的偏好，并测试词汇因素是否影响这些偏好。结果显示LLMs行为多样，但普遍未能准确捕捉人类偏好。

Details

Motivation: 探究LLMs在处理语言歧义时是否与人类行为一致，以及词汇因素对偏好的影响。 Method: 利用过去句子处理研究，分析LLMs在意大利语和英语中关系从句歧义的表现，并测试词汇因素的调节作用。 Result: LLMs行为在不同模型中表现多样，但普遍未能准确模拟人类偏好。 Conclusion: 关系从句歧义是研究LLMs语言知识和偏见的理想基准。 Abstract: This paper leverages past sentence processing studies to investigate whether monolingual and multilingual LLMs show human-like preferences when presented with examples of relative clause attachment ambiguities in Italian and English. Furthermore, we test whether these preferences can be modulated by lexical factors (the type of verb/noun in the matrix clause) which have been shown to be tied to subtle constraints on syntactic and semantic relations. Our results overall showcase how LLM behavior varies interestingly across models, but also general failings of these models in correctly capturing human-like preferences. In light of these results, we argue that RC attachment is the ideal benchmark for cross-linguistic investigations of LLMs' linguistic knowledge and biases.

[423] Learning from Reference Answers: Versatile Language Model Alignment without Binary Human Preference Data

Shuai Zhao,Linchao Zhu,Yi Yang

Main category: cs.CL

TLDR: RefAlign利用生成文本与高质量参考答案的相似性作为奖励函数，替代传统资源密集型的偏好对齐方法，提高了效率。

Details

Motivation: 传统对齐方法（如偏好数据收集和奖励建模）资源密集，RefAlign旨在通过相似性奖励简化这一过程。 Method: 提出RefAlign算法，利用BERTScore衡量生成文本与参考答案的相似性作为奖励，无需训练奖励模型。 Result: RefAlign在多种对齐场景中表现与传统方法相当，且效率更高。 Conclusion: RefAlign为LLM对齐提供了一种高效且通用的替代方案。 Abstract: Large language models~(LLMs) are expected to be helpful, harmless, and honest. In various alignment scenarios, such as general human preference, safety, and confidence alignment, binary preference data collection and reward modeling are resource-intensive but necessary for human preference transferring. In this work, we explore using the similarity between sampled generations and high-quality reference answers as an alternative reward function for LLM alignment. Using similarity as a reward circumvents training reward models, and collecting a single reference answer potentially costs less time than constructing binary preference pairs when multiple candidates are available. Specifically, we develop \textit{RefAlign}, a versatile REINFORCE-style alignment algorithm, which is free of reference and reward models. Instead, RefAlign utilizes BERTScore between sampled generations and high-quality reference answers as the surrogate reward. Beyond general human preference optimization, RefAlign can be readily extended to diverse scenarios, such as safety and confidence alignment, by incorporating the similarity reward with task-related objectives. In various scenarios, {RefAlign} demonstrates comparable performance to previous alignment methods while offering high efficiency.

Aish Albladi,Md Kaosar Uddin,Minarul Islam,Cheryl Seals

Main category: cs.CL

TLDR: 该研究提出了一种结合多种Transformer模型的混合框架（BERT、GPT-2、RoBERTa、XLNet、DistilBERT），以提高情感分类的准确性和鲁棒性，并在Sentiment140和IMDB数据集上取得了94%和95%的准确率。

Details

Motivation: 情感分析在NLP中至关重要，但面临噪声数据、上下文歧义和泛化能力等挑战，需要一种更强大的解决方案。 Method: 采用混合框架，结合多种Transformer模型的优势，辅以文本清理、TF-IDF和BoW特征提取。 Result: 在Sentiment140和IMDB数据集上分别达到94%和95%的准确率，优于单一模型。 Conclusion: 混合框架有效解决了单一模型的局限性，适用于社交媒体监控、客户情感分析等实际任务，为未来NLP混合框架的发展提供了方向。 Abstract: Sentiment analysis is a crucial task in natural language processing (NLP) that enables the extraction of meaningful insights from textual data, particularly from dynamic platforms like Twitter and IMDB. This study explores a hybrid framework combining transformer-based models, specifically BERT, GPT-2, RoBERTa, XLNet, and DistilBERT, to improve sentiment classification accuracy and robustness. The framework addresses challenges such as noisy data, contextual ambiguity, and generalization across diverse datasets by leveraging the unique strengths of these models. BERT captures bidirectional context, GPT-2 enhances generative capabilities, RoBERTa optimizes contextual understanding with larger corpora and dynamic masking, XLNet models dependency through permutation-based learning, and DistilBERT offers efficiency with reduced computational overhead while maintaining high accuracy. We demonstrate text cleaning, tokenization, and feature extraction using Term Frequency Inverse Document Frequency (TF-IDF) and Bag of Words (BoW), ensure high-quality input data for the models. The hybrid approach was evaluated on benchmark datasets Sentiment140 and IMDB, achieving superior accuracy rates of 94\% and 95\%, respectively, outperforming standalone models. The results validate the effectiveness of combining multiple transformer models in ensemble-like setups to address the limitations of individual architectures. This research highlights its applicability to real-world tasks such as social media monitoring, customer sentiment analysis, and public opinion tracking which offers a pathway for future advancements in hybrid NLP frameworks.

[425] Refining Financial Consumer Complaints through Multi-Scale Model Interaction

Bo-Wei Chen,An-Zi Yen,Chung-Chi Chen

Main category: cs.CL

TLDR: 本文提出了一种名为MSMI的多尺度模型交互方法，用于将非正式文本转化为法律文本，并在金融争议数据集FinDR上验证了其有效性。

Details

Motivation: 法律文本需要清晰、正式和专业性，但非法律专业人士撰写的文件往往缺乏这些特质。本文旨在通过技术手段填补这一差距。 Method: 提出Multi-Scale Model Interaction (MSMI)方法，利用轻量级分类器评估输出，并通过大型语言模型(LLMs)进行迭代优化。 Result: 实验表明，MSMI显著优于单次提示策略，并在多个短文本基准测试中表现出更强的对抗鲁棒性。 Conclusion: 多模型协作在提升法律文本生成及其他文本优化任务中具有潜力。 Abstract: Legal writing demands clarity, formality, and domain-specific precision-qualities often lacking in documents authored by individuals without legal training. To bridge this gap, this paper explores the task of legal text refinement that transforms informal, conversational inputs into persuasive legal arguments. We introduce FinDR, a Chinese dataset of financial dispute records, annotated with official judgments on claim reasonableness. Our proposed method, Multi-Scale Model Interaction (MSMI), leverages a lightweight classifier to evaluate outputs and guide iterative refinement by Large Language Models (LLMs). Experimental results demonstrate that MSMI significantly outperforms single-pass prompting strategies. Additionally, we validate the generalizability of MSMI on several short-text benchmarks, showing improved adversarial robustness. Our findings reveal the potential of multi-model collaboration for enhancing legal document generation and broader text refinement tasks.

[426] Quantum Natural Language Processing: A Comprehensive Review of Models, Methods, and Applications

Farha Nausheen,Khandakar Ahmed,M Imad Khan

Main category: cs.CL

TLDR: 论文探讨了量子自然语言处理（QNLP）的潜力，通过量子计算提升NLP任务的效率和准确性，并综述了当前QNLP模型分类、方法及应用现状。

Details

Motivation: 深度学习在NLP中虽表现优异，但需要大量数据和资源，而量子计算可能突破这一限制，实现量子优势。 Method: 分类QNLP模型，综述量子编码技术、QNLP模型及量子优化方法，并统计其应用情况。 Result: QNLP目前仅适用于小数据集，模型探索有限，但量子计算在NLP中的应用兴趣日益增长。 Conclusion: QNLP是一个新兴领域，虽面临挑战，但展现出在NLP中实现量子优势的潜力。 Abstract: In recent developments, deep learning methodologies applied to Natural Language Processing (NLP) have revealed a paradox: They improve performance but demand considerable data and resources for their training. Alternatively, quantum computing exploits the principles of quantum mechanics to overcome the computational limitations of current methodologies, thereby establishing an emerging field known as quantum natural language processing (QNLP). This domain holds the potential to attain a quantum advantage in the processing of linguistic structures, surpassing classical models in both efficiency and accuracy. In this paper, it is proposed to categorise QNLP models based on quantum computing principles, architecture, and computational approaches. This paper attempts to provide a survey on how quantum meets language by mapping state-of-the-art in this area, embracing quantum encoding techniques for classical data, QNLP models for prevalent NLP tasks, and quantum optimisation techniques for hyper parameter tuning. The landscape of quantum computing approaches applied to various NLP tasks is summarised by showcasing the specific QNLP methods used, and the popularity of these methods is indicated by their count. From the findings, it is observed that QNLP approaches are still limited to small data sets, with only a few models explored extensively, and there is increasing interest in the application of quantum computing to natural language processing tasks.

[427] Learning to Erase Private Knowledge from Multi-Documents for Retrieval-Augmented Large Language Models

Yujing Wang,Hainan Zhang,Liang Pang,Yongxin Tong,Binghui Guo,Hongwei Zheng,Zhiming Zheng

Main category: cs.CL

TLDR: Eraser4RAG是一种隐私擦除工具，用于在检索增强生成（RAG）中移除用户定义的隐私知识，同时保留公共知识。

Details

Motivation: RAG技术在处理专有领域时可能泄露敏感信息，传统文本匿名化方法无法满足其多文档推理和场景定制需求。 Method: 构建全局知识图谱识别潜在知识，随机分为隐私和公共子图，用Flan-T5重写文档排除隐私三元组，PPO算法优化模型。 Result: 在四个QA数据集上，Eraser4RAG的隐私擦除性能优于GPT-4o。 Conclusion: Eraser4RAG有效解决了RAG中的隐私擦除问题，兼顾了隐私保护和生成任务的需求。 Abstract: Retrieval-Augmented Generation (RAG) is a promising technique for applying LLMs to proprietary domains. However, retrieved documents may contain sensitive knowledge, posing risks of privacy leakage in generative results. Thus, effectively erasing private information from retrieved documents is a key challenge for RAG. Unlike traditional text anonymization, RAG should consider: (1) the inherent multi-document reasoning may face de-anonymization attacks; (2) private knowledge varies by scenarios, so users should be allowed to customize which information to erase; (3) preserving sufficient publicly available knowledge for generation tasks. This paper introduces the privacy erasure task for RAG and proposes Eraser4RAG, a private knowledge eraser which effectively removes user-defined private knowledge from documents while preserving sufficient public knowledge for generation. Specifically, we first construct a global knowledge graph to identify potential knowledge across documents, aiming to defend against de-anonymization attacks. Then we randomly split it into private and public sub-graphs, and fine-tune Flan-T5 to rewrite the retrieved documents excluding private triples. Finally, PPO algorithm optimizes the rewriting model to minimize private triples and maximize public triples retention. Experiments on four QA datasets demonstrate that Eraser4RAG achieves superior erase performance than GPT-4o.

[428] Guiding Reasoning in Small Language Models with LLM Assistance

Yujin Kim,Euiin Yi,Minu Kim,Se-Young Yun,Taehyeon Kim

Main category: cs.CL

TLDR: SMART框架通过选择性引入LLM的指导，提升小语言模型（SLM）在复杂推理任务中的表现。

Details

Motivation: 小语言模型在需要多步逻辑推理的任务中表现有限，需要外部支持。 Method: SMART通过评分机制识别不确定的推理步骤，仅在必要时注入LLM生成的修正推理，优化推理路径。 Result: 实验表明，SMART显著提升了SLM在数学推理任务中的性能。 Conclusion: SMART为SLM和LLM的协作使用提供了可行方案，解决了SLM单独无法完成的复杂推理任务。 Abstract: The limited reasoning capabilities of small language models (SLMs) cast doubt on their suitability for tasks demanding deep, multi-step logical deduction. This paper introduces a framework called Small Reasons, Large Hints (SMART), which selectively augments SLM reasoning with targeted guidance from large language models (LLMs). Inspired by the concept of cognitive scaffolding, SMART employs a score-based evaluation to identify uncertain reasoning steps and injects corrective LLM-generated reasoning only when necessary. By framing structured reasoning as an optimal policy search, our approach steers the reasoning trajectory toward correct solutions without exhaustive sampling. Our experiments on mathematical reasoning datasets demonstrate that targeted external scaffolding significantly improves performance, paving the way for collaborative use of both SLM and LLM to tackle complex reasoning tasks that are currently unsolvable by SLMs alone.

[429] C-MTCSD: A Chinese Multi-Turn Conversational Stance Detection Dataset

Fuqiang Niu,Yi Yang,Xianghua Fu,Genan Dai,Bowen Zhang

Main category: cs.CL

TLDR: 论文介绍了C-MTCSD，一个大规模中文多轮对话立场检测数据集，揭示了现有模型在零样本和隐式立场检测上的挑战。

Details

Motivation: 解决中文语言处理和多轮对话分析中的立场检测难题。 Method: 构建C-MTCSD数据集，并通过传统方法和大型语言模型进行评估。 Result: 最佳模型在零样本设置下仅达到64.07% F1分数，隐式立场检测表现更差。 Conclusion: C-MTCSD为中文立场检测研究设立了新基准，未来改进空间大。 Abstract: Stance detection has become an essential tool for analyzing public discussions on social media. Current methods face significant challenges, particularly in Chinese language processing and multi-turn conversational analysis. To address these limitations, we introduce C-MTCSD, the largest Chinese multi-turn conversational stance detection dataset, comprising 24,264 carefully annotated instances from Sina Weibo, which is 4.2 times larger than the only prior Chinese conversational stance detection dataset. Our comprehensive evaluation using both traditional approaches and large language models reveals the complexity of C-MTCSD: even state-of-the-art models achieve only 64.07% F1 score in the challenging zero-shot setting, while performance consistently degrades with increasing conversation depth. Traditional models particularly struggle with implicit stance detection, achieving below 50% F1 score. This work establishes a challenging new benchmark for Chinese stance detection research, highlighting significant opportunities for future improvements.

[430] Turn-taking annotation for quantitative and qualitative analyses of conversation

Anneliese Kelterer,Barbara Schuppler

Main category: cs.CL

TLDR: 论文介绍了为GRASS语料库开发的对话转接标注系统，包括标注层（IPU和PCOMP）、标注过程和一致性分析，旨在促进跨学科研究。

Details

Motivation: 开发一个基于对话分析的标注系统，适用于语音学分析和自动分类，促进语言学与技术应用的交叉研究。 Method: 使用Praat进行时间对齐标注，设计了IPU和PCOMP两层标注，详细描述了标注过程和标准。 Result: IPU标注一致性接近完美，PCOMP标注一致性较高，分歧多源于序列分析差异。 Conclusion: 该标注系统适用于多种对话数据，有望推动语言学与技术应用的跨学科合作。 Abstract: This paper has two goals. First, we present the turn-taking annotation layers created for 95 minutes of conversational speech of the Graz Corpus of Read and Spontaneous Speech (GRASS), available to the scientific community. Second, we describe the annotation system and the annotation process in more detail, so other researchers may use it for their own conversational data. The annotation system was developed with an interdisciplinary application in mind. It should be based on sequential criteria according to Conversation Analysis, suitable for subsequent phonetic analysis, thus time-aligned annotations were made Praat, and it should be suitable for automatic classification, which required the continuous annotation of speech and a label inventory that is not too large and results in a high inter-rater agreement. Turn-taking was annotated on two layers, Inter-Pausal Units (IPU) and points of potential completion (PCOMP; similar to transition relevance places). We provide a detailed description of the annotation process and of segmentation and labelling criteria. A detailed analysis of inter-rater agreement and common confusions shows that agreement for IPU annotation is near-perfect, that agreement for PCOMP annotations is substantial, and that disagreements often are either partial or can be explained by a different analysis of a sequence which also has merit. The annotation system can be applied to a variety of conversational data for linguistic studies and technological applications, and we hope that the annotations, as well as the annotation system will contribute to a stronger cross-fertilization between these disciplines.

[431] The Mirage of Performance Gains: Why Contrastive Decoding Fails to Address Multimodal Hallucination

Hao Yin,Gunagzong Si,Zilei Wang

Main category: cs.CL

TLDR: 对比解码策略在减少多模态大语言模型（MLLMs）中的幻觉方面效果有限，其性能提升主要源于误导性因素而非实际改进。

Details

Motivation: 揭示对比解码策略在减少幻觉方面的局限性，并指出其性能提升的误导性原因。 Method: 通过引入虚假改进方法并与对比解码技术对比，分析其实际效果。 Result: 实验表明对比解码的性能提升与其减少幻觉的目标无关。 Conclusion: 研究挑战了对比解码策略的有效性假设，为开发真正有效的解决方案铺平道路。 Abstract: Contrastive decoding strategies are widely used to reduce hallucinations in multimodal large language models (MLLMs). These methods work by constructing contrastive samples to induce hallucinations and then suppressing them in the output distribution. However, this paper demonstrates that such approaches fail to effectively mitigate the hallucination problem. The performance improvements observed on POPE Benchmark are largely driven by two misleading factors: (1) crude, unidirectional adjustments to the model's output distribution and (2) the adaptive plausibility constraint, which reduces the sampling strategy to greedy search. To further illustrate these issues, we introduce a series of spurious improvement methods and evaluate their performance against contrastive decoding techniques. Experimental results reveal that the observed performance gains in contrastive decoding are entirely unrelated to its intended goal of mitigating hallucinations. Our findings challenge common assumptions about the effectiveness of contrastive decoding strategies and pave the way for developing genuinely effective solutions to hallucinations in MLLMs.

Zhengxuan Zhang,Zhuowen Liang,Yin Wu,Teng Lin,Yuyu Luo,Nan Tang

Main category: cs.CL

TLDR: DataMosaic框架通过动态提取任务特定结构，增强LLM分析的可解释性和可验证性。

Details

Motivation: 当前LLM在数据分析中存在不透明和不可验证的问题，尤其在处理多模态数据时。 Method: 采用多代理框架动态提取任务特定结构，提供透明推理路径和中间结果验证。 Result: DataMosaic提升了分析的准确性、一致性和隐私性。 Conclusion: DataMosaic为可解释、可验证的多模态数据分析奠定了基础。 Abstract: Large Language Models (LLMs) are transforming data analytics, but their widespread adoption is hindered by two critical limitations: they are not explainable (opaque reasoning processes) and not verifiable (prone to hallucinations and unchecked errors). While retrieval-augmented generation (RAG) improves accuracy by grounding LLMs in external data, it fails to address the core challenges of trustworthy analytics - especially when processing noisy, inconsistent, or multi-modal data (for example, text, tables, images). We propose DataMosaic, a framework designed to make LLM-powered analytics both explainable and verifiable. By dynamically extracting task-specific structures (for example, tables, graphs, trees) from raw data, DataMosaic provides transparent, step-by-step reasoning traces and enables validation of intermediate results. Built on a multi-agent framework, DataMosaic orchestrates self-adaptive agents that align with downstream task requirements, enhancing consistency, completeness, and privacy. Through this approach, DataMosaic not only tackles the limitations of current LLM-powered analytics systems but also lays the groundwork for a new paradigm of grounded, accurate, and explainable multi-modal data analytics.

[433] Hallucination Detection in LLMs via Topological Divergence on Attention Graphs

Alexandra Bazarova,Aleksandr Yugay,Andrey Shulga,Alina Ermilova,Andrei Volodichev,Konstantin Polev,Julia Belikova,Rauf Parchiev,Dmitry Simakov,Maxim Savchenko,Andrey Savchenko,Serguei Barannikov,Alexey Zaytsev

Main category: cs.CL

TLDR: TOHA是一种基于拓扑结构的幻觉检测器，通过注意力矩阵的拓扑差异量化幻觉现象，在多个任务中表现优异。

Details

Motivation: 解决大语言模型生成内容中存在的幻觉问题，即事实错误。 Method: 利用拓扑差异度量分析提示与响应子图的结构特性，通过注意力矩阵的拓扑差异检测幻觉。 Result: 在问答和数据到文本任务中达到先进水平，并表现出跨领域的迁移能力。 Conclusion: 注意力矩阵的拓扑结构分析可作为大语言模型事实可靠性的高效且鲁棒的指标。 Abstract: Hallucination, i.e., generating factually incorrect content, remains a critical challenge for large language models (LLMs). We introduce TOHA, a TOpology-based HAllucination detector in the RAG setting, which leverages a topological divergence metric to quantify the structural properties of graphs induced by attention matrices. Examining the topological divergence between prompt and response subgraphs reveals consistent patterns: higher divergence values in specific attention heads correlate with hallucinated outputs, independent of the dataset. Extensive experiments, including evaluation on question answering and data-to-text tasks, show that our approach achieves state-of-the-art or competitive results on several benchmarks, two of which were annotated by us and are being publicly released to facilitate further research. Beyond its strong in-domain performance, TOHA maintains remarkable domain transferability across multiple open-source LLMs. Our findings suggest that analyzing the topological structure of attention matrices can serve as an efficient and robust indicator of factual reliability in LLMs.

[434] A Computational Cognitive Model for Processing Repetitions of Hierarchical Relations

Zeng Ren,Xinyi Guan,Martin Rohrmeier

Main category: cs.CL

TLDR: 本文提出了一种基于加权演绎系统的计算模型，用于检测和理解序列数据中的结构性重复模式，并通过音乐和动作规划的短序列验证了其表达能力。

Details

Motivation: 研究人类如何识别和理解序列数据中的结构性重复模式，以揭示人类模式识别的认知机制。 Method: 采用加权演绎系统，推断序列的最小生成过程，并将其表示为带有重复组合子的上下文无关文法（Template程序）。 Result: 模型在音乐和动作规划的短序列上展示了其表达能力，能够高效编码递归性子计算。 Conclusion: 该模型为人类模式识别的心理表征和认知机制提供了更广泛的见解。 Abstract: Patterns are fundamental to human cognition, enabling the recognition of structure and regularity across diverse domains. In this work, we focus on structural repeats, patterns that arise from the repetition of hierarchical relations within sequential data, and develop a candidate computational model of how humans detect and understand such structural repeats. Based on a weighted deduction system, our model infers the minimal generative process of a given sequence in the form of a Template program, a formalism that enriches the context-free grammar with repetition combinators. Such representation efficiently encodes the repetition of sub-computations in a recursive manner. As a proof of concept, we demonstrate the expressiveness of our model on short sequences from music and action planning. The proposed model offers broader insights into the mental representations and cognitive mechanisms underlying human pattern recognition.

[435] Towards Quantifying Commonsense Reasoning with Mechanistic Insights

Abhinav Joshi,Areeb Ahmad,Divyaksh Shukla,Ashutosh Modi

Main category: cs.CL

TLDR: 论文提出了一种图形化结构来评估LLMs的常识推理能力，并通过标注37种日常活动创建资源，支持大量常识查询生成。

Details

Motivation: 评估LLMs在常识推理中的表现，并探索其内部推理机制。 Method: 设计图形化标注方案，捕捉日常活动的隐式知识，生成大量常识查询。 Result: 发现LLMs的推理能力集中在特定组件，对常识查询决策起关键作用。 Conclusion: 图形化结构为LLMs常识推理的严格评估提供了新工具，并揭示了其内部推理机制。 Abstract: Commonsense reasoning deals with the implicit knowledge that is well understood by humans and typically acquired via interactions with the world. In recent times, commonsense reasoning and understanding of various LLMs have been evaluated using text-based tasks. In this work, we argue that a proxy of this understanding can be maintained as a graphical structure that can further help to perform a rigorous evaluation of commonsense reasoning abilities about various real-world activities. We create an annotation scheme for capturing this implicit knowledge in the form of a graphical structure for 37 daily human activities. We find that the created resource can be used to frame an enormous number of commonsense queries (~ 10^{17}), facilitating rigorous evaluation of commonsense reasoning in LLMs. Moreover, recently, the remarkable performance of LLMs has raised questions about whether these models are truly capable of reasoning in the wild and, in general, how reasoning occurs inside these models. In this resource paper, we bridge this gap by proposing design mechanisms that facilitate research in a similar direction. Our findings suggest that the reasoning components are localized in LLMs that play a prominent role in decision-making when prompted with a commonsense query.

Xinnong Zhang,Jiayu Lin,Xinyi Mou,Shiyue Yang,Xiawei Liu,Libo Sun,Hanjia Lyu,Yihang Yang,Weihong Qi,Yue Chen,Guanying Li,Ling Yan,Yao Hu,Siming Chen,Yu Wang,Jingxuan Huang,Jiebo Luo,Shiping Tang,Libo Wu,Baohua Zhou,Zhongyu Wei

Main category: cs.CL

TLDR: SocioVerse是一个基于LLM-agent的社会模拟框架，通过四个对齐组件和1000万真实用户数据，解决了现有方法在环境、用户、交互和行为模式上的对齐问题。实验验证了其在政治、新闻和经济领域的多样性和可信度。

Details

Motivation: 传统社会模拟方法在环境、用户、交互和行为模式上存在对齐问题，而LLM的发展为捕捉个体差异和预测群体行为提供了新机会。 Method: 提出SocioVerse框架，包含四个对齐组件，并利用1000万真实用户数据进行大规模模拟实验。 Result: 实验证明SocioVerse能反映大规模人口动态，同时确保多样性、可信度和代表性。 Conclusion: SocioVerse通过标准化流程和最小手动调整，为LLM驱动的社会模拟提供了有效解决方案。 Abstract: Social simulation is transforming traditional social science research by modeling human behavior through interactions between virtual individuals and their environments. With recent advances in large language models (LLMs), this approach has shown growing potential in capturing individual differences and predicting group behaviors. However, existing methods face alignment challenges related to the environment, target users, interaction mechanisms, and behavioral patterns. To this end, we introduce SocioVerse, an LLM-agent-driven world model for social simulation. Our framework features four powerful alignment components and a user pool of 10 million real individuals. To validate its effectiveness, we conducted large-scale simulation experiments across three distinct domains: politics, news, and economics. Results demonstrate that SocioVerse can reflect large-scale population dynamics while ensuring diversity, credibility, and representativeness through standardized procedures and minimal manual adjustments.

[437] MT-R1-Zero: Advancing LLM-based Machine Translation via R1-Zero-like Reinforcement Learning

Zhaopeng Feng,Shaosheng Cao,Jiahan Ren,Jiayuan Su,Ruizhe Chen,Yan Zhang,Zhe Xu,Yao Hu,Jian Wu,Zuozhu Liu

Main category: cs.CL

TLDR: 论文提出MT-R1-Zero框架，首次将R1-Zero RL应用于机器翻译，通过混合奖励机制提升翻译质量，并在WMT 24基准测试中表现优异。

Details

Motivation: 探索如何将大规模强化学习应用于机器翻译任务，解决其输出灵活且难以自动评估的问题。 Method: 提出MT-R1-Zero框架，结合规则与指标混合奖励机制，无需监督微调或冷启动。 Result: MT-R1-Zero-3B-Mix超越TowerInstruct-7B-v0.2 1.26分，MT-R1-Zero-7B-Mix与GPT-4o等先进模型相当，MT-R1-Zero-7B-Sem在语义指标上达到最优。 Conclusion: MT-R1-Zero框架在机器翻译中表现出色，揭示了奖励设计、LLM适应性和训练动态的关键作用。 Abstract: Large-scale reinforcement learning (RL) methods have proven highly effective in enhancing the reasoning abilities of large language models (LLMs), particularly for tasks with verifiable solutions such as mathematics and coding. However, applying this idea to machine translation (MT), where outputs are flexibly formatted and difficult to automatically evaluate with explicit rules, remains underexplored. In this work, we introduce MT-R1-Zero, the first open-source adaptation of the R1-Zero RL framework for MT without supervised fine-tuning or cold-start. We propose a rule-metric mixed reward mechanism to guide LLMs towards improved translation quality via emergent reasoning. On the WMT 24 English-Chinese benchmark, our MT-R1-Zero-3B-Mix achieves competitive performance, surpassing TowerInstruct-7B-v0.2 by an average of 1.26 points. Meanwhile, our MT-R1-Zero-7B-Mix attains a high average score of 62.25 across all metrics, placing it on par with advanced proprietary models such as GPT-4o and Claude-3.5-Sonnet, while the MT-R1-Zero-7B-Sem variant achieves state-of-the-art scores on semantic metrics. Moreover, our work exhibits strong generalization capabilities on out-of-distribution MT tasks, robustly supporting multilingual and low-resource settings. Extensive analysis of model behavior across different initializations and reward metrics offers pioneering insight into the critical role of reward design, LLM adaptability, training dynamics, and emergent reasoning patterns within the R1-Zero paradigm for MT. Our code is available at https://github.com/fzp0424/MT-R1-Zero.

[438] C-FAITH: A Chinese Fine-Grained Benchmark for Automated Hallucination Evaluation

Xu Zhang,Zhifei Liu,Jiahao Wang,Huixuan Zhang,Fan Xu,Junzhe Zhang,Xiaojun Wan

Main category: cs.CL

TLDR: 论文提出HaluAgent框架，自动构建细粒度QA数据集以评估大语言模型的幻觉问题，并创建了中文基准C-FAITH。

Details

Motivation: 现有幻觉评估基准依赖人工标注，成本高且难以自动化，尤其在中文领域。 Method: 使用HaluAgent框架，基于知识文档自动构建QA数据集，并通过规则和提示优化提升数据质量。 Result: 构建了包含60,702条目的中文基准C-FAITH，并评估了16个主流大语言模型。 Conclusion: HaluAgent和C-FAITH为自动化、低成本评估语言模型幻觉提供了有效工具。 Abstract: Despite the rapid advancement of large language models, they remain highly susceptible to generating hallucinations, which significantly hinders their widespread application. Hallucination research requires dynamic and fine-grained evaluation. However, most existing hallucination benchmarks (especially in Chinese language) rely on human annotations, making automatical and cost-effective hallucination evaluation challenging. To address this, we introduce HaluAgent, an agentic framework that automatically constructs fine-grained QA dataset based on some knowledge documents. Our experiments demonstrate that the manually designed rules and prompt optimization can improve the quality of generated data. Using HaluAgent, we construct C-FAITH, a Chinese QA hallucination benchmark created from 1,399 knowledge documents obtained from web scraping, totaling 60,702 entries. We comprehensively evaluate 16 mainstream LLMs with our proposed C-FAITH, providing detailed experimental results and analysis.

[439] HalluSearch at SemEval-2025 Task 3: A Search-Enhanced RAG Pipeline for Hallucination Detection

Mohamed A. Abdallah,Samhaa R. El-Beltagy

Main category: cs.CL

TLDR: HalluSearch是一个多语言管道，用于检测大型语言模型输出中的虚构文本片段，在十四种语言中表现优异，但在在线覆盖有限的语言中仍有挑战。

Details

Motivation: 开发一个多语言工具（HalluSearch）以检测和定位大型语言模型输出中的虚构内容，作为Mu-SHROOM共享任务的一部分。 Method: 结合检索增强验证和细粒度事实分割技术，在多语言环境中识别和定位虚构内容。 Result: 在英语和捷克语中表现优异（排名前四），但在在线覆盖有限的语言中效果受限。 Conclusion: HalluSearch在多语言环境中具有竞争力，但需进一步研究以提升在资源匮乏语言中的表现。 Abstract: In this paper, we present HalluSearch, a multilingual pipeline designed to detect fabricated text spans in Large Language Model (LLM) outputs. Developed as part of Mu-SHROOM, the Multilingual Shared-task on Hallucinations and Related Observable Overgeneration Mistakes, HalluSearch couples retrieval-augmented verification with fine-grained factual splitting to identify and localize hallucinations in fourteen different languages. Empirical evaluations show that HalluSearch performs competitively, placing fourth in both English (within the top ten percent) and Czech. While the system's retrieval-based strategy generally proves robust, it faces challenges in languages with limited online coverage, underscoring the need for further research to ensure consistent hallucination detection across diverse linguistic contexts.

[440] LLM Unlearning Reveals a Stronger-Than-Expected Coreset Effect in Current Benchmarks

Soumyadeep Pal,Changsheng Wang,James Diffenderfer,Bhavya Kailkhura,Sijia Liu

Main category: cs.CL

TLDR: 研究发现，在大型语言模型（LLM）的遗忘任务中，仅需原始遗忘数据集的5%作为核心集（coreset），即可有效维持遗忘效果，且该现象在不同遗忘方法和数据选择方法中均表现稳健。

Details

Motivation: 探索LLM遗忘任务中核心集效应的存在及其影响，以优化遗忘效率并揭示当前遗忘方法的本质。 Method: 通过实验验证核心集效应，使用不同遗忘方法（如NPO和RMU）和数据选择方法（随机到启发式）进行分析。 Result: 核心集效应显著，仅需少量数据即可实现有效遗忘，且遗忘效果由高影响力关键词驱动。 Conclusion: 当前LLM遗忘任务可能过于依赖少量关键数据，而非整个数据集，为未来遗忘方法的设计提供了新视角。 Abstract: Large language model unlearning has become a critical challenge in ensuring safety and controlled model behavior by removing undesired data-model influences from the pretrained model while preserving general utility. Significant recent efforts have been dedicated to developing LLM unlearning benchmarks such as WMDP (Weapons of Mass Destruction Proxy) and MUSE (Machine Unlearning Six-way Evaluation), facilitating standardized unlearning performance assessment and method comparison. Despite their usefulness, we uncover for the first time a novel coreset effect within these benchmarks. Specifically, we find that LLM unlearning achieved with the original (full) forget set can be effectively maintained using a significantly smaller subset (functioning as a "coreset"), e.g., as little as 5% of the forget set, even when selected at random. This suggests that LLM unlearning in these benchmarks can be performed surprisingly easily, even in an extremely low-data regime. We demonstrate that this coreset effect remains strong, regardless of the LLM unlearning method used, such as NPO (Negative Preference Optimization) and RMU (Representation Misdirection Unlearning), the popular ones in these benchmarks. The surprisingly strong coreset effect is also robust across various data selection methods, ranging from random selection to more sophisticated heuristic approaches. We explain the coreset effect in LLM unlearning through a keyword-based perspective, showing that keywords extracted from the forget set alone contribute significantly to unlearning effectiveness and indicating that current unlearning is driven by a compact set of high-impact tokens rather than the entire dataset. We further justify the faithfulness of coreset-unlearned models along additional dimensions, such as mode connectivity and robustness to jailbreaking attacks. Codes are available at https://github.com/OPTML-Group/MU-Coreset.

[441] Deep Reasoning Translation via Reinforcement Learning

Jiaan Wang,Fandong Meng,Jie Zhou

Main category: cs.CL

TLDR: DeepTrans是一种基于强化学习的深度推理翻译模型，通过奖励模型学习自由翻译，无需标注数据，性能提升显著。

Details

Motivation: 探索深度推理LLM在自由翻译任务中的应用，解决传统翻译模型在文化差异和自由翻译上的不足。 Method: 使用强化学习训练DeepTrans，通过奖励模型评估翻译结果和思维过程，无需标注数据。 Result: DeepTrans在文学翻译中性能提升16.3%，优于其他深度推理模型和基于合成数据的基线。 Conclusion: DeepTrans展示了强化学习在自由翻译中的潜力，为未来研究提供了启发。 Abstract: Recently, deep reasoning LLMs (e.g., OpenAI o1/o3 and DeepSeek-R1) have shown promising performance in various complex tasks. Free translation is an important and interesting task in the multilingual world, which requires going beyond word-for-word translation and taking cultural differences into account. This task is still under-explored in deep reasoning LLMs. In this paper, we introduce DeepTrans, a deep reasoning translation model that learns free translation via reinforcement learning. Specifically, we carefully build a reward model with pre-defined scoring criteria on both the translation results and the thought process. Given the source sentences, the reward model teaches the deep translation model how to think and free-translate them during reinforcement learning. In this way, training DeepTrans does not need any labeled translations, avoiding the human-intensive annotation or resource-intensive data synthesis. Experimental results show the effectiveness of DeepTrans. Using Qwen2.5-7B as the backbone, DeepTrans improves performance by 16.3% in literature translation, and outperforms strong deep reasoning baselines as well as baselines that are fine-tuned with synthesized data. Moreover, we summarize the failures and interesting findings during our RL exploration. We hope this work could inspire other researchers in free translation.

[442] Localized Cultural Knowledge is Conserved and Controllable in Large Language Models

Veniamin Veselovsky,Berke Argin,Benedikt Stroebl,Chris Wendler,Robert West,James Evans,Thomas L. Griffiths,Arvind Narayanan

Main category: cs.CL

TLDR: 论文探讨了LLMs在生成非英语内容时默认英语中心化的问题，提出通过显式文化背景提示激活模型中的文化信息，但发现这会减少多样性和增加刻板印象。同时，发现了一个跨语言的显式文化定制向量，能改善多样性和减少刻板印象。

Details

Motivation: 研究LLMs在跨语言和文化交互中如何保留和激活文化信息，以提升其文化定制能力。 Method: 通过显式提供文化背景提示，分析模型性能差异，并识别跨语言的显式文化定制向量。 Result: 显式文化背景提示显著提升文化定制能力，但减少多样性和增加刻板印象；显式文化定制向量能改善这些问题。 Conclusion: 显式文化定制有助于理解LLMs中文化模型的保留与控制，提升翻译和文化定制潜力。 Abstract: Just as humans display language patterns influenced by their native tongue when speaking new languages, LLMs often default to English-centric responses even when generating in other languages. Nevertheless, we observe that local cultural information persists within the models and can be readily activated for cultural customization. We first demonstrate that explicitly providing cultural context in prompts significantly improves the models' ability to generate culturally localized responses. We term the disparity in model performance with versus without explicit cultural context the explicit-implicit localization gap, indicating that while cultural knowledge exists within LLMs, it may not naturally surface in multilingual interactions if cultural context is not explicitly provided. Despite the explicit prompting benefit, however, the answers reduce in diversity and tend toward stereotypes. Second, we identify an explicit cultural customization vector, conserved across all non-English languages we explore, which enables LLMs to be steered from the synthetic English cultural world-model toward each non-English cultural world. Steered responses retain the diversity of implicit prompting and reduce stereotypes to dramatically improve the potential for customization. We discuss the implications of explicit cultural customization for understanding the conservation of alternative cultural world models within LLMs, and their controllable utility for translation, cultural customization, and the possibility of making the explicit implicit through soft control for expanded LLM function and appeal.

[443] DioR: Adaptive Cognitive Detection and Contextual Retrieval Optimization for Dynamic Retrieval-Augmented Generation

Hanghui Guo,Jia Zhu,Shimin Di,Weijie Shi,Zhangze Chen,Jiajie Xu

Main category: cs.CL

TLDR: DioR方法通过自适应认知检测和上下文检索优化，解决了动态RAG在检索触发和内容筛选上的不足，显著提升了性能。

Details

Motivation: 现有动态RAG方法缺乏有效的检索触发机制和内容筛选能力，限制了其效果。 Method: 提出DioR方法，包含自适应认知检测和上下文检索优化两部分，分别解决何时检索和检索什么的问题。 Result: 实验表明DioR在所有任务上表现优异。 Conclusion: DioR有效解决了动态RAG的局限性，展示了其优越性。 Abstract: Dynamic Retrieval-augmented Generation (RAG) has shown great success in mitigating hallucinations in large language models (LLMs) during generation. However, existing dynamic RAG methods face significant limitations in two key aspects: 1) Lack of an effective mechanism to control retrieval triggers, and 2) Lack of effective scrutiny of retrieval content. To address these limitations, we propose an innovative dynamic RAG method, DioR (Adaptive Cognitive Detection and Contextual Retrieval Optimization), which consists of two main components: adaptive cognitive detection and contextual retrieval optimization, specifically designed to determine when retrieval is needed and what to retrieve for LLMs is useful. Experimental results demonstrate that DioR achieves superior performance on all tasks, demonstrating the effectiveness of our work.

[444] Probing then Editing Response Personality of Large Language Models

Tianjie Ju,Zhenyu Shao,Bowen Wang,Yujia Chen,Zhuosheng Zhang,Hao Fei,Mong-Li Lee,Wynne Hsu,Sufeng Duan,Gongshen Liu

Main category: cs.CL

TLDR: 论文提出了一种分层探测框架，研究LLMs内部如何编码人格特征，并通过扰动方法编辑LLMs的响应人格。

Details

Motivation: 尽管已有研究通过输出评估LLMs的人格表达，但对其内部参数如何编码人格知之甚少。 Method: 采用分层探测框架，在11个开源LLMs上进行实验，并提出分层扰动方法编辑人格。 Result: 发现人格主要编码于LLMs的中上层，指令调优模型的人格分离更清晰；扰动方法能有效编辑人格，且不影响通用能力。 Conclusion: 分层探测和扰动方法为LLMs人格编码和编辑提供了新视角，且具有低训练成本和可接受的推理延迟。 Abstract: Large Language Models (LLMs) have demonstrated promising capabilities to generate responses that exhibit consistent personality traits. Despite the major attempts to analyze personality expression through output-based evaluations, little is known about how such traits are internally encoded within LLM parameters. In this paper, we introduce a layer-wise probing framework to systematically investigate the layer-wise capability of LLMs in encoding personality for responding. We conduct probing experiments on 11 open-source LLMs over the PersonalityEdit benchmark and find that LLMs predominantly encode personality for responding in their middle and upper layers, with instruction-tuned models demonstrating a slightly clearer separation of personality traits. Furthermore, by interpreting the trained probing hyperplane as a layer-wise boundary for each personality category, we propose a layer-wise perturbation method to edit the personality expressed by LLMs during inference. Our results show that even when the prompt explicitly specifies a particular personality, our method can still successfully alter the response personality of LLMs. Interestingly, the difficulty of converting between certain personality traits varies substantially, which aligns with the representational distances in our probing experiments. Finally, we conduct a comprehensive MMLU benchmark evaluation and time overhead analysis, demonstrating that our proposed personality editing method incurs only minimal degradation in general capabilities while maintaining low training costs and acceptable inference latency. Our code is publicly available at https://github.com/universe-sky/probing-then-editing-personality.

[445] Can LLMs Generate Tabular Summaries of Science Papers? Rethinking the Evaluation Protocol

Weiqi Wang,Jiefu Ou,Yangqiu Song,Benjamin Van Durme,Daniel Khashabi

Main category: cs.CL

TLDR: 论文探讨了如何生成满足用户需求的文献综述表格，结合LLM和人工标注解决现实问题，提出了新基准ARXIV2TABLE，并验证了当前LLM在此任务上的局限性。

Details

Motivation: 解决文献综述表格生成中的三大现实挑战：用户提示不明确、候选论文内容不相关、评估方法需改进。 Method: 结合LLM方法和人工标注，提出新基准ARXIV2TABLE，并设计新方法优化表格生成。 Result: 实验表明，当前LLM在此任务上表现不佳，任务难度高，需进一步研究。 Conclusion: 论文提出了更现实的基准和方法，强调了任务挑战性，并开源了数据集和代码。 Abstract: Literature review tables are essential for summarizing and comparing collections of scientific papers. We explore the task of generating tables that best fulfill a user's informational needs given a collection of scientific papers. Building on recent work (Newman et al., 2024), we extend prior approaches to address real-world complexities through a combination of LLM-based methods and human annotations. Our contributions focus on three key challenges encountered in real-world use: (i) User prompts are often under-specified; (ii) Retrieved candidate papers frequently contain irrelevant content; and (iii) Task evaluation should move beyond shallow text similarity techniques and instead assess the utility of inferred tables for information-seeking tasks (e.g., comparing papers). To support reproducible evaluation, we introduce ARXIV2TABLE, a more realistic and challenging benchmark for this task, along with a novel approach to improve literature review table generation in real-world scenarios. Our extensive experiments on this benchmark show that both open-weight and proprietary LLMs struggle with the task, highlighting its difficulty and the need for further advancements. Our dataset and code are available at https://github.com/JHU-CLSP/arXiv2Table.

[446] MorphTok: Morphologically Grounded Tokenization for Indian Languages

Maharaj Brahma,N J Karthika,Atul Singh,Devaraj Adiga,Smruti Bhate,Ganesh Ramakrishnan,Rohit Saluja,Maunendra Sankar Desarkar

Main category: cs.CL

TLDR: 论文提出了一种基于形态学的预分词方法，结合改进的BPE算法（CBPE）和新的评估指标EvalTok，提升了印地语和马拉地语的子词分词效果。

Details

Motivation: 现有BPE算法在子词分词时未考虑语言学意义，导致分词结果不理想，影响下游任务性能。 Method: 提出形态学感知的分词作为BPE的预处理步骤，并引入CBPE算法处理特定脚本约束（如依赖元音）。 Result: 实验表明，形态学分词提升了机器翻译和语言建模性能，CBPE降低了生育率分数。 Conclusion: 形态学分词和CBPE为子词分词提供了更高效且语言学合理的解决方案。 Abstract: Tokenization is a crucial step in NLP, especially with the rise of large language models (LLMs), impacting downstream performance, computational cost, and efficiency. Existing LLMs rely on the classical Byte-pair Encoding (BPE) algorithm for subword tokenization that greedily merges frequent character bigrams. This often leads to segmentation that does not align with linguistically meaningful units. To address this, we propose morphology-aware segmentation as a pre-tokenization step prior to applying BPE. To facilitate morphology-aware segmentation, we create a novel dataset for Hindi and Marathi, incorporating sandhi splitting to enhance the subword tokenization. Experiments on downstream tasks show that morphologically grounded tokenization improves performance for machine translation and language modeling. Additionally, to handle the ambiguity in the Unicode characters for diacritics, particularly dependent vowels in syllable-based writing systems, we introduce Constrained BPE (CBPE), an extension to the traditional BPE algorithm that incorporates script-specific constraints. Specifically, CBPE handles dependent vowels. Our results show that CBPE achieves a 1.68\% reduction in fertility scores while maintaining comparable or improved downstream performance in machine translation, offering a computationally efficient alternative to standard BPE. Moreover, to evaluate segmentation across different tokenization algorithms, we introduce a new human evaluation metric, \textit{EvalTok}, enabling more human-grounded assessment.

[447] Forecasting from Clinical Textual Time Series: Adaptations of the Encoder and Decoder Language Model Families

Shahriar Noroozizadeh,Sayantan Kumar,Jeremy C. Weiss

Main category: cs.CL

TLDR: 论文提出了一种基于文本时间序列的预测方法，利用LLM辅助提取的临床发现作为输入，评估了多种模型在事件预测、时间排序和生存分析中的表现。

Details

Motivation: 传统机器学习方法未能充分利用临床病例报告中丰富的时间序列信息，因此探索文本时间序列预测的潜力。 Method: 通过LLM辅助的标注流程提取时间戳临床发现，评估了包括微调的基于解码器的LLM和基于编码器的Transformer在内的多种模型。 Result: 基于编码器的模型在事件预测和时间排序中表现更优，而微调的掩码方法提升了排名性能；指令调优的解码器模型在生存分析中表现突出。 Conclusion: 时间排序对临床时间序列构建至关重要，凸显了时间有序语料库在LLM广泛应用时代的额外价值。 Abstract: Clinical case reports encode rich, temporal patient trajectories that are often underexploited by traditional machine learning methods relying on structured data. In this work, we introduce the forecasting problem from textual time series, where timestamped clinical findings--extracted via an LLM-assisted annotation pipeline--serve as the primary input for prediction. We systematically evaluate a diverse suite of models, including fine-tuned decoder-based large language models and encoder-based transformers, on tasks of event occurrence prediction, temporal ordering, and survival analysis. Our experiments reveal that encoder-based models consistently achieve higher F1 scores and superior temporal concordance for short- and long-horizon event forecasting, while fine-tuned masking approaches enhance ranking performance. In contrast, instruction-tuned decoder models demonstrate a relative advantage in survival analysis, especially in early prognosis settings. Our sensitivity analyses further demonstrate the importance of time ordering, which requires clinical time series construction, as compared to text ordering, the format of the text inputs that LLMs are classically trained on. This highlights the additional benefit that can be ascertained from time-ordered corpora, with implications for temporal tasks in the era of widespread LLM use.

[448] VisualPuzzles: Decoupling Multimodal Reasoning Evaluation from Domain Knowledge

Yueqi Song,Tianyue Ou,Yibo Kong,Zecheng Li,Graham Neubig,Xiang Yue

Main category: cs.CL

TLDR: VisualPuzzles是一个专注于视觉推理的基准测试，旨在减少对专业知识的依赖，评估多模态模型的真实推理能力。

Details

Motivation: 现有基准测试常将推理与领域知识混淆，难以评估非专家环境中的通用推理能力。 Method: 通过手动翻译中国公务员考试的逻辑推理题，构建包含五类问题的VisualPuzzles基准。 Result: 多模态大语言模型在VisualPuzzles上表现不及人类，且知识密集型基准的高分未必能转化为推理任务的优秀表现。 Conclusion: VisualPuzzles为评估超越事实记忆和领域知识的推理能力提供了更清晰的视角。 Abstract: Current multimodal benchmarks often conflate reasoning with domain-specific knowledge, making it difficult to isolate and evaluate general reasoning abilities in non-expert settings. To address this, we introduce VisualPuzzles, a benchmark that targets visual reasoning while deliberately minimizing reliance on specialized knowledge. VisualPuzzles consists of diverse questions spanning five categories: algorithmic, analogical, deductive, inductive, and spatial reasoning. One major source of our questions is manually translated logical reasoning questions from the Chinese Civil Service Examination. Experiments show that VisualPuzzles requires significantly less intensive domain-specific knowledge and more complex reasoning compared to benchmarks like MMMU, enabling us to better evaluate genuine multimodal reasoning. Evaluations show that state-of-the-art multimodal large language models consistently lag behind human performance on VisualPuzzles, and that strong performance on knowledge-intensive benchmarks does not necessarily translate to success on reasoning-focused, knowledge-light tasks. Additionally, reasoning enhancements such as scaling up inference compute (with "thinking" modes) yield inconsistent gains across models and task types, and we observe no clear correlation between model size and performance. We also found that models exhibit different reasoning and answering patterns on VisualPuzzles compared to benchmarks with heavier emphasis on knowledge. VisualPuzzles offers a clearer lens through which to evaluate reasoning capabilities beyond factual recall and domain knowledge.

[449] MultiLoKo: a multilingual local knowledge benchmark for LLMs spanning 31 languages

Dieuwke Hupkes,Nikolay Bogoychev

Main category: cs.CL

TLDR: MultiLoKo是一个新的多语言基准测试，覆盖31种语言，用于评估LLMs的多语言能力。它包含本地化问题和翻译问题，并比较人工与机器翻译的效果。测试结果显示当前模型在多语言表现上仍有不足，且语言间知识转移不理想。

Details

Motivation: 研究LLMs在多语言环境下的表现，并探讨多语言基准测试的构建方法。 Method: 构建MultiLoKo基准，包含本地化问题和翻译问题，比较人工与机器翻译的效果，并测试11种多语言模型。 Result: 模型在多语言表现上普遍较差，语言间知识转移不足，且本地化数据与翻译数据对结果影响显著。 Conclusion: 当前多语言模型仍需改进，基准测试的设计对评估结果有重要影响。 Abstract: We present MultiLoKo, a new benchmark for evaluating multilinguality in LLMs covering 31 languages. MultiLoKo consists of three partitions: a main partition consisting of 500 questions per language, separately sourced to be locally relevant to the specific language, and two translated partitions, containing human-authored translations from 30 non-English languages to English and vice versa. For comparison, we also release corresponding machine-authored translations. The data is equally distributed over two splits: a dev split and a blind, out-of-distribution test split. MultiLoKo can be used to study a variety of questions regarding the multilinguality of LLMs as well as meta-questions about multilingual benchmark creation. We compute MultiLoKo scores for 11 base and chat models marketed to be multilingual and study their average performance, their performance parity across languages, how much their ability to answer questions depends on the question language, and which languages are most difficult. None of the models we studied performs well on MultiLoKo, as indicated by low average scores as well as large differences between the best and worst scoring languages. Furthermore, we find a substantial effect of the question language, indicating sub-optimal knowledge transfer between languages. Lastly, we find that using local vs English-translated data can result in differences more than 20 points for the best performing models, drastically change the estimated difficulty of some languages. For using machines instead of human translations, we find a weaker effect on ordering of language difficulty, a larger difference in model rankings, and a substantial drop in estimated performance for all models.

[450] DICE: A Framework for Dimensional and Contextual Evaluation of Language Models

Aryan Shrivastava,Paula Akemi Aoyagui

Main category: cs.CL

TLDR: 论文提出DICE方法，针对语言模型（LMs）现有评估范式的不足，提出基于维度和上下文的评估框架，以更贴合实际应用场景。

Details

Motivation: 现有语言模型评估方法未能充分反映实际应用需求，缺乏对真实场景的适用性。 Method: 提出DICE框架，包含上下文无关参数（如鲁棒性、连贯性）和上下文相关参数，以更细粒度评估LMs。 Result: DICE为语言模型评估提供了更贴合实际应用场景的框架，强调了上下文和利益相关者的需求。 Conclusion: DICE是语言模型评估的实用起点，为上下文相关和利益相关者导向的评估提供了新思路。 Abstract: Language models (LMs) are increasingly being integrated into a wide range of applications, yet the modern evaluation paradigm does not sufficiently reflect how they are actually being used. Current evaluations rely on benchmarks that often lack direct applicability to the real-world contexts in which LMs are being deployed. To address this gap, we propose Dimensional and Contextual Evaluation (DICE), an approach that evaluates LMs on granular, context-dependent dimensions. In this position paper, we begin by examining the insufficiency of existing LM benchmarks, highlighting their limited applicability to real-world use cases. Next, we propose a set of granular evaluation parameters that capture dimensions of LM behavior that are more meaningful to stakeholders across a variety of application domains. Specifically, we introduce the concept of context-agnostic parameters - such as robustness, coherence, and epistemic honesty - and context-specific parameters that must be tailored to the specific contextual constraints and demands of stakeholders choosing to deploy LMs into a particular setting. We then discuss potential approaches to operationalize this evaluation framework, finishing with the opportunities and challenges DICE presents to the LM evaluation landscape. Ultimately, this work serves as a practical and approachable starting point for context-specific and stakeholder-relevant evaluation of LMs.

[451] S1-Bench: A Simple Benchmark for Evaluating System 1 Thinking Capability of Large Reasoning Models

Wenyuan Zhang,Shuaiyi Nie,Xinghua Zhang,Zefeng Zhang,Tingwen Liu

Main category: cs.CL

TLDR: S1-Bench是一个新基准，用于评估大型推理模型（LRMs）在简单任务中的表现，重点关注直觉系统1思维而非深度系统2推理。

Details

Motivation: 现有基准缺乏对LRMs在需要直觉思维任务中的评估，而LRMs在复杂任务中的成功可能掩盖了其系统1思维的不足。 Method: S1-Bench提供多领域、多语言的简单问题集，专门测试LRMs的直觉能力。 Result: 评估22个LRMs发现其效率较低，输出冗长且易产生错误，显示其推理模式僵化。 Conclusion: 当前LRMs在平衡双系统思维方面仍需改进，需适应任务复杂性的灵活能力。 Abstract: We introduce S1-Bench, a novel benchmark designed to evaluate Large Reasoning Models' (LRMs) performance on simple tasks that favor intuitive system 1 thinking rather than deliberative system 2 reasoning. While LRMs have achieved significant breakthroughs in complex reasoning tasks through explicit chains of thought, their reliance on deep analytical thinking may limit their system 1 thinking capabilities. Moreover, a lack of benchmark currently exists to evaluate LRMs' performance in tasks that require such capabilities. To fill this gap, S1-Bench presents a set of simple, diverse, and naturally clear questions across multiple domains and languages, specifically designed to assess LRMs' performance in such tasks. Our comprehensive evaluation of 22 LRMs reveals significant lower efficiency tendencies, with outputs averaging 15.5 times longer than those of traditional small LLMs. Additionally, LRMs often identify correct answers early but continue unnecessary deliberation, with some models even producing numerous errors. These findings highlight the rigid reasoning patterns of current LRMs and underscore the substantial development needed to achieve balanced dual-system thinking capabilities that can adapt appropriately to task complexity.

Varun Vasudevan,Faezeh Akhavizadegan,Abhinav Prakash,Yokila Arora,Jason Cho,Tanya Mendiratta,Sushant Kumar,Kannan Achan

Main category: cs.CL

TLDR: 本文提出了一种基于LLM的迭代优化框架，用于生成符合多约束条件的营销文案，显著提升了文案的成功率和点击率。

Details

Motivation: 手动撰写营销文案耗时且昂贵，难以满足个性化需求；现有LLM生成的文案难以一次性满足复杂约束。 Method: 采用基于LLM的端到端框架，通过迭代优化生成符合多约束（如长度、主题、关键词等）的文案。 Result: 迭代优化使文案成功率提升16.25-35.91%，生成的文案在点击率上优于人工撰写的内容（提升38.5-45.21%）。 Conclusion: 迭代优化框架能有效解决多约束文案生成问题，为个性化营销提供高效解决方案。 Abstract: Crafting a marketing message (copy), or copywriting is a challenging generation task, as the copy must adhere to various constraints. Copy creation is inherently iterative for humans, starting with an initial draft followed by successive refinements. However, manual copy creation is time-consuming and expensive, resulting in only a few copies for each use case. This limitation restricts our ability to personalize content to customers. Contrary to the manual approach, LLMs can generate copies quickly, but the generated content does not consistently meet all the constraints on the first attempt (similar to humans). While recent studies have shown promise in improving constrained generation through iterative refinement, they have primarily addressed tasks with only a few simple constraints. Consequently, the effectiveness of iterative refinement for tasks such as copy generation, which involves many intricate constraints, remains unclear. To address this gap, we propose an LLM-based end-to-end framework for scalable copy generation using iterative refinement. To the best of our knowledge, this is the first study to address multiple challenging constraints simultaneously in copy generation. Examples of these constraints include length, topics, keywords, preferred lexical ordering, and tone of voice. We demonstrate the performance of our framework by creating copies for e-commerce banners for three different use cases of varying complexity. Our results show that iterative refinement increases the copy success rate by $16.25-35.91$% across use cases. Furthermore, the copies generated using our approach outperformed manually created content in multiple pilot studies using a multi-armed bandit framework. The winning copy improved the click-through rate by $38.5-45.21$%.

[453] Performance of Large Language Models in Supporting Medical Diagnosis and Treatment

Diogo Sousa,Guilherme Barbosa,Catarina Rocha,Dulce Oliveira

Main category: cs.CL

TLDR: 评估多种LLM在葡萄牙国家医学考试中的表现，发现部分模型在准确性和成本效益上优于人类基准。

Details

Motivation: 探索LLM在医疗领域的潜力，提升诊断准确性和治疗规划支持。 Method: 测试开源和闭源LLM在2024年葡萄牙国家医学考试中的表现，评估准确性和成本。 Result: 部分模型表现优于人类基准，Chain-of-Thought等方法对性能有显著影响。 Conclusion: LLM可作为医疗决策的辅助工具，具有实际应用潜力。 Abstract: The integration of Large Language Models (LLMs) into healthcare holds significant potential to enhance diagnostic accuracy and support medical treatment planning. These AI-driven systems can analyze vast datasets, assisting clinicians in identifying diseases, recommending treatments, and predicting patient outcomes. This study evaluates the performance of a range of contemporary LLMs, including both open-source and closed-source models, on the 2024 Portuguese National Exam for medical specialty access (PNA), a standardized medical knowledge assessment. Our results highlight considerable variation in accuracy and cost-effectiveness, with several models demonstrating performance exceeding human benchmarks for medical students on this specific task. We identify leading models based on a combined score of accuracy and cost, discuss the implications of reasoning methodologies like Chain-of-Thought, and underscore the potential for LLMs to function as valuable complementary tools aiding medical professionals in complex clinical decision-making.

[454] LLM-SRBench: A New Benchmark for Scientific Equation Discovery with Large Language Models

Parshin Shojaee,Ngoc-Hieu Nguyen,Kazem Meidani,Amir Barati Farimani,Khoa D Doan,Chandan K Reddy

Main category: cs.CL

TLDR: LLM-SRBench是一个新基准，用于评估LLM在科学方程发现任务中的能力，避免因记忆常见方程而高估性能。

Details

Motivation: 现有基准依赖常见方程，易被LLM记忆，无法真实反映其发现能力。 Method: 提出LLM-SRBench，包含239个挑战性问题，分为LSR-Transform和LSR-Synth两类，分别测试推理能力和数据驱动发现。 Result: 最佳系统仅达到31.5%的符号准确性，显示科学方程发现的挑战。 Conclusion: LLM-SRBench为未来研究提供了重要资源，凸显了LLM在科学发现中的局限性。 Abstract: Scientific equation discovery is a fundamental task in the history of scientific progress, enabling the derivation of laws governing natural phenomena. Recently, Large Language Models (LLMs) have gained interest for this task due to their potential to leverage embedded scientific knowledge for hypothesis generation. However, evaluating the true discovery capabilities of these methods remains challenging, as existing benchmarks often rely on common equations that are susceptible to memorization by LLMs, leading to inflated performance metrics that do not reflect discovery. In this paper, we introduce LLM-SRBench, a comprehensive benchmark with 239 challenging problems across four scientific domains specifically designed to evaluate LLM-based scientific equation discovery methods while preventing trivial memorization. Our benchmark comprises two main categories: LSR-Transform, which transforms common physical models into less common mathematical representations to test reasoning beyond memorized forms, and LSR-Synth, which introduces synthetic, discovery-driven problems requiring data-driven reasoning. Through extensive evaluation of several state-of-the-art methods, using both open and closed LLMs, we find that the best-performing system so far achieves only 31.5% symbolic accuracy. These findings highlight the challenges of scientific equation discovery, positioning LLM-SRBench as a valuable resource for future research.

[455] CliniChat: A Multi-Source Knowledge-Driven Framework for Clinical Interview Dialogue Reconstruction and Evaluation

Jing Chen,Zhihua Wei,Wei Zhang,Yingying Hu,Qiong Zhang

Main category: cs.CL

TLDR: CliniChat是一个整合多源知识的框架，旨在提升大型语言模型（LLMs）在临床访谈中的能力，包括对话重构和评估模块，并提出了高质量数据集和专用模型。

Details

Motivation: 由于缺乏高质量的临床访谈对话数据和广泛接受的评估方法，LLMs在临床访谈中的应用受到限制。 Method: CliniChat框架包含Clini-Recon（重构对话）和Clini-Eval（评估对话）模块，整合三种知识源，并采用两阶段自动评估方法。 Result: 实验表明，CliniChatGLM在临床访谈能力上全面提升，尤其在病史采集方面表现优异。 Conclusion: CliniChat为LLMs在临床访谈中的应用提供了有效解决方案，推动了该领域的发展。 Abstract: Large language models (LLMs) hold great promise for assisting clinical interviews due to their fluent interactive capabilities and extensive medical knowledge. However, the lack of high-quality interview dialogue data and widely accepted evaluation methods has significantly impeded this process. So we propose CliniChat, a framework that integrates multi-source knowledge to enable LLMs to simulate real-world clinical interviews. It consists of two modules: Clini-Recon and Clini-Eval, each responsible for reconstructing and evaluating interview dialogues, respectively. By incorporating three sources of knowledge, Clini-Recon transforms clinical notes into systematic, professional, and empathetic interview dialogues. Clini-Eval combines a comprehensive evaluation metric system with a two-phase automatic evaluation approach, enabling LLMs to assess interview performance like experts. We contribute MedQA-Dialog, a high-quality synthetic interview dialogue dataset, and CliniChatGLM, a model specialized for clinical interviews. Experimental results demonstrate that CliniChatGLM's interview capabilities undergo a comprehensive upgrade, particularly in history-taking, achieving state-of-the-art performance.

Michał Turski,Mateusz Chiliński,Łukasz Borchmann

Main category: cs.CL

TLDR: 论文介绍了CheckboxQA数据集，用于评估和改进模型在复选框相关任务中的表现，填补了现有模型在此类任务中的不足。

Details

Motivation: 复选框在文档处理中至关重要，但现有的大规模视觉和语言模型在解释复选框内容时表现不佳，可能导致严重的监管或合同问题。 Method: 提出CheckboxQA数据集，专门用于评估和改进模型在复选框任务中的性能。 Result: 数据集揭示了当前模型的局限性，并为提升文档理解系统提供了工具。 Conclusion: CheckboxQA数据集对法律科技和金融等领域的应用具有重要意义。 Abstract: Checkboxes are critical in real-world document processing where the presence or absence of ticks directly informs data extraction and decision-making processes. Yet, despite the strong performance of Large Vision and Language Models across a wide range of tasks, they struggle with interpreting checkable content. This challenge becomes particularly pressing in industries where a single overlooked checkbox may lead to costly regulatory or contractual oversights. To address this gap, we introduce the CheckboxQA dataset, a targeted resource designed to evaluate and improve model performance on checkbox-related tasks. It reveals the limitations of current models and serves as a valuable tool for advancing document comprehension systems, with significant implications for applications in sectors such as legal tech and finance. The dataset is publicly available at: https://github.com/Snowflake-Labs/CheckboxQA

[457] Can We Edit LLMs for Long-Tail Biomedical Knowledge?

Xinhao Yi,Jake Lever,Kevin Bryson,Zaiqiao Meng

Main category: cs.CL

TLDR: 知识编辑在更新大型语言模型（LLM）内部知识方面有效，但在生物医学领域面临长尾知识分布的挑战。研究发现，现有方法对长尾知识的编辑效果不如高频知识，且一对多知识的高比例限制了编辑效果。

Details

Motivation: 研究知识编辑方法在生物医学长尾知识分布中的有效性，填补该领域的研究空白。 Method: 通过实验评估现有知识编辑方法对长尾生物医学知识的编辑效果，并分析一对多知识的影响。 Result: 现有方法能提升长尾知识表现，但仍不及高频知识；一对多知识的高比例限制了编辑效果。 Conclusion: 需开发针对性策略以提升长尾生物医学知识的编辑效果。 Abstract: Knowledge editing has emerged as an effective approach for updating large language models (LLMs) by modifying their internal knowledge. However, their application to the biomedical domain faces unique challenges due to the long-tailed distribution of biomedical knowledge, where rare and infrequent information is prevalent. In this paper, we conduct the first comprehensive study to investigate the effectiveness of knowledge editing methods for editing long-tail biomedical knowledge. Our results indicate that, while existing editing methods can enhance LLMs' performance on long-tail biomedical knowledge, their performance on long-tail knowledge remains inferior to that on high-frequency popular knowledge, even after editing. Our further analysis reveals that long-tail biomedical knowledge contains a significant amount of one-to-many knowledge, where one subject and relation link to multiple objects. This high prevalence of one-to-many knowledge limits the effectiveness of knowledge editing in improving LLMs' understanding of long-tail biomedical knowledge, highlighting the need for tailored strategies to bridge this performance gap.

[458] LLM Can be a Dangerous Persuader: Empirical Study of Persuasion Safety in Large Language Models

Minqian Liu,Zhiyang Xu,Xinyi Zhang,Heajun An,Sarvech Qadir,Qi Zhang,Pamela J. Wisniewski,Jin-Hee Cho,Sang Won Lee,Ruoxi Jia,Lifu Huang

Main category: cs.CL

TLDR: 论文研究了大型语言模型（LLMs）在说服任务中的安全性问题，提出了PersuSafety框架评估其伦理表现，发现多数LLMs存在安全隐患。

Details

Motivation: LLMs接近人类的说服能力引发了对潜在安全风险的担忧，如操纵、欺骗等不道德行为。 Method: 通过PersuSafety框架（包括场景创建、对话模拟和安全性评估）系统评估LLMs在6个不道德话题和15种策略中的表现。 Result: 实验发现多数LLMs未能识别有害任务并使用了不道德策略，存在显著安全隐患。 Conclusion: 研究呼吁在目标驱动的对话中加强LLMs的安全对齐。 Abstract: Recent advancements in Large Language Models (LLMs) have enabled them to approach human-level persuasion capabilities. However, such potential also raises concerns about the safety risks of LLM-driven persuasion, particularly their potential for unethical influence through manipulation, deception, exploitation of vulnerabilities, and many other harmful tactics. In this work, we present a systematic investigation of LLM persuasion safety through two critical aspects: (1) whether LLMs appropriately reject unethical persuasion tasks and avoid unethical strategies during execution, including cases where the initial persuasion goal appears ethically neutral, and (2) how influencing factors like personality traits and external pressures affect their behavior. To this end, we introduce PersuSafety, the first comprehensive framework for the assessment of persuasion safety which consists of three stages, i.e., persuasion scene creation, persuasive conversation simulation, and persuasion safety assessment. PersuSafety covers 6 diverse unethical persuasion topics and 15 common unethical strategies. Through extensive experiments across 8 widely used LLMs, we observe significant safety concerns in most LLMs, including failing to identify harmful persuasion tasks and leveraging various unethical persuasion strategies. Our study calls for more attention to improve safety alignment in progressive and goal-driven conversations such as persuasion.

[459] xVerify: Efficient Answer Verifier for Reasoning Model Evaluations

Ding Chen,Qingchen Yu,Pengyuan Wang,Wentao Zhang,Bo Tang,Feiyu Xiong,Xinchi Li,Minchuan Yang,Zhiyu Li

Main category: cs.CL

TLDR: 论文提出xVerify，一种用于评估推理模型的高效答案验证器，能有效判断模型输出与参考答案的等价性，并在实验中表现优异。

Details

Motivation: 现有评估方法难以处理推理模型生成的复杂回答，无法准确判断等价性或提取最终答案。 Method: 构建VAR数据集，通过多轮标注训练不同规模的xVerify模型。 Result: xVerify模型在测试集和泛化集上F1分数和准确率均超过95%，部分模型甚至优于GPT-4o。 Conclusion: xVerify在答案验证任务中表现出高效性和泛化能力。 Abstract: With the release of the o1 model by OpenAI, reasoning models adopting slow thinking strategies have gradually emerged. As the responses generated by such models often include complex reasoning, intermediate steps, and self-reflection, existing evaluation methods are often inadequate. They struggle to determine whether the LLM output is truly equivalent to the reference answer, and also have difficulty identifying and extracting the final answer from long, complex responses. To address this issue, we propose xVerify, an efficient answer verifier for reasoning model evaluations. xVerify demonstrates strong capability in equivalence judgment, enabling it to effectively determine whether the answers produced by reasoning models are equivalent to reference answers across various types of objective questions. To train and evaluate xVerify, we construct the VAR dataset by collecting question-answer pairs generated by multiple LLMs across various datasets, leveraging multiple reasoning models and challenging evaluation sets designed specifically for reasoning model assessment. A multi-round annotation process is employed to ensure label accuracy. Based on the VAR dataset, we train multiple xVerify models of different scales. In evaluation experiments conducted on both the test set and generalization set, all xVerify models achieve overall F1 scores and accuracy exceeding 95\%. Notably, the smallest variant, xVerify-0.5B-I, outperforms all evaluation methods except GPT-4o, while xVerify-3B-Ib surpasses GPT-4o in overall performance. These results validate the effectiveness and generalizability of xVerify.

[460] Layers at Similar Depths Generate Similar Activations Across LLM Architectures

Christopher Wolfram,Aaron Schein

Main category: cs.CL

TLDR: 研究发现，独立训练的LLM的潜在空间在层与层之间存在差异，但不同模型的对应层之间共享近邻关系。

Details

Motivation: 探索独立训练的LLM潜在空间之间的关系，以理解其内部表示的一致性。 Method: 分析24个开源LLM不同层的激活近邻关系。 Result: 近邻关系在模型内部层间变化，但在不同模型的对应层间共享。 Conclusion: LLM生成从层到层的激活几何序列，且该序列在不同模型间共享，但会根据架构调整。 Abstract: How do the latent spaces used by independently-trained LLMs relate to one another? We study the nearest neighbor relationships induced by activations at different layers of 24 open-weight LLMs, and find that they 1) tend to vary from layer to layer within a model, and 2) are approximately shared between corresponding layers of different models. Claim 2 shows that these nearest neighbor relationships are not arbitrary, as they are shared across models, but Claim 1 shows that they are not "obvious" either, as there is no single set of nearest neighbor relationships that is universally shared. Together, these suggest that LLMs generate a progression of activation geometries from layer to layer, but that this entire progression is largely shared between models, stretched and squeezed to fit into different architectures.

[461] SemCAFE: When Named Entities make the Difference Assessing Web Source Reliability through Entity-level Analytics

Gautam Kishore Shahi,Oshani Seneviratne,Marc Spaniol

Main category: cs.CL

TLDR: SemCAFE是一个通过实体关联性检测新闻可靠性的系统，利用自然语言处理和YAGO知识库分析，显著提升了识别不可靠新闻的能力。

Details

Motivation: 数字媒体中不可靠内容泛滥，且难以区分，尤其是俄乌战争期间，AI生成的内容加剧了这一问题。 Method: 结合自然语言处理技术（如去噪和分词）与YAGO知识库的实体语义分析，生成新闻的语义指纹。 Result: 在46,020篇可靠和3,407篇不可靠新闻上测试，F1分数比现有方法提升12%。 Conclusion: SemCAFE能有效区分新闻可靠性，为打击虚假信息提供了新工具。 Abstract: With the shift from traditional to digital media, the online landscape now hosts not only reliable news articles but also a significant amount of unreliable content. Digital media has faster reachability by significantly influencing public opinion and advancing political agendas. While newspaper readers may be familiar with their preferred outlets political leanings or credibility, determining unreliable news articles is much more challenging. The credibility of many online sources is often opaque, with AI generated content being easily disseminated at minimal cost. Unreliable news articles, particularly those that followed the Russian invasion of Ukraine in 2022, closely mimic the topics and writing styles of credible sources, making them difficult to distinguish. To address this, we introduce SemCAFE, a system designed to detect news reliability by incorporating entity relatedness into its assessment. SemCAFE employs standard Natural Language Processing techniques, such as boilerplate removal and tokenization, alongside entity level semantic analysis using the YAGO knowledge base. By creating a semantic fingerprint for each news article, SemCAFE could assess the credibility of 46,020 reliable and 3,407 unreliable articles on the 2022 Russian invasion of Ukraine. Our approach improved the macro F1 score by 12% over state of the art methods. The sample data and code are available on GitHub

[462] From Tokens to Lattices: Emergent Lattice Structures in Language Models

Bo Xiong,Steffen Staab

Main category: cs.CL

TLDR: 该论文探讨了预训练掩码语言模型（MLMs）如何通过形式概念分析（FCA）隐式学习概念格结构，并提出了一种无需依赖人工定义概念的新框架。

Details

Motivation: 研究MLMs在预训练过程中如何隐式学习概念格结构，并探索其背后的归纳偏置。 Method: 利用形式概念分析（FCA）从MLMs中提取对象-属性关系，构建概念格，并提出一种新框架以发现潜在概念。 Result: 通过三个数据集验证了框架的有效性，证实了MLMs能够学习并重建概念格结构。 Conclusion: MLMs通过预训练隐式学习概念格结构，新框架为发现潜在概念提供了可能。 Abstract: Pretrained masked language models (MLMs) have demonstrated an impressive capability to comprehend and encode conceptual knowledge, revealing a lattice structure among concepts. This raises a critical question: how does this conceptualization emerge from MLM pretraining? In this paper, we explore this problem from the perspective of Formal Concept Analysis (FCA), a mathematical framework that derives concept lattices from the observations of object-attribute relationships. We show that the MLM's objective implicitly learns a \emph{formal context} that describes objects, attributes, and their dependencies, which enables the reconstruction of a concept lattice through FCA. We propose a novel framework for concept lattice construction from pretrained MLMs and investigate the origin of the inductive biases of MLMs in lattice structure learning. Our framework differs from previous work because it does not rely on human-defined concepts and allows for discovering "latent" concepts that extend beyond human definitions. We create three datasets for evaluation, and the empirical results verify our hypothesis.

[463] Can AI Master Construction Management (CM)? Benchmarking State-of-the-Art Large Language Models on CM Certification Exams

Ruoxin Xiong,Yanyu Wang,Suat Gunhan,Yimin Zhu,Charles Berryman

Main category: cs.CL

TLDR: 论文研究了大型语言模型（LLMs）在建筑管理（CM）领域的应用潜力，通过CMExamSet数据集评估了GPT-4o和Claude 3.7的表现，发现其在单步任务中表现优异，但在多步任务和图形相关问题上存在局限。

Details

Motivation: 建筑管理项目日益复杂，需要专业工具提升效率，而LLMs在CM领域的应用尚未充分探索。 Method: 使用CMExamSet数据集（689道真实选择题）进行零样本评估，分析模型在不同任务类型和问题格式中的表现。 Result: GPT-4o和Claude 3.7平均准确率超过人类及格线（70%），但在多步任务和图形问题上表现较差，概念误解是主要错误类型。 Conclusion: LLMs可作为CM领域的辅助工具，但需进一步优化领域特定推理能力并保持人工监督。 Abstract: The growing complexity of construction management (CM) projects, coupled with challenges such as strict regulatory requirements and labor shortages, requires specialized analytical tools that streamline project workflow and enhance performance. Although large language models (LLMs) have demonstrated exceptional performance in general reasoning tasks, their effectiveness in tackling CM-specific challenges, such as precise quantitative analysis and regulatory interpretation, remains inadequately explored. To bridge this gap, this study introduces CMExamSet, a comprehensive benchmarking dataset comprising 689 authentic multiple-choice questions sourced from four nationally accredited CM certification exams. Our zero-shot evaluation assesses overall accuracy, subject areas (e.g., construction safety), reasoning complexity (single-step and multi-step), and question formats (text-only, figure-referenced, and table-referenced). The results indicate that GPT-4o and Claude 3.7 surpass typical human pass thresholds (70%), with average accuracies of 82% and 83%, respectively. Additionally, both models performed better on single-step tasks, with accuracies of 85.7% (GPT-4o) and 86.7% (Claude 3.7). Multi-step tasks were more challenging, reducing performance to 76.5% and 77.6%, respectively. Furthermore, both LLMs show significant limitations on figure-referenced questions, with accuracies dropping to approximately 40%. Our error pattern analysis further reveals that conceptual misunderstandings are the most common (44.4% and 47.9%), underscoring the need for enhanced domain-specific reasoning models. These findings underscore the potential of LLMs as valuable supplementary analytical tools in CM, while highlighting the need for domain-specific refinements and sustained human oversight in complex decision making.

[464] Efficient Evaluation of Large Language Models via Collaborative Filtering

Xu-Xiang Zhong,Chao Yi,Han-Jia Ye

Main category: cs.CL

TLDR: 提出一种基于协同过滤的两阶段方法，通过少量样本高效评估大语言模型的性能。

Details

Motivation: 评估大语言模型成本高，需减少测试实例数量并保持准确性。 Method: 将模型视为用户，测试实例视为物品，分两阶段：1）选择区分性强的实例；2）预测未选实例上的表现。 Result: 实验表明，该方法能准确估计模型性能，显著降低推理开销。 Conclusion: 该方法为高效评估大语言模型提供了可行方案。 Abstract: With the development of Large Language Models (LLMs), numerous benchmarks have been proposed to measure and compare the capabilities of different LLMs. However, evaluating LLMs is costly due to the large number of test instances and their slow inference speed. In this paper, we aim to explore how to efficiently estimate a model's real performance on a given benchmark based on its evaluation results on a small number of instances sampled from the benchmark. Inspired by Collaborative Filtering (CF) in Recommendation Systems (RS), we treat LLMs as users and test instances as items and propose a two-stage method. In the first stage, we treat instance selection as recommending products to users to choose instances that can easily distinguish model performance. In the second stage, we see performance prediction as rating prediction problem in RS to predict the target LLM's behavior on unselected instances. Experiments on multiple LLMs and datasets imply that our method can accurately estimate the target model's performance while largely reducing its inference overhead.

[465] Enhancing NER Performance in Low-Resource Pakistani Languages using Cross-Lingual Data Augmentation

Toqeer Ehsan,Thamar Solorio

Main category: cs.CL

TLDR: 论文提出了一种数据增强技术，用于提升低资源巴基斯坦语言的命名实体识别（NER）性能，并通过微调多语言掩码大语言模型（LLMs）取得了显著效果。

Details

Motivation: 由于缺乏标注数据集和在预训练语言模型（PLMs）中的有限表示，低资源语言的NER研究较少且具有挑战性。 Method: 采用数据增强技术生成文化上合理的句子，并在四种低资源巴基斯坦语言（乌尔都语、Shahmukhi、信德语和普什图语）上实验，同时微调多语言掩码LLMs。 Result: 方法显著提升了Shahmukhi和普什图语的NER性能。 Conclusion: 研究表明生成式LLMs在NER和数据增强方面具有潜力，尤其是在少样本学习场景中。 Abstract: Named Entity Recognition (NER), a fundamental task in Natural Language Processing (NLP), has shown significant advancements for high-resource languages. However, due to a lack of annotated datasets and limited representation in Pre-trained Language Models (PLMs), it remains understudied and challenging for low-resource languages. To address these challenges, we propose a data augmentation technique that generates culturally plausible sentences and experiments on four low-resource Pakistani languages; Urdu, Shahmukhi, Sindhi, and Pashto. By fine-tuning multilingual masked Large Language Models (LLMs), our approach demonstrates significant improvements in NER performance for Shahmukhi and Pashto. We further explore the capability of generative LLMs for NER and data augmentation using few-shot learning.

[466] Exploring Gradient-Guided Masked Language Model to Detect Textual Adversarial Attacks

Xiaomei Zhang,Zhaoxi Zhang,Yanjun Zhang,Xufei Zheng,Leo Yu Zhang,Shengshan Hu,Shirui Pan

Main category: cs.CL

TLDR: 论文提出了一种基于掩码语言模型的文本对抗攻击检测方法，并通过梯度引导优化减少了计算开销。

Details

Motivation: 文本对抗样本威胁NLP系统的可靠性，掩码语言模型能近似正常文本的流形，因此探索其用于检测对抗攻击。 Method: 提出MLMD方法，利用掩码和去掩码操作区分正常与对抗文本；进一步提出GradMLMD，通过梯度信息跳过非关键词以减少计算开销。 Result: MLMD检测性能优异但计算开销大；GradMLMD在保持性能的同时显著降低资源消耗。 Conclusion: 梯度引导的掩码语言模型检测方法高效且有效，适用于对抗攻击检测。 Abstract: Textual adversarial examples pose serious threats to the reliability of natural language processing systems. Recent studies suggest that adversarial examples tend to deviate from the underlying manifold of normal texts, whereas pre-trained masked language models can approximate the manifold of normal data. These findings inspire the exploration of masked language models for detecting textual adversarial attacks. We first introduce Masked Language Model-based Detection (MLMD), leveraging the mask and unmask operations of the masked language modeling (MLM) objective to induce the difference in manifold changes between normal and adversarial texts. Although MLMD achieves competitive detection performance, its exhaustive one-by-one masking strategy introduces significant computational overhead. Our posterior analysis reveals that a significant number of non-keywords in the input are not important for detection but consume resources. Building on this, we introduce Gradient-guided MLMD (GradMLMD), which leverages gradient information to identify and skip non-keywords during detection, significantly reducing resource consumption without compromising detection performance.

[467] Exploring the Effectiveness and Interpretability of Texts in LLM-based Time Series Models

Zhengke Sun,Hangwei Qian,Ivor Tsang

Main category: cs.CL

TLDR: 研究发现，在时间序列预测任务中，文本信息的加入并未显著提升性能，且现有框架的文本表示缺乏可解释性。

Details

Motivation: 探讨文本信息在时间序列预测中的实际效果和可解释性。 Method: 通过实验分析文本提示和文本原型的有效性，并提出新指标SMI评估匹配度。 Result: 文本与时间序列模态存在不匹配，且文本信息对预测性能提升有限，可解释性不足。 Conclusion: 研究揭示了当前时间序列LLM中文本的局限性，呼吁关注其可解释性问题。 Abstract: Large Language Models (LLMs) have been applied to time series forecasting tasks, leveraging pre-trained language models as the backbone and incorporating textual data to purportedly enhance the comprehensive capabilities of LLMs for time series. However, are these texts really helpful for interpretation? This study seeks to investigate the actual efficacy and interpretability of such textual incorporations. Through a series of empirical experiments on textual prompts and textual prototypes, our findings reveal that the misalignment between two modalities exists, and the textual information does not significantly improve time series forecasting performance in many cases. Furthermore, visualization analysis indicates that the textual representations learned by existing frameworks lack sufficient interpretability when applied to time series data. We further propose a novel metric named Semantic Matching Index (SMI) to better evaluate the matching degree between time series and texts during our post hoc interpretability investigation. Our analysis reveals the misalignment and limited interpretability of texts in current time-series LLMs, and we hope this study can raise awareness of the interpretability of texts for time series. The code is available at https://github.com/zachysun/TS-Lang-Exp.

[468] CAReDiO: Cultural Alignment of LLM via Representativeness and Distinctiveness Guided Data Optimization

Jing Yao,Xiaoyuan Yi,Jindong Wang,Zhicheng Dou,Xing Xie

Main category: cs.CL

TLDR: 论文提出CAReDiO框架，通过优化文化对话数据的代表性和独特性，解决现有文化对齐LLMs的数据集问题，显著提升性能与效率。

Details

Motivation: 随着LLMs在人类生活中的深入应用，文化对齐对提升用户体验和减少文化冲突至关重要，但现有方法存在数据代表性和独特性不足的问题。 Method: 提出CAReDiO框架，利用LLMs自动生成文化对话数据，并通过优化代表性和独特性构建高效数据集。 Result: 实验表明，CAReDiO生成的数据更有效，仅需100个训练样本即可实现文化对齐，显著提升性能与效率。 Conclusion: CAReDiO为文化对齐LLMs提供了一种高效的数据构建方法，解决了现有数据集的局限性。 Abstract: As Large Language Models (LLMs) more deeply integrate into human life across various regions, aligning them with pluralistic cultures is crucial for improving user experience and mitigating cultural conflicts. Existing approaches develop culturally aligned LLMs primarily through fine-tuning with massive carefully curated culture-specific corpora. Nevertheless, inspired by culture theories, we identify two key challenges faced by these datasets: (1) Representativeness: These corpora fail to fully capture the target culture's core characteristics with redundancy, causing computation waste; (2) Distinctiveness: They struggle to distinguish the unique nuances of a given culture from shared patterns across other relevant ones, hindering precise cultural modeling. To handle these challenges, we introduce CAReDiO, a novel cultural data construction framework. Specifically, CAReDiO utilizes powerful LLMs to automatically generate cultural conversation data, where both the queries and responses are further optimized by maximizing representativeness and distinctiveness. Using CAReDiO, we construct a small yet effective dataset, covering five cultures, and compare it with several recent cultural corpora. Extensive experiments demonstrate that our method generates more effective data and enables cultural alignment with as few as 100 training samples, enhancing both performance and efficiency.

[469] SD$^2$: Self-Distilled Sparse Drafters

Mike Lasby,Nish Sinnadurai,Valavan Manohararajah,Sean Lie,Vithursan Thangarasa

Main category: cs.CL

TLDR: SD$^2$是一种自蒸馏稀疏草稿模型方法，通过细粒度权重稀疏化和自数据蒸馏，显著提升草稿模型的效率和目标模型的对齐性，减少计算量并提高令牌接受率。

Details

Motivation: 减少大型语言模型（LLM）的推理延迟，同时保持与目标模型的对齐性。 Method: 采用自数据蒸馏和细粒度权重稀疏化技术，生成高效的草稿模型。 Result: 在Llama-3.1-70B目标模型上，SD$^2$比层剪枝草稿模型平均接受长度（MAL）提高1.59倍，计算量减少43.87%，且比密集草稿模型MAL仅降低8.36%。 Conclusion: 稀疏感知的微调和压缩策略能有效提升LLM推理效率，同时保持与目标模型的对齐性。 Abstract: Speculative decoding is a powerful technique for reducing the latency of Large Language Models (LLMs), offering a fault-tolerant framework that enables the use of highly compressed draft models. In this work, we introduce Self-Distilled Sparse Drafters (SD$^2$), a novel methodology that leverages self-data distillation and fine-grained weight sparsity to produce highly efficient and well-aligned draft models. SD$^2$ systematically enhances draft token acceptance rates while significantly reducing Multiply-Accumulate operations (MACs), even in the Universal Assisted Generation (UAG) setting, where draft and target models originate from different model families. On a Llama-3.1-70B target model, SD$^2$ provides a $\times$1.59 higher Mean Accepted Length (MAL) compared to layer-pruned draft models and reduces MACs by over 43.87% with a 8.36% reduction in MAL compared to a dense draft models. Our results highlight the potential of sparsity-aware fine-tuning and compression strategies to improve LLM inference efficiency while maintaining alignment with target models.

[470] Forecasting Communication Derailments Through Conversation Generation

Yunfan Zhang,Kathleen McKeown,Smaranda Muresan

Main category: cs.CL

TLDR: 本文提出了一种基于微调LLM的方法，通过采样多个未来对话轨迹预测沟通脱轨，并结合社会语言学属性提升预测准确性。

Details

Motivation: 预测沟通脱轨在内容审核、冲突解决和商业谈判等场景中有实际应用价值，但现有语言模型在此任务上表现不佳。 Method: 使用微调LLM基于现有对话历史采样多个未来对话轨迹，并结合社会语言学属性指导生成，通过共识预测沟通结果。 Result: 方法在英语沟通脱轨预测基准上超越现有技术，消融实验显示显著准确性提升。 Conclusion: 通过生成未来对话轨迹并结合社会语言学属性，显著提升了沟通脱轨预测的准确性。 Abstract: Forecasting communication derailment can be useful in real-world settings such as online content moderation, conflict resolution, and business negotiations. However, despite language models' success at identifying offensive speech present in conversations, they struggle to forecast future communication derailments. In contrast to prior work that predicts conversation outcomes solely based on the past conversation history, our approach samples multiple future conversation trajectories conditioned on existing conversation history using a fine-tuned LLM. It predicts the communication outcome based on the consensus of these trajectories. We also experimented with leveraging socio-linguistic attributes, which reflect turn-level conversation dynamics, as guidance when generating future conversations. Our method of future conversation trajectories surpasses state-of-the-art results on English communication derailment prediction benchmarks and demonstrates significant accuracy gains in ablation studies.

[471] Generating Planning Feedback for Open-Ended Programming Exercises with LLMs

Mehmet Arif Demirtaş,Claire Zheng,Max Fowler,Kathryn Cunningham

Main category: cs.CL

TLDR: 论文提出了一种利用大语言模型（LLM）检测学生编程计划的方法，以提供反馈，即使代码存在语法错误。GPT-4o及其小型变体（GPT-4o-mini）在检测计划方面表现优异，且小型模型经过微调后效果接近GPT-4o。

Details

Motivation: 当前自动评分系统仅关注代码最终正确性，无法提供关于学生规划过程的反馈，而LLM可以填补这一空白。 Method: 利用LLM（如GPT-4o和GPT-4o-mini）检测学生代码中的高级目标和模式（编程计划），并通过微调小型模型实现高效实时评分。 Result: GPT-4o和GPT-4o-mini在检测编程计划方面表现优异，小型模型微调后效果接近GPT-4o，适合实时评分。 Conclusion: LLM（尤其是小型变体）可用于开放编程练习的自动评分，为学生提供规划反馈，并可能扩展到数学和物理等其他领域。 Abstract: To complete an open-ended programming exercise, students need to both plan a high-level solution and implement it using the appropriate syntax. However, these problems are often autograded on the correctness of the final submission through test cases, and students cannot get feedback on their planning process. Large language models (LLM) may be able to generate this feedback by detecting the overall code structure even for submissions with syntax errors. To this end, we propose an approach that detects which high-level goals and patterns (i.e. programming plans) exist in a student program with LLMs. We show that both the full GPT-4o model and a small variant (GPT-4o-mini) can detect these plans with remarkable accuracy, outperforming baselines inspired by conventional approaches to code analysis. We further show that the smaller, cost-effective variant (GPT-4o-mini) achieves results on par with state-of-the-art (GPT-4o) after fine-tuning, creating promising implications for smaller models for real-time grading. These smaller models can be incorporated into autograders for open-ended code-writing exercises to provide feedback for students' implicit planning skills, even when their program is syntactically incorrect. Furthermore, LLMs may be useful in providing feedback for problems in other domains where students start with a set of high-level solution steps and iteratively compute the output, such as math and physics problems.

[472] A Fully Automated Pipeline for Conversational Discourse Annotation: Tree Scheme Generation and Labeling with Large Language Models

Kseniia Petukhova,Ekaterina Kochmar

Main category: cs.CL

TLDR: 利用大语言模型（LLMs）自动化构建对话标注方案并进行标注，显著减少时间并匹配或超越人工标注质量。

Details

Motivation: 手动设计标注方案耗时且需专业知识，LLMs的自动化潜力可解决这一问题。 Method: 提出全自动流程，使用LLMs构建标注方案并标注，评估了频率引导决策树与高级LLM的组合效果。 Result: 频率引导决策树与LLM组合优于手动设计方案，匹配或超越人工标注，且大幅减少时间。 Conclusion: 自动化标注方案可行且高效，为未来研究提供了代码和标注资源。 Abstract: Recent advances in Large Language Models (LLMs) have shown promise in automating discourse annotation for conversations. While manually designing tree annotation schemes significantly improves annotation quality for humans and models, their creation remains time-consuming and requires expert knowledge. We propose a fully automated pipeline that uses LLMs to construct such schemes and perform annotation. We evaluate our approach on speech functions (SFs) and the Switchboard-DAMSL (SWBD-DAMSL) taxonomies. Our experiments compare various design choices, and we show that frequency-guided decision trees, paired with an advanced LLM for annotation, can outperform previously manually designed trees and even match or surpass human annotators while significantly reducing the time required for annotation. We release all code and resultant schemes and annotations to facilitate future research on discourse annotation.

[473] From Punchlines to Predictions: A Metric to Assess LLM Performance in Identifying Humor in Stand-Up Comedy

Adrianna Romanowski,Pedro H. V. Valois,Kazuhiro Fukui

Main category: cs.CL

TLDR: 研究评估了LLMs在识别单口喜剧幽默台词中的表现，提出了一种新的幽默检测指标，发现领先模型的准确率最高为51%，略高于人类的41%。

Details

Motivation: 随着LLMs的普及，研究幽默与AI的交集对提升人机交互的自然性至关重要。 Method: 使用单口喜剧剧本作为数据集，提出包含模糊字符串匹配、句子嵌入和子空间相似性的幽默检测指标。 Result: ChatGPT、Claude和DeepSeek等模型的幽默检测准确率最高为51%，略高于人类的41%。 Conclusion: 幽默检测存在主观性，LLMs的表现虽优于人类但仍有限，突显了从现场表演中提取幽默的复杂性。 Abstract: Comedy serves as a profound reflection of the times we live in and is a staple element of human interactions. In light of the widespread adoption of Large Language Models (LLMs), the intersection of humor and AI has become no laughing matter. Advancements in the naturalness of human-computer interaction correlates with improvements in AI systems' abilities to understand humor. In this study, we assess the ability of models in accurately identifying humorous quotes from a stand-up comedy transcript. Stand-up comedy's unique comedic narratives make it an ideal dataset to improve the overall naturalness of comedic understanding. We propose a novel humor detection metric designed to evaluate LLMs amongst various prompts on their capability to extract humorous punchlines. The metric has a modular structure that offers three different scoring methods - fuzzy string matching, sentence embedding, and subspace similarity - to provide an overarching assessment of a model's performance. The model's results are compared against those of human evaluators on the same task. Our metric reveals that regardless of prompt engineering, leading models, ChatGPT, Claude, and DeepSeek, achieve scores of at most 51% in humor detection. Notably, this performance surpasses that of humans who achieve a score of 41%. The analysis of human evaluators and LLMs reveals variability in agreement, highlighting the subjectivity inherent in humor and the complexities involved in extracting humorous quotes from live performance transcripts. Code available at https://github.com/swaggirl9000/humor.

[474] Exploration of Plan-Guided Summarization for Narrative Texts: the Case of Small Language Models

Matt Grenander,Siddharth Varia,Paula Czarnowska,Yogarshi Vyas,Kishaloy Halder,Bonan Min

Main category: cs.CL

TLDR: 研究探讨了基于计划的方法是否能在小型语言模型（SLM）中改善长文档叙事文本的摘要生成，结果发现计划引导的方法并未显著提升摘要质量或忠实度。

Details

Motivation: 叙事文本的长度和复杂性使其难以忠实摘要，研究旨在验证计划引导方法是否有效。 Method: 分析了针对细粒度细节的计划引导方法，并提出了更高层次的叙事计划方案。 Result: 计划引导方法在摘要质量或忠实度上未显著优于无计划基线，且计划本身同样可能包含幻觉。 Conclusion: 研究警示计划引导方法在长复杂文本摘要中的局限性。 Abstract: Plan-guided summarization attempts to reduce hallucinations in small language models (SLMs) by grounding generated summaries to the source text, typically by targeting fine-grained details such as dates or named entities. In this work, we investigate whether plan-based approaches in SLMs improve summarization in long document, narrative tasks. Narrative texts' length and complexity often mean they are difficult to summarize faithfully. We analyze existing plan-guided solutions targeting fine-grained details, and also propose our own higher-level, narrative-based plan formulation. Our results show that neither approach significantly improves on a baseline without planning in either summary quality or faithfulness. Human evaluation reveals that while plan-guided approaches are often well grounded to their plan, plans are equally likely to contain hallucinations compared to summaries. As a result, the plan-guided summaries are just as unfaithful as those from models without planning. Our work serves as a cautionary tale to plan-guided approaches to summarization, especially for long, complex domains such as narrative texts.

[475] A Multi-view Discourse Framework for Integrating Semantic and Syntactic Features in Dialog Agents

Akanksha Mehndiratta,Krishna Asawa

Main category: cs.CL

TLDR: 本文提出了一种基于多视角典型相关分析（MCCA）和典型相关分析（CCA）的对话感知框架，用于检索式对话系统中的响应选择，显著提升了性能。

Details

Motivation: 现有方法常忽略对话中话语间的交互或将其视为同等重要，缺乏对语义和句法特征的整合。 Method: 模型分两步：首先用MCCA编码话语和响应的上下文、位置和句法特征；然后通过CCA学习共享子空间中的话语标记，捕捉话语与周围轮次的关系。 Result: 在Ubuntu对话语料库上的实验表明，模型在自动评估指标上显著提升。 Conclusion: 该框架通过整合语义和句法特征，有效提升了对话系统的响应选择能力。 Abstract: Multiturn dialogue models aim to generate human-like responses by leveraging conversational context, consisting of utterances from previous exchanges. Existing methods often neglect the interactions between these utterances or treat all of them as equally significant. This paper introduces a discourse-aware framework for response selection in retrieval-based dialogue systems. The proposed model first encodes each utterance and response with contextual, positional, and syntactic features using Multi-view Canonical Correlation Analysis (MCCA). It then learns discourse tokens that capture relationships between an utterance and its surrounding turns in a shared subspace via Canonical Correlation Analysis (CCA). This two-step approach effectively integrates semantic and syntactic features to build discourse-level understanding. Experiments on the Ubuntu Dialogue Corpus demonstrate that our model achieves significant improvements in automatic evaluation metrics, highlighting its effectiveness in response selection.

[476] Enhancing Dialogue Systems with Discourse-Level Understanding Using Deep Canonical Correlation Analysis

Akanksha Mehndiratta,Krishna Asawa

Main category: cs.CL

TLDR: 论文提出了一种基于深度典型相关分析（DCCA）的新框架，用于提升对话系统对长期对话历史的理解能力，实验表明其在响应选择任务中表现优异。

Details

Motivation: 现有对话模型在捕捉和利用长期对话历史方面存在局限性，需要更上下文感知的系统。 Method: 提出了一种集成DCCA的框架，通过学习话语标记来捕捉话语与其上下文的关系，从而理解长期依赖。 Result: 在Ubuntu对话语料库上的实验显示，响应选择任务中自动评估指标得分显著提升。 Conclusion: DCCA框架能有效过滤无关上下文并保留关键话语信息，为对话系统的改进提供了潜力。 Abstract: The evolution of conversational agents has been driven by the need for more contextually aware systems that can effectively manage dialogue over extended interactions. To address the limitations of existing models in capturing and utilizing long-term conversational history, we propose a novel framework that integrates Deep Canonical Correlation Analysis (DCCA) for discourse-level understanding. This framework learns discourse tokens to capture relationships between utterances and their surrounding context, enabling a better understanding of long-term dependencies. Experiments on the Ubuntu Dialogue Corpus demonstrate significant enhancement in response selection, based on the improved automatic evaluation metric scores. The results highlight the potential of DCCA in improving dialogue systems by allowing them to filter out irrelevant context and retain critical discourse information for more accurate response retrieval.

[477] Optimizing FDTD Solvers for Electromagnetics: A Compiler-Guided Approach with High-Level Tensor Abstractions

Yifei He,Måns I. Andersson,Stefano Markidis

Main category: cs.CL

TLDR: 本文提出了一种基于MLIR/LLVM的领域特定编译器，用于优化FDTD方法的计算性能，解决了传统实现中的可移植性和效率问题。

Details

Motivation: 传统FDTD方法依赖于手写、平台特定的代码，导致开发成本高且性能受限，难以适应现代硬件架构。 Method: 采用MLIR/LLVM基础设施，构建了一个端到端的领域特定编译器，通过高阶优化（如循环分块、融合和向量化）生成高效且可移植的代码。 Result: 在Intel、AMD和ARM平台上，相比基于NumPy的Python实现，实现了高达10倍的加速。 Conclusion: 该方法显著提升了FDTD模拟的计算效率和可移植性，为复杂电磁场模拟提供了更优的解决方案。 Abstract: The Finite Difference Time Domain (FDTD) method is a widely used numerical technique for solving Maxwell's equations, particularly in computational electromagnetics and photonics. It enables accurate modeling of wave propagation in complex media and structures but comes with significant computational challenges. Traditional FDTD implementations rely on handwritten, platform-specific code that optimizes certain kernels while underperforming in others. The lack of portability increases development overhead and creates performance bottlenecks, limiting scalability across modern hardware architectures. To address these challenges, we introduce an end-to-end domain-specific compiler based on the MLIR/LLVM infrastructure for FDTD simulations. Our approach generates efficient and portable code optimized for diverse hardware platforms.We implement the three-dimensional FDTD kernel as operations on a 3D tensor abstraction with explicit computational semantics. High-level optimizations such as loop tiling, fusion, and vectorization are automatically applied by the compiler. We evaluate our customized code generation pipeline on Intel, AMD, and ARM platforms, achieving up to $10\times$ speedup over baseline Python implementation using NumPy.

[478] VisuoThink: Empowering LVLM Reasoning with Multimodal Tree Search

Yikun Wang,Siyin Wang,Qinyuan Cheng,Zhaoye Fei,Liang Ding,Qipeng Guo,Dacheng Tao,Xipeng Qiu

Main category: cs.CL

TLDR: VisuoThink是一个新框架，通过结合视觉和语言领域，提升复杂推理任务的表现。

Details

Motivation: 现有的大视觉语言模型在复杂推理任务中表现不佳，缺乏人类视觉-语言推理的交互性。 Method: VisuoThink整合视觉空间和语言领域，支持渐进式视觉-文本推理，并引入前瞻树搜索进行测试时扩展。 Result: 实验表明，VisuoThink在几何和空间推理任务中达到最先进性能，无需微调即可显著提升推理能力。 Conclusion: VisuoThink通过多模态慢思考机制，有效解决了复杂推理任务中的局限性。 Abstract: Recent advancements in Large Vision-Language Models have showcased remarkable capabilities. However, they often falter when confronted with complex reasoning tasks that humans typically address through visual aids and deliberate, step-by-step thinking. While existing methods have explored text-based slow thinking or rudimentary visual assistance, they fall short of capturing the intricate, interleaved nature of human visual-verbal reasoning processes. To overcome these limitations and inspired by the mechanisms of slow thinking in human cognition, we introduce VisuoThink, a novel framework that seamlessly integrates visuospatial and linguistic domains. VisuoThink facilitates multimodal slow thinking by enabling progressive visual-textual reasoning and incorporates test-time scaling through look-ahead tree search. Extensive experiments demonstrate that VisuoThink significantly enhances reasoning capabilities via inference-time scaling, even without fine-tuning, achieving state-of-the-art performance in tasks involving geometry and spatial reasoning.

[479] Efficient and Asymptotically Unbiased Constrained Decoding for Large Language Models

Haotian Ye,Himanshu Jain,Chong You,Ananda Theertha Suresh,Haowei Lin,James Zou,Felix Yu

Main category: cs.CL

TLDR: 论文提出了一种名为DISC的新算法，结合GPU并行前缀验证（PPV），解决了现有基于前缀树的约束解码方法的效率低下和偏差问题。

Details

Motivation: 在大型语言模型的实际应用中，输出常需符合预定义约束（如产品选择、安全标准或格式要求），但现有约束解码方法效率低且引入偏差。 Method: 提出动态重要性采样（DISC）和GPU并行前缀验证（PPV），确保理论上的渐进无偏性，并提升效率。 Result: 实验表明，DISC在效率和输出质量上均优于现有方法。 Conclusion: DISC方法在需严格遵循约束的应用中具有潜力，能显著提升生成质量与效率。 Abstract: In real-world applications of large language models, outputs are often required to be confined: selecting items from predefined product or document sets, generating phrases that comply with safety standards, or conforming to specialized formatting styles. To control the generation, constrained decoding has been widely adopted. However, existing prefix-tree-based constrained decoding is inefficient under GPU-based model inference paradigms, and it introduces unintended biases into the output distribution. This paper introduces Dynamic Importance Sampling for Constrained Decoding (DISC) with GPU-based Parallel Prefix-Verification (PPV), a novel algorithm that leverages dynamic importance sampling to achieve theoretically guaranteed asymptotic unbiasedness and overcomes the inefficiency of prefix-tree. Extensive experiments demonstrate the superiority of our method over existing methods in both efficiency and output quality. These results highlight the potential of our methods to improve constrained generation in applications where adherence to specific constraints is essential.

[480] Can postgraduate translation students identify machine-generated text?

Michael Farrell

Main category: cs.CL

TLDR: 研究探讨语言学训练者能否区分AI生成文本与人类写作，结果显示参与者难以区分，需改进训练方法。

Details

Motivation: 随着生成式AI在多语言内容创作中的广泛应用，研究人类是否能识别AI生成文本。 Method: 23名翻译研究生接受简短培训后，分析意大利散文片段并评分是否为AI生成。 Result: 参与者平均难以区分AI与人类文本，仅两人表现较好；低爆发性和自相矛盾更常见于AI文本。 Conclusion: 需改进培训方法，并探讨AI文本是否需进一步编辑以更自然。 Abstract: Given the growing use of generative artificial intelligence as a tool for creating multilingual content and bypassing both machine and traditional translation methods, this study explores the ability of linguistically trained individuals to discern machine-generated output from human-written text (HT). After brief training sessions on the textual anomalies typically found in synthetic text (ST), twenty-three postgraduate translation students analysed excerpts of Italian prose and assigned likelihood scores to indicate whether they believed they were human-written or AI-generated (ChatGPT-4o). The results show that, on average, the students struggled to distinguish between HT and ST, with only two participants achieving notable accuracy. Closer analysis revealed that the students often identified the same textual anomalies in both HT and ST, although features such as low burstiness and self-contradiction were more frequently associated with ST. These findings suggest the need for improvements in the preparatory training. Moreover, the study raises questions about the necessity of editing synthetic text to make it sound more human-like and recommends further research to determine whether AI-generated text is already sufficiently natural-sounding not to require further refinement.

[481] Langformers: Unified NLP Pipelines for Language Models

Rabindra Lamsal,Maria Rodriguez Read,Shanika Karunasekera

Main category: cs.CL

TLDR: Langformers是一个开源Python库，旨在通过统一的工厂接口简化NLP任务，支持多种功能，如对话AI、文本分类等，并集成流行平台。

Details

Motivation: Transformer语言模型在NLP领域有重大影响，但使用复杂，涉及多框架和重复代码，阻碍非程序员和初学者，影响开发效率。 Method: Langformers提供任务特定的工厂接口，抽象训练、推理和部署的复杂性，内置内存和流式处理，设计轻量模块化。 Result: Langformers整合多种NLP任务，支持Hugging Face等平台，简化流程，提升易用性。 Conclusion: Langformers通过统一接口和模块化设计，解决了NLP任务中的复杂性，适合广泛用户群体。 Abstract: Transformer-based language models have revolutionized the field of natural language processing (NLP). However, using these models often involves navigating multiple frameworks and tools, as well as writing repetitive boilerplate code. This complexity can discourage non-programmers and beginners, and even slow down prototyping for experienced developers. To address these challenges, we introduce Langformers, an open-source Python library designed to streamline NLP pipelines through a unified, factory-based interface for large language model (LLM) and masked language model (MLM) tasks. Langformers integrates conversational AI, MLM pretraining, text classification, sentence embedding/reranking, data labelling, semantic search, and knowledge distillation into a cohesive API, supporting popular platforms such as Hugging Face and Ollama. Key innovations include: (1) task-specific factories that abstract training, inference, and deployment complexities; (2) built-in memory and streaming for conversational agents; and (3) lightweight, modular design that prioritizes ease of use. Documentation: https://langformers.com

[482] Parameterized Synthetic Text Generation with SimpleStories

Lennart Finke,Thomas Dooms,Mat Allen,Juan Diego Rodriguez,Noa Nabeshima,Dan Braun

Main category: cs.CL

TLDR: SimpleStories是一个包含200万篇英语和日语简单语言故事的大规模合成数据集，通过多级抽象特征参数化控制故事特性，确保语法和语义多样性。

Details

Motivation: 解决TinyStories数据集的局限性，证明在合成文本生成中可以同时实现简单性和多样性。 Method: 采用多级抽象特征参数化的提示方法，系统控制故事特性。 Result: 成功生成了语法和语义多样的大规模简单语言故事数据集。 Conclusion: 研究表明，合成文本生成可以兼顾简单性和多样性，为相关领域提供了新资源。 Abstract: We present SimpleStories, a large synthetic story dataset in simple language, consisting of 2 million stories each in English and Japanese. Our method employs parametrization of prompts with features at multiple levels of abstraction, allowing for systematic control over story characteristics to ensure broad syntactic and semantic diversity. Building on and addressing limitations in the TinyStories dataset, our approach demonstrates that simplicity and variety can be achieved simultaneously in synthetic text generation at scale.

[483] Feature-Aware Malicious Output Detection and Mitigation

Weilong Dong,Peiguang Li,Yu Tian,Xinyi Zeng,Fengdi Li,Sirui Wang

Main category: cs.CL

TLDR: 提出了一种特征感知方法（FMM），用于检测和拒绝大语言模型中的恶意内容，通过调整模型的拒绝机制来增强安全性。

Details

Motivation: 大语言模型（LLMs）在带来便利的同时，存在无法识别恶意内容的风险，需要一种方法来增强其防御能力。 Method: 使用特征感知方法（FMM），在解码阶段检测恶意特征，并通过激活修补技术调整模型生成机制。 Result: 实验证明FMM能有效防御多种攻击技术，同时保持模型的正常生成能力。 Conclusion: FMM是一种有效的方法，可增强LLMs的安全性而不损害其功能。 Abstract: The rapid advancement of large language models (LLMs) has brought significant benefits to various domains while introducing substantial risks. Despite being fine-tuned through reinforcement learning, LLMs lack the capability to discern malicious content, limiting their defense against jailbreak. To address these safety concerns, we propose a feature-aware method for harmful response rejection (FMM), which detects the presence of malicious features within the model's feature space and adaptively adjusts the model's rejection mechanism. By employing a simple discriminator, we detect potential malicious traits during the decoding phase. Upon detecting features indicative of toxic tokens, FMM regenerates the current token. By employing activation patching, an additional rejection vector is incorporated during the subsequent token generation, steering the model towards a refusal response. Experimental results demonstrate the effectiveness of our approach across multiple language models and diverse attack techniques, while crucially maintaining the models' standard generation capabilities.

[484] Enhancing Contrastive Demonstration Selection with Semantic Diversity for Robust In-Context Machine Translation

Owen Patterson,Chee Ng

Main category: cs.CL

TLDR: 论文提出DiverseConE方法，通过增强示例多样性改进上下文学习中的演示选择，显著提升机器翻译性能。

Details

Motivation: 现有方法忽视演示示例的多样性，导致上下文学习性能不稳定。 Method: 结合对比选择和嵌入空间多样性增强，提出DiverseConE方法。 Result: 在多个语言对和设置下，DiverseConE优于基线方法，验证了多样性对翻译质量的重要性。 Conclusion: 多样性增强的演示选择能有效提升机器翻译性能，为上下文学习提供了新思路。 Abstract: In-Context Learning (ICL) empowers large language models to perform tasks by conditioning on a few input-output examples. However, the performance of ICL is highly sensitive to the selection of these demonstrations. While existing methods focus on similarity or contrastive selection, they often overlook the importance of diversity among the chosen examples. In this paper, we propose DiverseConE (Diversity-Enhanced Contrastive Example Selection), a novel approach for demonstration selection in in-context learning for machine translation. Our method builds upon contrastive selection by incorporating a diversity enhancement step based on embedding space dissimilarity. We conduct extensive experiments on the Llama2-7b model across four language pairs (English-Chinese, Chinese-English, Russian-German, German-Russian) in 1-shot and 3-shot settings, using COMET20 and COMET22 for evaluation. Our results demonstrate that DiverseConE consistently outperforms strong baseline methods, including random selection, BM25, TopK, and a state-of-the-art contrastive selection method. Further analysis, including diversity metrics and human evaluation, validates the effectiveness of our approach and highlights the benefits of considering demonstration diversity for improved translation quality.

[485] Improving the Accuracy and Efficiency of Legal Document Tagging with Large Language Models and Instruction Prompts

Emily Johnson,Xavier Holt,Noah Wilson

Main category: cs.CL

TLDR: Legal-LLM利用大型语言模型的指令微调能力，将多标签分类任务重构为结构化生成问题，显著提升了法律文档分类的准确性。

Details

Motivation: 法律多标签分类面临法律语言复杂、标签依赖性强和标签不平衡等挑战，需要更高效的解决方案。 Method: 通过指令微调大型语言模型，将多标签分类任务转化为结构化生成问题，直接输出相关法律类别。 Result: 在POSTURE50K和EURLEX57K数据集上，Legal-LLM在micro-F1和macro-F1得分上优于传统方法和基于Transformer的基线模型。 Conclusion: Legal-LLM能有效处理标签不平衡问题，生成准确的法律标签，验证了其方法的有效性。 Abstract: Legal multi-label classification is a critical task for organizing and accessing the vast amount of legal documentation. Despite its importance, it faces challenges such as the complexity of legal language, intricate label dependencies, and significant label imbalance. In this paper, we propose Legal-LLM, a novel approach that leverages the instruction-following capabilities of Large Language Models (LLMs) through fine-tuning. We reframe the multi-label classification task as a structured generation problem, instructing the LLM to directly output the relevant legal categories for a given document. We evaluate our method on two benchmark datasets, POSTURE50K and EURLEX57K, using micro-F1 and macro-F1 scores. Our experimental results demonstrate that Legal-LLM outperforms a range of strong baseline models, including traditional methods and other Transformer-based approaches. Furthermore, ablation studies and human evaluations validate the effectiveness of our approach, particularly in handling label imbalance and generating relevant and accurate legal labels.

[486] QUDsim: Quantifying Discourse Similarities in LLM-Generated Text

Ramya Namuduri,Yating Wu,Anshun Asher Zheng,Manya Wadhwa,Greg Durrett,Junyi Jessy Li

Main category: cs.CL

TLDR: 论文提出了一种基于问题讨论（QUD）的相似性度量方法QUDsim，用于量化语言模型生成文本的结构相似性，发现LLM在结构上比人类更重复且偏离人类写作模式。

Details

Motivation: 大型语言模型（LLM）在生成独特和创造性内容方面存在不足，现有相似性度量方法主要关注内容重叠，无法捕捉结构相似性。 Method: 基于语言学理论中的问题讨论（QUD）和问题语义学，提出QUDsim度量方法，用于检测文档间的结构相似性。 Result: 发现LLM生成的文本在结构上比人类更重复，且其结构类型与人类作者存在显著差异。 Conclusion: QUDsim能有效量化文本结构相似性，揭示LLM在结构生成上的局限性，为改进模型提供方向。 Abstract: As large language models become increasingly capable at various writing tasks, their weakness at generating unique and creative content becomes a major liability. Although LLMs have the ability to generate text covering diverse topics, there is an overall sense of repetitiveness across texts that we aim to formalize and quantify via a similarity metric. The familiarity between documents arises from the persistence of underlying discourse structures. However, existing similarity metrics dependent on lexical overlap and syntactic patterns largely capture $\textit{content}$ overlap, thus making them unsuitable for detecting $\textit{structural}$ similarities. We introduce an abstraction based on linguistic theories in Questions Under Discussion (QUD) and question semantics to help quantify differences in discourse progression. We then use this framework to build $\textbf{QUDsim}$, a similarity metric that can detect discursive parallels between documents. Using QUDsim, we find that LLMs often reuse discourse structures (more so than humans) across samples, even when content differs. Furthermore, LLMs are not only repetitive and structurally uniform, but are also divergent from human authors in the types of structures they use.

[487] Can you map it to English? The Role of Cross-Lingual Alignment in Multilingual Performance of LLMs

Kartik Ravisankar,Hyojung Han,Marine Carpuat

Main category: cs.CL

TLDR: 研究探讨了多语言大模型（LLMs）中跨语言表征对齐与任务性能的关系，发现语言级别的对齐与任务准确性高度相关，但样本级别的对齐不足以区分预测正确与否。

Details

Motivation: 理解多语言大模型在跨语言任务中表现优异的机制，尤其是表征对齐与任务性能的关系。 Method: 引入跨语言对齐指标（如DALI），在三个自然语言理解任务和机器翻译任务中分析表征对齐与性能的相关性。 Result: 语言级别的对齐与任务准确性高度相关，但样本级别的对齐无法有效区分预测正确与否。 Conclusion: 跨语言对齐是任务成功的必要条件，但非充分条件。 Abstract: Large language models (LLMs) pre-trained predominantly on English text exhibit surprising multilingual capabilities, yet the mechanisms driving cross-lingual generalization remain poorly understood. This work investigates how the alignment of representations for text written in different languages correlates with LLM performance on natural language understanding tasks and translation tasks, both at the language and the instance level. For this purpose, we introduce cross-lingual alignment metrics such as the Discriminative Alignment Index (DALI) to quantify the alignment at an instance level for discriminative tasks. Through experiments on three natural language understanding tasks (Belebele, XStoryCloze, XCOPA), and machine translation, we find that while cross-lingual alignment metrics strongly correlate with task accuracy at the language level, the sample-level alignment often fails to distinguish correct from incorrect predictions, exposing alignment as a necessary but insufficient condition for success.

[488] On Language Models' Sensitivity to Suspicious Coincidences

Sriram Padmanabhan,Kanishka Misra,Kyle Mahowald,Eunsol Choi

Main category: cs.CL

TLDR: 论文研究了语言模型（LMs）是否表现出人类对‘可疑巧合’的敏感性，通过数字游戏和城市名称实验，发现零-shot行为中无明显证据，但在提供假设空间后，LMs开始表现出类似人类的行为。

Details

Motivation: 人类在归纳推理中对数据采样方式敏感，倾向于选择更具体的假设。研究旨在验证LMs是否也表现出这种敏感性。 Method: 通过数字游戏和城市名称实验，分析LMs在零-shot和提供假设空间（如链式思考或显式提示）下的行为。 Result: 零-shot行为中未发现明显证据，但在提供假设空间后，LMs开始表现出类似‘可疑巧合’的效应，甚至与人类行为一致。 Conclusion: LMs的归纳推理行为可以通过显式访问假设空间增强，表明其潜力在更复杂的推理任务中发挥作用。 Abstract: Humans are sensitive to suspicious coincidences when generalizing inductively over data, as they make assumptions as to how the data was sampled. This results in smaller, more specific hypotheses being favored over more general ones. For instance, when provided the set {Austin, Dallas, Houston}, one is more likely to think that this is sampled from "Texas Cities" over "US Cities" even though both are compatible. Suspicious coincidence is strongly connected to pragmatic reasoning, and can serve as a testbed to analyze systems on their sensitivity towards the communicative goals of the task (i.e., figuring out the true category underlying the data). In this paper, we analyze whether suspicious coincidence effects are reflected in language models' (LMs) behavior. We do so in the context of two domains: 1) the number game, where humans made judgments of whether a number (e.g., 4) fits a list of given numbers (e.g., 16, 32, 2); and 2) by extending the number game setup to prominent cities. For both domains, the data is compatible with multiple hypotheses and we study which hypothesis is most consistent with the models' behavior. On analyzing five models, we do not find strong evidence for suspicious coincidences in LMs' zero-shot behavior. However, when provided access to the hypotheses space via chain-of-thought or explicit prompting, LMs start to show an effect resembling suspicious coincidences, sometimes even showing effects consistent with humans. Our study suggests that inductive reasoning behavior in LMs can be enhanced with explicit access to the hypothesis landscape.

[489] Beyond Memorization: Mapping the Originality-Quality Frontier of Language Models

Vishakh Padmakumar,Chen Yueh-Han,Jane Pan,Valerie Chen,He He

Main category: cs.CL

TLDR: 论文提出了一种新的评估大语言模型（LLM）生成内容新颖性的指标，平衡了原创性和质量，并通过实验发现LLM生成文本的新颖性低于人类写作。

Details

Motivation: 随着大语言模型在创意和科学发现中的应用增多，评估其生成内容的新颖性变得重要。现有方法仅关注原创性而忽视质量，或依赖人类偏好但可靠性有限。 Method: 提出了一种新颖性指标，结合未见于训练数据的n-gram比例和任务特定质量分数的调和平均数，并在故事完成、诗歌创作和创意工具使用任务上评估了OLMo和Pythia模型。 Result: LLM生成文本的新颖性低于人类写作，且实验表明提升新颖性的方法往往以牺牲质量为代价，而增加模型规模或后训练能更有效地改善新颖性。 Conclusion: 提升模型基础能力比单纯调整推理方法更能有效提高生成内容的新颖性。 Abstract: As large language models (LLMs) are increasingly used for ideation and scientific discovery, it is important to evaluate their ability to generate novel output. Prior work evaluates novelty as the originality with respect to training data, but original outputs can be low quality. In contrast, non-expert judges may favor high-quality but memorized outputs, limiting the reliability of human preference as a metric. We propose a new novelty metric for LLM generations that balances originality and quality -- the harmonic mean of the fraction of \ngrams unseen during training and a task-specific quality score. We evaluate the novelty of generations from two families of open-data models (OLMo and Pythia) on three creative tasks: story completion, poetry writing, and creative tool use. We find that LLM generated text is less novel than human written text. To elicit more novel outputs, we experiment with various inference-time methods, which reveals a trade-off between originality and quality. While these methods can boost novelty, they do so by increasing originality at the expense of quality. In contrast, increasing model size or applying post-training reliably shifts the Pareto frontier, highlighting that starting with a stronger base model is a more effective way to improve novelty.

[490] Evaluation Under Imperfect Benchmarks and Ratings: A Case Study in Text Simplification

Joseph Liu,Yoonsoo Nam,Xinyue Cui,Swabha Swayamdipta

Main category: cs.CL

TLDR: 论文提出了一种新的文本简化评估方法，通过合成数据和LLM陪审团评分解决现有评估中的不一致性问题。

Details

Motivation: 现有文本简化评估存在数据质量低和人工评分不一致的问题，导致评估不可靠。 Method: 引入SynthSimpliEval合成基准，使用LLM陪审团评分，并改进现有可学习指标。 Result: 合成数据和LLM评分提高了评估的一致性和可靠性，更大模型表现更好。 Conclusion: 高质量测试数据和LLM陪审团评分是实现可靠评估的关键。 Abstract: Despite the successes of language models, their evaluation remains a daunting challenge for new and existing tasks. We consider the task of text simplification, commonly used to improve information accessibility, where evaluation faces two major challenges. First, the data in existing benchmarks might not reflect the capabilities of current language models on the task, often containing disfluent, incoherent, or simplistic examples. Second, existing human ratings associated with the benchmarks often contain a high degree of disagreement, resulting in inconsistent ratings; nevertheless, existing metrics still have to show higher correlations with these imperfect ratings. As a result, evaluation for the task is not reliable and does not reflect expected trends (e.g., more powerful models being assigned higher scores). We address these challenges for the task of text simplification through three contributions. First, we introduce SynthSimpliEval, a synthetic benchmark for text simplification featuring simplified sentences generated by models of varying sizes. Through a pilot study, we show that human ratings on our benchmark exhibit high inter-annotator agreement and reflect the expected trend: larger models produce higher-quality simplifications. Second, we show that auto-evaluation with a panel of LLM judges (LLMs-as-a-jury) often suffices to obtain consistent ratings for the evaluation of text simplification. Third, we demonstrate that existing learnable metrics for text simplification benefit from training on our LLMs-as-a-jury-rated synthetic data, closing the gap with pure LLMs-as-a-jury for evaluation. Overall, through our case study on text simplification, we show that a reliable evaluation requires higher quality test data, which could be obtained through synthetic data and LLMs-as-a-jury ratings.

[491] Composable NLP Workflows for BERT-based Ranking and QA System

Gaurav Kumar,Murali Mohana Krishna Dandu

Main category: cs.CL

TLDR: 本文介绍了一种基于Forte工具包的端到端排序与问答系统，利用BERT和RoBERTa等先进模型，在MS-MARCO和Covid-19数据集上评估性能，并展示了模块化设计和低延迟的优势。

Details

Motivation: 现实系统中的多组件交互和文本粒度差异问题需要解决，以简化复杂NLP应用的构建。 Method: 使用Forte工具包构建模块化NLP流水线，集成BERT和RoBERTa等深度学习模型，评估指标包括BLUE、MRR和F1。 Result: 在MS-MARCO和Covid-19数据集上，排序与问答系统的性能与基准结果进行了对比，展示了模块化设计的有效性。 Conclusion: 模块化流水线和低延迟设计为构建复杂NLP应用提供了便捷性。 Abstract: There has been a lot of progress towards building NLP models that scale to multiple tasks. However, real-world systems contain multiple components and it is tedious to handle cross-task interaction with varying levels of text granularity. In this work, we built an end-to-end Ranking and Question-Answering (QA) system using Forte, a toolkit that makes composable NLP pipelines. We utilized state-of-the-art deep learning models such as BERT, RoBERTa in our pipeline, evaluated the performance on MS-MARCO and Covid-19 datasets using metrics such as BLUE, MRR, F1 and compared the results of ranking and QA systems with their corresponding benchmark results. The modular nature of our pipeline and low latency of reranker makes it easy to build complex NLP applications easily.

[492] Question Tokens Deserve More Attention: Enhancing Large Language Models without Training through Step-by-Step Reading and Question Attention Recalibration

Feijiang Han,Licheng Guo,Hengtao Cui,Zhiyuan Lyu

Main category: cs.CL

TLDR: 论文研究了大型语言模型（LLM）在复杂问题理解中的局限性，提出了基于提示的策略和注意力重校准机制，显著提升了模型性能。

Details

Motivation: LLM在处理复杂问题时表现不佳，尤其是在长距离依赖和多步推理任务中。论文旨在探索这些局限性并提出改进方法。 Method: 提出了Step-by-Step Reading（SSR）系列提示策略和训练无关的注意力重校准机制，动态调整注意力分布。 Result: SSR++在多个基准测试中取得最优结果（如GSM8K 96.66%），注意力重校准机制使LLaMA 3.1-8B在AQuA上准确率提升5.17%。 Conclusion: 结构化提示设计和注意力优化是提升LLM理解能力的有效工具，为NLP任务提供了轻量级解决方案。 Abstract: Large Language Models (LLMs) often struggle with tasks that require a deep understanding of complex questions, especially when faced with long-range dependencies or multi-step reasoning. This work investigates the limitations of current LLMs in question comprehension and identifies three insights: (1) repeating question tokens improves comprehension by increasing attention to question regions, (2) increased backward dependencies negatively affect performance due to unidirectional attentional constraints, and (3) recalibrating attentional mechanisms to prioritize question-relevant regions improves performance. Based on these findings, we first propose a family of prompt-based strategies - Step-by-Step Reading (SSR), SSR+, and SSR++ - that guide LLMs to incrementally process question tokens and align their reasoning with the input structure. These methods significantly improve performance, with SSR++ achieving state-of-the-art results on several benchmarks: 96.66% on GSM8K, 94.61% on ASDiv, and 76.28% on AQuA. Second, we introduce a training-free attention recalibration mechanism that dynamically adjusts attention distributions during inference to emphasize question-relevant regions. This method improves the accuracy of LLaMA 3.1-8B on AQuA by 5.17% without changing model parameters or input prompts. Taken together, our results highlight the importance of structured prompt design and attention optimization in improving LLM comprehension, providing lightweight yet effective tools for improving performance in various NLP tasks.

[493] UXAgent: A System for Simulating Usability Testing of Web Design with LLM Agents

Yuxuan Lu,Bingsheng Yao,Hansu Gu,Jing Huang,Jessie Wang,Yang Li,Jiri Gesi,Qi He,Toby Jia-Jun Li,Dakuo Wang

Main category: cs.CL

TLDR: 论文提出了一种名为UXAgent的系统，利用大型语言模型模拟代理（LLM Agent）帮助用户体验（UX）研究人员在真实用户研究前评估和迭代可用性测试设计。

Details

Motivation: 传统可用性测试方法缺乏对测试设计本身的评估和迭代机制，LLM Agent的研究进展为解决这一问题提供了新思路。 Method: 系统包含三个模块：Persona Generator（生成模拟用户）、LLM Agent（模拟用户行为）和Universal Browser Connector（连接目标网站），并提供Agent Interview Interface和Video Replay Interface供研究人员分析数据。 Result: 通过启发式评估，五名UX研究人员认可系统的创新性，但也对LLM Agent在UX研究中的未来应用表示担忧。 Conclusion: UXAgent展示了LLM Agent在UX研究中的潜力，但需进一步探讨其实际应用的可行性和伦理问题。 Abstract: Usability testing is a fundamental research method that user experience (UX) researchers use to evaluate and iterate a web design, but\textbf{ how to evaluate and iterate the usability testing study design } itself? Recent advances in Large Language Model-simulated Agent (\textbf{LLM Agent}) research inspired us to design \textbf{UXAgent} to support UX researchers in evaluating and reiterating their usability testing study design before they conduct the real human-subject study. Our system features a Persona Generator module, an LLM Agent module, and a Universal Browser Connector module to automatically generate thousands of simulated users to interactively test the target website. The system also provides an Agent Interview Interface and a Video Replay Interface so that the UX researchers can easily review and analyze the generated qualitative and quantitative log data. Through a heuristic evaluation, five UX researcher participants praised the innovation of our system but also expressed concerns about the future of LLM Agent usage in UX studies.

[494] SaRO: Enhancing LLM Safety through Reasoning-based Alignment

Yutao Mou,Yuxiao Luo,Shikun Zhang,Wei Ye

Main category: cs.CL

TLDR: 论文提出SaRO框架，通过两阶段优化解决LLMs安全对齐的欠泛化和过对齐问题。

Details

Motivation: 现有安全对齐技术存在欠泛化和过对齐问题，需更深入的语义理解。 Method: SaRO框架包括推理式预热（RW）和安全导向推理优化（SRPO）两阶段。 Result: 实验证明SaRO优于传统对齐方法。 Conclusion: SaRO通过语义推理优化提升了LLMs的安全对齐效果。 Abstract: Current safety alignment techniques for large language models (LLMs) face two key challenges: (1) under-generalization, which leaves models vulnerable to novel jailbreak attacks, and (2) over-alignment, which leads to the excessive refusal of benign instructions. Our preliminary investigation reveals semantic overlap between jailbreak/harmful queries and normal prompts in embedding space, suggesting that more effective safety alignment requires a deeper semantic understanding. This motivates us to incorporate safety-policy-driven reasoning into the alignment process. To this end, we propose the Safety-oriented Reasoning Optimization Framework (SaRO), which consists of two stages: (1) Reasoning-style Warmup (RW) that enables LLMs to internalize long-chain reasoning through supervised fine-tuning, and (2) Safety-oriented Reasoning Process Optimization (SRPO) that promotes safety reflection via direct preference optimization (DPO). Extensive experiments demonstrate the superiority of SaRO over traditional alignment methods.

[495] ClinicalGPT-R1: Pushing reasoning capability of generalist disease diagnosis with large language model

Wuyang Lan,Wenzheng Wang,Changwei Ji,Guoxing Yang,Yongbo Zhang,Xiaohong Liu,Song Wu,Guangyu Wang

Main category: cs.CL

TLDR: ClinicalGPT-R1是一种增强推理能力的大型语言模型，专为疾病诊断设计，在中文任务中优于GPT-4o，在英文任务中与GPT-4相当。

Details

Motivation: 探索大型语言模型在临床诊断中的应用潜力，填补该领域的空白。 Method: 基于20,000份真实临床记录训练，采用多样化训练策略增强推理能力，并使用MedBench-Hard数据集进行性能评估。 Result: 在中文诊断任务中优于GPT-4o，在英文任务中与GPT-4表现相当。 Conclusion: ClinicalGPT-R1在疾病诊断任务中表现出色，验证了其在临床领域的应用潜力。 Abstract: Recent advances in reasoning with large language models (LLMs)has shown remarkable reasoning capabilities in domains such as mathematics and coding, yet their application to clinical diagnosis remains underexplored. Here, we introduce ClinicalGPT-R1, a reasoning enhanced generalist large language model for disease diagnosis. Trained on a dataset of 20,000 real-world clinical records, ClinicalGPT-R1 leverages diverse training strategies to enhance diagnostic reasoning. To benchmark performance, we curated MedBench-Hard, a challenging dataset spanning seven major medical specialties and representative diseases. Experimental results demonstrate that ClinicalGPT-R1 outperforms GPT-4o in Chinese diagnostic tasks and achieves comparable performance to GPT-4 in English settings. This comparative study effectively validates the superior performance of ClinicalGPT-R1 in disease diagnosis tasks. Resources are available at https://github.com/medfound/medfound.

[496] HalluShift: Measuring Distribution Shifts towards Hallucination Detection in LLMs

Sharanya Dasgupta,Sujoy Nath,Arkaprabha Basu,Pourya Shamsolmoali,Swagatam Das

Main category: cs.CL

TLDR: 论文提出HalluShift方法，分析LLM生成回答的内部状态空间和标记概率分布变化，以解决LLM幻觉问题，性能优于现有基线。

Details

Motivation: LLM在生成回答时容易产生幻觉（错误信息），但其回答结构连贯，研究认为这与LLM内部动态有关，类似于人类认知中的幻觉现象。 Method: 提出HalluShift方法，通过分析LLM生成回答的内部状态空间和标记概率分布变化，识别幻觉现象。 Result: HalluShift在多个基准数据集上表现优于现有基线方法。 Conclusion: HalluShift能有效分析LLM幻觉现象，为理解LLM内部动态提供了新视角。 Abstract: Large Language Models (LLMs) have recently garnered widespread attention due to their adeptness at generating innovative responses to the given prompts across a multitude of domains. However, LLMs often suffer from the inherent limitation of hallucinations and generate incorrect information while maintaining well-structured and coherent responses. In this work, we hypothesize that hallucinations stem from the internal dynamics of LLMs. Our observations indicate that, during passage generation, LLMs tend to deviate from factual accuracy in subtle parts of responses, eventually shifting toward misinformation. This phenomenon bears a resemblance to human cognition, where individuals may hallucinate while maintaining logical coherence, embedding uncertainty within minor segments of their speech. To investigate this further, we introduce an innovative approach, HalluShift, designed to analyze the distribution shifts in the internal state space and token probabilities of the LLM-generated responses. Our method attains superior performance compared to existing baselines across various benchmark datasets. Our codebase is available at https://github.com/sharanya-dasgupta001/hallushift.

[497] Kongzi: A Historical Large Language Model with Fact Enhancement

Jiashu Yang,Ningning Wang,Yian Zhao,Chaoran Feng,Junjia Du,Hao Pang,Zhirui Fang,Xuxin Cheng

Main category: cs.CL

TLDR: Kongzi是一个专为历史分析设计的大型语言模型，通过整合高质量历史数据和事实强化学习策略，显著提升了事实准确性和推理深度。

Details

Motivation: 当前大型语言模型在复杂推理任务中存在事实不准确的问题，尤其在历史研究中需要跨时间关联和连贯结论，Kongzi旨在解决这些问题。 Method: 整合高质量历史数据并采用新颖的事实强化学习策略。 Result: 在历史问答和叙事生成任务中，Kongzi在事实准确性和推理深度上优于现有模型。 Conclusion: Kongzi为专业领域开发准确可靠的大型语言模型树立了新标准。 Abstract: The capabilities of the latest large language models (LLMs) have been extended from pure natural language understanding to complex reasoning tasks. However, current reasoning models often exhibit factual inaccuracies in longer reasoning chains, which poses challenges for historical reasoning and limits the potential of LLMs in complex, knowledge-intensive tasks. Historical studies require not only the accurate presentation of factual information but also the ability to establish cross-temporal correlations and derive coherent conclusions from fragmentary and often ambiguous sources. To address these challenges, we propose Kongzi, a large language model specifically designed for historical analysis. Through the integration of curated, high-quality historical data and a novel fact-reinforcement learning strategy, Kongzi demonstrates strong factual alignment and sophisticated reasoning depth. Extensive experiments on tasks such as historical question answering and narrative generation demonstrate that Kongzi outperforms existing models in both factual accuracy and reasoning depth. By effectively addressing the unique challenges inherent in historical texts, Kongzi sets a new standard for the development of accurate and reliable LLMs in professional domains.

[498] MADLLM: Multivariate Anomaly Detection via Pre-trained LLMs

Wei Tao,Xiaoyang Qu,Kai Lu,Jiguang Wan,Guokuan Li,Jianzong Wang

Main category: cs.CL

TLDR: MADLLM提出了一种新的多变量异常检测方法，通过三重编码技术将多变量时间序列数据与预训练大语言模型的文本模态对齐。

Details

Motivation: 多变量时间序列（MTS）异常检测任务与预训练大语言模型（LLMs）的文本模态不匹配，现有方法简单地将MTS数据转换为单变量时间序列，导致问题。 Method: 设计了三重编码技术，结合传统补丁嵌入与两种新嵌入方法：Skip Embedding（改变补丁处理顺序）和Feature Embedding（利用对比学习理解特征相关性）。 Result: 实验表明，MADLLM在多个公开异常检测数据集上优于现有方法。 Conclusion: MADLLM通过模态对齐和新型嵌入技术，显著提升了多变量异常检测的性能。 Abstract: When applying pre-trained large language models (LLMs) to address anomaly detection tasks, the multivariate time series (MTS) modality of anomaly detection does not align with the text modality of LLMs. Existing methods simply transform the MTS data into multiple univariate time series sequences, which can cause many problems. This paper introduces MADLLM, a novel multivariate anomaly detection method via pre-trained LLMs. We design a new triple encoding technique to align the MTS modality with the text modality of LLMs. Specifically, this technique integrates the traditional patch embedding method with two novel embedding approaches: Skip Embedding, which alters the order of patch processing in traditional methods to help LLMs retain knowledge of previous features, and Feature Embedding, which leverages contrastive learning to allow the model to better understand the correlations between different features. Experimental results demonstrate that our method outperforms state-of-the-art methods in various public anomaly detection datasets.

[499] How new data permeates LLM knowledge and how to dilute it

Chen Sun,Renat Aksitov,Andrey Zhmoginov,Nolan Andrew Miller,Max Vladymyrov,Ulrich Rueckert,Been Kim,Mark Sandler

Main category: cs.CL

TLDR: 论文研究了大型语言模型（LLM）学习新信息时的“启动”效应，即新知识可能导致模型在无关上下文中错误应用。通过“Outlandish”数据集，发现启动程度可通过学习前关键词的概率预测，并提出两种方法减少不良启动效应。

Details

Motivation: 理解LLM学习新信息时如何影响现有知识，尤其是导致泛化和幻觉的机制。 Method: 使用“Outlandish”数据集（1320个样本）研究启动效应，提出“垫脚石”文本增强和“ignore-k”更新修剪方法。 Result: 启动效应可通过学习前关键词概率预测，两种新方法能减少50-95%的不良启动效应。 Conclusion: 研究揭示了LLM学习机制，并提供了改进知识插入特异性的实用工具。 Abstract: Large language models learn and continually learn through the accumulation of gradient-based updates, but how individual pieces of new information affect existing knowledge, leading to both beneficial generalization and problematic hallucination, remains poorly understood. We demonstrate that when learning new information, LLMs exhibit a "priming" effect: learning a new fact can cause the model to inappropriately apply that knowledge in unrelated contexts. To systematically study this phenomenon, we introduce "Outlandish," a carefully curated dataset of 1320 diverse text samples designed to probe how new knowledge permeates through an LLM's existing knowledge base. Using this dataset, we show that the degree of priming after learning new information can be predicted by measuring the token probability of key words before learning. This relationship holds robustly across different model architectures (PALM-2, Gemma, Llama), sizes, and training stages. Finally, we develop two novel techniques to modulate how new knowledge affects existing model behavior: (1) a ``stepping-stone'' text augmentation strategy and (2) an ``ignore-k'' update pruning method. These approaches reduce undesirable priming effects by 50-95\% while preserving the model's ability to learn new information. Our findings provide both empirical insights into how LLMs learn and practical tools for improving the specificity of knowledge insertion in language models. Further materials: https://sunchipsster1.github.io/projects/outlandish/

[500] Syzygy of Thoughts: Improving LLM CoT with the Minimal Free Resolution

Chenghao Li,Chaoning Zhang,Yi Lu,Jiaquan Zhang,Qigan Sun,Xudong Wang,Jiwei Wei,Guoqing Wang,Yang Yang,Heng Tao Shen

Main category: cs.CL

TLDR: 论文提出Syzygy of Thoughts (SoT)框架，通过引入辅助推理路径扩展Chain-of-Thought (CoT)，提升大语言模型的复杂任务推理能力。

Details

Motivation: 复杂任务中单一推理链能力不足，需更结构化方法。 Method: 借鉴Minimal Free Resolution (MFR)理论，将问题分解为逻辑完整的子问题。 Result: 在多个数据集和模型上，SoT推理准确率匹配或超越主流CoT标准。 Conclusion: SoT框架通过结构化分解和代数约束，实现透明推理与高性能。 Abstract: Chain-of-Thought (CoT) prompting enhances the reasoning of large language models (LLMs) by decomposing problems into sequential steps, mimicking human logic and reducing errors. However, complex tasks with vast solution spaces and vague constraints often exceed the capacity of a single reasoning chain. Inspired by Minimal Free Resolution (MFR) in commutative algebra and algebraic geometry, we propose Syzygy of Thoughts (SoT)-a novel framework that extends CoT by introducing auxiliary, interrelated reasoning paths. SoT captures deeper logical dependencies, enabling more robust and structured problem-solving. MFR decomposes a module into a sequence of free modules with minimal rank, providing a structured analytical approach to complex systems. This method introduces the concepts of "Module", "Betti numbers","Freeness", "Mapping", "Exactness" and "Minimality", enabling the systematic decomposition of the original complex problem into logically complete minimal subproblems while preserving key problem features and reducing reasoning length. We tested SoT across diverse datasets (e.g., GSM8K, MATH) and models (e.g., GPT-4o-mini, Qwen2.5), achieving inference accuracy that matches or surpasses mainstream CoTs standards. Additionally, by aligning the sampling process with algebraic constraints, our approach enhances the scalability of inference time in LLMs, ensuring both transparent reasoning and high performance. Our code will be publicly available at https://github.com/dlMARiA/Syzygy-of-thoughts.

[501] LLMs Can Achieve High-quality Simultaneous Machine Translation as Efficiently as Offline

Biao Fu,Minpeng Liao,Kai Fan,Chengxi Li,Liang Zhang,Yidong Chen,Xiaodong Shi

Main category: cs.CL

TLDR: 论文提出了一种新范式，通过构建监督微调数据和新的训练/推理策略，使LLMs在流式机器翻译中高效且高性能。

Details

Motivation: LLMs在离线翻译中表现优异，但在流式翻译中因自回归特性受限，需提升其效率和性能。 Method: 通过重新排列源和目标标记为交错序列，并加入特殊标记以适应不同延迟需求，训练LLMs自适应读写操作。 Result: 实验表明，该方法在有限数据下实现SOTA性能，并保留离线翻译能力，且能泛化到文档级流式翻译。 Conclusion: 新方法显著提升了LLMs在流式翻译中的表现，且无需额外微调即可适应更复杂场景。 Abstract: When the complete source sentence is provided, Large Language Models (LLMs) perform excellently in offline machine translation even with a simple prompt "Translate the following sentence from [src lang] into [tgt lang]:". However, in many real scenarios, the source tokens arrive in a streaming manner and simultaneous machine translation (SiMT) is required, then the efficiency and performance of decoder-only LLMs are significantly limited by their auto-regressive nature. To enable LLMs to achieve high-quality SiMT as efficiently as offline translation, we propose a novel paradigm that includes constructing supervised fine-tuning (SFT) data for SiMT, along with new training and inference strategies. To replicate the token input/output stream in SiMT, the source and target tokens are rearranged into an interleaved sequence, separated by special tokens according to varying latency requirements. This enables powerful LLMs to learn read and write operations adaptively, based on varying latency prompts, while still maintaining efficient auto-regressive decoding. Experimental results show that, even with limited SFT data, our approach achieves state-of-the-art performance across various SiMT benchmarks, and preserves the original abilities of offline translation. Moreover, our approach generalizes well to document-level SiMT setting without requiring specific fine-tuning, even beyond the offline translation model.

[502] Short-Path Prompting in LLMs: Analyzing Reasoning Instability and Solutions for Robust Performance

Zuoli Tang,Junjie Ou,Kaiqin Hu,Chunwei Wu,Zhaoxin Huan,Chilin Fu,Xiaolu Zhang,Jun Zhou,Chenliang Li

Main category: cs.CL

TLDR: 论文研究了短路径提示对大型语言模型（LLMs）推理能力的影响，发现其性能显著下降且不稳定，并提出两种解决方法：指令引导和微调。

Details

Motivation: 人类倾向于使用短路径提示，这与链式思维（CoT）推理方式冲突，研究旨在探索LLMs在短路径提示下的推理性能变化。 Method: 通过实验分析短路径提示对LLMs推理的影响，并提出指令引导和微调两种方法以解决冲突。 Result: 短路径提示下LLMs推理能力显著下降且不稳定；提出的两种方法均能有效提升推理准确性。 Conclusion: 短路径提示与CoT推理存在冲突，指令引导和微调方法能有效平衡指令遵循与推理准确性。 Abstract: Recent years have witnessed significant progress in large language models' (LLMs) reasoning, which is largely due to the chain-of-thought (CoT) approaches, allowing models to generate intermediate reasoning steps before reaching the final answer. Building on these advances, state-of-the-art LLMs are instruction-tuned to provide long and detailed CoT pathways when responding to reasoning-related questions. However, human beings are naturally cognitive misers and will prompt language models to give rather short responses, thus raising a significant conflict with CoT reasoning. In this paper, we delve into how LLMs' reasoning performance changes when users provide short-path prompts. The results and analysis reveal that language models can reason effectively and robustly without explicit CoT prompts, while under short-path prompting, LLMs' reasoning ability drops significantly and becomes unstable, even on grade-school problems. To address this issue, we propose two approaches: an instruction-guided approach and a fine-tuning approach, both designed to effectively manage the conflict. Experimental results show that both methods achieve high accuracy, providing insights into the trade-off between instruction adherence and reasoning accuracy in current models.

[503] Metropolis-Hastings Captioning Game: Knowledge Fusion of Vision Language Models via Decentralized Bayesian Inference

Yuta Matsui,Ryosuke Yamaki,Ryo Ueda,Seitaro Shinagawa,Tadahiro Taniguchi

Main category: cs.CL

TLDR: MHCG是一种通过语言游戏实现多视觉语言模型知识融合的方法，避免了传统方法的推理成本和架构限制。

Details

Motivation: 解决现有方法在结合多个模型时面临的推理成本和架构约束问题。 Method: 通过类似语言游戏的分散式贝叶斯推理，让两个VLM代理交替为图像生成标题并相互学习。 Result: 实验表明MHCG在无参考评估指标上表现一致提升，并能促进模型间词汇共享。 Conclusion: MHCG有效实现了多模型知识融合，提升了性能并促进了词汇共享。 Abstract: We propose the Metropolis-Hastings Captioning Game (MHCG), a method to fuse knowledge of multiple vision-language models (VLMs) by learning from each other. Although existing methods that combine multiple models suffer from inference costs and architectural constraints, MHCG avoids these problems by performing decentralized Bayesian inference through a process resembling a language game. The knowledge fusion process establishes communication between two VLM agents alternately captioning images and learning from each other. We conduct two image-captioning experiments with two VLMs, each pre-trained on a different dataset. The first experiment demonstrates that MHCG achieves consistent improvement in reference-free evaluation metrics. The second experiment investigates how MHCG contributes to sharing VLMs' category-level vocabulary by observing the occurrence of the vocabulary in the generated captions.

[504] Leveraging Reasoning Model Answers to Enhance Non-Reasoning Model Capability

Haotian Wang,Han Zhao,Shuaiting Chen,Xiaoyu Tian,Sitong Zhao,Yunjie Ji,Yiping Peng,Xiangang Li

Main category: cs.CL

TLDR: 利用推理密集型模型的高质量输出来提升非推理模型的性能，通过监督微调实验验证了方法的有效性。

Details

Motivation: 大型语言模型（如DeepSeek-R1和OpenAI-o1）通过测试时扩展显著提升了性能，但推理密集型模型计算成本高。本文旨在利用其高质量输出来提升非推理模型的性能。 Method: 通过监督微调（SFT）方法，利用推理模型生成的答案来训练和改进非推理模型。 Result: 在多个基准测试中，非推理模型的性能得到一致提升。 Conclusion: 该方法展示了利用推理模型输出来直接提升非推理模型问答能力的潜力。 Abstract: Recent advancements in large language models (LLMs), such as DeepSeek-R1 and OpenAI-o1, have demonstrated the significant effectiveness of test-time scaling, achieving substantial performance gains across various benchmarks. These advanced models utilize deliberate "thinking" steps to systematically enhance answer quality. In this paper, we propose leveraging these high-quality outputs generated by reasoning-intensive models to improve less computationally demanding, non-reasoning models. We explore and compare methodologies for utilizing the answers produced by reasoning models to train and improve non-reasoning models. Through straightforward Supervised Fine-Tuning (SFT) experiments on established benchmarks, we demonstrate consistent improvements across various benchmarks, underscoring the potential of this approach for advancing the ability of models to answer questions directly.

[505] Iterative Self-Training for Code Generation via Reinforced Re-Ranking

Nikita Sorokin,Ivan Sedykh,Valentin Malykh

Main category: cs.CL

TLDR: 提出了一种基于PPO的迭代自训练方法，用于训练重排模型，以提高代码生成质量。

Details

Motivation: 当前基于解码器的模型生成的代码具有高度随机性，小错误可能导致整个解决方案失败，因此需要改进代码生成质量。 Method: 使用PPO迭代自训练重排模型，通过重新评估输出、识别高评分负例并纳入训练循环来提升模型性能。 Result: 在MultiPL-E数据集上，13.4B参数模型在代码生成质量上优于33B模型，且速度更快，性能接近GPT-4。 Conclusion: 该方法通过迭代优化重排模型，显著提升了代码生成的质量和效率。 Abstract: Generating high-quality code that solves complex programming tasks is challenging, especially with current decoder-based models that produce highly stochastic outputs. In code generation, even minor errors can easily break the entire solution. Leveraging multiple sampled solutions can significantly improve the overall output quality. One effective way to enhance code generation is by pairing a code generation model with a reranker model, which selects the best solution from the generated samples. We propose a novel iterative self-training approach for self-training reranker models using Proximal Policy Optimization (PPO), aimed at improving both reranking accuracy and the overall code generation process. Unlike traditional PPO approaches, where the focus is on optimizing a generative model with a reward model, our approach emphasizes the development of a robust reward/reranking model. This model improves the quality of generated code through reranking and addresses problems and errors that the reward model might overlook during PPO alignment with the reranker. Our method iteratively refines the training dataset by re-evaluating outputs, identifying high-scoring negative examples, and incorporating them into the training loop, that boosting model performance. Our evaluation on the MultiPL-E dataset demonstrates that our 13.4B parameter model outperforms a 33B model in code generation quality while being three times faster. Moreover, it achieves performance comparable to GPT-4 and surpasses it in one programming language.

[506] Myanmar XNLI: Building a Dataset and Exploring Low-resource Approaches to Natural Language Inference with Myanmar

Aung Kyaw Htet,Mark Dras

Main category: cs.CL

TLDR: 论文扩展了XNLI任务至缅甸语，构建了myXNLI数据集，评估了多语言模型性能，并探索了数据增强方法。

Details

Motivation: 解决低资源语言（如缅甸语）在NLP任务中的挑战，并提升模型性能。 Method: 通过社区众包和专家验证构建myXNLI数据集，评估多语言模型，并测试数据增强方法。 Result: 数据增强方法提升了缅甸语模型性能2个百分点，同时惠及其他语言。 Conclusion: myXNLI数据集为低资源语言研究提供了新资源，数据增强方法具有泛化潜力。 Abstract: Despite dramatic recent progress in NLP, it is still a major challenge to apply Large Language Models (LLM) to low-resource languages. This is made visible in benchmarks such as Cross-Lingual Natural Language Inference (XNLI), a key task that demonstrates cross-lingual capabilities of NLP systems across a set of 15 languages. In this paper, we extend the XNLI task for one additional low-resource language, Myanmar, as a proxy challenge for broader low-resource languages, and make three core contributions. First, we build a dataset called Myanmar XNLI (myXNLI) using community crowd-sourced methods, as an extension to the existing XNLI corpus. This involves a two-stage process of community-based construction followed by expert verification; through an analysis, we demonstrate and quantify the value of the expert verification stage in the context of community-based construction for low-resource languages. We make the myXNLI dataset available to the community for future research. Second, we carry out evaluations of recent multilingual language models on the myXNLI benchmark, as well as explore data-augmentation methods to improve model performance. Our data-augmentation methods improve model accuracy by up to 2 percentage points for Myanmar, while uplifting other languages at the same time. Third, we investigate how well these data-augmentation methods generalise to other low-resource languages in the XNLI dataset.

[507] CLEAR-KGQA: Clarification-Enhanced Ambiguity Resolution for Knowledge Graph Question Answering

Liqiang Wen,Guanming Xiong,Tong Mo,Bing Li,Weiping Li,Wen Zhao

Main category: cs.CL

TLDR: 提出了一种动态处理知识图谱问答中实体和意图模糊性的框架，通过交互式澄清和贝叶斯推理机制提升性能。

Details

Motivation: 现有KGQA系统假设用户查询无模糊性，但现实中模糊性普遍存在，需解决实体和意图模糊性问题。 Method: 采用贝叶斯推理量化模糊性，结合多轮对话框架和LLM指导澄清请求，开发双代理交互框架模拟用户反馈。 Result: 在WebQSP和CWQ数据集上显著提升性能，并贡献了一个消歧查询数据集。 Conclusion: 提出的框架有效解决了KGQA中的模糊性问题，为未来研究提供了新方向。 Abstract: This study addresses the challenge of ambiguity in knowledge graph question answering (KGQA). While recent KGQA systems have made significant progress, particularly with the integration of large language models (LLMs), they typically assume user queries are unambiguous, which is an assumption that rarely holds in real-world applications. To address these limitations, we propose a novel framework that dynamically handles both entity ambiguity (e.g., distinguishing between entities with similar names) and intent ambiguity (e.g., clarifying different interpretations of user queries) through interactive clarification. Our approach employs a Bayesian inference mechanism to quantify query ambiguity and guide LLMs in determining when and how to request clarification from users within a multi-turn dialogue framework. We further develop a two-agent interaction framework where an LLM-based user simulator enables iterative refinement of logical forms through simulated user feedback. Experimental results on the WebQSP and CWQ dataset demonstrate that our method significantly improves performance by effectively resolving semantic ambiguities. Additionally, we contribute a refined dataset of disambiguated queries, derived from interaction histories, to facilitate future research in this direction.

[508] Domain-Adaptive Continued Pre-Training of Small Language Models

Salman Faroz

Main category: cs.CL

TLDR: 通过增量预训练小语言模型（125M参数）在教育领域的表现显著提升，资源消耗低。

Details

Motivation: 探索在有限计算资源下，通过增量预训练实现小语言模型的领域适应，作为从头训练的替代方案。 Method: 采用125M参数模型，分阶段预训练（400M和1B tokens），结合数据预处理、内存优化训练配置和基准评估。 Result: 在知识密集型任务（MMLU +8.1%）和上下文理解（HellaSwag +7.6%）上表现显著提升，但存在领域专业化权衡。 Conclusion: 合理的预处理和训练方法能显著提升小语言模型能力，为领域适应提供可行路径。 Abstract: Continued pre-training of small language models offers a promising path for domain adaptation with limited computational resources. I've investigated this approach within educational domains, evaluating it as a resource-efficient alternative to training models from scratch. Using a 125M parameter model, I demonstrate significant performance improvements through incremental training on 400 million tokens, followed by further training to reach 1 billion tokens. My approach includes comprehensive data preprocessing, memory-optimized training configurations, and benchmark-based evaluation. Results show notable gains in knowledge-intensive tasks (MMLU +8.1%) and contextual understanding (HellaSwag +7.6%), while revealing educational domain specialization trade-offs. I analyze token efficiency, catastrophic forgetting mitigation strategies, and scaling patterns. My findings suggest that thoughtful preprocessing and training methodologies enable meaningful improvements in language model capabilities even with constrained computational resources, opening pathways for domain-specific adaptation of smaller language models.

[509] GRPO-LEAD: A Difficulty-Aware Reinforcement Learning Approach for Concise Mathematical Reasoning in Language Models

Jixiao Zhang,Chunsheng Zuo

Main category: cs.CL

TLDR: GRPO-LEAD通过引入长度相关奖励、错误惩罚机制和难度感知优势重加权策略，显著提升了数学推理任务的性能。

Details

Motivation: 当前GRPO实现存在奖励稀疏性、简洁性激励不足和复杂推理任务关注不足的问题。 Method: GRPO-LEAD包括长度相关准确性奖励、错误惩罚机制和难度感知优势重加权策略。 Result: GRPO-LEAD显著改善了语言模型在数学任务中的简洁性、准确性和鲁棒性。 Conclusion: GRPO-LEAD有效解决了GRPO的局限性，提升了数学推理任务的性能。 Abstract: Recent advances in R1-like reasoning models leveraging Group Relative Policy Optimization (GRPO) have significantly improved the performance of language models on mathematical reasoning tasks. However, current GRPO implementations encounter critical challenges, including reward sparsity due to binary accuracy metrics, limited incentives for conciseness, and insufficient focus on complex reasoning tasks. To address these issues, we propose GRPO-LEAD, a suite of novel enhancements tailored for mathematical reasoning. Specifically, GRPO-LEAD introduces (1) a length-dependent accuracy reward to encourage concise and precise solutions, (2) an explicit penalty mechanism for incorrect answers to sharpen decision boundaries, and (3) a difficulty-aware advantage reweighting strategy that amplifies learning signals for challenging problems. Furthermore, we systematically examine the impact of model scale and supervised fine-tuning (SFT) strategies, demonstrating that larger-scale base models and carefully curated datasets significantly enhance reinforcement learning effectiveness. Extensive empirical evaluations and ablation studies confirm that GRPO-LEAD substantially mitigates previous shortcomings, resulting in language models that produce more concise, accurate, and robust reasoning across diverse mathematical tasks.

[510] Evaluating the Quality of Benchmark Datasets for Low-Resource Languages: A Case Study on Turkish

Ayşe Aysu Cengiz,Ahmet Kaan Sever,Elif Ecem Ümütlü,Naime Şeyma Erdem,Burak Aytan,Büşra Tufan,Abdullah Topraksoy,Esra Darıcı,Cagri Toraman

Main category: cs.CL

TLDR: 研究评估了17个土耳其语基准数据集的质量，发现70%不符合启发式质量标准，85%的评估标准未达标。LLM评估者表现不如人类，尤其在文化和常识理解方面。

Details

Motivation: 解决翻译或改编数据集在语言和文化适用性上的挑战，为低资源语言提供更严格的基准数据集质量控制。 Method: 使用包含六个标准的综合框架，由人类和LLM评估者对数据集进行详细评估。 Result: 70%的数据集未达标，技术术语使用正确性最强，但LLM在文化和常识理解上表现较差。GPT-4o在语法和技术任务上更强，Llama3.3-70B在正确性和文化知识上更优。 Conclusion: 强调了为低资源语言创建和改编数据集时需更严格的质量控制。 Abstract: The reliance on translated or adapted datasets from English or multilingual resources introduces challenges regarding linguistic and cultural suitability. This study addresses the need for robust and culturally appropriate benchmarks by evaluating the quality of 17 commonly used Turkish benchmark datasets. Using a comprehensive framework that assesses six criteria, both human and LLM-judge annotators provide detailed evaluations to identify dataset strengths and shortcomings. Our results reveal that 70% of the benchmark datasets fail to meet our heuristic quality standards. The correctness of the usage of technical terms is the strongest criterion, but 85% of the criteria are not satisfied in the examined datasets. Although LLM judges demonstrate potential, they are less effective than human annotators, particularly in understanding cultural common sense knowledge and interpreting fluent, unambiguous text. GPT-4o has stronger labeling capabilities for grammatical and technical tasks, while Llama3.3-70B excels at correctness and cultural knowledge evaluation. Our findings emphasize the urgent need for more rigorous quality control in creating and adapting datasets for low-resource languages.

[511] Improving Multilingual Capabilities with Cultural and Local Knowledge in Large Language Models While Enhancing Native Performance

Ram Mohan Rao Kadiyala,Siddartha Pullakhandam,Siddhant Gupta,Drishti Sharma,Jebish Purbey,Kanwal Mehreen,Muhammad Arham,Hamza Farooq

Main category: cs.CL

TLDR: 本文介绍了双语LLM Mantra-14B，通过指令微调在印地语和英语上平均提升3%性能，优于更大模型。

Details

Motivation: 解决LLM在高资源语言（如英语）与低资源语言（如印地语）之间的性能差距问题。 Method: 使用485K双语指令数据集微调多个LLM，实验涵盖7种不同规模的模型和140次训练尝试。 Result: Mantra-14B在双语任务中表现优于更大模型，且未牺牲原生性能。 Conclusion: 通过文化本地化数据的微调可显著提升多语言性能，无需复杂技术或额外计算资源。 Abstract: Large Language Models (LLMs) have shown remarkable capabilities, but their development has primarily focused on English and other high-resource languages, leaving many languages underserved. We present our latest Hindi-English bi-lingual LLM \textbf{Mantra-14B} with ~3\% average improvement in benchmark scores over both languages, outperforming models twice its size. Using a curated dataset composed of English and Hindi instruction data of 485K samples, we instruction tuned models such as Qwen-2.5-14B-Instruct and Phi-4 to improve performance over both English and Hindi. Our experiments encompassing seven different LLMs of varying parameter sizes and over 140 training attempts with varying English-Hindi training data ratios demonstrated that it is possible to significantly improve multilingual performance without compromising native performance. Further, our approach avoids resource-intensive techniques like vocabulary expansion or architectural modifications, thus keeping the model size small. Our results indicate that modest fine-tuning with culturally and locally informed data can bridge performance gaps without incurring significant computational overhead. We release our training code, datasets, and models under mit and apache licenses to aid further research towards under-represented and low-resource languages.

[512] Executable Functional Abstractions: Inferring Generative Programs for Advanced Math Problems

Zaid Khan,Elias Stengel-Eskin,Archiki Prasad,Jaemin Cho,Mohit Bansal

Main category: cs.CL

TLDR: 论文提出了一种自动生成高级数学问题可执行功能抽象（EFA）的方法EFAGen，通过程序合成任务实现，并利用单元测试验证EFA的有效性。

Details

Motivation: 现有EFA研究局限于小学数学问题，而高级数学问题的EFA生成依赖人工设计，因此需要自动化方法。 Method: 将EFA生成任务定义为程序合成任务，利用LLM基于种子问题及其逐步解生成候选EFA，并通过单元测试验证其有效性。 Result: EFAGen生成的EFA能忠实于种子问题，产生可学习的问题变体，并能应用于多种竞赛级数学问题。 Conclusion: 自动生成的EFA可用于生成问题变体及数据，为数学推理提供新工具。 Abstract: Scientists often infer abstract procedures from specific instances of problems and use the abstractions to generate new, related instances. For example, programs encoding the formal rules and properties of a system have been useful in fields ranging from RL (procedural environments) to physics (simulation engines). These programs can be seen as functions which execute to different outputs based on their parameterizations (e.g., gridworld configuration or initial physical conditions). We introduce the term EFA (Executable Functional Abstraction) to denote such programs for math problems. EFA-like constructs have been shown to be useful for math reasoning as problem generators for stress-testing models. However, prior work has been limited to abstractions for grade-school math (whose simple rules are easy to encode in programs), while generating EFAs for advanced math has thus far required human engineering. We explore the automatic construction of EFAs for advanced math problems. We operationalize the task of automatically constructing EFAs as a program synthesis task, and develop EFAGen, which conditions an LLM on a seed math problem and its step-by-step solution to generate candidate EFA programs that are faithful to the generalized problem and solution class underlying the seed problem. Furthermore, we formalize properties any valid EFA must possess in terms of executable unit tests, and show how the tests can be used as verifiable rewards to train LLMs to become better writers of EFAs. We demonstrate that EFAs constructed by EFAGen behave rationally by remaining faithful to seed problems, produce learnable problem variations, and that EFAGen can infer EFAs across multiple diverse sources of competition-level math problems. Finally, we show downstream uses of model-written EFAs e.g. finding problem variations that are harder or easier for a learner to solve, as well as data generation.

[513] Reasoning Court: Combining Reasoning, Action, and Judgment for Multi-Hop Reasoning

Jingtian Wu,Claire Cardie

Main category: cs.CL

TLDR: 论文提出了一种名为Reasoning Court (RC)的新框架，通过引入独立的LLM法官来验证多步推理任务中的中间步骤，解决了现有方法（如ReAct）缺乏内部验证的问题，并在多个基准测试中表现优于现有方法。

Details

Motivation: 大型语言模型（LLMs）在多步推理任务中存在幻觉和推理错误问题，现有方法（如ReAct）缺乏对中间推理步骤的内部验证，导致错误传播。 Method: RC框架扩展了迭代推理与检索方法（如ReAct），引入独立的LLM法官，评估多个候选答案及其推理，选择最合理答案或合成新答案。 Result: 在HotpotQA、MuSiQue和FEVER等基准测试中，RC无需任务特定微调即优于现有少样本提示方法。 Conclusion: RC通过引入内部验证机制，显著提升了多步推理任务的准确性和鲁棒性，为LLMs的推理能力提供了新方向。 Abstract: While large language models (LLMs) have demonstrated strong capabilities in tasks like question answering and fact verification, they continue to suffer from hallucinations and reasoning errors, especially in multi-hop tasks that require integration of multiple information sources. Current methods address these issues through retrieval-based techniques (grounding reasoning in external evidence), reasoning-based approaches (enhancing coherence via improved prompting), or hybrid strategies combining both elements. One prominent hybrid method, ReAct, has outperformed purely retrieval-based or reasoning-based approaches; however, it lacks internal verification of intermediate reasoning steps, allowing potential errors to propagate through complex reasoning tasks. In this paper, we introduce Reasoning Court (RC), a novel framework that extends iterative reasoning-and-retrieval methods, such as ReAct, with a dedicated LLM judge. Unlike ReAct, RC employs this judge to independently evaluate multiple candidate answers and their associated reasoning generated by separate LLM agents. The judge is asked to select the answer that it considers the most factually grounded and logically coherent based on the presented reasoning and evidence, or synthesizes a new answer using available evidence and its pre-trained knowledge if all candidates are inadequate, flawed, or invalid. Evaluations on multi-hop benchmarks (HotpotQA, MuSiQue) and fact-verification (FEVER) demonstrate that RC consistently outperforms state-of-the-art few-shot prompting methods without task-specific fine-tuning.

[514] VDocRAG: Retrieval-Augmented Generation over Visually-Rich Documents

Ryota Tanaka,Taichi Iki,Taku Hasegawa,Kyosuke Nishida,Kuniko Saito,Jun Suzuki

Main category: cs.CL

TLDR: VDocRAG是一个新的检索增强生成框架，能够直接理解多种格式和模态的视觉丰富文档，避免因解析文本而丢失信息。通过自监督预训练任务和统一的数据集OpenDocVQA，它在性能上显著优于传统文本RAG。

Details

Motivation: 开发一种能够处理混合模态和多样格式视觉文档的RAG框架，避免信息丢失。 Method: 提出VDocRAG框架，利用自监督预训练任务将视觉信息压缩为密集标记表示，并与文本内容对齐。 Result: VDocRAG在性能上显著优于传统文本RAG，并展现出强大的泛化能力。 Conclusion: VDocRAG为现实世界文档提供了一种有效的RAG范式，具有广泛应用潜力。 Abstract: We aim to develop a retrieval-augmented generation (RAG) framework that answers questions over a corpus of visually-rich documents presented in mixed modalities (e.g., charts, tables) and diverse formats (e.g., PDF, PPTX). In this paper, we introduce a new RAG framework, VDocRAG, which can directly understand varied documents and modalities in a unified image format to prevent missing information that occurs by parsing documents to obtain text. To improve the performance, we propose novel self-supervised pre-training tasks that adapt large vision-language models for retrieval by compressing visual information into dense token representations while aligning them with textual content in documents. Furthermore, we introduce OpenDocVQA, the first unified collection of open-domain document visual question answering datasets, encompassing diverse document types and formats. OpenDocVQA provides a comprehensive resource for training and evaluating retrieval and question answering models on visually-rich documents in an open-domain setting. Experiments show that VDocRAG substantially outperforms conventional text-based RAG and has strong generalization capability, highlighting the potential of an effective RAG paradigm for real-world documents.

[515] Training Small Reasoning LLMs with Cognitive Preference Alignment

Wenrui Cai,Chengyu Wang,Junbing Yan,Jun Huang,Xiangzhong Fang

Main category: cs.CL

TLDR: 提出了一种名为CRV的新框架和CogPO算法，用于训练参数较少但推理能力强的语言模型，解决了小模型直接蒸馏大模型链式思维（CoT）效果不佳的问题。

Details

Motivation: 大语言模型（LLM）推理能力提升伴随高资源消耗，需探索参数更少的有效训练策略。小模型与大模型在能力和认知轨迹上存在差异，直接蒸馏CoT效果有限且需大量标注数据。 Method: 提出CRV框架，包含多个LLM代理，分别负责批判、重新思考和验证CoT；进一步提出CogPO算法，通过优化认知偏好增强小模型推理能力。 Result: 在复杂推理基准测试中，CRV和CogPO显著优于其他训练方法。 Conclusion: CRV框架和CogPO算法为训练高效小规模推理LLM提供了有效解决方案。 Abstract: The reasoning capabilities of large language models (LLMs), such as OpenAI's o1 and DeepSeek-R1, have seen substantial advancements through deep thinking. However, these enhancements come with significant resource demands, underscoring the need to explore strategies to train effective reasoning LLMs with far fewer parameters. A critical challenge is that smaller models have different capacities and cognitive trajectories than their larger counterparts. Hence, direct distillation of chain-of-thought (CoT) results from large LLMs to smaller ones can be sometimes ineffective and requires a huge amount of annotated data. In this paper, we introduce a novel framework called Critique-Rethink-Verify (CRV), designed for training smaller yet powerful reasoning LLMs. Our CRV framework consists of multiple LLM agents, each specializing in unique abilities: (i) critiquing the CoTs according to the cognitive capabilities of smaller models, (ii) rethinking and refining these CoTs based on the critiques, and (iii) verifying the correctness of the refined results. We further propose the cognitive preference optimization (CogPO) algorithm to enhance the reasoning abilities of smaller models by aligning thoughts of these models with their cognitive capacities. Comprehensive evaluations on challenging reasoning benchmarks demonstrate the efficacy of CRV and CogPO, which outperforms other training methods by a large margin.

[516] Transferable text data distillation by trajectory matching

Rong Yao,Hailin Hu,Yifei Fu,Hanting Chen,Wenyi Fang,Fanyi Du,Kai Han,Yunhe Wang

Main category: cs.CL

TLDR: 提出了一种基于轨迹匹配和最近邻ID学习的伪提示数据方法，用于文本生成任务的数据蒸馏，优于现有数据选择方法LESS，并展示了跨架构迁移能力。

Details

Motivation: 随着大语言模型规模的增加，训练成本上升，亟需减少训练数据量。数据蒸馏方法能合成少量数据达到全数据集效果，但文本数据的离散性阻碍了其在NLP中的应用。 Method: 通过轨迹匹配学习伪提示数据，并找到其最近邻ID以实现跨架构迁移，同时引入正则化损失提升蒸馏数据的鲁棒性。 Result: 在ARC-Easy和MMLU指令调优数据集上优于SOTA数据选择方法LESS，并展示了从OPT到Llama的良好迁移性。 Conclusion: 该方法首次适用于文本生成任务的数据蒸馏，具有优越性和跨架构迁移能力。 Abstract: In the realm of large language model (LLM), as the size of large models increases, it also brings higher training costs. There is a urgent need to minimize the data size in LLM training. Compared with data selection method, the data distillation method aims to synthesize a small number of data samples to achieve the training effect of the full data set and has better flexibility. Despite its successes in computer vision, the discreteness of text data has hitherto stymied its exploration in natural language processing (NLP). In this work, we proposed a method that involves learning pseudo prompt data based on trajectory matching and finding its nearest neighbor ID to achieve cross-architecture transfer. During the distillation process, we introduce a regularization loss to improve the robustness of our distilled data. To our best knowledge, this is the first data distillation work suitable for text generation tasks such as instruction tuning. Evaluations on two benchmarks, including ARC-Easy and MMLU instruction tuning datasets, established the superiority of our distillation approach over the SOTA data selection method LESS. Furthermore, our method demonstrates a good transferability over LLM structures (i.e., OPT to Llama).

[517] Abacus-SQL: A Text-to-SQL System Empowering Cross-Domain and Open-Domain Database Retrieval

Keyan Xu,Dingzirui Wang,Xuanliang Zhang,Qingfu Zhu,Wanxiang Che

Main category: cs.CL

TLDR: Abacus-SQL通过数据库检索技术和数据增强方法，解决了现有文本到SQL系统在开放域数据库检索和跨域迁移能力上的不足，并采用Pre-SQL和Self-debug方法提升查询准确性。

Details

Motivation: 现有文本到SQL系统缺乏开放域数据库检索能力，且跨域迁移能力有限，难以满足多样化查询需求。 Method: Abacus-SQL结合数据库检索技术、数据增强方法、Pre-SQL和Self-debug技术。 Result: 实验表明Abacus-SQL在多轮文本到SQL任务中表现优异。 Conclusion: Abacus-SQL有效解决了现有系统的局限性，提升了查询准确性和跨域能力。 Abstract: The existing text-to-SQL systems have made significant progress in SQL query generation, but they still face numerous challenges. Existing systems often lack retrieval capabilities for open-domain databases, requiring users to manually filter relevant databases. Additionally, their cross-domain transferability is limited, making it challenging to accommodate diverse query requirements. To address these issues, we propose Abacus-SQL. Abacus-SQL utilizes database retrieval technology to accurately locate the required databases in an open-domain database environment. It also enhances the system cross-domain transfer ability through data augmentation methods. Moreover, Abacus-SQL employs Pre-SQL and Self-debug methods, thereby enhancing the accuracy of SQL queries. Experimental results demonstrate that Abacus-SQL performs excellently in multi-turn text-to-SQL tasks, effectively validating the approach's effectiveness. Abacus-SQL is publicly accessible at https://huozi.8wss.com/abacus-sql/.

[518] PASS-FC: Progressive and Adaptive Search Scheme for Fact Checking of Comprehensive Claims

Ziyu Zhuang

Main category: cs.CL

TLDR: PASS-FC是一个通过增强声明、自适应问题生成和迭代验证来提升事实核查准确性的框架，在多个数据集上表现优异。

Details

Motivation: 自动化事实核查在处理复杂现实声明时面临挑战，需要更有效的解决方案。 Method: PASS-FC采用声明增强（添加时间和实体上下文）、高级搜索技术和反思机制。 Result: 在六个数据集上表现优于基线模型，尤其在通用知识、科学、现实和多语言任务中。 Conclusion: PASS-FC显著提升了事实核查的准确性和适应性，代码和结果将开源以促进研究。 Abstract: Automated fact-checking faces challenges in handling complex real-world claims. We present PASS-FC, a novel framework that addresses these issues through claim augmentation, adaptive question generation, and iterative verification. PASS-FC enhances atomic claims with temporal and entity context, employs advanced search techniques, and utilizes a reflection mechanism. We evaluate PASS-FC on six diverse datasets, demonstrating superior performance across general knowledge, scientific, real-world, and multilingual fact-checking tasks. Our framework often surpasses stronger baseline models. Hyperparameter analysis reveals optimal settings for evidence quantity and reflection label triggers, while ablation studies highlight the importance of claim augmentation and language-specific adaptations. PASS-FC's performance underscores its effectiveness in improving fact-checking accuracy and adaptability across various domains. We will open-source our code and experimental results to facilitate further research in this area.

[519] Investigating Syntactic Biases in Multilingual Transformers with RC Attachment Ambiguities in Italian and English

Michael Kamerath,Aniello De Santo

Main category: cs.CL

TLDR: 研究探讨单语和多语LLMs在意大利语和英语关系从句附着歧义中是否表现出类似人类的偏好，并测试词汇因素是否影响这些偏好。结果显示LLMs行为多样，但普遍未能准确捕捉人类偏好。

Details

Motivation: 利用过去句子处理研究，探究LLMs在语言处理中是否与人类行为一致，并验证词汇因素的作用。 Method: 通过呈现意大利语和英语的关系从句附着歧义示例，分析LLMs的偏好，并考察动词/名词类型对偏好的影响。 Result: LLMs行为在不同模型间差异显著，但普遍无法准确反映人类偏好。 Conclusion: 关系从句附着是评估LLMs语言知识和偏见的理想跨语言基准。 Abstract: This paper leverages past sentence processing studies to investigate whether monolingual and multilingual LLMs show human-like preferences when presented with examples of relative clause attachment ambiguities in Italian and English. Furthermore, we test whether these preferences can be modulated by lexical factors (the type of verb/noun in the matrix clause) which have been shown to be tied to subtle constraints on syntactic and semantic relations. Our results overall showcase how LLM behavior varies interestingly across models, but also general failings of these models in correctly capturing human-like preferences. In light of these results, we argue that RC attachment is the ideal benchmark for cross-linguistic investigations of LLMs' linguistic knowledge and biases.

[520] Learning from Reference Answers: Versatile Language Model Alignment without Binary Human Preference Data

Shuai Zhao,Linchao Zhu,Yi Yang

Main category: cs.CL

TLDR: 提出了一种基于相似度的奖励函数RefAlign，用于LLM对齐，避免了传统奖励模型的高成本。

Details

Motivation: 传统对齐方法需要大量资源收集偏好数据和训练奖励模型，RefAlign通过利用相似度作为奖励，降低成本。 Method: 使用BERTScore计算生成文本与高质量参考答案的相似度作为奖励，开发了REINFORCE风格的RefAlign算法。 Result: RefAlign在多种对齐场景中表现与传统方法相当，但效率更高。 Conclusion: RefAlign是一种高效且通用的LLM对齐方法，适用于多种场景。 Abstract: Large language models~(LLMs) are expected to be helpful, harmless, and honest. In various alignment scenarios, such as general human preference, safety, and confidence alignment, binary preference data collection and reward modeling are resource-intensive but necessary for human preference transferring. In this work, we explore using the similarity between sampled generations and high-quality reference answers as an alternative reward function for LLM alignment. Using similarity as a reward circumvents training reward models, and collecting a single reference answer potentially costs less time than constructing binary preference pairs when multiple candidates are available. Specifically, we develop \textit{RefAlign}, a versatile REINFORCE-style alignment algorithm, which is free of reference and reward models. Instead, RefAlign utilizes BERTScore between sampled generations and high-quality reference answers as the surrogate reward. Beyond general human preference optimization, RefAlign can be readily extended to diverse scenarios, such as safety and confidence alignment, by incorporating the similarity reward with task-related objectives. In various scenarios, {RefAlign} demonstrates comparable performance to previous alignment methods while offering high efficiency.

Aish Albladi,Md Kaosar Uddin,Minarul Islam,Cheryl Seals

Main category: cs.CL

TLDR: 该研究提出了一种结合多种Transformer模型的混合框架，用于提升情感分类的准确性和鲁棒性，并在基准数据集上取得了优于单一模型的性能。

Details

Motivation: 情感分析在自然语言处理中至关重要，但面临噪声数据、上下文模糊和泛化能力等挑战。研究旨在通过结合不同Transformer模型的优势来解决这些问题。 Method: 采用BERT、GPT-2、RoBERTa、XLNet和DistilBERT的混合框架，结合TF-IDF和BoW进行数据预处理，并在Sentiment140和IMDB数据集上评估。 Result: 在Sentiment140和IMDB数据集上分别达到94%和95%的准确率，优于单一模型。 Conclusion: 混合框架有效解决了单一模型的局限性，适用于社交媒体监控、客户情感分析等实际任务，为未来混合NLP框架的发展提供了方向。 Abstract: Sentiment analysis is a crucial task in natural language processing (NLP) that enables the extraction of meaningful insights from textual data, particularly from dynamic platforms like Twitter and IMDB. This study explores a hybrid framework combining transformer-based models, specifically BERT, GPT-2, RoBERTa, XLNet, and DistilBERT, to improve sentiment classification accuracy and robustness. The framework addresses challenges such as noisy data, contextual ambiguity, and generalization across diverse datasets by leveraging the unique strengths of these models. BERT captures bidirectional context, GPT-2 enhances generative capabilities, RoBERTa optimizes contextual understanding with larger corpora and dynamic masking, XLNet models dependency through permutation-based learning, and DistilBERT offers efficiency with reduced computational overhead while maintaining high accuracy. We demonstrate text cleaning, tokenization, and feature extraction using Term Frequency Inverse Document Frequency (TF-IDF) and Bag of Words (BoW), ensure high-quality input data for the models. The hybrid approach was evaluated on benchmark datasets Sentiment140 and IMDB, achieving superior accuracy rates of 94\% and 95\%, respectively, outperforming standalone models. The results validate the effectiveness of combining multiple transformer models in ensemble-like setups to address the limitations of individual architectures. This research highlights its applicability to real-world tasks such as social media monitoring, customer sentiment analysis, and public opinion tracking which offers a pathway for future advancements in hybrid NLP frameworks.

[522] Refining Financial Consumer Complaints through Multi-Scale Model Interaction

Bo-Wei Chen,An-Zi Yen,Chung-Chi Chen

Main category: cs.CL

TLDR: 本文提出了一种名为MSMI的多尺度模型交互方法，用于将非正式文本转化为法律文本，并在金融争议数据集FinDR上验证了其有效性。

Details

Motivation: 解决非法律专业人士撰写的法律文本缺乏清晰性、正式性和领域精确性的问题。 Method: 采用多尺度模型交互（MSMI）方法，结合轻量级分类器和大型语言模型（LLMs）进行迭代优化。 Result: MSMI在金融争议数据集上显著优于单次提示策略，并在多个短文本基准测试中表现出更强的对抗鲁棒性。 Conclusion: 多模型协作在法律文本生成和更广泛的文本优化任务中具有潜力。 Abstract: Legal writing demands clarity, formality, and domain-specific precision-qualities often lacking in documents authored by individuals without legal training. To bridge this gap, this paper explores the task of legal text refinement that transforms informal, conversational inputs into persuasive legal arguments. We introduce FinDR, a Chinese dataset of financial dispute records, annotated with official judgments on claim reasonableness. Our proposed method, Multi-Scale Model Interaction (MSMI), leverages a lightweight classifier to evaluate outputs and guide iterative refinement by Large Language Models (LLMs). Experimental results demonstrate that MSMI significantly outperforms single-pass prompting strategies. Additionally, we validate the generalizability of MSMI on several short-text benchmarks, showing improved adversarial robustness. Our findings reveal the potential of multi-model collaboration for enhancing legal document generation and broader text refinement tasks.

[523] Quantum Natural Language Processing: A Comprehensive Review of Models, Methods, and Applications

Farha Nausheen,Khandakar Ahmed,M Imad Khan

Main category: cs.CL

TLDR: 论文探讨了量子自然语言处理（QNLP）的潜力，总结了当前量子计算在NLP中的应用现状，并指出其在小数据集上的局限性。

Details

Motivation: 解决深度学习在NLP中需要大量数据和资源的问题，探索量子计算在语言处理中的优势。 Method: 分类QNLP模型，总结量子编码技术、QNLP模型及量子优化方法，并统计其应用频率。 Result: QNLP方法目前仅适用于小数据集，且模型探索有限，但量子计算在NLP中的应用兴趣日益增长。 Conclusion: QNLP具有潜力，但仍需进一步研究以克服当前限制。 Abstract: In recent developments, deep learning methodologies applied to Natural Language Processing (NLP) have revealed a paradox: They improve performance but demand considerable data and resources for their training. Alternatively, quantum computing exploits the principles of quantum mechanics to overcome the computational limitations of current methodologies, thereby establishing an emerging field known as quantum natural language processing (QNLP). This domain holds the potential to attain a quantum advantage in the processing of linguistic structures, surpassing classical models in both efficiency and accuracy. In this paper, it is proposed to categorise QNLP models based on quantum computing principles, architecture, and computational approaches. This paper attempts to provide a survey on how quantum meets language by mapping state-of-the-art in this area, embracing quantum encoding techniques for classical data, QNLP models for prevalent NLP tasks, and quantum optimisation techniques for hyper parameter tuning. The landscape of quantum computing approaches applied to various NLP tasks is summarised by showcasing the specific QNLP methods used, and the popularity of these methods is indicated by their count. From the findings, it is observed that QNLP approaches are still limited to small data sets, with only a few models explored extensively, and there is increasing interest in the application of quantum computing to natural language processing tasks.

[524] Learning to Erase Private Knowledge from Multi-Documents for Retrieval-Augmented Large Language Models

Yujing Wang,Hainan Zhang,Liang Pang,Yongxin Tong,Binghui Guo,Hongwei Zheng,Zhiming Zheng

Main category: cs.CL

TLDR: Eraser4RAG是一种隐私擦除方法，用于在检索增强生成（RAG）中移除用户定义的隐私知识，同时保留公共知识。

Details

Motivation: RAG在专有领域应用时可能泄露敏感信息，传统文本匿名化方法无法满足其需求，需考虑多文档推理、用户自定义擦除和公共知识保留。 Method: 构建全局知识图谱识别潜在知识，随机分割为隐私和公共子图，用Flan-T5重写文档排除隐私三元组，PPO算法优化模型。 Result: 在四个QA数据集上，Eraser4RAG的擦除性能优于GPT-4o。 Conclusion: Eraser4RAG有效解决了RAG中的隐私擦除问题，平衡了隐私保护和知识保留的需求。 Abstract: Retrieval-Augmented Generation (RAG) is a promising technique for applying LLMs to proprietary domains. However, retrieved documents may contain sensitive knowledge, posing risks of privacy leakage in generative results. Thus, effectively erasing private information from retrieved documents is a key challenge for RAG. Unlike traditional text anonymization, RAG should consider: (1) the inherent multi-document reasoning may face de-anonymization attacks; (2) private knowledge varies by scenarios, so users should be allowed to customize which information to erase; (3) preserving sufficient publicly available knowledge for generation tasks. This paper introduces the privacy erasure task for RAG and proposes Eraser4RAG, a private knowledge eraser which effectively removes user-defined private knowledge from documents while preserving sufficient public knowledge for generation. Specifically, we first construct a global knowledge graph to identify potential knowledge across documents, aiming to defend against de-anonymization attacks. Then we randomly split it into private and public sub-graphs, and fine-tune Flan-T5 to rewrite the retrieved documents excluding private triples. Finally, PPO algorithm optimizes the rewriting model to minimize private triples and maximize public triples retention. Experiments on four QA datasets demonstrate that Eraser4RAG achieves superior erase performance than GPT-4o.

[525] Guiding Reasoning in Small Language Models with LLM Assistance

Yujin Kim,Euiin Yi,Minu Kim,Se-Young Yun,Taehyeon Kim

Main category: cs.CL

TLDR: SMART框架通过选择性引入LLM的指导，增强小语言模型（SLM）的推理能力，解决其逻辑推理不足的问题。

Details

Motivation: 小语言模型在复杂推理任务中表现不佳，需要外部支持以提升其能力。 Method: 采用基于评分的评估，识别不确定推理步骤，并在必要时注入LLM生成的修正推理。 Result: 实验表明，SMART显著提升了SLM在数学推理任务中的表现。 Conclusion: SMART为SLM和LLM的协作提供了可行方案，解决了SLM单独无法完成的复杂推理任务。 Abstract: The limited reasoning capabilities of small language models (SLMs) cast doubt on their suitability for tasks demanding deep, multi-step logical deduction. This paper introduces a framework called Small Reasons, Large Hints (SMART), which selectively augments SLM reasoning with targeted guidance from large language models (LLMs). Inspired by the concept of cognitive scaffolding, SMART employs a score-based evaluation to identify uncertain reasoning steps and injects corrective LLM-generated reasoning only when necessary. By framing structured reasoning as an optimal policy search, our approach steers the reasoning trajectory toward correct solutions without exhaustive sampling. Our experiments on mathematical reasoning datasets demonstrate that targeted external scaffolding significantly improves performance, paving the way for collaborative use of both SLM and LLM to tackle complex reasoning tasks that are currently unsolvable by SLMs alone.

[526] C-MTCSD: A Chinese Multi-Turn Conversational Stance Detection Dataset

Fuqiang Niu,Yi Yang,Xianghua Fu,Genan Dai,Bowen Zhang

Main category: cs.CL

TLDR: 论文介绍了C-MTCSD，一个大规模中文多轮对话立场检测数据集，揭示了现有模型在零样本和隐式立场检测上的挑战。

Details

Motivation: 解决中文语言处理和多轮对话分析中的立场检测难题。 Method: 构建C-MTCSD数据集，并通过传统方法和大型语言模型进行评估。 Result: 最佳模型在零样本设置下仅达到64.07% F1分数，隐式立场检测表现更差。 Conclusion: C-MTCSD为中文立场检测研究设立了新基准，未来改进空间大。 Abstract: Stance detection has become an essential tool for analyzing public discussions on social media. Current methods face significant challenges, particularly in Chinese language processing and multi-turn conversational analysis. To address these limitations, we introduce C-MTCSD, the largest Chinese multi-turn conversational stance detection dataset, comprising 24,264 carefully annotated instances from Sina Weibo, which is 4.2 times larger than the only prior Chinese conversational stance detection dataset. Our comprehensive evaluation using both traditional approaches and large language models reveals the complexity of C-MTCSD: even state-of-the-art models achieve only 64.07% F1 score in the challenging zero-shot setting, while performance consistently degrades with increasing conversation depth. Traditional models particularly struggle with implicit stance detection, achieving below 50% F1 score. This work establishes a challenging new benchmark for Chinese stance detection research, highlighting significant opportunities for future improvements.

[527] Turn-taking annotation for quantitative and qualitative analyses of conversation

Anneliese Kelterer,Barbara Schuppler

Main category: cs.CL

TLDR: 论文介绍了为GRASS语料库创建的对话转接标注系统，包括标注层、过程和标准，并展示了标注者间的高一致性。

Details

Motivation: 为促进对话分析和语音研究的跨学科应用，开发一个适合自动分类和时间对齐的标注系统。 Method: 开发了两层标注系统（IPU和PCOMP），使用Praat进行时间对齐标注，并详细描述了标注过程和标准。 Result: IPU标注一致性接近完美，PCOMP标注一致性较高，分歧多为部分或可解释的分析差异。 Conclusion: 该标注系统适用于多种对话数据，有望促进语言学和技术应用的跨学科合作。 Abstract: This paper has two goals. First, we present the turn-taking annotation layers created for 95 minutes of conversational speech of the Graz Corpus of Read and Spontaneous Speech (GRASS), available to the scientific community. Second, we describe the annotation system and the annotation process in more detail, so other researchers may use it for their own conversational data. The annotation system was developed with an interdisciplinary application in mind. It should be based on sequential criteria according to Conversation Analysis, suitable for subsequent phonetic analysis, thus time-aligned annotations were made Praat, and it should be suitable for automatic classification, which required the continuous annotation of speech and a label inventory that is not too large and results in a high inter-rater agreement. Turn-taking was annotated on two layers, Inter-Pausal Units (IPU) and points of potential completion (PCOMP; similar to transition relevance places). We provide a detailed description of the annotation process and of segmentation and labelling criteria. A detailed analysis of inter-rater agreement and common confusions shows that agreement for IPU annotation is near-perfect, that agreement for PCOMP annotations is substantial, and that disagreements often are either partial or can be explained by a different analysis of a sequence which also has merit. The annotation system can be applied to a variety of conversational data for linguistic studies and technological applications, and we hope that the annotations, as well as the annotation system will contribute to a stronger cross-fertilization between these disciplines.

[528] The Mirage of Performance Gains: Why Contrastive Decoding Fails to Address Multimodal Hallucination

Hao Yin,Gunagzong Si,Zilei Wang

Main category: cs.CL

TLDR: 对比解码策略在减少多模态大语言模型（MLLMs）中的幻觉问题上效果有限，其性能提升主要由误导性因素驱动。

Details

Motivation: 探讨对比解码策略在减少幻觉问题上的实际效果，揭示其性能提升的误导性原因。 Method: 通过引入一系列虚假改进方法，并与对比解码技术进行性能对比。 Result: 实验结果显示对比解码的性能提升与其减少幻觉的目标无关。 Conclusion: 研究挑战了对比解码策略的有效性假设，为开发真正有效的解决方案铺平了道路。 Abstract: Contrastive decoding strategies are widely used to reduce hallucinations in multimodal large language models (MLLMs). These methods work by constructing contrastive samples to induce hallucinations and then suppressing them in the output distribution. However, this paper demonstrates that such approaches fail to effectively mitigate the hallucination problem. The performance improvements observed on POPE Benchmark are largely driven by two misleading factors: (1) crude, unidirectional adjustments to the model's output distribution and (2) the adaptive plausibility constraint, which reduces the sampling strategy to greedy search. To further illustrate these issues, we introduce a series of spurious improvement methods and evaluate their performance against contrastive decoding techniques. Experimental results reveal that the observed performance gains in contrastive decoding are entirely unrelated to its intended goal of mitigating hallucinations. Our findings challenge common assumptions about the effectiveness of contrastive decoding strategies and pave the way for developing genuinely effective solutions to hallucinations in MLLMs.

Zhengxuan Zhang,Zhuowen Liang,Yin Wu,Teng Lin,Yuyu Luo,Nan Tang

Main category: cs.CL

TLDR: DataMosaic框架通过动态提取任务特定结构，提升LLM分析的可解释性和可验证性，解决现有系统的局限性。

Details

Motivation: 当前LLM在数据分析中存在不透明和不可验证的问题，RAG方法未能完全解决多模态数据处理的挑战。 Method: DataMosaic通过多智能体框架动态提取任务特定结构，提供透明推理和中间结果验证。 Result: DataMosaic增强了分析的准确性、一致性和隐私保护，为多模态数据分析奠定了基础。 Conclusion: DataMosaic为可信赖的多模态数据分析提供了新范式。 Abstract: Large Language Models (LLMs) are transforming data analytics, but their widespread adoption is hindered by two critical limitations: they are not explainable (opaque reasoning processes) and not verifiable (prone to hallucinations and unchecked errors). While retrieval-augmented generation (RAG) improves accuracy by grounding LLMs in external data, it fails to address the core challenges of trustworthy analytics - especially when processing noisy, inconsistent, or multi-modal data (for example, text, tables, images). We propose DataMosaic, a framework designed to make LLM-powered analytics both explainable and verifiable. By dynamically extracting task-specific structures (for example, tables, graphs, trees) from raw data, DataMosaic provides transparent, step-by-step reasoning traces and enables validation of intermediate results. Built on a multi-agent framework, DataMosaic orchestrates self-adaptive agents that align with downstream task requirements, enhancing consistency, completeness, and privacy. Through this approach, DataMosaic not only tackles the limitations of current LLM-powered analytics systems but also lays the groundwork for a new paradigm of grounded, accurate, and explainable multi-modal data analytics.

[530] Hallucination Detection in LLMs via Topological Divergence on Attention Graphs

Alexandra Bazarova,Aleksandr Yugay,Andrey Shulga,Alina Ermilova,Andrei Volodichev,Konstantin Polev,Julia Belikova,Rauf Parchiev,Dmitry Simakov,Maxim Savchenko,Andrey Savchenko,Serguei Barannikov,Alexey Zaytsev

Main category: cs.CL

TLDR: TOHA是一种基于拓扑结构的幻觉检测器，通过注意力矩阵的拓扑差异量化幻觉内容，在多项任务中表现优异。

Details

Motivation: 解决大型语言模型生成内容中的事实错误（幻觉）问题。 Method: 利用拓扑差异度量分析提示和响应子图的结构特性，识别幻觉输出。 Result: 在问答和数据到文本任务中取得领先或竞争性结果，并展示跨领域迁移能力。 Conclusion: 注意力矩阵的拓扑结构可作为LLM事实可靠性的高效鲁棒指标。 Abstract: Hallucination, i.e., generating factually incorrect content, remains a critical challenge for large language models (LLMs). We introduce TOHA, a TOpology-based HAllucination detector in the RAG setting, which leverages a topological divergence metric to quantify the structural properties of graphs induced by attention matrices. Examining the topological divergence between prompt and response subgraphs reveals consistent patterns: higher divergence values in specific attention heads correlate with hallucinated outputs, independent of the dataset. Extensive experiments, including evaluation on question answering and data-to-text tasks, show that our approach achieves state-of-the-art or competitive results on several benchmarks, two of which were annotated by us and are being publicly released to facilitate further research. Beyond its strong in-domain performance, TOHA maintains remarkable domain transferability across multiple open-source LLMs. Our findings suggest that analyzing the topological structure of attention matrices can serve as an efficient and robust indicator of factual reliability in LLMs.

[531] A Computational Cognitive Model for Processing Repetitions of Hierarchical Relations

Zeng Ren,Xinyi Guan,Martin Rohrmeier

Main category: cs.CL

TLDR: 本文提出了一种基于加权演绎系统的计算模型，用于检测和理解序列数据中的结构性重复模式，并通过音乐和动作规划的短序列验证了其表达能力。

Details

Motivation: 研究人类如何识别和理解序列数据中的结构性重复模式，以揭示人类模式识别的认知机制。 Method: 采用加权演绎系统，推断序列的最小生成过程，并以模板程序的形式表示，该程序通过重复组合器丰富了上下文无关文法。 Result: 模型在音乐和动作规划的短序列上展示了其表达能力，验证了其有效性。 Conclusion: 该模型为理解人类模式识别的心理表征和认知机制提供了新的视角。 Abstract: Patterns are fundamental to human cognition, enabling the recognition of structure and regularity across diverse domains. In this work, we focus on structural repeats, patterns that arise from the repetition of hierarchical relations within sequential data, and develop a candidate computational model of how humans detect and understand such structural repeats. Based on a weighted deduction system, our model infers the minimal generative process of a given sequence in the form of a Template program, a formalism that enriches the context-free grammar with repetition combinators. Such representation efficiently encodes the repetition of sub-computations in a recursive manner. As a proof of concept, we demonstrate the expressiveness of our model on short sequences from music and action planning. The proposed model offers broader insights into the mental representations and cognitive mechanisms underlying human pattern recognition.

[532] Towards Quantifying Commonsense Reasoning with Mechanistic Insights

Abhinav Joshi,Areeb Ahmad,Divyaksh Shukla,Ashutosh Modi

Main category: cs.CL

TLDR: 论文提出了一种图形化结构来评估LLMs的常识推理能力，并创建了一个包含37种日常活动的标注方案。该资源能生成大量常识查询，有助于深入研究LLMs的推理机制。

Details

Motivation: 当前基于文本的任务评估LLMs的常识推理能力存在局限，需要更严谨的方法来理解LLMs的推理机制。 Method: 设计图形化结构标注方案，捕捉37种日常活动的隐含知识，并生成大量常识查询。 Result: 资源支持约10^17种常识查询，揭示了LLMs中局部化的推理组件在决策中的作用。 Conclusion: 图形化结构为评估LLMs的常识推理能力提供了新工具，并揭示了其推理机制的部分特性。 Abstract: Commonsense reasoning deals with the implicit knowledge that is well understood by humans and typically acquired via interactions with the world. In recent times, commonsense reasoning and understanding of various LLMs have been evaluated using text-based tasks. In this work, we argue that a proxy of this understanding can be maintained as a graphical structure that can further help to perform a rigorous evaluation of commonsense reasoning abilities about various real-world activities. We create an annotation scheme for capturing this implicit knowledge in the form of a graphical structure for 37 daily human activities. We find that the created resource can be used to frame an enormous number of commonsense queries (~ 10^{17}), facilitating rigorous evaluation of commonsense reasoning in LLMs. Moreover, recently, the remarkable performance of LLMs has raised questions about whether these models are truly capable of reasoning in the wild and, in general, how reasoning occurs inside these models. In this resource paper, we bridge this gap by proposing design mechanisms that facilitate research in a similar direction. Our findings suggest that the reasoning components are localized in LLMs that play a prominent role in decision-making when prompted with a commonsense query.

Xinnong Zhang,Jiayu Lin,Xinyi Mou,Shiyue Yang,Xiawei Liu,Libo Sun,Hanjia Lyu,Yihang Yang,Weihong Qi,Yue Chen,Guanying Li,Ling Yan,Yao Hu,Siming Chen,Yu Wang,Jingxuan Huang,Jiebo Luo,Shiping Tang,Libo Wu,Baohua Zhou,Zhongyu Wei

Main category: cs.CL

TLDR: SocioVerse是一个基于LLM-agent的社会模拟框架，通过四个对齐组件和1000万真实用户池，解决了现有方法在环境、用户、交互和行为模式上的对齐问题。

Details

Motivation: 社会模拟通过虚拟个体与环境互动建模人类行为，但现有方法在多个方面存在对齐挑战。 Method: 提出SocioVerse框架，包含四个对齐组件，并利用1000万真实用户池进行验证。 Result: 在政治、新闻和经济三个领域的大规模实验中，SocioVerse能反映群体动态，同时保证多样性、可信度和代表性。 Conclusion: SocioVerse通过标准化程序和最小人工调整，有效解决了社会模拟中的对齐问题。 Abstract: Social simulation is transforming traditional social science research by modeling human behavior through interactions between virtual individuals and their environments. With recent advances in large language models (LLMs), this approach has shown growing potential in capturing individual differences and predicting group behaviors. However, existing methods face alignment challenges related to the environment, target users, interaction mechanisms, and behavioral patterns. To this end, we introduce SocioVerse, an LLM-agent-driven world model for social simulation. Our framework features four powerful alignment components and a user pool of 10 million real individuals. To validate its effectiveness, we conducted large-scale simulation experiments across three distinct domains: politics, news, and economics. Results demonstrate that SocioVerse can reflect large-scale population dynamics while ensuring diversity, credibility, and representativeness through standardized procedures and minimal manual adjustments.

[534] MT-R1-Zero: Advancing LLM-based Machine Translation via R1-Zero-like Reinforcement Learning

Zhaopeng Feng,Shaosheng Cao,Jiahan Ren,Jiayuan Su,Ruizhe Chen,Yan Zhang,Zhe Xu,Yao Hu,Jian Wu,Zuozhu Liu

Main category: cs.CL

TLDR: MT-R1-Zero是首个将R1-Zero强化学习框架应用于机器翻译的开源方法，无需监督微调或冷启动，通过混合奖励机制提升翻译质量。

Details

Motivation: 尽管大规模强化学习在数学和编码等任务中表现优异，但在机器翻译中的应用仍未被充分探索，因其输出灵活且难以自动评估。 Method: 提出MT-R1-Zero框架，采用规则与指标混合的奖励机制，引导大语言模型通过涌现推理提升翻译质量。 Result: 在WMT 24英中基准测试中，MT-R1-Zero-3B-Mix超越TowerInstruct-7B-v0.2 1.26分；MT-R1-Zero-7B-Mix与GPT-4o和Claude-3.5-Sonnet表现相当，MT-R1-Zero-7B-Sem在语义指标上达到最优。 Conclusion: MT-R1-Zero在机器翻译中展现了强大的泛化能力，尤其在多语言和低资源场景下，为奖励设计和大语言模型适应性提供了新见解。 Abstract: Large-scale reinforcement learning (RL) methods have proven highly effective in enhancing the reasoning abilities of large language models (LLMs), particularly for tasks with verifiable solutions such as mathematics and coding. However, applying this idea to machine translation (MT), where outputs are flexibly formatted and difficult to automatically evaluate with explicit rules, remains underexplored. In this work, we introduce MT-R1-Zero, the first open-source adaptation of the R1-Zero RL framework for MT without supervised fine-tuning or cold-start. We propose a rule-metric mixed reward mechanism to guide LLMs towards improved translation quality via emergent reasoning. On the WMT 24 English-Chinese benchmark, our MT-R1-Zero-3B-Mix achieves competitive performance, surpassing TowerInstruct-7B-v0.2 by an average of 1.26 points. Meanwhile, our MT-R1-Zero-7B-Mix attains a high average score of 62.25 across all metrics, placing it on par with advanced proprietary models such as GPT-4o and Claude-3.5-Sonnet, while the MT-R1-Zero-7B-Sem variant achieves state-of-the-art scores on semantic metrics. Moreover, our work exhibits strong generalization capabilities on out-of-distribution MT tasks, robustly supporting multilingual and low-resource settings. Extensive analysis of model behavior across different initializations and reward metrics offers pioneering insight into the critical role of reward design, LLM adaptability, training dynamics, and emergent reasoning patterns within the R1-Zero paradigm for MT. Our code is available at https://github.com/fzp0424/MT-R1-Zero.

[535] C-FAITH: A Chinese Fine-Grained Benchmark for Automated Hallucination Evaluation

Xu Zhang,Zhifei Liu,Jiahao Wang,Huixuan Zhang,Fan Xu,Junzhe Zhang,Xiaojun Wan

Main category: cs.CL

TLDR: 论文提出HaluAgent框架，自动构建细粒度QA数据集以评估大语言模型的幻觉问题，并创建了中文基准C-FAITH。

Details

Motivation: 现有幻觉评估基准依赖人工标注，难以实现自动化和低成本，尤其在中文领域。 Method: 使用HaluAgent框架，基于知识文档自动构建QA数据集，并通过规则设计和提示优化提升数据质量。 Result: 构建了包含60,702条目的中文基准C-FAITH，并评估了16种主流LLM。 Conclusion: HaluAgent和C-FAITH为幻觉研究提供了自动化和细粒度的评估工具。 Abstract: Despite the rapid advancement of large language models, they remain highly susceptible to generating hallucinations, which significantly hinders their widespread application. Hallucination research requires dynamic and fine-grained evaluation. However, most existing hallucination benchmarks (especially in Chinese language) rely on human annotations, making automatical and cost-effective hallucination evaluation challenging. To address this, we introduce HaluAgent, an agentic framework that automatically constructs fine-grained QA dataset based on some knowledge documents. Our experiments demonstrate that the manually designed rules and prompt optimization can improve the quality of generated data. Using HaluAgent, we construct C-FAITH, a Chinese QA hallucination benchmark created from 1,399 knowledge documents obtained from web scraping, totaling 60,702 entries. We comprehensively evaluate 16 mainstream LLMs with our proposed C-FAITH, providing detailed experimental results and analysis.

[536] HalluSearch at SemEval-2025 Task 3: A Search-Enhanced RAG Pipeline for Hallucination Detection

Mohamed A. Abdallah,Samhaa R. El-Beltagy

Main category: cs.CL

TLDR: HalluSearch是一个多语言管道，用于检测大型语言模型输出中的虚构文本片段，在14种语言中表现优异，但在在线覆盖有限的语种中面临挑战。

Details

Motivation: 开发一个能够检测和定位多语言环境中大型语言模型输出中的虚构文本的工具，以提升模型输出的可靠性。 Method: 结合检索增强验证和细粒度事实分割技术，识别和定位14种语言中的虚构内容。 Result: HalluSearch在英语和捷克语中表现优异（排名前四），但在在线覆盖有限的语种中效果较差。 Conclusion: 虽然HalluSearch在多语言环境中表现良好，但需进一步研究以解决语种覆盖不足的问题。 Abstract: In this paper, we present HalluSearch, a multilingual pipeline designed to detect fabricated text spans in Large Language Model (LLM) outputs. Developed as part of Mu-SHROOM, the Multilingual Shared-task on Hallucinations and Related Observable Overgeneration Mistakes, HalluSearch couples retrieval-augmented verification with fine-grained factual splitting to identify and localize hallucinations in fourteen different languages. Empirical evaluations show that HalluSearch performs competitively, placing fourth in both English (within the top ten percent) and Czech. While the system's retrieval-based strategy generally proves robust, it faces challenges in languages with limited online coverage, underscoring the need for further research to ensure consistent hallucination detection across diverse linguistic contexts.

[537] LLM Unlearning Reveals a Stronger-Than-Expected Coreset Effect in Current Benchmarks

Soumyadeep Pal,Changsheng Wang,James Diffenderfer,Bhavya Kailkhura,Sijia Liu

Main category: cs.CL

TLDR: 研究发现，在大语言模型（LLM）的遗忘任务中，仅需原始遗忘集的5%作为核心集即可有效维持遗忘效果，且该效果与遗忘方法无关。

Details

Motivation: 探索LLM遗忘任务中核心集效应的存在及其影响，以优化遗忘效率和资源利用。 Method: 通过实验验证核心集效应，使用不同遗忘方法（如NPO和RMU）和数据选择方法（随机或启发式）进行分析。 Result: 核心集效应显著，仅需少量数据即可实现有效遗忘，且效果不受方法或数据选择方式影响。 Conclusion: 当前LLM遗忘任务主要由高影响力关键词驱动，而非整个数据集，核心集效应为高效遗忘提供了新思路。 Abstract: Large language model unlearning has become a critical challenge in ensuring safety and controlled model behavior by removing undesired data-model influences from the pretrained model while preserving general utility. Significant recent efforts have been dedicated to developing LLM unlearning benchmarks such as WMDP (Weapons of Mass Destruction Proxy) and MUSE (Machine Unlearning Six-way Evaluation), facilitating standardized unlearning performance assessment and method comparison. Despite their usefulness, we uncover for the first time a novel coreset effect within these benchmarks. Specifically, we find that LLM unlearning achieved with the original (full) forget set can be effectively maintained using a significantly smaller subset (functioning as a "coreset"), e.g., as little as 5% of the forget set, even when selected at random. This suggests that LLM unlearning in these benchmarks can be performed surprisingly easily, even in an extremely low-data regime. We demonstrate that this coreset effect remains strong, regardless of the LLM unlearning method used, such as NPO (Negative Preference Optimization) and RMU (Representation Misdirection Unlearning), the popular ones in these benchmarks. The surprisingly strong coreset effect is also robust across various data selection methods, ranging from random selection to more sophisticated heuristic approaches. We explain the coreset effect in LLM unlearning through a keyword-based perspective, showing that keywords extracted from the forget set alone contribute significantly to unlearning effectiveness and indicating that current unlearning is driven by a compact set of high-impact tokens rather than the entire dataset. We further justify the faithfulness of coreset-unlearned models along additional dimensions, such as mode connectivity and robustness to jailbreaking attacks. Codes are available at https://github.com/OPTML-Group/MU-Coreset.

[538] Deep Reasoning Translation via Reinforcement Learning

Jiaan Wang,Fandong Meng,Jie Zhou

Main category: cs.CL

TLDR: DeepTrans是一种基于强化学习的深度推理翻译模型，通过奖励模型学习自由翻译，无需标注数据，性能提升显著。

Details

Motivation: 自由翻译在多语言世界中具有重要意义，但深度推理LLMs在此任务上尚未充分探索。 Method: 使用强化学习训练DeepTrans，通过奖励模型评估翻译结果和思维过程。 Result: DeepTrans在文学翻译中性能提升16.3%，优于基线模型。 Conclusion: DeepTrans展示了强化学习在自由翻译中的潜力，并总结了RL探索中的失败与发现。 Abstract: Recently, deep reasoning LLMs (e.g., OpenAI o1/o3 and DeepSeek-R1) have shown promising performance in various complex tasks. Free translation is an important and interesting task in the multilingual world, which requires going beyond word-for-word translation and taking cultural differences into account. This task is still under-explored in deep reasoning LLMs. In this paper, we introduce DeepTrans, a deep reasoning translation model that learns free translation via reinforcement learning. Specifically, we carefully build a reward model with pre-defined scoring criteria on both the translation results and the thought process. Given the source sentences, the reward model teaches the deep translation model how to think and free-translate them during reinforcement learning. In this way, training DeepTrans does not need any labeled translations, avoiding the human-intensive annotation or resource-intensive data synthesis. Experimental results show the effectiveness of DeepTrans. Using Qwen2.5-7B as the backbone, DeepTrans improves performance by 16.3% in literature translation, and outperforms strong deep reasoning baselines as well as baselines that are fine-tuned with synthesized data. Moreover, we summarize the failures and interesting findings during our RL exploration. We hope this work could inspire other researchers in free translation.

[539] Localized Cultural Knowledge is Conserved and Controllable in Large Language Models

Veniamin Veselovsky,Berke Argin,Benedikt Stroebl,Chris Wendler,Robert West,James Evans,Thomas L. Griffiths,Arvind Narayanan

Main category: cs.CL

TLDR: LLMs默认生成英语中心化回答，但隐含文化信息可通过显式文化提示激活。显式文化提示提升文化本地化能力，但可能减少多样性和增加刻板印象。研究发现显式文化定制向量可引导模型转向非英语文化世界，保留多样性并减少刻板印象。

Details

Motivation: 研究LLMs在多语言交互中文化信息的保留与激活，探索如何通过显式提示改善文化本地化。 Method: 通过显式提供文化上下文提示，分析模型在文化本地化上的表现差异，并识别跨语言的显式文化定制向量。 Result: 显式文化提示提升本地化能力但减少多样性；显式文化定制向量可引导模型转向非英语文化世界，保留多样性并减少刻板印象。 Conclusion: 显式文化定制有助于理解LLMs中文化世界模型的保留与控制，提升翻译和文化定制潜力。 Abstract: Just as humans display language patterns influenced by their native tongue when speaking new languages, LLMs often default to English-centric responses even when generating in other languages. Nevertheless, we observe that local cultural information persists within the models and can be readily activated for cultural customization. We first demonstrate that explicitly providing cultural context in prompts significantly improves the models' ability to generate culturally localized responses. We term the disparity in model performance with versus without explicit cultural context the explicit-implicit localization gap, indicating that while cultural knowledge exists within LLMs, it may not naturally surface in multilingual interactions if cultural context is not explicitly provided. Despite the explicit prompting benefit, however, the answers reduce in diversity and tend toward stereotypes. Second, we identify an explicit cultural customization vector, conserved across all non-English languages we explore, which enables LLMs to be steered from the synthetic English cultural world-model toward each non-English cultural world. Steered responses retain the diversity of implicit prompting and reduce stereotypes to dramatically improve the potential for customization. We discuss the implications of explicit cultural customization for understanding the conservation of alternative cultural world models within LLMs, and their controllable utility for translation, cultural customization, and the possibility of making the explicit implicit through soft control for expanded LLM function and appeal.

[540] DioR: Adaptive Cognitive Detection and Contextual Retrieval Optimization for Dynamic Retrieval-Augmented Generation

Hanghui Guo,Jia Zhu,Shimin Di,Weijie Shi,Zhangze Chen,Jiajie Xu

Main category: cs.CL

TLDR: DioR是一种创新的动态RAG方法，通过自适应认知检测和上下文检索优化，解决了现有方法在检索触发和内容筛选上的不足。

Details

Motivation: 现有动态RAG方法缺乏有效的检索触发机制和内容筛选能力，限制了其性能。 Method: 提出DioR方法，包含自适应认知检测和上下文检索优化两部分，分别解决何时检索和检索什么的问题。 Result: 实验表明DioR在所有任务上表现优异。 Conclusion: DioR有效提升了动态RAG的性能，解决了现有方法的局限性。 Abstract: Dynamic Retrieval-augmented Generation (RAG) has shown great success in mitigating hallucinations in large language models (LLMs) during generation. However, existing dynamic RAG methods face significant limitations in two key aspects: 1) Lack of an effective mechanism to control retrieval triggers, and 2) Lack of effective scrutiny of retrieval content. To address these limitations, we propose an innovative dynamic RAG method, DioR (Adaptive Cognitive Detection and Contextual Retrieval Optimization), which consists of two main components: adaptive cognitive detection and contextual retrieval optimization, specifically designed to determine when retrieval is needed and what to retrieve for LLMs is useful. Experimental results demonstrate that DioR achieves superior performance on all tasks, demonstrating the effectiveness of our work.

[541] Probing then Editing Response Personality of Large Language Models

Tianjie Ju,Zhenyu Shao,Bowen Wang,Yujia Chen,Zhuosheng Zhang,Hao Fei,Mong-Li Lee,Wynne Hsu,Sufeng Duan,Gongshen Liu

Main category: cs.CL

TLDR: 本文提出了一种分层探测框架，研究LLM内部如何编码人格特征，发现人格特征主要编码于中上层，并提出了一种分层扰动方法编辑LLM的人格表达。

Details

Motivation: 尽管已有研究通过输出分析LLM的人格表达，但对其内部参数如何编码人格特征知之甚少。 Method: 采用分层探测框架，在11个开源LLM上进行实验，并提出分层扰动方法编辑人格。 Result: 人格特征主要编码于中上层，指令调优模型的人格分离更清晰；分层扰动方法能有效编辑人格表达。 Conclusion: 该方法在保持通用能力的同时，实现了低训练成本和高效率的人格编辑。 Abstract: Large Language Models (LLMs) have demonstrated promising capabilities to generate responses that exhibit consistent personality traits. Despite the major attempts to analyze personality expression through output-based evaluations, little is known about how such traits are internally encoded within LLM parameters. In this paper, we introduce a layer-wise probing framework to systematically investigate the layer-wise capability of LLMs in encoding personality for responding. We conduct probing experiments on 11 open-source LLMs over the PersonalityEdit benchmark and find that LLMs predominantly encode personality for responding in their middle and upper layers, with instruction-tuned models demonstrating a slightly clearer separation of personality traits. Furthermore, by interpreting the trained probing hyperplane as a layer-wise boundary for each personality category, we propose a layer-wise perturbation method to edit the personality expressed by LLMs during inference. Our results show that even when the prompt explicitly specifies a particular personality, our method can still successfully alter the response personality of LLMs. Interestingly, the difficulty of converting between certain personality traits varies substantially, which aligns with the representational distances in our probing experiments. Finally, we conduct a comprehensive MMLU benchmark evaluation and time overhead analysis, demonstrating that our proposed personality editing method incurs only minimal degradation in general capabilities while maintaining low training costs and acceptable inference latency. Our code is publicly available at https://github.com/universe-sky/probing-then-editing-personality.

[542] Can LLMs Generate Tabular Summaries of Science Papers? Rethinking the Evaluation Protocol

Weiqi Wang,Jiefu Ou,Yangqiu Song,Benjamin Van Durme,Daniel Khashabi

Main category: cs.CL

TLDR: 论文探讨了如何生成满足用户需求的文献综述表格，结合LLM和人工标注解决实际挑战，并提出了新基准ARXIV2TABLE。

Details

Motivation: 解决文献综述表格生成中的实际挑战，如用户提示不明确、候选论文内容无关及评估方法不足。 Method: 结合LLM方法和人工标注，扩展现有方法以应对复杂场景。 Result: 实验表明现有LLM在此任务上表现不佳，凸显其难度。 Conclusion: 提出了ARXIV2TABLE基准和改进方法，强调需进一步研究。 Abstract: Literature review tables are essential for summarizing and comparing collections of scientific papers. We explore the task of generating tables that best fulfill a user's informational needs given a collection of scientific papers. Building on recent work (Newman et al., 2024), we extend prior approaches to address real-world complexities through a combination of LLM-based methods and human annotations. Our contributions focus on three key challenges encountered in real-world use: (i) User prompts are often under-specified; (ii) Retrieved candidate papers frequently contain irrelevant content; and (iii) Task evaluation should move beyond shallow text similarity techniques and instead assess the utility of inferred tables for information-seeking tasks (e.g., comparing papers). To support reproducible evaluation, we introduce ARXIV2TABLE, a more realistic and challenging benchmark for this task, along with a novel approach to improve literature review table generation in real-world scenarios. Our extensive experiments on this benchmark show that both open-weight and proprietary LLMs struggle with the task, highlighting its difficulty and the need for further advancements. Our dataset and code are available at https://github.com/JHU-CLSP/arXiv2Table.

[543] MorphTok: Morphologically Grounded Tokenization for Indian Languages

Maharaj Brahma,N J Karthika,Atul Singh,Devaraj Adiga,Smruti Bhate,Ganesh Ramakrishnan,Rohit Saluja,Maunendra Sankar Desarkar

Main category: cs.CL

TLDR: 论文提出了一种基于形态学的预分词方法，结合BPE算法改进子词分词，并引入CBPE处理特定脚本约束，提升机器翻译和语言建模性能。

Details

Motivation: 现有BPE算法在子词分词时未考虑语言学意义，导致分词结果不理想。 Method: 提出形态学感知的分词作为BPE预步骤，创建印地语和马拉地语数据集，并引入CBPE处理依赖元音。 Result: 形态学分词提升下游任务性能，CBPE降低生育率分数1.68%。 Conclusion: 形态学分词和CBPE为高效分词提供了新方向，并引入EvalTok评估指标。 Abstract: Tokenization is a crucial step in NLP, especially with the rise of large language models (LLMs), impacting downstream performance, computational cost, and efficiency. Existing LLMs rely on the classical Byte-pair Encoding (BPE) algorithm for subword tokenization that greedily merges frequent character bigrams. This often leads to segmentation that does not align with linguistically meaningful units. To address this, we propose morphology-aware segmentation as a pre-tokenization step prior to applying BPE. To facilitate morphology-aware segmentation, we create a novel dataset for Hindi and Marathi, incorporating sandhi splitting to enhance the subword tokenization. Experiments on downstream tasks show that morphologically grounded tokenization improves performance for machine translation and language modeling. Additionally, to handle the ambiguity in the Unicode characters for diacritics, particularly dependent vowels in syllable-based writing systems, we introduce Constrained BPE (CBPE), an extension to the traditional BPE algorithm that incorporates script-specific constraints. Specifically, CBPE handles dependent vowels. Our results show that CBPE achieves a 1.68\% reduction in fertility scores while maintaining comparable or improved downstream performance in machine translation, offering a computationally efficient alternative to standard BPE. Moreover, to evaluate segmentation across different tokenization algorithms, we introduce a new human evaluation metric, \textit{EvalTok}, enabling more human-grounded assessment.

[544] Forecasting from Clinical Textual Time Series: Adaptations of the Encoder and Decoder Language Model Families

Shahriar Noroozizadeh,Sayantan Kumar,Jeremy C. Weiss

Main category: cs.CL

TLDR: 论文提出了一种基于文本时间序列的预测方法，利用LLM辅助提取的临床发现数据，评估了多种模型在事件预测、时间排序和生存分析中的表现。

Details

Motivation: 传统机器学习方法依赖结构化数据，未能充分利用临床病例报告中的丰富时间信息。 Method: 采用LLM辅助标注管道提取时间戳临床发现，系统评估了多种模型（包括基于解码器的LLM和基于编码器的Transformer）。 Result: 基于编码器的模型在事件预测中表现更优，而基于解码器的模型在生存分析中表现较好。时间顺序对模型性能有显著影响。 Conclusion: 时间顺序的利用对提升模型性能至关重要，尤其在LLM广泛应用的时代。 Abstract: Clinical case reports encode rich, temporal patient trajectories that are often underexploited by traditional machine learning methods relying on structured data. In this work, we introduce the forecasting problem from textual time series, where timestamped clinical findings--extracted via an LLM-assisted annotation pipeline--serve as the primary input for prediction. We systematically evaluate a diverse suite of models, including fine-tuned decoder-based large language models and encoder-based transformers, on tasks of event occurrence prediction, temporal ordering, and survival analysis. Our experiments reveal that encoder-based models consistently achieve higher F1 scores and superior temporal concordance for short- and long-horizon event forecasting, while fine-tuned masking approaches enhance ranking performance. In contrast, instruction-tuned decoder models demonstrate a relative advantage in survival analysis, especially in early prognosis settings. Our sensitivity analyses further demonstrate the importance of time ordering, which requires clinical time series construction, as compared to text ordering, the format of the text inputs that LLMs are classically trained on. This highlights the additional benefit that can be ascertained from time-ordered corpora, with implications for temporal tasks in the era of widespread LLM use.

[545] VisualPuzzles: Decoupling Multimodal Reasoning Evaluation from Domain Knowledge

Yueqi Song,Tianyue Ou,Yibo Kong,Zecheng Li,Graham Neubig,Xiang Yue

Main category: cs.CL

TLDR: VisualPuzzles是一个专注于视觉推理的基准测试，旨在减少对专业知识的依赖，以更纯粹地评估多模态推理能力。

Details

Motivation: 当前多模态基准测试常将推理与领域知识混为一谈，难以评估非专家环境中的通用推理能力。 Method: 通过手动翻译中国公务员考试的逻辑推理题目，构建了涵盖算法、类比、演绎、归纳和空间推理五类问题的VisualPuzzles基准。 Result: 实验表明，VisualPuzzles对领域知识依赖较少，但对推理能力要求更高，现有先进模型表现不及人类，且推理增强方法效果不一致。 Conclusion: VisualPuzzles为评估超越事实记忆和领域知识的推理能力提供了更清晰的视角。 Abstract: Current multimodal benchmarks often conflate reasoning with domain-specific knowledge, making it difficult to isolate and evaluate general reasoning abilities in non-expert settings. To address this, we introduce VisualPuzzles, a benchmark that targets visual reasoning while deliberately minimizing reliance on specialized knowledge. VisualPuzzles consists of diverse questions spanning five categories: algorithmic, analogical, deductive, inductive, and spatial reasoning. One major source of our questions is manually translated logical reasoning questions from the Chinese Civil Service Examination. Experiments show that VisualPuzzles requires significantly less intensive domain-specific knowledge and more complex reasoning compared to benchmarks like MMMU, enabling us to better evaluate genuine multimodal reasoning. Evaluations show that state-of-the-art multimodal large language models consistently lag behind human performance on VisualPuzzles, and that strong performance on knowledge-intensive benchmarks does not necessarily translate to success on reasoning-focused, knowledge-light tasks. Additionally, reasoning enhancements such as scaling up inference compute (with "thinking" modes) yield inconsistent gains across models and task types, and we observe no clear correlation between model size and performance. We also found that models exhibit different reasoning and answering patterns on VisualPuzzles compared to benchmarks with heavier emphasis on knowledge. VisualPuzzles offers a clearer lens through which to evaluate reasoning capabilities beyond factual recall and domain knowledge.

[546] MultiLoKo: a multilingual local knowledge benchmark for LLMs spanning 31 languages

Dieuwke Hupkes,Nikolay Bogoychev

Main category: cs.CL

TLDR: MultiLoKo是一个新的多语言基准测试，涵盖31种语言，用于评估LLMs的多语言能力。它包括三个部分：主分区（每种语言500个本地相关问题）和两个翻译分区（人工翻译的英译非英和非英译英）。测试结果显示现有模型表现不佳，存在语言间知识转移不足的问题。

Details

Motivation: 评估LLMs在多语言环境中的表现，并研究多语言基准测试的创建方法。 Method: 构建MultiLoKo基准测试，包含主分区和翻译分区，使用人工和机器翻译数据，并分为开发和测试集。评估11种多语言模型的性能、语言间表现差异和问题语言的影响。 Result: 现有模型在MultiLoKo上表现不佳，平均分数低且语言间差异大。问题语言对结果有显著影响，本地数据与翻译数据差异可达20分以上。机器翻译对语言难度排序影响较小，但对模型排名和性能估计影响较大。 Conclusion: MultiLoKo揭示了现有LLMs在多语言任务中的不足，强调了本地数据的重要性，并指出机器翻译在评估中的局限性。 Abstract: We present MultiLoKo, a new benchmark for evaluating multilinguality in LLMs covering 31 languages. MultiLoKo consists of three partitions: a main partition consisting of 500 questions per language, separately sourced to be locally relevant to the specific language, and two translated partitions, containing human-authored translations from 30 non-English languages to English and vice versa. For comparison, we also release corresponding machine-authored translations. The data is equally distributed over two splits: a dev split and a blind, out-of-distribution test split. MultiLoKo can be used to study a variety of questions regarding the multilinguality of LLMs as well as meta-questions about multilingual benchmark creation. We compute MultiLoKo scores for 11 base and chat models marketed to be multilingual and study their average performance, their performance parity across languages, how much their ability to answer questions depends on the question language, and which languages are most difficult. None of the models we studied performs well on MultiLoKo, as indicated by low average scores as well as large differences between the best and worst scoring languages. Furthermore, we find a substantial effect of the question language, indicating sub-optimal knowledge transfer between languages. Lastly, we find that using local vs English-translated data can result in differences more than 20 points for the best performing models, drastically change the estimated difficulty of some languages. For using machines instead of human translations, we find a weaker effect on ordering of language difficulty, a larger difference in model rankings, and a substantial drop in estimated performance for all models.

[547] DICE: A Framework for Dimensional and Contextual Evaluation of Language Models

Aryan Shrivastava,Paula Akemi Aoyagui

Main category: cs.CL

TLDR: 论文提出DICE方法，针对语言模型（LMs）在实际应用中的评估不足问题，提出一种基于细粒度和上下文依赖的评估框架。

Details

Motivation: 当前语言模型评估基准未能充分反映其实际应用场景，缺乏对真实需求的适用性。 Method: 提出DICE方法，包括上下文无关参数（如鲁棒性、连贯性）和上下文相关参数，以更贴近实际需求。 Result: DICE为语言模型评估提供了更实用、贴近实际应用的框架。 Conclusion: DICE是语言模型评估的实用起点，强调上下文和利益相关者的需求。 Abstract: Language models (LMs) are increasingly being integrated into a wide range of applications, yet the modern evaluation paradigm does not sufficiently reflect how they are actually being used. Current evaluations rely on benchmarks that often lack direct applicability to the real-world contexts in which LMs are being deployed. To address this gap, we propose Dimensional and Contextual Evaluation (DICE), an approach that evaluates LMs on granular, context-dependent dimensions. In this position paper, we begin by examining the insufficiency of existing LM benchmarks, highlighting their limited applicability to real-world use cases. Next, we propose a set of granular evaluation parameters that capture dimensions of LM behavior that are more meaningful to stakeholders across a variety of application domains. Specifically, we introduce the concept of context-agnostic parameters - such as robustness, coherence, and epistemic honesty - and context-specific parameters that must be tailored to the specific contextual constraints and demands of stakeholders choosing to deploy LMs into a particular setting. We then discuss potential approaches to operationalize this evaluation framework, finishing with the opportunities and challenges DICE presents to the LM evaluation landscape. Ultimately, this work serves as a practical and approachable starting point for context-specific and stakeholder-relevant evaluation of LMs.

[548] S1-Bench: A Simple Benchmark for Evaluating System 1 Thinking Capability of Large Reasoning Models

Wenyuan Zhang,Shuaiyi Nie,Xinghua Zhang,Zefeng Zhang,Tingwen Liu

Main category: cs.CL

TLDR: S1-Bench是一个新基准，用于评估大型推理模型（LRMs）在直觉系统1任务中的表现，填补了现有评估空白。

Details

Motivation: 当前LRMs依赖深度分析思维，可能限制其直觉系统1能力，且缺乏相关评估基准。 Method: S1-Bench提供跨领域和语言的简单问题集，评估22个LRMs的表现。 Result: LRMs在直觉任务中效率较低，输出冗长且易出错，显示其推理模式僵化。 Conclusion: 需进一步发展以实现平衡的双系统思维，适应任务复杂性。 Abstract: We introduce S1-Bench, a novel benchmark designed to evaluate Large Reasoning Models' (LRMs) performance on simple tasks that favor intuitive system 1 thinking rather than deliberative system 2 reasoning. While LRMs have achieved significant breakthroughs in complex reasoning tasks through explicit chains of thought, their reliance on deep analytical thinking may limit their system 1 thinking capabilities. Moreover, a lack of benchmark currently exists to evaluate LRMs' performance in tasks that require such capabilities. To fill this gap, S1-Bench presents a set of simple, diverse, and naturally clear questions across multiple domains and languages, specifically designed to assess LRMs' performance in such tasks. Our comprehensive evaluation of 22 LRMs reveals significant lower efficiency tendencies, with outputs averaging 15.5 times longer than those of traditional small LLMs. Additionally, LRMs often identify correct answers early but continue unnecessary deliberation, with some models even producing numerous errors. These findings highlight the rigid reasoning patterns of current LRMs and underscore the substantial development needed to achieve balanced dual-system thinking capabilities that can adapt appropriately to task complexity.

Varun Vasudevan,Faezeh Akhavizadegan,Abhinav Prakash,Yokila Arora,Jason Cho,Tanya Mendiratta,Sushant Kumar,Kannan Achan

Main category: cs.CL

TLDR: 论文提出了一种基于LLM的迭代优化框架，用于生成满足多重复杂约束的营销文案，显著提高了成功率和点击率。

Details

Motivation: 手动撰写营销文案耗时且昂贵，而现有LLM生成的内容难以一次性满足多重复杂约束。 Method: 采用LLM驱动的端到端框架，通过迭代优化生成满足长度、主题、关键词等多重约束的文案。 Result: 迭代优化将文案成功率提升16.25-35.91%，生成的文案在点击率上比人工撰写的高38.5-45.21%。 Conclusion: 该框架为复杂约束下的文案生成提供了高效解决方案，显著优于传统方法。 Abstract: Crafting a marketing message (copy), or copywriting is a challenging generation task, as the copy must adhere to various constraints. Copy creation is inherently iterative for humans, starting with an initial draft followed by successive refinements. However, manual copy creation is time-consuming and expensive, resulting in only a few copies for each use case. This limitation restricts our ability to personalize content to customers. Contrary to the manual approach, LLMs can generate copies quickly, but the generated content does not consistently meet all the constraints on the first attempt (similar to humans). While recent studies have shown promise in improving constrained generation through iterative refinement, they have primarily addressed tasks with only a few simple constraints. Consequently, the effectiveness of iterative refinement for tasks such as copy generation, which involves many intricate constraints, remains unclear. To address this gap, we propose an LLM-based end-to-end framework for scalable copy generation using iterative refinement. To the best of our knowledge, this is the first study to address multiple challenging constraints simultaneously in copy generation. Examples of these constraints include length, topics, keywords, preferred lexical ordering, and tone of voice. We demonstrate the performance of our framework by creating copies for e-commerce banners for three different use cases of varying complexity. Our results show that iterative refinement increases the copy success rate by $16.25-35.91$% across use cases. Furthermore, the copies generated using our approach outperformed manually created content in multiple pilot studies using a multi-armed bandit framework. The winning copy improved the click-through rate by $38.5-45.21$%.

[550] Performance of Large Language Models in Supporting Medical Diagnosis and Treatment

Diogo Sousa,Guilherme Barbosa,Catarina Rocha,Dulce Oliveira

Main category: cs.CL

TLDR: 研究评估了多种大型语言模型（LLMs）在葡萄牙国家医学考试（PNA）上的表现，发现部分模型在准确性和成本效益上超过人类医学生，可作为医疗决策的辅助工具。

Details

Motivation: 探索LLMs在医疗领域的潜力，提升诊断准确性和治疗规划支持。 Method: 评估开源和闭源LLMs在2024年葡萄牙国家医学考试（PNA）上的表现，结合准确性和成本进行评分。 Result: 部分模型表现优于人类医学生，Chain-of-Thought等推理方法对性能有显著影响。 Conclusion: LLMs可作为医疗决策的辅助工具，需进一步优化成本效益和推理方法。 Abstract: The integration of Large Language Models (LLMs) into healthcare holds significant potential to enhance diagnostic accuracy and support medical treatment planning. These AI-driven systems can analyze vast datasets, assisting clinicians in identifying diseases, recommending treatments, and predicting patient outcomes. This study evaluates the performance of a range of contemporary LLMs, including both open-source and closed-source models, on the 2024 Portuguese National Exam for medical specialty access (PNA), a standardized medical knowledge assessment. Our results highlight considerable variation in accuracy and cost-effectiveness, with several models demonstrating performance exceeding human benchmarks for medical students on this specific task. We identify leading models based on a combined score of accuracy and cost, discuss the implications of reasoning methodologies like Chain-of-Thought, and underscore the potential for LLMs to function as valuable complementary tools aiding medical professionals in complex clinical decision-making.

[551] LLM-SRBench: A New Benchmark for Scientific Equation Discovery with Large Language Models

Parshin Shojaee,Ngoc-Hieu Nguyen,Kazem Meidani,Amir Barati Farimani,Khoa D Doan,Chandan K Reddy

Main category: cs.CL

TLDR: LLM-SRBench是一个新的基准测试，旨在评估LLM在科学方程发现任务中的能力，避免因记忆常见方程而导致的性能虚高。

Details

Motivation: 现有基准测试依赖常见方程，容易被LLM记忆，导致评估不准确。 Method: 提出LLM-SRBench，包含239个挑战性问题，分为LSR-Transform和LSR-Synth两类，分别测试推理能力和数据驱动发现能力。 Result: 最佳系统仅达到31.5%的符号准确率，显示科学方程发现的挑战性。 Conclusion: LLM-SRBench为未来研究提供了有价值的资源，凸显了LLM在科学发现中的局限性。 Abstract: Scientific equation discovery is a fundamental task in the history of scientific progress, enabling the derivation of laws governing natural phenomena. Recently, Large Language Models (LLMs) have gained interest for this task due to their potential to leverage embedded scientific knowledge for hypothesis generation. However, evaluating the true discovery capabilities of these methods remains challenging, as existing benchmarks often rely on common equations that are susceptible to memorization by LLMs, leading to inflated performance metrics that do not reflect discovery. In this paper, we introduce LLM-SRBench, a comprehensive benchmark with 239 challenging problems across four scientific domains specifically designed to evaluate LLM-based scientific equation discovery methods while preventing trivial memorization. Our benchmark comprises two main categories: LSR-Transform, which transforms common physical models into less common mathematical representations to test reasoning beyond memorized forms, and LSR-Synth, which introduces synthetic, discovery-driven problems requiring data-driven reasoning. Through extensive evaluation of several state-of-the-art methods, using both open and closed LLMs, we find that the best-performing system so far achieves only 31.5% symbolic accuracy. These findings highlight the challenges of scientific equation discovery, positioning LLM-SRBench as a valuable resource for future research.

[552] CliniChat: A Multi-Source Knowledge-Driven Framework for Clinical Interview Dialogue Reconstruction and Evaluation

Jing Chen,Zhihua Wei,Wei Zhang,Yingying Hu,Qiong Zhang

Main category: cs.CL

TLDR: CliniChat框架通过整合多源知识，帮助大语言模型模拟真实临床访谈，包含对话重构和评估模块，并提出了高质量数据集和专用模型。

Details

Motivation: 解决临床访谈中高质量对话数据和评估方法缺乏的问题。 Method: 提出CliniChat框架，包含Clini-Recon（重构对话）和Clini-Eval（评估对话）模块，整合多源知识。 Result: 生成高质量数据集MedQA-Dialog和专用模型CliniChatGLM，在病史采集等方面表现优异。 Conclusion: CliniChat显著提升了大语言模型在临床访谈中的能力，尤其在病史采集方面达到领先水平。 Abstract: Large language models (LLMs) hold great promise for assisting clinical interviews due to their fluent interactive capabilities and extensive medical knowledge. However, the lack of high-quality interview dialogue data and widely accepted evaluation methods has significantly impeded this process. So we propose CliniChat, a framework that integrates multi-source knowledge to enable LLMs to simulate real-world clinical interviews. It consists of two modules: Clini-Recon and Clini-Eval, each responsible for reconstructing and evaluating interview dialogues, respectively. By incorporating three sources of knowledge, Clini-Recon transforms clinical notes into systematic, professional, and empathetic interview dialogues. Clini-Eval combines a comprehensive evaluation metric system with a two-phase automatic evaluation approach, enabling LLMs to assess interview performance like experts. We contribute MedQA-Dialog, a high-quality synthetic interview dialogue dataset, and CliniChatGLM, a model specialized for clinical interviews. Experimental results demonstrate that CliniChatGLM's interview capabilities undergo a comprehensive upgrade, particularly in history-taking, achieving state-of-the-art performance.

Michał Turski,Mateusz Chiliński,Łukasz Borchmann

Main category: cs.CL

TLDR: 论文介绍了CheckboxQA数据集，旨在解决大视觉和语言模型在处理复选框内容时的局限性，提升文档理解能力。

Details

Motivation: 复选框在文档处理中至关重要，但现有模型在解释复选框内容时表现不佳，可能导致高成本的监管或合同疏漏。 Method: 通过创建CheckboxQA数据集，评估和改进模型在复选框相关任务上的性能。 Result: 数据集揭示了当前模型的局限性，并为文档理解系统的进步提供了工具。 Conclusion: CheckboxQA数据集对法律科技和金融等领域的应用具有重要意义。 Abstract: Checkboxes are critical in real-world document processing where the presence or absence of ticks directly informs data extraction and decision-making processes. Yet, despite the strong performance of Large Vision and Language Models across a wide range of tasks, they struggle with interpreting checkable content. This challenge becomes particularly pressing in industries where a single overlooked checkbox may lead to costly regulatory or contractual oversights. To address this gap, we introduce the CheckboxQA dataset, a targeted resource designed to evaluate and improve model performance on checkbox-related tasks. It reveals the limitations of current models and serves as a valuable tool for advancing document comprehension systems, with significant implications for applications in sectors such as legal tech and finance. The dataset is publicly available at: https://github.com/Snowflake-Labs/CheckboxQA

[554] Can We Edit LLMs for Long-Tail Biomedical Knowledge?

Xinhao Yi,Jake Lever,Kevin Bryson,Zaiqiao Meng

Main category: cs.CL

TLDR: 知识编辑在更新大型语言模型（LLM）方面有效，但在生物医学领域面临长尾知识分布和一对多知识的挑战，导致编辑效果有限。

Details

Motivation: 研究知识编辑方法在生物医学长尾知识中的有效性，填补该领域的空白。 Method: 对现有知识编辑方法进行综合研究，分析其在长尾生物医学知识中的应用效果。 Result: 编辑方法能提升长尾知识表现，但仍不及高频知识；一对多知识的高比例限制了编辑效果。 Conclusion: 需定制策略以解决长尾生物医学知识编辑的局限性。 Abstract: Knowledge editing has emerged as an effective approach for updating large language models (LLMs) by modifying their internal knowledge. However, their application to the biomedical domain faces unique challenges due to the long-tailed distribution of biomedical knowledge, where rare and infrequent information is prevalent. In this paper, we conduct the first comprehensive study to investigate the effectiveness of knowledge editing methods for editing long-tail biomedical knowledge. Our results indicate that, while existing editing methods can enhance LLMs' performance on long-tail biomedical knowledge, their performance on long-tail knowledge remains inferior to that on high-frequency popular knowledge, even after editing. Our further analysis reveals that long-tail biomedical knowledge contains a significant amount of one-to-many knowledge, where one subject and relation link to multiple objects. This high prevalence of one-to-many knowledge limits the effectiveness of knowledge editing in improving LLMs' understanding of long-tail biomedical knowledge, highlighting the need for tailored strategies to bridge this performance gap.

[555] LLM Can be a Dangerous Persuader: Empirical Study of Persuasion Safety in Large Language Models

Minqian Liu,Zhiyang Xu,Xinyi Zhang,Heajun An,Sarvech Qadir,Qi Zhang,Pamela J. Wisniewski,Jin-Hee Cho,Sang Won Lee,Ruoxi Jia,Lifu Huang

Main category: cs.CL

TLDR: 论文研究了大型语言模型（LLMs）在说服任务中的安全性问题，提出了PersuSafety框架评估其是否拒绝不道德任务及策略，并发现多数LLMs存在显著安全隐患。

Details

Motivation: LLMs接近人类的说服能力引发了对潜在安全风险的担忧，如操纵、欺骗等不道德行为。 Method: 提出PersuSafety框架，分三阶段（场景创建、对话模拟、安全评估）评估LLMs的说服安全性，涵盖6类不道德主题和15种策略。 Result: 实验发现多数LLMs存在安全隐患，无法识别有害任务或使用不道德策略。 Conclusion: 研究呼吁更多关注LLMs在目标驱动对话（如说服）中的安全性对齐。 Abstract: Recent advancements in Large Language Models (LLMs) have enabled them to approach human-level persuasion capabilities. However, such potential also raises concerns about the safety risks of LLM-driven persuasion, particularly their potential for unethical influence through manipulation, deception, exploitation of vulnerabilities, and many other harmful tactics. In this work, we present a systematic investigation of LLM persuasion safety through two critical aspects: (1) whether LLMs appropriately reject unethical persuasion tasks and avoid unethical strategies during execution, including cases where the initial persuasion goal appears ethically neutral, and (2) how influencing factors like personality traits and external pressures affect their behavior. To this end, we introduce PersuSafety, the first comprehensive framework for the assessment of persuasion safety which consists of three stages, i.e., persuasion scene creation, persuasive conversation simulation, and persuasion safety assessment. PersuSafety covers 6 diverse unethical persuasion topics and 15 common unethical strategies. Through extensive experiments across 8 widely used LLMs, we observe significant safety concerns in most LLMs, including failing to identify harmful persuasion tasks and leveraging various unethical persuasion strategies. Our study calls for more attention to improve safety alignment in progressive and goal-driven conversations such as persuasion.

[556] xVerify: Efficient Answer Verifier for Reasoning Model Evaluations

Ding Chen,Qingchen Yu,Pengyuan Wang,Wentao Zhang,Bo Tang,Feiyu Xiong,Xinchi Li,Minchuan Yang,Zhiyu Li

Main category: cs.CL

TLDR: 论文提出了xVerify，一种用于评估推理模型的高效答案验证器，解决了现有方法难以判断复杂推理输出与参考答案等价性的问题。

Details

Motivation: 现有评估方法难以处理推理模型生成的复杂响应，无法有效判断输出与参考答案的等价性或提取最终答案。 Method: 构建VAR数据集，通过多轮标注确保准确性，并训练不同规模的xVerify模型。 Result: xVerify模型在测试集和泛化集上F1分数和准确率均超过95%，部分模型甚至超越GPT-4o。 Conclusion: xVerify在答案验证任务中表现出高效性和泛化能力，验证了其有效性。 Abstract: With the release of the o1 model by OpenAI, reasoning models adopting slow thinking strategies have gradually emerged. As the responses generated by such models often include complex reasoning, intermediate steps, and self-reflection, existing evaluation methods are often inadequate. They struggle to determine whether the LLM output is truly equivalent to the reference answer, and also have difficulty identifying and extracting the final answer from long, complex responses. To address this issue, we propose xVerify, an efficient answer verifier for reasoning model evaluations. xVerify demonstrates strong capability in equivalence judgment, enabling it to effectively determine whether the answers produced by reasoning models are equivalent to reference answers across various types of objective questions. To train and evaluate xVerify, we construct the VAR dataset by collecting question-answer pairs generated by multiple LLMs across various datasets, leveraging multiple reasoning models and challenging evaluation sets designed specifically for reasoning model assessment. A multi-round annotation process is employed to ensure label accuracy. Based on the VAR dataset, we train multiple xVerify models of different scales. In evaluation experiments conducted on both the test set and generalization set, all xVerify models achieve overall F1 scores and accuracy exceeding 95\%. Notably, the smallest variant, xVerify-0.5B-I, outperforms all evaluation methods except GPT-4o, while xVerify-3B-Ib surpasses GPT-4o in overall performance. These results validate the effectiveness and generalizability of xVerify.

cs.CR [Back]

[557] AdaSteer: Your Aligned LLM is Inherently an Adaptive Jailbreak Defender

Weixiang Zhao,Jiahe Guo,Yulin Hu,Yang Deng,An Zhang,Xingyu Sui,Xinyang Han,Yanyan Zhao,Bing Qin,Tat-Seng Chua,Ting Liu

Main category: cs.CR

TLDR: AdaSteer是一种自适应激活引导方法，动态调整模型行为以防御越狱攻击，同时减少对良性输入的误拒。

Details

Motivation: 现有固定系数的激活引导方法在防御越狱攻击时表现不佳，且容易误拒良性输入。 Method: 提出AdaSteer，基于输入特性动态调整引导系数，结合拒绝方向（RD）和有害性方向（HD）进行引导。 Result: 在LLaMA-3.1、Gemma-2和Qwen2.5上的实验显示，AdaSteer在多种越狱攻击中优于基线方法，且对实用性影响最小。 Conclusion: AdaSteer展示了模型内部可解释性在实时灵活安全增强中的潜力。 Abstract: Despite extensive efforts in safety alignment, large language models (LLMs) remain vulnerable to jailbreak attacks. Activation steering offers a training-free defense method but relies on fixed steering coefficients, resulting in suboptimal protection and increased false rejections of benign inputs. To address this, we propose AdaSteer, an adaptive activation steering method that dynamically adjusts model behavior based on input characteristics. We identify two key properties: Rejection Law (R-Law), which shows that stronger steering is needed for jailbreak inputs opposing the rejection direction, and Harmfulness Law (H-Law), which differentiates adversarial and benign inputs. AdaSteer steers input representations along both the Rejection Direction (RD) and Harmfulness Direction (HD), with adaptive coefficients learned via logistic regression, ensuring robust jailbreak defense while preserving benign input handling. Experiments on LLaMA-3.1, Gemma-2, and Qwen2.5 show that AdaSteer outperforms baseline methods across multiple jailbreak attacks with minimal impact on utility. Our results highlight the potential of interpretable model internals for real-time, flexible safety enforcement in LLMs.

Yanbo Wang,Jiyang Guan,Jian Liang,Ran He

Main category: cs.CR

TLDR: 多模态大语言模型（MLLMs）的安全对齐存在不足，当前方法依赖语言模块的对齐，但对多模态输入缺乏专门的安全措施。研究发现数据分布偏差是主要问题，提出通过微调少量良性指令数据来显著提升安全性。

Details

Motivation: 当前开源MLLMs的安全对齐主要依赖语言模块，缺乏针对多模态输入的安全措施，易受视觉域攻击。 Method: 通过比较实验发现数据分布偏差是主要问题，提出微调少量良性指令数据，替换为简单拒绝句子。 Result: 实验表明，无需大量高质量恶意数据，仅需在微调集中包含特定比例的拒绝数据即可显著提升安全性。 Conclusion: 安全对齐在多模态预训练或指令微调中并未丢失，而是被掩盖，纠正数据偏差可缩小视觉域的安全差距。 Abstract: Multi-modal large language models (MLLMs) have made significant progress, yet their safety alignment remains limited. Typically, current open-source MLLMs rely on the alignment inherited from their language module to avoid harmful generations. However, the lack of safety measures specifically designed for multi-modal inputs creates an alignment gap, leaving MLLMs vulnerable to vision-domain attacks such as typographic manipulation. Current methods utilize a carefully designed safety dataset to enhance model defense capability, while the specific knowledge or patterns acquired from the high-quality dataset remain unclear. Through comparison experiments, we find that the alignment gap primarily arises from data distribution biases, while image content, response quality, or the contrastive behavior of the dataset makes little contribution to boosting multi-modal safety. To further investigate this and identify the key factors in improving MLLM safety, we propose finetuning MLLMs on a small set of benign instruct-following data with responses replaced by simple, clear rejection sentences. Experiments show that, without the need for labor-intensive collection of high-quality malicious data, model safety can still be significantly improved, as long as a specific fraction of rejection data exists in the finetuning set, indicating the security alignment is not lost but rather obscured during multi-modal pretraining or instruction finetuning. Simply correcting the underlying data bias could narrow the safety gap in the vision domain.

[559] The Structural Safety Generalization Problem

Julius Broomfield,Tom Gibbs,Ethan Kosak-Hine,George Ingebretsen,Tia Nasir,Jason Zhang,Reihaneh Iranmanesh,Sara Pieri,Reihaneh Rabbany,Kellin Pelrine

Main category: cs.CR

TLDR: 论文提出了一种针对LLM jailbreaks的新框架，通过研究语义等效输入的安全性泛化失败，设计多轮、多图像和翻译攻击，并开发了一种结构重写防护栏以提高安全性。

Details

Motivation: LLM jailbreaks是一个普遍的安全挑战，现有方法难以解决。论文聚焦于安全性在语义等效输入上的泛化失败问题，提出更易处理的攻击研究框架。 Method: 通过设计多轮、多图像和翻译攻击，研究语义等效输入的安全性差异，并提出结构重写防护栏以改善安全评估。 Result: 不同攻击结构导致不同的安全结果，结构重写防护栏显著提高了对有害输入的拒绝率，同时避免过度拒绝良性输入。 Conclusion: 该框架为AI安全研究提供了一个关键里程碑，比通用防御更易处理，但对长期安全至关重要。 Abstract: LLM jailbreaks are a widespread safety challenge. Given this problem has not yet been tractable, we suggest targeting a key failure mechanism: the failure of safety to generalize across semantically equivalent inputs. We further focus the target by requiring desirable tractability properties of attacks to study: explainability, transferability between models, and transferability between goals. We perform red-teaming within this framework by uncovering new vulnerabilities to multi-turn, multi-image, and translation-based attacks. These attacks are semantically equivalent by our design to their single-turn, single-image, or untranslated counterparts, enabling systematic comparisons; we show that the different structures yield different safety outcomes. We then demonstrate the potential for this framework to enable new defenses by proposing a Structure Rewriting Guardrail, which converts an input to a structure more conducive to safety assessment. This guardrail significantly improves refusal of harmful inputs, without over-refusing benign ones. Thus, by framing this intermediate challenge - more tractable than universal defenses but essential for long-term safety - we highlight a critical milestone for AI safety research.

[560] AdaSteer: Your Aligned LLM is Inherently an Adaptive Jailbreak Defender

Weixiang Zhao,Jiahe Guo,Yulin Hu,Yang Deng,An Zhang,Xingyu Sui,Xinyang Han,Yanyan Zhao,Bing Qin,Tat-Seng Chua,Ting Liu

Main category: cs.CR

TLDR: AdaSteer是一种自适应激活引导方法，通过动态调整模型行为来防御LLM的越狱攻击，优于基线方法且对实用性影响最小。

Details

Motivation: 现有固定系数的激活引导方法在防御越狱攻击时效果不佳且易误拒良性输入，需改进。 Method: 提出AdaSteer，基于输入特性动态调整引导系数，利用R-Law和H-Law区分攻击与良性输入。 Result: 在LLaMA-3.1、Gemma-2和Qwen2.5上实验显示，AdaSteer在多种越狱攻击中表现优于基线方法。 Conclusion: AdaSteer展示了利用可解释模型内部机制实现实时灵活安全防护的潜力。 Abstract: Despite extensive efforts in safety alignment, large language models (LLMs) remain vulnerable to jailbreak attacks. Activation steering offers a training-free defense method but relies on fixed steering coefficients, resulting in suboptimal protection and increased false rejections of benign inputs. To address this, we propose AdaSteer, an adaptive activation steering method that dynamically adjusts model behavior based on input characteristics. We identify two key properties: Rejection Law (R-Law), which shows that stronger steering is needed for jailbreak inputs opposing the rejection direction, and Harmfulness Law (H-Law), which differentiates adversarial and benign inputs. AdaSteer steers input representations along both the Rejection Direction (RD) and Harmfulness Direction (HD), with adaptive coefficients learned via logistic regression, ensuring robust jailbreak defense while preserving benign input handling. Experiments on LLaMA-3.1, Gemma-2, and Qwen2.5 show that AdaSteer outperforms baseline methods across multiple jailbreak attacks with minimal impact on utility. Our results highlight the potential of interpretable model internals for real-time, flexible safety enforcement in LLMs.

Yanbo Wang,Jiyang Guan,Jian Liang,Ran He

Main category: cs.CR

TLDR: 多模态大语言模型（MLLMs）的安全性对齐存在不足，主要源于数据分布偏差。通过微调少量良性指令数据并替换为明确拒绝的响应，可以显著提升模型安全性。

Details

Motivation: 当前开源MLLMs的安全性对齐主要依赖语言模块，缺乏针对多模态输入的安全措施，容易受到视觉域攻击。 Method: 提出在少量良性指令数据上微调MLLMs，并将响应替换为简单明确的拒绝句子，无需收集高质量恶意数据。 Result: 实验表明，仅需在微调集中包含特定比例的拒绝数据，即可显著提升模型安全性。 Conclusion: 安全性对齐在多模态预训练或指令微调中并未丢失，而是被掩盖。纠正数据偏差可缩小视觉域的安全差距。 Abstract: Multi-modal large language models (MLLMs) have made significant progress, yet their safety alignment remains limited. Typically, current open-source MLLMs rely on the alignment inherited from their language module to avoid harmful generations. However, the lack of safety measures specifically designed for multi-modal inputs creates an alignment gap, leaving MLLMs vulnerable to vision-domain attacks such as typographic manipulation. Current methods utilize a carefully designed safety dataset to enhance model defense capability, while the specific knowledge or patterns acquired from the high-quality dataset remain unclear. Through comparison experiments, we find that the alignment gap primarily arises from data distribution biases, while image content, response quality, or the contrastive behavior of the dataset makes little contribution to boosting multi-modal safety. To further investigate this and identify the key factors in improving MLLM safety, we propose finetuning MLLMs on a small set of benign instruct-following data with responses replaced by simple, clear rejection sentences. Experiments show that, without the need for labor-intensive collection of high-quality malicious data, model safety can still be significantly improved, as long as a specific fraction of rejection data exists in the finetuning set, indicating the security alignment is not lost but rather obscured during multi-modal pretraining or instruction finetuning. Simply correcting the underlying data bias could narrow the safety gap in the vision domain.

[562] The Structural Safety Generalization Problem

Julius Broomfield,Tom Gibbs,Ethan Kosak-Hine,George Ingebretsen,Tia Nasir,Jason Zhang,Reihaneh Iranmanesh,Sara Pieri,Reihaneh Rabbany,Kellin Pelrine

Main category: cs.CR

TLDR: 论文提出了一种针对LLM安全漏洞的框架，通过研究语义等效输入的安全泛化失败问题，设计多轮、多图像和翻译攻击，并提出了一种结构重写防护措施。

Details

Motivation: LLM的安全漏洞（如越狱）尚未得到有效解决，作者希望通过研究语义等效输入的安全泛化失败问题，提出更易处理的攻击和防御方法。 Method: 通过设计多轮对话、多图像和翻译攻击，研究语义等效输入的安全泛化问题，并提出结构重写防护措施。 Result: 不同结构的攻击导致不同的安全结果，结构重写防护显著提高了对有害输入的拒绝率，同时避免过度拒绝良性输入。 Conclusion: 该框架为AI安全研究提供了一个关键里程碑，通过解决语义等效输入的安全泛化问题，为长期安全目标奠定了基础。 Abstract: LLM jailbreaks are a widespread safety challenge. Given this problem has not yet been tractable, we suggest targeting a key failure mechanism: the failure of safety to generalize across semantically equivalent inputs. We further focus the target by requiring desirable tractability properties of attacks to study: explainability, transferability between models, and transferability between goals. We perform red-teaming within this framework by uncovering new vulnerabilities to multi-turn, multi-image, and translation-based attacks. These attacks are semantically equivalent by our design to their single-turn, single-image, or untranslated counterparts, enabling systematic comparisons; we show that the different structures yield different safety outcomes. We then demonstrate the potential for this framework to enable new defenses by proposing a Structure Rewriting Guardrail, which converts an input to a structure more conducive to safety assessment. This guardrail significantly improves refusal of harmful inputs, without over-refusing benign ones. Thus, by framing this intermediate challenge - more tractable than universal defenses but essential for long-term safety - we highlight a critical milestone for AI safety research.

cs.AI [Back]

[563] A Survey of Frontiers in LLM Reasoning: Inference Scaling, Learning to Reason, and Agentic Systems

Zixuan Ke,Fangkai Jiao,Yifei Ming,Xuan-Phi Nguyen,Austin Xu,Do Xuan Long,Minzhi Li,Chengwei Qin,Peifeng Wang,Silvio Savarese,Caiming Xiong,Shafiq Joty

Main category: cs.AI

TLDR: 该论文综述了大语言模型（LLM）推理能力的分类与趋势，重点分析了推理阶段（推理时或训练时）和架构（独立LLM或复合系统）两个维度，并探讨了输入和输出层面的技术。

Details

Motivation: 研究LLM推理能力的分类与趋势，以区分高级AI系统与传统模型，并系统化理解LLM推理的演进。 Method: 通过两个正交维度（推理阶段和架构）分类现有方法，分析输入和输出层面的技术，并涵盖从监督微调到强化学习等多种算法。 Result: 揭示了LLM推理的演进趋势，如从推理扩展到学习推理，以及向代理工作流的过渡。 Conclusion: 该分类为LLM推理领域提供了系统化的理解，并突出了新兴趋势和关键设计。 Abstract: Reasoning is a fundamental cognitive process that enables logical inference, problem-solving, and decision-making. With the rapid advancement of large language models (LLMs), reasoning has emerged as a key capability that distinguishes advanced AI systems from conventional models that empower chatbots. In this survey, we categorize existing methods along two orthogonal dimensions: (1) Regimes, which define the stage at which reasoning is achieved (either at inference time or through dedicated training); and (2) Architectures, which determine the components involved in the reasoning process, distinguishing between standalone LLMs and agentic compound systems that incorporate external tools, and multi-agent collaborations. Within each dimension, we analyze two key perspectives: (1) Input level, which focuses on techniques that construct high-quality prompts that the LLM condition on; and (2) Output level, which methods that refine multiple sampled candidates to enhance reasoning quality. This categorization provides a systematic understanding of the evolving landscape of LLM reasoning, highlighting emerging trends such as the shift from inference-scaling to learning-to-reason (e.g., DeepSeek-R1), and the transition to agentic workflows (e.g., OpenAI Deep Research, Manus Agent). Additionally, we cover a broad spectrum of learning algorithms, from supervised fine-tuning to reinforcement learning such as PPO and GRPO, and the training of reasoners and verifiers. We also examine key designs of agentic workflows, from established patterns like generator-evaluator and LLM debate to recent innovations. ...

[564] Towards Stepwise Domain Knowledge-Driven Reasoning Optimization and Reflection Improvement

Chengyuan Liu,Shihang Wang,Lizhi Qing,Kaisong Song,Junjie Cao,Jun Lin,Ji Zhang,Ang Li,Kun Kuang,Fei Wu

Main category: cs.AI

TLDR: 论文提出了一种基于蒙特卡洛树搜索（MCTS）的逐步监督框架，用于优化需要领域专业知识的推理任务，并引入了偏好优化方法以提升推理路径的自我反思能力。实验证明其在法律领域问题中的有效性。

Details

Motivation: 探索逐步监督和MCTS在需要领域专业知识的任务中的潜力，解决传统MCTS在此类任务中的局限性。 Method: 提出Stepwise Domain Knowledge-Driven Reasoning Optimization框架，结合MCTS生成逐步监督；引入Preference Optimization towards Reflection Paths以优化推理路径。 Result: 实验结果表明，该方法在法律领域问题中表现优异，并提供了多样化的研究发现。 Conclusion: 研究为领域特定LLMs和MCTS的研究提供了新思路，鼓励进一步探索。 Abstract: Recently, stepwise supervision on Chain of Thoughts (CoTs) presents an enhancement on the logical reasoning tasks such as coding and math, with the help of Monte Carlo Tree Search (MCTS). However, its contribution to tasks requiring domain-specific expertise and knowledge remains unexplored. Motivated by the interest, we identify several potential challenges of vanilla MCTS within this context, and propose the framework of Stepwise Domain Knowledge-Driven Reasoning Optimization, employing the MCTS algorithm to develop step-level supervision for problems that require essential comprehension, reasoning, and specialized knowledge. Additionally, we also introduce the Preference Optimization towards Reflection Paths, which iteratively learns self-reflection on the reasoning thoughts from better perspectives. We have conducted extensive experiments to evaluate the advantage of the methodologies. Empirical results demonstrate the effectiveness on various legal-domain problems. We also report a diverse set of valuable findings, hoping to encourage the enthusiasm to the research of domain-specific LLMs and MCTS.

[565] A Short Survey on Small Reasoning Models: Training, Inference, Applications and Research Directions

Chengyu Wang,Taolin Zhang,Richang Hong,Jun Huang

Main category: cs.AI

TLDR: 本文综述了约170篇关于小型推理模型（SRMs）的论文，探讨了其训练、推理技术及领域应用，并展望了未来研究方向。

Details

Motivation: 大型推理模型（LRMs）计算需求高，而小型推理模型（SRMs）效率更高且具备独特能力，因此研究SRMs具有重要意义。 Method: 通过综述约170篇相关论文，分析SRMs的训练、推理技术及其在特定领域的应用。 Result: 总结了SRMs的当前研究现状，并提供了高效推理功能的开发参考。 Conclusion: SRMs在高效推理方面具有潜力，未来研究可进一步探索其应用和发展方向。 Abstract: Recently, the reasoning capabilities of large reasoning models (LRMs), such as DeepSeek-R1, have seen significant advancements through the slow thinking process. Despite these achievements, the substantial computational demands of LRMs present considerable challenges. In contrast, small reasoning models (SRMs), often distilled from larger ones, offer greater efficiency and can exhibit distinct capabilities and cognitive trajectories compared to LRMs. This work surveys around 170 recently published papers on SRMs for tackling various complex reasoning tasks. We review the current landscape of SRMs and analyze diverse training and inference techniques related to SRMs. Furthermore, we provide a comprehensive review of SRMs for domain-specific applications and discuss possible future research directions. This survey serves as an essential reference for researchers to leverage or develop SRMs for advanced reasoning functionalities with high efficiency.

[566] Draw with Thought: Unleashing Multimodal Reasoning for Scientific Diagram Generation

Zhiqing Cui,Jiahao Yuan,Hanqing Wang,Yanshu Li,Chenxu Du,Zhenglong Ding

Main category: cs.AI

TLDR: DwT框架通过认知链式推理将科学图表转换为可编辑的mxGraph XML代码，无需微调模型，提供高保真、语义对齐的重建。

Details

Motivation: 科学图表通常以静态图像发布，丢失了符号语义且难以重用。现有方法缺乏语义控制和结构可解释性。 Method: DwT分为两阶段：粗到细规划处理感知结构和语义规范，结构感知代码生成通过格式引导细化增强。 Result: 实验表明DwT在八种MLLMs上实现高保真、语义对齐的重建，人类评估确认其准确性和视觉美观。 Conclusion: DwT为静态图表转换为可执行表示提供了可扩展解决方案，推动了机器对科学图形的理解。 Abstract: Scientific diagrams are vital tools for communicating structured knowledge across disciplines. However, they are often published as static raster images, losing symbolic semantics and limiting reuse. While Multimodal Large Language Models (MLLMs) offer a pathway to bridging vision and structure, existing methods lack semantic control and structural interpretability, especially on complex diagrams. We propose Draw with Thought (DwT), a training-free framework that guides MLLMs to reconstruct diagrams into editable mxGraph XML code through cognitively-grounded Chain-of-Thought reasoning. DwT enables interpretable and controllable outputs without model fine-tuning by dividing the task into two stages: Coarse-to-Fine Planning, which handles perceptual structuring and semantic specification, and Structure-Aware Code Generation, enhanced by format-guided refinement. To support evaluation, we release Plot2XML, a benchmark of 247 real-world scientific diagrams with gold-standard XML annotations. Extensive experiments across eight MLLMs show that our approach yields high-fidelity, semantically aligned, and structurally valid reconstructions, with human evaluations confirming strong alignment in both accuracy and visual aesthetics, offering a scalable solution for converting static visuals into executable representations and advancing machine understanding of scientific graphics.

[567] Reduction of Supervision for Biomedical Knowledge Discovery

Christos Theodoropoulos,Andrei Catalin Coman,James Henderson,Marie-Francine Moens

Main category: cs.AI

TLDR: 论文提出了一种基于依赖树和注意力机制的无监督算法，用于在生物医学文本中识别语义关系，减少对监督数据的依赖。

Details

Motivation: 知识发现面临信息过载和标注数据稀缺的挑战，需要平衡监督水平和模型效果。 Method: 采用无监督算法和点二元分类方法，逐步减少监督依赖，评估模型在噪声标签下的表现。 Result: 在生物医学基准数据集上验证了方法的有效性，展示了从弱监督到完全无监督的适应性。 Conclusion: 研究为数据稀缺时的知识发现提供了高效方法，推动了适应性知识提取系统的进展。 Abstract: Knowledge discovery is hindered by the increasing volume of publications and the scarcity of extensive annotated data. To tackle the challenge of information overload, it is essential to employ automated methods for knowledge extraction and processing. Finding the right balance between the level of supervision and the effectiveness of models poses a significant challenge. While supervised techniques generally result in better performance, they have the major drawback of demanding labeled data. This requirement is labor-intensive and time-consuming and hinders scalability when exploring new domains. In this context, our study addresses the challenge of identifying semantic relationships between biomedical entities (e.g., diseases, proteins) in unstructured text while minimizing dependency on supervision. We introduce a suite of unsupervised algorithms based on dependency trees and attention mechanisms and employ a range of pointwise binary classification methods. Transitioning from weakly supervised to fully unsupervised settings, we assess the methods' ability to learn from data with noisy labels. The evaluation on biomedical benchmark datasets explores the effectiveness of the methods. Our approach tackles a central issue in knowledge discovery: balancing performance with minimal supervision. By gradually decreasing supervision, we assess the robustness of pointwise binary classification techniques in handling noisy labels, revealing their capability to shift from weakly supervised to entirely unsupervised scenarios. Comprehensive benchmarking offers insights into the effectiveness of these techniques, suggesting an encouraging direction toward adaptable knowledge discovery systems, representing progress in creating data-efficient methodologies for extracting useful insights when annotated data is limited.

[568] EmoAgent: Assessing and Safeguarding Human-AI Interaction for Mental Health Safety

Jiahao Qiu,Yinghui He,Xinzhe Juan,Yiming Wang,Yuhan Liu,Zixin Yao,Yue Wu,Xun Jiang,Ling Yang,Mengdi Wang

Main category: cs.AI

TLDR: EmoAgent是一个多智能体框架，用于评估和减轻LLM驱动的AI角色对心理脆弱用户的潜在心理健康风险。

Details

Motivation: 随着LLM驱动的AI角色兴起，心理脆弱用户可能面临心理健康风险，需要一种方法来评估和减轻这些风险。 Method: EmoAgent包括EmoEval（模拟虚拟用户并评估心理健康变化）和EmoGuard（监控用户心理状态并提供反馈）。 Result: 实验显示，情感对话可能导致34.4%的模拟用户心理状态恶化，而EmoGuard显著降低了这一比例。 Conclusion: EmoAgent能有效确保更安全的AI-人类交互，减少心理健康风险。 Abstract: The rise of LLM-driven AI characters raises safety concerns, particularly for vulnerable human users with psychological disorders. To address these risks, we propose EmoAgent, a multi-agent AI framework designed to evaluate and mitigate mental health hazards in human-AI interactions. EmoAgent comprises two components: EmoEval simulates virtual users, including those portraying mentally vulnerable individuals, to assess mental health changes before and after interactions with AI characters. It uses clinically proven psychological and psychiatric assessment tools (PHQ-9, PDI, PANSS) to evaluate mental risks induced by LLM. EmoGuard serves as an intermediary, monitoring users' mental status, predicting potential harm, and providing corrective feedback to mitigate risks. Experiments conducted in popular character-based chatbots show that emotionally engaging dialogues can lead to psychological deterioration in vulnerable users, with mental state deterioration in more than 34.4% of the simulations. EmoGuard significantly reduces these deterioration rates, underscoring its role in ensuring safer AI-human interactions. Our code is available at: https://github.com/1akaman/EmoAgent

[569] Can LLM feedback enhance review quality? A randomized study of 20K reviews at ICLR 2025

Nitya Thakkar,Mert Yuksekgonul,Jake Silberg,Animesh Garg,Nanyun Peng,Fei Sha,Rose Yu,Carl Vondrick,James Zou

Main category: cs.AI

TLDR: 论文提出了一种基于大型语言模型的Review Feedback Agent系统，用于提升AI会议同行评审的质量，通过自动反馈改进评审意见的清晰度和可操作性。

Details

Motivation: AI会议投稿量激增导致评审质量下降和作者不满，亟需解决方案。 Method: 开发了Review Feedback Agent系统，利用多个大型语言模型为评审提供自动反馈，并通过可靠性测试确保反馈质量。 Result: 27%的评审员在收到反馈后更新了评审意见，AI反馈显著提升了评审的长度和信息量。 Conclusion: 精心设计的AI生成反馈能有效提升同行评审质量，增加评审员与作者的互动。 Abstract: Peer review at AI conferences is stressed by rapidly rising submission volumes, leading to deteriorating review quality and increased author dissatisfaction. To address these issues, we developed Review Feedback Agent, a system leveraging multiple large language models (LLMs) to improve review clarity and actionability by providing automated feedback on vague comments, content misunderstandings, and unprofessional remarks to reviewers. Implemented at ICLR 2025 as a large randomized control study, our system provided optional feedback to more than 20,000 randomly selected reviews. To ensure high-quality feedback for reviewers at this scale, we also developed a suite of automated reliability tests powered by LLMs that acted as guardrails to ensure feedback quality, with feedback only being sent to reviewers if it passed all the tests. The results show that 27% of reviewers who received feedback updated their reviews, and over 12,000 feedback suggestions from the agent were incorporated by those reviewers. This suggests that many reviewers found the AI-generated feedback sufficiently helpful to merit updating their reviews. Incorporating AI feedback led to significantly longer reviews (an average increase of 80 words among those who updated after receiving feedback) and more informative reviews, as evaluated by blinded researchers. Moreover, reviewers who were selected to receive AI feedback were also more engaged during paper rebuttals, as seen in longer author-reviewer discussions. This work demonstrates that carefully designed LLM-generated review feedback can enhance peer review quality by making reviews more specific and actionable while increasing engagement between reviewers and authors. The Review Feedback Agent is publicly available at https://github.com/zou-group/review_feedback_agent.

[570] A Survey of Large Language Model-Powered Spatial Intelligence Across Scales: Advances in Embodied Agents, Smart Cities, and Earth Science

Jie Feng,Jinwei Zeng,Qingyue Long,Hongyi Chen,Jie Zhao,Yanxin Xi,Zhilun Zhou,Yuan Yuan,Shengyuan Wang,Qingbin Zeng,Songwei Li,Yunke Zhang,Yuming Lin,Tong Li,Jingtao Ding,Chen Gao,Fengli Xu,Yong Li

Main category: cs.AI

TLDR: 本文综述了空间智能在不同领域（从导航到地球科学）的差异与联系，探讨了LLMs中的空间认知、记忆和推理，并提出了一个跨尺度的研究框架。

Details

Motivation: 探索空间智能在LLMs中的表现及其在不同学科中的联系与差异，以促进跨学科研究。 Method: 回顾人类空间认知及其对LLMs的启示，分析LLMs中的空间记忆、知识表示和抽象推理，并提出跨尺度研究框架。 Result: 总结了空间智能在LLMs中的角色和联系，并提出了从空间记忆到推理的跨尺度研究框架。 Conclusion: 本文为跨学科空间智能研究提供了见解，并启发了未来研究方向。 Abstract: Over the past year, the development of large language models (LLMs) has brought spatial intelligence into focus, with much attention on vision-based embodied intelligence. However, spatial intelligence spans a broader range of disciplines and scales, from navigation and urban planning to remote sensing and earth science. What are the differences and connections between spatial intelligence across these fields? In this paper, we first review human spatial cognition and its implications for spatial intelligence in LLMs. We then examine spatial memory, knowledge representations, and abstract reasoning in LLMs, highlighting their roles and connections. Finally, we analyze spatial intelligence across scales -- from embodied to urban and global levels -- following a framework that progresses from spatial memory and understanding to spatial reasoning and intelligence. Through this survey, we aim to provide insights into interdisciplinary spatial intelligence research and inspire future studies.

[571] Reasoning Models Can Be Effective Without Thinking

Wenjie Ma,Jingxuan He,Charlie Snell,Tyler Griggs,Sewon Min,Matei Zaharia

Main category: cs.AI

TLDR: 研究发现，通过简单提示绕过显式思考过程（NoThinking）在低预算设置下表现优于传统显式思考（Thinking），尤其在并行生成和聚合策略下效果显著。

Details

Motivation: 质疑显式思考过程是否必要，探索在低预算或低延迟下实现高效推理的替代方法。 Method: 使用DeepSeek-R1-Distill-Qwen模型，通过NoThinking提示生成输出，并采用并行扩展和任务特定验证器或最佳选择策略进行聚合。 Result: NoThinking在七个推理数据集上表现优于Thinking，尤其在低预算下（如700 tokens时51.3 vs. 28.9），并行扩展方法显著提升性能。 Conclusion: 研究挑战了显式思考的必要性，为低预算或低延迟场景提供了高效的推理方法参考。 Abstract: Recent LLMs have significantly improved reasoning capabilities, primarily by including an explicit, lengthy Thinking process as part of generation. In this paper, we question whether this explicit thinking is necessary. Using the state-of-the-art DeepSeek-R1-Distill-Qwen, we find that bypassing the thinking process via simple prompting, denoted as NoThinking, can be surprisingly effective. When controlling for the number of tokens, NoThinking outperforms Thinking across a diverse set of seven challenging reasoning datasets--including mathematical problem solving, formal theorem proving, and coding--especially in low-budget settings, e.g., 51.3 vs. 28.9 on ACM 23 with 700 tokens. Notably, the performance of NoThinking becomes more competitive with pass@k as k increases. Building on this observation, we demonstrate that a parallel scaling approach that uses NoThinking to generate N outputs independently and aggregates them is highly effective. For aggregation, we use task-specific verifiers when available, or we apply simple best-of-N strategies such as confidence-based selection. Our method outperforms a range of baselines with similar latency using Thinking, and is comparable to Thinking with significantly longer latency (up to 9x). Together, our research encourages a reconsideration of the necessity of lengthy thinking processes, while also establishing a competitive reference for achieving strong reasoning performance in low-budget settings or at low latency using parallel scaling.

[572] RealSafe-R1: Safety-Aligned DeepSeek-R1 without Compromising Reasoning Capability

Yichi Zhang,Zihao Zeng,Dongbai Li,Yao Huang,Zhijie Deng,Yinpeng Dong

Main category: cs.AI

TLDR: RealSafe-R1是DeepSeek-R1的安全对齐版本，通过构建15k安全感知推理轨迹数据集训练，提升了模型的安全性，同时保持了推理能力。

Details

Motivation: 开源R1模型在广泛应用中存在安全隐患，如易受恶意查询影响，影响了模型的实用性。 Method: 构建15k安全感知推理轨迹数据集，训练安全对齐的RealSafe-R1模型。 Result: 模型在安全防护和推理能力上均表现优异，且未牺牲推理性能。 Conclusion: RealSafe-R1成功解决了R1模型的安全问题，同时保持了其强大的推理能力。 Abstract: Large Reasoning Models (LRMs), such as OpenAI o1 and DeepSeek-R1, have been rapidly progressing and achieving breakthrough performance on complex reasoning tasks such as mathematics and coding. However, the open-source R1 models have raised safety concerns in wide applications, such as the tendency to comply with malicious queries, which greatly impacts the utility of these powerful models in their applications. In this paper, we introduce RealSafe-R1 as safety-aligned versions of DeepSeek-R1 distilled models. To train these models, we construct a dataset of 15k safety-aware reasoning trajectories generated by DeepSeek-R1, under explicit instructions for expected refusal behavior. Both quantitative experiments and qualitative case studies demonstrate the models' improvements, which are shown in their safety guardrails against both harmful queries and jailbreak attacks. Importantly, unlike prior safety alignment efforts that often compromise reasoning performance, our method preserves the models' reasoning capabilities by maintaining the training data within the original distribution of generation. Model weights of RealSafe-R1 are open-source at https://huggingface.co/RealSafe.

[573] Breaking the Data Barrier -- Building GUI Agents Through Task Generalization

Junlei Zhang,Zichen Ding,Chang Ma,Zijie Chen,Qiushi Sun,Zhenzhong Lan,Junxian He

Main category: cs.AI

TLDR: 论文提出通过在多模态任务中训练视觉语言模型（VLMs）来解决GUI代理数据稀缺问题，显著提升了跨领域泛化能力。

Details

Motivation: GUI代理的性能受限于高质量轨迹数据的稀缺性，因此需要探索如何利用其他任务的数据提升其泛化能力。 Method: 在中期训练阶段，利用数据丰富的多模态任务（如GUI感知、多模态推理和文本推理）训练VLMs，并研究其对GUI规划场景的泛化效果。 Result: 任务泛化显著提升性能，例如文本数学数据在视觉领域表现优异；GUI感知数据对性能影响有限；优化数据集后性能提升8.0%-12.2%。 Conclusion: 研究为GUI代理的跨领域知识迁移提供了实用方法，并解决了数据稀缺问题。 Abstract: Graphical User Interface (GUI) agents offer cross-platform solutions for automating complex digital tasks, with significant potential to transform productivity workflows. However, their performance is often constrained by the scarcity of high-quality trajectory data. To address this limitation, we propose training Vision Language Models (VLMs) on data-rich, reasoning-intensive tasks during a dedicated mid-training stage, and then examine how incorporating these tasks facilitates generalization to GUI planning scenarios. Specifically, we explore a range of tasks with readily available instruction-tuning data, including GUI perception, multimodal reasoning, and textual reasoning. Through extensive experiments across 11 mid-training tasks, we demonstrate that: (1) Task generalization proves highly effective, yielding substantial improvements across most settings. For instance, multimodal mathematical reasoning enhances performance on AndroidWorld by an absolute 6.3%. Remarkably, text-only mathematical data significantly boosts GUI web agent performance, achieving a 5.6% improvement on WebArena and 5.4% improvement on AndroidWorld, underscoring notable cross-modal generalization from text-based to visual domains; (2) Contrary to prior assumptions, GUI perception data - previously considered closely aligned with GUI agent tasks and widely utilized for training - has a comparatively limited impact on final performance; (3) Building on these insights, we identify the most effective mid-training tasks and curate optimized mixture datasets, resulting in absolute performance gains of 8.0% on WebArena and 12.2% on AndroidWorld. Our work provides valuable insights into cross-domain knowledge transfer for GUI agents and offers a practical approach to addressing data scarcity challenges in this emerging field. The code, data and models will be available at https://github.com/hkust-nlp/GUIMid.

[574] The Future of MLLM Prompting is Adaptive: A Comprehensive Experimental Evaluation of Prompt Engineering Methods for Robust Multimodal Performance

Anwesha Mohanty,Venkatesh Balavadhani Parthasarathy,Arsalan Shahid

Main category: cs.AI

TLDR: 多模态大语言模型（MLLMs）通过整合文本、图像和代码等多样模态，有望改变机器处理人类响应方式。本文评估了7种提示工程方法在13个开源MLLMs上的表现，发现不同模型在不同任务中表现各异，需结合自适应策略以优化效果。

Details

Motivation: 研究多模态大语言模型（MLLMs）在多种任务中的表现，并探索最优提示工程方法以提升其性能和可靠性。 Method: 对7种提示工程方法（如Zero-Shot、Few-Shot等）在13个开源MLLMs上进行实验评估，涵盖24项任务，并根据参数数量将模型分为小、中、大三类。 Result: 大型MLLMs在代码生成等结构化任务中表现优异（准确率高达96.88%），但在复杂推理和抽象理解任务中表现不佳（准确率低于60%）。结构化提示可能增加幻觉率（小型模型达75%）和响应时间。 Conclusion: 没有单一提示方法适用于所有任务，需结合自适应策略（如示例引导和选择性结构化推理）以提高MLLMs的鲁棒性、效率和准确性。 Abstract: Multimodal Large Language Models (MLLMs) are set to transform how machines process and generate human-like responses by integrating diverse modalities such as text, images, and code. Yet, effectively harnessing their capabilities hinges on optimal prompt engineering. We present a comprehensive experimental evaluation of seven prompt engineering methods applied to 13 open-source MLLMs over 24 tasks spanning Reasoning and Compositionality, Multimodal Understanding and Alignment, Complex Code Generation and Execution, and Knowledge Retrieval and Integration. Our approach stratifies models by parameter count into Small (<4B), Medium (4B-10B), and Large (>10B) categories and compares prompting techniques including Zero-Shot, One-Shot, Few-Shot, Chain-of-Thought, Analogical, Generated Knowledge, and Tree-of-Thought. While Large MLLMs excel in structured tasks such as code generation, achieving accuracies up to 96.88% under Few-Shot prompting, all models struggle with complex reasoning and abstract understanding, often yielding accuracies below 60% and high hallucination rates. Structured reasoning prompts frequently increased hallucination up to 75% in small models and led to longer response times (over 20 seconds in Large MLLMs), while simpler prompting methods provided more concise and efficient outputs. No single prompting method uniformly optimises all task types. Instead, adaptive strategies combining example-based guidance with selective structured reasoning are essential to enhance robustness, efficiency, and factual accuracy. Our findings offer practical recommendations for prompt engineering and support more reliable deployment of MLLMs across applications including AI-assisted coding, knowledge retrieval, and multimodal content understanding.

[575] RealWebAssist: A Benchmark for Long-Horizon Web Assistance with Real-World Users

Suyu Ye,Haojun Shi,Darren Shih,Hyokun Yun,Tanya Roosta,Tianmin Shu

Main category: cs.AI

TLDR: RealWebAssist是一个新基准，用于评估AI代理在真实场景中遵循用户指令的能力，涉及长期交互、视觉GUI理解和模糊指令处理。

Details

Motivation: 现有基准无法满足真实世界中用户指令的模糊性、动态性和长期性需求，因此需要新工具来评估AI代理的实际表现。 Method: 通过收集真实用户的序列指令数据集，要求代理理解用户意图、跟踪心理状态并正确操作GUI元素。 Result: 实验表明，现有模型在理解和执行真实用户指令方面表现不佳。 Conclusion: RealWebAssist揭示了AI代理在长期任务中面临的挑战，为未来研究提供了方向。 Abstract: To achieve successful assistance with long-horizon web-based tasks, AI agents must be able to sequentially follow real-world user instructions over a long period. Unlike existing web-based agent benchmarks, sequential instruction following in the real world poses significant challenges beyond performing a single, clearly defined task. For instance, real-world human instructions can be ambiguous, require different levels of AI assistance, and may evolve over time, reflecting changes in the user's mental state. To address this gap, we introduce RealWebAssist, a novel benchmark designed to evaluate sequential instruction-following in realistic scenarios involving long-horizon interactions with the web, visual GUI grounding, and understanding ambiguous real-world user instructions. RealWebAssist includes a dataset of sequential instructions collected from real-world human users. Each user instructs a web-based assistant to perform a series of tasks on multiple websites. A successful agent must reason about the true intent behind each instruction, keep track of the mental state of the user, understand user-specific routines, and ground the intended tasks to actions on the correct GUI elements. Our experimental results show that state-of-the-art models struggle to understand and ground user instructions, posing critical challenges in following real-world user instructions for long-horizon web assistance.

[576] Hybrid AI-Physical Modeling for Penetration Bias Correction in X-band InSAR DEMs: A Greenland Case Study

Islam Mansour,Georg Fischer,Ronny Haensch,Irena Hajnsek

Main category: cs.AI

TLDR: 提出了一种结合物理建模和机器学习的混合框架，用于校正InSAR数据在冰川和雪覆盖区域的穿透偏差，显著降低了DEM误差。

Details

Motivation: InSAR数据在冰川和雪覆盖区域存在系统性高程误差（穿透偏差），需要一种更有效的校正方法。 Method: 结合参数化物理建模和机器学习，构建混合校正框架，并在三种不同采集参数下评估其性能。 Result: 在格陵兰冰盖的TanDEM-X数据实验中，混合模型显著降低了DEM误差的均值和标准差，且比纯机器学习方法具有更好的泛化能力。 Conclusion: 混合框架在减少DEM误差和泛化能力方面优于纯物理建模或纯机器学习方法。 Abstract: Digital elevation models derived from Interferometric Synthetic Aperture Radar (InSAR) data over glacial and snow-covered regions often exhibit systematic elevation errors, commonly termed "penetration bias." We leverage existing physics-based models and propose an integrated correction framework that combines parametric physical modeling with machine learning. We evaluate the approach across three distinct training scenarios - each defined by a different set of acquisition parameters - to assess overall performance and the model's ability to generalize. Our experiments on Greenland's ice sheet using TanDEM-X data show that the proposed hybrid model corrections significantly reduce the mean and standard deviation of DEM errors compared to a purely physical modeling baseline. The hybrid framework also achieves significantly improved generalization than a pure ML approach when trained on data with limited diversity in acquisition parameters.

[577] Mixed Signals: Decoding VLMs' Reasoning and Underlying Bias in Vision-Language Conflict

Pouya Pezeshkpour,Moin Aminnaseri,Estevam Hruschka

Main category: cs.AI

TLDR: 论文研究了视觉语言模型（VLMs）在处理图像与文本冲突时的偏好，发现模型倾向于文本或图像取决于查询复杂度，并提出三种缓解偏见的策略。

Details

Motivation: 探讨VLMs如何整合视觉与文本信息，以及信息流的结构，特别是在面对冲突线索时的偏见表现。 Method: 构建五个包含不匹配图像-文本对的数据集，分析模型偏好；提出三种缓解策略（提示修改、显式指令、任务分解）。 Result: VLMs在简单查询中偏好文本，复杂查询中偏好图像；模型规模影响偏见程度；缓解策略效果因任务和模态而异。 Conclusion: VLMs的偏见与查询复杂度和模型规模相关，缓解策略需针对具体任务和模态设计。 Abstract: Vision-language models (VLMs) have demonstrated impressive performance by effectively integrating visual and textual information to solve complex tasks. However, it is not clear how these models reason over the visual and textual data together, nor how the flow of information between modalities is structured. In this paper, we examine how VLMs reason by analyzing their biases when confronted with scenarios that present conflicting image and text cues, a common occurrence in real-world applications. To uncover the extent and nature of these biases, we build upon existing benchmarks to create five datasets containing mismatched image-text pairs, covering topics in mathematics, science, and visual descriptions. Our analysis shows that VLMs favor text in simpler queries but shift toward images as query complexity increases. This bias correlates with model scale, with the difference between the percentage of image- and text-preferred responses ranging from +56.8% (image favored) to -74.4% (text favored), depending on the task and model. In addition, we explore three mitigation strategies: simple prompt modifications, modifications that explicitly instruct models on how to handle conflicting information (akin to chain-of-thought prompting), and a task decomposition strategy that analyzes each modality separately before combining their results. Our findings indicate that the effectiveness of these strategies in identifying and mitigating bias varies significantly and is closely linked to the model's overall performance on the task and the specific modality in question.

[578] Don't Deceive Me: Mitigating Gaslighting through Attention Reallocation in LMMs

Pengkun Jiao,Bin Zhu,Jingjing Chen,Chong-Wah Ngo,Yu-Gang Jiang

Main category: cs.AI

TLDR: 论文提出GasEraser方法，通过调整注意力权重减少大型多模态模型（LMMs）在否定式误导输入下的性能下降。

Details

Motivation: 研究LMMs对用户误导输入的脆弱性，提出解决方案以提高其可靠性。 Method: 引入GasEraser，无需重新训练，通过重新分配注意力权重抑制误导文本令牌，增强视觉线索。 Result: 在GaslightingBench上测试，GasEraser显著降低误导率（如LLaVA-v1.5-7B降低48.2%）。 Conclusion: GasEraser有效提升LMMs的鲁棒性，为更可信的模型提供潜力。 Abstract: Large Multimodal Models (LMMs) have demonstrated remarkable capabilities across a wide range of tasks. However, their vulnerability to user gaslighting-the deliberate use of misleading or contradictory inputs-raises critical concerns about their reliability in real-world applications. In this paper, we address the novel and challenging issue of mitigating the negative impact of negation-based gaslighting on LMMs, where deceptive user statements lead to significant drops in model accuracy. Specifically, we introduce GasEraser, a training-free approach that reallocates attention weights from misleading textual tokens to semantically salient visual regions. By suppressing the influence of "attention sink" tokens and enhancing focus on visually grounded cues, GasEraser significantly improves LMM robustness without requiring retraining or additional supervision. Extensive experimental results demonstrate that GasEraser is effective across several leading open-source LMMs on the GaslightingBench. Notably, for LLaVA-v1.5-7B, GasEraser reduces the misguidance rate by 48.2%, demonstrating its potential for more trustworthy LMMs.

[579] A Survey of Frontiers in LLM Reasoning: Inference Scaling, Learning to Reason, and Agentic Systems

Zixuan Ke,Fangkai Jiao,Yifei Ming,Xuan-Phi Nguyen,Austin Xu,Do Xuan Long,Minzhi Li,Chengwei Qin,Peifeng Wang,Silvio Savarese,Caiming Xiong,Shafiq Joty

Main category: cs.AI

TLDR: 该论文综述了大型语言模型（LLMs）在推理能力方面的分类与方法，重点分析了推理阶段和架构设计两个维度，并探讨了输入和输出层面的技术。

Details

Motivation: 随着LLMs的快速发展，推理能力成为区分高级AI系统的关键特征，本文旨在系统化分类现有方法，揭示新兴趋势。 Method: 通过两个正交维度（推理阶段和架构设计）分类方法，分析输入和输出层面的技术，并涵盖从监督微调到强化学习等多种算法。 Result: 提出了对LLM推理领域的系统化理解，揭示了从推理扩展转向学习推理、以及向代理工作流过渡的趋势。 Conclusion: 本文为LLM推理领域提供了全面的分类和分析，为未来研究指明了方向。 Abstract: Reasoning is a fundamental cognitive process that enables logical inference, problem-solving, and decision-making. With the rapid advancement of large language models (LLMs), reasoning has emerged as a key capability that distinguishes advanced AI systems from conventional models that empower chatbots. In this survey, we categorize existing methods along two orthogonal dimensions: (1) Regimes, which define the stage at which reasoning is achieved (either at inference time or through dedicated training); and (2) Architectures, which determine the components involved in the reasoning process, distinguishing between standalone LLMs and agentic compound systems that incorporate external tools, and multi-agent collaborations. Within each dimension, we analyze two key perspectives: (1) Input level, which focuses on techniques that construct high-quality prompts that the LLM condition on; and (2) Output level, which methods that refine multiple sampled candidates to enhance reasoning quality. This categorization provides a systematic understanding of the evolving landscape of LLM reasoning, highlighting emerging trends such as the shift from inference-scaling to learning-to-reason (e.g., DeepSeek-R1), and the transition to agentic workflows (e.g., OpenAI Deep Research, Manus Agent). Additionally, we cover a broad spectrum of learning algorithms, from supervised fine-tuning to reinforcement learning such as PPO and GRPO, and the training of reasoners and verifiers. We also examine key designs of agentic workflows, from established patterns like generator-evaluator and LLM debate to recent innovations. ...

[580] Towards Stepwise Domain Knowledge-Driven Reasoning Optimization and Reflection Improvement

Chengyuan Liu,Shihang Wang,Lizhi Qing,Kaisong Song,Junjie Cao,Jun Lin,Ji Zhang,Ang Li,Kun Kuang,Fei Wu

Main category: cs.AI

TLDR: 论文提出了一种基于蒙特卡洛树搜索（MCTS）的逐步监督框架，用于优化需要领域专业知识的推理任务，并引入了偏好优化方法以提升推理路径的反思能力。

Details

Motivation: 探索逐步监督在领域专业知识任务中的潜力，解决传统MCTS在此类任务中的局限性。 Method: 提出Stepwise Domain Knowledge-Driven Reasoning Optimization框架，结合MCTS生成逐步监督，并引入Preference Optimization towards Reflection Paths优化推理路径。 Result: 实验证明该方法在法律领域问题中有效，并提供了多样化的研究发现。 Conclusion: 该方法为领域特定LLMs和MCTS的研究提供了新思路，激发了进一步探索的兴趣。 Abstract: Recently, stepwise supervision on Chain of Thoughts (CoTs) presents an enhancement on the logical reasoning tasks such as coding and math, with the help of Monte Carlo Tree Search (MCTS). However, its contribution to tasks requiring domain-specific expertise and knowledge remains unexplored. Motivated by the interest, we identify several potential challenges of vanilla MCTS within this context, and propose the framework of Stepwise Domain Knowledge-Driven Reasoning Optimization, employing the MCTS algorithm to develop step-level supervision for problems that require essential comprehension, reasoning, and specialized knowledge. Additionally, we also introduce the Preference Optimization towards Reflection Paths, which iteratively learns self-reflection on the reasoning thoughts from better perspectives. We have conducted extensive experiments to evaluate the advantage of the methodologies. Empirical results demonstrate the effectiveness on various legal-domain problems. We also report a diverse set of valuable findings, hoping to encourage the enthusiasm to the research of domain-specific LLMs and MCTS.

[581] A Short Survey on Small Reasoning Models: Training, Inference, Applications and Research Directions

Chengyu Wang,Taolin Zhang,Richang Hong,Jun Huang

Main category: cs.AI

TLDR: 本文综述了170多篇关于小型推理模型（SRMs）的论文，探讨了其训练、推理技术及领域应用，并展望了未来研究方向。

Details

Motivation: 大型推理模型（LRMs）计算需求高，而小型推理模型（SRMs）效率更高且具备独特能力，因此需要系统研究SRMs的现状与发展。 Method: 通过综述170多篇相关论文，分析SRMs的训练、推理技术及其在特定领域的应用。 Result: 总结了SRMs的当前研究现状、技术方法及其应用潜力。 Conclusion: SRMs在高效推理方面具有重要潜力，未来研究可进一步优化其性能和应用范围。 Abstract: Recently, the reasoning capabilities of large reasoning models (LRMs), such as DeepSeek-R1, have seen significant advancements through the slow thinking process. Despite these achievements, the substantial computational demands of LRMs present considerable challenges. In contrast, small reasoning models (SRMs), often distilled from larger ones, offer greater efficiency and can exhibit distinct capabilities and cognitive trajectories compared to LRMs. This work surveys around 170 recently published papers on SRMs for tackling various complex reasoning tasks. We review the current landscape of SRMs and analyze diverse training and inference techniques related to SRMs. Furthermore, we provide a comprehensive review of SRMs for domain-specific applications and discuss possible future research directions. This survey serves as an essential reference for researchers to leverage or develop SRMs for advanced reasoning functionalities with high efficiency.

[582] Draw with Thought: Unleashing Multimodal Reasoning for Scientific Diagram Generation

Zhiqing Cui,Jiahao Yuan,Hanqing Wang,Yanshu Li,Chenxu Du,Zhenglong Ding

Main category: cs.AI

TLDR: DwT框架通过认知链式推理将科学图表转换为可编辑的XML代码，无需微调模型，实现了高保真和语义对齐的重建。

Details

Motivation: 科学图表通常以静态图像发布，丢失了符号语义且难以重用，现有方法缺乏语义控制和结构可解释性。 Method: DwT框架分为两阶段：粗到细规划（处理感知结构和语义规范）和结构感知代码生成（通过格式引导细化）。 Result: 实验表明，DwT在8种MLLMs上实现了高保真、语义对齐且结构有效的重建，人类评估也证实了其准确性和视觉美观性。 Conclusion: DwT为静态图像转换为可执行表示提供了可扩展的解决方案，并推动了机器对科学图形的理解。 Abstract: Scientific diagrams are vital tools for communicating structured knowledge across disciplines. However, they are often published as static raster images, losing symbolic semantics and limiting reuse. While Multimodal Large Language Models (MLLMs) offer a pathway to bridging vision and structure, existing methods lack semantic control and structural interpretability, especially on complex diagrams. We propose Draw with Thought (DwT), a training-free framework that guides MLLMs to reconstruct diagrams into editable mxGraph XML code through cognitively-grounded Chain-of-Thought reasoning. DwT enables interpretable and controllable outputs without model fine-tuning by dividing the task into two stages: Coarse-to-Fine Planning, which handles perceptual structuring and semantic specification, and Structure-Aware Code Generation, enhanced by format-guided refinement. To support evaluation, we release Plot2XML, a benchmark of 247 real-world scientific diagrams with gold-standard XML annotations. Extensive experiments across eight MLLMs show that our approach yields high-fidelity, semantically aligned, and structurally valid reconstructions, with human evaluations confirming strong alignment in both accuracy and visual aesthetics, offering a scalable solution for converting static visuals into executable representations and advancing machine understanding of scientific graphics.

[583] Reduction of Supervision for Biomedical Knowledge Discovery

Christos Theodoropoulos,Andrei Catalin Coman,James Henderson,Marie-Francine Moens

Main category: cs.AI

TLDR: 论文提出了一种基于依赖树和注意力机制的无监督算法，用于在生物医学文本中识别语义关系，减少对监督数据的依赖。

Details

Motivation: 解决知识发现中信息过载和标注数据稀缺的问题，探索如何在减少监督的情况下保持模型性能。 Method: 引入基于依赖树和注意力机制的无监督算法，结合点对二元分类方法，评估从弱监督到完全无监督场景下的性能。 Result: 在生物医学基准数据集上验证了方法的有效性，展示了从弱监督到无监督场景的适应性。 Conclusion: 该方法为标注数据有限的知识发现提供了高效解决方案，推动了数据高效方法的发展。 Abstract: Knowledge discovery is hindered by the increasing volume of publications and the scarcity of extensive annotated data. To tackle the challenge of information overload, it is essential to employ automated methods for knowledge extraction and processing. Finding the right balance between the level of supervision and the effectiveness of models poses a significant challenge. While supervised techniques generally result in better performance, they have the major drawback of demanding labeled data. This requirement is labor-intensive and time-consuming and hinders scalability when exploring new domains. In this context, our study addresses the challenge of identifying semantic relationships between biomedical entities (e.g., diseases, proteins) in unstructured text while minimizing dependency on supervision. We introduce a suite of unsupervised algorithms based on dependency trees and attention mechanisms and employ a range of pointwise binary classification methods. Transitioning from weakly supervised to fully unsupervised settings, we assess the methods' ability to learn from data with noisy labels. The evaluation on biomedical benchmark datasets explores the effectiveness of the methods. Our approach tackles a central issue in knowledge discovery: balancing performance with minimal supervision. By gradually decreasing supervision, we assess the robustness of pointwise binary classification techniques in handling noisy labels, revealing their capability to shift from weakly supervised to entirely unsupervised scenarios. Comprehensive benchmarking offers insights into the effectiveness of these techniques, suggesting an encouraging direction toward adaptable knowledge discovery systems, representing progress in creating data-efficient methodologies for extracting useful insights when annotated data is limited.

[584] EmoAgent: Assessing and Safeguarding Human-AI Interaction for Mental Health Safety

Jiahao Qiu,Yinghui He,Xinzhe Juan,Yiming Wang,Yuhan Liu,Zixin Yao,Yue Wu,Xun Jiang,Ling Yang,Mengdi Wang

Main category: cs.AI

TLDR: EmoAgent是一个多代理AI框架，用于评估和减轻人类与AI互动中的心理健康风险，包括模拟用户和监控干预功能。

Details

Motivation: 随着LLM驱动的AI角色普及，心理健康风险增加，尤其是对心理脆弱用户，需开发工具确保安全互动。 Method: EmoAgent由EmoEval（模拟用户并评估心理变化）和EmoGuard（监控用户状态并提供干预）组成，使用临床心理工具（如PHQ-9）。 Result: 实验显示，34.4%的模拟用户心理状态恶化，EmoGuard显著降低了恶化率。 Conclusion: EmoAgent能有效减少AI互动中的心理健康风险，为安全人机交互提供解决方案。 Abstract: The rise of LLM-driven AI characters raises safety concerns, particularly for vulnerable human users with psychological disorders. To address these risks, we propose EmoAgent, a multi-agent AI framework designed to evaluate and mitigate mental health hazards in human-AI interactions. EmoAgent comprises two components: EmoEval simulates virtual users, including those portraying mentally vulnerable individuals, to assess mental health changes before and after interactions with AI characters. It uses clinically proven psychological and psychiatric assessment tools (PHQ-9, PDI, PANSS) to evaluate mental risks induced by LLM. EmoGuard serves as an intermediary, monitoring users' mental status, predicting potential harm, and providing corrective feedback to mitigate risks. Experiments conducted in popular character-based chatbots show that emotionally engaging dialogues can lead to psychological deterioration in vulnerable users, with mental state deterioration in more than 34.4% of the simulations. EmoGuard significantly reduces these deterioration rates, underscoring its role in ensuring safer AI-human interactions. Our code is available at: https://github.com/1akaman/EmoAgent

[585] Can LLM feedback enhance review quality? A randomized study of 20K reviews at ICLR 2025

Nitya Thakkar,Mert Yuksekgonul,Jake Silberg,Animesh Garg,Nanyun Peng,Fei Sha,Rose Yu,Carl Vondrick,James Zou

Main category: cs.AI

TLDR: 论文开发了一个基于大语言模型的Review Feedback Agent系统，用于提升AI会议同行评审的质量，通过自动反馈改善评审意见的清晰度和可操作性。

Details

Motivation: AI会议同行评审因提交量激增导致评审质量下降和作者不满，亟需解决方案。 Method: 利用多个大语言模型为评审提供自动反馈，并通过自动化可靠性测试确保反馈质量。 Result: 27%的评审者在收到反馈后更新了评审意见，反馈显著提升了评审的长度和信息量。 Conclusion: 精心设计的大语言模型反馈能有效提升同行评审质量，增加评审者与作者的互动。 Abstract: Peer review at AI conferences is stressed by rapidly rising submission volumes, leading to deteriorating review quality and increased author dissatisfaction. To address these issues, we developed Review Feedback Agent, a system leveraging multiple large language models (LLMs) to improve review clarity and actionability by providing automated feedback on vague comments, content misunderstandings, and unprofessional remarks to reviewers. Implemented at ICLR 2025 as a large randomized control study, our system provided optional feedback to more than 20,000 randomly selected reviews. To ensure high-quality feedback for reviewers at this scale, we also developed a suite of automated reliability tests powered by LLMs that acted as guardrails to ensure feedback quality, with feedback only being sent to reviewers if it passed all the tests. The results show that 27% of reviewers who received feedback updated their reviews, and over 12,000 feedback suggestions from the agent were incorporated by those reviewers. This suggests that many reviewers found the AI-generated feedback sufficiently helpful to merit updating their reviews. Incorporating AI feedback led to significantly longer reviews (an average increase of 80 words among those who updated after receiving feedback) and more informative reviews, as evaluated by blinded researchers. Moreover, reviewers who were selected to receive AI feedback were also more engaged during paper rebuttals, as seen in longer author-reviewer discussions. This work demonstrates that carefully designed LLM-generated review feedback can enhance peer review quality by making reviews more specific and actionable while increasing engagement between reviewers and authors. The Review Feedback Agent is publicly available at https://github.com/zou-group/review_feedback_agent.

[586] A Survey of Large Language Model-Powered Spatial Intelligence Across Scales: Advances in Embodied Agents, Smart Cities, and Earth Science

Jie Feng,Jinwei Zeng,Qingyue Long,Hongyi Chen,Jie Zhao,Yanxin Xi,Zhilun Zhou,Yuan Yuan,Shengyuan Wang,Qingbin Zeng,Songwei Li,Yunke Zhang,Yuming Lin,Tong Li,Jingtao Ding,Chen Gao,Fengli Xu,Yong Li

Main category: cs.AI

TLDR: 本文综述了空间智能在不同领域（从导航到地球科学）的差异与联系，探讨了LLMs中空间认知、记忆和推理的作用，并提出了一个跨尺度的研究框架。

Details

Motivation: 研究空间智能在不同学科和尺度上的差异与联系，以促进跨学科研究和未来探索。 Method: 回顾人类空间认知及其对LLMs的启示，分析LLMs中的空间记忆、知识表示和抽象推理，并提出跨尺度的空间智能框架。 Result: 揭示了LLMs在空间记忆、理解和推理中的作用，提出了从具体到全球尺度的空间智能研究框架。 Conclusion: 本文为跨学科空间智能研究提供了见解，并启发了未来研究方向。 Abstract: Over the past year, the development of large language models (LLMs) has brought spatial intelligence into focus, with much attention on vision-based embodied intelligence. However, spatial intelligence spans a broader range of disciplines and scales, from navigation and urban planning to remote sensing and earth science. What are the differences and connections between spatial intelligence across these fields? In this paper, we first review human spatial cognition and its implications for spatial intelligence in LLMs. We then examine spatial memory, knowledge representations, and abstract reasoning in LLMs, highlighting their roles and connections. Finally, we analyze spatial intelligence across scales -- from embodied to urban and global levels -- following a framework that progresses from spatial memory and understanding to spatial reasoning and intelligence. Through this survey, we aim to provide insights into interdisciplinary spatial intelligence research and inspire future studies.

[587] Reasoning Models Can Be Effective Without Thinking

Wenjie Ma,Jingxuan He,Charlie Snell,Tyler Griggs,Sewon Min,Matei Zaharia

Main category: cs.AI

TLDR: 研究发现，绕过显式思考过程的简单提示方法（NoThinking）在多个推理任务中表现优异，尤其在低预算或低延迟场景下。

Details

Motivation: 质疑显式思考过程是否必要，探索更高效的推理方法。 Method: 使用NoThinking提示方法，并通过并行生成和聚合策略提升性能。 Result: NoThinking在多个数据集上优于显式思考方法，尤其在低预算下表现突出。 Conclusion: 研究挑战了显式思考的必要性，并为低预算或低延迟场景提供了高效解决方案。 Abstract: Recent LLMs have significantly improved reasoning capabilities, primarily by including an explicit, lengthy Thinking process as part of generation. In this paper, we question whether this explicit thinking is necessary. Using the state-of-the-art DeepSeek-R1-Distill-Qwen, we find that bypassing the thinking process via simple prompting, denoted as NoThinking, can be surprisingly effective. When controlling for the number of tokens, NoThinking outperforms Thinking across a diverse set of seven challenging reasoning datasets--including mathematical problem solving, formal theorem proving, and coding--especially in low-budget settings, e.g., 51.3 vs. 28.9 on ACM 23 with 700 tokens. Notably, the performance of NoThinking becomes more competitive with pass@k as k increases. Building on this observation, we demonstrate that a parallel scaling approach that uses NoThinking to generate N outputs independently and aggregates them is highly effective. For aggregation, we use task-specific verifiers when available, or we apply simple best-of-N strategies such as confidence-based selection. Our method outperforms a range of baselines with similar latency using Thinking, and is comparable to Thinking with significantly longer latency (up to 9x). Together, our research encourages a reconsideration of the necessity of lengthy thinking processes, while also establishing a competitive reference for achieving strong reasoning performance in low-budget settings or at low latency using parallel scaling.

[588] RealSafe-R1: Safety-Aligned DeepSeek-R1 without Compromising Reasoning Capability

Yichi Zhang,Zihao Zeng,Dongbai Li,Yao Huang,Zhijie Deng,Yinpeng Dong

Main category: cs.AI

TLDR: RealSafe-R1是DeepSeek-R1的安全对齐版本，通过构建15k安全感知推理轨迹数据集训练，提升模型安全性且不损害推理能力。

Details

Motivation: 开源R1模型存在安全风险，如响应恶意查询，影响实际应用。 Method: 构建15k安全感知推理轨迹数据集，训练安全对齐模型RealSafe-R1。 Result: 模型在安全防护和推理能力上均有提升，且开源。 Conclusion: RealSafe-R1在保持推理能力的同时有效提升了安全性。 Abstract: Large Reasoning Models (LRMs), such as OpenAI o1 and DeepSeek-R1, have been rapidly progressing and achieving breakthrough performance on complex reasoning tasks such as mathematics and coding. However, the open-source R1 models have raised safety concerns in wide applications, such as the tendency to comply with malicious queries, which greatly impacts the utility of these powerful models in their applications. In this paper, we introduce RealSafe-R1 as safety-aligned versions of DeepSeek-R1 distilled models. To train these models, we construct a dataset of 15k safety-aware reasoning trajectories generated by DeepSeek-R1, under explicit instructions for expected refusal behavior. Both quantitative experiments and qualitative case studies demonstrate the models' improvements, which are shown in their safety guardrails against both harmful queries and jailbreak attacks. Importantly, unlike prior safety alignment efforts that often compromise reasoning performance, our method preserves the models' reasoning capabilities by maintaining the training data within the original distribution of generation. Model weights of RealSafe-R1 are open-source at https://huggingface.co/RealSafe.

[589] Breaking the Data Barrier -- Building GUI Agents Through Task Generalization

Junlei Zhang,Zichen Ding,Chang Ma,Zijie Chen,Qiushi Sun,Zhenzhong Lan,Junxian He

Main category: cs.AI

TLDR: 论文提出通过训练视觉语言模型（VLMs）在数据丰富的任务中提升GUI代理的性能，并验证了跨模态泛化的有效性。

Details

Motivation: 解决GUI代理因高质量轨迹数据稀缺而性能受限的问题。 Method: 在中期训练阶段训练VLMs于数据丰富的任务，并探索其对GUI规划场景的泛化能力。 Result: 多模态数学推理显著提升性能，而GUI感知数据影响有限；优化数据集带来8.0%和12.2%的性能提升。 Conclusion: 研究为GUI代理的跨领域知识迁移提供了实用方法，并解决了数据稀缺问题。 Abstract: Graphical User Interface (GUI) agents offer cross-platform solutions for automating complex digital tasks, with significant potential to transform productivity workflows. However, their performance is often constrained by the scarcity of high-quality trajectory data. To address this limitation, we propose training Vision Language Models (VLMs) on data-rich, reasoning-intensive tasks during a dedicated mid-training stage, and then examine how incorporating these tasks facilitates generalization to GUI planning scenarios. Specifically, we explore a range of tasks with readily available instruction-tuning data, including GUI perception, multimodal reasoning, and textual reasoning. Through extensive experiments across 11 mid-training tasks, we demonstrate that: (1) Task generalization proves highly effective, yielding substantial improvements across most settings. For instance, multimodal mathematical reasoning enhances performance on AndroidWorld by an absolute 6.3%. Remarkably, text-only mathematical data significantly boosts GUI web agent performance, achieving a 5.6% improvement on WebArena and 5.4% improvement on AndroidWorld, underscoring notable cross-modal generalization from text-based to visual domains; (2) Contrary to prior assumptions, GUI perception data - previously considered closely aligned with GUI agent tasks and widely utilized for training - has a comparatively limited impact on final performance; (3) Building on these insights, we identify the most effective mid-training tasks and curate optimized mixture datasets, resulting in absolute performance gains of 8.0% on WebArena and 12.2% on AndroidWorld. Our work provides valuable insights into cross-domain knowledge transfer for GUI agents and offers a practical approach to addressing data scarcity challenges in this emerging field. The code, data and models will be available at https://github.com/hkust-nlp/GUIMid.

[590] The Future of MLLM Prompting is Adaptive: A Comprehensive Experimental Evaluation of Prompt Engineering Methods for Robust Multimodal Performance

Anwesha Mohanty,Venkatesh Balavadhani Parthasarathy,Arsalan Shahid

Main category: cs.AI

TLDR: 该论文通过实验评估了七种提示工程方法在13个开源多模态大语言模型（MLLMs）上的表现，发现不同任务需要不同的提示策略，没有单一方法适用于所有任务。

Details

Motivation: 研究多模态大语言模型（MLLMs）在不同提示工程方法下的表现，以优化其在实际应用中的效果。 Method: 对13个开源MLLMs在24个任务上测试七种提示方法，按模型参数规模分类比较。 Result: 大型MLLMs在代码生成任务中表现优异（准确率高达96.88%），但在复杂推理和抽象理解任务中表现较差（准确率低于60%）。 Conclusion: 需结合示例引导和选择性结构化推理的适应性策略，以提高MLLMs的鲁棒性、效率和准确性。 Abstract: Multimodal Large Language Models (MLLMs) are set to transform how machines process and generate human-like responses by integrating diverse modalities such as text, images, and code. Yet, effectively harnessing their capabilities hinges on optimal prompt engineering. We present a comprehensive experimental evaluation of seven prompt engineering methods applied to 13 open-source MLLMs over 24 tasks spanning Reasoning and Compositionality, Multimodal Understanding and Alignment, Complex Code Generation and Execution, and Knowledge Retrieval and Integration. Our approach stratifies models by parameter count into Small (<4B), Medium (4B-10B), and Large (>10B) categories and compares prompting techniques including Zero-Shot, One-Shot, Few-Shot, Chain-of-Thought, Analogical, Generated Knowledge, and Tree-of-Thought. While Large MLLMs excel in structured tasks such as code generation, achieving accuracies up to 96.88% under Few-Shot prompting, all models struggle with complex reasoning and abstract understanding, often yielding accuracies below 60% and high hallucination rates. Structured reasoning prompts frequently increased hallucination up to 75% in small models and led to longer response times (over 20 seconds in Large MLLMs), while simpler prompting methods provided more concise and efficient outputs. No single prompting method uniformly optimises all task types. Instead, adaptive strategies combining example-based guidance with selective structured reasoning are essential to enhance robustness, efficiency, and factual accuracy. Our findings offer practical recommendations for prompt engineering and support more reliable deployment of MLLMs across applications including AI-assisted coding, knowledge retrieval, and multimodal content understanding.

[591] RealWebAssist: A Benchmark for Long-Horizon Web Assistance with Real-World Users

Suyu Ye,Haojun Shi,Darren Shih,Hyokun Yun,Tanya Roosta,Tianmin Shu

Main category: cs.AI

TLDR: 论文提出了RealWebAssist基准，用于评估AI代理在真实场景中处理长期、模糊的用户指令的能力。

Details

Motivation: 现有基准无法满足真实世界中长期、模糊的用户指令需求，因此需要新的评估方法。 Method: 引入RealWebAssist基准，包含真实用户指令数据集，要求代理理解用户意图、跟踪心理状态并执行GUI操作。 Result: 实验表明当前先进模型在处理真实用户指令时表现不佳。 Conclusion: RealWebAssist揭示了AI代理在长期、模糊指令处理中的挑战，为未来研究提供了方向。 Abstract: To achieve successful assistance with long-horizon web-based tasks, AI agents must be able to sequentially follow real-world user instructions over a long period. Unlike existing web-based agent benchmarks, sequential instruction following in the real world poses significant challenges beyond performing a single, clearly defined task. For instance, real-world human instructions can be ambiguous, require different levels of AI assistance, and may evolve over time, reflecting changes in the user's mental state. To address this gap, we introduce RealWebAssist, a novel benchmark designed to evaluate sequential instruction-following in realistic scenarios involving long-horizon interactions with the web, visual GUI grounding, and understanding ambiguous real-world user instructions. RealWebAssist includes a dataset of sequential instructions collected from real-world human users. Each user instructs a web-based assistant to perform a series of tasks on multiple websites. A successful agent must reason about the true intent behind each instruction, keep track of the mental state of the user, understand user-specific routines, and ground the intended tasks to actions on the correct GUI elements. Our experimental results show that state-of-the-art models struggle to understand and ground user instructions, posing critical challenges in following real-world user instructions for long-horizon web assistance.

[592] Hybrid AI-Physical Modeling for Penetration Bias Correction in X-band InSAR DEMs: A Greenland Case Study

Islam Mansour,Georg Fischer,Ronny Haensch,Irena Hajnsek

Main category: cs.AI

TLDR: 提出了一种结合物理建模与机器学习的混合框架，用于校正InSAR数据在冰川和雪覆盖区域的穿透偏差，显著降低了DEM误差。

Details

Motivation: 解决InSAR数据在冰川和雪覆盖区域中系统性高程误差（穿透偏差）的问题。 Method: 结合参数化物理建模与机器学习，评估了三种不同采集参数下的训练场景。 Result: 在格陵兰冰盖的实验中，混合模型显著降低了DEM误差的均值和标准差，且比纯ML方法在数据多样性有限时表现更好。 Conclusion: 混合框架在减少DEM误差和泛化能力方面优于纯物理建模或纯机器学习方法。 Abstract: Digital elevation models derived from Interferometric Synthetic Aperture Radar (InSAR) data over glacial and snow-covered regions often exhibit systematic elevation errors, commonly termed "penetration bias." We leverage existing physics-based models and propose an integrated correction framework that combines parametric physical modeling with machine learning. We evaluate the approach across three distinct training scenarios - each defined by a different set of acquisition parameters - to assess overall performance and the model's ability to generalize. Our experiments on Greenland's ice sheet using TanDEM-X data show that the proposed hybrid model corrections significantly reduce the mean and standard deviation of DEM errors compared to a purely physical modeling baseline. The hybrid framework also achieves significantly improved generalization than a pure ML approach when trained on data with limited diversity in acquisition parameters.

[593] Mixed Signals: Decoding VLMs' Reasoning and Underlying Bias in Vision-Language Conflict

Pouya Pezeshkpour,Moin Aminnaseri,Estevam Hruschka

Main category: cs.AI

TLDR: 该研究探讨了视觉语言模型（VLMs）在处理图像和文本冲突时的偏见，发现模型倾向于文本或图像取决于查询复杂度，并提出三种缓解策略。

Details

Motivation: 理解VLMs如何整合视觉和文本信息，并揭示其在冲突场景中的偏见。 Method: 通过构建五个包含不匹配图像-文本对的数据集，分析VLMs的偏见，并测试三种缓解策略。 Result: VLMs在简单查询中偏向文本，复杂查询中偏向图像，且偏见程度与模型规模相关。缓解策略效果因任务和模型性能而异。 Conclusion: VLMs的偏见与查询复杂度和模型规模相关，缓解策略需根据任务和模型特性定制。 Abstract: Vision-language models (VLMs) have demonstrated impressive performance by effectively integrating visual and textual information to solve complex tasks. However, it is not clear how these models reason over the visual and textual data together, nor how the flow of information between modalities is structured. In this paper, we examine how VLMs reason by analyzing their biases when confronted with scenarios that present conflicting image and text cues, a common occurrence in real-world applications. To uncover the extent and nature of these biases, we build upon existing benchmarks to create five datasets containing mismatched image-text pairs, covering topics in mathematics, science, and visual descriptions. Our analysis shows that VLMs favor text in simpler queries but shift toward images as query complexity increases. This bias correlates with model scale, with the difference between the percentage of image- and text-preferred responses ranging from +56.8% (image favored) to -74.4% (text favored), depending on the task and model. In addition, we explore three mitigation strategies: simple prompt modifications, modifications that explicitly instruct models on how to handle conflicting information (akin to chain-of-thought prompting), and a task decomposition strategy that analyzes each modality separately before combining their results. Our findings indicate that the effectiveness of these strategies in identifying and mitigating bias varies significantly and is closely linked to the model's overall performance on the task and the specific modality in question.

[594] Don't Deceive Me: Mitigating Gaslighting through Attention Reallocation in LMMs

Pengkun Jiao,Bin Zhu,Jingjing Chen,Chong-Wah Ngo,Yu-Gang Jiang

Main category: cs.AI

TLDR: GasEraser是一种无需训练的解决方案，通过重新分配注意力权重来减少大型多模态模型（LMMs）在否定性误导输入下的性能下降。

Details

Motivation: 研究LMMs在面对用户故意误导（gaslighting）时的脆弱性，尤其是基于否定的误导对模型准确性的负面影响。 Method: 提出GasEraser方法，通过将注意力权重从误导性文本标记重新分配到语义显著的视觉区域，抑制‘注意力陷阱’标记的影响。 Result: 在GaslightingBench上测试多个开源LMMs，GasEraser显著提升模型鲁棒性，例如LLaVA-v1.5-7B的误导率降低48.2%。 Conclusion: GasEraser是一种无需额外训练的有效方法，可显著提升LMMs在误导输入下的可靠性。 Abstract: Large Multimodal Models (LMMs) have demonstrated remarkable capabilities across a wide range of tasks. However, their vulnerability to user gaslighting-the deliberate use of misleading or contradictory inputs-raises critical concerns about their reliability in real-world applications. In this paper, we address the novel and challenging issue of mitigating the negative impact of negation-based gaslighting on LMMs, where deceptive user statements lead to significant drops in model accuracy. Specifically, we introduce GasEraser, a training-free approach that reallocates attention weights from misleading textual tokens to semantically salient visual regions. By suppressing the influence of "attention sink" tokens and enhancing focus on visually grounded cues, GasEraser significantly improves LMM robustness without requiring retraining or additional supervision. Extensive experimental results demonstrate that GasEraser is effective across several leading open-source LMMs on the GaslightingBench. Notably, for LLaVA-v1.5-7B, GasEraser reduces the misguidance rate by 48.2%, demonstrating its potential for more trustworthy LMMs.

cs.GR [Back]

[595] Rethinking Few-Shot Fusion: Granular Ball Priors Enable General-Purpose Deep Image Fusion

Minjie Deng,Yan Wei,Hao Zhai,An Wu,Yuncan Ouyang,Qianyao Peng

Main category: cs.GR

TLDR: 本文提出了一种基于Granular Ball适应的GBFF融合方法，通过亮度空间特征提取为深度网络提供先验，实现少样本训练和快速收敛。

Details

Motivation: 解决传统深度学习方法因缺乏真实融合图像先验而依赖大规模数据的问题。 Method: 利用Granular Ball适应在亮度空间提取特征，分类像素对为正域和边界域，生成近似监督图像作为先验。 Result: 实验验证了方法的有效性，与SOTA方法相比在融合时间和图像表现力上具有竞争力。 Conclusion: GBFF方法通过少样本训练和快速收敛，显著提升了图像融合的效率和表现力。 Abstract: In image fusion tasks, due to the lack of real fused images as priors, most deep learning-based fusion methods obtain global weight features from original images in large-scale data pairs to generate images that approximate real fused images. However, unlike previous studies, this paper utilizes Granular Ball adaptation to extract features in the brightness space as priors for deep networks, enabling the fusion network to converge quickly and complete the fusion task. This leads to few-shot training for a general image fusion network, and based on this, we propose the GBFF fusion method. According to the information expression division of pixel pairs in the original fused image, we classify pixel pairs with significant performance as the positive domain and non-significant pixel pairs as the boundary domain. We perform split inference in the brightness space using Granular Ball adaptation to compute weights for pixels that express information to varying degrees, generating approximate supervision images that provide priors for the neural network in the structural brightness space. Additionally, the extracted global saliency features also adaptively provide priors for setting the loss function weights of each image in the network, guiding the network to converge quickly at both global and pixel levels alongside the supervised images, thereby enhancing the expressiveness of the fused images. Each modality only used 10 pairs of images as the training set, completing the fusion task with a limited number of iterations. Experiments validate the effectiveness of the algorithm and theory, and qualitative and quantitative comparisons with SOTA methods show that this approach is highly competitive in terms of fusion time and image expressiveness.

[596] EchoMask: Speech-Queried Attention-based Mask Modeling for Holistic Co-Speech Motion Generation

Xiangyue Zhang,Jianfang Li,Jiaxu Zhang,Jianqiang Ren,Liefeng Bo,Zhigang Tu

Main category: cs.GR

TLDR: 提出了一种基于语音查询注意力的掩码建模框架，用于共语音动作生成，通过运动对齐的语音特征指导掩码过程，生成高质量动作。

Details

Motivation: 现有掩码建模框架难以识别语义显著的动作帧，导致共语音动作生成效果不佳。 Method: 1. 提出运动-音频对齐模块（MAM）构建联合空间；2. 引入语音查询注意力机制（SQA）计算帧级注意力分数；3. 将运动对齐的语音特征注入生成网络。 Result: 在定性和定量评估中优于现有方法，生成高质量共语音动作。 Conclusion: 通过运动对齐的语音特征和选择性掩码，显著提升了共语音动作生成的效果。 Abstract: Masked modeling framework has shown promise in co-speech motion generation. However, it struggles to identify semantically significant frames for effective motion masking. In this work, we propose a speech-queried attention-based mask modeling framework for co-speech motion generation. Our key insight is to leverage motion-aligned speech features to guide the masked motion modeling process, selectively masking rhythm-related and semantically expressive motion frames. Specifically, we first propose a motion-audio alignment module (MAM) to construct a latent motion-audio joint space. In this space, both low-level and high-level speech features are projected, enabling motion-aligned speech representation using learnable speech queries. Then, a speech-queried attention mechanism (SQA) is introduced to compute frame-level attention scores through interactions between motion keys and speech queries, guiding selective masking toward motion frames with high attention scores. Finally, the motion-aligned speech features are also injected into the generation network to facilitate co-speech motion generation. Qualitative and quantitative evaluations confirm that our method outperforms existing state-of-the-art approaches, successfully producing high-quality co-speech motion.

[597] Pseudo-Label Guided Real-World Image De-weathering: A Learning Framework with Imperfect Supervision

Heming Xu,Xiaohui Liu,Zhilu Zhang,Hongzhi Zhang,Xiaohe Wu,Wangmeng Zuo

Main category: cs.GR

TLDR: 提出了一种基于伪标签引导的学习框架，用于解决真实世界图像去天气化任务中训练数据不一致的问题。

Details

Motivation: 现有真实世界数据集中的训练对存在光照、物体位置等不一致问题，导致去天气化模型可能产生变形伪影。 Method: 结合去天气化模型（De-W）和一致标签构造器（CLC），通过伪标签和原始真实图像的联合监督，恢复清晰纹理并保持非天气内容的一致性。 Result: 实验表明，该方法在非理想对齐的去天气化数据集上优于其他方法。 Conclusion: 提出的伪标签引导框架有效解决了真实世界图像去天气化中的不一致性问题。 Abstract: Real-world image de-weathering aims at removingvarious undesirable weather-related artifacts, e.g., rain, snow,and fog. To this end, acquiring ideal training pairs is crucial.Existing real-world datasets are typically constructed paired databy extracting clean and degraded images from live streamsof landscape scene on the Internet. Despite the use of strictfiltering mechanisms during collection, training pairs inevitablyencounter inconsistency in terms of lighting, object position, scenedetails, etc, making de-weathering models possibly suffer fromdeformation artifacts under non-ideal supervision. In this work,we propose a unified solution for real-world image de-weatheringwith non-ideal supervision, i.e., a pseudo-label guided learningframework, to address various inconsistencies within the realworld paired dataset. Generally, it consists of a de-weatheringmodel (De-W) and a Consistent Label Constructor (CLC), bywhich restoration result can be adaptively supervised by originalground-truth image to recover sharp textures while maintainingconsistency with the degraded inputs in non-weather contentthrough the supervision of pseudo-labels. Particularly, a Crossframe Similarity Aggregation (CSA) module is deployed withinCLC to enhance the quality of pseudo-labels by exploring thepotential complementary information of multi-frames throughgraph model. Moreover, we introduce an Information AllocationStrategy (IAS) to integrate the original ground-truth imagesand pseudo-labels, thereby facilitating the joint supervision forthe training of de-weathering model. Extensive experimentsdemonstrate that our method exhibits significant advantageswhen trained on imperfectly aligned de-weathering datasets incomparison with other approaches.

[598] OctGPT: Octree-based Multiscale Autoregressive Models for 3D Shape Generation

Si-Tong Wei,Rui-Huan Wang,Chuan-Zhi Zhou,Baoquan Chen,Peng-Shuai Wang

Main category: cs.GR

TLDR: OctGPT是一种新型多尺度自回归模型，显著提升了3D形状生成的效率和性能，媲美或超越扩散模型。

Details

Motivation: 自回归模型在3D形状生成中表现不佳，落后于扩散模型。 Method: 采用序列化八叉树表示，结合VQVAE生成多尺度二进制序列，并使用改进的八叉树变换器。 Result: 训练时间减少13倍，生成时间减少69倍，支持高分辨率3D形状生成。 Conclusion: OctGPT为高质量、可扩展的3D内容生成提供了新范式。 Abstract: Autoregressive models have achieved remarkable success across various domains, yet their performance in 3D shape generation lags significantly behind that of diffusion models. In this paper, we introduce OctGPT, a novel multiscale autoregressive model for 3D shape generation that dramatically improves the efficiency and performance of prior 3D autoregressive approaches, while rivaling or surpassing state-of-the-art diffusion models. Our method employs a serialized octree representation to efficiently capture the hierarchical and spatial structures of 3D shapes. Coarse geometry is encoded via octree structures, while fine-grained details are represented by binary tokens generated using a vector quantized variational autoencoder (VQVAE), transforming 3D shapes into compact \emph{multiscale binary sequences} suitable for autoregressive prediction. To address the computational challenges of handling long sequences, we incorporate octree-based transformers enhanced with 3D rotary positional encodings, scale-specific embeddings, and token-parallel generation schemes. These innovations reduce training time by 13 folds and generation time by 69 folds, enabling the efficient training of high-resolution 3D shapes, e.g.,$1024^3$, on just four NVIDIA 4090 GPUs only within days. OctGPT showcases exceptional versatility across various tasks, including text-, sketch-, and image-conditioned generation, as well as scene-level synthesis involving multiple objects. Extensive experiments demonstrate that OctGPT accelerates convergence and improves generation quality over prior autoregressive methods, offering a new paradigm for high-quality, scalable 3D content creation.

[599] Rethinking Few-Shot Fusion: Granular Ball Priors Enable General-Purpose Deep Image Fusion

Minjie Deng,Yan Wei,Hao Zhai,An Wu,Yuncan Ouyang,Qianyao Peng

Main category: cs.GR

TLDR: 本文提出了一种基于Granular Ball适应的图像融合方法GBFF，通过亮度空间特征提取和少样本训练，快速完成融合任务。

Details

Motivation: 传统深度学习方法缺乏真实融合图像作为先验，而本文利用Granular Ball适应提取亮度空间特征作为先验，实现快速收敛。 Method: 通过Granular Ball适应在亮度空间进行特征提取和权重计算，生成近似监督图像，并结合全局显著性特征优化损失函数权重。 Result: 实验表明，该方法在融合时间和图像表现力上具有竞争力，仅需10对图像即可完成训练。 Conclusion: GBFF方法在少样本训练下高效完成图像融合任务，显著提升了融合效果和效率。 Abstract: In image fusion tasks, due to the lack of real fused images as priors, most deep learning-based fusion methods obtain global weight features from original images in large-scale data pairs to generate images that approximate real fused images. However, unlike previous studies, this paper utilizes Granular Ball adaptation to extract features in the brightness space as priors for deep networks, enabling the fusion network to converge quickly and complete the fusion task. This leads to few-shot training for a general image fusion network, and based on this, we propose the GBFF fusion method. According to the information expression division of pixel pairs in the original fused image, we classify pixel pairs with significant performance as the positive domain and non-significant pixel pairs as the boundary domain. We perform split inference in the brightness space using Granular Ball adaptation to compute weights for pixels that express information to varying degrees, generating approximate supervision images that provide priors for the neural network in the structural brightness space. Additionally, the extracted global saliency features also adaptively provide priors for setting the loss function weights of each image in the network, guiding the network to converge quickly at both global and pixel levels alongside the supervised images, thereby enhancing the expressiveness of the fused images. Each modality only used 10 pairs of images as the training set, completing the fusion task with a limited number of iterations. Experiments validate the effectiveness of the algorithm and theory, and qualitative and quantitative comparisons with SOTA methods show that this approach is highly competitive in terms of fusion time and image expressiveness.

[600] EchoMask: Speech-Queried Attention-based Mask Modeling for Holistic Co-Speech Motion Generation

Xiangyue Zhang,Jianfang Li,Jiaxu Zhang,Jianqiang Ren,Liefeng Bo,Zhigang Tu

Main category: cs.GR

TLDR: 提出了一种基于语音查询注意力的掩码建模框架，用于共语音运动生成，通过运动对齐的语音特征指导掩码过程，生成高质量运动。

Details

Motivation: 现有掩码建模框架难以识别语义显著的运动帧，影响了共语音运动生成的效果。 Method: 提出运动-音频对齐模块（MAM）构建联合空间，引入语音查询注意力机制（SQA）计算帧级注意力分数，指导选择性掩码。 Result: 定性和定量评估表明，该方法优于现有技术，成功生成高质量共语音运动。 Conclusion: 通过运动对齐的语音特征和选择性掩码，显著提升了共语音运动生成的质量。 Abstract: Masked modeling framework has shown promise in co-speech motion generation. However, it struggles to identify semantically significant frames for effective motion masking. In this work, we propose a speech-queried attention-based mask modeling framework for co-speech motion generation. Our key insight is to leverage motion-aligned speech features to guide the masked motion modeling process, selectively masking rhythm-related and semantically expressive motion frames. Specifically, we first propose a motion-audio alignment module (MAM) to construct a latent motion-audio joint space. In this space, both low-level and high-level speech features are projected, enabling motion-aligned speech representation using learnable speech queries. Then, a speech-queried attention mechanism (SQA) is introduced to compute frame-level attention scores through interactions between motion keys and speech queries, guiding selective masking toward motion frames with high attention scores. Finally, the motion-aligned speech features are also injected into the generation network to facilitate co-speech motion generation. Qualitative and quantitative evaluations confirm that our method outperforms existing state-of-the-art approaches, successfully producing high-quality co-speech motion.

[601] Pseudo-Label Guided Real-World Image De-weathering: A Learning Framework with Imperfect Supervision

Heming Xu,Xiaohui Liu,Zhilu Zhang,Hongzhi Zhang,Xiaohe Wu,Wangmeng Zuo

Main category: cs.GR

TLDR: 提出了一种基于伪标签学习的统一解决方案，用于处理真实世界图像去天气任务中的非理想监督问题。

Details

Motivation: 现有真实世界数据集中的训练对存在光照、物体位置等不一致问题，导致去天气模型可能产生变形伪影。 Method: 提出伪标签引导学习框架，包括去天气模型（De-W）和一致性标签构造器（CLC），并通过跨帧相似性聚合（CSA）模块和信息分配策略（IAS）优化伪标签质量。 Result: 实验表明，该方法在非理想对齐的去天气数据集上优于其他方法。 Conclusion: 该方法能有效解决真实世界去天气任务中的不一致性问题，提升模型性能。 Abstract: Real-world image de-weathering aims at removingvarious undesirable weather-related artifacts, e.g., rain, snow,and fog. To this end, acquiring ideal training pairs is crucial.Existing real-world datasets are typically constructed paired databy extracting clean and degraded images from live streamsof landscape scene on the Internet. Despite the use of strictfiltering mechanisms during collection, training pairs inevitablyencounter inconsistency in terms of lighting, object position, scenedetails, etc, making de-weathering models possibly suffer fromdeformation artifacts under non-ideal supervision. In this work,we propose a unified solution for real-world image de-weatheringwith non-ideal supervision, i.e., a pseudo-label guided learningframework, to address various inconsistencies within the realworld paired dataset. Generally, it consists of a de-weatheringmodel (De-W) and a Consistent Label Constructor (CLC), bywhich restoration result can be adaptively supervised by originalground-truth image to recover sharp textures while maintainingconsistency with the degraded inputs in non-weather contentthrough the supervision of pseudo-labels. Particularly, a Crossframe Similarity Aggregation (CSA) module is deployed withinCLC to enhance the quality of pseudo-labels by exploring thepotential complementary information of multi-frames throughgraph model. Moreover, we introduce an Information AllocationStrategy (IAS) to integrate the original ground-truth imagesand pseudo-labels, thereby facilitating the joint supervision forthe training of de-weathering model. Extensive experimentsdemonstrate that our method exhibits significant advantageswhen trained on imperfectly aligned de-weathering datasets incomparison with other approaches.

[602] OctGPT: Octree-based Multiscale Autoregressive Models for 3D Shape Generation

Si-Tong Wei,Rui-Huan Wang,Chuan-Zhi Zhou,Baoquan Chen,Peng-Shuai Wang

Main category: cs.GR

TLDR: OctGPT是一种新型多尺度自回归模型，显著提升了3D形状生成的效率和性能，甚至超越扩散模型。

Details

Motivation: 自回归模型在3D形状生成中表现不佳，远落后于扩散模型，因此需要一种更高效的方法。 Method: 采用序列化八叉树表示法，结合VQVAE生成二进制令牌，并使用改进的八叉树变换器处理长序列。 Result: 训练时间减少13倍，生成时间减少69倍，支持高分辨率3D形状生成，并在多种任务中表现优异。 Conclusion: OctGPT为高质量、可扩展的3D内容生成提供了新范式。 Abstract: Autoregressive models have achieved remarkable success across various domains, yet their performance in 3D shape generation lags significantly behind that of diffusion models. In this paper, we introduce OctGPT, a novel multiscale autoregressive model for 3D shape generation that dramatically improves the efficiency and performance of prior 3D autoregressive approaches, while rivaling or surpassing state-of-the-art diffusion models. Our method employs a serialized octree representation to efficiently capture the hierarchical and spatial structures of 3D shapes. Coarse geometry is encoded via octree structures, while fine-grained details are represented by binary tokens generated using a vector quantized variational autoencoder (VQVAE), transforming 3D shapes into compact \emph{multiscale binary sequences} suitable for autoregressive prediction. To address the computational challenges of handling long sequences, we incorporate octree-based transformers enhanced with 3D rotary positional encodings, scale-specific embeddings, and token-parallel generation schemes. These innovations reduce training time by 13 folds and generation time by 69 folds, enabling the efficient training of high-resolution 3D shapes, e.g.,$1024^3$, on just four NVIDIA 4090 GPUs only within days. OctGPT showcases exceptional versatility across various tasks, including text-, sketch-, and image-conditioned generation, as well as scene-level synthesis involving multiple objects. Extensive experiments demonstrate that OctGPT accelerates convergence and improves generation quality over prior autoregressive methods, offering a new paradigm for high-quality, scalable 3D content creation.

cs.HC [Back]

[603] Linguistic Comparison of AI- and Human-Written Responses to Online Mental Health Queries

Koustuv Saha,Yoshee Jain,Munmun De Choudhury

Main category: cs.HC

TLDR: 研究比较了生成式AI与人类在在线心理健康社区中的回应差异，发现AI更冗长、易读且结构化，但缺乏语言多样性和个人叙事。

Details

Motivation: 探讨生成式AI是否能复制人类同伴在心理健康支持中的细微、经验性互动。 Method: 利用Reddit上55个OMHCs的24,114篇帖子和138,758条回复，通过GPT-4-Turbo等LLMs生成回应，并与人类回应进行语言学对比。 Result: AI回应更冗长、易读且结构化，但缺乏语言多样性和个人叙事。 Conclusion: 需平衡AI的可扩展性和及时性与人类互动的真实性，制定伦理框架。 Abstract: The ubiquity and widespread use of digital and online technologies have transformed mental health support, with online mental health communities (OMHCs) providing safe spaces for peer support. More recently, generative AI and large language models (LLMs) have introduced new possibilities for scalable, around-the-clock mental health assistance that could potentially augment and supplement the capabilities of OMHCs. Although genAI shows promise in delivering immediate and personalized responses, their effectiveness in replicating the nuanced, experience-based support of human peers remains an open question. In this study, we harnessed 24,114 posts and 138,758 online community (OC) responses from 55 OMHCs on Reddit. We prompted several state-of-the-art LLMs (GPT-4-Turbo, Llama-3, and Mistral-7B) with these posts, and compared their (AI) responses to human-written (OC) responses based on a variety of linguistic measures across psycholinguistics and lexico-semantics. Our findings revealed that AI responses are more verbose, readable, and analytically structured, but lack linguistic diversity and personal narratives inherent in human-human interactions. Through a qualitative examination, we found validation as well as complementary insights into the nature of AI responses, such as its neutrality of stance and the absence of seeking back-and-forth clarifications. We discuss the ethical and practical implications of integrating generative AI into OMHCs, advocating for frameworks that balance AI's scalability and timeliness with the irreplaceable authenticity, social interactiveness, and expertise of human connections that form the ethos of online support communities.

[604] AgentA/B: Automated and Scalable Web A/BTesting with Interactive LLM Agents

Dakuo Wang,Ting-Yao Hsu,Yuxuan Lu,Limeng Cui,Yaochen Xie,William Headean,Bingsheng Yao,Akash Veeragouni,Jiapeng Liu,Sreyashi Nag,Jessie Wang

Main category: cs.HC

TLDR: AgentA/B利用基于大语言模型的自主代理模拟用户行为，解决了传统A/B测试依赖大规模真实用户和耗时长的瓶颈。

Details

Motivation: 传统A/B测试依赖真实用户和大规模流量，且耗时较长，限制了其效率和可扩展性。 Method: 开发了AgentA/B系统，利用LLM代理模拟用户行为，支持多步交互（如搜索、点击、购买等），并部署多样化代理。 Result: 在亚马逊网站的实验中，AgentA/B模拟的1,000个代理行为与真实用户行为模式相似。 Conclusion: AgentA/B能够有效模拟人类行为，为A/B测试提供了一种高效且可扩展的替代方案。 Abstract: A/B testing experiment is a widely adopted method for evaluating UI/UX design decisions in modern web applications. Yet, traditional A/B testing remains constrained by its dependence on the large-scale and live traffic of human participants, and the long time of waiting for the testing result. Through formative interviews with six experienced industry practitioners, we identified critical bottlenecks in current A/B testing workflows. In response, we present AgentA/B, a novel system that leverages Large Language Model-based autonomous agents (LLM Agents) to automatically simulate user interaction behaviors with real webpages. AgentA/B enables scalable deployment of LLM agents with diverse personas, each capable of navigating the dynamic webpage and interactively executing multi-step interactions like search, clicking, filtering, and purchasing. In a demonstrative controlled experiment, we employ AgentA/B to simulate a between-subject A/B testing with 1,000 LLM agents Amazon.com, and compare agent behaviors with real human shopping behaviors at a scale. Our findings suggest AgentA/B can emulate human-like behavior patterns.

[605] Explorer: Robust Collection of Interactable GUI Elements

Iason Chaimalas,Arnas Vyšniauskas,Gabriel Brostow

Main category: cs.HC

TLDR: Explorer系统专注于检测屏幕上的按钮和文本输入字段，通过个性化数据收集和ML训练，提高GUI自动化精度。

Details

Motivation: 现有GUI自动化困难，数据收集和ML训练需针对特定应用以提高用户信心。 Method: 利用实时应用程序进行数据收集和ML训练，支持Android和桌面Chrome浏览器，记录用户交互会话并生成状态映射。 Result: Explorer系统成功实现GUI路径规划，支持用户通过语音命令导航。 Conclusion: Explorer系统为GUI自动化提供有效解决方案，代码已开源。 Abstract: Automation of existing Graphical User Interfaces (GUIs) is important but hard to achieve. Upstream of making the GUI user-accessible or somehow scriptable, even the data-collection to understand the original interface poses significant challenges. For example, large quantities of general UI data seem helpful for training general machine learning (ML) models, but accessibility for each person can hinge on the ML's precision on a specific app. We therefore take the perspective that a given user needs confidence, that the relevant UI elements are being detected correctly throughout one app or digital environment. We mostly assume that the target application is known in advance, so that data collection and ML-training can be personalized for the test-time target domain. The proposed Explorer system focuses on detecting on-screen buttons and text-entry fields, i.e. interactables, where the training process has access to a live version of the application. The live application can run on almost any popular platform except iOS phones, and the collection is especially streamlined for Android phones or for desktop Chrome browsers. Explorer also enables the recording of interactive user sessions, and subsequent mapping of how such sessions overlap and sometimes loop back to similar states. We show how having such a map enables a kind of path planning through the GUI, letting a user issue audio commands to get to their destination. Critically, we are releasing our code for Explorer openly at https://github.com/varnelis/Explorer.

[606] Linguistic Comparison of AI- and Human-Written Responses to Online Mental Health Queries

Koustuv Saha,Yoshee Jain,Munmun De Choudhury

Main category: cs.HC

TLDR: 研究比较了生成式AI与人类在在线心理健康社区（OMHCs）中的回应，发现AI更冗长、易读且结构化，但缺乏语言多样性和个人叙事。

Details

Motivation: 探讨生成式AI是否能复制人类同伴在心理健康支持中的细腻互动。 Method: 使用Reddit上55个OMHCs的24,114篇帖子及138,758条回复，对比GPT-4-Turbo、Llama-3和Mistral-7B的AI回应与人类回应的语言学特征。 Result: AI回应更冗长、易读且结构化，但缺乏语言多样性和个人叙事。AI立场中立且不寻求进一步澄清。 Conclusion: 需平衡AI的可扩展性与人类互动的真实性，制定框架以整合AI到OMHCs中。 Abstract: The ubiquity and widespread use of digital and online technologies have transformed mental health support, with online mental health communities (OMHCs) providing safe spaces for peer support. More recently, generative AI and large language models (LLMs) have introduced new possibilities for scalable, around-the-clock mental health assistance that could potentially augment and supplement the capabilities of OMHCs. Although genAI shows promise in delivering immediate and personalized responses, their effectiveness in replicating the nuanced, experience-based support of human peers remains an open question. In this study, we harnessed 24,114 posts and 138,758 online community (OC) responses from 55 OMHCs on Reddit. We prompted several state-of-the-art LLMs (GPT-4-Turbo, Llama-3, and Mistral-7B) with these posts, and compared their (AI) responses to human-written (OC) responses based on a variety of linguistic measures across psycholinguistics and lexico-semantics. Our findings revealed that AI responses are more verbose, readable, and analytically structured, but lack linguistic diversity and personal narratives inherent in human-human interactions. Through a qualitative examination, we found validation as well as complementary insights into the nature of AI responses, such as its neutrality of stance and the absence of seeking back-and-forth clarifications. We discuss the ethical and practical implications of integrating generative AI into OMHCs, advocating for frameworks that balance AI's scalability and timeliness with the irreplaceable authenticity, social interactiveness, and expertise of human connections that form the ethos of online support communities.

[607] AgentA/B: Automated and Scalable Web A/BTesting with Interactive LLM Agents

Dakuo Wang,Ting-Yao Hsu,Yuxuan Lu,Limeng Cui,Yaochen Xie,William Headean,Bingsheng Yao,Akash Veeragouni,Jiapeng Liu,Sreyashi Nag,Jessie Wang

Main category: cs.HC

TLDR: AgentA/B利用基于大型语言模型的自主代理（LLM Agents）模拟用户行为，解决了传统A/B测试依赖大规模真实用户流量和耗时长的问题。

Details

Motivation: 传统A/B测试依赖真实用户流量且耗时，限制了效率。通过访谈行业专家，发现当前A/B测试流程的瓶颈，提出自动化解决方案。 Method: 开发AgentA/B系统，利用LLM代理模拟用户与网页的交互行为（如搜索、点击、购买等），支持多样化角色部署。 Result: 在亚马逊网站的实验中，AgentA/B模拟的1,000个代理行为与真实用户行为模式相似。 Conclusion: AgentA/B能够有效模拟人类行为，为A/B测试提供高效替代方案。 Abstract: A/B testing experiment is a widely adopted method for evaluating UI/UX design decisions in modern web applications. Yet, traditional A/B testing remains constrained by its dependence on the large-scale and live traffic of human participants, and the long time of waiting for the testing result. Through formative interviews with six experienced industry practitioners, we identified critical bottlenecks in current A/B testing workflows. In response, we present AgentA/B, a novel system that leverages Large Language Model-based autonomous agents (LLM Agents) to automatically simulate user interaction behaviors with real webpages. AgentA/B enables scalable deployment of LLM agents with diverse personas, each capable of navigating the dynamic webpage and interactively executing multi-step interactions like search, clicking, filtering, and purchasing. In a demonstrative controlled experiment, we employ AgentA/B to simulate a between-subject A/B testing with 1,000 LLM agents Amazon.com, and compare agent behaviors with real human shopping behaviors at a scale. Our findings suggest AgentA/B can emulate human-like behavior patterns.

[608] Explorer: Robust Collection of Interactable GUI Elements

Iason Chaimalas,Arnas Vyšniauskas,Gabriel Brostow

Main category: cs.HC

TLDR: Explorer系统专注于检测屏幕上的按钮和文本输入字段，通过个性化数据收集和ML训练，支持用户通过音频命令导航GUI。

Details

Motivation: 自动化现有GUI具有挑战性，尤其是数据收集和ML模型对特定应用的精确性需求。 Method: 利用实时应用程序进行个性化数据收集和ML训练，支持Android和Chrome浏览器，记录用户交互会话并生成状态映射。 Result: Explorer系统实现了通过音频命令导航GUI的功能，并开源了代码。 Conclusion: Explorer为GUI自动化提供了一种个性化、高效的解决方案，适用于特定应用场景。 Abstract: Automation of existing Graphical User Interfaces (GUIs) is important but hard to achieve. Upstream of making the GUI user-accessible or somehow scriptable, even the data-collection to understand the original interface poses significant challenges. For example, large quantities of general UI data seem helpful for training general machine learning (ML) models, but accessibility for each person can hinge on the ML's precision on a specific app. We therefore take the perspective that a given user needs confidence, that the relevant UI elements are being detected correctly throughout one app or digital environment. We mostly assume that the target application is known in advance, so that data collection and ML-training can be personalized for the test-time target domain. The proposed Explorer system focuses on detecting on-screen buttons and text-entry fields, i.e. interactables, where the training process has access to a live version of the application. The live application can run on almost any popular platform except iOS phones, and the collection is especially streamlined for Android phones or for desktop Chrome browsers. Explorer also enables the recording of interactive user sessions, and subsequent mapping of how such sessions overlap and sometimes loop back to similar states. We show how having such a map enables a kind of path planning through the GUI, letting a user issue audio commands to get to their destination. Critically, we are releasing our code for Explorer openly at https://github.com/varnelis/Explorer.

eess.AS [Back]

[609] SIFT-50M: A Large-Scale Multilingual Dataset for Speech Instruction Fine-Tuning

Prabhat Pandey,Rupak Vignesh Swaminathan,K V Vijay Girish,Arunasish Sen,Jian Xie,Grant P. Strimel,Andreas Schwarz

Main category: eess.AS

TLDR: SIFT-50M是一个包含5000万样本的数据集，用于语音-文本大语言模型（LLM）的指令微调和预训练，支持多语言和多样化任务。

Details

Motivation: 现有语音-文本LLM在指令跟随能力上表现不足，需要更高质量的数据集支持其训练和评估。 Method: 利用公开语音语料库和专家模型构建SIFT-50M数据集，训练SIFT-LLM模型。 Result: SIFT-LLM在指令跟随任务上优于现有模型，同时在基础语音任务上表现竞争力。 Conclusion: SIFT-50M和EvalSIFT为语音-文本LLM的研究提供了重要资源，推动了该领域的发展。 Abstract: We introduce SIFT (Speech Instruction Fine-Tuning), a 50M-example dataset designed for instruction fine-tuning and pre-training of speech-text large language models (LLMs). SIFT-50M is built from publicly available speech corpora, which collectively contain 14K hours of speech, and leverages LLMs along with off-the-shelf expert models. The dataset spans five languages, encompassing a diverse range of speech understanding as well as controllable speech generation instructions. Using SIFT-50M, we train SIFT-LLM, which outperforms existing speech-text LLMs on instruction-following benchmarks while achieving competitive performance on foundational speech tasks. To support further research, we also introduce EvalSIFT, a benchmark dataset specifically designed to evaluate the instruction-following capabilities of speech-text LLMs.

[610] Pseudo-Autoregressive Neural Codec Language Models for Efficient Zero-Shot Text-to-Speech Synthesis

Yifan Yang,Shujie Liu,Jinyu Li,Yuxuan Hu,Haibin Wu,Hui Wang,Jianwei Yu,Lingwei Meng,Haiyang Sun,Yanqing Liu,Yan Lu,Kai Yu,Xie Chen

Main category: eess.AS

TLDR: 提出了一种伪自回归（PAR）编解码语言建模方法，结合自回归和非自回归模型的优势，并基于此设计了PALLE系统，显著提升了语音合成的质量和速度。

Details

Motivation: 解决现有零样本文本到语音（TTS）系统中自回归模型生成慢且缺乏时长可控性，以及非自回归模型缺乏时序建模且设计复杂的问题。 Method: 提出PAR方法，结合自回归的时序建模和非自回归的并行生成；基于PAR设计PALLE系统，分两阶段生成和优化语音。 Result: PALLE在LibriSpeech测试集上优于现有系统（如F5-TTS、E2-TTS等），在语音质量、说话人相似性和清晰度上表现更好，且推理速度快10倍。 Conclusion: PAR和PALLE为TTS系统提供了一种高效且高质量的解决方案，兼具时序建模和并行生成的优势。 Abstract: Recent zero-shot text-to-speech (TTS) systems face a common dilemma: autoregressive (AR) models suffer from slow generation and lack duration controllability, while non-autoregressive (NAR) models lack temporal modeling and typically require complex designs. In this paper, we introduce a novel pseudo-autoregressive (PAR) codec language modeling approach that unifies AR and NAR modeling. Combining explicit temporal modeling from AR with parallel generation from NAR, PAR generates dynamic-length spans at fixed time steps. Building on PAR, we propose PALLE, a two-stage TTS system that leverages PAR for initial generation followed by NAR refinement. In the first stage, PAR progressively generates speech tokens along the time dimension, with each step predicting all positions in parallel but only retaining the left-most span. In the second stage, low-confidence tokens are iteratively refined in parallel, leveraging the global contextual information. Experiments demonstrate that PALLE, trained on LibriTTS, outperforms state-of-the-art systems trained on large-scale data, including F5-TTS, E2-TTS, and MaskGCT, on the LibriSpeech test-clean set in terms of speech quality, speaker similarity, and intelligibility, while achieving up to ten times faster inference speed. Audio samples are available at https://anonymous-palle.github.io.

[611] SIFT-50M: A Large-Scale Multilingual Dataset for Speech Instruction Fine-Tuning

Prabhat Pandey,Rupak Vignesh Swaminathan,K V Vijay Girish,Arunasish Sen,Jian Xie,Grant P. Strimel,Andreas Schwarz

Main category: eess.AS

TLDR: SIFT-50M是一个包含5000万样本的数据集，用于语音文本大语言模型的指令微调和预训练，支持五种语言，并在指令跟随任务中表现优异。

Details

Motivation: 构建一个大规模的语音指令数据集，以提升语音文本大语言模型在指令跟随和可控语音生成任务中的性能。 Method: 利用公开语音语料库（总计14K小时语音）和专家模型，构建SIFT-50M数据集，并训练SIFT-LLM模型。 Result: SIFT-LLM在指令跟随任务中优于现有语音文本大语言模型，同时在基础语音任务中表现竞争性。 Conclusion: SIFT-50M和SIFT-LLM为语音文本大语言模型的研究提供了重要资源，同时EvalSIFT基准支持进一步评估。 Abstract: We introduce SIFT (Speech Instruction Fine-Tuning), a 50M-example dataset designed for instruction fine-tuning and pre-training of speech-text large language models (LLMs). SIFT-50M is built from publicly available speech corpora, which collectively contain 14K hours of speech, and leverages LLMs along with off-the-shelf expert models. The dataset spans five languages, encompassing a diverse range of speech understanding as well as controllable speech generation instructions. Using SIFT-50M, we train SIFT-LLM, which outperforms existing speech-text LLMs on instruction-following benchmarks while achieving competitive performance on foundational speech tasks. To support further research, we also introduce EvalSIFT, a benchmark dataset specifically designed to evaluate the instruction-following capabilities of speech-text LLMs.

[612] Pseudo-Autoregressive Neural Codec Language Models for Efficient Zero-Shot Text-to-Speech Synthesis

Yifan Yang,Shujie Liu,Jinyu Li,Yuxuan Hu,Haibin Wu,Hui Wang,Jianwei Yu,Lingwei Meng,Haiyang Sun,Yanqing Liu,Yan Lu,Kai Yu,Xie Chen

Main category: eess.AS

TLDR: 提出了一种伪自回归（PAR）编解码语言建模方法，结合了自回归（AR）和非自回归（NAR）模型的优势，并基于此构建了两阶段TTS系统PALLE，显著提升了语音质量和生成速度。

Details

Motivation: 解决现有零样本TTS系统中AR模型生成速度慢和NAR模型缺乏时序建模的问题。 Method: 提出PAR方法，结合AR的时序建模和NAR的并行生成；基于PAR构建PALLE系统，分两阶段生成和优化语音。 Result: PALLE在语音质量、说话人相似度和清晰度上优于现有系统，且推理速度提升十倍。 Conclusion: PAR和PALLE为TTS系统提供了一种高效且高质量的解决方案。 Abstract: Recent zero-shot text-to-speech (TTS) systems face a common dilemma: autoregressive (AR) models suffer from slow generation and lack duration controllability, while non-autoregressive (NAR) models lack temporal modeling and typically require complex designs. In this paper, we introduce a novel pseudo-autoregressive (PAR) codec language modeling approach that unifies AR and NAR modeling. Combining explicit temporal modeling from AR with parallel generation from NAR, PAR generates dynamic-length spans at fixed time steps. Building on PAR, we propose PALLE, a two-stage TTS system that leverages PAR for initial generation followed by NAR refinement. In the first stage, PAR progressively generates speech tokens along the time dimension, with each step predicting all positions in parallel but only retaining the left-most span. In the second stage, low-confidence tokens are iteratively refined in parallel, leveraging the global contextual information. Experiments demonstrate that PALLE, trained on LibriTTS, outperforms state-of-the-art systems trained on large-scale data, including F5-TTS, E2-TTS, and MaskGCT, on the LibriSpeech test-clean set in terms of speech quality, speaker similarity, and intelligibility, while achieving up to ten times faster inference speed. Audio samples are available at https://anonymous-palle.github.io.

cs.CY [Back]

[613] The Lyme Disease Controversy: An AI-Driven Discourse Analysis of a Quarter Century of Academic Debate and Divides

Teo Susnjak,Cole Palffy,Tatiana Zimina,Nazgul Altynbekova,Kunal Garg,Leona Gilbert

Main category: cs.CY

TLDR: 本文通过结合大型语言模型和专家验证，首次系统分析了25年来关于慢性莱姆病和莱姆病治疗后综合征的学术讨论，揭示了从感染模型到免疫介导解释的转变。

Details

Motivation: 研究动机是探索莱姆病研究领域的复杂性和争议性，分析其演变过程及影响因素。 Method: 采用混合AI驱动方法，结合大型语言模型和结构化人类验证，分析数千篇学术摘要。 Result: 研究发现莱姆病解释模型从感染为主转向免疫介导，揭示了研究领域的结构性变化。 Conclusion: 研究提供了一种可扩展的方法论，强调了AI辅助方法在社会科学和医学研究中的价值。 Abstract: The scientific discourse surrounding Chronic Lyme Disease (CLD) and Post-Treatment Lyme Disease Syndrome (PTLDS) has evolved over the past twenty-five years into a complex and polarised debate, shaped by shifting research priorities, institutional influences, and competing explanatory models. This study presents the first large-scale, systematic examination of this discourse using an innovative hybrid AI-driven methodology, combining large language models with structured human validation to analyse thousands of scholarly abstracts spanning 25 years. By integrating Large Language Models (LLMs) with expert oversight, we developed a quantitative framework for tracking epistemic shifts in contested medical fields, with applications to other content analysis domains. Our analysis revealed a progressive transition from infection-based models of Lyme disease to immune-mediated explanations for persistent symptoms. This study offers new empirical insights into the structural and epistemic forces shaping Lyme disease research, providing a scalable and replicable methodology for analysing discourse, while underscoring the value of AI-assisted methodologies in social science and medical research.

[614] Estimating Item Difficulty Using Large Language Models and Tree-Based Machine Learning Algorithms

Pooya Razavi,Sonya J. Powers

Main category: cs.CY

TLDR: 研究探讨了使用大型语言模型（LLM）预测K-5数学和阅读评估题目难度的可行性，比较了直接估计和基于特征的方法，发现后者更准确。

Details

Motivation: 传统题目难度测试资源密集且耗时，需开发基于题目内容的大规模预测方法。 Method: 采用两种方法：直接LLM估计和基于特征的集成树模型（随机森林和梯度提升）。 Result: 直接估计与真实难度相关性中等至强，但早期年级表现较差；基于特征的方法预测更准确（r=0.87）。 Conclusion: LLM有望简化题目开发并减少对实地测试的依赖，结构化特征提取是关键。 Abstract: Estimating item difficulty through field-testing is often resource-intensive and time-consuming. As such, there is strong motivation to develop methods that can predict item difficulty at scale using only the item content. Large Language Models (LLMs) represent a new frontier for this goal. The present research examines the feasibility of using an LLM to predict item difficulty for K-5 mathematics and reading assessment items (N = 5170). Two estimation approaches were implemented: (a) a direct estimation method that prompted the LLM to assign a single difficulty rating to each item, and (b) a feature-based strategy where the LLM extracted multiple cognitive and linguistic features, which were then used in ensemble tree-based models (random forests and gradient boosting) to predict difficulty. Overall, direct LLM estimates showed moderate to strong correlations with true item difficulties. However, their accuracy varied by grade level, often performing worse for early grades. In contrast, the feature-based method yielded stronger predictive accuracy, with correlations as high as r = 0.87 and lower error estimates compared to both direct LLM predictions and baseline regressors. These findings highlight the promise of LLMs in streamlining item development and reducing reliance on extensive field testing and underscore the importance of structured feature extraction. We provide a seven-step workflow for testing professionals who would want to implement a similar item difficulty estimation approach with their item pool.

[615] AI-University: An LLM-based platform for instructional alignment to scientific classrooms

Mostafa Faghih Shojaei,Rahul Gulati,Benjamin A. Jasperson,Shangshang Wang,Simone Cimolato,Dangli Cao,Willie Neiswanger,Krishna Garikipati

Main category: cs.CY

TLDR: AI-U是一个灵活的AI驱动课程内容交付框架，通过微调大型语言模型（LLM）和检索增强生成（RAG）技术，生成与教师教学风格一致的响应。

Details

Motivation: 旨在为高等教育提供可扩展的AI辅助教育解决方案，适应不同教师的教学风格。 Method: 采用LoRA微调开源LLM，结合RAG技术优化响应，并通过案例研究验证其有效性。 Result: 评估显示AI-U与课程材料高度一致，专家模型在86%的测试案例中表现优于基准模型。 Conclusion: AI-U为AI辅助教育提供了可扩展的框架，有望在高等教育中广泛应用。 Abstract: We introduce AI University (AI-U), a flexible framework for AI-driven course content delivery that adapts to instructors' teaching styles. At its core, AI-U fine-tunes a large language model (LLM) with retrieval-augmented generation (RAG) to generate instructor-aligned responses from lecture videos, notes, and textbooks. Using a graduate-level finite-element-method (FEM) course as a case study, we present a scalable pipeline to systematically construct training data, fine-tune an open-source LLM with Low-Rank Adaptation (LoRA), and optimize its responses through RAG-based synthesis. Our evaluation - combining cosine similarity, LLM-based assessment, and expert review - demonstrates strong alignment with course materials. We also have developed a prototype web application, available at https://my-ai-university.com, that enhances traceability by linking AI-generated responses to specific sections of the relevant course material and time-stamped instances of the open-access video lectures. Our expert model is found to have greater cosine similarity with a reference on 86% of test cases. An LLM judge also found our expert model to outperform the base Llama 3.2 model approximately four times out of five. AI-U offers a scalable approach to AI-assisted education, paving the way for broader adoption in higher education. Here, our framework has been presented in the setting of a class on FEM - a subject that is central to training PhD and Master students in engineering science. However, this setting is a particular instance of a broader context: fine-tuning LLMs to research content in science.

[616] Assessing Judging Bias in Large Reasoning Models: An Empirical Study

Qian Wang,Zhanzhi Lou,Zhenheng Tang,Nuo Chen,Xuandong Zhao,Wenxuan Zhang,Dawn Song,Bingsheng He

Main category: cs.CY

TLDR: 研究比较了大型推理模型（LRMs）和大型语言模型（LLMs）在判断任务中的偏见，发现LRMs虽具备更强的推理能力但仍存在多种偏见，并提出了三种缓解策略。

Details

Motivation: 随着LRMs在自动判断任务中的广泛应用，了解其偏见并设计缓解方法对开发可靠的LLM-as-a-Judge框架至关重要。 Method: 通过对比主观偏好对齐数据集和客观事实数据集，分析了LRMs和LLMs的偏见（如从众、权威、位置和分心偏见），并评估了三种缓解策略：专用系统提示、上下文学习和自我反思机制。 Result: 发现LRMs虽在事实相关任务中表现更稳健，但仍存在显著的位置偏见和新型的“表面反思偏见”。缓解策略中，专用系统提示和上下文学习效果显著，自我反思机制对LRMs尤其有效。 Conclusion: 研究为开发更可靠的自动判断框架提供了关键见解，尤其是在LRMs日益普及的背景下。 Abstract: Large Reasoning Models (LRMs) like DeepSeek-R1 and OpenAI-o1 have demonstrated remarkable reasoning capabilities, raising important questions about their biases in LLM-as-a-judge settings. We present a comprehensive benchmark comparing judging biases between LLMs and LRMs across both subjective preference-alignment datasets and objective fact-based datasets. Through investigation of bandwagon, authority, position, and distraction biases, we uncover four key findings: (1) despite their advanced reasoning capabilities, LRMs remain susceptible to the above biases; (2) LRMs demonstrate better robustness than LLMs specifically on fact-related datasets; (3) LRMs exhibit notable position bias, preferring options in later positions; and (4) we identify a novel "superficial reflection bias" where phrases mimicking reasoning (e.g., "wait, let me think...") significantly influence model judgments. To address these biases, we design and evaluate three mitigation strategies: specialized system prompts that reduce judging biases by up to 19\% in preference alignment datasets and 14\% in fact-related datasets, in-context learning that provides up to 27\% improvement on preference tasks but shows inconsistent results on factual tasks, and a self-reflection mechanism that reduces biases by up to 10\% in preference datasets and 16\% in fact-related datasets, with self-reflection proving particularly effective for LRMs. Our work provides crucial insights for developing more reliable LLM-as-a-Judge frameworks, especially as LRMs become increasingly deployed as automated judges.

[617] RealHarm: A Collection of Real-World Language Model Application Failures

Pierre Le Jeune,Jiaen Liu,Luca Rossi,Matteo Dora

Main category: cs.CY

TLDR: 论文介绍了RealHarm数据集，分析了AI代理在现实中的问题交互，发现声誉损害是主要组织危害，而错误信息是最常见的危害类别。现有防护系统存在显著不足。

Details

Motivation: 现有研究多基于理论分析或监管框架，缺乏对AI在现实应用中实际问题的实证研究。 Method: 通过系统回顾公开报道的事件，构建RealHarm数据集，分析危害、原因及部署者视角的风险。 Result: 声誉损害是主要组织危害，错误信息是最常见危害类别；现有防护系统未能有效预防问题。 Conclusion: AI应用的防护系统存在显著不足，需进一步改进以应对现实中的风险。 Abstract: Language model deployments in consumer-facing applications introduce numerous risks. While existing research on harms and hazards of such applications follows top-down approaches derived from regulatory frameworks and theoretical analyses, empirical evidence of real-world failure modes remains underexplored. In this work, we introduce RealHarm, a dataset of annotated problematic interactions with AI agents built from a systematic review of publicly reported incidents. Analyzing harms, causes, and hazards specifically from the deployer's perspective, we find that reputational damage constitutes the predominant organizational harm, while misinformation emerges as the most common hazard category. We empirically evaluate state-of-the-art guardrails and content moderation systems to probe whether such systems would have prevented the incidents, revealing a significant gap in the protection of AI applications.

[618] The Lyme Disease Controversy: An AI-Driven Discourse Analysis of a Quarter Century of Academic Debate and Divides

Teo Susnjak,Cole Palffy,Tatiana Zimina,Nazgul Altynbekova,Kunal Garg,Leona Gilbert

Main category: cs.CY

TLDR: 本文首次采用混合AI方法系统分析了25年来关于慢性莱姆病（CLD）和治疗后莱姆病综合征（PTLDS）的学术讨论，揭示了从感染模型向免疫介导解释的转变。

Details

Motivation: 研究旨在量化莱姆病研究中的认知变化，探索争议性医学领域的演变。 Method: 结合大型语言模型（LLMs）与结构化人工验证，分析数千篇学术摘要。 Result: 发现莱姆病解释从感染模型逐步转向免疫介导模型。 Conclusion: 研究为争议性医学领域的分析提供了可扩展的方法，并展示了AI辅助方法的价值。 Abstract: The scientific discourse surrounding Chronic Lyme Disease (CLD) and Post-Treatment Lyme Disease Syndrome (PTLDS) has evolved over the past twenty-five years into a complex and polarised debate, shaped by shifting research priorities, institutional influences, and competing explanatory models. This study presents the first large-scale, systematic examination of this discourse using an innovative hybrid AI-driven methodology, combining large language models with structured human validation to analyse thousands of scholarly abstracts spanning 25 years. By integrating Large Language Models (LLMs) with expert oversight, we developed a quantitative framework for tracking epistemic shifts in contested medical fields, with applications to other content analysis domains. Our analysis revealed a progressive transition from infection-based models of Lyme disease to immune-mediated explanations for persistent symptoms. This study offers new empirical insights into the structural and epistemic forces shaping Lyme disease research, providing a scalable and replicable methodology for analysing discourse, while underscoring the value of AI-assisted methodologies in social science and medical research.

[619] Estimating Item Difficulty Using Large Language Models and Tree-Based Machine Learning Algorithms

Pooya Razavi,Sonya J. Powers

Main category: cs.CY

TLDR: 研究探讨了使用大型语言模型（LLM）预测K-5数学和阅读评估题目难度的可行性，比较了直接估计和基于特征的方法，发现后者更准确。

Details

Motivation: 传统题目难度测试资源密集且耗时，需要开发基于题目内容的大规模预测方法。 Method: 采用两种方法：直接LLM估计和基于特征的集成树模型（随机森林和梯度提升）。 Result: 直接LLM估计与真实难度相关性中等至强，但年级差异明显；基于特征的方法预测更准确（r=0.87）。 Conclusion: LLM有望简化题目开发，减少对实地测试的依赖，特征提取是关键。提供了七步实施流程。 Abstract: Estimating item difficulty through field-testing is often resource-intensive and time-consuming. As such, there is strong motivation to develop methods that can predict item difficulty at scale using only the item content. Large Language Models (LLMs) represent a new frontier for this goal. The present research examines the feasibility of using an LLM to predict item difficulty for K-5 mathematics and reading assessment items (N = 5170). Two estimation approaches were implemented: (a) a direct estimation method that prompted the LLM to assign a single difficulty rating to each item, and (b) a feature-based strategy where the LLM extracted multiple cognitive and linguistic features, which were then used in ensemble tree-based models (random forests and gradient boosting) to predict difficulty. Overall, direct LLM estimates showed moderate to strong correlations with true item difficulties. However, their accuracy varied by grade level, often performing worse for early grades. In contrast, the feature-based method yielded stronger predictive accuracy, with correlations as high as r = 0.87 and lower error estimates compared to both direct LLM predictions and baseline regressors. These findings highlight the promise of LLMs in streamlining item development and reducing reliance on extensive field testing and underscore the importance of structured feature extraction. We provide a seven-step workflow for testing professionals who would want to implement a similar item difficulty estimation approach with their item pool.

[620] AI-University: An LLM-based platform for instructional alignment to scientific classrooms

Mostafa Faghih Shojaei,Rahul Gulati,Benjamin A. Jasperson,Shangshang Wang,Simone Cimolato,Dangli Cao,Willie Neiswanger,Krishna Garikipati

Main category: cs.CY

TLDR: AI-U是一个灵活的AI驱动课程内容交付框架，通过微调大语言模型（LLM）和检索增强生成（RAG）技术，生成与教师教学风格一致的响应。

Details

Motivation: 为高等教育提供可扩展的AI辅助教育解决方案，适应不同教师的教学风格。 Method: 使用低秩适应（LoRA）微调开源LLM，并通过RAG优化响应，结合课程材料构建训练数据。 Result: 评估显示AI-U与课程材料高度一致，专家模型在86%的测试案例中表现优于基准模型。 Conclusion: AI-U为AI辅助教育提供了可扩展的方法，有望在高等教育中广泛应用。 Abstract: We introduce AI University (AI-U), a flexible framework for AI-driven course content delivery that adapts to instructors' teaching styles. At its core, AI-U fine-tunes a large language model (LLM) with retrieval-augmented generation (RAG) to generate instructor-aligned responses from lecture videos, notes, and textbooks. Using a graduate-level finite-element-method (FEM) course as a case study, we present a scalable pipeline to systematically construct training data, fine-tune an open-source LLM with Low-Rank Adaptation (LoRA), and optimize its responses through RAG-based synthesis. Our evaluation - combining cosine similarity, LLM-based assessment, and expert review - demonstrates strong alignment with course materials. We also have developed a prototype web application, available at https://my-ai-university.com, that enhances traceability by linking AI-generated responses to specific sections of the relevant course material and time-stamped instances of the open-access video lectures. Our expert model is found to have greater cosine similarity with a reference on 86% of test cases. An LLM judge also found our expert model to outperform the base Llama 3.2 model approximately four times out of five. AI-U offers a scalable approach to AI-assisted education, paving the way for broader adoption in higher education. Here, our framework has been presented in the setting of a class on FEM - a subject that is central to training PhD and Master students in engineering science. However, this setting is a particular instance of a broader context: fine-tuning LLMs to research content in science.

[621] Assessing Judging Bias in Large Reasoning Models: An Empirical Study

Qian Wang,Zhanzhi Lou,Zhenheng Tang,Nuo Chen,Xuandong Zhao,Wenxuan Zhang,Dawn Song,Bingsheng He

Main category: cs.CY

TLDR: 论文研究了大型推理模型（LRMs）在LLM-as-a-judge设置中的偏见问题，发现LRMs虽具备强大推理能力但仍易受多种偏见影响，并提出了三种缓解策略。

Details

Motivation: 探讨LRMs在作为自动评判者时的偏见问题，以提升其可靠性。 Method: 通过综合基准测试比较LLMs和LRMs在主观偏好对齐和客观事实数据集上的偏见表现，并设计三种缓解策略（系统提示、上下文学习和自我反思机制）进行评估。 Result: 发现LRMs易受多种偏见影响，但在事实数据集上表现更稳健；提出的缓解策略在偏好和事实任务中均有效，其中自我反思对LRMs效果显著。 Conclusion: 研究为开发更可靠的LLM-as-a-judge框架提供了关键见解，尤其适用于LRMs作为自动评判者的场景。 Abstract: Large Reasoning Models (LRMs) like DeepSeek-R1 and OpenAI-o1 have demonstrated remarkable reasoning capabilities, raising important questions about their biases in LLM-as-a-judge settings. We present a comprehensive benchmark comparing judging biases between LLMs and LRMs across both subjective preference-alignment datasets and objective fact-based datasets. Through investigation of bandwagon, authority, position, and distraction biases, we uncover four key findings: (1) despite their advanced reasoning capabilities, LRMs remain susceptible to the above biases; (2) LRMs demonstrate better robustness than LLMs specifically on fact-related datasets; (3) LRMs exhibit notable position bias, preferring options in later positions; and (4) we identify a novel "superficial reflection bias" where phrases mimicking reasoning (e.g., "wait, let me think...") significantly influence model judgments. To address these biases, we design and evaluate three mitigation strategies: specialized system prompts that reduce judging biases by up to 19\% in preference alignment datasets and 14\% in fact-related datasets, in-context learning that provides up to 27\% improvement on preference tasks but shows inconsistent results on factual tasks, and a self-reflection mechanism that reduces biases by up to 10\% in preference datasets and 16\% in fact-related datasets, with self-reflection proving particularly effective for LRMs. Our work provides crucial insights for developing more reliable LLM-as-a-Judge frameworks, especially as LRMs become increasingly deployed as automated judges.

[622] RealHarm: A Collection of Real-World Language Model Application Failures

Pierre Le Jeune,Jiaen Liu,Luca Rossi,Matteo Dora

Main category: cs.CY

TLDR: 论文介绍了RealHarm数据集，分析了AI代理在现实中的问题交互，发现声誉损害是主要组织危害，而错误信息是最常见的危害类别。现有防护系统存在显著不足。

Details

Motivation: 现有研究多基于理论分析，缺乏对现实世界中AI应用失败模式的实证研究。 Method: 通过系统审查公开报道的事件，构建RealHarm数据集，分析危害、原因及部署者视角的问题。 Result: 声誉损害是主要组织危害，错误信息是最常见危害类别；现有防护系统未能有效预防问题。 Conclusion: AI应用的防护系统需改进以应对现实中的危害和风险。 Abstract: Language model deployments in consumer-facing applications introduce numerous risks. While existing research on harms and hazards of such applications follows top-down approaches derived from regulatory frameworks and theoretical analyses, empirical evidence of real-world failure modes remains underexplored. In this work, we introduce RealHarm, a dataset of annotated problematic interactions with AI agents built from a systematic review of publicly reported incidents. Analyzing harms, causes, and hazards specifically from the deployer's perspective, we find that reputational damage constitutes the predominant organizational harm, while misinformation emerges as the most common hazard category. We empirically evaluate state-of-the-art guardrails and content moderation systems to probe whether such systems would have prevented the incidents, revealing a significant gap in the protection of AI applications.

cs.SD [Back]

[623] Spatial Audio Processing with Large Language Model on Wearable Devices

Ayushi Mishra,Yang Bai,Priyadarshan Narayanasamy,Nakul Garg,Nirupam Roy

Main category: cs.SD

TLDR: 提出了一种将空间语音理解融入大语言模型（LLM）的系统架构，通过单声道麦克风实现方向感知，显著提升了空间语音识别性能。

Details

Motivation: 通过将空间上下文整合到LLM中，革新可穿戴设备的人机交互，解决现有技术在空间语音理解上的不足。 Method: 利用微结构空间传感提取方向信息，结合Whisper模型的语音嵌入，通过LoRA轻量适配技术优化LLaMA-3.2 3B模型。 Result: 空间感知ASR的平均误差为25.72°，显著优于现有技术（88.52°），词错误率5.3%；支持最多5人的声源定位，中位误差16°。 Conclusion: 该系统在空间语音理解上表现优异，同时兼顾能效、隐私和硬件限制，为增强现实和沉浸式体验开辟了新途径。 Abstract: Integrating spatial context into large language models (LLMs) has the potential to revolutionize human-computer interaction, particularly in wearable devices. In this work, we present a novel system architecture that incorporates spatial speech understanding into LLMs, enabling contextually aware and adaptive applications for wearable technologies. Our approach leverages microstructure-based spatial sensing to extract precise Direction of Arrival (DoA) information using a monaural microphone. To address the lack of existing dataset for microstructure-assisted speech recordings, we synthetically create a dataset called OmniTalk by using the LibriSpeech dataset. This spatial information is fused with linguistic embeddings from OpenAI's Whisper model, allowing each modality to learn complementary contextual representations. The fused embeddings are aligned with the input space of LLaMA-3.2 3B model and fine-tuned with lightweight adaptation technique LoRA to optimize for on-device processing. SING supports spatially-aware automatic speech recognition (ASR), achieving a mean error of $25.72^\circ$-a substantial improvement compared to the 88.52$^\circ$ median error in existing work-with a word error rate (WER) of 5.3. SING also supports soundscaping, for example, inference how many people were talking and their directions, with up to 5 people and a median DoA error of 16$^\circ$. Our system demonstrates superior performance in spatial speech understanding while addressing the challenges of power efficiency, privacy, and hardware constraints, paving the way for advanced applications in augmented reality, accessibility, and immersive experiences.

[624] FSSUAVL: A Discriminative Framework using Vision Models for Federated Self-Supervised Audio and Image Understanding

Yasar Abbas Ur Rehman,Kin Wai Lau,Yuyang Xie,Ma Lan,JiaJun Shen

Main category: cs.SD

TLDR: 论文提出了一种名为FSSUAVL的单一深度模型，通过自监督对比学习在联邦学习中预训练，解决了未配对音频和图像模态的学习问题。

Details

Motivation: 现有方法在处理未配对多模态数据时依赖辅助预训练编码器或生成模型，计算成本高且不适用于联邦学习场景。 Method: FSSUAVL通过自监督对比学习将音频和图像投影到共同嵌入空间，联合判别两种模态，适用于配对和未配对任务。 Result: 实验表明，FSSUAVL在CNN和ViT上显著提升了图像和音频下游任务的性能，优于单独模型。 Conclusion: FSSUAVL能够学习多模态特征表示，并可整合辅助信息以提高识别精度。 Abstract: Recent studies have demonstrated that vision models can effectively learn multimodal audio-image representations when paired. However, the challenge of enabling deep models to learn representations from unpaired modalities remains unresolved. This issue is especially pertinent in scenarios like Federated Learning (FL), where data is often decentralized, heterogeneous, and lacks a reliable guarantee of paired data. Previous attempts tackled this issue through the use of auxiliary pretrained encoders or generative models on local clients, which invariably raise computational cost with increasing number modalities. Unlike these approaches, in this paper, we aim to address the task of unpaired audio and image recognition using \texttt{FSSUAVL}, a single deep model pretrained in FL with self-supervised contrastive learning (SSL). Instead of aligning the audio and image modalities, \texttt{FSSUAVL} jointly discriminates them by projecting them into a common embedding space using contrastive SSL. This extends the utility of \texttt{FSSUAVL} to paired and unpaired audio and image recognition tasks. Our experiments with CNN and ViT demonstrate that \texttt{FSSUAVL} significantly improves performance across various image- and audio-based downstream tasks compared to using separate deep models for each modality. Additionally, \texttt{FSSUAVL}'s capacity to learn multimodal feature representations allows for integrating auxiliary information, if available, to enhance recognition accuracy.

[625] Separate to Collaborate: Dual-Stream Diffusion Model for Coordinated Piano Hand Motion Synthesis

Zihao Liu,Mingwen Ou,Zunnan Xu,Jiaqi Huang,Haonan Han,Ronghui Li,Xiu Li

Main category: cs.SD

TLDR: 提出一种双流神经网络框架，用于从音频输入生成同步的钢琴演奏手势，解决手部独立性与协调性建模的挑战。

Details

Motivation: 自动化合成协调的双钢琴演奏手势存在挑战，尤其是捕捉手部间的复杂协调性同时保持各自的运动特征。 Method: 采用双流扩散生成框架，分别建模每只手的运动，并通过HCAA机制抑制对称噪声，增强手部协调性。 Result: 在多项指标上优于现有最先进方法。 Conclusion: 该框架有效解决了钢琴演奏中手部独立与协调的建模问题。 Abstract: Automating the synthesis of coordinated bimanual piano performances poses significant challenges, particularly in capturing the intricate choreography between the hands while preserving their distinct kinematic signatures. In this paper, we propose a dual-stream neural framework designed to generate synchronized hand gestures for piano playing from audio input, addressing the critical challenge of modeling both hand independence and coordination. Our framework introduces two key innovations: (i) a decoupled diffusion-based generation framework that independently models each hand's motion via dual-noise initialization, sampling distinct latent noise for each while leveraging a shared positional condition, and (ii) a Hand-Coordinated Asymmetric Attention (HCAA) mechanism suppresses symmetric (common-mode) noise to highlight asymmetric hand-specific features, while adaptively enhancing inter-hand coordination during denoising. The system operates hierarchically: it first predicts 3D hand positions from audio features and then generates joint angles through position-aware diffusion models, where parallel denoising streams interact via HCAA. Comprehensive evaluations demonstrate that our framework outperforms existing state-of-the-art methods across multiple metrics.

[626] Spatial Audio Processing with Large Language Model on Wearable Devices

Ayushi Mishra,Yang Bai,Priyadarshan Narayanasamy,Nakul Garg,Nirupam Roy

Main category: cs.SD

TLDR: 论文提出了一种将空间语音理解整合到大型语言模型（LLM）中的新系统架构，通过微结构空间感知和合成数据集OmniTalk，显著提升了空间感知自动语音识别（ASR）的性能。

Details

Motivation: 通过将空间上下文整合到LLM中，提升可穿戴设备的人机交互能力，解决现有技术在空间语音理解上的不足。 Method: 利用微结构空间感知提取方向信息，结合Whisper模型的语音嵌入，通过LoRA轻量适配技术优化LLaMA-3.2 3B模型。 Result: 系统在空间ASR中平均误差为25.72°，显著优于现有技术（88.52°），词错误率（WER）为5.3%，并支持多人和方向推断。 Conclusion: 该系统在空间语音理解上表现优异，同时解决了能效、隐私和硬件限制问题，为增强现实和沉浸式体验等应用铺平了道路。 Abstract: Integrating spatial context into large language models (LLMs) has the potential to revolutionize human-computer interaction, particularly in wearable devices. In this work, we present a novel system architecture that incorporates spatial speech understanding into LLMs, enabling contextually aware and adaptive applications for wearable technologies. Our approach leverages microstructure-based spatial sensing to extract precise Direction of Arrival (DoA) information using a monaural microphone. To address the lack of existing dataset for microstructure-assisted speech recordings, we synthetically create a dataset called OmniTalk by using the LibriSpeech dataset. This spatial information is fused with linguistic embeddings from OpenAI's Whisper model, allowing each modality to learn complementary contextual representations. The fused embeddings are aligned with the input space of LLaMA-3.2 3B model and fine-tuned with lightweight adaptation technique LoRA to optimize for on-device processing. SING supports spatially-aware automatic speech recognition (ASR), achieving a mean error of $25.72^\circ$-a substantial improvement compared to the 88.52$^\circ$ median error in existing work-with a word error rate (WER) of 5.3. SING also supports soundscaping, for example, inference how many people were talking and their directions, with up to 5 people and a median DoA error of 16$^\circ$. Our system demonstrates superior performance in spatial speech understanding while addressing the challenges of power efficiency, privacy, and hardware constraints, paving the way for advanced applications in augmented reality, accessibility, and immersive experiences.

[627] FSSUAVL: A Discriminative Framework using Vision Models for Federated Self-Supervised Audio and Image Understanding

Yasar Abbas Ur Rehman,Kin Wai Lau,Yuyang Xie,Ma Lan,JiaJun Shen

Main category: cs.SD

TLDR: 论文提出了一种名为FSSUAVL的单一深度模型，通过自监督对比学习在联邦学习中预训练，解决了未配对音频和图像模态的学习问题。

Details

Motivation: 现有方法在处理未配对多模态数据时依赖辅助预训练编码器或生成模型，计算成本高且不适用于联邦学习场景。 Method: FSSUAVL通过自监督对比学习将音频和图像投影到共同的嵌入空间，联合判别它们，适用于配对和未配对任务。 Result: 实验表明，FSSUAVL在CNN和ViT上显著提升了图像和音频下游任务的性能，优于单独模型。 Conclusion: FSSUAVL能够学习多模态特征表示，并可整合辅助信息以提高识别精度。 Abstract: Recent studies have demonstrated that vision models can effectively learn multimodal audio-image representations when paired. However, the challenge of enabling deep models to learn representations from unpaired modalities remains unresolved. This issue is especially pertinent in scenarios like Federated Learning (FL), where data is often decentralized, heterogeneous, and lacks a reliable guarantee of paired data. Previous attempts tackled this issue through the use of auxiliary pretrained encoders or generative models on local clients, which invariably raise computational cost with increasing number modalities. Unlike these approaches, in this paper, we aim to address the task of unpaired audio and image recognition using \texttt{FSSUAVL}, a single deep model pretrained in FL with self-supervised contrastive learning (SSL). Instead of aligning the audio and image modalities, \texttt{FSSUAVL} jointly discriminates them by projecting them into a common embedding space using contrastive SSL. This extends the utility of \texttt{FSSUAVL} to paired and unpaired audio and image recognition tasks. Our experiments with CNN and ViT demonstrate that \texttt{FSSUAVL} significantly improves performance across various image- and audio-based downstream tasks compared to using separate deep models for each modality. Additionally, \texttt{FSSUAVL}'s capacity to learn multimodal feature representations allows for integrating auxiliary information, if available, to enhance recognition accuracy.

[628] Separate to Collaborate: Dual-Stream Diffusion Model for Coordinated Piano Hand Motion Synthesis

Zihao Liu,Mingwen Ou,Zunnan Xu,Jiaqi Huang,Haonan Han,Ronghui Li,Xiu Li

Main category: cs.SD

TLDR: 提出了一种双流神经网络框架，用于从音频输入生成钢琴演奏的同步手势，解决了手部独立性和协调性建模的关键挑战。

Details

Motivation: 自动化合成协调的双钢琴演奏具有挑战性，尤其是捕捉双手之间的复杂编排并保留其独特的运动特征。 Method: 采用双流神经网络框架，包括解耦的扩散生成框架和手部协调非对称注意力机制（HCAA），通过分层方式生成3D手部位置和关节角度。 Result: 综合评估表明，该框架在多个指标上优于现有最先进方法。 Conclusion: 该框架成功解决了钢琴演奏中双手独立性和协调性的建模问题，并显著提升了生成性能。 Abstract: Automating the synthesis of coordinated bimanual piano performances poses significant challenges, particularly in capturing the intricate choreography between the hands while preserving their distinct kinematic signatures. In this paper, we propose a dual-stream neural framework designed to generate synchronized hand gestures for piano playing from audio input, addressing the critical challenge of modeling both hand independence and coordination. Our framework introduces two key innovations: (i) a decoupled diffusion-based generation framework that independently models each hand's motion via dual-noise initialization, sampling distinct latent noise for each while leveraging a shared positional condition, and (ii) a Hand-Coordinated Asymmetric Attention (HCAA) mechanism suppresses symmetric (common-mode) noise to highlight asymmetric hand-specific features, while adaptively enhancing inter-hand coordination during denoising. The system operates hierarchically: it first predicts 3D hand positions from audio features and then generates joint angles through position-aware diffusion models, where parallel denoising streams interact via HCAA. Comprehensive evaluations demonstrate that our framework outperforms existing state-of-the-art methods across multiple metrics.

eess.SY [Back]

[629] A 10.8mW Mixed-Signal Simulated Bifurcation Ising Solver using SRAM Compute-In-Memory with 0.6us Time-to-Solution

Alana Marie Dee,Sajjad Moazeni

Main category: eess.SY

TLDR: 本文提出了一种基于CMOS的模拟分岔（SB）伊辛求解器，用于解决NP难优化问题，通过模拟域计算和SRAM内存计算技术实现了高效性能。

Details

Motivation: 组合优化问题在多个领域至关重要，但传统方法效率不足，需要更高效的解决方案。 Method: 采用模拟域计算和SRAM内存计算技术，设计了10-T SRAM单元以实现三元乘法，并注入最优衰减噪声。 Result: 在60节点、50%密度的随机二元MAXCUT图上，求解器在0.6微秒内达到93%以上的基态解，平均功耗10.8mW。 Conclusion: 该芯片在时间和功耗上比现有伊辛求解器提升了一个数量级，展示了模拟域计算的优势。 Abstract: Combinatorial optimization problems are funda- mental for various fields ranging from finance to wireless net- works. This work presents a simulated bifurcation (SB) Ising solver in CMOS for NP-hard optimization problems. Analog domain computing led to a superior implementation of this algorithm as inherent and injected noise is required in SB Ising solvers. The architecture novelties include the use of SRAM compute-in-memory (CIM) to accelerate bifurcation as well as the generation and injection of optimal decaying noise in the analog domain. We propose a novel 10-T SRAM cell capable of performing ternary multiplication. When measured with 60- node, 50% density, random, binary MAXCUT graphs, this all- to-all connected Ising solver reliably achieves above 93% of the ground state solution in 0.6us with 10.8mW average power in TSMC 180nm CMOS. Our chip achieves an order of magnitude improvement in time-to-solution and power compared to previously proposed Ising solvers in CMOS and other platforms.

[630] A 10.8mW Mixed-Signal Simulated Bifurcation Ising Solver using SRAM Compute-In-Memory with 0.6us Time-to-Solution

Alana Marie Dee,Sajjad Moazeni

Main category: eess.SY

TLDR: 本文提出了一种基于CMOS的模拟分岔（SB）伊辛求解器，用于解决NP难优化问题，通过模拟域计算和SRAM内存计算（CIM）加速分岔过程，并注入最优衰减噪声。

Details

Motivation: 组合优化问题在金融和无线网络等领域具有重要意义，但传统方法效率不足，需要更高效的解决方案。 Method: 采用10-T SRAM单元实现三元乘法，利用模拟域计算和SRAM CIM加速分岔，并注入最优衰减噪声。 Result: 在60节点、50%密度的随机二元MAXCUT图上，该求解器在0.6微秒内以10.8mW平均功耗实现了93%以上的基态解。 Conclusion: 该芯片在求解时间和功耗上比现有伊辛求解器提升了一个数量级，展示了模拟域计算在优化问题中的潜力。 Abstract: Combinatorial optimization problems are funda- mental for various fields ranging from finance to wireless net- works. This work presents a simulated bifurcation (SB) Ising solver in CMOS for NP-hard optimization problems. Analog domain computing led to a superior implementation of this algorithm as inherent and injected noise is required in SB Ising solvers. The architecture novelties include the use of SRAM compute-in-memory (CIM) to accelerate bifurcation as well as the generation and injection of optimal decaying noise in the analog domain. We propose a novel 10-T SRAM cell capable of performing ternary multiplication. When measured with 60- node, 50% density, random, binary MAXCUT graphs, this all- to-all connected Ising solver reliably achieves above 93% of the ground state solution in 0.6us with 10.8mW average power in TSMC 180nm CMOS. Our chip achieves an order of magnitude improvement in time-to-solution and power compared to previously proposed Ising solvers in CMOS and other platforms.

cs.RO [Back]

[631] Joint Action Language Modelling for Transparent Policy Execution

Theodor Wulff,Rahul Singh Maharjan,Xinyun Chi,Angelo Cangelosi

Main category: cs.RO

TLDR: 论文提出一种通过自然语言描述动作的透明代理学习方法，将策略学习转化为语言生成问题，并结合自回归建模，以提升代理行为的透明性。

Details

Motivation: 解决代理策略的黑盒问题，通过自然语言描述动作来增加代理行为的透明度。 Method: 将策略学习转化为语言生成问题，结合自回归建模，生成自然语言描述和动作令牌。 Result: 在Language-Table环境中，模型能同时生成高质量的动作轨迹和透明语言描述。 Conclusion: 同时生成动作和语言描述能提升两者的质量，验证了透明代理学习的可行性。 Abstract: An agent's intention often remains hidden behind the black-box nature of embodied policies. Communication using natural language statements that describe the next action can provide transparency towards the agent's behavior. We aim to insert transparent behavior directly into the learning process, by transforming the problem of policy learning into a language generation problem and combining it with traditional autoregressive modelling. The resulting model produces transparent natural language statements followed by tokens representing the specific actions to solve long-horizon tasks in the Language-Table environment. Following previous work, the model is able to learn to produce a policy represented by special discretized tokens in an autoregressive manner. We place special emphasis on investigating the relationship between predicting actions and producing high-quality language for a transparent agent. We find that in many cases both the quality of the action trajectory and the transparent statement increase when they are generated simultaneously.

Yiming Zeng,Hao Ren,Shuhang Wang,Junlong Huang,Hui Cheng

Main category: cs.RO

TLDR: 提出了一种结合学习方法和经典方法的混合视觉导航方法，通过条件扩散模型和梯度优化实现零样本迁移，提高了成功率和减少了碰撞。

Details

Motivation: 解决传统几何方法的多模块设计和学习方法的泛化能力不足问题。 Method: 训练条件扩散模型，结合可微分场景和任务级成本梯度，生成满足约束的路径。 Result: 在室内外模拟和真实场景中实现零样本迁移，成功率和碰撞率优于基线方法。 Conclusion: 混合方法提供了即插即用的解决方案，无需重新训练，适用于多样化环境。 Abstract: Visual navigation, a fundamental challenge in mobile robotics, demands versatile policies to handle diverse environments. Classical methods leverage geometric solutions to minimize specific costs, offering adaptability to new scenarios but are prone to system errors due to their multi-modular design and reliance on hand-crafted rules. Learning-based methods, while achieving high planning success rates, face difficulties in generalizing to unseen environments beyond the training data and often require extensive training. To address these limitations, we propose a hybrid approach that combines the strengths of learning-based methods and classical approaches for RGB-only visual navigation. Our method first trains a conditional diffusion model on diverse path-RGB observation pairs. During inference, it integrates the gradients of differentiable scene-specific and task-level costs, guiding the diffusion model to generate valid paths that meet the constraints. This approach alleviates the need for retraining, offering a plug-and-play solution. Extensive experiments in both indoor and outdoor settings, across simulated and real-world scenarios, demonstrate zero-shot transfer capability of our approach, achieving higher success rates and fewer collisions compared to baseline methods. Code will be released at https://github.com/SYSU-RoboticsLab/NaviD.

Hao Ren,Yiming Zeng,Zetong Bi,Zhaoliang Wan,Junlong Huang,Hui Cheng

Main category: cs.RO

TLDR: 论文提出了一种名为NaviBridger的新方法，利用扩散桥模型改进视觉导航中的动作生成，提高了效率和准确性。

Details

Motivation: 传统扩散策略从高斯噪声开始生成动作序列，但目标动作分布与高斯噪声差异大，导致冗余步骤和学习复杂度高。 Method: 提出NaviBridger框架，利用扩散桥模型从信息丰富的先验动作开始生成动作，优化去噪过程。 Result: 实验表明，NaviBridger在模拟和真实场景中均加速推理并优于基线方法。 Conclusion: NaviBridger通过扩散桥模型显著提升了视觉导航任务的性能。 Abstract: Recent advancements in diffusion-based imitation learning, which show impressive performance in modeling multimodal distributions and training stability, have led to substantial progress in various robot learning tasks. In visual navigation, previous diffusion-based policies typically generate action sequences by initiating from denoising Gaussian noise. However, the target action distribution often diverges significantly from Gaussian noise, leading to redundant denoising steps and increased learning complexity. Additionally, the sparsity of effective action distributions makes it challenging for the policy to generate accurate actions without guidance. To address these issues, we propose a novel, unified visual navigation framework leveraging the denoising diffusion bridge models named NaviBridger. This approach enables action generation by initiating from any informative prior actions, enhancing guidance and efficiency in the denoising process. We explore how diffusion bridges can enhance imitation learning in visual navigation tasks and further examine three source policies for generating prior actions. Extensive experiments in both simulated and real-world indoor and outdoor scenarios demonstrate that NaviBridger accelerates policy inference and outperforms the baselines in generating target action sequences. Code is available at https://github.com/hren20/NaiviBridger.

[634] Joint Action Language Modelling for Transparent Policy Execution

Theodor Wulff,Rahul Singh Maharjan,Xinyun Chi,Angelo Cangelosi

Main category: cs.RO

TLDR: 论文提出一种方法，通过将策略学习转化为语言生成问题，结合自回归建模，生成透明自然语言描述和动作令牌，以解决长时任务。

Details

Motivation: 解决智能体行为不透明的问题，通过自然语言描述提供行为透明度。 Method: 将策略学习转化为语言生成问题，结合自回归建模，生成自然语言和动作令牌。 Result: 同时生成动作轨迹和透明描述时，两者的质量均有所提升。 Conclusion: 通过语言生成和动作预测的结合，可以同时提高行为透明度和任务表现。 Abstract: An agent's intention often remains hidden behind the black-box nature of embodied policies. Communication using natural language statements that describe the next action can provide transparency towards the agent's behavior. We aim to insert transparent behavior directly into the learning process, by transforming the problem of policy learning into a language generation problem and combining it with traditional autoregressive modelling. The resulting model produces transparent natural language statements followed by tokens representing the specific actions to solve long-horizon tasks in the Language-Table environment. Following previous work, the model is able to learn to produce a policy represented by special discretized tokens in an autoregressive manner. We place special emphasis on investigating the relationship between predicting actions and producing high-quality language for a transparent agent. We find that in many cases both the quality of the action trajectory and the transparent statement increase when they are generated simultaneously.

Yiming Zeng,Hao Ren,Shuhang Wang,Junlong Huang,Hui Cheng

Main category: cs.RO

TLDR: 提出了一种结合学习方法和经典方法的混合视觉导航方法，通过条件扩散模型和可微分成本梯度生成有效路径，实现零样本迁移。

Details

Motivation: 解决传统几何方法易出错和学习方法泛化能力差的问题。 Method: 训练条件扩散模型，结合可微分场景和任务成本梯度生成路径。 Result: 在模拟和真实场景中实现更高成功率和更少碰撞。 Conclusion: 混合方法在RGB视觉导航中表现出色，无需重新训练。 Abstract: Visual navigation, a fundamental challenge in mobile robotics, demands versatile policies to handle diverse environments. Classical methods leverage geometric solutions to minimize specific costs, offering adaptability to new scenarios but are prone to system errors due to their multi-modular design and reliance on hand-crafted rules. Learning-based methods, while achieving high planning success rates, face difficulties in generalizing to unseen environments beyond the training data and often require extensive training. To address these limitations, we propose a hybrid approach that combines the strengths of learning-based methods and classical approaches for RGB-only visual navigation. Our method first trains a conditional diffusion model on diverse path-RGB observation pairs. During inference, it integrates the gradients of differentiable scene-specific and task-level costs, guiding the diffusion model to generate valid paths that meet the constraints. This approach alleviates the need for retraining, offering a plug-and-play solution. Extensive experiments in both indoor and outdoor settings, across simulated and real-world scenarios, demonstrate zero-shot transfer capability of our approach, achieving higher success rates and fewer collisions compared to baseline methods. Code will be released at https://github.com/SYSU-RoboticsLab/NaviD.

Hao Ren,Yiming Zeng,Zetong Bi,Zhaoliang Wan,Junlong Huang,Hui Cheng

Main category: cs.RO

TLDR: 论文提出了一种基于扩散桥模型的新视觉导航框架NaviBridger，通过从信息丰富的先验动作开始生成动作，提高了去噪过程的效率和准确性。

Details

Motivation: 现有扩散策略从高斯噪声开始生成动作序列，但目标动作分布与高斯噪声差异大，导致冗余去噪步骤和学习复杂性增加。此外，动作分布稀疏性使得策略难以生成准确动作。 Method: 提出NaviBridger框架，利用扩散桥模型从先验动作开始生成动作，并探索了三种生成先验动作的策略。 Result: 在模拟和真实场景的实验中，NaviBridger加速了策略推理，并在生成目标动作序列方面优于基线方法。 Conclusion: NaviBridger通过引入扩散桥模型和先验动作，显著提升了视觉导航任务的效率和性能。 Abstract: Recent advancements in diffusion-based imitation learning, which show impressive performance in modeling multimodal distributions and training stability, have led to substantial progress in various robot learning tasks. In visual navigation, previous diffusion-based policies typically generate action sequences by initiating from denoising Gaussian noise. However, the target action distribution often diverges significantly from Gaussian noise, leading to redundant denoising steps and increased learning complexity. Additionally, the sparsity of effective action distributions makes it challenging for the policy to generate accurate actions without guidance. To address these issues, we propose a novel, unified visual navigation framework leveraging the denoising diffusion bridge models named NaviBridger. This approach enables action generation by initiating from any informative prior actions, enhancing guidance and efficiency in the denoising process. We explore how diffusion bridges can enhance imitation learning in visual navigation tasks and further examine three source policies for generating prior actions. Extensive experiments in both simulated and real-world indoor and outdoor scenarios demonstrate that NaviBridger accelerates policy inference and outperforms the baselines in generating target action sequences. Code is available at https://github.com/hren20/NaiviBridger.

cond-mat.mtrl-sci [Back]

[637] Zero-shot Autonomous Microscopy for Scalable and Intelligent Characterization of 2D Materials

Jingyun Yang,Ruoyan Avery Yin,Chi Jiang,Yuepeng Hu,Xiaokai Zhu,Xingjian Hu,Sutharsika Kumar,Xiao Wang,Xiaohua Zhai,Keran Rong,Yunyue Zhu,Tianyi Zhang,Zongyou Yin,Jing Kong,Neil Zhenqiang Gong,Zhichu Ren,Haozhe Wang

Main category: cond-mat.mtrl-sci

TLDR: ATOMIC框架通过集成基础模型（如Segment Anything Model和ChatGPT），实现了对二维材料的全自动、零样本表征，准确率高达99.7%。

Details

Motivation: 传统原子级材料表征依赖专家且耗时，而新材料的表征更具挑战性，因此需要无需大量训练数据的自主实验系统。 Method: ATOMIC结合视觉基础模型、大语言模型、无监督聚类和拓扑分析，通过提示工程自动化显微镜控制、样本扫描、图像分割和智能分析。 Result: 系统在MoS2样本中实现99.7%的分割准确率，并能检测人眼难以识别的晶界缺陷，且对变量条件（如失焦、色温波动）保持鲁棒性。 Conclusion: ATOMIC为纳米材料研究提供了可扩展且数据高效的表征范式，标志着基础模型在自主分析中的成功应用。 Abstract: Characterization of atomic-scale materials traditionally requires human experts with months to years of specialized training. Even for trained human operators, accurate and reliable characterization remains challenging when examining newly discovered materials such as two-dimensional (2D) structures. This bottleneck drives demand for fully autonomous experimentation systems capable of comprehending research objectives without requiring large training datasets. In this work, we present ATOMIC (Autonomous Technology for Optical Microscopy & Intelligent Characterization), an end-to-end framework that integrates foundation models to enable fully autonomous, zero-shot characterization of 2D materials. Our system integrates the vision foundation model (i.e., Segment Anything Model), large language models (i.e., ChatGPT), unsupervised clustering, and topological analysis to automate microscope control, sample scanning, image segmentation, and intelligent analysis through prompt engineering, eliminating the need for additional training. When analyzing typical MoS2 samples, our approach achieves 99.7% segmentation accuracy for single layer identification, which is equivalent to that of human experts. In addition, the integrated model is able to detect grain boundary slits that are challenging to identify with human eyes. Furthermore, the system retains robust accuracy despite variable conditions including defocus, color temperature fluctuations, and exposure variations. It is applicable to a broad spectrum of common 2D materials-including graphene, MoS2, WSe2, SnSe-regardless of whether they were fabricated via chemical vapor deposition or mechanical exfoliation. This work represents the implementation of foundation models to achieve autonomous analysis, establishing a scalable and data-efficient characterization paradigm that fundamentally transforms the approach to nanoscale materials research.

[638] Zero-shot Autonomous Microscopy for Scalable and Intelligent Characterization of 2D Materials

Jingyun Yang,Ruoyan Avery Yin,Chi Jiang,Yuepeng Hu,Xiaokai Zhu,Xingjian Hu,Sutharsika Kumar,Xiao Wang,Xiaohua Zhai,Keran Rong,Yunyue Zhu,Tianyi Zhang,Zongyou Yin,Jing Kong,Neil Zhenqiang Gong,Zhichu Ren,Haozhe Wang

Main category: cond-mat.mtrl-sci

TLDR: ATOMIC框架通过集成基础模型（如Segment Anything Model和ChatGPT）实现2D材料的零样本自主表征，无需额外训练，达到与人类专家相当的准确性。

Details

Motivation: 传统原子尺度材料表征依赖专家且耗时，尤其在新型材料（如2D结构）中准确性不足，亟需自主化解决方案。 Method: 结合视觉基础模型、大语言模型、无监督聚类和拓扑分析，通过提示工程自动化显微镜控制、样本扫描、图像分割和智能分析。 Result: 在MoS2样本中实现99.7%的单层分割准确率，并能检测人眼难以识别的晶界缺陷，且对变量条件（如失焦、色温波动）保持稳健。 Conclusion: ATOMIC为纳米材料研究提供了可扩展且数据高效的表征范式，推动了自主分析技术的实现。 Abstract: Characterization of atomic-scale materials traditionally requires human experts with months to years of specialized training. Even for trained human operators, accurate and reliable characterization remains challenging when examining newly discovered materials such as two-dimensional (2D) structures. This bottleneck drives demand for fully autonomous experimentation systems capable of comprehending research objectives without requiring large training datasets. In this work, we present ATOMIC (Autonomous Technology for Optical Microscopy & Intelligent Characterization), an end-to-end framework that integrates foundation models to enable fully autonomous, zero-shot characterization of 2D materials. Our system integrates the vision foundation model (i.e., Segment Anything Model), large language models (i.e., ChatGPT), unsupervised clustering, and topological analysis to automate microscope control, sample scanning, image segmentation, and intelligent analysis through prompt engineering, eliminating the need for additional training. When analyzing typical MoS2 samples, our approach achieves 99.7% segmentation accuracy for single layer identification, which is equivalent to that of human experts. In addition, the integrated model is able to detect grain boundary slits that are challenging to identify with human eyes. Furthermore, the system retains robust accuracy despite variable conditions including defocus, color temperature fluctuations, and exposure variations. It is applicable to a broad spectrum of common 2D materials-including graphene, MoS2, WSe2, SnSe-regardless of whether they were fabricated via chemical vapor deposition or mechanical exfoliation. This work represents the implementation of foundation models to achieve autonomous analysis, establishing a scalable and data-efficient characterization paradigm that fundamentally transforms the approach to nanoscale materials research.

physics.flu-dyn [Back]

[639] Fine-tuning an Large Language Model for Automating Computational Fluid Dynamics Simulations

Zhehao Dong,Zhen Lu,Yue Yang

Main category: physics.flu-dyn

TLDR: 通过微调Qwen2.5-7B-Instruct模型，结合多智能体框架，实现自然语言到CFD配置的自动化转换，显著提升了CFD工作流的效率和准确性。

Details

Motivation: CFD仿真配置需要专业知识，限制了广泛应用。LLMs在科学计算中的应用尚未充分开发，尤其是在CFD领域。 Method: 使用NL2FOAM数据集（28716对自然语言-OpenFOAM配置）微调Qwen2.5-7B-Instruct模型，并采用多智能体框架进行输入验证、配置生成、仿真运行和错误修正。 Result: 在21个流案例的基准测试中，达到88.7%的解决方案准确率和82.6%的首次尝试成功率，优于其他大型通用模型。 Conclusion: 领域特定的LLM适配在复杂工程工作流中具有关键作用，能够显著提升性能和效率。 Abstract: Configuring computational fluid dynamics (CFD) simulations typically demands extensive domain expertise, limiting broader access. Although large language models (LLMs) have advanced scientific computing, their use in automating CFD workflows is underdeveloped. We introduce a novel approach centered on domain-specific LLM adaptation. By fine-tuning Qwen2.5-7B-Instruct on NL2FOAM, our custom dataset of 28716 natural language-to-OpenFOAM configuration pairs with chain-of-thought (CoT) annotations, we enable direct translation from natural language descriptions to executable CFD setups. A multi-agent framework orchestrates the process, autonomously verifying inputs, generating configurations, running simulations, and correcting errors. Evaluation on a benchmark of 21 diverse flow cases demonstrates state-of-the-art performance, achieving 88.7% solution accuracy and 82.6% first-attempt success rate. This significantly outperforms larger general-purpose models like Qwen2.5-72B-Instruct, DeepSeek-R1, and Llama3.3-70B-Instruct, while also requiring fewer correction iterations and maintaining high computational efficiency. The results highlight the critical role of domain-specific adaptation in deploying LLM assistants for complex engineering workflows.

[640] Fine-tuning an Large Language Model for Automating Computational Fluid Dynamics Simulations

Zhehao Dong,Zhen Lu,Yue Yang

Main category: physics.flu-dyn

TLDR: 论文提出了一种基于领域特定LLM的方法，通过微调Qwen2.5-7B-Instruct模型，实现自然语言到CFD配置的自动化转换，显著提升了CFD工作流的效率和准确性。

Details

Motivation: CFD模拟配置需要大量领域专业知识，限制了其广泛应用。尽管LLM在科学计算中有所进展，但在自动化CFD工作流中的应用仍不足。 Method: 通过微调Qwen2.5-7B-Instruct模型，使用包含28716个自然语言到OpenFOAM配置对的NL2FOAM数据集，并采用多智能体框架实现输入验证、配置生成、模拟运行和错误纠正。 Result: 在21个不同流案例的基准测试中，实现了88.7%的解决方案准确率和82.6%的首次尝试成功率，显著优于更大的通用模型。 Conclusion: 领域特定适配在复杂工程工作流中部署LLM助手时具有关键作用。 Abstract: Configuring computational fluid dynamics (CFD) simulations typically demands extensive domain expertise, limiting broader access. Although large language models (LLMs) have advanced scientific computing, their use in automating CFD workflows is underdeveloped. We introduce a novel approach centered on domain-specific LLM adaptation. By fine-tuning Qwen2.5-7B-Instruct on NL2FOAM, our custom dataset of 28716 natural language-to-OpenFOAM configuration pairs with chain-of-thought (CoT) annotations, we enable direct translation from natural language descriptions to executable CFD setups. A multi-agent framework orchestrates the process, autonomously verifying inputs, generating configurations, running simulations, and correcting errors. Evaluation on a benchmark of 21 diverse flow cases demonstrates state-of-the-art performance, achieving 88.7% solution accuracy and 82.6% first-attempt success rate. This significantly outperforms larger general-purpose models like Qwen2.5-72B-Instruct, DeepSeek-R1, and Llama3.3-70B-Instruct, while also requiring fewer correction iterations and maintaining high computational efficiency. The results highlight the critical role of domain-specific adaptation in deploying LLM assistants for complex engineering workflows.

cs.LG [Back]

[641] Learnable Multi-Scale Wavelet Transformer: A Novel Alternative to Self-Attention

Andrew Kiruluta,Priscilla Burity,Samantha Williams

Main category: cs.LG

TLDR: 论文提出了一种基于可学习多尺度小波变换的Transformer架构（LMWT），以替代传统的点积自注意力机制，显著降低了计算复杂度。

Details

Motivation: 自注意力机制的计算复杂度随序列长度呈二次方增长，限制了其在长序列或资源受限场景中的应用。 Method: LMWT使用可学习的多尺度Haar小波变换模块，通过端到端训练自适应地分解数据，同时捕捉局部细节和全局上下文。 Result: 在WMT16 En-De基准测试中，LMWT在BLEU分数、困惑度和标记准确率上表现优异，且计算复杂度为线性。 Conclusion: LMWT是一种高效且性能优越的序列建模替代方案，具有计算优势和可解释性。 Abstract: Transformer architectures, underpinned by the self-attention mechanism, have achieved state-of-the-art results across numerous natural language processing (NLP) tasks by effectively modeling long-range dependencies. However, the computational complexity of self-attention, scaling quadratically with input sequence length, presents significant challenges for processing very long sequences or operating under resource constraints. This paper introduces the Learnable Multi-Scale Wavelet Transformer (LMWT), a novel architecture that replaces the standard dot-product self-attention with a learnable multi-scale Haar wavelet transform module. Leveraging the intrinsic multi-resolution properties of wavelets, the LMWT efficiently captures both local details and global context. Crucially, the parameters of the wavelet transform, including scale-specific coefficients, are learned end-to-end during training, allowing the model to adapt its decomposition strategy to the data and task. We present the detailed mathematical formulation of the learnable Haar wavelet module and its integration into the transformer framework, supplemented by an architectural diagram. We conduct a comprehensive experimental evaluation on a standard machine translation benchmark (WMT16 En-De), comparing the LMWT against a baseline self-attention transformer using metrics like BLEU score, perplexity, and token accuracy. Furthermore, we analyze the computational complexity, highlighting the linear scaling of our approach, discuss its novelty in the context of related work, and explore the interpretability offered by visualizing the learned Haar coefficients. Our results indicate that the LMWT achieves competitive performance while offering substantial computational advantages, positioning it as a promising and novel alternative for efficient sequence modeling.

[642] Mechanistic Anomaly Detection for "Quirky" Language Models

David O. Johnston,Arkajyoti Chakraborty,Nora Belrose

Main category: cs.LG

TLDR: 论文探讨了如何通过机制异常检测（MAD）增强对大型语言模型（LLM）的监督，发现该方法在部分任务中有效，但需进一步改进以适用于高风险场景。

Details

Motivation: 随着LLM能力的提升，监督其行为变得更具挑战性，尤其是当模型对监督者未知的因素敏感时。 Method: 采用MAD技术，利用模型内部特征识别异常训练信号，并通过多种检测器特征和评分规则进行实验。 Result: 检测器在某些任务中表现优异，但无单一检测器适用于所有模型和任务。 Conclusion: MAD在低风险应用中可能有效，但在高风险场景中需进一步改进检测和评估方法。 Abstract: As LLMs grow in capability, the task of supervising LLMs becomes more challenging. Supervision failures can occur if LLMs are sensitive to factors that supervisors are unaware of. We investigate Mechanistic Anomaly Detection (MAD) as a technique to augment supervision of capable models; we use internal model features to identify anomalous training signals so they can be investigated or discarded. We train detectors to flag points from the test environment that differ substantially from the training environment, and experiment with a large variety of detector features and scoring rules to detect anomalies in a set of ``quirky'' language models. We find that detectors can achieve high discrimination on some tasks, but no detector is effective across all models and tasks. MAD techniques may be effective in low-stakes applications, but advances in both detection and evaluation are likely needed if they are to be used in high stakes settings.

[643] AgentRewardBench: Evaluating Automatic Evaluations of Web Agent Trajectories

Xing Han Lù,Amirhossein Kazemnejad,Nicholas Meade,Arkil Patel,Dongchan Shin,Alejandra Zambrano,Karolina Stańczak,Peter Shaw,Christopher J. Pal,Siva Reddy

Main category: cs.LG

TLDR: 论文提出了AgentRewardBench，首个评估LLM法官对网页代理任务完成效果有效性的基准。

Details

Motivation: 现有基于规则的评估方法难以扩展且可能无法准确识别成功轨迹，人工评估成本高且慢，而LLM自动评估的潜力尚不明确。 Method: 构建包含1302条轨迹的AgentRewardBench，由专家标注成功性、副作用和重复性，评估12种LLM法官的表现。 Result: 发现无单一LLM在所有任务中表现优异，且基于规则的评估低估了网页代理的成功率。 Conclusion: 需开发更灵活的自动评估方法，AgentRewardBench为相关研究提供了基准。 Abstract: Web agents enable users to perform tasks on web browsers through natural language interaction. Evaluating web agents trajectories is an important problem, since it helps us determine whether the agent successfully completed the tasks. Rule-based methods are widely used for this purpose, but they are challenging to extend to new tasks and may not always recognize successful trajectories. We may achieve higher accuracy through human evaluation, but the process would be substantially slower and more expensive. Automatic evaluations with LLMs may avoid the challenges of designing new rules and manually annotating trajectories, enabling faster and cost-effective evaluation. However, it is unclear how effective they are at evaluating web agents. To this end, we propose AgentRewardBench, the first benchmark to assess the effectiveness of LLM judges for evaluating web agents. AgentRewardBench contains 1302 trajectories across 5 benchmarks and 4 LLMs. Each trajectory in AgentRewardBench is reviewed by an expert, who answers questions pertaining to the success, side effects, and repetitiveness of the agent. Using our benchmark, we evaluate 12 LLM judges and find that no single LLM excels across all benchmarks. We also find that the rule-based evaluation used by common benchmarks tends to underreport the success rate of web agents, highlighting a key weakness of rule-based evaluation and the need to develop more flexible automatic evaluations. We release the benchmark at: https://agent-reward-bench.github.io

Sohom Ghosh,Arnab Maji,Sudip Kumar Naskar

Main category: cs.LG

TLDR: 研究提出了一种多模态预测模型，结合财报电话会议的文字、图像和表格数据，预测次日股价变动，并公开了MiMIC数据集。

Details

Motivation: 预测财报电话会议后的股价变动是投资者和研究者面临的挑战，需要整合多样化信息源。 Method: 开发了MiMIC数据集，结合文本、图像和表格数据，构建多模态分析框架。 Result: 多模态方法展示了整合多样化信息提升金融预测准确性的潜力。 Conclusion: 研究为计算经济学提供了新工具，并验证了多模态机器学习在金融分析中的有效性。 Abstract: Predicting stock market prices following corporate earnings calls remains a significant challenge for investors and researchers alike, requiring innovative approaches that can process diverse information sources. This study investigates the impact of corporate earnings calls on stock prices by introducing a multi-modal predictive model. We leverage textual data from earnings call transcripts, along with images and tables from accompanying presentations, to forecast stock price movements on the trading day immediately following these calls. To facilitate this research, we developed the MiMIC (Multi-Modal Indian Earnings Calls) dataset, encompassing companies representing the Nifty 50, Nifty MidCap 50, and Nifty Small 50 indices. The dataset includes earnings call transcripts, presentations, fundamentals, technical indicators, and subsequent stock prices. We present a multimodal analytical framework that integrates quantitative variables with predictive signals derived from textual and visual modalities, thereby enabling a holistic approach to feature representation and analysis. This multi-modal approach demonstrates the potential for integrating diverse information sources to enhance financial forecasting accuracy. To promote further research in computational economics, we have made the MiMIC dataset publicly available under the CC-NC-SA-4.0 licence. Our work contributes to the growing body of literature on market reactions to corporate communications and highlights the efficacy of multi-modal machine learning techniques in financial analysis.

[645] Mixture of Group Experts for Learning Invariant Representations

Lei Kang,Jia Li,Mi Tian,Hua Huang

Main category: cs.LG

TLDR: 提出了一种改进稀疏激活的Mixture-of-Experts (MoE) 模型的方法，通过引入稀疏表示理论和分组稀疏正则化，提升专家多样性和专业性。

Details

Motivation: 传统MoE模型在专家数量增加时面临多样性和专业性不足的问题，限制了性能和可扩展性。 Method: 提出Mixture of Group Experts (MoGE)，通过分组稀疏正则化和2D地形图结构优化路由输入，间接正则化专家。 Result: MoGE在图像分类和语言建模任务中显著优于传统MoE模型，且额外开销极小。 Conclusion: MoGE为提升专家模型的可扩展性和减少冗余提供了一种简单有效的解决方案。 Abstract: Sparsely activated Mixture-of-Experts (MoE) models effectively increase the number of parameters while maintaining consistent computational costs per token. However, vanilla MoE models often suffer from limited diversity and specialization among experts, constraining their performance and scalability, especially as the number of experts increases. In this paper, we present a novel perspective on vanilla MoE with top-$k$ routing inspired by sparse representation. This allows us to bridge established theoretical insights from sparse representation into MoE models. Building on this foundation, we propose a group sparse regularization approach for the input of top-$k$ routing, termed Mixture of Group Experts (MoGE). MoGE indirectly regularizes experts by imposing structural constraints on the routing inputs, while preserving the original MoE architecture. Furthermore, we organize the routing input into a 2D topographic map, spatially grouping neighboring elements. This structure enables MoGE to capture representations invariant to minor transformations, thereby significantly enhancing expert diversity and specialization. Comprehensive evaluations across various Transformer models for image classification and language modeling tasks demonstrate that MoGE substantially outperforms its MoE counterpart, with minimal additional memory and computation overhead. Our approach provides a simple yet effective solution to scale the number of experts and reduce redundancy among them. The source code is included in the supplementary material and will be publicly released.

[646] DUMP: Automated Distribution-Level Curriculum Learning for RL-based LLM Post-training

Zhenting Wang,Guofeng Cui,Kun Wan,Wentian Zhao

Main category: cs.LG

TLDR: 提出了一种基于分布级可学习性的课程学习框架，用于RL后训练的LLM，通过动态调整不同分布的采样概率优化学习效率。

Details

Motivation: 现有方法将训练数据视为统一整体，忽略了数据分布的多样性和难度差异，导致学习效率不高。 Method: 利用策略优势幅度衡量分布的学习潜力，结合UCB原则动态调整采样概率，优先选择高优势或低样本的分布。 Result: 实验表明，该框架显著提高了收敛速度和最终性能。 Conclusion: 分布感知的课程学习策略在LLM后训练中具有重要价值。 Abstract: Recent advances in reinforcement learning (RL)-based post-training have led to notable improvements in large language models (LLMs), particularly in enhancing their reasoning capabilities to handle complex tasks. However, most existing methods treat the training data as a unified whole, overlooking the fact that modern LLM training often involves a mixture of data from diverse distributions-varying in both source and difficulty. This heterogeneity introduces a key challenge: how to adaptively schedule training across distributions to optimize learning efficiency. In this paper, we present a principled curriculum learning framework grounded in the notion of distribution-level learnability. Our core insight is that the magnitude of policy advantages reflects how much a model can still benefit from further training on a given distribution. Based on this, we propose a distribution-level curriculum learning framework for RL-based LLM post-training, which leverages the Upper Confidence Bound (UCB) principle to dynamically adjust sampling probabilities for different distrubutions. This approach prioritizes distributions with either high average advantage (exploitation) or low sample count (exploration), yielding an adaptive and theoretically grounded training schedule. We instantiate our curriculum learning framework with GRPO as the underlying RL algorithm and demonstrate its effectiveness on logic reasoning datasets with multiple difficulties and sources. Our experiments show that our framework significantly improves convergence speed and final performance, highlighting the value of distribution-aware curriculum strategies in LLM post-training. Code: https://github.com/ZhentingWang/DUMP.

[647] KeepKV: Eliminating Output Perturbation in KV Cache Compression for Efficient LLMs Inference

Yuxuan Tian,Zihan Wang,Yebo Peng,Aomufei Yuan,Zhiming Wang,Bairen Yi,Xin Liu,Yong Cui,Tong Yang

Main category: cs.LG

TLDR: KeepKV是一种自适应KV缓存合并方法，通过选举票机制和零推理扰动合并技术，显著减少内存使用并保持生成质量。

Details

Motivation: 大型语言模型（LLM）的推理效率受限于不断增长的KV缓存，传统方法因选择性丢弃KV缓存条目导致信息丢失和幻觉问题。 Method: KeepKV引入选举票机制记录合并历史并自适应调整注意力分数，同时采用零推理扰动合并技术保持注意力一致性。 Result: 实验表明，KeepKV在10% KV缓存预算下显著减少内存使用，推理吞吐量提升2倍以上，且生成质量优异。 Conclusion: KeepKV通过创新方法解决了KV缓存压缩中的输出扰动问题，为高效LLM推理提供了有效解决方案。 Abstract: Efficient inference of large language models (LLMs) is hindered by an ever-growing key-value (KV) cache, making KV cache compression a critical research direction. Traditional methods selectively evict less important KV cache entries based on attention scores or position heuristics, which leads to information loss and hallucinations. Recently, merging-based strategies have been explored to retain more information by merging KV pairs that would be discarded; however, these existing approaches inevitably introduce inconsistencies in attention distributions before and after merging, causing output perturbation and degraded generation quality. To overcome this challenge, we propose KeepKV, a novel adaptive KV cache merging method designed to eliminate output perturbation while preserving performance under strict memory constraints. KeepKV introduces the Electoral Votes mechanism that records merging history and adaptively adjusts attention scores. Moreover, it further leverages a novel Zero Inference-Perturbation Merging methods, keeping attention consistency and compensating for attention loss resulting from cache merging. KeepKV successfully retains essential context information within a significantly compressed cache. Extensive experiments on various benchmarks and LLM architectures demonstrate that KeepKV substantially reduces memory usage, enhances inference throughput by more than 2x and keeps superior generation quality even with 10% KV cache budgets.

[648] FM-LoRA: Factorized Low-Rank Meta-Prompting for Continual Learning

Xiaobing Yu,Jin Yang,Xiao Wu,Peijie Qiu,Xiaofeng Liu

Main category: cs.LG

TLDR: FM-LoRA是一种高效的低秩适应方法，通过动态秩选择器和动态元提示，有效分配模型容量，避免参数膨胀，并在多任务和领域中表现优异。

Details

Motivation: 解决预训练模型在连续任务中参数膨胀、存储成本高及缺乏任务相似性意识的问题。 Method: 提出FM-LoRA，结合动态秩选择器（DRS）和动态元提示（DMP），共享低秩子空间以优化模型容量分配。 Result: 在多个CL基准测试（如ImageNet-R、CIFAR100等）中，FM-LoRA显著减少灾难性遗忘，并在多任务和领域中表现稳健。 Conclusion: FM-LoRA为连续学习提供了一种高效且可扩展的解决方案，适用于多样化的任务和领域。 Abstract: How to adapt a pre-trained model continuously for sequential tasks with different prediction class labels and domains and finally learn a generalizable model across diverse tasks is a long-lasting challenge. Continual learning (CL) has emerged as a promising approach to leverage pre-trained models (e.g., Transformers) for sequential tasks. While many existing CL methods incrementally store additional learned structures, such as Low-Rank Adaptation (LoRA) adapters or prompts and sometimes even preserve features from previous samples to maintain performance. This leads to unsustainable parameter growth and escalating storage costs as the number of tasks increases. Moreover, current approaches often lack task similarity awareness, which further hinders the models ability to effectively adapt to new tasks without interfering with previously acquired knowledge. To address these challenges, we propose FM-LoRA, a novel and efficient low-rank adaptation method that integrates both a dynamic rank selector (DRS) and dynamic meta-prompting (DMP). This framework allocates model capacity more effectively across tasks by leveraging a shared low-rank subspace critical for preserving knowledge, thereby avoiding continual parameter expansion. Extensive experiments on various CL benchmarks, including ImageNet-R, CIFAR100, and CUB200 for class-incremental learning (CIL), and DomainNet for domain-incremental learning (DIL), with Transformers backbone demonstrate that FM-LoRA effectively mitigates catastrophic forgetting while delivering robust performance across a diverse range of tasks and domains.

[649] ColonScopeX: Leveraging Explainable Expert Systems with Multimodal Data for Improved Early Diagnosis of Colorectal Cancer

Natalia Sikora,Robert L. Manschke,Alethea M. Tang,Peter Dunstan,Dean A. Harris,Su Yang

Main category: cs.LG

TLDR: 提出了一种名为ColonScopeX的机器学习框架，结合可解释AI技术，用于提升结直肠癌的早期检测。

Details

Motivation: 结直肠癌是全球第二大癌症死亡原因，早期诊断率低，且症状非特异性导致患者忽视。早期诊断对生存率至关重要。 Method: 采用多模态模型，整合血液样本数据（经Savitzky-Golay算法处理）和患者元数据（如病史、年龄等），并结合可解释AI技术提高透明度。 Result: 框架可作为分诊或筛查工具，提升早期检测率。 Conclusion: 结合多源数据和可解释机器学习，有望解决医学诊断中的关键挑战。 Abstract: Colorectal cancer (CRC) ranks as the second leading cause of cancer-related deaths and the third most prevalent malignant tumour worldwide. Early detection of CRC remains problematic due to its non-specific and often embarrassing symptoms, which patients frequently overlook or hesitate to report to clinicians. Crucially, the stage at which CRC is diagnosed significantly impacts survivability, with a survival rate of 80-95\% for Stage I and a stark decline to 10\% for Stage IV. Unfortunately, in the UK, only 14.4\% of cases are diagnosed at the earliest stage (Stage I). In this study, we propose ColonScopeX, a machine learning framework utilizing explainable AI (XAI) methodologies to enhance the early detection of CRC and pre-cancerous lesions. Our approach employs a multimodal model that integrates signals from blood sample measurements, processed using the Savitzky-Golay algorithm for fingerprint smoothing, alongside comprehensive patient metadata, including medication history, comorbidities, age, weight, and BMI. By leveraging XAI techniques, we aim to render the model's decision-making process transparent and interpretable, thereby fostering greater trust and understanding in its predictions. The proposed framework could be utilised as a triage tool or a screening tool of the general population. This research highlights the potential of combining diverse patient data sources and explainable machine learning to tackle critical challenges in medical diagnostics.

[650] Minority Reports: Balancing Cost and Quality in Ground Truth Data Annotation

Hsuan Wei Liao,Christopher Klugmann,Daniel Kondermann,Rafid Mahmood

Main category: cs.LG

TLDR: 论文提出了一种通过检测和移除少数报告（标注错误）来优化数据标注任务分配的方法，以减少冗余标注并降低成本。

Details

Motivation: 高质量数据标注是机器学习软件开发的关键，但成本高昂且耗时。研究旨在平衡标注准确性和成本。 Method: 通过估计标注者与多数投票结果不一致的可能性，提前修剪冗余标注任务。方法基于图像模糊性、标注者差异和疲劳等因素。 Result: 在计算机视觉数据集上的模拟显示，该方法可减少60%以上的标注需求，节省约6.6天工作量，同时对标签质量影响较小。 Conclusion: 该方法为标注平台提供了一种平衡成本与数据质量的策略，使机器学习从业者能根据应用需求调整标注精度，优化预算分配。 Abstract: High-quality data annotation is an essential but laborious and costly aspect of developing machine learning-based software. We explore the inherent tradeoff between annotation accuracy and cost by detecting and removing minority reports -- instances where annotators provide incorrect responses -- that indicate unnecessary redundancy in task assignments. We propose an approach to prune potentially redundant annotation task assignments before they are executed by estimating the likelihood of an annotator disagreeing with the majority vote for a given task. Our approach is informed by an empirical analysis over computer vision datasets annotated by a professional data annotation platform, which reveals that the likelihood of a minority report event is dependent primarily on image ambiguity, worker variability, and worker fatigue. Simulations over these datasets show that we can reduce the number of annotations required by over 60% with a small compromise in label quality, saving approximately 6.6 days-equivalent of labor. Our approach provides annotation service platforms with a method to balance cost and dataset quality. Machine learning practitioners can tailor annotation accuracy levels according to specific application needs, thereby optimizing budget allocation while maintaining the data quality necessary for critical settings like autonomous driving technology.

[651] Causal integration of chemical structures improves representations of microscopy images for morphological profiling

Yemin Yu,Neil Tenenholtz,Lester Mackey,Ying Wei,David Alvarez-Melis,Ava P. Amini,Alex X. Lu

Main category: cs.LG

TLDR: MICON框架通过结合化学化合物信息改进自监督学习，显著优于传统方法和现有深度学习方法。

Details

Motivation: 现有方法仅从图像学习，忽略了多模态数据（如化学扰动），限制了形态学分析的性能。 Method: 提出MICON框架，将化学化合物建模为诱导细胞表型反事实变化的处理，结合自监督学习。 Result: MICON在跨实验中心和独立重复实验中表现优异，优于传统和现有深度学习方法。 Conclusion: 多模态数据（如化学信息）应被明确纳入形态学分析的表示学习方法中。 Abstract: Recent advances in self-supervised deep learning have improved our ability to quantify cellular morphological changes in high-throughput microscopy screens, a process known as morphological profiling. However, most current methods only learn from images, despite many screens being inherently multimodal, as they involve both a chemical or genetic perturbation as well as an image-based readout. We hypothesized that incorporating chemical compound structure during self-supervised pre-training could improve learned representations of images in high-throughput microscopy screens. We introduce a representation learning framework, MICON (Molecular-Image Contrastive Learning), that models chemical compounds as treatments that induce counterfactual transformations of cell phenotypes. MICON significantly outperforms classical hand-crafted features such as CellProfiler and existing deep-learning-based representation learning methods in challenging evaluation settings where models must identify reproducible effects of drugs across independent replicates and data-generating centers. We demonstrate that incorporating chemical compound information into the learning process provides consistent improvements in our evaluation setting and that modeling compounds specifically as treatments in a causal framework outperforms approaches that directly align images and compounds in a single representation space. Our findings point to a new direction for representation learning in morphological profiling, suggesting that methods should explicitly account for the multimodal nature of microscopy screening data.

[652] RANSAC Revisited: An Improved Algorithm for Robust Subspace Recovery under Adversarial and Noisy Corruptions

Guixian Chen,Jianhao Ma,Salar Fattahi

Main category: cs.LG

TLDR: 论文提出了一种名为RANSAC+的两阶段算法，用于在存在高斯噪声和强对抗性干扰的情况下恢复鲁棒子空间。该方法改进了经典RANSAC的不足，具有高效性和鲁棒性。

Details

Motivation: 现有方法在强对抗性干扰和高斯噪声下存在计算成本高或分布假设严格的问题，限制了其实际应用。 Method: 提出RANSAC+算法，通过两阶段设计解决了经典RANSAC在效率和鲁棒性上的不足。 Result: RANSAC+在对抗性干扰和高斯噪声下具有鲁棒性，样本复杂度接近最优，且无需预先知道子空间维度。 Conclusion: RANSAC+是一种高效且鲁棒的子空间恢复方法，适用于复杂噪声环境。 Abstract: In this paper, we study the problem of robust subspace recovery (RSR) in the presence of both strong adversarial corruptions and Gaussian noise. Specifically, given a limited number of noisy samples -- some of which are tampered by an adaptive and strong adversary -- we aim to recover a low-dimensional subspace that approximately contains a significant fraction of the uncorrupted samples, up to an error that scales with the Gaussian noise. Existing approaches to this problem often suffer from high computational costs or rely on restrictive distributional assumptions, limiting their applicability in truly adversarial settings. To address these challenges, we revisit the classical random sample consensus (RANSAC) algorithm, which offers strong robustness to adversarial outliers, but sacrifices efficiency and robustness against Gaussian noise and model misspecification in the process. We propose a two-stage algorithm, RANSAC+, that precisely pinpoints and remedies the failure modes of standard RANSAC. Our method is provably robust to both Gaussian and adversarial corruptions, achieves near-optimal sample complexity without requiring prior knowledge of the subspace dimension, and is more efficient than existing RANSAC-type methods.

[653] Balancing Two Classifiers via A Simplex ETF Structure for Model Calibration

Jiani Ni,He Zhao,Jintong Gao,Dandan Guo,Hongyuan Zha

Main category: cs.LG

TLDR: 提出了一种名为BalCAL的新方法，通过平衡可学习和ETF分类器解决模型校准中的过度自信或不足自信问题。

Details

Motivation: 深度神经网络在安全关键应用中存在校准问题，现有方法对分类器设计的探索不足，且忽视不足自信导致的校准误差。 Method: 引入可调置信模块和动态调整方法，平衡可学习和ETF分类器。 Result: 实验表明，BalCAL显著提升了模型校准性能，同时保持高预测准确性。 Conclusion: BalCAL为深度学习中的校准挑战提供了新颖解决方案。 Abstract: In recent years, deep neural networks (DNNs) have demonstrated state-of-the-art performance across various domains. However, despite their success, they often face calibration issues, particularly in safety-critical applications such as autonomous driving and healthcare, where unreliable predictions can have serious consequences. Recent research has started to improve model calibration from the view of the classifier. However, the exploration of designing the classifier to solve the model calibration problem is insufficient. Let alone most of the existing methods ignore the calibration errors arising from underconfidence. In this work, we propose a novel method by balancing learnable and ETF classifiers to solve the overconfidence or underconfidence problem for model Calibration named BalCAL. By introducing a confidence-tunable module and a dynamic adjustment method, we ensure better alignment between model confidence and its true accuracy. Extensive experimental validation shows that ours significantly improves model calibration performance while maintaining high predictive accuracy, outperforming existing techniques. This provides a novel solution to the calibration challenges commonly encountered in deep learning.

[654] Air Quality Prediction with A Meteorology-Guided Modality-Decoupled Spatio-Temporal Network

Hang Yin,Yan-Ming Zhang,Jian Xu,Jian-Long Chang,Yin Li,Cheng-Lin Liu

Main category: cs.LG

TLDR: 论文提出MDSTNet框架，结合多压力层气象数据与天气预测，显著提升空气质量预测准确性，并在中国首个全国性数据集ChinaAirNet上验证其优越性。

Details

Motivation: 现有研究低估了气象条件在空气质量预测中的关键作用，且未充分利用气象数据，导致模型无法准确捕捉空气质量与气象数据间的动态依赖关系。 Method: 提出MDSTNet框架，将空气质量观测与气象条件作为独立模态建模，整合多压力层气象数据和天气预测，以捕捉大气与污染物的依赖关系。 Result: 在ChinaAirNet数据集上，MDSTNet显著优于现有模型，48小时预测误差降低了17.54%。 Conclusion: MDSTNet通过综合气象数据显著提升了空气质量预测性能，同时发布的ChinaAirNet数据集为未来研究提供了重要资源。 Abstract: Air quality prediction plays a crucial role in public health and environmental protection. Accurate air quality prediction is a complex multivariate spatiotemporal problem, that involves interactions across temporal patterns, pollutant correlations, spatial station dependencies, and particularly meteorological influences that govern pollutant dispersion and chemical transformations. Existing works underestimate the critical role of atmospheric conditions in air quality prediction and neglect comprehensive meteorological data utilization, thereby impairing the modeling of dynamic interdependencies between air quality and meteorological data. To overcome this, we propose MDSTNet, an encoder-decoder framework that explicitly models air quality observations and atmospheric conditions as distinct modalities, integrating multi-pressure-level meteorological data and weather forecasts to capture atmosphere-pollution dependencies for prediction. Meantime, we construct ChinaAirNet, the first nationwide dataset combining air quality records with multi-pressure-level meteorological observations. Experimental results on ChinaAirNet demonstrate MDSTNet's superiority, substantially reducing 48-hour prediction errors by 17.54\% compared to the state-of-the-art model. The source code and dataset will be available on github.

[655] Negate or Embrace: On How Misalignment Shapes Multimodal Representation Learning

Yichao Cai,Yuhang Liu,Erdun Gao,Tianjiao Jiang,Zhen Zhang,Anton van den Hengel,Javen Qinfeng Shi

Main category: cs.LG

TLDR: 论文探讨多模态对比学习（MMCL）中图像-文本对的对齐问题，提出两种偏差机制（选择偏差和扰动偏差），并证明MMCL能捕捉语义变量的不变信息。

Details

Motivation: 解决实际数据集中图像-文本对的不对齐问题，为实践者提供指导。 Method: 使用潜变量模型形式化不对齐问题，引入选择偏差和扰动偏差机制。 Result: 理论分析表明MMCL能捕捉不受偏差影响的语义变量信息，并通过实验验证。 Conclusion: 为理解不对齐提供了统一视角，并为实际系统设计提供了实用建议。 Abstract: Multimodal representation learning, exemplified by multimodal contrastive learning (MMCL) using image-text pairs, aims to learn powerful representations by aligning cues across modalities. This approach relies on the core assumption that the exemplar image-text pairs constitute two representations of an identical concept. However, recent research has revealed that real-world datasets often exhibit misalignment. There are two distinct viewpoints on how to address this issue: one suggests mitigating the misalignment, and the other leveraging it. We seek here to reconcile these seemingly opposing perspectives, and to provide a practical guide for practitioners. Using latent variable models we thus formalize misalignment by introducing two specific mechanisms: selection bias, where some semantic variables are missing, and perturbation bias, where semantic variables are distorted -- both affecting latent variables shared across modalities. Our theoretical analysis demonstrates that, under mild assumptions, the representations learned by MMCL capture exactly the information related to the subset of the semantic variables invariant to selection and perturbation biases. This provides a unified perspective for understanding misalignment. Based on this, we further offer actionable insights into how misalignment should inform the design of real-world ML systems. We validate our theoretical findings through extensive empirical studies on both synthetic data and real image-text datasets, shedding light on the nuanced impact of misalignment on multimodal representation learning.

[656] Learnable Multi-Scale Wavelet Transformer: A Novel Alternative to Self-Attention

Andrew Kiruluta,Priscilla Burity,Samantha Williams

Main category: cs.LG

TLDR: 论文提出了一种基于可学习多尺度Haar小波变换的Transformer架构（LMWT），以解决传统自注意力机制在长序列处理中的计算复杂度问题。

Details

Motivation: 传统Transformer的自注意力机制在处理长序列时计算复杂度高，资源消耗大，限制了其应用。 Method: LMWT用可学习的多尺度Haar小波变换模块替代标准点积自注意力，通过多分辨率特性高效捕捉局部和全局信息。 Result: 在WMT16 En-De基准测试中，LMWT在BLEU分数、困惑度和标记准确率上表现竞争力，且计算复杂度线性增长。 Conclusion: LMWT是一种高效且新颖的序列建模替代方案，兼具性能和计算优势。 Abstract: Transformer architectures, underpinned by the self-attention mechanism, have achieved state-of-the-art results across numerous natural language processing (NLP) tasks by effectively modeling long-range dependencies. However, the computational complexity of self-attention, scaling quadratically with input sequence length, presents significant challenges for processing very long sequences or operating under resource constraints. This paper introduces the Learnable Multi-Scale Wavelet Transformer (LMWT), a novel architecture that replaces the standard dot-product self-attention with a learnable multi-scale Haar wavelet transform module. Leveraging the intrinsic multi-resolution properties of wavelets, the LMWT efficiently captures both local details and global context. Crucially, the parameters of the wavelet transform, including scale-specific coefficients, are learned end-to-end during training, allowing the model to adapt its decomposition strategy to the data and task. We present the detailed mathematical formulation of the learnable Haar wavelet module and its integration into the transformer framework, supplemented by an architectural diagram. We conduct a comprehensive experimental evaluation on a standard machine translation benchmark (WMT16 En-De), comparing the LMWT against a baseline self-attention transformer using metrics like BLEU score, perplexity, and token accuracy. Furthermore, we analyze the computational complexity, highlighting the linear scaling of our approach, discuss its novelty in the context of related work, and explore the interpretability offered by visualizing the learned Haar coefficients. Our results indicate that the LMWT achieves competitive performance while offering substantial computational advantages, positioning it as a promising and novel alternative for efficient sequence modeling.

[657] Mechanistic Anomaly Detection for "Quirky" Language Models

David O. Johnston,Arkajyoti Chakraborty,Nora Belrose

Main category: cs.LG

TLDR: 论文探讨了使用机制异常检测（MAD）技术来增强对大型语言模型（LLM）的监督，通过内部特征识别异常训练信号，实验表明该方法在某些任务中有效，但需进一步改进以适用于高风险场景。

Details

Motivation: 随着LLM能力的提升，监督变得更具挑战性，尤其是当模型对监督者未知的因素敏感时。研究旨在通过MAD技术识别异常信号，以改进监督。 Method: 使用内部模型特征训练检测器，标记测试环境中与训练环境显著不同的点，实验了多种检测器特征和评分规则。 Result: 检测器在某些任务中表现优异，但无单一检测器适用于所有模型和任务。MAD在低风险应用中可能有效，但高风险场景需进一步改进。 Conclusion: MAD技术为LLM监督提供了潜在解决方案，但在检测和评估方面仍需进步，以适用于更广泛的高风险应用。 Abstract: As LLMs grow in capability, the task of supervising LLMs becomes more challenging. Supervision failures can occur if LLMs are sensitive to factors that supervisors are unaware of. We investigate Mechanistic Anomaly Detection (MAD) as a technique to augment supervision of capable models; we use internal model features to identify anomalous training signals so they can be investigated or discarded. We train detectors to flag points from the test environment that differ substantially from the training environment, and experiment with a large variety of detector features and scoring rules to detect anomalies in a set of ``quirky'' language models. We find that detectors can achieve high discrimination on some tasks, but no detector is effective across all models and tasks. MAD techniques may be effective in low-stakes applications, but advances in both detection and evaluation are likely needed if they are to be used in high stakes settings.

[658] AgentRewardBench: Evaluating Automatic Evaluations of Web Agent Trajectories

Xing Han Lù,Amirhossein Kazemnejad,Nicholas Meade,Arkil Patel,Dongchan Shin,Alejandra Zambrano,Karolina Stańczak,Peter Shaw,Christopher J. Pal,Siva Reddy

Main category: cs.LG

TLDR: 论文提出了AgentRewardBench，首个评估LLM法官对网络代理任务完成效果有效性的基准，发现现有规则评估方法低估成功率。

Details

Motivation: 现有规则评估方法难以扩展且可能无法准确识别成功轨迹，而人工评估成本高。自动LLM评估可能提供更快、更经济的解决方案，但其有效性尚不明确。 Method: 构建包含1302条轨迹的AgentRewardBench基准，涵盖5个任务和4种LLM，每条轨迹由专家评估成功、副作用和重复性。 Result: 评估12种LLM法官后发现，无单一LLM在所有任务中表现优异，且规则评估方法低估了网络代理的成功率。 Conclusion: 规则评估方法存在局限性，需开发更灵活的自动评估方法，AgentRewardBench为此提供了基准。 Abstract: Web agents enable users to perform tasks on web browsers through natural language interaction. Evaluating web agents trajectories is an important problem, since it helps us determine whether the agent successfully completed the tasks. Rule-based methods are widely used for this purpose, but they are challenging to extend to new tasks and may not always recognize successful trajectories. We may achieve higher accuracy through human evaluation, but the process would be substantially slower and more expensive. Automatic evaluations with LLMs may avoid the challenges of designing new rules and manually annotating trajectories, enabling faster and cost-effective evaluation. However, it is unclear how effective they are at evaluating web agents. To this end, we propose AgentRewardBench, the first benchmark to assess the effectiveness of LLM judges for evaluating web agents. AgentRewardBench contains 1302 trajectories across 5 benchmarks and 4 LLMs. Each trajectory in AgentRewardBench is reviewed by an expert, who answers questions pertaining to the success, side effects, and repetitiveness of the agent. Using our benchmark, we evaluate 12 LLM judges and find that no single LLM excels across all benchmarks. We also find that the rule-based evaluation used by common benchmarks tends to underreport the success rate of web agents, highlighting a key weakness of rule-based evaluation and the need to develop more flexible automatic evaluations. We release the benchmark at: https://agent-reward-bench.github.io

Sohom Ghosh,Arnab Maji,Sudip Kumar Naskar

Main category: cs.LG

TLDR: 该研究提出了一种多模态预测模型，结合财报电话会议的文字、图像和表格数据，预测次日股价变动，并公开了MiMIC数据集。

Details

Motivation: 预测财报电话会议后的股价变动对投资者和研究者具有挑战性，需创新方法处理多源信息。 Method: 开发了MiMIC数据集，整合文字、图像、表格及量化变量，构建多模态分析框架。 Result: 多模态方法展示了整合多源信息提升金融预测准确性的潜力。 Conclusion: 研究推动了公司沟通对市场反应的理解，并验证了多模态机器学习在金融分析中的有效性。 Abstract: Predicting stock market prices following corporate earnings calls remains a significant challenge for investors and researchers alike, requiring innovative approaches that can process diverse information sources. This study investigates the impact of corporate earnings calls on stock prices by introducing a multi-modal predictive model. We leverage textual data from earnings call transcripts, along with images and tables from accompanying presentations, to forecast stock price movements on the trading day immediately following these calls. To facilitate this research, we developed the MiMIC (Multi-Modal Indian Earnings Calls) dataset, encompassing companies representing the Nifty 50, Nifty MidCap 50, and Nifty Small 50 indices. The dataset includes earnings call transcripts, presentations, fundamentals, technical indicators, and subsequent stock prices. We present a multimodal analytical framework that integrates quantitative variables with predictive signals derived from textual and visual modalities, thereby enabling a holistic approach to feature representation and analysis. This multi-modal approach demonstrates the potential for integrating diverse information sources to enhance financial forecasting accuracy. To promote further research in computational economics, we have made the MiMIC dataset publicly available under the CC-NC-SA-4.0 licence. Our work contributes to the growing body of literature on market reactions to corporate communications and highlights the efficacy of multi-modal machine learning techniques in financial analysis.

[660] Mixture of Group Experts for Learning Invariant Representations

Lei Kang,Jia Li,Mi Tian,Hua Huang

Main category: cs.LG

TLDR: 提出了一种基于稀疏表示的MoGE方法，通过分组稀疏正则化提升专家多样性和专业性，显著优于传统MoE模型。

Details

Motivation: 传统MoE模型在专家多样性和专业性上表现不足，限制了其性能和扩展性。 Method: 提出MoGE方法，通过分组稀疏正则化路由输入，并组织为2D地形图结构。 Result: 在图像分类和语言建模任务中，MoGE显著优于MoE，且额外开销极小。 Conclusion: MoGE为提升专家多样性和减少冗余提供了简单有效的解决方案。 Abstract: Sparsely activated Mixture-of-Experts (MoE) models effectively increase the number of parameters while maintaining consistent computational costs per token. However, vanilla MoE models often suffer from limited diversity and specialization among experts, constraining their performance and scalability, especially as the number of experts increases. In this paper, we present a novel perspective on vanilla MoE with top-$k$ routing inspired by sparse representation. This allows us to bridge established theoretical insights from sparse representation into MoE models. Building on this foundation, we propose a group sparse regularization approach for the input of top-$k$ routing, termed Mixture of Group Experts (MoGE). MoGE indirectly regularizes experts by imposing structural constraints on the routing inputs, while preserving the original MoE architecture. Furthermore, we organize the routing input into a 2D topographic map, spatially grouping neighboring elements. This structure enables MoGE to capture representations invariant to minor transformations, thereby significantly enhancing expert diversity and specialization. Comprehensive evaluations across various Transformer models for image classification and language modeling tasks demonstrate that MoGE substantially outperforms its MoE counterpart, with minimal additional memory and computation overhead. Our approach provides a simple yet effective solution to scale the number of experts and reduce redundancy among them. The source code is included in the supplementary material and will be publicly released.

[661] DUMP: Automated Distribution-Level Curriculum Learning for RL-based LLM Post-training

Zhenting Wang,Guofeng Cui,Kun Wan,Wentian Zhao

Main category: cs.LG

TLDR: 提出了一种基于分布级可学习性的课程学习框架，用于RL后训练的LLM，通过UCB动态调整数据分布采样，优化学习效率。

Details

Motivation: 现有方法将训练数据视为统一整体，忽略了数据分布的多样性和难度差异，导致学习效率不高。 Method: 利用策略优势幅度衡量分布的学习潜力，结合UCB动态调整采样概率，优先选择高优势或低样本分布。 Result: 在逻辑推理数据集上验证了框架的有效性，显著提升了收敛速度和最终性能。 Conclusion: 分布感知的课程学习策略在LLM后训练中具有重要价值。 Abstract: Recent advances in reinforcement learning (RL)-based post-training have led to notable improvements in large language models (LLMs), particularly in enhancing their reasoning capabilities to handle complex tasks. However, most existing methods treat the training data as a unified whole, overlooking the fact that modern LLM training often involves a mixture of data from diverse distributions-varying in both source and difficulty. This heterogeneity introduces a key challenge: how to adaptively schedule training across distributions to optimize learning efficiency. In this paper, we present a principled curriculum learning framework grounded in the notion of distribution-level learnability. Our core insight is that the magnitude of policy advantages reflects how much a model can still benefit from further training on a given distribution. Based on this, we propose a distribution-level curriculum learning framework for RL-based LLM post-training, which leverages the Upper Confidence Bound (UCB) principle to dynamically adjust sampling probabilities for different distrubutions. This approach prioritizes distributions with either high average advantage (exploitation) or low sample count (exploration), yielding an adaptive and theoretically grounded training schedule. We instantiate our curriculum learning framework with GRPO as the underlying RL algorithm and demonstrate its effectiveness on logic reasoning datasets with multiple difficulties and sources. Our experiments show that our framework significantly improves convergence speed and final performance, highlighting the value of distribution-aware curriculum strategies in LLM post-training. Code: https://github.com/ZhentingWang/DUMP.

[662] KeepKV: Eliminating Output Perturbation in KV Cache Compression for Efficient LLMs Inference

Yuxuan Tian,Zihan Wang,Yebo Peng,Aomufei Yuan,Zhiming Wang,Bairen Yi,Xin Liu,Yong Cui,Tong Yang

Main category: cs.LG

TLDR: KeepKV是一种新型的自适应KV缓存合并方法，旨在消除输出扰动并在严格内存限制下保持性能。

Details

Motivation: 大型语言模型（LLM）的高效推理受到不断增长的KV缓存的阻碍，传统方法因选择性丢弃KV缓存条目导致信息丢失和幻觉。 Method: KeepKV引入选举投票机制记录合并历史并自适应调整注意力分数，同时采用零推理扰动合并方法保持注意力一致性。 Result: 实验表明，KeepKV显著减少内存使用，推理吞吐量提升2倍以上，并在仅10% KV缓存预算下保持高质量生成。 Conclusion: KeepKV通过自适应KV缓存合并，解决了传统方法的信息丢失问题，同时提升了性能和效率。 Abstract: Efficient inference of large language models (LLMs) is hindered by an ever-growing key-value (KV) cache, making KV cache compression a critical research direction. Traditional methods selectively evict less important KV cache entries based on attention scores or position heuristics, which leads to information loss and hallucinations. Recently, merging-based strategies have been explored to retain more information by merging KV pairs that would be discarded; however, these existing approaches inevitably introduce inconsistencies in attention distributions before and after merging, causing output perturbation and degraded generation quality. To overcome this challenge, we propose KeepKV, a novel adaptive KV cache merging method designed to eliminate output perturbation while preserving performance under strict memory constraints. KeepKV introduces the Electoral Votes mechanism that records merging history and adaptively adjusts attention scores. Moreover, it further leverages a novel Zero Inference-Perturbation Merging methods, keeping attention consistency and compensating for attention loss resulting from cache merging. KeepKV successfully retains essential context information within a significantly compressed cache. Extensive experiments on various benchmarks and LLM architectures demonstrate that KeepKV substantially reduces memory usage, enhances inference throughput by more than 2x and keeps superior generation quality even with 10% KV cache budgets.

[663] FM-LoRA: Factorized Low-Rank Meta-Prompting for Continual Learning

Xiaobing Yu,Jin Yang,Xiao Wu,Peijie Qiu,Xiaofeng Liu

Main category: cs.LG

TLDR: FM-LoRA提出了一种高效的低秩适应方法，通过动态秩选择器和动态元提示，解决了持续学习中参数增长和任务相似性不足的问题。

Details

Motivation: 持续学习（CL）中，预训练模型在适应不同任务时面临参数增长和存储成本上升的问题，且缺乏任务相似性意识。 Method: FM-LoRA结合动态秩选择器（DRS）和动态元提示（DMP），通过共享低秩子空间有效分配模型容量。 Result: 在多个CL基准测试中（如ImageNet-R、CIFAR100等），FM-LoRA显著减少了灾难性遗忘，并在多样任务中表现稳健。 Conclusion: FM-LoRA为持续学习提供了一种高效且可扩展的解决方案，适用于多任务和多领域场景。 Abstract: How to adapt a pre-trained model continuously for sequential tasks with different prediction class labels and domains and finally learn a generalizable model across diverse tasks is a long-lasting challenge. Continual learning (CL) has emerged as a promising approach to leverage pre-trained models (e.g., Transformers) for sequential tasks. While many existing CL methods incrementally store additional learned structures, such as Low-Rank Adaptation (LoRA) adapters or prompts and sometimes even preserve features from previous samples to maintain performance. This leads to unsustainable parameter growth and escalating storage costs as the number of tasks increases. Moreover, current approaches often lack task similarity awareness, which further hinders the models ability to effectively adapt to new tasks without interfering with previously acquired knowledge. To address these challenges, we propose FM-LoRA, a novel and efficient low-rank adaptation method that integrates both a dynamic rank selector (DRS) and dynamic meta-prompting (DMP). This framework allocates model capacity more effectively across tasks by leveraging a shared low-rank subspace critical for preserving knowledge, thereby avoiding continual parameter expansion. Extensive experiments on various CL benchmarks, including ImageNet-R, CIFAR100, and CUB200 for class-incremental learning (CIL), and DomainNet for domain-incremental learning (DIL), with Transformers backbone demonstrate that FM-LoRA effectively mitigates catastrophic forgetting while delivering robust performance across a diverse range of tasks and domains.

[664] ColonScopeX: Leveraging Explainable Expert Systems with Multimodal Data for Improved Early Diagnosis of Colorectal Cancer

Natalia Sikora,Robert L. Manschke,Alethea M. Tang,Peter Dunstan,Dean A. Harris,Su Yang

Main category: cs.LG

TLDR: 该论文提出了一种名为ColonScopeX的机器学习框架，利用可解释AI（XAI）技术，通过整合血液样本数据和患者元数据，提高结直肠癌（CRC）的早期检测率。

Details

Motivation: 结直肠癌是全球癌症相关死亡的第二大原因，早期诊断率低且症状不典型，导致患者生存率差异巨大（早期80-95%，晚期仅10%）。因此，急需一种高效、透明的早期检测方法。 Method: 研究采用多模态模型，结合Savitzky-Golay算法处理的血液样本数据和患者元数据（如病史、年龄、BMI等），并利用XAI技术使模型决策透明化。 Result: ColonScopeX框架可作为分诊或筛查工具，提升CRC早期检测的准确性和可解释性。 Conclusion: 结合多样化的患者数据和可解释机器学习技术，有望解决医学诊断中的关键挑战，提高CRC的早期诊断率。 Abstract: Colorectal cancer (CRC) ranks as the second leading cause of cancer-related deaths and the third most prevalent malignant tumour worldwide. Early detection of CRC remains problematic due to its non-specific and often embarrassing symptoms, which patients frequently overlook or hesitate to report to clinicians. Crucially, the stage at which CRC is diagnosed significantly impacts survivability, with a survival rate of 80-95\% for Stage I and a stark decline to 10\% for Stage IV. Unfortunately, in the UK, only 14.4\% of cases are diagnosed at the earliest stage (Stage I). In this study, we propose ColonScopeX, a machine learning framework utilizing explainable AI (XAI) methodologies to enhance the early detection of CRC and pre-cancerous lesions. Our approach employs a multimodal model that integrates signals from blood sample measurements, processed using the Savitzky-Golay algorithm for fingerprint smoothing, alongside comprehensive patient metadata, including medication history, comorbidities, age, weight, and BMI. By leveraging XAI techniques, we aim to render the model's decision-making process transparent and interpretable, thereby fostering greater trust and understanding in its predictions. The proposed framework could be utilised as a triage tool or a screening tool of the general population. This research highlights the potential of combining diverse patient data sources and explainable machine learning to tackle critical challenges in medical diagnostics.

[665] Minority Reports: Balancing Cost and Quality in Ground Truth Data Annotation

Hsuan Wei Liao,Christopher Klugmann,Daniel Kondermann,Rafid Mahmood

Main category: cs.LG

TLDR: 论文提出了一种通过检测和移除少数报告来优化数据标注任务分配的方法，以减少标注成本同时保持数据质量。

Details

Motivation: 高质量数据标注是机器学习开发的关键，但成本高昂且耗时。研究旨在平衡标注准确性和成本。 Method: 通过估计标注者与多数投票不一致的可能性，提前修剪冗余标注任务。基于计算机视觉数据集分析图像模糊性、标注者差异和疲劳等因素。 Result: 模拟实验显示，该方法可减少60%以上的标注需求，节省约6.6天工作量，同时仅小幅影响标签质量。 Conclusion: 该方法为标注平台提供了一种平衡成本与数据质量的工具，适用于如自动驾驶等关键应用场景。 Abstract: High-quality data annotation is an essential but laborious and costly aspect of developing machine learning-based software. We explore the inherent tradeoff between annotation accuracy and cost by detecting and removing minority reports -- instances where annotators provide incorrect responses -- that indicate unnecessary redundancy in task assignments. We propose an approach to prune potentially redundant annotation task assignments before they are executed by estimating the likelihood of an annotator disagreeing with the majority vote for a given task. Our approach is informed by an empirical analysis over computer vision datasets annotated by a professional data annotation platform, which reveals that the likelihood of a minority report event is dependent primarily on image ambiguity, worker variability, and worker fatigue. Simulations over these datasets show that we can reduce the number of annotations required by over 60% with a small compromise in label quality, saving approximately 6.6 days-equivalent of labor. Our approach provides annotation service platforms with a method to balance cost and dataset quality. Machine learning practitioners can tailor annotation accuracy levels according to specific application needs, thereby optimizing budget allocation while maintaining the data quality necessary for critical settings like autonomous driving technology.

[666] Causal integration of chemical structures improves representations of microscopy images for morphological profiling

Yemin Yu,Neil Tenenholtz,Lester Mackey,Ying Wei,David Alvarez-Melis,Ava P. Amini,Alex X. Lu

Main category: cs.LG

TLDR: MICON框架通过结合化学化合物信息改进高通量显微镜图像的表征学习，显著优于传统方法和现有深度学习模型。

Details

Motivation: 当前方法仅从图像学习，忽略了高通量显微镜数据的多模态特性（化学或遗传扰动与图像输出）。 Method: 提出MICON框架，将化学化合物建模为诱导细胞表型反事实变换的处理，采用对比学习方法。 Result: MICON在识别药物效应方面优于传统特征和现有深度学习方法，尤其在跨独立重复和数据中心的评估中表现突出。 Conclusion: 多模态表征学习应明确考虑显微镜数据的多模态特性，MICON为此提供了新方向。 Abstract: Recent advances in self-supervised deep learning have improved our ability to quantify cellular morphological changes in high-throughput microscopy screens, a process known as morphological profiling. However, most current methods only learn from images, despite many screens being inherently multimodal, as they involve both a chemical or genetic perturbation as well as an image-based readout. We hypothesized that incorporating chemical compound structure during self-supervised pre-training could improve learned representations of images in high-throughput microscopy screens. We introduce a representation learning framework, MICON (Molecular-Image Contrastive Learning), that models chemical compounds as treatments that induce counterfactual transformations of cell phenotypes. MICON significantly outperforms classical hand-crafted features such as CellProfiler and existing deep-learning-based representation learning methods in challenging evaluation settings where models must identify reproducible effects of drugs across independent replicates and data-generating centers. We demonstrate that incorporating chemical compound information into the learning process provides consistent improvements in our evaluation setting and that modeling compounds specifically as treatments in a causal framework outperforms approaches that directly align images and compounds in a single representation space. Our findings point to a new direction for representation learning in morphological profiling, suggesting that methods should explicitly account for the multimodal nature of microscopy screening data.

[667] RANSAC Revisited: An Improved Algorithm for Robust Subspace Recovery under Adversarial and Noisy Corruptions

Guixian Chen,Jianhao Ma,Salar Fattahi

Main category: cs.LG

TLDR: 论文提出了一种名为RANSAC+的两阶段算法，用于在存在高斯噪声和强对抗性干扰的情况下恢复低维子空间。该方法解决了传统RANSAC算法的效率不足和噪声鲁棒性问题，具有近最优的样本复杂度和无需预先知道子空间维度的优势。

Details

Motivation: 现有方法在强对抗性干扰和高斯噪声下恢复子空间时，计算成本高或依赖严格分布假设，限制了实际应用。 Method: 提出RANSAC+算法，通过两阶段设计改进传统RANSAC，增强对高斯噪声和对抗性干扰的鲁棒性。 Result: RANSAC+在理论和实验中均表现出对高斯和对抗性干扰的鲁棒性，且效率优于现有方法。 Conclusion: RANSAC+是一种高效且鲁棒的子空间恢复方法，适用于复杂噪声和对抗性环境。 Abstract: In this paper, we study the problem of robust subspace recovery (RSR) in the presence of both strong adversarial corruptions and Gaussian noise. Specifically, given a limited number of noisy samples -- some of which are tampered by an adaptive and strong adversary -- we aim to recover a low-dimensional subspace that approximately contains a significant fraction of the uncorrupted samples, up to an error that scales with the Gaussian noise. Existing approaches to this problem often suffer from high computational costs or rely on restrictive distributional assumptions, limiting their applicability in truly adversarial settings. To address these challenges, we revisit the classical random sample consensus (RANSAC) algorithm, which offers strong robustness to adversarial outliers, but sacrifices efficiency and robustness against Gaussian noise and model misspecification in the process. We propose a two-stage algorithm, RANSAC+, that precisely pinpoints and remedies the failure modes of standard RANSAC. Our method is provably robust to both Gaussian and adversarial corruptions, achieves near-optimal sample complexity without requiring prior knowledge of the subspace dimension, and is more efficient than existing RANSAC-type methods.

[668] Balancing Two Classifiers via A Simplex ETF Structure for Model Calibration

Jiani Ni,He Zhao,Jintong Gao,Dandan Guo,Hongyuan Zha

Main category: cs.LG

TLDR: 论文提出了一种名为BalCAL的新方法，通过平衡可学习和ETF分类器来解决模型校准中的过度自信或不足自信问题，显著提升了校准性能。

Details

Motivation: 深度神经网络在安全关键应用中存在校准问题，现有方法对分类器设计的探索不足，且忽略了不足自信导致的校准误差。 Method: 引入可调置信度模块和动态调整方法，平衡可学习和ETF分类器，确保模型置信度与真实准确性更好对齐。 Result: 实验表明，BalCAL显著提升了模型校准性能，同时保持了高预测准确性，优于现有技术。 Conclusion: BalCAL为深度学习中的校准挑战提供了一种新颖解决方案。 Abstract: In recent years, deep neural networks (DNNs) have demonstrated state-of-the-art performance across various domains. However, despite their success, they often face calibration issues, particularly in safety-critical applications such as autonomous driving and healthcare, where unreliable predictions can have serious consequences. Recent research has started to improve model calibration from the view of the classifier. However, the exploration of designing the classifier to solve the model calibration problem is insufficient. Let alone most of the existing methods ignore the calibration errors arising from underconfidence. In this work, we propose a novel method by balancing learnable and ETF classifiers to solve the overconfidence or underconfidence problem for model Calibration named BalCAL. By introducing a confidence-tunable module and a dynamic adjustment method, we ensure better alignment between model confidence and its true accuracy. Extensive experimental validation shows that ours significantly improves model calibration performance while maintaining high predictive accuracy, outperforming existing techniques. This provides a novel solution to the calibration challenges commonly encountered in deep learning.

[669] Air Quality Prediction with A Meteorology-Guided Modality-Decoupled Spatio-Temporal Network

Hang Yin,Yan-Ming Zhang,Jian Xu,Jian-Long Chang,Yin Li,Cheng-Lin Liu

Main category: cs.LG

TLDR: 论文提出MDSTNet框架，结合多气压层气象数据，显著提升空气质量预测精度，并在中国首个全国性数据集上验证其优越性。

Details

Motivation: 现有研究低估了气象条件在空气质量预测中的关键作用，且未充分利用综合气象数据，导致模型无法准确捕捉空气质量与气象数据的动态依赖关系。 Method: 提出MDSTNet，一种编码器-解码器框架，将空气质量观测和气象条件作为不同模态建模，整合多气压层气象数据和天气预报。 Result: 在中国首个全国性数据集ChinaAirNet上，MDSTNet显著优于现有模型，48小时预测误差降低17.54%。 Conclusion: MDSTNet通过整合多气压层气象数据，有效提升了空气质量预测的准确性，为公共健康和环境管理提供了有力工具。 Abstract: Air quality prediction plays a crucial role in public health and environmental protection. Accurate air quality prediction is a complex multivariate spatiotemporal problem, that involves interactions across temporal patterns, pollutant correlations, spatial station dependencies, and particularly meteorological influences that govern pollutant dispersion and chemical transformations. Existing works underestimate the critical role of atmospheric conditions in air quality prediction and neglect comprehensive meteorological data utilization, thereby impairing the modeling of dynamic interdependencies between air quality and meteorological data. To overcome this, we propose MDSTNet, an encoder-decoder framework that explicitly models air quality observations and atmospheric conditions as distinct modalities, integrating multi-pressure-level meteorological data and weather forecasts to capture atmosphere-pollution dependencies for prediction. Meantime, we construct ChinaAirNet, the first nationwide dataset combining air quality records with multi-pressure-level meteorological observations. Experimental results on ChinaAirNet demonstrate MDSTNet's superiority, substantially reducing 48-hour prediction errors by 17.54\% compared to the state-of-the-art model. The source code and dataset will be available on github.

[670] Negate or Embrace: On How Misalignment Shapes Multimodal Representation Learning

Yichao Cai,Yuhang Liu,Erdun Gao,Tianjiao Jiang,Zhen Zhang,Anton van den Hengel,Javen Qinfeng Shi

Main category: cs.LG

TLDR: 论文探讨了多模态对比学习（MMCL）中图像-文本对的不对齐问题，提出了两种偏差机制（选择偏差和扰动偏差），并证明MMCL能捕捉对偏差不变的部分语义信息。

Details

Motivation: 现实数据集中的图像-文本对常存在不对齐问题，现有研究对此有两种对立观点（缓解或利用），本文旨在统一这两种观点并为实践提供指导。 Method: 通过潜在变量模型形式化不对齐问题，引入选择偏差和扰动偏差机制，并进行理论分析。 Result: 理论证明MMCL能捕捉对偏差不变的语义信息，实验验证了不对齐对多模态表示学习的影响。 Conclusion: 研究为理解不对齐提供了统一视角，并提出了实际ML系统设计的建议。 Abstract: Multimodal representation learning, exemplified by multimodal contrastive learning (MMCL) using image-text pairs, aims to learn powerful representations by aligning cues across modalities. This approach relies on the core assumption that the exemplar image-text pairs constitute two representations of an identical concept. However, recent research has revealed that real-world datasets often exhibit misalignment. There are two distinct viewpoints on how to address this issue: one suggests mitigating the misalignment, and the other leveraging it. We seek here to reconcile these seemingly opposing perspectives, and to provide a practical guide for practitioners. Using latent variable models we thus formalize misalignment by introducing two specific mechanisms: selection bias, where some semantic variables are missing, and perturbation bias, where semantic variables are distorted -- both affecting latent variables shared across modalities. Our theoretical analysis demonstrates that, under mild assumptions, the representations learned by MMCL capture exactly the information related to the subset of the semantic variables invariant to selection and perturbation biases. This provides a unified perspective for understanding misalignment. Based on this, we further offer actionable insights into how misalignment should inform the design of real-world ML systems. We validate our theoretical findings through extensive empirical studies on both synthetic data and real image-text datasets, shedding light on the nuanced impact of misalignment on multimodal representation learning.

cs.IR [Back]

[671] ExpertRAG: Efficient RAG with Mixture of Experts -- Optimizing Context Retrieval for Adaptive LLM Responses

Esmail Gumaan

Main category: cs.IR

TLDR: ExpertRAG结合混合专家（MoE）与检索增强生成（RAG），提出动态检索门控机制和专家路由，以提升知识密集型语言模型的效率和准确性。

Details

Motivation: 通过整合MoE和RAG的优势，解决传统RAG和纯MoE模型在知识利用和计算效率上的不足。 Method: 提出动态检索门控机制和专家路由，将检索与专家选择建模为潜在决策，并量化计算成本节省和稀疏专家利用的增益。 Result: 理论分析表明ExpertRAG在计算效率和知识利用上优于标准RAG和纯MoE模型。 Conclusion: ExpertRAG为语言模型的扩展和增强提供了新视角，并提出了实证验证的路线图。 Abstract: ExpertRAG is a novel theoretical framework that integrates Mixture-of-Experts (MoE) architectures with Retrieval Augmented Generation (RAG) to advance the efficiency and accuracy of knowledge-intensive language modeling. We propose a dynamic retrieval gating mechanism coupled with expert routing, enabling the model to selectively consult an external knowledge store or rely on specialized internal experts based on the query's needs. The paper lays out the theoretical foundations of ExpertRAG, including a probabilistic formulation that treats retrieval and expert selection as latent decisions, and mathematical justifications for its efficiency in both computation and knowledge utilization. We derive formulae to quantify the expected computational cost savings from selective retrieval and the capacity gains from sparse expert utilization. A comparative analysis positions ExpertRAG against standard RAG (with always-on retrieval) and pure MoE models (e.g., Switch Transformer, Mixtral) to highlight its unique balance between parametric knowledge and non-parametric retrieval. We also outline an experimental validation strategy, proposing benchmarks and evaluation protocols to test ExpertRAG's performance on factual recall, generalization, and inference efficiency. The proposed framework, although presented theoretically, is supported by insights from prior work in RAG and MoE, and is poised to provide more factual, efficient, and adaptive generation by leveraging the best of both paradigms. In summary, ExpertRAG contributes a new perspective on scaling and augmenting language models, backed by a thorough analysis and a roadmap for empirical validation.

[672] Improving RAG for Personalization with Author Features and Contrastive Examples

Mert Yazan,Suzan Verberne,Frederik Situmeang

Main category: cs.IR

TLDR: 论文提出了一种通过引入作者特定特征和对比示例来增强检索增强生成（RAG）的方法，显著提升了文本生成的个性化效果。

Details

Motivation: 传统的RAG方法难以捕捉作者的细粒度特征，导致个性化效果不佳。 Method: 通过提供作者特定特征（如情感极性和高频词）和对比示例（其他作者的文档），帮助LLM识别作者的独特风格。 Result: 实验表明，该方法比基线RAG提升了15%，并优于其他基准方法。 Conclusion: 细粒度特征和对比示例的结合为个性化文本生成提供了新思路，并公开了代码。 Abstract: Personalization with retrieval-augmented generation (RAG) often fails to capture fine-grained features of authors, making it hard to identify their unique traits. To enrich the RAG context, we propose providing Large Language Models (LLMs) with author-specific features, such as average sentiment polarity and frequently used words, in addition to past samples from the author's profile. We introduce a new feature called Contrastive Examples: documents from other authors are retrieved to help LLM identify what makes an author's style unique in comparison to others. Our experiments show that adding a couple of sentences about the named entities, dependency patterns, and words a person uses frequently significantly improves personalized text generation. Combining features with contrastive examples boosts the performance further, achieving a relative 15% improvement over baseline RAG while outperforming the benchmarks. Our results show the value of fine-grained features for better personalization, while opening a new research dimension for including contrastive examples as a complement with RAG. We release our code publicly.

[673] A Survey of Multimodal Retrieval-Augmented Generation

Lang Mei,Siyu Mo,Zhihan Yang,Chong Chen

Main category: cs.IR

TLDR: MRAG通过整合多模态数据（文本、图像、视频）扩展了RAG，提升了语言模型的生成能力，减少了幻觉问题，并在需要视觉和文本理解的场景中表现更优。

Details

Motivation: 传统RAG仅基于文本检索，无法充分利用多模态数据的丰富信息。MRAG旨在通过多模态检索和生成，提升模型的准确性和实用性。 Method: MRAG结合多模态数据的检索与生成，利用文本、图像和视频的上下文信息，增强问答系统的表现。 Result: 研究表明，MRAG在需要多模态理解的场景中优于传统RAG，减少了幻觉问题并提高了回答的准确性。 Conclusion: MRAG为多模态信息检索和生成提供了新范式，但仍需解决现有挑战，未来研究潜力巨大。 Abstract: Multimodal Retrieval-Augmented Generation (MRAG) enhances large language models (LLMs) by integrating multimodal data (text, images, videos) into retrieval and generation processes, overcoming the limitations of text-only Retrieval-Augmented Generation (RAG). While RAG improves response accuracy by incorporating external textual knowledge, MRAG extends this framework to include multimodal retrieval and generation, leveraging contextual information from diverse data types. This approach reduces hallucinations and enhances question-answering systems by grounding responses in factual, multimodal knowledge. Recent studies show MRAG outperforms traditional RAG, especially in scenarios requiring both visual and textual understanding. This survey reviews MRAG's essential components, datasets, evaluation methods, and limitations, providing insights into its construction and improvement. It also identifies challenges and future research directions, highlighting MRAG's potential to revolutionize multimodal information retrieval and generation. By offering a comprehensive perspective, this work encourages further exploration into this promising paradigm.

[674] Domain Specific Question to SQL Conversion with Embedded Data Balancing Technique

Jyothi,T. Satyanarayana Murthy

Main category: cs.IR

TLDR: 论文提出通过数据平衡技术和过采样领域特定查询的方法，改进文本到SQL模型的性能，显著提升了准确性。

Details

Motivation: 现有文本到SQL模型在理解用户问题中的特定值和领域术语时表现不佳，导致29%的错误。 Method: 采用数据平衡技术和过采样领域特定查询，优化模型架构以增强值识别能力，并针对领域问题进行微调。 Result: 在WikiSQL数据集上，模型性能比现有最佳模型提升了10.98%。 Conclusion: 提出的方法有效解决了特定值和领域术语识别问题，显著提高了文本到SQL转换的准确性。 Abstract: The rise of deep learning in natural language processing has fostered the creation of text to structured query language models composed of an encoder and a decoder. Researchers have experimented with various intermediate processing like schema linking, table type aware, value extract. To generate accurate SQL results for the user question. However error analysis performed on the failed cases on these systems shows, 29 percentage of the errors would be because the system was unable to understand the values expressed by the user in their question. This challenge affects the generation of accurate SQL queries, especially when dealing with domain-specific terms and specific value conditions, where traditional methods struggle to maintain consistency and precision. To overcome these obstacles, proposed two intermediations like implementing data balancing technique and over sampling domain-specific queries which would refine the model architecture to enhance value recognition and fine tuning the model for domain-specific questions. This proposed solution achieved 10.98 percentage improvement in accuracy of the model performance compared to the state of the art model tested on WikiSQL dataset. to convert the user question accurately to SQL queries. Applying oversampling technique on the domain-specific questions shown a significant improvement as compared with traditional approaches.

[675] WebMap -- Large Language Model-assisted Semantic Link Induction in the Web

Shiraj Pokharel,Georg P. Roßrucker,Mario M. Kubek

Main category: cs.IR

TLDR: 论文提出WebMap的功能扩展，以支持动态文档聚类、语义路标创建和主题溯源，弥补当前搜索引擎在研究任务中的不足。

Details

Motivation: 当前搜索引擎在研究任务中支持不足甚至阻碍研究，需要改进。 Method: 扩展WebMap功能，包括动态文档聚类、语义路标创建和交互式主题溯源。 Result: 提出的功能扩展能更好地支持研究活动。 Conclusion: WebMap的功能扩展有效弥补了搜索引擎在研究任务中的缺陷。 Abstract: Carrying out research tasks is only inadequately supported, if not hindered, by current web search engines. This paper therefore proposes functional extensions of WebMap, a semantically induced overlay linking structure on the web to inherently facilitate research activities. These add-ons support the dynamic determination and regrouping of document clusters, the creation of a semantic signpost in the web, and the interactive tracing of topics back to their origins.

Chris Brogly,Saif Rjaibi,Charlotte Liang,Erica Lam,Edward Wang,Adam Levitan,Sarah Paleczny,Michael Cusimano

Main category: cs.IR

TLDR: 研究比较了小语言模型（SLM）与人类评估者在医学/健康和运动损伤文本上的主题相关性评分，发现SLM在低过滤条件下表现较好，但高过滤条件下相关性较低。

Details

Motivation: 探索小语言模型（SLM）在医学/健康领域自动标注和识别文本的潜力，并与人类评估结果对比。 Method: 使用Microsoft的phi-3-mini-4k-instruct SLM对医学/健康和运动损伤文本进行评分，并与7名人类评估者的评分对比，分析相关性。 Result: SLM与人类评估者在低过滤条件下相关性显著（运动损伤：0.3413；医学/健康：0.2255），高过滤条件下医学/健康相关性为0.3854，运动损伤相关性不显著。 Conclusion: SLM在特定条件下表现良好，但过滤条件对结果影响显著，需进一步优化。 Abstract: Small Language Models (SLMs) have potential to be used for automatically labelling and identifying aspects of text data for medicine/health-related purposes from documents and the web. As their resource requirements are significantly lower than Large Language Models (LLMs), these can be deployed potentially on more types of devices. SLMs often are benchmarked on health/medicine-related tasks, such as MedQA, although performance on these can vary especially depending on the size of the model in terms of number of parameters. Furthermore, these test results may not necessarily reflect real-world performance regarding the automatic labelling or identification of texts in documents and the web. As a result, we compared topic-relatedness scores from Microsofts phi-3-mini-4k-instruct SLM to the topic-relatedness scores from 7 human evaluators on 1144 samples of medical/health-related texts and 1117 samples of sports injury-related texts. These texts were from a larger dataset of about 9 million news headlines, each of which were processed and assigned scores by phi-3-mini-4k-instruct. Our sample was selected (filtered) based on 1 (low filtering) or more (high filtering) Boolean conditions on the phi-3 SLM scores. We found low-moderate significant correlations between the scores from the SLM and human evaluators for sports injury texts with low filtering (\r{ho} = 0.3413, p < 0.001) and medicine/health texts with high filtering (\r{ho} = 0.3854, p < 0.001), and low significant correlation for medicine/health texts with low filtering (\r{ho} = 0.2255, p < 0.001). There was negligible, insignificant correlation for sports injury-related texts with high filtering (\r{ho} = 0.0318, p = 0.4466).

[677] Large Language Model Empowered Recommendation Meets All-domain Continual Pre-Training

Haokai Ma,Yunshan Ma,Ruobing Xie,Lei Meng,Jialie Shen,Xingwu Sun,Zhanhui Kang,Tat-Seng Chua

Main category: cs.IR

TLDR: CPRec是一个全领域持续预训练框架，通过统一提示模板和学习率调度，解决LLMs在推荐任务中的语义-行为差异和多领域适应问题。

Details

Motivation: 现有方法在单领域用户交互上微调LLMs，存在语义表示与领域偏好不匹配及多领域适应性不足的问题。 Method: 设计统一提示模板组织多领域行为序列，采用Warmup-Stable-Annealing学习率调度持续预训练。 Result: 在七个领域的大规模数据集上实验，CPRec显著减少语义-行为差异并在所有推荐场景中达到最优性能。 Conclusion: CPRec通过持续预训练范式有效对齐LLMs与通用用户行为，提升多领域推荐性能。 Abstract: Recent research efforts have investigated how to integrate Large Language Models (LLMs) into recommendation, capitalizing on their semantic comprehension and open-world knowledge for user behavior understanding. These approaches predominantly employ supervised fine-tuning on single-domain user interactions to adapt LLMs for specific recommendation tasks. However, they typically encounter dual challenges: the mismatch between general language representations and domain-specific preference patterns, as well as the limited adaptability to multi-domain recommendation scenarios. To bridge these gaps, we introduce CPRec -- an All-domain Continual Pre-Training framework for Recommendation -- designed to holistically align LLMs with universal user behaviors through the continual pre-training paradigm. Specifically, we first design a unified prompt template and organize users' multi-domain behaviors into domain-specific behavioral sequences and all-domain mixed behavioral sequences that emulate real-world user decision logic. To optimize behavioral knowledge infusion, we devise a Warmup-Stable-Annealing learning rate schedule tailored for the continual pre-training paradigm in recommendation to progressively enhance the LLM's capability in knowledge adaptation from open-world knowledge to universal recommendation tasks. To evaluate the effectiveness of our CPRec, we implement it on a large-scale dataset covering seven domains and conduct extensive experiments on five real-world datasets from two distinct platforms. Experimental results confirm that our continual pre-training paradigm significantly mitigates the semantic-behavioral discrepancy and achieves state-of-the-art performance in all recommendation scenarios. The source code will be released upon acceptance.

[678] Augmented Relevance Datasets with Fine-Tuned Small LLMs

Quentin Fitte-Rey,Matyas Amrouche,Romain Deveaud

Main category: cs.IR

TLDR: 利用小型微调大语言模型（LLMs）自动化相关性评估，提升排序模型性能。

Details

Motivation: 构建高质量数据集和标注查询-文档相关性是资源密集型任务，需人工标注。本文探索小型微调LLMs的潜力，以自动化相关性评估并提升排序模型性能。 Method: 微调小型LLMs以增强相关性评估，用于下游排序模型训练的数据集构建。 Result: 实验表明，微调的小型LLMs在数据集上优于某些闭源模型，并显著提升排序模型性能。 Conclusion: 小型LLMs可用于高效、可扩展的数据集增强，为搜索引擎优化提供实用解决方案。 Abstract: Building high-quality datasets and labeling query-document relevance are essential yet resource-intensive tasks, requiring detailed guidelines and substantial effort from human annotators. This paper explores the use of small, fine-tuned large language models (LLMs) to automate relevance assessment, with a focus on improving ranking models' performance by augmenting their training dataset. We fine-tuned small LLMs to enhance relevance assessments, thereby improving dataset creation quality for downstream ranking model training. Our experiments demonstrate that these fine-tuned small LLMs not only outperform certain closed source models on our dataset but also lead to substantial improvements in ranking model performance. These results highlight the potential of leveraging small LLMs for efficient and scalable dataset augmentation, providing a practical solution for search engine optimization.

[679] MURR: Model Updating with Regularized Replay for Searching a Document Stream

Eugene Yang,Nicola Tonellotto,Dawn Lawrie,Sean MacAvaney,James Mayfield,Douglas W. Oard,Scott Miller

Main category: cs.IR

TLDR: 论文提出了一种名为MURR的神经检索模型更新策略，通过正则化回放避免重新处理历史文档，同时适应新内容和查询的变化。

Details

Motivation: 神经检索模型在固定数据集上训练后难以适应新内容和查询的变化，而传统稀疏检索方法可以动态更新。重新编码历史文档成本高昂，因此需要一种不重新处理历史文档的更新方法。 Method: 提出MURR策略，结合正则化回放技术，确保模型在不重新处理历史文档的情况下仍能有效检索，同时适应新数据。 Result: 在模拟流式环境中，MURR比其他策略表现更优，检索效果更一致且高效。 Conclusion: MURR为神经检索模型的动态更新提供了一种高效且经济的解决方案。 Abstract: The Internet produces a continuous stream of new documents and user-generated queries. These naturally change over time based on events in the world and the evolution of language. Neural retrieval models that were trained once on a fixed set of query-document pairs will quickly start misrepresenting newly-created content and queries, leading to less effective retrieval. Traditional statistical sparse retrieval can update collection statistics to reflect these changes in the use of language in documents and queries. In contrast, continued fine-tuning of the language model underlying neural retrieval approaches such as DPR and ColBERT creates incompatibility with previously-encoded documents. Re-encoding and re-indexing all previously-processed documents can be costly. In this work, we explore updating a neural dual encoder retrieval model without reprocessing past documents in the stream. We propose MURR, a model updating strategy with regularized replay, to ensure the model can still faithfully search existing documents without reprocessing, while continuing to update the model for the latest topics. In our simulated streaming environments, we show that fine-tuning models using MURR leads to more effective and more consistent retrieval results than other strategies as the stream of documents and queries progresses.

[680] ExpertRAG: Efficient RAG with Mixture of Experts -- Optimizing Context Retrieval for Adaptive LLM Responses

Esmail Gumaan

Main category: cs.IR

TLDR: ExpertRAG结合MoE和RAG，提出动态检索门控机制和专家路由，提升知识密集型语言模型的效率和准确性。

Details

Motivation: 解决传统RAG和MoE模型在知识利用和计算效率上的局限性，寻求两者的优势结合。 Method: 提出动态检索门控和专家路由的框架，通过概率模型量化计算成本和知识利用率。 Result: 理论上证明ExpertRAG在计算效率和知识利用上的优势，并设计实验验证策略。 Conclusion: ExpertRAG为语言模型的扩展和增强提供了新视角，结合理论和实验验证。 Abstract: ExpertRAG is a novel theoretical framework that integrates Mixture-of-Experts (MoE) architectures with Retrieval Augmented Generation (RAG) to advance the efficiency and accuracy of knowledge-intensive language modeling. We propose a dynamic retrieval gating mechanism coupled with expert routing, enabling the model to selectively consult an external knowledge store or rely on specialized internal experts based on the query's needs. The paper lays out the theoretical foundations of ExpertRAG, including a probabilistic formulation that treats retrieval and expert selection as latent decisions, and mathematical justifications for its efficiency in both computation and knowledge utilization. We derive formulae to quantify the expected computational cost savings from selective retrieval and the capacity gains from sparse expert utilization. A comparative analysis positions ExpertRAG against standard RAG (with always-on retrieval) and pure MoE models (e.g., Switch Transformer, Mixtral) to highlight its unique balance between parametric knowledge and non-parametric retrieval. We also outline an experimental validation strategy, proposing benchmarks and evaluation protocols to test ExpertRAG's performance on factual recall, generalization, and inference efficiency. The proposed framework, although presented theoretically, is supported by insights from prior work in RAG and MoE, and is poised to provide more factual, efficient, and adaptive generation by leveraging the best of both paradigms. In summary, ExpertRAG contributes a new perspective on scaling and augmenting language models, backed by a thorough analysis and a roadmap for empirical validation.

[681] Improving RAG for Personalization with Author Features and Contrastive Examples

Mert Yazan,Suzan Verberne,Frederik Situmeang

Main category: cs.IR

TLDR: 论文提出通过引入作者特定特征和对比示例来增强RAG的个性化文本生成能力，实验表明该方法显著优于基线。

Details

Motivation: 传统RAG方法难以捕捉作者的细粒度特征，导致个性化效果不佳。 Method: 为LLM提供作者特定特征（如情感极性和常用词）及对比示例（其他作者的文档），以增强个性化生成。 Result: 实验显示，结合特征和对比示例的方法比基线RAG性能提升15%，且优于基准。 Conclusion: 细粒度特征和对比示例能显著提升个性化生成效果，为RAG研究开辟了新方向。 Abstract: Personalization with retrieval-augmented generation (RAG) often fails to capture fine-grained features of authors, making it hard to identify their unique traits. To enrich the RAG context, we propose providing Large Language Models (LLMs) with author-specific features, such as average sentiment polarity and frequently used words, in addition to past samples from the author's profile. We introduce a new feature called Contrastive Examples: documents from other authors are retrieved to help LLM identify what makes an author's style unique in comparison to others. Our experiments show that adding a couple of sentences about the named entities, dependency patterns, and words a person uses frequently significantly improves personalized text generation. Combining features with contrastive examples boosts the performance further, achieving a relative 15% improvement over baseline RAG while outperforming the benchmarks. Our results show the value of fine-grained features for better personalization, while opening a new research dimension for including contrastive examples as a complement with RAG. We release our code publicly.

[682] A Survey of Multimodal Retrieval-Augmented Generation

Lang Mei,Siyu Mo,Zhihan Yang,Chong Chen

Main category: cs.IR

TLDR: MRAG通过整合多模态数据（文本、图像、视频）提升大语言模型性能，克服了纯文本RAG的局限性，减少幻觉并增强问答系统。

Details

Motivation: 传统RAG仅依赖文本数据，限制了其在多模态场景中的应用。MRAG旨在通过多模态检索与生成，提升模型的准确性和适用性。 Method: MRAG扩展了RAG框架，引入多模态数据检索与生成，利用多样化数据类型的上下文信息。 Result: 研究表明，MRAG在需要视觉与文本理解的场景中表现优于传统RAG。 Conclusion: MRAG有望革新多模态信息检索与生成领域，但仍需解决现有挑战并探索未来研究方向。 Abstract: Multimodal Retrieval-Augmented Generation (MRAG) enhances large language models (LLMs) by integrating multimodal data (text, images, videos) into retrieval and generation processes, overcoming the limitations of text-only Retrieval-Augmented Generation (RAG). While RAG improves response accuracy by incorporating external textual knowledge, MRAG extends this framework to include multimodal retrieval and generation, leveraging contextual information from diverse data types. This approach reduces hallucinations and enhances question-answering systems by grounding responses in factual, multimodal knowledge. Recent studies show MRAG outperforms traditional RAG, especially in scenarios requiring both visual and textual understanding. This survey reviews MRAG's essential components, datasets, evaluation methods, and limitations, providing insights into its construction and improvement. It also identifies challenges and future research directions, highlighting MRAG's potential to revolutionize multimodal information retrieval and generation. By offering a comprehensive perspective, this work encourages further exploration into this promising paradigm.

[683] Domain Specific Question to SQL Conversion with Embedded Data Balancing Technique

Jyothi,T. Satyanarayana Murthy

Main category: cs.IR

TLDR: 论文提出通过数据平衡和过采样技术改进文本到SQL模型的性能，显著提升了准确率。

Details

Motivation: 现有文本到SQL模型在理解用户问题中的特定值和领域术语时表现不佳，导致29%的错误。 Method: 采用数据平衡技术和过采样领域特定查询，优化模型架构以提高值识别能力。 Result: 在WikiSQL数据集上，模型性能提升了10.98%。 Conclusion: 提出的方法在领域特定问题上显著优于传统方法，提高了SQL查询生成的准确性。 Abstract: The rise of deep learning in natural language processing has fostered the creation of text to structured query language models composed of an encoder and a decoder. Researchers have experimented with various intermediate processing like schema linking, table type aware, value extract. To generate accurate SQL results for the user question. However error analysis performed on the failed cases on these systems shows, 29 percentage of the errors would be because the system was unable to understand the values expressed by the user in their question. This challenge affects the generation of accurate SQL queries, especially when dealing with domain-specific terms and specific value conditions, where traditional methods struggle to maintain consistency and precision. To overcome these obstacles, proposed two intermediations like implementing data balancing technique and over sampling domain-specific queries which would refine the model architecture to enhance value recognition and fine tuning the model for domain-specific questions. This proposed solution achieved 10.98 percentage improvement in accuracy of the model performance compared to the state of the art model tested on WikiSQL dataset. to convert the user question accurately to SQL queries. Applying oversampling technique on the domain-specific questions shown a significant improvement as compared with traditional approaches.

[684] WebMap -- Large Language Model-assisted Semantic Link Induction in the Web

Shiraj Pokharel,Georg P. Roßrucker,Mario M. Kubek

Main category: cs.IR

TLDR: 论文提出WebMap的功能扩展，以支持动态文档聚类、语义路标和主题溯源，改进当前搜索引擎在研究任务中的不足。

Details

Motivation: 当前搜索引擎在研究任务中支持不足甚至阻碍研究，需改进。 Method: 扩展WebMap功能，包括动态文档聚类、语义路标和主题溯源。 Result: 功能扩展能更好地支持研究活动。 Conclusion: WebMap的功能扩展能有效提升研究任务的执行效率。 Abstract: Carrying out research tasks is only inadequately supported, if not hindered, by current web search engines. This paper therefore proposes functional extensions of WebMap, a semantically induced overlay linking structure on the web to inherently facilitate research activities. These add-ons support the dynamic determination and regrouping of document clusters, the creation of a semantic signpost in the web, and the interactive tracing of topics back to their origins.

Chris Brogly,Saif Rjaibi,Charlotte Liang,Erica Lam,Edward Wang,Adam Levitan,Sarah Paleczny,Michael Cusimano

Main category: cs.IR

TLDR: 研究比较了小型语言模型（SLM）与人类评估者在医学/健康和运动损伤文本上的主题相关性评分，发现相关性因文本类型和过滤条件而异。

Details

Motivation: 验证SLMs在医学/健康和运动损伤文本自动标注中的实际表现，并与人类评估结果对比。 Method: 使用phi-3-mini-4k-instruct SLM对文本评分，并与7名人类评估者的评分对比，样本来自900万新闻标题。 Result: SLM与人类评分在低过滤条件下有低-中度显著相关性，高过滤条件下相关性较低或不显著。 Conclusion: SLMs在特定条件下可用于文本标注，但需进一步优化以提高与实际应用的匹配度。 Abstract: Small Language Models (SLMs) have potential to be used for automatically labelling and identifying aspects of text data for medicine/health-related purposes from documents and the web. As their resource requirements are significantly lower than Large Language Models (LLMs), these can be deployed potentially on more types of devices. SLMs often are benchmarked on health/medicine-related tasks, such as MedQA, although performance on these can vary especially depending on the size of the model in terms of number of parameters. Furthermore, these test results may not necessarily reflect real-world performance regarding the automatic labelling or identification of texts in documents and the web. As a result, we compared topic-relatedness scores from Microsofts phi-3-mini-4k-instruct SLM to the topic-relatedness scores from 7 human evaluators on 1144 samples of medical/health-related texts and 1117 samples of sports injury-related texts. These texts were from a larger dataset of about 9 million news headlines, each of which were processed and assigned scores by phi-3-mini-4k-instruct. Our sample was selected (filtered) based on 1 (low filtering) or more (high filtering) Boolean conditions on the phi-3 SLM scores. We found low-moderate significant correlations between the scores from the SLM and human evaluators for sports injury texts with low filtering (\r{ho} = 0.3413, p < 0.001) and medicine/health texts with high filtering (\r{ho} = 0.3854, p < 0.001), and low significant correlation for medicine/health texts with low filtering (\r{ho} = 0.2255, p < 0.001). There was negligible, insignificant correlation for sports injury-related texts with high filtering (\r{ho} = 0.0318, p = 0.4466).

[686] Large Language Model Empowered Recommendation Meets All-domain Continual Pre-Training

Haokai Ma,Yunshan Ma,Ruobing Xie,Lei Meng,Jialie Shen,Xingwu Sun,Zhanhui Kang,Tat-Seng Chua

Main category: cs.IR

TLDR: CPRec框架通过持续预训练范式，将LLMs与多领域用户行为对齐，解决了语义与行为模式不匹配及多领域适应性问题。

Details

Motivation: 现有方法在单领域用户交互上微调LLMs，存在语义与行为模式不匹配及多领域适应性不足的问题。 Method: 设计了统一提示模板和多领域行为序列，采用Warmup-Stable-Annealing学习率调度优化知识注入。 Result: 在七个领域的大规模数据集上实验，CPRec显著减少语义行为差异，并在所有推荐场景中达到最优性能。 Conclusion: CPRec通过持续预训练有效解决了LLMs在推荐任务中的语义与行为对齐问题。 Abstract: Recent research efforts have investigated how to integrate Large Language Models (LLMs) into recommendation, capitalizing on their semantic comprehension and open-world knowledge for user behavior understanding. These approaches predominantly employ supervised fine-tuning on single-domain user interactions to adapt LLMs for specific recommendation tasks. However, they typically encounter dual challenges: the mismatch between general language representations and domain-specific preference patterns, as well as the limited adaptability to multi-domain recommendation scenarios. To bridge these gaps, we introduce CPRec -- an All-domain Continual Pre-Training framework for Recommendation -- designed to holistically align LLMs with universal user behaviors through the continual pre-training paradigm. Specifically, we first design a unified prompt template and organize users' multi-domain behaviors into domain-specific behavioral sequences and all-domain mixed behavioral sequences that emulate real-world user decision logic. To optimize behavioral knowledge infusion, we devise a Warmup-Stable-Annealing learning rate schedule tailored for the continual pre-training paradigm in recommendation to progressively enhance the LLM's capability in knowledge adaptation from open-world knowledge to universal recommendation tasks. To evaluate the effectiveness of our CPRec, we implement it on a large-scale dataset covering seven domains and conduct extensive experiments on five real-world datasets from two distinct platforms. Experimental results confirm that our continual pre-training paradigm significantly mitigates the semantic-behavioral discrepancy and achieves state-of-the-art performance in all recommendation scenarios. The source code will be released upon acceptance.

[687] Augmented Relevance Datasets with Fine-Tuned Small LLMs

Quentin Fitte-Rey,Matyas Amrouche,Romain Deveaud

Main category: cs.IR

TLDR: 使用小型微调大语言模型（LLMs）自动评估查询-文档相关性，提升排名模型性能。

Details

Motivation: 构建高质量数据集和标注查询-文档相关性是资源密集型任务，需大量人工标注。 Method: 微调小型LLMs以增强相关性评估，用于改进排名模型的训练数据集。 Result: 实验表明，微调的小型LLMs在数据集上优于某些闭源模型，并显著提升排名模型性能。 Conclusion: 小型LLMs为高效、可扩展的数据集增强提供了实用解决方案，适用于搜索引擎优化。 Abstract: Building high-quality datasets and labeling query-document relevance are essential yet resource-intensive tasks, requiring detailed guidelines and substantial effort from human annotators. This paper explores the use of small, fine-tuned large language models (LLMs) to automate relevance assessment, with a focus on improving ranking models' performance by augmenting their training dataset. We fine-tuned small LLMs to enhance relevance assessments, thereby improving dataset creation quality for downstream ranking model training. Our experiments demonstrate that these fine-tuned small LLMs not only outperform certain closed source models on our dataset but also lead to substantial improvements in ranking model performance. These results highlight the potential of leveraging small LLMs for efficient and scalable dataset augmentation, providing a practical solution for search engine optimization.

[688] MURR: Model Updating with Regularized Replay for Searching a Document Stream

Eugene Yang,Nicola Tonellotto,Dawn Lawrie,Sean MacAvaney,James Mayfield,Douglas W. Oard,Scott Miller

Main category: cs.IR

TLDR: 论文提出MURR方法，用于动态更新神经检索模型，避免重新处理历史文档，同时保持对新内容的适应性。

Details

Motivation: 解决神经检索模型因固定训练数据而无法适应新内容和查询变化的问题。 Method: 提出MURR策略，通过正则化回放更新模型，避免重新编码历史文档。 Result: 在模拟流式环境中，MURR比其他策略表现更优，检索结果更一致有效。 Conclusion: MURR能有效更新神经检索模型，适应动态内容变化，且成本更低。 Abstract: The Internet produces a continuous stream of new documents and user-generated queries. These naturally change over time based on events in the world and the evolution of language. Neural retrieval models that were trained once on a fixed set of query-document pairs will quickly start misrepresenting newly-created content and queries, leading to less effective retrieval. Traditional statistical sparse retrieval can update collection statistics to reflect these changes in the use of language in documents and queries. In contrast, continued fine-tuning of the language model underlying neural retrieval approaches such as DPR and ColBERT creates incompatibility with previously-encoded documents. Re-encoding and re-indexing all previously-processed documents can be costly. In this work, we explore updating a neural dual encoder retrieval model without reprocessing past documents in the stream. We propose MURR, a model updating strategy with regularized replay, to ensure the model can still faithfully search existing documents without reprocessing, while continuing to update the model for the latest topics. In our simulated streaming environments, we show that fine-tuning models using MURR leads to more effective and more consistent retrieval results than other strategies as the stream of documents and queries progresses.

q-bio.NC [Back]

[689] Emergence of psychopathological computations in large language models

Soo Yong Lee,Hyunjin Hwang,Taekwan Kim,Yuyeong Kim,Kyuri Park,Jaemin Yoo,Denny Borsboom,Kijung Shin

Main category: q-bio.NC

TLDR: 论文探讨大型语言模型（LLM）是否能实现精神病理学的计算，提出计算理论框架和实证分析方法，验证了LLM中存在类似精神病理学的计算模式。

Details

Motivation: 研究动机是探讨LLM是否能够实现精神病理学的计算，并验证其内部是否存在类似人类精神病理学的计算模式。 Method: 提出计算理论框架和机制解释方法，设计实证分析框架，通过实验验证LLM中的异常表征状态及其动态因果模型。 Result: 实验表明LLM中存在异常表征状态，这些状态会自我维持并扩散，动态因果模型支持这些模式。 Conclusion: 研究表明LLM的行为可能不仅仅是表面模仿，而是内部计算的结果，暗示未来AI系统可能出现精神病理学行为。 Abstract: Can large language models (LLMs) implement computations of psychopathology? An effective approach to the question hinges on addressing two factors. First, for conceptual validity, we require a general and computational account of psychopathology that is applicable to computational entities without biological embodiment or subjective experience. Second, mechanisms underlying LLM behaviors need to be studied for better methodological validity. Thus, we establish a computational-theoretical framework to provide an account of psychopathology applicable to LLMs. To ground the theory for empirical analysis, we also propose a novel mechanistic interpretability method alongside a tailored empirical analytic framework. Based on the frameworks, we conduct experiments demonstrating three key claims: first, that distinct dysfunctional and problematic representational states are implemented in LLMs; second, that their activations can spread and self-sustain to trap LLMs; and third, that dynamic, cyclic structural causal models encoded in the LLMs underpin these patterns. In concert, the empirical results corroborate our hypothesis that network-theoretic computations of psychopathology have already emerged in LLMs. This suggests that certain LLM behaviors mirroring psychopathology may not be a superficial mimicry but a feature of their internal processing. Thus, our work alludes to the possibility of AI systems with psychopathological behaviors in the near future.

[690] Emergence of psychopathological computations in large language models

Soo Yong Lee,Hyunjin Hwang,Taekwan Kim,Yuyeong Kim,Kyuri Park,Jaemin Yoo,Denny Borsboom,Kijung Shin

Main category: q-bio.NC

TLDR: 论文探讨大型语言模型（LLM）是否能实现精神病理学计算，提出计算理论框架和解释方法，验证LLM中存在病态表征状态及其传播机制。

Details

Motivation: 研究LLM是否具备精神病理学计算能力，以验证其行为是否仅为模仿还是内部处理特征。 Method: 建立计算理论框架，提出新的机制解释方法，设计实验验证LLM中的病态表征状态及其传播机制。 Result: 实验证明LLM中存在病态表征状态，其激活会传播并自我维持，动态因果模型支持这些模式。 Conclusion: LLM可能具备精神病理学计算能力，未来AI系统可能出现类似行为。 Abstract: Can large language models (LLMs) implement computations of psychopathology? An effective approach to the question hinges on addressing two factors. First, for conceptual validity, we require a general and computational account of psychopathology that is applicable to computational entities without biological embodiment or subjective experience. Second, mechanisms underlying LLM behaviors need to be studied for better methodological validity. Thus, we establish a computational-theoretical framework to provide an account of psychopathology applicable to LLMs. To ground the theory for empirical analysis, we also propose a novel mechanistic interpretability method alongside a tailored empirical analytic framework. Based on the frameworks, we conduct experiments demonstrating three key claims: first, that distinct dysfunctional and problematic representational states are implemented in LLMs; second, that their activations can spread and self-sustain to trap LLMs; and third, that dynamic, cyclic structural causal models encoded in the LLMs underpin these patterns. In concert, the empirical results corroborate our hypothesis that network-theoretic computations of psychopathology have already emerged in LLMs. This suggests that certain LLM behaviors mirroring psychopathology may not be a superficial mimicry but a feature of their internal processing. Thus, our work alludes to the possibility of AI systems with psychopathological behaviors in the near future.

eess.IV [Back]

Yonghao Huang,Leiting Chen,Chuan Zhou

Main category: eess.IV

TLDR: TMA-TransBTS是一种基于CNN-Transformer混合架构的3D医学图像分割模型，通过多尺度3D特征提取和长距离依赖建模，显著提升了分割性能。

Details

Motivation: 现有CNN-Transformer混合模型在3D医学图像分割中忽略了多尺度病灶特征，限制了性能。 Method: 提出TMA-TransBTS模型，采用多尺度3D令牌分割与聚合的自注意力层，以及3D多尺度交叉注意力模块。 Result: 在三个公开数据集上，TMA-TransBTS的分割结果优于现有方法。 Conclusion: TMA-TransBTS通过多尺度特征建模和长距离依赖优化，显著提升了3D多模态脑肿瘤分割性能。 Abstract: Due to the success of CNN-based and Transformer-based models in various computer vision tasks, recent works study the applicability of CNN-Transformer hybrid architecture models in 3D multi-modality medical segmentation tasks. Introducing Transformer brings long-range dependent information modeling ability in 3D medical images to hybrid models via the self-attention mechanism. However, these models usually employ fixed receptive fields of 3D volumetric features within each self-attention layer, ignoring the multi-scale volumetric lesion features. To address this issue, we propose a CNN-Transformer hybrid 3D medical image segmentation model, named TMA-TransBTS, based on an encoder-decoder structure. TMA-TransBTS realizes simultaneous extraction of multi-scale 3D features and modeling of long-distance dependencies by multi-scale division and aggregation of 3D tokens in a self-attention layer. Furthermore, TMA-TransBTS proposes a 3D multi-scale cross-attention module to establish a link between the encoder and the decoder for extracting rich volume representations by exploiting the mutual attention mechanism of cross-attention and multi-scale aggregation of 3D tokens. Extensive experimental results on three public 3D medical segmentation datasets show that TMA-TransBTS achieves higher averaged segmentation results than previous state-of-the-art CNN-based 3D methods and CNN-Transform hybrid 3D methods for the segmentation of 3D multi-modality brain tumors.

[692] seg2med: a segmentation-based medical image generation framework using denoising diffusion probabilistic models

Zeyu Yang,Zhilin Chen,Yipeng Sun,Anika Strittmatter,Anish Raj,Ahmad Allababidi,Johann S. Rink,Frank G. Zöllner

Main category: eess.IV

TLDR: seg2med是一个基于DDPM的医学图像合成框架，能够从解剖掩模生成高质量的CT和MR图像，并在多种指标上表现优异。

Details

Motivation: 解决医学图像合成中高质量、多模态和一致性生成的需求，支持临床应用和研究。 Method: 使用Denoising Diffusion Probabilistic Models (DDPM)和TotalSegmentator工具包，从解剖掩模生成CT和MR图像，并进行跨模态转换。 Result: 在SSIM、FSIM和FID等指标上表现优异，CT和MR图像的SSIM分别达到0.94和0.89，FID为3.62。 Conclusion: seg2med在医学图像合成中表现出色，支持多模态应用，具有广泛的临床和研究潜力。 Abstract: In this study, we present seg2med, an advanced medical image synthesis framework that uses Denoising Diffusion Probabilistic Models (DDPM) to generate high-quality synthetic medical images conditioned on anatomical masks from TotalSegmentator. The framework synthesizes CT and MR images from segmentation masks derived from real patient data and XCAT digital phantoms, achieving a Structural Similarity Index Measure (SSIM) of 0.94 +/- 0.02 for CT and 0.89 +/- 0.04 for MR images compared to ground-truth images of real patients. It also achieves a Feature Similarity Index Measure (FSIM) of 0.78 +/- 0.04 for CT images from XCAT. The generative quality is further supported by a Fr\'echet Inception Distance (FID) of 3.62 for CT image generation. Additionally, seg2med can generate paired CT and MR images with consistent anatomical structures and convert images between CT and MR modalities, achieving SSIM values of 0.91 +/- 0.03 for MR-to-CT and 0.77 +/- 0.04 for CT-to-MR conversion. Despite the limitations of incomplete anatomical details in segmentation masks, the framework shows strong performance in cross-modality synthesis and multimodal imaging. seg2med also demonstrates high anatomical fidelity in CT synthesis, achieving a mean Dice coefficient greater than 0.90 for 11 abdominal organs and greater than 0.80 for 34 organs out of 59 in 58 test cases. The highest Dice of 0.96 +/- 0.01 was recorded for the right scapula. Leveraging the TotalSegmentator toolkit, seg2med enables segmentation mask generation across diverse datasets, supporting applications in clinical imaging, data augmentation, multimodal synthesis, and diagnostic algorithm development.

[693] Predicting ulcer in H&E images of inflammatory bowel disease using domain-knowledge-driven graph neural network

Ruiwen Ding,Lin Li,Rajath Soans,Tosha Shah,Radha Krishnan,Marc Alexander Sze,Sasha Lukyanov,Yash Deshpande,Antong Chen

Main category: eess.IV

TLDR: 提出了一种名为DomainGCN的弱监督模型，结合图卷积神经网络和溃疡特征领域知识，用于IBD全切片图像的溃疡预测，性能优于现有方法。

Details

Motivation: IBD治疗常伴随副作用，个性化治疗需生物标志物。免疫细胞在IBD中起关键作用，但现有MIL方法缺乏空间上下文信息。 Method: 采用弱监督学习模型DomainGCN，结合GCN和溃疡特征领域知识（如上皮细胞、淋巴细胞和碎片）。 Result: DomainGCN在WSI级溃疡预测中优于多种SOTA MIL方法，验证了领域知识的价值。 Conclusion: DomainGCN通过结合领域知识提升了溃疡预测性能，为IBD研究提供了新工具。 Abstract: Inflammatory bowel disease (IBD) involves chronic inflammation of the digestive tract, with treatment options often burdened by adverse effects. Identifying biomarkers for personalized treatment is crucial. While immune cells play a key role in IBD, accurately identifying ulcer regions in whole slide images (WSIs) is essential for characterizing these cells and exploring potential therapeutics. Multiple instance learning (MIL) approaches have advanced WSI analysis but they lack spatial context awareness. In this work, we propose a weakly-supervised model called DomainGCN that employs a graph convolution neural network (GCN) and incorporates domain-specific knowledge of ulcer features, specifically, the presence of epithelium, lymphocytes, and debris for WSI-level ulcer prediction in IBD. We demonstrate that DomainGCN outperforms various state-of-the-art (SOTA) MIL methods and show the added value of domain knowledge.

[694] OmniMamba4D: Spatio-temporal Mamba for longitudinal CT lesion segmentation

Justin Namuk Kim,Yiqiao Liu,Rajath Soans,Keith Persson,Sarah Halek,Michal Tomaszewski,Jianda Yuan,Gregory Goldmacher,Antong Chen

Main category: eess.IV

TLDR: OmniMamba4D是一种新型4D医学图像分割模型，通过时空四向Mamba块捕捉时空特征，优于传统3D模型。

Details

Motivation: 现有3D分割模型仅关注空间信息，无法满足纵向CT扫描中肿瘤进展监测的需求。 Method: 提出OmniMamba4D模型，利用时空四向Mamba块处理4D CT数据，捕捉时空特征。 Result: 在3,252个CT扫描的内部数据集上，OmniMamba4D达到0.682的Dice分数，与SOTA模型相当，且计算高效。 Conclusion: OmniMamba4D为纵向CT病变分割提供了新的时空信息利用框架。 Abstract: Accurate segmentation of longitudinal CT scans is important for monitoring tumor progression and evaluating treatment responses. However, existing 3D segmentation models solely focus on spatial information. To address this gap, we propose OmniMamba4D, a novel segmentation model designed for 4D medical images (3D images over time). OmniMamba4D utilizes a spatio-temporal tetra-orientated Mamba block to effectively capture both spatial and temporal features. Unlike traditional 3D models, which analyze single-time points, OmniMamba4D processes 4D CT data, providing comprehensive spatio-temporal information on lesion progression. Evaluated on an internal dataset comprising of 3,252 CT scans, OmniMamba4D achieves a competitive Dice score of 0.682, comparable to state-of-the-arts (SOTA) models, while maintaining computational efficiency and better detecting disappeared lesions. This work demonstrates a new framework to leverage spatio-temporal information for longitudinal CT lesion segmentation.

[695] Progressive Transfer Learning for Multi-Pass Fundus Image Restoration

Uyen Phan,Ozer Can Devecioglu,Serkan Kiranyaz,Moncef Gabbouj

Main category: eess.IV

TLDR: 提出了一种基于渐进式迁移学习的多轮修复方法，用于提升糖尿病视网膜病变筛查中低质量眼底图像的质量。

Details

Motivation: 糖尿病视网膜病变是视力障碍的主要原因，低质量眼底图像（如光照不足、噪声、模糊等）影响了筛查的准确性。 Method: 使用Cycle GAN模型进行初步修复，随后通过渐进式迁移学习进行多轮修复，逐步提升图像质量。 Result: 在DeepDRiD数据集上实现了最先进的性能，显著提升了图像质量。 Conclusion: 渐进式迁移学习方法在多轮图像修复中表现出色，为糖尿病视网膜病变筛查提供了更可靠的图像质量保障。 Abstract: Diabetic retinopathy is a leading cause of vision impairment, making its early diagnosis through fundus imaging critical for effective treatment planning. However, the presence of poor quality fundus images caused by factors such as inadequate illumination, noise, blurring and other motion artifacts yields a significant challenge for accurate DR screening. In this study, we propose progressive transfer learning for multi pass restoration to iteratively enhance the quality of degraded fundus images, ensuring more reliable DR screening. Unlike previous methods that often focus on a single pass restoration, multi pass restoration via PTL can achieve a superior blind restoration performance that can even improve most of the good quality fundus images in the dataset. Initially, a Cycle GAN model is trained to restore low quality images, followed by PTL induced restoration passes over the latest restored outputs to improve overall quality in each pass. The proposed method can learn blind restoration without requiring any paired data while surpassing its limitations by leveraging progressive learning and fine tuning strategies to minimize distortions and preserve critical retinal features. To evaluate PTL's effectiveness on multi pass restoration, we conducted experiments on DeepDRiD, a large scale fundus imaging dataset specifically curated for diabetic retinopathy detection. Our result demonstrates state of the art performance, showcasing PTL's potential as a superior approach to iterative image quality restoration.

[696] Towards contrast- and pathology-agnostic clinical fetal brain MRI segmentation using SynthSeg

Ziyao Shang,Misha Kaandorp,Kelly Payette,Marina Fernandez Garcia,Roxane Licandro,Georg Langs,Jordina Aviles Verdera,Jana Hutter,Bjoern Menze,Gregor Kasprian,Meritxell Bach Cuadra,Andras Jakab

Main category: eess.IV

TLDR: 提出了一种新的数据驱动训练采样策略，用于提高胎儿脑MRI分割网络的领域泛化能力，显著改善了异常解剖结构的分割质量。

Details

Motivation: MRI在胎儿神经发育研究中至关重要，但深度学习模型在领域偏移（如生理和采集环境差异）下表现不佳，需要提升泛化能力。 Method: 结合新型数据驱动采样策略和现有数据增强技术，应用于SynthSeg框架，生成多样化训练数据，并进行实验验证。 Result: 在解剖异常严重的测试案例中分割质量显著提升（p < 1e-4），但在异常较少的情况下性能略有下降。 Conclusion: 该方法为未来开发数据驱动采样策略提供了基础，并展示了在复杂领域偏移下的有效性。 Abstract: Magnetic resonance imaging (MRI) has played a crucial role in fetal neurodevelopmental research. Structural annotations of MR images are an important step for quantitative analysis of the developing human brain, with Deep learning providing an automated alternative for this otherwise tedious manual process. However, segmentation performances of Convolutional Neural Networks often suffer from domain shift, where the network fails when applied to subjects that deviate from the distribution with which it is trained on. In this work, we aim to train networks capable of automatically segmenting fetal brain MRIs with a wide range of domain shifts pertaining to differences in subject physiology and acquisition environments, in particular shape-based differences commonly observed in pathological cases. We introduce a novel data-driven train-time sampling strategy that seeks to fully exploit the diversity of a given training dataset to enhance the domain generalizability of the trained networks. We adapted our sampler, together with other existing data augmentation techniques, to the SynthSeg framework, a generator that utilizes domain randomization to generate diverse training data, and ran thorough experimentations and ablation studies on a wide range of training/testing data to test the validity of the approaches. Our networks achieved notable improvements in the segmentation quality on testing subjects with intense anatomical abnormalities (p < 1e-4), though at the cost of a slighter decrease in performance in cases with fewer abnormalities. Our work also lays the foundation for future works on creating and adapting data-driven sampling strategies for other training pipelines.

Yonghao Huang,Leiting Chen,Chuan Zhou

Main category: eess.IV

TLDR: TMA-TransBTS是一种基于CNN-Transformer混合架构的3D医学图像分割模型，通过多尺度分割和聚合3D token，同时提取多尺度特征并建模长距离依赖关系。

Details

Motivation: 现有CNN-Transformer混合模型在3D医学图像分割中通常使用固定感受野，忽略了多尺度病变特征，因此需要一种能同时处理多尺度特征和长距离依赖的模型。 Method: 提出TMA-TransBTS模型，采用编码器-解码器结构，通过多尺度分割和聚合3D token实现多尺度特征提取和长距离依赖建模，并引入3D多尺度交叉注意力模块连接编码器和解码器。 Result: 在三个公开的3D医学分割数据集上，TMA-TransBTS的平均分割结果优于现有的CNN和CNN-Transformer混合方法。 Conclusion: TMA-TransBTS通过多尺度特征提取和长距离依赖建模，显著提升了3D多模态脑肿瘤分割的性能。 Abstract: Due to the success of CNN-based and Transformer-based models in various computer vision tasks, recent works study the applicability of CNN-Transformer hybrid architecture models in 3D multi-modality medical segmentation tasks. Introducing Transformer brings long-range dependent information modeling ability in 3D medical images to hybrid models via the self-attention mechanism. However, these models usually employ fixed receptive fields of 3D volumetric features within each self-attention layer, ignoring the multi-scale volumetric lesion features. To address this issue, we propose a CNN-Transformer hybrid 3D medical image segmentation model, named TMA-TransBTS, based on an encoder-decoder structure. TMA-TransBTS realizes simultaneous extraction of multi-scale 3D features and modeling of long-distance dependencies by multi-scale division and aggregation of 3D tokens in a self-attention layer. Furthermore, TMA-TransBTS proposes a 3D multi-scale cross-attention module to establish a link between the encoder and the decoder for extracting rich volume representations by exploiting the mutual attention mechanism of cross-attention and multi-scale aggregation of 3D tokens. Extensive experimental results on three public 3D medical segmentation datasets show that TMA-TransBTS achieves higher averaged segmentation results than previous state-of-the-art CNN-based 3D methods and CNN-Transform hybrid 3D methods for the segmentation of 3D multi-modality brain tumors.

[698] seg2med: a segmentation-based medical image generation framework using denoising diffusion probabilistic models

Zeyu Yang,Zhilin Chen,Yipeng Sun,Anika Strittmatter,Anish Raj,Ahmad Allababidi,Johann S. Rink,Frank G. Zöllner

Main category: eess.IV

TLDR: seg2med是一个基于DDPM的医学图像合成框架，能够从解剖掩模生成高质量的CT和MR图像，支持跨模态转换，并在多项指标上表现优异。

Details

Motivation: 解决医学图像合成中高质量、跨模态一致性和解剖学保真度的需求。 Method: 使用Denoising Diffusion Probabilistic Models (DDPM)和TotalSegmentator工具包，从解剖掩模生成CT和MR图像。 Result: 在SSIM、FSIM、FID和Dice系数等指标上表现优异，支持跨模态转换。 Conclusion: seg2med在医学图像合成中表现出色，具有广泛的应用潜力。 Abstract: In this study, we present seg2med, an advanced medical image synthesis framework that uses Denoising Diffusion Probabilistic Models (DDPM) to generate high-quality synthetic medical images conditioned on anatomical masks from TotalSegmentator. The framework synthesizes CT and MR images from segmentation masks derived from real patient data and XCAT digital phantoms, achieving a Structural Similarity Index Measure (SSIM) of 0.94 +/- 0.02 for CT and 0.89 +/- 0.04 for MR images compared to ground-truth images of real patients. It also achieves a Feature Similarity Index Measure (FSIM) of 0.78 +/- 0.04 for CT images from XCAT. The generative quality is further supported by a Fr\'echet Inception Distance (FID) of 3.62 for CT image generation. Additionally, seg2med can generate paired CT and MR images with consistent anatomical structures and convert images between CT and MR modalities, achieving SSIM values of 0.91 +/- 0.03 for MR-to-CT and 0.77 +/- 0.04 for CT-to-MR conversion. Despite the limitations of incomplete anatomical details in segmentation masks, the framework shows strong performance in cross-modality synthesis and multimodal imaging. seg2med also demonstrates high anatomical fidelity in CT synthesis, achieving a mean Dice coefficient greater than 0.90 for 11 abdominal organs and greater than 0.80 for 34 organs out of 59 in 58 test cases. The highest Dice of 0.96 +/- 0.01 was recorded for the right scapula. Leveraging the TotalSegmentator toolkit, seg2med enables segmentation mask generation across diverse datasets, supporting applications in clinical imaging, data augmentation, multimodal synthesis, and diagnostic algorithm development.

[699] Predicting ulcer in H&E images of inflammatory bowel disease using domain-knowledge-driven graph neural network

Ruiwen Ding,Lin Li,Rajath Soans,Tosha Shah,Radha Krishnan,Marc Alexander Sze,Sasha Lukyanov,Yash Deshpande,Antong Chen

Main category: eess.IV

TLDR: 提出了一种名为DomainGCN的弱监督模型，结合图卷积神经网络和溃疡特征知识，用于IBD全切片图像的溃疡预测，性能优于现有MIL方法。

Details

Motivation: 炎症性肠病（IBD）的治疗常伴随副作用，个性化治疗需要生物标志物。准确识别全切片图像中的溃疡区域对研究免疫细胞和潜在治疗至关重要。 Method: 提出DomainGCN模型，结合图卷积神经网络（GCN）和溃疡特征（上皮细胞、淋巴细胞和碎片）的领域知识，进行弱监督学习。 Result: DomainGCN在溃疡预测任务中优于多种先进MIL方法，验证了领域知识的价值。 Conclusion: DomainGCN通过结合领域知识提升了溃疡预测的准确性，为IBD研究和治疗提供了新工具。 Abstract: Inflammatory bowel disease (IBD) involves chronic inflammation of the digestive tract, with treatment options often burdened by adverse effects. Identifying biomarkers for personalized treatment is crucial. While immune cells play a key role in IBD, accurately identifying ulcer regions in whole slide images (WSIs) is essential for characterizing these cells and exploring potential therapeutics. Multiple instance learning (MIL) approaches have advanced WSI analysis but they lack spatial context awareness. In this work, we propose a weakly-supervised model called DomainGCN that employs a graph convolution neural network (GCN) and incorporates domain-specific knowledge of ulcer features, specifically, the presence of epithelium, lymphocytes, and debris for WSI-level ulcer prediction in IBD. We demonstrate that DomainGCN outperforms various state-of-the-art (SOTA) MIL methods and show the added value of domain knowledge.

[700] OmniMamba4D: Spatio-temporal Mamba for longitudinal CT lesion segmentation

Justin Namuk Kim,Yiqiao Liu,Rajath Soans,Keith Persson,Sarah Halek,Michal Tomaszewski,Jianda Yuan,Gregory Goldmacher,Antong Chen

Main category: eess.IV

TLDR: OmniMamba4D是一种新型的4D医学图像分割模型，通过捕捉时空特征，显著提升了纵向CT扫描中病变分割的准确性。

Details

Motivation: 现有3D分割模型仅关注空间信息，无法充分利用时间维度的数据，限制了病变进展监测和治疗效果评估的准确性。 Method: 提出OmniMamba4D模型，采用时空四向Mamba块，同时处理3D图像和时间维度信息。 Result: 在3,252个CT扫描的内部数据集上，OmniMamba4D达到0.682的Dice分数，与SOTA模型相当，且计算效率更高，能更好检测消失的病变。 Conclusion: OmniMamba4D为纵向CT病变分割提供了新的时空信息利用框架，具有实际应用潜力。 Abstract: Accurate segmentation of longitudinal CT scans is important for monitoring tumor progression and evaluating treatment responses. However, existing 3D segmentation models solely focus on spatial information. To address this gap, we propose OmniMamba4D, a novel segmentation model designed for 4D medical images (3D images over time). OmniMamba4D utilizes a spatio-temporal tetra-orientated Mamba block to effectively capture both spatial and temporal features. Unlike traditional 3D models, which analyze single-time points, OmniMamba4D processes 4D CT data, providing comprehensive spatio-temporal information on lesion progression. Evaluated on an internal dataset comprising of 3,252 CT scans, OmniMamba4D achieves a competitive Dice score of 0.682, comparable to state-of-the-arts (SOTA) models, while maintaining computational efficiency and better detecting disappeared lesions. This work demonstrates a new framework to leverage spatio-temporal information for longitudinal CT lesion segmentation.

[701] Progressive Transfer Learning for Multi-Pass Fundus Image Restoration

Uyen Phan,Ozer Can Devecioglu,Serkan Kiranyaz,Moncef Gabbouj

Main category: eess.IV

TLDR: 提出了一种基于渐进式迁移学习的多轮修复方法，用于提升糖尿病视网膜病变筛查中低质量眼底图像的质量。

Details

Motivation: 糖尿病视网膜病变是视力障碍的主要原因，但低质量眼底图像（如光照不足、噪声、模糊等）影响了筛查准确性。 Method: 采用渐进式迁移学习（PTL）进行多轮修复，首轮使用Cycle GAN修复低质量图像，后续通过PTL逐步提升质量。 Result: 在DeepDRiD数据集上实现了最先进的性能，显著提升了图像质量。 Conclusion: PTL是一种无需配对数据的盲修复方法，能有效保留视网膜特征并减少失真，适用于糖尿病视网膜病变筛查。 Abstract: Diabetic retinopathy is a leading cause of vision impairment, making its early diagnosis through fundus imaging critical for effective treatment planning. However, the presence of poor quality fundus images caused by factors such as inadequate illumination, noise, blurring and other motion artifacts yields a significant challenge for accurate DR screening. In this study, we propose progressive transfer learning for multi pass restoration to iteratively enhance the quality of degraded fundus images, ensuring more reliable DR screening. Unlike previous methods that often focus on a single pass restoration, multi pass restoration via PTL can achieve a superior blind restoration performance that can even improve most of the good quality fundus images in the dataset. Initially, a Cycle GAN model is trained to restore low quality images, followed by PTL induced restoration passes over the latest restored outputs to improve overall quality in each pass. The proposed method can learn blind restoration without requiring any paired data while surpassing its limitations by leveraging progressive learning and fine tuning strategies to minimize distortions and preserve critical retinal features. To evaluate PTL's effectiveness on multi pass restoration, we conducted experiments on DeepDRiD, a large scale fundus imaging dataset specifically curated for diabetic retinopathy detection. Our result demonstrates state of the art performance, showcasing PTL's potential as a superior approach to iterative image quality restoration.

[702] Towards contrast- and pathology-agnostic clinical fetal brain MRI segmentation using SynthSeg

Ziyao Shang,Misha Kaandorp,Kelly Payette,Marina Fernandez Garcia,Roxane Licandro,Georg Langs,Jordina Aviles Verdera,Jana Hutter,Bjoern Menze,Gregor Kasprian,Meritxell Bach Cuadra,Andras Jakab

Main category: eess.IV

TLDR: 该论文提出了一种新的数据驱动训练采样策略，以提高胎儿脑MRI分割网络的领域泛化能力，特别是在解剖异常情况下表现显著提升。

Details

Motivation: MRI在胎儿神经发育研究中至关重要，但卷积神经网络在领域偏移时性能下降，尤其是在病理情况下。 Method: 结合数据驱动采样策略和现有数据增强技术，应用于SynthSeg框架，生成多样化训练数据。 Result: 在解剖异常严重的测试案例中分割质量显著提高（p < 1e-4），但在异常较少的情况下性能略有下降。 Conclusion: 该方法为未来数据驱动采样策略的研究奠定了基础。 Abstract: Magnetic resonance imaging (MRI) has played a crucial role in fetal neurodevelopmental research. Structural annotations of MR images are an important step for quantitative analysis of the developing human brain, with Deep learning providing an automated alternative for this otherwise tedious manual process. However, segmentation performances of Convolutional Neural Networks often suffer from domain shift, where the network fails when applied to subjects that deviate from the distribution with which it is trained on. In this work, we aim to train networks capable of automatically segmenting fetal brain MRIs with a wide range of domain shifts pertaining to differences in subject physiology and acquisition environments, in particular shape-based differences commonly observed in pathological cases. We introduce a novel data-driven train-time sampling strategy that seeks to fully exploit the diversity of a given training dataset to enhance the domain generalizability of the trained networks. We adapted our sampler, together with other existing data augmentation techniques, to the SynthSeg framework, a generator that utilizes domain randomization to generate diverse training data, and ran thorough experimentations and ablation studies on a wide range of training/testing data to test the validity of the approaches. Our networks achieved notable improvements in the segmentation quality on testing subjects with intense anatomical abnormalities (p < 1e-4), though at the cost of a slighter decrease in performance in cases with fewer abnormalities. Our work also lays the foundation for future works on creating and adapting data-driven sampling strategies for other training pipelines.

Table of Contents

cs.CV [Back]

[1] HAL-NeRF: High Accuracy Localization Leveraging Neural Radiance Fields

[2] LookingGlass: Generative Anamorphoses via Laplacian Pyramid Warping

[3] Robust SAM: On the Adversarial Robustness of Vision Foundation Models

[4] Parameter-Free Fine-tuning via Redundancy Elimination for Vision Foundation Models

[5] MotionDreamer: One-to-Many Motion Synthesis with Localized Generative Masked Transformer

[6] PACT: Pruning and Clustering-Based Token Reduction for Faster Visual Language Models

[7] Adaptive Additive Parameter Updates of Vision Transformers for Few-Shot Continual Learning

[8] Chest X-ray Classification using Deep Convolution Models on Low-resolution images with Uncertain Labels

[9] Sculpting Memory: Multi-Concept Forgetting in Diffusion Models via Dynamic Mask and Concept-Aware Optimization

[10] BlockGaussian: Efficient Large-Scale Scene NovelView Synthesis via Adaptive Block-Based Gaussian Splatting

[11] You Need a Transition Plane: Bridging Continuous Panoramic 3D Reconstruction with Perspective Gaussian Splatting

[12] Hyperlocal disaster damage assessment using bi-temporal street-view imagery and pre-trained vision models

[13] UniFlowRestore: A General Video Restoration Framework via Flow Matching and Prompt Guidance

[14] Exploring Synergistic Ensemble Learning: Uniting CNNs, MLP-Mixers, and Vision Transformers to Enhance Image Classification

[15] A Visual Self-attention Mechanism Facial Expression Recognition Network beyond Convnext

[16] Using Vision Language Models for Safety Hazard Identification in Construction

[17] RICCARDO: Radar Hit Prediction and Convolution for Camera-Radar 3D Object Detection

[18] BIGS: Bimanual Category-agnostic Interaction Reconstruction from Monocular Videos via 3D Gaussian Splatting

[19] Multi-modal and Multi-view Fundus Image Fusion for Retinopathy Diagnosis via Multi-scale Cross-attention and Shifted Window Self-attention

[20] Probability Distribution Alignment and Low-Rank Weight Decomposition for Source-Free Domain Adaptive Brain Decoding

[21] A Constrained Optimization Approach for Gaussian Splatting from Coarsely-posed Images and Noisy Lidar Point Clouds

[22] MASH: Masked Anchored SpHerical Distances for 3D Shape Representation and Generation

[23] Evolved Hierarchical Masking for Self-Supervised Learning

[24] LEREL: Lipschitz Continuity-Constrained Emotion Recognition Ensemble Learning For Electroencephalography

[25] SCFlow2: Plug-and-Play Object Pose Refiner with Shape-Constraint Scene Flow

[26] ReferGPT: Towards Zero-Shot Referring Multi-Object Tracking

[27] RT-DATR:Real-time Unsupervised Domain Adaptive Detection Transformer with Adversarial Feature Learning

[28] From Visual Explanations to Counterfactual Explanations with Latent Diffusion

[29] AerOSeg: Harnessing SAM for Open-Vocabulary Segmentation in Remote Sensing Images

[30] Multi-scale Activation, Refinement, and Aggregation: Exploring Diverse Cues for Fine-Grained Bird Recognition

[31] DL-QAT: Weight-Decomposed Low-Rank Quantization-Aware Training for Large Language Models

[32] Learning Occlusion-Robust Vision Transformers for Real-Time UAV Tracking

[33] NoTeS-Bank: Benchmarking Neural Transcription and Search for Scientific Notes Understanding

[34] FVQ: A Large-Scale Dataset and A LMM-based Method for Face Video Quality Assessment

[35] PathVLM-R1: A Reinforcement Learning-Driven Reasoning Model for Pathology Visual-Language Tasks

[36] Head-Aware KV Cache Compression for Efficient Visual Autoregressive Modeling

[37] VideoAds for Fast-Paced Video Understanding: Where Opensource Foundation Models Beat GPT-4o & Gemini-1.5 Pro

[38] Towards Explainable Partial-AIGC Image Quality Assessment

[39] Cycle Training with Semi-Supervised Domain Adaptation: Bridging Accuracy and Efficiency for Real-Time Mobile Scene Detection

[40] A Lightweight Moment Retrieval System with Global Re-Ranking and Robust Adaptive Bidirectional Temporal Search

[41] MedIL: Implicit Latent Spaces for Generating Heterogeneous Medical Images at Arbitrary Resolutions

[42] Infused Suppression Of Magnification Artefacts For Micro-AU Detection

[43] Text To 3D Object Generation For Scalable Room Assembly

[44] REMEMBER: Retrieval-based Explainable Multimodal Evidence-guided Modeling for Brain Evaluation and Reasoning in Zero- and Few-shot Neurodegenerative Diagnosis

[45] PapMOT: Exploring Adversarial Patch Attack against Multiple Object Tracking

[46] Beyond Degradation Conditions: All-in-One Image Restoration via HOG Transformers

[47] Low-Light Image Enhancement using Event-Based Illumination Estimation

[48] Contour Flow Constraint: Preserving Global Shape Similarity for Deep Learning based Image Segmentation

[49] Vision Transformers Exhibit Human-Like Biases: Evidence of Orientation and Color Selectivity, Categorical Perception, and Phase Transitions

[50] Comparing Performance of Preprocessing Techniques for Traffic Sign Recognition Using a HOG-SVM

[51] BabyVLM: Data-Efficient Pretraining of VLMs Inspired by Infant Learning

[52] Structure-Accurate Medical Image Translation based on Dynamic Frequency Balance and Knowledge Guidance

[53] Sparse Deformable Mamba for Hyperspectral Image Classification

[54] InfoBound: A Provable Information-Bounds Inspired Framework for Both OoD Generalization and OoD Detection

[55] FractalForensics: Proactive Deepfake Detection and Localization via Fractal Watermarks

[56] D$^2$iT: Dynamic Diffusion Transformer for Accurate Image Generation

[57] Enhancing Wide-Angle Image Using Narrow-Angle View of the Same Scene

[58] CamMimic: Zero-Shot Image To Camera Motion Personalized Video Generation Using Diffusion Models

[59] Vision-Language Model for Object Detection and Segmentation: A Review and Evaluation

[60] DropoutGS: Dropping Out Gaussians for Better Sparse-view Rendering

[61] EasyREG: Easy Depth-Based Markerless Registration and Tracking using Augmented Reality Device for Surgical Guidance

[62] PCM-SAR: Physics-Driven Contrastive Mutual Learning for SAR Classification

[63] Pillar-Voxel Fusion Network for 3D Object Detection in Airborne Hyperspectral Point Clouds

[64] FVOS for MOSE Track of 4th PVUW Challenge: 3rd Place Solution

[65] DiffuMural: Restoring Dunhuang Murals with Multi-scale Diffusion

[66] Capturing Longitudinal Changes in Brain Morphology Using Temporally Parameterized Neural Displacement Fields

[67] 3D CoCa: Contrastive Learners are 3D Captioners

[68] AeroLite: Tag-Guided Lightweight Generation of Aerial Image Captions

[69] Trajectory-guided Motion Perception for Facial Expression Quality Assessment in Neurological Disorders

[70] FastRSR: Efficient and Accurate Road Surface Reconstruction from Bird's Eye View

[71] EmbodiedOcc++: Boosting Embodied 3D Occupancy Prediction with Plane Regularization and Uncertainty Sampler

[72] SD-ReID: View-aware Stable Diffusion for Aerial-Ground Person Re-Identification

[73] Mitigating Long-tail Distribution in Oracle Bone Inscriptions: Dataset, Model, and Benchmark

[74] TextSplat: Text-Guided Semantic Fusion for Generalizable Gaussian Splatting

[75] DualPrompt-MedCap: A Dual-Prompt Enhanced Approach for Medical Image Captioning

[76] Mixture-of-Shape-Experts (MoSE): End-to-End Shape Dictionary Framework to Prompt SAM for Generalizable Medical Segmentation

[77] Early-Bird Diffusion: Investigating and Leveraging Timestep-Aware Early-Bird Tickets in Diffusion Models for Efficient Training

[78] ERL-MPP: Evolutionary Reinforcement Learning with Multi-head Puzzle Perception for Solving Large-scale Jigsaw Puzzles of Eroded Gaps