cs.CV [Total: 128]
cs.GR [Total: 3]
cs.CL [Total: 64]
cs.LG [Total: 14]
cs.IR [Total: 4]
q-bio.NC [Total: 1]
eess.SP [Total: 3]
cs.RO [Total: 12]
cs.MM [Total: 1]
cs.HC [Total: 2]
cs.CR [Total: 2]
stat.ML [Total: 1]
cs.SE [Total: 1]
cs.DL [Total: 1]
cs.AI [Total: 4]
eess.IV [Total: 14]
cs.CY [Total: 1]
cs.SD [Total: 2]

cs.CV [Back]

[1] Understanding and Mitigating Toxicity in Image-Text Pretraining Datasets: A Case Study on LLaVA

Karthik Reddy Kanjula,Surya Guthikonda,Nahid Alam,Shayekh Bin Islam

Main category: cs.CV

TL;DR: 论文研究了LLaVA图像-文本预训练数据集中的毒性内容，分析了其表现形式，并提出针对性缓解策略，最终创建了一个去除了7,531对有毒图像-文本对的净化数据集。

Details

Motivation: 预训练数据集对多模态模型至关重要，但常包含来自网络的有害内容。研究旨在识别和减少这些毒性内容，以构建更负责任和公平的系统。 Method: 分析了LLaVA数据集中毒性内容的常见类别，提出了针对性的缓解策略，并实施了毒性检测流程。 Result: 创建了一个去除了7,531对有毒图像-文本对的净化数据集，并提供了毒性检测指南。 Conclusion: 研究强调了主动识别和过滤毒性内容的必要性，以推动更负责任的多模态系统发展，净化数据集已开源供进一步研究。 Abstract: Pretraining datasets are foundational to the development of multimodal models, yet they often have inherent biases and toxic content from the web-scale corpora they are sourced from. In this paper, we investigate the prevalence of toxicity in LLaVA image-text pretraining dataset, examining how harmful content manifests in different modalities. We present a comprehensive analysis of common toxicity categories and propose targeted mitigation strategies, resulting in the creation of a refined toxicity-mitigated dataset. This dataset removes 7,531 of toxic image-text pairs in the LLaVA pre-training dataset. We offer guidelines for implementing robust toxicity detection pipelines. Our findings underscore the need to actively identify and filter toxic content - such as hate speech, explicit imagery, and targeted harassment - to build more responsible and equitable multimodal systems. The toxicity-mitigated dataset is open source and is available for further research.

[2] Robust & Precise Knowledge Distillation-based Novel Context-Aware Predictor for Disease Detection in Brain and Gastrointestinal

Saif Ur Rehman Khan,Muhammad Nabeel Asim,Sebastian Vollmer,Andreas Dengel

Main category: cs.CV

TL;DR: 提出了一种结合蚁群优化和上下文感知预测器的知识蒸馏框架，显著提升了医学图像疾病预测的准确率。

Details

Motivation: 医学图像数据复杂多变，传统知识蒸馏方法在处理不确定性和泛化能力上存在不足。 Method: 集成蚁群优化选择最优师生模型对，并采用上下文感知的温度缩放方法。 Result: 在三个公开数据集上取得最高准确率：Kaggle 98.01%、Figshare 92.81%、GastroNet 96.20%。 Conclusion: 该框架显著优于现有方法，为医学图像分析提供了更鲁棒的解决方案。 Abstract: Medical disease prediction, particularly through imaging, remains a challenging task due to the complexity and variability of medical data, including noise, ambiguity, and differing image quality. Recent deep learning models, including Knowledge Distillation (KD) methods, have shown promising results in brain tumor image identification but still face limitations in handling uncertainty and generalizing across diverse medical conditions. Traditional KD methods often rely on a context-unaware temperature parameter to soften teacher model predictions, which does not adapt effectively to varying uncertainty levels present in medical images. To address this issue, we propose a novel framework that integrates Ant Colony Optimization (ACO) for optimal teacher-student model selection and a novel context-aware predictor approach for temperature scaling. The proposed context-aware framework adjusts the temperature based on factors such as image quality, disease complexity, and teacher model confidence, allowing for more robust knowledge transfer. Additionally, ACO efficiently selects the most appropriate teacher-student model pair from a set of pre-trained models, outperforming current optimization methods by exploring a broader solution space and better handling complex, non-linear relationships within the data. The proposed framework is evaluated using three publicly available benchmark datasets, each corresponding to a distinct medical imaging task. The results demonstrate that the proposed framework significantly outperforms current state-of-the-art methods, achieving top accuracy rates: 98.01% on the MRI brain tumor (Kaggle) dataset, 92.81% on the Figshare MRI dataset, and 96.20% on the GastroNet dataset. This enhanced performance is further evidenced by the improved results, surpassing existing benchmarks of 97.24% (Kaggle), 91.43% (Figshare), and 95.00% (GastroNet).

[3] Deep Learning-Based Robust Optical Guidance for Hypersonic Platforms

Adrien Chan-Hon-Tong,Aurélien Plyer,Baptiste Cadalen,Laurent Serre

Main category: cs.CV

TL;DR: 本文提出了一种基于深度网络的场景图像编码方法，以绕过传统参考图像框架的结构限制，适用于双模态场景（如雪景与非雪景）。

Details

Motivation: 传统参考图像框架在长距离平台传感器引导中存在结构限制，无法适应双模态场景的需求。 Method: 通过将场景图像堆栈编码到深度网络中，利用堆栈的多样性解决双模态问题。 Result: 实验表明，基于图像堆栈的方法在双模态场景中表现良好。 Conclusion: 深度网络编码图像堆栈的方法有效解决了传统框架的限制，适用于复杂场景。 Abstract: Sensor-based guidance is required for long-range platforms. To bypass the structural limitation of classical registration on reference image framework, we offer in this paper to encode a stack of images of the scene into a deep network. Relying on a stack is showed to be relevant on bimodal scene (e.g. when the scene can or can not be snowy).

[4] Toward Advancing License Plate Super-Resolution in Real-World Scenarios: A Dataset and Benchmark

Valfride Nascimento,Gabriel E. Lima,Rafael O. Ribeiro,William Robson Schwartz,Rayson Laroca,David Menotti

Main category: cs.CV

TL;DR: 论文提出了一种新的数据集UFPR-SR-Plates，用于解决车牌识别（LPR）中的低分辨率图像问题，并通过融合策略和超分辨率技术显著提升了识别性能。

Details

Motivation: 现有研究依赖私有数据集和简化的退化模型，无法真实反映实际场景中的挑战。 Method: 引入包含10,000对低分辨率和高分辨率车牌图像的数据集，并评估了两种超分辨率模型和三种融合策略。 Result: 超分辨率技术显著提升LPR性能，结合多数投票融合策略后，识别率从1.7%提升至44.7%。 Conclusion: 超分辨率和时间信息在恶劣条件下对提升LPR准确性至关重要，数据集已公开以支持进一步研究。 Abstract: Recent advancements in super-resolution for License Plate Recognition (LPR) have sought to address challenges posed by low-resolution (LR) and degraded images in surveillance, traffic monitoring, and forensic applications. However, existing studies have relied on private datasets and simplistic degradation models. To address this gap, we introduce UFPR-SR-Plates, a novel dataset containing 10,000 tracks with 100,000 paired low and high-resolution license plate images captured under real-world conditions. We establish a benchmark using multiple sequential LR and high-resolution (HR) images per vehicle -- five of each -- and two state-of-the-art models for super-resolution of license plates. We also investigate three fusion strategies to evaluate how combining predictions from a leading Optical Character Recognition (OCR) model for multiple super-resolved license plates enhances overall performance. Our findings demonstrate that super-resolution significantly boosts LPR performance, with further improvements observed when applying majority vote-based fusion techniques. Specifically, the Layout-Aware and Character-Driven Network (LCDNet) model combined with the Majority Vote by Character Position (MVCP) strategy led to the highest recognition rates, increasing from 1.7% with low-resolution images to 31.1% with super-resolution, and up to 44.7% when combining OCR outputs from five super-resolved images. These findings underscore the critical role of super-resolution and temporal information in enhancing LPR accuracy under real-world, adverse conditions. The proposed dataset is publicly available to support further research and can be accessed at: https://valfride.github.io/nascimento2024toward/

[5] MAGE:A Multi-stage Avatar Generator with Sparse Observations

Fangyu Du,Yang Yang,Xuehao Gao,Hongye Hou

Main category: cs.CV

TL;DR: 论文提出了一种名为MAGE的多阶段虚拟人生成器，通过渐进式预测策略从头部和手腕的3关节观测推断全身姿态，显著提升了预测的准确性和连续性。

Details

Motivation: 现有方法直接从3关节观测学习全身姿态映射，导致推断空间过大，预测结果不理想，尤其是下半身姿态和时间一致性较差。 Method: MAGE采用多阶段渐进预测策略，从6部分身体表示逐步细化到22个关节，逐步引入更多运动上下文先验。 Result: 在大规模数据集上的实验表明，MAGE在准确性和连续性上显著优于现有方法。 Conclusion: MAGE通过多阶段渐进预测策略有效解决了全身姿态推断的挑战，为AR/VR应用提供了更真实和连贯的运动序列。 Abstract: Inferring full-body poses from Head Mounted Devices, which capture only 3-joint observations from the head and wrists, is a challenging task with wide AR/VR applications. Previous attempts focus on learning one-stage motion mapping and thus suffer from an over-large inference space for unobserved body joint motions. This often leads to unsatisfactory lower-body predictions and poor temporal consistency, resulting in unrealistic or incoherent motion sequences. To address this, we propose a powerful Multi-stage Avatar GEnerator named MAGE that factorizes this one-stage direct motion mapping learning with a progressive prediction strategy. Specifically, given initial 3-joint motions, MAGE gradually inferring multi-scale body part poses at different abstract granularity levels, starting from a 6-part body representation and gradually refining to 22 joints. With decreasing abstract levels step by step, MAGE introduces more motion context priors from former prediction stages and thus improves realistic motion completion with richer constraint conditions and less ambiguity. Extensive experiments on large-scale datasets verify that MAGE significantly outperforms state-of-the-art methods with better accuracy and continuity.

[6] Natural Reflection Backdoor Attack on Vision Language Model for Autonomous Driving

Ming Liu,Siyuan Liang,Koushik Howlader,Liwen Wang,Dacheng Tao,Wensheng Zhang

Main category: cs.CV

TL;DR: 论文提出了一种基于自然反射的后门攻击方法，针对自动驾驶中的视觉语言模型（VLM），通过嵌入微弱的反射模式和无关文本前缀，诱导模型生成异常长响应，导致决策延迟。

Details

Motivation: 研究自动驾驶中VLM系统对后门攻击的鲁棒性不足问题，揭示其潜在安全隐患。 Method: 在DriveLM数据集中嵌入自然反射模式，并添加无关文本前缀，通过参数高效方法微调Qwen2-VL和LLaMA-Adapter模型。 Result: 模型在干净输入下表现正常，但在触发时推理延迟显著增加，可能引发自动驾驶决策危险延迟。 Conclusion: 该攻击方法利用自动驾驶的实时性要求，对VLM增强系统的安全性和可靠性提出严峻挑战。 Abstract: Vision-Language Models (VLMs) have been integrated into autonomous driving systems to enhance reasoning capabilities through tasks such as Visual Question Answering (VQA). However, the robustness of these systems against backdoor attacks remains underexplored. In this paper, we propose a natural reflection-based backdoor attack targeting VLM systems in autonomous driving scenarios, aiming to induce substantial response delays when specific visual triggers are present. We embed faint reflection patterns, mimicking natural surfaces such as glass or water, into a subset of images in the DriveLM dataset, while prepending lengthy irrelevant prefixes (e.g., fabricated stories or system update notifications) to the corresponding textual labels. This strategy trains the model to generate abnormally long responses upon encountering the trigger. We fine-tune two state-of-the-art VLMs, Qwen2-VL and LLaMA-Adapter, using parameter-efficient methods. Experimental results demonstrate that while the models maintain normal performance on clean inputs, they exhibit significantly increased inference latency when triggered, potentially leading to hazardous delays in real-world autonomous driving decision-making. Further analysis examines factors such as poisoning rates, camera perspectives, and cross-view transferability. Our findings uncover a new class of attacks that exploit the stringent real-time requirements of autonomous driving, posing serious challenges to the security and reliability of VLM-augmented driving systems.

[7] My Emotion on your face: The use of Facial Keypoint Detection to preserve Emotions in Latent Space Editing

Jingrui He,Andrew Stephen McGough

Main category: cs.CV

TL;DR: 论文提出了一种通过在损失函数中添加人脸关键点检测损失（HFLD）的方法，以解决StyleGAN/2生成图像时特征纠缠的问题，从而在保持面部表情不变的情况下生成多样化图像。

Details

Motivation: StyleGAN/2在生成逼真人脸图像时存在特征纠缠问题，即修改某一特征（如性别或年龄）会影响其他特征（如表情）。这限制了其在面部表情研究中的数据增强应用。 Method: 在现有模型的基础上，将预训练的人脸关键点检测模型提供的HFLD损失添加到原始损失函数中，以减少表情变化。 Result: 实验表明，该方法能将情感变化减少高达49%，并有效保持面部表情不变。 Conclusion: 通过改进特征纠缠问题，该方法为面部表情研究提供了一种可靠的数据增强手段。 Abstract: Generative Adversarial Network approaches such as StyleGAN/2 provide two key benefits: the ability to generate photo-realistic face images and possessing a semantically structured latent space from which these images are created. Many approaches have emerged for editing images derived from vectors in the latent space of a pre-trained StyleGAN/2 models by identifying semantically meaningful directions (e.g., gender or age) in the latent space. By moving the vector in a specific direction, the ideal result would only change the target feature while preserving all the other features. Providing an ideal data augmentation approach for gesture research as it could be used to generate numerous image variations whilst keeping the facial expressions intact. However, entanglement issues, where changing one feature inevitably affects other features, impacts the ability to preserve facial expressions. To address this, we propose the use of an addition to the loss function of a Facial Keypoint Detection model to restrict changes to the facial expressions. Building on top of an existing model, adding the proposed Human Face Landmark Detection (HFLD) loss, provided by a pre-trained Facial Keypoint Detection model, to the original loss function. We quantitatively and qualitatively evaluate the existing and our extended model, showing the effectiveness of our approach in addressing the entanglement issue and maintaining the facial expression. Our approach achieves up to 49% reduction in the change of emotion in our experiments. Moreover, we show the benefit of our approach by comparing with state-of-the-art models. By increasing the ability to preserve the facial gesture and expression during facial transformation, we present a way to create human face images with fixed expression but different appearances, making it a reliable data augmentation approach for Facial Gesture and Expression research.

[8] PromptIQ: Who Cares About Prompts? Let System Handle It -- A Component-Aware Framework for T2I Generation

Nisan Chhetri,Arpan Sainju

Main category: cs.CV

TL;DR: PromptIQ是一个自动化框架，通过改进提示词和评估图像质量，解决了文本到图像（T2I）模型中提示词结构不佳导致的问题。

Details

Motivation: 当前T2I模型对结构不佳的提示词容易产生误解，导致图像失真和对齐问题，而现有评估方法（如CLIP）无法捕捉这些结构性问题。 Method: 提出PromptIQ框架，使用新颖的Component-Aware Similarity（CAS）指标检测和惩罚结构错误，并通过迭代生成和评估图像优化结果。 Result: PromptIQ显著提高了生成质量和评估准确性，使T2I模型对非专业用户更易用。 Conclusion: PromptIQ通过自动化提示词优化和结构错误检测，提升了T2I模型的可用性和生成质量。 Abstract: Generating high-quality images without prompt engineering expertise remains a challenge for text-to-image (T2I) models, which often misinterpret poorly structured prompts, leading to distortions and misalignments. While humans easily recognize these flaws, metrics like CLIP fail to capture structural inconsistencies, exposing a key limitation in current evaluation methods. To address this, we introduce PromptIQ, an automated framework that refines prompts and assesses image quality using our novel Component-Aware Similarity (CAS) metric, which detects and penalizes structural errors. Unlike conventional methods, PromptIQ iteratively generates and evaluates images until the user is satisfied, eliminating trial-and-error prompt tuning. Our results show that PromptIQ significantly improves generation quality and evaluation accuracy, making T2I models more accessible for users with little to no prompt engineering expertise.

[9] HCMA: Hierarchical Cross-model Alignment for Grounded Text-to-Image Generation

Hang Wang,Zhi-Qi Cheng,Chenhao Lin,Chao Shen,Lei Zhang

Main category: cs.CV

TL;DR: HCMA框架通过全局和局部对齐模块，实现了文本到图像生成中的语义保真和空间控制，显著提升了生成质量。

Details

Motivation: 现有方法在复杂场景中难以同时满足高级语义保真和明确的空间控制需求。 Method: HCMA框架在扩散采样步骤中集成了全局对齐模块（确保场景一致性）和局部对齐模块（通过边界框实现细粒度空间控制）。 Result: 在MS-COCO 2014验证集上，HCMA在FID和CLIP Score上分别提升了0.69和0.0295。 Conclusion: HCMA有效解决了语义保真与空间控制的平衡问题，为文本到图像生成提供了稳健方案。 Abstract: Text-to-image synthesis has progressed to the point where models can generate visually compelling images from natural language prompts. Yet, existing methods often fail to reconcile high-level semantic fidelity with explicit spatial control, particularly in scenes involving multiple objects, nuanced relations, or complex layouts. To bridge this gap, we propose a Hierarchical Cross-Modal Alignment (HCMA) framework for grounded text-to-image generation. HCMA integrates two alignment modules into each diffusion sampling step: a global module that continuously aligns latent representations with textual descriptions to ensure scene-level coherence, and a local module that employs bounding-box layouts to anchor objects at specified locations, enabling fine-grained spatial control. Extensive experiments on the MS-COCO 2014 validation set show that HCMA surpasses state-of-the-art baselines, achieving a 0.69 improvement in Frechet Inception Distance (FID) and a 0.0295 gain in CLIP Score. These results demonstrate HCMA's effectiveness in faithfully capturing intricate textual semantics while adhering to user-defined spatial constraints, offering a robust solution for semantically grounded image generation.Our code is available at https://github.com/hwang-cs-ime/HCMA

[10] RESAR-BEV: An Explainable Progressive Residual Autoregressive Approach for Camera-Radar Fusion in BEV Segmentation

Zhiwen Zeng,Yunfei Yin,Zheng Yuan,Argho Dey,Xianjian Bao

Main category: cs.CV

TL;DR: RESAR-BEV是一种渐进式细化框架，通过残差自回归学习和双路径体素特征编码，提升了BEV语义分割的性能，并在nuScenes数据集上实现了54.0%的mIoU和14.6 FPS的实时性能。

Details

Motivation: BEV语义分割在自动驾驶中提供全面环境感知，但存在多模态不对齐和传感器噪声问题。 Method: 采用渐进式细化框架，结合残差自回归学习、双路径体素特征编码和解耦监督，优化BEV分割性能。 Result: 在nuScenes数据集上达到54.0% mIoU，实时性能为14.6 FPS，且在长距离感知和恶劣天气下表现鲁棒。 Conclusion: RESAR-BEV通过渐进式细化实现了高性能和鲁棒性，适用于自动驾驶场景。 Abstract: Bird's-Eye-View (BEV) semantic segmentation provides comprehensive environmental perception for autonomous driving but suffers multi-modal misalignment and sensor noise. We propose RESAR-BEV, a progressive refinement framework that advances beyond single-step end-to-end approaches: (1) progressive refinement through residual autoregressive learning that decomposes BEV segmentation into interpretable coarse-to-fine stages via our Drive-Transformer and Modifier-Transformer residual prediction cascaded architecture, (2) robust BEV representation combining ground-proximity voxels with adaptive height offsets and dual-path voxel feature encoding (max+attention pooling) for efficient feature extraction, and (3) decoupled supervision with offline Ground Truth decomposition and online joint optimization to prevent overfitting while ensuring structural coherence. Experiments on nuScenes demonstrate RESAR-BEV achieves state-of-the-art performance with 54.0% mIoU across 7 essential driving-scene categories while maintaining real-time capability at 14.6 FPS. The framework exhibits robustness in challenging scenarios of long-range perception and adverse weather conditions.

[11] Quantum Conflict Measurement in Decision Making for Out-of-Distribution Detection

Yilin Dong,Tianyun Zhu,Xinde Li,Jean Dezert,Rigui Zhou,Changming Zhu,Lei Cao,Shuzhi Sam Ge

Main category: cs.CV

TL;DR: 论文提出了一种量子冲突指示器（QCI）用于测量量子Dempster-Shafer理论（QDST）中多个量子质量函数（QMF）之间的冲突，并验证其优越性。进一步应用于冲突融合方法和OOD检测任务，性能优于现有基线方法。

Details

Motivation: QDST中多个QMF之间的冲突管理是一个挑战性问题，需要一种有效的冲突测量方法。 Method: 提出QCI作为冲突测量工具，研究其性质，并应用于冲突融合方法和OOD检测任务。 Result: QCI符合理想的冲突测量性质，基于QCI的融合方法在性能上优于常用方法，且在OOD检测任务中表现更优。 Conclusion: QCI为QDST中的冲突管理提供了有效解决方案，并在实际应用中展现出显著优势。 Abstract: Quantum Dempster-Shafer Theory (QDST) uses quantum interference effects to derive a quantum mass function (QMF) as a fuzzy metric type from information obtained from various data sources. In addition, QDST uses quantum parallel computing to speed up computation. Nevertheless, the effective management of conflicts between multiple QMFs in QDST is a challenging question. This work aims to address this problem by proposing a Quantum Conflict Indicator (QCI) that measures the conflict between two QMFs in decision-making. Then, the properties of the QCI are carefully investigated. The obtained results validate its compliance with desirable conflict measurement properties such as non-negativity, symmetry, boundedness, extreme consistency and insensitivity to refinement. We then apply the proposed QCI in conflict fusion methods and compare its performance with several commonly used fusion approaches. This comparison demonstrates the superiority of the QCI-based conflict fusion method. Moreover, the Class Description Domain Space (C-DDS) and its optimized version, C-DDS+ by utilizing the QCI-based fusion method, are proposed to address the Out-of-Distribution (OOD) detection task. The experimental results show that the proposed approach gives better OOD performance with respect to several state-of-the-art baseline OOD detection methods. Specifically, it achieves an average increase in Area Under the Receiver Operating Characteristic Curve (AUC) of 1.2% and a corresponding average decrease in False Positive Rate at 95% True Negative Rate (FPR95) of 5.4% compared to the optimal baseline method.

Xiaohong Huang,Cui Yang,Miaowen Wen

Main category: cs.CV

TL;DR: 提出了一种基于长跟踪特征的视觉惯性里程计（VIO）方法，通过主动解耦累积误差和优化策略，提高定位精度并确保实时性能。

Details

Motivation: 长跟踪特征虽能约束更多视觉帧以减少定位漂移，但会引入累积匹配误差和跟踪漂移，且现有基于重投影误差的权重调整方法存在误导优化的问题。 Method: 提出主动解耦机制，包括视觉参考帧重置策略和深度预测策略，并采用三种高效状态估计策略（并行消除、逆深度简化、跳过消除）以确保实时性。 Result: 实验表明，该方法在多种数据集上实现了更高的定位精度和较短的耗时，适用于边缘设备上的低空物联网导航。 Conclusion: 该方法有效解决了长跟踪特征的误差累积和实时性问题，为高精度边缘导航提供了实用解决方案。 Abstract: This paper presents a visual-inertial odometry (VIO) method using long-tracked features. Long-tracked features can constrain more visual frames, reducing localization drift. However, they may also lead to accumulated matching errors and drift in feature tracking. Current VIO methods adjust observation weights based on re-projection errors, yet this approach has flaws. Re-projection errors depend on estimated camera poses and map points, so increased errors might come from estimation inaccuracies, not actual feature tracking errors. This can mislead the optimization process and make long-tracked features ineffective for suppressing localization drift. Furthermore, long-tracked features constrain a larger number of frames, which poses a significant challenge to real-time performance of the system. To tackle these issues, we propose an active decoupling mechanism for accumulated errors in long-tracked feature utilization. We introduce a visual reference frame reset strategy to eliminate accumulated tracking errors and a depth prediction strategy to leverage the long-term constraint. To ensure real time preformane, we implement three strategies for efficient system state estimation: a parallel elimination strategy based on predefined elimination order, an inverse-depth elimination simplification strategy, and an elimination skipping strategy. Experiments on various datasets show that our method offers higher positioning accuracy with relatively short consumption time, making it more suitable for edge-enabled low-altitude IoT navigation, where high-accuracy positioning and real-time operation on edge device are required. The code will be published at github.

[13] Causal Prompt Calibration Guided Segment Anything Model for Open-Vocabulary Multi-Entity Segmentation

Jingyao Wang,Jianqi Zhang,Wenwen Qiang,Changwen Zheng

Main category: cs.CV

TL;DR: SAM在开放词汇多实体分割（OVMS）中存在泛化问题，主要原因是提示偏差。通过因果分析，提出因果提示校准方法（CPC-SAM），通过轻量级因果提示学习器（CaPL）消除无关因素，实现准确分割。

Details

Motivation: SAM在OVMS任务中泛化能力不足，主要由于提示偏差和无关生成因素的干扰，需通过因果分析解决。 Method: 提出CPC-SAM方法，集成CaPL生成因果提示，通过多分布一致性理论和双层优化策略校准提示。 Result: 实验验证CPC-SAM在OVMS任务中的优越性。 Conclusion: 因果提示校准能有效提升SAM在OVMS中的泛化能力。 Abstract: Despite the strength of the Segment Anything Model (SAM), it struggles with generalization issues in open-vocabulary multi-entity segmentation (OVMS). Through empirical and causal analyses, we find that (i) the prompt bias is the primary cause of the generalization issues; (ii) this bias is closely tied to the task-irrelevant generating factors within the prompts, which act as confounders and affect generalization. To address the generalization issues, we aim to propose a method that can calibrate prompts to eliminate confounders for accurate OVMS. Building upon the causal analysis, we propose that the optimal prompt for OVMS should contain only task-relevant causal factors. We define it as the causal prompt, serving as the goal of calibration. Next, our theoretical analysis, grounded by causal multi-distribution consistency theory, proves that this prompt can be obtained by enforcing segmentation consistency and optimality. Inspired by this, we propose CPC-SAM, a Causal Prompt Calibration method for SAM to achieve accurate OVMS. It integrates a lightweight causal prompt learner (CaPL) into SAM to obtain causal prompts. Specifically, we first generate multiple prompts using random annotations to simulate diverse distributions and then reweight them via CaPL by enforcing causal multi-distribution consistency in both task and entity levels. To ensure obtaining causal prompts, CaPL is optimized by minimizing the cumulative segmentation loss across the reweighted prompts to achieve consistency and optimality. A bi-level optimization strategy alternates between optimizing CaPL and SAM, ensuring accurate OVMS. Extensive experiments validate its superiority.

[14] Improving Generalization of Medical Image Registration Foundation Model

Jing Hu,Kaiwei Yu,Hongjiang Xian,Shu Hu,Xin Wang

Main category: cs.CV

TL;DR: 本文提出了一种将Sharpness-Aware Minimization (SAM) 融入基础模型的方法，以提升医学图像配准的泛化性和鲁棒性。

Details

Motivation: 传统方法计算效率低，深度学习方法缺乏灵活性和泛化性，基础模型虽表现优异但在面对新结构或模态时仍有挑战。 Method: 将SAM融入基础模型，通过优化损失景观的平坦性来提升模型的稳定性和泛化能力。 Result: 实验表明，结合SAM的基础模型在跨数据集配准任务中表现显著提升。 Conclusion: 该方法为医学图像配准技术的进步提供了新思路。 Abstract: Deformable registration is a fundamental task in medical image processing, aiming to achieve precise alignment by establishing nonlinear correspondences between images. Traditional methods offer good adaptability and interpretability but are limited by computational efficiency. Although deep learning approaches have significantly improved registration speed and accuracy, they often lack flexibility and generalizability across different datasets and tasks. In recent years, foundation models have emerged as a promising direction, leveraging large and diverse datasets to learn universal features and transformation patterns for image registration, thus demonstrating strong cross-task transferability. However, these models still face challenges in generalization and robustness when encountering novel anatomical structures, varying imaging conditions, or unseen modalities. To address these limitations, this paper incorporates Sharpness-Aware Minimization (SAM) into foundation models to enhance their generalization and robustness in medical image registration. By optimizing the flatness of the loss landscape, SAM improves model stability across diverse data distributions and strengthens its ability to handle complex clinical scenarios. Experimental results show that foundation models integrated with SAM achieve significant improvements in cross-dataset registration performance, offering new insights for the advancement of medical image registration technology. Our code is available at https://github.com/Promise13/fm_sam}{https://github.com/Promise13/fm\_sam.

[15] Unmasking Deep Fakes: Leveraging Deep Learning for Video Authenticity Detection

Mahmudul Hasan

Main category: cs.CV

TL;DR: 论文提出了一种基于卷积神经网络（EfficientNet-B5）的深度伪造视频检测方法，使用MTCNN进行人脸检测，并在Kaggle DFDC数据集上取得了较好的性能。

Details

Motivation: 随着深度伪造技术日益逼真，检测伪造视频需要更先进的方法。本文旨在利用深度学习技术识别深度伪造视频。 Method: 采用MTCNN进行人脸检测，使用EfficientNet-B5作为编码器模型，预测视频是否为深度伪造。 Result: 模型在Kaggle DFDC数据集上的性能为：42.78%对数损失、93.80% AUC和86.82% F1分数。 Conclusion: 深度学习技术在深度伪造视频检测中表现出色，验证了所提方法的有效性。 Abstract: Deepfake videos, produced through advanced artificial intelligence methods now a days, pose a new challenge to the truthfulness of the digital media. As Deepfake becomes more convincing day by day, detecting them requires advanced methods capable of identifying subtle inconsistencies. The primary motivation of this paper is to recognize deepfake videos using deep learning techniques, specifically by using convolutional neural networks. Deep learning excels in pattern recognition, hence, makes it an ideal approach for detecting the intricate manipulations in deepfakes. In this paper, we consider using MTCNN as a face detector and EfficientNet-B5 as encoder model to predict if a video is deepfake or not. We utilize training and evaluation dataset from Kaggle DFDC. The results shows that our deepfake detection model acquired 42.78% log loss, 93.80% AUC and 86.82% F1 score on kaggle's DFDC dataset.

Feng Liu,Ziwang Fu,Yunlong Wang,Qijian Zheng

Main category: cs.CV

TL;DR: 提出了一种基于Transformer的自适应跨模态融合网络（TACFN），通过自注意力机制选择特征并优化跨模态交互，显著提升了多模态情感识别性能。

Details

Motivation: 现有跨模态注意力方法存在冗余特征和互补特征捕捉不足的问题，需要更高效的跨模态交互方式。 Method: 设计TACFN，利用自注意力机制进行模态内特征选择，并通过拼接权重向量实现模态间特征强化。 Result: 在RAVDESS和IEMOCAP数据集上，TACFN显著优于其他方法，达到最优性能。 Conclusion: TACFN通过自适应特征选择和强化，有效解决了跨模态融合中的冗余和互补问题，提升了情感识别效果。 Abstract: The fusion technique is the key to the multimodal emotion recognition task. Recently, cross-modal attention-based fusion methods have demonstrated high performance and strong robustness. However, cross-modal attention suffers from redundant features and does not capture complementary features well. We find that it is not necessary to use the entire information of one modality to reinforce the other during cross-modal interaction, and the features that can reinforce a modality may contain only a part of it. To this end, we design an innovative Transformer-based Adaptive Cross-modal Fusion Network (TACFN). Specifically, for the redundant features, we make one modality perform intra-modal feature selection through a self-attention mechanism, so that the selected features can adaptively and efficiently interact with another modality. To better capture the complementary information between the modalities, we obtain the fused weight vector by splicing and use the weight vector to achieve feature reinforcement of the modalities. We apply TCAFN to the RAVDESS and IEMOCAP datasets. For fair comparison, we use the same unimodal representations to validate the effectiveness of the proposed fusion method. The experimental results show that TACFN brings a significant performance improvement compared to other methods and reaches the state-of-the-art. All code and models could be accessed from https://github.com/shuzihuaiyu/TACFN.

[17] ProFashion: Prototype-guided Fashion Video Generation with Multiple Reference Images

Xianghao Kong,Qiaosong Qi,Yuanbin Wang,Anyi Rao,Biaolong Chen,Aixi Zhang,Si Liu,Hao Jiang

Main category: cs.CV

TL;DR: ProFashion是一个利用多参考图像生成时尚视频的框架，通过姿态感知原型聚合器和流增强原型实例化器提升视图一致性和时间连贯性。

Details

Motivation: 现有基于扩散的方法仅支持单参考图像输入，限制了视图一致性生成能力，且运动模块对人体运动建模不足。 Method: 提出Pose-aware Prototype Aggregator和Flow-enhanced Prototype Instantiator，分别利用姿态信息和关键点运动流优化特征聚合和时空注意力。 Result: 在MRFashion-7K和UBC Fashion数据集上表现优于现有方法。 Conclusion: ProFashion通过多参考图像和运动流增强，显著提升了时尚视频生成的视图一致性和时间连贯性。 Abstract: Fashion video generation aims to synthesize temporally consistent videos from reference images of a designated character. Despite significant progress, existing diffusion-based methods only support a single reference image as input, severely limiting their capability to generate view-consistent fashion videos, especially when there are different patterns on the clothes from different perspectives. Moreover, the widely adopted motion module does not sufficiently model human body movement, leading to sub-optimal spatiotemporal consistency. To address these issues, we propose ProFashion, a fashion video generation framework leveraging multiple reference images to achieve improved view consistency and temporal coherency. To effectively leverage features from multiple reference images while maintaining a reasonable computational cost, we devise a Pose-aware Prototype Aggregator, which selects and aggregates global and fine-grained reference features according to pose information to form frame-wise prototypes, which serve as guidance in the denoising process. To further enhance motion consistency, we introduce a Flow-enhanced Prototype Instantiator, which exploits the human keypoint motion flow to guide an extra spatiotemporal attention process in the denoiser. To demonstrate the effectiveness of ProFashion, we extensively evaluate our method on the MRFashion-7K dataset we collected from the Internet. ProFashion also outperforms previous methods on the UBC Fashion dataset.

[18] HDGlyph: A Hierarchical Disentangled Glyph-Based Framework for Long-Tail Text Rendering in Diffusion Models

Shuhan Zhuang,Mengqi Huang,Fengyi Fu,Nan Chen,Bohan Lei,Zhendong Mao

Main category: cs.CV

TL;DR: HDGlyph是一种分层解耦的基于字形的框架，通过联合优化常见和长尾文本渲染，显著提升了视觉文本渲染的准确性和图像质量。

Details

Motivation: 当前方法在处理长尾文本（如未见字符或小尺寸文本）时表现不佳，限制了商业设计等应用的效果。 Method: HDGlyph通过Multi-Linguistic GlyphNet和Glyph-Aware Perceptual Loss在训练阶段解耦像素级表示，并在推理时应用Noise-Disentangled Classifier-Free Guidance和LD-TSR方案。 Result: 模型在英文和中文文本渲染中分别提升了5.08%和11.7%的准确率，并在长尾场景中表现出色。 Conclusion: HDGlyph通过分层解耦和联合优化，显著提升了文本渲染的准确性和视觉质量，尤其在长尾场景中表现优异。 Abstract: Visual text rendering, which aims to accurately integrate specified textual content within generated images, is critical for various applications such as commercial design. Despite recent advances, current methods struggle with long-tail text cases, particularly when handling unseen or small-sized text. In this work, we propose a novel Hierarchical Disentangled Glyph-Based framework (HDGlyph) that hierarchically decouples text generation from non-text visual synthesis, enabling joint optimization of both common and long-tail text rendering. At the training stage, HDGlyph disentangles pixel-level representations via the Multi-Linguistic GlyphNet and the Glyph-Aware Perceptual Loss, ensuring robust rendering even for unseen characters. At inference time, HDGlyph applies Noise-Disentangled Classifier-Free Guidance and Latent-Disentangled Two-Stage Rendering (LD-TSR) scheme, which refines both background and small-sized text. Extensive evaluations show our model consistently outperforms others, with 5.08% and 11.7% accuracy gains in English and Chinese text rendering while maintaining high image quality. It also excels in long-tail scenarios with strong accuracy and visual performance.

[19] Weakly Supervised Temporal Sentence Grounding via Positive Sample Mining

Lu Dong,Haiyu Zhang,Hongjie Zhang,Yifei Huang,Zhen-Hua Ling,Yu Qiao,Limin Wang,Yali Wang

Main category: cs.CV

TL;DR: 提出了一种名为PSM的新框架，通过挖掘正样本来提供更具区分性的监督，解决了弱监督时序句子定位中相似样本被误认为负样本的问题。

Details

Motivation: 现有方法将相似样本直接视为负样本，导致优化困难并忽略了相关性。 Method: 提出PSM框架，通过文本查询相似性划分训练集，并引入PSM引导的对比损失和排序损失。 Result: 在WSTSG和VideoQA任务中验证了方法的有效性和优越性。 Conclusion: PSM框架通过挖掘正样本和优化损失函数，显著提升了弱监督时序句子定位的性能。 Abstract: The task of weakly supervised temporal sentence grounding (WSTSG) aims to detect temporal intervals corresponding to a language description from untrimmed videos with only video-level video-language correspondence. For an anchor sample, most existing approaches generate negative samples either from other videos or within the same video for contrastive learning. However, some training samples are highly similar to the anchor sample, directly regarding them as negative samples leads to difficulties for optimization and ignores the correlations between these similar samples and the anchor sample. To address this, we propose Positive Sample Mining (PSM), a novel framework that mines positive samples from the training set to provide more discriminative supervision. Specifically, for a given anchor sample, we partition the remaining training set into semantically similar and dissimilar subsets based on the similarity of their text queries. To effectively leverage these correlations, we introduce a PSM-guided contrastive loss to ensure that the anchor proposal is closer to similar samples and further from dissimilar ones. Additionally, we design a PSM-guided rank loss to ensure that similar samples are closer to the anchor proposal than to the negative intra-video proposal, aiming to distinguish the anchor proposal and the negative intra-video proposal. Experiments on the WSTSG and grounded VideoQA tasks demonstrate the effectiveness and superiority of our method.

[20] Dynamic Uncertainty Learning with Noisy Correspondence for Text-Based Person Search

Zequn Xie,Haoming Ji,Lingwei Meng

Main category: cs.CV

TL;DR: 论文提出DURA框架，通过KFS和DSH-Loss解决文本-图像对中的噪声问题，提升检索性能。

Details

Motivation: 在线构建的大规模文本-图像数据集可能包含噪声（如不匹配对），现有方法可能放大噪声，影响检索效果。 Method: 提出DURA框架，包括KFS（建模噪声不确定性）和DSH-Loss（动态调整负样本难度）。 Result: 在三个数据集上验证了方法的抗噪性和检索性能提升。 Conclusion: DURA框架能有效应对噪声，提升文本到图像人物搜索的鲁棒性和性能。 Abstract: Text-to-image person search aims to identify an individual based on a text description. To reduce data collection costs, large-scale text-image datasets are created from co-occurrence pairs found online. However, this can introduce noise, particularly mismatched pairs, which degrade retrieval performance. Existing methods often focus on negative samples, amplifying this noise. To address these issues, we propose the Dynamic Uncertainty and Relational Alignment (DURA) framework, which includes the Key Feature Selector (KFS) and a new loss function, Dynamic Softmax Hinge Loss (DSH-Loss). KFS captures and models noise uncertainty, improving retrieval reliability. The bidirectional evidence from cross-modal similarity is modeled as a Dirichlet distribution, enhancing adaptability to noisy data. DSH adjusts the difficulty of negative samples to improve robustness in noisy environments. Our experiments on three datasets show that the method offers strong noise resistance and improves retrieval performance in both low- and high-noise scenarios.

[21] ElectricSight: 3D Hazard Monitoring for Power Lines Using Low-Cost Sensors

Xingchen Li,LiDian Wang,Yu Sheng,ZhiPeng Tang,Haojie Ren,Guoliang You,YiFan Duan,Jianmin Ji,Yanyong Zhang

Main category: cs.CV

TL;DR: ElectricSight系统通过结合实时图像和环境点云先验，提出了一种低成本、高精度的3D距离测量方法，用于电力传输线路的潜在威胁监测。

Details

Motivation: 现有传感器方法在平衡精度和成本方面存在挑战，而摄像头缺乏深度信息，3D激光器成本过高。 Method: 结合实时图像和环境点云先验，提出单目深度估计方法，提升系统精度和可靠性。 Result: 测试数据显示，系统平均测量精度为1.08米，预警准确率达92%。 Conclusion: ElectricSight提供了一种经济高效且精确的解决方案，适用于电力传输线路的3D距离监测。 Abstract: Protecting power transmission lines from potential hazards involves critical tasks, one of which is the accurate measurement of distances between power lines and potential threats, such as large cranes. The challenge with this task is that the current sensor-based methods face challenges in balancing accuracy and cost in distance measurement. A common practice is to install cameras on transmission towers, which, however, struggle to measure true 3D distances due to the lack of depth information. Although 3D lasers can provide accurate depth data, their high cost makes large-scale deployment impractical. To address this challenge, we present ElectricSight, a system designed for 3D distance measurement and monitoring of potential hazards to power transmission lines. This work's key innovations lie in both the overall system framework and a monocular depth estimation method. Specifically, the system framework combines real-time images with environmental point cloud priors, enabling cost-effective and precise 3D distance measurements. As a core component of the system, the monocular depth estimation method enhances the performance by integrating 3D point cloud data into image-based estimates, improving both the accuracy and reliability of the system. To assess ElectricSight's performance, we conducted tests with data from a real-world power transmission scenario. The experimental results demonstrate that ElectricSight achieves an average accuracy of 1.08 m for distance measurements and an early warning accuracy of 92%.

[22] GRACE: Estimating Geometry-level 3D Human-Scene Contact from 2D Images

Chengfeng Wang,Wei Zhai,Yuhang Yang,Yang Cao,Zhengjun Zha

Main category: cs.CV

TL;DR: GRACE提出了一种新的3D人-场景接触估计方法，通过结合几何特征和图像语义，实现了高精度和强泛化能力。

Details

Motivation: 现有方法依赖固定的人体模型顶点序列，缺乏对几何结构的考虑，限制了泛化能力。GRACE旨在解决这一问题。 Method: GRACE采用点云编码-解码架构和分层特征提取融合模块，将3D几何结构与2D图像语义结合，建立几何特征到顶点空间的隐式映射。 Result: 在多个基准数据集上，GRACE实现了最先进的性能，并展示了在非结构化点云上的强泛化能力。 Conclusion: GRACE通过几何级推理显著提升了接触估计的准确性和泛化能力，为相关应用提供了有力支持。 Abstract: Estimating the geometry level of human-scene contact aims to ground specific contact surface points at 3D human geometries, which provides a spatial prior and bridges the interaction between human and scene, supporting applications such as human behavior analysis, embodied AI, and AR/VR. To complete the task, existing approaches predominantly rely on parametric human models (e.g., SMPL), which establish correspondences between images and contact regions through fixed SMPL vertex sequences. This actually completes the mapping from image features to an ordered sequence. However, this approach lacks consideration of geometry, limiting its generalizability in distinct human geometries. In this paper, we introduce GRACE (Geometry-level Reasoning for 3D Human-scene Contact Estimation), a new paradigm for 3D human contact estimation. GRACE incorporates a point cloud encoder-decoder architecture along with a hierarchical feature extraction and fusion module, enabling the effective integration of 3D human geometric structures with 2D interaction semantics derived from images. Guided by visual cues, GRACE establishes an implicit mapping from geometric features to the vertex space of the 3D human mesh, thereby achieving accurate modeling of contact regions. This design ensures high prediction accuracy and endows the framework with strong generalization capability across diverse human geometries. Extensive experiments on multiple benchmark datasets demonstrate that GRACE achieves state-of-the-art performance in contact estimation, with additional results further validating its robust generalization to unstructured human point clouds.

[23] Two-Stage Random Alternation Framework for Zero-Shot Pansharpening

Haorui Chen,Zeyu Ren,Jiaxuan Ren,Ran Ran,Jinliang Shao,Jie Huang,Liangjian Deng

Main category: cs.CV

TL;DR: 提出了一种名为TRA-PAN的两阶段随机交替框架，通过结合降分辨率图像的强监督约束和全分辨率图像的物理特性，解决了深度学习在图像融合中因缺乏真实高分辨率图像而受限的问题。

Details

Motivation: 深度学习在图像融合中表现优异，但缺乏真实高分辨率图像限制了其实际应用。 Method: 采用两阶段框架：第一阶段通过Degradation-Aware Modeling (DAM)和预热过程预训练；第二阶段通过Random Alternation Optimization (RAO)优化模型，结合降分辨率和全分辨率图像的优势。 Result: 实验表明，TRA-PAN在定量指标和视觉质量上均优于现有方法，且仅需单对图像即可实现零样本训练。 Conclusion: TRA-PAN具有强实用性，解决了数据需求问题，并在实际场景中表现出色。 Abstract: In recent years, pansharpening has seen rapid advancements with deep learning methods, which have demonstrated impressive fusion quality. However, the challenge of acquiring real high-resolution images limits the practical applicability of these methods. To address this, we propose a two-stage random alternating framework (TRA-PAN) that effectively integrates strong supervision constraints from reduced-resolution images with the physical characteristics of full-resolution images. The first stage introduces a pre-training procedure, which includes Degradation-Aware Modeling (DAM) to capture spatial-spectral degradation mappings, alongside a warm-up procedure designed to reduce training time and mitigate the negative effects of reduced-resolution data. In the second stage, Random Alternation Optimization (RAO) is employed, where random alternating training leverages the strengths of both reduced- and full-resolution images, further optimizing the fusion model. By primarily relying on full-resolution images, our method enables zero-shot training with just a single image pair, obviating the need for large datasets. Experimental results demonstrate that TRA-PAN outperforms state-of-the-art (SOTA) methods in both quantitative metrics and visual quality in real-world scenarios, highlighting its strong practical applicability.

[24] Compact and Efficient Neural Networks for Image Recognition Based on Learned 2D Separable Transform

Maxim Vashkevich,Egor Krivalcevich

Main category: cs.CV

TL;DR: 论文提出了一种基于二维可分离变换（LST）的神经网络层，用于图像识别任务，显著减少参数数量，并在MNIST数据集上达到98.02%的准确率。

Details

Motivation: 设计一种高效的神经网络层，减少模型参数，同时保持高性能，适用于资源受限的平台（如FPGA）。 Method: 通过共享权重的全连接层分别处理图像的行和列，构建二维可分离变换（LST）层，替代传统的堆叠全连接层。 Result: 在MNIST数据集上，基于LST层的分类器仅用9.5k参数即达到98.02%的准确率，并在FPGA平台上验证了其高效性。 Conclusion: LST层是一种高效且紧凑的神经网络设计方法，适用于资源受限的硬件实现。 Abstract: The paper presents a learned two-dimensional separable transform (LST) that can be considered as a new type of computational layer for constructing neural network (NN) architecture for image recognition tasks. The LST based on the idea of sharing the weights of one fullyconnected (FC) layer to process all rows of an image. After that, a second shared FC layer is used to process all columns of image representation obtained from the first layer. The use of LST layers in a NN architecture significantly reduces the number of model parameters compared to models that use stacked FC layers. We show that a NN-classifier based on a single LST layer followed by an FC layer achieves 98.02\% accuracy on the MNIST dataset, while having only 9.5k parameters. We also implemented a LST-based classifier for handwritten digit recognition on the FPGA platform to demonstrate the efficiency of the suggested approach for designing a compact and high-performance implementation of NN models. Git repository with supplementary materials: https://github.com/Mak-Sim/LST-2d

[25] Batch Augmentation with Unimodal Fine-tuning for Multimodal Learning

H M Dipu Kabir,Subrota Kumar Mondal,Mohammad Ali Moni

Main category: cs.CV

TL;DR: 提出一种结合单模态微调和批量增强的方法，用于从超声图像和临床文本中检测胎儿器官，并通过预训练和多模态训练提升性能。

Details

Motivation: 解决从超声图像和文本信息中准确检测胎儿器官的问题，提升多模态数据的利用效率。 Method: 采用单模态预训练初始化层权重，结合批量增强和神经网络提取特征，最后通过多模态训练优化结果。 Result: 在FPU23和UPMC Food-101数据集上表现优异，接近SOTA水平。 Conclusion: 提出的方法在多模态任务中表现最佳，代码已开源。 Abstract: This paper proposes batch augmentation with unimodal fine-tuning to detect the fetus's organs from ultrasound images and associated clinical textual information. We also prescribe pre-training initial layers with investigated medical data before the multimodal training. At first, we apply a transferred initialization with the unimodal image portion of the dataset with batch augmentation. This step adjusts the initial layer weights for medical data. Then, we apply neural networks (NNs) with fine-tuned initial layers to images in batches with batch augmentation to obtain features. We also extract information from descriptions of images. We combine this information with features obtained from images to train the head layer. We write a dataloader script to load the multimodal data and use existing unimodal image augmentation techniques with batch augmentation for the multimodal data. The dataloader brings a new random augmentation for each batch to get a good generalization. We investigate the FPU23 ultrasound and UPMC Food-101 multimodal datasets. The multimodal large language model (LLM) with the proposed training provides the best results among the investigated methods. We receive near state-of-the-art (SOTA) performance on the UPMC Food-101 dataset. We share the scripts of the proposed method with traditional counterparts at the following repository: github.com/dipuk0506/multimodal

[26] ReplayCAD: Generative Diffusion Replay for Continual Anomaly Detection

Lei Hu,Zhiyong Gan,Ling Deng,Jinglin Liang,Lingyu Liang,Shuangping Huang,Tianshui Chen

Main category: cs.CV

TL;DR: ReplayCAD是一种基于扩散模型的生成重放框架，用于解决持续异常检测中的灾难性遗忘和小异常区域分割问题，通过语义和空间特征指导数据重放，显著提升了分割性能。

Details

Motivation: 持续异常检测面临灾难性遗忘和小异常区域分割的挑战，现有方法难以保留像素级细节特征。 Method: 提出ReplayCAD框架，利用预训练扩散模型的语义嵌入和空间特征指导数据压缩与重放，生成高质量历史数据。 Result: 在VisA和MVTec数据集上，分割性能分别提升11.5%和8.1%，分类和分割任务均达到最优。 Conclusion: ReplayCAD通过生成高质量历史数据有效解决了持续异常检测中的关键问题，显著提升了模型性能。 Abstract: Continual Anomaly Detection (CAD) enables anomaly detection models in learning new classes while preserving knowledge of historical classes. CAD faces two key challenges: catastrophic forgetting and segmentation of small anomalous regions. Existing CAD methods store image distributions or patch features to mitigate catastrophic forgetting, but they fail to preserve pixel-level detailed features for accurate segmentation. To overcome this limitation, we propose ReplayCAD, a novel diffusion-driven generative replay framework that replay high-quality historical data, thus effectively preserving pixel-level detailed features. Specifically, we compress historical data by searching for a class semantic embedding in the conditional space of the pre-trained diffusion model, which can guide the model to replay data with fine-grained pixel details, thus improving the segmentation performance. However, relying solely on semantic features results in limited spatial diversity. Hence, we further use spatial features to guide data compression, achieving precise control of sample space, thereby generating more diverse data. Our method achieves state-of-the-art performance in both classification and segmentation, with notable improvements in segmentation: 11.5% on VisA and 8.1% on MVTec. Our source code is available at https://github.com/HULEI7/ReplayCAD.

Xu Zheng,Yuanhuiyi Lyu,Lutao Jiang,Danda Pani Paudel,Luc Van Gool,Xuming Hu

Main category: cs.CV

TL;DR: 提出了一种基于功能熵的简单有效的正则化方法，用于平衡多模态输入在语义分割任务中的贡献，解决了单模态主导问题。

Details

Motivation: 多模态框架容易过度依赖易学习的模态（单模态主导），导致在现实场景中性能下降。 Method: 利用功能熵和log-Sobolev不等式设计正则化项，最大化各模态的信息贡献，并提出多尺度正则化模块。 Result: 在三个数据集上性能显著提升（+13.94%、+3.25%、+3.64%），且无需额外参数。 Conclusion: 该方法有效平衡多模态输入，提升分割任务的鲁棒性和性能。 Abstract: Fusing and balancing multi-modal inputs from novel sensors for dense prediction tasks, particularly semantic segmentation, is critically important yet remains a significant challenge. One major limitation is the tendency of multi-modal frameworks to over-rely on easily learnable modalities, a phenomenon referred to as unimodal dominance or bias. This issue becomes especially problematic in real-world scenarios where the dominant modality may be unavailable, resulting in severe performance degradation. To this end, we apply a simple but effective plug-and-play regularization term based on functional entropy, which introduces no additional parameters or modules. This term is designed to intuitively balance the contribution of each visual modality to the segmentation results. Specifically, we leverage the log-Sobolev inequality to bound functional entropy using functional-Fisher-information. By maximizing the information contributed by each visual modality, our approach mitigates unimodal dominance and establishes a more balanced and robust segmentation framework. A multi-scale regularization module is proposed to apply our proposed plug-and-play term on high-level features and also segmentation predictions for more balanced multi-modal learning. Extensive experiments on three datasets demonstrate that our proposed method achieves superior performance, i.e., +13.94%, +3.25%, and +3.64%, without introducing any additional parameters.

[28] Dataset Distillation with Probabilistic Latent Features

Zhe Li,Sarah Cechnicka,Cheng Ouyang,Katharina Breininger,Peter Schüffler,Bernhard Kainz

Main category: cs.CV

TL;DR: 提出一种新的随机方法，通过建模潜在特征的联合分布来合成紧凑数据集，以降低存储和计算成本。

Details

Motivation: 随着深度学习模型复杂性和训练数据量的增加，减少存储和计算成本变得至关重要。 Method: 采用低秩多元正态分布参数化轻量网络，生成多样化的合成样本，并通过预训练生成器生成图像。 Result: 在多个基准测试（如ImageNet子集、CIFAR-10和MedMNIST）上实现了跨架构的最优性能。 Conclusion: 该方法具有通用性和高效性，能够有效支持下游分类任务。 Abstract: As deep learning models grow in complexity and the volume of training data increases, reducing storage and computational costs becomes increasingly important. Dataset distillation addresses this challenge by synthesizing a compact set of synthetic data that can effectively replace the original dataset in downstream classification tasks. While existing methods typically rely on mapping data from pixel space to the latent space of a generative model, we propose a novel stochastic approach that models the joint distribution of latent features. This allows our method to better capture spatial structures and produce diverse synthetic samples, which benefits model training. Specifically, we introduce a low-rank multivariate normal distribution parameterized by a lightweight network. This design maintains low computational complexity and is compatible with various matching networks used in dataset distillation. After distillation, synthetic images are generated by feeding the learned latent features into a pretrained generator. These synthetic images are then used to train classification models, and performance is evaluated on real test set. We validate our method on several benchmarks, including ImageNet subsets, CIFAR-10, and the MedMNIST histopathological dataset. Our approach achieves state-of-the-art cross architecture performance across a range of backbone architectures, demonstrating its generality and effectiveness.

[29] METOR: A Unified Framework for Mutual Enhancement of Objects and Relationships in Open-vocabulary Video Visual Relationship Detection

Yongqi Wang,Xinxiao Wu,Shuo Yang

Main category: cs.CV

TL;DR: METOR提出了一种基于查询的统一框架，联合建模并相互增强开放词汇场景中的目标检测和关系分类，避免了级联管道的错误传播问题。

Details

Motivation: 现有方法采用级联管道先检测目标再分类关系，可能导致错误传播和性能下降。 Method: 设计了基于CLIP的上下文细化编码模块和迭代增强模块，联合优化目标和关系的表示。 Result: 在两个公开数据集VidVRD和VidOR上取得了最先进的性能。 Conclusion: METOR通过联合建模和相互增强，显著提升了开放词汇视频视觉关系检测的性能。 Abstract: Open-vocabulary video visual relationship detection aims to detect objects and their relationships in videos without being restricted by predefined object or relationship categories. Existing methods leverage the rich semantic knowledge of pre-trained vision-language models such as CLIP to identify novel categories. They typically adopt a cascaded pipeline to first detect objects and then classify relationships based on the detected objects, which may lead to error propagation and thus suboptimal performance. In this paper, we propose Mutual EnhancemenT of Objects and Relationships (METOR), a query-based unified framework to jointly model and mutually enhance object detection and relationship classification in open-vocabulary scenarios. Under this framework, we first design a CLIP-based contextual refinement encoding module that extracts visual contexts of objects and relationships to refine the encoding of text features and object queries, thus improving the generalization of encoding to novel categories. Then we propose an iterative enhancement module to alternatively enhance the representations of objects and relationships by fully exploiting their interdependence to improve recognition performance. Extensive experiments on two public datasets, VidVRD and VidOR, demonstrate that our framework achieves state-of-the-art performance.

[30] MultiTaskVIF: Segmentation-oriented visible and infrared image fusion via multi-task learning

Zixian Zhao,Andrew Howes,Xingchen Zhang

Main category: cs.CV

TL;DR: 提出了一种名为MultiTaskVIF的简洁通用训练框架，通过多任务头解码器（MTH）在训练中同时输出融合图像和分割结果，避免了传统级联结构的复杂性。

Details

Motivation: 现有分割导向的可见光与红外图像融合方法通常采用分离的融合和分割模型，导致网络复杂且冗余，因此需要更简洁高效的结构。 Method: 提出MultiTaskVIF框架，引入多任务头解码器（MTH），在训练中直接输出融合图像和分割结果，无需完整分割模型。 Result: 实验验证了该方法的有效性，代码将在论文接受后公开。 Conclusion: MultiTaskVIF通过简洁结构成功整合语义信息，为分割导向的图像融合提供了高效解决方案。 Abstract: Visible and infrared image fusion (VIF) has attracted significant attention in recent years. Traditional VIF methods primarily focus on generating fused images with high visual quality, while recent advancements increasingly emphasize incorporating semantic information into the fusion model during training. However, most existing segmentation-oriented VIF methods adopt a cascade structure comprising separate fusion and segmentation models, leading to increased network complexity and redundancy. This raises a critical question: can we design a more concise and efficient structure to integrate semantic information directly into the fusion model during training-Inspired by multi-task learning, we propose a concise and universal training framework, MultiTaskVIF, for segmentation-oriented VIF models. In this framework, we introduce a multi-task head decoder (MTH) to simultaneously output both the fused image and the segmentation result during training. Unlike previous cascade training frameworks that necessitate joint training with a complete segmentation model, MultiTaskVIF enables the fusion model to learn semantic features by simply replacing its decoder with MTH. Extensive experimental evaluations validate the effectiveness of the proposed method. Our code will be released upon acceptance.

[31] StableMotion: Repurposing Diffusion-Based Image Priors for Motion Estimation

Ziyi Wang,Haipeng Li,Lin Sui,Tianhao Zhou,Hai Jiang,Lang Nie,Shuaicheng Liu

Main category: cs.CV

TL;DR: StableMotion利用预训练图像扩散模型的知识进行运动估计，解决单图像校正任务，如拼接图像矩形化和滚动快门校正。通过自适应集成策略和采样步数灾难概念，实现高效一步推理，性能优越且速度快。

Details

Motivation: 解决单图像校正任务中的运动估计问题，利用预训练模型的几何和内容先验知识，提高效率和准确性。 Method: 以Stable Diffusion模型为骨干，通过自适应集成策略（AES）整合多输出，并利用采样步数灾难（SSD）概念实现一步推理。 Result: 在拼接图像矩形化和滚动快门校正任务中表现优异，速度提升200倍。 Conclusion: StableMotion框架高效且通用，为图像校正任务提供了新的解决方案。 Abstract: We present StableMotion, a novel framework leverages knowledge (geometry and content priors) from pretrained large-scale image diffusion models to perform motion estimation, solving single-image-based image rectification tasks such as Stitched Image Rectangling (SIR) and Rolling Shutter Correction (RSC). Specifically, StableMotion framework takes text-to-image Stable Diffusion (SD) models as backbone and repurposes it into an image-to-motion estimator. To mitigate inconsistent output produced by diffusion models, we propose Adaptive Ensemble Strategy (AES) that consolidates multiple outputs into a cohesive, high-fidelity result. Additionally, we present the concept of Sampling Steps Disaster (SSD), the counterintuitive scenario where increasing the number of sampling steps can lead to poorer outcomes, which enables our framework to achieve one-step inference. StableMotion is verified on two image rectification tasks and delivers state-of-the-art performance in both, as well as showing strong generalizability. Supported by SSD, StableMotion offers a speedup of 200 times compared to previous diffusion model-based methods.

[32] Video Dataset Condensation with Diffusion Models

Zhe Li,Hadrien Reynaud,Mischa Dombrowski,Sarah Cechnicka,Franciskus Xaverius Erick,Bernhard Kainz

Main category: cs.CV

TL;DR: 论文提出了一种基于视频扩散模型和VST-UNet的视频数据集蒸馏方法，结合TAC-DT算法提升计算效率和数据质量，性能优于现有方法。

Details

Motivation: 解决深度学习模型对计算资源的高需求问题，尤其是视频数据集的存储和训练挑战。 Method: 使用视频扩散模型生成高质量合成视频，提出VST-UNet选择多样性视频子集，并引入TAC-DT算法优化计算效率。 Result: 在四个基准数据集上性能提升高达10.61%，优于现有方法。 Conclusion: 该方法为视频数据集蒸馏设立了新基准，显著提升了性能和效率。 Abstract: In recent years, the rapid expansion of dataset sizes and the increasing complexity of deep learning models have significantly escalated the demand for computational resources, both for data storage and model training. Dataset distillation has emerged as a promising solution to address this challenge by generating a compact synthetic dataset that retains the essential information from a large real dataset. However, existing methods often suffer from limited performance and poor data quality, particularly in the video domain. In this paper, we focus on video dataset distillation by employing a video diffusion model to generate high-quality synthetic videos. To enhance representativeness, we introduce Video Spatio-Temporal U-Net (VST-UNet), a model designed to select a diverse and informative subset of videos that effectively captures the characteristics of the original dataset. To further optimize computational efficiency, we explore a training-free clustering algorithm, Temporal-Aware Cluster-based Distillation (TAC-DT), to select representative videos without requiring additional training overhead. We validate the effectiveness of our approach through extensive experiments on four benchmark datasets, demonstrating performance improvements of up to $10.61\%$ over the state-of-the-art. Our method consistently outperforms existing approaches across all datasets, establishing a new benchmark for video dataset distillation.

[33] Jailbreaking the Text-to-Video Generative Models

Jiayang Liu,Siyuan Liang,Shiqian Zhao,Rongcheng Tu,Wenbo Zhou,Xiaochun Cao,Dacheng Tao,Siew Kei Lam

Main category: cs.CV

TL;DR: 本文提出了一种针对文本到视频生成模型的优化越狱攻击方法，通过优化目标生成绕过安全过滤的提示，并提升生成视频的语义相关性。

Details

Motivation: 尽管文本到视频生成模型取得了显著进展，但其易受越狱攻击（生成不安全内容）的问题引发了严重的安全担忧。现有研究缺乏系统性漏洞利用方法。 Method: 将提示生成任务建模为优化问题，包含三个目标：最大化输入与生成提示的语义相似性、绕过安全过滤、提升生成视频的语义相似性。引入提示变异策略增强鲁棒性。 Result: 在多个模型（如Open-Sora、Pika等）上的实验表明，该方法攻击成功率更高，且生成视频的语义相关性更强。 Conclusion: 本文提出的优化越狱攻击方法有效提升了攻击成功率和语义相关性，为文本到视频模型的安全性研究提供了新思路。 Abstract: Text-to-video generative models have achieved significant progress, driven by the rapid advancements in diffusion models, with notable examples including Pika, Luma, Kling, and Sora. Despite their remarkable generation ability, their vulnerability to jailbreak attack, i.e. to generate unsafe content, including pornography, violence, and discrimination, raises serious safety concerns. Existing efforts, such as T2VSafetyBench, have provided valuable benchmarks for evaluating the safety of text-to-video models against unsafe prompts but lack systematic studies for exploiting their vulnerabilities effectively. In this paper, we propose the \textit{first} optimization-based jailbreak attack against text-to-video models, which is specifically designed. Our approach formulates the prompt generation task as an optimization problem with three key objectives: (1) maximizing the semantic similarity between the input and generated prompts, (2) ensuring that the generated prompts can evade the safety filter of the text-to-video model, and (3) maximizing the semantic similarity between the generated videos and the original input prompts. To further enhance the robustness of the generated prompts, we introduce a prompt mutation strategy that creates multiple prompt variants in each iteration, selecting the most effective one based on the averaged score. This strategy not only improves the attack success rate but also boosts the semantic relevance of the generated video. We conduct extensive experiments across multiple text-to-video models, including Open-Sora, Pika, Luma, and Kling. The results demonstrate that our method not only achieves a higher attack success rate compared to baseline methods but also generates videos with greater semantic similarity to the original input prompts.

[34] UnfoldIR: Rethinking Deep Unfolding Network in Illumination Degradation Image Restoration

Chunming He,Rihan Zhang,Fengyang Xiao,Chengyu Fang,Longxiang Tang,Yulun Zhang,Sina Farsiu

Main category: cs.CV

TL;DR: UnfoldIR是一种基于深度展开网络（DUNs）的新方法，用于解决光照退化图像恢复（IDIR）任务。通过引入任务特定的恢复模型、高级网络架构和DUN专用损失函数，UnfoldIR在性能上超越了现有方法。

Details

Motivation: 现有DUNs在IDIR任务中性能不足，主要原因是未充分探索展开结构的潜力，包括任务特定模型构建、高级架构集成和损失函数设计。 Method: UnfoldIR提出了一种新的IDIR模型，包含反射辅助光照校正（RAIC）模块和光照引导反射增强（IGRE）模块，并引入阶段间信息一致性损失。 Result: 实验表明，UnfoldIR在5种IDIR任务和3种下游问题中表现优异。 Conclusion: UnfoldIR通过优化DUN结构设计，显著提升了IDIR任务的性能，证明了其在光照退化图像恢复中的有效性。 Abstract: Deep unfolding networks (DUNs) are widely employed in illumination degradation image restoration (IDIR) to merge the interpretability of model-based approaches with the generalization of learning-based methods. However, the performance of DUN-based methods remains considerably inferior to that of state-of-the-art IDIR solvers. Our investigation indicates that this limitation does not stem from structural shortcomings of DUNs but rather from the limited exploration of the unfolding structure, particularly for (1) constructing task-specific restoration models, (2) integrating advanced network architectures, and (3) designing DUN-specific loss functions. To address these issues, we propose a novel DUN-based method, UnfoldIR, for IDIR tasks. UnfoldIR first introduces a new IDIR model with dedicated regularization terms for smoothing illumination and enhancing texture. We unfold the iterative optimized solution of this model into a multistage network, with each stage comprising a reflectance-assisted illumination correction (RAIC) module and an illumination-guided reflectance enhancement (IGRE) module. RAIC employs a visual state space (VSS) to extract non-local features, enforcing illumination smoothness, while IGRE introduces a frequency-aware VSS to globally align similar textures, enabling mildly degraded regions to guide the enhancement of details in more severely degraded areas. This suppresses noise while enhancing details. Furthermore, given the multistage structure, we propose an inter-stage information consistent loss to maintain network stability in the final stages. This loss contributes to structural preservation and sustains the model's performance even in unsupervised settings. Experiments verify our effectiveness across 5 IDIR tasks and 3 downstream problems.

[35] FNBench: Benchmarking Robust Federated Learning against Noisy Labels

Xuefeng Jiang,Jia Li,Nannan Wu,Zhiyuan Wu,Xujing Li,Sheng Sun,Gang Xu,Yuwei Wang,Qi Li,Min Liu

Main category: cs.CV

TL;DR: 该论文提出了首个联邦学习中标签噪声的基准研究FNBench，评估了18种方法在多种噪声模式下的表现，并提出了一种增强鲁棒性的方法。

Details

Motivation: 联邦学习中的数据标签噪声问题导致性能下降，但缺乏统一的基准研究来评估现有方法的实际表现。 Method: 提出了FNBench基准，涵盖三种标签噪声模式，评估了18种方法在多个数据集上的表现，并提出了一种表示感知的正则化方法。 Result: 实验表明标签噪声显著影响联邦学习性能，提出的正则化方法能有效增强现有方法的鲁棒性。 Conclusion: FNBench为联邦学习中的标签噪声问题提供了首个基准研究，并提出了改进方向，未来工作可进一步探索噪声模式和算法优化。 Abstract: Robustness to label noise within data is a significant challenge in federated learning (FL). From the data-centric perspective, the data quality of distributed datasets can not be guaranteed since annotations of different clients contain complicated label noise of varying degrees, which causes the performance degradation. There have been some early attempts to tackle noisy labels in FL. However, there exists a lack of benchmark studies on comprehensively evaluating their practical performance under unified settings. To this end, we propose the first benchmark study FNBench to provide an experimental investigation which considers three diverse label noise patterns covering synthetic label noise, imperfect human-annotation errors and systematic errors. Our evaluation incorporates eighteen state-of-the-art methods over five image recognition datasets and one text classification dataset. Meanwhile, we provide observations to understand why noisy labels impair FL, and additionally exploit a representation-aware regularization method to enhance the robustness of existing methods against noisy labels based on our observations. Finally, we discuss the limitations of this work and propose three-fold future directions. To facilitate related communities, our source code is open-sourced at https://github.com/Sprinter1999/FNBench.

[36] Underwater object detection in sonar imagery with detection transformer and Zero-shot neural architecture search

XiaoTong Gu,Shengyu Tang,Yiming Cao,Changdong Yu

Main category: cs.CV

TL;DR: 提出了一种结合神经架构搜索（NAS）和检测Transformer（DETR）的NAS-DETR方法，用于提升声纳图像中的目标检测性能。

Details

Motivation: 声纳图像分辨率低、特征稀疏，传统目标检测方法性能受限，需高效且高性能的解决方案。 Method: 采用基于最大熵的零样本NAS方法优化CNN-Transformer主干网络，结合FPN和可变形注意力Transformer解码器构建完整架构。 Result: 在两个代表性数据集上实现最先进性能，同时保持低计算开销和实时效率。 Conclusion: NAS-DETR首次将DETR与NAS结合，显著提升了声纳目标检测的性能和可解释性。 Abstract: Underwater object detection using sonar imagery has become a critical and rapidly evolving research domain within marine technology. However, sonar images are characterized by lower resolution and sparser features compared to optical images, which seriously degrades the performance of object detection.To address these challenges, we specifically propose a Detection Transformer (DETR) architecture optimized with a Neural Architecture Search (NAS) approach called NAS-DETR for object detection in sonar images. First, an improved Zero-shot Neural Architecture Search (NAS) method based on the maximum entropy principle is proposed to identify a real-time, high-representational-capacity CNN-Transformer backbone for sonar image detection. This method enables the efficient discovery of high-performance network architectures with low computational and time overhead. Subsequently, the backbone is combined with a Feature Pyramid Network (FPN) and a deformable attention-based Transformer decoder to construct a complete network architecture. This architecture integrates various advanced components and training schemes to enhance overall performance. Extensive experiments demonstrate that this architecture achieves state-of-the-art performance on two Representative datasets, while maintaining minimal overhead in real-time efficiency and computational complexity. Furthermore, correlation analysis between the key parameters and differential entropy-based fitness function is performed to enhance the interpretability of the proposed framework. To the best of our knowledge, this is the first work in the field of sonar object detection to integrate the DETR architecture with a NAS search mechanism.

[37] SimMIL: A Universal Weakly Supervised Pre-Training Framework for Multi-Instance Learning in Whole Slide Pathology Images

Yicheng Song,Tiancheng Lin,Die Peng,Su Yang,Yi Xu

Main category: cs.CV

TL;DR: 本文提出了一种弱监督预训练方法，用于多实例学习（MIL）中的特征提取器，通过将弱标签从包级传播到实例级进行监督学习，提升了WSI任务中的性能。

Details

Motivation: 现有MIL方法忽视了实例级表示学习，依赖预训练特征提取器，但效果有限。本文旨在改进特征提取器的预训练方法。 Method: 采用弱监督方案预训练特征提取器，结合数据增强、非线性预测头和鲁棒损失函数。 Result: 在WSI数据集上表现优于其他预训练方案（如ImageNet和自监督学习），并展示了兼容性和可扩展性。 Conclusion: 这是首个专注于MIL表示学习的工作，为WSI任务提供了更有效的特征提取方法。 Abstract: Various multi-instance learning (MIL) based approaches have been developed and successfully applied to whole-slide pathological images (WSI). Existing MIL methods emphasize the importance of feature aggregators, but largely neglect the instance-level representation learning. They assume that the availability of a pre-trained feature extractor can be directly utilized or fine-tuned, which is not always the case. This paper proposes to pre-train feature extractor for MIL via a weakly-supervised scheme, i.e., propagating the weak bag-level labels to the corresponding instances for supervised learning. To learn effective features for MIL, we further delve into several key components, including strong data augmentation, a non-linear prediction head and the robust loss function. We conduct experiments on common large-scale WSI datasets and find it achieves better performance than other pre-training schemes (e.g., ImageNet pre-training and self-supervised learning) in different downstream tasks. We further show the compatibility and scalability of the proposed scheme by deploying it in fine-tuning the pathological-specific models and pre-training on merged multiple datasets. To our knowledge, this is the first work focusing on the representation learning for MIL.

[38] Symbolic Rule Extraction from Attention-Guided Sparse Representations in Vision Transformers

Parth Padalkar,Gopal Gupta

Main category: cs.CV

TL;DR: 提出了一种从Vision Transformers（ViTs）中提取符号规则的方法，通过引入稀疏概念层和FOLD-SE-M算法，提升了分类准确性和可解释性。

Details

Motivation: 现有方法难以从ViTs中提取符号规则，因其缺乏模块化概念检测器和依赖全局自注意力机制。 Method: 引入稀疏概念层，结合L1稀疏性、熵最小化和监督对比损失，生成二值化概念激活，再通过FOLD-SE-M算法提取逻辑程序。 Result: 分类准确率比标准ViT提高5.14%，生成的规则集可直接用于逻辑推理。 Conclusion: 首次从ViTs中提取可执行的逻辑程序，推动了可解释和可验证的神经符号AI发展。 Abstract: Recent neuro-symbolic approaches have successfully extracted symbolic rule-sets from CNN-based models to enhance interpretability. However, applying similar techniques to Vision Transformers (ViTs) remains challenging due to their lack of modular concept detectors and reliance on global self-attention mechanisms. We propose a framework for symbolic rule extraction from ViTs by introducing a sparse concept layer inspired by Sparse Autoencoders (SAEs). This linear layer operates on attention-weighted patch representations and learns a disentangled, binarized representation in which individual neurons activate for high-level visual concepts. To encourage interpretability, we apply a combination of L1 sparsity, entropy minimization, and supervised contrastive loss. These binarized concept activations are used as input to the FOLD-SE-M algorithm, which generates a rule-set in the form of logic programs. Our method achieves a 5.14% better classification accuracy than the standard ViT while enabling symbolic reasoning. Crucially, the extracted rule-set is not merely post-hoc but acts as a logic-based decision layer that operates directly on the sparse concept representations. The resulting programs are concise and semantically meaningful. This work is the first to extract executable logic programs from ViTs using sparse symbolic representations. It bridges the gap between transformer-based vision models and symbolic logic programming, providing a step forward in interpretable and verifiable neuro-symbolic AI.

[39] Multimodal Fake News Detection: MFND Dataset and Shallow-Deep Multitask Learning

Ye Zhu,Yunan Wang,Zitong Yu

Main category: cs.CV

TL;DR: 提出了一种新的多模态假新闻检测数据集（MFND）和浅层-深层多任务学习模型（SDML），用于检测和定位高度逼真的假新闻。

Details

Motivation: 多模态新闻信息丰富，但易受深度伪造攻击，需有效检测和定位假新闻。 Method: 提出SDML模型，结合浅层推理（对比学习和跨模态融合）和深层推理（双分支框架）挖掘新闻内在语义。 Result: 在主流和自建数据集上验证了模型的优越性。 Conclusion: SDML模型能有效检测和定位假新闻，代码和数据集已开源。 Abstract: Multimodal news contains a wealth of information and is easily affected by deepfake modeling attacks. To combat the latest image and text generation methods, we present a new Multimodal Fake News Detection dataset (MFND) containing 11 manipulated types, designed to detect and localize highly authentic fake news. Furthermore, we propose a Shallow-Deep Multitask Learning (SDML) model for fake news, which fully uses unimodal and mutual modal features to mine the intrinsic semantics of news. Under shallow inference, we propose the momentum distillation-based light punishment contrastive learning for fine-grained uniform spatial image and text semantic alignment, and an adaptive cross-modal fusion module to enhance mutual modal features. Under deep inference, we design a two-branch framework to augment the image and text unimodal features, respectively merging with mutual modalities features, for four predictions via dedicated detection and localization projections. Experiments on both mainstream and our proposed datasets demonstrate the superiority of the model. Codes and dataset are released at https://github.com/yunan-wang33/sdml.

Bin Li,Shenxi Liu,Yixuan Weng,Yue Du,Yuhang Tian,Shoujun Zhou

Main category: cs.CV

TL;DR: M4IVQA挑战赛旨在推动多模态、多语言和多跳医学教学视频问答系统的研究，包含三个任务：M4TAGSV、M4VCR和M4TAGVC。

Details

Motivation: 通过多模态、多语言和多跳问题的结合，提升医疗场景下的智能推理系统，支持多语言社区的医疗教育和应急响应。 Method: 参与者需开发能处理视频和文本数据、理解多语言查询并回答多跳医学问题的算法。 Result: 挑战赛包含三个具体任务，分别针对单视频、视频库检索和视频库中的时间定位。 Conclusion: M4IVQA有望推动医疗多模态推理系统的创新，助力智能医疗教育和应急响应。 Abstract: Following the successful hosts of the 1-st (NLPCC 2023 Foshan) CMIVQA and the 2-rd (NLPCC 2024 Hangzhou) MMIVQA challenges, this year, a new task has been introduced to further advance research in multi-modal, multilingual, and multi-hop medical instructional question answering (M4IVQA) systems, with a specific focus on medical instructional videos. The M4IVQA challenge focuses on evaluating models that integrate information from medical instructional videos, understand multiple languages, and answer multi-hop questions requiring reasoning over various modalities. This task consists of three tracks: multi-modal, multilingual, and multi-hop Temporal Answer Grounding in Single Video (M4TAGSV), multi-modal, multilingual, and multi-hop Video Corpus Retrieval (M4VCR) and multi-modal, multilingual, and multi-hop Temporal Answer Grounding in Video Corpus (M4TAGVC). Participants in M4IVQA are expected to develop algorithms capable of processing both video and text data, understanding multilingual queries, and providing relevant answers to multi-hop medical questions. We believe the newly introduced M4IVQA challenge will drive innovations in multimodal reasoning systems for healthcare scenarios, ultimately contributing to smarter emergency response systems and more effective medical education platforms in multilingual communities. Our official website is https://cmivqa.github.io/

[41] Active Learning for Multi-class Image Classification

Thien Nhan Vo

Main category: cs.CV

TL;DR: 通过主动学习策略减少图像分类任务所需的训练样本数量，利用不确定性度量选择高价值样本，在MNIST和Fruits360数据集上验证效果。

Details

Motivation: 图像分类任务通常需要大量训练样本，主动学习可以降低这一需求，提高效率。 Method: 使用四种不确定性度量评估图像样本价值，选择高价值样本训练CNN分类器。 Result: 在MNIST和Fruits360数据集上，主动学习显著减少所需样本量，尤其在复杂任务中效果更明显。 Conclusion: 主动学习是图像分类问题的有效算法，尤其在复杂任务中表现优于随机采样。 Abstract: A principle bottleneck in image classification is the large number of training examples needed to train a classifier. Using active learning, we can reduce the number of training examples to teach a CNN classifier by strategically selecting examples. Assigning values to image examples using different uncertainty metrics allows the model to identify and select high-value examples in a smaller training set size. We demonstrate results for digit recognition and fruit classification on the MNIST and Fruits360 data sets. We formally compare results for four different uncertainty metrics. Finally, we observe active learning is also effective on simpler (binary) classification tasks, but marked improvement from random sampling is more evident on more difficult tasks. We show active learning is a viable algorithm for image classification problems.

[42] Fine-Grained Bias Exploration and Mitigation for Group-Robust Classification

Miaoyun Zhao,Qiang Zhang,Chenrong Li

Main category: cs.CV

TL;DR: 论文提出了一种新方法BEO和FG-CCDB，通过更精细的分布建模和匹配，解决了现有方法CCDB在消除虚假相关性时的局限性，并在实验中表现出色。

Details

Motivation: 现有方法CCDB通过单一高斯分布近似分布，过于简化且不适用于实际场景，需要更精细的分布建模来消除虚假相关性。 Method: 提出BEO方法，通过潜在群组混合建模更详细地捕捉分布；进一步提出FG-CCDB，在群组内进行精细分布匹配和平衡。 Result: BEO可作为真实偏差标注的强代理，FG-CCDB在二分类任务中与监督方法相当，在多分类任务中显著优于监督方法。 Conclusion: BEO和FG-CCDB通过精细建模和匹配分布，有效消除虚假相关性，性能优越且计算成本低。 Abstract: Achieving group-robust generalization in the presence of spurious correlations remains a significant challenge, particularly when bias annotations are unavailable. Recent studies on Class-Conditional Distribution Balancing (CCDB) reveal that spurious correlations often stem from mismatches between the class-conditional and marginal distributions of bias attributes. They achieve promising results by addressing this issue through simple distribution matching in a bias-agnostic manner. However, CCDB approximates each distribution using a single Gaussian, which is overly simplistic and rarely holds in real-world applications. To address this limitation, we propose a novel method called Bias Exploration via Overfitting (BEO), which captures each distribution in greater detail by modeling it as a mixture of latent groups. Building on these group-level descriptions, we introduce a fine-grained variant of CCDB, termed FG-CCDB, which performs more precise distribution matching and balancing within each group. Through group-level reweighting, FG-CCDB learns sample weights from a global perspective, achieving stronger mitigation of spurious correlations without incurring substantial storage or computational costs. Extensive experiments demonstrate that BEO serves as a strong proxy for ground-truth bias annotations and can be seamlessly integrated with bias-supervised methods. Moreover, when combined with FG-CCDB, our method performs on par with bias-supervised approaches on binary classification tasks and significantly outperforms them in highly biased multi-class scenarios.

[43] Visual Instruction Tuning with Chain of Region-of-Interest

Yixin Chen,Shuai Zhang,Boran Han,Bernie Wang

Main category: cs.CV

TL;DR: 提出了一种名为CoRoI的方法，通过识别和优先处理高分辨率图像中的关键区域，降低计算负担，同时提升多模态大语言模型的性能。

Details

Motivation: 高分辨率图像对多模态大语言模型至关重要，但直接增加分辨率会显著增加计算成本。 Method: 受人类视觉系统启发，CoRoI方法选择性地处理图像中最具信息量的区域，避免处理完整的高分辨率图像。 Result: 在11个基准测试中验证了CoRoI的有效性，模型性能优于LLaVA-NeXT和部分专有方法（如Gemini Pro 1.0和GPT-4V）。 Conclusion: CoRoI是一种高效的方法，能够在降低计算负担的同时提升多模态模型的性能。 Abstract: High-resolution (HR) images are pivotal for enhancing the recognition and understanding capabilities of multimodal large language models (MLLMs). However, directly increasing image resolution can significantly escalate computational demands. In this study, we propose a method called Chain of Region-of-Interest (CoRoI) for Visual Instruction Tuning, aimed at alleviating the computational burden associated with high-resolution images for MLLMs. Drawing inspiration from the selective nature of the human visual system, we recognize that not all regions within high-resolution images carry equal importance. CoRoI seeks to identify and prioritize the most informative regions, thereby enhancing multimodal visual comprehension and recognition while circumventing the need for processing lengthy HR image tokens. Through extensive experiments on 11 benchmarks, we validate the efficacy of CoRoI across varying sizes, ranging from 7B to 34B in parameters. Our models consistently demonstrate superior performance across diverse multimodal benchmarks and tasks. Notably, our method outperforms LLaVA-NeXT on almost all benchmarks and our finetuned 34B model surpasses proprietary methods like Gemini Pro 1.0 on six benchmarks, as well as outperforming GPT-4V on MMB, SEED-I, and MME.

[44] Predicting Surgical Safety Margins in Osteosarcoma Knee Resections: An Unsupervised Approach

Carolina Vargas-Ecos,Edwin Salcedo

Main category: cs.CV

TL;DR: 提出一种基于MRI和X射线数据的无监督学习方法，用于确定骨肉瘤手术的安全边界。

Details

Motivation: 拉丁美洲癌症病例预计将大幅增加，骨肉瘤作为常见且致命的骨癌，手术切除需精确安全边界以确保完全切除并保留健康组织。 Method: 利用开源存储库中的MRI和X射线数据，结合数字处理技术和k-means聚类等无监督学习算法定义肿瘤边界。 Result: 实验结果表明该方法能自动化、个性化地确定安全边界。 Conclusion: 该方法为骨肉瘤手术提供了潜在的高效解决方案。 Abstract: According to the Pan American Health Organization, the number of cancer cases in Latin America was estimated at 4.2 million in 2022 and is projected to rise to 6.7 million by 2045. Osteosarcoma, one of the most common and deadly bone cancers affecting young people, is difficult to detect due to its unique texture and intensity. Surgical removal of osteosarcoma requires precise safety margins to ensure complete resection while preserving healthy tissue. Therefore, this study proposes a method for estimating the confidence interval of surgical safety margins in osteosarcoma surgery around the knee. The proposed approach uses MRI and X-ray data from open-source repositories, digital processing techniques, and unsupervised learning algorithms (such as k-means clustering) to define tumor boundaries. Experimental results highlight the potential for automated, patient-specific determination of safety margins.

[45] Joint Low-level and High-level Textual Representation Learning with Multiple Masking Strategies

Zhengmi Tang,Yuto Mitsui,Tomo Miyazaki,Shinichiro Omachi

Main category: cs.CV

TL;DR: 论文提出了一种多掩码策略（MMS），通过结合随机块和跨度掩码，改进文本识别任务中的掩码图像建模（MIM），以更好地捕捉高低级文本特征。

Details

Motivation: 现有文本识别方法依赖合成数据，但合成图像无法完全模拟真实场景，导致性能差距。自监督学习（如MIM）可缩小这一差距。 Method: 引入随机块和跨度掩码策略，结合随机补丁掩码，形成MMS，以捕捉高低级文本特征。 Result: MMS在文本识别、分割和超分辨率等任务中优于现有自监督方法。 Conclusion: MMS通过多掩码策略有效提升了文本识别任务中的性能，尤其在真实场景中表现优异。 Abstract: Most existing text recognition methods are trained on large-scale synthetic datasets due to the scarcity of labeled real-world datasets. Synthetic images, however, cannot faithfully reproduce real-world scenarios, such as uneven illumination, irregular layout, occlusion, and degradation, resulting in performance disparities when handling complex real-world images. Recent self-supervised learning techniques, notably contrastive learning and masked image modeling (MIM), narrow this domain gap by exploiting unlabeled real text images. This study first analyzes the original Masked AutoEncoder (MAE) and observes that random patch masking predominantly captures low-level textural features but misses high-level contextual representations. To fully exploit the high-level contextual representations, we introduce random blockwise and span masking in the text recognition task. These strategies can mask the continuous image patches and completely remove some characters, forcing the model to infer relationships among characters within a word. Our Multi-Masking Strategy (MMS) integrates random patch, blockwise, and span masking into the MIM frame, which jointly learns low and high-level textual representations. After fine-tuning with real data, MMS outperforms the state-of-the-art self-supervised methods in various text-related tasks, including text recognition, segmentation, and text-image super-resolution.

[46] NeuRN: Neuro-inspired Domain Generalization for Image Classification

Hamd Jalil,Ahmed Qazi,Asim Iqbal

Main category: cs.CV

TL;DR: 论文提出了一种受神经启发的NeuRN层，用于提升深度学习模型在未见目标域上的泛化性能，并通过实验验证其有效性。

Details

Motivation: 解决图像分类中模型在未见数据集上泛化能力不足的问题。 Method: 引入NeuRN层，基于哺乳动物视觉皮层神经元机制，结合Needleman-Wunsch算法筛选模型。 Result: NeuRN在跨域图像分类任务中表现优于基线模型。 Conclusion: NeuRN为未来神经启发的深度学习模型奠定了基础。 Abstract: Domain generalization in image classification is a crucial challenge, with models often failing to generalize well across unseen datasets. We address this issue by introducing a neuro-inspired Neural Response Normalization (NeuRN) layer which draws inspiration from neurons in the mammalian visual cortex, which aims to enhance the performance of deep learning architectures on unseen target domains by training deep learning models on a source domain. The performance of these models is considered as a baseline and then compared against models integrated with NeuRN on image classification tasks. We perform experiments across a range of deep learning architectures, including ones derived from Neural Architecture Search and Vision Transformer. Additionally, in order to shortlist models for our experiment from amongst the vast range of deep neural networks available which have shown promising results, we also propose a novel method that uses the Needleman-Wunsch algorithm to compute similarity between deep learning architectures. Our results demonstrate the effectiveness of NeuRN by showing improvement against baseline in cross-domain image classification tasks. Our framework attempts to establish a foundation for future neuro-inspired deep learning models.

[47] Mice to Machines: Neural Representations from Visual Cortex for Domain Generalization

Ahmed Qazi,Hamd Jalil,Asim Iqbal

Main category: cs.CV

TL;DR: 该研究探索了小鼠视觉皮层与深度学习模型在物体分类任务中的功能对齐，提出了一种新的表征学习策略，并引入NeuRN层以增强模型性能。

Details

Motivation: 理解小鼠视觉皮层的神经表征及其与深度学习模型的相似性，以改进AI模型的性能。 Method: 提出表征学习策略和NeuRN层，测试其在域泛化任务中的效果。 Result: 发现小鼠视觉皮层与深度学习模型在功能映射上的相似性，NeuRN层显著提升了模型的鲁棒性。 Conclusion: 该框架为研究小鼠视觉皮层的神经表征提供了新方法，并有助于开发更先进的AI模型。 Abstract: The mouse is one of the most studied animal models in the field of systems neuroscience. Understanding the generalized patterns and decoding the neural representations that are evoked by the diverse range of natural scene stimuli in the mouse visual cortex is one of the key quests in computational vision. In recent years, significant parallels have been drawn between the primate visual cortex and hierarchical deep neural networks. However, their generalized efficacy in understanding mouse vision has been limited. In this study, we investigate the functional alignment between the mouse visual cortex and deep learning models for object classification tasks. We first introduce a generalized representational learning strategy that uncovers a striking resemblance between the functional mapping of the mouse visual cortex and high-performing deep learning models on both top-down (population-level) and bottom-up (single cell-level) scenarios. Next, this representational similarity across the two systems is further enhanced by the addition of Neural Response Normalization (NeuRN) layer, inspired by the activation profile of excitatory and inhibitory neurons in the visual cortex. To test the performance effect of NeuRN on real-world tasks, we integrate it into deep learning models and observe significant improvements in their robustness against data shifts in domain generalization tasks. Our work proposes a novel framework for comparing the functional architecture of the mouse visual cortex with deep learning models. Our findings carry broad implications for the development of advanced AI models that draw inspiration from the mouse visual cortex, suggesting that these models serve as valuable tools for studying the neural representations of the mouse visual cortex and, as a result, enhancing their performance on real-world tasks.

[48] NeuGen: Amplifying the 'Neural' in Neural Radiance Fields for Domain Generalization

Ahmed Qazi,Abdul Basit,Asim Iqbal

Main category: cs.CV

TL;DR: 论文提出了一种名为NeuGen的脑启发归一化技术，将其集成到NeRF架构中以提升模型在多样化场景中的泛化能力。

Details

Motivation: NeRF在新视角合成领域取得了显著进展，但在多样化场景和条件下的泛化能力仍然不足。 Method: 通过NeuGen提取领域不变特征，并将其无缝集成到MVSNeRF和GeoNeRF等主流NeRF架构中。 Result: NeuGen显著提升了模型的泛化能力和渲染质量，在多个数据集上超越了现有方法。 Conclusion: 该研究展示了将神经科学原理与深度学习框架结合的潜力，为新视角合成领域的泛化性和效率设定了新标准。 Abstract: Neural Radiance Fields (NeRF) have significantly advanced the field of novel view synthesis, yet their generalization across diverse scenes and conditions remains challenging. Addressing this, we propose the integration of a novel brain-inspired normalization technique Neural Generalization (NeuGen) into leading NeRF architectures which include MVSNeRF and GeoNeRF. NeuGen extracts the domain-invariant features, thereby enhancing the models' generalization capabilities. It can be seamlessly integrated into NeRF architectures and cultivates a comprehensive feature set that significantly improves accuracy and robustness in image rendering. Through this integration, NeuGen shows improved performance on benchmarks on diverse datasets across state-of-the-art NeRF architectures, enabling them to generalize better across varied scenes. Our comprehensive evaluations, both quantitative and qualitative, confirm that our approach not only surpasses existing models in generalizability but also markedly improves rendering quality. Our work exemplifies the potential of merging neuroscientific principles with deep learning frameworks, setting a new precedent for enhanced generalizability and efficiency in novel view synthesis. A demo of our study is available at https://neugennerf.github.io.

Honglong Yang,Shanshan Song,Yi Qin,Lehan Wang,Haonan Wang,Xinpeng Ding,Qixiang Zhang,Bodong Du,Xiaomeng Li

Main category: cs.CV

TL;DR: XMedGPT是一种多模态AI助手，通过结合文本和视觉解释性提升医疗决策的透明度和可信度，并在不确定性量化和预后建模方面表现优异。

Details

Motivation: 现有通用医疗AI系统在解释性和预后能力上存在不足，限制了其临床实用性。 Method: XMedGPT整合多模态解释性，引入可靠性索引机制，并通过交互式问答评估一致性。 Result: 在141个解剖区域中IoU达0.703，不确定性估计AUC为0.862（视觉问答）和0.764（放射报告生成），在癌症预后预测中超越现有模型26.9%。 Conclusion: XMedGPT在解释性、不确定性和预后能力上显著优于现有系统，为医疗AI提供了可信且可扩展的支持。 Abstract: Generalist Medical AI (GMAI) systems have demonstrated expert-level performance in biomedical perception tasks, yet their clinical utility remains limited by inadequate multi-modal explainability and suboptimal prognostic capabilities. Here, we present XMedGPT, a clinician-centric, multi-modal AI assistant that integrates textual and visual interpretability to support transparent and trustworthy medical decision-making. XMedGPT not only produces accurate diagnostic and descriptive outputs, but also grounds referenced anatomical sites within medical images, bridging critical gaps in interpretability and enhancing clinician usability. To support real-world deployment, we introduce a reliability indexing mechanism that quantifies uncertainty through consistency-based assessment via interactive question-answering. We validate XMedGPT across four pillars: multi-modal interpretability, uncertainty quantification, and prognostic modeling, and rigorous benchmarking. The model achieves an IoU of 0.703 across 141 anatomical regions, and a Kendall's tau-b of 0.479, demonstrating strong alignment between visual rationales and clinical outcomes. For uncertainty estimation, it attains an AUC of 0.862 on visual question answering and 0.764 on radiology report generation. In survival and recurrence prediction for lung and glioma cancers, it surpasses prior leading models by 26.9%, and outperforms GPT-4o by 25.0%. Rigorous benchmarking across 347 datasets covers 40 imaging modalities and external validation spans 4 anatomical systems confirming exceptional generalizability, with performance gains surpassing existing GMAI by 20.7% for in-domain evaluation and 16.7% on 11,530 in-house data evaluation. Together, XMedGPT represents a significant leap forward in clinician-centric AI integration, offering trustworthy and scalable support for diverse healthcare applications.

[50] CheXLearner: Text-Guided Fine-Grained Representation Learning for Progression Detection

Yuanzhuo Wang,Junwen Duan,Xinyu Li,Jianxin Wang

Main category: cs.CV

TL;DR: CheXLearner是一个端到端框架，通过结合解剖区域检测、黎曼流形对齐和细粒度语义指导，显著提升了医学图像分析的性能。

Details

Motivation: 现有方法在医学图像分析中要么对齐粗糙导致语义不匹配，要么仅依赖视觉信息缺乏医学语义整合。 Method: 提出Med-MAM模块，利用双曲几何对齐解剖结构，并结合区域进展描述作为监督。 Result: 在解剖区域进展检测中平均准确率达81.12%（提升17.2%），F1-score达80.32%（提升11.05%），下游疾病分类AUC达91.52%。 Conclusion: CheXLearner通过多模态对齐和动态特征优化，显著优于现有方法，适用于复杂医学图像分析。 Abstract: Temporal medical image analysis is essential for clinical decision-making, yet existing methods either align images and text at a coarse level - causing potential semantic mismatches - or depend solely on visual information, lacking medical semantic integration. We present CheXLearner, the first end-to-end framework that unifies anatomical region detection, Riemannian manifold-based structure alignment, and fine-grained regional semantic guidance. Our proposed Med-Manifold Alignment Module (Med-MAM) leverages hyperbolic geometry to robustly align anatomical structures and capture pathologically meaningful discrepancies across temporal chest X-rays. By introducing regional progression descriptions as supervision, CheXLearner achieves enhanced cross-modal representation learning and supports dynamic low-level feature optimization. Experiments show that CheXLearner achieves 81.12% (+17.2%) average accuracy and 80.32% (+11.05%) F1-score on anatomical region progression detection - substantially outperforming state-of-the-art baselines, especially in structurally complex regions. Additionally, our model attains a 91.52% average AUC score in downstream disease classification, validating its superior feature representation.

[51] Enhancing Monocular Height Estimation via Sparse LiDAR-Guided Correction

Jian Song,Hongruixuan Chen,Naoto Yokoya

Main category: cs.CV

TL;DR: 论文研究了基于合成数据的单目高度估计模型，发现其依赖阴影线索可能导致误差，并提出结合稀疏LiDAR数据的校正方法，显著降低了误差。

Details

Motivation: 解决单目高度估计因缺乏结构信息和依赖阴影线索导致的不可靠性问题。 Method: 提出两阶段校正方法：预处理ICESat-2数据，再通过随机森林优化高度估计。 Result: 在三个城市实验中，平均绝对误差分别降低22.8%、6.9%和4.9%。 Conclusion: 结合稀疏LiDAR数据可提升单目高度估计的鲁棒性，为可靠3D地图提供新思路。 Abstract: Monocular height estimation (MHE) from very-high-resolution (VHR) remote sensing imagery via deep learning is notoriously challenging due to the lack of sufficient structural information. Conventional digital elevation models (DEMs), typically derived from airborne LiDAR or multi-view stereo, remain costly and geographically limited. Recently, models trained on synthetic data and refined through domain adaptation have shown remarkable performance in MHE, yet it remains unclear how these models make predictions or how reliable they truly are. In this paper, we investigate a state-of-the-art MHE model trained purely on synthetic data to explore where the model looks when making height predictions. Through systematic analyses, we find that the model relies heavily on shadow cues, a factor that can lead to overestimation or underestimation of heights when shadows deviate from expected norms. Furthermore, the inherent difficulty of evaluating regression tasks with the human eye underscores additional limitations of purely synthetic training. To address these issues, we propose a novel correction pipeline that integrates sparse, imperfect global LiDAR measurements (ICESat-2) with deep-learning outputs to improve local accuracy and achieve spatially consistent corrections. Our method comprises two stages: pre-processing raw ICESat-2 data, followed by a random forest-based approach to densely refine height estimates. Experiments in three representative urban regions -- Saint-Omer, Tokyo, and Sao Paulo -- reveal substantial error reductions, with mean absolute error (MAE) decreased by 22.8\%, 6.9\%, and 4.9\%, respectively. These findings highlight the critical role of shadow awareness in synthetic data-driven models and demonstrate how fusing imperfect real-world LiDAR data can bolster the robustness of MHE, paving the way for more reliable and scalable 3D mapping solutions.

[52] Building a Human-Verified Clinical Reasoning Dataset via a Human LLM Hybrid Pipeline for Trustworthy Medical AI

Chao Ding,Mouxiao Bian,Pengcheng Chen,Hongliang Zhang,Tianbin Li,Lihao Liu,Jiayuan Chen,Zhuoran Li,Yabei Zhong,Yongqi Liu,Haiqing Huang,Dongming Shan,Junjun He,Jie Xu

Main category: cs.CV

TL;DR: 论文介绍了一个高临床相关性的数据集，包含31,247个医学问答对，每个问题附带专家验证的思维链解释，旨在提升医学大语言模型的透明性和可验证性。

Details

Motivation: 当前医学大语言模型依赖科学文献或合成数据，缺乏专家验证和高临床相关性，限制了临床信任。 Method: 通过人机混合流程，由医学专家对LLM生成的解释进行迭代评审、评分和优化，最终形成高质量数据集。 Result: 公开的数据集为开发透明且可验证的医学LLM提供了重要资源。 Conclusion: 该数据集有望推动医学AI更安全、更可解释的发展。 Abstract: Despite strong performance in medical question-answering, the clinical adoption of Large Language Models (LLMs) is critically hampered by their opaque 'black-box' reasoning, limiting clinician trust. This challenge is compounded by the predominant reliance of current medical LLMs on corpora from scientific literature or synthetic data, which often lack the granular expert validation and high clinical relevance essential for advancing their specialized medical capabilities. To address these critical gaps, we introduce a highly clinically relevant dataset with 31,247 medical question-answer pairs, each accompanied by expert-validated chain-of-thought (CoT) explanations. This resource, spanning multiple clinical domains, was curated via a scalable human-LLM hybrid pipeline: LLM-generated rationales were iteratively reviewed, scored, and refined by medical experts against a structured rubric, with substandard outputs revised through human effort or guided LLM regeneration until expert consensus. This publicly available dataset provides a vital source for the development of medical LLMs that capable of transparent and verifiable reasoning, thereby advancing safer and more interpretable AI in medicine.

[53] Bi-directional Self-Registration for Misaligned Infrared-Visible Image Fusion

Timing Li,Bing Cao,Pengfei Zhu,Bin Xiao,Qinghua Hu

Main category: cs.CV

TL;DR: 提出了一种自监督的双向自注册框架（B-SR），用于解决多模态图像配准和融合中缺乏真实对齐数据的问题。

Details

Motivation: 当前多模态图像配准和融合方法缺乏真实对齐数据，影响了融合质量。 Method: 通过代理数据生成器（PDG）和逆代理数据生成器（IPDG）实现自监督的全局-局部配准，并结合邻域动态对齐损失消除模态差异的影响。 Result: 实验表明，该方法在多模态图像对齐和融合中优于竞争方法。 Conclusion: B-SR框架有效解决了多模态图像配准和融合中的对齐问题，代码将公开。 Abstract: Acquiring accurately aligned multi-modal image pairs is fundamental for achieving high-quality multi-modal image fusion. To address the lack of ground truth in current multi-modal image registration and fusion methods, we propose a novel self-supervised \textbf{B}i-directional \textbf{S}elf-\textbf{R}egistration framework (\textbf{B-SR}). Specifically, B-SR utilizes a proxy data generator (PDG) and an inverse proxy data generator (IPDG) to achieve self-supervised global-local registration. Visible-infrared image pairs with spatially misaligned differences are aligned to obtain global differences through the registration module. The same image pairs are processed by PDG, such as cropping, flipping, stitching, etc., and then aligned to obtain local differences. IPDG converts the obtained local differences into pseudo-global differences, which are used to perform global-local difference consistency with the global differences. Furthermore, aiming at eliminating the effect of modal gaps on the registration module, we design a neighborhood dynamic alignment loss to achieve cross-modal image edge alignment. Extensive experiments on misaligned multi-modal images demonstrate the effectiveness of the proposed method in multi-modal image alignment and fusion against the competing methods. Our code will be publicly available.

[54] Transformer-Based Dual-Optical Attention Fusion Crowd Head Point Counting and Localization Network

Fei Zhou,Yi Li,Mingqing Zhu

Main category: cs.CV

TL;DR: 提出TAPNet模型，通过双光注意力融合模块（DAFP）和自适应双光特征分解融合模块（AFDF），结合红外图像信息，提升无人机视角下复杂场景（如密集遮挡和低光照）的人群计数准确性。

Details

Motivation: 解决无人机视角下复杂场景（如密集遮挡和低光照）中人群计数准确性不足的问题。 Method: 设计DAFP模块引入红外图像互补信息，提出AFDF模块解决图像对系统不对齐问题，并通过空间随机偏移数据增强优化训练策略。 Result: 在DroneRGBT和GAIIC2数据集上表现优于现有技术，尤其在密集低光照场景中效果显著。 Conclusion: TAPNet通过多模态信息融合和数据增强，显著提升了复杂场景下的人群计数性能。 Abstract: In this paper, the dual-optical attention fusion crowd head point counting model (TAPNet) is proposed to address the problem of the difficulty of accurate counting in complex scenes such as crowd dense occlusion and low light in crowd counting tasks under UAV view. The model designs a dual-optical attention fusion module (DAFP) by introducing complementary information from infrared images to improve the accuracy and robustness of all-day crowd counting. In order to fully utilize different modal information and solve the problem of inaccurate localization caused by systematic misalignment between image pairs, this paper also proposes an adaptive two-optical feature decomposition fusion module (AFDF). In addition, we optimize the training strategy to improve the model robustness through spatial random offset data augmentation. Experiments on two challenging public datasets, DroneRGBT and GAIIC2, show that the proposed method outperforms existing techniques in terms of performance, especially in challenging dense low-light scenes. Code is available at https://github.com/zz-zik/TAPNet

[55] Unsupervised Learning for Class Distribution Mismatch

Pan Du,Wangbo Zhao,Xinai Lu,Nian Liu,Zhikai Li,Chaoyu Gong,Suyun Zhao,Hong Chen,Cuiping Li,Kai Wang,Yang You

Main category: cs.CV

TL;DR: 论文提出了一种无监督学习方法UCDM，用于解决训练数据与目标任务之间的类别分布不匹配问题，通过生成正负对和置信度标记机制，显著优于现有半监督方法。

Details

Motivation: 解决类别分布不匹配问题，现有方法依赖标记数据且局限于半监督场景，限制了其适用性和性能。 Method: 利用未标记数据构建正负对，通过扩散模型合成多样化训练对，并引入置信度标记机制迭代分配伪标签。 Result: 在三个数据集上验证了UCDM的优越性，尤其在Tiny-ImageNet数据集上，无标记数据下性能显著超过OpenMatch。 Conclusion: UCDM是一种高效的无监督方法，能够有效解决类别分布不匹配问题，性能优于现有半监督方法。 Abstract: Class distribution mismatch (CDM) refers to the discrepancy between class distributions in training data and target tasks. Previous methods address this by designing classifiers to categorize classes known during training, while grouping unknown or new classes into an "other" category. However, they focus on semi-supervised scenarios and heavily rely on labeled data, limiting their applicability and performance. To address this, we propose Unsupervised Learning for Class Distribution Mismatch (UCDM), which constructs positive-negative pairs from unlabeled data for classifier training. Our approach randomly samples images and uses a diffusion model to add or erase semantic classes, synthesizing diverse training pairs. Additionally, we introduce a confidence-based labeling mechanism that iteratively assigns pseudo-labels to valuable real-world data and incorporates them into the training process. Extensive experiments on three datasets demonstrate UCDM's superiority over previous semi-supervised methods. Specifically, with a 60% mismatch proportion on Tiny-ImageNet dataset, our approach, without relying on labeled data, surpasses OpenMatch (with 40 labels per class) by 35.1%, 63.7%, and 72.5% in classifying known, unknown, and new classes.

[56] Boosting Cross-spectral Unsupervised Domain Adaptation for Thermal Semantic Segmentation

Seokjun Kwon,Jeongmin Shin,Namil Kim,Soonmin Hwang,Yukyung Choi

Main category: cs.CV

TL;DR: 提出了一种新的跨光谱无监督域适应方法，通过掩码互学习和原型自监督损失提升热图像语义分割性能。

Details

Motivation: 解决热图像标注数据不足及RGB预训练网络在低光照下知识迁移效果差的问题。 Method: 采用掩码互学习策略选择性传递光谱模型结果，并引入原型自监督损失优化夜间场景性能。 Result: 实验表明，该方法优于现有无监督域适应方法，性能接近有监督方法。 Conclusion: 该方法有效提升了热图像语义分割的域适应能力，尤其在低光照条件下表现优异。 Abstract: In autonomous driving, thermal image semantic segmentation has emerged as a critical research area, owing to its ability to provide robust scene understanding under adverse visual conditions. In particular, unsupervised domain adaptation (UDA) for thermal image segmentation can be an efficient solution to address the lack of labeled thermal datasets. Nevertheless, since these methods do not effectively utilize the complementary information between RGB and thermal images, they significantly decrease performance during domain adaptation. In this paper, we present a comprehensive study on cross-spectral UDA for thermal image semantic segmentation. We first propose a novel masked mutual learning strategy that promotes complementary information exchange by selectively transferring results between each spectral model while masking out uncertain regions. Additionally, we introduce a novel prototypical self-supervised loss designed to enhance the performance of the thermal segmentation model in nighttime scenarios. This approach addresses the limitations of RGB pre-trained networks, which cannot effectively transfer knowledge under low illumination due to the inherent constraints of RGB sensors. In experiments, our method achieves higher performance over previous UDA methods and comparable performance to state-of-the-art supervised methods.

[57] High-Frequency Prior-Driven Adaptive Masking for Accelerating Image Super-Resolution

Wei Shang,Dongwei Ren,Wanying Zhang,Pengfei Zhu,Qinghua Hu,Wangmeng Zuo

Main category: cs.CV

TL;DR: 提出了一种无需训练的适应性掩码模块，通过动态聚焦计算于高频区域（如边缘和纹理），显著减少计算量，同时保持性能。

Details

Motivation: 加速图像超分辨率的关键挑战在于减少计算量而不牺牲性能和适应性。高频区域对重建至关重要。 Method: 通过高斯模糊减法提取高频成分，利用K-means聚类生成二进制掩码，动态识别需密集处理的区域。适用于CNN和Transformer架构。 Result: 实验表明，该方法将FLOPs减少24-43%，同时保持或提升定量指标。 Conclusion: 该方法高效且鲁棒，支持未见过的退化情况（如噪声、压缩），无需重新训练。 Abstract: The primary challenge in accelerating image super-resolution lies in reducing computation while maintaining performance and adaptability. Motivated by the observation that high-frequency regions (e.g., edges and textures) are most critical for reconstruction, we propose a training-free adaptive masking module for acceleration that dynamically focuses computation on these challenging areas. Specifically, our method first extracts high-frequency components via Gaussian blur subtraction and adaptively generates binary masks using K-means clustering to identify regions requiring intensive processing. Our method can be easily integrated with both CNNs and Transformers. For CNN-based architectures, we replace standard $3 \times 3$ convolutions with an unfold operation followed by $1 \times 1$ convolutions, enabling pixel-wise sparse computation guided by the mask. For Transformer-based models, we partition the mask into non-overlapping windows and selectively process tokens based on their average values. During inference, unnecessary pixels or windows are pruned, significantly reducing computation. Moreover, our method supports dilation-based mask adjustment to control the processing scope without retraining, and is robust to unseen degradations (e.g., noise, compression). Extensive experiments on benchmarks demonstrate that our method reduces FLOPs by 24--43% for state-of-the-art models (e.g., CARN, SwinIR) while achieving comparable or better quantitative metrics. The source code is available at https://github.com/shangwei5/AMSR

[58] Federated Learning with LoRA Optimized DeiT and Multiscale Patch Embedding for Secure Eye Disease Recognition

Md. Naimur Asif Borno,Md Sakib Hossain Shovon,MD Hanif Sikder,Iffat Firozy Rimi,Tahani Jaser Alahmadi,Mohammad Ali Moni

Main category: cs.CV

TL;DR: 本文提出了一种基于数据高效图像变换器（DeIT）的方法，解决了医学图像疾病检测中的数据不足、特征提取不足、数据安全和训练效率问题，取得了最优性能。

Details

Motivation: 解决医学图像疾病检测中的四大挑战：标注数据有限、空间特征分析不足、数据安全问题以及训练框架效率低下。 Method: 采用多尺度补丁嵌入优化特征提取，分层加权随机采样解决类别不平衡，结合LoRA增强的变换器编码器、蒸馏框架和联邦学习。 Result: 模型在AUC、F1分数、精确度、最小损失和Top-5准确率上达到最优，并通过Grad-CAM++提升可解释性。 Conclusion: 该方法在AI驱动的医学影像和疾病检测中具有显著潜力。 Abstract: Recent progress in image-based medical disease detection encounters challenges such as limited annotated data sets, inadequate spatial feature analysis, data security issues, and inefficient training frameworks. This study introduces a data-efficient image transformer (DeIT)-based approach that overcomes these challenges by utilizing multiscale patch embedding for better feature extraction and stratified weighted random sampling to address class imbalance. The model also incorporates a LoRA-enhanced transformer encoder, a distillation framework, and federated learning for decentralized training, improving both efficiency and data security. Consequently, it achieves state-of-the-art performance, with the highest AUC, F1 score, precision, minimal loss, and Top-5 accuracy. Additionally, Grad-CAM++ visualizations improve interpretability by highlighting critical pathological regions, enhancing the model's clinical relevance. These results highlight the potential of this approach to advance AI-powered medical imaging and disease detection.

[59] BridgeIV: Bridging Customized Image and Video Generation through Test-Time Autoregressive Identity Propagation

Panwen Hu,Jiehui Huang,Qiang Sun,Xiaodan Liang

Main category: cs.CV

TL;DR: 论文提出了一种自回归结构和纹理传播模块（STPM）及测试时奖励优化（TTRO）方法，用于提升定制化文本到视频（CT2V）生成的性能。

Details

Motivation: 现有零样本CT2V方法泛化能力差，而基于调优的T2I模型结合时间运动模块会导致结构和纹理信息丢失。 Method: 提出STPM模块自回归提取并注入关键结构和纹理特征，同时引入TTRO方法优化细节。 Result: 实验表明STPM和TTRO在CLIP-I和DINO一致性指标上分别提升7.8和13.1。 Conclusion: STPM和TTRO有效提升了CT2V生成的一致性和细节表现。 Abstract: Both zero-shot and tuning-based customized text-to-image (CT2I) generation have made significant progress for storytelling content creation. In contrast, research on customized text-to-video (CT2V) generation remains relatively limited. Existing zero-shot CT2V methods suffer from poor generalization, while another line of work directly combining tuning-based T2I models with temporal motion modules often leads to the loss of structural and texture information. To bridge this gap, we propose an autoregressive structure and texture propagation module (STPM), which extracts key structural and texture features from the reference subject and injects them autoregressively into each video frame to enhance consistency. Additionally, we introduce a test-time reward optimization (TTRO) method to further refine fine-grained details. Quantitative and qualitative experiments validate the effectiveness of STPM and TTRO, demonstrating improvements of 7.8 and 13.1 in CLIP-I and DINO consistency metrics over the baseline, respectively.

[60] Technical Report for ICRA 2025 GOOSE 2D Semantic Segmentation Challenge: Leveraging Color Shift Correction, RoPE-Swin Backbone, and Quantile-based Label Denoising Strategy for Robust Outdoor Scene Understanding

Chih-Chung Hsu,I-Hsuan Wu,Wen-Hai Tseng,Ching-Heng Cheng,Ming-Hsuan Wu,Jin-Hui Jiang,Yu-Jou Hsiao

Main category: cs.CV

TL;DR: 团队ACVLAB提出的语义分割框架结合了Swin Transformer和Rotary Position Embedding，并引入颜色偏移校正模块和基于分位数的去噪策略，在ICRA 2025挑战赛中表现优异。

Details

Motivation: 解决户外场景中光照不一致和噪声对语义分割的影响，提升模型的鲁棒性和准确性。 Method: 采用Swin Transformer骨干网络，结合Rotary Position Embedding增强空间泛化能力；引入颜色偏移估计与校正模块处理光照问题；使用基于分位数的去噪策略抑制高误差像素的影响。 Result: 在GOOSE测试集上达到0.848的mIoU，证明了方法的有效性。 Conclusion: 结合颜色校正、位置编码和误差感知去噪，能够显著提升语义分割的鲁棒性和性能。 Abstract: This report presents our semantic segmentation framework developed by team ACVLAB for the ICRA 2025 GOOSE 2D Semantic Segmentation Challenge, which focuses on parsing outdoor scenes into nine semantic categories under real-world conditions. Our method integrates a Swin Transformer backbone enhanced with Rotary Position Embedding (RoPE) for improved spatial generalization, alongside a Color Shift Estimation-and-Correction module designed to compensate for illumination inconsistencies in natural environments. To further improve training stability, we adopt a quantile-based denoising strategy that downweights the top 2.5\% of highest-error pixels, treating them as noise and suppressing their influence during optimization. Evaluated on the official GOOSE test set, our approach achieved a mean Intersection over Union (mIoU) of 0.848, demonstrating the effectiveness of combining color correction, positional encoding, and error-aware denoising in robust semantic segmentation.

[61] Replay-Based Continual Learning with Dual-Layered Distillation and a Streamlined U-Net for Efficient Text-to-Image Generation

Md. Naimur Asif Borno,Md Sakib Hossain Shovon,Asmaa Soliman Al-Moisheer,Mohammad Ali Moni

Main category: cs.CV

TL;DR: KDC-Diff是一种高效的文本到图像扩散模型，通过简化U-Net架构和双重蒸馏策略，显著降低计算需求，同时保持图像质量。

Details

Motivation: 当前文本到图像扩散模型计算需求高，限制了可访问性和扩展性。 Method: 采用简化的U-Net架构（参数减少近半）和双重蒸馏策略，结合基于回放的持续学习。 Result: 在低计算资源下，KDC-Diff在多个数据集上达到SOTA性能，显著减少推理时间。 Conclusion: KDC-Diff是计算受限环境下高效且适应性强的文本到图像生成解决方案。 Abstract: Recent advancements in text-to-image diffusion models are hindered by high computational demands, limiting accessibility and scalability. This paper introduces KDC-Diff, a novel stable diffusion framework that enhances efficiency while maintaining image quality. KDC-Diff features a streamlined U-Net architecture with nearly half the parameters of the original U-Net (482M), significantly reducing model complexity. We propose a dual-layered distillation strategy to ensure high-fidelity generation, transferring semantic and structural insights from a teacher to a compact student model while minimizing quality degradation. Additionally, replay-based continual learning is integrated to mitigate catastrophic forgetting, allowing the model to retain prior knowledge while adapting to new data. Despite operating under extremely low computational resources, KDC-Diff achieves state-of-the-art performance on the Oxford Flowers and Butterflies & Moths 100 Species datasets, demonstrating competitive metrics such as FID, CLIP, and LPIPS. Moreover, it significantly reduces inference time compared to existing models. These results establish KDC-Diff as a highly efficient and adaptable solution for text-to-image generation, particularly in computationally constrained environments.

[62] Hallucination-Aware Multimodal Benchmark for Gastrointestinal Image Analysis with Large Vision-Language Models

Bidur Khanal,Sandesh Pokhrel,Sanjay Bhandari,Ramesh Rana,Nikesh Shrestha,Ram Bahadur Gurung,Cristian Linte,Angus Watson,Yash Raj Shrestha,Binod Bhattarai

Main category: cs.CV

TL;DR: 论文提出了一种针对胃肠道图像的多模态数据集Gut-VLM，用于研究视觉语言模型（VLM）中的幻觉问题，并采用幻觉感知微调方法改进模型性能。

Details

Motivation: 现有VLM在医学领域存在幻觉问题（生成与图像内容不符的描述），尤其在胃肠道图像分析中影响严重。 Method: 通过两阶段流程构建Gut-VLM数据集：1）用ChatGPT生成初步报告（含幻觉文本）；2）医学专家审核并修正。采用幻觉感知微调方法优化VLM。 Result: 幻觉感知微调方法优于传统报告生成微调，并建立了VLM的基准评估。 Conclusion: Gut-VLM数据集和幻觉感知微调方法为医学VLM研究提供了新方向，显著减少幻觉问题。 Abstract: Vision-Language Models (VLMs) are becoming increasingly popular in the medical domain, bridging the gap between medical images and clinical language. Existing VLMs demonstrate an impressive ability to comprehend medical images and text queries to generate detailed, descriptive diagnostic medical reports. However, hallucination--the tendency to generate descriptions that are inconsistent with the visual content--remains a significant issue in VLMs, with particularly severe implications in the medical field. To facilitate VLM research on gastrointestinal (GI) image analysis and study hallucination, we curate a multimodal image-text GI dataset: Gut-VLM. This dataset is created using a two-stage pipeline: first, descriptive medical reports of Kvasir-v2 images are generated using ChatGPT, which introduces some hallucinated or incorrect texts. In the second stage, medical experts systematically review these reports, and identify and correct potential inaccuracies to ensure high-quality, clinically reliable annotations. Unlike traditional datasets that contain only descriptive texts, our dataset also features tags identifying hallucinated sentences and their corresponding corrections. A common approach to reducing hallucination in VLM is to finetune the model on a small-scale, problem-specific dataset. However, we take a different strategy using our dataset. Instead of finetuning the VLM solely for generating textual reports, we finetune it to detect and correct hallucinations, an approach we call hallucination-aware finetuning. Our results show that this approach is better than simply finetuning for descriptive report generation. Additionally, we conduct an extensive evaluation of state-of-the-art VLMs across several metrics, establishing a benchmark. GitHub Repo: https://github.com/bhattarailab/Hallucination-Aware-VLM.

[63] CMD: Controllable Multiview Diffusion for 3D Editing and Progressive Generation

Peng Li,Suizhi Ma,Jialiang Chen,Yuan Liu,Chongyi Zhang,Wei Xue,Wenhan Luo,Alla Sheffer,Wenping Wang,Yike Guo

Main category: cs.CV

TL;DR: 论文提出了一种名为CMD的新方法，通过条件多视图扩散模型实现3D模型的局部编辑和生成。

Details

Motivation: 现有3D生成方法缺乏对生成模型各部分的控制，输入图像的修改需要重新生成整个模型。 Method: 采用条件多视图扩散模型，以已知部分为条件生成或编辑组件，支持局部修改。 Result: 实验表明CMD能分解复杂3D生成任务，提高生成质量，并支持高效局部编辑。 Conclusion: CMD为3D生成提供了灵活性和控制性，显著优于传统方法。 Abstract: Recently, 3D generation methods have shown their powerful ability to automate 3D model creation. However, most 3D generation methods only rely on an input image or a text prompt to generate a 3D model, which lacks the control of each component of the generated 3D model. Any modifications of the input image lead to an entire regeneration of the 3D models. In this paper, we introduce a new method called CMD that generates a 3D model from an input image while enabling flexible local editing of each component of the 3D model. In CMD, we formulate the 3D generation as a conditional multiview diffusion model, which takes the existing or known parts as conditions and generates the edited or added components. This conditional multiview diffusion model not only allows the generation of 3D models part by part but also enables local editing of 3D models according to the local revision of the input image without changing other 3D parts. Extensive experiments are conducted to demonstrate that CMD decomposes a complex 3D generation task into multiple components, improving the generation quality. Meanwhile, CMD enables efficient and flexible local editing of a 3D model by just editing one rendered image.

[64] MELLM: Exploring LLM-Powered Micro-Expression Understanding Enhanced by Subtle Motion Perception

Zhengye Zhang,Sirui Zhao,Shifeng Liu,Shukang Yin,Xinglong Mao,Tong Xu,Enhong Chen

Main category: cs.CV

TL;DR: 论文提出了一种新型的微表情大语言模型（MELLM），结合了多模态大语言模型（MLLMs）的强大推理能力和微表情的细微动态捕捉策略，首次探索了MLLMs在微表情分析领域的应用。

Details

Motivation: 当前自动微表情识别（MER）研究主要关注离散情绪分类，缺乏对细微动态运动和内在情感线索的深入分析。多模态大语言模型（MLLMs）在视觉语言任务中的成功为微表情的全面理解提供了新可能。 Method: 提出MELLM模型，通过融合起始-顶点光流动态和灰度起始帧构建可解释的运动增强彩色图作为输入，并结合专门的微调策略增强视觉感知。基于FACS注释和情绪标签构建指令描述数据集进行训练。 Result: 在多个基准数据集上的综合评估表明，MELLM在微表情理解（MEU）中表现出卓越的鲁棒性和泛化能力。 Conclusion: MELLM为微表情分析提供了新的解决方案，展示了MLLMs在该领域的潜力。 Abstract: Micro-expressions (MEs) are crucial psychological responses with significant potential for affective computing. However, current automatic micro-expression recognition (MER) research primarily focuses on discrete emotion classification, neglecting a convincing analysis of the subtle dynamic movements and inherent emotional cues. The rapid progress in multimodal large language models (MLLMs), known for their strong multimodal comprehension and language generation abilities, offers new possibilities. MLLMs have shown success in various vision-language tasks, indicating their potential to understand MEs comprehensively, including both fine-grained motion patterns and underlying emotional semantics. Nevertheless, challenges remain due to the subtle intensity and short duration of MEs, as existing MLLMs are not designed to capture such delicate frame-level facial dynamics. In this paper, we propose a novel Micro-Expression Large Language Model (MELLM), which incorporates a subtle facial motion perception strategy with the strong inference capabilities of MLLMs, representing the first exploration of MLLMs in the domain of ME analysis. Specifically, to explicitly guide the MLLM toward motion-sensitive regions, we construct an interpretable motion-enhanced color map by fusing onset-apex optical flow dynamics with the corresponding grayscale onset frame as the model input. Additionally, specialized fine-tuning strategies are incorporated to further enhance the model's visual perception of MEs. Furthermore, we construct an instruction-description dataset based on Facial Action Coding System (FACS) annotations and emotion labels to train our MELLM. Comprehensive evaluations across multiple benchmark datasets demonstrate that our model exhibits superior robustness and generalization capabilities in ME understanding (MEU). Code is available at https://github.com/zyzhangUstc/MELLM.

[65] Efficient and Robust Multidimensional Attention in Remote Physiological Sensing through Target Signal Constrained Factorization

Jitesh Joshi,Youngjun Cho

Main category: cs.CV

TL;DR: 论文提出了一种名为TSFM的多维注意力机制和MMRPhys双分支3D-CNN架构，用于从多模态视频中同时估计rPPG和rRSP信号，显著提升了跨域鲁棒性和实时性能。

Details

Motivation: 现有深度学习方法在远程生理信号监测中对域转移的鲁棒性不足，影响了实际应用效果。 Method: 引入TSFM模块作为生理信号特征约束，设计MMRPhys双分支3D-CNN架构，支持多任务和多模态输入。 Result: 在五个基准数据集上的跨域评估显示，MMRPhys显著优于现有方法，且推理延迟低，适合实时应用。 Conclusion: MMRPhys为鲁棒的多任务和多模态生理信号监测设立了新基准，并提供了高效的计算框架。 Abstract: Remote physiological sensing using camera-based technologies offers transformative potential for non-invasive vital sign monitoring across healthcare and human-computer interaction domains. Although deep learning approaches have advanced the extraction of physiological signals from video data, existing methods have not been sufficiently assessed for their robustness to domain shifts. These shifts in remote physiological sensing include variations in ambient conditions, camera specifications, head movements, facial poses, and physiological states which often impact real-world performance significantly. Cross-dataset evaluation provides an objective measure to assess generalization capabilities across these domain shifts. We introduce Target Signal Constrained Factorization module (TSFM), a novel multidimensional attention mechanism that explicitly incorporates physiological signal characteristics as factorization constraints, allowing more precise feature extraction. Building on this innovation, we present MMRPhys, an efficient dual-branch 3D-CNN architecture designed for simultaneous multitask estimation of photoplethysmography (rPPG) and respiratory (rRSP) signals from multimodal RGB and thermal video inputs. Through comprehensive cross-dataset evaluation on five benchmark datasets, we demonstrate that MMRPhys with TSFM significantly outperforms state-of-the-art methods in generalization across domain shifts for rPPG and rRSP estimation, while maintaining a minimal inference latency suitable for real-time applications. Our approach establishes new benchmarks for robust multitask and multimodal physiological sensing and offers a computationally efficient framework for practical deployment in unconstrained environments. The web browser-based application featuring on-device real-time inference of MMRPhys model is available at https://physiologicailab.github.io/mmrphys-live

[66] A Vision-Language Foundation Model for Leaf Disease Identification

Khang Nguyen Quoc,Lan Le Thi Thu,Luyl-Da Quach

Main category: cs.CV

TL;DR: SCOLD是一种基于软目标对比学习的农业视觉语言基础模型，通过结合图像和文本模态，显著提升了叶片病害识别的性能。

Details

Motivation: 现有研究在整合图像和文本模态以及利用领域特定数据方面存在不足，SCOLD旨在解决这些问题。 Method: SCOLD利用186,000个图像-描述对进行任务无关的预训练，通过软目标对比学习提高模型的泛化能力和鲁棒性。 Result: SCOLD在多项基准测试中优于现有模型（如OpenAI-CLIP-L、BioCLIP和SigLIP2），且参数规模保持竞争力。 Conclusion: SCOLD为农业视觉语言模型提供了强大的性能，为未来多模态植物病害诊断研究奠定了基础。 Abstract: Leaf disease identification plays a pivotal role in smart agriculture. However, many existing studies still struggle to integrate image and textual modalities to compensate for each other's limitations. Furthermore, many of these approaches rely on pretraining with constrained datasets such as ImageNet, which lack domain-specific information. We propose SCOLD (Soft-target COntrastive learning for Leaf Disease identification), a context-aware vision-language foundation model tailored to address these challenges for agricultural tasks. SCOLD is developed using a diverse corpus of plant leaf images and corresponding symptom descriptions, comprising over 186,000 image-caption pairs aligned with 97 unique concepts. Through task-agnostic pretraining, SCOLD leverages contextual soft targets to mitigate overconfidence in contrastive learning by smoothing labels, thereby improving model generalization and robustness on fine-grained classification tasks. Experimental results demonstrate that SCOLD outperforms existing vision-language models such as OpenAI-CLIP-L, BioCLIP, and SigLIP2 across several benchmarks, including zero-shot and few-shot classification, image-text retrieval, and image classification, while maintaining a competitive parameter footprint. Ablation studies further highlight SCOLD's effectiveness in contrast to its counterparts. The proposed approach significantly advances the agricultural vision-language foundation model, offering strong performance with minimal or no supervised fine-tuning. This work lays a solid groundwork for future research on models trained with long-form and simplified contexts, tasks involving class ambiguity, and multi-modal systems for intelligent plant disease diagnostics. The code for this study is available at https://huggingface.co/enalis/scold

[67] MarkMatch: Same-Hand Stuffing Detection

Fei Zhao,Runlin Zhang,Chengcui Zhang,Nitesh Saxena

Main category: cs.CV

TL;DR: MarkMatch是一个用于检测两张纸质选票标记是否由同一人填写的检索系统，通过对比学习排名风格相似性，优于现有方法BubbleSig。

Details

Motivation: 为选举审计员提供一种实用的非生物特征视觉调查工具，用于检测可疑选票。 Method: 使用对比学习训练模型，结合密集批次相似性矩阵和双重损失目标，学习细微的笔迹差异，并集成Segment Anything Model进行灵活的标记提取。 Result: 模型F1得分为0.943，优于BubbleSig的最佳性能。 Conclusion: MarkMatch在笔迹差异和视觉噪声下表现优异，为选举审计提供了高效工具。 Abstract: We present MarkMatch, a retrieval system for detecting whether two paper ballot marks were filled by the same hand. Unlike the previous SOTA method BubbleSig, which used binary classification on isolated mark pairs, MarkMatch ranks stylistic similarity between a query mark and a mark in the database using contrastive learning. Our model is trained with a dense batch similarity matrix and a dual loss objective. Each sample is contrasted against many negatives within each batch, enabling the model to learn subtle handwriting difference and improve generalization under handwriting variation and visual noise, while diagonal supervision reinforces high confidence on true matches. The model achieves an F1 score of 0.943, surpassing BubbleSig's best performance. MarkMatch also integrates Segment Anything Model for flexible mark extraction via box- or point-based prompts. The system offers election auditors a practical tool for visual, non-biometric investigation of suspicious ballots.

[68] Differentiable NMS via Sinkhorn Matching for End-to-End Fabric Defect Detection

Zhengyang Lu,Bingjie Lu,Weifan Wang,Feng Wang

Main category: cs.CV

TL;DR: 提出了一种可微分NMS框架，用于解决织物缺陷检测中的梯度流中断和标注成本高的问题，通过端到端优化实现高定位精度。

Details

Motivation: 传统非极大值抑制（NMS）会中断梯度流，且像素级标注成本高昂，限制了织物缺陷检测的性能和可扩展性。 Method: 将NMS重新表述为可微分二分匹配问题，使用Sinkhorn-Knopp算法解决，并结合提案质量、特征相似性和空间关系。引入熵约束掩码细化机制以提升定位精度。 Result: 在天池织物缺陷数据集上表现优于现有方法，同时保持实时速度，适用于工业部署。框架对不同架构具有良好适应性，并可推广至一般目标检测任务。 Conclusion: 该框架通过端到端优化解决了织物缺陷检测的关键挑战，具有实际应用价值和推广潜力。 Abstract: Fabric defect detection confronts two fundamental challenges. First, conventional non-maximum suppression disrupts gradient flow, which hinders genuine end-to-end learning. Second, acquiring pixel-level annotations at industrial scale is prohibitively costly. Addressing these limitations, we propose a differentiable NMS framework for fabric defect detection that achieves superior localization precision through end-to-end optimization. We reformulate NMS as a differentiable bipartite matching problem solved through the Sinkhorn-Knopp algorithm, maintaining uninterrupted gradient flow throughout the network. This approach specifically targets the irregular morphologies and ambiguous boundaries of fabric defects by integrating proposal quality, feature similarity, and spatial relationships. Our entropy-constrained mask refinement mechanism further enhances localization precision through principled uncertainty modeling. Extensive experiments on the Tianchi fabric defect dataset demonstrate significant performance improvements over existing methods while maintaining real-time speeds suitable for industrial deployment. The framework exhibits remarkable adaptability across different architectures and generalizes effectively to general object detection tasks.

Binbin Wei,Yuhang Zhang,Shishun Tian,Muxin Liao,Wei Li,Wenbin Zou

Main category: cs.CV

TL;DR: 论文提出了一种名为DSSS的新框架，通过RGB-D模态间风格化流和深度敏感软抑制，学习深度图中的域不变特征，用于域泛化语义分割任务。

Details

Motivation: 当前无监督域适应（UDA）和目标域泛化（DG）方法在处理深度图时存在噪声和空洞问题，且现有方法无法直接适用于深度图的独特特性。 Method: 提出RGB-D模态间风格化流生成风格化深度图，设计类级软空间敏感抑制机制，并引入RGB-D软对齐损失。 Result: 实验表明，DSSS框架在多骨干网络上显著提升了性能。 Conclusion: DSSS是首个在域泛化语义分割任务中整合RGB与深度信息的工作，具有显著性能优势。 Abstract: Unsupervised Domain Adaptation (UDA) aims to align source and target domain distributions to close the domain gap, but still struggles with obtaining the target data. Fortunately, Domain Generalization (DG) excels without the need for any target data. Recent works expose that depth maps contribute to improved generalized performance in the UDA tasks, but they ignore the noise and holes in depth maps due to device and environmental factors, failing to sufficiently and effectively learn domain-invariant representation. Although high-sensitivity region suppression has shown promising results in learning domain-invariant features, existing methods cannot be directly applicable to depth maps due to their unique characteristics. Hence, we propose a novel framework, namely Depth-Sensitive Soft Suppression with RGB-D inter-modal stylization flow (DSSS), focusing on learning domain-invariant features from depth maps for the DG semantic segmentation. Specifically, we propose the RGB-D inter-modal stylization flow to generate stylized depth maps for sensitivity detection, cleverly utilizing RGB information as the stylization source. Then, a class-wise soft spatial sensitivity suppression is designed to identify and emphasize non-sensitive depth features that contain more domain-invariant information. Furthermore, an RGB-D soft alignment loss is proposed to ensure that the stylized depth maps only align part of the RGB features while still retaining the unique depth information. To our best knowledge, our DSSS framework is the first work to integrate RGB and Depth information in the multi-class DG semantic segmentation task. Extensive experiments over multiple backbone networks show that our framework achieves remarkable performance improvement.

[70] DAPE: Dual-Stage Parameter-Efficient Fine-Tuning for Consistent Video Editing with Diffusion Models

Junhao Xia,Chaoyang Zhang,Yecheng Zhang,Chengyang Zhou,Zhichang Wang,Bochun Liu,Dongshuo Yin

Main category: cs.CV

TL;DR: DAPE是一个高效的两阶段参数微调框架，用于视频编辑，通过规范调整和视觉友好适配器提升视频质量和一致性，并提出了一个更全面的基准数据集。

Details

Motivation: 解决现有视频编辑方法中训练成本高或性能不足的问题。 Method: 两阶段参数高效微调框架：第一阶段通过规范调整增强时间一致性，第二阶段引入视觉友好适配器提升视觉质量。 Result: DAPE在时间一致性和文本-视频对齐方面显著优于现有方法，并在多个数据集上表现优异。 Conclusion: DAPE提供了一种高质量且成本效益高的视频编辑解决方案，并通过新基准数据集推动了该领域的评估标准。 Abstract: Video generation based on diffusion models presents a challenging multimodal task, with video editing emerging as a pivotal direction in this field. Recent video editing approaches primarily fall into two categories: training-required and training-free methods. While training-based methods incur high computational costs, training-free alternatives often yield suboptimal performance. To address these limitations, we propose DAPE, a high-quality yet cost-effective two-stage parameter-efficient fine-tuning (PEFT) framework for video editing. In the first stage, we design an efficient norm-tuning method to enhance temporal consistency in generated videos. The second stage introduces a vision-friendly adapter to improve visual quality. Additionally, we identify critical shortcomings in existing benchmarks, including limited category diversity, imbalanced object distribution, and inconsistent frame counts. To mitigate these issues, we curate a large dataset benchmark comprising 232 videos with rich annotations and 6 editing prompts, enabling objective and comprehensive evaluation of advanced methods. Extensive experiments on existing datasets (BalanceCC, LOVEU-TGVE, RAVE) and our proposed benchmark demonstrate that DAPE significantly improves temporal coherence and text-video alignment while outperforming previous state-of-the-art approaches.

[71] Seed1.5-VL Technical Report

Dong Guo,Faming Wu,Feida Zhu,Fuxing Leng,Guang Shi,Haobin Chen,Haoqi Fan,Jian Wang,Jianyu Jiang,Jiawei Wang,Jingji Chen,Jingjia Huang,Kang Lei,Liping Yuan,Lishu Luo,Pengfei Liu,Qinghao Ye,Rui Qian,Shen Yan,Shixiong Zhao,Shuai Peng,Shuangye Li,Sihang Yuan,Sijin Wu,Tianheng Cheng,Weiwei Liu,Wenqian Wang,Xianhan Zeng,Xiao Liu,Xiaobo Qin,Xiaohan Ding,Xiaojun Xiao,Xiaoying Zhang,Xuanwei Zhang,Xuehan Xiong,Yanghua Peng,Yangrui Chen,Yanwei Li,Yanxu Hu,Yi Lin,Yiyuan Hu,Yiyuan Zhang,Youbin Wu,Yu Li,Yudong Liu,Yue Ling,Yujia Qin,Zanbo Wang,Zhiwu He,Aoxue Zhang,Bairen Yi,Bencheng Liao,Can Huang,Can Zhang,Chaorui Deng,Chaoyi Deng,Cheng Lin,Cheng Yuan,Chenggang Li,Chenhui Gou,Chenwei Lou,Chengzhi Wei,Chundian Liu,Chunyuan Li,Deyao Zhu,Donghong Zhong,Feng Li,Feng Zhang,Gang Wu,Guodong Li,Guohong Xiao,Haibin Lin,Haihua Yang,Haoming Wang,Heng Ji,Hongxiang Hao,Hui Shen,Huixia Li,Jiahao Li,Jialong Wu,Jianhua Zhu,Jianpeng Jiao,Jiashi Feng,Jiaze Chen,Jianhui Duan,Jihao Liu,Jin Zeng,Jingqun Tang,Jingyu Sun,Joya Chen,Jun Long,Junda Feng,Junfeng Zhan,Junjie Fang,Junting Lu,Kai Hua,Kai Liu,Kai Shen,Kaiyuan Zhang,Ke Shen,Ke Wang,Keyu Pan,Kun Zhang,Kunchang Li,Lanxin Li,Lei Li,Lei Shi,Li Han,Liang Xiang,Liangqiang Chen,Lin Chen,Lin Li,Lin Yan,Liying Chi,Longxiang Liu,Mengfei Du,Mingxuan Wang,Ningxin Pan,Peibin Chen,Pengfei Chen,Pengfei Wu,Qingqing Yuan,Qingyao Shuai,Qiuyan Tao,Renjie Zheng,Renrui Zhang,Ru Zhang,Rui Wang,Rui Yang,Rui Zhao,Shaoqiang Xu,Shihao Liang,Shipeng Yan,Shu Zhong,Shuaishuai Cao,Shuangzhi Wu,Shufan Liu,Shuhan Chang,Songhua Cai,Tenglong Ao,Tianhao Yang,Tingting Zhang,Wanjun Zhong,Wei Jia,Wei Weng,Weihao Yu,Wenhao Huang,Wenjia Zhu,Wenli Yang,Wenzhi Wang,Xiang Long,XiangRui Yin,Xiao Li,Xiaolei Zhu,Xiaoying Jia,Xijin Zhang,Xin Liu,Xinchen Zhang,Xinyu Yang,Xiongcai Luo,Xiuli Chen,Xuantong Zhong,Xuefeng Xiao,Xujing Li,Yan Wu,Yawei Wen,Yifan Du,Yihao Zhang,Yining Ye,Yonghui Wu,Yu Liu,Yu Yue,Yufeng Zhou,Yufeng Yuan,Yuhang Xu,Yuhong Yang,Yun Zhang,Yunhao Fang,Yuntao Li,Yurui Ren,Yuwen Xiong,Zehua Hong,Zehua Wang,Zewei Sun,Zeyu Wang,Zhao Cai,Zhaoyue Zha,Zhecheng An,Zhehui Zhao,Zhengzhuo Xu,Zhipeng Chen,Zhiyong Wu,Zhuofan Zheng,Zihao Wang,Zilong Huang,Ziyu Zhu,Zuquan Song

Main category: cs.CV

TL;DR: Seed1.5-VL是一个高效的多模态视觉语言基础模型，结合532M参数的视觉编码器和20B参数的MoE LLM，在多项公开和内部评估中表现优异，尤其在代理任务和多模态推理中领先。

Details

Motivation: 推动通用多模态理解和推理能力的发展，为多样化任务提供更广泛的应用支持。 Method: 采用532M参数的视觉编码器和20B参数的MoE LLM架构，结合模型设计、数据构建和分阶段训练。 Result: 在60个公开基准测试中38项达到SOTA，在GUI控制和游戏等代理任务中超越领先多模态系统。 Conclusion: Seed1.5-VL展示了强大的多模态理解和推理能力，为未来研究提供了重要参考。 Abstract: We present Seed1.5-VL, a vision-language foundation model designed to advance general-purpose multimodal understanding and reasoning. Seed1.5-VL is composed with a 532M-parameter vision encoder and a Mixture-of-Experts (MoE) LLM of 20B active parameters. Despite its relatively compact architecture, it delivers strong performance across a wide spectrum of public VLM benchmarks and internal evaluation suites, achieving the state-of-the-art performance on 38 out of 60 public benchmarks. Moreover, in agent-centric tasks such as GUI control and gameplay, Seed1.5-VL outperforms leading multimodal systems, including OpenAI CUA and Claude 3.7. Beyond visual and video understanding, it also demonstrates strong reasoning abilities, making it particularly effective for multimodal reasoning challenges such as visual puzzles. We believe these capabilities will empower broader applications across diverse tasks. In this report, we mainly provide a comprehensive review of our experiences in building Seed1.5-VL across model design, data construction, and training at various stages, hoping that this report can inspire further research. Seed1.5-VL is now accessible at https://www.volcengine.com/ (Volcano Engine Model ID: doubao-1-5-thinking-vision-pro-250428)

[72] Semantic-Guided Diffusion Model for Single-Step Image Super-Resolution

Zihang Liu,Zhenyu Zhang,Hao Tang

Main category: cs.CV

TL;DR: SAMSR是一种基于语义引导的扩散框架，通过引入语义分割掩码优化单步推理的扩散模型，提升复杂语义区域的超分辨率性能。

Details

Motivation: 现有单步推理扩散模型在处理复杂语义区域时效率有限，需改进以保留更多细节。 Method: 提出SAM-Noise模块和像素级采样策略，结合语义一致性损失优化训练。 Result: 在真实和合成数据集上显著提升感知质量和细节恢复。 Conclusion: SAMSR在语义复杂图像中表现优异，代码已开源。 Abstract: Diffusion-based image super-resolution (SR) methods have demonstrated remarkable performance. Recent advancements have introduced deterministic sampling processes that reduce inference from 15 iterative steps to a single step, thereby significantly improving the inference speed of existing diffusion models. However, their efficiency remains limited when handling complex semantic regions due to the single-step inference. To address this limitation, we propose SAMSR, a semantic-guided diffusion framework that incorporates semantic segmentation masks into the sampling process. Specifically, we introduce the SAM-Noise Module, which refines Gaussian noise using segmentation masks to preserve spatial and semantic features. Furthermore, we develop a pixel-wise sampling strategy that dynamically adjusts the residual transfer rate and noise strength based on pixel-level semantic weights, prioritizing semantically rich regions during the diffusion process. To enhance model training, we also propose a semantic consistency loss, which aligns pixel-wise semantic weights between predictions and ground truth. Extensive experiments on both real-world and synthetic datasets demonstrate that SAMSR significantly improves perceptual quality and detail recovery, particularly in semantically complex images. Our code is released at https://github.com/Liu-Zihang/SAMSR.

[73] Discovering Concept Directions from Diffusion-based Counterfactuals via Latent Clustering

Payal Varshney,Adriano Lucieri,Christoph Balada,Andreas Dengel,Sheraz Ahmed

Main category: cs.CV

TL;DR: CDLC是一种基于潜在聚类的方法，用于提取全局、类别特定的概念方向，显著降低了计算复杂度，并在皮肤病变数据上验证了其有效性。

Details

Motivation: 现有基于概念的解释方法计算量大且难以高效捕捉复杂语义概念，CDLC旨在解决这些问题。 Method: 通过聚类潜在差异向量（来自事实和反事实图像对）提取概念方向，避免了潜在空间的维度遍历。 Result: 在皮肤病变数据集上验证，提取的概念方向与临床特征一致，并揭示了数据集偏差或未知生物标志物。 Conclusion: CDLC具有可解释性、可扩展性，适用于高风险领域和多样数据模态。 Abstract: Concept-based explanations have emerged as an effective approach within Explainable Artificial Intelligence, enabling interpretable insights by aligning model decisions with human-understandable concepts. However, existing methods rely on computationally intensive procedures and struggle to efficiently capture complex, semantic concepts. Recently, the Concept Discovery through Latent Diffusion-based Counterfactual Trajectories (CDCT) framework, introduced by Varshney et al. (2025), attempts to identify concepts via dimension-wise traversal of the latent space of a Variational Autoencoder trained on counterfactual trajectories. Extending the CDCT framework, this work introduces Concept Directions via Latent Clustering (CDLC), which extracts global, class-specific concept directions by clustering latent difference vectors derived from factual and diffusion-generated counterfactual image pairs. CDLC substantially reduces computational complexity by eliminating the exhaustive latent dimension traversal required in CDCT and enables the extraction of multidimensional semantic concepts encoded across the latent dimensions. This approach is validated on a real-world skin lesion dataset, demonstrating that the extracted concept directions align with clinically recognized dermoscopic features and, in some cases, reveal dataset-specific biases or unknown biomarkers. These results highlight that CDLC is interpretable, scalable, and applicable across high-stakes domains and diverse data modalities.

[74] Towards Scalable IoT Deployment for Visual Anomaly Detection via Efficient Compression

Arianna Stropeni,Francesco Borsatti,Manuel Barusco,Davide Dalle Pezze,Marco Fabris,Gian Antonio Susto

Main category: cs.CV

TL;DR: 该研究探讨了在计算能力和带宽受限的物联网环境中，如何通过高效的数据压缩技术实现视觉异常检测（VAD），并在MVTec AD基准测试中验证了其有效性。

Details

Motivation: 工业环境中，减少浪费和运营成本至关重要，但在物联网边缘设备上部署深度学习模型面临计算能力和带宽限制的挑战。 Method: 研究评估了多种数据压缩技术，分析了系统延迟与检测精度之间的权衡。 Result: 实验表明，与未压缩数据相比，显著的数据压缩仅导致异常检测性能的微小损失。 Conclusion: 在资源受限的物联网环境中，高效的数据压缩技术可以有效支持视觉异常检测任务。 Abstract: Visual Anomaly Detection (VAD) is a key task in industrial settings, where minimizing waste and operational costs is essential. Deploying deep learning models within Internet of Things (IoT) environments introduces specific challenges due to the limited computational power and bandwidth of edge devices. This study investigates how to perform VAD effectively under such constraints by leveraging compact and efficient processing strategies. We evaluate several data compression techniques, examining the trade-off between system latency and detection accuracy. Experiments on the MVTec AD benchmark demonstrate that significant compression can be achieved with minimal loss in anomaly detection performance compared to uncompressed data.

[75] Generalizable Pancreas Segmentation via a Dual Self-Supervised Learning Framework

Jun Li,Hongzhang Zhu,Tao Chen,Xiaohua Qian

Main category: cs.CV

TL;DR: 本文提出了一种双自监督学习模型，通过结合全局和局部解剖特征，提升单源数据集训练的胰腺分割模型的泛化性能。

Details

Motivation: 现有胰腺分割方法在单源数据集上表现良好，但在其他来源测试数据上泛化能力不足。本文旨在解决这一问题。 Method: 提出双自监督学习模型：1）全局特征对比自监督模块，通过胰腺空间结构引导，增强特征区分能力；2）局部图像恢复自监督模块，学习高不确定性区域的解剖特征。 Result: 模型通过全局和局部特征学习，显著提升了胰腺分割的泛化能力和稳定性。 Conclusion: 双自监督学习模型有效解决了单源数据集训练的泛化问题，为胰腺分割提供了更鲁棒的解决方案。 Abstract: Recently, numerous pancreas segmentation methods have achieved promising performance on local single-source datasets. However, these methods don't adequately account for generalizability issues, and hence typically show limited performance and low stability on test data from other sources. Considering the limited availability of distinct data sources, we seek to improve the generalization performance of a pancreas segmentation model trained with a single-source dataset, i.e., the single source generalization task. In particular, we propose a dual self-supervised learning model that incorporates both global and local anatomical contexts. Our model aims to fully exploit the anatomical features of the intra-pancreatic and extra-pancreatic regions, and hence enhance the characterization of the high-uncertainty regions for more robust generalization. Specifically, we first construct a global-feature contrastive self-supervised learning module that is guided by the pancreatic spatial structure. This module obtains complete and consistent pancreatic features through promoting intra-class cohesion, and also extracts more discriminative features for differentiating between pancreatic and non-pancreatic tissues through maximizing inter-class separation. It mitigates the influence of surrounding tissue on the segmentation outcomes in high-uncertainty regions. Subsequently, a local-image restoration self-supervised learning module is introduced to further enhance the characterization of the high uncertainty regions. In this module, informative anatomical contexts are actually learned to recover randomly corrupted appearance patterns in those regions.

[76] Critique Before Thinking: Mitigating Hallucination through Rationale-Augmented Instruction Tuning

Zexian Yang,Dian Li,Dayan Wu,Gang Liu,Weiping Wang

Main category: cs.CV

TL;DR: Re-Critic框架通过视觉原理合成器和自评机制增强多模态推理能力，减少视觉无关响应。

Details

Motivation: 现有大型视觉语言模型（LVLMs）在解释图像时容易产生视觉无关的响应，而人类学习新知识时会依赖预学习原则，如总结关键点。 Method: 提出Re-Critic框架，结合基本原理和链式思维（CoT），通过视觉原理合成器增强指令，并利用自评机制选择响应对进行偏好调整。 Result: 实验表明，使用Re-Critic微调的模型在幻觉特定任务和更广泛的多模态推理任务中均表现更优。 Conclusion: Re-Critic通过引入视觉原理和自评机制，显著提升了多模态推理的准确性和上下文相关性。 Abstract: Despite significant advancements in multimodal reasoning tasks, existing Large Vision-Language Models (LVLMs) are prone to producing visually ungrounded responses when interpreting associated images. In contrast, when humans embark on learning new knowledge, they often rely on a set of fundamental pre-study principles: reviewing outlines to grasp core concepts, summarizing key points to guide their focus and enhance understanding. However, such preparatory actions are notably absent in the current instruction tuning processes. This paper presents Re-Critic, an easily scalable rationale-augmented framework designed to incorporate fundamental rules and chain-of-thought (CoT) as a bridge to enhance reasoning abilities. Specifically, Re-Critic develops a visual rationale synthesizer that scalably augments raw instructions with rationale explanation. To probe more contextually grounded responses, Re-Critic employs an in-context self-critic mechanism to select response pairs for preference tuning. Experiments demonstrate that models fine-tuned with our rationale-augmented dataset yield gains that extend beyond hallucination-specific tasks to broader multimodal reasoning tasks.

[77] Ranking-aware Continual Learning for LiDAR Place Recognition

Xufei Wang,Gengxuan Tian,Junqiao Zhao,Siyue Tao,Qiwen Gu,Qiankun Yu,Tiantian Feng

Main category: cs.CV

TL;DR: 论文提出了一种基于知识蒸馏与融合（KDF）的持续学习框架，用于解决LiDAR地点识别（LPR）中的灾难性遗忘问题。

Details

Motivation: 现有基于学习的LPR方法在新环境训练后容易遗忘之前训练的地点信息，影响性能。 Method: 提出排名感知的知识蒸馏损失和知识融合模块，以保留高级地点识别知识并整合新旧模型知识。 Result: 实验表明，KDF能有效克服灾难性遗忘，在Recall@1和遗忘分数上超越现有方法。 Conclusion: KDF框架为LPR的持续学习提供了有效解决方案。 Abstract: Place recognition plays a significant role in SLAM, robot navigation, and autonomous driving applications. Benefiting from deep learning, the performance of LiDAR place recognition (LPR) has been greatly improved. However, many existing learning-based LPR methods suffer from catastrophic forgetting, which severely harms the performance of LPR on previously trained places after training on a new environment. In this paper, we introduce a continual learning framework for LPR via Knowledge Distillation and Fusion (KDF) to alleviate forgetting. Inspired by the ranking process of place recognition retrieval, we present a ranking-aware knowledge distillation loss that encourages the network to preserve the high-level place recognition knowledge. We also introduce a knowledge fusion module to integrate the knowledge of old and new models for LiDAR place recognition. Our extensive experiments demonstrate that KDF can be applied to different networks to overcome catastrophic forgetting, surpassing the state-of-the-art methods in terms of mean Recall@1 and forgetting score.

[78] Discovering Fine-Grained Visual-Concept Relations by Disentangled Optimal Transport Concept Bottleneck Models

Yan Xie,Zequn Zeng,Hao Zhang,Yucheng Ding,Yi Wang,Zhengjue Wang,Bo Chen,Hongwei Liu

Main category: cs.CV

TL;DR: DOT-CBM通过细粒度视觉-概念关系解决传统CBM的局限性，利用最优传输和正交投影损失提升模型可靠性和可解释性。

Details

Motivation: 传统CBM仅学习图像与概念的粗粒度关系，导致虚假关联和局部视觉信息缺失，影响模型可靠性和解释性。 Method: 提出DOT-CBM框架，将概念预测建模为图像块与概念间的传输问题，结合正交投影损失和运输先验优化特征对齐。 Result: DOT-CBM在图像分类、局部检测和分布外泛化等任务中达到SOTA性能，提供更可靠的概念预测和可视化热图。 Conclusion: DOT-CBM通过细粒度对齐和特征解耦显著提升CBM的可靠性和解释性，为透明决策提供新思路。 Abstract: Concept Bottleneck Models (CBMs) try to make the decision-making process transparent by exploring an intermediate concept space between the input image and the output prediction. Existing CBMs just learn coarse-grained relations between the whole image and the concepts, less considering local image information, leading to two main drawbacks: i) they often produce spurious visual-concept relations, hence decreasing model reliability; and ii) though CBMs could explain the importance of every concept to the final prediction, it is still challenging to tell which visual region produces the prediction. To solve these problems, this paper proposes a Disentangled Optimal Transport CBM (DOT-CBM) framework to explore fine-grained visual-concept relations between local image patches and concepts. Specifically, we model the concept prediction process as a transportation problem between the patches and concepts, thereby achieving explicit fine-grained feature alignment. We also incorporate orthogonal projection losses within the modality to enhance local feature disentanglement. To further address the shortcut issues caused by statistical biases in the data, we utilize the visual saliency map and concept label statistics as transportation priors. Thus, DOT-CBM can visualize inversion heatmaps, provide more reliable concept predictions, and produce more accurate class predictions. Comprehensive experiments demonstrate that our proposed DOT-CBM achieves SOTA performance on several tasks, including image classification, local part detection and out-of-distribution generalization.

[79] Language-Driven Dual Style Mixing for Single-Domain Generalized Object Detection

Hongda Qin,Xiao Lu,Zhiyong Wei,Yihong Cao,Kailun Yang,Ningjiang Chen

Main category: cs.CV

TL;DR: 提出了一种语言驱动的双风格混合（LDDS）方法，用于单域泛化，通过利用视觉语言模型（VLM）的语义信息增强源域多样性，适用于多种检测器框架。

Details

Motivation: 现有方法依赖特定结构的检测器或增强选择，限制了框架的灵活性。LDDS旨在解决这一问题，实现模型无关的特征增强。 Method: 通过VLM的语义信息生成风格多样化的图像，并进行图像级和特征级的风格混合，支持主流检测器框架。 Result: 在多个基准数据集上验证了方法的有效性，包括真实到卡通和正常到恶劣天气任务。 Conclusion: LDDS通过语义驱动的风格混合，实现了模型无关的域泛化，提升了检测器的鲁棒性。 Abstract: Generalizing an object detector trained on a single domain to multiple unseen domains is a challenging task. Existing methods typically introduce image or feature augmentation to diversify the source domain to raise the robustness of the detector. Vision-Language Model (VLM)-based augmentation techniques have been proven to be effective, but they require that the detector's backbone has the same structure as the image encoder of VLM, limiting the detector framework selection. To address this problem, we propose Language-Driven Dual Style Mixing (LDDS) for single-domain generalization, which diversifies the source domain by fully utilizing the semantic information of the VLM. Specifically, we first construct prompts to transfer style semantics embedded in the VLM to an image translation network. This facilitates the generation of style diversified images with explicit semantic information. Then, we propose image-level style mixing between the diversified images and source domain images. This effectively mines the semantic information for image augmentation without relying on specific augmentation selections. Finally, we propose feature-level style mixing in a double-pipeline manner, allowing feature augmentation to be model-agnostic and can work seamlessly with the mainstream detector frameworks, including the one-stage, two-stage, and transformer-based detectors. Extensive experiments demonstrate the effectiveness of our approach across various benchmark datasets, including real to cartoon and normal to adverse weather tasks. The source code and pre-trained models will be publicly available at https://github.com/qinhongda8/LDDS.

[80] When Dance Video Archives Challenge Computer Vision

Philippe Colantoni,Rafique Ahmed,Prashant Ghimire,Damien Muselet,Alain Trémeau

Main category: cs.CV

TL;DR: 提出了一种结合最新技术的3D人体姿态估计流程，用于舞蹈视频分析，并通过实验验证了数据参数对姿态估计的影响。

Details

Motivation: 舞蹈视频的特殊性对姿态估计技术提出了挑战，需要开发更高效的解决方案。 Method: 结合最新技术和未用于舞蹈分析的方法，构建3D人体姿态估计流程，并通过舞蹈视频档案进行测试和实验。 Result: 实验结果公开，展示了数据参数对人体姿态估计的影响。 Conclusion: 提出的方法为舞蹈视频的姿态估计提供了有效工具，数据公开供研究使用。 Abstract: The accuracy and efficiency of human body pose estimation depend on the quality of the data to be processed and of the particularities of these data. To demonstrate how dance videos can challenge pose estimation techniques, we proposed a new 3D human body pose estimation pipeline which combined up-to-date techniques and methods that had not been yet used in dance analysis. Second, we performed tests and extensive experimentations from dance video archives, and used visual analytic tools to evaluate the impact of several data parameters on human body pose. Our results are publicly available for research at https://www.couleur.org/articles/arXiv-1-2025/

[81] Incomplete In-context Learning

Wenqiang Wang,Yangshijie Zhang

Main category: cs.CV

TL;DR: 论文提出了一种名为IJIP的两阶段框架，用于解决视觉语言模型在检索数据库不完整时的上下文学习问题。

Details

Motivation: 现实场景中，检索数据库可能仅包含部分类别的标注样本，导致传统方法失效。 Method: IJIP通过迭代判断和集成预测两阶段，将多分类问题转化为多个二分类任务，并利用输入图像和预测结果提升分类精度。 Result: 在两个LVLM和两个数据集上，IJIP在三种标签不完整条件下表现优异，最高准确率达93.9%。 Conclusion: IJIP不仅在不完整标签条件下有效，在完整标签场景下也优于基线方法，且适用于提示学习和文本领域。 Abstract: Large vision language models (LVLMs) achieve remarkable performance through Vision In-context Learning (VICL), a process that depends significantly on demonstrations retrieved from an extensive collection of annotated examples (retrieval database). Existing studies often assume that the retrieval database contains annotated examples for all labels. However, in real-world scenarios, delays in database updates or incomplete data annotation may result in the retrieval database containing labeled samples for only a subset of classes. We refer to this phenomenon as an \textbf{incomplete retrieval database} and define the in-context learning under this condition as \textbf{Incomplete In-context Learning (IICL)}. To address this challenge, we propose \textbf{Iterative Judgments and Integrated Prediction (IJIP)}, a two-stage framework designed to mitigate the limitations of IICL. The Iterative Judgments Stage reformulates an $\boldsymbol{m}$-class classification problem into a series of $\boldsymbol{m}$ binary classification tasks, effectively converting the IICL setting into a standard VICL scenario. The Integrated Prediction Stage further refines the classification process by leveraging both the input image and the predictions from the Iterative Judgments Stage to enhance overall classification accuracy. IJIP demonstrates considerable performance across two LVLMs and two datasets under three distinct conditions of label incompleteness, achieving the highest accuracy of 93.9\%. Notably, even in scenarios where labels are fully available, IJIP still achieves the best performance of all six baselines. Furthermore, IJIP can be directly applied to \textbf{Prompt Learning} and is adaptable to the \textbf{text domain}.

[82] Towards Accurate State Estimation: Kalman Filter Incorporating Motion Dynamics for 3D Multi-Object Tracking

Mohamed Nagy,Naoufel Werghi,Bilal Hassan,Jorge Dias,Majid Khonji

Main category: cs.CV

TL;DR: 提出了一种改进的卡尔曼滤波器，通过动态调整运动模型，显著提升了3D多目标跟踪的精度和性能。

Details

Motivation: 现有卡尔曼滤波器在3D多目标跟踪中因使用固定运动模型而无法适应复杂运动动态，导致轨迹分割和定位不精确。 Method: 引入了一种新型卡尔曼滤波器，能够根据物体运动动态自适应调整运动模型。 Result: 在KITTI和Waymo数据集上，跟踪性能优于基准方法，HOTA和MOTA分别提升0.56%和0.81%。 Conclusion: 改进的卡尔曼滤波器在实时性和处理长遮挡方面表现优异，适用于实际应用。 Abstract: This work addresses the critical lack of precision in state estimation in the Kalman filter for 3D multi-object tracking (MOT) and the ongoing challenge of selecting the appropriate motion model. Existing literature commonly relies on constant motion models for estimating the states of objects, neglecting the complex motion dynamics unique to each object. Consequently, trajectory division and imprecise object localization arise, especially under occlusion conditions. The core of these challenges lies in the limitations of the current Kalman filter formulation, which fails to account for the variability of motion dynamics as objects navigate their environments. This work introduces a novel formulation of the Kalman filter that incorporates motion dynamics, allowing the motion model to adaptively adjust according to changes in the object's movement. The proposed Kalman filter substantially improves state estimation, localization, and trajectory prediction compared to the traditional Kalman filter. This is reflected in tracking performance that surpasses recent benchmarks on the KITTI and Waymo Open Datasets, with margins of 0.56\% and 0.81\% in higher order tracking accuracy (HOTA) and multi-object tracking accuracy (MOTA), respectively. Furthermore, the proposed Kalman filter consistently outperforms the baseline across various detectors. Additionally, it shows an enhanced capability in managing long occlusions compared to the baseline Kalman filter, achieving margins of 1.22\% in higher order tracking accuracy (HOTA) and 1.55\% in multi-object tracking accuracy (MOTA) on the KITTI dataset. The formulation's efficiency is evident, with an additional processing time of only approximately 0.078 ms per frame, ensuring its applicability in real-time applications.

[83] Synthetic Similarity Search in Automotive Production

Christoph Huber,Ludwig Schleeh,Dino Knoll,Michael Guthe

Main category: cs.CV

TL;DR: 提出了一种结合相似性搜索和合成数据的图像分类方法，减少了对大量标注数据的需求，并在实际生产环境中验证了其高效性。

Details

Motivation: 计算机视觉在汽车生产质量检测中应用广泛，但需要大量标注数据，成本高且耗时。 Method: 利用DINOv2模型将图像转换为特征向量，通过余弦距离与合成参考图像比较，实现分类。 Result: 在八个实际检测场景中验证了方法的高性能，满足生产环境要求。 Conclusion: 该方法通过合成数据替代真实数据，显著降低了数据需求，同时保持了高分类精度。 Abstract: Visual quality inspection in automotive production is essential for ensuring the safety and reliability of vehicles. Computer vision (CV) has become a popular solution for these inspections due to its cost-effectiveness and reliability. However, CV models require large, annotated datasets, which are costly and time-consuming to collect. To reduce the need for extensive training data, we propose a novel image classification pipeline that combines similarity search using a vision-based foundation model with synthetic data. Our approach leverages a DINOv2 model to transform input images into feature vectors, which are then compared to pre-classified reference images using cosine distance measurements. By utilizing synthetic data instead of real images as references, our pipeline achieves high classification accuracy without relying on real data. We evaluate this approach in eight real-world inspection scenarios and demonstrate that it meets the high performance requirements of production environments.

[84] Skywork-VL Reward: An Effective Reward Model for Multimodal Understanding and Reasoning

Xiaokun Wang,Chris,Jiangbo Pei,Wei Shen,Yi Peng,Yunzhuo Hao,Weijie Qiu,Ai Jian,Tianyidan Xie,Xuchen Song,Yang Liu,Yahui Zhou

Main category: cs.CV

TL;DR: Skywork-VL Reward是一种多模态奖励模型，为多模态理解和推理任务提供奖励信号，通过大规模数据集和先进架构实现。

Details

Motivation: 开发一个通用的、可靠的多模态奖励模型，以提升多模态对齐任务的效果。 Method: 构建大规模多模态偏好数据集，设计基于Qwen2.5-VL-7B-Instruct的奖励模型架构，采用多阶段微调和成对排序损失。 Result: 在VL-RewardBench上达到SOTA，在RewardBench上表现优异，且数据对MPO训练效果显著。 Conclusion: Skywork-VL Reward是多模态对齐任务的重要进展，已公开模型以促进透明性和可复现性。 Abstract: We propose Skywork-VL Reward, a multimodal reward model that provides reward signals for both multimodal understanding and reasoning tasks. Our technical approach comprises two key components: First, we construct a large-scale multimodal preference dataset that covers a wide range of tasks and scenarios, with responses collected from both standard vision-language models (VLMs) and advanced VLM reasoners. Second, we design a reward model architecture based on Qwen2.5-VL-7B-Instruct, integrating a reward head and applying multi-stage fine-tuning using pairwise ranking loss on pairwise preference data. Experimental evaluations show that Skywork-VL Reward achieves state-of-the-art results on multimodal VL-RewardBench and exhibits competitive performance on the text-only RewardBench benchmark. Furthermore, preference data constructed based on our Skywork-VL Reward proves highly effective for training Mixed Preference Optimization (MPO), leading to significant improvements in multimodal reasoning capabilities. Our results underscore Skywork-VL Reward as a significant advancement toward general-purpose, reliable reward models for multimodal alignment. Our model has been publicly released to promote transparency and reproducibility.

[85] L-SWAG: Layer-Sample Wise Activation with Gradients information for Zero-Shot NAS on Vision Transformers

Sofia Casarin,Sergio Escalera,Oswald Lanz

Main category: cs.CV

TL;DR: 该论文提出了一种无需训练的神经架构搜索（NAS）方法，通过零成本（ZC）代理高效识别高性能神经网络，并扩展了ZC代理在Vision Transformers（ViTs）中的应用。

Details

Motivation: 随着大型语言模型在深度学习中的兴起，需要将ZC代理的适用性扩展到ViTs，并解决现有代理在卷积搜索空间中的局限性。 Method: 提出了L-SWAG度量标准和LIBRA-NAS方法，用于结合不同代理以优化NAS搜索。 Result: LIBRA-NAS在ImageNet1k上仅用0.1 GPU天就实现了17.0%的测试错误率。 Conclusion: 该方法在时间和性能上显著优于传统NAS技术，为未来NAS研究提供了新方向。 Abstract: Training-free Neural Architecture Search (NAS) efficiently identifies high-performing neural networks using zero-cost (ZC) proxies. Unlike multi-shot and one-shot NAS approaches, ZC-NAS is both (i) time-efficient, eliminating the need for model training, and (ii) interpretable, with proxy designs often theoretically grounded. Despite rapid developments in the field, current SOTA ZC proxies are typically constrained to well-established convolutional search spaces. With the rise of Large Language Models shaping the future of deep learning, this work extends ZC proxy applicability to Vision Transformers (ViTs). We present a new benchmark using the Autoformer search space evaluated on 6 distinct tasks and propose Layer-Sample Wise Activation with Gradients information (L-SWAG), a novel, generalizable metric that characterizes both convolutional and transformer architectures across 14 tasks. Additionally, previous works highlighted how different proxies contain complementary information, motivating the need for a ML model to identify useful combinations. To further enhance ZC-NAS, we therefore introduce LIBRA-NAS (Low Information gain and Bias Re-Alignment), a method that strategically combines proxies to best represent a specific benchmark. Integrated into the NAS search, LIBRA-NAS outperforms evolution and gradient-based NAS techniques by identifying an architecture with a 17.0% test error on ImageNet1k in just 0.1 GPU days.

[86] Human Motion Prediction via Test-domain-aware Adaptation with Easily-available Human Motions Estimated from Videos

Katsuki Shimbo,Hiromu Taketsugu,Norimichi Ukita

Main category: cs.CV

TL;DR: 论文提出通过视频估计的2D姿态增强3D人体运动预测（HMP），以解决传统方法依赖昂贵动作捕捉数据导致泛化性差的问题。

Details

Motivation: 传统HMP方法依赖昂贵的动作捕捉数据，数据多样性受限，泛化性差。 Method: 利用单目视频估计的2D姿态，通过特定流程转换为3D动作数据，用于增强HMP模型的训练。 Result: 实验结果表明，该方法在定量和定性上均有效提升了HMP模型的性能。 Conclusion: 通过视频估计的2D姿态增强HMP模型，能够显著提升其泛化能力。 Abstract: In 3D Human Motion Prediction (HMP), conventional methods train HMP models with expensive motion capture data. However, the data collection cost of such motion capture data limits the data diversity, which leads to poor generalizability to unseen motions or subjects. To address this issue, this paper proposes to enhance HMP with additional learning using estimated poses from easily available videos. The 2D poses estimated from the monocular videos are carefully transformed into motion capture-style 3D motions through our pipeline. By additional learning with the obtained motions, the HMP model is adapted to the test domain. The experimental results demonstrate the quantitative and qualitative impact of our method.

[87] Through the Looking Glass: Common Sense Consistency Evaluation of Weird Images

Elisei Rykov,Kseniia Petrushina,Kseniia Titova,Anton Razzhigaev,Alexander Panchenko,Vasily Konovalov

Main category: cs.CV

TL;DR: 论文提出了一种名为TLG的新方法，利用大型视觉语言模型和Transformer编码器评估图像常识一致性，并在多个数据集上取得最优性能。

Details

Motivation: 衡量真实图像的常识一致性是AI研究中的复杂任务，现有方法难以处理违反常识的图像。 Method: 通过大型视觉语言模型提取图像中的原子事实，并微调一个紧凑的注意力池化分类器。 Result: TLG在WHOOPS!和WEIRD数据集上实现了新的最优性能。 Conclusion: TLG方法通过结合大型模型和紧凑分类器，有效提升了图像常识一致性的评估能力。 Abstract: Measuring how real images look is a complex task in artificial intelligence research. For example, an image of a boy with a vacuum cleaner in a desert violates common sense. We introduce a novel method, which we call Through the Looking Glass (TLG), to assess image common sense consistency using Large Vision-Language Models (LVLMs) and Transformer-based encoder. By leveraging LVLMs to extract atomic facts from these images, we obtain a mix of accurate facts. We proceed by fine-tuning a compact attention-pooling classifier over encoded atomic facts. Our TLG has achieved a new state-of-the-art performance on the WHOOPS! and WEIRD datasets while leveraging a compact fine-tuning component.

[88] Enabling Privacy-Aware AI-Based Ergonomic Analysis

Sander De Coninck,Emilio Gamba,Bart Van Doninck,Abdellatif Bey-Temsamani,Sam Leroux,Pieter Simoens

Main category: cs.CV

TL;DR: 提出了一种基于机器学习的隐私保护工效学评估框架，通过对抗训练生成轻量级神经网络，模糊视频数据以保护隐私，同时保持高精度的人体姿态估计。

Details

Motivation: 制造业中肌肉骨骼疾病（MSDs）导致的经济损失和生产力下降问题严重，工效学评估可改善工作姿势，但现有摄像头系统存在隐私问题。 Method: 采用对抗训练开发轻量级神经网络模糊视频数据，保留姿态估计所需信息，结合多视角集成和REBA方法评估3D关键点。 Result: 系统在保护隐私的同时，实现了高精度的工效学监测，适用于工业环境。 Conclusion: 该框架为工业环境中的工效学监测提供了安全有效的解决方案，兼顾隐私保护和工作场所安全。 Abstract: Musculoskeletal disorders (MSDs) are a leading cause of injury and productivity loss in the manufacturing industry, incurring substantial economic costs. Ergonomic assessments can mitigate these risks by identifying workplace adjustments that improve posture and reduce strain. Camera-based systems offer a non-intrusive, cost-effective method for continuous ergonomic tracking, but they also raise significant privacy concerns. To address this, we propose a privacy-aware ergonomic assessment framework utilizing machine learning techniques. Our approach employs adversarial training to develop a lightweight neural network that obfuscates video data, preserving only the essential information needed for human pose estimation. This obfuscation ensures compatibility with standard pose estimation algorithms, maintaining high accuracy while protecting privacy. The obfuscated video data is transmitted to a central server, where state-of-the-art keypoint detection algorithms extract body landmarks. Using multi-view integration, 3D keypoints are reconstructed and evaluated with the Rapid Entire Body Assessment (REBA) method. Our system provides a secure, effective solution for ergonomic monitoring in industrial environments, addressing both privacy and workplace safety concerns.

[89] RealRep: Generalized SDR-to-HDR Conversion with Style Disentangled Representation Learning

Gang He,Siqi Wang,Kepeng Xu,Lin Zhang

Main category: cs.CV

TL;DR: 提出了一种名为RealRep的通用SDR-to-HDR方法，通过解耦亮度和色度处理多样化的SDR内容风格，并结合DDACMNet框架实现自适应映射。

Details

Motivation: 现有方法依赖固定色调映射，难以处理真实场景中多样化的SDR输入风格。 Method: 采用解耦多视角风格表示学习，捕捉真实亮度和色度分布的先验，并结合DDACMNet框架进行自适应分层映射。 Result: RealRep在实验中表现优于现有方法，具有更好的泛化能力和感知真实的HDR色域重建效果。 Conclusion: RealRep通过解耦表示学习和自适应映射，有效解决了多样化SDR内容到HDR的转换问题。 Abstract: High-Dynamic-Range Wide-Color-Gamut (HDR-WCG) technology is becoming increasingly prevalent, intensifying the demand for converting Standard Dynamic Range (SDR) content to HDR. Existing methods primarily rely on fixed tone mapping operators, which are inadequate for handling SDR inputs with diverse styles commonly found in real-world scenarios. To address this challenge, we propose a generalized SDR-to-HDR method that handles diverse styles in real-world SDR content, termed Realistic Style Disentangled Representation Learning (RealRep). By disentangling luminance and chrominance, we analyze the intrinsic differences between contents with varying styles and propose a disentangled multi-view style representation learning method. This approach captures the guidance prior of true luminance and chrominance distributions across different styles, even when the SDR style distributions exhibit significant variations, thereby establishing a robust embedding space for inverse tone mapping. Motivated by the difficulty of directly utilizing degradation representation priors, we further introduce the Degradation-Domain Aware Controlled Mapping Network (DDACMNet), a two-stage framework that performs adaptive hierarchical mapping guided by a control-aware normalization mechanism. DDACMNet dynamically modulates the mapping process via degradation-conditioned hierarchical features, enabling robust adaptation across diverse degradation domains. Extensive experiments show that RealRep consistently outperforms state-of-the-art methods with superior generalization and perceptually faithful HDR color gamut reconstruction.

[90] Link to the Past: Temporal Propagation for Fast 3D Human Reconstruction from Monocular Video

Matthew Marchellus,Nadhira Noor,In Kyu Park

Main category: cs.CV

TL;DR: TemPoFast3D是一种快速3D穿衣人体重建方法，通过利用时间一致性减少冗余计算，实现高质量实时重建。

Details

Motivation: 解决现有方法在计算效率和重建质量之间的不平衡问题，特别是实时应用的需求。 Method: 利用时间一致性，通过高效坐标映射维护和优化规范外观表示，实现连续视频流的处理。 Result: 在标准指标上匹配或超越现有方法，最高速度达12 FPS，支持多样姿态和外观的高质量纹理重建。 Conclusion: TemPoFast3D是一种高效且高质量的实时3D穿衣人体重建解决方案。 Abstract: Fast 3D clothed human reconstruction from monocular video remains a significant challenge in computer vision, particularly in balancing computational efficiency with reconstruction quality. Current approaches are either focused on static image reconstruction but too computationally intensive, or achieve high quality through per-video optimization that requires minutes to hours of processing, making them unsuitable for real-time applications. To this end, we present TemPoFast3D, a novel method that leverages temporal coherency of human appearance to reduce redundant computation while maintaining reconstruction quality. Our approach is a "plug-and play" solution that uniquely transforms pixel-aligned reconstruction networks to handle continuous video streams by maintaining and refining a canonical appearance representation through efficient coordinate mapping. Extensive experiments demonstrate that TemPoFast3D matches or exceeds state-of-the-art methods across standard metrics while providing high-quality textured reconstruction across diverse pose and appearance, with a maximum speed of 12 FPS.

[91] SAEN-BGS: Energy-Efficient Spiking AutoEncoder Network for Background Subtraction

Zhixuan Zhang,Xiaopeng Li,Qi Liu

Main category: cs.CV

TL;DR: 提出了一种基于脉冲神经网络的噪声鲁棒性背景减除方法SAEN-BGS，通过自蒸馏学习提高能效，在复杂动态背景下表现优异。

Details

Motivation: 现有深度学习背景减除方法对视频中的噪声（如光照变化、相机抖动等）处理不足，需改进。 Method: 设计了基于脉冲神经网络的SAEN-BGS模型，包含连续脉冲卷积-反卷积块和自蒸馏监督学习。 Result: 在CDnet-2014和DAVIS-2016数据集上表现优于基线方法，尤其在动态背景场景中。 Conclusion: SAEN-BGS通过脉冲神经网络的噪声鲁棒性和时间序列敏感性，有效提升了背景减除性能，同时降低了能耗。 Abstract: Background subtraction (BGS) is utilized to detect moving objects in a video and is commonly employed at the onset of object tracking and human recognition processes. Nevertheless, existing BGS techniques utilizing deep learning still encounter challenges with various background noises in videos, including variations in lighting, shifts in camera angles, and disturbances like air turbulence or swaying trees. To address this problem, we design a spiking autoencoder network, termed SAEN-BGS, based on noise resilience and time-sequence sensitivity of spiking neural networks (SNNs) to enhance the separation of foreground and background. To eliminate unnecessary background noise and preserve the important foreground elements, we begin by creating the continuous spiking conv-and-dconv block, which serves as the fundamental building block for the decoder in SAEN-BGS. Moreover, in striving for enhanced energy efficiency, we introduce a novel self-distillation spiking supervised learning method grounded in ANN-to-SNN frameworks, resulting in decreased power consumption. In extensive experiments conducted on CDnet-2014 and DAVIS-2016 datasets, our approach demonstrates superior segmentation performance relative to other baseline methods, even when challenged by complex scenarios with dynamic backgrounds.

[92] Generative Pre-trained Autoregressive Diffusion Transformer

Yuan Zhang,Jiacheng Jiang,Guoqing Ma,Zhiying Lu,Haoyang Huang,Jianlong Yuan,Nan Duan

Main category: cs.CV

TL;DR: GPDiT是一种结合扩散和自回归建模优势的生成预训练模型，用于连续潜在空间中的长范围视频合成，通过扩散损失预测未来潜在帧，提升生成质量和表示能力。

Details

Motivation: 统一扩散和自回归建模的优势，以解决长范围视频合成中的运动动态和语义一致性问题。 Method: 使用扩散损失自回归预测未来潜在帧，引入轻量级因果注意力变体和无参数旋转时间条件机制。 Result: 在视频生成质量、表示能力和少样本学习任务中表现优异。 Conclusion: GPDiT是连续空间中视频建模的有效框架，具有高质量生成和表示能力。 Abstract: In this work, we present GPDiT, a Generative Pre-trained Autoregressive Diffusion Transformer that unifies the strengths of diffusion and autoregressive modeling for long-range video synthesis, within a continuous latent space. Instead of predicting discrete tokens, GPDiT autoregressively predicts future latent frames using a diffusion loss, enabling natural modeling of motion dynamics and semantic consistency across frames. This continuous autoregressive framework not only enhances generation quality but also endows the model with representation capabilities. Additionally, we introduce a lightweight causal attention variant and a parameter-free rotation-based time-conditioning mechanism, improving both the training and inference efficiency. Extensive experiments demonstrate that GPDiT achieves strong performance in video generation quality, video representation ability, and few-shot learning tasks, highlighting its potential as an effective framework for video modeling in continuous space.

Jiewen Yang,Taoran Huang,Shangwei Ding,Xiaowei Xu,Qinhua Zhao,Yong Jiang,Jiarong Guo,Bin Pu,Jiexuan Zheng,Caojin Zhang,Hongwen Fei,Xiaomeng Li

Main category: cs.CV

TL;DR: MePH是一种多视角、多模态的视觉语言模型，用于通过非侵入性超声心动图准确评估肺动脉高压的进展，显著优于传统超声心动图评估方法。

Details

Motivation: 传统方法如右心导管检查（RHC）虽然精确但侵入性强，不适合常规使用，亟需一种非侵入性且准确的替代方法。 Method: MePH利用多视角、多模态超声心动图数据（视频、频谱图像）与RHC数据关联建模，构建了一个包含1,237例患者的大规模数据集。 Result: MePH在评估平均肺动脉压（mPAP）和肺血管阻力（PVR）时，误差分别降低49.73%和43.81%，并在外部验证中表现优异。 Conclusion: MePH为非侵入性监测肺动脉高压提供了高效准确的工具，有助于早期干预和个性化治疗。 Abstract: Echocardiographers can detect pulmonary hypertension using Doppler echocardiography; however, accurately assessing its progression often proves challenging. Right heart catheterization (RHC), the gold standard for precise evaluation, is invasive and unsuitable for routine use, limiting its practicality for timely diagnosis and monitoring of pulmonary hypertension progression. Here, we propose MePH, a multi-view, multi-modal vision-language model to accurately assess pulmonary hypertension progression using non-invasive echocardiography. We constructed a large dataset comprising paired standardized echocardiogram videos, spectral images and RHC data, covering 1,237 patient cases from 12 medical centers. For the first time, MePH precisely models the correlation between non-invasive multi-view, multi-modal echocardiography and the pressure and resistance obtained via RHC. We show that MePH significantly outperforms echocardiographers' assessments using echocardiography, reducing the mean absolute error in estimating mean pulmonary arterial pressure (mPAP) and pulmonary vascular resistance (PVR) by 49.73% and 43.81%, respectively. In eight independent external hospitals, MePH achieved a mean absolute error of 3.147 for PVR assessment. Furthermore, MePH achieved an area under the curve of 0.921, surpassing echocardiographers (area under the curve of 0.842) in accurately predicting the severity of pulmonary hypertension, whether mild or severe. A prospective study demonstrated that MePH can predict treatment efficacy for patients. Our work provides pulmonary hypertension patients with a non-invasive and timely method for monitoring disease progression, improving the accuracy and efficiency of pulmonary hypertension management while enabling earlier interventions and more personalized treatment decisions.

[94] Geometric Prior-Guided Neural Implicit Surface Reconstruction in the Wild

Lintao Xiang,Hongpei Zheng,Bailin Deng,Hujun Yin

Main category: cs.CV

TL;DR: 提出了一种新方法，通过多重几何约束改进隐式表面重建，适用于非受控环境下的高精度3D重建。

Details

Motivation: 现有方法在光照一致场景中表现良好，但在非受控环境下（如瞬态遮挡或外观变化）重建效果不佳，且基于NeRF的方法更注重新视角合成而非精确表面重建。 Method: 结合稀疏3D点（来自SfM）和法线先验（通过法线预测器和多视角一致性约束优化），改进隐式表面优化过程。 Result: 在Heritage-Recon等数据集上验证，方法能更准确地从非受控图像中重建表面，几何精度和细节优于现有技术。 Conclusion: 该方法适用于文化遗产数字化保护等多样化场景，实现了高质量3D重建。 Abstract: Neural implicit surface reconstruction using volume rendering techniques has recently achieved significant advancements in creating high-fidelity surfaces from multiple 2D images. However, current methods primarily target scenes with consistent illumination and struggle to accurately reconstruct 3D geometry in uncontrolled environments with transient occlusions or varying appearances. While some neural radiance field (NeRF)-based variants can better manage photometric variations and transient objects in complex scenes, they are designed for novel view synthesis rather than precise surface reconstruction due to limited surface constraints. To overcome this limitation, we introduce a novel approach that applies multiple geometric constraints to the implicit surface optimization process, enabling more accurate reconstructions from unconstrained image collections. First, we utilize sparse 3D points from structure-from-motion (SfM) to refine the signed distance function estimation for the reconstructed surface, with a displacement compensation to accommodate noise in the sparse points. Additionally, we employ robust normal priors derived from a normal predictor, enhanced by edge prior filtering and multi-view consistency constraints, to improve alignment with the actual surface geometry. Extensive testing on the Heritage-Recon benchmark and other datasets has shown that the proposed method can accurately reconstruct surfaces from in-the-wild images, yielding geometries with superior accuracy and granularity compared to existing techniques. Our approach enables high-quality 3D reconstruction of various landmarks, making it applicable to diverse scenarios such as digital preservation of cultural heritage sites.

[95] Boosting Global-Local Feature Matching via Anomaly Synthesis for Multi-Class Point Cloud Anomaly Detection

Yuqi Cheng,Yunkang Cao,Dongfang Wang,Weiming Shen,Wenlong Li

Main category: cs.CV

TL;DR: GLFM是一种多类点云异常检测方法，通过全局-局部特征匹配逐步分离易混淆数据，显著提升检测性能。

Details

Motivation: 由于单类无监督方法在多类场景下计算和存储成本高，且多类方法存在特征混淆问题，因此开发了GLFM。 Method: GLFM分为三个阶段：异常合成、全局-局部记忆库建立、基于特征距离的异常检测。 Result: 在多个数据集上验证了GLFM的优越性能。 Conclusion: GLFM有效解决了多类点云异常检测中的特征混淆问题，具有实际应用价值。 Abstract: Point cloud anomaly detection is essential for various industrial applications. The huge computation and storage costs caused by the increasing product classes limit the application of single-class unsupervised methods, necessitating the development of multi-class unsupervised methods. However, the feature similarity between normal and anomalous points from different class data leads to the feature confusion problem, which greatly hinders the performance of multi-class methods. Therefore, we introduce a multi-class point cloud anomaly detection method, named GLFM, leveraging global-local feature matching to progressively separate data that are prone to confusion across multiple classes. Specifically, GLFM is structured into three stages: Stage-I proposes an anomaly synthesis pipeline that stretches point clouds to create abundant anomaly data that are utilized to adapt the point cloud feature extractor for better feature representation. Stage-II establishes the global and local memory banks according to the global and local feature distributions of all the training data, weakening the impact of feature confusion on the establishment of the memory bank. Stage-III implements anomaly detection of test data leveraging its feature distance from global and local memory banks. Extensive experiments on the MVTec 3D-AD, Real3D-AD and actual industry parts dataset showcase our proposed GLFM's superior point cloud anomaly detection performance. The code is available at https://github.com/hustCYQ/GLFM-Multi-class-3DAD.

[96] Apple's Synthetic Defocus Noise Pattern: Characterization and Forensic Applications

David Vázquez-Padín,Fernando Pérez-González,Pablo Pérez-Miguélez

Main category: cs.CV

TL;DR: 论文分析了iPhone人像模式图像中的合成散焦噪声模式（SDNP），提出了一种精确估计方法，并探讨了其在法医取证中的应用，包括图像溯源和减少PRNU误报。

Details

Motivation: iPhone人像模式图像的SDNP可能干扰盲法医分析，尤其是基于PRNU的相机源验证，但目前研究不足。 Method: 详细表征SDNP，提出精确估计方法，并建模其与场景亮度、ISO设置等因素的关系。 Result: SDNP可用于图像溯源，并在PRNU验证中通过掩蔽SDNP区域显著减少误报。 Conclusion: SDNP的表征和应用提升了相机源验证的准确性，改进了现有技术。 Abstract: iPhone portrait-mode images contain a distinctive pattern in out-of-focus regions simulating the bokeh effect, which we term Apple's Synthetic Defocus Noise Pattern (SDNP). If overlooked, this pattern can interfere with blind forensic analyses, especially PRNU-based camera source verification, as noted in earlier works. Since Apple's SDNP remains underexplored, we provide a detailed characterization, proposing a method for its precise estimation, modeling its dependence on scene brightness, ISO settings, and other factors. Leveraging this characterization, we explore forensic applications of the SDNP, including traceability of portrait-mode images across iPhone models and iOS versions in open-set scenarios, assessing its robustness under post-processing. Furthermore, we show that masking SDNP-affected regions in PRNU-based camera source verification significantly reduces false positives, overcoming a critical limitation in camera attribution, and improving state-of-the-art techniques.

[97] Few-shot Semantic Encoding and Decoding for Video Surveillance

Baoping Cheng,Yukun Zhang,Liming Wang,Xiaoyan Xie,Tao Fu,Dongkun Wang,Xiaoming Tao

Main category: cs.CV

TL;DR: 提出了一种基于语义编码和解码的监控视频处理方法，通过提取草图作为语义信息并压缩，结合图像翻译网络和少样本解码网络，显著降低了存储和传输开销。

Details

Motivation: 传统通信方法在监控视频传输和存储方面面临瓶颈，语义通信有望突破这一限制，但现有方法需要大量样本训练，效率低下。 Method: 提取草图作为语义信息并压缩，提出图像翻译网络将草图转换为视频帧，设计少样本解码网络重建视频。 Result: 实验表明，该方法在视频重建性能上优于基线方法，草图压缩有效减少了存储和传输开销，且视频质量损失小。 Conclusion: 该方法仅需少量训练样本即可实现高效的语义编码和解码，提升了语义通信系统的实用性。 Abstract: With the continuous increase in the number and resolution of video surveillance cameras, the burden of transmitting and storing surveillance video is growing. Traditional communication methods based on Shannon's theory are facing optimization bottlenecks. Semantic communication, as an emerging communication method, is expected to break through this bottleneck and reduce the storage and transmission consumption of video. Existing semantic decoding methods often require many samples to train the neural network for each scene, which is time-consuming and labor-intensive. In this study, a semantic encoding and decoding method for surveillance video is proposed. First, the sketch was extracted as semantic information, and a sketch compression method was proposed to reduce the bit rate of semantic information. Then, an image translation network was proposed to translate the sketch into a video frame with a reference frame. Finally, a few-shot sketch decoding network was proposed to reconstruct video from sketch. Experimental results showed that the proposed method achieved significantly better video reconstruction performance than baseline methods. The sketch compression method could effectively reduce the storage and transmission consumption of semantic information with little compromise on video quality. The proposed method provides a novel semantic encoding and decoding method that only needs a few training samples for each surveillance scene, thus improving the practicality of the semantic communication system.

[98] Feature Visualization in 3D Convolutional Neural Networks

Chunpeng Li,Ya-tang Li

Main category: cs.CV

TL;DR: 提出了一种新的可视化方法，用于解析3D卷积核的纹理和运动偏好，通过两阶段优化策略提取特征，提高了3D卷积核的可解释性。

Details

Motivation: 现有方法难以有效可视化3D卷积核的高维复杂特征，导致结果难以解释。 Method: 采用数据驱动分解和两阶段优化策略，分别提取纹理和运动成分。 Result: 可视化结果清晰展示了3D卷积核偏好的动态模式，特别是运动部分。 Conclusion: 该方法有效提升了3D卷积操作的可解释性，代码已开源。 Abstract: Understanding the computations of convolutional neural networks requires effective visualization of their kernels. While maximal activation methods have proven successful in highlighting the preferred features of 2D convolutional kernels, directly applying these techniques to 3D convolutions often leads to uninterpretable results due to the higher dimensionality and complexity of 3D features. To address this challenge, we propose a novel visualization approach for 3D convolutional kernels that disentangles their texture and motion preferences. Our method begins with a data-driven decomposition of the optimal input that maximally activates a given kernel. We then introduce a two-stage optimization strategy to extract distinct texture and motion components from this input. Applying our approach to visualize kernels at various depths of several pre-trained models, we find that the resulting visualizations--particularly those capturing motion--clearly reveal the preferred dynamic patterns encoded by 3D kernels. These results demonstrate the effectiveness of our method in providing interpretable insights into 3D convolutional operations. Code is available at https://github.com/YatangLiLab/3DKernelVisualizer.

[99] TUM2TWIN: Introducing the Large-Scale Multimodal Urban Digital Twin Benchmark Dataset

Olaf Wysocki,Benedikt Schwab,Manoj Kumar Biswanath,Qilin Zhang,Jingwei Zhu,Thomas Froech,Medhini Heeramaglore,Ihab Hijazi,Khaoula Kanna,Mathias Pechinger,Zhaiyu Chen,Yao Sun,Alejandro Rueda Segura,Ziyang Xu,Omar AbdelGafar,Mansour Mehranfar,Chandan Yeshwanth,Yueh-Cheng Liu,Hadi Yazdi,Jiapan Wang,Stefan Auer,Katharina Anders,Klaus Bogenberger,Andre Borrmann,Angela Dai,Ludwig Hoegner,Christoph Holst,Thomas H. Kolbe,Ferdinand Ludwig,Matthias Nießner,Frank Petzold,Xiao Xiang Zhu,Boris Jutzi

Main category: cs.CV

TL;DR: 论文介绍了首个综合多模态城市数字孪生基准数据集TUM2TWIN，旨在解决城市数字孪生创建中的多阶段挑战。

Details

Motivation: 当前数据集通常仅覆盖处理链的一部分，限制了城市数字孪生的全面验证。 Method: 提出TUM2TWIN数据集，包含地理参考、语义对齐的3D模型和网络，以及多种地面、移动、航空和卫星观测数据。 Result: 数据集覆盖约100,000平方米，包含32个子集和767GB数据，支持传感器分析和高级重建方法开发。 Conclusion: TUM2TWIN为克服城市数字孪生创建中的限制奠定了基础，推动了数据驱动的智慧城市研究。 Abstract: Urban Digital Twins (UDTs) have become essential for managing cities and integrating complex, heterogeneous data from diverse sources. Creating UDTs involves challenges at multiple process stages, including acquiring accurate 3D source data, reconstructing high-fidelity 3D models, maintaining models' updates, and ensuring seamless interoperability to downstream tasks. Current datasets are usually limited to one part of the processing chain, hampering comprehensive UDTs validation. To address these challenges, we introduce the first comprehensive multimodal Urban Digital Twin benchmark dataset: TUM2TWIN. This dataset includes georeferenced, semantically aligned 3D models and networks along with various terrestrial, mobile, aerial, and satellite observations boasting 32 data subsets over roughly 100,000 $m^2$ and currently 767 GB of data. By ensuring georeferenced indoor-outdoor acquisition, high accuracy, and multimodal data integration, the benchmark supports robust analysis of sensors and the development of advanced reconstruction methods. Additionally, we explore downstream tasks demonstrating the potential of TUM2TWIN, including novel view synthesis of NeRF and Gaussian Splatting, solar potential analysis, point cloud semantic segmentation, and LoD3 building reconstruction. We are convinced this contribution lays a foundation for overcoming current limitations in UDT creation, fostering new research directions and practical solutions for smarter, data-driven urban environments. The project is available under: https://tum2t.win

[100] DepthFusion: Depth-Aware Hybrid Feature Fusion for LiDAR-Camera 3D Object Detection

Mingqian Ji,Jian Yang,Shanshan Zhang

Main category: cs.CV

TL;DR: 提出了一种基于深度感知的多模态特征融合方法（DepthFusion），通过深度编码在全局和局部层面调整点云和RGB图像的权重，显著提升了3D目标检测性能。

Details

Motivation: 现有LiDAR-相机3D目标检测器在设计融合策略时忽视了深度因素，而统计分析和可视化表明不同模态在不同深度下作用不同。 Method: 提出DepthFusion策略，包含Depth-GFusion模块（全局调整BEV特征权重）和Depth-LFusion模块（局部调整原始体素和多视角图像特征权重）。 Result: 在nuScenes和KITTI数据集上表现优于现有方法，且在nuScenes-C数据集上对各类干扰更具鲁棒性。 Conclusion: DepthFusion通过深度感知的融合策略显著提升了多模态3D目标检测的性能和鲁棒性。 Abstract: State-of-the-art LiDAR-camera 3D object detectors usually focus on feature fusion. However, they neglect the factor of depth while designing the fusion strategy. In this work, we are the first to observe that different modalities play different roles as depth varies via statistical analysis and visualization. Based on this finding, we propose a Depth-Aware Hybrid Feature Fusion (DepthFusion) strategy that guides the weights of point cloud and RGB image modalities by introducing depth encoding at both global and local levels. Specifically, the Depth-GFusion module adaptively adjusts the weights of image Bird's-Eye-View (BEV) features in multi-modal global features via depth encoding. Furthermore, to compensate for the information lost when transferring raw features to the BEV space, we propose a Depth-LFusion module, which adaptively adjusts the weights of original voxel features and multi-view image features in multi-modal local features via depth encoding. Extensive experiments on the nuScenes and KITTI datasets demonstrate that our DepthFusion method surpasses previous state-of-the-art methods. Moreover, our DepthFusion is more robust to various kinds of corruptions, outperforming previous methods on the nuScenes-C dataset.

[101] Lightweight Multispectral Crop-Weed Segmentation for Precision Agriculture

Zeynep Galymzhankyzy,Eric Martinson

Main category: cs.CV

TL;DR: 提出了一种轻量级Transformer-CNN混合模型，用于高效作物-杂草分割，结合RGB、近红外和红边波段，显著提升精度和计算效率。

Details

Motivation: 传统CNN方法在复杂田间条件下泛化能力不足且依赖RGB图像，限制了性能。 Method: 采用轻量级Transformer-CNN混合模型，通过专用编码器和动态模态集成处理多波段数据。 Result: 在WeedsGalore数据集上，模型分割准确率（平均IoU）达78.88%，比仅RGB模型高15.8个百分点，参数量仅870万。 Conclusion: 该模型兼具高精度和计算效率，适合无人机和边缘设备实时部署，推动了精准杂草管理。 Abstract: Efficient crop-weed segmentation is critical for site-specific weed control in precision agriculture. Conventional CNN-based methods struggle to generalize and rely on RGB imagery, limiting performance under complex field conditions. To address these challenges, we propose a lightweight transformer-CNN hybrid. It processes RGB, Near-Infrared (NIR), and Red-Edge (RE) bands using specialized encoders and dynamic modality integration. Evaluated on the WeedsGalore dataset, the model achieves a segmentation accuracy (mean IoU) of 78.88%, outperforming RGB-only models by 15.8 percentage points. With only 8.7 million parameters, the model offers high accuracy, computational efficiency, and potential for real-time deployment on Unmanned Aerial Vehicles (UAVs) and edge devices, advancing precision weed management.

[102] Addressing degeneracies in latent interpolation for diffusion models

Erik Landolsi,Fredrik Kahl

Main category: cs.CV

TL;DR: 论文研究了在图像生成扩散模型中潜在空间插值时出现的退化问题，并提出了一种简单的归一化方法来解决该问题。

Details

Motivation: 随着扩散模型在深度数据增强和图像变形中的应用增加，潜在空间插值可能导致退化结果，尤其是在输入图像数量较多时。 Method: 通过理论和实验分析退化原因，提出了一种简单的归一化方案，用于潜在空间插值。 Result: 实验表明，基线插值方法在退化问题明显前就导致质量下降，而提出的方法显著减少了退化效应并提升了质量指标。 Conclusion: 提出的归一化方法简单有效，适用于潜在空间插值，显著改善了图像生成质量。 Abstract: There is an increasing interest in using image-generating diffusion models for deep data augmentation and image morphing. In this context, it is useful to interpolate between latents produced by inverting a set of input images, in order to generate new images representing some mixture of the inputs. We observe that such interpolation can easily lead to degenerate results when the number of inputs is large. We analyze the cause of this effect theoretically and experimentally, and suggest a suitable remedy. The suggested approach is a relatively simple normalization scheme that is easy to use whenever interpolation between latents is needed. We measure image quality using FID and CLIP embedding distance and show experimentally that baseline interpolation methods lead to a drop in quality metrics long before the degeneration issue is clearly visible. In contrast, our method significantly reduces the degeneration effect and leads to improved quality metrics also in non-degenerate situations.

[103] DocVXQA: Context-Aware Visual Explanations for Document Question Answering

Mohamed Ali Souibgui,Changkyu Choi,Andrey Barsky,Kangsoo Jung,Ernest Valveny,Dimosthenis Karatzas

Main category: cs.CV

TL;DR: DocVXQA是一个新颖的视觉自解释文档问答框架，通过生成视觉热图提供解释，同时平衡性能和可解释性。

Details

Motivation: 传统方法仅关注与答案相关的区域，缺乏上下文充分的解释，DocVXQA旨在解决这一问题。 Method: 将可解释性原则量化为显式学习目标，生成上下文充分且表示高效的热图。 Result: 实验和人工评估验证了方法的有效性，代码已开源。 Conclusion: DocVXQA在文档问答中实现了性能与可解释性的平衡，提升了用户信任。 Abstract: We propose DocVXQA, a novel framework for visually self-explainable document question answering. The framework is designed not only to produce accurate answers to questions but also to learn visual heatmaps that highlight contextually critical regions, thereby offering interpretable justifications for the model's decisions. To integrate explanations into the learning process, we quantitatively formulate explainability principles as explicit learning objectives. Unlike conventional methods that emphasize only the regions pertinent to the answer, our framework delivers explanations that are \textit{contextually sufficient} while remaining \textit{representation-efficient}. This fosters user trust while achieving a balance between predictive performance and interpretability in DocVQA applications. Extensive experiments, including human evaluation, provide strong evidence supporting the effectiveness of our method. The code is available at https://github.com/dali92002/DocVXQA.

[104] Learning to Reason and Navigate: Parameter Efficient Action Planning with Large Language Models

Bahram Mohammadi,Ehsan Abbasnejad,Yuankai Qi,Qi Wu,Anton Van Den Hengel,Javen Qinfeng Shi

Main category: cs.CV

TL;DR: 论文提出了一种基于大语言模型的高效动作规划器（PEAP-LLM），用于解决远程实体指代表达（REVERIE）任务中的导航问题。通过两阶段微调方法（SFT和DPO），模型在复杂环境中表现优于现有方法。

Details

Motivation: 解决远程实体指代表达任务中高效导航的需求，避免大语言模型直接应用时的性能不足和错误倾向。 Method: 提出PEAP-LLM模型，包含LLM目标规划器（LGP）和LoRA动作规划器（LAP），并通过两阶段微调（SFT和DPO）优化模型性能。 Result: 实验结果表明，PEAP-LLM在REVERIE任务中优于现有最优方法。 Conclusion: PEAP-LLM通过高效的动作规划和两阶段微调，显著提升了远程实体指代表达任务的性能。 Abstract: The remote embodied referring expression (REVERIE) task requires an agent to navigate through complex indoor environments and localize a remote object specified by high-level instructions, such as "bring me a spoon", without pre-exploration. Hence, an efficient navigation plan is essential for the final success. This paper proposes a novel parameter-efficient action planner using large language models (PEAP-LLM) to generate a single-step instruction at each location. The proposed model consists of two modules, LLM goal planner (LGP) and LoRA action planner (LAP). Initially, LGP extracts the goal-oriented plan from REVERIE instructions, including the target object and room. Then, LAP generates a single-step instruction with the goal-oriented plan, high-level instruction, and current visual observation as input. PEAP-LLM enables the embodied agent to interact with LAP as the path planner on the fly. A simple direct application of LLMs hardly achieves good performance. Also, existing hard-prompt-based methods are error-prone in complicated scenarios and need human intervention. To address these issues and prevent the LLM from generating hallucinations and biased information, we propose a novel two-stage method for fine-tuning the LLM, consisting of supervised fine-tuning (STF) and direct preference optimization (DPO). SFT improves the quality of generated instructions, while DPO utilizes environmental feedback. Experimental results show the superiority of our proposed model on REVERIE compared to the previous state-of-the-art.

[105] MAIS: Memory-Attention for Interactive Segmentation

Mauricio Orbes-Arteaga,Oeslle Lucena,Sabastien Ourselin,M. Jorge Cardoso

Main category: cs.CV

TL;DR: MAIS引入记忆注意力机制，通过存储历史用户输入和分割状态，提升交互式医学分割的效率和准确性。

Details

Motivation: 现有方法将用户交互视为独立事件，导致冗余修正和有限的改进效果，需要一种能整合时间上下文的方法。 Method: 提出MAIS（记忆注意力机制），存储过去的用户输入和分割状态，以整合时间上下文，优化ViT模型的分割性能。 Result: MAIS在多种成像模态中提升了分割效率和准确性。 Conclusion: MAIS通过记忆注意力机制显著改进了交互式医学分割的效果。 Abstract: Interactive medical segmentation reduces annotation effort by refining predictions through user feedback. Vision Transformer (ViT)-based models, such as the Segment Anything Model (SAM), achieve state-of-the-art performance using user clicks and prior masks as prompts. However, existing methods treat interactions as independent events, leading to redundant corrections and limited refinement gains. We address this by introducing MAIS, a Memory-Attention mechanism for Interactive Segmentation that stores past user inputs and segmentation states, enabling temporal context integration. Our approach enhances ViT-based segmentation across diverse imaging modalities, achieving more efficient and accurate refinements.

[106] FLUXSynID: A Framework for Identity-Controlled Synthetic Face Generation with Document and Live Images

Raul Ismayilov,Luuk Spreeuwers,Dzemila Sero

Main category: cs.CV

TL;DR: FLUXSynID是一个生成高分辨率合成人脸数据集的框架，支持用户定义身份属性分布，并提供配对的文档风格和可信实时捕获图像。

Details

Motivation: 解决真实生物特征数据的隐私问题、人口不平衡和高收集成本，同时提供对身份属性的细粒度控制。 Method: 引入FLUXSynID框架，生成具有用户定义属性分布的高分辨率合成人脸数据集，并支持配对图像生成。 Result: 生成的数据集与真实身份分布更一致，且具有更高的集合间多样性。 Conclusion: FLUXSynID框架及14,889个合成身份的数据集已公开，支持生物特征研究。 Abstract: Synthetic face datasets are increasingly used to overcome the limitations of real-world biometric data, including privacy concerns, demographic imbalance, and high collection costs. However, many existing methods lack fine-grained control over identity attributes and fail to produce paired, identity-consistent images under structured capture conditions. We introduce FLUXSynID, a framework for generating high-resolution synthetic face datasets with user-defined identity attribute distributions and paired document-style and trusted live capture images. The dataset generated using the FLUXSynID framework shows improved alignment with real-world identity distributions and greater inter-set diversity compared to prior work. The FLUXSynID framework for generating custom datasets, along with a dataset of 14,889 synthetic identities, is publicly released to support biometric research, including face recognition and morphing attack detection.

[107] IKrNet: A Neural Network for Detecting Specific Drug-Induced Patterns in Electrocardiograms Amidst Physiological Variability

Ahmad Fall,Federica Granese,Alex Lence,Dominique Fourer,Blaise Hanczar,Joe-Elie Salem,Jean-Daniel Zucker,Edi Prifti

Main category: cs.CV

TL;DR: IKrNet是一种新型神经网络模型，通过结合空间和时间动态分析ECG信号，能够识别特定药物和生理条件下的ECG模式，优于现有方法。

Details

Motivation: 当前AI方法未能充分考虑生理条件（如体力活动、药物和压力）对ECG模式的影响，限制了其实际应用。 Method: IKrNet采用卷积主干网络捕捉空间特征，结合双向LSTM模块建模时间依赖性，并以心率变异性作为生理波动的替代指标。 Result: 在包含体力压力、药物摄入和基线条件的多样化场景中，IKrNet的准确性和稳定性优于现有模型。 Conclusion: IKrNet在多变生理条件下表现出临床可行性，为ECG分析提供了更可靠的工具。 Abstract: Monitoring and analyzing electrocardiogram (ECG) signals, even under varying physiological conditions, including those influenced by physical activity, drugs and stress, is crucial to accurately assess cardiac health. However, current AI-based methods often fail to account for how these factors interact and alter ECG patterns, ultimately limiting their applicability in real-world settings. This study introduces IKrNet, a novel neural network model, which identifies drug-specific patterns in ECGs amidst certain physiological conditions. IKrNet's architecture incorporates spatial and temporal dynamics by using a convolutional backbone with varying receptive field size to capture spatial features. A bi-directional Long Short-Term Memory module is also employed to model temporal dependencies. By treating heart rate variability as a surrogate for physiological fluctuations, we evaluated IKrNet's performance across diverse scenarios, including conditions with physical stress, drug intake alone, and a baseline without drug presence. Our assessment follows a clinical protocol in which 990 healthy volunteers were administered 80mg of Sotalol, a drug which is known to be a precursor to Torsades-de-Pointes, a life-threatening arrhythmia. We show that IKrNet outperforms state-of-the-art models' accuracy and stability in varying physiological conditions, underscoring its clinical viability.

[108] Discrete Visual Tokens of Autoregression, by Diffusion, and for Reasoning

Bohan Wang,Zhongqi Yue,Fengda Zhang,Shuo Chen,Li'an Bi,Junzhe Zhang,Xue Song,Kennard Yanting Chan,Jiachun Pan,Weijia Wu,Mingze Zhou,Wang Lin,Kaihang Pan,Saining Zhang,Liyu Jia,Wentao Hu,Wei Zhao,Hanwang Zhang

Main category: cs.CV

TL;DR: Selftok是一种新型离散视觉标记器，通过自回归先验和反向扩散过程统一扩散与自回归模型，支持强化学习，并在视觉生成任务中表现优异。

Details

Motivation: 传统空间先验在视觉表示中存在局限性，Selftok旨在通过自回归先验解决这一问题，并支持视觉-语言模型的无缝训练。 Method: Selftok利用反向扩散过程生成自回归视觉标记，无需额外模块或训练目标，同时满足Bellman方程。 Result: Selftok在视觉生成任务中表现卓越，显著超越现有模型，且支持强化学习。 Conclusion: Selftok解决了视觉标记无法有效支持强化学习的长期挑战，为多模态大语言模型的发展迈出重要一步。 Abstract: We completely discard the conventional spatial prior in image representation and introduce a novel discrete visual tokenizer: Self-consistency Tokenizer (Selftok). At its design core, we compose an autoregressive (AR) prior -- mirroring the causal structure of language -- into visual tokens by using the reverse diffusion process of image generation. The AR property makes Selftok fundamentally distinct from traditional spatial tokens in the following two key ways: - Selftok offers an elegant and minimalist approach to unify diffusion and AR for vision-language models (VLMs): By representing images with Selftok tokens, we can train a VLM using a purely discrete autoregressive architecture -- like that in LLMs -- without requiring additional modules or training objectives. - We theoretically show that the AR prior satisfies the Bellman equation, whereas the spatial prior does not. Therefore, Selftok supports reinforcement learning (RL) for visual generation with effectiveness comparable to that achieved in LLMs. Besides the AR property, Selftok is also a SoTA tokenizer that achieves a favorable trade-off between high-quality reconstruction and compression rate. We use Selftok to build a pure AR VLM for both visual comprehension and generation tasks. Impressively, without using any text-image training pairs, a simple policy gradient RL working in the visual tokens can significantly boost the visual generation benchmark, surpassing all the existing models by a large margin. Therefore, we believe that Selftok effectively addresses the long-standing challenge that visual tokens cannot support effective RL. When combined with the well-established strengths of RL in LLMs, this brings us one step closer to realizing a truly multimodal LLM. Project Page: https://selftok-team.github.io/report/.

[109] GIFStream: 4D Gaussian-based Immersive Video with Feature Stream

Hao Li,Sicheng Li,Xiang Gao,Abudouaihati Batuer,Lu Yu,Yiyi Liao

Main category: cs.CV

TL;DR: GIFStream是一种新型的4D高斯表示方法，通过规范空间和变形场结合时间特征流，实现高效压缩和高画质的沉浸式视频。

Details

Motivation: 沉浸式视频需要高渲染效率和画质，但现有方法在存储管理上存在挑战。 Method: 使用规范空间和变形场，结合时间特征流进行复杂运动建模，并利用时间和空间压缩网络实现端到端压缩。 Result: 在30 Mbps带宽下实现高质量沉浸式视频，支持RTX 4090上的实时渲染和快速解码。 Conclusion: GIFStream在保持高质量的同时解决了存储和效率问题，为沉浸式视频技术提供了有效解决方案。 Abstract: Immersive video offers a 6-Dof-free viewing experience, potentially playing a key role in future video technology. Recently, 4D Gaussian Splatting has gained attention as an effective approach for immersive video due to its high rendering efficiency and quality, though maintaining quality with manageable storage remains challenging. To address this, we introduce GIFStream, a novel 4D Gaussian representation using a canonical space and a deformation field enhanced with time-dependent feature streams. These feature streams enable complex motion modeling and allow efficient compression by leveraging temporal correspondence and motion-aware pruning. Additionally, we incorporate both temporal and spatial compression networks for end-to-end compression. Experimental results show that GIFStream delivers high-quality immersive video at 30 Mbps, with real-time rendering and fast decoding on an RTX 4090. Project page: https://xdimlab.github.io/GIFStream

[110] SynID: Passport Synthetic Dataset for Presentation Attack Detection

Juan E. Tapia,Fabian Stockhardt,Lázaro Janier González-Soler,Christoph Busch

Main category: cs.CV

TL;DR: 提出了一种结合合成数据和公开信息的混合方法，生成符合ICAO要求的护照数据集，用于训练和测试PAD系统。

Details

Motivation: 远程验证系统中伪造ID文档的攻击增加，但隐私问题导致真实ID文档数据有限。 Method: 采用混合方法，结合合成数据和公开信息，生成符合ICAO要求的护照数据集。 Result: 生成了一个可用于训练和测试PAD系统的现实护照数据集。 Conclusion: 该方法为解决PAD系统训练数据不足问题提供了可行方案。 Abstract: The demand for Presentation Attack Detection (PAD) to identify fraudulent ID documents in remote verification systems has significantly risen in recent years. This increase is driven by several factors, including the rise of remote work, online purchasing, migration, and advancements in synthetic images. Additionally, we have noticed a surge in the number of attacks aimed at the enrolment process. Training a PAD to detect fake ID documents is very challenging because of the limited number of ID documents available due to privacy concerns. This work proposes a new passport dataset generated from a hybrid method that combines synthetic data and open-access information using the ICAO requirement to obtain realistic training and testing images.

[111] Automated Visual Attention Detection using Mobile Eye Tracking in Behavioral Classroom Studies

Efe Bozkir,Christian Kosel,Tina Seidel,Enkelejda Kasneci

Main category: cs.CV

TL;DR: 论文提出了一种自动化处理流程，结合移动眼动追踪和人脸识别技术，以最小化手动标注数据的需求，识别教师在课堂上关注的学生。

Details

Motivation: 教师的视觉注意力分布对学生参与度和成绩有重要影响，但传统方法依赖大量手动标注，限制了其应用。 Method: 利用先进的人脸检测和识别模型，结合移动眼动追踪数据，通过迁移学习训练教室场景下的模型。 Result: 在四种不同教室布局中验证了方法的有效性，U型和小型教室的识别准确率分别达到0.7和0.9。 Conclusion: 该方法无需大量手动标注，为非侵入式分析教师视觉注意力提供了可能，有助于改进教学策略和教师培训。 Abstract: Teachers' visual attention and its distribution across the students in classrooms can constitute important implications for student engagement, achievement, and professional teacher training. Despite that, inferring the information about where and which student teachers focus on is not trivial. Mobile eye tracking can provide vital help to solve this issue; however, the use of mobile eye tracking alone requires a significant amount of manual annotations. To address this limitation, we present an automated processing pipeline concept that requires minimal manually annotated data to recognize which student the teachers focus on. To this end, we utilize state-of-the-art face detection models and face recognition feature embeddings to train face recognition models with transfer learning in the classroom context and combine these models with the teachers' gaze from mobile eye trackers. We evaluated our approach with data collected from four different classrooms, and our results show that while it is possible to estimate the visually focused students with reasonable performance in all of our classroom setups, U-shaped and small classrooms led to the best results with accuracies of approximately 0.7 and 0.9, respectively. While we did not evaluate our method for teacher-student interactions and focused on the validity of the technical approach, as our methodology does not require a vast amount of manually annotated data and offers a non-intrusive way of handling teachers' visual attention, it could help improve instructional strategies, enhance classroom management, and provide feedback for professional teacher development.

[112] Self-Supervised Event Representations: Towards Accurate, Real-Time Perception on SoC FPGAs

Kamil Jeziorek,Tomasz Kryjak

Main category: cs.CV

TL;DR: 本文提出了一种自监督事件表示方法（SSER），利用GRU网络实现事件时间戳和极性的精确编码，无需时间离散化，显著提升了事件数据的处理性能。

Details

Motivation: 事件相机具有高时间分辨率、强光适应性和低功耗等优势，但其稀疏异步事件流的有效处理仍具挑战性。现有方法或牺牲性能或损失时间保真度，因此需要一种更优的解决方案。 Method: 采用自监督训练的GRU网络，实现事件时间戳和极性的逐像素编码，支持异步推理，确保与高吞吐量传感器的兼容性。 Result: SSER在Gen1和1 Mpx数据集上的目标检测性能分别提升2.4% mAP和0.6%，并在FPGA上实现亚微秒延迟和1-2 W的低功耗。 Conclusion: SSER方法在性能和效率上均优于现有基线，首次在硬件上实现事件数据的循环表示，适合实时低功耗应用。 Abstract: Event cameras offer significant advantages over traditional frame-based sensors. These include microsecond temporal resolution, robustness under varying lighting conditions and low power consumption. Nevertheless, the effective processing of their sparse, asynchronous event streams remains challenging. Existing approaches to this problem can be categorised into two distinct groups. The first group involves the direct processing of event data with neural models, such as Spiking Neural Networks or Graph Convolutional Neural Networks. However, this approach is often accompanied by a compromise in terms of qualitative performance. The second group involves the conversion of events into dense representations with handcrafted aggregation functions, which can boost accuracy at the cost of temporal fidelity. This paper introduces a novel Self-Supervised Event Representation (SSER) method leveraging Gated Recurrent Unit (GRU) networks to achieve precise per-pixel encoding of event timestamps and polarities without temporal discretisation. The recurrent layers are trained in a self-supervised manner to maximise the fidelity of event-time encoding. The inference is performed with event representations generated asynchronously, thus ensuring compatibility with high-throughput sensors. The experimental validation demonstrates that SSER outperforms aggregation-based baselines, achieving improvements of 2.4% mAP and 0.6% on the Gen1 and 1 Mpx object detection datasets. Furthermore, the paper presents the first hardware implementation of recurrent representation for event data on a System-on-Chip FPGA, achieving sub-microsecond latency and power consumption between 1-2 W, suitable for real-time, power-efficient applications. Code is available at https://github.com/vision-agh/RecRepEvent.

[113] Robust Kidney Abnormality Segmentation: A Validation Study of an AI-Based Framework

Sarah de Boer,Hartmut Häntze,Kiran Vaidhya Venkadesh,Myrthe A. D. Buser,Gabriel E. Humpire Mamani,Lina Xu,Lisa C. Adams,Jawed Nawabi,Keno K. Bressem,Bram van Ginneken,Mathias Prokop,Alessa Hering

Main category: cs.CV

TL;DR: 开发了一种基于nnU-Net的肾脏异常分割算法，通过公开数据集训练并在多个测试集上验证，表现优于现有方法，且具有强鲁棒性。

Details

Motivation: 肾脏体积是肾脏疾病的重要生物标志物，但目前临床依赖主观视觉评估，需要更客观、可重复的分割方法。 Method: 使用公开数据集训练nnU-Net框架，通过Dice系数和Hausdorff距离量化性能，并分析不同亚组的鲁棒性。 Result: 算法在外部测试集上表现优异，优于现有方法，且在不同亚组中表现一致。 Conclusion: 该算法具有临床和研究潜力，代码已公开。 Abstract: Kidney abnormality segmentation has important potential to enhance the clinical workflow, especially in settings requiring quantitative assessments. Kidney volume could serve as an important biomarker for renal diseases, with changes in volume correlating directly with kidney function. Currently, clinical practice often relies on subjective visual assessment for evaluating kidney size and abnormalities, including tumors and cysts, which are typically staged based on diameter, volume, and anatomical location. To support a more objective and reproducible approach, this research aims to develop a robust, thoroughly validated kidney abnormality segmentation algorithm, made publicly available for clinical and research use. We employ publicly available training datasets and leverage the state-of-the-art medical image segmentation framework nnU-Net. Validation is conducted using both proprietary and public test datasets, with segmentation performance quantified by Dice coefficient and the 95th percentile Hausdorff distance. Furthermore, we analyze robustness across subgroups based on patient sex, age, CT contrast phases, and tumor histologic subtypes. Our findings demonstrate that our segmentation algorithm, trained exclusively on publicly available data, generalizes effectively to external test sets and outperforms existing state-of-the-art models across all tested datasets. Subgroup analyses reveal consistent high performance, indicating strong robustness and reliability. The developed algorithm and associated code are publicly accessible at https://github.com/DIAGNijmegen/oncology-kidney-abnormality-segmentation.

[114] Evaluating Modern Visual Anomaly Detection Approaches in Semiconductor Manufacturing: A Comparative Study

Manuel Barusco,Francesco Borsatti,Youssef Ben Khalifa,Davide Dalle Pezze,Gian Antonio Susto

Main category: cs.CV

TL;DR: 论文介绍了半导体制造中基于无监督学习的视觉异常检测（VAD）方法，利用MIIC数据集建立基准，展示了现代VAD方法的有效性。

Details

Motivation: 半导体制造过程复杂，传统监督方法需要大量标注异常样本，成本高。无监督VAD方法避免了这一缺陷，同时提供预测解释。 Method: 利用MIIC数据集，采用现代无监督VAD方法进行视觉异常检测。 Result: 实验结果表明现代VAD方法在半导体领域具有高效性。 Conclusion: 无监督VAD方法为半导体制造中的视觉检测提供了高效且低成本的解决方案。 Abstract: Semiconductor manufacturing is a complex, multistage process. Automated visual inspection of Scanning Electron Microscope (SEM) images is indispensable for minimizing equipment downtime and containing costs. Most previous research considers supervised approaches, assuming a sufficient number of anomalously labeled samples. On the contrary, Visual Anomaly Detection (VAD), an emerging research domain, focuses on unsupervised learning, avoiding the costly defect collection phase while providing explanations of the predictions. We introduce a benchmark for VAD in the semiconductor domain by leveraging the MIIC dataset. Our results demonstrate the efficacy of modern VAD approaches in this field.

[115] Deep Learning Advances in Vision-Based Traffic Accident Anticipation: A Comprehensive Review of Methods,Datasets,and Future Directions

Yi Zhang,Wenye Zhou,Ruonan Lin,Xin Yang,Hao Zheng

Main category: cs.CV

TL;DR: 本文综述了147项近期研究，探讨了基于视觉的交通事故预测（Vision-TAA）中监督、无监督和混合深度学习模型的应用，总结了四种主要方法，并指出了数据稀缺、泛化能力不足等挑战。

Details

Motivation: 提升道路安全，通过深度学习技术预测和检测交通事故。 Method: 综述了图像和视频特征预测、时空特征预测、场景理解和多模态数据融合四种方法。 Result: 现有方法在预测准确性上表现良好，但仍面临数据稀缺、泛化能力不足和实时性能限制等问题。 Conclusion: 未来研究方向包括多模态数据融合、自监督学习和Transformer架构，以提升预测准确性和可扩展性。 Abstract: Traffic accident prediction and detection are critical for enhancing road safety,and vision-based traffic accident anticipation (Vision-TAA) has emerged as a promising approach in the era of deep learning.This paper reviews 147 recent studies,focusing on the application of supervised,unsupervised,and hybrid deep learning models for accident prediction,alongside the use of real-world and synthetic datasets.Current methodologies are categorized into four key approaches: image and video feature-based prediction, spatiotemporal feature-based prediction, scene understanding,and multimodal data fusion.While these methods demonstrate significant potential,challenges such as data scarcity,limited generalization to complex scenarios,and real-time performance constraints remain prevalent. This review highlights opportunities for future research,including the integration of multimodal data fusion, self-supervised learning,and Transformer-based architectures to enhance prediction accuracy and scalability.By synthesizing existing advancements and identifying critical gaps, this paper provides a foundational reference for developing robust and adaptive Vision-TAA systems,contributing to road safety and traffic management.

[116] Higher-Order Convolution Improves Neural Predictivity in the Retina

Simone Azeglio,Victor Calbiague Garcia,Guilhem Glaziou,Peter Neri,Olivier Marre,Ulisse Ferrari

Main category: cs.CV

TL;DR: 提出了一种在卷积神经网络中嵌入高阶操作的新方法，提升模型表现力而不增加深度，适用于生物视觉系统的模拟。

Details

Motivation: 解决传统CNN与生物视觉系统在架构上的差异，提升神经响应预测能力。 Method: 扩展3D CNN，在卷积操作中嵌入高阶操作，直接建模像素间的乘法交互。 Result: 在两种数据集上表现优异，训练数据需求减半，相关性系数高达0.75，几何变换编码能力显著。 Conclusion: 高阶CNN在神经响应预测中表现卓越，尤其适用于特定细胞类型的几何变换检测。 Abstract: We present a novel approach to neural response prediction that incorporates higher-order operations directly within convolutional neural networks (CNNs). Our model extends traditional 3D CNNs by embedding higher-order operations within the convolutional operator itself, enabling direct modeling of multiplicative interactions between neighboring pixels across space and time. Our model increases the representational power of CNNs without increasing their depth, therefore addressing the architectural disparity between deep artificial networks and the relatively shallow processing hierarchy of biological visual systems. We evaluate our approach on two distinct datasets: salamander retinal ganglion cell (RGC) responses to natural scenes, and a new dataset of mouse RGC responses to controlled geometric transformations. Our higher-order CNN (HoCNN) achieves superior performance while requiring only half the training data compared to standard architectures, demonstrating correlation coefficients up to 0.75 with neural responses (against 0.80$\pm$0.02 retinal reliability). When integrated into state-of-the-art architectures, our approach consistently improves performance across different species and stimulus conditions. Analysis of the learned representations reveals that our network naturally encodes fundamental geometric transformations, particularly scaling parameters that characterize object expansion and contraction. This capability is especially relevant for specific cell types, such as transient OFF-alpha and transient ON cells, which are known to detect looming objects and object motion respectively, and where our model shows marked improvement in response prediction. The correlation coefficients for scaling parameters are more than twice as high in HoCNN (0.72) compared to baseline models (0.32).

[117] A Unified Hierarchical Framework for Fine-grained Cross-view Geo-localization over Large-scale Scenarios

Zhuo Song,Ye Zhang,Kunhong Li,Longguang Wang,Yulan Guo

Main category: cs.CV

TL;DR: UnifyGeo是一个统一的分层地理定位框架，将检索和度量定位任务整合到一个网络中，通过共享参数和重新排序机制显著提升了性能。

Details

Motivation: 现有方法通常为检索和度量定位任务设计独立模型，导致协作效率低和训练开销大。 Method: 采用统一学习策略和共享参数联合学习多粒度表示，设计了基于专用损失函数的重新排序机制。 Result: 在VIGOR基准测试中，1米级定位召回率从1.53%提升至39.64%（同区域）和0.43%提升至25.58%（跨区域）。 Conclusion: UnifyGeo显著优于现有方法，验证了统一框架的有效性。 Abstract: Cross-view geo-localization is a promising solution for large-scale localization problems, requiring the sequential execution of retrieval and metric localization tasks to achieve fine-grained predictions. However, existing methods typically focus on designing standalone models for these two tasks, resulting in inefficient collaboration and increased training overhead. In this paper, we propose UnifyGeo, a novel unified hierarchical geo-localization framework that integrates retrieval and metric localization tasks into a single network. Specifically, we first employ a unified learning strategy with shared parameters to jointly learn multi-granularity representation, facilitating mutual reinforcement between these two tasks. Subsequently, we design a re-ranking mechanism guided by a dedicated loss function, which enhances geo-localization performance by improving both retrieval accuracy and metric localization references. Extensive experiments demonstrate that UnifyGeo significantly outperforms the state-of-the-arts in both task-isolated and task-associated settings. Remarkably, on the challenging VIGOR benchmark, which supports fine-grained localization evaluation, the 1-meter-level localization recall rate improves from 1.53\% to 39.64\% and from 0.43\% to 25.58\% under same-area and cross-area evaluations, respectively. Code will be made publicly available.

[118] ShotAdapter: Text-to-Multi-Shot Video Generation with Diffusion Models

Ozgur Kara,Krishna Kumar Singh,Feng Liu,Duygu Ceylan,James M. Rehg,Tobias Hinz

Main category: cs.CV

TL;DR: 提出了一种支持文本到多镜头视频生成的框架，解决了现有方法无法生成多镜头视频的问题。

Details

Motivation: 现有基于扩散的文本到视频方法只能生成单镜头短视频，无法生成多镜头视频。 Method: 提出了一个框架，包括数据集收集流程和视频扩散模型的架构扩展，引入过渡标记和局部注意力掩码策略。 Result: 实验表明，该方法能生成多镜头视频，支持镜头控制，性能优于基线。 Conclusion: 该方法为文本到多镜头视频生成提供了有效解决方案。 Abstract: Current diffusion-based text-to-video methods are limited to producing short video clips of a single shot and lack the capability to generate multi-shot videos with discrete transitions where the same character performs distinct activities across the same or different backgrounds. To address this limitation we propose a framework that includes a dataset collection pipeline and architectural extensions to video diffusion models to enable text-to-multi-shot video generation. Our approach enables generation of multi-shot videos as a single video with full attention across all frames of all shots, ensuring character and background consistency, and allows users to control the number, duration, and content of shots through shot-specific conditioning. This is achieved by incorporating a transition token into the text-to-video model to control at which frames a new shot begins and a local attention masking strategy which controls the transition token's effect and allows shot-specific prompting. To obtain training data we propose a novel data collection pipeline to construct a multi-shot video dataset from existing single-shot video datasets. Extensive experiments demonstrate that fine-tuning a pre-trained text-to-video model for a few thousand iterations is enough for the model to subsequently be able to generate multi-shot videos with shot-specific control, outperforming the baselines. You can find more details in https://shotadapter.github.io/

[119] Anatomical Attention Alignment representation for Radiology Report Generation

Quang Vinh Nguyen,Minh Duc Nguyen,Thanh Hoang Son Vo,Hyung-Jeong Yang,Soo-Hyung Kim

Main category: cs.CV

TL;DR: A3Net通过结合解剖学知识字典和视觉特征，提升了放射学报告生成的准确性和临床相关性。

Details

Motivation: 现有模型仅依赖原始图像的视觉特征，限制了空间结构和语义关系的理解，导致文本生成效果不佳。 Method: 提出A3Net框架，通过构建超视觉表示，将解剖学知识字典与图像区域特征结合，增强视觉-文本理解。 Result: 在IU X-Ray和MIMIC-CXR数据集上，A3Net显著提升了视觉感知和文本生成质量。 Conclusion: A3Net通过结构化表示改善了语义推理和跨模态对齐，为放射学报告生成提供了更优解决方案。 Abstract: Automated Radiology report generation (RRG) aims at producing detailed descriptions of medical images, reducing radiologists' workload and improving access to high-quality diagnostic services. Existing encoder-decoder models only rely on visual features extracted from raw input images, which can limit the understanding of spatial structures and semantic relationships, often resulting in suboptimal text generation. To address this, we propose Anatomical Attention Alignment Network (A3Net), a framework that enhance visual-textual understanding by constructing hyper-visual representations. Our approach integrates a knowledge dictionary of anatomical structures with patch-level visual features, enabling the model to effectively associate image regions with their corresponding anatomical entities. This structured representation improves semantic reasoning, interpretability, and cross-modal alignment, ultimately enhancing the accuracy and clinical relevance of generated reports. Experimental results on IU X-Ray and MIMIC-CXR datasets demonstrate that A3Net significantly improves both visual perception and text generation quality. Our code is available at \href{https://github.com/Vinh-AI/A3Net}{GitHub}.

[120] Beyond CLIP Generalization: Against Forward&Backward Forgetting Adapter for Continual Learning of Vision-Language Models

Songlin Dong,Chenhao Ding,Jiangyang Li,Jizhou Han,Qiang Wang,Yuhang He,Yihong Gong

Main category: cs.CV

TL;DR: 本文提出了一种名为AFA的新框架，用于解决多领域任务增量学习（MTIL）问题，通过两个核心模块提升视觉语言模型（VLMs）的零样本识别能力和少样本学习能力。

Details

Motivation: 现有方法仅能维持模型的零样本能力，但无法进一步提升其泛化能力，因此需要一种新方法来解决这一问题。 Method: AFA框架包含两个模块：防止前向遗忘的适配器（学习任务不变信息）和防止后向遗忘的适配器（增强少样本学习能力）。 Result: 实验表明，AFA在少样本MTIL任务中显著优于现有方法，并超越了CLIP的固有零样本性能。 Conclusion: AFA框架有效提升了VLMs在增量学习中的性能，同时增强了其零样本和少样本能力。 Abstract: This study aims to address the problem of multi-domain task incremental learning~(MTIL), which requires that vision-language models~(VLMs) continuously acquire new knowledge while maintaining their inherent zero-shot recognition capability. Existing paradigms delegate the testing of unseen-domain samples to the original CLIP, which only prevents the degradation of the model's zero-shot capability but fails to enhance the generalization of the VLM further. To this end, we propose a novel MTIL framework, named AFA, which comprises two core modules: (1) an against forward-forgetting adapter that learns task-invariant information for each dataset in the incremental tasks to enhance the zero-shot recognition ability of VLMs; (2) an against backward-forgetting adapter that strengthens the few-shot learning capability of VLMs while supporting incremental learning. Extensive experiments demonstrate that the AFA method significantly outperforms existing state-of-the-art approaches, especially in few-shot MTIL tasks, and surpasses the inherent zero-shot performance of CLIP in terms of transferability. The code is provided in the Supplementary Material.

[121] Feedback-Driven Pseudo-Label Reliability Assessment: Redefining Thresholding for Semi-Supervised Semantic Segmentation

Negin Ghamsarian,Sahar Nasirihaghighi,Klaus Schoeffmann,Raphael Sznitman

Main category: cs.CV

TL;DR: 提出了一种动态反馈驱动的伪标签选择方法ENCORE，通过估计未标记数据中的类别置信度并动态调整阈值，无需手动调参即可提升半监督学习性能。

Details

Motivation: 解决传统伪监督方法依赖静态置信度阈值的问题，特别是在标记数据稀缺的半监督场景中。 Method: 提出ENCORE方法，动态估计类别置信度并调整伪标签选择阈值，基于模型对不同过滤级别的反馈。 Result: 在多个数据集和网络架构中显著提升分割性能，尤其在数据稀缺条件下表现优异。 Conclusion: ENCORE方法有效提升了半监督学习的性能，无需依赖大量标记数据或手动调参。 Abstract: Semi-supervised learning leverages unlabeled data to enhance model performance, addressing the limitations of fully supervised approaches. Among its strategies, pseudo-supervision has proven highly effective, typically relying on one or multiple teacher networks to refine pseudo-labels before training a student network. A common practice in pseudo-supervision is filtering pseudo-labels based on pre-defined confidence thresholds or entropy. However, selecting optimal thresholds requires large labeled datasets, which are often scarce in real-world semi-supervised scenarios. To overcome this challenge, we propose Ensemble-of-Confidence Reinforcement (ENCORE), a dynamic feedback-driven thresholding strategy for pseudo-label selection. Instead of relying on static confidence thresholds, ENCORE estimates class-wise true-positive confidence within the unlabeled dataset and continuously adjusts thresholds based on the model's response to different levels of pseudo-label filtering. This feedback-driven mechanism ensures the retention of informative pseudo-labels while filtering unreliable ones, enhancing model training without manual threshold tuning. Our method seamlessly integrates into existing pseudo-supervision frameworks and significantly improves segmentation performance, particularly in data-scarce conditions. Extensive experiments demonstrate that integrating ENCORE with existing pseudo-supervision frameworks enhances performance across multiple datasets and network architectures, validating its effectiveness in semi-supervised learning.

[122] Hybrid Spiking Vision Transformer for Object Detection with Event Cameras

Qi Xu,Jie Deng,Jiangrong Shen,Biwu Chen,Huajin Tang,Gang Pan

Main category: cs.CV

TL;DR: 本文提出了一种新型混合脉冲视觉Transformer（HsVT）模型，用于提升基于事件的目标检测性能，并通过公开数据集支持研究。

Details

Motivation: 基于事件的目标检测具有高时间分辨率、宽动态范围和异步事件表示等优势，但需要更高效的模型来捕捉时空特征。 Method: HsVT模型结合了空间特征提取模块（捕捉局部和全局特征）和时间特征提取模块（建模时间依赖性和长期模式）。 Result: 实验表明，HsVT在GEN1和Fall Detection数据集上显著提升了检测性能，且参数更少。 Conclusion: HsVT模型为基于事件的目标检测提供了一种高效解决方案，公开数据集促进了该领域的研究。 Abstract: Event-based object detection has gained increasing attention due to its advantages such as high temporal resolution, wide dynamic range, and asynchronous address-event representation. Leveraging these advantages, Spiking Neural Networks (SNNs) have emerged as a promising approach, offering low energy consumption and rich spatiotemporal dynamics. To further enhance the performance of event-based object detection, this study proposes a novel hybrid spike vision Transformer (HsVT) model. The HsVT model integrates a spatial feature extraction module to capture local and global features, and a temporal feature extraction module to model time dependencies and long-term patterns in event sequences. This combination enables HsVT to capture spatiotemporal features, improving its capability to handle complex event-based object detection tasks. To support research in this area, we developed and publicly released The Fall Detection Dataset as a benchmark for event-based object detection tasks. This dataset, captured using an event-based camera, ensures facial privacy protection and reduces memory usage due to the event representation format. We evaluated the HsVT model on GEN1 and Fall Detection datasets across various model sizes. Experimental results demonstrate that HsVT achieves significant performance improvements in event detection with fewer parameters.

[123] Gameplay Highlights Generation

Vignesh Edithal,Le Zhang,Ilia Blank,Imran Junejo

Main category: cs.CV

TL;DR: 论文提出了一种自动生成游戏精彩片段的方法，通过多模态视频理解模型X-CLIP识别有趣事件，无需针对每款游戏单独开发，显著提高了效率和泛化能力。

Details

Motivation: 让玩家能够轻松分享游戏体验，同时节省时间并提高观众参与度。传统方法需要与游戏开发者合作或针对每款游戏单独开发，成本高且泛化能力差。 Method: 使用X-CLIP多模态模型，通过微调和提示工程优化分类性能，结合ONNX库实现跨平台部署和高效推理。 Result: 模型在未见过的一人称射击游戏片段中识别有趣事件的准确率超过90%，且在低资源游戏中表现出迁移学习能力。 Conclusion: 自然语言监督的X-CLIP模型能够高效且高性能地识别视频内容，适用于多款游戏，无需单独开发。 Abstract: In this work, we enable gamers to share their gaming experience on social media by automatically generating eye-catching highlight reels from their gameplay session Our automation will save time for gamers while increasing audience engagement. We approach the highlight generation problem by first identifying intervals in the video where interesting events occur and then concatenate them. We developed an in-house gameplay event detection dataset containing interesting events annotated by humans using VIA video annotator. Traditional techniques for highlight detection such as game engine integration requires expensive collaboration with game developers. OCR techniques which detect patches of specific images or texts require expensive per game engineering and may not generalize across game UI and different language. We finetuned a multimodal general purpose video understanding model such as X-CLIP using our dataset which generalizes across multiple games in a genre without per game engineering. Prompt engineering was performed to improve the classification performance of this multimodal model. Our evaluation showed that such a finetuned model can detect interesting events in first person shooting games from unseen gameplay footage with more than 90% accuracy. Moreover, our model performed significantly better on low resource games (small dataset) when trained along with high resource games, showing signs of transfer learning. To make the model production ready, we used ONNX libraries to enable cross platform inference. These libraries also provide post training quantization tools to reduce model size and inference time for deployment. ONNX runtime libraries with DirectML backend were used to perform efficient inference on Windows OS. We show that natural language supervision in the X-CLIP model leads to data efficient and highly performant video recognition models.

[124] LAMM-ViT: AI Face Detection via Layer-Aware Modulation of Region-Guided Attention

Jiangling Zhang,Weijie Zhu,Jirui Huang,Yaxiong Chen

Main category: cs.CV

TL;DR: LAMM-ViT模型通过区域引导多注意力头和层感知掩码调制，提高了AI合成人脸检测的鲁棒性和泛化能力。

Details

Motivation: 现有方法难以捕捉不同生成技术中人脸区域的结构关系，导致对新生成模型的检测效果不佳。 Method: 提出LAMM-ViT模型，结合区域引导多注意力头和层感知掩码调制，动态调整区域关注度。 Result: 在跨模型测试中，LAMM-ViT表现优异，准确率和平均精度分别提升5.45%和3.09%。 Conclusion: LAMM-ViT具有强大的泛化能力，可有效应对不断演变的合成媒体威胁。 Abstract: Detecting AI-synthetic faces presents a critical challenge: it is hard to capture consistent structural relationships between facial regions across diverse generation techniques. Current methods, which focus on specific artifacts rather than fundamental inconsistencies, often fail when confronted with novel generative models. To address this limitation, we introduce Layer-aware Mask Modulation Vision Transformer (LAMM-ViT), a Vision Transformer designed for robust facial forgery detection. This model integrates distinct Region-Guided Multi-Head Attention (RG-MHA) and Layer-aware Mask Modulation (LAMM) components within each layer. RG-MHA utilizes facial landmarks to create regional attention masks, guiding the model to scrutinize architectural inconsistencies across different facial areas. Crucially, the separate LAMM module dynamically generates layer-specific parameters, including mask weights and gating values, based on network context. These parameters then modulate the behavior of RG-MHA, enabling adaptive adjustment of regional focus across network depths. This architecture facilitates the capture of subtle, hierarchical forgery cues ubiquitous among diverse generation techniques, such as GANs and Diffusion Models. In cross-model generalization tests, LAMM-ViT demonstrates superior performance, achieving 94.09% mean ACC (a +5.45% improvement over SoTA) and 98.62% mean AP (a +3.09% improvement). These results demonstrate LAMM-ViT's exceptional ability to generalize and its potential for reliable deployment against evolving synthetic media threats.

[125] BodyGPS: Anatomical Positioning System

Halid Ziya Yerebakan,Kritika Iyer,Xueqi Guo,Yoshihisa Shinagawa,Gerardo Hermosillo Valadez

Main category: cs.CV

TL;DR: 提出了一种新型基础模型，用于解析医学图像中的人体解剖结构，支持多种模态和训练方式，并能高效完成多种任务。

Details

Motivation: 解决医学图像解析中多模态、多任务的需求，同时提高效率和灵活性。 Method: 通过训练神经网络估计器，将查询位置映射到图谱坐标，采用稀疏采样提高效率。 Result: 在CT和MRI模态中验证了算法的实用性，响应时间低于1毫秒。 Conclusion: 该模型在医学图像解析中具有高效性和广泛适用性。 Abstract: We introduce a new type of foundational model for parsing human anatomy in medical images that works for different modalities. It supports supervised or unsupervised training and can perform matching, registration, classification, or segmentation with or without user interaction. We achieve this by training a neural network estimator that maps query locations to atlas coordinates via regression. Efficiency is improved by sparsely sampling the input, enabling response times of less than 1 ms without additional accelerator hardware. We demonstrate the utility of the algorithm in both CT and MRI modalities.

[126] Step1X-3D: Towards High-Fidelity and Controllable Generation of Textured 3D Assets

Weiyu Li,Xuanyang Zhang,Zheng Sun,Di Qi,Hao Li,Wei Cheng,Weiwei Cai,Shihao Wu,Jiarui Liu,Zihao Wang,Xiao Chen,Feipeng Tian,Jianxiong Pan,Zeming Li,Gang Yu,Xiangyu Zhang,Daxin Jiang,Ping Tan

Main category: cs.CV

TL;DR: Step1X-3D是一个开源框架，通过高质量数据集、两阶段3D生成架构和开源工具，解决了3D生成中的数据稀缺、算法限制和生态碎片化问题。

Details

Motivation: 3D生成领域因数据稀缺、算法限制和生态碎片化而发展滞后，Step1X-3D旨在通过开源框架推动可控3D资产生成的研究。 Method: 1) 数据管道处理500万资产生成200万高质量数据集；2) 两阶段架构结合VAE-DiT几何生成器和扩散纹理合成模块；3) 开源模型和代码。 Result: 在基准测试中表现优于现有开源方法，与专有解决方案质量相当，并支持2D控制技术直接迁移到3D生成。 Conclusion: Step1X-3D通过提升数据质量、算法保真度和可复现性，为可控3D资产生成的开源研究设立了新标准。 Abstract: While generative artificial intelligence has advanced significantly across text, image, audio, and video domains, 3D generation remains comparatively underdeveloped due to fundamental challenges such as data scarcity, algorithmic limitations, and ecosystem fragmentation. To this end, we present Step1X-3D, an open framework addressing these challenges through: (1) a rigorous data curation pipeline processing >5M assets to create a 2M high-quality dataset with standardized geometric and textural properties; (2) a two-stage 3D-native architecture combining a hybrid VAE-DiT geometry generator with an diffusion-based texture synthesis module; and (3) the full open-source release of models, training code, and adaptation modules. For geometry generation, the hybrid VAE-DiT component produces TSDF representations by employing perceiver-based latent encoding with sharp edge sampling for detail preservation. The diffusion-based texture synthesis module then ensures cross-view consistency through geometric conditioning and latent-space synchronization. Benchmark results demonstrate state-of-the-art performance that exceeds existing open-source methods, while also achieving competitive quality with proprietary solutions. Notably, the framework uniquely bridges the 2D and 3D generation paradigms by supporting direct transfer of 2D control techniques~(e.g., LoRA) to 3D synthesis. By simultaneously advancing data quality, algorithmic fidelity, and reproducibility, Step1X-3D aims to establish new standards for open research in controllable 3D asset generation.

[127] Continuous Visual Autoregressive Generation via Score Maximization

Chenze Shao,Fandong Meng,Jie Zhou

Main category: cs.CV

TL;DR: 论文提出了一种连续视觉自回归（VAR）框架，避免了传统量化方法的信息损失，基于严格适当评分规则直接生成连续数据。

Details

Motivation: 传统自回归模型在处理连续视觉数据时需量化，导致信息损失。本文旨在解决这一问题。 Method: 提出连续VAR框架，利用严格适当评分规则（如能量评分）作为训练目标，无需概率预测。 Result: 框架支持直接连续数据生成，且兼容其他严格适当评分方法（如GIVT和扩散损失）。 Conclusion: 连续VAR框架为连续视觉数据生成提供了高效且灵活的解决方案。 Abstract: Conventional wisdom suggests that autoregressive models are used to process discrete data. When applied to continuous modalities such as visual data, Visual AutoRegressive modeling (VAR) typically resorts to quantization-based approaches to cast the data into a discrete space, which can introduce significant information loss. To tackle this issue, we introduce a Continuous VAR framework that enables direct visual autoregressive generation without vector quantization. The underlying theoretical foundation is strictly proper scoring rules, which provide powerful statistical tools capable of evaluating how well a generative model approximates the true distribution. Within this framework, all we need is to select a strictly proper score and set it as the training objective to optimize. We primarily explore a class of training objectives based on the energy score, which is likelihood-free and thus overcomes the difficulty of making probabilistic predictions in the continuous space. Previous efforts on continuous autoregressive generation, such as GIVT and diffusion loss, can also be derived from our framework using other strictly proper scores. Source code: https://github.com/shaochenze/EAR.

[128] DanceGRPO: Unleashing GRPO on Visual Generation

Zeyue Xue,Jie Wu,Yu Gao,Fangyuan Kong,Lingting Zhu,Mengzhao Chen,Zhiheng Liu,Wei Liu,Qiushan Guo,Weilin Huang,Ping Luo

Main category: cs.CV

TL;DR: DanceGRPO是一个统一的RL框架，适用于多种生成模型和任务，显著提升性能并解决现有方法的局限性。

Details

Motivation: 生成模型在视觉内容创作中取得突破，但如何与人类偏好对齐仍具挑战性。现有RL方法存在兼容性、稳定性和验证不足的问题。 Method: 引入DanceGRPO，将GRPO应用于视觉生成，支持多种生成范式、任务、基础模型和奖励模型。 Result: DanceGRPO在多个基准测试中表现优异，性能提升高达181%，并能稳定优化复杂视频生成策略。 Conclusion: DanceGRPO为视觉生成中的RLHF任务提供了稳健且通用的解决方案，推动了强化学习与视觉合成的结合。 Abstract: Recent breakthroughs in generative models-particularly diffusion models and rectified flows-have revolutionized visual content creation, yet aligning model outputs with human preferences remains a critical challenge. Existing reinforcement learning (RL)-based methods for visual generation face critical limitations: incompatibility with modern Ordinary Differential Equations (ODEs)-based sampling paradigms, instability in large-scale training, and lack of validation for video generation. This paper introduces DanceGRPO, the first unified framework to adapt Group Relative Policy Optimization (GRPO) to visual generation paradigms, unleashing one unified RL algorithm across two generative paradigms (diffusion models and rectified flows), three tasks (text-to-image, text-to-video, image-to-video), four foundation models (Stable Diffusion, HunyuanVideo, FLUX, SkyReel-I2V), and five reward models (image/video aesthetics, text-image alignment, video motion quality, and binary reward). To our knowledge, DanceGRPO is the first RL-based unified framework capable of seamless adaptation across diverse generative paradigms, tasks, foundational models, and reward models. DanceGRPO demonstrates consistent and substantial improvements, which outperform baselines by up to 181% on benchmarks such as HPS-v2.1, CLIP Score, VideoAlign, and GenEval. Notably, DanceGRPO not only can stabilize policy optimization for complex video generation, but also enables generative policy to better capture denoising trajectories for Best-of-N inference scaling and learn from sparse binary feedback. Our results establish DanceGRPO as a robust and versatile solution for scaling Reinforcement Learning from Human Feedback (RLHF) tasks in visual generation, offering new insights into harmonizing reinforcement learning and visual synthesis. The code will be released.

cs.GR [Back]

[129] Virtualized 3D Gaussians: Flexible Cluster-based Level-of-Detail System for Real-Time Rendering of Composed Scenes

Xijie Yang,Linning Xu,Lihan Jiang,Dahua Lin,Bo Dai

Main category: cs.GR

TL;DR: 3D高斯泼溅（3DGS）通过3D高斯基元重建复杂3D资产，但大规模场景渲染效率低。提出V3DG，基于集群的LOD方案，分两阶段优化渲染效率与视觉质量。

Details

Motivation: 解决3DGS在大规模场景（如人群级场景）中因高斯基元过多导致的实时渲染挑战。 Method: 提出V3DG，分离线构建（生成分层高斯集群）和在线选择（动态选择可见集群）两阶段。 Result: 实验表明，V3DG在用户定义的容忍度内平衡了渲染效率和视觉质量。 Conclusion: V3DG为大规模3DGS资产提供高效渲染方案，支持下游交互应用。 Abstract: 3D Gaussian Splatting (3DGS) enables the reconstruction of intricate digital 3D assets from multi-view images by leveraging a set of 3D Gaussian primitives for rendering. Its explicit and discrete representation facilitates the seamless composition of complex digital worlds, offering significant advantages over previous neural implicit methods. However, when applied to large-scale compositions, such as crowd-level scenes, it can encompass numerous 3D Gaussians, posing substantial challenges for real-time rendering. To address this, inspired by Unreal Engine 5's Nanite system, we propose Virtualized 3D Gaussians (V3DG), a cluster-based LOD solution that constructs hierarchical 3D Gaussian clusters and dynamically selects only the necessary ones to accelerate rendering speed. Our approach consists of two stages: (1) Offline Build, where hierarchical clusters are generated using a local splatting method to minimize visual differences across granularities, and (2) Online Selection, where footprint evaluation determines perceptible clusters for efficient rasterization during rendering. We curate a dataset of synthetic and real-world scenes, including objects, trees, people, and buildings, each requiring 0.1 billion 3D Gaussians to capture fine details. Experiments show that our solution balances rendering efficiency and visual quality across user-defined tolerances, facilitating downstream interactive applications that compose extensive 3DGS assets for consistent rendering performance.

[130] Gaussian Wave Splatting for Computer-Generated Holography

Suyeon Choi,Brian Chao,Jacqueline Yang,Manu Gopakumar,Gordon Wetzstein

Main category: cs.GR

TL;DR: 论文提出了一种名为Gaussian Wave Splatting的高效算法，将高斯场景表示转化为全息图，支持遮挡和视角依赖效果。

Details

Motivation: 现有计算机生成全息图（CGH）算法无法准确处理遮挡和视角依赖效果，因此需要结合神经渲染技术改进。 Method: 通过推导2D高斯到全息图的闭式解，支持遮挡和alpha混合，并在傅里叶域提出高效近似方法，使用CUDA内核实现。 Result: 该方法能够生成逼真的全息图，支持遮挡和视角依赖效果，为下一代全息显示技术奠定基础。 Conclusion: Gaussian Wave Splatting通过结合神经渲染与全息显示技术，为未来全息显示提供了高效且逼真的解决方案。 Abstract: State-of-the-art neural rendering methods optimize Gaussian scene representations from a few photographs for novel-view synthesis. Building on these representations, we develop an efficient algorithm, dubbed Gaussian Wave Splatting, to turn these Gaussians into holograms. Unlike existing computer-generated holography (CGH) algorithms, Gaussian Wave Splatting supports accurate occlusions and view-dependent effects for photorealistic scenes by leveraging recent advances in neural rendering. Specifically, we derive a closed-form solution for a 2D Gaussian-to-hologram transform that supports occlusions and alpha blending. Inspired by classic computer graphics techniques, we also derive an efficient approximation of the aforementioned process in the Fourier domain that is easily parallelizable and implement it using custom CUDA kernels. By integrating emerging neural rendering pipelines with holographic display technology, our Gaussian-based CGH framework paves the way for next-generation holographic displays.

[131] A Gpu-based solution for large-scale skeletal animation simulation

Xi Pan

Main category: cs.GR

TL;DR: 提出了一种基于GPU的骨骼动画优化方案，通过并行前缀树更新技术解决复杂骨骼动画的性能问题。

Details

Motivation: 传统GPU解决方案在复杂骨骼动画（如多层级骨骼）中存在性能瓶颈或功能限制，需要一种更高效的优化方法。 Method: 结合并行前缀树更新技术，优化复杂骨骼动画的更新过程，无需预先模拟或烘焙动画数据。 Result: 新方案在复杂骨骼动画中的性能优于传统方法，且支持更多动画计算功能。 Conclusion: 该方案为大规模骨骼动画模拟提供了新的优化选项，扩展了GPU骨骼动画的应用场景。 Abstract: Skeletal animations of large-scale characters are widely used in video games. However, with a large number of characters are involved, relying on the CPU to calculate skeletal animations leads to significant performance problems. There are two main types of traditional GPU- based solutions. One is referred to as pre-baked animation texture technology. The problem with this solution is that it can only play animations from the pre-baked animation. It is impossible to perform interpolation, blending and other calculations on the animation, which affects the quality of the animations. The other solution is referred to as dedicated processing with a simple skeleton hierarchy (the number of skeleton levels < 64). This option does not need to simulate and bake animation data in advance. However, performance is dramatically impaired when processing complex skeletons with too many skeleton levels (such as fluttering clothing, soft plants, dragon-like creatures, etc.). In order to solve these issues, we developed a parallel prefix tree update solution to optimize the animation update process of complex skeletons with too many levels, and combined traditional solutions to implement a GPU-based skeletal animation solution. This solution does not need to simulate and bake animation results. In addition, the performance is superior to traditional solutions for complex skeletons with too many levels. Our work can provide a new option for optimizing the performance of large-scale skeletal animation simulations, providing GPU-based skeletal animations a wider range of application scenarios.

cs.CL [Back]

[132] ScaleMCP: Dynamic and Auto-Synchronizing Model Context Protocol Tools for LLM Agents

Elias Lumer,Anmol Gulati,Vamse Kumar Subbiah,Pradeep Honaganahalli Basavaraju,James A. Burke

Main category: cs.CL

TL;DR: ScaleMCP是一种新型工具选择方法，通过动态集成MCP工具检索器和自动同步工具存储系统，解决了现有工具选择框架的不足，显著提升了工具检索和代理调用性能。

Details

Motivation: 现有工具选择框架未集成MCP服务器，依赖手动更新本地工具库，导致效率低下且缺乏动态查询能力。 Method: 提出ScaleMCP方法，包括动态工具检索器、自动同步存储系统及TDWA嵌入策略。 Result: 在5,000个金融指标MCP服务器上的评估显示，ScaleMCP显著提升了工具检索和代理调用性能。 Conclusion: ScaleMCP为动态工具选择和调用提供了高效、可扩展的解决方案。 Abstract: Recent advancements in Large Language Models (LLMs) and the introduction of the Model Context Protocol (MCP) have significantly expanded LLM agents' capability to interact dynamically with external tools and APIs. However, existing tool selection frameworks do not integrate MCP servers, instead relying heavily on error-prone manual updates to monolithic local tool repositories, leading to duplication, inconsistencies, and inefficiencies. Additionally, current approaches abstract tool selection before the LLM agent is invoked, limiting its autonomy and hindering dynamic re-querying capabilities during multi-turn interactions. To address these issues, we introduce ScaleMCP, a novel tool selection approach that dynamically equips LLM agents with a MCP tool retriever, giving agents the autonomy to add tools into their memory, as well as an auto-synchronizing tool storage system pipeline through CRUD (create, read, update, delete) operations with MCP servers as the single source of truth. We also propose a novel embedding strategy, Tool Document Weighted Average (TDWA), designed to selectively emphasize critical components of tool documents (e.g. tool name or synthetic questions) during the embedding process. Comprehensive evaluations conducted on a created dataset of 5,000 financial metric MCP servers, across 10 LLM models, 5 embedding models, and 5 retriever types, demonstrate substantial improvements in tool retrieval and agent invocation performance, emphasizing ScaleMCP's effectiveness in scalable, dynamic tool selection and invocation.

[133] Is your multimodal large language model a good science tutor?

Ming Liu,Liwen Wang,Wensheng Zhang

Main category: cs.CL

TL;DR: 论文提出了一种评估多模态大语言模型（MLLMs）作为科学导师的框架，通过教育评分标准和模拟学生模型优化模型的教学能力。

Details

Motivation: 现有基准主要关注答案准确性，忽视了教育场景中对教学能力的评价需求。 Method: 使用教育评分标准和模拟学生模型评估MLLMs，并通过偏好优化方法微调模型。 Result: 优化后的模型在教学能力上表现更佳，且问题解决能力与教学能力不完全相关。 Conclusion: 该方法为构建兼具问题解决和教学能力的MLLMs提供了新思路。 Abstract: Multimodal large language models (MLLMs) demonstrate impressive performance on scientific reasoning tasks (e.g., ScienceQA). However, most existing benchmarks focus narrowly on the accuracy of the final answer while ignoring other metrics. In particular, when applying MLLMs to educational contexts, the goal is not only correctness but also the ability to teach. In this paper, we propose a framework that evaluates MLLMs as science tutors using a comprehensive educational rubric and a simulated student model that judges the teaching performance of the tutors. Given a list of candidate MLLM science tutors, we use rubric-based student judgments to produce a range of tutor performance scores, identifying both strong and weak tutors. Using the training section of the ScienceQA dataset, we then construct a data set of pairwise comparisons between the outputs of strong and weak tutors. This enables us to apply multiple preference optimization methods to fine-tune an underperforming tutor model (Qwen2-VL-2B) into more effective ones. Our results also show that strong problem-solving skills do not guarantee high-quality tutoring and that performance optimization-guided refinements can yield more educationally aligned tutor models. This approach opens avenues for building MLLMs that serve not only as problem solvers, but as genuinely helpful educational assistants.

[134] xGen-small Technical Report

Erik Nijkamp,Bo Pang,Egor Pakhomov,Akash Gokul,Jin Qu,Silvio Savarese,Yingbo Zhou,Caiming Xiong

Main category: cs.CL

TL;DR: xGen-small是4B和9B规模的Transformer解码器模型家族，专为长上下文应用优化，通过垂直整合的数据处理和多阶段训练实现高性能。

Details

Motivation: 针对长上下文应用的需求，优化模型性能，尤其是在数学和编程领域的表现。 Method: 采用垂直整合流程，包括数据筛选、多阶段预训练（质量退火和长度扩展至128k tokens）以及针对性后训练（监督微调、偏好学习和在线强化学习）。 Result: xGen-small在多种任务中表现优异，尤其在数学和编程领域，同时在长上下文基准测试中表现出色。 Conclusion: xGen-small通过综合优化流程，实现了在长上下文应用中的高性能表现。 Abstract: We introduce xGen-small, a family of 4B and 9B Transformer decoder models optimized for long-context applications. Our vertically integrated pipeline unites domain-balanced, frequency-aware data curation; multi-stage pre-training with quality annealing and length extension to 128k tokens; and targeted post-training via supervised fine-tuning, preference learning, and online reinforcement learning. xGen-small delivers strong performance across various tasks, especially in math and coding domains, while excelling at long context benchmarks.

[135] Think in Safety: Unveiling and Mitigating Safety Alignment Collapse in Multimodal Large Reasoning Model

Xinyue Lou,You Li,Jinan Xu,Xiangyu Shi,Chi Chen,Kaiyu Huang

Main category: cs.CL

TL;DR: 该论文对11种多模态大型推理模型（MLRMs）进行了系统性安全评估，揭示了先进模型中普遍存在的安全退化现象，并提出了一种利用模型内在推理能力检测不安全意图的方法。

Details

Motivation: 多模态大型推理模型（MLRMs）的快速发展带来了广泛的应用潜力，但其安全性和可靠性仍需系统性研究。 Method: 通过5个基准对11种MLRMs进行安全评估，并构建了一个包含安全导向思维过程的多模态调优数据集。 Result: 实验表明，通过该数据集微调的MLRMs在安全性和鲁棒性方面均有显著提升。 Conclusion: 研究为开发安全的MLRMs提供了新视角，相关数据集已开源。 Abstract: The rapid development of multimodal large reasoning models (MLRMs) has demonstrated broad application potential, yet their safety and reliability remain critical concerns that require systematic exploration. To address this gap, we conduct a comprehensive and systematic safety evaluation of 11 MLRMs across 5 benchmarks and unveil prevalent safety degradation phenomena in most advanced models. Moreover, our analysis reveals distinct safety patterns across different benchmarks: significant safety degradation is observed across jailbreak robustness benchmarks, whereas safety-awareness benchmarks demonstrate less pronounced degradation. In particular, a long thought process in some scenarios even enhances safety performance. Therefore, it is a potential approach to addressing safety issues in MLRMs by leveraging the intrinsic reasoning capabilities of the model to detect unsafe intent. To operationalize this insight, we construct a multimodal tuning dataset that incorporates a safety-oriented thought process. Experimental results from fine-tuning existing MLRMs with this dataset effectively enhances the safety on both jailbreak robustness and safety-awareness benchmarks. This study provides a new perspective for developing safe MLRMs. Our dataset is available at https://github.com/xinyuelou/Think-in-Safety.

[136] REFINE-AF: A Task-Agnostic Framework to Align Language Models via Self-Generated Instructions using Reinforcement Learning from Automated Feedback

Aniruddha Roy,Pretam Ray,Abhilash Nandy,Somak Aditya,Pawan Goyal

Main category: cs.CL

TL;DR: 论文探讨了使用开源小型LLMs（如LLaMA 2-7B、LLaMA 2-13B和Mistral 7B）通过半自动化框架生成指令数据，减少人工干预和成本，并结合强化学习进一步提升性能。

Details

Motivation: 人工标注指令数据耗时、昂贵且多样性有限，现有方法依赖昂贵的大模型（如GPT-3.5），限制了可扩展性。 Method: 采用半自动化框架结合开源小型LLMs生成指令数据，并引入强化学习算法优化性能。 Result: RL框架在63-66%的任务中显著优于先前方法。 Conclusion: 开源小型LLMs结合强化学习可高效生成高质量指令数据，降低成本并提升性能。 Abstract: Instruction-based Large Language Models (LLMs) have proven effective in numerous few-shot or zero-shot Natural Language Processing (NLP) tasks. However, creating human-annotated instruction data is time-consuming, expensive, and often limited in quantity and task diversity. Previous research endeavors have attempted to address this challenge by proposing frameworks capable of generating instructions in a semi-automated and task-agnostic manner directly from the model itself. Many of these efforts have relied on large API-only parameter-based models such as GPT-3.5 (175B), which are expensive, and subject to limits on a number of queries. This paper explores the performance of three open-source small LLMs such as LLaMA 2-7B, LLama 2-13B, and Mistral 7B, using a semi-automated framework, thereby reducing human intervention, effort, and cost required to generate an instruction dataset for fine-tuning LLMs. Furthermore, we demonstrate that incorporating a Reinforcement Learning (RL) based training algorithm into this LLMs-based framework leads to further enhancements. Our evaluation of the dataset reveals that these RL-based frameworks achieve a substantial improvements in 63-66% of the tasks compared to previous approaches.

[137] References Indeed Matter? Reference-Free Preference Optimization for Conversational Query Reformulation

Doyoung Kim,Youngjun Lee,Joeun Kim,Jihwan Bang,Hwanjun Song,Susik Yoon,Jae-Gil Lee

Main category: cs.CL

TL;DR: DualReform是一种无需参考段落的对话查询重写优化框架，通过生成伪参考段落和双重角色优化，显著提升检索性能。

Details

Motivation: 现有方法依赖参考段落优化，但在实际场景中难以获取，因此需要一种无需参考段落的解决方案。 Method: 提出DualReform框架，通过响应推断生成伪参考段落，并利用CQR双重角色优化响应。 Result: DualReform达到96.9-99.1%的检索准确率，优于现有方法31.6%。 Conclusion: DualReform在无需参考段落的情况下，显著提升了对话查询重写的检索性能。 Abstract: Conversational query reformulation (CQR) has become indispensable for improving retrieval in dialogue-based applications. However, existing approaches typically rely on reference passages for optimization, which are impractical to acquire in real-world scenarios. To address this limitation, we introduce a novel reference-free preference optimization framework DualReform that generates pseudo reference passages from commonly-encountered conversational datasets containing only queries and responses. DualReform attains this goal through two key innovations: (1) response-based inference, where responses serve as proxies to infer pseudo reference passages, and (2) response refinement via the dual-role of CQR, where a CQR model refines responses based on the shared objectives between response refinement and CQR. Despite not relying on reference passages, DualReform achieves 96.9--99.1% of the retrieval accuracy attainable only with reference passages and surpasses the state-of-the-art method by up to 31.6%.

[138] MacRAG: Compress, Slice, and Scale-up for Multi-Scale Adaptive Context RAG

Woosang Lim,Zekun Li,Gyuwan Kim,Sungyoung Ji,HyeonJung Kim,Kyuri Choi,Jin Hyuk Lim,Kyungpyo Park,William Yang Wang

Main category: cs.CL

TL;DR: MacRAG是一种分层检索框架，通过多尺度自适应上下文构建，优化了长上下文和多跳推理任务中的检索精度和覆盖范围。

Details

Motivation: 现有RAG系统存在检索不精确、上下文覆盖不完整以及信息碎片化的问题，MacRAG旨在解决这些问题。 Method: MacRAG将文档分层压缩和分区，并实时通过块级和文档级扩展自适应合并相关上下文。 Result: 在LongBench扩展任务中，MacRAG显著优于基线RAG系统，支持多种模型（如Llama-3.1-8B、Gemini-1.5-pro和GPT-4o）。 Conclusion: MacRAG为长上下文和多跳推理任务提供了一种高效、可扩展的解决方案。 Abstract: Long-context (LC) Large Language Models (LLMs) combined with Retrieval-Augmented Generation (RAG) hold strong potential for complex multi-hop and large-document tasks. However, existing RAG systems often suffer from imprecise retrieval, incomplete context coverage under constrained context windows, and fragmented information caused by suboptimal context construction. We introduce Multi-scale Adaptive Context RAG (MacRAG), a hierarchical retrieval framework that compresses and partitions documents into coarse-to-fine granularities, then adaptively merges relevant contexts through chunk- and document-level expansions in real time. By starting from the finest-level retrieval and progressively incorporating higher-level and broader context, MacRAG constructs effective query-specific long contexts, optimizing both precision and coverage. Evaluations on the challenging LongBench expansions of HotpotQA, 2WikiMultihopQA, and Musique confirm that MacRAG consistently surpasses baseline RAG pipelines on single- and multi-step generation with Llama-3.1-8B, Gemini-1.5-pro, and GPT-4o. Our results establish MacRAG as an efficient, scalable solution for real-world long-context, multi-hop reasoning. Our code is available at https://github.com/Leezekun/MacRAG.

[139] Evaluating LLM-Generated Q&A Test: a Student-Centered Study

Anna Wróblewska,Bartosz Grabek,Jakub Świstak,Daniel Dan

Main category: cs.CL

TL;DR: 研究提出了一种基于AI聊天机器人的自动生成可靠问答测试的流程，并通过学生和专家评估了其心理测量学和质量指标，结果显示其性能与人工测试相当。

Details

Motivation: 开发一种可扩展的AI辅助评估方法，以自动生成高质量的问答测试，减轻人工负担。 Method: 使用GPT-4o-mini自动生成自然语言处理课程的问答测试，并通过IRT分析和用户评分评估其心理测量学和质量。 Result: 生成的测试项目具有强区分度和适当难度，学生和专家评分显示高质量，仅两项需复查。 Conclusion: LLM生成的测试在心理测量学表现和用户满意度上可与人工测试媲美，展示了AI辅助评估的可扩展性。 Abstract: This research prepares an automatic pipeline for generating reliable question-answer (Q&A) tests using AI chatbots. We automatically generated a GPT-4o-mini-based Q&A test for a Natural Language Processing course and evaluated its psychometric and perceived-quality metrics with students and experts. A mixed-format IRT analysis showed that the generated items exhibit strong discrimination and appropriate difficulty, while student and expert star ratings reflect high overall quality. A uniform DIF check identified two items for review. These findings demonstrate that LLM-generated assessments can match human-authored tests in psychometric performance and user satisfaction, illustrating a scalable approach to AI-assisted assessment development.

[140] Integrating Video and Text: A Balanced Approach to Multimodal Summary Generation and Evaluation

Galann Pennec,Zhengyuan Liu,Nicholas Asher,Philippe Muller,Nancy F. Chen

Main category: cs.CL

TL;DR: 本文提出了一种零样本视频到文本摘要方法，通过生成剧本表示整合视频、对话和角色信息，并引入多模态评估指标MFactSum，在SummScreen3D数据集上表现优于现有方法。

Details

Motivation: 视觉语言模型（VLMs）在处理复杂多模态输入（如整集电视节目）时难以平衡视觉和文本信息，现有摘要指标也无法有效评估多模态内容。 Method: 提出零样本视频到文本摘要方法，生成剧本表示并命名角色，仅需音频、视频和文本输入；同时引入多模态评估指标MFactSum。 Result: 在SummScreen3D数据集上，生成的摘要包含比Gemini 1.5多20%的相关视觉信息，且输入视频需求减少75%。 Conclusion: 该方法有效整合多模态信息并提升摘要质量，MFactSum为多模态摘要评估提供了新标准。 Abstract: Vision-Language Models (VLMs) often struggle to balance visual and textual information when summarizing complex multimodal inputs, such as entire TV show episodes. In this paper, we propose a zero-shot video-to-text summarization approach that builds its own screenplay representation of an episode, effectively integrating key video moments, dialogue, and character information into a unified document. Unlike previous approaches, we simultaneously generate screenplays and name the characters in zero-shot, using only the audio, video, and transcripts as input. Additionally, we highlight that existing summarization metrics can fail to assess the multimodal content in summaries. To address this, we introduce MFactSum, a multimodal metric that evaluates summaries with respect to both vision and text modalities. Using MFactSum, we evaluate our screenplay summaries on the SummScreen3D dataset, demonstrating superiority against state-of-the-art VLMs such as Gemini 1.5 by generating summaries containing 20% more relevant visual information while requiring 75% less of the video as input.

[141] Bridging the Gap: An Intermediate Language for Enhanced and Cost-Effective Grapheme-to-Phoneme Conversion with Homographs with Multiple Pronunciations Disambiguation

Abbas Bertina,Shahab Beirami,Hossein Biniazian,Elham Esmaeilnia,Soheil Shahi,Mahdi Pirnia

Main category: cs.CL

TL;DR: 该论文提出了一种针对波斯语G2P转换的中间语言方法，结合LLM提示技术和序列到序列机器音译架构，显著提升了音素错误率（PER）。

Details

Motivation: 波斯语的复杂音系特征（如同形异义词和Ezafe）在正式和非正式语境中带来挑战，需要一种更有效的解决方案。 Method: 结合LLM提示技术和序列到序列机器音译架构，构建多发音同形异义词的词汇数据库，并利用形式概念分析进行语义区分。 Result: 实验结果表明，该方法在波斯语音素转换中表现优于现有技术，显著降低了PER。 Conclusion: 该方法为波斯语文本到语音系统提供了稳健解决方案，并可扩展至其他具有丰富同形异义词现象的语言（如中文和阿拉伯语）。 Abstract: Grapheme-to-phoneme (G2P) conversion for Persian presents unique challenges due to its complex phonological features, particularly homographs and Ezafe, which exist in formal and informal language contexts. This paper introduces an intermediate language specifically designed for Persian language processing that addresses these challenges through a multi-faceted approach. Our methodology combines two key components: Large Language Model (LLM) prompting techniques and a specialized sequence-to-sequence machine transliteration architecture. We developed and implemented a systematic approach for constructing a comprehensive lexical database for homographs with multiple pronunciations disambiguation often termed polyphones, utilizing formal concept analysis for semantic differentiation. We train our model using two distinct datasets: the LLM-generated dataset for formal and informal Persian and the B-Plus podcasts for informal language variants. The experimental results demonstrate superior performance compared to existing state-of-the-art approaches, particularly in handling the complexities of Persian phoneme conversion. Our model significantly improves Phoneme Error Rate (PER) metrics, establishing a new benchmark for Persian G2P conversion accuracy. This work contributes to the growing research in low-resource language processing and provides a robust solution for Persian text-to-speech systems and demonstrating its applicability beyond Persian. Specifically, the approach can extend to languages with rich homographic phenomena such as Chinese and Arabic

[142] Using External knowledge to Enhanced PLM for Semantic Matching

Min Li,Chun Yuan

Main category: cs.CL

TL;DR: 论文探讨如何利用外部知识增强预训练的语义相关性判别模型，并在10个公开数据集上验证了性能提升。

Details

Motivation: 尽管大规模标注数据和复杂模型（如神经网络）在语义相关性任务中表现优异，但仅依赖这些数据是否足够？如何将外部知识融入模型以提升性能？ Method: 使用外部知识增强预训练的语义相关性判别模型。 Result: 在10个公开数据集上，相比基线模型，性能一致提升。 Conclusion: 外部知识的引入能有效提升语义相关性判别模型的性能。 Abstract: Modeling semantic relevance has always been a challenging and critical task in natural language processing. In recent years, with the emergence of massive amounts of annotated data, it has become feasible to train complex models, such as neural network-based reasoning models. These models have shown excellent performance in practical applications and have achieved the current state-ofthe-art performance. However, even with such large-scale annotated data, we still need to think: Can machines learn all the knowledge necessary to perform semantic relevance detection tasks based on this data alone? If not, how can neural network-based models incorporate external knowledge into themselves, and how can relevance detection models be constructed to make full use of external knowledge? In this paper, we use external knowledge to enhance the pre-trained semantic relevance discrimination model. Experimental results on 10 public datasets show that our method achieves consistent improvements in performance compared to the baseline model.

[143] Boosting Neural Language Inference via Cascaded Interactive Reasoning

Min Li,Chun Yuan

Main category: cs.CL

TL;DR: 论文提出了一种名为CIRN的新型架构，通过多层级特征提取和交互式推理，提升自然语言推理任务的性能。

Details

Motivation: 现有方法主要依赖预训练语言模型的最后一层表示，可能忽略了中间层的有价值信息，限制了语义建模能力。 Method: CIRN采用分层特征提取策略，在交互空间中逐步整合跨句子信息，实现从表层特征匹配到深层语义连接的推理过程。 Result: 在多个标准NLI数据集上，CIRN表现优于基线方法，验证了多层级交互特征的有效性。 Conclusion: CIRN通过挖掘多层级语义关系，显著提升了复杂推理任务的性能。 Abstract: Natural Language Inference (NLI) focuses on ascertaining the logical relationship (entailment, contradiction, or neutral) between a given premise and hypothesis. This task presents significant challenges due to inherent linguistic features such as diverse phrasing, semantic complexity, and contextual nuances. While Pre-trained Language Models (PLMs) built upon the Transformer architecture have yielded substantial advancements in NLI, prevailing methods predominantly utilize representations from the terminal layer. This reliance on final-layer outputs may overlook valuable information encoded in intermediate layers, potentially limiting the capacity to model intricate semantic interactions effectively. Addressing this gap, we introduce the Cascaded Interactive Reasoning Network (CIRN), a novel architecture designed for deeper semantic comprehension in NLI. CIRN implements a hierarchical feature extraction strategy across multiple network depths, operating within an interactive space where cross-sentence information is continuously integrated. This mechanism aims to mimic a process of progressive reasoning, transitioning from surface-level feature matching to uncovering more profound logical and semantic connections between the premise and hypothesis. By systematically mining latent semantic relationships at various representational levels, CIRN facilitates a more thorough understanding of the input pair. Comprehensive evaluations conducted on several standard NLI benchmark datasets reveal consistent performance gains achieved by CIRN over competitive baseline approaches, demonstrating the efficacy of leveraging multi-level interactive features for complex relational reasoning.

[144] The Efficiency of Pre-training with Objective Masking in Pseudo Labeling for Semi-Supervised Text Classification

Arezoo Hatefi,Xuan-Son Vu,Monowar Bhuyan,Frank Drewes

Main category: cs.CL

TL;DR: 本文扩展了Hatefi等人的半监督文本分类模型，通过无监督预训练阶段提升性能，并在多语言数据集上进行了评估。

Details

Motivation: 解决在仅有少量黄金标注样本和大量未标注样本的情况下，提升文本分类模型的性能。 Method: 采用师生架构（Meta Pseudo Labels），结合无监督预训练（基于目标掩码），并在多语言数据集上进行实验。 Result: 扩展模型在性能上优于原始模型和多个基线方法。 Conclusion: 无监督预训练显著提升了半监督文本分类模型的性能，适用于多语言任务。 Abstract: We extend and study a semi-supervised model for text classification proposed earlier by Hatefi et al. for classification tasks in which document classes are described by a small number of gold-labeled examples, while the majority of training examples is unlabeled. The model leverages the teacher-student architecture of Meta Pseudo Labels in which a ''teacher'' generates labels for originally unlabeled training data to train the ''student'' and updates its own model iteratively based on the performance of the student on the gold-labeled portion of the data. We extend the original model of Hatefi et al. by an unsupervised pre-training phase based on objective masking, and conduct in-depth performance evaluations of the original model, our extension, and various independent baselines. Experiments are performed using three different datasets in two different languages (English and Swedish).

[145] Dynamic Domain Information Modulation Algorithm for Multi-domain Sentiment Analysis

Chunyi Yue,Ang Li

Main category: cs.CL

TL;DR: 提出了一种动态信息调制算法，用于多领域情感分类，通过两阶段训练优化领域分类任务对情感分类的影响。

Details

Motivation: 解决多领域情感分类中因领域信息对情感分类影响不同而导致的超参数优化问题。 Method: 分两阶段训练：第一阶段确定共享超参数；第二阶段引入领域感知调制算法调整输入文本的领域信息。 Result: 在包含16个领域的公开数据集上验证了方法的优越性。 Conclusion: 动态信息调制算法能高效生成各领域所需信息，提升多领域情感分类性能。 Abstract: Multi-domain sentiment classification aims to mitigate poor performance models due to the scarcity of labeled data in a single domain, by utilizing data labeled from various domains. A series of models that jointly train domain classifiers and sentiment classifiers have demonstrated their advantages, because domain classification helps generate necessary information for sentiment classification. Intuitively, the importance of sentiment classification tasks is the same in all domains for multi-domain sentiment classification; but domain classification tasks are different because the impact of domain information on sentiment classification varies across different fields; this can be controlled through adjustable weights or hyper parameters. However, as the number of domains increases, existing hyperparameter optimization algorithms may face the following challenges: (1) tremendous demand for computing resources, (2) convergence problems, and (3) high algorithm complexity. To efficiently generate the domain information required for sentiment classification in each domain, we propose a dynamic information modulation algorithm. Specifically, the model training process is divided into two stages. In the first stage, a shared hyperparameter, which would control the proportion of domain classification tasks across all fields, is determined. In the second stage, we introduce a novel domain-aware modulation algorithm to adjust the domain information contained in the input text, which is then calculated based on a gradient-based and loss-based method. In summary, experimental results on a public sentiment analysis dataset containing 16 domains prove the superiority of the proposed method.

[146] Attention Is Not All You Need: The Importance of Feedforward Networks in Transformer Models

Isaac Gerber

Main category: cs.CL

TL;DR: 研究发现，解码器专用Transformer中的前馈网络（FFN）对模型性能至关重要，且三层的FFN配置比标准两层更高效。

Details

Motivation: 探讨FFN在模型预训练中的重要性及其对性能的影响。 Method: 通过实验比较不同FFN层数的Transformer块配置。 Result: 三层FFN配置在更少参数和时间内表现优于标准两层配置。 Conclusion: FFN的设计对模型效率和性能有显著影响。 Abstract: Decoder-only transformer networks have become incredibly popular for language modeling tasks. State-of-the-art models can have over a hundred transformer blocks, containing billions of trainable parameters, and are trained on trillions of tokens of text. Each transformer block typically consists of a multi-head attention (MHA) mechanism and a two-layer fully connected feedforward network (FFN). In this paper, we examine the importance of the FFN during the model pre-training process through a series of experiments, confirming that the FFN is important to model performance. Furthermore, we show that models using a transformer block configuration with three-layer FFNs with fewer such blocks outperform the standard two-layer configuration delivering lower training loss with fewer total parameters in less time.

[147] TS-SUPERB: A Target Speech Processing Benchmark for Speech Self-Supervised Learning Models

Junyi Peng,Takanori Ashihara,Marc Delcroix,Tsubasa Ochiai,Oldrich Plchot,Shoko Araki,Jan Černocký

Main category: cs.CL

TL;DR: TS-SUPERB是一个新的目标说话人语音处理基准，专注于嘈杂多说话人场景，评估自监督学习模型在目标说话人任务中的表现。

Details

Motivation: 现有基准主要关注单说话人场景，缺乏对嘈杂多说话人条件下目标说话人任务的评估，而后者更具挑战性和实用性。 Method: 引入TS-SUPERB基准，包含四项目标说话人处理任务，利用注册语音的说话人嵌入作为下游模型的线索，并研究联合优化目标说话人任务。 Result: 基准结果表明，目标说话人场景下的性能无法简单从单说话人任务推断，且联合优化能有效利用任务间的互信息。 Conclusion: TS-SUPERB强调了在目标说话人场景中评估SSL模型的重要性，并展示了联合优化的潜力。 Abstract: Self-supervised learning (SSL) models have significantly advanced speech processing tasks, and several benchmarks have been proposed to validate their effectiveness. However, previous benchmarks have primarily focused on single-speaker scenarios, with less exploration of target-speaker tasks in noisy, multi-talker conditions -- a more challenging yet practical case. In this paper, we introduce the Target-Speaker Speech Processing Universal Performance Benchmark (TS-SUPERB), which includes four widely recognized target-speaker processing tasks that require identifying the target speaker and extracting information from the speech mixture. In our benchmark, the speaker embedding extracted from enrollment speech is used as a clue to condition downstream models. The benchmark result reveals the importance of evaluating SSL models in target speaker scenarios, demonstrating that performance cannot be easily inferred from related single-speaker tasks. Moreover, by using a unified SSL-based target speech encoder, consisting of a speaker encoder and an extractor module, we also investigate joint optimization across TS tasks to leverage mutual information and demonstrate its effectiveness.

[148] Enhancing BERTopic with Intermediate Layer Representations

Dominik Koterwa,Maciej Świtała

Main category: cs.CL

TL;DR: BERTopic是一种基于Transformer嵌入的主题建模算法，通过评估18种嵌入表示和实验数据，发现可以优化默认设置，并研究了停用词的影响。

Details

Motivation: 研究BERTopic算法中不同嵌入表示对主题建模性能的影响，以优化其效果。 Method: 评估18种嵌入表示，在三个数据集上实验，测量主题一致性和多样性。 Result: 发现每种数据集都存在优于默认设置的嵌入配置，并分析了停用词的影响。 Conclusion: 嵌入配置的选择对BERTopic性能有显著影响，优化嵌入可提升效果。 Abstract: BERTopic is a topic modeling algorithm that leverages transformer-based embeddings to create dense clusters, enabling the estimation of topic structures and the extraction of valuable insights from a corpus of documents. This approach allows users to efficiently process large-scale text data and gain meaningful insights into its structure. While BERTopic is a powerful tool, embedding preparation can vary, including extracting representations from intermediate model layers and applying transformations to these embeddings. In this study, we evaluate 18 different embedding representations and present findings based on experiments conducted on three diverse datasets. To assess the algorithm's performance, we report topic coherence and topic diversity metrics across all experiments. Our results demonstrate that, for each dataset, it is possible to find an embedding configuration that performs better than the default setting of BERTopic. Additionally, we investigate the influence of stop words on different embedding configurations.

[149] From Rankings to Insights: Evaluation Should Shift Focus from Leaderboard to Feedback

Zongqi Wang,Tianle Gu,Chen Gong,Xin Tian,Siqi Bao,Yujiu Yang

Main category: cs.CL

TL;DR: 论文提出Feedbacker框架，旨在从提供整体评分转向提供分析性反馈，以优化LLM模型。

Details

Motivation: 现有评估基准仅提供整体评分，无法指导模型优化或支持模型分析，因此需要转向提供有分析价值的反馈。 Method: Feedbacker框架包括可扩展的树状查询分类构建器、自动化查询合成方案、可视化分析工具，并提出PC2点评估方法。 Result: 通过17个主流LLM的评估结果，展示了Feedbacker的有效性和潜力。 Conclusion: Feedbacker为LLM评估提供了更全面的反馈，支持模型优化和行为理解。 Abstract: Automatic evaluation benchmarks such as MT-Bench, Arena-Hard, and Auto-Arena are seeing growing adoption for the evaluation of Large Language Models (LLMs). Existing research has primarily focused on approximating human-based model rankings using limited data and LLM-as-a-Judge. However, the fundamental premise of these studies, which attempts to replicate human rankings, is flawed. Specifically, these benchmarks typically offer only overall scores, limiting their utility to leaderboard rankings, rather than providing feedback that can guide model optimization and support model profiling. Therefore, we advocate for an evaluation paradigm shift from approximating human-based model rankings to providing feedback with analytical value. To this end, we introduce Feedbacker, an evaluation framework that provides comprehensive and fine-grained results, thereby enabling thorough identification of a model's specific strengths and weaknesses. Such feedback not only supports the targeted optimization of the model but also enhances the understanding of its behavior. Feedbacker comprises three key components: an extensible tree-based query taxonomy builder, an automated query synthesis scheme, and a suite of visualization and analysis tools. Furthermore, we propose a novel LLM-as-a-Judge method: PC2 (Pre-Comparison-derived Criteria) pointwise evaluation. This method derives evaluation criteria by pre-comparing the differences between several auxiliary responses, achieving the accuracy of pairwise evaluation while maintaining the time complexity of pointwise evaluation. Finally, leveraging the evaluation results of 17 mainstream LLMs, we demonstrate the usage of Feedbacker and highlight its effectiveness and potential. Our homepage project is available at https://liudan193.github.io/Feedbacker.

[150] Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free

Zihan Qiu,Zekun Wang,Bo Zheng,Zeyu Huang,Kaiyue Wen,Songlin Yang,Rui Men,Le Yu,Fei Huang,Suozhi Huang,Dayiheng Liu,Jingren Zhou,Junyang Lin

Main category: cs.CL

TL;DR: 研究发现，在软注意力机制中引入头特定的sigmoid门控能显著提升性能，增强训练稳定性，并改善长上下文推理能力。

Details

Motivation: 现有文献很少研究门控机制的具体效果，本文旨在系统研究门控增强的软注意力变体。 Method: 通过对比30种15B MoE模型和1.7B密集模型的变体，实验分析了不同门控位置和计算变体的效果。 Result: 发现头特定的sigmoid门控能提升性能、训练稳定性和学习率容忍度，并缓解'注意力下沉'问题。 Conclusion: 门控机制通过引入非线性和查询依赖的稀疏门控分数，有效提升了软注意力的表现，相关代码和模型已开源。 Abstract: Gating mechanisms have been widely utilized, from early models like LSTMs and Highway Networks to recent state space models, linear attention, and also softmax attention. Yet, existing literature rarely examines the specific effects of gating. In this work, we conduct comprehensive experiments to systematically investigate gating-augmented softmax attention variants. Specifically, we perform a comprehensive comparison over 30 variants of 15B Mixture-of-Experts (MoE) models and 1.7B dense models trained on a 3.5 trillion token dataset. Our central finding is that a simple modification-applying a head-specific sigmoid gate after the Scaled Dot-Product Attention (SDPA)-consistently improves performance. This modification also enhances training stability, tolerates larger learning rates, and improves scaling properties. By comparing various gating positions and computational variants, we attribute this effectiveness to two key factors: (1) introducing non-linearity upon the low-rank mapping in the softmax attention, and (2) applying query-dependent sparse gating scores to modulate the SDPA output. Notably, we find this sparse gating mechanism mitigates 'attention sink' and enhances long-context extrapolation performance, and we also release related $\href{https://github.com/qiuzh20/gated_attention}{codes}$ and $\href{https://huggingface.co/QwQZh/gated_attention}{models}$ to facilitate future research.

[151] Utilizing LLMs to Investigate the Disputed Role of Evidence in Electronic Cigarette Health Policy Formation in Australia and the UK

Damian Curran,Brian Chapman,Mike Conway

Main category: cs.CL

TL;DR: 本文利用基于GPT-4的大型语言模型分类器，分析了澳大利亚和英国电子烟政策文件中关于电子烟对公共健康影响的陈述，发现两国政策倾向截然不同。

Details

Motivation: 探究澳大利亚和英国在电子烟政策上的差异，尽管基于相同证据，但政策倾向截然不同。 Method: 开发并评估基于GPT-4的句子分类器，对109份政策文件中的句子进行分类，判断其陈述电子烟对公共健康的影响是积极还是消极。 Result: 分类器F-score为0.9；澳大利亚政策文件更多强调电子烟的危害，而英国文件更多强调其益处。 Conclusion: 研究证实两国政策倾向差异，并展示了基于LLM的方法在分析证据与政策关系中的潜力。 Abstract: Australia and the UK have developed contrasting approaches to the regulation of electronic cigarettes, with - broadly speaking - Australia adopting a relatively restrictive approach and the UK adopting a more permissive approach. Notably, these divergent policies were developed from the same broad evidence base. In this paper, to investigate differences in how the two jurisdictions manage and present evidence, we developed and evaluated a Large Language Model-based sentence classifier to perform automated analyses of electronic cigarette-related policy documents drawn from official Australian and UK legislative processes (109 documents in total). Specifically, we utilized GPT-4 to automatically classify sentences based on whether they contained claims that e-cigarettes were broadly helpful or harmful for public health. Our LLM-based classifier achieved an F-score of 0.9. Further, when applying the classifier to our entire sentence-level corpus, we found that Australian legislative documents show a much higher proportion of harmful statements, and a lower proportion of helpful statements compared to the expected values, with the opposite holding for the UK. In conclusion, this work utilized an LLM-based approach to provide evidence to support the contention that - drawing on the same evidence base - Australian ENDS-related policy documents emphasize the harms associated with ENDS products and UK policy documents emphasize the benefits. Further, our approach provides a starting point for using LLM-based methods to investigate the complex relationship between evidence and health policy formation.

[152] A Split-then-Join Approach to Abstractive Summarization for Very Long Documents in a Low Resource Setting

Lhuqita Fazry

Main category: cs.CL

TL;DR: BIGBIRD-PEGASUS模型在长文档摘要任务中表现优异，但受限于4096个token的最大长度，对超长文档摘要性能下降。本研究通过微调模型并采用数据增强方法解决这一问题。

Details

Motivation: 解决BIGBIRD-PEGASUS模型在超长文档摘要任务中的性能限制问题。 Method: 通过微调预训练模型，并筛选长度超过20,000 token的文档，通过数据增强（将文档-摘要对分割为部分）以适应模型的最大长度限制。 Result: 改进了模型在超长文档摘要任务中的性能。 Conclusion: 通过微调和数据增强，BIGBIRD-PEGASUS模型能够更好地处理超长文档摘要任务。 Abstract: $\texttt{BIGBIRD-PEGASUS}$ model achieves $\textit{state-of-the-art}$ on abstractive text summarization for long documents. However it's capacity still limited to maximum of $4,096$ tokens, thus caused performance degradation on summarization for very long documents. Common method to deal with the issue is to truncate the documents. In this reasearch, we'll use different approach. We'll use the pretrained $\texttt{BIGBIRD-PEGASUS}$ model by fine tuned the model on other domain dataset. First, we filter out all documents which length less than $20,000$ tokens to focus on very long documents. To prevent domain shifting problem and overfitting on transfer learning due to small dataset, we augment the dataset by splitting document-summary training pair into parts, to fit the document into $4,096$ tokens. Source code available on $\href{https://github.com/lhfazry/SPIN-summ}{https://github.com/lhfazry/SPIN-summ}$.

[153] IM-BERT: Enhancing Robustness of BERT through the Implicit Euler Method

Mihyeon Kim,Juhyoung Park,Youngbin Kim

Main category: cs.CL

TL;DR: IM-BERT通过将BERT的层建模为ODE的解，提出了一种数值稳定的IM-connection，提升了预训练语言模型在对抗攻击下的鲁棒性，无需额外参数或对抗训练。

Details

Motivation: 解决预训练语言模型在有限下游数据上微调时易受对抗攻击和过拟合的问题。 Method: 将BERT的层视为ODE的解，分析数值稳定性，并提出IM-connection策略。 Result: 在AdvGLUE数据集上，IM-BERT比原始BERT性能提升8.3%，在低资源场景下提升5.9%。 Conclusion: IM-BERT通过动态系统视角显著提升了模型鲁棒性，尤其在低资源场景表现优异。 Abstract: Pre-trained Language Models (PLMs) have achieved remarkable performance on diverse NLP tasks through pre-training and fine-tuning. However, fine-tuning the model with a large number of parameters on limited downstream datasets often leads to vulnerability to adversarial attacks, causing overfitting of the model on standard datasets. To address these issues, we propose IM-BERT from the perspective of a dynamic system by conceptualizing a layer of BERT as a solution of Ordinary Differential Equations (ODEs). Under the situation of initial value perturbation, we analyze the numerical stability of two main numerical ODE solvers: the explicit and implicit Euler approaches. Based on these analyses, we introduce a numerically robust IM-connection incorporating BERT's layers. This strategy enhances the robustness of PLMs against adversarial attacks, even in low-resource scenarios, without introducing additional parameters or adversarial training strategies. Experimental results on the adversarial GLUE (AdvGLUE) dataset validate the robustness of IM-BERT under various conditions. Compared to the original BERT, IM-BERT exhibits a performance improvement of approximately 8.3\%p on the AdvGLUE dataset. Furthermore, in low-resource scenarios, IM-BERT outperforms BERT by achieving 5.9\%p higher accuracy.

Xinyi Mou,Chen Qian,Wei Liu,Xuanjing Huang,Zhongyu Wei

Main category: cs.CL

TL;DR: EcoLANG提出了一种高效且有效的代理通信语言诱导方法，用于社会模拟，通过语言进化和利用两阶段显著降低计算成本。

Details

Motivation: 解决大规模社会模拟中高时间和计算成本的挑战，同时避免现有方法在准确性和通用性上的妥协。 Method: 分两阶段：语言进化（筛选同义词并优化句子规则）和语言利用（代理使用进化后的语言进行通信）。 Result: 实验显示EcoLANG减少20%以上的令牌消耗，提升效率且不影响模拟准确性。 Conclusion: EcoLANG为高效社会模拟提供了一种可行方案，平衡了计算成本和模拟质量。 Abstract: Large language models (LLMs) have demonstrated an impressive ability to role-play humans and replicate complex social dynamics. While large-scale social simulations are gaining increasing attention, they still face significant challenges, particularly regarding high time and computation costs. Existing solutions, such as distributed mechanisms or hybrid agent-based model (ABM) integrations, either fail to address inference costs or compromise accuracy and generalizability. To this end, we propose EcoLANG: Efficient and Effective Agent Communication Language Induction for Social Simulation. EcoLANG operates in two stages: (1) language evolution, where we filter synonymous words and optimize sentence-level rules through natural selection, and (2) language utilization, where agents in social simulations communicate using the evolved language. Experimental results demonstrate that EcoLANG reduces token consumption by over 20%, enhancing efficiency without sacrificing simulation accuracy.

[155] The Distracting Effect: Understanding Irrelevant Passages in RAG

Chen Amiraz,Florin Cuconasu,Simone Filice,Zohar Karnin

Main category: cs.CL

TL;DR: 论文研究了RAG系统中无关段落对LLM生成的干扰问题，提出了一种量化干扰效应的方法，并通过使用干扰段落微调LLM，提升了7.5%的准确率。

Details

Motivation: 解决RAG系统中无关段落对LLM生成的干扰问题，并量化这种干扰效应。 Method: 提出量化干扰效应的方法，并开发了识别和使用干扰段落的新技术，通过微调LLM验证效果。 Result: 使用干扰段落微调LLM后，回答准确率提升了7.5%。 Conclusion: 论文为识别和利用干扰段落提供了全面框架，显著提升了RAG系统的性能。 Abstract: A well-known issue with Retrieval Augmented Generation (RAG) is that retrieved passages that are irrelevant to the query sometimes distract the answer-generating LLM, causing it to provide an incorrect response. In this paper, we shed light on this core issue and formulate the distracting effect of a passage w.r.t. a query (and an LLM). We provide a quantifiable measure of the distracting effect of a passage and demonstrate its robustness across LLMs. Our research introduces novel methods for identifying and using hard distracting passages to improve RAG systems. By fine-tuning LLMs with these carefully selected distracting passages, we achieve up to a 7.5% increase in answering accuracy compared to counterparts fine-tuned on conventional RAG datasets. Our contribution is two-fold: first, we move beyond the simple binary classification of irrelevant passages as either completely unrelated vs. distracting, and second, we develop and analyze multiple methods for finding hard distracting passages. To our knowledge, no other research has provided such a comprehensive framework for identifying and utilizing hard distracting passages.

[156] CNN-based Image Models Verify a Hypothesis that The Writers of Cuneiform Texts Improved Their Writing Skills When Studying at the Age of Hittite Empire

Daichi Kohmoto,Katsutoshi Fukuda,Daisuke Yoshida,Takafumi Matsui,Sachihiro Omura

Main category: cs.CL

TL;DR: 通过CNN分析楔形文字泥板图像，发现其由‘老师’和‘学生’共同书写，揭示了古代书写训练的可能性。

Details

Motivation: 探讨为何古代人会留下内容几乎相同的楔形文字泥板，传统语言学方法未能解答此问题。 Method: 使用基于CNN的图像模型定量分析泥板图像，无需逐字分割楔形文字。 Result: 发现泥板前半部分由‘老师’书写，后半部分由‘学生’书写，表明其为书写训练工具。 Conclusion: 该方法为楔形文字研究提供了新视角，未来可推广至其他类似文本分析。 Abstract: A cuneiform tablet KBo 23.1 ++/KUB 30.38, which is known to represent a text of Kizzuwatna rituals, was written by two writers with almost identical content in two iterations. Unlike other cuneiform tablets that contained information such as myths, essays, or business records, the reason why ancient people left such tablets for posterity remains unclear. To study this problem, we develop a new methodology by analyzing images of a tablet quantitatively using CNN (Convolutional Neural Network)-based image models, without segmenting cuneiforms one-by-one. Our data-driven methodology implies that the writer writing the first half was a `teacher' and the other writer was a `student' who was training his skills of writing cuneiforms. This result has not been reached by classical linguistics. We also discuss related conclusions and possible further directions for applying our method and its generalizations.

[157] Convert Language Model into a Value-based Strategic Planner

Xiaoyu Wang,Yue Zhao,Qingqing Gu,Zhonglin Jiang,Xiaokai Chen,Yong Chen,Luo Ji

Main category: cs.CL

TL;DR: 论文提出了一种基于Q学习的框架straQ*，用于提升情感支持对话（ESC）的长期满意度。

Details

Motivation: 现有大型语言模型（LLM）在ESC中表现优异，但缺乏状态模型视角，导致长期满意度不足。 Method: 利用Q学习优化LLM，设计straQ*框架，实现对话策略的动态规划和响应生成。 Result: 实验表明，straQ*在ESC数据集上优于直接推理、自优化、思维链等多种基线方法。 Conclusion: straQ*通过结合Q学习和LLM，显著提升了ESC的长期效果。 Abstract: Emotional support conversation (ESC) aims to alleviate the emotional distress of individuals through effective conversations. Although large language models (LLMs) have obtained remarkable progress on ESC, most of these studies might not define the diagram from the state model perspective, therefore providing a suboptimal solution for long-term satisfaction. To address such an issue, we leverage the Q-learning on LLMs, and propose a framework called straQ*. Our framework allows a plug-and-play LLM to bootstrap the planning during ESC, determine the optimal strategy based on long-term returns, and finally guide the LLM to response. Substantial experiments on ESC datasets suggest that straQ* outperforms many baselines, including direct inference, self-refine, chain of thought, finetuning, and finite state machines.

[158] HAMLET: Healthcare-focused Adaptive Multilingual Learning Embedding-based Topic Modeling

Hajar Sakai,Sarah S. Lam

Main category: cs.CL

TL;DR: HAMLET是一种基于图驱动的跨语言医疗主题建模架构，利用LLM生成初始主题，并通过BERT、SBERT和GNN优化主题嵌入，提升主题质量和可解释性。

Details

Motivation: 传统主题模型在处理上下文细微差别、多义词和罕见词时表现不佳，导致主题缺乏连贯性和质量。LLM生成的初始主题也存在冗余和缺乏代表性的问题。 Method: 结合BERT、SBERT和GNN进行主题嵌入优化，通过图神经网络建立文档、主题和词汇之间的联系，并引入新的相似度计算方法。 Result: 在英语和法语的医疗数据集上实验，验证了HAMLET的有效性。 Conclusion: HAMLET通过神经增强的语义融合显著提升了主题建模的质量和可解释性。 Abstract: Traditional topic models often struggle with contextual nuances and fail to adequately handle polysemy and rare words. This limitation typically results in topics that lack coherence and quality. Large Language Models (LLMs) can mitigate this issue by generating an initial set of topics. However, these raw topics frequently lack refinement and representativeness, which leads to redundancy without lexical similarity and reduced interpretability. This paper introduces HAMLET, a graph-driven architecture for cross-lingual healthcare topic modeling that uses LLMs. The proposed approach leverages neural-enhanced semantic fusion to refine the embeddings of topics generated by the LLM. Instead of relying solely on statistical co-occurrence or human interpretation to extract topics from a document corpus, this method introduces a topic embedding refinement that uses Bidirectional Encoder Representations from Transformers (BERT) and Graph Neural Networks (GNN). After topic generation, a hybrid technique that involves BERT and Sentence-BERT (SBERT) is employed for embedding. The topic representations are further refined using a GNN, which establishes connections between documents, topics, words, similar topics, and similar words. A novel method is introduced to compute similarities. Consequently, the topic embeddings are refined, and the top k topics are extracted. Experiments were conducted using two healthcare datasets, one in English and one in French, from which six sets were derived. The results demonstrate the effectiveness of HAMLET.

[159] Towards Actionable Pedagogical Feedback: A Multi-Perspective Analysis of Mathematics Teaching and Tutoring Dialogue

Jannatun Naim,Jie Cao,Fareen Tasneem,Jennifer Jacobs,Brent Milne,James Martin,Tamara Sumner

Main category: cs.CL

TL;DR: 论文提出了一种多视角话语分析框架，结合领域特定对话动作和话语关系，解决了数学教育中反馈分析的挑战。

Details

Motivation: 解决话语分析中的多功能性和领域特定分类遗漏问题，以提升数学教育中的反馈质量。 Method: 整合领域特定对话动作（43标签）和话语关系（16关系），应用于两个数学教育数据集。 Result: 发现无对话动作的话语在引导和结构化课堂中起关键作用，揭示了有意义的话语模式。 Conclusion: 框架有助于提升AI辅助教育系统的反馈能力，支持开发更智能的教育代理。 Abstract: Effective feedback is essential for refining instructional practices in mathematics education, and researchers often turn to advanced natural language processing (NLP) models to analyze classroom dialogues from multiple perspectives. However, utterance-level discourse analysis encounters two primary challenges: (1) multifunctionality, where a single utterance may serve multiple purposes that a single tag cannot capture, and (2) the exclusion of many utterances from domain-specific discourse move classifications, leading to their omission in feedback. To address these challenges, we proposed a multi-perspective discourse analysis that integrates domain-specific talk moves with dialogue act (using the flattened multi-functional SWBD-MASL schema with 43 tags) and discourse relation (applying Segmented Discourse Representation Theory with 16 relations). Our top-down analysis framework enables a comprehensive understanding of utterances that contain talk moves, as well as utterances that do not contain talk moves. This is applied to two mathematics education datasets: TalkMoves (teaching) and SAGA22 (tutoring). Through distributional unigram analysis, sequential talk move analysis, and multi-view deep dive, we discovered meaningful discourse patterns, and revealed the vital role of utterances without talk moves, demonstrating that these utterances, far from being mere fillers, serve crucial functions in guiding, acknowledging, and structuring classroom discourse. These insights underscore the importance of incorporating discourse relations and dialogue acts into AI-assisted education systems to enhance feedback and create more responsive learning environments. Our framework may prove helpful for providing human educator feedback, but also aiding in the development of AI agents that can effectively emulate the roles of both educators and students.

[160] KDH-MLTC: Knowledge Distillation for Healthcare Multi-Label Text Classification

Hajar Sakai,Sarah S. Lam

Main category: cs.CL

TL;DR: 提出了一种基于知识蒸馏和大型语言模型的医疗多标签文本分类框架（KDH-MLTC），通过模型压缩和优化实现高效且高精度的分类。

Details

Motivation: 医疗文本数据量大且复杂，需要高效且准确的分类方法，同时满足隐私保护要求（如HIPAA）。 Method: 结合知识蒸馏和序列微调，利用粒子群优化（PSO）进行超参数调优，将BERT的知识迁移到轻量级模型DistilBERT。 Result: 在三个医疗数据集上表现优异，最大数据集F1分数达82.70%，并通过统计验证和消融实验证明其鲁棒性。 Conclusion: KDH-MLTC在资源受限的医疗环境中平衡了效率与精度需求，为医疗文本分类研究提供了新思路。 Abstract: The increasing volume of healthcare textual data requires computationally efficient, yet highly accurate classification approaches able to handle the nuanced and complex nature of medical terminology. This research presents Knowledge Distillation for Healthcare Multi-Label Text Classification (KDH-MLTC), a framework leveraging model compression and Large Language Models (LLMs). The proposed approach addresses conventional healthcare Multi-Label Text Classification (MLTC) challenges by integrating knowledge distillation and sequential fine-tuning, subsequently optimized through Particle Swarm Optimization (PSO) for hyperparameter tuning. KDH-MLTC transfers knowledge from a more complex teacher LLM (i.e., BERT) to a lighter student LLM (i.e., DistilBERT) through sequential training adapted to MLTC that preserves the teacher's learned information while significantly reducing computational requirements. As a result, the classification is enabled to be conducted locally, making it suitable for healthcare textual data characterized by sensitivity and, therefore, ensuring HIPAA compliance. The experiments conducted on three medical literature datasets of different sizes, sampled from the Hallmark of Cancer (HoC) dataset, demonstrate that KDH-MLTC achieves superior performance compared to existing approaches, particularly for the largest dataset, reaching an F1 score of 82.70%. Additionally, statistical validation and an ablation study are carried out, proving the robustness of KDH-MLTC. Furthermore, the PSO-based hyperparameter optimization process allowed the identification of optimal configurations. The proposed approach contributes to healthcare text classification research, balancing efficiency requirements in resource-constrained healthcare settings with satisfactory accuracy demands.

[161] Structural Entropy Guided Agent for Detecting and Repairing Knowledge Deficiencies in LLMs

Yifan Wei,Xiaoyan Yu,Tengfei Pan,Angsheng Li,Li Du

Main category: cs.CL

TL;DR: 提出了一种名为SENATOR的框架，通过结构熵和蒙特卡洛树搜索来检测和修复大语言模型在知识密集型领域的知识缺陷，并通过生成针对性合成数据提升性能。

Details

Motivation: 大语言模型在知识密集型领域（如医学和科研）中表现不佳，现有方法生成的合成数据冗余且未能针对模型的知识缺陷。 Method: 使用结构熵（SE）量化知识图谱路径的不确定性，结合蒙特卡洛树搜索（MCTS）选择性探索模型知识不足的区域，生成针对性合成数据进行监督微调。 Result: 在LLaMA-3和Qwen2上的实验表明，SENATOR能有效检测和修复知识缺陷，显著提升性能。 Conclusion: SENATOR框架通过针对性数据生成和知识缺陷修复，显著提升了大语言模型在知识密集型任务中的表现。 Abstract: Large language models (LLMs) have achieved unprecedented performance by leveraging vast pretraining corpora, yet their performance remains suboptimal in knowledge-intensive domains such as medicine and scientific research, where high factual precision is required. While synthetic data provides a promising avenue for augmenting domain knowledge, existing methods frequently generate redundant samples that do not align with the model's true knowledge gaps. To overcome this limitation, we propose a novel Structural Entropy-guided Knowledge Navigator (SENATOR) framework that addresses the intrinsic knowledge deficiencies of LLMs. Our approach employs the Structure Entropy (SE) metric to quantify uncertainty along knowledge graph paths and leverages Monte Carlo Tree Search (MCTS) to selectively explore regions where the model lacks domain-specific knowledge. Guided by these insights, the framework generates targeted synthetic data for supervised fine-tuning, enabling continuous self-improvement. Experimental results on LLaMA-3 and Qwen2 across multiple domain-specific benchmarks show that SENATOR effectively detects and repairs knowledge deficiencies, achieving notable performance improvements. The code and data for our methods and experiments are available at https://github.com/weiyifan1023/senator.

[162] On the Cost and Benefits of Training Context with Utterance or Full Conversation Training: A Comparative Stud

Hyouin Liu,Zhikuan Zhang

Main category: cs.CL

TL;DR: 研究发现，基于上下文的单句训练在对话TTS中表现优于完整对话训练，MOS得分更高且训练时间更短。

Details

Motivation: 探讨现有开源架构或训练技术是否导致对话TTS系统难以公开访问，并研究不同训练方法的效果。 Method: 在NVIDIA H100上使用20 GPU小时，对比上下文单句训练和完整对话训练两种方法。 Result: 上下文单句训练MOS得分4.3/5.0，优于完整对话训练的3.7/5.0，且训练时间减少37%。 Conclusion: 对话TTS开发应优先采用基于上下文的单句训练，兼顾资源效率和输出质量。 Abstract: Modern TTS systems designed for conversations achieve high-quality utterances but often remain inaccessible publicly. Are existing open-source architectures inadequate, or are current training techniques insufficient? This paper investigates prominent models and their underlying behaviors regarding conversational context. Using 20 GPU-hours on an NVIDIA H100, we empirically examine two approaches: context-based utterance-level training versus full conversation training. Results demonstrate that context-based utterance training achieves superior MOS scores (4.3/5.0 vs 3.7/5.0) and reduces training time by 37%, while full conversation approaches suffer from speaker similarity hallucination issues. These findings provide practical guidelines for conversational TTS development, favoring utterance-level training with contextual conditioning for both resource efficiency and output quality.

[163] Benchmarking Ethical and Safety Risks of Healthcare LLMs in China-Toward Systemic Governance under Healthy China 2030

Mouxiao Bian,Rongzhao Zhang,Chao Ding,Xinwei Peng,Jie Xu

Main category: cs.CL

TL;DR: 研究提出了一种评估医疗大语言模型（LLM）伦理和安全风险的12,000项Q&A基准，揭示了当前模型的性能不足，并提出了治理框架以应对挑战。

Details

Motivation: 中国‘健康中国2030’计划下，LLM在医疗领域的应用潜力巨大，但伴随伦理和患者安全风险，需系统性评估和治理。 Method: 构建涵盖11个伦理和9个安全维度的Q&A基准数据集，评估主流中文医疗LLM性能，并通过微调提升表现。 Result: 基准测试显示模型初始准确率较低（如Qwen 2.5-32B为42.7%），微调后提升至50.8%，但伦理与安全决策仍存在显著缺陷。 Conclusion: 研究呼吁加强LLM治理，提出嵌入审计团队、制定数据伦理指南等框架，以平衡AI创新与患者安全。 Abstract: Large Language Models (LLMs) are poised to transform healthcare under China's Healthy China 2030 initiative, yet they introduce new ethical and patient-safety challenges. We present a novel 12,000-item Q&A benchmark covering 11 ethics and 9 safety dimensions in medical contexts, to quantitatively evaluate these risks. Using this dataset, we assess state-of-the-art Chinese medical LLMs (e.g., Qwen 2.5-32B, DeepSeek), revealing moderate baseline performance (accuracy 42.7% for Qwen 2.5-32B) and significant improvements after fine-tuning on our data (up to 50.8% accuracy). Results show notable gaps in LLM decision-making on ethics and safety scenarios, reflecting insufficient institutional oversight. We then identify systemic governance shortfalls-including the lack of fine-grained ethical audit protocols, slow adaptation by hospital IRBs, and insufficient evaluation tools-that currently hinder safe LLM deployment. Finally, we propose a practical governance framework for healthcare institutions (embedding LLM auditing teams, enacting data ethics guidelines, and implementing safety simulation pipelines) to proactively manage LLM risks. Our study highlights the urgent need for robust LLM governance in Chinese healthcare, aligning AI innovation with patient safety and ethical standards.

[164] DynamicRAG: Leveraging Outputs of Large Language Model as Feedback for Dynamic Reranking in Retrieval-Augmented Generation

Jiashuo Sun,Xianrui Zhong,Sizhe Zhou,Jiawei Han

Main category: cs.CL

TL;DR: DynamicRAG提出了一种基于强化学习的动态调整检索文档数量和顺序的RAG框架，显著提升了生成质量和效率。

Details

Motivation: 现有RAG系统的重排序器未充分利用LLM的反馈信号，且固定文档数量可能导致信息缺失或噪声引入。 Method: 将重排序器建模为强化学习代理，利用LLM输出质量作为奖励动态调整文档顺序和数量。 Result: 在七个知识密集型数据集上表现优异，达到最先进水平。 Conclusion: DynamicRAG通过动态调整检索策略，显著提升了RAG系统的性能和可解释性。 Abstract: Retrieval-augmented generation (RAG) systems combine large language models (LLMs) with external knowledge retrieval, making them highly effective for knowledge-intensive tasks. A crucial but often under-explored component of these systems is the reranker, which refines retrieved documents to enhance generation quality and explainability. The challenge of selecting the optimal number of documents (k) remains unsolved: too few may omit critical information, while too many introduce noise and inefficiencies. Although recent studies have explored LLM-based rerankers, they primarily leverage internal model knowledge and overlook the rich supervisory signals that LLMs can provide, such as using response quality as feedback for optimizing reranking decisions. In this paper, we propose DynamicRAG, a novel RAG framework where the reranker dynamically adjusts both the order and number of retrieved documents based on the query. We model the reranker as an agent optimized through reinforcement learning (RL), using rewards derived from LLM output quality. Across seven knowledge-intensive datasets, DynamicRAG demonstrates superior performance, achieving state-of-the-art results. The model, data and code are available at https://github.com/GasolSun36/DynamicRAG

[165] SAS-Bench: A Fine-Grained Benchmark for Evaluating Short Answer Scoring with Large Language Models

Peichao Lai,Kexuan Zhang,Yi Lin,Linyihan Zhang,Feiyang Ye,Jinhao Yan,Yanwei Xu,Conghui He,Yilei Wang,Wentao Zhang,Bin Cui

Main category: cs.CL

TL;DR: SAS-Bench是一个专为LLM设计的短答案评分基准，提供细粒度评分和专家标注的错误类别，旨在解决现有方法的不足。

Details

Motivation: 现有主观答案评分方法通常粗粒度且缺乏详细推理，LLM作为零样本评估者存在偏见、不一致性和透明度不足的问题。 Method: 提出SAS-Bench基准，包含1,030个问题和4,109个学生回答，标注了错误类别，并测试了多种LLM的评分效果。 Result: 实验发现科学类问题评分挑战较大，少量样本提示能显著提升评分准确性。 Conclusion: SAS-Bench为开发更稳健、公平且教育意义强的LLM评分系统提供了重要参考。 Abstract: Subjective Answer Grading (SAG) plays a crucial role in education, standardized testing, and automated assessment systems, particularly for evaluating short-form responses in Short Answer Scoring (SAS). However, existing approaches often produce coarse-grained scores and lack detailed reasoning. Although large language models (LLMs) have demonstrated potential as zero-shot evaluators, they remain susceptible to bias, inconsistencies with human judgment, and limited transparency in scoring decisions. To overcome these limitations, we introduce SAS-Bench, a benchmark specifically designed for LLM-based SAS tasks. SAS-Bench provides fine-grained, step-wise scoring, expert-annotated error categories, and a diverse range of question types derived from real-world subject-specific exams. This benchmark facilitates detailed evaluation of model reasoning processes and explainability. We also release an open-source dataset containing 1,030 questions and 4,109 student responses, each annotated by domain experts. Furthermore, we conduct comprehensive experiments with various LLMs, identifying major challenges in scoring science-related questions and highlighting the effectiveness of few-shot prompting in improving scoring accuracy. Our work offers valuable insights into the development of more robust, fair, and educationally meaningful LLM-based evaluation systems.

[166] No Query, No Access

Wenqiang Wang,Siyuan Liang,Yangshijie Zhang,Xiaojun Jia,Hao Lin,Xiaochun Cao

Main category: cs.CL

TL;DR: VDBA是一种仅需受害者文本的对抗攻击方法，通过影子数据集和分层替代模型设计提高攻击成功率，显著减少查询需求。

Details

Motivation: 现有对抗攻击方法需要受害者模型信息或训练数据，限制了实际可行性。 Method: 利用公开预训练模型和聚类方法构建影子数据集，设计分层替代模型，结合多样化对抗样本生成。 Result: 在Emotion和SST5数据集上，VDBA的ASR提升52.08%，查询次数降为0，对LLMs威胁显著。 Conclusion: VDBA证实高级NLP模型仍面临严重安全风险，无需API即可实现高攻击成功率。 Abstract: Textual adversarial attacks mislead NLP models, including Large Language Models (LLMs), by subtly modifying text. While effective, existing attacks often require knowledge of the victim model, extensive queries, or access to training data, limiting real-world feasibility. To overcome these constraints, we introduce the \textbf{Victim Data-based Adversarial Attack (VDBA)}, which operates using only victim texts. To prevent access to the victim model, we create a shadow dataset with publicly available pre-trained models and clustering methods as a foundation for developing substitute models. To address the low attack success rate (ASR) due to insufficient information feedback, we propose the hierarchical substitution model design, generating substitute models to mitigate the failure of a single substitute model at the decision boundary. Concurrently, we use diverse adversarial example generation, employing various attack methods to generate and select the adversarial example with better similarity and attack effectiveness. Experiments on the Emotion and SST5 datasets show that VDBA outperforms state-of-the-art methods, achieving an ASR improvement of 52.08\% while significantly reducing attack queries to 0. More importantly, we discover that VDBA poses a significant threat to LLMs such as Qwen2 and the GPT family, and achieves the highest ASR of 45.99% even without access to the API, confirming that advanced NLP models still face serious security risks. Our codes can be found at https://anonymous.4open.science/r/VDBA-Victim-Data-based-Adversarial-Attack-36EC/

[167] On the Robustness of Reward Models for Language Model Alignment

Jiwoo Hong,Noah Lee,Eunki Kim,Guijin Son,Woojin Chung,Aman Gupta,Shao Tang,James Thorne

Main category: cs.CL

TL;DR: 论文研究了Bradley-Terry（BT）模型在奖励建模中的过优化问题，提出了一种批处理零和正则化方法（BSR）以提高鲁棒性，并在实验中验证了其有效性。

Details

Motivation: BT模型在RLHF中广泛使用，但其奖励模型容易过优化，导致在未见数据上泛化能力下降。论文旨在解决这一问题。 Method: 通过分析隐藏状态范数的过度分散是过优化的主要原因，提出BSR方法，强制每批奖励和为零，约束极端奖励值。 Result: BSR在四种过优化场景中均表现更好，且在RLHF训练中显著提升了策略对齐效果。在8B规模模型中，BSR提升了5%以上的复杂偏好预测性能。 Conclusion: BSR显著提高了奖励模型的鲁棒性，进而提升了RLHF训练的鲁棒性，实验验证了其有效性。 Abstract: The Bradley-Terry (BT) model is widely practiced in reward modeling for reinforcement learning with human feedback (RLHF). Despite its effectiveness, reward models (RMs) trained with BT model loss are prone to over-optimization, losing generalizability to unseen input distributions. In this paper, we study the cause of over-optimization in RM training and its downstream effects on the RLHF procedure, accentuating the importance of distributional robustness of RMs in unseen data. First, we show that the excessive dispersion of hidden state norms is the main source of over-optimization. Then, we propose batch-wise sum-to-zero regularization (BSR) to enforce zero-centered reward sum per batch, constraining the rewards with extreme magnitudes. We assess the impact of BSR in improving robustness in RMs through four scenarios of over-optimization, where BSR consistently manifests better robustness. Subsequently, we compare the plain BT model and BSR on RLHF training and empirically show that robust RMs better align the policy to the gold preference model. Finally, we apply BSR to high-quality data and models, which surpasses state-of-the-art RMs in the 8B scale by adding more than 5% in complex preference prediction tasks. By conducting RLOO training with 8B RMs, AlpacaEval 2.0 reduces generation length by 40% while adding a 7% increase in win rate, further highlighting that robustness in RMs induces robustness in RLHF training. We release the code, data, and models: https://github.com/LinkedIn-XFACT/RM-Robustness.

[168] Semantic Retention and Extreme Compression in LLMs: Can We Have Both?

Stanislas Laborde,Martin Cousseau,Antoun Yaacoub,Lionel Prevost

Main category: cs.CL

TL;DR: 论文提出了一种结合剪枝和量化的联合压缩方法，并引入新指标SrCr评估语义保留与压缩率的权衡，实验显示其性能优于单一方法。

Details

Motivation: 随着大语言模型（LLM）的广泛部署，高效压缩技术需求迫切，但剪枝与量化的联合潜力尚未充分探索。 Method: 研究联合压缩策略，结合剪枝与量化，并引入SrCr指标评估语义保留与压缩率的平衡。 Result: 实验表明，推荐组合在相同理论压缩率下，性能比纯量化模型平均提升20%。 Conclusion: 联合压缩方法在性能与压缩率间取得更好平衡，SrCr指标为优化配置提供了有效工具。 Abstract: The exponential growth in Large Language Model (LLM) deployment has intensified the need for efficient model compression techniques to reduce computational and memory costs. While pruning and quantization have shown promise, their combined potential remains largely unexplored. In this paper, we examine joint compression and how strategically combining pruning and quantization could yield superior performance-to-compression ratios compared to single-method approaches. Recognizing the challenges in accurately assessing LLM performance, we address key limitations of previous evaluation frameworks and introduce the Semantic Retention Compression Rate (SrCr), a novel metric that quantifies the trade-off between model compression and semantic preservation, facilitating the optimization of pruning-quantization configurations. Experiments demonstrate that our recommended combination achieves, on average, a 20% performance increase compared to an equivalent quantization-only model at the same theoretical compression rate.

[169] AttentionInfluence: Adopting Attention Head Influence for Weak-to-Strong Pretraining Data Selection

Kai Hua,Steven Wu,Ge Zhang,Ke Shen

Main category: cs.CL

TL;DR: 提出了一种无需监督的训练方法AttentionInfluence，通过注意力头掩码操作提升语言模型的推理能力。

Details

Motivation: 现有方法依赖监督分类器收集推理密集型数据，容易引入领域偏差，因此需要一种无监督的方法。 Method: 利用注意力头掩码操作，通过小模型选择数据，并将其与大模型结合训练。 Result: 在多个推理密集型基准测试中性能提升1.4pp至3.5pp。 Conclusion: AttentionInfluence为推理中心数据选择提供了可扩展的路径。 Abstract: Recently, there has been growing interest in collecting reasoning-intensive pretraining data to improve LLMs' complex reasoning ability. Prior approaches typically rely on supervised classifiers to identify such data, which requires labeling by humans or LLMs, often introducing domain-specific biases. Due to the attention heads being crucial to in-context reasoning, we propose AttentionInfluence, a simple yet effective, training-free method without supervision signal. Our approach enables a small pretrained language model to act as a strong data selector through a simple attention head masking operation. Specifically, we identify retrieval heads and compute the loss difference when masking these heads. We apply AttentionInfluence to a 1.3B-parameter dense model to conduct data selection on the SmolLM corpus of 241B tokens, and mix the SmolLM corpus with the selected subset comprising 73B tokens to pretrain a 7B-parameter dense model using 1T training tokens and WSD learning rate scheduling. Our experimental results demonstrate substantial improvements, ranging from 1.4pp to 3.5pp, across several knowledge-intensive and reasoning-heavy benchmarks (i.e., MMLU, MMLU-Pro, AGIEval-en, GSM8K, and HumanEval). This demonstrates an effective weak-to-strong scaling property, with small models improving the final performance of larger models-offering a promising and scalable path for reasoning-centric data selection.

[170] Towards Multi-Agent Reasoning Systems for Collaborative Expertise Delegation: An Exploratory Design Study

Baixuan Xu,Chunyang Li,Weiqi Wang,Wei Fan,Tianshi Zheng,Haochen Shi,Tao Fan,Yangqiu Song,Qiang Yang

Main category: cs.CL

TL;DR: 研究了多智能体LLM系统中协作结构对集体推理的影响，发现专业知识对齐、协作范式（结构化工作流与多样性整合）和系统规模是关键设计维度。

Details

Motivation: 多智能体LLM系统的协作结构设计对集体推理至关重要，但尚未充分探索。 Method: 系统研究了三个设计维度：专业知识对齐、协作范式和系统规模，并通过实验验证其影响。 Result: 专业知识对齐在上下文推理任务中最有效；多样性整合优于刚性任务分解；系统规模扩展需权衡计算效率。 Conclusion: 提供了多智能体系统配置的具体指南，并指出了可扩展推理中的关键架构权衡与瓶颈。 Abstract: Designing effective collaboration structure for multi-agent LLM systems to enhance collective reasoning is crucial yet remains under-explored. In this paper, we systematically investigate how collaborative reasoning performance is affected by three key design dimensions: (1) Expertise-Domain Alignment, (2) Collaboration Paradigm (structured workflow vs. diversity-driven integration), and (3) System Scale. Our findings reveal that expertise alignment benefits are highly domain-contingent, proving most effective for contextual reasoning tasks. Furthermore, collaboration focused on integrating diverse knowledge consistently outperforms rigid task decomposition. Finally, we empirically explore the impact of scaling the multi-agent system with expertise specialization and study the computational trade off, highlighting the need for more efficient communication protocol design. This work provides concrete guidelines for configuring specialized multi-agent system and identifies critical architectural trade-offs and bottlenecks for scalable multi-agent reasoning. The code will be made available upon acceptance.

[171] QUPID: Quantified Understanding for Enhanced Performance, Insights, and Decisions in Korean Search Engines

Ohjoon Kwon,Changsu Lee,Jihye Back,Lim Sun Suk,Inho Kang,Donghyeon Jeon

Main category: cs.CL

TL;DR: 结合两种不同架构的小型语言模型（SLMs）的QUPID方法，在信息检索任务中优于大型语言模型（LLMs），提高了相关性判断准确性并降低了计算成本。

Details

Motivation: 探索如何通过结合不同架构的小型语言模型（SLMs）来提升信息检索任务中的相关性判断性能，同时减少计算资源消耗。 Method: 提出QUPID方法，整合生成式SLM和基于嵌入的SLM，通过架构多样性优化性能。 Result: 在实验中，QUPID的Cohen's Kappa达到0.646（优于LLMs的0.387），推理速度快60倍，并在生产搜索系统中将nDCG@5提高了1.9%。 Conclusion: 通过结合不同架构的SLMs，QUPID显著提升了信息检索系统的相关性和操作效率，证明了架构多样性的重要性。 Abstract: Large language models (LLMs) have been widely used for relevance assessment in information retrieval. However, our study demonstrates that combining two distinct small language models (SLMs) with different architectures can outperform LLMs in this task. Our approach -- QUPID -- integrates a generative SLM with an embedding-based SLM, achieving higher relevance judgment accuracy while reducing computational costs compared to state-of-the-art LLM solutions. This computational efficiency makes QUPID highly scalable for real-world search systems processing millions of queries daily. In experiments across diverse document types, our method demonstrated consistent performance improvements (Cohen's Kappa of 0.646 versus 0.387 for leading LLMs) while offering 60x faster inference times. Furthermore, when integrated into production search pipelines, QUPID improved nDCG@5 scores by 1.9%. These findings underscore how architectural diversity in model combinations can significantly enhance both search relevance and operational efficiency in information retrieval systems.

Tim Wittenborg,Constantin Sebastian Tremel,Markus Stocker,Sören Auer

Main category: cs.CL

TL;DR: 该论文提出了一种半自动量化在线媒体科学准确性的方法，通过语义化处理并与可信来源对比，利用LLM和知识图谱分析，但其工具仍需改进以支持更细粒度和规模的公共媒体标注。

Details

Motivation: 民主社会需要可靠信息，但大众媒体中的错误信息威胁公共讨论，公民难以验证海量内容。 Method: 使用基于LLM的陈述提取和知识图谱分析的神经符号系统，量化媒体内容的真实性。 Result: 工具通过专家访谈和用户调查验证，能有效提供真实性指示，但无法满足公共媒体的细粒度和规模需求。 Conclusion: 需进一步开发FAIR标准和补充指标，以科学支持公共讨论。 Abstract: Democratic societies need reliable information. Misinformation in popular media such as news articles or videos threatens to impair civic discourse. Citizens are, unfortunately, not equipped to verify this content flood consumed daily at increasing rates. This work aims to semi-automatically quantify scientific accuracy of online media. By semantifying media of unknown veracity, their statements can be compared against equally processed trusted sources. We implemented a workflow using LLM-based statement extraction and knowledge graph analysis. Our neurosymbolic system was able to evidently streamline state-of-the-art veracity quantification. Evaluated via expert interviews and a user survey, the tool provides a beneficial veracity indication. This indicator, however, is unable to annotate public media at the required granularity and scale. Further work towards a FAIR (Findable, Accessible, Interoperable, Reusable) ground truth and complementary metrics are required to scientifically support civic discourse.

[173] ViMRHP: A Vietnamese Benchmark Dataset for Multimodal Review Helpfulness Prediction via Human-AI Collaborative Annotation

Truc Mai-Thanh Nguyen,Dat Minh Nguyen,Son T. Luu,Kiet Van Nguyen

Main category: cs.CL

TL;DR: 本文介绍了ViMRHP数据集，一个用于越南语多模态评论有用性预测的大规模基准数据集，并利用AI辅助标注以提高效率。

Details

Motivation: 现有数据集主要针对英语和印尼语，缺乏对越南语等低资源语言的支持，因此需要构建一个越南语的多模态评论有用性预测数据集。 Method: 通过构建ViMRHP数据集，覆盖四个领域，包含2K产品和46K评论，并利用AI辅助标注以减少时间和成本。 Result: AI辅助标注将标注时间从90-120秒降至20-40秒，成本降低约65%，但在复杂任务中仍有限制。 Conclusion: ViMRHP数据集填补了越南语多模态评论有用性预测的空白，AI辅助标注显著提升了效率，但需进一步优化复杂任务的表现。 Abstract: Multimodal Review Helpfulness Prediction (MRHP) is an essential task in recommender systems, particularly in E-commerce platforms. Determining the helpfulness of user-generated reviews enhances user experience and improves consumer decision-making. However, existing datasets focus predominantly on English and Indonesian, resulting in a lack of linguistic diversity, especially for low-resource languages such as Vietnamese. In this paper, we introduce ViMRHP (Vietnamese Multimodal Review Helpfulness Prediction), a large-scale benchmark dataset for MRHP task in Vietnamese. This dataset covers four domains, including 2K products with 46K reviews. Meanwhile, a large-scale dataset requires considerable time and cost. To optimize the annotation process, we leverage AI to assist annotators in constructing the ViMRHP dataset. With AI assistance, annotation time is reduced (90 to 120 seconds per task down to 20 to 40 seconds per task) while maintaining data quality and lowering overall costs by approximately 65%. However, AI-generated annotations still have limitations in complex annotation tasks, which we further examine through a detailed performance analysis. In our experiment on ViMRHP, we evaluate baseline models on human-verified and AI-generated annotations to assess their quality differences. The ViMRHP dataset is publicly available at https://github.com/trng28/ViMRHP

[174] Comparative sentiment analysis of public perception: Monkeypox vs. COVID-19 behavioral insights

Mostafa Mohaimen Akand Faisal,Rabeya Amin Jhuma

Main category: cs.CL

TL;DR: 论文通过分析COVID-19和猴痘的推文数据，比较了公众对两种疾病的情绪差异，并探讨了其对公共卫生策略的启示。

Details

Motivation: 全球健康危机（如COVID-19和猴痘）凸显了理解公众情绪对制定有效公共卫生策略的重要性。 Method: 使用机器学习模型（如Logistic Regression、Naive Bayes、RoBERTa等）对147,475条COVID-19推文和106,638条猴痘推文进行情感分类。 Result: 研究发现公众情绪因疾病特征、媒体报道和疫情疲劳而存在显著差异。 Conclusion: 研究为公共卫生信息定制、减少错误信息和建立信任提供了见解，并为未来实时监测和多语言分析奠定了基础。 Abstract: The emergence of global health crises, such as COVID-19 and Monkeypox (mpox), has underscored the importance of understanding public sentiment to inform effective public health strategies. This study conducts a comparative sentiment analysis of public perceptions surrounding COVID-19 and mpox by leveraging extensive datasets of 147,475 and 106,638 tweets, respectively. Advanced machine learning models, including Logistic Regression, Naive Bayes, RoBERTa, DistilRoBERTa and XLNet, were applied to perform sentiment classification, with results indicating key trends in public emotion and discourse. The analysis highlights significant differences in public sentiment driven by disease characteristics, media representation, and pandemic fatigue. Through the lens of sentiment polarity and thematic trends, this study offers valuable insights into tailoring public health messaging, mitigating misinformation, and fostering trust during concurrent health crises. The findings contribute to advancing sentiment analysis applications in public health informatics, setting the groundwork for enhanced real-time monitoring and multilingual analysis in future research.

[175] Matching Tasks with Industry Groups for Augmenting Commonsense Knowledge

Rituraj Singh,Sachin Pawar,Girish Palshikar

Main category: cs.CL

TL;DR: 提出了一种弱监督框架，用于扩充常识知识库（KB）中的行业任务知识，通过训练神经网络模型学习任务与行业组（IG）的关联，并提取了2339个高精度任务-IG三元组。

Details

Motivation: 现有常识知识库（如ConceptNet）中行业领域任务知识不足，难以满足实际需求。 Method: 采用弱监督框架，训练神经网络模型学习任务与行业组的关联，并通过聚类选择每个行业组的top-k任务。 Result: 从两个公开新闻数据集中提取了2339个任务-IG三元组，精度达0.86。 Conclusion: 验证了提取的任务-IG对可直接用于扩充现有知识库，填补了行业任务知识的空白。 Abstract: Commonsense knowledge bases (KB) are a source of specialized knowledge that is widely used to improve machine learning applications. However, even for a large KB such as ConceptNet, capturing explicit knowledge from each industry domain is challenging. For example, only a few samples of general {\em tasks} performed by various industries are available in ConceptNet. Here, a task is a well-defined knowledge-based volitional action to achieve a particular goal. In this paper, we aim to fill this gap and present a weakly-supervised framework to augment commonsense KB with tasks carried out by various industry groups (IG). We attempt to {\em match} each task with one or more suitable IGs by training a neural model to learn task-IG affinity and apply clustering to select the top-k tasks per IG. We extract a total of 2339 triples of the form $\langle IG, is~capable~of, task \rangle$ from two publicly available news datasets for 24 IGs with the precision of 0.86. This validates the reliability of the extracted task-IG pairs that can be directly added to existing KBs.

[176] Translating the Grievance Dictionary: a psychometric evaluation of Dutch, German, and Italian versions

Isabelle van der Vegt,Bennett Kleinberg,Marilu Miotto,Jonas Festor

Main category: cs.CL

TL;DR: 本文介绍了三种语言（荷兰语、德语、意大利语）的Grievance Dictionary翻译版本，并通过心理测量分析评估其表现。荷兰语和德语版本与英语原版表现相似，而意大利语版本在某些类别中可靠性较低。

Details

Motivation: 由于暴力、威胁或不满情绪相关文本分析在英语以外的语言中也很重要，因此翻译Grievance Dictionary以扩展其应用范围。 Method: 采用自动化翻译结合人工标注的方法，并进行心理测量分析，包括内部可靠性和与LIWC词典的相关性。 Result: 荷兰语和德语翻译版本表现良好，意大利语版本在某些类别中可靠性较低。 Conclusion: 建议进一步验证和应用词典，并推广类似翻译方法。 Abstract: This paper introduces and evaluates three translations of the Grievance Dictionary, a psycholinguistic dictionary for the analysis of violent, threatening or grievance-fuelled texts. Considering the relevance of these themes in languages beyond English, we translated the Grievance Dictionary to Dutch, German, and Italian. We describe the process of automated translation supplemented by human annotation. Psychometric analyses are performed, including internal reliability of dictionary categories and correlations with the LIWC dictionary. The Dutch and German translations perform similarly to the original English version, whereas the Italian dictionary shows low reliability for some categories. Finally, we make suggestions for further validation and application of the dictionary, as well as for future dictionary translations following a similar approach.

[177] ToolACE-DEV: Self-Improving Tool Learning via Decomposition and EVolution

Xu Huang,Weiwen Liu,Xingshan Zeng,Yuefeng Huang,Xinlong Hao,Yuxian Wang,Yirong Zeng,Chuhan Wu,Yasheng Wang,Ruiming Tang,Defu Lian

Main category: cs.CL

TL;DR: 论文提出ToolACE-DEV框架，通过分解工具学习任务并引入自进化范式，减少对高级LLMs的依赖，解决数据兼容性和成本问题。

Details

Motivation: 当前增强LLMs工具使用能力的方法依赖高级模型的数据合成，导致高成本和数据兼容性问题。 Method: 分解工具学习目标为子任务，引入自进化范式，使轻量级模型自我改进。 Result: 实验验证了该方法在不同规模和架构模型中的有效性。 Conclusion: ToolACE-DEV通过自改进框架成功降低了依赖高级LLMs的需求，提升了工具学习能力。 Abstract: The tool-using capability of large language models (LLMs) enables them to access up-to-date external information and handle complex tasks. Current approaches to enhancing this capability primarily rely on distilling advanced models by data synthesis. However, this method incurs significant costs associated with advanced model usage and often results in data compatibility issues, led by the high discrepancy in the knowledge scope between the advanced model and the target model. To address these challenges, we propose ToolACE-DEV, a self-improving framework for tool learning. First, we decompose the tool-learning objective into sub-tasks that enhance basic tool-making and tool-using abilities. Then, we introduce a self-evolving paradigm that allows lightweight models to self-improve, reducing reliance on advanced LLMs. Extensive experiments validate the effectiveness of our approach across models of varying scales and architectures.

[178] SEReDeEP: Hallucination Detection in Retrieval-Augmented Models via Semantic Entropy and Context-Parameter Fusion

Lei Wang

Main category: cs.CL

TL;DR: SEReDeEP通过语义熵增强ReDeEP框架，改进RAG模型中的幻觉检测，更准确地评估语义层面的幻觉现象。

Details

Motivation: RAG模型在整合外部信息与内部参数知识时易产生幻觉，现有方法多孤立分析内外机制，忽略协同效应。ReDeEP虽解耦了内外机制，但评估仍局限于逻辑或语言层面，缺乏语义维度。 Method: SEReDeEP基于ReDeEP框架，引入语义熵（通过训练线性探针捕获），动态调节FFN和复制头的贡献，实现更准确的幻觉检测。 Result: SEReDeEP在语义层面更准确地评估幻觉，比现有方法更接近真实评估。 Conclusion: SEReDeEP通过语义熵增强幻觉检测，为RAG模型的改进提供了新方向。 Abstract: Retrieval-Augmented Generation (RAG) models frequently encounter hallucination phenomena when integrating external information with internal parametric knowledge. Empirical studies demonstrate that the disequilibrium between external contextual information and internal parametric knowledge constitutes a primary factor in hallucination generation. Existing hallucination detection methodologies predominantly emphasize either the external or internal mechanism in isolation, thereby overlooking their synergistic effects. The recently proposed ReDeEP framework decouples these dual mechanisms, identifying two critical contributors to hallucinations: excessive reliance on parametric knowledge encoded in feed-forward networks (FFN) and insufficient utilization of external information by attention mechanisms (particularly copy heads). ReDeEP quantitatively assesses these factors to detect hallucinations and dynamically modulates the contributions of FFNs and copy heads to attenuate their occurrence. Nevertheless, ReDeEP and numerous other hallucination detection approaches have been employed at logit-level uncertainty estimation or language-level self-consistency evaluation, inadequately address the semantic dimensions of model responses, resulting in inconsistent hallucination assessments in RAG implementations. Building upon ReDeEP's foundation, this paper introduces SEReDeEP, which enhances computational processes through semantic entropy captured via trained linear probes, thereby achieving hallucination assessments that more accurately reflect ground truth evaluations.

[179] A Multi-Dimensional Constraint Framework for Evaluating and Improving Instruction Following in Large Language Models

Junjie Ye,Caishuang Huang,Zhuohan Chen,Wenjie Fu,Chenyuan Yang,Leyi Yang,Yilong Wu,Peng Wang,Meng Zhou,Xiaolong Yang,Tao Gui,Qi Zhang,Zhongchao Shi,Jianping Fan,Xuanjing Huang

Main category: cs.CL

TL;DR: 提出了一种多维约束框架，用于评估大语言模型（LLMs）在指令跟随任务中的表现，并通过自动化流程生成了1,200个可验证的测试样本。评估了19个LLMs，发现性能随约束难度显著下降。该方法还用于强化学习数据生成，提升了模型性能。

Details

Motivation: 现有基准测试缺乏多样性，无法全面评估LLMs在真实场景中的指令跟随能力。 Method: 设计了包含三种约束模式、四类约束和四个难度级别的框架，并开发了自动化指令生成流程。 Result: LLMs性能随约束难度显著下降（从77.67%降至32.96%），强化学习数据生成提升了模型性能。 Conclusion: 多维约束框架和自动化流程为LLMs指令跟随能力的评估和优化提供了有效工具。 Abstract: Instruction following evaluates large language models (LLMs) on their ability to generate outputs that adhere to user-defined constraints. However, existing benchmarks often rely on templated constraint prompts, which lack the diversity of real-world usage and limit fine-grained performance assessment. To fill this gap, we propose a multi-dimensional constraint framework encompassing three constraint patterns, four constraint categories, and four difficulty levels. Building on this framework, we develop an automated instruction generation pipeline that performs constraint expansion, conflict detection, and instruction rewriting, yielding 1,200 code-verifiable instruction-following test samples. We evaluate 19 LLMs across seven model families and uncover substantial variation in performance across constraint forms. For instance, average performance drops from 77.67% at Level I to 32.96% at Level IV. Furthermore, we demonstrate the utility of our approach by using it to generate data for reinforcement learning, achieving substantial gains in instruction following without degrading general performance. In-depth analysis indicates that these gains stem primarily from modifications in the model's attention modules parameters, which enhance constraint recognition and adherence. Code and data are available in https://github.com/Junjie-Ye/MulDimIF.

[180] Reinforced Internal-External Knowledge Synergistic Reasoning for Efficient Adaptive Search Agent

Ziyang Huang,Xiaowei Yuan,Yiming Ju,Jun Zhao,Kang Liu

Main category: cs.CL

TL;DR: IKEA是一种强化学习驱动的检索增强生成方法，通过优化检索时机和知识整合，减少冗余检索并提升性能。

Details

Motivation: 现有检索增强生成方法未能充分利用LLMs内部知识，导致冗余检索、知识冲突和延迟问题。 Method: 提出IKEA，结合知识边界感知奖励函数和训练数据集，优化检索时机和知识整合。 Result: IKEA在多项任务中显著优于基线方法，减少检索频率并展现强泛化能力。 Conclusion: IKEA通过协同利用内外知识，有效提升LLMs的检索效率和准确性。 Abstract: Retrieval-augmented generation (RAG) is a common strategy to reduce hallucinations in Large Language Models (LLMs). While reinforcement learning (RL) can enable LLMs to act as search agents by activating retrieval capabilities, existing ones often underutilize their internal knowledge. This can lead to redundant retrievals, potential harmful knowledge conflicts, and increased inference latency. To address these limitations, an efficient and adaptive search agent capable of discerning optimal retrieval timing and synergistically integrating parametric (internal) and retrieved (external) knowledge is in urgent need. This paper introduces the Reinforced Internal-External Knowledge Synergistic Reasoning Agent (IKEA), which could indentify its own knowledge boundary and prioritize the utilization of internal knowledge, resorting to external search only when internal knowledge is deemed insufficient. This is achieved using a novel knowledge-boundary aware reward function and a knowledge-boundary aware training dataset. These are designed for internal-external knowledge synergy oriented RL, incentivizing the model to deliver accurate answers, minimize unnecessary retrievals, and encourage appropriate external searches when its own knowledge is lacking. Evaluations across multiple knowledge reasoning tasks demonstrate that IKEA significantly outperforms baseline methods, reduces retrieval frequency significantly, and exhibits robust generalization capabilities.

[181] Characterizing the Investigative Methods of Fictional Detectives with Large Language Models

Edirlei Soares de Lima,Marco A. Casanova,Bruno Feijó,Antonio L. Furtado

Main category: cs.CL

TL;DR: 本文提出了一种基于AI的方法，通过15种大型语言模型（LLMs）系统化分析虚构侦探的调查方法，验证了其91.43%的准确性，为计算叙事学提供了可扩展的框架。

Details

Motivation: 传统文学研究对虚构侦探的分析局限于少数角色，缺乏可扩展性，无法满足自动化叙事生成的需求。 Method: 采用多阶段工作流程，利用15种LLMs提取、合成和验证虚构侦探的独特调查特征，测试了7位标志性侦探。 Result: 方法在反向识别阶段达到91.43%的准确率，成功捕捉了每位侦探的独特调查风格。 Conclusion: 该研究为计算叙事学提供了可扩展的角色分析框架，对AI驱动的交互式叙事和自动化叙事生成具有潜在应用价值。 Abstract: Detective fiction, a genre defined by its complex narrative structures and character-driven storytelling, presents unique challenges for computational narratology, a research field focused on integrating literary theory into automated narrative generation. While traditional literary studies have offered deep insights into the methods and archetypes of fictional detectives, these analyses often focus on a limited number of characters and lack the scalability needed for the extraction of unique traits that can be used to guide narrative generation methods. In this paper, we present an AI-driven approach for systematically characterizing the investigative methods of fictional detectives. Our multi-phase workflow explores the capabilities of 15 Large Language Models (LLMs) to extract, synthesize, and validate distinctive investigative traits of fictional detectives. This approach was tested on a diverse set of seven iconic detectives - Hercule Poirot, Sherlock Holmes, William Murdoch, Columbo, Father Brown, Miss Marple, and Auguste Dupin - capturing the distinctive investigative styles that define each character. The identified traits were validated against existing literary analyses and further tested in a reverse identification phase, achieving an overall accuracy of 91.43%, demonstrating the method's effectiveness in capturing the distinctive investigative approaches of each detective. This work contributes to the broader field of computational narratology by providing a scalable framework for character analysis, with potential applications in AI-driven interactive storytelling and automated narrative generation.

[182] MiMo: Unlocking the Reasoning Potential of Language Model -- From Pretraining to Posttraining

Xiaomi LLM-Core Team,:,Bingquan Xia,Bowen Shen,Cici,Dawei Zhu,Di Zhang,Gang Wang,Hailin Zhang,Huaqiu Liu,Jiebao Xiao,Jinhao Dong,Liang Zhao,Peidian Li,Peng Wang,Shihua Yu,Shimao Chen,Weikun Wang,Wenhan Ma,Xiangwei Deng,Yi Huang,Yifan Song,Zihan Jiang,Bowen Ye,Can Cai,Chenhong He,Dong Zhang,Duo Zhang,Guoan Wang,Hao Tian,Haochen Zhao,Heng Qu,Hongshen Xu,Jun Shi,Kainan Bao,QingKai Fang,Kang Zhou,Kangyang Zhou,Lei Li,Menghang Zhu,Nuo Chen,Qiantong Wang,Shaohui Liu,Shicheng Li,Shuhao Gu,Shuhuai Ren,Shuo Liu,Sirui Deng,Weiji Zhuang,Weiwei Lv,Wenyu Yang,Xin Zhang,Xing Yong,Xing Zhang,Xingchen Song,Xinzhe Xu,Xu Wang,Yihan Yan,Yu Tu,Yuanyuan Tian,Yudong Wang,Yue Yu,Zhenru Lin,Zhichao Song,Zihao Yue

Main category: cs.CL

TL;DR: MiMo-7B是一个专为推理任务设计的大型语言模型，通过预训练和后训练两阶段优化，显著提升了推理能力。

Details

Motivation: 为了解决推理任务中模型性能不足的问题，研究团队优化了数据预处理和训练策略。 Method: 预训练阶段采用三阶段数据混合策略和Multi-Token Prediction目标；后训练阶段使用13万可验证数学和编程问题进行强化学习，并引入代码奖励方案和数据重采样。 Result: MiMo-7B-Base在推理潜力上优于32B模型，最终RL调优模型MiMo-7B-RL在数学、代码和通用推理任务上表现卓越，超越OpenAI o1-mini。 Conclusion: MiMo-7B通过优化训练策略和数据，显著提升了推理能力，为相关任务提供了高效解决方案。 Abstract: We present MiMo-7B, a large language model born for reasoning tasks, with optimization across both pre-training and post-training stages. During pre-training, we enhance the data preprocessing pipeline and employ a three-stage data mixing strategy to strengthen the base model's reasoning potential. MiMo-7B-Base is pre-trained on 25 trillion tokens, with additional Multi-Token Prediction objective for enhanced performance and accelerated inference speed. During post-training, we curate a dataset of 130K verifiable mathematics and programming problems for reinforcement learning, integrating a test-difficulty-driven code-reward scheme to alleviate sparse-reward issues and employing strategic data resampling to stabilize training. Extensive evaluations show that MiMo-7B-Base possesses exceptional reasoning potential, outperforming even much larger 32B models. The final RL-tuned model, MiMo-7B-RL, achieves superior performance on mathematics, code and general reasoning tasks, surpassing the performance of OpenAI o1-mini. The model checkpoints are available at https://github.com/xiaomimimo/MiMo.

[183] Concept-Level Explainability for Auditing & Steering LLM Responses

Kenza Amara,Rita Sevastjanova,Mennatallah El-Assady

Main category: cs.CL

TL;DR: ConceptX是一种模型无关的概念级解释方法，用于识别提示中的语义丰富概念，并通过语义相似性分配重要性，优于现有的令牌级方法。

Details

Motivation: 随着大语言模型（LLM）的广泛应用，其安全性和对齐性问题日益突出，需要一种更透明和忠实的方法来引导模型行为。 Method: ConceptX通过识别提示中的语义概念，并基于输出语义相似性分配重要性，支持上下文完整性保护和灵活的解释目标（如性别偏见）。 Result: 在三个LLM上，ConceptX在忠实度和人类对齐性上优于TokenSHAP等令牌级方法，情感偏移提升至0.252，攻击成功率降至0.242。 Conclusion: ConceptX为提升LLM安全性和对齐性提供了一种透明且忠实的方法，展示了基于归因的解释性在引导模型行为中的实用价值。 Abstract: As large language models (LLMs) become widely deployed, concerns about their safety and alignment grow. An approach to steer LLM behavior, such as mitigating biases or defending against jailbreaks, is to identify which parts of a prompt influence specific aspects of the model's output. Token-level attribution methods offer a promising solution, but still struggle in text generation, explaining the presence of each token in the output separately, rather than the underlying semantics of the entire LLM response. We introduce ConceptX, a model-agnostic, concept-level explainability method that identifies the concepts, i.e., semantically rich tokens in the prompt, and assigns them importance based on the outputs' semantic similarity. Unlike current token-level methods, ConceptX also offers to preserve context integrity through in-place token replacements and supports flexible explanation goals, e.g., gender bias. ConceptX enables both auditing, by uncovering sources of bias, and steering, by modifying prompts to shift the sentiment or reduce the harmfulness of LLM responses, without requiring retraining. Across three LLMs, ConceptX outperforms token-level methods like TokenSHAP in both faithfulness and human alignment. Steering tasks boost sentiment shift by 0.252 versus 0.131 for random edits and lower attack success rates from 0.463 to 0.242, outperforming attribution and paraphrasing baselines. While prompt engineering and self-explaining methods sometimes yield safer responses, ConceptX offers a transparent and faithful alternative for improving LLM safety and alignment, demonstrating the practical value of attribution-based explainability in guiding LLM behavior.

[184] Chronocept: Instilling a Sense of Time in Machines

Krish Goel,Sanskar Pandey,KS Mahadevan,Harsh Kumar,Vishesh Khadaria

Main category: cs.CL

TL;DR: Chronocept是首个将时间有效性建模为连续概率分布的基准，填补了AI时间推理的基础空白。

Details

Motivation: 人类认知与时间感知（Chronoception）密切相关，但AI在时间有效性推理方面仍有不足。 Method: 通过拟合偏态正态曲线，Chronocept捕捉了知识出现、衰减和峰值相关性的细微模式，包括两个数据集。 Result: 基线模型预测曲线参数（位置、尺度和偏度），优于基于分类的方法，标注一致性高（84%和89%）。 Conclusion: Chronocept为知识落地、事实核查、检索增强生成等应用提供了支持，代码和数据公开。 Abstract: Human cognition is deeply intertwined with a sense of time, known as Chronoception. This sense allows us to judge how long facts remain valid and when knowledge becomes outdated. Despite progress in vision, language, and motor control, AI still struggles to reason about temporal validity. We introduce Chronocept, the first benchmark to model temporal validity as a continuous probability distribution over time. Using skew-normal curves fitted along semantically decomposed temporal axes, Chronocept captures nuanced patterns of emergence, decay, and peak relevance. It includes two datasets: Benchmark I (atomic facts) and Benchmark II (multi-sentence passages). Annotations show strong inter-annotator agreement (84% and 89%). Our baselines predict curve parameters - location, scale, and skewness - enabling interpretable, generalizable learning and outperforming classification-based approaches. Chronocept fills a foundational gap in AI's temporal reasoning, supporting applications in knowledge grounding, fact-checking, retrieval-augmented generation (RAG), and proactive agents. Code and data are publicly available.

[185] JobHop: A Large-Scale Dataset of Career Trajectories

Iman Johary,Raphael Romero,Alexandru C. Mara,Tijl De Bie

Main category: cs.CL

TL;DR: 论文介绍了JobHop数据集，通过LLM处理匿名简历数据，提取结构化职业信息并映射到ESCO职业代码，为劳动力市场研究提供丰富数据支持。

Details

Motivation: 劳动力市场动态对政策制定者、雇主和求职者至关重要，但缺乏全面的职业轨迹数据集。 Method: 利用LLM处理非结构化简历数据，通过多标签分类模型映射到ESCO职业代码，构建包含230万工作经历的数据集。 Result: 生成包含391,000份简历的JobHop数据集，支持劳动力市场流动性、职业稳定性等分析。 Conclusion: JobHop数据集为劳动力市场研究提供了宝贵资源，支持数据驱动的决策和职业路径预测。 Abstract: Understanding labor market dynamics is essential for policymakers, employers, and job seekers. However, comprehensive datasets that capture real-world career trajectories are scarce. In this paper, we introduce JobHop, a large-scale public dataset derived from anonymized resumes provided by VDAB, the public employment service in Flanders, Belgium. Utilizing Large Language Models (LLMs), we process unstructured resume data to extract structured career information, which is then mapped to standardized ESCO occupation codes using a multi-label classification model. This results in a rich dataset of over 2.3 million work experiences, extracted from and grouped into more than 391,000 user resumes and mapped to standardized ESCO occupation codes, offering valuable insights into real-world occupational transitions. This dataset enables diverse applications, such as analyzing labor market mobility, job stability, and the effects of career breaks on occupational transitions. It also supports career path prediction and other data-driven decision-making processes. To illustrate its potential, we explore key dataset characteristics, including job distributions, career breaks, and job transitions, demonstrating its value for advancing labor market research.

[186] Using Information Theory to Characterize Prosodic Typology: The Case of Tone, Pitch-Accent and Stress-Accent

Ethan Gotlieb Wilcox,Cui Ding,Giovanni Acampa,Tiago Pimentel,Alex Warstadt,Tamar I. Regev

Main category: cs.CL

TL;DR: 本文通过信息论方法研究了词汇身份与韵律之间的关系，发现声调语言中词汇与音高的互信息更高，支持了语言类型学为渐变的观点。

Details

Motivation: 探讨词汇身份与韵律（如音高）之间的关系，验证声调语言中词汇与音高的互信息是否更高。 Method: 使用十种语言的数据集，通过信息论方法估计文本与音高曲线的互信息。 Result: 声调语言中音高曲线更易预测，互信息更高，支持假设。 Conclusion: 研究支持语言类型学为渐变而非分类的观点。 Abstract: This paper argues that the relationship between lexical identity and prosody -- one well-studied parameter of linguistic variation -- can be characterized using information theory. We predict that languages that use prosody to make lexical distinctions should exhibit a higher mutual information between word identity and prosody, compared to languages that don't. We test this hypothesis in the domain of pitch, which is used to make lexical distinctions in tonal languages, like Cantonese. We use a dataset of speakers reading sentences aloud in ten languages across five language families to estimate the mutual information between the text and their pitch curves. We find that, across languages, pitch curves display similar amounts of entropy. However, these curves are easier to predict given their associated text in the tonal languages, compared to pitch- and stress-accent languages, and thus the mutual information is higher in these languages, supporting our hypothesis. Our results support perspectives that view linguistic typology as gradient, rather than categorical.

[187] Benchmarking Retrieval-Augmented Generation for Chemistry

Xianrui Zhong,Bowen Jin,Siru Ouyang,Yanzhen Shen,Qiao Jin,Yin Fang,Zhiyong Lu,Jiawei Han

Main category: cs.CL

TL;DR: 论文介绍了ChemRAG-Bench和ChemRAG-Toolkit，用于评估和增强化学领域的检索增强生成（RAG）性能。

Details

Motivation: 化学领域的RAG应用缺乏高质量语料库和评估基准，阻碍了其发展。 Method: 提出ChemRAG-Bench作为评估基准，整合多源知识；开发ChemRAG-Toolkit支持多种检索算法和LLMs。 Result: RAG在化学任务中平均相对性能提升17.4%。 Conclusion: 研究为化学领域的RAG系统提供了实用建议和工具支持。 Abstract: Retrieval-augmented generation (RAG) has emerged as a powerful framework for enhancing large language models (LLMs) with external knowledge, particularly in scientific domains that demand specialized and dynamic information. Despite its promise, the application of RAG in the chemistry domain remains underexplored, primarily due to the lack of high-quality, domain-specific corpora and well-curated evaluation benchmarks. In this work, we introduce ChemRAG-Bench, a comprehensive benchmark designed to systematically assess the effectiveness of RAG across a diverse set of chemistry-related tasks. The accompanying chemistry corpus integrates heterogeneous knowledge sources, including scientific literature, the PubChem database, PubMed abstracts, textbooks, and Wikipedia entries. In addition, we present ChemRAG-Toolkit, a modular and extensible RAG toolkit that supports five retrieval algorithms and eight LLMs. Using ChemRAG-Toolkit, we demonstrate that RAG yields a substantial performance gain -- achieving an average relative improvement of 17.4% over direct inference methods. We further conduct in-depth analyses on retriever architectures, corpus selection, and the number of retrieved passages, culminating in practical recommendations to guide future research and deployment of RAG systems in the chemistry domain. The code and data is available at https://chemrag.github.io.

[188] OnPrem.LLM: A Privacy-Conscious Document Intelligence Toolkit

Arun S. Maiya

Main category: cs.CL

TL;DR: OnPrem.LLM是一个基于Python的工具包，用于在离线或受限环境中应用大语言模型（LLMs）处理敏感数据，支持隐私保护用例。

Details

Motivation: 解决在隐私敏感环境中使用LLMs的需求，提供本地化处理能力，同时支持灵活的后端选择和混合部署。 Method: 提供预构建的文档处理、检索增强生成（RAG）、信息提取、摘要、分类等管道，支持多种LLM后端（如llama.cpp、Ollama、vLLM等），并支持GPU加速和无缝切换。 Result: 实现了隐私保护的数据处理，支持本地和混合部署，扩展了非技术用户的可访问性。 Conclusion: OnPrem.LLM是一个灵活且隐私优先的工具包，适用于敏感数据的本地化LLM应用。 Abstract: We present OnPrem.LLM, a Python-based toolkit for applying large language models (LLMs) to sensitive, non-public data in offline or restricted environments. The system is designed for privacy-preserving use cases and provides prebuilt pipelines for document processing and storage, retrieval-augmented generation (RAG), information extraction, summarization, classification, and prompt/output processing with minimal configuration. OnPrem.LLM supports multiple LLM backends -- including llama.cpp, Ollama, vLLM, and Hugging Face Transformers -- with quantized model support, GPU acceleration, and seamless backend switching. Although designed for fully local execution, OnPrem.LLM also supports integration with a wide range of cloud LLM providers when permitted, enabling hybrid deployments that balance performance with data control. A no-code web interface extends accessibility to non-technical users.

[189] Codifying Character Logic in Role-Playing

Letian Peng,Jingbo Shang

Main category: cs.CL

TL;DR: 本文提出了一种名为Codified Profiles的新方法，通过将角色逻辑表示为结构化、可执行的函数，显著提升了角色扮演的持久性、可更新性和行为多样性。

Details

Motivation: 传统基于提示的角色扮演方法依赖模型的隐式推理，难以实现逻辑的持久性、更新和可控随机性。本文旨在通过结构化函数解决这些问题。 Method: 使用可执行函数（如parse_by_scene和check_condition）表示角色逻辑，支持显式控制结构和条件检查。 Result: 实验表明，Codified Profiles在持久性、可更新性和行为多样性上优于传统方法，并能显著提升小模型的性能。 Conclusion: Codified Profiles为角色扮演提供了一种高效、可扩展的解决方案，尤其适合本地部署。 Abstract: This paper introduces Codified Profiles for role-playing, a novel approach that represents character logic as structured, executable functions for behavioral decision-making. Each profile defines a set of functions parse_by_scene(scene) that outputs a list of logic-grounded assertions triggered_statements, using both explicit control structures (e.g., if-then-else) and condition checks like check_condition(scene, question), where each question is a semantically meaningful prompt about the scene (e.g., "Is the character in danger?") discriminated by the role-playing LLM as true, false, or unknown. This explicit representation offers three key advantages over traditional prompt-based profiles, which append character descriptions directly into text prompts: (1) Persistence, by enforcing complete and consistent execution of character logic, rather than relying on the model's implicit reasoning; (2) Updatability, through systematic inspection and revision of behavioral logic, which is difficult to track or debug in prompt-only approaches; (3) Controllable Randomness, by supporting stochastic behavior directly within the logic, enabling fine-grained variability that prompting alone struggles to achieve. To validate these advantages, we introduce a new benchmark constructed from 83 characters and 5,141 scenes curated from Fandom, using NLI-based scoring to compare character responses against ground-truth actions. Our experiments demonstrate the significant benefits of codified profiles in improving persistence, updatability, and behavioral diversity. Notably, by offloading a significant portion of reasoning to preprocessing, codified profiles enable even 1B-parameter models to perform high-quality role-playing, providing a scalable and efficient foundation for local deployment of role-play agents.

[190] Spoken Language Understanding on Unseen Tasks With In-Context Learning

Neeraj Agrawal,Sriram Ganapathy

Main category: cs.CL

TL;DR: 论文提出了一种使用随机类标签进行任务无关微调的新方法，显著提升了语音-文本大语言模型在未见任务上的性能，且无需任务特定数据标注。

Details

Motivation: 传统任务特定的SLU模型无法应对任务特定训练数据不足的情况，而现有的语音-文本LLM在零/少样本设置下表现不佳。 Method: 引入了一种基于随机类标签的任务无关微调方法。 Result: 实验表明，该方法显著提升了语音-文本LLM在未见任务上的性能。 Conclusion: 该方法为语音-文本LLM在无需任务特定标注的情况下适应新任务提供了有效解决方案。 Abstract: Spoken language understanding (SLU) tasks involve diverse skills that probe the information extraction, classification and/or generation capabilities of models. In this setting, task-specific training data may not always be available. While traditional task-specific SLU models are unable to cater to such requirements, the speech-text large language models (LLMs) offer a promising alternative with emergent abilities. However, out of-the-box, our evaluations indicate that the zero/few-shot performance of prominent open-source speech-text LLMs on SLU tasks are not up to the mark. In this paper, we introduce a novel approach to robust task-agnostic fine-tuning using randomized class labels. With this proposed fine-tuning, we illustrate that the performance of the speech-text LLMs on an unseen task is significantly improved over standard approaches. Critically, the proposed approach avoids the requirement of task-specific data annotations for enabling new tasks in speech-text LLMs.

[191] Must Read: A Systematic Survey of Computational Persuasion

Nimet Beyza Bozdag,Shuhaib Mehri,Xiaocheng Yang,Hyeonjeong Ha,Zirui Cheng,Esin Durmus,Jiaxuan You,Heng Ji,Gokhan Tur,Dilek Hakkani-Tür

Main category: cs.CL

TL;DR: 本文综述了计算说服力的研究，围绕AI作为说服者、被说服者和判断者三个视角，探讨了AI在说服力中的角色、挑战及未来方向。

Details

Motivation: 随着对话AI系统的兴起，说服力的范围扩大，带来了机遇和风险，但目前对有效说服的理解仍有限。 Method: 通过三个视角（AI作为说服者、被说服者和判断者）构建分类法，分析AI在说服力中的作用。 Result: 提出了计算说服力的分类法，并讨论了评估说服力、减少操纵性说服等关键挑战。 Conclusion: 未来研究应关注提升AI说服力的安全性、公平性和有效性，同时应对语言模型带来的风险。 Abstract: Persuasion is a fundamental aspect of communication, influencing decision-making across diverse contexts, from everyday conversations to high-stakes scenarios such as politics, marketing, and law. The rise of conversational AI systems has significantly expanded the scope of persuasion, introducing both opportunities and risks. AI-driven persuasion can be leveraged for beneficial applications, but also poses threats through manipulation and unethical influence. Moreover, AI systems are not only persuaders, but also susceptible to persuasion, making them vulnerable to adversarial attacks and bias reinforcement. Despite rapid advancements in AI-generated persuasive content, our understanding of what makes persuasion effective remains limited due to its inherently subjective and context-dependent nature. In this survey, we provide a comprehensive overview of computational persuasion, structured around three key perspectives: (1) AI as a Persuader, which explores AI-generated persuasive content and its applications; (2) AI as a Persuadee, which examines AI's susceptibility to influence and manipulation; and (3) AI as a Persuasion Judge, which analyzes AI's role in evaluating persuasive strategies, detecting manipulation, and ensuring ethical persuasion. We introduce a taxonomy for computational persuasion research and discuss key challenges, including evaluating persuasiveness, mitigating manipulative persuasion, and developing responsible AI-driven persuasive systems. Our survey outlines future research directions to enhance the safety, fairness, and effectiveness of AI-powered persuasion while addressing the risks posed by increasingly capable language models.

[192] Domain Regeneration: How well do LLMs match syntactic properties of text domains?

Da Ju,Hagen Blix,Adina Williams

Main category: cs.CL

TL;DR: 研究探讨了大型语言模型（LLM）在生成文本时如何近似训练数据的分布，发现其生成的文本在多数统计特性上与原始人类文本存在差异。

Details

Motivation: 探索LLM在生成文本时是否能忠实模拟训练数据的分布特性，尤其是在语义控制的环境中。 Method: 通过观察性方法，使用开源LLM重新生成Wikipedia和新闻文本，分析其在句法抽象不同层次上的表现。 Result: 生成的文本在多数统计特性上（如均值偏移、标准差降低、长尾减少）与原始人类文本存在显著差异。 Conclusion: LLM在生成文本时未能完全忠实模拟训练数据的分布特性，表现出一定的简化趋势。 Abstract: Recent improvement in large language model performance have, in all likelihood, been accompanied by improvement in how well they can approximate the distribution of their training data. In this work, we explore the following question: which properties of text domains do LLMs faithfully approximate, and how well do they do so? Applying observational approaches familiar from corpus linguistics, we prompt a commonly used, opensource LLM to regenerate text from two domains of permissively licensed English text which are often contained in LLM training data -- Wikipedia and news text. This regeneration paradigm allows us to investigate whether LLMs can faithfully match the original human text domains in a fairly semantically-controlled setting. We investigate varying levels of syntactic abstraction, from more simple properties like sentence length, and article readability, to more complex and higher order properties such as dependency tag distribution, parse depth, and parse complexity. We find that the majority of the regenerated distributions show a shifted mean, a lower standard deviation, and a reduction of the long tail, as compared to the human originals.

[193] Learning from Peers in Reasoning Models

Tongxu Luo,Wenyu Du,Jiaxi Bi,Stephen Chung,Zhengyang Tang,Hao Yang,Min Zhang,Benyou Wang

Main category: cs.CL

TL;DR: 论文提出LeaP方法，通过同行交互帮助大型推理模型（LRMs）避免“前缀主导陷阱”，并在多个基准测试中显著提升性能。

Details

Motivation: 研究发现LRMs在推理过程中若初始部分表现不佳（前缀主导陷阱），难以自我纠正，受心理学启发提出同行学习（LeaP）以解决此问题。 Method: LeaP通过路由机制让推理路径共享中间结果，引入同行见解；针对小模型指令执行问题，提出微调模型LeaP-T。 Result: 实验显示LeaP显著提升性能，如QwQ-32B平均提升5分，LeaP-T-7B在AIME 2024上媲美更大模型。 Conclusion: LeaP通过同行协作增强LRMs的容错能力，是推理协作的重要里程碑。 Abstract: Large Reasoning Models (LRMs) have the ability to self-correct even when they make mistakes in their reasoning paths. However, our study reveals that when the reasoning process starts with a short but poor beginning, it becomes difficult for the model to recover. We refer to this phenomenon as the "Prefix Dominance Trap". Inspired by psychological findings that peer interaction can promote self-correction without negatively impacting already accurate individuals, we propose **Learning from Peers** (LeaP) to address this phenomenon. Specifically, every tokens, each reasoning path summarizes its intermediate reasoning and shares it with others through a routing mechanism, enabling paths to incorporate peer insights during inference. However, we observe that smaller models sometimes fail to follow summarization and reflection instructions effectively. To address this, we fine-tune them into our **LeaP-T** model series. Experiments on AIME 2024, AIME 2025, AIMO 2025, and GPQA Diamond show that LeaP provides substantial improvements. For instance, QwQ-32B with LeaP achieves nearly 5 absolute points higher than the baseline on average, and surpasses DeepSeek-R1-671B on three math benchmarks with an average gain of 3.3 points. Notably, our fine-tuned LeaP-T-7B matches the performance of DeepSeek-R1-Distill-Qwen-14B on AIME 2024. In-depth analysis reveals LeaP's robust error correction by timely peer insights, showing strong error tolerance and handling varied task difficulty. LeaP marks a milestone by enabling LRMs to collaborate during reasoning. Our code, datasets, and models are available at https://learning-from-peers.github.io/ .

[194] Learning Dynamics in Continual Pre-Training for Large Language Models

Xingjin Wang,Howe Tissue,Lu Wang,Linjing Li,Daniel Dajun Zeng

Main category: cs.CL

TL;DR: 本文研究了持续预训练（CPT）过程中大语言模型的学习动态，提出了一种结合分布偏移和学习率退火的CPT缩放定律，用于预测损失并优化训练参数。

Details

Motivation: 探索CPT过程中通用和下游领域性能的变化，以验证损失曲线揭示的隐藏动态。 Method: 通过解耦分布偏移和学习率退火效应，推导CPT缩放定律，预测损失并优化超参数。 Result: 实验表明，该缩放定律适用于多种CPT数据集和超参数，并能定制训练目标。 Conclusion: 提出的CPT缩放定律为理解CPT关键因素提供了全面框架，并支持灵活的训练目标定制。 Abstract: Continual Pre-Training (CPT) has become a popular and effective method to apply strong foundation models to specific downstream tasks. In this work, we explore the learning dynamics throughout the CPT process for large language models. We specifically focus on how general and downstream domain performance evolves at each training step, with domain performance measured via validation losses. We have observed that the CPT loss curve fundamentally characterizes the transition from one curve to another hidden curve, and could be described by decoupling the effects of distribution shift and learning rate annealing. We derive a CPT scaling law that combines the two factors, enabling the prediction of loss at any (continual) training steps and across learning rate schedules (LRS) in CPT. Our formulation presents a comprehensive understanding of several critical factors in CPT, including loss potential, peak learning rate, training steps, replay ratio, etc. Moreover, our approach can be adapted to customize training hyper-parameters to different CPT goals such as balancing general and domain-specific performance. Extensive experiments demonstrate that our scaling law holds across various CPT datasets and training hyper-parameters.

[195] A Comparative Analysis of Static Word Embeddings for Hungarian

Máté Gedeon

Main category: cs.CL

TL;DR: 本文全面分析了匈牙利语的静态词嵌入，包括传统模型（如Word2Vec、FastText）和基于BERT的静态嵌入。通过内在和外在任务评估，发现FastText在语义和句法任务中表现优异，而X2Static方法提取的BERT嵌入接近传统嵌入效果。动态模型（如ELMo）在外在任务（如NER和POS标注）中表现最佳。

Details

Motivation: 研究匈牙利语静态词嵌入的性能，比较传统模型与基于BERT的嵌入，并探索高级提取方法对BERT嵌入的改进潜力。 Method: 使用内在任务（词类比）和外在任务（NER和POS标注）评估不同嵌入的性能，包括传统静态嵌入和基于BERT的静态嵌入（通过X2Static、去上下文化和聚合方法提取）。 Result: FastText在词类比任务中表现最佳；X2Static提取的BERT嵌入接近传统嵌入效果；ELMo在外在任务中表现最优。 Conclusion: 静态词嵌入在NLP中仍具价值，高级提取方法可提升BERT嵌入的实用性。研究为匈牙利语嵌入性能提供了新见解，并公开了相关资源以支持未来研究。 Abstract: This paper presents a comprehensive analysis of various static word embeddings for Hungarian, including traditional models such as Word2Vec, FastText, as well as static embeddings derived from BERT-based models using different extraction methods. We evaluate these embeddings on both intrinsic and extrinsic tasks to provide a holistic view of their performance. For intrinsic evaluation, we employ a word analogy task, which assesses the embeddings ability to capture semantic and syntactic relationships. Our results indicate that traditional static embeddings, particularly FastText, excel in this task, achieving high accuracy and mean reciprocal rank (MRR) scores. Among the BERT-based models, the X2Static method for extracting static embeddings demonstrates superior performance compared to decontextualized and aggregate methods, approaching the effectiveness of traditional static embeddings. For extrinsic evaluation, we utilize a bidirectional LSTM model to perform Named Entity Recognition (NER) and Part-of-Speech (POS) tagging tasks. The results reveal that embeddings derived from dynamic models, especially those extracted using the X2Static method, outperform purely static embeddings. Notably, ELMo embeddings achieve the highest accuracy in both NER and POS tagging tasks, underscoring the benefits of contextualized representations even when used in a static form. Our findings highlight the continued relevance of static word embeddings in NLP applications and the potential of advanced extraction methods to enhance the utility of BERT-based models. This piece of research contributes to the understanding of embedding performance in the Hungarian language and provides valuable insights for future developments in the field. The training scripts, evaluation codes, restricted vocabulary, and extracted embeddings will be made publicly available to support further research and reproducibility.

cs.LG [Back]

[196] Lossless Compression of Large Language Model-Generated Text via Next-Token Prediction

Yu Mao,Holger Pirk,Chun Jason Xue

Main category: cs.LG

TL;DR: 该论文研究了针对大语言模型（LLM）生成数据的无损压缩技术，发现LLM因其预测能力可作为高效压缩器，压缩率远超通用压缩工具。

Details

Motivation: 随着LLM生成数据的快速增长，传统压缩方法难以应对其复杂性和多样性，亟需专门的高效压缩技术。 Method: 通过系统研究14种代表性LLM和8个不同领域的LLM生成数据集，利用LLM的预测能力实现数据压缩。 Result: 实验显示LLM压缩率超过20倍，显著优于Gzip的3倍，且在不同LLM规模和数据集上表现稳健。 Conclusion: LLM作为压缩器在生成式AI任务中具有高效性和实用性，为文本管理提供了新思路。 Abstract: As large language models (LLMs) continue to be deployed and utilized across domains, the volume of LLM-generated data is growing rapidly. This trend highlights the increasing importance of effective and lossless compression for such data in modern text management systems. However, compressing LLM-generated data presents unique challenges compared to traditional human- or machine-generated content. Traditional machine-generated data is typically derived from computational processes or device outputs, often highly structured and limited to low-level elements like labels or numerical values. This structure enables conventional lossless compressors to perform efficiently. In contrast, LLM-generated data is more complex and diverse, requiring new approaches for effective compression. In this work, we conduct the first systematic investigation of lossless compression techniques tailored specifically to LLM-generated data. Notably, because LLMs are trained via next-token prediction, we find that LLM-generated data is highly predictable for the models themselves. This predictability enables LLMs to serve as efficient compressors of their own outputs. Through extensive experiments with 14 representative LLMs and 8 LLM-generated datasets from diverse domains, we show that LLM-based prediction methods achieve remarkable compression rates, exceeding 20x, far surpassing the 3x rate achieved by Gzip, a widely used general-purpose compressor. Furthermore, this advantage holds across different LLM sizes and dataset types, demonstrating the robustness and practicality of LLM-based methods in lossless text compression under generative AI workloads.

[197] Divide (Text) and Conquer (Sentiment): Improved Sentiment Classification by Constituent Conflict Resolution

Jan Kościałkowski,Paweł Marcinkowski

Main category: cs.LG

TL;DR: 论文提出新方法解决多情感冲突文本的情感分类问题，通过MLP模型显著提升性能且成本低。

Details

Motivation: 长文本中多情感冲突导致模型性能下降，需新方法解决。 Method: 采用MLP模型分离并聚合冲突情感，预测整体情感。 Result: MLP模型在多个数据集（Amazon、Twitter、SST）上优于基线，成本仅为微调基线的1/100。 Conclusion: 新方法有效提升多情感冲突文本的分类性能，且经济高效。 Abstract: Sentiment classification, a complex task in natural language processing, becomes even more challenging when analyzing passages with multiple conflicting tones. Typically, longer passages exacerbate this issue, leading to decreased model performance. The aim of this paper is to introduce novel methodologies for isolating conflicting sentiments and aggregating them to effectively predict the overall sentiment of such passages. One of the aggregation strategies involves a Multi-Layer Perceptron (MLP) model which outperforms baseline models across various datasets, including Amazon, Twitter, and SST while costing $\sim$1/100 of what fine-tuning the baseline would take.

[198] Improving Block-Wise LLM Quantization by 4-bit Block-Wise Optimal Float (BOF4): Analysis and Variations

Patrick Blumenberg,Thomas Graave,Tim Fingscheidt

Main category: cs.LG

TL;DR: 论文提出了一种优化块量化方法BOF4，减少量化误差，并改进归一化方法BOF4-S，进一步降低误差。同时探索了混合精度量化策略OPQ，以处理异常值权重，最终在4-bit量化技术中表现最佳。

Details

Motivation: 大型语言模型（LLMs）在微调和推理过程中需要大量内存，现有量化方法（如NF4和AF4）存在次优量化误差，需改进。 Method: 提出优化块量化方法BOF4，改进归一化方法BOF4-S，并探索混合精度量化策略OPQ。 Result: BOF4-S结合OPQ在4-bit量化技术中表现最佳，显著减少量化误差和语言建模性能下降。 Conclusion: 优化块量化方法和混合精度策略能有效减少量化误差，提升LLMs的内存效率。 Abstract: Large language models (LLMs) demand extensive memory capacity during both fine-tuning and inference. To enable memory-efficient fine-tuning, existing methods apply block-wise quantization techniques, such as NF4 and AF4, to the network weights. We show that these quantization techniques incur suboptimal quantization errors. Therefore, as a first novelty, we propose an optimization approach for block-wise quantization. Using this method, we design a family of quantizers named 4-bit block-wise optimal float (BOF4), which consistently reduces the quantization error compared to both baseline methods. We provide both a theoretical and a data-driven solution for the optimization process and prove their practical equivalence. Secondly, we propose a modification to the employed normalization method based on the signed absolute block maximum (BOF4-S), enabling further reduction of the quantization error and empirically achieving less degradation in language modeling performance. Thirdly, we explore additional variations of block-wise quantization methods applied to LLMs through an experimental study on the importance of accurately representing zero and large-amplitude weights on the one hand, and optimization towards various error metrics on the other hand. Lastly, we introduce a mixed-precision quantization strategy dubbed outlier-preserving quantization (OPQ) to address the distributional mismatch induced by outlier weights in block-wise quantization. By storing outlier weights in 16-bit precision (OPQ) while applying BOF4-S, we achieve top performance among 4-bit block-wise quantization techniques w.r.t. perplexity.

[199] Benign Samples Matter! Fine-tuning On Outlier Benign Samples Severely Breaks Safety

Zihan Guan,Mengxuan Hu,Ronghang Zhu,Sheng Li,Anil Vullikanti

Main category: cs.LG

TL;DR: 研究发现，即使在良性数据集上微调大型语言模型（LLM）也可能导致其输出有害性显著增加。通过分析并筛选出最影响安全性的样本，提出了一种更有效的攻击方法Self-Inf-N，实验表明该方法对主流LLM具有高迁移性，且现有防御策略难以应对。

Details

Motivation: 揭示微调阶段LLM的安全漏洞，并开发一种更有效的攻击方法，以证明现有安全对齐措施的不足。 Method: 从良性数据集中识别并筛选出最影响安全性的样本（异常值），提出Self-Inf-N方法进行检测和提取，并基于这些样本微调LLM。 Result: 实验表明，仅用100个异常样本微调即可严重破坏LLM的安全对齐，且攻击方法对多种LLM架构具有高迁移性。 Conclusion: 现有防御策略难以应对此类攻击，亟需开发更鲁棒的安全对齐措施。 Abstract: Recent studies have uncovered a troubling vulnerability in the fine-tuning stage of large language models (LLMs): even fine-tuning on entirely benign datasets can lead to a significant increase in the harmfulness of LLM outputs. Building on this finding, our red teaming study takes this threat one step further by developing a more effective attack. Specifically, we analyze and identify samples within benign datasets that contribute most to safety degradation, then fine-tune LLMs exclusively on these samples. We approach this problem from an outlier detection perspective and propose Self-Inf-N, to detect and extract outliers for fine-tuning. Our findings reveal that fine-tuning LLMs on 100 outlier samples selected by Self-Inf-N in the benign datasets severely compromises LLM safety alignment. Extensive experiments across seven mainstream LLMs demonstrate that our attack exhibits high transferability across different architectures and remains effective in practical scenarios. Alarmingly, our results indicate that most existing mitigation strategies fail to defend against this attack, underscoring the urgent need for more robust alignment safeguards. Codes are available at https://github.com/GuanZihan/Benign-Samples-Matter.

[200] Towards the Three-Phase Dynamics of Generalization Power of a DNN

Yuxuan He,Junpeng Zhang,Hongyuan Zhang,Quanshi Zhang

Main category: cs.LG

TL;DR: 本文提出了一种新视角来分析深度神经网络（DNN）的泛化能力，通过直接解耦和分析训练过程中DNN编码的可泛化与非泛化交互的动态变化。

Details

Motivation: 现有方法难以直接分析DNN的泛化能力，本文旨在通过解耦交互动态来填补这一空白。 Method: 基于可解释AI的理论成果，将DNN推理逻辑重写为少量AND-OR交互模式，并提出量化交互泛化能力的高效方法。 Result: 发现训练过程中交互泛化能力呈现三阶段动态变化，非泛化交互的学习是训练与测试损失差距的直接原因。 Conclusion: 通过解耦交互动态，本文为理解DNN泛化能力提供了新视角，并揭示了非泛化交互的关键作用。 Abstract: This paper proposes a new perspective for analyzing the generalization power of deep neural networks (DNNs), i.e., directly disentangling and analyzing the dynamics of generalizable and non-generalizable interaction encoded by a DNN through the training process. Specifically, this work builds upon the recent theoretical achievement in explainble AI, which proves that the detailed inference logic of DNNs can be can be strictly rewritten as a small number of AND-OR interaction patterns. Based on this, we propose an efficient method to quantify the generalization power of each interaction, and we discover a distinct three-phase dynamics of the generalization power of interactions during training. In particular, the early phase of training typically removes noisy and non-generalizable interactions and learns simple and generalizable ones. The second and the third phases tend to capture increasingly complex interactions that are harder to generalize. Experimental results verify that the learning of non-generalizable interactions is the the direct cause for the gap between the training and testing losses.

[201] Direct Density Ratio Optimization: A Statistically Consistent Approach to Aligning Large Language Models

Rei Higuchi,Taiji Suzuki

Main category: cs.LG

TL;DR: 论文提出了一种名为DDRO的新方法，直接估计偏好与非偏好输出的密度比，无需显式建模人类偏好，解决了现有方法的统计不一致性问题。

Details

Motivation: 现有方法假设特定偏好模型（如Bradley-Terry模型），导致统计不一致性，无法保证收敛到真实人类偏好。 Method: 引入Direct Density Ratio Optimization (DDRO)，直接估计偏好与非偏好输出的密度比，避免显式建模偏好。 Result: DDRO在理论上具有统计一致性，实验证明其在多个基准测试中优于现有方法。 Conclusion: DDRO为数据驱动的对齐提供了新途径，有望实现更可靠且与人类偏好一致的LLMs。 Abstract: Aligning large language models (LLMs) with human preferences is crucial for safe deployment, yet existing methods assume specific preference models like Bradley-Terry model. This assumption leads to statistical inconsistency, where more data doesn't guarantee convergence to true human preferences. To address this critical gap, we introduce a novel alignment method Direct Density Ratio Optimization (DDRO). DDRO directly estimates the density ratio between preferred and unpreferred output distributions, circumventing the need for explicit human preference modeling. We theoretically prove that DDRO is statistically consistent, ensuring convergence to the true preferred distribution as the data size grows, regardless of the underlying preference structure. Experiments demonstrate that DDRO achieves superior performance compared to existing methods on many major benchmarks. DDRO unlocks the potential for truly data-driven alignment, paving the way for more reliable and human-aligned LLMs.

[202] Attonsecond Streaking Phase Retrieval Via Deep Learning Methods

Yuzhou Zhu,Zheng Zhang,Ruyi Zhang,Liang Zhou

Main category: cs.LG

TL;DR: 论文提出了一种基于监督计算机视觉的相位检索方法，比较了四种神经网络架构，发现胶囊网络在精度上表现最佳。

Details

Motivation: 传统相位检索算法依赖迭代最小化和中心动量近似，对宽带脉冲精度不足，需要更高效的方法。 Method: 将相位检索问题转化为监督计算机视觉任务，系统比较了CNN、ViT、混合CNN-ViT和胶囊网络四种架构。 Result: 胶囊网络在合成条纹谱图中表现出最高的检索保真度，验证了理论预测的精度排序。 Conclusion: 胶囊网络是最优选择，未来可结合物理信息神经网络和光子硬件实现实时阿秒脉冲表征。 Abstract: Attosecond streaking phase retrieval is essential for resolving electron dynamics on sub-femtosecond time scales yet traditional algorithms rely on iterative minimization and central momentum approximations that degrade accuracy for broadband pulses. In this work phase retrieval is reformulated as a supervised computer-vision problem and four neural architectures are systematically compared. A convolutional network demonstrates strong sensitivity to local streak edges but lacks global context; a vision transformer captures long-range delay-energy correlations at the expense of local inductive bias; a hybrid CNN-ViT model unites local feature extraction and full-graph attention; and a capsule network further enforces spatial pose agreement through dynamic routing. A theoretical analysis introduces local, global and positional sensitivity measures and derives surrogate error bounds that predict the strict ordering $CNN ### [203] [Minimizing Risk Through Minimizing Model-Data Interaction: A Protocol For Relying on Proxy Tasks When Designing Child Sexual Abuse Imagery Detection Models](https://arxiv.org/abs/2505.06621) *Thamiris Coelho,Leo S. F. Ribeiro,João Macedo,Jefersson A. dos Santos,Sandra Avila* Main category: cs.LG TL;DR: 论文提出了一种通过代理任务（Proxy Tasks）训练模型的方法，避免直接使用儿童性虐待图像（CSAI）数据，以解决敏感数据访问限制问题，并展示了在真实数据集上的初步成果。

Details

Motivation: 儿童性虐待图像的传播问题日益严重，但敏感数据的访问限制使得自动化分类和检测的研究面临挑战。 Method: 通过定义和使用代理任务（Proxy Tasks）训练模型，避免直接接触敏感数据，并结合执法机构（LEAs）的输入设计自动化方案。 Result: 在真实世界的CSAI数据集上，模型通过代理任务训练取得了有希望的结果，且未直接使用敏感数据。 Conclusion: 代理任务为CSAI自动化检测提供了一种可行的替代方案，未来可进一步优化并结合多方协作。 Abstract: The distribution of child sexual abuse imagery (CSAI) is an ever-growing concern of our modern world; children who suffered from this heinous crime are revictimized, and the growing amount of illegal imagery distributed overwhelms law enforcement agents (LEAs) with the manual labor of categorization. To ease this burden researchers have explored methods for automating data triage and detection of CSAI, but the sensitive nature of the data imposes restricted access and minimal interaction between real data and learning algorithms, avoiding leaks at all costs. In observing how these restrictions have shaped the literature we formalize a definition of "Proxy Tasks", i.e., the substitute tasks used for training models for CSAI without making use of CSA data. Under this new terminology we review current literature and present a protocol for making conscious use of Proxy Tasks together with consistent input from LEAs to design better automation in this field. Finally, we apply this protocol to study -- for the first time -- the task of Few-shot Indoor Scene Classification on CSAI, showing a final model that achieves promising results on a real-world CSAI dataset whilst having no weights actually trained on sensitive data.

### [204] [Image Classification Using a Diffusion Model as a Pre-Training Model](https://arxiv.org/abs/2505.06890) *Kosuke Ukita,Ye Xiaolong,Tsuyoshi Okita* Main category: cs.LG TL;DR: 提出了一种基于扩散模型的方法，通过Vision Transformer（ViT）的表征条件机制，实现无标签数据的自监督学习，显著提升了脑成像中血肿检测的零样本分类性能。

Details

Motivation: 解决大规模标注数据集的需求问题，利用自监督学习从无标签数据中提取表征。 Method: 结合ViT的表征条件机制与基于Transformer的扩散模型，实现表征条件数据生成。 Result: 在零样本分类任务中，准确率和F1分数分别比DINOv2基线提高了6.15%和13.60%。 Conclusion: 该方法在图像分类任务中表现出色，尤其在无标签数据场景下具有潜力。 Abstract: In this paper, we propose a diffusion model that integrates a representation-conditioning mechanism, where the representations derived from a Vision Transformer (ViT) are used to condition the internal process of a Transformer-based diffusion model. This approach enables representation-conditioned data generation, addressing the challenge of requiring large-scale labeled datasets by leveraging self-supervised learning on unlabeled data. We evaluate our method through a zero-shot classification task for hematoma detection in brain imaging. Compared to the strong contrastive learning baseline, DINOv2, our method achieves a notable improvement of +6.15% in accuracy and +13.60% in F1-score, demonstrating its effectiveness in image classification.

### [205] [ICE-Pruning: An Iterative Cost-Efficient Pruning Pipeline for Deep Neural Networks](https://arxiv.org/abs/2505.07411) *Wenhao Hu,Paul Henderson,José Cano* Main category: cs.LG TL;DR: ICE-Pruning是一种高效的DNN剪枝方法，通过减少微调成本显著加速剪枝过程，同时保持与现有方法相当的精度。

Details

Motivation: 现有剪枝方法因需要重复微调而计算成本高，ICE-Pruning旨在减少这一成本。 Method: 提出三个主要组件：自动确定微调时机、冻结策略加速微调、剪枝感知学习率调度器。 Result: 实验表明，ICE-Pruning可将剪枝速度提升至9.61倍。 Conclusion: ICE-Pruning在保持精度的同时显著降低了剪枝的计算成本。 Abstract: Pruning is a widely used method for compressing Deep Neural Networks (DNNs), where less relevant parameters are removed from a DNN model to reduce its size. However, removing parameters reduces model accuracy, so pruning is typically combined with fine-tuning, and sometimes other operations such as rewinding weights, to recover accuracy. A common approach is to repeatedly prune and then fine-tune, with increasing amounts of model parameters being removed in each step. While straightforward to implement, pruning pipelines that follow this approach are computationally expensive due to the need for repeated fine-tuning. In this paper we propose ICE-Pruning, an iterative pruning pipeline for DNNs that significantly decreases the time required for pruning by reducing the overall cost of fine-tuning, while maintaining a similar accuracy to existing pruning pipelines. ICE-Pruning is based on three main components: i) an automatic mechanism to determine after which pruning steps fine-tuning should be performed; ii) a freezing strategy for faster fine-tuning in each pruning step; and iii) a custom pruning-aware learning rate scheduler to further improve the accuracy of each pruning step and reduce the overall time consumption. We also propose an efficient auto-tuning stage for the hyperparameters (e.g., freezing percentage) introduced by the three components. We evaluate ICE-Pruning on several DNN models and datasets, showing that it can accelerate pruning by up to 9.61x. Code is available at https://github.com/gicLAB/ICE-Pruning

### [206] [Unified Continuous Generative Models](https://arxiv.org/abs/2505.07447) *Peng Sun,Yi Jiang,Tao Lin* Main category: cs.LG TL;DR: 论文提出了一种统一框架UCGM，用于训练、采样和分析连续生成模型，实现了多步和少步方法的SOTA性能。

Details

Motivation: 现有工作将多步和少步生成模型视为独立范式，导致训练和采样方法分离，缺乏统一性。 Method: 引入UCGM框架，统一训练和采样方法，支持多步和少步模型。 Result: 在ImageNet 256x256上，UCGM-T训练的多步模型在20步达到1.30 FID，少步模型在2步达到1.42 FID；UCGM-S将预训练模型的FID从1.26（250步）提升至1.06（40步）。 Conclusion: UCGM框架显著提升了生成模型的效率和性能，为连续生成模型提供了统一的解决方案。 Abstract: Recent advances in continuous generative models, including multi-step approaches like diffusion and flow-matching (typically requiring 8-1000 sampling steps) and few-step methods such as consistency models (typically 1-8 steps), have demonstrated impressive generative performance. However, existing work often treats these approaches as distinct paradigms, resulting in separate training and sampling methodologies. We introduce a unified framework for training, sampling, and analyzing these models. Our implementation, the Unified Continuous Generative Models Trainer and Sampler (UCGM-{T,S}), achieves state-of-the-art (SOTA) performance. For example, on ImageNet 256x256 using a 675M diffusion transformer, UCGM-T trains a multi-step model achieving 1.30 FID in 20 steps and a few-step model reaching 1.42 FID in just 2 steps. Additionally, applying UCGM-S to a pre-trained model (previously 1.26 FID at 250 steps) improves performance to 1.06 FID in only 40 steps. Code is available at: https://github.com/LINs-lab/UCGM.

### [207] [You Only Look One Step: Accelerating Backpropagation in Diffusion Sampling with Gradient Shortcuts](https://arxiv.org/abs/2505.07477) *Hongkun Dou,Zeyu Li,Xingyu Jiang,Hongjue Li,Lijun Yang,Wen Yao,Yue Deng* Main category: cs.LG TL;DR: 提出了一种名为SDO的高效方法，通过单步计算图优化扩散模型生成过程，显著降低计算成本。

Details

Motivation: 扩散模型在生成过程中需要反向传播优化特定指标，计算成本高，亟需更高效的方法。 Method: 采用并行去噪视角，仅保留单步计算图进行梯度传播，避免全程反向传播。 Result: SDO方法在多个任务中表现优异，计算成本降低约90%。 Conclusion: SDO是一种通用、高效且轻量的优化方法，适用于扩散模型的所有参数类型。 Abstract: Diffusion models (DMs) have recently demonstrated remarkable success in modeling large-scale data distributions. However, many downstream tasks require guiding the generated content based on specific differentiable metrics, typically necessitating backpropagation during the generation process. This approach is computationally expensive, as generating with DMs often demands tens to hundreds of recursive network calls, resulting in high memory usage and significant time consumption. In this paper, we propose a more efficient alternative that approaches the problem from the perspective of parallel denoising. We show that full backpropagation throughout the entire generation process is unnecessary. The downstream metrics can be optimized by retaining the computational graph of only one step during generation, thus providing a shortcut for gradient propagation. The resulting method, which we call Shortcut Diffusion Optimization (SDO), is generic, high-performance, and computationally lightweight, capable of optimizing all parameter types in diffusion sampling. We demonstrate the effectiveness of SDO on several real-world tasks, including controlling generation by optimizing latent and aligning the DMs by fine-tuning network parameters. Compared to full backpropagation, our approach reduces computational costs by $\sim 90\%$ while maintaining superior performance. Code is available at https://github.com/deng-ai-lab/SDO.

### [208] [Noise Optimized Conditional Diffusion for Domain Adaptation](https://arxiv.org/abs/2505.07548) *Lingkun Luo,Shiqiang Hu,Liming Chen* Main category: cs.LG TL;DR: 论文提出了一种名为NOCDDA的方法，通过结合条件扩散模型和领域自适应（DA）任务，优化噪声以生成高质量伪标签，从而提升跨领域对齐效果。

Details

Motivation: 伪标签在无监督领域自适应（UDA）中至关重要，但高置信度伪标签样本（hcpl-tds）的稀缺导致跨领域统计对齐不准确，从而影响DA性能。 Method: NOCDDA整合了条件扩散模型的生成能力和DA的决策需求，通过统一的优化框架调整DA分类器，并引入类感知噪声优化策略，改进伪标签生成。 Result: 在5个基准数据集和29个DA任务上的实验表明，NOCDDA显著优于31种现有方法。 Conclusion: NOCDDA通过噪声优化和任务耦合优化，有效提升了跨领域对齐的鲁棒性和性能。 Abstract: Pseudo-labeling is a cornerstone of Unsupervised Domain Adaptation (UDA), yet the scarcity of High-Confidence Pseudo-Labeled Target Domain Samples (\textbf{hcpl-tds}) often leads to inaccurate cross-domain statistical alignment, causing DA failures. To address this challenge, we propose \textbf{N}oise \textbf{O}ptimized \textbf{C}onditional \textbf{D}iffusion for \textbf{D}omain \textbf{A}daptation (\textbf{NOCDDA}), which seamlessly integrates the generative capabilities of conditional diffusion models with the decision-making requirements of DA to achieve task-coupled optimization for efficient adaptation. For robust cross-domain consistency, we modify the DA classifier to align with the conditional diffusion classifier within a unified optimization framework, enabling forward training on noise-varying cross-domain samples. Furthermore, we argue that the conventional $ \mathcal{N}(\mathbf{0}, \mathbf{I}) $ initialization in diffusion models often generates class-confused hcpl-tds, compromising discriminative DA. To resolve this, we introduce a class-aware noise optimization strategy that refines sampling regions for reverse class-specific hcpl-tds generation, effectively enhancing cross-domain alignment. Extensive experiments across 5 benchmark datasets and 29 DA tasks demonstrate significant performance gains of \textbf{NOCDDA} over 31 state-of-the-art methods, validating its robustness and effectiveness.

### [209] [Simple Semi-supervised Knowledge Distillation from Vision-Language Models via $\mathbf{\texttt{D}}$ual-$\mathbf{\texttt{H}}$ead $\mathbf{\texttt{O}}$ptimization](https://arxiv.org/abs/2505.07675) *Seongjae Kang,Dong Bok Lee,Hyungjoon Jang,Sung Ju Hwang* Main category: cs.LG TL;DR: 提出了一种名为DHO的双头优化知识蒸馏框架，用于在资源受限环境中高效部署视觉语言模型，显著提升性能。

Details

Motivation: 视觉语言模型（VLMs）在资源受限环境中部署困难，现有知识蒸馏方法多阶段训练或额外调优增加了计算和优化复杂度。 Method: 引入双预测头，分别从标注数据和教师预测中学习，并在推理时线性结合输出，缓解梯度冲突。 Result: 在多个领域和细粒度数据集上表现优于基线，在ImageNet上以更少参数实现SOTA性能，1%和10%标注数据下分别提升3%和0.1%准确率。 Conclusion: DHO是一种简单有效的知识蒸馏框架，显著提升了视觉语言模型在资源受限环境中的性能。 Abstract: Vision-language models (VLMs) have achieved remarkable success across diverse tasks by leveraging rich textual information with minimal labeled data. However, deploying such large models remains challenging, particularly in resource-constrained environments. Knowledge distillation (KD) offers a well-established solution to this problem; however, recent KD approaches from VLMs often involve multi-stage training or additional tuning, increasing computational overhead and optimization complexity. In this paper, we propose $\mathbf{\texttt{D}}$ual-$\mathbf{\texttt{H}}$ead $\mathbf{\texttt{O}}$ptimization ($\mathbf{\texttt{DHO}}$) -- a simple yet effective KD framework that transfers knowledge from VLMs to compact, task-specific models in semi-supervised settings. Specifically, we introduce dual prediction heads that independently learn from labeled data and teacher predictions, and propose to linearly combine their outputs during inference. We observe that $\texttt{DHO}$ mitigates gradient conflicts between supervised and distillation signals, enabling more effective feature learning than single-head KD baselines. As a result, extensive experiments show that $\texttt{DHO}$ consistently outperforms baselines across multiple domains and fine-grained datasets. Notably, on ImageNet, it achieves state-of-the-art performance, improving accuracy by 3% and 0.1% with 1% and 10% labeled data, respectively, while using fewer parameters.

# cs.IR [[Back]](#toc) ### [210] [AI Approaches to Qualitative and Quantitative News Analytics on NATO Unity](https://arxiv.org/abs/2505.06313) *Bohdan M. Pavlyshenko* Main category: cs.IR TL;DR: 论文探讨了使用GPT模型结合检索增强生成（RAG）技术，对NATO相关情感、团结及Article 5信任度进行定性和定量分析，结果显示NATO团结度评分呈下降趋势。

Details

Motivation: 研究旨在探索基于AI的方法（如GPT模型）在新闻分析中的应用潜力，为复杂分析提供初步工具。 Method: 采用GPT-4.1模型结合RAG技术，分两级分析新闻：首级生成定性摘要和定量评分，次级汇总摘要。使用贝叶斯回归分析评分趋势。 Result: 分析显示NATO团结度评分呈下降趋势，模型能有效提供定性和定量分析结果。 Conclusion: GPT模型结合RAG技术可用于新闻分析，提供有价值的见解，但需结合其他方法进行更全面的政治分析。 Abstract: The paper considers the use of GPT models with retrieval-augmented generation (RAG) for qualitative and quantitative analytics on NATO sentiments, NATO unity and NATO Article 5 trust opinion scores in different web sources: news sites found via Google Search API, Youtube videos with comments, and Reddit discussions. A RAG approach using GPT-4.1 model was applied to analyse news where NATO related topics were discussed. Two levels of RAG analytics were used: on the first level, the GPT model generates qualitative news summaries and quantitative opinion scores using zero-shot prompts; on the second level, the GPT model generates the summary of news summaries. Quantitative news opinion scores generated by the GPT model were analysed using Bayesian regression to get trend lines. The distributions found for the regression parameters make it possible to analyse an uncertainty in specified news opinion score trends. Obtained results show a downward trend for analysed scores of opinion related to NATO unity. This approach does not aim to conduct real political analysis; rather, it consider AI based approaches which can be used for further analytics as a part of a complex analytical approach. The obtained results demonstrate that the use of GPT models for news analysis can give informative qualitative and quantitative analytics, providing important insights. The dynamic model based on neural ordinary differential equations was considered for modelling public opinions. This approach makes it possible to analyse different scenarios for evolving public opinions.

### [211] [Web Page Classification using LLMs for Crawling Support](https://arxiv.org/abs/2505.06972) *Yuichi Sasazawa,Yasuhiro Sogawa* Main category: cs.IR TL;DR: 提出了一种基于大语言模型（LLM）的网页分类方法，将网页分为“索引页”和“内容页”，以提高新页面抓取效率。

Details

Motivation: 传统方法依赖网站特征（如XML站点地图和更新频率）难以普适，需更高效的方法。 Method: 利用LLM对网页分类，选择索引页作为抓取起点，构建自动标注数据集并评估分类性能和新页面覆盖率。 Result: 实验表明，基于LLM的方法在分类性能和新页面覆盖率上均优于基线方法。 Conclusion: LLM分类方法能有效提升网页抓取效率。 Abstract: A web crawler is a system designed to collect web pages, and efficient crawling of new pages requires appropriate algorithms. While website features such as XML sitemaps and the frequency of past page updates provide important clues for accessing new pages, their universal application across diverse conditions is challenging. In this study, we propose a method to efficiently collect new pages by classifying web pages into two types, "Index Pages" and "Content Pages," using a large language model (LLM), and leveraging the classification results to select index pages as starting points for accessing new pages. We construct a dataset with automatically annotated web page types and evaluate our approach from two perspectives: the page type classification performance and coverage of new pages. Experimental results demonstrate that the LLM-based method outperformed baseline methods in both evaluation metrics.

### [212] [Reassessing Large Language Model Boolean Query Generation for Systematic Reviews](https://arxiv.org/abs/2505.07155) *Shuai Wang,Harrisen Scells,Bevan Koopman,Guido Zuccon* Main category: cs.IR TL;DR: 本文探讨了使用大语言模型（LLMs）生成系统综述布尔查询的有效性，比较了不同模型和提示设计的差异，强调了模型选择和提示优化的重要性。

Details

Motivation: 系统综述是医学中最高级别的证据形式，但其布尔查询构建复杂且耗时。研究旨在评估LLMs在此任务中的潜力，并解决先前研究中的不足。 Method: 通过系统复现Wang et al.和Staudinger et al.的研究，重点关注查询验证、输出格式约束和提示设计选择等被忽略的因素。 Result: 查询效果因模型和提示设计而异，引导式查询设计尤其受益于精心选择的种子研究。 Conclusion: 提示设计和模型选择是成功生成布尔查询的关键，研究强调了在系统综述领域进行复现研究的重要性。 Abstract: Systematic reviews are comprehensive literature reviews that address highly focused research questions and represent the highest form of evidence in medicine. A critical step in this process is the development of complex Boolean queries to retrieve relevant literature. Given the difficulty of manually constructing these queries, recent efforts have explored Large Language Models (LLMs) to assist in their formulation. One of the first studies,Wang et al., investigated ChatGPT for this task, followed by Staudinger et al., which evaluated multiple LLMs in a reproducibility study. However, the latter overlooked several key aspects of the original work, including (i) validation of generated queries, (ii) output formatting constraints, and (iii) selection of examples for chain-of-thought (Guided) prompting. As a result, its findings diverged significantly from the original study. In this work, we systematically reproduce both studies while addressing these overlooked factors. Our results show that query effectiveness varies significantly across models and prompt designs, with guided query formulation benefiting from well-chosen seed studies. Overall, prompt design and model selection are key drivers of successful query formulation. Our findings provide a clearer understanding of LLMs' potential in Boolean query generation and highlight the importance of model- and prompt-specific optimisations. The complex nature of systematic reviews adds to challenges in both developing and reproducing methods but also highlights the importance of reproducibility studies in this domain.

### [213] [Pre-training vs. Fine-tuning: A Reproducibility Study on Dense Retrieval Knowledge Acquisition](https://arxiv.org/abs/2505.07166) *Zheng Yao,Shuai Wang,Guido Zuccon* Main category: cs.IR TL;DR: 研究重新审视了密集检索器中预训练与微调的作用，发现预训练知识对性能起主导作用，但这一结论在不同模型和数据集上并不一致。

Details

Motivation: 探讨密集检索器中预训练和微调对知识获取的相对贡献，并验证先前研究结论的普适性。 Method: 通过比较不同表示方法（CLS token vs. 平均池化）、架构（BERT vs. LLaMA）和数据集（MSMARCO和Natural Questions），分析预训练与微调的作用。 Result: 在DPR微调中，预训练知识主导性能，微调主要调整神经元激活而非重组知识；但这一模式在平均池化（Contriever）和解码器模型（LLaMA）中不成立。 Conclusion: 预训练知识对密集检索器性能至关重要，但微调的作用因模型和任务而异，需具体分析。 Abstract: Dense retrievers utilize pre-trained backbone language models (e.g., BERT, LLaMA) that are fine-tuned via contrastive learning to perform the task of encoding text into sense representations that can be then compared via a shallow similarity operation, e.g. inner product. Recent research has questioned the role of fine-tuning vs. that of pre-training within dense retrievers, specifically arguing that retrieval knowledge is primarily gained during pre-training, meaning knowledge not acquired during pre-training cannot be sub-sequentially acquired via fine-tuning. We revisit this idea here as the claim was only studied in the context of a BERT-based encoder using DPR as representative dense retriever. We extend the previous analysis by testing other representation approaches (comparing the use of CLS tokens with that of mean pooling), backbone architectures (encoder-only BERT vs. decoder-only LLaMA), and additional datasets (MSMARCO in addition to Natural Questions). Our study confirms that in DPR tuning, pre-trained knowledge underpins retrieval performance, with fine-tuning primarily adjusting neuron activation rather than reorganizing knowledge. However, this pattern does not hold universally, such as in mean-pooled (Contriever) and decoder-based (LLaMA) models. We ensure full reproducibility and make our implementation publicly available at https://github.com/ielab/DenseRetriever-Knowledge-Acquisition.

# q-bio.NC [[Back]](#toc) ### [214] [Skeletonization of neuronal processes using Discrete Morse techniques from computational topology](https://arxiv.org/abs/2505.07754) *Samik Banerjee,Caleb Stam,Daniel J. Tward,Steven Savoia,Yusu Wang,Partha P. Mitra* Main category: q-bio.NC TL;DR: 提出了一种基于深度学习和计算拓扑学的新方法，用于量化神经元轴突的投影密度，解决了传统方法中区域标记强度量化不具生物学意义的问题。

Details

Motivation: 传统方法通过区域标记强度量化神经元投影，但这种方法缺乏生物学意义，无法准确反映单个轴突的投影情况。 Method: 结合深度学习和离散莫尔斯（DM）技术，对标记的轴突片段进行骨架化，并估计体积长度密度。该方法利用非局部连接信息，具有抗噪性。 Result: 在全脑示踪剂注射数据上验证了方法的实用性和可扩展性，并定义了信息理论度量以量化额外信息。 Conclusion: 该方法是DM技术在计算神经解剖学中的首次应用，有助于连接单轴突骨架和示踪剂注射两种重要数据类型。 Abstract: To understand biological intelligence we need to map neuronal networks in vertebrate brains. Mapping mesoscale neural circuitry is done using injections of tracers that label groups of neurons whose axons project to different brain regions. Since many neurons are labeled, it is difficult to follow individual axons. Previous approaches have instead quantified the regional projections using the total label intensity within a region. However, such a quantification is not biologically meaningful. We propose a new approach better connected to the underlying neurons by skeletonizing labeled axon fragments and then estimating a volumetric length density. Our approach uses a combination of deep nets and the Discrete Morse (DM) technique from computational topology. This technique takes into account nonlocal connectivity information and therefore provides noise-robustness. We demonstrate the utility and scalability of the approach on whole-brain tracer injected data. We also define and illustrate an information theoretic measure that quantifies the additional information obtained, compared to the skeletonized tracer injection fragments, when individual axon morphologies are available. Our approach is the first application of the DM technique to computational neuroanatomy. It can help bridge between single-axon skeletons and tracer injections, two important data types in mapping neural networks in vertebrates.

# eess.SP [[Back]](#toc) ### [215] [DeltaDPD: Exploiting Dynamic Temporal Sparsity in Recurrent Neural Networks for Energy-Efficient Wideband Digital Predistortion](https://arxiv.org/abs/2505.06250) *Yizhuo Wu,Yi Zhu,Kun Qian,Qinyu Chen,Anding Zhu,John Gajadharsing,Leo C. N. de Vreede,Chang Gao* Main category: eess.SP TL;DR: DeltaDPD是一种利用动态时间稀疏性优化数字预失真（DPD）的技术，显著降低计算复杂度和能耗，同时保持性能。

Details

Motivation: 随着带宽和数据速率的增加，传统DPD模型的能耗问题日益突出，尤其是基于RNN的模型。DeltaDPD旨在解决这一问题。 Method: 通过探索输入信号和RNN隐藏状态的动态时间稀疏性，减少算术运算和内存访问，实现高效DPD。 Result: 在200MHz带宽的256-QAM OFDM信号测试中，DeltaDPD实现了-50.03 dBc ACPR、-37.22 dB NMSE和-38.52 dBc EVM，能耗降低1.8倍。 Conclusion: DeltaDPD在保持性能的同时显著降低了能耗，为高效DPD提供了可行方案。 Abstract: Digital Predistortion (DPD) is a popular technique to enhance signal quality in wideband RF power amplifiers (PAs). With increasing bandwidth and data rates, DPD faces significant energy consumption challenges during deployment, contrasting with its efficiency goals. State-of-the-art DPD models rely on recurrent neural networks (RNN), whose computational complexity hinders system efficiency. This paper introduces DeltaDPD, exploring the dynamic temporal sparsity of input signals and neuronal hidden states in RNNs for energy-efficient DPD, reducing arithmetic operations and memory accesses while preserving satisfactory linearization performance. Applying a TM3.1a 200MHz-BW 256-QAM OFDM signal to a 3.5 GHz GaN Doherty RF PA, DeltaDPD achieves -50.03 dBc in Adjacent Channel Power Ratio (ACPR), -37.22 dB in Normalized Mean Square Error (NMSE) and -38.52 dBc in Error Vector Magnitude (EVM) with 52% temporal sparsity, leading to a 1.8X reduction in estimated inference power. The DeltaDPD code will be released after formal publication at https://www.opendpd.com.

### [216] [Terahertz Spatial Wireless Channel Modeling with Radio Radiance Field](https://arxiv.org/abs/2505.06277) *John Song,Lihao Zhang,Feng Ye,Haijian Sun* Main category: eess.SP TL;DR: 论文探讨了在太赫兹（THz）频段应用无线电辐射场（RRF）框架的可行性，提出了一种基于视觉几何和稀疏测量的高效空间信道建模方法。

Details

Motivation: 太赫兹通信在6G系统中具有潜力，但其信号传播特性与传统频段不同，导致现有信道建模方法效率低下。 Method: 通过构建模拟THz场景，利用RRF框架结合稀疏测量数据重建连续辐射场，实现高效空间信道状态信息建模。 Result: 实验表明，RRF方法能够以稀疏训练样本捕捉关键传播路径，重建质量高且通信效果良好。 Conclusion: RRF在THz频段仍有效，为未来6G网络的可扩展、低成本信道重建提供了新方向。 Abstract: Terahertz (THz) communication is a key enabler for 6G systems, offering ultra-wide bandwidth and unprecedented data rates. However, THz signal propagation differs significantly from lower-frequency bands due to severe free space path loss, minimal diffraction and specular reflection, and prominent scattering, making conventional channel modeling and pilot-based estimation approaches inefficient. In this work, we investigate the feasibility of applying radio radiance field (RRF) framework to the THz band. This method reconstructs a continuous RRF using visual-based geometry and sparse THz RF measurements, enabling efficient spatial channel state information (Spatial-CSI) modeling without dense sampling. We first build a fine simulated THz scenario, then we reconstruct the RRF and evaluate the performance in terms of both reconstruction quality and effectiveness in THz communication, showing that the reconstructed RRF captures key propagation paths with sparse training samples. Our findings demonstrate that RRF modeling remains effective in the THz regime and provides a promising direction for scalable, low-cost spatial channel reconstruction in future 6G networks.

### [217] [FEMSN: Frequency-Enhanced Multiscale Network for fault diagnosis of rotating machinery under strong noise environments](https://arxiv.org/abs/2505.06285) *Yuhan Yuan,Xiaomo Jiang,Yanfeng Han,Ke Xiao* Main category: eess.SP TL;DR: 提出了一种名为FEMSN的新型CNN模型，通过FADEL和MSTFF模块增强特征提取能力，提高了轴承故障检测的鲁棒性和准确性。

Details

Motivation: 复杂工况下，现有方法难以提取故障特征，噪声干扰严重。 Method: 引入FADEL作为去噪层，结合MSTFF模块提取时频特征，并加入蒸馏层扩展感受野。 Result: 通过两个案例验证了FEMSN和FADEL在机器健康监测中的有效性。 Conclusion: FEMSN模型显著提升了故障检测性能，适用于复杂工业环境。 Abstract: Rolling bearings are critical components of rotating machinery, and their proper functioning is essential for industrial production. Most existing condition monitoring methods focus on extracting discriminative features from time-domain signals to assess bearing health status. However, under complex operating conditions, periodic impulsive characteristics related to fault information are often obscured by noise interference. Consequently, existing approaches struggle to learn distinctive fault-related features in such scenarios. To address this issue, this paper proposes a novel CNN-based model named FEMSN. Specifically, a Fourier Adaptive Denoising Encoder Layer (FADEL) is introduced as an input denoising layer to enhance key features while filtering out irrelevant information. Subsequently, a Multiscale Time-Frequency Fusion (MSTFF) module is employed to extract fused time-frequency features, further improving the model robustness and nonlinear representation capability. Additionally, a distillation layer is incorporated to expand the receptive field. Based on these advancements, a novel deep lightweight CNN model, termed the Frequency-Enhanced Multiscale Network (FEMSN), is developed. The effectiveness of FEMSN and FADEL in machine health monitoring and stability assessment is validated through two case studies.

# cs.RO [[Back]](#toc) ### [218] [CompSLAM: Complementary Hierarchical Multi-Modal Localization and Mapping for Robot Autonomy in Underground Environments](https://arxiv.org/abs/2505.06483) *Shehryar Khattak,Timon Homberger,Lukas Bernreiter,Julian Nubert,Olov Andersson,Roland Siegwart,Kostas Alexis,Marco Hutter* Main category: cs.RO TL;DR: CompSLAM是一个多模态定位与建图框架，专为复杂地下环境设计，通过冗余传感器实现高鲁棒性，并在DARPA挑战赛中成功应用。

Details

Motivation: 解决GPS缺失、感知退化的地下环境中机器人实时定位与建图的挑战，如黑暗、灰尘和几何相似结构。 Method: 采用分层多模态架构，利用冗余传感器互补性提升鲁棒性，支持多机器人协作。 Result: 在DARPA挑战赛中成功部署，并在后续项目中验证了其可靠性，公开了代码和数据集。 Conclusion: CompSLAM为复杂环境下的机器人定位与建图提供了高效解决方案，具有广泛的应用潜力。 Abstract: Robot autonomy in unknown, GPS-denied, and complex underground environments requires real-time, robust, and accurate onboard pose estimation and mapping for reliable operations. This becomes particularly challenging in perception-degraded subterranean conditions under harsh environmental factors, including darkness, dust, and geometrically self-similar structures. This paper details CompSLAM, a highly resilient and hierarchical multi-modal localization and mapping framework designed to address these challenges. Its flexible architecture achieves resilience through redundancy by leveraging the complementary nature of pose estimates derived from diverse sensor modalities. Developed during the DARPA Subterranean Challenge, CompSLAM was successfully deployed on all aerial, legged, and wheeled robots of Team Cerberus during their competition-winning final run. Furthermore, it has proven to be a reliable odometry and mapping solution in various subsequent projects, with extensions enabling multi-robot map sharing for marsupial robotic deployments and collaborative mapping. This paper also introduces a comprehensive dataset acquired by a manually teleoperated quadrupedal robot, covering a significant portion of the DARPA Subterranean Challenge finals course. This dataset evaluates CompSLAM's robustness to sensor degradations as the robot traverses 740 meters in an environment characterized by highly variable geometries and demanding lighting conditions. The CompSLAM code and the DARPA SubT Finals dataset are made publicly available for the benefit of the robotics community

### [219] [M3CAD: Towards Generic Cooperative Autonomous Driving Benchmark](https://arxiv.org/abs/2505.06746) *Morui Zhu,Yongqi Zhu,Yihao Zhu,Qi Chen,Deyuan Qu,Song Fu,Qing Yang* Main category: cs.RO TL;DR: M$^3$CAD是一个用于推动通用协作自动驾驶研究的新基准，包含204个序列和30k帧数据，支持多种任务和模态。

Details

Motivation: 为协作自动驾驶研究提供全面且多样化的基准，填补现有研究的空白。 Method: 构建包含多车辆和多模态数据的基准，并提出E2EC框架以利用车辆间共享信息优化路径规划。 Result: M$^3$CAD成为最全面的协作多任务自动驾驶基准，并提供了基线性能评估。 Conclusion: M$^3$CAD和E2EC框架将促进协作自动驾驶系统的研究和发展。 Abstract: We introduce M$^3$CAD, a novel benchmark designed to advance research in generic cooperative autonomous driving. M$^3$CAD comprises 204 sequences with 30k frames, spanning a diverse range of cooperative driving scenarios. Each sequence includes multiple vehicles and sensing modalities, e.g., LiDAR point clouds, RGB images, and GPS/IMU, supporting a variety of autonomous driving tasks, including object detection and tracking, mapping, motion forecasting, occupancy prediction, and path planning. This rich multimodal setup enables M$^3$CAD to support both single-vehicle and multi-vehicle autonomous driving research, significantly broadening the scope of research in the field. To our knowledge, M$^3$CAD is the most comprehensive benchmark specifically tailored for cooperative multi-task autonomous driving research. We evaluate the state-of-the-art end-to-end solution on M$^3$CAD to establish baseline performance. To foster cooperative autonomous driving research, we also propose E2EC, a simple yet effective framework for cooperative driving solution that leverages inter-vehicle shared information for improved path planning. We release M$^3$CAD, along with our baseline models and evaluation results, to support the development of robust cooperative autonomous driving systems. All resources will be made publicly available on https://github.com/zhumorui/M3CAD

### [220] [Efficient Robotic Policy Learning via Latent Space Backward Planning](https://arxiv.org/abs/2505.06861) *Dongxiu Liu,Haoyi Niu,Zhihao Wang,Jinliang Zheng,Yinan Zheng,Zhonghong Ou,Jianming Hu,Jianxiong Li,Xianyuan Zhan* Main category: cs.RO TL;DR: 论文提出了一种基于潜在空间的后向规划方法（LBP），解决了现有机器人规划方法在实时性和准确性上的问题。

Details

Motivation: 现有机器人规划方法依赖多帧图像预测，计算成本高且易积累误差，影响实时部署和动作提取的准确性。 Method: 提出LBP方法，从最终潜在目标出发，递归预测中间子目标，并通过可学习令牌总结子目标序列指导动作提取。 Result: LBP在仿真和真实机器人实验中表现优于现有方法，达到SOTA性能。 Conclusion: LBP实现了高效且准确的机器人规划，适用于长时域多阶段任务。 Abstract: Current robotic planning methods often rely on predicting multi-frame images with full pixel details. While this fine-grained approach can serve as a generic world model, it introduces two significant challenges for downstream policy learning: substantial computational costs that hinder real-time deployment, and accumulated inaccuracies that can mislead action extraction. Planning with coarse-grained subgoals partially alleviates efficiency issues. However, their forward planning schemes can still result in off-task predictions due to accumulation errors, leading to misalignment with long-term goals. This raises a critical question: Can robotic planning be both efficient and accurate enough for real-time control in long-horizon, multi-stage tasks? To address this, we propose a Latent Space Backward Planning scheme (LBP), which begins by grounding the task into final latent goals, followed by recursively predicting intermediate subgoals closer to the current state. The grounded final goal enables backward subgoal planning to always remain aware of task completion, facilitating on-task prediction along the entire planning horizon. The subgoal-conditioned policy incorporates a learnable token to summarize the subgoal sequences and determines how each subgoal guides action extraction. Through extensive simulation and real-robot long-horizon experiments, we show that LBP outperforms existing fine-grained and forward planning methods, achieving SOTA performance. Project Page: https://lbp-authors.github.io

### [221] [Reinforcement Learning-Based Monocular Vision Approach for Autonomous UAV Landing](https://arxiv.org/abs/2505.06963) *Tarik Houichime,Younes EL Amrani* Main category: cs.RO TL;DR: 提出一种仅使用单目摄像头实现无人机自主降落的新方法，无需深度估计摄像头，通过优化问题和强化学习实现。

Details

Motivation: 解决无人机自主降落对复杂传感器的依赖问题，降低成本并提高适用性。 Method: 利用特殊设计的着陆垫上的视觉特征变化，通过强化学习算法优化降落任务。 Result: 模拟和实验验证了方法的鲁棒性和准确性，适用于低成本无人机。 Conclusion: 该方法为无人机自主降落提供了经济高效的解决方案，具有广泛的应用潜力。 Abstract: This paper introduces an innovative approach for the autonomous landing of Unmanned Aerial Vehicles (UAVs) using only a front-facing monocular camera, therefore obviating the requirement for depth estimation cameras. Drawing on the inherent human estimating process, the proposed method reframes the landing task as an optimization problem. The UAV employs variations in the visual characteristics of a specially designed lenticular circle on the landing pad, where the perceived color and form provide critical information for estimating both altitude and depth. Reinforcement learning algorithms are utilized to approximate the functions governing these estimations, enabling the UAV to ascertain ideal landing settings via training. This method's efficacy is assessed by simulations and experiments, showcasing its potential for robust and accurate autonomous landing without dependence on complex sensor setups. This research contributes to the advancement of cost-effective and efficient UAV landing solutions, paving the way for wider applicability across various fields.

### [222] [VALISENS: A Validated Innovative Multi-Sensor System for Cooperative Automated Driving](https://arxiv.org/abs/2505.06980) *Lei Wan,Prabesh Gupta,Andreas Eich,Marcel Kettelgerdes,Hannan Ejaz Keen,Michael Klöppel-Gersdorf,Alexey Vinel* Main category: cs.RO TL;DR: VALISENS是一种创新的多传感器系统，通过多智能体协作提升自动驾驶车辆的感知能力，结合车载和路边传感器增强环境感知。

Details

Motivation: 提升自动驾驶车辆在复杂现实场景中的感知鲁棒性，解决单一车辆感知的局限性。 Method: 集成车载和路边LiDAR、雷达、热成像相机和RGB相机，构建多传感器融合系统，并开发相应的感知模块。 Result: 系统在真实测试环境中验证了协作感知的潜力，为未来智能交通系统应用奠定了基础。 Conclusion: VALISENS展示了多智能体协作感知的优势，为自动驾驶和智能交通系统的发展提供了新方向。 Abstract: Perception is a core capability of automated vehicles and has been significantly advanced through modern sensor technologies and artificial intelligence. However, perception systems still face challenges in complex real-world scenarios. To improve robustness against various external factors, multi-sensor fusion techniques are essential, combining the strengths of different sensor modalities. With recent developments in Vehicle-to-Everything (V2X communication, sensor fusion can now extend beyond a single vehicle to a cooperative multi-agent system involving Connected Automated Vehicle (CAV) and intelligent infrastructure. This paper presents VALISENS, an innovative multi-sensor system distributed across multiple agents. It integrates onboard and roadside LiDARs, radars, thermal cameras, and RGB cameras to enhance situational awareness and support cooperative automated driving. The thermal camera adds critical redundancy for perceiving Vulnerable Road User (VRU), while fusion with roadside sensors mitigates visual occlusions and extends the perception range beyond the limits of individual vehicles. We introduce the corresponding perception module built on this sensor system, which includes object detection, tracking, motion forecasting, and high-level data fusion. The proposed system demonstrates the potential of cooperative perception in real-world test environments and lays the groundwork for future Cooperative Intelligent Transport Systems (C-ITS) applications.

### [223] [Beyond Static Perception: Integrating Temporal Context into VLMs for Cloth Folding](https://arxiv.org/abs/2505.07600) *Oriol Barbany,Adrià Colomé,Carme Torras* Main category: cs.RO TL;DR: BiFold模型通过端到端学习预测语言条件下的抓取和放置动作，利用时间上下文改进状态估计，适用于复杂衣物操作。

Details

Motivation: 衣物操作因复杂动态、高变形性和频繁自遮挡而具有挑战性，BiFold旨在解决这些问题。 Method: BiFold通过端到端学习隐式编码衣物状态，利用时间上下文提升状态估计，并分析模型内部表示。 Result: 模型通过微调和时间上下文实现了文本与图像区域的有效对齐及时间一致性。 Conclusion: BiFold在复杂衣物操作中表现出色，展示了时间上下文和端到端学习的有效性。 Abstract: Manipulating clothes is challenging due to their complex dynamics, high deformability, and frequent self-occlusions. Garments exhibit a nearly infinite number of configurations, making explicit state representations difficult to define. In this paper, we analyze BiFold, a model that predicts language-conditioned pick-and-place actions from visual observations, while implicitly encoding garment state through end-to-end learning. To address scenarios such as crumpled garments or recovery from failed manipulations, BiFold leverages temporal context to improve state estimation. We examine the internal representations of the model and present evidence that its fine-tuning and temporal context enable effective alignment between text and image regions, as well as temporal consistency.

### [224] [Neural Brain: A Neuroscience-inspired Framework for Embodied Agents](https://arxiv.org/abs/2505.07634) *Jian Liu,Xiongtao Shi,Thai Duy Nguyen,Haitian Zhang,Tianxiang Zhang,Wei Sun,Yanjie Li,Athanasios V. Vasilakos,Giovanni Iacca,Arshad Ali Khan,Arvind Kumar,Jae Won Cho,Ajmal Mian,Lihua Xie,Erik Cambria,Lin Wang* Main category: cs.RO TL;DR: 本文提出了一种统一的框架"神经大脑"，旨在解决具身智能体的核心组件定义及静态AI模型与动态适应性之间的差距，通过生物启发架构实现多模态感知与认知功能的整合。

Details

Motivation: 当前AI系统（如大型语言模型）缺乏与物理世界的交互能力，推动了具身AI的发展，需要智能体具备人类般的适应性。 Method: 提出生物启发架构，整合多模态主动感知、感知-认知-行动功能、基于神经可塑性的记忆系统及神经形态硬件/软件优化。 Result: 通过综述最新研究，分析了当前AI系统与人类智能的差距，并提出了实现通用自主智能体的路线图。 Conclusion: 结合神经科学见解，本文为开发具备人类水平智能的具身智能体提供了理论基础和实现路径。 Abstract: The rapid evolution of artificial intelligence (AI) has shifted from static, data-driven models to dynamic systems capable of perceiving and interacting with real-world environments. Despite advancements in pattern recognition and symbolic reasoning, current AI systems, such as large language models, remain disembodied, unable to physically engage with the world. This limitation has driven the rise of embodied AI, where autonomous agents, such as humanoid robots, must navigate and manipulate unstructured environments with human-like adaptability. At the core of this challenge lies the concept of Neural Brain, a central intelligence system designed to drive embodied agents with human-like adaptability. A Neural Brain must seamlessly integrate multimodal sensing and perception with cognitive capabilities. Achieving this also requires an adaptive memory system and energy-efficient hardware-software co-design, enabling real-time action in dynamic environments. This paper introduces a unified framework for the Neural Brain of embodied agents, addressing two fundamental challenges: (1) defining the core components of Neural Brain and (2) bridging the gap between static AI models and the dynamic adaptability required for real-world deployment. To this end, we propose a biologically inspired architecture that integrates multimodal active sensing, perception-cognition-action function, neuroplasticity-based memory storage and updating, and neuromorphic hardware/software optimization. Furthermore, we also review the latest research on embodied agents across these four aspects and analyze the gap between current AI systems and human intelligence. By synthesizing insights from neuroscience, we outline a roadmap towards the development of generalizable, autonomous agents capable of human-level intelligence in real-world scenarios.

### [225] [Privacy Risks of Robot Vision: A User Study on Image Modalities and Resolution](https://arxiv.org/abs/2505.07766) *Xuying Huang,Sicong Pan,Maren Bennewitz* Main category: cs.RO TL;DR: 用户对机器人应用中视觉数据的隐私感知研究：深度图像和语义分割图像被视为隐私安全，低分辨率RGB图像（如32*32）也被认为足够保护隐私。

Details

Motivation: 机器人应用中摄像头使用可能引发隐私风险，需了解用户对不同图像模态和分辨率的隐私感知。 Method: 通过用户研究调查不同图像模态（如深度图像、语义分割图像）和分辨率对用户隐私担忧的影响。 Result: 深度图像和语义分割图像被视为隐私安全；32*32分辨率RGB图像被认为足够保护隐私，16*16分辨率则完全满足隐私保护需求。 Conclusion: 研究为机器人设计中隐私保护提供了实用指导，建议使用深度图像或低分辨率RGB图像以减少隐私风险。 Abstract: User privacy is a crucial concern in robotic applications, especially when mobile service robots are deployed in personal or sensitive environments. However, many robotic downstream tasks require the use of cameras, which may raise privacy risks. To better understand user perceptions of privacy in relation to visual data, we conducted a user study investigating how different image modalities and image resolutions affect users' privacy concerns. The results show that depth images are broadly viewed as privacy-safe, and a similarly high proportion of respondents feel the same about semantic segmentation images. Additionally, the majority of participants consider 32*32 resolution RGB images to be almost sufficiently privacy-preserving, while most believe that 16*16 resolution can fully guarantee privacy protection.

### [226] [DexWild: Dexterous Human Interactions for In-the-Wild Robot Policies](https://arxiv.org/abs/2505.07813) *Tony Tao,Mohan Kumar Srirama,Jason Jingzhou Liu,Kenneth Shaw,Deepak Pathak* Main category: cs.RO TL;DR: DexWild提出了一种低成本、易用的数据收集系统，通过人类手部动作收集多样化数据，结合机器人数据训练，显著提升了机器人在新环境中的泛化能力。

Details

Motivation: 传统遥操作数据收集成本高且难以扩展，而人类日常手部动作为数据收集提供了新思路。 Method: 开发了DexWild-System设备，收集人类手部动作数据，并与机器人数据联合训练。 Result: 实验显示，DexWild在新环境中成功率提升至68.5%，是仅用机器人数据训练的4倍，跨具身泛化能力提升5.8倍。 Conclusion: DexWild通过低成本人类数据收集和联合训练，显著提升了机器人策略的泛化能力。 Abstract: Large-scale, diverse robot datasets have emerged as a promising path toward enabling dexterous manipulation policies to generalize to novel environments, but acquiring such datasets presents many challenges. While teleoperation provides high-fidelity datasets, its high cost limits its scalability. Instead, what if people could use their own hands, just as they do in everyday life, to collect data? In DexWild, a diverse team of data collectors uses their hands to collect hours of interactions across a multitude of environments and objects. To record this data, we create DexWild-System, a low-cost, mobile, and easy-to-use device. The DexWild learning framework co-trains on both human and robot demonstrations, leading to improved performance compared to training on each dataset individually. This combination results in robust robot policies capable of generalizing to novel environments, tasks, and embodiments with minimal additional robot-specific data. Experimental results demonstrate that DexWild significantly improves performance, achieving a 68.5% success rate in unseen environments-nearly four times higher than policies trained with robot data only-and offering 5.8x better cross-embodiment generalization. Video results, codebases, and instructions at https://dexwild.github.io

### [227] [Imagine, Verify, Execute: Memory-Guided Agentic Exploration with Vision-Language Models](https://arxiv.org/abs/2505.07815) *Seungjae Lee,Daniel Ekpo,Haowen Liu,Furong Huang,Abhinav Shrivastava,Jia-Bin Huang* Main category: cs.RO TL;DR: IVE（Imagine, Verify, Execute）是一种基于视觉语言模型（VLM）的探索框架，通过语义场景图和物理可行性验证，实现多样且有意义的机器人探索。

Details

Motivation: 在开放环境中，密集奖励或任务监督稀缺，探索对机器人学习至关重要。VLM的语义推理能力为生成高级探索行为提供了基础，但其输出常缺乏物理可行性验证。 Method: IVE框架将RGB-D观察抽象为语义场景图，想象新场景并预测其物理可行性，通过动作工具生成可执行技能序列。 Result: 在模拟和真实桌面环境中，IVE比强化学习基线实现了4.1至7.8倍的探索状态熵增长，且支持下游学习，性能接近或超过人类演示数据训练的模型。 Conclusion: IVE通过结合VLM的语义推理和物理验证，显著提升了机器人探索的多样性和有效性。 Abstract: Exploration is essential for general-purpose robotic learning, especially in open-ended environments where dense rewards, explicit goals, or task-specific supervision are scarce. Vision-language models (VLMs), with their semantic reasoning over objects, spatial relations, and potential outcomes, present a compelling foundation for generating high-level exploratory behaviors. However, their outputs are often ungrounded, making it difficult to determine whether imagined transitions are physically feasible or informative. To bridge the gap between imagination and execution, we present IVE (Imagine, Verify, Execute), an agentic exploration framework inspired by human curiosity. Human exploration is often driven by the desire to discover novel scene configurations and to deepen understanding of the environment. Similarly, IVE leverages VLMs to abstract RGB-D observations into semantic scene graphs, imagine novel scenes, predict their physical plausibility, and generate executable skill sequences through action tools. We evaluate IVE in both simulated and real-world tabletop environments. The results show that IVE enables more diverse and meaningful exploration than RL baselines, as evidenced by a 4.1 to 7.8x increase in the entropy of visited states. Moreover, the collected experience supports downstream learning, producing policies that closely match or exceed the performance of those trained on human-collected demonstrations.

### [228] [Pixel Motion as Universal Representation for Robot Control](https://arxiv.org/abs/2505.07817) *Kanchana Ranasinghe,Xiang Li,Cristina Mata,Jongwoo Park,Michael S Ryoo* Main category: cs.RO TL;DR: LangToMo是一个双系统架构的视觉-语言-动作框架，通过像素运动预测作为中间表示，实现机器人控制。

Details

Motivation: 解决语言、运动和动作之间的鸿沟，实现灵活、可扩展和通用的机器人控制。 Method: 使用图像扩散模型（System 2）生成文本条件像素运动序列，并通过System 1将运动映射为机器人动作。 Result: 实现了在无监督和监督设置下的通用机器人控制。 Conclusion: LangToMo通过分层解耦设计，成功连接了语言、运动和动作，为机器人控制提供了新思路。 Abstract: We present LangToMo, a vision-language-action framework structured as a dual-system architecture that uses pixel motion forecasts as intermediate representations. Our high-level System 2, an image diffusion model, generates text-conditioned pixel motion sequences from a single frame to guide robot control. Pixel motion-a universal, interpretable, and motion-centric representation-can be extracted from videos in a self-supervised manner, enabling diffusion model training on web-scale video-caption data. Treating generated pixel motion as learned universal representations, our low level System 1 module translates these into robot actions via motion-to-action mapping functions, which can be either hand-crafted or learned with minimal supervision. System 2 operates as a high-level policy applied at sparse temporal intervals, while System 1 acts as a low-level policy at dense temporal intervals. This hierarchical decoupling enables flexible, scalable, and generalizable robot control under both unsupervised and supervised settings, bridging the gap between language, motion, and action. Checkout https://kahnchana.github.io/LangToMo for visualizations.

### [229] [H$^{\mathbf{3}}$DP: Triply-Hierarchical Diffusion Policy for Visuomotor Learning](https://arxiv.org/abs/2505.07819) *Yiyang Lu,Yufeng Tian,Zhecheng Yuan,Xianbang Wang,Pu Hua,Zhengrong Xue,Huazhe Xu* Main category: cs.RO TL;DR: H$^{3}$DP是一种新的视觉运动学习框架，通过三层层次结构增强视觉特征与动作生成的整合，显著优于基线方法。

Details

Motivation: 现有生成模型方法忽视了视觉感知与动作预测的关键耦合关系，需要更紧密的整合。 Method: H$^{3}$DP包含三层结构：深度感知输入分层、多尺度视觉表征和层次化扩散过程。 Result: 在44个仿真任务中平均相对提升27.5%，并在4个真实世界双手机器人任务中表现优异。 Conclusion: H$^{3}$DP通过层次化设计显著提升了视觉运动策略学习的性能。 Abstract: Visuomotor policy learning has witnessed substantial progress in robotic manipulation, with recent approaches predominantly relying on generative models to model the action distribution. However, these methods often overlook the critical coupling between visual perception and action prediction. In this work, we introduce $\textbf{Triply-Hierarchical Diffusion Policy}~(\textbf{H$^{\mathbf{3}}$DP})$, a novel visuomotor learning framework that explicitly incorporates hierarchical structures to strengthen the integration between visual features and action generation. H$^{3}$DP contains $\mathbf{3}$ levels of hierarchy: (1) depth-aware input layering that organizes RGB-D observations based on depth information; (2) multi-scale visual representations that encode semantic features at varying levels of granularity; and (3) a hierarchically conditioned diffusion process that aligns the generation of coarse-to-fine actions with corresponding visual features. Extensive experiments demonstrate that H$^{3}$DP yields a $\mathbf{+27.5\%}$ average relative improvement over baselines across $\mathbf{44}$ simulation tasks and achieves superior performance in $\mathbf{4}$ challenging bimanual real-world manipulation tasks. Project Page: https://lyy-iiis.github.io/h3dp/.

# cs.MM [[Back]](#toc) ### [230] [Emotion-Qwen: Training Hybrid Experts for Unified Emotion and General Vision-Language Understanding](https://arxiv.org/abs/2505.06685) *Dawei Huang,Qing Li,Chuan Yan,Zebang Cheng,Yurong Huang,Xiang Li,Bin Li,Xiaohui Wang,Zheng Lian,Xiaojiang Peng* Main category: cs.MM TL;DR: Emotion-Qwen是一个多模态框架，通过混合专家（MoE）范式增强情感理解和通用视觉语言推理，解决了大型多模态模型在情感任务中的局限性。

Details

Motivation: 大型多模态模型在情感场景中表现有限，且微调会导致灾难性遗忘，因此需要一种既能增强情感理解又能保持通用推理能力的框架。 Method: Emotion-Qwen采用混合专家范式，动态路由输入以平衡情感和通用处理，并通过三阶段预训练和VER数据集支持多模态表示。 Result: Emotion-Qwen在多个情感识别基准上达到最先进性能，同时在通用视觉语言任务中保持竞争力。 Conclusion: Emotion-Qwen通过动态路由和预训练策略，成功提升了情感理解和通用推理能力，为多模态情感分析提供了有效解决方案。 Abstract: Emotion understanding in videos aims to accurately recognize and interpret individuals' emotional states by integrating contextual, visual, textual, and auditory cues. While Large Multimodal Models (LMMs) have demonstrated significant progress in general vision-language (VL) tasks, their performance in emotion-specific scenarios remains limited. Moreover, fine-tuning LMMs on emotion-related tasks often leads to catastrophic forgetting, hindering their ability to generalize across diverse tasks. To address these challenges, we present Emotion-Qwen, a tailored multimodal framework designed to enhance both emotion understanding and general VL reasoning. Emotion-Qwen incorporates a sophisticated Hybrid Compressor based on the Mixture of Experts (MoE) paradigm, which dynamically routes inputs to balance emotion-specific and general-purpose processing. The model is pre-trained in a three-stage pipeline on large-scale general and emotional image datasets to support robust multimodal representations. Furthermore, we construct the Video Emotion Reasoning (VER) dataset, comprising more than 40K bilingual video clips with fine-grained descriptive annotations, to further enrich Emotion-Qwen's emotional reasoning capability. Experimental results demonstrate that Emotion-Qwen achieves state-of-the-art performance on multiple emotion recognition benchmarks, while maintaining competitive results on general VL tasks. Code and models are available at https://anonymous.4open.science/r/Emotion-Qwen-Anonymous.

# cs.HC [[Back]](#toc) ### [231] [DeepSORT-Driven Visual Tracking Approach for Gesture Recognition in Interactive Systems](https://arxiv.org/abs/2505.07110) *Tong Zhang,Fenghua Shao,Runsheng Zhang,Yifan Zhuang,Liuqingqing Yang* Main category: cs.HC TL;DR: 本研究基于DeepSORT算法，探索了视觉跟踪技术在智能人机交互中的应用，特别是在手势识别与跟踪领域。实验验证了DeepSORT在动态环境中的优越性能，能够准确捕捉和跟踪手势轨迹，并在实时性和准确性上优于传统方法。

Details

Motivation: 随着人工智能和深度学习技术的发展，基于视觉的交互逐渐取代传统输入设备，成为智能系统与用户交互的重要方式。 Method: 结合卡尔曼滤波和深度学习特征提取方法，DeepSORT算法在动态环境中实现精准目标跟踪，适用于多目标跟踪和快速运动的复杂场景。 Result: 实验结果表明，DeepSORT能有效处理目标遮挡和运动模糊，在多目标环境中稳定跟踪，提供流畅的用户交互体验。 Conclusion: 未来研究方向包括算法优化、数据融合和多模态交互，以推动更智能化和个性化的交互体验。 Abstract: Based on the DeepSORT algorithm, this study explores the application of visual tracking technology in intelligent human-computer interaction, especially in the field of gesture recognition and tracking. With the rapid development of artificial intelligence and deep learning technology, visual-based interaction has gradually replaced traditional input devices and become an important way for intelligent systems to interact with users. The DeepSORT algorithm can achieve accurate target tracking in dynamic environments by combining Kalman filters and deep learning feature extraction methods. It is especially suitable for complex scenes with multi-target tracking and fast movements. This study experimentally verifies the superior performance of DeepSORT in gesture recognition and tracking. It can accurately capture and track the user's gesture trajectory and is superior to traditional tracking methods in terms of real-time and accuracy. In addition, this study also combines gesture recognition experiments to evaluate the recognition ability and feedback response of the DeepSORT algorithm under different gestures (such as sliding, clicking, and zooming). The experimental results show that DeepSORT can not only effectively deal with target occlusion and motion blur but also can stably track in a multi-target environment, achieving a smooth user interaction experience. Finally, this paper looks forward to the future development direction of intelligent human-computer interaction systems based on visual tracking and proposes future research focuses such as algorithm optimization, data fusion, and multimodal interaction in order to promote a more intelligent and personalized interactive experience. Keywords-DeepSORT, visual tracking, gesture recognition, human-computer interaction

### [232] [Towards user-centered interactive medical image segmentation in VR with an assistive AI agent](https://arxiv.org/abs/2505.07214) *Pascal Spiegler,Arash Harirpoush,Yiming Xiao* Main category: cs.HC TL;DR: SAMIRA是一个基于VR的对话式AI代理，结合AI和VR技术，帮助用户定位、分割和可视化3D医学影像，并通过语音交互优化分割任务。

Details

Motivation: 手动分割医学影像耗时且易出错，而全自动算法需要用户反馈。结合AI和VR的优势，提出SAMIRA以提升效率和准确性。 Method: 利用AI基础模型和VR交互，通过语音和点提示优化分割任务，并比较VR控制器、头部指向和眼动追踪三种输入模式。 Result: 用户研究表明系统具有高可用性（SUS=90.0±9.0）、低任务负载，并支持AI在放射学分割任务中的应用。 Conclusion: SAMIRA展示了AI与VR结合在医学影像分割中的潜力，提升了用户交互效率和任务完成质量。 Abstract: Crucial in disease analysis and surgical planning, manual segmentation of volumetric medical scans (e.g. MRI, CT) is laborious, error-prone, and challenging to master, while fully automatic algorithms can benefit from user-feedback. Therefore, with the complementary power of the latest radiological AI foundation models and virtual reality (VR)'s intuitive data interaction, we propose SAMIRA, a novel conversational AI agent that assists users with localizing, segmenting, and visualizing 3D medical concepts in VR. Through speech-based interaction, the agent helps users understand radiological features, locate clinical targets, and generate segmentation masks that can be refined with just a few point prompts. The system also supports true-to-scale 3D visualization of segmented pathology to enhance patient-specific anatomical understanding. Furthermore, to determine the optimal interaction paradigm under near-far attention-switching for refining segmentation masks in an immersive, human-in-the-loop workflow, we compare VR controller pointing, head pointing, and eye tracking as input modes. With a user study, evaluations demonstrated a high usability score (SUS=90.0 $\pm$ 9.0), low overall task load, as well as strong support for the proposed VR system's guidance, training potential, and integration of AI in radiological segmentation tasks.

# cs.CR [[Back]](#toc) ### [233] [One Trigger Token Is Enough: A Defense Strategy for Balancing Safety and Usability in Large Language Models](https://arxiv.org/abs/2505.07167) *Haoran Gu,Handing Wang,Yi Mei,Mengjie Zhang,Yaochu Jin* Main category: cs.CR TL;DR: 论文提出了一种名为D-STT的防御算法，通过识别和解码安全触发令牌来减少LLM生成有害内容的风险，同时保持模型可用性。

Details

Motivation: 当前安全对齐的LLM容易受到越狱攻击，导致生成有害响应，现有防御策略效果有限。 Method: 提出D-STT算法，通过识别和解码安全触发令牌来触发模型的安全模式，限制干预范围。 Result: 实验表明，D-STT显著降低有害输出，同时保持模型可用性和响应速度，优于十种基线方法。 Conclusion: D-STT是一种简单有效的防御方法，能够平衡安全性和模型可用性。 Abstract: Large Language Models (LLMs) have been extensively used across diverse domains, including virtual assistants, automated code generation, and scientific research. However, they remain vulnerable to jailbreak attacks, which manipulate the models into generating harmful responses despite safety alignment. Recent studies have shown that current safety-aligned LLMs often undergo the shallow safety alignment, where the first few tokens largely determine whether the response will be harmful. Through comprehensive observations, we find that safety-aligned LLMs and various defense strategies generate highly similar initial tokens in their refusal responses, which we define as safety trigger tokens. Building on this insight, we propose \texttt{D-STT}, a simple yet effective defense algorithm that identifies and explicitly decodes safety trigger tokens of the given safety-aligned LLM to trigger the model's learned safety patterns. In this process, the safety trigger is constrained to a single token, which effectively preserves model usability by introducing minimum intervention in the decoding process. Extensive experiments across diverse jailbreak attacks and benign prompts demonstrate that \ours significantly reduces output harmfulness while preserving model usability and incurring negligible response time overhead, outperforming ten baseline methods.

### [234] [Securing Genomic Data Against Inference Attacks in Federated Learning Environments](https://arxiv.org/abs/2505.07188) *Chetan Pathade,Shubham Patil* Main category: cs.CR TL;DR: 联邦学习（FL）在去中心化基因组数据上协作训练模型，但易受推理攻击威胁。实验显示梯度暴露风险最高，需更鲁棒的隐私保护机制。

Details

Motivation: 研究FL在基因组数据中的隐私风险，评估其易受攻击性。 Method: 模拟FL环境，测试三种攻击向量：MIA、梯度MIA和LIA。 Result: 梯度MIA效果最佳（精确度0.79，F1分数0.87），显示梯度暴露风险。 Conclusion: 现有FL对基因组隐私保护不足，需针对性改进。 Abstract: Federated Learning (FL) offers a promising framework for collaboratively training machine learning models across decentralized genomic datasets without direct data sharing. While this approach preserves data locality, it remains susceptible to sophisticated inference attacks that can compromise individual privacy. In this study, we simulate a federated learning setup using synthetic genomic data and assess its vulnerability to three key attack vectors: Membership Inference Attack (MIA), Gradient-Based Membership Inference Attack, and Label Inference Attack (LIA). Our experiments reveal that Gradient-Based MIA achieves the highest effectiveness, with a precision of 0.79 and F1-score of 0.87, underscoring the risk posed by gradient exposure in federated updates. Additionally, we visualize comparative attack performance through radar plots and quantify model leakage across clients. The findings emphasize the inadequacy of na\"ive FL setups in safeguarding genomic privacy and motivate the development of more robust privacy-preserving mechanisms tailored to the unique sensitivity of genomic data.

# stat.ML [[Back]](#toc) ### [235] [Feature Representation Transferring to Lightweight Models via Perception Coherence](https://arxiv.org/abs/2505.06595) *Hai-Vy Nguyen,Fabrice Gamboa,Sixin Zhang,Reda Chhaibi,Serge Gratton,Thierry Giaccone* Main category: stat.ML TL;DR: 提出一种通过感知一致性将特征表示从大教师模型迁移到轻量学生模型的方法，通过排名差异设计损失函数，学生模型学习模仿教师模型的感知方式。

Details

Motivation: 学生模型的表示能力弱于教师模型，需开发新方法以更好地放松约束，无需保留绝对几何结构，但需保持全局一致性。 Method: 基于感知一致性定义损失函数，通过排名差异优化学生模型的特征表示。 Result: 实验表明，该方法在特征迁移任务中优于或与基线方法相当。 Conclusion: 提出的方法通过感知一致性和排名差异，有效实现了特征表示迁移，理论分析提供了概率视角。 Abstract: In this paper, we propose a method for transferring feature representation to lightweight student models from larger teacher models. We mathematically define a new notion called \textit{perception coherence}. Based on this notion, we propose a loss function, which takes into account the dissimilarities between data points in feature space through their ranking. At a high level, by minimizing this loss function, the student model learns to mimic how the teacher model \textit{perceives} inputs. More precisely, our method is motivated by the fact that the representational capacity of the student model is weaker than the teacher model. Hence, we aim to develop a new method allowing for a better relaxation. This means that, the student model does not need to preserve the absolute geometry of the teacher one, while preserving global coherence through dissimilarity ranking. Our theoretical insights provide a probabilistic perspective on the process of feature representation transfer. Our experiments results show that our method outperforms or achieves on-par performance compared to strong baseline methods for representation transferring.

# cs.SE [[Back]](#toc) ### [236] [Enhancing Code Generation via Bidirectional Comment-Level Mutual Grounding](https://arxiv.org/abs/2505.07768) *Yifeng Di,Tianyi Zhang* Main category: cs.SE TL;DR: 论文提出了一种通过交互式代码注释改进LLM生成代码准确性的方法，显著提升了代码生成质量和开发者效率。

Details

Motivation: LLM生成的代码存在功能错误，开发者难以检查和修复，影响了生产力和信任。 Method: 利用代码注释作为媒介，通过迭代生成代码、注释和用户反馈，建立开发者与LLM的共享理解。 Result: 在HumanEval基准上，代码生成准确率提升17.1%；用户研究显示任务完成速度提高16.7%，成功率提升10.5%。 Conclusion: 交互式注释优化了代码生成，提高了准确性和开发者信心。 Abstract: Large Language Models (LLMs) have demonstrated unprecedented capability in code generation. However, LLM-generated code is still plagued with a wide range of functional errors, especially for complex programming tasks that LLMs have not seen before. Recent studies have shown that developers often struggle with inspecting and fixing incorrect code generated by LLMs, diminishing their productivity and trust in LLM-based code generation. Inspired by the mutual grounding theory in communication, we propose an interactive approach that leverages code comments as a medium for developers and LLMs to establish a shared understanding. Our approach facilitates iterative grounding by interleaving code generation, inline comment generation, and contextualized user feedback through editable comments to align generated code with developer intent. We evaluated our approach on two popular benchmarks and demonstrated that our approach significantly improved multiple state-of-the-art LLMs, e.g., 17.1% pass@1 improvement for code-davinci-002 on HumanEval. Furthermore, we conducted a user study with 12 participants in comparison to two baselines: (1) interacting with GitHub Copilot, and (2) interacting with a multi-step code generation paradigm called Multi-Turn Program Synthesis. Participants completed the given programming tasks 16.7% faster and with 10.5% improvement in task success rate when using our approach. Both results show that interactively refining code comments enables the collaborative establishment of mutual grounding, leading to more accurate code generation and higher developer confidence.

# cs.DL [[Back]](#toc) ### [237] [A digital perspective on the role of a stemma in material-philological transmission studies](https://arxiv.org/abs/2505.06938) *Katarzyna Anna Kapitan* Main category: cs.DL TL;DR: 本文探讨了数字人文和自动化工作流程对文本学术研究的影响，提出计算机生成的谱系图可作为研究工具而非最终成果，并以古挪威传奇为例展示了其应用。

Details

Motivation: 研究动机源于数字人文领域的发展和学术工作流程的自动化趋势，旨在探索数字方法对文本传统研究的潜在影响。 Method: 采用古挪威传奇《Hrömundur》作为案例研究，使用Python脚本将TEI编码的XML数据转换为PHYLIP格式，生成文本关系的无根树。 Result: 研究表明，计算机生成的谱系图能作为研究起点，帮助解决传统方法无法回答的问题，并提供了相关数据集和脚本。 Conclusion: 结论指出，数字方法为文本传统研究提供了新工具，谱系图可作为探索文本关系的有效起点。 Abstract: Taking its point of departure in the recent developments in the field of digital humanities and the increasing automatisation of scholarly workflows, this study explores the implications of digital approaches to textual traditions for the broader field of textual scholarship. It argues that the relative simplicity of creating computergenerated stemmas allows us to view the stemma codicum as a research tool rather than the final product of our scholarly investigation. Using the Old Norse saga of Hr\'omundur as a case study, this article demonstrates that stemmas can serve as a starting point for exploring textual traditions further. In doing so, they enable us to address research questions that otherwise remain unanswered. The article is accompanied by datasets used to generate stemmas for the Hr\'omundar saga tradition as well as two custom Python scripts. The scripts are designed to convert XML-based textual data, encoded according to the TEI Guidelines, into the input format used for the analysis in the PHYLIP package to generate unrooted trees of relationships between texts.

# cs.AI [[Back]](#toc) ### [238] [LLM-Augmented Chemical Synthesis and Design Decision Programs](https://arxiv.org/abs/2505.07027) *Haorui Wang,Jeff Guo,Lingkai Kong,Rampi Ramprasad,Philippe Schwaller,Yuanqi Du,Chao Zhang* Main category: cs.AI TL;DR: 本文探讨了大型语言模型（LLMs）在解决多步逆合成规划问题中的潜力，提出了一种高效的路径编码方案和新的路线级搜索策略，显著提升了逆合成规划的效果。

Details

Motivation: 逆合成是有机化学和药物开发的核心，但现有机器学习方法受限于组合空间的复杂性。LLMs展现出化学知识的潜力，可能解决这一复杂决策问题。 Method: 提出了一种高效的路径编码方案和路线级搜索策略，超越传统的逐步反应物预测。 Result: LLM增强的方法在多步逆合成规划中表现优异，并可扩展到可合成分子设计。 Conclusion: LLMs在逆合成规划中具有显著潜力，为复杂化学决策问题提供了新思路。 Abstract: Retrosynthesis, the process of breaking down a target molecule into simpler precursors through a series of valid reactions, stands at the core of organic chemistry and drug development. Although recent machine learning (ML) research has advanced single-step retrosynthetic modeling and subsequent route searches, these solutions remain restricted by the extensive combinatorial space of possible pathways. Concurrently, large language models (LLMs) have exhibited remarkable chemical knowledge, hinting at their potential to tackle complex decision-making tasks in chemistry. In this work, we explore whether LLMs can successfully navigate the highly constrained, multi-step retrosynthesis planning problem. We introduce an efficient scheme for encoding reaction pathways and present a new route-level search strategy, moving beyond the conventional step-by-step reactant prediction. Through comprehensive evaluations, we show that our LLM-augmented approach excels at retrosynthesis planning and extends naturally to the broader challenge of synthesizable molecular design.

### [239] [A Survey on Collaborative Mechanisms Between Large and Small Language Models](https://arxiv.org/abs/2505.07460) *Yi Chen,JiaHao Zhao,HaoHao Han* Main category: cs.AI TL;DR: 该论文探讨了大型语言模型（LLM）与小型语言模型（SLM）协作的潜力，以平衡性能与资源效率，并综述了其交互机制、应用场景及未来方向。

Details

Motivation: 解决LLMs资源成本高和延迟问题，同时利用SLMs的高效性，以实现在资源受限设备上的先进AI应用。 Method: 综述了LLM-SLM协作的多种交互机制（如管道、路由、辅助、蒸馏、融合）及关键技术。 Result: LLM-SLM协作在低延迟、隐私、个性化等场景中表现出潜力，但仍面临系统开销、一致性等挑战。 Conclusion: LLM-SLM协作是下一代实用AI的关键驱动力，未来需发展更智能的框架和更深度的模型融合。 Abstract: Large Language Models (LLMs) deliver powerful AI capabilities but face deployment challenges due to high resource costs and latency, whereas Small Language Models (SLMs) offer efficiency and deployability at the cost of reduced performance. Collaboration between LLMs and SLMs emerges as a crucial paradigm to synergistically balance these trade-offs, enabling advanced AI applications, especially on resource-constrained edge devices. This survey provides a comprehensive overview of LLM-SLM collaboration, detailing various interaction mechanisms (pipeline, routing, auxiliary, distillation, fusion), key enabling technologies, and diverse application scenarios driven by on-device needs like low latency, privacy, personalization, and offline operation. While highlighting the significant potential for creating more efficient, adaptable, and accessible AI, we also discuss persistent challenges including system overhead, inter-model consistency, robust task allocation, evaluation complexity, and security/privacy concerns. Future directions point towards more intelligent adaptive frameworks, deeper model fusion, and expansion into multimodal and embodied AI, positioning LLM-SLM collaboration as a key driver for the next generation of practical and ubiquitous artificial intelligence.

### [240] [Text-to-CadQuery: A New Paradigm for CAD Generation with Scalable Large Model Capabilities](https://arxiv.org/abs/2505.06507) *Haoyang Xie,Feng Ju* Main category: cs.AI TL;DR: 提出了一种直接生成CadQuery代码的方法，利用预训练大语言模型（LLMs）从文本生成3D模型，避免了中间表示，提高了效率。

Details

Motivation: 现有的CAD模型生成方法需要将任务特定命令序列转换为CAD表示，增加了复杂性和训练成本。 Method: 通过微调预训练LLMs，直接从文本生成CadQuery代码，利用Python生成和空间推理的优势。 Result: 微调后的模型在Top-1精确匹配上提高了10.5%，Chamfer Distance减少了48.6%。 Conclusion: 直接生成CadQuery代码的方法有效简化了CAD模型生成流程，且模型规模越大效果越好。 Abstract: Computer-aided design (CAD) is fundamental to modern engineering and manufacturing, but creating CAD models still requires expert knowledge and specialized software. Recent advances in large language models (LLMs) open up the possibility of generative CAD, where natural language is directly translated into parametric 3D models. However, most existing methods generate task-specific command sequences that pretrained models cannot directly handle. These sequences must be converted into CAD representations such as CAD vectors before a 3D model can be produced, which requires training models from scratch and adds unnecessary complexity. To tackle this issue, we propose generating CadQuery code directly from text, leveraging the strengths of pretrained LLMs to produce 3D models without intermediate representations, using this Python-based scripting language. Since LLMs already excel at Python generation and spatial reasoning, fine-tuning them on Text-to-CadQuery data proves highly effective. Given that these capabilities typically improve with scale, we hypothesize that larger models will perform better after fine-tuning. To enable this, we augment the Text2CAD dataset with 170,000 CadQuery annotations. We fine-tune six open-source LLMs of varying sizes and observe consistent improvements. Our best model achieves a top-1 exact match of 69.3%, up from 58.8%, and reduces Chamfer Distance by 48.6%. Project page: https://github.com/Text-to-CadQuery/Text-to-CadQuery.

### [241] [Towards Artificial General or Personalized Intelligence? A Survey on Foundation Models for Personalized Federated Intelligence](https://arxiv.org/abs/2505.06907) *Yu Qiao,Huy Q. Le,Avi Deb Raha,Phuong-Nam Tran,Apurba Adhikary,Mengchun Zhang,Loc X. Nguyen,Eui-Nam Huh,Dusit Niyato,Choong Seon Hong* Main category: cs.AI TL;DR: 本文提出个性化联邦智能（PFI），结合联邦学习的隐私保护优势和基础模型的零样本泛化能力，旨在实现边缘计算中的个性化、高效且隐私保护的模型部署。

Details

Motivation: 大型语言模型（LLMs）虽强大，但其大规模、隐私敏感和高计算需求限制了用户个性化定制。为解决这一问题，提出API愿景，通过PFI实现个性化需求与隐私效率的平衡。 Method: 提出PFI框架，整合联邦学习（FL）与基础模型（FMs）优势，探索高效PFI、可信PFI及基于检索增强生成（RAG）的PFI等方向。 Result: PFI为边缘计算中的个性化模型部署提供了新思路，兼具隐私保护和高效性。 Conclusion: PFI是实现人工个性化智能（API）的关键技术，未来需进一步研究其在计算效率、隐私保障等方面的挑战。 Abstract: The rise of large language models (LLMs), such as ChatGPT, DeepSeek, and Grok-3, has reshaped the artificial intelligence landscape. As prominent examples of foundational models (FMs) built on LLMs, these models exhibit remarkable capabilities in generating human-like content, bringing us closer to achieving artificial general intelligence (AGI). However, their large-scale nature, sensitivity to privacy concerns, and substantial computational demands present significant challenges to personalized customization for end users. To bridge this gap, this paper presents the vision of artificial personalized intelligence (API), focusing on adapting these powerful models to meet the specific needs and preferences of users while maintaining privacy and efficiency. Specifically, this paper proposes personalized federated intelligence (PFI), which integrates the privacy-preserving advantages of federated learning (FL) with the zero-shot generalization capabilities of FMs, enabling personalized, efficient, and privacy-protective deployment at the edge. We first review recent advances in both FL and FMs, and discuss the potential of leveraging FMs to enhance federated systems. We then present the key motivations behind realizing PFI and explore promising opportunities in this space, including efficient PFI, trustworthy PFI, and PFI empowered by retrieval-augmented generation (RAG). Finally, we outline key challenges and future research directions for deploying FM-powered FL systems at the edge with improved personalization, computational efficiency, and privacy guarantees. Overall, this survey aims to lay the groundwork for the development of API as a complement to AGI, with a particular focus on PFI as a key enabling technique.

# eess.IV [[Back]](#toc) ### [242] [LMLCC-Net: A Semi-Supervised Deep Learning Model for Lung Nodule Malignancy Prediction from CT Scans using a Novel Hounsfield Unit-Based Intensity Filtering](https://arxiv.org/abs/2505.06370) *Adhora Madhuri,Nusaiba Sobir,Tasnia Binte Mamun,Taufiq Hasan* Main category: eess.IV TL;DR: 提出了一种基于3D CNN的深度学习框架LMLCC-Net，利用Hounsfield Unit (HU)强度过滤对CT图像中的肺结节进行分类，结合强度和纹理特征，性能优于现有方法。

Details

Motivation: 肺癌是全球患者死亡的主要原因，早期诊断恶性肺结节对降低死亡率至关重要，但现有方法未充分利用HU强度差异。 Method: LMLCC-Net通过多分支提取特征，每个分支使用可学习的HU强度过滤，结合半监督学习处理模糊病例，并开发轻量级模型。 Result: 在LUNA16数据集上，分类准确率91.96%，敏感性92.04%，AUC 91.87%，性能优于现有方法。 Conclusion: LMLCC-Net能有效辅助放射科医生分类肺结节，改善患者护理。 Abstract: Lung cancer is the leading cause of patient mortality in the world. Early diagnosis of malignant pulmonary nodules in CT images can have a significant impact on reducing disease mortality and morbidity. In this work, we propose LMLCC-Net, a novel deep learning framework for classifying nodules from CT scan images using a 3D CNN, considering Hounsfield Unit (HU)-based intensity filtering. Benign and malignant nodules have significant differences in their intensity profile of HU, which was not exploited in the literature. Our method considers the intensity pattern as well as the texture for the prediction of malignancies. LMLCC-Net extracts features from multiple branches that each use a separate learnable HU-based intensity filtering stage. Various combinations of branches and learnable ranges of filters were explored to finally produce the best-performing model. In addition, we propose a semi-supervised learning scheme for labeling ambiguous cases and also developed a lightweight model to classify the nodules. The experimental evaluations are carried out on the LUNA16 dataset. Our proposed method achieves a classification accuracy (ACC) of 91.96%, a sensitivity (SEN) of 92.04%, and an area under the curve (AUC) of 91.87%, showing improved performance compared to existing methods. The proposed method can have a significant impact in helping radiologists in the classification of pulmonary nodules and improving patient care.

### [243] [PC-SRGAN: Physically Consistent Super-Resolution Generative Adversarial Network for General Transient Simulations](https://arxiv.org/abs/2505.06502) *Md Rakibul Hasan,Pouria Behnoudfar,Dan MacKinlay,Thomas Poulet* Main category: eess.IV TL;DR: PC-SRGAN是一种改进的超分辨率生成对抗网络，通过确保物理一致性提升图像分辨率，适用于科学应用。

Details

Motivation: 传统GAN生成的超分辨率图像缺乏物理意义，不适用于科学领域，PC-SRGAN旨在解决这一问题。 Method: PC-SRGAN结合物理一致性约束和先进质量指标，即使在训练数据有限的情况下也能显著提升性能。 Result: PC-SRGAN在PSNR和SSIM上优于传统方法，仅需13%的训练数据即可达到SRGAN的效果。 Conclusion: PC-SRGAN为科学机器学习提供了可靠且高效的解决方案，具有广泛的应用潜力。 Abstract: Machine Learning, particularly Generative Adversarial Networks (GANs), has revolutionised Super Resolution (SR). However, generated images often lack physical meaningfulness, which is essential for scientific applications. Our approach, PC-SRGAN, enhances image resolution while ensuring physical consistency for interpretable simulations. PC-SRGAN significantly improves both the Peak Signal-to-Noise Ratio and the Structural Similarity Index Measure compared to conventional methods, even with limited training data (e.g., only 13% of training data required for SRGAN). Beyond SR, PC-SRGAN augments physically meaningful machine learning, incorporating numerically justified time integrators and advanced quality metrics. These advancements promise reliable and causal machine-learning models in scientific domains. A significant advantage of PC-SRGAN over conventional SR techniques is its physical consistency, which makes it a viable surrogate model for time-dependent problems. PC-SRGAN advances scientific machine learning, offering improved accuracy and efficiency for image processing, enhanced process understanding, and broader applications to scientific research. The source codes and data will be made publicly available at https://github.com/hasan-rakibul/PC-SRGAN upon acceptance of this paper.

### [244] [Reproducing and Improving CheXNet: Deep Learning for Chest X-ray Disease Classification](https://arxiv.org/abs/2505.06646) *Daniel Strick,Carlos Garcia,Anthony Huang* Main category: eess.IV TL;DR: 论文研究了深度学习在放射影像分析中的应用，复现了CheXNet算法并探索了性能更优的其他算法，在NIH ChestX-ray14数据集上取得了平均AUC-ROC 0.85和F1分数0.39的结果。

Details

Motivation: 深度学习在医学影像分析中具有潜力，可能成为现代医学的标准实践。 Method: 复现CheXNet算法并探索其他算法，使用F1分数和AUC-ROC评估模型性能。 Result: 最佳模型在14种疾病分类中平均AUC-ROC为0.85，F1分数为0.39。 Conclusion: 深度学习在放射影像分析中表现优异，未来可能成为医学实践的重要工具。 Abstract: Deep learning for radiologic image analysis is a rapidly growing field in biomedical research and is likely to become a standard practice in modern medicine. On the publicly available NIH ChestX-ray14 dataset, containing X-ray images that are classified by the presence or absence of 14 different diseases, we reproduced an algorithm known as CheXNet, as well as explored other algorithms that outperform CheXNet's baseline metrics. Model performance was primarily evaluated using the F1 score and AUC-ROC, both of which are critical metrics for imbalanced, multi-label classification tasks in medical imaging. The best model achieved an average AUC-ROC score of 0.85 and an average F1 score of 0.39 across all 14 disease classifications present in the dataset.

### [245] [HistDiST: Histopathological Diffusion-based Stain Transfer](https://arxiv.org/abs/2505.06793) *Erik Großkopf,Valay Bundele,Mehran Hossienzadeh,Hendrik P. A. Lensch* Main category: eess.IV TL;DR: HistDiST是一种基于潜在扩散模型（LDM）的框架，用于高保真H&E到IHC的转换，通过双条件策略和新型噪声调度方法显著提升性能。

Details

Motivation: H&E染色缺乏分子特异性，而IHC成本高且复杂，因此需要一种经济高效的H&E到IHC转换方法。现有方法（如GAN）存在训练不稳定和结构保真度低的问题。 Method: 提出HistDiST框架，采用双条件策略（形态学嵌入和VAE编码H&E表示），结合重新缩放的噪声调度和v预测，确保病理相关性和结构一致性。 Result: 在MIST和BCI数据集上，HistDiST显著优于现有方法，H&E到Ki67转换任务的MRA提升28%。 Conclusion: HistDiST能有效捕捉真实IHC语义，为病理学提供了一种高保真、经济高效的解决方案。 Abstract: Hematoxylin and Eosin (H&E) staining is the cornerstone of histopathology but lacks molecular specificity. While Immunohistochemistry (IHC) provides molecular insights, it is costly and complex, motivating H&E-to-IHC translation as a cost-effective alternative. Existing translation methods are mainly GAN-based, often struggling with training instability and limited structural fidelity, while diffusion-based approaches remain underexplored. We propose HistDiST, a Latent Diffusion Model (LDM) based framework for high-fidelity H&E-to-IHC translation. HistDiST introduces a dual-conditioning strategy, utilizing Phikon-extracted morphological embeddings alongside VAE-encoded H&E representations to ensure pathology-relevant context and structural consistency. To overcome brightness biases, we incorporate a rescaled noise schedule, v-prediction, and trailing timesteps, enforcing a zero-SNR condition at the final timestep. During inference, DDIM inversion preserves the morphological structure, while an eta-cosine noise schedule introduces controlled stochasticity, balancing structural consistency and molecular fidelity. Moreover, we propose Molecular Retrieval Accuracy (MRA), a novel pathology-aware metric leveraging GigaPath embeddings to assess molecular relevance. Extensive evaluations on MIST and BCI datasets demonstrate that HistDiST significantly outperforms existing methods, achieving a 28% improvement in MRA on the H&E-to-Ki67 translation task, highlighting its effectiveness in capturing true IHC semantics.

### [246] [Missing Data Estimation for MR Spectroscopic Imaging via Mask-Free Deep Learning Methods](https://arxiv.org/abs/2505.06811) *Tan-Hanh Pham,Ovidiu C. Andronesi,Xianqi Li,Kim-Doang Nguyen* Main category: eess.IV TL;DR: 提出了一种基于深度学习的无掩码框架，用于估计MRSI代谢图中的缺失数据，优于传统插值方法。

Details

Motivation: MRSI在脑代谢物非侵入性映射中具有重要作用，但常因数据缺失或损坏而受限。 Method: 采用2D和3D U-Net架构，通过上下文空间特征隐式检测和估计缺失区域，并引入渐进训练策略增强鲁棒性。 Result: 2D模型在20%缺失体素下MSE为0.002、SSIM为0.97；3D模型在15%缺失体素下MSE为0.001、SSIM为0.98。 Conclusion: 该方法在真实数据上表现良好，无需重新训练或掩码输入，具有临床和研究应用的潜力。 Abstract: Magnetic Resonance Spectroscopic Imaging (MRSI) is a powerful tool for non-invasive mapping of brain metabolites, providing critical insights into neurological conditions. However, its utility is often limited by missing or corrupted data due to motion artifacts, magnetic field inhomogeneities, or failed spectral fitting-especially in high resolution 3D acquisitions. To address this, we propose the first deep learning-based, mask-free framework for estimating missing data in MRSI metabolic maps. Unlike conventional restoration methods that rely on explicit masks to identify missing regions, our approach implicitly detects and estimates these areas using contextual spatial features through 2D and 3D U-Net architectures. We also introduce a progressive training strategy to enhance robustness under varying levels of data degradation. Our method is evaluated on both simulated and real patient datasets and consistently outperforms traditional interpolation techniques such as cubic and linear interpolation. The 2D model achieves an MSE of 0.002 and an SSIM of 0.97 with 20% missing voxels, while the 3D model reaches an MSE of 0.001 and an SSIM of 0.98 with 15% missing voxels. Qualitative results show improved fidelity in estimating missing data, particularly in metabolically heterogeneous regions and ventricular regions. Importantly, our model generalizes well to real-world datasets without requiring retraining or mask input. These findings demonstrate the effectiveness and broad applicability of mask-free deep learning for MRSI restoration, with strong potential for clinical and research integration.

### [247] [Uni-AIMS: AI-Powered Microscopy Image Analysis](https://arxiv.org/abs/2505.06918) *Yanhui Hong,Nan Wang,Zhiyi Xia,Haoyi Tao,Xi Fang,Yiming Li,Jiankun Wang,Peng Jin,Xiaochen Cai,Shengyu Li,Ziqi Chen,Zezhong Zhang,Guolin Ke,Linfeng Zhang* Main category: eess.IV TL;DR: 本文提出了一种智能识别和自动分析显微镜图像的系统解决方案，包括数据引擎、分割模型和智能分析平台，解决了显微镜图像中的独特挑战。

Details

Motivation: 显微镜图像的自动识别和分析在跨学科研究中具有重要意义，但传统方法难以处理复杂和小目标的问题。 Method: 开发了数据引擎生成高质量标注数据集，提出了一种分割模型，能够稳健检测大小目标，并支持图像比例尺的自动识别。 Result: 构建了全面的智能分析平台，并在实际应用中验证了其有效性和实用性。 Conclusion: 该研究不仅推动了显微镜图像的自动识别技术，还确保了跨应用领域的可扩展性和通用性。 Abstract: This paper presents a systematic solution for the intelligent recognition and automatic analysis of microscopy images. We developed a data engine that generates high-quality annotated datasets through a combination of the collection of diverse microscopy images from experiments, synthetic data generation and a human-in-the-loop annotation process. To address the unique challenges of microscopy images, we propose a segmentation model capable of robustly detecting both small and large objects. The model effectively identifies and separates thousands of closely situated targets, even in cluttered visual environments. Furthermore, our solution supports the precise automatic recognition of image scale bars, an essential feature in quantitative microscopic analysis. Building upon these components, we have constructed a comprehensive intelligent analysis platform and validated its effectiveness and practicality in real-world applications. This study not only advances automatic recognition in microscopy imaging but also ensures scalability and generalizability across multiple application domains, offering a powerful tool for automated microscopic analysis in interdisciplinary research.

### [248] [Whitened CLIP as a Likelihood Surrogate of Images and Captions](https://arxiv.org/abs/2505.06934) *Roy Betser,Meir Yossef Levi,Guy Gilboa* Main category: eess.IV TL;DR: 论文提出了一种名为Whitened CLIP的方法，通过对CLIP潜在空间进行可逆线性变换，使其特征具有零均值、单位标准差且无相关性，从而简化图像和标题的似然估计。

Details

Motivation: 图像似然估计在应用中很有用，但计算复杂。研究旨在利用CLIP模型简化这一过程。 Method: 通过可逆线性变换（Whitened CLIP）对CLIP潜在空间进行白化处理，使其协方差矩阵为单位矩阵，特征分布近似标准正态分布。 Result: 白化后的嵌入空间特征分布近似标准正态分布，似然估计简化为嵌入空间中的欧几里得距离平方。 Conclusion: Whitened CLIP提供了一种快速、无需训练的图像和标题似然估计方法，初步实验验证了其有效性。 Abstract: Likelihood approximations for images are not trivial to compute and can be useful in many applications. We examine the use of Contrastive Language-Image Pre-training (CLIP) to assess the likelihood of images and captions. We introduce \textit{Whitened CLIP}, a novel transformation of the CLIP latent space via an invertible linear operation. This transformation ensures that each feature in the embedding space has zero mean, unit standard deviation, and no correlation with all other features, resulting in an identity covariance matrix. We show that the whitened embeddings statistics can be well approximated as a standard normal distribution, thus, the log-likelihood is estimated simply by the square Euclidean norm in the whitened embedding space. The whitening procedure is completely training-free and performed using a pre-computed whitening matrix, hence, is very fast. We present several preliminary experiments demonstrating the properties and applicability of these likelihood scores to images and captions.

### [249] [Skull stripping with purely synthetic data](https://arxiv.org/abs/2505.07159) *Jong Sung Park,Juhyung Ha,Siddhesh Thakur,Alexandra Badea,Spyridon Bakas,Eleftherios Garyfallidis* Main category: eess.IV TL;DR: PUMBA是一种无需真实脑图像或标签的通用脑提取训练策略，在多模态、多物种及病理情况下表现优异。

Details

Motivation: 现有颅骨剥离算法在多模态和多物种情况下缺乏通用性。 Method: 提出PUMBA策略，通过纯合成数据训练模型，无需真实脑图像或标签。 Result: 模型在多模态、多物种及病理情况下达到可比准确性。 Conclusion: 为通用医学图像分割任务提供了新研究方向。 Abstract: While many skull stripping algorithms have been developed for multi-modal and multi-species cases, there is still a lack of a fundamentally generalizable approach. We present PUMBA(PUrely synthetic Multimodal/species invariant Brain extrAction), a strategy to train a model for brain extraction with no real brain images or labels. Our results show that even without any real images or anatomical priors, the model achieves comparable accuracy in multi-modal, multi-species and pathological cases. This work presents a new direction of research for any generalizable medical image segmentation task.

### [250] [Metrics that matter: Evaluating image quality metrics for medical image generation](https://arxiv.org/abs/2505.07175) *Yash Deo,Yan Jia,Toni Lassila,William A. P. Smith,Tom Lawton,Siyuan Kang,Alejandro F. Frangi,Ibrahim Habli* Main category: eess.IV TL;DR: 该研究评估了无参考图像质量指标在合成医学影像中的可靠性，发现其与下游任务表现相关性差，可能导致临床误判，建议采用多层面验证框架。

Details

Motivation: 由于合成医学影像在临床应用中需满足高保真度和解剖准确性，但现有无参考图像质量指标的可靠性未得到充分验证，因此需要系统评估。 Method: 研究使用脑MRI数据（包括肿瘤和血管图像），系统评估无参考图像质量指标对噪声、分布偏移和局部形态变化的敏感性，并与下游分割任务表现对比。 Result: 发现许多常用指标与下游任务表现相关性差，对关键解剖细节不敏感，且可能误导对模型性能的判断。 Conclusion: 为确保生成模型适用于临床，需结合下游任务表现和谨慎选择的无参考指标，构建多层面验证框架。 Abstract: Evaluating generative models for synthetic medical imaging is crucial yet challenging, especially given the high standards of fidelity, anatomical accuracy, and safety required for clinical applications. Standard evaluation of generated images often relies on no-reference image quality metrics when ground truth images are unavailable, but their reliability in this complex domain is not well established. This study comprehensively assesses commonly used no-reference image quality metrics using brain MRI data, including tumour and vascular images, providing a representative exemplar for the field. We systematically evaluate metric sensitivity to a range of challenges, including noise, distribution shifts, and, critically, localised morphological alterations designed to mimic clinically relevant inaccuracies. We then compare these metric scores against model performance on a relevant downstream segmentation task, analysing results across both controlled image perturbations and outputs from different generative model architectures. Our findings reveal significant limitations: many widely-used no-reference image quality metrics correlate poorly with downstream task suitability and exhibit a profound insensitivity to localised anatomical details crucial for clinical validity. Furthermore, these metrics can yield misleading scores regarding distribution shifts, e.g. data memorisation. This reveals the risk of misjudging model readiness, potentially leading to the deployment of flawed tools that could compromise patient safety. We conclude that ensuring generative models are truly fit for clinical purpose requires a multifaceted validation framework, integrating performance on relevant downstream tasks with the cautious interpretation of carefully selected no-reference image quality metrics.

### [251] [Multi-Plane Vision Transformer for Hemorrhage Classification Using Axial and Sagittal MRI Data](https://arxiv.org/abs/2505.07349) *Badhan Kumar Das,Gengyan Zhao,Boris Mailhe,Thomas J. Re,Dorin Comaniciu,Eli Gibson,Andreas Maier* Main category: eess.IV TL;DR: 提出了一种3D多平面视觉变换器（MP-ViT），用于处理不同方向的MRI图像，以提升脑出血分类的准确性。

Details

Motivation: MRI图像因方向和对比度多样，传统方法在固定平面重采样会导致信息丢失，需要一种更有效的方法。 Method: MP-ViT使用两个独立的变换器编码器处理轴向和矢状面数据，并通过交叉注意力整合信息，同时引入模态指示向量补充缺失的对比度信息。 Result: 在包含10,084训练样本的临床数据集上，MP-ViT的AUC比ViT提升5.5%，比CNN提升1.8%。 Conclusion: MP-ViT在需要不同方向对比度的脑出血检测中表现出显著潜力。 Abstract: Identifying brain hemorrhages from magnetic resonance imaging (MRI) is a critical task for healthcare professionals. The diverse nature of MRI acquisitions with varying contrasts and orientation introduce complexity in identifying hemorrhage using neural networks. For acquisitions with varying orientations, traditional methods often involve resampling images to a fixed plane, which can lead to information loss. To address this, we propose a 3D multi-plane vision transformer (MP-ViT) for hemorrhage classification with varying orientation data. It employs two separate transformer encoders for axial and sagittal contrasts, using cross-attention to integrate information across orientations. MP-ViT also includes a modality indication vector to provide missing contrast information to the model. The effectiveness of the proposed model is demonstrated with extensive experiments on real world clinical dataset consists of 10,084 training, 1,289 validation and 1,496 test subjects. MP-ViT achieved substantial improvement in area under the curve (AUC), outperforming the vision transformer (ViT) by 5.5% and CNN-based architectures by 1.8%. These results highlight the potential of MP-ViT in improving performance for hemorrhage detection when different orientation contrasts are needed.

### [252] [Ophora: A Large-Scale Data-Driven Text-Guided Ophthalmic Surgical Video Generation Model](https://arxiv.org/abs/2505.07449) *Wei Li,Ming Hu,Guoan Wang,Lihao Liu,Kaijin Zhou,Junzhi Ning,Xin Guo,Zongyuan Ge,Lixu Gu,Junjun He* Main category: eess.IV TL;DR: Ophora是一种基于自然语言指令生成眼科手术视频的AI模型，通过构建大规模数据集Ophora-160K和渐进式视频指令调优方案，解决了高质量标注视频稀缺的问题。

Details

Motivation: 由于隐私和人力成本问题，获取高质量标注的眼科手术视频困难，而文本引导视频生成（T2V）技术为生成此类视频提供了可能。 Method: 提出综合数据整理流程构建Ophora-160K数据集，并采用渐进式视频指令调优方案，从预训练的T2V模型中迁移时空知识。 Result: 实验表明，Ophora能根据指令生成真实可靠的眼科手术视频，并支持下游手术流程理解任务。 Conclusion: Ophora为隐私保护下的眼科手术视频生成提供了有效解决方案，并展示了在下游任务中的潜力。 Abstract: In ophthalmic surgery, developing an AI system capable of interpreting surgical videos and predicting subsequent operations requires numerous ophthalmic surgical videos with high-quality annotations, which are difficult to collect due to privacy concerns and labor consumption. Text-guided video generation (T2V) emerges as a promising solution to overcome this issue by generating ophthalmic surgical videos based on surgeon instructions. In this paper, we present Ophora, a pioneering model that can generate ophthalmic surgical videos following natural language instructions. To construct Ophora, we first propose a Comprehensive Data Curation pipeline to convert narrative ophthalmic surgical videos into a large-scale, high-quality dataset comprising over 160K video-instruction pairs, Ophora-160K. Then, we propose a Progressive Video-Instruction Tuning scheme to transfer rich spatial-temporal knowledge from a T2V model pre-trained on natural video-text datasets for privacy-preserved ophthalmic surgical video generation based on Ophora-160K. Experiments on video quality evaluation via quantitative analysis and ophthalmologist feedback demonstrate that Ophora can generate realistic and reliable ophthalmic surgical videos based on surgeon instructions. We also validate the capability of Ophora for empowering downstream tasks of ophthalmic surgical workflow understanding. Code is available at https://github.com/mar-cry/Ophora.

### [253] [Breast Cancer Classification in Deep Ultraviolet Fluorescence Images Using a Patch-Level Vision Transformer Framework](https://arxiv.org/abs/2505.07654) *Pouya Afshin,David Helminiak,Tongtong Lu,Tina Yen,Julie M. Jorns,Mollie Patton,Bing Yu,Dong Hye Ye* Main category: eess.IV TL;DR: 本文提出了一种基于深度紫外荧光扫描显微镜（DUV-FSM）和视觉变换器（ViT）的乳腺癌分类框架，通过局部和全局特征捕捉及Grad-CAM++显著性加权，显著提升了分类准确率和结果可解释性。

Details

Motivation: 乳腺癌保乳手术（BCS）需要在彻底切除恶性病变的同时最大化保留健康组织，术中边缘评估至关重要。DUV-FSM可快速获取切除组织的全表面图像（WSIs），但高分辨率和复杂组织病理特征对分类提出了挑战。 Method: 研究采用基于补丁级的视觉变换器（ViT）模型，结合Grad-CAM++显著性加权，捕捉局部和全局特征，提升分类准确性和结果可解释性。 Result: 通过5折交叉验证，该方法在良性和恶性组织分类中达到98.33%的准确率，显著优于传统深度学习方法。 Conclusion: 该框架为乳腺癌术中边缘评估提供了高效、准确的解决方案，具有临床应用潜力。 Abstract: Breast-conserving surgery (BCS) aims to completely remove malignant lesions while maximizing healthy tissue preservation. Intraoperative margin assessment is essential to achieve a balance between thorough cancer resection and tissue conservation. A deep ultraviolet fluorescence scanning microscope (DUV-FSM) enables rapid acquisition of whole surface images (WSIs) for excised tissue, providing contrast between malignant and normal tissues. However, breast cancer classification with DUV WSIs is challenged by high resolutions and complex histopathological features. This study introduces a DUV WSI classification framework using a patch-level vision transformer (ViT) model, capturing local and global features. Grad-CAM++ saliency weighting highlights relevant spatial regions, enhances result interpretability, and improves diagnostic accuracy for benign and malignant tissue classification. A comprehensive 5-fold cross-validation demonstrates the proposed approach significantly outperforms conventional deep learning methods, achieving a classification accuracy of 98.33%.

### [254] [Hierarchical Sparse Attention Framework for Computationally Efficient Classification of Biological Cells](https://arxiv.org/abs/2505.07661) *Elad Yoshai,Dana Yagoda-Aharoni,Eden Dotan,Natan T. Shaked* Main category: eess.IV TL;DR: SparseAttnNet是一种高效的图像分类框架，通过动态选择最具信息量的像素点，显著减少计算量，同时保持高分类准确性。

Details

Motivation: 传统卷积神经网络处理整张图像，计算效率低且可能关注无关特征，SparseAttnNet旨在解决这一问题。 Method: 采用动态选择机制，结合粗粒度注意力和细粒度多头注意力，自适应选择并处理最具信息量的像素点，嵌入语言模型捕获语义。 Result: 在多种细胞图像模态下，仅处理约15%的像素点，计算效率显著提升，分类准确性仍具竞争力。 Conclusion: SparseAttnNet高效、轻量，适合资源受限场景，同时提高了模型的可解释性。 Abstract: We present SparseAttnNet, a new hierarchical attention-driven framework for efficient image classification that adaptively selects and processes only the most informative pixels from images. Traditional convolutional neural networks typically process the entire images regardless of information density, leading to computational inefficiency and potential focus on irrelevant features. Our approach leverages a dynamic selection mechanism that uses coarse attention distilled by fine multi-head attention from the downstream layers of the model, allowing the model to identify and extract the most salient k pixels, where k is adaptively learned during training based on loss convergence trends. Once the top-k pixels are selected, the model processes only these pixels, embedding them as words in a language model to capture their semantics, followed by multi-head attention to incorporate global context. For biological cell images, we demonstrate that SparseAttnNet can process approximately 15% of the pixels instead of the full image. Applied to cell classification tasks using white blood cells images from the following modalities: optical path difference (OPD) images from digital holography for stain-free cells, images from motion-sensitive (event) camera from stain-free cells, and brightfield microscopy images of stained cells, For all three imaging modalities, SparseAttnNet achieves competitive accuracy while drastically reducing computational requirements in terms of both parameters and floating-point operations per second, compared to traditional CNNs and Vision Transformers. Since the model focuses on biologically relevant regions, it also offers improved explainability. The adaptive and lightweight nature of SparseAttnNet makes it ideal for deployment in resource-constrained and high-throughput settings, including imaging flow cytometry.

### [255] [ABS-Mamba: SAM2-Driven Bidirectional Spiral Mamba Network for Medical Image Translation](https://arxiv.org/abs/2505.07687) *Feng Yuan,Yifan Gao,Wenbin Wu,Keqing Wu,Xiaotong Guo,Jie Jiang,Xin Gao* Main category: eess.IV TL;DR: ABS-Mamba是一种新型多模态医学图像翻译架构，结合了SAM2、CNN和Mamba模型，通过双分辨率框架和特征融合网络实现高保真图像合成。

Details

Motivation: 解决多模态医学图像翻译中全局解剖语义与局部结构保真度的平衡问题，克服模态间信息丢失和结构失真的挑战。 Method: 采用双分辨率框架，SAM2提取器官级语义，CNN分支提取局部特征，RFFN融合特征，BMRN建模空间依赖关系，三阶段解码器增强细节保真度，并使用LoRA+微调。 Result: 在SynthRAD2023和BraTS2019数据集上表现优于现有方法，实现了高保真的跨模态合成。 Conclusion: ABS-Mamba在临床应用中能有效提升诊断准确性，代码已开源。 Abstract: Accurate multi-modal medical image translation requires ha-rmonizing global anatomical semantics and local structural fidelity, a challenge complicated by intermodality information loss and structural distortion. We propose ABS-Mamba, a novel architecture integrating the Segment Anything Model 2 (SAM2) for organ-aware semantic representation, specialized convolutional neural networks (CNNs) for preserving modality-specific edge and texture details, and Mamba's selective state-space modeling for efficient long- and short-range feature dependencies. Structurally, our dual-resolution framework leverages SAM2's image encoder to capture organ-scale semantics from high-resolution inputs, while a parallel CNNs branch extracts fine-grained local features. The Robust Feature Fusion Network (RFFN) integrates these epresentations, and the Bidirectional Mamba Residual Network (BMRN) models spatial dependencies using spiral scanning and bidirectional state-space dynamics. A three-stage skip fusion decoder enhances edge and texture fidelity. We employ Efficient Low-Rank Adaptation (LoRA+) fine-tuning to enable precise domain specialization while maintaining the foundational capabilities of the pre-trained components. Extensive experimental validation on the SynthRAD2023 and BraTS2019 datasets demonstrates that ABS-Mamba outperforms state-of-the-art methods, delivering high-fidelity cross-modal synthesis that preserves anatomical semantics and structural details to enhance diagnostic accuracy in clinical applications. The code is available at https://github.com/gatina-yone/ABS-Mamba

# cs.CY [[Back]](#toc) ### [256] [Privacy of Groups in Dense Street Imagery](https://arxiv.org/abs/2505.07085) *Matt Franchi,Hauke Sandhaus,Madiha Zahrah Choksi,Severin Engelmann,Wendy Ju,Helen Nissenbaum* Main category: cs.CY TL;DR: 研究发现，高密度的街景图像数据（DSI）和AI技术的进步使得从匿名数据中推断敏感群体信息成为可能，尽管已采取隐私保护措施。

Details

Motivation: 尽管DSI提供商通过模糊人脸和车牌保护隐私，但现有措施无法解决更广泛的隐私问题，尤其是群体成员身份的推断。 Method: 通过对纽约市25,232,608张行车记录仪图像进行渗透测试，展示如何从模糊的行人中推断敏感群体信息。 Result: 研究发现，高数据密度和AI技术使得从匿名数据中推断群体成员信息变得容易，揭示了隐私保护的不足。 Conclusion: 论文提出了针对DSI数据研究者的可操作建议，强调了在数据使用中保护隐私的重要性。 Abstract: Spatially and temporally dense street imagery (DSI) datasets have grown unbounded. In 2024, individual companies possessed around 3 trillion unique images of public streets. DSI data streams are only set to grow as companies like Lyft and Waymo use DSI to train autonomous vehicle algorithms and analyze collisions. Academic researchers leverage DSI to explore novel approaches to urban analysis. Despite good-faith efforts by DSI providers to protect individual privacy through blurring faces and license plates, these measures fail to address broader privacy concerns. In this work, we find that increased data density and advancements in artificial intelligence enable harmful group membership inferences from supposedly anonymized data. We perform a penetration test to demonstrate how easily sensitive group affiliations can be inferred from obfuscated pedestrians in 25,232,608 dashcam images taken in New York City. We develop a typology of identifiable groups within DSI and analyze privacy implications through the lens of contextual integrity. Finally, we discuss actionable recommendations for researchers working with data from DSI providers.

# cs.SD [[Back]](#toc) ### [257] [Bridging Ears and Eyes: Analyzing Audio and Visual Large Language Models to Humans in Visible Sound Recognition and Reducing Their Sensory Gap via Cross-Modal Distillation](https://arxiv.org/abs/2505.06803) *Xilin Jiang,Junkai Wu,Vishal Choudhari,Nima Mesgarani* Main category: cs.SD TL;DR: 本文比较了音频、视觉和视听大语言模型（LLMs）在识别声音对象上的表现，发现音频与视觉模型之间存在性能差距，并提出跨模态蒸馏框架以缩小差距。

Details

Motivation: 探索音频LLMs与其他感官模态LLMs及人类在识别声音对象上的表现差异，并提出改进方法。 Method: 系统评估音频、视觉和视听LLMs（Qwen2-Audio、Qwen2-VL、Qwen2.5-Omni）与人类的表现，并引入跨模态蒸馏框架进行知识迁移。 Result: 发现音频与视觉LLMs间的性能差距，跨模态蒸馏显著提升了具有挑战性类别的识别效果。 Conclusion: 研究揭示了LLMs的感官差距，并提出了一种增强多模态LLMs感知能力的有效方法。 Abstract: Audio large language models (LLMs) are considered experts at recognizing sound objects, yet their performance relative to LLMs in other sensory modalities, such as visual or audio-visual LLMs, and to humans using their ears, eyes, or both remains unexplored. To investigate this, we systematically evaluate audio, visual, and audio-visual LLMs, specifically Qwen2-Audio, Qwen2-VL, and Qwen2.5-Omni, against humans in recognizing sound objects of different classes from audio-only, silent video, or sounded video inputs. We uncover a performance gap between Qwen2-Audio and Qwen2-VL that parallels the sensory discrepancy between human ears and eyes. To reduce this gap, we introduce a cross-modal distillation framework, where an LLM in one modality serves as the teacher and another as the student, with knowledge transfer in sound classes predicted as more challenging to the student by a heuristic model. Distillation in both directions, from Qwen2-VL to Qwen2-Audio and vice versa, leads to notable improvements, particularly in challenging classes. This work highlights the sensory gap in LLMs from a human-aligned perspective and proposes a principled approach to enhancing modality-specific perception in multimodal LLMs.

### [258] [Multi-Domain Audio Question Answering Toward Acoustic Content Reasoning in The DCASE 2025 Challenge](https://arxiv.org/abs/2505.07365) *Chao-Han Huck Yang,Sreyan Ghosh,Qing Wang,Jaeyeon Kim,Hengyi Hong,Sonal Kumar,Guirui Zhong,Zhifeng Kong,S Sakshi,Vaibhavi Lokegaonkar,Oriol Nieto,Ramani Duraiswami,Dinesh Manocha,Gunhee Kim,Jun Du,Rafael Valle,Bryan Catanzaro* Main category: cs.SD TL;DR: DCASE 2025挑战赛的任务5是一个多领域音频问答（AQA）基准测试，包含三个子集（生物声学、时间声景和复杂问答），旨在测试音频-语言模型在多样化声学场景中的交互问答能力。

Details

Motivation: 推动音频-语言模型在音频理解和推理能力上达到人类水平，以增强AI代理对世界的感知和交互能力。 Method: 定义了三个QA子集，使用开发集评估模型（Qwen2-Audio-7B、AudioFlamingo 2、Gemini-2-Flash），采用top-1准确率和答案混洗鲁棒性作为评估标准。 Result: 初步结果显示不同模型和子集间存在显著差异。 Conclusion: 该挑战旨在提升音频-语言模型的音频理解和推理能力，为AI代理的感知和交互提供支持。 Abstract: We present Task 5 of the DCASE 2025 Challenge: an Audio Question Answering (AQA) benchmark spanning multiple domains of sound understanding. This task defines three QA subsets (Bioacoustics, Temporal Soundscapes, and Complex QA) to test audio-language models on interactive question-answering over diverse acoustic scenes. We describe the dataset composition (from marine mammal calls to soundscapes and complex real-world clips), the evaluation protocol (top-1 accuracy with answer-shuffling robustness), and baseline systems (Qwen2-Audio-7B, AudioFlamingo 2, Gemini-2-Flash). Preliminary results on the development set are compared, showing strong variation across models and subsets. This challenge aims to advance the audio understanding and reasoning capabilities of audio-language models toward human-level acuity, which are crucial for enabling AI agents to perceive and interact about the world effectively.

Table of Contents

cs.CV [Back]

[1] Understanding and Mitigating Toxicity in Image-Text Pretraining Datasets: A Case Study on LLaVA

[2] Robust & Precise Knowledge Distillation-based Novel Context-Aware Predictor for Disease Detection in Brain and Gastrointestinal

[3] Deep Learning-Based Robust Optical Guidance for Hypersonic Platforms

[4] Toward Advancing License Plate Super-Resolution in Real-World Scenarios: A Dataset and Benchmark

[5] MAGE:A Multi-stage Avatar Generator with Sparse Observations

[6] Natural Reflection Backdoor Attack on Vision Language Model for Autonomous Driving

[7] My Emotion on your face: The use of Facial Keypoint Detection to preserve Emotions in Latent Space Editing

[8] PromptIQ: Who Cares About Prompts? Let System Handle It -- A Component-Aware Framework for T2I Generation

[9] HCMA: Hierarchical Cross-model Alignment for Grounded Text-to-Image Generation

[10] RESAR-BEV: An Explainable Progressive Residual Autoregressive Approach for Camera-Radar Fusion in BEV Segmentation

[11] Quantum Conflict Measurement in Decision Making for Out-of-Distribution Detection

[12] Edge-Enabled VIO with Long-Tracked Features for High-Accuracy Low-Altitude IoT Navigation

[13] Causal Prompt Calibration Guided Segment Anything Model for Open-Vocabulary Multi-Entity Segmentation

[14] Improving Generalization of Medical Image Registration Foundation Model

[15] Unmasking Deep Fakes: Leveraging Deep Learning for Video Authenticity Detection

[16] TACFN: Transformer-based Adaptive Cross-modal Fusion Network for Multimodal Emotion Recognition

[17] ProFashion: Prototype-guided Fashion Video Generation with Multiple Reference Images

[18] HDGlyph: A Hierarchical Disentangled Glyph-Based Framework for Long-Tail Text Rendering in Diffusion Models

[19] Weakly Supervised Temporal Sentence Grounding via Positive Sample Mining

[20] Dynamic Uncertainty Learning with Noisy Correspondence for Text-Based Person Search

[21] ElectricSight: 3D Hazard Monitoring for Power Lines Using Low-Cost Sensors

[22] GRACE: Estimating Geometry-level 3D Human-Scene Contact from 2D Images

[23] Two-Stage Random Alternation Framework for Zero-Shot Pansharpening

[24] Compact and Efficient Neural Networks for Image Recognition Based on Learned 2D Separable Transform

[25] Batch Augmentation with Unimodal Fine-tuning for Multimodal Learning

[26] ReplayCAD: Generative Diffusion Replay for Continual Anomaly Detection

[27] Reducing Unimodal Bias in Multi-Modal Semantic Segmentation with Multi-Scale Functional Entropy Regularization

[28] Dataset Distillation with Probabilistic Latent Features

[29] METOR: A Unified Framework for Mutual Enhancement of Objects and Relationships in Open-vocabulary Video Visual Relationship Detection

[30] MultiTaskVIF: Segmentation-oriented visible and infrared image fusion via multi-task learning

[31] StableMotion: Repurposing Diffusion-Based Image Priors for Motion Estimation

[32] Video Dataset Condensation with Diffusion Models

[33] Jailbreaking the Text-to-Video Generative Models

[34] UnfoldIR: Rethinking Deep Unfolding Network in Illumination Degradation Image Restoration

[35] FNBench: Benchmarking Robust Federated Learning against Noisy Labels

[36] Underwater object detection in sonar imagery with detection transformer and Zero-shot neural architecture search

[37] SimMIL: A Universal Weakly Supervised Pre-Training Framework for Multi-Instance Learning in Whole Slide Pathology Images

[38] Symbolic Rule Extraction from Attention-Guided Sparse Representations in Vision Transformers

[39] Multimodal Fake News Detection: MFND Dataset and Shallow-Deep Multitask Learning

[40] Overview of the NLPCC 2025 Shared Task 4: Multi-modal, Multilingual, and Multi-hop Medical Instructional Video Question Answering Challenge

[41] Active Learning for Multi-class Image Classification

[42] Fine-Grained Bias Exploration and Mitigation for Group-Robust Classification

[43] Visual Instruction Tuning with Chain of Region-of-Interest

[44] Predicting Surgical Safety Margins in Osteosarcoma Knee Resections: An Unsupervised Approach

[45] Joint Low-level and High-level Textual Representation Learning with Multiple Masking Strategies

[46] NeuRN: Neuro-inspired Domain Generalization for Image Classification

[47] Mice to Machines: Neural Representations from Visual Cortex for Domain Generalization

[48] NeuGen: Amplifying the 'Neural' in Neural Radiance Fields for Domain Generalization

[49] Multi-Modal Explainable Medical AI Assistant for Trustworthy Human-AI Collaboration

[50] CheXLearner: Text-Guided Fine-Grained Representation Learning for Progression Detection

[51] Enhancing Monocular Height Estimation via Sparse LiDAR-Guided Correction

[52] Building a Human-Verified Clinical Reasoning Dataset via a Human LLM Hybrid Pipeline for Trustworthy Medical AI

[53] Bi-directional Self-Registration for Misaligned Infrared-Visible Image Fusion

[54] Transformer-Based Dual-Optical Attention Fusion Crowd Head Point Counting and Localization Network

[55] Unsupervised Learning for Class Distribution Mismatch

[56] Boosting Cross-spectral Unsupervised Domain Adaptation for Thermal Semantic Segmentation

[57] High-Frequency Prior-Driven Adaptive Masking for Accelerating Image Super-Resolution

[58] Federated Learning with LoRA Optimized DeiT and Multiscale Patch Embedding for Secure Eye Disease Recognition

[59] BridgeIV: Bridging Customized Image and Video Generation through Test-Time Autoregressive Identity Propagation

[60] Technical Report for ICRA 2025 GOOSE 2D Semantic Segmentation Challenge: Leveraging Color Shift Correction, RoPE-Swin Backbone, and Quantile-based Label Denoising Strategy for Robust Outdoor Scene Understanding

[61] Replay-Based Continual Learning with Dual-Layered Distillation and a Streamlined U-Net for Efficient Text-to-Image Generation

[62] Hallucination-Aware Multimodal Benchmark for Gastrointestinal Image Analysis with Large Vision-Language Models

[63] CMD: Controllable Multiview Diffusion for 3D Editing and Progressive Generation

[64] MELLM: Exploring LLM-Powered Micro-Expression Understanding Enhanced by Subtle Motion Perception

[65] Efficient and Robust Multidimensional Attention in Remote Physiological Sensing through Target Signal Constrained Factorization

[66] A Vision-Language Foundation Model for Leaf Disease Identification

[67] MarkMatch: Same-Hand Stuffing Detection

[68] Differentiable NMS via Sinkhorn Matching for End-to-End Fabric Defect Detection

[69] Depth-Sensitive Soft Suppression with RGB-D Inter-Modal Stylization Flow for Domain Generalization Semantic Segmentation

[70] DAPE: Dual-Stage Parameter-Efficient Fine-Tuning for Consistent Video Editing with Diffusion Models

[71] Seed1.5-VL Technical Report

[72] Semantic-Guided Diffusion Model for Single-Step Image Super-Resolution

[73] Discovering Concept Directions from Diffusion-based Counterfactuals via Latent Clustering

[74] Towards Scalable IoT Deployment for Visual Anomaly Detection via Efficient Compression

[75] Generalizable Pancreas Segmentation via a Dual Self-Supervised Learning Framework

[76] Critique Before Thinking: Mitigating Hallucination through Rationale-Augmented Instruction Tuning

[77] Ranking-aware Continual Learning for LiDAR Place Recognition

[78] Discovering Fine-Grained Visual-Concept Relations by Disentangled Optimal Transport Concept Bottleneck Models