cs.CV [Back]

[1] Beyond Recognition: Evaluating Visual Perspective Taking in Vision Language Models

Gracjan Góral,Alicja Ziarko,Piotr Miłoś,Michał Nauman,Maciej Wołczyk,Michał Kosiński

Main category: cs.CV

TL;DR: 研究探讨了视觉语言模型（VLMs）在视觉视角任务中的表现，发现模型在场景理解上表现良好，但在空间推理和视角任务中表现显著下降。

Details

Motivation: 评估VLMs在复杂视觉任务中的能力，尤其是空间推理和视角任务，以揭示其局限性。 Method: 通过控制场景中的空间配置（如物体位置和人偶方向），生成144个任务，每个任务配以7个诊断问题，评估三个认知层次。 Result: 模型在场景理解上表现优秀，但在空间推理和视角任务中表现显著下降。 Conclusion: 未来VLM开发需整合显式几何表示和针对性训练协议，以提升复杂视觉任务能力。 Abstract: We investigate the ability of Vision Language Models (VLMs) to perform visual perspective taking using a novel set of visual tasks inspired by established human tests. Our approach leverages carefully controlled scenes, in which a single humanoid minifigure is paired with a single object. By systematically varying spatial configurations - such as object position relative to the humanoid minifigure and the humanoid minifigure's orientation - and using both bird's-eye and surface-level views, we created 144 unique visual tasks. Each visual task is paired with a series of 7 diagnostic questions designed to assess three levels of visual cognition: scene understanding, spatial reasoning, and visual perspective taking. Our evaluation of several state-of-the-art models, including GPT-4-Turbo, GPT-4o, Llama-3.2-11B-Vision-Instruct, and variants of Claude Sonnet, reveals that while they excel in scene understanding, the performance declines significantly on spatial reasoning and further deteriorates on perspective-taking. Our analysis suggests a gap between surface-level object recognition and the deeper spatial and perspective reasoning required for complex visual tasks, pointing to the need for integrating explicit geometric representations and tailored training protocols in future VLM development.

[2] In-situ and Non-contact Etch Depth Prediction in Plasma Etching via Machine Learning (ANN & BNN) and Digital Image Colorimetry

Minji Kang,Seongho Kim,Eunseo Go,Donghyeon Paek,Geon Lim,Muyoung Kim,Soyeun Kim,Sung Kyu Jang,Min Sup Choi,Woo Seok Kang,Jaehyun Kim,Jaekwang Kim,Hyeong-U Kim

Main category: cs.CV

TL;DR: 论文提出了一种基于机器学习的非接触式原位蚀刻深度预测框架，用于半导体制造中绝缘材料厚度的精确监测。

Details

Motivation: 传统的外部分析方法存在时间延迟和污染风险，需要一种更高效、非侵入的监测方法。 Method: 研究探索了两种场景：1) 使用人工神经网络（ANN）从工艺参数预测平均蚀刻深度；2) 使用贝叶斯神经网络（BNN）结合RGB数据预测蚀刻深度。 Result: ANN显著降低了均方误差，BNN能可靠估计不确定性；RGB数据输入也表现出色。 Conclusion: 结合数字图像比色法和机器学习，为等离子蚀刻过程提供了一种实时、原位、非侵入的监测方案。 Abstract: Precise monitoring of etch depth and the thickness of insulating materials, such as Silicon dioxide and silicon nitride, is critical to ensuring device performance and yield in semiconductor manufacturing. While conventional ex-situ analysis methods are accurate, they are constrained by time delays and contamination risks. To address these limitations, this study proposes a non-contact, in-situ etch depth prediction framework based on machine learning (ML) techniques. Two scenarios are explored. In the first scenario, an artificial neural network (ANN) is trained to predict average etch depth from process parameters, achieving a significantly lower mean squared error (MSE) compared to a linear baseline model. The approach is then extended to incorporate variability from repeated measurements using a Bayesian Neural Network (BNN) to capture both aleatoric and epistemic uncertainty. Coverage analysis confirms the BNN's capability to provide reliable uncertainty estimates. In the second scenario, we demonstrate the feasibility of using RGB data from digital image colorimetry (DIC) as input for etch depth prediction, achieving strong performance even in the absence of explicit process parameters. These results suggest that the integration of DIC and ML offers a viable, cost-effective alternative for real-time, in-situ, and non-invasive monitoring in plasma etching processes, contributing to enhanced process stability, and manufacturing efficiency.

[3] VideoLLM Benchmarks and Evaluation: A Survey

Yogesh Kumar

Main category: cs.CV

TL;DR: 本文综述了视频大语言模型（VideoLLMs）的评测基准和方法，分析了现有视频理解基准的特点、评估协议及局限性，并提出了未来研究方向。

Details

Motivation: 随着大语言模型（LLMs）的快速发展，视频理解技术取得显著进展，但缺乏系统性的评测框架。本文旨在填补这一空白。 Method: 通过分析现有视频理解基准（如封闭集、开放集及时空任务评估）和评估方法，总结性能趋势和挑战。 Result: 揭示了当前评测框架的局限性，并提出了改进方向，如多样化、多模态和可解释性更强的基准设计。 Conclusion: 本文为研究者提供了系统化的评测指南，并指出了推动视频理解领域发展的潜在方向。 Abstract: The rapid development of Large Language Models (LLMs) has catalyzed significant advancements in video understanding technologies. This survey provides a comprehensive analysis of benchmarks and evaluation methodologies specifically designed or used for Video Large Language Models (VideoLLMs). We examine the current landscape of video understanding benchmarks, discussing their characteristics, evaluation protocols, and limitations. The paper analyzes various evaluation methodologies, including closed-set, open-set, and specialized evaluations for temporal and spatiotemporal understanding tasks. We highlight the performance trends of state-of-the-art VideoLLMs across these benchmarks and identify key challenges in current evaluation frameworks. Additionally, we propose future research directions to enhance benchmark design, evaluation metrics, and protocols, including the need for more diverse, multimodal, and interpretability-focused benchmarks. This survey aims to equip researchers with a structured understanding of how to effectively evaluate VideoLLMs and identify promising avenues for advancing the field of video understanding with large language models.

[4] Video Forgery Detection for Surveillance Cameras: A Review

Noor B. Tayfor,Tarik A. Rashid,Shko M. Qader,Bryar A. Hassan,Mohammed H. Abdalla,Jafar Majidpour,Aram M. Ahmed,Hussein M. Ali,Aso M. Aladdin,Abdulhady A. Abdullah,Ahmed S. Shamsaldin,Haval M. Sidqi,Abdulrahman Salih,Zaher M. Yaseen,Azad A. Ameen,Janmenjoy Nayak,Mahmood Yashar Hamza

Main category: cs.CV

TL;DR: 论文综述了现有视频取证技术，探讨了其在检测监控视频伪造中的有效性，并强调需要更强大的技术以应对日益复杂的篡改手段。

Details

Motivation: 随着视频编辑工具的普及，监控视频的篡改变得容易，威胁其真实性，可能误导司法决策，因此需要确保视频证据的完整性。 Method: 研究了多种取证技术，包括基于压缩的分析、帧复制检测和机器学习方法。 Result: 发现现有技术虽有效，但需进一步强化以应对不断演变的伪造手段。 Conclusion: 加强视频取证能力是确保监控视频可信并可作为法律证据的关键。 Abstract: The widespread availability of video recording through smartphones and digital devices has made video-based evidence more accessible than ever. Surveillance footage plays a crucial role in security, law enforcement, and judicial processes. However, with the rise of advanced video editing tools, tampering with digital recordings has become increasingly easy, raising concerns about their authenticity. Ensuring the integrity of surveillance videos is essential, as manipulated footage can lead to misinformation and undermine judicial decisions. This paper provides a comprehensive review of existing forensic techniques used to detect video forgery, focusing on their effectiveness in verifying the authenticity of surveillance recordings. Various methods, including compression-based analysis, frame duplication detection, and machine learning-based approaches, are explored. The findings highlight the growing necessity for more robust forensic techniques to counteract evolving forgery methods. Strengthening video forensic capabilities will ensure that surveillance recordings remain credible and admissible as legal evidence.

[5] PointExplainer: Towards Transparent Parkinson's Disease Diagnosis

Xuechao Wang,Sven Nomm,Junqing Huang,Kadri Medijainen,Aaro Toomela,Michael Ruzhansky

Main category: cs.CV

TL;DR: PointExplainer是一种可解释的诊断策略，用于识别手绘区域对帕金森病早期诊断的贡献。

Details

Motivation: 现有诊断方法缺乏可解释性，影响临床信任。 Method: 结合诊断模块（将手绘信号编码为3D点云）和解释模块（训练可解释的替代模型），并引入一致性度量。 Result: 在多个数据集上验证，PointExplainer能提供直观解释且不影响诊断性能。 Conclusion: PointExplainer为帕金森病诊断提供了可解释且高效的工具。 Abstract: Deep neural networks have shown potential in analyzing digitized hand-drawn signals for early diagnosis of Parkinson's disease. However, the lack of clear interpretability in existing diagnostic methods presents a challenge to clinical trust. In this paper, we propose PointExplainer, an explainable diagnostic strategy to identify hand-drawn regions that drive model diagnosis. Specifically, PointExplainer assigns discrete attribution values to hand-drawn segments, explicitly quantifying their relative contributions to the model's decision. Its key components include: (i) a diagnosis module, which encodes hand-drawn signals into 3D point clouds to represent hand-drawn trajectories, and (ii) an explanation module, which trains an interpretable surrogate model to approximate the local behavior of the black-box diagnostic model. We also introduce consistency measures to further address the issue of faithfulness in explanations. Extensive experiments on two benchmark datasets and a newly constructed dataset show that PointExplainer can provide intuitive explanations with no diagnostic performance degradation. The source code is available at https://github.com/chaoxuewang/PointExplainer.

[6] Explainable Face Recognition via Improved Localization

Rashik Shadman,Daqing Hou,Faraz Hussain,M G Sarwar Murshed

Main category: cs.CV

TL;DR: 论文提出了一种基于Scaled Directed Divergence (SDD)的可解释人脸识别方法，通过精细定位相关面部特征，提高了系统的透明度和可信度。

Details

Motivation: 当前基于深度学习的人脸识别系统缺乏解释性，用户难以信任其决策。本文旨在解决这一问题，提供可解释的视觉结果。 Method: 使用SDD类激活映射技术，对深度学习模型的面部特征进行精细定位，生成视觉解释。 Result: 实验表明，SDD CAM比传统CAM更精确地定位相关面部特征，提供了更具体的视觉解释。 Conclusion: SDD方法显著提升了人脸识别系统的透明度和可信度，有助于用户信任AI决策。 Abstract: Biometric authentication has become one of the most widely used tools in the current technological era to authenticate users and to distinguish between genuine users and imposters. Face is the most common form of biometric modality that has proven effective. Deep learning-based face recognition systems are now commonly used across different domains. However, these systems usually operate like black-box models that do not provide necessary explanations or justifications for their decisions. This is a major disadvantage because users cannot trust such artificial intelligence-based biometric systems and may not feel comfortable using them when clear explanations or justifications are not provided. This paper addresses this problem by applying an efficient method for explainable face recognition systems. We use a Class Activation Mapping (CAM)-based discriminative localization (very narrow/specific localization) technique called Scaled Directed Divergence (SDD) to visually explain the results of deep learning-based face recognition systems. We perform fine localization of the face features relevant to the deep learning model for its prediction/decision. Our experiments show that the SDD Class Activation Map (CAM) highlights the relevant face features very specifically compared to the traditional CAM and very accurately. The provided visual explanations with narrow localization of relevant features can ensure much-needed transparency and trust for deep learning-based face recognition systems.

[7] GAME: Learning Multimodal Interactions via Graph Structures for Personality Trait Estimation

Kangsheng Wang,Yuhang Li,Chengwei Ye,Yufei Lin,Huanzhen Zhang,Bohan Hu,Linuo Xu,Shuyan Liu

Main category: cs.CV

TL;DR: GAME是一种图增强多模态编码器，用于从短视频中预测人格特质，通过融合视觉、听觉和文本特征，显著优于现有方法。

Details

Motivation: 短视频中的人格分析因多源特征的复杂交互而具有挑战性，需要一种鲁棒的多模态建模方法。 Method: 构建面部图，结合GCN和CNN的双分支网络提取视觉特征；使用BiGRU处理时序动态；音频和文本特征分别通过VGGish和XLM-Roberta提取；通过通道注意力融合模块整合多模态特征。 Result: GAME在多个基准测试中表现优于现有方法，验证了其有效性和泛化能力。 Conclusion: GAME通过多模态特征融合和注意力机制，显著提升了人格预测的准确性。 Abstract: Apparent personality analysis from short videos poses significant chal-lenges due to the complex interplay of visual, auditory, and textual cues. In this paper, we propose GAME, a Graph-Augmented Multimodal Encoder designed to robustly model and fuse multi-source features for automatic personality prediction. For the visual stream, we construct a facial graph and introduce a dual-branch Geo Two-Stream Network, which combines Graph Convolutional Networks (GCNs) and Convolutional Neural Net-works (CNNs) with attention mechanisms to capture both structural and appearance-based facial cues. Complementing this, global context and iden-tity features are extracted using pretrained ResNet18 and VGGFace back-bones. To capture temporal dynamics, frame-level features are processed by a BiGRU enhanced with temporal attention modules. Meanwhile, audio representations are derived from the VGGish network, and linguistic se-mantics are captured via the XLM-Roberta transformer. To achieve effective multimodal integration, we propose a Channel Attention-based Fusion module, followed by a Multi-Layer Perceptron (MLP) regression head for predicting personality traits. Extensive experiments show that GAME con-sistently outperforms existing methods across multiple benchmarks, vali-dating its effectiveness and generalizability.

[8] Advanced Clustering Framework for Semiconductor Image Analytics Integrating Deep TDA with Self-Supervised and Transfer Learning Techniques

Janhavi Giri,Attila Lengyel,Don Kent,Edward Kibardin

Main category: cs.CV

TL;DR: 论文提出了一种结合深度拓扑数据分析（TDA）、自监督学习和迁移学习的先进聚类框架，用于半导体制造中的图像数据聚类，解决了传统方法在高维无标签数据中的局限性。

Details

Motivation: 半导体制造中产生的大量图像数据对缺陷识别和良率优化至关重要，但传统聚类方法难以处理高维无标签数据，无法捕捉细微模式。 Method: 框架整合了深度TDA、自监督学习和迁移学习，TDA提取拓扑特征，自监督学习从无标签数据中学习表征，迁移学习提升适应性和可扩展性。 Result: 在合成和开源半导体图像数据集上验证，框架成功识别出与缺陷模式和工艺变化相关的聚类。 Conclusion: 该研究展示了TDA、自监督学习和迁移学习结合的潜力，为半导体制造等领域的图像数据分析提供了可扩展的解决方案。 Abstract: Semiconductor manufacturing generates vast amounts of image data, crucial for defect identification and yield optimization, yet often exceeds manual inspection capabilities. Traditional clustering techniques struggle with high-dimensional, unlabeled data, limiting their effectiveness in capturing nuanced patterns. This paper introduces an advanced clustering framework that integrates deep Topological Data Analysis (TDA) with self-supervised and transfer learning techniques, offering a novel approach to unsupervised image clustering. TDA captures intrinsic topological features, while self-supervised learning extracts meaningful representations from unlabeled data, reducing reliance on labeled datasets. Transfer learning enhances the framework's adaptability and scalability, allowing fine-tuning to new datasets without retraining from scratch. Validated on synthetic and open-source semiconductor image datasets, the framework successfully identifies clusters aligned with defect patterns and process variations. This study highlights the transformative potential of combining TDA, self-supervised learning, and transfer learning, providing a scalable solution for proactive process monitoring and quality control in semiconductor manufacturing and other domains with large-scale image datasets.

[9] An Active Inference Model of Covert and Overt Visual Attention

Tin Mišić,Karlo Koledić,Fabio Bonsignorio,Ivan Petrović,Ivan Marković

Main category: cs.CV

TL;DR: 该论文提出了一种基于主动推理的视觉注意力模型，通过动态优化感官精度来最小化自由能，研究了外源性和内源性注意力的交互作用，并在实验中验证了模型的有效性。

Details

Motivation: 研究如何在复杂感官输入中选择性关注相关刺激并过滤干扰，为智能体处理高维感官数据提供理论基础。 Method: 通过动态优化感官精度构建视觉注意力模型，利用Posner提示任务和简单目标聚焦任务测试模型行为，测量反应时间。 Result: 外源性和有效提示通常导致更快的反应时间；模型表现出类似返回抑制的行为；反射性眼动比有意眼动更快但适应性较差。 Conclusion: 模型成功模拟了视觉注意力的动态行为，为理解注意机制提供了新视角。 Abstract: The ability to selectively attend to relevant stimuli while filtering out distractions is essential for agents that process complex, high-dimensional sensory input. This paper introduces a model of covert and overt visual attention through the framework of active inference, utilizing dynamic optimization of sensory precisions to minimize free-energy. The model determines visual sensory precisions based on both current environmental beliefs and sensory input, influencing attentional allocation in both covert and overt modalities. To test the effectiveness of the model, we analyze its behavior in the Posner cueing task and a simple target focus task using two-dimensional(2D) visual data. Reaction times are measured to investigate the interplay between exogenous and endogenous attention, as well as valid and invalid cueing. The results show that exogenous and valid cues generally lead to faster reaction times compared to endogenous and invalid cues. Furthermore, the model exhibits behavior similar to inhibition of return, where previously attended locations become suppressed after a specific cue-target onset asynchrony interval. Lastly, we investigate different aspects of overt attention and show that involuntary, reflexive saccades occur faster than intentional ones, but at the expense of adaptability.

[10] Novel Extraction of Discriminative Fine-Grained Feature to Improve Retinal Vessel Segmentation

Shuang Zeng,Chee Hong Lee,Micky C Nnamdi,Wenqi Shi,J Ben Tamo,Lei Zhu,Hangzhou He,Xinliang Zhang,Qian Chen,May D. Wang,Yanye Lu,Qiushi Ren

Main category: cs.CV

TL;DR: 提出了一种名为AttUKAN的新型注意力U形Kolmogorov-Arnold网络及标签引导的像素级对比损失，用于视网膜血管分割，显著提升了性能。

Details

Motivation: 现有方法在视网膜血管分割中未能充分利用编码器的细粒度特征，且缺乏对特征级差异的关注。 Method: 结合注意力门控的Kolmogorov-Arnold网络（AttUKAN）和标签引导的像素级对比损失，增强特征提取能力。 Result: 在多个数据集上取得最高F1和MIoU分数，性能优于现有方法。 Conclusion: AttUKAN在视网膜血管分割中表现出色，为相关疾病早期检测提供了有效工具。 Abstract: Retinal vessel segmentation is a vital early detection method for several severe ocular diseases. Despite significant progress in retinal vessel segmentation with the advancement of Neural Networks, there are still challenges to overcome. Specifically, retinal vessel segmentation aims to predict the class label for every pixel within a fundus image, with a primary focus on intra-image discrimination, making it vital for models to extract more discriminative features. Nevertheless, existing methods primarily focus on minimizing the difference between the output from the decoder and the label, but ignore fully using feature-level fine-grained representations from the encoder. To address these issues, we propose a novel Attention U-shaped Kolmogorov-Arnold Network named AttUKAN along with a novel Label-guided Pixel-wise Contrastive Loss for retinal vessel segmentation. Specifically, we implement Attention Gates into Kolmogorov-Arnold Networks to enhance model sensitivity by suppressing irrelevant feature activations and model interpretability by non-linear modeling of KAN blocks. Additionally, we also design a novel Label-guided Pixel-wise Contrastive Loss to supervise our proposed AttUKAN to extract more discriminative features by distinguishing between foreground vessel-pixel pairs and background pairs. Experiments are conducted across four public datasets including DRIVE, STARE, CHASE_DB1, HRF and our private dataset. AttUKAN achieves F1 scores of 82.50%, 81.14%, 81.34%, 80.21% and 80.09%, along with MIoU scores of 70.24%, 68.64%, 68.59%, 67.21% and 66.94% in the above datasets, which are the highest compared to 11 networks for retinal vessel segmentation. Quantitative and qualitative results show that our AttUKAN achieves state-of-the-art performance and outperforms existing retinal vessel segmentation methods. Our code will be available at https://github.com/stevezs315/AttUKAN.

[11] Deep Learning Framework for Infrastructure Maintenance: Crack Detection and High-Resolution Imaging of Infrastructure Surfaces

Nikhil M. Pawar,Jorge A. Prozzi,Feng Hong,Surya Sarat Chandra Congress

Main category: cs.CV

TL;DR: 该论文提出了一种结合CNN和ESPCNN的框架，用于高效超分辨率和减少误报，以提升基础设施图像中损伤检测的准确性。

Details

Motivation: 无人机等数据采集平台在基础设施管理中应用广泛，但图像分辨率低和误报问题限制了其效果。现有超分辨率技术计算成本高且易产生误报。 Method: 使用CNN分类正负损伤图像，再用轻量级ESPCNN对正损伤图像进行超分辨率处理。 Result: ESPCNN在超分辨率评估中优于双三次插值，且框架有效减少了计算成本和误报。 Conclusion: 该框架有望帮助高速公路机构更准确地进行损伤检测和资产管理。 Abstract: Recently, there has been an impetus for the application of cutting-edge data collection platforms such as drones mounted with camera sensors for infrastructure asset management. However, the sensor characteristics, proximity to the structure, hard-to-reach access, and environmental conditions often limit the resolution of the datasets. A few studies used super-resolution techniques to address the problem of low-resolution images. Nevertheless, these techniques were observed to increase computational cost and false alarms of distress detection due to the consideration of all the infrastructure images i.e., positive and negative distress classes. In order to address the pre-processing of false alarm and achieve efficient super-resolution, this study developed a framework consisting of convolutional neural network (CNN) and efficient sub-pixel convolutional neural network (ESPCNN). CNN accurately classified both the classes. ESPCNN, which is the lightweight super-resolution technique, generated high-resolution infrastructure image of positive distress obtained from CNN. The ESPCNN outperformed bicubic interpolation in all the evaluation metrics for super-resolution. Based on the performance metrics, the combination of CNN and ESPCNN was observed to be effective in preprocessing the infrastructure images with negative distress, reducing the computational cost and false alarms in the next step of super-resolution. The visual inspection showed that EPSCNN is able to capture crack propagation, complex geometry of even minor cracks. The proposed framework is expected to help the highway agencies in accurately performing distress detection and assist in efficient asset management practices.

[12] Action Spotting and Precise Event Detection in Sports: Datasets, Methods, and Challenges

Hao Xu,Arbind Agrahari Baniya,Sam Well,Mohamed Reda Bouadjenek,Richard Dazeley,Sunil Aryal

Main category: cs.CV

TL;DR: 该论文综述了视频事件检测在体育分析中的应用，重点介绍了TAL、AS和PES三大任务，总结了现有数据集、评估方法和先进技术，并探讨了未来研究方向。

Details

Motivation: 视频事件检测对体育分析至关重要，能够自动化识别关键时刻，提升分析效率、观众参与度和转播效果。 Method: 综述了基于深度学习的TAL、AS和PES方法，包括多模态、自监督学习和知识蒸馏等技术。 Result: 总结了现有数据集和评估指标的优缺点，分析了先进技术的性能和应用。 Conclusion: 提出了未来研究方向，旨在开发更通用、高效和鲁棒的事件检测框架。 Abstract: Video event detection has become an essential component of sports analytics, enabling automated identification of key moments and enhancing performance analysis, viewer engagement, and broadcast efficiency. Recent advancements in deep learning, particularly Convolutional Neural Networks (CNNs) and Transformers, have significantly improved accuracy and efficiency in Temporal Action Localization (TAL), Action Spotting (AS), and Precise Event Spotting (PES). This survey provides a comprehensive overview of these three key tasks, emphasizing their differences, applications, and the evolution of methodological approaches. We thoroughly review and categorize existing datasets and evaluation metrics specifically tailored for sports contexts, highlighting the strengths and limitations of each. Furthermore, we analyze state-of-the-art techniques, including multi-modal approaches that integrate audio and visual information, methods utilizing self-supervised learning and knowledge distillation, and approaches aimed at generalizing across multiple sports. Finally, we discuss critical open challenges and outline promising research directions toward developing more generalized, efficient, and robust event detection frameworks applicable to diverse sports. This survey serves as a foundation for future research on efficient, generalizable, and multi-modal sports event detection.

[13] The Eye as a Window to Systemic Health: A Survey of Retinal Imaging from Classical Techniques to Oculomics

Inamullah,Imran Razzak,Shoaib Jameel

Main category: cs.CV

TL;DR: 视网膜成像技术结合AI分析，为眼部及全身疾病提供非侵入性标记，推动眼科学新领域发展。

Details

Motivation: 利用视网膜血管化结构的独特性，作为人类健康的窗口，实现早期疾病检测和干预。 Method: 综述视网膜成像技术的演变，探讨AI驱动分析的整合需求，以及从传统技术向眼科学的转变。 Result: 揭示了眼科学在眼科及全身疾病中的应用潜力，并指出研究空白和未来方向。 Conclusion: 眼科学通过AI和视网膜成像的结合，为疾病监测和干预提供了新途径，但仍需克服技术障碍。 Abstract: The unique vascularized anatomy of the human eye, encased in the retina, provides an opportunity to act as a window for human health. The retinal structure assists in assessing the early detection, monitoring of disease progression and intervention for both ocular and non-ocular diseases. The advancement in imaging technology leveraging Artificial Intelligence has seized this opportunity to bridge the gap between the eye and human health. This track paves the way for unveiling systemic health insight from the ocular system and surrogating non-invasive markers for timely intervention and identification. The new frontiers of oculomics in ophthalmology cover both ocular and systemic diseases, and getting more attention to explore them. In this survey paper, we explore the evolution of retinal imaging techniques, the dire need for the integration of AI-driven analysis, and the shift of retinal imaging from classical techniques to oculomics. We also discuss some hurdles that may be faced in the progression of oculomics, highlighting the research gaps and future directions.

[14] FoodTrack: Estimating Handheld Food Portions with Egocentric Video

Ervin Wang,Yuhao Chen

Main category: cs.CV

TL;DR: FoodTrack框架通过第一视角视频直接测量手持食物的体积，克服了传统方法的局限性，提高了食物消费跟踪的准确性。

Details

Motivation: 传统食物消费跟踪方法依赖特定摄像头角度、无遮挡图像或手势识别，且对咬合大小有固定假设，限制了准确性和适应性。 Method: 提出FoodTrack框架，利用第一视角视频直接测量食物体积，无需依赖手势或固定咬合大小假设，适应性强。 Result: 在手持食物对象上实现了约7.01%的绝对百分比误差，优于之前方法在最佳情况下的16.40%误差。 Conclusion: FoodTrack提供了一种更准确、适应性更强的食物消费跟踪解决方案。 Abstract: Accurately tracking food consumption is crucial for nutrition and health monitoring. Traditional approaches typically require specific camera angles, non-occluded images, or rely on gesture recognition to estimate intake, making assumptions about bite size rather than directly measuring food volume. We propose the FoodTrack framework for tracking and measuring the volume of hand-held food items using egocentric video which is robust to hand occlusions and flexible with varying camera and object poses. FoodTrack estimates food volume directly, without relying on intake gestures or fixed assumptions about bite size, offering a more accurate and adaptable solution for tracking food consumption. We achieve absolute percentage loss of approximately 7.01% on a handheld food object, improving upon a previous approach that achieved a 16.40% mean absolute percentage error in its best case, under less flexible conditions.

Feng Xiao,Hongbin Xu,Guocan Zhao,Wenxiong Kang

Main category: cs.CV

TL;DR: 提出了一种2D辅助的3D视觉定位框架，通过构建语义-空间场景图和引入双分支视觉编码器，提升多模态对象编码和关系感知能力。

Details

Motivation: 3D视觉定位中，3D与语言模态之间的显著差距使得通过空间关系区分多个相似对象成为挑战。现有方法忽略了对参考对象的感知。 Method: 提出2D辅助的3D视觉定位框架，构建语义-空间场景图，采用双分支视觉编码器，利用2D预训练属性指导多模态对象编码，并通过图注意力进行跨模态交互。 Result: 在流行基准测试中表现优于现有方法，特别是在处理多个相似干扰物时。 Conclusion: 通过增强对象表示和迭代关系学习，实现了3D视觉与参考描述之间的有效对齐。 Abstract: 3D visual grounding aims to localize the unique target described by natural languages in 3D scenes. The significant gap between 3D and language modalities makes it a notable challenge to distinguish multiple similar objects through the described spatial relationships. Current methods attempt to achieve cross-modal understanding in complex scenes via a target-centered learning mechanism, ignoring the perception of referred objects. We propose a novel 2D-assisted 3D visual grounding framework that constructs semantic-spatial scene graphs with referred object discrimination for relationship perception. The framework incorporates a dual-branch visual encoder that utilizes 2D pre-trained attributes to guide the multi-modal object encoding. Furthermore, our cross-modal interaction module uses graph attention to facilitate relationship-oriented information fusion. The enhanced object representation and iterative relational learning enable the model to establish effective alignment between 3D vision and referential descriptions. Experimental results on the popular benchmarks demonstrate our superior performance compared to state-of-the-art methods, especially in addressing the challenges of multiple similar distractors.

[16] SEVA: Leveraging Single-Step Ensemble of Vicinal Augmentations for Test-Time Adaptation

Zixuan Hu,Yichun Hu,Ling-Yu Duan

Main category: cs.CV

TL;DR: 论文提出了一种名为SEVA的新型测试时适应方法，通过单步集成邻近增强策略，在不增加计算负担的情况下充分利用数据增强，提升模型对分布变化的鲁棒性。

Details

Motivation: 现有测试时适应方法通常依赖基于熵的无监督训练，但单轮训练难以充分利用可靠样本，限制了适应效率。 Method: SEVA通过理论框架探索多轮增强对模型适应的影响，提出优化熵损失的上界，将多轮增强训练效果集成到单步中，并结合样本选择机制。 Result: 在各种网络架构和测试场景下的实验表明，SEVA表现出色且具有广泛适应性。 Conclusion: SEVA通过高效损失和样本选择策略，显著提升了可靠样本的潜力，同时满足了测试时适应的实时要求。 Abstract: Test-Time adaptation (TTA) aims to enhance model robustness against distribution shifts through rapid model adaptation during inference. While existing TTA methods often rely on entropy-based unsupervised training and achieve promising results, the common practice of a single round of entropy training is typically unable to adequately utilize reliable samples, hindering adaptation efficiency. In this paper, we discover augmentation strategies can effectively unleash the potential of reliable samples, but the rapidly growing computational cost impedes their real-time application. To address this limitation, we propose a novel TTA approach named Single-step Ensemble of Vicinal Augmentations (SEVA), which can take advantage of data augmentations without increasing the computational burden. Specifically, instead of explicitly utilizing the augmentation strategy to generate new data, SEVA develops a theoretical framework to explore the impacts of multiple augmentations on model adaptation and proposes to optimize an upper bound of the entropy loss to integrate the effects of multiple rounds of augmentation training into a single step. Furthermore, we discover and verify that using the upper bound as the loss is more conducive to the selection mechanism, as it can effectively filter out harmful samples that confuse the model. Combining these two key advantages, the proposed efficient loss and a complementary selection strategy can simultaneously boost the potential of reliable samples and meet the stringent time requirements of TTA. The comprehensive experiments on various network architectures across challenging testing scenarios demonstrate impressive performances and the broad adaptability of SEVA. The code will be publicly available.

[17] SMMT: Siamese Motion Mamba with Self-attention for Thermal Infrared Target Tracking

Shang Zhang,Huanbin Zhang,Dali Feng,Yujie Cui,Ruoyan Xiong,Cen He

Main category: cs.CV

TL;DR: 论文提出了一种新型的Siamese Motion Mamba Tracker (SMMT)，通过结合双向状态空间模型和自注意力机制，解决了热红外目标跟踪中的遮挡、运动模糊和背景干扰问题。

Details

Motivation: 热红外目标跟踪常因目标遮挡、运动模糊和背景干扰等问题导致性能下降，亟需一种高效解决方案。 Method: 引入Motion Mamba模块到Siamese架构中，利用双向建模和自注意力提取运动特征并恢复边缘细节；采用参数共享策略减少计算冗余；设计运动边缘感知回归损失提升跟踪精度。 Result: 在四个热红外跟踪基准测试中，SMMT表现出卓越性能。 Conclusion: SMMT通过创新设计和优化策略，显著提升了热红外目标跟踪的准确性和鲁棒性。 Abstract: Thermal infrared (TIR) object tracking often suffers from challenges such as target occlusion, motion blur, and background clutter, which significantly degrade the performance of trackers. To address these issues, this paper pro-poses a novel Siamese Motion Mamba Tracker (SMMT), which integrates a bidirectional state-space model and a self-attention mechanism. Specifically, we introduce the Motion Mamba module into the Siamese architecture to ex-tract motion features and recover overlooked edge details using bidirectional modeling and self-attention. We propose a Siamese parameter-sharing strate-gy that allows certain convolutional layers to share weights. This approach reduces computational redundancy while preserving strong feature represen-tation. In addition, we design a motion edge-aware regression loss to improve tracking accuracy, especially for motion-blurred targets. Extensive experi-ments are conducted on four TIR tracking benchmarks, including LSOTB-TIR, PTB-TIR, VOT-TIR2015, and VOT-TIR 2017. The results show that SMMT achieves superior performance in TIR target tracking.

[18] MAISY: Motion-Aware Image SYnthesis for MedicalImage Motion Correction

Andrew Zhang,Hao Wang,Shuchang Ye,Michael Fulham,Jinman Kim

Main category: cs.CV

TL;DR: 论文提出了一种名为MAISY的新方法，通过动态学习运动特征和引入VS-SSIM损失函数，显著提升了医学图像中运动伪影的校正效果。

Details

Motivation: 现有基于GAN的方法在医学图像运动伪影校正中主要关注全局结构特征，忽略了局部关键病理信息，且SSIM损失函数对像素强度变化的处理不足。 Method: MAISY结合Segment Anything Model（SAM）动态学习运动特征，并引入VS-SSIM损失函数，自适应强调高方差区域以保留解剖细节。 Result: 在胸部和头部CT数据集上，MAISY的PSNR提高了40%，SSIM提高了10%，Dice系数提高了16%，优于现有方法。 Conclusion: MAISY通过动态学习运动特征和优化损失函数，显著提升了医学图像运动伪影的校正效果，具有临床应用潜力。 Abstract: Patient motion during medical image acquisition causes blurring, ghosting, and distorts organs, which makes image interpretation challenging.Current state-of-the-art algorithms using Generative Adversarial Network (GAN)-based methods with their ability to learn the mappings between corrupted images and their ground truth via Structural Similarity Index Measure (SSIM) loss effectively generate motion-free images. However, we identified the following limitations: (i) they mainly focus on global structural characteristics and therefore overlook localized features that often carry critical pathological information, and (ii) the SSIM loss function struggles to handle images with varying pixel intensities, luminance factors, and variance. In this study, we propose Motion-Aware Image SYnthesis (MAISY) which initially characterize motion and then uses it for correction by: (a) leveraging the foundation model Segment Anything Model (SAM), to dynamically learn spatial patterns along anatomical boundaries where motion artifacts are most pronounced and, (b) introducing the Variance-Selective SSIM (VS-SSIM) loss which adaptively emphasizes spatial regions with high pixel variance to preserve essential anatomical details during artifact correction. Experiments on chest and head CT datasets demonstrate that our model outperformed the state-of-the-art counterparts, with Peak Signal-to-Noise Ratio (PSNR) increasing by 40%, SSIM by 10%, and Dice by 16%.

[19] One2Any: One-Reference 6D Pose Estimation for Any Object

Mengya Liu,Siyuan Li,Ajad Chhatkuli,Prune Truong,Luc Van Gool,Federico Tombari

Main category: cs.CV

TL;DR: One2Any提出了一种仅需单参考-单查询RGB-D图像即可估计6D物体姿态的新方法，无需3D模型、多视图数据或类别限制。

Details

Motivation: 现有6D姿态估计方法依赖完整3D模型、多视图图像或特定类别训练，难以泛化到新物体。 Method: 通过编码-解码框架，从单参考视图生成参考物体姿态嵌入（ROPE），并利用U-Net解码模块预测新视图的参考物体坐标（ROC）。 Result: 在多个基准数据集上表现出色，泛化能力强，精度和鲁棒性达到SOTA，甚至优于依赖多视图或CAD输入的方法。 Conclusion: One2Any提供了一种高效、可扩展的6D姿态估计解决方案，适用于新物体且计算成本低。 Abstract: 6D object pose estimation remains challenging for many applications due to dependencies on complete 3D models, multi-view images, or training limited to specific object categories. These requirements make generalization to novel objects difficult for which neither 3D models nor multi-view images may be available. To address this, we propose a novel method One2Any that estimates the relative 6-degrees of freedom (DOF) object pose using only a single reference-single query RGB-D image, without prior knowledge of its 3D model, multi-view data, or category constraints. We treat object pose estimation as an encoding-decoding process, first, we obtain a comprehensive Reference Object Pose Embedding (ROPE) that encodes an object shape, orientation, and texture from a single reference view. Using this embedding, a U-Net-based pose decoding module produces Reference Object Coordinate (ROC) for new views, enabling fast and accurate pose estimation. This simple encoding-decoding framework allows our model to be trained on any pair-wise pose data, enabling large-scale training and demonstrating great scalability. Experiments on multiple benchmark datasets demonstrate that our model generalizes well to novel objects, achieving state-of-the-art accuracy and robustness even rivaling methods that require multi-view or CAD inputs, at a fraction of compute.

[20] GAPrompt: Geometry-Aware Point Cloud Prompt for 3D Vision Model

Zixiang Ai,Zichen Liu,Yuanhang Lei,Zhenyu Cui,Xu Zou,Jiahuan Zhou

Main category: cs.CV

TL;DR: 提出了一种几何感知的点云提示方法（GAPrompt），通过几何线索增强3D视觉模型的适应性，显著优于现有参数高效微调方法，且仅需2.19%的可训练参数。

Details

Motivation: 预训练的3D视觉模型在点云数据上表现优异，但全微调计算和存储成本高；现有参数高效微调方法因难以捕捉几何信息而性能不足。 Method: 提出Point Prompt和Point Shift Prompter分别捕捉细粒度几何细节和全局形状信息，并通过Prompt Propagation机制将形状信息融入特征提取过程。 Result: GAPrompt在多个基准测试中显著优于现有方法，性能接近全微调，且仅需少量可训练参数。 Conclusion: GAPrompt通过几何感知提示有效提升了3D视觉模型的适应性，为参数高效微调提供了新思路。 Abstract: Pre-trained 3D vision models have gained significant attention for their promising performance on point cloud data. However, fully fine-tuning these models for downstream tasks is computationally expensive and storage-intensive. Existing parameter-efficient fine-tuning (PEFT) approaches, which focus primarily on input token prompting, struggle to achieve competitive performance due to their limited ability to capture the geometric information inherent in point clouds. To address this challenge, we propose a novel Geometry-Aware Point Cloud Prompt (GAPrompt) that leverages geometric cues to enhance the adaptability of 3D vision models. First, we introduce a Point Prompt that serves as an auxiliary input alongside the original point cloud, explicitly guiding the model to capture fine-grained geometric details. Additionally, we present a Point Shift Prompter designed to extract global shape information from the point cloud, enabling instance-specific geometric adjustments at the input level. Moreover, our proposed Prompt Propagation mechanism incorporates the shape information into the model's feature extraction process, further strengthening its ability to capture essential geometric characteristics. Extensive experiments demonstrate that GAPrompt significantly outperforms state-of-the-art PEFT methods and achieves competitive results compared to full fine-tuning on various benchmarks, while utilizing only 2.19% of trainable parameters. Our code is available at https://github.com/zhoujiahuan1991/ICML2025-VGP.

[21] Vision Graph Prompting via Semantic Low-Rank Decomposition

Zixiang Ai,Zichen Liu,Jiahuan Zhou

Main category: cs.CV

TL;DR: ViG通过图结构表示图像，但现有提示方法忽略图结构的拓扑关系。本文提出VGP框架，利用低秩特性提升ViG在下游任务中的性能。

Details

Motivation: 现有提示方法主要针对Transformer模型，无法充分利用图结构的拓扑关系，限制了复杂语义建模能力。 Method: 提出Vision Graph Prompting (VGP)，基于低秩语义特征分解，结合视觉图拓扑提示，捕捉全局结构和细粒度语义依赖。 Result: 实验表明VGP显著提升ViG在下游任务的迁移性能，接近全微调效果且保持参数高效。 Conclusion: VGP为图结构视觉模型提供了一种高效的提示方法，平衡性能与参数效率。 Abstract: Vision GNN (ViG) demonstrates superior performance by representing images as graph structures, providing a more natural way to capture irregular semantic patterns beyond traditional grid or sequence-based representations. To efficiently adapt ViG to downstream tasks, parameter-efficient fine-tuning techniques like visual prompting become increasingly essential. However, existing prompting methods are primarily designed for Transformer-based models, neglecting the rich topological relationships among nodes and edges in graph-based representations, limiting their capacity to model complex semantics. In this paper, we propose Vision Graph Prompting (VGP), a novel framework tailored for vision graph structures. Our core insight reveals that semantically connected components in the graph exhibit low-rank properties. Building on this observation, we introduce a semantic low-rank prompting method that decomposes low-rank semantic features and integrates them with prompts on vision graph topologies, capturing both global structural patterns and fine-grained semantic dependencies. Extensive experiments demonstrate our method significantly improves ViG's transfer performance on diverse downstream tasks, achieving results comparable to full fine-tuning while maintaining parameter efficiency. Our code is available at https://github.com/zhoujiahuan1991/ICML2025-VGP.

Lixing Niu,Jiapeng Li,Xingping Yu,Shu Wang,Ruining Feng,Bo Wu,Ping Wei,Yisen Wang,Lifeng Fan

Main category: cs.CV

TL;DR: 论文提出了一个名为R^3-VQA的高质量视频数据集，用于评估复杂社交场景中的社会推理能力，并发现当前大型视觉语言模型在此任务上仍远未达到人类水平。

Details

Motivation: 现有社会推理任务和数据集过于简单，无法反映真实社交互动的复杂性，因此需要更全面的数据集和任务来评估模型能力。 Method: 构建了R^3-VQA数据集，包含精细标注的社交事件、心理状态和社交因果链，并设计了三个任务：社交事件理解、心理状态估计和社交因果推理。 Result: 实验表明，当前大型视觉语言模型在复杂社交推理任务上表现不佳，但通过心理理论提示可以提升其性能。 Conclusion: R^3-VQA为评估和改进社会推理能力提供了重要基准，未来需进一步研究以缩小模型与人类水平的差距。 Abstract: "Read the room" is a significant social reasoning capability in human daily life. Humans can infer others' mental states from subtle social cues. Previous social reasoning tasks and datasets lack complexity (e.g., simple scenes, basic interactions, incomplete mental state variables, single-step reasoning, etc.) and fall far short of the challenges present in real-life social interactions. In this paper, we contribute a valuable, high-quality, and comprehensive video dataset named R^3-VQA with precise and fine-grained annotations of social events and mental states (i.e., belief, intent, desire, and emotion) as well as corresponding social causal chains in complex social scenarios. Moreover, we include human-annotated and model-generated QAs. Our task R^3-VQA includes three aspects: Social Event Understanding, Mental State Estimation, and Social Causal Reasoning. As a benchmark, we comprehensively evaluate the social reasoning capabilities and consistencies of current state-of-the-art large vision-language models (LVLMs). Comprehensive experiments show that (i) LVLMs are still far from human-level consistent social reasoning in complex social scenarios; (ii) Theory of Mind (ToM) prompting can help LVLMs perform better on social reasoning tasks. We provide some of our dataset and codes in supplementary material and will release our full dataset and codes upon acceptance.

[23] Learning from Similarity Proportion Loss for Classifying Skeletal Muscle Recovery Stages

Yu Yamaoka or Weng Ian Chan,Shigeto Seno,Soichiro Fukada,Hideo Matsuda

Main category: cs.CV

TL;DR: 论文提出了一种名为OSLSP的弱监督学习方法，用于自动化评估肌肉组织再生过程，解决了现有方法无法适应肌肉组织特征和忽略类别顺序信息的问题。

Details

Motivation: 传统肌肉组织再生评估依赖人工视觉检查，缺乏定量和客观性。现有弱监督学习方法（LLP）无法适应肌肉组织特征且忽略类别顺序信息。 Method: 提出OSLSP方法，利用相似性比例损失和类别比例注意力机制，更新特征提取器并保留类别顺序信息。 Result: OSLSP模型在骨骼肌恢复阶段分类任务中表现优于大规模预训练和微调模型。 Conclusion: OSLSP为肌肉组织再生评估提供了一种自动化、定量且保留顺序信息的解决方案。 Abstract: Evaluating the regeneration process of damaged muscle tissue is a fundamental analysis in muscle research to measure experimental effect sizes and uncover mechanisms behind muscle weakness due to aging and disease. The conventional approach to assessing muscle tissue regeneration involves whole-slide imaging and expert visual inspection of the recovery stages based on the morphological information of cells and fibers. There is a need to replace these tasks with automated methods incorporating machine learning techniques to ensure a quantitative and objective analysis. Given the limited availability of fully labeled data, a possible approach is Learning from Label Proportions (LLP), a weakly supervised learning method using class label proportions. However, current LLP methods have two limitations: (1) they cannot adapt the feature extractor for muscle tissues, and (2) they treat the classes representing recovery stages and cell morphological changes as nominal, resulting in the loss of ordinal information. To address these issues, we propose Ordinal Scale Learning from Similarity Proportion (OSLSP), which uses a similarity proportion loss derived from two bag combinations. OSLSP can update the feature extractor by using class proportion attention to the ordinal scale of the class. Our model with OSLSP outperforms large-scale pre-trained and fine-tuning models in classification tasks of skeletal muscle recovery stages.

[24] DOTA: Deformable Optimized Transformer Architecture for End-to-End Text Recognition with Retrieval-Augmented Generation

Naphat Nithisopa,Teerapong Panboonyuen

Main category: cs.CV

TL;DR: 本文提出了一种结合ResNet和Vision Transformer的新型端到端文本识别框架，通过Deformable Convolutions、Retrieval-Augmented Generation和CRF等方法提升OCR性能，在多个基准数据集上达到最优效果。

Details

Motivation: 自然图像中的文本识别是一个重要但具有挑战性的任务，广泛应用于计算机视觉和自然语言处理领域。 Method: 框架采用ResNet和Vision Transformer作为主干网络，结合Deformable Convolutions、自适应dropout和CRF优化特征表示和序列建模。 Result: 在六个基准数据集上的实验显示，平均准确率为77.77%，部分数据集表现尤为突出（如IC13达到97.32%）。 Conclusion: 该方法在文本识别任务中实现了新的最优性能，展现了其鲁棒性和广泛适用性。 Abstract: Text recognition in natural images remains a challenging yet essential task, with broad applications spanning computer vision and natural language processing. This paper introduces a novel end-to-end framework that combines ResNet and Vision Transformer backbones with advanced methodologies, including Deformable Convolutions, Retrieval-Augmented Generation, and Conditional Random Fields (CRF). These innovations collectively enhance feature representation and improve Optical Character Recognition (OCR) performance. Specifically, the framework substitutes standard convolution layers in the third and fourth blocks with Deformable Convolutions, leverages adaptive dropout for regularization, and incorporates CRF for more refined sequence modeling. Extensive experiments conducted on six benchmark datasets IC13, IC15, SVT, IIIT5K, SVTP, and CUTE80 validate the proposed method's efficacy, achieving notable accuracies: 97.32% on IC13, 58.26% on IC15, 88.10% on SVT, 74.13% on IIIT5K, 82.17% on SVTP, and 66.67% on CUTE80, resulting in an average accuracy of 77.77%. These results establish a new state-of-the-art for text recognition, demonstrating the robustness of the approach across diverse and challenging datasets.

[25] S3D: Sketch-Driven 3D Model Generation

Hail Song,Wonsik Shin,Naeun Lee,Soomin Chung,Nojun Kwak,Woontack Woo

Main category: cs.CV

TL;DR: S3D框架通过U-Net架构和风格对齐损失，将2D草图转换为高质量3D模型。

Details

Motivation: 解决2D草图因模糊和稀疏性导致3D建模困难的问题。 Method: 使用U-Net编码器-解码器生成面部分割掩码，结合风格对齐损失和数据集增强技术。 Result: 生成高质量3D模型，支持多视角渲染。 Conclusion: S3D框架在草图到3D模型转换中表现出色，代码已开源。 Abstract: Generating high-quality 3D models from 2D sketches is a challenging task due to the inherent ambiguity and sparsity of sketch data. In this paper, we present S3D, a novel framework that converts simple hand-drawn sketches into detailed 3D models. Our method utilizes a U-Net-based encoder-decoder architecture to convert sketches into face segmentation masks, which are then used to generate a 3D representation that can be rendered from novel views. To ensure robust consistency between the sketch domain and the 3D output, we introduce a novel style-alignment loss that aligns the U-Net bottleneck features with the initial encoder outputs of the 3D generation module, significantly enhancing reconstruction fidelity. To further enhance the network's robustness, we apply augmentation techniques to the sketch dataset. This streamlined framework demonstrates the effectiveness of S3D in generating high-quality 3D models from sketch inputs. The source code for this project is publicly available at https://github.com/hailsong/S3D.

[26] VideoPath-LLaVA: Pathology Diagnostic Reasoning Through Video Instruction Tuning

Trinh T. L. Vuong,Jin Tae Kwak

Main category: cs.CV

TL;DR: VideoPath-LLaVA是首个整合单张图像、自动提取关键帧视频和手动分割视频的多模态模型，模拟病理学家诊断过程，生成详细病理描述和最终诊断。

Details

Motivation: 通过整合多种图像场景，模拟病理学家的自然诊断过程，提升AI在病理视频分析中的表现。 Method: 利用VideoPath-Instruct数据集（4278对视频和诊断链式指令），结合单图像指令数据集知识迁移，先训练弱标注关键帧视频，再微调手动分割视频。 Result: VideoPath-LLaVA在病理视频分析中设定了新基准，为未来支持临床决策的AI系统奠定基础。 Conclusion: 该模型通过视觉与诊断推理的结合，展示了在病理学中的潜力，代码和数据已开源。 Abstract: We present VideoPath-LLaVA, the first large multimodal model (LMM) in computational pathology that integrates three distinct image scenarios, single patch images, automatically keyframe-extracted clips, and manually segmented video pathology images, to mimic the natural diagnostic process of pathologists. By generating detailed histological descriptions and culminating in a definitive sign-out diagnosis, VideoPath-LLaVA bridges visual narratives with diagnostic reasoning. Central to our approach is the VideoPath-Instruct dataset, comprising 4278 video and diagnosis-specific chain-of-thought instructional pairs sourced from educational histopathology videos on YouTube. Although high-quality data is critical for enhancing diagnostic reasoning, its creation is time-intensive and limited in volume. To overcome this challenge, we transfer knowledge from existing single-image instruction datasets to train on weakly annotated, keyframe-extracted clips, followed by fine-tuning on manually segmented videos. VideoPath-LLaVA establishes a new benchmark in pathology video analysis and offers a promising foundation for future AI systems that support clinical decision-making through integrated visual and diagnostic reasoning. Our code, data, and model are publicly available at https://github.com/trinhvg/VideoPath-LLaVA.

[27] SToLa: Self-Adaptive Touch-Language Framework with Tactile Commonsense Reasoning in Open-Ended Scenarios

Ning Cheng,Jinan Xu,Jialing Chen,Wenjuan Han

Main category: cs.CV

TL;DR: 论文提出SToLa框架，解决触觉与语言模态融合的挑战，通过Mixture of Experts架构实现动态处理，并构建了一个全面的触觉常识推理数据集。

Details

Motivation: 解决触觉感知在智能系统中的模态差异和数据稀缺问题，以支持开放物理世界的常识推理。 Method: 引入SToLa框架，利用Mixture of Experts动态统一触觉和语言模态，并构建触觉常识推理数据集。 Result: SToLa在PhysiCLeAR基准和自建数据集上表现优异，验证了其架构的有效性和性能优势。 Conclusion: SToLa框架成功解决了触觉与语言融合的挑战，为开放场景的触觉常识推理任务提供了有效解决方案。 Abstract: This paper explores the challenges of integrating tactile sensing into intelligent systems for multimodal reasoning, particularly in enabling commonsense reasoning about the open-ended physical world. We identify two key challenges: modality discrepancy, where existing large touch-language models often treat touch as a mere sub-modality of language, and open-ended tactile data scarcity, where current datasets lack the diversity, open-endness and complexity needed for reasoning. To overcome these challenges, we introduce SToLa, a Self-Adaptive Touch-Language framework. SToLa utilizes Mixture of Experts (MoE) to dynamically process, unify, and manage tactile and language modalities, capturing their unique characteristics. Crucially, we also present a comprehensive tactile commonsense reasoning dataset and benchmark featuring free-form questions and responses, 8 physical properties, 4 interactive characteristics, and diverse commonsense knowledge. Experiments show SToLa exhibits competitive performance compared to existing models on the PhysiCLeAR benchmark and self-constructed datasets, proving the effectiveness of the Mixture of Experts architecture in multimodal management and the performance advantages for open-scenario tactile commonsense reasoning tasks.

[28] An Enhanced YOLOv8 Model for Real-Time and Accurate Pothole Detection and Measurement

Mustafa Yurdakul,Şakir Tasdemir

Main category: cs.CV

TL;DR: 论文提出了一种基于改进YOLOv8的模型，用于坑洞检测及其物理特征分析，通过RGB-D图像数据集（PothRGBD）和动态蛇形卷积等方法，显著提升了检测精度和实时性。

Details

Motivation: 坑洞导致车辆损坏和交通事故，现有方法仅基于2D RGB图像，无法准确分析坑洞的物理特征，因此需要更精确的检测方法。 Method: 使用Intel RealSense D415深度相机采集RGB-D数据，构建PothRGBD数据集；改进YOLOv8n-seg模型，引入动态蛇形卷积（DSConv）、简单注意力模块（SimAM）和高斯误差线性单元（GELU）。 Result: 改进模型在精度、召回率和mAP@50上分别提升了1.96%、6.13%和2.07%，达到93.7%、90.4%和93.8%，并能高精度测量坑洞周长和深度。 Conclusion: 该模型轻量高效，适用于实时智能交通解决方案，为深度学习在坑洞检测中的应用提供了有效工具。 Abstract: Potholes cause vehicle damage and traffic accidents, creating serious safety and economic problems. Therefore, early and accurate detection of potholes is crucial. Existing detection methods are usually only based on 2D RGB images and cannot accurately analyze the physical characteristics of potholes. In this paper, a publicly available dataset of RGB-D images (PothRGBD) is created and an improved YOLOv8-based model is proposed for both pothole detection and pothole physical features analysis. The Intel RealSense D415 depth camera was used to collect RGB and depth data from the road surfaces, resulting in a PothRGBD dataset of 1000 images. The data was labeled in YOLO format suitable for segmentation. A novel YOLO model is proposed based on the YOLOv8n-seg architecture, which is structurally improved with Dynamic Snake Convolution (DSConv), Simple Attention Module (SimAM) and Gaussian Error Linear Unit (GELU). The proposed model segmented potholes with irregular edge structure more accurately, and performed perimeter and depth measurements on depth maps with high accuracy. The standard YOLOv8n-seg model achieved 91.9% precision, 85.2% recall and 91.9% mAP@50. With the proposed model, the values increased to 93.7%, 90.4% and 93.8% respectively. Thus, an improvement of 1.96% in precision, 6.13% in recall and 2.07% in mAP was achieved. The proposed model performs pothole detection as well as perimeter and depth measurement with high accuracy and is suitable for real-time applications due to its low model complexity. In this way, a lightweight and effective model that can be used in deep learning-based intelligent transportation solutions has been acquired.

[29] CM1 -- A Dataset for Evaluating Few-Shot Information Extraction with Large Vision Language Models

Fabian Wolf,Oliver Tüselmann,Arthur Matei,Lukas Hennies,Christoph Rass,Gernot A. Fink

Main category: cs.CV

TL;DR: 论文提出了一种用于评估大型视觉语言模型（LVLM）少样本能力的新数据集CM1，并在提取手写文档中的关键信息（如姓名和出生日期）时，比较了LVLM与传统全页提取模型的性能。

Details

Motivation: 解决手写文档中关键信息自动提取的挑战，尤其是在标注训练数据稀缺的情况下，推动档案大规模数字化。 Method: 设计了一个基于二战后欧洲Care and Maintenance计划手写表单的数据集CM1，并设置了三个基准任务，比较了两种LVLM与传统全页提取模型的性能。 Result: 传统全页模型在充足训练数据下表现优异，但在少样本情况下，LVLM凭借其规模和预训练优势优于传统方法。 Conclusion: LVLM在少样本场景下具有潜力，为手写文档信息提取提供了新思路。 Abstract: The automatic extraction of key-value information from handwritten documents is a key challenge in document analysis. A reliable extraction is a prerequisite for the mass digitization efforts of many archives. Large Vision Language Models (LVLM) are a promising technology to tackle this problem especially in scenarios where little annotated training data is available. In this work, we present a novel dataset specifically designed to evaluate the few-shot capabilities of LVLMs. The CM1 documents are a historic collection of forms with handwritten entries created in Europe to administer the Care and Maintenance program after World War Two. The dataset establishes three benchmarks on extracting name and birthdate information and, furthermore, considers different training set sizes. We provide baseline results for two different LVLMs and compare performances to an established full-page extraction model. While the traditional full-page model achieves highly competitive performances, our experiments show that when only a few training samples are available the considered LVLMs benefit from their size and heavy pretraining and outperform the classical approach.

[30] A Weak Supervision Learning Approach Towards an Equitable Parking Lot Occupancy Estimation

Theophilus Aidoo,Till Koebe,Akansh Maurya,Hewan Shrestha,Ingmar Weber

Main category: cs.CV

TL;DR: 提出了一种弱监督框架，利用3米分辨率卫星图像估计停车场占用率，减少对高分辨率图像的依赖。

Details

Motivation: 高分辨率标记图像稀缺且昂贵，尤其在低收入地区，限制了遥感应用。 Method: 利用粗粒度时间标签（假设德国大型超市和五金店停车场周六满、周日空），训练成对比较模型。 Result: 模型在大型停车场上的AUC达到0.92。 Conclusion: 该方法可扩展至城市流动性分析，并为脆弱社区的资源分配提供数据支持。 Abstract: The scarcity and high cost of labeled high-resolution imagery have long challenged remote sensing applications, particularly in low-income regions where high-resolution data are scarce. In this study, we propose a weak supervision framework that estimates parking lot occupancy using 3m resolution satellite imagery. By leveraging coarse temporal labels -- based on the assumption that parking lots of major supermarkets and hardware stores in Germany are typically full on Saturdays and empty on Sundays -- we train a pairwise comparison model that achieves an AUC of 0.92 on large parking lots. The proposed approach minimizes the reliance on expensive high-resolution images and holds promise for scalable urban mobility analysis. Moreover, the method can be adapted to assess transit patterns and resource allocation in vulnerable communities, providing a data-driven basis to improve the well-being of those most in need.

[31] Bridging Geometry-Coherent Text-to-3D Generation with Multi-View Diffusion Priors and Gaussian Splatting

Feng Yang,Wenliang Qian,Wangmeng Zuo,Hui Li

Main category: cs.CV

TL;DR: 提出了Coupled Score Distillation (CSD)框架，通过耦合多视角联合分布先验，解决Score Distillation Sampling (SDS)在文本到3D生成中的几何不一致问题。

Details

Motivation: SDS在文本到3D生成中忽略了多视角相关性，导致几何不一致和多面伪影。 Method: 提出CSD框架，将优化问题重新表述为多视角联合优化，并直接优化3D Gaussian Splatting (3D-GS)以生成几何一致的3D内容。 Result: 实验结果表明，该方法在效率和生成质量上具有竞争力。 Conclusion: CSD框架能够有效生成几何一致的3D内容，并支持高质量网格的细化。 Abstract: Score Distillation Sampling (SDS) leverages pretrained 2D diffusion models to advance text-to-3D generation but neglects multi-view correlations, being prone to geometric inconsistencies and multi-face artifacts in the generated 3D content. In this work, we propose Coupled Score Distillation (CSD), a framework that couples multi-view joint distribution priors to ensure geometrically consistent 3D generation while enabling the stable and direct optimization of 3D Gaussian Splatting. Specifically, by reformulating the optimization as a multi-view joint optimization problem, we derive an effective optimization rule that effectively couples multi-view priors to guide optimization across different viewpoints while preserving the diversity of generated 3D assets. Additionally, we propose a framework that directly optimizes 3D Gaussian Splatting (3D-GS) with random initialization to generate geometrically consistent 3D content. We further employ a deformable tetrahedral grid, initialized from 3D-GS and refined through CSD, to produce high-quality, refined meshes. Quantitative and qualitative experimental results demonstrate the efficiency and competitive quality of our approach.

[32] Object-Shot Enhanced Grounding Network for Egocentric Video

Yisen Feng,Haoyu Zhang,Meng Liu,Weili Guan,Liqiang Nie

Main category: cs.CV

TL;DR: OSGNet提出了一种用于自我中心视频定位的新方法，通过提取对象信息和利用镜头运动特征来增强视频表示和模态对齐。

Details

Motivation: 现有方法主要关注自我中心与外部中心视频的分布差异，但忽略了自我中心视频的关键特征和问题查询的细粒度信息。 Method: 提出OSGNet，提取视频中的对象信息并分析镜头运动，以增强视频表示和模态对齐。 Result: 在三个数据集上的实验表明，OSGNet实现了最先进的性能。 Conclusion: OSGNet通过结合对象和镜头运动特征，有效提升了自我中心视频定位的性能。 Abstract: Egocentric video grounding is a crucial task for embodied intelligence applications, distinct from exocentric video moment localization. Existing methods primarily focus on the distributional differences between egocentric and exocentric videos but often neglect key characteristics of egocentric videos and the fine-grained information emphasized by question-type queries. To address these limitations, we propose OSGNet, an Object-Shot enhanced Grounding Network for egocentric video. Specifically, we extract object information from videos to enrich video representation, particularly for objects highlighted in the textual query but not directly captured in the video features. Additionally, we analyze the frequent shot movements inherent to egocentric videos, leveraging these features to extract the wearer's attention information, which enhances the model's ability to perform modality alignment. Experiments conducted on three datasets demonstrate that OSGNet achieves state-of-the-art performance, validating the effectiveness of our approach. Our code can be found at https://github.com/Yisen-Feng/OSGNet.

[33] HDiffTG: A Lightweight Hybrid Diffusion-Transformer-GCN Architecture for 3D Human Pose Estimation

Yajie Fu,Chaorui Huang,Junwei Li,Hui Kong,Yibin Tian,Huakang Li,Zhiyuan Zhang

Main category: cs.CV

TL;DR: HDiffTG是一种新颖的3D人体姿态估计方法，结合了Transformer、GCN和扩散模型，显著提升了精度和鲁棒性，同时保持轻量化设计。

Details

Motivation: 现有方法在复杂场景和遮挡情况下表现不佳，HDiffTG旨在通过多技术融合解决这一问题。 Method: 整合Transformer捕捉全局时空依赖，GCN建模局部骨骼结构，扩散模型逐步优化，实现全局与局部特征的互补平衡。 Result: 在Human3.6M和MPI-INF-3DHP数据集上达到SOTA性能，且在噪声和遮挡环境下表现出色。 Conclusion: HDiffTG通过技术融合和轻量化优化，实现了高效、鲁棒的3D姿态估计。 Abstract: We propose HDiffTG, a novel 3D Human Pose Estimation (3DHPE) method that integrates Transformer, Graph Convolutional Network (GCN), and diffusion model into a unified framework. HDiffTG leverages the strengths of these techniques to significantly improve pose estimation accuracy and robustness while maintaining a lightweight design. The Transformer captures global spatiotemporal dependencies, the GCN models local skeletal structures, and the diffusion model provides step-by-step optimization for fine-tuning, achieving a complementary balance between global and local features. This integration enhances the model's ability to handle pose estimation under occlusions and in complex scenarios. Furthermore, we introduce lightweight optimizations to the integrated model and refine the objective function design to reduce computational overhead without compromising performance. Evaluation results on the Human3.6M and MPI-INF-3DHP datasets demonstrate that HDiffTG achieves state-of-the-art (SOTA) performance on the MPI-INF-3DHP dataset while excelling in both accuracy and computational efficiency. Additionally, the model exhibits exceptional robustness in noisy and occluded environments. Source codes and models are available at https://github.com/CirceJie/HDiffTG

[34] TS-Diff: Two-Stage Diffusion Model for Low-Light RAW Image Enhancement

Yi Li,Zhiyuan Zhang,Jiangnan Xia,Jianghan Cheng,Qilong Wu,Junwei Li,Yibin Tian,Hui Kong

Main category: cs.CV

TL;DR: TS-Diff是一种新型的两阶段扩散模型，用于增强极低光RAW图像，通过虚拟相机噪声空间预训练和特定相机微调，结合颜色校正器解决色偏问题，并在多个数据集上实现最优性能。

Details

Motivation: 解决极低光条件下RAW图像增强的挑战，包括噪声、色偏和泛化性问题。 Method: 采用两阶段扩散模型（预训练和对齐阶段），结合虚拟相机噪声空间、CFI模块、结构重参数化和颜色校正器。 Result: 在QID、SID和ELD等多个数据集上实现最优性能，表现出优异的去噪、泛化和颜色一致性。 Conclusion: TS-Diff是一种稳健且通用的低光图像增强解决方案，适用于多种相机和光照条件。 Abstract: This paper presents a novel Two-Stage Diffusion Model (TS-Diff) for enhancing extremely low-light RAW images. In the pre-training stage, TS-Diff synthesizes noisy images by constructing multiple virtual cameras based on a noise space. Camera Feature Integration (CFI) modules are then designed to enable the model to learn generalizable features across diverse virtual cameras. During the aligning stage, CFIs are averaged to create a target-specific CFI$^T$, which is fine-tuned using a small amount of real RAW data to adapt to the noise characteristics of specific cameras. A structural reparameterization technique further simplifies CFI$^T$ for efficient deployment. To address color shifts during the diffusion process, a color corrector is introduced to ensure color consistency by dynamically adjusting global color distributions. Additionally, a novel dataset, QID, is constructed, featuring quantifiable illumination levels and a wide dynamic range, providing a comprehensive benchmark for training and evaluation under extreme low-light conditions. Experimental results demonstrate that TS-Diff achieves state-of-the-art performance on multiple datasets, including QID, SID, and ELD, excelling in denoising, generalization, and color consistency across various cameras and illumination levels. These findings highlight the robustness and versatility of TS-Diff, making it a practical solution for low-light imaging applications. Source codes and models are available at https://github.com/CircccleK/TS-Diff

[35] MoDE: Mixture of Diffusion Experts for Any Occluded Face Recognition

Qiannan Fan,Zhuoyang Li,Jitong Li,Chenyang Cao

Main category: cs.CV

TL;DR: 提出了一种基于扩散模型的专家混合方法（MoDE）用于遮挡人脸识别（OFR），通过身份门控网络自适应整合多重建人脸信息，提升识别性能。

Details

Motivation: 当前OFR算法缺乏对遮挡的先验知识，导致实际应用中性能不佳，影响日常生活便利性。 Method: 使用扩散模型生成多个可能的完整人脸图像，通过身份门控网络评估并整合各重建图像对身份的贡献。 Result: 在三个公开人脸数据集和两个真实数据集上验证了方法对多种遮挡的先进性能。 Conclusion: MoDE是一种即插即用模块，显著提升了遮挡人脸识别的效果。 Abstract: With the continuous impact of epidemics, people have become accustomed to wearing masks. However, most current occluded face recognition (OFR) algorithms lack prior knowledge of occlusions, resulting in poor performance when dealing with occluded faces of varying types and severity in reality. Recognizing occluded faces is still a significant challenge, which greatly affects the convenience of people's daily lives. In this paper, we propose an identity-gated mixture of diffusion experts (MoDE) for OFR. Each diffusion-based generative expert estimates one possible complete image for occluded faces. Considering the random sampling process of the diffusion model, which introduces inevitable differences and variations between the inpainted faces and the real ones. To ensemble effective information from multi-reconstructed faces, we introduce an identity-gating network to evaluate the contribution of each reconstructed face to the identity and adaptively integrate the predictions in the decision space. Moreover, our MoDE is a plug-and-play module for most existing face recognition models. Extensive experiments on three public face datasets and two datasets in the wild validate our advanced performance for various occlusions in comparison with the competing methods.

[36] Multi-turn Consistent Image Editing

Zijun Zhou,Yingying Deng,Xiangyu He,Weiming Dong,Fan Tang

Main category: cs.CV

TL;DR: 提出了一种多轮图像编辑框架，通过迭代优化解决单步编辑的不足，提升编辑效果和用户满意度。

Details

Motivation: 现有图像编辑方法多为单步操作，难以处理模糊用户意图或复杂变换，导致结果不一致或不符合预期。 Method: 采用流匹配实现精确图像反转，结合双目标线性二次调节器（LQR）稳定采样，并引入自适应注意力增强方法。 Result: 实验表明，该方法显著提高了编辑成功率和视觉保真度。 Conclusion: 多轮迭代编辑框架有效解决了单步编辑的局限性，提升了编辑质量和用户体验。 Abstract: Many real-world applications, such as interactive photo retouching, artistic content creation, and product design, require flexible and iterative image editing. However, existing image editing methods primarily focus on achieving the desired modifications in a single step, which often struggles with ambiguous user intent, complex transformations, or the need for progressive refinements. As a result, these methods frequently produce inconsistent outcomes or fail to meet user expectations. To address these challenges, we propose a multi-turn image editing framework that enables users to iteratively refine their edits, progressively achieving more satisfactory results. Our approach leverages flow matching for accurate image inversion and a dual-objective Linear Quadratic Regulators (LQR) for stable sampling, effectively mitigating error accumulation. Additionally, by analyzing the layer-wise roles of transformers, we introduce a adaptive attention highlighting method that enhances editability while preserving multi-turn coherence. Extensive experiments demonstrate that our framework significantly improves edit success rates and visual fidelity compared to existing methods.

[37] CountDiffusion: Text-to-Image Synthesis with Training-Free Counting-Guidance Diffusion

Yanyu Li,Pencheng Wan,Liang Han,Yaowei Wang,Liqiang Nie,Min Zhang

Main category: cs.CV

TL;DR: CountDiffusion是一种无需训练的方法，通过两阶段校正机制提升文本到图像生成模型在对象数量准确性上的表现。

Details

Motivation: 现有文本到图像生成模型在准确生成对象数量方面存在困难，主要由于计算成本高和抽象数量概念难以教授。 Method: CountDiffusion分为两阶段：首先生成中间去噪结果并计数对象数量，第二阶段通过注意力图校正对象数量。 Result: 实验表明，CountDiffusion显著提升了文本到图像模型生成准确对象数量的能力。 Conclusion: CountDiffusion无需额外训练即可集成到现有扩散模型中，有效解决了对象数量生成问题。 Abstract: Stable Diffusion has advanced text-to-image synthesis, but training models to generate images with accurate object quantity is still difficult due to the high computational cost and the challenge of teaching models the abstract concept of quantity. In this paper, we propose CountDiffusion, a training-free framework aiming at generating images with correct object quantity from textual descriptions. CountDiffusion consists of two stages. In the first stage, an intermediate denoising result is generated by the diffusion model to predict the final synthesized image with one-step denoising, and a counting model is used to count the number of objects in this image. In the second stage, a correction module is used to correct the object quantity by changing the attention map of the object with universal guidance. The proposed CountDiffusion can be plugged into any diffusion-based text-to-image (T2I) generation models without further training. Experiment results demonstrate the superiority of our proposed CountDiffusion, which improves the accurate object quantity generation ability of T2I models by a large margin.

[38] WDMamba: When Wavelet Degradation Prior Meets Vision Mamba for Image Dehazing

Jie Sun,Heng Liu,Yongzhen Wang,Xiao-Ping Zhang,Mingqiang Wei

Main category: cs.CV

TL;DR: 论文提出了一种基于小波变换的去雾框架WDMamba，通过低频恢复和细节增强两阶段处理，结合Mamba块和自引导对比正则化，显著提升了去雾效果。

Details

Motivation: 通过小波变换分析发现雾霾信息主要存在于低频分量，因此提出一种新的去雾框架以更好地利用这一特性。 Method: WDMamba将去雾任务分为低频恢复和细节增强两阶段，使用Mamba块进行全局结构重建，并引入自引导对比正则化优化训练。 Result: 在公开去雾基准测试中，WDMamba在质量和数量上均优于现有方法。 Conclusion: WDMamba通过两阶段策略和正则化方法，有效提升了去雾性能，为图像去雾提供了新思路。 Abstract: In this paper, we reveal a novel haze-specific wavelet degradation prior observed through wavelet transform analysis, which shows that haze-related information predominantly resides in low-frequency components. Exploiting this insight, we propose a novel dehazing framework, WDMamba, which decomposes the image dehazing task into two sequential stages: low-frequency restoration followed by detail enhancement. This coarse-to-fine strategy enables WDMamba to effectively capture features specific to each stage of the dehazing process, resulting in high-quality restored images. Specifically, in the low-frequency restoration stage, we integrate Mamba blocks to reconstruct global structures with linear complexity, efficiently removing overall haze and producing a coarse restored image. Thereafter, the detail enhancement stage reinstates fine-grained information that may have been overlooked during the previous phase, culminating in the final dehazed output. Furthermore, to enhance detail retention and achieve more natural dehazing, we introduce a self-guided contrastive regularization during network training. By utilizing the coarse restored output as a hard negative example, our model learns more discriminative representations, substantially boosting the overall dehazing performance. Extensive evaluations on public dehazing benchmarks demonstrate that our method surpasses state-of-the-art approaches both qualitatively and quantitatively. Code is available at https://github.com/SunJ000/WDMamba.

[39] Balancing Accuracy, Calibration, and Efficiency in Active Learning with Vision Transformers Under Label Noise

Moseli Mots'oehli,Hope Mogale,Kyungim Baek

Main category: cs.CV

TL;DR: 研究探讨了在标签噪声和低预算约束下，不同规模的视觉变换器（ViT）和Swin变换器的性能表现，发现较大的ViT模型在准确性和校准性上表现更优，而Swin变换器对噪声的鲁棒性较弱。

Details

Motivation: 探索视觉变换器在不同模型规模和标签噪声下的实用性，为资源受限环境下的模型部署提供指导。 Method: 在CIFAR10和CIFAR100数据集上，评估四种ViT配置和三种Swin变换器配置在不同标签噪声率下的分类准确性和校准性。 Result: 较大的ViT模型（如ViTl32）在噪声环境下表现更优，而Swin变换器对噪声敏感。较小的补丁尺寸不一定带来更好的性能。基于信息的主动学习策略在中等噪声率下有效，但在高噪声率下校准性较差。 Conclusion: 研究为资源受限环境下视觉变换器的部署提供了实用建议，强调需平衡模型复杂度、标签噪声和计算效率。 Abstract: Fine-tuning pre-trained convolutional neural networks on ImageNet for downstream tasks is well-established. Still, the impact of model size on the performance of vision transformers in similar scenarios, particularly under label noise, remains largely unexplored. Given the utility and versatility of transformer architectures, this study investigates their practicality under low-budget constraints and noisy labels. We explore how classification accuracy and calibration are affected by symmetric label noise in active learning settings, evaluating four vision transformer configurations (Base and Large with 16x16 and 32x32 patch sizes) and three Swin Transformer configurations (Tiny, Small, and Base) on CIFAR10 and CIFAR100 datasets, under varying label noise rates. Our findings show that larger ViT models (ViTl32 in particular) consistently outperform their smaller counterparts in both accuracy and calibration, even under moderate to high label noise, while Swin Transformers exhibit weaker robustness across all noise levels. We find that smaller patch sizes do not always lead to better performance, as ViTl16 performs consistently worse than ViTl32 while incurring a higher computational cost. We also find that information-based Active Learning strategies only provide meaningful accuracy improvements at moderate label noise rates, but they result in poorer calibration compared to models trained on randomly acquired labels, especially at high label noise rates. We hope these insights provide actionable guidance for practitioners looking to deploy vision transformers in resource-constrained environments, where balancing model complexity, label noise, and compute efficiency is critical in model fine-tuning or distillation.

[40] Label-efficient Single Photon Images Classification via Active Learning

Zili Zhang,Ziting Wen,Yiheng Qiang,Hongzhou Dong,Wenle Dong,Xinyang Li,Xiaofan Wang,Xiaoqiang Ren

Main category: cs.CV

TL;DR: 本文提出了一种针对单光子图像分类的主动学习框架，通过成像条件感知的采样策略和合成增强，显著减少了标注样本需求，同时保持了高分类精度。

Details

Motivation: 单光子LiDAR在极端环境中实现高精度3D成像，但其语义解释因高标注成本和低效标注策略而未被充分探索。 Method: 提出了一种成像条件感知的采样策略，结合合成增强，选择性地标注最具信息量的样本。 Result: 在合成数据上仅需1.5%标注样本即达到97%准确率；在真实数据上仅需8%标注样本，准确率达90.63%，优于基线方法4.51%。 Conclusion: 主动学习使单光子图像分类达到与经典图像相同的性能，为单光子数据的大规模应用铺平了道路。 Abstract: Single-photon LiDAR achieves high-precision 3D imaging in extreme environments through quantum-level photon detection technology. Current research primarily focuses on reconstructing 3D scenes from sparse photon events, whereas the semantic interpretation of single-photon images remains underexplored, due to high annotation costs and inefficient labeling strategies. This paper presents the first active learning framework for single-photon image classification. The core contribution is an imaging condition-aware sampling strategy that integrates synthetic augmentation to model variability across imaging conditions. By identifying samples where the model is both uncertain and sensitive to these conditions, the proposed method selectively annotates only the most informative examples. Experiments on both synthetic and real-world datasets show that our approach outperforms all baselines and achieves high classification accuracy with significantly fewer labeled samples. Specifically, our approach achieves 97% accuracy on synthetic single-photon data using only 1.5% labeled samples. On real-world data, we maintain 90.63% accuracy with just 8% labeled samples, which is 4.51% higher than the best-performing baseline. This illustrates that active learning enables the same level of classification performance on single-photon images as on classical images, opening doors to large-scale integration of single-photon data in real-world applications.

[41] Tetrahedron-Net for Medical Image Registration

Jinhai Xiang,Shuai Guo,Qianru Han,Dantong Shi,Xinwei He,Xiang Bai

Main category: cs.CV

TL;DR: 论文提出了一种名为Tetrahedron-Net的新架构，通过增加一个解码器来增强医学图像配准的表示能力，实验证明其性能优于现有方法。

Details

Motivation: 现有U-Net类网络在医学图像配准中虽有效，但未能充分利用单编码器和解码器架构的交互潜力。 Method: 提出Tetrahedron-Net，采用一个编码器和两个解码器的“四面体”结构，新增解码器与原编码器和解码器交互。 Result: 在多个医学图像配准基准测试中表现优异，且能轻松集成到现有U-Net类架构中。 Conclusion: Tetrahedron-Net是一种简洁有效的改进方法，显著提升了医学图像配准的性能。 Abstract: Medical image registration plays a vital role in medical image processing. Extracting expressive representations for medical images is crucial for improving the registration quality. One common practice for this end is constructing a convolutional backbone to enable interactions with skip connections among feature extraction layers. The de facto structure, U-Net-like networks, has attempted to design skip connections such as nested or full-scale ones to connect one single encoder and one single decoder to improve its representation capacity. Despite being effective, it still does not fully explore interactions with a single encoder and decoder architectures. In this paper, we embrace this observation and introduce a simple yet effective alternative strategy to enhance the representations for registrations by appending one additional decoder. The new decoder is designed to interact with both the original encoder and decoder. In this way, it not only reuses feature presentation from corresponding layers in the encoder but also interacts with the original decoder to corporately give more accurate registration results. The new architecture is concise yet generalized, with only one encoder and two decoders forming a ``Tetrahedron'' structure, thereby dubbed Tetrahedron-Net. Three instantiations of Tetrahedron-Net are further constructed regarding the different structures of the appended decoder. Our extensive experiments prove that superior performance can be obtained on several representative benchmarks of medical image registration. Finally, such a ``Tetrahedron'' design can also be easily integrated into popular U-Net-like architectures including VoxelMorph, ViT-V-Net, and TransMorph, leading to consistent performance gains.

[42] DATA: Multi-Disentanglement based Contrastive Learning for Open-World Semi-Supervised Deepfake Attribution

Ming-Hui Liu,Xiao-Qian Liu,Xin Luo,Xin-Shun Xu

Main category: cs.CV

TL;DR: 论文提出了一种名为DATA的多解缠对比学习框架，用于提升开放世界半监督深度伪造溯源任务的泛化能力，通过正交深度伪造基和增强记忆机制实现。

Details

Motivation: 解决现有深度伪造溯源方法过度依赖特定方法线索、忽略共同伪造特征，以及在开放世界中难以区分新类别的问题。 Method: 提出DATA框架，利用正交深度伪造基解缠方法特定特征，设计增强记忆机制以辅助新类别发现和对比学习，并使用基对比损失和中心对比损失优化特征。 Result: 在OSS-DFA基准测试中，DATA表现优于现有方法，准确率提升2.55%和5.7%。 Conclusion: DATA框架通过解缠和对比学习显著提升了深度伪造溯源的泛化能力和新类别识别效果。 Abstract: Deepfake attribution (DFA) aims to perform multiclassification on different facial manipulation techniques, thereby mitigating the detrimental effects of forgery content on the social order and personal reputations. However, previous methods focus only on method-specific clues, which easily lead to overfitting, while overlooking the crucial role of common forgery features. Additionally, they struggle to distinguish between uncertain novel classes in more practical open-world scenarios. To address these issues, in this paper we propose an innovative multi-DisentAnglement based conTrastive leArning framework, DATA, to enhance the generalization ability on novel classes for the open-world semi-supervised deepfake attribution (OSS-DFA) task. Specifically, since all generation techniques can be abstracted into a similar architecture, DATA defines the concept of 'Orthonormal Deepfake Basis' for the first time and utilizes it to disentangle method-specific features, thereby reducing the overfitting on forgery-irrelevant information. Furthermore, an augmented-memory mechanism is designed to assist in novel class discovery and contrastive learning, which aims to obtain clear class boundaries for the novel classes through instance-level disentanglements. Additionally, to enhance the standardization and discrimination of features, DATA uses bases contrastive loss and center contrastive loss as auxiliaries for the aforementioned modules. Extensive experimental evaluations show that DATA achieves state-of-the-art performance on the OSS-DFA benchmark, e.g., there are notable accuracy improvements in 2.55% / 5.7% under different settings, compared with the existing methods.

[43] Predicting Road Surface Anomalies by Visual Tracking of a Preceding Vehicle

Petr Jahoda,Jan Cech

Main category: cs.CV

TL;DR: 提出一种通过视觉跟踪前车来检测路面异常的新方法，适用于低能见度或交通密集场景，无需依赖特定异常的训练检测器。

Details

Motivation: 传统方法依赖直接观察和训练特定异常检测器，难以应对遮挡或低能见度情况。新方法旨在通过前车运动预测路面异常，提升检测的通用性和实用性。 Method: 利用摄像头跟踪前车运动，通过迭代鲁棒估计器补偿摄像头俯仰旋转，预测路面异常（如坑洼、颠簸等）。 Result: 实验表明，即使在复杂路况下，该方法能可靠地远距离检测异常，并在标准硬件上实时运行。 Conclusion: 该方法高效、通用，适用于自动驾驶或车辆底盘预配置，具有实际应用潜力。 Abstract: A novel approach to detect road surface anomalies by visual tracking of a preceding vehicle is proposed. The method is versatile, predicting any kind of road anomalies, such as potholes, bumps, debris, etc., unlike direct observation methods that rely on training visual detectors of those cases. The method operates in low visibility conditions or in dense traffic where the anomaly is occluded by a preceding vehicle. Anomalies are detected predictively, i.e., before a vehicle encounters them, which allows to pre-configure low-level vehicle systems (such as chassis) or to plan an avoidance maneuver in case of autonomous driving. A challenge is that the signal coming from camera-based tracking of a preceding vehicle may be weak and disturbed by camera ego motion due to vibrations affecting the ego vehicle. Therefore, we propose an efficient method to compensate camera pitch rotation by an iterative robust estimator. Our experiments on both controlled setup and normal traffic conditions show that road anomalies can be detected reliably at a distance even in challenging cases where the ego vehicle traverses imperfect road surfaces. The method is effective and performs in real time on standard consumer hardware.

[44] SwinLip: An Efficient Visual Speech Encoder for Lip Reading Using Swin Transformer

Young-Hu Park,Rae-Hong Park,Hyung-Min Park

Main category: cs.CV

TL;DR: 提出了一种高效的视觉语音编码器SwinLip，用于唇读任务，通过结合Swin Transformer的分层结构和窗口自注意力机制，显著降低了计算复杂度并提升了性能。

Details

Motivation: 现有基于ResNet的唇读方法计算复杂度高，不适合高效捕捉唇读特征，且在多模态研究中引入延迟。 Method: 采用Swin Transformer的分层结构和窗口自注意力，结合改进的Conformer时间嵌入和空间嵌入，构建轻量级的SwinLip编码器。 Result: 在英语LRW和汉语LRW-1000数据集上表现优异，计算量更少，且在汉语LRW-1000上达到SOTA性能。 Conclusion: SwinLip在降低计算负载的同时提升了唇读网络的性能和推理速度，适用于多种骨干网络。 Abstract: This paper presents an efficient visual speech encoder for lip reading. While most recent lip reading studies have been based on the ResNet architecture and have achieved significant success, they are not sufficiently suitable for efficiently capturing lip reading features due to high computational complexity in modeling spatio-temporal information. Additionally, using a complex visual model not only increases the complexity of lip reading models but also induces delays in the overall network for multi-modal studies (e.g., audio-visual speech recognition, speech enhancement, and speech separation). To overcome the limitations of Convolutional Neural Network (CNN)-based models, we apply the hierarchical structure and window self-attention of the Swin Transformer to lip reading. We configure a new lightweight scale of the Swin Transformer suitable for processing lip reading data and present the SwinLip visual speech encoder, which efficiently reduces computational load by integrating modified Convolution-augmented Transformer (Conformer) temporal embeddings with conventional spatial embeddings in the hierarchical structure. Through extensive experiments, we have validated that our SwinLip successfully improves the performance and inference speed of the lip reading network when applied to various backbones for word and sentence recognition, reducing computational load. In particular, our SwinLip demonstrated robust performance in both English LRW and Mandarin LRW-1000 datasets and achieved state-of-the-art performance on the Mandarin LRW-1000 dataset with less computation compared to the existing state-of-the-art model.

[45] Deep residual learning with product units

Ziyuan Li,Uwe Jaekel,Babette Dellen

Main category: cs.CV

TL;DR: PURe（深度乘积单元残差神经网络）通过将乘积单元集成到残差块中，提升了卷积网络的表达能力和参数效率，在多个基准数据集上表现优于传统ResNet模型。

Details

Motivation: 传统残差网络主要依赖求和神经元，而乘积单元能够实现特征的乘法交互，从而更有效地表示复杂模式。PURe旨在通过乘积单元提升网络的表达能力和效率。 Method: PURe在残差块的第二层用2D乘积单元替代传统卷积层，并移除非线性激活函数以保留结构信息。 Result: 在Galaxy10 DECaLS、ImageNet和CIFAR-10数据集上，PURe均表现优异，准确率高于更深层的ResNet模型，同时收敛更快且参数更少。 Conclusion: PURe在准确性、效率和鲁棒性之间取得了良好平衡，展示了乘积单元架构在计算机视觉中的潜力。 Abstract: We propose a deep product-unit residual neural network (PURe) that integrates product units into residual blocks to improve the expressiveness and parameter efficiency of deep convolutional networks. Unlike standard summation neurons, product units enable multiplicative feature interactions, potentially offering a more powerful representation of complex patterns. PURe replaces conventional convolutional layers with 2D product units in the second layer of each residual block, eliminating nonlinear activation functions to preserve structural information. We validate PURe on three benchmark datasets. On Galaxy10 DECaLS, PURe34 achieves the highest test accuracy of 84.89%, surpassing the much deeper ResNet152, while converging nearly five times faster and demonstrating strong robustness to Poisson noise. On ImageNet, PURe architectures outperform standard ResNet models at similar depths, with PURe34 achieving a top-1 accuracy of 80.27% and top-5 accuracy of 95.78%, surpassing deeper ResNet variants (ResNet50, ResNet101) while utilizing significantly fewer parameters and computational resources. On CIFAR-10, PURe consistently outperforms ResNet variants across varying depths, with PURe272 reaching 95.01% test accuracy, comparable to ResNet1001 but at less than half the model size. These results demonstrate that PURe achieves a favorable balance between accuracy, efficiency, and robustness. Compared to traditional residual networks, PURe not only achieves competitive classification performance with faster convergence and fewer parameters, but also demonstrates greater robustness to noise. Its effectiveness across diverse datasets highlights the potential of product-unit-based architectures for scalable and reliable deep learning in computer vision.

[46] MFSeg: Efficient Multi-frame 3D Semantic Segmentation

Chengjie Huang,Krzysztof Czarnecki

Main category: cs.CV

TL;DR: MFSeg是一种高效的多帧3D语义分割框架，通过特征级点云序列聚合和轻量级MLP解码器，降低计算开销并保持高精度。

Details

Motivation: 解决多帧3D语义分割中计算开销大和冗余点云上采样的问题。 Method: 特征级点云序列聚合和轻量级MLP解码器。 Result: 在nuScenes和Waymo数据集上表现优于现有方法。 Conclusion: MFSeg在效率和准确性上均表现出色。 Abstract: We propose MFSeg, an efficient multi-frame 3D semantic segmentation framework. By aggregating point cloud sequences at the feature level and regularizing the feature extraction and aggregation process, MFSeg reduces computational overhead while maintaining high accuracy. Moreover, by employing a lightweight MLP-based point decoder, our method eliminates the need to upsample redundant points from past frames. Experiments on the nuScenes and Waymo datasets show that MFSeg outperforms existing methods, demonstrating its effectiveness and efficiency.

[47] DeCLIP: Decoupled Learning for Open-Vocabulary Dense Perception

Junjie Wang,Bin Chen,Yulin Li,Bin Kang,Yichi Chen,Zhuotao Tian

Main category: cs.CV

TL;DR: DeCLIP通过解耦CLIP的自注意力模块，分别提取“内容”和“上下文”特征，显著提升了开放词汇密集预测任务的性能。

Details

Motivation: 现有视觉语言模型（如CLIP）在密集预测任务中表现不佳，因其局部特征表示能力有限，缺乏空间一致性。 Method: 提出DeCLIP框架，解耦自注意力模块，分别优化“内容”特征（对齐图像裁剪表示）和“上下文”特征（保留空间相关性）。 Result: DeCLIP在开放词汇密集预测任务（如目标检测和语义分割）中显著优于现有方法。 Conclusion: DeCLIP通过改进局部特征表示和空间一致性，有效解决了CLIP在密集预测任务中的局限性。 Abstract: Dense visual prediction tasks have been constrained by their reliance on predefined categories, limiting their applicability in real-world scenarios where visual concepts are unbounded. While Vision-Language Models (VLMs) like CLIP have shown promise in open-vocabulary tasks, their direct application to dense prediction often leads to suboptimal performance due to limitations in local feature representation. In this work, we present our observation that CLIP's image tokens struggle to effectively aggregate information from spatially or semantically related regions, resulting in features that lack local discriminability and spatial consistency. To address this issue, we propose DeCLIP, a novel framework that enhances CLIP by decoupling the self-attention module to obtain ``content'' and ``context'' features respectively. The ``content'' features are aligned with image crop representations to improve local discriminability, while ``context'' features learn to retain the spatial correlations under the guidance of vision foundation models, such as DINO. Extensive experiments demonstrate that DeCLIP significantly outperforms existing methods across multiple open-vocabulary dense prediction tasks, including object detection and semantic segmentation. Code is available at \textcolor{magenta}{https://github.com/xiaomoguhz/DeCLIP}.

[48] RLMiniStyler: Light-weight RL Style Agent for Arbitrary Sequential Neural Style Generation

Jing Hu,Chengming Feng,Shu Hu,Ming-Ching Chang,Xin Li,Xi Wu,Xin Wang

Main category: cs.CV

TL;DR: 提出了一种基于强化学习的轻量级任意风格迁移框架RLMiniStyler，通过迭代优化生成高质量、多样化的艺术图像序列，同时降低计算成本。

Details

Motivation: 现有深度学习方法在生成多样化风格化结果时计算成本高，因此需要一种更高效的解决方案。 Method: 采用强化学习策略迭代指导风格迁移过程，并结合不确定性感知的多任务学习策略自动调整损失权重。 Result: 实验验证了RLMiniStyler在生成高质量、多样化艺术图像序列方面的优势，且计算成本更低。 Conclusion: RLMiniStyler是一种高效、轻量级的任意风格迁移方法，优于现有技术。 Abstract: Arbitrary style transfer aims to apply the style of any given artistic image to another content image. Still, existing deep learning-based methods often require significant computational costs to generate diverse stylized results. Motivated by this, we propose a novel reinforcement learning-based framework for arbitrary style transfer RLMiniStyler. This framework leverages a unified reinforcement learning policy to iteratively guide the style transfer process by exploring and exploiting stylization feedback, generating smooth sequences of stylized results while achieving model lightweight. Furthermore, we introduce an uncertainty-aware multi-task learning strategy that automatically adjusts loss weights to adapt to the content and style balance requirements at different training stages, thereby accelerating model convergence. Through a series of experiments across image various resolutions, we have validated the advantages of RLMiniStyler over other state-of-the-art methods in generating high-quality, diverse artistic image sequences at a lower cost. Codes are available at https://github.com/fengxiaoming520/RLMiniStyler.

[49] Learning Real Facial Concepts for Independent Deepfake Detection

Ming-Hui Liu,Harry Cheng,Tianyi Wang,Xin Luo,Xin-Shun Xu

Main category: cs.CV

TL;DR: RealID通过独立学习真实和伪造类别的概念，提升深度伪造检测模型的泛化能力，显著优于现有方法。

Details

Motivation: 解决深度伪造检测模型在未见数据集上泛化能力差的问题，尤其是误将真实实例分类为伪造的情况。 Method: 提出RealID方法，包含RealC2模块和IDC分类器，分别学习真实类别的综合概念并独立决策。 Result: 在五个数据集上实验，RealID平均准确率提升1.74%，优于现有方法。 Conclusion: RealID通过独立学习真实和伪造概念，有效提升泛化能力，为深度伪造检测提供了新思路。 Abstract: Deepfake detection models often struggle with generalization to unseen datasets, manifesting as misclassifying real instances as fake in target domains. This is primarily due to an overreliance on forgery artifacts and a limited understanding of real faces. To address this challenge, we propose a novel approach RealID to enhance generalization by learning a comprehensive concept of real faces while assessing the probabilities of belonging to the real and fake classes independently. RealID comprises two key modules: the Real Concept Capture Module (RealC2) and the Independent Dual-Decision Classifier (IDC). With the assistance of a MultiReal Memory, RealC2 maintains various prototypes for real faces, allowing the model to capture a comprehensive concept of real class. Meanwhile, IDC redefines the classification strategy by making independent decisions based on the concept of the real class and the presence of forgery artifacts. Through the combined effect of the above modules, the influence of forgery-irrelevant patterns is alleviated, and extensive experiments on five widely used datasets demonstrate that RealID significantly outperforms existing state-of-the-art methods, achieving a 1.74% improvement in average accuracy.

[50] CAD-Llama: Leveraging Large Language Models for Computer-Aided Design Parametric 3D Model Generation

Jiahao Li,Weijian Ma,Xueyang Li,Yunzhong Lou,Guichun Zhou,Xiangdong Zhou

Main category: cs.CV

TL;DR: 该论文研究了如何利用大型语言模型（LLMs）生成计算机辅助设计（CAD）模型的参数序列，提出了CAD-Llama框架，通过层次化标注和自适应预训练方法，显著提升了生成能力。

Details

Motivation: LLMs在通用文本生成方面取得了成功，但如何将其能力扩展到领域特定任务（如CAD参数序列生成）仍具挑战性。 Method: 提出CAD-Llama框架，包括层次化标注管道、结构化参数CAD代码（SPCC）格式、自适应预训练和指令调优。 Result: 实验表明，该方法在生成参数序列方面优于现有自回归方法和LLM基线。 Conclusion: CAD-Llama为LLMs在3D形状生成领域的应用提供了有效解决方案。 Abstract: Recently, Large Language Models (LLMs) have achieved significant success, prompting increased interest in expanding their generative capabilities beyond general text into domain-specific areas. This study investigates the generation of parametric sequences for computer-aided design (CAD) models using LLMs. This endeavor represents an initial step towards creating parametric 3D shapes with LLMs, as CAD model parameters directly correlate with shapes in three-dimensional space. Despite the formidable generative capacities of LLMs, this task remains challenging, as these models neither encounter parametric sequences during their pretraining phase nor possess direct awareness of 3D structures. To address this, we present CAD-Llama, a framework designed to enhance pretrained LLMs for generating parametric 3D CAD models. Specifically, we develop a hierarchical annotation pipeline and a code-like format to translate parametric 3D CAD command sequences into Structured Parametric CAD Code (SPCC), incorporating hierarchical semantic descriptions. Furthermore, we propose an adaptive pretraining approach utilizing SPCC, followed by an instruction tuning process aligned with CAD-specific guidelines. This methodology aims to equip LLMs with the spatial knowledge inherent in parametric sequences. Experimental results demonstrate that our framework significantly outperforms prior autoregressive methods and existing LLM baselines.

[51] FA-KPConv: Introducing Euclidean Symmetries to KPConv via Frame Averaging

Ali Alawieh,Alexandru P. Condurache

Main category: cs.CV

TL;DR: FA-KPConv是一种基于KPConv的神经网络架构，通过帧平均技术实现点云网络的精确不变性和/或等变性，适用于点云分类和配准任务。

Details

Motivation: KPConv在点云分析中广泛应用，但其对欧几里得变换的不变性和/或等变性仅能通过大数据集或数据增强近似实现。FA-KPConv旨在通过帧平均技术精确实现这些性质。 Method: FA-KPConv通过帧平均技术包装现有的KPConv网络，使其对点云的平移、旋转和/或反射具有精确的不变性和/或等变性，同时不增加可学习参数或损失输入信息。 Result: 实验表明，FA-KPConv在点云分类和配准任务中表现优异，尤其在训练数据稀缺或测试数据随机旋转等挑战性场景下。 Conclusion: FA-KPConv通过嵌入几何先验知识，显著提升了KPConv网络的性能，尤其在数据受限或变换复杂的情况下。 Abstract: We present Frame-Averaging Kernel-Point Convolution (FA-KPConv), a neural network architecture built on top of the well-known KPConv, a widely adopted backbone for 3D point cloud analysis. Even though invariance and/or equivariance to Euclidean transformations are required for many common tasks, KPConv-based networks can only approximately achieve such properties when training on large datasets or with significant data augmentations. Using Frame Averaging, we allow to flexibly customize point cloud neural networks built with KPConv layers, by making them exactly invariant and/or equivariant to translations, rotations and/or reflections of the input point clouds. By simply wrapping around an existing KPConv-based network, FA-KPConv embeds geometrical prior knowledge into it while preserving the number of learnable parameters and not compromising any input information. We showcase the benefit of such an introduced bias for point cloud classification and point cloud registration, especially in challenging cases such as scarce training data or randomly rotated test data.

[52] Efficient Flow Matching using Latent Variables

Anirban Samaddar,Yixuan Sun,Viktor Nilsson,Sandeep Madireddy

Main category: cs.CV

TL;DR: Latent-CFM 是一种改进的流匹配模型，通过利用预训练的深度隐变量模型简化训练和推理，显著减少了计算成本，并在多模态数据和图像生成任务中表现出色。

Details

Motivation: 现有的流匹配模型在从简单源分布学习流时未显式建模目标数据的底层结构/流形，导致学习效率低下，尤其是在高维数据中。 Method: 提出 Latent-CFM，利用预训练的深度隐变量模型简化训练和推理策略，以更好地处理多模态数据。 Result: 实验表明，Latent-CFM 在生成质量和计算效率上优于现有流匹配模型，训练成本减少约 50%，并在物理数据生成和条件图像生成中表现优异。 Conclusion: Latent-CFM 提供了一种高效且性能优越的流匹配方法，适用于多模态数据和复杂生成任务。 Abstract: Flow matching models have shown great potential in image generation tasks among probabilistic generative models. Building upon the ideas of continuous normalizing flows, flow matching models generalize the transport path of the diffusion models from a simple prior distribution to the data. Most flow matching models in the literature do not explicitly model the underlying structure/manifold in the target data when learning the flow from a simple source distribution like the standard Gaussian. This leads to inefficient learning, especially for many high-dimensional real-world datasets, which often reside in a low-dimensional manifold. Existing strategies of incorporating manifolds, including data with underlying multi-modal distribution, often require expensive training and hence frequently lead to suboptimal performance. To this end, we present \texttt{Latent-CFM}, which provides simplified training/inference strategies to incorporate multi-modal data structures using pretrained deep latent variable models. Through experiments on multi-modal synthetic data and widely used image benchmark datasets, we show that \texttt{Latent-CFM} exhibits improved generation quality with significantly less training ($\sim 50\%$ less in some cases) and computation than state-of-the-art flow matching models. Using a 2d Darcy flow dataset, we demonstrate that our approach generates more physically accurate samples than competitive approaches. In addition, through latent space analysis, we demonstrate that our approach can be used for conditional image generation conditioned on latent features.

[53] "I Can See Forever!": Evaluating Real-time VideoLLMs for Assisting Individuals with Visual Impairments

Ziyi Zhang,Zhen Sun,Zongmin Zhang,Zifan Peng,Yuemeng Zhao,Zichun Wang,Zeren Luo,Ruiting Zuo,Xinlei He

Main category: cs.CV

TL;DR: 该论文首次系统评估了视频语言模型（VideoLLMs）在辅助视障人士日常活动中的效果，构建了VisAssistDaily基准数据集，并发现GPT-4o表现最佳。同时，提出了SafeVid数据集和轮询机制以解决动态环境中的风险感知问题。

Details

Motivation: 视障人士在动态复杂环境中缺乏实时感知支持，现有技术多关注静态内容，无法满足实际需求。 Method: 构建VisAssistDaily数据集评估模型表现，并通过用户研究验证模型在封闭和开放场景中的实用性。提出SafeVid数据集和轮询机制以改进风险感知。 Result: GPT-4o在任务成功率上表现最佳，但现有模型在动态环境中感知潜在危险的能力有限。 Conclusion: 该研究为未来辅助技术提供了重要见解，特别是在动态环境风险感知方面。 Abstract: The visually impaired population, especially the severely visually impaired, is currently large in scale, and daily activities pose significant challenges for them. Although many studies use large language and vision-language models to assist the blind, most focus on static content and fail to meet real-time perception needs in dynamic and complex environments, such as daily activities. To provide them with more effective intelligent assistance, it is imperative to incorporate advanced visual understanding technologies. Although real-time vision and speech interaction VideoLLMs demonstrate strong real-time visual understanding, no prior work has systematically evaluated their effectiveness in assisting visually impaired individuals. In this work, we conduct the first such evaluation. First, we construct a benchmark dataset (VisAssistDaily), covering three categories of assistive tasks for visually impaired individuals: Basic Skills, Home Life Tasks, and Social Life Tasks. The results show that GPT-4o achieves the highest task success rate. Next, we conduct a user study to evaluate the models in both closed-world and open-world scenarios, further exploring the practical challenges of applying VideoLLMs in assistive contexts. One key issue we identify is the difficulty current models face in perceiving potential hazards in dynamic environments. To address this, we build an environment-awareness dataset named SafeVid and introduce a polling mechanism that enables the model to proactively detect environmental risks. We hope this work provides valuable insights and inspiration for future research in this field.

[54] Defining and Quantifying Creative Behavior in Popular Image Generators

Aditi Ramaswamy

Main category: cs.CV

TL;DR: 本文提出了一种从实践角度衡量生成AI模型创造力的方法，并引入定量指标帮助用户选择适合任务的模型。

Details

Motivation: 探讨生成AI模型的创造力，解决科学界对此的争议，并提供实用工具。 Method: 引入定量指标，并在多个流行的图像生成模型上评估这些指标。 Result: 评估结果显示，提出的指标与人类直觉一致。 Conclusion: 定量指标能有效衡量生成AI模型的创造力，并帮助用户选择模型。 Abstract: Creativity of generative AI models has been a subject of scientific debate in the last years, without a conclusive answer. In this paper, we study creativity from a practical perspective and introduce quantitative measures that help the user to choose a suitable AI model for a given task. We evaluated our measures on a number of popular image-to-image generation models, and the results of this suggest that our measures conform to human intuition.

[55] Leveraging Simultaneous Usage of Edge GPU Hardware Engines for Video Face Detection and Recognition

Asma Baobaid,Mahmoud Meribout

Main category: cs.CV

TL;DR: 本文提出了一种在边缘GPU上最大化利用硬件引擎的方法，通过并发和流水线处理视频解码、人脸检测和识别任务，提高了吞吐量并降低了功耗。

Details

Motivation: 公共场合的视频人脸检测和识别需求增加，但现有方法未能充分利用边缘GPU的硬件引擎，导致效率不足。 Method: 利用边缘GPU的并发和流水线技术，同时处理视频解码、人脸检测和识别任务，并优化硬件引擎的使用。 Result: 在NVIDIA边缘Orin GPU上实现了更高的吞吐量和约5%的功耗节省，同时满足实时性能要求。 Conclusion: 通过优化硬件引擎的使用，显著提升了性能，并提出了进一步硬件改进的建议。 Abstract: Video face detection and recognition in public places at the edge is required in several applications, such as security reinforcement and contactless access to authorized venues. This paper aims to maximize the simultaneous usage of hardware engines available in edge GPUs nowadays by leveraging the concurrency and pipelining of tasks required for face detection and recognition. This also includes the video decoding task, which is required in most face monitoring applications as the video streams are usually carried via Gbps Ethernet network. This constitutes an improvement over previous works where the tasks are usually allocated to a single engine due to the lack of a unified and automated framework that simultaneously explores all hardware engines. In addition, previously, the input faces were usually embedded in still images or within raw video streams that overlook the burst delay caused by the decoding stage. The results on real-life video streams suggest that simultaneously using all the hardware engines available in the recent NVIDIA edge Orin GPU, higher throughput, and a slight saving of power consumption of around 300 mW, accounting for around 5%, have been achieved while satisfying the real-time performance constraint. The performance gets even higher by considering several video streams simultaneously. Further performance improvement could have been obtained if the number of shuffle layers that were created by the tensor RT framework for the face recognition task was lower. Thus, the paper suggests some hardware improvements to the existing edge GPU processors to enhance their performance even higher.

[56] HunyuanCustom: A Multimodal-Driven Architecture for Customized Video Generation

Teng Hu,Zhentao Yu,Zhengguang Zhou,Sen Liang,Yuan Zhou,Qin Lin,Qinglin Lu

Main category: cs.CV

TL;DR: HunyuanCustom是一个多模态定制视频生成框架，支持图像、音频、视频和文本输入，强调主体一致性和多模态理解。

Details

Motivation: 现有方法在身份一致性和输入模态多样性上存在不足，HunyuanCustom旨在解决这些问题。 Method: 基于HunyuanVideo，引入文本-图像融合模块和图像ID增强模块；针对音频和视频输入，提出AudioNet和视频驱动注入模块。 Result: 实验表明HunyuanCustom在ID一致性、真实感和文本-视频对齐上优于现有方法，并在下游任务中表现稳健。 Conclusion: 多模态条件和身份保持策略有效推动了可控视频生成的进展。 Abstract: Customized video generation aims to produce videos featuring specific subjects under flexible user-defined conditions, yet existing methods often struggle with identity consistency and limited input modalities. In this paper, we propose HunyuanCustom, a multi-modal customized video generation framework that emphasizes subject consistency while supporting image, audio, video, and text conditions. Built upon HunyuanVideo, our model first addresses the image-text conditioned generation task by introducing a text-image fusion module based on LLaVA for enhanced multi-modal understanding, along with an image ID enhancement module that leverages temporal concatenation to reinforce identity features across frames. To enable audio- and video-conditioned generation, we further propose modality-specific condition injection mechanisms: an AudioNet module that achieves hierarchical alignment via spatial cross-attention, and a video-driven injection module that integrates latent-compressed conditional video through a patchify-based feature-alignment network. Extensive experiments on single- and multi-subject scenarios demonstrate that HunyuanCustom significantly outperforms state-of-the-art open- and closed-source methods in terms of ID consistency, realism, and text-video alignment. Moreover, we validate its robustness across downstream tasks, including audio and video-driven customized video generation. Our results highlight the effectiveness of multi-modal conditioning and identity-preserving strategies in advancing controllable video generation. All the code and models are available at https://hunyuancustom.github.io.

[57] Text2CT: Towards 3D CT Volume Generation from Free-text Descriptions Using Diffusion Model

Pengfei Guo,Can Zhao,Dong Yang,Yufan He,Vishwesh Nath,Ziyue Xu,Pedro R. A. S. Bassi,Zongwei Zhou,Benjamin D. Simon,Stephanie Anne Harmon,Baris Turkbey,Daguang Xu

Main category: cs.CV

TL;DR: Text2CT是一种基于扩散模型的新方法，能够从自由文本描述生成3D CT体积，优于现有固定格式输入的方法。

Details

Motivation: 通过自由文本生成3D CT体积，为诊断和研究提供新机会。 Method: 使用扩散模型，将医学文本编码为潜在表示并解码为高分辨率3D CT扫描。 Result: 在保留解剖保真度和捕捉复杂结构方面表现优异，达到最先进水平。 Conclusion: Text2CT在诊断和数据增强方面具有广阔应用前景。 Abstract: Generating 3D CT volumes from descriptive free-text inputs presents a transformative opportunity in diagnostics and research. In this paper, we introduce Text2CT, a novel approach for synthesizing 3D CT volumes from textual descriptions using the diffusion model. Unlike previous methods that rely on fixed-format text input, Text2CT employs a novel prompt formulation that enables generation from diverse, free-text descriptions. The proposed framework encodes medical text into latent representations and decodes them into high-resolution 3D CT scans, effectively bridging the gap between semantic text inputs and detailed volumetric representations in a unified 3D framework. Our method demonstrates superior performance in preserving anatomical fidelity and capturing intricate structures as described in the input text. Extensive evaluations show that our approach achieves state-of-the-art results, offering promising potential applications in diagnostics, and data augmentation.

[58] Edge-GPU Based Face Tracking for Face Detection and Recognition Acceleration

Asma Baobaid,Mahmoud Meribout

Main category: cs.CV

TL;DR: 本文提出了一种结合硬件和软件的方法，优化了基于NVIDIA Jetson AGX Orin的人脸检测与识别系统，通过同时利用所有硬件引擎和集成人脸跟踪模块，显著提升了处理速度和能效。

Details

Motivation: 现代应用中，公共场合的实时、准确人脸检测与识别系统需求迫切，但现有系统在吞吐量和功耗方面仍有改进空间。 Method: 利用NVIDIA Jetson AGX Orin的所有硬件引擎，并集成人脸跟踪模块，避免冗余计算。 Result: 实验结果显示，系统在1920x1080分辨率下达到290 FPS，同时功耗降低了约800 mW。 Conclusion: 这种硬件-软件协同设计方法为高性能边缘机器视觉系统提供了可行方案，尤其适用于多摄像头监控场景。 Abstract: Cost-effective machine vision systems dedicated to real-time and accurate face detection and recognition in public places are crucial for many modern applications. However, despite their high performance, which could be reached using specialized edge or cloud AI hardware accelerators, there is still room for improvement in throughput and power consumption. This paper aims to suggest a combined hardware-software approach that optimizes face detection and recognition systems on one of the latest edge GPUs, namely NVIDIA Jetson AGX Orin. First, it leverages the simultaneous usage of all its hardware engines to improve processing time. This offers an improvement over previous works where these tasks were mainly allocated automatically and exclusively to the CPU or, to a higher extent, to the GPU core. Additionally, the paper suggests integrating a face tracker module to avoid redundantly running the face recognition algorithm for every frame but only when a new face appears in the scene. The results of extended experiments suggest that simultaneous usage of all the hardware engines that are available in the Orin GPU and tracker integration into the pipeline yield an impressive throughput of 290 FPS (frames per second) on 1920 x 1080 input size frames containing in average of 6 faces/frame. Additionally, a substantial saving of power consumption of around 800 mW was achieved when compared to running the task on the CPU/GPU engines only and without integrating a tracker into the Orin GPU\'92s pipeline. This hardware-codesign approach can pave the way to design high-performance machine vision systems at the edge, critically needed in video monitoring in public places where several nearby cameras are usually deployed for a same scene.

[59] DFVO: Learning Darkness-free Visible and Infrared Image Disentanglement and Fusion All at Once

Qi Zhou,Yukai Shi,Xiaojun Yang,Xiaoyu Xian,Lunjia Liao,Ruimao Zhang,Liang Lin

Main category: cs.CV

TL;DR: 论文提出了一种名为DFVO的网络，用于在黑暗环境下实现可见光和红外图像的融合，解决了传统方法在光照不足时融合效果模糊的问题。

Details

Motivation: 现有图像融合方法在可见光图像光照不足时，融合结果模糊且视觉效果差，这对自动驾驶等高级视觉任务构成挑战。 Method: 采用多任务级联策略，设计了一个潜在共同特征提取器（LCFE），结合细节提取模块（DEM）和超交叉注意力模块（HCAM），并通过相关损失函数指导网络学习。 Result: 实验表明，DFVO在黑暗环境中生成更清晰、信息更丰富的融合图像，在LLVIP数据集上达到63.258 dB PSNR和0.724 CC。 Conclusion: DFVO在黑暗环境下显著提升了图像融合质量，为高级视觉任务提供了更有效的信息。 Abstract: Visible and infrared image fusion is one of the most crucial tasks in the field of image fusion, aiming to generate fused images with clear structural information and high-quality texture features for high-level vision tasks. However, when faced with severe illumination degradation in visible images, the fusion results of existing image fusion methods often exhibit blurry and dim visual effects, posing major challenges for autonomous driving. To this end, a Darkness-Free network is proposed to handle Visible and infrared image disentanglement and fusion all at Once (DFVO), which employs a cascaded multi-task approach to replace the traditional two-stage cascaded training (enhancement and fusion), addressing the issue of information entropy loss caused by hierarchical data transmission. Specifically, we construct a latent-common feature extractor (LCFE) to obtain latent features for the cascaded tasks strategy. Firstly, a details-extraction module (DEM) is devised to acquire high-frequency semantic information. Secondly, we design a hyper cross-attention module (HCAM) to extract low-frequency information and preserve texture features from source images. Finally, a relevant loss function is designed to guide the holistic network learning, thereby achieving better image fusion. Extensive experiments demonstrate that our proposed approach outperforms state-of-the-art alternatives in terms of qualitative and quantitative evaluations. Particularly, DFVO can generate clearer, more informative, and more evenly illuminated fusion results in the dark environments, achieving best performance on the LLVIP dataset with 63.258 dB PSNR and 0.724 CC, providing more effective information for high-level vision tasks. Our code is publicly accessible at https://github.com/DaVin-Qi530/DFVO.

[60] RAFT: Robust Augmentation of FeaTures for Image Segmentation

Edward Humes,Xiaomin Lin,Uttej Kallakuri,Tinoosh Mohsenin

Main category: cs.CV

TL;DR: RAFT是一种新的图像分割框架，通过数据增强、特征增强和主动学习，使用少量真实数据解决合成数据训练的模型在真实场景中性能下降的问题。

Details

Motivation: 解决合成数据训练的模型在真实场景中性能不佳的问题（Syn2Real问题）。 Method: 提出RAFT框架，结合数据增强、特征增强和主动学习，利用少量真实数据进行模型适应。 Result: 在SYNTHIA->Cityscapes和GTAV->Cityscapes基准测试中，mIoU分别提升2.1%和0.4%；在Cityscapes->ACDC基准测试中，mIoU提升1.3%。 Conclusion: RAFT有效提升了合成数据训练的模型在真实场景中的性能，且通过实验验证了其优越性。 Abstract: Image segmentation is a powerful computer vision technique for scene understanding. However, real-world deployment is stymied by the need for high-quality, meticulously labeled datasets. Synthetic data provides high-quality labels while reducing the need for manual data collection and annotation. However, deep neural networks trained on synthetic data often face the Syn2Real problem, leading to poor performance in real-world deployments. To mitigate the aforementioned gap in image segmentation, we propose RAFT, a novel framework for adapting image segmentation models using minimal labeled real-world data through data and feature augmentations, as well as active learning. To validate RAFT, we perform experiments on the synthetic-to-real "SYNTHIA->Cityscapes" and "GTAV->Cityscapes" benchmarks. We managed to surpass the previous state of the art, HALO. SYNTHIA->Cityscapes experiences an improvement in mIoU* upon domain adaptation of 2.1%/79.9%, and GTAV->Cityscapes experiences a 0.4%/78.2% improvement in mIoU. Furthermore, we test our approach on the real-to-real benchmark of "Cityscapes->ACDC", and again surpass HALO, with a gain in mIoU upon adaptation of 1.3%/73.2%. Finally, we examine the effect of the allocated annotation budget and various components of RAFT upon the final transfer mIoU.

[61] Registration of 3D Point Sets Using Exponential-based Similarity Matrix

Ashutosh Singandhupe,Sanket Lokhande,Hung Manh La

Main category: cs.CV

TL;DR: 提出了一种改进的ICP算法（ESM-ICP），通过动态调整相似性矩阵，解决了点云配准中旋转差异大和噪声干扰的问题。

Details

Motivation: 现有配准技术在大旋转差异或噪声干扰下表现不佳，导致3D重建不准确。 Method: 引入高斯启发的指数加权方案，构建动态调整的相似性矩阵，改进旋转和平移估计。 Result: ESM-ICP在大旋转差异和非高斯噪声下优于传统几何方法和部分学习型方法。 Conclusion: ESM-ICP有效提升了点云配准的鲁棒性，代码已开源。 Abstract: Point cloud registration is a fundamental problem in computer vision and robotics, involving the alignment of 3D point sets captured from varying viewpoints using depth sensors such as LiDAR or structured light. In modern robotic systems, especially those focused on mapping, it is essential to merge multiple views of the same environment accurately. However, state-of-the-art registration techniques often struggle when large rotational differences exist between point sets or when the data is significantly corrupted by sensor noise. These challenges can lead to misalignments and, consequently, to inaccurate or distorted 3D reconstructions. In this work, we address both these limitations by proposing a robust modification to the classic Iterative Closest Point (ICP) algorithm. Our method, termed Exponential Similarity Matrix ICP (ESM-ICP), integrates a Gaussian-inspired exponential weighting scheme to construct a similarity matrix that dynamically adapts across iterations. This matrix facilitates improved estimation of both rotational and translational components during alignment. We demonstrate the robustness of ESM-ICP in two challenging scenarios: (i) large rotational discrepancies between the source and target point clouds, and (ii) data corrupted by non-Gaussian noise. Our results show that ESM-ICP outperforms traditional geometric registration techniques as well as several recent learning-based methods. To encourage reproducibility and community engagement, our full implementation is made publicly available on GitHub. https://github.com/aralab-unr/ESM_ICP

[62] Componential Prompt-Knowledge Alignment for Domain Incremental Learning

Kunlun Xu,Xu Zou,Gang Hua,Jiahuan Zhou

Main category: cs.CV

TL;DR: KA-Prompt提出了一种基于提示的领域增量学习方法，通过组件感知的提示-知识对齐解决多领域知识融合中的冲突问题。

Details

Motivation: 揭示现有提示方法中组件间不对齐导致知识冲突和预测性能下降的问题。 Method: 分两阶段：初始组件结构配置和在线对齐保持，通过贪婪搜索和动态一致性约束实现知识对齐。 Result: 在领域增量学习基准测试中表现优异。 Conclusion: KA-Prompt有效提升了模型的学习和推理能力。 Abstract: Domain Incremental Learning (DIL) aims to learn from non-stationary data streams across domains while retaining and utilizing past knowledge. Although prompt-based methods effectively store multi-domain knowledge in prompt parameters and obtain advanced performance through cross-domain prompt fusion, we reveal an intrinsic limitation: component-wise misalignment between domain-specific prompts leads to conflicting knowledge integration and degraded predictions. This arises from the random positioning of knowledge components within prompts, where irrelevant component fusion introduces interference.To address this, we propose Componential Prompt-Knowledge Alignment (KA-Prompt), a novel prompt-based DIL method that introduces component-aware prompt-knowledge alignment during training, significantly improving both the learning and inference capacity of the model. KA-Prompt operates in two phases: (1) Initial Componential Structure Configuring, where a set of old prompts containing knowledge relevant to the new domain are mined via greedy search, which is then exploited to initialize new prompts to achieve reusable knowledge transfer and establish intrinsic alignment between new and old prompts. (2) Online Alignment Preservation, which dynamically identifies the target old prompts and applies adaptive componential consistency constraints as new prompts evolve. Extensive experiments on DIL benchmarks demonstrate the effectiveness of our KA-Prompt. Our source code is available at https://github.com/zhoujiahuan1991/ICML2025-KA-Prompt

[63] Active Sampling for MRI-based Sequential Decision Making

Yuning Du,Jingshuai Liu,Rohan Dharmakumar,Sotirios A. Tsaftaris

Main category: cs.CV

TL;DR: 提出了一种多目标强化学习框架，用于从欠采样的k空间数据中进行全面的、连续的诊断评估，显著减少了MRI作为点护理设备所需的样本数量。

Details

Motivation: 尽管MRI具有卓越的诊断能力，但其作为点护理设备的使用仍受限于高成本和复杂性。通过降低磁场强度并改进采样策略，有望实现这一目标。 Method: 采用多目标强化学习框架，通过逐步加权奖励函数训练模型，以优化采样策略，并在推理过程中动态适应连续决策。 Result: 在两个膝关节病理评估任务中，该方法在疾病检测、严重程度量化和整体连续诊断方面表现优异，同时显著减少了k空间样本需求。 Conclusion: 该方法为MRI作为全面且经济的点护理设备铺平了道路。 Abstract: Despite the superior diagnostic capability of Magnetic Resonance Imaging (MRI), its use as a Point-of-Care (PoC) device remains limited by high cost and complexity. To enable such a future by reducing the magnetic field strength, one key approach will be to improve sampling strategies. Previous work has shown that it is possible to make diagnostic decisions directly from k-space with fewer samples. Such work shows that single diagnostic decisions can be made, but if we aspire to see MRI as a true PoC, multiple and sequential decisions are necessary while minimizing the number of samples acquired. We present a novel multi-objective reinforcement learning framework enabling comprehensive, sequential, diagnostic evaluation from undersampled k-space data. Our approach during inference actively adapts to sequential decisions to optimally sample. To achieve this, we introduce a training methodology that identifies the samples that contribute the best to each diagnostic objective using a step-wise weighting reward function. We evaluate our approach in two sequential knee pathology assessment tasks: ACL sprain detection and cartilage thickness loss assessment. Our framework achieves diagnostic performance competitive with various policy-based benchmarks on disease detection, severity quantification, and overall sequential diagnosis, while substantially saving k-space samples. Our approach paves the way for the future of MRI as a comprehensive and affordable PoC device. Our code is publicly available at https://github.com/vios-s/MRI_Sequential_Active_Sampling

[64] MonoCoP: Chain-of-Prediction for Monocular 3D Object Detection

Zhihao Zhang,Abhinav Kumar,Girish Chandar Ganesan,Xiaoming Liu

Main category: cs.CV

TL;DR: MonoCoP通过链式预测（CoP）方法，依次预测3D属性并利用属性间的相关性提升单目3D目标检测的深度估计精度。

Details

Motivation: 现有方法忽略了3D属性间的内在关联，导致深度估计精度受限。 Method: 提出MonoCoP，包括轻量级AttributeNet、显式链式特征传播和残差连接。 Result: 在KITTI、Waymo和nuScenes数据集上达到SOTA性能。 Conclusion: MonoCoP通过链式预测有效提升了单目3D目标检测的精度和稳定性。 Abstract: Accurately predicting 3D attributes is crucial for monocular 3D object detection (Mono3D), with depth estimation posing the greatest challenge due to the inherent ambiguity in mapping 2D images to 3D space. While existing methods leverage multiple depth cues (e.g., estimating depth uncertainty, modeling depth error) to improve depth accuracy, they overlook that accurate depth prediction requires conditioning on other 3D attributes, as these attributes are intrinsically inter-correlated through the 3D to 2D projection, which ultimately limits overall accuracy and stability. Inspired by Chain-of-Thought (CoT) in large language models (LLMs), this paper proposes MonoCoP, which leverages a Chain-of-Prediction (CoP) to predict attributes sequentially and conditionally via three key designs. First, it employs a lightweight AttributeNet (AN) for each 3D attribute to learn attribute-specific features. Next, MonoCoP constructs an explicit chain to propagate these learned features from one attribute to the next. Finally, MonoCoP uses a residual connection to aggregate features for each attribute along the chain, ensuring that later attribute predictions are conditioned on all previously processed attributes without forgetting the features of earlier ones. Experimental results show that our MonoCoP achieves state-of-the-art (SoTA) performance on the KITTI leaderboard without requiring additional data and further surpasses existing methods on the Waymo and nuScenes frontal datasets.

[65] OpenVision: A Fully-Open, Cost-Effective Family of Advanced Vision Encoders for Multimodal Learning

Xianhang Li,Yanqing Liu,Haoqin Tu,Hongru Zhu,Cihang Xie

Main category: cs.CV

TL;DR: OpenVision是一个完全开放的视觉编码器家族，性能媲美或超越CLIP，提供从5.9M到632.1M参数的灵活选择。

Details

Motivation: 填补现有视觉编码器（如CLIP）不完全开放的空白，提供开源且高效的替代方案。 Method: 基于现有工作（如CLIPS训练框架和Recap-DataComp-1B数据），优化编码器质量并集成到多模态框架中。 Result: OpenVision在不同参数规模下均表现优异，大模型提升性能，小模型适合边缘部署。 Conclusion: OpenVision为多模态模型提供了高效、灵活的视觉编码器选择，推动开源社区发展。 Abstract: OpenAI's CLIP, released in early 2021, have long been the go-to choice of vision encoder for building multimodal foundation models. Although recent alternatives such as SigLIP have begun to challenge this status quo, to our knowledge none are fully open: their training data remains proprietary and/or their training recipes are not released. This paper fills this gap with OpenVision, a fully-open, cost-effective family of vision encoders that match or surpass the performance of OpenAI's CLIP when integrated into multimodal frameworks like LLaVA. OpenVision builds on existing works -- e.g., CLIPS for training framework and Recap-DataComp-1B for training data -- while revealing multiple key insights in enhancing encoder quality and showcasing practical benefits in advancing multimodal models. By releasing vision encoders spanning from 5.9M to 632.1M parameters, OpenVision offers practitioners a flexible trade-off between capacity and efficiency in building multimodal models: larger models deliver enhanced multimodal performance, while smaller versions enable lightweight, edge-ready multimodal deployments.

[66] FastMap: Revisiting Dense and Scalable Structure from Motion

Jiahao Li,Haochen Wang,Muhammad Zubair Irshad,Igor Vasiljevic,Matthew R. Walter,Vitor Campagnolo Guizilini,Greg Shakhnarovich

Main category: cs.CV

TL;DR: FastMap是一种新的全局运动结构方法，专注于速度和简洁性，解决了现有方法（如COLMAP和GLOMAP）在大规模场景下扩展性差的问题。

Details

Motivation: 现有方法（如COLMAP和GLOMAP）虽然能估计高精度相机姿态，但在匹配关键点对数量大时扩展性差，主要原因是并行化不足和优化步骤计算成本高。 Method: FastMap设计了一个完全基于GPU友好操作的SfM框架，易于并行化，且每个优化步骤的运行时间与图像对数量线性相关，与关键点对或3D点无关。 Result: 实验表明，FastMap在大规模场景下比COLMAP和GLOMAP快一到两个数量级，同时保持相当的姿态精度。 Conclusion: FastMap通过优化并行化和计算效率，显著提升了全局运动结构方法的速度和可扩展性。 Abstract: We propose FastMap, a new global structure from motion method focused on speed and simplicity. Previous methods like COLMAP and GLOMAP are able to estimate high-precision camera poses, but suffer from poor scalability when the number of matched keypoint pairs becomes large. We identify two key factors leading to this problem: poor parallelization and computationally expensive optimization steps. To overcome these issues, we design an SfM framework that relies entirely on GPU-friendly operations, making it easily parallelizable. Moreover, each optimization step runs in time linear to the number of image pairs, independent of keypoint pairs or 3D points. Through extensive experiments, we show that FastMap is one to two orders of magnitude faster than COLMAP and GLOMAP on large-scale scenes with comparable pose accuracy.

[67] Person Recognition at Altitude and Range: Fusion of Face, Body Shape and Gait

Feng Liu,Nicholas Chimitt,Lanqing Guo,Jitesh Jain,Aditya Kane,Minchul Kim,Wes Robbins,Yiyang Su,Dingqiang Ye,Xingguang Zhang,Jie Zhu,Siddharth Satyakam,Christopher Perry,Stanley H. Chan,Arun Ross,Humphrey Shi,Zhangyang Wang,Anil Jain,Xiaoming Liu

Main category: cs.CV

TL;DR: FarSight是一个端到端的多模态生物特征识别系统，用于在远距离、恶劣环境下进行全身人员识别，通过整合面部、步态和体型特征，显著提升了识别性能。

Details

Motivation: 解决在远距离、高视角和恶劣天气条件下（如湍流、强风）的全身人员识别问题，适用于如BRIAR计划中的监控场景。 Method: FarSight系统包含四个核心模块：多目标检测与跟踪、识别感知的视频恢复、模态特定的生物特征编码和质量引导的多模态融合。 Result: 在BRIAR数据集上，FarSight在1:1验证、闭集识别和开集识别任务中分别提升了34.1%、17.8%和34.3%的性能。 Conclusion: FarSight在挑战性现实条件下表现出色，成为生物特征识别领域的先进解决方案。 Abstract: We address the problem of whole-body person recognition in unconstrained environments. This problem arises in surveillance scenarios such as those in the IARPA Biometric Recognition and Identification at Altitude and Range (BRIAR) program, where biometric data is captured at long standoff distances, elevated viewing angles, and under adverse atmospheric conditions (e.g., turbulence and high wind velocity). To this end, we propose FarSight, a unified end-to-end system for person recognition that integrates complementary biometric cues across face, gait, and body shape modalities. FarSight incorporates novel algorithms across four core modules: multi-subject detection and tracking, recognition-aware video restoration, modality-specific biometric feature encoding, and quality-guided multi-modal fusion. These components are designed to work cohesively under degraded image conditions, large pose and scale variations, and cross-domain gaps. Extensive experiments on the BRIAR dataset, one of the most comprehensive benchmarks for long-range, multi-modal biometric recognition, demonstrate the effectiveness of FarSight. Compared to our preliminary system, this system achieves a 34.1% absolute gain in 1:1 verification accuracy (TAR@0.1% FAR), a 17.8% increase in closed-set identification (Rank-20), and a 34.3% reduction in open-set identification errors (FNIR@1% FPIR). Furthermore, FarSight was evaluated in the 2025 NIST RTE Face in Video Evaluation (FIVE), which conducts standardized face recognition testing on the BRIAR dataset. These results establish FarSight as a state-of-the-art solution for operational biometric recognition in challenging real-world conditions.

[68] On Path to Multimodal Generalist: General-Level and General-Bench

Hao Fei,Yuan Zhou,Juncheng Li,Xiangtai Li,Qingshan Xu,Bobo Li,Shengqiong Wu,Yaoting Wang,Junbao Zhou,Jiahao Meng,Qingyu Shi,Zhiyuan Zhou,Liangtao Shi,Minghe Gao,Daoan Zhang,Zhiqi Ge,Weiming Wu,Siliang Tang,Kaihang Pan,Yaobo Ye,Haobo Yuan,Tao Zhang,Tianjie Ju,Zixiang Meng,Shilin Xu,Liyu Jia,Wentao Hu,Meng Luo,Jiebo Luo,Tat-Seng Chua,Shuicheng Yan,Hanwang Zhang

Main category: cs.CV

TL;DR: 论文提出了General-Level评估框架，用于衡量多模态大语言模型（MLLM）的性能和通用性，并引入Synergy概念和General-Bench基准测试。

Details

Motivation: 现有MLLM评估方法无法全面衡量模型性能是否接近人类水平AI，需要更系统的评估框架。 Method: 提出5级评估标准和Synergy概念，开发包含700任务和325,800实例的General-Bench基准。 Result: 评估了100多个MLLM，揭示了通用模型的性能排名及实现真正AI的挑战。 Conclusion: 该框架为未来多模态基础模型研究提供了基础设施，加速实现通用人工智能（AGI）。 Abstract: The Multimodal Large Language Model (MLLM) is currently experiencing rapid growth, driven by the advanced capabilities of LLMs. Unlike earlier specialists, existing MLLMs are evolving towards a Multimodal Generalist paradigm. Initially limited to understanding multiple modalities, these models have advanced to not only comprehend but also generate across modalities. Their capabilities have expanded from coarse-grained to fine-grained multimodal understanding and from supporting limited modalities to arbitrary ones. While many benchmarks exist to assess MLLMs, a critical question arises: Can we simply assume that higher performance across tasks indicates a stronger MLLM capability, bringing us closer to human-level AI? We argue that the answer is not as straightforward as it seems. This project introduces General-Level, an evaluation framework that defines 5-scale levels of MLLM performance and generality, offering a methodology to compare MLLMs and gauge the progress of existing systems towards more robust multimodal generalists and, ultimately, towards AGI. At the core of the framework is the concept of Synergy, which measures whether models maintain consistent capabilities across comprehension and generation, and across multiple modalities. To support this evaluation, we present General-Bench, which encompasses a broader spectrum of skills, modalities, formats, and capabilities, including over 700 tasks and 325,800 instances. The evaluation results that involve over 100 existing state-of-the-art MLLMs uncover the capability rankings of generalists, highlighting the challenges in reaching genuine AI. We expect this project to pave the way for future research on next-generation multimodal foundation models, providing a robust infrastructure to accelerate the realization of AGI. Project page: https://generalist.top/

cs.GR [Back]

[69] PARC: Physics-based Augmentation with Reinforcement Learning for Character Controllers

Michael Xu,Yi Shi,KangKang Yin,Xue Bin Peng

Main category: cs.GR

TL;DR: PARC框架通过机器学习和物理模拟迭代增强运动数据集，解决敏捷地形穿越控制器开发中数据稀缺的问题。

Details

Motivation: 敏捷地形穿越行为的运动捕捉数据稀缺且获取成本高，难以模拟复杂环境中的动态动作。 Method: PARC结合机器学习与物理模拟，通过迭代训练运动生成器和物理跟踪控制器，逐步扩展数据集和控制器能力。 Result: PARC生成了敏捷且通用的地形穿越模型，弥补了运动数据不足与控制器需求之间的差距。 Conclusion: PARC为开发复杂环境交互的控制器提供了一种有效方法，解决了数据稀缺的挑战。 Abstract: Humans excel in navigating diverse, complex environments with agile motor skills, exemplified by parkour practitioners performing dynamic maneuvers, such as climbing up walls and jumping across gaps. Reproducing these agile movements with simulated characters remains challenging, in part due to the scarcity of motion capture data for agile terrain traversal behaviors and the high cost of acquiring such data. In this work, we introduce PARC (Physics-based Augmentation with Reinforcement Learning for Character Controllers), a framework that leverages machine learning and physics-based simulation to iteratively augment motion datasets and expand the capabilities of terrain traversal controllers. PARC begins by training a motion generator on a small dataset consisting of core terrain traversal skills. The motion generator is then used to produce synthetic data for traversing new terrains. However, these generated motions often exhibit artifacts, such as incorrect contacts or discontinuities. To correct these artifacts, we train a physics-based tracking controller to imitate the motions in simulation. The corrected motions are then added to the dataset, which is used to continue training the motion generator in the next iteration. PARC's iterative process jointly expands the capabilities of the motion generator and tracker, creating agile and versatile models for interacting with complex environments. PARC provides an effective approach to develop controllers for agile terrain traversal, which bridges the gap between the scarcity of motion data and the need for versatile character controllers.

[70] TerraFusion: Joint Generation of Terrain Geometry and Texture Using Latent Diffusion Models

Kazuki Higo,Toshiki Kanai,Yuki Endo,Yoshihiro Kanamori

Main category: cs.GR

TL;DR: 提出了一种基于潜在扩散模型的联合生成地形高度图和纹理的方法，通过无监督学习和有监督学习实现用户控制。

Details

Motivation: 现有方法通常单独生成高度图或纹理，未能充分捕捉二者之间的相关性，影响真实感。 Method: 使用潜在扩散模型无监督生成配对的随机高度图和纹理，并通过有监督学习的外部适配器实现用户手绘草图控制。 Result: 实验表明，该方法能直观生成地形并保持高度图与纹理的关联性。 Conclusion: 该方法有效解决了地形生成中高度图与纹理的关联性问题，提升了真实感和用户控制性。 Abstract: 3D terrain models are essential in fields such as video game development and film production. Since surface color often correlates with terrain geometry, capturing this relationship is crucial to achieving realism. However, most existing methods generate either a heightmap or a texture, without sufficiently accounting for the inherent correlation. In this paper, we propose a method that jointly generates terrain heightmaps and textures using a latent diffusion model. First, we train the model in an unsupervised manner to randomly generate paired heightmaps and textures. Then, we perform supervised learning of an external adapter to enable user control via hand-drawn sketches. Experiments show that our approach allows intuitive terrain generation while preserving the correlation between heightmaps and textures.

[71] BuildingBlock: A Hybrid Approach for Structured Building Generation

Junming Huang,Chi Wang,Letian Li,Changxin Huang,Qiang Dai,Weiwei Xu

Main category: cs.GR

TL;DR: BuildingBlock提出了一种结合生成模型、程序化内容生成（PCG）和大语言模型（LLM）的混合方法，用于生成多样化和层次结构一致的3D建筑。

Details

Motivation: 当前3D建筑生成方法在多样性和结构一致性方面存在不足，限制了在游戏、虚拟现实等领域的应用。 Method: 采用两阶段流程：布局生成阶段（LGP）和建筑构建阶段（BCP）。LGP通过点云生成任务和Transformer扩散模型生成全局一致的布局，LLM扩展为层次化设计；BCP利用PCG生成高质量建筑。 Result: 实验表明，BuildingBlock在多样性和层次结构生成方面表现优异，达到多个基准的最先进水平。 Conclusion: 该方法为可扩展和直观的建筑工作流程提供了新思路。 Abstract: Three-dimensional building generation is vital for applications in gaming, virtual reality, and digital twins, yet current methods face challenges in producing diverse, structured, and hierarchically coherent buildings. We propose BuildingBlock, a hybrid approach that integrates generative models, procedural content generation (PCG), and large language models (LLMs) to address these limitations. Specifically, our method introduces a two-phase pipeline: the Layout Generation Phase (LGP) and the Building Construction Phase (BCP). LGP reframes box-based layout generation as a point-cloud generation task, utilizing a newly constructed architectural dataset and a Transformer-based diffusion model to create globally consistent layouts. With LLMs, these layouts are extended into rule-based hierarchical designs, seamlessly incorporating component styles and spatial structures. The BCP leverages these layouts to guide PCG, enabling local-customizable, high-quality structured building generation. Experimental results demonstrate BuildingBlock's effectiveness in generating diverse and hierarchically structured buildings, achieving state-of-the-art results on multiple benchmarks, and paving the way for scalable and intuitive architectural workflows.

[72] Person-In-Situ: Scene-Consistent Human Image Insertion with Occlusion-Aware Pose Control

Shun Masuda,Yuki Endo,Yoshihiro Kanamori

Main category: cs.GR

TL;DR: 论文提出两种方法，通过3D人体模型控制姿势，并利用潜在扩散模型合成人物，解决现有方法在遮挡和深度放置上的不足。

Details

Motivation: 现有方法在人物插入场景时无法正确处理遮挡，且对姿势控制有限。 Method: 提出两种方法：1）两阶段法，先学习场景深度图再合成人物；2）直接合成法，隐式学习遮挡。 Result: 定量和定性评估显示，两种方法在场景一致性和遮挡处理上优于现有方法。 Conclusion: 新方法在人物插入场景时更自然，且支持用户指定姿势。 Abstract: Compositing human figures into scene images has broad applications in areas such as entertainment and advertising. However, existing methods often cannot handle occlusion of the inserted person by foreground objects and unnaturally place the person in the frontmost layer. Moreover, they offer limited control over the inserted person's pose. To address these challenges, we propose two methods. Both allow explicit pose control via a 3D body model and leverage latent diffusion models to synthesize the person at a contextually appropriate depth, naturally handling occlusions without requiring occlusion masks. The first is a two-stage approach: the model first learns a depth map of the scene with the person through supervised learning, and then synthesizes the person accordingly. The second method learns occlusion implicitly and synthesizes the person directly from input data without explicit depth supervision. Quantitative and qualitative evaluations show that both methods outperform existing approaches by better preserving scene consistency while accurately reflecting occlusions and user-specified poses.

[73] ELGAR: Expressive Cello Performance Motion Generation for Audio Rendition

Zhiping Qiu,Yitong Jin,Yuan Wang,Yi Shi,Chongwu Wang,Chao Tan,Xiaobing Li,Feng Yu,Tao Yu,Qionghai Dai

Main category: cs.GR

TL;DR: ELGAR是一种基于扩散的先进框架，仅从音频生成全身精细乐器演奏动作，强调手与弓的交互接触，并设计了新的评估指标。

Details

Motivation: 乐器演奏动作生成需要捕捉复杂动态和表演者-乐器交互，现有方法仅关注部分身体动作，因此提出ELGAR以实现更全面的生成。 Method: 提出HICL和BICL损失函数保证交互真实性，设计手指接触距离、弓弦距离等新评估指标，并构建SPD-GEN数据集。 Result: ELGAR能生成复杂快速交互的演奏动作，验证了方法的有效性，并在动画、音乐教育等领域有应用潜力。 Conclusion: ELGAR为乐器演奏动作生成提供了新思路，推动了相关领域的发展。 Abstract: The art of instrument performance stands as a vivid manifestation of human creativity and emotion. Nonetheless, generating instrument performance motions is a highly challenging task, as it requires not only capturing intricate movements but also reconstructing the complex dynamics of the performer-instrument interaction. While existing works primarily focus on modeling partial body motions, we propose Expressive ceLlo performance motion Generation for Audio Rendition (ELGAR), a state-of-the-art diffusion-based framework for whole-body fine-grained instrument performance motion generation solely from audio. To emphasize the interactive nature of the instrument performance, we introduce Hand Interactive Contact Loss (HICL) and Bow Interactive Contact Loss (BICL), which effectively guarantee the authenticity of the interplay. Moreover, to better evaluate whether the generated motions align with the semantic context of the music audio, we design novel metrics specifically for string instrument performance motion generation, including finger-contact distance, bow-string distance, and bowing score. Extensive evaluations and ablation studies are conducted to validate the efficacy of the proposed methods. In addition, we put forward a motion generation dataset SPD-GEN, collated and normalized from the MoCap dataset SPD. As demonstrated, ELGAR has shown great potential in generating instrument performance motions with complicated and fast interactions, which will promote further development in areas such as animation, music education, interactive art creation, etc.

[74] Geometry-Aware Texture Generation for 3D Head Modeling with Artist-driven Control

Amin Fadaeinejad,Abdallah Dib,Luiz Gustavo Hafemann,Emeline Got,Trevor Anderson,Amaury Depierre,Nikolaus F. Troje,Marcus A. Brubaker,Marc-André Carbonneau

Main category: cs.GR

TL;DR: 提出了一种新颖的框架，通过几何感知的纹理合成流程，为艺术家提供对3D头部生成的直观控制，简化虚拟角色创作流程。

Details

Motivation: 当前为虚拟角色创建符合精确艺术愿景的3D头部资产仍是一项劳动密集型任务，需要更高效的解决方案。 Method: 采用几何感知的纹理合成流程，学习不同人口统计特征下头部几何与皮肤纹理的关联，提供三个层级的艺术控制。 Result: 实验表明，该方法能生成多样化的结果并保持几何清洁，支持皮肤色调调整和细节编辑等实用功能。 Conclusion: 该框架通过集成化的方法优化了虚拟角色创作的艺术流程，提升了效率和直观性。 Abstract: Creating realistic 3D head assets for virtual characters that match a precise artistic vision remains labor-intensive. We present a novel framework that streamlines this process by providing artists with intuitive control over generated 3D heads. Our approach uses a geometry-aware texture synthesis pipeline that learns correlations between head geometry and skin texture maps across different demographics. The framework offers three levels of artistic control: manipulation of overall head geometry, adjustment of skin tone while preserving facial characteristics, and fine-grained editing of details such as wrinkles or facial hair. Our pipeline allows artists to make edits to a single texture map using familiar tools, with our system automatically propagating these changes coherently across the remaining texture maps needed for realistic rendering. Experiments demonstrate that our method produces diverse results with clean geometries. We showcase practical applications focusing on intuitive control for artists, including skin tone adjustments and simplified editing workflows for adding age-related details or removing unwanted features from scanned models. This integrated approach aims to streamline the artistic workflow in virtual character creation.

[75] TetWeave: Isosurface Extraction using On-The-Fly Delaunay Tetrahedral Grids for Gradient-Based Mesh Optimization

Alexandre Binninger,Ruben Wiersma,Philipp Herholz,Olga Sorkine-Hornung

Main category: cs.GR

TL;DR: TetWeave是一种新型等值面表示方法，通过联合优化四面体网格和定向有符号距离，实现高质量网格生成。

Details

Motivation: 解决传统预定义网格在灵活性和内存效率上的不足，同时保证网格的几何完整性。 Method: 利用Delaunay三角剖分动态构建四面体网格，并结合定向有符号距离优化网格质量。 Result: 生成的网格具有水密性、二维流形和无交特性，内存占用近线性增长。 Conclusion: TetWeave在多种图形和视觉任务中表现出色，显著优于传统方法。 Abstract: We introduce TetWeave, a novel isosurface representation for gradient-based mesh optimization that jointly optimizes the placement of a tetrahedral grid used for Marching Tetrahedra and a novel directional signed distance at each point. TetWeave constructs tetrahedral grids on-the-fly via Delaunay triangulation, enabling increased flexibility compared to predefined grids. The extracted meshes are guaranteed to be watertight, two-manifold and intersection-free. The flexibility of TetWeave enables a resampling strategy that places new points where reconstruction error is high and allows to encourage mesh fairness without compromising on reconstruction error. This leads to high-quality, adaptive meshes that require minimal memory usage and few parameters to optimize. Consequently, TetWeave exhibits near-linear memory scaling relative to the vertex count of the output mesh - a substantial improvement over predefined grids. We demonstrate the applicability of TetWeave to a broad range of challenging tasks in computer graphics and vision, such as multi-view 3D reconstruction, mesh compression and geometric texture generation.

[76] PrimitiveAnything: Human-Crafted 3D Primitive Assembly Generation with Auto-Regressive Transformer

Jingwen Ye,Yuze He,Yanning Zhou,Yiqin Zhu,Kaiwen Xiao,Yong-Jin Liu,Wei Yang,Xiao Han

Main category: cs.GR

TL;DR: PrimitiveAnything是一个新框架，将形状基元抽象任务重新定义为基元组装生成任务，通过大规模人类制作的抽象直接学习基元组装过程，生成与人类感知一致的高质量基元组装。

Details

Motivation: 现有基元抽象方法在语义理解或泛化能力上存在局限，无法适应多样化的形状类别。 Method: 提出形状条件基元变换器用于自回归生成，以及无歧义参数化方案统一表示多种基元类型。 Result: 实验表明，PrimitiveAnything能生成高质量基元组装，既符合人类感知又保持几何保真度。 Conclusion: 该框架在多样化形状类别中表现优异，有望推动基于基元的用户生成内容（UGC）在游戏等领域的应用。 Abstract: Shape primitive abstraction, which decomposes complex 3D shapes into simple geometric elements, plays a crucial role in human visual cognition and has broad applications in computer vision and graphics. While recent advances in 3D content generation have shown remarkable progress, existing primitive abstraction methods either rely on geometric optimization with limited semantic understanding or learn from small-scale, category-specific datasets, struggling to generalize across diverse shape categories. We present PrimitiveAnything, a novel framework that reformulates shape primitive abstraction as a primitive assembly generation task. PrimitiveAnything includes a shape-conditioned primitive transformer for auto-regressive generation and an ambiguity-free parameterization scheme to represent multiple types of primitives in a unified manner. The proposed framework directly learns the process of primitive assembly from large-scale human-crafted abstractions, enabling it to capture how humans decompose complex shapes into primitive elements. Through extensive experiments, we demonstrate that PrimitiveAnything can generate high-quality primitive assemblies that better align with human perception while maintaining geometric fidelity across diverse shape categories. It benefits various 3D applications and shows potential for enabling primitive-based user-generated content (UGC) in games. Project page: https://primitiveanything.github.io

cs.CL [Back]

Trilok Padhi,Ramneet Kaur,Adam D. Cobb,Manoj Acharya,Anirban Roy,Colin Samplawski,Brian Matejek,Alexander M. Berenbeim,Nathaniel D. Bastian,Susmit Jha

Main category: cs.CL

TL;DR: 提出了一种针对多模态大语言模型（LLM）的校准不确定性量化（UQ）新方法，通过结合跨模态一致性和自一致性改进校准效果。

Details

Motivation: 现有UQ方法在多模态LLM中常因模型在错误时仍表现一致而高估置信度，导致校准不佳。 Method: 利用视觉输入对文本响应进行基础校准，并通过温度缩放技术校准基础模型的置信度。 Result: 在医疗问答（Slake）和视觉问答（VQAv2）任务中，框架显著提升了校准效果。 Conclusion: 该方法在多模态任务中有效改善了置信度校准，提升了模型的可靠性。 Abstract: We introduce a novel approach for calibrating uncertainty quantification (UQ) tailored for multi-modal large language models (LLMs). Existing state-of-the-art UQ methods rely on consistency among multiple responses generated by the LLM on an input query under diverse settings. However, these approaches often report higher confidence in scenarios where the LLM is consistently incorrect. This leads to a poorly calibrated confidence with respect to accuracy. To address this, we leverage cross-modal consistency in addition to self-consistency to improve the calibration of the multi-modal models. Specifically, we ground the textual responses to the visual inputs. The confidence from the grounding model is used to calibrate the overall confidence. Given that using a grounding model adds its own uncertainty in the pipeline, we apply temperature scaling - a widely accepted parametric calibration technique - to calibrate the grounding model's confidence in the accuracy of generated responses. We evaluate the proposed approach across multiple multi-modal tasks, such as medical question answering (Slake) and visual question answering (VQAv2), considering multi-modal models such as LLaVA-Med and LLaVA. The experiments demonstrate that the proposed framework achieves significantly improved calibration on both tasks.

[78] Hesitation is defeat? Connecting Linguistic and Predictive Uncertainty

Gianluca Manzo,Julia Ive

Main category: cs.CL

TL;DR: 论文探讨了深度学习模型在胸片解读中的不确定性量化问题，比较了预测不确定性与人类语言不确定性之间的关系。

Details

Motivation: 在医疗环境中，仅优化预测性能不足，不确定性量化同样重要，以改善临床决策和大规模筛查。 Method: 使用BERT模型，评估不同二值化方法，并比较蒙特卡洛Dropout和深度集成方法在预测不确定性估计中的效果。 Result: 模型表现良好，但预测不确定性与语言不确定性之间相关性较弱，表明机器与人类不确定性对齐存在挑战。 Conclusion: 贝叶斯近似提供了有价值的不确定性估计，但需进一步改进以更好地捕捉人类不确定性的细微差别。 Abstract: Automating chest radiograph interpretation using Deep Learning (DL) models has the potential to significantly improve clinical workflows, decision-making, and large-scale health screening. However, in medical settings, merely optimising predictive performance is insufficient, as the quantification of uncertainty is equally crucial. This paper investigates the relationship between predictive uncertainty, derived from Bayesian Deep Learning approximations, and human/linguistic uncertainty, as estimated from free-text radiology reports labelled by rule-based labellers. Utilising BERT as the model of choice, this study evaluates different binarisation methods for uncertainty labels and explores the efficacy of Monte Carlo Dropout and Deep Ensembles in estimating predictive uncertainty. The results demonstrate good model performance, but also a modest correlation between predictive and linguistic uncertainty, highlighting the challenges in aligning machine uncertainty with human interpretation nuances. Our findings suggest that while Bayesian approximations provide valuable uncertainty estimates, further refinement is necessary to fully capture and utilise the subtleties of human uncertainty in clinical applications.

[79] A Reasoning-Focused Legal Retrieval Benchmark

Lucia Zheng,Neel Guha,Javokhir Arifov,Sarah Zhang,Michal Skreta,Christopher D. Manning,Peter Henderson,Daniel E. Ho

Main category: cs.CL

TL;DR: 论文提出了两个新的法律RAG基准测试（Bar Exam QA和Housing Statute QA），以解决缺乏真实法律RAG基准的问题，并评估现有检索管道的性能。

Details

Motivation: 法律AI开发者使用RAG系统提升性能，但缺乏真实的法律RAG基准测试，阻碍了专业RAG系统的发展。 Method: 通过模拟法律研究过程的标注方法，构建了两个新的法律RAG基准测试。 Result: 结果显示法律RAG应用仍具挑战性，需要进一步研究。 Conclusion: 论文强调了法律RAG基准的重要性，并呼吁未来研究以解决现有挑战。 Abstract: As the legal community increasingly examines the use of large language models (LLMs) for various legal applications, legal AI developers have turned to retrieval-augmented LLMs ("RAG" systems) to improve system performance and robustness. An obstacle to the development of specialized RAG systems is the lack of realistic legal RAG benchmarks which capture the complexity of both legal retrieval and downstream legal question-answering. To address this, we introduce two novel legal RAG benchmarks: Bar Exam QA and Housing Statute QA. Our tasks correspond to real-world legal research tasks, and were produced through annotation processes which resemble legal research. We describe the construction of these benchmarks and the performance of existing retriever pipelines. Our results suggest that legal RAG remains a challenging application, thus motivating future research.

[80] Divide, Optimize, Merge: Fine-Grained LLM Agent Optimization at Scale

Jiale Liu,Yifan Zeng,Shaokun Zhang,Chi Zhang,Malte Højmark-Bertelsen,Marie Normann Gadeberg,Huazheng Wang,Qingyun Wu

Main category: cs.CL

TL;DR: 论文提出了细粒度优化（FGO）框架，通过将大型优化任务分解为子集并逐步合并，解决了传统LLM优化方法因数据集增长导致的上下文窗口溢出和模式识别退化问题。实验显示FGO在多个基准测试中表现优于现有方法，同时显著减少提示令牌消耗。

Details

Motivation: 传统LLM优化方法在处理大规模数据集时面临上下文窗口溢出和模式识别退化的挑战，亟需一种可扩展的解决方案。 Method: 提出FGO框架，将大型优化任务分解为子集进行针对性优化，并通过渐进合并系统整合优化结果。 Result: 在ALFWorld、LogisticsQA和GAIA基准测试中，FGO性能提升1.6-8.6%，同时平均提示令牌消耗减少56.3%。 Conclusion: FGO为LLM优化提供了可扩展的解决方案，适用于日益复杂的智能体系统优化。 Abstract: LLM-based optimization has shown remarkable potential in enhancing agentic systems. However, the conventional approach of prompting LLM optimizer with the whole training trajectories on training dataset in a single pass becomes untenable as datasets grow, leading to context window overflow and degraded pattern recognition. To address these challenges, we propose Fine-Grained Optimization (FGO), a scalable framework that divides large optimization tasks into manageable subsets, performs targeted optimizations, and systematically combines optimized components through progressive merging. Evaluation across ALFWorld, LogisticsQA, and GAIA benchmarks demonstrate that FGO outperforms existing approaches by 1.6-8.6% while reducing average prompt token consumption by 56.3%. Our framework provides a practical solution for scaling up LLM-based optimization of increasingly sophisticated agent systems. Further analysis demonstrates that FGO achieves the most consistent performance gain in all training dataset sizes, showcasing its scalability and efficiency.

[81] X-Reasoner: Towards Generalizable Reasoning Across Modalities and Domains

Qianchu Liu,Sheng Zhang,Guanghui Qin,Timothy Ossowski,Yu Gu,Ying Jin,Sid Kiblawi,Sam Preston,Mu Wei,Paul Vozila,Tristan Naumann,Hoifung Poon

Main category: cs.CL

TL;DR: X-Reasoner通过通用领域文本后训练实现跨模态和跨领域的推理能力，优于现有模型。

Details

Motivation: 探索推理能力是否可跨模态和领域扩展。 Method: 两阶段方法：监督微调加强化学习。 Result: X-Reasoner在跨模态和领域任务中表现优异，X-Reasoner-Med在医疗领域创下新纪录。 Conclusion: 通用文本后训练可实现强泛化推理能力，领域专用数据可进一步提升性能。 Abstract: Recent proprietary models (e.g., o3) have begun to demonstrate strong multimodal reasoning capabilities. Yet, most existing open-source research concentrates on training text-only reasoning models, with evaluations limited to mainly mathematical and general-domain tasks. Therefore, it remains unclear how to effectively extend reasoning capabilities beyond text input and general domains. This paper explores a fundamental research question: Is reasoning generalizable across modalities and domains? Our findings support an affirmative answer: General-domain text-based post-training can enable such strong generalizable reasoning. Leveraging this finding, we introduce X-Reasoner, a vision-language model post-trained solely on general-domain text for generalizable reasoning, using a two-stage approach: an initial supervised fine-tuning phase with distilled long chain-of-thoughts, followed by reinforcement learning with verifiable rewards. Experiments show that X-Reasoner successfully transfers reasoning capabilities to both multimodal and out-of-domain settings, outperforming existing state-of-the-art models trained with in-domain and multimodal data across various general and medical benchmarks (Figure 1). Additionally, we find that X-Reasoner's performance in specialized domains can be further enhanced through continued training on domain-specific text-only data. Building upon this, we introduce X-Reasoner-Med, a medical-specialized variant that achieves new state of the art on numerous text-only and multimodal medical benchmarks.

[82] SLOT: Structuring the Output of Large Language Models

Darren Yow-Bang Wang,Zhengyuan Shen,Soumya Smruti Mishra,Zhichao Xu,Yifei Teng,Haibo Ding

Main category: cs.CL

TL;DR: SLOT是一种模型无关的方法，通过微调的轻量级语言模型将非结构化LLM输出转换为精确的结构化格式，显著提升模式准确性和内容保真度。

Details

Motivation: 在关键应用中，LLM生成的输出常偏离预定义模式，影响可靠性，因此需要一种灵活且高效的结构化输出方法。 Method: SLOT采用微调的轻量级语言模型作为后处理层，结合数据合成和评估方法，支持多种LLM和模式规范。 Result: 微调的Mistral-7B模型在模式准确性和内容相似性上表现优异（99.5%和94.0%），优于Claude-3.5-Sonnet，且小模型也能达到或超越大模型性能。 Conclusion: SLOT为资源受限环境提供了可靠的结构化生成能力，展示了轻量级模型的潜力。 Abstract: Structured outputs are essential for large language models (LLMs) in critical applications like agents and information extraction. Despite their capabilities, LLMs often generate outputs that deviate from predefined schemas, significantly hampering reliable application development. We present SLOT (Structured LLM Output Transformer), a model-agnostic approach that transforms unstructured LLM outputs into precise structured formats. While existing solutions predominantly rely on constrained decoding techniques or are tightly coupled with specific models, SLOT employs a fine-tuned lightweight language model as a post-processing layer, achieving flexibility across various LLMs and schema specifications. We introduce a systematic pipeline for data curation and synthesis alongside a formal evaluation methodology that quantifies both schema accuracy and content fidelity. Our results demonstrate that fine-tuned Mistral-7B model with constrained decoding achieves near perfect schema accuracy (99.5%) and content similarity (94.0%), outperforming Claude-3.5-Sonnet by substantial margins (+25 and +20 percentage points, respectively). Notably, even compact models like Llama-3.2-1B can match or exceed the structured output capabilities of much larger proprietary models when equipped with SLOT, enabling reliable structured generation in resource-constrained environments.

[83] Advancing and Benchmarking Personalized Tool Invocation for LLMs

Xu Huang,Yuefeng Huang,Weiwen Liu,Xingshan Zeng,Yasheng Wang,Ruiming Tang,Hong Xie,Defu Lian

Main category: cs.CL

TL;DR: 该论文提出了个性化工具调用（Personalized Tool Invocation）的概念，并定义了两个关键任务：工具偏好（Tool Preference）和依赖配置的查询（Profile-dependent Query）。作者提出了PTool框架和PTBench基准，验证了其有效性。

Details

Motivation: 现有研究主要关注LLMs调用工具的基本能力，而忽略了工具调用中的个性化约束。本文旨在填补这一空白。 Method: 提出了PTool框架，用于合成个性化工具调用的数据，并构建了PTBench基准。通过微调开源模型验证框架效果。 Result: 实验表明PTool框架有效，并提供了有价值的见解。PTBench基准已公开。 Conclusion: 本文为个性化工具调用提供了理论和实践基础，推动了LLMs在实际应用中的进一步发展。 Abstract: Tool invocation is a crucial mechanism for extending the capabilities of Large Language Models (LLMs) and has recently garnered significant attention. It enables LLMs to solve complex problems through tool calls while accessing up-to-date world knowledge. However, existing work primarily focuses on the fundamental ability of LLMs to invoke tools for problem-solving, without considering personalized constraints in tool invocation. In this work, we introduce the concept of Personalized Tool Invocation and define two key tasks: Tool Preference and Profile-dependent Query. Tool Preference addresses user preferences when selecting among functionally similar tools, while Profile-dependent Query considers cases where a user query lacks certain tool parameters, requiring the model to infer them from the user profile. To tackle these challenges, we propose PTool, a data synthesis framework designed for personalized tool invocation. Additionally, we construct \textbf{PTBench}, the first benchmark for evaluating personalized tool invocation. We then fine-tune various open-source models, demonstrating the effectiveness of our framework and providing valuable insights. Our benchmark is public at https://github.com/hyfshadow/PTBench.

[84] Natural Language Generation in Healthcare: A Review of Methods and Applications

Mengxian Lyu,Xiaohan Li,Ziyi Chen,Jinqian Pan,Cheng Peng,Sankalp Talankar,Yonghui Wu

Main category: cs.CL

TL;DR: 本文综述了自然语言生成（NLG）在医疗领域的应用，分析了113篇相关文献，涵盖数据模态、模型架构、临床应用及评估方法。

Details

Motivation: 随着大语言模型（LLMs）的突破，NLG在医疗领域展现出潜力，但缺乏全面综述。 Method: 系统回顾113篇文献，遵循PRISMA指南，分类关键方法及应用。 Result: 总结了NLG在医疗中的技术、应用及挑战。 Conclusion: 为未来研究提供见解，推动NLG在医疗领域的变革。 Abstract: Natural language generation (NLG) is the key technology to achieve generative artificial intelligence (AI). With the breakthroughs in large language models (LLMs), NLG has been widely used in various medical applications, demonstrating the potential to enhance clinical workflows, support clinical decision-making, and improve clinical documentation. Heterogeneous and diverse medical data modalities, such as medical text, images, and knowledge bases, are utilized in NLG. Researchers have proposed many generative models and applied them in a number of healthcare applications. There is a need for a comprehensive review of NLG methods and applications in the medical domain. In this study, we systematically reviewed 113 scientific publications from a total of 3,988 NLG-related articles identified using a literature search, focusing on data modality, model architecture, clinical applications, and evaluation methods. Following PRISMA (Preferred Reporting Items for Systematic reviews and Meta-Analyses) guidelines, we categorize key methods, identify clinical applications, and assess their capabilities, limitations, and emerging challenges. This timely review covers the key NLG technologies and medical applications and provides valuable insights for future studies to leverage NLG to transform medical discovery and healthcare.

[85] Bringing legal knowledge to the public by constructing a legal question bank using large-scale pre-trained language model

Mingruo Yuan,Ben Kao,Tien-Hsuan Wu,Michael M. K. Cheung,Henry W. H. Chan,Anne S. Y. Cheung,Felix W. H. Chan,Yongxi Chen

Main category: cs.CL

TL;DR: 论文提出了一种三步法，将法律知识转化为易于公众理解的形式，包括生成法律知识片段、构建法律问题库和设计交互式推荐系统。

Details

Motivation: 法律文件通常技术性强，公众难以理解，因此需要一种方法将法律信息转化为易于导航和理解的形式。 Method: 1. 将法律条文转化为易于理解的片段（CLIC-pages）；2. 构建法律问题库（LQB）；3. 设计交互式推荐系统（CRec）。 Result: 机器生成的问题（MGQs）更具扩展性和多样性，而人工编写的问题（HCQs）更精确。 Conclusion: 三步法有效提升了法律知识的可访问性和可理解性，尤其是通过自动化工具（如GPT-3）生成问题库的潜力。 Abstract: Access to legal information is fundamental to access to justice. Yet accessibility refers not only to making legal documents available to the public, but also rendering legal information comprehensible to them. A vexing problem in bringing legal information to the public is how to turn formal legal documents such as legislation and judgments, which are often highly technical, to easily navigable and comprehensible knowledge to those without legal education. In this study, we formulate a three-step approach for bringing legal knowledge to laypersons, tackling the issues of navigability and comprehensibility. First, we translate selected sections of the law into snippets (called CLIC-pages), each being a small piece of article that focuses on explaining certain technical legal concept in layperson's terms. Second, we construct a Legal Question Bank (LQB), which is a collection of legal questions whose answers can be found in the CLIC-pages. Third, we design an interactive CLIC Recommender (CRec). Given a user's verbal description of a legal situation that requires a legal solution, CRec interprets the user's input and shortlists questions from the question bank that are most likely relevant to the given legal situation and recommends their corresponding CLIC pages where relevant legal knowledge can be found. In this paper we focus on the technical aspects of creating an LQB. We show how large-scale pre-trained language models, such as GPT-3, can be used to generate legal questions. We compare machine-generated questions (MGQs) against human-composed questions (HCQs) and find that MGQs are more scalable, cost-effective, and more diversified, while HCQs are more precise. We also show a prototype of CRec and illustrate through an example how our 3-step approach effectively brings relevant legal knowledge to the public.

[86] Enhancing Granular Sentiment Classification with Chain-of-Thought Prompting in Large Language Models

Vihaan Miriyala,Smrithi Bukkapatnam,Lavanya Prahallad

Main category: cs.CL

TL;DR: Chain-of-Thought (CoT) prompting显著提升了大型语言模型在应用商店评论中的细粒度情感分类准确率，从84%提升至93%。

Details

Motivation: 传统的数值和极性评分无法捕捉用户反馈中的细微情感，需要更精确的方法。 Method: 在2000条亚马逊应用评论上比较了CoT提示与简单提示的效果，并与人工判断对比。 Result: CoT提示将分类准确率从84%提升至93%。 Conclusion: 显式推理（CoT提示）能显著提升情感分析性能。 Abstract: We explore the use of Chain-of-Thought (CoT) prompting with large language models (LLMs) to improve the accuracy of granular sentiment categorization in app store reviews. Traditional numeric and polarity-based ratings often fail to capture the nuanced sentiment embedded in user feedback. We evaluated the effectiveness of CoT prompting versus simple prompting on 2000 Amazon app reviews by comparing each method's predictions to human judgements. CoT prompting improved classification accuracy from 84% to 93% highlighting the benefit of explicit reasoning in enhancing sentiment analysis performance.

[87] Unmasking the Canvas: A Dynamic Benchmark for Image Generation Jailbreaking and LLM Content Safety

Variath Madhupal Gautham Nair,Vishal Varma Dantuluri

Main category: cs.CL

TL;DR: 论文提出了一种动态可扩展的基准数据集UTCB，用于评估大语言模型在图像生成中的漏洞，结合多语言混淆和结构化提示工程。

Details

Motivation: 现有大语言模型在图像生成任务中表现优异，但其内容安全检查易受提示攻击，导致生成不当内容。 Method: 采用结构化提示工程、多语言混淆（如祖鲁语、盖尔语、Base64）和Groq托管的LLaMA-3评估，支持零样本和回退提示策略。 Result: UTCB数据集分为青铜（未验证）、白银（LLM辅助验证）和黄金（人工验证）三级，支持动态更新。 Conclusion: UTCB为评估和改进LLM安全性提供了有效工具，需持续更新以应对新威胁。 Abstract: Existing large language models (LLMs) are advancing rapidly and produce outstanding results in image generation tasks, yet their content safety checks remain vulnerable to prompt-based jailbreaks. Through preliminary testing on platforms such as ChatGPT, MetaAI, and Grok, we observed that even short, natural prompts could lead to the generation of compromising images ranging from realistic depictions of forged documents to manipulated images of public figures. We introduce Unmasking the Canvas (UTC Benchmark; UTCB), a dynamic and scalable benchmark dataset to evaluate LLM vulnerability in image generation. Our methodology combines structured prompt engineering, multilingual obfuscation (e.g., Zulu, Gaelic, Base64), and evaluation using Groq-hosted LLaMA-3. The pipeline supports both zero-shot and fallback prompting strategies, risk scoring, and automated tagging. All generations are stored with rich metadata and curated into Bronze (non-verified), Silver (LLM-aided verification), and Gold (manually verified) tiers. UTCB is designed to evolve over time with new data sources, prompt templates, and model behaviors. Warning: This paper includes visual examples of adversarial inputs designed to test model safety. All outputs have been redacted to ensure responsible disclosure.

Manas Satish Bedmutha,Feng Chen,Andrea Hartzler,Trevor Cohen,Nadir Weibel

Main category: cs.CL

TL;DR: 论文探讨了利用大型语言模型（LLMs）自动分析临床对话中的社交信号（如非语言行为）的能力，并评估了不同模型架构和提示风格在识别20种社交信号中的表现。

Details

Motivation: 研究动机在于提升医患沟通效果，通过自动化分析社交信号（如情感和非语言行为）来改善医患关系。 Method: 设计了任务特定的提示，使用高度不平衡的标注数据集评估了多种LLM架构和提示风格在识别20种社交信号中的表现。 Result: 开发了首个能够追踪20种社交信号的系统，并揭示了LLM行为模式，为优化模型在医疗场景中的社交信号处理提供了见解。 Conclusion: 研究表明LLMs在临床对话中分析社交信号具有潜力，未来可通过优化模型配置和临床上下文进一步提升性能。 Abstract: Effective communication between providers and their patients influences health and care outcomes. The effectiveness of such conversations has been linked not only to the exchange of clinical information, but also to a range of interpersonal behaviors; commonly referred to as social signals, which are often conveyed through non-verbal cues and shape the quality of the patient-provider relationship. Recent advances in large language models (LLMs) have demonstrated an increasing ability to infer emotional and social behaviors even when analyzing only textual information. As automation increases also in clinical settings, such as for transcription of patient-provider conversations, there is growing potential for LLMs to automatically analyze and extract social behaviors from these interactions. To explore the foundational capabilities of LLMs in tracking social signals in clinical dialogue, we designed task-specific prompts and evaluated model performance across multiple architectures and prompting styles using a highly imbalanced, annotated dataset spanning 20 distinct social signals such as provider dominance, patient warmth, etc. We present the first system capable of tracking all these 20 coded signals, and uncover patterns in LLM behavior. Further analysis of model configurations and clinical context provides insights for enhancing LLM performance on social signal processing tasks in healthcare settings.

[89] LLM-Independent Adaptive RAG: Let the Question Speak for Itself

Maria Marina,Nikolay Ivanov,Sergey Pletenev,Mikhail Salnikov,Daria Galimzianova,Nikita Krayko,Vasily Konovalov,Alexander Panchenko,Viktor Moskvoretskii

Main category: cs.CL

TL;DR: 本文提出了一种轻量级、不依赖LLM的自适应检索方法，通过外部信息实现高效检索，性能与复杂LLM方法相当且效率更高。

Details

Motivation: LLM易产生幻觉，RAG虽能缓解但计算成本高且可能传播错误信息，现有自适应检索方法依赖LLM不确定性估计，效率低且不实用。 Method: 研究了27个特征（分为7组）及其混合组合，基于外部信息设计轻量级自适应检索方法，并在6个QA数据集上评估性能与效率。 Result: 方法性能与复杂LLM方法相当，同时显著提升效率，验证了外部信息在自适应检索中的潜力。 Conclusion: 轻量级自适应检索方法高效且实用，为减少LLM幻觉和计算成本提供了可行方案。 Abstract: Large Language Models~(LLMs) are prone to hallucinations, and Retrieval-Augmented Generation (RAG) helps mitigate this, but at a high computational cost while risking misinformation. Adaptive retrieval aims to retrieve only when necessary, but existing approaches rely on LLM-based uncertainty estimation, which remain inefficient and impractical. In this study, we introduce lightweight LLM-independent adaptive retrieval methods based on external information. We investigated 27 features, organized into 7 groups, and their hybrid combinations. We evaluated these methods on 6 QA datasets, assessing the QA performance and efficiency. The results show that our approach matches the performance of complex LLM-based methods while achieving significant efficiency gains, demonstrating the potential of external information for adaptive retrieval.

[90] GASCADE: Grouped Summarization of Adverse Drug Event for Enhanced Cancer Pharmacovigilance

Sofia Jamil,Aryan Dabad,Bollampalli Areen Reddy,Sriparna Saha,Rajiv Misra,Adil A. Shakur

Main category: cs.CL

TL;DR: 论文提出了一种针对癌症治疗中药物不良事件（ADEs）的分组摘要任务，并发布了MCADRS数据集和GASCADE框架，结合LLM和T5模型，提升了摘要性能。

Details

Motivation: 现有研究主要关注一般疾病的药物不良事件，而癌症领域的资源有限，需要更高效的方法来总结患者报告的不良事件以支持药物决策。 Method: 提出了MCADRS数据集，并开发了GASCADE框架，结合LLM的信息提取能力和T5模型的摘要能力，首次在摘要任务中应用对齐技术。 Result: 实验表明GASCADE在多种指标上表现优异，并通过自动评估和人工验证证实其有效性。 Conclusion: 该研究为癌症个性化治疗提供了新工具，提升了药物决策效率和对患者需求的理解。 Abstract: In the realm of cancer treatment, summarizing adverse drug events (ADEs) reported by patients using prescribed drugs is crucial for enhancing pharmacovigilance practices and improving drug-related decision-making. While the volume and complexity of pharmacovigilance data have increased, existing research in this field has predominantly focused on general diseases rather than specifically addressing cancer. This work introduces the task of grouped summarization of adverse drug events reported by multiple patients using the same drug for cancer treatment. To address the challenge of limited resources in cancer pharmacovigilance, we present the MultiLabeled Cancer Adverse Drug Reaction and Summarization (MCADRS) dataset. This dataset includes pharmacovigilance posts detailing patient concerns regarding drug efficacy and adverse effects, along with extracted labels for drug names, adverse drug events, severity, and adversity of reactions, as well as summaries of ADEs for each drug. Additionally, we propose the Grouping and Abstractive Summarization of Cancer Adverse Drug events (GASCADE) framework, a novel pipeline that combines the information extraction capabilities of Large Language Models (LLMs) with the summarization power of the encoder-decoder T5 model. Our work is the first to apply alignment techniques, including advanced algorithms like Direct Preference Optimization, to encoder-decoder models using synthetic datasets for summarization tasks. Through extensive experiments, we demonstrate the superior performance of GASCADE across various metrics, validated through both automated assessments and human evaluations. This multitasking approach enhances drug-related decision-making and fosters a deeper understanding of patient concerns, paving the way for advancements in personalized and responsive cancer care. The code and dataset used in this work are publicly available.

[91] The Aloe Family Recipe for Open and Specialized Healthcare LLMs

Dario Garcia-Gasulla,Jordi Bayarri-Planas,Ashwin Kumar Gururajan,Enrique Lopez-Cuena,Adrian Tormos,Daniel Hinjos,Pablo Bernabeu-Perez,Anna Arias-Duart,Pablo Agustin Martin-Torres,Marta Gonzalez-Mallo,Sergio Alvarez-Napagao,Eduard Ayguadé-Parra,Ulises Cortés

Main category: cs.CL

TL;DR: 论文提出了一种开源医疗大语言模型Aloe Beta，通过优化数据预处理和训练流程，结合DPO和RAG技术提升模型安全性和效能，并定义了新的评估标准。

Details

Motivation: 随着医疗领域大语言模型的发展，需要开源模型以保护公共利益，同时提升模型的安全性和效能。 Method: 基于Llama 3.1和Qwen 2.5等基础模型，使用自定义数据集增强数据，并通过DPO对齐模型，强调伦理和政策一致性。评估包括封闭式、开放式、安全性和人工测试。 Result: Aloe Beta模型在医疗基准测试中表现优异，显著提升安全性，并附带详细风险评估。 Conclusion: Aloe Beta模型及其开发方法为开源医疗大语言模型领域做出重要贡献，设定了新的开发和报告标准。 Abstract: Purpose: With advancements in Large Language Models (LLMs) for healthcare, the need arises for competitive open-source models to protect the public interest. This work contributes to the field of open medical LLMs by optimizing key stages of data preprocessing and training, while showing how to improve model safety (through DPO) and efficacy (through RAG). The evaluation methodology used, which includes four different types of tests, defines a new standard for the field. The resultant models, shown to be competitive with the best private alternatives, are released with a permisive license. Methods: Building on top of strong base models like Llama 3.1 and Qwen 2.5, Aloe Beta uses a custom dataset to enhance public data with synthetic Chain of Thought examples. The models undergo alignment with Direct Preference Optimization, emphasizing ethical and policy-aligned performance in the presence of jailbreaking attacks. Evaluation includes close-ended, open-ended, safety and human assessments, to maximize the reliability of results. Results: Recommendations are made across the entire pipeline, backed by the solid performance of the Aloe Family. These models deliver competitive performance across healthcare benchmarks and medical fields, and are often preferred by healthcare professionals. On bias and toxicity, the Aloe Beta models significantly improve safety, showing resilience to unseen jailbreaking attacks. For a responsible release, a detailed risk assessment specific to healthcare is attached to the Aloe Family models. Conclusion: The Aloe Beta models, and the recipe that leads to them, are a significant contribution to the open-source medical LLM field, offering top-of-the-line performance while maintaining high ethical requirements. This work sets a new standard for developing and reporting aligned LLMs in healthcare.

[92] Large Means Left: Political Bias in Large Language Models Increases with Their Number of Parameters

David Exler,Mark Schutera,Markus Reischl,Luca Rettenberger

Main category: cs.CL

TL;DR: 该论文研究了大型语言模型（LLMs）在德国联邦议院投票背景下的政治偏见，发现其倾向于左翼政党，并探讨了语言、模型规模和来源对偏见的影响。

Details

Motivation: 随着人工智能的普及，评估其内在偏见至关重要，尤其是LLMs作为主要信息来源时可能传播错误信息和偏见。 Method: 使用Wahl-O-Mat评分量化LLMs的政治偏见，比较模型对齐分数以分析影响因素。 Result: 发现LLMs倾向于左翼政党，模型规模和语言使用影响其政治观点，且偏见可能影响公众意见。 Conclusion: LLMs存在政治偏见，开发公司需承担责任以减少偏见对公众决策的影响。 Abstract: With the increasing prevalence of artificial intelligence, careful evaluation of inherent biases needs to be conducted to form the basis for alleviating the effects these predispositions can have on users. Large language models (LLMs) are predominantly used by many as a primary source of information for various topics. LLMs frequently make factual errors, fabricate data (hallucinations), or present biases, exposing users to misinformation and influencing opinions. Educating users on their risks is key to responsible use, as bias, unlike hallucinations, cannot be caught through data verification. We quantify the political bias of popular LLMs in the context of the recent vote of the German Bundestag using the score produced by the Wahl-O-Mat. This metric measures the alignment between an individual's political views and the positions of German political parties. We compare the models' alignment scores to identify factors influencing their political preferences. Doing so, we discover a bias toward left-leaning parties, most dominant in larger LLMs. Also, we find that the language we use to communicate with the models affects their political views. Additionally, we analyze the influence of a model's origin and release date and compare the results to the outcome of the recent vote of the Bundestag. Our results imply that LLMs are prone to exhibiting political bias. Large corporations with the necessary means to develop LLMs, thus, knowingly or unknowingly, have a responsibility to contain these biases, as they can influence each voter's decision-making process and inform public opinion in general and at scale.

[93] YABLoCo: Yet Another Benchmark for Long Context Code Generation

Aidar Valeev,Roman Garaev,Vadim Lomshakov,Irina Piontkovskaya,Vladimir Ivanov,Israel Adewuyi

Main category: cs.CL

TL;DR: 论文提出了一个针对C和C++的长上下文代码生成基准（YABLoCo），填补了现有基准在大型代码库（200K至2,000K行代码）中的空白，并提供了可扩展的评估工具。

Details

Motivation: 现有基准主要针对小型或中型代码库（数千行代码），而实际软件项目可能包含数百万行代码，因此需要一个新的基准来评估大型代码库中的代码生成能力。 Method: 构建了一个包含215个函数的测试集，选自四个大型代码库，涵盖函数元数据、依赖关系、文档字符串、函数体和调用图。 Result: 提出了YABLoCo基准，支持C和C++语言，并提供了可扩展的评估管道和代码可视化工具。 Conclusion: 该基准为评估大型代码库中的代码生成能力提供了有效工具，填补了现有研究的空白。 Abstract: Large Language Models demonstrate the ability to solve various programming tasks, including code generation. Typically, the performance of LLMs is measured on benchmarks with small or medium-sized context windows of thousands of lines of code. At the same time, in real-world software projects, repositories can span up to millions of LoC. This paper closes this gap by contributing to the long context code generation benchmark (YABLoCo). The benchmark featured a test set of 215 functions selected from four large repositories with thousands of functions. The dataset contained metadata of functions, contexts of the functions with different levels of dependencies, docstrings, functions bodies, and call graphs for each repository. This paper presents three key aspects of the contribution. First, the benchmark aims at function body generation in large repositories in C and C++, two languages not covered by previous benchmarks. Second, the benchmark contains large repositories from 200K to 2,000K LoC. Third, we contribute a scalable evaluation pipeline for efficient computing of the target metrics and a tool for visual analysis of generated code. Overall, these three aspects allow for evaluating code generation in large repositories in C and C++.

[94] OBLIVIATE: Robust and Practical Machine Unlearning for Large Language Models

Xiaoyu Xu,Minxin Du,Qingqing Ye,Haibo Hu

Main category: cs.CL

TL;DR: OBLIVIATE是一种高效的遗忘框架，用于从大型语言模型中移除敏感或受版权保护的内容，同时保持模型性能。

Details

Motivation: 大型语言模型可能记忆敏感、受版权保护或有毒内容，需要一种方法在不影响模型实用性的情况下移除这些数据。 Method: 框架通过提取目标标记、构建保留集和使用包含掩码、蒸馏和世界事实的损失函数进行微调，结合低秩适配器（LoRA）确保效率。 Result: 实验表明，OBLIVIATE能有效抵抗成员推理攻击，最小化对保留数据的影响，并在多种场景下保持鲁棒性。 Conclusion: OBLIVIATE是一种实用且高效的遗忘框架，适用于从LLMs中移除目标数据。 Abstract: Large language models (LLMs) trained over extensive corpora risk memorizing sensitive, copyrighted, or toxic content. To address this, we propose OBLIVIATE, a robust unlearning framework that removes targeted data while preserving model utility. The framework follows a structured process: extracting target tokens, building retain sets, and fine-tuning with a tailored loss function comprising three components -- masking, distillation, and world fact. Using low-rank adapters (LoRA), it ensures efficiency without compromising unlearning quality. We conduct experiments on multiple datasets, including the Harry Potter series, WMDP, and TOFU, using a comprehensive suite of metrics: forget quality (new document-level memorization score), model utility, and fluency. Results demonstrate its effectiveness in resisting membership inference attacks, minimizing the impact on retained data, and maintaining robustness across diverse scenarios.

[95] Detecting Spelling and Grammatical Anomalies in Russian Poetry Texts

Ilya Koziev

Main category: cs.CL

TL;DR: 论文提出通过自动语言异常检测提升生成模型训练数据的质量，比较了无监督和有监督的异常检测方法，并发布了RUPOR数据集。

Details

Motivation: 生成模型在创意任务（如诗歌生成）中的表现受训练文本质量影响，但现有数据常缺乏质量控制，需有效方法筛选低质量文本。 Method: 比较无监督和有监督的文本异常检测方法，使用合成和人工标注数据集，并引入RUPOR数据集用于跨句语法错误检测。 Result: 提供了工具和数据集，帮助社区提升创意领域生成模型的训练数据质量。 Conclusion: 自动异常检测方法能有效改善训练数据质量，为创意生成模型提供更可靠的基础。 Abstract: The quality of natural language texts in fine-tuning datasets plays a critical role in the performance of generative models, particularly in computational creativity tasks such as poem or song lyric generation. Fluency defects in generated poems significantly reduce their value. However, training texts are often sourced from internet-based platforms without stringent quality control, posing a challenge for data engineers to manage defect levels effectively. To address this issue, we propose the use of automated linguistic anomaly detection to identify and filter out low-quality texts from training datasets for creative models. In this paper, we present a comprehensive comparison of unsupervised and supervised text anomaly detection approaches, utilizing both synthetic and human-labeled datasets. We also introduce the RUPOR dataset, a collection of Russian-language human-labeled poems designed for cross-sentence grammatical error detection, and provide the full evaluation code. Our work aims to empower the community with tools and insights to improve the quality of training datasets for generative models in creative domains.

[96] Pangu Ultra MoE: How to Train Your Big MoE on Ascend NPUs

Yehui Tang,Yichun Yin,Yaoyuan Wang,Hang Zhou,Yu Pan,Wei Guo,Ziyang Zhang,Miao Rang,Fangcheng Liu,Naifu Zhang,Binghan Li,Yonghan Dong,Xiaojun Meng,Yasheng Wang,Dong Li,Yin Li,Dandan Tu,Can Chen,Youliang Yan,Fisher Yu,Ruiming Tang,Yunhe Wang,Botian Huang,Bo Wang,Boxiao Liu,Changzheng Zhang,Da Kuang,Fei Liu,Gang Huang,Jiansheng Wei,Jiarui Qin,Jie Ran,Jinpeng Li,Jun Zhao,Liang Dai,Lin Li,Liqun Deng,Peifeng Qin,Pengyuan Zeng,Qiang Gu,Shaohua Tang,Shengjun Cheng,Tao Gao,Tao Yu,Tianshu Li,Tianyu Bi,Wei He,Weikai Mao,Wenyong Huang,Wulong Liu,Xiabing Li,Xianzhi Yu,Xueyu Wu,Xu He,Yangkai Du,Yan Xu,Ye Tian,Yimeng Wu,Yongbing Huang,Yong Tian,Yong Zhu,Yue Li,Yufei Wang,Yuhang Gai,Yujun Li,Yu Luo,Yunsheng Ni,Yusen Sun,Zelin Chen,Zhe Liu,Zhicheng Liu,Zhipeng Tu,Zilin Ding,Zongyuan Zhan

Main category: cs.CL

TL;DR: 论文提出了一种在Ascend NPUs上高效训练大规模稀疏语言模型（如Pangu Ultra MoE）的方法，通过模拟优化模型配置，并改进系统通信和内存管理，最终实现了30.0%的MFU。

Details

Motivation: 解决万亿参数稀疏大语言模型在Ascend NPUs上的训练挑战，包括动态稀疏结构和硬件性能优化。 Method: 利用模拟选择适合Ascend NPUs的模型配置，优化专家并行性和设备间通信，提升内存效率。 Result: 成功训练了7180亿参数的Pangu Ultra MoE模型，MFU达到30.0%，性能与DeepSeek R1相当。 Conclusion: Ascend系统能够高效支持前沿稀疏语言模型的训练，为未来研究提供了参考。 Abstract: Sparse large language models (LLMs) with Mixture of Experts (MoE) and close to a trillion parameters are dominating the realm of most capable language models. However, the massive model scale poses significant challenges for the underlying software and hardware systems. In this paper, we aim to uncover a recipe to harness such scale on Ascend NPUs. The key goals are better usage of the computing resources under the dynamic sparse model structures and materializing the expected performance gain on the actual hardware. To select model configurations suitable for Ascend NPUs without repeatedly running the expensive experiments, we leverage simulation to compare the trade-off of various model hyperparameters. This study led to Pangu Ultra MoE, a sparse LLM with 718 billion parameters, and we conducted experiments on the model to verify the simulation results. On the system side, we dig into Expert Parallelism to optimize the communication between NPU devices to reduce the synchronization overhead. We also optimize the memory efficiency within the devices to further reduce the parameter and activation management overhead. In the end, we achieve an MFU of 30.0% when training Pangu Ultra MoE, with performance comparable to that of DeepSeek R1, on 6K Ascend NPUs, and demonstrate that the Ascend system is capable of harnessing all the training stages of the state-of-the-art language models. Extensive experiments indicate that our recipe can lead to efficient training of large-scale sparse language models with MoE. We also study the behaviors of such models for future reference.

[97] Overcoming Data Scarcity in Generative Language Modelling for Low-Resource Languages: A Systematic Review

Josh McGiff,Nikola S. Nikolov

Main category: cs.CL

TL;DR: 本文系统回顾了针对低资源语言（LRL）生成式语言建模中数据稀缺问题的策略，总结了技术方法、架构选择及评估趋势，并提出了扩展方法和开放挑战的建议。

Details

Motivation: 生成式语言模型（如ChatGPT和Google Gemini）主要服务于高资源语言（如英语），加剧了自然语言处理中的语言不平等问题。本文旨在解决低资源语言的数据稀缺问题。 Method: 通过分析54项研究，分类和评估了单语数据增强、反向翻译、多语言训练和提示工程等技术方法，并分析了架构选择、语言家族代表性和评估方法的趋势。 Result: 研究发现对基于Transformer的模型依赖性强，LRL覆盖范围有限，且评估方法不一致。 Conclusion: 建议扩展方法至更广泛的LRL，并提出了构建公平生成式语言系统的开放挑战，以支持开发包容性AI工具，保护语言多样性。 Abstract: Generative language modelling has surged in popularity with the emergence of services such as ChatGPT and Google Gemini. While these models have demonstrated transformative potential in productivity and communication, they overwhelmingly cater to high-resource languages like English. This has amplified concerns over linguistic inequality in natural language processing (NLP). This paper presents the first systematic review focused specifically on strategies to address data scarcity in generative language modelling for low-resource languages (LRL). Drawing from 54 studies, we identify, categorise and evaluate technical approaches, including monolingual data augmentation, back-translation, multilingual training, and prompt engineering, across generative tasks. We also analyse trends in architecture choices, language family representation, and evaluation methods. Our findings highlight a strong reliance on transformer-based models, a concentration on a small subset of LRLs, and a lack of consistent evaluation across studies. We conclude with recommendations for extending these methods to a wider range of LRLs and outline open challenges in building equitable generative language systems. Ultimately, this review aims to support researchers and developers in building inclusive AI tools for underrepresented languages, a necessary step toward empowering LRL speakers and the preservation of linguistic diversity in a world increasingly shaped by large-scale language technologies.

[98] ZeroSearch: Incentivize the Search Capability of LLMs without Searching

Hao Sun,Zile Qiao,Jiayan Guo,Xuanbo Fan,Yingyan Hou,Yong Jiang,Pengjun Xie,Fei Huang,Yan Zhang

Main category: cs.CL

TL;DR: ZeroSearch是一个强化学习框架，旨在提升大型语言模型（LLM）的搜索能力，无需依赖真实搜索引擎，通过生成文档模拟搜索过程，解决了文档质量不可控和高API成本的问题。

Details

Motivation: 现有方法依赖真实搜索引擎，面临文档质量不可控和高API成本的问题，限制了LLM搜索能力的提升和扩展性。 Method: 通过轻量级监督微调将LLM转化为检索模块，生成相关和噪声文档；采用基于课程的rollout策略逐步降低文档质量，激发模型的推理能力。 Result: 实验表明，ZeroSearch能有效提升LLM的搜索能力，3B参数的LLM表现良好，7B参数模型性能接近真实搜索引擎，14B参数模型甚至超越真实搜索引擎。 Conclusion: ZeroSearch提供了一种高效、低成本的替代方案，适用于不同参数规模的LLM，并与多种强化学习算法兼容。 Abstract: Effective information searching is essential for enhancing the reasoning and generation capabilities of large language models (LLMs). Recent research has explored using reinforcement learning (RL) to improve LLMs' search capabilities by interacting with live search engines in real-world environments. While these approaches show promising results, they face two major challenges: (1) Uncontrolled Document Quality: The quality of documents returned by search engines is often unpredictable, introducing noise and instability into the training process. (2) Prohibitively High API Costs: RL training requires frequent rollouts, potentially involving hundreds of thousands of search requests, which incur substantial API expenses and severely constrain scalability. To address these challenges, we introduce ZeroSearch, a reinforcement learning framework that incentivizes the search capabilities of LLMs without interacting with real search engines. Our approach begins with lightweight supervised fine-tuning to transform the LLM into a retrieval module capable of generating both relevant and noisy documents in response to a query. During RL training, we employ a curriculum-based rollout strategy that incrementally degrades the quality of generated documents, progressively eliciting the model's reasoning ability by exposing it to increasingly challenging retrieval scenarios. Extensive experiments demonstrate that ZeroSearch effectively incentivizes the search capabilities of LLMs using a 3B LLM as the retrieval module. Remarkably, a 7B retrieval module achieves comparable performance to the real search engine, while a 14B retrieval module even surpasses it. Furthermore, it generalizes well across both base and instruction-tuned models of various parameter sizes and is compatible with a wide range of RL algorithms.

cs.IR [Back]

[99] Sentiment-Aware Recommendation Systems in E-Commerce: A Review from a Natural Language Processing Perspective

Yogesh Gajula

Main category: cs.IR

TL;DR: 本文综述了2023年至2025年初基于自然语言处理的情绪感知推荐系统，探讨了如何通过情感分析提升电商推荐的准确性和可解释性。

Details

Motivation: 电商平台生成大量用户反馈，但现有推荐系统多依赖数值评分，忽略了文本中的细微意见。本文旨在填补这一研究空白。 Method: 将近期研究分为四类：结合情感嵌入与用户-物品交互的深度学习分类器、基于Transformer的特征提取方法、传播情感信号的图神经网络，以及实时响应用户反馈的对话推荐系统。 Result: 总结了模型架构，展示了情感如何影响推荐流程，并指出处理噪声文本、动态用户偏好和偏见缓解等挑战。 Conclusion: 提出了未来研究方向，旨在开发更智能、公平且以用户为中心的推荐工具。 Abstract: E-commerce platforms generate vast volumes of user feedback, such as star ratings, written reviews, and comments. However, most recommendation engines rely primarily on numerical scores, often overlooking the nuanced opinions embedded in free text. This paper comprehensively reviews sentiment-aware recommendation systems from a natural language processing perspective, covering advancements from 2023 to early 2025. It highlights the benefits of integrating sentiment analysis into e-commerce recommenders to enhance prediction accuracy and explainability through detailed opinion extraction. Our survey categorizes recent work into four main approaches: deep learning classifiers that combine sentiment embeddings with user item interactions, transformer based methods for nuanced feature extraction, graph neural networks that propagate sentiment signals, and conversational recommenders that adapt in real time to user feedback. We summarize model architectures and demonstrate how sentiment flows through recommendation pipelines, impacting dialogue-based suggestions. Key challenges include handling noisy or sarcastic text, dynamic user preferences, and bias mitigation. Finally, we outline research gaps and provide a roadmap for developing smarter, fairer, and more user-centric recommendation tools.

[100] OBD-Finder: Explainable Coarse-to-Fine Text-Centric Oracle Bone Duplicates Discovery

Chongsheng Zhang,Shuwen Wu,Yingqi Chen,Matthias Aßenmacher,Christian Heumann,Yi Men,Gaojuan Fan,João Gama

Main category: cs.IR

TL;DR: 本文提出了一种渐进式甲骨文重复发现框架，结合无监督低层关键点匹配和高层以文本为中心的内容匹配，以提高候选甲骨文重复的语义意识和可解释性。

Details

Motivation: 甲骨文重复识别是甲骨文研究的基础问题，现有方法在语义意识和计算效率上存在不足。 Method: 采用渐进式框架，结合低层关键点匹配和高层文本内容匹配，优化候选重复的排名。 Result: 在Top-5和Top-15检索结果中，该方法召回率与现有方法相当，但计算效率显著提升，并发现了60多对新甲骨文重复。 Conclusion: 该方法在语义意识和计算效率上优于现有方法，为甲骨文研究提供了新工具。 Abstract: Oracle Bone Inscription (OBI) is the earliest systematic writing system in China, while the identification of Oracle Bone (OB) duplicates is a fundamental issue in OBI research. In this work, we design a progressive OB duplicate discovery framework that combines unsupervised low-level keypoints matching with high-level text-centric content-based matching to refine and rank the candidate OB duplicates with semantic awareness and interpretability. We compare our approach with state-of-the-art content-based image retrieval and image matching methods, showing that our approach yields comparable recall performance and the highest simplified mean reciprocal rank scores for both Top-5 and Top-15 retrieval results, and with significantly accelerated computation efficiency. We have discovered over 60 pairs of new OB duplicates in real-world deployment, which were missed by OBI researchers for decades. The models, video illustration and demonstration of this work are available at: https://github.com/cszhangLMU/OBD-Finder/.

cs.HC [Back]

[101] Facilitating Video Story Interaction with Multi-Agent Collaborative System

Yiwen Zhang,Jianing Hao,Zhan Wang,Hongling Sheng,Wei Zeng

Main category: cs.HC

TL;DR: 提出了一种基于用户意图的交互式视频故事系统，结合VLM、RAG和MAS技术，实现个性化叙事体验。

Details

Motivation: 现有方法局限于用户选择和预设叙事，缺乏定制化，需改进以提升互动体验。 Method: 系统分三阶段：1）视频故事处理（VLM+先验知识）；2）多空间聊天（MAS生成成长型角色）；3）场景定制（扩展和可视化对话场景）。 Result: 应用于《哈利·波特》系列，系统成功展现了角色社交行为和成长，提升了互动体验。 Conclusion: 该系统通过技术整合有效增强了视频故事的交互性和个性化。 Abstract: Video story interaction enables viewers to engage with and explore narrative content for personalized experiences. However, existing methods are limited to user selection, specially designed narratives, and lack customization. To address this, we propose an interactive system based on user intent. Our system uses a Vision Language Model (VLM) to enable machines to understand video stories, combining Retrieval-Augmented Generation (RAG) and a Multi-Agent System (MAS) to create evolving characters and scene experiences. It includes three stages: 1) Video story processing, utilizing VLM and prior knowledge to simulate human understanding of stories across three modalities. 2) Multi-space chat, creating growth-oriented characters through MAS interactions based on user queries and story stages. 3) Scene customization, expanding and visualizing various story scenes mentioned in dialogue. Applied to the Harry Potter series, our study shows the system effectively portrays emergent character social behavior and growth, enhancing the interactive experience in the video story world.

cs.CY [Back]

[102] Large Language Models are often politically extreme, usually ideologically inconsistent, and persuasive even in informational contexts

Nouar Aldahoul,Hazem Ibrahim,Matteo Varvello,Aaron Kaufman,Talal Rahwan,Yasir Zaki

Main category: cs.CY

TL;DR: 研究发现，大型语言模型（LLMs）的政治偏见并非表面上的轻微，而是由极端观点抵消后的结果，且能显著影响用户的政治倾向。

Details

Motivation: 探讨LLMs的政治偏见及其对用户的影响，挑战现有认为偏见较小的观点。 Method: 比较31个LLMs与立法者、法官及美国选民的政治倾向，并通过随机实验测试LLMs对用户政治观点的影响。 Result: LLMs的偏见是极端观点的抵消结果，且能使用户政治倾向与其一致的概率提高5个百分点。 Conclusion: LLMs可能成为政治影响力的强大工具，需警惕其潜在影响。 Abstract: Large Language Models (LLMs) are a transformational technology, fundamentally changing how people obtain information and interact with the world. As people become increasingly reliant on them for an enormous variety of tasks, a body of academic research has developed to examine these models for inherent biases, especially political biases, often finding them small. We challenge this prevailing wisdom. First, by comparing 31 LLMs to legislators, judges, and a nationally representative sample of U.S. voters, we show that LLMs' apparently small overall partisan preference is the net result of offsetting extreme views on specific topics, much like moderate voters. Second, in a randomized experiment, we show that LLMs can promulgate their preferences into political persuasiveness even in information-seeking contexts: voters randomized to discuss political issues with an LLM chatbot are as much as 5 percentage points more likely to express the same preferences as that chatbot. Contrary to expectations, these persuasive effects are not moderated by familiarity with LLMs, news consumption, or interest in politics. LLMs, especially those controlled by private companies or governments, may become a powerful and targeted vector for political influence.

[103] Coverage Biases in High-Resolution Satellite Imagery

Vadim Musienko,Axel Jacquet,Ingmar Weber,Till Koebe

Main category: cs.CY

TL;DR: 研究探讨了全球卫星影像覆盖的偏差问题，发现远离赤道的地区更频繁被访问，而欠发达地区的历史影像较少，地缘政治事件也影响影像可用性。

Details

Motivation: 探讨卫星影像是否在全球范围内平等受益，揭示覆盖偏差的物理和社会经济因素。 Method: 分析卫星轨道路径估计30天内重访频率，收集主要卫星影像提供商的元数据评估历史影像可用性，并通过三个冲突地区案例研究地缘政治影响。 Result: 远离赤道地区重访频率更高；欠发达地区历史影像较少；地缘政治事件显著影响影像可用性。 Conclusion: 卫星影像的数字化红利在全球分布不均，受物理、社会经济和地缘政治因素共同影响。 Abstract: Satellite imagery is increasingly used to complement traditional data collection approaches such as surveys and censuses across scientific disciplines. However, we ask: Do all places on earth benefit equally from this new wealth of information? In this study, we investigate coverage bias of major satellite constellations that provide optical satellite imagery with a ground sampling distance below 10 meters, evaluating both the future on-demand tasking opportunities as well as the availability of historic images across the globe. Specifically, forward-looking, we estimate how often different places are revisited during a window of 30 days based on the satellites' orbital paths, thus investigating potential coverage biases caused by physical factors. We find that locations farther away from the equator are generally revisited more frequently by the constellations under study. Backward-looking, we show that historic satellite image availability -- based on metadata collected from major satellite imagery providers -- is influenced by socio-economic factors on the ground: less developed, less populated places have less satellite images available. Furthermore, in three small case studies on recent conflict regions in this world, namely Gaza, Sudan and Ukraine, we show that also geopolitical events play an important role in satellite image availability, hinting at underlying business model decisions. These insights lay bare that the digital dividend yielded by satellite imagery is not equally distributed across our planet.

[104] Deepfakes on Demand: the rise of accessible non-consensual deepfake image generators

Will Hawkins,Chris Russell,Brent Mittelstadt

Main category: cs.CY

TL;DR: 研究发现，文本到图像（T2I）模型的普及导致大量可下载的深度伪造模型变体在线传播，主要针对女性，并意图生成非自愿亲密图像（NCII）。

Details

Motivation: 探讨深度伪造模型的在线可访问性及其潜在危害，特别是针对非自愿亲密图像的生成。 Method: 通过分析Hugging Face和Civitai两个平台上的数千个公开可下载模型变体的元数据，统计深度伪造模型的数量、下载量及目标对象。 Result: 发现近35,000个公开可下载的深度伪造模型变体，主要托管在Civitai上，下载量达1500万次，96%针对女性，且多用于生成NCII。 Conclusion: 研究强调需采取更多行动遏制深度伪造模型和NCII的传播，尽管平台已禁止此类内容。 Abstract: Advances in multimodal machine learning have made text-to-image (T2I) models increasingly accessible and popular. However, T2I models introduce risks such as the generation of non-consensual depictions of identifiable individuals, otherwise known as deepfakes. This paper presents an empirical study exploring the accessibility of deepfake model variants online. Through a metadata analysis of thousands of publicly downloadable model variants on two popular repositories, Hugging Face and Civitai, we demonstrate a huge rise in easily accessible deepfake models. Almost 35,000 examples of publicly downloadable deepfake model variants are identified, primarily hosted on Civitai. These deepfake models have been downloaded almost 15 million times since November 2022, with the models targeting a range of individuals from global celebrities to Instagram users with under 10,000 followers. Both Stable Diffusion and Flux models are used for the creation of deepfake models, with 96% of these targeting women and many signalling intent to generate non-consensual intimate imagery (NCII). Deepfake model variants are often created via the parameter-efficient fine-tuning technique known as low rank adaptation (LoRA), requiring as few as 20 images, 24GB VRAM, and 15 minutes of time, making this process widely accessible via consumer-grade computers. Despite these models violating the Terms of Service of hosting platforms, and regulation seeking to prevent dissemination, these results emphasise the pressing need for greater action to be taken against the creation of deepfakes and NCII.

eess.AS [Back]

[105] EchoInk-R1: Exploring Audio-Visual Reasoning in Multimodal LLMs via Reinforcement Learning

Zhenghao Xing,Xiaowei Hu,Chi-Wing Fu,Wenhai Wang,Jifeng Dai,Pheng-Ann Heng

Main category: eess.AS

TL;DR: EchoInk-R1是一个基于强化学习的框架，用于提升多模态大语言模型（MLLMs）在音频和视觉信号的结构化跨模态推理能力，并在AVQA-R1-6K数据集上取得了显著性能提升。

Details

Motivation: 现有的MLLMs在跨模态推理（尤其是音频和视觉信号的结合）方面表现不佳，因此需要一种方法来增强这种能力。 Method: 基于Qwen2.5-Omni-7B模型，采用Group Relative Policy Optimization（GRPO）进行优化，专注于同步音频-图像对的多选题问答任务。 Result: EchoInk-R1-7B在验证集上达到85.77%的准确率，优于基础模型的80.53%，且仅需562步强化学习。 Conclusion: 轻量级强化学习微调可以有效提升MLLMs的跨模态推理能力，EchoInk-R1是首个通过强化学习统一音频、视觉和文本模态的框架。 Abstract: Multimodal large language models (MLLMs) have advanced perception across text, vision, and audio, yet they often struggle with structured cross-modal reasoning, particularly when integrating audio and visual signals. We introduce EchoInk-R1, a reinforcement learning framework that enhances such reasoning in MLLMs. Built upon the Qwen2.5-Omni-7B foundation and optimized with Group Relative Policy Optimization (GRPO), EchoInk-R1 tackles multiple-choice question answering over synchronized audio-image pairs. To enable this, we curate AVQA-R1-6K, a dataset pairing such audio-image inputs with multiple-choice questions derived from OmniInstruct-v1. EchoInk-R1-7B achieves 85.77% accuracy on the validation set, outperforming the base model, which scores 80.53%, using only 562 reinforcement learning steps. Beyond accuracy, EchoInk-R1 demonstrates reflective reasoning by revisiting initial interpretations and refining responses when facing ambiguous multimodal inputs. These results suggest that lightweight reinforcement learning fine-tuning enhances cross-modal reasoning in MLLMs. EchoInk-R1 is the first framework to unify audio, visual, and textual modalities for general open-world reasoning via reinforcement learning. Code and data are publicly released to facilitate further research.

eess.IV [Back]

[106] IntelliCardiac: An Intelligent Platform for Cardiac Image Segmentation and Classification

Ting Yu Tsai,An Yu,Meghana Spurthi Maadugundu,Ishrat Jahan Mohima,Umme Habiba Barsha,Mei-Hwa F. Chen,Balakrishnan Prabhakaran,Ming-Ching Chang

Main category: eess.IV

TL;DR: IntelliCardiac是一个基于网络的医疗图像处理平台，通过AI模型自动分割4D心脏图像并进行疾病分类，准确率优于现有方法。

Details

Motivation: 心脏影像数据的精确处理对心血管疾病的识别和管理至关重要。 Method: 结合深度学习分割模型和两步分类流程，利用ACDC数据集训练。 Result: 分割模块准确率92.6%，分类模块在五类疾病中准确率达98%。 Conclusion: IntelliCardiac具有临床决策辅助潜力，支持实时可视化和AI辅助诊断。 Abstract: Precise and effective processing of cardiac imaging data is critical for the identification and management of the cardiovascular diseases. We introduce IntelliCardiac, a comprehensive, web-based medical image processing platform for the automatic segmentation of 4D cardiac images and disease classification, utilizing an AI model trained on the publicly accessible ACDC dataset. The system, intended for patients, cardiologists, and healthcare professionals, offers an intuitive interface and uses deep learning models to identify essential heart structures and categorize cardiac diseases. The system supports analysis of both the right and left ventricles as well as myocardium, and then classifies patient's cardiac images into five diagnostic categories: dilated cardiomyopathy, myocardial infarction, hypertrophic cardiomyopathy, right ventricular abnormality, and no disease. IntelliCardiac combines a deep learning-based segmentation model with a two-step classification pipeline. The segmentation module gains an overall accuracy of 92.6\%. The classification module, trained on characteristics taken from segmented heart structures, achieves 98\% accuracy in five categories. These results exceed the performance of the existing state-of-the-art methods that integrate both segmentation and classification models. IntelliCardiac, which supports real-time visualization, workflow integration, and AI-assisted diagnostics, has great potential as a scalable, accurate tool for clinical decision assistance in cardiac imaging and diagnosis.

[107] From Spaceborn to Airborn: SAR Image Synthesis Using Foundation Models for Multi-Scale Adaptation

Solène Debuysère,Nicolas Trouvé,Nathan Letheule,Olivier Lévêque,Elise Colin

Main category: eess.IV

TL;DR: 论文提出了一种利用预训练的潜在扩散模型和空间条件技术，将卫星SAR图像转换为机载SAR表示的新方法，填补了公开SAR数据集的不足。

Details

Motivation: 高分辨率机载SAR图像获取成本高且数据稀缺，限制了现有基础模型在遥感中的应用。 Method: 利用ONERA的15年机载数据创建了11万张SAR图像数据集，结合35亿参数预训练的潜在扩散模型，采用空间条件技术进行图像转换。 Result: 方法成功将卫星SAR图像转换为机载SAR表示，并提升了模拟图像的逼真度。 Conclusion: 该研究为SAR成像技术的进步提供了关键AI应用，填补了文献空白。 Abstract: The availability of Synthetic Aperture Radar (SAR) satellite imagery has increased considerably in recent years, with datasets commercially available. However, the acquisition of high-resolution SAR images in airborne configurations, remains costly and limited. Thus, the lack of open source, well-labeled, or easily exploitable SAR text-image datasets is a barrier to the use of existing foundation models in remote sensing applications. In this context, synthetic image generation is a promising solution to augment this scarce data, enabling a broader range of applications. Leveraging over 15 years of ONERA's extensive archival airborn data from acquisition campaigns, we created a comprehensive training dataset of 110 thousands SAR images to exploit a 3.5 billion parameters pre-trained latent diffusion model. In this work, we present a novel approach utilizing spatial conditioning techniques within a foundation model to transform satellite SAR imagery into airborne SAR representations. Additionally, we demonstrate that our pipeline is effective for bridging the realism of simulated images generated by ONERA's physics-based simulator EMPRISE. Our method explores a key application of AI in advancing SAR imaging technology. To the best of our knowledge, we are the first to introduce this approach in the literature.

[108] A Deep Learning approach for Depressive Symptoms assessment in Parkinson's disease patients using facial videos

Ioannis Kyprakis,Vasileios Skaramagkas,Iro Boura,Georgios Karamanis,Dimitrios I. Fotiadis,Zinovia Kefalopoulou,Cleanthe Spanaki,Manolis Tsiknakis

Main category: eess.IV

TL;DR: 该研究利用深度学习模型（ViViT、Video Swin Tiny和3D CNN-LSTM）通过面部视频分析评估帕金森病患者的抑郁症状，其中Video Swin Tiny模型表现最佳。

Details

Motivation: 帕金森病患者的抑郁症状常因运动症状重叠而被低估，研究旨在通过深度学习技术提高抑郁症状的检测准确性。 Method: 使用三种深度学习模型分析面部视频数据，评估抑郁症状的存在和严重程度，并考虑药物状态的影响。 Result: Video Swin Tiny模型在二元分类和多分类任务中表现最佳，准确率分别达到94%和87.1%。 Conclusion: 深度学习模型，尤其是Video Swin Tiny，可有效用于帕金森病患者抑郁症状的自动化评估。 Abstract: Parkinson's disease (PD) is a neurodegenerative disorder, manifesting with motor and non-motor symptoms. Depressive symptoms are prevalent in PD, affecting up to 45% of patients. They are often underdiagnosed due to overlapping motor features, such as hypomimia. This study explores deep learning (DL) models-ViViT, Video Swin Tiny, and 3D CNN-LSTM with attention layers-to assess the presence and severity of depressive symptoms, as detected by the Geriatric Depression Scale (GDS), in PD patients through facial video analysis. The same parameters were assessed in a secondary analysis taking into account whether patients were one hour after (ON-medication state) or 12 hours without (OFF-medication state) dopaminergic medication. Using a dataset of 1,875 videos from 178 patients, the Video Swin Tiny model achieved the highest performance, with up to 94% accuracy and 93.7% F1-score in binary classification (presence of absence of depressive symptoms), and 87.1% accuracy with an 85.4% F1-score in multiclass tasks (absence or mild or severe depressive symptoms).

[109] Prototype-Based Information Compensation Network for Multi-Source Remote Sensing Data Classification

Feng Gao,Sheng Liu,Chuanzheng Gong,Xiaowei Zhou,Jiayi Wang,Junyu Dong,Qian Du

Main category: eess.IV

TL;DR: 提出了一种基于HSI和SAR/LiDAR数据的原型信息补偿网络（PICNet），通过频率交互模块和原型补偿模块解决多源遥感数据分类中的特征耦合和互补信息不一致问题。

Details

Motivation: 多源遥感数据联合分类存在特征耦合和互补信息探索不一致的挑战，需提升分类精度和可靠性。 Method: 设计频率交互模块解耦和重耦合多源特征，引入原型补偿模块通过跨模态注意力实现特征整合和对齐。 Result: 在三个公开数据集上显著优于现有方法。 Conclusion: PICNet有效解决了多源遥感数据分类中的关键问题，性能优越。 Abstract: Multi-source remote sensing data joint classification aims to provide accuracy and reliability of land cover classification by leveraging the complementary information from multiple data sources. Existing methods confront two challenges: inter-frequency multi-source feature coupling and inconsistency of complementary information exploration. To solve these issues, we present a Prototype-based Information Compensation Network (PICNet) for land cover classification based on HSI and SAR/LiDAR data. Specifically, we first design a frequency interaction module to enhance the inter-frequency coupling in multi-source feature extraction. The multi-source features are first decoupled into high- and low-frequency components. Then, these features are recoupled to achieve efficient inter-frequency communication. Afterward, we design a prototype-based information compensation module to model the global multi-source complementary information. Two sets of learnable modality prototypes are introduced to represent the global modality information of multi-source data. Subsequently, cross-modal feature integration and alignment are achieved through cross-attention computation between the modality-specific prototype vectors and the raw feature representations. Extensive experiments on three public datasets demonstrate the significant superiority of our PICNet over state-of-the-art methods. The codes are available at https://github.com/oucailab/PICNet.

[110] 3D Brain MRI Classification for Alzheimer Diagnosis Using CNN with Data Augmentation

Thien Nhan Vo,Bac Nam Ho,Thanh Xuan Truong

Main category: eess.IV

TL;DR: 开发了一个3D卷积神经网络，用于将T1加权脑MRI扫描分类为健康或阿尔茨海默病，通过噪声注入和交叉验证，模型表现优异。

Details

Motivation: 探索简单数据增强方法在3D MRI分类中的有效性，并推动未来对更高级增强方法和架构的研究。 Method: 使用3D卷积、池化、批归一化、ReLU层和Sigmoid输出，结合随机噪声注入和五折交叉验证。 Result: 测试集准确率为0.912，ROC曲线下面积为0.961，敏感性和特异性均超过0.90。 Conclusion: 结果表明简单增强方法有效，未来可探索更先进的增强方法和架构如3D U-Net和视觉变换器。 Abstract: A three-dimensional convolutional neural network was developed to classify T1-weighted brain MRI scans as healthy or Alzheimer. The network comprises 3D convolution, pooling, batch normalization, dense ReLU layers, and a sigmoid output. Using stochastic noise injection and five-fold cross-validation, the model achieved test set accuracy of 0.912 and area under the ROC curve of 0.961, an improvement of approximately 0.027 over resizing alone. Sensitivity and specificity both exceeded 0.90. These results align with prior work reporting up to 0.10 gain via synthetic augmentation. The findings demonstrate the effectiveness of simple augmentation for 3D MRI classification and motivate future exploration of advanced augmentation methods and architectures such as 3D U-Net and vision transformers.

cs.LG [Back]

[111] When Reasoning Beats Scale: A 1.5B Reasoning Model Outranks 13B LLMs as Discriminator

Md Fahim Anjum

Main category: cs.LG

TL;DR: 研究比较了具有推理能力的大型语言模型（DeepSeek-R1）与非推理模型在文本到SQL任务中的表现，发现推理模型在判别任务上表现更优，但在生成任务上可能不如小型非推理模型。

Details

Motivation: 探索推理模型在规划框架中作为判别器的潜力，并比较其与非推理模型的性能差异。 Method: 使用蒸馏的1.5B参数推理模型（DeepSeek-R1）与多个非推理LLM在生成-判别框架中进行对比，提出了一种从推理链中提取软评分的新方法。 Result: DeepSeek-R1在判别任务上表现优于CodeLlama-7B和CodeLlama-13B，但在生成任务上表现较差。推理模型的逻辑能力存在上限。 Conclusion: 推理模型在判别任务中潜力巨大，但在生成任务中可能不如非推理模型，需根据任务需求优化其在规划框架中的角色。 Abstract: Large Language Models (LLM) with reasoning capabilities offer a promising path for improving candidate evaluation in planning frameworks, but their relative performance against traditional non-reasoning models remains largely underexplored. In this study, we benchmark a distilled 1.5B parameter reasoning model (DeepSeek-R1) against several state-of-the-art non-reasoning LLMs within a generator-discriminator LLM planning framework for the text-to-SQL task. For this, we introduce a novel method for extracting soft scores from the chain-of-thought (CoT) outputs from reasoning that enables fine-grained ranking of candidates. Our central hypothesis is that reasoning models are more effective discriminators than non-reasoning LLMs. Our results show that distilled DeepSeek-R1-1.5B achieves up to $87\%$ higher F1 and $3.7\%$ better discrimination accuracy than CodeLlama-7B, as well as $3.7\%$ higher execution accuracy than CodeLlama-13B, despite having significantly fewer parameters. Furthermore, we find that there is a limit to the logical capabilities of reasoning models, and only providing more context or allowing more compute budget for reasoning is not enough to improve their discrimination performance. Finally, we demonstrate that, unlike non-reasoning LLMs, reasoning models find generation more challenging than discrimination and may underperform as generators compared to smaller non-reasoning LLMs. Our work highlights the potential of reasoning models as discriminators in agentic frameworks, far outweighing their capabilities as generators, offering insights into their optimal role within LLM planning infrastructures.

[112] Scalability Matters: Overcoming Challenges in InstructGLM with Similarity-Degree-Based Sampling

Hyun Lee,Chris Yi,Maminur Islam,B. D. S. Aritra

Main category: cs.LG

TL;DR: 提出了一种名为SDM-InstructGLM的新型指令调优图语言模型框架，解决了LLMs在图任务中的可扩展性和效率问题，无需依赖GNNs。

Details

Motivation: LLMs在图相关任务中的应用受限，主要由于可扩展性问题和缺乏处理图结构的专用机制。现有方法多依赖GNNs，但直接在图结构中编码LLMs的研究不足。 Method: 引入了基于相似度和度中心性的偏置随机游走机制，选择性采样和编码图信息，提高LLM中的结构化表示效率。 Result: 显著提高了token效率，减少了随机采样的信息损失，在图任务（如节点分类和链接预测）中表现优异。 Conclusion: 证明了LLM-only图处理的可行性，为不依赖GNNs的图学习方法开辟了新途径。 Abstract: Large Language Models (LLMs) have demonstrated strong capabilities in various natural language processing tasks; however, their application to graph-related problems remains limited, primarily due to scalability constraints and the absence of dedicated mechanisms for processing graph structures. Existing approaches predominantly integrate LLMs with Graph Neural Networks (GNNs), using GNNs as feature encoders or auxiliary components. However, directly encoding graph structures within LLMs has been underexplored, particularly in the context of large-scale graphs where token limitations hinder effective representation. To address these challenges, we propose SDM-InstructGLM, a novel instruction-tuned Graph Language Model (InstructGLM) framework that enhances scalability and efficiency without relying on GNNs. Our method introduces a similarity-degree-based biased random walk mechanism, which selectively samples and encodes graph information based on node-feature similarity and degree centrality, ensuring an adaptive and structured representation within the LLM. This approach significantly improves token efficiency, mitigates information loss due to random sampling, and enhances performance on graph-based tasks such as node classification and link prediction. Furthermore, our results demonstrate the feasibility of LLM-only graph processing, enabling scalable and interpretable Graph Language Models (GLMs) optimized through instruction-based fine-tuning. This work paves the way for GNN-free approaches to graph learning, leveraging LLMs as standalone graph reasoning models. Our source code is available on GitHub.

[113] Grouped Sequency-arranged Rotation: Optimizing Rotation Transformation for Quantization for Free

Euntae Choi,Sumin Song,Woosang Lim,Sungjoo Yoo

Main category: cs.LG

TL;DR: 提出一种无需训练的新方法，通过改进旋转矩阵解决低比特量化问题，显著提升性能。

Details

Motivation: 大型语言模型（LLMs）部署时计算成本高，现有旋转方法在极低比特（如2比特）下表现不佳。 Method: 利用Walsh-Hadamard变换和序列排序构建改进的旋转矩阵，并提出分组序列旋转（GSR）以减少量化误差。 Result: 在推理任务和WikiText-2的PPL得分上表现优异，性能接近基于优化的方法。 Conclusion: 该方法无需训练，显著提升低比特量化性能，并可与其他旋转技术结合使用。 Abstract: Large Language Models (LLMs) face deployment challenges due to high computational costs, and while Post-Training Quantization (PTQ) offers a solution, existing rotation-based methods struggle at very low bit-widths like 2-bit. We introduce a novel, training-free approach to construct an improved rotation matrix, addressing the limitations of current methods. The key contributions include leveraging the Walsh-Hadamard transform with sequency ordering, which clusters similar frequency components to reduce quantization error compared to standard Hadamard matrices, significantly improving performance. Furthermore, we propose a Grouped Sequency-arranged Rotation (GSR) using block-diagonal matrices with smaller Walsh blocks, effectively isolating outlier impacts and achieving performance comparable to optimization-based methods without requiring any training. Our method demonstrates robust performance on reasoning tasks and Perplexity (PPL) score on WikiText-2. Our method also enhances results even when applied over existing learned rotation techniques.

[114] Quiet Feature Learning in Algorithmic Tasks

Prudhviraj Naidu,Zixian Wang,Leon Bergen,Ramamohan Paturi

Main category: cs.LG

TL;DR: Transformer语言模型在算法任务训练中表现出明显的损失曲线相变，挑战了损失函数逐步改进的假设。

Details

Motivation: 研究算法任务训练中Transformer模型的损失曲线行为，揭示其与现有幂律标度趋势的差异。 Method: 在十个基础算法任务上训练Transformer模型，分析其损失曲线和内部特征学习。 Result: 观察到损失曲线停滞期后突然下降，伴随静默特征到显著特征的转变，且单个特征破坏会显著影响性能。 Conclusion: 损失函数不能可靠跟踪渐进进步，关键内部特征可能在表面下发展直至突然触发性能提升。 Abstract: We train Transformer-based language models on ten foundational algorithmic tasks and observe pronounced phase transitions in their loss curves that deviate from established power-law scaling trends. Over large ranges of compute, the validation loss barely improves, then abruptly decreases. Probing the models' internal representations reveals the learning of quiet features during the stagnant phase, followed by sudden acquisition of loud features that coincide with the sharp drop in loss. Our ablation experiments show that disrupting a single learned feature can dramatically degrade performance, providing evidence of their causal role in task performance. These findings challenge the prevailing assumption that next-token predictive loss reliably tracks incremental progress; instead, key internal features may be developing below the surface until they coalesce, triggering a rapid performance gain.

[115] LLAMAPIE: Proactive In-Ear Conversation Assistants

Tuochao Chen,Nicholas Batchelder,Alisa Liu,Noah Smith,Shyamnath Gollakota

Main category: cs.LG

TL;DR: LlamaPIE是一种实时主动助手，通过可听设备提供简洁指导，无需用户显式调用，旨在提升人类对话体验。

Details

Motivation: 传统语言模型需要用户显式调用，而LlamaPIE旨在在后台运行，预测用户需求且不打断对话。 Method: 构建半合成对话数据集，采用双模型管道：小型模型决定何时响应，大型模型生成响应。 Result: 在真实数据集上评估显示其有效性，用户研究表明用户更倾向于主动助手。 Conclusion: LlamaPIE能有效提升实时对话体验，具有实际应用潜力。 Abstract: We introduce LlamaPIE, the first real-time proactive assistant designed to enhance human conversations through discreet, concise guidance delivered via hearable devices. Unlike traditional language models that require explicit user invocation, this assistant operates in the background, anticipating user needs without interrupting conversations. We address several challenges, including determining when to respond, crafting concise responses that enhance conversations, leveraging knowledge of the user for context-aware assistance, and real-time, on-device processing. To achieve this, we construct a semi-synthetic dialogue dataset and propose a two-model pipeline: a small model that decides when to respond and a larger model that generates the response. We evaluate our approach on real-world datasets, demonstrating its effectiveness in providing helpful, unobtrusive assistance. User studies with our assistant, implemented on Apple Silicon M2 hardware, show a strong preference for the proactive assistant over both a baseline with no assistance and a reactive model, highlighting the potential of LlamaPie to enhance live conversations.

[116] AI-driven multi-source data fusion for algal bloom severity classification in small inland water bodies: Leveraging Sentinel-2, DEM, and NOAA climate data

Ioannis Nasios

Main category: cs.LG

TL;DR: 该研究提出了一种结合多源遥感数据和人工智能模型的高效方法，用于检测有害藻华，具有全球应用潜力。

Details

Motivation: 有害藻华对内陆水质和公共健康的威胁日益严重，亟需高效、准确且经济实惠的检测方法。 Method: 整合了Sentinel-2光学影像、DEM高程数据和NOAA气候数据，结合树模型和神经网络构建集成模型进行分类。 Result: 树模型表现优异，加入神经网络后增强了鲁棒性，展示了深度学习模型处理多源遥感数据的能力。 Conclusion: 该方法通过高分辨率卫星影像和AI分析动态监测藻华，代码开源，具备全球推广潜力。 Abstract: Harmful algal blooms are a growing threat to inland water quality and public health worldwide, creating an urgent need for efficient, accurate, and cost-effective detection methods. This research introduces a high-performing methodology that integrates multiple open-source remote sensing data with advanced artificial intelligence models. Key data sources include Copernicus Sentinel-2 optical imagery, the Copernicus Digital Elevation Model (DEM), and NOAA's High-Resolution Rapid Refresh (HRRR) climate data, all efficiently retrieved using platforms like Google Earth Engine (GEE) and Microsoft Planetary Computer (MPC). The NIR and two SWIR bands from Sentinel-2, the altitude from the elevation model, the temperature and wind from NOAA as well as the longitude and latitude were the most important features. The approach combines two types of machine learning models, tree-based models and a neural network, into an ensemble for classifying algal bloom severity. While the tree models performed strongly on their own, incorporating a neural network added robustness and demonstrated how deep learning models can effectively use diverse remote sensing inputs. The method leverages high-resolution satellite imagery and AI-driven analysis to monitor algal blooms dynamically, and although initially developed for a NASA competition in the U.S., it shows potential for global application. The complete code is available for further adaptation and practical implementation, illustrating the convergence of remote sensing data and AI to address critical environmental challenges (https://github.com/IoannisNasios/HarmfulAlgalBloomDetection).

[117] When Dynamic Data Selection Meets Data Augmentation

Suorong Yang,Peng Ye,Furao Shen,Dongzhan Zhou

Main category: cs.LG

TL;DR: 提出了一种动态数据选择与增强统一的新框架，显著提升训练效率与性能。

Details

Motivation: 动态数据选择虽能加速训练但可能限制数据多样性，而数据增强虽能提升多样性却未与选择优化结合。 Method: 通过估计样本的局部密度和多模态语义一致性联合分布，有针对性地选择适合增强的样本，同时抑制噪声或模糊数据。 Result: 在多个基准数据集和架构上优于现有方法，如在ImageNet-1k上减少50%训练成本且性能无损。 Conclusion: 该方法不仅高效且增强模型鲁棒性，具有实际应用价值。 Abstract: Dynamic data selection aims to accelerate training with lossless performance. However, reducing training data inherently limits data diversity, potentially hindering generalization. While data augmentation is widely used to enhance diversity, it is typically not optimized in conjunction with selection. As a result, directly combining these techniques fails to fully exploit their synergies. To tackle the challenge, we propose a novel online data training framework that, for the first time, unifies dynamic data selection and augmentation, achieving both training efficiency and enhanced performance. Our method estimates each sample's joint distribution of local density and multimodal semantic consistency, allowing for the targeted selection of augmentation-suitable samples while suppressing the inclusion of noisy or ambiguous data. This enables a more significant reduction in dataset size without sacrificing model generalization. Experimental results demonstrate that our method outperforms existing state-of-the-art approaches on various benchmark datasets and architectures, e.g., reducing 50\% training costs on ImageNet-1k with lossless performance. Furthermore, our approach enhances noise resistance and improves model robustness, reinforcing its practical utility in real-world scenarios.

[118] Merging and Disentangling Views in Visual Reinforcement Learning for Robotic Manipulation

Abdulaziz Almuzairee,Rohan Patil,Dwait Bhatt,Henrik I. Christensen

Main category: cs.LG

TL;DR: 提出了一种Merge And Disentanglement (MAD)算法，通过高效合并多视角提升样本效率，并结合单视角特征实现轻量部署和鲁棒策略。

Details

Motivation: 多视角视觉在操作任务中虽能提升鲁棒性，但计算成本高且部署复杂，需一种高效且轻量的解决方案。 Method: 采用MAD算法合并多视角数据以优化样本效率，同时结合单视角特征实现轻量部署。 Result: 在Meta-World和ManiSkill3上验证了算法的效率和鲁棒性。 Conclusion: MAD算法在提升样本效率的同时，实现了轻量部署和鲁棒策略，适用于复杂操作任务。 Abstract: Vision is well-known for its use in manipulation, especially using visual servoing. To make it robust, multiple cameras are needed to expand the field of view. That is computationally challenging. Merging multiple views and using Q-learning allows the design of more effective representations and optimization of sample efficiency. Such a solution might be expensive to deploy. To mitigate this, we introduce a Merge And Disentanglement (MAD) algorithm that efficiently merges views to increase sample efficiency while augmenting with single-view features to allow lightweight deployment and ensure robust policies. We demonstrate the efficiency and robustness of our approach using Meta-World and ManiSkill3. For project website and code, see https://aalmuzairee.github.io/mad

cs.RO [Back]

[119] OpenHelix: A Short Survey, Empirical Analysis, and Open-Source Dual-System VLA Model for Robotic Manipulation

Can Cui,Pengxiang Ding,Wenxuan Song,Shuanghao Bai,Xinyang Tong,Zirui Ge,Runze Suo,Wanqi Zhou,Yang Liu,Bofang Jia,Han Zhao,Siteng Huang,Donglin Wang

Main category: cs.RO

TL;DR: 本文总结了现有双系统VLA架构的设计，并进行了系统性评估，旨在提供低成本开源模型供进一步研究。

Details

Motivation: 解决双系统VLA架构开源资源不足的问题，促进性能分析和优化。 Method: 总结和比较现有双系统架构的设计，并进行系统性实证评估。 Result: 提供了低成本开源模型，并计划持续更新实验结论和改进性能的模型。 Conclusion: 通过开源模型和持续更新，推动双系统VLA架构的研究和应用。 Abstract: Dual-system VLA (Vision-Language-Action) architectures have become a hot topic in embodied intelligence research, but there is a lack of sufficient open-source work for further performance analysis and optimization. To address this problem, this paper will summarize and compare the structural designs of existing dual-system architectures, and conduct systematic empirical evaluations on the core design elements of existing dual-system architectures. Ultimately, it will provide a low-cost open-source model for further exploration. Of course, this project will continue to update with more experimental conclusions and open-source models with improved performance for everyone to choose from. Project page: https://openhelix-robot.github.io/.

[120] Scalable Aerial GNSS Localization for Marine Robots

Shuo Wen,Edwin Meriaux,Mariana Sosa Guzmán,Charlotte Morissette,Chloe Si,Bobak Baghi,Gregory Dudek

Main category: cs.RO

TL;DR: 提出了一种利用配备GNSS的无人机定位水面附近海洋机器人的新方法，解决了传统方法信号反射和成本高的问题。

Details

Motivation: 传统GNSS方法在水面信号反射和成本高的问题下效果不佳，现有方法如惯性导航、DVL等存在误差累积和计算复杂度高的挑战。 Method: 利用配备GNSS的无人机跟踪和定位水面附近的海洋机器人。 Result: 实验结果表明，该方法能实现单机和多机海洋机器人的精确定位。 Conclusion: 该方法为海洋机器人定位提供了一种高效且可扩展的解决方案。 Abstract: Accurate localization is crucial for water robotics, yet traditional onboard Global Navigation Satellite System (GNSS) approaches are difficult or ineffective due to signal reflection on the water's surface and its high cost of aquatic GNSS receivers. Existing approaches, such as inertial navigation, Doppler Velocity Loggers (DVL), SLAM, and acoustic-based methods, face challenges like error accumulation and high computational complexity. Therefore, a more efficient and scalable solution remains necessary. This paper proposes an alternative approach that leverages an aerial drone equipped with GNSS localization to track and localize a marine robot once it is near the surface of the water. Our results show that this novel adaptation enables accurate single and multi-robot marine robot localization.

[121] RGB-Event Fusion with Self-Attention for Collision Prediction

Pietro Bonazzi,Christian Vogt,Michael Jost,Haotong Qin,Lyes Khacef,Federico Paredes-Valles,Michele Magno

Main category: cs.RO

TL;DR: 论文提出了一种基于神经网络的框架，用于预测无人机与动态障碍物的碰撞时间和位置，结合RGB和事件视觉传感器，通过自注意力融合提高准确性。实验表明，融合模型在精度上优于单模态方法，但计算成本更高。

Details

Motivation: 确保自主机器人在动态环境中实时避障的安全性。 Method: 采用双编码器分支分别处理RGB和事件数据，通过自注意力融合，并在ABCD数据集上进行基准测试。 Result: 融合模型在50Hz的预测频率下，精度平均提高1%，远距离（>0.5m）提高10%，但内存和计算成本显著增加。事件模型在计算成本相近时优于RGB模型。 Conclusion: 多模态感知（RGB+事件）在机器人应用中需权衡精度与计算效率，事件相机是RGB的有力替代。 Abstract: Ensuring robust and real-time obstacle avoidance is critical for the safe operation of autonomous robots in dynamic, real-world environments. This paper proposes a neural network framework for predicting the time and collision position of an unmanned aerial vehicle with a dynamic object, using RGB and event-based vision sensors. The proposed architecture consists of two separate encoder branches, one for each modality, followed by fusion by self-attention to improve prediction accuracy. To facilitate benchmarking, we leverage the ABCD [8] dataset collected that enables detailed comparisons of single-modality and fusion-based approaches. At the same prediction throughput of 50Hz, the experimental results show that the fusion-based model offers an improvement in prediction accuracy over single-modality approaches of 1% on average and 10% for distances beyond 0.5m, but comes at the cost of +71% in memory and + 105% in FLOPs. Notably, the event-based model outperforms the RGB model by 4% for position and 26% for time error at a similar computational cost, making it a competitive alternative. Additionally, we evaluate quantized versions of the event-based models, applying 1- to 8-bit quantization to assess the trade-offs between predictive performance and computational efficiency. These findings highlight the trade-offs of multi-modal perception using RGB and event-based cameras in robotic applications.

physics.geo-ph [Back]

[122] On the Residual-based Neural Network for Unmodeled Distortions in Coordinate Transformation

Vinicius Francisco Rofatto,Luiz Felipe Rodrigues de Almeida,Marcelo Tomio Matsuoka,Ivandro Klein,Mauricio Roberto Veronez,Luiz Gonzaga Da Silveira Junior

Main category: physics.geo-ph

TL;DR: 提出一种基于残差的神经校正策略，通过神经网络学习初始几何变换后的系统性失真，降低模型复杂度并提升性能。

Details

Motivation: 传统坐标变换模型难以处理非线性和空间依赖的失真，导致地理空间应用中存在显著残差误差。 Method: 采用残差建模策略，神经网络仅学习初始变换后的系统性失真，减少模型复杂度。 Result: 在模拟和实际任务中，该方法比直接神经网络转换和经典模型更准确稳定，尤其在稀疏或结构化控制点配置下表现更优。 Conclusion: 残差建模是一种轻量且鲁棒的方法，可有效提升坐标变换精度。 Abstract: Coordinate transformation models often fail to account for nonlinear and spatially dependent distortions, leading to significant residual errors in geospatial applications. Here we propose a residual-based neural correction strategy, in which a neural network learns to model only the systematic distortions left by an initial geometric transformation. By focusing solely on residual patterns, the proposed method reduces model complexity and improves performance, particularly in scenarios with sparse or structured control point configurations. We evaluate the method using both simulated datasets with varying distortion intensities and sampling strategies, as well as under the real-world image georeferencing tasks. Compared with direct neural network coordinate converter and classical transformation models, the residual-based neural correction delivers more accurate and stable results under challenging conditions, while maintaining comparable performance in ideal cases. These findings demonstrate the effectiveness of residual modelling as a lightweight and robust alternative for improving coordinate transformation accuracy.

cs.AI [Back]

[123] The Power of Stories: Narrative Priming Shapes How LLM Agents Collaborate and Compete

Gerrit Großmann,Larisa Ivanova,Sai Leela Poduru,Mohaddeseh Tabrizian,Islam Mesabah,David A. Selby,Sebastian J. Vollmer

Main category: cs.AI

TL;DR: 研究探讨共享叙事是否能促进LLM智能体的合作行为，通过公共物品游戏实验发现，共同叙事显著提升合作效果，而不同叙事或自私叙事则削弱合作。

Details

Motivation: 探索共享叙事是否能像影响人类合作一样促进LLM智能体的协作行为。 Method: 使用有限重复的公共物品游戏，通过不同团队合作叙事对LLM智能体进行干预，观察其对谈判行为的影响。 Result: 共同叙事显著提升合作成功率，而不同叙事或自私叙事导致合作失败，自私智能体占优。 Conclusion: 研究结果对多智能体系统设计和AI对齐具有潜在启示。 Abstract: According to Yuval Noah Harari, large-scale human cooperation is driven by shared narratives that encode common beliefs and values. This study explores whether such narratives can similarly nudge LLM agents toward collaboration. We use a finitely repeated public goods game in which LLM agents choose either cooperative or egoistic spending strategies. We prime agents with stories highlighting teamwork to different degrees and test how this influences negotiation outcomes. Our experiments explore four questions:(1) How do narratives influence negotiation behavior? (2) What differs when agents share the same story versus different ones? (3) What happens when the agent numbers grow? (4) Are agents resilient against self-serving negotiators? We find that story-based priming significantly affects negotiation strategies and success rates. Common stories improve collaboration, benefiting each agent. By contrast, priming agents with different stories reverses this effect, and those agents primed toward self-interest prevail. We hypothesize that these results carry implications for multi-agent system design and AI alignment.

[124] Beyond Theorem Proving: Formulation, Framework and Benchmark for Formal Problem-Solving

Qi Liu,Xinhao Zheng,Renqiu Xia,Xingzhi Qi,Qinxiang Cao,Junchi Yan

Main category: cs.AI

TL;DR: 论文提出了一个形式化的问题解决框架FPS和D-FPS，利用FTP环境实现过程验证，并通过RPE方法评估答案正确性。

Details

Motivation: 问题解决在科学与工程中至关重要，但缺乏通用且具体的定义，且AI问题解决代理的过程验证需求未充分探索。 Method: 提出FPS框架，基于确定性马尔可夫决策过程，结合FTP环境；D-FPS将解决与验证分离；使用RPE方法进行答案验证。 Result: 构建三个基准测试，评估结果显示现有FTP模型在问题解决上表现有限（最高27.47%）。 Conclusion: FPS和D-FPS框架填补了问题解决形式化的空白，并通过RPE实现了可信的评估。 Abstract: As a seemingly self-explanatory task, problem-solving has been a significant component of science and engineering. However, a general yet concrete formulation of problem-solving itself is missing. With the recent development of AI-based problem-solving agents, the demand for process-level verifiability is rapidly increasing yet underexplored. To fill these gaps, we present a principled formulation of problem-solving as a deterministic Markov decision process; a novel framework, FPS (Formal Problem-Solving), which utilizes existing FTP (formal theorem proving) environments to perform process-verified problem-solving; and D-FPS (Deductive FPS), decoupling solving and answer verification for better human-alignment. The expressiveness, soundness and completeness of the frameworks are proven. We construct three benchmarks on problem-solving: FormalMath500, a formalization of a subset of the MATH500 benchmark; MiniF2F-Solving and PutnamBench-Solving, adaptations of FTP benchmarks MiniF2F and PutnamBench. For faithful, interpretable, and human-aligned evaluation, we propose RPE (Restricted Propositional Equivalence), a symbolic approach to determine the correctness of answers by formal verification. We evaluate four prevalent FTP models and two prompting methods as baselines, solving at most 23.77% of FormalMath500, 27.47% of MiniF2F-Solving, and 0.31% of PutnamBench-Solving.

[125] Design description of Wisdom Computing Persperctive

TianYi Yu

Main category: cs.AI

TL;DR: 设计了一个基于AI和可视化动画技术的系统，用于手写矩阵识别和逐步计算过程展示，帮助学生理解数学运算。

Details

Motivation: 解决学生在学习数学时对抽象公式和复杂计算步骤难以理解的问题。 Method: 结合Mamba骨干网络和YOLO模型实现手写矩阵的精确识别与重构，利用CoordAttention机制提升字符空间定位，并通过Manim动画引擎展示计算过程。 Result: 系统具有高模块化和灵活性，能实时生成不同数学运算的动画示例，提升学习效果。 Conclusion: 该系统为教育提供了一种直观、易用且高效的辅助工具，帮助学生深入理解数学运算逻辑。 Abstract: This course design aims to develop and research a handwriting matrix recognition and step-by-step visual calculation process display system, addressing the issue of abstract formulas and complex calculation steps that students find difficult to understand when learning mathematics. By integrating artificial intelligence with visualization animation technology, the system enhances precise recognition of handwritten matrix content through the introduction of Mamba backbone networks, completes digital extraction and matrix reconstruction using the YOLO model, and simultaneously combines CoordAttention coordinate attention mechanisms to improve the accurate grasp of character spatial positions. The calculation process is demonstrated frame by frame through the Manim animation engine, vividly showcasing each mathematical calculation step, helping students intuitively understand the intrinsic logic of mathematical operations. Through dynamically generating animation processes for different computational tasks, the system exhibits high modularity and flexibility, capable of generating various mathematical operation examples in real-time according to student needs. By innovating human-computer interaction methods, it brings mathematical calculation processes to life, helping students bridge the gap between knowledge and understanding on a deeper level, ultimately achieving a learning experience where "every step is understood." The system's scalability and interactivity make it an intuitive, user-friendly, and efficient auxiliary tool in education.

math.OC [Back]

[126] Dynamic Network Flow Optimization for Task Scheduling in PTZ Camera Surveillance Systems

Mohammad Merati,David Castañón

Main category: math.OC

TL;DR: 本文提出了一种新颖的动态监控环境中PTZ摄像头调度与控制优化方法，结合卡尔曼滤波和动态网络流模型，显著提升了实时视频捕捉效率。

Details

Motivation: 动态监控环境中，传统主从摄像头系统效率低下，难以适应复杂场景。本文旨在通过预测和优化调度提升监控效能。 Method: 结合卡尔曼滤波预测目标位置，动态网络流模型优化调度，引入分组跟踪节点和价值系统优先处理关键事件。 Result: 实验表明，该方法在覆盖率、等待时间和事件遗漏率上优于传统系统。 Conclusion: 该方法显著提升了监控系统的效率、可扩展性和适应性，尤其适用于动态拥挤环境。 Abstract: This paper presents a novel approach for optimizing the scheduling and control of Pan-Tilt-Zoom (PTZ) cameras in dynamic surveillance environments. The proposed method integrates Kalman filters for motion prediction with a dynamic network flow model to enhance real-time video capture efficiency. By assigning Kalman filters to tracked objects, the system predicts future locations, enabling precise scheduling of camera tasks. This prediction-driven approach is formulated as a network flow optimization, ensuring scalability and adaptability to various surveillance scenarios. To further reduce redundant monitoring, we also incorporate group-tracking nodes, allowing multiple objects to be captured within a single camera focus when appropriate. In addition, a value-based system is introduced to prioritize camera actions, focusing on the timely capture of critical events. By adjusting the decay rates of these values over time, the system ensures prompt responses to tasks with imminent deadlines. Extensive simulations demonstrate that this approach improves coverage, reduces average wait times, and minimizes missed events compared to traditional master-slave camera systems. Overall, our method significantly enhances the efficiency, scalability, and effectiveness of surveillance systems, particularly in dynamic and crowded environments.

cs.MA [Back]

[127] Benchmarking LLMs' Swarm intelligence

Kai Ruan,Mowen Huang,Ji-Rong Wen,Hao Sun

Main category: cs.MA

TL;DR: 论文提出了SwarmBench，一个用于评估LLMs在分散多智能体系统中群体智能能力的新基准，发现LLMs在局部信息约束下表现不一。

Details

Motivation: 探索LLMs在严格约束下（如局部感知和通信）的群体智能能力，填补现有基准在分散协调挑战上的不足。 Method: 设计了SwarmBench，包含五个基础MAS协调任务，配置2D网格环境，强制智能体依赖局部感知和通信。 Result: 评估多个领先LLMs，发现任务间性能差异显著，局部信息约束下规划和策略形成存在局限。 Conclusion: SwarmBench为LLMs在分散系统中的潜力评估提供了工具，促进可重复研究。 Abstract: Large Language Models (LLMs) show potential for complex reasoning, yet their capacity for emergent coordination in Multi-Agent Systems (MAS) when operating under strict constraints-such as limited local perception and communication, characteristic of natural swarms-remains largely unexplored, particularly concerning the nuances of swarm intelligence. Existing benchmarks often do not fully capture the unique challenges of decentralized coordination that arise when agents operate with incomplete spatio-temporal information. To bridge this gap, we introduce SwarmBench, a novel benchmark designed to systematically evaluate the swarm intelligence capabilities of LLMs acting as decentralized agents. SwarmBench features five foundational MAS coordination tasks within a configurable 2D grid environment, forcing agents to rely primarily on local sensory input (k x k view) and local communication. We propose metrics for coordination effectiveness and analyze emergent group dynamics. Evaluating several leading LLMs in a zero-shot setting, we find significant performance variations across tasks, highlighting the difficulties posed by local information constraints. While some coordination emerges, results indicate limitations in robust planning and strategy formation under uncertainty in these decentralized scenarios. Assessing LLMs under swarm-like conditions is crucial for realizing their potential in future decentralized systems. We release SwarmBench as an open, extensible toolkit-built upon a customizable and scalable physical system with defined mechanical properties. It provides environments, prompts, evaluation scripts, and the comprehensive experimental datasets generated, aiming to foster reproducible research into LLM-based MAS coordination and the theoretical underpinnings of Embodied MAS. Our code repository is available at https://github.com/x66ccff/swarmbench.

stat.ML [Back]

[128] Cer-Eval: Certifiable and Cost-Efficient Evaluation Framework for LLMs

Ganghua Wang,Zhaorun Chen,Bo Li,Haifeng Xu

Main category: stat.ML

TL;DR: 本文提出了一种可认证且高效的评估框架Cer-Eval，用于大型语言模型（LLM）的评估，通过自适应选择测试点减少测试成本，同时保持高置信度。

Details

Motivation: 随着基础模型规模的扩大，评估大型语言模型的挑战日益增加，现有评估方法缺乏对测试数据充分性和样本选择系统性的分析。 Method: 提出了一种基于分区的算法Cer-Eval，通过量化测试样本复杂度并推导其紧界，自适应选择测试点以减少评估成本。 Result: 实验表明，Cer-Eval能在多种基准测试中节省20%至40%的测试点，同时保持与现有评估相当的误差水平，并提供95%的置信保证。 Conclusion: Cer-Eval为LLM评估提供了一种高效且可认证的解决方案，显著降低了评估成本。 Abstract: As foundation models continue to scale, the size of trained models grows exponentially, presenting significant challenges for their evaluation. Current evaluation practices involve curating increasingly large datasets to assess the performance of large language models (LLMs). However, there is a lack of systematic analysis and guidance on determining the sufficiency of test data or selecting informative samples for evaluation. This paper introduces a certifiable and cost-efficient evaluation framework for LLMs. Our framework adapts to different evaluation objectives and outputs confidence intervals that contain true values with high probability. We use ``test sample complexity'' to quantify the number of test points needed for a certifiable evaluation and derive tight bounds on test sample complexity. Based on the developed theory, we develop a partition-based algorithm, named Cer-Eval, that adaptively selects test points to minimize the cost of LLM evaluation. Real-world experiments demonstrate that Cer-Eval can save 20% to 40% test points across various benchmarks, while maintaining an estimation error level comparable to the current evaluation process and providing a 95% confidence guarantee.

cs.SD [Back]

[129] Miipher-2: A Universal Speech Restoration Model for Million-Hour Scale Data Restoration

Shigeki Karita,Yuma Koizumi,Heiga Zen,Haruko Ishikawa,Robin Scheibler,Michiel Bacchiani

Main category: cs.SD

TL;DR: Miipher-2是一个用于大规模生成模型训练数据清理的语音恢复模型，支持300多种语言，无需显式条件输入，计算高效。

Details

Motivation: 解决大规模生成模型训练数据清理中的泛化性、无显式条件和计算效率问题。 Method: 利用预训练的通用语音模型（USM）作为特征提取器，结合并行适配器和WaneFit神经声码器。 Result: 在词错误率、说话人相似性和音质评分上优于或媲美传统模型，计算效率高。 Conclusion: Miipher-2为大规模语音数据清理提供了高效、通用的解决方案。 Abstract: Training data cleaning is a new application for generative model-based speech restoration (SR). This paper introduces Miipher-2, an SR model designed for million-hour scale data, for training data cleaning for large-scale generative models like large language models. Key challenges addressed include generalization to unseen languages, operation without explicit conditioning (e.g., text, speaker ID), and computational efficiency. Miipher-2 utilizes a frozen, pre-trained Universal Speech Model (USM), supporting over 300 languages, as a robust, conditioning-free feature extractor. To optimize efficiency and minimize memory, Miipher-2 incorporates parallel adapters for predicting clean USM features from noisy inputs and employs the WaneFit neural vocoder for waveform synthesis. These components were trained on 3,000 hours of multi-lingual, studio-quality recordings with augmented degradations, while USM parameters remained fixed. Experimental results demonstrate Miipher-2's superior or comparable performance to conventional SR models in word-error-rate, speaker similarity, and both objective and subjective sound quality scores across all tested languages. Miipher-2 operates efficiently on consumer-grade accelerators, achieving a real-time factor of 0.0078, enabling the processing of a million-hour speech dataset in approximately three days using only 100 such accelerators.

Table of Contents

cs.CV [Back]

[1] Beyond Recognition: Evaluating Visual Perspective Taking in Vision Language Models

[2] In-situ and Non-contact Etch Depth Prediction in Plasma Etching via Machine Learning (ANN & BNN) and Digital Image Colorimetry

[3] VideoLLM Benchmarks and Evaluation: A Survey

[4] Video Forgery Detection for Surveillance Cameras: A Review

[5] PointExplainer: Towards Transparent Parkinson's Disease Diagnosis

[6] Explainable Face Recognition via Improved Localization

[7] GAME: Learning Multimodal Interactions via Graph Structures for Personality Trait Estimation

[8] Advanced Clustering Framework for Semiconductor Image Analytics Integrating Deep TDA with Self-Supervised and Transfer Learning Techniques

[9] An Active Inference Model of Covert and Overt Visual Attention

[10] Novel Extraction of Discriminative Fine-Grained Feature to Improve Retinal Vessel Segmentation

[11] Deep Learning Framework for Infrastructure Maintenance: Crack Detection and High-Resolution Imaging of Infrastructure Surfaces

[12] Action Spotting and Precise Event Detection in Sports: Datasets, Methods, and Challenges

[13] The Eye as a Window to Systemic Health: A Survey of Retinal Imaging from Classical Techniques to Oculomics

[14] FoodTrack: Estimating Handheld Food Portions with Egocentric Video

[15] AS3D: 2D-Assisted Cross-Modal Understanding with Semantic-Spatial Scene Graphs for 3D Visual Grounding

[16] SEVA: Leveraging Single-Step Ensemble of Vicinal Augmentations for Test-Time Adaptation

[17] SMMT: Siamese Motion Mamba with Self-attention for Thermal Infrared Target Tracking

[18] MAISY: Motion-Aware Image SYnthesis for MedicalImage Motion Correction

[19] One2Any: One-Reference 6D Pose Estimation for Any Object

[20] GAPrompt: Geometry-Aware Point Cloud Prompt for 3D Vision Model

[21] Vision Graph Prompting via Semantic Low-Rank Decomposition

[22] R^3-VQA: "Read the Room" by Video Social Reasoning

[23] Learning from Similarity Proportion Loss for Classifying Skeletal Muscle Recovery Stages

[24] DOTA: Deformable Optimized Transformer Architecture for End-to-End Text Recognition with Retrieval-Augmented Generation

[25] S3D: Sketch-Driven 3D Model Generation

[26] VideoPath-LLaVA: Pathology Diagnostic Reasoning Through Video Instruction Tuning

[27] SToLa: Self-Adaptive Touch-Language Framework with Tactile Commonsense Reasoning in Open-Ended Scenarios

[28] An Enhanced YOLOv8 Model for Real-Time and Accurate Pothole Detection and Measurement

[29] CM1 -- A Dataset for Evaluating Few-Shot Information Extraction with Large Vision Language Models

[30] A Weak Supervision Learning Approach Towards an Equitable Parking Lot Occupancy Estimation

[31] Bridging Geometry-Coherent Text-to-3D Generation with Multi-View Diffusion Priors and Gaussian Splatting

[32] Object-Shot Enhanced Grounding Network for Egocentric Video

[33] HDiffTG: A Lightweight Hybrid Diffusion-Transformer-GCN Architecture for 3D Human Pose Estimation

[34] TS-Diff: Two-Stage Diffusion Model for Low-Light RAW Image Enhancement

[35] MoDE: Mixture of Diffusion Experts for Any Occluded Face Recognition

[36] Multi-turn Consistent Image Editing

[37] CountDiffusion: Text-to-Image Synthesis with Training-Free Counting-Guidance Diffusion

[38] WDMamba: When Wavelet Degradation Prior Meets Vision Mamba for Image Dehazing

[39] Balancing Accuracy, Calibration, and Efficiency in Active Learning with Vision Transformers Under Label Noise

[40] Label-efficient Single Photon Images Classification via Active Learning

[41] Tetrahedron-Net for Medical Image Registration

[42] DATA: Multi-Disentanglement based Contrastive Learning for Open-World Semi-Supervised Deepfake Attribution

[43] Predicting Road Surface Anomalies by Visual Tracking of a Preceding Vehicle

[44] SwinLip: An Efficient Visual Speech Encoder for Lip Reading Using Swin Transformer

[45] Deep residual learning with product units

[46] MFSeg: Efficient Multi-frame 3D Semantic Segmentation

[47] DeCLIP: Decoupled Learning for Open-Vocabulary Dense Perception

[48] RLMiniStyler: Light-weight RL Style Agent for Arbitrary Sequential Neural Style Generation

[49] Learning Real Facial Concepts for Independent Deepfake Detection

[50] CAD-Llama: Leveraging Large Language Models for Computer-Aided Design Parametric 3D Model Generation

[51] FA-KPConv: Introducing Euclidean Symmetries to KPConv via Frame Averaging

[52] Efficient Flow Matching using Latent Variables

[53] "I Can See Forever!": Evaluating Real-time VideoLLMs for Assisting Individuals with Visual Impairments

[54] Defining and Quantifying Creative Behavior in Popular Image Generators

[55] Leveraging Simultaneous Usage of Edge GPU Hardware Engines for Video Face Detection and Recognition

[56] HunyuanCustom: A Multimodal-Driven Architecture for Customized Video Generation

[57] Text2CT: Towards 3D CT Volume Generation from Free-text Descriptions Using Diffusion Model

[58] Edge-GPU Based Face Tracking for Face Detection and Recognition Acceleration

[59] DFVO: Learning Darkness-free Visible and Infrared Image Disentanglement and Fusion All at Once

[60] RAFT: Robust Augmentation of FeaTures for Image Segmentation

[61] Registration of 3D Point Sets Using Exponential-based Similarity Matrix

[62] Componential Prompt-Knowledge Alignment for Domain Incremental Learning

[63] Active Sampling for MRI-based Sequential Decision Making

[64] MonoCoP: Chain-of-Prediction for Monocular 3D Object Detection

[65] OpenVision: A Fully-Open, Cost-Effective Family of Advanced Vision Encoders for Multimodal Learning

[66] FastMap: Revisiting Dense and Scalable Structure from Motion

[67] Person Recognition at Altitude and Range: Fusion of Face, Body Shape and Gait

[68] On Path to Multimodal Generalist: General-Level and General-Bench

cs.GR [Back]

[69] PARC: Physics-based Augmentation with Reinforcement Learning for Character Controllers

[70] TerraFusion: Joint Generation of Terrain Geometry and Texture Using Latent Diffusion Models

[71] BuildingBlock: A Hybrid Approach for Structured Building Generation

[72] Person-In-Situ: Scene-Consistent Human Image Insertion with Occlusion-Aware Pose Control

[73] ELGAR: Expressive Cello Performance Motion Generation for Audio Rendition

[74] Geometry-Aware Texture Generation for 3D Head Modeling with Artist-driven Control

[75] TetWeave: Isosurface Extraction using On-The-Fly Delaunay Tetrahedral Grids for Gradient-Based Mesh Optimization

[76] PrimitiveAnything: Human-Crafted 3D Primitive Assembly Generation with Auto-Regressive Transformer

cs.CL [Back]