cs.CV [Back]

[1] Histo-Miner: Deep Learning based Tissue Features Extraction Pipeline from H&E Whole Slide Images of Cutaneous Squamous Cell Carcinoma

Lucas Sancéré,Carina Lorenz,Doris Helbig,Oana-Diana Persa,Sonja Dengler,Alexander Kreuter,Martim Laimer,Anne Fröhlich,Jennifer Landsberg,Johannes Brägelmann,Katarzyna Bozek

Main category: cs.CV

TL;DR: Histo-Miner是一个基于深度学习的皮肤组织全切片图像分析工具，生成了两个标注数据集，并用于预测皮肤鳞状细胞癌患者的免疫治疗反应。

Details

Motivation: 当前缺乏针对皮肤组织的标注数据集和开源分析工具，特别是在皮肤鳞状细胞癌（cSCC）领域。 Method: 利用卷积神经网络和视觉变换器，对47,392个标注细胞核和144个肿瘤分割图像进行分析，生成特征向量。 Result: 模型性能优于现有技术，细胞核分割mPQ为0.569，分类F1为0.832，肿瘤分割mIoU为0.884。特征向量成功预测了45名患者的免疫治疗反应。 Conclusion: Histo-Miner在临床相关场景中具有应用潜力，为治疗反应预测提供了生物学解释。 Abstract: Recent advancements in digital pathology have enabled comprehensive analysis of Whole-Slide Images (WSI) from tissue samples, leveraging high-resolution microscopy and computational capabilities. Despite this progress, there is a lack of labeled datasets and open source pipelines specifically tailored for analysis of skin tissue. Here we propose Histo-Miner, a deep learning-based pipeline for analysis of skin WSIs and generate two datasets with labeled nuclei and tumor regions. We develop our pipeline for the analysis of patient samples of cutaneous squamous cell carcinoma (cSCC), a frequent non-melanoma skin cancer. Utilizing the two datasets, comprising 47,392 annotated cell nuclei and 144 tumor-segmented WSIs respectively, both from cSCC patients, Histo-Miner employs convolutional neural networks and vision transformers for nucleus segmentation and classification as well as tumor region segmentation. Performance of trained models positively compares to state of the art with multi-class Panoptic Quality (mPQ) of 0.569 for nucleus segmentation, macro-averaged F1 of 0.832 for nucleus classification and mean Intersection over Union (mIoU) of 0.884 for tumor region segmentation. From these predictions we generate a compact feature vector summarizing tissue morphology and cellular interactions, which can be used for various downstream tasks. Here, we use Histo-Miner to predict cSCC patient response to immunotherapy based on pre-treatment WSIs from 45 patients. Histo-Miner identifies percentages of lymphocytes, the granulocyte to lymphocyte ratio in tumor vicinity and the distances between granulocytes and plasma cells in tumors as predictive features for therapy response. This highlights the applicability of Histo-Miner to clinically relevant scenarios, providing direct interpretation of the classification and insights into the underlying biology.

[2] Comparison of Visual Trackers for Biomechanical Analysis of Running

Luis F. Gomez,Gonzalo Garrido-Lopez,Julian Fierrez,Aythami Morales,Ruben Tolosana,Javier Rueda,Enrique Navarro

Main category: cs.CV

TL;DR: 该论文分析了六种姿态跟踪器在短跑生物力学分析中的性能，提出了一种后处理模块以减少误差，结果表明关节模型结合后处理模块能显著提升精度。

Details

Motivation: 近年来，深度学习模型和数据资源的进步推动了人体姿态估计的发展，但高精度生物力学分析仍需改进。本文旨在评估不同姿态跟踪器在短跑生物力学分析中的表现。 Method: 研究比较了两种点跟踪器和四种关节跟踪器在5870帧数据上的表现，结合专家手动标注，提出后处理模块用于异常检测和角度融合预测。 Result: 关节模型的均方根误差为11.41°至4.37°，结合后处理模块后可降至6.99°和3.88°，表明姿态跟踪对跑步生物力学分析具有实用价值。 Conclusion: 姿态跟踪方法在生物力学分析中具有潜力，但在高精度需求场景下仍需进一步优化。 Abstract: Human pose estimation has witnessed significant advancements in recent years, mainly due to the integration of deep learning models, the availability of a vast amount of data, and large computational resources. These developments have led to highly accurate body tracking systems, which have direct applications in sports analysis and performance evaluation. This work analyzes the performance of six trackers: two point trackers and four joint trackers for biomechanical analysis in sprints. The proposed framework compares the results obtained from these pose trackers with the manual annotations of biomechanical experts for more than 5870 frames. The experimental framework employs forty sprints from five professional runners, focusing on three key angles in sprint biomechanics: trunk inclination, hip flex extension, and knee flex extension. We propose a post-processing module for outlier detection and fusion prediction in the joint angles. The experimental results demonstrate that using joint-based models yields root mean squared errors ranging from 11.41{\deg} to 4.37{\deg}. When integrated with the post-processing modules, these errors can be reduced to 6.99{\deg} and 3.88{\deg}, respectively. The experimental findings suggest that human pose tracking approaches can be valuable resources for the biomechanical analysis of running. However, there is still room for improvement in applications where high accuracy is required.

[3] Lay-Your-Scene: Natural Scene Layout Generation with Diffusion Transformers

Divyansh Srivastava,Xiang Zhang,He Wen,Chenru Wen,Zhuowen Tu

Main category: cs.CV

TL;DR: LayouSyn是一种新颖的文本到布局生成方法，利用轻量级开源语言模型和扩散Transformer架构，在开放词汇条件下生成场景布局，性能优于现有方法。

Details

Motivation: 现有场景布局生成方法要么词汇封闭，要么依赖专有大型语言模型，限制了其建模能力和可控图像生成的广泛应用。 Method: 使用轻量级开源语言模型从文本提示中提取场景元素，结合新型的aspect-aware扩散Transformer架构进行条件布局生成。 Result: LayouSyn在空间和数值推理基准测试中表现优异，并展示了与大型语言模型结合及图像编辑的应用潜力。 Conclusion: LayouSyn为开放词汇场景布局生成提供了高效解决方案，并展示了在图像生成和编辑中的实际应用价值。 Abstract: We present Lay-Your-Scene (shorthand LayouSyn), a novel text-to-layout generation pipeline for natural scenes. Prior scene layout generation methods are either closed-vocabulary or use proprietary large language models for open-vocabulary generation, limiting their modeling capabilities and broader applicability in controllable image generation. In this work, we propose to use lightweight open-source language models to obtain scene elements from text prompts and a novel aspect-aware diffusion Transformer architecture trained in an open-vocabulary manner for conditional layout generation. Extensive experiments demonstrate that LayouSyn outperforms existing methods and achieves state-of-the-art performance on challenging spatial and numerical reasoning benchmarks. Additionally, we present two applications of LayouSyn. First, we show that coarse initialization from large language models can be seamlessly combined with our method to achieve better results. Second, we present a pipeline for adding objects to images, demonstrating the potential of LayouSyn in image editing applications.

[4] False Promises in Medical Imaging AI? Assessing Validity of Outperformance Claims

Evangelia Christodoulou,Annika Reinke,Pascaline Andrè,Patrick Godau,Piotr Kalinowski,Rola Houhou,Selen Erkan,Carole H. Sudre,Ninon Burgos,Sofiène Boutaj,Sophie Loizillon,Maëlys Solal,Veronika Cheplygina,Charles Heitz,Michal Kozubek,Michela Antonelli,Nicola Rieke,Antoine Gilson,Leon D. Mayer,Minu D. Tizabi,M. Jorge Cardoso,Amber Simpson,Annette Kopp-Schneider,Gaël Varoquaux,Olivier Colliot,Lena Maier-Hein

Main category: cs.CV

TL;DR: 论文分析了医学影像AI研究中性能比较的可靠性，发现多数新方法声称优于现有技术，但实际存在高概率的虚假声称。

Details

Motivation: 医学影像AI研究中，性能比较常基于均值表现，可能导致虚假的优越性声称，误导未来研究方向。 Method: 采用贝叶斯方法分析代表性医学影像论文，量化虚假声称的概率。 Result: 80%以上论文声称新方法更优，86%分类论文和53%分割论文存在高概率虚假声称。 Conclusion: 当前基准测试存在严重缺陷，优越性声称常无实质依据，需改进验证方法。 Abstract: Performance comparisons are fundamental in medical imaging Artificial Intelligence (AI) research, often driving claims of superiority based on relative improvements in common performance metrics. However, such claims frequently rely solely on empirical mean performance. In this paper, we investigate whether newly proposed methods genuinely outperform the state of the art by analyzing a representative cohort of medical imaging papers. We quantify the probability of false claims based on a Bayesian approach that leverages reported results alongside empirically estimated model congruence to estimate whether the relative ranking of methods is likely to have occurred by chance. According to our results, the majority (>80%) of papers claims outperformance when introducing a new method. Our analysis further revealed a high probability (>5%) of false outperformance claims in 86% of classification papers and 53% of segmentation papers. These findings highlight a critical flaw in current benchmarking practices: claims of outperformance in medical imaging AI are frequently unsubstantiated, posing a risk of misdirecting future research efforts.

[5] Hyb-KAN ViT: Hybrid Kolmogorov-Arnold Networks Augmented Vision Transformer

Sainath Dey,Mitul Goswami,Jashika Sethi,Prasant Kumar Pattnaik

Main category: cs.CV

TL;DR: Hyb-KAN ViT结合小波谱分解和样条优化激活函数，提升ViT性能。

Details

Motivation: 解决MLP在ViT中的局限性，整合小波函数的边缘检测能力。 Method: 提出Eff-KAN和Wav-KAN模块，替换MLP层并利用小波变换进行多分辨率特征提取。 Result: 在ImageNet-1K、COCO和ADE20K上实现SOTA性能。 Conclusion: Hyb-KAN ViT为视觉架构提供了参数效率和多尺度表示的新范式。 Abstract: This study addresses the inherent limitations of Multi-Layer Perceptrons (MLPs) in Vision Transformers (ViTs) by introducing Hybrid Kolmogorov-Arnold Network (KAN)-ViT (Hyb-KAN ViT), a novel framework that integrates wavelet-based spectral decomposition and spline-optimized activation functions, prior work has failed to focus on the prebuilt modularity of the ViT architecture and integration of edge detection capabilities of Wavelet functions. We propose two key modules: Efficient-KAN (Eff-KAN), which replaces MLP layers with spline functions and Wavelet-KAN (Wav-KAN), leveraging orthogonal wavelet transforms for multi-resolution feature extraction. These modules are systematically integrated in ViT encoder layers and classification heads to enhance spatial-frequency modeling while mitigating computational bottlenecks. Experiments on ImageNet-1K (Image Recognition), COCO (Object Detection and Instance Segmentation), and ADE20K (Semantic Segmentation) demonstrate state-of-the-art performance with Hyb-KAN ViT. Ablation studies validate the efficacy of wavelet-driven spectral priors in segmentation and spline-based efficiency in detection tasks. The framework establishes a new paradigm for balancing parameter efficiency and multi-scale representation in vision architectures.

[6] Lightweight RGB-D Salient Object Detection from a Speed-Accuracy Tradeoff Perspective

Songsong Duan,Xi Yang,Nannan Wang,Xinbo Gao

Main category: cs.CV

TL;DR: 提出SATNet网络，通过优化深度质量、模态融合和特征表示，平衡RGB-D显著性检测的效率与性能。

Details

Motivation: 现有RGB-D方法在效率和精度间难以平衡，轻量级方法精度不足，重型方法效率低。 Method: 引入Depth Anything Model提升深度质量；提出DAM模块解耦多模态特征；开发DIRM模块扩展特征空间；设计DFAM模块聚合特征。 Result: 在五个数据集上优于现有SOTA重型模型，参数仅5.2M，速度达415FPS。 Conclusion: SATNet成功平衡了效率与性能，为轻量级RGB-D显著性检测提供了有效方案。 Abstract: Current RGB-D methods usually leverage large-scale backbones to improve accuracy but sacrifice efficiency. Meanwhile, several existing lightweight methods are difficult to achieve high-precision performance. To balance the efficiency and performance, we propose a Speed-Accuracy Tradeoff Network (SATNet) for Lightweight RGB-D SOD from three fundamental perspectives: depth quality, modality fusion, and feature representation. Concerning depth quality, we introduce the Depth Anything Model to generate high-quality depth maps,which effectively alleviates the multi-modal gaps in the current datasets. For modality fusion, we propose a Decoupled Attention Module (DAM) to explore the consistency within and between modalities. Here, the multi-modal features are decoupled into dual-view feature vectors to project discriminable information of feature maps. For feature representation, we develop a Dual Information Representation Module (DIRM) with a bi-directional inverted framework to enlarge the limited feature space generated by the lightweight backbones. DIRM models texture features and saliency features to enrich feature space, and employ two-way prediction heads to optimal its parameters through a bi-directional backpropagation. Finally, we design a Dual Feature Aggregation Module (DFAM) in the decoder to aggregate texture and saliency features. Extensive experiments on five public RGB-D SOD datasets indicate that the proposed SATNet excels state-of-the-art (SOTA) CNN-based heavyweight models and achieves a lightweight framework with 5.2 M parameters and 415 FPS.

[7] Vision-Language-Action Models: Concepts, Progress, Applications and Challenges

Ranjan Sapkota,Yang Cao,Konstantinos I. Roumeliotis,Manoj Karkee

Main category: cs.CV

TL;DR: 该论文综述了视觉-语言-动作（VLA）模型的最新进展，涵盖其概念基础、架构创新、应用领域及未来挑战。

Details

Motivation: 旨在统一感知、自然语言理解和动作执行，推动智能机器人和通用人工智能的发展。 Method: 采用文献综述方法，分析过去三年80多个VLA模型，聚焦架构创新、高效训练策略和实时推理加速。 Result: 总结了VLA模型在多个领域的应用，并提出了解决实时控制、泛化能力等挑战的方案。 Conclusion: 展望了VLA模型与智能代理技术的融合，为未来通用智能体的发展提供路线图。 Abstract: Vision-Language-Action (VLA) models mark a transformative advancement in artificial intelligence, aiming to unify perception, natural language understanding, and embodied action within a single computational framework. This foundational review presents a comprehensive synthesis of recent advancements in Vision-Language-Action models, systematically organized across five thematic pillars that structure the landscape of this rapidly evolving field. We begin by establishing the conceptual foundations of VLA systems, tracing their evolution from cross-modal learning architectures to generalist agents that tightly integrate vision-language models (VLMs), action planners, and hierarchical controllers. Our methodology adopts a rigorous literature review framework, covering over 80 VLA models published in the past three years. Key progress areas include architectural innovations, parameter-efficient training strategies, and real-time inference accelerations. We explore diverse application domains such as humanoid robotics, autonomous vehicles, medical and industrial robotics, precision agriculture, and augmented reality navigation. The review further addresses major challenges across real-time control, multimodal action representation, system scalability, generalization to unseen tasks, and ethical deployment risks. Drawing from the state-of-the-art, we propose targeted solutions including agentic AI adaptation, cross-embodiment generalization, and unified neuro-symbolic planning. In our forward-looking discussion, we outline a future roadmap where VLA models, VLMs, and agentic AI converge to power socially aligned, adaptive, and general-purpose embodied agents. This work serves as a foundational reference for advancing intelligent, real-world robotics and artificial general intelligence. >Vision-language-action, Agentic AI, AI Agents, Vision-language Models

[8] Replay to Remember (R2R): An Efficient Uncertainty-driven Unsupervised Continual Learning Framework Using Generative Replay

Sriram Mandalika,Harsha Vardhan,Athira Nambiar

Main category: cs.CV

TL;DR: 提出了一种基于不确定性驱动的无监督持续学习框架R2R，通过生成回放机制有效缓解灾难性遗忘，无需预训练模型或伪标签，在多个数据集上表现优异。

Details

Motivation: 解决神经网络在持续学习中面临的灾难性遗忘问题，提出一种无需预训练的无监督学习方法。 Method: 采用聚类级不确定性驱动反馈机制和VLM支持的生成回放模块，结合动态阈值调整，利用未标记数据和合成标记数据。 Result: 在CIFAR-10、CIFAR-100等数据集上取得98.13%、73.06%等SOTA性能，超越现有方法4.36%。 Conclusion: R2R框架在持续学习中表现出色，有效保留知识并提升性能。 Abstract: Continual Learning entails progressively acquiring knowledge from new data while retaining previously acquired knowledge, thereby mitigating ``Catastrophic Forgetting'' in neural networks. Our work presents a novel uncertainty-driven Unsupervised Continual Learning framework using Generative Replay, namely ``Replay to Remember (R2R)''. The proposed R2R architecture efficiently uses unlabelled and synthetic labelled data in a balanced proportion using a cluster-level uncertainty-driven feedback mechanism and a VLM-powered generative replay module. Unlike traditional memory-buffer methods that depend on pretrained models and pseudo-labels, our R2R framework operates without any prior training. It leverages visual features from unlabeled data and adapts continuously using clustering-based uncertainty estimation coupled with dynamic thresholding. Concurrently, a generative replay mechanism along with DeepSeek-R1 powered CLIP VLM produces labelled synthetic data representative of past experiences, resembling biological visual thinking that replays memory to remember and act in new, unseen tasks. Extensive experimental analyses are carried out in CIFAR-10, CIFAR-100, CINIC-10, SVHN and TinyImageNet datasets. Our proposed R2R approach improves knowledge retention, achieving a state-of-the-art performance of 98.13%, 73.06%, 93.41%, 95.18%, 59.74%, respectively, surpassing state-of-the-art performance by over 4.36%.

[9] Convex Relaxation for Robust Vanishing Point Estimation in Manhattan World

Bangyan Liao,Zhenjun Zhao,Haoang Li,Yi Zhou,Yingping Zeng,Hao Li,Peidong Liu

Main category: cs.CV

TL;DR: 提出了一种基于凸松弛技术的新方法GlobustVP，用于联合估计消失点位置和线-消失点关联，在效率和全局最优性之间取得平衡。

Details

Motivation: 现有方法要么是次优解，要么追求全局最优但计算成本高，因此需要一种更高效且鲁棒的方法。 Method: 采用软关联方案和截断多选择误差，将问题转化为QCQP并松弛为SDP，通过迭代求解器GlobustVP独立更新每个消失点及其关联线。 Result: 在合成和真实数据上的实验表明，GlobustVP在效率、鲁棒性和全局最优性方面优于现有方法。 Conclusion: GlobustVP首次通过凸松弛技术解决了消失点估计问题，实现了效率和最优性的平衡。 Abstract: Determining the vanishing points (VPs) in a Manhattan world, as a fundamental task in many 3D vision applications, consists of jointly inferring the line-VP association and locating each VP. Existing methods are, however, either sub-optimal solvers or pursuing global optimality at a significant cost of computing time. In contrast to prior works, we introduce convex relaxation techniques to solve this task for the first time. Specifically, we employ a ``soft'' association scheme, realized via a truncated multi-selection error, that allows for joint estimation of VPs' locations and line-VP associations. This approach leads to a primal problem that can be reformulated into a quadratically constrained quadratic programming (QCQP) problem, which is then relaxed into a convex semidefinite programming (SDP) problem. To solve this SDP problem efficiently, we present a globally optimal outlier-robust iterative solver (called \textbf{GlobustVP}), which independently searches for one VP and its associated lines in each iteration, treating other lines as outliers. After each independent update of all VPs, the mutual orthogonality between the three VPs in a Manhattan world is reinforced via local refinement. Extensive experiments on both synthetic and real-world data demonstrate that \textbf{GlobustVP} achieves a favorable balance between efficiency, robustness, and global optimality compared to previous works. The code is publicly available at https://github.com/WU-CVGL/GlobustVP.

[10] DetReIDX: A Stress-Test Dataset for Real-World UAV-Based Person Recognition

Kailash A. Hambarde,Nzakiese Mbongo,Pavan Kumar MP,Satish Mekewad,Carolina Fernandes,Gökhan Silahtaroğlu,Alice Nithya,Pawan Wasnik,MD. Rashidunnabi,Pranita Samale,Hugo Proença

Main category: cs.CV

TL;DR: DetReIDX是一个大规模无人机与地面视角的行人重识别数据集，旨在测试真实世界条件下的ReID性能，包含多会话、多场景数据，并展示了现有方法在极端条件下的性能下降。

Details

Motivation: 现有ReID技术在真实世界极端条件下表现不佳，且公开数据集缺乏足够的变异性，限制了技术进步。 Method: 提出DetReIDX数据集，包含多会话、多场景的13百万边界框，标注了软生物特征和多任务标签。 Result: 现有SOTA方法在DetReIDX条件下性能显著下降（检测精度下降80%，Rank-1 ReID下降70%）。 Conclusion: DetReIDX为评估长期ReID提供了实用基准，数据集和协议已公开。 Abstract: Person reidentification (ReID) technology has been considered to perform relatively well under controlled, ground-level conditions, but it breaks down when deployed in challenging real-world settings. Evidently, this is due to extreme data variability factors such as resolution, viewpoint changes, scale variations, occlusions, and appearance shifts from clothing or session drifts. Moreover, the publicly available data sets do not realistically incorporate such kinds and magnitudes of variability, which limits the progress of this technology. This paper introduces DetReIDX, a large-scale aerial-ground person dataset, that was explicitly designed as a stress test to ReID under real-world conditions. DetReIDX is a multi-session set that includes over 13 million bounding boxes from 509 identities, collected in seven university campuses from three continents, with drone altitudes between 5.8 and 120 meters. More important, as a key novelty, DetReIDX subjects were recorded in (at least) two sessions on different days, with changes in clothing, daylight and location, making it suitable to actually evaluate long-term person ReID. Plus, data were annotated from 16 soft biometric attributes and multitask labels for detection, tracking, ReID, and action recognition. In order to provide empirical evidence of DetReIDX usefulness, we considered the specific tasks of human detection and ReID, where SOTA methods catastrophically degrade performance (up to 80% in detection accuracy and over 70% in Rank-1 ReID) when exposed to DetReIDXs conditions. The dataset, annotations, and official evaluation protocols are publicly available at https://www.it.ubi.pt/DetReIDX/

[11] Are Synthetic Corruptions A Reliable Proxy For Real-World Corruptions?

Shashank Agnihotri,David Schader,Nico Sharei,Mehmet Ege Kaçar,Margret Keuper

Main category: cs.CV

TL;DR: 论文研究了合成损坏是否可作为真实世界损坏的可靠替代品，通过大规模语义分割模型基准测试，发现两者在平均性能上强相关，支持合成损坏用于鲁棒性评估。

Details

Motivation: 深度学习模型易受分布偏移影响，而收集多样化真实数据成本高，因此探索合成损坏的可靠性。 Method: 对语义分割模型进行大规模基准测试，比较真实世界损坏和合成损坏数据集的性能。 Result: 结果显示两者在平均性能上强相关，支持合成损坏用于鲁棒性评估。 Conclusion: 合成损坏可作为真实世界损坏的可靠替代品，尤其在平均性能评估中表现一致。 Abstract: Deep learning (DL) models are widely used in real-world applications but remain vulnerable to distribution shifts, especially due to weather and lighting changes. Collecting diverse real-world data for testing the robustness of DL models is resource-intensive, making synthetic corruptions an attractive alternative for robustness testing. However, are synthetic corruptions a reliable proxy for real-world corruptions? To answer this, we conduct the largest benchmarking study on semantic segmentation models, comparing performance on real-world corruptions and synthetic corruptions datasets. Our results reveal a strong correlation in mean performance, supporting the use of synthetic corruptions for robustness evaluation. We further analyze corruption-specific correlations, providing key insights to understand when synthetic corruptions succeed in representing real-world corruptions. Open-source Code: https://github.com/shashankskagnihotri/benchmarking_robustness/tree/segmentation_david/semantic_segmentation

[12] Seeing Cells Clearly: Evaluating Machine Vision Strategies for Microglia Centroid Detection in 3D Images

Youjia Zhang

Main category: cs.CV

TL;DR: 比较三种工具（ilastik、3D Morph、Omnipose）在3D显微镜图像中定位小胶质细胞中心点的效果，发现不同工具对细胞的识别方式不同，影响数据解读。

Details

Motivation: 小胶质细胞的形态对脑健康研究至关重要，准确识别其中心点是关键。 Method: 测试ilastik、3D Morph和Omnipose三种工具在3D图像中定位小胶质细胞中心点的能力，并比较结果。 Result: 每种工具对细胞的识别方式不同，导致从图像中获取的信息存在差异。 Conclusion: 工具选择会影响小胶质细胞数据的解读，需根据研究需求选择合适的工具。 Abstract: Microglia are important cells in the brain, and their shape can tell us a lot about brain health. In this project, I test three different tools for finding the center points of microglia in 3D microscope images. The tools include ilastik, 3D Morph, and Omnipose. I look at how well each one finds the cells and how their results compare. My findings show that each tool sees the cells in its own way, and this can affect the kind of information we get from the images.

[13] ORXE: Orchestrating Experts for Dynamically Configurable Efficiency

Qingyuan Wang,Guoxin Wang,Barry Cardiff,Deepu John

Main category: cs.CV

TL;DR: ORXE是一个模块化框架，通过动态调整推理路径实现AI模型的高效实时配置，无需复杂训练。

Details

Motivation: 传统方法需要复杂的元模型训练，ORXE旨在提供高效灵活的解决方案，简化开发过程。 Method: 利用预训练专家集合和基于置信度的门控机制，动态分配计算资源。 Result: 在图像分类任务中，ORXE表现优于单一专家和其他动态模型。 Conclusion: ORXE可扩展至其他应用，为多样化部署提供可扩展方案。 Abstract: This paper presents ORXE, a modular and adaptable framework for achieving real-time configurable efficiency in AI models. By leveraging a collection of pre-trained experts with diverse computational costs and performance levels, ORXE dynamically adjusts inference pathways based on the complexity of input samples. Unlike conventional approaches that require complex metamodel training, ORXE achieves high efficiency and flexibility without complicating the development process. The proposed system utilizes a confidence-based gating mechanism to allocate appropriate computational resources for each input. ORXE also supports adjustments to the preference between inference cost and prediction performance across a wide range during runtime. We implemented a training-free ORXE system for image classification tasks, evaluating its efficiency and accuracy across various devices. The results demonstrate that ORXE achieves superior performance compared to individual experts and other dynamic models in most cases. This approach can be extended to other applications, providing a scalable solution for diverse real-world deployment scenarios.

[14] Mix-QSAM: Mixed-Precision Quantization of the Segment Anything Model

Navin Ranjan,Andreas Savakis

Main category: cs.CV

TL;DR: Mix-QSAM是一个混合精度的后训练量化框架，通过层间协同和重要性评分优化SAM模型的量化精度分配，显著提升性能。

Details

Motivation: SAM模型的高计算和内存需求限制了其在资源受限设备上的部署，现有固定位宽量化方法在精度和效率上表现不佳。 Method: 提出层间重要性评分和跨层协同度量，通过整数二次规划优化位宽分配。 Result: 在6位和4位混合精度设置下，Mix-QSAM比现有方法平均精度提升20%。 Conclusion: Mix-QSAM有效解决了SAM模型量化中的精度和效率问题，适用于资源受限设备。 Abstract: The Segment Anything Model (SAM) is a popular vision foundation model; however, its high computational and memory demands make deployment on resource-constrained devices challenging. While Post-Training Quantization (PTQ) is a practical approach for reducing computational overhead, existing PTQ methods rely on fixed bit-width quantization, leading to suboptimal accuracy and efficiency. To address this limitation, we propose Mix-QSAM, a mixed-precision PTQ framework for SAM. First, we introduce a layer-wise importance score, derived using Kullback-Leibler (KL) divergence, to quantify each layer's contribution to the model's output. Second, we introduce cross-layer synergy, a novel metric based on causal mutual information, to capture dependencies between adjacent layers. This ensures that highly interdependent layers maintain similar bit-widths, preventing abrupt precision mismatches that degrade feature propagation and numerical stability. Using these metrics, we formulate an Integer Quadratic Programming (IQP) problem to determine optimal bit-width allocation under model size and bit-operation constraints, assigning higher precision to critical layers while minimizing bit-width in less influential layers. Experimental results demonstrate that Mix-QSAM consistently outperforms existing PTQ methods on instance segmentation and object detection tasks, achieving up to 20% higher average precision under 6-bit and 4-bit mixed-precision settings, while maintaining computational efficiency.

[15] Auto-regressive transformation for image alignment

Kanggeon Lee,Soochahn Lee,Kyoung Mu Lee

Main category: cs.CV

TL;DR: 提出了一种名为ART的新方法，通过自回归框架和多尺度特征迭代优化图像对齐，显著提升了在特征稀疏区域的准确性。

Details

Motivation: 现有方法在特征稀疏区域、极端尺度和视场差异以及大变形情况下表现不佳，需要更鲁棒的解决方案。 Method: ART利用多尺度特征和自回归框架，通过随机采样点和交叉注意力层迭代优化变换场。 Result: 在多样化数据集上的实验表明，ART显著优于现有方法。 Conclusion: ART是一种强大且广泛适用的精确图像对齐新方法。 Abstract: Existing methods for image alignment struggle in cases involving feature-sparse regions, extreme scale and field-of-view differences, and large deformations, often resulting in suboptimal accuracy. Robustness to these challenges improves through iterative refinement of the transformation field while focusing on critical regions in multi-scale image representations. We thus propose Auto-Regressive Transformation (ART), a novel method that iteratively estimates the coarse-to-fine transformations within an auto-regressive framework. Leveraging hierarchical multi-scale features, our network refines the transformations using randomly sampled points at each scale. By incorporating guidance from the cross-attention layer, the model focuses on critical regions, ensuring accurate alignment even in challenging, feature-limited conditions. Extensive experiments across diverse datasets demonstrate that ART significantly outperforms state-of-the-art methods, establishing it as a powerful new method for precise image alignment with broad applicability.

[16] Learning from Loss Landscape: Generalizable Mixed-Precision Quantization via Adaptive Sharpness-Aware Gradient Aligning

Lianbo Ma,Jianlun Ma,Yuee Zhou,Guoyang Xie,Qiang He,Zhichao Lu

Main category: cs.CV

TL;DR: 论文提出了一种新方法，通过在小数据集上搜索量化策略并推广到大尺度数据集，显著降低了计算成本。

Details

Motivation: 现有MPQ方法需要在大规模数据集上进行昂贵的搜索，计算成本高。 Method: 在小数据集上搜索量化策略，通过锐度感知最小化、隐式梯度方向对齐和自适应扰动半径技术推广到大尺度数据集。 Result: 在CIFAR10上搜索策略后，在ImageNet上达到相同精度，计算成本显著降低，效率提升150%。 Conclusion: 该方法简化了量化过程，无需大规模微调，仅需调整模型权重，具有高效性和实用性。 Abstract: Mixed Precision Quantization (MPQ) has become an essential technique for optimizing neural network by determining the optimal bitwidth per layer. Existing MPQ methods, however, face a major hurdle: they require a computationally expensive search for quantization policies on large-scale datasets. To resolve this issue, we introduce a novel approach that first searches for quantization policies on small datasets and then generalizes them to large-scale datasets. This approach simplifies the process, eliminating the need for large-scale quantization fine-tuning and only necessitating model weight adjustment. Our method is characterized by three key techniques: sharpness-aware minimization for enhanced quantization generalization, implicit gradient direction alignment to handle gradient conflicts among different optimization objectives, and an adaptive perturbation radius to accelerate optimization. Both theoretical analysis and experimental results validate our approach. Using the CIFAR10 dataset (just 0.5\% the size of ImageNet training data) for MPQ policy search, we achieved equivalent accuracy on ImageNet with a significantly lower computational cost, while improving efficiency by up to 150% over the baselines.

[17] Cross-Branch Orthogonality for Improved Generalization in Face Deepfake Detection

Tharindu Fernando,Clinton Fookes,Sridha Sridharan,Simon Denman

Main category: cs.CV

TL;DR: 论文提出了一种新的策略，利用从粗到细的空间信息、语义信息及其交互，结合特征正交性解耦策略，显著提升了深度伪造检测的性能。

Details

Motivation: 深度伪造技术快速发展，现有检测器依赖特定伪造痕迹，难以应对新型伪造内容，导致社会对多媒体内容的信任危机。 Method: 提出基于特征正交性的解耦策略，整合多特征向量，避免特征空间复杂化，同时确保特征区分度和减少冗余。 Result: 在FaceForensics++、Celeb-DF和DFDC数据集上，新方法在跨数据集评估中分别比现有最佳方法提升5%和7%。 Conclusion: 该方法通过多特征整合和解耦策略，显著提升了深度伪造检测的泛化能力和性能。 Abstract: Remarkable advancements in generative AI technology have given rise to a spectrum of novel deepfake categories with unprecedented leaps in their realism, and deepfakes are increasingly becoming a nuisance to law enforcement authorities and the general public. In particular, we observe alarming levels of confusion, deception, and loss of faith regarding multimedia content within society caused by face deepfakes, and existing deepfake detectors are struggling to keep up with the pace of improvements in deepfake generation. This is primarily due to their reliance on specific forgery artifacts, which limits their ability to generalise and detect novel deepfake types. To combat the spread of malicious face deepfakes, this paper proposes a new strategy that leverages coarse-to-fine spatial information, semantic information, and their interactions while ensuring feature distinctiveness and reducing the redundancy of the modelled features. A novel feature orthogonality-based disentanglement strategy is introduced to ensure branch-level and cross-branch feature disentanglement, which allows us to integrate multiple feature vectors without adding complexity to the feature space or compromising generalisation. Comprehensive experiments on three public benchmarks: FaceForensics++, Celeb-DF, and the Deepfake Detection Challenge (DFDC) show that these design choices enable the proposed approach to outperform current state-of-the-art methods by 5% on the Celeb-DF dataset and 7% on the DFDC dataset in a cross-dataset evaluation setting.

[18] OWT: A Foundational Organ-Wise Tokenization Framework for Medical Imaging

Sifan Song,Siyeop Yoon,Pengfei Jin,Sekeun Kim,Matthew Tivnan,Yujin Oh,Runqi Meng,Ling Chen,Zhiliang Lyu,Dufan Wu,Ning Guo,Xiang Li,Quanzheng Li

Main category: cs.CV

TL;DR: 论文提出了一种器官级标记化（OWT）框架，通过标记组重建（TGR）训练范式，解决了医学图像中表示学习的可解释性和泛化性问题。

Details

Motivation: 现有表示学习方法依赖整体黑盒嵌入，导致语义组件纠缠，限制了可解释性和泛化能力，尤其在医学影像中问题突出。 Method: OWT框架将图像显式解耦为独立的标记组，每组对应特定器官或语义实体，并通过TGR训练范式优化。 Result: 在CT和MRI数据集上，OWT不仅实现了强图像重建和分割性能，还支持语义级生成和检索等新应用。 Conclusion: OWT作为一种语义解耦表示学习的基础框架，具有广泛的可扩展性和实际应用潜力。 Abstract: Recent advances in representation learning often rely on holistic, black-box embeddings that entangle multiple semantic components, limiting interpretability and generalization. These issues are especially critical in medical imaging. To address these limitations, we propose an Organ-Wise Tokenization (OWT) framework with a Token Group-based Reconstruction (TGR) training paradigm. Unlike conventional approaches that produce holistic features, OWT explicitly disentangles an image into separable token groups, each corresponding to a distinct organ or semantic entity. Our design ensures each token group encapsulates organ-specific information, boosting interpretability, generalization, and efficiency while allowing fine-grained control in downstream tasks. Experiments on CT and MRI datasets demonstrate the effectiveness of OWT in not only achieving strong image reconstruction and segmentation performance, but also enabling novel semantic-level generation and retrieval applications that are out of reach for standard holistic embedding methods. These findings underscore the potential of OWT as a foundational framework for semantically disentangled representation learning, offering broad scalability and applicability to real-world medical imaging scenarios and beyond.

[19] Pro2SAM: Mask Prompt to SAM with Grid Points for Weakly Supervised Object Localization

Xi Yang,Songsong Duan,Nannan Wang,Xinbo Gao

Main category: cs.CV

TL;DR: 论文提出了一种名为Pro2SAM的新方法，通过结合Segment Anything Model (SAM)和创新的mask prompt技术，解决了弱监督目标定位(WSOL)中像素级细粒度信息不足的问题，并在CUB-200-2011和ILSVRC数据集上取得了最优性能。

Details

Motivation: 弱监督目标定位(WSOL)因标注成本低而受到关注，但现有方法（如CAM和自注意力图）无法学习像素级细粒度信息，限制了其性能提升。 Method: 提出Pro2SAM网络，包括：1) 使用GTFormer生成粗粒度前景图作为mask prompt；2) 通过密集网格点输入SAM以最大化前景掩码概率；3) 提出像素级相似度度量实现掩码匹配。 Result: 在CUB-200-2011和ILSVRC数据集上，Pro2SAM分别达到84.03%和66.85%的Top-1定位准确率，表现最优。 Conclusion: Pro2SAM通过结合SAM的零样本能力和创新的mask prompt技术，显著提升了WSOL的性能，为未来研究提供了新思路。 Abstract: Weakly Supervised Object Localization (WSOL), which aims to localize objects by only using image-level labels, has attracted much attention because of its low annotation cost in real applications. Current studies focus on the Class Activation Map (CAM) of CNN and the self-attention map of transformer to identify the region of objects. However, both CAM and self-attention maps can not learn pixel-level fine-grained information on the foreground objects, which hinders the further advance of WSOL. To address this problem, we initiatively leverage the capability of zero-shot generalization and fine-grained segmentation in Segment Anything Model (SAM) to boost the activation of integral object regions. Further, to alleviate the semantic ambiguity issue accrued in single point prompt-based SAM, we propose an innovative mask prompt to SAM (Pro2SAM) network with grid points for WSOL task. First, we devise a Global Token Transformer (GTFormer) to generate a coarse-grained foreground map as a flexible mask prompt, where the GTFormer jointly embeds patch tokens and novel global tokens to learn foreground semantics. Secondly, we deliver grid points as dense prompts into SAM to maximize the probability of foreground mask, which avoids the lack of objects caused by a single point/box prompt. Finally, we propose a pixel-level similarity metric to come true the mask matching from mask prompt to SAM, where the mask with the highest score is viewed as the final localization map. Experiments show that the proposed Pro2SAM achieves state-of-the-art performance on both CUB-200-2011 and ILSVRC, with 84.03\% and 66.85\% Top-1 Loc, respectively.

[20] SpatialPrompting: Keyframe-driven Zero-Shot Spatial Reasoning with Off-the-Shelf Multimodal Large Language Models

Shun Taguchi,Hideki Deguchi,Takumi Hamazaki,Hiroyuki Sakai

Main category: cs.CV

TL;DR: SpatialPrompting框架利用现有多模态大语言模型的推理能力，实现零样本3D空间推理，无需昂贵3D微调或专用输入。

Details

Motivation: 现有方法依赖昂贵的3D微调和专用输入（如点云），限制了灵活性和可扩展性。 Method: 通过关键帧选择策略（基于视觉语言相似性等指标）和相机位姿数据，抽象空间关系并推理3D结构。 Result: 在ScanQA和SQA3D等基准数据集上实现零样本SOTA性能。 Conclusion: 该框架提供了一种更简单、可扩展的替代方案，无需专用3D输入或微调。 Abstract: This study introduces SpatialPrompting, a novel framework that harnesses the emergent reasoning capabilities of off-the-shelf multimodal large language models to achieve zero-shot spatial reasoning in three-dimensional (3D) environments. Unlike existing methods that rely on expensive 3D-specific fine-tuning with specialized 3D inputs such as point clouds or voxel-based features, SpatialPrompting employs a keyframe-driven prompt generation strategy. This framework uses metrics such as vision-language similarity, Mahalanobis distance, field of view, and image sharpness to select a diverse and informative set of keyframes from image sequences and then integrates them with corresponding camera pose data to effectively abstract spatial relationships and infer complex 3D structures. The proposed framework not only establishes a new paradigm for flexible spatial reasoning that utilizes intuitive visual and positional cues but also achieves state-of-the-art zero-shot performance on benchmark datasets, such as ScanQA and SQA3D, across several metrics. The proposed method effectively eliminates the need for specialized 3D inputs and fine-tuning, offering a simpler and more scalable alternative to conventional approaches.

[21] GlyphMastero: A Glyph Encoder for High-Fidelity Scene Text Editing

Tong Wang,Ting Liu,Xiaochao Qu,Chengjing Wu,Luoqi Liu,Xiaolin Hu

Main category: cs.CV

TL;DR: GlyphMastero是一种专门的字形编码器，通过字形注意力模块和多尺度OCR特征融合，显著提升了场景文本编辑的质量和准确性。

Details

Motivation: 现有基于扩散的方法在生成复杂字符（如中文）时表现不佳，难以保持笔画级精度和结构一致性。 Method: 提出GlyphMastero，结合字形注意力模块和多尺度特征金字塔网络，实现跨层次和多尺度的字形感知引导。 Result: 方法在句子准确率上提升18.02%，文本区域Fréchet inception距离降低53.28%。 Conclusion: GlyphMastero通过精确的字形建模和多尺度特征融合，显著提升了场景文本编辑的性能。 Abstract: Scene text editing, a subfield of image editing, requires modifying texts in images while preserving style consistency and visual coherence with the surrounding environment. While diffusion-based methods have shown promise in text generation, they still struggle to produce high-quality results. These methods often generate distorted or unrecognizable characters, particularly when dealing with complex characters like Chinese. In such systems, characters are composed of intricate stroke patterns and spatial relationships that must be precisely maintained. We present GlyphMastero, a specialized glyph encoder designed to guide the latent diffusion model for generating texts with stroke-level precision. Our key insight is that existing methods, despite using pretrained OCR models for feature extraction, fail to capture the hierarchical nature of text structures - from individual strokes to stroke-level interactions to overall character-level structure. To address this, our glyph encoder explicitly models and captures the cross-level interactions between local-level individual characters and global-level text lines through our novel glyph attention module. Meanwhile, our model implements a feature pyramid network to fuse the multi-scale OCR backbone features at the global-level. Through these cross-level and multi-scale fusions, we obtain more detailed glyph-aware guidance, enabling precise control over the scene text generation process. Our method achieves an 18.02\% improvement in sentence accuracy over the state-of-the-art multi-lingual scene text editing baseline, while simultaneously reducing the text-region Fr\'echet inception distance by 53.28\%.

[22] A Simple Detector with Frame Dynamics is a Strong Tracker

Chenxu Peng,Chenxu Wang,Minrui Zou,Danyang Li,Zhengpeng Yang,Yimian Dai,Ming-Ming Cheng,Xiang Li

Main category: cs.CV

TL;DR: 提出了一种红外微小目标跟踪方法，通过全局检测和运动感知学习提升性能，在Anti-UAV挑战中表现优异。

Details

Motivation: 现有跟踪器依赖裁剪模板区域且运动建模能力有限，难以处理微小目标。 Method: 结合帧动态和轨迹约束过滤策略，利用帧差和光流编码目标特征与运动特性。 Result: 在多个指标上优于现有方法，在Anti-UAV挑战中取得第一和第二名。 Conclusion: 该方法通过创新设计显著提升了红外微小目标跟踪的鲁棒性和准确性。 Abstract: Infrared object tracking plays a crucial role in Anti-Unmanned Aerial Vehicle (Anti-UAV) applications. Existing trackers often depend on cropped template regions and have limited motion modeling capabilities, which pose challenges when dealing with tiny targets. To address this, we propose a simple yet effective infrared tiny-object tracker that enhances tracking performance by integrating global detection and motion-aware learning with temporal priors. Our method is based on object detection and achieves significant improvements through two key innovations. First, we introduce frame dynamics, leveraging frame difference and optical flow to encode both prior target features and motion characteristics at the input level, enabling the model to better distinguish the target from background clutter. Second, we propose a trajectory constraint filtering strategy in the post-processing stage, utilizing spatio-temporal priors to suppress false positives and enhance tracking robustness. Extensive experiments show that our method consistently outperforms existing approaches across multiple metrics in challenging infrared UAV tracking scenarios. Notably, we achieve state-of-the-art performance in the 4th Anti-UAV Challenge, securing 1st place in Track 1 and 2nd place in Track 2.

[23] Perception, Reason, Think, and Plan: A Survey on Large Multimodal Reasoning Models

Yunxin Li,Zhenyu Liu,Zitao Li,Xuanyu Zhang,Zhenran Xu,Xinyu Chen,Haoyuan Shi,Shenyuan Jiang,Xintong Wang,Jifang Wang,Shouzheng Huang,Xinping Zhao,Borui Jiang,Lanqing Hong,Longyue Wang,Zhuotao Tian,Baoxing Huai,Wenhan Luo,Weihua Luo,Zheng Zhang,Baotian Hu,Min Zhang

Main category: cs.CV

TL;DR: 该论文综述了多模态推理研究的发展，从早期的模块化方法到统一的多模态大语言模型，并探讨了未来原生大型多模态推理模型（N-LMRMs）的方向。

Details

Motivation: 随着人工智能系统在开放、不确定和多模态环境中运行，推理能力对实现鲁棒和自适应行为至关重要。多模态推理模型（LMRMs）整合多种模态以支持复杂推理，但仍面临泛化、深度推理和代理行为等挑战。 Method: 论文通过四阶段发展路线图综述多模态推理研究，包括早期任务特定模块、统一的多模态LLMs，以及未来N-LMRMs的探索。 Result: 多模态推理研究已从模块化管道发展为统一框架，如Multimodal Chain-of-Thought（MCoT）和多模态强化学习，提升了推理链的丰富性和结构化。 Conclusion: 未来研究方向是原生大型多模态推理模型（N-LMRMs），旨在支持复杂现实环境中的可扩展、代理和自适应推理与规划。 Abstract: Reasoning lies at the heart of intelligence, shaping the ability to make decisions, draw conclusions, and generalize across domains. In artificial intelligence, as systems increasingly operate in open, uncertain, and multimodal environments, reasoning becomes essential for enabling robust and adaptive behavior. Large Multimodal Reasoning Models (LMRMs) have emerged as a promising paradigm, integrating modalities such as text, images, audio, and video to support complex reasoning capabilities and aiming to achieve comprehensive perception, precise understanding, and deep reasoning. As research advances, multimodal reasoning has rapidly evolved from modular, perception-driven pipelines to unified, language-centric frameworks that offer more coherent cross-modal understanding. While instruction tuning and reinforcement learning have improved model reasoning, significant challenges remain in omni-modal generalization, reasoning depth, and agentic behavior. To address these issues, we present a comprehensive and structured survey of multimodal reasoning research, organized around a four-stage developmental roadmap that reflects the field's shifting design philosophies and emerging capabilities. First, we review early efforts based on task-specific modules, where reasoning was implicitly embedded across stages of representation, alignment, and fusion. Next, we examine recent approaches that unify reasoning into multimodal LLMs, with advances such as Multimodal Chain-of-Thought (MCoT) and multimodal reinforcement learning enabling richer and more structured reasoning chains. Finally, drawing on empirical insights from challenging benchmarks and experimental cases of OpenAI O3 and O4-mini, we discuss the conceptual direction of native large multimodal reasoning models (N-LMRMs), which aim to support scalable, agentic, and adaptive reasoning and planning in complex, real-world environments.

[24] Canny2Palm: Realistic and Controllable Palmprint Generation for Large-scale Pre-training

Xingzeng Lan,Xing Duan,Chen Chen,Weiyu Lin,Bo Wang

Main category: cs.CV

TL;DR: 提出了一种名为Canny2Palm的新方法，通过结合Canny边缘检测器和Pix2Pix网络生成虚拟掌纹，解决了掌纹数据稀缺问题，并在开放集掌纹识别基准测试中表现优异。

Details

Motivation: 掌纹识别是一种安全且隐私友好的生物识别方法，但掌纹数据的稀缺性限制了其识别精度的提升。 Method: 使用Canny边缘检测器提取掌纹纹理，并通过Pix2Pix网络生成逼真的虚拟掌纹，同时通过重新组合不同身份的纹理生成新身份。 Result: 在开放集掌纹识别基准测试中，使用Canny2Palm合成数据预训练的模型识别准确率提高了7.2%，且在大规模合成数据下性能持续提升。 Conclusion: Canny2Palm不仅能生成逼真的掌纹数据，还能实现可控的多样性，为大规模预训练提供了潜力。 Abstract: Palmprint recognition is a secure and privacy-friendly method of biometric identification. One of the major challenges to improve palmprint recognition accuracy is the scarcity of palmprint data. Recently, a popular line of research revolves around the synthesis of virtual palmprints for large-scale pre-training purposes. In this paper, we propose a novel synthesis method named Canny2Palm that extracts palm textures with Canny edge detector and uses them to condition a Pix2Pix network for realistic palmprint generation. By re-assembling palmprint textures from different identities, we are able to create new identities by seeding the generator with new assemblies. Canny2Palm not only synthesizes realistic data following the distribution of real palmprints but also enables controllable diversity to generate large-scale new identities. On open-set palmprint recognition benchmarks, models pre-trained with Canny2Palm synthetic data outperform the state-of-the-art with up to 7.2% higher identification accuracy. Moreover, the performance of models pre-trained with Canny2Palm continues to improve given 10,000 synthetic IDs while those with existing methods already saturate, demonstrating the potential of our method for large-scale pre-training.

[25] FF-PNet: A Pyramid Network Based on Feature and Field for Brain Image Registration

Ying Zhang,Shuai Guo,Chenxi Sun,Yuchen Zhu,Jinhai Xiang

Main category: cs.CV

TL;DR: 提出了一种基于特征和变形场的金字塔配准网络（FF-PNet），通过并行提取粗粒度和细粒度特征，显著提升了医学图像配准的效率和精度。

Details

Motivation: 现有模型在并行提取粗粒度和细粒度特征时效率不足，无法有效处理复杂图像变形。 Method: 设计了残差特征融合模块（RFFM）用于粗粒度特征提取，以及残差变形场融合模块（RDFFM）用于细粒度图像变形，两者并行操作。 Result: 在LPBA和OASIS数据集上的实验表明，FF-PNet在Dice相似系数等指标上优于现有方法。 Conclusion: FF-PNet通过RFFM和RDFFM的并行操作，无需注意力机制或多层感知器，即可显著提升配准精度。 Abstract: In recent years, deformable medical image registration techniques have made significant progress. However, existing models still lack efficiency in parallel extraction of coarse and fine-grained features. To address this, we construct a new pyramid registration network based on feature and deformation field (FF-PNet). For coarse-grained feature extraction, we design a Residual Feature Fusion Module (RFFM), for fine-grained image deformation, we propose a Residual Deformation Field Fusion Module (RDFFM). Through the parallel operation of these two modules, the model can effectively handle complex image deformations. It is worth emphasizing that the encoding stage of FF-PNet only employs traditional convolutional neural networks without any attention mechanisms or multilayer perceptrons, yet it still achieves remarkable improvements in registration accuracy, fully demonstrating the superior feature decoding capabilities of RFFM and RDFFM. We conducted extensive experiments on the LPBA and OASIS datasets. The results show our network consistently outperforms popular methods in metrics like the Dice Similarity Coefficient.

Jiepan Li,He Huang,Yu Sheng,Yujun Guo,Wei He

Main category: cs.CV

TL;DR: 提出了一种基于建筑引导的伪标签学习框架，用于从多模态遥感图像中准确评估建筑物损坏情况，并在竞赛中取得最佳成绩。

Details

Motivation: 建筑物损坏评估对灾害响应和恢复规划至关重要，但多模态图像间的差异和不确定性增加了准确评估的难度。 Method: 首先训练建筑物提取模型，结合多模型融合和测试时增强生成伪概率；随后通过低不确定性伪标签训练优化。接着训练变化检测模型，并引入建筑引导的低不确定性伪标签细化策略。 Result: 在2025 IEEE GRSS数据融合竞赛数据集上，取得了最高的mIoU分数（54.28%），并获得第一名。 Conclusion: 该方法通过建筑引导和低不确定性伪标签策略，显著提升了建筑物损坏评估的准确性和可靠性。 Abstract: Accurate building damage assessment using bi-temporal multi-modal remote sensing images is essential for effective disaster response and recovery planning. This study proposes a novel Building-Guided Pseudo-Label Learning Framework to address the challenges of mapping building damage from pre-disaster optical and post-disaster SAR images. First, we train a series of building extraction models using pre-disaster optical images and building labels. To enhance building segmentation, we employ multi-model fusion and test-time augmentation strategies to generate pseudo-probabilities, followed by a low-uncertainty pseudo-label training method for further refinement. Next, a change detection model is trained on bi-temporal cross-modal images and damaged building labels. To improve damage classification accuracy, we introduce a building-guided low-uncertainty pseudo-label refinement strategy, which leverages building priors from the previous step to guide pseudo-label generation for damaged buildings, reducing uncertainty and enhancing reliability. Experimental results on the 2025 IEEE GRSS Data Fusion Contest dataset demonstrate the effectiveness of our approach, which achieved the highest mIoU score (54.28%) and secured first place in the competition.

[27] T2VTextBench: A Human Evaluation Benchmark for Textual Control in Video Generation Models

Xuyang Guo,Jiayan Huo,Zhenmei Shi,Zhao Song,Jiahao Zhang,Jiale Zhao

Main category: cs.CV

TL;DR: T2VTextBench是首个专注于评估文本到视频模型中屏幕文本保真度和时间一致性的基准测试，发现现有模型在生成清晰、一致的文本方面表现不佳。

Details

Motivation: 当前文本到视频生成模型在广告、娱乐和教育等领域表现出色，但在精确渲染屏幕文本（如字幕或数学公式）方面存在不足，亟需评估和改进。 Method: 提出T2VTextBench基准，通过结合复杂文本和动态场景变化的提示，评估10种先进模型的文本生成能力。 Result: 大多数模型在生成清晰、一致的文本方面表现不佳，揭示了当前视频生成器在文本处理上的关键缺陷。 Conclusion: 研究为未来提升视频合成中文本操作能力指明了方向。 Abstract: Thanks to recent advancements in scalable deep architectures and large-scale pretraining, text-to-video generation has achieved unprecedented capabilities in producing high-fidelity, instruction-following content across a wide range of styles, enabling applications in advertising, entertainment, and education. However, these models' ability to render precise on-screen text, such as captions or mathematical formulas, remains largely untested, posing significant challenges for applications requiring exact textual accuracy. In this work, we introduce T2VTextBench, the first human-evaluation benchmark dedicated to evaluating on-screen text fidelity and temporal consistency in text-to-video models. Our suite of prompts integrates complex text strings with dynamic scene changes, testing each model's ability to maintain detailed instructions across frames. We evaluate ten state-of-the-art systems, ranging from open-source solutions to commercial offerings, and find that most struggle to generate legible, consistent text. These results highlight a critical gap in current video generators and provide a clear direction for future research aimed at enhancing textual manipulation in video synthesis.

[28] An Efficient Method for Accurate Pose Estimation and Error Correction of Cuboidal Objects

Utsav Rai,Hardik Mehta,Vismay Vakharia,Aditya Choudhary,Amit Parmar,Rolif Lima,Kaushik Das

Main category: cs.CV

TL;DR: 提出了一种高效的方法，用于精确估计立方体形状物体的位姿，以减少目标位姿误差，并提高时间效率。

Details

Motivation: 解决在自主拾取立方体物体时，传统位姿估计方法存在的小误差和执行时间开销问题。 Method: 提出了一种线性时间方法，用于位姿误差估计和校正，替代了传统的全局点云配准和局部配准算法。 Result: 该方法能够高效且精确地估计和校正立方体物体的位姿。 Conclusion: 提出的线性时间方法在减少位姿误差和执行时间方面具有显著优势。 Abstract: The proposed system outlined in this paper is a solution to a use case that requires the autonomous picking of cuboidal objects from an organized or unorganized pile with high precision. This paper presents an efficient method for precise pose estimation of cuboid-shaped objects, which aims to reduce errors in target pose in a time-efficient manner. Typical pose estimation methods like global point cloud registrations are prone to minor pose errors for which local registration algorithms are generally used to improve pose accuracy. However, due to the execution time overhead and uncertainty in the error of the final achieved pose, an alternate, linear time approach is proposed for pose error estimation and correction. This paper presents an overview of the solution followed by a detailed description of individual modules of the proposed algorithm.

[29] ViCTr: Vital Consistency Transfer for Pathology Aware Image Synthesis

Onkar Susladkar,Gayatri Deshmukh,Yalcin Tur,Ulas Bagci

Main category: cs.CV

TL;DR: ViCTr是一种新型两阶段框架，结合修正流轨迹和Tweedie校正扩散过程，用于高保真、病理感知的医学图像合成，显著提升效率和效果。

Details

Motivation: 解决医学图像合成中标注数据有限、模态差异和弥漫性病理（如肝硬化）建模复杂的问题。 Method: 采用两阶段框架：预训练保留关键解剖结构，再通过对抗性微调控制病理严重程度；利用Tweedie公式实现一步采样，减少推理步骤。 Result: 在多个数据集上表现优异，MFID降低28%，提升分割任务性能3.8%，生成的肝硬化MRI与真实扫描临床不可区分。 Conclusion: ViCTr首次实现细粒度病理感知MRI合成，填补AI医学影像研究空白。 Abstract: Synthesizing medical images remains challenging due to limited annotated pathological data, modality domain gaps, and the complexity of representing diffuse pathologies such as liver cirrhosis. Existing methods often struggle to maintain anatomical fidelity while accurately modeling pathological features, frequently relying on priors derived from natural images or inefficient multi-step sampling. In this work, we introduce ViCTr (Vital Consistency Transfer), a novel two-stage framework that combines a rectified flow trajectory with a Tweedie-corrected diffusion process to achieve high-fidelity, pathology-aware image synthesis. First, we pretrain ViCTr on the ATLAS-8k dataset using Elastic Weight Consolidation (EWC) to preserve critical anatomical structures. We then fine-tune the model adversarially with Low-Rank Adaptation (LoRA) modules for precise control over pathology severity. By reformulating Tweedie's formula within a linear trajectory framework, ViCTr supports one-step sampling, reducing inference from 50 steps to just 4, without sacrificing anatomical realism. We evaluate ViCTr on BTCV (CT), AMOS (MRI), and CirrMRI600+ (cirrhosis) datasets. Results demonstrate state-of-the-art performance, achieving a Medical Frechet Inception Distance (MFID) of 17.01 for cirrhosis synthesis 28% lower than existing approaches and improving nnUNet segmentation by +3.8% mDSC when used for data augmentation. Radiologist reviews indicate that ViCTr-generated liver cirrhosis MRIs are clinically indistinguishable from real scans. To our knowledge, ViCTr is the first method to provide fine-grained, pathology-aware MRI synthesis with graded severity control, closing a critical gap in AI-driven medical imaging research.

[30] CAG-VLM: Fine-Tuning of a Large-Scale Model to Recognize Angiographic Images for Next-Generation Diagnostic Systems

Yuto Nakamura,Satoshi Kodera,Haruki Settai,Hiroki Shinohara,Masatsugu Tamura,Tomohiro Noguchi,Tatsuki Furusawa,Ryo Takizawa,Tempei Kabayama,Norihiko Takeda

Main category: cs.CV

TL;DR: 论文提出了一种两阶段的AI决策支持系统，用于冠状动脉造影（CAG）图像分析，并构建了一个双语数据集。通过训练CNN和微调视觉语言模型（VLM），系统在临床报告生成和治疗建议方面表现出色。

Details

Motivation: 冠状动脉造影（CAG）的解读和治疗规划高度依赖专家，因此需要AI辅助决策支持系统以提高效率和准确性。 Method: 1. 从539例检查中采样14,686帧图像，标注关键帧和左右侧性，训练CNN模型。2. 应用CNN提取关键帧，构建双语图像-报告数据集，并微调三种开源VLM模型。 Result: CNN在侧性分类上达到0.96 F1值；微调后的Gemma3模型获得最高临床评分（7.20/10），被选为最佳模型CAG-VLM。 Conclusion: 研究表明，经过微调的VLM能有效辅助心脏病专家从CAG图像生成临床报告和治疗建议。 Abstract: Coronary angiography (CAG) is the gold-standard imaging modality for evaluating coronary artery disease, but its interpretation and subsequent treatment planning rely heavily on expert cardiologists. To enable AI-based decision support, we introduce a two-stage, physician-curated pipeline and a bilingual (Japanese/English) CAG image-report dataset. First, we sample 14,686 frames from 539 exams and annotate them for key-frame detection and left/right laterality; a ConvNeXt-Base CNN trained on this data achieves 0.96 F1 on laterality classification, even on low-contrast frames. Second, we apply the CNN to 243 independent exams, extract 1,114 key frames, and pair each with its pre-procedure report and expert-validated diagnostic and treatment summary, yielding a parallel corpus. We then fine-tune three open-source VLMs (PaliGemma2, Gemma3, and ConceptCLIP-enhanced Gemma3) via LoRA and evaluate them using VLScore and cardiologist review. Although PaliGemma2 w/LoRA attains the highest VLScore, Gemma3 w/LoRA achieves the top clinician rating (mean 7.20/10); we designate this best-performing model as CAG-VLM. These results demonstrate that specialized, fine-tuned VLMs can effectively assist cardiologists in generating clinical reports and treatment recommendations from CAG images.

[31] DenseGrounding: Improving Dense Language-Vision Semantics for Ego-Centric 3D Visual Grounding

Henry Zheng,Hao Shi,Qihang Peng,Yong Xien Chng,Rui Huang,Yepeng Weng,Zhongchao Shi,Gao Huang

Main category: cs.CV

TL;DR: DenseGrounding提出了一种新方法，通过增强视觉和文本语义来解决3D视觉定位中的挑战，显著提升了性能。

Details

Motivation: 智能代理需要通过自然语言理解和交互3D环境，但现有方法在视觉和文本语义上存在不足。 Method: 提出Hierarchical Scene Semantic Enhancer保留密集视觉语义，Language Semantic Enhancer利用大语言模型丰富文本描述。 Result: DenseGrounding在整体准确率上显著优于现有方法，并在CVPR 2024竞赛中获奖。 Conclusion: 该方法有效解决了3D视觉定位中的语义损失问题，推动了该领域的进展。 Abstract: Enabling intelligent agents to comprehend and interact with 3D environments through natural language is crucial for advancing robotics and human-computer interaction. A fundamental task in this field is ego-centric 3D visual grounding, where agents locate target objects in real-world 3D spaces based on verbal descriptions. However, this task faces two significant challenges: (1) loss of fine-grained visual semantics due to sparse fusion of point clouds with ego-centric multi-view images, (2) limited textual semantic context due to arbitrary language descriptions. We propose DenseGrounding, a novel approach designed to address these issues by enhancing both visual and textual semantics. For visual features, we introduce the Hierarchical Scene Semantic Enhancer, which retains dense semantics by capturing fine-grained global scene features and facilitating cross-modal alignment. For text descriptions, we propose a Language Semantic Enhancer that leverages large language models to provide rich context and diverse language descriptions with additional context during model training. Extensive experiments show that DenseGrounding significantly outperforms existing methods in overall accuracy, with improvements of 5.81% and 7.56% when trained on the comprehensive full dataset and smaller mini subset, respectively, further advancing the SOTA in egocentric 3D visual grounding. Our method also achieves 1st place and receives the Innovation Award in the CVPR 2024 Autonomous Grand Challenge Multi-view 3D Visual Grounding Track, validating its effectiveness and robustness.

[32] ReAlign: Bilingual Text-to-Motion Generation via Step-Aware Reward-Guided Alignment

Wanjiang Weng,Xiaofeng Tan,Hongsong Wang,Pan Zhou

Main category: cs.CV

TL;DR: 提出BiHumanML3D双语数据集和BiMD模型，结合ReAlign方法，显著提升双语文本到动作生成的质量和语义一致性。

Details

Motivation: 解决双语文本到动作生成中数据集缺失和语义不一致的问题。 Method: 构建BiHumanML3D数据集，提出BiMD模型和ReAlign方法，结合奖励模型优化生成过程。 Result: 实验表明，该方法在文本-动作对齐和动作质量上优于现有方法。 Conclusion: BiHumanML3D和BiMD结合ReAlign方法，为双语文本到动作生成提供了有效解决方案。 Abstract: Bilingual text-to-motion generation, which synthesizes 3D human motions from bilingual text inputs, holds immense potential for cross-linguistic applications in gaming, film, and robotics. However, this task faces critical challenges: the absence of bilingual motion-language datasets and the misalignment between text and motion distributions in diffusion models, leading to semantically inconsistent or low-quality motions. To address these challenges, we propose BiHumanML3D, a novel bilingual human motion dataset, which establishes a crucial benchmark for bilingual text-to-motion generation models. Furthermore, we propose a Bilingual Motion Diffusion model (BiMD), which leverages cross-lingual aligned representations to capture semantics, thereby achieving a unified bilingual model. Building upon this, we propose Reward-guided sampling Alignment (ReAlign) method, comprising a step-aware reward model to assess alignment quality during sampling and a reward-guided strategy that directs the diffusion process toward an optimally aligned distribution. This reward model integrates step-aware tokens and combines a text-aligned module for semantic consistency and a motion-aligned module for realism, refining noisy motions at each timestep to balance probability density and alignment. Experiments demonstrate that our approach significantly improves text-motion alignment and motion quality compared to existing state-of-the-art methods. Project page: https://wengwanjiang.github.io/ReAlign-page/.

[33] Federated Deconfounding and Debiasing Learning for Out-of-Distribution Generalization

Zhuang Qi,Sijin Zhou,Lei Meng,Han Hu,Han Yu,Xiangxu Meng

Main category: cs.CV

TL;DR: FedDDL方法通过构建因果图和反事实样本，解决联邦学习中的属性偏差问题，显著提升模型性能。

Details

Motivation: 联邦学习中属性偏差导致模型性能下降，现有方法缺乏对推理路径的全面分析。 Method: FedDDL通过构建结构化因果图进行推理分析，设计反事实样本和因果原型以减少背景干扰。 Result: 在2个基准数据集上，FedDDL平均Top-1准确率比现有方法高4.5%。 Conclusion: FedDDL有效解决了联邦学习中的属性偏差问题，提升了模型对主要对象的关注能力。 Abstract: Attribute bias in federated learning (FL) typically leads local models to optimize inconsistently due to the learning of non-causal associations, resulting degraded performance. Existing methods either use data augmentation for increasing sample diversity or knowledge distillation for learning invariant representations to address this problem. However, they lack a comprehensive analysis of the inference paths, and the interference from confounding factors limits their performance. To address these limitations, we propose the \underline{Fed}erated \underline{D}econfounding and \underline{D}ebiasing \underline{L}earning (FedDDL) method. It constructs a structured causal graph to analyze the model inference process, and performs backdoor adjustment to eliminate confounding paths. Specifically, we design an intra-client deconfounding learning module for computer vision tasks to decouple background and objects, generating counterfactual samples that establish a connection between the background and any label, which stops the model from using the background to infer the label. Moreover, we design an inter-client debiasing learning module to construct causal prototypes to reduce the proportion of the background in prototype components. Notably, it bridges the gap between heterogeneous representations via causal prototypical regularization. Extensive experiments on 2 benchmarking datasets demonstrate that \methodname{} significantly enhances the model capability to focus on main objects in unseen data, leading to 4.5\% higher Top-1 Accuracy on average over 9 state-of-the-art existing methods.

[34] StabStitch++: Unsupervised Online Video Stitching with Spatiotemporal Bidirectional Warps

Lang Nie,Chunyu Lin,Kang Liao,Yun Zhang,Shuaicheng Liu,Yao Zhao

Main category: cs.CV

TL;DR: 论文提出StabStitch++框架，解决视频拼接中的warping shake问题，通过无监督学习同时实现空间拼接和时间稳定化。

Details

Motivation: 视频拼接中，即使输入视频稳定，拼接后的视频仍会出现warping shake，影响视觉体验。现有方法（如StabStitch）牺牲对齐以换取稳定性，而StabStitch++旨在同时优化两者。 Method: 1. 设计双向分解模块，将单应性变换解耦并均匀分布对齐负担；2. 结合空间和时间warp推导拼接轨迹的数学表达式；3. 提出warp平滑模型，使用混合损失优化内容对齐和轨迹平滑。 Result: StabStitch++在拼接性能、鲁棒性和效率上优于现有方法，并实现了实时在线视频拼接系统。 Conclusion: StabStitch++通过无监督学习同时优化空间拼接和时间稳定性，显著提升了视频拼接的质量和实用性。 Abstract: We retarget video stitching to an emerging issue, named warping shake, which unveils the temporal content shakes induced by sequentially unsmooth warps when extending image stitching to video stitching. Even if the input videos are stable, the stitched video can inevitably cause undesired warping shakes and affect the visual experience. To address this issue, we propose StabStitch++, a novel video stitching framework to realize spatial stitching and temporal stabilization with unsupervised learning simultaneously. First, different from existing learning-based image stitching solutions that typically warp one image to align with another, we suppose a virtual midplane between original image planes and project them onto it. Concretely, we design a differentiable bidirectional decomposition module to disentangle the homography transformation and incorporate it into our spatial warp, evenly spreading alignment burdens and projective distortions across two views. Then, inspired by camera paths in video stabilization, we derive the mathematical expression of stitching trajectories in video stitching by elaborately integrating spatial and temporal warps. Finally, a warp smoothing model is presented to produce stable stitched videos with a hybrid loss to simultaneously encourage content alignment, trajectory smoothness, and online collaboration. Compared with StabStitch that sacrifices alignment for stabilization, StabStitch++ makes no compromise and optimizes both of them simultaneously, especially in the online mode. To establish an evaluation benchmark and train the learning framework, we build a video stitching dataset with a rich diversity in camera motions and scenes. Experiments exhibit that StabStitch++ surpasses current solutions in stitching performance, robustness, and efficiency, offering compelling advancements in this field by building a real-time online video stitching system.

[35] Automated Thoracolumbar Stump Rib Detection and Analysis in a Large CT Cohort

Hendrik Möller,Hanna Schön,Alina Dima,Benjamin Keinert-Weth,Robert Graf,Matan Atad,Johannes Paetzold,Friederike Jungmann,Rickmer Braren,Florian Kofler,Bjoern Menze,Daniel Rueckert,Jan S. Kirschke

Main category: cs.CV

TL;DR: 该研究提出了一种自动化检测胸腰椎残根肋骨的方法，通过高分辨率深度学习模型和形态学分析，显著提高了检测准确性和定量分析能力。

Details

Motivation: 胸腰椎残根肋骨是胸腰椎过渡椎或计数异常的重要指标，现有研究多依赖人工定性评估，本研究旨在实现自动化检测和定量形态分析。 Method: 使用高分辨率深度学习模型进行肋骨分割，并通过迭代算法和分段线性插值评估肋骨长度，同时分析形态特征。 Result: 模型分割效果显著优于现有方法（Dice分数0.997 vs. 0.779），肋骨长度评估成功率达98.2%，形态特征分析显示残根肋骨在位置、厚度和方向上与完整肋骨有显著差异。 Conclusion: 该研究为残根肋骨的自动化检测和定量分析提供了高效工具，模型和掩码已公开供公众使用。 Abstract: Thoracolumbar stump ribs are one of the essential indicators of thoracolumbar transitional vertebrae or enumeration anomalies. While some studies manually assess these anomalies and describe the ribs qualitatively, this study aims to automate thoracolumbar stump rib detection and analyze their morphology quantitatively. To this end, we train a high-resolution deep-learning model for rib segmentation and show significant improvements compared to existing models (Dice score 0.997 vs. 0.779, p-value < 0.01). In addition, we use an iterative algorithm and piece-wise linear interpolation to assess the length of the ribs, showing a success rate of 98.2%. When analyzing morphological features, we show that stump ribs articulate more posteriorly at the vertebrae (-19.2 +- 3.8 vs -13.8 +- 2.5, p-value < 0.01), are thinner (260.6 +- 103.4 vs. 563.6 +- 127.1, p-value < 0.01), and are oriented more downwards and sideways within the first centimeters in contrast to full-length ribs. We show that with partially visible ribs, these features can achieve an F1-score of 0.84 in differentiating stump ribs from regular ones. We publish the model weights and masks for public use.

[36] Driving with Context: Online Map Matching for Complex Roads Using Lane Markings and Scenario Recognition

Xin Bi,Zhichao Li,Yuxuan Xia,Panpan Tong,Lijuan Zhang,Yang Chen,Junsheng Fu

Main category: cs.CV

TL;DR: 提出了一种基于HMM和多概率因子的在线SD地图匹配方法，显著提升了复杂道路网络中的匹配精度。

Details

Motivation: 当前在线地图匹配方法在复杂道路网络中容易出错，尤其是在多层道路区域，亟需改进。 Method: 通过构建包含车道标记和场景识别的HMM模型，结合ICP注册和概率因子设计，实现精准匹配。 Result: 在欧洲和中国的测试中，F1分数分别达到98.04%和94.60%，显著优于基准方法。 Conclusion: 该方法在复杂道路网络中表现出色，尤其在多层道路区域，具有实际应用价值。 Abstract: Accurate online map matching is fundamental to vehicle navigation and the activation of intelligent driving functions. Current online map matching methods are prone to errors in complex road networks, especially in multilevel road area. To address this challenge, we propose an online Standard Definition (SD) map matching method by constructing a Hidden Markov Model (HMM) with multiple probability factors. Our proposed method can achieve accurate map matching even in complex road networks by carefully leveraging lane markings and scenario recognition in the designing of the probability factors. First, the lane markings are generated by a multi-lane tracking method and associated with the SD map using HMM to build an enriched SD map. In areas covered by the enriched SD map, the vehicle can re-localize itself by performing Iterative Closest Point (ICP) registration for the lane markings. Then, the probability factor accounting for the lane marking detection can be obtained using the association probability between adjacent lanes and roads. Second, the driving scenario recognition model is applied to generate the emission probability factor of scenario recognition, which improves the performance of map matching on elevated roads and ordinary urban roads underneath them. We validate our method through extensive road tests in Europe and China, and the experimental results show that our proposed method effectively improves the online map matching accuracy as compared to other existing methods, especially in multilevel road area. Specifically, the experiments show that our proposed method achieves $F_1$ scores of 98.04% and 94.60% on the Zenseact Open Dataset and test data of multilevel road areas in Shanghai respectively, significantly outperforming benchmark methods. The implementation is available at https://github.com/TRV-Lab/LMSR-OMM.

[37] Adaptive Contextual Embedding for Robust Far-View Borehole Detection

Xuesong Liu,Tianyu Hao,Emmett J. Ientilucci

Main category: cs.CV

TL;DR: 提出了一种基于YOLO改进的自适应检测方法，通过EMA统计更新优化小尺度、高密度分布的钻孔检测。

Details

Motivation: 现有方法在检测小尺度、高密度分布的钻孔时效果不佳，影响爆破操作的安全性和效率。 Method: 引入自适应增强、嵌入稳定化和上下文细化三个组件，利用EMA统计更新提升特征提取的稳定性和准确性。 Result: 在采石场数据集上实验表明，该方法显著优于基线YOLO架构。 Conclusion: 该方法在复杂工业场景中表现出色，提升了钻孔检测的准确性和鲁棒性。 Abstract: In controlled blasting operations, accurately detecting densely distributed tiny boreholes from far-view imagery is critical for operational safety and efficiency. However, existing detection methods often struggle due to small object scales, highly dense arrangements, and limited distinctive visual features of boreholes. To address these challenges, we propose an adaptive detection approach that builds upon existing architectures (e.g., YOLO) by explicitly leveraging consistent embedding representations derived through exponential moving average (EMA)-based statistical updates. Our method introduces three synergistic components: (1) adaptive augmentation utilizing dynamically updated image statistics to robustly handle illumination and texture variations; (2) embedding stabilization to ensure consistent and reliable feature extraction; and (3) contextual refinement leveraging spatial context for improved detection accuracy. The pervasive use of EMA in our method is particularly advantageous given the limited visual complexity and small scale of boreholes, allowing stable and robust representation learning even under challenging visual conditions. Experiments on a challenging proprietary quarry-site dataset demonstrate substantial improvements over baseline YOLO-based architectures, highlighting our method's effectiveness in realistic and complex industrial scenarios.

[38] SOAP: Style-Omniscient Animatable Portraits

Tingting Liao,Yujian Zheng,Adilbek Karmanov,Liwen Hu,Leyang Jin,Yuliang Xiu,Hao Li

Main category: cs.CV

TL;DR: SOAP是一种风格全知的框架，可从任何肖像生成具有动画控制的3D头像，支持多种风格和细节保留。

Details

Motivation: 解决单图像生成3D动画头像时的风格限制和细节处理难题，如配饰或发型。 Method: 利用多视角扩散模型和自适应优化流程，通过可微分渲染保持拓扑和绑定。 Result: 生成的纹理化头像支持FACS动画，保留细节（如编织发或配饰），优于现有技术。 Conclusion: SOAP在单视角头像建模和基于扩散的Image-to-3D生成中表现卓越，代码和数据已开源。 Abstract: Creating animatable 3D avatars from a single image remains challenging due to style limitations (realistic, cartoon, anime) and difficulties in handling accessories or hairstyles. While 3D diffusion models advance single-view reconstruction for general objects, outputs often lack animation controls or suffer from artifacts because of the domain gap. We propose SOAP, a style-omniscient framework to generate rigged, topology-consistent avatars from any portrait. Our method leverages a multiview diffusion model trained on 24K 3D heads with multiple styles and an adaptive optimization pipeline to deform the FLAME mesh while maintaining topology and rigging via differentiable rendering. The resulting textured avatars support FACS-based animation, integrate with eyeballs and teeth, and preserve details like braided hair or accessories. Extensive experiments demonstrate the superiority of our method over state-of-the-art techniques for both single-view head modeling and diffusion-based generation of Image-to-3D. Our code and data are publicly available for research purposes at https://github.com/TingtingLiao/soap.

[39] Split Matching for Inductive Zero-shot Semantic Segmentation

Jialei Chen,Xu Zheng,Dongyue Li,Chong Yi,Seigo Ito,Danda Pani Paudel,Luc Van Gool,Hiroshi Murase,Daisuke Deguchi

Main category: cs.CV

TL;DR: 论文提出了一种名为Split Matching（SM）的新分配策略，用于解决零样本语义分割中匈牙利匹配对未见类别分类不准确的问题。通过将匹配分为可见类别和潜在类别两部分，并结合多尺度特征增强模块，实现了在标准基准上的最先进性能。

Details

Motivation: 零样本语义分割（ZSS）在训练时未标注的类别上表现不佳，主要原因是匈牙利匹配需要全监督且容易将未见类别误分类为背景。 Method: 提出Split Matching（SM），将匈牙利匹配解耦为可见类别和潜在类别两部分，并利用CLIP特征聚类生成伪掩码和区域级嵌入。此外，引入多尺度特征增强（MFE）模块优化解码器特征。 Result: SM在标准基准上实现了最先进的性能。 Conclusion: SM通过解耦匹配策略和多尺度特征增强，显著提升了零样本语义分割的性能。 Abstract: Zero-shot Semantic Segmentation (ZSS) aims to segment categories that are not annotated during training. While fine-tuning vision-language models has achieved promising results, these models often overfit to seen categories due to the lack of supervision for unseen classes. As an alternative to fully supervised approaches, query-based segmentation has shown great latent in ZSS, as it enables object localization without relying on explicit labels. However, conventional Hungarian matching, a core component in query-based frameworks, needs full supervision and often misclassifies unseen categories as background in the setting of ZSS. To address this issue, we propose Split Matching (SM), a novel assignment strategy that decouples Hungarian matching into two components: one for seen classes in annotated regions and another for latent classes in unannotated regions (referred to as unseen candidates). Specifically, we partition the queries into seen and candidate groups, enabling each to be optimized independently according to its available supervision. To discover unseen candidates, we cluster CLIP dense features to generate pseudo masks and extract region-level embeddings using CLS tokens. Matching is then conducted separately for the two groups based on both class-level similarity and mask-level consistency. Additionally, we introduce a Multi-scale Feature Enhancement (MFE) module that refines decoder features through residual multi-scale aggregation, improving the model's ability to capture spatial details across resolutions. SM is the first to introduce decoupled Hungarian matching under the inductive ZSS setting, and achieves state-of-the-art performance on two standard benchmarks.

[40] xTrace: A Facial Expressive Behaviour Analysis Tool for Continuous Affect Recognition

Mani Kumar Tellamekala,Shashank Jaiswal,Thomas Smith,Timur Alamev,Gary McKeown,Anthony Brown,Michel Valstar

Main category: cs.CV

TL;DR: 论文介绍了xTrace，一种用于分析自然环境中面部表情行为的工具，解决了大规模标注数据集不足和高效特征提取的挑战。

Details

Motivation: 解决自然环境中面部表情行为分析的挑战，包括缺乏大规模标注数据集和高效特征提取的困难。 Method: xTrace使用可解释的面部情感描述符，并在最大规模的面部情感视频数据集上进行训练。 Result: xTrace在验证集上表现优异，平均CCC为0.86，平均绝对误差为0.13，且在非正面头部姿态下表现稳健。 Conclusion: xTrace是一种高效、准确且稳健的面部表情分析工具，适用于广泛的情感空间。 Abstract: Recognising expressive behaviours in face videos is a long-standing challenge in Affective Computing. Despite significant advancements in recent years, it still remains a challenge to build a robust and reliable system for naturalistic and in-the-wild facial expressive behaviour analysis in real time. This paper addresses two key challenges in building such a system: (1). The paucity of large-scale labelled facial affect video datasets with extensive coverage of the 2D emotion space, and (2). The difficulty of extracting facial video features that are discriminative, interpretable, robust, and computationally efficient. Toward addressing these challenges, we introduce xTrace, a robust tool for facial expressive behaviour analysis and predicting continuous values of dimensional emotions, namely valence and arousal, from in-the-wild face videos. To address challenge (1), our affect recognition model is trained on the largest facial affect video data set, containing ~450k videos that cover most emotion zones in the dimensional emotion space, making xTrace highly versatile in analysing a wide spectrum of naturalistic expressive behaviours. To address challenge (2), xTrace uses facial affect descriptors that are not only explainable, but can also achieve a high degree of accuracy and robustness with low computational complexity. The key components of xTrace are benchmarked against three existing tools: MediaPipe, OpenFace, and Augsburg Affect Toolbox. On an in-the-wild validation set composed of 50k videos, xTrace achieves 0.86 mean CCC and 0.13 mean absolute error values. We present a detailed error analysis of affect predictions from xTrace, illustrating (a). its ability to recognise emotions with high accuracy across most bins in the 2D emotion space, (b). its robustness to non-frontal head pose angles, and (c). a strong correlation between its uncertainty estimates and its accuracy.

[41] UncertainSAM: Fast and Efficient Uncertainty Quantification of the Segment Anything Model

Timo Kaiser,Thomas Norrenbrock,Bodo Rosenhahn

Main category: cs.CV

TL;DR: 本文提出了一种基于贝叶斯熵的轻量级不确定性量化方法USAM，用于解决SAM模型的不确定性量化问题，并在多个数据集上表现出优越性能。

Details

Motivation: SAM模型在语义分割任务中表现出色，但其不确定性量化问题尚未解决，尤其是其类别无关的特性对现有方法提出了挑战。 Method: 提出了一种基于贝叶斯熵的理论模型，联合考虑偶然性、认知性和任务不确定性，并训练了轻量级的后处理方法USAM。 Result: USAM在SA-V、MOSE、ADE20k、DAVIS和COCO数据集上表现出优越的预测能力，且计算成本低、易于使用。 Conclusion: USAM为SAM模型提供了一种高效的不确定性量化解决方案，支持用户提示、增强半监督流程，并平衡了准确性与成本效率。 Abstract: The introduction of the Segment Anything Model (SAM) has paved the way for numerous semantic segmentation applications. For several tasks, quantifying the uncertainty of SAM is of particular interest. However, the ambiguous nature of the class-agnostic foundation model SAM challenges current uncertainty quantification (UQ) approaches. This paper presents a theoretically motivated uncertainty quantification model based on a Bayesian entropy formulation jointly respecting aleatoric, epistemic, and the newly introduced task uncertainty. We use this formulation to train USAM, a lightweight post-hoc UQ method. Our model traces the root of uncertainty back to under-parameterised models, insufficient prompts or image ambiguities. Our proposed deterministic USAM demonstrates superior predictive capabilities on the SA-V, MOSE, ADE20k, DAVIS, and COCO datasets, offering a computationally cheap and easy-to-use UQ alternative that can support user-prompting, enhance semi-supervised pipelines, or balance the tradeoff between accuracy and cost efficiency.

[42] ULFine: Unbiased Lightweight Fine-tuning for Foundation-Model-Assisted Long-Tailed Semi-Supervised Learning

Enhao Zhang,Chaohua Li,Chuanxing Geng,Songcan Chen

Main category: cs.CV

TL;DR: 本文探讨了大规模视觉基础模型（如CLIP）对长尾半监督学习（LTSSL）的影响，通过三种策略（LP、LFT、FFT）分析其效果，发现LP和LFT虽提升整体性能但对尾部类帮助有限，并提出ULFine策略以解决偏差问题。

Details

Motivation: 研究基础模型在LTSSL中的表现，探索其潜力与局限性。 Method: 采用LP、LFT和FFT三种策略分析基础模型在LTSSL中的效果，并提出ULFine策略以解决偏差问题。 Result: FFT导致性能下降，LP和LFT提升整体性能但对尾部类帮助有限；ULFine显著降低训练成本并提高预测准确性。 Conclusion: ULFine通过自适应拟合和互补融合有效解决了LTSSL中的偏差问题，显著提升了性能。 Abstract: Based on the success of large-scale visual foundation models like CLIP in various downstream tasks, this paper initially attempts to explore their impact on Long-Tailed Semi-Supervised Learning (LTSSL) by employing the foundation model with three strategies: Linear Probing (LP), Lightweight Fine-Tuning (LFT), and Full Fine-Tuning (FFT). Our analysis presents the following insights: i) Compared to LTSSL algorithms trained from scratch, FFT results in a decline in model performance, whereas LP and LFT, although boosting overall model performance, exhibit negligible benefits to tail classes. ii) LP produces numerous false pseudo-labels due to \textit{underlearned} training data, while LFT can reduce the number of these false labels but becomes overconfident about them owing to \textit{biased fitting} training data. This exacerbates the pseudo-labeled and classifier biases inherent in LTSSL, limiting performance improvement in the tail classes. With these insights, we propose a Unbiased Lightweight Fine-tuning strategy, \textbf{ULFine}, which mitigates the overconfidence via confidence-aware adaptive fitting of textual prototypes and counteracts the pseudo-labeled and classifier biases via complementary fusion of dual logits. Extensive experiments demonstrate that ULFine markedly decreases training costs by over ten times and substantially increases prediction accuracies compared to state-of-the-art methods.

[43] FG-CLIP: Fine-Grained Visual and Textual Alignment

Chunyu Xie,Bin Wang,Fanjing Kong,Jincheng Li,Dawei Liang,Gengshen Zhang,Dawei Leng,Yuhui Yin

Main category: cs.CV

TL;DR: FG-CLIP通过生成长标题-图像对、构建高质量数据集和引入硬负样本，提升了CLIP在细粒度理解任务中的表现。

Details

Motivation: CLIP在多模态任务中表现优异，但在细粒度理解上因粗粒度标题而受限。 Method: 1. 生成16亿长标题-图像对；2. 构建1200万图像和4000万区域标注的高质量数据集；3. 引入1000万硬负样本。 Result: FG-CLIP在细粒度理解、开放词汇目标检测等任务中优于CLIP和其他先进方法。 Conclusion: FG-CLIP有效提升细粒度图像细节捕捉能力，显著提高模型性能。 Abstract: Contrastive Language-Image Pre-training (CLIP) excels in multimodal tasks such as image-text retrieval and zero-shot classification but struggles with fine-grained understanding due to its focus on coarse-grained short captions. To address this, we propose Fine-Grained CLIP (FG-CLIP), which enhances fine-grained understanding through three key innovations. First, we leverage large multimodal models to generate 1.6 billion long caption-image pairs for capturing global-level semantic details. Second, a high-quality dataset is constructed with 12 million images and 40 million region-specific bounding boxes aligned with detailed captions to ensure precise, context-rich representations. Third, 10 million hard fine-grained negative samples are incorporated to improve the model's ability to distinguish subtle semantic differences. Corresponding training methods are meticulously designed for these data. Extensive experiments demonstrate that FG-CLIP outperforms the original CLIP and other state-of-the-art methods across various downstream tasks, including fine-grained understanding, open-vocabulary object detection, image-text retrieval, and general multimodal benchmarks. These results highlight FG-CLIP's effectiveness in capturing fine-grained image details and improving overall model performance. The related data, code, and models are available at https://github.com/360CVGroup/FG-CLIP.

[44] Visual Affordances: Enabling Robots to Understand Object Functionality

Tommaso Apicella,Alessio Xompero,Andrea Cavallaro

Main category: cs.CV

TL;DR: 论文提出了一种统一的视觉可供性预测方法，解决了现有方法因定义不一致导致的复现性问题，并引入Affordance Sheet以提高透明度。

Details

Motivation: 现有的人机交互中可供性预测方法因任务定义不一致，导致复现性和比较基准不可靠。 Method: 提出统一的视觉可供性预测框架，结合物理世界特性（如物体重量），并引入Affordance Sheet记录方法和数据集细节。 Result: 通过统一框架和物理特性结合，提高了可供性预测的准确性和可复现性。 Conclusion: 该方法填补了可供性感知与机器人执行之间的鸿沟，为任务完成提供了更全面的信息。 Abstract: Human-robot interaction for assistive technologies relies on the prediction of affordances, which are the potential actions a robot can perform on objects. Predicting object affordances from visual perception is formulated differently for tasks such as grasping detection, affordance classification, affordance segmentation, and hand-object interaction synthesis. In this work, we highlight the reproducibility issue in these redefinitions, making comparative benchmarks unfair and unreliable. To address this problem, we propose a unified formulation for visual affordance prediction, provide a comprehensive and systematic review of previous works highlighting strengths and limitations of methods and datasets, and analyse what challenges reproducibility. To favour transparency, we introduce the Affordance Sheet, a document to detail the proposed solution, the datasets, and the validation. As the physical properties of an object influence the interaction with the robot, we present a generic framework that links visual affordance prediction to the physical world. Using the weight of an object as an example for this framework, we discuss how estimating object mass can affect the affordance prediction. Our approach bridges the gap between affordance perception and robot actuation, and accounts for the complete information about objects of interest and how the robot interacts with them to accomplish its task.

[45] PIDiff: Image Customization for Personalized Identities with Diffusion Models

Jinyu Gu,Haipeng Liu,Meng Wang,Yang Wang

Main category: cs.CV

TL;DR: PIDiff是一种基于微调的扩散模型，用于个性化身份文本到图像生成，通过W+空间和身份定制微调策略避免语义纠缠，实现准确特征提取和定位。

Details

Motivation: 现有方法在文本到图像生成中未能有效分离身份信息和背景信息，导致生成图像失去关键身份特征且多样性降低。 Method: 结合W+空间和身份定制微调策略，提出PIDiff模型，通过跨注意力块和参数优化策略实现身份信息保留和生成能力维持。 Result: 实验验证了PIDiff在个性化身份文本到图像生成任务中的有效性。 Conclusion: PIDiff通过避免语义纠缠和优化特征提取，显著提升了生成图像的身份准确性和多样性。 Abstract: Text-to-image generation for personalized identities aims at incorporating the specific identity into images using a text prompt and an identity image. Based on the powerful generative capabilities of DDPMs, many previous works adopt additional prompts, such as text embeddings and CLIP image embeddings, to represent the identity information, while they fail to disentangle the identity information and background information. As a result, the generated images not only lose key identity characteristics but also suffer from significantly reduced diversity. To address this issue, previous works have combined the W+ space from StyleGAN with diffusion models, leveraging this space to provide a more accurate and comprehensive representation of identity features through multi-level feature extraction. However, the entanglement of identity and background information in in-the-wild images during training prevents accurate identity localization, resulting in severe semantic interference between identity and background. In this paper, we propose a novel fine-tuning-based diffusion model for personalized identities text-to-image generation, named PIDiff, which leverages the W+ space and an identity-tailored fine-tuning strategy to avoid semantic entanglement and achieves accurate feature extraction and localization. Style editing can also be achieved by PIDiff through preserving the characteristics of identity features in the W+ space, which vary from coarse to fine. Through the combination of the proposed cross-attention block and parameter optimization strategy, PIDiff preserves the identity information and maintains the generation capability for in-the-wild images of the pre-trained model during inference. Our experimental results validate the effectiveness of our method in this task.

[46] Nonlinear Motion-Guided and Spatio-Temporal Aware Network for Unsupervised Event-Based Optical Flow

Zuntao Liu,Hao Zhuang,Junjie Jiang,Yuhang Song,Zheng Fang

Main category: cs.CV

TL;DR: 论文提出了一种名为E-NMSTFlow的无监督事件相机光流估计网络，专注于长时间序列，通过利用丰富的时空信息和非线性运动补偿损失提升性能。

Details

Motivation: 现有基于学习的事件相机光流估计方法多采用帧式技术，忽略了事件的时空特性，且假设事件间为线性运动，导致长时间序列误差增加。 Method: 提出STMFA和AMFE模块，利用时空信息学习数据关联，并引入非线性运动补偿损失优化无监督学习。 Result: 在MVSEC和DSEC-Flow数据集上，该方法在无监督学习方法中排名第一。 Conclusion: E-NMSTFlow通过结合时空信息和非线性运动补偿，显著提升了事件相机光流估计的准确性。 Abstract: Event cameras have the potential to capture continuous motion information over time and space, making them well-suited for optical flow estimation. However, most existing learning-based methods for event-based optical flow adopt frame-based techniques, ignoring the spatio-temporal characteristics of events. Additionally, these methods assume linear motion between consecutive events within the loss time window, which increases optical flow errors in long-time sequences. In this work, we observe that rich spatio-temporal information and accurate nonlinear motion between events are crucial for event-based optical flow estimation. Therefore, we propose E-NMSTFlow, a novel unsupervised event-based optical flow network focusing on long-time sequences. We propose a Spatio-Temporal Motion Feature Aware (STMFA) module and an Adaptive Motion Feature Enhancement (AMFE) module, both of which utilize rich spatio-temporal information to learn spatio-temporal data associations. Meanwhile, we propose a nonlinear motion compensation loss that utilizes the accurate nonlinear motion between events to improve the unsupervised learning of our network. Extensive experiments demonstrate the effectiveness and superiority of our method. Remarkably, our method ranks first among unsupervised learning methods on the MVSEC and DSEC-Flow datasets. Our project page is available at https://wynelio.github.io/E-NMSTFlow.

[47] DispBench: Benchmarking Disparity Estimation to Synthetic Corruptions

Shashank Agnihotri,Amaan Ansari,Annika Dackermann,Fabian Rösch,Margret Keuper

Main category: cs.CV

TL;DR: 论文提出了DispBench，一个用于系统评估视差估计方法可靠性的综合基准工具，填补了该领域标准化评估的空白。

Details

Motivation: 深度学习在视差估计任务中表现出色，但其对分布偏移和对抗攻击的敏感性引发了对其可靠性和泛化能力的担忧，而缺乏标准化基准阻碍了该领域的进展。 Method: 通过DispBench工具，系统评估视差估计方法在合成图像损坏（如对抗攻击和分布偏移）下的鲁棒性，涵盖多种数据集和损坏场景。 Result: 研究揭示了视差估计方法在准确性、可靠性和泛化性之间的关键相关性，并提供了迄今为止最全面的性能和鲁棒性分析。 Conclusion: DispBench为视差估计方法的鲁棒性评估提供了标准化工具，有助于推动该领域的进一步研究和发展。 Abstract: Deep learning (DL) has surpassed human performance on standard benchmarks, driving its widespread adoption in computer vision tasks. One such task is disparity estimation, estimating the disparity between matching pixels in stereo image pairs, which is crucial for safety-critical applications like medical surgeries and autonomous navigation. However, DL-based disparity estimation methods are highly susceptible to distribution shifts and adversarial attacks, raising concerns about their reliability and generalization. Despite these concerns, a standardized benchmark for evaluating the robustness of disparity estimation methods remains absent, hindering progress in the field. To address this gap, we introduce DispBench, a comprehensive benchmarking tool for systematically assessing the reliability of disparity estimation methods. DispBench evaluates robustness against synthetic image corruptions such as adversarial attacks and out-of-distribution shifts caused by 2D Common Corruptions across multiple datasets and diverse corruption scenarios. We conduct the most extensive performance and robustness analysis of disparity estimation methods to date, uncovering key correlations between accuracy, reliability, and generalization. Open-source code for DispBench: https://github.com/shashankskagnihotri/benchmarking_robustness/tree/disparity_estimation/final/disparity_estimation

[48] MDE-Edit: Masked Dual-Editing for Multi-Object Image Editing via Diffusion Models

Hongyang Zhu,Haipeng Liu,Bo Fu,Yang Wang

Main category: cs.CV

TL;DR: MDE-Edit提出了一种无需训练、基于推理优化的方法，通过双损失设计（OAL和CCL）解决多目标编辑中的定位不准和属性错配问题。

Details

Motivation: 多目标编辑在复杂场景中面临定位不准和属性错配的挑战，现有方法难以解决。 Method: 提出MDE-Edit，通过Object Alignment Loss（OAL）和Color Consistency Loss（CCL）优化扩散模型的噪声潜在特征。 Result: 实验表明MDE-Edit在编辑准确性和视觉质量上优于现有方法。 Conclusion: MDE-Edit为复杂多目标图像编辑任务提供了鲁棒的解决方案。 Abstract: Multi-object editing aims to modify multiple objects or regions in complex scenes while preserving structural coherence. This task faces significant challenges in scenarios involving overlapping or interacting objects: (1) Inaccurate localization of target objects due to attention misalignment, leading to incomplete or misplaced edits; (2) Attribute-object mismatch, where color or texture changes fail to align with intended regions due to cross-attention leakage, creating semantic conflicts (\textit{e.g.}, color bleeding into non-target areas). Existing methods struggle with these challenges: approaches relying on global cross-attention mechanisms suffer from attention dilution and spatial interference between objects, while mask-based methods fail to bind attributes to geometrically accurate regions due to feature entanglement in multi-object scenarios. To address these limitations, we propose a training-free, inference-stage optimization approach that enables precise localized image manipulation in complex multi-object scenes, named MDE-Edit. MDE-Edit optimizes the noise latent feature in diffusion models via two key losses: Object Alignment Loss (OAL) aligns multi-layer cross-attention with segmentation masks for precise object positioning, and Color Consistency Loss (CCL) amplifies target attribute attention within masks while suppressing leakage to adjacent regions. This dual-loss design ensures localized and coherent multi-object edits. Extensive experiments demonstrate that MDE-Edit outperforms state-of-the-art methods in editing accuracy and visual quality, offering a robust solution for complex multi-object image manipulation tasks.

[49] Automated vision-based assistance tools in bronchoscopy: stenosis severity estimation

Clara Tomasini,Javier Rodriguez-Puigvert,Dinora Polanco,Manuel Viñuales,Luis Riazuelo,Ana Cristina Murillo

Main category: cs.CV

TL;DR: 提出了一种基于支气管镜图像的自动化评估声门下狭窄严重程度的方法，无需CT扫描，减少了主观性和辐射暴露。

Details

Motivation: 目前声门下狭窄的评估依赖主观视觉检查或CT扫描，缺乏一致性和自动化方法。 Method: 利用支气管镜图像的光照衰减效应，分割并跟踪气道，构建3D模型以测量狭窄程度。 Result: 方法在真实支气管镜数据上验证，结果与CT和专家评估一致，具有可重复性。 Conclusion: 自动化方法缩短诊断时间，减少辐射暴露，并发布了首个公开数据集。 Abstract: Purpose: Subglottic stenosis refers to the narrowing of the subglottis, the airway between the vocal cords and the trachea. Its severity is typically evaluated by estimating the percentage of obstructed airway. This estimation can be obtained from CT data or through visual inspection by experts exploring the region. However, visual inspections are inherently subjective, leading to less consistent and robust diagnoses. No public methods or datasets are currently available for automated evaluation of this condition from bronchoscopy video. Methods: We propose a pipeline for automated subglottic stenosis severity estimation during the bronchoscopy exploration, without requiring the physician to traverse the stenosed region. Our approach exploits the physical effect of illumination decline in endoscopy to segment and track the lumen and obtain a 3D model of the airway. This 3D model is obtained from a single frame and is used to measure the airway narrowing. Results: Our pipeline is the first to enable automated and robust subglottic stenosis severity measurement using bronchoscopy images. The results show consistency with ground-truth estimations from CT scans and expert estimations, and reliable repeatability across multiple estimations on the same patient. Our evaluation is performed on our new Subglottic Stenosis Dataset of real bronchoscopy procedures data. Conclusion: We demonstrate how to automate evaluation of subglottic stenosis severity using only bronchoscopy. Our approach can assist with and shorten diagnosis and monitoring procedures, with automated and repeatable estimations and less exploration time, and save radiation exposure to patients as no CT is required. Additionally, we release the first public benchmark for subglottic stenosis severity assessment.

[50] Probabilistic Embeddings for Frozen Vision-Language Models: Uncertainty Quantification with Gaussian Process Latent Variable Models

Aishwarya Venkataramanan,Paul Bodesheim,Joachim Denzler

Main category: cs.CV

TL;DR: GroVE是一种后处理方法，通过高斯过程潜在变量模型（GPLVM）从冻结的视觉语言模型（VLMs）中生成不确定性感知的概率嵌入，提升了不确定性校准性能。

Details

Motivation: 标准VLMs的确定性嵌入难以捕捉视觉和文本描述中的不确定性，现有方法需要大量数据训练且无法利用已有的大规模VLMs表示。 Method: GroVE基于GPLVM，在低维潜在空间中优化单模态嵌入重建和跨模态对齐目标，生成不确定性感知的概率嵌入。 Result: GroVE在跨模态检索、视觉问答和主动学习等任务中实现了最先进的不确定性校准。 Conclusion: GroVE提供了一种高效的后处理方法，无需重新训练VLMs即可生成高质量的概率嵌入。 Abstract: Vision-Language Models (VLMs) learn joint representations by mapping images and text into a shared latent space. However, recent research highlights that deterministic embeddings from standard VLMs often struggle to capture the uncertainties arising from the ambiguities in visual and textual descriptions and the multiple possible correspondences between images and texts. Existing approaches tackle this by learning probabilistic embeddings during VLM training, which demands large datasets and does not leverage the powerful representations already learned by large-scale VLMs like CLIP. In this paper, we propose GroVE, a post-hoc approach to obtaining probabilistic embeddings from frozen VLMs. GroVE builds on Gaussian Process Latent Variable Model (GPLVM) to learn a shared low-dimensional latent space where image and text inputs are mapped to a unified representation, optimized through single-modal embedding reconstruction and cross-modal alignment objectives. Once trained, the Gaussian Process model generates uncertainty-aware probabilistic embeddings. Evaluation shows that GroVE achieves state-of-the-art uncertainty calibration across multiple downstream tasks, including cross-modal retrieval, visual question answering, and active learning.

[51] PaniCar: Securing the Perception of Advanced Driving Assistance Systems Against Emergency Vehicle Lighting

Elad Feldman,Jacob Shams,Dudi Biton,Alfred Chen,Shaoyuan Xie,Satoru Koda,Yisroel Mirsky,Asaf Shabtai,Yuval Elovici,Ben Nassi

Main category: cs.CV

TL;DR: 研究发现自动驾驶汽车在紧急车辆灯光下存在检测漏洞，提出Caracetamol框架以提升检测稳定性。

Details

Motivation: 自动驾驶汽车在紧急车辆灯光下检测性能下降，存在安全隐患，需研究解决方案。 Method: 评估多种ADAS系统和检测器，分析灯光影响，提出Caracetamol框架优化检测。 Result: Caracetamol显著提升检测置信度，降低波动，满足实时处理需求。 Conclusion: Caracetamol有效缓解紧急车辆灯光对检测的影响，提升自动驾驶安全性。 Abstract: The safety of autonomous cars has come under scrutiny in recent years, especially after 16 documented incidents involving Teslas (with autopilot engaged) crashing into parked emergency vehicles (police cars, ambulances, and firetrucks). While previous studies have revealed that strong light sources often introduce flare artifacts in the captured image, which degrade the image quality, the impact of flare on object detection performance remains unclear. In this research, we unveil PaniCar, a digital phenomenon that causes an object detector's confidence score to fluctuate below detection thresholds when exposed to activated emergency vehicle lighting. This vulnerability poses a significant safety risk, and can cause autonomous vehicles to fail to detect objects near emergency vehicles. In addition, this vulnerability could be exploited by adversaries to compromise the security of advanced driving assistance systems (ADASs). We assess seven commercial ADASs (Tesla Model 3, "manufacturer C", HP, Pelsee, AZDOME, Imagebon, Rexing), four object detectors (YOLO, SSD, RetinaNet, Faster R-CNN), and 14 patterns of emergency vehicle lighting to understand the influence of various technical and environmental factors. We also evaluate four SOTA flare removal methods and show that their performance and latency are insufficient for real-time driving constraints. To mitigate this risk, we propose Caracetamol, a robust framework designed to enhance the resilience of object detectors against the effects of activated emergency vehicle lighting. Our evaluation shows that on YOLOv3 and Faster RCNN, Caracetamol improves the models' average confidence of car detection by 0.20, the lower confidence bound by 0.33, and reduces the fluctuation range by 0.33. In addition, Caracetamol is capable of processing frames at a rate of between 30-50 FPS, enabling real-time ADAS car detection.

[52] Biomed-DPT: Dual Modality Prompt Tuning for Biomedical Vision-Language Models

Wei Peng,Kang Liu,Jianchen Hu,Meng Zhang

Main category: cs.CV

TL;DR: Biomed-DPT是一种知识增强的双模态提示调整技术，通过结合文本和视觉提示，优化生物医学图像分类任务。

Details

Motivation: 现有提示学习方法仅关注文本提示，忽略了生物医学图像的特殊结构（如复杂解剖结构和细微病理特征）。 Method: Biomed-DPT设计双模态提示：文本提示包括临床模板和LLM驱动的领域适应提示，视觉提示引入零向量软提示以优化注意力权重。 Result: 在11个生物医学图像数据集上平均分类准确率达66.14%，优于CoOp方法。 Conclusion: Biomed-DPT通过双模态提示和知识蒸馏显著提升了生物医学图像分类性能。 Abstract: Prompt learning is one of the most effective paradigms for adapting pre-trained vision-language models (VLMs) to the biomedical image classification tasks in few shot scenarios. However, most of the current prompt learning methods only used the text prompts and ignored the particular structures (such as the complex anatomical structures and subtle pathological features) in the biomedical images. In this work, we propose Biomed-DPT, a knowledge-enhanced dual modality prompt tuning technique. In designing the text prompt, Biomed-DPT constructs a dual prompt including the template-driven clinical prompts and the large language model (LLM)-driven domain-adapted prompts, then extracts the clinical knowledge from the domain-adapted prompts through the knowledge distillation technique. In designing the vision prompt, Biomed-DPT introduces the zero vector as a soft prompt to leverage attention re-weighting so that the focus on non-diagnostic regions and the recognition of non-critical pathological features are avoided. Biomed-DPT achieves an average classification accuracy of 66.14\% across 11 biomedical image datasets covering 9 modalities and 10 organs, with performance reaching 78.06\% in base classes and 75.97\% in novel classes, surpassing the Context Optimization (CoOp) method by 6.20\%, 3.78\%, and 8.04\%, respectively. Our code are available at \underline{https://github.com/Kanyooo/Biomed-DPT}.

Haizhen Xie,Kunpeng Du,Qiangyu Yan,Sen Lu,Jianhong Han,Hanting Chen,Hailin Hu,Jie Hu

Main category: cs.CV

TL;DR: EAM是一种新的盲超分辨率方法，利用DiT架构和创新的Ψ-DiT模块，结合渐进式掩码图像建模和主题感知提示生成策略，显著提升了图像恢复性能。

Details

Motivation: 利用预训练的T2I扩散模型指导盲超分辨率（BSR）已成为主流方法，但传统U-Net架构性能有限，DiT架构表现更优。 Method: 提出EAM方法，结合Ψ-DiT模块、渐进式掩码图像建模和主题感知提示生成策略，优化DiT的指导和训练效率。 Result: EAM在多个数据集上取得最先进的结果，定量指标和视觉质量均优于现有方法。 Conclusion: EAM通过创新设计和策略，显著提升了BSR任务的性能，证明了DiT架构和T2I模型先验知识的有效性。 Abstract: Utilizing pre-trained Text-to-Image (T2I) diffusion models to guide Blind Super-Resolution (BSR) has become a predominant approach in the field. While T2I models have traditionally relied on U-Net architectures, recent advancements have demonstrated that Diffusion Transformers (DiT) achieve significantly higher performance in this domain. In this work, we introduce Enhancing Anything Model (EAM), a novel BSR method that leverages DiT and outperforms previous U-Net-based approaches. We introduce a novel block, $\Psi$-DiT, which effectively guides the DiT to enhance image restoration. This block employs a low-resolution latent as a separable flow injection control, forming a triple-flow architecture that effectively leverages the prior knowledge embedded in the pre-trained DiT. To fully exploit the prior guidance capabilities of T2I models and enhance their generalization in BSR, we introduce a progressive Masked Image Modeling strategy, which also reduces training costs. Additionally, we propose a subject-aware prompt generation strategy that employs a robust multi-modal model in an in-context learning framework. This strategy automatically identifies key image areas, provides detailed descriptions, and optimizes the utilization of T2I diffusion priors. Our experiments demonstrate that EAM achieves state-of-the-art results across multiple datasets, outperforming existing methods in both quantitative metrics and visual quality.

[54] HQC-NBV: A Hybrid Quantum-Classical View Planning Approach

Xiaotong Yu,Chang Wen Chen

Main category: cs.CV

TL;DR: HQC-NBV是一种混合量子经典框架，用于高效规划相机视角，通过量子特性提升探索效率，比传统方法高49.2%。

Details

Motivation: 解决传统视角规划方法在复杂场景中计算可扩展性和解决方案最优性不足的问题。 Method: 提出基于哈密顿量公式和参数中心变分ansatz的混合量子经典框架，利用双向交替纠缠模式捕捉参数依赖关系。 Result: 实验表明，量子组件显著提升性能，探索效率比传统方法高49.2%。 Conclusion: 该研究为量子计算在机器人感知系统中的应用提供了重要进展，为机器人视觉任务提供了新范式。 Abstract: Efficient view planning is a fundamental challenge in computer vision and robotic perception, critical for tasks ranging from search and rescue operations to autonomous navigation. While classical approaches, including sampling-based and deterministic methods, have shown promise in planning camera viewpoints for scene exploration, they often struggle with computational scalability and solution optimality in complex settings. This study introduces HQC-NBV, a hybrid quantum-classical framework for view planning that leverages quantum properties to efficiently explore the parameter space while maintaining robustness and scalability. We propose a specific Hamiltonian formulation with multi-component cost terms and a parameter-centric variational ansatz with bidirectional alternating entanglement patterns that capture the hierarchical dependencies between viewpoint parameters. Comprehensive experiments demonstrate that quantum-specific components provide measurable performance advantages. Compared to the classical methods, our approach achieves up to 49.2% higher exploration efficiency across diverse environments. Our analysis of entanglement architecture and coherence-preserving terms provides insights into the mechanisms of quantum advantage in robotic exploration tasks. This work represents a significant advancement in integrating quantum computing into robotic perception systems, offering a paradigm-shifting solution for various robot vision tasks.

[55] Diffusion Model Quantization: A Review

Qian Zeng,Chenggong Hu,Mingli Song,Jie Song

Main category: cs.CV

TL;DR: 本文综述了扩散模型量化的最新进展，分析了当前技术挑战、方法分类及效果评估，并展望了未来研究方向。

Details

Motivation: 为在资源受限的边缘设备上高效部署扩散模型，量化技术成为关键，本文旨在总结和评估该领域的最新进展。 Method: 通过分类讨论量化技术原理，并从定性和定量角度分析代表性方案，包括U-Net和DiT架构的量化挑战。 Result: 定量评估了多种方法在标准数据集上的表现，定性分析了量化误差的影响，提供了全面比较。 Conclusion: 提出了生成模型量化在实用中的未来研究方向，相关资源已公开。 Abstract: Recent success of large text-to-image models has empirically underscored the exceptional performance of diffusion models in generative tasks. To facilitate their efficient deployment on resource-constrained edge devices, model quantization has emerged as a pivotal technique for both compression and acceleration. This survey offers a thorough review of the latest advancements in diffusion model quantization, encapsulating and analyzing the current state of the art in this rapidly advancing domain. First, we provide an overview of the key challenges encountered in the quantization of diffusion models, including those based on U-Net architectures and Diffusion Transformers (DiT). We then present a comprehensive taxonomy of prevalent quantization techniques, engaging in an in-depth discussion of their underlying principles. Subsequently, we perform a meticulous analysis of representative diffusion model quantization schemes from both qualitative and quantitative perspectives. From a quantitative standpoint, we rigorously benchmark a variety of methods using widely recognized datasets, delivering an extensive evaluation of the most recent and impactful research in the field. From a qualitative standpoint, we categorize and synthesize the effects of quantization errors, elucidating these impacts through both visual analysis and trajectory examination. In conclusion, we outline prospective avenues for future research, proposing novel directions for the quantization of generative models in practical applications. The list of related papers, corresponding codes, pre-trained models and comparison results are publicly available at the survey project homepage https://github.com/TaylorJocelyn/Diffusion-Model-Quantization.

[56] Does CLIP perceive art the same way we do?

Andrea Asperti,Leonardo Dessì,Maria Chiara Tonetti,Nico Wu

Main category: cs.CV

TL;DR: 研究探讨了CLIP模型在理解艺术作品（包括人类创作和AI生成图像）时的能力，发现其在语义和风格提取上既有优势也有局限。

Details

Motivation: 探究CLIP模型是否能够像人类一样理解和解释艺术作品，尤其是在高层次的语义和风格信息上。 Method: 通过设计针对性任务，比较CLIP的响应与人类标注和专家基准，评估其感知能力。 Result: CLIP在视觉表示上表现出一定的优势，但在美学线索和艺术意图理解上存在局限性。 Conclusion: 研究强调了多模态系统在创造性领域中需要更高的可解释性，尤其是在涉及主观性和细微差别时。 Abstract: CLIP has emerged as a powerful multimodal model capable of connecting images and text through joint embeddings, but to what extent does it "see" the same way humans do - especially when interpreting artworks? In this paper, we investigate CLIP's ability to extract high-level semantic and stylistic information from paintings, including both human-created and AI-generated imagery. We evaluate its perception across multiple dimensions: content, scene understanding, artistic style, historical period, and the presence of visual deformations or artifacts. By designing targeted probing tasks and comparing CLIP's responses to human annotations and expert benchmarks, we explore its alignment with human perceptual and contextual understanding. Our findings reveal both strengths and limitations in CLIP's visual representations, particularly in relation to aesthetic cues and artistic intent. We further discuss the implications of these insights for using CLIP as a guidance mechanism during generative processes, such as style transfer or prompt-based image synthesis. Our work highlights the need for deeper interpretability in multimodal systems, especially when applied to creative domains where nuance and subjectivity play a central role.

[57] PADriver: Towards Personalized Autonomous Driving

Genghua Kou,Fan Jia,Weixin Mao,Yingfei Liu,Yucheng Zhao,Ziheng Zhang,Osamu Yoshie,Tiancai Wang,Ying Li,Xiangyu Zhang

Main category: cs.CV

TL;DR: PADriver是一个基于多模态大语言模型的个性化自动驾驶框架，通过闭环评估在交通规则下表现优异。

Details

Motivation: 提出个性化自动驾驶需求，结合多模态输入提升驾驶决策的适应性和安全性。 Method: 利用多模态大语言模型处理视频流和个性化文本提示，进行场景理解、危险评估和动作决策。 Result: 在PAD-Highway基准测试中表现优于现有方法，支持多种驾驶模式。 Conclusion: PADriver为个性化自动驾驶提供了高效闭环解决方案，具有实际应用潜力。 Abstract: In this paper, we propose PADriver, a novel closed-loop framework for personalized autonomous driving (PAD). Built upon Multi-modal Large Language Model (MLLM), PADriver takes streaming frames and personalized textual prompts as inputs. It autoaggressively performs scene understanding, danger level estimation and action decision. The predicted danger level reflects the risk of the potential action and provides an explicit reference for the final action, which corresponds to the preset personalized prompt. Moreover, we construct a closed-loop benchmark named PAD-Highway based on Highway-Env simulator to comprehensively evaluate the decision performance under traffic rules. The dataset contains 250 hours videos with high-quality annotation to facilitate the development of PAD behavior analysis. Experimental results on the constructed benchmark show that PADriver outperforms state-of-the-art approaches on different evaluation metrics, and enables various driving modes.

[58] PlaceIt3D: Language-Guided Object Placement in Real 3D Scenes

Ahmed Abdelreheem,Filippo Aleotti,Jamie Watson,Zawar Qureshi,Abdelrahman Eldesokey,Peter Wonka,Gabriel Brostow,Sara Vicente,Guillermo Garcia-Hernando

Main category: cs.CV

TL;DR: 论文提出了一种新任务：语言引导的3D场景物体放置，并提出了基准、数据集和基线方法。

Details

Motivation: 解决3D场景中语言引导物体放置的模糊性和几何关系推理问题。 Method: 提出新基准、数据集和基线方法，用于训练和评估3D LLM模型。 Result: 建立了首个语言引导3D物体放置的基准和数据集，并提供了基线方法。 Conclusion: 该任务和基准有望成为评估通用3D LLM模型的重要工具。 Abstract: We introduce the novel task of Language-Guided Object Placement in Real 3D Scenes. Our model is given a 3D scene's point cloud, a 3D asset, and a textual prompt broadly describing where the 3D asset should be placed. The task here is to find a valid placement for the 3D asset that respects the prompt. Compared with other language-guided localization tasks in 3D scenes such as grounding, this task has specific challenges: it is ambiguous because it has multiple valid solutions, and it requires reasoning about 3D geometric relationships and free space. We inaugurate this task by proposing a new benchmark and evaluation protocol. We also introduce a new dataset for training 3D LLMs on this task, as well as the first method to serve as a non-trivial baseline. We believe that this challenging task and our new benchmark could become part of the suite of benchmarks used to evaluate and compare generalist 3D LLM models.

[59] PRE-Mamba: A 4D State Space Model for Ultra-High-Frequent Event Camera Deraining

Ciyu Ruan,Ruishan Guo,Zihang Gong,Jingao Xu,Wenhan Yang,Xinlei Chen

Main category: cs.CV

TL;DR: PRE-Mamba是一种基于点的事件相机去雨框架，通过双时间尺度和多空间尺度建模，结合频域正则化，实现了高效去雨。

Details

Motivation: 事件相机在雨天条件下存在密集噪声，现有方法在时间精度、去雨效果和计算效率之间存在权衡。 Method: 提出4D事件云表示、时空解耦与融合模块（STDF）和多尺度状态空间模型（MS3M），结合频域正则化。 Result: 在EventRain-27K数据集上表现优异（SR 0.95，NR 0.91），计算效率高（0.4s/M事件），参数仅0.26M。 Conclusion: PRE-Mamba在多种雨强、视角及雪天条件下均表现出良好的泛化能力。 Abstract: Event cameras excel in high temporal resolution and dynamic range but suffer from dense noise in rainy conditions. Existing event deraining methods face trade-offs between temporal precision, deraining effectiveness, and computational efficiency. In this paper, we propose PRE-Mamba, a novel point-based event camera deraining framework that fully exploits the spatiotemporal characteristics of raw event and rain. Our framework introduces a 4D event cloud representation that integrates dual temporal scales to preserve high temporal precision, a Spatio-Temporal Decoupling and Fusion module (STDF) that enhances deraining capability by enabling shallow decoupling and interaction of temporal and spatial information, and a Multi-Scale State Space Model (MS3M) that captures deeper rain dynamics across dual-temporal and multi-spatial scales with linear computational complexity. Enhanced by frequency-domain regularization, PRE-Mamba achieves superior performance (0.95 SR, 0.91 NR, and 0.4s/M events) with only 0.26M parameters on EventRain-27K, a comprehensive dataset with labeled synthetic and real-world sequences. Moreover, our method generalizes well across varying rain intensities, viewpoints, and even snowy conditions.

[60] Mapping User Trust in Vision Language Models: Research Landscape, Challenges, and Prospects

Agnese Chiatti,Sara Bernardini,Lara Shibelski Godoy Piccolo,Viola Schiaffonati,Matteo Matteucci

Main category: cs.CV

TL;DR: 本文综述了用户与视觉语言模型（VLM）交互中的信任动态，提出了多学科分类法，并总结了未来VLM信任研究的初步需求。

Details

Motivation: 随着视觉语言模型（VLM）的快速普及，需要保护用户并告知他们何时可以信任这些系统。 Method: 通过多学科分类法（包括认知科学能力、协作模式和代理行为）综述相关研究，并结合潜在用户的研讨会发现。 Result: 总结了文献见解和用户研讨会的发现，提出了未来VLM信任研究的初步需求。 Conclusion: 未来研究应关注用户与VLM交互中的信任动态，以满足实际需求。 Abstract: The rapid adoption of Vision Language Models (VLMs), pre-trained on large image-text and video-text datasets, calls for protecting and informing users about when to trust these systems. This survey reviews studies on trust dynamics in user-VLM interactions, through a multi-disciplinary taxonomy encompassing different cognitive science capabilities, collaboration modes, and agent behaviours. Literature insights and findings from a workshop with prospective VLM users inform preliminary requirements for future VLM trust studies.

[61] Feature-Augmented Deep Networks for Multiscale Building Segmentation in High-Resolution UAV and Satellite Imagery

Chintan B. Maniyar,Minakshi Kumar,Gengchen Mai

Main category: cs.CV

TL;DR: 提出了一种基于深度学习的多尺度建筑分割框架，结合特征增强和优化训练策略，显著提升了RGB影像中建筑分割的准确性。

Details

Motivation: 高分辨率RGB影像中建筑分割面临光谱相似性、阴影和不规则几何形状的挑战，需要更有效的解决方案。 Method: 使用多传感器数据集，引入PCA、VDVI、MBI和Sobel边缘滤波器等特征增强输入，结合Res-U-Net架构和优化训练策略（如层冻结、循环学习率和SuperConvergence）。 Result: 在WorldView-3影像测试中，模型达到96.5%的总体准确率、F1分数0.86和IoU 0.80，优于现有RGB基准。 Conclusion: 研究表明，多分辨率影像、特征增强和优化训练策略的结合可显著提升遥感应用中建筑分割的鲁棒性。 Abstract: Accurate building segmentation from high-resolution RGB imagery remains challenging due to spectral similarity with non-building features, shadows, and irregular building geometries. In this study, we present a comprehensive deep learning framework for multiscale building segmentation using RGB aerial and satellite imagery with spatial resolutions ranging from 0.4m to 2.7m. We curate a diverse, multi-sensor dataset and introduce feature-augmented inputs by deriving secondary representations including Principal Component Analysis (PCA), Visible Difference Vegetation Index (VDVI), Morphological Building Index (MBI), and Sobel edge filters from RGB channels. These features guide a Res-U-Net architecture in learning complex spatial patterns more effectively. We also propose training policies incorporating layer freezing, cyclical learning rates, and SuperConvergence to reduce training time and resource usage. Evaluated on a held-out WorldView-3 image, our model achieves an overall accuracy of 96.5%, an F1-score of 0.86, and an Intersection over Union (IoU) of 0.80, outperforming existing RGB-based benchmarks. This study demonstrates the effectiveness of combining multi-resolution imagery, feature augmentation, and optimized training strategies for robust building segmentation in remote sensing applications.

[62] Aesthetics Without Semantics

C. Alejandro Parraga,Olivier Penacchio,Marcos Muňoz Gonzalez,Bogdan Raducanu,Xavier Otazu

Main category: cs.CV

TL;DR: 论文通过创建最小语义内容（MSC）数据库，解决了现有美学研究中偏向美丽图像的偏见，并展示了丑陋图像如何影响美学评价与图像特征的关系。

Details

Motivation: 现有美学数据库偏向美丽图像，限制了美学评价的全面研究，尤其是丑陋图像的作用。 Method: 创建MSC数据库，包含10,426张图像（美丽与丑陋平衡），每张由100名观察者评价，并利用图像指标分析美学关系。 Result: 加入丑陋图像后，图像特征与美学评价的关系可能被修改或反转。 Conclusion: 美学研究中忽视丑陋图像可能导致对图像内容与美学评价关系的误解或遗漏重要效应。 Abstract: While it is easy for human observers to judge an image as beautiful or ugly, aesthetic decisions result from a combination of entangled perceptual and cognitive (semantic) factors, making the understanding of aesthetic judgements particularly challenging from a scientific point of view. Furthermore, our research shows a prevailing bias in current databases, which include mostly beautiful images, further complicating the study and prediction of aesthetic responses. We address these limitations by creating a database of images with minimal semantic content and devising, and next exploiting, a method to generate images on the ugly side of aesthetic valuations. The resulting Minimum Semantic Content (MSC) database consists of a large and balanced collection of 10,426 images, each evaluated by 100 observers. We next use established image metrics to demonstrate how augmenting an image set biased towards beautiful images with ugly images can modify, or even invert, an observed relationship between image features and aesthetics valuation. Taken together, our study reveals that works in empirical aesthetics attempting to link image content and aesthetic judgements may magnify, underestimate, or simply miss interesting effects due to a limitation of the range of aesthetic values they consider.

[63] Progressive Inertial Poser: Progressive Real-Time Kinematic Chain Estimation for 3D Full-Body Pose from Three IMU Sensors

Zunjie Zhu,Yan Zhao,Yihan Hu,Guoxiang Wang,Hai Qiu,Bolun Zheng,Chenggang Yan,Feng Xu

Main category: cs.CV

TL;DR: 提出了一种仅使用三个IMU传感器（头、手腕）进行全身姿态估计的方法ProgIP，结合神经网络与人体动力学模型，性能优于同类方法。

Details

Motivation: 解决传统全身动作捕捉系统需额外传感器或依赖视觉传感器的限制，提升虚拟现实应用的实用性。 Method: ProgIP方法结合神经网络与人体动力学模型，采用多阶段渐进网络估计，编码器使用TE-biLSTM捕捉时序依赖，解码器基于MLPs映射到SMPL模型参数。 Result: 在多个公开数据集上定量和定性实验表明，该方法优于同类输入的最先进方法，性能接近使用六个IMU传感器的最新工作。 Conclusion: ProgIP方法显著降低了硬件复杂度，同时实现了高精度的实时全身动作捕捉。 Abstract: The motion capture system that supports full-body virtual representation is of key significance for virtual reality. Compared to vision-based systems, full-body pose estimation from sparse tracking signals is not limited by environmental conditions or recording range. However, previous works either face the challenge of wearing additional sensors on the pelvis and lower-body or rely on external visual sensors to obtain global positions of key joints. To improve the practicality of the technology for virtual reality applications, we estimate full-body poses using only inertial data obtained from three Inertial Measurement Unit (IMU) sensors worn on the head and wrists, thereby reducing the complexity of the hardware system. In this work, we propose a method called Progressive Inertial Poser (ProgIP) for human pose estimation, which combines neural network estimation with a human dynamics model, considers the hierarchical structure of the kinematic chain, and employs a multi-stage progressive network estimation with increased depth to reconstruct full-body motion in real time. The encoder combines Transformer Encoder and bidirectional LSTM (TE-biLSTM) to flexibly capture the temporal dependencies of the inertial sequence, while the decoder based on multi-layer perceptrons (MLPs) transforms high-dimensional features and accurately projects them onto Skinned Multi-Person Linear (SMPL) model parameters. Quantitative and qualitative experimental results on multiple public datasets show that our method outperforms state-of-the-art methods with the same inputs, and is comparable to recent works using six IMU sensors.

[64] Hearing and Seeing Through CLIP: A Framework for Self-Supervised Sound Source Localization

Sooyoung Park,Arda Senocak,Joon Son Chung

Main category: cs.CV

TL;DR: 该论文提出了一种基于CLIP的自监督方法，用于声源定位，无需显式文本输入，并通过音频驱动的嵌入和对比学习实现音频-视觉对齐。

Details

Motivation: 利用大规模视觉-语言模型（如CLIP）的多模态对齐能力，扩展其应用至声源定位任务，以提升定位的完整性和紧凑性。 Method: 将音频映射为与CLIP文本编码器兼容的令牌，生成音频驱动的嵌入，并通过对比学习对齐音频和视觉特征。进一步引入LLM引导的扩展，增强模型对音频-视觉场景的理解。 Result: 在五个任务上的实验表明，该方法在所有变体中均优于现有技术，并在零样本设置中表现出强大的泛化能力。 Conclusion: 预训练多模态基础模型的对齐知识能够显著提升声源定位的性能，LLM引导的扩展进一步增强了模型的场景理解能力。 Abstract: Large-scale vision-language models demonstrate strong multimodal alignment and generalization across diverse tasks. Among them, CLIP stands out as one of the most successful approaches. In this work, we extend the application of CLIP to sound source localization, proposing a self-supervised method operates without explicit text input. We introduce a framework that maps audios into tokens compatible with CLIP's text encoder, producing audio-driven embeddings. These embeddings are used to generate sounding region masks, from which visual features are extracted and aligned with the audio embeddings through a contrastive audio-visual correspondence objective. Our findings show that alignment knowledge of pre-trained multimodal foundation model enables our method to generate more complete and compact localization for sounding objects. We further propose an LLM-guided extension that distills object-aware audio-visual scene understanding into the model during training to enhance alignment. Extensive experiments across five diverse tasks demonstrate that our method, in all variants, outperforms state-of-the-art approaches and achieves strong generalization in zero-shot settings.

[65] Joint Super-Resolution and Segmentation for 1-m Impervious Surface Area Mapping in China's Yangtze River Economic Belt

Jie Deng,Danfeng Hong,Chenyu Li,Naoto Yokoya

Main category: cs.CV

TL;DR: 提出了一种名为JointSeg的新框架，结合超分辨率和分割技术，直接从Sentinel-2图像生成1米分辨率的地表不透水层（ISA）地图。该方法在复杂地形和城乡模式中表现优异，生成的产品ISA-1在定量比较中显著优于其他基准产品。

Details

Motivation: 传统方法生成高分辨率ISA地图成本高且不可扩展，而JointSeg提供了一种经济高效的替代方案，同时解决了多分辨率输入和跨尺度特征融合的挑战。 Method: JointSeg通过多模态跨分辨率输入训练，逐步从10米提升到1米分辨率，并保留空间纹理细节。通过跨尺度特征融合确保分类精度。 Result: 在长江经济带（YREB）的应用中，ISA-1的F1分数达到85.71%，优于其他方法。在城乡和山区均表现出色，减少了城市区域的过估计并提高了山区破碎特征的检测能力。 Conclusion: JointSeg是一种高效且稳健的方法，适用于复杂景观的ISA制图，并成功捕捉了2017-2023年间的城市化动态。 Abstract: We propose a novel joint framework by integrating super-resolution and segmentation, called JointSeg, which enables the generation of 1-meter ISA maps directly from freely available Sentinel-2 imagery. JointSeg was trained on multimodal cross-resolution inputs, offering a scalable and affordable alternative to traditional approaches. This synergistic design enables gradual resolution enhancement from 10m to 1m while preserving fine-grained spatial textures, and ensures high classification fidelity through effective cross-scale feature fusion. This method has been successfully applied to the Yangtze River Economic Belt (YREB), a region characterized by complex urban-rural patterns and diverse topography. As a result, a comprehensive ISA mapping product for 2021, referred to as ISA-1, was generated, covering an area of over 2.2 million square kilometers. Quantitative comparisons against the 10m ESA WorldCover and other benchmark products reveal that ISA-1 achieves an F1-score of 85.71%, outperforming bilinear-interpolation-based segmentation by 9.5%, and surpassing other ISA datasets by 21.43%-61.07%. In densely urbanized areas (e.g., Suzhou, Nanjing), ISA-1 reduces ISA overestimation through improved discrimination of green spaces and water bodies. Conversely, in mountainous regions (e.g., Ganzi, Zhaotong), it identifies significantly more ISA due to its enhanced ability to detect fragmented anthropogenic features such as rural roads and sparse settlements, demonstrating its robustness across diverse landscapes. Moreover, we present biennial ISA maps from 2017 to 2023, capturing spatiotemporal urbanization dynamics across representative cities. The results highlight distinct regional growth patterns: rapid expansion in upstream cities, moderate growth in midstream regions, and saturation in downstream metropolitan areas.

[66] Threshold Modulation for Online Test-Time Adaptation of Spiking Neural Networks

Kejie Zhao,Wenjia Hua,Aiersi Tuerhong,Luziwei Leng,Yuxin Ma,Qinghua Guo

Main category: cs.CV

TL;DR: 提出了一种适用于脉冲神经网络（SNNs）的低功耗在线测试时间适应框架（TM），通过动态调整阈值提升模型在分布变化下的泛化能力。

Details

Motivation: 解决SNNs在部署后适应分布变化的挑战，现有方法不适用于SNNs。 Method: 提出阈值调制（TM）方法，通过神经元动力学归一化动态调整阈值。 Result: 在基准数据集上验证了方法的有效性，提升了SNNs的鲁棒性且计算成本低。 Conclusion: TM为SNNs的在线测试时间适应提供了实用解决方案，并为未来神经形态芯片设计提供了启示。 Abstract: Recently, spiking neural networks (SNNs), deployed on neuromorphic chips, provide highly efficient solutions on edge devices in different scenarios. However, their ability to adapt to distribution shifts after deployment has become a crucial challenge. Online test-time adaptation (OTTA) offers a promising solution by enabling models to dynamically adjust to new data distributions without requiring source data or labeled target samples. Nevertheless, existing OTTA methods are largely designed for traditional artificial neural networks and are not well-suited for SNNs. To address this gap, we propose a low-power, neuromorphic chip-friendly online test-time adaptation framework, aiming to enhance model generalization under distribution shifts. The proposed approach is called Threshold Modulation (TM), which dynamically adjusts the firing threshold through neuronal dynamics-inspired normalization, being more compatible with neuromorphic hardware. Experimental results on benchmark datasets demonstrate the effectiveness of this method in improving the robustness of SNNs against distribution shifts while maintaining low computational cost. The proposed method offers a practical solution for online test-time adaptation of SNNs, providing inspiration for the design of future neuromorphic chips. The demo code is available at github.com/NneurotransmitterR/TM-OTTA-SNN.

[67] GeomHair: Reconstruction of Hair Strands from Colorless 3D Scans

Rachmadio Noval Lazuardi,Artem Sevastopolsky,Egor Zakharov,Matthias Niessner,Vanessa Sklyarova

Main category: cs.CV

TL;DR: 提出一种新方法，直接从无色3D扫描中重建头发丝，利用多模态头发方向提取技术，解决了现有方法依赖RGB数据的局限性。

Details

Motivation: 头发丝重建是计算机视觉和图形学中的基础问题，用于高保真数字头像合成、动画和AR/VR应用。现有方法依赖RGB数据，对环境敏感且难以提取复杂发型的方向信息。 Method: 通过扫描表面特征提取和神经2D线检测器估计头发方向，结合扩散先验和合成数据训练，优化噪声计划，并通过扫描特定文本提示调整重建内容。 Result: 方法能够准确重建简单和复杂发型，无需依赖颜色信息。 Conclusion: 提出Strands400数据集，包含400名受试者的头发丝重建数据，推动进一步研究。 Abstract: We propose a novel method that reconstructs hair strands directly from colorless 3D scans by leveraging multi-modal hair orientation extraction. Hair strand reconstruction is a fundamental problem in computer vision and graphics that can be used for high-fidelity digital avatar synthesis, animation, and AR/VR applications. However, accurately recovering hair strands from raw scan data remains challenging due to human hair's complex and fine-grained structure. Existing methods typically rely on RGB captures, which can be sensitive to the environment and can be a challenging domain for extracting the orientation of guiding strands, especially in the case of challenging hairstyles. To reconstruct the hair purely from the observed geometry, our method finds sharp surface features directly on the scan and estimates strand orientation through a neural 2D line detector applied to the renderings of scan shading. Additionally, we incorporate a diffusion prior trained on a diverse set of synthetic hair scans, refined with an improved noise schedule, and adapted to the reconstructed contents via a scan-specific text prompt. We demonstrate that this combination of supervision signals enables accurate reconstruction of both simple and intricate hairstyles without relying on color information. To facilitate further research, we introduce Strands400, the largest publicly available dataset of hair strands with detailed surface geometry extracted from real-world data, which contains reconstructed hair strands from the scans of 400 subjects.

[68] EDmamba: A Simple yet Effective Event Denoising Method with State Space Model

Ciyu Ruan,Zihang Gong,Ruishan Guo,Jingao Xu,Xinlei Chen

Main category: cs.CV

TL;DR: 提出了一种基于状态空间模型（SSMs）的事件去噪框架，通过空间和时间Mamba模块高效处理事件云的时空特征，实现了高精度和低延迟的去噪效果。

Details

Motivation: 事件相机在高速视觉中表现优异，但其输出噪声高，现有去噪方法在计算效率和鲁棒性之间存在矛盾。 Method: 将事件表示为4D事件云，利用粗粒度特征提取模块（CFE）提取几何和极性感知特征，并通过空间Mamba（S-SSM）和时间Mamba（T-SSM）模块建模局部几何结构和全局时间动态。 Result: 模型参数量为88.89K，每100K事件推理时间为0.0685秒，去噪准确率为0.982，优于Transformer方法2.08%，速度提升36倍。 Conclusion: 该方法在保持事件相机高速优势的同时，实现了高效且鲁棒的去噪，为实时处理提供了新思路。 Abstract: Event cameras excel in high-speed vision due to their high temporal resolution, high dynamic range, and low power consumption. However, as dynamic vision sensors, their output is inherently noisy, making efficient denoising essential to preserve their ultra-low latency and real-time processing capabilities. Existing event denoising methods struggle with a critical dilemma: computationally intensive approaches compromise the sensor's high-speed advantage, while lightweight methods often lack robustness across varying noise levels. To address this, we propose a novel event denoising framework based on State Space Models (SSMs). Our approach represents events as 4D event clouds and includes a Coarse Feature Extraction (CFE) module that extracts embedding features from both geometric and polarity-aware subspaces. The model is further composed of two essential components: A Spatial Mamba (S-SSM) that models local geometric structures and a Temporal Mamba (T-SSM) that captures global temporal dynamics, efficiently propagating spatiotemporal features across events. Experiments demonstrate that our method achieves state-of-the-art accuracy and efficiency, with 88.89K parameters, 0.0685s per 100K events inference time, and a 0.982 accuracy score, outperforming Transformer-based methods by 2.08% in denoising accuracy and 36X faster.

[69] TokLIP: Marry Visual Tokens to CLIP for Multimodal Comprehension and Generation

Haokun Lin,Teng Wang,Yixiao Ge,Yuying Ge,Zhichao Lu,Ying Wei,Qingfu Zhang,Zhenan Sun,Ying Shan

Main category: cs.CV

TL;DR: TokLIP是一种视觉标记器，通过语义化VQ标记并结合CLIP级语义，解决了多模态统一中的高计算开销和低理解性能问题。

Details

Motivation: 现有方法（如Chameleon和Emu3）在多模态统一中存在高计算开销和低理解性能的问题，TokLIP旨在通过语义化VQ标记和结合高级语义来解决这些问题。 Method: TokLIP结合了低层离散VQ标记器与ViT标记编码器，分离了理解和生成的训练目标，无需定制量化操作。 Result: TokLIP在数据效率和语义理解方面表现优异，同时提升了生成能力，适用于自回归Transformer的理解和生成任务。 Conclusion: TokLIP通过语义化VQ标记和结合高级语义，显著提升了多模态任务的性能和效率。 Abstract: Pioneering token-based works such as Chameleon and Emu3 have established a foundation for multimodal unification but face challenges of high training computational overhead and limited comprehension performance due to a lack of high-level semantics. In this paper, we introduce TokLIP, a visual tokenizer that enhances comprehension by semanticizing vector-quantized (VQ) tokens and incorporating CLIP-level semantics while enabling end-to-end multimodal autoregressive training with standard VQ tokens. TokLIP integrates a low-level discrete VQ tokenizer with a ViT-based token encoder to capture high-level continuous semantics. Unlike previous approaches (e.g., VILA-U) that discretize high-level features, TokLIP disentangles training objectives for comprehension and generation, allowing the direct application of advanced VQ tokenizers without the need for tailored quantization operations. Our empirical results demonstrate that TokLIP achieves exceptional data efficiency, empowering visual tokens with high-level semantic understanding while enhancing low-level generative capacity, making it well-suited for autoregressive Transformers in both comprehension and generation tasks. The code and models are available at https://github.com/TencentARC/TokLIP.

[70] PillarMamba: Learning Local-Global Context for Roadside Point Cloud via Hybrid State Space Model

Zhang Zhang,Chao Sun,Chao Yue,Da Wen,Tianze Wang,Jianghao Leng

Main category: cs.CV

TL;DR: 本文提出了一种基于Cross-stage State-space Group (CSG)的框架PillarMamba，用于路边点云的3D物体检测，通过混合状态空间块(HSB)解决局部连接和历史关系遗忘问题，并在DAIR-V2X-I基准上表现优异。

Details

Motivation: 路边点云的3D物体检测在智能交通系统和车联网任务中具有重要价值，但现有方法在全局感受野和场景上下文利用上存在不足。 Method: 提出PillarMamba框架，结合CSG和HSB模块，通过局部卷积和残差注意力增强局部-全局上下文建模。 Result: 在DAIR-V2X-I基准上优于现有方法。 Conclusion: PillarMamba通过改进状态空间模型，有效提升了路边点云检测性能，具有高效计算和强表达能力的优势。 Abstract: Serving the Intelligent Transport System (ITS) and Vehicle-to-Everything (V2X) tasks, roadside perception has received increasing attention in recent years, as it can extend the perception range of connected vehicles and improve traffic safety. However, roadside point cloud oriented 3D object detection has not been effectively explored. To some extent, the key to the performance of a point cloud detector lies in the receptive field of the network and the ability to effectively utilize the scene context. The recent emergence of Mamba, based on State Space Model (SSM), has shaken up the traditional convolution and transformers that have long been the foundational building blocks, due to its efficient global receptive field. In this work, we introduce Mamba to pillar-based roadside point cloud perception and propose a framework based on Cross-stage State-space Group (CSG), called PillarMamba. It enhances the expressiveness of the network and achieves efficient computation through cross-stage feature fusion. However, due to the limitations of scan directions, state space model faces local connection disrupted and historical relationship forgotten. To address this, we propose the Hybrid State-space Block (HSB) to obtain the local-global context of roadside point cloud. Specifically, it enhances neighborhood connections through local convolution and preserves historical memory through residual attention. The proposed method outperforms the state-of-the-art methods on the popular large scale roadside benchmark: DAIR-V2X-I. The code will be released soon.

[71] Adaptive Markup Language Generation for Contextually-Grounded Visual Document Understanding

Han Xiao,Yina Xie,Guanxin Tan,Yinghao Chen,Rui Hu,Ke Wang,Aojun Zhou,Hao Li,Hao Shao,Xudong Lu,Peng Gao,Yafei Wen,Xiaoxin Chen,Shuai Ren,Hongsheng Li

Main category: cs.CV

TL;DR: 论文提出了一种利用标记语言生成结构化文档表示的创新方法，并引入了两个细粒度数据集，显著提升了视觉文档理解能力。

Details

Motivation: 视觉文档理解领域因缺乏上下文信息和复杂布局的挑战，导致现有模型表现不佳。 Method: 采用自适应生成标记语言（如Markdown、JSON等）构建结构化文档表示，并引入DocMark-Pile和DocMark-Instruct数据集。 Result: 实验表明，该方法在多个视觉文档理解基准测试中优于现有模型。 Conclusion: 提出的方法有效提升了复杂视觉场景下的文档理解能力，代码和模型已开源。 Abstract: Visual Document Understanding has become essential with the increase of text-rich visual content. This field poses significant challenges due to the need for effective integration of visual perception and textual comprehension, particularly across diverse document types with complex layouts. Moreover, existing fine-tuning datasets for this domain often fall short in providing the detailed contextual information for robust understanding, leading to hallucinations and limited comprehension of spatial relationships among visual elements. To address these challenges, we propose an innovative pipeline that utilizes adaptive generation of markup languages, such as Markdown, JSON, HTML, and TiKZ, to build highly structured document representations and deliver contextually-grounded responses. We introduce two fine-grained structured datasets: DocMark-Pile, comprising approximately 3.8M pretraining data pairs for document parsing, and DocMark-Instruct, featuring 624k fine-tuning data annotations for grounded instruction following. Extensive experiments demonstrate that our proposed model significantly outperforms existing state-of-theart MLLMs across a range of visual document understanding benchmarks, facilitating advanced reasoning and comprehension capabilities in complex visual scenarios. Our code and models are released at https://github. com/Euphoria16/DocMark.

[72] StreamBridge: Turning Your Offline Video Large Language Model into a Proactive Streaming Assistant

Haibo Wang,Bo Feng,Zhengfeng Lai,Mingze Xu,Shiyu Li,Weifeng Ge,Afshin Dehghan,Meng Cao,Ping Huang

Main category: cs.CV

TL;DR: StreamBridge是一个简单有效的框架，将离线Video-LLMs转化为支持流式处理的模型，解决了多轮实时理解和主动响应机制的挑战。

Details

Motivation: 现有模型在在线场景中面临多轮实时理解能力不足和缺乏主动响应机制的问题。 Method: StreamBridge结合了内存缓冲与轮衰减压缩策略，支持长上下文多轮交互，并采用解耦的轻量级激活模型实现持续主动响应。 Result: 实验表明，StreamBridge显著提升了离线Video-LLMs的流式理解能力，优于GPT-4o和Gemini 1.5 Pro，同时在标准视频理解任务中表现优异。 Conclusion: StreamBridge为Video-LLMs的流式处理提供了高效解决方案，并在性能和适应性上表现出色。 Abstract: We present StreamBridge, a simple yet effective framework that seamlessly transforms offline Video-LLMs into streaming-capable models. It addresses two fundamental challenges in adapting existing models into online scenarios: (1) limited capability for multi-turn real-time understanding, and (2) lack of proactive response mechanisms. Specifically, StreamBridge incorporates (1) a memory buffer combined with a round-decayed compression strategy, supporting long-context multi-turn interactions, and (2) a decoupled, lightweight activation model that can be effortlessly integrated into existing Video-LLMs, enabling continuous proactive responses. To further support StreamBridge, we construct Stream-IT, a large-scale dataset tailored for streaming video understanding, featuring interleaved video-text sequences and diverse instruction formats. Extensive experiments show that StreamBridge significantly improves the streaming understanding capabilities of offline Video-LLMs across various tasks, outperforming even proprietary models such as GPT-4o and Gemini 1.5 Pro. Simultaneously, it achieves competitive or superior performance on standard video understanding benchmarks.

[73] SITE: towards Spatial Intelligence Thorough Evaluation

Wenqi Wang,Reuben Tan,Pengyue Zhu,Jianwei Yang,Zhengyuan Yang,Lijuan Wang,Andrey Kolobov,Jianfeng Gao,Boqing Gong

Main category: cs.CV

TL;DR: SITE是一个用于评估大型视觉语言模型空间智能的标准化多选视觉问答基准数据集，涵盖多种视觉模态和空间智能因素。实验表明，领先模型在空间定向等基础能力上落后于人类专家，且空间推理能力与具身AI任务表现正相关。

Details

Motivation: 空间智能（SI）在多个学科中具有重要作用，但缺乏标准化评估方法。SITE旨在填补这一空白，提供全面评估空间智能的工具。 Method: 通过结合对31个现有数据集的调查和认知科学分类系统，设计了包含新任务类型的SITE数据集，用于评估模型的空间智能。 Result: 实验显示，领先模型在空间定向等基础能力上表现不佳，且空间推理能力与具身AI任务表现呈正相关。 Conclusion: SITE为评估空间智能提供了标准化工具，揭示了模型在基础空间能力上的不足，并展示了空间推理与具身AI任务的关联。 Abstract: Spatial intelligence (SI) represents a cognitive ability encompassing the visualization, manipulation, and reasoning about spatial relationships, underpinning disciplines from neuroscience to robotics. We introduce SITE, a benchmark dataset towards SI Thorough Evaluation in a standardized format of multi-choice visual question-answering, designed to assess large vision-language models' spatial intelligence across diverse visual modalities (single-image, multi-image, and video) and SI factors (figural to environmental scales, spatial visualization and orientation, intrinsic and extrinsic, static and dynamic). Our approach to curating the benchmark combines a bottom-up survey about 31 existing datasets and a top-down strategy drawing upon three classification systems in cognitive science, which prompt us to design two novel types of tasks about view-taking and dynamic scenes. Extensive experiments reveal that leading models fall behind human experts especially in spatial orientation, a fundamental SI factor. Moreover, we demonstrate a positive correlation between a model's spatial reasoning proficiency and its performance on an embodied AI task.

[74] Generating Physically Stable and Buildable LEGO Designs from Text

Ava Pun,Kangle Deng,Ruixuan Liu,Deva Ramanan,Changliu Liu,Jun-Yan Zhu

Main category: cs.CV

TL;DR: LegoGPT是首个通过文本提示生成物理稳定LEGO模型的方案，结合大规模数据集和语言模型，通过物理验证和约束优化设计稳定性。

Details

Motivation: 解决从文本生成物理稳定LEGO模型的挑战，提供多样且美观的设计。 Method: 构建大规模物理稳定数据集，训练自回归语言模型，结合物理验证和回滚机制优化设计。 Result: 生成稳定、多样且美观的LEGO设计，支持人工和机器人组装，并发布数据集和代码。 Conclusion: LegoGPT成功实现了文本到物理稳定LEGO模型的生成，为创意设计和自动化组装提供了新工具。 Abstract: We introduce LegoGPT, the first approach for generating physically stable LEGO brick models from text prompts. To achieve this, we construct a large-scale, physically stable dataset of LEGO designs, along with their associated captions, and train an autoregressive large language model to predict the next brick to add via next-token prediction. To improve the stability of the resulting designs, we employ an efficient validity check and physics-aware rollback during autoregressive inference, which prunes infeasible token predictions using physics laws and assembly constraints. Our experiments show that LegoGPT produces stable, diverse, and aesthetically pleasing LEGO designs that align closely with the input text prompts. We also develop a text-based LEGO texturing method to generate colored and textured designs. We show that our designs can be assembled manually by humans and automatically by robotic arms. We also release our new dataset, StableText2Lego, containing over 47,000 LEGO structures of over 28,000 unique 3D objects accompanied by detailed captions, along with our code and models at the project website: https://avalovelace1.github.io/LegoGPT/.

[75] Flow-GRPO: Training Flow Matching Models via Online RL

Jie Liu,Gongye Liu,Jiajun Liang,Yangguang Li,Jiaheng Liu,Xintao Wang,Pengfei Wan,Di Zhang,Wanli Ouyang

Main category: cs.CV

TL;DR: Flow-GRPO将在线强化学习融入流匹配模型，通过ODE-to-SDE转换和去噪减少策略提升采样效率和性能，在文本到图像任务中表现优异。

Details

Motivation: 将强化学习引入流匹配模型以提升生成任务的准确性和多样性。 Method: 1. ODE-to-SDE转换实现统计采样；2. 去噪减少策略提升训练效率。 Result: 在复杂构图和视觉文本渲染任务中，准确率显著提升（如GenEval从63%到95%），且未牺牲图像质量或多样性。 Conclusion: Flow-GRPO在生成任务中表现出色，同时保持了图像质量和多样性。 Abstract: We propose Flow-GRPO, the first method integrating online reinforcement learning (RL) into flow matching models. Our approach uses two key strategies: (1) an ODE-to-SDE conversion that transforms a deterministic Ordinary Differential Equation (ODE) into an equivalent Stochastic Differential Equation (SDE) that matches the original model's marginal distribution at all timesteps, enabling statistical sampling for RL exploration; and (2) a Denoising Reduction strategy that reduces training denoising steps while retaining the original inference timestep number, significantly improving sampling efficiency without performance degradation. Empirically, Flow-GRPO is effective across multiple text-to-image tasks. For complex compositions, RL-tuned SD3.5 generates nearly perfect object counts, spatial relations, and fine-grained attributes, boosting GenEval accuracy from $63\%$ to $95\%$. In visual text rendering, its accuracy improves from $59\%$ to $92\%$, significantly enhancing text generation. Flow-GRPO also achieves substantial gains in human preference alignment. Notably, little to no reward hacking occurred, meaning rewards did not increase at the cost of image quality or diversity, and both remained stable in our experiments.

Chao Liao,Liyang Liu,Xun Wang,Zhengxiong Luo,Xinyu Zhang,Wenliang Zhao,Jie Wu,Liang Li,Zhi Tian,Weilin Huang

Main category: cs.CV

TL;DR: Mogao是一个统一的多模态生成框架，通过因果方法实现交错的文本和图像生成，结合自回归和扩散模型的优势，并在大规模数据集上训练，表现出色。

Details

Motivation: 当前统一模型在图像理解和生成方面进展显著，但多数仅限于单模态生成，Mogao旨在通过交错多模态生成突破这一限制。 Method: Mogao采用深度融合设计、双视觉编码器、交错旋转位置嵌入和多模态分类器自由引导等技术，结合自回归和扩散模型。 Result: Mogao在多模态理解和文本到图像生成中达到最先进性能，并能生成高质量的交错输出，具备零样本图像编辑和组合生成能力。 Conclusion: Mogao作为全能模态基础模型，为未来统一多模态系统的发展铺平了道路。 Abstract: Recent progress in unified models for image understanding and generation has been impressive, yet most approaches remain limited to single-modal generation conditioned on multiple modalities. In this paper, we present Mogao, a unified framework that advances this paradigm by enabling interleaved multi-modal generation through a causal approach. Mogao integrates a set of key technical improvements in architecture design, including a deep-fusion design, dual vision encoders, interleaved rotary position embeddings, and multi-modal classifier-free guidance, which allow it to harness the strengths of both autoregressive models for text generation and diffusion models for high-quality image synthesis. These practical improvements also make Mogao particularly effective to process interleaved sequences of text and images arbitrarily. To further unlock the potential of unified models, we introduce an efficient training strategy on a large-scale, in-house dataset specifically curated for joint text and image generation. Extensive experiments show that Mogao not only achieves state-of-the-art performance in multi-modal understanding and text-to-image generation, but also excels in producing high-quality, coherent interleaved outputs. Its emergent capabilities in zero-shot image editing and compositional generation highlight Mogao as a practical omni-modal foundation model, paving the way for future development and scaling the unified multi-modal systems.

[77] DiffusionSfM: Predicting Structure and Motion via Ray Origin and Endpoint Diffusion

Qitao Zhao,Amy Lin,Jeff Tan,Jason Y. Zhang,Deva Ramanan,Shubham Tulsiani

Main category: cs.CV

TL;DR: 论文提出了一种名为DiffusionSfM的数据驱动多视角推理方法，直接通过多视角图像推断3D场景几何和相机姿态，优于传统和基于学习的方法。

Details

Motivation: 当前的结构从运动（SfM）方法通常采用两阶段流程，结合学习或几何对偶推理与全局优化步骤。作者希望通过数据驱动的方法直接推断3D场景几何和相机姿态，简化流程并提高性能。 Method: DiffusionSfM框架将场景几何和相机姿态参数化为全局坐标系中的像素级射线起点和终点，并采用基于变压器的去噪扩散模型进行预测。针对训练中的缺失数据和无界场景坐标问题，引入了专门的机制。 Result: 实验验证表明，DiffusionSfM在合成和真实数据集上均优于传统和基于学习的方法，并能自然地建模不确定性。 Conclusion: DiffusionSfM通过数据驱动的多视角推理方法，直接预测3D场景几何和相机姿态，表现出优越性能，并解决了实际训练中的挑战。 Abstract: Current Structure-from-Motion (SfM) methods typically follow a two-stage pipeline, combining learned or geometric pairwise reasoning with a subsequent global optimization step. In contrast, we propose a data-driven multi-view reasoning approach that directly infers 3D scene geometry and camera poses from multi-view images. Our framework, DiffusionSfM, parameterizes scene geometry and cameras as pixel-wise ray origins and endpoints in a global frame and employs a transformer-based denoising diffusion model to predict them from multi-view inputs. To address practical challenges in training diffusion models with missing data and unbounded scene coordinates, we introduce specialized mechanisms that ensure robust learning. We empirically validate DiffusionSfM on both synthetic and real datasets, demonstrating that it outperforms classical and learning-based approaches while naturally modeling uncertainty.

[78] 3D Scene Generation: A Survey

Beichen Wen,Haozhe Xie,Zhaoxi Chen,Fangzhou Hong,Ziwei Liu

Main category: cs.CV

TL;DR: 本文综述了3D场景生成的最新进展，将其分为四种范式：程序化生成、基于神经3D的生成、基于图像的生成和基于视频的生成，并讨论了技术基础、挑战和未来方向。

Details

Motivation: 3D场景生成在沉浸式媒体、机器人、自动驾驶和具身AI等领域有广泛应用，但早期方法在多样性和真实性上存在局限，需要更先进的生成技术。 Method: 通过分析四种主要范式（程序化生成、神经3D生成、图像生成和视频生成），结合深度生成模型（如GANs、扩散模型）和3D表示（如NeRF、3D高斯），系统总结了技术基础和代表性成果。 Result: 总结了现有方法的优缺点，并提供了常用数据集、评估协议和下游应用的综述。 Conclusion: 未来方向包括更高保真度、物理感知和交互式生成，以及统一的感知生成模型，同时强调了生成AI、3D视觉和具身智能的交叉领域潜力。 Abstract: 3D scene generation seeks to synthesize spatially structured, semantically meaningful, and photorealistic environments for applications such as immersive media, robotics, autonomous driving, and embodied AI. Early methods based on procedural rules offered scalability but limited diversity. Recent advances in deep generative models (e.g., GANs, diffusion models) and 3D representations (e.g., NeRF, 3D Gaussians) have enabled the learning of real-world scene distributions, improving fidelity, diversity, and view consistency. Recent advances like diffusion models bridge 3D scene synthesis and photorealism by reframing generation as image or video synthesis problems. This survey provides a systematic overview of state-of-the-art approaches, organizing them into four paradigms: procedural generation, neural 3D-based generation, image-based generation, and video-based generation. We analyze their technical foundations, trade-offs, and representative results, and review commonly used datasets, evaluation protocols, and downstream applications. We conclude by discussing key challenges in generation capacity, 3D representation, data and annotations, and evaluation, and outline promising directions including higher fidelity, physics-aware and interactive generation, and unified perception-generation models. This review organizes recent advances in 3D scene generation and highlights promising directions at the intersection of generative AI, 3D vision, and embodied intelligence. To track ongoing developments, we maintain an up-to-date project page: https://github.com/hzxie/Awesome-3D-Scene-Generation.

[79] SVAD: From Single Image to 3D Avatar via Synthetic Data Generation with Video Diffusion and Data Augmentation

Yonwoo Choi

Main category: cs.CV

TL;DR: SVAD是一种新方法，通过结合视频扩散模型和3D高斯泼溅技术，从单张图像生成高质量可动画的3D人体化身，解决了现有方法的局限性。

Details

Motivation: 当前方法在从单张图像生成3D人体化身时存在局限性：3D高斯泼溅需要多视角数据，而视频扩散模型在一致性和身份保持上表现不佳。SVAD旨在结合两者的优势。 Method: SVAD利用视频扩散生成合成训练数据，并通过身份保持和图像恢复模块增强数据，随后用这些数据训练3D高斯泼溅化身。 Result: SVAD在身份一致性和细节保持上优于现有单图像方法，支持实时渲染，并减少了对密集训练数据的依赖。 Conclusion: SVAD通过结合扩散模型和3D高斯泼溅技术，为单图像生成高保真化身提供了新方法。 Abstract: Creating high-quality animatable 3D human avatars from a single image remains a significant challenge in computer vision due to the inherent difficulty of reconstructing complete 3D information from a single viewpoint. Current approaches face a clear limitation: 3D Gaussian Splatting (3DGS) methods produce high-quality results but require multiple views or video sequences, while video diffusion models can generate animations from single images but struggle with consistency and identity preservation. We present SVAD, a novel approach that addresses these limitations by leveraging complementary strengths of existing techniques. Our method generates synthetic training data through video diffusion, enhances it with identity preservation and image restoration modules, and utilizes this refined data to train 3DGS avatars. Comprehensive evaluations demonstrate that SVAD outperforms state-of-the-art (SOTA) single-image methods in maintaining identity consistency and fine details across novel poses and viewpoints, while enabling real-time rendering capabilities. Through our data augmentation pipeline, we overcome the dependency on dense monocular or multi-view training data typically required by traditional 3DGS approaches. Extensive quantitative, qualitative comparisons show our method achieves superior performance across multiple metrics against baseline models. By effectively combining the generative power of diffusion models with both the high-quality results and rendering efficiency of 3DGS, our work establishes a new approach for high-fidelity avatar generation from a single image input.

cs.GR [Back]

[80] ChannelExplorer: Exploring Class Separability Through Activation Channel Visualization

Md Rahat-uz- Zaman,Bei Wang,Paul Rosen

Main category: cs.GR

TL;DR: ChannelExplorer是一个交互式可视化工具，用于分析深度神经网络中各层的激活通道对类别可分性的贡献。

Details

Motivation: 理解DNN内部行为，尤其是不同层和激活通道如何影响类别可分性。 Method: 通过三个主要视图（散点图、Jaccard相似性和热图）分析激活通道，支持多种模型架构。 Result: 在ImageNet类别层次生成、错误标签检测、激活通道贡献分析和潜在状态定位等场景中展示了工具的能力。 Conclusion: ChannelExplorer为DNN的可解释性提供了数据驱动的分析工具，并通过专家评估验证了其有效性。 Abstract: Deep neural networks (DNNs) achieve state-of-the-art performance in many vision tasks, yet understanding their internal behavior remains challenging, particularly how different layers and activation channels contribute to class separability. We introduce ChannelExplorer, an interactive visual analytics tool for analyzing image-based outputs across model layers, emphasizing data-driven insights over architecture analysis for exploring class separability. ChannelExplorer summarizes activations across layers and visualizes them using three primary coordinated views: a Scatterplot View to reveal inter- and intra-class confusion, a Jaccard Similarity View to quantify activation overlap, and a Heatmap View to inspect activation channel patterns. Our technique supports diverse model architectures, including CNNs, GANs, ResNet and Stable Diffusion models. We demonstrate the capabilities of ChannelExplorer through four use-case scenarios: (1) generating class hierarchy in ImageNet, (2) finding mislabeled images, (3) identifying activation channel contributions, and(4) locating latent states' position in Stable Diffusion model. Finally, we evaluate the tool with expert users.

[81] Multimodal Benchmarking and Recommendation of Text-to-Image Generation Models

Kapil Wanaskar,Gaytri Jena,Magdalini Eirinaki

Main category: cs.GR

TL;DR: 提出一个开源框架，用于评估文本到图像生成模型，重点关注元数据增强提示的影响。

Details

Motivation: 研究元数据增强提示对文本到图像生成模型性能的影响，并提供统一的评估方法。 Method: 利用DeepFashion-MultiModal数据集，结合定量（如Weighted Score、CLIP相似性、LPIPS、FID）和定性分析评估生成结果。 Result: 结构化元数据显著提升了视觉真实性、语义保真度和模型鲁棒性。 Conclusion: 该框架为模型选择和提示设计提供了基于评估指标的推荐，虽非传统推荐系统，但具有实际应用价值。 Abstract: This work presents an open-source unified benchmarking and evaluation framework for text-to-image generation models, with a particular focus on the impact of metadata augmented prompts. Leveraging the DeepFashion-MultiModal dataset, we assess generated outputs through a comprehensive set of quantitative metrics, including Weighted Score, CLIP (Contrastive Language Image Pre-training)-based similarity, LPIPS (Learned Perceptual Image Patch Similarity), FID (Frechet Inception Distance), and retrieval-based measures, as well as qualitative analysis. Our results demonstrate that structured metadata enrichments greatly enhance visual realism, semantic fidelity, and model robustness across diverse text-to-image architectures. While not a traditional recommender system, our framework enables task-specific recommendations for model selection and prompt design based on evaluation metrics.

[82] MeshGen: Generating PBR Textured Mesh with Render-Enhanced Auto-Encoder and Generative Data Augmentation

Zilong Chen,Yikai Wang,Wenqiang Sun,Feng Wang,Yiwen Chen,Huaping Liu

Main category: cs.GR

TL;DR: MeshGen是一种先进的图像到3D管道，通过创新的渲染增强点对形状自动编码器和多视图ControlNet，生成高质量3D网格和PBR纹理，显著优于现有方法。

Details

Motivation: 解决现有3D原生扩散模型在自动编码器性能、可控性、泛化能力和PBR纹理一致性方面的不足。 Method: 采用渲染增强点对形状自动编码器、几何增强和生成渲染增强技术，以及多视图ControlNet和PBR分解器。 Result: MeshGen在形状和纹理生成上大幅超越先前方法，设定了新的3D网格生成质量标准。 Conclusion: MeshGen通过技术创新解决了现有问题，为高质量3D网格生成提供了新标准。 Abstract: In this paper, we introduce MeshGen, an advanced image-to-3D pipeline that generates high-quality 3D meshes with detailed geometry and physically based rendering (PBR) textures. Addressing the challenges faced by existing 3D native diffusion models, such as suboptimal auto-encoder performance, limited controllability, poor generalization, and inconsistent image-based PBR texturing, MeshGen employs several key innovations to overcome these limitations. We pioneer a render-enhanced point-to-shape auto-encoder that compresses meshes into a compact latent space by designing perceptual optimization with ray-based regularization. This ensures that the 3D shapes are accurately represented and reconstructed to preserve geometric details within the latent space. To address data scarcity and image-shape misalignment, we further propose geometric augmentation and generative rendering augmentation techniques, which enhance the model's controllability and generalization ability, allowing it to perform well even with limited public datasets. For the texture generation, MeshGen employs a reference attention-based multi-view ControlNet for consistent appearance synthesis. This is further complemented by our multi-view PBR decomposer that estimates PBR components and a UV inpainter that fills invisible areas, ensuring a seamless and consistent texture across the 3D mesh. Our extensive experiments demonstrate that MeshGen largely outperforms previous methods in both shape and texture generation, setting a new standard for the quality of 3D meshes generated with PBR textures. See our code at https://github.com/heheyas/MeshGen, project page https://heheyas.github.io/MeshGen

[83] GSsplat: Generalizable Semantic Gaussian Splatting for Novel-view Synthesis in 3D Scenes

Feng Xiao,Hongbin Xu,Wanlin Liang,Wenxiong Kang

Main category: cs.GR

TL;DR: 提出了一种通用语义高斯泼溅方法（GSsplat），用于高效的新视角合成，解决了现有方法在速度和分割性能上的局限性。

Details

Motivation: 研究3D场景理解中未见场景的多视角语义合成，现有方法在速度和分割性能上存在不足。 Method: 通过预测场景自适应高斯分布的位置和属性，设计混合网络提取颜色和语义信息，并引入偏移学习模块和点级交互模块。 Result: GSsplat在多视角输入下实现了最先进的语义合成性能，且速度最快。 Conclusion: GSsplat为高效新视角语义合成提供了一种有效解决方案。 Abstract: The semantic synthesis of unseen scenes from multiple viewpoints is crucial for research in 3D scene understanding. Current methods are capable of rendering novel-view images and semantic maps by reconstructing generalizable Neural Radiance Fields. However, they often suffer from limitations in speed and segmentation performance. We propose a generalizable semantic Gaussian Splatting method (GSsplat) for efficient novel-view synthesis. Our model predicts the positions and attributes of scene-adaptive Gaussian distributions from once input, replacing the densification and pruning processes of traditional scene-specific Gaussian Splatting. In the multi-task framework, a hybrid network is designed to extract color and semantic information and predict Gaussian parameters. To augment the spatial perception of Gaussians for high-quality rendering, we put forward a novel offset learning module through group-based supervision and a point-level interaction module with spatial unit aggregation. When evaluated with varying numbers of multi-view inputs, GSsplat achieves state-of-the-art performance for semantic synthesis at the fastest speed.

[84] Crafting Physical Adversarial Examples by Combining Differentiable and Physically Based Renders

Yuqiu Liu,Huanqian Yan,Xiaopei Zhu,Xiaolin Hu,Liang Tang,Hang Su,Chen Lv

Main category: cs.GR

TL;DR: 提出了一种名为PAV-Camou的新方法，用于生成适用于真实车辆的对抗性伪装，解决了现有方法在物理世界中表现不佳的问题。

Details

Motivation: 扩展对抗性伪装技术到物理世界，以测试自动驾驶系统的鲁棒性，但现有方法因训练样本缺乏真实感和物理实现方法不足而表现不佳。 Method: 调整2D地图坐标到3D模型的映射以减少纹理失真，结合两种渲染器生成逼真的对抗性样本，确保伪装在多样环境条件下有效。 Result: 实验表明，PAV-Camou在数字和物理世界中均表现良好。 Conclusion: PAV-Camou是一种有效的对抗性伪装方法，适用于真实车辆，并能适应不同环境条件。 Abstract: Recently we have witnessed progress in hiding road vehicles against object detectors through adversarial camouflage in the digital world. The extension of this technique to the physical world is crucial for testing the robustness of autonomous driving systems. However, existing methods do not show good performances when applied to the physical world. This is partly due to insufficient photorealism in training examples, and lack of proper physical realization methods for camouflage. To generate a robust adversarial camouflage suitable for real vehicles, we propose a novel method called PAV-Camou. We propose to adjust the mapping from the coordinates in the 2D map to those of corresponding 3D model. This process is critical for mitigating texture distortion and ensuring the camouflage's effectiveness when applied in the real world. Then we combine two renderers with different characteristics to obtain adversarial examples that are photorealistic that closely mimic real-world lighting and texture properties. The method ensures that the generated textures remain effective under diverse environmental conditions. Our adversarial camouflage can be optimized and printed in the form of 2D patterns, allowing for direct application on real vehicles. Extensive experiments demonstrated that our proposed method achieved good performance in both the digital world and the physical world.

[85] SGCR: Spherical Gaussians for Efficient 3D Curve Reconstruction

Xinran Yang,Donghao Ji,Yuanqi Li,Jie Guo,Yanwen Guo,Junyuan Xie

Main category: cs.GR

TL;DR: 论文提出了一种名为Spherical Gaussians的新表示方法，用于解决3D高斯渲染技术在几何结构定义上的不足，并进一步开发了SGCR算法以提取精确的3D曲线。

Details

Motivation: 3D高斯渲染技术在生成高质量新视角图像方面表现优异，但在定义精确的3D几何结构方面存在不足，主要因其各向异性特性。 Method: 提出Spherical Gaussians表示方法，通过基于视图的渲染损失优化，无需3D监督即可重建3D特征曲线；进一步开发SGCR算法从对齐的Spherical Gaussians中提取参数曲线。 Result: SGCR算法在3D边缘重建方面优于现有方法，且具有高效性。 Conclusion: Spherical Gaussians和SGCR算法为3D几何边界的高效重建提供了新思路，显著提升了3D边缘重建的准确性和效率。 Abstract: Neural rendering techniques have made substantial progress in generating photo-realistic 3D scenes. The latest 3D Gaussian Splatting technique has achieved high quality novel view synthesis as well as fast rendering speed. However, 3D Gaussians lack proficiency in defining accurate 3D geometric structures despite their explicit primitive representations. This is due to the fact that Gaussian's attributes are primarily tailored and fine-tuned for rendering diverse 2D images by their anisotropic nature. To pave the way for efficient 3D reconstruction, we present Spherical Gaussians, a simple and effective representation for 3D geometric boundaries, from which we can directly reconstruct 3D feature curves from a set of calibrated multi-view images. Spherical Gaussians is optimized from grid initialization with a view-based rendering loss, where a 2D edge map is rendered at a specific view and then compared to the ground-truth edge map extracted from the corresponding image, without the need for any 3D guidance or supervision. Given Spherical Gaussians serve as intermedia for the robust edge representation, we further introduce a novel optimization-based algorithm called SGCR to directly extract accurate parametric curves from aligned Spherical Gaussians. We demonstrate that SGCR outperforms existing state-of-the-art methods in 3D edge reconstruction while enjoying great efficiency.

[86] WIR3D: Visually-Informed and Geometry-Aware 3D Shape Abstraction

Richard Liu,Daniel Fu,Noah Tan,Itai Lang,Rana Hanocka

Main category: cs.GR

TL;DR: WIR3D是一种通过稀疏的视觉有意义曲线抽象3D形状的技术，利用Bezier曲线优化和CLIP模型引导，分为粗几何和细粒度特征两阶段优化，支持用户控制和形状变形。

Details

Motivation: 旨在通过稀疏曲线高效且直观地表示3D形状的几何和视觉特征，同时支持用户对抽象特征的交互控制。 Method: 分两阶段优化Bezier曲线参数：粗几何捕获和细粒度特征表示，利用CLIP模型引导和局部关键点损失进行空间指导，结合神经SDF损失确保表面保真度。 Result: 成功应用于多种复杂度和纹理的3D形状抽象，支持特征控制和形状变形等下游应用。 Conclusion: WIR3D提供了一种高效且用户友好的3D形状抽象方法，具有广泛的应用潜力。 Abstract: We present WIR3D, a technique for abstracting 3D shapes through a sparse set of visually meaningful curves in 3D. We optimize the parameters of Bezier curves such that they faithfully represent both the geometry and salient visual features (e.g. texture) of the shape from arbitrary viewpoints. We leverage the intermediate activations of a pre-trained foundation model (CLIP) to guide our optimization process. We divide our optimization into two phases: one for capturing the coarse geometry of the shape, and the other for representing fine-grained features. Our second phase supervision is spatially guided by a novel localized keypoint loss. This spatial guidance enables user control over abstracted features. We ensure fidelity to the original surface through a neural SDF loss, which allows the curves to be used as intuitive deformation handles. We successfully apply our method for shape abstraction over a broad dataset of shapes with varying complexity, geometric structure, and texture, and demonstrate downstream applications for feature control and shape deformation.

[87] ADD: Physics-Based Motion Imitation with Adversarial Differential Discriminators

Ziyu Zhang,Sergey Bashkirov,Dun Yang,Michael Taylor,Xue Bin Peng

Main category: cs.GR

TL;DR: 提出了一种新型对抗性多目标优化技术，适用于包括运动跟踪在内的多目标优化问题，无需手动调整奖励函数即可实现高质量结果。

Details

Motivation: 现有多目标优化方法依赖手动调整的聚合函数，性能受限于权重选择，且需要大量人工调整。 Method: 采用对抗性差分判别器，仅需单个正样本即可有效指导优化过程。 Result: 技术能够使角色精确复制多种高难度动作，质量与最先进方法相当。 Conclusion: 该方法在多目标优化问题中具有广泛适用性，尤其在运动跟踪领域表现优异。 Abstract: Multi-objective optimization problems, which require the simultaneous optimization of multiple terms, are prevalent across numerous applications. Existing multi-objective optimization methods often rely on manually tuned aggregation functions to formulate a joint optimization target. The performance of such hand-tuned methods is heavily dependent on careful weight selection, a time-consuming and laborious process. These limitations also arise in the setting of reinforcement-learning-based motion tracking for physically simulated characters, where intricately crafted reward functions are typically used to achieve high-fidelity results. Such solutions not only require domain expertise and significant manual adjustment, but also limit the applicability of the resulting reward function across diverse skills. To bridge this gap, we present a novel adversarial multi-objective optimization technique that is broadly applicable to a range of multi-objective optimization problems, including motion tracking. The proposed adversarial differential discriminator receives a single positive sample, yet is still effective at guiding the optimization process. We demonstrate that our technique can enable characters to closely replicate a variety of acrobatic and agile behaviors, achieving comparable quality to state-of-the-art motion-tracking methods, without relying on manually tuned reward functions. Results are best visualized through https://youtu.be/rz8BYCE9E2w.

[88] Inter-Diffusion Generation Model of Speakers and Listeners for Effective Communication

Jinhe Huang,Yongkang Cheng,Yuming Hang,Gaoge Han,Jinewei Li,Jing Zhang,Xingjian Gu

Main category: cs.GR

TL;DR: 本文提出了一种创新的交互扩散生成模型，首次将听众的全身动作整合到生成框架中，通过新型交互扩散机制捕捉说话者与听众间的复杂互动模式，显著提升了生成动作的自然性、连贯性和语音-动作同步性。

Details

Motivation: 现有研究多关注说话者的动作生成，忽视了听众在互动中的关键作用，未能充分探索两者间的动态互动。本文旨在填补这一空白。 Method: 基于先进的扩散模型架构，引入交互条件和GAN模型以增大去噪步长，实现说话者语音信息与听众反馈的实时响应。 Result: 实验表明，模型在自然性、连贯性和同步性上显著优于现有方法，用户评价更接近真实互动场景，客观指标也优于基线方法。 Conclusion: 该模型为有效沟通提供了更强大的支持，推动了互动动作生成领域的发展。 Abstract: Full-body gestures play a pivotal role in natural interactions and are crucial for achieving effective communication. Nevertheless, most existing studies primarily focus on the gesture generation of speakers, overlooking the vital role of listeners in the interaction process and failing to fully explore the dynamic interaction between them. This paper innovatively proposes an Inter-Diffusion Generation Model of Speakers and Listeners for Effective Communication. For the first time, we integrate the full-body gestures of listeners into the generation framework. By devising a novel inter-diffusion mechanism, this model can accurately capture the complex interaction patterns between speakers and listeners during communication. In the model construction process, based on the advanced diffusion model architecture, we innovatively introduce interaction conditions and the GAN model to increase the denoising step size. As a result, when generating gesture sequences, the model can not only dynamically generate based on the speaker's speech information but also respond in realtime to the listener's feedback, enabling synergistic interaction between the two. Abundant experimental results demonstrate that compared with the current state-of-the-art gesture generation methods, the model we proposed has achieved remarkable improvements in the naturalness, coherence, and speech-gesture synchronization of the generated gestures. In the subjective evaluation experiments, users highly praised the generated interaction scenarios, believing that they are closer to real life human communication situations. Objective index evaluations also show that our model outperforms the baseline methods in multiple key indicators, providing more powerful support for effective communication.

[89] Improving Global Motion Estimation in Sparse IMU-based Motion Capture with Physics

Xinyu Yi,Shaohua Pan,Feng Xu

Main category: cs.GR

TL;DR: 通过结合物理优化方案，利用6个IMU实现更准确的人体全局和局部运动捕捉，并估计接触力、关节扭矩等。

Details

Motivation: 解决IMU在人体全局运动重建中的挑战，尤其是z方向运动。 Method: 提出基于多接触的物理优化方案，结合重力约束优化全局姿态和局部姿态估计。 Result: 实验表明方法在局部姿态和全局运动捕捉上更准确，并能估计3D接触、接触力等。 Conclusion: 通过深度整合物理，方法显著提升了IMU运动捕捉的准确性和功能性。 Abstract: By learning human motion priors, motion capture can be achieved by 6 inertial measurement units (IMUs) in recent years with the development of deep learning techniques, even though the sensor inputs are sparse and noisy. However, human global motions are still challenging to be reconstructed by IMUs. This paper aims to solve this problem by involving physics. It proposes a physical optimization scheme based on multiple contacts to enable physically plausible translation estimation in the full 3D space where the z-directional motion is usually challenging for previous works. It also considers gravity in local pose estimation which well constrains human global orientations and refines local pose estimation in a joint estimation manner. Experiments demonstrate that our method achieves more accurate motion capture for both local poses and global motions. Furthermore, by deeply integrating physics, we can also estimate 3D contact, contact forces, joint torques, and interacting proxy surfaces.

[90] An Active Contour Model for Silhouette Vectorization using Bézier Curves

Luis Alvarez,Jean-Michel Morel

Main category: cs.GR

TL;DR: 提出了一种基于三次贝塞尔曲线的主动轮廓模型，用于轮廓矢量化，显著减少了轮廓边界与矢量化结果的平均距离。

Details

Motivation: 现有轮廓矢量化方法（如Inkscape、Adobe Illustrator等）在精度上仍有改进空间，需要一种更优化的方法。 Method: 通过最小化贝塞尔曲线与轮廓边界的距离，优化曲线端点位置、切线方向及曲线参数，并可利用其他方法的矢量化结果作为初始猜测。 Result: 相比世界级图形软件和其他方法，提出的方法显著降低了平均距离，同时还能通过减少曲线长度增加规则性。 Conclusion: 该方法在轮廓矢量化中表现出更高的精度和灵活性，适用于多种应用场景。 Abstract: In this paper, we propose an active contour model for silhouette vectorization using cubic B\'ezier curves. Among the end points of the B\'ezier curves, we distinguish between corner and regular points where the orientation of the tangent vector is prescribed. By minimizing the distance of the B\'ezier curves to the silhouette boundary, the active contour model optimizes the location of the B\'ezier curves end points, the orientation of the tangent vectors in the regular points, and the estimation of the B\'ezier curve parameters. This active contour model can use the silhouette vectorization obtained by any method as an initial guess. The proposed method significantly reduces the average distance between the silhouette boundary and its vectorization obtained by the world-class graphic software Inkscape, Adobe Illustrator, and a curvature-based vectorization method, which we introduce for comparison. Our method also allows us to impose additional regularity on the B\'ezier curves by reducing their lengths.

[91] Time of the Flight of the Gaussians: Optimizing Depth Indirectly in Dynamic Radiance Fields

Runfeng Li,Mikhail Okunev,Zixuan Guo,Anh Ha Duong,Christian Richardt,Matthew O'Toole,James Tompkin

Main category: cs.GR

TL;DR: 提出一种基于单目连续波飞行时间（C-ToF）相机的动态场景重建方法，比神经体积方法更准确且快100倍。

Details

Motivation: 快速从单视角实现高保真动态3D重建是计算机视觉中的重大挑战，而C-ToF深度信息不直接测量增加了难度。 Method: 在优化中引入两种启发式方法，改进高斯表示的几何精度。 Result: 实验表明，该方法在受限C-ToF条件下（如快速运动）仍能实现准确重建。 Conclusion: 该方法显著提升了动态场景重建的效率和精度。 Abstract: We present a method to reconstruct dynamic scenes from monocular continuous-wave time-of-flight (C-ToF) cameras using raw sensor samples that achieves similar or better accuracy than neural volumetric approaches and is 100x faster. Quickly achieving high-fidelity dynamic 3D reconstruction from a single viewpoint is a significant challenge in computer vision. In C-ToF radiance field reconstruction, the property of interest-depth-is not directly measured, causing an additional challenge. This problem has a large and underappreciated impact upon the optimization when using a fast primitive-based scene representation like 3D Gaussian splatting, which is commonly used with multi-view data to produce satisfactory results and is brittle in its optimization otherwise. We incorporate two heuristics into the optimization to improve the accuracy of scene geometry represented by Gaussians. Experimental results show that our approach produces accurate reconstructions under constrained C-ToF sensing conditions, including for fast motions like swinging baseball bats. https://visual.cs.brown.edu/gftorf

cs.CL [Back]

Yusen Wu,Junwu Xiong,Xiaotie Deng

Main category: cs.CL

TL;DR: 论文提出了一种评估大语言模型（LLMs）在复杂社交任务中能力的基准HSII，并引入COT方法提升性能，同时提出COT-complexity指标平衡效率与正确性。

Details

Motivation: 现有基准未能系统评估LLMs在多用户、多轮社交任务中的能力，需填补这一空白。 Method: 提出基于社会学原理的代理任务分级框架和HSII基准，包含四个阶段评估LLMs的社交能力，并研究COT方法的影响。 Result: 实验表明HSII基准能有效评估LLMs的社交能力，COT方法可提升性能，但需权衡效率。 Conclusion: HSII基准为评估LLMs在复杂社交任务中的能力提供了有效工具，COT-complexity指标优化了性能与效率的平衡。 Abstract: Expanding the application of large language models (LLMs) to societal life, instead of primary function only as auxiliary assistants to communicate with only one person at a time, necessitates LLMs' capabilities to independently play roles in multi-user, multi-turn social agent tasks within complex social settings. However, currently the capability has not been systematically measured with available benchmarks. To address this gap, we first introduce an agent task leveling framework grounded in sociological principles. Concurrently, we propose a novel benchmark, How Social Is It (we call it HSII below), designed to assess LLM's social capabilities in comprehensive social agents tasks and benchmark representative models. HSII comprises four stages: format parsing, target selection, target switching conversation, and stable conversation, which collectively evaluate the communication and task completion capabilities of LLMs within realistic social interaction scenarios dataset, HSII-Dataset. The dataset is derived step by step from news dataset. We perform an ablation study by doing clustering to the dataset. Additionally, we investigate the impact of chain of thought (COT) method on enhancing LLMs' social performance. Since COT cost more computation, we further introduce a new statistical metric, COT-complexity, to quantify the efficiency of certain LLMs with COTs for specific social tasks and strike a better trade-off between measurement of correctness and efficiency. Various results of our experiments demonstrate that our benchmark is well-suited for evaluating social skills in LLMs.

[93] Adaptive Token Boundaries: Integrating Human Chunking Mechanisms into Multimodal LLMs

Dongxing Yu

Main category: cs.CL

TL;DR: 研究探讨了多模态大语言模型（MLLMs）与人类跨模态信息处理的差异，提出了一种动态跨模态标记化框架，显著提升了模型性能。

Details

Motivation: 当前MLLMs在信息整合上与人类认知存在差距，研究旨在缩小这一差距。 Method: 通过比较人类与模型在视觉-语言任务中的表现，提出动态跨模态标记化框架，包含自适应边界和分层表示。 Result: 新框架在基准任务上显著优于现有模型（VQA提升7.8%，复杂场景描述提升5.3%），且更接近人类表现。 Conclusion: 研究为开发更符合人类认知的AI系统提供了理论和实证支持。 Abstract: Recent advancements in multimodal large language models (MLLMs) have demonstrated remarkable capabilities in processing diverse data types, yet significant disparities persist between human cognitive processes and computational approaches to multimodal information integration. This research presents a systematic investigation into the parallels between human cross-modal chunking mechanisms and token representation methodologies in MLLMs. Through empirical studies comparing human performance patterns with model behaviors across visual-linguistic tasks, we demonstrate that conventional static tokenization schemes fundamentally constrain current models' capacity to simulate the dynamic, context-sensitive nature of human information processing. We propose a novel framework for dynamic cross-modal tokenization that incorporates adaptive boundaries, hierarchical representations, and alignment mechanisms grounded in cognitive science principles. Quantitative evaluations demonstrate that our approach yields statistically significant improvements over state-of-the-art models on benchmark tasks (+7.8% on Visual Question Answering, +5.3% on Complex Scene Description) while exhibiting more human-aligned error patterns and attention distributions. These findings contribute to the theoretical understanding of the relationship between human cognition and artificial intelligence, while providing empirical evidence for developing more cognitively plausible AI systems.

[94] Language translation, and change of accent for speech-to-speech task using diffusion model

Abhishek Mishra,Ritesh Sur Chowdhury,Vartul Bahuguna,Isha Pandey,Ganesh Ramakrishnan

Main category: cs.CL

TL;DR: 提出了一种统一的方法，同时实现语音翻译和口音适应，利用扩散模型生成目标语音。

Details

Motivation: 跨文化交流需要同时处理语言翻译和口音适应，但现有研究对此关注不足。 Method: 将问题重新定义为条件生成任务，利用扩散模型生成目标语音的梅尔频谱图。 Result: 实现了翻译和口音适应的联合优化，模型更高效且效果优于传统方法。 Conclusion: 该方法为语音到语音翻译提供了更高效的解决方案，具有实际应用潜力。 Abstract: Speech-to-speech translation (S2ST) aims to convert spoken input in one language to spoken output in another, typically focusing on either language translation or accent adaptation. However, effective cross-cultural communication requires handling both aspects simultaneously - translating content while adapting the speaker's accent to match the target language context. In this work, we propose a unified approach for simultaneous speech translation and change of accent, a task that remains underexplored in current literature. Our method reformulates the problem as a conditional generation task, where target speech is generated based on phonemes and guided by target speech features. Leveraging the power of diffusion models, known for high-fidelity generative capabilities, we adapt text-to-image diffusion strategies by conditioning on source speech transcriptions and generating Mel spectrograms representing the target speech with desired linguistic and accentual attributes. This integrated framework enables joint optimization of translation and accent adaptation, offering a more parameter-efficient and effective model compared to traditional pipelines.

[95] A Comparative Benchmark of a Moroccan Darija Toxicity Detection Model (Typica.ai) and Major LLM-Based Moderation APIs (OpenAI, Mistral, Anthropic)

Hicham Assoudi

Main category: cs.CL

TL;DR: Typica.ai的摩洛哥方言毒性检测模型在性能上优于主流LLM审核API，强调文化适应模型的重要性。

Details

Motivation: 评估Typica.ai的定制模型在摩洛哥方言毒性检测中的表现，并比较其与主流LLM审核API的性能差异，特别关注文化背景下的毒性内容。 Method: 使用OMCD_Typica.ai_Mix数据集的平衡测试集，比较Typica.ai模型与OpenAI、Mistral和Anthropic Claude的API，评估精度、召回率、F1分数和准确率。 Result: Typica.ai模型表现最佳，突显了文化适应模型在内容审核中的优势。 Conclusion: 文化适应模型对可靠的内容审核至关重要，Typica.ai的定制模型在摩洛哥方言毒性检测中具有显著优势。 Abstract: This paper presents a comparative benchmark evaluating the performance of Typica.ai's custom Moroccan Darija toxicity detection model against major LLM-based moderation APIs: OpenAI (omni-moderation-latest), Mistral (mistral-moderation-latest), and Anthropic Claude (claude-3-haiku-20240307). We focus on culturally grounded toxic content, including implicit insults, sarcasm, and culturally specific aggression often overlooked by general-purpose systems. Using a balanced test set derived from the OMCD_Typica.ai_Mix dataset, we report precision, recall, F1-score, and accuracy, offering insights into challenges and opportunities for moderation in underrepresented languages. Our results highlight Typica.ai's superior performance, underlining the importance of culturally adapted models for reliable content moderation.

[96] Rethinking Multimodal Sentiment Analysis: A High-Accuracy, Simplified Fusion Architecture

Nischal Mandal,Yang Li

Main category: cs.CL

TL;DR: 提出了一种轻量级的多模态情感分析模型，通过简单的特征融合策略在资源受限环境中表现优异。

Details

Motivation: 多模态情感分析需要整合语言、音频和视觉信号，但现有方法常采用复杂的注意力机制和层次结构，计算开销大。 Method: 设计了一种基于模态特定编码器和简单拼接融合的轻量级深度学习模型。 Result: 在IEMOCAP数据集上实现了92%的分类准确率，优于或匹配复杂模型。 Conclusion: 通过精心设计的特征工程和模块化结构，简单融合策略在资源受限环境中具有竞争力。 Abstract: Multimodal sentiment analysis, a pivotal task in affective computing, seeks to understand human emotions by integrating cues from language, audio, and visual signals. While many recent approaches leverage complex attention mechanisms and hierarchical architectures, we propose a lightweight, yet effective fusion-based deep learning model tailored for utterance-level emotion classification. Using the benchmark IEMOCAP dataset, which includes aligned text, audio-derived numeric features, and visual descriptors, we design a modality-specific encoder using fully connected layers followed by dropout regularization. The modality-specific representations are then fused using simple concatenation and passed through a dense fusion layer to capture cross-modal interactions. This streamlined architecture avoids computational overhead while preserving performance, achieving a classification accuracy of 92% across six emotion categories. Our approach demonstrates that with careful feature engineering and modular design, simpler fusion strategies can outperform or match more complex models, particularly in resource-constrained environments.

[97] Prediction-powered estimators for finite population statistics in highly imbalanced textual data: Public hate crime estimation

Hannes Waldetoft,Jakob Torgander,Måns Magnusson

Main category: cs.CL

TL;DR: 结合Transformer神经网络与抽样估计方法，提出一种高效估计文本数据中目标变量参数的方法，应用于瑞典仇恨犯罪统计。

Details

Motivation: 解决文本数据中目标变量需手动标注导致参数估计困难的问题。 Method: 使用Transformer编码器神经网络预测结果作为辅助变量，结合Hansen-Hurwitz估计、差异估计和分层随机抽样估计方法。 Result: 在瑞典仇恨犯罪统计中验证了方法的有效性，减少了手动标注时间。 Conclusion: 若有标注训练数据，该方法可提供高效估计，显著减少手动标注需求。 Abstract: Estimating population parameters in finite populations of text documents can be challenging when obtaining the labels for the target variable requires manual annotation. To address this problem, we combine predictions from a transformer encoder neural network with well-established survey sampling estimators using the model predictions as an auxiliary variable. The applicability is demonstrated in Swedish hate crime statistics based on Swedish police reports. Estimates of the yearly number of hate crimes and the police's under-reporting are derived using the Hansen-Hurwitz estimator, difference estimation, and stratified random sampling estimation. We conclude that if labeled training data is available, the proposed method can provide very efficient estimates with reduced time spent on manual annotation.

[98] ChatGPT for automated grading of short answer questions in mechanical ventilation

Tejas Jade,Alex Yartsev

Main category: cs.CL

TL;DR: 研究评估了ChatGPT 4o在研究生医学教育中自动评分短答案问题的表现，发现其评分与人类评分存在显著差异，不建议在高风险评估中使用。

Details

Motivation: 大型语言模型（LLMs）能够模拟对话语言并解释自由文本回答，可能适用于自动评分标准化测试中的短答案问题（SAQs）。 Method: 使用ChatGPT 4o对215名学生的557个短答案回答进行评分，并与人类评分对比，采用混合效应模型、方差分量分析、ICC、Cohen's kappa等方法分析一致性。 Result: ChatGPT评分系统性低于人类评分（平均偏差-1.34分），个体一致性差（ICC1=0.086），且对评价性和分析性题目一致性最差。 Conclusion: 不建议在研究生课程评分中使用LLMs，因其与人类评分差异超过高风险评估可接受范围。 Abstract: Standardised tests using short answer questions (SAQs) are common in postgraduate education. Large language models (LLMs) simulate conversational language and interpret unstructured free-text responses in ways aligning with applying SAQ grading rubrics, making them attractive for automated grading. We evaluated ChatGPT 4o to grade SAQs in a postgraduate medical setting using data from 215 students (557 short-answer responses) enrolled in an online course on mechanical ventilation (2020--2024). Deidentified responses to three case-based scenarios were presented to ChatGPT with a standardised grading prompt and rubric. Outputs were analysed using mixed-effects modelling, variance component analysis, intraclass correlation coefficients (ICCs), Cohen's kappa, Kendall's W, and Bland--Altman statistics. ChatGPT awarded systematically lower marks than human graders with a mean difference (bias) of -1.34 on a 10-point scale. ICC values indicated poor individual-level agreement (ICC1 = 0.086), and Cohen's kappa (-0.0786) suggested no meaningful agreement. Variance component analysis showed minimal variability among the five ChatGPT sessions (G-value = 0.87), indicating internal consistency but divergence from the human grader. The poorest agreement was observed for evaluative and analytic items, whereas checklist and prescriptive rubric items had less disagreement. We caution against the use of LLMs in grading postgraduate coursework. Over 60% of ChatGPT-assigned grades differed from human grades by more than acceptable boundaries for high-stakes assessments.

[99] FRAME: Feedback-Refined Agent Methodology for Enhancing Medical Research Insights

Chengzhang Yu,Yiming Zhang,Zhixin Liu,Zenghui Ding,Yining Sun,Zhanpeng Jin

Main category: cs.CL

TL;DR: FRAME框架通过迭代优化和结构化反馈提升医学论文生成质量，实验显示其显著优于传统方法，接近人类水平。

Details

Motivation: 利用大语言模型自动化科研面临知识合成和质量保证的挑战，需改进医学论文生成方法。 Method: 提出FRAME框架，包括结构化数据集构建、三元架构（生成器、评估器、反射器）和综合评估体系。 Result: 实验表明FRAME在多模型和评估维度上显著提升性能，生成论文质量接近人类水平。 Conclusion: FRAME为自动化医学论文生成提供了高效且严谨的解决方案。 Abstract: The automation of scientific research through large language models (LLMs) presents significant opportunities but faces critical challenges in knowledge synthesis and quality assurance. We introduce Feedback-Refined Agent Methodology (FRAME), a novel framework that enhances medical paper generation through iterative refinement and structured feedback. Our approach comprises three key innovations: (1) A structured dataset construction method that decomposes 4,287 medical papers into essential research components through iterative refinement; (2) A tripartite architecture integrating Generator, Evaluator, and Reflector agents that progressively improve content quality through metric-driven feedback; and (3) A comprehensive evaluation framework that combines statistical metrics with human-grounded benchmarks. Experimental results demonstrate FRAME's effectiveness, achieving significant improvements over conventional approaches across multiple models (9.91% average gain with DeepSeek V3, comparable improvements with GPT-4o Mini) and evaluation dimensions. Human evaluation confirms that FRAME-generated papers achieve quality comparable to human-authored works, with particular strength in synthesizing future research directions. The results demonstrated our work could efficiently assist medical research by building a robust foundation for automated medical research paper generation while maintaining rigorous academic standards.

[100] Scientific Hypothesis Generation and Validation: Methods, Datasets, and Future Directions

Adithya Kulkarni,Fatimah Alotaibi,Xinyue Zeng,Longfeng Wu,Tong Zeng,Barry Menglong Yao,Minqian Liu,Shuaicheng Zhang,Lifu Huang,Dawei Zhou

Main category: cs.CL

TL;DR: 该综述探讨了大语言模型（LLMs）在科学假设生成与验证中的应用，总结了相关方法、技术及未来方向。

Details

Motivation: LLMs通过信息合成、潜在关系发现和推理增强，正在改变科学发现的方式，本文旨在系统梳理相关进展。 Method: 综述了符号框架、生成模型、混合系统及多智能体架构等技术，并对比了早期符号系统与现代LLM方法。 Result: 总结了LLMs在生物医学、材料科学等领域的应用，并提出了新颖性生成、多模态整合等未来方向。 Conclusion: LLMs有望成为科学发现的重要工具，但需关注伦理与可解释性。 Abstract: Large Language Models (LLMs) are transforming scientific hypothesis generation and validation by enabling information synthesis, latent relationship discovery, and reasoning augmentation. This survey provides a structured overview of LLM-driven approaches, including symbolic frameworks, generative models, hybrid systems, and multi-agent architectures. We examine techniques such as retrieval-augmented generation, knowledge-graph completion, simulation, causal inference, and tool-assisted reasoning, highlighting trade-offs in interpretability, novelty, and domain alignment. We contrast early symbolic discovery systems (e.g., BACON, KEKADA) with modern LLM pipelines that leverage in-context learning and domain adaptation via fine-tuning, retrieval, and symbolic grounding. For validation, we review simulation, human-AI collaboration, causal modeling, and uncertainty quantification, emphasizing iterative assessment in open-world contexts. The survey maps datasets across biomedicine, materials science, environmental science, and social science, introducing new resources like AHTech and CSKG-600. Finally, we outline a roadmap emphasizing novelty-aware generation, multimodal-symbolic integration, human-in-the-loop systems, and ethical safeguards, positioning LLMs as agents for principled, scalable scientific discovery.

[101] Advancing Conversational Diagnostic AI with Multimodal Reasoning

Khaled Saab,Jan Freyberg,Chunjong Park,Tim Strother,Yong Cheng,Wei-Hung Weng,David G. T. Barrett,David Stutz,Nenad Tomasev,Anil Palepu,Valentin Liévin,Yash Sharma,Roma Ruparel,Abdullah Ahmed,Elahe Vedadi,Kimberly Kanada,Cian Hughes,Yun Liu,Geoff Brown,Yang Gao,Sean Li,S. Sara Mahdavi,James Manyika,Katherine Chou,Yossi Matias,Avinatan Hassidim,Dale R. Webster,Pushmeet Kohli,S. M. Ali Eslami,Joëlle Barral,Adam Rodman,Vivek Natarajan,Mike Schaekermann,Tao Tu,Alan Karthikesalingam,Ryutaro Tanno

Main category: cs.CL

TL;DR: AMIE，一种基于Gemini 2.0 Flash的多模态对话诊断AI，在临床咨询中表现优于初级保健医生（PCPs）。

Details

Motivation: 现有LLMs在诊断对话中多限于纯语言交互，而远程医疗需要多模态数据处理能力。 Method: AMIE通过动态对话框架和多模态数据处理能力，模拟经验丰富的临床医生进行结构化问诊。 Result: 在105个评估场景中，AMIE在多模态和非多模态指标上均显著优于PCPs。 Conclusion: 多模态对话诊断AI取得进展，但实际应用仍需进一步研究。 Abstract: Large Language Models (LLMs) have demonstrated great potential for conducting diagnostic conversations but evaluation has been largely limited to language-only interactions, deviating from the real-world requirements of remote care delivery. Instant messaging platforms permit clinicians and patients to upload and discuss multimodal medical artifacts seamlessly in medical consultation, but the ability of LLMs to reason over such data while preserving other attributes of competent diagnostic conversation remains unknown. Here we advance the conversational diagnosis and management performance of the Articulate Medical Intelligence Explorer (AMIE) through a new capability to gather and interpret multimodal data, and reason about this precisely during consultations. Leveraging Gemini 2.0 Flash, our system implements a state-aware dialogue framework, where conversation flow is dynamically controlled by intermediate model outputs reflecting patient states and evolving diagnoses. Follow-up questions are strategically directed by uncertainty in such patient states, leading to a more structured multimodal history-taking process that emulates experienced clinicians. We compared AMIE to primary care physicians (PCPs) in a randomized, blinded, OSCE-style study of chat-based consultations with patient actors. We constructed 105 evaluation scenarios using artifacts like smartphone skin photos, ECGs, and PDFs of clinical documents across diverse conditions and demographics. Our rubric assessed multimodal capabilities and other clinically meaningful axes like history-taking, diagnostic accuracy, management reasoning, communication, and empathy. Specialist evaluation showed AMIE to be superior to PCPs on 7/9 multimodal and 29/32 non-multimodal axes (including diagnostic accuracy). The results show clear progress in multimodal conversational diagnostic AI, but real-world translation needs further research.

[102] A Comparative Analysis of Ethical and Safety Gaps in LLMs using Relative Danger Coefficient

Yehor Tereshchenko,Mika Hämäläinen

Main category: cs.CL

TL;DR: 本文比较了多种AI模型的伦理表现，提出了一种新的危害度量指标RDC，并强调了高风险场景中人类监督的重要性。

Details

Motivation: AI和LLM的快速发展引发了伦理问题，如安全、滥用和歧视，需要对其伦理表现进行评估。 Method: 通过比较DeepSeek-V3、GPT变体和Gemini等模型的伦理表现，并引入RDC指标。 Result: 研究发现不同模型在伦理表现上存在差异，RDC能有效量化危害。 Conclusion: 强调人类监督的必要性，并提出RDC作为评估LLM危害的新工具。 Abstract: Artificial Intelligence (AI) and Large Language Models (LLMs) have rapidly evolved in recent years, showcasing remarkable capabilities in natural language understanding and generation. However, these advancements also raise critical ethical questions regarding safety, potential misuse, discrimination and overall societal impact. This article provides a comparative analysis of the ethical performance of various AI models, including the brand new DeepSeek-V3(R1 with reasoning and without), various GPT variants (4o, 3.5 Turbo, 4 Turbo, o1/o3 mini) and Gemini (1.5 flash, 2.0 flash and 2.0 flash exp) and highlights the need for robust human oversight, especially in situations with high stakes. Furthermore, we present a new metric for calculating harm in LLMs called Relative Danger Coefficient (RDC).

Paul Landes,Jimeng Sun,Adam Cross

Main category: cs.CL

TL;DR: 论文提出了一种结合传统深度学习和大型语言模型（LLMs）的方法，用于从临床文本中自动提取社会健康决定因素（SDoHs），并在效率和精度上取得显著提升。

Details

Motivation: 社会健康决定因素（SDoHs）对个体健康状况有重要影响，但传统方法效率低且成本高，需要更高效的自动提取方法。 Method: 结合传统深度学习和LLMs，提出了一种高效分类方法，避免了昂贵的LLM处理，同时利用合成数据和多种深度学习模型提升性能。 Result: 模型在SDoH多标签分类任务上比基准提升了10分，执行时间缩短了12倍，并在合成数据上表现出色。 Conclusion: 该方法为自动预测SDoHs提供了高效且精确的解决方案，特别适用于高风险患者。 Abstract: Social Determinants of Health (SDoH) are economic, social and personal circumstances that affect or influence an individual's health status. SDoHs have shown to be correlated to wellness outcomes, and therefore, are useful to physicians in diagnosing diseases and in decision-making. In this work, we automatically extract SDoHs from clinical text using traditional deep learning and Large Language Models (LLMs) to find the advantages and disadvantages of each on an existing publicly available dataset. Our models outperform a previous reference point on a multilabel SDoH classification by 10 points, and we present a method and model to drastically speed up classification (12X execution time) by eliminating expensive LLM processing. The method we present combines a more nimble and efficient solution that leverages the power of the LLM for precision and traditional deep learning methods for efficiency. We also show highly performant results on a dataset supplemented with synthetic data and several traditional deep learning models that outperform LLMs. Our models and methods offer the next iteration of automatic prediction of SDoHs that impact at-risk patients.

[104] AI-Generated Fall Data: Assessing LLMs and Diffusion Model for Wearable Fall Detection

Sana Alamgeer,Yasine Souissi,Anne H. H. Ngu

Main category: cs.CL

TL;DR: 研究探讨了利用大型语言模型（LLMs）生成合成跌倒数据以解决真实数据稀缺的问题，评估了不同模型在模拟跌倒场景中的表现及其对跌倒检测性能的影响。

Details

Motivation: 由于老年人真实跌倒数据的稀缺性，训练跌倒检测系统具有挑战性，因此研究探索了生成合成数据的潜力。 Method: 使用文本到运动（T2M, SATO, ParCo）和文本到文本模型（GPT4o, GPT4, Gemini）生成合成数据集，并与真实数据集结合，通过LSTM模型评估其对跌倒检测性能的影响。同时比较了LLM生成数据与扩散方法生成数据的分布一致性。 Result: 合成数据的有效性受数据集特性影响，LLM生成数据在低频（20Hz）表现最佳，但在高频（200Hz）不稳定。扩散方法生成的数据与真实数据最接近，但未显著提升模型性能。 Conclusion: 合成数据的有效性取决于传感器位置和跌倒表示方式，研究结果为优化跌倒检测模型的合成数据生成提供了参考。 Abstract: Training fall detection systems is challenging due to the scarcity of real-world fall data, particularly from elderly individuals. To address this, we explore the potential of Large Language Models (LLMs) for generating synthetic fall data. This study evaluates text-to-motion (T2M, SATO, ParCo) and text-to-text models (GPT4o, GPT4, Gemini) in simulating realistic fall scenarios. We generate synthetic datasets and integrate them with four real-world baseline datasets to assess their impact on fall detection performance using a Long Short-Term Memory (LSTM) model. Additionally, we compare LLM-generated synthetic data with a diffusion-based method to evaluate their alignment with real accelerometer distributions. Results indicate that dataset characteristics significantly influence the effectiveness of synthetic data, with LLM-generated data performing best in low-frequency settings (e.g., 20Hz) while showing instability in high-frequency datasets (e.g., 200Hz). While text-to-motion models produce more realistic biomechanical data than text-to-text models, their impact on fall detection varies. Diffusion-based synthetic data demonstrates the closest alignment to real data but does not consistently enhance model performance. An ablation study further confirms that the effectiveness of synthetic data depends on sensor placement and fall representation. These findings provide insights into optimizing synthetic data generation for fall detection models.

[105] Personalized Risks and Regulatory Strategies of Large Language Models in Digital Advertising

Haoyang Feng,Yanjun Dai,Yuan Gao

Main category: cs.CL

TL;DR: 论文研究了结合大语言模型（如BERT）的个性化广告推荐系统，同时关注用户隐私保护和数据安全，提出了一种基于本地模型训练和加密的方法。

Details

Motivation: 探讨在实际操作中如何将广告推荐系统与用户隐私保护和数据安全措施结合，以解决个性化广告推荐中的隐私风险问题。 Method: 结合BERT模型和注意力机制，构建个性化广告推荐算法，包括数据预处理、特征选择、语义嵌入和本地模型训练，同时采用数据加密保护隐私。 Result: 实验表明，基于BERT的广告推送能显著提高点击率和转化率，同时通过本地训练和隐私保护机制降低隐私泄露风险。 Conclusion: 论文验证了结合大语言模型的个性化广告推荐系统的可行性，并展示了其在提升效果和保护隐私方面的潜力。 Abstract: Although large language models have demonstrated the potential for personalized advertising recommendations in experimental environments, in actual operations, how advertising recommendation systems can be combined with measures such as user privacy protection and data security is still an area worthy of in-depth discussion. To this end, this paper studies the personalized risks and regulatory strategies of large language models in digital advertising. This study first outlines the principles of Large Language Model (LLM), especially the self-attention mechanism based on the Transformer architecture, and how to enable the model to understand and generate natural language text. Then, the BERT (Bidirectional Encoder Representations from Transformers) model and the attention mechanism are combined to construct an algorithmic model for personalized advertising recommendations and user factor risk protection. The specific steps include: data collection and preprocessing, feature selection and construction, using large language models such as BERT for advertising semantic embedding, and ad recommendations based on user portraits. Then, local model training and data encryption are used to ensure the security of user privacy and avoid the leakage of personal data. This paper designs an experiment for personalized advertising recommendation based on a large language model of BERT and verifies it with real user data. The experimental results show that BERT-based advertising push can effectively improve the click-through rate and conversion rate of advertisements. At the same time, through local model training and privacy protection mechanisms, the risk of user privacy leakage can be reduced to a certain extent.

[106] Fine-Tuning Large Language Models and Evaluating Retrieval Methods for Improved Question Answering on Building Codes

Mohammad Aqib,Mohd Hamza,Qipei Mei,Ying Hei Chui

Main category: cs.CL

TL;DR: 该论文提出了一种基于检索增强生成（RAG）的问答系统，用于解决建筑规范查询的复杂性问题，重点优化了检索器和语言模型的性能。

Details

Motivation: 建筑规范内容复杂且更新频繁，手动查询效率低下，因此需要一种自动化解决方案。 Method: 采用RAG框架，评估多种检索方法（如Elasticsearch）并对语言模型进行领域特定微调。 Result: 实验表明Elasticsearch是最优检索器，微调显著提升了语言模型在建筑规范领域的表现。 Conclusion: 结合高效检索器和微调语言模型的RAG系统能有效应对建筑规范的查询挑战。 Abstract: Building codes are regulations that establish standards for the design, construction, and safety of buildings to ensure structural integrity, fire protection, and accessibility. They are often extensive, complex, and subject to frequent updates, making manual querying challenging and time-consuming. Key difficulties include navigating large volumes of text, interpreting technical language, and identifying relevant clauses across different sections. A potential solution is to build a Question-Answering (QA) system that answers user queries based on building codes. Among the various methods for building a QA system, Retrieval-Augmented Generation (RAG) stands out in performance. RAG consists of two components: a retriever and a language model. This study focuses on identifying a suitable retriever method for building codes and optimizing the generational capability of the language model using fine-tuning techniques. We conducted a detailed evaluation of various retrieval methods by performing the retrieval on the National Building Code of Canada (NBCC) and explored the impact of domain-specific fine-tuning on several language models using the dataset derived from NBCC. Our analysis included a comparative assessment of different retrievers and the performance of both pre-trained and fine-tuned models to determine the efficacy and domain-specific adaptation of language models using fine-tuning on the NBCC dataset. Experimental results showed that Elasticsearch proved to be the most robust retriever among all. The findings also indicate that fine-tuning language models on an NBCC-specific dataset can enhance their ability to generate contextually relevant responses. When combined with context retrieved by a powerful retriever like Elasticsearch, this improvement in LLM performance can optimize the RAG system, enabling it to better navigate the complexities of the NBCC.

[107] Reward-SQL: Boosting Text-to-SQL via Stepwise Reasoning and Process-Supervised Rewards

Yuxin Zhang,Meihao Fan,Ju Fan,Mingyang Yi,Yuyu Luo,Jian Tan,Guoliang Li

Main category: cs.CL

TL;DR: Reward-SQL框架通过引入Process Reward Models (PRMs)提升Text-to-SQL任务的性能，采用“冷启动后PRM监督”策略，结合在线训练信号和推理引导，显著提升模型表现。

Details

Motivation: PRMs在Text-to-SQL任务中可能因误用导致推理轨迹扭曲，需系统性探索其有效整合方式。 Method: 先通过Chain-of-CTEs分解SQL查询建立推理基线，再研究四种PRM整合策略，最优策略为GRPO结合best-of-N采样。 Result: 在BIRD基准测试中，Reward-SQL使7B PRM监督的模型性能提升13.1%，GRPO策略模型达到68.9%准确率。 Conclusion: Reward-SQL通过奖励监督有效提升Text-to-SQL推理性能，代码已开源。 Abstract: Recent advances in large language models (LLMs) have significantly improved performance on the Text-to-SQL task by leveraging their powerful reasoning capabilities. To enhance accuracy during the reasoning process, external Process Reward Models (PRMs) can be introduced during training and inference to provide fine-grained supervision. However, if misused, PRMs may distort the reasoning trajectory and lead to suboptimal or incorrect SQL generation.To address this challenge, we propose Reward-SQL, a framework that systematically explores how to incorporate PRMs into the Text-to-SQL reasoning process effectively. Our approach follows a "cold start, then PRM supervision" paradigm. Specifically, we first train the model to decompose SQL queries into structured stepwise reasoning chains using common table expressions (Chain-of-CTEs), establishing a strong and interpretable reasoning baseline. Then, we investigate four strategies for integrating PRMs, and find that combining PRM as an online training signal (GRPO) with PRM-guided inference (e.g., best-of-N sampling) yields the best results. Empirically, on the BIRD benchmark, Reward-SQL enables models supervised by a 7B PRM to achieve a 13.1% performance gain across various guidance strategies. Notably, our GRPO-aligned policy model based on Qwen2.5-Coder-7B-Instruct achieves 68.9% accuracy on the BIRD development set, outperforming all baseline methods under the same model size. These results demonstrate the effectiveness of Reward-SQL in leveraging reward-based supervision for Text-to-SQL reasoning. Our code is publicly available.

[108] REVEAL: Multi-turn Evaluation of Image-Input Harms for Vision LLM

Madhur Jindal,Saurabh Deshpande

Main category: cs.CL

TL;DR: 论文介绍了REVEAL框架，用于评估视觉大语言模型（VLLMs）的安全性和伦理问题，发现多轮对话中的缺陷率显著高于单轮对话，GPT-4o表现最佳。

Details

Motivation: 传统安全评估框架无法应对VLLMs在多模态和多轮对话中的复杂性，因此需要新的评估方法。 Method: 提出REVEAL框架，包括自动图像挖掘、合成对抗数据生成、多轮对话扩展和全面危害评估。 Result: 评估了五种VLLMs，发现多轮对话缺陷率更高，GPT-4o表现最佳，Llama-3.2缺陷率最高。 Conclusion: 多轮对话暴露了VLLMs的深层次漏洞，需加强上下文防御，尤其是针对错误信息。 Abstract: Vision Large Language Models (VLLMs) represent a significant advancement in artificial intelligence by integrating image-processing capabilities with textual understanding, thereby enhancing user interactions and expanding application domains. However, their increased complexity introduces novel safety and ethical challenges, particularly in multi-modal and multi-turn conversations. Traditional safety evaluation frameworks, designed for text-based, single-turn interactions, are inadequate for addressing these complexities. To bridge this gap, we introduce the REVEAL (Responsible Evaluation of Vision-Enabled AI LLMs) Framework, a scalable and automated pipeline for evaluating image-input harms in VLLMs. REVEAL includes automated image mining, synthetic adversarial data generation, multi-turn conversational expansion using crescendo attack strategies, and comprehensive harm assessment through evaluators like GPT-4o. We extensively evaluated five state-of-the-art VLLMs, GPT-4o, Llama-3.2, Qwen2-VL, Phi3.5V, and Pixtral, across three important harm categories: sexual harm, violence, and misinformation. Our findings reveal that multi-turn interactions result in significantly higher defect rates compared to single-turn evaluations, highlighting deeper vulnerabilities in VLLMs. Notably, GPT-4o demonstrated the most balanced performance as measured by our Safety-Usability Index (SUI) followed closely by Pixtral. Additionally, misinformation emerged as a critical area requiring enhanced contextual defenses. Llama-3.2 exhibited the highest MT defect rate ($16.55 \%$) while Qwen2-VL showed the highest MT refusal rate ($19.1 \%$).

[109] Advanced Deep Learning Approaches for Automated Recognition of Cuneiform Symbols

Shahad Elshehaby,Alavikunhu Panthakkan,Hussain Al-Ahmad,Mina Al-Saad

Main category: cs.CL

TL;DR: 本文提出了一种基于深度学习的自动化方法，用于识别和解释楔形文字字符，并在哈姆拉比法典等数据集上验证了模型性能。

Details

Motivation: 探索深度学习在解读古代楔形文字中的应用，结合计算语言学和考古学，为人类历史的理解与保护提供新视角。 Method: 训练了五种深度学习模型，评估其在楔形文字识别和翻译中的性能，重点关注准确率和精确度。 Result: 其中两种模型表现优异，成功识别了楔形文字的阿卡德语含义并提供了精确的英文翻译。 Conclusion: 未来研究将探索集成和堆叠方法以优化性能，同时深化对阿卡德语与阿拉伯语历史联系的研究。 Abstract: This paper presents a thoroughly automated method for identifying and interpreting cuneiform characters via advanced deep-learning algorithms. Five distinct deep-learning models were trained on a comprehensive dataset of cuneiform characters and evaluated according to critical performance metrics, including accuracy and precision. Two models demonstrated outstanding performance and were subsequently assessed using cuneiform symbols from the Hammurabi law acquisition, notably Hammurabi Law 1. Each model effectively recognized the relevant Akkadian meanings of the symbols and delivered precise English translations. Future work will investigate ensemble and stacking approaches to optimize performance, utilizing hybrid architectures to improve detection accuracy and reliability. This research explores the linguistic relationships between Akkadian, an ancient Mesopotamian language, and Arabic, emphasizing their historical and cultural linkages. This study demonstrates the capability of deep learning to decipher ancient scripts by merging computational linguistics with archaeology, therefore providing significant insights for the comprehension and conservation of human history.

[110] SOAEsV2-7B/72B: Full-Pipeline Optimization for State-Owned Enterprise LLMs via Continual Pre-Training, Domain-Progressive SFT and Distillation-Enhanced Speculative Decoding

Jingyang Deng,Ran Chen,Jo-Ku Cheng,Jinwen Ma

Main category: cs.CL

TL;DR: 该研究提出了一种针对中国国有企业（SOAEs）的领域专用大语言模型（LLM）开发框架，解决了模型容量限制、过度依赖领域数据及推理效率低的问题。通过三阶段方法（持续预训练、渐进领域微调、蒸馏增强推理），显著提升了领域性能，同时保持了通用语言能力。

Details

Motivation: 当前领域专用LLM在国有企业应用中面临模型容量限制、过度依赖领域数据和推理效率低的挑战，亟需一种兼顾通用性和领域性能的解决方案。 Method: 采用三阶段框架：1）持续预训练整合领域知识；2）渐进领域微调（从弱相关数据到专家标注数据）；3）蒸馏增强推理加速。 Result: 实验显示，领域预训练阶段保留了99.8%的通用能力，同时显著提升领域性能（Rouge-1提高1.08倍，BLEU-4提高1.17倍）。渐进微调优于单阶段训练。 Conclusion: 该研究为国有企业LLM提供了一种全流程优化方法，成功平衡了通用语言能力和领域专业性。 Abstract: This study addresses key challenges in developing domain-specific large language models (LLMs) for Chinese state-owned assets and enterprises (SOAEs), where current approaches face three limitations: 1) constrained model capacity that limits knowledge integration and cross-task adaptability; 2) excessive reliance on domain-specific supervised fine-tuning (SFT) data, which neglects the broader applicability of general language patterns; and 3) inefficient inference acceleration for large models processing long contexts. In this work, we propose SOAEsV2-7B/72B, a specialized LLM series developed via a three-phase framework: 1) continual pre-training integrates domain knowledge while retaining base capabilities; 2) domain-progressive SFT employs curriculum-based learning strategy, transitioning from weakly relevant conversational data to expert-annotated SOAEs datasets to optimize domain-specific tasks; 3) distillation-enhanced speculative decoding accelerates inference via logit distillation between 72B target and 7B draft models, achieving 1.39-1.52$\times$ speedup without quality loss. Experimental results demonstrate that our domain-specific pre-training phase maintains 99.8% of original general language capabilities while significantly improving domain performance, resulting in a 1.08$\times$ improvement in Rouge-1 score and a 1.17$\times$ enhancement in BLEU-4 score. Ablation studies further show that domain-progressive SFT outperforms single-stage training, achieving 1.02$\times$ improvement in Rouge-1 and 1.06$\times$ in BLEU-4. Our work introduces a comprehensive, full-pipeline approach for optimizing SOAEs LLMs, bridging the gap between general language capabilities and domain-specific expertise.

[111] Flower Across Time and Media: Sentiment Analysis of Tang Song Poetry and Visual Correspondence

Shuai Gong,Tiange Zhou

Main category: cs.CL

TL;DR: 该研究通过BERT情感分析量化唐宋诗词中花卉意象的情感模式，并将其与装饰艺术发展对比，揭示文学与艺术的协同关系。

Details

Motivation: 探讨唐宋时期花卉意象在文学与视觉文化中的情感关联，填补现有研究的空白。 Method: 使用微调BERT模型分析唐宋诗词中的牡丹与梅花意象情感变化，并与同期装饰艺术进行交叉验证。 Result: 发现唐宋时期花卉意象情感内涵的显著变化，并揭示文学表达与艺术表现之间的协同关系。 Conclusion: 研究为理解唐宋文化表达提供了新视角，展示了计算人文与传统汉学方法的结合潜力。 Abstract: The Tang (618 to 907) and Song (960 to 1279) dynasties witnessed an extraordinary flourishing of Chinese cultural expression, where floral motifs served as a dynamic medium for both poetic sentiment and artistic design. While previous scholarship has examined these domains independently, the systematic correlation between evolving literary emotions and visual culture remains underexplored. This study addresses that gap by employing BERT-based sentiment analysis to quantify emotional patterns in floral imagery across Tang Song poetry, then validating these patterns against contemporaneous developments in decorative arts.Our approach builds upon recent advances in computational humanities while remaining grounded in traditional sinological methods. By applying a fine tuned BERT model to analyze peony and plum blossom imagery in classical poetry, we detect measurable shifts in emotional connotations between the Tang and Song periods. These textual patterns are then cross berenced with visual evidence from textiles, ceramics, and other material culture, revealing previously unrecognized synergies between literary expression and artistic representation.

[112] Osiris: A Lightweight Open-Source Hallucination Detection System

Alex Shan,John Bauer,Christopher D. Manning

Main category: cs.CL

TL;DR: 本文提出了一种基于监督微调的方法，用于检测RAG系统中的幻觉问题，通过扰动多跳QA数据集实现高效检测，性能优于GPT-4o。

Details

Motivation: 当前RAG系统中的幻觉问题阻碍了其在实际生产环境中的部署，现有检测方法（如人工评估或闭源模型）成本高且难以扩展。 Method: 通过构建一个扰动多跳QA数据集，并利用监督微调训练7B模型，实现高效的幻觉检测。 Result: 在RAGTruth基准测试中，7B模型的召回率优于GPT-4o，同时在精度和准确率上表现竞争性，且参数更少。 Conclusion: 该方法为RAG系统的幻觉检测提供了一种高效且可扩展的解决方案。 Abstract: Retrieval-Augmented Generation (RAG) systems have gained widespread adoption by application builders because they leverage sources of truth to enable Large Language Models (LLMs) to generate more factually sound responses. However, hallucinations, instances of LLM responses that are unfaithful to the provided context, often prevent these systems from being deployed in production environments. Current hallucination detection methods typically involve human evaluation or the use of closed-source models to review RAG system outputs for hallucinations. Both human evaluators and closed-source models suffer from scaling issues due to their high costs and slow inference speeds. In this work, we introduce a perturbed multi-hop QA dataset with induced hallucinations. Via supervised fine-tuning on our dataset, we achieve better recall with a 7B model than GPT-4o on the RAGTruth hallucination detection benchmark and offer competitive performance on precision and accuracy, all while using a fraction of the parameters. Code is released at our repository.

[113] Benchmarking LLM Faithfulness in RAG with Evolving Leaderboards

Manveer Singh Tamber,Forrest Sheng Bao,Chenyu Xu,Ge Luo,Suleman Kazi,Minseok Bae,Miaoran Li,Ofer Mendelevitch,Renyi Qu,Jimmy Lin

Main category: cs.CL

TL;DR: 该论文探讨了LLM在摘要任务中的幻觉问题，提出了FaithJudge方法改进评估，并引入了新的幻觉排行榜。

Details

Motivation: LLM幻觉问题持续存在，现有评估方法如HHEM和Vectara排行榜存在局限性，需更可靠的评估工具。 Method: 提出FaithJudge方法，基于少量人工标注的幻觉数据，采用LLM作为评判者，优化幻觉评估。 Result: FaithJudge显著提升了幻觉评估的准确性，并推出了新的幻觉排行榜。 Conclusion: FaithJudge为LLM幻觉评估提供了更可靠的方法，有助于改进RAG中的LLM表现。 Abstract: Hallucinations remain a persistent challenge for LLMs. RAG aims to reduce hallucinations by grounding responses in contexts. However, even when provided context, LLMs still frequently introduce unsupported information or contradictions. This paper presents our efforts to measure LLM hallucinations with a focus on summarization tasks, assessing how often various LLMs introduce hallucinations when summarizing documents. We discuss Vectara's existing LLM hallucination leaderboard, based on the Hughes Hallucination Evaluation Model (HHEM). While HHEM and Vectara's Hallucination Leaderboard have garnered great research interest, we examine challenges faced by HHEM and current hallucination detection methods by analyzing the effectiveness of these methods on existing hallucination datasets. To address these limitations, we propose FaithJudge, an LLM-as-a-judge approach guided by few-shot human hallucination annotations, which substantially improves automated LLM hallucination evaluation over current methods. We introduce an enhanced hallucination leaderboard centered on FaithJudge, alongside our current hallucination leaderboard, enabling more reliable benchmarking of LLMs for hallucinations in RAG.

[114] An Open-Source Dual-Loss Embedding Model for Semantic Retrieval in Higher Education

Ramteja Sajja,Yusuf Sermet,Ibrahim Demir

Main category: cs.CL

TL;DR: 该研究提出了两种针对教育问答优化的开源嵌入模型，通过合成数据集和双损失训练策略，显著提升了语义检索性能。

Details

Motivation: 现有语义检索系统难以适应学术内容的独特语言和结构特征，需要针对教育领域优化的嵌入模型。 Method: 构建了包含3,197个句对的合成数据集，并评估了两种训练策略：基于MNRL的基线模型和结合MNRL与CosineSimilarityLoss的双损失模型。 Result: 两种模型均优于开源基线，双损失模型缩小了与高性能专有嵌入（如OpenAI的text-embedding-3系列）的性能差距。 Conclusion: 该研究提供了可复用的教育领域嵌入模型和框架，支持学术聊天机器人、RAG系统等下游应用。 Abstract: Recent advances in AI have catalyzed the adoption of intelligent educational tools, yet many semantic retrieval systems remain ill-suited to the unique linguistic and structural characteristics of academic content. This study presents two open-source embedding models fine-tuned for educational question answering, particularly in the context of course syllabi. A synthetic dataset of 3,197 sentence pairs, spanning synonymous terminology, paraphrased questions, and implicit-explicit mappings, was constructed through a combination of manual curation and large language model (LLM)-assisted generation. Two training strategies were evaluated: (1) a baseline model fine-tuned using MultipleNegativesRankingLoss (MNRL), and (2) a dual-loss model that combines MNRL with CosineSimilarityLoss to improve both semantic ranking and similarity calibration. Evaluations were conducted on 28 university course syllabi using a fixed set of natural language questions categorized into course, faculty, and teaching assistant information. Results demonstrate that both fine-tuned models outperform strong open-source baselines, including all-MiniLM-L6-v2 and multi-qa-MiniLM-L6-cos-v1, and that the dual-loss model narrows the performance gap with high-performing proprietary embeddings such as OpenAI's text-embedding-3 series. This work contributes reusable, domain-aligned embedding models and provides a replicable framework for educational semantic retrieval, supporting downstream applications such as academic chatbots, retrieval-augmented generation (RAG) systems, and learning management system (LMS) integrations.

[115] Chain-of-Thought Tokens are Computer Program Variables

Fangwei Zhu,Peiyi Wang,Zhifang Sui

Main category: cs.CL

TL;DR: 研究发现，Chain-of-thoughts（CoT）中的中间结果存储是关键，而非所有步骤。CoT可能类似编程中的变量，但存在潜在缺陷。

Details

Motivation: 探索CoT在大型语言模型中的内部机制，尤其是其在复杂推理任务中的作用。 Method: 在组合任务（多位数乘法和动态规划）中实证研究CoT令牌的作用，包括保留中间结果、替代潜在形式存储及随机干预。 Result: 仅保留中间结果令牌即可达到类似性能；替代存储形式不影响表现；干预CoT值会改变后续令牌和最终答案。 Conclusion: CoT令牌可能类似程序变量，但也存在潜在问题如捷径和计算复杂性限制。 Abstract: Chain-of-thoughts (CoT) requires large language models (LLMs) to generate intermediate steps before reaching the final answer, and has been proven effective to help LLMs solve complex reasoning tasks. However, the inner mechanism of CoT still remains largely unclear. In this paper, we empirically study the role of CoT tokens in LLMs on two compositional tasks: multi-digit multiplication and dynamic programming. While CoT is essential for solving these problems, we find that preserving only tokens that store intermediate results would achieve comparable performance. Furthermore, we observe that storing intermediate results in an alternative latent form will not affect model performance. We also randomly intervene some values in CoT, and notice that subsequent CoT tokens and the final answer would change correspondingly. These findings suggest that CoT tokens may function like variables in computer programs but with potential drawbacks like unintended shortcuts and computational complexity limits between tokens. The code and data are available at https://github.com/solitaryzero/CoTs_are_Variables.

[116] Rethinking the Relationship between the Power Law and Hierarchical Structures

Kai Nakaishi,Ryo Yoshida,Kohei Kajikawa,Koji Hukushima,Yohei Oseki

Main category: cs.CL

TL;DR: 研究通过分析语料库的统计特性，探讨了幂律衰减与语言层次结构的关系，发现现有假设不成立，需重新审视。

Details

Motivation: 验证幂律衰减是否支持语言（如句法、儿童语言和动物信号）的层次结构假设。 Method: 使用英语语料库分析解析树的互信息、与PCFGs的偏差等统计特性。 Result: 假设不成立，难以将幂律衰减解释为层次结构的证据。 Conclusion: 需重新思考幂律与语言层次结构的关系。 Abstract: Statistical analysis of corpora provides an approach to quantitatively investigate natural languages. This approach has revealed that several power laws consistently emerge across different corpora and languages, suggesting the universal principles underlying languages. Particularly, the power-law decay of correlation has been interpreted as evidence for underlying hierarchical structures in syntax, semantics, and discourse. This perspective has also been extended to child languages and animal signals. However, the argument supporting this interpretation has not been empirically tested. To address this problem, this study examines the validity of the argument for syntactic structures. Specifically, we test whether the statistical properties of parse trees align with the implicit assumptions in the argument. Using English corpora, we analyze the mutual information, deviations from probabilistic context-free grammars (PCFGs), and other properties in parse trees, as well as in the PCFG that approximates these trees. Our results indicate that the assumptions do not hold for syntactic structures and that it is difficult to apply the proposed argument to child languages and animal signals, highlighting the need to reconsider the relationship between the power law and hierarchical structures.

[117] Latent Preference Coding: Aligning Large Language Models via Discrete Latent Codes

Zhuocheng Gong,Jian Guan,Wei Wu,Huishuai Zhang,Dongyan Zhao

Main category: cs.CL

TL;DR: 论文提出了一种名为Latent Preference Coding（LPC）的新框架，用于建模人类偏好的隐含因素及其组合，无需依赖预定义的奖励函数。实验表明，LPC显著提升了多种对齐算法的性能，并增强了鲁棒性。

Details

Motivation: 现有方法依赖显式或隐式奖励函数，难以捕捉人类偏好的复杂性和多面性。 Method: LPC通过离散潜在编码建模偏好的隐含因素及其组合，自动从数据中推断这些因素及其重要性。 Result: LPC在多个基准测试中显著提升了三种对齐算法的性能，潜在编码能有效捕捉人类偏好的分布差异并增强鲁棒性。 Conclusion: LPC为开发更鲁棒和通用的对齐技术提供了统一表示，有助于负责任地部署大型语言模型。 Abstract: Large language models (LLMs) have achieved remarkable success, yet aligning their generations with human preferences remains a critical challenge. Existing approaches to preference modeling often rely on an explicit or implicit reward function, overlooking the intricate and multifaceted nature of human preferences that may encompass conflicting factors across diverse tasks and populations. To address this limitation, we introduce Latent Preference Coding (LPC), a novel framework that models the implicit factors as well as their combinations behind holistic preferences using discrete latent codes. LPC seamlessly integrates with various offline alignment algorithms, automatically inferring the underlying factors and their importance from data without relying on pre-defined reward functions and hand-crafted combination weights. Extensive experiments on multiple benchmarks demonstrate that LPC consistently improves upon three alignment algorithms (DPO, SimPO, and IPO) using three base models (Mistral-7B, Llama3-8B, and Llama3-8B-Instruct). Furthermore, deeper analysis reveals that the learned latent codes effectively capture the differences in the distribution of human preferences and significantly enhance the robustness of alignment against noise in data. By providing a unified representation for the multifarious preference factors, LPC paves the way towards developing more robust and versatile alignment techniques for the responsible deployment of powerful LLMs.

[118] Rethinking Invariance in In-context Learning

Lizhe Fang,Yifei Wang,Khashayar Gatmiry,Lei Fang,Yisen Wang

Main category: cs.CL

TL;DR: 论文提出了一种名为InvICL的方法，解决了In-Context Learning（ICL）对示例顺序敏感的问题，同时满足信息不泄漏和上下文相互依赖两个关键设计要素。

Details

Motivation: ICL在自回归大语言模型中表现出对示例顺序的敏感性，现有方法无法同时满足信息不泄漏和上下文相互依赖，导致性能不足。 Method: 提出了InvICL方法，通过设计确保信息不泄漏和上下文相互依赖，实现ICL的排列不变性。 Result: 实验表明，InvICL在多数基准数据集上优于现有方法，表现出更强的泛化能力。 Conclusion: InvICL是一种有效的排列不变ICL方法，解决了现有方法的局限性。 Abstract: In-Context Learning (ICL) has emerged as a pivotal capability of auto-regressive large language models, yet it is hindered by a notable sensitivity to the ordering of context examples regardless of their mutual independence. To address this issue, recent studies have introduced several variant algorithms of ICL that achieve permutation invariance. However, many of these do not exhibit comparable performance with the standard auto-regressive ICL algorithm. In this work, we identify two crucial elements in the design of an invariant ICL algorithm: information non-leakage and context interdependence, which are not simultaneously achieved by any of the existing methods. These investigations lead us to the proposed Invariant ICL (InvICL), a methodology designed to achieve invariance in ICL while ensuring the two properties. Empirically, our findings reveal that InvICL surpasses previous models, both invariant and non-invariant, in most benchmark datasets, showcasing superior generalization capabilities across varying input lengths. Code is available at https://github.com/PKU-ML/InvICL.

Cedric Waterschoot,Nava Tintarev,Francesco Barile

Main category: cs.CL

TL;DR: 研究了大型语言模型（LLMs）在零样本学习下如何正确执行基于社交选择的聚合策略，分析了提示格式对准确性的影响。

Details

Motivation: 探索LLMs在群体推荐系统（GRS）中的应用条件，特别是零样本学习和提示格式对性能的影响。 Method: 通过实验分析群体复杂性（用户和物品数量）、不同LLMs、提示条件（如上下文学习或生成解释）以及偏好格式对准确性的影响。 Result: 性能在超过100个评分时开始下降，但不同模型对复杂性敏感度不同；上下文学习显著提升高复杂性下的性能。 Conclusion: 未来研究应考虑群体复杂性对LLM性能的影响；小模型在适当条件下也能生成群体推荐，适合低计算成本场景。 Abstract: Large Language Models (LLMs) are increasingly applied in recommender systems aimed at both individuals and groups. Previously, Group Recommender Systems (GRS) often used social choice-based aggregation strategies to derive a single recommendation based on the preferences of multiple people. In this paper, we investigate under which conditions language models can perform these strategies correctly based on zero-shot learning and analyse whether the formatting of the group scenario in the prompt affects accuracy. We specifically focused on the impact of group complexity (number of users and items), different LLMs, different prompting conditions, including In-Context learning or generating explanations, and the formatting of group preferences. Our results show that performance starts to deteriorate when considering more than 100 ratings. However, not all language models were equally sensitive to growing group complexity. Additionally, we showed that In-Context Learning (ICL) can significantly increase the performance at higher degrees of group complexity, while adding other prompt modifications, specifying domain cues or prompting for explanations, did not impact accuracy. We conclude that future research should include group complexity as a factor in GRS evaluation due to its effect on LLM performance. Furthermore, we showed that formatting the group scenarios differently, such as rating lists per user or per item, affected accuracy. All in all, our study implies that smaller LLMs are capable of generating group recommendations under the right conditions, making the case for using smaller models that require less computing power and costs.

[120] Scalable Multi-Stage Influence Function for Large Language Models via Eigenvalue-Corrected Kronecker-Factored Parameterization

Yuntai Bao,Xuhong Zhang,Tianyu Du,Xinkui Zhao,Jiang Zong,Hao Peng,Jianwei Yin

Main category: cs.CL

TL;DR: 提出了一种多阶段影响函数方法，用于将微调后大语言模型的预测归因于预训练数据，并通过EK-FAC参数化提高效率。

Details

Motivation: 现有方法无法计算多阶段影响且难以扩展到十亿级大语言模型，因此需要一种新的方法来解释微调模型的预测来源。 Method: 提出多阶段影响函数，结合EK-FAC参数化进行高效近似计算。 Result: 实验验证了EK-FAC的高效性和多阶段影响函数的有效性，并通过案例展示了其解释能力。 Conclusion: 多阶段影响函数为理解微调模型的预测提供了新视角，且具有实际应用价值。 Abstract: Pre-trained large language models (LLMs) are commonly fine-tuned to adapt to downstream tasks. Since the majority of knowledge is acquired during pre-training, attributing the predictions of fine-tuned LLMs to their pre-training data may provide valuable insights. Influence functions have been proposed as a means to explain model predictions based on training data. However, existing approaches fail to compute ``multi-stage'' influence and lack scalability to billion-scale LLMs. In this paper, we propose the multi-stage influence function to attribute the downstream predictions of fine-tuned LLMs to pre-training data under the full-parameter fine-tuning paradigm. To enhance the efficiency and practicality of our multi-stage influence function, we leverage Eigenvalue-corrected Kronecker-Factored (EK-FAC) parameterization for efficient approximation. Empirical results validate the superior scalability of EK-FAC approximation and the effectiveness of our multi-stage influence function. Additionally, case studies on a real-world LLM, dolly-v2-3b, demonstrate its interpretive power, with exemplars illustrating insights provided by multi-stage influence estimates. Our code is public at https://github.com/colored-dye/multi_stage_influence_function.

[121] G-FOCUS: Towards a Robust Method for Assessing UI Design Persuasiveness

Jaehyun Jeon,Janghan Yoon,Minsoo Kim,Sumin Shim,Yejin Choi,Hanbin Kim,Youngjae Yu

Main category: cs.CL

TL;DR: 论文提出了WiserUI-Bench基准和G-FOCUS推理策略，用于评估UI设计的说服力，旨在替代或补充昂贵的A/B测试。

Details

Motivation: 传统的A/B测试成本高且耗时，而现有视觉语言模型（VLM）方法未能有效评估UI设计的比较说服力。 Method: 引入WiserUI-Bench基准（包含300个UI图像对）和G-FOCUS推理策略，以减少位置偏差并提高评估准确性。 Result: 实验表明，G-FOCUS在一致性和准确性上优于现有推理策略。 Conclusion: 该研究为UI设计的说服力评估提供了一种可扩展的方法，有望推动UI偏好建模和设计优化。 Abstract: Evaluating user interface (UI) design effectiveness extends beyond aesthetics to influencing user behavior, a principle central to Design Persuasiveness. A/B testing is the predominant method for determining which UI variations drive higher user engagement, but it is costly and time-consuming. While recent Vision-Language Models (VLMs) can process automated UI analysis, current approaches focus on isolated design attributes rather than comparative persuasiveness-the key factor in optimizing user interactions. To address this, we introduce WiserUI-Bench, a benchmark designed for Pairwise UI Design Persuasiveness Assessment task, featuring 300 real-world UI image pairs labeled with A/B test results and expert rationales. Additionally, we propose G-FOCUS, a novel inference-time reasoning strategy that enhances VLM-based persuasiveness assessment by reducing position bias and improving evaluation accuracy. Experimental results show that G-FOCUS surpasses existing inference strategies in consistency and accuracy for pairwise UI evaluation. Through promoting VLM-driven evaluation of UI persuasiveness, our work offers an approach to complement A/B testing, propelling progress in scalable UI preference modeling and design optimization. Code and data will be released publicly.

[122] Image-Text Relation Prediction for Multilingual Tweets

Matīss Rikters,Edison Marrese-Taylor

Main category: cs.CL

TL;DR: 研究探讨了多语言视觉语言模型在不同语言中处理图像-文本关系预测的能力，并构建了一个平衡的基准数据集。

Details

Motivation: 社交媒体中图像与文本的关系不明确，需要探索多语言模型在此任务中的表现。 Method: 使用Twitter帖子构建了拉脱维亚语及其英语翻译的平衡数据集，并测试了最新的视觉语言模型。 Result: 最新模型在此任务上表现更好，但仍有改进空间。 Conclusion: 多语言视觉语言模型在图像-文本关系预测上有潜力，但需进一步优化。 Abstract: Various social networks have been allowing media uploads for over a decade now. Still, it has not always been clear what is their relation with the posted text or even if there is any at all. In this work, we explore how multilingual vision-language models tackle the task of image-text relation prediction in different languages, and construct a dedicated balanced benchmark data set from Twitter posts in Latvian along with their manual translations into English. We compare our results to previous work and show that the more recently released vision-language model checkpoints are becoming increasingly capable at this task, but there is still much room for further improvement.

[123] Teochew-Wild: The First In-the-wild Teochew Dataset with Orthographic Annotations

Linrong Pan,Chenglong Jiang,Gaoze Hou,Ying Gao

Main category: cs.CL

TL;DR: 本文介绍了Teochew-Wild语料库的构建，包含18.9小时的多说话者潮汕话语音数据，附带精确的文本和拼音标注，并提供了辅助工具，验证了其在ASR和TTS任务中的有效性。

Details

Motivation: 潮汕话是一种低资源语言，缺乏公开可用的语音数据集，限制了相关研究与应用的发展。本文旨在填补这一空白。 Method: 构建了包含正式和口语表达的潮汕话语音语料库，提供文本和拼音标注，并开发了辅助工具。 Result: 实验验证了该语料库在自动语音识别（ASR）和文本转语音（TTS）任务中的有效性。 Conclusion: Teochew-Wild是首个公开的潮汕话数据集，为低资源语言的研究和应用提供了重要资源。 Abstract: This paper reports the construction of the Teochew-Wild, a speech corpus of the Teochew dialect. The corpus includes 18.9 hours of in-the-wild Teochew speech data from multiple speakers, covering both formal and colloquial expressions, with precise orthographic and pinyin annotations. Additionally, we provide supplementary text processing tools and resources to propel research and applications in speech tasks for this low-resource language, such as automatic speech recognition (ASR) and text-to-speech (TTS). To the best of our knowledge, this is the first publicly available Teochew dataset with accurate orthographic annotations. We conduct experiments on the corpus, and the results validate its effectiveness in ASR and TTS tasks.

[124] Performance Evaluation of Large Language Models in Bangla Consumer Health Query Summarization

Ajwad Abrar,Farzana Tabassum,Sabbir Ahmed

Main category: cs.CL

TL;DR: 研究评估了九种大型语言模型在零样本情况下对孟加拉语消费者健康查询的摘要能力，发现Mixtral-8x22b-Instruct表现最佳，与微调模型Bangla T5相当。

Details

Motivation: 孟加拉语（低资源语言）中的消费者健康查询常含冗余信息，影响医疗响应效率，需探索高效摘要方法。 Method: 使用BanglaCHQ-Summ数据集（2,350对标注数据），比较九种LLM的零样本性能，以ROUGE指标为基准。 Result: Mixtral-8x22b-Instruct在ROUGE-1和ROUGE-L中表现最佳，Bangla T5在ROUGE-2中领先。零样本LLM与微调模型性能相当。 Conclusion: 零样本LLM在低资源语言任务中潜力显著，为医疗查询摘要提供了可扩展的解决方案。 Abstract: Consumer Health Queries (CHQs) in Bengali (Bangla), a low-resource language, often contain extraneous details, complicating efficient medical responses. This study investigates the zero-shot performance of nine advanced large language models (LLMs): GPT-3.5-Turbo, GPT-4, Claude-3.5-Sonnet, Llama3-70b-Instruct, Mixtral-8x22b-Instruct, Gemini-1.5-Pro, Qwen2-72b-Instruct, Gemma-2-27b, and Athene-70B, in summarizing Bangla CHQs. Using the BanglaCHQ-Summ dataset comprising 2,350 annotated query-summary pairs, we benchmarked these LLMs using ROUGE metrics against Bangla T5, a fine-tuned state-of-the-art model. Mixtral-8x22b-Instruct emerged as the top performing model in ROUGE-1 and ROUGE-L, while Bangla T5 excelled in ROUGE-2. The results demonstrate that zero-shot LLMs can rival fine-tuned models, achieving high-quality summaries even without task-specific training. This work underscores the potential of LLMs in addressing challenges in low-resource languages, providing scalable solutions for healthcare query summarization.

[125] Reliably Bounding False Positives: A Zero-Shot Machine-Generated Text Detection Framework via Multiscaled Conformal Prediction

Xiaowei Zhu,Yubing Ren,Yanan Cao,Xixun Lin,Fang Fang,Yangxi Li

Main category: cs.CL

TL;DR: 论文提出了一种基于多尺度共形预测（MCP）的零样本机器生成文本检测框架，旨在解决现有检测方法高误报率（FPR）的问题，同时提升检测性能。

Details

Motivation: 大型语言模型的快速发展引发了对其潜在滥用的担忧，现有检测方法过度关注准确率而忽视了高误报率带来的社会风险。 Method: 利用共形预测（CP）约束误报率的上限，并提出多尺度共形预测（MCP）框架以克服性能下降问题，同时引入高质量数据集RealDet。 Result: 实证评估表明，MCP能有效约束误报率，显著提升检测性能，并增强对抗攻击的鲁棒性。 Conclusion: MCP框架在约束误报率和提升检测性能方面表现优异，为机器生成文本检测提供了更可靠的解决方案。 Abstract: The rapid advancement of large language models has raised significant concerns regarding their potential misuse by malicious actors. As a result, developing effective detectors to mitigate these risks has become a critical priority. However, most existing detection methods focus excessively on detection accuracy, often neglecting the societal risks posed by high false positive rates (FPRs). This paper addresses this issue by leveraging Conformal Prediction (CP), which effectively constrains the upper bound of FPRs. While directly applying CP constrains FPRs, it also leads to a significant reduction in detection performance. To overcome this trade-off, this paper proposes a Zero-Shot Machine-Generated Text Detection Framework via Multiscaled Conformal Prediction (MCP), which both enforces the FPR constraint and improves detection performance. This paper also introduces RealDet, a high-quality dataset that spans a wide range of domains, ensuring realistic calibration and enabling superior detection performance when combined with MCP. Empirical evaluations demonstrate that MCP effectively constrains FPRs, significantly enhances detection performance, and increases robustness against adversarial attacks across multiple detectors and datasets.

[126] Unveiling Language-Specific Features in Large Language Models via Sparse Autoencoders

Boyi Deng,Yu Wan,Yidan Zhang,Baosong Yang,Fuli Feng

Main category: cs.CL

TL;DR: 论文探讨了大型语言模型（LLMs）多语言能力的机制，提出了一种基于稀疏自编码器（SAEs）的新方法，发现语言特异性特征，并展示了其在语言控制中的应用。

Details

Motivation: 现有基于神经元或内部激活的方法在多语言能力分析中存在局限性，如叠加和层间激活差异，需要更可靠的方法。 Method: 使用稀疏自编码器（SAEs）分解LLMs的激活，提出新指标评估特征的单语性，并通过特征消融实验验证其语言特异性。 Result: 发现某些SAE特征与特定语言强相关，消融这些特征仅显著影响一种语言能力；还发现多语言协同特征。 Conclusion: SAEs提供了一种更精细的多语言能力分析方法，并可用于增强语言控制能力。 Abstract: The mechanisms behind multilingual capabilities in Large Language Models (LLMs) have been examined using neuron-based or internal-activation-based methods. However, these methods often face challenges such as superposition and layer-wise activation variance, which limit their reliability. Sparse Autoencoders (SAEs) offer a more nuanced analysis by decomposing the activations of LLMs into sparse linear combination of SAE features. We introduce a novel metric to assess the monolinguality of features obtained from SAEs, discovering that some features are strongly related to specific languages. Additionally, we show that ablating these SAE features only significantly reduces abilities in one language of LLMs, leaving others almost unaffected. Interestingly, we find some languages have multiple synergistic SAE features, and ablating them together yields greater improvement than ablating individually. Moreover, we leverage these SAE-derived language-specific features to enhance steering vectors, achieving control over the language generated by LLMs.

[127] A Benchmark Dataset and a Framework for Urdu Multimodal Named Entity Recognition

Hussain Ahmad,Qingyang Zeng,Jing Wan

Main category: cs.CL

TL;DR: 论文提出了U-MNER框架和Twitter2015-Urdu数据集，填补了乌尔都语多模态命名实体识别（MNER）的研究空白，并通过基准测试展示了其性能。

Details

Motivation: 针对低资源语言（如乌尔都语）在多模态命名实体识别（MNER）领域的研究不足和数据集稀缺问题。 Method: 提出U-MNER框架，结合乌尔都语BERT和ResNet提取文本与视觉特征，通过跨模态融合模块整合信息。 Result: U-MNER在Twitter2015-Urdu数据集上实现了最先进的性能。 Conclusion: 该研究为低资源语言的MNER研究奠定了基础，并提供了标准化的基准数据集。 Abstract: The emergence of multimodal content, particularly text and images on social media, has positioned Multimodal Named Entity Recognition (MNER) as an increasingly important area of research within Natural Language Processing. Despite progress in high-resource languages such as English, MNER remains underexplored for low-resource languages like Urdu. The primary challenges include the scarcity of annotated multimodal datasets and the lack of standardized baselines. To address these challenges, we introduce the U-MNER framework and release the Twitter2015-Urdu dataset, a pioneering resource for Urdu MNER. Adapted from the widely used Twitter2015 dataset, it is annotated with Urdu-specific grammar rules. We establish benchmark baselines by evaluating both text-based and multimodal models on this dataset, providing comparative analyses to support future research on Urdu MNER. The U-MNER framework integrates textual and visual context using Urdu-BERT for text embeddings and ResNet for visual feature extraction, with a Cross-Modal Fusion Module to align and fuse information. Our model achieves state-of-the-art performance on the Twitter2015-Urdu dataset, laying the groundwork for further MNER research in low-resource languages.

[128] QualBench: Benchmarking Chinese LLMs with Localized Professional Qualifications for Vertical Domain Evaluation

Mengze Hong,Wailing Ng,Di Jiang,Chen Jason Zhang

Main category: cs.CL

TL;DR: QualBench是首个专注于中文大语言模型（LLM）本地化评估的多领域问答基准，包含17,000多个问题，覆盖六个垂直领域。评估结果显示，中文LLM在本地化领域知识上表现优于非中文模型，但仍有提升空间。

Details

Motivation: 现有基准在垂直领域覆盖不足且缺乏对中国工作场景的针对性评估，需通过资格考试的框架来填补这一空白。 Method: 利用24种中国资格考试构建多领域数据集，并通过综合评估比较不同模型的性能。 Result: Qwen2.5模型表现优于GPT-4o，中文LLM整体优于非中文模型，但最高准确率仅为75.26%，显示领域覆盖仍有不足。 Conclusion: 本地化领域知识对LLM性能至关重要，未来可通过多领域RAG知识增强和联邦学习提升垂直领域LLM训练。 Abstract: The rapid advancement of Chinese large language models (LLMs) underscores the need for domain-specific evaluations to ensure reliable applications. However, existing benchmarks often lack coverage in vertical domains and offer limited insights into the Chinese working context. Leveraging qualification exams as a unified framework for human expertise evaluation, we introduce QualBench, the first multi-domain Chinese QA benchmark dedicated to localized assessment of Chinese LLMs. The dataset includes over 17,000 questions across six vertical domains, with data selections grounded in 24 Chinese qualifications to closely align with national policies and working standards. Through comprehensive evaluation, the Qwen2.5 model outperformed the more advanced GPT-4o, with Chinese LLMs consistently surpassing non-Chinese models, highlighting the importance of localized domain knowledge in meeting qualification requirements. The best performance of 75.26% reveals the current gaps in domain coverage within model capabilities. Furthermore, we present the failure of LLM collaboration with crowdsourcing mechanisms and suggest the opportunities for multi-domain RAG knowledge enhancement and vertical domain LLM training with Federated Learning.

[129] T-T: Table Transformer for Tagging-based Aspect Sentiment Triplet Extraction

Kun Peng,Chaodong Tong,Cong Cao,Hao Peng,Qian Li,Guanlin Wu,Lei Jiang,Yanbing Liu,Philip S. Yu

Main category: cs.CL

TL;DR: 论文提出了一种基于Transformer的表格标记方法（T-T），通过引入条纹注意力和循环移位策略，解决了传统方法中长序列和局部注意力交互的问题，显著提升了ASTE任务的性能。

Details

Motivation: 现有ASTE任务中，表格标记方法通过二维表格编码句子，但传统Transformer直接应用面临长序列和局部注意力交互不公平的挑战。 Method: 提出Table-Transformer（T-T），采用条纹注意力机制和循环移位策略，优化全局注意力为局部窗口，并增强不同窗口间的交互。 Result: 实验表明，T-T作为下游关系学习模块，以较低计算成本实现了最先进的性能。 Conclusion: T-T通过创新注意力机制，有效解决了Transformer在ASTE任务中的挑战，为表格标记方法提供了高效解决方案。 Abstract: Aspect sentiment triplet extraction (ASTE) aims to extract triplets composed of aspect terms, opinion terms, and sentiment polarities from given sentences. The table tagging method is a popular approach to addressing this task, which encodes a sentence into a 2-dimensional table, allowing for the tagging of relations between any two words. Previous efforts have focused on designing various downstream relation learning modules to better capture interactions between tokens in the table, revealing that a stronger capability to capture relations can lead to greater improvements in the model. Motivated by this, we attempt to directly utilize transformer layers as downstream relation learning modules. Due to the powerful semantic modeling capability of transformers, it is foreseeable that this will lead to excellent improvement. However, owing to the quadratic relation between the length of the table and the length of the input sentence sequence, using transformers directly faces two challenges: overly long table sequences and unfair local attention interaction. To address these challenges, we propose a novel Table-Transformer (T-T) for the tagging-based ASTE method. Specifically, we introduce a stripe attention mechanism with a loop-shift strategy to tackle these challenges. The former modifies the global attention mechanism to only attend to a 2-dimensional local attention window, while the latter facilitates interaction between different attention windows. Extensive and comprehensive experiments demonstrate that the T-T, as a downstream relation learning module, achieves state-of-the-art performance with lower computational costs.

[130] Toward Reasonable Parrots: Why Large Language Models Should Argue with Us by Design

Elena Musi,Nadin Kokciyan,Khalid Al-Khatib,Davide Ceolin,Emmanuelle Dietz,Klara Gutekunst,Annette Hautli-Janisz,Cristian Manuel Santibañez Yañez,Jodi Schneider,Jonas Scholz,Cor Steging,Jacky Visser,Henning Wachsmuth

Main category: cs.CL

TL;DR: 本文主张开发支持论证过程的对话技术，指出当前大语言模型（LLMs）的不足，并提出一种理想设计，将LLMs重新定位为锻炼批判性思维的工具。

Details

Motivation: 当前LLMs在支持论证过程中表现不足，需要设计更符合论证理论的技术。 Method: 提出'合理鹦鹉'概念，基于论证理论的原则（相关性、责任、自由）和对话动作设计LLMs。 Result: 理想技术设计旨在提升论证技能，而非替代人类思考。 Conclusion: 论证理论的原则应作为LLMs技术设计的基础，以支持批判性思维。 Abstract: In this position paper, we advocate for the development of conversational technology that is inherently designed to support and facilitate argumentative processes. We argue that, at present, large language models (LLMs) are inadequate for this purpose, and we propose an ideal technology design aimed at enhancing argumentative skills. This involves re-framing LLMs as tools to exercise our critical thinking rather than replacing them. We introduce the concept of 'reasonable parrots' that embody the fundamental principles of relevance, responsibility, and freedom, and that interact through argumentative dialogical moves. These principles and moves arise out of millennia of work in argumentation theory and should serve as the starting point for LLM-based technology that incorporates basic principles of argumentation.

[131] ICon: In-Context Contribution for Automatic Data Selection

Yixin Yang,Qingxiu Dong,Linli Yao,Fangwei Zhu,Zhifang Sui

Main category: cs.CL

TL;DR: 本文提出了一种名为ICon的梯度无关方法，利用上下文学习的隐式微调特性高效测量数据贡献，无需梯度计算或人工设计启发式规则。实验证明ICon在多种基准测试中表现优异。

Details

Motivation: 现有数据选择方法依赖计算昂贵的梯度或人工启发式规则，未能充分利用数据内在属性，因此需要一种更高效且减少人为偏差的方法。 Method: ICon通过上下文学习的隐式学习特性评估样本贡献，包含三个组件，无需梯度计算或人工指标设计。 Result: 在多个基准测试中，ICon选择的数据仅需15%即可超越完整数据集和其他常用方法的表现。 Conclusion: ICon提供了一种高效且减少人为偏差的数据选择方法，所选样本兼具任务多样性和适当难度。 Abstract: Data selection for instruction tuning is essential for improving the performance of Large Language Models (LLMs) and reducing training cost. However, existing automated selection methods either depend on computationally expensive gradient-based measures or manually designed heuristics, which may fail to fully exploit the intrinsic attributes of data. In this paper, we propose In-context Learning for Contribution Measurement (ICon), a novel gradient-free method that takes advantage of the implicit fine-tuning nature of in-context learning (ICL) to measure sample contribution without gradient computation or manual indicators engineering. ICon offers a computationally efficient alternative to gradient-based methods and reduces human inductive bias inherent in heuristic-based approaches. ICon comprises three components and identifies high-contribution data by assessing performance shifts under implicit learning through ICL. Extensive experiments on three LLMs across 12 benchmarks and 5 pairwise evaluation sets demonstrate the effectiveness of ICon. Remarkably, on LLaMA3.1-8B, models trained on 15% of ICon-selected data outperform full datasets by 5.42% points and exceed the best performance of widely used selection methods by 2.06% points. We further analyze high-contribution samples selected by ICon, which show both diverse tasks and appropriate difficulty levels, rather than just the hardest ones.

[132] Frame In, Frame Out: Do LLMs Generate More Biased News Headlines than Humans?

Valeria Pastorino,Nafise Sadat Moosavi

Main category: cs.CL

TL;DR: 研究发现，大型语言模型（LLM）生成的新闻内容在政治和社会敏感话题中比人类作者更容易表现出明显的框架偏见，且不同模型架构间存在显著差异。

Details

Motivation: 探讨LLM在自动新闻生成中可能引入或放大的框架偏见，以评估其对公共认知的潜在影响。 Method: 分析未经调整和微调的LLM生成的新闻内容，比较其与人类作者的框架表现。 Result: LLM在敏感话题中表现出更明显的框架偏见，且不同模型间存在显著差异。 Conclusion: 需要开发有效的训练后缓解策略和更严格的评估框架，以确保自动新闻内容的平衡性。 Abstract: Framing in media critically shapes public perception by selectively emphasizing some details while downplaying others. With the rise of large language models in automated news and content creation, there is growing concern that these systems may introduce or even amplify framing biases compared to human authors. In this paper, we explore how framing manifests in both out-of-the-box and fine-tuned LLM-generated news content. Our analysis reveals that, particularly in politically and socially sensitive contexts, LLMs tend to exhibit more pronounced framing than their human counterparts. In addition, we observe significant variation in framing tendencies across different model architectures, with some models displaying notably higher biases. These findings point to the need for effective post-training mitigation strategies and tighter evaluation frameworks to ensure that automated news content upholds the standards of balanced reporting.

[133] Crosslingual Reasoning through Test-Time Scaling

Zheng-Xin Yong,M. Farid Adilazuarda,Jonibek Mansurov,Ruochen Zhang,Niklas Muennighoff,Carsten Eickhoff,Genta Indra Winata,Julia Kreutzer,Stephen H. Bach,Alham Fikri Aji

Main category: cs.CL

TL;DR: 研究探讨了英语推理微调在多语言中的泛化能力，发现计算资源扩展能提升多语言数学推理，但推理语言的选择和领域限制仍需改进。

Details

Motivation: 探索英语为中心的推理语言模型在多语言环境中的表现，尤其是低资源语言和跨领域推理的局限性。 Method: 通过扩展推理计算资源、分析推理模式（如“引用-思考”模式）以及控制推理语言策略，研究多语言推理的泛化能力。 Result: 英语为中心的模型在高资源语言中表现更优，但在低资源语言和跨领域（如STEM到文化常识）中表现较差。 Conclusion: 建议在高资源语言中使用英语为中心的模型，同时需进一步改进低资源语言和跨领域的推理能力。 Abstract: Reasoning capabilities of large language models are primarily studied for English, even when pretrained models are multilingual. In this work, we investigate to what extent English reasoning finetuning with long chain-of-thoughts (CoTs) can generalize across languages. First, we find that scaling up inference compute for English-centric reasoning language models (RLMs) improves multilingual mathematical reasoning across many languages including low-resource languages, to an extent where they outperform models twice their size. Second, we reveal that while English-centric RLM's CoTs are naturally predominantly English, they consistently follow a quote-and-think pattern to reason about quoted non-English inputs. Third, we discover an effective strategy to control the language of long CoT reasoning, and we observe that models reason better and more efficiently in high-resource languages. Finally, we observe poor out-of-domain reasoning generalization, in particular from STEM to cultural commonsense knowledge, even for English. Overall, we demonstrate the potentials, study the mechanisms and outline the limitations of crosslingual generalization of English reasoning test-time scaling. We conclude that practitioners should let English-centric RLMs reason in high-resource languages, while further work is needed to improve reasoning in low-resource languages and out-of-domain contexts.

[134] Reasoning Models Don't Always Say What They Think

Yanda Chen,Joe Benton,Ansh Radhakrishnan,Jonathan Uesato,Carson Denison,John Schulman,Arushi Somani,Peter Hase,Misha Wagner,Fabien Roger,Vlad Mikulik,Samuel R. Bowman,Jan Leike,Jared Kaplan,Ethan Perez

Main category: cs.CL

TL;DR: 论文评估了链式思维（CoT）在AI安全中的有效性，发现CoT虽能部分揭示模型推理过程，但揭示率通常低于20%，且强化学习虽能提升忠实性但效果有限。

Details

Motivation: 研究CoT是否能忠实反映AI模型的推理过程，以评估其在AI安全监控中的潜力。 Method: 通过6种提示下的推理任务测试CoT忠实性，并分析强化学习对忠实性的影响。 Result: CoT揭示提示使用的比例通常低于20%；强化学习初期提升忠实性但效果有限；奖励破解行为未增加提示的显性表达。 Conclusion: CoT监控在训练和评估中有潜力发现不良行为，但不足以完全排除；在非必要场景下，CoT监控难以可靠捕捉罕见灾难性行为。 Abstract: Chain-of-thought (CoT) offers a potential boon for AI safety as it allows monitoring a model's CoT to try to understand its intentions and reasoning processes. However, the effectiveness of such monitoring hinges on CoTs faithfully representing models' actual reasoning processes. We evaluate CoT faithfulness of state-of-the-art reasoning models across 6 reasoning hints presented in the prompts and find: (1) for most settings and models tested, CoTs reveal their usage of hints in at least 1% of examples where they use the hint, but the reveal rate is often below 20%, (2) outcome-based reinforcement learning initially improves faithfulness but plateaus without saturating, and (3) when reinforcement learning increases how frequently hints are used (reward hacking), the propensity to verbalize them does not increase, even without training against a CoT monitor. These results suggest that CoT monitoring is a promising way of noticing undesired behaviors during training and evaluations, but that it is not sufficient to rule them out. They also suggest that in settings like ours where CoT reasoning is not necessary, test-time monitoring of CoTs is unlikely to reliably catch rare and catastrophic unexpected behaviors.

[135] TransProQA: an LLM-based literary Translation evaluation metric with Professional Question Answering

Ran Zhang,Wei Zhao,Lieve Macken,Steffen Eger

Main category: cs.CL

TL;DR: TransProQA是一种基于LLM的无参考QA框架，专门用于文学翻译评估，优于现有指标，接近人类评估水平。

Details

Motivation: 现有评估指标过于注重机械准确性，忽视艺术表达，可能导致翻译质量和文化真实性的长期下降。 Method: TransProQA整合专业文学翻译和研究者的见解，关注文学质量评估的关键元素（如文学手法、文化理解和作者声音）。 Result: TransProQA显著优于现有指标，相关性提升0.07，在充分性评估中超过SOTA指标15分以上。 Conclusion: TransProQA是一种高效、无需训练的文学评估工具，适用于开源模型，具有广泛的应用潜力。 Abstract: The impact of Large Language Models (LLMs) has extended into literary domains. However, existing evaluation metrics prioritize mechanical accuracy over artistic expression and tend to overrate machine translation (MT) as being superior to experienced professional human translation. In the long run, this bias could result in a permanent decline in translation quality and cultural authenticity. In response to the urgent need for a specialized literary evaluation metric, we introduce TransProQA, a novel, reference-free, LLM-based question-answering (QA) framework designed specifically for literary translation evaluation. TransProQA uniquely integrates insights from professional literary translators and researchers, focusing on critical elements in literary quality assessment such as literary devices, cultural understanding, and authorial voice. Our extensive evaluation shows that while literary-finetuned XCOMET-XL yields marginal gains, TransProQA substantially outperforms current metrics, achieving up to 0.07 gain in correlation (ACC-EQ and Kendall's tau) and surpassing the best state-of-the-art (SOTA) metrics by over 15 points in adequacy assessments. Incorporating professional translator insights as weights further improves performance, highlighting the value of translator inputs. Notably, TransProQA approaches human-level evaluation performance comparable to trained linguistic annotators. It demonstrates broad applicability to open-source models such as LLaMA3.3-70b and Qwen2.5-32b, indicating its potential as an accessible and training-free literary evaluation metric and a valuable tool for evaluating texts that require local processing due to copyright or ethical considerations.

[136] Ultra-FineWeb: Efficient Data Filtering and Verification for High-Quality LLM Training Data

Yudong Wang,Zixuan Fu,Jie Cai,Peijun Tang,Hongya Lyu,Yewei Fang,Zhi Zheng,Jie Zhou,Guoyang Zeng,Chaojun Xiao,Xu Han,Zhiyuan Liu

Main category: cs.CL

TL;DR: 论文提出了一种高效的数据过滤方法，解决了数据质量验证和种子数据选择的主观性问题，显著提升了LLM训练的效果和效率。

Details

Motivation: 随着大语言模型（LLM）的快速发展，数据质量成为提升模型性能的关键因素。然而，现有方法在数据验证和种子数据选择上存在效率低和主观性强的问题。 Method: 提出了一种高效的数据验证策略和优化的种子数据选择方法，结合轻量级分类器（fastText），构建了一个高效的数据过滤流程。 Result: 应用于FineWeb和Chinese FineWeb数据集，生成了高质量的Ultra-FineWeb数据集，显著提升了LLM在多个基准任务上的性能。 Conclusion: 提出的数据过滤流程不仅提高了数据质量和训练效率，还降低了实验和推理成本，验证了方法的有效性。 Abstract: Data quality has become a key factor in enhancing model performance with the rapid development of large language models (LLMs). Model-driven data filtering has increasingly become a primary approach for acquiring high-quality data. However, it still faces two main challenges: (1) the lack of an efficient data verification strategy makes it difficult to provide timely feedback on data quality; and (2) the selection of seed data for training classifiers lacks clear criteria and relies heavily on human expertise, introducing a degree of subjectivity. To address the first challenge, we introduce an efficient verification strategy that enables rapid evaluation of the impact of data on LLM training with minimal computational cost. To tackle the second challenge, we build upon the assumption that high-quality seed data is beneficial for LLM training, and by integrating the proposed verification strategy, we optimize the selection of positive and negative samples and propose an efficient data filtering pipeline. This pipeline not only improves filtering efficiency, classifier quality, and robustness, but also significantly reduces experimental and inference costs. In addition, to efficiently filter high-quality data, we employ a lightweight classifier based on fastText, and successfully apply the filtering pipeline to two widely-used pre-training corpora, FineWeb and Chinese FineWeb datasets, resulting in the creation of the higher-quality Ultra-FineWeb dataset. Ultra-FineWeb contains approximately 1 trillion English tokens and 120 billion Chinese tokens. Empirical results demonstrate that the LLMs trained on Ultra-FineWeb exhibit significant performance improvements across multiple benchmark tasks, validating the effectiveness of our pipeline in enhancing both data quality and training efficiency.

[137] clem:todd: A Framework for the Systematic Benchmarking of LLM-Based Task-Oriented Dialogue System Realisations

Chalamalasetti Kranti,Sherzod Hakimov,David Schlangen

Main category: cs.CL

TL;DR: 提出clem todd框架，用于在一致条件下系统评估对话系统，支持用户模拟器和对话系统的灵活组合测试。

Details

Motivation: 现有研究常孤立评估对话系统组件，限制了跨架构和配置的通用性见解。 Method: 开发clem todd框架，支持插件式集成，统一数据集、评估指标和计算约束。 Result: 通过重新评估现有系统并集成新系统，揭示了架构、规模和提示策略对对话性能的影响。 Conclusion: clem todd为构建高效对话AI系统提供了实用指导。 Abstract: The emergence of instruction-tuned large language models (LLMs) has advanced the field of dialogue systems, enabling both realistic user simulations and robust multi-turn conversational agents. However, existing research often evaluates these components in isolation-either focusing on a single user simulator or a specific system design-limiting the generalisability of insights across architectures and configurations. In this work, we propose clem todd (chat-optimized LLMs for task-oriented dialogue systems development), a flexible framework for systematically evaluating dialogue systems under consistent conditions. clem todd enables detailed benchmarking across combinations of user simulators and dialogue systems, whether existing models from literature or newly developed ones. It supports plug-and-play integration and ensures uniform datasets, evaluation metrics, and computational constraints. We showcase clem todd's flexibility by re-evaluating existing task-oriented dialogue systems within this unified setup and integrating three newly proposed dialogue systems into the same evaluation pipeline. Our results provide actionable insights into how architecture, scale, and prompting strategies affect dialogue performance, offering practical guidance for building efficient and effective conversational AI systems.

[138] UKElectionNarratives: A Dataset of Misleading Narratives Surrounding Recent UK General Elections

Fatima Haouari,Carolina Scarton,Nicolò Faggiani,Nikolaos Nikolaidis,Bonka Kotseva,Ibrahim Abu Farha,Jens Linge,Kalina Bontcheva

Main category: cs.CL

TL;DR: 论文提出了一种针对欧洲选举中误导性叙事的分类法，并构建了首个标注数据集UKElectionNarratives，同时评估了预训练和大型语言模型（如GPT-4o）在检测此类叙事中的效果。

Details

Motivation: 误导性叙事在选举中影响公众观点，需要准确检测以维护选举公正。 Method: 提出分类法并构建标注数据集，评估预训练和大型语言模型的检测能力。 Result: 构建了首个误导性叙事数据集，并展示了GPT-4o等模型在检测中的潜力。 Conclusion: 研究为未来利用代码本和数据集进一步探索误导性叙事检测提供了方向和建议。 Abstract: Misleading narratives play a crucial role in shaping public opinion during elections, as they can influence how voters perceive candidates and political parties. This entails the need to detect these narratives accurately. To address this, we introduce the first taxonomy of common misleading narratives that circulated during recent elections in Europe. Based on this taxonomy, we construct and analyse UKElectionNarratives: the first dataset of human-annotated misleading narratives which circulated during the UK General Elections in 2019 and 2024. We also benchmark Pre-trained and Large Language Models (focusing on GPT-4o), studying their effectiveness in detecting election-related misleading narratives. Finally, we discuss potential use cases and make recommendations for future research directions using the proposed codebook and dataset.

[139] Bring Reason to Vision: Understanding Perception and Reasoning through Model Merging

Shiqi Chen,Jinghan Zhang,Tongyao Zhu,Wei Liu,Siyang Gao,Miao Xiong,Manling Li,Junxian He

Main category: cs.CL

TL;DR: 该论文探讨了通过模型合并将大型语言模型（LLM）的推理能力融入视觉语言模型（VLM），揭示了感知和推理能力在模型各层的分布及其在合并后的变化。

Details

Motivation: 研究视觉语言模型（VLM）如何结合视觉感知与大型语言模型（LLM）的推理能力，以及这种结合的机制。 Method: 提出跨模态模型合并方法，将LLM的推理能力融入VLM，无需额外训练。 Result: 实验表明模型合并成功将LLM的推理能力转移到VLM，并发现感知能力主要分布在早期层，推理能力则在中后期层；合并后所有层均参与推理，感知能力分布不变。 Conclusion: 模型合并为多模态集成和解释提供了新工具，揭示了感知与推理能力的分布机制。 Abstract: Vision-Language Models (VLMs) combine visual perception with the general capabilities, such as reasoning, of Large Language Models (LLMs). However, the mechanisms by which these two abilities can be combined and contribute remain poorly understood. In this work, we explore to compose perception and reasoning through model merging that connects parameters of different models. Unlike previous works that often focus on merging models of the same kind, we propose merging models across modalities, enabling the incorporation of the reasoning capabilities of LLMs into VLMs. Through extensive experiments, we demonstrate that model merging offers a successful pathway to transfer reasoning abilities from LLMs to VLMs in a training-free manner. Moreover, we utilize the merged models to understand the internal mechanism of perception and reasoning and how merging affects it. We find that perception capabilities are predominantly encoded in the early layers of the model, whereas reasoning is largely facilitated by the middle-to-late layers. After merging, we observe that all layers begin to contribute to reasoning, whereas the distribution of perception abilities across layers remains largely unchanged. These observations shed light on the potential of model merging as a tool for multimodal integration and interpretation.

[140] ComPO: Preference Alignment via Comparison Oracles

Peter Chen,Xi Chen,Wotao Yin,Tianyi Lin

Main category: cs.CL

TL;DR: 本文提出了一种基于比较预言的新偏好对齐方法，解决了现有直接对齐方法中的冗长和似然位移问题，并通过实验验证了其有效性。

Details

Motivation: 现有直接对齐方法存在冗长和似然位移问题，尤其是噪声偏好对会导致偏好和非偏好响应的似然相似。 Method: 提出基于比较预言的新偏好对齐方法，提供其基本方案的收敛保证，并通过启发式改进方法。 Result: 在多个基准模型和评测集上实验，验证了方法的灵活性和兼容性，显著提升了LLM性能。 Conclusion: 新方法有效解决了现有方法的局限性，并强调了为不同似然边际的偏好对设计专门方法的重要性。 Abstract: Direct alignment methods are increasingly used for aligning large language models (LLMs) with human preferences. However, these methods suffer from the issues of verbosity and likelihood displacement, which can be driven by the noisy preference pairs that induce similar likelihood for preferred and dispreferred responses. The contributions of this paper are two-fold. First, we propose a new preference alignment method based on comparison oracles and provide the convergence guarantee for its basic scheme. Second, we improve our method using some heuristics and conduct the experiments to demonstrate the flexibility and compatibility of practical scheme in improving the performance of LLMs using noisy preference pairs. Evaluations are conducted across multiple base and instruction-tuned models (Mistral-7B, Llama-3-8B and Gemma-2-9B) with benchmarks (AlpacaEval 2, MT-Bench and Arena-Hard). Experimental results show the effectiveness of our method as an alternative to addressing the limitations of existing direct alignment methods. A highlight of our work is that we evidence the importance of designing specialized methods for preference pairs with distinct likelihood margin, which complements the recent findings in \citet{Razin-2025-Unintentional}.

cs.LG [Back]

[141] When Bad Data Leads to Good Models

Kenneth Li,Yida Chen,Fernanda Viégas,Martin Wattenberg

Main category: cs.LG

TL;DR: 研究探讨了预训练数据中“毒性”比例对模型后训练控制的影响，发现适当增加毒性数据可以改善模型输出的毒性控制。

Details

Motivation: 重新审视数据质量对模型性能的影响，特别是毒性数据在预训练和后训练中的作用。 Method: 通过玩具实验和Olmo-1B模型实验，分析数据组成对特征表示的影响，并评估后训练干预技术（如ITI）的效果。 Result: 增加毒性数据使毒性特征在表示空间中更线性可分，虽然基础模型毒性增加，但后训练更容易去除毒性。 Conclusion: 结合后训练技术，适当使用“坏数据”可能有助于生成更好的模型。 Abstract: In large language model (LLM) pretraining, data quality is believed to determine model quality. In this paper, we re-examine the notion of "quality" from the perspective of pre- and post-training co-design. Specifically, we explore the possibility that pre-training on more toxic data can lead to better control in post-training, ultimately decreasing a model's output toxicity. First, we use a toy experiment to study how data composition affects the geometry of features in the representation space. Next, through controlled experiments with Olmo-1B models trained on varying ratios of clean and toxic data, we find that the concept of toxicity enjoys a less entangled linear representation as the proportion of toxic data increases. Furthermore, we show that although toxic data increases the generational toxicity of the base model, it also makes the toxicity easier to remove. Evaluations on Toxigen and Real Toxicity Prompts demonstrate that models trained on toxic data achieve a better trade-off between reducing generational toxicity and preserving general capabilities when detoxifying techniques such as inference-time intervention (ITI) are applied. Our findings suggest that, with post-training taken into account, bad data may lead to good models.

[142] ConCISE: Confidence-guided Compression in Step-by-step Efficient Reasoning

Ziqing Qiao,Yongheng Deng,Jiali Zeng,Dong Wang,Lai Wei,Fandong Meng,Jie Zhou,Ju Ren,Yaoxue Zhang

Main category: cs.LG

TL;DR: ConCISE框架通过增强模型推理过程中的置信度，减少冗余步骤，显著缩短输出长度（约50%），同时保持任务准确性。

Details

Motivation: 大型推理模型（LRMs）在复杂推理任务中表现优异，但常因冗余内容导致输出冗长，增加计算开销并降低用户体验。现有压缩方法存在破坏推理连贯性或干预效果不佳的问题。 Method: 提出ConCISE框架，通过置信度引导的视角分析冗余反思的成因（如置信度不足和终止延迟），并采用置信度注入和早期停止技术优化推理过程。 Result: 实验表明，ConCISE显著缩短输出长度（约50%），同时保持高任务准确性，并在多个推理基准测试中优于现有基线。 Conclusion: ConCISE通过置信度引导的压缩方法，有效减少冗余推理步骤，提升推理效率和用户体验。 Abstract: Large Reasoning Models (LRMs) perform strongly in complex reasoning tasks via Chain-of-Thought (CoT) prompting, but often suffer from verbose outputs caused by redundant content, increasing computational overhead, and degrading user experience. Existing compression methods either operate post-hoc pruning, risking disruption to reasoning coherence, or rely on sampling-based selection, which fails to intervene effectively during generation. In this work, we introduce a confidence-guided perspective to explain the emergence of redundant reflection in LRMs, identifying two key patterns: Confidence Deficit, where the model reconsiders correct steps due to low internal confidence, and Termination Delay, where reasoning continues even after reaching a confident answer. Based on this analysis, we propose ConCISE (Confidence-guided Compression In Step-by-step Efficient Reasoning), a framework that simplifies reasoning chains by reinforcing the model's confidence during inference, thus preventing the generation of redundant reflection steps. It integrates Confidence Injection to stabilize intermediate steps and Early Stopping to terminate reasoning when confidence is sufficient. Extensive experiments demonstrate that fine-tuning LRMs on ConCISE-generated data yields significantly shorter outputs, reducing length by up to approximately 50% under SimPO, while maintaining high task accuracy. ConCISE consistently outperforms existing baselines across multiple reasoning benchmarks.

[143] General Transform: A Unified Framework for Adaptive Transform to Enhance Representations

Gekko Budiutama,Shunsuke Daimon,Hirofumi Nishi,Yu-ichiro Matsushita

Main category: cs.LG

TL;DR: 提出了一种自适应变换方法（GT），通过学习数据驱动的映射提升机器学习性能，优于传统变换方法。

Details

Motivation: 传统离散变换依赖对数据集特性的了解，缺乏适应性，限制了其效果。 Method: 提出通用变换（GT），通过学习数据驱动的映射，适应不同数据集和任务。 Result: GT在计算机视觉和自然语言处理任务中表现优于传统变换方法。 Conclusion: GT是一种有效的自适应变换方法，适用于多样化学习场景。 Abstract: Discrete transforms, such as the discrete Fourier transform, are widely used in machine learning to improve model performance by extracting meaningful features. However, with numerous transforms available, selecting an appropriate one often depends on understanding the dataset's properties, making the approach less effective when such knowledge is unavailable. In this work, we propose General Transform (GT), an adaptive transform-based representation designed for machine learning applications. Unlike conventional transforms, GT learns data-driven mapping tailored to the dataset and task of interest. Here, we demonstrate that models incorporating GT outperform conventional transform-based approaches across computer vision and natural language processing tasks, highlighting its effectiveness in diverse learning scenarios.

[144] CodeMixBench: Evaluating Large Language Models on Code Generation with Code-Mixed Prompts

Manik Sheokand,Parth Sawant

Main category: cs.LG

TL;DR: CodeMixBench是一个新的基准测试，用于评估LLMs在多语言混合提示下的代码生成能力，填补了现有基准测试仅关注英语提示的不足。

Details

Motivation: 现有基准测试如HumanEval和MBPP仅评估英语提示下的代码生成能力，忽略了多语言开发者实际使用混合语言的情况。 Method: 基于BigCodeBench，CodeMixBench引入了三种语言对（Hinglish、西班牙语-英语、中文拼音-英语）的受控代码混合（CMD）提示。 Result: 结果显示，混合语言提示显著降低了Pass@1性能，尤其是小模型在高CMD水平下表现更差。 Conclusion: CodeMixBench为多语言代码生成提供了真实评估框架，并揭示了构建鲁棒代码生成模型的新挑战。 Abstract: Large Language Models (LLMs) have achieved remarkable success in code generation tasks, powering various applications like code completion, debugging, and programming assistance. However, existing benchmarks such as HumanEval, MBPP, and BigCodeBench primarily evaluate LLMs on English-only prompts, overlooking the real-world scenario where multilingual developers often use code-mixed language while interacting with LLMs. To address this gap, we introduce CodeMixBench, a novel benchmark designed to evaluate the robustness of LLMs on code generation from code-mixed prompts. Built upon BigCodeBench, CodeMixBench introduces controlled code-mixing (CMD) into the natural language parts of prompts across three language pairs: Hinglish (Hindi-English), Spanish-English, and Chinese Pinyin-English. We comprehensively evaluate a diverse set of open-source code generation models ranging from 1.5B to 15B parameters. Our results show that code-mixed prompts consistently degrade Pass@1 performance compared to their English-only counterparts, with performance drops increasing under higher CMD levels for smaller models. CodeMixBench provides a realistic evaluation framework for studying multilingual code generation and highlights new challenges and directions for building robust code generation models that generalize well across diverse linguistic settings.

[145] Understanding In-context Learning of Addition via Activation Subspaces

Xinyan Hu,Kayo Yin,Michael I. Jordan,Jacob Steinhardt,Lijie Chen

Main category: cs.LG

TL;DR: 论文研究了语言模型如何在少样本学习中提取信号并形成预测规则，发现Llama-3-8B通过特定注意力头实现高精度，并揭示了其低维子空间的计算机制。

Details

Motivation: 探索现代Transformer模型在少样本学习任务中如何通过前向传递实现信号提取和规则应用。 Method: 设计结构化少样本学习任务，通过优化方法定位关键注意力头，并分析信号的低维子空间。 Result: Llama-3-8B在任务中表现优异，信号集中在六维子空间，且存在自校正机制。 Conclusion: 通过跟踪低维子空间，可以揭示模型在少样本学习中的精细计算结构。 Abstract: To perform in-context learning, language models must extract signals from individual few-shot examples, aggregate these into a learned prediction rule, and then apply this rule to new examples. How is this implemented in the forward pass of modern transformer models? To study this, we consider a structured family of few-shot learning tasks for which the true prediction rule is to add an integer $k$ to the input. We find that Llama-3-8B attains high accuracy on this task for a range of $k$, and localize its few-shot ability to just three attention heads via a novel optimization approach. We further show the extracted signals lie in a six-dimensional subspace, where four of the dimensions track the unit digit and the other two dimensions track overall magnitude. We finally examine how these heads extract information from individual few-shot examples, identifying a self-correction mechanism in which mistakes from earlier examples are suppressed by later examples. Our results demonstrate how tracking low-dimensional subspaces across a forward pass can provide insight into fine-grained computational structures.

[146] Revealing Weaknesses in Text Watermarking Through Self-Information Rewrite Attacks

Yixin Cheng,Hongcheng Guo,Yangming Li,Leonid Sigal

Main category: cs.LG

TL;DR: 本文揭示当前文本水印算法在高熵标记中嵌入水印的设计存在漏洞，提出了一种高效的自信息重写攻击（SIRA），攻击成功率高且成本低，凸显了需要更鲁棒的水印算法。

Details

Motivation: 当前文本水印算法在高熵标记中嵌入水印的设计看似无害，但存在被攻击者利用的风险，亟需评估其鲁棒性。 Method: 提出了一种通用的高效重写攻击（SIRA），通过计算每个标记的自信息识别潜在模式标记并进行针对性攻击。 Result: 实验表明，SIRA在七种最新水印方法上攻击成功率近100%，成本仅为每百万标记0.88美元。 Conclusion: 当前水印算法存在普遍漏洞，亟需开发更鲁棒的方案。 Abstract: Text watermarking aims to subtly embed statistical signals into text by controlling the Large Language Model (LLM)'s sampling process, enabling watermark detectors to verify that the output was generated by the specified model. The robustness of these watermarking algorithms has become a key factor in evaluating their effectiveness. Current text watermarking algorithms embed watermarks in high-entropy tokens to ensure text quality. In this paper, we reveal that this seemingly benign design can be exploited by attackers, posing a significant risk to the robustness of the watermark. We introduce a generic efficient paraphrasing attack, the Self-Information Rewrite Attack (SIRA), which leverages the vulnerability by calculating the self-information of each token to identify potential pattern tokens and perform targeted attack. Our work exposes a widely prevalent vulnerability in current watermarking algorithms. The experimental results show SIRA achieves nearly 100% attack success rates on seven recent watermarking methods with only 0.88 USD per million tokens cost. Our approach does not require any access to the watermark algorithms or the watermarked LLM and can seamlessly transfer to any LLM as the attack model, even mobile-level models. Our findings highlight the urgent need for more robust watermarking.

[147] Scalable Chain of Thoughts via Elastic Reasoning

Yuhui Xu,Hanze Dong,Lei Wang,Doyen Sahoo,Junnan Li,Caiming Xiong

Main category: cs.LG

TL;DR: Elastic Reasoning框架通过将推理分为思考和解决两阶段，并独立分配预算，显著提升了在严格资源约束下的可靠性。

Details

Motivation: 大型推理模型（LRMs）在复杂任务中表现出色，但其不可控的输出长度在实际部署中面临资源限制的挑战。 Method: 提出Elastic Reasoning框架，将推理分为思考和解决两阶段，并引入预算约束的轻量级训练策略。 Result: 在数学和编程基准测试中，Elastic Reasoning在严格预算约束下表现稳健，且推理更简洁高效。 Conclusion: Elastic Reasoning为大规模可控推理提供了实用且有效的解决方案。 Abstract: Large reasoning models (LRMs) have achieved remarkable progress on complex tasks by generating extended chains of thought (CoT). However, their uncontrolled output lengths pose significant challenges for real-world deployment, where inference-time budgets on tokens, latency, or compute are strictly constrained. We propose Elastic Reasoning, a novel framework for scalable chain of thoughts that explicitly separates reasoning into two phases--thinking and solution--with independently allocated budgets. At test time, Elastic Reasoning prioritize that completeness of solution segments, significantly improving reliability under tight resource constraints. To train models that are robust to truncated thinking, we introduce a lightweight budget-constrained rollout strategy, integrated into GRPO, which teaches the model to reason adaptively when the thinking process is cut short and generalizes effectively to unseen budget constraints without additional training. Empirical results on mathematical (AIME, MATH500) and programming (LiveCodeBench, Codeforces) benchmarks demonstrate that Elastic Reasoning performs robustly under strict budget constraints, while incurring significantly lower training cost than baseline methods. Remarkably, our approach also produces more concise and efficient reasoning even in unconstrained settings. Elastic Reasoning offers a principled and practical solution to the pressing challenge of controllable reasoning at scale.

[148] Research on Anomaly Detection Methods Based on Diffusion Models

Yi Chen

Main category: cs.LG

TL;DR: 本文提出了一种基于扩散模型的新型异常检测框架，通过多尺度特征提取和注意力机制，显著提升了在图像和音频数据上的检测性能。

Details

Motivation: 传统异常检测方法在处理复杂高维数据时表现不佳，扩散模型因其强大的数据建模能力成为潜在解决方案。 Method: 利用扩散概率模型（DPMs）建模正常数据分布，结合重建误差和语义差异作为异常指标，并引入多尺度特征提取和注意力机制。 Result: 在MVTec AD和UrbanSound8K等基准数据集上，该方法优于现有技术，表现出更高的准确性和鲁棒性。 Conclusion: 扩散模型在异常检测中具有显著优势，为实际应用提供了高效可靠的解决方案。 Abstract: Anomaly detection is a fundamental task in machine learning and data mining, with significant applications in cybersecurity, industrial fault diagnosis, and clinical disease monitoring. Traditional methods, such as statistical modeling and machine learning-based approaches, often face challenges in handling complex, high-dimensional data distributions. In this study, we explore the potential of diffusion models for anomaly detection, proposing a novel framework that leverages the strengths of diffusion probabilistic models (DPMs) to effectively identify anomalies in both image and audio data. The proposed method models the distribution of normal data through a diffusion process and reconstructs input data via reverse diffusion, using a combination of reconstruction errors and semantic discrepancies as anomaly indicators. To enhance the framework's performance, we introduce multi-scale feature extraction, attention mechanisms, and wavelet-domain representations, enabling the model to capture fine-grained structures and global dependencies in the data. Extensive experiments on benchmark datasets, including MVTec AD and UrbanSound8K, demonstrate that our method outperforms state-of-the-art anomaly detection techniques, achieving superior accuracy and robustness across diverse data modalities. This research highlights the effectiveness of diffusion models in anomaly detection and provides a robust and efficient solution for real-world applications.

[149] Concept-Based Unsupervised Domain Adaptation

Xinyue Xu,Yueying Hu,Hui Tang,Yi Qin,Lu Mi,Hao Wang,Xiaomeng Li

Main category: cs.LG

TL;DR: 论文提出CUDA框架，通过对抗训练对齐概念表示并引入松弛阈值，提升概念瓶颈模型在域适应中的鲁棒性和泛化能力。

Details

Motivation: 传统概念瓶颈模型（CBMs）假设训练和测试数据分布相同，但在域偏移下性能下降。CUDA旨在解决这一问题，提升CBMs的适应性和泛化能力。 Method: CUDA框架包括：1）对抗训练对齐概念表示；2）引入松弛阈值允许概念分布差异；3）无需标注概念数据直接推断目标域概念；4）结合理论保证将概念学习融入传统域适应。 Result: 实验表明，CUDA在真实数据集上显著优于现有CBM和域适应方法。 Conclusion: CUDA通过改进概念对齐和引入松弛阈值，显著提升了CBMs在域适应中的性能，同时保持了模型的解释性。 Abstract: Concept Bottleneck Models (CBMs) enhance interpretability by explaining predictions through human-understandable concepts but typically assume that training and test data share the same distribution. This assumption often fails under domain shifts, leading to degraded performance and poor generalization. To address these limitations and improve the robustness of CBMs, we propose the Concept-based Unsupervised Domain Adaptation (CUDA) framework. CUDA is designed to: (1) align concept representations across domains using adversarial training, (2) introduce a relaxation threshold to allow minor domain-specific differences in concept distributions, thereby preventing performance drop due to over-constraints of these distributions, (3) infer concepts directly in the target domain without requiring labeled concept data, enabling CBMs to adapt to diverse domains, and (4) integrate concept learning into conventional domain adaptation (DA) with theoretical guarantees, improving interpretability and establishing new benchmarks for DA. Experiments demonstrate that our approach significantly outperforms the state-of-the-art CBM and DA methods on real-world datasets.

[150] MTL-UE: Learning to Learn Nothing for Multi-Task Learning

Yi Yu,Song Xia,Siyuan Yang,Chenqi Kong,Wenhan Yang,Shijian Lu,Yap-Peng Tan,Alex C. Kot

Main category: cs.LG

TL;DR: MTL-UE是首个针对多任务数据和模型生成不可学习样本的统一框架，通过生成器结构和嵌入正则化提升攻击性能。

Details

Motivation: 现有不可学习策略主要针对单任务学习，而多任务学习（MTL）数据和模型的重要性日益增加，但相关研究却较少。 Method: 设计了基于生成器的结构，引入标签先验和类特征嵌入，并结合任务内和任务间嵌入正则化。 Result: 在4个MTL数据集、3种基础UE方法、5种模型架构和5种MTL任务加权策略上均表现出优越的攻击性能。 Conclusion: MTL-UE为多任务学习提供了高效且通用的不可学习样本生成框架，具有广泛的适用性和鲁棒性。 Abstract: Most existing unlearnable strategies focus on preventing unauthorized users from training single-task learning (STL) models with personal data. Nevertheless, the paradigm has recently shifted towards multi-task data and multi-task learning (MTL), targeting generalist and foundation models that can handle multiple tasks simultaneously. Despite their growing importance, MTL data and models have been largely neglected while pursuing unlearnable strategies. This paper presents MTL-UE, the first unified framework for generating unlearnable examples for multi-task data and MTL models. Instead of optimizing perturbations for each sample, we design a generator-based structure that introduces label priors and class-wise feature embeddings which leads to much better attacking performance. In addition, MTL-UE incorporates intra-task and inter-task embedding regularization to increase inter-class separation and suppress intra-class variance which enhances the attack robustness greatly. Furthermore, MTL-UE is versatile with good supports for dense prediction tasks in MTL. It is also plug-and-play allowing integrating existing surrogate-dependent unlearnable methods with little adaptation. Extensive experiments show that MTL-UE achieves superior attacking performance consistently across 4 MTL datasets, 3 base UE methods, 5 model backbones, and 5 MTL task-weighting strategies.

eess.AS [Back]

[151] From Dialect Gaps to Identity Maps: Tackling Variability in Speaker Verification

Abdulhady Abas Abdullah,Soran Badawi,Dana A. Abdullah,Dana Rasul Hamad,Hanan Abdulrahman Taher,Sabat Salih Muhamad,Aram Mahmood Ahmed,Bryar A. Hassan,Sirwan Abdolwahed Aula,Tarik A. Rashid

Main category: eess.AS

TL;DR: 本文研究了库尔德语多方言下说话人识别的复杂性，提出了改进方法以提高准确性。

Details

Motivation: 库尔德语因方言间语音和词汇差异大，导致说话人识别系统面临挑战，需探索解决方案。 Method: 采用高级机器学习方法、数据增强策略及构建方言特定语料库。 Result: 针对各方言的定制策略及跨方言训练显著提升了识别性能。 Conclusion: 结合方言特定方法和跨方言训练可有效提高库尔德语说话人识别系统的准确性。 Abstract: The complexity and difficulties of Kurdish speaker detection among its several dialects are investigated in this work. Because of its great phonetic and lexical differences, Kurdish with several dialects including Kurmanji, Sorani, and Hawrami offers special challenges for speaker recognition systems. The main difficulties in building a strong speaker identification system capable of precisely identifying speakers across several dialects are investigated in this work. To raise the accuracy and dependability of these systems, it also suggests solutions like sophisticated machine learning approaches, data augmentation tactics, and the building of thorough dialect-specific corpus. The results show that customized strategies for every dialect together with cross-dialect training greatly enhance recognition performance.

cs.MM [Back]

[152] SSH-Net: A Self-Supervised and Hybrid Network for Noisy Image Watermark Removal

Wenyang Liu,Jianjun Gao,Kim-Hui Yap

Main category: cs.MM

TL;DR: SSH-Net是一种自监督混合网络，用于去除带噪声图像中的水印，无需成对数据集，采用双网络设计和共享特征编码器。

Details

Motivation: 现有方法依赖成对数据集，实际中难以获取，因此提出自监督方法解决水印去除问题。 Method: SSH-Net通过自监督合成参考图像，采用双网络设计：上层CNN去噪，下层Transformer去除水印和噪声，共享特征编码器。 Result: SSH-Net有效去除水印和噪声，无需成对数据集。 Conclusion: SSH-Net为水印去除提供了一种高效的自监督解决方案。 Abstract: Visible watermark removal is challenging due to its inherent complexities and the noise carried within images. Existing methods primarily rely on supervised learning approaches that require paired datasets of watermarked and watermark-free images, which are often impractical to obtain in real-world scenarios. To address this challenge, we propose SSH-Net, a Self-Supervised and Hybrid Network specifically designed for noisy image watermark removal. SSH-Net synthesizes reference watermark-free images using the watermark distribution in a self-supervised manner and adopts a dual-network design to address the task. The upper network, focused on the simpler task of noise removal, employs a lightweight CNN-based architecture, while the lower network, designed to handle the more complex task of simultaneously removing watermarks and noise, incorporates Transformer blocks to model long-range dependencies and capture intricate image features. To enhance the model's effectiveness, a shared CNN-based feature encoder is introduced before dual networks to extract common features that both networks can leverage. Our code will be available at https://github.com/wenyang001/SSH-Net.

eess.SP [Back]

[153] Integrated Image Reconstruction and Target Recognition based on Deep Learning Technique

Cien Zhang,Jiaming Zhang,Jiajun He,Okan Yurduseven

Main category: eess.SP

TL;DR: Att-ClassiGAN结合注意力机制改进ClassiGAN，显著提升计算微波成像的图像重建和分类性能，减少重建时间并优于现有方法。

Details

Motivation: 解决计算微波成像（CMI）在图像重建和分类阶段的高计算需求问题。 Method: 在ClassiGAN中引入注意力门模块，动态聚焦重要特征并抑制无关信息。 Result: Att-ClassiGAN在NMSE、SSIM和分类准确性上优于现有方法，同时减少重建时间。 Conclusion: 注意力机制有效提升了CMI的性能，Att-ClassiGAN为微波成像提供了高效解决方案。 Abstract: Computational microwave imaging (CMI) has gained attention as an alternative technique for conventional microwave imaging techniques, addressing their limitations such as hardware-intensive physical layer and slow data collection acquisition speed to name a few. Despite these advantages, CMI still encounters notable computational bottlenecks, especially during the image reconstruction stage. In this setting, both image recovery and object classification present significant processing demands. To address these challenges, our previous work introduced ClassiGAN, which is a generative deep learning model designed to simultaneously reconstruct images and classify targets using only back-scattered signals. In this study, we build upon that framework by incorporating attention gate modules into ClassiGAN. These modules are intended to refine feature extraction and improve the identification of relevant information. By dynamically focusing on important features and suppressing irrelevant ones, the attention mechanism enhances the overall model performance. The proposed architecture, named Att-ClassiGAN, significantly reduces the reconstruction time compared to traditional CMI approaches. Furthermore, it outperforms current advanced methods, delivering improved Normalized Mean Squared Error (NMSE), higher Structural Similarity Index (SSIM), and better classification outcomes for the reconstructed targets.

cs.RO [Back]

[154] Steerable Scene Generation with Post Training and Inference-Time Search

Nicholas Pfaff,Hongkai Dai,Sergey Zakharov,Shun Iwase,Russ Tedrake

Main category: cs.RO

TL;DR: 论文提出了一种基于扩散模型的场景生成方法，用于机器人仿真训练，支持任务导向的场景合成，并发布了一个包含4400万场景的数据集。

Details

Motivation: 机器人仿真训练需要多样化的3D场景，但符合特定任务要求（如高杂乱度且空间布局合理）的场景稀缺且手动制作成本高。 Method: 训练统一的扩散生成模型，预测物体及其SE(3)位姿，并通过强化学习、条件生成或推理时搜索适应任务目标。 Result: 提出了一种MCTS推理时搜索策略，确保物理可行性，并发布了包含4400万场景的数据集。 Conclusion: 该方法支持任务导向的场景生成，具有物理可行性和可扩展性。 Abstract: Training robots in simulation requires diverse 3D scenes that reflect the specific challenges of downstream tasks. However, scenes that satisfy strict task requirements, such as high-clutter environments with plausible spatial arrangement, are rare and costly to curate manually. Instead, we generate large-scale scene data using procedural models that approximate realistic environments for robotic manipulation, and adapt it to task-specific goals. We do this by training a unified diffusion-based generative model that predicts which objects to place from a fixed asset library, along with their SE(3) poses. This model serves as a flexible scene prior that can be adapted using reinforcement learning-based post training, conditional generation, or inference-time search, steering generation toward downstream objectives even when they differ from the original data distribution. Our method enables goal-directed scene synthesis that respects physical feasibility and scales across scene types. We introduce a novel MCTS-based inference-time search strategy for diffusion models, enforce feasibility via projection and simulation, and release a dataset of over 44 million SE(3) scenes spanning five diverse environments. Website with videos, code, data, and model weights: https://steerable-scene-generation.github.io/

[155] X-Driver: Explainable Autonomous Driving with Vision-Language Models

Wei Liu,Jiyuan Zhang,Binxiong Zheng,Yufeng Hu,Yingzhan Lin,Zengfeng Zeng

Main category: cs.RO

TL;DR: X-Driver是一种基于多模态大语言模型（MLLMs）的端到端自动驾驶框架，通过链式思维（CoT）和自回归建模提升感知与决策能力，在封闭环路测试中表现优于现有方法。

Details

Motivation: 现有端到端自动驾驶框架在封闭环路测试中成功率较低，限制了其实际部署潜力。 Method: 提出X-Driver框架，结合MLLMs、CoT和自回归建模，优化感知与决策。 Result: 在CARLA仿真环境中验证，X-Driver在封闭环路性能上超越现有SOTA，并提升决策可解释性。 Conclusion: 结构化推理对端到端驾驶至关重要，X-Driver为未来封闭环路自动驾驶研究提供了强基线。 Abstract: End-to-end autonomous driving has advanced significantly, offering benefits such as system simplicity and stronger driving performance in both open-loop and closed-loop settings than conventional pipelines. However, existing frameworks still suffer from low success rates in closed-loop evaluations, highlighting their limitations in real-world deployment. In this paper, we introduce X-Driver, a unified multi-modal large language models(MLLMs) framework designed for closed-loop autonomous driving, leveraging Chain-of-Thought(CoT) and autoregressive modeling to enhance perception and decision-making. We validate X-Driver across multiple autonomous driving tasks using public benchmarks in CARLA simulation environment, including Bench2Drive[6]. Our experimental results demonstrate superior closed-loop performance, surpassing the current state-of-the-art(SOTA) while improving the interpretability of driving decisions. These findings underscore the importance of structured reasoning in end-to-end driving and establish X-Driver as a strong baseline for future research in closed-loop autonomous driving.

[156] D-CODA: Diffusion for Coordinated Dual-Arm Data Augmentation

I-Chun Arthur Liu,Jason Chen,Gaurav Sukhatme,Daniel Seita

Main category: cs.RO

TL;DR: D-CODA是一种针对双臂操作的数据增强方法，通过扩散模型生成视角一致的手腕相机图像和动作标签，提升模仿学习的可扩展性。

Details

Motivation: 双臂操作的高维性和协调需求使得数据收集成本高，需要一种可扩展的数据增强方法。 Method: 提出D-CODA，利用扩散模型生成视角一致的双臂图像和动作标签，并通过约束优化确保可行性。 Result: 在5个模拟和3个真实任务中，D-CODA表现优于基线方法，验证了其有效性。 Conclusion: D-CODA为双臂操作的数据增强提供了可扩展的解决方案。 Abstract: Learning bimanual manipulation is challenging due to its high dimensionality and tight coordination required between two arms. Eye-in-hand imitation learning, which uses wrist-mounted cameras, simplifies perception by focusing on task-relevant views. However, collecting diverse demonstrations remains costly, motivating the need for scalable data augmentation. While prior work has explored visual augmentation in single-arm settings, extending these approaches to bimanual manipulation requires generating viewpoint-consistent observations across both arms and producing corresponding action labels that are both valid and feasible. In this work, we propose Diffusion for COordinated Dual-arm Data Augmentation (D-CODA), a method for offline data augmentation tailored to eye-in-hand bimanual imitation learning that trains a diffusion model to synthesize novel, viewpoint-consistent wrist-camera images for both arms while simultaneously generating joint-space action labels. It employs constrained optimization to ensure that augmented states involving gripper-to-object contacts adhere to constraints suitable for bimanual coordination. We evaluate D-CODA on 5 simulated and 3 real-world tasks. Our results across 2250 simulation trials and 300 real-world trials demonstrate that it outperforms baselines and ablations, showing its potential for scalable data augmentation in eye-in-hand bimanual manipulation. Our project website is at: https://dcodaaug.github.io/D-CODA/.

Mattia Sartori,Chetna Singhal,Neelabhro Roy,Davide Brunelli,James Gross

Main category: cs.RO

TL;DR: 论文提出了一种基于AI和视觉的避障方法，用于30克重的Crazyflie 2.1无人机在部分已知环境中的自主飞行。

Details

Motivation: 由于纳米无人机资源有限，实现其安全自主导航和高阶任务（如探索和监视）极具挑战性。 Method: 将导航任务分为两部分：边缘设备运行深度学习目标检测器，机载执行规划算法。 Result: 实验显示无人机能以每秒8帧的速度运行，模型性能达到COCO mAP 60.8，并在1 m/s速度下成功避障。 Conclusion: 该方法为纳米无人机的实时导航提供了一种可行的替代方案，并可扩展至自主探索任务。 Abstract: The miniaturisation of sensors and processors, the advancements in connected edge intelligence, and the exponential interest in Artificial Intelligence are boosting the affirmation of autonomous nano-size drones in the Internet of Robotic Things ecosystem. However, achieving safe autonomous navigation and high-level tasks such as exploration and surveillance with these tiny platforms is extremely challenging due to their limited resources. This work focuses on enabling the safe and autonomous flight of a pocket-size, 30-gram platform called Crazyflie 2.1 in a partially known environment. We propose a novel AI-aided, vision-based reactive planning method for obstacle avoidance under the ambit of Integrated Sensing, Computing and Communication paradigm. We deal with the constraints of the nano-drone by splitting the navigation task into two parts: a deep learning-based object detector runs on the edge (external hardware) while the planning algorithm is executed onboard. The results show the ability to command the drone at $\sim8$ frames-per-second and a model performance reaching a COCO mean-average-precision of $60.8$. Field experiments demonstrate the feasibility of the solution with the drone flying at a top speed of $1$ m/s while steering away from an obstacle placed in an unknown position and reaching the target destination. The outcome highlights the compatibility of the communication delay and the model performance with the requirements of the real-time navigation task. We provide a feasible alternative to a fully onboard implementation that can be extended to autonomous exploration with nano-drones.

[158] The City that Never Settles: Simulation-based LiDAR Dataset for Long-Term Place Recognition Under Extreme Structural Changes

Hyunho Song,Dongjae Lee,Seunghun Oh,Minwoo Jung,Ayoung Kim

Main category: cs.RO

TL;DR: 论文介绍了CNS数据集和TCR_sym指标，用于解决大规模建筑和拆除对长期地点识别的挑战，并展示了现有方法在显著环境变化下的性能下降。

Details

Motivation: 大规模建筑和拆除对长期地点识别（PR）提出了重大挑战，现有数据集未能充分反映户外环境的显著变化。 Method: 使用CARLA模拟器创建CNS数据集，并提出TCR_sym指标以一致测量结构变化。 Result: CNS数据集比现有真实世界基准涵盖更广泛的变化，现有LiDAR-based PR方法在CNS上表现显著下降。 Conclusion: 需要更鲁棒的算法以应对显著环境变化，CNS数据集为相关研究提供了新基准。 Abstract: Large-scale construction and demolition significantly challenge long-term place recognition (PR) by drastically reshaping urban and suburban environments. Existing datasets predominantly reflect limited or indoor-focused changes, failing to adequately represent extensive outdoor transformations. To bridge this gap, we introduce the City that Never Settles (CNS) dataset, a simulation-based dataset created using the CARLA simulator, capturing major structural changes-such as building construction and demolition-across diverse maps and sequences. Additionally, we propose TCR_sym, a symmetric version of the original TCR metric, enabling consistent measurement of structural changes irrespective of source-target ordering. Quantitative comparisons demonstrate that CNS encompasses more extensive transformations than current real-world benchmarks. Evaluations of state-of-the-art LiDAR-based PR methods on CNS reveal substantial performance degradation, underscoring the need for robust algorithms capable of handling significant environmental changes. Our dataset is available at https://github.com/Hyunho111/CNS_dataset.

[159] Multi-Objective Reinforcement Learning for Adaptive Personalized Autonomous Driving

Hendrik Surmann,Jorge de Heuvel,Maren Bennewitz

Main category: cs.RO

TL;DR: 提出了一种基于多目标强化学习（MORL）和偏好驱动优化的端到端自动驾驶方法，能够动态适应驾驶风格偏好，无需重新训练策略。

Details

Motivation: 人类驾驶风格多样，现有自动驾驶方法难以动态适应个性化偏好，影响用户信任和满意度。 Method: 使用MORL和连续权重向量编码偏好，调节行为目标（如效率、舒适性、速度和攻击性），并结合视觉感知在复杂交通场景中测试。 Result: 在CARLA模拟器中验证，代理能根据偏好动态调整驾驶行为，同时保持避碰和路线完成性能。 Conclusion: 该方法支持动态、上下文相关的驾驶风格偏好，提升了自动驾驶的适应性和用户满意度。 Abstract: Human drivers exhibit individual preferences regarding driving style. Adapting autonomous vehicles to these preferences is essential for user trust and satisfaction. However, existing end-to-end driving approaches often rely on predefined driving styles or require continuous user feedback for adaptation, limiting their ability to support dynamic, context-dependent preferences. We propose a novel approach using multi-objective reinforcement learning (MORL) with preference-driven optimization for end-to-end autonomous driving that enables runtime adaptation to driving style preferences. Preferences are encoded as continuous weight vectors to modulate behavior along interpretable style objectives$\unicode{x2013}$including efficiency, comfort, speed, and aggressiveness$\unicode{x2013}$without requiring policy retraining. Our single-policy agent integrates vision-based perception in complex mixed-traffic scenarios and is evaluated in diverse urban environments using the CARLA simulator. Experimental results demonstrate that the agent dynamically adapts its driving behavior according to changing preferences while maintaining performance in terms of collision avoidance and route completion.

cs.IR [Back]

[160] HiPerRAG: High-Performance Retrieval Augmented Generation for Scientific Insights

Ozan Gokdemir,Carlo Siebenschuh,Alexander Brace,Azton Wells,Brian Hsu,Kyle Hippe,Priyanka V. Setty,Aswathy Ajith,J. Gregory Pauloski,Varuni Sastry,Sam Foreman,Huihuo Zheng,Heng Ma,Bharat Kale,Nicholas Chia,Thomas Gibbs,Michael E. Papka,Thomas Brettin,Francis J. Alexander,Anima Anandkumar,Ian Foster,Rick Stevens,Venkatram Vishwanath,Arvind Ramanathan

Main category: cs.IR

TL;DR: HiPerRAG是一种基于高性能计算的RAG工作流，用于索引和检索360万篇科学文献，解决了大规模科学文献处理中的计算成本和语义对齐问题。

Details

Motivation: 科学文献数量激增导致发现未充分利用、重复工作和跨学科合作受限，RAG可提升LLMs处理信息的准确性，但大规模应用面临挑战。 Method: HiPerRAG结合了Oreo（多模态文档解析模型）和ColTrast（查询感知编码器微调算法），利用对比学习和延迟交互技术提升检索精度。 Result: 在SciQ和PubMedQA基准测试中分别达到90%和76%的准确率，优于PubMedGPT和GPT-4。 Conclusion: HiPerRAG通过高性能计算实现了大规模科学知识的统一，促进了跨学科创新。 Abstract: The volume of scientific literature is growing exponentially, leading to underutilized discoveries, duplicated efforts, and limited cross-disciplinary collaboration. Retrieval Augmented Generation (RAG) offers a way to assist scientists by improving the factuality of Large Language Models (LLMs) in processing this influx of information. However, scaling RAG to handle millions of articles introduces significant challenges, including the high computational costs associated with parsing documents and embedding scientific knowledge, as well as the algorithmic complexity of aligning these representations with the nuanced semantics of scientific content. To address these issues, we introduce HiPerRAG, a RAG workflow powered by high performance computing (HPC) to index and retrieve knowledge from more than 3.6 million scientific articles. At its core are Oreo, a high-throughput model for multimodal document parsing, and ColTrast, a query-aware encoder fine-tuning algorithm that enhances retrieval accuracy by using contrastive learning and late-interaction techniques. HiPerRAG delivers robust performance on existing scientific question answering benchmarks and two new benchmarks introduced in this work, achieving 90% accuracy on SciQ and 76% on PubMedQA-outperforming both domain-specific models like PubMedGPT and commercial LLMs such as GPT-4. Scaling to thousands of GPUs on the Polaris, Sunspot, and Frontier supercomputers, HiPerRAG delivers million document-scale RAG workflows for unifying scientific knowledge and fostering interdisciplinary innovation.

[161] Prompt-Based LLMs for Position Bias-Aware Reranking in Personalized Recommendations

Md Aminul Islam,Ahmed Sayeed Faruk

Main category: cs.IR

TL;DR: 论文提出了一种结合传统推荐模型和大型语言模型（LLM）的混合框架，用于重新排序推荐列表，并研究了LLM在推荐系统中的局限性，如位置偏见和排名能力。

Details

Motivation: 大型语言模型（LLM）在推荐系统中表现出潜力，但存在上下文窗口限制、位置偏见和排名能力不足等问题，需要进一步研究和改进。 Method: 采用混合框架，结合传统推荐模型和LLM，通过结构化提示对前k项进行重新排序，并评估用户历史记录重排和指令提示对位置偏见的影响。 Result: 实验表明，随机化用户历史记录提高了排名质量，但LLM重新排序未能超越基础模型，且显式指令对减少位置偏见无效。 Conclusion: LLM在建模排名上下文和减少偏见方面存在局限性，需要进一步研究改进。 Abstract: Recommender systems are essential for delivering personalized content across digital platforms by modeling user preferences and behaviors. Recently, large language models (LLMs) have been adopted for prompt-based recommendation due to their ability to generate personalized outputs without task-specific training. However, LLM-based methods face limitations such as limited context window size, inefficient pointwise and pairwise prompting, and difficulty handling listwise ranking due to token constraints. LLMs can also be sensitive to position bias, as they may overemphasize earlier items in the prompt regardless of their true relevance. To address and investigate these issues, we propose a hybrid framework that combines a traditional recommendation model with an LLM for reranking top-k items using structured prompts. We evaluate the effects of user history reordering and instructional prompts for mitigating position bias. Experiments on MovieLens-100K show that randomizing user history improves ranking quality, but LLM-based reranking does not outperform the base model. Explicit instructions to reduce position bias are also ineffective. Our evaluations reveal limitations in LLMs' ability to model ranking context and mitigate bias. Our code is publicly available at https://github.com/aminul7506/LLMForReRanking.

eess.IV [Back]

[162] Rethinking Boundary Detection in Deep Learning-Based Medical Image Segmentation

Yi Lin,Dong Zhang,Xiao Fang,Yufan Chen,Kwang-Ting Cheng,Hao Chen

Main category: eess.IV

TL;DR: 本文提出了一种名为CTO的新型网络架构，结合CNN、ViT和边缘检测算子，显著提升了医学图像边界区域的分割精度。

Details

Motivation: 医学图像分割中，边界区域的精确分割仍具挑战性，现有方法难以兼顾精度与效率。 Method: CTO采用双流编码器（CNN和StitchViT）和边界引导解码器，利用边缘检测算子生成边界掩码指导解码。 Result: 在七个医学图像数据集上的实验表明，CTO实现了最先进的精度，同时保持较低的模型复杂度。 Conclusion: CTO通过结合多种技术，显著提升了边界分割能力，为医学图像分析提供了高效解决方案。 Abstract: Medical image segmentation is a pivotal task within the realms of medical image analysis and computer vision. While current methods have shown promise in accurately segmenting major regions of interest, the precise segmentation of boundary areas remains challenging. In this study, we propose a novel network architecture named CTO, which combines Convolutional Neural Networks (CNNs), Vision Transformer (ViT) models, and explicit edge detection operators to tackle this challenge. CTO surpasses existing methods in terms of segmentation accuracy and strikes a better balance between accuracy and efficiency, without the need for additional data inputs or label injections. Specifically, CTO adheres to the canonical encoder-decoder network paradigm, with a dual-stream encoder network comprising a mainstream CNN stream for capturing local features and an auxiliary StitchViT stream for integrating long-range dependencies. Furthermore, to enhance the model's ability to learn boundary areas, we introduce a boundary-guided decoder network that employs binary boundary masks generated by dedicated edge detection operators to provide explicit guidance during the decoding process. We validate the performance of CTO through extensive experiments conducted on seven challenging medical image segmentation datasets, namely ISIC 2016, PH2, ISIC 2018, CoNIC, LiTS17, and BTCV. Our experimental results unequivocally demonstrate that CTO achieves state-of-the-art accuracy on these datasets while maintaining competitive model complexity. The codes have been released at: https://github.com/xiaofang007/CTO.

[163] Advancing 3D Medical Image Segmentation: Unleashing the Potential of Planarian Neural Networks in Artificial Intelligence

Ziyuan Huang,Kevin Huggins,Srikar Bellur

Main category: eess.IV

TL;DR: PNN-UNet是一种基于涡虫神经网络结构的深度学习方法，用于3D医学图像分割，性能优于传统UNet及其变体。

Details

Motivation: 受涡虫神经网络的启发，研究提出了一种新型网络结构，旨在提升3D医学图像分割的准确性和效率。 Method: PNN-UNet由Deep-UNet和Wide-UNet作为神经索，以及一个密集连接的自编码器作为大脑，模拟涡虫的神经网络结构。 Result: 在3D MRI海马体数据集上，PNN-UNet在图像分割任务中表现优于基线UNet和其他UNet变体。 Conclusion: PNN-UNet通过模仿生物神经网络结构，为3D医学图像分割提供了一种高效且性能优越的新方法。 Abstract: Our study presents PNN-UNet as a method for constructing deep neural networks that replicate the planarian neural network (PNN) structure in the context of 3D medical image data. Planarians typically have a cerebral structure comprising two neural cords, where the cerebrum acts as a coordinator, and the neural cords serve slightly different purposes within the organism's neurological system. Accordingly, PNN-UNet comprises a Deep-UNet and a Wide-UNet as the nerve cords, with a densely connected autoencoder performing the role of the brain. This distinct architecture offers advantages over both monolithic (UNet) and modular networks (Ensemble-UNet). Our outcomes on a 3D MRI hippocampus dataset, with and without data augmentation, demonstrate that PNN-UNet outperforms the baseline UNet and several other UNet variants in image segmentation.

[164] Advanced 3D Imaging Approach to TSV/TGV Metrology and Inspection Using Only Optical Microscopy

Gugeong Sung

Main category: eess.IV

TL;DR: 本文提出了一种结合混合场显微镜和光度立体技术的新方法，用于硅和玻璃通孔的检测，克服了传统光学显微镜的局限性。

Details

Motivation: 传统光学显微镜技术通常仅能进行表面检测，难以有效可视化硅和玻璃通孔的内部结构。 Method: 通过结合光度立体技术和传统光学显微镜，利用多种光照条件进行3D重建，增强了对微尺度缺陷的检测能力。 Result: 实验结果表明，该方法能有效捕捉复杂的表面细节和内部结构，显著提升了检测过程的成本效益和准确性。 Conclusion: 该方法在硅和玻璃通孔检测技术上取得了显著进步，同时保持了高精度和可重复性。 Abstract: This paper introduces an innovative approach to silicon and glass via inspection, which combines hybrid field microscopy with photometric stereo. Conventional optical microscopy techniques are generally limited to superficial inspections and struggle to effectively visualize the internal structures of silicon and glass vias. By utilizing various lighting conditions for 3D reconstruction, the proposed method surpasses these limitations. By integrating photometric stereo to the traditional optical microscopy, the proposed method not only enhances the capability to detect micro-scale defects but also provides a detailed visualization of depth and edge abnormality, which are typically not visible with conventional optical microscopy inspection. The experimental results demonstrated that the proposed method effectively captures intricate surface details and internal structures. Quantitative comparisons between the reconstructed models and actual measurements present the capability of the proposed method to significantly improve silicon and glass via inspection process. As a result, the proposed method achieves enhanced cost-effectiveness while maintaining high accuracy and repeatability, suggesting substantial advancements in silicon and glass via inspection techniques

[165] MoRe-3DGSMR: Motion-resolved reconstruction framework for free-breathing pulmonary MRI based on 3D Gaussian representation

Tengya Peng,Ruyi Zha,Qing Zou

Main category: eess.IV

TL;DR: 提出了一种基于3D高斯表示的无监督运动解析重建框架，用于高分辨率自由呼吸肺部MRI，通过数据平滑和空间变换实现高质量图像重建。

Details

Motivation: 解决自由呼吸肺部MRI中运动解析和3D各向同性重建的挑战，提升图像质量。 Method: 使用黄金角度径向采样轨迹采集数据，提取呼吸运动信号并分阶段排序，结合3D高斯表示和卷积神经网络估计变形向量场，实现运动状态重建。 Result: 在六组数据上验证，相比现有方法，图像质量更高，信噪比和对比噪声比更优。 Conclusion: 该方法为临床肺部MRI提供了一种鲁棒的解决方案，具有潜在应用价值。 Abstract: This study presents an unsupervised, motion-resolved reconstruction framework for high-resolution, free-breathing pulmonary magnetic resonance imaging (MRI), utilizing a three-dimensional Gaussian representation (3DGS). The proposed method leverages 3DGS to address the challenges of motion-resolved 3D isotropic pulmonary MRI reconstruction by enabling data smoothing between voxels for continuous spatial representation. Pulmonary MRI data acquisition is performed using a golden-angle radial sampling trajectory, with respiratory motion signals extracted from the center of k-space in each radial spoke. Based on the estimated motion signal, the k-space data is sorted into multiple respiratory phases. A 3DGS framework is then applied to reconstruct a reference image volume from the first motion state. Subsequently, a patient-specific convolutional neural network is trained to estimate the deformation vector fields (DVFs), which are used to generate the remaining motion states through spatial transformation of the reference volume. The proposed reconstruction pipeline is evaluated on six datasets from six subjects and bench-marked against three state-of-the-art reconstruction methods. The experimental findings demonstrate that the proposed reconstruction framework effectively reconstructs high-resolution, motion-resolved pulmonary MR images. Compared with existing approaches, it achieves superior image quality, reflected by higher signal-to-noise ratio and contrast-to-noise ratio. The proposed unsupervised 3DGS-based reconstruction method enables accurate motion-resolved pulmonary MRI with isotropic spatial resolution. Its superior performance in image quality metrics over state-of-the-art methods highlights its potential as a robust solution for clinical pulmonary MR imaging.

[166] ADNP-15: An Open-Source Histopathological Dataset for Neuritic Plaque Segmentation in Human Brain Whole Slide Images with Frequency Domain Image Enhancement for Stain Normalization

Chenxi Zhao,Jianqiang Li,Qing Zhao,Jing Bai,Susana Boluda,Benoit Delatour,Lev Stimmer,Daniel Racoceanu,Gabriel Jimenez,Guanghui Fu

Main category: eess.IV

TL;DR: 论文提出了一种针对阿尔茨海默病（AD）病理图像分割的开放数据集ADNP-15，并评估了五种深度学习模型和四种染色归一化技术。同时，提出了一种新的图像增强方法，显著提高了分割精度。

Details

Motivation: AD的研究需要准确识别和分割淀粉样斑块和tau蛋白缠结，但现有方法受限于数据集规模和染色差异。 Method: 引入ADNP-15数据集，评估五种深度学习模型和四种染色归一化技术，并提出一种新的图像增强方法。 Result: 实验表明，提出的图像增强方法显著提高了模型泛化能力和分割精度。 Conclusion: 开放数据集和代码为AD研究提供了透明和可重复的工具，推动了该领域的进步。 Abstract: Alzheimer's Disease (AD) is a neurodegenerative disorder characterized by amyloid-beta plaques and tau neurofibrillary tangles, which serve as key histopathological features. The identification and segmentation of these lesions are crucial for understanding AD progression but remain challenging due to the lack of large-scale annotated datasets and the impact of staining variations on automated image analysis. Deep learning has emerged as a powerful tool for pathology image segmentation; however, model performance is significantly influenced by variations in staining characteristics, necessitating effective stain normalization and enhancement techniques. In this study, we address these challenges by introducing an open-source dataset (ADNP-15) of neuritic plaques (i.e., amyloid deposits combined with a crown of dystrophic tau-positive neurites) in human brain whole slide images. We establish a comprehensive benchmark by evaluating five widely adopted deep learning models across four stain normalization techniques, providing deeper insights into their influence on neuritic plaque segmentation. Additionally, we propose a novel image enhancement method that improves segmentation accuracy, particularly in complex tissue structures, by enhancing structural details and mitigating staining inconsistencies. Our experimental results demonstrate that this enhancement strategy significantly boosts model generalization and segmentation accuracy. All datasets and code are open-source, ensuring transparency and reproducibility while enabling further advancements in the field.

[167] Direct Image Classification from Fourier Ptychographic Microscopy Measurements without Reconstruction

Navya Sonal Agarwal,Jan Philipp Schneider,Kanchana Vaishnavi Gandikota,Syed Muhammad Kazim,John Meshreki,Ivo Ihrke,Michael Moeller

Main category: eess.IV

TL;DR: Fourier Ptychographic Microscopy (FPM) 的高分辨率成像计算成本高，本文提出直接利用测量数据进行分类，无需重建。CNN 能从测量序列中提取有用信息，分类效果优于单幅图像且更高效，同时通过学习多路复用减少数据量和采集时间。

Details

Motivation: FPM 的高分辨率重建计算成本高，尤其在宽视场下。直接利用测量数据分类可避免重建步骤，提高效率。 Method: 使用卷积神经网络 (CNN) 直接从 FPM 测量序列中提取信息进行分类，并学习多路复用以减少数据量。 Result: CNN 分类效果优于单幅图像（提升达 12%），且比高分辨率重建更高效。多路复用可在保持分类准确性的同时减少数据量和采集时间。 Conclusion: 直接利用 FPM 测量数据进行分类是高效且可行的，CNN 和多路复用技术为实际应用提供了新思路。 Abstract: The computational imaging technique of Fourier Ptychographic Microscopy (FPM) enables high-resolution imaging with a wide field of view and can serve as an extremely valuable tool, e.g. in the classification of cells in medical applications. However, reconstructing a high-resolution image from tens or even hundreds of measurements is computationally expensive, particularly for a wide field of view. Therefore, in this paper, we investigate the idea of classifying the image content in the FPM measurements directly without performing a reconstruction step first. We show that Convolutional Neural Networks (CNN) can extract meaningful information from measurement sequences, significantly outperforming the classification on a single band-limited image (up to 12 %) while being significantly more efficient than a reconstruction of a high-resolution image. Furthermore, we demonstrate that a learned multiplexing of several raw measurements allows maintaining the classification accuracy while reducing the amount of data (and consequently also the acquisition time) significantly.

[168] RepSNet: A Nucleus Instance Segmentation model based on Boundary Regression and Structural Re-parameterization

Shengchun Xiong,Xiangru Li,Yunpeng Zhong,Wanfen Peng

Main category: eess.IV

TL;DR: RepSNet是一种基于核边界回归和结构重参数化的神经网络模型，用于H&E染色病理图像中的核分割与分类，解决了计算效率和重叠目标处理的挑战。

Details

Motivation: 病理诊断是肿瘤诊断的金标准，核实例分割是数字病理分析和病理诊断的关键步骤，但计算效率和重叠目标处理是主要挑战。 Method: RepSNet通过核边界位置信息（BPI）估计和边界投票机制（BVM）实现核分割，并采用结构重参数化技术减少模型参数和计算负担。 Result: 实验表明，RepSNet在多个典型基准模型中表现优越。 Conclusion: RepSNet通过宏观信息引导和结构重参数化，显著提升了核分割的准确性和计算效率。 Abstract: Pathological diagnosis is the gold standard for tumor diagnosis, and nucleus instance segmentation is a key step in digital pathology analysis and pathological diagnosis. However, the computational efficiency of the model and the treatment of overlapping targets are the major challenges in the studies of this problem. To this end, a neural network model RepSNet was designed based on a nucleus boundary regression and a structural re-parameterization scheme for segmenting and classifying the nuclei in H\&E-stained histopathological images. First, RepSNet estimates the boundary position information (BPI) of the parent nucleus for each pixel. The BPI estimation incorporates the local information of the pixel and the contextual information of the parent nucleus. Then, the nucleus boundary is estimated by aggregating the BPIs from a series of pixels using a proposed boundary voting mechanism (BVM), and the instance segmentation results are computed from the estimated nucleus boundary using a connected component analysis procedure. The BVM intrinsically achieves a kind of synergistic belief enhancement among the BPIs from various pixels. Therefore, different from the methods available in literature that obtain nucleus boundaries based on a direct pixel recognition scheme, RepSNet computes its boundary decisions based on some guidances from macroscopic information using an integration mechanism. In addition, RepSNet employs a re-parametrizable encoder-decoder structure. This model can not only aggregate features from some receptive fields with various scales which helps segmentation accuracy improvement, but also reduce the parameter amount and computational burdens in the model inference phase through the structural re-parameterization technique. Extensive experiments demonstrated the superiorities of RepSNet compared to several typical benchmark models.

[169] MDAA-Diff: CT-Guided Multi-Dose Adaptive Attention Diffusion Model for PET Denoising

Xiaolong Niu,Zanting Ye,Xu Han,Yanchao Huang,Hao Sun,Hubing Wu,Lijun Lu

Main category: eess.IV

TL;DR: 提出了一种名为MDAA-Diff的新型CT引导多剂量自适应注意力去噪扩散模型，用于多剂量PET图像去噪，结合解剖学指导和剂量水平适应，显著提升了低剂量条件下的去噪性能。

Details

Motivation: 高剂量放射性示踪剂会增加辐射风险，而现有研究多关注单剂量去噪，忽略了患者间剂量响应差异和CT图像的解剖学约束。 Method: 采用CT引导的高频小波注意力模块提取解剖边界特征，并结合剂量自适应注意力模块动态整合剂量水平，通过自适应加权融合机制增强图像细节。 Result: 在18F-FDG和68Ga-FAPI数据集上的实验表明，MDAA-Diff在低剂量条件下优于现有方法，保持了诊断质量。 Conclusion: MDAA-Diff通过解剖学引导和剂量适应机制，显著提升了多剂量PET图像的去噪效果，为低剂量PET成像提供了可行解决方案。 Abstract: Acquiring high-quality Positron Emission Tomography (PET) images requires administering high-dose radiotracers, which increases radiation exposure risks. Generating standard-dose PET (SPET) from low-dose PET (LPET) has become a potential solution. However, previous studies have primarily focused on single low-dose PET denoising, neglecting two critical factors: discrepancies in dose response caused by inter-patient variability, and complementary anatomical constraints derived from CT images. In this work, we propose a novel CT-Guided Multi-dose Adaptive Attention Denoising Diffusion Model (MDAA-Diff) for multi-dose PET denoising. Our approach integrates anatomical guidance and dose-level adaptation to achieve superior denoising performance under low-dose conditions. Specifically, this approach incorporates a CT-Guided High-frequency Wavelet Attention (HWA) module, which uses wavelet transforms to separate high-frequency anatomical boundary features from CT images. These extracted features are then incorporated into PET imaging through an adaptive weighted fusion mechanism to enhance edge details. Additionally, we propose the Dose-Adaptive Attention (DAA) module, a dose-conditioned enhancement mechanism that dynamically integrates dose levels into channel-spatial attention weight calculation. Extensive experiments on 18F-FDG and 68Ga-FAPI datasets demonstrate that MDAA-Diff outperforms state-of-the-art approaches in preserving diagnostic quality under reduced-dose conditions. Our code is publicly available.

[170] Improved Brain Tumor Detection in MRI: Fuzzy Sigmoid Convolution in Deep Learning

Muhammad Irfan,Anum Nawaz,Riku Klen,Abdulhamit Subasi,Tomi Westerlund,Wei Chen

Main category: eess.IV

TL;DR: 论文提出了一种基于模糊Sigmoid卷积（FSC）的轻量级深度学习模型，用于早期脑肿瘤检测，显著减少了参数数量并保持高分类准确率。

Details

Motivation: 早期检测和准确诊断对改善患者预后至关重要，但现有CNN模型因过参数化问题限制了性能提升。 Method: 引入FSC模块及两个附加模块（top-of-the-funnel和middle-of-the-funnel），采用新型卷积算子扩大感受野并保持数据完整性，结合模糊Sigmoid激活函数优化特征提取。 Result: 在三个基准数据集上分类准确率分别达到99.17%、99.75%和99.89%，参数数量比传统模型少100倍。 Conclusion: 该研究为医学影像应用提供了轻量且高性能的深度学习模型。 Abstract: Early detection and accurate diagnosis are essential to improving patient outcomes. The use of convolutional neural networks (CNNs) for tumor detection has shown promise, but existing models often suffer from overparameterization, which limits their performance gains. In this study, fuzzy sigmoid convolution (FSC) is introduced along with two additional modules: top-of-the-funnel and middle-of-the-funnel. The proposed methodology significantly reduces the number of trainable parameters without compromising classification accuracy. A novel convolutional operator is central to this approach, effectively dilating the receptive field while preserving input data integrity. This enables efficient feature map reduction and enhances the model's tumor detection capability. In the FSC-based model, fuzzy sigmoid activation functions are incorporated within convolutional layers to improve feature extraction and classification. The inclusion of fuzzy logic into the architecture improves its adaptability and robustness. Extensive experiments on three benchmark datasets demonstrate the superior performance and efficiency of the proposed model. The FSC-based architecture achieved classification accuracies of 99.17%, 99.75%, and 99.89% on three different datasets. The model employs 100 times fewer parameters than large-scale transfer learning architectures, highlighting its computational efficiency and suitability for detecting brain tumors early. This research offers lightweight, high-performance deep-learning models for medical imaging applications.

[171] White Light Specular Reflection Data Augmentation for Deep Learning Polyp Detection

Jose Angel Nuñez,Fabian Vazquez,Diego Adame,Xiaoyan Fu,Pengfei Gu,Bin Fu

Main category: eess.IV

TL;DR: 论文提出了一种新的数据增强方法，通过在训练图像中添加人工白光反射，以提高深度学习模型在结肠息肉检测中的性能。

Details

Motivation: 现有深度学习息肉检测器常将内窥镜的白光反射误认为息肉，导致假阳性。为了解决这一问题，研究提出通过数据增强生成更难的训练场景。 Method: 首先生成人工白光反射库，确定训练图像中不应添加反射的区域，然后通过滑动窗口方法在合适区域添加人工反射，生成增强图像。 Result: 实验结果表明，新数据增强方法有效提高了模型性能。 Conclusion: 通过增加模型犯错机会并从中学习，新方法显著改善了息肉检测的准确性。 Abstract: Colorectal cancer is one of the deadliest cancers today, but it can be prevented through early detection of malignant polyps in the colon, primarily via colonoscopies. While this method has saved many lives, human error remains a significant challenge, as missing a polyp could have fatal consequences for the patient. Deep learning (DL) polyp detectors offer a promising solution. However, existing DL polyp detectors often mistake white light reflections from the endoscope for polyps, which can lead to false positives.To address this challenge, in this paper, we propose a novel data augmentation approach that artificially adds more white light reflections to create harder training scenarios. Specifically, we first generate a bank of artificial lights using the training dataset. Then we find the regions of the training images that we should not add these artificial lights on. Finally, we propose a sliding window method to add the artificial light to the areas that fit of the training images, resulting in augmented images. By providing the model with more opportunities to make mistakes, we hypothesize that it will also have more chances to learn from those mistakes, ultimately improving its performance in polyp detection. Experimental results demonstrate the effectiveness of our new data augmentation method.

[172] Benchmarking Ophthalmology Foundation Models for Clinically Significant Age Macular Degeneration Detection

Benjamin A. Cohen,Jonathan Fhima,Meishar Meisel,Baskin Meital,Luis Filipe Nakayama,Eran Berkowitz,Joachim A. Behar

Main category: eess.IV

TL;DR: 自监督学习（SSL）在自然图像上预训练的ViTs在视网膜图像任务中表现出色，优于领域特定模型和未预训练的基线模型。

Details

Motivation: 探讨在视网膜图像任务中，领域内预训练是否必要。 Method: 在7个DFI数据集（共70,000张图像）上评估6种SSL预训练的ViTs，用于AMD识别。 Result: iBOT在自然图像上预训练的模型表现最佳（AUROC 0.80-0.97），优于领域特定模型（0.78-0.96）和未预训练的基线（0.68-0.91）。 Conclusion: 基础模型在AMD识别中表现优异，挑战了领域内预训练的必要性；同时发布了巴西的开放数据集BRAMD。 Abstract: Self-supervised learning (SSL) has enabled Vision Transformers (ViTs) to learn robust representations from large-scale natural image datasets, enhancing their generalization across domains. In retinal imaging, foundation models pretrained on either natural or ophthalmic data have shown promise, but the benefits of in-domain pretraining remain uncertain. To investigate this, we benchmark six SSL-pretrained ViTs on seven digital fundus image (DFI) datasets totaling 70,000 expert-annotated images for the task of moderate-to-late age-related macular degeneration (AMD) identification. Our results show that iBOT pretrained on natural images achieves the highest out-of-distribution generalization, with AUROCs of 0.80-0.97, outperforming domain-specific models, which achieved AUROCs of 0.78-0.96 and a baseline ViT-L with no pretraining, which achieved AUROCs of 0.68-0.91. These findings highlight the value of foundation models in improving AMD identification and challenge the assumption that in-domain pretraining is necessary. Furthermore, we release BRAMD, an open-access dataset (n=587) of DFIs with AMD labels from Brazil.

[173] Augmented Deep Contexts for Spatially Embedded Video Coding

Yifan Bian,Chuanbo Tang,Li Li,Dong Liu

Main category: eess.IV

TL;DR: SEVC提出了一种结合空间和时间参考的神经视频编解码器，解决了传统时间参考编解码器在大运动或新物体出现时的局限性。

Details

Motivation: 传统神经视频编解码器仅依赖时间参考，导致在大运动或新物体出现时表现不佳。 Method: SEVC通过压缩低分辨率视频生成空间参考，结合空间和时间参考生成增强的运动向量和混合上下文，并引入空间引导的潜在先验。 Result: 实验显示SEVC有效解决了大运动和新物体的问题，比特率降低11.9%，并提供额外的低分辨率比特流。 Conclusion: SEVC通过空间和时间结合的优化方法，显著提升了视频编解码性能。 Abstract: Most Neural Video Codecs (NVCs) only employ temporal references to generate temporal-only contexts and latent prior. These temporal-only NVCs fail to handle large motions or emerging objects due to limited contexts and misaligned latent prior. To relieve the limitations, we propose a Spatially Embedded Video Codec (SEVC), in which the low-resolution video is compressed for spatial references. Firstly, our SEVC leverages both spatial and temporal references to generate augmented motion vectors and hybrid spatial-temporal contexts. Secondly, to address the misalignment issue in latent prior and enrich the prior information, we introduce a spatial-guided latent prior augmented by multiple temporal latent representations. At last, we design a joint spatial-temporal optimization to learn quality-adaptive bit allocation for spatial references, further boosting rate-distortion performance. Experimental results show that our SEVC effectively alleviates the limitations in handling large motions or emerging objects, and also reduces 11.9% more bitrate than the previous state-of-the-art NVC while providing an additional low-resolution bitstream. Our code and model are available at https://github.com/EsakaK/SEVC.

[174] OcularAge: A Comparative Study of Iris and Periocular Images for Pediatric Age Estimation

Naveenkumar G Venkataswamy,Poorna Ravi,Stephanie Schuckers,Masudul H. Imtiaz

Main category: eess.IV

TL;DR: 该研究比较了虹膜和眼周图像在4至16岁儿童年龄估计中的表现，发现眼周模型优于虹膜模型，平均绝对误差为1.33年，分类准确率达83.82%。

Details

Motivation: 儿童年龄估计因生理变化细微和纵向数据稀缺而具有挑战性，且现有研究多集中于成人面部特征，儿童眼周和虹膜区域研究较少。 Method: 使用包含21,000多张近红外图像的纵向数据集，采用多任务深度学习框架，结合年龄预测和年龄组分类，探索不同CNN架构对儿童眼部图像的处理能力。 Result: 眼周模型表现优于虹膜模型，平均绝对误差1.33年，分类准确率83.82%，且模型在不同成像传感器上表现稳健，推理速度低于10毫秒/图像。 Conclusion: 研究首次证明儿童眼周图像可用于可靠年龄估计，为儿童生物识别系统设计提供了基准，并展示了实时应用的潜力。 Abstract: Estimating a child's age from ocular biometric images is challenging due to subtle physiological changes and the limited availability of longitudinal datasets. Although most biometric age estimation studies have focused on facial features and adult subjects, pediatric-specific analysis, particularly of the iris and periocular regions, remains relatively unexplored. This study presents a comparative evaluation of iris and periocular images for estimating the ages of children aged between 4 and 16 years. We utilized a longitudinal dataset comprising more than 21,000 near-infrared (NIR) images, collected from 288 pediatric subjects over eight years using two different imaging sensors. A multi-task deep learning framework was employed to jointly perform age prediction and age-group classification, enabling a systematic exploration of how different convolutional neural network (CNN) architectures, particularly those adapted for non-square ocular inputs, capture the complex variability inherent in pediatric eye images. The results show that periocular models consistently outperform iris-based models, achieving a mean absolute error (MAE) of 1.33 years and an age-group classification accuracy of 83.82%. These results mark the first demonstration that reliable age estimation is feasible from children's ocular images, enabling privacy-preserving age checks in child-centric applications. This work establishes the first longitudinal benchmark for pediatric ocular age estimation, providing a foundation for designing robust, child-focused biometric systems. The developed models proved resilient across different imaging sensors, confirming their potential for real-world deployment. They also achieved inference speeds of less than 10 milliseconds per image on resource-constrained VR headsets, demonstrating their suitability for real-time applications.

cs.CR [Back]

[175] Red Teaming the Mind of the Machine: A Systematic Evaluation of Prompt Injection and Jailbreak Vulnerabilities in LLMs

Chetan Pathade

Main category: cs.CR

TL;DR: 本文系统研究了针对多种先进大语言模型（LLMs）的越狱策略，分析了1400多个对抗性提示的成功率，并提出了分层的缓解策略。

Details

Motivation: 尽管LLMs能力强大，但仍易受对抗性攻击（如提示注入和越狱）的影响，需要系统研究其安全性和防御方法。 Method: 分类并分析1400多个对抗性提示，测试其在GPT-4、Claude 2、Mistral 7B和Vicuna上的效果，研究其通用性和构造逻辑。 Result: 揭示了不同LLMs对对抗性攻击的脆弱性，并验证了攻击策略的通用性。 Conclusion: 提出分层缓解策略，并建议结合红队测试和沙盒方法以增强LLM的安全性。 Abstract: Large Language Models (LLMs) are increasingly integrated into consumer and enterprise applications. Despite their capabilities, they remain susceptible to adversarial attacks such as prompt injection and jailbreaks that override alignment safeguards. This paper provides a systematic investigation of jailbreak strategies against various state-of-the-art LLMs. We categorize over 1,400 adversarial prompts, analyze their success against GPT-4, Claude 2, Mistral 7B, and Vicuna, and examine their generalizability and construction logic. We further propose layered mitigation strategies and recommend a hybrid red-teaming and sandboxing approach for robust LLM security.

cs.AI [Back]

[176] Towards Artificial Intelligence Research Assistant for Expert-Involved Learning

Tianyu Liu,Simeng Han,Xiao Luo,Hanchen Wang,Pan Lu,Biqing Zhu,Yuge Wang,Keyi Li,Jiapeng Chen,Rihao Qu,Yufeng Liu,Xinyue Cui,Aviv Yaish,Yuhang Chen,Minsheng Hao,Chuhan Li,Kexing Li,Arman Cohan,Hua Xu,Mark Gerstein,James Zou,Hongyu Zhao

Main category: cs.AI

TL;DR: ARIEL是一个多模态数据集，用于评估和提升LLMs和LMMs在生物医学研究中的文本摘要和图像解析能力，并通过专家评估和优化策略展示了模型的优势和局限性。

Details

Motivation: 尽管LLMs和LMMs在科学研究中具有潜力，但其在生物医学应用中的可靠性和具体贡献尚未充分研究。 Method: 创建了包含生物医学文章和图像的开放数据集，并通过专家评估、提示工程、微调和计算扩展等方法优化模型性能。 Result: 模型在文本摘要和图像解析任务中表现出色，但仍存在局限性；LMM Agents展示了从多模态输入生成科学假设的潜力。 Conclusion: 研究为未来在生物医学研究中部署大规模语言和多模态模型提供了实用指导和改进方向。 Abstract: Large Language Models (LLMs) and Large Multi-Modal Models (LMMs) have emerged as transformative tools in scientific research, yet their reliability and specific contributions to biomedical applications remain insufficiently characterized. In this study, we present \textbf{AR}tificial \textbf{I}ntelligence research assistant for \textbf{E}xpert-involved \textbf{L}earning (ARIEL), a multimodal dataset designed to benchmark and enhance two critical capabilities of LLMs and LMMs in biomedical research: summarizing extensive scientific texts and interpreting complex biomedical figures. To facilitate rigorous assessment, we create two open-source sets comprising biomedical articles and figures with designed questions. We systematically benchmark both open- and closed-source foundation models, incorporating expert-driven human evaluations conducted by doctoral-level experts. Furthermore, we improve model performance through targeted prompt engineering and fine-tuning strategies for summarizing research papers, and apply test-time computational scaling to enhance the reasoning capabilities of LMMs, achieving superior accuracy compared to human-expert corrections. We also explore the potential of using LMM Agents to generate scientific hypotheses from diverse multimodal inputs. Overall, our results delineate clear strengths and highlight significant limitations of current foundation models, providing actionable insights and guiding future advancements in deploying large-scale language and multi-modal models within biomedical research.

[177] CRAFT: Cultural Russian-Oriented Dataset Adaptation for Focused Text-to-Image Generation

Viacheslav Vasilev,Vladimir Arkhipkin,Julia Agafonova,Tatiana Nikulina,Evelina Mironova,Alisa Shichanina,Nikolai Gerasimenko,Mikhail Shoytov,Denis Dimitrov

Main category: cs.AI

TL;DR: 论文探讨了当前文本到图像生成模型在个体文化知识上的不足，提出了一种基于文化代码的数据集构建方法，并以俄罗斯文化为例验证了其有效性。

Details

Motivation: 现有模型在西方文化数据上表现良好，但对个体文化（如俄罗斯文化）的理解不足，导致生成质量下降和刻板印象问题。 Method: 提出了一种收集和处理文化代码数据的方法，构建了俄罗斯文化数据集，并在Kandinsky 3.1模型上测试其效果。 Result: 人类评估显示，模型对俄罗斯文化的理解有所提升。 Conclusion: 文化代码数据集的引入能有效改善模型在特定文化领域的生成质量。 Abstract: Despite the fact that popular text-to-image generation models cope well with international and general cultural queries, they have a significant knowledge gap regarding individual cultures. This is due to the content of existing large training datasets collected on the Internet, which are predominantly based on Western European or American popular culture. Meanwhile, the lack of cultural adaptation of the model can lead to incorrect results, a decrease in the generation quality, and the spread of stereotypes and offensive content. In an effort to address this issue, we examine the concept of cultural code and recognize the critical importance of its understanding by modern image generation models, an issue that has not been sufficiently addressed in the research community to date. We propose the methodology for collecting and processing the data necessary to form a dataset based on the cultural code, in particular the Russian one. We explore how the collected data affects the quality of generations in the national domain and analyze the effectiveness of our approach using the Kandinsky 3.1 text-to-image model. Human evaluation results demonstrate an increase in the level of awareness of Russian culture in the model.

[178] Enigme: Generative Text Puzzles for Evaluating Reasoning in Language Models

John Hawkins

Main category: cs.AI

TL;DR: 论文探讨了Transformer-decoder语言模型在生成推理能力上的局限性，并提出了一个开源库enigme，用于生成文本谜题以评估模型的推理能力。

Details

Motivation: 理解Transformer-decoder模型在自然语言命令理解和推理能力上的局限性，以改进其作为通用智能系统的应用。 Method: 通过分析模型的潜在变量结构，设计推理任务来测试其能力边界，并开发了开源库enigme生成文本谜题。 Result: 提出了enigme库，用于训练和评估Transformer-decoder模型及其他未来AI架构的推理能力。 Conclusion: 通过设计专门的推理任务，可以更有效地评估和改进模型的推理能力，推动其在通用智能系统中的应用。 Abstract: Transformer-decoder language models are a core innovation in text based generative artificial intelligence. These models are being deployed as general-purpose intelligence systems in many applications. Central to their utility is the capacity to understand natural language commands and exploit the reasoning embedded in human text corpora to apply some form of reasoning process to a wide variety of novel tasks. To understand the limitations of this approach to generating reasoning we argue that we need to consider the architectural constraints of these systems. Consideration of the latent variable structure of transformer-decoder models allows us to design reasoning tasks that should probe the boundary of their capacity to reason. We present enigme, an open-source library for generating text-based puzzles to be used in training and evaluating reasoning skills within transformer-decoder models and future AI architectures.

Table of Contents

cs.CV [Back]

[1] Histo-Miner: Deep Learning based Tissue Features Extraction Pipeline from H&E Whole Slide Images of Cutaneous Squamous Cell Carcinoma

[2] Comparison of Visual Trackers for Biomechanical Analysis of Running

[3] Lay-Your-Scene: Natural Scene Layout Generation with Diffusion Transformers

[4] False Promises in Medical Imaging AI? Assessing Validity of Outperformance Claims

[5] Hyb-KAN ViT: Hybrid Kolmogorov-Arnold Networks Augmented Vision Transformer

[6] Lightweight RGB-D Salient Object Detection from a Speed-Accuracy Tradeoff Perspective

[7] Vision-Language-Action Models: Concepts, Progress, Applications and Challenges

[8] Replay to Remember (R2R): An Efficient Uncertainty-driven Unsupervised Continual Learning Framework Using Generative Replay

[9] Convex Relaxation for Robust Vanishing Point Estimation in Manhattan World

[10] DetReIDX: A Stress-Test Dataset for Real-World UAV-Based Person Recognition

[11] Are Synthetic Corruptions A Reliable Proxy For Real-World Corruptions?

[12] Seeing Cells Clearly: Evaluating Machine Vision Strategies for Microglia Centroid Detection in 3D Images

[13] ORXE: Orchestrating Experts for Dynamically Configurable Efficiency

[14] Mix-QSAM: Mixed-Precision Quantization of the Segment Anything Model

[15] Auto-regressive transformation for image alignment

[16] Learning from Loss Landscape: Generalizable Mixed-Precision Quantization via Adaptive Sharpness-Aware Gradient Aligning

[17] Cross-Branch Orthogonality for Improved Generalization in Face Deepfake Detection

[18] OWT: A Foundational Organ-Wise Tokenization Framework for Medical Imaging

[19] Pro2SAM: Mask Prompt to SAM with Grid Points for Weakly Supervised Object Localization

[20] SpatialPrompting: Keyframe-driven Zero-Shot Spatial Reasoning with Off-the-Shelf Multimodal Large Language Models

[21] GlyphMastero: A Glyph Encoder for High-Fidelity Scene Text Editing

[22] A Simple Detector with Frame Dynamics is a Strong Tracker

[23] Perception, Reason, Think, and Plan: A Survey on Large Multimodal Reasoning Models

[24] Canny2Palm: Realistic and Controllable Palmprint Generation for Large-scale Pre-training

[25] FF-PNet: A Pyramid Network Based on Feature and Field for Brain Image Registration

[26] Building-Guided Pseudo-Label Learning for Cross-Modal Building Damage Mapping

[27] T2VTextBench: A Human Evaluation Benchmark for Textual Control in Video Generation Models

[28] An Efficient Method for Accurate Pose Estimation and Error Correction of Cuboidal Objects

[29] ViCTr: Vital Consistency Transfer for Pathology Aware Image Synthesis

[30] CAG-VLM: Fine-Tuning of a Large-Scale Model to Recognize Angiographic Images for Next-Generation Diagnostic Systems

[31] DenseGrounding: Improving Dense Language-Vision Semantics for Ego-Centric 3D Visual Grounding

[32] ReAlign: Bilingual Text-to-Motion Generation via Step-Aware Reward-Guided Alignment

[33] Federated Deconfounding and Debiasing Learning for Out-of-Distribution Generalization

[34] StabStitch++: Unsupervised Online Video Stitching with Spatiotemporal Bidirectional Warps

[35] Automated Thoracolumbar Stump Rib Detection and Analysis in a Large CT Cohort

[36] Driving with Context: Online Map Matching for Complex Roads Using Lane Markings and Scenario Recognition

[37] Adaptive Contextual Embedding for Robust Far-View Borehole Detection

[38] SOAP: Style-Omniscient Animatable Portraits

[39] Split Matching for Inductive Zero-shot Semantic Segmentation

[40] xTrace: A Facial Expressive Behaviour Analysis Tool for Continuous Affect Recognition

[41] UncertainSAM: Fast and Efficient Uncertainty Quantification of the Segment Anything Model

[42] ULFine: Unbiased Lightweight Fine-tuning for Foundation-Model-Assisted Long-Tailed Semi-Supervised Learning

[43] FG-CLIP: Fine-Grained Visual and Textual Alignment

[44] Visual Affordances: Enabling Robots to Understand Object Functionality

[45] PIDiff: Image Customization for Personalized Identities with Diffusion Models

[46] Nonlinear Motion-Guided and Spatio-Temporal Aware Network for Unsupervised Event-Based Optical Flow

[47] DispBench: Benchmarking Disparity Estimation to Synthetic Corruptions

[48] MDE-Edit: Masked Dual-Editing for Multi-Object Image Editing via Diffusion Models

[49] Automated vision-based assistance tools in bronchoscopy: stenosis severity estimation

[50] Probabilistic Embeddings for Frozen Vision-Language Models: Uncertainty Quantification with Gaussian Process Latent Variable Models

[51] PaniCar: Securing the Perception of Advanced Driving Assistance Systems Against Emergency Vehicle Lighting

[52] Biomed-DPT: Dual Modality Prompt Tuning for Biomedical Vision-Language Models

[53] EAM: Enhancing Anything with Diffusion Transformers for Blind Super-Resolution

[54] HQC-NBV: A Hybrid Quantum-Classical View Planning Approach

[55] Diffusion Model Quantization: A Review

[56] Does CLIP perceive art the same way we do?

[57] PADriver: Towards Personalized Autonomous Driving

[58] PlaceIt3D: Language-Guided Object Placement in Real 3D Scenes

[59] PRE-Mamba: A 4D State Space Model for Ultra-High-Frequent Event Camera Deraining

[60] Mapping User Trust in Vision Language Models: Research Landscape, Challenges, and Prospects

[61] Feature-Augmented Deep Networks for Multiscale Building Segmentation in High-Resolution UAV and Satellite Imagery

[62] Aesthetics Without Semantics

[63] Progressive Inertial Poser: Progressive Real-Time Kinematic Chain Estimation for 3D Full-Body Pose from Three IMU Sensors

[64] Hearing and Seeing Through CLIP: A Framework for Self-Supervised Sound Source Localization

[65] Joint Super-Resolution and Segmentation for 1-m Impervious Surface Area Mapping in China's Yangtze River Economic Belt

[66] Threshold Modulation for Online Test-Time Adaptation of Spiking Neural Networks

[67] GeomHair: Reconstruction of Hair Strands from Colorless 3D Scans

[68] EDmamba: A Simple yet Effective Event Denoising Method with State Space Model

[69] TokLIP: Marry Visual Tokens to CLIP for Multimodal Comprehension and Generation

[70] PillarMamba: Learning Local-Global Context for Roadside Point Cloud via Hybrid State Space Model

[71] Adaptive Markup Language Generation for Contextually-Grounded Visual Document Understanding

[72] StreamBridge: Turning Your Offline Video Large Language Model into a Proactive Streaming Assistant

[73] SITE: towards Spatial Intelligence Thorough Evaluation

[74] Generating Physically Stable and Buildable LEGO Designs from Text

[75] Flow-GRPO: Training Flow Matching Models via Online RL

[76] Mogao: An Omni Foundation Model for Interleaved Multi-Modal Generation

[77] DiffusionSfM: Predicting Structure and Motion via Ray Origin and Endpoint Diffusion

[78] 3D Scene Generation: A Survey