cs.CV [Back]

[1] Histo-Miner: Deep Learning based Tissue Features Extraction Pipeline from H&E Whole Slide Images of Cutaneous Squamous Cell Carcinoma

Lucas Sancéré,Carina Lorenz,Doris Helbig,Oana-Diana Persa,Sonja Dengler,Alexander Kreuter,Martim Laimer,Anne Fröhlich,Jennifer Landsberg,Johannes Brägelmann,Katarzyna Bozek

Main category: cs.CV

TL;DR: Histo-Miner是一个基于深度学习的皮肤组织全切片图像分析工具，提供两个标注数据集，用于细胞核分割、分类及肿瘤区域分割，性能优于现有技术，并可用于预测患者对免疫治疗的响应。

Details

Motivation: 当前缺乏针对皮肤组织的标注数据集和开源分析工具，特别是针对皮肤鳞状细胞癌（cSCC）的研究。 Method: 使用卷积神经网络和视觉变换器，基于两个标注数据集（47,392个细胞核和144个肿瘤区域）进行细胞核分割、分类及肿瘤区域分割。 Result: 模型性能优异：细胞核分割mPQ为0.569，分类F1为0.832，肿瘤分割mIoU为0.884。特征向量可用于预测免疫治疗响应。 Conclusion: Histo-Miner在临床应用中表现出色，为cSCC患者的治疗提供了生物学见解和预测能力。 Abstract: Recent advancements in digital pathology have enabled comprehensive analysis of Whole-Slide Images (WSI) from tissue samples, leveraging high-resolution microscopy and computational capabilities. Despite this progress, there is a lack of labeled datasets and open source pipelines specifically tailored for analysis of skin tissue. Here we propose Histo-Miner, a deep learning-based pipeline for analysis of skin WSIs and generate two datasets with labeled nuclei and tumor regions. We develop our pipeline for the analysis of patient samples of cutaneous squamous cell carcinoma (cSCC), a frequent non-melanoma skin cancer. Utilizing the two datasets, comprising 47,392 annotated cell nuclei and 144 tumor-segmented WSIs respectively, both from cSCC patients, Histo-Miner employs convolutional neural networks and vision transformers for nucleus segmentation and classification as well as tumor region segmentation. Performance of trained models positively compares to state of the art with multi-class Panoptic Quality (mPQ) of 0.569 for nucleus segmentation, macro-averaged F1 of 0.832 for nucleus classification and mean Intersection over Union (mIoU) of 0.884 for tumor region segmentation. From these predictions we generate a compact feature vector summarizing tissue morphology and cellular interactions, which can be used for various downstream tasks. Here, we use Histo-Miner to predict cSCC patient response to immunotherapy based on pre-treatment WSIs from 45 patients. Histo-Miner identifies percentages of lymphocytes, the granulocyte to lymphocyte ratio in tumor vicinity and the distances between granulocytes and plasma cells in tumors as predictive features for therapy response. This highlights the applicability of Histo-Miner to clinically relevant scenarios, providing direct interpretation of the classification and insights into the underlying biology.

[2] Comparison of Visual Trackers for Biomechanical Analysis of Running

Luis F. Gomez,Gonzalo Garrido-Lopez,Julian Fierrez,Aythami Morales,Ruben Tolosana,Javier Rueda,Enrique Navarro

Main category: cs.CV

TL;DR: 论文分析了六种姿态跟踪器在短跑生物力学分析中的性能，提出了一种后处理模块以减少误差，结果表明基于关节的模型在生物力学分析中具有潜力。

Details

Motivation: 研究旨在评估姿态跟踪器在短跑生物力学分析中的准确性，并与专家标注结果对比，以验证其实际应用价值。 Method: 使用六种姿态跟踪器（两种点跟踪器和四种关节跟踪器）分析5870帧数据，提出后处理模块进行异常检测和角度融合预测。 Result: 关节模型的均方根误差为11.41°至4.37°，加入后处理模块后降至6.99°和3.88°，显示其适用于生物力学分析。 Conclusion: 姿态跟踪方法在短跑生物力学分析中具有潜力，但在高精度需求场景仍需改进。 Abstract: Human pose estimation has witnessed significant advancements in recent years, mainly due to the integration of deep learning models, the availability of a vast amount of data, and large computational resources. These developments have led to highly accurate body tracking systems, which have direct applications in sports analysis and performance evaluation. This work analyzes the performance of six trackers: two point trackers and four joint trackers for biomechanical analysis in sprints. The proposed framework compares the results obtained from these pose trackers with the manual annotations of biomechanical experts for more than 5870 frames. The experimental framework employs forty sprints from five professional runners, focusing on three key angles in sprint biomechanics: trunk inclination, hip flex extension, and knee flex extension. We propose a post-processing module for outlier detection and fusion prediction in the joint angles. The experimental results demonstrate that using joint-based models yields root mean squared errors ranging from 11.41{\deg} to 4.37{\deg}. When integrated with the post-processing modules, these errors can be reduced to 6.99{\deg} and 3.88{\deg}, respectively. The experimental findings suggest that human pose tracking approaches can be valuable resources for the biomechanical analysis of running. However, there is still room for improvement in applications where high accuracy is required.

[3] Lay-Your-Scene: Natural Scene Layout Generation with Diffusion Transformers

Divyansh Srivastava,Xiang Zhang,He Wen,Chenru Wen,Zhuowen Tu

Main category: cs.CV

TL;DR: LayouSyn是一种新的文本到布局生成方法，利用轻量级开源语言模型和扩散Transformer架构，在开放词汇条件下生成场景布局，性能优于现有方法。

Details

Motivation: 现有场景布局生成方法要么词汇封闭，要么依赖专有大型语言模型，限制了其建模能力和可控图像生成的广泛应用。 Method: 使用轻量级开源语言模型从文本提示中提取场景元素，结合新型aspect-aware扩散Transformer架构进行条件布局生成。 Result: LayouSyn在空间和数值推理基准测试中表现优异，并展示了与大型语言模型结合及图像编辑中的应用潜力。 Conclusion: LayouSyn为开放词汇场景布局生成提供了高效解决方案，扩展了可控图像生成的应用范围。 Abstract: We present Lay-Your-Scene (shorthand LayouSyn), a novel text-to-layout generation pipeline for natural scenes. Prior scene layout generation methods are either closed-vocabulary or use proprietary large language models for open-vocabulary generation, limiting their modeling capabilities and broader applicability in controllable image generation. In this work, we propose to use lightweight open-source language models to obtain scene elements from text prompts and a novel aspect-aware diffusion Transformer architecture trained in an open-vocabulary manner for conditional layout generation. Extensive experiments demonstrate that LayouSyn outperforms existing methods and achieves state-of-the-art performance on challenging spatial and numerical reasoning benchmarks. Additionally, we present two applications of LayouSyn. First, we show that coarse initialization from large language models can be seamlessly combined with our method to achieve better results. Second, we present a pipeline for adding objects to images, demonstrating the potential of LayouSyn in image editing applications.

[4] False Promises in Medical Imaging AI? Assessing Validity of Outperformance Claims

Evangelia Christodoulou,Annika Reinke,Pascaline Andrè,Patrick Godau,Piotr Kalinowski,Rola Houhou,Selen Erkan,Carole H. Sudre,Ninon Burgos,Sofiène Boutaj,Sophie Loizillon,Maëlys Solal,Veronika Cheplygina,Charles Heitz,Michal Kozubek,Michela Antonelli,Nicola Rieke,Antoine Gilson,Leon D. Mayer,Minu D. Tizabi,M. Jorge Cardoso,Amber Simpson,Annette Kopp-Schneider,Gaël Varoquaux,Olivier Colliot,Lena Maier-Hein

Main category: cs.CV

TL;DR: 研究发现，医学影像AI领域中的性能比较常基于均值表现，导致虚假优越性声明频发。通过贝叶斯方法分析，发现80%以上论文声称新方法优越，但虚假声明的概率高达86%（分类任务）和53%（分割任务）。

Details

Motivation: 当前医学影像AI研究中，性能比较常依赖均值表现，可能导致虚假优越性声明，误导未来研究方向。本文旨在验证这些声明的可靠性。 Method: 采用贝叶斯方法，结合报告结果和模型一致性估计，量化虚假优越性声明的概率。 Result: 分析显示，80%以上论文声称新方法优越，但虚假声明的概率在分类和分割任务中分别高达86%和53%。 Conclusion: 当前医学影像AI的基准测试存在严重缺陷，优越性声明常无实质依据，可能误导研究方向。 Abstract: Performance comparisons are fundamental in medical imaging Artificial Intelligence (AI) research, often driving claims of superiority based on relative improvements in common performance metrics. However, such claims frequently rely solely on empirical mean performance. In this paper, we investigate whether newly proposed methods genuinely outperform the state of the art by analyzing a representative cohort of medical imaging papers. We quantify the probability of false claims based on a Bayesian approach that leverages reported results alongside empirically estimated model congruence to estimate whether the relative ranking of methods is likely to have occurred by chance. According to our results, the majority (>80%) of papers claims outperformance when introducing a new method. Our analysis further revealed a high probability (>5%) of false outperformance claims in 86% of classification papers and 53% of segmentation papers. These findings highlight a critical flaw in current benchmarking practices: claims of outperformance in medical imaging AI are frequently unsubstantiated, posing a risk of misdirecting future research efforts.

[5] Hyb-KAN ViT: Hybrid Kolmogorov-Arnold Networks Augmented Vision Transformer

Sainath Dey,Mitul Goswami,Jashika Sethi,Prasant Kumar Pattnaik

Main category: cs.CV

TL;DR: 提出Hyb-KAN ViT框架，结合小波谱分解和样条优化激活函数，改进ViT中的MLP限制，实现高效多尺度建模。

Details

Motivation: 解决Vision Transformers中MLP的固有局限性，利用小波函数边缘检测能力和ViT模块化特性。 Method: 引入Eff-KAN（样条函数替代MLP）和Wav-KAN（小波变换多分辨率特征提取），集成到ViT编码器和分类头。 Result: 在ImageNet-1K、COCO和ADE20K上实现SOTA性能，验证小波谱先验和样条效率。 Conclusion: Hyb-KAN ViT为视觉架构平衡参数效率和多尺度表示提供了新范式。 Abstract: This study addresses the inherent limitations of Multi-Layer Perceptrons (MLPs) in Vision Transformers (ViTs) by introducing Hybrid Kolmogorov-Arnold Network (KAN)-ViT (Hyb-KAN ViT), a novel framework that integrates wavelet-based spectral decomposition and spline-optimized activation functions, prior work has failed to focus on the prebuilt modularity of the ViT architecture and integration of edge detection capabilities of Wavelet functions. We propose two key modules: Efficient-KAN (Eff-KAN), which replaces MLP layers with spline functions and Wavelet-KAN (Wav-KAN), leveraging orthogonal wavelet transforms for multi-resolution feature extraction. These modules are systematically integrated in ViT encoder layers and classification heads to enhance spatial-frequency modeling while mitigating computational bottlenecks. Experiments on ImageNet-1K (Image Recognition), COCO (Object Detection and Instance Segmentation), and ADE20K (Semantic Segmentation) demonstrate state-of-the-art performance with Hyb-KAN ViT. Ablation studies validate the efficacy of wavelet-driven spectral priors in segmentation and spline-based efficiency in detection tasks. The framework establishes a new paradigm for balancing parameter efficiency and multi-scale representation in vision architectures.

[6] Lightweight RGB-D Salient Object Detection from a Speed-Accuracy Tradeoff Perspective

Songsong Duan,Xi Yang,Nannan Wang,Xinbo Gao

Main category: cs.CV

TL;DR: SATNet是一个轻量级RGB-D显著性检测网络，通过深度质量、模态融合和特征表示三方面优化，实现了高效与高精度的平衡。

Details

Motivation: 现有RGB-D方法要么牺牲效率使用大型主干网络，要么轻量级方法难以达到高精度。SATNet旨在平衡效率与性能。 Method: 引入Depth Anything Model生成高质量深度图；提出解耦注意力模块（DAM）探索模态内和模态间一致性；开发双向信息表示模块（DIRM）扩展特征空间；设计双特征聚合模块（DFAM）解码器。 Result: 在五个公开RGB-D SOD数据集上表现优于现有CNN重型模型，参数仅5.2M，速度达415 FPS。 Conclusion: SATNet成功实现了轻量级框架下的高效与高精度，为RGB-D显著性检测提供了新思路。 Abstract: Current RGB-D methods usually leverage large-scale backbones to improve accuracy but sacrifice efficiency. Meanwhile, several existing lightweight methods are difficult to achieve high-precision performance. To balance the efficiency and performance, we propose a Speed-Accuracy Tradeoff Network (SATNet) for Lightweight RGB-D SOD from three fundamental perspectives: depth quality, modality fusion, and feature representation. Concerning depth quality, we introduce the Depth Anything Model to generate high-quality depth maps,which effectively alleviates the multi-modal gaps in the current datasets. For modality fusion, we propose a Decoupled Attention Module (DAM) to explore the consistency within and between modalities. Here, the multi-modal features are decoupled into dual-view feature vectors to project discriminable information of feature maps. For feature representation, we develop a Dual Information Representation Module (DIRM) with a bi-directional inverted framework to enlarge the limited feature space generated by the lightweight backbones. DIRM models texture features and saliency features to enrich feature space, and employ two-way prediction heads to optimal its parameters through a bi-directional backpropagation. Finally, we design a Dual Feature Aggregation Module (DFAM) in the decoder to aggregate texture and saliency features. Extensive experiments on five public RGB-D SOD datasets indicate that the proposed SATNet excels state-of-the-art (SOTA) CNN-based heavyweight models and achieves a lightweight framework with 5.2 M parameters and 415 FPS.

[7] Vision-Language-Action Models: Concepts, Progress, Applications and Challenges

Ranjan Sapkota,Yang Cao,Konstantinos I. Roumeliotis,Manoj Karkee

Main category: cs.CV

TL;DR: 本文综述了Vision-Language-Action（VLA）模型的最新进展，涵盖其概念基础、架构创新、应用领域及未来挑战与解决方案。

Details

Motivation: 旨在统一感知、自然语言理解和动作执行，推动人工智能向更智能、适应性更强的方向发展。 Method: 采用严格的文献综述方法，分析了过去三年发表的80多个VLA模型，聚焦架构创新、高效训练策略和实时推理加速。 Result: 总结了VLA模型在机器人、自动驾驶、医疗等领域的应用，并提出了解决实时控制、泛化能力等挑战的方案。 Conclusion: VLA模型有望成为未来智能机器人和通用人工智能的核心技术，但仍需解决伦理和可扩展性等问题。 Abstract: Vision-Language-Action (VLA) models mark a transformative advancement in artificial intelligence, aiming to unify perception, natural language understanding, and embodied action within a single computational framework. This foundational review presents a comprehensive synthesis of recent advancements in Vision-Language-Action models, systematically organized across five thematic pillars that structure the landscape of this rapidly evolving field. We begin by establishing the conceptual foundations of VLA systems, tracing their evolution from cross-modal learning architectures to generalist agents that tightly integrate vision-language models (VLMs), action planners, and hierarchical controllers. Our methodology adopts a rigorous literature review framework, covering over 80 VLA models published in the past three years. Key progress areas include architectural innovations, parameter-efficient training strategies, and real-time inference accelerations. We explore diverse application domains such as humanoid robotics, autonomous vehicles, medical and industrial robotics, precision agriculture, and augmented reality navigation. The review further addresses major challenges across real-time control, multimodal action representation, system scalability, generalization to unseen tasks, and ethical deployment risks. Drawing from the state-of-the-art, we propose targeted solutions including agentic AI adaptation, cross-embodiment generalization, and unified neuro-symbolic planning. In our forward-looking discussion, we outline a future roadmap where VLA models, VLMs, and agentic AI converge to power socially aligned, adaptive, and general-purpose embodied agents. This work serves as a foundational reference for advancing intelligent, real-world robotics and artificial general intelligence. >Vision-language-action, Agentic AI, AI Agents, Vision-language Models

[8] Replay to Remember (R2R): An Efficient Uncertainty-driven Unsupervised Continual Learning Framework Using Generative Replay

Sriram Mandalika,Harsha Vardhan,Athira Nambiar

Main category: cs.CV

TL;DR: 提出了一种基于不确定性驱动的无监督持续学习框架R2R，通过生成回放和聚类不确定性反馈机制，显著减少灾难性遗忘，并在多个数据集上实现SOTA性能。

Details

Motivation: 解决神经网络在持续学习中的灾难性遗忘问题，提出无需预训练的框架，利用无标签数据和生成回放技术。 Method: 采用聚类不确定性驱动反馈机制和VLM生成回放模块，结合动态阈值调整，利用无标签数据和合成标签数据。 Result: 在CIFAR-10、CIFAR-100等数据集上分别达到98.13%、73.06%等SOTA性能，超越现有方法4.36%。 Conclusion: R2R框架有效提升知识保留能力，为无监督持续学习提供了新思路。 Abstract: Continual Learning entails progressively acquiring knowledge from new data while retaining previously acquired knowledge, thereby mitigating ``Catastrophic Forgetting'' in neural networks. Our work presents a novel uncertainty-driven Unsupervised Continual Learning framework using Generative Replay, namely ``Replay to Remember (R2R)''. The proposed R2R architecture efficiently uses unlabelled and synthetic labelled data in a balanced proportion using a cluster-level uncertainty-driven feedback mechanism and a VLM-powered generative replay module. Unlike traditional memory-buffer methods that depend on pretrained models and pseudo-labels, our R2R framework operates without any prior training. It leverages visual features from unlabeled data and adapts continuously using clustering-based uncertainty estimation coupled with dynamic thresholding. Concurrently, a generative replay mechanism along with DeepSeek-R1 powered CLIP VLM produces labelled synthetic data representative of past experiences, resembling biological visual thinking that replays memory to remember and act in new, unseen tasks. Extensive experimental analyses are carried out in CIFAR-10, CIFAR-100, CINIC-10, SVHN and TinyImageNet datasets. Our proposed R2R approach improves knowledge retention, achieving a state-of-the-art performance of 98.13%, 73.06%, 93.41%, 95.18%, 59.74%, respectively, surpassing state-of-the-art performance by over 4.36%.

[9] Convex Relaxation for Robust Vanishing Point Estimation in Manhattan World

Bangyan Liao,Zhenjun Zhao,Haoang Li,Yi Zhou,Yingping Zeng,Hao Li,Peidong Liu

Main category: cs.CV

TL;DR: 论文提出了一种基于凸松弛技术的全局最优方法（GlobustVP），用于曼哈顿世界中的消失点检测，平衡了效率、鲁棒性和全局最优性。

Details

Motivation: 现有方法在消失点检测中要么是次优解，要么追求全局最优但计算成本高。本文旨在通过凸松弛技术解决这一问题。 Method: 采用“软”关联方案，通过截断多选择误差实现消失点位置和线-消失点关联的联合估计，将其转化为QCQP问题并松弛为凸SDP问题，提出GlobustVP迭代求解器。 Result: 在合成和真实数据上的实验表明，GlobustVP在效率、鲁棒性和全局最优性上优于现有方法。 Conclusion: GlobustVP是一种高效且鲁棒的消失点检测方法，代码已开源。 Abstract: Determining the vanishing points (VPs) in a Manhattan world, as a fundamental task in many 3D vision applications, consists of jointly inferring the line-VP association and locating each VP. Existing methods are, however, either sub-optimal solvers or pursuing global optimality at a significant cost of computing time. In contrast to prior works, we introduce convex relaxation techniques to solve this task for the first time. Specifically, we employ a ``soft'' association scheme, realized via a truncated multi-selection error, that allows for joint estimation of VPs' locations and line-VP associations. This approach leads to a primal problem that can be reformulated into a quadratically constrained quadratic programming (QCQP) problem, which is then relaxed into a convex semidefinite programming (SDP) problem. To solve this SDP problem efficiently, we present a globally optimal outlier-robust iterative solver (called \textbf{GlobustVP}), which independently searches for one VP and its associated lines in each iteration, treating other lines as outliers. After each independent update of all VPs, the mutual orthogonality between the three VPs in a Manhattan world is reinforced via local refinement. Extensive experiments on both synthetic and real-world data demonstrate that \textbf{GlobustVP} achieves a favorable balance between efficiency, robustness, and global optimality compared to previous works. The code is publicly available at https://github.com/WU-CVGL/GlobustVP.

[10] DetReIDX: A Stress-Test Dataset for Real-World UAV-Based Person Recognition

Kailash A. Hambarde,Nzakiese Mbongo,Pavan Kumar MP,Satish Mekewad,Carolina Fernandes,Gökhan Silahtaroğlu,Alice Nithya,Pawan Wasnik,MD. Rashidunnabi,Pranita Samale,Hugo Proença

Main category: cs.CV

TL;DR: DetReIDX是一个大规模的空地行人数据集，旨在测试真实世界条件下行人重识别（ReID）的性能，包含多会话、多变量数据，并展示了现有方法在这些条件下的显著性能下降。

Details

Motivation: 当前行人重识别技术在真实世界复杂条件下表现不佳，且公开数据集未能充分模拟这些条件，限制了技术进步。 Method: 构建DetReIDX数据集，包含多会话、多变量数据（如衣物变化、光照变化等），并标注了多种软生物特征和多任务标签。 Result: 现有方法在DetReIDX条件下性能显著下降（检测准确率下降80%，Rank-1 ReID下降70%以上）。 Conclusion: DetReIDX为真实世界行人重识别提供了更具挑战性的基准，推动了技术进步。 Abstract: Person reidentification (ReID) technology has been considered to perform relatively well under controlled, ground-level conditions, but it breaks down when deployed in challenging real-world settings. Evidently, this is due to extreme data variability factors such as resolution, viewpoint changes, scale variations, occlusions, and appearance shifts from clothing or session drifts. Moreover, the publicly available data sets do not realistically incorporate such kinds and magnitudes of variability, which limits the progress of this technology. This paper introduces DetReIDX, a large-scale aerial-ground person dataset, that was explicitly designed as a stress test to ReID under real-world conditions. DetReIDX is a multi-session set that includes over 13 million bounding boxes from 509 identities, collected in seven university campuses from three continents, with drone altitudes between 5.8 and 120 meters. More important, as a key novelty, DetReIDX subjects were recorded in (at least) two sessions on different days, with changes in clothing, daylight and location, making it suitable to actually evaluate long-term person ReID. Plus, data were annotated from 16 soft biometric attributes and multitask labels for detection, tracking, ReID, and action recognition. In order to provide empirical evidence of DetReIDX usefulness, we considered the specific tasks of human detection and ReID, where SOTA methods catastrophically degrade performance (up to 80% in detection accuracy and over 70% in Rank-1 ReID) when exposed to DetReIDXs conditions. The dataset, annotations, and official evaluation protocols are publicly available at https://www.it.ubi.pt/DetReIDX/

[11] Are Synthetic Corruptions A Reliable Proxy For Real-World Corruptions?

Shashank Agnihotri,David Schader,Nico Sharei,Mehmet Ege Kaçar,Margret Keuper

Main category: cs.CV

TL;DR: 论文探讨了合成损坏是否可靠替代真实世界损坏用于深度学习模型鲁棒性测试，并通过大规模基准研究验证其相关性。

Details

Motivation: 深度学习模型易受分布偏移影响，而收集真实世界数据测试鲁棒性成本高，因此研究合成损坏的可靠性。 Method: 通过语义分割模型的大规模基准研究，比较真实世界损坏和合成损坏数据集的性能表现。 Result: 结果显示平均性能强相关，支持合成损坏用于鲁棒性评估，并分析了特定损坏的相关性。 Conclusion: 合成损坏可作为真实世界损坏的可靠替代，但需注意特定损坏的适用性。 Abstract: Deep learning (DL) models are widely used in real-world applications but remain vulnerable to distribution shifts, especially due to weather and lighting changes. Collecting diverse real-world data for testing the robustness of DL models is resource-intensive, making synthetic corruptions an attractive alternative for robustness testing. However, are synthetic corruptions a reliable proxy for real-world corruptions? To answer this, we conduct the largest benchmarking study on semantic segmentation models, comparing performance on real-world corruptions and synthetic corruptions datasets. Our results reveal a strong correlation in mean performance, supporting the use of synthetic corruptions for robustness evaluation. We further analyze corruption-specific correlations, providing key insights to understand when synthetic corruptions succeed in representing real-world corruptions. Open-source Code: https://github.com/shashankskagnihotri/benchmarking_robustness/tree/segmentation_david/semantic_segmentation

[12] Seeing Cells Clearly: Evaluating Machine Vision Strategies for Microglia Centroid Detection in 3D Images

Youjia Zhang

Main category: cs.CV

TL;DR: 比较三种工具（ilastik、3D Morph和Omnipose）在3D显微镜图像中定位小胶质细胞中心点的效果。

Details

Motivation: 小胶质细胞的形态对脑健康研究至关重要，准确识别其中心点是关键。 Method: 测试三种工具（ilastik、3D Morph和Omnipose）在3D图像中的表现，并比较其结果。 Result: 每种工具对小胶质细胞的识别方式不同，影响从图像中获取的信息。 Conclusion: 工具的选择会影响研究结果，需根据需求选择合适的工具。 Abstract: Microglia are important cells in the brain, and their shape can tell us a lot about brain health. In this project, I test three different tools for finding the center points of microglia in 3D microscope images. The tools include ilastik, 3D Morph, and Omnipose. I look at how well each one finds the cells and how their results compare. My findings show that each tool sees the cells in its own way, and this can affect the kind of information we get from the images.

[13] ORXE: Orchestrating Experts for Dynamically Configurable Efficiency

Qingyuan Wang,Guoxin Wang,Barry Cardiff,Deepu John

Main category: cs.CV

TL;DR: ORXE是一个模块化、适应性强的框架，通过动态调整推理路径实现AI模型的高效实时配置。

Details

Motivation: 传统方法需要复杂的元模型训练，而ORXE旨在简化开发过程，同时保持高效和灵活。 Method: 利用预训练专家集合和基于置信度的门控机制，动态分配计算资源。 Result: 在图像分类任务中，ORXE表现优于单个专家和其他动态模型。 Conclusion: ORXE可扩展至其他应用，为多样化部署提供可扩展解决方案。 Abstract: This paper presents ORXE, a modular and adaptable framework for achieving real-time configurable efficiency in AI models. By leveraging a collection of pre-trained experts with diverse computational costs and performance levels, ORXE dynamically adjusts inference pathways based on the complexity of input samples. Unlike conventional approaches that require complex metamodel training, ORXE achieves high efficiency and flexibility without complicating the development process. The proposed system utilizes a confidence-based gating mechanism to allocate appropriate computational resources for each input. ORXE also supports adjustments to the preference between inference cost and prediction performance across a wide range during runtime. We implemented a training-free ORXE system for image classification tasks, evaluating its efficiency and accuracy across various devices. The results demonstrate that ORXE achieves superior performance compared to individual experts and other dynamic models in most cases. This approach can be extended to other applications, providing a scalable solution for diverse real-world deployment scenarios.

[14] Mix-QSAM: Mixed-Precision Quantization of the Segment Anything Model

Navin Ranjan,Andreas Savakis

Main category: cs.CV

TL;DR: Mix-QSAM是一种混合精度的后训练量化框架，通过层间协同和重要性评分优化SAM模型的量化效果，显著提升资源受限设备上的性能。

Details

Motivation: SAM模型计算和内存需求高，现有固定位宽量化方法在精度和效率上表现不佳。 Method: 提出层间重要性评分和跨层协同度量，通过整数二次规划分配最优位宽。 Result: 在6位和4位混合精度设置下，平均精度提升高达20%。 Conclusion: Mix-QSAM在保持计算效率的同时，显著优于现有PTQ方法。 Abstract: The Segment Anything Model (SAM) is a popular vision foundation model; however, its high computational and memory demands make deployment on resource-constrained devices challenging. While Post-Training Quantization (PTQ) is a practical approach for reducing computational overhead, existing PTQ methods rely on fixed bit-width quantization, leading to suboptimal accuracy and efficiency. To address this limitation, we propose Mix-QSAM, a mixed-precision PTQ framework for SAM. First, we introduce a layer-wise importance score, derived using Kullback-Leibler (KL) divergence, to quantify each layer's contribution to the model's output. Second, we introduce cross-layer synergy, a novel metric based on causal mutual information, to capture dependencies between adjacent layers. This ensures that highly interdependent layers maintain similar bit-widths, preventing abrupt precision mismatches that degrade feature propagation and numerical stability. Using these metrics, we formulate an Integer Quadratic Programming (IQP) problem to determine optimal bit-width allocation under model size and bit-operation constraints, assigning higher precision to critical layers while minimizing bit-width in less influential layers. Experimental results demonstrate that Mix-QSAM consistently outperforms existing PTQ methods on instance segmentation and object detection tasks, achieving up to 20% higher average precision under 6-bit and 4-bit mixed-precision settings, while maintaining computational efficiency.

[15] Auto-regressive transformation for image alignment

Kanggeon Lee,Soochahn Lee,Kyoung Mu Lee

Main category: cs.CV

TL;DR: 提出了一种名为ART的新方法，通过自回归框架和多尺度特征迭代优化图像对齐，显著提升了在特征稀疏区域的准确性。

Details

Motivation: 现有方法在特征稀疏区域、极端尺度和视野差异以及大变形情况下表现不佳，需要更鲁棒的解决方案。 Method: ART利用多尺度特征和自回归框架，通过随机采样点和交叉注意力层迭代优化变换场。 Result: 在多样化数据集上的实验表明，ART显著优于现有方法。 Conclusion: ART是一种强大且广泛适用的精确图像对齐新方法。 Abstract: Existing methods for image alignment struggle in cases involving feature-sparse regions, extreme scale and field-of-view differences, and large deformations, often resulting in suboptimal accuracy. Robustness to these challenges improves through iterative refinement of the transformation field while focusing on critical regions in multi-scale image representations. We thus propose Auto-Regressive Transformation (ART), a novel method that iteratively estimates the coarse-to-fine transformations within an auto-regressive framework. Leveraging hierarchical multi-scale features, our network refines the transformations using randomly sampled points at each scale. By incorporating guidance from the cross-attention layer, the model focuses on critical regions, ensuring accurate alignment even in challenging, feature-limited conditions. Extensive experiments across diverse datasets demonstrate that ART significantly outperforms state-of-the-art methods, establishing it as a powerful new method for precise image alignment with broad applicability.

[16] Learning from Loss Landscape: Generalizable Mixed-Precision Quantization via Adaptive Sharpness-Aware Gradient Aligning

Lianbo Ma,Jianlun Ma,Yuee Zhou,Guoyang Xie,Qiang He,Zhichao Lu

Main category: cs.CV

TL;DR: 提出了一种新方法，通过在小规模数据集上搜索量化策略并推广到大规模数据集，解决了现有混合精度量化方法计算成本高的问题。

Details

Motivation: 现有混合精度量化方法需要在大规模数据集上进行昂贵的搜索，计算成本高。 Method: 在小数据集上搜索量化策略，推广到大数据集；结合锐度感知最小化、隐式梯度方向对齐和自适应扰动半径三种技术。 Result: 在CIFAR10上搜索策略后，在ImageNet上实现了同等精度，计算成本显著降低，效率提升150%。 Conclusion: 该方法简化了量化过程，无需大规模微调，仅需调整模型权重，验证了其有效性和高效性。 Abstract: Mixed Precision Quantization (MPQ) has become an essential technique for optimizing neural network by determining the optimal bitwidth per layer. Existing MPQ methods, however, face a major hurdle: they require a computationally expensive search for quantization policies on large-scale datasets. To resolve this issue, we introduce a novel approach that first searches for quantization policies on small datasets and then generalizes them to large-scale datasets. This approach simplifies the process, eliminating the need for large-scale quantization fine-tuning and only necessitating model weight adjustment. Our method is characterized by three key techniques: sharpness-aware minimization for enhanced quantization generalization, implicit gradient direction alignment to handle gradient conflicts among different optimization objectives, and an adaptive perturbation radius to accelerate optimization. Both theoretical analysis and experimental results validate our approach. Using the CIFAR10 dataset (just 0.5\% the size of ImageNet training data) for MPQ policy search, we achieved equivalent accuracy on ImageNet with a significantly lower computational cost, while improving efficiency by up to 150% over the baselines.

[17] Cross-Branch Orthogonality for Improved Generalization in Face Deepfake Detection

Tharindu Fernando,Clinton Fookes,Sridha Sridharan,Simon Denman

Main category: cs.CV

TL;DR: 论文提出了一种新策略，利用从粗到细的空间和语义信息及其交互，结合特征正交性解缠策略，显著提升了深度伪造检测的性能。

Details

Motivation: 深度伪造技术的快速发展导致社会对多媒体内容的信任度下降，现有检测器因依赖特定伪造痕迹而难以应对新型深度伪造。 Method: 提出了一种结合空间和语义信息的策略，并引入特征正交性解缠技术，确保特征独特性和减少冗余。 Result: 在三个公开数据集上的实验表明，该方法在跨数据集评估中优于现有最佳方法，Celeb-DF和DFDC数据集上分别提升5%和7%。 Conclusion: 该方法通过多特征整合和解缠策略，显著提升了深度伪造检测的泛化能力和性能。 Abstract: Remarkable advancements in generative AI technology have given rise to a spectrum of novel deepfake categories with unprecedented leaps in their realism, and deepfakes are increasingly becoming a nuisance to law enforcement authorities and the general public. In particular, we observe alarming levels of confusion, deception, and loss of faith regarding multimedia content within society caused by face deepfakes, and existing deepfake detectors are struggling to keep up with the pace of improvements in deepfake generation. This is primarily due to their reliance on specific forgery artifacts, which limits their ability to generalise and detect novel deepfake types. To combat the spread of malicious face deepfakes, this paper proposes a new strategy that leverages coarse-to-fine spatial information, semantic information, and their interactions while ensuring feature distinctiveness and reducing the redundancy of the modelled features. A novel feature orthogonality-based disentanglement strategy is introduced to ensure branch-level and cross-branch feature disentanglement, which allows us to integrate multiple feature vectors without adding complexity to the feature space or compromising generalisation. Comprehensive experiments on three public benchmarks: FaceForensics++, Celeb-DF, and the Deepfake Detection Challenge (DFDC) show that these design choices enable the proposed approach to outperform current state-of-the-art methods by 5% on the Celeb-DF dataset and 7% on the DFDC dataset in a cross-dataset evaluation setting.

[18] OWT: A Foundational Organ-Wise Tokenization Framework for Medical Imaging

Sifan Song,Siyeop Yoon,Pengfei Jin,Sekeun Kim,Matthew Tivnan,Yujin Oh,Runqi Meng,Ling Chen,Zhiliang Lyu,Dufan Wu,Ning Guo,Xiang Li,Quanzheng Li

Main category: cs.CV

TL;DR: 论文提出了一种器官级标记化（OWT）框架，通过标记组重建（TGR）训练范式，解决了传统整体嵌入方法在医学影像中的局限性，提升了可解释性和泛化能力。

Details

Motivation: 传统表示学习方法依赖整体、黑盒嵌入，导致语义组件纠缠，限制了可解释性和泛化能力，尤其在医学影像中问题突出。 Method: 提出OWT框架，将图像显式分解为可分离的标记组，每组对应特定器官或语义实体，并通过TGR训练范式优化。 Result: 在CT和MRI数据集上，OWT在图像重建、分割任务中表现优异，并支持语义级生成和检索等新应用。 Conclusion: OWT作为一种语义解耦表示学习的基础框架，具有广泛的扩展性和适用性，尤其适用于医学影像及其他领域。 Abstract: Recent advances in representation learning often rely on holistic, black-box embeddings that entangle multiple semantic components, limiting interpretability and generalization. These issues are especially critical in medical imaging. To address these limitations, we propose an Organ-Wise Tokenization (OWT) framework with a Token Group-based Reconstruction (TGR) training paradigm. Unlike conventional approaches that produce holistic features, OWT explicitly disentangles an image into separable token groups, each corresponding to a distinct organ or semantic entity. Our design ensures each token group encapsulates organ-specific information, boosting interpretability, generalization, and efficiency while allowing fine-grained control in downstream tasks. Experiments on CT and MRI datasets demonstrate the effectiveness of OWT in not only achieving strong image reconstruction and segmentation performance, but also enabling novel semantic-level generation and retrieval applications that are out of reach for standard holistic embedding methods. These findings underscore the potential of OWT as a foundational framework for semantically disentangled representation learning, offering broad scalability and applicability to real-world medical imaging scenarios and beyond.

[19] Pro2SAM: Mask Prompt to SAM with Grid Points for Weakly Supervised Object Localization

Xi Yang,Songsong Duan,Nannan Wang,Xinbo Gao

Main category: cs.CV

TL;DR: 论文提出Pro2SAM方法，利用SAM的零样本能力和细粒度分割，结合创新的网格点提示和全局令牌变换器，提升弱监督目标定位性能。

Details

Motivation: 解决CAM和自注意力图在弱监督目标定位中无法学习像素级细粒度信息的问题。 Method: 结合SAM的零样本能力，设计GTFormer生成粗粒度前景图作为掩码提示，并通过网格点提示最大化前景掩码概率。 Result: 在CUB-200-2011和ILSVRC上分别达到84.03%和66.85%的Top-1定位准确率。 Conclusion: Pro2SAM通过创新方法显著提升了弱监督目标定位的性能。 Abstract: Weakly Supervised Object Localization (WSOL), which aims to localize objects by only using image-level labels, has attracted much attention because of its low annotation cost in real applications. Current studies focus on the Class Activation Map (CAM) of CNN and the self-attention map of transformer to identify the region of objects. However, both CAM and self-attention maps can not learn pixel-level fine-grained information on the foreground objects, which hinders the further advance of WSOL. To address this problem, we initiatively leverage the capability of zero-shot generalization and fine-grained segmentation in Segment Anything Model (SAM) to boost the activation of integral object regions. Further, to alleviate the semantic ambiguity issue accrued in single point prompt-based SAM, we propose an innovative mask prompt to SAM (Pro2SAM) network with grid points for WSOL task. First, we devise a Global Token Transformer (GTFormer) to generate a coarse-grained foreground map as a flexible mask prompt, where the GTFormer jointly embeds patch tokens and novel global tokens to learn foreground semantics. Secondly, we deliver grid points as dense prompts into SAM to maximize the probability of foreground mask, which avoids the lack of objects caused by a single point/box prompt. Finally, we propose a pixel-level similarity metric to come true the mask matching from mask prompt to SAM, where the mask with the highest score is viewed as the final localization map. Experiments show that the proposed Pro2SAM achieves state-of-the-art performance on both CUB-200-2011 and ILSVRC, with 84.03\% and 66.85\% Top-1 Loc, respectively.

[20] SpatialPrompting: Keyframe-driven Zero-Shot Spatial Reasoning with Off-the-Shelf Multimodal Large Language Models

Shun Taguchi,Hideki Deguchi,Takumi Hamazaki,Hiroyuki Sakai

Main category: cs.CV

TL;DR: SpatialPrompting框架利用现成的多模态大语言模型实现零样本3D空间推理，无需昂贵的3D微调或专用输入。

Details

Motivation: 现有方法依赖昂贵的3D微调和专用输入（如点云或体素特征），限制了灵活性和可扩展性。 Method: 采用关键帧驱动提示生成策略，结合视觉语言相似性、马氏距离等指标选择关键帧，并与相机位姿数据结合。 Result: 在ScanQA和SQA3D等基准数据集上实现零样本最优性能。 Conclusion: SpatialPrompting提供了一种更简单、可扩展的3D空间推理方法，无需专用输入或微调。 Abstract: This study introduces SpatialPrompting, a novel framework that harnesses the emergent reasoning capabilities of off-the-shelf multimodal large language models to achieve zero-shot spatial reasoning in three-dimensional (3D) environments. Unlike existing methods that rely on expensive 3D-specific fine-tuning with specialized 3D inputs such as point clouds or voxel-based features, SpatialPrompting employs a keyframe-driven prompt generation strategy. This framework uses metrics such as vision-language similarity, Mahalanobis distance, field of view, and image sharpness to select a diverse and informative set of keyframes from image sequences and then integrates them with corresponding camera pose data to effectively abstract spatial relationships and infer complex 3D structures. The proposed framework not only establishes a new paradigm for flexible spatial reasoning that utilizes intuitive visual and positional cues but also achieves state-of-the-art zero-shot performance on benchmark datasets, such as ScanQA and SQA3D, across several metrics. The proposed method effectively eliminates the need for specialized 3D inputs and fine-tuning, offering a simpler and more scalable alternative to conventional approaches.

[21] GlyphMastero: A Glyph Encoder for High-Fidelity Scene Text Editing

Tong Wang,Ting Liu,Xiaochao Qu,Chengjing Wu,Luoqi Liu,Xiaolin Hu

Main category: cs.CV

TL;DR: GlyphMastero是一种专门的字形编码器，通过字形注意力模块和多尺度OCR特征融合，显著提升了场景文本编辑的质量和准确性。

Details

Motivation: 现有基于扩散的方法在生成复杂字符（如中文）时效果不佳，难以保持笔画级精度和风格一致性。 Method: 提出GlyphMastero，结合字形注意力模块和多尺度特征金字塔网络，实现跨层次和多尺度的字形感知引导。 Result: 在句子准确率上比现有方法提升18.02%，文本区域的Fréchet inception距离降低53.28%。 Conclusion: GlyphMastero通过精细的字形建模和多尺度特征融合，显著提升了场景文本编辑的性能。 Abstract: Scene text editing, a subfield of image editing, requires modifying texts in images while preserving style consistency and visual coherence with the surrounding environment. While diffusion-based methods have shown promise in text generation, they still struggle to produce high-quality results. These methods often generate distorted or unrecognizable characters, particularly when dealing with complex characters like Chinese. In such systems, characters are composed of intricate stroke patterns and spatial relationships that must be precisely maintained. We present GlyphMastero, a specialized glyph encoder designed to guide the latent diffusion model for generating texts with stroke-level precision. Our key insight is that existing methods, despite using pretrained OCR models for feature extraction, fail to capture the hierarchical nature of text structures - from individual strokes to stroke-level interactions to overall character-level structure. To address this, our glyph encoder explicitly models and captures the cross-level interactions between local-level individual characters and global-level text lines through our novel glyph attention module. Meanwhile, our model implements a feature pyramid network to fuse the multi-scale OCR backbone features at the global-level. Through these cross-level and multi-scale fusions, we obtain more detailed glyph-aware guidance, enabling precise control over the scene text generation process. Our method achieves an 18.02\% improvement in sentence accuracy over the state-of-the-art multi-lingual scene text editing baseline, while simultaneously reducing the text-region Fr\'echet inception distance by 53.28\%.

[22] A Simple Detector with Frame Dynamics is a Strong Tracker

Chenxu Peng,Chenxu Wang,Minrui Zou,Danyang Li,Zhengpeng Yang,Yimian Dai,Ming-Ming Cheng,Xiang Li

Main category: cs.CV

TL;DR: 提出了一种红外微小目标跟踪方法，通过全局检测和运动感知学习提升性能，在Anti-UAV挑战中表现优异。

Details

Motivation: 现有跟踪器依赖裁剪模板区域且运动建模能力有限，难以处理微小目标。 Method: 结合全局检测和运动感知学习，引入帧动态特征和轨迹约束过滤策略。 Result: 在多个指标上优于现有方法，在Anti-UAV挑战中取得领先成绩。 Conclusion: 该方法有效提升了红外微小目标跟踪的性能和鲁棒性。 Abstract: Infrared object tracking plays a crucial role in Anti-Unmanned Aerial Vehicle (Anti-UAV) applications. Existing trackers often depend on cropped template regions and have limited motion modeling capabilities, which pose challenges when dealing with tiny targets. To address this, we propose a simple yet effective infrared tiny-object tracker that enhances tracking performance by integrating global detection and motion-aware learning with temporal priors. Our method is based on object detection and achieves significant improvements through two key innovations. First, we introduce frame dynamics, leveraging frame difference and optical flow to encode both prior target features and motion characteristics at the input level, enabling the model to better distinguish the target from background clutter. Second, we propose a trajectory constraint filtering strategy in the post-processing stage, utilizing spatio-temporal priors to suppress false positives and enhance tracking robustness. Extensive experiments show that our method consistently outperforms existing approaches across multiple metrics in challenging infrared UAV tracking scenarios. Notably, we achieve state-of-the-art performance in the 4th Anti-UAV Challenge, securing 1st place in Track 1 and 2nd place in Track 2.

[23] Perception, Reason, Think, and Plan: A Survey on Large Multimodal Reasoning Models

Yunxin Li,Zhenyu Liu,Zitao Li,Xuanyu Zhang,Zhenran Xu,Xinyu Chen,Haoyuan Shi,Shenyuan Jiang,Xintong Wang,Jifang Wang,Shouzheng Huang,Xinping Zhao,Borui Jiang,Lanqing Hong,Longyue Wang,Zhuotao Tian,Baoxing Huai,Wenhan Luo,Weihua Luo,Zheng Zhang,Baotian Hu,Min Zhang

Main category: cs.CV

TL;DR: 本文综述了多模态推理研究的发展历程，从早期的模块化方法到统一的多模态大语言模型，并探讨了未来原生大型多模态推理模型（N-LMRMs）的方向。

Details

Motivation: 随着人工智能系统在开放、不确定和多模态环境中的广泛应用，推理能力成为实现稳健和自适应行为的关键。多模态推理模型（LMRMs）通过整合多种模态数据支持复杂推理，但仍面临泛化、深度推理和自主行为等挑战。 Method: 文章围绕四阶段发展路线图，回顾了从任务特定模块到统一多模态LLMs的演进，并分析了多模态思维链（MCoT）和强化学习等技术。 Result: 研究发现，统一框架在多模态推理中表现更优，但仍需解决泛化和深度推理问题。实验案例（如OpenAI O3和O4-mini）验证了N-LMRMs的潜力。 Conclusion: 未来研究应聚焦于原生大型多模态推理模型（N-LMRMs），以实现复杂现实环境中的可扩展、自主和自适应推理与规划。 Abstract: Reasoning lies at the heart of intelligence, shaping the ability to make decisions, draw conclusions, and generalize across domains. In artificial intelligence, as systems increasingly operate in open, uncertain, and multimodal environments, reasoning becomes essential for enabling robust and adaptive behavior. Large Multimodal Reasoning Models (LMRMs) have emerged as a promising paradigm, integrating modalities such as text, images, audio, and video to support complex reasoning capabilities and aiming to achieve comprehensive perception, precise understanding, and deep reasoning. As research advances, multimodal reasoning has rapidly evolved from modular, perception-driven pipelines to unified, language-centric frameworks that offer more coherent cross-modal understanding. While instruction tuning and reinforcement learning have improved model reasoning, significant challenges remain in omni-modal generalization, reasoning depth, and agentic behavior. To address these issues, we present a comprehensive and structured survey of multimodal reasoning research, organized around a four-stage developmental roadmap that reflects the field's shifting design philosophies and emerging capabilities. First, we review early efforts based on task-specific modules, where reasoning was implicitly embedded across stages of representation, alignment, and fusion. Next, we examine recent approaches that unify reasoning into multimodal LLMs, with advances such as Multimodal Chain-of-Thought (MCoT) and multimodal reinforcement learning enabling richer and more structured reasoning chains. Finally, drawing on empirical insights from challenging benchmarks and experimental cases of OpenAI O3 and O4-mini, we discuss the conceptual direction of native large multimodal reasoning models (N-LMRMs), which aim to support scalable, agentic, and adaptive reasoning and planning in complex, real-world environments.

[24] Canny2Palm: Realistic and Controllable Palmprint Generation for Large-scale Pre-training

Xingzeng Lan,Xing Duan,Chen Chen,Weiyu Lin,Bo Wang

Main category: cs.CV

TL;DR: 提出了一种名为Canny2Palm的新方法，通过结合Canny边缘检测器和Pix2Pix网络生成虚拟掌纹，解决了掌纹数据稀缺问题，并在识别准确率上优于现有方法。

Details

Motivation: 掌纹识别是一种安全且隐私友好的生物识别方法，但掌纹数据的稀缺性限制了其识别精度的提升。 Method: 使用Canny边缘检测器提取掌纹纹理，并通过Pix2Pix网络生成逼真的虚拟掌纹，同时通过重新组合纹理生成新身份。 Result: 在开放集掌纹识别基准测试中，使用Canny2Palm合成数据预训练的模型识别准确率比现有方法高出7.2%，且在大规模合成数据下性能持续提升。 Conclusion: Canny2Palm不仅能生成逼真的掌纹数据，还能实现可控的多样性，为大规模预训练提供了潜力。 Abstract: Palmprint recognition is a secure and privacy-friendly method of biometric identification. One of the major challenges to improve palmprint recognition accuracy is the scarcity of palmprint data. Recently, a popular line of research revolves around the synthesis of virtual palmprints for large-scale pre-training purposes. In this paper, we propose a novel synthesis method named Canny2Palm that extracts palm textures with Canny edge detector and uses them to condition a Pix2Pix network for realistic palmprint generation. By re-assembling palmprint textures from different identities, we are able to create new identities by seeding the generator with new assemblies. Canny2Palm not only synthesizes realistic data following the distribution of real palmprints but also enables controllable diversity to generate large-scale new identities. On open-set palmprint recognition benchmarks, models pre-trained with Canny2Palm synthetic data outperform the state-of-the-art with up to 7.2% higher identification accuracy. Moreover, the performance of models pre-trained with Canny2Palm continues to improve given 10,000 synthetic IDs while those with existing methods already saturate, demonstrating the potential of our method for large-scale pre-training.

[25] FF-PNet: A Pyramid Network Based on Feature and Field for Brain Image Registration

Ying Zhang,Shuai Guo,Chenxi Sun,Yuchen Zhu,Jinhai Xiang

Main category: cs.CV

TL;DR: 提出了一种基于特征和变形场的金字塔配准网络（FF-PNet），通过并行提取粗粒度和细粒度特征，显著提升了医学图像配准的效率和精度。

Details

Motivation: 现有模型在并行提取粗粒度和细粒度特征时效率不足，FF-PNet旨在解决这一问题。 Method: 设计了残差特征融合模块（RFFM）和残差变形场融合模块（RDFFM），并行处理特征提取和图像变形。 Result: 在LPBA和OASIS数据集上的实验表明，FF-PNet在Dice相似系数等指标上优于现有方法。 Conclusion: FF-PNet无需注意力机制或多层感知机，仅通过传统卷积网络和提出的模块，即实现了显著的配准精度提升。 Abstract: In recent years, deformable medical image registration techniques have made significant progress. However, existing models still lack efficiency in parallel extraction of coarse and fine-grained features. To address this, we construct a new pyramid registration network based on feature and deformation field (FF-PNet). For coarse-grained feature extraction, we design a Residual Feature Fusion Module (RFFM), for fine-grained image deformation, we propose a Residual Deformation Field Fusion Module (RDFFM). Through the parallel operation of these two modules, the model can effectively handle complex image deformations. It is worth emphasizing that the encoding stage of FF-PNet only employs traditional convolutional neural networks without any attention mechanisms or multilayer perceptrons, yet it still achieves remarkable improvements in registration accuracy, fully demonstrating the superior feature decoding capabilities of RFFM and RDFFM. We conducted extensive experiments on the LPBA and OASIS datasets. The results show our network consistently outperforms popular methods in metrics like the Dice Similarity Coefficient.

Jiepan Li,He Huang,Yu Sheng,Yujun Guo,Wei He

Main category: cs.CV

TL;DR: 提出了一种基于建筑引导的伪标签学习框架，用于从灾前光学和灾后SAR图像中评估建筑损坏，通过多模型融合和低不确定性伪标签训练提升精度。

Details

Motivation: 准确的建筑损坏评估对灾害响应和恢复规划至关重要，但现有方法在多模态遥感图像中面临挑战。 Method: 1. 使用灾前光学图像训练建筑提取模型；2. 通过多模型融合和测试时增强生成伪概率；3. 引入建筑引导的低不确定性伪标签细化策略。 Result: 在2025 IEEE GRSS数据融合竞赛数据集中，取得了最高mIoU分数（54.28%）和第一名。 Conclusion: 该方法通过建筑引导的伪标签学习，显著提升了建筑损坏评估的准确性和可靠性。 Abstract: Accurate building damage assessment using bi-temporal multi-modal remote sensing images is essential for effective disaster response and recovery planning. This study proposes a novel Building-Guided Pseudo-Label Learning Framework to address the challenges of mapping building damage from pre-disaster optical and post-disaster SAR images. First, we train a series of building extraction models using pre-disaster optical images and building labels. To enhance building segmentation, we employ multi-model fusion and test-time augmentation strategies to generate pseudo-probabilities, followed by a low-uncertainty pseudo-label training method for further refinement. Next, a change detection model is trained on bi-temporal cross-modal images and damaged building labels. To improve damage classification accuracy, we introduce a building-guided low-uncertainty pseudo-label refinement strategy, which leverages building priors from the previous step to guide pseudo-label generation for damaged buildings, reducing uncertainty and enhancing reliability. Experimental results on the 2025 IEEE GRSS Data Fusion Contest dataset demonstrate the effectiveness of our approach, which achieved the highest mIoU score (54.28%) and secured first place in the competition.

[27] T2VTextBench: A Human Evaluation Benchmark for Textual Control in Video Generation Models

Xuyang Guo,Jiayan Huo,Zhenmei Shi,Zhao Song,Jiahao Zhang,Jiale Zhao

Main category: cs.CV

TL;DR: T2VTextBench是首个评估文本到视频模型在屏幕上文本保真度和时间一致性的基准测试，发现现有模型在生成清晰、一致的文本方面表现不佳。

Details

Motivation: 尽管文本到视频生成在高保真度和指令遵循方面取得了进展，但模型在精确渲染屏幕文本（如字幕或数学公式）方面的能力尚未充分测试，这对需要文本准确性的应用构成挑战。 Method: 提出T2VTextBench基准测试，通过结合复杂文本字符串和动态场景变化的提示，评估模型在帧间保持详细指令的能力。 Result: 评估了十种最先进的系统，发现大多数模型难以生成清晰、一致的文本。 Conclusion: 当前视频生成器在文本处理方面存在明显不足，为未来研究提供了明确方向。 Abstract: Thanks to recent advancements in scalable deep architectures and large-scale pretraining, text-to-video generation has achieved unprecedented capabilities in producing high-fidelity, instruction-following content across a wide range of styles, enabling applications in advertising, entertainment, and education. However, these models' ability to render precise on-screen text, such as captions or mathematical formulas, remains largely untested, posing significant challenges for applications requiring exact textual accuracy. In this work, we introduce T2VTextBench, the first human-evaluation benchmark dedicated to evaluating on-screen text fidelity and temporal consistency in text-to-video models. Our suite of prompts integrates complex text strings with dynamic scene changes, testing each model's ability to maintain detailed instructions across frames. We evaluate ten state-of-the-art systems, ranging from open-source solutions to commercial offerings, and find that most struggle to generate legible, consistent text. These results highlight a critical gap in current video generators and provide a clear direction for future research aimed at enhancing textual manipulation in video synthesis.

[28] An Efficient Method for Accurate Pose Estimation and Error Correction of Cuboidal Objects

Utsav Rai,Hardik Mehta,Vismay Vakharia,Aditya Choudhary,Amit Parmar,Rolif Lima,Kaushik Das

Main category: cs.CV

TL;DR: 提出了一种高效的方法，用于精确估计立方体形状物体的位姿，以减少目标位姿误差并节省时间。

Details

Motivation: 解决在自主拾取立方体物体时，传统位姿估计方法存在微小误差和时间开销的问题。 Method: 提出了一种线性时间方法，用于位姿误差估计和校正，替代了传统的全局点云配准和局部配准算法。 Result: 该方法能够高效且精确地估计和校正立方体物体的位姿。 Conclusion: 提出的线性时间方法在减少位姿误差和提高效率方面表现优异。 Abstract: The proposed system outlined in this paper is a solution to a use case that requires the autonomous picking of cuboidal objects from an organized or unorganized pile with high precision. This paper presents an efficient method for precise pose estimation of cuboid-shaped objects, which aims to reduce errors in target pose in a time-efficient manner. Typical pose estimation methods like global point cloud registrations are prone to minor pose errors for which local registration algorithms are generally used to improve pose accuracy. However, due to the execution time overhead and uncertainty in the error of the final achieved pose, an alternate, linear time approach is proposed for pose error estimation and correction. This paper presents an overview of the solution followed by a detailed description of individual modules of the proposed algorithm.

[29] ViCTr: Vital Consistency Transfer for Pathology Aware Image Synthesis

Onkar Susladkar,Gayatri Deshmukh,Yalcin Tur,Ulas Bagci

Main category: cs.CV

TL;DR: ViCTr是一种新型两阶段框架，结合修正流轨迹和Tweedie校正扩散过程，用于高保真、病理感知的医学图像合成。

Details

Motivation: 解决医学图像合成中标注病理数据有限、模态域差异和弥漫性病理（如肝硬化）建模复杂的问题。 Method: ViCTr采用两阶段方法：预训练保留关键解剖结构，再通过对抗性微调控制病理严重程度。支持一步采样，减少推理步骤。 Result: 在多个数据集上表现优异，肝硬化合成的MFID为17.01，比现有方法低28%，并提升分割性能3.8% mDSC。 Conclusion: ViCTr首次实现细粒度病理感知MRI合成，填补了AI医学影像研究的空白。 Abstract: Synthesizing medical images remains challenging due to limited annotated pathological data, modality domain gaps, and the complexity of representing diffuse pathologies such as liver cirrhosis. Existing methods often struggle to maintain anatomical fidelity while accurately modeling pathological features, frequently relying on priors derived from natural images or inefficient multi-step sampling. In this work, we introduce ViCTr (Vital Consistency Transfer), a novel two-stage framework that combines a rectified flow trajectory with a Tweedie-corrected diffusion process to achieve high-fidelity, pathology-aware image synthesis. First, we pretrain ViCTr on the ATLAS-8k dataset using Elastic Weight Consolidation (EWC) to preserve critical anatomical structures. We then fine-tune the model adversarially with Low-Rank Adaptation (LoRA) modules for precise control over pathology severity. By reformulating Tweedie's formula within a linear trajectory framework, ViCTr supports one-step sampling, reducing inference from 50 steps to just 4, without sacrificing anatomical realism. We evaluate ViCTr on BTCV (CT), AMOS (MRI), and CirrMRI600+ (cirrhosis) datasets. Results demonstrate state-of-the-art performance, achieving a Medical Frechet Inception Distance (MFID) of 17.01 for cirrhosis synthesis 28% lower than existing approaches and improving nnUNet segmentation by +3.8% mDSC when used for data augmentation. Radiologist reviews indicate that ViCTr-generated liver cirrhosis MRIs are clinically indistinguishable from real scans. To our knowledge, ViCTr is the first method to provide fine-grained, pathology-aware MRI synthesis with graded severity control, closing a critical gap in AI-driven medical imaging research.

[30] CAG-VLM: Fine-Tuning of a Large-Scale Model to Recognize Angiographic Images for Next-Generation Diagnostic Systems

Yuto Nakamura,Satoshi Kodera,Haruki Settai,Hiroki Shinohara,Masatsugu Tamura,Tomohiro Noguchi,Tatsuki Furusawa,Ryo Takizawa,Tempei Kabayama,Norihiko Takeda

Main category: cs.CV

TL;DR: 论文提出了一种两阶段的AI支持流程，用于从冠状动脉造影（CAG）图像生成临床报告和治疗建议，并通过双语数据集和微调视觉语言模型（VLM）验证其有效性。

Details

Motivation: 冠状动脉造影（CAG）的解读和治疗规划依赖专家，AI支持可提高效率和准确性。 Method: 1. 构建双语CAG图像-报告数据集；2. 训练CNN进行关键帧检测和左右侧分类；3. 微调三种VLM模型并评估其性能。 Result: Gemma3 w/LoRA模型在临床评分中表现最佳（7.20/10），被选为CAG-VLM。 Conclusion: 微调的VLM能有效辅助生成CAG图像的临床报告和治疗建议。 Abstract: Coronary angiography (CAG) is the gold-standard imaging modality for evaluating coronary artery disease, but its interpretation and subsequent treatment planning rely heavily on expert cardiologists. To enable AI-based decision support, we introduce a two-stage, physician-curated pipeline and a bilingual (Japanese/English) CAG image-report dataset. First, we sample 14,686 frames from 539 exams and annotate them for key-frame detection and left/right laterality; a ConvNeXt-Base CNN trained on this data achieves 0.96 F1 on laterality classification, even on low-contrast frames. Second, we apply the CNN to 243 independent exams, extract 1,114 key frames, and pair each with its pre-procedure report and expert-validated diagnostic and treatment summary, yielding a parallel corpus. We then fine-tune three open-source VLMs (PaliGemma2, Gemma3, and ConceptCLIP-enhanced Gemma3) via LoRA and evaluate them using VLScore and cardiologist review. Although PaliGemma2 w/LoRA attains the highest VLScore, Gemma3 w/LoRA achieves the top clinician rating (mean 7.20/10); we designate this best-performing model as CAG-VLM. These results demonstrate that specialized, fine-tuned VLMs can effectively assist cardiologists in generating clinical reports and treatment recommendations from CAG images.

[31] DenseGrounding: Improving Dense Language-Vision Semantics for Ego-Centric 3D Visual Grounding

Henry Zheng,Hao Shi,Qihang Peng,Yong Xien Chng,Rui Huang,Yepeng Weng,Zhongchao Shi,Gao Huang

Main category: cs.CV

TL;DR: DenseGrounding提出了一种新方法，通过增强视觉和文本语义，解决了3D视觉定位中的两个主要挑战，显著提升了性能并在CVPR 2024竞赛中获奖。

Details

Motivation: 智能代理通过自然语言理解和交互3D环境对机器人和人机交互至关重要，但现有方法在视觉语义和文本上下文方面存在不足。 Method: DenseGrounding结合了Hierarchical Scene Semantic Enhancer（增强视觉特征）和Language Semantic Enhancer（利用大语言模型丰富文本描述），提升了跨模态对齐能力。 Result: 实验表明，DenseGrounding在整体准确率上显著优于现有方法，分别提升了5.81%和7.56%，并在CVPR 2024竞赛中获得第一名和创新奖。 Conclusion: DenseGrounding通过增强视觉和文本语义，有效解决了3D视觉定位的挑战，推动了该领域的技术进步。 Abstract: Enabling intelligent agents to comprehend and interact with 3D environments through natural language is crucial for advancing robotics and human-computer interaction. A fundamental task in this field is ego-centric 3D visual grounding, where agents locate target objects in real-world 3D spaces based on verbal descriptions. However, this task faces two significant challenges: (1) loss of fine-grained visual semantics due to sparse fusion of point clouds with ego-centric multi-view images, (2) limited textual semantic context due to arbitrary language descriptions. We propose DenseGrounding, a novel approach designed to address these issues by enhancing both visual and textual semantics. For visual features, we introduce the Hierarchical Scene Semantic Enhancer, which retains dense semantics by capturing fine-grained global scene features and facilitating cross-modal alignment. For text descriptions, we propose a Language Semantic Enhancer that leverages large language models to provide rich context and diverse language descriptions with additional context during model training. Extensive experiments show that DenseGrounding significantly outperforms existing methods in overall accuracy, with improvements of 5.81% and 7.56% when trained on the comprehensive full dataset and smaller mini subset, respectively, further advancing the SOTA in egocentric 3D visual grounding. Our method also achieves 1st place and receives the Innovation Award in the CVPR 2024 Autonomous Grand Challenge Multi-view 3D Visual Grounding Track, validating its effectiveness and robustness.

[32] ReAlign: Bilingual Text-to-Motion Generation via Step-Aware Reward-Guided Alignment

Wanjiang Weng,Xiaofeng Tan,Hongsong Wang,Pan Zhou

Main category: cs.CV

TL;DR: 提出BiHumanML3D双语数据集和BiMD模型，结合ReAlign方法提升双语文本到动作生成的质量和语义一致性。

Details

Motivation: 解决双语文本到动作生成中数据集缺失和语义不一致的问题。 Method: 构建BiHumanML3D数据集，提出BiMD模型和ReAlign方法，结合奖励模型优化生成过程。 Result: 实验表明方法显著提升了文本-动作对齐和动作质量。 Conclusion: BiHumanML3D和BiMD模型为双语文本到动作生成提供了有效解决方案。 Abstract: Bilingual text-to-motion generation, which synthesizes 3D human motions from bilingual text inputs, holds immense potential for cross-linguistic applications in gaming, film, and robotics. However, this task faces critical challenges: the absence of bilingual motion-language datasets and the misalignment between text and motion distributions in diffusion models, leading to semantically inconsistent or low-quality motions. To address these challenges, we propose BiHumanML3D, a novel bilingual human motion dataset, which establishes a crucial benchmark for bilingual text-to-motion generation models. Furthermore, we propose a Bilingual Motion Diffusion model (BiMD), which leverages cross-lingual aligned representations to capture semantics, thereby achieving a unified bilingual model. Building upon this, we propose Reward-guided sampling Alignment (ReAlign) method, comprising a step-aware reward model to assess alignment quality during sampling and a reward-guided strategy that directs the diffusion process toward an optimally aligned distribution. This reward model integrates step-aware tokens and combines a text-aligned module for semantic consistency and a motion-aligned module for realism, refining noisy motions at each timestep to balance probability density and alignment. Experiments demonstrate that our approach significantly improves text-motion alignment and motion quality compared to existing state-of-the-art methods. Project page: https://wengwanjiang.github.io/ReAlign-page/.

[33] Federated Deconfounding and Debiasing Learning for Out-of-Distribution Generalization

Zhuang Qi,Sijin Zhou,Lei Meng,Han Hu,Han Yu,Xiangxu Meng

Main category: cs.CV

TL;DR: FedDDL方法通过构建因果图和反事实样本，解决联邦学习中的属性偏差问题，显著提升模型性能。

Details

Motivation: 联邦学习中属性偏差导致模型性能下降，现有方法未能全面分析推理路径或解决混杂因素干扰。 Method: FedDDL通过构建因果图进行推理分析，设计客户端内解耦和客户端间去偏模块，生成反事实样本和因果原型。 Result: 在2个基准数据集上，FedDDL平均Top-1准确率比现有方法高4.5%。 Conclusion: FedDDL有效消除混杂路径，提升模型对主对象的关注能力，性能显著优于现有方法。 Abstract: Attribute bias in federated learning (FL) typically leads local models to optimize inconsistently due to the learning of non-causal associations, resulting degraded performance. Existing methods either use data augmentation for increasing sample diversity or knowledge distillation for learning invariant representations to address this problem. However, they lack a comprehensive analysis of the inference paths, and the interference from confounding factors limits their performance. To address these limitations, we propose the \underline{Fed}erated \underline{D}econfounding and \underline{D}ebiasing \underline{L}earning (FedDDL) method. It constructs a structured causal graph to analyze the model inference process, and performs backdoor adjustment to eliminate confounding paths. Specifically, we design an intra-client deconfounding learning module for computer vision tasks to decouple background and objects, generating counterfactual samples that establish a connection between the background and any label, which stops the model from using the background to infer the label. Moreover, we design an inter-client debiasing learning module to construct causal prototypes to reduce the proportion of the background in prototype components. Notably, it bridges the gap between heterogeneous representations via causal prototypical regularization. Extensive experiments on 2 benchmarking datasets demonstrate that \methodname{} significantly enhances the model capability to focus on main objects in unseen data, leading to 4.5\% higher Top-1 Accuracy on average over 9 state-of-the-art existing methods.

[34] StabStitch++: Unsupervised Online Video Stitching with Spatiotemporal Bidirectional Warps

Lang Nie,Chunyu Lin,Kang Liao,Yun Zhang,Shuaicheng Liu,Yao Zhao

Main category: cs.CV

TL;DR: 论文提出StabStitch++框架，解决视频拼接中的warping shake问题，通过无监督学习同时实现空间拼接和时间稳定。

Details

Motivation: 视频拼接中，即使输入视频稳定，拼接后仍可能因时序不平滑的warp导致视觉抖动（warping shake），影响体验。 Method: 设计双向分解模块解耦单应变换，结合空间和时间warp，提出warp平滑模型，优化内容对齐、轨迹平滑和在线协作。 Result: StabStitch++在拼接性能、鲁棒性和效率上超越现有方案，支持实时在线视频拼接。 Conclusion: StabStitch++通过同时优化对齐和稳定，显著提升视频拼接质量，为领域带来新突破。 Abstract: We retarget video stitching to an emerging issue, named warping shake, which unveils the temporal content shakes induced by sequentially unsmooth warps when extending image stitching to video stitching. Even if the input videos are stable, the stitched video can inevitably cause undesired warping shakes and affect the visual experience. To address this issue, we propose StabStitch++, a novel video stitching framework to realize spatial stitching and temporal stabilization with unsupervised learning simultaneously. First, different from existing learning-based image stitching solutions that typically warp one image to align with another, we suppose a virtual midplane between original image planes and project them onto it. Concretely, we design a differentiable bidirectional decomposition module to disentangle the homography transformation and incorporate it into our spatial warp, evenly spreading alignment burdens and projective distortions across two views. Then, inspired by camera paths in video stabilization, we derive the mathematical expression of stitching trajectories in video stitching by elaborately integrating spatial and temporal warps. Finally, a warp smoothing model is presented to produce stable stitched videos with a hybrid loss to simultaneously encourage content alignment, trajectory smoothness, and online collaboration. Compared with StabStitch that sacrifices alignment for stabilization, StabStitch++ makes no compromise and optimizes both of them simultaneously, especially in the online mode. To establish an evaluation benchmark and train the learning framework, we build a video stitching dataset with a rich diversity in camera motions and scenes. Experiments exhibit that StabStitch++ surpasses current solutions in stitching performance, robustness, and efficiency, offering compelling advancements in this field by building a real-time online video stitching system.

[35] Automated Thoracolumbar Stump Rib Detection and Analysis in a Large CT Cohort

Hendrik Möller,Hanna Schön,Alina Dima,Benjamin Keinert-Weth,Robert Graf,Matan Atad,Johannes Paetzold,Friederike Jungmann,Rickmer Braren,Florian Kofler,Bjoern Menze,Daniel Rueckert,Jan S. Kirschke

Main category: cs.CV

TL;DR: 该研究通过高分辨率深度学习模型自动化检测胸腰椎残端肋骨，并定量分析其形态特征，显著优于现有方法。

Details

Motivation: 胸腰椎残端肋骨是胸腰椎过渡椎或计数异常的重要指标，现有研究多依赖人工定性评估，本研究旨在实现自动化定量分析。 Method: 训练高分辨率深度学习模型进行肋骨分割，并采用迭代算法和分段线性插值评估肋骨长度。 Result: 模型分割性能显著提升（Dice分数0.997 vs. 0.779），肋骨长度评估成功率达98.2%。残端肋骨在形态上与全长肋骨有显著差异。 Conclusion: 自动化方法能高效区分残端肋骨与常规肋骨（F1分数0.84），模型权重和掩码已公开。 Abstract: Thoracolumbar stump ribs are one of the essential indicators of thoracolumbar transitional vertebrae or enumeration anomalies. While some studies manually assess these anomalies and describe the ribs qualitatively, this study aims to automate thoracolumbar stump rib detection and analyze their morphology quantitatively. To this end, we train a high-resolution deep-learning model for rib segmentation and show significant improvements compared to existing models (Dice score 0.997 vs. 0.779, p-value < 0.01). In addition, we use an iterative algorithm and piece-wise linear interpolation to assess the length of the ribs, showing a success rate of 98.2%. When analyzing morphological features, we show that stump ribs articulate more posteriorly at the vertebrae (-19.2 +- 3.8 vs -13.8 +- 2.5, p-value < 0.01), are thinner (260.6 +- 103.4 vs. 563.6 +- 127.1, p-value < 0.01), and are oriented more downwards and sideways within the first centimeters in contrast to full-length ribs. We show that with partially visible ribs, these features can achieve an F1-score of 0.84 in differentiating stump ribs from regular ones. We publish the model weights and masks for public use.

[36] Driving with Context: Online Map Matching for Complex Roads Using Lane Markings and Scenario Recognition

Xin Bi,Zhichao Li,Yuxuan Xia,Panpan Tong,Lijuan Zhang,Yang Chen,Junsheng Fu

Main category: cs.CV

TL;DR: 提出了一种基于多概率因子隐马尔可夫模型的在线地图匹配方法，显著提高了复杂道路网络中的匹配精度。

Details

Motivation: 现有在线地图匹配方法在复杂道路网络中易出错，尤其是在多层道路区域。 Method: 通过构建多概率因子的HMM模型，结合车道标记和场景识别设计概率因子，实现高精度匹配。 Result: 在欧洲和中国的道路测试中，F1分数分别达到98.04%和94.60%，显著优于基准方法。 Conclusion: 该方法在复杂道路网络中表现出色，尤其是多层道路区域，具有实际应用价值。 Abstract: Accurate online map matching is fundamental to vehicle navigation and the activation of intelligent driving functions. Current online map matching methods are prone to errors in complex road networks, especially in multilevel road area. To address this challenge, we propose an online Standard Definition (SD) map matching method by constructing a Hidden Markov Model (HMM) with multiple probability factors. Our proposed method can achieve accurate map matching even in complex road networks by carefully leveraging lane markings and scenario recognition in the designing of the probability factors. First, the lane markings are generated by a multi-lane tracking method and associated with the SD map using HMM to build an enriched SD map. In areas covered by the enriched SD map, the vehicle can re-localize itself by performing Iterative Closest Point (ICP) registration for the lane markings. Then, the probability factor accounting for the lane marking detection can be obtained using the association probability between adjacent lanes and roads. Second, the driving scenario recognition model is applied to generate the emission probability factor of scenario recognition, which improves the performance of map matching on elevated roads and ordinary urban roads underneath them. We validate our method through extensive road tests in Europe and China, and the experimental results show that our proposed method effectively improves the online map matching accuracy as compared to other existing methods, especially in multilevel road area. Specifically, the experiments show that our proposed method achieves $F_1$ scores of 98.04% and 94.60% on the Zenseact Open Dataset and test data of multilevel road areas in Shanghai respectively, significantly outperforming benchmark methods. The implementation is available at https://github.com/TRV-Lab/LMSR-OMM.

[37] Adaptive Contextual Embedding for Robust Far-View Borehole Detection

Xuesong Liu,Tianyu Hao,Emmett J. Ientilucci

Main category: cs.CV

TL;DR: 提出一种基于EMA的自适应检测方法，用于远视图像中密集分布的小孔检测，显著提升YOLO基线的性能。

Details

Motivation: 现有方法在小尺度、高密度和视觉特征有限的孔洞检测中表现不佳，需改进以提升操作安全性和效率。 Method: 结合自适应增强、嵌入稳定化和上下文细化，利用EMA实现稳定特征提取。 Result: 在采石场数据集上显著优于基线YOLO架构。 Conclusion: 该方法在复杂工业场景中表现优异，适用于小目标检测。 Abstract: In controlled blasting operations, accurately detecting densely distributed tiny boreholes from far-view imagery is critical for operational safety and efficiency. However, existing detection methods often struggle due to small object scales, highly dense arrangements, and limited distinctive visual features of boreholes. To address these challenges, we propose an adaptive detection approach that builds upon existing architectures (e.g., YOLO) by explicitly leveraging consistent embedding representations derived through exponential moving average (EMA)-based statistical updates. Our method introduces three synergistic components: (1) adaptive augmentation utilizing dynamically updated image statistics to robustly handle illumination and texture variations; (2) embedding stabilization to ensure consistent and reliable feature extraction; and (3) contextual refinement leveraging spatial context for improved detection accuracy. The pervasive use of EMA in our method is particularly advantageous given the limited visual complexity and small scale of boreholes, allowing stable and robust representation learning even under challenging visual conditions. Experiments on a challenging proprietary quarry-site dataset demonstrate substantial improvements over baseline YOLO-based architectures, highlighting our method's effectiveness in realistic and complex industrial scenarios.

[38] SOAP: Style-Omniscient Animatable Portraits

Tingting Liao,Yujian Zheng,Adilbek Karmanov,Liwen Hu,Leyang Jin,Yuliang Xiu,Hao Li

Main category: cs.CV

TL;DR: SOAP框架通过多视角扩散模型和自适应优化，从单张肖像生成可动画的3D头像，支持多种风格和细节保留。

Details

Motivation: 解决单图像生成3D头像时的风格限制和动画控制不足问题。 Method: 利用多视角扩散模型和自适应优化流程，保持FLAME网格的拓扑和绑定。 Result: 生成的3D头像支持FACS动画，保留细节如发型和配饰，优于现有技术。 Conclusion: SOAP在单视角头像建模和图像到3D生成方面表现优越，代码和数据已开源。 Abstract: Creating animatable 3D avatars from a single image remains challenging due to style limitations (realistic, cartoon, anime) and difficulties in handling accessories or hairstyles. While 3D diffusion models advance single-view reconstruction for general objects, outputs often lack animation controls or suffer from artifacts because of the domain gap. We propose SOAP, a style-omniscient framework to generate rigged, topology-consistent avatars from any portrait. Our method leverages a multiview diffusion model trained on 24K 3D heads with multiple styles and an adaptive optimization pipeline to deform the FLAME mesh while maintaining topology and rigging via differentiable rendering. The resulting textured avatars support FACS-based animation, integrate with eyeballs and teeth, and preserve details like braided hair or accessories. Extensive experiments demonstrate the superiority of our method over state-of-the-art techniques for both single-view head modeling and diffusion-based generation of Image-to-3D. Our code and data are publicly available for research purposes at https://github.com/TingtingLiao/soap.

[39] Split Matching for Inductive Zero-shot Semantic Segmentation

Jialei Chen,Xu Zheng,Dongyue Li,Chong Yi,Seigo Ito,Danda Pani Paudel,Luc Van Gool,Hiroshi Murase,Daisuke Deguchi

Main category: cs.CV

TL;DR: 论文提出了一种名为Split Matching（SM）的新分配策略，用于解决零样本语义分割中匈牙利匹配对未见类别的误分类问题，并结合多尺度特征增强模块提升性能。

Details

Motivation: 零样本语义分割中，传统匈牙利匹配需要全监督且容易将未见类别误分类为背景，限制了模型性能。 Method: 提出Split Matching策略，将匈牙利匹配解耦为可见类别和潜在类别的独立优化，并结合CLIP特征聚类生成伪掩码和区域级嵌入。引入多尺度特征增强模块（MFE）优化解码器特征。 Result: 在标准基准测试中取得了最先进的性能。 Conclusion: SM是首个在归纳式零样本语义分割中引入解耦匈牙利匹配的方法，显著提升了模型对未见类别的分割能力。 Abstract: Zero-shot Semantic Segmentation (ZSS) aims to segment categories that are not annotated during training. While fine-tuning vision-language models has achieved promising results, these models often overfit to seen categories due to the lack of supervision for unseen classes. As an alternative to fully supervised approaches, query-based segmentation has shown great latent in ZSS, as it enables object localization without relying on explicit labels. However, conventional Hungarian matching, a core component in query-based frameworks, needs full supervision and often misclassifies unseen categories as background in the setting of ZSS. To address this issue, we propose Split Matching (SM), a novel assignment strategy that decouples Hungarian matching into two components: one for seen classes in annotated regions and another for latent classes in unannotated regions (referred to as unseen candidates). Specifically, we partition the queries into seen and candidate groups, enabling each to be optimized independently according to its available supervision. To discover unseen candidates, we cluster CLIP dense features to generate pseudo masks and extract region-level embeddings using CLS tokens. Matching is then conducted separately for the two groups based on both class-level similarity and mask-level consistency. Additionally, we introduce a Multi-scale Feature Enhancement (MFE) module that refines decoder features through residual multi-scale aggregation, improving the model's ability to capture spatial details across resolutions. SM is the first to introduce decoupled Hungarian matching under the inductive ZSS setting, and achieves state-of-the-art performance on two standard benchmarks.

[40] xTrace: A Facial Expressive Behaviour Analysis Tool for Continuous Affect Recognition

Mani Kumar Tellamekala,Shashank Jaiswal,Thomas Smith,Timur Alamev,Gary McKeown,Anthony Brown,Michel Valstar

Main category: cs.CV

TL;DR: 本文介绍了xTrace，一种用于分析面部表情行为并预测情感维度（如效价和唤醒）的工具，解决了大规模标记数据集不足和高效特征提取的挑战。

Details

Motivation: 构建一个在自然环境中实时分析面部表情行为的系统面临两大挑战：缺乏大规模标记数据集和难以提取高效且鲁棒的面部特征。 Method: xTrace利用包含45万视频的最大数据集训练，并采用可解释且高效的面部情感描述符。 Result: 在5万视频的验证集上，xTrace达到0.86的平均CCC和0.13的平均绝对误差，表现优于现有工具。 Conclusion: xTrace在情感识别、鲁棒性和不确定性估计方面表现出色，适用于广泛自然表情分析。 Abstract: Recognising expressive behaviours in face videos is a long-standing challenge in Affective Computing. Despite significant advancements in recent years, it still remains a challenge to build a robust and reliable system for naturalistic and in-the-wild facial expressive behaviour analysis in real time. This paper addresses two key challenges in building such a system: (1). The paucity of large-scale labelled facial affect video datasets with extensive coverage of the 2D emotion space, and (2). The difficulty of extracting facial video features that are discriminative, interpretable, robust, and computationally efficient. Toward addressing these challenges, we introduce xTrace, a robust tool for facial expressive behaviour analysis and predicting continuous values of dimensional emotions, namely valence and arousal, from in-the-wild face videos. To address challenge (1), our affect recognition model is trained on the largest facial affect video data set, containing ~450k videos that cover most emotion zones in the dimensional emotion space, making xTrace highly versatile in analysing a wide spectrum of naturalistic expressive behaviours. To address challenge (2), xTrace uses facial affect descriptors that are not only explainable, but can also achieve a high degree of accuracy and robustness with low computational complexity. The key components of xTrace are benchmarked against three existing tools: MediaPipe, OpenFace, and Augsburg Affect Toolbox. On an in-the-wild validation set composed of 50k videos, xTrace achieves 0.86 mean CCC and 0.13 mean absolute error values. We present a detailed error analysis of affect predictions from xTrace, illustrating (a). its ability to recognise emotions with high accuracy across most bins in the 2D emotion space, (b). its robustness to non-frontal head pose angles, and (c). a strong correlation between its uncertainty estimates and its accuracy.

[41] UncertainSAM: Fast and Efficient Uncertainty Quantification of the Segment Anything Model

Timo Kaiser,Thomas Norrenbrock,Bodo Rosenhahn

Main category: cs.CV

TL;DR: 本文提出了一种基于贝叶斯熵的不确定性量化模型USAM，用于解决SAM模型的不确定性量化问题，并在多个数据集上表现出色。

Details

Motivation: 由于SAM模型的类无关性，传统的不确定性量化方法难以适用，因此需要一种新的方法来量化其不确定性。 Method: 提出了一种基于贝叶斯熵的轻量级后处理方法USAM，联合考虑随机性、认知性和任务不确定性。 Result: USAM在SA-V、MOSE、ADE20k、DAVIS和COCO数据集上表现出卓越的预测能力。 Conclusion: USAM是一种计算成本低、易于使用的不确定性量化替代方案，适用于用户提示、半监督流程以及平衡精度与成本效率。 Abstract: The introduction of the Segment Anything Model (SAM) has paved the way for numerous semantic segmentation applications. For several tasks, quantifying the uncertainty of SAM is of particular interest. However, the ambiguous nature of the class-agnostic foundation model SAM challenges current uncertainty quantification (UQ) approaches. This paper presents a theoretically motivated uncertainty quantification model based on a Bayesian entropy formulation jointly respecting aleatoric, epistemic, and the newly introduced task uncertainty. We use this formulation to train USAM, a lightweight post-hoc UQ method. Our model traces the root of uncertainty back to under-parameterised models, insufficient prompts or image ambiguities. Our proposed deterministic USAM demonstrates superior predictive capabilities on the SA-V, MOSE, ADE20k, DAVIS, and COCO datasets, offering a computationally cheap and easy-to-use UQ alternative that can support user-prompting, enhance semi-supervised pipelines, or balance the tradeoff between accuracy and cost efficiency.

[42] ULFine: Unbiased Lightweight Fine-tuning for Foundation-Model-Assisted Long-Tailed Semi-Supervised Learning

Enhao Zhang,Chaohua Li,Chuanxing Geng,Songcan Chen

Main category: cs.CV

TL;DR: 论文探索了视觉基础模型（如CLIP）对长尾半监督学习（LTSSL）的影响，提出了三种策略（LP、LFT、FFT），发现FFT性能下降，LP和LFT对尾部类别帮助有限。作者提出ULFine策略，显著降低训练成本并提升性能。

Details

Motivation: 研究视觉基础模型在长尾半监督学习中的应用，解决现有策略对尾部类别效果不佳的问题。 Method: 采用三种策略（LP、LFT、FFT）分析基础模型对LTSSL的影响，并提出ULFine策略，通过自适应拟合和双logit融合减少偏差。 Result: FFT导致性能下降，LP和LFT对尾部类别帮助有限；ULFine显著降低训练成本并提升预测准确率。 Conclusion: ULFine策略有效解决了长尾半监督学习中的偏差问题，显著提升了模型性能。 Abstract: Based on the success of large-scale visual foundation models like CLIP in various downstream tasks, this paper initially attempts to explore their impact on Long-Tailed Semi-Supervised Learning (LTSSL) by employing the foundation model with three strategies: Linear Probing (LP), Lightweight Fine-Tuning (LFT), and Full Fine-Tuning (FFT). Our analysis presents the following insights: i) Compared to LTSSL algorithms trained from scratch, FFT results in a decline in model performance, whereas LP and LFT, although boosting overall model performance, exhibit negligible benefits to tail classes. ii) LP produces numerous false pseudo-labels due to \textit{underlearned} training data, while LFT can reduce the number of these false labels but becomes overconfident about them owing to \textit{biased fitting} training data. This exacerbates the pseudo-labeled and classifier biases inherent in LTSSL, limiting performance improvement in the tail classes. With these insights, we propose a Unbiased Lightweight Fine-tuning strategy, \textbf{ULFine}, which mitigates the overconfidence via confidence-aware adaptive fitting of textual prototypes and counteracts the pseudo-labeled and classifier biases via complementary fusion of dual logits. Extensive experiments demonstrate that ULFine markedly decreases training costs by over ten times and substantially increases prediction accuracies compared to state-of-the-art methods.

[43] FG-CLIP: Fine-Grained Visual and Textual Alignment

Chunyu Xie,Bin Wang,Fanjing Kong,Jincheng Li,Dawei Liang,Gengshen Zhang,Dawei Leng,Yuhui Yin

Main category: cs.CV

TL;DR: FG-CLIP通过生成长标题-图像对、构建高质量数据集和引入困难负样本，提升了CLIP在细粒度理解任务中的表现。

Details

Motivation: CLIP在多模态任务中表现优异，但在细粒度理解上因粗粒度标题而受限。 Method: 1. 生成16亿长标题-图像对；2. 构建1200万图像和4000万区域标注的高质量数据集；3. 引入1000万困难负样本。 Result: FG-CLIP在细粒度理解、开放词汇目标检测等任务中优于CLIP及其他先进方法。 Conclusion: FG-CLIP有效提升了细粒度理解能力，数据、代码和模型已开源。 Abstract: Contrastive Language-Image Pre-training (CLIP) excels in multimodal tasks such as image-text retrieval and zero-shot classification but struggles with fine-grained understanding due to its focus on coarse-grained short captions. To address this, we propose Fine-Grained CLIP (FG-CLIP), which enhances fine-grained understanding through three key innovations. First, we leverage large multimodal models to generate 1.6 billion long caption-image pairs for capturing global-level semantic details. Second, a high-quality dataset is constructed with 12 million images and 40 million region-specific bounding boxes aligned with detailed captions to ensure precise, context-rich representations. Third, 10 million hard fine-grained negative samples are incorporated to improve the model's ability to distinguish subtle semantic differences. Corresponding training methods are meticulously designed for these data. Extensive experiments demonstrate that FG-CLIP outperforms the original CLIP and other state-of-the-art methods across various downstream tasks, including fine-grained understanding, open-vocabulary object detection, image-text retrieval, and general multimodal benchmarks. These results highlight FG-CLIP's effectiveness in capturing fine-grained image details and improving overall model performance. The related data, code, and models are available at https://github.com/360CVGroup/FG-CLIP.

[44] Visual Affordances: Enabling Robots to Understand Object Functionality

Tommaso Apicella,Alessio Xompero,Andrea Cavallaro

Main category: cs.CV

TL;DR: 论文提出了一种统一的视觉可供性预测框架，解决了现有方法因定义不一致导致的再现性问题，并引入Affordance Sheet以提高透明度。

Details

Motivation: 现有的人机交互中可供性预测方法因任务定义不一致导致比较基准不公平且不可靠，需要统一的框架和透明的方法。 Method: 提出统一的视觉可供性预测框架，系统综述现有方法和数据集，并引入Affordance Sheet记录解决方案、数据集和验证。 Result: 通过将视觉可供性与物理世界（如物体重量）关联，框架弥补了感知与机器人执行之间的差距。 Conclusion: 该框架为可供性预测提供了透明和可复现的方法，并整合了物体和机器人交互的完整信息。 Abstract: Human-robot interaction for assistive technologies relies on the prediction of affordances, which are the potential actions a robot can perform on objects. Predicting object affordances from visual perception is formulated differently for tasks such as grasping detection, affordance classification, affordance segmentation, and hand-object interaction synthesis. In this work, we highlight the reproducibility issue in these redefinitions, making comparative benchmarks unfair and unreliable. To address this problem, we propose a unified formulation for visual affordance prediction, provide a comprehensive and systematic review of previous works highlighting strengths and limitations of methods and datasets, and analyse what challenges reproducibility. To favour transparency, we introduce the Affordance Sheet, a document to detail the proposed solution, the datasets, and the validation. As the physical properties of an object influence the interaction with the robot, we present a generic framework that links visual affordance prediction to the physical world. Using the weight of an object as an example for this framework, we discuss how estimating object mass can affect the affordance prediction. Our approach bridges the gap between affordance perception and robot actuation, and accounts for the complete information about objects of interest and how the robot interacts with them to accomplish its task.

[45] PIDiff: Image Customization for Personalized Identities with Diffusion Models

Jinyu Gu,Haipeng Liu,Meng Wang,Yang Wang

Main category: cs.CV

TL;DR: PIDiff是一种基于微调的扩散模型，用于个性化文本到图像生成，通过W+空间和身份定制微调策略避免语义纠缠，实现准确特征提取和定位。

Details

Motivation: 现有方法在将特定身份融入图像时，未能解耦身份和背景信息，导致生成图像失去关键身份特征且多样性降低。 Method: 结合W+空间与扩散模型，采用身份定制微调策略和交叉注意力块，避免语义干扰并保持预训练模型的生成能力。 Result: 实验验证PIDiff在保留身份信息和生成多样性方面的有效性。 Conclusion: PIDiff通过优化特征提取和定位，解决了身份与背景信息的语义干扰问题，提升了生成质量。 Abstract: Text-to-image generation for personalized identities aims at incorporating the specific identity into images using a text prompt and an identity image. Based on the powerful generative capabilities of DDPMs, many previous works adopt additional prompts, such as text embeddings and CLIP image embeddings, to represent the identity information, while they fail to disentangle the identity information and background information. As a result, the generated images not only lose key identity characteristics but also suffer from significantly reduced diversity. To address this issue, previous works have combined the W+ space from StyleGAN with diffusion models, leveraging this space to provide a more accurate and comprehensive representation of identity features through multi-level feature extraction. However, the entanglement of identity and background information in in-the-wild images during training prevents accurate identity localization, resulting in severe semantic interference between identity and background. In this paper, we propose a novel fine-tuning-based diffusion model for personalized identities text-to-image generation, named PIDiff, which leverages the W+ space and an identity-tailored fine-tuning strategy to avoid semantic entanglement and achieves accurate feature extraction and localization. Style editing can also be achieved by PIDiff through preserving the characteristics of identity features in the W+ space, which vary from coarse to fine. Through the combination of the proposed cross-attention block and parameter optimization strategy, PIDiff preserves the identity information and maintains the generation capability for in-the-wild images of the pre-trained model during inference. Our experimental results validate the effectiveness of our method in this task.

[46] Nonlinear Motion-Guided and Spatio-Temporal Aware Network for Unsupervised Event-Based Optical Flow

Zuntao Liu,Hao Zhuang,Junjie Jiang,Yuhang Song,Zheng Fang

Main category: cs.CV

TL;DR: 提出了一种名为E-NMSTFlow的无监督事件相机光流估计网络，专注于长时间序列，通过利用丰富的时空信息和精确的非线性运动提升性能。

Details

Motivation: 现有基于事件的光流估计方法多采用帧式技术，忽略了事件的时空特性，且假设事件间为线性运动，导致长时间序列误差增加。 Method: 设计了时空运动特征感知模块（STMFA）和自适应运动特征增强模块（AMFE），并引入非线性运动补偿损失函数。 Result: 在MVSEC和DSEC-Flow数据集上，该方法在无监督学习方法中排名第一。 Conclusion: E-NMSTFlow通过充分利用时空信息和精确非线性运动，显著提升了事件相机光流估计的准确性。 Abstract: Event cameras have the potential to capture continuous motion information over time and space, making them well-suited for optical flow estimation. However, most existing learning-based methods for event-based optical flow adopt frame-based techniques, ignoring the spatio-temporal characteristics of events. Additionally, these methods assume linear motion between consecutive events within the loss time window, which increases optical flow errors in long-time sequences. In this work, we observe that rich spatio-temporal information and accurate nonlinear motion between events are crucial for event-based optical flow estimation. Therefore, we propose E-NMSTFlow, a novel unsupervised event-based optical flow network focusing on long-time sequences. We propose a Spatio-Temporal Motion Feature Aware (STMFA) module and an Adaptive Motion Feature Enhancement (AMFE) module, both of which utilize rich spatio-temporal information to learn spatio-temporal data associations. Meanwhile, we propose a nonlinear motion compensation loss that utilizes the accurate nonlinear motion between events to improve the unsupervised learning of our network. Extensive experiments demonstrate the effectiveness and superiority of our method. Remarkably, our method ranks first among unsupervised learning methods on the MVSEC and DSEC-Flow datasets. Our project page is available at https://wynelio.github.io/E-NMSTFlow.

[47] DispBench: Benchmarking Disparity Estimation to Synthetic Corruptions

Shashank Agnihotri,Amaan Ansari,Annika Dackermann,Fabian Rösch,Margret Keuper

Main category: cs.CV

TL;DR: 论文提出了DispBench，一个用于系统评估视差估计方法可靠性的综合基准工具，填补了该领域标准化评估的空白。

Details

Motivation: 深度学习在视差估计任务中表现优异，但其对分布偏移和对抗攻击的敏感性引发了可靠性和泛化性的担忧，而缺乏标准化基准阻碍了进展。 Method: 开发DispBench，通过合成图像损坏（如对抗攻击和分布偏移）评估视差估计方法的鲁棒性，涵盖多数据集和多样化损坏场景。 Result: 进行了迄今为止最广泛的视差估计方法性能和鲁棒性分析，揭示了准确性、可靠性和泛化性之间的关键关联。 Conclusion: DispBench为视差估计方法的鲁棒性评估提供了标准化工具，有助于推动该领域的发展。 Abstract: Deep learning (DL) has surpassed human performance on standard benchmarks, driving its widespread adoption in computer vision tasks. One such task is disparity estimation, estimating the disparity between matching pixels in stereo image pairs, which is crucial for safety-critical applications like medical surgeries and autonomous navigation. However, DL-based disparity estimation methods are highly susceptible to distribution shifts and adversarial attacks, raising concerns about their reliability and generalization. Despite these concerns, a standardized benchmark for evaluating the robustness of disparity estimation methods remains absent, hindering progress in the field. To address this gap, we introduce DispBench, a comprehensive benchmarking tool for systematically assessing the reliability of disparity estimation methods. DispBench evaluates robustness against synthetic image corruptions such as adversarial attacks and out-of-distribution shifts caused by 2D Common Corruptions across multiple datasets and diverse corruption scenarios. We conduct the most extensive performance and robustness analysis of disparity estimation methods to date, uncovering key correlations between accuracy, reliability, and generalization. Open-source code for DispBench: https://github.com/shashankskagnihotri/benchmarking_robustness/tree/disparity_estimation/final/disparity_estimation

[48] MDE-Edit: Masked Dual-Editing for Multi-Object Image Editing via Diffusion Models

Hongyang Zhu,Haipeng Liu,Bo Fu,Yang Wang

Main category: cs.CV

TL;DR: MDE-Edit提出了一种无需训练的方法，通过双重损失设计优化扩散模型中的噪声潜在特征，实现复杂多目标场景中的精确编辑。

Details

Motivation: 多目标编辑在复杂场景中面临定位不准确和属性-对象不匹配的挑战，现有方法难以解决这些问题。 Method: MDE-Edit通过对象对齐损失（OAL）和颜色一致性损失（CCL）优化噪声潜在特征，确保编辑的局部性和一致性。 Result: 实验表明，MDE-Edit在编辑准确性和视觉质量上优于现有方法。 Conclusion: MDE-Edit为复杂多目标图像编辑任务提供了鲁棒的解决方案。 Abstract: Multi-object editing aims to modify multiple objects or regions in complex scenes while preserving structural coherence. This task faces significant challenges in scenarios involving overlapping or interacting objects: (1) Inaccurate localization of target objects due to attention misalignment, leading to incomplete or misplaced edits; (2) Attribute-object mismatch, where color or texture changes fail to align with intended regions due to cross-attention leakage, creating semantic conflicts (\textit{e.g.}, color bleeding into non-target areas). Existing methods struggle with these challenges: approaches relying on global cross-attention mechanisms suffer from attention dilution and spatial interference between objects, while mask-based methods fail to bind attributes to geometrically accurate regions due to feature entanglement in multi-object scenarios. To address these limitations, we propose a training-free, inference-stage optimization approach that enables precise localized image manipulation in complex multi-object scenes, named MDE-Edit. MDE-Edit optimizes the noise latent feature in diffusion models via two key losses: Object Alignment Loss (OAL) aligns multi-layer cross-attention with segmentation masks for precise object positioning, and Color Consistency Loss (CCL) amplifies target attribute attention within masks while suppressing leakage to adjacent regions. This dual-loss design ensures localized and coherent multi-object edits. Extensive experiments demonstrate that MDE-Edit outperforms state-of-the-art methods in editing accuracy and visual quality, offering a robust solution for complex multi-object image manipulation tasks.

[49] Automated vision-based assistance tools in bronchoscopy: stenosis severity estimation

Clara Tomasini,Javier Rodriguez-Puigvert,Dinora Polanco,Manuel Viñuales,Luis Riazuelo,Ana Cristina Murillo

Main category: cs.CV

TL;DR: 提出了一种基于支气管镜图像的自动化评估声门下狭窄严重程度的方法，无需依赖CT扫描，减少了主观性和辐射暴露。

Details

Motivation: 目前声门下狭窄的评估依赖主观视觉检查或CT扫描，缺乏一致性和自动化方法。 Method: 利用支气管镜图像中的光照衰减效应，分割和跟踪气道腔，构建3D模型以测量狭窄程度。 Result: 方法在真实支气管镜数据上验证，结果与CT扫描和专家评估一致，且具有可重复性。 Conclusion: 该方法自动化了评估过程，缩短诊断时间，减少辐射暴露，并发布了首个公开数据集。 Abstract: Purpose: Subglottic stenosis refers to the narrowing of the subglottis, the airway between the vocal cords and the trachea. Its severity is typically evaluated by estimating the percentage of obstructed airway. This estimation can be obtained from CT data or through visual inspection by experts exploring the region. However, visual inspections are inherently subjective, leading to less consistent and robust diagnoses. No public methods or datasets are currently available for automated evaluation of this condition from bronchoscopy video. Methods: We propose a pipeline for automated subglottic stenosis severity estimation during the bronchoscopy exploration, without requiring the physician to traverse the stenosed region. Our approach exploits the physical effect of illumination decline in endoscopy to segment and track the lumen and obtain a 3D model of the airway. This 3D model is obtained from a single frame and is used to measure the airway narrowing. Results: Our pipeline is the first to enable automated and robust subglottic stenosis severity measurement using bronchoscopy images. The results show consistency with ground-truth estimations from CT scans and expert estimations, and reliable repeatability across multiple estimations on the same patient. Our evaluation is performed on our new Subglottic Stenosis Dataset of real bronchoscopy procedures data. Conclusion: We demonstrate how to automate evaluation of subglottic stenosis severity using only bronchoscopy. Our approach can assist with and shorten diagnosis and monitoring procedures, with automated and repeatable estimations and less exploration time, and save radiation exposure to patients as no CT is required. Additionally, we release the first public benchmark for subglottic stenosis severity assessment.

[50] Probabilistic Embeddings for Frozen Vision-Language Models: Uncertainty Quantification with Gaussian Process Latent Variable Models

Aishwarya Venkataramanan,Paul Bodesheim,Joachim Denzler

Main category: cs.CV

TL;DR: GroVE是一种后处理方法，通过高斯过程潜在变量模型（GPLVM）从冻结的视觉语言模型（VLM）中获取概率嵌入，解决了确定性嵌入难以捕捉不确定性的问题。

Details

Motivation: 标准VLM的确定性嵌入难以处理视觉和文本描述中的模糊性及多对应关系，现有方法需要大数据集训练且未充分利用预训练VLM的强大表示能力。 Method: GroVE基于GPLVM学习共享低维潜在空间，通过单模态嵌入重建和跨模态对齐目标优化，生成不确定性感知的概率嵌入。 Result: GroVE在跨模态检索、视觉问答和主动学习等任务中实现了最先进的不确定性校准。 Conclusion: GroVE提供了一种高效的后处理方法，能够在不重新训练VLM的情况下生成高质量的概率嵌入。 Abstract: Vision-Language Models (VLMs) learn joint representations by mapping images and text into a shared latent space. However, recent research highlights that deterministic embeddings from standard VLMs often struggle to capture the uncertainties arising from the ambiguities in visual and textual descriptions and the multiple possible correspondences between images and texts. Existing approaches tackle this by learning probabilistic embeddings during VLM training, which demands large datasets and does not leverage the powerful representations already learned by large-scale VLMs like CLIP. In this paper, we propose GroVE, a post-hoc approach to obtaining probabilistic embeddings from frozen VLMs. GroVE builds on Gaussian Process Latent Variable Model (GPLVM) to learn a shared low-dimensional latent space where image and text inputs are mapped to a unified representation, optimized through single-modal embedding reconstruction and cross-modal alignment objectives. Once trained, the Gaussian Process model generates uncertainty-aware probabilistic embeddings. Evaluation shows that GroVE achieves state-of-the-art uncertainty calibration across multiple downstream tasks, including cross-modal retrieval, visual question answering, and active learning.

[51] PaniCar: Securing the Perception of Advanced Driving Assistance Systems Against Emergency Vehicle Lighting

Elad Feldman,Jacob Shams,Dudi Biton,Alfred Chen,Shaoyuan Xie,Satoru Koda,Yisroel Mirsky,Asaf Shabtai,Yuval Elovici,Ben Nassi

Main category: cs.CV

TL;DR: 研究发现强光源（如应急车辆灯光）会导致自动驾驶汽车物体检测性能下降，并提出了一种新框架Caracetamol来提升检测稳定性。

Details

Motivation: 自动驾驶汽车在应急车辆灯光下物体检测性能下降，存在安全隐患，需研究解决方案。 Method: 评估了多种ADAS系统、物体检测器和应急灯光模式，并提出了Caracetamol框架。 Result: Caracetamol显著提升了检测置信度和稳定性，且满足实时处理需求。 Conclusion: Caracetamol能有效缓解应急车辆灯光对物体检测的干扰，提升自动驾驶安全性。 Abstract: The safety of autonomous cars has come under scrutiny in recent years, especially after 16 documented incidents involving Teslas (with autopilot engaged) crashing into parked emergency vehicles (police cars, ambulances, and firetrucks). While previous studies have revealed that strong light sources often introduce flare artifacts in the captured image, which degrade the image quality, the impact of flare on object detection performance remains unclear. In this research, we unveil PaniCar, a digital phenomenon that causes an object detector's confidence score to fluctuate below detection thresholds when exposed to activated emergency vehicle lighting. This vulnerability poses a significant safety risk, and can cause autonomous vehicles to fail to detect objects near emergency vehicles. In addition, this vulnerability could be exploited by adversaries to compromise the security of advanced driving assistance systems (ADASs). We assess seven commercial ADASs (Tesla Model 3, "manufacturer C", HP, Pelsee, AZDOME, Imagebon, Rexing), four object detectors (YOLO, SSD, RetinaNet, Faster R-CNN), and 14 patterns of emergency vehicle lighting to understand the influence of various technical and environmental factors. We also evaluate four SOTA flare removal methods and show that their performance and latency are insufficient for real-time driving constraints. To mitigate this risk, we propose Caracetamol, a robust framework designed to enhance the resilience of object detectors against the effects of activated emergency vehicle lighting. Our evaluation shows that on YOLOv3 and Faster RCNN, Caracetamol improves the models' average confidence of car detection by 0.20, the lower confidence bound by 0.33, and reduces the fluctuation range by 0.33. In addition, Caracetamol is capable of processing frames at a rate of between 30-50 FPS, enabling real-time ADAS car detection.

[52] Biomed-DPT: Dual Modality Prompt Tuning for Biomedical Vision-Language Models

Wei Peng,Kang Liu,Jianchen Hu,Meng Zhang

Main category: cs.CV

TL;DR: Biomed-DPT是一种知识增强的双模态提示调优技术，通过结合文本和视觉提示，优化生物医学图像分类任务，在少样本场景中表现优于现有方法。

Details

Motivation: 当前提示学习方法主要依赖文本提示，忽略了生物医学图像的特殊结构（如复杂解剖结构和细微病理特征），Biomed-DPT旨在通过双模态提示填补这一空白。 Method: Biomed-DPT设计文本提示（临床模板和领域适应提示）并通过知识蒸馏提取临床知识；视觉提示引入零向量软提示以优化注意力权重。 Result: 在11个生物医学图像数据集上平均准确率达66.14%，基类和新型类分别达到78.06%和75.97%，优于CoOp方法。 Conclusion: Biomed-DPT通过双模态提示和知识增强显著提升了生物医学图像分类性能，为少样本学习提供了有效解决方案。 Abstract: Prompt learning is one of the most effective paradigms for adapting pre-trained vision-language models (VLMs) to the biomedical image classification tasks in few shot scenarios. However, most of the current prompt learning methods only used the text prompts and ignored the particular structures (such as the complex anatomical structures and subtle pathological features) in the biomedical images. In this work, we propose Biomed-DPT, a knowledge-enhanced dual modality prompt tuning technique. In designing the text prompt, Biomed-DPT constructs a dual prompt including the template-driven clinical prompts and the large language model (LLM)-driven domain-adapted prompts, then extracts the clinical knowledge from the domain-adapted prompts through the knowledge distillation technique. In designing the vision prompt, Biomed-DPT introduces the zero vector as a soft prompt to leverage attention re-weighting so that the focus on non-diagnostic regions and the recognition of non-critical pathological features are avoided. Biomed-DPT achieves an average classification accuracy of 66.14\% across 11 biomedical image datasets covering 9 modalities and 10 organs, with performance reaching 78.06\% in base classes and 75.97\% in novel classes, surpassing the Context Optimization (CoOp) method by 6.20\%, 3.78\%, and 8.04\%, respectively. Our code are available at \underline{https://github.com/Kanyooo/Biomed-DPT}.

Haizhen Xie,Kunpeng Du,Qiangyu Yan,Sen Lu,Jianhong Han,Hanting Chen,Hailin Hu,Jie Hu

Main category: cs.CV

TL;DR: EAM是一种基于DiT的新型BSR方法，通过引入Ψ-DiT块和渐进式掩码图像建模策略，显著提升了图像恢复性能，并在多个数据集上实现了最先进的结果。

Details

Motivation: 利用预训练的T2I扩散模型指导BSR已成为主流，但传统U-Net架构性能有限，而DiT表现出更高潜力。EAM旨在通过DiT和新技术提升BSR性能。 Method: 提出Ψ-DiT块，采用三重流架构和低分辨率潜在控制；引入渐进式掩码图像建模策略降低训练成本；提出主题感知提示生成策略优化T2I先验利用。 Result: EAM在多个数据集上表现优异，定量指标和视觉质量均超越现有方法。 Conclusion: EAM通过结合DiT和新技术，显著提升了BSR性能，为图像恢复领域提供了高效解决方案。 Abstract: Utilizing pre-trained Text-to-Image (T2I) diffusion models to guide Blind Super-Resolution (BSR) has become a predominant approach in the field. While T2I models have traditionally relied on U-Net architectures, recent advancements have demonstrated that Diffusion Transformers (DiT) achieve significantly higher performance in this domain. In this work, we introduce Enhancing Anything Model (EAM), a novel BSR method that leverages DiT and outperforms previous U-Net-based approaches. We introduce a novel block, $\Psi$-DiT, which effectively guides the DiT to enhance image restoration. This block employs a low-resolution latent as a separable flow injection control, forming a triple-flow architecture that effectively leverages the prior knowledge embedded in the pre-trained DiT. To fully exploit the prior guidance capabilities of T2I models and enhance their generalization in BSR, we introduce a progressive Masked Image Modeling strategy, which also reduces training costs. Additionally, we propose a subject-aware prompt generation strategy that employs a robust multi-modal model in an in-context learning framework. This strategy automatically identifies key image areas, provides detailed descriptions, and optimizes the utilization of T2I diffusion priors. Our experiments demonstrate that EAM achieves state-of-the-art results across multiple datasets, outperforming existing methods in both quantitative metrics and visual quality.

[54] HQC-NBV: A Hybrid Quantum-Classical View Planning Approach

Xiaotong Yu,Chang Wen Chen

Main category: cs.CV

TL;DR: HQC-NBV是一种混合量子经典框架，用于高效规划相机视角，提升复杂环境中的探索效率和计算可扩展性。

Details

Motivation: 解决传统方法在复杂场景中计算可扩展性和解决方案最优性方面的不足。 Method: 提出基于哈密顿量公式和参数中心变分ansatz的混合量子经典框架，利用量子特性探索参数空间。 Result: 实验表明，相比经典方法，HQC-NBV的探索效率提升高达49.2%。 Conclusion: 该研究为量子计算在机器人感知系统中的集成提供了重要进展，为机器人视觉任务提供了范式转变的解决方案。 Abstract: Efficient view planning is a fundamental challenge in computer vision and robotic perception, critical for tasks ranging from search and rescue operations to autonomous navigation. While classical approaches, including sampling-based and deterministic methods, have shown promise in planning camera viewpoints for scene exploration, they often struggle with computational scalability and solution optimality in complex settings. This study introduces HQC-NBV, a hybrid quantum-classical framework for view planning that leverages quantum properties to efficiently explore the parameter space while maintaining robustness and scalability. We propose a specific Hamiltonian formulation with multi-component cost terms and a parameter-centric variational ansatz with bidirectional alternating entanglement patterns that capture the hierarchical dependencies between viewpoint parameters. Comprehensive experiments demonstrate that quantum-specific components provide measurable performance advantages. Compared to the classical methods, our approach achieves up to 49.2% higher exploration efficiency across diverse environments. Our analysis of entanglement architecture and coherence-preserving terms provides insights into the mechanisms of quantum advantage in robotic exploration tasks. This work represents a significant advancement in integrating quantum computing into robotic perception systems, offering a paradigm-shifting solution for various robot vision tasks.

[55] Diffusion Model Quantization: A Review

Qian Zeng,Chenggong Hu,Mingli Song,Jie Song

Main category: cs.CV

TL;DR: 本文综述了扩散模型量化的最新进展，分析了量化技术、挑战及未来研究方向。

Details

Motivation: 为在资源受限的边缘设备上高效部署扩散模型，量化技术成为关键。 Method: 综述了扩散模型量化的技术分类、原理分析，并通过定性和定量评估比较代表性方法。 Result: 通过数据集基准测试和视觉分析，评估了量化效果及其影响。 Conclusion: 提出了生成模型量化在实用中的未来研究方向，并提供了相关资源链接。 Abstract: Recent success of large text-to-image models has empirically underscored the exceptional performance of diffusion models in generative tasks. To facilitate their efficient deployment on resource-constrained edge devices, model quantization has emerged as a pivotal technique for both compression and acceleration. This survey offers a thorough review of the latest advancements in diffusion model quantization, encapsulating and analyzing the current state of the art in this rapidly advancing domain. First, we provide an overview of the key challenges encountered in the quantization of diffusion models, including those based on U-Net architectures and Diffusion Transformers (DiT). We then present a comprehensive taxonomy of prevalent quantization techniques, engaging in an in-depth discussion of their underlying principles. Subsequently, we perform a meticulous analysis of representative diffusion model quantization schemes from both qualitative and quantitative perspectives. From a quantitative standpoint, we rigorously benchmark a variety of methods using widely recognized datasets, delivering an extensive evaluation of the most recent and impactful research in the field. From a qualitative standpoint, we categorize and synthesize the effects of quantization errors, elucidating these impacts through both visual analysis and trajectory examination. In conclusion, we outline prospective avenues for future research, proposing novel directions for the quantization of generative models in practical applications. The list of related papers, corresponding codes, pre-trained models and comparison results are publicly available at the survey project homepage https://github.com/TaylorJocelyn/Diffusion-Model-Quantization.

[56] Does CLIP perceive art the same way we do?

Andrea Asperti,Leonardo Dessì,Maria Chiara Tonetti,Nico Wu

Main category: cs.CV

TL;DR: 研究探讨了CLIP模型在理解绘画（包括人类创作和AI生成作品）时与人类感知的异同，评估了其在内容、风格等多维度的表现，并讨论了其在生成艺术中的应用潜力。

Details

Motivation: 探索CLIP模型是否能够像人类一样理解和解释艺术作品，尤其是在高层次的语义和风格信息提取方面。 Method: 通过设计针对性任务，比较CLIP的响应与人类标注和专家基准，评估其在内容、场景理解、艺术风格等维度的表现。 Result: 研究发现CLIP在视觉表征上既有优势也有局限，特别是在美学线索和艺术意图方面。 Conclusion: 研究强调在多模态系统中需要更深入的可解释性，尤其是在涉及创意领域时，因为其中的细微差别和主观性至关重要。 Abstract: CLIP has emerged as a powerful multimodal model capable of connecting images and text through joint embeddings, but to what extent does it "see" the same way humans do - especially when interpreting artworks? In this paper, we investigate CLIP's ability to extract high-level semantic and stylistic information from paintings, including both human-created and AI-generated imagery. We evaluate its perception across multiple dimensions: content, scene understanding, artistic style, historical period, and the presence of visual deformations or artifacts. By designing targeted probing tasks and comparing CLIP's responses to human annotations and expert benchmarks, we explore its alignment with human perceptual and contextual understanding. Our findings reveal both strengths and limitations in CLIP's visual representations, particularly in relation to aesthetic cues and artistic intent. We further discuss the implications of these insights for using CLIP as a guidance mechanism during generative processes, such as style transfer or prompt-based image synthesis. Our work highlights the need for deeper interpretability in multimodal systems, especially when applied to creative domains where nuance and subjectivity play a central role.

[57] PADriver: Towards Personalized Autonomous Driving

Genghua Kou,Fan Jia,Weixin Mao,Yingfei Liu,Yucheng Zhao,Ziheng Zhang,Osamu Yoshie,Tiancai Wang,Ying Li,Xiangyu Zhang

Main category: cs.CV

TL;DR: PADriver是一个基于多模态大语言模型的个性化自动驾驶框架，通过闭环基准测试PAD-Highway验证其性能优于现有方法。

Details

Motivation: 解决个性化自动驾驶中的场景理解、危险等级评估和动作决策问题。 Method: 利用多模态大语言模型处理视频流和个性化文本提示，构建闭环基准PAD-Highway进行评估。 Result: 在PAD-Highway基准上表现优于现有方法，支持多种驾驶模式。 Conclusion: PADriver为个性化自动驾驶提供了高效且灵活的解决方案。 Abstract: In this paper, we propose PADriver, a novel closed-loop framework for personalized autonomous driving (PAD). Built upon Multi-modal Large Language Model (MLLM), PADriver takes streaming frames and personalized textual prompts as inputs. It autoaggressively performs scene understanding, danger level estimation and action decision. The predicted danger level reflects the risk of the potential action and provides an explicit reference for the final action, which corresponds to the preset personalized prompt. Moreover, we construct a closed-loop benchmark named PAD-Highway based on Highway-Env simulator to comprehensively evaluate the decision performance under traffic rules. The dataset contains 250 hours videos with high-quality annotation to facilitate the development of PAD behavior analysis. Experimental results on the constructed benchmark show that PADriver outperforms state-of-the-art approaches on different evaluation metrics, and enables various driving modes.

[58] PlaceIt3D: Language-Guided Object Placement in Real 3D Scenes

Ahmed Abdelreheem,Filippo Aleotti,Jamie Watson,Zawar Qureshi,Abdelrahman Eldesokey,Peter Wonka,Gabriel Brostow,Sara Vicente,Guillermo Garcia-Hernando

Main category: cs.CV

TL;DR: 论文提出了一种新任务：语言引导的3D场景物体放置，并建立了相关基准和数据集。

Details

Motivation: 解决3D场景中基于语言提示的物体放置问题，填补了现有语言引导定位任务的空白。 Method: 提出了新的基准、评估协议、数据集，并开发了一种基线方法。 Result: 任务具有挑战性，基准可用于评估通用3D LLM模型。 Conclusion: 该任务和基准有望成为评估3D LLM模型的重要工具。 Abstract: We introduce the novel task of Language-Guided Object Placement in Real 3D Scenes. Our model is given a 3D scene's point cloud, a 3D asset, and a textual prompt broadly describing where the 3D asset should be placed. The task here is to find a valid placement for the 3D asset that respects the prompt. Compared with other language-guided localization tasks in 3D scenes such as grounding, this task has specific challenges: it is ambiguous because it has multiple valid solutions, and it requires reasoning about 3D geometric relationships and free space. We inaugurate this task by proposing a new benchmark and evaluation protocol. We also introduce a new dataset for training 3D LLMs on this task, as well as the first method to serve as a non-trivial baseline. We believe that this challenging task and our new benchmark could become part of the suite of benchmarks used to evaluate and compare generalist 3D LLM models.

[59] PRE-Mamba: A 4D State Space Model for Ultra-High-Frequent Event Camera Deraining

Ciyu Ruan,Ruishan Guo,Zihang Gong,Jingao Xu,Wenhan Yang,Xinlei Chen

Main category: cs.CV

TL;DR: PRE-Mamba是一种基于点的事件相机去雨框架，通过双时间尺度和多空间尺度建模，在保持高时间精度的同时提升去雨效果。

Details

Motivation: 事件相机在雨天条件下存在密集噪声，现有方法在时间精度、去雨效果和计算效率之间存在权衡。 Method: 提出4D事件云表示、时空解耦与融合模块（STDF）和多尺度状态空间模型（MS3M），结合频域正则化。 Result: 在EventRain-27K数据集上表现优异（SR 0.95，NR 0.91，0.4s/M事件），仅需0.26M参数。 Conclusion: PRE-Mamba在多种雨强、视角及雪天条件下均表现出良好的泛化能力。 Abstract: Event cameras excel in high temporal resolution and dynamic range but suffer from dense noise in rainy conditions. Existing event deraining methods face trade-offs between temporal precision, deraining effectiveness, and computational efficiency. In this paper, we propose PRE-Mamba, a novel point-based event camera deraining framework that fully exploits the spatiotemporal characteristics of raw event and rain. Our framework introduces a 4D event cloud representation that integrates dual temporal scales to preserve high temporal precision, a Spatio-Temporal Decoupling and Fusion module (STDF) that enhances deraining capability by enabling shallow decoupling and interaction of temporal and spatial information, and a Multi-Scale State Space Model (MS3M) that captures deeper rain dynamics across dual-temporal and multi-spatial scales with linear computational complexity. Enhanced by frequency-domain regularization, PRE-Mamba achieves superior performance (0.95 SR, 0.91 NR, and 0.4s/M events) with only 0.26M parameters on EventRain-27K, a comprehensive dataset with labeled synthetic and real-world sequences. Moreover, our method generalizes well across varying rain intensities, viewpoints, and even snowy conditions.

[60] Mapping User Trust in Vision Language Models: Research Landscape, Challenges, and Prospects

Agnese Chiatti,Sara Bernardini,Lara Shibelski Godoy Piccolo,Viola Schiaffonati,Matteo Matteucci

Main category: cs.CV

TL;DR: 本文综述了用户与视觉语言模型（VLM）交互中的信任动态，提出了多学科分类法，并总结了未来研究的初步需求。

Details

Motivation: 随着视觉语言模型（VLM）的广泛应用，需要保护用户并告知其何时信任这些系统。 Method: 通过多学科分类法（认知科学能力、协作模式和代理行为）综述相关研究，并结合用户研讨会的结果。 Result: 总结了用户与VLM交互中的信任动态，并提出了未来研究的初步需求。 Conclusion: 未来研究应关注用户信任的动态变化，以提升VLM的可靠性和用户接受度。 Abstract: The rapid adoption of Vision Language Models (VLMs), pre-trained on large image-text and video-text datasets, calls for protecting and informing users about when to trust these systems. This survey reviews studies on trust dynamics in user-VLM interactions, through a multi-disciplinary taxonomy encompassing different cognitive science capabilities, collaboration modes, and agent behaviours. Literature insights and findings from a workshop with prospective VLM users inform preliminary requirements for future VLM trust studies.

[61] Feature-Augmented Deep Networks for Multiscale Building Segmentation in High-Resolution UAV and Satellite Imagery

Chintan B. Maniyar,Minakshi Kumar,Gengchen Mai

Main category: cs.CV

TL;DR: 提出了一种基于深度学习的多尺度建筑分割框架，通过特征增强和优化训练策略，显著提高了RGB影像中建筑分割的准确性。

Details

Motivation: 由于建筑与非建筑特征的光谱相似性、阴影和不规则几何形状，高分辨率RGB影像中的建筑分割仍具挑战性。 Method: 使用多传感器数据集，通过PCA、VDVI、MBI和Sobel边缘滤波器增强特征，并采用Res-U-Net架构和优化训练策略（如层冻结、循环学习率和SuperConvergence）。 Result: 模型在WorldView-3影像上达到96.5%的总体准确率、F1分数0.86和IoU 0.80，优于现有RGB基准。 Conclusion: 结合多分辨率影像、特征增强和优化训练策略，能够实现鲁棒的遥感建筑分割。 Abstract: Accurate building segmentation from high-resolution RGB imagery remains challenging due to spectral similarity with non-building features, shadows, and irregular building geometries. In this study, we present a comprehensive deep learning framework for multiscale building segmentation using RGB aerial and satellite imagery with spatial resolutions ranging from 0.4m to 2.7m. We curate a diverse, multi-sensor dataset and introduce feature-augmented inputs by deriving secondary representations including Principal Component Analysis (PCA), Visible Difference Vegetation Index (VDVI), Morphological Building Index (MBI), and Sobel edge filters from RGB channels. These features guide a Res-U-Net architecture in learning complex spatial patterns more effectively. We also propose training policies incorporating layer freezing, cyclical learning rates, and SuperConvergence to reduce training time and resource usage. Evaluated on a held-out WorldView-3 image, our model achieves an overall accuracy of 96.5%, an F1-score of 0.86, and an Intersection over Union (IoU) of 0.80, outperforming existing RGB-based benchmarks. This study demonstrates the effectiveness of combining multi-resolution imagery, feature augmentation, and optimized training strategies for robust building segmentation in remote sensing applications.

[62] Aesthetics Without Semantics

C. Alejandro Parraga,Olivier Penacchio,Marcos Muňoz Gonzalez,Bogdan Raducanu,Xavier Otazu

Main category: cs.CV

TL;DR: 论文通过创建最小语义内容（MSC）数据库，解决了现有美学研究中数据库偏向美丽图像的问题，并揭示了图像特征与美学评价关系的复杂性。

Details

Motivation: 现有美学研究数据库偏向美丽图像，限制了美学评价的全面研究，作者希望通过平衡图像美学范围来解决这一问题。 Method: 创建MSC数据库，包含10,426张图像（美丽与丑陋平衡），每张由100名观察者评价，并利用图像指标分析美学关系。 Result: 研究发现，加入丑陋图像可以改变甚至逆转图像特征与美学评价的关系，揭示了现有研究的局限性。 Conclusion: 美学研究中忽略丑陋图像可能导致对图像内容与美学评价关系的误解或遗漏重要效应。 Abstract: While it is easy for human observers to judge an image as beautiful or ugly, aesthetic decisions result from a combination of entangled perceptual and cognitive (semantic) factors, making the understanding of aesthetic judgements particularly challenging from a scientific point of view. Furthermore, our research shows a prevailing bias in current databases, which include mostly beautiful images, further complicating the study and prediction of aesthetic responses. We address these limitations by creating a database of images with minimal semantic content and devising, and next exploiting, a method to generate images on the ugly side of aesthetic valuations. The resulting Minimum Semantic Content (MSC) database consists of a large and balanced collection of 10,426 images, each evaluated by 100 observers. We next use established image metrics to demonstrate how augmenting an image set biased towards beautiful images with ugly images can modify, or even invert, an observed relationship between image features and aesthetics valuation. Taken together, our study reveals that works in empirical aesthetics attempting to link image content and aesthetic judgements may magnify, underestimate, or simply miss interesting effects due to a limitation of the range of aesthetic values they consider.

[63] Progressive Inertial Poser: Progressive Real-Time Kinematic Chain Estimation for 3D Full-Body Pose from Three IMU Sensors

Zunjie Zhu,Yan Zhao,Yihan Hu,Guoxiang Wang,Hai Qiu,Bolun Zheng,Chenggang Yan,Feng Xu

Main category: cs.CV

TL;DR: 提出了一种仅使用三个IMU传感器（头、手腕）的全身姿态估计方法ProgIP，结合神经网络与人体动力学模型，性能优于现有方法。

Details

Motivation: 减少硬件复杂性，提高虚拟现实中全身姿态估计的实用性。 Method: 结合Transformer Encoder和双向LSTM的编码器，多层感知机解码器，分阶段渐进式网络估计。 Result: 在多个公开数据集上表现优于同类方法，接近使用六个IMU传感器的方法。 Conclusion: ProgIP方法在减少硬件需求的同时，实现了高精度的全身姿态估计。 Abstract: The motion capture system that supports full-body virtual representation is of key significance for virtual reality. Compared to vision-based systems, full-body pose estimation from sparse tracking signals is not limited by environmental conditions or recording range. However, previous works either face the challenge of wearing additional sensors on the pelvis and lower-body or rely on external visual sensors to obtain global positions of key joints. To improve the practicality of the technology for virtual reality applications, we estimate full-body poses using only inertial data obtained from three Inertial Measurement Unit (IMU) sensors worn on the head and wrists, thereby reducing the complexity of the hardware system. In this work, we propose a method called Progressive Inertial Poser (ProgIP) for human pose estimation, which combines neural network estimation with a human dynamics model, considers the hierarchical structure of the kinematic chain, and employs a multi-stage progressive network estimation with increased depth to reconstruct full-body motion in real time. The encoder combines Transformer Encoder and bidirectional LSTM (TE-biLSTM) to flexibly capture the temporal dependencies of the inertial sequence, while the decoder based on multi-layer perceptrons (MLPs) transforms high-dimensional features and accurately projects them onto Skinned Multi-Person Linear (SMPL) model parameters. Quantitative and qualitative experimental results on multiple public datasets show that our method outperforms state-of-the-art methods with the same inputs, and is comparable to recent works using six IMU sensors.

[64] Hearing and Seeing Through CLIP: A Framework for Self-Supervised Sound Source Localization

Sooyoung Park,Arda Senocak,Joon Son Chung

Main category: cs.CV

TL;DR: 本文提出了一种自监督方法，将CLIP模型扩展应用于声源定位任务，无需显式文本输入。通过音频驱动的嵌入和对比音频-视觉对应目标，实现了更完整和紧凑的声源定位。

Details

Motivation: 探索如何利用预训练的多模态基础模型（如CLIP）的跨模态对齐知识，改进声源定位任务。 Method: 提出了一种框架，将音频映射为与CLIP文本编码器兼容的令牌，生成音频驱动的嵌入，并通过对比音频-视觉对应目标对齐视觉特征。 Result: 实验表明，该方法在五个任务中均优于现有方法，并在零样本设置下表现出强泛化能力。 Conclusion: 通过自监督和LLM引导的扩展，成功提升了声源定位的性能和泛化能力。 Abstract: Large-scale vision-language models demonstrate strong multimodal alignment and generalization across diverse tasks. Among them, CLIP stands out as one of the most successful approaches. In this work, we extend the application of CLIP to sound source localization, proposing a self-supervised method operates without explicit text input. We introduce a framework that maps audios into tokens compatible with CLIP's text encoder, producing audio-driven embeddings. These embeddings are used to generate sounding region masks, from which visual features are extracted and aligned with the audio embeddings through a contrastive audio-visual correspondence objective. Our findings show that alignment knowledge of pre-trained multimodal foundation model enables our method to generate more complete and compact localization for sounding objects. We further propose an LLM-guided extension that distills object-aware audio-visual scene understanding into the model during training to enhance alignment. Extensive experiments across five diverse tasks demonstrate that our method, in all variants, outperforms state-of-the-art approaches and achieves strong generalization in zero-shot settings.

[65] Joint Super-Resolution and Segmentation for 1-m Impervious Surface Area Mapping in China's Yangtze River Economic Belt

Jie Deng,Danfeng Hong,Chenyu Li,Naoto Yokoya

Main category: cs.CV

TL;DR: JointSeg是一种结合超分辨率和分割的框架，可从Sentinel-2图像生成1米分辨率的地表不透水面（ISA）地图，优于传统方法。

Details

Motivation: 传统方法生成高分辨率ISA地图成本高且不灵活，JointSeg提供了一种可扩展且经济高效的替代方案。 Method: JointSeg通过多模态跨分辨率输入训练，逐步从10米提升到1米分辨率，同时保留空间纹理并融合跨尺度特征。 Result: 在长江经济带的应用中，ISA-1产品的F1分数达85.71%，优于其他数据集，并在不同地形中表现出鲁棒性。 Conclusion: JointSeg在生成高分辨率ISA地图方面具有显著优势，并能捕捉城市化动态，适用于多样化景观。 Abstract: We propose a novel joint framework by integrating super-resolution and segmentation, called JointSeg, which enables the generation of 1-meter ISA maps directly from freely available Sentinel-2 imagery. JointSeg was trained on multimodal cross-resolution inputs, offering a scalable and affordable alternative to traditional approaches. This synergistic design enables gradual resolution enhancement from 10m to 1m while preserving fine-grained spatial textures, and ensures high classification fidelity through effective cross-scale feature fusion. This method has been successfully applied to the Yangtze River Economic Belt (YREB), a region characterized by complex urban-rural patterns and diverse topography. As a result, a comprehensive ISA mapping product for 2021, referred to as ISA-1, was generated, covering an area of over 2.2 million square kilometers. Quantitative comparisons against the 10m ESA WorldCover and other benchmark products reveal that ISA-1 achieves an F1-score of 85.71%, outperforming bilinear-interpolation-based segmentation by 9.5%, and surpassing other ISA datasets by 21.43%-61.07%. In densely urbanized areas (e.g., Suzhou, Nanjing), ISA-1 reduces ISA overestimation through improved discrimination of green spaces and water bodies. Conversely, in mountainous regions (e.g., Ganzi, Zhaotong), it identifies significantly more ISA due to its enhanced ability to detect fragmented anthropogenic features such as rural roads and sparse settlements, demonstrating its robustness across diverse landscapes. Moreover, we present biennial ISA maps from 2017 to 2023, capturing spatiotemporal urbanization dynamics across representative cities. The results highlight distinct regional growth patterns: rapid expansion in upstream cities, moderate growth in midstream regions, and saturation in downstream metropolitan areas.

[66] Threshold Modulation for Online Test-Time Adaptation of Spiking Neural Networks

Kejie Zhao,Wenjia Hua,Aiersi Tuerhong,Luziwei Leng,Yuxin Ma,Qinghua Guo

Main category: cs.CV

TL;DR: 论文提出了一种名为阈值调制（TM）的在线测试时间适应框架，旨在提升脉冲神经网络（SNNs）在分布偏移下的泛化能力，同时保持低计算成本。

Details

Motivation: 解决SNNs在部署后适应分布偏移的挑战，现有方法不适用于SNNs。 Method: 通过神经元动态启发的归一化动态调整发放阈值（TM），兼容神经形态硬件。 Result: 在基准数据集上验证了TM方法提升SNNs鲁棒性的有效性。 Conclusion: TM为SNNs的在线测试时间适应提供了实用方案，并为未来神经形态芯片设计提供启发。 Abstract: Recently, spiking neural networks (SNNs), deployed on neuromorphic chips, provide highly efficient solutions on edge devices in different scenarios. However, their ability to adapt to distribution shifts after deployment has become a crucial challenge. Online test-time adaptation (OTTA) offers a promising solution by enabling models to dynamically adjust to new data distributions without requiring source data or labeled target samples. Nevertheless, existing OTTA methods are largely designed for traditional artificial neural networks and are not well-suited for SNNs. To address this gap, we propose a low-power, neuromorphic chip-friendly online test-time adaptation framework, aiming to enhance model generalization under distribution shifts. The proposed approach is called Threshold Modulation (TM), which dynamically adjusts the firing threshold through neuronal dynamics-inspired normalization, being more compatible with neuromorphic hardware. Experimental results on benchmark datasets demonstrate the effectiveness of this method in improving the robustness of SNNs against distribution shifts while maintaining low computational cost. The proposed method offers a practical solution for online test-time adaptation of SNNs, providing inspiration for the design of future neuromorphic chips. The demo code is available at github.com/NneurotransmitterR/TM-OTTA-SNN.

[67] GeomHair: Reconstruction of Hair Strands from Colorless 3D Scans

Rachmadio Noval Lazuardi,Artem Sevastopolsky,Egor Zakharov,Matthias Niessner,Vanessa Sklyarova

Main category: cs.CV

TL;DR: 提出一种直接从无色3D扫描重建头发丝的新方法，通过多模态头发方向提取，解决了复杂发型下头发丝重建的难题。

Details

Motivation: 头发丝重建是计算机视觉和图形学中的基础问题，可用于高保真数字头像合成、动画和AR/VR应用。现有方法依赖RGB捕捉，对环境敏感且难以提取复杂发型的头发方向。 Method: 从扫描几何中提取锐利表面特征，通过神经2D线检测器估计头发方向，并结合扩散先验和合成头发扫描数据集。 Result: 方法能够准确重建简单和复杂发型，无需依赖颜色信息。 Conclusion: 提出了Strands400数据集，包含400名受试者的头发丝几何数据，推动进一步研究。 Abstract: We propose a novel method that reconstructs hair strands directly from colorless 3D scans by leveraging multi-modal hair orientation extraction. Hair strand reconstruction is a fundamental problem in computer vision and graphics that can be used for high-fidelity digital avatar synthesis, animation, and AR/VR applications. However, accurately recovering hair strands from raw scan data remains challenging due to human hair's complex and fine-grained structure. Existing methods typically rely on RGB captures, which can be sensitive to the environment and can be a challenging domain for extracting the orientation of guiding strands, especially in the case of challenging hairstyles. To reconstruct the hair purely from the observed geometry, our method finds sharp surface features directly on the scan and estimates strand orientation through a neural 2D line detector applied to the renderings of scan shading. Additionally, we incorporate a diffusion prior trained on a diverse set of synthetic hair scans, refined with an improved noise schedule, and adapted to the reconstructed contents via a scan-specific text prompt. We demonstrate that this combination of supervision signals enables accurate reconstruction of both simple and intricate hairstyles without relying on color information. To facilitate further research, we introduce Strands400, the largest publicly available dataset of hair strands with detailed surface geometry extracted from real-world data, which contains reconstructed hair strands from the scans of 400 subjects.

[68] EDmamba: A Simple yet Effective Event Denoising Method with State Space Model

Ciyu Ruan,Zihang Gong,Ruishan Guo,Jingao Xu,Xinlei Chen

Main category: cs.CV

TL;DR: 提出了一种基于状态空间模型（SSMs）的事件去噪框架，解决了现有方法在高计算强度与轻量级之间的权衡问题，实现了高精度与高效率。

Details

Motivation: 事件相机的高动态范围和低延迟特性使其在高速视觉中表现优异，但输出噪声问题需要高效去噪方法以保持其实时处理能力。 Method: 将事件表示为4D事件云，通过粗粒度特征提取（CFE）模块从几何和极性感知子空间提取特征，结合空间Mamba（S-SSM）和时间Mamba（T-SSM）建模局部几何结构和全局时间动态。 Result: 模型参数为88.89K，推理时间为0.0685s/100K事件，准确率为0.982，比基于Transformer的方法精度高2.08%，速度快36倍。 Conclusion: 该框架在事件去噪中实现了高精度与高效率的平衡，为实时高速视觉应用提供了有效解决方案。 Abstract: Event cameras excel in high-speed vision due to their high temporal resolution, high dynamic range, and low power consumption. However, as dynamic vision sensors, their output is inherently noisy, making efficient denoising essential to preserve their ultra-low latency and real-time processing capabilities. Existing event denoising methods struggle with a critical dilemma: computationally intensive approaches compromise the sensor's high-speed advantage, while lightweight methods often lack robustness across varying noise levels. To address this, we propose a novel event denoising framework based on State Space Models (SSMs). Our approach represents events as 4D event clouds and includes a Coarse Feature Extraction (CFE) module that extracts embedding features from both geometric and polarity-aware subspaces. The model is further composed of two essential components: A Spatial Mamba (S-SSM) that models local geometric structures and a Temporal Mamba (T-SSM) that captures global temporal dynamics, efficiently propagating spatiotemporal features across events. Experiments demonstrate that our method achieves state-of-the-art accuracy and efficiency, with 88.89K parameters, 0.0685s per 100K events inference time, and a 0.982 accuracy score, outperforming Transformer-based methods by 2.08% in denoising accuracy and 36X faster.

[69] TokLIP: Marry Visual Tokens to CLIP for Multimodal Comprehension and Generation

Haokun Lin,Teng Wang,Yixiao Ge,Yuying Ge,Zhichao Lu,Ying Wei,Qingfu Zhang,Zhenan Sun,Ying Shan

Main category: cs.CV

TL;DR: TokLIP是一种视觉分词器，通过语义化向量量化（VQ）标记并结合CLIP级语义，解决了多模态统一中的高计算开销和低理解性能问题。

Details

Motivation: 解决现有方法（如Chameleon和Emu3）在多模态统一中面临的高计算开销和语义理解不足的问题。 Method: 结合低级别离散VQ分词器和基于ViT的标记编码器，分离理解和生成目标，直接应用高级VQ分词器。 Result: TokLIP在数据效率和语义理解上表现优异，同时提升了生成能力，适用于自回归Transformer的理解和生成任务。 Conclusion: TokLIP为多模态任务提供了一种高效且语义丰富的解决方案，代码和模型已开源。 Abstract: Pioneering token-based works such as Chameleon and Emu3 have established a foundation for multimodal unification but face challenges of high training computational overhead and limited comprehension performance due to a lack of high-level semantics. In this paper, we introduce TokLIP, a visual tokenizer that enhances comprehension by semanticizing vector-quantized (VQ) tokens and incorporating CLIP-level semantics while enabling end-to-end multimodal autoregressive training with standard VQ tokens. TokLIP integrates a low-level discrete VQ tokenizer with a ViT-based token encoder to capture high-level continuous semantics. Unlike previous approaches (e.g., VILA-U) that discretize high-level features, TokLIP disentangles training objectives for comprehension and generation, allowing the direct application of advanced VQ tokenizers without the need for tailored quantization operations. Our empirical results demonstrate that TokLIP achieves exceptional data efficiency, empowering visual tokens with high-level semantic understanding while enhancing low-level generative capacity, making it well-suited for autoregressive Transformers in both comprehension and generation tasks. The code and models are available at https://github.com/TencentARC/TokLIP.

[70] PillarMamba: Learning Local-Global Context for Roadside Point Cloud via Hybrid State Space Model

Zhang Zhang,Chao Sun,Chao Yue,Da Wen,Tianze Wang,Jianghao Leng

Main category: cs.CV

TL;DR: 论文提出了一种基于Mamba的PillarMamba框架，用于路边点云的3D目标检测，通过Cross-stage State-space Group和Hybrid State-space Block提升网络表达能力和计算效率，并在DAIR-V2X-I基准上表现优异。

Details

Motivation: 路边点云的3D目标检测尚未被充分探索，而点云检测器的性能关键取决于网络的感受野和场景上下文利用能力。Mamba模型因其高效的全局感受野，为这一领域提供了新的可能性。 Method: 提出PillarMamba框架，结合Cross-stage State-space Group（CSG）实现跨阶段特征融合，并通过Hybrid State-space Block（HSB）解决局部连接和历史关系遗忘问题，融合局部卷积和残差注意力机制。 Result: 在DAIR-V2X-I基准测试中，该方法优于现有最先进技术。 Conclusion: PillarMamba通过创新的局部-全局上下文建模，显著提升了路边点云3D目标检测的性能，为智能交通系统提供了有效解决方案。 Abstract: Serving the Intelligent Transport System (ITS) and Vehicle-to-Everything (V2X) tasks, roadside perception has received increasing attention in recent years, as it can extend the perception range of connected vehicles and improve traffic safety. However, roadside point cloud oriented 3D object detection has not been effectively explored. To some extent, the key to the performance of a point cloud detector lies in the receptive field of the network and the ability to effectively utilize the scene context. The recent emergence of Mamba, based on State Space Model (SSM), has shaken up the traditional convolution and transformers that have long been the foundational building blocks, due to its efficient global receptive field. In this work, we introduce Mamba to pillar-based roadside point cloud perception and propose a framework based on Cross-stage State-space Group (CSG), called PillarMamba. It enhances the expressiveness of the network and achieves efficient computation through cross-stage feature fusion. However, due to the limitations of scan directions, state space model faces local connection disrupted and historical relationship forgotten. To address this, we propose the Hybrid State-space Block (HSB) to obtain the local-global context of roadside point cloud. Specifically, it enhances neighborhood connections through local convolution and preserves historical memory through residual attention. The proposed method outperforms the state-of-the-art methods on the popular large scale roadside benchmark: DAIR-V2X-I. The code will be released soon.

[71] Adaptive Markup Language Generation for Contextually-Grounded Visual Document Understanding

Han Xiao,Yina Xie,Guanxin Tan,Yinghao Chen,Rui Hu,Ke Wang,Aojun Zhou,Hao Li,Hao Shao,Xudong Lu,Peng Gao,Yafei Wen,Xiaoxin Chen,Shuai Ren,Hongsheng Li

Main category: cs.CV

TL;DR: 论文提出了一种利用标记语言生成结构化文档表示的创新方法，解决了视觉文档理解中上下文信息不足和空间关系理解有限的问题。

Details

Motivation: 视觉文档理解领域因需要整合视觉感知和文本理解而面临挑战，现有数据集缺乏详细上下文信息，导致幻觉和空间关系理解不足。 Method: 提出了一种自适应生成标记语言（如Markdown、JSON、HTML、TiKZ）的管道，构建结构化文档表示，并引入两个细粒度数据集DocMark-Pile和DocMark-Instruct。 Result: 实验表明，该方法在多个视觉文档理解基准上显著优于现有MLLMs，提升了复杂视觉场景中的推理和理解能力。 Conclusion: 该方法通过结构化表示和丰富数据集，有效提升了视觉文档理解的性能，代码和模型已开源。 Abstract: Visual Document Understanding has become essential with the increase of text-rich visual content. This field poses significant challenges due to the need for effective integration of visual perception and textual comprehension, particularly across diverse document types with complex layouts. Moreover, existing fine-tuning datasets for this domain often fall short in providing the detailed contextual information for robust understanding, leading to hallucinations and limited comprehension of spatial relationships among visual elements. To address these challenges, we propose an innovative pipeline that utilizes adaptive generation of markup languages, such as Markdown, JSON, HTML, and TiKZ, to build highly structured document representations and deliver contextually-grounded responses. We introduce two fine-grained structured datasets: DocMark-Pile, comprising approximately 3.8M pretraining data pairs for document parsing, and DocMark-Instruct, featuring 624k fine-tuning data annotations for grounded instruction following. Extensive experiments demonstrate that our proposed model significantly outperforms existing state-of-theart MLLMs across a range of visual document understanding benchmarks, facilitating advanced reasoning and comprehension capabilities in complex visual scenarios. Our code and models are released at https://github. com/Euphoria16/DocMark.

[72] StreamBridge: Turning Your Offline Video Large Language Model into a Proactive Streaming Assistant

Haibo Wang,Bo Feng,Zhengfeng Lai,Mingze Xu,Shiyu Li,Weifeng Ge,Afshin Dehghan,Meng Cao,Ping Huang

Main category: cs.CV

TL;DR: StreamBridge是一个框架，将离线Video-LLMs转化为支持流式处理的模型，解决了实时多轮理解和主动响应的挑战。

Details

Motivation: 适应现有模型到在线场景时，存在多轮实时理解能力不足和缺乏主动响应机制的问题。 Method: 结合内存缓冲区和轮衰减压缩策略支持长上下文多轮交互，并采用解耦的轻量激活模型实现持续主动响应。 Result: StreamBridge显著提升了离线Video-LLMs的流式理解能力，优于GPT-4o和Gemini 1.5 Pro，同时在标准视频理解任务中表现优异。 Conclusion: StreamBridge为Video-LLMs的流式处理提供了高效解决方案，并在性能和适应性上表现突出。 Abstract: We present StreamBridge, a simple yet effective framework that seamlessly transforms offline Video-LLMs into streaming-capable models. It addresses two fundamental challenges in adapting existing models into online scenarios: (1) limited capability for multi-turn real-time understanding, and (2) lack of proactive response mechanisms. Specifically, StreamBridge incorporates (1) a memory buffer combined with a round-decayed compression strategy, supporting long-context multi-turn interactions, and (2) a decoupled, lightweight activation model that can be effortlessly integrated into existing Video-LLMs, enabling continuous proactive responses. To further support StreamBridge, we construct Stream-IT, a large-scale dataset tailored for streaming video understanding, featuring interleaved video-text sequences and diverse instruction formats. Extensive experiments show that StreamBridge significantly improves the streaming understanding capabilities of offline Video-LLMs across various tasks, outperforming even proprietary models such as GPT-4o and Gemini 1.5 Pro. Simultaneously, it achieves competitive or superior performance on standard video understanding benchmarks.

[73] SITE: towards Spatial Intelligence Thorough Evaluation

Wenqi Wang,Reuben Tan,Pengyue Zhu,Jianwei Yang,Zhengyuan Yang,Lijuan Wang,Andrey Kolobov,Jianfeng Gao,Boqing Gong

Main category: cs.CV

TL;DR: SITE是一个用于评估大型视觉语言模型空间智能的标准化多选视觉问答基准数据集，涵盖多种视觉模态和空间智能因素。实验表明，领先模型在空间定向等基础能力上落后于人类专家，且空间推理能力与具身AI任务表现正相关。

Details

Motivation: 空间智能（SI）是认知能力的重要组成部分，涉及多学科领域。现有研究缺乏标准化评估工具，因此需要开发一个全面评估SI的基准数据集。 Method: 通过结合对31个现有数据集的调查和认知科学分类系统，设计了SITE数据集，包含两种新任务类型（视角转换和动态场景）。 Result: 实验显示，领先模型在空间定向等基础能力上落后于人类专家，且空间推理能力与具身AI任务表现正相关。 Conclusion: SITE为评估空间智能提供了标准化工具，揭示了模型在空间推理上的不足，并验证了空间推理能力与具身AI任务的相关性。 Abstract: Spatial intelligence (SI) represents a cognitive ability encompassing the visualization, manipulation, and reasoning about spatial relationships, underpinning disciplines from neuroscience to robotics. We introduce SITE, a benchmark dataset towards SI Thorough Evaluation in a standardized format of multi-choice visual question-answering, designed to assess large vision-language models' spatial intelligence across diverse visual modalities (single-image, multi-image, and video) and SI factors (figural to environmental scales, spatial visualization and orientation, intrinsic and extrinsic, static and dynamic). Our approach to curating the benchmark combines a bottom-up survey about 31 existing datasets and a top-down strategy drawing upon three classification systems in cognitive science, which prompt us to design two novel types of tasks about view-taking and dynamic scenes. Extensive experiments reveal that leading models fall behind human experts especially in spatial orientation, a fundamental SI factor. Moreover, we demonstrate a positive correlation between a model's spatial reasoning proficiency and its performance on an embodied AI task.

[74] Generating Physically Stable and Buildable LEGO Designs from Text

Ava Pun,Kangle Deng,Ruixuan Liu,Deva Ramanan,Changliu Liu,Jun-Yan Zhu

Main category: cs.CV

TL;DR: LegoGPT是首个通过文本提示生成物理稳定LEGO模型的方法，结合大规模数据集和语言模型，通过物理约束优化设计稳定性。

Details

Motivation: 解决从文本生成物理稳定LEGO模型的挑战，提供实用且美观的设计。 Method: 构建大规模数据集，训练自回归语言模型，结合物理约束和回滚机制优化设计。 Result: 生成稳定、多样且美观的LEGO设计，支持人工和机器人组装，并发布数据集和代码。 Conclusion: LegoGPT为文本到LEGO设计提供了高效解决方案，具有实际应用潜力。 Abstract: We introduce LegoGPT, the first approach for generating physically stable LEGO brick models from text prompts. To achieve this, we construct a large-scale, physically stable dataset of LEGO designs, along with their associated captions, and train an autoregressive large language model to predict the next brick to add via next-token prediction. To improve the stability of the resulting designs, we employ an efficient validity check and physics-aware rollback during autoregressive inference, which prunes infeasible token predictions using physics laws and assembly constraints. Our experiments show that LegoGPT produces stable, diverse, and aesthetically pleasing LEGO designs that align closely with the input text prompts. We also develop a text-based LEGO texturing method to generate colored and textured designs. We show that our designs can be assembled manually by humans and automatically by robotic arms. We also release our new dataset, StableText2Lego, containing over 47,000 LEGO structures of over 28,000 unique 3D objects accompanied by detailed captions, along with our code and models at the project website: https://avalovelace1.github.io/LegoGPT/.

[75] Flow-GRPO: Training Flow Matching Models via Online RL

Jie Liu,Gongye Liu,Jiajun Liang,Yangguang Li,Jiaheng Liu,Xintao Wang,Pengfei Wan,Di Zhang,Wanli Ouyang

Main category: cs.CV

TL;DR: Flow-GRPO首次将在线强化学习（RL）融入流匹配模型，通过ODE-to-SDE转换和降噪策略提升采样效率和性能，在文本到图像任务中表现优异。

Details

Motivation: 将强化学习引入流匹配模型，以提升生成任务中的准确性和效率。 Method: 采用ODE-to-SDE转换和降噪策略，优化采样效率和训练步骤。 Result: 在复杂构图和视觉文本渲染任务中，准确率显著提升（如GenEval从63%升至95%），同时保持图像质量和多样性。 Conclusion: Flow-GRPO在提升生成任务性能的同时，避免了奖励作弊，具有广泛的应用潜力。 Abstract: We propose Flow-GRPO, the first method integrating online reinforcement learning (RL) into flow matching models. Our approach uses two key strategies: (1) an ODE-to-SDE conversion that transforms a deterministic Ordinary Differential Equation (ODE) into an equivalent Stochastic Differential Equation (SDE) that matches the original model's marginal distribution at all timesteps, enabling statistical sampling for RL exploration; and (2) a Denoising Reduction strategy that reduces training denoising steps while retaining the original inference timestep number, significantly improving sampling efficiency without performance degradation. Empirically, Flow-GRPO is effective across multiple text-to-image tasks. For complex compositions, RL-tuned SD3.5 generates nearly perfect object counts, spatial relations, and fine-grained attributes, boosting GenEval accuracy from $63\%$ to $95\%$. In visual text rendering, its accuracy improves from $59\%$ to $92\%$, significantly enhancing text generation. Flow-GRPO also achieves substantial gains in human preference alignment. Notably, little to no reward hacking occurred, meaning rewards did not increase at the cost of image quality or diversity, and both remained stable in our experiments.

Chao Liao,Liyang Liu,Xun Wang,Zhengxiong Luo,Xinyu Zhang,Wenliang Zhao,Jie Wu,Liang Li,Zhi Tian,Weilin Huang

Main category: cs.CV

TL;DR: Mogao是一个统一框架，通过因果方法实现交错多模态生成，结合了自回归模型和扩散模型的优势，并在多模态理解和生成任务中表现优异。

Details

Motivation: 当前统一模型在图像理解和生成方面取得进展，但仍局限于单模态生成。Mogao旨在通过交错多模态生成扩展这一范式。 Method: Mogao采用深度融合设计、双视觉编码器、交错旋转位置嵌入和多模态无分类器指导等技术改进架构设计，并结合高效训练策略。 Result: 实验表明，Mogao在多模态理解和文本到图像生成任务中达到最先进水平，并能生成高质量的交错输出。 Conclusion: Mogao作为全模态基础模型，为零样本图像编辑和组合生成提供了新方向，推动了统一多模态系统的发展。 Abstract: Recent progress in unified models for image understanding and generation has been impressive, yet most approaches remain limited to single-modal generation conditioned on multiple modalities. In this paper, we present Mogao, a unified framework that advances this paradigm by enabling interleaved multi-modal generation through a causal approach. Mogao integrates a set of key technical improvements in architecture design, including a deep-fusion design, dual vision encoders, interleaved rotary position embeddings, and multi-modal classifier-free guidance, which allow it to harness the strengths of both autoregressive models for text generation and diffusion models for high-quality image synthesis. These practical improvements also make Mogao particularly effective to process interleaved sequences of text and images arbitrarily. To further unlock the potential of unified models, we introduce an efficient training strategy on a large-scale, in-house dataset specifically curated for joint text and image generation. Extensive experiments show that Mogao not only achieves state-of-the-art performance in multi-modal understanding and text-to-image generation, but also excels in producing high-quality, coherent interleaved outputs. Its emergent capabilities in zero-shot image editing and compositional generation highlight Mogao as a practical omni-modal foundation model, paving the way for future development and scaling the unified multi-modal systems.

[77] DiffusionSfM: Predicting Structure and Motion via Ray Origin and Endpoint Diffusion

Qitao Zhao,Amy Lin,Jeff Tan,Jason Y. Zhang,Deva Ramanan,Shubham Tulsiani

Main category: cs.CV

TL;DR: 论文提出了一种名为DiffusionSfM的数据驱动多视图推理方法，直接通过多视图图像推断3D场景几何和相机姿态，优于传统和基于学习的方法。

Details

Motivation: 当前的结构从运动（SfM）方法通常采用两阶段流程，结合学习或几何对偶推理与全局优化步骤。本文旨在提出一种更直接的多视图推理方法。 Method: DiffusionSfM框架将场景几何和相机姿态参数化为全局坐标系中的像素级射线起点和终点，并使用基于Transformer的去噪扩散模型进行预测。 Result: 在合成和真实数据集上的实验表明，DiffusionSfM优于传统和基于学习的方法，并能自然建模不确定性。 Conclusion: DiffusionSfM提供了一种高效且鲁棒的多视图推理方法，为3D重建领域带来了新的可能性。 Abstract: Current Structure-from-Motion (SfM) methods typically follow a two-stage pipeline, combining learned or geometric pairwise reasoning with a subsequent global optimization step. In contrast, we propose a data-driven multi-view reasoning approach that directly infers 3D scene geometry and camera poses from multi-view images. Our framework, DiffusionSfM, parameterizes scene geometry and cameras as pixel-wise ray origins and endpoints in a global frame and employs a transformer-based denoising diffusion model to predict them from multi-view inputs. To address practical challenges in training diffusion models with missing data and unbounded scene coordinates, we introduce specialized mechanisms that ensure robust learning. We empirically validate DiffusionSfM on both synthetic and real datasets, demonstrating that it outperforms classical and learning-based approaches while naturally modeling uncertainty.

[78] 3D Scene Generation: A Survey

Beichen Wen,Haozhe Xie,Zhaoxi Chen,Fangzhou Hong,Ziwei Liu

Main category: cs.CV

TL;DR: 该论文综述了3D场景生成的最新进展，包括四种范式：程序生成、神经3D生成、基于图像的生成和基于视频的生成，并讨论了挑战与未来方向。

Details

Motivation: 3D场景生成在沉浸式媒体、机器人等领域有广泛应用，但早期方法多样性有限，需结合深度学习提升生成质量。 Method: 通过分析四种生成范式（程序生成、神经3D生成、基于图像的生成和基于视频的生成）的技术基础、优缺点及代表性成果。 Result: 总结了当前方法的性能、数据集、评估协议及下游应用，并指出生成能力、3D表示等关键挑战。 Conclusion: 未来方向包括更高保真度、物理感知生成及统一感知-生成模型，并提供了项目页面跟踪进展。 Abstract: 3D scene generation seeks to synthesize spatially structured, semantically meaningful, and photorealistic environments for applications such as immersive media, robotics, autonomous driving, and embodied AI. Early methods based on procedural rules offered scalability but limited diversity. Recent advances in deep generative models (e.g., GANs, diffusion models) and 3D representations (e.g., NeRF, 3D Gaussians) have enabled the learning of real-world scene distributions, improving fidelity, diversity, and view consistency. Recent advances like diffusion models bridge 3D scene synthesis and photorealism by reframing generation as image or video synthesis problems. This survey provides a systematic overview of state-of-the-art approaches, organizing them into four paradigms: procedural generation, neural 3D-based generation, image-based generation, and video-based generation. We analyze their technical foundations, trade-offs, and representative results, and review commonly used datasets, evaluation protocols, and downstream applications. We conclude by discussing key challenges in generation capacity, 3D representation, data and annotations, and evaluation, and outline promising directions including higher fidelity, physics-aware and interactive generation, and unified perception-generation models. This review organizes recent advances in 3D scene generation and highlights promising directions at the intersection of generative AI, 3D vision, and embodied intelligence. To track ongoing developments, we maintain an up-to-date project page: https://github.com/hzxie/Awesome-3D-Scene-Generation.

[79] SVAD: From Single Image to 3D Avatar via Synthetic Data Generation with Video Diffusion and Data Augmentation

Yonwoo Choi

Main category: cs.CV

TL;DR: SVAD结合视频扩散模型和3D高斯泼溅技术，从单张图像生成高质量可动画的3D人体化身，解决了现有方法的局限性。

Details

Motivation: 现有方法在从单张图像生成3D人体化身时存在不足：3DGS需要多视角数据，而视频扩散模型在一致性和身份保持上表现不佳。 Method: SVAD利用视频扩散生成合成训练数据，通过身份保持和图像恢复模块增强数据，并用于训练3DGS化身。 Result: SVAD在身份一致性和细节保持上优于现有单图像方法，支持实时渲染，且无需依赖密集训练数据。 Conclusion: SVAD通过结合扩散模型和3DGS，为单图像生成高保真化身提供了新方法。 Abstract: Creating high-quality animatable 3D human avatars from a single image remains a significant challenge in computer vision due to the inherent difficulty of reconstructing complete 3D information from a single viewpoint. Current approaches face a clear limitation: 3D Gaussian Splatting (3DGS) methods produce high-quality results but require multiple views or video sequences, while video diffusion models can generate animations from single images but struggle with consistency and identity preservation. We present SVAD, a novel approach that addresses these limitations by leveraging complementary strengths of existing techniques. Our method generates synthetic training data through video diffusion, enhances it with identity preservation and image restoration modules, and utilizes this refined data to train 3DGS avatars. Comprehensive evaluations demonstrate that SVAD outperforms state-of-the-art (SOTA) single-image methods in maintaining identity consistency and fine details across novel poses and viewpoints, while enabling real-time rendering capabilities. Through our data augmentation pipeline, we overcome the dependency on dense monocular or multi-view training data typically required by traditional 3DGS approaches. Extensive quantitative, qualitative comparisons show our method achieves superior performance across multiple metrics against baseline models. By effectively combining the generative power of diffusion models with both the high-quality results and rendering efficiency of 3DGS, our work establishes a new approach for high-fidelity avatar generation from a single image input.

cs.GR [Back]

[80] ChannelExplorer: Exploring Class Separability Through Activation Channel Visualization

Md Rahat-uz- Zaman,Bei Wang,Paul Rosen

Main category: cs.GR

TL;DR: ChannelExplorer是一个交互式可视化工具，用于分析深度神经网络中不同层和激活通道对类别可分性的贡献。

Details

Motivation: 理解DNN内部行为，尤其是不同层和激活通道如何影响类别可分性。 Method: 通过三个协调视图（散点图、Jaccard相似性视图和热力图）分析激活模式，支持多种模型架构。 Result: 工具在生成类别层次结构、发现错误标签、识别激活通道贡献和定位潜在状态等方面表现出色。 Conclusion: ChannelExplorer为DNN行为分析提供了有效的数据驱动方法，并通过专家评估验证了其实用性。 Abstract: Deep neural networks (DNNs) achieve state-of-the-art performance in many vision tasks, yet understanding their internal behavior remains challenging, particularly how different layers and activation channels contribute to class separability. We introduce ChannelExplorer, an interactive visual analytics tool for analyzing image-based outputs across model layers, emphasizing data-driven insights over architecture analysis for exploring class separability. ChannelExplorer summarizes activations across layers and visualizes them using three primary coordinated views: a Scatterplot View to reveal inter- and intra-class confusion, a Jaccard Similarity View to quantify activation overlap, and a Heatmap View to inspect activation channel patterns. Our technique supports diverse model architectures, including CNNs, GANs, ResNet and Stable Diffusion models. We demonstrate the capabilities of ChannelExplorer through four use-case scenarios: (1) generating class hierarchy in ImageNet, (2) finding mislabeled images, (3) identifying activation channel contributions, and(4) locating latent states' position in Stable Diffusion model. Finally, we evaluate the tool with expert users.

[81] Multimodal Benchmarking and Recommendation of Text-to-Image Generation Models

Kapil Wanaskar,Gaytri Jena,Magdalini Eirinaki

Main category: cs.GR

TL;DR: 本文提出了一种开源统一的文本到图像生成模型评估框架，重点关注元数据增强提示的影响，并通过定量和定性分析验证其效果。

Details

Motivation: 研究元数据增强提示对文本到图像生成模型性能的影响，并提供一种统一的评估方法。 Method: 利用DeepFashion-MultiModal数据集，通过多种定量指标（如Weighted Score、CLIP相似性、LPIPS、FID等）和定性分析评估生成结果。 Result: 结果表明，结构化元数据增强显著提升了生成图像的视觉真实性、语义保真度和模型鲁棒性。 Conclusion: 该框架虽非传统推荐系统，但能基于评估指标为模型选择和提示设计提供任务特定的建议。 Abstract: This work presents an open-source unified benchmarking and evaluation framework for text-to-image generation models, with a particular focus on the impact of metadata augmented prompts. Leveraging the DeepFashion-MultiModal dataset, we assess generated outputs through a comprehensive set of quantitative metrics, including Weighted Score, CLIP (Contrastive Language Image Pre-training)-based similarity, LPIPS (Learned Perceptual Image Patch Similarity), FID (Frechet Inception Distance), and retrieval-based measures, as well as qualitative analysis. Our results demonstrate that structured metadata enrichments greatly enhance visual realism, semantic fidelity, and model robustness across diverse text-to-image architectures. While not a traditional recommender system, our framework enables task-specific recommendations for model selection and prompt design based on evaluation metrics.

[82] MeshGen: Generating PBR Textured Mesh with Render-Enhanced Auto-Encoder and Generative Data Augmentation

Zilong Chen,Yikai Wang,Wenqiang Sun,Feng Wang,Yiwen Chen,Huaping Liu

Main category: cs.GR

TL;DR: MeshGen是一个先进的图像到3D的生成管道，通过创新的点对形状自动编码器和多视图ControlNet技术，解决了现有3D生成模型在几何细节和纹理一致性上的问题。

Details

Motivation: 现有的3D生成模型在自动编码性能、可控性、泛化能力和纹理一致性方面存在不足，MeshGen旨在克服这些限制。 Method: MeshGen采用渲染增强的点对形状自动编码器、几何增强和生成渲染增强技术，以及多视图ControlNet和PBR分解器，确保几何细节和纹理一致性。 Result: 实验表明，MeshGen在形状和纹理生成上显著优于现有方法，设定了新的3D网格生成质量标准。 Conclusion: MeshGen通过创新的技术解决了3D生成中的关键问题，为高质量3D网格生成提供了新标准。 Abstract: In this paper, we introduce MeshGen, an advanced image-to-3D pipeline that generates high-quality 3D meshes with detailed geometry and physically based rendering (PBR) textures. Addressing the challenges faced by existing 3D native diffusion models, such as suboptimal auto-encoder performance, limited controllability, poor generalization, and inconsistent image-based PBR texturing, MeshGen employs several key innovations to overcome these limitations. We pioneer a render-enhanced point-to-shape auto-encoder that compresses meshes into a compact latent space by designing perceptual optimization with ray-based regularization. This ensures that the 3D shapes are accurately represented and reconstructed to preserve geometric details within the latent space. To address data scarcity and image-shape misalignment, we further propose geometric augmentation and generative rendering augmentation techniques, which enhance the model's controllability and generalization ability, allowing it to perform well even with limited public datasets. For the texture generation, MeshGen employs a reference attention-based multi-view ControlNet for consistent appearance synthesis. This is further complemented by our multi-view PBR decomposer that estimates PBR components and a UV inpainter that fills invisible areas, ensuring a seamless and consistent texture across the 3D mesh. Our extensive experiments demonstrate that MeshGen largely outperforms previous methods in both shape and texture generation, setting a new standard for the quality of 3D meshes generated with PBR textures. See our code at https://github.com/heheyas/MeshGen, project page https://heheyas.github.io/MeshGen

[83] GSsplat: Generalizable Semantic Gaussian Splatting for Novel-view Synthesis in 3D Scenes

Feng Xiao,Hongbin Xu,Wanlin Liang,Wenxiong Kang

Main category: cs.GR

TL;DR: 提出了一种通用的语义高斯泼溅方法（GSsplat），用于高效的新视角合成，解决了现有方法在速度和分割性能上的局限性。

Details

Motivation: 解决当前方法在新视角合成中速度和语义分割性能不足的问题。 Method: 通过预测场景自适应高斯分布的位置和属性，设计混合网络提取颜色和语义信息，并提出偏移学习模块和点级交互模块。 Result: GSsplat在多视角输入下实现了最快的速度和最先进的语义合成性能。 Conclusion: GSsplat是一种高效且性能优越的语义新视角合成方法。 Abstract: The semantic synthesis of unseen scenes from multiple viewpoints is crucial for research in 3D scene understanding. Current methods are capable of rendering novel-view images and semantic maps by reconstructing generalizable Neural Radiance Fields. However, they often suffer from limitations in speed and segmentation performance. We propose a generalizable semantic Gaussian Splatting method (GSsplat) for efficient novel-view synthesis. Our model predicts the positions and attributes of scene-adaptive Gaussian distributions from once input, replacing the densification and pruning processes of traditional scene-specific Gaussian Splatting. In the multi-task framework, a hybrid network is designed to extract color and semantic information and predict Gaussian parameters. To augment the spatial perception of Gaussians for high-quality rendering, we put forward a novel offset learning module through group-based supervision and a point-level interaction module with spatial unit aggregation. When evaluated with varying numbers of multi-view inputs, GSsplat achieves state-of-the-art performance for semantic synthesis at the fastest speed.

[84] Crafting Physical Adversarial Examples by Combining Differentiable and Physically Based Renders

Yuqiu Liu,Huanqian Yan,Xiaopei Zhu,Xiaolin Hu,Liang Tang,Hang Su,Chen Lv

Main category: cs.GR

TL;DR: 提出了一种名为PAV-Camou的新方法，用于生成适用于真实车辆的鲁棒对抗性伪装，解决了现有方法在物理世界中表现不佳的问题。

Details

Motivation: 现有对抗性伪装方法在物理世界中效果不佳，主要由于训练样本的光照真实感不足以及缺乏合适的物理实现方法。 Method: 调整2D地图坐标到3D模型的映射以减少纹理失真，结合两种渲染器生成光照和纹理逼真的对抗样本。 Result: 生成的伪装在数字和物理世界中均表现良好，适用于多种环境条件。 Conclusion: PAV-Camou方法有效解决了对抗性伪装在物理世界中的实现问题，为自动驾驶系统的鲁棒性测试提供了实用工具。 Abstract: Recently we have witnessed progress in hiding road vehicles against object detectors through adversarial camouflage in the digital world. The extension of this technique to the physical world is crucial for testing the robustness of autonomous driving systems. However, existing methods do not show good performances when applied to the physical world. This is partly due to insufficient photorealism in training examples, and lack of proper physical realization methods for camouflage. To generate a robust adversarial camouflage suitable for real vehicles, we propose a novel method called PAV-Camou. We propose to adjust the mapping from the coordinates in the 2D map to those of corresponding 3D model. This process is critical for mitigating texture distortion and ensuring the camouflage's effectiveness when applied in the real world. Then we combine two renderers with different characteristics to obtain adversarial examples that are photorealistic that closely mimic real-world lighting and texture properties. The method ensures that the generated textures remain effective under diverse environmental conditions. Our adversarial camouflage can be optimized and printed in the form of 2D patterns, allowing for direct application on real vehicles. Extensive experiments demonstrated that our proposed method achieved good performance in both the digital world and the physical world.

[85] SGCR: Spherical Gaussians for Efficient 3D Curve Reconstruction

Xinran Yang,Donghao Ji,Yuanqi Li,Jie Guo,Yanwen Guo,Junyuan Xie

Main category: cs.GR

TL;DR: 论文提出了一种名为Spherical Gaussians的新表示方法，用于改进3D几何边界重建，并进一步开发了SGCR算法，显著提升了3D边缘重建的效率和准确性。

Details

Motivation: 3D Gaussian Splatting技术在生成逼真3D场景方面取得了进展，但其在定义精确3D几何结构方面表现不足。为了解决这一问题，作者提出了Spherical Gaussians表示方法。 Method: 通过Spherical Gaussians表示3D几何边界，并基于视图渲染损失进行优化。进一步提出了SGCR算法，从对齐的Spherical Gaussians中直接提取参数化曲线。 Result: SGCR算法在3D边缘重建方面优于现有方法，同时具有高效性。 Conclusion: Spherical Gaussians和SGCR算法为高效3D重建提供了新的解决方案，显著提升了3D几何结构的准确性。 Abstract: Neural rendering techniques have made substantial progress in generating photo-realistic 3D scenes. The latest 3D Gaussian Splatting technique has achieved high quality novel view synthesis as well as fast rendering speed. However, 3D Gaussians lack proficiency in defining accurate 3D geometric structures despite their explicit primitive representations. This is due to the fact that Gaussian's attributes are primarily tailored and fine-tuned for rendering diverse 2D images by their anisotropic nature. To pave the way for efficient 3D reconstruction, we present Spherical Gaussians, a simple and effective representation for 3D geometric boundaries, from which we can directly reconstruct 3D feature curves from a set of calibrated multi-view images. Spherical Gaussians is optimized from grid initialization with a view-based rendering loss, where a 2D edge map is rendered at a specific view and then compared to the ground-truth edge map extracted from the corresponding image, without the need for any 3D guidance or supervision. Given Spherical Gaussians serve as intermedia for the robust edge representation, we further introduce a novel optimization-based algorithm called SGCR to directly extract accurate parametric curves from aligned Spherical Gaussians. We demonstrate that SGCR outperforms existing state-of-the-art methods in 3D edge reconstruction while enjoying great efficiency.

[86] WIR3D: Visually-Informed and Geometry-Aware 3D Shape Abstraction

Richard Liu,Daniel Fu,Noah Tan,Itai Lang,Rana Hanocka

Main category: cs.GR

TL;DR: WIR3D是一种通过稀疏的视觉意义曲线抽象3D形状的技术，利用Bezier曲线优化和CLIP预训练模型指导，分两阶段优化几何和细节特征。

Details

Motivation: 旨在通过稀疏曲线高效抽象3D形状的几何和视觉特征，同时支持用户控制和下游应用。 Method: 分两阶段优化Bezier曲线参数，利用CLIP模型激活和局部关键点损失指导细节优化，结合神经SDF损失保持表面保真度。 Result: 成功应用于多种复杂度和纹理的3D形状抽象，支持特征控制和形状变形。 Conclusion: WIR3D为3D形状抽象提供了一种高效且可控的方法，适用于多种应用场景。 Abstract: We present WIR3D, a technique for abstracting 3D shapes through a sparse set of visually meaningful curves in 3D. We optimize the parameters of Bezier curves such that they faithfully represent both the geometry and salient visual features (e.g. texture) of the shape from arbitrary viewpoints. We leverage the intermediate activations of a pre-trained foundation model (CLIP) to guide our optimization process. We divide our optimization into two phases: one for capturing the coarse geometry of the shape, and the other for representing fine-grained features. Our second phase supervision is spatially guided by a novel localized keypoint loss. This spatial guidance enables user control over abstracted features. We ensure fidelity to the original surface through a neural SDF loss, which allows the curves to be used as intuitive deformation handles. We successfully apply our method for shape abstraction over a broad dataset of shapes with varying complexity, geometric structure, and texture, and demonstrate downstream applications for feature control and shape deformation.

[87] ADD: Physics-Based Motion Imitation with Adversarial Differential Discriminators

Ziyu Zhang,Sergey Bashkirov,Dun Yang,Michael Taylor,Xue Bin Peng

Main category: cs.GR

TL;DR: 提出一种新型对抗性多目标优化技术，无需手动调整权重，适用于包括运动跟踪在内的多目标优化问题。

Details

Motivation: 现有方法依赖手动调整的聚合函数，耗时且需要领域专业知识，限制了适用性。 Method: 采用对抗性差分判别器，仅需单一样本即可有效指导优化过程。 Result: 技术能实现高保真运动跟踪，媲美现有方法，且无需手动调整奖励函数。 Conclusion: 该方法在多目标优化中具有广泛应用潜力，特别是在运动跟踪领域。 Abstract: Multi-objective optimization problems, which require the simultaneous optimization of multiple terms, are prevalent across numerous applications. Existing multi-objective optimization methods often rely on manually tuned aggregation functions to formulate a joint optimization target. The performance of such hand-tuned methods is heavily dependent on careful weight selection, a time-consuming and laborious process. These limitations also arise in the setting of reinforcement-learning-based motion tracking for physically simulated characters, where intricately crafted reward functions are typically used to achieve high-fidelity results. Such solutions not only require domain expertise and significant manual adjustment, but also limit the applicability of the resulting reward function across diverse skills. To bridge this gap, we present a novel adversarial multi-objective optimization technique that is broadly applicable to a range of multi-objective optimization problems, including motion tracking. The proposed adversarial differential discriminator receives a single positive sample, yet is still effective at guiding the optimization process. We demonstrate that our technique can enable characters to closely replicate a variety of acrobatic and agile behaviors, achieving comparable quality to state-of-the-art motion-tracking methods, without relying on manually tuned reward functions. Results are best visualized through https://youtu.be/rz8BYCE9E2w.

[88] Inter-Diffusion Generation Model of Speakers and Listeners for Effective Communication

Jinhe Huang,Yongkang Cheng,Yuming Hang,Gaoge Han,Jinewei Li,Jing Zhang,Xingjian Gu

Main category: cs.GR

TL;DR: 本文提出了一种创新的交互扩散生成模型，首次将听众的全身动作纳入生成框架，通过交互扩散机制捕捉说话者和听众间的复杂互动模式，显著提升了生成动作的自然性和同步性。

Details

Motivation: 现有研究主要关注说话者的动作生成，忽视了听众在互动中的关键作用，未能充分探索两者间的动态交互。 Method: 基于扩散模型架构，引入交互条件和GAN模型以增加去噪步长，动态生成说话者动作并实时响应听众反馈。 Result: 实验表明，该模型在生成动作的自然性、连贯性和语音-动作同步性上优于现有方法，用户评价和客观指标均显示显著提升。 Conclusion: 该模型为有效沟通提供了更强大的支持，生成的交互场景更接近真实人类交流。 Abstract: Full-body gestures play a pivotal role in natural interactions and are crucial for achieving effective communication. Nevertheless, most existing studies primarily focus on the gesture generation of speakers, overlooking the vital role of listeners in the interaction process and failing to fully explore the dynamic interaction between them. This paper innovatively proposes an Inter-Diffusion Generation Model of Speakers and Listeners for Effective Communication. For the first time, we integrate the full-body gestures of listeners into the generation framework. By devising a novel inter-diffusion mechanism, this model can accurately capture the complex interaction patterns between speakers and listeners during communication. In the model construction process, based on the advanced diffusion model architecture, we innovatively introduce interaction conditions and the GAN model to increase the denoising step size. As a result, when generating gesture sequences, the model can not only dynamically generate based on the speaker's speech information but also respond in realtime to the listener's feedback, enabling synergistic interaction between the two. Abundant experimental results demonstrate that compared with the current state-of-the-art gesture generation methods, the model we proposed has achieved remarkable improvements in the naturalness, coherence, and speech-gesture synchronization of the generated gestures. In the subjective evaluation experiments, users highly praised the generated interaction scenarios, believing that they are closer to real life human communication situations. Objective index evaluations also show that our model outperforms the baseline methods in multiple key indicators, providing more powerful support for effective communication.

[89] Improving Global Motion Estimation in Sparse IMU-based Motion Capture with Physics

Xinyu Yi,Shaohua Pan,Feng Xu

Main category: cs.GR

TL;DR: 通过结合物理优化方案，利用6个IMU实现更准确的人体全局和局部运动捕捉，并估计3D接触、接触力等。

Details

Motivation: 解决IMU在重建人体全局运动时的挑战，尤其是z方向运动。 Method: 提出基于多接触的物理优化方案，结合重力约束优化全局方向和局部姿态估计。 Result: 实验表明方法在局部姿态和全局运动捕捉上更准确，并能估计3D接触、力等。 Conclusion: 通过深度整合物理，方法显著提升了IMU运动捕捉的准确性和功能性。 Abstract: By learning human motion priors, motion capture can be achieved by 6 inertial measurement units (IMUs) in recent years with the development of deep learning techniques, even though the sensor inputs are sparse and noisy. However, human global motions are still challenging to be reconstructed by IMUs. This paper aims to solve this problem by involving physics. It proposes a physical optimization scheme based on multiple contacts to enable physically plausible translation estimation in the full 3D space where the z-directional motion is usually challenging for previous works. It also considers gravity in local pose estimation which well constrains human global orientations and refines local pose estimation in a joint estimation manner. Experiments demonstrate that our method achieves more accurate motion capture for both local poses and global motions. Furthermore, by deeply integrating physics, we can also estimate 3D contact, contact forces, joint torques, and interacting proxy surfaces.

[90] An Active Contour Model for Silhouette Vectorization using Bézier Curves

Luis Alvarez,Jean-Michel Morel

Main category: cs.GR

TL;DR: 提出了一种基于三次贝塞尔曲线的主动轮廓模型，用于轮廓矢量化，显著优于现有方法。

Details

Motivation: 解决现有轮廓矢量化方法在精度和曲线长度优化上的不足。 Method: 通过最小化贝塞尔曲线与轮廓边界的距离，优化端点位置、切线方向及曲线参数。 Result: 显著降低了与Inkscape、Adobe Illustrator等软件的轮廓矢量化平均距离。 Conclusion: 该方法在精度和曲线长度优化上表现优异，适用于多种初始矢量化结果。 Abstract: In this paper, we propose an active contour model for silhouette vectorization using cubic B\'ezier curves. Among the end points of the B\'ezier curves, we distinguish between corner and regular points where the orientation of the tangent vector is prescribed. By minimizing the distance of the B\'ezier curves to the silhouette boundary, the active contour model optimizes the location of the B\'ezier curves end points, the orientation of the tangent vectors in the regular points, and the estimation of the B\'ezier curve parameters. This active contour model can use the silhouette vectorization obtained by any method as an initial guess. The proposed method significantly reduces the average distance between the silhouette boundary and its vectorization obtained by the world-class graphic software Inkscape, Adobe Illustrator, and a curvature-based vectorization method, which we introduce for comparison. Our method also allows us to impose additional regularity on the B\'ezier curves by reducing their lengths.

[91] Time of the Flight of the Gaussians: Optimizing Depth Indirectly in Dynamic Radiance Fields

Runfeng Li,Mikhail Okunev,Zixuan Guo,Anh Ha Duong,Christian Richardt,Matthew O'Toole,James Tompkin

Main category: cs.GR

TL;DR: 提出了一种基于单目连续波飞行时间（C-ToF）相机的动态场景重建方法，比神经体积方法更精确且快100倍。

Details

Motivation: 快速从单视角实现高保真动态3D重建是计算机视觉中的重大挑战，而C-ToF中的深度信息并非直接测量，增加了优化难度。 Method: 在优化中引入两种启发式方法，改进高斯表示的场景几何精度。 Result: 实验表明，该方法在受限的C-ToF感知条件下（如快速运动）能生成精确重建。 Conclusion: 该方法在动态场景重建中表现出色，尤其适用于快速运动场景。 Abstract: We present a method to reconstruct dynamic scenes from monocular continuous-wave time-of-flight (C-ToF) cameras using raw sensor samples that achieves similar or better accuracy than neural volumetric approaches and is 100x faster. Quickly achieving high-fidelity dynamic 3D reconstruction from a single viewpoint is a significant challenge in computer vision. In C-ToF radiance field reconstruction, the property of interest-depth-is not directly measured, causing an additional challenge. This problem has a large and underappreciated impact upon the optimization when using a fast primitive-based scene representation like 3D Gaussian splatting, which is commonly used with multi-view data to produce satisfactory results and is brittle in its optimization otherwise. We incorporate two heuristics into the optimization to improve the accuracy of scene geometry represented by Gaussians. Experimental results show that our approach produces accurate reconstructions under constrained C-ToF sensing conditions, including for fast motions like swinging baseball bats. https://visual.cs.brown.edu/gftorf

cs.CL [Back]

Yusen Wu,Junwu Xiong,Xiaotie Deng

Main category: cs.CL

TL;DR: 论文提出了一个评估大语言模型（LLM）在复杂社交任务中能力的基准HSII，并引入了基于社会学原理的任务分级框架和数据集HSII-Dataset。

Details

Motivation: 现有基准未系统评估LLM在多用户、多轮社交任务中的能力，需填补这一空白。 Method: 提出HSII基准，包含四个阶段：格式解析、目标选择、目标切换对话和稳定对话，并引入COT方法和COT-complexity指标。 Result: 实验证明HSII能有效评估LLM的社交能力，COT方法可提升性能但增加计算成本。 Conclusion: HSII基准适用于评估LLM的社交技能，COT-complexity指标在正确性和效率间取得平衡。 Abstract: Expanding the application of large language models (LLMs) to societal life, instead of primary function only as auxiliary assistants to communicate with only one person at a time, necessitates LLMs' capabilities to independently play roles in multi-user, multi-turn social agent tasks within complex social settings. However, currently the capability has not been systematically measured with available benchmarks. To address this gap, we first introduce an agent task leveling framework grounded in sociological principles. Concurrently, we propose a novel benchmark, How Social Is It (we call it HSII below), designed to assess LLM's social capabilities in comprehensive social agents tasks and benchmark representative models. HSII comprises four stages: format parsing, target selection, target switching conversation, and stable conversation, which collectively evaluate the communication and task completion capabilities of LLMs within realistic social interaction scenarios dataset, HSII-Dataset. The dataset is derived step by step from news dataset. We perform an ablation study by doing clustering to the dataset. Additionally, we investigate the impact of chain of thought (COT) method on enhancing LLMs' social performance. Since COT cost more computation, we further introduce a new statistical metric, COT-complexity, to quantify the efficiency of certain LLMs with COTs for specific social tasks and strike a better trade-off between measurement of correctness and efficiency. Various results of our experiments demonstrate that our benchmark is well-suited for evaluating social skills in LLMs.

[93] Adaptive Token Boundaries: Integrating Human Chunking Mechanisms into Multimodal LLMs

Dongxing Yu

Main category: cs.CL

TL;DR: 该研究探讨了多模态大语言模型（MLLMs）与人类认知过程在信息整合上的差异，提出了一种动态跨模态标记化框架，显著提升了模型性能。

Details

Motivation: 当前MLLMs在整合多模态信息时与人类认知存在显著差距，研究旨在通过模拟人类跨模态分块机制改进模型。 Method: 通过比较人类与模型在视觉-语言任务中的表现，提出动态跨模态标记化框架，包括自适应边界、分层表示和对齐机制。 Result: 新框架在基准任务上显著优于现有模型（VQA提升7.8%，复杂场景描述提升5.3%），且错误模式和注意力分布更接近人类。 Conclusion: 研究不仅深化了对人类认知与AI关系的理解，还为开发更具认知合理性的AI系统提供了实证支持。 Abstract: Recent advancements in multimodal large language models (MLLMs) have demonstrated remarkable capabilities in processing diverse data types, yet significant disparities persist between human cognitive processes and computational approaches to multimodal information integration. This research presents a systematic investigation into the parallels between human cross-modal chunking mechanisms and token representation methodologies in MLLMs. Through empirical studies comparing human performance patterns with model behaviors across visual-linguistic tasks, we demonstrate that conventional static tokenization schemes fundamentally constrain current models' capacity to simulate the dynamic, context-sensitive nature of human information processing. We propose a novel framework for dynamic cross-modal tokenization that incorporates adaptive boundaries, hierarchical representations, and alignment mechanisms grounded in cognitive science principles. Quantitative evaluations demonstrate that our approach yields statistically significant improvements over state-of-the-art models on benchmark tasks (+7.8% on Visual Question Answering, +5.3% on Complex Scene Description) while exhibiting more human-aligned error patterns and attention distributions. These findings contribute to the theoretical understanding of the relationship between human cognition and artificial intelligence, while providing empirical evidence for developing more cognitively plausible AI systems.

[94] Language translation, and change of accent for speech-to-speech task using diffusion model

Abhishek Mishra,Ritesh Sur Chowdhury,Vartul Bahuguna,Isha Pandey,Ganesh Ramakrishnan

Main category: cs.CL

TL;DR: 提出了一种统一的方法，同时实现语音翻译和口音适应，利用扩散模型生成目标语音。

Details

Motivation: 跨文化沟通需要同时处理语言翻译和口音适应，但现有研究对此关注不足。 Method: 将问题重新定义为条件生成任务，利用扩散模型生成目标语音的梅尔频谱图。 Result: 该方法比传统流程更高效且有效，实现了翻译和口音适应的联合优化。 Conclusion: 提出的统一框架为语音到语音翻译和口音适应提供了一种创新的解决方案。 Abstract: Speech-to-speech translation (S2ST) aims to convert spoken input in one language to spoken output in another, typically focusing on either language translation or accent adaptation. However, effective cross-cultural communication requires handling both aspects simultaneously - translating content while adapting the speaker's accent to match the target language context. In this work, we propose a unified approach for simultaneous speech translation and change of accent, a task that remains underexplored in current literature. Our method reformulates the problem as a conditional generation task, where target speech is generated based on phonemes and guided by target speech features. Leveraging the power of diffusion models, known for high-fidelity generative capabilities, we adapt text-to-image diffusion strategies by conditioning on source speech transcriptions and generating Mel spectrograms representing the target speech with desired linguistic and accentual attributes. This integrated framework enables joint optimization of translation and accent adaptation, offering a more parameter-efficient and effective model compared to traditional pipelines.

[95] A Comparative Benchmark of a Moroccan Darija Toxicity Detection Model (Typica.ai) and Major LLM-Based Moderation APIs (OpenAI, Mistral, Anthropic)

Hicham Assoudi

Main category: cs.CL

TL;DR: Typica.ai的摩洛哥Darija毒性检测模型在性能上优于OpenAI、Mistral和Anthropic Claude的主流LLM审核API，特别是在处理文化相关的毒性内容时。

Details

Motivation: 评估Typica.ai的定制模型与主流LLM审核API在摩洛哥Darija语言中的毒性检测性能，关注文化相关的毒性内容（如隐性侮辱、讽刺和文化特定攻击）。 Method: 使用OMCD_Typica.ai_Mix数据集的平衡测试集，比较Typica.ai模型与OpenAI、Mistral和Anthropic Claude的API在精确率、召回率、F1分数和准确率上的表现。 Result: Typica.ai模型表现最优，突显了文化适应模型在内容审核中的重要性。 Conclusion: 文化适应模型对于可靠的内容审核至关重要，尤其是在处理小众语言和文化特定内容时。 Abstract: This paper presents a comparative benchmark evaluating the performance of Typica.ai's custom Moroccan Darija toxicity detection model against major LLM-based moderation APIs: OpenAI (omni-moderation-latest), Mistral (mistral-moderation-latest), and Anthropic Claude (claude-3-haiku-20240307). We focus on culturally grounded toxic content, including implicit insults, sarcasm, and culturally specific aggression often overlooked by general-purpose systems. Using a balanced test set derived from the OMCD_Typica.ai_Mix dataset, we report precision, recall, F1-score, and accuracy, offering insights into challenges and opportunities for moderation in underrepresented languages. Our results highlight Typica.ai's superior performance, underlining the importance of culturally adapted models for reliable content moderation.

[96] Rethinking Multimodal Sentiment Analysis: A High-Accuracy, Simplified Fusion Architecture

Nischal Mandal,Yang Li

Main category: cs.CL

TL;DR: 提出了一种轻量级的多模态情感分析模型，通过简单的特征融合策略在资源受限环境中表现优异。

Details

Motivation: 研究多模态情感分析，旨在通过整合语言、音频和视觉信号理解人类情感，同时避免复杂模型的算力开销。 Method: 使用IEMOCAP数据集，设计模态特定编码器（全连接层和dropout），通过简单拼接和多模态融合层实现特征交互。 Result: 在六类情感分类任务中达到92%的准确率。 Conclusion: 轻量级模型通过精心设计的特征工程和模块化结构，能在资源受限环境中媲美或超越复杂模型。 Abstract: Multimodal sentiment analysis, a pivotal task in affective computing, seeks to understand human emotions by integrating cues from language, audio, and visual signals. While many recent approaches leverage complex attention mechanisms and hierarchical architectures, we propose a lightweight, yet effective fusion-based deep learning model tailored for utterance-level emotion classification. Using the benchmark IEMOCAP dataset, which includes aligned text, audio-derived numeric features, and visual descriptors, we design a modality-specific encoder using fully connected layers followed by dropout regularization. The modality-specific representations are then fused using simple concatenation and passed through a dense fusion layer to capture cross-modal interactions. This streamlined architecture avoids computational overhead while preserving performance, achieving a classification accuracy of 92% across six emotion categories. Our approach demonstrates that with careful feature engineering and modular design, simpler fusion strategies can outperform or match more complex models, particularly in resource-constrained environments.

[97] Prediction-powered estimators for finite population statistics in highly imbalanced textual data: Public hate crime estimation

Hannes Waldetoft,Jakob Torgander,Måns Magnusson

Main category: cs.CL

TL;DR: 结合Transformer神经网络与抽样估计方法，提出一种高效估计文本数据中目标变量的方法，应用于瑞典仇恨犯罪统计。

Details

Motivation: 解决文本数据中目标变量需手动标注的问题，提高估计效率。 Method: 使用Transformer神经网络预测结果作为辅助变量，结合Hansen-Hurwitz估计、差异估计和分层随机抽样估计。 Result: 在瑞典仇恨犯罪统计中验证了方法的有效性，显著减少了手动标注时间。 Conclusion: 若有标注数据，该方法能提供高效估计，减少人工标注负担。 Abstract: Estimating population parameters in finite populations of text documents can be challenging when obtaining the labels for the target variable requires manual annotation. To address this problem, we combine predictions from a transformer encoder neural network with well-established survey sampling estimators using the model predictions as an auxiliary variable. The applicability is demonstrated in Swedish hate crime statistics based on Swedish police reports. Estimates of the yearly number of hate crimes and the police's under-reporting are derived using the Hansen-Hurwitz estimator, difference estimation, and stratified random sampling estimation. We conclude that if labeled training data is available, the proposed method can provide very efficient estimates with reduced time spent on manual annotation.

[98] ChatGPT for automated grading of short answer questions in mechanical ventilation

Tejas Jade,Alex Yartsev

Main category: cs.CL

TL;DR: 研究评估了ChatGPT 4o在医学研究生课程中自动评分短答案问题的表现，发现其评分与人类评分者存在显著差异，尤其是在分析性题目上，建议谨慎使用。

Details

Motivation: 探索大型语言模型（LLMs）在标准化短答案问题（SAQs）自动评分中的适用性，尤其是在医学教育领域。 Method: 使用ChatGPT 4o对215名学生的557份短答案进行评分，并与人类评分结果对比，采用多种统计方法分析一致性。 Result: ChatGPT评分显著低于人类评分（平均偏差-1.34），一致性指标显示较差（ICC1=0.086，Cohen's kappa=-0.0786），分析性题目分歧最大。 Conclusion: 不建议在高风险评估中使用LLMs自动评分，因其与人类评分的一致性不足。 Abstract: Standardised tests using short answer questions (SAQs) are common in postgraduate education. Large language models (LLMs) simulate conversational language and interpret unstructured free-text responses in ways aligning with applying SAQ grading rubrics, making them attractive for automated grading. We evaluated ChatGPT 4o to grade SAQs in a postgraduate medical setting using data from 215 students (557 short-answer responses) enrolled in an online course on mechanical ventilation (2020--2024). Deidentified responses to three case-based scenarios were presented to ChatGPT with a standardised grading prompt and rubric. Outputs were analysed using mixed-effects modelling, variance component analysis, intraclass correlation coefficients (ICCs), Cohen's kappa, Kendall's W, and Bland--Altman statistics. ChatGPT awarded systematically lower marks than human graders with a mean difference (bias) of -1.34 on a 10-point scale. ICC values indicated poor individual-level agreement (ICC1 = 0.086), and Cohen's kappa (-0.0786) suggested no meaningful agreement. Variance component analysis showed minimal variability among the five ChatGPT sessions (G-value = 0.87), indicating internal consistency but divergence from the human grader. The poorest agreement was observed for evaluative and analytic items, whereas checklist and prescriptive rubric items had less disagreement. We caution against the use of LLMs in grading postgraduate coursework. Over 60% of ChatGPT-assigned grades differed from human grades by more than acceptable boundaries for high-stakes assessments.

[99] FRAME: Feedback-Refined Agent Methodology for Enhancing Medical Research Insights

Chengzhang Yu,Yiming Zhang,Zhixin Liu,Zenghui Ding,Yining Sun,Zhanpeng Jin

Main category: cs.CL

TL;DR: FRAME框架通过迭代优化和结构化反馈提升医学论文生成质量，显著优于传统方法，并在多模型和评估维度上表现优异。

Details

Motivation: 解决大语言模型在科学研究和医学论文生成中的知识合成与质量保证问题。 Method: 提出FRAME框架，包含结构化数据集构建、三方代理架构（生成器、评估器、反思器）和综合评估框架。 Result: 实验结果显示FRAME在多模型上平均提升9.91%，生成论文质量接近人类水平，尤其在未来研究方向合成方面表现突出。 Conclusion: FRAME为自动化医学论文生成提供了高效且严谨的解决方案，有望推动医学研究发展。 Abstract: The automation of scientific research through large language models (LLMs) presents significant opportunities but faces critical challenges in knowledge synthesis and quality assurance. We introduce Feedback-Refined Agent Methodology (FRAME), a novel framework that enhances medical paper generation through iterative refinement and structured feedback. Our approach comprises three key innovations: (1) A structured dataset construction method that decomposes 4,287 medical papers into essential research components through iterative refinement; (2) A tripartite architecture integrating Generator, Evaluator, and Reflector agents that progressively improve content quality through metric-driven feedback; and (3) A comprehensive evaluation framework that combines statistical metrics with human-grounded benchmarks. Experimental results demonstrate FRAME's effectiveness, achieving significant improvements over conventional approaches across multiple models (9.91% average gain with DeepSeek V3, comparable improvements with GPT-4o Mini) and evaluation dimensions. Human evaluation confirms that FRAME-generated papers achieve quality comparable to human-authored works, with particular strength in synthesizing future research directions. The results demonstrated our work could efficiently assist medical research by building a robust foundation for automated medical research paper generation while maintaining rigorous academic standards.

[100] Scientific Hypothesis Generation and Validation: Methods, Datasets, and Future Directions

Adithya Kulkarni,Fatimah Alotaibi,Xinyue Zeng,Longfeng Wu,Tong Zeng,Barry Menglong Yao,Minqian Liu,Shuaicheng Zhang,Lifu Huang,Dawei Zhou

Main category: cs.CL

TL;DR: 该论文综述了大型语言模型（LLMs）在科学假设生成与验证中的应用，涵盖了多种方法和技术，并对比了早期符号系统与现代LLM管道的差异。同时，提出了未来发展方向和伦理考量。

Details

Motivation: 探讨LLMs如何通过信息合成、潜在关系发现和推理增强来推动科学发现，并系统化现有方法和技术。 Method: 分析了符号框架、生成模型、混合系统和多智能体架构等技术，包括检索增强生成、知识图谱补全、模拟、因果推理和工具辅助推理等方法。 Result: 总结了LLMs在生物医学、材料科学、环境科学和社会科学等领域的应用，并介绍了新资源如AHTech和CSKG-600。 Conclusion: 提出了未来发展方向，包括新颖性感知生成、多模态符号整合、人机协作系统和伦理保障，将LLMs定位为科学发现的有力工具。 Abstract: Large Language Models (LLMs) are transforming scientific hypothesis generation and validation by enabling information synthesis, latent relationship discovery, and reasoning augmentation. This survey provides a structured overview of LLM-driven approaches, including symbolic frameworks, generative models, hybrid systems, and multi-agent architectures. We examine techniques such as retrieval-augmented generation, knowledge-graph completion, simulation, causal inference, and tool-assisted reasoning, highlighting trade-offs in interpretability, novelty, and domain alignment. We contrast early symbolic discovery systems (e.g., BACON, KEKADA) with modern LLM pipelines that leverage in-context learning and domain adaptation via fine-tuning, retrieval, and symbolic grounding. For validation, we review simulation, human-AI collaboration, causal modeling, and uncertainty quantification, emphasizing iterative assessment in open-world contexts. The survey maps datasets across biomedicine, materials science, environmental science, and social science, introducing new resources like AHTech and CSKG-600. Finally, we outline a roadmap emphasizing novelty-aware generation, multimodal-symbolic integration, human-in-the-loop systems, and ethical safeguards, positioning LLMs as agents for principled, scalable scientific discovery.

[101] Advancing Conversational Diagnostic AI with Multimodal Reasoning

Khaled Saab,Jan Freyberg,Chunjong Park,Tim Strother,Yong Cheng,Wei-Hung Weng,David G. T. Barrett,David Stutz,Nenad Tomasev,Anil Palepu,Valentin Liévin,Yash Sharma,Roma Ruparel,Abdullah Ahmed,Elahe Vedadi,Kimberly Kanada,Cian Hughes,Yun Liu,Geoff Brown,Yang Gao,Sean Li,S. Sara Mahdavi,James Manyika,Katherine Chou,Yossi Matias,Avinatan Hassidim,Dale R. Webster,Pushmeet Kohli,S. M. Ali Eslami,Joëlle Barral,Adam Rodman,Vivek Natarajan,Mike Schaekermann,Tao Tu,Alan Karthikesalingam,Ryutaro Tanno

Main category: cs.CL

TL;DR: AMIE系统通过多模态数据处理能力提升了诊断对话性能，在模拟研究中表现优于初级保健医生。

Details

Motivation: 评估LLMs在多模态医疗咨询中的能力，以满足远程医疗的实际需求。 Method: 利用Gemini 2.0 Flash实现状态感知对话框架，动态控制对话流程，并通过不确定性引导后续问题。 Result: AMIE在多模态和非多模态评估中均优于初级保健医生，尤其在诊断准确性方面表现突出。 Conclusion: 多模态对话诊断AI取得进展，但实际应用仍需进一步研究。 Abstract: Large Language Models (LLMs) have demonstrated great potential for conducting diagnostic conversations but evaluation has been largely limited to language-only interactions, deviating from the real-world requirements of remote care delivery. Instant messaging platforms permit clinicians and patients to upload and discuss multimodal medical artifacts seamlessly in medical consultation, but the ability of LLMs to reason over such data while preserving other attributes of competent diagnostic conversation remains unknown. Here we advance the conversational diagnosis and management performance of the Articulate Medical Intelligence Explorer (AMIE) through a new capability to gather and interpret multimodal data, and reason about this precisely during consultations. Leveraging Gemini 2.0 Flash, our system implements a state-aware dialogue framework, where conversation flow is dynamically controlled by intermediate model outputs reflecting patient states and evolving diagnoses. Follow-up questions are strategically directed by uncertainty in such patient states, leading to a more structured multimodal history-taking process that emulates experienced clinicians. We compared AMIE to primary care physicians (PCPs) in a randomized, blinded, OSCE-style study of chat-based consultations with patient actors. We constructed 105 evaluation scenarios using artifacts like smartphone skin photos, ECGs, and PDFs of clinical documents across diverse conditions and demographics. Our rubric assessed multimodal capabilities and other clinically meaningful axes like history-taking, diagnostic accuracy, management reasoning, communication, and empathy. Specialist evaluation showed AMIE to be superior to PCPs on 7/9 multimodal and 29/32 non-multimodal axes (including diagnostic accuracy). The results show clear progress in multimodal conversational diagnostic AI, but real-world translation needs further research.

[102] A Comparative Analysis of Ethical and Safety Gaps in LLMs using Relative Danger Coefficient

Yehor Tereshchenko,Mika Hämäläinen

Main category: cs.CL

TL;DR: 本文比较了多种AI模型的伦理表现，提出了一种新的危害度量指标RDC，并强调高风险场景中人类监督的重要性。

Details

Motivation: 探讨AI和LLMs快速发展带来的伦理问题，如安全性、滥用和歧视。 Method: 对DeepSeek-V3、GPT系列和Gemini等模型进行伦理性能比较分析，并提出RDC指标。 Result: 揭示了不同模型在伦理表现上的差异，并展示了RDC的应用价值。 Conclusion: 强调人类监督的必要性，并建议进一步研究AI伦理问题。 Abstract: Artificial Intelligence (AI) and Large Language Models (LLMs) have rapidly evolved in recent years, showcasing remarkable capabilities in natural language understanding and generation. However, these advancements also raise critical ethical questions regarding safety, potential misuse, discrimination and overall societal impact. This article provides a comparative analysis of the ethical performance of various AI models, including the brand new DeepSeek-V3(R1 with reasoning and without), various GPT variants (4o, 3.5 Turbo, 4 Turbo, o1/o3 mini) and Gemini (1.5 flash, 2.0 flash and 2.0 flash exp) and highlights the need for robust human oversight, especially in situations with high stakes. Furthermore, we present a new metric for calculating harm in LLMs called Relative Danger Coefficient (RDC).

Paul Landes,Jimeng Sun,Adam Cross

Main category: cs.CL

TL;DR: 论文提出了一种结合传统深度学习和大型语言模型（LLM）的方法，自动从临床文本中提取社会健康决定因素（SDoH），并在多标签分类任务中表现优于基准10分，同时通过优化方法将执行速度提升12倍。

Details

Motivation: 社会健康决定因素（SDoH）对健康状态有重要影响，但传统方法效率低且成本高，需要更高效的自动化解决方案。 Method: 结合传统深度学习和LLM，利用合成数据增强数据集，并通过优化方法减少昂贵的LLM处理。 Result: 模型在多标签SDoH分类任务中表现优于基准10分，执行速度提升12倍，且在合成数据增强的数据集上表现优异。 Conclusion: 提出的方法为自动预测SDoH提供了高效且精确的解决方案，尤其适用于高风险患者。 Abstract: Social Determinants of Health (SDoH) are economic, social and personal circumstances that affect or influence an individual's health status. SDoHs have shown to be correlated to wellness outcomes, and therefore, are useful to physicians in diagnosing diseases and in decision-making. In this work, we automatically extract SDoHs from clinical text using traditional deep learning and Large Language Models (LLMs) to find the advantages and disadvantages of each on an existing publicly available dataset. Our models outperform a previous reference point on a multilabel SDoH classification by 10 points, and we present a method and model to drastically speed up classification (12X execution time) by eliminating expensive LLM processing. The method we present combines a more nimble and efficient solution that leverages the power of the LLM for precision and traditional deep learning methods for efficiency. We also show highly performant results on a dataset supplemented with synthetic data and several traditional deep learning models that outperform LLMs. Our models and methods offer the next iteration of automatic prediction of SDoHs that impact at-risk patients.

[104] AI-Generated Fall Data: Assessing LLMs and Diffusion Model for Wearable Fall Detection

Sana Alamgeer,Yasine Souissi,Anne H. H. Ngu

Main category: cs.CL

TL;DR: 研究探索了利用大型语言模型（LLMs）生成合成跌倒数据以解决真实数据稀缺的问题，评估了不同模型在生成数据上的表现及其对跌倒检测性能的影响。

Details

Motivation: 由于真实跌倒数据（尤其是老年人数据）稀缺，训练跌倒检测系统面临挑战，因此研究探索了合成数据的潜力。 Method: 评估了文本到动作（T2M, SATO, ParCo）和文本到文本模型（GPT4o, GPT4, Gemini）生成合成数据的能力，并将其与真实数据集结合，使用LSTM模型评估性能。还比较了LLM生成数据与扩散方法生成数据的分布对齐性。 Result: 合成数据的有效性受数据集特性影响，LLM生成数据在低频（20Hz）表现最佳，但在高频（200Hz）不稳定。扩散方法生成的数据与真实数据对齐性最好，但未显著提升模型性能。 Conclusion: 合成数据的优化需考虑传感器放置和跌倒表示，研究为跌倒检测模型的合成数据生成提供了指导。 Abstract: Training fall detection systems is challenging due to the scarcity of real-world fall data, particularly from elderly individuals. To address this, we explore the potential of Large Language Models (LLMs) for generating synthetic fall data. This study evaluates text-to-motion (T2M, SATO, ParCo) and text-to-text models (GPT4o, GPT4, Gemini) in simulating realistic fall scenarios. We generate synthetic datasets and integrate them with four real-world baseline datasets to assess their impact on fall detection performance using a Long Short-Term Memory (LSTM) model. Additionally, we compare LLM-generated synthetic data with a diffusion-based method to evaluate their alignment with real accelerometer distributions. Results indicate that dataset characteristics significantly influence the effectiveness of synthetic data, with LLM-generated data performing best in low-frequency settings (e.g., 20Hz) while showing instability in high-frequency datasets (e.g., 200Hz). While text-to-motion models produce more realistic biomechanical data than text-to-text models, their impact on fall detection varies. Diffusion-based synthetic data demonstrates the closest alignment to real data but does not consistently enhance model performance. An ablation study further confirms that the effectiveness of synthetic data depends on sensor placement and fall representation. These findings provide insights into optimizing synthetic data generation for fall detection models.

[105] Personalized Risks and Regulatory Strategies of Large Language Models in Digital Advertising

Haoyang Feng,Yanjun Dai,Yuan Gao

Main category: cs.CL

TL;DR: 本文研究了大型语言模型在数字广告中的个性化风险与监管策略，提出了一种结合BERT模型和隐私保护的广告推荐算法。

Details

Motivation: 探讨广告推荐系统如何结合用户隐私保护和数据安全措施，解决实际运营中的个性化风险问题。 Method: 结合BERT模型和注意力机制，构建个性化广告推荐与用户隐私保护的算法模型，包括数据预处理、特征选择、语义嵌入及本地模型训练与数据加密。 Result: 实验表明，基于BERT的广告推送能有效提高点击率和转化率，同时通过隐私保护机制降低用户数据泄露风险。 Conclusion: 大型语言模型在广告推荐中具有潜力，但需结合隐私保护措施以实现安全有效的个性化推荐。 Abstract: Although large language models have demonstrated the potential for personalized advertising recommendations in experimental environments, in actual operations, how advertising recommendation systems can be combined with measures such as user privacy protection and data security is still an area worthy of in-depth discussion. To this end, this paper studies the personalized risks and regulatory strategies of large language models in digital advertising. This study first outlines the principles of Large Language Model (LLM), especially the self-attention mechanism based on the Transformer architecture, and how to enable the model to understand and generate natural language text. Then, the BERT (Bidirectional Encoder Representations from Transformers) model and the attention mechanism are combined to construct an algorithmic model for personalized advertising recommendations and user factor risk protection. The specific steps include: data collection and preprocessing, feature selection and construction, using large language models such as BERT for advertising semantic embedding, and ad recommendations based on user portraits. Then, local model training and data encryption are used to ensure the security of user privacy and avoid the leakage of personal data. This paper designs an experiment for personalized advertising recommendation based on a large language model of BERT and verifies it with real user data. The experimental results show that BERT-based advertising push can effectively improve the click-through rate and conversion rate of advertisements. At the same time, through local model training and privacy protection mechanisms, the risk of user privacy leakage can be reduced to a certain extent.

[106] Fine-Tuning Large Language Models and Evaluating Retrieval Methods for Improved Question Answering on Building Codes

Mohammad Aqib,Mohd Hamza,Qipei Mei,Ying Hei Chui

Main category: cs.CL

TL;DR: 该论文提出了一种基于检索增强生成（RAG）的问答系统，用于解决建筑规范查询的复杂性问题，重点研究了检索方法和语言模型微调的效果。

Details

Motivation: 建筑规范内容复杂且更新频繁，手动查询效率低，需要一种自动化的问答系统来提升查询效率和准确性。 Method: 采用RAG框架，评估了不同检索方法（如Elasticsearch）在加拿大国家建筑规范（NBCC）上的表现，并对语言模型进行领域特定的微调。 Result: 实验表明Elasticsearch是最优检索方法，且微调后的语言模型能生成更相关的回答。 Conclusion: 结合高效检索器和微调后的语言模型，RAG系统能更好地处理建筑规范的复杂性。 Abstract: Building codes are regulations that establish standards for the design, construction, and safety of buildings to ensure structural integrity, fire protection, and accessibility. They are often extensive, complex, and subject to frequent updates, making manual querying challenging and time-consuming. Key difficulties include navigating large volumes of text, interpreting technical language, and identifying relevant clauses across different sections. A potential solution is to build a Question-Answering (QA) system that answers user queries based on building codes. Among the various methods for building a QA system, Retrieval-Augmented Generation (RAG) stands out in performance. RAG consists of two components: a retriever and a language model. This study focuses on identifying a suitable retriever method for building codes and optimizing the generational capability of the language model using fine-tuning techniques. We conducted a detailed evaluation of various retrieval methods by performing the retrieval on the National Building Code of Canada (NBCC) and explored the impact of domain-specific fine-tuning on several language models using the dataset derived from NBCC. Our analysis included a comparative assessment of different retrievers and the performance of both pre-trained and fine-tuned models to determine the efficacy and domain-specific adaptation of language models using fine-tuning on the NBCC dataset. Experimental results showed that Elasticsearch proved to be the most robust retriever among all. The findings also indicate that fine-tuning language models on an NBCC-specific dataset can enhance their ability to generate contextually relevant responses. When combined with context retrieved by a powerful retriever like Elasticsearch, this improvement in LLM performance can optimize the RAG system, enabling it to better navigate the complexities of the NBCC.

[107] Reward-SQL: Boosting Text-to-SQL via Stepwise Reasoning and Process-Supervised Rewards

Yuxin Zhang,Meihao Fan,Ju Fan,Mingyang Yi,Yuyu Luo,Jian Tan,Guoliang Li

Main category: cs.CL

TL;DR: Reward-SQL框架通过结合Process Reward Models（PRMs）提升Text-to-SQL任务性能，采用分阶段策略（冷启动后PRM监督）并验证了最佳集成方法（GRPO），在BIRD基准上实现显著性能提升。

Details

Motivation: 尽管PRMs能提升Text-to-SQL任务的推理准确性，但不当使用可能导致推理轨迹扭曲。Reward-SQL旨在系统探索如何有效集成PRMs。 Method: 采用分阶段策略：1）冷启动训练模型分解SQL为结构化推理链（Chain-of-CTEs）；2）研究四种PRM集成策略，发现GRPO（在线训练信号）与PRM引导推理结合效果最佳。 Result: 在BIRD基准上，Reward-SQL使7B PRM监督的模型性能提升13.1%，GRPO策略下Qwen2.5-Coder-7B-Instruct模型达到68.9%准确率，优于同类基线。 Conclusion: Reward-SQL通过奖励监督有效提升Text-to-SQL推理性能，验证了PRMs的合理集成策略。代码已开源。 Abstract: Recent advances in large language models (LLMs) have significantly improved performance on the Text-to-SQL task by leveraging their powerful reasoning capabilities. To enhance accuracy during the reasoning process, external Process Reward Models (PRMs) can be introduced during training and inference to provide fine-grained supervision. However, if misused, PRMs may distort the reasoning trajectory and lead to suboptimal or incorrect SQL generation.To address this challenge, we propose Reward-SQL, a framework that systematically explores how to incorporate PRMs into the Text-to-SQL reasoning process effectively. Our approach follows a "cold start, then PRM supervision" paradigm. Specifically, we first train the model to decompose SQL queries into structured stepwise reasoning chains using common table expressions (Chain-of-CTEs), establishing a strong and interpretable reasoning baseline. Then, we investigate four strategies for integrating PRMs, and find that combining PRM as an online training signal (GRPO) with PRM-guided inference (e.g., best-of-N sampling) yields the best results. Empirically, on the BIRD benchmark, Reward-SQL enables models supervised by a 7B PRM to achieve a 13.1% performance gain across various guidance strategies. Notably, our GRPO-aligned policy model based on Qwen2.5-Coder-7B-Instruct achieves 68.9% accuracy on the BIRD development set, outperforming all baseline methods under the same model size. These results demonstrate the effectiveness of Reward-SQL in leveraging reward-based supervision for Text-to-SQL reasoning. Our code is publicly available.

[108] REVEAL: Multi-turn Evaluation of Image-Input Harms for Vision LLM

Madhur Jindal,Saurabh Deshpande

Main category: cs.CL

TL;DR: 论文介绍了REVEAL框架，用于评估视觉大语言模型（VLLMs）的安全性，发现多轮对话中的缺陷率显著高于单轮评估，GPT-4o表现最佳。

Details

Motivation: 传统安全评估框架无法应对VLLMs在多模态和多轮对话中的复杂性，需开发新方法。 Method: 提出REVEAL框架，包括自动图像挖掘、合成对抗数据生成、多轮对话扩展和全面危害评估。 Result: 评估了五种VLLMs，发现多轮对话缺陷率更高，GPT-4o表现最佳，Llama-3.2缺陷率最高。 Conclusion: 多轮对话暴露VLLMs更深层漏洞，需加强上下文防御，尤其是针对错误信息。 Abstract: Vision Large Language Models (VLLMs) represent a significant advancement in artificial intelligence by integrating image-processing capabilities with textual understanding, thereby enhancing user interactions and expanding application domains. However, their increased complexity introduces novel safety and ethical challenges, particularly in multi-modal and multi-turn conversations. Traditional safety evaluation frameworks, designed for text-based, single-turn interactions, are inadequate for addressing these complexities. To bridge this gap, we introduce the REVEAL (Responsible Evaluation of Vision-Enabled AI LLMs) Framework, a scalable and automated pipeline for evaluating image-input harms in VLLMs. REVEAL includes automated image mining, synthetic adversarial data generation, multi-turn conversational expansion using crescendo attack strategies, and comprehensive harm assessment through evaluators like GPT-4o. We extensively evaluated five state-of-the-art VLLMs, GPT-4o, Llama-3.2, Qwen2-VL, Phi3.5V, and Pixtral, across three important harm categories: sexual harm, violence, and misinformation. Our findings reveal that multi-turn interactions result in significantly higher defect rates compared to single-turn evaluations, highlighting deeper vulnerabilities in VLLMs. Notably, GPT-4o demonstrated the most balanced performance as measured by our Safety-Usability Index (SUI) followed closely by Pixtral. Additionally, misinformation emerged as a critical area requiring enhanced contextual defenses. Llama-3.2 exhibited the highest MT defect rate ($16.55 \%$) while Qwen2-VL showed the highest MT refusal rate ($19.1 \%$).

[109] Advanced Deep Learning Approaches for Automated Recognition of Cuneiform Symbols

Shahad Elshehaby,Alavikunhu Panthakkan,Hussain Al-Ahmad,Mina Al-Saad

Main category: cs.CL

TL;DR: 本文提出了一种基于深度学习的自动化方法，用于识别和解释楔形文字字符，并通过五种不同模型在性能指标上进行了评估，其中两种模型表现优异。

Details

Motivation: 探索深度学习在解读古代楔形文字中的应用，结合计算语言学和考古学，为人类历史的理解和保护提供新视角。 Method: 训练五种深度学习模型，使用楔形文字数据集评估性能，并测试其在汉谟拉比法典中的表现。 Result: 两种模型表现优异，能准确识别楔形文字的阿卡德语含义并提供精确的英文翻译。 Conclusion: 深度学习在解读古代文字方面具有潜力，未来将通过集成和堆叠方法进一步优化性能。 Abstract: This paper presents a thoroughly automated method for identifying and interpreting cuneiform characters via advanced deep-learning algorithms. Five distinct deep-learning models were trained on a comprehensive dataset of cuneiform characters and evaluated according to critical performance metrics, including accuracy and precision. Two models demonstrated outstanding performance and were subsequently assessed using cuneiform symbols from the Hammurabi law acquisition, notably Hammurabi Law 1. Each model effectively recognized the relevant Akkadian meanings of the symbols and delivered precise English translations. Future work will investigate ensemble and stacking approaches to optimize performance, utilizing hybrid architectures to improve detection accuracy and reliability. This research explores the linguistic relationships between Akkadian, an ancient Mesopotamian language, and Arabic, emphasizing their historical and cultural linkages. This study demonstrates the capability of deep learning to decipher ancient scripts by merging computational linguistics with archaeology, therefore providing significant insights for the comprehension and conservation of human history.

[110] SOAEsV2-7B/72B: Full-Pipeline Optimization for State-Owned Enterprise LLMs via Continual Pre-Training, Domain-Progressive SFT and Distillation-Enhanced Speculative Decoding

Jingyang Deng,Ran Chen,Jo-Ku Cheng,Jinwen Ma

Main category: cs.CL

TL;DR: 该研究提出了一种针对中国国有资产的领域专用大语言模型（LLM）开发框架，解决了模型容量限制、过度依赖领域数据及推理效率低的问题。通过三阶段方法（持续预训练、渐进领域微调和蒸馏增强推理），显著提升了领域性能并保持了通用能力。

Details

Motivation: 当前方法在开发领域专用LLM时面临模型容量限制、过度依赖领域数据和推理效率低的问题，亟需一种更全面的解决方案。 Method: 采用三阶段框架：1）持续预训练整合领域知识；2）渐进领域微调（从弱相关数据到专家标注数据）；3）蒸馏增强推理加速。 Result: 领域预训练阶段保持99.8%的通用能力，显著提升领域性能（Rouge-1提升1.08倍，BLEU-4提升1.17倍）。渐进微调优于单阶段训练（Rouge-1提升1.02倍，BLEU-4提升1.06倍）。推理速度提升1.39-1.52倍。 Conclusion: 该研究为领域专用LLM提供了一种全流程优化方法，平衡了通用能力和领域性能，解决了现有方法的局限性。 Abstract: This study addresses key challenges in developing domain-specific large language models (LLMs) for Chinese state-owned assets and enterprises (SOAEs), where current approaches face three limitations: 1) constrained model capacity that limits knowledge integration and cross-task adaptability; 2) excessive reliance on domain-specific supervised fine-tuning (SFT) data, which neglects the broader applicability of general language patterns; and 3) inefficient inference acceleration for large models processing long contexts. In this work, we propose SOAEsV2-7B/72B, a specialized LLM series developed via a three-phase framework: 1) continual pre-training integrates domain knowledge while retaining base capabilities; 2) domain-progressive SFT employs curriculum-based learning strategy, transitioning from weakly relevant conversational data to expert-annotated SOAEs datasets to optimize domain-specific tasks; 3) distillation-enhanced speculative decoding accelerates inference via logit distillation between 72B target and 7B draft models, achieving 1.39-1.52$\times$ speedup without quality loss. Experimental results demonstrate that our domain-specific pre-training phase maintains 99.8% of original general language capabilities while significantly improving domain performance, resulting in a 1.08$\times$ improvement in Rouge-1 score and a 1.17$\times$ enhancement in BLEU-4 score. Ablation studies further show that domain-progressive SFT outperforms single-stage training, achieving 1.02$\times$ improvement in Rouge-1 and 1.06$\times$ in BLEU-4. Our work introduces a comprehensive, full-pipeline approach for optimizing SOAEs LLMs, bridging the gap between general language capabilities and domain-specific expertise.

[111] Flower Across Time and Media: Sentiment Analysis of Tang Song Poetry and Visual Correspondence

Shuai Gong,Tiange Zhou

Main category: cs.CL

TL;DR: 该研究通过BERT情感分析量化唐宋诗歌中花卉意象的情感模式，并将其与装饰艺术的发展进行对比，揭示了文学表达与艺术表现之间的协同关系。

Details

Motivation: 填补文学情感与视觉文化之间系统性关联的研究空白。 Method: 使用微调的BERT模型分析唐宋诗歌中的牡丹和梅花意象，并与纺织品、陶瓷等物质文化中的视觉证据进行交叉验证。 Result: 发现了唐宋时期花卉意象情感内涵的显著变化，并揭示了文学与艺术表现之间的新协同关系。 Conclusion: 研究为理解唐宋文化表达提供了新视角，展示了计算人文与传统汉学方法的结合潜力。 Abstract: The Tang (618 to 907) and Song (960 to 1279) dynasties witnessed an extraordinary flourishing of Chinese cultural expression, where floral motifs served as a dynamic medium for both poetic sentiment and artistic design. While previous scholarship has examined these domains independently, the systematic correlation between evolving literary emotions and visual culture remains underexplored. This study addresses that gap by employing BERT-based sentiment analysis to quantify emotional patterns in floral imagery across Tang Song poetry, then validating these patterns against contemporaneous developments in decorative arts.Our approach builds upon recent advances in computational humanities while remaining grounded in traditional sinological methods. By applying a fine tuned BERT model to analyze peony and plum blossom imagery in classical poetry, we detect measurable shifts in emotional connotations between the Tang and Song periods. These textual patterns are then cross berenced with visual evidence from textiles, ceramics, and other material culture, revealing previously unrecognized synergies between literary expression and artistic representation.

[112] Osiris: A Lightweight Open-Source Hallucination Detection System

Alex Shan,John Bauer,Christopher D. Manning

Main category: cs.CL

TL;DR: 论文提出了一种基于监督微调的方法，用于检测RAG系统中的幻觉问题，使用7B参数的模型在RAGTruth基准测试中表现优于GPT-4o。

Details

Motivation: RAG系统因幻觉问题难以部署到生产环境，现有检测方法成本高且难以扩展。 Method: 通过构建带有诱导幻觉的扰动多跳QA数据集，并进行监督微调。 Result: 7B参数的模型在RAGTruth基准测试中召回率优于GPT-4o，同时在精度和准确率上表现竞争性。 Conclusion: 该方法提供了一种高效且低成本的幻觉检测解决方案。 Abstract: Retrieval-Augmented Generation (RAG) systems have gained widespread adoption by application builders because they leverage sources of truth to enable Large Language Models (LLMs) to generate more factually sound responses. However, hallucinations, instances of LLM responses that are unfaithful to the provided context, often prevent these systems from being deployed in production environments. Current hallucination detection methods typically involve human evaluation or the use of closed-source models to review RAG system outputs for hallucinations. Both human evaluators and closed-source models suffer from scaling issues due to their high costs and slow inference speeds. In this work, we introduce a perturbed multi-hop QA dataset with induced hallucinations. Via supervised fine-tuning on our dataset, we achieve better recall with a 7B model than GPT-4o on the RAGTruth hallucination detection benchmark and offer competitive performance on precision and accuracy, all while using a fraction of the parameters. Code is released at our repository.

[113] Benchmarking LLM Faithfulness in RAG with Evolving Leaderboards

Manveer Singh Tamber,Forrest Sheng Bao,Chenyu Xu,Ge Luo,Suleman Kazi,Minseok Bae,Miaoran Li,Ofer Mendelevitch,Renyi Qu,Jimmy Lin

Main category: cs.CL

TL;DR: 论文探讨了LLM幻觉问题，提出FaithJudge方法改进评估，并引入新的排行榜。

Details

Motivation: LLM在RAG中仍频繁产生幻觉，现有评估方法如HHEM存在不足。 Method: 提出FaithJudge方法，基于少量人工标注改进自动化评估。 Result: FaithJudge显著优于现有方法，并推出新幻觉排行榜。 Conclusion: FaithJudge为LLM幻觉评估提供更可靠基准，推动RAG发展。 Abstract: Hallucinations remain a persistent challenge for LLMs. RAG aims to reduce hallucinations by grounding responses in contexts. However, even when provided context, LLMs still frequently introduce unsupported information or contradictions. This paper presents our efforts to measure LLM hallucinations with a focus on summarization tasks, assessing how often various LLMs introduce hallucinations when summarizing documents. We discuss Vectara's existing LLM hallucination leaderboard, based on the Hughes Hallucination Evaluation Model (HHEM). While HHEM and Vectara's Hallucination Leaderboard have garnered great research interest, we examine challenges faced by HHEM and current hallucination detection methods by analyzing the effectiveness of these methods on existing hallucination datasets. To address these limitations, we propose FaithJudge, an LLM-as-a-judge approach guided by few-shot human hallucination annotations, which substantially improves automated LLM hallucination evaluation over current methods. We introduce an enhanced hallucination leaderboard centered on FaithJudge, alongside our current hallucination leaderboard, enabling more reliable benchmarking of LLMs for hallucinations in RAG.

[114] An Open-Source Dual-Loss Embedding Model for Semantic Retrieval in Higher Education

Ramteja Sajja,Yusuf Sermet,Ibrahim Demir

Main category: cs.CL

TL;DR: 论文提出了两种针对教育问答优化的开源嵌入模型，通过合成数据集和双损失训练策略，显著提升了语义检索性能。

Details

Motivation: 现有语义检索系统难以适应学术内容的独特语言和结构特点，因此需要专门优化的嵌入模型。 Method: 构建了包含3,197对句子的合成数据集，并评估了两种训练策略：基于MNRL的基准模型和结合MNRL与CosineSimilarityLoss的双损失模型。 Result: 两种优化模型均优于开源基线，双损失模型缩小了与高性能专有嵌入（如OpenAI的text-embedding-3系列）的性能差距。 Conclusion: 研究贡献了可复用的教育领域嵌入模型和语义检索框架，支持学术聊天机器人、RAG系统等下游应用。 Abstract: Recent advances in AI have catalyzed the adoption of intelligent educational tools, yet many semantic retrieval systems remain ill-suited to the unique linguistic and structural characteristics of academic content. This study presents two open-source embedding models fine-tuned for educational question answering, particularly in the context of course syllabi. A synthetic dataset of 3,197 sentence pairs, spanning synonymous terminology, paraphrased questions, and implicit-explicit mappings, was constructed through a combination of manual curation and large language model (LLM)-assisted generation. Two training strategies were evaluated: (1) a baseline model fine-tuned using MultipleNegativesRankingLoss (MNRL), and (2) a dual-loss model that combines MNRL with CosineSimilarityLoss to improve both semantic ranking and similarity calibration. Evaluations were conducted on 28 university course syllabi using a fixed set of natural language questions categorized into course, faculty, and teaching assistant information. Results demonstrate that both fine-tuned models outperform strong open-source baselines, including all-MiniLM-L6-v2 and multi-qa-MiniLM-L6-cos-v1, and that the dual-loss model narrows the performance gap with high-performing proprietary embeddings such as OpenAI's text-embedding-3 series. This work contributes reusable, domain-aligned embedding models and provides a replicable framework for educational semantic retrieval, supporting downstream applications such as academic chatbots, retrieval-augmented generation (RAG) systems, and learning management system (LMS) integrations.

[115] Chain-of-Thought Tokens are Computer Program Variables

Fangwei Zhu,Peiyi Wang,Zhifang Sui

Main category: cs.CL

TL;DR: 研究发现，思维链（CoT）中的关键令牌（如存储中间结果的令牌）对模型性能至关重要，而其他令牌可能并非必要。CoT令牌可能类似于程序变量，但也存在潜在缺陷。

Details

Motivation: 探索思维链（CoT）在大型语言模型（LLMs）中的内部机制，尤其是在复杂推理任务中的作用。 Method: 通过实验研究CoT令牌在两种组合任务（多位数乘法和动态规划）中的作用，包括保留关键令牌、替换潜在形式以及随机干预令牌值。 Result: 仅保留存储中间结果的令牌即可达到与完整CoT相当的性能；CoT令牌可能类似于程序变量，但存在潜在问题。 Conclusion: CoT令牌在推理中可能扮演类似变量的角色，但其机制仍需进一步研究以优化性能和避免缺陷。 Abstract: Chain-of-thoughts (CoT) requires large language models (LLMs) to generate intermediate steps before reaching the final answer, and has been proven effective to help LLMs solve complex reasoning tasks. However, the inner mechanism of CoT still remains largely unclear. In this paper, we empirically study the role of CoT tokens in LLMs on two compositional tasks: multi-digit multiplication and dynamic programming. While CoT is essential for solving these problems, we find that preserving only tokens that store intermediate results would achieve comparable performance. Furthermore, we observe that storing intermediate results in an alternative latent form will not affect model performance. We also randomly intervene some values in CoT, and notice that subsequent CoT tokens and the final answer would change correspondingly. These findings suggest that CoT tokens may function like variables in computer programs but with potential drawbacks like unintended shortcuts and computational complexity limits between tokens. The code and data are available at https://github.com/solitaryzero/CoTs_are_Variables.

[116] Rethinking the Relationship between the Power Law and Hierarchical Structures

Kai Nakaishi,Ryo Yoshida,Kohei Kajikawa,Koji Hukushima,Yohei Oseki

Main category: cs.CL

TL;DR: 论文通过统计分析语料库，探讨了自然语言中的幂律现象及其与层次结构的关系，发现现有假设在句法结构中不成立，需重新思考幂律与层次结构的联系。

Details

Motivation: 研究动机是验证幂律衰减相关性是否支持语言中的层次结构假设，尤其是句法结构，并探讨其在儿童语言和动物信号中的适用性。 Method: 方法包括分析英语语料库中的互信息、与概率上下文无关文法（PCFG）的偏差，以及解析树的其他属性。 Result: 结果表明，现有假设在句法结构中不成立，且难以推广到儿童语言和动物信号。 Conclusion: 结论指出需重新审视幂律与层次结构的关系，现有解释可能不充分。 Abstract: Statistical analysis of corpora provides an approach to quantitatively investigate natural languages. This approach has revealed that several power laws consistently emerge across different corpora and languages, suggesting the universal principles underlying languages. Particularly, the power-law decay of correlation has been interpreted as evidence for underlying hierarchical structures in syntax, semantics, and discourse. This perspective has also been extended to child languages and animal signals. However, the argument supporting this interpretation has not been empirically tested. To address this problem, this study examines the validity of the argument for syntactic structures. Specifically, we test whether the statistical properties of parse trees align with the implicit assumptions in the argument. Using English corpora, we analyze the mutual information, deviations from probabilistic context-free grammars (PCFGs), and other properties in parse trees, as well as in the PCFG that approximates these trees. Our results indicate that the assumptions do not hold for syntactic structures and that it is difficult to apply the proposed argument to child languages and animal signals, highlighting the need to reconsider the relationship between the power law and hierarchical structures.

[117] Latent Preference Coding: Aligning Large Language Models via Discrete Latent Codes

Zhuocheng Gong,Jian Guan,Wei Wu,Huishuai Zhang,Dongyan Zhao

Main category: cs.CL

TL;DR: 论文提出了一种名为Latent Preference Coding（LPC）的新框架，用于建模人类偏好的隐含因素及其组合，无需依赖预定义的奖励函数。实验表明，LPC能显著提升多种对齐算法的性能，并增强对数据噪声的鲁棒性。

Details

Motivation: 尽管大型语言模型（LLMs）取得了显著成功，但其生成内容与人类偏好的对齐仍是一个关键挑战。现有方法通常依赖显式或隐式的奖励函数，忽略了人类偏好的复杂性和多面性。 Method: LPC通过离散潜在编码建模偏好的隐含因素及其组合，自动从数据中推断这些因素及其重要性，无需预定义奖励函数或手工组合权重。 Result: 实验证明，LPC在多个基准测试中显著提升了三种对齐算法（DPO、SimPO和IPO）的性能，并增强了模型对数据噪声的鲁棒性。 Conclusion: LPC为多面性偏好因素提供了统一的表示方法，为开发更鲁棒和通用的对齐技术铺平了道路，有助于LLMs的负责任部署。 Abstract: Large language models (LLMs) have achieved remarkable success, yet aligning their generations with human preferences remains a critical challenge. Existing approaches to preference modeling often rely on an explicit or implicit reward function, overlooking the intricate and multifaceted nature of human preferences that may encompass conflicting factors across diverse tasks and populations. To address this limitation, we introduce Latent Preference Coding (LPC), a novel framework that models the implicit factors as well as their combinations behind holistic preferences using discrete latent codes. LPC seamlessly integrates with various offline alignment algorithms, automatically inferring the underlying factors and their importance from data without relying on pre-defined reward functions and hand-crafted combination weights. Extensive experiments on multiple benchmarks demonstrate that LPC consistently improves upon three alignment algorithms (DPO, SimPO, and IPO) using three base models (Mistral-7B, Llama3-8B, and Llama3-8B-Instruct). Furthermore, deeper analysis reveals that the learned latent codes effectively capture the differences in the distribution of human preferences and significantly enhance the robustness of alignment against noise in data. By providing a unified representation for the multifarious preference factors, LPC paves the way towards developing more robust and versatile alignment techniques for the responsible deployment of powerful LLMs.

[118] Rethinking Invariance in In-context Learning

Lizhe Fang,Yifei Wang,Khashayar Gatmiry,Lei Fang,Yisen Wang

Main category: cs.CL

TL;DR: InvICL是一种新的ICL方法，解决了现有方法无法同时实现信息不泄漏和上下文相互依赖的问题，并在多个基准数据集上表现优异。

Details

Motivation: 现有ICL方法对上下文示例的顺序敏感，且无法同时满足信息不泄漏和上下文相互依赖的要求。 Method: 提出InvICL方法，通过设计满足信息不泄漏和上下文相互依赖的算法，实现ICL的排列不变性。 Result: InvICL在大多数基准数据集上优于现有方法，表现出更强的泛化能力。 Conclusion: InvICL通过同时满足两个关键设计要素，显著提升了ICL的性能和稳定性。 Abstract: In-Context Learning (ICL) has emerged as a pivotal capability of auto-regressive large language models, yet it is hindered by a notable sensitivity to the ordering of context examples regardless of their mutual independence. To address this issue, recent studies have introduced several variant algorithms of ICL that achieve permutation invariance. However, many of these do not exhibit comparable performance with the standard auto-regressive ICL algorithm. In this work, we identify two crucial elements in the design of an invariant ICL algorithm: information non-leakage and context interdependence, which are not simultaneously achieved by any of the existing methods. These investigations lead us to the proposed Invariant ICL (InvICL), a methodology designed to achieve invariance in ICL while ensuring the two properties. Empirically, our findings reveal that InvICL surpasses previous models, both invariant and non-invariant, in most benchmark datasets, showcasing superior generalization capabilities across varying input lengths. Code is available at https://github.com/PKU-ML/InvICL.

Cedric Waterschoot,Nava Tintarev,Francesco Barile

Main category: cs.CL

TL;DR: 研究探讨了大型语言模型（LLMs）在零样本学习下如何正确执行基于社会选择的聚合策略，分析了提示格式对准确性的影响，发现性能在超过100个评分时下降，但In-Context学习可显著提升性能。

Details

Motivation: 探索LLMs在群体推荐系统（GRS）中的应用条件，特别是零样本学习下提示格式和群体复杂性对模型性能的影响。 Method: 通过实验分析不同LLMs在不同提示条件下（如In-Context学习、解释生成）的表现，以及群体复杂性（用户和物品数量）和偏好格式化的影响。 Result: 性能在评分超过100时下降，但In-Context学习能显著提升复杂群体下的表现；提示格式（如按用户或物品列出评分）影响准确性。 Conclusion: 未来研究应考虑群体复杂性对LLMs性能的影响，小模型在适当条件下也能有效生成群体推荐，降低计算成本和资源需求。 Abstract: Large Language Models (LLMs) are increasingly applied in recommender systems aimed at both individuals and groups. Previously, Group Recommender Systems (GRS) often used social choice-based aggregation strategies to derive a single recommendation based on the preferences of multiple people. In this paper, we investigate under which conditions language models can perform these strategies correctly based on zero-shot learning and analyse whether the formatting of the group scenario in the prompt affects accuracy. We specifically focused on the impact of group complexity (number of users and items), different LLMs, different prompting conditions, including In-Context learning or generating explanations, and the formatting of group preferences. Our results show that performance starts to deteriorate when considering more than 100 ratings. However, not all language models were equally sensitive to growing group complexity. Additionally, we showed that In-Context Learning (ICL) can significantly increase the performance at higher degrees of group complexity, while adding other prompt modifications, specifying domain cues or prompting for explanations, did not impact accuracy. We conclude that future research should include group complexity as a factor in GRS evaluation due to its effect on LLM performance. Furthermore, we showed that formatting the group scenarios differently, such as rating lists per user or per item, affected accuracy. All in all, our study implies that smaller LLMs are capable of generating group recommendations under the right conditions, making the case for using smaller models that require less computing power and costs.

[120] Scalable Multi-Stage Influence Function for Large Language Models via Eigenvalue-Corrected Kronecker-Factored Parameterization

Yuntai Bao,Xuhong Zhang,Tianyu Du,Xinkui Zhao,Jiang Zong,Hao Peng,Jianwei Yin

Main category: cs.CL

TL;DR: 论文提出了一种多阶段影响函数方法，用于将微调后大语言模型（LLM）的预测归因于预训练数据，并通过EK-FAC参数化提高效率。

Details

Motivation: 现有方法无法计算多阶段影响且难以扩展到十亿级LLM，因此需要一种新的方法来解释微调后模型的预测来源。 Method: 提出多阶段影响函数，结合EK-FAC参数化进行高效近似计算。 Result: 实验验证了EK-FAC的高效性和多阶段影响函数的有效性，并通过案例展示了其解释能力。 Conclusion: 多阶段影响函数为LLM的预测提供了可扩展且有效的解释工具。 Abstract: Pre-trained large language models (LLMs) are commonly fine-tuned to adapt to downstream tasks. Since the majority of knowledge is acquired during pre-training, attributing the predictions of fine-tuned LLMs to their pre-training data may provide valuable insights. Influence functions have been proposed as a means to explain model predictions based on training data. However, existing approaches fail to compute ``multi-stage'' influence and lack scalability to billion-scale LLMs. In this paper, we propose the multi-stage influence function to attribute the downstream predictions of fine-tuned LLMs to pre-training data under the full-parameter fine-tuning paradigm. To enhance the efficiency and practicality of our multi-stage influence function, we leverage Eigenvalue-corrected Kronecker-Factored (EK-FAC) parameterization for efficient approximation. Empirical results validate the superior scalability of EK-FAC approximation and the effectiveness of our multi-stage influence function. Additionally, case studies on a real-world LLM, dolly-v2-3b, demonstrate its interpretive power, with exemplars illustrating insights provided by multi-stage influence estimates. Our code is public at https://github.com/colored-dye/multi_stage_influence_function.

[121] G-FOCUS: Towards a Robust Method for Assessing UI Design Persuasiveness

Jaehyun Jeon,Janghan Yoon,Minsoo Kim,Sumin Shim,Yejin Choi,Hanbin Kim,Youngjae Yu

Main category: cs.CL

TL;DR: 论文提出了WiserUI-Bench基准和G-FOCUS推理策略，用于评估UI设计的说服力，旨在替代或补充传统的A/B测试方法。

Details

Motivation: 当前A/B测试成本高且耗时，而现有视觉语言模型（VLMs）在UI分析中未能有效评估比较说服力。 Method: 引入WiserUI-Bench基准，包含300个真实UI图像对，并提出了G-FOCUS推理策略以减少位置偏差并提高评估准确性。 Result: 实验表明，G-FOCUS在一致性和准确性上优于现有推理策略。 Conclusion: 该研究为UI设计优化提供了可扩展的VLM驱动评估方法，补充了A/B测试。 Abstract: Evaluating user interface (UI) design effectiveness extends beyond aesthetics to influencing user behavior, a principle central to Design Persuasiveness. A/B testing is the predominant method for determining which UI variations drive higher user engagement, but it is costly and time-consuming. While recent Vision-Language Models (VLMs) can process automated UI analysis, current approaches focus on isolated design attributes rather than comparative persuasiveness-the key factor in optimizing user interactions. To address this, we introduce WiserUI-Bench, a benchmark designed for Pairwise UI Design Persuasiveness Assessment task, featuring 300 real-world UI image pairs labeled with A/B test results and expert rationales. Additionally, we propose G-FOCUS, a novel inference-time reasoning strategy that enhances VLM-based persuasiveness assessment by reducing position bias and improving evaluation accuracy. Experimental results show that G-FOCUS surpasses existing inference strategies in consistency and accuracy for pairwise UI evaluation. Through promoting VLM-driven evaluation of UI persuasiveness, our work offers an approach to complement A/B testing, propelling progress in scalable UI preference modeling and design optimization. Code and data will be released publicly.

[122] Image-Text Relation Prediction for Multilingual Tweets

Matīss Rikters,Edison Marrese-Taylor

Main category: cs.CL

TL;DR: 本文探讨了多语言视觉语言模型在不同语言中预测图像-文本关系的能力，并构建了一个平衡的基准数据集（拉脱维亚语推特帖子及其英语翻译）。

Details

Motivation: 研究社交媒体中图像与文本关系的模糊性，以及多语言模型在此任务上的表现。 Method: 使用多语言视觉语言模型，构建并测试了一个拉脱维亚语-英语的基准数据集。 Result: 最新视觉语言模型在此任务上表现更好，但仍有改进空间。 Conclusion: 多语言视觉语言模型在图像-文本关系预测上有潜力，但需进一步优化。 Abstract: Various social networks have been allowing media uploads for over a decade now. Still, it has not always been clear what is their relation with the posted text or even if there is any at all. In this work, we explore how multilingual vision-language models tackle the task of image-text relation prediction in different languages, and construct a dedicated balanced benchmark data set from Twitter posts in Latvian along with their manual translations into English. We compare our results to previous work and show that the more recently released vision-language model checkpoints are becoming increasingly capable at this task, but there is still much room for further improvement.

[123] Teochew-Wild: The First In-the-wild Teochew Dataset with Orthographic Annotations

Linrong Pan,Chenglong Jiang,Gaoze Hou,Ying Gao

Main category: cs.CL

TL;DR: 论文介绍了Teochew-Wild语料库的构建，包含18.9小时的潮州话语音数据，支持自动语音识别和文本转语音任务。

Details

Motivation: 为低资源语言潮州话提供首个公开的精确标注语音数据集，推动相关研究与应用。 Method: 构建包含多种表达形式的潮州话语料库，并提供文本处理工具和资源。 Result: 实验验证了该语料库在自动语音识别和文本转语音任务中的有效性。 Conclusion: Teochew-Wild是首个公开的潮州话数据集，为相关研究提供了重要资源。 Abstract: This paper reports the construction of the Teochew-Wild, a speech corpus of the Teochew dialect. The corpus includes 18.9 hours of in-the-wild Teochew speech data from multiple speakers, covering both formal and colloquial expressions, with precise orthographic and pinyin annotations. Additionally, we provide supplementary text processing tools and resources to propel research and applications in speech tasks for this low-resource language, such as automatic speech recognition (ASR) and text-to-speech (TTS). To the best of our knowledge, this is the first publicly available Teochew dataset with accurate orthographic annotations. We conduct experiments on the corpus, and the results validate its effectiveness in ASR and TTS tasks.

[124] Performance Evaluation of Large Language Models in Bangla Consumer Health Query Summarization

Ajwad Abrar,Farzana Tabassum,Sabbir Ahmed

Main category: cs.CL

TL;DR: 研究评估了九种大型语言模型在零样本设置下对孟加拉语消费者健康查询的摘要性能，发现Mixtral-8x22b-Instruct表现最佳，与微调模型Bangla T5相当。

Details

Motivation: 孟加拉语作为低资源语言，其消费者健康查询常含冗余信息，影响医疗响应效率，需高效摘要方法。 Method: 使用BanglaCHQ-Summ数据集（2,350对查询-摘要），通过ROUGE指标比较九种LLMs与微调模型Bangla T5。 Result: Mixtral-8x22b-Instruct在ROUGE-1和ROUGE-L中表现最佳，Bangla T5在ROUGE-2领先，零样本LLMs可媲美微调模型。 Conclusion: 零样本LLMs在低资源语言中具备潜力，为医疗查询摘要提供可扩展解决方案。 Abstract: Consumer Health Queries (CHQs) in Bengali (Bangla), a low-resource language, often contain extraneous details, complicating efficient medical responses. This study investigates the zero-shot performance of nine advanced large language models (LLMs): GPT-3.5-Turbo, GPT-4, Claude-3.5-Sonnet, Llama3-70b-Instruct, Mixtral-8x22b-Instruct, Gemini-1.5-Pro, Qwen2-72b-Instruct, Gemma-2-27b, and Athene-70B, in summarizing Bangla CHQs. Using the BanglaCHQ-Summ dataset comprising 2,350 annotated query-summary pairs, we benchmarked these LLMs using ROUGE metrics against Bangla T5, a fine-tuned state-of-the-art model. Mixtral-8x22b-Instruct emerged as the top performing model in ROUGE-1 and ROUGE-L, while Bangla T5 excelled in ROUGE-2. The results demonstrate that zero-shot LLMs can rival fine-tuned models, achieving high-quality summaries even without task-specific training. This work underscores the potential of LLMs in addressing challenges in low-resource languages, providing scalable solutions for healthcare query summarization.

[125] Reliably Bounding False Positives: A Zero-Shot Machine-Generated Text Detection Framework via Multiscaled Conformal Prediction

Xiaowei Zhu,Yubing Ren,Yanan Cao,Xixun Lin,Fang Fang,Yangxi Li

Main category: cs.CL

TL;DR: 论文提出了一种基于多尺度共形预测（MCP）的零样本机器生成文本检测框架，旨在解决高误报率（FPR）问题，同时提升检测性能。

Details

Motivation: 大型语言模型的快速发展引发了对其潜在滥用的担忧，现有检测方法过于关注准确性而忽视了高误报率带来的社会风险。 Method: 利用共形预测（CP）约束误报率上限，并提出多尺度共形预测（MCP）框架，结合高质量数据集RealDet，提升检测性能。 Result: 实证表明，MCP能有效约束误报率，显著提升检测性能，并增强对抗攻击的鲁棒性。 Conclusion: MCP框架为解决误报率与检测性能的权衡问题提供了有效方案，具有实际应用潜力。 Abstract: The rapid advancement of large language models has raised significant concerns regarding their potential misuse by malicious actors. As a result, developing effective detectors to mitigate these risks has become a critical priority. However, most existing detection methods focus excessively on detection accuracy, often neglecting the societal risks posed by high false positive rates (FPRs). This paper addresses this issue by leveraging Conformal Prediction (CP), which effectively constrains the upper bound of FPRs. While directly applying CP constrains FPRs, it also leads to a significant reduction in detection performance. To overcome this trade-off, this paper proposes a Zero-Shot Machine-Generated Text Detection Framework via Multiscaled Conformal Prediction (MCP), which both enforces the FPR constraint and improves detection performance. This paper also introduces RealDet, a high-quality dataset that spans a wide range of domains, ensuring realistic calibration and enabling superior detection performance when combined with MCP. Empirical evaluations demonstrate that MCP effectively constrains FPRs, significantly enhances detection performance, and increases robustness against adversarial attacks across multiple detectors and datasets.

[126] Unveiling Language-Specific Features in Large Language Models via Sparse Autoencoders

Boyi Deng,Yu Wan,Yidan Zhang,Baosong Yang,Fuli Feng

Main category: cs.CL

TL;DR: 论文研究了大型语言模型（LLMs）的多语言能力机制，提出了一种基于稀疏自编码器（SAEs）的新方法，发现语言特定特征，并通过特征消融和增强实现了对生成语言的控制。

Details

Motivation: 现有方法（如神经元或内部激活分析）在多语言能力研究中存在局限性（如叠加和层间激活差异），需要更可靠的分析工具。 Method: 使用稀疏自编码器（SAEs）分解LLMs的激活，提出新指标评估特征的单一语言性，并通过特征消融和增强实验验证其效果。 Result: 发现某些SAE特征与特定语言强相关，消融这些特征仅显著影响一种语言能力；部分语言存在协同特征，联合消融效果更显著；利用这些特征可增强语言控制能力。 Conclusion: SAEs为多语言能力研究提供了更精细的分析工具，语言特定特征的发现和操控为LLMs的语言生成控制提供了新途径。 Abstract: The mechanisms behind multilingual capabilities in Large Language Models (LLMs) have been examined using neuron-based or internal-activation-based methods. However, these methods often face challenges such as superposition and layer-wise activation variance, which limit their reliability. Sparse Autoencoders (SAEs) offer a more nuanced analysis by decomposing the activations of LLMs into sparse linear combination of SAE features. We introduce a novel metric to assess the monolinguality of features obtained from SAEs, discovering that some features are strongly related to specific languages. Additionally, we show that ablating these SAE features only significantly reduces abilities in one language of LLMs, leaving others almost unaffected. Interestingly, we find some languages have multiple synergistic SAE features, and ablating them together yields greater improvement than ablating individually. Moreover, we leverage these SAE-derived language-specific features to enhance steering vectors, achieving control over the language generated by LLMs.

[127] A Benchmark Dataset and a Framework for Urdu Multimodal Named Entity Recognition

Hussain Ahmad,Qingyang Zeng,Jing Wan

Main category: cs.CL

TL;DR: 论文提出了U-MNER框架和Twitter2015-Urdu数据集，解决了乌尔都语多模态命名实体识别（MNER）中数据稀缺和基准缺失的问题，并通过融合文本和视觉信息实现了最佳性能。

Details

Motivation: 由于乌尔都语等多模态数据稀缺且缺乏标准化基准，研究旨在填补这一空白，推动低资源语言的MNER研究。 Method: 采用U-MNER框架，结合Urdu-BERT和ResNet提取文本和视觉特征，并通过跨模态融合模块对齐信息。 Result: 模型在Twitter2015-Urdu数据集上实现了最佳性能。 Conclusion: U-MNER框架为低资源语言的MNER研究奠定了基础，未来可进一步扩展。 Abstract: The emergence of multimodal content, particularly text and images on social media, has positioned Multimodal Named Entity Recognition (MNER) as an increasingly important area of research within Natural Language Processing. Despite progress in high-resource languages such as English, MNER remains underexplored for low-resource languages like Urdu. The primary challenges include the scarcity of annotated multimodal datasets and the lack of standardized baselines. To address these challenges, we introduce the U-MNER framework and release the Twitter2015-Urdu dataset, a pioneering resource for Urdu MNER. Adapted from the widely used Twitter2015 dataset, it is annotated with Urdu-specific grammar rules. We establish benchmark baselines by evaluating both text-based and multimodal models on this dataset, providing comparative analyses to support future research on Urdu MNER. The U-MNER framework integrates textual and visual context using Urdu-BERT for text embeddings and ResNet for visual feature extraction, with a Cross-Modal Fusion Module to align and fuse information. Our model achieves state-of-the-art performance on the Twitter2015-Urdu dataset, laying the groundwork for further MNER research in low-resource languages.

[128] QualBench: Benchmarking Chinese LLMs with Localized Professional Qualifications for Vertical Domain Evaluation

Mengze Hong,Wailing Ng,Di Jiang,Chen Jason Zhang

Main category: cs.CL

TL;DR: QualBench是首个专注于中文大语言模型（LLM）本地化评估的多领域问答基准，通过17,000多个问题覆盖六个垂直领域，基于24项中国资格考试设计。Qwen2.5模型表现优于GPT-4o，凸显本地化知识的重要性。

Details

Motivation: 现有基准在垂直领域覆盖不足且缺乏中文工作场景的针对性评估，需通过资格考试框架填补这一空白。 Method: 利用资格考试作为统一评估框架，构建QualBench数据集，涵盖六领域17,000+问题，基于24项中国资格考试设计。 Result: Qwen2.5优于GPT-4o，中文LLM整体表现优于非中文模型，最佳成绩为75.26%，显示模型能力在领域覆盖上的不足。 Conclusion: 本地化知识对LLM性能至关重要，未来可通过多领域RAG知识增强和联邦学习优化垂直领域LLM训练。 Abstract: The rapid advancement of Chinese large language models (LLMs) underscores the need for domain-specific evaluations to ensure reliable applications. However, existing benchmarks often lack coverage in vertical domains and offer limited insights into the Chinese working context. Leveraging qualification exams as a unified framework for human expertise evaluation, we introduce QualBench, the first multi-domain Chinese QA benchmark dedicated to localized assessment of Chinese LLMs. The dataset includes over 17,000 questions across six vertical domains, with data selections grounded in 24 Chinese qualifications to closely align with national policies and working standards. Through comprehensive evaluation, the Qwen2.5 model outperformed the more advanced GPT-4o, with Chinese LLMs consistently surpassing non-Chinese models, highlighting the importance of localized domain knowledge in meeting qualification requirements. The best performance of 75.26% reveals the current gaps in domain coverage within model capabilities. Furthermore, we present the failure of LLM collaboration with crowdsourcing mechanisms and suggest the opportunities for multi-domain RAG knowledge enhancement and vertical domain LLM training with Federated Learning.

[129] T-T: Table Transformer for Tagging-based Aspect Sentiment Triplet Extraction

Kun Peng,Chaodong Tong,Cong Cao,Hao Peng,Qian Li,Guanlin Wu,Lei Jiang,Yanbing Liu,Philip S. Yu

Main category: cs.CL

TL;DR: 论文提出了一种基于Transformer的Table-Transformer（T-T）模型，用于解决ASTE任务中的长序列和局部注意力问题，通过条纹注意力和循环移位策略实现高效关系学习。

Details

Motivation: 现有方法通过设计下游关系学习模块提升模型性能，但直接使用Transformer层面临长序列和不公平局部注意力交互的挑战。 Method: 提出T-T模型，引入条纹注意力机制和循环移位策略，优化全局注意力为局部窗口，并促进不同窗口间的交互。 Result: 实验表明T-T在计算成本较低的情况下实现了最先进的性能。 Conclusion: T-T作为一种下游关系学习模块，有效解决了ASTE任务中的挑战，并显著提升了模型表现。 Abstract: Aspect sentiment triplet extraction (ASTE) aims to extract triplets composed of aspect terms, opinion terms, and sentiment polarities from given sentences. The table tagging method is a popular approach to addressing this task, which encodes a sentence into a 2-dimensional table, allowing for the tagging of relations between any two words. Previous efforts have focused on designing various downstream relation learning modules to better capture interactions between tokens in the table, revealing that a stronger capability to capture relations can lead to greater improvements in the model. Motivated by this, we attempt to directly utilize transformer layers as downstream relation learning modules. Due to the powerful semantic modeling capability of transformers, it is foreseeable that this will lead to excellent improvement. However, owing to the quadratic relation between the length of the table and the length of the input sentence sequence, using transformers directly faces two challenges: overly long table sequences and unfair local attention interaction. To address these challenges, we propose a novel Table-Transformer (T-T) for the tagging-based ASTE method. Specifically, we introduce a stripe attention mechanism with a loop-shift strategy to tackle these challenges. The former modifies the global attention mechanism to only attend to a 2-dimensional local attention window, while the latter facilitates interaction between different attention windows. Extensive and comprehensive experiments demonstrate that the T-T, as a downstream relation learning module, achieves state-of-the-art performance with lower computational costs.

[130] Toward Reasonable Parrots: Why Large Language Models Should Argue with Us by Design

Elena Musi,Nadin Kokciyan,Khalid Al-Khatib,Davide Ceolin,Emmanuelle Dietz,Klara Gutekunst,Annette Hautli-Janisz,Cristian Manuel Santibañez Yañez,Jodi Schneider,Jonas Scholz,Cor Steging,Jacky Visser,Henning Wachsmuth

Main category: cs.CL

TL;DR: 论文主张开发支持辩论过程的对话技术，认为当前大语言模型（LLMs）不足，并提出理想设计以提升辩论技能。

Details

Motivation: 当前LLMs无法有效支持辩论，需重新设计技术以促进批判性思维。 Method: 提出'合理鹦鹉'概念，基于辩论理论的原则（相关性、责任、自由）和对话动作。 Result: 理想技术设计应结合辩论理论基本原则，以LLMs为工具而非替代。 Conclusion: 辩论理论原则应作为LLM技术的基础，以支持批判性思维和辩论过程。 Abstract: In this position paper, we advocate for the development of conversational technology that is inherently designed to support and facilitate argumentative processes. We argue that, at present, large language models (LLMs) are inadequate for this purpose, and we propose an ideal technology design aimed at enhancing argumentative skills. This involves re-framing LLMs as tools to exercise our critical thinking rather than replacing them. We introduce the concept of 'reasonable parrots' that embody the fundamental principles of relevance, responsibility, and freedom, and that interact through argumentative dialogical moves. These principles and moves arise out of millennia of work in argumentation theory and should serve as the starting point for LLM-based technology that incorporates basic principles of argumentation.

[131] ICon: In-Context Contribution for Automatic Data Selection

Yixin Yang,Qingxiu Dong,Linli Yao,Fangwei Zhu,Zhifang Sui

Main category: cs.CL

TL;DR: ICon是一种基于上下文学习的无梯度方法，用于高效选择指令调优数据，优于现有方法。

Details

Motivation: 现有数据选择方法依赖计算昂贵的梯度或人工启发式，未能充分利用数据内在属性。 Method: ICon通过上下文学习的隐式调优特性，评估样本贡献，无需梯度计算或人工设计指标。 Result: 在多个基准测试中，ICon选择的数据显著提升模型性能，优于全数据集和其他方法。 Conclusion: ICon提供了一种高效、无偏的数据选择方法，适用于大规模语言模型调优。 Abstract: Data selection for instruction tuning is essential for improving the performance of Large Language Models (LLMs) and reducing training cost. However, existing automated selection methods either depend on computationally expensive gradient-based measures or manually designed heuristics, which may fail to fully exploit the intrinsic attributes of data. In this paper, we propose In-context Learning for Contribution Measurement (ICon), a novel gradient-free method that takes advantage of the implicit fine-tuning nature of in-context learning (ICL) to measure sample contribution without gradient computation or manual indicators engineering. ICon offers a computationally efficient alternative to gradient-based methods and reduces human inductive bias inherent in heuristic-based approaches. ICon comprises three components and identifies high-contribution data by assessing performance shifts under implicit learning through ICL. Extensive experiments on three LLMs across 12 benchmarks and 5 pairwise evaluation sets demonstrate the effectiveness of ICon. Remarkably, on LLaMA3.1-8B, models trained on 15% of ICon-selected data outperform full datasets by 5.42% points and exceed the best performance of widely used selection methods by 2.06% points. We further analyze high-contribution samples selected by ICon, which show both diverse tasks and appropriate difficulty levels, rather than just the hardest ones.

[132] Frame In, Frame Out: Do LLMs Generate More Biased News Headlines than Humans?

Valeria Pastorino,Nafise Sadat Moosavi

Main category: cs.CL

TL;DR: 研究发现，大型语言模型（LLM）生成的新闻内容在政治和社会敏感话题中比人类作者更容易表现出明显的框架偏见，且不同模型架构间存在显著差异。

Details

Motivation: 探讨LLM在新闻内容生成中可能引入或放大的框架偏见，以评估其对公众认知的潜在影响。 Method: 分析现成和微调的LLM生成的新闻内容，比较其与人类作者在框架表现上的差异。 Result: LLM在敏感话题中表现出更明显的框架偏见，且不同模型间存在显著差异。 Conclusion: 需要开发有效的训练后缓解策略和更严格的评估框架，以确保自动化新闻内容的平衡性。 Abstract: Framing in media critically shapes public perception by selectively emphasizing some details while downplaying others. With the rise of large language models in automated news and content creation, there is growing concern that these systems may introduce or even amplify framing biases compared to human authors. In this paper, we explore how framing manifests in both out-of-the-box and fine-tuned LLM-generated news content. Our analysis reveals that, particularly in politically and socially sensitive contexts, LLMs tend to exhibit more pronounced framing than their human counterparts. In addition, we observe significant variation in framing tendencies across different model architectures, with some models displaying notably higher biases. These findings point to the need for effective post-training mitigation strategies and tighter evaluation frameworks to ensure that automated news content upholds the standards of balanced reporting.

[133] Crosslingual Reasoning through Test-Time Scaling

Zheng-Xin Yong,M. Farid Adilazuarda,Jonibek Mansurov,Ruochen Zhang,Niklas Muennighoff,Carsten Eickhoff,Genta Indra Winata,Julia Kreutzer,Stephen H. Bach,Alham Fikri Aji

Main category: cs.CL

TL;DR: 研究发现，英语为中心的大语言模型在多语言数学推理中表现优异，但其推理能力在低资源语言和跨领域任务中仍有局限。

Details

Motivation: 探讨英语推理微调在多语言中的泛化能力及其机制。 Method: 通过扩展推理计算和长链思维（CoTs）研究模型表现，分析其推理模式。 Result: 模型在高资源语言中表现更好，但在低资源语言和跨领域任务中表现不佳。 Conclusion: 建议在高资源语言中使用英语为中心模型，同时需进一步改进低资源语言和跨领域推理。 Abstract: Reasoning capabilities of large language models are primarily studied for English, even when pretrained models are multilingual. In this work, we investigate to what extent English reasoning finetuning with long chain-of-thoughts (CoTs) can generalize across languages. First, we find that scaling up inference compute for English-centric reasoning language models (RLMs) improves multilingual mathematical reasoning across many languages including low-resource languages, to an extent where they outperform models twice their size. Second, we reveal that while English-centric RLM's CoTs are naturally predominantly English, they consistently follow a quote-and-think pattern to reason about quoted non-English inputs. Third, we discover an effective strategy to control the language of long CoT reasoning, and we observe that models reason better and more efficiently in high-resource languages. Finally, we observe poor out-of-domain reasoning generalization, in particular from STEM to cultural commonsense knowledge, even for English. Overall, we demonstrate the potentials, study the mechanisms and outline the limitations of crosslingual generalization of English reasoning test-time scaling. We conclude that practitioners should let English-centric RLMs reason in high-resource languages, while further work is needed to improve reasoning in low-resource languages and out-of-domain contexts.

[134] Reasoning Models Don't Always Say What They Think

Yanda Chen,Joe Benton,Ansh Radhakrishnan,Jonathan Uesato,Carson Denison,John Schulman,Arushi Somani,Peter Hase,Misha Wagner,Fabien Roger,Vlad Mikulik,Samuel R. Bowman,Jan Leike,Jared Kaplan,Ethan Perez

Main category: cs.CL

TL;DR: 论文评估了思维链（CoT）在AI安全中的有效性，发现CoT监控虽能部分揭示模型意图，但效果有限，尤其在罕见或灾难性行为中不可靠。

Details

Motivation: 研究CoT是否能忠实反映模型的推理过程，以评估其在AI安全监控中的潜力。 Method: 测试了6种推理提示下的CoT忠实性，分析了强化学习对CoT效果的影响。 Result: CoT揭示提示使用的比例通常低于20%，强化学习初期提升忠实性但效果有限，且奖励机制未增加提示的显性表达。 Conclusion: CoT监控在训练和评估中有潜力，但不足以完全避免不良行为，尤其在非必要CoT场景中效果不佳。 Abstract: Chain-of-thought (CoT) offers a potential boon for AI safety as it allows monitoring a model's CoT to try to understand its intentions and reasoning processes. However, the effectiveness of such monitoring hinges on CoTs faithfully representing models' actual reasoning processes. We evaluate CoT faithfulness of state-of-the-art reasoning models across 6 reasoning hints presented in the prompts and find: (1) for most settings and models tested, CoTs reveal their usage of hints in at least 1% of examples where they use the hint, but the reveal rate is often below 20%, (2) outcome-based reinforcement learning initially improves faithfulness but plateaus without saturating, and (3) when reinforcement learning increases how frequently hints are used (reward hacking), the propensity to verbalize them does not increase, even without training against a CoT monitor. These results suggest that CoT monitoring is a promising way of noticing undesired behaviors during training and evaluations, but that it is not sufficient to rule them out. They also suggest that in settings like ours where CoT reasoning is not necessary, test-time monitoring of CoTs is unlikely to reliably catch rare and catastrophic unexpected behaviors.

[135] TransProQA: an LLM-based literary Translation evaluation metric with Professional Question Answering

Ran Zhang,Wei Zhao,Lieve Macken,Steffen Eger

Main category: cs.CL

TL;DR: TransProQA是一种基于LLM的无参考QA框架，专为文学翻译评估设计，显著优于现有指标，接近人类评估水平。

Details

Motivation: 现有评估指标过于注重机械准确性，忽视艺术表达，可能导致翻译质量和文化真实性的长期下降。 Method: TransProQA结合专业文学翻译和研究者的见解，关注文学质量评估的关键元素（如文学手法、文化理解和作者声音）。 Result: TransProQA在相关性（ACC-EQ和Kendall's tau）上提升0.07，在充分性评估中超过SOTA指标15分以上，接近人类评估水平。 Conclusion: TransProQA具有广泛适用性，可作为无训练需求的文学评估工具，尤其适用于需要本地处理的文本。 Abstract: The impact of Large Language Models (LLMs) has extended into literary domains. However, existing evaluation metrics prioritize mechanical accuracy over artistic expression and tend to overrate machine translation (MT) as being superior to experienced professional human translation. In the long run, this bias could result in a permanent decline in translation quality and cultural authenticity. In response to the urgent need for a specialized literary evaluation metric, we introduce TransProQA, a novel, reference-free, LLM-based question-answering (QA) framework designed specifically for literary translation evaluation. TransProQA uniquely integrates insights from professional literary translators and researchers, focusing on critical elements in literary quality assessment such as literary devices, cultural understanding, and authorial voice. Our extensive evaluation shows that while literary-finetuned XCOMET-XL yields marginal gains, TransProQA substantially outperforms current metrics, achieving up to 0.07 gain in correlation (ACC-EQ and Kendall's tau) and surpassing the best state-of-the-art (SOTA) metrics by over 15 points in adequacy assessments. Incorporating professional translator insights as weights further improves performance, highlighting the value of translator inputs. Notably, TransProQA approaches human-level evaluation performance comparable to trained linguistic annotators. It demonstrates broad applicability to open-source models such as LLaMA3.3-70b and Qwen2.5-32b, indicating its potential as an accessible and training-free literary evaluation metric and a valuable tool for evaluating texts that require local processing due to copyright or ethical considerations.

[136] Ultra-FineWeb: Efficient Data Filtering and Verification for High-Quality LLM Training Data

Yudong Wang,Zixuan Fu,Jie Cai,Peijun Tang,Hongya Lyu,Yewei Fang,Zhi Zheng,Jie Zhou,Guoyang Zeng,Chaojun Xiao,Xu Han,Zhiyuan Liu

Main category: cs.CL

TL;DR: 论文提出了一种高效的数据验证策略和优化种子数据选择的方法，构建了一个高效的数据过滤流程，显著提升了数据质量和训练效率。

Details

Motivation: 随着大语言模型（LLM）的快速发展，数据质量成为提升模型性能的关键因素。然而，现有方法缺乏高效的数据验证策略，且种子数据选择依赖主观经验。 Method: 引入高效验证策略快速评估数据对LLM训练的影响；基于高质量种子数据假设，优化正负样本选择，提出高效数据过滤流程；使用fastText轻量级分类器。 Result: 应用于FineWeb和Chinese FineWeb数据集，生成更高质量的Ultra-FineWeb数据集，包含约1万亿英文和1200亿中文token。LLM在多个基准任务上性能显著提升。 Conclusion: 提出的流程有效提升了数据质量和训练效率，验证了其在LLM训练中的实用性。 Abstract: Data quality has become a key factor in enhancing model performance with the rapid development of large language models (LLMs). Model-driven data filtering has increasingly become a primary approach for acquiring high-quality data. However, it still faces two main challenges: (1) the lack of an efficient data verification strategy makes it difficult to provide timely feedback on data quality; and (2) the selection of seed data for training classifiers lacks clear criteria and relies heavily on human expertise, introducing a degree of subjectivity. To address the first challenge, we introduce an efficient verification strategy that enables rapid evaluation of the impact of data on LLM training with minimal computational cost. To tackle the second challenge, we build upon the assumption that high-quality seed data is beneficial for LLM training, and by integrating the proposed verification strategy, we optimize the selection of positive and negative samples and propose an efficient data filtering pipeline. This pipeline not only improves filtering efficiency, classifier quality, and robustness, but also significantly reduces experimental and inference costs. In addition, to efficiently filter high-quality data, we employ a lightweight classifier based on fastText, and successfully apply the filtering pipeline to two widely-used pre-training corpora, FineWeb and Chinese FineWeb datasets, resulting in the creation of the higher-quality Ultra-FineWeb dataset. Ultra-FineWeb contains approximately 1 trillion English tokens and 120 billion Chinese tokens. Empirical results demonstrate that the LLMs trained on Ultra-FineWeb exhibit significant performance improvements across multiple benchmark tasks, validating the effectiveness of our pipeline in enhancing both data quality and training efficiency.

[137] clem:todd: A Framework for the Systematic Benchmarking of LLM-Based Task-Oriented Dialogue System Realisations

Chalamalasetti Kranti,Sherzod Hakimov,David Schlangen

Main category: cs.CL

TL;DR: 本文提出了clem todd框架，用于在一致条件下系统评估对话系统，支持多种用户模拟器和对话系统的组合，提供统一的数据集、评估指标和计算约束。

Details

Motivation: 现有研究常孤立评估对话系统的组件，限制了跨架构和配置的通用性见解。 Method: 提出clem todd框架，支持灵活集成现有或新开发的用户模拟器和对话系统，确保评估条件一致。 Result: 通过重新评估现有系统和集成新系统，展示了clem todd的灵活性，并提供了关于架构、规模和提示策略对对话性能影响的实用见解。 Conclusion: clem todd为构建高效有效的对话AI系统提供了实用指导，并展示了其在统一评估中的价值。 Abstract: The emergence of instruction-tuned large language models (LLMs) has advanced the field of dialogue systems, enabling both realistic user simulations and robust multi-turn conversational agents. However, existing research often evaluates these components in isolation-either focusing on a single user simulator or a specific system design-limiting the generalisability of insights across architectures and configurations. In this work, we propose clem todd (chat-optimized LLMs for task-oriented dialogue systems development), a flexible framework for systematically evaluating dialogue systems under consistent conditions. clem todd enables detailed benchmarking across combinations of user simulators and dialogue systems, whether existing models from literature or newly developed ones. It supports plug-and-play integration and ensures uniform datasets, evaluation metrics, and computational constraints. We showcase clem todd's flexibility by re-evaluating existing task-oriented dialogue systems within this unified setup and integrating three newly proposed dialogue systems into the same evaluation pipeline. Our results provide actionable insights into how architecture, scale, and prompting strategies affect dialogue performance, offering practical guidance for building efficient and effective conversational AI systems.

[138] UKElectionNarratives: A Dataset of Misleading Narratives Surrounding Recent UK General Elections

Fatima Haouari,Carolina Scarton,Nicolò Faggiani,Nikolaos Nikolaidis,Bonka Kotseva,Ibrahim Abu Farha,Jens Linge,Kalina Bontcheva

Main category: cs.CL

TL;DR: 论文提出了一种检测选举中误导性叙事的分类法，并构建了首个标注数据集UKElectionNarratives，同时评估了预训练和大型语言模型（如GPT-4o）在检测此类叙事中的效果。

Details

Motivation: 误导性叙事在选举中影响公众观点，因此需要准确检测这些叙事。 Method: 提出分类法并构建标注数据集，评估预训练和大型语言模型的效果。 Result: 构建了UKElectionNarratives数据集，并展示了模型在检测误导性叙事中的表现。 Conclusion: 讨论了潜在应用场景，并提出了未来研究方向。 Abstract: Misleading narratives play a crucial role in shaping public opinion during elections, as they can influence how voters perceive candidates and political parties. This entails the need to detect these narratives accurately. To address this, we introduce the first taxonomy of common misleading narratives that circulated during recent elections in Europe. Based on this taxonomy, we construct and analyse UKElectionNarratives: the first dataset of human-annotated misleading narratives which circulated during the UK General Elections in 2019 and 2024. We also benchmark Pre-trained and Large Language Models (focusing on GPT-4o), studying their effectiveness in detecting election-related misleading narratives. Finally, we discuss potential use cases and make recommendations for future research directions using the proposed codebook and dataset.

[139] Bring Reason to Vision: Understanding Perception and Reasoning through Model Merging

Shiqi Chen,Jinghan Zhang,Tongyao Zhu,Wei Liu,Siyang Gao,Miao Xiong,Manling Li,Junxian He

Main category: cs.CL

TL;DR: 本文探讨了通过模型合并将大型语言模型（LLM）的推理能力融入视觉语言模型（VLM），揭示了感知与推理在模型中的分布及合并后的变化。

Details

Motivation: 现有视觉语言模型（VLM）虽结合了视觉感知与语言模型的推理能力，但二者如何结合及贡献机制尚不明确。 Method: 通过跨模态模型合并，将LLM的推理能力融入VLM，无需额外训练。 Result: 实验表明，合并后模型的推理能力显著提升，且感知能力主要分布在早期层，推理能力则由中后期层主导。合并后，所有层均参与推理，感知分布不变。 Conclusion: 模型合并为多模态集成与解释提供了新工具，揭示了感知与推理的机制。 Abstract: Vision-Language Models (VLMs) combine visual perception with the general capabilities, such as reasoning, of Large Language Models (LLMs). However, the mechanisms by which these two abilities can be combined and contribute remain poorly understood. In this work, we explore to compose perception and reasoning through model merging that connects parameters of different models. Unlike previous works that often focus on merging models of the same kind, we propose merging models across modalities, enabling the incorporation of the reasoning capabilities of LLMs into VLMs. Through extensive experiments, we demonstrate that model merging offers a successful pathway to transfer reasoning abilities from LLMs to VLMs in a training-free manner. Moreover, we utilize the merged models to understand the internal mechanism of perception and reasoning and how merging affects it. We find that perception capabilities are predominantly encoded in the early layers of the model, whereas reasoning is largely facilitated by the middle-to-late layers. After merging, we observe that all layers begin to contribute to reasoning, whereas the distribution of perception abilities across layers remains largely unchanged. These observations shed light on the potential of model merging as a tool for multimodal integration and interpretation.

[140] ComPO: Preference Alignment via Comparison Oracles

Peter Chen,Xi Chen,Wotao Yin,Tianyi Lin

Main category: cs.CL

TL;DR: 本文提出了一种基于比较预言机的新偏好对齐方法，解决了现有直接对齐方法中的冗长和似然位移问题，并通过实验验证了其有效性。

Details

Motivation: 现有直接对齐方法存在冗长和似然位移问题，这些问题源于噪声偏好对导致偏好和非偏好响应的似然相似。 Method: 提出基于比较预言机的偏好对齐方法，提供其基本方案的收敛保证，并通过启发式方法改进方案。 Result: 在多个基准模型和数据集上的实验表明，该方法能有效改善LLMs的性能，并作为现有直接对齐方法的替代方案。 Conclusion: 本文强调了为具有不同似然边际的偏好对设计专门方法的重要性，补充了近期研究中的发现。 Abstract: Direct alignment methods are increasingly used for aligning large language models (LLMs) with human preferences. However, these methods suffer from the issues of verbosity and likelihood displacement, which can be driven by the noisy preference pairs that induce similar likelihood for preferred and dispreferred responses. The contributions of this paper are two-fold. First, we propose a new preference alignment method based on comparison oracles and provide the convergence guarantee for its basic scheme. Second, we improve our method using some heuristics and conduct the experiments to demonstrate the flexibility and compatibility of practical scheme in improving the performance of LLMs using noisy preference pairs. Evaluations are conducted across multiple base and instruction-tuned models (Mistral-7B, Llama-3-8B and Gemma-2-9B) with benchmarks (AlpacaEval 2, MT-Bench and Arena-Hard). Experimental results show the effectiveness of our method as an alternative to addressing the limitations of existing direct alignment methods. A highlight of our work is that we evidence the importance of designing specialized methods for preference pairs with distinct likelihood margin, which complements the recent findings in \citet{Razin-2025-Unintentional}.

cs.RO [Back]

[141] Steerable Scene Generation with Post Training and Inference-Time Search

Nicholas Pfaff,Hongkai Dai,Sergey Zakharov,Shun Iwase,Russ Tedrake

Main category: cs.RO

TL;DR: 提出了一种基于扩散模型的场景生成方法，通过训练统一模型预测物体及其位姿，支持强化学习后训练和条件生成，适用于机器人仿真任务。

Details

Motivation: 手动构建满足任务需求的高复杂度3D场景成本高且稀缺，需要自动化生成方法。 Method: 训练扩散模型预测物体及其SE(3)位姿，结合强化学习后训练、条件生成或推理时搜索优化生成结果。 Result: 生成了超过4400万个SE(3)场景，覆盖五种环境，并通过MCTS搜索和仿真验证物理可行性。 Conclusion: 该方法支持目标导向的场景生成，扩展性强且物理可行，适用于多样化机器人任务。 Abstract: Training robots in simulation requires diverse 3D scenes that reflect the specific challenges of downstream tasks. However, scenes that satisfy strict task requirements, such as high-clutter environments with plausible spatial arrangement, are rare and costly to curate manually. Instead, we generate large-scale scene data using procedural models that approximate realistic environments for robotic manipulation, and adapt it to task-specific goals. We do this by training a unified diffusion-based generative model that predicts which objects to place from a fixed asset library, along with their SE(3) poses. This model serves as a flexible scene prior that can be adapted using reinforcement learning-based post training, conditional generation, or inference-time search, steering generation toward downstream objectives even when they differ from the original data distribution. Our method enables goal-directed scene synthesis that respects physical feasibility and scales across scene types. We introduce a novel MCTS-based inference-time search strategy for diffusion models, enforce feasibility via projection and simulation, and release a dataset of over 44 million SE(3) scenes spanning five diverse environments. Website with videos, code, data, and model weights: https://steerable-scene-generation.github.io/

[142] X-Driver: Explainable Autonomous Driving with Vision-Language Models

Wei Liu,Jiyuan Zhang,Binxiong Zheng,Yufeng Hu,Yingzhan Lin,Zengfeng Zeng

Main category: cs.RO

TL;DR: X-Driver是一个基于多模态大语言模型（MLLMs）的端到端自动驾驶框架，通过Chain-of-Thought和自回归建模提升感知与决策能力，在闭环测试中表现优于现有方法。

Details

Motivation: 现有端到端自动驾驶框架在闭环测试中成功率低，限制了实际部署。 Method: 提出X-Driver框架，结合Chain-of-Thought和自回归建模，利用多模态大语言模型提升性能。 Result: 在CARLA仿真环境中验证，X-Driver在闭环性能上超越当前最优方法，并提高了决策可解释性。 Conclusion: 结构化推理对端到端驾驶至关重要，X-Driver为未来闭环自动驾驶研究提供了强基线。 Abstract: End-to-end autonomous driving has advanced significantly, offering benefits such as system simplicity and stronger driving performance in both open-loop and closed-loop settings than conventional pipelines. However, existing frameworks still suffer from low success rates in closed-loop evaluations, highlighting their limitations in real-world deployment. In this paper, we introduce X-Driver, a unified multi-modal large language models(MLLMs) framework designed for closed-loop autonomous driving, leveraging Chain-of-Thought(CoT) and autoregressive modeling to enhance perception and decision-making. We validate X-Driver across multiple autonomous driving tasks using public benchmarks in CARLA simulation environment, including Bench2Drive[6]. Our experimental results demonstrate superior closed-loop performance, surpassing the current state-of-the-art(SOTA) while improving the interpretability of driving decisions. These findings underscore the importance of structured reasoning in end-to-end driving and establish X-Driver as a strong baseline for future research in closed-loop autonomous driving.

[143] D-CODA: Diffusion for Coordinated Dual-Arm Data Augmentation

I-Chun Arthur Liu,Jason Chen,Gaurav Sukhatme,Daniel Seita

Main category: cs.RO

TL;DR: D-CODA是一种用于双机械臂模仿学习的离线数据增强方法，通过扩散模型生成视角一致的手腕摄像头图像和动作标签。

Details

Motivation: 学习双机械臂操作具有高维度和协调性挑战，而传统数据收集成本高，需要可扩展的数据增强方法。 Method: 提出D-CODA方法，利用扩散模型生成视角一致的图像和动作标签，并通过约束优化确保双机械臂协调的可行性。 Result: 在5个模拟和3个真实任务中，D-CODA在2250次模拟和300次真实试验中表现优于基线方法。 Conclusion: D-CODA展示了在双机械臂模仿学习中可扩展数据增强的潜力。 Abstract: Learning bimanual manipulation is challenging due to its high dimensionality and tight coordination required between two arms. Eye-in-hand imitation learning, which uses wrist-mounted cameras, simplifies perception by focusing on task-relevant views. However, collecting diverse demonstrations remains costly, motivating the need for scalable data augmentation. While prior work has explored visual augmentation in single-arm settings, extending these approaches to bimanual manipulation requires generating viewpoint-consistent observations across both arms and producing corresponding action labels that are both valid and feasible. In this work, we propose Diffusion for COordinated Dual-arm Data Augmentation (D-CODA), a method for offline data augmentation tailored to eye-in-hand bimanual imitation learning that trains a diffusion model to synthesize novel, viewpoint-consistent wrist-camera images for both arms while simultaneously generating joint-space action labels. It employs constrained optimization to ensure that augmented states involving gripper-to-object contacts adhere to constraints suitable for bimanual coordination. We evaluate D-CODA on 5 simulated and 3 real-world tasks. Our results across 2250 simulation trials and 300 real-world trials demonstrate that it outperforms baselines and ablations, showing its potential for scalable data augmentation in eye-in-hand bimanual manipulation. Our project website is at: https://dcodaaug.github.io/D-CODA/.

Mattia Sartori,Chetna Singhal,Neelabhro Roy,Davide Brunelli,James Gross

Main category: cs.RO

TL;DR: 本文提出了一种基于AI和视觉的纳米无人机自主导航方法，通过边缘计算与机载规划结合，解决了资源受限问题，实现了实时避障。

Details

Motivation: 纳米无人机在自主导航和高级任务中面临资源限制的挑战，需要一种高效的方法实现安全飞行。 Method: 采用深度学习目标检测（边缘计算）与机载规划算法结合，实现实时避障。 Result: 实验显示无人机能以8帧/秒的速度运行，模型性能达60.8 COCO mAP，成功避障并到达目标。 Conclusion: 该方法为纳米无人机自主探索提供了可行方案，兼容实时导航需求。 Abstract: The miniaturisation of sensors and processors, the advancements in connected edge intelligence, and the exponential interest in Artificial Intelligence are boosting the affirmation of autonomous nano-size drones in the Internet of Robotic Things ecosystem. However, achieving safe autonomous navigation and high-level tasks such as exploration and surveillance with these tiny platforms is extremely challenging due to their limited resources. This work focuses on enabling the safe and autonomous flight of a pocket-size, 30-gram platform called Crazyflie 2.1 in a partially known environment. We propose a novel AI-aided, vision-based reactive planning method for obstacle avoidance under the ambit of Integrated Sensing, Computing and Communication paradigm. We deal with the constraints of the nano-drone by splitting the navigation task into two parts: a deep learning-based object detector runs on the edge (external hardware) while the planning algorithm is executed onboard. The results show the ability to command the drone at $\sim8$ frames-per-second and a model performance reaching a COCO mean-average-precision of $60.8$. Field experiments demonstrate the feasibility of the solution with the drone flying at a top speed of $1$ m/s while steering away from an obstacle placed in an unknown position and reaching the target destination. The outcome highlights the compatibility of the communication delay and the model performance with the requirements of the real-time navigation task. We provide a feasible alternative to a fully onboard implementation that can be extended to autonomous exploration with nano-drones.

[145] The City that Never Settles: Simulation-based LiDAR Dataset for Long-Term Place Recognition Under Extreme Structural Changes

Hyunho Song,Dongjae Lee,Seunghun Oh,Minwoo Jung,Ayoung Kim

Main category: cs.RO

TL;DR: 论文提出了一个模拟数据集CNS，用于解决大规模建筑和拆除对长期地点识别的挑战，并提出了对称度量TCR_sym。

Details

Motivation: 现有数据集无法充分反映户外环境的重大变化，尤其是建筑和拆除带来的结构变化。 Method: 使用CARLA模拟器创建CNS数据集，并提出对称度量TCR_sym以一致衡量结构变化。 Result: CNS数据集比现有基准包含更广泛的变化，且现有LiDAR地点识别方法在CNS上表现显著下降。 Conclusion: CNS数据集填补了现有空白，并突显了需要更鲁棒的算法以应对重大环境变化。 Abstract: Large-scale construction and demolition significantly challenge long-term place recognition (PR) by drastically reshaping urban and suburban environments. Existing datasets predominantly reflect limited or indoor-focused changes, failing to adequately represent extensive outdoor transformations. To bridge this gap, we introduce the City that Never Settles (CNS) dataset, a simulation-based dataset created using the CARLA simulator, capturing major structural changes-such as building construction and demolition-across diverse maps and sequences. Additionally, we propose TCR_sym, a symmetric version of the original TCR metric, enabling consistent measurement of structural changes irrespective of source-target ordering. Quantitative comparisons demonstrate that CNS encompasses more extensive transformations than current real-world benchmarks. Evaluations of state-of-the-art LiDAR-based PR methods on CNS reveal substantial performance degradation, underscoring the need for robust algorithms capable of handling significant environmental changes. Our dataset is available at https://github.com/Hyunho111/CNS_dataset.

[146] Multi-Objective Reinforcement Learning for Adaptive Personalized Autonomous Driving

Hendrik Surmann,Jorge de Heuvel,Maren Bennewitz

Main category: cs.RO

TL;DR: 提出一种基于多目标强化学习（MORL）的自动驾驶方法，动态适应用户驾驶风格偏好，无需重新训练策略。

Details

Motivation: 现有自动驾驶方法依赖预设驾驶风格或持续用户反馈，难以支持动态、上下文相关的偏好调整。 Method: 使用MORL和偏好驱动优化，将偏好编码为连续权重向量，调节可解释的驾驶风格目标（如效率、舒适度等）。 Result: 在CARLA模拟器中验证，代理能根据偏好动态调整驾驶行为，同时保持碰撞避免和路线完成性能。 Conclusion: 该方法有效支持动态偏好适应，提升用户信任和满意度。 Abstract: Human drivers exhibit individual preferences regarding driving style. Adapting autonomous vehicles to these preferences is essential for user trust and satisfaction. However, existing end-to-end driving approaches often rely on predefined driving styles or require continuous user feedback for adaptation, limiting their ability to support dynamic, context-dependent preferences. We propose a novel approach using multi-objective reinforcement learning (MORL) with preference-driven optimization for end-to-end autonomous driving that enables runtime adaptation to driving style preferences. Preferences are encoded as continuous weight vectors to modulate behavior along interpretable style objectives$\unicode{x2013}$including efficiency, comfort, speed, and aggressiveness$\unicode{x2013}$without requiring policy retraining. Our single-policy agent integrates vision-based perception in complex mixed-traffic scenarios and is evaluated in diverse urban environments using the CARLA simulator. Experimental results demonstrate that the agent dynamically adapts its driving behavior according to changing preferences while maintaining performance in terms of collision avoidance and route completion.

eess.AS [Back]

[147] From Dialect Gaps to Identity Maps: Tackling Variability in Speaker Verification

Abdulhady Abas Abdullah,Soran Badawi,Dana A. Abdullah,Dana Rasul Hamad,Hanan Abdulrahman Taher,Sabat Salih Muhamad,Aram Mahmood Ahmed,Bryar A. Hassan,Sirwan Abdolwahed Aula,Tarik A. Rashid

Main category: eess.AS

TL;DR: 本文研究了库尔德语多种方言中说话人识别的复杂性和挑战，提出了提升系统准确性的方法，并验证了跨方言训练的有效性。

Details

Motivation: 库尔德语因方言间的语音和词汇差异大，导致说话人识别系统面临独特挑战，需探索解决方案以提高识别性能。 Method: 采用高级机器学习方法、数据增强策略及构建方言特定语料库，结合跨方言训练优化系统。 Result: 针对各方言的定制策略与跨方言训练显著提升了说话人识别的准确性。 Conclusion: 研究表明，结合方言特定方法和跨方言训练是提升库尔德语说话人识别系统性能的有效途径。 Abstract: The complexity and difficulties of Kurdish speaker detection among its several dialects are investigated in this work. Because of its great phonetic and lexical differences, Kurdish with several dialects including Kurmanji, Sorani, and Hawrami offers special challenges for speaker recognition systems. The main difficulties in building a strong speaker identification system capable of precisely identifying speakers across several dialects are investigated in this work. To raise the accuracy and dependability of these systems, it also suggests solutions like sophisticated machine learning approaches, data augmentation tactics, and the building of thorough dialect-specific corpus. The results show that customized strategies for every dialect together with cross-dialect training greatly enhance recognition performance.

cs.CR [Back]

[148] Red Teaming the Mind of the Machine: A Systematic Evaluation of Prompt Injection and Jailbreak Vulnerabilities in LLMs

Chetan Pathade

Main category: cs.CR

TL;DR: 本文系统研究了针对多种先进LLM的越狱策略，分析了1400多个对抗性提示的成功率，并提出了分层缓解策略和混合红队测试与沙盒方法。

Details

Motivation: 尽管LLM能力强大，但仍易受对抗性攻击（如提示注入和越狱）的影响，需要系统研究其安全性和缓解措施。 Method: 分类并分析1400多个对抗性提示，测试其在GPT-4、Claude 2、Mistral 7B和Vicuna上的成功率，研究其通用性和构造逻辑。 Result: 揭示了不同LLM对越狱攻击的脆弱性，并提出了有效的缓解策略。 Conclusion: 建议采用混合红队测试与沙盒方法，以增强LLM的安全性。 Abstract: Large Language Models (LLMs) are increasingly integrated into consumer and enterprise applications. Despite their capabilities, they remain susceptible to adversarial attacks such as prompt injection and jailbreaks that override alignment safeguards. This paper provides a systematic investigation of jailbreak strategies against various state-of-the-art LLMs. We categorize over 1,400 adversarial prompts, analyze their success against GPT-4, Claude 2, Mistral 7B, and Vicuna, and examine their generalizability and construction logic. We further propose layered mitigation strategies and recommend a hybrid red-teaming and sandboxing approach for robust LLM security.

cs.AI [Back]

[149] Towards Artificial Intelligence Research Assistant for Expert-Involved Learning

Tianyu Liu,Simeng Han,Xiao Luo,Hanchen Wang,Pan Lu,Biqing Zhu,Yuge Wang,Keyi Li,Jiapeng Chen,Rihao Qu,Yufeng Liu,Xinyue Cui,Aviv Yaish,Yuhang Chen,Minsheng Hao,Chuhan Li,Kexing Li,Arman Cohan,Hua Xu,Mark Gerstein,James Zou,Hongyu Zhao

Main category: cs.AI

TL;DR: 论文介绍了ARIEL，一个多模态数据集，用于评估和提升LLMs和LMMs在生物医学研究中的文本摘要和图像解释能力，并通过专家评估和优化策略展示了模型的优势和局限。

Details

Motivation: 尽管LLMs和LMMs在科学研究中具有潜力，但其在生物医学应用中的可靠性和具体贡献尚未充分研究。 Method: 创建了包含生物医学文章和图像的开源数据集，通过专家评估、提示工程和微调策略优化模型性能，并测试LMM Agents生成科学假设的能力。 Result: 模型在文本摘要和图像解释任务中表现出色，但仍存在局限性，为未来研究提供了指导。 Conclusion: 研究明确了当前基础模型的优势和不足，为生物医学研究中大规模语言和多模态模型的应用提供了可行建议。 Abstract: Large Language Models (LLMs) and Large Multi-Modal Models (LMMs) have emerged as transformative tools in scientific research, yet their reliability and specific contributions to biomedical applications remain insufficiently characterized. In this study, we present \textbf{AR}tificial \textbf{I}ntelligence research assistant for \textbf{E}xpert-involved \textbf{L}earning (ARIEL), a multimodal dataset designed to benchmark and enhance two critical capabilities of LLMs and LMMs in biomedical research: summarizing extensive scientific texts and interpreting complex biomedical figures. To facilitate rigorous assessment, we create two open-source sets comprising biomedical articles and figures with designed questions. We systematically benchmark both open- and closed-source foundation models, incorporating expert-driven human evaluations conducted by doctoral-level experts. Furthermore, we improve model performance through targeted prompt engineering and fine-tuning strategies for summarizing research papers, and apply test-time computational scaling to enhance the reasoning capabilities of LMMs, achieving superior accuracy compared to human-expert corrections. We also explore the potential of using LMM Agents to generate scientific hypotheses from diverse multimodal inputs. Overall, our results delineate clear strengths and highlight significant limitations of current foundation models, providing actionable insights and guiding future advancements in deploying large-scale language and multi-modal models within biomedical research.

[150] CRAFT: Cultural Russian-Oriented Dataset Adaptation for Focused Text-to-Image Generation

Viacheslav Vasilev,Vladimir Arkhipkin,Julia Agafonova,Tatiana Nikulina,Evelina Mironova,Alisa Shichanina,Nikolai Gerasimenko,Mikhail Shoytov,Denis Dimitrov

Main category: cs.AI

TL;DR: 文本到图像生成模型在通用文化查询上表现良好，但在个体文化上存在显著知识缺口，主要因训练数据偏向欧美文化。本文提出基于文化代码（如俄罗斯文化）的数据集构建方法，并通过实验验证其提升模型文化认知的效果。

Details

Motivation: 现有文本到图像生成模型因训练数据偏向欧美文化，导致对其他文化的理解不足，可能生成错误或刻板内容。本文旨在填补这一研究空白。 Method: 提出基于文化代码（如俄罗斯文化）的数据集收集与处理方法，并在Kandinsky 3.1模型上验证其有效性。 Result: 实验结果表明，该方法显著提升了模型对俄罗斯文化的认知水平。 Conclusion: 通过文化代码数据集，可以有效提升文本到图像生成模型对特定文化的理解能力。 Abstract: Despite the fact that popular text-to-image generation models cope well with international and general cultural queries, they have a significant knowledge gap regarding individual cultures. This is due to the content of existing large training datasets collected on the Internet, which are predominantly based on Western European or American popular culture. Meanwhile, the lack of cultural adaptation of the model can lead to incorrect results, a decrease in the generation quality, and the spread of stereotypes and offensive content. In an effort to address this issue, we examine the concept of cultural code and recognize the critical importance of its understanding by modern image generation models, an issue that has not been sufficiently addressed in the research community to date. We propose the methodology for collecting and processing the data necessary to form a dataset based on the cultural code, in particular the Russian one. We explore how the collected data affects the quality of generations in the national domain and analyze the effectiveness of our approach using the Kandinsky 3.1 text-to-image model. Human evaluation results demonstrate an increase in the level of awareness of Russian culture in the model.

[151] Enigme: Generative Text Puzzles for Evaluating Reasoning in Language Models

John Hawkins

Main category: cs.AI

TL;DR: 论文探讨了Transformer-decoder语言模型在生成推理能力上的局限性，并提出了一种开源工具enigme，用于生成文本谜题以评估模型的推理能力。

Details

Motivation: 理解Transformer-decoder模型在自然语言命令理解和推理任务中的局限性，以改进其作为通用智能系统的应用。 Method: 通过分析模型的潜在变量结构，设计推理任务来测试其能力边界，并开发了enigme库用于生成评估推理能力的文本谜题。 Result: 提出了enigme这一开源工具，用于训练和评估Transformer-decoder模型及其他未来AI架构的推理能力。 Conclusion: 通过设计专门的推理任务和工具，可以更深入地评估和改进Transformer-decoder模型的推理能力。 Abstract: Transformer-decoder language models are a core innovation in text based generative artificial intelligence. These models are being deployed as general-purpose intelligence systems in many applications. Central to their utility is the capacity to understand natural language commands and exploit the reasoning embedded in human text corpora to apply some form of reasoning process to a wide variety of novel tasks. To understand the limitations of this approach to generating reasoning we argue that we need to consider the architectural constraints of these systems. Consideration of the latent variable structure of transformer-decoder models allows us to design reasoning tasks that should probe the boundary of their capacity to reason. We present enigme, an open-source library for generating text-based puzzles to be used in training and evaluating reasoning skills within transformer-decoder models and future AI architectures.

eess.IV [Back]

[152] Rethinking Boundary Detection in Deep Learning-Based Medical Image Segmentation

Yi Lin,Dong Zhang,Xiao Fang,Yufan Chen,Kwang-Ting Cheng,Hao Chen

Main category: eess.IV

TL;DR: 提出了一种名为CTO的新型网络架构，结合CNN、ViT和边缘检测算子，显著提升了医学图像边界分割的准确性。

Details

Motivation: 当前医学图像分割方法在边界区域分割上仍存在挑战，需要更精确的解决方案。 Method: CTO采用双流编码器（CNN和StitchViT）和边界引导解码器，利用边缘检测算子生成边界掩码指导解码过程。 Result: 在多个医学图像数据集上，CTO实现了最先进的准确性，同时保持模型复杂度。 Conclusion: CTO在无需额外数据或标签的情况下，显著提升了边界分割性能，具有高效性和实用性。 Abstract: Medical image segmentation is a pivotal task within the realms of medical image analysis and computer vision. While current methods have shown promise in accurately segmenting major regions of interest, the precise segmentation of boundary areas remains challenging. In this study, we propose a novel network architecture named CTO, which combines Convolutional Neural Networks (CNNs), Vision Transformer (ViT) models, and explicit edge detection operators to tackle this challenge. CTO surpasses existing methods in terms of segmentation accuracy and strikes a better balance between accuracy and efficiency, without the need for additional data inputs or label injections. Specifically, CTO adheres to the canonical encoder-decoder network paradigm, with a dual-stream encoder network comprising a mainstream CNN stream for capturing local features and an auxiliary StitchViT stream for integrating long-range dependencies. Furthermore, to enhance the model's ability to learn boundary areas, we introduce a boundary-guided decoder network that employs binary boundary masks generated by dedicated edge detection operators to provide explicit guidance during the decoding process. We validate the performance of CTO through extensive experiments conducted on seven challenging medical image segmentation datasets, namely ISIC 2016, PH2, ISIC 2018, CoNIC, LiTS17, and BTCV. Our experimental results unequivocally demonstrate that CTO achieves state-of-the-art accuracy on these datasets while maintaining competitive model complexity. The codes have been released at: https://github.com/xiaofang007/CTO.

[153] Advancing 3D Medical Image Segmentation: Unleashing the Potential of Planarian Neural Networks in Artificial Intelligence

Ziyuan Huang,Kevin Huggins,Srikar Bellur

Main category: eess.IV

TL;DR: PNN-UNet是一种基于涡虫神经网络结构的深度学习方法，用于3D医学图像分割，性能优于传统UNet及其变体。

Details

Motivation: 受涡虫神经网络的启发，研究旨在设计一种更高效的3D医学图像分割网络结构。 Method: PNN-UNet由Deep-UNet和Wide-UNet模拟神经索，以及一个密集连接的自编码器模拟大脑，形成独特的网络架构。 Result: 在3D MRI海马体数据集上，PNN-UNet在图像分割任务中表现优于基线UNet和其他变体。 Conclusion: PNN-UNet通过仿生设计提升了3D医学图像分割的性能，为未来研究提供了新思路。 Abstract: Our study presents PNN-UNet as a method for constructing deep neural networks that replicate the planarian neural network (PNN) structure in the context of 3D medical image data. Planarians typically have a cerebral structure comprising two neural cords, where the cerebrum acts as a coordinator, and the neural cords serve slightly different purposes within the organism's neurological system. Accordingly, PNN-UNet comprises a Deep-UNet and a Wide-UNet as the nerve cords, with a densely connected autoencoder performing the role of the brain. This distinct architecture offers advantages over both monolithic (UNet) and modular networks (Ensemble-UNet). Our outcomes on a 3D MRI hippocampus dataset, with and without data augmentation, demonstrate that PNN-UNet outperforms the baseline UNet and several other UNet variants in image segmentation.

[154] Advanced 3D Imaging Approach to TSV/TGV Metrology and Inspection Using Only Optical Microscopy

Gugeong Sung

Main category: eess.IV

TL;DR: 提出了一种结合混合场显微镜和光度立体视觉的创新方法，用于硅和玻璃通孔的检测，克服了传统光学显微镜的局限性。

Details

Motivation: 传统光学显微镜技术仅能进行表面检测，难以有效可视化硅和玻璃通孔的内部结构。 Method: 通过结合光度立体视觉和传统光学显微镜，利用多种光照条件进行3D重建，增强了对微尺度缺陷的检测能力。 Result: 实验结果表明，该方法能有效捕捉复杂的表面细节和内部结构，显著提升了检测过程的成本效益和准确性。 Conclusion: 该方法在硅和玻璃通孔检测技术上取得了显著进展，具有高精度和可重复性。 Abstract: This paper introduces an innovative approach to silicon and glass via inspection, which combines hybrid field microscopy with photometric stereo. Conventional optical microscopy techniques are generally limited to superficial inspections and struggle to effectively visualize the internal structures of silicon and glass vias. By utilizing various lighting conditions for 3D reconstruction, the proposed method surpasses these limitations. By integrating photometric stereo to the traditional optical microscopy, the proposed method not only enhances the capability to detect micro-scale defects but also provides a detailed visualization of depth and edge abnormality, which are typically not visible with conventional optical microscopy inspection. The experimental results demonstrated that the proposed method effectively captures intricate surface details and internal structures. Quantitative comparisons between the reconstructed models and actual measurements present the capability of the proposed method to significantly improve silicon and glass via inspection process. As a result, the proposed method achieves enhanced cost-effectiveness while maintaining high accuracy and repeatability, suggesting substantial advancements in silicon and glass via inspection techniques

[155] MoRe-3DGSMR: Motion-resolved reconstruction framework for free-breathing pulmonary MRI based on 3D Gaussian representation

Tengya Peng,Ruyi Zha,Qing Zou

Main category: eess.IV

TL;DR: 提出了一种基于3D高斯表示的无监督运动分辨重建框架，用于高分辨率自由呼吸肺部MRI，通过数据平滑和空间变换实现高质量图像重建。

Details

Motivation: 解决自由呼吸肺部MRI中运动分辨和3D各向同性重建的挑战，提升图像质量。 Method: 采用黄金角度径向采样轨迹获取数据，利用3D高斯表示和卷积神经网络估计变形向量场，实现多呼吸相重建。 Result: 在六组数据上验证，相比现有方法，图像质量更高，信噪比和对比噪声比更优。 Conclusion: 该方法为临床肺部MRI提供了一种高精度、无监督的解决方案，具有显著优势。 Abstract: This study presents an unsupervised, motion-resolved reconstruction framework for high-resolution, free-breathing pulmonary magnetic resonance imaging (MRI), utilizing a three-dimensional Gaussian representation (3DGS). The proposed method leverages 3DGS to address the challenges of motion-resolved 3D isotropic pulmonary MRI reconstruction by enabling data smoothing between voxels for continuous spatial representation. Pulmonary MRI data acquisition is performed using a golden-angle radial sampling trajectory, with respiratory motion signals extracted from the center of k-space in each radial spoke. Based on the estimated motion signal, the k-space data is sorted into multiple respiratory phases. A 3DGS framework is then applied to reconstruct a reference image volume from the first motion state. Subsequently, a patient-specific convolutional neural network is trained to estimate the deformation vector fields (DVFs), which are used to generate the remaining motion states through spatial transformation of the reference volume. The proposed reconstruction pipeline is evaluated on six datasets from six subjects and bench-marked against three state-of-the-art reconstruction methods. The experimental findings demonstrate that the proposed reconstruction framework effectively reconstructs high-resolution, motion-resolved pulmonary MR images. Compared with existing approaches, it achieves superior image quality, reflected by higher signal-to-noise ratio and contrast-to-noise ratio. The proposed unsupervised 3DGS-based reconstruction method enables accurate motion-resolved pulmonary MRI with isotropic spatial resolution. Its superior performance in image quality metrics over state-of-the-art methods highlights its potential as a robust solution for clinical pulmonary MR imaging.

[156] ADNP-15: An Open-Source Histopathological Dataset for Neuritic Plaque Segmentation in Human Brain Whole Slide Images with Frequency Domain Image Enhancement for Stain Normalization

Chenxi Zhao,Jianqiang Li,Qing Zhao,Jing Bai,Susana Boluda,Benoit Delatour,Lev Stimmer,Daniel Racoceanu,Gabriel Jimenez,Guanghui Fu

Main category: eess.IV

TL;DR: 该研究提出了一个开源数据集ADNP-15，用于神经斑块的识别与分割，并评估了五种深度学习模型和四种染色归一化技术。同时，提出了一种新的图像增强方法，显著提高了分割精度。

Details

Motivation: 阿尔茨海默病（AD）的神经斑块分割面临标注数据不足和染色差异的挑战，需要有效的染色归一化和增强技术。 Method: 研究引入了ADNP-15数据集，评估了五种深度学习模型和四种染色归一化技术，并提出了一种新的图像增强方法。 Result: 实验结果表明，提出的图像增强方法显著提高了模型泛化能力和分割精度。 Conclusion: 所有数据集和代码开源，为AD研究提供了透明和可重复的工具，推动了该领域的进一步发展。 Abstract: Alzheimer's Disease (AD) is a neurodegenerative disorder characterized by amyloid-beta plaques and tau neurofibrillary tangles, which serve as key histopathological features. The identification and segmentation of these lesions are crucial for understanding AD progression but remain challenging due to the lack of large-scale annotated datasets and the impact of staining variations on automated image analysis. Deep learning has emerged as a powerful tool for pathology image segmentation; however, model performance is significantly influenced by variations in staining characteristics, necessitating effective stain normalization and enhancement techniques. In this study, we address these challenges by introducing an open-source dataset (ADNP-15) of neuritic plaques (i.e., amyloid deposits combined with a crown of dystrophic tau-positive neurites) in human brain whole slide images. We establish a comprehensive benchmark by evaluating five widely adopted deep learning models across four stain normalization techniques, providing deeper insights into their influence on neuritic plaque segmentation. Additionally, we propose a novel image enhancement method that improves segmentation accuracy, particularly in complex tissue structures, by enhancing structural details and mitigating staining inconsistencies. Our experimental results demonstrate that this enhancement strategy significantly boosts model generalization and segmentation accuracy. All datasets and code are open-source, ensuring transparency and reproducibility while enabling further advancements in the field.

[157] Direct Image Classification from Fourier Ptychographic Microscopy Measurements without Reconstruction

Navya Sonal Agarwal,Jan Philipp Schneider,Kanchana Vaishnavi Gandikota,Syed Muhammad Kazim,John Meshreki,Ivo Ihrke,Michael Moeller

Main category: eess.IV

TL;DR: 该论文提出了一种直接利用FPM测量数据进行图像分类的方法，避免了高分辨率图像重建的计算成本，并通过CNN显著提高了分类性能。

Details

Motivation: FPM技术虽然能实现高分辨率和大视场成像，但重建过程计算成本高，尤其在医学应用中效率不足。因此，研究直接在测量数据上进行分类的方法。 Method: 使用卷积神经网络（CNN）直接从FPM测量序列中提取信息进行分类，并探索通过多路复用减少数据量的方法。 Result: CNN在测量序列上的分类性能比单幅图像高12%，且比高分辨率图像重建更高效；多路复用技术能在保持分类精度的同时减少数据量。 Conclusion: 直接在FPM测量数据上分类是可行的，CNN和多路复用技术显著提升了效率和性能，适用于医学等实际应用。 Abstract: The computational imaging technique of Fourier Ptychographic Microscopy (FPM) enables high-resolution imaging with a wide field of view and can serve as an extremely valuable tool, e.g. in the classification of cells in medical applications. However, reconstructing a high-resolution image from tens or even hundreds of measurements is computationally expensive, particularly for a wide field of view. Therefore, in this paper, we investigate the idea of classifying the image content in the FPM measurements directly without performing a reconstruction step first. We show that Convolutional Neural Networks (CNN) can extract meaningful information from measurement sequences, significantly outperforming the classification on a single band-limited image (up to 12 %) while being significantly more efficient than a reconstruction of a high-resolution image. Furthermore, we demonstrate that a learned multiplexing of several raw measurements allows maintaining the classification accuracy while reducing the amount of data (and consequently also the acquisition time) significantly.

[158] RepSNet: A Nucleus Instance Segmentation model based on Boundary Regression and Structural Re-parameterization

Shengchun Xiong,Xiangru Li,Yunpeng Zhong,Wanfen Peng

Main category: eess.IV

TL;DR: RepSNet是一种基于核边界回归和结构重参数化的神经网络模型，用于H&E染色组织病理图像中的核实例分割与分类，解决了计算效率和重叠目标处理的挑战。

Details

Motivation: 病理诊断是肿瘤诊断的金标准，核实例分割是数字病理分析和病理诊断的关键步骤，但计算效率和重叠目标处理是主要挑战。 Method: RepSNet通过估计每个像素的父核边界位置信息（BPI），并结合局部和上下文信息，利用边界投票机制（BVM）聚合BPI以估计核边界，最终通过连通分量分析实现实例分割。模型采用可重参数化的编码器-解码器结构，提升特征聚合能力并减少计算负担。 Result: 实验表明，RepSNet在多个典型基准模型上表现出优越性能。 Conclusion: RepSNet通过宏观信息集成和结构优化，显著提升了核实例分割的准确性和计算效率。 Abstract: Pathological diagnosis is the gold standard for tumor diagnosis, and nucleus instance segmentation is a key step in digital pathology analysis and pathological diagnosis. However, the computational efficiency of the model and the treatment of overlapping targets are the major challenges in the studies of this problem. To this end, a neural network model RepSNet was designed based on a nucleus boundary regression and a structural re-parameterization scheme for segmenting and classifying the nuclei in H\&E-stained histopathological images. First, RepSNet estimates the boundary position information (BPI) of the parent nucleus for each pixel. The BPI estimation incorporates the local information of the pixel and the contextual information of the parent nucleus. Then, the nucleus boundary is estimated by aggregating the BPIs from a series of pixels using a proposed boundary voting mechanism (BVM), and the instance segmentation results are computed from the estimated nucleus boundary using a connected component analysis procedure. The BVM intrinsically achieves a kind of synergistic belief enhancement among the BPIs from various pixels. Therefore, different from the methods available in literature that obtain nucleus boundaries based on a direct pixel recognition scheme, RepSNet computes its boundary decisions based on some guidances from macroscopic information using an integration mechanism. In addition, RepSNet employs a re-parametrizable encoder-decoder structure. This model can not only aggregate features from some receptive fields with various scales which helps segmentation accuracy improvement, but also reduce the parameter amount and computational burdens in the model inference phase through the structural re-parameterization technique. Extensive experiments demonstrated the superiorities of RepSNet compared to several typical benchmark models.

[159] MDAA-Diff: CT-Guided Multi-Dose Adaptive Attention Diffusion Model for PET Denoising

Xiaolong Niu,Zanting Ye,Xu Han,Yanchao Huang,Hao Sun,Hubing Wu,Lijun Lu

Main category: eess.IV

TL;DR: 提出了一种CT引导的多剂量自适应注意力去噪扩散模型（MDAA-Diff），用于低剂量PET图像去噪，结合解剖学指导和剂量水平适应，显著提升去噪效果。

Details

Motivation: 高剂量放射性示踪剂会增加辐射风险，而现有研究多关注单剂量去噪，忽略了患者间剂量响应差异和CT图像的解剖学约束。 Method: 采用CT引导的高频小波注意力模块提取解剖边界特征，并通过剂量自适应注意力模块动态整合剂量水平，优化去噪性能。 Result: 在18F-FDG和68Ga-FAPI数据集上，MDAA-Diff在低剂量条件下优于现有方法，保持了诊断质量。 Conclusion: MDAA-Diff通过结合解剖学指导和剂量适应机制，为低剂量PET图像去噪提供了高效解决方案。 Abstract: Acquiring high-quality Positron Emission Tomography (PET) images requires administering high-dose radiotracers, which increases radiation exposure risks. Generating standard-dose PET (SPET) from low-dose PET (LPET) has become a potential solution. However, previous studies have primarily focused on single low-dose PET denoising, neglecting two critical factors: discrepancies in dose response caused by inter-patient variability, and complementary anatomical constraints derived from CT images. In this work, we propose a novel CT-Guided Multi-dose Adaptive Attention Denoising Diffusion Model (MDAA-Diff) for multi-dose PET denoising. Our approach integrates anatomical guidance and dose-level adaptation to achieve superior denoising performance under low-dose conditions. Specifically, this approach incorporates a CT-Guided High-frequency Wavelet Attention (HWA) module, which uses wavelet transforms to separate high-frequency anatomical boundary features from CT images. These extracted features are then incorporated into PET imaging through an adaptive weighted fusion mechanism to enhance edge details. Additionally, we propose the Dose-Adaptive Attention (DAA) module, a dose-conditioned enhancement mechanism that dynamically integrates dose levels into channel-spatial attention weight calculation. Extensive experiments on 18F-FDG and 68Ga-FAPI datasets demonstrate that MDAA-Diff outperforms state-of-the-art approaches in preserving diagnostic quality under reduced-dose conditions. Our code is publicly available.

[160] Improved Brain Tumor Detection in MRI: Fuzzy Sigmoid Convolution in Deep Learning

Muhammad Irfan,Anum Nawaz,Riku Klen,Abdulhamit Subasi,Tomi Westerlund,Wei Chen

Main category: eess.IV

TL;DR: 论文提出了一种基于模糊Sigmoid卷积（FSC）的新方法，结合了漏斗顶部和漏斗中部模块，显著减少了可训练参数数量，同时保持了分类准确性。

Details

Motivation: 早期检测和准确诊断对改善患者预后至关重要，但现有CNN模型因过度参数化限制了性能提升。 Method: 引入FSC和两个附加模块，采用新型卷积算子扩大感受野并保持数据完整性，结合模糊逻辑提升特征提取和分类能力。 Result: 在三个基准数据集上，FSC模型分类准确率分别达到99.17%、99.75%和99.89%，参数数量比大规模迁移学习架构少100倍。 Conclusion: 该研究为医学影像应用提供了轻量级高性能深度学习模型，具有高效计算和早期脑肿瘤检测能力。 Abstract: Early detection and accurate diagnosis are essential to improving patient outcomes. The use of convolutional neural networks (CNNs) for tumor detection has shown promise, but existing models often suffer from overparameterization, which limits their performance gains. In this study, fuzzy sigmoid convolution (FSC) is introduced along with two additional modules: top-of-the-funnel and middle-of-the-funnel. The proposed methodology significantly reduces the number of trainable parameters without compromising classification accuracy. A novel convolutional operator is central to this approach, effectively dilating the receptive field while preserving input data integrity. This enables efficient feature map reduction and enhances the model's tumor detection capability. In the FSC-based model, fuzzy sigmoid activation functions are incorporated within convolutional layers to improve feature extraction and classification. The inclusion of fuzzy logic into the architecture improves its adaptability and robustness. Extensive experiments on three benchmark datasets demonstrate the superior performance and efficiency of the proposed model. The FSC-based architecture achieved classification accuracies of 99.17%, 99.75%, and 99.89% on three different datasets. The model employs 100 times fewer parameters than large-scale transfer learning architectures, highlighting its computational efficiency and suitability for detecting brain tumors early. This research offers lightweight, high-performance deep-learning models for medical imaging applications.

[161] White Light Specular Reflection Data Augmentation for Deep Learning Polyp Detection

Jose Angel Nuñez,Fabian Vazquez,Diego Adame,Xiaoyan Fu,Pengfei Gu,Bin Fu

Main category: eess.IV

TL;DR: 提出了一种新的数据增强方法，通过在训练图像中添加人工白光反射，提高深度学习模型在结肠息肉检测中的性能。

Details

Motivation: 结肠癌的早期检测依赖于结肠镜检查，但人为错误可能导致息肉漏检。现有深度学习息肉检测器常将内窥镜的白光反射误认为息肉，导致假阳性。 Method: 生成人工白光反射库，确定不应添加反射的区域，并通过滑动窗口方法在合适区域添加反射，生成增强图像。 Result: 实验结果表明，新数据增强方法有效提升了模型性能。 Conclusion: 通过增加模型犯错机会，使其从中学习，最终提高了息肉检测的准确性。 Abstract: Colorectal cancer is one of the deadliest cancers today, but it can be prevented through early detection of malignant polyps in the colon, primarily via colonoscopies. While this method has saved many lives, human error remains a significant challenge, as missing a polyp could have fatal consequences for the patient. Deep learning (DL) polyp detectors offer a promising solution. However, existing DL polyp detectors often mistake white light reflections from the endoscope for polyps, which can lead to false positives.To address this challenge, in this paper, we propose a novel data augmentation approach that artificially adds more white light reflections to create harder training scenarios. Specifically, we first generate a bank of artificial lights using the training dataset. Then we find the regions of the training images that we should not add these artificial lights on. Finally, we propose a sliding window method to add the artificial light to the areas that fit of the training images, resulting in augmented images. By providing the model with more opportunities to make mistakes, we hypothesize that it will also have more chances to learn from those mistakes, ultimately improving its performance in polyp detection. Experimental results demonstrate the effectiveness of our new data augmentation method.

[162] Benchmarking Ophthalmology Foundation Models for Clinically Significant Age Macular Degeneration Detection

Benjamin A. Cohen,Jonathan Fhima,Meishar Meisel,Baskin Meital,Luis Filipe Nakayama,Eran Berkowitz,Joachim A. Behar

Main category: eess.IV

TL;DR: 自监督学习（SSL）使ViTs能从大规模自然图像数据中学习鲁棒表示，提升跨领域泛化能力。在视网膜成像中，基于自然或眼科数据预训练的模型表现良好，但领域内预训练的价值尚不明确。本文通过实验发现，基于自然图像预训练的iBOT模型在AMD识别任务中表现最佳，优于领域特定模型和无预训练基线。

Details

Motivation: 探讨在视网膜成像任务中，领域内预训练是否必要，以及自监督预训练模型的表现。 Method: 在7个DFI数据集（共7万张专家标注图像）上，对6种SSL预训练的ViTs进行基准测试，用于AMD识别任务。 Result: iBOT（基于自然图像预训练）在分布外泛化中表现最佳（AUROC 0.80-0.97），优于领域特定模型（0.78-0.96）和无预训练基线（0.68-0.91）。 Conclusion: 研究强调了基础模型在AMD识别中的价值，并挑战了领域内预训练的必要性。同时发布了巴西的开放数据集BRAMD。 Abstract: Self-supervised learning (SSL) has enabled Vision Transformers (ViTs) to learn robust representations from large-scale natural image datasets, enhancing their generalization across domains. In retinal imaging, foundation models pretrained on either natural or ophthalmic data have shown promise, but the benefits of in-domain pretraining remain uncertain. To investigate this, we benchmark six SSL-pretrained ViTs on seven digital fundus image (DFI) datasets totaling 70,000 expert-annotated images for the task of moderate-to-late age-related macular degeneration (AMD) identification. Our results show that iBOT pretrained on natural images achieves the highest out-of-distribution generalization, with AUROCs of 0.80-0.97, outperforming domain-specific models, which achieved AUROCs of 0.78-0.96 and a baseline ViT-L with no pretraining, which achieved AUROCs of 0.68-0.91. These findings highlight the value of foundation models in improving AMD identification and challenge the assumption that in-domain pretraining is necessary. Furthermore, we release BRAMD, an open-access dataset (n=587) of DFIs with AMD labels from Brazil.

[163] Augmented Deep Contexts for Spatially Embedded Video Coding

Yifan Bian,Chuanbo Tang,Li Li,Dong Liu

Main category: eess.IV

TL;DR: SEVC提出了一种结合空间和时间参考的神经视频编解码器，通过增强运动向量和混合时空上下文，解决了传统时间参考编解码器在大运动或新物体出现时的局限性。

Details

Motivation: 传统神经视频编解码器仅依赖时间参考，导致在大运动或新物体出现时表现不佳。 Method: SEVC利用空间和时间参考生成增强的运动向量和混合时空上下文，并引入空间引导的潜在先验和联合时空优化。 Result: SEVC显著提升了对大运动或新物体的处理能力，比特率比现有最优方法降低了11.9%。 Conclusion: SEVC通过结合空间和时间参考，显著提升了视频编解码的性能和适应性。 Abstract: Most Neural Video Codecs (NVCs) only employ temporal references to generate temporal-only contexts and latent prior. These temporal-only NVCs fail to handle large motions or emerging objects due to limited contexts and misaligned latent prior. To relieve the limitations, we propose a Spatially Embedded Video Codec (SEVC), in which the low-resolution video is compressed for spatial references. Firstly, our SEVC leverages both spatial and temporal references to generate augmented motion vectors and hybrid spatial-temporal contexts. Secondly, to address the misalignment issue in latent prior and enrich the prior information, we introduce a spatial-guided latent prior augmented by multiple temporal latent representations. At last, we design a joint spatial-temporal optimization to learn quality-adaptive bit allocation for spatial references, further boosting rate-distortion performance. Experimental results show that our SEVC effectively alleviates the limitations in handling large motions or emerging objects, and also reduces 11.9% more bitrate than the previous state-of-the-art NVC while providing an additional low-resolution bitstream. Our code and model are available at https://github.com/EsakaK/SEVC.

[164] OcularAge: A Comparative Study of Iris and Periocular Images for Pediatric Age Estimation

Naveenkumar G Venkataswamy,Poorna Ravi,Stephanie Schuckers,Masudul H. Imtiaz

Main category: eess.IV

TL;DR: 该研究比较了虹膜和眼周图像在4至16岁儿童年龄估计中的表现，发现眼周模型优于虹膜模型，并展示了其在隐私保护应用中的潜力。

Details

Motivation: 儿童年龄估计因生理变化细微和纵向数据集有限而具有挑战性，且现有研究多关注成人面部特征，儿童眼周和虹膜区域研究较少。 Method: 利用包含21,000多张近红外图像的纵向数据集，采用多任务深度学习框架进行年龄预测和年龄组分类，探索不同CNN架构的性能。 Result: 眼周模型的平均绝对误差为1.33年，年龄组分类准确率达83.82%，且在不同成像传感器上表现稳健。 Conclusion: 研究首次证明儿童眼周图像可用于可靠年龄估计，为儿童生物识别系统设计提供了基础，并展示了实时应用的潜力。 Abstract: Estimating a child's age from ocular biometric images is challenging due to subtle physiological changes and the limited availability of longitudinal datasets. Although most biometric age estimation studies have focused on facial features and adult subjects, pediatric-specific analysis, particularly of the iris and periocular regions, remains relatively unexplored. This study presents a comparative evaluation of iris and periocular images for estimating the ages of children aged between 4 and 16 years. We utilized a longitudinal dataset comprising more than 21,000 near-infrared (NIR) images, collected from 288 pediatric subjects over eight years using two different imaging sensors. A multi-task deep learning framework was employed to jointly perform age prediction and age-group classification, enabling a systematic exploration of how different convolutional neural network (CNN) architectures, particularly those adapted for non-square ocular inputs, capture the complex variability inherent in pediatric eye images. The results show that periocular models consistently outperform iris-based models, achieving a mean absolute error (MAE) of 1.33 years and an age-group classification accuracy of 83.82%. These results mark the first demonstration that reliable age estimation is feasible from children's ocular images, enabling privacy-preserving age checks in child-centric applications. This work establishes the first longitudinal benchmark for pediatric ocular age estimation, providing a foundation for designing robust, child-focused biometric systems. The developed models proved resilient across different imaging sensors, confirming their potential for real-world deployment. They also achieved inference speeds of less than 10 milliseconds per image on resource-constrained VR headsets, demonstrating their suitability for real-time applications.

cs.IR [Back]

[165] HiPerRAG: High-Performance Retrieval Augmented Generation for Scientific Insights

Ozan Gokdemir,Carlo Siebenschuh,Alexander Brace,Azton Wells,Brian Hsu,Kyle Hippe,Priyanka V. Setty,Aswathy Ajith,J. Gregory Pauloski,Varuni Sastry,Sam Foreman,Huihuo Zheng,Heng Ma,Bharat Kale,Nicholas Chia,Thomas Gibbs,Michael E. Papka,Thomas Brettin,Francis J. Alexander,Anima Anandkumar,Ian Foster,Rick Stevens,Venkatram Vishwanath,Arvind Ramanathan

Main category: cs.IR

TL;DR: HiPerRAG利用高性能计算（HPC）扩展RAG技术，处理360万篇科学文献，提升检索准确性，并在科学问答任务中表现优异。

Details

Motivation: 科学文献数量激增导致信息利用率低、重复工作和跨学科合作受限，RAG技术可提升LLMs处理科学信息的准确性。 Method: HiPerRAG结合Oreo（多模态文档解析模型）和ColTrast（查询感知编码器微调算法），利用对比学习和延迟交互技术优化检索。 Result: 在SciQ和PubMedQA基准测试中分别达到90%和76%的准确率，优于PubMedGPT和GPT-4。 Conclusion: HiPerRAG通过高性能计算实现大规模科学知识整合，推动跨学科创新。 Abstract: The volume of scientific literature is growing exponentially, leading to underutilized discoveries, duplicated efforts, and limited cross-disciplinary collaboration. Retrieval Augmented Generation (RAG) offers a way to assist scientists by improving the factuality of Large Language Models (LLMs) in processing this influx of information. However, scaling RAG to handle millions of articles introduces significant challenges, including the high computational costs associated with parsing documents and embedding scientific knowledge, as well as the algorithmic complexity of aligning these representations with the nuanced semantics of scientific content. To address these issues, we introduce HiPerRAG, a RAG workflow powered by high performance computing (HPC) to index and retrieve knowledge from more than 3.6 million scientific articles. At its core are Oreo, a high-throughput model for multimodal document parsing, and ColTrast, a query-aware encoder fine-tuning algorithm that enhances retrieval accuracy by using contrastive learning and late-interaction techniques. HiPerRAG delivers robust performance on existing scientific question answering benchmarks and two new benchmarks introduced in this work, achieving 90% accuracy on SciQ and 76% on PubMedQA-outperforming both domain-specific models like PubMedGPT and commercial LLMs such as GPT-4. Scaling to thousands of GPUs on the Polaris, Sunspot, and Frontier supercomputers, HiPerRAG delivers million document-scale RAG workflows for unifying scientific knowledge and fostering interdisciplinary innovation.

[166] Prompt-Based LLMs for Position Bias-Aware Reranking in Personalized Recommendations

Md Aminul Islam,Ahmed Sayeed Faruk

Main category: cs.IR

TL;DR: 论文提出了一种结合传统推荐模型和大语言模型（LLM）的混合框架，用于重新排序推荐结果，并研究了LLM在排名任务中的局限性，如位置偏差和上下文建模能力不足。

Details

Motivation: 大语言模型（LLM）在基于提示的推荐任务中表现出潜力，但也存在上下文窗口限制、位置偏差等问题，需要研究其实际效果和改进方法。 Method: 提出混合框架，结合传统推荐模型和LLM，通过结构化提示重新排序推荐结果，并评估用户历史记录重排和指令提示对位置偏差的影响。 Result: 实验表明，随机化用户历史记录可提升排名质量，但LLM的重新排序效果未优于基础模型，且显式指令无法有效减少位置偏差。 Conclusion: LLM在排名任务中建模上下文和减少偏差的能力有限，需进一步研究改进方法。 Abstract: Recommender systems are essential for delivering personalized content across digital platforms by modeling user preferences and behaviors. Recently, large language models (LLMs) have been adopted for prompt-based recommendation due to their ability to generate personalized outputs without task-specific training. However, LLM-based methods face limitations such as limited context window size, inefficient pointwise and pairwise prompting, and difficulty handling listwise ranking due to token constraints. LLMs can also be sensitive to position bias, as they may overemphasize earlier items in the prompt regardless of their true relevance. To address and investigate these issues, we propose a hybrid framework that combines a traditional recommendation model with an LLM for reranking top-k items using structured prompts. We evaluate the effects of user history reordering and instructional prompts for mitigating position bias. Experiments on MovieLens-100K show that randomizing user history improves ranking quality, but LLM-based reranking does not outperform the base model. Explicit instructions to reduce position bias are also ineffective. Our evaluations reveal limitations in LLMs' ability to model ranking context and mitigate bias. Our code is publicly available at https://github.com/aminul7506/LLMForReRanking.

cs.MM [Back]

[167] SSH-Net: A Self-Supervised and Hybrid Network for Noisy Image Watermark Removal

Wenyang Liu,Jianjun Gao,Kim-Hui Yap

Main category: cs.MM

TL;DR: SSH-Net是一种自监督混合网络，用于去除图像中的可见水印和噪声，无需成对数据集。

Details

Motivation: 现有方法依赖成对数据集，实际中难以获取，因此提出自监督方法。 Method: 采用双网络设计，上网络处理噪声，下网络处理水印和噪声，结合CNN和Transformer。 Result: SSH-Net能有效去除水印和噪声，无需依赖成对数据。 Conclusion: SSH-Net为水印去除提供了一种实用且高效的自监督解决方案。 Abstract: Visible watermark removal is challenging due to its inherent complexities and the noise carried within images. Existing methods primarily rely on supervised learning approaches that require paired datasets of watermarked and watermark-free images, which are often impractical to obtain in real-world scenarios. To address this challenge, we propose SSH-Net, a Self-Supervised and Hybrid Network specifically designed for noisy image watermark removal. SSH-Net synthesizes reference watermark-free images using the watermark distribution in a self-supervised manner and adopts a dual-network design to address the task. The upper network, focused on the simpler task of noise removal, employs a lightweight CNN-based architecture, while the lower network, designed to handle the more complex task of simultaneously removing watermarks and noise, incorporates Transformer blocks to model long-range dependencies and capture intricate image features. To enhance the model's effectiveness, a shared CNN-based feature encoder is introduced before dual networks to extract common features that both networks can leverage. Our code will be available at https://github.com/wenyang001/SSH-Net.

eess.SP [Back]

[168] Integrated Image Reconstruction and Target Recognition based on Deep Learning Technique

Cien Zhang,Jiaming Zhang,Jiajun He,Okan Yurduseven

Main category: eess.SP

TL;DR: Att-ClassiGAN结合注意力机制改进ClassiGAN，显著提升计算微波成像的图像重建和目标分类性能。

Details

Motivation: 传统微波成像技术存在硬件需求高和数据采集慢等问题，计算微波成像（CMI）虽有所改进，但图像重建阶段仍面临计算瓶颈。 Method: 在ClassiGAN中引入注意力门模块，动态聚焦重要特征以优化特征提取和信息识别。 Result: Att-ClassiGAN显著减少重建时间，在NMSE、SSIM和分类性能上优于现有先进方法。 Conclusion: 注意力机制有效提升了CMI的性能，Att-ClassiGAN为计算微波成像提供了高效解决方案。 Abstract: Computational microwave imaging (CMI) has gained attention as an alternative technique for conventional microwave imaging techniques, addressing their limitations such as hardware-intensive physical layer and slow data collection acquisition speed to name a few. Despite these advantages, CMI still encounters notable computational bottlenecks, especially during the image reconstruction stage. In this setting, both image recovery and object classification present significant processing demands. To address these challenges, our previous work introduced ClassiGAN, which is a generative deep learning model designed to simultaneously reconstruct images and classify targets using only back-scattered signals. In this study, we build upon that framework by incorporating attention gate modules into ClassiGAN. These modules are intended to refine feature extraction and improve the identification of relevant information. By dynamically focusing on important features and suppressing irrelevant ones, the attention mechanism enhances the overall model performance. The proposed architecture, named Att-ClassiGAN, significantly reduces the reconstruction time compared to traditional CMI approaches. Furthermore, it outperforms current advanced methods, delivering improved Normalized Mean Squared Error (NMSE), higher Structural Similarity Index (SSIM), and better classification outcomes for the reconstructed targets.

cs.LG [Back]

[169] When Bad Data Leads to Good Models

Kenneth Li,Yida Chen,Fernanda Viégas,Martin Wattenberg

Main category: cs.LG

TL;DR: 研究探讨了预训练数据中“毒性”比例对模型后训练控制的影响，发现高毒性数据虽增加基础模型毒性，但使其更易去除，最终实现更好的毒性控制与通用能力平衡。

Details

Motivation: 重新审视数据质量对模型性能的影响，探索预训练与后训练协同设计的可能性，尤其是毒性数据对模型控制的作用。 Method: 通过玩具实验研究数据组成对特征几何的影响，并利用Olmo-1B模型进行毒性数据比例实验，评估线性表示和去毒效果。 Result: 毒性数据增加基础模型毒性，但使其更易通过后训练技术（如ITI）去除，在Toxigen和Real Toxicity Prompts上表现更优。 Conclusion: 考虑后训练时，毒性数据可能有助于构建更易控制的模型，实现毒性减少与通用能力的平衡。 Abstract: In large language model (LLM) pretraining, data quality is believed to determine model quality. In this paper, we re-examine the notion of "quality" from the perspective of pre- and post-training co-design. Specifically, we explore the possibility that pre-training on more toxic data can lead to better control in post-training, ultimately decreasing a model's output toxicity. First, we use a toy experiment to study how data composition affects the geometry of features in the representation space. Next, through controlled experiments with Olmo-1B models trained on varying ratios of clean and toxic data, we find that the concept of toxicity enjoys a less entangled linear representation as the proportion of toxic data increases. Furthermore, we show that although toxic data increases the generational toxicity of the base model, it also makes the toxicity easier to remove. Evaluations on Toxigen and Real Toxicity Prompts demonstrate that models trained on toxic data achieve a better trade-off between reducing generational toxicity and preserving general capabilities when detoxifying techniques such as inference-time intervention (ITI) are applied. Our findings suggest that, with post-training taken into account, bad data may lead to good models.

[170] ConCISE: Confidence-guided Compression in Step-by-step Efficient Reasoning

Ziqing Qiao,Yongheng Deng,Jiali Zeng,Dong Wang,Lai Wei,Fandong Meng,Jie Zhou,Ju Ren,Yaoxue Zhang

Main category: cs.LG

TL;DR: ConCISE框架通过增强模型推理过程中的信心，减少冗余步骤，显著缩短输出长度，同时保持任务准确性。

Details

Motivation: 大型推理模型（LRMs）在复杂推理任务中表现优异，但常因冗余内容导致输出冗长，增加计算开销并降低用户体验。现有压缩方法存在破坏推理连贯性或无法有效干预生成的问题。 Method: 提出ConCISE框架，通过信心注入稳定中间步骤，并在信心足够时提前终止推理，从而简化推理链。 Result: 实验表明，ConCISE能减少约50%的输出长度，同时保持高任务准确性，并在多个推理基准测试中优于现有基线。 Conclusion: ConCISE通过信心引导的压缩方法，有效解决了LRMs输出冗长的问题，提升了推理效率和用户体验。 Abstract: Large Reasoning Models (LRMs) perform strongly in complex reasoning tasks via Chain-of-Thought (CoT) prompting, but often suffer from verbose outputs caused by redundant content, increasing computational overhead, and degrading user experience. Existing compression methods either operate post-hoc pruning, risking disruption to reasoning coherence, or rely on sampling-based selection, which fails to intervene effectively during generation. In this work, we introduce a confidence-guided perspective to explain the emergence of redundant reflection in LRMs, identifying two key patterns: Confidence Deficit, where the model reconsiders correct steps due to low internal confidence, and Termination Delay, where reasoning continues even after reaching a confident answer. Based on this analysis, we propose ConCISE (Confidence-guided Compression In Step-by-step Efficient Reasoning), a framework that simplifies reasoning chains by reinforcing the model's confidence during inference, thus preventing the generation of redundant reflection steps. It integrates Confidence Injection to stabilize intermediate steps and Early Stopping to terminate reasoning when confidence is sufficient. Extensive experiments demonstrate that fine-tuning LRMs on ConCISE-generated data yields significantly shorter outputs, reducing length by up to approximately 50% under SimPO, while maintaining high task accuracy. ConCISE consistently outperforms existing baselines across multiple reasoning benchmarks.

[171] General Transform: A Unified Framework for Adaptive Transform to Enhance Representations

Gekko Budiutama,Shunsuke Daimon,Hirofumi Nishi,Yu-ichiro Matsushita

Main category: cs.LG

TL;DR: 提出了一种自适应变换方法（GT），通过学习数据驱动的映射，优于传统变换方法，适用于多种机器学习任务。

Details

Motivation: 传统离散变换依赖对数据集特性的了解，缺乏适应性，限制了其效果。 Method: 提出通用变换（GT），通过学习数据驱动的映射，适应不同数据集和任务。 Result: GT在计算机视觉和自然语言处理任务中表现优于传统变换方法。 Conclusion: GT是一种有效的自适应变换方法，适用于多样化的学习场景。 Abstract: Discrete transforms, such as the discrete Fourier transform, are widely used in machine learning to improve model performance by extracting meaningful features. However, with numerous transforms available, selecting an appropriate one often depends on understanding the dataset's properties, making the approach less effective when such knowledge is unavailable. In this work, we propose General Transform (GT), an adaptive transform-based representation designed for machine learning applications. Unlike conventional transforms, GT learns data-driven mapping tailored to the dataset and task of interest. Here, we demonstrate that models incorporating GT outperform conventional transform-based approaches across computer vision and natural language processing tasks, highlighting its effectiveness in diverse learning scenarios.

[172] CodeMixBench: Evaluating Large Language Models on Code Generation with Code-Mixed Prompts

Manik Sheokand,Parth Sawant

Main category: cs.LG

TL;DR: CodeMixBench是一个新的基准测试，用于评估LLMs在多语言代码混合提示下的代码生成能力，填补了现有基准测试仅关注英语的不足。

Details

Motivation: 现有基准测试如HumanEval和MBPP仅评估英语提示下的代码生成，忽略了多语言开发者实际使用代码混合语言的情况。 Method: 基于BigCodeBench，CodeMixBench在提示的自然语言部分引入控制代码混合（CMD），涵盖三种语言对（Hinglish、西班牙语-英语、汉语拼音-英语），并评估了多个开源代码生成模型。 Result: 代码混合提示显著降低了Pass@1性能，尤其是较小模型在高CMD水平下表现更差。 Conclusion: CodeMixBench为多语言代码生成提供了现实评估框架，揭示了构建在多语言环境下鲁棒的代码生成模型的新挑战。 Abstract: Large Language Models (LLMs) have achieved remarkable success in code generation tasks, powering various applications like code completion, debugging, and programming assistance. However, existing benchmarks such as HumanEval, MBPP, and BigCodeBench primarily evaluate LLMs on English-only prompts, overlooking the real-world scenario where multilingual developers often use code-mixed language while interacting with LLMs. To address this gap, we introduce CodeMixBench, a novel benchmark designed to evaluate the robustness of LLMs on code generation from code-mixed prompts. Built upon BigCodeBench, CodeMixBench introduces controlled code-mixing (CMD) into the natural language parts of prompts across three language pairs: Hinglish (Hindi-English), Spanish-English, and Chinese Pinyin-English. We comprehensively evaluate a diverse set of open-source code generation models ranging from 1.5B to 15B parameters. Our results show that code-mixed prompts consistently degrade Pass@1 performance compared to their English-only counterparts, with performance drops increasing under higher CMD levels for smaller models. CodeMixBench provides a realistic evaluation framework for studying multilingual code generation and highlights new challenges and directions for building robust code generation models that generalize well across diverse linguistic settings.

[173] Understanding In-context Learning of Addition via Activation Subspaces

Xinyan Hu,Kayo Yin,Michael I. Jordan,Jacob Steinhardt,Lijie Chen

Main category: cs.LG

TL;DR: 论文研究了现代Transformer模型如何通过前向传播实现上下文学习，发现Llama-3-8B在特定任务中表现优异，并通过优化方法定位到三个注意力头。

Details

Motivation: 探索语言模型如何从少量示例中提取信号并形成预测规则，以及这些规则在前向传播中的实现方式。 Method: 设计了一个结构化任务（输入加整数k），通过优化方法定位关键注意力头，并分析信号的低维子空间。 Result: Llama-3-8B在任务中表现优异，信号集中在六维子空间，其中四维跟踪个位数，两维跟踪整体大小。 Conclusion: 通过跟踪低维子空间，可以揭示模型在前向传播中的精细计算结构。 Abstract: To perform in-context learning, language models must extract signals from individual few-shot examples, aggregate these into a learned prediction rule, and then apply this rule to new examples. How is this implemented in the forward pass of modern transformer models? To study this, we consider a structured family of few-shot learning tasks for which the true prediction rule is to add an integer $k$ to the input. We find that Llama-3-8B attains high accuracy on this task for a range of $k$, and localize its few-shot ability to just three attention heads via a novel optimization approach. We further show the extracted signals lie in a six-dimensional subspace, where four of the dimensions track the unit digit and the other two dimensions track overall magnitude. We finally examine how these heads extract information from individual few-shot examples, identifying a self-correction mechanism in which mistakes from earlier examples are suppressed by later examples. Our results demonstrate how tracking low-dimensional subspaces across a forward pass can provide insight into fine-grained computational structures.

[174] Revealing Weaknesses in Text Watermarking Through Self-Information Rewrite Attacks

Yixin Cheng,Hongcheng Guo,Yangming Li,Leonid Sigal

Main category: cs.LG

TL;DR: 本文揭示当前文本水印算法在高熵令牌中嵌入水印的设计存在漏洞，提出一种通用高效的改写攻击方法SIRA，实验表明其能以低成本实现高攻击成功率，呼吁开发更鲁棒的水印技术。

Details

Motivation: 当前文本水印算法在高熵令牌中嵌入水印的设计看似无害，但实际存在被攻击者利用的风险，亟需评估其鲁棒性。 Method: 提出Self-Information Rewrite Attack (SIRA)，通过计算令牌的自信息识别潜在模式令牌并进行针对性攻击。 Result: SIRA在七种最新水印方法上实现近100%攻击成功率，成本仅为每百万令牌0.88美元，且无需访问水印算法或水印LLM。 Conclusion: 当前水印算法存在普遍漏洞，亟需开发更鲁棒的解决方案。 Abstract: Text watermarking aims to subtly embed statistical signals into text by controlling the Large Language Model (LLM)'s sampling process, enabling watermark detectors to verify that the output was generated by the specified model. The robustness of these watermarking algorithms has become a key factor in evaluating their effectiveness. Current text watermarking algorithms embed watermarks in high-entropy tokens to ensure text quality. In this paper, we reveal that this seemingly benign design can be exploited by attackers, posing a significant risk to the robustness of the watermark. We introduce a generic efficient paraphrasing attack, the Self-Information Rewrite Attack (SIRA), which leverages the vulnerability by calculating the self-information of each token to identify potential pattern tokens and perform targeted attack. Our work exposes a widely prevalent vulnerability in current watermarking algorithms. The experimental results show SIRA achieves nearly 100% attack success rates on seven recent watermarking methods with only 0.88 USD per million tokens cost. Our approach does not require any access to the watermark algorithms or the watermarked LLM and can seamlessly transfer to any LLM as the attack model, even mobile-level models. Our findings highlight the urgent need for more robust watermarking.

[175] Scalable Chain of Thoughts via Elastic Reasoning

Yuhui Xu,Hanze Dong,Lei Wang,Doyen Sahoo,Junnan Li,Caiming Xiong

Main category: cs.LG

TL;DR: Elastic Reasoning框架通过将推理分为思考和解决两阶段，并独立分配预算，显著提升了在严格资源约束下的可靠性。

Details

Motivation: 大型推理模型（LRMs）在复杂任务中表现出色，但其不可控的输出长度在现实部署中面临挑战。 Method: 提出Elastic Reasoning框架，分两阶段（思考和解决）分配预算，并引入轻量级预算约束策略训练模型。 Result: 在数学和编程基准测试中，Elastic Reasoning在严格预算约束下表现稳健，且推理更简洁高效。 Conclusion: Elastic Reasoning为可控推理提供了实用解决方案。 Abstract: Large reasoning models (LRMs) have achieved remarkable progress on complex tasks by generating extended chains of thought (CoT). However, their uncontrolled output lengths pose significant challenges for real-world deployment, where inference-time budgets on tokens, latency, or compute are strictly constrained. We propose Elastic Reasoning, a novel framework for scalable chain of thoughts that explicitly separates reasoning into two phases--thinking and solution--with independently allocated budgets. At test time, Elastic Reasoning prioritize that completeness of solution segments, significantly improving reliability under tight resource constraints. To train models that are robust to truncated thinking, we introduce a lightweight budget-constrained rollout strategy, integrated into GRPO, which teaches the model to reason adaptively when the thinking process is cut short and generalizes effectively to unseen budget constraints without additional training. Empirical results on mathematical (AIME, MATH500) and programming (LiveCodeBench, Codeforces) benchmarks demonstrate that Elastic Reasoning performs robustly under strict budget constraints, while incurring significantly lower training cost than baseline methods. Remarkably, our approach also produces more concise and efficient reasoning even in unconstrained settings. Elastic Reasoning offers a principled and practical solution to the pressing challenge of controllable reasoning at scale.

[176] Research on Anomaly Detection Methods Based on Diffusion Models

Yi Chen

Main category: cs.LG

TL;DR: 本文提出了一种基于扩散模型的新型异常检测框架，通过多尺度特征提取和注意力机制提升性能，在图像和音频数据上表现优异。

Details

Motivation: 传统异常检测方法在处理复杂高维数据时面临挑战，扩散模型在数据分布建模方面具有潜力。 Method: 利用扩散概率模型（DPMs）建模正常数据分布，通过逆向扩散重构输入数据，结合重建误差和语义差异作为异常指标。 Result: 在MVTec AD和UrbanSound8K等基准数据集上，该方法优于现有技术，表现出更高的准确性和鲁棒性。 Conclusion: 扩散模型在异常检测中具有显著效果，为实际应用提供了高效解决方案。 Abstract: Anomaly detection is a fundamental task in machine learning and data mining, with significant applications in cybersecurity, industrial fault diagnosis, and clinical disease monitoring. Traditional methods, such as statistical modeling and machine learning-based approaches, often face challenges in handling complex, high-dimensional data distributions. In this study, we explore the potential of diffusion models for anomaly detection, proposing a novel framework that leverages the strengths of diffusion probabilistic models (DPMs) to effectively identify anomalies in both image and audio data. The proposed method models the distribution of normal data through a diffusion process and reconstructs input data via reverse diffusion, using a combination of reconstruction errors and semantic discrepancies as anomaly indicators. To enhance the framework's performance, we introduce multi-scale feature extraction, attention mechanisms, and wavelet-domain representations, enabling the model to capture fine-grained structures and global dependencies in the data. Extensive experiments on benchmark datasets, including MVTec AD and UrbanSound8K, demonstrate that our method outperforms state-of-the-art anomaly detection techniques, achieving superior accuracy and robustness across diverse data modalities. This research highlights the effectiveness of diffusion models in anomaly detection and provides a robust and efficient solution for real-world applications.

[177] Concept-Based Unsupervised Domain Adaptation

Xinyue Xu,Yueying Hu,Hui Tang,Yi Qin,Lu Mi,Hao Wang,Xiaomeng Li

Main category: cs.LG

TL;DR: 论文提出CUDA框架，通过对抗训练和松弛阈值提升概念瓶颈模型在域适应中的鲁棒性和泛化能力。

Details

Motivation: 传统概念瓶颈模型（CBMs）假设训练和测试数据分布相同，但在域偏移时性能下降。CUDA旨在解决这一问题。 Method: CUDA通过对抗训练对齐概念表示，引入松弛阈值允许域间差异，无需目标域标签数据，并结合理论保证的概念学习。 Result: 实验表明，CUDA在真实数据集上显著优于现有CBM和域适应方法。 Conclusion: CUDA提升了CBMs在域适应中的性能，同时保持可解释性，为域适应设立了新基准。 Abstract: Concept Bottleneck Models (CBMs) enhance interpretability by explaining predictions through human-understandable concepts but typically assume that training and test data share the same distribution. This assumption often fails under domain shifts, leading to degraded performance and poor generalization. To address these limitations and improve the robustness of CBMs, we propose the Concept-based Unsupervised Domain Adaptation (CUDA) framework. CUDA is designed to: (1) align concept representations across domains using adversarial training, (2) introduce a relaxation threshold to allow minor domain-specific differences in concept distributions, thereby preventing performance drop due to over-constraints of these distributions, (3) infer concepts directly in the target domain without requiring labeled concept data, enabling CBMs to adapt to diverse domains, and (4) integrate concept learning into conventional domain adaptation (DA) with theoretical guarantees, improving interpretability and establishing new benchmarks for DA. Experiments demonstrate that our approach significantly outperforms the state-of-the-art CBM and DA methods on real-world datasets.

[178] MTL-UE: Learning to Learn Nothing for Multi-Task Learning

Yi Yu,Song Xia,Siyuan Yang,Chenqi Kong,Wenhan Yang,Shijian Lu,Yap-Peng Tan,Alex C. Kot

Main category: cs.LG

TL;DR: MTL-UE是首个针对多任务数据和模型生成不可学习样本的统一框架，通过生成器结构和嵌入正则化提升攻击性能。

Details

Motivation: 现有不可学习策略主要针对单任务学习，而多任务学习（MTL）数据和模型的重要性日益增长，但相关研究较少。 Method: 设计基于生成器的结构，引入标签先验和类特征嵌入，并结合任务内和任务间嵌入正则化。 Result: 在4个MTL数据集、3种基础UE方法、5种模型架构和5种MTL任务加权策略中表现优异。 Conclusion: MTL-UE在多任务学习中具有高效性和通用性，为不可学习样本生成提供了新思路。 Abstract: Most existing unlearnable strategies focus on preventing unauthorized users from training single-task learning (STL) models with personal data. Nevertheless, the paradigm has recently shifted towards multi-task data and multi-task learning (MTL), targeting generalist and foundation models that can handle multiple tasks simultaneously. Despite their growing importance, MTL data and models have been largely neglected while pursuing unlearnable strategies. This paper presents MTL-UE, the first unified framework for generating unlearnable examples for multi-task data and MTL models. Instead of optimizing perturbations for each sample, we design a generator-based structure that introduces label priors and class-wise feature embeddings which leads to much better attacking performance. In addition, MTL-UE incorporates intra-task and inter-task embedding regularization to increase inter-class separation and suppress intra-class variance which enhances the attack robustness greatly. Furthermore, MTL-UE is versatile with good supports for dense prediction tasks in MTL. It is also plug-and-play allowing integrating existing surrogate-dependent unlearnable methods with little adaptation. Extensive experiments show that MTL-UE achieves superior attacking performance consistently across 4 MTL datasets, 3 base UE methods, 5 model backbones, and 5 MTL task-weighting strategies.

Table of Contents

cs.CV [Back]

[1] Histo-Miner: Deep Learning based Tissue Features Extraction Pipeline from H&E Whole Slide Images of Cutaneous Squamous Cell Carcinoma

[2] Comparison of Visual Trackers for Biomechanical Analysis of Running

[3] Lay-Your-Scene: Natural Scene Layout Generation with Diffusion Transformers

[4] False Promises in Medical Imaging AI? Assessing Validity of Outperformance Claims

[5] Hyb-KAN ViT: Hybrid Kolmogorov-Arnold Networks Augmented Vision Transformer

[6] Lightweight RGB-D Salient Object Detection from a Speed-Accuracy Tradeoff Perspective

[7] Vision-Language-Action Models: Concepts, Progress, Applications and Challenges

[8] Replay to Remember (R2R): An Efficient Uncertainty-driven Unsupervised Continual Learning Framework Using Generative Replay

[9] Convex Relaxation for Robust Vanishing Point Estimation in Manhattan World

[10] DetReIDX: A Stress-Test Dataset for Real-World UAV-Based Person Recognition

[11] Are Synthetic Corruptions A Reliable Proxy For Real-World Corruptions?

[12] Seeing Cells Clearly: Evaluating Machine Vision Strategies for Microglia Centroid Detection in 3D Images

[13] ORXE: Orchestrating Experts for Dynamically Configurable Efficiency

[14] Mix-QSAM: Mixed-Precision Quantization of the Segment Anything Model

[15] Auto-regressive transformation for image alignment

[16] Learning from Loss Landscape: Generalizable Mixed-Precision Quantization via Adaptive Sharpness-Aware Gradient Aligning

[17] Cross-Branch Orthogonality for Improved Generalization in Face Deepfake Detection

[18] OWT: A Foundational Organ-Wise Tokenization Framework for Medical Imaging

[19] Pro2SAM: Mask Prompt to SAM with Grid Points for Weakly Supervised Object Localization

[20] SpatialPrompting: Keyframe-driven Zero-Shot Spatial Reasoning with Off-the-Shelf Multimodal Large Language Models

[21] GlyphMastero: A Glyph Encoder for High-Fidelity Scene Text Editing

[22] A Simple Detector with Frame Dynamics is a Strong Tracker

[23] Perception, Reason, Think, and Plan: A Survey on Large Multimodal Reasoning Models

[24] Canny2Palm: Realistic and Controllable Palmprint Generation for Large-scale Pre-training

[25] FF-PNet: A Pyramid Network Based on Feature and Field for Brain Image Registration

[26] Building-Guided Pseudo-Label Learning for Cross-Modal Building Damage Mapping

[27] T2VTextBench: A Human Evaluation Benchmark for Textual Control in Video Generation Models

[28] An Efficient Method for Accurate Pose Estimation and Error Correction of Cuboidal Objects

[29] ViCTr: Vital Consistency Transfer for Pathology Aware Image Synthesis

[30] CAG-VLM: Fine-Tuning of a Large-Scale Model to Recognize Angiographic Images for Next-Generation Diagnostic Systems

[31] DenseGrounding: Improving Dense Language-Vision Semantics for Ego-Centric 3D Visual Grounding

[32] ReAlign: Bilingual Text-to-Motion Generation via Step-Aware Reward-Guided Alignment

[33] Federated Deconfounding and Debiasing Learning for Out-of-Distribution Generalization

[34] StabStitch++: Unsupervised Online Video Stitching with Spatiotemporal Bidirectional Warps

[35] Automated Thoracolumbar Stump Rib Detection and Analysis in a Large CT Cohort

[36] Driving with Context: Online Map Matching for Complex Roads Using Lane Markings and Scenario Recognition

[37] Adaptive Contextual Embedding for Robust Far-View Borehole Detection

[38] SOAP: Style-Omniscient Animatable Portraits

[39] Split Matching for Inductive Zero-shot Semantic Segmentation

[40] xTrace: A Facial Expressive Behaviour Analysis Tool for Continuous Affect Recognition

[41] UncertainSAM: Fast and Efficient Uncertainty Quantification of the Segment Anything Model

[42] ULFine: Unbiased Lightweight Fine-tuning for Foundation-Model-Assisted Long-Tailed Semi-Supervised Learning

[43] FG-CLIP: Fine-Grained Visual and Textual Alignment

[44] Visual Affordances: Enabling Robots to Understand Object Functionality

[45] PIDiff: Image Customization for Personalized Identities with Diffusion Models

[46] Nonlinear Motion-Guided and Spatio-Temporal Aware Network for Unsupervised Event-Based Optical Flow

[47] DispBench: Benchmarking Disparity Estimation to Synthetic Corruptions

[48] MDE-Edit: Masked Dual-Editing for Multi-Object Image Editing via Diffusion Models

[49] Automated vision-based assistance tools in bronchoscopy: stenosis severity estimation

[50] Probabilistic Embeddings for Frozen Vision-Language Models: Uncertainty Quantification with Gaussian Process Latent Variable Models

[51] PaniCar: Securing the Perception of Advanced Driving Assistance Systems Against Emergency Vehicle Lighting

[52] Biomed-DPT: Dual Modality Prompt Tuning for Biomedical Vision-Language Models

[53] EAM: Enhancing Anything with Diffusion Transformers for Blind Super-Resolution

[54] HQC-NBV: A Hybrid Quantum-Classical View Planning Approach

[55] Diffusion Model Quantization: A Review

[56] Does CLIP perceive art the same way we do?

[57] PADriver: Towards Personalized Autonomous Driving

[58] PlaceIt3D: Language-Guided Object Placement in Real 3D Scenes

[59] PRE-Mamba: A 4D State Space Model for Ultra-High-Frequent Event Camera Deraining

[60] Mapping User Trust in Vision Language Models: Research Landscape, Challenges, and Prospects

[61] Feature-Augmented Deep Networks for Multiscale Building Segmentation in High-Resolution UAV and Satellite Imagery

[62] Aesthetics Without Semantics

[63] Progressive Inertial Poser: Progressive Real-Time Kinematic Chain Estimation for 3D Full-Body Pose from Three IMU Sensors

[64] Hearing and Seeing Through CLIP: A Framework for Self-Supervised Sound Source Localization

[65] Joint Super-Resolution and Segmentation for 1-m Impervious Surface Area Mapping in China's Yangtze River Economic Belt

[66] Threshold Modulation for Online Test-Time Adaptation of Spiking Neural Networks

[67] GeomHair: Reconstruction of Hair Strands from Colorless 3D Scans

[68] EDmamba: A Simple yet Effective Event Denoising Method with State Space Model

[69] TokLIP: Marry Visual Tokens to CLIP for Multimodal Comprehension and Generation

[70] PillarMamba: Learning Local-Global Context for Roadside Point Cloud via Hybrid State Space Model

[71] Adaptive Markup Language Generation for Contextually-Grounded Visual Document Understanding

[72] StreamBridge: Turning Your Offline Video Large Language Model into a Proactive Streaming Assistant

[73] SITE: towards Spatial Intelligence Thorough Evaluation

[74] Generating Physically Stable and Buildable LEGO Designs from Text

[75] Flow-GRPO: Training Flow Matching Models via Online RL

[76] Mogao: An Omni Foundation Model for Interleaved Multi-Modal Generation

[77] DiffusionSfM: Predicting Structure and Motion via Ray Origin and Endpoint Diffusion

[78] 3D Scene Generation: A Survey