cs.CV [Back]

[1] Data extraction and processing methods to aid the study of driving behaviors at intersections in naturalistic driving

Shrinivas Pundlik,Seonggyu Choe,Patrick Baker,Chen-Yuan Lee,Naser Al-Madi,Alex R. Bowers,Gang Luo

Main category: cs.CV

TL;DR: 论文描述了从自然驾驶研究中提取和表征驾驶员在路口头部扫描的自动化方法，结合了视频、GPS和车辆数据，开发了定制工具和AI模型，验证了高准确率。

Details

Motivation: 自然驾驶研究产生大量多样化数据，需自动化处理以高效分析驾驶员行为，特别是在路口的关键操作。 Method: 开发工具标记路口，同步视频与位置数据，使用AI模型检测头部姿态和场景对象，结合规则算法推断路口类型和操作。 Result: 自动化算法在路口标志和操作检测上准确率分别为100%和94%，车辆进入路口的误差较小，路口边界估计与真实值高度重叠。 Conclusion: 自动化方法能高效处理自然驾驶数据，为驾驶员行为研究提供可靠支持。 Abstract: Naturalistic driving studies use devices in participants' own vehicles to record daily driving over many months. Due to diverse and extensive amounts of data recorded, automated processing is necessary. This report describes methods to extract and characterize driver head scans at intersections from data collected from an in-car recording system that logged vehicle speed, GPS location, scene videos, and cabin videos. Custom tools were developed to mark the intersections, synchronize location and video data, and clip the cabin and scene videos for +/-100 meters from the intersection location. A custom-developed head pose detection AI model for wide angle head turns was run on the cabin videos to estimate the driver head pose, from which head scans >20 deg were computed in the horizontal direction. The scene videos were processed using a YOLO object detection model to detect traffic lights, stop signs, pedestrians, and other vehicles on the road. Turning maneuvers were independently detected using vehicle self-motion patterns. Stop lines on the road surface were detected using changing intensity patterns over time as the vehicle moved. The information obtained from processing the scene videos, along with the speed data was used in a rule-based algorithm to infer the intersection type, maneuver, and bounds. We processed 190 intersections from 3 vehicles driven in cities and suburban areas from Massachusetts and California. The automated video processing algorithm correctly detected intersection signage and maneuvers in 100% and 94% of instances, respectively. The median [IQR] error in detecting vehicle entry into the intersection was 1.1[0.4-4.9] meters and 0.2[0.1-0.54] seconds. The median overlap between ground truth and estimated intersection bounds was 0.88[0.82-0.93].

[2] From Events to Enhancement: A Survey on Event-Based Imaging Technologies

Yunfan Lu,Xiaogang Xu,Pengteng Li,Yusheng Wang,Yi Cui,Huizai Yao,Hui Xiong

Main category: cs.CV

TL;DR: 本文综述了事件相机在成像领域的最新进展、挑战及其应用，包括传感器特性、图像增强任务和高级任务（如光场估计）。

Details

Motivation: 事件相机因其高动态范围和低延迟特性成为成像领域的颠覆性技术，但目前缺乏对其最新进展和挑战的全面研究，限制了其在通用成像应用中的潜力。 Method: 通过介绍事件传感器的物理模型和特性，分析图像/视频增强任务与事件的互动，并探讨利用事件捕捉更丰富光信息的高级任务。 Result: 总结了事件相机在成像任务中的优势和应用潜力，并提出了该领域的新挑战和开放性问题。 Conclusion: 事件相机在成像领域具有广阔前景，但仍需解决技术和应用中的挑战，未来研究应关注其进一步发展和优化。 Abstract: Event cameras offering high dynamic range and low latency have emerged as disruptive technologies in imaging. Despite growing research on leveraging these benefits for different imaging tasks, a comprehensive study of recently advances and challenges are still lacking. This limits the broader understanding of how to utilize events in universal imaging applications. In this survey, we first introduce a physical model and the characteristics of different event sensors as the foundation. Following this, we highlight the advancement and interaction of image/video enhancement tasks with events. Additionally, we explore advanced tasks, which capture richer light information with events, \eg~light field estimation, multi-view generation, and photometric. Finally, we discuss new challenges and open questions offering a perspective for this rapidly evolving field. More continuously updated resources are at this link: https://github.com/yunfanLu/Awesome-Event-Imaging

[3] MDDFNet: Mamba-based Dynamic Dual Fusion Network for Traffic Sign Detection

TianYi Yu

Main category: cs.CV

TL;DR: 论文提出了一种名为MDDFNet的新型目标检测网络，用于解决交通标志检测中特征提取单一和多尺度目标检测的挑战。

Details

Motivation: 交通标志检测是自动驾驶中的关键任务，但现有方法在特征多样性和多尺度处理上存在不足。 Method: MDDFNet结合了动态双融合模块和Mamba-based主干网络，前者通过多分支整合空间和语义信息，后者自适应融合全局和局部特征。 Result: 在TT100K数据集上的实验表明，MDDFNet优于现有方法，同时保持实时处理能力。 Conclusion: MDDFNet在小型交通标志检测中表现出色，验证了其有效性。 Abstract: The Detection of small objects, especially traffic signs, is a critical sub-task in object detection and autonomous driving. Despite signficant progress in previous research, two main challenges remain. First, the issue of feature extraction being too singular. Second, the detection process struggles to efectively handle objects of varying sizes or scales. These problems are also prevalent in general object detection tasks. To address these challenges, we propose a novel object detection network, Mamba-based Dynamic Dual Fusion Network (MDDFNet), for traffic sign detection. The network integrates a dynamic dual fusion module and a Mamba-based backbone to simultaneously tackle the aforementioned issues. Specifically, the dynamic dual fusion module utilizes multiple branches to consolidate various spatial and semantic information, thus enhancing feature diversity. The Mamba-based backbone leverages global feature fusion and local feature interaction, combining features in an adaptive manner to generate unique classification characteristics. Extensive experiments conducted on the TT100K (Tsinghua-Tencent 100K) datasets demonstrate that MDDFNet outperforms other state-of-the-art detectors, maintaining real-time processing capabilities of single-stage models while achieving superior performance. This confirms the efectiveness of MDDFNet in detecting small traffic signs.

[4] DetoxAI: a Python Toolkit for Debiasing Deep Learning Models in Computer Vision

Ignacy Stępka,Lukasz Sztukiewicz,Michał Wiliński,Jerzy Stefanowski

Main category: cs.CV

TL;DR: DetoxAI是一个开源Python库，旨在通过后处理去偏技术提升深度学习视觉分类器的公平性。

Details

Motivation: 现有的机器学习公平性解决方案主要针对表格数据，不适用于依赖深度学习的视觉分类任务，因此需要专门工具填补这一空白。 Method: DetoxAI实现了先进的去偏算法、公平性指标和可视化工具，支持通过干预内部表征进行去偏。 Result: DetoxAI展示了其在工程师和研究人员中的实际价值，包括去偏效果的可视化和量化评估。 Conclusion: DetoxAI为视觉分类任务的公平性提供了实用工具，填补了现有解决方案的不足。 Abstract: While machine learning fairness has made significant progress in recent years, most existing solutions focus on tabular data and are poorly suited for vision-based classification tasks, which rely heavily on deep learning. To bridge this gap, we introduce DetoxAI, an open-source Python library for improving fairness in deep learning vision classifiers through post-hoc debiasing. DetoxAI implements state-of-the-art debiasing algorithms, fairness metrics, and visualization tools. It supports debiasing via interventions in internal representations and includes attribution-based visualization tools and quantitative algorithmic fairness metrics to show how bias is mitigated. This paper presents the motivation, design, and use cases of DetoxAI, demonstrating its tangible value to engineers and researchers.

[5] Learning 3D Persistent Embodied World Models

Siyuan Zhou,Yilun Du,Yuncong Yang,Lei Han,Peihao Chen,Dit-Yan Yeung,Chuang Gan

Main category: cs.CV

TL;DR: 论文提出了一种具有显式记忆的持久性世界模型，用于解决现有视频模型在长时程规划中的局限性，通过3D空间地图实现更一致的未来模拟。

Details

Motivation: 现有视频模型缺乏对未观测场景的记忆，导致长时程规划不一致，特别是在复杂环境中部分观测的情况下。 Method: 采用视频扩散模型预测未来RGB-D视频，并将其聚合为持久性3D环境地图，通过地图条件化实现更一致的模拟。 Result: 模型能够忠实模拟观测和未观测部分的世界，提升了长时程规划的准确性。 Conclusion: 提出的持久性世界模型在具体应用中有效支持规划和策略学习，展现了其在智能体任务中的潜力。 Abstract: The ability to simulate the effects of future actions on the world is a crucial ability of intelligent embodied agents, enabling agents to anticipate the effects of their actions and make plans accordingly. While a large body of existing work has explored how to construct such world models using video models, they are often myopic in nature, without any memory of a scene not captured by currently observed images, preventing agents from making consistent long-horizon plans in complex environments where many parts of the scene are partially observed. We introduce a new persistent embodied world model with an explicit memory of previously generated content, enabling much more consistent long-horizon simulation. During generation time, our video diffusion model predicts RGB-D video of the future observations of the agent. This generation is then aggregated into a persistent 3D map of the environment. By conditioning the video model on this 3D spatial map, we illustrate how this enables video world models to faithfully simulate both seen and unseen parts of the world. Finally, we illustrate the efficacy of such a world model in downstream embodied applications, enabling effective planning and policy learning.

[6] Preliminary Explorations with GPT-4o(mni) Native Image Generation

Pu Cao,Feng Zhou,Junyi Ji,Qingye Kong,Zhixiang Lv,Mingjian Zhang,Xuekun Zhao,Siqi Wu,Yinghui Lin,Qing Song,Lu Yang

Main category: cs.CV

TL;DR: GPT-4o展示了强大的多模态生成能力，但在空间推理、知识密集型任务和一致性预测方面仍有局限。

Details

Motivation: 探索GPT-4o在多任务中的能力，评估其在图像生成和相关任务中的表现。 Method: 构建任务分类和测试样本，对GPT-4o在六类任务中进行定性评估。 Result: GPT-4o在通用合成任务中表现优异，但在空间推理、知识密集型任务和一致性预测中存在不足。 Conclusion: GPT-4o在多模态生成方面取得显著进展，但尚未达到专业或安全关键领域的可靠应用标准。 Abstract: Recently, the visual generation ability by GPT-4o(mni) has been unlocked by OpenAI. It demonstrates a very remarkable generation capability with excellent multimodal condition understanding and varied task instructions. In this paper, we aim to explore the capabilities of GPT-4o across various tasks. Inspired by previous study, we constructed a task taxonomy along with a carefully curated set of test samples to conduct a comprehensive qualitative test. Benefiting from GPT-4o's powerful multimodal comprehension, its image-generation process demonstrates abilities surpassing those of traditional image-generation tasks. Thus, regarding the dimensions of model capabilities, we evaluate its performance across six task categories: traditional image generation tasks, discriminative tasks, knowledge-based generation, commonsense-based generation, spatially-aware image generation, and temporally-aware image generation. These tasks not only assess the quality and conditional alignment of the model's outputs but also probe deeper into GPT-4o's understanding of real-world concepts. Our results reveal that GPT-4o performs impressively well in general-purpose synthesis tasks, showing strong capabilities in text-to-image generation, visual stylization, and low-level image processing. However, significant limitations remain in its ability to perform precise spatial reasoning, instruction-grounded generation, and consistent temporal prediction. Furthermore, when faced with knowledge-intensive or domain-specific scenarios, such as scientific illustrations or mathematical plots, the model often exhibits hallucinations, factual errors, or structural inconsistencies. These findings suggest that while GPT-4o marks a substantial advancement in unified multimodal generation, there is still a long way to go before it can be reliably applied to professional or safety-critical domains.

[7] Apply Hierarchical-Chain-of-Generation to Complex Attributes Text-to-3D Generation

Yiming Qin,Zhu Xu,Yang Liu

Main category: cs.CV

TL;DR: HCoG提出了一种自动化方法，通过分解长描述并按遮挡顺序生成3D对象，解决了现有文本到3D模型在复杂属性对象上的问题。

Details

Motivation: 现有文本到3D模型在长描述理解和遮挡部分生成上存在问题，导致属性绑定错误和生成质量不稳定。 Method: 利用大语言模型分解长描述为部分块，按遮挡顺序生成，并通过高斯核优化和标签消除实现精确属性绑定。 Result: HCoG生成了结构连贯且属性准确的复杂3D对象。 Conclusion: HCoG通过自动化方法显著提升了文本到3D模型的生成质量。 Abstract: Recent text-to-3D models can render high-quality assets, yet they still stumble on objects with complex attributes. The key obstacles are: (1) existing text-to-3D approaches typically lift text-to-image models to extract semantics via text encoders, while the text encoder exhibits limited comprehension ability for long descriptions, leading to deviated cross-attention focus, subsequently wrong attribute binding in generated results. (2) Occluded object parts demand a disciplined generation order and explicit part disentanglement. Though some works introduce manual efforts to alleviate the above issues, their quality is unstable and highly reliant on manual information. To tackle above problems, we propose a automated method Hierarchical-Chain-of-Generation (HCoG). It leverages a large language model to decompose the long description into blocks representing different object parts, and orders them from inside out according to occlusions, forming a hierarchical chain. Within each block we first coarsely create components, then precisely bind attributes via target-region localization and corresponding 3D Gaussian kernel optimization. Between blocks, we introduce Gaussian Extension and Label Elimination to seamlessly generate new parts by extending new Gaussian kernels, re-assigning semantic labels, and eliminating unnecessary kernels, ensuring that only relevant parts are added without disrupting previously optimized parts. Experiments confirm that HCoG yields structurally coherent, attribute-faithful 3D objects with complex attributes. The code is available at https://github.com/Wakals/GASCOL .

[8] Occupancy World Model for Robots

Zhang Zhang,Qiang Zhang,Wei Cui,Shuai Shi,Yijie Guo,Gang Han,Wen Zhao,Jingkai Sun,Jiahang Cao,Jiaxu Wang,Hao Cheng,Xiaozhu Ju,Zhengping Che,Renjing Xu,Jian Tang

Main category: cs.CV

TL;DR: 论文提出了一种名为RoboOccWorld的新框架，用于预测室内3D占用场景的动态演化，结合了时空感受野和自回归变换器，并在实验中表现优于现有方法。

Details

Motivation: 现有方法主要关注室外结构化道路场景，而忽略了室内机器人场景的3D占用动态预测。本文旨在填补这一空白。 Method: 提出了RoboOccWorld框架，包括条件因果状态注意力（CCSA）和混合时空聚合（HSTA），以利用历史观测的时空线索。 Result: 实验结果表明，RoboOccWorld在室内3D占用场景演化预测任务中优于现有方法。 Conclusion: 论文成功填补了室内3D占用场景预测的空白，并展示了新框架的有效性。 Abstract: Understanding and forecasting the scene evolutions deeply affect the exploration and decision of embodied agents. While traditional methods simulate scene evolutions through trajectory prediction of potential instances, current works use the occupancy world model as a generative framework for describing fine-grained overall scene dynamics. However, existing methods cluster on the outdoor structured road scenes, while ignoring the exploration of forecasting 3D occupancy scene evolutions for robots in indoor scenes. In this work, we explore a new framework for learning the scene evolutions of observed fine-grained occupancy and propose an occupancy world model based on the combined spatio-temporal receptive field and guided autoregressive transformer to forecast the scene evolutions, called RoboOccWorld. We propose the Conditional Causal State Attention (CCSA), which utilizes camera poses of next state as conditions to guide the autoregressive transformer to adapt and understand the indoor robotics scenarios. In order to effectively exploit the spatio-temporal cues from historical observations, Hybrid Spatio-Temporal Aggregation (HSTA) is proposed to obtain the combined spatio-temporal receptive field based on multi-scale spatio-temporal windows. In addition, we restructure the OccWorld-ScanNet benchmark based on local annotations to facilitate the evaluation of the indoor 3D occupancy scene evolution prediction task. Experimental results demonstrate that our RoboOccWorld outperforms state-of-the-art methods in indoor 3D occupancy scene evolution prediction task. The code will be released soon.

[9] Exploring Convolutional Neural Networks for Rice Grain Classification: An Explainable AI Approach

Muhammad Junaid Asif,Hamza Khan,Rabia Tehseen,Syed Tahir Hussain Rizvi,Mujtaba Asad,Shazia Saqib,Rana Fayyaz Ahmad

Main category: cs.CV

TL;DR: 该论文提出了一种基于卷积神经网络（CNN）的自动框架，用于高效分类不同品种的稻米，并通过性能指标和可解释性技术验证了其有效性。

Details

Motivation: 稻米是全球重要主食，其质量检查与分类传统上依赖人工，效率低且易出错，因此需要自动化解决方案。 Method: 使用卷积神经网络（CNN）对稻米品种进行分类，并通过准确率、召回率、精确率和F1分数等指标评估模型性能。 Result: CNN模型在训练和验证中表现出色，分类准确率高，ROC曲线下面积完美，混淆矩阵显示误分类极少。 Conclusion: 该自动化框架能高效分类稻米，结合LIME和SHAP技术增强了模型的可解释性，为实际应用提供了可靠支持。 Abstract: Rice is an essential staple food worldwide that is important in promoting international trade, economic growth, and nutrition. Asian countries such as China, India, Pakistan, Thailand, Vietnam, and Indonesia are notable for their significant contribution to the cultivation and utilization of rice. These nations are also known for cultivating different rice grains, including short and long grains. These sizes are further classified as basmati, jasmine, kainat saila, ipsala, arborio, etc., catering to diverse culinary preferences and cultural traditions. For both local and international trade, inspecting and maintaining the quality of rice grains to satisfy customers and preserve a country's reputation is necessary. Manual quality check and classification is quite a laborious and time-consuming process. It is also highly prone to mistakes. Therefore, an automatic solution must be proposed for the effective and efficient classification of different varieties of rice grains. This research paper presents an automatic framework based on a convolutional neural network (CNN) for classifying different varieties of rice grains. We evaluated the proposed model based on performance metrics such as accuracy, recall, precision, and F1-Score. The CNN model underwent rigorous training and validation, achieving a remarkable accuracy rate and a perfect area under each class's Receiver Operating Characteristic (ROC) curve. The confusion matrix analysis confirmed the model's effectiveness in distinguishing between the different rice varieties, indicating minimal misclassifications. Additionally, the integration of explainability techniques such as LIME (Local Interpretable Model-agnostic Explanations) and SHAP (SHapley Additive exPlanations) provided valuable insights into the model's decision-making process, revealing how specific features of the rice grains influenced classification outcomes.

[10] Web2Grasp: Learning Functional Grasps from Web Images of Hand-Object Interactions

Hongyi Chen,Yunchao Yao,Yufei Ye,Zhixuan Xu,Homanga Bharadhwaj,Jiashun Wang,Shubham Tulsiani,Zackory Erickson,Jeffrey Ichnowski

Main category: cs.CV

TL;DR: 论文提出了一种从网络图像中提取人类抓取信息的方法，用于训练功能性抓取模型，避免了昂贵的机器人演示需求，并在仿真和现实中验证了其有效性。

Details

Motivation: 功能性抓取对机器人多指手至关重要，但现有方法多集中于静态抓取或依赖昂贵的演示数据。 Method: 从RGB图像重建人手-物体交互3D网格，将人手动作迁移到机器人手，并与精确3D物体模型对齐，利用仿真器扩展数据集。 Result: 在仿真中，模型在已见和未见物体上的成功率分别为75.8%和61.8%，仿真增强数据将成功率提升至83.4%，现实测试成功率达85%。 Conclusion: 网络图像数据可有效训练功能性抓取模型，仿真增强进一步提升了性能，实现了高效的仿真到现实迁移。 Abstract: Functional grasp is essential for enabling dexterous multi-finger robot hands to manipulate objects effectively. However, most prior work either focuses on power grasping, which simply involves holding an object still, or relies on costly teleoperated robot demonstrations to teach robots how to grasp each object functionally. Instead, we propose extracting human grasp information from web images since they depict natural and functional object interactions, thereby bypassing the need for curated demonstrations. We reconstruct human hand-object interaction (HOI) 3D meshes from RGB images, retarget the human hand to multi-finger robot hands, and align the noisy object mesh with its accurate 3D shape. We show that these relatively low-quality HOI data from inexpensive web sources can effectively train a functional grasping model. To further expand the grasp dataset for seen and unseen objects, we use the initially-trained grasping policy with web data in the IsaacGym simulator to generate physically feasible grasps while preserving functionality. We train the grasping model on 10 object categories and evaluate it on 9 unseen objects, including challenging items such as syringes, pens, spray bottles, and tongs, which are underrepresented in existing datasets. The model trained on the web HOI dataset, achieving a 75.8% success rate on seen objects and 61.8% across all objects in simulation, with a 6.7% improvement in success rate and a 1.8x increase in functionality ratings over baselines. Simulator-augmented data further boosts performance from 61.8% to 83.4%. The sim-to-real transfer to the LEAP Hand achieves a 85% success rate. Project website is at: https://webgrasp.github.io/.

[11] Real-Time Privacy Preservation for Robot Visual Perception

Minkyu Choi,Yunhao Yang,Neel P. Bhatt,Kushagra Gupta,Sahil Shah,Aditya Rai,David Fridovich-Keil,Ufuk Topcu,Sandeep P. Chinchali

Main category: cs.CV

TL;DR: PCVS方法通过逻辑规范和模糊处理实时视频流中的敏感对象，确保隐私保护，并在多个数据集中表现优异。

Details

Motivation: 现有隐私保护方法无法完全隐藏敏感对象且不适用于实时视频流，因此需要一种能实时保证隐私的方法。 Method: PCVS结合逻辑规范和检测模型，模糊处理敏感对象，并使用符合预测方法建立理论下界。 Result: PCVS在多个数据集中实现超过95%的规范满足率，且实际表现优于理论下界。 Conclusion: PCVS有效保护隐私且不影响机器人正常操作，适用于实时场景。 Abstract: Many robots (e.g., iRobot's Roomba) operate based on visual observations from live video streams, and such observations may inadvertently include privacy-sensitive objects, such as personal identifiers. Existing approaches for preserving privacy rely on deep learning models, differential privacy, or cryptography. They lack guarantees for the complete concealment of all sensitive objects. Guaranteeing concealment requires post-processing techniques and thus is inadequate for real-time video streams. We develop a method for privacy-constrained video streaming, PCVS, that conceals sensitive objects within real-time video streams. PCVS takes a logical specification constraining the existence of privacy-sensitive objects, e.g., never show faces when a person exists. It uses a detection model to evaluate the existence of these objects in each incoming frame. Then, it blurs out a subset of objects such that the existence of the remaining objects satisfies the specification. We then propose a conformal prediction approach to (i) establish a theoretical lower bound on the probability of the existence of these objects in a sequence of frames satisfying the specification and (ii) update the bound with the arrival of each subsequent frame. Quantitative evaluations show that PCVS achieves over 95 percent specification satisfaction rate in multiple datasets, significantly outperforming other methods. The satisfaction rate is consistently above the theoretical bounds across all datasets, indicating that the established bounds hold. Additionally, we deploy PCVS on robots in real-time operation and show that the robots operate normally without being compromised when PCVS conceals objects.

[12] GaMNet: A Hybrid Network with Gabor Fusion and NMamba for Efficient 3D Glioma Segmentation

Chengwei Ye,Huanzhen Zhang,Yufei Lin,Kangsheng Wang,Linuo Xu,Shuyan Liu

Main category: cs.CV

TL;DR: GaMNet结合NMamba模块和多尺度CNN，用于胶质瘤分割，提升精度和效率。

Details

Motivation: 现有CNN和Transformer模型在胶质瘤分割中缺乏上下文建模或计算量大，难以在移动医疗设备中实时使用。 Method: 集成NMamba模块进行全局建模，多尺度CNN提取局部特征，并使用Gabor滤波器提升可解释性。 Result: GaMNet在减少参数和计算时间的同时，提高了分割精度，显著降低了假阳性和假阴性。 Conclusion: GaMNet在临床诊断中表现出更高的可靠性，优于现有方法。 Abstract: Gliomas are aggressive brain tumors that pose serious health risks. Deep learning aids in lesion segmentation, but CNN and Transformer-based models often lack context modeling or demand heavy computation, limiting real-time use on mobile medical devices. We propose GaMNet, integrating the NMamba module for global modeling and a multi-scale CNN for efficient local feature extraction. To improve interpretability and mimic the human visual system, we apply Gabor filters at multiple scales. Our method achieves high segmentation accuracy with fewer parameters and faster computation. Extensive experiments show GaMNet outperforms existing methods, notably reducing false positives and negatives, which enhances the reliability of clinical diagnosis.

[13] X-Transfer Attacks: Towards Super Transferable Adversarial Attacks on CLIP

Hanxun Huang,Sarah Erfani,Yige Li,Xingjun Ma,James Bailey

Main category: cs.CV

TL;DR: 论文提出了一种名为X-Transfer的新型攻击方法，揭示了CLIP模型中的通用对抗漏洞，通过动态选择代理模型实现高效攻击。

Details

Motivation: 随着CLIP模型在多样化下游任务和大型视觉语言模型中的应用增加，其对抗扰动的脆弱性成为关键问题。 Method: X-Transfer通过动态代理模型选择和通用对抗扰动（UAP）生成，实现跨数据、跨领域、跨模型和跨任务的攻击。 Result: X-Transfer显著优于现有UAP方法，为CLIP模型的对抗迁移性设定了新标准。 Conclusion: X-Transfer展示了CLIP模型的通用对抗漏洞，为未来防御研究提供了重要参考。 Abstract: As Contrastive Language-Image Pre-training (CLIP) models are increasingly adopted for diverse downstream tasks and integrated into large vision-language models (VLMs), their susceptibility to adversarial perturbations has emerged as a critical concern. In this work, we introduce \textbf{X-Transfer}, a novel attack method that exposes a universal adversarial vulnerability in CLIP. X-Transfer generates a Universal Adversarial Perturbation (UAP) capable of deceiving various CLIP encoders and downstream VLMs across different samples, tasks, and domains. We refer to this property as \textbf{super transferability}--a single perturbation achieving cross-data, cross-domain, cross-model, and cross-task adversarial transferability simultaneously. This is achieved through \textbf{surrogate scaling}, a key innovation of our approach. Unlike existing methods that rely on fixed surrogate models, which are computationally intensive to scale, X-Transfer employs an efficient surrogate scaling strategy that dynamically selects a small subset of suitable surrogates from a large search space. Extensive evaluations demonstrate that X-Transfer significantly outperforms previous state-of-the-art UAP methods, establishing a new benchmark for adversarial transferability across CLIP models. The code is publicly available in our \href{https://github.com/HanxunH/XTransferBench}{GitHub repository}.

[14] OXSeg: Multidimensional attention UNet-based lip segmentation using semi-supervised lip contours

Hanie Moghaddasi,Christina Chambers,Sarah N. Mattson,Jeffrey R. Wozniak,Claire D. Coles,Raja Mukherjee,Michael Suttie

Main category: cs.CV

TL;DR: 提出了一种结合注意力UNet和多维输入的唇部分割方法，显著提升了分割精度，并在胎儿酒精综合征（FAS）的诊断中表现出色。

Details

Motivation: 现有唇部分割方法受限于训练数据中唇部轮廓的可用性，且易受图像质量、光照和肤色影响，导致边界检测不准确。 Method: 采用局部二值模式提取面部图像的微模式，构建多维输入，并通过顺序注意力UNet重建唇部轮廓。引入基于解剖标志的掩膜生成方法提升分割精度。 Result: 在唇部分割中，平均Dice分数为84.75%，像素精度为99.77%；在FAS识别中，GAN分类器准确率达98.55%。 Conclusion: 该方法显著提升了唇部分割精度，尤其针对Cupid's bow区域，并为FAS的唇部特征研究提供了新视角。 Abstract: Lip segmentation plays a crucial role in various domains, such as lip synchronization, lipreading, and diagnostics. However, the effectiveness of supervised lip segmentation is constrained by the availability of lip contour in the training phase. A further challenge with lip segmentation is its reliance on image quality , lighting, and skin tone, leading to inaccuracies in the detected boundaries. To address these challenges, we propose a sequential lip segmentation method that integrates attention UNet and multidimensional input. We unravel the micro-patterns in facial images using local binary patterns to build multidimensional inputs. Subsequently, the multidimensional inputs are fed into sequential attention UNets, where the lip contour is reconstructed. We introduce a mask generation method that uses a few anatomical landmarks and estimates the complete lip contour to improve segmentation accuracy. This mask has been utilized in the training phase for lip segmentation. To evaluate the proposed method, we use facial images to segment the upper lips and subsequently assess lip-related facial anomalies in subjects with fetal alcohol syndrome (FAS). Using the proposed lip segmentation method, we achieved a mean dice score of 84.75%, and a mean pixel accuracy of 99.77% in upper lip segmentation. To further evaluate the method, we implemented classifiers to identify those with FAS. Using a generative adversarial network (GAN), we reached an accuracy of 98.55% in identifying FAS in one of the study populations. This method could be used to improve lip segmentation accuracy, especially around Cupid's bow, and shed light on distinct lip-related characteristics of FAS.

[15] Benchmarking Vision, Language, & Action Models in Procedurally Generated, Open Ended Action Environments

Pranav Guruprasad,Yangyue Wang,Sudipta Chowdhury,Harshvardhan Sikka

Main category: cs.CV

TL;DR: MultiNet v0.2 是一个用于评估视觉-语言-动作（VLA）模型在零样本泛化能力的基准测试，揭示了现有模型在分布外任务中的局限性。

Details

Motivation: 系统评估VLA模型在零样本泛化能力上的表现，尤其是在分布外环境中的性能。 Method: 通过MultiNet v0.2基准测试，评估包括GPT-4o、GPT-4.1、OpenVLA等在内的多种VLA和VLM模型在Procgen任务上的表现。 Result: 发现模型在零样本泛化上表现有限，VLA模型因架构设计更优而表现更好，VLM模型在适当约束下有明显改进。 Conclusion: VLA模型在泛化能力上优于其他模型，但所有模型仍需改进，提示工程对性能有显著影响。 Abstract: Vision-language-action (VLA) models represent an important step toward general-purpose robotic systems by integrating visual perception, language understanding, and action execution. However, systematic evaluation of these models, particularly their zero-shot generalization capabilities in out-of-distribution (OOD) environments, remains limited. In this paper, we introduce MultiNet v0.2, a comprehensive benchmark designed to evaluate and analyze the generalization performance of state-of-the-art VLM and VLA models-including GPT-4o, GPT-4.1, OpenVLA,Pi0 Base, and Pi0 FAST-on diverse procedural tasks from the Procgen benchmark. Our analysis reveals several critical insights: (1) all evaluated models exhibit significant limitations in zero-shot generalization to OOD tasks, with performance heavily influenced by factors such as action representation and task complexit; (2) VLAs generally outperform other models due to their robust architectural design; and (3) VLM variants demonstrate substantial improvements when constrained appropriately, highlighting the sensitivity of model performance to precise prompt engineering.

[16] Prompt to Polyp: Clinically-Aware Medical Image Synthesis with Diffusion Models

Mikhail Chaichuk,Sushant Gautam,Steven Hicks,Elena Tutubalina

Main category: cs.CV

TL;DR: 本文研究了医疗领域文本生成图像的两种方法：微调预训练大模型与训练小型领域专用模型，并提出了一种优化架构MSDM。

Details

Motivation: 解决医疗AI数据稀缺问题，同时保护患者隐私。 Method: 比较了微调预训练大模型（FLUX、Kandinsky）与训练小型领域专用模型（MSDM），MSDM整合了临床文本编码器、变分自编码器和交叉注意力机制。 Result: 大模型生成图像质量更高，但MSDM在计算成本更低的情况下表现接近。 Conclusion: MSDM在医疗图像生成中具有潜力，尤其适合资源有限场景。 Abstract: The generation of realistic medical images from text descriptions has significant potential to address data scarcity challenges in healthcare AI while preserving patient privacy. This paper presents a comprehensive study of text-to-image synthesis in the medical domain, comparing two distinct approaches: (1) fine-tuning large pre-trained latent diffusion models and (2) training small, domain-specific models. We introduce a novel model named MSDM, an optimized architecture based on Stable Diffusion that integrates a clinical text encoder, variational autoencoder, and cross-attention mechanisms to better align medical text prompts with generated images. Our study compares two approaches: fine-tuning large pre-trained models (FLUX, Kandinsky) versus training compact domain-specific models (MSDM). Evaluation across colonoscopy (MedVQA-GI) and radiology (ROCOv2) datasets reveals that while large models achieve higher fidelity, our optimized MSDM delivers comparable quality with lower computational costs. Quantitative metrics and qualitative evaluations by medical experts reveal strengths and limitations of each approach.

[17] Steepest Descent Density Control for Compact 3D Gaussian Splatting

Peihao Wang,Yuehao Wang,Dilin Wang,Sreyas Mohan,Zhiwen Fan,Lemeng Wu,Ruisi Cai,Yu-Ying Yeh,Zhangyang Wang,Qiang Liu,Rakesh Ranjan

Main category: cs.CV

TL;DR: 3D Gaussian Splatting (3DGS) 是一种高效的实时高分辨率新视角合成技术，但点云冗余导致内存和性能问题。本文提出 SteepGS 框架，通过优化密度控制减少 50% 高斯点，提升效率。

Details

Motivation: 3DGS 的点云冗余问题导致内存占用高、性能下降，限制了其在资源受限设备上的应用。 Method: 提出理论框架分析密度控制，优化高斯点的生成和参数更新，并引入 SteepGS 策略。 Result: SteepGS 减少 50% 高斯点，同时保持渲染质量，显著提升效率和可扩展性。 Conclusion: SteepGS 通过优化密度控制解决了 3DGS 的冗余问题，为资源受限设备提供了高效解决方案。 Abstract: 3D Gaussian Splatting (3DGS) has emerged as a powerful technique for real-time, high-resolution novel view synthesis. By representing scenes as a mixture of Gaussian primitives, 3DGS leverages GPU rasterization pipelines for efficient rendering and reconstruction. To optimize scene coverage and capture fine details, 3DGS employs a densification algorithm to generate additional points. However, this process often leads to redundant point clouds, resulting in excessive memory usage, slower performance, and substantial storage demands - posing significant challenges for deployment on resource-constrained devices. To address this limitation, we propose a theoretical framework that demystifies and improves density control in 3DGS. Our analysis reveals that splitting is crucial for escaping saddle points. Through an optimization-theoretic approach, we establish the necessary conditions for densification, determine the minimal number of offspring Gaussians, identify the optimal parameter update direction, and provide an analytical solution for normalizing off-spring opacity. Building on these insights, we introduce SteepGS, incorporating steepest density control, a principled strategy that minimizes loss while maintaining a compact point cloud. SteepGS achieves a ~50% reduction in Gaussian points without compromising rendering quality, significantly enhancing both efficiency and scalability.

[18] ReactDance: Progressive-Granular Representation for Long-Term Coherent Reactive Dance Generation

Jingzhong Lin,Yuanyuan Qi,Xinru Li,Wenxuan Huang,Xiangfeng Xu,Bangyan Li,Xuejiao Wang,Gaoqi He

Main category: cs.CV

TL;DR: ReactDance是一种基于扩散模型的新框架，通过多尺度解耦运动表示和局部上下文采样策略，解决了现有反应性舞蹈生成方法在交互保真度、同步性和时间一致性上的不足。

Details

Motivation: 现有方法过度强调全局约束和优化，忽略了局部信息（如细粒度空间交互和局部时间上下文），导致交互保真度和时间一致性不足。 Method: 提出GRFSQ（多尺度解耦运动表示）和BLC（局部块因果掩码采样策略），并结合Layer-Decoupled Classifier-free Guidance实现多尺度控制。 Result: 在标准基准测试中，ReactDance表现优于现有方法，达到了最先进的性能。 Conclusion: ReactDance通过多尺度表示和局部上下文策略，显著提升了反应性舞蹈生成的交互保真度和时间一致性。 Abstract: Reactive dance generation (RDG) produces follower movements conditioned on guiding dancer and music while ensuring spatial coordination and temporal coherence. However, existing methods overemphasize global constraints and optimization, overlooking local information, such as fine-grained spatial interactions and localized temporal context. Therefore, we present ReactDance, a novel diffusion-based framework for high-fidelity RDG with long-term coherence and multi-scale controllability. Unlike existing methods that struggle with interaction fidelity, synchronization, and temporal consistency in duet synthesis, our approach introduces two key innovations: 1)Group Residual Finite Scalar Quantization (GRFSQ), a multi-scale disentangled motion representation that captures interaction semantics from coarse body rhythms to fine-grained joint dynamics, and 2)Blockwise Local Context (BLC), a sampling strategy eliminating error accumulation in long sequence generation via local block causal masking and periodic positional encoding. Built on the decoupled multi-scale GRFSQ representation, we implement a diffusion model withLayer-Decoupled Classifier-free Guidance (LDCFG), allowing granular control over motion semantics across scales. Extensive experiments on standard benchmarks demonstrate that ReactDance surpasses existing methods, achieving state-of-the-art performance.

[19] QuickSplat: Fast 3D Surface Reconstruction via Learned Gaussian Initialization

Yueh-Cheng Liu,Lukas Höllein,Matthias Nießner,Angela Dai

Main category: cs.CV

TL;DR: QuickSplat利用数据驱动先验生成2D高斯泼溅优化的密集初始化，加速大规模室内场景重建，提升几何精度。

Details

Motivation: 现有基于体积渲染的方法优化速度慢，难以处理低纹理区域，需要改进。 Method: 学习数据驱动先验生成初始化，联合估计场景参数更新与高斯泼溅的密集化。 Result: 实验显示，运行时间加速8倍，深度误差降低48%。 Conclusion: QuickSplat通过数据驱动优化显著提升了重建效率和精度。 Abstract: Surface reconstruction is fundamental to computer vision and graphics, enabling applications in 3D modeling, mixed reality, robotics, and more. Existing approaches based on volumetric rendering obtain promising results, but optimize on a per-scene basis, resulting in a slow optimization that can struggle to model under-observed or textureless regions. We introduce QuickSplat, which learns data-driven priors to generate dense initializations for 2D gaussian splatting optimization of large-scale indoor scenes. This provides a strong starting point for the reconstruction, which accelerates the convergence of the optimization and improves the geometry of flat wall structures. We further learn to jointly estimate the densification and update of the scene parameters during each iteration; our proposed densifier network predicts new Gaussians based on the rendering gradients of existing ones, removing the needs of heuristics for densification. Extensive experiments on large-scale indoor scene reconstruction demonstrate the superiority of our data-driven optimization. Concretely, we accelerate runtime by 8x, while decreasing depth errors by up to 48% in comparison to state of the art methods.

[20] Enhancing Satellite Object Localization with Dilated Convolutions and Attention-aided Spatial Pooling

Seraj Al Mahmud Mostafa,Chenxi Wang,Jia Yue,Yuta Hozumi,Jianwu Wang

Main category: cs.CV

TL;DR: 论文提出YOLO-DCAP，一种改进的YOLOv5模型，用于解决卫星图像中物体定位的挑战，包括多尺度特征和噪声干扰。实验显示其性能显著优于基准模型和现有方法。

Details

Motivation: 卫星图像中物体定位面临高变异性、低分辨率和噪声干扰等挑战，尤其是重力波、中层大气波和海洋涡旋等复杂场景。 Method: YOLO-DCAP引入多尺度扩张残差卷积块（MDRC）和注意力辅助空间池化模块（AaSP），以增强多尺度特征提取和全局空间区域关注。 Result: YOLO-DCAP在三个卫星数据集上平均mAP50提升20.95%，IoU提升32.23%，优于基准模型和现有方法。 Conclusion: YOLO-DCAP表现出鲁棒性和泛化能力，适用于复杂卫星图像中的物体定位，代码已开源。 Abstract: Object localization in satellite imagery is particularly challenging due to the high variability of objects, low spatial resolution, and interference from noise and dominant features such as clouds and city lights. In this research, we focus on three satellite datasets: upper atmospheric Gravity Waves (GW), mesospheric Bores (Bore), and Ocean Eddies (OE), each presenting its own unique challenges. These challenges include the variability in the scale and appearance of the main object patterns, where the size, shape, and feature extent of objects of interest can differ significantly. To address these challenges, we introduce YOLO-DCAP, a novel enhanced version of YOLOv5 designed to improve object localization in these complex scenarios. YOLO-DCAP incorporates a Multi-scale Dilated Residual Convolution (MDRC) block to capture multi-scale features at scale with varying dilation rates, and an Attention-aided Spatial Pooling (AaSP) module to focus on the global relevant spatial regions, enhancing feature selection. These structural improvements help to better localize objects in satellite imagery. Experimental results demonstrate that YOLO-DCAP significantly outperforms both the YOLO base model and state-of-the-art approaches, achieving an average improvement of 20.95% in mAP50 and 32.23% in IoU over the base model, and 7.35% and 9.84% respectively over state-of-the-art alternatives, consistently across all three satellite datasets. These consistent gains across all three satellite datasets highlight the robustness and generalizability of the proposed approach. Our code is open sourced at https://github.com/AI-4-atmosphere-remote-sensing/satellite-object-localization.

[21] A Preliminary Study for GPT-4o on Image Restoration

Hao Yang,Yan Yang,Ruikun Zhang,Liyuan Pan

Main category: cs.CV

TL;DR: GPT-4o在图像修复任务中表现优异，但存在像素级结构保真度问题。通过实验发现其输出可作为视觉先验提升现有去雾网络性能，为未来图像修复流程提供指导。

Details

Motivation: 研究GPT-4o在图像修复领域的潜力，填补系统性评估的空白。 Method: 对GPT-4o在多种修复任务（如去雾、去雨、低光增强）中进行实验评估，分析其输出作为视觉先验的效果。 Result: GPT-4o输出视觉吸引力强，但像素级结构保真度不足；其作为视觉先验可显著提升现有去雾网络性能。 Conclusion: GPT-4o在图像修复中具有潜力，为未来研究提供基线框架和数据支持，有望推动图像生成领域的创新。 Abstract: OpenAI's GPT-4o model, integrating multi-modal inputs and outputs within an autoregressive architecture, has demonstrated unprecedented performance in image generation. In this work, we investigate its potential impact on the image restoration community. We present the first systematic evaluation of GPT-4o across diverse restoration tasks. Our experiments reveal that, although restoration outputs from GPT-4o are visually appealing, they often suffer from pixel-level structural fidelity when compared to ground-truth images. Common issues are variations in image proportions, shifts in object positions and quantities, and changes in viewpoint.To address it, taking image dehazing, derainning, and low-light enhancement as representative case studies, we show that GPT-4o's outputs can serve as powerful visual priors, substantially enhancing the performance of existing dehazing networks. It offers practical guidelines and a baseline framework to facilitate the integration of GPT-4o into future image restoration pipelines. We hope the study on GPT-4o image restoration will accelerate innovation in the broader field of image generation areas. To support further research, we will release GPT-4o-restored images from over 10 widely used image restoration datasets.

[22] Looking Beyond Language Priors: Enhancing Visual Comprehension and Attention in Multimodal Models

Aarti Ghatkesar,Uddeshya Upadhyay,Ganesh Venkatesh

Main category: cs.CV

TL;DR: 论文探讨了多模态大语言模型（MLLMs）在视觉与语言深度对齐上的挑战，并提出了一种增强视觉理解并引导语言生成的方法，最终在视觉依赖任务上表现优异。

Details

Motivation: 解决MLLMs在视觉输入利用不足、过度依赖语言先验的问题，提升视觉与语言的深度对齐能力。 Method: 通过分析MLLMs内部视觉理解机制，引入技术加深视觉内容理解并确保视觉信息指导语言生成。 Result: 模型在视觉依赖任务上表现出色，预测视觉相关令牌能力增强，视觉挑战任务提升10分。 Conclusion: 提出的方法有效提升了MLLMs的多模态理解能力，实现了视觉与语言的深度对齐。 Abstract: Achieving deep alignment between vision and language remains a central challenge for Multimodal Large Language Models (MLLMs). These models often fail to fully leverage visual input, defaulting to strong language priors. Our approach first provides insights into how MLLMs internally build visual understanding of image regions and then introduces techniques to amplify this capability. Specifically, we explore techniques designed both to deepen the model's understanding of visual content and to ensure that these visual insights actively guide language generation. We demonstrate the superior multimodal understanding of our resultant model through a detailed upstream analysis quantifying its ability to predict visually-dependent tokens as well as 10 pt boost on visually challenging tasks.

Faizan Farooq Khan,Jun Chen,Youssef Mohamed,Chun-Mei Feng,Mohamed Elhoseiny

Main category: cs.CV

TL;DR: 论文提出了一种名为VR-RAG的框架，用于解决开放词汇鸟类物种识别问题，通过结合视觉和文本知识提升对新物种的分类能力。

Details

Motivation: 开放词汇识别在计算机视觉中具有挑战性，尤其是在自然界中，新物种不断被发现。传统方法在封闭词汇范式下表现有限，无法适应现实场景。 Method: 提出VR-RAG框架，结合GPT-4o从维基百科提取的文本知识和多模态视觉语言编码器，通过视觉相似性重排序候选物种。 Result: 在五个基准测试中，VR-RAG将现有大型多模态模型QWEN2.5-VL的平均性能提升了15.4%，显著优于传统方法。 Conclusion: 该研究通过结合百科全书知识和视觉识别，推动了开放词汇识别的发展，为生物多样性监测提供了灵活、可扩展的解决方案。 Abstract: Open-vocabulary recognition remains a challenging problem in computer vision, as it requires identifying objects from an unbounded set of categories. This is particularly relevant in nature, where new species are discovered every year. In this work, we focus on open-vocabulary bird species recognition, where the goal is to classify species based on their descriptions without being constrained to a predefined set of taxonomic categories. Traditional benchmarks like CUB-200-2011 and Birdsnap have been evaluated in a closed-vocabulary paradigm, limiting their applicability to real-world scenarios where novel species continually emerge. We show that the performance of current systems when evaluated under settings closely aligned with open-vocabulary drops by a huge margin. To address this gap, we propose a scalable framework integrating structured textual knowledge from Wikipedia articles of 11,202 bird species distilled via GPT-4o into concise, discriminative summaries. We propose Visual Re-ranking Retrieval-Augmented Generation(VR-RAG), a novel, retrieval-augmented generation framework that uses visual similarities to rerank the top m candidates retrieved by a set of multimodal vision language encoders. This allows for the recognition of unseen taxa. Extensive experiments across five established classification benchmarks show that our approach is highly effective. By integrating VR-RAG, we improve the average performance of state-of-the-art Large Multi-Modal Model QWEN2.5-VL by 15.4% across five benchmarks. Our approach outperforms conventional VLM-based approaches, which struggle with unseen species. By bridging the gap between encyclopedic knowledge and visual recognition, our work advances open-vocabulary recognition, offering a flexible, scalable solution for biodiversity monitoring and ecological research.

[24] Semantic Style Transfer for Enhancing Animal Facial Landmark Detection

Anadil Hussein,Anna Zamansky,George Martvel

Main category: cs.CV

TL;DR: 该论文研究了神经风格迁移（NST）在提升动物面部关键点检测器训练中的应用，通过改进风格迁移方法，提高了生成图像的质量和模型性能。

Details

Motivation: 探索神经风格迁移技术在动物面部关键点检测中的应用，以提升模型的鲁棒性和准确性。 Method: 使用裁剪后的面部图像进行风格迁移，提出监督式风格迁移（SST）解决标注对齐问题，并通过数据增强提升模型性能。 Result: 改进的风格迁移方法提高了生成图像的结构一致性，SST保留了98%的基线准确率，数据增强进一步提升了模型的鲁棒性。 Conclusion: 语义风格迁移是一种有效的增强策略，可推广至其他物种和关键点检测模型。 Abstract: Neural Style Transfer (NST) is a technique for applying the visual characteristics of one image onto another while preserving structural content. Traditionally used for artistic transformations, NST has recently been adapted, e.g., for domain adaptation and data augmentation. This study investigates the use of this technique for enhancing animal facial landmark detectors training. As a case study, we use a recently introduced Ensemble Landmark Detector for 48 anatomical cat facial landmarks and the CatFLW dataset it was trained on, making three main contributions. First, we demonstrate that applying style transfer to cropped facial images rather than full-body images enhances structural consistency, improving the quality of generated images. Secondly, replacing training images with style-transferred versions raised challenges of annotation misalignment, but Supervised Style Transfer (SST) - which selects style sources based on landmark accuracy - retained up to 98% of baseline accuracy. Finally, augmenting the dataset with style-transferred images further improved robustness, outperforming traditional augmentation methods. These findings establish semantic style transfer as an effective augmentation strategy for enhancing the performance of facial landmark detection models for animals and beyond. While this study focuses on cat facial landmarks, the proposed method can be generalized to other species and landmark detection models.

[25] The Moon's Many Faces: A Single Unified Transformer for Multimodal Lunar Reconstruction

Tom Sander,Moritz Tenthoff,Kay Wohlfarth,Christian Wöhler

Main category: cs.CV

TL;DR: 论文提出了一种统一的多模态学习架构，用于解决月球图像的反射率参数估计和3D重建问题。

Details

Motivation: 多模态学习在行星科学中应用较少，但月球图像的反射率参数估计和3D重建可视为多模态问题。 Method: 采用统一的Transformer架构，学习灰度图像、数字高程模型、表面法线和反照率图之间的共享表示，支持任意输入到目标的模态转换。 Result: 模型能够同时预测数字高程模型和反照率图，实现行星表面的3D重建并分离光度参数与高度信息。 Conclusion: 该基础模型能学习四种模态间的物理合理关系，未来可扩展更多输入模态以支持光度归一化和共配准等任务。 Abstract: Multimodal learning is an emerging research topic across multiple disciplines but has rarely been applied to planetary science. In this contribution, we identify that reflectance parameter estimation and image-based 3D reconstruction of lunar images can be formulated as a multimodal learning problem. We propose a single, unified transformer architecture trained to learn shared representations between multiple sources like grayscale images, digital elevation models, surface normals, and albedo maps. The architecture supports flexible translation from any input modality to any target modality. Predicting DEMs and albedo maps from grayscale images simultaneously solves the task of 3D reconstruction of planetary surfaces and disentangles photometric parameters and height information. Our results demonstrate that our foundation model learns physically plausible relations across these four modalities. Adding more input modalities in the future will enable tasks such as photometric normalization and co-registration.

[26] Lost in OCR Translation? Vision-Based Approaches to Robust Document Retrieval

Alexander Most,Joseph Winjum,Ayan Biswas,Shawn Jones,Nishath Rajiv Ranasinghe,Dan O'Malley,Manish Bhattarai

Main category: cs.CV

TL;DR: 该研究比较了基于视觉的RAG系统（ColPali）与传统OCR-based RAG系统在不同文档质量下的表现，发现OCR-based RAG在泛化性上更优，并讨论了计算效率与语义准确性之间的权衡。

Details

Motivation: 传统RAG系统依赖OCR处理文档，但OCR可能引入错误，尤其是对质量较差的文档。研究旨在比较视觉嵌入方法与OCR-based方法的性能差异。 Method: 使用ColPali（视觉嵌入）与OCR-based（Llama 3.2和Nougat OCR）方法，在不同文档质量下进行系统比较，引入语义答案评估基准。 Result: 视觉嵌入方法在训练数据上表现良好，但OCR-based方法对未见文档的泛化能力更强。 Conclusion: 研究为RAG实践者在选择OCR-dependent或视觉嵌入系统时提供了实用指导，需权衡计算效率与语义准确性。 Abstract: Retrieval-Augmented Generation (RAG) has become a popular technique for enhancing the reliability and utility of Large Language Models (LLMs) by grounding responses in external documents. Traditional RAG systems rely on Optical Character Recognition (OCR) to first process scanned documents into text. However, even state-of-the-art OCRs can introduce errors, especially in degraded or complex documents. Recent vision-language approaches, such as ColPali, propose direct visual embedding of documents, eliminating the need for OCR. This study presents a systematic comparison between a vision-based RAG system (ColPali) and more traditional OCR-based pipelines utilizing Llama 3.2 (90B) and Nougat OCR across varying document qualities. Beyond conventional retrieval accuracy metrics, we introduce a semantic answer evaluation benchmark to assess end-to-end question-answering performance. Our findings indicate that while vision-based RAG performs well on documents it has been fine-tuned on, OCR-based RAG is better able to generalize to unseen documents of varying quality. We highlight the key trade-offs between computational efficiency and semantic accuracy, offering practical guidance for RAG practitioners in selecting between OCR-dependent and vision-based document retrieval systems in production environments.

[27] TeGA: Texture Space Gaussian Avatars for High-Resolution Dynamic Head Modeling

Gengyan Li,Paulo Gotardo,Timo Bolkart,Stephan Garbin,Kripasindhu Sarkar,Abhimitra Meka,Alexandros Lattas,Thabo Beeler

Main category: cs.CV

TL;DR: 本文提出了一种基于3D高斯溅射的高细节3D头部虚拟化身模型，通过改进的变形高斯编码和拟合流程，显著提升了渲染质量和细节表现。

Details

Motivation: 现有可动画的3D头部虚拟化身在运动估计不准确时会导致细节丢失，且受限于内存，3D高斯数量不足，影响渲染质量。本文旨在解决这些问题。 Method: 利用多视角输入视频重建高质量模型，基于网格的3D可变形模型提供粗变形层，3D高斯嵌入网格的连续UVD切线空间，并通过新型UVD变形场捕捉局部细微运动。 Result: 模型显著增加了3D高斯的数量和质量，支持4K分辨率渲染，能够保留外观细节并捕捉面部运动及高频特征（如皮肤皱纹）。 Conclusion: 提出的变形高斯编码和拟合流程有效提升了虚拟化身的细节表现和动画质量，为远程呈现、扩展现实和娱乐应用提供了更优解决方案。 Abstract: Sparse volumetric reconstruction and rendering via 3D Gaussian splatting have recently enabled animatable 3D head avatars that are rendered under arbitrary viewpoints with impressive photorealism. Today, such photoreal avatars are seen as a key component in emerging applications in telepresence, extended reality, and entertainment. Building a photoreal avatar requires estimating the complex non-rigid motion of different facial components as seen in input video images; due to inaccurate motion estimation, animatable models typically present a loss of fidelity and detail when compared to their non-animatable counterparts, built from an individual facial expression. Also, recent state-of-the-art models are often affected by memory limitations that reduce the number of 3D Gaussians used for modeling, leading to lower detail and quality. To address these problems, we present a new high-detail 3D head avatar model that improves upon the state of the art, largely increasing the number of 3D Gaussians and modeling quality for rendering at 4K resolution. Our high-quality model is reconstructed from multiview input video and builds on top of a mesh-based 3D morphable model, which provides a coarse deformation layer for the head. Photoreal appearance is modelled by 3D Gaussians embedded within the continuous UVD tangent space of this mesh, allowing for more effective densification where most needed. Additionally, these Gaussians are warped by a novel UVD deformation field to capture subtle, localized motion. Our key contribution is the novel deformable Gaussian encoding and overall fitting procedure that allows our head model to preserve appearance detail, while capturing facial motion and other transient high-frequency features such as skin wrinkling.

[28] InstanceGen: Image Generation with Instance-level Instructions

Etai Sella,Yanir Kleiman,Hadar Averbuch-Elor

Main category: cs.CV

TL;DR: 论文提出了一种结合图像结构引导和LLM实例级指令的技术，以改进复杂提示下的文本到图像生成效果。

Details

Motivation: 尽管生成模型能力快速提升，但预训练的文本到图像模型在处理包含多对象和实例级属性的复杂提示时仍表现不佳。 Method: 通过利用图像生成模型提供的细粒度结构初始化，结合LLM的实例级指令，生成更符合提示要求的图像。 Result: 生成的图像能更好地遵循文本提示的所有部分，包括对象数量、实例级属性和实例间的空间关系。 Conclusion: 该方法通过结构引导和指令结合，显著提升了复杂提示下的图像生成质量。 Abstract: Despite rapid advancements in the capabilities of generative models, pretrained text-to-image models still struggle in capturing the semantics conveyed by complex prompts that compound multiple objects and instance-level attributes. Consequently, we are witnessing growing interests in integrating additional structural constraints, %leveraging additional structural inputs typically in the form of coarse bounding boxes, to better guide the generation process in such challenging cases. In this work, we take the idea of structural guidance a step further by making the observation that contemporary image generation models can directly provide a plausible \emph{fine-grained} structural initialization. We propose a technique that couples this image-based structural guidance with LLM-based instance-level instructions, yielding output images that adhere to all parts of the text prompt, including object counts, instance-level attributes, and spatial relations between instances.

[29] Fine-Tuning Video-Text Contrastive Model for Primate Behavior Retrieval from Unlabeled Raw Videos

Giulio Cesare Mastrocinque Santo,Patrícia Izar,Irene Delval,Victor de Napole Gregolin,Nina S. T. Hirata

Main category: cs.CV

TL;DR: 研究者通过微调预训练的视频-文本基础模型，开发了用于从野生卷尾猴视频中检索有用片段的计算模型，利用多模态大语言模型和视觉语言模型处理噪声数据，取得了显著的性能提升。

Details

Motivation: 研究野生卷尾猴行为需要从大量未标记视频中检索有用片段，但视频和音频数据噪声大，传统方法效果不佳。 Method: 提出两阶段方法：1）自动化数据预处理管道提取干净的视频-文本对；2）通过LoRA微调预训练的X-CLIP模型。 Result: 在领域数据上，16帧模型的Hits@5提升167%，8帧模型提升114%；NDCG@K结果显示模型能有效排序行为，而原始预训练模型无法做到。 Conclusion: 该方法显著提升了从噪声数据中检索有用视频片段的能力，为野生动物行为研究提供了实用工具。 Abstract: Video recordings of nonhuman primates in their natural habitat are a common source for studying their behavior in the wild. We fine-tune pre-trained video-text foundational models for the specific domain of capuchin monkeys, with the goal of developing useful computational models to help researchers to retrieve useful clips from videos. We focus on the challenging problem of training a model based solely on raw, unlabeled video footage, using weak audio descriptions sometimes provided by field collaborators. We leverage recent advances in Multimodal Large Language Models (MLLMs) and Vision-Language Models (VLMs) to address the extremely noisy nature of both video and audio content. Specifically, we propose a two-folded approach: an agentic data treatment pipeline and a fine-tuning process. The data processing pipeline automatically extracts clean and semantically aligned video-text pairs from the raw videos, which are subsequently used to fine-tune a pre-trained Microsoft's X-CLIP model through Low-Rank Adaptation (LoRA). We obtained an uplift in $Hits@5$ of $167\%$ for the 16 frames model and an uplift of $114\%$ for the 8 frame model on our domain data. Moreover, based on $NDCG@K$ results, our model is able to rank well most of the considered behaviors, while the tested raw pre-trained models are not able to rank them at all. The code will be made available upon acceptance.

[30] HyperspectralMAE: The Hyperspectral Imagery Classification Model using Fourier-Encoded Dual-Branch Masked Autoencoder

Wooyoung Jeong,Hyun Jae Park,Seonghun Jeong,Jong Wook Jang,Tae Hoon Lim,Dae Seoung Kim

Main category: cs.CV

TL;DR: 提出了一种基于Transformer的基础模型HyperspectralMAE，通过双掩码策略（空间和光谱维度）预训练，结合谐波傅里叶位置嵌入，实现高光谱数据的鲁棒表示学习。

Details

Motivation: 高光谱数据的高维特性带来挑战，需要一种能够同时处理空间和光谱信息的模型。 Method: 采用双掩码策略（50%空间块和50%光谱带），结合谐波傅里叶位置嵌入和MSE+SAM重建目标。 Result: 模型在Indian Pines基准测试中达到最先进的迁移学习准确率。 Conclusion: 双掩码和波长感知嵌入显著提升了高光谱图像的重建和下游任务性能。 Abstract: Hyperspectral imagery provides rich spectral detail but poses unique challenges because of its high dimensionality in both spatial and spectral domains. We propose \textit{HyperspectralMAE}, a Transformer-based foundation model for hyperspectral data that employs a \textit{dual masking} strategy: during pre-training we randomly occlude 50\% of spatial patches and 50\% of spectral bands. This forces the model to learn representations capable of reconstructing missing information across both dimensions. To encode spectral order, we introduce learnable harmonic Fourier positional embeddings based on wavelength. The reconstruction objective combines mean-squared error (MSE) with the spectral angle mapper (SAM) to balance pixel-level accuracy and spectral-shape fidelity. The resulting model contains about $1.8\times10^{8}$ parameters and produces 768-dimensional embeddings, giving it sufficient capacity for transfer learning. We pre-trained HyperspectralMAE on two large hyperspectral corpora -- NASA EO-1 Hyperion ($\sim$1\,600 scenes, $\sim$$3\times10^{11}$ pixel spectra) and DLR EnMAP Level-0 ($\sim$1\,300 scenes, $\sim$$3\times10^{11}$ pixel spectra) -- and fine-tuned it for land-cover classification on the Indian Pines benchmark. HyperspectralMAE achieves state-of-the-art transfer-learning accuracy on Indian Pines, confirming that masked dual-dimensional pre-training yields robust spectral-spatial representations. These results demonstrate that dual masking and wavelength-aware embeddings advance hyperspectral image reconstruction and downstream analysis.

[31] DiGIT: Multi-Dilated Gated Encoder and Central-Adjacent Region Integrated Decoder for Temporal Action Detection Transformer

Ho-Joong Kim,Yearang Lee,Jung-Ho Hong,Seong-Whan Lee

Main category: cs.CV

TL;DR: 论文提出了一种名为DiGIT的时序动作检测方法，通过多扩张门控编码器和中心-邻近区域集成解码器解决了现有查询检测器的冗余和上下文不足问题。

Details

Motivation: 现有基于查询的时序动作检测器直接沿用目标检测架构，导致多尺度特征冗余和时序上下文捕捉能力不足。 Method: 提出多扩张门控编码器替代多尺度可变形注意力编码器，减少冗余信息；引入中心-邻近区域集成解码器，优化采样策略。 Result: DiGIT在THUMOS14、ActivityNet v1.3和HACS-Segment上达到最优性能。 Conclusion: DiGIT通过改进编码器和解码器设计，有效提升了时序动作检测的性能。 Abstract: In this paper, we examine a key limitation in query-based detectors for temporal action detection (TAD), which arises from their direct adaptation of originally designed architectures for object detection. Despite the effectiveness of the existing models, they struggle to fully address the unique challenges of TAD, such as the redundancy in multi-scale features and the limited ability to capture sufficient temporal context. To address these issues, we propose a multi-dilated gated encoder and central-adjacent region integrated decoder for temporal action detection transformer (DiGIT). Our approach replaces the existing encoder that consists of multi-scale deformable attention and feedforward network with our multi-dilated gated encoder. Our proposed encoder reduces the redundant information caused by multi-level features while maintaining the ability to capture fine-grained and long-range temporal information. Furthermore, we introduce a central-adjacent region integrated decoder that leverages a more comprehensive sampling strategy for deformable cross-attention to capture the essential information. Extensive experiments demonstrate that DiGIT achieves state-of-the-art performance on THUMOS14, ActivityNet v1.3, and HACS-Segment. Code is available at: https://github.com/Dotori-HJ/DiGIT

[32] Semantic-Space-Intervened Diffusive Alignment for Visual Classification

Zixuan Li,Lei Meng,Guoqing Chao,Wei Wu,Xiaoshuo Yan,Yimeng Yang,Zhuang Qi,Xiangxu Meng

Main category: cs.CV

TL;DR: 提出了一种名为SeDA的新方法，通过语义空间干预和双阶段扩散框架，逐步对齐视觉和文本特征，提升了跨模态分类性能。

Details

Motivation: 现有方法通常通过一步映射将视觉特征投影到文本特征分布，但由于两种模态在样本分布和特征值范围上的差异，难以实现有效对齐。 Method: SeDA利用语义空间作为桥梁，通过双阶段扩散框架（包括扩散控制语义学习器和翻译器）逐步对齐视觉和文本特征，并引入渐进特征交互网络。 Result: 实验表明，SeDA在跨模态特征对齐上表现优异，优于现有方法。 Conclusion: SeDA通过语义空间干预和渐进对齐，有效解决了跨模态特征对齐的挑战。 Abstract: Cross-modal alignment is an effective approach to improving visual classification. Existing studies typically enforce a one-step mapping that uses deep neural networks to project the visual features to mimic the distribution of textual features. However, they typically face difficulties in finding such a projection due to the two modalities in both the distribution of class-wise samples and the range of their feature values. To address this issue, this paper proposes a novel Semantic-Space-Intervened Diffusive Alignment method, termed SeDA, models a semantic space as a bridge in the visual-to-textual projection, considering both types of features share the same class-level information in classification. More importantly, a bi-stage diffusion framework is developed to enable the progressive alignment between the two modalities. Specifically, SeDA first employs a Diffusion-Controlled Semantic Learner to model the semantic features space of visual features by constraining the interactive features of the diffusion model and the category centers of visual features. In the later stage of SeDA, the Diffusion-Controlled Semantic Translator focuses on learning the distribution of textual features from the semantic space. Meanwhile, the Progressive Feature Interaction Network introduces stepwise feature interactions at each alignment step, progressively integrating textual information into mapped features. Experimental results show that SeDA achieves stronger cross-modal feature alignment, leading to superior performance over existing methods across multiple scenarios.

[33] You Are Your Best Teacher: Semi-Supervised Surgical Point Tracking with Cycle-Consistent Self-Distillation

Valay Bundele,Mehran Hosseinzadeh,Hendrik Lensch

Main category: cs.CV

TL;DR: SurgTracker是一种半监督框架，通过过滤自蒸馏方法将合成训练的点跟踪器适应于手术视频，解决了领域转移和标记数据不足的问题。

Details

Motivation: 合成数据集在点跟踪中取得了显著进展，但在实际手术视频中部署时，由于复杂的组织变形、遮挡和光照变化，领域转移和标记数据不足的问题尤为严重。 Method: SurgTracker采用固定教师模型在线生成伪标签，并通过循环一致性约束过滤不一致的轨迹，确保几何一致性和稳定的训练监督。 Result: 在STIR基准测试中，仅使用80个未标记视频，SurgTracker显著提升了跟踪性能。 Conclusion: SurgTracker展示了在高领域转移和数据稀缺环境中实现稳健适应的潜力。 Abstract: Synthetic datasets have enabled significant progress in point tracking by providing large-scale, densely annotated supervision. However, deploying these models in real-world domains remains challenging due to domain shift and lack of labeled data-issues that are especially severe in surgical videos, where scenes exhibit complex tissue deformation, occlusion, and lighting variation. While recent approaches adapt synthetic-trained trackers to natural videos using teacher ensembles or augmentation-heavy pseudo-labeling pipelines, their effectiveness in high-shift domains like surgery remains unexplored. This work presents SurgTracker, a semi-supervised framework for adapting synthetic-trained point trackers to surgical video using filtered self-distillation. Pseudo-labels are generated online by a fixed teacher-identical in architecture and initialization to the student-and are filtered using a cycle consistency constraint to discard temporally inconsistent trajectories. This simple yet effective design enforces geometric consistency and provides stable supervision throughout training, without the computational overhead of maintaining multiple teachers. Experiments on the STIR benchmark show that SurgTracker improves tracking performance using only 80 unlabeled videos, demonstrating its potential for robust adaptation in high-shift, data-scarce domains.

[34] Dome-DETR: DETR with Density-Oriented Feature-Query Manipulation for Efficient Tiny Object Detection

Zhangchi Hu,Peixi Wu,Jie Chen,Huyue Zhu,Yijun Wang,Yansong Peng,Hebei Li,Xiaoyan Sun

Main category: cs.CV

TL;DR: Dome-DETR提出了一种高效的小目标检测框架，通过密度导向的特征查询操作减少冗余计算，提升性能。

Details

Motivation: 现有小目标检测方法存在特征利用效率低和计算成本高的问题。 Method: 引入DeFE生成紧凑前景掩码，结合MWAS稀疏注意力机制和PAQI自适应查询分配。 Result: 在AI-TOD-V2和VisDrone数据集上分别提升3.3 AP和2.5 AP。 Conclusion: Dome-DETR在性能和效率上均达到最优，模型轻量且高效。 Abstract: Tiny object detection plays a vital role in drone surveillance, remote sensing, and autonomous systems, enabling the identification of small targets across vast landscapes. However, existing methods suffer from inefficient feature leverage and high computational costs due to redundant feature processing and rigid query allocation. To address these challenges, we propose Dome-DETR, a novel framework with Density-Oriented Feature-Query Manipulation for Efficient Tiny Object Detection. To reduce feature redundancies, we introduce a lightweight Density-Focal Extractor (DeFE) to produce clustered compact foreground masks. Leveraging these masks, we incorporate Masked Window Attention Sparsification (MWAS) to focus computational resources on the most informative regions via sparse attention. Besides, we propose Progressive Adaptive Query Initialization (PAQI), which adaptively modulates query density across spatial areas for better query allocation. Extensive experiments demonstrate that Dome-DETR achieves state-of-the-art performance (+3.3 AP on AI-TOD-V2 and +2.5 AP on VisDrone) while maintaining low computational complexity and a compact model size. Code will be released upon acceptance.

[35] kFuse: A novel density based agglomerative clustering

Huan Yan,Junjie Hu

Main category: cs.CV

TL;DR: 本文提出了一种基于密度的凝聚聚类方法kFuse，通过自然邻居划分子簇、边界连通性和密度相似性评估，显著提升了聚类准确性和稳定性。

Details

Motivation: 现有凝聚聚类方法需要额外参数且结果不稳定，缺乏先验知识时难以适应不同数据集。 Method: kFuse包含四个关键步骤：自然邻居划分子簇、边界连通性计算、密度相似性评估及合并规则制定。 Result: 实验证明kFuse在合成和真实数据集上表现优异。 Conclusion: kFuse通过综合考量邻近样本、距离和密度，显著提升了聚类效果。 Abstract: Agglomerative clustering has emerged as a vital tool in data analysis due to its intuitive and flexible characteristics. However, existing agglomerative clustering methods often involve additional parameters for sub-cluster partitioning and inter-cluster similarity assessment. This necessitates different parameter settings across various datasets, which is undoubtedly challenging in the absence of prior knowledge. Moreover, existing agglomerative clustering techniques are constrained by the calculation method of connection distance, leading to unstable clustering results. To address these issues, this paper introduces a novel density-based agglomerative clustering method, termed kFuse. kFuse comprises four key components: (1) sub-cluster partitioning based on natural neighbors; (2) determination of boundary connectivity between sub-clusters through the computation of adjacent samples and shortest distances; (3) assessment of density similarity between sub-clusters via the calculation of mean density and variance; and (4) establishment of merging rules between sub-clusters based on boundary connectivity and density similarity. kFuse requires the specification of the number of clusters only at the final merging stage. Additionally, by comprehensively considering adjacent samples, distances, and densities among different sub-clusters, kFuse significantly enhances accuracy during the merging phase, thereby greatly improving its identification capability. Experimental results on both synthetic and real-world datasets validate the effectiveness of kFuse.

[36] Automating Infrastructure Surveying: A Framework for Geometric Measurements and Compliance Assessment Using Point Cloud Data

Amin Ghafourian,Andrew Lee,Dechen Gao,Tyler Beer,Kin Yen,Iman Soltani

Main category: cs.CV

TL;DR: 论文提出了一种基于点云数据的自动化几何测量和合规性评估框架，结合深度学习和几何信号处理技术，应用于ADA合规性评估，验证了其准确性和可靠性。

Details

Motivation: 通过自动化提高基础设施调查和合规性评估的效率、准确性和可扩展性。 Method: 结合深度学习检测与分割、几何和信号处理技术，利用新收集的大型标注数据集进行模型训练和评估。 Result: 实验验证了方法的准确性和可靠性，显著减少了人工工作量并提高了评估一致性。 Conclusion: 该框架为基础设施调查和自动化施工评估的广泛应用奠定了基础，推动了点云数据在这些领域的更广泛采用。 Abstract: Automation can play a prominent role in improving efficiency, accuracy, and scalability in infrastructure surveying and assessing construction and compliance standards. This paper presents a framework for automation of geometric measurements and compliance assessment using point cloud data. The proposed approach integrates deep learning-based detection and segmentation, in conjunction with geometric and signal processing techniques, to automate surveying tasks. As a proof of concept, we apply this framework to automatically evaluate the compliance of curb ramps with the Americans with Disabilities Act (ADA), demonstrating the utility of point cloud data in survey automation. The method leverages a newly collected, large annotated dataset of curb ramps, made publicly available as part of this work, to facilitate robust model training and evaluation. Experimental results, including comparison with manual field measurements of several ramps, validate the accuracy and reliability of the proposed method, highlighting its potential to significantly reduce manual effort and improve consistency in infrastructure assessment. Beyond ADA compliance, the proposed framework lays the groundwork for broader applications in infrastructure surveying and automated construction evaluation, promoting wider adoption of point cloud data in these domains. The annotated database, manual ramp survey data, and developed algorithms are publicly available on the project's GitHub page: https://github.com/Soltanilara/SurveyAutomation.

[37] A review of advancements in low-light image enhancement using deep learning

Fangxue Liu,Lei Fan

Main category: cs.CV

TL;DR: 本文综述了2020年以来基于深度学习的低光照图像增强方法，探讨了其对下游视觉任务的影响，并提出了未来研究方向。

Details

Motivation: 低光照环境下计算机视觉算法性能下降，亟需系统研究深度学习方法在低光照图像增强中的应用及其效果。 Method: 详细分析近年（2020年起）的低光照图像增强方法及其机制，辅以图示说明，并评估其对下游任务的影响。 Result: 总结了不同增强技术的优缺点，并分析了它们对后续视觉任务的提升效果。 Conclusion: 本文为选择低光照图像增强技术和优化低光照条件下的视觉任务性能提供了参考，并指出了未来研究的潜在方向。 Abstract: In low-light environments, the performance of computer vision algorithms often deteriorates significantly, adversely affecting key vision tasks such as segmentation, detection, and classification. With the rapid advancement of deep learning, its application to low-light image processing has attracted widespread attention and seen significant progress in recent years. However, there remains a lack of comprehensive surveys that systematically examine how recent deep-learning-based low-light image enhancement methods function and evaluate their effectiveness in enhancing downstream vison tasks. To address this gap, this review provides a detailed elaboration on how various recent approaches (from 2020) operate and their enhancement mechanisms, supplemented with clear illustrations. It also investigates the impact of different enhancement techniques on subsequent vision tasks, critically analyzing their strengths and limitations. Additionally, it proposes future research directions. This review serves as a useful reference for determining low-light image enhancement techniques and optimizing vision task performance in low-light conditions.

[38] Describe Anything in Medical Images

Xi Xiao,Yunbei Zhang,Thanh-Huy Nguyen,Ba-Thinh Lam,Janet Wang,Jihun Hamm,Tianyang Wang,Xingjian Li,Xiao Wang,Hao Xu,Tianming Liu,Min Xu

Main category: cs.CV

TL;DR: MedDAM是一个针对医学图像的局部描述生成框架，通过专家设计的提示和评估基准，显著优于现有模型。

Details

Motivation: 现有局部描述模型未广泛应用于医学领域，而医学诊断依赖区域细节，因此需要专门解决方案。 Method: MedDAM利用大型视觉语言模型，结合专家设计的提示和评估协议，生成区域描述。 Result: 在多个数据集上，MedDAM表现优于主流模型，验证了区域语义对齐的重要性。 Conclusion: MedDAM为医学视觉语言集成提供了有前景的基础。 Abstract: Localized image captioning has made significant progress with models like the Describe Anything Model (DAM), which can generate detailed region-specific descriptions without explicit region-text supervision. However, such capabilities have yet to be widely applied to specialized domains like medical imaging, where diagnostic interpretation relies on subtle regional findings rather than global understanding. To mitigate this gap, we propose MedDAM, the first comprehensive framework leveraging large vision-language models for region-specific captioning in medical images. MedDAM employs medical expert-designed prompts tailored to specific imaging modalities and establishes a robust evaluation benchmark comprising a customized assessment protocol, data pre-processing pipeline, and specialized QA template library. This benchmark evaluates both MedDAM and other adaptable large vision-language models, focusing on clinical factuality through attribute-level verification tasks, thereby circumventing the absence of ground-truth region-caption pairs in medical datasets. Extensive experiments on the VinDr-CXR, LIDC-IDRI, and SkinCon datasets demonstrate MedDAM's superiority over leading peers (including GPT-4o, Claude 3.7 Sonnet, LLaMA-3.2 Vision, Qwen2.5-VL, GPT-4Rol, and OMG-LLaVA) in the task, revealing the importance of region-level semantic alignment in medical image understanding and establishing MedDAM as a promising foundation for clinical vision-language integration.

[39] Image Segmentation via Variational Model Based Tailored UNet: A Deep Variational Framework

Kaili Qi,Wenli Yang,Ye Li,Zhongyi Huang

Main category: cs.CV

TL;DR: 提出了一种结合变分模型和UNet的混合框架VM_TUNet，兼具数学可解释性和自适应特征学习能力，实验表明其在边界分割上表现优异。

Details

Motivation: 传统变分模型数学可解释性强但参数敏感且计算成本高，而深度学习模型如UNet轻量但缺乏理论解释性且需大量标注数据。结合两者优势。 Method: 提出VM_TUNet，将四阶修正Cahn-Hilliard方程与UNet结合，引入数据驱动算子替代手动调参，并采用TFPM保证高精度边界保留。 Result: 在基准数据集上，VM_TUNet的分割性能优于现有方法，尤其在精细边界划分上表现突出。 Conclusion: VM_TUNet成功结合了变分模型和深度学习的优势，为图像分割提供了一种高效且可解释的解决方案。 Abstract: Traditional image segmentation methods, such as variational models based on partial differential equations (PDEs), offer strong mathematical interpretability and precise boundary modeling, but often suffer from sensitivity to parameter settings and high computational costs. In contrast, deep learning models such as UNet, which are relatively lightweight in parameters, excel in automatic feature extraction but lack theoretical interpretability and require extensive labeled data. To harness the complementary strengths of both paradigms, we propose Variational Model Based Tailored UNet (VM_TUNet), a novel hybrid framework that integrates the fourth-order modified Cahn-Hilliard equation with the deep learning backbone of UNet, which combines the interpretability and edge-preserving properties of variational methods with the adaptive feature learning of neural networks. Specifically, a data-driven operator is introduced to replace manual parameter tuning, and we incorporate the tailored finite point method (TFPM) to enforce high-precision boundary preservation. Experimental results on benchmark datasets demonstrate that VM_TUNet achieves superior segmentation performance compared to existing approaches, especially for fine boundary delineation.

[40] Accelerating Diffusion Transformer via Increment-Calibrated Caching with Channel-Aware Singular Value Decomposition

Zhiyuan Chen,Keyi Li,Yifan Jia,Le Ye,Yufei Ma

Main category: cs.CV

TL;DR: 提出了一种无需训练的DiT加速方法，通过增量校准缓存和通道感知SVD，显著减少计算量且保持生成质量。

Details

Motivation: 扩散模型的高计算复杂度限制了部署，现有缓存方法缺乏校正可能导致质量下降。 Method: 提出增量校准缓存和通道感知SVD，利用预训练模型生成校准参数。 Result: 在相同计算资源下优于现有方法，减少45%计算量且FID仅增加0.06。 Conclusion: 该方法高效且无需额外训练，显著提升DiT的实用性。 Abstract: Diffusion transformer (DiT) models have achieved remarkable success in image generation, thanks for their exceptional generative capabilities and scalability. Nonetheless, the iterative nature of diffusion models (DMs) results in high computation complexity, posing challenges for deployment. Although existing cache-based acceleration methods try to utilize the inherent temporal similarity to skip redundant computations of DiT, the lack of correction may induce potential quality degradation. In this paper, we propose increment-calibrated caching, a training-free method for DiT acceleration, where the calibration parameters are generated from the pre-trained model itself with low-rank approximation. To deal with the possible correction failure arising from outlier activations, we introduce channel-aware Singular Value Decomposition (SVD), which further strengthens the calibration effect. Experimental results show that our method always achieve better performance than existing naive caching methods with a similar computation resource budget. When compared with 35-step DDIM, our method eliminates more than 45% computation and improves IS by 12 at the cost of less than 0.06 FID increase. Code is available at https://github.com/ccccczzy/icc.

[41] Dual-level Fuzzy Learning with Patch Guidance for Image Ordinal Regression

Chunlai Dong,Haochao Ying,Qibo Qiu,Jinhong Wang,Danny Chen,Jian Wu

Main category: cs.CV

TL;DR: 提出了一种名为DFPG的双层次模糊学习框架，通过补丁引导从模糊的序数标签中学习精确的特征分级边界。

Details

Motivation: 当前方法仅依赖图像级序数标签，忽略了细粒度的补丁级特征，而人类专家依赖补丁级特征进行决策。 Method: 提出了补丁标记和过滤策略，使模型仅通过图像级序数标签关注补丁级特征；设计了双层次模糊学习模块，从补丁和通道两个角度处理标签模糊性。 Result: 在多个图像序数回归数据集上表现优异，尤其在难以分类的类别中区分样本的能力突出。 Conclusion: DFPG框架有效解决了序数回归中补丁级特征利用不足的问题，显著提升了性能。 Abstract: Ordinal regression bridges regression and classification by assigning objects to ordered classes. While human experts rely on discriminative patch-level features for decisions, current approaches are limited by the availability of only image-level ordinal labels, overlooking fine-grained patch-level characteristics. In this paper, we propose a Dual-level Fuzzy Learning with Patch Guidance framework, named DFPG that learns precise feature-based grading boundaries from ambiguous ordinal labels, with patch-level supervision. Specifically, we propose patch-labeling and filtering strategies to enable the model to focus on patch-level features exclusively with only image-level ordinal labels available. We further design a dual-level fuzzy learning module, which leverages fuzzy logic to quantitatively capture and handle label ambiguity from both patch-wise and channel-wise perspectives. Extensive experiments on various image ordinal regression datasets demonstrate the superiority of our proposed method, further confirming its ability in distinguishing samples from difficult-to-classify categories. The code is available at https://github.com/ZJUMAI/DFPG-ord.

[42] Automated Knot Detection and Pairing for Wood Analysis in the Timber Industry

Guohao Lin,Shidong Pan,Rasul Khanbayov,Changxi Yang,Ani Khaloian-Sarnaghi,Andriy Kovryga

Main category: cs.CV

TL;DR: 本文提出了一种基于机器学习的轻量级自动化流程，用于木材中结节的检测与配对，显著提高了效率和准确性。

Details

Motivation: 木材中的结节对美观和结构完整性至关重要，但传统手动标注方法效率低下，亟需自动化解决方案。 Method: 采用YOLOv8l进行结节检测，并通过多维特征提取和三重神经网络实现结节配对。 Result: 检测阶段的mAP@0.5为0.887，配对阶段的准确率为0.85。 Conclusion: 实验验证了该方法的有效性，展示了AI在木材科学与工业中的潜力。 Abstract: Knots in wood are critical to both aesthetics and structural integrity, making their detection and pairing essential in timber processing. However, traditional manual annotation was labor-intensive and inefficient, necessitating automation. This paper proposes a lightweight and fully automated pipeline for knot detection and pairing based on machine learning techniques. In the detection stage, high-resolution surface images of wooden boards were collected using industrial-grade cameras, and a large-scale dataset was manually annotated and preprocessed. After the transfer learning, the YOLOv8l achieves an mAP@0.5 of 0.887. In the pairing stage, detected knots were analyzed and paired based on multidimensional feature extraction. A triplet neural network was used to map the features into a latent space, enabling clustering algorithms to identify and pair corresponding knots. The triplet network with learnable weights achieved a pairing accuracy of 0.85. Further analysis revealed that he distances from the knot's start and end points to the bottom of the wooden board, and the longitudinal coordinates play crucial roles in achieving high pairing accuracy. Our experiments validate the effectiveness of the proposed solution, demonstrating the potential of AI in advancing wood science and industry.

[43] RefRef: A Synthetic Dataset and Benchmark for Reconstructing Refractive and Reflective Objects

Yue Yin,Enze Tao,Weijian Deng,Dylan Campbell

Main category: cs.CV

TL;DR: 论文介绍了RefRef数据集和基准，用于从姿态图像重建包含折射和反射物体的场景，并提出了一种基于准确光路计算的oracle方法和一种无需假设的方法。

Details

Motivation: 现有3D重建和新视角合成方法主要针对不透明朗伯物体，无法处理折射和反射材料，且缺乏相关数据集。 Method: 提出RefRef数据集（150个场景）和oracle方法（基于几何和折射率计算光路），并开发了一种无需假设的方法。 Result: 实验表明，现有方法在RefRef数据集上的表现远不及oracle方法，凸显了任务的挑战性。 Conclusion: RefRef数据集和基准为折射和反射场景的重建提供了新工具，现有方法仍需改进。 Abstract: Modern 3D reconstruction and novel view synthesis approaches have demonstrated strong performance on scenes with opaque Lambertian objects. However, most assume straight light paths and therefore cannot properly handle refractive and reflective materials. Moreover, datasets specialized for these effects are limited, stymieing efforts to evaluate performance and develop suitable techniques. In this work, we introduce a synthetic RefRef dataset and benchmark for reconstructing scenes with refractive and reflective objects from posed images. Our dataset has 50 such objects of varying complexity, from single-material convex shapes to multi-material non-convex shapes, each placed in three different background types, resulting in 150 scenes. We also propose an oracle method that, given the object geometry and refractive indices, calculates accurate light paths for neural rendering, and an approach based on this that avoids these assumptions. We benchmark these against several state-of-the-art methods and show that all methods lag significantly behind the oracle, highlighting the challenges of the task and dataset.

[44] PICD: Versatile Perceptual Image Compression with Diffusion Rendering

Tongda Xu,Jiahao Li,Bin Li,Yan Wang,Ya-Qin Zhang,Yan Lu

Main category: cs.CV

TL;DR: 提出了一种基于扩散渲染的多功能感知屏幕图像压缩方法（PICD），适用于屏幕和自然图像，通过分层条件信息集成提升文本和图像的压缩质量。

Details

Motivation: 现有感知图像压缩方法在处理屏幕内容（如文本）时会产生明显伪影，需要一种既能处理屏幕内容又能处理自然图像的通用压缩方法。 Method: 提出PICD框架，将文本和图像分别编码，并通过扩散模型渲染为一张图像。扩散渲染分为三个层次的条件信息集成：领域级微调、适配器级控制和实例级指导。 Result: PICD在文本准确性和感知质量上优于现有感知编解码器，同时也能作为自然图像的感知编解码器。 Conclusion: PICD是一种高效的多功能压缩方法，适用于屏幕和自然图像，显著提升了压缩质量。 Abstract: Recently, perceptual image compression has achieved significant advancements, delivering high visual quality at low bitrates for natural images. However, for screen content, existing methods often produce noticeable artifacts when compressing text. To tackle this challenge, we propose versatile perceptual screen image compression with diffusion rendering (PICD), a codec that works well for both screen and natural images. More specifically, we propose a compression framework that encodes the text and image separately, and renders them into one image using diffusion model. For this diffusion rendering, we integrate conditional information into diffusion models at three distinct levels: 1). Domain level: We fine-tune the base diffusion model using text content prompts with screen content. 2). Adaptor level: We develop an efficient adaptor to control the diffusion model using compressed image and text as input. 3). Instance level: We apply instance-wise guidance to further enhance the decoding process. Empirically, our PICD surpasses existing perceptual codecs in terms of both text accuracy and perceptual quality. Additionally, without text conditions, our approach serves effectively as a perceptual codec for natural images.

[45] Decoupling Multi-Contrast Super-Resolution: Pairing Unpaired Synthesis with Implicit Representations

Hongyu Rui,Yinzhe Wu,Fanwen Wang,Jiahao Huang,Liutao Yang,Zi Wang,Guang Yang

Main category: cs.CV

TL;DR: 提出了一种模块化多对比超分辨率（MCSR）框架，无需配对训练数据，支持任意放大倍数，通过无配对跨模态合成和无监督超分辨率两阶段实现高质量重建。

Details

Motivation: MRI的多对比特性为跨模态增强提供了机会，但现有方法依赖配对数据且分辨率固定，难以适应临床环境。 Method: 分两阶段：无配对跨模态合成（U-CMS）和无监督超分辨率（U-SR），利用隐式神经表示（INRs）实现尺度无关的解剖学忠实重建。 Result: 在4倍和8倍放大下表现优于现有基线，具有更高的保真度和解剖学一致性。 Conclusion: 该框架在真实临床环境中具有可扩展、个性化且数据高效的应用潜力。 Abstract: Magnetic Resonance Imaging (MRI) is critical for clinical diagnostics but is often limited by long acquisition times and low signal-to-noise ratios, especially in modalities like diffusion and functional MRI. The multi-contrast nature of MRI presents a valuable opportunity for cross-modal enhancement, where high-resolution (HR) modalities can serve as references to boost the quality of their low-resolution (LR) counterparts-motivating the development of Multi-Contrast Super-Resolution (MCSR) techniques. Prior work has shown that leveraging complementary contrasts can improve SR performance; however, effective feature extraction and fusion across modalities with varying resolutions remains a major challenge. Moreover, existing MCSR methods often assume fixed resolution settings and all require large, perfectly paired training datasets-conditions rarely met in real-world clinical environments. To address these challenges, we propose a novel Modular Multi-Contrast Super-Resolution (MCSR) framework that eliminates the need for paired training data and supports arbitrary upscaling. Our method decouples the MCSR task into two stages: (1) Unpaired Cross-Modal Synthesis (U-CMS), which translates a high-resolution reference modality into a synthesized version of the target contrast, and (2) Unsupervised Super-Resolution (U-SR), which reconstructs the final output using implicit neural representations (INRs) conditioned on spatial coordinates. This design enables scale-agnostic and anatomically faithful reconstruction by bridging un-paired cross-modal synthesis with unsupervised resolution enhancement. Experiments show that our method achieves superior performance at 4x and 8x upscaling, with improved fidelity and anatomical consistency over existing baselines. Our framework demonstrates strong potential for scalable, subject-specific, and data-efficient MCSR in real-world clinical settings.

[46] Towards Facial Image Compression with Consistency Preserving Diffusion Prior

Yimin Zhou,Yichong Xia,Bin Chen,Baoyi An,Haoqian Wang,Zhi Wang,Yaowei Wang,Zikun Zhou

Main category: cs.CV

TL;DR: FaSDiff是一种基于稳定扩散先验的面部图像压缩方法，通过频率增强保持一致性，在低比特率下提升重建图像质量。

Details

Motivation: 现有面部图像压缩方法在低比特率下重建质量不佳，且高频信息保留不足，影响下游应用。 Method: FaSDiff结合高频敏感压缩器和低频增强模块，利用扩散先验优化视觉提示和语义一致性。 Result: 实验表明FaSDiff在人类视觉质量和机器视觉准确性上优于现有方法。 Conclusion: FaSDiff通过频率增强和扩散先验，有效平衡了视觉质量和机器性能。 Abstract: With the widespread application of facial image data across various domains, the efficient storage and transmission of facial images has garnered significant attention. However, the existing learned face image compression methods often produce unsatisfactory reconstructed image quality at low bit rates. Simply adapting diffusion-based compression methods to facial compression tasks results in reconstructed images that perform poorly in downstream applications due to insufficient preservation of high-frequency information. To further explore the diffusion prior in facial image compression, we propose Facial Image Compression with a Stable Diffusion Prior (FaSDiff), a method that preserves consistency through frequency enhancement. FaSDiff employs a high-frequency-sensitive compressor in an end-to-end framework to capture fine image details and produce robust visual prompts. Additionally, we introduce a hybrid low-frequency enhancement module that disentangles low-frequency facial semantics and stably modulates the diffusion prior alongside visual prompts. The proposed modules allow FaSDiff to leverage diffusion priors for superior human visual perception while minimizing performance loss in machine vision due to semantic inconsistency. Extensive experiments show that FaSDiff outperforms state-of-the-art methods in balancing human visual quality and machine vision accuracy. The code will be released after the paper is accepted.

[47] Register and CLS tokens yield a decoupling of local and global features in large ViTs

Alexander Lappe,Martin A. Giese

Main category: cs.CV

TL;DR: DINOv2模型的注意力图存在因冗余信息存储导致的伪影，影响性能和可解释性。引入寄存器标记后，虽然注意力图更清晰，但全局与局部特征仍脱节。研究发现CLS标记也有类似问题，需谨慎解释大型ViT的注意力图。

Details

Motivation: 解决DINOv2模型中因冗余信息存储导致的注意力图伪影问题，提升模型性能和可解释性。 Method: 引入寄存器标记存储全局信息，分析其对全局与局部特征关系的影响，并探讨CLS标记的类似现象。 Result: 寄存器标记使注意力图更清晰，但全局信息主要由寄存器标记提取，导致局部与全局特征脱节。CLS标记也有类似问题。 Conclusion: 需谨慎解释大型ViT的注意力图，明确寄存器标记和CLS标记的问题，为设计更可解释的视觉模型提供方向。 Abstract: Recent work has shown that the attention maps of the widely popular DINOv2 model exhibit artifacts, which hurt both model interpretability and performance on dense image tasks. These artifacts emerge due to the model repurposing patch tokens with redundant local information for the storage of global image information. To address this problem, additional register tokens have been incorporated in which the model can store such information instead. We carefully examine the influence of these register tokens on the relationship between global and local image features, showing that while register tokens yield cleaner attention maps, these maps do not accurately reflect the integration of local image information in large models. Instead, global information is dominated by information extracted from register tokens, leading to a disconnect between local and global features. Inspired by these findings, we show that the CLS token itself, which can be interpreted as a register, leads to a very similar phenomenon in models without explicit register tokens. Our work shows that care must be taken when interpreting attention maps of large ViTs. Further, by clearly attributing the faulty behaviour to register and CLS tokens, we show a path towards more interpretable vision models.

[48] Leveraging Vision-Language Models for Visual Grounding and Analysis of Automotive UI

Benjamin Raphael Ernhofer,Daniil Prokhorov,Jannica Langner,Dominik Bollmann

Main category: cs.CV

TL;DR: 论文提出了一种基于视觉-语言框架的智能汽车娱乐系统交互方法，并发布了开源数据集AutomotiveUI-Bench-4K。通过合成数据管道和LoRa微调的Molmo-7B模型，实现了跨领域的高性能表现。

Details

Motivation: 现代汽车娱乐系统需要智能且自适应的解决方案来处理频繁的UI更新和多样化的设计变化。 Method: 采用视觉-语言框架，结合合成数据管道和LoRa微调的Molmo-7B模型，开发了评估性大型动作模型（ELAM）。 Result: ELAM在AutomotiveUI-Bench-4K上表现优异，跨领域泛化能力提升5.2%，ScreenSpot平均准确率达80.4%。 Conclusion: 研究表明数据收集和微调可推动汽车UI理解的AI进展，方法成本低且适用于消费级GPU。 Abstract: Modern automotive infotainment systems require intelligent and adaptive solutions to handle frequent User Interface (UI) updates and diverse design variations. We introduce a vision-language framework for understanding and interacting with automotive infotainment systems, enabling seamless adaptation across different UI designs. To further support research in this field, we release AutomotiveUI-Bench-4K, an open-source dataset of 998 images with 4,208 annotations. Additionally, we present a synthetic data pipeline to generate training data. We fine-tune a Molmo-7B-based model using Low-Rank Adaptation (LoRa) and incorporating reasoning generated by our pipeline, along with visual grounding and evaluation capabilities. The fine-tuned Evaluative Large Action Model (ELAM) achieves strong performance on AutomotiveUI-Bench-4K (model and dataset are available on Hugging Face) and demonstrating strong cross-domain generalization, including a +5.2% improvement on ScreenSpot over the baseline model. Notably, our approach achieves 80.4% average accuracy on ScreenSpot, closely matching or even surpassing specialized models for desktop, mobile, and web, such as ShowUI, despite being trained for the infotainment domain. This research investigates how data collection and subsequent fine-tuning can lead to AI-driven progress within automotive UI understanding and interaction. The applied method is cost-efficient and fine-tuned models can be deployed on consumer-grade GPUs.

[49] Examining the Source of Defects from a Mechanical Perspective for 3D Anomaly Detection

Hanzhe Liang,Aoran Wang,Jie Zhou,Xin Jin,Can Gao,Jinbao Wang

Main category: cs.CV

TL;DR: 论文提出了一种基于力学互补框架的3D异常检测方法（MC4AD），通过模拟内部和外部纠正力来改进异常检测，并结合多样异常生成模块和纠正力预测网络，实现了高效且参数少的性能。

Details

Motivation: 现有异常检测方法主要关注结构特征，而忽略了异常原因。论文希望通过模拟异常产生的内部和外部纠正力，更全面地解决3D异常检测问题。 Method: 1. 提出多样异常生成模块（DA-Gen）模拟多种异常；2. 设计纠正力预测网络（CFP-Net）生成点级纠正力；3. 提出对称损失和整体损失约束纠正力。 Result: 在提出的数据集和现有五个数据集上，实现了九个最优性能，且参数最少、推理速度最快。 Conclusion: MC4AD框架通过力学互补和纠正力模拟，显著提升了3D异常检测的性能和效率。 Abstract: In this paper, we go beyond identifying anomalies only in structural terms and think about better anomaly detection motivated by anomaly causes. Most anomalies are regarded as the result of unpredictable defective forces from internal and external sources, and their opposite forces are sought to correct the anomalies. We introduced a Mechanics Complementary framework for 3D anomaly detection (MC4AD) to generate internal and external Corrective forces for each point. A Diverse Anomaly-Generation (DA-Gen) module is first proposed to simulate various anomalies. Then, we present a Corrective Force Prediction Network (CFP-Net) with complementary representations for point-level representation to simulate the different contributions of internal and external corrective forces. A combined loss was proposed, including a new symmetric loss and an overall loss, to constrain the corrective forces properly. As a highlight, we consider 3D anomaly detection in industry more comprehensively, creating a hierarchical quality control strategy based on a three-way decision and contributing a dataset named Anomaly-IntraVariance with intraclass variance to evaluate the model. On the proposed and existing five datasets, we obtained nine state-of-the-art performers with the minimum parameters and the fastest inference speed. The source is available at https://github.com/hzzzzzhappy/MC4AD

[50] DFEN: Dual Feature Equalization Network for Medical Image Segmentation

Jianjian Yin,Yi Chen,Chengyu Li,Zhichao Zheng,Yanhui Gu,Junsheng Zhou

Main category: cs.CV

TL;DR: 提出了一种基于Swin Transformer和CNN的双特征均衡网络，通过图像级和类别级特征均衡增强像素特征表示，解决了边界像素和低类别像素区域的误分类问题。

Details

Motivation: 现有医学图像分割方法未考虑边界像素和低类别像素区域的上下文特征信息不均问题，导致误分类。 Method: 设计了图像级和类别级特征均衡模块，结合Swin Transformer和CNN架构，增强像素特征表示。 Result: 在多个数据集（BUSI、ISIC2017、ACDC和PH²）上实现了最先进的性能。 Conclusion: 提出的双特征均衡网络有效解决了特征信息不均问题，提升了分割性能。 Abstract: Current methods for medical image segmentation primarily focus on extracting contextual feature information from the perspective of the whole image. While these methods have shown effective performance, none of them take into account the fact that pixels at the boundary and regions with a low number of class pixels capture more contextual feature information from other classes, leading to misclassification of pixels by unequal contextual feature information. In this paper, we propose a dual feature equalization network based on the hybrid architecture of Swin Transformer and Convolutional Neural Network, aiming to augment the pixel feature representations by image-level equalization feature information and class-level equalization feature information. Firstly, the image-level feature equalization module is designed to equalize the contextual information of pixels within the image. Secondly, we aggregate regions of the same class to equalize the pixel feature representations of the corresponding class by class-level feature equalization module. Finally, the pixel feature representations are enhanced by learning weights for image-level equalization feature information and class-level equalization feature information. In addition, Swin Transformer is utilized as both the encoder and decoder, thereby bolstering the ability of the model to capture long-range dependencies and spatial correlations. We conducted extensive experiments on Breast Ultrasound Images (BUSI), International Skin Imaging Collaboration (ISIC2017), Automated Cardiac Diagnosis Challenge (ACDC) and PH$^2$ datasets. The experimental results demonstrate that our method have achieved state-of-the-art performance. Our code is publicly available at https://github.com/JianJianYin/DFEN.

[51] CGTrack: Cascade Gating Network with Hierarchical Feature Aggregation for UAV Tracking

Weihong Li,Xiaoqiong Liu,Heng Fan,Libo Zhang

Main category: cs.CV

TL;DR: CGTrack是一种新型无人机跟踪器，通过结合显式和隐式技术，在粗到细框架内扩展网络容量，解决了轻量级网络在无人机跟踪中容量不足的问题。

Details

Motivation: 无人机跟踪在现实机器人应用中至关重要，但轻量级网络集成常导致网络容量显著下降，加剧遮挡和视角变化等挑战。 Method: 提出层次特征级联（HFC）模块和轻量级门控中心头（LGCH），通过特征重用和门控机制增强网络容量和特征表示。 Result: 在三个无人机跟踪基准测试中，CGTrack实现了最先进的性能，同时保持高速运行。 Conclusion: CGTrack通过创新的模块设计，有效解决了无人机跟踪中的网络容量问题，性能优越且高效。 Abstract: Recent advancements in visual object tracking have markedly improved the capabilities of unmanned aerial vehicle (UAV) tracking, which is a critical component in real-world robotics applications. While the integration of hierarchical lightweight networks has become a prevalent strategy for enhancing efficiency in UAV tracking, it often results in a significant drop in network capacity, which further exacerbates challenges in UAV scenarios, such as frequent occlusions and extreme changes in viewing angles. To address these issues, we introduce a novel family of UAV trackers, termed CGTrack, which combines explicit and implicit techniques to expand network capacity within a coarse-to-fine framework. Specifically, we first introduce a Hierarchical Feature Cascade (HFC) module that leverages the spirit of feature reuse to increase network capacity by integrating the deep semantic cues with the rich spatial information, incurring minimal computational costs while enhancing feature representation. Based on this, we design a novel Lightweight Gated Center Head (LGCH) that utilizes gating mechanisms to decouple target-oriented coordinates from previously expanded features, which contain dense local discriminative information. Extensive experiments on three challenging UAV tracking benchmarks demonstrate that CGTrack achieves state-of-the-art performance while running fast. Code will be available at https://github.com/Nightwatch-Fox11/CGTrack.

[52] Achieving 3D Attention via Triplet Squeeze and Excitation Block

Maan Alhazmi,Abdulrahman Altahhan

Main category: cs.CV

TL;DR: 论文提出了一种结合Triplet attention和Squeeze-and-Excitation（TripSE）的新注意力机制，并在ResNet18、DenseNet和ConvNeXt架构中验证其有效性，特别是在ConvNeXt上表现突出。

Details

Motivation: CNN模型在视觉任务中仍具潜力，尤其是在面部表情识别（FER）领域。通过引入新的注意力机制，进一步提升模型性能。 Method: 提出四种TripSE变体，并将其应用于ResNet18、DenseNet和ConvNeXt架构，验证其通用性和效果。 Result: TripSE显著提升了模型性能，ConvNeXt结合TripSE在FER2013数据集上达到78.27%的准确率，创下新纪录。 Conclusion: TripSE机制在CNN模型中具有广泛适用性，特别是在ConvNeXt架构中表现优异，为FER任务提供了新的解决方案。 Abstract: The emergence of ConvNeXt and its variants has reaffirmed the conceptual and structural suitability of CNN-based models for vision tasks, re-establishing them as key players in image classification in general, and in facial expression recognition (FER) in particular. In this paper, we propose a new set of models that build on these advancements by incorporating a new set of attention mechanisms that combines Triplet attention with Squeeze-and-Excitation (TripSE) in four different variants. We demonstrate the effectiveness of these variants by applying them to the ResNet18, DenseNet and ConvNext architectures to validate their versatility and impact. Our study shows that incorporating a TripSE block in these CNN models boosts their performances, particularly for the ConvNeXt architecture, indicating its utility. We evaluate the proposed mechanisms and associated models across four datasets, namely CIFAR100, ImageNet, FER2013 and AffectNet datasets, where ConvNext with TripSE achieves state-of-the-art results with an accuracy of \textbf{78.27\%} on the popular FER2013 dataset, a new feat for this dataset.

[53] Task-Adapter++: Task-specific Adaptation with Order-aware Alignment for Few-shot Action Recognition

Congqi Cao,Peiheng Han,Yueran zhang,Yating Yu,Qinyi Lv,Lingtong Min,Yanning zhang

Main category: cs.CV

TL;DR: 论文提出Task-Adapter++，一种参数高效的双重适配方法，用于解决预训练模型在少样本动作识别中的问题，并在5个基准测试中取得最优性能。

Details

Motivation: 当前方法在少样本动作识别中存在预训练模型泛化能力下降、任务特定信息探索不足、语义顺序信息忽略及跨模态对齐技术忽视时间耦合等问题。 Method: 设计任务特定适配器优化图像编码器，利用大语言模型生成子动作描述并引入语义顺序适配器，开发细粒度跨模态对齐策略。 Result: 在5个基准测试中取得最优性能。 Conclusion: Task-Adapter++通过双重适配和细粒度对齐，显著提升了少样本动作识别的性能。 Abstract: Large-scale pre-trained models have achieved remarkable success in language and image tasks, leading an increasing number of studies to explore the application of pre-trained image models, such as CLIP, in the domain of few-shot action recognition (FSAR). However, current methods generally suffer from several problems: 1) Direct fine-tuning often undermines the generalization capability of the pre-trained model; 2) The exploration of task-specific information is insufficient in the visual tasks; 3) The semantic order information is typically overlooked during text modeling; 4) Existing cross-modal alignment techniques ignore the temporal coupling of multimodal information. To address these, we propose Task-Adapter++, a parameter-efficient dual adaptation method for both image and text encoders. Specifically, to make full use of the variations across different few-shot learning tasks, we design a task-specific adaptation for the image encoder so that the most discriminative information can be well noticed during feature extraction. Furthermore, we leverage large language models (LLMs) to generate detailed sequential sub-action descriptions for each action class, and introduce semantic order adapters into the text encoder to effectively model the sequential relationships between these sub-actions. Finally, we develop an innovative fine-grained cross-modal alignment strategy that actively maps visual features to reside in the same temporal stage as semantic descriptions. Extensive experiments fully demonstrate the effectiveness and superiority of the proposed method, which achieves state-of-the-art performance on 5 benchmarks consistently. The code is open-sourced at https://github.com/Jaulin-Bage/Task-Adapter-pp.

[54] From Pixels to Perception: Interpretable Predictions via Instance-wise Grouped Feature Selection

Moritz Vandenhirtz,Julia E. Vogt

Main category: cs.CV

TL;DR: 提出了一种通过实例级稀疏化输入图像实现可解释预测的方法，学习在语义区域而非像素级进行掩码，并动态确定稀疏度。

Details

Motivation: 理解机器学习模型的决策过程有助于洞察任务、数据及模型失败原因。 Method: 在语义区域学习掩码，动态确定实例级稀疏度。 Result: 在半合成和自然图像数据集上，方法比现有基准产生更易理解的预测。 Conclusion: 该方法能生成更符合人类理解的预测，优于现有技术。 Abstract: Understanding the decision-making process of machine learning models provides valuable insights into the task, the data, and the reasons behind a model's failures. In this work, we propose a method that performs inherently interpretable predictions through the instance-wise sparsification of input images. To align the sparsification with human perception, we learn the masking in the space of semantically meaningful pixel regions rather than on pixel-level. Additionally, we introduce an explicit way to dynamically determine the required level of sparsity for each instance. We show empirically on semi-synthetic and natural image datasets that our inherently interpretable classifier produces more meaningful, human-understandable predictions than state-of-the-art benchmarks.

[55] Document Image Rectification Bases on Self-Adaptive Multitask Fusion

Heng Li,Xiangping Wu,Qingcai Chen

Main category: cs.CV

TL;DR: 提出了一种自适应的多任务融合网络SalmRec，用于文档图像矫正，通过任务间特征聚合和门控机制提升性能。

Details

Motivation: 现有方法忽视了多任务间的互补特征和交互作用，影响了文档图像矫正的效果。 Method: 设计了SalmRec网络，包含任务间特征聚合模块和门控机制，以优化特征互补和减少干扰。 Result: 在三个基准数据集（DIR300、DocUNet和DocReal）上显著提升了矫正性能。 Conclusion: SalmRec通过任务协同和特征优化，有效提升了文档图像矫正的准确性和鲁棒性。 Abstract: Deformed document image rectification is essential for real-world document understanding tasks, such as layout analysis and text recognition. However, current multi-task methods -- such as background removal, 3D coordinate prediction, and text line segmentation -- often overlook the complementary features between tasks and their interactions. To address this gap, we propose a self-adaptive learnable multi-task fusion rectification network named SalmRec. This network incorporates an inter-task feature aggregation module that adaptively improves the perception of geometric distortions, enhances feature complementarity, and reduces negative interference. We also introduce a gating mechanism to balance features both within global tasks and between local tasks effectively. Experimental results on two English benchmarks (DIR300 and DocUNet) and one Chinese benchmark (DocReal) demonstrate that our method significantly improves rectification performance. Ablation studies further highlight the positive impact of different tasks on dewarping and the effectiveness of our proposed module.

[56] Towards Better Cephalometric Landmark Detection with Diffusion Data Generation

Dongqian Guo,Wencheng Han,Pang Lyu,Yuxi Zhou,Jianbing Shen

Main category: cs.CV

TL;DR: 提出了一种创新的数据生成方法，用于生成多样化的头影测量X射线图像及标注，无需人工干预，并通过大规模视觉检测模型显著提高了检测精度。

Details

Motivation: 头影测量标志点检测对正畸诊断和治疗规划至关重要，但数据稀缺和人工标注成本高限制了深度学习方法的有效性。 Method: 通过解剖学先验构建新标注，利用扩散模型生成真实X射线图像，并结合医学文本提示控制生成样本的属性。 Result: 实验表明，生成数据训练显著提升性能，检测成功率（SDR）提高6.5%，达到82.2%。 Conclusion: 该方法通过生成多样化数据解决了数据稀缺问题，显著提升了头影测量标志点检测的准确性。 Abstract: Cephalometric landmark detection is essential for orthodontic diagnostics and treatment planning. Nevertheless, the scarcity of samples in data collection and the extensive effort required for manual annotation have significantly impeded the availability of diverse datasets. This limitation has restricted the effectiveness of deep learning-based detection methods, particularly those based on large-scale vision models. To address these challenges, we have developed an innovative data generation method capable of producing diverse cephalometric X-ray images along with corresponding annotations without human intervention. To achieve this, our approach initiates by constructing new cephalometric landmark annotations using anatomical priors. Then, we employ a diffusion-based generator to create realistic X-ray images that correspond closely with these annotations. To achieve precise control in producing samples with different attributes, we introduce a novel prompt cephalometric X-ray image dataset. This dataset includes real cephalometric X-ray images and detailed medical text prompts describing the images. By leveraging these detailed prompts, our method improves the generation process to control different styles and attributes. Facilitated by the large, diverse generated data, we introduce large-scale vision detection models into the cephalometric landmark detection task to improve accuracy. Experimental results demonstrate that training with the generated data substantially enhances the performance. Compared to methods without using the generated data, our approach improves the Success Detection Rate (SDR) by 6.5%, attaining a notable 82.2%. All code and data are available at: https://um-lab.github.io/cepha-generation

[57] Noise-Consistent Siamese-Diffusion for Medical Image Synthesis and Segmentation

Kunpeng Qiu,Zhiqiang Gao,Zhiying Zhou,Mingjie Sun,Yongxin Guo

Main category: cs.CV

TL;DR: Siamese-Diffusion模型通过双组件设计（Mask-Diffusion和Image-Diffusion）和噪声一致性损失，提升了医学图像分割中合成数据的形态保真度，显著提高了分割模型的性能。

Details

Motivation: 医学图像分割中标注数据稀缺，传统扩散模型生成的图像保真度低，影响了分割模型的鲁棒性和可靠性。 Method: 提出Siamese-Diffusion模型，包含Mask-Diffusion和Image-Diffusion，通过噪声一致性损失提升形态保真度，采样时仅使用Mask-Diffusion以保证多样性和可扩展性。 Result: 实验表明，Siamese-Diffusion显著提升了SANet和UNet的性能，如在Polyps数据集上mDice和mIoU分别提高了3.6%和4.4%。 Conclusion: Siamese-Diffusion有效解决了医学图像分割中数据稀缺和形态保真度低的问题，为分割模型提供了更可靠的合成数据。 Abstract: Deep learning has revolutionized medical image segmentation, yet its full potential remains constrained by the paucity of annotated datasets. While diffusion models have emerged as a promising approach for generating synthetic image-mask pairs to augment these datasets, they paradoxically suffer from the same data scarcity challenges they aim to mitigate. Traditional mask-only models frequently yield low-fidelity images due to their inability to adequately capture morphological intricacies, which can critically compromise the robustness and reliability of segmentation models. To alleviate this limitation, we introduce Siamese-Diffusion, a novel dual-component model comprising Mask-Diffusion and Image-Diffusion. During training, a Noise Consistency Loss is introduced between these components to enhance the morphological fidelity of Mask-Diffusion in the parameter space. During sampling, only Mask-Diffusion is used, ensuring diversity and scalability. Comprehensive experiments demonstrate the superiority of our method. Siamese-Diffusion boosts SANet's mDice and mIoU by 3.6% and 4.4% on the Polyps, while UNet improves by 1.52% and 1.64% on the ISIC2018. Code is available at GitHub.

[58] Camera-Only Bird's Eye View Perception: A Neural Approach to LiDAR-Free Environmental Mapping for Autonomous Vehicles

Anupkumar Bochare

Main category: cs.CV

TL;DR: 提出了一种仅使用相机的感知框架，通过结合YOLOv11目标检测和DepthAnythingV2深度估计，实现360度场景理解，性能接近LiDAR。

Details

Motivation: 传统自动驾驶感知系统依赖昂贵的LiDAR传感器，本文旨在探索仅使用相机的低成本解决方案。 Method: 扩展Lift-Splat-Shoot架构，结合YOLOv11目标检测和DepthAnythingV2单目深度估计，生成BEV地图。 Result: 在OpenLane-V2和NuScenes数据集上，道路分割准确率达85%，车辆检测率85-90%，平均位置误差1.2米。 Conclusion: 深度学习可从相机输入中提取丰富空间信息，实现低成本且高精度的自动驾驶导航。 Abstract: Autonomous vehicle perception systems have traditionally relied on costly LiDAR sensors to generate precise environmental representations. In this paper, we propose a camera-only perception framework that produces Bird's Eye View (BEV) maps by extending the Lift-Splat-Shoot architecture. Our method combines YOLOv11-based object detection with DepthAnythingV2 monocular depth estimation across multi-camera inputs to achieve comprehensive 360-degree scene understanding. We evaluate our approach on the OpenLane-V2 and NuScenes datasets, achieving up to 85% road segmentation accuracy and 85-90% vehicle detection rates when compared against LiDAR ground truth, with average positional errors limited to 1.2 meters. These results highlight the potential of deep learning to extract rich spatial information using only camera inputs, enabling cost-efficient autonomous navigation without sacrificing accuracy.

[59] Photovoltaic Defect Image Generator with Boundary Alignment Smoothing Constraint for Domain Shift Mitigation

Dongying Li,Binyi Su,Hua Zhang,Yong Li,Haiyong Chen

Main category: cs.CV

TL;DR: PDIG是一种基于稳定扩散的光伏缺陷图像生成器，通过语义概念嵌入和工业风格适配器提升生成质量，显著优于现有方法。

Details

Motivation: 光伏电池缺陷检测对智能制造至关重要，但数据稀缺导致模型训练困难，现有生成方法存在不稳定、多样性不足和领域偏移问题。 Method: 提出PDIG，结合语义概念嵌入（SCE）和轻量级工业风格适配器（LISA），利用文本-图像双空间约束（TIDSC）提升生成质量。 Result: PDIG在FID指标上优于第二名19.16分，显著提升下游缺陷检测性能。 Conclusion: PDIG通过结合稳定扩散和领域适配技术，有效解决了光伏缺陷数据稀缺问题，生成图像质量高且多样。 Abstract: Accurate defect detection of photovoltaic (PV) cells is critical for ensuring quality and efficiency in intelligent PV manufacturing systems. However, the scarcity of rich defect data poses substantial challenges for effective model training. While existing methods have explored generative models to augment datasets, they often suffer from instability, limited diversity, and domain shifts. To address these issues, we propose PDIG, a Photovoltaic Defect Image Generator based on Stable Diffusion (SD). PDIG leverages the strong priors learned from large-scale datasets to enhance generation quality under limited data. Specifically, we introduce a Semantic Concept Embedding (SCE) module that incorporates text-conditioned priors to capture the relational concepts between defect types and their appearances. To further enrich the domain distribution, we design a Lightweight Industrial Style Adaptor (LISA), which injects industrial defect characteristics into the SD model through cross-disentangled attention. At inference, we propose a Text-Image Dual-Space Constraints (TIDSC) module, enforcing the quality of generated images via positional consistency and spatial smoothing alignment. Extensive experiments demonstrate that PDIG achieves superior realism and diversity compared to state-of-the-art methods. Specifically, our approach improves Frechet Inception Distance (FID) by 19.16 points over the second-best method and significantly enhances the performance of downstream defect detection tasks.

[60] BrainSegDMlF: A Dynamic Fusion-enhanced SAM for Brain Lesion Segmentation

Hongming Wang,Yifeng Wu,Huimin Huang,Hongtao Wu,Jia-Xuan Jiang,Xiaodong Zhang,Hao Zheng,Xian Wu,Yefeng Zheng,Jinping Xu,Jing Cheng

Main category: cs.CV

TL;DR: 论文提出了一种名为BrainSegDMLF的全自动脑部病变分割模型，解决了现有方法在多模态数据整合、小病变检测和自动化分割方面的不足。

Details

Motivation: 脑部病变分割在医学影像中具有重要性和挑战性，现有方法在多模态数据利用、小病变敏感性和自动化分割方面存在局限性。 Method: BrainSegDMLF模型包含动态模态交互融合模块（DMIF）、逐层上采样解码器和自动分割掩码生成功能。 Result: 模型能够更全面地整合多模态数据，提高对小病变的检测能力，并实现全自动分割。 Conclusion: BrainSegDMLF模型在脑部病变分割任务中表现出色，解决了现有方法的不足。 Abstract: The segmentation of substantial brain lesions is a significant and challenging task in the field of medical image segmentation. Substantial brain lesions in brain imaging exhibit high heterogeneity, with indistinct boundaries between lesion regions and normal brain tissue. Small lesions in single slices are difficult to identify, making the accurate and reproducible segmentation of abnormal regions, as well as their feature description, highly complex. Existing methods have the following limitations: 1) They rely solely on single-modal information for learning, neglecting the multi-modal information commonly used in diagnosis. This hampers the ability to comprehensively acquire brain lesion information from multiple perspectives and prevents the effective integration and utilization of multi-modal data inputs, thereby limiting a holistic understanding of lesions. 2) They are constrained by the amount of data available, leading to low sensitivity to small lesions and difficulty in detecting subtle pathological changes. 3) Current SAM-based models rely on external prompts, which cannot achieve automatic segmentation and, to some extent, affect diagnostic efficiency.To address these issues, we have developed a large-scale fully automated segmentation model specifically designed for brain lesion segmentation, named BrainSegDMLF. This model has the following features: 1) Dynamic Modal Interactive Fusion (DMIF) module that processes and integrates multi-modal data during the encoding process, providing the SAM encoder with more comprehensive modal information. 2) Layer-by-Layer Upsampling Decoder, enabling the model to extract rich low-level and high-level features even with limited data, thereby detecting the presence of small lesions. 3) Automatic segmentation masks, allowing the model to generate lesion masks automatically without requiring manual prompts.

[61] MM-Skin: Enhancing Dermatology Vision-Language Model with an Image-Text Dataset Derived from Textbooks

Wenqi Zeng,Yuqi Sun,Chenxi Ma,Weimin Tan,Bo Yan

Main category: cs.CV

TL;DR: 论文提出了MM-Skin数据集和SkinVL模型，解决了皮肤病领域视觉语言模型（VLM）缺乏专业文本描述的问题，并在多个任务中表现出色。

Details

Motivation: 当前皮肤病领域的多模态数据集缺乏专业文本描述，限制了皮肤病VLM的发展。 Method: 构建了包含3种成像模态和近10k高质量图像-文本对的MM-Skin数据集，并生成27k多样化的VQA样本；基于此开发了皮肤病专用VLM模型SkinVL。 Result: SkinVL在VQA、监督微调和零样本分类任务中表现优于通用和医学VLM模型。 Conclusion: MM-Skin和SkinVL为皮肤病VLM的发展提供了重要贡献。 Abstract: Medical vision-language models (VLMs) have shown promise as clinical assistants across various medical fields. However, specialized dermatology VLM capable of delivering professional and detailed diagnostic analysis remains underdeveloped, primarily due to less specialized text descriptions in current dermatology multimodal datasets. To address this issue, we propose MM-Skin, the first large-scale multimodal dermatology dataset that encompasses 3 imaging modalities, including clinical, dermoscopic, and pathological and nearly 10k high-quality image-text pairs collected from professional textbooks. In addition, we generate over 27k diverse, instruction-following vision question answering (VQA) samples (9 times the size of current largest dermatology VQA dataset). Leveraging public datasets and MM-Skin, we developed SkinVL, a dermatology-specific VLM designed for precise and nuanced skin disease interpretation. Comprehensive benchmark evaluations of SkinVL on VQA, supervised fine-tuning (SFT) and zero-shot classification tasks across 8 datasets, reveal its exceptional performance for skin diseases in comparison to both general and medical VLM models. The introduction of MM-Skin and SkinVL offers a meaningful contribution to advancing the development of clinical dermatology VLM assistants. MM-Skin is available at https://github.com/ZwQ803/MM-Skin

[62] DiffLocks: Generating 3D Hair from a Single Image using Diffusion Models

Radu Alexandru Rosu,Keyu Wu,Yao Feng,Youyi Zheng,Michael J. Black

Main category: cs.CV

TL;DR: DiffLocks是一种新框架，通过合成大规模头发数据集和扩散-变换器模型，直接从单张图像重建多种详细3D发型。

Details

Motivation: 现有方法因数据不足和低维表示限制，难以重建复杂发型（如卷发），需后处理。DiffLocks旨在解决这些问题。 Method: 1. 创建包含40K发型的合成数据集；2. 使用扩散-变换器模型从单张图像生成3D发丝，无需后处理。 Result: DiffLocks首次实现从单张图像重建高度卷曲发型（如非洲发型），且泛化能力强。 Conclusion: DiffLocks通过创新数据生成和模型设计，显著提升了单图像3D头发重建的多样性和细节表现。 Abstract: We address the task of generating 3D hair geometry from a single image, which is challenging due to the diversity of hairstyles and the lack of paired image-to-3D hair data. Previous methods are primarily trained on synthetic data and cope with the limited amount of such data by using low-dimensional intermediate representations, such as guide strands and scalp-level embeddings, that require post-processing to decode, upsample, and add realism. These approaches fail to reconstruct detailed hair, struggle with curly hair, or are limited to handling only a few hairstyles. To overcome these limitations, we propose DiffLocks, a novel framework that enables detailed reconstruction of a wide variety of hairstyles directly from a single image. First, we address the lack of 3D hair data by automating the creation of the largest synthetic hair dataset to date, containing 40K hairstyles. Second, we leverage the synthetic hair dataset to learn an image-conditioned diffusion-transfomer model that generates accurate 3D strands from a single frontal image. By using a pretrained image backbone, our method generalizes to in-the-wild images despite being trained only on synthetic data. Our diffusion model predicts a scalp texture map in which any point in the map contains the latent code for an individual hair strand. These codes are directly decoded to 3D strands without post-processing techniques. Representing individual strands, instead of guide strands, enables the transformer to model the detailed spatial structure of complex hairstyles. With this, DiffLocks can recover highly curled hair, like afro hairstyles, from a single image for the first time. Data and code is available at https://radualexandru.github.io/difflocks/

[63] Adapting a Segmentation Foundation Model for Medical Image Classification

Pengfei Gu,Haoteng Tang,Islam A. Ebeid,Jose A. Nunez,Fabian Vazquez,Diego Adame,Marcus Zhan,Huimin Li,Bin Fu,Danny Z. Chen

Main category: cs.CV

TL;DR: 本文提出了一种新框架，将Segment Anything Model (SAM) 适配于医学图像分类任务，通过冻结SAM编码器权重并引入空间局部通道注意力机制 (SLCA) 提升分类性能。

Details

Motivation: 尽管SAM在图像分割任务中表现出色，但其在医学图像分类中的应用尚未充分探索，因此需要一种有效的适配方法。 Method: 利用SAM编码器提取特征，结合提出的SLCA机制计算空间局部注意力权重，并将其集成到分类模型中。 Result: 在三个公开医学图像分类数据集上的实验证明了该方法的有效性和数据效率。 Conclusion: 该框架成功将SAM适配于医学图像分类，并通过SLCA机制显著提升了分类性能。 Abstract: Recent advancements in foundation models, such as the Segment Anything Model (SAM), have shown strong performance in various vision tasks, particularly image segmentation, due to their impressive zero-shot segmentation capabilities. However, effectively adapting such models for medical image classification is still a less explored topic. In this paper, we introduce a new framework to adapt SAM for medical image classification. First, we utilize the SAM image encoder as a feature extractor to capture segmentation-based features that convey important spatial and contextual details of the image, while freezing its weights to avoid unnecessary overhead during training. Next, we propose a novel Spatially Localized Channel Attention (SLCA) mechanism to compute spatially localized attention weights for the feature maps. The features extracted from SAM's image encoder are processed through SLCA to compute attention weights, which are then integrated into deep learning classification models to enhance their focus on spatially relevant or meaningful regions of the image, thus improving classification performance. Experimental results on three public medical image classification datasets demonstrate the effectiveness and data-efficiency of our approach.

[64] VIN-NBV: A View Introspection Network for Next-Best-View Selection for Resource-Efficient 3D Reconstruction

Noah Frahm,Dongxu Zhao,Andrea Dunn Beltran,Ron Alterovitz,Jan-Michael Frahm,Junier Oliva,Roni Sengupta

Main category: cs.CV

TL;DR: 提出了一种基于View Introspection Network (VIN)的Next Best View (NBV)算法，通过预测视图对重建质量的直接改进来选择最优视图，显著提升了3D重建效果。

Details

Motivation: 现有NBV算法通常依赖场景先验知识或最大化覆盖率，但在复杂几何和自遮挡场景中，覆盖率最大化并不直接提升重建质量。 Method: 设计了VIN网络，通过3D感知特征化和模仿学习预测视图改进分数，结合贪婪策略选择最优视图。 Result: VIN-NBV在有限采集次数或时间约束下，重建质量比基线方法提升约30%。 Conclusion: VIN-NBV通过直接预测视图改进分数，显著提升了复杂场景下的3D重建效率和质量。 Abstract: Next Best View (NBV) algorithms aim to acquire an optimal set of images using minimal resources, time, or number of captures to enable efficient 3D reconstruction of a scene. Existing approaches often rely on prior scene knowledge or additional image captures and often develop policies that maximize coverage. Yet, for many real scenes with complex geometry and self-occlusions, coverage maximization does not lead to better reconstruction quality directly. In this paper, we propose the View Introspection Network (VIN), which is trained to predict the reconstruction quality improvement of views directly, and the VIN-NBV policy. A greedy sequential sampling-based policy, where at each acquisition step, we sample multiple query views and choose the one with the highest VIN predicted improvement score. We design the VIN to perform 3D-aware featurization of the reconstruction built from prior acquisitions, and for each query view create a feature that can be decoded into an improvement score. We then train the VIN using imitation learning to predict the reconstruction improvement score. We show that VIN-NBV improves reconstruction quality by ~30% over a coverage maximization baseline when operating with constraints on the number of acquisitions or the time in motion.

cs.GR [Back]

[65] MonetGPT: Solving Puzzles Enhances MLLMs' Image Retouching Skills

Niladri Shekhar Dutt,Duygu Ceylan,Niloy J. Mitra

Main category: cs.GR

TL;DR: 论文探讨了如何利用多模态大语言模型（MLLM）为原始照片提供专业级修图建议和操作序列，解决生成式编辑易改变对象身份的问题。

Details

Motivation: 传统修图工具操作保守但专业，而生成式编辑易失控。MLLM有望结合两者优势，提供可控且专业的修图方案。 Method: 训练MLLM通过视觉谜题理解图像处理操作，并基于专家编辑数据合成推理数据集，用于微调模型以规划和执行编辑序列。 Result: MLLM能生成可解释的修图操作序列，保留对象细节和分辨率，优于现有生成式和传统方法。 Conclusion: MLLM在修图中展现出可控性和专业性的潜力，适合需要身份保留的场景。 Abstract: Retouching is an essential task in post-manipulation of raw photographs. Generative editing, guided by text or strokes, provides a new tool accessible to users but can easily change the identity of the original objects in unacceptable and unpredictable ways. In contrast, although traditional procedural edits, as commonly supported by photoediting tools (e.g., Gimp, Lightroom), are conservative, they are still preferred by professionals. Unfortunately, professional quality retouching involves many individual procedural editing operations that is challenging to plan for most novices. In this paper, we ask if a multimodal large language model (MLLM) can be taught to critique raw photographs, suggest suitable remedies, and finally realize them with a given set of pre-authored procedural image operations. We demonstrate that MLLMs can be first made aware of the underlying image processing operations, by training them to solve specially designed visual puzzles. Subsequently, such an operation-aware MLLM can both plan and propose edit sequences. To facilitate training, given a set of expert-edited photos, we synthesize a reasoning dataset by procedurally manipulating the expert edits and then grounding a pretrained LLM on the visual adjustments, to synthesize reasoning for finetuning. The proposed retouching operations are, by construction, understandable by the users, preserve object details and resolution, and can be optionally overridden. We evaluate our setup on a variety of test examples and show advantages, in terms of explainability and identity preservation, over existing generative and other procedural alternatives. Code, data, models, and supplementary results can be found via our project website at https://monetgpt.github.io.

[66] Anymate: A Dataset and Baselines for Learning 3D Object Rigging

Yufan Deng,Yuhao Zhang,Chen Geng,Shangzhe Wu,Jiajun Wu

Main category: cs.GR

TL;DR: 论文提出了Anymate数据集，一个包含23万3D资产及专家标注的绑定和蒙皮信息的大规模数据集，并基于此提出了一种学习驱动的自动绑定框架，显著优于现有方法。

Details

Motivation: 传统的自动绑定和蒙皮方法依赖几何启发式，难以处理复杂几何体；现有数据驱动方法受限于训练数据不足。 Method: 提出了一个基于学习的自动绑定框架，包含三个顺序模块：关节预测、连接性预测和蒙皮权重预测，并设计了多种架构作为基线进行比较。 Result: 提出的模型显著优于现有方法，为未来自动绑定和蒙皮方法的比较提供了基础。 Conclusion: Anymate数据集和提出的框架为自动绑定和蒙皮领域提供了新的基准和工具。 Abstract: Rigging and skinning are essential steps to create realistic 3D animations, often requiring significant expertise and manual effort. Traditional attempts at automating these processes rely heavily on geometric heuristics and often struggle with objects of complex geometry. Recent data-driven approaches show potential for better generality, but are often constrained by limited training data. We present the Anymate Dataset, a large-scale dataset of 230K 3D assets paired with expert-crafted rigging and skinning information -- 70 times larger than existing datasets. Using this dataset, we propose a learning-based auto-rigging framework with three sequential modules for joint, connectivity, and skinning weight prediction. We systematically design and experiment with various architectures as baselines for each module and conduct comprehensive evaluations on our dataset to compare their performance. Our models significantly outperform existing methods, providing a foundation for comparing future methods in automated rigging and skinning. Code and dataset can be found at https://anymate3d.github.io/.

cs.CL [Back]

[67] KG-HTC: Integrating Knowledge Graphs into LLMs for Effective Zero-shot Hierarchical Text Classification

Qianbo Zang,Christophe Zgrzendek,Igor Tchappi,Afshin Khadangi,Johannes Sedlmeir

Main category: cs.CL

TL;DR: KG-HTC结合知识图谱和大型语言模型，通过RAG方法在零样本场景下提升层次文本分类性能。

Details

Motivation: 解决层次文本分类中标注数据不足、标签空间大和长尾分布的问题。 Method: 利用知识图谱为输入文本检索相关子图，增强LLM对标签语义的理解。 Result: 在WoS、DBpedia和Amazon数据集上显著优于基线，尤其在深层标签上表现突出。 Conclusion: 结合结构化知识能有效应对层次文本分类的挑战。 Abstract: Hierarchical Text Classification (HTC) involves assigning documents to labels organized within a taxonomy. Most previous research on HTC has focused on supervised methods. However, in real-world scenarios, employing supervised HTC can be challenging due to a lack of annotated data. Moreover, HTC often faces issues with large label spaces and long-tail distributions. In this work, we present Knowledge Graphs for zero-shot Hierarchical Text Classification (KG-HTC), which aims to address these challenges of HTC in applications by integrating knowledge graphs with Large Language Models (LLMs) to provide structured semantic context during classification. Our method retrieves relevant subgraphs from knowledge graphs related to the input text using a Retrieval-Augmented Generation (RAG) approach. Our KG-HTC can enhance LLMs to understand label semantics at various hierarchy levels. We evaluate KG-HTC on three open-source HTC datasets: WoS, DBpedia, and Amazon. Our experimental results show that KG-HTC significantly outperforms three baselines in the strict zero-shot setting, particularly achieving substantial improvements at deeper levels of the hierarchy. This evaluation demonstrates the effectiveness of incorporating structured knowledge into LLMs to address HTC's challenges in large label spaces and long-tailed label distributions. Our code is available at: https://github.com/QianboZang/KG-HTC.

[68] Privacy-Preserving Transformers: SwiftKey's Differential Privacy Implementation

Abdelrahman Abouelenin,Mohamed Abdelrehim,Raffy Fahim,Amr Hendy,Mohamed Afify

Main category: cs.CL

TL;DR: 论文通过差分隐私（DP）训练了一个用于SwiftKey语言建模的Transformer模型，实验表明在模型大小、运行速度和准确性之间取得平衡，相比现有GRU模型在预测和准确性上有小幅提升。

Details

Motivation: 研究如何在保护隐私的前提下，通过Transformer模型提升语言建模的性能，同时优化模型大小和运行效率。 Method: 采用两阶段训练：先在通用数据上训练种子模型，再用差分隐私在输入数据上微调；同时缩小GPT2架构以适应需求。 Result: 模型在预测和准确性上表现优于现有GRU模型，同时内存和速度开销增加可控。 Conclusion: 通过差分隐私和两阶段训练，成功实现了高效且隐私保护的Transformer语言模型。 Abstract: In this paper we train a transformer using differential privacy (DP) for language modeling in SwiftKey. We run multiple experiments to balance the trade-off between the model size, run-time speed and accuracy. We show that we get small and consistent gains in the next-word-prediction and accuracy with graceful increase in memory and speed compared to the production GRU. This is obtained by scaling down a GPT2 architecture to fit the required size and a two stage training process that builds a seed model on general data and DP finetunes it on typing data. The transformer is integrated using ONNX offering both flexibility and efficiency.

[69] Exploration of COVID-19 Discourse on Twitter: American Politician Edition

Cindy Kim,Daniela Puchall,Jiangyi Liang,Jiwon Kim

Main category: cs.CL

TL;DR: 论文通过分析美国两党政治人物的推文，研究了COVID-19疫情期间的党派差异，发现民主党更关注疫情伤亡和医疗建议，而共和党更注重政治责任和媒体更新。

Details

Motivation: 探讨COVID-19疫情如何影响全球政治，以及新术语和公众意见如何加剧党派立场分化。 Method: 收集美国两党政治人物的推文，使用词袋模型、双词模型和TF-IDF模型分析关键词、主题和情感。 Result: 民主党更关注疫情伤亡和医疗建议，共和党更注重政治责任和媒体更新。 Conclusion: 提出一种基于分类算法的系统方法，通过COVID-19相关术语预测推文的政治立场。 Abstract: The advent of the COVID-19 pandemic has undoubtedly affected the political scene worldwide and the introduction of new terminology and public opinions regarding the virus has further polarized partisan stances. Using a collection of tweets gathered from leading American political figures online (Republican and Democratic), we explored the partisan differences in approach, response, and attitude towards handling the international crisis. Implementation of the bag-of-words, bigram, and TF-IDF models was used to identify and analyze keywords, topics, and overall sentiments from each party. Results suggest that Democrats are more concerned with the casualties of the pandemic, and give more medical precautions and recommendations to the public whereas Republicans are more invested in political responsibilities such as keeping the public updated through media and carefully watching the progress of the virus. We propose a systematic approach to predict and distinguish a tweet's political stance (left or right leaning) based on its COVID-19 related terms using different classification algorithms on different language models.

[70] Assessing Robustness to Spurious Correlations in Post-Training Language Models

Julia Shuieh,Prasann Singhal,Apaar Shanker,John Heyer,George Pu,Samuel Denton

Main category: cs.CL

TL;DR: 论文评估了三种后训练算法（SFT、DPO、KTO）在不同任务和虚假相关性条件下的表现，发现不同方法在不同任务中表现各异，无单一最优策略。

Details

Motivation: 研究监督和偏好微调技术在虚假相关性（如偏差或数据集伪影）下的表现，以优化大语言模型的对齐效果。 Method: 在数学推理、指令遵循和文档问答等任务中，比较SFT、DPO和KTO在不同虚假相关性条件下的性能。 Result: 偏好方法（DPO/KTO）在数学推理中表现稳健，而SFT在复杂任务中更优；虚假相关性高时模型性能普遍下降。 Conclusion: 后训练策略的选择需根据任务类型和虚假相关性特征，无通用最优方法。 Abstract: Supervised and preference-based fine-tuning techniques have become popular for aligning large language models (LLMs) with user intent and correctness criteria. However, real-world training data often exhibits spurious correlations -- arising from biases, dataset artifacts, or other "shortcut" features -- that can compromise a model's performance or generalization. In this paper, we systematically evaluate three post-training algorithms -- Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and KTO (Kahneman-Tversky Optimization) -- across a diverse set of synthetic tasks and spuriousness conditions. Our tasks span mathematical reasoning, constrained instruction-following, and document-grounded question answering. We vary the degree of spurious correlation (10% vs. 90%) and investigate two forms of artifacts: "Feature Ambiguity" and "Distributional Narrowness." Our results show that the models often but not always degrade under higher spuriousness. The preference-based methods (DPO/KTO) can demonstrate relative robustness in mathematical reasoning tasks. By contrast, SFT maintains stronger performance in complex, context-intensive tasks. These findings highlight that no single post-training strategy universally outperforms in all scenarios; the best choice depends on the type of target task and the nature of spurious correlations.

[71] TopicVD: A Topic-Based Dataset of Video-Guided Multimodal Machine Translation for Documentaries

Jinze Lv,Jian Chen,Zi Long,Xianghua Fu,Yin Chen

Main category: cs.CL

TL;DR: 该论文提出了TopicVD数据集，用于视频支持的多模态机器翻译（MMT），并开发了一种基于跨模态双向注意力模块的MMT模型。实验表明视觉信息提升了翻译性能，但在域外场景中表现下降，需进一步研究领域适应方法。

Details

Motivation: 现有MMT数据集多为静态图像或短视频，无法满足纪录片翻译等实际需求，因此开发了TopicVD数据集以推动研究。 Method: 收集纪录片视频-字幕对，按主题分类，提出基于跨模态双向注意力模块的MMT模型。 Result: 视觉信息提升了翻译性能，但域外场景表现不佳；全局上下文能有效改善翻译效果。 Conclusion: TopicVD数据集和提出的MMT模型为视频支持的MMT研究提供了新方向，但需进一步优化领域适应方法。 Abstract: Most existing multimodal machine translation (MMT) datasets are predominantly composed of static images or short video clips, lacking extensive video data across diverse domains and topics. As a result, they fail to meet the demands of real-world MMT tasks, such as documentary translation. In this study, we developed TopicVD, a topic-based dataset for video-supported multimodal machine translation of documentaries, aiming to advance research in this field. We collected video-subtitle pairs from documentaries and categorized them into eight topics, such as economy and nature, to facilitate research on domain adaptation in video-guided MMT. Additionally, we preserved their contextual information to support research on leveraging the global context of documentaries in video-guided MMT. To better capture the shared semantics between text and video, we propose an MMT model based on a cross-modal bidirectional attention module. Extensive experiments on the TopicVD dataset demonstrate that visual information consistently improves the performance of the NMT model in documentary translation. However, the MMT model's performance significantly declines in out-of-domain scenarios, highlighting the need for effective domain adaptation methods. Additionally, experiments demonstrate that global context can effectively improve translation performance. % Dataset and our implementations are available at https://github.com/JinzeLv/TopicVD

[72] Insertion Language Models: Sequence Generation with Arbitrary-Position Insertions

Dhruvesh Patel,Aishwarya Sahoo,Avinash Amballa,Tahira Naseem,Tim G. J. Rudner,Andrew McCallum

Main category: cs.CL

TL;DR: 插入语言模型（ILMs）通过任意位置插入令牌解决自回归模型（ARMs）和掩码扩散模型（MDMs）在复杂约束和顺序依赖上的不足，表现优于两者。

Details

Motivation: ARMs和MDMs在处理复杂约束或非顺序依赖序列时存在局限性，需要一种更灵活的生成方法。 Method: 提出ILMs，通过联合选择插入位置和词汇元素，一次插入一个令牌，并采用定制网络参数化和去噪目标进行训练。 Result: ILMs在规划任务中优于ARMs和MDMs，在无条件文本生成任务中与ARMs相当，同时在任意长度文本填充中比MDMs更灵活。 Conclusion: ILMs提供了一种灵活且高效的序列生成方法，特别适用于复杂约束和非顺序依赖的场景。 Abstract: Autoregressive models (ARMs), which predict subsequent tokens one-by-one ``from left to right,'' have achieved significant success across a wide range of sequence generation tasks. However, they struggle to accurately represent sequences that require satisfying sophisticated constraints or whose sequential dependencies are better addressed by out-of-order generation. Masked Diffusion Models (MDMs) address some of these limitations, but the process of unmasking multiple tokens simultaneously in MDMs can introduce incoherences, and MDMs cannot handle arbitrary infilling constraints when the number of tokens to be filled in is not known in advance. In this work, we introduce Insertion Language Models (ILMs), which learn to insert tokens at arbitrary positions in a sequence -- that is, they select jointly both the position and the vocabulary element to be inserted. By inserting tokens one at a time, ILMs can represent strong dependencies between tokens, and their ability to generate sequences in arbitrary order allows them to accurately model sequences where token dependencies do not follow a left-to-right sequential structure. To train ILMs, we propose a tailored network parameterization and use a simple denoising objective. Our empirical evaluation demonstrates that ILMs outperform both ARMs and MDMs on common planning tasks. Furthermore, we show that ILMs outperform MDMs and perform on par with ARMs in an unconditional text generation task while offering greater flexibility than MDMs in arbitrary-length text infilling.

[73] Sparse Attention Remapping with Clustering for Efficient LLM Decoding on PIM

Zehao Fan,Garrett Gagnon,Zhenyu Liu,Liu Liu

Main category: cs.CL

TL;DR: STARC是一种针对PIM架构优化的稀疏数据映射方案，通过语义相似性聚类KV对，显著减少LLM解码时的延迟和能耗，同时保持模型精度。

Details

Motivation: Transformer模型在自回归解码时因KV缓存的内存访问压力导致性能瓶颈，现有PIM架构难以处理动态稀疏访问模式。 Method: STARC通过聚类KV对并按语义相似性映射到连续内存区域，利用预计算中心点实现选择性注意力，减少数据移动和负载不均衡。 Result: 在HBM-PIM系统上，STARC减少注意力层延迟19%-31%，能耗19%-27%，并在KV缓存限制下实现更显著的性能提升。 Conclusion: STARC有效支持PIM架构上的高效长上下文LLM推理，兼具硬件友好性和模型精度。 Abstract: Transformer-based models are the foundation of modern machine learning, but their execution, particularly during autoregressive decoding in large language models (LLMs), places significant pressure on memory systems due to frequent memory accesses and growing key-value (KV) caches. This creates a bottleneck in memory bandwidth, especially as context lengths increase. Processing-in-memory (PIM) architectures are a promising solution, offering high internal bandwidth and compute parallelism near memory. However, current PIM designs are primarily optimized for dense attention and struggle with the dynamic, irregular access patterns introduced by modern KV cache sparsity techniques. Consequently, they suffer from workload imbalance, reducing throughput and resource utilization. In this work, we propose STARC, a novel sparsity-optimized data mapping scheme tailored specifically for efficient LLM decoding on PIM architectures. STARC clusters KV pairs by semantic similarity and maps them to contiguous memory regions aligned with PIM bank structures. During decoding, queries retrieve relevant tokens at cluster granularity by matching against precomputed centroids, enabling selective attention and parallel processing without frequent reclustering or data movement overhead. Experiments on the HBM-PIM system show that, compared to common token-wise sparsity methods, STARC reduces attention-layer latency by 19%--31% and energy consumption by 19%--27%. Under a KV cache budget of 1024, it achieves up to 54%--74% latency reduction and 45%--67% energy reduction compared to full KV cache retrieval. Meanwhile, STARC maintains model accuracy comparable to state-of-the-art sparse attention methods, demonstrating its effectiveness in enabling efficient and hardware-friendly long-context LLM inference on PIM architectures.

[74] Tell Me Who Your Students Are: GPT Can Generate Valid Multiple-Choice Questions When Students' (Mis)Understanding Is Hinted

Machi Shimmei,Masaki Uto,Yuichiroh Matsubayashi,Kentaro Inui,Aditi Mallavarapu,Noboru Matsuda

Main category: cs.CL

TL;DR: AnaQuest是一种创新的提示技术，用于通过预训练大语言模型生成多选题（MCQs），其选项为复杂概念的句子级断言。该技术结合了形成性和总结性评估，生成的问题在难度和区分度上更接近人工编写的问题。

Details

Motivation: 开发一种能够生成高质量多选题的技术，以支持教育评估，同时减少人工编写问题的负担。 Method: AnaQuest通过分析学生对开放问题的回答生成正确和错误的断言，形成多选题。使用IRT比较AnaQuest、ChatGPT和人工编写问题的特性。 Result: 专家认为AnaQuest和ChatGPT生成的问题与人工编写的问题同样有效，但AnaQuest生成的问题在难度和区分度上更接近人工编写的问题。 Conclusion: AnaQuest是一种有效的多选题生成技术，尤其在生成错误选项（干扰项）方面表现优于ChatGPT，更接近人工编写的问题。 Abstract: The primary goal of this study is to develop and evaluate an innovative prompting technique, AnaQuest, for generating multiple-choice questions (MCQs) using a pre-trained large language model. In AnaQuest, the choice items are sentence-level assertions about complex concepts. The technique integrates formative and summative assessments. In the formative phase, students answer open-ended questions for target concepts in free text. For summative assessment, AnaQuest analyzes these responses to generate both correct and incorrect assertions. To evaluate the validity of the generated MCQs, Item Response Theory (IRT) was applied to compare item characteristics between MCQs generated by AnaQuest, a baseline ChatGPT prompt, and human-crafted items. An empirical study found that expert instructors rated MCQs generated by both AI models to be as valid as those created by human instructors. However, IRT-based analysis revealed that AnaQuest-generated questions - particularly those with incorrect assertions (foils) - more closely resembled human-crafted items in terms of difficulty and discrimination than those produced by ChatGPT.

[75] Symbol-based entity marker highlighting for enhanced text mining in materials science with generative AI

Junhyeong Lee,Jong Min Yuk,Chan-Woo Lee

Main category: cs.CL

TL;DR: 提出了一种新颖的混合文本挖掘框架，结合多步和直接方法的优势，将非结构化科学文本转化为结构化数据，并通过引入实体标记器提升实体识别性能。

Details

Motivation: 现有的多步和直接方法在独立应用时存在局限性，需要一种更高效的方法来提取科学文献中的结构化数据。 Method: 提出混合框架，先将原始文本转化为实体识别的文本，再转化为结构化形式，并引入实体标记器提升性能。 Result: 在三个基准数据集上，实体识别性能显著优于现有方法，实体级F1分数提升58%，关系级F1分数提升83%。 Conclusion: 混合框架结合了两种方法的优势，显著提升了结构化数据的质量和性能。 Abstract: The construction of experimental datasets is essential for expanding the scope of data-driven scientific discovery. Recent advances in natural language processing (NLP) have facilitated automatic extraction of structured data from unstructured scientific literature. While existing approaches-multi-step and direct methods-offer valuable capabilities, they also come with limitations when applied independently. Here, we propose a novel hybrid text-mining framework that integrates the advantages of both methods to convert unstructured scientific text into structured data. Our approach first transforms raw text into entity-recognized text, and subsequently into structured form. Furthermore, beyond the overall data structuring framework, we also enhance entity recognition performance by introducing an entity marker-a simple yet effective technique that uses symbolic annotations to highlight target entities. Specifically, our entity marker-based hybrid approach not only consistently outperforms previous entity recognition approaches across three benchmark datasets (MatScholar, SOFC, and SOFC slot NER) but also improve the quality of final structured data-yielding up to a 58% improvement in entity-level F1 score and up to 83% improvement in relation-level F1 score compared to direct approach.

[76] Elastic Weight Consolidation for Full-Parameter Continual Pre-Training of Gemma2

Vytenis Šliogeris,Povilas Daniušis,Artūras Nakvosas

Main category: cs.CL

TL;DR: 实验研究了在Gemma2 20亿参数大语言模型上应用弹性权重巩固（EWC）进行自回归预训练，以缓解灾难性遗忘并提升新任务学习效果。

Details

Motivation: 探讨如何在多语言环境下通过EWC技术实现持续学习，避免灾难性遗忘。 Method: 在Gemma2模型上应用EWC，使用CulturaX数据集的立陶宛语部分进行预训练，并评估多个语言理解基准。 Result: EWC不仅有效缓解了灾难性遗忘，还对新任务学习有潜在益处。 Conclusion: EWC在大语言模型的持续学习中具有实际应用价值。 Abstract: This technical report describes an experiment on autoregressive pre-training of Gemma2 2 billion parameter large language model (LLM) with 10\% on the Lithuanian language component of CulturaX from the point of view of continual learning. We apply elastic weight consolidation (EWC) to the full set of the model's parameters and investigate language understanding benchmarks, consisting of Arc, Belebele, Gsm8K, Hellaswag, MMLU, TruthfulQA, and Winogrande sets (both in English and Lithuanian versions), and perplexity benchmarks. We empirically demonstrate that EWC regularisation allows us not only to mitigate catastrophic forgetting effects but also that it is potentially beneficial for learning of the new task with LLMs.

[77] Summarisation of German Judgments in conjunction with a Class-based Evaluation

Bianca Steffes,Nils Torben Wiedemann,Alexander Gratz,Pamela Hochreither,Jana Elina Meyer,Katharina Luise Schilke

Main category: cs.CL

TL;DR: 通过微调解码器大语言模型自动生成德国判决摘要，并引入法律实体信息提升效果，但摘要质量尚不足以实际应用。

Details

Motivation: 为法律专家提供自动化工具以简化长法律文档的摘要工作。 Method: 微调解码器大语言模型，并在训练前为判决添加法律实体信息。 Result: 引入法律实体有助于模型找到相关内容，但生成的摘要质量未达实际应用标准。 Conclusion: 当前方法虽有潜力，但需进一步提升摘要质量以满足实际需求。 Abstract: The automated summarisation of long legal documents can be a great aid for legal experts in their daily work. We automatically create summaries (guiding principles) of German judgments by fine-tuning a decoder-based large language model. We enrich the judgments with information about legal entities before the training. For the evaluation of the created summaries, we define a set of evaluation classes which allows us to measure their language, pertinence, completeness and correctness. Our results show that employing legal entities helps the generative model to find the relevant content, but the quality of the created summaries is not yet sufficient for a use in practice.

[78] NeoQA: Evidence-based Question Answering with Generated News Events

Max Glockner,Xiang Jiang,Leonardo F. R. Ribeiro,Iryna Gurevych,Markus Dreyer

Main category: cs.CL

TL;DR: NeoQA是一个新的基准测试，用于评估大型语言模型（LLM）在检索增强生成（RAG）中的表现，通过虚构新闻事件和实体确保模型无法依赖预训练知识。

Details

Motivation: 现有基准测试容易过时，难以区分模型是基于检索还是预训练知识回答问题。 Method: 构建虚构新闻事件的时间线和知识库，生成问答对，确保模型仅依赖检索到的证据回答问题。 Result: LLM在问题与证据间细微不匹配时表现不佳，且在证据缺失时容易走捷径。 Conclusion: NeoQA揭示了LLM在基于证据推理中的关键局限性，为未来研究提供了新平台。 Abstract: Evaluating Retrieval-Augmented Generation (RAG) in large language models (LLMs) is challenging because benchmarks can quickly become stale. Questions initially requiring retrieval may become answerable from pretraining knowledge as newer models incorporate more recent information during pretraining, making it difficult to distinguish evidence-based reasoning from recall. We introduce NeoQA (News Events for Out-of-training Question Answering), a benchmark designed to address this issue. To construct NeoQA, we generated timelines and knowledge bases of fictional news events and entities along with news articles and Q\&A pairs to prevent LLMs from leveraging pretraining knowledge, ensuring that no prior evidence exists in their training data. We propose our dataset as a new platform for evaluating evidence-based question answering, as it requires LLMs to generate responses exclusively from retrieved evidence and only when sufficient evidence is available. NeoQA enables controlled evaluation across various evidence scenarios, including cases with missing or misleading details. Our findings indicate that LLMs struggle to distinguish subtle mismatches between questions and evidence, and suffer from short-cut reasoning when key information required to answer a question is missing from the evidence, underscoring key limitations in evidence-based reasoning.

[79] Towards Developmentally Plausible Rewards: Communicative Success as a Learning Signal for Interactive Language Models

Lennart Stöpler,Rufat Asadli,Mitja Nikolaus,Ryan Cotterell,Alex Warstadt

Main category: cs.CL

TL;DR: 提出了一种基于交互式语言学习的训练方法，通过单轮对话中的奖励机制模拟儿童语言习得，但尚未在语言评估中观察到显著改进。

Details

Motivation: 受儿童语言习得启发，探索交互式学习对语言模型训练的潜在益处。 Method: 使用强化学习微调语言模型，通过问答任务中的奖励信号间接评估语法性。 Result: 实验显示通信约束能改变说话者行为，但语言评估未提升。 Conclusion: 未来可通过改进任务设计和训练配置，进一步验证交互学习对语言模型的效果。 Abstract: We propose a method for training language models in an interactive setting inspired by child language acquisition. In our setting, a speaker attempts to communicate some information to a listener in a single-turn dialogue and receives a reward if communicative success is achieved. Unlike earlier related work using image--caption data for interactive reference games, we operationalize communicative success in a more abstract language-only question--answering setting. First, we present a feasibility study demonstrating that our reward provides an indirect signal about grammaticality. Second, we conduct experiments using reinforcement learning to fine-tune language models. We observe that cognitively plausible constraints on the communication channel lead to interpretable changes in speaker behavior. However, we do not yet see improvements on linguistic evaluations from our training regime. We outline potential modifications to the task design and training configuration that could better position future work to use our methodology to observe the benefits of interaction on language learning in computational cognitive models.

[80] An Exploratory Analysis on the Explanatory Potential of Embedding-Based Measures of Semantic Transparency for Malay Word Recognition

M. Maziyah Mohamed,R. H. Baayen

Main category: cs.CL

TL;DR: 该研究探讨了基于嵌入的语义透明度测量方法及其对阅读的影响，通过聚类分析和线性判别分析验证了这些方法的有效性。

Details

Motivation: 语义透明度在词汇识别中至关重要，但其计算操作化仍存在争议。研究旨在探索基于嵌入的语义透明度测量方法及其对阅读的影响。 Method: 研究对4,226个马来语前缀词进行了t-SNE聚类分析，并提出了五种简单测量方法。通过线性判别分析和广义加性混合模型验证这些方法的预测能力。 Result: 所有测量方法均能预测词汇决策延迟，其中与词群中心的相关性测量方法表现最佳。 Conclusion: 基于嵌入的语义透明度测量方法有效，且与词群中心的相关性是最佳预测指标。 Abstract: Studies of morphological processing have shown that semantic transparency is crucial for word recognition. Its computational operationalization is still under discussion. Our primary objectives are to explore embedding-based measures of semantic transparency, and assess their impact on reading. First, we explored the geometry of complex words in semantic space. To do so, we conducted a t-distributed Stochastic Neighbor Embedding clustering analysis on 4,226 Malay prefixed words. Several clusters were observed for complex words varied by their prefix class. Then, we derived five simple measures, and investigated whether they were significant predictors of lexical decision latencies. Two sets of Linear Discriminant Analyses were run in which the prefix of a word is predicted from either word embeddings or shift vectors (i.e., a vector subtraction of the base word from the derived word). The accuracy with which the model predicts the prefix of a word indicates the degree of transparency of the prefix. Three further measures were obtained by comparing embeddings between each word and all other words containing the same prefix (i.e., centroid), between each word and the shift from their base word, and between each word and the predicted word of the Functional Representations of Affixes in Compositional Semantic Space model. In a series of Generalized Additive Mixed Models, all measures predicted decision latencies after accounting for word frequency, word length, and morphological family size. The model that included the correlation between each word and their centroid as a predictor provided the best fit to the data.

[81] Exploring the Feasibility of Multilingual Grammatical Error Correction with a Single LLM up to 9B parameters: A Comparative Study of 17 Models

Dawid Wisniewski,Antoni Solarski,Artur Nowakowski

Main category: cs.CL

TL;DR: 研究评估了17种流行模型在英语、德语、意大利语和瑞典语中的语法纠错表现，发现Gemma 9B是当前表现最佳的模型。

Details

Motivation: 探索单一模型在多语言语法纠错任务中的性能，以解决多语言环境下的语法问题。 Method: 分析17种模型在四种语言中的输出，重点关注减少语法错误并最小化改动。 Result: 六种模型在所有四种语言中均能提升语法正确性，其中Gemma 9B表现最佳。 Conclusion: 研究揭示了多语言语法纠错模型的常见问题，并推荐了适用于多语言任务的模型。 Abstract: Recent language models can successfully solve various language-related tasks, and many understand inputs stated in different languages. In this paper, we explore the performance of 17 popular models used to correct grammatical issues in texts stated in English, German, Italian, and Swedish when using a single model to correct texts in all those languages. We analyze the outputs generated by these models, focusing on decreasing the number of grammatical errors while keeping the changes small. The conclusions drawn help us understand what problems occur among those models and which models can be recommended for multilingual grammatical error correction tasks. We list six models that improve grammatical correctness in all four languages and show that Gemma 9B is currently the best performing one for the languages considered.

[82] Do Not Change Me: On Transferring Entities Without Modification in Neural Machine Translation -- a Multilingual Perspective

Dawid Wisniewski,Mikolaj Pokrywka,Zofia Rostek

Main category: cs.CL

TL;DR: 本文探讨了当前机器翻译模型在翻译过程中保留实体（如URL、IBAN号码等）的能力，分析了四种语言的翻译质量，并提出一个新的多语言合成数据集。

Details

Motivation: 尽管机器翻译模型在大多数场景下表现良好，但在处理特定实体（如URL、IBAN号码等）时仍存在问题，本文旨在研究这一问题。 Method: 研究了包括OPUS、Google Translate等在内的流行NMT模型在四种语言（英语、德语、波兰语、乌克兰语）中保留实体的能力，并分析了错误原因。 Result: 研究发现某些类别（如表情符号）对模型构成显著挑战，并提出了一个包含36,000句的多语言合成数据集。 Conclusion: 本文揭示了机器翻译模型在实体保留方面的不足，并提供了新的数据集以支持未来研究。 Abstract: Current machine translation models provide us with high-quality outputs in most scenarios. However, they still face some specific problems, such as detecting which entities should not be changed during translation. In this paper, we explore the abilities of popular NMT models, including models from the OPUS project, Google Translate, MADLAD, and EuroLLM, to preserve entities such as URL addresses, IBAN numbers, or emails when producing translations between four languages: English, German, Polish, and Ukrainian. We investigate the quality of popular NMT models in terms of accuracy, discuss errors made by the models, and examine the reasons for errors. Our analysis highlights specific categories, such as emojis, that pose significant challenges for many models considered. In addition to the analysis, we propose a new multilingual synthetic dataset of 36,000 sentences that can help assess the quality of entity transfer across nine categories and four aforementioned languages.

[83] Unilogit: Robust Machine Unlearning for LLMs Using Uniform-Target Self-Distillation

Stefan Vasilev,Christian Herold,Baohao Liao,Seyyed Hadi Hashemi,Shahram Khadivi,Christof Monz

Main category: cs.CL

TL;DR: Unilogit是一种新型的自蒸馏方法，用于大型语言模型的机器遗忘，动态调整目标logits以实现均匀概率，无需额外超参数，性能优于现有方法。

Details

Motivation: 解决在遵守GDPR等数据隐私法规时，选择性遗忘特定信息同时保持模型整体效用的挑战。 Method: 动态调整目标logits，利用当前模型输出生成更准确的自蒸馏目标，避免静态超参数或初始模型输出的依赖。 Result: 在公共基准和内部电商数据集上表现优异，平衡遗忘与保留目标，优于NPO和UnDIAL等方法。 Conclusion: Unilogit在多种场景中表现出鲁棒性，具有实际应用价值和高效的机器遗忘能力。 Abstract: This paper introduces Unilogit, a novel self-distillation method for machine unlearning in Large Language Models. Unilogit addresses the challenge of selectively forgetting specific information while maintaining overall model utility, a critical task in compliance with data privacy regulations like GDPR. Unlike prior methods that rely on static hyperparameters or starting model outputs, Unilogit dynamically adjusts target logits to achieve a uniform probability for the target token, leveraging the current model's outputs for more accurate self-distillation targets. This approach not only eliminates the need for additional hyperparameters but also enhances the model's ability to approximate the golden targets. Extensive experiments on public benchmarks and an in-house e-commerce dataset demonstrate Unilogit's superior performance in balancing forget and retain objectives, outperforming state-of-the-art methods such as NPO and UnDIAL. Our analysis further reveals Unilogit's robustness across various scenarios, highlighting its practical applicability and effectiveness in achieving efficacious machine unlearning.

[84] Healthy LLMs? Benchmarking LLM Knowledge of UK Government Public Health Information

Joshua Harris,Fan Grayson,Felix Feldman,Timothy Laurence,Toby Nonnenmacher,Oliver Higgins,Leo Loman,Selina Patel,Thomas Finnie,Samuel Collins,Michael Borowitz

Main category: cs.CL

TL;DR: 论文提出了一个名为PubHealthBench的新基准，用于评估大型语言模型（LLMs）在英国政府公共卫生信息上的表现，发现最新私有模型在多项选择题上表现优异，但在自由回答中仍有不足。

Details

Motivation: 随着LLMs的广泛应用，了解其在特定领域（如公共卫生）的知识准确性变得至关重要，尤其是在可能影响公众健康的情况下。 Method: 通过自动化流程创建了包含8000多个问题的基准PubHealthBench，评估了24个LLMs在多项选择题和自由回答中的表现。 Result: 最新私有LLMs（如GPT-4.5）在多项选择题上表现优异（>90%），但在自由回答中表现较差（<75%）。 Conclusion: 尽管先进LLMs在公共卫生信息上表现良好，但在自由回答中仍需额外保障措施。 Abstract: As Large Language Models (LLMs) become widely accessible, a detailed understanding of their knowledge within specific domains becomes necessary for successful real world use. This is particularly critical in public health, where failure to retrieve relevant, accurate, and current information could significantly impact UK residents. However, currently little is known about LLM knowledge of UK Government public health information. To address this issue, this paper introduces a new benchmark, PubHealthBench, with over 8000 questions for evaluating LLMs' Multiple Choice Question Answering (MCQA) and free form responses to public health queries, created via an automated pipeline. We also release a new dataset of the extracted UK Government public health guidance documents used as source text for PubHealthBench. Assessing 24 LLMs on PubHealthBench we find the latest private LLMs (GPT-4.5, GPT-4.1 and o1) have a high degree of knowledge, achieving >90% in the MCQA setup, and outperform humans with cursory search engine use. However, in the free form setup we see lower performance with no model scoring >75%. Therefore, whilst there are promising signs that state of the art (SOTA) LLMs are an increasingly accurate source of public health information, additional safeguards or tools may still be needed when providing free form responses on public health topics.

[85] Attention on Multiword Expressions: A Multilingual Study of BERT-based Models with Regard to Idiomaticity and Microsyntax

Iuliia Zaitova,Vitalii Hirak,Badr M. Abdullah,Dietrich Klakow,Bernd Möbius,Tania Avgustinova

Main category: cs.CL

TL;DR: 研究分析了基于BERT架构的微调编码器模型对多词表达（MWEs）的注意力模式，比较了习语和微句法单元（MSUs）的差异。

Details

Motivation: 探讨微调BERT模型在语义和句法任务中对MWEs的注意力分配是否不同，以及这种分配如何受任务类型影响。 Method: 使用六种印欧语言的单语模型和数据集，分析预训练和微调BERT模型对MWEs的注意力分数。 Result: 微调显著影响模型对MWEs的注意力分配：语义任务模型更均匀关注习语，句法任务模型在低层更关注MSUs。 Conclusion: 任务类型决定了BERT模型对MWEs的注意力分配模式，语义和句法任务分别对应不同的注意力分布。 Abstract: This study analyzes the attention patterns of fine-tuned encoder-only models based on the BERT architecture (BERT-based models) towards two distinct types of Multiword Expressions (MWEs): idioms and microsyntactic units (MSUs). Idioms present challenges in semantic non-compositionality, whereas MSUs demonstrate unconventional syntactic behavior that does not conform to standard grammatical categorizations. We aim to understand whether fine-tuning BERT-based models on specific tasks influences their attention to MWEs, and how this attention differs between semantic and syntactic tasks. We examine attention scores to MWEs in both pre-trained and fine-tuned BERT-based models. We utilize monolingual models and datasets in six Indo-European languages - English, German, Dutch, Polish, Russian, and Ukrainian. Our results show that fine-tuning significantly influences how models allocate attention to MWEs. Specifically, models fine-tuned on semantic tasks tend to distribute attention to idiomatic expressions more evenly across layers. Models fine-tuned on syntactic tasks show an increase in attention to MSUs in the lower layers, corresponding with syntactic processing requirements.

[86] Multimodal Sentiment Analysis on CMU-MOSEI Dataset using Transformer-based Models

Jugal Gajjar,Kaustik Ranaware

Main category: cs.CL

TL;DR: 本文提出了一种基于Transformer的多模态情感分析方法，通过早期融合整合文本、音频和视觉模态，在CMU-MOSEI数据集上表现优异。

Details

Motivation: 探索多模态情感分析中早期融合策略的有效性，以及Transformer架构在捕捉跨模态交互中的优势。 Method: 使用BERT编码器分别提取文本、音频和视觉模态的嵌入，通过拼接后进行早期融合分类，采用Adam优化器和早停策略。 Result: 模型在测试集上达到97.87%的7分类准确率和0.9682的F1分数，MAE为0.1060，表现优异。 Conclusion: 早期融合结合Transformer架构在多模态情感分析中效果显著，未来可进一步比较融合策略或提升模型可解释性。 Abstract: This project performs multimodal sentiment analysis using the CMU-MOSEI dataset, using transformer-based models with early fusion to integrate text, audio, and visual modalities. We employ BERT-based encoders for each modality, extracting embeddings that are concatenated before classification. The model achieves strong performance, with 97.87\% 7-class accuracy and a 0.9682 F1-score on the test set, demonstrating the effectiveness of early fusion in capturing cross-modal interactions. The training utilized Adam optimization (lr=1e-4), dropout (0.3), and early stopping to ensure generalization and robustness. Results highlight the superiority of transformer architectures in modeling multimodal sentiment, with a low MAE (0.1060) indicating precise sentiment intensity prediction. Future work may compare fusion strategies or enhance interpretability. This approach utilizes multimodal learning by effectively combining linguistic, acoustic, and visual cues for sentiment analysis.

[87] LLMs Get Lost In Multi-Turn Conversation

Philippe Laban,Hiroaki Hayashi,Yingbo Zhou,Jennifer Neville

Main category: cs.CL

TL;DR: LLMs在多轮对话中的表现比单轮对话显著下降，平均下降39%，主要原因是早期错误假设和过度依赖未完成的解决方案。

Details

Motivation: 研究LLMs在多轮对话中的表现，填补现有评估主要集中在单轮、完全指定任务上的空白。 Method: 通过大规模模拟实验，比较LLMs在单轮和多轮对话中的表现，并分析200,000+模拟对话。 Result: LLMs在多轮对话中表现下降39%，主要原因是早期错误假设和过度依赖未完成解决方案。 Conclusion: LLMs在多轮对话中容易迷失方向且难以恢复，需改进其对话管理能力。 Abstract: Large Language Models (LLMs) are conversational interfaces. As such, LLMs have the potential to assist their users not only when they can fully specify the task at hand, but also to help them define, explore, and refine what they need through multi-turn conversational exchange. Although analysis of LLM conversation logs has confirmed that underspecification occurs frequently in user instructions, LLM evaluation has predominantly focused on the single-turn, fully-specified instruction setting. In this work, we perform large-scale simulation experiments to compare LLM performance in single- and multi-turn settings. Our experiments confirm that all the top open- and closed-weight LLMs we test exhibit significantly lower performance in multi-turn conversations than single-turn, with an average drop of 39% across six generation tasks. Analysis of 200,000+ simulated conversations decomposes the performance degradation into two components: a minor loss in aptitude and a significant increase in unreliability. We find that LLMs often make assumptions in early turns and prematurely attempt to generate final solutions, on which they overly rely. In simpler terms, we discover that *when LLMs take a wrong turn in a conversation, they get lost and do not recover*.

[88] Towards Robust Few-Shot Text Classification Using Transformer Architectures and Dual Loss Strategies

Xu Han,Yumeng Sun,Weiqiang Huang,Hongye Zheng,Junliang Du

Main category: cs.CL

TL;DR: 本文提出了一种结合自适应微调、对比学习和正则化优化的策略，以提升基于Transformer的模型在少样本文本分类任务中的性能。实验表明，该方法在FewRel 2.0数据集上表现优异，尤其在5-shot设置下能更有效地捕捉文本特征并提高分类精度。

Details

Motivation: 少样本文本分类在低资源环境中具有重要应用价值，但标准交叉熵损失难以学习区分模糊语义边界或复杂特征分布的类别。 Method: 结合自适应微调、对比学习和正则化优化，引入对比损失和正则化损失以增强模型泛化能力。 Result: 实验证明T5-small、DeBERTa-v3和RoBERTa-base在少样本任务中表现良好，尤其在5-shot设置下效果显著。 Conclusion: 使用具有更强自注意力机制的Transformer模型或生成架构有助于提升少样本分类的稳定性和准确性。 Abstract: Few-shot text classification has important application value in low-resource environments. This paper proposes a strategy that combines adaptive fine-tuning, contrastive learning, and regularization optimization to improve the classification performance of Transformer-based models. Experiments on the FewRel 2.0 dataset show that T5-small, DeBERTa-v3, and RoBERTa-base perform well in few-shot tasks, especially in the 5-shot setting, which can more effectively capture text features and improve classification accuracy. The experiment also found that there are significant differences in the classification difficulty of different relationship categories. Some categories have fuzzy semantic boundaries or complex feature distributions, making it difficult for the standard cross entropy loss to learn the discriminative information required to distinguish categories. By introducing contrastive loss and regularization loss, the generalization ability of the model is enhanced, effectively alleviating the overfitting problem in few-shot environments. In addition, the research results show that the use of Transformer models or generative architectures with stronger self-attention mechanisms can help improve the stability and accuracy of few-shot classification.

[89] Can Prompting LLMs Unlock Hate Speech Detection across Languages? A Zero-shot and Few-shot Study

Faeze Ghorbanpour,Daryna Dementieva,Alexander Fraser

Main category: cs.CL

TL;DR: 评估多语言大语言模型（如LLaMA、Aya等）在零样本和少样本提示下检测仇恨言论的效果，发现其泛化能力优于微调编码器模型，但提示设计对性能至关重要。

Details

Motivation: 现有仇恨言论检测方法忽视语言多样性，多语言大语言模型的能力尚未充分探索。 Method: 在八种非英语语言中评估LLM提示技术，并与微调编码器模型对比。 Result: 零样本和少样本提示在真实数据集上表现较差，但在泛化测试中优于微调模型；提示设计对性能影响显著。 Conclusion: 多语言LLM在仇恨言论检测中具有潜力，但需针对不同语言优化提示设计。 Abstract: Despite growing interest in automated hate speech detection, most existing approaches overlook the linguistic diversity of online content. Multilingual instruction-tuned large language models such as LLaMA, Aya, Qwen, and BloomZ offer promising capabilities across languages, but their effectiveness in identifying hate speech through zero-shot and few-shot prompting remains underexplored. This work evaluates LLM prompting-based detection across eight non-English languages, utilizing several prompting techniques and comparing them to fine-tuned encoder models. We show that while zero-shot and few-shot prompting lag behind fine-tuned encoder models on most of the real-world evaluation sets, they achieve better generalization on functional tests for hate speech detection. Our study also reveals that prompt design plays a critical role, with each language often requiring customized prompting techniques to maximize performance.

[90] A Scaling Law for Token Efficiency in LLM Fine-Tuning Under Fixed Compute Budgets

Ryan Lagasse,Aidan Kiernans,Avijit Ghosh,Shiri Dori-Hacohen

Main category: cs.CL

TL;DR: 论文提出了一种考虑数据组成的大语言模型（LLM）微调缩放定律，发现数据集体积（样本数和平均长度）对性能有显著影响。

Details

Motivation: 传统方法仅以总token数衡量训练数据，忽视了数据组成（样本数和平均长度）对模型性能的关键作用。 Method: 通过实验在BRICC和MMLU数据集上验证数据组成对token效率的影响，并调整缩放定律。 Result: 数据组成显著影响token效率，为资源受限的LLM微调提供了更精细的缩放定律。 Conclusion: 研究强调了数据组成的重要性，为实际应用中的LLM微调提供了更有效的指导。 Abstract: We introduce a scaling law for fine-tuning large language models (LLMs) under fixed compute budgets that explicitly accounts for data composition. Conventional approaches measure training data solely by total tokens, yet the number of examples and their average token length -- what we term \emph{dataset volume} -- play a decisive role in model performance. Our formulation is tuned following established procedures. Experiments on the BRICC dataset \cite{salavati2024reducing} and subsets of the MMLU dataset \cite{hendrycks2021measuringmassivemultitasklanguage}, evaluated under multiple subsampling strategies, reveal that data composition significantly affects token efficiency. These results motivate refined scaling laws for practical LLM fine-tuning in resource-constrained settings.

[91] Estimating Quality in Therapeutic Conversations: A Multi-Dimensional Natural Language Processing Framework

Alice Rueda,Argyrios Perivolaris,Niloy Roy,Dylan Weston,Sarmed Shaya,Zachary Cote,Martin Ivanov,Bazen G. Teferra,Yuqi Wu,Sirisha Rambhatla,Divya Sharma,Andrew Greenshaw,Rakesh Jetly,Yanbo Zhang,Bo Cao,Reza Samavi,Sridhar Krishnan,Venkat Bhat

Main category: cs.CL

TL;DR: 提出了一种基于NLP的多维框架，用于客观分类心理咨询中的参与质量，通过多种特征和分类器实现高准确率，并展示了数据增强后的性能提升。

Details

Motivation: 客户与治疗师之间的参与是治疗成功的关键因素，需要一种客观评估参与质量的方法。 Method: 使用253份心理咨询文本，提取42个特征（对话动态、语义相似性、情感分类、问题检测），并通过多种分类器（RF、Cat-Boost、SVM）进行训练和评估。 Result: 在平衡数据上，RF和SVM表现最佳；数据增强后性能显著提升，RF达到88.9%准确率和94.6% AUC。 Conclusion: 该框架具有扩展性，支持未来多模态扩展，为临床提供实时反馈，提升治疗质量。 Abstract: Engagement between client and therapist is a critical determinant of therapeutic success. We propose a multi-dimensional natural language processing (NLP) framework that objectively classifies engagement quality in counseling sessions based on textual transcripts. Using 253 motivational interviewing transcripts (150 high-quality, 103 low-quality), we extracted 42 features across four domains: conversational dynamics, semantic similarity as topic alignment, sentiment classification, and question detection. Classifiers, including Random Forest (RF), Cat-Boost, and Support Vector Machines (SVM), were hyperparameter tuned and trained using a stratified 5-fold cross-validation and evaluated on a holdout test set. On balanced (non-augmented) data, RF achieved the highest classification accuracy (76.7%), and SVM achieved the highest AUC (85.4%). After SMOTE-Tomek augmentation, performance improved significantly: RF achieved up to 88.9% accuracy, 90.0% F1-score, and 94.6% AUC, while SVM reached 81.1% accuracy, 83.1% F1-score, and 93.6% AUC. The augmented data results reflect the potential of the framework in future larger-scale applications. Feature contribution revealed conversational dynamics and semantic similarity between clients and therapists were among the top contributors, led by words uttered by the client (mean and standard deviation). The framework was robust across the original and augmented datasets and demonstrated consistent improvements in F1 scores and recall. While currently text-based, the framework supports future multimodal extensions (e.g., vocal tone, facial affect) for more holistic assessments. This work introduces a scalable, data-driven method for evaluating engagement quality of the therapy session, offering clinicians real-time feedback to enhance the quality of both virtual and in-person therapeutic interactions.

[92] Query-driven Document-level Scientific Evidence Extraction from Biomedical Studies

Massimiliano Pronesti,Joao Bettencourt-Silva,Paul Flanagan,Alessandra Pascale,Oisin Redmond,Anya Belz,Yufang Hou

Main category: cs.CL

TL;DR: 论文提出了一种名为URCA的检索增强生成框架，用于从CochraneForest数据集中提取科学证据，解决临床问题中证据冲突的挑战。

Details

Motivation: 从生物医学研究中提取科学证据对临床研究问题至关重要，尤其是当证据存在冲突时。 Method: 创建CochraneForest数据集（包含202个注释森林图），并提出URCA框架，结合检索和生成技术。 Result: URCA在F1分数上比现有最佳方法高出10.3%，但数据集复杂性仍为挑战。 Conclusion: CochraneForest为自动化证据合成系统提供了具有挑战性的测试平台，URCA表现出色但仍需改进。 Abstract: Extracting scientific evidence from biomedical studies for clinical research questions (e.g., Does stem cell transplantation improve quality of life in patients with medically refractory Crohn's disease compared to placebo?) is a crucial step in synthesising biomedical evidence. In this paper, we focus on the task of document-level scientific evidence extraction for clinical questions with conflicting evidence. To support this task, we create a dataset called CochraneForest, leveraging forest plots from Cochrane systematic reviews. It comprises 202 annotated forest plots, associated clinical research questions, full texts of studies, and study-specific conclusions. Building on CochraneForest, we propose URCA (Uniform Retrieval Clustered Augmentation), a retrieval-augmented generation framework designed to tackle the unique challenges of evidence extraction. Our experiments show that URCA outperforms the best existing methods by up to 10.3% in F1 score on this task. However, the results also underscore the complexity of CochraneForest, establishing it as a challenging testbed for advancing automated evidence synthesis systems.

cs.DL [Back]

[93] Differentiating Emigration from Return Migration of Scholars Using Name-Based Nationality Detection Models

Faeze Ghorbanpour,Thiago Zordan Malaguth,Aliakbar Akbaritabar

Main category: cs.DL

TL;DR: 论文提出了一种基于姓名的国籍检测方法，用于解决移民研究中的数据缺失问题，并通过实验验证了其有效性。

Details

Motivation: 由于隐私问题，大多数网络和数字追踪数据不包含国籍信息，这给移民研究带来了挑战，尤其是左删失问题。 Method: 利用从Wikipedia收集的260万姓名-国籍对作为训练数据，采用基于字符的机器学习模型进行国籍分类。 Result: 模型在粗粒度分类中F1得分为84%，细粒度国家分类中为67%。实证研究发现，使用姓名国籍比学术起源更能准确识别回流移民。 Conclusion: 该方法有效解决了移民研究中的左删失问题，尤其适用于学术劳动力多样化的国家。 Abstract: Most web and digital trace data do not include information about an individual's nationality due to privacy concerns. The lack of data on nationality can create challenges for migration research. It can lead to a left-censoring issue since we are uncertain about the migrant's country of origin. Once we observe an emigration event, if we know the nationality, we can differentiate it from return migration. We propose methods to detect the nationality with the least available data, i.e., full names. We use the detected nationality in comparison with the country of academic origin, which is a common approach in studying the migration of researchers. We gathered 2.6 million unique name-nationality pairs from Wikipedia and categorized them into families of nationalities with three granularity levels to use as our training data. Using a character-based machine learning model, we achieved a weighted F1 score of 84% for the broadest and 67% for the most granular, country-level categorization. In our empirical study, we used the trained and tested model to assign nationality to 8+ million scholars' full names in Scopus data. Our results show that using the country of first publication as a proxy for nationality underestimates the size of return flows, especially for countries with a more diverse academic workforce, such as the USA, Australia, and Canada. We found that around 48% of emigration from the USA was return migration once we used the country of name origin, in contrast to 33% based on academic origin. In the most recent period, 79% of scholars whose affiliation has consistently changed from the USA to China, and are considered emigrants, have Chinese names in contrast to 41% with a Chinese academic origin. Our proposed methods for addressing left-censoring issues are beneficial for other research that uses digital trace data to study migration.

physics.med-ph [Back]

[94] Towards order of magnitude X-ray dose reduction in breast cancer imaging using phase contrast and deep denoising

Ashkan Pakzad,Robert Turnbull,Simon J. Mutch,Thomas A. Leatham,Darren Lockie,Jane Fox,Beena Kumar,Daniel Häsermann,Christopher J. Hall,Anton Maksimenko,Benedicta D. Arhatari,Yakov I. Nesterets,Amir Entezam,Seyedamir T. Taba,Patrick C. Brennan,Timur E. Gureyev,Harry M. Quiney

Main category: physics.med-ph

TL;DR: 研究提出了一种基于深度学习的图像去噪方法，结合相位对比计算机断层扫描（PCT），可在显著降低辐射剂量的同时保持乳腺癌成像质量。

Details

Motivation: 目前乳腺癌筛查方法（如X线乳腺摄影和数字乳腺断层合成）存在灵敏度、特异性不足及患者不适的问题，PCT虽为潜在替代方案，但高剂量需求限制了其应用。 Method: 研究使用深度学习对PCT成像进行去噪处理，测试于全新鲜乳腺切除样本，评估了剂量降低效果和图像质量。 Result: 实验表明，该方法可将辐射剂量降低16倍以上，且图像质量（如空间分辨率和对比噪声比）未受影响，专家评估也证实了其有效性。 Conclusion: 该技术为未来在同步辐射设施中开展活体患者PCT乳腺癌成像奠定了基础，具有临床潜力。 Abstract: Breast cancer is the most frequently diagnosed human cancer in the United States at present. Early detection is crucial for its successful treatment. X-ray mammography and digital breast tomosynthesis are currently the main methods for breast cancer screening. However, both have known limitations in terms of their sensitivity and specificity to breast cancers, while also frequently causing patient discomfort due to the requirement for breast compression. Breast computed tomography is a promising alternative, however, to obtain high-quality images, the X-ray dose needs to be sufficiently high. As the breast is highly radiosensitive, dose reduction is particularly important. Phase-contrast computed tomography (PCT) has been shown to produce higher-quality images at lower doses and has no need for breast compression. It is demonstrated in the present study that, when imaging full fresh mastectomy samples with PCT, deep learning-based image denoising can further reduce the radiation dose by a factor of 16 or more, without any loss of image quality. The image quality has been assessed both in terms of objective metrics, such as spatial resolution and contrast-to-noise ratio, as well as in an observer study by experienced medical imaging specialists and radiologists. This work was carried out in preparation for live patient PCT breast cancer imaging, initially at specialized synchrotron facilities.

quant-ph [Back]

[95] Efficient Quantum Convolutional Neural Networks for Image Classification: Overcoming Hardware Constraints

Peter Röseler,Oliver Schaudt,Helmut Berg,Christian Bauckhage,Matthias Koch

Main category: quant-ph

TL;DR: 量子卷积神经网络（QCNN）通过减少输入维度和自动化框架设计，在NISQ设备上实现了高效图像分类，准确率显著超越传统方法。

Details

Motivation: 量子计算为神经网络架构提供了新机遇，但当前NISQ设备的硬件限制阻碍了QCNN的实际应用。 Method: 提出一种编码方案降低输入维度，并设计自动化框架优化QCNN的构建模块（PQCs）。 Result: 在49量子比特的QCNN上直接处理MNIST图像，分类准确率达96.08%，超越传统方法的71.74%。 Conclusion: 研究验证了量子计算在图像分类中的潜力，为未来量子神经网络的发展提供了方向。 Abstract: While classical convolutional neural networks (CNNs) have revolutionized image classification, the emergence of quantum computing presents new opportunities for enhancing neural network architectures. Quantum CNNs (QCNNs) leverage quantum mechanical properties and hold potential to outperform classical approaches. However, their implementation on current noisy intermediate-scale quantum (NISQ) devices remains challenging due to hardware limitations. In our research, we address this challenge by introducing an encoding scheme that significantly reduces the input dimensionality. We demonstrate that a primitive QCNN architecture with 49 qubits is sufficient to directly process $28\times 28$ pixel MNIST images, eliminating the need for classical dimensionality reduction pre-processing. Additionally, we propose an automated framework based on expressibility, entanglement, and complexity characteristics to identify the building blocks of QCNNs, parameterized quantum circuits (PQCs). Our approach demonstrates advantages in accuracy and convergence speed with a similar parameter count compared to both hybrid QCNNs and classical CNNs. We validated our experiments on IBM's Heron r2 quantum processor, achieving $96.08\%$ classification accuracy, surpassing the $71.74\%$ benchmark of traditional approaches under identical training conditions. These results represent one of the first implementations of image classifications on real quantum hardware and validate the potential of quantum computing in this area.

cs.RO [Back]

[96] Adaptive Stress Testing Black-Box LLM Planners

Neeloy Chakraborty,John Pohovey,Melkior Ornik,Katherine Driggs-Campbell

Main category: cs.RO

TL;DR: 论文提出了一种新方法，通过自适应压力测试（AST）和蒙特卡洛树搜索（MCTS）来检测大型语言模型（LLMs）的幻觉问题，以提升安全关键场景中的可靠性。

Details

Motivation: LLMs在决策任务中表现优异，但其幻觉问题可能产生不安全输出，需在安全关键场景中检测此类失败。 Method: 通过手动案例研究验证不同扰动形式对LLMs的影响，并提出结合AST和MCTS的方法，高效搜索提示扰动空间。 Result: 实验表明，离线分析可生成影响模型不确定性的提示，并为实时信任评估提供依据。 Conclusion: 该方法能有效发现导致LLMs高不确定性的场景和提示，提升模型在安全关键任务中的可靠性。 Abstract: Large language models (LLMs) have recently demonstrated success in generalizing across decision-making tasks including planning, control and prediction, but their tendency to hallucinate unsafe and undesired outputs poses risks. We argue that detecting such failures is necessary, especially in safety-critical scenarios. Existing black-box methods often detect hallucinations by identifying inconsistencies across multiple samples. Many of these approaches typically introduce prompt perturbations like randomizing detail order or generating adversarial inputs, with the intuition that a confident model should produce stable outputs. We first perform a manual case study showing that other forms of perturbations (e.g., adding noise, removing sensor details) cause LLMs to hallucinate in a driving environment. We then propose a novel method for efficiently searching the space of prompt perturbations using Adaptive Stress Testing (AST) with Monte-Carlo Tree Search (MCTS). Our AST formulation enables discovery of scenarios and prompts that cause language models to act with high uncertainty. By generating MCTS prompt perturbation trees across diverse scenarios, we show that offline analyses can be used at runtime to automatically generate prompts that influence model uncertainty, and to inform real-time trust assessments of an LLM.

[97] Learning to Drive Anywhere with Model-Based Reannotation11

Noriaki Hirose,Lydia Ignatova,Kyle Stachowicz,Catherine Glossop,Sergey Levine,Dhruv Shah

Main category: cs.RO

TL;DR: 提出MBRA框架，利用模型重新标注被动数据，训练LogoNav导航策略，实现300米以上长距离导航。

Details

Motivation: 解决视觉导航策略泛化性受限的问题，利用大规模被动数据弥补高质量数据不足。 Method: MBRA框架通过短视界模型重新标注被动数据，训练LogoNav长视界导航策略。 Result: LogoNav在未见过环境中实现300米以上导航，并在多城市多场景中验证泛化性。 Conclusion: MBRA和LogoNav结合被动数据，显著提升导航策略的泛化能力和实际表现。 Abstract: Developing broadly generalizable visual navigation policies for robots is a significant challenge, primarily constrained by the availability of large-scale, diverse training data. While curated datasets collected by researchers offer high quality, their limited size restricts policy generalization. To overcome this, we explore leveraging abundant, passively collected data sources, including large volumes of crowd-sourced teleoperation data and unlabeled YouTube videos, despite their potential for lower quality or missing action labels. We propose Model-Based ReAnnotation (MBRA), a framework that utilizes a learned short-horizon, model-based expert model to relabel or generate high-quality actions for these passive datasets. This relabeled data is then distilled into LogoNav, a long-horizon navigation policy conditioned on visual goals or GPS waypoints. We demonstrate that LogoNav, trained using MBRA-processed data, achieves state-of-the-art performance, enabling robust navigation over distances exceeding 300 meters in previously unseen indoor and outdoor environments. Our extensive real-world evaluations, conducted across a fleet of robots (including quadrupeds) in six cities on three continents, validate the policy's ability to generalize and navigate effectively even amidst pedestrians in crowded settings.

[98] 3D CAVLA: Leveraging Depth and 3D Context to Generalize Vision Language Action Models for Unseen Tasks

Vineet Bhat,Yu-Hsiang Lan,Prashanth Krishnamurthy,Ramesh Karri,Farshad Khorrami

Main category: cs.RO

TL;DR: 论文提出了一种改进的视觉-语言-动作模型3D-CAVLA，通过整合链式思维推理、深度感知和任务导向的兴趣区域检测，提升了场景上下文感知能力，在LIBERO仿真环境中取得了98.1%的平均成功率，并在零样本任务中表现出色。

Details

Motivation: 机器人3D操作需要将语义和视觉感知能力转化为低层控制，现有视觉-语言模型（VLMs）在场景上下文感知方面仍有改进空间。 Method: 提出3D-CAVLA模型，整合链式思维推理、深度感知和任务导向的兴趣区域检测，增强场景上下文感知。 Result: 在LIBERO仿真环境中，平均成功率达98.1%，零样本任务中绝对提升8.8%。 Conclusion: 3D-CAVLA通过增强场景感知能力，显著提升了机器人操作的性能和适应性，尤其在零样本任务中表现突出。 Abstract: Robotic manipulation in 3D requires learning an $N$ degree-of-freedom joint space trajectory of a robot manipulator. Robots must possess semantic and visual perception abilities to transform real-world mappings of their workspace into the low-level control necessary for object manipulation. Recent work has demonstrated the capabilities of fine-tuning large Vision-Language Models (VLMs) to learn the mapping between RGB images, language instructions, and joint space control. These models typically take as input RGB images of the workspace and language instructions, and are trained on large datasets of teleoperated robot demonstrations. In this work, we explore methods to improve the scene context awareness of a popular recent Vision-Language-Action model by integrating chain-of-thought reasoning, depth perception, and task-oriented region of interest detection. Our experiments in the LIBERO simulation environment show that our proposed model, 3D-CAVLA, improves the success rate across various LIBERO task suites, achieving an average success rate of 98.1$\%$. We also evaluate the zero-shot capabilities of our method, demonstrating that 3D scene awareness leads to robust learning and adaptation for completely unseen tasks. 3D-CAVLA achieves an absolute improvement of 8.8$\%$ on unseen tasks. We will open-source our code and the unseen tasks dataset to promote community-driven research here: https://3d-cavla.github.io

[99] TREND: Tri-teaching for Robust Preference-based Reinforcement Learning with Demonstrations

Shuaiyi Huang,Mara Levy,Anubhav Gupta,Daniel Ekpo,Ruijie Zheng,Abhinav Shrivastava

Main category: cs.RO

TL;DR: TREND框架通过结合少量专家演示和三重教学策略，有效减少偏好反馈中的噪声，提升偏好强化学习的性能。

Details

Motivation: 偏好反馈常因噪声问题影响学习效果，需一种高效去噪方法。 Method: TREND利用三个奖励模型互相教学，筛选低损失偏好对更新参数，仅需1-3次专家演示。 Result: 在机器人操作任务中，噪声高达40%时仍能达到90%成功率。 Conclusion: TREND在噪声环境中表现鲁棒，显著提升偏好强化学习效果。 Abstract: Preference feedback collected by human or VLM annotators is often noisy, presenting a significant challenge for preference-based reinforcement learning that relies on accurate preference labels. To address this challenge, we propose TREND, a novel framework that integrates few-shot expert demonstrations with a tri-teaching strategy for effective noise mitigation. Our method trains three reward models simultaneously, where each model views its small-loss preference pairs as useful knowledge and teaches such useful pairs to its peer network for updating the parameters. Remarkably, our approach requires as few as one to three expert demonstrations to achieve high performance. We evaluate TREND on various robotic manipulation tasks, achieving up to 90% success rates even with noise levels as high as 40%, highlighting its effective robustness in handling noisy preference feedback. Project page: https://shuaiyihuang.github.io/publications/TREND.

[100] Let Humanoids Hike! Integrative Skill Development on Complex Trails

Kwan-Yee Lin,Stella X. Yu

Main category: cs.RO

TL;DR: 论文提出了一种名为LEGO-H的学习框架，旨在训练人形机器人在复杂小径上徒步，整合视觉感知、决策制定和运动执行能力。

Details

Motivation: 当前人形机器人研究在徒步任务中存在不足：运动技能缺乏长期目标和情境感知，而语义导航忽视了实际环境和地形变化。 Method: LEGO-H框架结合了时间视觉变换器和分层强化学习，预测未来局部目标以指导运动，并通过潜在运动模式表示和分层度量学习实现策略迁移。 Result: 实验表明，LEGO-H能够处理多样的物理和环境挑战，无需依赖预定义运动模式，展现了其多功能性和鲁棒性。 Conclusion: 徒步任务成为体现自主性的重要测试平台，LEGO-H为未来人形机器人发展提供了基准。 Abstract: Hiking on complex trails demands balance, agility, and adaptive decision-making over unpredictable terrain. Current humanoid research remains fragmented and inadequate for hiking: locomotion focuses on motor skills without long-term goals or situational awareness, while semantic navigation overlooks real-world embodiment and local terrain variability. We propose training humanoids to hike on complex trails, driving integrative skill development across visual perception, decision making, and motor execution. We develop a learning framework, LEGO-H, that enables a vision-equipped humanoid robot to hike complex trails autonomously. We introduce two technical innovations: 1) A temporal vision transformer variant - tailored into Hierarchical Reinforcement Learning framework - anticipates future local goals to guide movement, seamlessly integrating locomotion with goal-directed navigation. 2) Latent representations of joint movement patterns, combined with hierarchical metric learning - enhance Privileged Learning scheme - enable smooth policy transfer from privileged training to onboard execution. These components allow LEGO-H to handle diverse physical and environmental challenges without relying on predefined motion patterns. Experiments across varied simulated trails and robot morphologies highlight LEGO-H's versatility and robustness, positioning hiking as a compelling testbed for embodied autonomy and LEGO-H as a baseline for future humanoid development.

cs.LG [Back]

[101] Harnessing LLMs Explanations to Boost Surrogate Models in Tabular Data Classification

Ruxue Shi,Hengrui Gu,Xu Shen,Xin Wang

Main category: cs.LG

TL;DR: 提出了一种基于大语言模型（LLM）的解释引导框架，用于提升表格预测任务的性能和可解释性，并通过实验验证了其有效性。

Details

Motivation: 现有基于LLM的方法存在高资源需求、演示选择不优和可解释性差的问题，限制了其在实际中的应用。 Method: 框架分为三个阶段：LLM生成解释、解释引导演示选择、解释引导可解释SLM预测。 Result: 实验表明，该框架在多个表格数据集上平均准确率提升了5.31%。 Conclusion: 该框架通过解释引导显著提升了表格预测的性能和可解释性，具有实际应用潜力。 Abstract: Large Language Models (LLMs) have shown remarkable ability in solving complex tasks, making them a promising tool for enhancing tabular learning. However, existing LLM-based methods suffer from high resource requirements, suboptimal demonstration selection, and limited interpretability, which largely hinder their prediction performance and application in the real world. To overcome these problems, we propose a novel in-context learning framework for tabular prediction. The core idea is to leverage the explanations generated by LLMs to guide a smaller, locally deployable Surrogate Language Model (SLM) to make interpretable tabular predictions. Specifically, our framework mainly involves three stages: (i) Post Hoc Explanation Generation, where LLMs are utilized to generate explanations for question-answer pairs in candidate demonstrations, providing insights into the reasoning behind the answer. (ii) Post Hoc Explanation-Guided Demonstrations Selection, which utilizes explanations generated by LLMs to guide the process of demonstration selection from candidate demonstrations. (iii) Post Hoc Explanation-Guided Interpretable SLM Prediction, which utilizes the demonstrations obtained in step (ii) as in-context and merges corresponding explanations as rationales to improve the performance of SLM and guide the model to generate interpretable outputs. Experimental results highlight the framework's effectiveness, with an average accuracy improvement of 5.31% across various tabular datasets in diverse domains.

[102] BMMDetect: A Multimodal Deep Learning Framework for Comprehensive Biomedical Misconduct Detection

Yize Zhou,Jie Zhang,Meijie Wang,Lun Yu

Main category: cs.LG

TL;DR: BMMDetect是一种多模态深度学习框架，用于检测生物医学研究中的学术不端行为，通过整合期刊元数据、语义嵌入和GPT-4o挖掘的文本属性，显著提高了检测性能。

Details

Motivation: 现有方法在算法上存在局限性且分析流程分散，难以有效检测生物医学研究中的学术不端行为。 Method: BMMDetect整合了期刊元数据（如SJR指数）、语义嵌入（PubMedBERT）和GPT-4o挖掘的文本属性（如方法学统计和数据异常），通过多模态融合减少检测偏差。 Result: BMMDetect的AUC达到74.33%，比单模态基线方法提高了8.6%，并在生物医学子领域中表现出良好的迁移性。 Conclusion: BMMDetect为研究诚信保护提供了可扩展且可解释的工具，显著提升了学术不端行为的检测能力。 Abstract: Academic misconduct detection in biomedical research remains challenging due to algorithmic narrowness in existing methods and fragmented analytical pipelines. We present BMMDetect, a multimodal deep learning framework that integrates journal metadata (SJR, institutional data), semantic embeddings (PubMedBERT), and GPT-4o-mined textual attributes (methodological statistics, data anomalies) for holistic manuscript evaluation. Key innovations include: (1) multimodal fusion of domain-specific features to reduce detection bias; (2) quantitative evaluation of feature importance, identifying journal authority metrics (e.g., SJR-index) and textual anomalies (e.g., statistical outliers) as dominant predictors; and (3) the BioMCD dataset, a large-scale benchmark with 13,160 retracted articles and 53,411 controls. BMMDetect achieves 74.33% AUC, outperforming single-modality baselines by 8.6%, and demonstrates transferability across biomedical subfields. This work advances scalable, interpretable tools for safeguarding research integrity.

[103] Short-circuiting Shortcuts: Mechanistic Investigation of Shortcuts in Text Classification

Leon Eshuijs,Shihan Wang,Antske Fokkens

Main category: cs.LG

TL;DR: 论文研究了语言模型如何依赖虚假相关性（捷径）进行决策，并提出了一种新方法（HTA）来检测和缓解这种问题。

Details

Motivation: 探讨语言模型中捷径的处理机制，以理解其如何影响预测结果。 Method: 使用电影评论中的演员名称作为可控捷径，通过机制解释方法识别相关注意力头，并提出HTA方法追踪输入标记。 Result: 发现特定注意力头会提前依赖捷径做出决策，HTA能有效检测并缓解这种问题。 Conclusion: HTA为理解和干预语言模型中的捷径依赖提供了新工具。 Abstract: Reliance on spurious correlations (shortcuts) has been shown to underlie many of the successes of language models. Previous work focused on identifying the input elements that impact prediction. We investigate how shortcuts are actually processed within the model's decision-making mechanism. We use actor names in movie reviews as controllable shortcuts with known impact on the outcome. We use mechanistic interpretability methods and identify specific attention heads that focus on shortcuts. These heads gear the model towards a label before processing the complete input, effectively making premature decisions that bypass contextual analysis. Based on these findings, we introduce Head-based Token Attribution (HTA), which traces intermediate decisions back to input tokens. We show that HTA is effective in detecting shortcuts in LLMs and enables targeted mitigation by selectively deactivating shortcut-related attention heads.

[104] Automated Learning of Semantic Embedding Representations for Diffusion Models

Limai Jiang,Yunpeng Cai

Main category: cs.LG

TL;DR: 本文提出了一种多级去噪自编码器框架，扩展了去噪扩散模型（DDMs）的表征能力，通过自条件扩散学习在去噪马尔可夫链上获取嵌入表征。实验证明，该方法生成的嵌入在语义上优于现有自监督学习方法。

Details

Motivation: 去噪扩散模型（DDMs）在生成任务中表现出色，但其表征学习效率不足。本文旨在扩展DDMs的表征能力，使其适用于更广泛的深度学习任务。 Method: 采用多级去噪自编码器框架，引入时序一致的扩散变换器和时间步依赖的编码器，通过自条件扩散学习获取嵌入表征。 Result: 实验表明，该方法生成的嵌入在语义上优于现有自监督学习方法，且在大多数情况下表现优异。 Conclusion: DDMs不仅适用于生成任务，还具有在通用深度学习任务中应用的潜力。 Abstract: Generative models capture the true distribution of data, yielding semantically rich representations. Denoising diffusion models (DDMs) exhibit superior generative capabilities, though efficient representation learning for them are lacking. In this work, we employ a multi-level denoising autoencoder framework to expand the representation capacity of DDMs, which introduces sequentially consistent Diffusion Transformers and an additional timestep-dependent encoder to acquire embedding representations on the denoising Markov chain through self-conditional diffusion learning. Intuitively, the encoder, conditioned on the entire diffusion process, compresses high-dimensional data into directional vectors in latent under different noise levels, facilitating the learning of image embeddings across all timesteps. To verify the semantic adequacy of embeddings generated through this approach, extensive experiments are conducted on various datasets, demonstrating that optimally learned embeddings by DDMs surpass state-of-the-art self-supervised representation learning methods in most cases, achieving remarkable discriminative semantic representation quality. Our work justifies that DDMs are not only suitable for generative tasks, but also potentially advantageous for general-purpose deep learning applications.

[105] Improving Generalizability of Kolmogorov-Arnold Networks via Error-Correcting Output Codes

Youngjoon Lee,Jinu Gong,Joonhyuk Kang

Main category: cs.LG

TL;DR: 将ECOC集成到KAN框架中，通过多二元任务提升多分类性能，在医疗图像分类中表现优异。

Details

Motivation: 提升KAN在多分类任务中的鲁棒性和泛化能力，特别是在医疗AI应用中。 Method: 将ECOC与KAN结合，利用汉明距离解码将多分类任务转化为多个二元任务。 Result: 在血液细胞分类数据集上表现优于原始KAN，且在不同超参数下均表现稳定。 Conclusion: ECOC显著提升了KAN在医疗图像多分类任务中的性能，具有实际应用价值。 Abstract: Kolmogorov-Arnold Networks (KAN) offer universal function approximation using univariate spline compositions without nonlinear activations. In this work, we integrate Error-Correcting Output Codes (ECOC) into the KAN framework to transform multi-class classification into multiple binary tasks, improving robustness via Hamming-distance decoding. Our proposed KAN with ECOC method outperforms vanilla KAN on a challenging blood cell classification dataset, achieving higher accuracy under diverse hyperparameter settings. Ablation studies further confirm that ECOC consistently enhances performance across FastKAN and FasterKAN variants. These results demonstrate that ECOC integration significantly boosts KAN generalizability in critical healthcare AI applications. To the best of our knowledge, this is the first integration of ECOC with KAN for enhancing multi-class medical image classification performance.

[106] Wasserstein Distances Made Explainable: Insights into Dataset Shifts and Transport Phenomena

Philip Naumann,Jacob Kauffmann,Grégoire Montavon

Main category: cs.LG

TL;DR: 提出了一种基于可解释AI的新方法，用于高效准确地将Wasserstein距离归因于数据的不同组成部分。

Details

Motivation: 单纯计算Wasserstein距离或分析其传输图可能不足以理解导致距离高低的具体因素。 Method: 基于可解释AI的解决方案，将Wasserstein距离归因于数据子组、输入特征或可解释子空间。 Result: 方法在多种数据集和Wasserstein距离规范下表现出高准确性，并通过两个用例验证了其实用性。 Conclusion: 该方法为理解Wasserstein距离的贡献因素提供了有效工具，具有广泛的应用潜力。 Abstract: Wasserstein distances provide a powerful framework for comparing data distributions. They can be used to analyze processes over time or to detect inhomogeneities within data. However, simply calculating the Wasserstein distance or analyzing the corresponding transport map (or coupling) may not be sufficient for understanding what factors contribute to a high or low Wasserstein distance. In this work, we propose a novel solution based on Explainable AI that allows us to efficiently and accurately attribute Wasserstein distances to various data components, including data subgroups, input features, or interpretable subspaces. Our method achieves high accuracy across diverse datasets and Wasserstein distance specifications, and its practical utility is demonstrated in two use cases.

[107] Brain Hematoma Marker Recognition Using Multitask Learning: SwinTransformer and Swin-Unet

Kodai Hirata,Tsuyoshi Okita

Main category: cs.LG

TL;DR: MTL-Swin-Unet方法通过多任务学习结合分类和语义分割，利用图像重建和语义分割增强图像表示，解决了虚假相关性问题。

Details

Motivation: 解决虚假相关性（spurious-correlation）问题，提升图像表示能力。 Method: 使用多任务学习（MTL）结合Swin-Unet架构，通过语义分割和图像重建增强图像表示。 Result: 在无协变量偏移（相同患者切片）时F值表现最佳；在协变量偏移（不同患者切片）时AUC表现最佳。 Conclusion: MTL-Swin-Unet在多任务学习和图像表示增强方面表现出色，适用于不同数据分布场景。 Abstract: This paper proposes a method MTL-Swin-Unet which is multi-task learning using transformers for classification and semantic segmentation. For spurious-correlation problems, this method allows us to enhance the image representation with two other image representations: representation obtained by semantic segmentation and representation obtained by image reconstruction. In our experiments, the proposed method outperformed in F-value measure than other classifiers when the test data included slices from the same patient (no covariate shift). Similarly, when the test data did not include slices from the same patient (covariate shift setting), the proposed method outperformed in AUC measure.

eess.SP [Back]

[108] ECGDeDRDNet: A deep learning-based method for Electrocardiogram noise removal using a double recurrent dense network

Sainan xiao,Wangdong Yang,Buwen Cao,Jintao Wu

Main category: eess.SP

TL;DR: 提出了一种基于深度学习的ECG去噪框架ECGDeDRDNet，采用双循环密集网络架构，结合ECG波形和估计的干净图像信息，显著提升了去噪效果。

Details

Motivation: ECG信号常受基线漂移、肌肉伪影和电极运动等噪声干扰，影响诊断效果，需高效去噪方法。 Method: 使用LSTM层与DenseNet块结合的双循环架构，迭代利用ECG波形和估计的干净图像信息。 Result: 在MIT-BIH数据集上，PSNR和SSIM优于传统图像去噪方法，SNR和RMSE优于经典ECG去噪技术。 Conclusion: ECGDeDRDNet通过双循环架构有效结合时空信息，显著提升了ECG去噪性能。 Abstract: Electrocardiogram (ECG) signals are frequently corrupted by noise, such as baseline wander (BW), muscle artifacts (MA), and electrode motion (EM), which significantly degrade their diagnostic utility. To address this issue, we propose ECGDeDRDNet, a deep learning-based ECG Denoising framework leveraging a Double Recurrent Dense Network architecture. In contrast to traditional approaches, we introduce a double recurrent scheme to enhance information reuse from both ECG waveforms and the estimated clean image. For ECG waveform processing, our basic model employs LSTM layers cascaded with DenseNet blocks. The estimated clean ECG image, obtained by subtracting predicted noise components from the noisy input, is iteratively fed back into the model. This dual recurrent architecture enables comprehensive utilization of both temporal waveform features and spatial image details, leading to more effective noise suppression. Experimental results on the MIT-BIH dataset demonstrate that our method achieves superior performance compared to conventional image denoising methods in terms of PSNR and SSIM while also surpassing classical ECG denoising techniques in both SNR and RMSE.

[109] A New k-Space Model for Non-Cartesian Fourier Imaging

Chin-Cheng Chan,Justin P. Haldar

Main category: eess.SP

TL;DR: 论文提出了一种新的基于傅里叶域基展开的模型，用于改进传统体素基模型在计算成本、收敛速度和伪影方面的局限性。

Details

Motivation: 传统体素基模型存在高计算成本、慢收敛和伪影等长期问题，且可能被忽视的近似性、周期性和零空间特性问题。 Method: 提出了一种新的傅里叶域基展开模型，替代传统的图像域体素基方法。 Result: 在非笛卡尔MRI重建中，新模型减少了伪影并降低了计算复杂度，提高了收敛速度。 Conclusion: 新模型在图像质量和计算效率方面优于传统方法，为解决长期问题提供了新思路。 Abstract: For the past several decades, it has been popular to reconstruct Fourier imaging data using model-based approaches that can easily incorporate physical constraints and advanced regularization/machine learning priors. The most common modeling approach is to represent the continuous image as a linear combination of shifted "voxel" basis functions. Although well-studied and widely-deployed, this voxel-based model is associated with longstanding limitations, including high computational costs, slow convergence, and a propensity for artifacts. In this work, we reexamine this model from a fresh perspective, identifying new issues that may have been previously overlooked (including undesirable approximation, periodicity, and nullspace characteristics). Our insights motivate us to propose a new model that is more resilient to the limitations (old and new) of the previous approach. Specifically, the new model is based on a Fourier-domain basis expansion rather than the standard image-domain voxel-based approach. Illustrative results, which are presented in the context of non-Cartesian MRI reconstruction, demonstrate that the new model enables improved image quality (reduced artifacts) and/or reduced computational complexity (faster computations and improved convergence).

q-bio.PE [Back]

[110] Evolutionary ecology of words

Reiji Suzuki,Takaya Arita

Main category: q-bio.PE

TL;DR: 论文提出了一种基于大语言模型（LLM）的词汇进化生态学模型，扩展了进化博弈论和基于代理的模型，展示了词汇多样性和无限交互选项的涌现与演化。

Details

Motivation: 通过利用LLM丰富的语言表达能力，扩展进化博弈论和基于代理的模型，研究词汇在空间环境中的进化与多样性。 Method: 代理在空间环境中移动，携带由LLM生成的短词或短语，交互结果由LLM根据词汇关系决定，失败者的词汇被替换，并可能发生词汇突变。 Result: 初步实验显示，从已知物种初始种群中，物种以渐进和间断平衡方式涌现，最终形成多样化的优势种群。长期实验展示了多样物种的共存。 Conclusion: 模型成功展示了词汇进化的多样性和适应性，验证了LLM在模拟复杂生态演化中的潜力。 Abstract: We propose a model for the evolutionary ecology of words as one attempt to extend evolutionary game theory and agent-based models by utilizing the rich linguistic expressions of Large Language Models (LLMs). Our model enables the emergence and evolution of diverse and infinite options for interactions among agents. Within the population, each agent possesses a short word (or phrase) generated by an LLM and moves within a spatial environment. When agents become adjacent, the outcome of their interaction is determined by the LLM based on the relationship between their words, with the loser's word being replaced by the winner's. Word mutations, also based on LLM outputs, may occur. We conducted preliminary experiments assuming that ``strong animal species" would survive. The results showed that from an initial population consisting of well-known species, many species emerged both gradually and in a punctuated equilibrium manner. Each trial demonstrated the unique evolution of diverse populations, with one type of large species becoming dominant, such as terrestrial animals, marine life, or extinct species, which were ecologically specialized and adapted ones across diverse extreme habitats. We also conducted a long-term experiment with a large population, demonstrating the emergence and coexistence of diverse species.

cs.SI [Back]

[111] From Millions of Tweets to Actionable Insights: Leveraging LLMs for User Profiling

Vahid Rahimzadeh,Ali Hamzehpour,Azadeh Shakery,Masoud Asadpour

Main category: cs.SI

TL;DR: 提出了一种基于大语言模型（LLM）的两阶段用户画像方法，通过领域定义语句生成可解释的用户画像，显著优于现有方法。

Details

Motivation: 现有用户画像技术缺乏可迁移性、可解释性，且依赖大量标注数据或固定类别，限制了适应性。 Method: 两阶段方法：1) 半监督过滤结合领域知识库；2) 生成抽象和抽取式用户画像，利用LLM知识减少人工标注需求。 Result: 实验结果表明，该方法在波斯政治Twitter数据集上比现有方法性能提升9.8%。 Conclusion: 该方法能生成灵活、适应性强且可解释的用户画像，适用于多种社交网络任务。 Abstract: Social media user profiling through content analysis is crucial for tasks like misinformation detection, engagement prediction, hate speech monitoring, and user behavior modeling. However, existing profiling techniques, including tweet summarization, attribute-based profiling, and latent representation learning, face significant limitations: they often lack transferability, produce non-interpretable features, require large labeled datasets, or rely on rigid predefined categories that limit adaptability. We introduce a novel large language model (LLM)-based approach that leverages domain-defining statements, which serve as key characteristics outlining the important pillars of a domain as foundations for profiling. Our two-stage method first employs semi-supervised filtering with a domain-specific knowledge base, then generates both abstractive (synthesized descriptions) and extractive (representative tweet selections) user profiles. By harnessing LLMs' inherent knowledge with minimal human validation, our approach is adaptable across domains while reducing the need for large labeled datasets. Our method generates interpretable natural language user profiles, condensing extensive user data into a scale that unlocks LLMs' reasoning and knowledge capabilities for downstream social network tasks. We contribute a Persian political Twitter (X) dataset and an LLM-based evaluation framework with human validation. Experimental results show our method significantly outperforms state-of-the-art LLM-based and traditional methods by 9.8%, demonstrating its effectiveness in creating flexible, adaptable, and interpretable user profiles.

cs.NE [Back]

[112] How to Train Your Metamorphic Deep Neural Network

Thomas Sommariva,Simone Calderara,Angelo Porrello

Main category: cs.NE

TL;DR: NeuMeta是一种基于INR的神经网络生成方法，通过学习连续权重流形生成不同宽度和深度的模型。本文提出了一种训练算法，扩展了NeuMeta的能力，实现全网络变形且精度损失最小。

Details

Motivation: 原始NeuMeta仅适用于模型的最后一层，限制了其广泛应用。本文旨在扩展其能力，实现全网络变形。 Method: 采用块级增量训练、INR初始化和替换批归一化的策略。 Result: 生成的变形网络在多种压缩比下保持竞争力，适用于高效部署。 Conclusion: 本文方法为深度模型的可适应和高效部署提供了可扩展解决方案。 Abstract: Neural Metamorphosis (NeuMeta) is a recent paradigm for generating neural networks of varying width and depth. Based on Implicit Neural Representation (INR), NeuMeta learns a continuous weight manifold, enabling the direct generation of compressed models, including those with configurations not seen during training. While promising, the original formulation of NeuMeta proves effective only for the final layers of the undelying model, limiting its broader applicability. In this work, we propose a training algorithm that extends the capabilities of NeuMeta to enable full-network metamorphosis with minimal accuracy degradation. Our approach follows a structured recipe comprising block-wise incremental training, INR initialization, and strategies for replacing batch normalization. The resulting metamorphic networks maintain competitive accuracy across a wide range of compression ratios, offering a scalable solution for adaptable and efficient deployment of deep models. The code is available at: https://github.com/TSommariva/HTTY_NeuMeta.

cs.AI [Back]

[113] Prompted Meta-Learning for Few-shot Knowledge Graph Completion

Han Wu,Jie Yin

Main category: cs.AI

TL;DR: 提出了一种名为PromptMeta的新框架，结合元语义和关系信息用于少样本知识图谱补全。

Details

Motivation: 现有方法主要关注关系信息，忽视了知识图谱中的丰富语义，PromptMeta旨在填补这一空白。 Method: PromptMeta包含元语义提示池和可学习的融合提示，动态结合元语义与任务特定关系信息。 Result: 在两个基准数据集上的实验证明了该方法的有效性。 Conclusion: PromptMeta通过整合元语义和关系信息，提升了少样本知识图谱补全的性能。 Abstract: Few-shot knowledge graph completion (KGC) has obtained significant attention due to its practical applications in real-world scenarios, where new knowledge often emerges with limited available data. While most existing methods for few-shot KGC have predominantly focused on leveraging relational information, rich semantics inherent in KGs have been largely overlooked. To address this gap, we propose a novel prompted meta-learning (PromptMeta) framework that seamlessly integrates meta-semantics with relational information for few-shot KGC. PrompMeta has two key innovations: (1) a meta-semantic prompt pool that captures and consolidates high-level meta-semantics, enabling effective knowledge transfer and adaptation to rare and newly emerging relations. (2) a learnable fusion prompt that dynamically combines meta-semantic information with task-specific relational information tailored to different few-shot tasks. Both components are optimized together with model parameters within a meta-learning framework. Extensive experiments on two benchmark datasets demonstrate the effectiveness of our approach.

[114] Neuro-Symbolic Concepts

Jiayuan Mao,Joshua B. Tenenbaum,Jiajun Wu

Main category: cs.AI

TL;DR: 本文提出了一种以概念为中心的范式，用于构建能够持续学习和灵活推理的智能体。

Details

Motivation: 为了解决智能体在不同领域任务中的学习效率和泛化能力问题，作者提出了一种基于神经符号概念的方法。 Method: 智能体利用神经符号概念（如物体、关系和动作概念），这些概念通过感官输入和动作输出进行基础化，并具有组合性。概念通过符号程序和神经网络表示相结合，支持高效学习和推理。 Result: 该框架在2D图像、视频、3D场景和机器人操作任务中表现出高效学习、组合泛化、持续学习和零样本迁移能力。 Conclusion: 概念中心框架为智能体的多任务学习和跨领域泛化提供了有效解决方案。 Abstract: This article presents a concept-centric paradigm for building agents that can learn continually and reason flexibly. The concept-centric agent utilizes a vocabulary of neuro-symbolic concepts. These concepts, such as object, relation, and action concepts, are grounded on sensory inputs and actuation outputs. They are also compositional, allowing for the creation of novel concepts through their structural combination. To facilitate learning and reasoning, the concepts are typed and represented using a combination of symbolic programs and neural network representations. Leveraging such neuro-symbolic concepts, the agent can efficiently learn and recombine them to solve various tasks across different domains, ranging from 2D images, videos, 3D scenes, and robotic manipulation tasks. This concept-centric framework offers several advantages, including data efficiency, compositional generalization, continual learning, and zero-shot transfer.

[115] ArtRAG: Retrieval-Augmented Generation with Structured Context for Visual Art Understanding

Shuai Wang,Ivona Najdenkoska,Hongyi Zhu,Stevan Rudinac,Monika Kackovic,Nachoem Wijnberg,Marcel Worring

Main category: cs.AI

TL;DR: ArtRAG是一个无需训练的框架，结合结构化知识和检索增强生成（RAG），用于多视角艺术品解释，优于现有方法。

Details

Motivation: 现有多模态大语言模型（MLLMs）在艺术品解释中缺乏文化、历史和风格等细微理解，需要更丰富的上下文支持。 Method: ArtRAG通过构建艺术上下文知识图谱（ACKG），结合多粒度检索器选择相关子图，指导MLLMs生成解释。 Result: 在SemArt和Artpedia数据集上，ArtRAG表现优于基线模型，人类评估确认其生成内容更具文化深度和连贯性。 Conclusion: ArtRAG通过结构化知识和检索增强生成，显著提升了艺术品解释的质量和丰富性。 Abstract: Understanding visual art requires reasoning across multiple perspectives -- cultural, historical, and stylistic -- beyond mere object recognition. While recent multimodal large language models (MLLMs) perform well on general image captioning, they often fail to capture the nuanced interpretations that fine art demands. We propose ArtRAG, a novel, training-free framework that combines structured knowledge with retrieval-augmented generation (RAG) for multi-perspective artwork explanation. ArtRAG automatically constructs an Art Context Knowledge Graph (ACKG) from domain-specific textual sources, organizing entities such as artists, movements, themes, and historical events into a rich, interpretable graph. At inference time, a multi-granular structured retriever selects semantically and topologically relevant subgraphs to guide generation. This enables MLLMs to produce contextually grounded, culturally informed art descriptions. Experiments on the SemArt and Artpedia datasets show that ArtRAG outperforms several heavily trained baselines. Human evaluations further confirm that ArtRAG generates coherent, insightful, and culturally enriched interpretations.

[116] Why Are You Wrong? Counterfactual Explanations for Language Grounding with 3D Objects

Tobias Preintner,Weixuan Yuan,Qi Huang,Adrian König,Thomas Bäck,Elena Raponi,Niki van Stein

Main category: cs.AI

TL;DR: 该论文提出了一种生成反事实示例的方法，用于解释模型在3D对象引用识别任务中的错误预测，帮助理解模型行为和改进模型。

Details

Motivation: 研究动机在于解决语言描述和3D对象空间关系复杂性导致的模型错误预测问题，填补了该领域研究的空白。 Method: 方法是通过生成与原始描述相似但能导致正确预测的反事实示例，揭示模型弱点或描述偏差。 Result: 在ShapeTalk数据集和三个模型上的实验表明，生成的反事实示例保持了描述结构、语义相似且有意义。 Conclusion: 结论是该方法增强了模型行为的可解释性，帮助实践者更好地与系统交互，并指导模型改进。 Abstract: Combining natural language and geometric shapes is an emerging research area with multiple applications in robotics and language-assisted design. A crucial task in this domain is object referent identification, which involves selecting a 3D object given a textual description of the target. Variability in language descriptions and spatial relationships of 3D objects makes this a complex task, increasing the need to better understand the behavior of neural network models in this domain. However, limited research has been conducted in this area. Specifically, when a model makes an incorrect prediction despite being provided with a seemingly correct object description, practitioners are left wondering: "Why is the model wrong?". In this work, we present a method answering this question by generating counterfactual examples. Our method takes a misclassified sample, which includes two objects and a text description, and generates an alternative yet similar formulation that would have resulted in a correct prediction by the model. We have evaluated our approach with data from the ShapeTalk dataset along with three distinct models. Our counterfactual examples maintain the structure of the original description, are semantically similar and meaningful. They reveal weaknesses in the description, model bias and enhance the understanding of the models behavior. Theses insights help practitioners to better interact with systems as well as engineers to improve models.

q-bio.QM [Back]

[117] Multimodal Integrated Knowledge Transfer to Large Language Models through Preference Optimization with Biomedical Applications

Da Wu,Zhanliang Wang,Quan Nguyen,Zhuoran Xu,Kai Wang

Main category: q-bio.QM

TL;DR: MINT框架通过偏好优化将单模态大模型与多模态生物医学数据的领域知识对齐，提升其在文本或图像输入任务中的表现。

Details

Motivation: 高质量多模态生物医学数据的稀缺限制了预训练大模型在专业任务中的微调效果。 Method: MINT利用多模态机器学习模型生成偏好数据集，通过ORPO框架优化单模态模型。 Result: 在罕见遗传病预测和组织类型分类任务中，MINT优化后的模型表现优于其他方法。 Conclusion: MINT为单模态大模型与多模态领域知识的对齐提供了有效策略。 Abstract: The scarcity of high-quality multimodal biomedical data limits the ability to effectively fine-tune pretrained Large Language Models (LLMs) for specialized biomedical tasks. To address this challenge, we introduce MINT (Multimodal Integrated kNowledge Transfer), a framework that aligns unimodal large decoder models with domain-specific decision patterns from multimodal biomedical data through preference optimization. While MINT supports different optimization techniques, we primarily implement it with the Odds Ratio Preference Optimization (ORPO) framework as its backbone. This strategy enables the aligned LLMs to perform predictive tasks using text-only or image-only inputs while retaining knowledge learnt from multimodal data. MINT leverages an upstream multimodal machine learning (MML) model trained on high-quality multimodal data to transfer domain-specific insights to downstream text-only or image-only LLMs. We demonstrate its effectiveness through two key applications: (1) Rare genetic disease prediction from texts, where MINT uses a multimodal encoder model, trained on facial photos and clinical notes, to generate a preference dataset for aligning a lightweight Llama 3.2-3B-Instruct. Despite relying on text input only, the MINT-derived model outperforms models trained with SFT, RAG, or DPO, and even outperforms Llama 3.1-405B-Instruct. (2) Tissue type classification using cell nucleus images, where MINT uses a vision-language foundation model as the preference generator, containing knowledge learnt from both text and histopathological images to align downstream image-only models. The resulting MINT-derived model significantly improves the performance of Llama 3.2-Vision-11B-Instruct on tissue type classification. In summary, MINT provides an effective strategy to align unimodal LLMs with high-quality multimodal expertise through preference optimization.

eess.IV [Back]

[118] Image Restoration via Multi-domain Learning

Xingyu Jiang,Ning Gao,Xiuhui Zhang,Hongkun Dou,Shaowen Fu,Xiaoqing Zhong,Hongjue Li,Yue Deng

Main category: eess.IV

TL;DR: 论文提出了一种新的图像修复框架SWFormer，通过多域学习和改进的Transformer结构，在多种修复任务中表现优异。

Details

Motivation: 自然图像因大气和成像条件导致多种退化现象，现有Transformer方法模型复杂且未充分利用退化共性。 Method: 提出Spatial-Wavelet-Fourier多域结构和多尺度学习，改进Token Mixer和Feed-Forward Network。 Result: 在十种修复任务中超越现有方法，平衡了性能、参数、计算成本和推理延迟。 Conclusion: SWFormer在多域学习和效率方面表现突出，为图像修复提供了新思路。 Abstract: Due to adverse atmospheric and imaging conditions, natural images suffer from various degradation phenomena. Consequently, image restoration has emerged as a key solution and garnered substantial attention. Although recent Transformer architectures have demonstrated impressive success across various restoration tasks, their considerable model complexity poses significant challenges for both training and real-time deployment. Furthermore, instead of investigating the commonalities among different degradations, most existing restoration methods focus on modifying Transformer under limited restoration priors. In this work, we first review various degradation phenomena under multi-domain perspective, identifying common priors. Then, we introduce a novel restoration framework, which integrates multi-domain learning into Transformer. Specifically, in Token Mixer, we propose a Spatial-Wavelet-Fourier multi-domain structure that facilitates local-region-global multi-receptive field modeling to replace vanilla self-attention. Additionally, in Feed-Forward Network, we incorporate multi-scale learning to fuse multi-domain features at different resolutions. Comprehensive experimental results across ten restoration tasks, such as dehazing, desnowing, motion deblurring, defocus deblurring, rain streak/raindrop removal, cloud removal, shadow removal, underwater enhancement and low-light enhancement, demonstrate that our proposed model outperforms state-of-the-art methods and achieves a favorable trade-off among restoration performance, parameter size, computational cost and inference latency. The code is available at: https://github.com/deng-ai-lab/SWFormer.

[119] StereoINR: Cross-View Geometry Consistent Stereo Super Resolution with Implicit Neural Representation

Yi Liu,Xinyi Liu,Panwang Xia,Qiong Wu,Yi Wan,Yongjun Zhang

Main category: eess.IV

TL;DR: StereoINR提出了一种基于隐式神经表示的立体图像超分辨率方法，突破了固定尺度的限制，并通过跨视图信息融合显著提升了几何一致性。

Details

Motivation: 现有立体超分辨率方法缺乏跨视图几何一致性和自适应多尺度能力，限制了性能。 Method: StereoINR将立体图像对建模为连续隐式表示，结合空间扭曲和跨注意力机制实现跨视图信息融合。 Result: 实验表明，StereoINR在训练分布内外尺度均表现优异，几何一致性显著提升。 Conclusion: StereoINR为任意尺度立体超分辨率提供了统一解决方案，性能优于现有方法。 Abstract: Stereo image super-resolution (SSR) aims to enhance high-resolution details by leveraging information from stereo image pairs. However, existing stereo super-resolution (SSR) upsampling methods (e.g., pixel shuffle) often overlook cross-view geometric consistency and are limited to fixed-scale upsampling. The key issue is that previous upsampling methods use convolution to independently process deep features of different views, lacking cross-view and non-local information perception, making it difficult to select beneficial information from multi-view scenes adaptively. In this work, we propose Stereo Implicit Neural Representation (StereoINR), which innovatively models stereo image pairs as continuous implicit representations. This continuous representation breaks through the scale limitations, providing a unified solution for arbitrary-scale stereo super-resolution reconstruction of left-right views. Furthermore, by incorporating spatial warping and cross-attention mechanisms, StereoINR enables effective cross-view information fusion and achieves significant improvements in pixel-level geometric consistency. Extensive experiments across multiple datasets show that StereoINR outperforms out-of-training-distribution scale upsampling and matches state-of-the-art SSR methods within training-distribution scales.

[120] Guidance for Intra-cardiac Echocardiography Manipulation to Maintain Continuous Therapy Device Tip Visibility

Jaeyoung Huh,Ankur Kapoor,Young-Ho Kim

Main category: eess.IV

TL;DR: 提出了一种AI驱动的跟踪模型，用于在心脏内超声（ICE）成像中持续追踪治疗设备尖端，通过混合数据集和预训练模型提高精度。

Details

Motivation: 手动ICE导管操作中治疗设备尖端的持续可见性难以保持，需频繁调整，影响手术效率。 Method: 结合临床ICE序列与合成数据增强生成混合数据集，利用预训练超声基础模型和基于Transformer的网络进行特征提取与预测。 Result: 模型实现了3.32度的入射角误差和12.76度的旋转角误差，为实时机器人ICE导管调整奠定基础。 Conclusion: 该AI框架显著减少操作负担并确保设备可见性，未来将扩展临床数据集以进一步提升泛化能力。 Abstract: Intra-cardiac Echocardiography (ICE) plays a critical role in Electrophysiology (EP) and Structural Heart Disease (SHD) interventions by providing real-time visualization of intracardiac structures. However, maintaining continuous visibility of the therapy device tip remains a challenge due to frequent adjustments required during manual ICE catheter manipulation. To address this, we propose an AI-driven tracking model that estimates the device tip incident angle and passing point within the ICE imaging plane, ensuring continuous visibility and facilitating robotic ICE catheter control. A key innovation of our approach is the hybrid dataset generation strategy, which combines clinical ICE sequences with synthetic data augmentation to enhance model robustness. We collected ICE images in a water chamber setup, equipping both the ICE catheter and device tip with electromagnetic (EM) sensors to establish precise ground-truth locations. Synthetic sequences were created by overlaying catheter tips onto real ICE images, preserving motion continuity while simulating diverse anatomical scenarios. The final dataset consists of 5,698 ICE-tip image pairs, ensuring comprehensive training coverage. Our model architecture integrates a pretrained ultrasound (US) foundation model, trained on 37.4M echocardiography images, for feature extraction. A transformer-based network processes sequential ICE frames, leveraging historical passing points and incident angles to improve prediction accuracy. Experimental results demonstrate that our method achieves 3.32 degree entry angle error, 12.76 degree rotation angle error. This AI-driven framework lays the foundation for real-time robotic ICE catheter adjustments, minimizing operator workload while ensuring consistent therapy device visibility. Future work will focus on expanding clinical datasets to further enhance model generalization.

[121] Score-based Self-supervised MRI Denoising

Jiachen Tu,Yaokun Shi,Fan Lam

Main category: eess.IV

TL;DR: C2S是一种基于分数的自监督MRI去噪框架，通过广义去噪分数匹配损失直接从噪声数据中学习，无需高信噪比标签，并在多噪声水平和多对比度条件下表现优异。

Details

Motivation: MRI图像中的噪声会降低图像质量和诊断准确性，而现有自监督去噪方法容易过度平滑细节且性能不如监督方法。 Method: 提出C2S框架，采用广义去噪分数匹配损失（GDSM）直接从噪声数据学习，引入噪声水平重参数化和细节细化扩展。 Result: 在M4Raw和fastMRI数据集上，C2S在自监督方法中表现最佳，与监督方法竞争。 Conclusion: C2S为MRI去噪提供了一种无需高信噪比标签的高效自监督解决方案，适用于多噪声水平和多对比度场景。 Abstract: Magnetic resonance imaging (MRI) is a powerful noninvasive diagnostic imaging tool that provides unparalleled soft tissue contrast and anatomical detail. Noise contamination, especially in accelerated and/or low-field acquisitions, can significantly degrade image quality and diagnostic accuracy. Supervised learning based denoising approaches have achieved impressive performance but require high signal-to-noise ratio (SNR) labels, which are often unavailable. Self-supervised learning holds promise to address the label scarcity issue, but existing self-supervised denoising methods tend to oversmooth fine spatial features and often yield inferior performance than supervised methods. We introduce Corruption2Self (C2S), a novel score-based self-supervised framework for MRI denoising. At the core of C2S is a generalized denoising score matching (GDSM) loss, which extends denoising score matching to work directly with noisy observations by modeling the conditional expectation of higher-SNR images given further corrupted observations. This allows the model to effectively learn denoising across multiple noise levels directly from noisy data. Additionally, we incorporate a reparameterization of noise levels to stabilize training and enhance convergence, and introduce a detail refinement extension to balance noise reduction with the preservation of fine spatial features. Moreover, C2S can be extended to multi-contrast denoising by leveraging complementary information across different MRI contrasts. We demonstrate that our method achieves state-of-the-art performance among self-supervised methods and competitive results compared to supervised counterparts across varying noise conditions and MRI contrasts on the M4Raw and fastMRI dataset.

[122] UltraGauss: Ultrafast Gaussian Reconstruction of 3D Ultrasound Volumes

Mark C. Eid,Ana I. L. Namburete,João F. Henriques

Main category: eess.IV

TL;DR: UltraGauss是一种新型超声高斯泼溅框架，通过模拟超声波的物理特性，实现了高效且准确的2D到3D重建，显著提升了重建速度和图像质量。

Details

Motivation: 解决传统2D超声图像解释依赖操作者、现有3D重建方法计算成本高且不符合超声物理特性的问题。 Method: 提出UltraGauss框架，基于高斯泼溅技术，模拟探头平面与3D空间的交互，优化GPU并行化和数值稳定性。 Result: 在真实临床数据上，UltraGauss在5分钟内达到最优重建效果，20分钟内SSIM达0.99，专家评价其重建效果最真实。 Conclusion: UltraGauss为超声图像重建提供了高效、准确的解决方案，未来将通过开源进一步推动领域发展。 Abstract: Ultrasound imaging is widely used due to its safety, affordability, and real-time capabilities, but its 2D interpretation is highly operator-dependent, leading to variability and increased cognitive demand. 2D-to-3D reconstruction mitigates these challenges by providing standardized volumetric views, yet existing methods are often computationally expensive, memory-intensive, or incompatible with ultrasound physics. We introduce UltraGauss: the first ultrasound-specific Gaussian Splatting framework, extending view synthesis techniques to ultrasound wave propagation. Unlike conventional perspective-based splatting, UltraGauss models probe-plane intersections in 3D, aligning with acoustic image formation. We derive an efficient rasterization boundary formulation for GPU parallelization and introduce a numerically stable covariance parametrization, improving computational efficiency and reconstruction accuracy. On real clinical ultrasound data, UltraGauss achieves state-of-the-art reconstructions in 5 minutes, and reaching 0.99 SSIM within 20 minutes on a single GPU. A survey of expert clinicians confirms UltraGauss' reconstructions are the most realistic among competing methods. Our CUDA implementation will be released upon publication.

[123] V-EfficientNets: Vector-Valued Efficiently Scaled Convolutional Neural Network Models

Guilherme Vieira Neto,Marcos Eduardo Valle

Main category: eess.IV

TL;DR: V-EfficientNets是EfficientNet的扩展，用于处理向量值数据，在医学图像分类任务中表现优异，准确率达99.46%。

Details

Motivation: 传统神经网络在处理多维数据时忽略了通道间关系，而向量值神经网络能更好地处理此类数据。 Method: 通过优化网络宽度、深度和分辨率，扩展EfficientNet以处理向量值数据。 Result: 在ALL-IDB2数据集上达到99.46%的准确率，参数更少且优于现有模型。 Conclusion: V-EfficientNets在医学图像分类中高效且准确，优于传统方法。 Abstract: EfficientNet models are convolutional neural networks optimized for parameter allocation by jointly balancing network width, depth, and resolution. Renowned for their exceptional accuracy, these models have become a standard for image classification tasks across diverse computer vision benchmarks. While traditional neural networks learn correlations between feature channels during training, vector-valued neural networks inherently treat multidimensional data as coherent entities, taking for granted the inter-channel relationships. This paper introduces vector-valued EfficientNets (V-EfficientNets), a novel extension of EfficientNet designed to process arbitrary vector-valued data. The proposed models are evaluated on a medical image classification task, achieving an average accuracy of 99.46% on the ALL-IDB2 dataset for detecting acute lymphoblastic leukemia. V-EfficientNets demonstrate remarkable efficiency, significantly reducing parameters while outperforming state-of-the-art models, including the original EfficientNet. The source code is available at https://github.com/mevalle/v-nets.

[124] Equivariant Imaging Biomarkers for Robust Unsupervised Segmentation of Histopathology

Fuyao Chen,Yuexi Du,Tal Zeevi,Nicha C. Dvornek,John A. Onofrey

Main category: eess.IV

TL;DR: 提出了一种基于对称卷积核的无监督分割方法，用于提取鲁棒的等变组织病理学生物标志物，解决了传统机器学习模型在旋转和反射不变性上的不足。

Details

Motivation: 传统病理学分析耗时且易受主观影响，而现有机器学习模型在图像旋转和反射不变性上表现不足，限制了其泛化能力。 Method: 通过对称卷积核进行无监督分割，提取等变生物标志物，并在前列腺组织微阵列图像上验证。 Result: 提取的生物标志物在旋转不变性上优于标准卷积核模型，提升了模型的鲁棒性和泛化能力。 Conclusion: 该方法有望提高数字病理学中机器学习模型的准确性和一致性，并拓展至其他癌症的诊断和预后。 Abstract: Histopathology evaluation of tissue specimens through microscopic examination is essential for accurate disease diagnosis and prognosis. However, traditional manual analysis by specially trained pathologists is time-consuming, labor-intensive, cost-inefficient, and prone to inter-rater variability, potentially affecting diagnostic consistency and accuracy. As digital pathology images continue to proliferate, there is a pressing need for automated analysis to address these challenges. Recent advancements in artificial intelligence-based tools such as machine learning (ML) models, have significantly enhanced the precision and efficiency of analyzing histopathological slides. However, despite their impressive performance, ML models are invariant only to translation, lacking invariance to rotation and reflection. This limitation restricts their ability to generalize effectively, particularly in histopathology, where images intrinsically lack meaningful orientation. In this study, we develop robust, equivariant histopathological biomarkers through a novel symmetric convolutional kernel via unsupervised segmentation. The approach is validated using prostate tissue micro-array (TMA) images from 50 patients in the Gleason 2019 Challenge public dataset. The biomarkers extracted through this approach demonstrate enhanced robustness and generalizability against rotation compared to models using standard convolution kernels, holding promise for enhancing the accuracy, consistency, and robustness of ML models in digital pathology. Ultimately, this work aims to improve diagnostic and prognostic capabilities of histopathology beyond prostate cancer through equivariant imaging.

[125] Hybrid Learning: A Novel Combination of Self-Supervised and Supervised Learning for MRI Reconstruction without High-Quality Training Reference

Haoyang Pei,Ding Xia,Xiang Xu,William Moore,Yao Wang,Hersh Chandarana,Li Feng

Main category: eess.IV

TL;DR: 提出了一种结合自监督和监督学习的混合学习方法，用于在缺乏高质量参考图像的情况下改进MRI重建。

Details

Motivation: 传统监督学习需要高质量参考图像，而自监督学习在高加速率下性能下降，因此提出混合学习以解决这些限制。 Method: 分两阶段：第一阶段用自监督学习从噪声或欠采样数据生成改进图像；第二阶段用监督学习进一步优化重建性能。 Result: 在螺旋UTE肺MRI和3D T1脑成像中，混合学习在图像质量和定量准确性上均优于自监督和监督学习方法。 Conclusion: 混合学习为缺乏高质量参考数据时的MRI重建提供了有效解决方案，有望推动深度学习在临床中的广泛应用。 Abstract: Purpose: Deep learning has demonstrated strong potential for MRI reconstruction, but conventional supervised learning methods require high-quality reference images, which are often unavailable in practice. Self-supervised learning offers an alternative, yet its performance degrades at high acceleration rates. To overcome these limitations, we propose hybrid learning, a novel two-stage training framework that combines self-supervised and supervised learning for robust image reconstruction. Methods: Hybrid learning is implemented in two sequential stages. In the first stage, self-supervised learning is employed to generate improved images from noisy or undersampled reference data. These enhanced images then serve as pseudo-ground truths for the second stage, which uses supervised learning to refine reconstruction performance and support higher acceleration rates. We evaluated hybrid learning in two representative applications: (1) accelerated 0.55T spiral-UTE lung MRI using noisy reference data, and (2) 3D T1 mapping of the brain without access to fully sampled ground truth. Results: For spiral-UTE lung MRI, hybrid learning consistently improved image quality over both self-supervised and conventional supervised methods across different acceleration rates, as measured by SSIM and NMSE. For 3D T1 mapping, hybrid learning achieved superior T1 quantification accuracy across a wide dynamic range, outperforming self-supervised learning in all tested conditions. Conclusions: Hybrid learning provides a practical and effective solution for training deep MRI reconstruction networks when only low-quality or incomplete reference data are available. It enables improved image quality and accurate quantitative mapping across different applications and field strengths, representing a promising technique toward broader clinical deployment of deep learning-based MRI.

[126] Predicting Diabetic Macular Edema Treatment Responses Using OCT: Dataset and Methods of APTOS Competition

Weiyi Zhang,Peranut Chotcomwongse,Yinwen Li,Pusheng Xu,Ruijie Yao,Lianhao Zhou,Yuxuan Zhou,Hui Feng,Qiping Zhou,Xinyue Wang,Shoujin Huang,Zihao Jin,Florence H. T. Chung,Shujun Wang,Yalin Zheng,Mingguang He,Danli Shi,Paisan Ruamviboonsuk

Main category: eess.IV

TL;DR: 该研究通过组织亚太远程眼科学会大数据竞赛，利用OCT图像预测糖尿病黄斑水肿（DME）患者对抗VEGF治疗的反应，旨在实现个性化治疗。

Details

Motivation: 糖尿病黄斑水肿（DME）治疗反应差异大，需通过患者分层预测治疗效果以实现个性化治疗。 Method: 组织竞赛，提供包含数千张OCT图像的数据集，涵盖四个子任务，评估AI模型的预测准确性。 Result: 竞赛吸引了170支团队参与，最终41支进入决赛，最佳团队AUC达80.06%，显示AI在个性化治疗中的潜力。 Conclusion: AI可通过OCT图像预测DME治疗反应，为临床决策提供支持，未来有望实现个性化治疗。 Abstract: Diabetic macular edema (DME) significantly contributes to visual impairment in diabetic patients. Treatment responses to intravitreal therapies vary, highlighting the need for patient stratification to predict therapeutic benefits and enable personalized strategies. To our knowledge, this study is the first to explore pre-treatment stratification for predicting DME treatment responses. To advance this research, we organized the 2nd Asia-Pacific Tele-Ophthalmology Society (APTOS) Big Data Competition in 2021. The competition focused on improving predictive accuracy for anti-VEGF therapy responses using ophthalmic OCT images. We provided a dataset containing tens of thousands of OCT images from 2,000 patients with labels across four sub-tasks. This paper details the competition's structure, dataset, leading methods, and evaluation metrics. The competition attracted strong scientific community participation, with 170 teams initially registering and 41 reaching the final round. The top-performing team achieved an AUC of 80.06%, highlighting the potential of AI in personalized DME treatment and clinical decision-making.

[127] S2MNet: Speckle-To-Mesh Net for Three-Dimensional Cardiac Morphology Reconstruction via Echocardiogram

Xilin Gong,Yongkai Chen,Shushan Wu,Fang Wang,Ping Ma,Wenxuan Zhong

Main category: eess.IV

TL;DR: 提出了一种深度学习框架S2MNet，通过整合六张常规2D超声心动图切片，重建连续且高保真的3D心脏模型。

Details

Motivation: 尽管2D超声心动图应用广泛，但其无法全面评估心脏的三维解剖和功能，而现有的3D超声心动图存在分辨率低、可用性有限和成本高的问题。 Method: S2MNet通过模拟六张2D超声心动图切片生成3D心脏模型，并引入基于变形场的方法以避免空间不连续或结构伪影。 Result: 验证表明，重建的左心室体积与医生测量的GLPS呈强相关，符合医学理论。 Conclusion: 该方法可靠，为3D心脏重建提供了一种高效且经济的解决方案。 Abstract: Echocardiogram is the most commonly used imaging modality in cardiac assessment duo to its non-invasive nature, real-time capability, and cost-effectiveness. Despite its advantages, most clinical echocardiograms provide only two-dimensional views, limiting the ability to fully assess cardiac anatomy and function in three dimensions. While three-dimensional echocardiography exists, it often suffers from reduced resolution, limited availability, and higher acquisition costs. To overcome these challenges, we propose a deep learning framework S2MNet that reconstructs continuous and high-fidelity 3D heart models by integrating six slices of routinely acquired 2D echocardiogram views. Our method has three advantages. First, our method avoid the difficulties on training data acquasition by simulate six of 2D echocardiogram images from corresponding slices of a given 3D heart mesh. Second, we introduce a deformation field-based method, which avoid spatial discontinuities or structural artifacts in 3D echocardiogram reconstructions. We validate our method using clinically collected echocardiogram and demonstrate that our estimated left ventricular volume, a key clinical indicator of cardiac function, is strongly correlated with the doctor measured GLPS, a clinical measurement that should demonstrate a negative correlation with LVE in medical theory. This association confirms the reliability of our proposed 3D construction method.

[128] The Application of Deep Learning for Lymph Node Segmentation: A Systematic Review

Jingguo Qu,Xinyang Han,Man-Lik Chui,Yao Pu,Simon Takadiyi Gunda,Ziman Chen,Jing Qin,Ann Dorothy King,Winnie Chiu-Wing Chu,Jing Cai,Michael Tin-Cheung Ying

Main category: eess.IV

TL;DR: 本文综述了深度学习在淋巴结分割中的应用，探讨了不同架构的优缺点，并提出了未来研究方向。

Details

Motivation: 传统淋巴结分割方法受限于人工标注和操作者水平，深度学习为提高准确性提供了新可能。 Method: 评估了卷积神经网络、编码器-解码器网络和Transformer等深度学习架构在医学影像分析中的应用。 Result: 尽管有进展，但仍面临淋巴结形状多样、标注数据稀缺及跨模态泛化能力不足等挑战。 Conclusion: 首次全面综述深度学习在淋巴结分割中的应用，并提出了多模态融合、迁移学习等未来方向。 Abstract: Automatic lymph node segmentation is the cornerstone for advances in computer vision tasks for early detection and staging of cancer. Traditional segmentation methods are constrained by manual delineation and variability in operator proficiency, limiting their ability to achieve high accuracy. The introduction of deep learning technologies offers new possibilities for improving the accuracy of lymph node image analysis. This study evaluates the application of deep learning in lymph node segmentation and discusses the methodologies of various deep learning architectures such as convolutional neural networks, encoder-decoder networks, and transformers in analyzing medical imaging data across different modalities. Despite the advancements, it still confronts challenges like the shape diversity of lymph nodes, the scarcity of accurately labeled datasets, and the inadequate development of methods that are robust and generalizable across different imaging modalities. To the best of our knowledge, this is the first study that provides a comprehensive overview of the application of deep learning techniques in lymph node segmentation task. Furthermore, this study also explores potential future research directions, including multimodal fusion techniques, transfer learning, and the use of large-scale pre-trained models to overcome current limitations while enhancing cancer diagnosis and treatment planning strategies.

[129] Topo-VM-UNetV2: Encoding Topology into Vision Mamba UNet for Polyp Segmentation

Diego Adame,Jose A. Nunez,Fabian Vazquez,Nayeli Gurrola,Huimin Li,Haoteng Tang,Bin Fu,Pengfei Gu

Main category: eess.IV

TL;DR: 提出了一种名为Topo-VM-UNetV2的新方法，通过将拓扑特征编码到基于Mamba的VM-UNetV2模型中，改进了息肉分割的准确性。

Details

Motivation: CNN和Transformer在息肉分割中存在局限性：CNN难以建模长距离依赖，Transformer计算复杂度高。Mamba虽有效但无法捕捉拓扑特征，导致边界分割不准确。 Method: 分两阶段：1) 使用VM-UNetV2生成概率图并计算拓扑注意力图；2) 将拓扑注意力图集成到VM-UNetV2的SDI模块中，形成Topo-SDI模块以增强分割结果。 Result: 在五个公开息肉分割数据集上的实验证明了该方法的有效性。 Conclusion: Topo-VM-UNetV2通过引入拓扑特征显著提升了息肉分割的准确性，代码将公开。 Abstract: Convolutional neural network (CNN) and Transformer-based architectures are two dominant deep learning models for polyp segmentation. However, CNNs have limited capability for modeling long-range dependencies, while Transformers incur quadratic computational complexity. Recently, State Space Models such as Mamba have been recognized as a promising approach for polyp segmentation because they not only model long-range interactions effectively but also maintain linear computational complexity. However, Mamba-based architectures still struggle to capture topological features (e.g., connected components, loops, voids), leading to inaccurate boundary delineation and polyp segmentation. To address these limitations, we propose a new approach called Topo-VM-UNetV2, which encodes topological features into the Mamba-based state-of-the-art polyp segmentation model, VM-UNetV2. Our method consists of two stages: Stage 1: VM-UNetV2 is used to generate probability maps (PMs) for the training and test images, which are then used to compute topology attention maps. Specifically, we first compute persistence diagrams of the PMs, then we generate persistence score maps by assigning persistence values (i.e., the difference between death and birth times) of each topological feature to its birth location, finally we transform persistence scores into attention weights using the sigmoid function. Stage 2: These topology attention maps are integrated into the semantics and detail infusion (SDI) module of VM-UNetV2 to form a topology-guided semantics and detail infusion (Topo-SDI) module for enhancing the segmentation results. Extensive experiments on five public polyp segmentation datasets demonstrate the effectiveness of our proposed method. The code will be made publicly available.

cs.HC [Back]

[130] An empathic GPT-based chatbot to talk about mental disorders with Spanish teenagers

Alba María Mármol-Romero,Manuel García-Vega,Miguel Ángel García-Cumbreras,Arturo Montejo-Ráez

Main category: cs.HC

TL;DR: 本文介绍了一种基于聊天机器人的系统，通过自我披露技术提高西班牙年轻人对某些心理健康问题的认识。

Details

Motivation: 旨在通过技术手段帮助青少年更好地了解和关注心理健康问题。 Method: 结合封闭式和开放式对话，利用GPT-3语言模型进行交互，根据用户对特定心理障碍的敏感度调整对话内容。 Result: 系统受到青少年欢迎，并有助于提高他们对心理健康问题的认识。 Conclusion: 聊天机器人系统在青少年心理健康教育中具有潜力。 Abstract: This paper presents a chatbot-based system to engage young Spanish people in the awareness of certain mental disorders through a self-disclosure technique. The study was carried out in a population of teenagers aged between 12 and 18 years. The dialogue engine mixes closed and open conversations, so certain controlled messages are sent to focus the chat on a specific disorder, which will change over time. Once a set of trial questions is answered, the system can initiate the conversation on the disorder under the focus according to the user's sensibility to that disorder, in an attempt to establish a more empathetic communication. Then, an open conversation based on the GPT-3 language model is initiated, allowing the user to express themselves with more freedom. The results show that these systems are of interest to young people and could help them become aware of certain mental disorders.

Table of Contents

cs.CV [Back]

[1] Data extraction and processing methods to aid the study of driving behaviors at intersections in naturalistic driving

[2] From Events to Enhancement: A Survey on Event-Based Imaging Technologies

[3] MDDFNet: Mamba-based Dynamic Dual Fusion Network for Traffic Sign Detection

[4] DetoxAI: a Python Toolkit for Debiasing Deep Learning Models in Computer Vision

[5] Learning 3D Persistent Embodied World Models

[6] Preliminary Explorations with GPT-4o(mni) Native Image Generation

[7] Apply Hierarchical-Chain-of-Generation to Complex Attributes Text-to-3D Generation

[8] Occupancy World Model for Robots

[9] Exploring Convolutional Neural Networks for Rice Grain Classification: An Explainable AI Approach

[10] Web2Grasp: Learning Functional Grasps from Web Images of Hand-Object Interactions

[11] Real-Time Privacy Preservation for Robot Visual Perception

[12] GaMNet: A Hybrid Network with Gabor Fusion and NMamba for Efficient 3D Glioma Segmentation

[13] X-Transfer Attacks: Towards Super Transferable Adversarial Attacks on CLIP

[14] OXSeg: Multidimensional attention UNet-based lip segmentation using semi-supervised lip contours

[15] Benchmarking Vision, Language, & Action Models in Procedurally Generated, Open Ended Action Environments

[16] Prompt to Polyp: Clinically-Aware Medical Image Synthesis with Diffusion Models

[17] Steepest Descent Density Control for Compact 3D Gaussian Splatting

[18] ReactDance: Progressive-Granular Representation for Long-Term Coherent Reactive Dance Generation

[19] QuickSplat: Fast 3D Surface Reconstruction via Learned Gaussian Initialization

[20] Enhancing Satellite Object Localization with Dilated Convolutions and Attention-aided Spatial Pooling

[21] A Preliminary Study for GPT-4o on Image Restoration

[22] Looking Beyond Language Priors: Enhancing Visual Comprehension and Attention in Multimodal Models

[23] VR-RAG: Open-vocabulary Species Recognition with RAG-Assisted Large Multi-Modal Models

[24] Semantic Style Transfer for Enhancing Animal Facial Landmark Detection

[25] The Moon's Many Faces: A Single Unified Transformer for Multimodal Lunar Reconstruction

[26] Lost in OCR Translation? Vision-Based Approaches to Robust Document Retrieval

[27] TeGA: Texture Space Gaussian Avatars for High-Resolution Dynamic Head Modeling

[28] InstanceGen: Image Generation with Instance-level Instructions

[29] Fine-Tuning Video-Text Contrastive Model for Primate Behavior Retrieval from Unlabeled Raw Videos

[30] HyperspectralMAE: The Hyperspectral Imagery Classification Model using Fourier-Encoded Dual-Branch Masked Autoencoder

[31] DiGIT: Multi-Dilated Gated Encoder and Central-Adjacent Region Integrated Decoder for Temporal Action Detection Transformer

[32] Semantic-Space-Intervened Diffusive Alignment for Visual Classification

[33] You Are Your Best Teacher: Semi-Supervised Surgical Point Tracking with Cycle-Consistent Self-Distillation

[34] Dome-DETR: DETR with Density-Oriented Feature-Query Manipulation for Efficient Tiny Object Detection

[35] kFuse: A novel density based agglomerative clustering

[36] Automating Infrastructure Surveying: A Framework for Geometric Measurements and Compliance Assessment Using Point Cloud Data

[37] A review of advancements in low-light image enhancement using deep learning

[38] Describe Anything in Medical Images

[39] Image Segmentation via Variational Model Based Tailored UNet: A Deep Variational Framework

[40] Accelerating Diffusion Transformer via Increment-Calibrated Caching with Channel-Aware Singular Value Decomposition

[41] Dual-level Fuzzy Learning with Patch Guidance for Image Ordinal Regression

[42] Automated Knot Detection and Pairing for Wood Analysis in the Timber Industry

[43] RefRef: A Synthetic Dataset and Benchmark for Reconstructing Refractive and Reflective Objects

[44] PICD: Versatile Perceptual Image Compression with Diffusion Rendering

[45] Decoupling Multi-Contrast Super-Resolution: Pairing Unpaired Synthesis with Implicit Representations

[46] Towards Facial Image Compression with Consistency Preserving Diffusion Prior

[47] Register and CLS tokens yield a decoupling of local and global features in large ViTs

[48] Leveraging Vision-Language Models for Visual Grounding and Analysis of Automotive UI

[49] Examining the Source of Defects from a Mechanical Perspective for 3D Anomaly Detection

[50] DFEN: Dual Feature Equalization Network for Medical Image Segmentation

[51] CGTrack: Cascade Gating Network with Hierarchical Feature Aggregation for UAV Tracking

[52] Achieving 3D Attention via Triplet Squeeze and Excitation Block

[53] Task-Adapter++: Task-specific Adaptation with Order-aware Alignment for Few-shot Action Recognition

[54] From Pixels to Perception: Interpretable Predictions via Instance-wise Grouped Feature Selection

[55] Document Image Rectification Bases on Self-Adaptive Multitask Fusion

[56] Towards Better Cephalometric Landmark Detection with Diffusion Data Generation

[57] Noise-Consistent Siamese-Diffusion for Medical Image Synthesis and Segmentation

[58] Camera-Only Bird's Eye View Perception: A Neural Approach to LiDAR-Free Environmental Mapping for Autonomous Vehicles

[59] Photovoltaic Defect Image Generator with Boundary Alignment Smoothing Constraint for Domain Shift Mitigation

[60] BrainSegDMlF: A Dynamic Fusion-enhanced SAM for Brain Lesion Segmentation

[61] MM-Skin: Enhancing Dermatology Vision-Language Model with an Image-Text Dataset Derived from Textbooks

[62] DiffLocks: Generating 3D Hair from a Single Image using Diffusion Models

[63] Adapting a Segmentation Foundation Model for Medical Image Classification

[64] VIN-NBV: A View Introspection Network for Next-Best-View Selection for Resource-Efficient 3D Reconstruction

cs.GR [Back]

[65] MonetGPT: Solving Puzzles Enhances MLLMs' Image Retouching Skills

[66] Anymate: A Dataset and Baselines for Learning 3D Object Rigging

cs.CL [Back]

[67] KG-HTC: Integrating Knowledge Graphs into LLMs for Effective Zero-shot Hierarchical Text Classification

[68] Privacy-Preserving Transformers: SwiftKey's Differential Privacy Implementation

[69] Exploration of COVID-19 Discourse on Twitter: American Politician Edition

[70] Assessing Robustness to Spurious Correlations in Post-Training Language Models

[71] TopicVD: A Topic-Based Dataset of Video-Guided Multimodal Machine Translation for Documentaries

[72] Insertion Language Models: Sequence Generation with Arbitrary-Position Insertions

[73] Sparse Attention Remapping with Clustering for Efficient LLM Decoding on PIM

[74] Tell Me Who Your Students Are: GPT Can Generate Valid Multiple-Choice Questions When Students' (Mis)Understanding Is Hinted

[75] Symbol-based entity marker highlighting for enhanced text mining in materials science with generative AI

[76] Elastic Weight Consolidation for Full-Parameter Continual Pre-Training of Gemma2