Skip to content

Table of Contents

cs.CV [Back]

[1] MilChat: Introducing Chain of Thought Reasoning and GRPO to a Multimodal Small Language Model for Remote Sensing

Aybora Koksal,A. Aydin Alatan

Main category: cs.CV

TL;DR: MilChat是一种轻量级多模态语言模型,专为分析偏远地区的遥感图像设计,尤其在军事领域表现优异。通过专家验证的数据集MilData和强化学习优化,其在开放描述和分类任务中显著优于通用模型。

Details Motivation: 当前多模态大语言模型在专业领域(如军事遥感)的适应性和效率有限,需要一种轻量级且高效的解决方案。 Method: 使用2B参数的开源MLLM进行监督微调,结合链式思维推理注释和GRPO强化学习,优化模型对军事关键特征的检测能力。 Result: 在MilData基准测试中,召回率达80%,精确率达98%,显著优于通用模型和现有遥感适应方法。 Conclusion: MilChat通过针对性微调和强化学习,证明了在专业领域的高效性和准确性,为实际应用提供了有力工具。 Abstract: Remarkable capabilities in understanding and generating text-image content have been demonstrated by recent advancements in multimodal large language models (MLLMs). However, their effectiveness in specialized domains-particularly those requiring resource-efficient and domain-specific adaptations-has remained limited. In this work, a lightweight multimodal language model termed MilChat is introduced, specifically adapted to analyze remote sensing imagery in secluded areas, including challenging missile launch sites. A new dataset, MilData, was compiled by verifying hundreds of aerial images through expert review, and subtle military installations were highlighted via detailed captions. Supervised fine-tuning on a 2B-parameter open-source MLLM with chain-of-thought (CoT) reasoning annotations was performed, enabling more accurate and interpretable explanations. Additionally, Group Relative Policy Optimization (GRPO) was leveraged to enhance the model's ability to detect critical domain-specific cues-such as defensive layouts and key military structures-while minimizing false positives on civilian scenes. Through empirical evaluations, it has been shown that MilChat significantly outperforms both larger, general-purpose multimodal models and existing remote sensing-adapted approaches on open-ended captioning and classification metrics. Over 80% recall and 98% precision were achieved on the newly proposed MilData benchmark, underscoring the potency of targeted fine-tuning and reinforcement learning in specialized real-world applications.

[2] Vision Foundation Model Embedding-Based Semantic Anomaly Detection

Max Peter Ronecker,Matthew Foutter,Amine Elhafsi,Daniele Gammelli,Ihor Barakaiev,Marco Pavone,Daniel Watzenig

Main category: cs.CV

TL;DR: 该论文提出了一种基于视觉基础模型的语义异常检测框架,通过比较运行时图像的局部嵌入与安全场景数据库,实现了高性能的异常检测与定位。

Details Motivation: 语义异常可能导致自主系统推理失败,因此需要一种有效的方法来检测这些异常。 Method: 提出两种框架变体:基于原始网格嵌入和基于实例分割的对象中心表示,并引入过滤机制减少误报。 Result: 在CARLA模拟异常测试中,基于实例分割的方法性能接近GPT-4o,并能精确定位异常。 Conclusion: 视觉基础模型的嵌入在自主系统实时异常检测中具有潜在应用价值。 Abstract: Semantic anomalies are contextually invalid or unusual combinations of familiar visual elements that can cause undefined behavior and failures in system-level reasoning for autonomous systems. This work explores semantic anomaly detection by leveraging the semantic priors of state-of-the-art vision foundation models, operating directly on the image. We propose a framework that compares local vision embeddings from runtime images to a database of nominal scenarios in which the autonomous system is deemed safe and performant. In this work, we consider two variants of the proposed framework: one using raw grid-based embeddings, and another leveraging instance segmentation for object-centric representations. To further improve robustness, we introduce a simple filtering mechanism to suppress false positives. Our evaluations on CARLA-simulated anomalies show that the instance-based method with filtering achieves performance comparable to GPT-4o, while providing precise anomaly localization. These results highlight the potential utility of vision embeddings from foundation models for real-time anomaly detection in autonomous systems.

[3] RDD: Robust Feature Detector and Descriptor using Deformable Transformer

Gonglin Chen,Tianwen Fu,Haiwei Chen,Wenbin Teng,Hanyuan Xiao,Yajie Zhao

Main category: cs.CV

TL;DR: 论文提出了一种基于可变形Transformer的鲁棒关键点检测与描述方法(RDD),通过可变形自注意力机制捕捉全局上下文和几何不变性,显著提升了在稀疏匹配任务中的性能。

Details Motivation: 在结构从运动和SLAM中,特征检测与描述在显著视角变化等挑战性场景下仍存在问题。现有方法未能有效学习长距离关系中的视觉线索。 Method: 提出RDD方法,利用可变形Transformer的可变形自注意力机制,聚焦关键位置,降低搜索空间复杂度并建模几何不变性。同时结合Air-to-Ground和MegaDepth数据集进行训练。 Result: RDD在稀疏匹配任务中优于所有现有方法,并支持半稠密匹配。论文还引入了两个新基准测试,分别针对大视角/尺度变化和空对地场景。 Conclusion: RDD通过全局上下文和几何不变性建模,显著提升了关键点检测与描述的鲁棒性,适用于复杂场景。 Abstract: As a core step in structure-from-motion and SLAM, robust feature detection and description under challenging scenarios such as significant viewpoint changes remain unresolved despite their ubiquity. While recent works have identified the importance of local features in modeling geometric transformations, these methods fail to learn the visual cues present in long-range relationships. We present Robust Deformable Detector (RDD), a novel and robust keypoint detector/descriptor leveraging the deformable transformer, which captures global context and geometric invariance through deformable self-attention mechanisms. Specifically, we observed that deformable attention focuses on key locations, effectively reducing the search space complexity and modeling the geometric invariance. Furthermore, we collected an Air-to-Ground dataset for training in addition to the standard MegaDepth dataset. Our proposed method outperforms all state-of-the-art keypoint detection/description methods in sparse matching tasks and is also capable of semi-dense matching. To ensure comprehensive evaluation, we introduce two challenging benchmarks: one emphasizing large viewpoint and scale variations, and the other being an Air-to-Ground benchmark -- an evaluation setting that has recently gaining popularity for 3D reconstruction across different altitudes.

[4] Visually Interpretable Subtask Reasoning for Visual Question Answering

Yu Cheng,Arushi Goel,Hakan Bilen

Main category: cs.CV

TL;DR: VISTAR是一种基于子任务驱动的训练框架,通过生成文本和视觉解释提升多模态大语言模型(MLLMs)的可解释性和推理能力,同时提高准确性。

Details Motivation: 解决现有方法在复杂视觉问题中计算成本高且准确性低的问题。 Method: 通过微调MLLMs生成结构化的子任务思维推理序列(Subtask-of-Thought rationales),无需依赖外部模型。 Result: 在两个基准测试中,VISTAR显著提高了推理准确性并保持了可解释性。 Conclusion: VISTAR为复杂视觉问题的多步推理提供了一种高效且可解释的解决方案。 Abstract: Answering complex visual questions like `Which red furniture can be used for sitting?' requires multi-step reasoning, including object recognition, attribute filtering, and relational understanding. Recent work improves interpretability in multimodal large language models (MLLMs) by decomposing tasks into sub-task programs, but these methods are computationally expensive and less accurate due to poor adaptation to target data. To address this, we introduce VISTAR (Visually Interpretable Subtask-Aware Reasoning Model), a subtask-driven training framework that enhances both interpretability and reasoning by generating textual and visual explanations within MLLMs. Instead of relying on external models, VISTAR fine-tunes MLLMs to produce structured Subtask-of-Thought rationales (step-by-step reasoning sequences). Experiments on two benchmarks show that VISTAR consistently improves reasoning accuracy while maintaining interpretability. Our code and dataset will be available at https://github.com/ChengJade/VISTAR.

[5] Multi-modal wound classification using wound image and location by Xception and Gaussian Mixture Recurrent Neural Network (GMRNN)

Ramin Mousa,Ehsan Matbooe,Hakimeh Khojasteh,Amirali Bengari,Mohammadmahdi Vahediahmar

Main category: cs.CV

TL;DR: 提出了一种基于迁移学习的多模态AI模型,结合Xception和GMRNN架构,用于伤口分类,显著提高了诊断准确性。

Details Motivation: 急性及难愈伤口的有效诊断对临床护理至关重要,传统方法常因感染、血管疾病等因素导致效果不佳。AI工具可加速医学图像解读,提升早期检测能力。 Method: 通过迁移学习算法提取特征,并结合位置特征,构建多模态网络,对糖尿病、压力、手术和静脉溃疡进行分类。 Result: 实验结果显示,伤口分类准确率在78.77%至100%之间,显著优于传统深度神经网络。 Conclusion: 该方法在常见伤口类型的分类中表现出卓越的准确性,为临床诊断提供了高效工具。 Abstract: The effective diagnosis of acute and hard-to-heal wounds is crucial for wound care practitioners to provide effective patient care. Poor clinical outcomes are often linked to infection, peripheral vascular disease, and increasing wound depth, which collectively exacerbate these comorbidities. However, diagnostic tools based on Artificial Intelligence (AI) speed up the interpretation of medical images and improve early detection of disease. In this article, we propose a multi-modal AI model based on transfer learning (TL), which combines two state-of-the-art architectures, Xception and GMRNN, for wound classification. The multi-modal network is developed by concatenating the features extracted by a transfer learning algorithm and location features to classify the wound types of diabetic, pressure, surgical, and venous ulcers. The proposed method is comprehensively compared with deep neural networks (DNN) for medical image analysis. The experimental results demonstrate a notable wound-class classifications (containing only diabetic, pressure, surgical, and venous) vary from 78.77 to 100\% in various experiments. The results presented in this study showcase the exceptional accuracy of the proposed methodology in accurately classifying the most commonly occurring wound types using wound images and their corresponding locations.

[6] Topology-Guided Knowledge Distillation for Efficient Point Cloud Processing

Luu Tung Hai,Thinh D. Le,Zhicheng Ding,Qing Tian,Truong-Son Hy

Main category: cs.CV

TL;DR: 提出了一种新颖的蒸馏框架,通过拓扑感知表示和梯度引导知识蒸馏,将高性能点云模型压缩为轻量级模型,显著减少模型大小和推理时间。

Details Motivation: 解决高性能点云模型(如Point Transformer V3)在资源受限环境中部署的挑战。 Method: 利用拓扑感知表示和梯度引导知识蒸馏,从高容量教师模型向轻量级学生模型传递知识。 Result: 在Nuscenes、SemanticKITTI和Waymo数据集上表现优异,模型大小减少约16倍,推理时间降低近1.9倍,并在NuScenes上达到知识蒸馏技术的SOTA性能。 Conclusion: 该方法有效解决了点云模型在资源受限环境中的部署问题,同时保持了高性能。 Abstract: Point cloud processing has gained significant attention due to its critical role in applications such as autonomous driving and 3D object recognition. However, deploying high-performance models like Point Transformer V3 in resource-constrained environments remains challenging due to their high computational and memory demands. This work introduces a novel distillation framework that leverages topology-aware representations and gradient-guided knowledge distillation to effectively transfer knowledge from a high-capacity teacher to a lightweight student model. Our approach captures the underlying geometric structures of point clouds while selectively guiding the student model's learning process through gradient-based feature alignment. Experimental results in the Nuscenes, SemanticKITTI, and Waymo datasets demonstrate that the proposed method achieves competitive performance, with an approximately 16x reduction in model size and a nearly 1.9x decrease in inference time compared to its teacher model. Notably, on NuScenes, our method achieves state-of-the-art performance among knowledge distillation techniques trained solely on LiDAR data, surpassing prior knowledge distillation baselines in segmentation performance. Our implementation is available publicly at: https://github.com/HySonLab/PointDistill

[7] Sleep Position Classification using Transfer Learning for Bed-based Pressure Sensors

Olivier Papillon,Rafik Goubran,James Green,Julien Larivière-Chartier,Caitlin Higginson,Frank Knoefel,Rébecca Robillard

Main category: cs.CV

TL;DR: 利用预训练的Vision Transformer模型(ViTMAE和ViTPose)对低分辨率压力敏感垫数据进行睡眠姿势分类,优于传统方法。

Details Motivation: 睡眠姿势影响睡眠质量和睡眠障碍(如呼吸暂停),但临床环境中标记数据稀缺,需高效解决方案。 Method: 采用迁移学习,利用预训练的ViTMAE和ViTPose模型,对低分辨率数据进行分类。 Result: 在112晚患者数据上表现优异,验证了高分辨率数据集的效果。 Conclusion: 该方法在临床环境中具有实际应用潜力。 Abstract: Bed-based pressure-sensitive mats (PSMs) offer a non-intrusive way of monitoring patients during sleep. We focus on four-way sleep position classification using data collected from a PSM placed under a mattress in a sleep clinic. Sleep positions can affect sleep quality and the prevalence of sleep disorders, such as apnea. Measurements were performed on patients with suspected sleep disorders referred for assessments at a sleep clinic. Training deep learning models can be challenging in clinical settings due to the need for large amounts of labeled data. To overcome the shortage of labeled training data, we utilize transfer learning to adapt pre-trained deep learning models to accurately estimate sleep positions from a low-resolution PSM dataset collected in a polysomnography sleep lab. Our approach leverages Vision Transformer models pre-trained on ImageNet using masked autoencoding (ViTMAE) and a pre-trained model for human pose estimation (ViTPose). These approaches outperform previous work from PSM-based sleep pose classification using deep learning (TCN) as well as traditional machine learning models (SVM, XGBoost, Random Forest) that use engineered features. We evaluate the performance of sleep position classification from 112 nights of patient recordings and validate it on a higher resolution 13-patient dataset. Despite the challenges of differentiating between sleep positions from low-resolution PSM data, our approach shows promise for real-world deployment in clinical settings

[8] Now you see it, Now you don't: Damage Label Agreement in Drone & Satellite Post-Disaster Imagery

Thomas Manzini,Priyankari Perali,Jayesh Tripathi,Robin Murphy

Main category: cs.CV

TL;DR: 本文通过对比卫星和无人机图像对15,814栋建筑的损坏标签,发现29.02%的标签不一致,且两种来源的分布差异显著,这可能对机器学习损坏评估系统的部署带来风险和潜在危害。

Details Motivation: 目前尚无研究探讨无人机和卫星图像在建筑损坏评估中的标签一致性。现有研究因标签标准不同、建筑位置未对齐和数据量不足而受限。 Method: 本研究克服了这些限制,通过使用相同的损坏标签标准和建筑位置,对比了三次飓风中的15,814栋建筑的数据。 Result: 分析发现,卫星标签比无人机标签至少低估了20.43%的损坏(p<1.2x10^-117),且两种标签的分布差异显著(p<5.1x10^-175)。 Conclusion: 这种差异表明,基于其中一种标签训练的计算机视觉和机器学习模型可能无法准确反映实际情况,从而带来伦理风险和社会危害。为此,本文提出了四条建议以提高可靠性和透明度。 Abstract: This paper audits damage labels derived from coincident satellite and drone aerial imagery for 15,814 buildings across Hurricanes Ian, Michael, and Harvey, finding 29.02% label disagreement and significantly different distributions between the two sources, which presents risks and potential harms during the deployment of machine learning damage assessment systems. Currently, there is no known study of label agreement between drone and satellite imagery for building damage assessment. The only prior work that could be used to infer if such imagery-derived labels agree is limited by differing damage label schemas, misaligned building locations, and low data quantities. This work overcomes these limitations by comparing damage labels using the same damage label schemas and building locations from three hurricanes, with the 15,814 buildings representing 19.05 times more buildings considered than the most relevant prior work. The analysis finds satellite-derived labels significantly under-report damage by at least 20.43% compared to drone-derived labels (p<1.2x10^-117), and satellite- and drone-derived labels represent significantly different distributions (p<5.1x10^-175). This indicates that computer vision and machine learning (CV/ML) models trained on at least one of these distributions will misrepresent actual conditions, as the differing satellite and drone-derived distributions cannot simultaneously represent the distribution of actual conditions in a scene. This potential misrepresentation poses ethical risks and potential societal harm if not managed. To reduce the risk of future societal harms, this paper offers four recommendations to improve reliability and transparency to decisio-makers when deploying CV/ML damage assessment systems in practice

[9] JSover: Joint Spectrum Estimation and Multi-Material Decomposition from Single-Energy CT Projections

Qing Wu,Hongjiang Wei,Jingyi Yu,S. Kevin Zhou,Yuyao Zhang

Main category: cs.CV

TL;DR: JSover是一种新型的单能CT多材料分解(SEMMD)框架,通过一步法直接从投影数据中重建多材料组成和估计能量谱,显著提高了分解的准确性和可靠性。

Details Motivation: 传统SEMMD方法采用两步法,忽略了组织能量依赖性衰减,导致严重的非线性束硬化伪影和噪声。JSover旨在通过一步法解决这一问题。 Method: JSover结合物理先验知识,直接从单能CT投影中联合重建多材料组成和估计能量谱,并引入隐式神经表示(INR)作为无监督深度学习求解器。 Result: 实验表明,JSover在模拟和真实CT数据集上均优于现有SEMMD方法,具有更高的准确性和计算效率。 Conclusion: JSover通过一步法和INR的引入,显著提升了SEMMD的性能,为临床提供了更可靠的定量分析工具。 Abstract: Multi-material decomposition (MMD) enables quantitative reconstruction of tissue compositions in the human body, supporting a wide range of clinical applications. However, traditional MMD typically requires spectral CT scanners and pre-measured X-ray energy spectra, significantly limiting clinical applicability. To this end, various methods have been developed to perform MMD using conventional (i.e., single-energy, SE) CT systems, commonly referred to as SEMMD. Despite promising progress, most SEMMD methods follow a two-step image decomposition pipeline, which first reconstructs monochromatic CT images using algorithms such as FBP, and then performs decomposition on these images. The initial reconstruction step, however, neglects the energy-dependent attenuation of human tissues, introducing severe nonlinear beam hardening artifacts and noise into the subsequent decomposition. This paper proposes JSover, a fundamentally reformulated one-step SEMMD framework that jointly reconstructs multi-material compositions and estimates the energy spectrum directly from SECT projections. By explicitly incorporating physics-informed spectral priors into the SEMMD process, JSover accurately simulates a virtual spectral CT system from SE acquisitions, thereby improving the reliability and accuracy of decomposition. Furthermore, we introduce implicit neural representation (INR) as an unsupervised deep learning solver for representing the underlying material maps. The inductive bias of INR toward continuous image patterns constrains the solution space and further enhances estimation quality. Extensive experiments on both simulated and real CT datasets show that JSover outperforms state-of-the-art SEMMD methods in accuracy and computational efficiency.

[10] SLAG: Scalable Language-Augmented Gaussian Splatting

Laszlo Szilagyi,Francis Engelmann,Jeannette Bohg

Main category: cs.CV

TL;DR: SLAG是一个多GPU框架,用于语言增强的高斯泼溅,提升大规模场景嵌入的速度和可扩展性。

Details Motivation: 解决时间敏感和大数据量的机器人应用场景中快速编码和计算资源受限的挑战。 Method: 集成2D视觉语言模型特征到3D场景,通过归一化加权平均计算语言嵌入,无需损失函数。 Result: 在16-GPU设置下比OpenGaussian快18倍,同时在ScanNet和LERF数据集上保持嵌入质量。 Conclusion: SLAG提供了一种高效且可扩展的语言增强场景表示方法,适用于资源受限的机器人应用。 Abstract: Language-augmented scene representations hold great promise for large-scale robotics applications such as search-and-rescue, smart cities, and mining. Many of these scenarios are time-sensitive, requiring rapid scene encoding while also being data-intensive, necessitating scalable solutions. Deploying these representations on robots with limited computational resources further adds to the challenge. To address this, we introduce SLAG, a multi-GPU framework for language-augmented Gaussian splatting that enhances the speed and scalability of embedding large scenes. Our method integrates 2D visual-language model features into 3D scenes using SAM and CLIP. Unlike prior approaches, SLAG eliminates the need for a loss function to compute per-Gaussian language embeddings. Instead, it derives embeddings from 3D Gaussian scene parameters via a normalized weighted average, enabling highly parallelized scene encoding. Additionally, we introduce a vector database for efficient embedding storage and retrieval. Our experiments show that SLAG achieves an 18 times speedup in embedding computation on a 16-GPU setup compared to OpenGaussian, while preserving embedding quality on the ScanNet and LERF datasets. For more details, visit our project website: https://slag-project.github.io/.

[11] Asynchronous Multi-Object Tracking with an Event Camera

Angus Apps,Ziwei Wang,Vladimir Perejogin,Timothy Molloy,Robert Mahony

Main category: cs.CV

TL;DR: AEMOT算法通过异步处理事件相机数据,检测和跟踪多个动态物体,性能优于现有方法37%。

Details Motivation: 事件相机在动态环境中具有低延迟、高时间分辨率和高动态范围的优势,适合多目标跟踪。 Method: AEMOT通过识别一致光流区域检测特征,使用AEB跟踪器构建候选对象,并通过分类验证阶段筛选对象。 Result: 在Bee Swarm数据集上,AEMOT的精确率和召回率超过其他事件跟踪算法37%。 Conclusion: AEMOT在动态多目标跟踪中表现优异,代码和数据集将开源。 Abstract: Events cameras are ideal sensors for enabling robots to detect and track objects in highly dynamic environments due to their low latency output, high temporal resolution, and high dynamic range. In this paper, we present the Asynchronous Event Multi-Object Tracking (AEMOT) algorithm for detecting and tracking multiple objects by processing individual raw events asynchronously. AEMOT detects salient event blob features by identifying regions of consistent optical flow using a novel Field of Active Flow Directions built from the Surface of Active Events. Detected features are tracked as candidate objects using the recently proposed Asynchronous Event Blob (AEB) tracker in order to construct small intensity patches of each candidate object. A novel learnt validation stage promotes or discards candidate objects based on classification of their intensity patches, with promoted objects having their position, velocity, size, and orientation estimated at their event rate. We evaluate AEMOT on a new Bee Swarm Dataset, where it tracks dozens of small bees with precision and recall performance exceeding that of alternative event-based detection and tracking algorithms by over 37%. Source code and the labelled event Bee Swarm Dataset will be open sourced

[12] MoKD: Multi-Task Optimization for Knowledge Distillation

Zeeshan Hayder,Ali Cheraghian,Lars Petersson,Mehrtash Harandi

Main category: cs.CV

TL;DR: MoKD通过多任务优化解决知识蒸馏中的梯度冲突和梯度主导问题,提升了模型性能。

Details Motivation: 解决知识蒸馏中任务目标和教师指导之间的平衡问题,以及师生模型知识表示差异的挑战。 Method: 将知识蒸馏重新定义为多目标优化问题,并引入子空间学习框架改进知识传递。 Result: 在ImageNet-1K和COCO数据集上,MoKD表现优于现有方法,达到最先进性能。 Conclusion: MoKD在知识蒸馏中实现了高效且高性能的模型训练。 Abstract: Compact models can be effectively trained through Knowledge Distillation (KD), a technique that transfers knowledge from larger, high-performing teacher models. Two key challenges in Knowledge Distillation (KD) are: 1) balancing learning from the teacher's guidance and the task objective, and 2) handling the disparity in knowledge representation between teacher and student models. To address these, we propose Multi-Task Optimization for Knowledge Distillation (MoKD). MoKD tackles two main gradient issues: a) Gradient Conflicts, where task-specific and distillation gradients are misaligned, and b) Gradient Dominance, where one objective's gradient dominates, causing imbalance. MoKD reformulates KD as a multi-objective optimization problem, enabling better balance between objectives. Additionally, it introduces a subspace learning framework to project feature representations into a high-dimensional space, improving knowledge transfer. Our MoKD is demonstrated to outperform existing methods through extensive experiments on image classification using the ImageNet-1K dataset and object detection using the COCO dataset, achieving state-of-the-art performance with greater efficiency. To the best of our knowledge, MoKD models also achieve state-of-the-art performance compared to models trained from scratch.

[13] Empowering Vision Transformers with Multi-Scale Causal Intervention for Long-Tailed Image Classification

Xiaoshuo Yan,Zhaochuan Li,Lei Meng,Zhuang Qi,Wei Wu,Zixuan Li,Xiangxu Meng

Main category: cs.CV

TL;DR: 本文提出了一种名为TSCNet的两阶段因果建模方法,旨在解决视觉变换器(ViT)在长尾分类中因全局特征表示而难以建模细粒度因果关联的问题。

Details Motivation: 现有因果模型在ViT上的性能提升有限,因其全局特征表示难以捕捉细粒度特征与预测的关联,导致尾部类别分类困难。 Method: TSCNet通过多尺度因果干预分两阶段建模:1) 层次因果表示学习(HCRL),解耦背景与对象并进行干预;2) 反事实对数偏差校准(CLBC),优化决策边界。 Result: 实验表明,TSCNet能有效消除数据不平衡引入的多种偏差,性能优于现有方法。 Conclusion: TSCNet通过细粒度因果建模和反事实校准,显著提升了ViT在长尾分类中的表现。 Abstract: Causal inference has emerged as a promising approach to mitigate long-tail classification by handling the biases introduced by class imbalance. However, along with the change of advanced backbone models from Convolutional Neural Networks (CNNs) to Visual Transformers (ViT), existing causal models may not achieve an expected performance gain. This paper investigates the influence of existing causal models on CNNs and ViT variants, highlighting that ViT's global feature representation makes it hard for causal methods to model associations between fine-grained features and predictions, which leads to difficulties in classifying tail classes with similar visual appearance. To address these issues, this paper proposes TSCNet, a two-stage causal modeling method to discover fine-grained causal associations through multi-scale causal interventions. Specifically, in the hierarchical causal representation learning stage (HCRL), it decouples the background and objects, applying backdoor interventions at both the patch and feature level to prevent model from using class-irrelevant areas to infer labels which enhances fine-grained causal representation. In the counterfactual logits bias calibration stage (CLBC), it refines the optimization of model's decision boundary by adaptive constructing counterfactual balanced data distribution to remove the spurious associations in the logits caused by data distribution. Extensive experiments conducted on various long-tail benchmarks demonstrate that the proposed TSCNet can eliminate multiple biases introduced by data imbalance, which outperforms existing methods.

[14] Monocular Depth Guided Occlusion-Aware Disparity Refinement via Semi-supervised Learning in Laparoscopic Images

Ziteng Liu,Dongdong He,Chenghong Zhang,Wenpeng Gao,Yili Fu

Main category: cs.CV

TL;DR: 本文提出了一种深度引导的遮挡感知视差细化网络(DGORNet),通过利用不受遮挡影响的单目深度信息来优化视差图,解决了腹腔镜立体图像视差估计中的遮挡和标记数据稀缺问题。

Details Motivation: 腹腔镜立体图像视差估计中的遮挡问题和标记数据稀缺是主要挑战,需要一种能够有效利用未标记数据并提升视差图质量的方法。 Method: 提出DGORNet,结合位置嵌入(PE)模块提供显式空间上下文,并引入光流差异损失(OFDLoss)利用视频帧的时间连续性提升鲁棒性。 Result: 在SCARED数据集上,DGORNet在端到端误差(EPE)和均方根误差(RMSE)上优于现有方法,尤其在遮挡和无纹理区域表现突出。 Conclusion: DGORNet通过位置嵌入和光流差异损失的结合,显著提升了腹腔镜手术视差估计的空间和时间一致性,为解决视差估计和数据限制提供了实用方案。 Abstract: Occlusion and the scarcity of labeled surgical data are significant challenges in disparity estimation for stereo laparoscopic images. To address these issues, this study proposes a Depth Guided Occlusion-Aware Disparity Refinement Network (DGORNet), which refines disparity maps by leveraging monocular depth information unaffected by occlusion. A Position Embedding (PE) module is introduced to provide explicit spatial context, enhancing the network's ability to localize and refine features. Furthermore, we introduce an Optical Flow Difference Loss (OFDLoss) for unlabeled data, leveraging temporal continuity across video frames to improve robustness in dynamic surgical scenes. Experiments on the SCARED dataset demonstrate that DGORNet outperforms state-of-the-art methods in terms of End-Point Error (EPE) and Root Mean Squared Error (RMSE), particularly in occlusion and texture-less regions. Ablation studies confirm the contributions of the Position Embedding and Optical Flow Difference Loss, highlighting their roles in improving spatial and temporal consistency. These results underscore DGORNet's effectiveness in enhancing disparity estimation for laparoscopic surgery, offering a practical solution to challenges in disparity estimation and data limitations.

[15] Unsupervised Raindrop Removal from a Single Image using Conditional Diffusion Models

Lhuqita Fazry,Valentino Vito

Main category: cs.CV

TL;DR: 提出了一种基于扩散模型的单图像雨滴去除新方法。

Details Motivation: 单图像雨滴去除任务具有挑战性,现有方法多依赖GAN,而扩散模型在图像修复中表现优异。 Method: 采用扩散模型进行图像修复,结合雨滴区域检测。 Result: 实现了基于扩散模型的雨滴去除,性能优于传统方法。 Conclusion: 扩散模型为单图像雨滴去除提供了新的有效解决方案。 Abstract: Raindrop removal is a challenging task in image processing. Removing raindrops while relying solely on a single image further increases the difficulty of the task. Common approaches include the detection of raindrop regions in the image, followed by performing a background restoration process conditioned on those regions. While various methods can be applied for the detection step, the most common architecture used for background restoration is the Generative Adversarial Network (GAN). Recent advances in the use of diffusion models have led to state-of-the-art image inpainting techniques. In this paper, we introduce a novel technique for raindrop removal from a single image using diffusion-based image inpainting.

[16] ADC-GS: Anchor-Driven Deformable and Compressed Gaussian Splatting for Dynamic Scene Reconstruction

He Huang,Qi Yang,Mufan Liu,Yiling Xu,Zhu Li

Main category: cs.CV

TL;DR: ADC-GS提出了一种基于锚点的动态场景重建方法,通过分层处理和速率失真优化,显著提升了渲染速度和存储效率。

Details Motivation: 现有4D高斯泼溅方法忽略了相邻高斯基元间的冗余,导致性能不佳。 Method: ADC-GS采用锚点驱动结构,结合分层粗到细的管道和速率失真优化。 Result: 渲染速度提升300%-800%,存储效率达到最优,且不影响渲染质量。 Conclusion: ADC-GS是一种高效且紧凑的动态场景表示方法。 Abstract: Existing 4D Gaussian Splatting methods rely on per-Gaussian deformation from a canonical space to target frames, which overlooks redundancy among adjacent Gaussian primitives and results in suboptimal performance. To address this limitation, we propose Anchor-Driven Deformable and Compressed Gaussian Splatting (ADC-GS), a compact and efficient representation for dynamic scene reconstruction. Specifically, ADC-GS organizes Gaussian primitives into an anchor-based structure within the canonical space, enhanced by a temporal significance-based anchor refinement strategy. To reduce deformation redundancy, ADC-GS introduces a hierarchical coarse-to-fine pipeline that captures motions at varying granularities. Moreover, a rate-distortion optimization is adopted to achieve an optimal balance between bitrate consumption and representation fidelity. Experimental results demonstrate that ADC-GS outperforms the per-Gaussian deformation approaches in rendering speed by 300%-800% while achieving state-of-the-art storage efficiency without compromising rendering quality. The code is released at https://github.com/H-Huang774/ADC-GS.git.

[17] Visual Watermarking in the Era of Diffusion Models: Advances and Challenges

Junxian Duan,Jiyang Guang,Wenkui Yang,Ran He

Main category: cs.CV

TL;DR: 论文探讨了生成式AI技术(如Stable Diffusion)对视觉内容的潜在滥用风险,提出利用扩散模型嵌入不可感知且鲁棒的水印作为解决方案,并分析了其优势与挑战。

Details Motivation: 随着生成式AI技术的进步,视觉内容容易被滥用,引发版权问题。传统检测方法难以应对复杂操作,需要更有效的水印技术保护数字内容所有权。 Method: 通过扩散模型学习特征,嵌入不可感知且鲁棒的水印,提升检测准确性。研究扩散模型与水印技术的结合,分析其鲁棒性和应用。 Result: 扩散模型能够有效提升水印的鲁棒性和检测准确性,为数字内容保护提供创新解决方案。 Conclusion: 在生成式AI时代,开发创新的水印技术对保护数字内容所有权至关重要,扩散模型为此提供了新的可能性。 Abstract: As generative artificial intelligence technologies like Stable Diffusion advance, visual content becomes more vulnerable to misuse, raising concerns about copyright infringement. Visual watermarks serve as effective protection mechanisms, asserting ownership and deterring unauthorized use. Traditional deepfake detection methods often rely on passive techniques that struggle with sophisticated manipulations. In contrast, diffusion models enhance detection accuracy by allowing for the effective learning of features, enabling the embedding of imperceptible and robust watermarks. We analyze the strengths and challenges of watermark techniques related to diffusion models, focusing on their robustness and application in watermark generation. By exploring the integration of advanced diffusion models and watermarking security, we aim to advance the discourse on preserving watermark robustness against evolving forgery threats. It emphasizes the critical importance of developing innovative solutions to protect digital content and ensure the preservation of ownership rights in the era of generative AI.

[18] Object detection in adverse weather conditions for autonomous vehicles using Instruct Pix2Pix

Unai Gurbindo,Axel Brando,Jaume Abella,Caroline König

Main category: cs.CV

TL;DR: 提出了一种利用扩散模型Instruct Pix2Pix生成天气增强数据的方法,以提升目标检测模型在恶劣天气下的鲁棒性。

Details Motivation: 恶劣天气条件下目标检测系统的鲁棒性对自动驾驶技术发展至关重要。 Method: 使用Instruct Pix2Pix生成天气增强数据,并在CARLA模拟器和真实数据集(BDD100K、ACDC)上进行实验验证。 Result: 量化了目标检测模型在恶劣天气下的性能差距,并证明定制数据增强策略能显著提升模型鲁棒性。 Conclusion: 为提升感知系统在恶劣环境中的可靠性奠定了基础,并为自动驾驶技术的未来发展提供了路径。 Abstract: Enhancing the robustness of object detection systems under adverse weather conditions is crucial for the advancement of autonomous driving technology. This study presents a novel approach leveraging the diffusion model Instruct Pix2Pix to develop prompting methodologies that generate realistic datasets with weather-based augmentations aiming to mitigate the impact of adverse weather on the perception capabilities of state-of-the-art object detection models, including Faster R-CNN and YOLOv10. Experiments were conducted in two environments, in the CARLA simulator where an initial evaluation of the proposed data augmentation was provided, and then on the real-world image data sets BDD100K and ACDC demonstrating the effectiveness of the approach in real environments. The key contributions of this work are twofold: (1) identifying and quantifying the performance gap in object detection models under challenging weather conditions, and (2) demonstrating how tailored data augmentation strategies can significantly enhance the robustness of these models. This research establishes a solid foundation for improving the reliability of perception systems in demanding environmental scenarios, and provides a pathway for future advancements in autonomous driving.

[19] HMPNet: A Feature Aggregation Architecture for Maritime Object Detection from a Shipborne Perspective

Yu Zhang,Fengyuan Liu,Juan Lyu,Yi Wei,Changdong Yu

Main category: cs.CV

TL;DR: 论文提出Navigation12数据集和HMPNet模型,用于船舶视角下的目标检测,HMPNet在精度和计算效率上优于现有方法。

Details Motivation: 解决海上目标检测数据稀缺问题,提升船舶视觉感知能力。 Method: 提出Navigation12数据集,设计HMPNet模型,包含动态调制主干、多尺度特征聚合结构和共享权重检测器。 Result: HMPNet在mAP上优于YOLOv11n 3.3%,参数减少23%。 Conclusion: HMPNet为海上目标检测提供了高效解决方案,具有实际应用潜力。 Abstract: In the realm of intelligent maritime navigation, object detection from a shipborne perspective is paramount. Despite the criticality, the paucity of maritime-specific data impedes the deployment of sophisticated visual perception techniques, akin to those utilized in autonomous vehicular systems, within the maritime context. To bridge this gap, we introduce Navigation12, a novel dataset annotated for 12 object categories under diverse maritime environments and weather conditions. Based upon this dataset, we propose HMPNet, a lightweight architecture tailored for shipborne object detection. HMPNet incorporates a hierarchical dynamic modulation backbone to bolster feature aggregation and expression, complemented by a matrix cascading poly-scale neck and a polymerization weight sharing detector, facilitating efficient multi-scale feature aggregation. Empirical evaluations indicate that HMPNet surpasses current state-of-the-art methods in terms of both accuracy and computational efficiency, realizing a 3.3% improvement in mean Average Precision over YOLOv11n, the prevailing model, and reducing parameters by 23%.

[20] G-MSGINet: A Grouped Multi-Scale Graph-Involution Network for Contactless Fingerprint Recognition

Santhoshkumar Peddi,Soham Bandyopadhyay,Debasis Samanta

Main category: cs.CV

TL;DR: G-MSGINet是一个高效的无接触指纹识别框架,通过GMSGI层联合实现细节定位和身份嵌入,无需复杂预处理,显著提升性能。

Details Motivation: 现有方法依赖多分支架构或复杂预处理,限制了实际应用中的扩展性和泛化能力。 Method: 引入GMSGI层,结合像素级卷积、动态多尺度核生成和图关系建模,通过端到端优化逐步细化特征。 Result: 在三个数据集上,F1分数达0.83±0.02,Rank-1准确率97.0%-99.1%,EER低至0.5%,参数和计算量大幅减少。 Conclusion: G-MSGINet在性能和效率上显著优于现有方法,适用于实际生物识别场景。 Abstract: This paper presents G-MSGINet, a unified and efficient framework for robust contactless fingerprint recognition that jointly performs minutiae localization and identity embedding directly from raw input images. Existing approaches rely on multi-branch architectures, orientation labels, or complex preprocessing steps, which limit scalability and generalization across real-world acquisition scenarios. In contrast, the proposed architecture introduces the GMSGI layer, a novel computational module that integrates grouped pixel-level involution, dynamic multi-scale kernel generation, and graph-based relational modelling into a single processing unit. Stacked GMSGI layers progressively refine both local minutiae-sensitive features and global topological representations through end-to-end optimization. The architecture eliminates explicit orientation supervision and adapts graph connectivity directly from learned kernel descriptors, thereby capturing meaningful structural relationships among fingerprint regions without fixed heuristics. Extensive experiments on three benchmark datasets, namely PolyU, CFPose, and Benchmark 2D/3D, demonstrate that G-MSGINet consistently achieves minutiae F1-scores in the range of $0.83\pm0.02$ and Rank-1 identification accuracies between 97.0% and 99.1%, while maintaining an Equal Error Rate (EER) as low as 0.5%. These results correspond to improvements of up to 4.8% in F1-score and 1.4% in Rank-1 accuracy when compared to prior methods, using only 0.38 million parameters and 6.63 giga floating-point operations, which represents up to ten times fewer parameters than competitive baselines. This highlights the scalability and effectiveness of G-MSGINet in real-world contactless biometric recognition scenarios.

[21] Removing Watermarks with Partial Regeneration using Semantic Information

Krti Tallam,John Kevin Cava,Caleb Geniesse,N. Benjamin Erichson,Michael W. Mahoney

Main category: cs.CV

TL;DR: 论文提出了一种名为SemanticRegen的三阶段攻击方法,能够有效擦除最先进的语义和隐形水印,同时保持图像的视觉质量。

Details Motivation: 随着AI生成图像的普及,隐形水印成为版权和来源保护的主要手段,但其对抗自适应攻击的鲁棒性尚未充分研究。 Method: SemanticRegen通过(i)使用视觉语言模型获取细粒度描述,(ii)零样本分割提取前景掩码,(iii)基于LLM引导的扩散模型仅修复背景,从而保留显著对象和风格线索。 Result: 在四种水印系统(TreeRing、StegaStamp、StableSig、DWT/DCT)上测试,SemanticRegen成功击败TreeRing水印(p=0.10>0.05),并将其他方案的比特准确率降至0.75以下,同时保持高感知质量(mSSIM=0.94)。 Conclusion: 研究揭示了当前水印防御与自适应语义攻击能力之间的差距,强调了开发对内容保留再生攻击具有鲁棒性的水印算法的紧迫性。 Abstract: As AI-generated imagery becomes ubiquitous, invisible watermarks have emerged as a primary line of defense for copyright and provenance. The newest watermarking schemes embed semantic signals - content-aware patterns that are designed to survive common image manipulations - yet their true robustness against adaptive adversaries remains under-explored. We expose a previously unreported vulnerability and introduce SemanticRegen, a three-stage, label-free attack that erases state-of-the-art semantic and invisible watermarks while leaving an image's apparent meaning intact. Our pipeline (i) uses a vision-language model to obtain fine-grained captions, (ii) extracts foreground masks with zero-shot segmentation, and (iii) inpaints only the background via an LLM-guided diffusion model, thereby preserving salient objects and style cues. Evaluated on 1,000 prompts across four watermarking systems - TreeRing, StegaStamp, StableSig, and DWT/DCT - SemanticRegen is the only method to defeat the semantic TreeRing watermark (p = 0.10 > 0.05) and reduces bit-accuracy below 0.75 for the remaining schemes, all while maintaining high perceptual quality (masked SSIM = 0.94 +/- 0.01). We further introduce masked SSIM (mSSIM) to quantify fidelity within foreground regions, showing that our attack achieves up to 12 percent higher mSSIM than prior diffusion-based attackers. These results highlight an urgent gap between current watermark defenses and the capabilities of adaptive, semantics-aware adversaries, underscoring the need for watermarking algorithms that are resilient to content-preserving regenerative attacks.

[22] EventDiff: A Unified and Efficient Diffusion Model Framework for Event-based Video Frame Interpolation

Hanle Zheng,Xujie Han,Zegang Peng,Shangbin Zhang,Guangxun Du,Zhuo Zou,Xilin Wang,Jibin Wu,Hao Guo,Lei Deng

Main category: cs.CV

TL;DR: EventDiff是一种基于事件的扩散模型框架,用于视频帧插值(VFI),通过直接在潜在空间中进行去噪扩散过程,提高了在多样化和挑战性场景中的鲁棒性,并在多个数据集上实现了最先进的性能。

Details Motivation: 视频帧插值在涉及大运动、遮挡和光照变化的条件下具有挑战性。现有基于事件的VFI方法依赖显式运动建模,在细微运动场景下可能影响高保真图像重建。扩散模型通过去噪过程重建帧,避免了显式运动估计的需求。 Method: 提出EventDiff框架,包括一个新颖的事件-帧混合自动编码器(HAE)和轻量级的时空交叉注意力(STCA)模块,通过两阶段训练策略(先预训练HAE,再与扩散模型联合优化)实现。 Result: 在Vimeo90K-Triplet上PSNR提升1.98dB,在SNU-FILM任务中表现优异,比现有扩散方法快4.24倍且PSNR增益达5.72dB。 Conclusion: EventDiff通过潜在空间去噪扩散过程,显著提升了VFI的性能和效率,适用于多样化和挑战性场景。 Abstract: Video Frame Interpolation (VFI) is a fundamental yet challenging task in computer vision, particularly under conditions involving large motion, occlusion, and lighting variation. Recent advancements in event cameras have opened up new opportunities for addressing these challenges. While existing event-based VFI methods have succeeded in recovering large and complex motions by leveraging handcrafted intermediate representations such as optical flow, these designs often compromise high-fidelity image reconstruction under subtle motion scenarios due to their reliance on explicit motion modeling. Meanwhile, diffusion models provide a promising alternative for VFI by reconstructing frames through a denoising process, eliminating the need for explicit motion estimation or warping operations. In this work, we propose EventDiff, a unified and efficient event-based diffusion model framework for VFI. EventDiff features a novel Event-Frame Hybrid AutoEncoder (HAE) equipped with a lightweight Spatial-Temporal Cross Attention (STCA) module that effectively fuses dynamic event streams with static frames. Unlike previous event-based VFI methods, EventDiff performs interpolation directly in the latent space via a denoising diffusion process, making it more robust across diverse and challenging VFI scenarios. Through a two-stage training strategy that first pretrains the HAE and then jointly optimizes it with the diffusion model, our method achieves state-of-the-art performance across multiple synthetic and real-world event VFI datasets. The proposed method outperforms existing state-of-the-art event-based VFI methods by up to 1.98dB in PSNR on Vimeo90K-Triplet and shows superior performance in SNU-FILM tasks with multiple difficulty levels. Compared to the emerging diffusion-based VFI approach, our method achieves up to 5.72dB PSNR gain on Vimeo90K-Triplet and 4.24X faster inference.

[23] Congenital Heart Disease recognition using Deep Learning/Transformer models

Aidar Amangeldi,Vladislav Yarovenko,Angsar Taigonyrov

Main category: cs.CV

TL;DR: 利用双模态(声音和图像)深度学习方法提高先天性心脏病(CHD)诊断准确率,分别在ZCHSound和DICOM胸部X光数据集上达到73.9%和80.72%的准确率。

Details Motivation: 先天性心脏病是婴儿发病和死亡的主要原因,现有非侵入性筛查方法存在假阴性问题,亟需更有效的诊断手段。 Method: 采用双模态(声音和图像)深度学习方法,自动提取特征以辅助医生诊断CHD。 Result: 在ZCHSound数据集上准确率为73.9%,在DICOM胸部X光数据集上准确率为80.72%。 Conclusion: 双模态深度学习方法在CHD诊断中表现出潜力,但仍需进一步提升准确率。 Abstract: Congenital Heart Disease (CHD) remains a leading cause of infant morbidity and mortality, yet non-invasive screening methods often yield false negatives. Deep learning models, with their ability to automatically extract features, can assist doctors in detecting CHD more effectively. In this work, we investigate the use of dual-modality (sound and image) deep learning methods for CHD diagnosis. We achieve 73.9% accuracy on the ZCHSound dataset and 80.72% accuracy on the DICOM Chest X-ray dataset.

[24] Identifying Memorization of Diffusion Models through p-Laplace Analysis

Jonathan Brokman,Amit Giloni,Omer Hofman,Roman Vainshtein,Hisashi Kojima,Guy Gilboa

Main category: cs.CV

TL;DR: 本文探讨了扩散模型能否利用估计的分数函数计算高阶微分(p-Laplace算子),并用于识别记忆的训练数据。通过数值近似方法,展示了其在概率分布特征识别中的有效性,并在高斯混合模型和图像生成模型中验证了结果。

Details Motivation: 扩散模型是目前领先的图像生成模型,但其分数函数(即扰动数据样本的对数概率梯度)的估计是否能用于计算高阶微分(如p-Laplace算子)尚不明确。本文旨在探索这一可能性,并利用p-Laplace算子识别记忆的训练数据。 Method: 提出了一种基于学习到的分数函数的数值p-Laplace近似方法,并在高斯混合模型的结构化案例中进行了分析。随后将方法扩展到图像生成模型,首次实现了基于p-Laplace算子的记忆识别。 Result: 研究表明,p-Laplace算子能有效识别概率分布的关键特征,并在高斯混合模型和图像生成模型中验证了其用于记忆识别的可行性。 Conclusion: 本文证明了扩散模型的分数函数可用于计算p-Laplace算子,并成功应用于记忆识别任务,为生成模型的分析提供了新工具。 Abstract: Diffusion models, today's leading image generative models, estimate the score function, i.e. the gradient of the log probability of (perturbed) data samples, without direct access to the underlying probability distribution. This work investigates whether the estimated score function can be leveraged to compute higher-order differentials, namely p-Laplace operators. We show here these operators can be employed to identify memorized training data. We propose a numerical p-Laplace approximation based on the learned score functions, showing its effectiveness in identifying key features of the probability landscape. We analyze the structured case of Gaussian mixture models, and demonstrate the results carry-over to image generative models, where memorization identification based on the p-Laplace operator is performed for the first time.

[25] CNN and ViT Efficiency Study on Tiny ImageNet and DermaMNIST Datasets

Aidar Amangeldi,Angsar Taigonyrov,Muhammad Huzaid Jawad,Chinedu Emmanuel Mbonu

Main category: cs.CV

TL;DR: 比较卷积和Transformer架构在医学和通用图像分类任务中的性能,发现适当微调的Vision Transformers在性能、推理速度和参数效率上优于ResNet-18。

Details Motivation: 评估卷积和Transformer架构在资源受限环境中的适用性,目标是降低推理延迟和模型复杂度。 Method: 以ResNet-18为基线,对四种Vision Transformer变体(Tiny、Small、Base、Large)进行微调,并在DermatologyMNIST和TinyImageNet上进行测试。 Result: 微调后的Vision Transformers性能匹配或超越基线,推理更快且参数更少。 Conclusion: Vision Transformers在资源受限环境中具有部署潜力。 Abstract: This study evaluates the trade-offs between convolutional and transformer-based architectures on both medical and general-purpose image classification benchmarks. We use ResNet-18 as our baseline and introduce a fine-tuning strategy applied to four Vision Transformer variants (Tiny, Small, Base, Large) on DermatologyMNIST and TinyImageNet. Our goal is to reduce inference latency and model complexity with acceptable accuracy degradation. Through systematic hyperparameter variations, we demonstrate that appropriately fine-tuned Vision Transformers can match or exceed the baseline's performance, achieve faster inference, and operate with fewer parameters, highlighting their viability for deployment in resource-constrained environments.

[26] Few-shot Novel Category Discovery

Chunming Li,Shidong Wang,Haofeng Zhang

Main category: cs.CV

TL;DR: 论文提出了一种名为FSNCD的新设置,结合半监督层次聚类(SHC)和不确定性感知K-means聚类(UKC),以解决新类别发现(NCD)在现实场景中的应用限制。

Details Motivation: 现有NCD方法在现实场景中应用受限,而少量标注数据可以缓解这一问题。论文旨在通过少量支持样本,实现模型在已知类别识别和新类别聚类任务间的灵活切换。 Method: 提出FSNCD框架,结合SHC和UKC方法,利用少量支持样本和先验知识,提升模型在开放集场景中的适应性。 Result: 在五个常用数据集上的实验表明,该方法在不同任务设置和场景下均达到领先性能。 Conclusion: FSNCD框架通过结合半监督和不确定性感知聚类,有效提升了新类别发现的灵活性和性能,适用于更广泛的现实场景。 Abstract: The recently proposed Novel Category Discovery (NCD) adapt paradigm of transductive learning hinders its application in more real-world scenarios. In fact, few labeled data in part of new categories can well alleviate this burden, which coincides with the ease that people can label few of new category data. Therefore, this paper presents a new setting in which a trained agent is able to flexibly switch between the tasks of identifying examples of known (labelled) classes and clustering novel (completely unlabeled) classes as the number of query examples increases by leveraging knowledge learned from only a few (handful) support examples. Drawing inspiration from the discovery of novel categories using prior-based clustering algorithms, we introduce a novel framework that further relaxes its assumptions to the real-world open set level by unifying the concept of model adaptability in few-shot learning. We refer to this setting as Few-Shot Novel Category Discovery (FSNCD) and propose Semi-supervised Hierarchical Clustering (SHC) and Uncertainty-aware K-means Clustering (UKC) to examine the model's reasoning capabilities. Extensive experiments and detailed analysis on five commonly used datasets demonstrate that our methods can achieve leading performance levels across different task settings and scenarios.

Yanbin Wei,Xuehao Wang,Zhan Zhuang,Yang Chen,Shuhao Chen,Yulong Zhang,Yu Zhang,James Kwok

Main category: cs.CV

TL;DR: 论文提出了一种名为Graph Vision Network (GVN)的框架,首次将视觉感知引入MPNNs,提升了链接预测任务的性能。

Details Motivation: 现有MPNNs和结构特征在链接预测中表现良好,但视觉感知的潜力被忽视,作者希望通过视觉增强提升性能。 Method: 提出GVN及其高效变体E-GVN,将视觉结构感知融入MPNNs。 Result: 在七个链接预测数据集上,GVN均表现出性能提升,并在大规模图上取得新的SOTA结果。 Conclusion: GVN为链接预测开辟了新方向,展示了视觉增强的潜力。 Abstract: Message-passing graph neural networks (MPNNs) and structural features (SFs) are cornerstones for the link prediction task. However, as a common and intuitive mode of understanding, the potential of visual perception has been overlooked in the MPNN community. For the first time, we equip MPNNs with vision structural awareness by proposing an effective framework called Graph Vision Network (GVN), along with a more efficient variant (E-GVN). Extensive empirical results demonstrate that with the proposed frameworks, GVN consistently benefits from the vision enhancement across seven link prediction datasets, including challenging large-scale graphs. Such improvements are compatible with existing state-of-the-art (SOTA) methods and GVNs achieve new SOTA results, thereby underscoring a promising novel direction for link prediction.

[28] IrrMap: A Large-Scale Comprehensive Dataset for Irrigation Method Mapping

Nibir Chandra Mandal,Oishee Bintey Hoque,Abhijin Adiga,Samarth Swarup,Mandy Wilson,Lu Feng,Yangfeng Ji,Miaomiao Zhang,Geoffrey Fox,Madhav Marathe

Main category: cs.CV

TL;DR: IrrMap是一个用于灌溉方法映射的大规模数据集,包含多分辨率卫星影像和辅助数据,覆盖美国西部多个州的农田,支持深度学习模型训练和基准测试。

Details Motivation: 提供首个大规模灌溉方法映射数据集,填补研究空白,支持农业和地理空间分析。 Method: 利用Landsat和Sentinel卫星影像,结合作物类型、土地利用等辅助数据,构建标准化数据集,并提供数据生成管道。 Result: 数据集覆盖168万多个农场和1410万英亩土地,包含多种灌溉方法分布和空间模式分析。 Conclusion: IrrMap为灌溉研究提供了丰富资源,并开源数据集和工具以促进进一步探索。 Abstract: We introduce IrrMap, the first large-scale dataset (1.1 million patches) for irrigation method mapping across regions. IrrMap consists of multi-resolution satellite imagery from LandSat and Sentinel, along with key auxiliary data such as crop type, land use, and vegetation indices. The dataset spans 1,687,899 farms and 14,117,330 acres across multiple western U.S. states from 2013 to 2023, providing a rich and diverse foundation for irrigation analysis and ensuring geospatial alignment and quality control. The dataset is ML-ready, with standardized 224x224 GeoTIFF patches, the multiple input modalities, carefully chosen train-test-split data, and accompanying dataloaders for seamless deep learning model training andbenchmarking in irrigation mapping. The dataset is also accompanied by a complete pipeline for dataset generation, enabling researchers to extend IrrMap to new regions for irrigation data collection or adapt it with minimal effort for other similar applications in agricultural and geospatial analysis. We also analyze the irrigation method distribution across crop groups, spatial irrigation patterns (using Shannon diversity indices), and irrigated area variations for both LandSat and Sentinel, providing insights into regional and resolution-based differences. To promote further exploration, we openly release IrrMap, along with the derived datasets, benchmark models, and pipeline code, through a GitHub repository: https://github.com/Nibir088/IrrMap and Data repository: https://huggingface.co/Nibir/IrrMap, providing comprehensive documentation and implementation details.

[29] Ultra Lowrate Image Compression with Semantic Residual Coding and Compression-aware Diffusion

Anle Ke,Xu Zhang,Tong Chen,Ming Lu,Chao Zhou,Jiawen Gu,Zhan Ma

Main category: cs.CV

TL;DR: ResULIC是一种基于残差信号引导的超低码率图像压缩方法,通过语义残差编码和压缩感知扩散模型提升重建质量和编码效率。

Details Motivation: 现有多模态大模型图像压缩框架在重建保真度和编码效率上表现不佳,需要更优的整合方法。 Method: 提出语义残差编码(SRC)捕捉语义差异,并设计压缩感知扩散模型(CDM)优化码率与扩散步长的对齐。 Result: 实验显示ResULIC在LPIPS和FID指标上分别节省80.7%和66.3%的BD-rate,优于现有方法。 Conclusion: ResULIC通过残差信号和扩散模型的协同优化,显著提升了超低码率图像压缩的性能。 Abstract: Existing multimodal large model-based image compression frameworks often rely on a fragmented integration of semantic retrieval, latent compression, and generative models, resulting in suboptimal performance in both reconstruction fidelity and coding efficiency. To address these challenges, we propose a residual-guided ultra lowrate image compression named ResULIC, which incorporates residual signals into both semantic retrieval and the diffusion-based generation process. Specifically, we introduce Semantic Residual Coding (SRC) to capture the semantic disparity between the original image and its compressed latent representation. A perceptual fidelity optimizer is further applied for superior reconstruction quality. Additionally, we present the Compression-aware Diffusion Model (CDM), which establishes an optimal alignment between bitrates and diffusion time steps, improving compression-reconstruction synergy. Extensive experiments demonstrate the effectiveness of ResULIC, achieving superior objective and subjective performance compared to state-of-the-art diffusion-based methods with - 80.7%, -66.3% BD-rate saving in terms of LPIPS and FID. Project page is available at https: //njuvision.github.io/ResULIC/.

[30] Disruptive Transformation of Artworks in Master-Disciple Relationships: The Case of Ukiyo-e Artworks

Honna Shinichi,Akira Matsui

Main category: cs.CV

TL;DR: 论文通过机器学习对浮世绘进行定量分析,发现其整体创造力随文化成熟下降,但风格创造力保持高水平。

Details Motivation: 传统艺术研究依赖主观判断,机器学习为东方绘画(如浮世绘)提供了新的定量分析方法。 Method: 使用11,000张高分辨率浮世绘图像,基于网络计算创造力,分析作品和艺术家的创造力。 Result: 浮世绘整体创造力随文化成熟下降,但风格创造力分段化并保持高水平。 Conclusion: 研究为浮世绘和东方艺术分析提供了新视角,揭示了其在文化历史中的演变。 Abstract: Artwork research has long relied on human sensibility and subjective judgment, but recent developments in machine learning have enabled the quantitative assessment of features that humans could not discover. In Western paintings, comprehensive analyses have been conducted from various perspectives in conjunction with large databases, but such extensive analysis has not been sufficiently conducted for Eastern paintings. Then, we focus on Ukiyo-e, a traditional Japanese art form, as a case study of Eastern paintings, and conduct a quantitative analysis of creativity in works of art using 11,000 high-resolution images. This involves using the concept of calculating creativity from networks to analyze both the creativity of the artwork and that of the artists. As a result, In terms of Ukiyo-e as a whole, it was found that the creativity of its appearance has declined with the maturation of culture, but in terms of style, it has become more segmented with the maturation of culture and has maintained a high level of creativity. This not only provides new insights into the study of Ukiyo-e but also shows how Ukiyo-e has evolved within the ongoing cultural history, playing a culturally significant role in the analysis of Eastern art.

[31] FauForensics: Boosting Audio-Visual Deepfake Detection with Facial Action Units

Jian Wang,Baoyuan Wu,Li Liu,Qingshan Liu

Main category: cs.CV

TL;DR: 论文提出了一种名为FauForensics的新框架,利用生物不变的面部动作单元(FAUs)作为抗伪造特征,结合多模态融合模块,显著提升了深度伪造检测的性能和跨数据集泛化能力。

Details Motivation: 生成式AI的快速发展导致音频-视觉深度伪造威胁增加,现有单模态检测方法难以应对多模态伪造,需要更鲁棒的解决方案。 Method: 引入生物不变的面部动作单元(FAUs)作为抗伪造特征,设计多模态融合模块动态对齐时空唇-音频关系,减少特征异质性。 Result: 在FakeAVCeleb和LAV-DF数据集上取得最优性能,平均比现有方法提升4.83%,并表现出卓越的跨数据集泛化能力。 Conclusion: FauForensics框架通过生物不变特征和多模态动态对齐,显著提升了深度伪造检测的鲁棒性和泛化性。 Abstract: The rapid evolution of generative AI has increased the threat of realistic audio-visual deepfakes, demanding robust detection methods. Existing solutions primarily address unimodal (audio or visual) forgeries but struggle with multimodal manipulations due to inadequate handling of heterogeneous modality features and poor generalization across datasets. To this end, we propose a novel framework called FauForensics by introducing biologically invariant facial action units (FAUs), which is a quantitative descriptor of facial muscle activity linked to emotion physiology. It serves as forgery-resistant representations that reduce domain dependency while capturing subtle dynamics often disrupted in synthetic content. Besides, instead of comparing entire video clips as in prior works, our method computes fine-grained frame-wise audiovisual similarities via a dedicated fusion module augmented with learnable cross-modal queries. It dynamically aligns temporal-spatial lip-audio relationships while mitigating multi-modal feature heterogeneity issues. Experiments on FakeAVCeleb and LAV-DF show state-of-the-art (SOTA) performance and superior cross-dataset generalizability with up to an average of 4.83\% than existing methods.

[32] Knowledge-Informed Deep Learning for Irrigation Type Mapping from Remote Sensing

Oishee Bintey Hoque,Nibir Chandra Mandal,Abhijin Adiga,Samarth Swarup,Sayjro Kossi Nouwakpo,Amanda Wilson,Madhav Marathe

Main category: cs.CV

TL;DR: KIIM是一种基于Swin-Transformer的新方法,通过多模态信息和加权集成显著提升了灌溉分类的准确性,尤其在滴灌分类上表现突出。

Details Motivation: 现有基于卫星图像光谱特征的模型因农业景观复杂性和训练数据不足而效果不佳,亟需更高效的灌溉分类方法。 Method: KIIM结合了作物到灌溉概率的投影矩阵、空间注意力图、双向交叉注意力和加权集成,并采用两阶段迁移学习。 Result: 实验显示KIIM在五个美国州的灌溉分类中IoU提升22.9%,滴灌分类提升71.4%,且仅需40%训练数据即可达到基线性能。 Conclusion: KIIM显著降低了大规模灌溉分类对人工标注的依赖,提升了可行性和成本效益。 Abstract: Accurate mapping of irrigation methods is crucial for sustainable agricultural practices and food systems. However, existing models that rely solely on spectral features from satellite imagery are ineffective due to the complexity of agricultural landscapes and limited training data, making this a challenging problem. We present Knowledge-Informed Irrigation Mapping (KIIM), a novel Swin-Transformer based approach that uses (i) a specialized projection matrix to encode crop to irrigation probability, (ii) a spatial attention map to identify agricultural lands from non-agricultural lands, (iii) bi-directional cross-attention to focus complementary information from different modalities, and (iv) a weighted ensemble for combining predictions from images and crop information. Our experimentation on five states in the US shows up to 22.9\% (IoU) improvement over baseline with a 71.4% (IoU) improvement for hard-to-classify drip irrigation. In addition, we propose a two-phase transfer learning approach to enhance cross-state irrigation mapping, achieving a 51% IoU boost in a state with limited labeled data. The ability to achieve baseline performance with only 40% of the training data highlights its efficiency, reducing the dependency on extensive manual labeling efforts and making large-scale, automated irrigation mapping more feasible and cost-effective.

[33] An incremental algorithm for non-convex AI-enhanced medical image processing

Elena Morotti

Main category: cs.CV

TL;DR: 提出了一种结合深度学习和增量模型优化的混合框架incDG,用于高效解决非凸正则化逆问题,尤其在医学影像中表现优异。

Details Motivation: 非凸正则化逆问题因其复杂性和局部极小值难以解决,但在医学影像等领域能提供高质量、任务导向的解。 Method: incDG结合深度神经网络生成初始解,并通过非凸变分求解器进行增量迭代优化,融合AI效率和模型优化理论保证。 Result: 在多种数据集上验证,incDG在医学影像去模糊和断层重建中优于传统迭代方法和深度学习,且无需真实数据训练仍保持高性能。 Conclusion: incDG是一种高效、稳健的工具,适用于解决影像及其他领域的非凸逆问题。 Abstract: Solving non-convex regularized inverse problems is challenging due to their complex optimization landscapes and multiple local minima. However, these models remain widely studied as they often yield high-quality, task-oriented solutions, particularly in medical imaging, where the goal is to enhance clinically relevant features rather than merely minimizing global error. We propose incDG, a hybrid framework that integrates deep learning with incremental model-based optimization to efficiently approximate the $\ell_0$-optimal solution of imaging inverse problems. Built on the Deep Guess strategy, incDG exploits a deep neural network to generate effective initializations for a non-convex variational solver, which refines the reconstruction through regularized incremental iterations. This design combines the efficiency of Artificial Intelligence (AI) tools with the theoretical guarantees of model-based optimization, ensuring robustness and stability. We validate incDG on TpV-regularized optimization tasks, demonstrating its effectiveness in medical image deblurring and tomographic reconstruction across diverse datasets, including synthetic images, brain CT slices, and chest-abdomen scans. Results show that incDG outperforms both conventional iterative solvers and deep learning-based methods, achieving superior accuracy and stability. Moreover, we confirm that training incDG without ground truth does not significantly degrade performance, making it a practical and powerful tool for solving non-convex inverse problems in imaging and beyond.

[34] A computer vision-based model for occupancy detection using low-resolution thermal images

Xue Cui,Vincent Gbouna Zakka,Minhyun Lee

Main category: cs.CV

TL;DR: 论文提出了一种基于低分辨率热成像和计算机视觉技术的占用检测模型,解决了传统RGB图像带来的隐私问题,同时优化了计算资源需求。

Details Motivation: 传统HVAC系统基于固定时间表运行,未考虑实际占用情况,而RGB图像检测占用会引发隐私问题。低分辨率热成像提供了一种非侵入性解决方案。 Method: 研究采用低分辨率热成像和计算机视觉技术,通过迁移学习微调YOLOv5模型,开发占用检测模型。 Result: 模型性能优异,精确度、召回率和mAP50值接近1.000。 Conclusion: 该模型不仅解决了隐私问题,还降低了计算资源需求,为HVAC系统的占用检测提供了高效方案。 Abstract: Occupancy plays an essential role in influencing the energy consumption and operation of heating, ventilation, and air conditioning (HVAC) systems. Traditional HVAC typically operate on fixed schedules without considering occupancy. Advanced occupant-centric control (OCC) adopted occupancy status in regulating HVAC operations. RGB images combined with computer vision (CV) techniques are widely used for occupancy detection, however, the detailed facial and body features they capture raise significant privacy concerns. Low-resolution thermal images offer a non-invasive solution that mitigates privacy issues. The study developed an occupancy detection model utilizing low-resolution thermal images and CV techniques, where transfer learning was applied to fine-tune the You Only Look Once version 5 (YOLOv5) model. The developed model ultimately achieved satisfactory performance, with precision, recall, mAP50, and mAP50 values approaching 1.000. The contributions of this model lie not only in mitigating privacy concerns but also in reducing computing resource demands.

[35] FAD: Frequency Adaptation and Diversion for Cross-domain Few-shot Learning

Ruixiao Shi,Fu Feng,Yucheng Xie,Jing Wang,Xin Geng

Main category: cs.CV

TL;DR: 论文提出了一种频率感知框架FAD,通过显式建模和调制频谱成分,提升跨域小样本学习的泛化能力。

Details Motivation: 跨域小样本学习中,空间域方法可能忽略频谱差异,限制了模型的泛化能力。 Method: FAD框架利用离散傅里叶变换将特征转换到频域,分为低、中、高频带,并通过专用卷积分支进行针对性适配。 Result: 在Meta-Dataset基准测试中,FAD显著优于现有方法,验证了频域表示和分频带适配的有效性。 Conclusion: 频域表示和分频带适配是提升跨域小样本学习泛化能力的关键。 Abstract: Cross-domain few-shot learning (CD-FSL) requires models to generalize from limited labeled samples under significant distribution shifts. While recent methods enhance adaptability through lightweight task-specific modules, they operate solely in the spatial domain and overlook frequency-specific variations that are often critical for robust transfer. We observe that spatially similar images across domains can differ substantially in their spectral representations, with low and high frequencies capturing complementary semantic information at coarse and fine levels. This indicates that uniform spatial adaptation may overlook these spectral distinctions, thus constraining generalization. To address this, we introduce Frequency Adaptation and Diversion (FAD), a frequency-aware framework that explicitly models and modulates spectral components. At its core is the Frequency Diversion Adapter, which transforms intermediate features into the frequency domain using the discrete Fourier transform (DFT), partitions them into low, mid, and high-frequency bands via radial masks, and reconstructs each band using inverse DFT (IDFT). Each frequency band is then adapted using a dedicated convolutional branch with a kernel size tailored to its spectral scale, enabling targeted and disentangled adaptation across frequencies. Extensive experiments on the Meta-Dataset benchmark demonstrate that FAD consistently outperforms state-of-the-art methods on both seen and unseen domains, validating the utility of frequency-domain representations and band-wise adaptation for improving generalization in CD-FSL.

[36] STORYANCHORS: Generating Consistent Multi-Scene Story Frames for Long-Form Narratives

Bo Wang,Haoyang Huang,Zhiyin Lu,Fengyuan Liu,Guoqing Ma,Jianlong Yuan,Yuan Zhang,Nan Duan

Main category: cs.CV

TL;DR: StoryAnchors是一个统一的框架,用于生成高质量、多场景且具有强时间一致性的故事帧。它通过双向故事生成器和特定条件区分标准视频合成,提升场景多样性和叙事丰富性。

Details Motivation: 解决多场景故事帧生成中的时间一致性、角色连续性和场景过渡问题,同时增强叙事多样性和编辑灵活性。 Method: 采用双向故事生成器整合过去和未来上下文,引入Multi-Event Story Frame Labeling和Progressive Story Frame Training,捕捉叙事流和事件动态。 Result: 在一致性、叙事连贯性和场景多样性上优于现有开源模型,叙事一致性和故事丰富性与GPT-4o相当。 Conclusion: StoryAnchors为故事驱动帧生成提供了可扩展、灵活且高度可编辑的基础,推动了该领域的边界。 Abstract: This paper introduces StoryAnchors, a unified framework for generating high-quality, multi-scene story frames with strong temporal consistency. The framework employs a bidirectional story generator that integrates both past and future contexts to ensure temporal consistency, character continuity, and smooth scene transitions throughout the narrative. Specific conditions are introduced to distinguish story frame generation from standard video synthesis, facilitating greater scene diversity and enhancing narrative richness. To further improve generation quality, StoryAnchors integrates Multi-Event Story Frame Labeling and Progressive Story Frame Training, enabling the model to capture both overarching narrative flow and event-level dynamics. This approach supports the creation of editable and expandable story frames, allowing for manual modifications and the generation of longer, more complex sequences. Extensive experiments show that StoryAnchors outperforms existing open-source models in key areas such as consistency, narrative coherence, and scene diversity. Its performance in narrative consistency and story richness is also on par with GPT-4o. Ultimately, StoryAnchors pushes the boundaries of story-driven frame generation, offering a scalable, flexible, and highly editable foundation for future research.

[37] DArFace: Deformation Aware Robustness for Low Quality Face Recognition

Sadaf Gulshad,Abdullah Aldahlawi Thakaa

Main category: cs.CV

TL;DR: DArFace提出了一种新的面部识别框架,通过模拟真实低质量图像的全局和局部变形,提升模型在低质量图像上的鲁棒性。

Details Motivation: 现有面部识别系统在高质量数据上表现良好,但在低质量图像(如监控视频)中性能下降,主要原因是忽略了局部非刚性变形。 Method: DArFace通过对抗训练模拟全局变换和局部弹性变形,并引入对比目标确保身份一致性。 Result: 在TinyFace、IJB-B和IJB-C等低质量基准测试中,DArFace显著优于现有方法。 Conclusion: DArFace通过建模局部变形,显著提升了面部识别系统在低质量图像中的鲁棒性。 Abstract: Facial recognition systems have achieved remarkable success by leveraging deep neural networks, advanced loss functions, and large-scale datasets. However, their performance often deteriorates in real-world scenarios involving low-quality facial images. Such degradations, common in surveillance footage or standoff imaging include low resolution, motion blur, and various distortions, resulting in a substantial domain gap from the high-quality data typically used during training. While existing approaches attempt to address robustness by modifying network architectures or modeling global spatial transformations, they frequently overlook local, non-rigid deformations that are inherently present in real-world settings. In this work, we introduce DArFace, a Deformation-Aware robust Face recognition framework that enhances robustness to such degradations without requiring paired high- and low-quality training samples. Our method adversarially integrates both global transformations (e.g., rotation, translation) and local elastic deformations during training to simulate realistic low-quality conditions. Moreover, we introduce a contrastive objective to enforce identity consistency across different deformed views. Extensive evaluations on low-quality benchmarks including TinyFace, IJB-B, and IJB-C demonstrate that DArFace surpasses state-of-the-art methods, with significant gains attributed to the inclusion of local deformation modeling.

[38] DHECA-SuperGaze: Dual Head-Eye Cross-Attention and Super-Resolution for Unconstrained Gaze Estimation

Franko Šikić,Donik Vršnak,Sven Lončarić

Main category: cs.CV

TL;DR: DHECA-SuperGaze是一种基于深度学习的方法,通过超分辨率和双头眼交叉注意力模块改进无约束视线估计,显著降低了角度误差。

Details Motivation: 无约束视线估计在现实场景中面临挑战,主要由于低分辨率图像和现有方法对头眼交互建模不足。 Method: 提出DHECA-SuperGaze,结合超分辨率和双头眼交叉注意力模块,处理多尺度图像并进行双向特征优化。 Result: 在Gaze360和GFIE数据集上,静态和时序配置中角度误差分别降低0.48°-3.00°和1.53°-3.99°。 Conclusion: 该方法在性能和泛化能力上优于现有技术,同时修正了Gaze360数据集中的标注错误。 Abstract: Unconstrained gaze estimation is the process of determining where a subject is directing their visual attention in uncontrolled environments. Gaze estimation systems are important for a myriad of tasks such as driver distraction monitoring, exam proctoring, accessibility features in modern software, etc. However, these systems face challenges in real-world scenarios, partially due to the low resolution of in-the-wild images and partially due to insufficient modeling of head-eye interactions in current state-of-the-art (SOTA) methods. This paper introduces DHECA-SuperGaze, a deep learning-based method that advances gaze prediction through super-resolution (SR) and a dual head-eye cross-attention (DHECA) module. Our dual-branch convolutional backbone processes eye and multiscale SR head images, while the proposed DHECA module enables bidirectional feature refinement between the extracted visual features through cross-attention mechanisms. Furthermore, we identified critical annotation errors in one of the most diverse and widely used gaze estimation datasets, Gaze360, and rectified the mislabeled data. Performance evaluation on Gaze360 and GFIE datasets demonstrates superior within-dataset performance of the proposed method, reducing angular error (AE) by 0.48{\deg} (Gaze360) and 2.95{\deg} (GFIE) in static configurations, and 0.59{\deg} (Gaze360) and 3.00{\deg} (GFIE) in temporal settings compared to prior SOTA methods. Cross-dataset testing shows improvements in AE of more than 1.53{\deg} (Gaze360) and 3.99{\deg} (GFIE) in both static and temporal settings, validating the robust generalization properties of our approach.

[39] Visual Image Reconstruction from Brain Activity via Latent Representation

Yukiyasu Kamitani,Misato Tanaka,Ken Shirakawa

Main category: cs.CV

TL;DR: 综述回顾了视觉图像重建领域的进展,从早期分类方法到基于深度神经网络和生成模型的复杂重建技术,探讨了当前挑战和未来方向。

Details Motivation: 探索如何从大脑活动中解码视觉内容,以理解神经编码并推动临床应用。 Method: 整合深度神经网络和生成模型,利用分层潜在表示、组合策略和模块化架构。 Result: 实现了对主观视觉体验的详细重建,但仍面临零样本泛化和主观感知建模的挑战。 Conclusion: 需多样化数据集、改进评估指标并关注伦理问题,视觉图像重建在神经科学和临床应用中潜力巨大。 Abstract: Visual image reconstruction, the decoding of perceptual content from brain activity into images, has advanced significantly with the integration of deep neural networks (DNNs) and generative models. This review traces the field's evolution from early classification approaches to sophisticated reconstructions that capture detailed, subjective visual experiences, emphasizing the roles of hierarchical latent representations, compositional strategies, and modular architectures. Despite notable progress, challenges remain, such as achieving true zero-shot generalization for unseen images and accurately modeling the complex, subjective aspects of perception. We discuss the need for diverse datasets, refined evaluation metrics aligned with human perceptual judgments, and compositional representations that strengthen model robustness and generalizability. Ethical issues, including privacy, consent, and potential misuse, are underscored as critical considerations for responsible development. Visual image reconstruction offers promising insights into neural coding and enables new psychological measurements of visual experiences, with applications spanning clinical diagnostics and brain-machine interfaces.

[40] TT-DF: A Large-Scale Diffusion-Based Dataset and Benchmark for Human Body Forgery Detection

Wenkui Yang,Zhida Zhang,Xiaoqiang Zhou,Junxian Duan,Jie Cao

Main category: cs.CV

TL;DR: 论文介绍了TikTok-DeepFake(TT-DF)数据集,专注于人体伪造检测,并提出了一种新的检测模型TOF-Net,在实验中表现优异。

Details Motivation: 由于人体伪造检测领域缺乏数据集和方法,作者旨在填补这一空白,提供全面的数据集和高效的检测模型。 Method: 作者构建了TT-DF数据集,包含多种伪造方法和压缩版本,并提出了TOF-Net模型,利用时空不一致性和光流分布差异进行检测。 Result: TOF-Net在TT-DF数据集上表现优于现有面部伪造检测模型。 Conclusion: TT-DF数据集和TOF-Net模型为人体伪造检测提供了重要资源和方法,推动了该领域的发展。 Abstract: The emergence and popularity of facial deepfake methods spur the vigorous development of deepfake datasets and facial forgery detection, which to some extent alleviates the security concerns about facial-related artificial intelligence technologies. However, when it comes to human body forgery, there has been a persistent lack of datasets and detection methods, due to the later inception and complexity of human body generation methods. To mitigate this issue, we introduce TikTok-DeepFake (TT-DF), a novel large-scale diffusion-based dataset containing 6,120 forged videos with 1,378,857 synthetic frames, specifically tailored for body forgery detection. TT-DF offers a wide variety of forgery methods, involving multiple advanced human image animation models utilized for manipulation, two generative configurations based on the disentanglement of identity and pose information, as well as different compressed versions. The aim is to simulate any potential unseen forged data in the wild as comprehensively as possible, and we also furnish a benchmark on TT-DF. Additionally, we propose an adapted body forgery detection model, Temporal Optical Flow Network (TOF-Net), which exploits the spatiotemporal inconsistencies and optical flow distribution differences between natural data and forged data. Our experiments demonstrate that TOF-Net achieves favorable performance on TT-DF, outperforming current state-of-the-art extendable facial forgery detection models. For our TT-DF dataset, please refer to https://github.com/HashTAG00002/TT-DF.

[41] A Survey of 3D Reconstruction with Event Cameras: From Event-based Geometry to Neural 3D Rendering

Chuanzhi Xu,Haoxian Zhou,Langyi Chen,Haodong Chen,Ying Zhou,Vera Chung,Qiang Qu

Main category: cs.CV

TL;DR: 该论文综述了基于事件相机的3D重建技术,分类总结了现有方法,并指出了当前研究的局限性和未来方向。

Details Motivation: 事件相机因其异步捕捉像素亮度变化的能力,在3D重建中表现出潜力,尤其是在极端环境下。本文旨在提供首个专注于事件相机3D重建的全面综述。 Method: 将现有工作按输入模态(立体、单目、多模态)和重建方法(几何、深度学习、神经渲染)分类,并整理相关公开数据集。 Result: 总结了事件相机3D重建的现状,包括技术分类、数据集和当前研究的局限性。 Conclusion: 本文为事件驱动3D重建提供了全面参考,并指出了未来研究方向,如数据可用性、动态场景处理等。 Abstract: Event cameras have emerged as promising sensors for 3D reconstruction due to their ability to capture per-pixel brightness changes asynchronously. Unlike conventional frame-based cameras, they produce sparse and temporally rich data streams, which enable more accurate 3D reconstruction and open up the possibility of performing reconstruction in extreme environments such as high-speed motion, low light, or high dynamic range scenes. In this survey, we provide the first comprehensive review focused exclusively on 3D reconstruction using event cameras. The survey categorises existing works into three major types based on input modality - stereo, monocular, and multimodal systems, and further classifies them by reconstruction approach, including geometry-based, deep learning-based, and recent neural rendering techniques such as Neural Radiance Fields and 3D Gaussian Splatting. Methods with a similar research focus were organised chronologically into the most subdivided groups. We also summarise public datasets relevant to event-based 3D reconstruction. Finally, we highlight current research limitations in data availability, evaluation, representation, and dynamic scene handling, and outline promising future research directions. This survey aims to serve as a comprehensive reference and a roadmap for future developments in event-driven 3D reconstruction.

[42] VCRBench: Exploring Long-form Causal Reasoning Capabilities of Large Video Language Models

Pritam Sarkar,Ali Etemad

Main category: cs.CV

TL;DR: 论文提出了一个名为VCRBench的新基准,用于评估大型视频语言模型(LVLMs)在视频因果推理中的能力,并提出了分解方法RRD以提升性能。

Details Motivation: 当前缺乏专门用于评估视频因果推理的基准,导致LVLMs在此任务上的能力未被充分探索。 Method: 通过创建VCRBench基准,测试LVLMs对事件序列的因果推理能力,并提出RRD方法将任务分解为视频识别和因果推理两部分。 Result: 实验显示LVLMs在长范围因果推理上表现不佳,而RRD方法能显著提升性能(最高25.2%)。 Conclusion: LVLMs在复杂视频因果推理中主要依赖语言知识,RRD方法为解决此类问题提供了有效途径。 Abstract: Despite recent advances in video understanding, the capabilities of Large Video Language Models (LVLMs) to perform video-based causal reasoning remains underexplored, largely due to the absence of relevant and dedicated benchmarks for evaluating causal reasoning in visually grounded and goal-driven settings. To fill this gap, we introduce a novel benchmark named Video-based long-form Causal Reasoning (VCRBench). We create VCRBench using procedural videos of simple everyday activities, where the steps are deliberately shuffled with each clip capturing a key causal event, to test whether LVLMs can identify, reason about, and correctly sequence the events needed to accomplish a specific goal. Moreover, the benchmark is carefully designed to prevent LVLMs from exploiting linguistic shortcuts, as seen in multiple-choice or binary QA formats, while also avoiding the challenges associated with evaluating open-ended QA. Our evaluation of state-of-the-art LVLMs on VCRBench suggests that these models struggle with video-based long-form causal reasoning, primarily due to their difficulty in modeling long-range causal dependencies directly from visual observations. As a simple step toward enabling such capabilities, we propose Recognition-Reasoning Decomposition (RRD), a modular approach that breaks video-based causal reasoning into two sub-tasks of video recognition and causal reasoning. Our experiments on VCRBench show that RRD significantly boosts accuracy on VCRBench, with gains of up to 25.2%. Finally, our thorough analysis reveals interesting insights, for instance, that LVLMs primarily rely on language knowledge for complex video-based long-form causal reasoning tasks.

[43] A Deep Learning-Driven Framework for Inhalation Injury Grading Using Bronchoscopy Images

Yifan Li,Alan W Pang,Jo Woon Chong

Main category: cs.CV

TL;DR: 论文提出了一种基于深度学习的框架,利用支气管镜图像和机械通气时长作为客观指标,改进吸入性损伤的分级诊断。通过增强的StarGAN生成高质量合成图像,显著提升了分类性能。

Details Motivation: 传统方法(如AIS)依赖主观评估且与临床结果相关性弱,导致吸入性损伤的诊断和分级面临挑战。 Method: 提出增强的StarGAN,结合Patch Loss和SSIM Loss生成高质量合成图像,并用Swin Transformer评估分类性能。 Result: 增强的StarGAN生成的数据集将分类准确率提升至77.78%(提高11.11%),FID得分最低(30.06),且生成的图像被烧伤外科医生认可为具有临床相关性。 Conclusion: 增强的StarGAN能有效解决医学影像数据稀缺问题,提升吸入性损伤分级的准确性。 Abstract: Inhalation injuries face a challenge in clinical diagnosis and grading due to the limitations of traditional methods, such as Abbreviated Injury Score (AIS), which rely on subjective assessments and show weak correlations with clinical outcomes. This study introduces a novel deep learning-based framework for grading inhalation injuries using bronchoscopy images with the duration of mechanical ventilation as an objective metric. To address the scarcity of medical imaging data, we propose enhanced StarGAN, a generative model that integrates Patch Loss and SSIM Loss to improve synthetic images' quality and clinical relevance. The augmented dataset generated by enhanced StarGAN significantly improved classification performance when evaluated using the Swin Transformer, achieving an accuracy of 77.78%, an 11.11% improvement over the original dataset. Image quality was assessed using the Fr\'echet Inception Distance (FID), where Enhanced StarGAN achieved the lowest FID of 30.06, outperforming baseline models. Burn surgeons confirmed the realism and clinical relevance of the generated images, particularly the preservation of bronchial structures and color distribution. These results highlight the potential of enhanced StarGAN in addressing data limitations and improving classification accuracy for inhalation injury grading.

[44] Attention-based Generative Latent Replay: A Continual Learning Approach for WSI Analysis

Pratibha Kumari,Daniel Reisenbüchler,Afshin Bozorgpour,Nadine S. Schaadt,Friedrich Feuerhake,Dorit Merhof

Main category: cs.CV

TL;DR: 提出了一种基于注意力的生成潜在重放持续学习框架(AGLR-CL),用于解决全切片图像(WSI)分类中的域偏移问题,无需显式存储原始数据。

Details Motivation: WSI分类在计算病理学中应用广泛,但受限于不同器官、疾病或机构差异导致的域偏移。 Method: 采用高斯混合模型(GMMs)合成WSI表示和补丁计数分布,结合注意力过滤选择关键补丁嵌入。 Result: 在多个公共数据集上验证了AGLR-CL的性能,其表现优于无缓冲方法,且与有缓冲方法相当。 Conclusion: AGLR-CL是一种隐私保护且高效的域增量持续学习方法,适用于WSI分类。 Abstract: Whole slide image (WSI) classification has emerged as a powerful tool in computational pathology, but remains constrained by domain shifts, e.g., due to different organs, diseases, or institution-specific variations. To address this challenge, we propose an Attention-based Generative Latent Replay Continual Learning framework (AGLR-CL), in a multiple instance learning (MIL) setup for domain incremental WSI classification. Our method employs Gaussian Mixture Models (GMMs) to synthesize WSI representations and patch count distributions, preserving knowledge of past domains without explicitly storing original data. A novel attention-based filtering step focuses on the most salient patch embeddings, ensuring high-quality synthetic samples. This privacy-aware strategy obviates the need for replay buffers and outperforms other buffer-free counterparts while matching the performance of buffer-based solutions. We validate AGLR-CL on clinically relevant biomarker detection and molecular status prediction across multiple public datasets with diverse centers, organs, and patient cohorts. Experimental results confirm its ability to retain prior knowledge and adapt to new domains, offering an effective, privacy-preserving avenue for domain incremental continual learning in WSI classification.

[45] Dynamic Snake Upsampling Operater and Boundary-Skeleton Weighted Loss for Tubular Structure Segmentation

Yiqi Chen,Ganghai Huang,Sheng Zhang,Jianglin Dai

Main category: cs.CV

TL;DR: 本文提出了一种动态蛇形上采样操作符和边界-骨架加权损失,用于提升管状拓扑结构的准确分割。

Details Motivation: 管状拓扑结构(如裂隙和血管)的精确分割对下游定量分析和建模至关重要,但传统上采样操作符难以处理其纤细性和形态曲率。 Method: 设计了基于自适应采样域的动态蛇形上采样操作符,动态调整采样步长,并沿蛇形路径选择子像素采样点。同时提出了一种基于掩模类别比和距离场的骨架-边界加权损失。 Result: 实验表明,该方法在多个领域数据集和骨干网络中显著提升了像素级分割精度和拓扑一致性。 Conclusion: 动态蛇形上采样和边界-骨架加权损失有效解决了管状结构分割中的挑战,提升了分割结果的准确性和连续性。 Abstract: Accurate segmentation of tubular topological structures (e.g., fissures and vasculature) is critical in various fields to guarantee dependable downstream quantitative analysis and modeling. However, in dense prediction tasks such as semantic segmentation and super-resolution, conventional upsampling operators cannot accommodate the slenderness of tubular structures and the curvature of morphology. This paper introduces a dynamic snake upsampling operators and a boundary-skeleton weighted loss tailored for topological tubular structures. Specifically, we design a snake upsampling operators based on an adaptive sampling domain, which dynamically adjusts the sampling stride according to the feature map and selects a set of subpixel sampling points along the serpentine path, enabling more accurate subpixel-level feature recovery for tubular structures. Meanwhile, we propose a skeleton-to-boundary increasing weighted loss that trades off main body and boundary weight allocation based on mask class ratio and distance field, preserving main body overlap while enhancing focus on target topological continuity and boundary alignment precision. Experiments across various domain datasets and backbone networks show that this plug-and-play dynamic snake upsampling operator and boundary-skeleton weighted loss boost both pixel-wise segmentation accuracy and topological consistency of results.

[46] Leveraging Segment Anything Model for Source-Free Domain Adaptation via Dual Feature Guided Auto-Prompting

Zheang Huai,Hui Tang,Yi Li,Zhuangzhuang Chen,Xiaomeng Li

Main category: cs.CV

TL;DR: 本文提出了一种基于Segment Anything Model(SAM)的双特征引导(DFG)自动提示方法,用于解决源自由域适应(SFDA)分割任务中的域差距问题。

Details Motivation: 源自由域适应(SFDA)在分割任务中面临域差距挑战,现有方法生成的边界框提示不准确。本文探索SAM的潜力,通过自动寻找准确的边界框提示来解决这一问题。 Method: 提出DFG自动提示方法,分两阶段:1)特征聚合阶段,初步适应目标域并构建特征分布;2)基于目标模型特征和SAM特征逐步扩展边界框提示,并通过连通性分析后处理伪标签。 Result: 在3D和2D数据集上的实验表明,DFG方法优于传统方法。 Conclusion: DFG方法通过双特征引导和自动提示,有效解决了SFDA分割任务中的域差距问题,性能显著提升。 Abstract: Source-free domain adaptation (SFDA) for segmentation aims at adapting a model trained in the source domain to perform well in the target domain with only the source model and unlabeled target data.Inspired by the recent success of Segment Anything Model (SAM) which exhibits the generality of segmenting images of various modalities and in different domains given human-annotated prompts like bounding boxes or points, we for the first time explore the potentials of Segment Anything Model for SFDA via automatedly finding an accurate bounding box prompt. We find that the bounding boxes directly generated with existing SFDA approaches are defective due to the domain gap.To tackle this issue, we propose a novel Dual Feature Guided (DFG) auto-prompting approach to search for the box prompt. Specifically, the source model is first trained in a feature aggregation phase, which not only preliminarily adapts the source model to the target domain but also builds a feature distribution well-prepared for box prompt search. In the second phase, based on two feature distribution observations, we gradually expand the box prompt with the guidance of the target model feature and the SAM feature to handle the class-wise clustered target features and the class-wise dispersed target features, respectively. To remove the potentially enlarged false positive regions caused by the over-confident prediction of the target model, the refined pseudo-labels produced by SAM are further postprocessed based on connectivity analysis. Experiments on 3D and 2D datasets indicate that our approach yields superior performance compared to conventional methods. Code is available at https://github.com/zheangh/DFG.

[47] The RaspGrade Dataset: Towards Automatic Raspberry Ripeness Grading with Deep Learning

Mohamed Lamine Mekhalfi,Paul Chippendale,Fabio Poiesi,Samuele Bonecher,Gilberto Osler,Nicola Zancanella

Main category: cs.CV

TL;DR: 研究探讨了计算机视觉在快速、准确、非侵入性食品质量评估中的应用,专注于工业环境中实时将覆盆子分为五个等级的新挑战。

Details Motivation: 解决工业环境中覆盆子实时分级的挑战,提高食品质量评估的效率和准确性。 Method: 使用名为RaspGrade的专用数据集,进行实例分割实验以获取水果级掩码。 Result: 实验显示某些覆盆子等级因颜色相似和遮挡难以分类,而其他等级基于颜色更易区分。 Conclusion: RaspGrade数据集可用于进一步研究,但某些分级挑战仍需解决。 Abstract: This research investigates the application of computer vision for rapid, accurate, and non-invasive food quality assessment, focusing on the novel challenge of real-time raspberry grading into five distinct classes within an industrial environment as the fruits move along a conveyor belt. To address this, a dedicated dataset of raspberries, namely RaspGrade, was acquired and meticulously annotated. Instance segmentation experiments revealed that accurate fruit-level masks can be obtained; however, the classification of certain raspberry grades presents challenges due to color similarities and occlusion, while others are more readily distinguishable based on color. The acquired and annotated RaspGrade dataset is accessible on HuggingFace at: https://huggingface.co/datasets/FBK-TeV/RaspGrade.

Haroon Wahab,Hassan Ugail,Irfan Mehmood

Main category: cs.CV

TL;DR: 论文提出了一种名为DFA-CON的对比学习框架,用于检测侵犯版权或伪造的AI生成艺术作品。

Details Motivation: 生成式AI工具在视觉内容创作中的广泛应用引发了版权侵权和伪造的严重问题,训练数据集中常包含受版权保护的作品,导致模型可能侵犯版权。 Method: DFA-CON通过对比学习框架学习判别性表示空间,区分原始艺术作品及其伪造版本,并在多种攻击类型(如修复、风格迁移、对抗扰动和cutmix)上进行训练。 Result: 评估结果显示,DFA-CON在大多数攻击类型上表现出鲁棒的检测性能,优于现有的预训练基础模型。 Conclusion: DFA-CON为解决AI生成艺术作品的版权侵权和伪造问题提供了一种有效的解决方案。 Abstract: Recent proliferation of generative AI tools for visual content creation-particularly in the context of visual artworks-has raised serious concerns about copyright infringement and forgery. The large-scale datasets used to train these models often contain a mixture of copyrighted and non-copyrighted artworks. Given the tendency of generative models to memorize training patterns, they are susceptible to varying degrees of copyright violation. Building on the recently proposed DeepfakeArt Challenge benchmark, this work introduces DFA-CON, a contrastive learning framework designed to detect copyright-infringing or forged AI-generated art. DFA-CON learns a discriminative representation space, posing affinity among original artworks and their forged counterparts within a contrastive learning framework. The model is trained across multiple attack types, including inpainting, style transfer, adversarial perturbation, and cutmix. Evaluation results demonstrate robust detection performance across most attack types, outperforming recent pretrained foundation models. Code and model checkpoints will be released publicly upon acceptance.

[49] Reinforcement Learning meets Masked Video Modeling : Trajectory-Guided Adaptive Token Selection

Ayush K. Rai,Kyle Min,Tarun Krishna,Feiyan Hu,Alan F. Smeaton,Noel E. O'Connor

Main category: cs.CV

TL;DR: 论文提出了一种新的轨迹感知自适应令牌采样器(TATS),用于改进视频建模中的掩码策略,并结合MAE框架和PPO优化方法,在动作识别任务中表现出色。

Details Motivation: 现有视频建模中的掩码策略(如随机或基于运动先验的方法)存在局限性,需要更通用且高效的解决方案。 Method: 提出TATS方法,动态建模令牌运动轨迹,并与MAE框架结合,使用PPO进行联合优化。 Result: 在多个基准测试中(如Something-Something v2等),TATS表现出高效性和泛化能力,支持高掩码率而不影响性能。 Conclusion: TATS是一种通用且高效的视频建模方法,显著提升了掩码策略的效果和下游任务的性能。 Abstract: Masked video modeling~(MVM) has emerged as a highly effective pre-training strategy for visual foundation models, whereby the model reconstructs masked spatiotemporal tokens using information from visible tokens. However, a key challenge in such approaches lies in selecting an appropriate masking strategy. Previous studies have explored predefined masking techniques, including random and tube-based masking, as well as approaches that leverage key motion priors, optical flow and semantic cues from externally pre-trained models. In this work, we introduce a novel and generalizable Trajectory-Aware Adaptive Token Sampler (TATS), which models the motion dynamics of tokens and can be seamlessly integrated into the masked autoencoder (MAE) framework to select motion-centric tokens in videos. Additionally, we propose a unified training strategy that enables joint optimization of both MAE and TATS from scratch using Proximal Policy Optimization (PPO). We show that our model allows for aggressive masking without compromising performance on the downstream task of action recognition while also ensuring that the pre-training remains memory efficient. Extensive experiments of the proposed approach across four benchmarks, including Something-Something v2, Kinetics-400, UCF101, and HMDB51, demonstrate the effectiveness, transferability, generalization, and efficiency of our work compared to other state-of-the-art methods.

[50] Thermal Detection of People with Mobility Restrictions for Barrier Reduction at Traffic Lights Controlled Intersections

Xiao Ni,Carsten Kuehnel,Xiaoyi Jiang

Main category: cs.CV

TL;DR: 提出了一种基于热成像的交通信号灯系统,通过动态调整信号时长和触发听觉信号,提升对行动不便和视力障碍人群的支持,同时解决了RGB摄像头的隐私和恶劣天气问题。

Details Motivation: 现有RGB摄像头交通系统忽视行动不便人群需求,且在恶劣天气或低能见度下性能受限,隐私问题突出。 Method: 构建热成像数据集TD4PWMR,开发YOLO-Thermal模型,结合特征提取和注意力机制提升热成像检测精度。 Result: YOLO-Thermal优于现有检测器,系统有效提升无障碍交叉路口体验。 Conclusion: 热成像系统在隐私和恶劣条件下表现优越,为无障碍交通提供可行方案。 Abstract: Rapid advances in deep learning for computer vision have driven the adoption of RGB camera-based adaptive traffic light systems to improve traffic safety and pedestrian comfort. However, these systems often overlook the needs of people with mobility restrictions. Moreover, the use of RGB cameras presents significant challenges, including limited detection performance under adverse weather or low-visibility conditions, as well as heightened privacy concerns. To address these issues, we propose a fully automated, thermal detector-based traffic light system that dynamically adjusts signal durations for individuals with walking impairments or mobility burden and triggers the auditory signal for visually impaired individuals, thereby advancing towards barrier-free intersection for all users. To this end, we build the thermal dataset for people with mobility restrictions (TD4PWMR), designed to capture diverse pedestrian scenarios, particularly focusing on individuals with mobility aids or mobility burden under varying environmental conditions, such as different lighting, weather, and crowded urban settings. While thermal imaging offers advantages in terms of privacy and robustness to adverse conditions, it also introduces inherent hurdles for object detection due to its lack of color and fine texture details and generally lower resolution of thermal images. To overcome these limitations, we develop YOLO-Thermal, a novel variant of the YOLO architecture that integrates advanced feature extraction and attention mechanisms for enhanced detection accuracy and robustness in thermal imaging. Experiments demonstrate that the proposed thermal detector outperforms existing detectors, while the proposed traffic light system effectively enhances barrier-free intersection. The source codes and dataset are available at https://github.com/leon2014dresden/YOLO-THERMAL.

[51] ReSurgSAM2: Referring Segment Anything in Surgical Video via Credible Long-term Tracking

Haofeng Liu,Mingqi Gao,Xuxiao Luo,Ziyue Wang,Guanyi Qin,Junde Wu,Yueming Jin

Main category: cs.CV

TL;DR: ReSurgSAM2是一个两阶段的手术场景分割框架,结合了Segment Anything Model 2和长时记忆跟踪,显著提升了效率和准确性。

Details Motivation: 现有手术分割方法效率低且跟踪时间短,难以适应复杂手术场景。 Method: 采用两阶段框架:文本引导的目标检测和基于可靠初始帧的长时记忆跟踪。 Result: ReSurgSAM2在61.2 FPS下实时运行,准确性和效率显著优于现有方法。 Conclusion: ReSurgSAM2为手术场景分割提供了高效且可靠的解决方案。 Abstract: Surgical scene segmentation is critical in computer-assisted surgery and is vital for enhancing surgical quality and patient outcomes. Recently, referring surgical segmentation is emerging, given its advantage of providing surgeons with an interactive experience to segment the target object. However, existing methods are limited by low efficiency and short-term tracking, hindering their applicability in complex real-world surgical scenarios. In this paper, we introduce ReSurgSAM2, a two-stage surgical referring segmentation framework that leverages Segment Anything Model 2 to perform text-referred target detection, followed by tracking with reliable initial frame identification and diversity-driven long-term memory. For the detection stage, we propose a cross-modal spatial-temporal Mamba to generate precise detection and segmentation results. Based on these results, our credible initial frame selection strategy identifies the reliable frame for the subsequent tracking. Upon selecting the initial frame, our method transitions to the tracking stage, where it incorporates a diversity-driven memory mechanism that maintains a credible and diverse memory bank, ensuring consistent long-term tracking. Extensive experiments demonstrate that ReSurgSAM2 achieves substantial improvements in accuracy and efficiency compared to existing methods, operating in real-time at 61.2 FPS. Our code and datasets will be available at https://github.com/jinlab-imvr/ReSurgSAM2.

[52] A Large-scale Benchmark on Geological Fault Delineation Models: Domain Shift, Training Dynamics, Generalizability, Evaluation and Inferential Behavior

Jorge Quesada,Chen Zhou,Prithwijit Chowdhury,Mohammad Alotaibi,Ahmad Mustafa,Yusufjon Kumamnov,Mohit Prabhushankar,Ghassan AlRegib

Main category: cs.CV

TL;DR: 本文首次大规模评估了地震解释中领域偏移策略,揭示了当前微调实践的脆弱性,并提出了改进方向。

Details Motivation: 地震解释中机器学习模型的泛化能力缺乏系统性研究,领域偏移、微调策略和评估协议不一致是主要障碍。 Method: 通过200多个模型在三个异构数据集(FaultSeg3D、CRACKS、Thebe)上评估预训练、微调和联合训练策略。 Result: 研究发现当前微调实践脆弱,存在灾难性遗忘问题,并揭示了性能解释的挑战。 Conclusion: 研究为地震解释工作流中模型部署提供了指导,提出了开发更通用、可解释模型的方向。 Abstract: Machine learning has taken a critical role in seismic interpretation workflows, especially in fault delineation tasks. However, despite the recent proliferation of pretrained models and synthetic datasets, the field still lacks a systematic understanding of the generalizability limits of these models across seismic data representing a variety of geologic, acquisition and processing settings. Distributional shifts between different data sources, limitations in fine-tuning strategies and labeled data accessibility, and inconsistent evaluation protocols all represent major roadblocks in the deployment of reliable and robust models in real-world exploration settings. In this paper, we present the first large-scale benchmarking study explicitly designed to provide answers and guidelines for domain shift strategies in seismic interpretation. Our benchmark encompasses over $200$ models trained and evaluated on three heterogeneous datasets (synthetic and real data) including FaultSeg3D, CRACKS, and Thebe. We systematically assess pretraining, fine-tuning, and joint training strategies under varying degrees of domain shift. Our analysis highlights the fragility of current fine-tuning practices, the emergence of catastrophic forgetting, and the challenges of interpreting performance in a systematic manner. We establish a robust experimental baseline to provide insights into the tradeoffs inherent to current fault delineation workflows, and shed light on directions for developing more generalizable, interpretable and effective machine learning models for seismic interpretation. The insights and analyses reported provide a set of guidelines on the deployment of fault delineation models within seismic interpretation workflows.

[53] PrePrompt: Predictive prompting for class incremental learning

Libo Huang,Zhulin An,Chuanguang Yang,Boyu Diao,Fei Wang,Yan Zeng,Zhifeng Hao,Yongjun Xu

Main category: cs.CV

TL;DR: PrePrompt提出了一种新的CIL框架,通过预测任务特定提示来避免相关性策略的局限性,并通过特征翻译平衡稳定性和可塑性。

Details Motivation: 现有基于相关性策略的CIL方法难以用少量可训练提示拟合所有任务的特征空间,存在固有局限性。 Method: PrePrompt将CIL分解为两阶段预测框架:任务特定提示预测和标签预测,并通过特征翻译缓解历史数据缺失导致的偏差。 Result: 实验表明PrePrompt在多个基准测试中优于现有基于提示的CIL方法。 Conclusion: PrePrompt通过预测提示和动态平衡机制,显著提升了CIL性能,代码将在接受后发布。 Abstract: Class Incremental Learning (CIL) based on pre-trained models offers a promising direction for open-world continual learning. Existing methods typically rely on correlation-based strategies, where an image's classification feature is used as a query to retrieve the most related key prompts and select the corresponding value prompts for training. However, these approaches face an inherent limitation: fitting the entire feature space of all tasks with only a few trainable prompts is fundamentally challenging. We propose Predictive Prompting (PrePrompt), a novel CIL framework that circumvents correlation-based limitations by leveraging pre-trained models' natural classification ability to predict task-specific prompts. Specifically, PrePrompt decomposes CIL into a two-stage prediction framework: task-specific prompt prediction followed by label prediction. While theoretically appealing, this framework risks bias toward recent classes due to missing historical data for older classifier calibration. PrePrompt then mitigates this by incorporating feature translation, dynamically balancing stability and plasticity. Experiments across multiple benchmarks demonstrate PrePrompt's superiority over state-of-the-art prompt-based CIL methods. The code will be released upon acceptance.

[54] MESSI: A Multi-Elevation Semantic Segmentation Image Dataset of an Urban Environment

Barak Pinkovich,Boaz Matalon,Ehud Rivlin,Hector Rotstein

Main category: cs.CV

TL;DR: MESSI数据集包含2525张无人机拍摄的密集城市环境图像,支持多高度语义分割研究,并提供多种标注信息。

Details Motivation: 研究深度对语义分割的影响,并覆盖无人机3D飞行中的视觉多样性。 Method: 使用多个神经网络模型进行语义分割,并提供数据集标注细节。 Result: MESSI数据集可用于训练深度神经网络,支持语义分割及其他应用。 Conclusion: MESSI将公开发布,作为无人机图像语义分割的评估基准。 Abstract: This paper presents a Multi-Elevation Semantic Segmentation Image (MESSI) dataset comprising 2525 images taken by a drone flying over dense urban environments. MESSI is unique in two main features. First, it contains images from various altitudes, allowing us to investigate the effect of depth on semantic segmentation. Second, it includes images taken from several different urban regions (at different altitudes). This is important since the variety covers the visual richness captured by a drone's 3D flight, performing horizontal and vertical maneuvers. MESSI contains images annotated with location, orientation, and the camera's intrinsic parameters and can be used to train a deep neural network for semantic segmentation or other applications of interest (e.g., localization, navigation, and tracking). This paper describes the dataset and provides annotation details. It also explains how semantic segmentation was performed using several neural network models and shows several relevant statistics. MESSI will be published in the public domain to serve as an evaluation benchmark for semantic segmentation using images captured by a drone or similar vehicle flying over a dense urban environment.

[55] Rejoining fragmented ancient bamboo slips with physics-driven deep learning

Jinchi Zhu,Zhou Zhao,Hailong Lei,Xiaoguang Wang,Jialiang Lu,Jing Li,Qianqian Tang,Jiachen Shen,Gui-Song Xia,Bo Du,Yongchao Xu

Main category: cs.CV

TL;DR: WisePanda是一个基于物理原理的深度学习框架,用于拼接破碎的竹简,显著提高了匹配准确性和效率。

Details Motivation: 竹简是记录东亚古代文明的重要媒介,但许多出土竹简破碎成不规则碎片,拼接困难,影响内容理解。 Method: 结合断裂和材料退化的物理原理,自动生成合成训练数据,训练匹配网络,无需手动配对样本。 Result: Top-50匹配准确率从36%提升至52%,拼接效率提高约20倍。 Conclusion: 物理驱动的深度学习显著提升模型性能,为古代文物修复提供了新范式。 Abstract: Bamboo slips are a crucial medium for recording ancient civilizations in East Asia, and offers invaluable archaeological insights for reconstructing the Silk Road, studying material culture exchanges, and global history. However, many excavated bamboo slips have been fragmented into thousands of irregular pieces, making their rejoining a vital yet challenging step for understanding their content. Here we introduce WisePanda, a physics-driven deep learning framework designed to rejoin fragmented bamboo slips. Based on the physics of fracture and material deterioration, WisePanda automatically generates synthetic training data that captures the physical properties of bamboo fragmentations. This approach enables the training of a matching network without requiring manually paired samples, providing ranked suggestions to facilitate the rejoining process. Compared to the leading curve matching method, WisePanda increases Top-50 matching accuracy from 36\% to 52\%. Archaeologists using WisePanda have experienced substantial efficiency improvements (approximately 20 times faster) when rejoining fragmented bamboo slips. This research demonstrates that incorporating physical principles into deep learning models can significantly enhance their performance, transforming how archaeologists restore and study fragmented artifacts. WisePanda provides a new paradigm for addressing data scarcity in ancient artifact restoration through physics-driven machine learning.

[56] Unsupervised Out-of-Distribution Detection in Medical Imaging Using Multi-Exit Class Activation Maps and Feature Masking

Yu-Jen Chen,Xueyang Li,Yiyu Shi,Tsung-Yi Ho

Main category: cs.CV

TL;DR: 论文提出了一种基于多出口类激活图(MECAM)的无监督OOD检测框架,通过特征掩码和多分辨率CAM提升检测鲁棒性。

Details Motivation: 医学影像中OOD检测对模型可靠性至关重要,研究发现ID数据的CAM通常聚焦于预测相关区域,而OOD数据缺乏此类激活。 Method: 利用多出口网络生成不同分辨率和深度的CAM,结合特征掩码技术,通过ID和OOD数据在掩码后特征变化的差异进行检测。 Result: 在多个ID和OOD数据集(如ISIC19、PathMNIST、RSNA Pneumonia等)上验证,MECAM优于现有方法。 Conclusion: 多出口网络和特征掩码技术为医学影像中的无监督OOD检测提供了新思路,有望提升临床模型的可靠性和可解释性。 Abstract: Out-of-distribution (OOD) detection is essential for ensuring the reliability of deep learning models in medical imaging applications. This work is motivated by the observation that class activation maps (CAMs) for in-distribution (ID) data typically emphasize regions that are highly relevant to the model's predictions, whereas OOD data often lacks such focused activations. By masking input images with inverted CAMs, the feature representations of ID data undergo more substantial changes compared to those of OOD data, offering a robust criterion for differentiation. In this paper, we introduce a novel unsupervised OOD detection framework, Multi-Exit Class Activation Map (MECAM), which leverages multi-exit CAMs and feature masking. By utilizing mult-exit networks that combine CAMs from varying resolutions and depths, our method captures both global and local feature representations, thereby enhancing the robustness of OOD detection. We evaluate MECAM on multiple ID datasets, including ISIC19 and PathMNIST, and test its performance against three medical OOD datasets, RSNA Pneumonia, COVID-19, and HeadCT, and one natural image OOD dataset, iSUN. Comprehensive comparisons with state-of-the-art OOD detection methods validate the effectiveness of our approach. Our findings emphasize the potential of multi-exit networks and feature masking for advancing unsupervised OOD detection in medical imaging, paving the way for more reliable and interpretable models in clinical practice.

[57] Leveraging Multi-Modal Information to Enhance Dataset Distillation

Zhe Li,Hadrien Reynaud,Bernhard Kainz

Main category: cs.CV

TL;DR: 论文提出两种改进数据集蒸馏的方法:基于文本的监督和对象中心掩码,通过结合文本信息和对象级优化提升蒸馏数据集质量。

Details Motivation: 现有方法主要关注视觉表示优化,但结合多模态信息和细化对象级信息可以显著提升蒸馏数据集的质量。 Method: 引入两种策略:1)特征拼接,将文本嵌入与视觉特征融合;2)文本匹配,通过基于文本的对齐损失确保语义一致性。同时,使用分割掩码隔离目标对象,并提出两种对象中心损失函数。 Result: 实验表明,结合文本指导和对象中心掩码显著提升了数据集蒸馏的效果,生成的数据集在下游任务中表现更优。 Conclusion: 通过多模态信息和对象级优化,可以显著提升数据集蒸馏的质量和实用性。 Abstract: Dataset distillation aims to create a compact and highly representative synthetic dataset that preserves the knowledge of a larger real dataset. While existing methods primarily focus on optimizing visual representations, incorporating additional modalities and refining object-level information can significantly improve the quality of distilled datasets. In this work, we introduce two key enhancements to dataset distillation: caption-guided supervision and object-centric masking. To integrate textual information, we propose two strategies for leveraging caption features: the feature concatenation, where caption embeddings are fused with visual features at the classification stage, and caption matching, which introduces a caption-based alignment loss during training to ensure semantic coherence between real and synthetic data. Additionally, we apply segmentation masks to isolate target objects and remove background distractions, introducing two loss functions designed for object-centric learning: masked feature alignment loss and masked gradient matching loss. Comprehensive evaluations demonstrate that integrating caption-based guidance and object-centric masking enhances dataset distillation, leading to synthetic datasets that achieve superior performance on downstream tasks.

[58] Boosting Zero-shot Stereo Matching using Large-scale Mixed Images Sources in the Real World

Yuran Wang,Yingping Liang,Ying Fu

Main category: cs.CV

TL;DR: 提出了一种名为BooSTer的新框架,利用视觉基础模型和大规模混合图像源(合成、真实和单视图图像)解决立体匹配中标注数据稀缺和域差距问题。

Details Motivation: 立体匹配方法依赖密集像素级标注数据,但获取成本高,且合成与真实图像间的域差距带来挑战。 Method: 结合单目深度估计和扩散模型生成立体匹配数据;利用伪单目深度标签和动态尺度不变损失进行监督;引入视觉基础模型提取特征。 Result: 在基准数据集上表现优异,显著提升了有限标注数据和域转移场景下的准确性。 Conclusion: BooSTer框架有效解决了标注数据稀缺和域差距问题,提升了立体匹配的准确性和泛化能力。 Abstract: Stereo matching methods rely on dense pixel-wise ground truth labels, which are laborious to obtain, especially for real-world datasets. The scarcity of labeled data and domain gaps between synthetic and real-world images also pose notable challenges. In this paper, we propose a novel framework, \textbf{BooSTer}, that leverages both vision foundation models and large-scale mixed image sources, including synthetic, real, and single-view images. First, to fully unleash the potential of large-scale single-view images, we design a data generation strategy combining monocular depth estimation and diffusion models to generate dense stereo matching data from single-view images. Second, to tackle sparse labels in real-world datasets, we transfer knowledge from monocular depth estimation models, using pseudo-mono depth labels and a dynamic scale- and shift-invariant loss for additional supervision. Furthermore, we incorporate vision foundation model as an encoder to extract robust and transferable features, boosting accuracy and generalization. Extensive experiments on benchmark datasets demonstrate the effectiveness of our approach, achieving significant improvements in accuracy over existing methods, particularly in scenarios with limited labeled data and domain shifts.

[59] WaveGuard: Robust Deepfake Detection and Source Tracing via Dual-Tree Complex Wavelet and Graph Neural Networks

Ziyuan He,Zhiqing Guo,Liejun Wang,Gaobo Yang,Yunfeng Diao,Dan Ma

Main category: cs.CV

TL;DR: WaveGuard是一种主动水印框架,通过频域嵌入和图结构一致性增强鲁棒性和不可感知性,用于应对Deepfake技术的隐私和身份盗窃风险。

Details Motivation: Deepfake技术带来的隐私侵犯和身份盗窃风险日益增加,需要一种更鲁棒且不可感知的水印方法。 Method: 使用双树复小波变换(DT-CWT)在高频子带嵌入水印,并通过结构一致性图神经网络(SC-GNN)保持视觉质量,同时设计了注意力模块以提高嵌入精度。 Result: 在面部交换和重演任务中,WaveGuard在鲁棒性和视觉质量上均优于现有方法。 Conclusion: WaveGuard为应对Deepfake威胁提供了一种有效的水印解决方案,代码已开源。 Abstract: Deepfake technology poses increasing risks such as privacy invasion and identity theft. To address these threats, we propose WaveGuard, a proactive watermarking framework that enhances robustness and imperceptibility via frequency-domain embedding and graph-based structural consistency. Specifically, we embed watermarks into high-frequency sub-bands using Dual-Tree Complex Wavelet Transform (DT-CWT) and employ a Structural Consistency Graph Neural Network (SC-GNN) to preserve visual quality. We also design an attention module to refine embedding precision. Experimental results on face swap and reenactment tasks demonstrate that WaveGuard outperforms state-of-the-art methods in both robustness and visual quality. Code is available at https://github.com/vpsg-research/WaveGuard.

[60] OpenThinkIMG: Learning to Think with Images via Visual Tool Reinforcement Learning

Zhaochen Su,Linjie Li,Mingyang Song,Yunzhuo Hao,Zhengyuan Yang,Jun Zhang,Guanjie Chen,Jiawei Gu,Juntao Li,Xiaoye Qu,Yu Cheng

Main category: cs.CV

TL;DR: OpenThinkIMG是一个开源框架,用于增强大型视觉语言模型(LVLMs)的动态视觉工具调用能力,通过强化学习(V-ToolRL)显著提升了任务表现。

Details Motivation: 当前缺乏标准化基础设施,阻碍了视觉工具的集成和交互数据的生成,限制了LVLMs的动态问题解决能力。 Method: 提出OpenThinkIMG框架,包括标准化工具接口、轨迹生成和强化学习(V-ToolRL)训练方法。 Result: 在图表推理任务中,V-ToolRL训练的模型比监督学习基线表现更好,甚至超过GPT-4.1。 Conclusion: OpenThinkIMG为动态视觉推理提供了基础框架,有望推动AI在图像思维方面的发展。 Abstract: While humans can flexibly leverage interactive visual cognition for complex problem-solving, enabling Large Vision-Language Models (LVLMs) to learn similarly adaptive behaviors with visual tools remains challenging. A significant hurdle is the current lack of standardized infrastructure, which hinders integrating diverse tools, generating rich interaction data, and training robust agents effectively. To address these gaps, we introduce OpenThinkIMG, the first open-source, comprehensive end-to-end framework for tool-augmented LVLMs. It features standardized vision tool interfaces, scalable trajectory generation for policy initialization, and a flexible training environment. Furthermore, considering supervised fine-tuning (SFT) on static demonstrations offers limited policy generalization for dynamic tool invocation, we propose a novel reinforcement learning (RL) framework V-ToolRL to train LVLMs to learn adaptive policies for invoking external vision tools. V-ToolRL enables LVLMs to autonomously discover optimal tool-usage strategies by directly optimizing for task success using feedback from tool interactions. We empirically validate V-ToolRL on challenging chart reasoning tasks. Our RL-trained agent, built upon a Qwen2-VL-2B, significantly outperforms its SFT-initialized counterpart (+28.83 points) and surpasses established supervised tool-learning baselines like Taco and CogCom by an average of +12.7 points. Notably, it also surpasses prominent closed-source models like GPT-4.1 by +8.68 accuracy points. We hope OpenThinkIMG can serve as a foundational framework for advancing dynamic, tool-augmented visual reasoning, helping the community develop AI agents that can genuinely "think with images".

[61] DLO-Splatting: Tracking Deformable Linear Objects Using 3D Gaussian Splatting

Holly Dinkel,Marcel Büsching,Alberta Longhini,Brian Coltin,Trey Smith,Danica Kragic,Mårten Björkman,Timothy Bretl

Main category: cs.CV

TL;DR: DLO-Splatting算法通过多视角RGB图像和夹爪状态信息预测-更新滤波估计可变形线性物体的3D形状,结合3D高斯渲染损失优化。

Details Motivation: 现有视觉方法在复杂场景(如打结)中表现不佳,需结合动态模型和视觉信息提升形状估计精度。 Method: 采用基于位置的动力学模型,结合形状平滑和刚性阻尼校正预测形状,通过3D高斯渲染损失迭代优化。 Result: 初步实验在打结场景中表现优于纯视觉方法。 Conclusion: DLO-Splatting在复杂场景中展现出潜力,结合动态模型与视觉信息是有效方向。 Abstract: This work presents DLO-Splatting, an algorithm for estimating the 3D shape of Deformable Linear Objects (DLOs) from multi-view RGB images and gripper state information through prediction-update filtering. The DLO-Splatting algorithm uses a position-based dynamics model with shape smoothness and rigidity dampening corrections to predict the object shape. Optimization with a 3D Gaussian Splatting-based rendering loss iteratively renders and refines the prediction to align it with the visual observations in the update step. Initial experiments demonstrate promising results in a knot tying scenario, which is challenging for existing vision-only methods.

[62] SkillFormer: Unified Multi-View Video Understanding for Proficiency Estimation

Edoardo Bianchi,Antonio Liotta

Main category: cs.CV

TL;DR: SkillFormer是一种参数高效的架构,用于从第一人称和第三人称视频中统一评估多视角技能水平,通过跨视角融合模块和低秩适应技术显著提升了效率和准确性。

Details Motivation: 评估复杂活动中的人类技能水平在体育、康复和培训中有重要应用,但现有方法在多视角融合和计算效率上存在不足。 Method: 基于TimeSformer架构,SkillFormer引入跨视角融合模块(CrossViewFusion),结合多头交叉注意力、可学习门控和自适应自校准,并采用低秩适应技术减少训练参数。 Result: 在EgoExo4D数据集上,SkillFormer在多视角设置中达到最先进精度,参数减少4.5倍,训练周期减少3.75倍。 Conclusion: SkillFormer证明了多视角融合在细粒度技能评估中的价值,同时显著提升了计算效率。 Abstract: Assessing human skill levels in complex activities is a challenging problem with applications in sports, rehabilitation, and training. In this work, we present SkillFormer, a parameter-efficient architecture for unified multi-view proficiency estimation from egocentric and exocentric videos. Building on the TimeSformer backbone, SkillFormer introduces a CrossViewFusion module that fuses view-specific features using multi-head cross-attention, learnable gating, and adaptive self-calibration. We leverage Low-Rank Adaptation to fine-tune only a small subset of parameters, significantly reducing training costs. In fact, when evaluated on the EgoExo4D dataset, SkillFormer achieves state-of-the-art accuracy in multi-view settings while demonstrating remarkable computational efficiency, using 4.5x fewer parameters and requiring 3.75x fewer training epochs than prior baselines. It excels in multiple structured tasks, confirming the value of multi-view integration for fine-grained skill assessment.

[63] Calibration and Uncertainty for multiRater Volume Assessment in multiorgan Segmentation (CURVAS) challenge results

Meritxell Riera-Marin,Sikha O K,Julia Rodriguez-Comas,Matthias Stefan May,Zhaohong Pan,Xiang Zhou,Xiaokun Liang,Franciskus Xaverius Erick,Andrea Prenner,Cedric Hemon,Valentin Boussot,Jean-Louis Dillenseger,Jean-Claude Nunes,Abdul Qayyum,Moona Mazher,Steven A Niederer,Kaisar Kushibar,Carlos Martin-Isla,Petia Radeva,Karim Lekadir,Theodore Barfoot,Luis C. Garcia Peraza Herrera,Ben Glocker,Tom Vercauteren,Lucas Gago,Justin Englemann,Joy-Marie Kleiss,Anton Aubanell,Andreu Antolin,Javier Garcia-Lopez,Miguel A. Gonzalez Ballester,Adrian Galdran

Main category: cs.CV

TL;DR: 论文提出CURVAS方法,强调多标注者在医学图像分割中的重要性,通过评估模型校准和不确定性,提升分割模型的可靠性和临床适用性。

Details Motivation: 解决医学图像分割中标注变异性、校准和不确定性估计等关键挑战,确保深度学习模型的可靠性和临床适用性。 Method: 创建CURVAS方法,利用多标注者建立更全面的基准,评估模型在DSC、ECE和CRPS等指标上的表现。 Result: 模型校准与结果质量强相关,预训练和多样化数据集增强模型鲁棒性,最佳模型表现高DSC和良好校准。 Conclusion: 多标注者基准、校准评估和不确定性感知是开发可靠医学图像分割模型的关键。 Abstract: Deep learning (DL) has become the dominant approach for medical image segmentation, yet ensuring the reliability and clinical applicability of these models requires addressing key challenges such as annotation variability, calibration, and uncertainty estimation. This is why we created the Calibration and Uncertainty for multiRater Volume Assessment in multiorgan Segmentation (CURVAS), which highlights the critical role of multiple annotators in establishing a more comprehensive ground truth, emphasizing that segmentation is inherently subjective and that leveraging inter-annotator variability is essential for robust model evaluation. Seven teams participated in the challenge, submitting a variety of DL models evaluated using metrics such as Dice Similarity Coefficient (DSC), Expected Calibration Error (ECE), and Continuous Ranked Probability Score (CRPS). By incorporating consensus and dissensus ground truth, we assess how DL models handle uncertainty and whether their confidence estimates align with true segmentation performance. Our findings reinforce the importance of well-calibrated models, as better calibration is strongly correlated with the quality of the results. Furthermore, we demonstrate that segmentation models trained on diverse datasets and enriched with pre-trained knowledge exhibit greater robustness, particularly in cases deviating from standard anatomical structures. Notably, the best-performing models achieved high DSC and well-calibrated uncertainty estimates. This work underscores the need for multi-annotator ground truth, thorough calibration assessments, and uncertainty-aware evaluations to develop trustworthy and clinically reliable DL-based medical image segmentation models.

[64] SPAST: Arbitrary Style Transfer with Style Priors via Pre-trained Large-scale Model

Zhanjie Zhang,Quanwei Zhang,Junsheng Luan,Mengyuan Yang,Yun Wang,Lei Zhao

Main category: cs.CV

TL;DR: SPAST框架通过局部-全局窗口风格化模块和风格先验损失,实现高质量风格迁移并减少推理时间。

Details Motivation: 现有方法要么生成低质量图像,要么推理时间长且难以保留内容结构。 Method: 设计局部-全局窗口风格化模块(LGWSSM)和风格先验损失,结合预训练大模型的风格先验。 Result: 实验证明SPAST能生成高质量风格化图像且推理时间更短。 Conclusion: SPAST在质量和效率上优于现有方法。 Abstract: Given an arbitrary content and style image, arbitrary style transfer aims to render a new stylized image which preserves the content image's structure and possesses the style image's style. Existing arbitrary style transfer methods are based on either small models or pre-trained large-scale models. The small model-based methods fail to generate high-quality stylized images, bringing artifacts and disharmonious patterns. The pre-trained large-scale model-based methods can generate high-quality stylized images but struggle to preserve the content structure and cost long inference time. To this end, we propose a new framework, called SPAST, to generate high-quality stylized images with less inference time. Specifically, we design a novel Local-global Window Size Stylization Module (LGWSSM)tofuse style features into content features. Besides, we introduce a novel style prior loss, which can dig out the style priors from a pre-trained large-scale model into the SPAST and motivate the SPAST to generate high-quality stylized images with short inference time.We conduct abundant experiments to verify that our proposed method can generate high-quality stylized images and less inference time compared with the SOTA arbitrary style transfer methods.

[65] Controllable Image Colorization with Instance-aware Texts and Masks

Yanru An,Ling Gui,Qiang Hu,Chunlei Cai,Tianxiao Ye,Xiaoyun Zhang,Yanfeng Wang

Main category: cs.CV

TL;DR: 提出了一种基于扩散模型的实例感知图像着色方法MT-Color,通过像素级掩码注意力机制和实例掩码文本引导模块解决颜色溢出和绑定错误问题,并构建了专用数据集GPT-color。

Details Motivation: 当前主流图像着色模型存在颜色溢出和绑定错误问题,且无法实现实例级着色。 Method: 设计了像素级掩码注意力机制和实例掩码文本引导模块,采用多实例采样策略,并构建了数据集GPT-color。 Result: 定性和定量实验表明,模型和数据集优于现有方法。 Conclusion: MT-Color方法在实例级着色任务中表现优异,解决了颜色溢出和绑定错误问题。 Abstract: Recently, the application of deep learning in image colorization has received widespread attention. The maturation of diffusion models has further advanced the development of image colorization models. However, current mainstream image colorization models still face issues such as color bleeding and color binding errors, and cannot colorize images at the instance level. In this paper, we propose a diffusion-based colorization method MT-Color to achieve precise instance-aware colorization with use-provided guidance. To tackle color bleeding issue, we design a pixel-level mask attention mechanism that integrates latent features and conditional gray image features through cross-attention. We use segmentation masks to construct cross-attention masks, preventing pixel information from exchanging between different instances. We also introduce an instance mask and text guidance module that extracts instance masks and text representations of each instance, which are then fused with latent features through self-attention, utilizing instance masks to form self-attention masks to prevent instance texts from guiding the colorization of other areas, thus mitigating color binding errors. Furthermore, we apply a multi-instance sampling strategy, which involves sampling each instance region separately and then fusing the results. Additionally, we have created a specialized dataset for instance-level colorization tasks, GPT-color, by leveraging large visual language models on existing image datasets. Qualitative and quantitative experiments show that our model and dataset outperform previous methods and datasets.

[66] TiMo: Spatiotemporal Foundation Model for Satellite Image Time Series

Xiaolei Qin,Di Wang,Jing Zhang,Fengxiang Wang,Xin Su,Bo Du,Liangpei Zhang

Main category: cs.CV

TL;DR: TiMo是一种新型的分层视觉Transformer基础模型,专为卫星图像时间序列(SITS)分析设计,通过动态捕捉多尺度时空关系提升下游任务性能。

Details Motivation: 现有时空基础模型依赖普通视觉Transformer,未能显式捕捉多尺度时空关系,限制了其在下游任务中的有效性。 Method: 提出TiMo模型,采用时空陀螺注意力机制动态捕捉多尺度时空模式,并利用MillionST数据集进行预训练。 Result: 在多项时空任务(如森林砍伐监测、土地覆盖分割等)中,TiMo优于现有方法。 Conclusion: TiMo通过显式建模多尺度时空关系,显著提升了SITS分析的性能。 Abstract: Satellite image time series (SITS) provide continuous observations of the Earth's surface, making them essential for applications such as environmental management and disaster assessment. However, existing spatiotemporal foundation models rely on plain vision transformers, which encode entire temporal sequences without explicitly capturing multiscale spatiotemporal relationships between land objects. This limitation hinders their effectiveness in downstream tasks. To overcome this challenge, we propose TiMo, a novel hierarchical vision transformer foundation model tailored for SITS analysis. At its core, we introduce a spatiotemporal gyroscope attention mechanism that dynamically captures evolving multiscale patterns across both time and space. For pre-training, we curate MillionST, a large-scale dataset of one million images from 100,000 geographic locations, each captured across 10 temporal phases over five years, encompassing diverse geospatial changes and seasonal variations. Leveraging this dataset, we adapt masked image modeling to pre-train TiMo, enabling it to effectively learn and encode generalizable spatiotemporal representations.Extensive experiments across multiple spatiotemporal tasks-including deforestation monitoring, land cover segmentation, crop type classification, and flood detection-demonstrate TiMo's superiority over state-of-the-art methods. Code, model, and dataset will be released at https://github.com/MiliLab/TiMo.

[67] Extending Large Vision-Language Model for Diverse Interactive Tasks in Autonomous Driving

Zongchuang Zhao,Haoyu Fu,Dingkang Liang,Xin Zhou,Dingyuan Zhang,Hongwei Xie,Bing Wang,Xiang Bai

Main category: cs.CV

TL;DR: 论文提出NuInteract数据集和DriveMonkey框架,解决LVLMs在3D场景理解中的不足,显著提升3D视觉定位任务性能。

Details Motivation: 现有LVLMs在自动驾驶场景中缺乏全面的场景理解和2D-3D映射关系,且3D物体定位与指令理解结合不足。 Method: 引入NuInteract数据集(150万对多视角图像语言数据),并提出DriveMonkey框架,结合空间处理器提升3D感知。 Result: DriveMonkey在3D视觉定位任务上比通用LVLMs提升9.86%。 Conclusion: DriveMonkey和NuInteract为LVLMs在自动驾驶中的3D场景理解提供了有效解决方案。 Abstract: The Large Visual-Language Models (LVLMs) have significantly advanced image understanding. Their comprehension and reasoning capabilities enable promising applications in autonomous driving scenarios. However, existing research typically focuses on front-view perspectives and partial objects within scenes, struggling to achieve comprehensive scene understanding. Meanwhile, existing LVLMs suffer from the lack of mapping relationship between 2D and 3D and insufficient integration of 3D object localization and instruction understanding. To tackle these limitations, we first introduce NuInteract, a large-scale dataset with over 1.5M multi-view image language pairs spanning dense scene captions and diverse interactive tasks. Furthermore, we propose DriveMonkey, a simple yet effective framework that seamlessly integrates LVLMs with a spatial processor using a series of learnable queries. The spatial processor, designed as a plug-and-play component, can be initialized with pre-trained 3D detectors to improve 3D perception. Our experiments show that DriveMonkey outperforms general LVLMs, especially achieving a 9.86% notable improvement on the 3D visual grounding task. The dataset and code will be released at https://github.com/zc-zhao/DriveMonkey.

[68] Advancing Food Nutrition Estimation via Visual-Ingredient Feature Fusion

Huiyan Qi,Bin Zhu,Chong-Wah Ngo,Jingjing Chen,Ee-Peng Lim

Main category: cs.CV

TL;DR: FastFood数据集和VIF²方法通过融合视觉与成分特征提升营养估计准确性。

Details Motivation: 营养估计对健康饮食至关重要,但缺乏带营养标注的数据集限制了进展。 Method: 提出VIF²方法,结合视觉与成分特征,并通过同义词替换和重采样增强成分鲁棒性。 Result: 在FastFood和Nutrition5k数据集上验证了方法的有效性,支持成分信息的重要性。 Conclusion: VIF²方法显著提升了营养估计的准确性,证明了成分信息的关键作用。 Abstract: Nutrition estimation is an important component of promoting healthy eating and mitigating diet-related health risks. Despite advances in tasks such as food classification and ingredient recognition, progress in nutrition estimation is limited due to the lack of datasets with nutritional annotations. To address this issue, we introduce FastFood, a dataset with 84,446 images across 908 fast food categories, featuring ingredient and nutritional annotations. In addition, we propose a new model-agnostic Visual-Ingredient Feature Fusion (VIF$^2$) method to enhance nutrition estimation by integrating visual and ingredient features. Ingredient robustness is improved through synonym replacement and resampling strategies during training. The ingredient-aware visual feature fusion module combines ingredient features and visual representation to achieve accurate nutritional prediction. During testing, ingredient predictions are refined using large multimodal models by data augmentation and majority voting. Our experiments on both FastFood and Nutrition5k datasets validate the effectiveness of our proposed method built in different backbones (e.g., Resnet, InceptionV3 and ViT), which demonstrates the importance of ingredient information in nutrition estimation. https://huiyanqi.github.io/fastfood-nutrition-estimation/.

[69] Towards Autonomous UAV Visual Object Search in City Space: Benchmark and Agentic Methodology

Yatai Ji,Zhengqiu Zhu,Yong Zhao,Beidan Liu,Chen Gao,Yihao Zhao,Sihang Qiu,Yue Hu,Quanjun Yin,Yong Li

Main category: cs.CV

TL;DR: 论文提出了CityAVOS数据集和PRPSearcher方法,用于解决无人机在复杂城市环境中自主视觉目标搜索的挑战,显著提升了搜索成功率和效率。

Details Motivation: 现有方法在复杂城市环境中表现不佳,存在语义冗余、相似物体区分和探索-利用困境等问题,需要新的数据集和方法来支持AVOS任务。 Method: 提出PRPSearcher方法,基于多模态大语言模型,构建三种专用地图(动态语义地图、3D认知地图和3D不确定性地图),并结合去噪机制和IPT提示机制。 Result: PRPSearcher在CityAVOS数据集上表现优于基线方法(平均提升37.69%成功率、28.96%路径效率,减少30.69%平均搜索步数和46.40%导航误差)。 Conclusion: 尽管PRPSearcher表现优异,但与人类相比仍有差距,未来需提升语义推理和空间探索能力。该工作为未来目标搜索研究奠定了基础。 Abstract: Aerial Visual Object Search (AVOS) tasks in urban environments require Unmanned Aerial Vehicles (UAVs) to autonomously search for and identify target objects using visual and textual cues without external guidance. Existing approaches struggle in complex urban environments due to redundant semantic processing, similar object distinction, and the exploration-exploitation dilemma. To bridge this gap and support the AVOS task, we introduce CityAVOS, the first benchmark dataset for autonomous search of common urban objects. This dataset comprises 2,420 tasks across six object categories with varying difficulty levels, enabling comprehensive evaluation of UAV agents' search capabilities. To solve the AVOS tasks, we also propose PRPSearcher (Perception-Reasoning-Planning Searcher), a novel agentic method powered by multi-modal large language models (MLLMs) that mimics human three-tier cognition. Specifically, PRPSearcher constructs three specialized maps: an object-centric dynamic semantic map enhancing spatial perception, a 3D cognitive map based on semantic attraction values for target reasoning, and a 3D uncertainty map for balanced exploration-exploitation search. Also, our approach incorporates a denoising mechanism to mitigate interference from similar objects and utilizes an Inspiration Promote Thought (IPT) prompting mechanism for adaptive action planning. Experimental results on CityAVOS demonstrate that PRPSearcher surpasses existing baselines in both success rate and search efficiency (on average: +37.69% SR, +28.96% SPL, -30.69% MSS, and -46.40% NE). While promising, the performance gap compared to humans highlights the need for better semantic reasoning and spatial exploration capabilities in AVOS tasks. This work establishes a foundation for future advances in embodied target search. Dataset and source code are available at https://anonymous.4open.science/r/CityAVOS-3DF8.

cs.GR [Back]

[70] PosterO: Structuring Layout Trees to Enable Language Models in Generalized Content-Aware Layout Generation

HsiaoYuan Hsu,Yuxin Peng

Main category: cs.GR

TL;DR: PosterO提出了一种基于布局中心的方法,利用大语言模型(LLMs)的布局知识生成多样化海报设计,通过SVG树结构和上下文学习实现,并在实验中表现出色。

Details Motivation: 现有方法侧重于图像增强,忽视了布局多样性和形状变化元素的处理,无法适应广义设计需求。 Method: 将布局结构化为SVG树,结合设计意图向量化和层次节点表示,利用LLMs进行上下文学习预测新布局树。 Result: PosterO能生成视觉吸引力的布局,在多个基准测试中达到最新性能。 Conclusion: PosterO在广义设置下表现优异,并构建了PStylish7数据集以推动进一步研究。 Abstract: In poster design, content-aware layout generation is crucial for automatically arranging visual-textual elements on the given image. With limited training data, existing work focused on image-centric enhancement. However, this neglects the diversity of layouts and fails to cope with shape-variant elements or diverse design intents in generalized settings. To this end, we proposed a layout-centric approach that leverages layout knowledge implicit in large language models (LLMs) to create posters for omnifarious purposes, hence the name PosterO. Specifically, it structures layouts from datasets as trees in SVG language by universal shape, design intent vectorization, and hierarchical node representation. Then, it applies LLMs during inference to predict new layout trees by in-context learning with intent-aligned example selection. After layout trees are generated, we can seamlessly realize them into poster designs by editing the chat with LLMs. Extensive experimental results have demonstrated that PosterO can generate visually appealing layouts for given images, achieving new state-of-the-art performance across various benchmarks. To further explore PosterO's abilities under the generalized settings, we built PStylish7, the first dataset with multi-purpose posters and various-shaped elements, further offering a challenging test for advanced research.

[71] Monocular Online Reconstruction with Enhanced Detail Preservation

Songyin Wu,Zhaoyang Lv,Yufeng Zhu,Duncan Frost,Zhengqin Li,Ling-Qi Yan,Carl Ren,Richard Newcombe,Zhao Dong

Main category: cs.GR

TL;DR: 提出了一种基于3D高斯分布的在线密集映射框架,用于从单目图像流中重建逼真细节,解决了深度图依赖和局部全局一致性问题。

Details Motivation: 解决单目在线重建中不依赖深度图的高斯分布问题,并确保重建地图的局部和全局一致性。 Method: 引入分层高斯管理模块和全局一致性优化模块,提出多级占用哈希体素(MOHV)结构。 Result: 相比现有RGB-only和RGB-D方法,实现了更高质量的重建和高效计算。 Conclusion: 框架具有通用性和可扩展性,能与多种跟踪系统无缝集成。 Abstract: We propose an online 3D Gaussian-based dense mapping framework for photorealistic details reconstruction from a monocular image stream. Our approach addresses two key challenges in monocular online reconstruction: distributing Gaussians without relying on depth maps and ensuring both local and global consistency in the reconstructed maps. To achieve this, we introduce two key modules: the Hierarchical Gaussian Management Module for effective Gaussian distribution and the Global Consistency Optimization Module for maintaining alignment and coherence at all scales. In addition, we present the Multi-level Occupancy Hash Voxels (MOHV), a structure that regularizes Gaussians for capturing details across multiple levels of granularity. MOHV ensures accurate reconstruction of both fine and coarse geometries and textures, preserving intricate details while maintaining overall structural integrity. Compared to state-of-the-art RGB-only and even RGB-D methods, our framework achieves superior reconstruction quality with high computational efficiency. Moreover, it integrates seamlessly with various tracking systems, ensuring generality and scalability.

[72] ACT-R: Adaptive Camera Trajectories for 3D Reconstruction from Single Image

Yizhi Wang,Mingrui Zhao,Ali Mahdavi-Amiri,Hao Zhang

Main category: cs.GR

TL;DR: 提出自适应视角规划方法,通过动态相机轨迹优化多视角合成,提升单视角3D重建的遮挡揭示和3D一致性。

Details Motivation: 解决传统多视角合成中无序视角生成导致的遮挡问题和3D不一致性。 Method: 计算自适应相机轨迹(ACT),最大化遮挡区域可见性,并结合视频扩散模型生成新视角,用于多视角3D重建。 Result: 在GSO数据集上显著提升3D重建效果,定量和定性均优于现有方法。 Conclusion: 自适应视角规划有效提升遮挡揭示和3D一致性,无需运行时训练,高效且性能优越。 Abstract: We introduce adaptive view planning to multi-view synthesis, aiming to improve both occlusion revelation and 3D consistency for single-view 3D reconstruction. Instead of generating an unordered set of views independently or simultaneously, we generate a sequence of views, leveraging temporal consistency to enhance 3D coherence. Most importantly, our view sequence is not determined by a pre-determined camera setup. Instead, we compute an adaptive camera trajectory (ACT), specifically, an orbit of camera views, which maximizes the visibility of occluded regions of the 3D object to be reconstructed. Once the best orbit is found, we feed it to a video diffusion model to generate novel views around the orbit, which in turn, are passed to a multi-view 3D reconstruction model to obtain the final reconstruction. Our multi-view synthesis pipeline is quite efficient since it involves no run-time training/optimization, only forward inferences by applying the pre-trained models for occlusion analysis and multi-view synthesis. Our method predicts camera trajectories that reveal occlusions effectively and produce consistent novel views, significantly improving 3D reconstruction over SOTA on the unseen GSO dataset, both quantitatively and qualitatively.

[73] M3G: Multi-Granular Gesture Generator for Audio-Driven Full-Body Human Motion Synthesis

Zhizhuo Yin,Yuk Hang Tsui,Pan Hui

Main category: cs.GR

TL;DR: 提出了一种名为M3G的新框架,用于从音频生成全身手势,解决了现有方法因固定粒度而无法建模不同手势模式的问题。

Details Motivation: 现有系统因手势令牌的固定粒度而无法建模不同手势模式,限制了虚拟角色创建中自然和表达性手势的生成。 Method: 提出M3G框架,包括多粒度VQ-VAE(MGVQ-VAE)和多粒度令牌预测器,从音频提取信息并预测不同时间粒度的运动令牌。 Result: M3G在生成自然和表达性全身手势方面优于现有方法。 Conclusion: M3G通过多粒度建模显著提升了音频驱动手势生成的质量。 Abstract: Generating full-body human gestures encompassing face, body, hands, and global movements from audio is a valuable yet challenging task in virtual avatar creation. Previous systems focused on tokenizing the human gestures framewisely and predicting the tokens of each frame from the input audio. However, one observation is that the number of frames required for a complete expressive human gesture, defined as granularity, varies among different human gesture patterns. Existing systems fail to model these gesture patterns due to the fixed granularity of their gesture tokens. To solve this problem, we propose a novel framework named Multi-Granular Gesture Generator (M3G) for audio-driven holistic gesture generation. In M3G, we propose a novel Multi-Granular VQ-VAE (MGVQ-VAE) to tokenize motion patterns and reconstruct motion sequences from different temporal granularities. Subsequently, we proposed a multi-granular token predictor that extracts multi-granular information from audio and predicts the corresponding motion tokens. Then M3G reconstructs the human gestures from the predicted tokens using the MGVQ-VAE. Both objective and subjective experiments demonstrate that our proposed M3G framework outperforms the state-of-the-art methods in terms of generating natural and expressive full-body human gestures.

[74] Claycode: Stylable and Deformable 2D Scannable Codes

Marco Maida,Alberto Crescini,Marco Perronet,Elena Camuffo

Main category: cs.GR

TL;DR: Claycode是一种新型2D可扫描码,支持高度定制化和变形,基于树结构编码,优于传统二维码。

Details Motivation: 传统矩阵式二维码(如QR码)在高度定制化和变形场景下表现不佳,Claycode旨在解决这一问题。 Method: 通过树结构编码信息,将比特映射到拓扑树中,并以目标多边形边界内的嵌套彩色区域呈现。解码时从摄像头流中实时提取和解释。 Result: Claycode在高度定制化和变形场景下表现优异,功能不受影响,且对变形容忍度高。 Conclusion: Claycode为2D可扫描码提供了更灵活的设计空间和更强的适应性。 Abstract: This paper introduces Claycode, a novel 2D scannable code designed for extensive stylization and deformation. Unlike traditional matrix-based codes (e.g., QR codes), Claycodes encode their message in a tree structure. During the encoding process, bits are mapped into a topology tree, which is then depicted as a nesting of color regions drawn within the boundaries of a target polygon shape. When decoding, Claycodes are extracted and interpreted in real-time from a camera stream. We detail the end-to-end pipeline and show that Claycodes allow for extensive stylization without compromising their functionality. We then empirically demonstrate Claycode's high tolerance to heavy deformations, outperforming traditional 2D scannable codes in scenarios where they typically fail.

[75] CAD-Coder:Text-Guided CAD Files Code Generation

Changqi He,Shuhan Zhang,Liguo Zhang,Jiajun Miao

Main category: cs.GR

TL;DR: CAD-Coder是一个将自然语言指令转换为可编辑CAD脚本代码的框架,解决了现有生成方法缺乏交互性和几何标注的问题。

Details Motivation: 传统CAD依赖专家手工绘制或修改现有库文件,无法快速个性化;现有生成方法缺乏交互性和几何标注,限制了实际应用。 Method: 提出CAD-Coder框架,通过自然语言指令生成可执行的CAD脚本代码,构建包含29,130个Dxf文件及其脚本代码的数据集。 Result: 在多种2D/3D CAD生成任务中表现优于现有方法,提供可编辑的草图及几何标注。 Conclusion: CAD-Coder实现了交互式生成可编辑CAD文件,为个性化设计提供了高效解决方案。 Abstract: Computer-aided design (CAD) is a way to digitally create 2D drawings and 3D models of real-world products. Traditional CAD typically relies on hand-drawing by experts or modifications of existing library files, which doesn't allow for rapid personalization. With the emergence of generative artificial intelligence, convenient and efficient personalized CAD generation has become possible. However, existing generative methods typically produce outputs that lack interactive editability and geometric annotations, limiting their practical applications in manufacturing. To enable interactive generative CAD, we propose CAD-Coder, a framework that transforms natural language instructions into CAD script codes, which can be executed in Python environments to generate human-editable CAD files (.Dxf). To facilitate the generation of editable CAD sketches with annotation information, we construct a comprehensive dataset comprising 29,130 Dxf files with their corresponding script codes, where each sketch preserves both editability and geometric annotations. We evaluate CAD-Coder on various 2D/3D CAD generation tasks against existing methods, demonstrating superior interactive capabilities while uniquely providing editable sketches with geometric annotations.

cs.CL [Back]

[76] Polysemy of Synthetic Neurons Towards a New Type of Explanatory Categorical Vector Spaces

Michael Pichat,William Pogrund,Paloma Pichat,Judicael Poumay,Armanouche Gasparian,Samuel Demarchi,Martin Corbet,Alois Georgeon,Michael Veillet-Guillem

Main category: cs.CL

TL;DR: 论文提出了一种几何方法,将神经元定义为具有非正交基的分类向量空间,通过神经元内注意力过程识别关键分类区域,提升语言模型效率。

Details Motivation: 探索人工神经网络中多义神经元的本质,提出更高效的几何定义方法。 Method: 将神经元定义为分类向量空间,利用非正交基和神经元内注意力识别关键分类区域。 Result: 发现更均匀且位于分类子维度交叉点的关键区域,提升模型效率。 Conclusion: 几何定义方法为理解多义神经元提供了新视角,并优化了语言模型性能。 Abstract: The polysemantic nature of synthetic neurons in artificial intelligence language models is currently understood as the result of a necessary superposition of distributed features within the latent space. We propose an alternative approach, geometrically defining a neuron in layer n as a categorical vector space with a non-orthogonal basis, composed of categorical sub-dimensions extracted from preceding neurons in layer n-1. This categorical vector space is structured by the activation space of each neuron and enables, via an intra-neuronal attention process, the identification and utilization of a critical categorical zone for the efficiency of the language model - more homogeneous and located at the intersection of these different categorical sub-dimensions.

[77] A Tale of Two Identities: An Ethical Audit of Human and AI-Crafted Personas

Pranav Narayanan Venkit,Jiayi Li,Yingfan Zhou,Sarah Rajtmajer,Shomir Wilson

Main category: cs.CL

TL;DR: 论文研究了LLM生成的合成人物在种族身份表征上的问题,揭示了其过度强调种族标记和文化编码语言,导致刻板印象等社会技术危害。

Details Motivation: 随着LLM在数据有限领域(如健康、隐私和HCI)中生成合成人物的应用增多,需了解这些叙述如何表征少数群体身份。 Method: 采用混合方法(细读、词汇分析和参数化创造力框架)比较1512个LLM生成人物与人类撰写回答。 Result: LLM生成的人物过度突出种族标记,文化编码语言过多,导致刻板印象、异化等危害。 Conclusion: 提出算法他者化概念,建议设计叙事感知评估指标和社区中心验证协议。 Abstract: As LLMs (large language models) are increasingly used to generate synthetic personas particularly in data-limited domains such as health, privacy, and HCI, it becomes necessary to understand how these narratives represent identity, especially that of minority communities. In this paper, we audit synthetic personas generated by 3 LLMs (GPT4o, Gemini 1.5 Pro, Deepseek 2.5) through the lens of representational harm, focusing specifically on racial identity. Using a mixed methods approach combining close reading, lexical analysis, and a parameterized creativity framework, we compare 1512 LLM generated personas to human-authored responses. Our findings reveal that LLMs disproportionately foreground racial markers, overproduce culturally coded language, and construct personas that are syntactically elaborate yet narratively reductive. These patterns result in a range of sociotechnical harms, including stereotyping, exoticism, erasure, and benevolent bias, that are often obfuscated by superficially positive narrations. We formalize this phenomenon as algorithmic othering, where minoritized identities are rendered hypervisible but less authentic. Based on these findings, we offer design recommendations for narrative-aware evaluation metrics and community-centered validation protocols for synthetic identity generation.

[78] Joint Detection of Fraud and Concept Drift inOnline Conversations with LLM-Assisted Judgment

Ali Senol,Garima Agrawal,Huan Liu

Main category: cs.CL

TL;DR: 提出了一种两阶段检测框架,结合集成分类模型和概念漂移分析,以提高数字通信平台中虚假交互检测的准确性和可解释性。

Details Motivation: 数字通信平台中的虚假交互问题尚未得到充分解决,传统静态异常检测方法难以适应动态对话变化,易产生误判。 Method: 采用两阶段框架:1) 使用集成分类模型识别可疑对话;2) 通过概念漂移分析(OCDD)和大型语言模型(LLM)判断是否为欺诈行为。 Result: 在社交工程聊天场景数据集上验证,框架显著提高了检测准确性和实时性。 Conclusion: 该方法优于传统的双LLM基线,在欺诈检测中更具实用性和解释性。 Abstract: Detecting fake interactions in digital communication platforms remains a challenging and insufficiently addressed problem. These interactions may appear as harmless spam or escalate into sophisticated scam attempts, making it difficult to flag malicious intent early. Traditional detection methods often rely on static anomaly detection techniques that fail to adapt to dynamic conversational shifts. One key limitation is the misinterpretation of benign topic transitions referred to as concept drift as fraudulent behavior, leading to either false alarms or missed threats. We propose a two stage detection framework that first identifies suspicious conversations using a tailored ensemble classification model. To improve the reliability of detection, we incorporate a concept drift analysis step using a One Class Drift Detector (OCDD) to isolate conversational shifts within flagged dialogues. When drift is detected, a large language model (LLM) assesses whether the shift indicates fraudulent manipulation or a legitimate topic change. In cases where no drift is found, the behavior is inferred to be spam like. We validate our framework using a dataset of social engineering chat scenarios and demonstrate its practical advantages in improving both accuracy and interpretability for real time fraud detection. To contextualize the trade offs, we compare our modular approach against a Dual LLM baseline that performs detection and judgment using different language models.

[79] CrashSage: A Large Language Model-Centered Framework for Contextual and Interpretable Traffic Crash Analysis

Hao Zhen,Jidong J. Yang

Main category: cs.CL

TL;DR: CrashSage是一个基于大型语言模型(LLM)的框架,通过表格转文本、数据增强、模型微调和可解释性技术,提升交通事故分析的准确性和透明度。

Details Motivation: 全球每年因交通事故造成巨大生命和经济损失,现有方法难以捕捉复杂关系和上下文信息,亟需更有效的分析工具。 Method: 1. 表格转文本策略;2. 上下文感知数据增强;3. 微调LLaMA3-8B模型;4. 梯度可解释性技术。 Result: CrashSage在事故严重性推断上优于基线方法,并提供更透明的决策解释。 Conclusion: CrashSage为交通安全研究提供了新思路,通过LLM技术实现更精准和可解释的分析。 Abstract: Road crashes claim over 1.3 million lives annually worldwide and incur global economic losses exceeding \$1.8 trillion. Such profound societal and financial impacts underscore the urgent need for road safety research that uncovers crash mechanisms and delivers actionable insights. Conventional statistical models and tree ensemble approaches typically rely on structured crash data, overlooking contextual nuances and struggling to capture complex relationships and underlying semantics. Moreover, these approaches tend to incur significant information loss, particularly in narrative elements related to multi-vehicle interactions, crash progression, and rare event characteristics. This study presents CrashSage, a novel Large Language Model (LLM)-centered framework designed to advance crash analysis and modeling through four key innovations. First, we introduce a tabular-to-text transformation strategy paired with relational data integration schema, enabling the conversion of raw, heterogeneous crash data into enriched, structured textual narratives that retain essential structural and relational context. Second, we apply context-aware data augmentation using a base LLM model to improve narrative coherence while preserving factual integrity. Third, we fine-tune the LLaMA3-8B model for crash severity inference, demonstrating superior performance over baseline approaches, including zero-shot, zero-shot with chain-of-thought prompting, and few-shot learning, with multiple models (GPT-4o, GPT-4o-mini, LLaMA3-70B). Finally, we employ a gradient-based explainability technique to elucidate model decisions at both the individual crash level and across broader risk factor dimensions. This interpretability mechanism enhances transparency and enables targeted road safety interventions by providing deeper insights into the most influential factors.

[80] Unpacking Robustness in Inflectional Languages: Adversarial Evaluation and Mechanistic Insights

Paweł Walkowiak,Marek Klonowski,Marcin Oleksy,Arkadiusz Janz

Main category: cs.CL

TL;DR: 本文研究了对抗性攻击在屈折语言中的表现,提出了一种基于Edge Attribution Patching(EAP)的新评估协议,并通过波兰语和英语的平行语料库分析了屈折变化对模型鲁棒性的影响。

Details Motivation: 大多数对抗性攻击方法主要针对非屈折语言(如英语),而屈折语言(如波兰语)中的表现尚未充分研究。本文旨在填补这一空白。 Method: 设计了基于EAP的新评估协议,利用波兰语和英语的平行语料库,创建了基于MultiEmo数据集的基准,分析模型在攻击下的行为。 Result: 揭示了屈折变化对模型鲁棒性的影响,并识别了模型中与屈折相关的机制性元素。 Conclusion: 屈折语言中的对抗性攻击表现与非屈折语言不同,新评估协议为理解模型行为提供了新视角。 Abstract: Various techniques are used in the generation of adversarial examples, including methods such as TextBugger which introduce minor, hardly visible perturbations to words leading to changes in model behaviour. Another class of techniques involves substituting words with their synonyms in a way that preserves the text's meaning but alters its predicted class, with TextFooler being a prominent example of such attacks. Most adversarial example generation methods are developed and evaluated primarily on non-inflectional languages, typically English. In this work, we evaluate and explain how adversarial attacks perform in inflectional languages. To explain the impact of inflection on model behaviour and its robustness under attack, we designed a novel protocol inspired by mechanistic interpretability, based on Edge Attribution Patching (EAP) method. The proposed evaluation protocol relies on parallel task-specific corpora that include both inflected and syncretic variants of texts in two languages -- Polish and English. To analyse the models and explain the relationship between inflection and adversarial robustness, we create a new benchmark based on task-oriented dataset MultiEmo, enabling the identification of mechanistic inflection-related elements of circuits within the model and analyse their behaviour under attack.

[81] Enhanced Urdu Intent Detection with Large Language Models and Prototype-Informed Predictive Pipelines

Faiza Hassan,Summra Saleem,Kashif Javed,Muhammad Nabeel Asim,Abdur Rehman,Andreas Dengel

Main category: cs.CL

TL;DR: 本文提出了一种针对乌尔都语的意图检测方法LLMPIA,结合对比学习和原型注意力机制,显著提升了在少样本和相同类测试中的性能。

Details Motivation: 乌尔都语作为第十大语言,缺乏基于少样本学习的意图检测方法,传统方法局限于训练集中的类别。 Method: 采用对比学习利用未标记数据重新训练预训练语言模型,结合原型注意力机制构建LLMPIA管道。 Result: 在ATIS和Web Queries数据集上,LLMPIA在少样本和相同类测试中均取得显著提升的F1分数。 Conclusion: LLMPIA为乌尔都语意图检测提供了高效解决方案,显著优于现有方法。 Abstract: Multifarious intent detection predictors are developed for different languages, including English, Chinese and French, however, the field remains underdeveloped for Urdu, the 10th most spoken language. In the realm of well-known languages, intent detection predictors utilize the strategy of few-shot learning and prediction of unseen classes based on the model training on seen classes. However, Urdu language lacks few-shot strategy based intent detection predictors and traditional predictors are focused on prediction of the same classes which models have seen in the train set. To empower Urdu language specific intent detection, this introduces a unique contrastive learning approach that leverages unlabeled Urdu data to re-train pre-trained language models. This re-training empowers LLMs representation learning for the downstream intent detection task. Finally, it reaps the combined potential of pre-trained LLMs and the prototype-informed attention mechanism to create a comprehensive end-to-end LLMPIA intent detection pipeline. Under the paradigm of proposed predictive pipeline, it explores the potential of 6 distinct language models and 13 distinct similarity computation methods. The proposed framework is evaluated on 2 public benchmark datasets, namely ATIS encompassing 5836 samples and Web Queries having 8519 samples. Across ATIS dataset under 4-way 1 shot and 4-way 5 shot experimental settings LLMPIA achieved 83.28% and 98.25% F1-Score and on Web Queries dataset produced 76.23% and 84.42% F1-Score, respectively. In an additional case study on the Web Queries dataset under same classes train and test set settings, LLMPIA outperformed state-of-the-art predictor by 53.55% F1-Score.

[82] Scaling Laws for Speculative Decoding

Siyuan Yan,Mo Zhu,Guo-qing Jiang,Jianfei Wang,Jiaxing Chen,Wentai Zhang,Xiang Liao,Xiao Cui,Chen Zhang,Zhuoran Song,Ran Zhu

Main category: cs.CL

TL;DR: 本文研究了通过密集LLM架构的推测解码技术,发现了控制解码效率的对数线性缩放定律,并提出了Scylla方法,显著提升了推理任务的解码速度和接受率。

Details Motivation: 大型语言模型(LLM)在推理密集型架构中需要高效解码,但现有方法在解码效率的缩放规律上研究不足。 Method: 通过密集LLM架构研究推测解码技术,发现对数线性缩放定律,并开发了Scylla方法。 Result: Scylla在解码接受率和速度上优于EAGLE2和EAGLE3,工业部署中解码吞吐量提升2倍。 Conclusion: 系统性缩放对高效LLM推理具有变革潜力,Scylla为推理任务提供了显著性能提升。 Abstract: The escalating demand for efficient decoding in large language models (LLMs) is particularly critical for reasoning-intensive architectures like OpenAI-o3 and DeepSeek-R1, which depend on extended chain-of-thought reasoning. This study investigates speculative decoding techniques through dense LLM architectures to establish foundational insights for accelerating reasoning tasks. While speculative decoding methods leveraging parallel draft-verification cycles have emerged as promising acceleration techniques, the scaling laws governing decoding efficiency remain under-explored compared to conventional backbone LLMs developed through Pretraining->SFT->RLHF training paradigms. In this work, we discover Log-linear Scaling Laws (Theorem 1.1, 1.2 and 1.3) governing draft model acceptance rate (or decoding speed) across three dimensions: pretraining token volume, draft model capacity, and decoding batch size. Building on these laws, we achieve Scylla, which coordinates multi-dimensional scaling for popular LLMs (Llama2/3, Qwen2.5). Empirical validation shows Scylla achieves 1.5-2.2 higher acceptance rate than EAGLE2 and 0.3 higher than EAGLE3 at temperature T = 0, with peak performance gains on summarization and QA tasks (Figure 2). Industrial inference engine deployments demonstrate 2X decoding throughput improvements over EAGLE2 (Table 5), validating the transformative potential of systematic scaling for efficient LLM inference. Code will be released later.

[83] Boosting Performance on ARC is a Matter of Perspective

Daniel Franzen,Jan Disselhoff,David Hartmann

Main category: cs.CL

TL;DR: 通过任务特定数据增强和深度优先搜索算法,结合LLM生成和评分功能,在ARC-AGI评测中取得71.6%的高分,具有透明、可复现和低成本优势。

Details Motivation: 解决大型语言模型在抽象推理任务(如ARC-AGI)中的局限性。 Method: 在训练、生成和评分阶段使用任务特定数据增强,结合深度优先搜索算法生成多样化候选解,并利用LLM作为生成器和评分器。 Result: 在ARC-AGI评测中得分71.6%(286.5/400任务),性能领先于公开方法。 Conclusion: 该方法在透明性、可复现性和低成本方面表现突出,尽管闭源方法得分更高。 Abstract: The Abstraction and Reasoning Corpus (ARC-AGI) poses a significant challenge for large language models (LLMs), exposing limitations in their abstract reasoning abilities. In this work, we leverage task-specific data augmentations throughout the training, generation, and scoring phases, and employ a depth-first search algorithm to generate diverse, high-probability candidate solutions. Furthermore, we utilize the LLM not only as a generator but also as a scorer, using its output probabilities to select the most promising solutions. Our method achieves a score of 71.6% (286.5/400 solved tasks) on the public ARC-AGI evaluation set, demonstrating state-of-the-art performance among publicly available approaches. While concurrent closed-source work has reported higher scores, our method distinguishes itself through its transparency, reproducibility, and remarkably low inference cost, averaging only around 2ct per task on readily available hardware (we assume a price of 36ct/hour for a Nvidia 4090 GPU).

[84] Scalable LLM Math Reasoning Acceleration with Low-rank Distillation

Harry Dong,Bilge Acun,Beidi Chen,Yuejie Chi

Main category: cs.CL

TL;DR: Caprese是一种低成本蒸馏方法,用于恢复因高效推理方法部署而丢失的数学能力,同时保持语言任务性能。

Details Motivation: 大型语言模型(LLM)在数学推理上需要大量计算资源,现有高效推理方法虽在语言任务上表现良好,但会严重降低数学性能。 Method: 提出Caprese方法,通过少量额外参数(约1%)和20K合成训练样本,在不扰动原始权重的情况下恢复数学能力。 Result: Caprese显著减少了活跃参数数量(如Gemma 2 9B和Llama 3.1 8B减少约2B),并降低了延迟(Qwen 2.5 14B生成2048个token延迟减少11%以上)。 Conclusion: Caprese能够高效恢复数学能力,同时不影响语言任务性能,并显著提升推理效率。 Abstract: Due to long generations, large language model (LLM) math reasoning demands significant computational resources and time. While many existing efficient inference methods have been developed with excellent performance preservation on language tasks, they often severely degrade math performance. In this paper, we propose Caprese, a low-cost distillation method to recover lost capabilities from deploying efficient inference methods, focused primarily in feedforward blocks. With original weights unperturbed, roughly 1% of additional parameters, and only 20K synthetic training samples, we are able to recover much if not all of the math capabilities lost from efficient inference for thinking LLMs and without harm to language tasks for instruct LLMs. Moreover, Caprese slashes the number of active parameters (~2B cut for Gemma 2 9B and Llama 3.1 8B) and integrates cleanly into existing model layers to reduce latency (>11% reduction to generate 2048 tokens with Qwen 2.5 14B) while encouraging response brevity.

[85] Graph Laplacian Wavelet Transformer via Learnable Spectral Decomposition

Andrew Kiruluta,Eric Lundy,Priscilla Burity

Main category: cs.CL

TL;DR: 论文提出了一种名为Graph Wavelet Transformer (GWT)的新架构,用于替代传统的点积自注意力机制,以解决其在计算和内存上的二次复杂度问题。

Details Motivation: 现有的序列到序列模型在处理结构化语言任务时,依赖点积自注意力机制,导致计算和内存的二次复杂度。 Method: GWT通过引入可学习的多尺度小波变换,基于显式图拉普拉斯矩阵(源自语法或语义解析)来替代点积自注意力。 Result: 分析表明,多尺度谱分解为图结构序列建模提供了一种可解释、高效且表达能力强的替代方案。 Conclusion: GWT是一种高效且表达能力强的替代方案,适用于图结构序列建模。 Abstract: Existing sequence to sequence models for structured language tasks rely heavily on the dot product self attention mechanism, which incurs quadratic complexity in both computation and memory for input length N. We introduce the Graph Wavelet Transformer (GWT), a novel architecture that replaces this bottleneck with a learnable, multi scale wavelet transform defined over an explicit graph Laplacian derived from syntactic or semantic parses. Our analysis shows that multi scale spectral decomposition offers an interpretable, efficient, and expressive alternative to quadratic self attention for graph structured sequence modeling.

[86] QoSBERT: An Uncertainty-Aware Approach based on Pre-trained Language Models for Service Quality Prediction

Ziliang Wang,Xiaohong Zhang,Ze Shi Li,Meng Yan

Main category: cs.CL

TL;DR: QoSBERT是一个基于预训练语言模型的QoS预测框架,通过语义回归和不确定性估计提升预测准确性和可信度。

Details Motivation: 传统QoS模型依赖手动特征工程且缺乏预测置信度信息,限制了其在实际应用中的可靠性。 Method: QoSBERT将用户服务元数据编码为自然语言描述,结合蒙特卡洛Dropout进行不确定性估计,并使用注意力池化和多层感知机回归器优化预测。 Result: 在标准数据集上,QoSBERT在响应时间和吞吐量预测上分别降低了11.7%和6.9%的MAE,并提供校准的置信区间。 Conclusion: QoSBERT不仅提高了预测准确性,还提供了可靠的不确定性量化,为服务选择和优化提供了更可信的数据支持。 Abstract: Accurate prediction of Quality of Service (QoS) metrics is fundamental for selecting and managing cloud based services. Traditional QoS models rely on manual feature engineering and yield only point estimates, offering no insight into the confidence of their predictions. In this paper, we propose QoSBERT, the first framework that reformulates QoS prediction as a semantic regression task based on pre trained language models. Unlike previous approaches relying on sparse numerical features, QoSBERT automatically encodes user service metadata into natural language descriptions, enabling deep semantic understanding. Furthermore, we integrate a Monte Carlo Dropout based uncertainty estimation module, allowing for trustworthy and risk-aware service quality prediction, which is crucial yet underexplored in existing QoS models. QoSBERT applies attentive pooling over contextualized embeddings and a lightweight multilayer perceptron regressor, fine tuned jointly to minimize absolute error. We further exploit the resulting uncertainty estimates to select high quality training samples, improving robustness in low resource settings. On standard QoS benchmark datasets, QoSBERT achieves an average reduction of 11.7% in MAE and 6.7% in RMSE for response time prediction, and 6.9% in MAE for throughput prediction compared to the strongest baselines, while providing well calibrated confidence intervals for robust and trustworthy service quality estimation. Our approach not only advances the accuracy of service quality prediction but also delivers reliable uncertainty quantification, paving the way for more trustworthy, data driven service selection and optimization.

[87] Efficient Fairness Testing in Large Language Models: Prioritizing Metamorphic Relations for Bias Detection

Suavis Giramata,Madhusudan Srinivasan,Venkat Naidu Gudivada,Upulee Kanewala

Main category: cs.CL

TL;DR: 本文提出了一种基于句子多样性的蜕变关系(MRs)优先级排序方法,用于高效检测大语言模型(LLMs)中的公平性问题。实验表明,该方法在故障检测率和时间效率上优于随机和基于距离的优先级排序。

Details Motivation: 随着大语言模型(LLMs)的广泛应用,其输出中的公平性和潜在偏见问题日益突出。由于测试用例数量庞大,如何高效检测公平性问题成为关键挑战。 Method: 采用句子多样性为基础的方法,计算并排序蜕变关系(MRs),以优化故障检测。 Result: 实验结果显示,该方法在故障检测率上比随机优先级排序提高了22%,比基于距离的优先级排序提高了12%,同时在首次故障发现时间上分别减少了15%和8%。 Conclusion: 多样性基础的MR优先级排序方法在提升LLMs公平性测试效果的同时,显著降低了计算成本,验证了其有效性。 Abstract: Large Language Models (LLMs) are increasingly deployed in various applications, raising critical concerns about fairness and potential biases in their outputs. This paper explores the prioritization of metamorphic relations (MRs) in metamorphic testing as a strategy to efficiently detect fairness issues within LLMs. Given the exponential growth of possible test cases, exhaustive testing is impractical; therefore, prioritizing MRs based on their effectiveness in detecting fairness violations is crucial. We apply a sentence diversity-based approach to compute and rank MRs to optimize fault detection. Experimental results demonstrate that our proposed prioritization approach improves fault detection rates by 22% compared to random prioritization and 12% compared to distance-based prioritization, while reducing the time to the first failure by 15% and 8%, respectively. Furthermore, our approach performs within 5% of fault-based prioritization in effectiveness, while significantly reducing the computational cost associated with fault labeling. These results validate the effectiveness of diversity-based MR prioritization in enhancing fairness testing for LLMs.

[88] Evaluating Financial Sentiment Analysis with Annotators Instruction Assisted Prompting: Enhancing Contextual Interpretation and Stock Prediction Accuracy

A M Muntasir Rahman,Ajim Uddin,Guiling "Grace" Wang

Main category: cs.CL

TL;DR: 论文提出了一种名为AIAP的新型评估提示方法,通过整合人类标注者的任务指令,改善金融情感分析(FSA)中LLMs的性能,并在新数据集WSBS上验证了其有效性。

Details Motivation: 金融情感分析(FSA)中LLMs的表现受限于现有基准数据集(如Financial Phrasebank)中未定义的情感类别和标注者的主观性,导致评估不公平。 Method: 引入Annotators' Instruction Assisted Prompt(AIAP),将人类标注者的详细任务指令整合到LLMs的提示框架中,以标准化情感理解。 Result: 实验显示,AIAP显著提升了LLMs的性能(最高提升9.08),并提出了一种基于模型置信度的情感索引方法,增强了股票价格预测。 Conclusion: AIAP通过改进任务定义和评估方法,提升了FSA的性能,并凸显了WSB作为金融文本来源的重要性。 Abstract: Financial sentiment analysis (FSA) presents unique challenges to LLMs that surpass those in typical sentiment analysis due to the nuanced language used in financial contexts. The prowess of these models is often undermined by the inherent subjectivity of sentiment classifications in existing benchmark datasets like Financial Phrasebank. These datasets typically feature undefined sentiment classes that reflect the highly individualized perspectives of annotators, leading to significant variability in annotations. This variability results in an unfair expectation for LLMs during benchmarking, where they are tasked to conjecture the subjective viewpoints of human annotators without sufficient context. In this paper, we introduce the Annotators' Instruction Assisted Prompt, a novel evaluation prompt designed to redefine the task definition of FSA for LLMs. By integrating detailed task instructions originally intended for human annotators into the LLMs' prompt framework, AIAP aims to standardize the understanding of sentiment across both human and machine interpretations, providing a fair and context-rich foundation for sentiment analysis. We utilize a new dataset, WSBS, derived from the WallStreetBets subreddit to demonstrate how AIAP significantly enhances LLM performance by aligning machine operations with the refined task definitions. Experimental results demonstrate that AIAP enhances LLM performance significantly, with improvements up to 9.08. This context-aware approach not only yields incremental gains in performance but also introduces an innovative sentiment-indexing method utilizing model confidence scores. This method enhances stock price prediction models and extracts more value from the financial sentiment analysis, underscoring the significance of WSB as a critical source of financial text. Our research offers insights into both improving FSA through better evaluation methods.

[89] The Sound of Populism: Distinct Linguistic Features Across Populist Variants

Yu Wang,Runxi Yu,Zhongyuan Wang,Jing He

Main category: cs.CL

TL;DR: 研究通过结合LIWC特征和RoBERTa模型,分析美国政治演讲中的民粹主义语言风格,发现民粹主义修辞具有直接、自信的特点,且右翼民粹主义情感更强烈。

Details Motivation: 探索民粹主义在政治演讲中的语言表现,揭示其情感和风格特征。 Method: 结合LIWC情感分析工具和RoBERTa模型,分析美国总统就职和国情咨文演讲。 Result: 民粹主义修辞具有直接、自信的特点,右翼民粹主义情感更强烈,左翼和反精英主义则相对克制。 Conclusion: 民粹主义语言风格具有战略性和多样性,右翼民粹主义更倾向于情感化表达。 Abstract: This study explores the sound of populism by integrating the classic Linguistic Inquiry and Word Count (LIWC) features, which capture the emotional and stylistic tones of language, with a fine-tuned RoBERTa model, a state-of-the-art context-aware language model trained to detect nuanced expressions of populism. This approach allows us to uncover the auditory dimensions of political rhetoric in U.S. presidential inaugural and State of the Union addresses. We examine how four key populist dimensions (i.e., left-wing, right-wing, anti-elitism, and people-centrism) manifest in the linguistic markers of speech, drawing attention to both commonalities and distinct tonal shifts across these variants. Our findings reveal that populist rhetoric consistently features a direct, assertive ``sound" that forges a connection with ``the people'' and constructs a charismatic leadership persona. However, this sound is not simply informal but strategically calibrated. Notably, right-wing populism and people-centrism exhibit a more emotionally charged discourse, resonating with themes of identity, grievance, and crisis, in contrast to the relatively restrained emotional tones of left-wing and anti-elitist expressions.

[90] Recovering Event Probabilities from Large Language Model Embeddings via Axiomatic Constraints

Jian-Qiao Zhu,Haijiang Yan,Thomas L. Griffiths

Main category: cs.CL

TL;DR: 论文探讨了如何从LLM嵌入中恢复符合概率论公理的相干事件概率,提出了一种基于变分自编码器的方法,实验表明其效果优于模型直接生成的概率。

Details Motivation: LLM生成的事件概率存在不连贯性,违反概率论公理,因此需要探索如何从嵌入中恢复相干概率。 Method: 通过扩展的变分自编码器(VAE)在潜在空间中强制概率论公理约束(如加法规则),使事件概率自然生成。 Result: 实验表明,从嵌入中恢复的概率比模型直接生成的更连贯,且更接近真实概率。 Conclusion: 该方法能有效恢复相干事件概率,为不确定性事件提供更准确的估计。 Abstract: Rational decision-making under uncertainty requires coherent degrees of belief in events. However, event probabilities generated by Large Language Models (LLMs) have been shown to exhibit incoherence, violating the axioms of probability theory. This raises the question of whether coherent event probabilities can be recovered from the embeddings used by the models. If so, those derived probabilities could be used as more accurate estimates in events involving uncertainty. To explore this question, we propose enforcing axiomatic constraints, such as the additive rule of probability theory, in the latent space learned by an extended variational autoencoder (VAE) applied to LLM embeddings. This approach enables event probabilities to naturally emerge in the latent space as the VAE learns to both reconstruct the original embeddings and predict the embeddings of semantically related events. We evaluate our method on complementary events (i.e., event A and its complement, event not-A), where the true probabilities of the two events must sum to 1. Experiment results on open-weight language models demonstrate that probabilities recovered from embeddings exhibit greater coherence than those directly reported by the corresponding models and align closely with the true probabilities.

[91] Development of a WAZOBIA-Named Entity Recognition System

S. E Emedem,I. E Onyenwe,E. G Onyedinma

Main category: cs.CL

TL;DR: 该研究开发了一个名为WAZOBIA-NER的系统,专门针对尼日利亚的三种主要语言(豪萨语、约鲁巴语和伊博语)进行命名实体识别(NER),填补了资源匮乏语言的空白。

Details Motivation: 尽管非洲语言在计算语言学中受到越来越多的关注,但现有的NER系统主要集中在英语、欧洲语言和其他少数全球语言上,导致资源匮乏的语言存在显著差距。 Method: 研究通过编译标注数据集,并探索了条件随机场(CRF)、双向长短期记忆网络(BiLSTM)、双向编码器表示(BERT)和循环神经网络(RNN)等机器学习与深度学习技术,评估其在识别人名、组织和地点三类实体中的有效性。系统还利用OCR技术处理图像文本。 Result: 系统在精确率、召回率、F1分数和准确率上分别达到0.9511、0.9400、0.9564和0.9301,表现出色。 Conclusion: 研究表明,利用现有NLP框架和迁移学习,可以为资源匮乏的非洲语言构建强大的NER工具。 Abstract: Named Entity Recognition NER is very crucial for various natural language processing applications, including information extraction, machine translation, and sentiment analysis. Despite the ever-increasing interest in African languages within computational linguistics, existing NER systems focus mainly on English, European, and a few other global languages, leaving a significant gap for under-resourced languages. This research presents the development of a WAZOBIA-NER system tailored for the three most prominent Nigerian languages: Hausa, Yoruba, and Igbo. This research begins with a comprehensive compilation of annotated datasets for each language, addressing data scarcity and linguistic diversity challenges. Exploring the state-of-the-art machine learning technique, Conditional Random Fields (CRF) and deep learning models such as Bidirectional Long Short-Term Memory (BiLSTM), Bidirectional Encoder Representation from Transformers (Bert) and fine-tune with a Recurrent Neural Network (RNN), the study evaluates the effectiveness of these approaches in recognizing three entities: persons, organizations, and locations. The system utilizes optical character recognition (OCR) technology to convert textual images into machine-readable text, thereby enabling the Wazobia system to accept both input text and textual images for extraction purposes. The system achieved a performance of 0.9511 in precision, 0.9400 in recall, 0.9564 in F1-score, and 0.9301 in accuracy. The model's evaluation was conducted across three languages, with precision, recall, F1-score, and accuracy as key assessment metrics. The Wazobia-NER system demonstrates that it is feasible to build robust NER tools for under-resourced African languages using current NLP frameworks and transfer learning.

[92] PLHF: Prompt Optimization with Few-Shot Human Feedback

Chun-Pai Yang,Kan Zheng,Shou-De Lin

Main category: cs.CL

TL;DR: PLHF是一种基于人类反馈的提示优化框架,通过单轮反馈高效优化LLM提示,优于现有方法。

Details Motivation: 现有方法难以处理输出质量难以评估的任务,缺乏明确指标导致提示优化困难。 Method: PLHF采用类似RLHF的技术,引入评估模块作为质量指标,仅需单轮人类反馈完成优化。 Result: 在公共和工业数据集上,PLHF表现优于现有输出评分策略。 Conclusion: PLHF为复杂任务提供了一种高效、低成本的提示优化解决方案。 Abstract: Automatic prompt optimization frameworks are developed to obtain suitable prompts for large language models (LLMs) with respect to desired output quality metrics. Although existing approaches can handle conventional tasks such as fixed-solution question answering, defining the metric becomes complicated when the output quality cannot be easily assessed by comparisons with standard golden samples. Consequently, optimizing the prompts effectively and efficiently without a clear metric becomes a critical challenge. To address the issue, we present PLHF (which stands for "P"rompt "L"earning with "H"uman "F"eedback), a few-shot prompt optimization framework inspired by the well-known RLHF technique. Different from naive strategies, PLHF employs a specific evaluator module acting as the metric to estimate the output quality. PLHF requires only a single round of human feedback to complete the entire prompt optimization process. Empirical results on both public and industrial datasets show that PLHF outperforms prior output grading strategies for LLM prompt optimizations.

[93] Implementing Long Text Style Transfer with LLMs through Dual-Layered Sentence and Paragraph Structure Extraction and Mapping

Yusen Wu,Xiaotie Deng

Main category: cs.CL

TL;DR: 论文提出了一种基于零样本学习的分层框架ZeroStylus,用于长文本风格迁移,结合句子级风格适应和段落级结构连贯性,显著提升了风格一致性、内容保留和表达质量。

Details Motivation: 解决长文本风格迁移中句子级和段落级的风格一致性与结构连贯性问题,避免对平行语料或LLM微调的依赖。 Method: 提出分层框架ZeroStylus,通过句子和段落模板库的动态构建,实现上下文感知的转换,保留句子间逻辑关系。 Result: 实验显示,ZeroStylus在风格一致性、内容保留和表达质量上优于基线方法(6.90 vs. 6.70),并通过消融研究验证了分层模板的必要性。 Conclusion: ZeroStylus为无需平行语料或LLM微调的长文本风格迁移提供了新方法,证明了分层模板在风格迁移中的有效性。 Abstract: This paper addresses the challenge in long-text style transfer using zero-shot learning of large language models (LLMs), proposing a hierarchical framework that combines sentence-level stylistic adaptation with paragraph-level structural coherence. We argue that in the process of effective paragraph-style transfer, to preserve the consistency of original syntactic and semantic information, it is essential to perform style transfer not only at the sentence level but also to incorporate paragraph-level semantic considerations, while ensuring structural coherence across inter-sentential relationships. Our proposed framework, ZeroStylus, operates through two systematic phases: hierarchical template acquisition from reference texts and template-guided generation with multi-granular matching. The framework dynamically constructs sentence and paragraph template repositories, enabling context-aware transformations while preserving inter-sentence logical relationships. Experimental evaluations demonstrate significant improvements over baseline methods, with structured rewriting achieving 6.90 average score compared to 6.70 for direct prompting approaches in tri-axial metrics assessing style consistency, content preservation, and expression quality. Ablation studies validate the necessity of both template hierarchies during style transfer, showing higher content preservation win rate against sentence-only approaches through paragraph-level structural encoding, as well as direct prompting method through sentence-level pattern extraction and matching. The results establish new capabilities for coherent long-text style transfer without requiring parallel corpora or LLM fine-tuning.

[94] BioProBench: Comprehensive Dataset and Benchmark in Biological Protocol Understanding and Reasoning

Yuyang Liu,Liuzhenghao Lv,Xiancheng Zhang,Li Yuan,Yonghong Tian

Main category: cs.CL

TL;DR: BioProBench是首个大规模、多任务的生物协议理解与推理基准,包含五个核心任务,评估12种主流LLM,发现其在深层次推理和结构化生成任务上表现不佳。

Details Motivation: 生物协议对生命科学研究至关重要,但目前LLM在这些高度专业化、准确性要求高的文本上的系统性评估有限。 Method: 基于27K原始协议构建BioProBench,生成556K高质量结构化实例,评估12种LLM在五个核心任务上的表现。 Result: 实验显示LLM在表面理解任务表现良好,但在深层次推理和结构化生成任务上表现较差,开源模型在某些任务上接近闭源模型水平。 Conclusion: 生物协议中的程序推理对当前LLM仍是挑战,BioProBench为诊断局限性提供了标准化框架,指导AI系统开发。 Abstract: Biological protocols are fundamental to reproducible and safe life science research. While LLMs excel on general tasks, their systematic evaluation on these highly specialized, accuracy-critical, and inherently procedural texts remains limited. In this work, we present BioProBench, the first large-scale, integrated multi-task benchmark for biological protocol understanding and reasoning. While limited benchmarks have touched upon specific aspects like protocol QA, BioProBench provides a comprehensive suite of five core tasks: Protocol Question Answering, Step Ordering, Error Correction, Protocol Generation, and Protocol Reasoning, enabling a holistic evaluation of LLMs on procedural biological texts. Built upon 27K original protocols, it yields nearly 556K high-quality structured instances. We evaluate 12 mainstream open/closed-source LLMs on BioProBench. Experimental results reveal that while top models preform well on surface understanding tasks, struggle significantly with deep reasoning and structured generation tasks like ordering and generation. Furthermore, model comparisons reveal diverse performance: certain open-source models approach closed-source levels on some tasks, yet bio-specific small models lag behind general LLMs, indicating limitations on complex procedural content. Overall, our findings underscore that procedural reasoning within biological protocols represents a significant challenge for current LLMs. BioProBench serves as a standardized framework to diagnose these specific limitations and guide the development of AI systems better equipped for safely automating complex scientific procedures. The code and data are available at: https://github.com/YuyangSunshine/bioprotocolbench and https://huggingface.co/datasets/GreatCaptainNemo/BioProBench.

[95] TSLFormer: A Lightweight Transformer Model for Turkish Sign Language Recognition Using Skeletal Landmarks

Kutay Ertürk,Furkan Altınışık,İrem Sarıaltın,Ömer Nezih Gerek

Main category: cs.CL

TL;DR: TSLFormer是一种轻量级且鲁棒的土耳其手语识别模型,将手势视为有序的字符串语言,仅使用3D关节位置作为输入,通过序列到序列翻译实现高效识别。

Details Motivation: 研究动机在于开发一种高效、轻量的手语识别模型,适用于实时和移动辅助通信系统,同时减少输入数据的维度。 Method: 方法基于3D关节位置(通过Mediapipe提取),采用序列到序列的Transformer架构,利用自注意力机制捕捉手势序列的时序关系。 Result: 在AUTSL数据集(36,000样本,227个单词)上表现优异,计算成本低。 Conclusion: 结论表明基于关节的输入足以支持实时移动辅助通信系统,为听力障碍者提供有效帮助。 Abstract: This study presents TSLFormer, a light and robust word-level Turkish Sign Language (TSL) recognition model that treats sign gestures as ordered, string-like language. Instead of using raw RGB or depth videos, our method only works with 3D joint positions - articulation points - extracted using Google's Mediapipe library, which focuses on the hand and torso skeletal locations. This creates efficient input dimensionality reduction while preserving important semantic gesture information. Our approach revisits sign language recognition as sequence-to-sequence translation, inspired by the linguistic nature of sign languages and the success of transformers in natural language processing. Since TSLFormer uses the self-attention mechanism, it effectively captures temporal co-occurrence within gesture sequences and highlights meaningful motion patterns as words unfold. Evaluated on the AUTSL dataset with over 36,000 samples and 227 different words, TSLFormer achieves competitive performance with minimal computational cost. These results show that joint-based input is sufficient for enabling real-time, mobile, and assistive communication systems for hearing-impaired individuals.

[96] TrumorGPT: Graph-Based Retrieval-Augmented Large Language Model for Fact-Checking

Ching Nam Hang,Pei-Duo Yu,Chee Wei Tan

Main category: cs.CL

TL;DR: TrumorGPT是一种基于生成式人工智能的解决方案,用于健康领域的事实核查,旨在区分真实的健康谣言(“trumors”)和虚假信息。

Details Motivation: 在社交媒体时代,虚假信息的快速传播导致“信息流行病”,对社会构成重大威胁。TrumorGPT旨在解决这一问题,增强健康领域信息的可信度。 Method: TrumorGPT利用大型语言模型(LLM)进行少样本学习,构建语义健康知识图谱并进行语义推理。采用基于图的检索增强生成(GraphRAG)技术,解决LLM的幻觉问题和静态训练数据的局限性。 Result: 在广泛的医疗数据集上评估,TrumorGPT在公共卫生声明的事实核查中表现出色,能够基于最新数据提供准确结果。 Conclusion: TrumorGPT为对抗健康相关虚假信息提供了重要工具,提升了数字信息时代的信任和准确性。 Abstract: In the age of social media, the rapid spread of misinformation and rumors has led to the emergence of infodemics, where false information poses a significant threat to society. To combat this issue, we introduce TrumorGPT , a novel generative artificial intelligence solution designed for fact-checking in the health domain. TrumorGPT aims to distinguish "trumors", which are health-related rumors that turn out to be true, providing a crucial tool in differentiating between mere speculation and verified facts. This framework leverages a large language model (LLM) with few-shot learning for semantic health knowledge graph construction and semantic reasoning. TrumorGPT incorporates graph-based retrieval-augmented generation (GraphRAG) to address the hallucination issue common in LLMs and the limitations of static training data. GraphRAG involves accessing and utilizing information from regularly updated semantic health knowledge graphs that consist of the latest medical news and health information, ensuring that fact-checking by TrumorGPT is based on the most recent data. Evaluating with extensive healthcare datasets, TrumorGPT demonstrates superior performance in fact-checking for public health claims. Its ability to effectively conduct fact-checking across various platforms marks a critical step forward in the fight against health-related misinformation, enhancing trust and accuracy in the digital information age.

[97] LongCodeBench: Evaluating Coding LLMs at 1M Context Windows

Stefano Rando,Luca Romani,Alessio Sampieri,Yuta Kyuragi,Luca Franco,Fabio Galasso,Tatsunori Hashimoto,John Yang

Main category: cs.CL

TL;DR: 论文介绍了LongCodeBench(LCB),一个用于测试长上下文模型中代码理解和修复能力的基准,发现所有模型在长上下文任务中表现均下降。

Details Motivation: 现代长上下文模型的上下文长度快速增长,但缺乏现实的长上下文基准测试,尤其是代码理解和修复任务。 Method: 通过从真实GitHub问题中构建QA和bug修复任务(LongCodeQA和LongSWE-Bench),分层设计基准以评估不同规模模型。 Result: 所有模型在长上下文任务中表现显著下降,例如Claude 3.5 Sonnet从29%降至3%,Qwen2.5从70.2%降至40%。 Conclusion: 长上下文仍是当前模型的弱点,需要进一步改进。 Abstract: Context lengths for models have grown rapidly, from thousands to millions of tokens in just a few years. The extreme context sizes of modern long-context models have made it difficult to construct realistic long-context benchmarks -- not only due to the cost of collecting million-context tasks but also in identifying realistic scenarios that require significant contexts. We identify code comprehension and repair as a natural testbed and challenge task for long-context models and introduce LongCodeBench (LCB), a benchmark to test LLM coding abilities in long-context scenarios. Our benchmark tests both the comprehension and repair capabilities of LCLMs in realistic and important settings by drawing from real-world GitHub issues and constructing QA (LongCodeQA) and bug fixing (LongSWE-Bench) tasks. We carefully stratify the complexity of our benchmark, enabling us to evaluate models across different scales -- ranging from Qwen2.5 14B Instruct to Google's flagship Gemini model. We find that long-context remains a weakness for all models, with performance drops such as from 29% to 3% for Claude 3.5 Sonnet, or from 70.2% to 40% for Qwen2.5.

[98] DeltaEdit: Enhancing Sequential Editing in Large Language Models by Controlling Superimposed Noise

Ding Cao,Yuchen Cai,Rongxi Guo,Xuesong He,Guiquan Liu

Main category: cs.CL

TL;DR: DeltaEdit通过动态正交约束策略优化更新参数,有效减少编辑间的干扰,显著提升长期编辑成功率和模型泛化能力。

Details Motivation: 现有连续知识编辑方法在长期编辑后成功率显著下降,模型输出偏离目标,称为叠加噪声累积问题。 Method: 提出DeltaEdit方法,通过动态正交约束策略优化更新参数,减少编辑间干扰。 Result: DeltaEdit在编辑成功率和泛化能力保留上显著优于现有方法。 Conclusion: DeltaEdit能确保模型在长期连续编辑下保持稳定可靠的性能。 Abstract: Sequential knowledge editing techniques aim to continuously update the knowledge in large language models at a low cost, preventing the models from generating outdated or incorrect information. However, existing sequential editing methods suffer from a significant decline in editing success rates after long-term editing. Through theoretical analysis and experiments, we identify that as the number of edits increases, the model's output increasingly deviates from the desired target, leading to a drop in editing success rates. We refer to this issue as the accumulation of superimposed noise problem. To address this, we identify the factors contributing to this deviation and propose DeltaEdit, a novel method that optimizes update parameters through a dynamic orthogonal constraints strategy, effectively reducing interference between edits to mitigate deviation. Experimental results demonstrate that DeltaEdit significantly outperforms existing methods in edit success rates and the retention of generalization capabilities, ensuring stable and reliable model performance even under extensive sequential editing.

[99] SEM: Reinforcement Learning for Search-Efficient Large Language Models

Zeyang Sha,Shiwen Cui,Weiqiang Wang

Main category: cs.CL

TL;DR: 本文提出SEM框架,通过强化学习优化大语言模型(LLM)的搜索行为,减少冗余搜索并保持准确性。

Details Motivation: 现有方法导致LLM频繁冗余搜索,效率低下且成本高,需优化其搜索决策能力。 Method: 结合MuSiQue和MMLU数据集,设计结构化推理模板,采用GRPO方法训练模型区分是否需要搜索。 Result: 实验表明,SEM显著减少冗余搜索,同时保持或提升多基准测试的答案准确性。 Conclusion: SEM框架提升了LLM的推理效率,并使其更明智地利用外部知识。 Abstract: Recent advancements in Large Language Models(LLMs) have demonstrated their capabilities not only in reasoning but also in invoking external tools, particularly search engines. However, teaching models to discern when to invoke search and when to rely on their internal knowledge remains a significant challenge. Existing reinforcement learning approaches often lead to redundant search behaviors, resulting in inefficiencies and over-cost. In this paper, we propose SEM, a novel post-training reinforcement learning framework that explicitly trains LLMs to optimize search usage. By constructing a balanced dataset combining MuSiQue and MMLU, we create scenarios where the model must learn to distinguish between questions it can answer directly and those requiring external retrieval. We design a structured reasoning template and employ Group Relative Policy Optimization(GRPO) to post-train the model's search behaviors. Our reward function encourages accurate answering without unnecessary search while promoting effective retrieval when needed. Experimental results demonstrate that our method significantly reduces redundant search operations while maintaining or improving answer accuracy across multiple challenging benchmarks. This framework advances the model's reasoning efficiency and extends its capability to judiciously leverage external knowledge.

[100] Re$^2$: A Consistency-ensured Dataset for Full-stage Peer Review and Multi-turn Rebuttal Discussions

Daoze Zhang,Zhijian Bao,Sihang Du,Zhiyi Zhao,Kuangling Zhang,Dezheng Bao,Yang Yang

Main category: cs.CL

TL;DR: 论文提出了一个名为Re^2的大规模一致性保障的同行评审数据集,旨在解决现有数据集在多样性、数据质量和交互支持方面的不足,以帮助减轻评审负担并提升作者自我评估能力。

Details Motivation: 同行评审系统因投稿量激增和重复提交低质量稿件而面临压力,现有数据集存在多样性不足、数据质量低且缺乏交互支持的问题。 Method: 构建了包含19,926篇初始投稿、70,668条评审意见和53,818条反驳的Re^2数据集,并将反驳和讨论阶段建模为多轮对话范式。 Result: Re^2数据集支持静态评审任务和动态交互式LLM助手,为作者提供更实用的指导,并有望缓解评审负担。 Conclusion: Re^2数据集通过提供高质量、多样化的评审数据和多轮对话支持,为改进同行评审系统和LLM辅助工具提供了重要资源。 Abstract: Peer review is a critical component of scientific progress in the fields like AI, but the rapid increase in submission volume has strained the reviewing system, which inevitably leads to reviewer shortages and declines review quality. Besides the growing research popularity, another key factor in this overload is the repeated resubmission of substandard manuscripts, largely due to the lack of effective tools for authors to self-evaluate their work before submission. Large Language Models (LLMs) show great promise in assisting both authors and reviewers, and their performance is fundamentally limited by the quality of the peer review data. However, existing peer review datasets face three major limitations: (1) limited data diversity, (2) inconsistent and low-quality data due to the use of revised rather than initial submissions, and (3) insufficient support for tasks involving rebuttal and reviewer-author interactions. To address these challenges, we introduce the largest consistency-ensured peer review and rebuttal dataset named Re^2, which comprises 19,926 initial submissions, 70,668 review comments, and 53,818 rebuttals from 24 conferences and 21 workshops on OpenReview. Moreover, the rebuttal and discussion stage is framed as a multi-turn conversation paradigm to support both traditional static review tasks and dynamic interactive LLM assistants, providing more practical guidance for authors to refine their manuscripts and helping alleviate the growing review burden. Our data and code are available in https://anonymous.4open.science/r/ReviewBench_anon/.

[101] Assessing and Mitigating Medical Knowledge Drift and Conflicts in Large Language Models

Weiyi Wu,Xinwen Xu,Chongyang Gao,Xingjian Diao,Siting Li,Lucas A. Salas,Jiang Gui

Main category: cs.CL

TL;DR: 该研究探讨了大型语言模型(LLMs)在医疗指南动态变化中的表现,发现其难以拒绝过时建议且常支持矛盾指导。通过两种缓解策略(检索增强生成和偏好微调)的组合,显著提升了模型性能。

Details Motivation: LLMs在医疗领域潜力巨大,但面临适应快速变化医学知识的挑战,可能导致过时或矛盾的治疗建议。 Method: 研究开发了DriftMedQA基准模拟指南演变,评估了七种先进LLMs的时序可靠性,并探索了检索增强生成和偏好微调两种缓解策略。 Result: 评估显示LLMs难以拒绝过时建议且常支持矛盾指导。两种策略单独使用均能提升性能,组合效果最佳。 Conclusion: 需提升LLMs对时间变化的鲁棒性,以确保其在临床实践中的可靠性。 Abstract: Large Language Models (LLMs) have great potential in the field of health care, yet they face great challenges in adapting to rapidly evolving medical knowledge. This can lead to outdated or contradictory treatment suggestions. This study investigated how LLMs respond to evolving clinical guidelines, focusing on concept drift and internal inconsistencies. We developed the DriftMedQA benchmark to simulate guideline evolution and assessed the temporal reliability of various LLMs. Our evaluation of seven state-of-the-art models across 4,290 scenarios demonstrated difficulties in rejecting outdated recommendations and frequently endorsing conflicting guidance. Additionally, we explored two mitigation strategies: Retrieval-Augmented Generation and preference fine-tuning via Direct Preference Optimization. While each method improved model performance, their combination led to the most consistent and reliable results. These findings underscore the need to improve LLM robustness to temporal shifts to ensure more dependable applications in clinical practice.

[102] Task-Adaptive Semantic Communications with Controllable Diffusion-based Data Regeneration

Fupei Guo,Achintha Wijesinghe,Songyang Zhang,Zhi Ding

Main category: cs.CL

TL;DR: 提出了一种基于扩散模型的任务自适应语义通信框架,能根据下游任务动态调整语义信息传递。

Details Motivation: 下一代网络需要从比特级数据传输转向语义传递以提高带宽效率,同时需适应不同下游任务的需求。 Method: 通过扩散模型初始化通用语义表示传输,接收端生成任务相关提示反馈,发送端结合注意力机制更新语义传输。 Result: 测试表明该方法能自适应保留关键任务相关信息,同时保持高压缩效率。 Conclusion: 该框架为语义通信提供了一种高效且任务自适应的解决方案。 Abstract: Semantic communications represent a new paradigm of next-generation networking that shifts bit-wise data delivery to conveying the semantic meanings for bandwidth efficiency. To effectively accommodate various potential downstream tasks at the receiver side, one should adaptively convey the most critical semantic information. This work presents a novel task-adaptive semantic communication framework based on diffusion models that is capable of dynamically adjusting the semantic message delivery according to various downstream tasks. Specifically, we initialize the transmission of a deep-compressed general semantic representation from the transmitter to enable diffusion-based coarse data reconstruction at the receiver. The receiver identifies the task-specific demands and generates textual prompts as feedback. Integrated with the attention mechanism, the transmitter updates the semantic transmission with more details to better align with the objectives of the intended receivers. Our test results demonstrate the efficacy of the proposed method in adaptively preserving critical task-relevant information for semantic communications while preserving high compression efficiency.

[103] Large Language Models and Arabic Content: A Review

Haneh Rhel,Dmitri Roussinov

Main category: cs.CL

TL;DR: 本文概述了大型语言模型(LLMs)在阿拉伯语自然语言处理(NLP)中的应用,探讨了其挑战、现有资源及技术改进方法。

Details Motivation: 阿拉伯语资源稀缺且语言复杂,研究旨在探索LLMs如何应对这些挑战并提升阿拉伯语NLP任务的表现。 Method: 通过综述早期预训练阿拉伯语模型、微调技术和提示工程,分析其在多样化阿拉伯语任务和方言中的应用。 Result: 研究表明,LLMs在多语言语料库训练下对阿拉伯语NLP任务表现显著,技术改进进一步提升了模型性能。 Conclusion: LLMs在阿拉伯语NLP中的应用呈上升趋势,未来需更多资源和工具支持其发展。 Abstract: Over the past three years, the rapid advancement of Large Language Models (LLMs) has had a profound impact on multiple areas of Artificial Intelligence (AI), particularly in Natural Language Processing (NLP) across diverse languages, including Arabic. Although Arabic is considered one of the most widely spoken languages across 27 countries in the Arabic world and used as a second language in some other non-Arabic countries as well, there is still a scarcity of Arabic resources, datasets, and tools. Arabic NLP tasks face various challenges due to the complexities of the Arabic language, including its rich morphology, intricate structure, and diverse writing standards, among other factors. Researchers have been actively addressing these challenges, demonstrating that pre-trained Large Language Models (LLMs) trained on multilingual corpora achieve significant success in various Arabic NLP tasks. This study provides an overview of using large language models (LLMs) for the Arabic language, highlighting early pre-trained Arabic Language models across various NLP applications and their ability to handle diverse Arabic content tasks and dialects. It also provides an overview of how techniques like finetuning and prompt engineering can enhance the performance of these models. Additionally, the study summarizes common Arabic benchmarks and datasets while presenting our observations on the persistent upward trend in the adoption of LLMs.

[104] TiSpell: A Semi-Masked Methodology for Tibetan Spelling Correction covering Multi-Level Error with Data Augmentation

Yutong Liu,Feng Xiao,Ziyue Zhang,Yongbin Yu,Cheng Huang,Fan Gao,Xiangxiang Wang,Ma-bao Ban,Manping Fan,Thupten Tsering,Cheng Huang,Gadeng Luosang,Renzeng Duojie,Nyima Tashi

Main category: cs.CL

TL;DR: 提出了一种多级藏文拼写纠错方法TiSpell,结合字符和音节级纠错,通过半掩码策略和数据增强提升性能。

Details Motivation: 现有方法主要关注单级纠错,缺乏对字符和音节级的有效整合,且缺乏针对藏文的开源数据集和增强方法。 Method: 使用无标签文本生成多级错误数据,提出半掩码模型TiSpell,支持字符和音节级纠错。 Result: 在模拟和真实数据上,TiSpell优于基线模型,与最先进方法性能相当。 Conclusion: TiSpell通过数据增强和半掩码策略,有效解决了藏文多级拼写纠错问题。 Abstract: Multi-level Tibetan spelling correction addresses errors at both the character and syllable levels within a unified model. Existing methods focus mainly on single-level correction and lack effective integration of both levels. Moreover, there are no open-source datasets or augmentation methods tailored for this task in Tibetan. To tackle this, we propose a data augmentation approach using unlabeled text to generate multi-level corruptions, and introduce TiSpell, a semi-masked model capable of correcting both character- and syllable-level errors. Although syllable-level correction is more challenging due to its reliance on global context, our semi-masked strategy simplifies this process. We synthesize nine types of corruptions on clean sentences to create a robust training set. Experiments on both simulated and real-world data demonstrate that TiSpell, trained on our dataset, outperforms baseline models and matches the performance of state-of-the-art approaches, confirming its effectiveness.

[105] FalseReject: A Resource for Improving Contextual Safety and Mitigating Over-Refusals in LLMs via Structured Reasoning

Zhehao Zhang,Weijie Xu,Fanyou Wu,Chandan K. Reddy

Main category: cs.CL

TL;DR: 论文提出FalseReject资源,用于解决大语言模型(LLMs)因安全对齐导致的过度拒绝良性查询问题,通过多智能体交互框架生成多样化提示,并通过监督微调显著减少不必要的拒绝。

Details Motivation: 大语言模型在安全对齐过程中常过度拒绝良性查询,影响其在敏感场景的实用性。 Method: 提出FalseReject资源,包含16k看似有毒的查询和结构化响应;采用图引导的多智能体交互框架生成多样化提示;提供训练数据集和人工标注测试集。 Result: 在29个SOTA LLMs上的实验表明,FalseReject显著减少不必要的拒绝,同时保持安全性和语言能力。 Conclusion: FalseReject通过监督微调有效解决过度拒绝问题,提升模型实用性。 Abstract: Safety alignment approaches in large language models (LLMs) often lead to the over-refusal of benign queries, significantly diminishing their utility in sensitive scenarios. To address this challenge, we introduce FalseReject, a comprehensive resource containing 16k seemingly toxic queries accompanied by structured responses across 44 safety-related categories. We propose a graph-informed adversarial multi-agent interaction framework to generate diverse and complex prompts, while structuring responses with explicit reasoning to aid models in accurately distinguishing safe from unsafe contexts. FalseReject includes training datasets tailored for both standard instruction-tuned models and reasoning-oriented models, as well as a human-annotated benchmark test set. Our extensive benchmarking on 29 state-of-the-art (SOTA) LLMs reveals persistent over-refusal challenges. Empirical results demonstrate that supervised finetuning with FalseReject substantially reduces unnecessary refusals without compromising overall safety or general language capabilities.

[106] HYPERNYM MERCURY: Token Optimization through Semantic Field Constriction and Reconstruction from Hypernyms. A New Text Compression Method

Chris Forrester,Octavia Sulea

Main category: cs.CL

TL;DR: 本文提出了一种新颖的文本表示方法和语义压缩技术,可减少90%以上的token,同时保持高语义相似性。

Details Motivation: 在NLP和下一代智能AI中,计算优化是一个重要任务,尤其是通过减少LLM提示的token数量来提高效率。 Method: 采用专利待决的文本表示方案和首次提出的词级语义压缩技术,支持无损压缩和粒度控制。 Result: 在开源数据(如《德古拉》)上测试,结果表明该方法在多类型和模型下均有效。 Conclusion: 该技术显著减少了token数量,同时保持了语义完整性,具有广泛的应用潜力。 Abstract: Compute optimization using token reduction of LLM prompts is an emerging task in the fields of NLP and next generation, agentic AI. In this white paper, we introduce a novel (patent pending) text representation scheme and a first-of-its-kind word-level semantic compression of paragraphs that can lead to over 90\% token reduction, while retaining high semantic similarity to the source text. We explain how this novel compression technique can be lossless and how the detail granularity is controllable. We discuss benchmark results over open source data (i.e. Bram Stoker's Dracula available through Project Gutenberg) and show how our results hold at the paragraph level, across multiple genres and models.

[107] Are LLMs complicated ethical dilemma analyzers?

Jiashen,Du,Jesse Yao,Allen Liu,Zhekai Zhang

Main category: cs.CL

TL;DR: 论文研究了大型语言模型(LLMs)是否能够模拟人类伦理推理,并通过基准数据集和复合指标框架评估了多个前沿LLMs的表现。结果显示LLMs在词汇和结构对齐上优于非专家人类,但在历史背景和复杂解决策略方面表现不足。

Details Motivation: 探讨LLMs是否能够模拟人类伦理推理,并作为人类判断的可信代理。 Method: 引入包含196个真实伦理困境的基准数据集,使用复合指标框架(BLEU、Damerau-Levenshtein距离等)评估多个LLMs的表现。 Result: LLMs在词汇和结构对齐上优于非专家人类,但在历史背景和复杂解决策略方面表现不足。GPT-4o-mini表现最一致。 Conclusion: LLMs在伦理决策中表现出潜力,但仍存在局限性,尤其是在需要上下文抽象的领域。 Abstract: One open question in the study of Large Language Models (LLMs) is whether they can emulate human ethical reasoning and act as believable proxies for human judgment. To investigate this, we introduce a benchmark dataset comprising 196 real-world ethical dilemmas and expert opinions, each segmented into five structured components: Introduction, Key Factors, Historical Theoretical Perspectives, Resolution Strategies, and Key Takeaways. We also collect non-expert human responses for comparison, limited to the Key Factors section due to their brevity. We evaluate multiple frontier LLMs (GPT-4o-mini, Claude-3.5-Sonnet, Deepseek-V3, Gemini-1.5-Flash) using a composite metric framework based on BLEU, Damerau-Levenshtein distance, TF-IDF cosine similarity, and Universal Sentence Encoder similarity. Metric weights are computed through an inversion-based ranking alignment and pairwise AHP analysis, enabling fine-grained comparison of model outputs to expert responses. Our results show that LLMs generally outperform non-expert humans in lexical and structural alignment, with GPT-4o-mini performing most consistently across all sections. However, all models struggle with historical grounding and proposing nuanced resolution strategies, which require contextual abstraction. Human responses, while less structured, occasionally achieve comparable semantic similarity, suggesting intuitive moral reasoning. These findings highlight both the strengths and current limitations of LLMs in ethical decision-making.

[108] Putting It All into Context: Simplifying Agents with LCLMs

Mingjian Jiang,Yangjun Ruan,Luis Lastras,Pavan Kapanipathi,Tatsunori Hashimoto

Main category: cs.CL

TL;DR: 研究表明,在复杂任务(如SWE-bench)中,仅使用长上下文语言模型(LCLM)并优化提示即可达到与复杂代理架构相当的效果,而无需额外脚手架或工具。

Details Motivation: 探讨语言模型代理架构中复杂脚手架的必要性,验证简化方法在复杂任务中的表现。 Method: 使用长上下文语言模型(如Gemini-1.5-Pro和Gemini-2.5-Pro),仅通过优化提示完成任务,无需复杂脚手架或工具。 Result: Gemini-1.5-Pro在SWE-Bench-Verified上达到38%的解决率,与复杂代理架构(32%)相当;Gemini-2.5-Pro则达到50.8%。 Conclusion: 简化方法在某些任务中表现优异,表明复杂脚手架并非总是必要,模型能力的提升是关键。 Abstract: Recent advances in language model (LM) agents have demonstrated significant potential for automating complex real-world tasks. To make progress on these difficult tasks, LM agent architectures have become increasingly complex, often incorporating multi-step retrieval tools, multiple agents, and scaffolding adapted to the underlying LM. In this work, we investigate whether all of this complexity is necessary, or if parts of these scaffolds can be removed on challenging tasks like SWE-bench. We show that in the case of SWE-bench, simply putting the entire environment into the context of a long context language model (LCLM) and properly prompting the model makes it competitive with carefully tuned, complex agent scaffolds. We show that a Gemini-1.5-Pro model without any scaffolding or tools achieves 38% on SWE-Bench-Verified, comparable with approaches using carefully tuned agent scaffolds (32%). While the unscaffolded approach with Gemini-1.5-Pro falls short of the strongest agentic architectures, we demonstrate that the more capable Gemini-2.5-Pro using the same unscaffolded approach directly attains a 50.8% solve rate. Additionally, a two-stage approach combining Gemini-1.5-Pro with Claude-3.7 achieves a competitive 48.6% solve rate.

[109] ALOHA: Empowering Multilingual Agent for University Orientation with Hierarchical Retrieval

Mingxu Tao,Bowen Tang,Mingxuan Ma,Yining Zhang,Hourun Li,Feifan Wen,Hao Ma,Jia Yang

Main category: cs.CL

TL;DR: ALOHA系统通过分层检索和多语言支持解决了LLMs在校园信息检索中的不足,提供更准确、及时和用户友好的服务。

Details Motivation: 现有LLMs和搜索引擎在校园特定信息检索中表现不足,尤其是多语言和实时性需求。 Method: 提出ALOHA系统,结合分层检索和多语言代理,并集成外部API提供交互服务。 Result: 系统在多语言查询中表现优于商业聊天机器人和搜索引擎,已服务超过12,000人。 Conclusion: ALOHA系统有效解决了校园信息检索的痛点,具有实际应用价值。 Abstract: The rise of Large Language Models~(LLMs) revolutionizes information retrieval, allowing users to obtain required answers through complex instructions within conversations. However, publicly available services remain inadequate in addressing the needs of faculty and students to search campus-specific information. It is primarily due to the LLM's lack of domain-specific knowledge and the limitation of search engines in supporting multilingual and timely scenarios. To tackle these challenges, we introduce ALOHA, a multilingual agent enhanced by hierarchical retrieval for university orientation. We also integrate external APIs into the front-end interface to provide interactive service. The human evaluation and case study show our proposed system has strong capabilities to yield correct, timely, and user-friendly responses to the queries in multiple languages, surpassing commercial chatbots and search engines. The system has been deployed and has provided service for more than 12,000 people.

[110] Fusing Bidirectional Chains of Thought and Reward Mechanisms A Method for Enhancing Question-Answering Capabilities of Large Language Models for Chinese Intangible Cultural Heritage

Ruilin Liu,Zhixiao Zhao,Jieqiong Li,Chang Liu,Dongbo Wang

Main category: cs.CL

TL;DR: 本文提出了一种结合双向思维链和奖励机制的新训练方法,用于解决领域特定大语言模型在微调过程中面临的偏见、知识继承错误和灾难性遗忘问题。实验证明该方法在问答任务中优于现有方法,并具有跨领域适应性。

Details Motivation: 大型语言模型在领域特定任务中微调时面临偏见、知识继承错误和灾难性遗忘等问题,亟需一种有效方法解决这些挑战。 Method: 提出了一种结合双向思维链(前向推理与反向提问)和奖励机制的训练方法,基于ICH-Qwen模型进行优化。 Result: 实验结果表明,该方法在准确性、Bleu-4和Rouge-L分数上优于0-shot、逐步推理、知识蒸馏和问题增强方法,且具有跨领域适应性。 Conclusion: 该方法为领域特定语言模型的训练提供了有效解决方案,并展示了在多领域的潜在应用价值。 Abstract: The rapid development of large language models (LLMs) has provided significant support and opportunities for the advancement of domain-specific LLMs. However, fine-tuning these large models using Intangible Cultural Heritage (ICH) data inevitably faces challenges such as bias, incorrect knowledge inheritance, and catastrophic forgetting. To address these issues, we propose a novel training method that integrates a bidirectional chains of thought and a reward mechanism. This method is built upon ICH-Qwen, a large language model specifically designed for the field of intangible cultural heritage. The proposed method enables the model to not only perform forward reasoning but also enhances the accuracy of the generated answers by utilizing reverse questioning and reverse reasoning to activate the model's latent knowledge. Additionally, a reward mechanism is introduced during training to optimize the decision-making process. This mechanism improves the quality of the model's outputs through structural and content evaluations with different weighting schemes. We conduct comparative experiments on ICH-Qwen, with results demonstrating that our method outperforms 0-shot, step-by-step reasoning, knowledge distillation, and question augmentation methods in terms of accuracy, Bleu-4, and Rouge-L scores on the question-answering task. Furthermore, the paper highlights the effectiveness of combining the bidirectional chains of thought and reward mechanism through ablation experiments. In addition, a series of generalizability experiments are conducted, with results showing that the proposed method yields improvements on various domain-specific datasets and advanced models in areas such as Finance, Wikidata, and StrategyQA. This demonstrates that the method is adaptable to multiple domains and provides a valuable approach for model training in future applications across diverse fields.

[111] Exploiting Text Semantics for Few and Zero Shot Node Classification on Text-attributed Graph

Yuxiang Wang,Xiao Yan,Shiyu Jin,Quanqing Xu,Chuang Hu,Yuanyuan Zhu,Bo Du,Jia Wu,Jiawei Jiang

Main category: cs.CL

TL;DR: 本文提出了一种名为TSA的文本语义增强方法,通过引入更多文本语义监督信号,提升了TAG上少样本和零样本节点分类的准确性。

Details Motivation: 现有方法主要基于图增强技术,而文本增强技术未被充分探索,因此需要一种新的方法来利用文本语义信息提升分类性能。 Method: 设计了两种文本语义增强技术:正语义匹配和负语义对比,分别通过匹配相似文本和构造相反语义文本来提供更多参考。 Result: 在5个数据集上与13个基线方法对比,TSA表现最优,通常比最佳基线方法准确率提升5%以上。 Conclusion: TSA通过增强文本语义监督信号,显著提升了TAG上节点分类的性能。 Abstract: Text-attributed graph (TAG) provides a text description for each graph node, and few- and zero-shot node classification on TAGs have many applications in fields such as academia and social networks. Existing work utilizes various graph-based augmentation techniques to train the node and text embeddings, while text-based augmentations are largely unexplored. In this paper, we propose Text Semantics Augmentation (TSA) to improve accuracy by introducing more text semantic supervision signals. Specifically, we design two augmentation techniques, i.e., positive semantics matching and negative semantics contrast, to provide more reference texts for each graph node or text description. Positive semantic matching retrieves texts with similar embeddings to match with a graph node. Negative semantic contrast adds a negative prompt to construct a text description with the opposite semantics, which is contrasted with the original node and text. We evaluate TSA on 5 datasets and compare with 13 state-of-the-art baselines. The results show that TSA consistently outperforms all baselines, and its accuracy improvements over the best-performing baseline are usually over 5%.

[112] A Head to Predict and a Head to Question: Pre-trained Uncertainty Quantification Heads for Hallucination Detection in LLM Outputs

Artem Shelmanov,Ekaterina Fadeeva,Akim Tsvigun,Ivan Tsvigun,Zhuohan Xie,Igor Kiselev,Nico Daheim,Caiqi Zhang,Artem Vazhentsev,Mrinmaya Sachan,Preslav Nakov,Timothy Baldwin

Main category: cs.CL

TL;DR: 论文提出了一种预训练的不确定性量化(UQ)模块,用于检测大型语言模型(LLMs)的幻觉问题,显著优于无监督方法。

Details Motivation: LLMs容易生成虚假信息(幻觉),且难以被用户识别,因此需要一种可靠的方法来量化输出的不确定性。 Method: 通过设计基于Transformer架构的预训练UQ模块,利用LLM的注意力图提取特征,增强不确定性捕捉能力。 Result: 实验表明,该方法在幻觉检测上表现优异,支持跨语言泛化,并发布了预训练的UQ模块。 Conclusion: 预训练的UQ模块为LLMs的幻觉问题提供了高效解决方案,具有广泛适用性。 Abstract: Large Language Models (LLMs) have the tendency to hallucinate, i.e., to sporadically generate false or fabricated information. This presents a major challenge, as hallucinations often appear highly convincing and users generally lack the tools to detect them. Uncertainty quantification (UQ) provides a framework for assessing the reliability of model outputs, aiding in the identification of potential hallucinations. In this work, we introduce pre-trained UQ heads: supervised auxiliary modules for LLMs that substantially enhance their ability to capture uncertainty compared to unsupervised UQ methods. Their strong performance stems from the powerful Transformer architecture in their design and informative features derived from LLM attention maps. Experimental evaluation shows that these heads are highly robust and achieve state-of-the-art performance in claim-level hallucination detection across both in-domain and out-of-domain prompts. Moreover, these modules demonstrate strong generalization to languages they were not explicitly trained on. We pre-train a collection of UQ heads for popular LLM series, including Mistral, Llama, and Gemma 2. We publicly release both the code and the pre-trained heads.

[113] Large Language Model Psychometrics: A Systematic Review of Evaluation, Validation, and Enhancement

Haoran Ye,Jing Jin,Yuhang Xie,Xin Zhang,Guojie Song

Main category: cs.CL

TL;DR: 该论文提出了LLM心理测量学这一新兴交叉学科,旨在利用心理测量工具和理论评估和改进大语言模型,以解决传统评估方法的不足。

Details Motivation: 大语言模型的快速发展超越了传统评估方法,需要新的方法来衡量人类心理特质,并建立以人为中心的评估体系。 Method: 通过整合心理测量学的工具、理论和原则,系统性地探索其在制定基准、拓宽评估范围、改进方法和验证结果中的作用。 Result: 论文提供了一个结构化框架,帮助跨学科研究者全面理解LLM心理测量学,并提出了未来评估范式的可行建议。 Conclusion: 目标是推动与人类水平AI对齐的评估方法,促进以人为中心的AI系统发展,造福社会。 Abstract: The rapid advancement of large language models (LLMs) has outpaced traditional evaluation methodologies. It presents novel challenges, such as measuring human-like psychological constructs, navigating beyond static and task-specific benchmarks, and establishing human-centered evaluation. These challenges intersect with Psychometrics, the science of quantifying the intangible aspects of human psychology, such as personality, values, and intelligence. This survey introduces and synthesizes an emerging interdisciplinary field of LLM Psychometrics, which leverages psychometric instruments, theories, and principles to evaluate, understand, and enhance LLMs. We systematically explore the role of Psychometrics in shaping benchmarking principles, broadening evaluation scopes, refining methodologies, validating results, and advancing LLM capabilities. This paper integrates diverse perspectives to provide a structured framework for researchers across disciplines, enabling a more comprehensive understanding of this nascent field. Ultimately, we aim to provide actionable insights for developing future evaluation paradigms that align with human-level AI and promote the advancement of human-centered AI systems for societal benefit. A curated repository of LLM psychometric resources is available at https://github.com/valuebyte-ai/Awesome-LLM-Psychometrics.

[114] Enhancing Cache-Augmented Generation (CAG) with Adaptive Contextual Compression for Scalable Knowledge Integration

Rishabh Agrawal,Himanshu Kumar

Main category: cs.CL

TL;DR: 论文提出了一种名为自适应上下文压缩(ACC)的技术,结合混合CAG-RAG框架,以优化大语言模型(LLMs)的知识集成效率。

Details Motivation: 尽管缓存增强生成(CAG)在减少检索延迟和简化系统设计方面表现出潜力,但在处理大规模动态知识库时仍面临挑战。 Method: 引入自适应上下文压缩(ACC)技术动态管理上下文输入,并提出混合CAG-RAG框架,结合选择性检索以补充预加载知识。 Result: 实验表明,该方法显著提升了可扩展性、效率和多跳推理性能。 Conclusion: 该研究为实际知识集成问题提供了有效的解决方案。 Abstract: The rapid progress in large language models (LLMs) has paved the way for novel approaches in knowledge-intensive tasks. Among these, Cache-Augmented Generation (CAG) has emerged as a promising alternative to Retrieval-Augmented Generation (RAG). CAG minimizes retrieval latency and simplifies system design by preloading knowledge into the model's context. However, challenges persist in scaling CAG to accommodate large and dynamic knowledge bases effectively. This paper introduces Adaptive Contextual Compression (ACC), an innovative technique designed to dynamically compress and manage context inputs, enabling efficient utilization of the extended memory capabilities of modern LLMs. To further address the limitations of standalone CAG, we propose a Hybrid CAG-RAG Framework, which integrates selective retrieval to augment preloaded contexts in scenarios requiring additional information. Comprehensive evaluations on diverse datasets highlight the proposed methods' ability to enhance scalability, optimize efficiency, and improve multi-hop reasoning performance, offering practical solutions for real-world knowledge integration challenges.

[115] Evaluating the Effectiveness of Black-Box Prompt Optimization as the Scale of LLMs Continues to Grow

Ziyu Zhou,Yihang Wu,Jingyuan Yang,Zhan Xiao,Rongjun Li

Main category: cs.CL

TL;DR: 黑盒提示优化方法在大型语言模型(LLM)上的效果随模型规模增大而减弱。

Details Motivation: 研究黑盒提示优化方法在超大规模LLM(如DeepSeek V3和Gemini 2.0 Flash)上的有效性。 Method: 选择三种黑盒优化方法,在多个NLU和NLG数据集上评估其在大型LLM上的表现,并进一步分析模型规模的影响。 Result: 黑盒优化方法对超大规模LLM的改进有限,且其效果随模型规模增大而降低。 Conclusion: 模型规模是影响黑盒提示优化效果的主要因素,超大规模LLM可能不再需要此类优化。 Abstract: Black-Box prompt optimization methods have emerged as a promising strategy for refining input prompts to better align large language models (LLMs), thereby enhancing their task performance. Although these methods have demonstrated encouraging results, most studies and experiments have primarily focused on smaller-scale models (e.g., 7B, 14B) or earlier versions (e.g., GPT-3.5) of LLMs. As the scale of LLMs continues to increase, such as with DeepSeek V3 (671B), it remains an open question whether these black-box optimization techniques will continue to yield significant performance improvements for models of such scale. In response to this, we select three well-known black-box optimization methods and evaluate them on large-scale LLMs (DeepSeek V3 and Gemini 2.0 Flash) across four NLU and NLG datasets. The results show that these black-box prompt optimization methods offer only limited improvements on these large-scale LLMs. Furthermore, we hypothesize that the scale of the model is the primary factor contributing to the limited benefits observed. To explore this hypothesis, we conducted experiments on LLMs of varying sizes (Qwen 2.5 series, ranging from 7B to 72B) and observed an inverse scaling law, wherein the effectiveness of black-box optimization methods diminished as the model size increased.

[116] AM-Thinking-v1: Advancing the Frontier of Reasoning at 32B Scale

Yunjie Ji,Xiaoyu Tian,Sitong Zhao,Haotian Wang,Shuaiting Chen,Yiping Peng,Han Zhao,Xiangang Li

Main category: cs.CL

TL;DR: AM-Thinking-v1是一个32B密集语言模型,在开源社区中表现出色,尤其在数学和编程能力上领先。

Details Motivation: 展示开源社区在32B规模模型上也能实现高性能,平衡性能和实用性。 Method: 基于开源Qwen2.5-32B模型,通过监督微调和强化学习的后训练流程优化。 Result: 在AIME 2024、2025和LiveCodeBench上分别取得85.3、74.4和70.3的高分。 Conclusion: 开源模型在中等规模下也能实现高性能,为协作创新提供新方向。 Abstract: We present AM-Thinking-v1, a 32B dense language model that advances the frontier of reasoning, embodying the collaborative spirit of open-source innovation. Outperforming DeepSeek-R1 and rivaling leading Mixture-of-Experts (MoE) models like Qwen3-235B-A22B and Seed1.5-Thinking, AM-Thinking-v1 achieves impressive scores of 85.3 on AIME 2024, 74.4 on AIME 2025, and 70.3 on LiveCodeBench, showcasing state-of-the-art mathematical and coding capabilities among open-source models of similar scale. Built entirely from the open-source Qwen2.5-32B base model and publicly available queries, AM-Thinking-v1 leverages a meticulously crafted post-training pipeline - combining supervised fine-tuning and reinforcement learning - to deliver exceptional reasoning capabilities. This work demonstrates that the open-source community can achieve high performance at the 32B scale, a practical sweet spot for deployment and fine-tuning. By striking a balance between top-tier performance and real-world usability, we hope AM-Thinking-v1 inspires further collaborative efforts to harness mid-scale models, pushing reasoning boundaries while keeping accessibility at the core of innovation. We have open-sourced our model on \href{https://huggingface.co/a-m-team/AM-Thinking-v1}{Hugging Face}.

[117] On the Geometry of Semantics in Next-token Prediction

Yize Zhao,Christos Thrampoulidis

Main category: cs.CL

TL;DR: 论文探讨了现代语言模型如何通过简单的下一个词预测(NTP)目标学习潜在语义和语法概念,揭示了NTP优化通过奇异值分解(SVD)隐式编码这些概念。

Details Motivation: 研究NTP训练目标如何引导模型提取和编码潜在语义和语法概念,以理解语言模型如何从简单目标中学习复杂结构。 Method: 通过分析NTP优化过程,发现模型隐式地通过SVD分解中心化的数据稀疏矩阵来编码概念,并利用谱聚类方法识别语义。 Result: 研究发现最重要的SVD因子在训练早期被学习,支持使用谱聚类方法(包括k-means和新提出的基于象限的方法)识别语义。 Conclusion: 论文连接了分布语义、神经崩溃几何和神经网络训练动态,揭示了NTP隐式偏差如何塑造语言模型中的意义表示。 Abstract: Modern language models demonstrate a remarkable ability to capture linguistic meaning despite being trained solely through next-token prediction (NTP). We investigate how this conceptually simple training objective leads models to extract and encode latent semantic and grammatical concepts. Our analysis reveals that NTP optimization implicitly guides models to encode concepts via singular value decomposition (SVD) factors of a centered data-sparsity matrix that captures next-word co-occurrence patterns. While the model never explicitly constructs this matrix, learned word and context embeddings effectively factor it to capture linguistic structure. We find that the most important SVD factors are learned first during training, motivating the use of spectral clustering of embeddings to identify human-interpretable semantics, including both classical k-means and a new orthant-based method directly motivated by our interpretation of concepts. Overall, our work bridges distributional semantics, neural collapse geometry, and neural network training dynamics, providing insights into how NTP's implicit biases shape the emergence of meaning representations in language models.

[118] Alignment Drift in CEFR-prompted LLMs for Interactive Spanish Tutoring

Mina Almasi,Ross Deans Kristensen-McLachlan

Main category: cs.CL

TL;DR: 本文探讨了大型语言模型(LLMs)作为第二语言学习自适应导师的潜力,通过系统提示评估其是否能生成适合学生能力水平的文本。研究发现提示方法虽有效,但长期交互中易出现对齐漂移。

Details Motivation: 研究LLMs在第二语言学习中作为自适应导师的可行性,探索系统提示是否能有效控制文本难度以适应不同学生水平。 Method: 使用7B至12B参数的开源LLMs模拟西班牙语师生对话,通过CEFR分级提示控制文本难度,评估模型输出效果。 Result: 系统提示可约束模型输出,但长期交互中提示方法易失效(对齐漂移)。 Conclusion: LLMs在个性化自适应导师方面具有潜力,但需改进提示方法以应对长期交互挑战。 Abstract: This paper investigates the potentials of Large Language Models (LLMs) as adaptive tutors in the context of second-language learning. In particular, we evaluate whether system prompting can reliably constrain LLMs to generate only text appropriate to the student's competence level. We simulate full teacher-student dialogues in Spanish using instruction-tuned, open-source LLMs ranging in size from 7B to 12B parameters. Dialogues are generated by having an LLM alternate between tutor and student roles with separate chat histories. The output from the tutor model is then used to evaluate the effectiveness of CEFR-based prompting to control text difficulty across three proficiency levels (A1, B1, C1). Our findings suggest that while system prompting can be used to constrain model outputs, prompting alone is too brittle for sustained, long-term interactional contexts - a phenomenon we term alignment drift. Our results provide insights into the feasibility of LLMs for personalized, proficiency-aligned adaptive tutors and provide a scalable method for low-cost evaluation of model performance without human participants.

[119] Towards Contamination Resistant Benchmarks

Rahmatullah Musawi,Sheng Lu

Main category: cs.CL

TL;DR: 本文提出了一种基于凯撒密码的污染抵抗基准,用于更严格地评估大语言模型(LLMs),揭示了当前模型的局限性。

Details Motivation: 由于污染问题影响LLM评估的可靠性,需要开发一种抵抗污染的基准来更准确地评估模型的真实能力。 Method: 提出基于凯撒密码的污染抵抗基准,并在不同设置下测试广泛使用的LLMs。 Result: LLMs在控制污染的情况下表现不佳,揭示了其真实能力的不足。 Conclusion: 该工作为开发污染抵抗基准提供了贡献,有助于更严格地评估LLMs并揭示其局限性。 Abstract: The rapid development of large language models (LLMs) has transformed the landscape of natural language processing. Evaluating LLMs properly is crucial for understanding their potential and addressing concerns such as safety. However, LLM evaluation is confronted by various factors, among which contamination stands out as a key issue that undermines the reliability of evaluations. In this work, we introduce the concept of contamination resistance to address this challenge. We propose a benchmark based on Caesar ciphers (e.g., "ab" to "bc" when the shift is 1), which, despite its simplicity, is an excellent example of a contamination resistant benchmark. We test this benchmark on widely used LLMs under various settings, and we find that these models struggle with this benchmark when contamination is controlled. Our findings reveal issues in current LLMs and raise important questions regarding their true capabilities. Our work contributes to the development of contamination resistant benchmarks, enabling more rigorous LLM evaluation and offering insights into the true capabilities and limitations of LLMs.

[120] Accelerating Chain-of-Thought Reasoning: When Goal-Gradient Importance Meets Dynamic Skipping

Ren Zhuang,Ben Wang,Shuifa Sun

Main category: cs.CL

TL;DR: 论文提出Adaptive GoGI-Skip框架,通过动态压缩Chain-of-Thought(CoT)提示,结合Goal-Gradient Importance(GoGI)和Adaptive Dynamic Skipping(ADS),显著提升推理效率并保持准确性。

Details Motivation: 当前CoT压缩技术依赖通用重要性指标和静态压缩率,可能导致关键信息丢失或无法适应不同推理复杂度。 Method: 提出GoGI指标衡量中间表示对最终答案损失的梯度影响,结合ADS机制动态调节压缩率。 Result: 在MATH数据上训练,跨领域泛化能力强,平均减少45%的CoT标记,推理速度提升1.6-2.0倍,同时保持高准确性。 Conclusion: Adaptive GoGI-Skip在效率和准确性权衡上优于现有基线,推动了CoT推理的先进技术。 Abstract: Large Language Models leverage Chain-of-Thought (CoT) prompting for complex tasks, but their reasoning traces are often excessively verbose and inefficient, leading to significant computational costs and latency. Current CoT compression techniques typically rely on generic importance metrics and static compression rates, which may inadvertently remove functionally critical tokens or fail to adapt to varying reasoning complexity. To overcome these limitations, we propose Adaptive GoGI-Skip, a novel framework learning dynamic CoT compression via supervised fine-tuning. This approach introduces two synergistic innovations: (1) Goal-Gradient Importance (GoGI), a novel metric accurately identifying functionally relevant tokens by measuring the gradient influence of their intermediate representations on the final answer loss, and (2) Adaptive Dynamic Skipping (ADS), a mechanism dynamically regulating the compression rate based on runtime model uncertainty while ensuring local coherence through an adaptive N-token constraint. To our knowledge, this is the first work unifying a goal-oriented, gradient-based importance metric with dynamic, uncertainty-aware skipping for CoT compression. Trained on compressed MATH data, Adaptive GoGI-Skip demonstrates strong cross-domain generalization across diverse reasoning benchmarks including AIME, GPQA, and GSM8K. It achieves substantial efficiency gains - reducing CoT token counts by over 45% on average and delivering 1.6-2.0 times inference speedups - while maintaining high reasoning accuracy. Notably, it significantly outperforms existing baselines by preserving accuracy even at high effective compression rates, advancing the state of the art in the CoT reasoning efficiency-accuracy trade-off.

[121] TUMS: Enhancing Tool-use Abilities of LLMs with Multi-structure Handlers

Aiyao He,Sijia Cui,Shuai Xu,Yanna Wang,Bo Xu

Main category: cs.CL

TL;DR: TUMS框架通过将工具级处理转为参数级处理,提升LLMs的工具使用能力,显著提高了任务执行效果。

Details Motivation: LLMs在工具集成中面临非可执行操作和参数错误的问题,TUMS旨在解决这些问题。 Method: TUMS框架包含意图识别器、任务分解器、子任务处理器和执行器,实现参数级处理。 Result: 在ToolQA基准测试中,TUMS在简单和困难任务上分别平均提升19.6%和50.6%。 Conclusion: TUMS框架有效提升LLMs的工具使用能力,为工具增强型LLMs的未来研究提供启示。 Abstract: Recently, large language models(LLMs) have played an increasingly important role in solving a wide range of NLP tasks, leveraging their capabilities of natural language understanding and generating. Integration with external tools further enhances LLMs' effectiveness, providing more precise, timely, and specialized responses. However, LLMs still encounter difficulties with non-executable actions and improper actions, which are primarily attributed to incorrect parameters. The process of generating parameters by LLMs is confined to the tool level, employing the coarse-grained strategy without considering the different difficulties of various tools. To address this issue, we propose TUMS, a novel framework designed to enhance the tool-use capabilities of LLMs by transforming tool-level processing into parameter-level processing. Specifically, our framework consists of four key components: (1) an intent recognizer that identifies the user's intent to help LLMs better understand the task; (2) a task decomposer that breaks down complex tasks into simpler subtasks, each involving a tool call; (3) a subtask processor equipped with multi-structure handlers to generate accurate parameters; and (4) an executor. Our empirical studies have evidenced the effectiveness and efficiency of the TUMS framework with an average of 19.6\% and 50.6\% improvement separately on easy and hard benchmarks of ToolQA, meanwhile, we demonstrated the key contribution of each part with ablation experiments, offering more insights and stimulating future research on Tool-augmented LLMs.

[122] Hakim: Farsi Text Embedding Model

Mehran Sarmadi,Morteza Alikhani,Erfan Zinvandi,Zahra Pourbahman

Main category: cs.CL

TL;DR: 本文介绍了Hakim,一种新型波斯语文本嵌入模型,性能提升8.5%,并引入三个新数据集,适用于聊天机器人和检索增强生成系统。

Details Motivation: 波斯语在大规模嵌入研究中代表性不足,本文旨在填补这一空白。 Method: 提出基于BERT架构的基线模型和RetroMAE模型,并引入三个新数据集支持训练。 Result: Hakim在FaMTEB基准上表现优异,性能提升8.5%,适用于多种波斯语NLP任务。 Conclusion: 这些贡献为波斯语语言理解奠定了基础。 Abstract: Recent advancements in text embedding have significantly improved natural language understanding across many languages, yet Persian remains notably underrepresented in large-scale embedding research. In this paper, we present Hakim, a novel state-of-the-art Persian text embedding model that achieves a 8.5% performance improvement over existing approaches on the FaMTEB benchmark, outperforming all previously developed Persian language models. As part of this work, we introduce three new datasets - Corpesia, Pairsia-sup, and Pairsia-unsup - to support supervised and unsupervised training scenarios. Additionally, Hakim is designed for applications in chatbots and retrieval-augmented generation (RAG) systems, particularly addressing retrieval tasks that require incorporating message history within these systems. We also propose a new baseline model built on the BERT architecture. Our language model consistently achieves higher accuracy across various Persian NLP tasks, while the RetroMAE-based model proves particularly effective for textual information retrieval applications. Together, these contributions establish a new foundation for advancing Persian language understanding.

[123] A document processing pipeline for the construction of a dataset for topic modeling based on the judgments of the Italian Supreme Court

Matteo Marulli,Glauco Panattoni,Marco Bertini

Main category: cs.CL

TL;DR: 为了解决意大利法律研究中缺乏公开数据集的问题,研究者开发了一个文档处理流程,生成适用于主题建模的匿名数据集,并显著提升了主题建模的效果。

Details Motivation: 意大利法律研究中缺乏公开数据集,限制了最高法院判决中法律主题的分析。 Method: 开发了一个集成文档布局分析(YOLOv8x)、光学字符识别和文本匿名化的处理流程。 Result: DLA模块和OCR检测器表现优异,数据集显著提升了主题建模的多样性和一致性。BERTopic和大型语言模型生成的主题标签和摘要获得了高评分。 Conclusion: 该流程有效解决了数据缺乏问题,并提升了主题建模的质量,为法律研究提供了有力工具。 Abstract: Topic modeling in Italian legal research is hindered by the lack of public datasets, limiting the analysis of legal themes in Supreme Court judgments. To address this, we developed a document processing pipeline that produces an anonymized dataset optimized for topic modeling. The pipeline integrates document layout analysis (YOLOv8x), optical character recognition, and text anonymization. The DLA module achieved a mAP@50 of 0.964 and a mAP@50-95 of 0.800. The OCR detector reached a mAP@50-95 of 0.9022, and the text recognizer (TrOCR) obtained a character error rate of 0.0047 and a word error rate of 0.0248. Compared to OCR-only methods, our dataset improved topic modeling with a diversity score of 0.6198 and a coherence score of 0.6638. We applied BERTopic to extract topics and used large language models to generate labels and summaries. Outputs were evaluated against domain expert interpretations. Claude Sonnet 3.7 achieved a BERTScore F1 of 0.8119 for labeling and 0.9130 for summarization.

[124] IterKey: Iterative Keyword Generation with LLMs for Enhanced Retrieval Augmented Generation

Kazuki Hayashi,Hidetaka Kamigaito,Shinya Kouda,Taro Watanabe

Main category: cs.CL

TL;DR: IterKey是一个基于LLM的迭代关键词生成框架,通过稀疏检索增强RAG,平衡了准确性和可解释性。

Details Motivation: 解决密集检索方法缺乏可解释性和稀疏检索方法无法完全捕捉查询意图的问题。 Method: 采用三阶段LLM驱动流程:生成检索关键词、基于检索文档生成答案、验证答案,失败时迭代优化关键词。 Result: 在四个QA任务中,IterKey比BM25-based RAG和简单基线方法提升了5%至20%的准确率,性能接近密集检索方法。 Conclusion: IterKey是一种新颖的BM25-based方法,通过LLM迭代优化RAG,有效平衡了准确性和可解释性。 Abstract: Retrieval-Augmented Generation (RAG) has emerged as a way to complement the in-context knowledge of Large Language Models (LLMs) by integrating external documents. However, real-world applications demand not only accuracy but also interpretability. While dense retrieval methods provide high accuracy, they lack interpretability; conversely, sparse retrieval methods offer transparency but often fail to capture the full intent of queries due to their reliance on keyword matching. To address these issues, we introduce IterKey, an LLM-driven iterative keyword generation framework that enhances RAG via sparse retrieval. IterKey consists of three LLM-driven stages: generating keywords for retrieval, generating answers based on retrieved documents, and validating the answers. If validation fails, the process iteratively repeats with refined keywords. Across four QA tasks, experimental results show that IterKey achieves 5% to 20% accuracy improvements over BM25-based RAG and simple baselines. Its performance is comparable to dense retrieval-based RAG and prior iterative query refinement methods using dense models. In summary, IterKey is a novel BM25-based approach leveraging LLMs to iteratively refine RAG, effectively balancing accuracy with interpretability.

[125] RepCali: High Efficient Fine-tuning Via Representation Calibration in Latent Space for Pre-trained Language Models

Fujun Zhang,XiangDong Su

Main category: cs.CL

TL;DR: 论文提出了一种名为RepCali的方法,通过在校准块中调整预训练语言模型(PLMs)的潜在空间表示,以解决编码器与解码器之间的表示差异问题。

Details Motivation: 现有PLMs在微调后仍存在编码器输出与解码器输入之间的表示不匹配问题,限制了模型性能。 Method: 提出RepCali方法,在编码器后加入校准块,调整潜在空间表示,作为解码器输入。 Result: 在8个任务(包括中英文数据集)的25个PLM模型上验证,RepCali显著提升了性能,优于基准微调方法。 Conclusion: RepCali是一种通用、即插即用的方法,能有效提升PLMs在下游任务中的表现。 Abstract: Fine-tuning pre-trained language models (PLMs) has become a dominant paradigm in applying PLMs to downstream tasks. However, with limited fine-tuning, PLMs still struggle with the discrepancies between the representation obtained from the PLMs' encoder and the optimal input to the PLMs' decoder. This paper tackles this challenge by learning to calibrate the representation of PLMs in the latent space. In the proposed representation calibration method (RepCali), we integrate a specific calibration block to the latent space after the encoder and use the calibrated output as the decoder input. The merits of the proposed RepCali include its universality to all PLMs with encoder-decoder architectures, its plug-and-play nature, and ease of implementation. Extensive experiments on 25 PLM-based models across 8 tasks (including both English and Chinese datasets) demonstrate that the proposed RepCali offers desirable enhancements to PLMs (including LLMs) and significantly improves the performance of downstream tasks. Comparison experiments across 4 benchmark tasks indicate that RepCali is superior to the representative fine-tuning baselines.

[126] Large Language Models Meet Stance Detection: A Survey of Tasks, Methods, Applications, Challenges and Future Directions

Lata Pangtey,Anukriti Bhatnagar,Shubhi Bansal,Shahid Shafi Dar,Nagendra Kumar

Main category: cs.CL

TL;DR: 本文综述了基于大语言模型(LLM)的立场检测研究,系统分析了其方法、数据集、应用及挑战,并提出新的分类法。

Details Motivation: 现有调查缺乏对LLM在立场检测中应用的全面覆盖,本文旨在填补这一空白。 Method: 通过系统分析,提出基于学习方式、数据模态和目标关系的分类法,并讨论评估技术和数据集。 Result: 总结了LLM在立场检测中的优势与局限,并探讨了关键应用领域。 Conclusion: 指出了未来研究方向,如可解释性推理和低资源适应,为下一代立场检测系统提供指导。 Abstract: Stance detection is essential for understanding subjective content across various platforms such as social media, news articles, and online reviews. Recent advances in Large Language Models (LLMs) have revolutionized stance detection by introducing novel capabilities in contextual understanding, cross-domain generalization, and multimodal analysis. Despite these progressions, existing surveys often lack comprehensive coverage of approaches that specifically leverage LLMs for stance detection. To bridge this critical gap, our review article conducts a systematic analysis of stance detection, comprehensively examining recent advancements of LLMs transforming the field, including foundational concepts, methodologies, datasets, applications, and emerging challenges. We present a novel taxonomy for LLM-based stance detection approaches, structured along three key dimensions: 1) learning methods, including supervised, unsupervised, few-shot, and zero-shot; 2) data modalities, such as unimodal, multimodal, and hybrid; and 3) target relationships, encompassing in-target, cross-target, and multi-target scenarios. Furthermore, we discuss the evaluation techniques and analyze benchmark datasets and performance trends, highlighting the strengths and limitations of different architectures. Key applications in misinformation detection, political analysis, public health monitoring, and social media moderation are discussed. Finally, we identify critical challenges such as implicit stance expression, cultural biases, and computational constraints, while outlining promising future directions, including explainable stance reasoning, low-resource adaptation, and real-time deployment frameworks. Our survey highlights emerging trends, open challenges, and future directions to guide researchers and practitioners in developing next-generation stance detection systems powered by large language models.

[127] Judging the Judges: Can Large Vision-Language Models Fairly Evaluate Chart Comprehension and Reasoning?

Md Tahmid Rahman Laskar,Mohammed Saidul Islam,Ridwan Mahbub,Ahmed Masry,Mizanur Rahman,Amran Bhuiyan,Mir Tafseer Nayeem,Shafiq Joty,Enamul Hoque,Jimmy Huang

Main category: cs.CL

TL;DR: 论文评估了13种开源大型视觉语言模型(LVLM)作为图表理解任务的自动评估工具,发现部分模型性能接近GPT-4,但存在位置偏好和长度偏差等问题。

Details Motivation: 现有大型视觉语言模型(LVLM)在图表相关任务中的评估成本高且耗时,限制了实际应用。开源LVLM作为评估工具可能提供成本效益高的解决方案。 Method: 设计了成对和点对评估任务,涵盖事实正确性、信息量和相关性等标准,并分析格式遵循、位置一致性、长度偏差和指令遵循等特性。 Result: 实验结果显示,部分开源LVLM评估性能接近GPT-4(约80%一致),但其他模型表现较差(低于10%一致)。 Conclusion: 开源LVLM可作为图表任务的成本效益评估工具,但仍需解决位置偏好和长度偏差等问题。 Abstract: Charts are ubiquitous as they help people understand and reason with data. Recently, various downstream tasks, such as chart question answering, chart2text, and fact-checking, have emerged. Large Vision-Language Models (LVLMs) show promise in tackling these tasks, but their evaluation is costly and time-consuming, limiting real-world deployment. While using LVLMs as judges to assess the chart comprehension capabilities of other LVLMs could streamline evaluation processes, challenges like proprietary datasets, restricted access to powerful models, and evaluation costs hinder their adoption in industrial settings. To this end, we present a comprehensive evaluation of 13 open-source LVLMs as judges for diverse chart comprehension and reasoning tasks. We design both pairwise and pointwise evaluation tasks covering criteria like factual correctness, informativeness, and relevancy. Additionally, we analyze LVLM judges based on format adherence, positional consistency, length bias, and instruction-following. We focus on cost-effective LVLMs (<10B parameters) suitable for both research and commercial use, following a standardized evaluation protocol and rubric to measure the LVLM judge's accuracy. Experimental results reveal notable variability: while some open LVLM judges achieve GPT-4-level evaluation performance (about 80% agreement with GPT-4 judgments), others struggle (below ~10% agreement). Our findings highlight that state-of-the-art open-source LVLMs can serve as cost-effective automatic evaluators for chart-related tasks, though biases such as positional preference and length bias persist.

[128] LCES: Zero-shot Automated Essay Scoring via Pairwise Comparisons Using Large Language Models

Takumi Shibata,Yuichi Miyamura

Main category: cs.CL

TL;DR: 论文提出了一种基于大语言模型(LLM)的比较性论文评分方法(LCES),通过成对比较任务提高零样本自动评分的准确性。

Details Motivation: 现有零样本方法直接生成绝对分数,容易因模型偏差和不一致评分偏离人工评估,需改进。 Method: 将自动评分任务转化为成对比较,利用LLM判断两篇论文优劣,再通过RankNet将比较结果转换为连续分数。 Result: 实验表明,LCES在准确性上优于传统零样本方法,且计算高效,适用于不同LLM模型。 Conclusion: LCES为现实中的零样本自动评分提供了一种高效且稳健的解决方案。 Abstract: Recent advances in large language models (LLMs) have enabled zero-shot automated essay scoring (AES), providing a promising way to reduce the cost and effort of essay scoring in comparison with manual grading. However, most existing zero-shot approaches rely on LLMs to directly generate absolute scores, which often diverge from human evaluations owing to model biases and inconsistent scoring. To address these limitations, we propose LLM-based Comparative Essay Scoring (LCES), a method that formulates AES as a pairwise comparison task. Specifically, we instruct LLMs to judge which of two essays is better, collect many such comparisons, and convert them into continuous scores. Considering that the number of possible comparisons grows quadratically with the number of essays, we improve scalability by employing RankNet to efficiently transform LLM preferences into scalar scores. Experiments using AES benchmark datasets show that LCES outperforms conventional zero-shot methods in accuracy while maintaining computational efficiency. Moreover, LCES is robust across different LLM backbones, highlighting its applicability to real-world zero-shot AES.

[129] Reassessing Graph Linearization for Sequence-to-sequence AMR Parsing: On the Advantages and Limitations of Triple-Based Encoding

Jeongwoo Kang,Maximin Coavoux,Cédric Lopez,Didier Schwab

Main category: cs.CL

TL;DR: 论文提出了一种基于三元组的线性化方法,用于改进AMR图的序列化表示,解决了Penman编码在处理深图和节点重入时的局限性。

Details Motivation: Penman编码在序列化AMR图时存在两个主要问题:深图中相关节点距离过远,以及需要逆角色处理节点重入,导致关系类型数量翻倍。 Method: 提出了一种基于三元组的线性化方法,并与Penman编码进行效率比较。 Result: 结果表明,三元组编码在表示图结构方面有潜力,但仍需改进以匹敌Penman的简洁性和嵌套结构的明确表示。 Conclusion: 三元组编码是一种有前景的替代方案,但需进一步优化以完全替代Penman编码。 Abstract: Sequence-to-sequence models are widely used to train Abstract Meaning Representation (Banarescu et al., 2013, AMR) parsers. To train such models, AMR graphs have to be linearized into a one-line text format. While Penman encoding is typically used for this purpose, we argue that it has limitations: (1) for deep graphs, some closely related nodes are located far apart in the linearized text (2) Penman's tree-based encoding necessitates inverse roles to handle node re-entrancy, doubling the number of relation types to predict. To address these issues, we propose a triple-based linearization method and compare its efficiency with Penman linearization. Although triples are well suited to represent a graph, our results suggest room for improvement in triple encoding to better compete with Penman's concise and explicit representation of a nested graph structure.

[130] Are We Paying Attention to Her? Investigating Gender Disambiguation and Attention in Machine Translation

Chiara Manna,Afra Alishahi,Frédéric Blain,Eva Vanmassenhove

Main category: cs.CL

TL;DR: 论文提出了一种新的评估指标MPA,用于衡量NMT系统对性别线索的依赖程度,发现现有模型更倾向于忽略性别线索而依赖统计性别刻板印象。

Details Motivation: 传统评估指标未能充分捕捉NMT系统对上下文性别线索的整合程度,因此需要一种更精确的评估方法。 Method: 提出Minimal Pair Accuracy (MPA)指标,通过最小对句子(仅性别代词不同)评估模型对性别线索的依赖。 Result: 实验显示NMT模型大多忽略性别线索,依赖刻板印象;在反刻板情况下,模型更倾向于考虑男性线索而忽略女性线索。 Conclusion: 性别信息在模型中编码存在差异,男性线索引发更分散的响应,而女性线索则更集中和专门化。 Abstract: While gender bias in modern Neural Machine Translation (NMT) systems has received much attention, traditional evaluation metrics do not to fully capture the extent to which these systems integrate contextual gender cues. We propose a novel evaluation metric called Minimal Pair Accuracy (MPA), which measures the reliance of models on gender cues for gender disambiguation. MPA is designed to go beyond surface-level gender accuracy metrics by focusing on whether models adapt to gender cues in minimal pairs -- sentence pairs that differ solely in the gendered pronoun, namely the explicit indicator of the target's entity gender in the source language (EN). We evaluate a number of NMT models on the English-Italian (EN--IT) language pair using this metric, we show that they ignore available gender cues in most cases in favor of (statistical) stereotypical gender interpretation. We further show that in anti-stereotypical cases, these models tend to more consistently take masculine gender cues into account while ignoring the feminine cues. Furthermore, we analyze the attention head weights in the encoder component and show that while all models encode gender information to some extent, masculine cues elicit a more diffused response compared to the more concentrated and specialized responses to feminine gender cues.

[131] Small but Significant: On the Promise of Small Language Models for Accessible AIED

Yumou Wei,Paulo Carvalho,John Stamper

Main category: cs.CL

TL;DR: 论文指出GPT和大型语言模型(LLMs)在AIED领域占据主导地位,但忽视了小型语言模型(SLMs)的潜力。通过实验证明SLMs(如Phi-2)在解决教育关键问题(如知识组件发现)时同样有效,呼吁更多关注SLM-based AIED方法。

Details Motivation: 当前AIED领域过度依赖GPT等资源密集型LLMs,可能忽视了SLMs在资源受限机构中提供高质量AI工具的潜力。 Method: 通过实验验证SLMs(如Phi-2)在知识组件发现任务中的表现,无需复杂提示策略。 Result: SLMs在解决AIED关键挑战时表现有效,证明了其潜力。 Conclusion: 呼吁AIED领域更多关注和开发基于SLMs的方法,以实现更公平和可负担的AI教育工具。 Abstract: GPT has become nearly synonymous with large language models (LLMs), an increasingly popular term in AIED proceedings. A simple keyword-based search reveals that 61% of the 76 long and short papers presented at AIED 2024 describe novel solutions using LLMs to address some of the long-standing challenges in education, and 43% specifically mention GPT. Although LLMs pioneered by GPT create exciting opportunities to strengthen the impact of AI on education, we argue that the field's predominant focus on GPT and other resource-intensive LLMs (with more than 10B parameters) risks neglecting the potential impact that small language models (SLMs) can make in providing resource-constrained institutions with equitable and affordable access to high-quality AI tools. Supported by positive results on knowledge component (KC) discovery, a critical challenge in AIED, we demonstrate that SLMs such as Phi-2 can produce an effective solution without elaborate prompting strategies. Hence, we call for more attention to developing SLM-based AIED approaches.

[132] Enhancing Thyroid Cytology Diagnosis with RAG-Optimized LLMs and Pa-thology Foundation Models

Hussien Al-Asi,Jordan P Reynolds,Shweta Agarwal,Bryan J Dangott,Aziza Nassar,Zeynettin Akkus

Main category: cs.CL

TL;DR: 该研究探索了结合检索增强生成(RAG)和病理学基础模型的LLMs在甲状腺细胞学诊断中的应用,显著提高了诊断效率和准确性。

Details Motivation: 解决甲状腺细胞学诊断中的挑战,如解释标准化和诊断准确性,通过AI技术提升病理学实践。 Method: 利用RAG动态检索相关知识库,结合病理学基础模型优化特征提取和分类能力。 Result: RAG增强的LLMs显著提升诊断一致性,基础模型UNI的AUC达到0.73-0.93。 Conclusion: 该AI驱动方法为甲状腺细胞病理学提供了高效、可解释的辅助工具,展现了临床应用潜力。 Abstract: Advancements in artificial intelligence (AI) are transforming pathology by integrat-ing large language models (LLMs) with retrieval-augmented generation (RAG) and domain-specific foundation models. This study explores the application of RAG-enhanced LLMs coupled with pathology foundation models for thyroid cytology diagnosis, addressing challenges in cytological interpretation, standardization, and diagnostic accuracy. By leveraging a curated knowledge base, RAG facilitates dy-namic retrieval of relevant case studies, diagnostic criteria, and expert interpreta-tion, improving the contextual understanding of LLMs. Meanwhile, pathology foun-dation models, trained on high-resolution pathology images, refine feature extrac-tion and classification capabilities. The fusion of these AI-driven approaches en-hances diagnostic consistency, reduces variability, and supports pathologists in dis-tinguishing benign from malignant thyroid lesions. Our results demonstrate that integrating RAG with pathology-specific LLMs significantly improves diagnostic efficiency and interpretability, paving the way for AI-assisted thyroid cytopathology, with foundation model UNI achieving AUC 0.73-0.93 for correct prediction of surgi-cal pathology diagnosis from thyroid cytology samples.

[133] Automatic Task Detection and Heterogeneous LLM Speculative Decoding

Danying Ge,Jianhua Gao,Qizhi Jiang,Yifei Feng,Weixing Ji

Main category: cs.CL

TL;DR: 提出了一种针对下游任务优化的推测解码算法,通过任务分区和异构草稿模型分配,提升解码速度和接受率。

Details Motivation: 现有推测解码方法在下游任务中面临解码速度和接受率的权衡问题,草稿模型能力有限导致效率难以保证。 Method: 提出自动任务分区和分配方法,将下游任务分类并分配给异构草稿模型,结合在线轻量级提示分类器动态路由提示。 Result: 实验显示,该方法比传统推测解码提升草稿准确率6%至50%,推理速度提升1.10x至2.64x。 Conclusion: 该方法有效解决了下游任务中的效率问题,显著提升了推测解码的性能。 Abstract: Speculative decoding, which combines a draft model with a target model, has emerged as an effective approach to accelerate large language model (LLM) inference. However, existing methods often face a trade-off between the acceptance rate and decoding speed in downstream tasks due to the limited capacity of the draft model, making it difficult to ensure efficiency across diverse tasks. To address this problem, we propose a speculative decoding algorithm tailored for downstream task optimization. It includes an automatic task partitioning and assigning method, which automatically categorizes downstream tasks into different sub-tasks and assigns them to a set of heterogeneous draft models. Each draft model is aligned with the target model using task-specific data, thereby enhancing the consistency of inference results. In addition, our proposed method incorporates an online lightweight prompt classifier to dynamically route prompts to the appropriate draft model. Experimental results demonstrate that the proposed method improves draft accuracy by 6% to 50% over vanilla speculative decoding, while achieving a speedup of 1.10x to 2.64x in LLM inference.

[134] Scaling Context, Not Parameters: Training a Compact 7B Language Model for Efficient Long-Context Processing

Chen Wu,Yin Song

Main category: cs.CL

TL;DR: MegaBeam-Mistral-7B是一个支持512K标记上下文长度的语言模型,解决了长上下文训练的实际限制,并在多个长上下文基准测试中表现优异。

Details Motivation: 解决长上下文训练的实际限制,支持合规监控和验证等现实任务。 Method: 开发了一个7B参数的语言模型,支持512K标记的上下文长度,并在HELMET、RULER和BABILong等基准上进行评估。 Result: 模型在HELMET上表现出优异的情境学习能力,在RULER上展示了强大的检索和追踪能力,并在BABILong上实现了竞争性的长程推理能力。 Conclusion: MegaBeam-Mistral-7B是目前唯一无需RAG或针对性微调即可在512K上下文长度下实现竞争性性能的开源模型,已发布为Apache 2.0许可证下的开源项目。 Abstract: We present MegaBeam-Mistral-7B, a language model that supports 512K-token context length. Our work addresses practical limitations in long-context training, supporting real-world tasks such as compliance monitoring and verification. Evaluated on three long-context benchmarks, our 7B-parameter model demonstrates superior in-context learning performance on HELMET and robust retrieval and tracing capability on RULER. It is currently the only open model to achieve competitive long-range reasoning on BABILong at 512K context length without RAG or targeted fine-tuning. Released as fully open source under the Apache 2.0 license, the model has been downloaded over 100,000 times on Hugging Face. Model available at: https://huggingface.co/aws-prototyping/MegaBeam-Mistral-7B-512k

[135] Revealing economic facts: LLMs know more than they say

Marcus Buckmann,Quynh Anh Nguyen,Edward Hill

Main category: cs.CL

TL;DR: 研究发现,大型语言模型的隐藏状态可用于估计和填补经济与金融统计数据,效果优于模型直接输出的文本。

Details Motivation: 探索大型语言模型(LLMs)的隐藏状态是否能够捕捉并补充经济与金融统计数据,以提升数据估计的准确性。 Method: 使用简单的线性模型训练开源LLMs的隐藏状态,并通过学习曲线分析和迁移学习方法优化估计效果。 Result: 隐藏状态比模型直接输出的文本包含更丰富的经济信息,且仅需少量标注数据即可训练。迁移学习方法进一步提高了估计准确性。 Conclusion: LLMs的隐藏状态在经济与金融数据的高分辨率重建和填补任务中具有实际应用价值。 Abstract: We investigate whether the hidden states of large language models (LLMs) can be used to estimate and impute economic and financial statistics. Focusing on county-level (e.g. unemployment) and firm-level (e.g. total assets) variables, we show that a simple linear model trained on the hidden states of open-source LLMs outperforms the models' text outputs. This suggests that hidden states capture richer economic information than the responses of the LLMs reveal directly. A learning curve analysis indicates that only a few dozen labelled examples are sufficient for training. We also propose a transfer learning method that improves estimation accuracy without requiring any labelled data for the target variable. Finally, we demonstrate the practical utility of hidden-state representations in super-resolution and data imputation tasks.

[136] Adaptive Schema-aware Event Extraction with Retrieval-Augmented Generation

Sheng Liang,Hang Lv,Zhihao Wen,Yaxiong Wu,Yongyue Zhang,Hao Wang,Yong Liu

Main category: cs.CL

TL;DR: 论文提出了一种自适应模式感知事件抽取(ASEE)方法,结合模式改写和检索增强生成,解决了现有事件抽取中的模式固定和评估基准缺失问题,并在多领域数据集上验证了其有效性。

Details Motivation: 现有事件抽取方法存在模式固定和缺乏联合模式匹配与抽取评估基准的问题,且大语言模型在实际部署中存在模式幻觉和上下文窗口限制。 Method: 提出ASEE方法,结合模式改写和检索增强生成技术,动态检索改写后的模式并生成目标结构。 Result: 在多领域数据集MD-SEE上验证,ASEE表现出强适应性,显著提升了事件抽取的准确性。 Conclusion: ASEE通过自适应模式处理解决了事件抽取中的关键问题,为实际应用提供了有效解决方案。 Abstract: Event extraction (EE) is a fundamental task in natural language processing (NLP) that involves identifying and extracting event information from unstructured text. Effective EE in real-world scenarios requires two key steps: selecting appropriate schemas from hundreds of candidates and executing the extraction process. Existing research exhibits two critical gaps: (1) the rigid schema fixation in existing pipeline systems, and (2) the absence of benchmarks for evaluating joint schema matching and extraction. Although large language models (LLMs) offer potential solutions, their schema hallucination tendencies and context window limitations pose challenges for practical deployment. In response, we propose Adaptive Schema-aware Event Extraction (ASEE), a novel paradigm combining schema paraphrasing with schema retrieval-augmented generation. ASEE adeptly retrieves paraphrased schemas and accurately generates targeted structures. To facilitate rigorous evaluation, we construct the Multi-Dimensional Schema-aware Event Extraction (MD-SEE) benchmark, which systematically consolidates 12 datasets across diverse domains, complexity levels, and language settings. Extensive evaluations on MD-SEE show that our proposed ASEE demonstrates strong adaptability across various scenarios, significantly improving the accuracy of event extraction.

[137] NurValues: Real-World Nursing Values Evaluation for Large Language Models in Clinical Context

Ben Yao,Qiuchi Li,Yazhou Zhang,Siyu Yang,Bohan Zhang,Prayag Tiwari,Jing Qin

Main category: cs.CL

TL;DR: 论文提出了首个护理价值对齐基准,包含五个核心价值维度,并收集了1100个真实护理行为实例,通过LLM生成对立案例,构建了Easy-Level和Hard-Level数据集。评估了23个SoTA LLM,发现DeepSeek-V3和Claude 3.5 Sonnet表现最佳,Justice是最难评估的维度,上下文学习显著提升对齐效果。

Details Motivation: 为临床环境中开发对价值敏感的LLM提供基础,通过量化护理价值对齐填补研究空白。 Method: 从国际护理规范中提炼五个核心价值维度,收集真实护理行为实例并生成对立案例,构建两个难度级别的数据集,评估23个LLM的表现。 Result: DeepSeek-V3在Easy-Level表现最佳(94.55),Claude 3.5 Sonnet在Hard-Level领先(89.43),Justice是最难评估的维度,上下文学习显著提升对齐效果。 Conclusion: 该研究为临床环境中开发价值敏感的LLM提供了重要基准和数据支持,未来可扩展至其他医疗领域。 Abstract: This work introduces the first benchmark for nursing value alignment, consisting of five core value dimensions distilled from international nursing codes: Altruism, Human Dignity, Integrity, Justice, and Professionalism. The benchmark comprises 1,100 real-world nursing behavior instances collected through a five-month longitudinal field study across three hospitals of varying tiers. These instances are annotated by five clinical nurses and then augmented with LLM-generated counterfactuals with reversed ethic polarity. Each original case is paired with a value-aligned and a value-violating version, resulting in 2,200 labeled instances that constitute the Easy-Level dataset. To increase adversarial complexity, each instance is further transformed into a dialogue-based format that embeds contextual cues and subtle misleading signals, yielding a Hard-Level dataset. We evaluate 23 state-of-the-art (SoTA) LLMs on their alignment with nursing values. Our findings reveal three key insights: (1) DeepSeek-V3 achieves the highest performance on the Easy-Level dataset (94.55), where Claude 3.5 Sonnet outperforms other models on the Hard-Level dataset (89.43), significantly surpassing the medical LLMs; (2) Justice is consistently the most difficult nursing value dimension to evaluate; and (3) in-context learning significantly improves alignment. This work aims to provide a foundation for value-sensitive LLMs development in clinical settings. The dataset and the code are available at https://huggingface.co/datasets/Ben012345/NurValues.

[138] Probability Consistency in Large Language Models: Theoretical Foundations Meet Empirical Discrepancies

Xiaoliang Luo,Xinyi Xu,Michael Ramscar,Bradley C. Love

Main category: cs.CL

TL;DR: 论文证明自回归大语言模型(LLM)在不同分词顺序下学习一致概率分布的可能性,并揭示了现有研究中方法论的缺陷。

Details Motivation: 研究LLM在不同分词顺序下是否能学习一致概率分布,为LLM学习机制提供理论基础。 Method: 通过理论证明和实验验证,重新训练GPT-2模型,分析不同分词顺序(正向、反向、随机排列)下的表现。 Result: 发现模型在不同分词顺序下存在系统性偏差,尤其是随机排列与正向、反向模型差异显著。 Conclusion: 研究为理解LLM的位置偏差提供了新视角,并提出了检测概率分布不一致性的方法。 Abstract: Can autoregressive large language models (LLMs) learn consistent probability distributions when trained on sequences in different token orders? We prove formally that for any well-defined probability distribution, sequence perplexity is invariant under any factorization, including forward, backward, or arbitrary permutations. This result establishes a rigorous theoretical foundation for studying how LLMs learn from data and defines principled protocols for empirical evaluation. Applying these protocols, we show that prior studies examining ordering effects suffer from critical methodological flaws. We retrain GPT-2 models across forward, backward, and arbitrary permuted orders on scientific text. We find systematic deviations from theoretical invariance across all orderings with arbitrary permutations strongly deviating from both forward and backward models, which largely (but not completely) agreed with one another. Deviations were traceable to differences in self-attention, reflecting positional and locality biases in processing. Our theoretical and empirical results provide novel avenues for understanding positional biases in LLMs and suggest methods for detecting when LLMs' probability distributions are inconsistent and therefore untrustworthy.

[139] AC-Reason: Towards Theory-Guided Actual Causality Reasoning with Large Language Models

Yanxi Zhang,Xin Cong,Zhong Zhang,Xiao Liu,Dongyan Zhao,Yesai Wu

Main category: cs.CL

TL;DR: 论文提出AC-Reason框架,结合形式因果理论提升LLM在因果推理中的表现,并引入AC-Bench基准验证其有效性。

Details Motivation: 现有LLM方法缺乏形式因果理论支持,导致解释性不足,需改进。 Method: 提出AC-Reason框架,通过理论指导的算法识别因果事件并回答查询,无需显式构建因果图。 Result: AC-Reason显著提升LLM性能,在BBH-CJ和AC-Bench上均优于基线,GPT-4结合AC-Reason表现最佳。 Conclusion: 将形式因果理论融入LLM高度有效,AC-Reason算法贡献最大性能提升。 Abstract: Actual causality (AC), a fundamental aspect of causal reasoning (CR), is responsible for attribution and responsibility assignment in real-world scenarios. However, existing LLM-based methods lack grounding in formal AC theory, resulting in limited interpretability. Therefore, we propose AC-Reason, a semi-formal reasoning framework that identifies causally relevant events within an AC scenario, infers the values of their formal causal factors (e.g., sufficiency, necessity, and normality), and answers AC queries via a theory-guided algorithm with explanations. While AC-Reason does not explicitly construct a causal graph, it operates over variables in the underlying causal structure to support principled reasoning. To enable comprehensive evaluation, we introduce AC-Bench, a new benchmark built upon and substantially extending Big-Bench Hard Causal Judgment (BBH-CJ). AC-Bench comprises ~1K carefully annotated samples, each with detailed reasoning steps and focuses solely on actual causation. The case study shows that synthesized samples in AC-Bench present greater challenges for LLMs. Extensive experiments on BBH-CJ and AC-Bench show that AC-Reason consistently improves LLM performance over baselines. On BBH-CJ, all tested LLMs surpass the average human rater accuracy of 69.60%, with GPT-4 + AC-Reason achieving 75.04%. On AC-Bench, GPT-4 + AC-Reason again achieves the highest accuracy of 71.82%. AC-Bench further enables fine-grained analysis of reasoning faithfulness, revealing that only Qwen-2.5-72B-Instruct, Claude-3.5-Sonnet, and GPT-4o exhibit faithful reasoning, whereas GPT-4 tends to exploit shortcuts. Finally, our ablation study proves that integrating AC theory into LLMs is highly effective, with the proposed algorithm contributing the most significant performance gains.

[140] Aya Vision: Advancing the Frontier of Multilingual Multimodality

Saurabh Dash,Yiyang Nan,John Dang,Arash Ahmadian,Shivalika Singh,Madeline Smith,Bharat Venkitesh,Vlad Shmyhlo,Viraat Aryabumi,Walter Beller-Morales,Jeremy Pekmez,Jason Ozuzu,Pierre Richemond,Acyr Locatelli,Nick Frosst,Phil Blunsom,Aidan Gomez,Ivan Zhang,Marzieh Fadaee,Manoj Govindassamy,Sudip Roy,Matthias Gallé,Beyza Ermis,Ahmet Üstün,Sara Hooker

Main category: cs.CL

TL;DR: 论文提出了一种多模态语言模型构建方法,通过合成注释框架和跨模态模型合并技术,解决了多语言多模态数据稀缺和灾难性遗忘问题,并在性能上超越了多个大型模型。

Details Motivation: 构建多模态语言模型面临多模态对齐、高质量指令数据稀缺以及引入视觉后文本能力退化等挑战,多语言环境下问题更为复杂。 Method: 开发了合成注释框架以生成高质量多语言多模态指令数据,并提出跨模态模型合并技术以减少灾难性遗忘。 Result: Aya-Vision-8B和Aya-Vision-32B在性能上超越了多个大型模型,如Qwen-2.5-VL-7B和LLaMA-3.2-90B-Vision。 Conclusion: 该研究在多模态多语言领域取得了进展,提供了高效计算和高性能的技术方案。 Abstract: Building multimodal language models is fundamentally challenging: it requires aligning vision and language modalities, curating high-quality instruction data, and avoiding the degradation of existing text-only capabilities once vision is introduced. These difficulties are further magnified in the multilingual setting, where the need for multimodal data in different languages exacerbates existing data scarcity, machine translation often distorts meaning, and catastrophic forgetting is more pronounced. To address the aforementioned challenges, we introduce novel techniques spanning both data and modeling. First, we develop a synthetic annotation framework that curates high-quality, diverse multilingual multimodal instruction data, enabling Aya Vision models to produce natural, human-preferred responses to multimodal inputs across many languages. Complementing this, we propose a cross-modal model merging technique that mitigates catastrophic forgetting, effectively preserving text-only capabilities while simultaneously enhancing multimodal generative performance. Aya-Vision-8B achieves best-in-class performance compared to strong multimodal models such as Qwen-2.5-VL-7B, Pixtral-12B, and even much larger Llama-3.2-90B-Vision. We further scale this approach with Aya-Vision-32B, which outperforms models more than twice its size, such as Molmo-72B and LLaMA-3.2-90B-Vision. Our work advances multilingual progress on the multi-modal frontier, and provides insights into techniques that effectively bend the need for compute while delivering extremely high performance.

[141] HealthBench: Evaluating Large Language Models Towards Improved Human Health

Rahul K. Arora,Jason Wei,Rebecca Soskin Hicks,Preston Bowman,Joaquin Quiñonero-Candela,Foivos Tsimpourlas,Michael Sharman,Meghan Shah,Andrea Vallone,Alex Beutel,Johannes Heidecke,Karan Singhal

Main category: cs.CL

TL;DR: HealthBench是一个开源基准测试,用于评估大型语言模型在医疗领域的性能和安全性,包含5000次多轮对话和48562条独特评分标准,反映模型在医疗场景中的进步。

Details Motivation: 开发一个更真实、开放的评估工具,以衡量语言模型在医疗健康领域的表现,推动模型发展和应用。 Method: 通过多轮对话和医生创建的评分标准,评估模型在不同医疗场景和行为维度(如准确性、沟通能力)的表现。 Result: 模型性能稳步提升(如GPT-4o得分32%,最新模型达60%),小模型表现尤为突出(如GPT-4.1 nano优于GPT-4o且成本更低)。 Conclusion: HealthBench为医疗领域模型的发展和应用提供了重要基准,有望推动技术进步以造福人类健康。 Abstract: We present HealthBench, an open-source benchmark measuring the performance and safety of large language models in healthcare. HealthBench consists of 5,000 multi-turn conversations between a model and an individual user or healthcare professional. Responses are evaluated using conversation-specific rubrics created by 262 physicians. Unlike previous multiple-choice or short-answer benchmarks, HealthBench enables realistic, open-ended evaluation through 48,562 unique rubric criteria spanning several health contexts (e.g., emergencies, transforming clinical data, global health) and behavioral dimensions (e.g., accuracy, instruction following, communication). HealthBench performance over the last two years reflects steady initial progress (compare GPT-3.5 Turbo's 16% to GPT-4o's 32%) and more rapid recent improvements (o3 scores 60%). Smaller models have especially improved: GPT-4.1 nano outperforms GPT-4o and is 25 times cheaper. We additionally release two HealthBench variations: HealthBench Consensus, which includes 34 particularly important dimensions of model behavior validated via physician consensus, and HealthBench Hard, where the current top score is 32%. We hope that HealthBench grounds progress towards model development and applications that benefit human health.

cs.LG [Back]

[142] Large Language Models for Computer-Aided Design: A Survey

Licheng Zhang,Bach Le,Naveed Akhtar,Siew-Kei Lam,Tuan Ngo

Main category: cs.LG

TL;DR: 本文首次系统综述了大型语言模型(LLMs)与计算机辅助设计(CAD)的结合,探讨了LLMs在CAD中的六大应用领域,并提出了未来发展方向。

Details Motivation: CAD作为3D建模的行业标准,在现代设计中日益复杂,而LLMs的潜力尚未被充分探索,因此需要系统研究其结合的可能性。 Method: 文章首先概述CAD的工业意义和LLMs的基础,随后分类探讨LLMs在CAD中的六大应用领域,并提出未来研究方向。 Result: 提出了LLMs在CAD中的六大关键应用领域,并总结了当前研究的进展与局限。 Conclusion: LLMs与CAD的结合具有巨大潜力,未来研究方向包括技术创新和应用扩展,有望推动CAD技术的未来发展。 Abstract: Large Language Models (LLMs) have seen rapid advancements in recent years, with models like ChatGPT and DeepSeek, showcasing their remarkable capabilities across diverse domains. While substantial research has been conducted on LLMs in various fields, a comprehensive review focusing on their integration with Computer-Aided Design (CAD) remains notably absent. CAD is the industry standard for 3D modeling and plays a vital role in the design and development of products across different industries. As the complexity of modern designs increases, the potential for LLMs to enhance and streamline CAD workflows presents an exciting frontier. This article presents the first systematic survey exploring the intersection of LLMs and CAD. We begin by outlining the industrial significance of CAD, highlighting the need for AI-driven innovation. Next, we provide a detailed overview of the foundation of LLMs. We also examine both closed-source LLMs as well as publicly available models. The core of this review focuses on the various applications of LLMs in CAD, providing a taxonomy of six key areas where these models are making considerable impact. Finally, we propose several promising future directions for further advancements, which offer vast opportunities for innovation and are poised to shape the future of CAD technology. Github: https://github.com/lichengzhanguom/LLMs-CAD-Survey-Taxonomy

[143] A Reproduction Study: The Kernel PCA Interpretation of Self-Attention Fails Under Scrutiny

Karahan Sarıtaş,Çağatay Yıldız

Main category: cs.LG

TL;DR: 本文重新审视了自注意力机制实现核主成分分析(KPCA)的近期观点,发现其缺乏实证支持。

Details Motivation: 验证自注意力机制是否如Teo等人(2024)所述实现了KPCA。 Method: 通过分析自注意力值向量与KPCA特征向量的对齐性、重构损失的解读以及Gram矩阵特征值统计的可复现性。 Result: 发现自注意力值与KPCA特征向量无显著对应关系,重构损失解读错误,且特征值统计不可复现。 Conclusion: 自注意力的KPCA解释缺乏实证依据。 Abstract: In this reproduction study, we revisit recent claims that self-attention implements kernel principal component analysis (KPCA) (Teo et al., 2024), positing that (i) value vectors $V$ capture the eigenvectors of the Gram matrix of the keys, and (ii) that self-attention projects queries onto the principal component axes of the key matrix $K$ in a feature space. Our analysis reveals three critical inconsistencies: (1) No alignment exists between learned self-attention value vectors and what is proposed in the KPCA perspective, with average similarity metrics (optimal cosine similarity $\leq 0.32$, linear CKA (Centered Kernel Alignment) $\leq 0.11$, kernel CKA $\leq 0.32$) indicating negligible correspondence; (2) Reported decreases in reconstruction loss $J_\text{proj}$, arguably justifying the claim that the self-attention minimizes the projection error of KPCA, are misinterpreted, as the quantities involved differ by orders of magnitude ($\sim\!10^3$); (3) Gram matrix eigenvalue statistics, introduced to justify that $V$ captures the eigenvector of the gram matrix, are irreproducible without undocumented implementation-specific adjustments. Across 10 transformer architectures, we conclude that the KPCA interpretation of self-attention lacks empirical support.

[144] Beyond Input Activations: Identifying Influential Latents by Gradient Sparse Autoencoders

Dong Shu,Xuansheng Wu,Haiyan Zhao,Mengnan Du,Ninghao Liu

Main category: cs.LG

TL;DR: GradSAE通过结合输出梯度信息,识别稀疏自编码器中对模型输出最具因果影响的潜在特征。

Details Motivation: 传统稀疏自编码器分析方法仅依赖输入激活,忽略了潜在特征对模型输出的因果影响。 Method: 提出GradSAE方法,利用输出梯度信息识别高因果影响的潜在特征。 Result: 验证了高因果影响的潜在特征对模型操控更有效。 Conclusion: GradSAE为稀疏自编码器分析提供了更有效的工具,提升了模型解释和操控能力。 Abstract: Sparse Autoencoders (SAEs) have recently emerged as powerful tools for interpreting and steering the internal representations of large language models (LLMs). However, conventional approaches to analyzing SAEs typically rely solely on input-side activations, without considering the causal influence between each latent feature and the model's output. This work is built on two key hypotheses: (1) activated latents do not contribute equally to the construction of the model's output, and (2) only latents with high causal influence are effective for model steering. To validate these hypotheses, we propose Gradient Sparse Autoencoder (GradSAE), a simple yet effective method that identifies the most influential latents by incorporating output-side gradient information.

[145] Optimizing Retrieval-Augmented Generation: Analysis of Hyperparameter Impact on Performance and Efficiency

Adel Ammar,Anis Koubaa,Omer Nacar,Wadii Boulila

Main category: cs.LG

TL;DR: 本文分析了检索增强生成(RAG)系统中超参数对速度和性能的影响,揭示了速度与精度之间的权衡,并展示了优化配置如何显著提升检索质量。

Details Motivation: 大型语言模型虽任务性能高,但易产生幻觉或依赖过时知识。RAG通过结合外部搜索弥补这些不足,但需平衡计算成本与准确性。 Method: 研究涵盖Chroma和Faiss向量存储、分块策略、交叉编码器重排序及温度参数,评估六项指标:忠实性、答案正确性、答案相关性、上下文精度、上下文召回和答案相似性。 Result: Chroma查询速度快13%,Faiss检索精度更高;固定长度分块优于语义分割;重排序提升质量但增加运行时。优化配置下,上下文精度达99%。 Conclusion: RAG系统可通过超参数优化实现高检索精度,对下游任务(如医疗临床决策)有重要意义。 Abstract: Large language models achieve high task performance yet often hallucinate or rely on outdated knowledge. Retrieval-augmented generation (RAG) addresses these gaps by coupling generation with external search. We analyse how hyperparameters influence speed and quality in RAG systems, covering Chroma and Faiss vector stores, chunking policies, cross-encoder re-ranking, and temperature, and we evaluate six metrics: faithfulness, answer correctness, answer relevancy, context precision, context recall, and answer similarity. Chroma processes queries 13% faster, whereas Faiss yields higher retrieval precision, revealing a clear speed-accuracy trade-off. Naive fixed-length chunking with small windows and minimal overlap outperforms semantic segmentation while remaining the quickest option. Re-ranking provides modest gains in retrieval quality yet increases runtime by roughly a factor of 5, so its usefulness depends on latency constraints. These results help practitioners balance computational cost and accuracy when tuning RAG systems for transparent, up-to-date responses. Finally, we re-evaluate the top configurations with a corrective RAG workflow and show that their advantages persist when the model can iteratively request additional evidence. We obtain a near-perfect context precision (99%), which demonstrates that RAG systems can achieve extremely high retrieval accuracy with the right combination of hyperparameters, with significant implications for applications where retrieval quality directly impacts downstream task performance, such as clinical decision support in healthcare.

[146] Fréchet Power-Scenario Distance: A Metric for Evaluating Generative AI Models across Multiple Time-Scales in Smart Grids

Yuting Cai,Shaohuai Liu,Chao Tian,Le Xie

Main category: cs.LG

TL;DR: 提出了一种基于Fréchet距离的新指标,用于评估智能电网中生成AI模型产生的合成数据质量,优于传统欧氏距离方法。

Details Motivation: 生成AI模型在智能电网中产生大量合成数据,但传统欧氏距离指标无法有效评估其质量差异。 Method: 提出了一种基于Fréchet距离的指标,通过在学习的特征空间中估计数据集间的距离,从分布角度评估生成质量。 Result: 实证结果表明,该指标在不同时间尺度和模型中表现优越,提升了智能电网数据驱动决策的可靠性。 Conclusion: 新方法为评估生成数据质量提供了更有效的工具,推动了智能电网中生成AI的应用。 Abstract: Generative artificial intelligence (AI) models in smart grids have advanced significantly in recent years due to their ability to generate large amounts of synthetic data, which would otherwise be difficult to obtain in the real world due to confidentiality constraints. A key challenge in utilizing such synthetic data is how to assess the data quality produced from such generative models. Traditional Euclidean distance-based metrics only reflect pair-wise relations between two individual samples, and could fail in evaluating quality differences between groups of synthetic datasets. In this work, we propose a novel metric based on the Fr\'{e}chet Distance (FD) estimated between two datasets in a learned feature space. The proposed method evaluates the quality of generation from a distributional perspective. Empirical results demonstrate the superiority of the proposed metric across timescales and models, enhancing the reliability of data-driven decision-making in smart grid operations.

[147] Memorization-Compression Cycles Improve Generalization

Fangyuan Yu

Main category: cs.LG

TL;DR: 论文提出通过压缩内部表征提升泛化能力,引入信息瓶颈语言建模(IBLM)目标,并提出GAPT训练算法,显著提升模型性能。

Details Motivation: 探索数据扩展和表征压缩对泛化的影响,并模拟生物学习与睡眠交替的机制。 Method: 提出IBLM目标,将语言建模转化为约束优化问题,并设计GAPT算法动态切换记忆与压缩阶段。 Result: GAPT在GPT-2预训练中降低MBE 50%,提升交叉熵4.8%,OOD泛化提升35%,并显著减少灾难性遗忘。 Conclusion: 表征压缩与动态阶段切换(如GAPT)可显著提升模型泛化能力,模拟生物学习机制。 Abstract: We prove theoretically that generalization improves not only through data scaling but also by compressing internal representations. To operationalize this insight, we introduce the Information Bottleneck Language Modeling (IBLM) objective, which reframes language modeling as a constrained optimization problem: minimizing representation entropy subject to optimal prediction performance. Empirically, we observe an emergent memorization-compression cycle during LLM pretraining, evidenced by oscillation positive/negative gradient alignment between cross-entropy and Matrix-Based Entropy (MBE), a measure of representation entropy. This pattern closely mirrors the predictive-compressive trade-off prescribed by IBLM and also parallels the biological alternation between awake learning and sleep consolidation. Motivated by this observation, we propose Gated Phase Transition (GAPT), a training algorithm that adaptively switches between memorization and compression phases. When applied to GPT-2 pretraining on FineWeb dataset, GAPT reduces MBE by 50% and improves cross-entropy by 4.8%. GAPT improves OOD generalizatino by 35% in a pretraining task on arithmetic multiplication. In a setting designed to simulate catastrophic forgetting, GAPT reduces interference by compressing and separating representations, achieving a 97% improvement in separation - paralleling the functional role of sleep consolidation.

[148] CodePDE: An Inference Framework for LLM-driven PDE Solver Generation

Shanda Li,Tanya Marwah,Junhong Shen,Weiwei Sun,Andrej Risteski,Yiming Yang,Ameet Talwalkar

Main category: cs.LG

TL;DR: CodePDE是一个基于大型语言模型(LLM)的PDE求解框架,通过代码生成任务解决PDE问题,无需特定任务调优即可实现高性能。

Details Motivation: 传统数值求解器依赖专家知识且计算成本高,而基于神经网络的求解器需要大量训练数据且缺乏可解释性。 Method: 将PDE求解视为代码生成任务,利用LLM的推理能力、调试、自我优化和测试时扩展策略。 Result: CodePDE在多个代表性PDE问题上实现超人类性能,并提供了对LLM生成求解器的系统性分析。 Conclusion: LLM在PDE求解中展现出潜力,但也存在局限性,为未来模型设计提供了新视角。 Abstract: Partial differential equations (PDEs) are fundamental to modeling physical systems, yet solving them remains a complex challenge. Traditional numerical solvers rely on expert knowledge to implement and are computationally expensive, while neural-network-based solvers require large training datasets and often lack interpretability. In this work, we frame PDE solving as a code generation task and introduce CodePDE, the first inference framework for generating PDE solvers using large language models (LLMs). Leveraging advanced inference-time algorithms and scaling strategies, CodePDE unlocks critical capacities of LLM for PDE solving: reasoning, debugging, selfrefinement, and test-time scaling -- all without task-specific tuning. CodePDE achieves superhuman performance across a range of representative PDE problems. We also present a systematic empirical analysis of LLM generated solvers, analyzing their accuracy, efficiency, and numerical scheme choices. Our findings highlight the promise and the current limitations of LLMs in PDE solving, offering a new perspective on solver design and opportunities for future model development. Our code is available at https://github.com/LithiumDA/CodePDE.

[149] Decoupled Multimodal Prototypes for Visual Recognition with Missing Modalities

Jueqing Lu,Yuanyuan Qi,Xiaohao Yang,Shujie Zhou,Lan Du

Main category: cs.LG

TL;DR: 提出了一种基于解耦原型的多模态输出头,动态适应缺失模态场景,显著提升性能。

Details Motivation: 现实应用中多模态数据常缺失,现有方法假设所有模态可用,导致性能下降。 Method: 引入缺失感知的类原型输出头,动态适应不同缺失场景,兼容现有提示方法。 Result: 实验表明,该方法在多种缺失场景和缺失率下显著提升性能。 Conclusion: 解耦原型输出头有效解决了多模态缺失问题,兼容性强且性能优越。 Abstract: Multimodal learning enhances deep learning models by enabling them to perceive and understand information from multiple data modalities, such as visual and textual inputs. However, most existing approaches assume the availability of all modalities, an assumption that often fails in real-world applications. Recent works have introduced learnable missing-case-aware prompts to mitigate performance degradation caused by missing modalities while reducing the need for extensive model fine-tuning. Building upon the effectiveness of missing-case-aware handling for missing modalities, we propose a novel decoupled prototype-based output head, which leverages missing-case-aware class-wise prototypes tailored for each individual modality. This approach dynamically adapts to different missing modality scenarios and can be seamlessly integrated with existing prompt-based methods. Extensive experiments demonstrate that our proposed output head significantly improves performance across a wide range of missing-modality scenarios and varying missing rates.

[150] Efficient Unstructured Pruning of Mamba State-Space Models for Resource-Constrained Environments

Ibne Farabi Shihab,Sanjeda Akter,Anuj Sharma

Main category: cs.LG

TL;DR: 提出了一种针对Mamba模型的无结构化剪枝框架,实现70%参数减少且保留95%性能。

Details Motivation: Mamba模型参数量大,难以在资源受限环境中部署。 Method: 结合梯度感知的幅度剪枝、迭代剪枝计划和全局剪枝策略。 Result: 在多个基准测试中实现高效能且性能损失极小。 Conclusion: 该框架揭示了Mamba架构的冗余性和鲁棒性,拓宽了其应用范围。 Abstract: State-space models (SSMs), particularly the Mamba architecture, have emerged as powerful alternatives to Transformers for sequence modeling, offering linear-time complexity and competitive performance across diverse tasks. However, their large parameter counts pose significant challenges for deployment in resource-constrained environments. We propose a novel unstructured pruning framework tailored for Mamba models that achieves up to 70\% parameter reduction while retaining over 95\% of the original performance. Our approach integrates three key innovations: (1) a gradient-aware magnitude pruning technique that combines weight magnitude and gradient information to identify less critical parameters, (2) an iterative pruning schedule that gradually increases sparsity to maintain model stability, and (3) a global pruning strategy that optimizes parameter allocation across the entire model. Through extensive experiments on WikiText-103, Long Range Arena, and ETT time-series benchmarks, we demonstrate significant efficiency gains with minimal performance degradation. Our analysis of pruning effects on Mamba's components reveals critical insights into the architecture's redundancy and robustness, enabling practical deployment in resource-constrained settings while broadening Mamba's applicability.

[151] GradMix: Gradient-based Selective Mixup for Robust Data Augmentation in Class-Incremental Learning

Minsu Kim,Seong-Hyeon Hwang,Steven Euijong Whang

Main category: cs.LG

TL;DR: 论文提出GradMix,一种基于梯度的选择性数据增强方法,用于缓解持续学习中的灾难性遗忘问题。

Details Motivation: 持续学习中,新知识的获取与旧知识的保持是一大挑战。现有方法如经验回放虽有效,但随机混合样本可能损害旧任务知识。 Method: GradMix通过梯度选择性混合样本,仅混合有益类对,避免有害类对,以减少灾难性遗忘。 Result: 实验表明,GradMix在多个真实数据集上优于基线方法,显著减少了旧知识的遗忘。 Conclusion: GradMix为持续学习提供了一种有效的数据增强策略,显著提升了模型性能。 Abstract: In the context of continual learning, acquiring new knowledge while maintaining previous knowledge presents a significant challenge. Existing methods often use experience replay techniques that store a small portion of previous task data for training. In experience replay approaches, data augmentation has emerged as a promising strategy to further improve the model performance by mixing limited previous task data with sufficient current task data. However, we theoretically and empirically analyze that training with mixed samples from random sample pairs may harm the knowledge of previous tasks and cause greater catastrophic forgetting. We then propose GradMix, a robust data augmentation method specifically designed for mitigating catastrophic forgetting in class-incremental learning. GradMix performs gradient-based selective mixup using a class-based criterion that mixes only samples from helpful class pairs and not from detrimental class pairs for reducing catastrophic forgetting. Our experiments on various real datasets show that GradMix outperforms data augmentation baselines in accuracy by minimizing the forgetting of previous knowledge.

cs.SI [Back]

[152] NAZM: Network Analysis of Zonal Metrics in Persian Poetic Tradition

Kourosh Shahnazari,Seyed Moein Ayyoubzadeh

Main category: cs.SI

TL;DR: 该研究通过构建多维相似性网络,模拟古典波斯诗人的影响力动态,结合语义、词汇、风格、主题和韵律特征,识别关键诗人、风格中心和桥梁诗人,并通过社区检测算法揭示文学流派。

Details Motivation: 结合计算语言学和文学研究,提供数据驱动的波斯文学新视角,区分经典意义与文本间影响,突出结构重要性较高的非知名人物。 Method: 使用Ganjoor语料库的严格数据集,构建加权相似性矩阵和聚合图,计算多种中心性指标,并应用Louvain社区检测算法。 Result: 揭示了与已知文学流派(如Sabk-e Hindi、Sabk-e Khorasani等)密切相关的诗人集群,提供了对波斯诗歌传统的新理解。 Conclusion: 该研究为诗歌传统提供了可解释且可扩展的模型,支持数字人文学科的回顾性反思和前瞻性研究。 Abstract: This study formalizes a computational model to simulate classical Persian poets' dynamics of influence through constructing a multi-dimensional similarity network. Using a rigorously curated dataset based on Ganjoor's corpus, we draw upon semantic, lexical, stylistic, thematic, and metrical features to demarcate each poet's corpus. Each is contained within weighted similarity matrices, which are then appended to generate an aggregate graph showing poet-to-poet influence. Further network investigation is carried out to identify key poets, style hubs, and bridging poets by calculating degree, closeness, betweenness, eigenvector, and Katz centrality measures. Further, for typological insight, we use the Louvain community detection algorithm to demarcate clusters of poets sharing both style and theme coherence, which correspond closely to acknowledged schools of literature like Sabk-e Hindi, Sabk-e Khorasani, and the Bazgasht-e Adabi phenomenon. Our findings provide a new data-driven view of Persian literature distinguished between canonical significance and interextual influence, thus highlighting relatively lesser-known figures who hold great structural significance. Combining computational linguistics with literary study, this paper produces an interpretable and scalable model for poetic tradition, enabling retrospective reflection as well as forward-looking research within digital humanities.

cs.SD [Back]

[153] Not that Groove: Zero-Shot Symbolic Music Editing

Li Zhang

Main category: cs.SD

TL;DR: 论文提出了一种基于零样本提示的LLM方法,用于符号音乐编辑,解决了音频生成在音乐制作中的局限性问题。

Details Motivation: 现有AI音乐生成多关注音频,但因其灵活性不足,在音乐制作行业应用有限。本文旨在通过符号音乐编辑提供更高灵活性。 Method: 利用零样本提示的LLM方法编辑鼓点节奏,设计了创新的格式以连接LLM与音乐,并提供标注数据集以支持评估。 Result: 证明LLM能有效编辑鼓点节奏,且评估数据集与音乐家判断高度一致。 Conclusion: 该方法为符号音乐编辑提供了灵活且有效的解决方案,填补了AI音乐生成领域的空白。 Abstract: Most work in AI music generation focused on audio, which has seen limited use in the music production industry due to its rigidity. To maximize flexibility while assuming only textual instructions from producers, we are among the first to tackle symbolic music editing. We circumvent the known challenge of lack of labeled data by proving that LLMs with zero-shot prompting can effectively edit drum grooves. The recipe of success is a creatively designed format that interfaces LLMs and music, while we facilitate evaluation by providing an evaluation dataset with annotated unit tests that highly aligns with musicians' judgment.

cond-mat.mtrl-sci [Back]

[154] Image-Guided Microstructure Optimization using Diffusion Models: Validated with Li-Mn-rich Cathode Precursors

Geunho Choi,Changhwan Lee,Jieun Kim,Insoo Ye,Keeyoung Jung,Inchul Park

Main category: cond-mat.mtrl-sci

TL;DR: 论文提出了一种基于AI的闭环框架,通过图像生成和分析优化锂离子电池正极前驱体的微观结构设计。

Details Motivation: 微观结构对材料性能至关重要,但由于难以量化、预测和优化,很少被作为设计变量。本文旨在解决这一问题。 Method: 结合扩散图像生成模型、定量图像分析流程和粒子群优化算法,提取形态特征并预测合成条件。 Result: 框架能准确预测特定合成条件下的微观结构,并通过实验验证了预测与合成的结构一致性。 Conclusion: 该框架为数据驱动的材料设计提供了实用策略,支持正向预测和逆向设计,推动了自主微观结构工程的发展。 Abstract: Microstructure often dictates materials performance, yet it is rarely treated as an explicit design variable because microstructure is hard to quantify, predict, and optimize. Here, we introduce an image centric, closed-loop framework that makes microstructural morphology into a controllable objective and demonstrate its use case with Li- and Mn-rich layered oxide cathode precursors. This work presents an integrated, AI driven framework for the predictive design and optimization of lithium-ion battery cathode precursor synthesis. This framework integrates a diffusion-based image generation model, a quantitative image analysis pipeline, and a particle swarm optimization (PSO) algorithm. By extracting key morphological descriptors such as texture, sphericity, and median particle size (D50) from SEM images, the platform accurately predicts SEM like morphologies resulting from specific coprecipitation conditions, including reaction time-, solution concentration-, and pH-dependent structural changes. Optimization then pinpoints synthesis parameters that yield user defined target morphologies, as experimentally validated by the close agreement between predicted and synthesized structures. This framework offers a practical strategy for data driven materials design, enabling both forward prediction and inverse design of synthesis conditions and paving the way toward autonomous, image guided microstructure engineering.

cs.RO [Back]

[155] UniSkill: Imitating Human Videos via Cross-Embodiment Skill Representations

Hanjung Kim,Jaehyun Kang,Hyolim Kang,Meedeum Cho,Seon Joo Kim,Youngwoon Lee

Main category: cs.RO

TL;DR: UniSkill框架通过无标签的大规模跨体现视频数据学习技能表示,实现从人类视频提示到机器人策略的有效迁移。

Details Motivation: 模仿是人类学习的基本机制,但将其应用于机器人因视觉和物理能力的差异而面临挑战。现有方法依赖对齐数据,但大规模收集困难。 Method: 提出UniSkill框架,从无标签的跨体现视频数据中学习技能表示,支持从人类视频到机器人策略的迁移。 Result: 实验表明,UniSkill能在仿真和真实环境中成功指导机器人执行动作,即使面对未见过的视频提示。 Conclusion: UniSkill为跨体现技能学习提供了一种有效方法,无需对齐数据即可实现技能迁移。 Abstract: Mimicry is a fundamental learning mechanism in humans, enabling individuals to learn new tasks by observing and imitating experts. However, applying this ability to robots presents significant challenges due to the inherent differences between human and robot embodiments in both their visual appearance and physical capabilities. While previous methods bridge this gap using cross-embodiment datasets with shared scenes and tasks, collecting such aligned data between humans and robots at scale is not trivial. In this paper, we propose UniSkill, a novel framework that learns embodiment-agnostic skill representations from large-scale cross-embodiment video data without any labels, enabling skills extracted from human video prompts to effectively transfer to robot policies trained only on robot data. Our experiments in both simulation and real-world environments show that our cross-embodiment skills successfully guide robots in selecting appropriate actions, even with unseen video prompts. The project website can be found at: https://kimhanjung.github.io/UniSkill.

cs.CY [Back]

[156] Multimodal Assessment of Classroom Discourse Quality: A Text-Centered Attention-Based Multi-Task Learning Approach

Ruikun Hou,Babette Bühler,Tim Fütterer,Efe Bozkir,Peter Gerjets,Ulrich Trautwein,Enkelejda Kasneci

Main category: cs.CY

TL;DR: 论文提出了一种多模态融合架构,用于评估课堂话语质量,结合文本、音频和视频数据,并通过实验验证了其有效性。

Details Motivation: 传统课堂话语评估依赖人工编码,耗时且成本高。现有AI技术多关注单一句子分析,缺乏对整个课程段话语质量的评估。 Method: 采用注意力机制捕捉多模态交互,多任务学习联合预测三个话语组件的质量分数,并将任务建模为有序分类问题。 Result: 在GTI德国数据集上,模型表现与人类评分者一致性相当,整体Quadratic Weighted Kappa得分为0.384。 Conclusion: 研究为自动化话语质量评估奠定了基础,支持教师通过多维反馈提升专业发展。 Abstract: Classroom discourse is an essential vehicle through which teaching and learning take place. Assessing different characteristics of discursive practices and linking them to student learning achievement enhances the understanding of teaching quality. Traditional assessments rely on manual coding of classroom observation protocols, which is time-consuming and costly. Despite many studies utilizing AI techniques to analyze classroom discourse at the utterance level, investigations into the evaluation of discursive practices throughout an entire lesson segment remain limited. To address this gap, our study proposes a novel text-centered multimodal fusion architecture to assess the quality of three discourse components grounded in the Global Teaching InSights (GTI) observation protocol: Nature of Discourse, Questioning, and Explanations. First, we employ attention mechanisms to capture inter- and intra-modal interactions from transcript, audio, and video streams. Second, a multi-task learning approach is adopted to jointly predict the quality scores of the three components. Third, we formulate the task as an ordinal classification problem to account for rating level order. The effectiveness of these designed elements is demonstrated through an ablation study on the GTI Germany dataset containing 92 videotaped math lessons. Our results highlight the dominant role of text modality in approaching this task. Integrating acoustic features enhances the model's consistency with human ratings, achieving an overall Quadratic Weighted Kappa score of 0.384, comparable to human inter-rater reliability (0.326). Our study lays the groundwork for the future development of automated discourse quality assessment to support teacher professional development through timely feedback on multidimensional discourse practices.

cs.IR [Back]

[157] OMGM: Orchestrate Multiple Granularities and Modalities for Efficient Multimodal Retrieval

Wei Yang,Jingjing Fu,Rui Wang,Jinyu Wang,Lei Song,Jiang Bian

Main category: cs.IR

TL;DR: 提出了一种多模态RAG系统,通过粗到细的多步检索提升KB-VQA任务的效果。

Details Motivation: 现有方法未充分利用多模态和知识粒度之间的潜在交互,导致检索效果受限。 Method: 采用粗到细的多步检索策略,包括初始粗粒度对齐、多模态融合重排序和文本重排序。 Result: 在InfoSeek和Encyclopedic-VQA基准测试中取得最优检索性能和竞争性回答结果。 Conclusion: 该方法有效提升了KB-VQA系统的性能,展示了多模态和多粒度交互的重要性。 Abstract: Vision-language retrieval-augmented generation (RAG) has become an effective approach for tackling Knowledge-Based Visual Question Answering (KB-VQA), which requires external knowledge beyond the visual content presented in images. The effectiveness of Vision-language RAG systems hinges on multimodal retrieval, which is inherently challenging due to the diverse modalities and knowledge granularities in both queries and knowledge bases. Existing methods have not fully tapped into the potential interplay between these elements. We propose a multimodal RAG system featuring a coarse-to-fine, multi-step retrieval that harmonizes multiple granularities and modalities to enhance efficacy. Our system begins with a broad initial search aligning knowledge granularity for cross-modal retrieval, followed by a multimodal fusion reranking to capture the nuanced multimodal information for top entity selection. A text reranker then filters out the most relevant fine-grained section for augmented generation. Extensive experiments on the InfoSeek and Encyclopedic-VQA benchmarks show our method achieves state-of-the-art retrieval performance and highly competitive answering results, underscoring its effectiveness in advancing KB-VQA systems.

cs.CE [Back]

[158] Improving Unsupervised Task-driven Models of Ventral Visual Stream via Relative Position Predictivity

Dazhong Rong,Hao Dong,Xing Gao,Jiyu Wei,Di Hong,Yaoyao Hao,Qinming He,Yueming Wang

Main category: cs.CE

TL;DR: 论文提出了一种结合相对位置预测和对比学习的新方法,以更全面地建模腹侧视觉流(VVS),并验证了其在物体识别和脑相似性上的提升。

Details Motivation: 现有方法仅关注VVS的物体识别功能,忽略了其在位置感知(如相对位置预测)中的作用。论文旨在填补这一空白。 Method: 提出了一种结合相对位置预测和对比学习的无监督任务驱动方法,以更符合生物现实的方式建模VVS。 Result: 实验表明,新方法显著提升了物体识别的下游性能,同时增强了相对位置预测能力,并提高了模型的脑相似性。 Conclusion: 研究从计算角度为VVS参与位置感知(尤其是相对位置预测)提供了有力证据。 Abstract: Based on the concept that ventral visual stream (VVS) mainly functions for object recognition, current unsupervised task-driven methods model VVS by contrastive learning, and have achieved good brain similarity. However, we believe functions of VVS extend beyond just object recognition. In this paper, we introduce an additional function involving VVS, named relative position (RP) prediction. We first theoretically explain contrastive learning may be unable to yield the model capability of RP prediction. Motivated by this, we subsequently integrate RP learning with contrastive learning, and propose a new unsupervised task-driven method to model VVS, which is more inline with biological reality. We conduct extensive experiments, demonstrating that: (i) our method significantly improves downstream performance of object recognition while enhancing RP predictivity; (ii) RP predictivity generally improves the model brain similarity. Our results provide strong evidence for the involvement of VVS in location perception (especially RP prediction) from a computational perspective.

q-bio.QM [Back]

[159] CellVerse: Do Large Language Models Really Understand Cell Biology?

Fan Zhang,Tianyu Liu,Zhihong Zhu,Hao Wu,Haixin Wang,Donghao Zhou,Yefeng Zheng,Kun Wang,Xian Wu,Pheng-Ann Heng

Main category: q-bio.QM

TL;DR: CellVerse是一个统一的语言中心问答基准,用于评估大型语言模型(LLMs)在单细胞多组学数据分析任务中的表现,发现现有模型在细胞生物学领域的应用仍有显著挑战。

Details Motivation: 当前缺乏对LLMs在语言驱动的单细胞分析任务中性能的全面评估,CellVerse旨在填补这一空白。 Method: 通过整合四种单细胞多组学数据,设计三个层次的分析任务(细胞类型注释、药物反应预测和扰动分析),并系统评估14种开源和闭源LLMs的性能。 Result: 实验结果显示,现有专家模型表现不佳,通用模型(如Qwen、Llama等)初步具备理解能力,但整体性能仍有较大提升空间,尤其在药物反应预测任务中表现不佳。 Conclusion: CellVerse为通过自然语言推进细胞生物学研究奠定了基础,揭示了LLMs在该领域的应用挑战和潜力。 Abstract: Recent studies have demonstrated the feasibility of modeling single-cell data as natural languages and the potential of leveraging powerful large language models (LLMs) for understanding cell biology. However, a comprehensive evaluation of LLMs' performance on language-driven single-cell analysis tasks still remains unexplored. Motivated by this challenge, we introduce CellVerse, a unified language-centric question-answering benchmark that integrates four types of single-cell multi-omics data and encompasses three hierarchical levels of single-cell analysis tasks: cell type annotation (cell-level), drug response prediction (drug-level), and perturbation analysis (gene-level). Going beyond this, we systematically evaluate the performance across 14 open-source and closed-source LLMs ranging from 160M to 671B on CellVerse. Remarkably, the experimental results reveal: (1) Existing specialist models (C2S-Pythia) fail to make reasonable decisions across all sub-tasks within CellVerse, while generalist models such as Qwen, Llama, GPT, and DeepSeek family models exhibit preliminary understanding capabilities within the realm of cell biology. (2) The performance of current LLMs falls short of expectations and has substantial room for improvement. Notably, in the widely studied drug response prediction task, none of the evaluated LLMs demonstrate significant performance improvement over random guessing. CellVerse offers the first large-scale empirical demonstration that significant challenges still remain in applying LLMs to cell biology. By introducing CellVerse, we lay the foundation for advancing cell biology through natural languages and hope this paradigm could facilitate next-generation single-cell analysis.

cs.AI [Back]

[160] Arrow-Guided VLM: Enhancing Flowchart Understanding via Arrow Direction Encoding

Takamitsu Omasa,Ryo Koshihara,Masumi Morishige

Main category: cs.AI

TL;DR: 提出了一种七阶段流程,通过箭头感知检测、OCR提取文本和结构化提示,显著提升了VLM对流程图的解析准确率。

Details Motivation: 现有视觉语言模型(VLM)常误解流程图的箭头和拓扑结构,需改进其对这类图表的理解能力。 Method: 采用七阶段流程,分三步:箭头感知检测、OCR提取节点文本、构建结构化提示指导VLM。 Result: 在90个问题的基准测试中,准确率从80%提升至89%,尤其对下一步查询效果显著(100%准确率)。 Conclusion: 方法通过显式编码箭头提升了VLM性能,但依赖检测器和OCR精度,未来将扩展评估集和测试其他图表类型。 Abstract: Flowcharts are indispensable tools in software design and business-process analysis, yet current vision-language models (VLMs) frequently misinterpret the directional arrows and graph topology that set these diagrams apart from natural images. We introduce a seven-stage pipeline grouped into three broader processes: (1) arrow-aware detection of nodes and arrow endpoints; (2) optical character recognition (OCR) to extract node text; and (3) construction of a structured prompt that guides the VLMs. Tested on a 90-question benchmark distilled from 30 annotated flowcharts, the method raises overall accuracy from 80 % to 89 % (+9 percentage points) without any task-specific fine-tuning. The gain is most pronounced for next-step queries (25/30 -> 30/30; 100 %, +17 pp); branch-result questions improve more modestly, and before-step questions remain difficult. A parallel evaluation with an LLM-as-a-Judge protocol shows the same trends, reinforcing the advantage of explicit arrow encoding. Limitations include dependence on detector and OCR precision, the small evaluation set, and residual errors at nodes with multiple incoming edges. Future work will enlarge the benchmark with synthetic and handwritten flowcharts and assess the approach on Business Process Model and Notation (BPMN) and Unified Modeling Language (UML).

[161] Visually Guided Decoding: Gradient-Free Hard Prompt Inversion with Language Models

Donghoon Kim,Minji Bae,Kyuhong Shim,Byonghyo Shim

Main category: cs.AI

TL;DR: VGD是一种无需梯度的方法,利用LLMs和CLIP指导生成连贯且语义对齐的文本提示,优于现有技术。

Details Motivation: 现有文本提示生成方法(如软硬提示技术)效果有限,缺乏解释性和连贯性。 Method: VGD结合LLMs的文本生成能力和CLIP评分,确保提示与视觉概念对齐。 Result: 实验表明VGD在生成可理解和上下文相关提示方面优于现有技术。 Conclusion: VGD提升了文本提示生成的解释性、泛化性和灵活性,无需额外训练。 Abstract: Text-to-image generative models like DALL-E and Stable Diffusion have revolutionized visual content creation across various applications, including advertising, personalized media, and design prototyping. However, crafting effective textual prompts to guide these models remains challenging, often requiring extensive trial and error. Existing prompt inversion approaches, such as soft and hard prompt techniques, are not so effective due to the limited interpretability and incoherent prompt generation. To address these issues, we propose Visually Guided Decoding (VGD), a gradient-free approach that leverages large language models (LLMs) and CLIP-based guidance to generate coherent and semantically aligned prompts. In essence, VGD utilizes the robust text generation capabilities of LLMs to produce human-readable prompts. Further, by employing CLIP scores to ensure alignment with user-specified visual concepts, VGD enhances the interpretability, generalization, and flexibility of prompt generation without the need for additional training. Our experiments demonstrate that VGD outperforms existing prompt inversion techniques in generating understandable and contextually relevant prompts, facilitating more intuitive and controllable interactions with text-to-image models.

[162] Decoding Neighborhood Environments with Large Language Models

Andrew Cart,Shaohu Zhang,Melanie Escue,Xugui Zhou,Haitao Zhao,Prashanth BusiReddyGari,Beiyu Lin,Shuang Li

Main category: cs.AI

TL;DR: 研究探讨了利用大型语言模型(LLMs)如ChatGPT和Gemini解码邻里环境的可行性,通过训练YOLOv11模型和评估四种LLMs,实现了高精度检测环境指标。

Details Motivation: 传统邻里环境评估方法资源密集且难以规模化,机器学习虽具潜力但数据标注和模型可访问性限制了其应用。 Method: 训练YOLOv11模型检测六种环境指标,并评估四种LLMs的可行性和鲁棒性,采用多数投票策略提升准确性。 Result: YOLOv11模型平均准确率达99.13%,四种LLMs通过多数投票实现超过88%的准确率。 Conclusion: LLMs无需训练即可成为解码邻里环境的有效工具,展示了其规模化应用的潜力。 Abstract: Neighborhood environments include physical and environmental conditions such as housing quality, roads, and sidewalks, which significantly influence human health and well-being. Traditional methods for assessing these environments, including field surveys and geographic information systems (GIS), are resource-intensive and challenging to evaluate neighborhood environments at scale. Although machine learning offers potential for automated analysis, the laborious process of labeling training data and the lack of accessible models hinder scalability. This study explores the feasibility of large language models (LLMs) such as ChatGPT and Gemini as tools for decoding neighborhood environments (e.g., sidewalk and powerline) at scale. We train a robust YOLOv11-based model, which achieves an average accuracy of 99.13% in detecting six environmental indicators, including streetlight, sidewalk, powerline, apartment, single-lane road, and multilane road. We then evaluate four LLMs, including ChatGPT, Gemini, Claude, and Grok, to assess their feasibility, robustness, and limitations in identifying these indicators, with a focus on the impact of prompting strategies and fine-tuning. We apply majority voting with the top three LLMs to achieve over 88% accuracy, which demonstrates LLMs could be a useful tool to decode the neighborhood environment without any training effort.

[163] TRAIL: Trace Reasoning and Agentic Issue Localization

Darshan Deshpande,Varun Gangal,Hersh Mehta,Jitin Krishnan,Anand Kannappan,Rebecca Qian

Main category: cs.AI

TL;DR: 论文提出了一种用于评估代理工作流复杂痕迹的动态方法,并引入了一个错误分类法,同时发布了包含148条人工标注痕迹的数据集TRAIL。

Details Motivation: 随着代理工作流的广泛应用,现有依赖人工分析的评估方法无法应对复杂性和规模的增加,亟需更高效的评估工具。 Method: 论文提出了一个错误分类法,并构建了基于真实应用场景的标注数据集TRAIL,用于评估代理工作流的性能。 Result: 实验表明,现代长上下文LLM在痕迹调试中表现不佳,最佳模型Gemini-2.5-pro在TRAIL上的得分仅为11%。 Conclusion: 论文强调了动态评估方法的重要性,并提供了公开数据集和代码以推动未来研究。 Abstract: The increasing adoption of agentic workflows across diverse domains brings a critical need to scalably and systematically evaluate the complex traces these systems generate. Current evaluation methods depend on manual, domain-specific human analysis of lengthy workflow traces - an approach that does not scale with the growing complexity and volume of agentic outputs. Error analysis in these settings is further complicated by the interplay of external tool outputs and language model reasoning, making it more challenging than traditional software debugging. In this work, we (1) articulate the need for robust and dynamic evaluation methods for agentic workflow traces, (2) introduce a formal taxonomy of error types encountered in agentic systems, and (3) present a set of 148 large human-annotated traces (TRAIL) constructed using this taxonomy and grounded in established agentic benchmarks. To ensure ecological validity, we curate traces from both single and multi-agent systems, focusing on real-world applications such as software engineering and open-world information retrieval. Our evaluations reveal that modern long context LLMs perform poorly at trace debugging, with the best Gemini-2.5-pro model scoring a mere 11% on TRAIL. Our dataset and code are made publicly available to support and accelerate future research in scalable evaluation for agentic workflows.

[164] LLM-based Prompt Ensemble for Reliable Medical Entity Recognition from EHRs

K M Sajjadul Islam,Ayesha Siddika Nipu,Jiawei Wu,Praveen Madiraju

Main category: cs.AI

TL;DR: 论文探讨了基于提示的大语言模型(如GPT-4o和DeepSeek-R1)在电子健康记录(EHR)中识别医疗实体的方法,其中GPT-4o结合提示集成方法表现最佳。

Details Motivation: 电子健康记录(EHRs)中的非结构化临床文本需要高效提取关键医疗实体(如问题、测试和治疗),以支持下游临床应用。 Method: 使用大语言模型(LLMs)进行基于提示的医疗实体识别,包括零样本、少样本和集成方法。 Result: GPT-4o结合提示集成方法在分类任务中表现最佳,F1分数为0.95,召回率为0.98,优于DeepSeek-R1。 Conclusion: 提示集成方法通过嵌入相似性和多数投票提高了可靠性,GPT-4o在此任务中表现优越。 Abstract: Electronic Health Records (EHRs) are digital records of patient information, often containing unstructured clinical text. Named Entity Recognition (NER) is essential in EHRs for extracting key medical entities like problems, tests, and treatments to support downstream clinical applications. This paper explores prompt-based medical entity recognition using large language models (LLMs), specifically GPT-4o and DeepSeek-R1, guided by various prompt engineering techniques, including zero-shot, few-shot, and an ensemble approach. Among all strategies, GPT-4o with prompt ensemble achieved the highest classification performance with an F1-score of 0.95 and recall of 0.98, outperforming DeepSeek-R1 on the task. The ensemble method improved reliability by aggregating outputs through embedding-based similarity and majority voting.

cs.DL [Back]

[165] SciCom Wiki: Fact-Checking and FAIR Knowledge Distribution for Scientific Videos and Podcasts

Tim Wittenborg,Constantin Sebastian Tremel,Niklas Stehr,Oliver Karras,Markus Stocker,Sören Auer

Main category: cs.DL

TL;DR: 论文提出SciCom Wiki平台,支持科学传播知识基础设施(SciCom KI),通过FAIR媒体表示和神经符号计算事实核查工具应对视频和播客中的信息泛滥与虚假信息。

Details Motivation: 民主社会需要可靠信息,但视频和播客既是信息传播媒介,也是虚假信息载体。现有SciCom KI碎片化且无法应对内容泛滥。 Method: 基于Wikibase构建开源平台,通过53名利益相关者调查、11次访谈和14名参与者评估原型,开发神经符号计算事实核查工具。 Result: 平台和工具通过专家访谈和用户调查验证了必要性和可用性,但SciCom KI在FAIR知识和协作系统方面仍不足。 Conclusion: SciCom Wiki可作为核心知识节点,但需协作努力以应对信息泛滥。 Abstract: Democratic societies need accessible, reliable information. Videos and Podcasts have established themselves as the medium of choice for civic dissemination, but also as carriers of misinformation. The emerging Science Communication Knowledge Infrastructure (SciCom KI) curating non-textual media is still fragmented and not adequately equipped to scale against the content flood. Our work sets out to support the SciCom KI with a central, collaborative platform, the SciCom Wiki, to facilitate FAIR (findable, accessible, interoperable, reusable) media representation and the fact-checking of their content, particularly for videos and podcasts. Building an open-source service system centered around Wikibase, we survey requirements from 53 stakeholders, refine these in 11 interviews, and evaluate our prototype based on these requirements with another 14 participants. To address the most requested feature, fact-checking, we developed a neurosymbolic computational fact-checking approach, converting heterogenous media into knowledge graphs. This increases machine-readability and allows comparing statements against equally represented ground-truth. Our computational fact-checking tool was iteratively evaluated through 10 expert interviews, a public user survey with 43 participants verified the necessity and usability of our tool. Overall, our findings identified several needs to systematically support the SciCom KI. The SciCom Wiki, as a FAIR digital library complementing our neurosymbolic computational fact-checking framework, was found suitable to address the raised requirements. Further, we identified that the SciCom KI is severely underdeveloped regarding FAIR knowledge and related systems facilitating its collaborative creation and curation. Our system can provide a central knowledge node, yet a collaborative effort is required to scale against the imminent (mis-)information flood.

cs.CR [Back]

[166] A Large-Scale Empirical Analysis of Custom GPTs' Vulnerabilities in the OpenAI Ecosystem

Sunday Oyinlola Ogundoyin,Muhammad Ikram,Hassan Jameel Asghar,Benjamin Zi Hao Zhao,Dali Kaafar

Main category: cs.CR

TL;DR: 研究分析了14,904个自定义GPT模型,发现95%以上存在安全漏洞,包括角色扮演攻击、系统提示泄漏和钓鱼内容生成等,呼吁加强安全措施。

Details Motivation: 随着自定义GPT模型的广泛应用,其安全漏洞问题日益突出,但现有研究缺乏大规模实证分析。 Method: 通过分析14,904个自定义GPT模型,评估其对七种可攻击威胁的易感性,并引入多指标排名系统研究安全风险与模型流行度的关系。 Result: 95%以上的自定义GPT缺乏足够的安全保护,最常见漏洞包括角色扮演(96.51%)、系统提示泄漏(92.20%)和钓鱼(91.22%)。 Conclusion: 研究揭示了自定义GPT模型的普遍安全漏洞,强调需加强安全措施和内容审核以确保安全部署。 Abstract: Millions of users leverage generative pretrained transformer (GPT)-based language models developed by leading model providers for a wide range of tasks. To support enhanced user interaction and customization, many platforms-such as OpenAI-now enable developers to create and publish tailored model instances, known as custom GPTs, via dedicated repositories or application stores. These custom GPTs empower users to browse and interact with specialized applications designed to meet specific needs. However, as custom GPTs see growing adoption, concerns regarding their security vulnerabilities have intensified. Existing research on these vulnerabilities remains largely theoretical, often lacking empirical, large-scale, and statistically rigorous assessments of associated risks. In this study, we analyze 14,904 custom GPTs to assess their susceptibility to seven exploitable threats, such as roleplay-based attacks, system prompt leakage, phishing content generation, and malicious code synthesis, across various categories and popularity tiers within the OpenAI marketplace. We introduce a multi-metric ranking system to examine the relationship between a custom GPT's popularity and its associated security risks. Our findings reveal that over 95% of custom GPTs lack adequate security protections. The most prevalent vulnerabilities include roleplay-based vulnerabilities (96.51%), system prompt leakage (92.20%), and phishing (91.22%). Furthermore, we demonstrate that OpenAI's foundational models exhibit inherent security weaknesses, which are often inherited or amplified in custom GPTs. These results highlight the urgent need for enhanced security measures and stricter content moderation to ensure the safe deployment of GPT-based applications.

[167] Where the Devil Hides: Deepfake Detectors Can No Longer Be Trusted

Shuaiwei Yuan,Junyu Dong,Yuezun Li

Main category: cs.CR

TL;DR: 论文探讨了第三方数据提供者可能通过投毒数据在Deepfake检测器中植入后门的风险,并提出了一种生成隐蔽且有效触发器的解决方案。

Details Motivation: 随着AI生成技术的发展,Deepfake检测器依赖第三方数据集训练,但数据提供者可能恶意投毒,导致检测器被植入后门,影响其可靠性。 Method: 开发了一种触发器生成器,可合成密码控制、语义抑制、自适应且不可见的触发器,并通过脏标签和干净标签两种投毒场景植入后门。 Result: 实验证明该方法在隐蔽性和有效性上优于基线方法。 Conclusion: 研究揭示了Deepfake检测器的潜在安全风险,并提出了一种隐蔽的后门植入方法,强调了数据来源安全的重要性。 Abstract: With the advancement of AI generative techniques, Deepfake faces have become incredibly realistic and nearly indistinguishable to the human eye. To counter this, Deepfake detectors have been developed as reliable tools for assessing face authenticity. These detectors are typically developed on Deep Neural Networks (DNNs) and trained using third-party datasets. However, this protocol raises a new security risk that can seriously undermine the trustfulness of Deepfake detectors: Once the third-party data providers insert poisoned (corrupted) data maliciously, Deepfake detectors trained on these datasets will be injected ``backdoors'' that cause abnormal behavior when presented with samples containing specific triggers. This is a practical concern, as third-party providers may distribute or sell these triggers to malicious users, allowing them to manipulate detector performance and escape accountability. This paper investigates this risk in depth and describes a solution to stealthily infect Deepfake detectors. Specifically, we develop a trigger generator, that can synthesize passcode-controlled, semantic-suppression, adaptive, and invisible trigger patterns, ensuring both the stealthiness and effectiveness of these triggers. Then we discuss two poisoning scenarios, dirty-label poisoning and clean-label poisoning, to accomplish the injection of backdoors. Extensive experiments demonstrate the effectiveness, stealthiness, and practicality of our method compared to several baselines.

cs.AR [Back]

[168] SpNeRF: Memory Efficient Sparse Volumetric Neural Rendering Accelerator for Edge Devices

Yipu Zhang,Jiawei Liang,Jian Peng,Jiang Xu,Wei Zhang

Main category: cs.AR

TL;DR: SpNeRF是一种软硬件协同设计的稀疏体素神经渲染解决方案,通过预处理和在线解码步骤减少内存占用,同时保持渲染质量。

Details Motivation: 神经渲染在AR/VR应用中需要高质量输出,但大体积网格数据和不规则访问模式限制了边缘设备的实时处理能力。现有方法未充分解决大体积网格导致的内存访问问题。 Method: 提出预处理步骤(哈希映射)和在线解码步骤(位图掩码),设计专用硬件架构以支持稀疏体素网格处理。 Result: 实验显示SpNeRF平均减少21.07倍内存占用,速度提升最高95.1倍,能效提升最高625.6倍。 Conclusion: SpNeRF通过软硬件协同设计显著提升了稀疏体素神经渲染的效率和性能。 Abstract: Neural rendering has gained prominence for its high-quality output, which is crucial for AR/VR applications. However, its large voxel grid data size and irregular access patterns challenge real-time processing on edge devices. While previous works have focused on improving data locality, they have not adequately addressed the issue of large voxel grid sizes, which necessitate frequent off-chip memory access and substantial on-chip memory. This paper introduces SpNeRF, a software-hardware co-design solution tailored for sparse volumetric neural rendering. We first identify memory-bound rendering inefficiencies and analyze the inherent sparsity in the voxel grid data of neural rendering. To enhance efficiency, we propose novel preprocessing and online decoding steps, reducing the memory size for voxel grid. The preprocessing step employs hash mapping to support irregular data access while maintaining a minimal memory size. The online decoding step enables efficient on-chip sparse voxel grid processing, incorporating bitmap masking to mitigate PSNR loss caused by hash collisions. To further optimize performance, we design a dedicated hardware architecture supporting our sparse voxel grid processing technique. Experimental results demonstrate that SpNeRF achieves an average 21.07$\times$ reduction in memory size while maintaining comparable PSNR levels. When benchmarked against Jetson XNX, Jetson ONX, RT-NeRF.Edge and NeuRex.Edge, our design achieves speedups of 95.1$\times$, 63.5$\times$, 1.5$\times$ and 10.3$\times$, and improves energy efficiency by 625.6$\times$, 529.1$\times$, 4$\times$, and 4.4$\times$, respectively.

eess.IV [Back]

[169] Evaluation of UAV-Based RGB and Multispectral Vegetation Indices for Precision Agriculture in Palm Tree Cultivation

Alavikunhu Panthakkan,S M Anzar,K. Sherin,Saeed Al Mansoori,Hussain Al-Ahmad

Main category: eess.IV

TL;DR: 该研究评估了无人机多光谱和RGB成像在迪拜棕榈树种植区的植被健康监测效果,发现RGB植被指数性能接近多光谱指数,为大规模农业监测提供了低成本方案。

Details Motivation: 精准农业需要高效且经济的植被监测方法,以提升作物产量并实现可持续农业。 Method: 使用配备多光谱传感器的无人机,计算NDVI和SAVI等指数,同时评估RGB指数(如VARI和MGRVI)的植被分类和胁迫检测能力。 Result: RGB指数在多光谱性能相近的情况下,显著降低了成本,适合大规模应用。 Conclusion: RGB成像在精准农业中具有潜力,可推动数据驱动的作物管理决策,同时提升无人机农业应用的效率和可扩展性。 Abstract: Precision farming relies on accurate vegetation monitoring to enhance crop productivity and promote sustainable agricultural practices. This study presents a comprehensive evaluation of UAV-based imaging for vegetation health assessment in a palm tree cultivation region in Dubai. By comparing multispectral and RGB image data, we demonstrate that RGBbased vegetation indices offer performance comparable to more expensive multispectral indices, providing a cost-effective alternative for large-scale agricultural monitoring. Using UAVs equipped with multispectral sensors, indices such as NDVI and SAVI were computed to categorize vegetation into healthy, moderate, and stressed conditions. Simultaneously, RGB-based indices like VARI and MGRVI delivered similar results in vegetation classification and stress detection. Our findings highlight the practical benefits of integrating RGB imagery into precision farming, reducing operational costs while maintaining accuracy in plant health monitoring. This research underscores the potential of UAVbased RGB imaging as a powerful tool for precision agriculture, enabling broader adoption of data-driven decision-making in crop management. By leveraging the strengths of both multispectral and RGB imaging, this work advances the state of UAV applications in agriculture, paving the way for more efficient and scalable farming solutions.

[170] Pose Estimation for Intra-cardiac Echocardiography Catheter via AI-Based Anatomical Understanding

Jaeyoung Huh,Ankur Kapoor,Young-Ho Kim

Main category: eess.IV

TL;DR: 提出了一种基于Vision Transformer的解剖感知姿态估计系统,用于从ICE图像中确定导管位置和方向,无需外部跟踪传感器。

Details Motivation: 现有导航方法依赖易受干扰的电磁跟踪或需手动调整,限制了ICE在心脏手术中的应用。 Method: 使用ViT模型处理ICE图像,通过16x16嵌入和Transformer网络预测位置和方向,训练于851例临床数据集。 Result: 实验显示平均位置误差9.48 mm,方向误差(16.13°, 8.98°, 10.47°),验证了模型准确性。 Conclusion: 该系统提高了手术效率,减少操作负担,支持无跟踪实时导管定位,可独立或补充现有系统。 Abstract: Intra-cardiac Echocardiography (ICE) plays a crucial role in Electrophysiology (EP) and Structural Heart Disease (SHD) interventions by providing high-resolution, real-time imaging of cardiac structures. However, existing navigation methods rely on electromagnetic (EM) tracking, which is susceptible to interference and position drift, or require manual adjustments based on operator expertise. To overcome these limitations, we propose a novel anatomy-aware pose estimation system that determines the ICE catheter position and orientation solely from ICE images, eliminating the need for external tracking sensors. Our approach leverages a Vision Transformer (ViT)-based deep learning model, which captures spatial relationships between ICE images and anatomical structures. The model is trained on a clinically acquired dataset of 851 subjects, including ICE images paired with position and orientation labels normalized to the left atrium (LA) mesh. ICE images are patchified into 16x16 embeddings and processed through a transformer network, where a [CLS] token independently predicts position and orientation via separate linear layers. The model is optimized using a Mean Squared Error (MSE) loss function, balancing positional and orientational accuracy. Experimental results demonstrate an average positional error of 9.48 mm and orientation errors of (16.13 deg, 8.98 deg, 10.47 deg) across x, y, and z axes, confirming the model accuracy. Qualitative assessments further validate alignment between predicted and target views within 3D cardiac meshes. This AI-driven system enhances procedural efficiency, reduces operator workload, and enables real-time ICE catheter localization for tracking-free procedures. The proposed method can function independently or complement existing mapping systems like CARTO, offering a transformative approach to ICE-guided interventions.

[171] Computationally Efficient Diffusion Models in Medical Imaging: A Comprehensive Review

Abdullah,Tao Huang,Ickjai Lee,Euijoon Ahn

Main category: eess.IV

TL;DR: 本文探讨了扩散模型在生成式人工智能中的高效性和推理时间,重点分析了DDPM、LDM和WDM三种模型在自然和医学成像中的应用及其计算复杂性。

Details Motivation: 扩散模型在生成高质量合成图像方面表现出色,但训练和生成的高计算成本仍是主要挑战。本文旨在优化其效率和推理时间,尤其是在医学成像中的应用。 Method: 研究分析了三种关键扩散模型(DDPM、LDM和WDM)的框架,并讨论了它们在自然和医学成像中填补的计算复杂性缺口。 Result: 扩散模型在医学成像中能生成快速、可靠且高质量的图像,对疾病诊断至关重要。 Conclusion: 尽管扩散模型在医学成像中具有潜力,但仍存在局限性,未来研究需进一步优化其效率和扩展应用。 Abstract: The diffusion model has recently emerged as a potent approach in computer vision, demonstrating remarkable performances in the field of generative artificial intelligence. Capable of producing high-quality synthetic images, diffusion models have been successfully applied across a range of applications. However, a significant challenge remains with the high computational cost associated with training and generating these models. This study focuses on the efficiency and inference time of diffusion-based generative models, highlighting their applications in both natural and medical imaging. We present the most recent advances in diffusion models by categorizing them into three key models: the Denoising Diffusion Probabilistic Model (DDPM), the Latent Diffusion Model (LDM), and the Wavelet Diffusion Model (WDM). These models play a crucial role in medical imaging, where producing fast, reliable, and high-quality medical images is essential for accurate analysis of abnormalities and disease diagnosis. We first investigate the general framework of DDPM, LDM, and WDM and discuss the computational complexity gap filled by these models in natural and medical imaging. We then discuss the current limitations of these models as well as the opportunities and future research directions in medical imaging.

[172] Skeleton-Guided Diffusion Model for Accurate Foot X-ray Synthesis in Hallux Valgus Diagnosis

Midi Wan,Pengfei Li,Yizhuo Liang,Di Wu,Yushan Pan,Guangzhen Zhu,Hao Wang

Main category: eess.IV

TL;DR: 论文提出了一种骨骼约束条件扩散模型(SCCDM)和基于骨骼标志的足部评估方法(KCC),用于提升医学图像合成的准确性和临床适用性。

Details Motivation: 拇外翻(Hallux valgus)影响全球约19%的人口,需要频繁的负重X光评估,现有模型在图像保真度、骨骼一致性和物理约束方面存在不足。 Method: 提出SCCDM模型,结合多尺度特征提取和注意力机制,并引入KCC方法评估足部骨骼标志。 Result: SCCDM将结构相似性指数(SSIM)提升5.72%(0.794),峰值信噪比(PSNR)提升18.34%(21.40 dB),结合KCC后平均得分达0.85。 Conclusion: SCCDM和KCC的组合显著提升了医学图像合成的质量和临床适用性,代码已开源。 Abstract: Medical image synthesis plays a crucial role in providing anatomically accurate images for diagnosis and treatment. Hallux valgus, which affects approximately 19% of the global population, requires frequent weight-bearing X-rays for assessment, placing additional strain on both patients and healthcare providers. Existing X-ray models often struggle to balance image fidelity, skeletal consistency, and physical constraints, particularly in diffusion-based methods that lack skeletal guidance. We propose the Skeletal-Constrained Conditional Diffusion Model (SCCDM) and introduce KCC, a foot evaluation method utilizing skeletal landmarks. SCCDM incorporates multi-scale feature extraction and attention mechanisms, improving the Structural Similarity Index (SSIM) by 5.72% (0.794) and Peak Signal-to-Noise Ratio (PSNR) by 18.34% (21.40 dB). When combined with KCC, the model achieves an average score of 0.85, demonstrating strong clinical applicability. The code is available at https://github.com/midisec/SCCDM.

[173] An integrated language-vision foundation model for conversational diagnostics and triaging in primary eye care

Zhi Da Soh,Yang Bai,Kai Yu,Yang Zhou,Xiaofeng Lei,Sahil Thakur,Zann Lee,Lee Ching Linette Phang,Qingsheng Peng,Can Can Xue,Rachel Shujuan Chong,Quan V. Hoang,Lavanya Raghavan,Yih Chung Tham,Charumathi Sabanayagam,Wei-Chi Wu,Ming-Chih Ho,Jiangnan He,Preeti Gupta,Ecosse Lamoureux,Seang Mei Saw,Vinay Nangia,Songhomitra Panda-Jonas,Jie Xu,Ya Xing Wang,Xinxing Xu,Jost B. Jonas,Tien Yin Wong,Rick Siow Mong Goh,Yong Liu,Ching-Yu Cheng

Main category: eess.IV

TL;DR: Meta-EyeFM是一个结合大型语言模型(LLM)和视觉基础模型(VFM)的多功能基础模型,用于眼科疾病评估。通过路由机制和低秩适应(LoRA)微调,实现了高精度的疾病检测、严重程度区分和常见体征识别。

Details Motivation: 当前深度学习模型多为任务专用且缺乏用户友好界面,Meta-EyeFM旨在解决这一问题,提供多功能、高精度的眼科诊断支持。 Method: 结合LLM和VFM,采用路由机制分配任务,并通过LoRA微调VFM以检测疾病、区分严重程度和识别体征。 Result: 模型在路由任务中达到100%准确率,疾病检测准确率≥82.2%,严重程度区分≥89%,体征识别≥76%,优于Gemini-1.5-flash和ChatGPT-4o LMMs。 Conclusion: Meta-EyeFM在眼科诊断中表现出色,可作为初级眼科护理的决策支持工具或在线LLM用于眼底评估。 Abstract: Current deep learning models are mostly task specific and lack a user-friendly interface to operate. We present Meta-EyeFM, a multi-function foundation model that integrates a large language model (LLM) with vision foundation models (VFMs) for ocular disease assessment. Meta-EyeFM leverages a routing mechanism to enable accurate task-specific analysis based on text queries. Using Low Rank Adaptation, we fine-tuned our VFMs to detect ocular and systemic diseases, differentiate ocular disease severity, and identify common ocular signs. The model achieved 100% accuracy in routing fundus images to appropriate VFMs, which achieved $\ge$ 82.2% accuracy in disease detection, $\ge$ 89% in severity differentiation, $\ge$ 76% in sign identification. Meta-EyeFM was 11% to 43% more accurate than Gemini-1.5-flash and ChatGPT-4o LMMs in detecting various eye diseases and comparable to an ophthalmologist. This system offers enhanced usability and diagnostic performance, making it a valuable decision support tool for primary eye care or an online LLM for fundus evaluation.

[174] GNCAF: A GNN-based Neighboring Context Aggregation Framework for Tertiary Lymphoid Structures Semantic Segmentation in WSI

Lei Su

Main category: eess.IV

TL;DR: 本文提出了一种基于图神经网络的邻域上下文聚合框架(GNCAF),用于端到端地分割WSI中的三级淋巴结构(TLS)区域和成熟阶段,显著提升了分割性能。

Details Motivation: 现有方法依赖细胞代理任务且需额外后处理,无法充分利用邻域上下文信息。 Method: GNCAF通过多跳邻域上下文聚合和自注意力机制,增强目标补丁的分割能力。 Result: 在TCGA-COAD和INHOUSE-PAAD数据集上,GNCAF在mF1和mIoU上分别提升22.08%和26.57%。 Conclusion: GNCAF不仅适用于TLS分割,还可扩展至淋巴结转移分割任务。 Abstract: Tertiary lymphoid structures (TLS) are organized clusters of immune cells, whose maturity and area can be quantified in whole slide image (WSI) for various prognostic tasks. Existing methods for assessing these characteristics typically rely on cell proxy tasks and require additional post-processing steps. In this work, We focus on a novel task-TLS Semantic Segmentation (TLS-SS)-which segments both the regions and maturation stages of TLS in WSI in an end-to-end manner. Due to the extensive scale of WSI and patch-based segmentation strategies, TLS-SS necessitates integrating from neighboring patches to guide target patch (target) segmentation. Previous techniques often employ on multi-resolution approaches, constraining the capacity to leverage the broader neighboring context while tend to preserve coarse-grained information. To address this, we propose a GNN-based Neighboring Context Aggregation Framework (GNCAF), which progressively aggregates multi-hop neighboring context from the target and employs a self-attention mechanism to guide the segmentation of the target. GNCAF can be integrated with various segmentation models to enhance their ability to perceive contextual information outside of the patch. We build two TLS-SS datasets, called TCGA-COAD and INHOUSE-PAAD, and make the former (comprising 225 WSIs and 5041 TLSs) publicly available. Experiments on these datasets demonstrate the superiority of GNCAF, achieving a maximum of 22.08% and 26.57% improvement in mF1 and mIoU, respectively. Additionally, we also validate the task scalability of GNCAF on segmentation of lymph node metastases.

[175] A portable diagnosis model for Keratoconus using a smartphone

Yifan Li,Myeongjun Kim,Yanjing Jin,Peter Ho,Jo Woon Chong

Main category: eess.IV

TL;DR: 提出了一种基于智能手机的便携式圆锥角膜(KC)诊断框架,通过两阶段检测流程实现高精度分类和可视化。

Details Motivation: 传统Placido盘地形图依赖专业设备,限制了可及性,因此开发便携式解决方案。 Method: 利用智能手机屏幕显示Placido盘,捕获角膜反射,采用两阶段检测流程(WSVM分类和彩色图可视化)。 Result: 在模拟眼球模型上验证,分类准确率最高达92.93%,且支持多款智能手机。 Conclusion: 该方法为KC诊断提供了便携、高精度的替代方案。 Abstract: Keratoconus (KC) is a progressive corneal disorder characterized by localized thinning and protrusion, leading to visual distortion. While Placido disc-based topography remains a standard in clinical diagnostics, its dependence on specialized equipment limits accessibility. In this paper, we propose a portable, smartphone-based diagnostic framework that captures corneal reflections of a Placido disc displayed on a phone screen and applies a two-stage detection pipeline, then validate on 3D-printed emulated eyeball models that simulate normal, moderate, and severe KC stages based on anterior chamber depth (ACD). The first step of the two-stage detection pipeline is classifying different stages of KC with features including height and width of extracted reflections using weighted support vector machine (WSVM). It achieves a maximum accuracy of 92.93%, and maintains over 90% accuracy across multiple smartphone models, including the Galaxy Z Flip 3, iPhone 15 Pro, and iPhone 16 Pro. For the second step, we visualize the KC-affected protrusion regions on the corneas with color maps based on inter-disc distance, that provides an intuitive representation of disease severity and localization. Moreover, we validate the ability of the extracted features to differentiate between KC stages with ANOVA and Omega Squared, with significant p-values (e.g., $p < 10^{-6}$) and large effect sizes ($\\omega^2$ up to 0.8398) among classes.

[176] VIViT: Variable-Input Vision Transformer Framework for 3D MR Image Segmentation

Badhan Kumar Das,Ajay Singh,Gengyan Zhao,Han Liu,Thomas J. Re,Dorin Comaniciu,Eli Gibson,Andreas Maier

Main category: eess.IV

TL;DR: 论文提出了一种基于Transformer的框架VIViT,用于处理多对比度MR数据的自监督预训练和分割微调,解决了现有方法对固定输入模态的限制。

Details Motivation: 现实中的MR研究通常包含不同对比度的数据,而现有深度学习方法需要固定输入模态,限制了大规模预训练和下游任务的适应性。 Method: 提出VIViT框架,支持自监督预训练和可变对比度的分割微调,最大化数据利用率并适应不同输入需求。 Result: 在脑梗死和脑肿瘤分割任务中,VIViT分别取得0.624和0.883的平均Dice分数,优于现有CNN和ViT模型。 Conclusion: VIViT框架在异构MR数据任务中表现出更好的适应性和性能,为实际应用提供了有效解决方案。 Abstract: Self-supervised pretrain techniques have been widely used to improve the downstream tasks' performance. However, real-world magnetic resonance (MR) studies usually consist of different sets of contrasts due to different acquisition protocols, which poses challenges for the current deep learning methods on large-scale pretrain and different downstream tasks with different input requirements, since these methods typically require a fixed set of input modalities or, contrasts. To address this challenge, we propose variable-input ViT (VIViT), a transformer-based framework designed for self-supervised pretraining and segmentation finetuning for variable contrasts in each study. With this ability, our approach can maximize the data availability in pretrain, and can transfer the learned knowledge from pretrain to downstream tasks despite variations in input requirements. We validate our method on brain infarct and brain tumor segmentation, where our method outperforms current CNN and ViT-based models with a mean Dice score of 0.624 and 0.883 respectively. These results highlight the efficacy of our design for better adaptability and performance on tasks with real-world heterogeneous MR data.