cs.CV [Back]

[1] Can a Large Language Model Assess Urban Design Quality? Evaluating Walkability Metrics Across Expertise Levels

Chenyi Cai,Kosuke Kuriyama,Youlong Gu,Filip Biljecki,Pieter Herthogs

Main category: cs.CV

TL;DR: 研究探讨了如何通过整合专家知识提升多模态大语言模型（MLLM）在评估城市步行性方面的能力，发现专家知识能提高模型的一致性和准确性。

Details

Motivation: 城市街道环境对人类活动至关重要，但利用MLLM评估其质量时，专家知识的作用尚未充分探索。 Method: 通过整合文献中的步行性指标和专家知识，设计不同清晰度的提示词，测试ChatGPT-4对街景图像步行性的评估能力。 Result: MLLM能基于通用知识进行评估，但评分偏乐观且易误判；整合专家知识后，评估表现更一致和集中。 Conclusion: 专家知识能显著提升MLLM在城市设计评估中的可靠性，但需注意其局限性。 Abstract: Urban street environments are vital to supporting human activity in public spaces. The emergence of big data, such as street view images (SVIs) combined with multimodal large language models (MLLMs), is transforming how researchers and practitioners investigate, measure, and evaluate semantic and visual elements of urban environments. Considering the low threshold for creating automated evaluative workflows using MLLMs, it is crucial to explore both the risks and opportunities associated with these probabilistic models. In particular, the extent to which the integration of expert knowledge can influence the performance of MLLMs in evaluating the quality of urban design has not been fully explored. This study sets out an initial exploration of how integrating more formal and structured representations of expert urban design knowledge into the input prompts of an MLLM (ChatGPT-4) can enhance the model's capability and reliability in evaluating the walkability of built environments using SVIs. We collect walkability metrics from the existing literature and categorize them using relevant ontologies. We then select a subset of these metrics, focusing on the subthemes of pedestrian safety and attractiveness, and develop prompts for the MLLM accordingly. We analyze the MLLM's ability to evaluate SVI walkability subthemes through prompts with varying levels of clarity and specificity regarding evaluation criteria. Our experiments demonstrate that MLLMs are capable of providing assessments and interpretations based on general knowledge and can support the automation of multimodal image-text evaluations. However, they generally provide more optimistic scores and can make mistakes when interpreting the provided metrics, resulting in incorrect evaluations. By integrating expert knowledge, the MLLM's evaluative performance exhibits higher consistency and concentration.

[2] Legilimens: Performant Video Analytics on the System-on-Chip Edge

Murali Ramanujam,Yinwei Dai,Kyle Jamieson,Ravi Netravali

Main category: cs.CV

TL;DR: Legilimens是一种针对移动边缘设备（如无人机和行车记录仪）的持续学习系统，利用设备内存中的基础模型轻量化适应新场景，显著降低重新训练成本并提高准确性。

Details

Motivation: 传统边缘服务器资源有限，而移动边缘设备具有丰富的统一内存池但计算能力较弱，需要一种高效的方法在这些设备上实现高精度视频分析。 Method: Legilimens提出三种计算高效技术：选择高效用数据样本、无需完全重新训练更新基础模型、以及计算资源的时间共享。 Result: 相比现有系统，Legilimens将重新训练成本降低2.8-10倍，准确性提高18-45%。 Conclusion: Legilimens为移动边缘设备的持续学习提供了一种高效且轻量化的解决方案。 Abstract: Continually retraining models has emerged as a primary technique to enable high-accuracy video analytics on edge devices. Yet, existing systems employ such adaptation by relying on the spare compute resources that traditional (memory-constrained) edge servers afford. In contrast, mobile edge devices such as drones and dashcams offer a fundamentally different resource profile: weak(er) compute with abundant unified memory pools. We present Legilimens, a continuous learning system for the mobile edge's System-on-Chip GPUs. Our driving insight is that visually distinct scenes that require retraining exhibit substantial overlap in model embeddings; if captured into a base model on device memory, specializing to each new scene can become lightweight, requiring very few samples. To practically realize this approach, Legilimens presents new, compute-efficient techniques to (1) select high-utility data samples for retraining specialized models, (2) update the base model without complete retraining, and (3) time-share compute resources between retraining and live inference for maximal accuracy. Across diverse workloads, Legilimens lowers retraining costs by 2.8-10x compared to existing systems, resulting in 18-45% higher accuracies.

[3] Emotion Recognition in Contemporary Dance Performances Using Laban Movement Analysis

Muhammad Turab,Philippe Colantoni,Damien Muselet,Alain Tremeau

Main category: cs.CV

TL;DR: 提出了一种改进Laban运动分析特征描述符的新框架，用于当代舞蹈中的情感识别，结合定量和定性特征，并通过可解释机器学习方法分析特征影响。

Details

Motivation: 改进现有情感识别方法，捕捉舞蹈动作的定量和定性特征，以提升情感识别的准确性和应用范围。 Method: 从3D关键点数据中提取特征，改进LMA描述符并引入新描述符，使用随机森林和支持向量机等分类器进行训练。 Result: 最高准确率达到96.85%，显著提升了当代舞蹈中的情感识别性能。 Conclusion: 该框架在舞蹈表演分析、训练和人机交互中有广泛应用前景。 Abstract: This paper presents a novel framework for emotion recognition in contemporary dance by improving existing Laban Movement Analysis (LMA) feature descriptors and introducing robust, novel descriptors that capture both quantitative and qualitative aspects of the movement. Our approach extracts expressive characteristics from 3D keypoints data of professional dancers performing contemporary dance under various emotional states, and trains multiple classifiers, including Random Forests and Support Vector Machines. Additionally, we provide in-depth explanation of features and their impact on model predictions using explainable machine learning methods. Overall, our study improves emotion recognition in contemporary dance and offers promising applications in performance analysis, dance training, and human--computer interaction, with a highest accuracy of 96.85\%.

[4] Dance Style Recognition Using Laban Movement Analysis

Muhammad Turab,Philippe Colantoni,Damien Muselet,Alain Tremeau

Main category: cs.CV

TL;DR: 提出了一种结合3D姿态估计、人体网格重建和地板感知建模的新方法，用于舞蹈风格识别，通过滑动窗口捕捉时间上下文，分类准确率达99.18%。

Details

Motivation: 现有舞蹈风格识别方法缺乏对时间上下文和动态过渡的捕捉，限制了性能。 Method: 结合3D姿态估计、人体网格重建和地板感知建模提取LMA特征，采用滑动窗口捕捉时间上下文，并用机器学习分类。 Result: 最高分类准确率达99.18%，表明时间上下文的加入显著提升了性能。 Conclusion: 提出的方法通过引入时间上下文，显著提高了舞蹈风格识别的准确性和解释性。 Abstract: The growing interest in automated movement analysis has presented new challenges in recognition of complex human activities including dance. This study focuses on dance style recognition using features extracted using Laban Movement Analysis. Previous studies for dance style recognition often focus on cross-frame movement analysis, which limits the ability to capture temporal context and dynamic transitions between movements. This gap highlights the need for a method that can add temporal context to LMA features. For this, we introduce a novel pipeline which combines 3D pose estimation, 3D human mesh reconstruction, and floor aware body modeling to effectively extract LMA features. To address the temporal limitation, we propose a sliding window approach that captures movement evolution across time in features. These features are then used to train various machine learning methods for classification, and their explainability explainable AI methods to evaluate the contribution of each feature to classification performance. Our proposed method achieves a highest classification accuracy of 99.18\% which shows that the addition of temporal context significantly improves dance style recognition performance.

[5] Geolocating Earth Imagery from ISS: Integrating Machine Learning with Astronaut Photography for Enhanced Geographic Mapping

Vedika Srivastava,Hemant Kumar Singh,Jaisal Singh

Main category: cs.CV

TL;DR: 本文提出了一种利用机器学习算法从国际空间站（ISS）图像中定位地球位置的新方法，通过三种不同的图像处理流程（神经网络、SIFT和GPT-4）实现了高精度的地理特征识别。

Details

Motivation: ISS拍摄的照片中具体地球位置常未被识别，研究旨在填补这一空白，提升空间图像的地理定位效率。 Method: 采用三种图像处理流程：神经网络方法、SIFT方法和GPT-4模型，分别针对高分辨率ISS图像处理自然和人工地理特征。 Result: 在140多张ISS图像数据集上验证，神经网络方法在地理特征匹配上表现优异，SIFT擅长处理放大图像，GPT-4提供丰富的地理描述。 Conclusion: 研究提升了空间图像地理定位的准确性和效率，对遥感、地球观测及环境监测具有重要意义。 Abstract: This paper presents a novel approach to geolocating images captured from the International Space Station (ISS) using advanced machine learning algorithms. Despite having precise ISS coordinates, the specific Earth locations depicted in astronaut-taken photographs often remain unidentified. Our research addresses this gap by employing three distinct image processing pipelines: a Neural Network based approach, a SIFT based method, and GPT-4 model. Each pipeline is tailored to process high-resolution ISS imagery, identifying both natural and man-made geographical features. Through extensive evaluation on a diverse dataset of over 140 ISS images, our methods demonstrate significant promise in automated geolocation with varied levels of success. The NN approach showed a high success rate in accurately matching geographical features, while the SIFT pipeline excelled in processing zoomed-in images. GPT-4 model provided enriched geographical descriptions alongside location predictions. This research contributes to the fields of remote sensing and Earth observation by enhancing the accuracy and efficiency of geolocating space-based imagery, thereby aiding environmental monitoring and global mapping efforts.

[6] MemeBLIP2: A novel lightweight multimodal system to detect harmful memes

Jiaqi Liu,Ran Tong,Aowei Shen,Shuzheng Li,Changlin Yang,Lisha Xu

Main category: cs.CV

TL;DR: MemeBLIP2是一个轻量级多模态系统，通过结合图像和文本特征有效检测有害表情包。

Details

Motivation: 表情包常结合图像和简短文本传播幽默或观点，但部分包含有害内容如仇恨言论，需有效检测方法。 Method: 基于BLIP-2视觉语言模型，添加模块对齐图像和文本表示，融合以提升分类效果。 Result: 在PrideMM数据集上评估，MemeBLIP2能捕捉多模态中的细微线索，提升有害内容检测。 Conclusion: MemeBLIP2通过多模态特征融合，显著提升有害表情包检测能力，尤其针对讽刺或文化特定内容。 Abstract: Memes often merge visuals with brief text to share humor or opinions, yet some memes contain harmful messages such as hate speech. In this paper, we introduces MemeBLIP2, a light weight multimodal system that detects harmful memes by combining image and text features effectively. We build on previous studies by adding modules that align image and text representations into a shared space and fuse them for better classification. Using BLIP-2 as the core vision-language model, our system is evaluated on the PrideMM datasets. The results show that MemeBLIP2 can capture subtle cues in both modalities, even in cases with ironic or culturally specific content, thereby improving the detection of harmful material.

[7] T2ID-CAS: Diffusion Model and Class Aware Sampling to Mitigate Class Imbalance in Neck Ultrasound Anatomical Landmark Detection

Manikanta Varaganti,Amulya Vankayalapati,Nour Awad,Gregory R. Dion,Laura J. Brattain

Main category: cs.CV

TL;DR: 论文提出T2ID-CAS方法，结合文本-图像潜在扩散模型与类别感知采样，解决颈部超声中类别不平衡问题，显著提升目标检测性能。

Details

Motivation: 颈部超声在气道管理中至关重要，但数据集中关键结构（如气管环和声带）的类别不平衡对目标检测模型构成挑战。 Method: 提出T2ID-CAS方法，结合文本-图像潜在扩散模型与类别感知采样，生成高质量的合成样本以增强少数类别的表示。 Result: 实验结果显示，T2ID-CAS在YOLOv9上的平均精度达到88.2，显著优于基线66。 Conclusion: T2ID-CAS是一种计算高效且可扩展的解决方案，适用于AI辅助超声引导干预中的类别不平衡问题。 Abstract: Neck ultrasound (US) plays a vital role in airway management by providing non-invasive, real-time imaging that enables rapid and precise interventions. Deep learning-based anatomical landmark detection in neck US can further facilitate procedural efficiency. However, class imbalance within datasets, where key structures like tracheal rings and vocal folds are underrepresented, presents significant challenges for object detection models. To address this, we propose T2ID-CAS, a hybrid approach that combines a text-to-image latent diffusion model with class-aware sampling to generate high-quality synthetic samples for underrepresented classes. This approach, rarely explored in the ultrasound domain, improves the representation of minority classes. Experimental results using YOLOv9 for anatomical landmark detection in neck US demonstrated that T2ID-CAS achieved a mean Average Precision of 88.2, significantly surpassing the baseline of 66. This highlights its potential as a computationally efficient and scalable solution for mitigating class imbalance in AI-assisted ultrasound-guided interventions.

[8] Subject Information Extraction for Novelty Detection with Domain Shifts

Yangyang Qu,Dazhi Fu,Jicong Fan

Main category: cs.CV

TL;DR: 本文提出了一种新的无监督新颖性检测方法，通过分离主体信息和背景变化来应对域偏移问题，显著提升了检测性能。

Details

Motivation: 现有方法假设训练和测试数据来自同一域，但现实中常存在域偏移，导致正常数据被误判为新颖。本文旨在解决这一问题。 Method: 提出了一种方法，通过最小化主体和背景表示间的互信息，并利用深度高斯混合模型建模背景变化，仅基于主体表示进行检测。 Result: 实验表明，该方法在未见域上表现优异，尤其在域偏移显著时优于基线方法。 Conclusion: 该方法有效解决了域偏移问题，提升了新颖性检测的泛化能力。 Abstract: Unsupervised novelty detection (UND), aimed at identifying novel samples, is essential in fields like medical diagnosis, cybersecurity, and industrial quality control. Most existing UND methods assume that the training data and testing normal data originate from the same domain and only consider the distribution variation between training data and testing data. However, in real scenarios, it is common for normal testing and training data to originate from different domains, a challenge known as domain shift. The discrepancies between training and testing data often lead to incorrect classification of normal data as novel by existing methods. A typical situation is that testing normal data and training data describe the same subject, yet they differ in the background conditions. To address this problem, we introduce a novel method that separates subject information from background variation encapsulating the domain information to enhance detection performance under domain shifts. The proposed method minimizes the mutual information between the representations of the subject and background while modelling the background variation using a deep Gaussian mixture model, where the novelty detection is conducted on the subject representations solely and hence is not affected by the variation of domains. Extensive experiments demonstrate that our model generalizes effectively to unseen domains and significantly outperforms baseline methods, especially under substantial domain shifts between training and testing data.

Ezra Engel,Lishan Li,Chris Hudy,Robert Schleusner

Main category: cs.CV

TL;DR: 本文探讨了多模态迁移学习在动态面部表情识别（DFEW数据集）中的应用，结合预训练网络提升分类准确性。

Details

Motivation: 面部表情识别（FER）在多个领域有重要应用，但由于面部特征的细微变化，准确分类具有挑战性。 Method: 结合预训练的ResNets、OpenPose和OmniVec网络，研究跨时间多模态特征对分类准确性的影响。 Result: 多模态特征生成器略微提升了基于Transformer的分类模型的准确性。 Conclusion: 多模态迁移学习在FER任务中具有潜力，但改进效果有限。 Abstract: Facial expression recognition (FER) is a subset of computer vision with important applications for human-computer-interaction, healthcare, and customer service. FER represents a challenging problem-space because accurate classification requires a model to differentiate between subtle changes in facial features. In this paper, we examine the use of multi-modal transfer learning to improve performance on a challenging video-based FER dataset, Dynamic Facial Expression in-the-Wild (DFEW). Using a combination of pretrained ResNets, OpenPose, and OmniVec networks, we explore the impact of cross-temporal, multi-modal features on classification accuracy. Ultimately, we find that these finely-tuned multi-modal feature generators modestly improve accuracy of our transformer-based classification model.

[10] Embracing Collaboration Over Competition: Condensing Multiple Prompts for Visual In-Context Learning

Jinpeng Wang,Tianci Luo,Yaohua Zha,Yan Feng,Ruisheng Luo,Bin Chen,Tao Dai,Long Chen,Yaowei Wang,Shu-Tao Xia

Main category: cs.CV

TL;DR: 论文提出了一种名为Condenser的轻量级插件，通过多提示协作解决视觉上下文学习中的提示选择问题，优于现有方法。

Details

Motivation: 当前视觉上下文学习（VICL）中提示选择方法假设存在单一理想提示，但实际上可能存在多个合适提示，单独使用时效果不佳。 Method: 提出提示压缩（prompt condensation）方法，通过Condenser插件整合多个提示的细粒度上下文信息，端到端优化。 Result: 实验表明Condenser在基准任务中优于现有方法，具有更好的上下文压缩能力、可扩展性和计算效率。 Conclusion: Condenser为VICL提供了一种高效且可扩展的解决方案，代码已开源。 Abstract: Visual In-Context Learning (VICL) enables adaptively solving vision tasks by leveraging pixel demonstrations, mimicking human-like task completion through analogy. Prompt selection is critical in VICL, but current methods assume the existence of a single "ideal" prompt in a pool of candidates, which in practice may not hold true. Multiple suitable prompts may exist, but individually they often fall short, leading to difficulties in selection and the exclusion of useful context. To address this, we propose a new perspective: prompt condensation. Rather than relying on a single prompt, candidate prompts collaborate to efficiently integrate informative contexts without sacrificing resolution. We devise Condenser, a lightweight external plugin that compresses relevant fine-grained context across multiple prompts. Optimized end-to-end with the backbone, Condenser ensures accurate integration of contextual cues. Experiments demonstrate Condenser outperforms state-of-the-arts across benchmark tasks, showing superior context compression, scalability with more prompts, and enhanced computational efficiency compared to ensemble methods, positioning it as a highly competitive solution for VICL. Code is open-sourced at https://github.com/gimpong/CVPR25-Condenser.

[11] CoCoDiff: Diversifying Skeleton Action Features via Coarse-Fine Text-Co-Guided Latent Diffusion

Zhifu Zhao,Hanyang Hua,Jianan Li,Shaoxin Wu,Fu Li,Yangtao Zhou,Yang Li

Main category: cs.CV

TL;DR: CoCoDiff提出了一种基于扩散模型和多粒度文本引导的方法，用于生成多样且语义一致的特征，提升动作识别任务的性能。

Details

Motivation: 现有方法通过扩展样本空间提升特征多样性，但效率低且语义不一致。CoCoDiff旨在解决这些问题。 Method: 利用潜在扩散模型生成多样动作表示，并结合粗粒度-细粒度文本引导策略确保语义一致性。 Result: 在多个骨架动作识别基准测试中达到SOTA性能。 Conclusion: CoCoDiff作为即插即用模块，无需额外推理成本，显著提升模型性能。 Abstract: In action recognition tasks, feature diversity is essential for enhancing model generalization and performance. Existing methods typically promote feature diversity by expanding the training data in the sample space, which often leads to inefficiencies and semantic inconsistencies. To overcome these problems, we propose a novel Coarse-fine text co-guidance Diffusion model (CoCoDiff). CoCoDiff generates diverse yet semantically consistent features in the latent space by leveraging diffusion and multi-granularity textual guidance. Specifically, our approach feeds spatio-temporal features extracted from skeleton sequences into a latent diffusion model to generate diverse action representations. Meanwhile, we introduce a coarse-fine text co-guided strategy that leverages textual information from large language models (LLMs) to ensure semantic consistency between the generated features and the original inputs. It is noted that CoCoDiff operates as a plug-and-play auxiliary module during training, incurring no additional inference cost. Extensive experiments demonstrate that CoCoDiff achieves SOTA performance on skeleton-based action recognition benchmarks, including NTU RGB+D, NTU RGB+D 120 and Kinetics-Skeleton.

Zexin Ji,Beiji Zou,Xiaoyan Kui,Hua Li,Pierre Vera,Su Ruan

Main category: cs.CV

TL;DR: 提出了一种基于Mamba的特征提取和自适应多级特征融合方法，用于多模态3D医学图像肿瘤分割，解决了传统CNN和Transformer方法的局限性。

Details

Motivation: 多模态3D医学图像分割面临图像强度和肿瘤形态变化的挑战，传统CNN方法难以捕捉全局特征，而Transformer方法计算成本高。Mamba模型结合了线性可扩展性和长距离建模能力，但现有方法在多模态特征融合上仍有不足。 Method: 设计了特定模态的Mamba编码器提取长距离相关特征，并提出双级协同集成块，通过模态注意力和通道注意力动态融合多模态和多级特征。解码器结合深层语义信息和细粒度细节生成分割图。 Result: 在PET/CT和MRI多序列数据集上的实验表明，该方法在性能上优于现有的CNN、Transformer和Mamba方法。 Conclusion: 该方法在多模态3D医学图像肿瘤分割中表现出色，有效解决了特征提取和融合的挑战。 Abstract: Multi-modal 3D medical image segmentation aims to accurately identify tumor regions across different modalities, facing challenges from variations in image intensity and tumor morphology. Traditional convolutional neural network (CNN)-based methods struggle with capturing global features, while Transformers-based methods, despite effectively capturing global context, encounter high computational costs in 3D medical image segmentation. The Mamba model combines linear scalability with long-distance modeling, making it a promising approach for visual representation learning. However, Mamba-based 3D multi-modal segmentation still struggles to leverage modality-specific features and fuse complementary information effectively. In this paper, we propose a Mamba based feature extraction and adaptive multilevel feature fusion for 3D tumor segmentation using multi-modal medical image. We first develop the specific modality Mamba encoder to efficiently extract long-range relevant features that represent anatomical and pathological structures present in each modality. Moreover, we design an bi-level synergistic integration block that dynamically merges multi-modal and multi-level complementary features by the modality attention and channel attention learning. Lastly, the decoder combines deep semantic information with fine-grained details to generate the tumor segmentation map. Experimental results on medical image datasets (PET/CT and MRI multi-sequence) show that our approach achieve competitive performance compared to the state-of-the-art CNN, Transformer, and Mamba-based approaches.

[13] Can We Achieve Efficient Diffusion without Self-Attention? Distilling Self-Attention into Convolutions

ZiYi Dong,Chengxing Zhou,Weijian Deng,Pengxu Wei,Xiangyang Ji,Liang Lin

Main category: cs.CV

TL;DR: 论文提出了一种名为ΔConvFusion的新方法，用金字塔卷积块替代传统自注意力模块，显著降低计算成本，同时保持生成质量。

Details

Motivation: 研究发现预训练扩散模型中的自注意力主要表现出局部化模式，与卷积归纳偏置类似，挑战了全局交互在自注意力中至关重要的传统观点。 Method: 通过层析分析揭示自注意力的局部化特性，提出用ΔConvBlocks替代自注意力模块，蒸馏注意力模式为局部卷积操作。 Result: ΔConvFusion在计算成本降低6929倍的同时，性能与基于Transformer的方法相当，效率超过LinFusion 5.42倍。 Conclusion: 局部卷积操作可有效替代自注意力，显著提升效率而不损失生成质量。 Abstract: Contemporary diffusion models built upon U-Net or Diffusion Transformer (DiT) architectures have revolutionized image generation through transformer-based attention mechanisms. The prevailing paradigm has commonly employed self-attention with quadratic computational complexity to handle global spatial relationships in complex images, thereby synthesizing high-fidelity images with coherent visual semantics.Contrary to conventional wisdom, our systematic layer-wise analysis reveals an interesting discrepancy: self-attention in pre-trained diffusion models predominantly exhibits localized attention patterns, closely resembling convolutional inductive biases. This suggests that global interactions in self-attention may be less critical than commonly assumed.Driven by this, we propose $\Delta$ConvFusion to replace conventional self-attention modules with Pyramid Convolution Blocks ($\Delta$ConvBlocks).By distilling attention patterns into localized convolutional operations while keeping other components frozen, $\Delta$ConvFusion achieves performance comparable to transformer-based counterparts while reducing computational cost by 6929$\times$ and surpassing LinFusion by 5.42$\times$ in efficiency--all without compromising generative fidelity.

[14] Learning Multi-view Multi-class Anomaly Detection

Qianzi Yu,Yang Cao,Yu Kang

Main category: cs.CV

TL;DR: MVMCAD模型通过多视图信息整合、半冻结编码器、异常放大模块和跨特征损失，显著提升了多视图多类异常检测性能。

Details

Motivation: 现有MCAD模型在多视图场景中表现不佳，未能有效建模视图间关系和互补信息。 Method: 提出半冻结编码器、异常放大模块（AAM）和跨特征损失，整合多视图信息并增强异常信号。 Result: 在Real-IAD数据集上，图像级和像素级检测分别达到91.0/88.6/82.1和99.1/43.9/48.2/95.2的SOTA性能。 Conclusion: MVMCAD模型在多视图多类异常检测中表现优异，验证了其方法的有效性。 Abstract: The latest trend in anomaly detection is to train a unified model instead of training a separate model for each category. However, existing multi-class anomaly detection (MCAD) models perform poorly in multi-view scenarios because they often fail to effectively model the relationships and complementary information among different views. In this paper, we introduce a Multi-View Multi-Class Anomaly Detection model (MVMCAD), which integrates information from multiple views to accurately identify anomalies. Specifically, we propose a semi-frozen encoder, where a pre-encoder prior enhancement mechanism is added before the frozen encoder, enabling stable cross-view feature modeling and efficient adaptation for improved anomaly detection. Furthermore, we propose an Anomaly Amplification Module (AAM) that models global token interactions and suppresses normal regions to enhance anomaly signals, leading to improved detection performance in multi-view settings. Finally, we propose a Cross-Feature Loss that aligns shallow encoder features with deep decoder features and vice versa, enhancing the model's sensitivity to anomalies at different semantic levels under multi-view scenarios. Extensive experiments on the Real-IAD dataset for multi-view multi-class anomaly detection validate the effectiveness of our approach, achieving state-of-the-art performance of 91.0/88.6/82.1 and 99.1/43.9/48.2/95.2 for image-level and the pixel-level, respectively.

[15] CMD: Constraining Multimodal Distribution for Domain Adaptation in Stereo Matching

Zhelun Shen,Zhuo Li,Chenming Wu,Zhibo Rao,Lina Liu,Yuchao Dai,Liangjun Zhang

Main category: cs.CV

TL;DR: 论文提出了一种名为CMD的新方法，通过约束多模态分布来解决无监督域适应中立体匹配的泛化问题。

Details

Motivation: 在无监督域适应场景中，传统的soft argmin和平滑L1损失会导致目标域中的多模态视差分布，从而降低泛化能力。 Method: 引入不确定性正则化最小化和各向异性soft argmin，以鼓励网络在目标域中生成单模态视差分布。 Result: 实验表明，该方法在多个代表性立体匹配网络中显著提升了泛化能力。 Conclusion: CMD方法有效解决了无监督域适应中的多模态分布问题，提高了预测准确性。 Abstract: Recently, learning-based stereo matching methods have achieved great improvement in public benchmarks, where soft argmin and smooth L1 loss play a core contribution to their success. However, in unsupervised domain adaptation scenarios, we observe that these two operations often yield multimodal disparity probability distributions in target domains, resulting in degraded generalization. In this paper, we propose a novel approach, Constrain Multi-modal Distribution (CMD), to address this issue. Specifically, we introduce \textit{uncertainty-regularized minimization} and \textit{anisotropic soft argmin} to encourage the network to produce predominantly unimodal disparity distributions in the target domain, thereby improving prediction accuracy. Experimentally, we apply the proposed method to multiple representative stereo-matching networks and conduct domain adaptation from synthetic data to unlabeled real-world scenes. Results consistently demonstrate improved generalization in both top-performing and domain-adaptable stereo-matching models. The code for CMD will be available at: \href{https://github.com/gallenszl/CMD}{https://github.com/gallenszl/CMD}.

[16] The Dual Power of Interpretable Token Embeddings: Jailbreaking Attacks and Defenses for Diffusion Model Unlearning

Siyi Chen,Yimeng Zhang,Sijia Liu,Qing Qu

Main category: cs.CV

TL;DR: 该论文提出了一种可解释的攻击方法，通过正交攻击令牌嵌入揭示未学习模型中仍保留有害概念的原因，并设计了一种防御方法。

Details

Motivation: 尽管扩散模型具有强大的泛化能力，但研究发现它们会记忆并生成有害内容。现有的微调方法无法完全消除这些有害概念，且攻击方法缺乏可解释性，阻碍了防御策略的发展。 Method: 提出一种攻击方法，学习一组正交的可解释攻击令牌嵌入，这些嵌入可分解为人类可理解的文本元素。基于这些嵌入，设计了一种防御方法。 Result: 实验证明，攻击令牌嵌入具有鲁棒性和可迁移性，且防御方法对现有攻击和提出的攻击均有效。 Conclusion: 通过可解释的攻击方法揭示了未学习模型中保留有害概念的原因，并提出了有效的防御策略。 Abstract: Despite the remarkable generalization capabilities of diffusion models, recent studies have shown that these models can memorize and generate harmful content when prompted with specific text instructions. Although fine-tuning approaches have been developed to mitigate this issue by unlearning harmful concepts, these methods can be easily circumvented through jailbreaking attacks. This indicates that the harmful concept has not been fully erased from the model. However, existing attack methods, while effective, lack interpretability regarding why unlearned models still retain the concept, thereby hindering the development of defense strategies. In this work, we address these limitations by proposing an attack method that learns an orthogonal set of interpretable attack token embeddings. The attack token embeddings can be decomposed into human-interpretable textual elements, revealing that unlearned models still retain the target concept through implicit textual components. Furthermore, these attack token embeddings are robust and transferable across text prompts, initial noises, and unlearned models. Finally, leveraging this diverse set of embeddings, we design a defense method applicable to both our proposed attack and existing attack methods. Experimental results demonstrate the effectiveness of both our attack and defense strategies.

[17] AGHI-QA: A Subjective-Aligned Dataset and Metric for AI-Generated Human Images

Yunhao Li,Sijing Wu,Wei Sun,Zhichao Zhang,Yucheng Zhu,Zicheng Zhang,Huiyu Duan,Xiongkuo Min,Guangtao Zhai

Main category: cs.CV

TL;DR: 论文提出了AGHI-QA，首个针对AI生成人类图像（AGHIs）质量评估的大规模基准数据集，并开发了AGHI-Assessor，一种结合多模态模型和人体特征的新型质量评估方法。

Details

Motivation: 现有图像质量评估（IQA）方法仅提供全局质量评分，无法对复杂结构（如人体）进行细粒度评估，而AI生成的人类图像常存在解剖和纹理失真。 Method: 构建包含4,000张图像的AGHI-QA数据集，通过主观研究收集多维标注；提出AGHI-Assessor，结合多模态模型和人体特征进行质量预测。 Result: AGHI-Assessor在多维质量评估和结构失真检测方面显著优于现有IQA方法和领先的多模态模型。 Conclusion: AGHI-QA和AGHI-Assessor为AI生成人类图像的质量评估提供了有效工具，填补了现有方法的不足。 Abstract: The rapid development of text-to-image (T2I) generation approaches has attracted extensive interest in evaluating the quality of generated images, leading to the development of various quality assessment methods for general-purpose T2I outputs. However, existing image quality assessment (IQA) methods are limited to providing global quality scores, failing to deliver fine-grained perceptual evaluations for structurally complex subjects like humans, which is a critical challenge considering the frequent anatomical and textural distortions in AI-generated human images (AGHIs). To address this gap, we introduce AGHI-QA, the first large-scale benchmark specifically designed for quality assessment of AGHIs. The dataset comprises 4,000 images generated from 400 carefully crafted text prompts using 10 state of-the-art T2I models. We conduct a systematic subjective study to collect multidimensional annotations, including perceptual quality scores, text-image correspondence scores, visible and distorted body part labels. Based on AGHI-QA, we evaluate the strengths and weaknesses of current T2I methods in generating human images from multiple dimensions. Furthermore, we propose AGHI-Assessor, a novel quality metric that integrates the large multimodal model (LMM) with domain-specific human features for precise quality prediction and identification of visible and distorted body parts in AGHIs. Extensive experimental results demonstrate that AGHI-Assessor showcases state-of-the-art performance, significantly outperforming existing IQA methods in multidimensional quality assessment and surpassing leading LMMs in detecting structural distortions in AGHIs.

[18] An Evaluation of a Visual Question Answering Strategy for Zero-shot Facial Expression Recognition in Still Images

Modesto Castrillón-Santana,Oliverio J Santana,David Freire-Obregón,Daniel Hernández-Sosa,Javier Lorenzo-Navarro

Main category: cs.CV

TL;DR: 论文探讨了如何利用视觉语言模型（VLMs）提升零样本面部表情识别（FER）的性能，并通过实验验证了部分VLMs的优异表现。

Details

Motivation: 现有深度学习模型在零样本FER场景中表现不佳，因此探索如何结合视觉语言模型的知识以提升性能。 Method: 采用视觉问答策略，评估多种本地执行的VLMs，并与现有FER模型进行对比。 Result: 部分VLMs在零样本FER场景中表现优异，表明其在提升FER泛化能力方面的潜力。 Conclusion: 需要进一步研究VLMs在FER中的应用，以提高模型的泛化能力。 Abstract: Facial expression recognition (FER) is a key research area in computer vision and human-computer interaction. Despite recent advances in deep learning, challenges persist, especially in generalizing to new scenarios. In fact, zero-shot FER significantly reduces the performance of state-of-the-art FER models. To address this problem, the community has recently started to explore the integration of knowledge from Large Language Models for visual tasks. In this work, we evaluate a broad collection of locally executed Visual Language Models (VLMs), avoiding the lack of task-specific knowledge by adopting a Visual Question Answering strategy. We compare the proposed pipeline with state-of-the-art FER models, both integrating and excluding VLMs, evaluating well-known FER benchmarks: AffectNet, FERPlus, and RAF-DB. The results show excellent performance for some VLMs in zero-shot FER scenarios, indicating the need for further exploration to improve FER generalization.

[19] Text-Conditioned Diffusion Model for High-Fidelity Korean Font Generation

Abdul Sami,Avinash Kumar,Irfanullah Memon,Youngwon Jo,Muhammad Rizwan,Jaeyoung Choi

Main category: cs.CV

TL;DR: 提出了一种基于扩散模型的自动字体生成方法，通过单一样本生成高质量韩文字体，解决了传统方法的不稳定性和细节捕捉问题。

Details

Motivation: 传统自动字体生成方法（如GAN和VAE）在训练中不稳定且易出现模式崩溃，难以捕捉字体细节，尤其是复杂语言（如韩文和中文）的手写风格。 Method: 采用扩散模型逐步细化噪声图像，结合预训练的风格编码器和文本编码器，利用感知损失提升生成质量。 Result: 在2000多个韩文字符上测试，模型生成准确且细节丰富的字体，优于基准方法。 Conclusion: 该方法为生成不同风格的韩文字体提供了可靠工具，解决了传统方法的局限性。 Abstract: Automatic font generation (AFG) is the process of creating a new font using only a few examples of the style images. Generating fonts for complex languages like Korean and Chinese, particularly in handwritten styles, presents significant challenges. Traditional AFGs, like Generative adversarial networks (GANs) and Variational Auto-Encoders (VAEs), are usually unstable during training and often face mode collapse problems. They also struggle to capture fine details within font images. To address these problems, we present a diffusion-based AFG method which generates high-quality, diverse Korean font images using only a single reference image, focusing on handwritten and printed styles. Our approach refines noisy images incrementally, ensuring stable training and visually appealing results. A key innovation is our text encoder, which processes phonetic representations to generate accurate and contextually correct characters, even for unseen characters. We used a pre-trained style encoder from DG FONT to effectively and accurately encode the style images. To further enhance the generation quality, we used perceptual loss that guides the model to focus on the global style of generated images. Experimental results on over 2000 Korean characters demonstrate that our model consistently generates accurate and detailed font images and outperforms benchmark methods, making it a reliable tool for generating authentic Korean fonts across different styles.

[20] Simple Visual Artifact Detection in Sora-Generated Videos

Misora Sugiyama,Hirokatsu Kataoka

Main category: cs.CV

TL;DR: 研究分析了OpenAI Sora生成的视频中常见的视觉伪影，提出了一种多标签分类框架，并训练了多个2D CNN架构，其中ResNet-50表现最佳，准确率达94.14%。

Details

Motivation: 随着视频生成模型的发展，理解其局限性并确保安全部署变得至关重要。 Method: 使用300个手动标注的帧提取自15个Sora生成的视频，训练了多种2D CNN架构（ResNet-50、EfficientNet-B3/B4、ViT-Base）。 Result: ResNet-50模型在多标签分类任务中平均准确率达94.14%。 Conclusion: 该研究为视频质量评估、视觉风险识别及安全性提供了支持。 Abstract: The December 2024 release of OpenAI's Sora, a powerful video generation model driven by natural language prompts, highlights a growing convergence between large language models (LLMs) and video synthesis. As these multimodal systems evolve into video-enabled LLMs (VidLLMs), capable of interpreting, generating, and interacting with visual content, understanding their limitations and ensuring their safe deployment becomes essential. This study investigates visual artifacts frequently found and reported in Sora-generated videos, which can compromise quality, mislead viewers, or propagate disinformation. We propose a multi-label classification framework targeting four common artifact label types: label 1: boundary / edge defects, label 2: texture / noise issues, label 3: movement / joint anomalies, and label 4: object mismatches / disappearances. Using a dataset of 300 manually annotated frames extracted from 15 Sora-generated videos, we trained multiple 2D CNN architectures (ResNet-50, EfficientNet-B3 / B4, ViT-Base). The best-performing model trained by ResNet-50 achieved an average multi-label classification accuracy of 94.14%. This work supports the broader development of VidLLMs by contributing to (1) the creation of datasets for video quality evaluation, (2) interpretable artifact-based analysis beyond language metrics, and (3) the identification of visual risks relevant to factuality and safety.

[21] UniBiomed: A Universal Foundation Model for Grounded Biomedical Image Interpretation

Linshan Wu,Yuxiang Nie,Sunan He,Jiaxin Zhuang,Hao Chen

Main category: cs.CV

TL;DR: UniBiomed是一种基于多模态大语言模型（MLLM）和Segment Anything Model（SAM）的通用基础模型，用于生物医学图像的接地解释，统一了临床文本生成和目标分割，显著提升了诊断效率。

Details

Motivation: 传统AI方法在生物医学图像分析中依赖分离的训练（如LLM用于文本生成，分割模型用于目标提取），导致部署不灵活且无法利用整体生物医学信息。 Method: UniBiomed通过整合MLLM和SAM，统一了临床文本生成和生物医学对象分割，支持10种生物医学成像模态的广泛任务。 Result: 在84个内外数据集上验证，UniBiomed在分割、疾病识别、区域感知诊断、视觉问答和报告生成中达到最先进性能。 Conclusion: UniBiomed实现了生物医学AI的新突破，为更准确高效的图像分析提供了自动化端到端接地解释能力。 Abstract: Multi-modal interpretation of biomedical images opens up novel opportunities in biomedical image analysis. Conventional AI approaches typically rely on disjointed training, i.e., Large Language Models (LLMs) for clinical text generation and segmentation models for target extraction, which results in inflexible real-world deployment and a failure to leverage holistic biomedical information. To this end, we introduce UniBiomed, the first universal foundation model for grounded biomedical image interpretation. UniBiomed is based on a novel integration of Multi-modal Large Language Model (MLLM) and Segment Anything Model (SAM), which effectively unifies the generation of clinical texts and the segmentation of corresponding biomedical objects for grounded interpretation. In this way, UniBiomed is capable of tackling a wide range of biomedical tasks across ten diverse biomedical imaging modalities. To develop UniBiomed, we curate a large-scale dataset comprising over 27 million triplets of images, annotations, and text descriptions across ten imaging modalities. Extensive validation on 84 internal and external datasets demonstrated that UniBiomed achieves state-of-the-art performance in segmentation, disease recognition, region-aware diagnosis, visual question answering, and report generation. Moreover, unlike previous models that rely on clinical experts to pre-diagnose images and manually craft precise textual or visual prompts, UniBiomed can provide automated and end-to-end grounded interpretation for biomedical image analysis. This represents a novel paradigm shift in clinical workflows, which will significantly improve diagnostic efficiency. In summary, UniBiomed represents a novel breakthrough in biomedical AI, unlocking powerful grounded interpretation capabilities for more accurate and efficient biomedical image analysis.

[22] Towards Improved Cervical Cancer Screening: Vision Transformer-Based Classification and Interpretability

Khoa Tuan Nguyen,Ho-min Park,Gaeun Oh,Joris Vankerschaver,Wesley De Neve

Main category: cs.CV

TL;DR: 提出了一种基于EVA-02 Transformer模型的宫颈细胞图像分类新方法，通过四步流程优化模型性能，最终F1分数达到0.85227，优于基线模型。

Details

Motivation: 改进宫颈癌筛查中的细胞图像分类性能，并提供可解释的模型决策依据。 Method: 四步流程：微调EVA-02、特征提取、多模型特征选择、训练带可选损失加权的人工神经网络。 Result: 最佳模型F1分数0.85227，优于基线模型（0.84878）；通过Kernel SHAP分析识别关键特征。 Conclusion: 该方法在性能和可解释性上均优于基线模型，为宫颈癌筛查提供了有效工具。 Abstract: We propose a novel approach to cervical cell image classification for cervical cancer screening using the EVA-02 transformer model. We developed a four-step pipeline: fine-tuning EVA-02, feature extraction, selecting important features through multiple machine learning models, and training a new artificial neural network with optional loss weighting for improved generalization. With this design, our best model achieved an F1-score of 0.85227, outperforming the baseline EVA-02 model (0.84878). We also utilized Kernel SHAP analysis and identified key features correlating with cell morphology and staining characteristics, providing interpretable insights into the decision-making process of the fine-tuned model. Our code is available at https://github.com/Khoa-NT/isbi2025_ps3c.

[23] Vision-Language Model-Based Semantic-Guided Imaging Biomarker for Early Lung Cancer Detection

Luoting Zhuang,Seyed Mohammad Hossein Tabatabaei,Ramin Salehi-Rad,Linh M. Tran,Denise R. Aberle,Ashley E. Prosper,William Hsu

Main category: cs.CV

TL;DR: 该研究通过整合放射科医生评估的语义特征，利用CLIP模型预测肺癌，提高了模型的解释性和泛化能力。

Details

Motivation: 现有机器学习模型依赖手动标注、解释性差且对影像变化敏感，限制了临床应用。 Method: 使用多个数据集，通过参数高效微调方法对齐影像和语义特征，预测一年内肺癌诊断。 Result: 模型AUROC为0.90，AUPRC为0.78，优于基线模型，并能解释结节特征。 Conclusion: 该方法准确分类肺结节，提供可解释的输出，适用于临床环境。 Abstract: Objective: A number of machine learning models have utilized semantic features, deep features, or both to assess lung nodule malignancy. However, their reliance on manual annotation during inference, limited interpretability, and sensitivity to imaging variations hinder their application in real-world clinical settings. Thus, this research aims to integrate semantic features derived from radiologists' assessments of nodules, allowing the model to learn clinically relevant, robust, and explainable features for predicting lung cancer. Methods: We obtained 938 low-dose CT scans from the National Lung Screening Trial with 1,246 nodules and semantic features. The Lung Image Database Consortium dataset contains 1,018 CT scans, with 2,625 lesions annotated for nodule characteristics. Three external datasets were obtained from UCLA Health, the LUNGx Challenge, and the Duke Lung Cancer Screening. We finetuned a pretrained Contrastive Language-Image Pretraining model with a parameter-efficient fine-tuning approach to align imaging and semantic features and predict the one-year lung cancer diagnosis. Results: We evaluated the performance of the one-year diagnosis of lung cancer with AUROC and AUPRC and compared it to three state-of-the-art models. Our model demonstrated an AUROC of 0.90 and AUPRC of 0.78, outperforming baseline state-of-the-art models on external datasets. Using CLIP, we also obtained predictions on semantic features, such as nodule margin (AUROC: 0.81), nodule consistency (0.81), and pleural attachment (0.84), that can be used to explain model predictions. Conclusion: Our approach accurately classifies lung nodules as benign or malignant, providing explainable outputs, aiding clinicians in comprehending the underlying meaning of model predictions. This approach also prevents the model from learning shortcuts and generalizes across clinical settings.

[24] Nexus-Gen: A Unified Model for Image Understanding, Generation, and Editing

Hong Zhang,Zhongjie Duan,Xingjun Wang,Yingda Chen,Yuze Zhao,Yu Zhang

Main category: cs.CV

TL;DR: Nexus-Gen是一个统一的多模态大语言模型，通过结合LLM的语言推理能力和扩散模型的图像合成能力，解决了现有开源统一模型与领域专用架构之间的性能差距。

Details

Motivation: 现有开源统一模型在多模态理解和生成能力上存在性能差距，需要一种更高效的统一框架。 Method: 采用双阶段对齐训练：1) LLM学习预测图像嵌入；2) 视觉解码器从嵌入重建图像。引入预填充自回归策略以避免误差累积。 Result: Nexus-Gen实现了图像理解、生成和编辑的综合能力。 Conclusion: Nexus-Gen通过双阶段训练和预填充策略，成功整合了多模态能力，并开源了模型和代码以推动领域发展。 Abstract: Unified multimodal large language models (MLLMs) aim to integrate multimodal understanding and generation abilities through a single framework. Despite their versatility, existing open-source unified models exhibit performance gaps against domain-specific architectures. To bridge this gap, we present Nexus-Gen, a unified model that synergizes the language reasoning capabilities of LLMs with the image synthesis power of diffusion models. To align the embedding space of the LLM and diffusion model, we conduct a dual-phase alignment training process. (1) The autoregressive LLM learns to predict image embeddings conditioned on multimodal inputs, while (2) the vision decoder is trained to reconstruct high-fidelity images from these embeddings. During training the LLM, we identified a critical discrepancy between the autoregressive paradigm's training and inference phases, where error accumulation in continuous embedding space severely degrades generation quality. To avoid this issue, we introduce a prefilled autoregression strategy that prefills input sequence with position-embedded special tokens instead of continuous embeddings. Through dual-phase training, Nexus-Gen has developed the integrated capability to comprehensively address the image understanding, generation and editing tasks. All models, datasets, and codes are published at https://github.com/modelscope/Nexus-Gen.git to facilitate further advancements across the field.

[25] Revisiting Diffusion Autoencoder Training for Image Reconstruction Quality

Pramook Khungurn,Sukit Seripanitkarn,Phonphrm Thawatdamrongkit,Supasorn Suwajanakorn

Main category: cs.CV

TL;DR: 提出了一种新的扩散自编码器（DAE）训练方法，通过分阶段训练提升图像重建质量。

Details

Motivation: 传统DAE使用线性噪声计划，导致图像模糊，因为高噪声阶段耗时过长。通过分阶段训练，可以优化细节恢复。 Method: 分两阶段训练：第一阶段在高噪声下训练DAE作为普通自编码器，第二阶段调整噪声计划以专注低噪声细节。 Result: 生成的图像在结构和细节上均更准确，同时保留了潜在编码的有用特性。 Conclusion: 新方法显著提升了DAE的图像重建质量。 Abstract: Diffusion autoencoders (DAEs) are typically formulated as a noise prediction model and trained with a linear-$\beta$ noise schedule that spends much of its sampling steps at high noise levels. Because high noise levels are associated with recovering large-scale image structures and low noise levels with recovering details, this configuration can result in low-quality and blurry images. However, it should be possible to improve details while spending fewer steps recovering structures because the latent code should already contain structural information. Based on this insight, we propose a new DAE training method that improves the quality of reconstructed images. We divide training into two phases. In the first phase, the DAE is trained as a vanilla autoencoder by always setting the noise level to the highest, forcing the encoder and decoder to populate the latent code with structural information. In the second phase, we incorporate a noise schedule that spends more time in the low-noise region, allowing the DAE to learn how to perfect the details. Our method results in images that have accurate high-level structures and low-level details while still preserving useful properties of the latent codes.

[26] IDDM: Bridging Synthetic-to-Real Domain Gap from Physics-Guided Diffusion for Real-world Image Dehazing

Shijun Zhou,Yajing Liu,Chunhui Hao,Zhiyuan Liu,Jiandong Tian

Main category: cs.CV

TL;DR: 论文提出了一种基于扩散模型的图像去雾方法（IDDM），通过结合大气散射模型和噪声扩散，解决了合成数据训练的去雾算法在真实场景中泛化能力不足的问题。

Details

Motivation: 当前基于合成数据训练的去雾算法在真实场景中表现不佳，主要原因是合成数据与真实雾图之间存在领域差距。 Method: IDDM通过将大气散射模型引入噪声扩散过程，设计了一种特殊的训练策略，利用扩散模型和物理模型共同指导去雾过程。 Result: 实验表明，IDDM在合成数据训练下能够有效泛化到真实场景，去雾效果优于现有方法。 Conclusion: IDDM通过结合物理模型和扩散模型，显著提升了去雾算法在真实场景中的泛化能力。 Abstract: Due to the domain gap between real-world and synthetic hazy images, current data-driven dehazing algorithms trained on synthetic datasets perform well on synthetic data but struggle to generalize to real-world scenarios. To address this challenge, we propose \textbf{I}mage \textbf{D}ehazing \textbf{D}iffusion \textbf{M}odels (IDDM), a novel diffusion process that incorporates the atmospheric scattering model into noise diffusion. IDDM aims to use the gradual haze formation process to help the denoising Unet robustly learn the distribution of clear images from the conditional input hazy images. We design a specialized training strategy centered around IDDM. Diffusion models are leveraged to bridge the domain gap from synthetic to real-world, while the atmospheric scattering model provides physical guidance for haze formation. During the forward process, IDDM simultaneously introduces haze and noise into clear images, and then robustly separates them during the sampling process. By training with physics-guided information, IDDM shows the ability of domain generalization, and effectively restores the real-world hazy images despite being trained on synthetic datasets. Extensive experiments demonstrate the effectiveness of our method through both quantitative and qualitative comparisons with state-of-the-art approaches.

[27] Comparison of Different Deep Neural Network Models in the Cultural Heritage Domain

Teodor Boyadzhiev,Gabriele Lagani,Luca Ciampi,Giuseppe Amato,Krassimira Ivanova

Main category: cs.CV

TL;DR: 比较卷积神经网络和Transformer架构在文化遗产任务中的知识迁移能力，发现DenseNet在效率-计算比上表现最佳。

Details

Motivation: 探讨计算机视觉和深度学习在文化遗产保护及提升游客体验中的应用，比较两种主流深度学习架构的性能。 Method: 测试VGG、ResNet、DenseNet、Visual Transformer、Swin Transformer和PoolFormer等架构，评估其从通用数据集（如ImageNet）到文化遗产任务的知识迁移能力。 Result: DenseNet在效率-计算比上表现最优。 Conclusion: DenseNet是文化遗产任务中知识迁移的最佳选择，因其高效的计算性能。 Abstract: The integration of computer vision and deep learning is an essential part of documenting and preserving cultural heritage, as well as improving visitor experiences. In recent years, two deep learning paradigms have been established in the field of computer vision: convolutional neural networks and transformer architectures. The present study aims to make a comparative analysis of some representatives of these two techniques of their ability to transfer knowledge from generic dataset, such as ImageNet, to cultural heritage specific tasks. The results of testing examples of the architectures VGG, ResNet, DenseNet, Visual Transformer, Swin Transformer, and PoolFormer, showed that DenseNet is the best in terms of efficiency-computability ratio.

[28] Static or Dynamic: Towards Query-Adaptive Token Selection for Video Question Answering

Yumeng Shi,Quanyu Long,Wenya Wang

Main category: cs.CV

TL;DR: 提出了一种名为EXPLORE-THEN-SELECT的令牌选择策略，通过自适应调整静态和动态信息来优化视频问答中的令牌使用效率。

Details

Motivation: 解决长视频中令牌数量过多导致的内存效率低下和模型性能下降问题，同时避免现有方法忽视不同查询对静态和动态信息需求差异的不足。 Method: 提出EXPLORE-THEN-SELECT策略，先探索静态帧和动态帧的令牌分配，再基于查询感知的注意力指标选择最优令牌组合。 Result: 在多个视频问答基准测试中，性能显著提升（最高达5.8%）。 Conclusion: 该框架无需模型更新即可提升性能，且易于集成到多种视频语言模型中。 Abstract: Video question answering benefits from the rich information available in videos, enabling a wide range of applications. However, the large volume of tokens generated from longer videos presents significant challenges to memory efficiency and model performance. To alleviate this issue, existing works propose to compress video inputs, but usually overlooking the varying importance of static and dynamic information across different queries, leading to inefficient token usage within limited budgets. To tackle this, we propose a novel token selection strategy, EXPLORE-THEN-SELECT, that adaptively adjust static and dynamic information needed based on question requirements. Our framework first explores different token allocations between static frames, which preserve spatial details, and dynamic frames, which capture temporal changes. Next, it employs a query-aware attention-based metric to select the optimal token combination without model updates. Our proposed framework is plug-and-play that can be seamlessly integrated within diverse video-language models. Extensive experiments show that our method achieves significant performance improvements (up to 5.8%) among various video question answering benchmarks.

[29] Adapting In-Domain Few-Shot Segmentation to New Domains without Retraining

Qi Fan,Kaiqi Liu,Nian Liu,Hisham Cholakkal,Rao Muhammad Anwer,Wenbin Li,Yang Gao

Main category: cs.CV

TL;DR: 提出了一种无需重新训练的方法（ISA），通过自适应调整模型结构来解决跨域少样本分割问题。

Details

Motivation: 跨域少样本分割（CD-FSS）面临目标域多样性和支持数据有限的问题，现有方法通常需要重新训练模型，成本高昂。 Method: 通过结构Fisher评分自适应识别域特定模型结构，并利用分层支持样本逐步训练这些结构。 Result: 在多个CD-FSS基准测试中表现优异，验证了方法的有效性。 Conclusion: ISA方法成功解决了域偏移问题，为现有模型提供了灵活的跨域适应能力。 Abstract: Cross-domain few-shot segmentation (CD-FSS) aims to segment objects of novel classes in new domains, which is often challenging due to the diverse characteristics of target domains and the limited availability of support data. Most CD-FSS methods redesign and retrain in-domain FSS models using various domain-generalization techniques, which are effective but costly to train. To address these issues, we propose adapting informative model structures of the well-trained FSS model for target domains by learning domain characteristics from few-shot labeled support samples during inference, thereby eliminating the need for retraining. Specifically, we first adaptively identify domain-specific model structures by measuring parameter importance using a novel structure Fisher score in a data-dependent manner. Then, we progressively train the selected informative model structures with hierarchically constructed training samples, progressing from fewer to more support shots. The resulting Informative Structure Adaptation (ISA) method effectively addresses domain shifts and equips existing well-trained in-domain FSS models with flexible adaptation capabilities for new domains, eliminating the need to redesign or retrain CD-FSS models on base data. Extensive experiments validate the effectiveness of our method, demonstrating superior performance across multiple CD-FSS benchmarks.

[30] Diff-Prompt: Diffusion-Driven Prompt Generator with Mask Supervision

Weicai Yan,Wang Lin,Zirun Guo,Ye Wang,Fangming Feng,Xiaoda Yang,Zehan Wang,Tao Jin

Main category: cs.CV

TL;DR: 论文提出Diff-Prompt，利用扩散模型生成丰富且细粒度的提示信息，以提升复杂下游任务的性能。

Details

Motivation: 现有提示学习方法在复杂细粒度任务中表现有限，因其直接优化提示生成参数，限制了提示的丰富性和特异性。 Method: 分三阶段：1) 训练Mask-VAE压缩掩码；2) 用改进的DiT在潜在空间训练提示生成器；3) 在语义空间对齐去噪过程并微调模型。 Result: 在Referring Expression Comprehension任务中，Diff-Prompt在R@1和R@5上分别提升8.87和14.05，优于其他方法。 Conclusion: Diff-Prompt验证了生成模型在提示生成中的潜力，显著提升复杂任务性能。 Abstract: Prompt learning has demonstrated promising results in fine-tuning pre-trained multimodal models. However, the performance improvement is limited when applied to more complex and fine-grained tasks. The reason is that most existing methods directly optimize the parameters involved in the prompt generation process through loss backpropagation, which constrains the richness and specificity of the prompt representations. In this paper, we propose Diffusion-Driven Prompt Generator (Diff-Prompt), aiming to use the diffusion model to generate rich and fine-grained prompt information for complex downstream tasks. Specifically, our approach consists of three stages. In the first stage, we train a Mask-VAE to compress the masks into latent space. In the second stage, we leverage an improved Diffusion Transformer (DiT) to train a prompt generator in the latent space, using the masks for supervision. In the third stage, we align the denoising process of the prompt generator with the pre-trained model in the semantic space, and use the generated prompts to fine-tune the model. We conduct experiments on a complex pixel-level downstream task, referring expression comprehension, and compare our method with various parameter-efficient fine-tuning approaches. Diff-Prompt achieves a maximum improvement of 8.87 in R@1 and 14.05 in R@5 compared to the foundation model and also outperforms other state-of-the-art methods across multiple metrics. The experimental results validate the effectiveness of our approach and highlight the potential of using generative models for prompt generation. Code is available at https://github.com/Kelvin-ywc/diff-prompt.

[31] SeriesBench: A Benchmark for Narrative-Driven Drama Series Understanding

Chenkai Zhang,Yiming Lei,Zeming Liu,Haitao Leng,ShaoGuo Liu,Tingting Gao,Qingjie Liu,Yunhong Wang

Main category: cs.CV

TL;DR: 论文提出了SeriesBench，一个评估多模态大语言模型（MLLMs）对叙事驱动系列视频理解能力的基准，并提出了PC-DCoT框架以提升模型性能。

Details

Motivation: 现有基准主要关注独立视频和视觉元素，而现实中的视频多为复杂连续叙事，需要更深入的叙事理解能力。 Method: 通过精选多样化剧集，采用长跨度叙事标注方法和全信息转换技术构建SeriesBench，并提出PC-DCoT框架增强模型对情节和角色关系的分析能力。 Result: 实验表明现有MLLMs在叙事理解上仍有挑战，而PC-DCoT能显著提升性能。 Conclusion: SeriesBench和PC-DCoT强调了提升模型叙事理解能力的必要性，为MLLMs未来发展提供了指导。 Abstract: With the rapid development of Multi-modal Large Language Models (MLLMs), an increasing number of benchmarks have been established to evaluate the video understanding capabilities of these models. However, these benchmarks focus on \textbf{standalone} videos and mainly assess ``visual elements'' like human actions and object states. In reality, contemporary videos often encompass complex and continuous narratives, typically presented as a \textbf{series}. To address this challenge, we propose \textbf{SeriesBench}, a benchmark consisting of 105 carefully curated narrative-driven series, covering 28 specialized tasks that require deep narrative understanding. Specifically, we first select a diverse set of drama series spanning various genres. Then, we introduce a novel long-span narrative annotation method, combined with a full-information transformation approach to convert manual annotations into diverse task formats. To further enhance model capacity for detailed analysis of plot structures and character relationships within series, we propose a novel narrative reasoning framework, \textbf{PC-DCoT}. Extensive results on \textbf{SeriesBench} indicate that existing MLLMs still face significant challenges in understanding narrative-driven series, while \textbf{PC-DCoT} enables these MLLMs to achieve performance improvements. Overall, our \textbf{SeriesBench} and \textbf{PC-DCoT} highlight the critical necessity of advancing model capabilities to understand narrative-driven series, guiding the future development of MLLMs. SeriesBench is publicly available at https://github.com/zackhxn/SeriesBench-CVPR2025.

[32] Rethinking Visual Layer Selection in Multimodal LLMs

Haoran Chen,Junyan Lin,Xinhao Chen,Yue Fan,Xin Jin,Hui Su,Jianfeng Dong,Jinlan Fu,Xiaoyu Shen

Main category: cs.CV

TL;DR: 本文提出了一种基于层间表示相似性的方法，对CLIP-ViT的视觉层进行分组，并评估其对多模态大语言模型（MLLM）性能的影响。实验表明，不同任务需要不同层次的视觉特征，而轻量级融合方法表现最佳。

Details

Motivation: 目前MLLM在视觉层选择上缺乏系统性分析，多依赖经验启发式方法。本文旨在通过分层研究，为视觉表示学习提供更科学的基础。 Method: 提出Layer-wise Representation Similarity方法，将CLIP-ViT层分为浅、中、深三类，并在不同规模的LLaVA风格模型上进行实验。 Result: 实验发现：(1) 深层对OCR任务至关重要；(2) 浅层和中层在计数、定位等推理任务中表现更优；(3) 轻量级融合方法在多数数据集上优于单一层选择。 Conclusion: 本文首次系统研究了MLLM中的视觉层选择问题，为后续视觉表示学习研究奠定了基础。 Abstract: Multimodal large language models (MLLMs) have achieved impressive performance across a wide range of tasks, typically using CLIP-ViT as their visual encoder due to its strong text-image alignment capabilities. While prior studies suggest that different CLIP-ViT layers capture different types of information, with shallower layers focusing on fine visual details and deeper layers aligning more closely with textual semantics, most MLLMs still select visual features based on empirical heuristics rather than systematic analysis. In this work, we propose a Layer-wise Representation Similarity approach to group CLIP-ViT layers with similar behaviors into {shallow, middle, and deep} categories and assess their impact on MLLM performance. Building on this foundation, we revisit the visual layer selection problem in MLLMs at scale, training LLaVA-style models ranging from 1.4B to 7B parameters. Through extensive experiments across 10 datasets and 4 tasks, we find that: (1) deep layers are essential for OCR tasks; (2) shallow and middle layers substantially outperform deep layers on reasoning tasks involving counting, positioning, and object localization; (3) a lightweight fusion of features across shallow, middle, and deep layers consistently outperforms specialized fusion baselines and single-layer selections, achieving gains on 9 out of 10 datasets. Our work offers the first principled study of visual layer selection in MLLMs, laying the groundwork for deeper investigations into visual representation learning for MLLMs.

[33] VR-FuseNet: A Fusion of Heterogeneous Fundus Data and Explainable Deep Network for Diabetic Retinopathy Classification

Shamim Rahim Refat,Ziyan Shirin Raha,Shuvashis Sarker,Faika Fairuj Preotee,MD. Musfikur Rahman,Tashreef Muhammad,Mohammad Shafiul Islam

Main category: cs.CV

TL;DR: 本文提出了一种名为VR-FuseNet的混合深度学习模型，用于自动检测糖尿病视网膜病变，结合VGG19和ResNet50V2的优势，准确率达91.824%，并通过XAI技术增强模型的可解释性。

Details

Motivation: 糖尿病视网膜病变是导致糖尿病患者失明的主要原因，现有方法存在数据集不平衡和泛化能力不足的问题，因此需要更准确和高效的自动检测方法。 Method: 提出VR-FuseNet模型，结合VGG19和ResNet50V2；使用混合数据集，并应用SMOTE和CLAHE进行预处理。 Result: 模型准确率达91.824%，优于单独架构，并通过XAI技术提供可视化解释。 Conclusion: VR-FuseNet在糖尿病视网膜病变分类任务中表现出色，结合XAI技术增强了临床实用性。 Abstract: Diabetic retinopathy is a severe eye condition caused by diabetes where the retinal blood vessels get damaged and can lead to vision loss and blindness if not treated. Early and accurate detection is key to intervention and stopping the disease progressing. For addressing this disease properly, this paper presents a comprehensive approach for automated diabetic retinopathy detection by proposing a new hybrid deep learning model called VR-FuseNet. Diabetic retinopathy is a major eye disease and leading cause of blindness especially among diabetic patients so accurate and efficient automated detection methods are required. To address the limitations of existing methods including dataset imbalance, diversity and generalization issues this paper presents a hybrid dataset created from five publicly available diabetic retinopathy datasets. Essential preprocessing techniques such as SMOTE for class balancing and CLAHE for image enhancement are applied systematically to the dataset to improve the robustness and generalizability of the dataset. The proposed VR-FuseNet model combines the strengths of two state-of-the-art convolutional neural networks, VGG19 which captures fine-grained spatial features and ResNet50V2 which is known for its deep hierarchical feature extraction. This fusion improves the diagnostic performance and achieves an accuracy of 91.824%. The model outperforms individual architectures on all performance metrics demonstrating the effectiveness of hybrid feature extraction in Diabetic Retinopathy classification tasks. To make the proposed model more clinically useful and interpretable this paper incorporates multiple XAI techniques. These techniques generate visual explanations that clearly indicate the retinal features affecting the model's prediction such as microaneurysms, hemorrhages and exudates so that clinicians can interpret and validate.

[34] Multiview Point Cloud Registration via Optimization in an Autoencoder Latent Space

Luc Vedrenne,Sylvain Faisan,Denis Fortun

Main category: cs.CV

TL;DR: POLAR是一种多视角点云刚性配准方法，通过潜在空间转换和优化策略，解决了现有方法在大视角、高退化和大初始角度下的局限性。

Details

Motivation: 现有方法在多视角点云配准中存在可扩展性差、对高退化和大变换适应性不足的问题。 Method: 将配准问题转换到预训练自编码器的潜在空间，设计考虑退化的损失函数，并采用多起点优化策略。 Result: POLAR在合成和真实数据上显著优于现有方法。 Conclusion: POLAR是一种高效、鲁棒的多视角点云配准方法，适用于复杂场景。 Abstract: Point cloud rigid registration is a fundamental problem in 3D computer vision. In the multiview case, we aim to find a set of 6D poses to align a set of objects. Methods based on pairwise registration rely on a subsequent synchronization algorithm, which makes them poorly scalable with the number of views. Generative approaches overcome this limitation, but are based on Gaussian Mixture Models and use an Expectation-Maximization algorithm. Hence, they are not well suited to handle large transformations. Moreover, most existing methods cannot handle high levels of degradations. In this paper, we introduce POLAR (POint cloud LAtent Registration), a multiview registration method able to efficiently deal with a large number of views, while being robust to a high level of degradations and large initial angles. To achieve this, we transpose the registration problem into the latent space of a pretrained autoencoder, design a loss taking degradations into account, and develop an efficient multistart optimization strategy. Our proposed method significantly outperforms state-of-the-art approaches on synthetic and real data. POLAR is available at github.com/pypolar/polar or as a standalone package which can be installed with pip install polaregistration.

[35] Quaternion Nuclear Norms Over Frobenius Norms Minimization for Robust Matrix Completion

Yu Guo,Guoqing Chen,Tieyong Zeng,Qiyu Jin,Michael Kwok-Po Ng

Main category: cs.CV

TL;DR: 本文提出了一种新的非凸近似方法QNOF，用于解决四元数矩阵的秩问题，并在鲁棒四元数矩阵补全中验证了其优越性。

Details

Motivation: 多维数据表示中恢复隐藏结构的挑战，尤其是四元数矩阵在建模多维数据方面的潜力。 Method: 提出QNOF作为四元数矩阵秩的非凸近似，利用四元数奇异值分解简化问题，并扩展到鲁棒矩阵补全。 Result: QNOF在数值实验中表现优异，优于现有四元数方法。 Conclusion: QNOF是一种参数无关且尺度不变的有效方法，适用于多维数据恢复问题。 Abstract: Recovering hidden structures from incomplete or noisy data remains a pervasive challenge across many fields, particularly where multi-dimensional data representation is essential. Quaternion matrices, with their ability to naturally model multi-dimensional data, offer a promising framework for this problem. This paper introduces the quaternion nuclear norm over the Frobenius norm (QNOF) as a novel nonconvex approximation for the rank of quaternion matrices. QNOF is parameter-free and scale-invariant. Utilizing quaternion singular value decomposition, we prove that solving the QNOF can be simplified to solving the singular value $L_1/L_2$ problem. Additionally, we extend the QNOF to robust quaternion matrix completion, employing the alternating direction multiplier method to derive solutions that guarantee weak convergence under mild conditions. Extensive numerical experiments validate the proposed model's superiority, consistently outperforming state-of-the-art quaternion methods.

[36] Robust Orthogonal NMF with Label Propagation for Image Clustering

Jingjing Liu,Nian Wu,Xianchao Xiu,Jianhua Zhang

Main category: cs.CV

TL;DR: 提出了一种名为RONMF的鲁棒正交非负矩阵分解方法，结合标签传播和图拉普拉斯正则化，有效提升噪声鲁棒性。

Details

Motivation: 现有NMF方法对噪声敏感且难以利用有限监督信息，需改进。 Method: 提出RONMF框架，引入非凸结构、正交约束和ADMM优化算法。 Result: 在8个公开图像数据集上表现优于现有方法，鲁棒性显著。 Conclusion: RONMF是一种高效、鲁棒的NMF改进方法，适用于图像聚类。 Abstract: Non-negative matrix factorization (NMF) is a popular unsupervised learning approach widely used in image clustering. However, in real-world clustering scenarios, most existing NMF methods are highly sensitive to noise corruption and are unable to effectively leverage limited supervised information. To overcome these drawbacks, we propose a unified non-convex framework with label propagation called robust orthogonal nonnegative matrix factorization (RONMF). This method not only considers the graph Laplacian and label propagation as regularization terms but also introduces a more effective non-convex structure to measure the reconstruction error and imposes orthogonal constraints on the basis matrix to reduce the noise corruption, thereby achieving higher robustness. To solve RONMF, we develop an alternating direction method of multipliers (ADMM)-based optimization algorithm. In particular, all subproblems have closed-form solutions, which ensures its efficiency. Experimental evaluations on eight public image datasets demonstrate that the proposed RONMF outperforms state-of-the-art NMF methods across various standard metrics and shows excellent robustness. The code will be available at https://github.com/slinda-liu.

[37] GarmentDiffusion: 3D Garment Sewing Pattern Generation with Multimodal Diffusion Transformers

Xinyu Li,Qi Yao,Yuanda Wang

Main category: cs.CV

TL;DR: GarmentDiffusion是一种新的生成模型，能够从多模态输入（文本、图像和不完整的缝纫图案）生成厘米级精确的矢量3D缝纫图案，效率比现有方法高100倍。

Details

Motivation: 现有方法在缝纫图案生成中依赖单一输入模态或效率不足，需要一种更高效、多模态的生成方法。 Method: 通过将3D缝纫图案参数编码为紧凑的边缘令牌表示，并使用扩散变换器同时去噪所有边缘令牌，实现高效生成。 Result: 在DressCodeData和GarmentCodeData数据集上取得了最新最优结果，生成速度比SewingGPT快100倍。 Conclusion: GarmentDiffusion为缝纫图案生成提供了高效、多模态的解决方案，显著提升了生成速度和精度。 Abstract: Garment sewing patterns are fundamental design elements that bridge the gap between design concepts and practical manufacturing. The generative modeling of sewing patterns is crucial for creating diversified garments. However, existing approaches are limited either by reliance on a single input modality or by suboptimal generation efficiency. In this work, we present \textbf{\textit{GarmentDiffusion}}, a new generative model capable of producing centimeter-precise, vectorized 3D sewing patterns from multimodal inputs (text, image, and incomplete sewing pattern). Our method efficiently encodes 3D sewing pattern parameters into compact edge token representations, achieving a sequence length that is $\textbf{10}\times$ shorter than that of the autoregressive SewingGPT in DressCode. By employing a diffusion transformer, we simultaneously denoise all edge tokens along the temporal axis, while maintaining a constant number of denoising steps regardless of dataset-specific edge and panel statistics. With all combination of designs of our model, the sewing pattern generation speed is accelerated by $\textbf{100}\times$ compared to SewingGPT. We achieve new state-of-the-art results on DressCodeData, as well as on the largest sewing pattern dataset, namely GarmentCodeData. The project website is available at https://shenfu-research.github.io/Garment-Diffusion/.

[38] CAE-DFKD: Bridging the Transferability Gap in Data-Free Knowledge Distillation

Zherui Zhang,Changwei Wang,Rongtao Xu,Wenhao Xu,Shibiao Xu,Yu Zhang,Li Guo

Main category: cs.CV

TL;DR: 本文提出了一种名为CAE-DFKD的无数据知识蒸馏方法，通过嵌入层面的改进解决了现有方法在模型泛化能力上的不足，并在效率和性能上表现出色。

Details

Motivation: 现有无数据知识蒸馏方法主要关注图像识别性能，而忽略了学习表示的可迁移性。本文旨在通过嵌入层面的改进提升模型的泛化能力。 Method: 提出Category-Aware Embedding Data-Free Knowledge Distillation (CAE-DFKD)，通过改变生成器训练范式，提升嵌入层面的表示学习。 Result: CAE-DFKD在图像识别任务中表现优异，同时在学习表示的可迁移性上显著优于现有方法。 Conclusion: CAE-DFKD在无数据知识蒸馏中表现出高效性和灵活性，为模型泛化和表示迁移提供了新思路。 Abstract: Data-Free Knowledge Distillation (DFKD) enables the knowledge transfer from the given pre-trained teacher network to the target student model without access to the real training data. Existing DFKD methods focus primarily on improving image recognition performance on associated datasets, often neglecting the crucial aspect of the transferability of learned representations. In this paper, we propose Category-Aware Embedding Data-Free Knowledge Distillation (CAE-DFKD), which addresses at the embedding level the limitations of previous rely on image-level methods to improve model generalization but fail when directly applied to DFKD. The superiority and flexibility of CAE-DFKD are extensively evaluated, including: \textit{\textbf{i.)}} Significant efficiency advantages resulting from altering the generator training paradigm; \textit{\textbf{ii.)}} Competitive performance with existing DFKD state-of-the-art methods on image recognition tasks; \textit{\textbf{iii.)}} Remarkable transferability of data-free learned representations demonstrated in downstream tasks.

[39] DGSolver: Diffusion Generalist Solver with Universal Posterior Sampling for Image Restoration

Hebaixu Wang,Jing Zhang,Haonan Guo,Di Wang,Jiayi Ma,Bo Du

Main category: cs.CV

TL;DR: DGSolver是一种扩散通用求解器，通过高阶求解器和队列加速采样策略提升图像恢复的准确性和效率，同时结合通用后验采样优化噪声估计。

Details

Motivation: 现有方法通过减少采样步骤加速推理，但大间隔步骤会引入累积误差，且难以平衡退化表示的通用性和恢复质量。 Method: 推导通用扩散模型的精确常微分方程，设计高阶求解器和队列加速采样策略，结合通用后验采样优化噪声估计。 Result: DGSolver在恢复准确性、稳定性和可扩展性上均优于现有方法。 Conclusion: DGSolver通过高效采样和噪声估计优化，显著提升了图像恢复性能。 Abstract: Diffusion models have achieved remarkable progress in universal image restoration. While existing methods speed up inference by reducing sampling steps, substantial step intervals often introduce cumulative errors. Moreover, they struggle to balance the commonality of degradation representations and restoration quality. To address these challenges, we introduce \textbf{DGSolver}, a diffusion generalist solver with universal posterior sampling. We first derive the exact ordinary differential equations for generalist diffusion models and tailor high-order solvers with a queue-based accelerated sampling strategy to improve both accuracy and efficiency. We then integrate universal posterior sampling to better approximate manifold-constrained gradients, yielding a more accurate noise estimation and correcting errors in inverse inference. Extensive experiments show that DGSolver outperforms state-of-the-art methods in restoration accuracy, stability, and scalability, both qualitatively and quantitatively. Code and models will be available at https://github.com/MiliLab/DGSolver.

[40] ClassWise-CRF: Category-Specific Fusion for Enhanced Semantic Segmentation of Remote Sensing Imagery

Qinfeng Zhu,Yunxi Jiang,Lei Fan

Main category: cs.CV

TL;DR: ClassWise-CRF是一种结果级类别特定融合架构，通过两阶段过程（专家网络选择和自适应加权融合）优化语义分割性能，并在两个遥感数据集上验证了其有效性。

Details

Motivation: 解决多网络融合时类别特定优化不足的问题，提升遥感图像语义分割的性能。 Method: 采用两阶段过程：1) 使用贪心算法选择类别表现好的专家网络；2) 基于分割性能自适应加权融合预测结果，并结合CRF优化空间一致性和边界精度。 Result: 在LoveDA和Vaihingen数据集上，mIoU分别提升了1.00%/0.68%和0.87%/0.91%。 Conclusion: ClassWise-CRF在遥感图像语义分割中表现出高效性和通用性，代码已开源。 Abstract: We propose a result-level category-specific fusion architecture called ClassWise-CRF. This architecture employs a two-stage process: first, it selects expert networks that perform well in specific categories from a pool of candidate networks using a greedy algorithm; second, it integrates the segmentation predictions of these selected networks by adaptively weighting their contributions based on their segmentation performance in each category. Inspired by Conditional Random Field (CRF), the ClassWise-CRF architecture treats the segmentation predictions from multiple networks as confidence vector fields. It leverages segmentation metrics (such as Intersection over Union) from the validation set as priors and employs an exponential weighting strategy to fuse the category-specific confidence scores predicted by each network. This fusion method dynamically adjusts the weights of each network for different categories, achieving category-specific optimization. Building on this, the architecture further optimizes the fused results using unary and pairwise potentials in CRF to ensure spatial consistency and boundary accuracy. To validate the effectiveness of ClassWise-CRF, we conducted experiments on two remote sensing datasets, LoveDA and Vaihingen, using eight classic and advanced semantic segmentation networks. The results show that the ClassWise-CRF architecture significantly improves segmentation performance: on the LoveDA dataset, the mean Intersection over Union (mIoU) metric increased by 1.00% on the validation set and by 0.68% on the test set; on the Vaihingen dataset, the mIoU improved by 0.87% on the validation set and by 0.91% on the test set. These results fully demonstrate the effectiveness and generality of the ClassWise-CRF architecture in semantic segmentation of remote sensing images. The full code is available at https://github.com/zhuqinfeng1999/ClassWise-CRF.

[41] Consistency-aware Fake Videos Detection on Short Video Platforms

Junxi Wang,Jize liu,Na Zhang,Yaxiong Wang

Main category: cs.CV

TL;DR: 本文提出了一种利用跨模态矛盾检测假新闻的新方法，通过跨模态一致性学习和多模态协作诊断模块提升检测性能。

Details

Motivation: 现有假新闻检测方法未能充分利用跨模态不一致性作为判别特征，导致检测精度不足。 Method: 提出跨模态一致性学习（CMCL）和多模态协作诊断（MMCD）模块，包括伪标签生成、一致性诊断、多模态特征融合和概率分数融合。 Result: 在FakeSV和FakeTT基准测试中表现出色。 Conclusion: 通过显式利用跨模态矛盾，显著提升了假新闻检测的准确性。 Abstract: This paper focuses to detect the fake news on the short video platforms. While significant research efforts have been devoted to this task with notable progress in recent years, current detection accuracy remains suboptimal due to the rapid evolution of content manipulation and generation technologies. Existing approaches typically employ a cross-modal fusion strategy that directly combines raw video data with metadata inputs before applying a classification layer. However, our empirical observations reveal a critical oversight: manipulated content frequently exhibits inter-modal inconsistencies that could serve as valuable discriminative features, yet remain underutilized in contemporary detection frameworks. Motivated by this insight, we propose a novel detection paradigm that explicitly identifies and leverages cross-modal contradictions as discriminative cues. Our approach consists of two core modules: Cross-modal Consistency Learning (CMCL) and Multi-modal Collaborative Diagnosis (MMCD). CMCL includes Pseudo-label Generation (PLG) and Cross-modal Consistency Diagnosis (CMCD). In PLG, a Multimodal Large Language Model is used to generate pseudo-labels for evaluating cross-modal semantic consistency. Then, CMCD extracts [CLS] tokens and computes cosine loss to quantify cross-modal inconsistencies. MMCD further integrates multimodal features through Multimodal Feature Fusion (MFF) and Probability Scores Fusion (PSF). MFF employs a co-attention mechanism to enhance semantic interactions across different modalities, while a Transformer is utilized for comprehensive feature fusion. Meanwhile, PSF further integrates the fake news probability scores obtained in the previous step. Extensive experiments on established benchmarks (FakeSV and FakeTT) demonstrate our model exhibits outstanding performance in Fake videos detection.

[42] MagicPortrait: Temporally Consistent Face Reenactment with 3D Geometric Guidance

Mengting Wei,Yante Li,Tuomas Varanka,Yan Jiang,Licai Sun,Guoying Zhao

Main category: cs.CV

TL;DR: 提出了一种将3D面部参数模型与潜在扩散框架结合的视频面部重演方法，以改进形状一致性和运动控制。

Details

Motivation: 现有视频面部生成方法在形状一致性和运动控制方面存在不足，需更精确的几何和运动特征提取。 Method: 使用FLAME模型作为3D面部参数表示，结合深度图、法线图和渲染图，增强潜在扩散模型，并通过多层融合模块结合身份和运动特征。 Result: 在基准数据集上生成高质量面部动画，精确建模表情和头部姿态变化，并展示出对域外图像的强泛化性能。 Conclusion: 该方法通过3D面部参数模型实现了更精确的面部重演，具有广泛的应用潜力。 Abstract: In this paper, we propose a method for video face reenactment that integrates a 3D face parametric model into a latent diffusion framework, aiming to improve shape consistency and motion control in existing video-based face generation approaches. Our approach employs the FLAME (Faces Learned with an Articulated Model and Expressions) model as the 3D face parametric representation, providing a unified framework for modeling face expressions and head pose. This enables precise extraction of detailed face geometry and motion features from driving videos. Specifically, we enhance the latent diffusion model with rich 3D expression and detailed pose information by incorporating depth maps, normal maps, and rendering maps derived from FLAME sequences. A multi-layer face movements fusion module with integrated self-attention mechanisms is used to combine identity and motion latent features within the spatial domain. By utilizing the 3D face parametric model as motion guidance, our method enables parametric alignment of face identity between the reference image and the motion captured from the driving video. Experimental results on benchmark datasets show that our method excels at generating high-quality face animations with precise expression and head pose variation modeling. In addition, it demonstrates strong generalization performance on out-of-domain images. Code is publicly available at https://github.com/weimengting/MagicPortrait.

[43] SAM4EM: Efficient memory-based two stage prompt-free segment anything model adapter for complex 3D neuroscience electron microscopy stacks

Uzair Shah,Marco Agus,Daniya Boges,Vanessa Chiappini,Mahmood Alzubaidi,Jens Schneider,Markus Hadwiger,Pierre J. Magistretti,Mowafa Househ,Corrado Calı

Main category: cs.CV

TL;DR: SAM4EM是一种基于Segment Anything Model（SAM）的新方法，用于电子显微镜（EM）数据中复杂神经结构的3D分割，通过无提示适配器和双阶段微调策略显著提升了分割精度。

Details

Motivation: 解决电子显微镜数据中复杂神经结构（如线粒体、胶质细胞和突触）的3D分割问题，尤其是在标注数据有限的情况下。 Method: 开发了无提示适配器，采用双阶段掩码解码生成提示嵌入；基于LoRA的双阶段微调方法；引入3D记忆注意力机制确保分割一致性。 Result: 在神经科学分割基准测试中，SAM4EM在胶质细胞和突触后密度等复杂结构的细分上优于现有方法。 Conclusion: SAM4EM为电子显微镜数据中的复杂神经结构分割提供了高效且准确的解决方案，代码和模型已开源。 Abstract: We present SAM4EM, a novel approach for 3D segmentation of complex neural structures in electron microscopy (EM) data by leveraging the Segment Anything Model (SAM) alongside advanced fine-tuning strategies. Our contributions include the development of a prompt-free adapter for SAM using two stage mask decoding to automatically generate prompt embeddings, a dual-stage fine-tuning method based on Low-Rank Adaptation (LoRA) for enhancing segmentation with limited annotated data, and a 3D memory attention mechanism to ensure segmentation consistency across 3D stacks. We further release a unique benchmark dataset for the segmentation of astrocytic processes and synapses. We evaluated our method on challenging neuroscience segmentation benchmarks, specifically targeting mitochondria, glia, and synapses, with significant accuracy improvements over state-of-the-art (SOTA) methods, including recent SAM-based adapters developed for the medical domain and other vision transformer-based approaches. Experimental results indicate that our approach outperforms existing solutions in the segmentation of complex processes like glia and post-synaptic densities. Our code and models are available at https://github.com/Uzshah/SAM4EM.

[44] Black-Box Visual Prompt Engineering for Mitigating Object Hallucination in Large Vision Language Models

Sangmin Woo,Kang Zhou,Yun Zhou,Shuai Wang,Sheng Guan,Haibo Ding,Lin Lee Cheong

Main category: cs.CV

TL;DR: 通过视觉提示工程（BBVPE）框架，动态选择最优视觉提示，显著减少大型视觉语言模型（LVLM）的对象幻觉问题。

Details

Motivation: 大型视觉语言模型（LVLM）常出现对象幻觉，影响可靠性，需一种无需访问模型内部的方法来缓解。 Method: 提出BBVPE框架，通过候选视觉提示池和路由模型动态选择最优提示，适用于开源和专有LVLM。 Result: 在POPE和CHAIR基准测试中，BBVPE有效减少了对象幻觉。 Conclusion: BBVPE是一种模型无关的解决方案，能显著提升LVLM的可靠性。 Abstract: Large Vision Language Models (LVLMs) often suffer from object hallucination, which undermines their reliability. Surprisingly, we find that simple object-based visual prompting -- overlaying visual cues (e.g., bounding box, circle) on images -- can significantly mitigate such hallucination; however, different visual prompts (VPs) vary in effectiveness. To address this, we propose Black-Box Visual Prompt Engineering (BBVPE), a framework to identify optimal VPs that enhance LVLM responses without needing access to model internals. Our approach employs a pool of candidate VPs and trains a router model to dynamically select the most effective VP for a given input image. This black-box approach is model-agnostic, making it applicable to both open-source and proprietary LVLMs. Evaluations on benchmarks such as POPE and CHAIR demonstrate that BBVPE effectively reduces object hallucination.

[45] Iterative Trajectory Exploration for Multimodal Agents

Pengxiang Li,Zhi Gao,Bofei Zhang,Yapeng Mi,Xiaojian Ma,Chenrui Shi,Tao Yuan,Yuwei Wu,Yunde Jia,Song-Chun Zhu,Qing Li

Main category: cs.CV

TL;DR: SPORT是一种多模态代理的在线自探索方法，通过逐步偏好优化改进代理轨迹，无需专家标注。

Details

Motivation: 现有代理需要大量专家数据微调以适应新环境，SPORT旨在通过自生成任务和学习解决任务来减少依赖。 Method: SPORT通过任务合成、步骤采样、步骤验证和偏好调优四个迭代组件实现。 Result: 在GTA和GAIA基准测试中，SPORT代理分别提升了6.41%和3.64%。 Conclusion: SPORT方法有效提升了多模态代理的泛化能力和性能。 Abstract: Multimodal agents, which integrate a controller (e.g., a large language model) with external tools, have demonstrated remarkable capabilities in tackling complex tasks. However, existing agents need to collect a large number of expert data for fine-tuning to adapt to new environments. In this paper, we propose an online self-exploration method for multimodal agents, namely SPORT, via step-wise preference optimization to refine the trajectories of agents, which automatically generates tasks and learns from solving the generated tasks, without any expert annotation. SPORT operates through four iterative components: task synthesis, step sampling, step verification, and preference tuning. First, we synthesize multi-modal tasks using language models. Then, we introduce a novel search scheme, where step sampling and step verification are executed alternately to solve each generated task. We employ a verifier to provide AI feedback to construct step-wise preference data. The data is subsequently used to update the controller's policy through preference tuning, producing a SPORT Agent. By interacting with real environments, the SPORT Agent evolves into a more refined and capable system. Evaluation in the GTA and GAIA benchmarks show that the SPORT Agent achieves 6.41\% and 3.64\% improvements, underscoring the generalization and effectiveness introduced by our method. The project page is https://SPORT-Agents.github.io.

[46] eNCApsulate: NCA for Precision Diagnosis on Capsule Endoscopes

Henry John Krumb,Anirban Mukhopadhyay

Main category: cs.CV

TL;DR: 无线胶囊内窥镜通过神经细胞自动机（NCA）实现高效出血分割和深度估计，首次在微型设备上实现可靠诊断。

Details

Motivation: 传统内窥镜生成大量视频数据且定位困难，需要轻量化的深度学习模型在胶囊内运行。 Method: 通过蒸馏大型基础模型到NCA架构，并在ESP32微控制器上实现高效图像处理。 Result: NCA在参数存储和准确性上优于其他便携模型，ESP32-S3优化显著提升推理速度。 Conclusion: NCA模型成功应用于微型设备，为胶囊内精准诊断和定位开辟新途径。 Abstract: Wireless Capsule Endoscopy is a non-invasive imaging method for the entire gastrointestinal tract, and is a pain-free alternative to traditional endoscopy. It generates extensive video data that requires significant review time, and localizing the capsule after ingestion is a challenge. Techniques like bleeding detection and depth estimation can help with localization of pathologies, but deep learning models are typically too large to run directly on the capsule. Neural Cellular Automata (NCA) for bleeding segmentation and depth estimation are trained on capsule endoscopic images. For monocular depth estimation, we distill a large foundation model into the lean NCA architecture, by treating the outputs of the foundation model as pseudo ground truth. We then port the trained NCA to the ESP32 microcontroller, enabling efficient image processing on hardware as small as a camera capsule. NCA are more accurate (Dice) than other portable segmentation models, while requiring more than 100x fewer parameters stored in memory than other small-scale models. The visual results of NCA depth estimation look convincing, and in some cases beat the realism and detail of the pseudo ground truth. Runtime optimizations on the ESP32-S3 accelerate the average inference speed significantly, by more than factor 3. With several algorithmic adjustments and distillation, it is possible to eNCApsulate NCA models into microcontrollers that fit into wireless capsule endoscopes. This is the first work that enables reliable bleeding segmentation and depth estimation on a miniaturized device, paving the way for precise diagnosis combined with visual odometry as a means of precise localization of the capsule -- on the capsule.

[47] Cascade Detector Analysis and Application to Biomedical Microscopy

Thomas L. Athey,Shashata Sawmya,Nir Shavit

Main category: cs.CV

TL;DR: 提出了一种基于级联检测器的高效稀疏目标识别方法，显著减少了计算时间。

Details

Motivation: 随着计算机视觉模型和生物医学数据规模的增大，需要更高效的推理算法。 Method: 利用多分辨率图像中的级联检测器，结合目标出现频率和已知精度的检测器，推导级联检测器的精度和预期分类器调用次数。 Result: 在荧光细胞检测、细胞器分割和组织分割中，多级检测器性能相当，但减少了30-75%的时间。 Conclusion: 该方法适用于多种计算机视觉模型和数据领域，具有广泛的应用潜力。 Abstract: As both computer vision models and biomedical datasets grow in size, there is an increasing need for efficient inference algorithms. We utilize cascade detectors to efficiently identify sparse objects in multiresolution images. Given an object's prevalence and a set of detectors at different resolutions with known accuracies, we derive the accuracy, and expected number of classifier calls by a cascade detector. These results generalize across number of dimensions and number of cascade levels. Finally, we compare one- and two-level detectors in fluorescent cell detection, organelle segmentation, and tissue segmentation across various microscopy modalities. We show that the multi-level detector achieves comparable performance in 30-75% less time. Our work is compatible with a variety of computer vision models and data domains.

[48] Mcity Data Engine: Iterative Model Improvement Through Open-Vocabulary Data Selection

Daniel Bogdoll,Rajanikant Patnaik Ananta,Abeyankar Giridharan,Isabel Moore,Gregory Stevens,Henry X. Liu

Main category: cs.CV

TL;DR: Mcity Data Engine是一个开源系统，专注于从大量未标记数据中选择和标记长尾类别样本，支持从数据采集到模型部署的完整开发周期。

Details

Motivation: 随着数据量的增加，选择和标记适合训练机器学习模型的样本变得更具挑战性，尤其是在智能交通系统（ITS）中。现有工业数据引擎多为专有，缺乏开源解决方案。 Method: Mcity Data Engine提供了一个模块化系统，支持开放词汇数据选择，专注于罕见和新类别，涵盖从数据采集到模型部署的全流程。 Result: 所有代码已在GitHub上公开，采用MIT许可证，为研究者和开源社区提供了可用的工具。 Conclusion: Mcity Data Engine填补了开源数据引擎的空白，特别适用于处理长尾类别数据，推动了ITS领域的研究和应用。 Abstract: With an ever-increasing availability of data, it has become more and more challenging to select and label appropriate samples for the training of machine learning models. It is especially difficult to detect long-tail classes of interest in large amounts of unlabeled data. This holds especially true for Intelligent Transportation Systems (ITS), where vehicle fleets and roadside perception systems generate an abundance of raw data. While industrial, proprietary data engines for such iterative data selection and model training processes exist, researchers and the open-source community suffer from a lack of an openly available system. We present the Mcity Data Engine, which provides modules for the complete data-based development cycle, beginning at the data acquisition phase and ending at the model deployment stage. The Mcity Data Engine focuses on rare and novel classes through an open-vocabulary data selection process. All code is publicly available on GitHub under an MIT license: https://github.com/mcity/mcity_data_engine

[49] Diffusion-based Adversarial Identity Manipulation for Facial Privacy Protection

Liqin Wang,Qianyue Hu,Wei Lu,Xiangyang Luo

Main category: cs.CV

TL;DR: DiffAIM是一种基于扩散模型的对抗性人脸生成方法，旨在保护用户隐私，通过操纵人脸身份生成自然且高迁移性的对抗性人脸。

Details

Motivation: 现有的人脸隐私保护方法难以生成自然的人脸图像，DiffAIM旨在解决这一问题，防止恶意人脸识别系统的滥用。 Method: 在扩散模型的低维潜在空间中操纵人脸身份，通过反向扩散过程中注入梯度对抗性身份引导，优化身份收敛和目标语义分离。 Result: DiffAIM在多项实验中表现出更强的黑盒攻击迁移性和视觉质量，适用于商业人脸识别API。 Conclusion: DiffAIM为保护人脸隐私提供了一种有效且自然的方法，具有实际应用潜力。 Abstract: The success of face recognition (FR) systems has led to serious privacy concerns due to potential unauthorized surveillance and user tracking on social networks. Existing methods for enhancing privacy fail to generate natural face images that can protect facial privacy. In this paper, we propose diffusion-based adversarial identity manipulation (DiffAIM) to generate natural and highly transferable adversarial faces against malicious FR systems. To be specific, we manipulate facial identity within the low-dimensional latent space of a diffusion model. This involves iteratively injecting gradient-based adversarial identity guidance during the reverse diffusion process, progressively steering the generation toward the desired adversarial faces. The guidance is optimized for identity convergence towards a target while promoting semantic divergence from the source, facilitating effective impersonation while maintaining visual naturalness. We further incorporate structure-preserving regularization to preserve facial structure consistency during manipulation. Extensive experiments on both face verification and identification tasks demonstrate that compared with the state-of-the-art, DiffAIM achieves stronger black-box attack transferability while maintaining superior visual quality. We also demonstrate the effectiveness of the proposed approach for commercial FR APIs, including Face++ and Aliyun.

[50] HoloTime: Taming Video Diffusion Models for Panoramic 4D Scene Generation

Haiyang Zhou,Wangbo Yu,Jiawen Guan,Xinhua Cheng,Yonghong Tian,Li Yuan

Main category: cs.CV

TL;DR: HoloTime框架通过视频扩散模型生成全景视频，并结合4D场景重建方法，提升VR/AR的沉浸式体验。

Details

Motivation: 现有扩散模型主要关注静态3D场景或对象级动态，限制了VR/AR的沉浸式体验。 Method: 提出HoloTime框架，包括360World数据集、全景动画生成器和全景时空重建技术。 Result: 在生成全景视频和4D场景重建方面优于现有方法。 Conclusion: HoloTime能够创建更逼真的沉浸式环境，提升VR/AR用户体验。 Abstract: The rapid advancement of diffusion models holds the promise of revolutionizing the application of VR and AR technologies, which typically require scene-level 4D assets for user experience. Nonetheless, existing diffusion models predominantly concentrate on modeling static 3D scenes or object-level dynamics, constraining their capacity to provide truly immersive experiences. To address this issue, we propose HoloTime, a framework that integrates video diffusion models to generate panoramic videos from a single prompt or reference image, along with a 360-degree 4D scene reconstruction method that seamlessly transforms the generated panoramic video into 4D assets, enabling a fully immersive 4D experience for users. Specifically, to tame video diffusion models for generating high-fidelity panoramic videos, we introduce the 360World dataset, the first comprehensive collection of panoramic videos suitable for downstream 4D scene reconstruction tasks. With this curated dataset, we propose Panoramic Animator, a two-stage image-to-video diffusion model that can convert panoramic images into high-quality panoramic videos. Following this, we present Panoramic Space-Time Reconstruction, which leverages a space-time depth estimation method to transform the generated panoramic videos into 4D point clouds, enabling the optimization of a holistic 4D Gaussian Splatting representation to reconstruct spatially and temporally consistent 4D scenes. To validate the efficacy of our method, we conducted a comparative analysis with existing approaches, revealing its superiority in both panoramic video generation and 4D scene reconstruction. This demonstrates our method's capability to create more engaging and realistic immersive environments, thereby enhancing user experiences in VR and AR applications.

[51] Visual Text Processing: A Comprehensive Review and Unified Evaluation

Yan Shu,Weichao Zeng,Fangmin Zhao,Zeyu Chen,Zhenhang Li,Xiaomeng Yang,Yu Zhou,Paolo Rota,Xiang Bai,Lianwen Jin,Xu-Cheng Yin,Nicu Sebe

Main category: cs.CV

TL;DR: 该论文综述了视觉文本处理的最新进展，提出了两个关键问题，并介绍了新基准VTPBench和评估指标VTPScore，旨在推动该领域的未来发展。

Details

Motivation: 视觉文本在文档和场景图像中具有重要语义信息，但因其独特性仍存在挑战，需有效捕捉和利用文本特征以提升处理模型。 Method: 通过多视角分析近期视觉文本处理进展，提出VTPBench基准和VTPScore评估指标，并基于MLLMs进行实证研究。 Result: 研究发现当前技术仍有较大改进空间，新基准和评估指标为未来研究提供了可靠工具。 Conclusion: 该工作旨在成为视觉文本处理领域的基础资源，促进未来探索与创新。 Abstract: Visual text is a crucial component in both document and scene images, conveying rich semantic information and attracting significant attention in the computer vision community. Beyond traditional tasks such as text detection and recognition, visual text processing has witnessed rapid advancements driven by the emergence of foundation models, including text image reconstruction and text image manipulation. Despite significant progress, challenges remain due to the unique properties that differentiate text from general objects. Effectively capturing and leveraging these distinct textual characteristics is essential for developing robust visual text processing models. In this survey, we present a comprehensive, multi-perspective analysis of recent advancements in visual text processing, focusing on two key questions: (1) What textual features are most suitable for different visual text processing tasks? (2) How can these distinctive text features be effectively incorporated into processing frameworks? Furthermore, we introduce VTPBench, a new benchmark that encompasses a broad range of visual text processing datasets. Leveraging the advanced visual quality assessment capabilities of multimodal large language models (MLLMs), we propose VTPScore, a novel evaluation metric designed to ensure fair and reliable evaluation. Our empirical study with more than 20 specific models reveals substantial room for improvement in the current techniques. Our aim is to establish this work as a fundamental resource that fosters future exploration and innovation in the dynamic field of visual text processing. The relevant repository is available at https://github.com/shuyansy/Visual-Text-Processing-survey.

[52] Enhancing Self-Supervised Fine-Grained Video Object Tracking with Dynamic Memory Prediction

Zihan Zhou,Changrui Dai,Aibo Song,Xiaolin Fang

Main category: cs.CV

TL;DR: 论文提出了一种动态记忆预测（DMP）框架，通过多参考帧直接增强帧重建，提升视频分析的准确性。

Details

Motivation: 现有帧重建方法在复杂情况下（如遮挡或快速运动）未能充分利用多参考帧的价值。 Method: DMP框架包含动态选择参考帧的记忆引擎和双向目标预测网络，利用多参考帧提升模型鲁棒性。 Result: 实验表明，该算法在目标分割和关键点跟踪任务上优于现有自监督技术。 Conclusion: DMP框架通过多参考帧的动态利用，显著提升了帧重建和跟踪的准确性。 Abstract: Successful video analysis relies on accurate recognition of pixels across frames, and frame reconstruction methods based on video correspondence learning are popular due to their efficiency. Existing frame reconstruction methods, while efficient, neglect the value of direct involvement of multiple reference frames for reconstruction and decision-making aspects, especially in complex situations such as occlusion or fast movement. In this paper, we introduce a Dynamic Memory Prediction (DMP) framework that innovatively utilizes multiple reference frames to concisely and directly enhance frame reconstruction. Its core component is a Reference Frame Memory Engine that dynamically selects frames based on object pixel features to improve tracking accuracy. In addition, a Bidirectional Target Prediction Network is built to utilize multiple reference frames to improve the robustness of the model. Through experiments, our algorithm outperforms the state-of-the-art self-supervised techniques on two fine-grained video object tracking tasks: object segmentation and keypoint tracking.

Abu Mohammed Raisuddin,Jesper Holmblad,Hamed Haghighi,Yuri Poledna,Maikol Funk Drechsler,Valentina Donzella,Eren Erdal Aksoy

Main category: cs.CV

TL;DR: 论文提出了一个名为REHEARSE-3D的大规模多模态模拟降雨数据集，用于促进3D点云去雨研究，并评估了多种模型的性能。

Details

Motivation: 传感器退化（如降雨对LiDAR点云的干扰）是自动驾驶中的重大挑战，需要天气感知系统以提高安全性。 Method: 发布REHEARSE-3D数据集，包含高分辨率LiDAR和4D雷达点云，并标注了降雨特征信息，用于雨滴检测和去除的基准测试。 Result: 数据集是最大的点级标注数据集，并首次结合了高分辨率LiDAR和4D雷达数据，支持多种模型的性能评估。 Conclusion: REHEARSE-3D为3D点云去雨研究提供了重要资源，未来将公开数据集和基准模型。 Abstract: Sensor degradation poses a significant challenge in autonomous driving. During heavy rainfall, the interference from raindrops can adversely affect the quality of LiDAR point clouds, resulting in, for instance, inaccurate point measurements. This, in turn, can potentially lead to safety concerns if autonomous driving systems are not weather-aware, i.e., if they are unable to discern such changes. In this study, we release a new, large-scale, multi-modal emulated rain dataset, REHEARSE-3D, to promote research advancements in 3D point cloud de-raining. Distinct from the most relevant competitors, our dataset is unique in several respects. First, it is the largest point-wise annotated dataset, and second, it is the only one with high-resolution LiDAR data (LiDAR-256) enriched with 4D Radar point clouds logged in both daytime and nighttime conditions in a controlled weather environment. Furthermore, REHEARSE-3D involves rain-characteristic information, which is of significant value not only for sensor noise modeling but also for analyzing the impact of weather at a point level. Leveraging REHEARSE-3D, we benchmark raindrop detection and removal in fused LiDAR and 4D Radar point clouds. Our comprehensive study further evaluates the performance of various statistical and deep-learning models. Upon publication, the dataset and benchmark models will be made publicly available at: https://sporsho.github.io/REHEARSE3D.

[54] Vision Transformers in Precision Agriculture: A Comprehensive Survey

Saber Mehdipour,Seyed Abolghasem Mirroshandel,Seyed Amirhossein Tabatabaei

Main category: cs.CV

TL;DR: 本文综述了视觉变换器（ViTs）在精准农业中的应用，包括分类、检测和分割任务，比较了ViTs与传统CNN的优劣，并探讨了技术挑战与未来研究方向。

Details

Motivation: 传统植物病害检测方法在可扩展性和准确性上存在局限，ViTs因其处理长距离依赖和视觉任务的优势成为有前景的替代方案。 Method: 综述ViTs的基础架构及其从NLP到计算机视觉的过渡，分析ViTs如何减少传统模型的归纳偏差，并比较ViTs与CNNs的性能。 Result: ViTs在精准农业中展现出潜力，但面临数据需求、计算成本和模型可解释性等技术挑战。 Conclusion: ViTs有望变革智能农业，未来研究应关注解决技术挑战并推动实际应用。 Abstract: Detecting plant diseases is a crucial aspect of modern agriculture - it plays a key role in maintaining crop health and increasing overall yield. Traditional approaches, though still valuable, often rely on manual inspection or conventional machine learning techniques, both of which face limitations in scalability and accuracy. Recently, Vision Transformers (ViTs) have emerged as a promising alternative, offering benefits such as improved handling of long-range dependencies and better scalability for visual tasks. This survey explores the application of ViTs in precision agriculture, covering tasks from classification to detection and segmentation. We begin by introducing the foundational architecture of ViTs and discuss their transition from Natural Language Processing (NLP) to computer vision. The discussion includes the concept of inductive bias in traditional models like Convolutional Neural Networks (CNNs), and how ViTs mitigate these biases. We provide a comprehensive review of recent literature, focusing on key methodologies, datasets, and performance metrics. The survey also includes a comparative analysis of CNNs and ViTs, with a look at hybrid models and performance enhancements. Technical challenges - such as data requirements, computational demands, and model interpretability - are addressed alongside potential solutions. Finally, we outline potential research directions and technological advancements that could further support the integration of ViTs in real-world agricultural settings. Our goal with this study is to offer practitioners and researchers a deeper understanding of how ViTs are poised to transform smart and precision agriculture.

Shiying Li,Xingqun Qi,Bingkun Yang,Chen Weile,Zezhao Tian,Muyi Sun,Qifeng Liu,Man Zhang,Zhenan Sun

Main category: cs.CV

TL;DR: 论文提出VividListener框架，用于生成具有细腻情感和表达反应的听者头部动态，解决了现有方法在长序列建模和情感强度控制上的不足。

Details

Motivation: 现有研究主要关注听者行为的短期生成，缺乏对运动变化和情感强度的精细控制，且缺乏大规模多模态标注的对话数据集。 Method: 作者首先收集了大规模多轮对话数据集ListenerX，并提出VividListener框架，包含Responsive Interaction Module（RIM）和Emotional Intensity Tags（EIT），用于多模态条件下的听者动态建模。 Result: 在ListenerX数据集上的实验表明，VividListener实现了最先进的性能，能够生成表达性强且可控的听者动态。 Conclusion: VividListener框架通过多模态条件引导，实现了听者动态的精细控制和情感表达，为对话建模提供了新思路。 Abstract: Generating responsive listener head dynamics with nuanced emotions and expressive reactions is crucial for practical dialogue modeling in various virtual avatar animations. Previous studies mainly focus on the direct short-term production of listener behavior. They overlook the fine-grained control over motion variations and emotional intensity, especially in long-sequence modeling. Moreover, the lack of long-term and large-scale paired speaker-listener corpora including head dynamics and fine-grained multi-modality annotations (e.g., text-based expression descriptions, emotional intensity) also limits the application of dialogue modeling.Therefore, we first newly collect a large-scale multi-turn dataset of 3D dyadic conversation containing more than 1.4M valid frames for multi-modal responsive interaction, dubbed ListenerX. Additionally, we propose VividListener, a novel framework enabling fine-grained, expressive and controllable listener dynamics modeling. This framework leverages multi-modal conditions as guiding principles for fostering coherent interactions between speakers and listeners.Specifically, we design the Responsive Interaction Module (RIM) to adaptively represent the multi-modal interactive embeddings. RIM ensures the listener dynamics achieve fine-grained semantic coordination with textual descriptions and adjustments, while preserving expressive reaction with speaker behavior. Meanwhile, we design the Emotional Intensity Tags (EIT) for emotion intensity editing with multi-modal information integration, applying to both text descriptions and listener motion amplitude.Extensive experiments conducted on our newly collected ListenerX dataset demonstrate that VividListener achieves state-of-the-art performance, realizing expressive and controllable listener dynamics.

[56] Common3D: Self-Supervised Learning of 3D Morphable Models for Common Objects in Neural Feature Space

Leonhard Sommer,Olaf Dünkel,Christian Theobalt,Adam Kortylewski

Main category: cs.CV

TL;DR: Common3D是一种完全自监督的方法，通过从对象中心视频中学习3D可变形模型（3DMMs），解决了3DMMs仅适用于少数特定对象类别的问题。

Details

Motivation: 3DMMs通常仅适用于特定对象类别（如人脸或人体），且需要复杂的3D数据采集和类别特定训练。Common3D旨在为常见对象学习3DMMs，无需监督数据。 Method: Common3D将对象表示为学习的3D模板网格和图像条件神经网络的变形场，使用神经特征而非RGB颜色表示外观，并通过对比目标训练外观特征。 Result: Common3D在3D对象姿态估计和语义对应任务中表现优于相关方法，并首次实现了完全自监督的零样本多任务解决能力。 Conclusion: Common3D为常见对象提供了一种高效的自监督3DMM学习方法，显著提升了多任务性能。 Abstract: 3D morphable models (3DMMs) are a powerful tool to represent the possible shapes and appearances of an object category. Given a single test image, 3DMMs can be used to solve various tasks, such as predicting the 3D shape, pose, semantic correspondence, and instance segmentation of an object. Unfortunately, 3DMMs are only available for very few object categories that are of particular interest, like faces or human bodies, as they require a demanding 3D data acquisition and category-specific training process. In contrast, we introduce a new method, Common3D, that learns 3DMMs of common objects in a fully self-supervised manner from a collection of object-centric videos. For this purpose, our model represents objects as a learned 3D template mesh and a deformation field that is parameterized as an image-conditioned neural network. Different from prior works, Common3D represents the object appearance with neural features instead of RGB colors, which enables the learning of more generalizable representations through an abstraction from pixel intensities. Importantly, we train the appearance features using a contrastive objective by exploiting the correspondences defined through the deformable template mesh. This leads to higher quality correspondence features compared to related works and a significantly improved model performance at estimating 3D object pose and semantic correspondence. Common3D is the first completely self-supervised method that can solve various vision tasks in a zero-shot manner.

[57] Anatomical Similarity as a New Metric to Evaluate Brain Generative Models

Bahram Jafrasteh,Wei Peng,Cheng Wan,Yimin Luo,Ehsan Adeli,Qingyu Zhao

Main category: cs.CV

TL;DR: 该研究提出了一种名为WASABI的新指标，用于评估合成脑部MRI的解剖学真实性，通过体积测量和Wasserstein距离比较真实与合成解剖结构的分布。

Details

Motivation: 尽管生成模型在合成MRI方面取得进展，但现有评估方法缺乏对解剖学保真度的敏感性，限制了其在临床中的实用性。 Method: 利用SynthSeg工具对脑部MRI进行分割，计算各脑区的体积，并使用多元Wasserstein距离比较真实与合成解剖结构的分布。 Result: WASABI在量化解剖学差异方面比传统图像级指标更敏感，即使合成图像视觉质量接近完美。 Conclusion: 研究建议将解剖学保真度作为评估合成MRI的关键标准，超越视觉检查和传统指标。 Abstract: Generative models enhance neuroimaging through data augmentation, quality improvement, and rare condition studies. Despite advances in realistic synthetic MRIs, evaluations focus on texture and perception, lacking sensitivity to crucial anatomical fidelity. This study proposes a new metric, called WASABI (Wasserstein-Based Anatomical Brain Index), to assess the anatomical realism of synthetic brain MRIs. WASABI leverages \textit{SynthSeg}, a deep learning-based brain parcellation tool, to derive volumetric measures of brain regions in each MRI and uses the multivariate Wasserstein distance to compare distributions between real and synthetic anatomies. Based on controlled experiments on two real datasets and synthetic MRIs from five generative models, WASABI demonstrates higher sensitivity in quantifying anatomical discrepancies compared to traditional image-level metrics, even when synthetic images achieve near-perfect visual quality. Our findings advocate for shifting the evaluation paradigm beyond visual inspection and conventional metrics, emphasizing anatomical fidelity as a crucial benchmark for clinically meaningful brain MRI synthesis. Our code is available at https://github.com/BahramJafrasteh/wasabi-mri.

[58] Anomaly-Driven Approach for Enhanced Prostate Cancer Segmentation

Alessia Hu,Regina Beets-Tan,Lishan Cai,Eduardo Pooch

Main category: cs.CV

TL;DR: 该研究提出了一种结合异常检测的深度学习分割框架（adU-Net），用于改进临床显著前列腺癌（csPCa）的识别，其性能优于基线模型。

Details

Motivation: MRI在识别csPCa中具有重要作用，但自动化方法面临数据不平衡、肿瘤大小不一和标注数据不足等挑战。 Method: 研究引入了adU-Net，通过将基于双参数MRI序列的异常图整合到分割框架中，利用Fixed-Point GAN生成异常图来引导模型关注潜在癌变区域。 Result: 在外部测试集上，adU-Net的平均得分（AUROC和AP的均值）为0.618，优于基线模型nnU-Net（0.605）。 Conclusion: 结合异常检测的分割方法能提升泛化能力和性能，尤其是基于ADC的异常图，为自动化csPCa识别提供了新方向。 Abstract: Magnetic Resonance Imaging (MRI) plays an important role in identifying clinically significant prostate cancer (csPCa), yet automated methods face challenges such as data imbalance, variable tumor sizes, and a lack of annotated data. This study introduces Anomaly-Driven U-Net (adU-Net), which incorporates anomaly maps derived from biparametric MRI sequences into a deep learning-based segmentation framework to improve csPCa identification. We conduct a comparative analysis of anomaly detection methods and evaluate the integration of anomaly maps into the segmentation pipeline. Anomaly maps, generated using Fixed-Point GAN reconstruction, highlight deviations from normal prostate tissue, guiding the segmentation model to potential cancerous regions. We compare the performance by using the average score, computed as the mean of the AUROC and Average Precision (AP). On the external test set, adU-Net achieves the best average score of 0.618, outperforming the baseline nnU-Net model (0.605). The results demonstrate that incorporating anomaly detection into segmentation improves generalization and performance, particularly with ADC-based anomaly maps, offering a promising direction for automated csPCa identification.

[59] A simple and effective approach for body part recognition on CT scans based on projection estimation

Franko Hrzic,Mohammadreza Movahhedi,Ophelie Lavoie-Gagne,Ata Kiapour

Main category: cs.CV

TL;DR: 提出了一种基于2D X射线估计3D CT扫描的简单有效方法，用于识别14个不同身体区域，显著优于其他方法。

Details

Motivation: 标注CT数据困难且耗时，现有方法常忽略扫描中的其他解剖区域。 Method: 利用估计的2D图像识别身体区域，对比了2.5D、3D和基础模型方法。 Result: EffNet-B0模型表现最佳，F1-Score为0.980 ± 0.016，显著优于其他方法。 Conclusion: 该方法为构建高质量医学数据集提供了有效工具。 Abstract: It is well known that machine learning models require a high amount of annotated data to obtain optimal performance. Labelling Computed Tomography (CT) data can be a particularly challenging task due to its volumetric nature and often missing and$/$or incomplete associated meta-data. Even inspecting one CT scan requires additional computer software, or in the case of programming languages $-$ additional programming libraries. This study proposes a simple, yet effective approach based on 2D X-ray-like estimation of 3D CT scans for body region identification. Although body region is commonly associated with the CT scan, it often describes only the focused major body region neglecting other anatomical regions present in the observed CT. In the proposed approach, estimated 2D images were utilized to identify 14 distinct body regions, providing valuable information for constructing a high-quality medical dataset. To evaluate the effectiveness of the proposed method, it was compared against 2.5D, 3D and foundation model (MI2) based approaches. Our approach outperformed the others, where it came on top with statistical significance and F1-Score for the best-performing model EffNet-B0 of 0.980 $\pm$ 0.016 in comparison to the 0.840 $\pm$ 0.114 (2.5D DenseNet-161), 0.854 $\pm$ 0.096 (3D VoxCNN), and 0.852 $\pm$ 0.104 (MI2 foundation model). The utilized dataset comprised three different clinical centers and counted 15,622 CT scans (44,135 labels).

[60] Why Compress What You Can Generate? When GPT-4o Generation Ushers in Image Compression Fields

Yixin Gao,Xiaohan Pan,Xin Li,Zhibo Chen

Main category: cs.CV

TL;DR: 论文探讨了利用AIGC基础模型（如GPT-4o）在图像压缩中的潜力，提出了一种基于文本和多模态编码的压缩方法，并通过结构光栅扫描提示工程机制实现语义和结构一致性。

Details

Motivation: AIGC基础模型的快速发展为图像压缩提供了新思路，即通过生成而非传统压缩方法实现高效压缩。GPT-4o的强大跨模态生成能力激发了对其在图像压缩领域应用的探索。 Method: 研究两种压缩范式：文本编码和多模态编码（文本+极低分辨率图像），利用GPT-4o的图像生成功能替代传统像素级压缩。提出结构光栅扫描提示工程机制，将图像转换为文本空间作为生成条件。 Result: 实验表明，结合结构光栅扫描提示和GPT-4o的图像生成功能，在超低比特率下表现优于现有多模态/生成式图像压缩方法。 Conclusion: AIGC生成在图像压缩领域具有巨大潜力，尤其是通过结构光栅扫描提示工程机制实现高效压缩。 Abstract: The rapid development of AIGC foundation models has revolutionized the paradigm of image compression, which paves the way for the abandonment of most pixel-level transform and coding, compelling us to ask: why compress what you can generate if the AIGC foundation model is powerful enough to faithfully generate intricate structure and fine-grained details from nothing more than some compact descriptors, i.e., texts, or cues. Fortunately, recent GPT-4o image generation of OpenAI has achieved impressive cross-modality generation, editing, and design capabilities, which motivates us to answer the above question by exploring its potential in image compression fields. In this work, we investigate two typical compression paradigms: textual coding and multimodal coding (i.e., text + extremely low-resolution image), where all/most pixel-level information is generated instead of compressing via the advanced GPT-4o image generation function. The essential challenge lies in how to maintain semantic and structure consistency during the decoding process. To overcome this, we propose a structure raster-scan prompt engineering mechanism to transform the image into textual space, which is compressed as the condition of GPT-4o image generation. Extensive experiments have shown that the combination of our designed structural raster-scan prompts and GPT-4o's image generation function achieved the impressive performance compared with recent multimodal/generative image compression at ultra-low bitrate, further indicating the potential of AIGC generation in image compression fields.

[61] Early Exit and Multi Stage Knowledge Distillation in VLMs for Video Summarization

Anas Anwarul Haq Khan,Utkarsh Verma,Prateek Chanda,Ganesh Ramakrishnan

Main category: cs.CV

TL;DR: DEEVISum是一种轻量高效的视觉语言模型，结合多模态提示和多阶段知识蒸馏（MSKD）与早期退出（EE），在视频摘要任务中平衡性能与效率。

Details

Motivation: 解决视频摘要任务中大型模型计算成本高的问题，提出轻量高效的解决方案。 Method: 结合多模态提示（文本和音频信号），采用MSKD和EE技术优化模型性能与效率。 Result: 在TVSum数据集上，PaLI Gemma2 3B + MSKD模型F1得分为61.1，推理时间减少21%。 Conclusion: DEEVISum在保持较低计算成本的同时，性能接近大型模型，代码和数据集已公开。 Abstract: We introduce DEEVISum (Distilled Early Exit Vision language model for Summarization), a lightweight, efficient, and scalable vision language model designed for segment wise video summarization. Leveraging multi modal prompts that combine textual and audio derived signals, DEEVISum incorporates Multi Stage Knowledge Distillation (MSKD) and Early Exit (EE) to strike a balance between performance and efficiency. MSKD offers a 1.33% absolute F1 improvement over baseline distillation (0.5%), while EE reduces inference time by approximately 21% with a 1.3 point drop in F1. Evaluated on the TVSum dataset, our best model PaLI Gemma2 3B + MSKD achieves an F1 score of 61.1, competing the performance of significantly larger models, all while maintaining a lower computational footprint. We publicly release our code and processed dataset to support further research.

[62] 3D Stylization via Large Reconstruction Model

Ipek Oztas,Duygu Ceylan,Aysegul Dundar

Main category: cs.CV

TL;DR: 本文提出了一种无需训练或优化的3D外观风格化方法，通过注入视觉风格图像的特征到大型重建模型的注意力块中，实现高效且高质量的3D生成。

Details

Motivation: 随着文本或图像引导的3D生成器的成功，用户对生成过程的控制需求增加，尤其是外观风格化。本文旨在通过参考图像实现3D生成资产的外观风格化，同时保持多视角的视觉一致性。 Method: 利用大型重建模型中的注意力块捕获外观特征，通过注入视觉风格图像的特征到这些块中，实现3D外观风格化。方法无需训练或测试时优化。 Result: 通过定量和定性评估，该方法在3D外观风格化方面表现优异，显著提高了效率并保持了高质量的视觉效果。 Conclusion: 本文方法简单有效，为3D外观风格化提供了一种高效且高质量的解决方案。 Abstract: With the growing success of text or image guided 3D generators, users demand more control over the generation process, appearance stylization being one of them. Given a reference image, this requires adapting the appearance of a generated 3D asset to reflect the visual style of the reference while maintaining visual consistency from multiple viewpoints. To tackle this problem, we draw inspiration from the success of 2D stylization methods that leverage the attention mechanisms in large image generation models to capture and transfer visual style. In particular, we probe if large reconstruction models, commonly used in the context of 3D generation, has a similar capability. We discover that the certain attention blocks in these models capture the appearance specific features. By injecting features from a visual style image to such blocks, we develop a simple yet effective 3D appearance stylization method. Our method does not require training or test time optimization. Through both quantitative and qualitative evaluations, we demonstrate that our approach achieves superior results in terms of 3D appearance stylization, significantly improving efficiency while maintaining high-quality visual outcomes.

[63] Active Light Modulation to Counter Manipulation of Speech Visual Content

Hadleigh Schwartz,Xiaofeng Yan,Charles J. Carver,Xia Zhou

Main category: cs.CV

TL;DR: Spotlight是一种低开销、非侵入性的系统，通过不可见的调制光在视频中嵌入动态物理签名，保护实时演讲视频免受身份和面部动作的视觉伪造。

Details

Motivation: 高影响力的演讲视频容易被伪造，现有方法主要在数字领域操作，Spotlight旨在通过物理签名提供更可靠的保护。 Method: Spotlight生成紧凑的、姿态不变的视频特征（基于局部敏感哈希），并通过光学调制方案在视频中嵌入签名（>200 bps）。 Result: 实验显示Spotlight在检测伪造视频时AUC≥0.99，真阳性率100%，且对录制条件和后处理技术高度鲁棒。 Conclusion: Spotlight通过物理签名有效保护视频完整性，适用于多种场景和攻击。 Abstract: High-profile speech videos are prime targets for falsification, owing to their accessibility and influence. This work proposes Spotlight, a low-overhead and unobtrusive system for protecting live speech videos from visual falsification of speaker identity and lip and facial motion. Unlike predominant falsification detection methods operating in the digital domain, Spotlight creates dynamic physical signatures at the event site and embeds them into all video recordings via imperceptible modulated light. These physical signatures encode semantically-meaningful features unique to the speech event, including the speaker's identity and facial motion, and are cryptographically-secured to prevent spoofing. The signatures can be extracted from any video downstream and validated against the portrayed speech content to check its integrity. Key elements of Spotlight include (1) a framework for generating extremely compact (i.e., 150-bit), pose-invariant speech video features, based on locality-sensitive hashing; and (2) an optical modulation scheme that embeds >200 bps into video while remaining imperceptible both in video and live. Prototype experiments on extensive video datasets show Spotlight achieves AUCs $\geq$ 0.99 and an overall true positive rate of 100% in detecting falsified videos. Further, Spotlight is highly robust across recording conditions, video post-processing techniques, and white-box adversarial attacks on its video feature extraction methodologies.

[64] Differentiable Room Acoustic Rendering with Multi-View Vision Priors

Derong Jin,Ruohan Gao

Main category: cs.CV

TL;DR: AV-DAR框架通过结合视觉线索和声学束追踪，实现了高效、可解释且准确的房间声学渲染，显著优于现有方法。

Details

Motivation: 空间音频对虚拟环境的真实感至关重要，但现有方法依赖数据密集型学习或计算昂贵的物理建模，亟需改进。 Method: 利用多视角图像提取视觉线索，结合声学束追踪进行物理建模，提出AV-DAR框架。 Result: 在六个真实环境中验证，AV-DAR性能显著优于现有方法，数据效率高，相对提升16.6%至50.9%。 Conclusion: AV-DAR为房间声学渲染提供了高效、可解释且准确的解决方案，具有实际应用潜力。 Abstract: An immersive acoustic experience enabled by spatial audio is just as crucial as the visual aspect in creating realistic virtual environments. However, existing methods for room impulse response estimation rely either on data-demanding learning-based models or computationally expensive physics-based modeling. In this work, we introduce Audio-Visual Differentiable Room Acoustic Rendering (AV-DAR), a framework that leverages visual cues extracted from multi-view images and acoustic beam tracing for physics-based room acoustic rendering. Experiments across six real-world environments from two datasets demonstrate that our multimodal, physics-based approach is efficient, interpretable, and accurate, significantly outperforming a series of prior methods. Notably, on the Real Acoustic Field dataset, AV-DAR achieves comparable performance to models trained on 10 times more data while delivering relative gains ranging from 16.6% to 50.9% when trained at the same scale.

[65] COMPACT: COMPositional Atomic-to-Complex Visual Capability Tuning

Xindi Wu,Hee Seung Hwang,Polina Kirichenko,Olga Russakovsky

Main category: cs.CV

TL;DR: COMPACT提出了一种数据高效的多模态大语言模型训练方法，通过控制训练数据的组合复杂性，显著提升了模型在复杂视觉语言任务上的表现。

Details

Motivation: 多模态大语言模型在简单视觉语言任务上表现良好，但在需要多种能力的复杂任务中表现不佳，这可能是因为传统视觉指令调优（VIT）更注重数据量而非组合复杂性。 Method: COMPACT通过生成一个明确控制训练数据组合复杂性的数据集，使模型能够更高效地学习复杂能力。 Result: COMPACT在多个基准测试中表现优异，仅使用不到10%的数据量即可达到与LLaVA-665k VIT相当的性能，并在复杂任务上显著超越。 Conclusion: COMPACT提供了一种可扩展且数据高效的视觉组合调优方法，显著提升了复杂视觉语言任务的性能。 Abstract: Multimodal Large Language Models (MLLMs) excel at simple vision-language tasks but struggle when faced with complex tasks that require multiple capabilities, such as simultaneously recognizing objects, counting them, and understanding their spatial relationships. This might be partially the result of the fact that Visual Instruction Tuning (VIT), a critical training step for MLLMs, has traditionally focused on scaling data volume, but not the compositional complexity of training examples. We propose COMPACT (COMPositional Atomic-to-complex visual Capability Tuning), which generates a training dataset explicitly controlling for the compositional complexity of the training examples. The data from COMPACT allows MLLMs to train on combinations of atomic capabilities to learn complex capabilities more efficiently. Across all benchmarks, COMPACT achieves comparable performance to the LLaVA-665k VIT while using less than 10% of its data budget, and even outperforms it on several, especially those involving complex multi-capability tasks. For example, COMPACT achieves substantial 83.3% improvement on MMStar and 94.0% improvement on MM-Vet compared to the full-scale VIT on particularly complex questions that require four or more atomic capabilities. COMPACT offers a scalable, data-efficient, visual compositional tuning recipe to improve on complex visual-language tasks.

[66] A Survey of Interactive Generative Video

Jiwen Yu,Yiran Qin,Haoxuan Che,Quande Liu,Xintao Wang,Pengfei Wan,Di Zhang,Kun Gai,Hao Chen,Xihui Liu

Main category: cs.CV

TL;DR: 本文定义了交互式生成视频（IGV）技术，并探讨了其在游戏、嵌入式AI和自动驾驶领域的应用。提出了一个包含五个模块的框架，并分析了技术挑战与未来方向。

Details

Motivation: 满足对高质量、交互式视频内容的需求，推动IGV技术的发展。 Method: 通过调查IGV的应用领域，提出一个包含生成、控制、记忆、动态和智能五个模块的框架。 Result: 总结了IGV在游戏、嵌入式AI和自动驾驶中的潜力，并识别了技术挑战。 Conclusion: 系统分析将促进IGV技术的未来发展，推动其在更复杂和实用领域的应用。 Abstract: Interactive Generative Video (IGV) has emerged as a crucial technology in response to the growing demand for high-quality, interactive video content across various domains. In this paper, we define IGV as a technology that combines generative capabilities to produce diverse high-quality video content with interactive features that enable user engagement through control signals and responsive feedback. We survey the current landscape of IGV applications, focusing on three major domains: 1) gaming, where IGV enables infinite exploration in virtual worlds; 2) embodied AI, where IGV serves as a physics-aware environment synthesizer for training agents in multimodal interaction with dynamically evolving scenes; and 3) autonomous driving, where IGV provides closed-loop simulation capabilities for safety-critical testing and validation. To guide future development, we propose a comprehensive framework that decomposes an ideal IGV system into five essential modules: Generation, Control, Memory, Dynamics, and Intelligence. Furthermore, we systematically analyze the technical challenges and future directions in realizing each component for an ideal IGV system, such as achieving real-time generation, enabling open-domain control, maintaining long-term coherence, simulating accurate physics, and integrating causal reasoning. We believe that this systematic analysis will facilitate future research and development in the field of IGV, ultimately advancing the technology toward more sophisticated and practical applications.

[67] ReVision: High-Quality, Low-Cost Video Generation with Explicit 3D Physics Modeling for Complex Motion and Interaction

Qihao Liu,Ju He,Qihang Yu,Liang-Chieh Chen,Alan Yuille

Main category: cs.CV

TL;DR: ReVision是一个即插即用的框架，通过将参数化的3D物理知识集成到预训练的视频生成模型中，显著提升了生成复杂运动和交互视频的能力。

Details

Motivation: 当前视频生成技术在复杂运动和交互方面仍存在挑战，需要更高质量和可控性的解决方案。 Method: ReVision分为三个阶段：1) 使用视频扩散模型生成粗糙视频；2) 提取2D和3D特征构建3D对象中心表示，并通过物理先验模型优化；3) 将优化后的运动序列反馈给视频扩散模型以生成一致性视频。 Result: 在Stable Video Diffusion上验证，ReVision显著提升了运动保真度和一致性，仅用1.5B参数即超越13B参数的先进模型。 Conclusion: 通过引入3D物理知识，即使是小规模视频扩散模型也能更真实、可控地生成复杂运动和交互，为物理合理的视频生成提供了有前景的解决方案。 Abstract: In recent years, video generation has seen significant advancements. However, challenges still persist in generating complex motions and interactions. To address these challenges, we introduce ReVision, a plug-and-play framework that explicitly integrates parameterized 3D physical knowledge into a pretrained conditional video generation model, significantly enhancing its ability to generate high-quality videos with complex motion and interactions. Specifically, ReVision consists of three stages. First, a video diffusion model is used to generate a coarse video. Next, we extract a set of 2D and 3D features from the coarse video to construct a 3D object-centric representation, which is then refined by our proposed parameterized physical prior model to produce an accurate 3D motion sequence. Finally, this refined motion sequence is fed back into the same video diffusion model as additional conditioning, enabling the generation of motion-consistent videos, even in scenarios involving complex actions and interactions. We validate the effectiveness of our approach on Stable Video Diffusion, where ReVision significantly improves motion fidelity and coherence. Remarkably, with only 1.5B parameters, it even outperforms a state-of-the-art video generation model with over 13B parameters on complex video generation by a substantial margin. Our results suggest that, by incorporating 3D physical knowledge, even a relatively small video diffusion model can generate complex motions and interactions with greater realism and controllability, offering a promising solution for physically plausible video generation.

cs.GR [Back]

[68] Transcending Dimensions using Generative AI: Real-Time 3D Model Generation in Augmented Reality

Majid Behravan,Maryam Haghani,Denis Gracanin

Main category: cs.GR

TL;DR: 研究结合生成式AI和AR技术，简化3D建模流程，使非专业用户也能轻松生成和操作3D模型。

Details

Motivation: 传统3D建模技术门槛高，需专业软件和技能，限制了普通用户的使用。 Method: 利用Shap-E等AI模型和Mask R-CNN等目标检测方法，在AR环境中实现2D图像到3D模型的转换。 Result: 35名参与者的SUS评分为69.64，频繁使用AR/VR技术的用户评分更高（80.71）。 Conclusion: 该系统在游戏、教育和AR电商等领域具有应用潜力，为非专业用户提供了直观的3D建模工具。 Abstract: Traditional 3D modeling requires technical expertise, specialized software, and time-intensive processes, making it inaccessible for many users. Our research aims to lower these barriers by combining generative AI and augmented reality (AR) into a cohesive system that allows users to easily generate, manipulate, and interact with 3D models in real time, directly within AR environments. Utilizing cutting-edge AI models like Shap-E, we address the complex challenges of transforming 2D images into 3D representations in AR environments. Key challenges such as object isolation, handling intricate backgrounds, and achieving seamless user interaction are tackled through advanced object detection methods, such as Mask R-CNN. Evaluation results from 35 participants reveal an overall System Usability Scale (SUS) score of 69.64, with participants who engaged with AR/VR technologies more frequently rating the system significantly higher, at 80.71. This research is particularly relevant for applications in gaming, education, and AR-based e-commerce, offering intuitive, model creation for users without specialized skills.

[69] GauSS-MI: Gaussian Splatting Shannon Mutual Information for Active 3D Reconstruction

Yuhan Xie,Yixi Cai,Yinqiang Zhang,Lei Yang,Jia Pan

Main category: cs.GR

TL;DR: 本文提出了一种基于高斯溅射香农互信息（GauSS-MI）的实时视觉不确定性量化方法，用于主动3D重建中的最佳视角选择。

Details

Motivation: 当前3D重建技术（如NeRF和3DGS）在渲染质量上有显著提升，但如何高效选择最具信息量的输入图像视角仍是一个挑战。现有研究多关注几何完整性，而忽略了重建模型中的视觉不确定性。 Method: 提出了一种概率模型，通过香农互信息量化每个高斯的视觉不确定性，并设计GauSS-MI准则实时评估新视角的视觉互信息，以选择最佳视角。 Result: 在模拟和真实场景中，该系统展现出优越的视觉质量和重建效率。 Conclusion: GauSS-MI方法有效解决了主动3D重建中的视角选择和视觉不确定性量化问题，提升了重建质量与效率。 Abstract: This research tackles the challenge of real-time active view selection and uncertainty quantification on visual quality for active 3D reconstruction. Visual quality is a critical aspect of 3D reconstruction. Recent advancements such as Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS) have notably enhanced the image rendering quality of reconstruction models. Nonetheless, the efficient and effective acquisition of input images for reconstruction-specifically, the selection of the most informative viewpoint-remains an open challenge, which is crucial for active reconstruction. Existing studies have primarily focused on evaluating geometric completeness and exploring unobserved or unknown regions, without direct evaluation of the visual uncertainty within the reconstruction model. To address this gap, this paper introduces a probabilistic model that quantifies visual uncertainty for each Gaussian. Leveraging Shannon Mutual Information, we formulate a criterion, Gaussian Splatting Shannon Mutual Information (GauSS-MI), for real-time assessment of visual mutual information from novel viewpoints, facilitating the selection of next best view. GauSS-MI is implemented within an active reconstruction system integrated with a view and motion planner. Extensive experiments across various simulated and real-world scenes showcase the superior visual quality and reconstruction efficiency performance of the proposed system.

[70] PhysicsFC: Learning User-Controlled Skills for a Physics-Based Football Player Controller

Minsu Kim,Eunho Jung,Yoonsang Lee

Main category: cs.GR

TL;DR: PhysicsFC是一种基于物理模拟的足球运动员控制方法，支持多种技能（如盘带、停球、移动和射门）的无缝切换，通过特定技能策略和状态机实现交互控制。

Details

Motivation: 旨在通过物理模拟实现足球运动员角色的多样化技能控制，并确保技能间的自然过渡，提升交互体验。 Method: 采用技能特定策略生成潜在变量，结合奖励设计和初始化方法（如DEGCL和STI），通过有限状态机（FSM）实现交互控制。 Result: 展示了多种交互场景（如11v11比赛），验证了技能策略和过渡的自然性与可控性。 Conclusion: PhysicsFC在物理模拟足球运动员控制中表现高效，支持复杂技能和自然过渡，适用于多样化场景。 Abstract: We propose PhysicsFC, a method for controlling physically simulated football player characters to perform a variety of football skills--such as dribbling, trapping, moving, and kicking--based on user input, while seamlessly transitioning between these skills. Our skill-specific policies, which generate latent variables for each football skill, are trained using an existing physics-based motion embedding model that serves as a foundation for reproducing football motions. Key features include a tailored reward design for the Dribble policy, a two-phase reward structure combined with projectile dynamics-based initialization for the Trap policy, and a Data-Embedded Goal-Conditioned Latent Guidance (DEGCL) method for the Move policy. Using the trained skill policies, the proposed football player finite state machine (PhysicsFC FSM) allows users to interactively control the character. To ensure smooth and agile transitions between skill policies, as defined in the FSM, we introduce the Skill Transition-Based Initialization (STI), which is applied during the training of each skill policy. We develop several interactive scenarios to showcase PhysicsFC's effectiveness, including competitive trapping and dribbling, give-and-go plays, and 11v11 football games, where multiple PhysicsFC agents produce natural and controllable physics-based football player behaviors. Quantitative evaluations further validate the performance of individual skill policies and the transitions between them, using the presented metrics and experimental designs.

[71] LSNIF: Locally-Subdivided Neural Intersection Function

Shin Fujieda,Chih-Chen Kao,Takahiro Harada

Main category: cs.GR

TL;DR: LSNIF是一种新型神经表示方法，用于替代传统BVH加速光线投射，通过稀疏哈希网格编码和定制损失函数实现高效渲染，内存占用减少106.2倍。

Details

Motivation: 传统BVH在光线追踪中占用大量内存，LSNIF旨在通过神经表示减少内存需求并保持渲染效率。 Method: 采用稀疏哈希网格编码、几何体素化和场景无关的训练数据收集，网络输出可见性、命中点信息和材质索引。 Result: LSNIF能处理任意视角的命中点查询，支持所有类型光线，内存占用比压缩BVH减少106.2倍。 Conclusion: LSNIF是一种高效替代BVH的神经表示方法，适用于多种场景渲染。 Abstract: Neural representations have shown the potential to accelerate ray casting in a conventional ray-tracing-based rendering pipeline. We introduce a novel approach called Locally-Subdivided Neural Intersection Function (LSNIF) that replaces bottom-level BVHs used as traditional geometric representations with a neural network. Our method introduces a sparse hash grid encoding scheme incorporating geometry voxelization, a scene-agnostic training data collection, and a tailored loss function. It enables the network to output not only visibility but also hit-point information and material indices. LSNIF can be trained offline for a single object, allowing us to use LSNIF as a replacement for its corresponding BVH. With these designs, the network can handle hit-point queries from any arbitrary viewpoint, supporting all types of rays in the rendering pipeline. We demonstrate that LSNIF can render a variety of scenes, including real-world scenes designed for other path tracers, while achieving a memory footprint reduction of up to 106.2x compared to a compressed BVH.

cs.CL [Back]

[72] Waking Up an AI: A Quantitative Framework for Prompt-Induced Phase Transition in Large Language Models

Makoto Sato

Main category: cs.CL

TL;DR: 论文提出了一种量化分析大型语言模型（LLM）认知行为的方法，通过设计诱导性提示（TIP）和量化性提示（TQP），发现LLM在语义融合提示下与人类直觉行为存在差异。

Details

Motivation: 探索人类直觉思维的认知动态，并通过比较人类与LLM的认知行为差异，揭示LLM是否能够复制人类的直觉过程。 Method: 设计TIP触发LLM行为变化，使用TQP量化评估这种变化，并通过实验比较语义融合与非融合提示对LLM响应的影响。 Result: 实验表明，LLM在语义融合提示下未表现出与人类相似的认知响应差异，提示其可能缺乏人类直觉中的概念整合能力。 Conclusion: 该方法为量化LLM认知行为提供了可重复的测量工具，揭示了LLM与人类在直觉和概念跳跃上的关键差异。 Abstract: What underlies intuitive human thinking? One approach to this question is to compare the cognitive dynamics of humans and large language models (LLMs). However, such a comparison requires a method to quantitatively analyze AI cognitive behavior under controlled conditions. While anecdotal observations suggest that certain prompts can dramatically change LLM behavior, these observations have remained largely qualitative. Here, we propose a two-part framework to investigate this phenomenon: a Transition-Inducing Prompt (TIP) that triggers a rapid shift in LLM responsiveness, and a Transition Quantifying Prompt (TQP) that evaluates this change using a separate LLM. Through controlled experiments, we examined how LLMs react to prompts embedding two semantically distant concepts (e.g., mathematical aperiodicity and traditional crafts)--either fused together or presented separately--by changing their linguistic quality and affective tone. Whereas humans tend to experience heightened engagement when such concepts are meaningfully blended producing a novel concept--a form of conceptual fusion--current LLMs showed no significant difference in responsiveness between semantically fused and non-fused prompts. This suggests that LLMs may not yet replicate the conceptual integration processes seen in human intuition. Our method enables fine-grained, reproducible measurement of cognitive responsiveness, and may help illuminate key differences in how intuition and conceptual leaps emerge in artificial versus human minds.

[73] Analyzing Feedback Mechanisms in AI-Generated MCQs: Insights into Readability, Lexical Properties, and Levels of Challenge

Antoun Yaacoub,Zainab Assaghir,Lionel Prevost,Jérôme Da-Rugna

Main category: cs.CL

TL;DR: 该研究分析了Google Gemini 1.5-flash文本模型生成的计算机科学多选题反馈的语言特征，包括可读性、词汇丰富性和适应性，揭示了反馈语气与问题难度之间的动态交互作用。

Details

Motivation: 尽管AI生成的反馈在教育中潜力巨大，但其语言特征的全面理解仍有限，本研究旨在填补这一空白。 Method: 分析了1,200多道不同难度和反馈语气的多选题，计算了语言指标，并训练了一个RoBERTa多任务学习模型预测这些指标。 Result: 模型在可读性和词汇丰富性预测上表现良好（MAE分别为2.0和0.03），发现反馈语气与问题难度有显著交互作用。 Conclusion: 研究为个性化AI反馈机制的发展提供了见解，同时强调了设计中的伦理考量。 Abstract: Artificial Intelligence (AI)-generated feedback in educational settings has garnered considerable attention due to its potential to enhance learning outcomes. However, a comprehensive understanding of the linguistic characteristics of AI-generated feedback, including readability, lexical richness, and adaptability across varying challenge levels, remains limited. This study delves into the linguistic and structural attributes of feedback generated by Google's Gemini 1.5-flash text model for computer science multiple-choice questions (MCQs). A dataset of over 1,200 MCQs was analyzed, considering three difficulty levels (easy, medium, hard) and three feedback tones (supportive, neutral, challenging). Key linguistic metrics, such as length, readability scores (Flesch-Kincaid Grade Level), vocabulary richness, and lexical density, were computed and examined. A fine-tuned RoBERTa-based multi-task learning (MTL) model was trained to predict these linguistic properties, achieving a Mean Absolute Error (MAE) of 2.0 for readability and 0.03 for vocabulary richness. The findings reveal significant interaction effects between feedback tone and question difficulty, demonstrating the dynamic adaptation of AI-generated feedback within diverse educational contexts. These insights contribute to the development of more personalized and effective AI-driven feedback mechanisms, highlighting the potential for improved learning outcomes while underscoring the importance of ethical considerations in their design and deployment.

[74] Nested Named-Entity Recognition on Vietnamese COVID-19: Dataset and Experiments

Ngoc C. Lê,Hai-Chung Nguyen-Phung,Thu-Huong Pham Thi,Hue Vu,Phuong-Thao Nguyen Thi,Thu-Thuy Tran,Hong-Nhung Le Thi,Thuy-Duong Nguyen-Thi,Thanh-Huy Nguyen

Main category: cs.CL

TL;DR: 研究提出了一种基于命名实体识别（NER）的系统，用于辅助越南的COVID-19疫情预防，并提供了一个手动标注的越南语数据集。

Details

Motivation: COVID-19疫情导致全球重大损失，越南通过追踪、定位和隔离接触者有效预防疫情，但人工操作效率低下。 Method: 研究采用命名实体识别（NER）技术，并定义新的实体类型，构建了一个手动标注的越南语数据集。 Result: 提出了一个NER系统，并提供了适用于越南语的新实体类型数据集。 Conclusion: NER技术可有效辅助越南的COVID-19疫情预防，提高效率。 Abstract: The COVID-19 pandemic caused great losses worldwide, efforts are taken place to prevent but many countries have failed. In Vietnam, the traceability, localization, and quarantine of people who contact with patients contribute to effective disease prevention. However, this is done by hand, and take a lot of work. In this research, we describe a named-entity recognition (NER) study that assists in the prevention of COVID-19 pandemic in Vietnam. We also present our manually annotated COVID-19 dataset with nested named entity recognition task for Vietnamese which be defined new entity types using for our system.

[75] ViQA-COVID: COVID-19 Machine Reading Comprehension Dataset for Vietnamese

Hai-Chung Nguyen-Phung,Ngoc C. Lê,Van-Chien Nguyen,Hang Thi Nguyen,Thuy Phuong Thi Nguyen

Main category: cs.CL

TL;DR: 该论文介绍了首个针对越南语的COVID-19多跨度抽取机器阅读理解（MRC）数据集ViQA-COVID，旨在支持疾病预防和促进越南语及多语言MRC研究。

Details

Motivation: COVID-19对全球经济和社会的严重影响，以及AI在疾病预防中的必要性，促使研究者创建ViQA-COVID数据集。 Method: 通过构建ViQA-COVID数据集，支持机器阅读理解模型的开发，以应对COVID-19相关信息的处理需求。 Result: ViQA-COVID是首个越南语COVID-19多跨度抽取MRC数据集，可用于模型和系统的构建。 Conclusion: ViQA-COVID的创建填补了越南语MRC数据集的空白，并为多语言MRC研究提供了支持。 Abstract: After two years of appearance, COVID-19 has negatively affected people and normal life around the world. As in May 2022, there are more than 522 million cases and six million deaths worldwide (including nearly ten million cases and over forty-three thousand deaths in Vietnam). Economy and society are both severely affected. The variant of COVID-19, Omicron, has broken disease prevention measures of countries and rapidly increased number of infections. Resources overloading in treatment and epidemics prevention is happening all over the world. It can be seen that, application of artificial intelligence (AI) to support people at this time is extremely necessary. There have been many studies applying AI to prevent COVID-19 which are extremely useful, and studies on machine reading comprehension (MRC) are also in it. Realizing that, we created the first MRC dataset about COVID-19 for Vietnamese: ViQA-COVID and can be used to build models and systems, contributing to disease prevention. Besides, ViQA-COVID is also the first multi-span extraction MRC dataset for Vietnamese, we hope that it can contribute to promoting MRC studies in Vietnamese and multilingual.

[76] HYPEROFA: Expanding LLM Vocabulary to New Languages via Hypernetwork-Based Embedding Initialization

Enes Özeren,Yihong Liu,Hinrich Schütze

Main category: cs.CL

TL;DR: HYPEROFA提出了一种基于超网络的自适应词嵌入初始化方法，用于提升预训练语言模型在低资源语言上的表现，优于随机初始化和OFA方法。

Details

Motivation: 预训练语言模型在低资源语言上表现不佳，主要因预训练数据有限。现有方法如OFA虽有效，但限制了目标语言词嵌入的表达能力。 Method: HYPEROFA利用超网络将多语言词向量空间映射到预训练模型的词嵌入空间，为目标语言生成更灵活的初始嵌入。 Result: 实验表明，HYPEROFA在持续预训练收敛性和下游任务性能上均优于随机初始化，且与OFA相当或更优。 Conclusion: HYPEROFA通过自适应初始化提升了低资源语言的模型表现，代码已公开。 Abstract: Many pre-trained language models (PLMs) exhibit suboptimal performance on mid- and low-resource languages, largely due to limited exposure to these languages during pre-training. A common strategy to address this is to introduce new tokens specific to the target languages, initialize their embeddings, and apply continual pre-training on target-language data. Among such methods, OFA (Liu et al., 2024a) proposes a similarity-based subword embedding initialization heuristic that is both effective and efficient. However, OFA restricts target-language token embeddings to be convex combinations of a fixed number of source-language embeddings, which may limit expressiveness. To overcome this limitation, we propose HYPEROFA, a hypernetwork-based approach for more adaptive token embedding initialization. The hypernetwork is trained to map from an external multilingual word vector space to the PLMs token embedding space using source-language tokens. Once trained, it can generate flexible embeddings for target-language tokens, serving as a good starting point for continual pretraining. Experiments demonstrate that HYPEROFA consistently outperforms random initialization baseline and matches or exceeds the performance of OFA in both continual pre-training convergence and downstream task performance. We make the code publicly available.

[77] Kill two birds with one stone: generalized and robust AI-generated text detection via dynamic perturbations

Yinghan Zhou,Juan Wen,Wanli Peng,Yiming Xue,Ziwei Zhang,Zhengxian Wu

Main category: cs.CL

TL;DR: 论文提出了一种新的AI生成文本检测方法（DP-Net），通过动态扰动和强化学习，同时解决了泛化性和鲁棒性问题，实验表明其在跨域场景和对抗攻击下表现优异。

Details

Motivation: 随着大语言模型的普及，AI生成文本（AIGT）的滥用风险增加，现有方法未能同时解决泛化性和鲁棒性问题。 Method: 将鲁棒性视为域偏移的一种形式，提出DP-Net方法，通过强化学习的动态扰动机制优化检测性能。 Result: DP-Net在三种跨域场景中表现出卓越的泛化能力，并在两种文本对抗攻击下保持最佳鲁棒性。 Conclusion: DP-Net为AIGT检测提供了一种统一且高效的解决方案，代码已开源。 Abstract: The growing popularity of large language models has raised concerns regarding the potential to misuse AI-generated text (AIGT). It becomes increasingly critical to establish an excellent AIGT detection method with high generalization and robustness. However, existing methods either focus on model generalization or concentrate on robustness. The unified mechanism, to simultaneously address the challenges of generalization and robustness, is less explored. In this paper, we argue that robustness can be view as a specific form of domain shift, and empirically reveal an intrinsic mechanism for model generalization of AIGT detection task. Then, we proposed a novel AIGT detection method (DP-Net) via dynamic perturbations introduced by a reinforcement learning with elaborated reward and action. Experimentally, extensive results show that the proposed DP-Net significantly outperforms some state-of-the-art AIGT detection methods for generalization capacity in three cross-domain scenarios. Meanwhile, the DP-Net achieves best robustness under two text adversarial attacks. The code is publicly available at https://github.com/CAU-ISS-Lab/AIGT-Detection-Evade-Detection/tree/main/DP-Net.

[78] Context-Enhanced Contrastive Search for Improved LLM Text Generation

Jaydip Sen,Rohit Pandey,Hetvi Waghela

Main category: cs.CL

TL;DR: 论文提出了一种改进的对比搜索算法CECS，通过动态上下文重要性加权和多级对比搜索等技术，显著提升了生成文本的连贯性和相关性。

Details

Motivation: 尽管大语言模型在自然语言处理中取得了显著进展，但生成高质量文本（平衡连贯性、多样性和相关性）仍具挑战性，传统解码方法存在重复或不连贯的问题。 Method: 提出Context-Enhanced Contrastive Search (CECS)，结合动态上下文重要性加权、多级对比搜索和自适应温度控制。 Result: 实验结果表明，CECS在BLEU、ROUGE和语义相似度等指标上优于现有对比搜索技术，生成文本的连贯性和相关性显著提升。 Conclusion: CECS在生成高质量文本方面表现出色，适用于法律文件起草、客服聊天机器人和内容营销等实际应用。 Abstract: Recently, Large Language Models (LLMs) have demonstrated remarkable advancements in Natural Language Processing (NLP). However, generating high-quality text that balances coherence, diversity, and relevance remains challenging. Traditional decoding methods, such as bean search and top-k sampling, often struggle with either repetitive or incoherent outputs, particularly in tasks that require long-form text generation. To address these limitations, the paper proposes a novel enhancement of the well-known Contrastive Search algorithm, Context-Enhanced Contrastive Search (CECS) with contextual calibration. The proposed scheme introduces several novelties including dynamic contextual importance weighting, multi-level Contrastive Search, and adaptive temperature control, to optimize the balance between fluency, creativity, and precision. The performance of CECS is evaluated using several standard metrics such as BLEU, ROUGE, and semantic similarity. Experimental results demonstrate significant improvements in both coherence and relevance of the generated texts by CECS outperforming the existing Contrastive Search techniques. The proposed algorithm has several potential applications in the real world including legal document drafting, customer service chatbots, and content marketing.

[79] ConformalNL2LTL: Translating Natural Language Instructions into Temporal Logic Formulas with Conformal Correctness Guarantees

Jun Wang,David Smith Sundarsingh,Jyotirmoy V. Deshmukh,Yiannis Kantaros

Main category: cs.CL

TL;DR: ConformalNL2LTL是一种新的自然语言到LTL的翻译方法，通过结合问答问题和不确定性量化，实现用户定义的翻译成功率。

Details

Motivation: 减少定义LTL任务所需的手动工作和专业知识，同时提供正确性保证。 Method: 通过迭代解决开放词汇问答问题，利用LLM生成答案，并结合符合预测（CP）量化不确定性。 Result: ConformalNL2LTL能够达到用户指定的翻译准确率，同时最小化求助率。 Conclusion: 该方法在理论和实证上均证明其有效性，为NL到LTL翻译提供了可靠解决方案。 Abstract: Linear Temporal Logic (LTL) has become a prevalent specification language for robotic tasks. To mitigate the significant manual effort and expertise required to define LTL-encoded tasks, several methods have been proposed for translating Natural Language (NL) instructions into LTL formulas, which, however, lack correctness guarantees. To address this, we introduce a new NL-to-LTL translation method, called ConformalNL2LTL, that can achieve user-defined translation success rates over unseen NL commands. Our method constructs LTL formulas iteratively by addressing a sequence of open-vocabulary Question-Answering (QA) problems with LLMs. To enable uncertainty-aware translation, we leverage conformal prediction (CP), a distribution-free uncertainty quantification tool for black-box models. CP enables our method to assess the uncertainty in LLM-generated answers, allowing it to proceed with translation when sufficiently confident and request help otherwise. We provide both theoretical and empirical results demonstrating that ConformalNL2LTL achieves user-specified translation accuracy while minimizing help rates.

[80] Param$Δ$ for Direct Weight Mixing: Post-Train Large Language Model at Zero Cost

Sheng Cao,Mingrui Wu,Karthik Prasad,Yuandong Tian,Zechun Liu

Main category: cs.CL

TL;DR: 本文提出了一种名为$Param\Delta$的新方法，通过直接转移已有后训练模型的知识到新更新的基础模型，无需额外训练，显著降低了后训练的计算成本和数据需求。

Details

Motivation: 后训练阶段需要大量高质量数据和计算资源，且存在过拟合风险。本文旨在解决这些问题，提供一种高效的后训练替代方案。 Method: 通过计算后训练模型权重与基础模型权重的差值，并将其应用于新更新的基础模型，实现知识转移。公式为$\Theta_{\text{Param}\Delta} = \Theta_\text{post} - \Theta_\text{base} + \Theta'_\text{base}$。 Result: 在多个模型（如LLama3、Qwen等）上验证，$Param\Delta$模型性能接近传统后训练模型（如达到Llama3.1-inst模型95%的性能）。 Conclusion: $Param\Delta$为开源模型社区提供了一种零成本的后训练替代方案，加速了模型迭代开发。 Abstract: The post-training phase of large language models is essential for enhancing capabilities such as instruction-following, reasoning, and alignment with human preferences. However, it demands extensive high-quality data and poses risks like overfitting, alongside significant computational costs due to repeated post-training and evaluation after each base model update. This paper introduces $Param\Delta$, a novel method that streamlines post-training by transferring knowledge from an existing post-trained model to a newly updated base model with ZERO additional training. By computing the difference between post-trained model weights ($\Theta_\text{post}$) and base model weights ($\Theta_\text{base}$), and adding this to the updated base model ($\Theta'_\text{base}$), we define $Param\Delta$ Model as: $\Theta_{\text{Param}\Delta} = \Theta_\text{post} - \Theta_\text{base} + \Theta'_\text{base}$. This approach surprisingly equips the new base model with post-trained capabilities, achieving performance comparable to direct post-training. We did analysis on LLama3, Llama3.1, Qwen, and DeepSeek-distilled models. Results indicate $Param\Delta$ Model effectively replicates traditional post-training. For example, the $Param\Delta$ Model obtained from 70B Llama3-inst, Llama3-base, Llama3.1-base models attains approximately 95\% of Llama3.1-inst model's performance on average. $Param\Delta$ brings a new perspective on how to fully leverage models in the open-weight community, where checkpoints for base and instruct models are readily available and frequently updated, by providing a cost-free framework to accelerate the iterative cycle of model development.

[81] WebEvolver: Enhancing Web Agent Self-Improvement with Coevolving World Model

Tianqing Fang,Hongming Zhang,Zhisong Zhang,Kaixin Ma,Wenhao Yu,Haitao Mi,Dong Yu

Main category: cs.CL

TL;DR: 提出了一种结合世界模型LLM的新框架，通过增强探索和利用预训练知识，解决了自主学习中性能停滞的问题，实验显示性能提升10%。

Details

Motivation: 现有自主学习方法在性能提升上遇到停滞，主要源于对环境的探索不足和预训练知识的利用不充分。 Method: 引入一个共同进化的世界模型LLM，用于预测环境状态并生成自指导训练数据，同时在推理中作为想象引擎。 Result: 在多个真实网络环境中的实验表明，性能提升了10%，且无需依赖闭源模型。 Conclusion: 世界模型的整合是实现自主代理持续适应性的关键。 Abstract: Agent self-improvement, where the backbone Large Language Model (LLM) of the agent are trained on trajectories sampled autonomously based on their own policies, has emerged as a promising approach for enhancing performance. Recent advancements, particularly in web environments, face a critical limitation: their performance will reach a stagnation point during autonomous learning cycles, hindering further improvement. We argue that this stems from limited exploration of the web environment and insufficient exploitation of pre-trained web knowledge in LLMs. To improve the performance of self-improvement, we propose a novel framework that introduces a co-evolving World Model LLM. This world model predicts the next observation based on the current observation and action within the web environment. Leveraging LLMs' pretrained knowledge of abundant web content, the World Model serves dual roles: (1) as a virtual web server generating self-instructed training data to continuously refine the agent's policy, and (2) as an imagination engine during inference, enabling look-ahead simulation to guide action selection for the agent LLM. Experiments in real-world web environments (Mind2Web-Live, WebVoyager, and GAIA-web) show a 10% performance gain over existing self-evolving agents, demonstrating the efficacy and generalizability of our approach, without using any distillation from more powerful close-sourced models. Our work establishes the necessity of integrating world models into autonomous agent frameworks to unlock sustained adaptability.

[82] Durghotona GPT: A Web Scraping and Large Language Model Based Framework to Generate Road Accident Dataset Automatically in Bangladesh

MD Thamed Bin Zaman Chowdhury,Moazzem Hossain,Md. Ridwanul Islam

Main category: cs.CL

TL;DR: 论文提出了一种名为'Durghotona GPT'的框架，结合网络爬虫和大语言模型（LLMs），从孟加拉国主要报纸自动生成全面的交通事故数据集。

Details

Motivation: 交通事故导致重大经济损失和社会问题，准确及时的数据对预测和缓解事故至关重要。 Method: 从三家主要报纸收集事故新闻，使用GPT-4、GPT-3.5和Llama-3处理数据，提取信息并分类。 Result: Llama-3表现接近GPT-4，准确率达89%，是成本效益高的替代方案。框架显著提升数据质量和可用性。 Conclusion: 该框架可支持交通安全分析、城市规划和公共健康应用，未来将扩展数据收集方法并优化LLMs。 Abstract: Road accidents pose significant concerns globally. They lead to large financial losses, injuries, disabilities, and societal challenges. Accurate and timely accident data is essential for predicting and mitigating these events. This paper presents a novel framework named 'Durghotona GPT' that integrates web scraping and Large Language Models (LLMs) to automate the generation of comprehensive accident datasets from prominent national dailies in Bangladesh. The authors collected accident reports from three major newspapers: Prothom Alo, Dhaka Tribune, and The Daily Star. The collected news was then processed using the newest available LLMs: GPT-4, GPT-3.5, and Llama-3. The framework efficiently extracts relevant information, categorizes reports, and compiles detailed datasets. Thus, this framework overcomes limitations of manual data collection methods such as delays, errors, and communication gaps. The authors' evaluation demonstrates that Llama-3, an open-source model, performs comparably to GPT-4. It achieved 89% accuracy in the authors' evaluation. Therefore, it can be considered a cost-effective alternative for similar tasks. The results suggest that the framework developed by the authors can drastically enhance the quality and availability of accident data. As a result, it can support critical applications in traffic safety analysis, urban planning, and public health. The authors also developed an interface for 'Durghotona GPT' for ease of use as part of this paper. Future work will focus on expanding data collection methods and refining LLMs to further increase dataset accuracy and applicability.

[83] Creating and Evaluating Code-Mixed Nepali-English and Telugu-English Datasets for Abusive Language Detection Using Traditional and Deep Learning Models

Manish Pandey,Nageshwar Prasad Yadav,Mokshada Adduru,Sawan Rai

Main category: cs.CL

TL;DR: 研究提出了一种针对泰卢固语-英语和尼泊尔语-英语混合文本的滥用语言检测方法，通过多种机器学习、深度学习和大型语言模型进行评估，填补了低资源语言在滥用检测领域的空白。

Details

Motivation: 随着多语言用户在社交媒体上的增加，混合文本中的滥用语言检测变得更具挑战性，尤其是低资源语言如泰卢固语和尼泊尔语的研究不足。 Method: 研究构建了一个包含2000条泰卢固语-英语和5000条尼泊尔语-英语混合文本的数据集，并通过多种模型（如逻辑回归、随机森林、SVM、神经网络、LSTM、CNN和LLMs）进行实验和优化。 Result: 研究提供了混合文本中滥用语言检测的挑战性分析，并比较了不同计算方法的性能，为低资源语言的NLP研究提供了基准。 Conclusion: 该研究填补了低资源语言滥用检测的空白，为多语言社交媒体环境提供了更强大的内容审核策略支持。 Abstract: With the growing presence of multilingual users on social media, detecting abusive language in code-mixed text has become increasingly challenging. Code-mixed communication, where users seamlessly switch between English and their native languages, poses difficulties for traditional abuse detection models, as offensive content may be context-dependent or obscured by linguistic blending. While abusive language detection has been extensively explored for high-resource languages like English and Hindi, low-resource languages such as Telugu and Nepali remain underrepresented, leaving gaps in effective moderation. In this study, we introduce a novel, manually annotated dataset of 2 thousand Telugu-English and 5 Nepali-English code-mixed comments, categorized as abusive and non-abusive, collected from various social media platforms. The dataset undergoes rigorous preprocessing before being evaluated across multiple Machine Learning (ML), Deep Learning (DL), and Large Language Models (LLMs). We experimented with models including Logistic Regression, Random Forest, Support Vector Machines (SVM), Neural Networks (NN), LSTM, CNN, and LLMs, optimizing their performance through hyperparameter tuning, and evaluate it using 10-fold cross-validation and statistical significance testing (t-test). Our findings provide key insights into the challenges of detecting abusive language in code-mixed settings and offer a comparative analysis of computational approaches. This study contributes to advancing NLP for low-resource languages by establishing benchmarks for abusive language detection in Telugu-English and Nepali-English code-mixed text. The dataset and insights can aid in the development of more robust moderation strategies for multilingual social media environments.

[84] UrbanPlanBench: A Comprehensive Urban Planning Benchmark for Evaluating Large Language Models

Yu Zheng,Longyi Liu,Yuming Lin,Jie Feng,Guozhen Zhang,Depeng Jin,Yong Li

Main category: cs.CL

TL;DR: 论文提出了UrbanPlanBench基准和UrbanPlanText数据集，评估LLMs在城乡规划中的表现，发现其表现不均衡，并通过微调提升性能。

Details

Motivation: 探索LLMs在城乡规划领域的潜力，填补现有研究空白。 Method: 引入UrbanPlanBench基准和30,000+指令对的UrbanPlanText数据集，进行微调评估。 Result: LLMs在规划法规理解等任务表现不佳，微调后性能有所提升但仍需改进。 Conclusion: 公开资源以促进LLMs与城乡规划的结合，推动人机协作。 Abstract: The advent of Large Language Models (LLMs) holds promise for revolutionizing various fields traditionally dominated by human expertise. Urban planning, a professional discipline that fundamentally shapes our daily surroundings, is one such field heavily relying on multifaceted domain knowledge and experience of human experts. The extent to which LLMs can assist human practitioners in urban planning remains largely unexplored. In this paper, we introduce a comprehensive benchmark, UrbanPlanBench, tailored to evaluate the efficacy of LLMs in urban planning, which encompasses fundamental principles, professional knowledge, and management and regulations, aligning closely with the qualifications expected of human planners. Through extensive evaluation, we reveal a significant imbalance in the acquisition of planning knowledge among LLMs, with even the most proficient models falling short of meeting professional standards. For instance, we observe that 70% of LLMs achieve subpar performance in understanding planning regulations compared to other aspects. Besides the benchmark, we present the largest-ever supervised fine-tuning (SFT) dataset, UrbanPlanText, comprising over 30,000 instruction pairs sourced from urban planning exams and textbooks. Our findings demonstrate that fine-tuned models exhibit enhanced performance in memorization tests and comprehension of urban planning knowledge, while there exists significant room for improvement, particularly in tasks requiring domain-specific terminology and reasoning. By making our benchmark, dataset, and associated evaluation and fine-tuning toolsets publicly available at https://github.com/tsinghua-fib-lab/PlanBench, we aim to catalyze the integration of LLMs into practical urban planning, fostering a symbiotic collaboration between human expertise and machine intelligence.

[85] Beyond One-Size-Fits-All: Inversion Learning for Highly Effective NLG Evaluation Prompts

Hanhua Hong,Chenghao Xiao,Yang Wang,Yiqi Liu,Wenge Rong,Chenghua Lin

Main category: cs.CL

TL;DR: 提出了一种基于反转学习的方法，自动生成高效、模型特定的评估提示，解决了LLM评估中提示设计的敏感性问题。

Details

Motivation: 由于人类评估存在不一致性和偏见，而LLM评估对提示设计敏感，需要一种更稳健和高效的评估方法。 Method: 采用反转学习技术，从模型输出反向映射到输入指令，自动生成评估提示。 Result: 方法仅需单个评估样本，无需手动设计提示，提高了效率和鲁棒性。 Conclusion: 为LLM评估提供了一种新的稳健且高效的方向。 Abstract: Evaluating natural language generation (NLG) systems is challenging due to the diversity of valid outputs. While human evaluation is the gold standard, it suffers from inconsistencies, lack of standardisation, and demographic biases, limiting reproducibility. LLM-based evaluation offers a scalable alternative but is highly sensitive to prompt design, where small variations can lead to significant discrepancies. In this work, we propose an inversion learning method that learns effective reverse mappings from model outputs back to their input instructions, enabling the automatic generation of highly effective, model-specific evaluation prompts. Our method requires only a single evaluation sample and eliminates the need for time-consuming manual prompt engineering, thereby improving both efficiency and robustness. Our work contributes toward a new direction for more robust and efficient LLM-based evaluation.

[86] LLM Enhancer: Merged Approach using Vector Embedding for Reducing Large Language Model Hallucinations with External Knowledge

Naheed Rayhan,Md. Ashrafuzzaman

Main category: cs.CL

TL;DR: LLM ENHANCER系统通过整合多源在线数据（如Google、Wikipedia等）提升LLMs的准确性，减少幻觉，同时保持回答的自然性。

Details

Motivation: 尽管LLMs（如ChatGPT）在多种任务中表现出色，但其在关键场景中的应用受限于信息不准确和外部知识利用不足的问题。 Method: 系统采用并行数据获取流程，利用自定义代理工具管理信息流，并通过向量嵌入筛选最相关信息供LLM使用。 Result: LLM ENHANCER有效减少了LLMs的幻觉问题，同时保持了回答的准确性和自然性。 Conclusion: 该系统为LLMs在关键场景中的可靠应用提供了可行解决方案。 Abstract: Large Language Models (LLMs), such as ChatGPT, have demonstrated the capability to generate human like, natural responses across a range of tasks, including task oriented dialogue and question answering. However, their application in real world, critical scenarios is often hindered by a tendency to produce inaccurate information and a limited ability to leverage external knowledge sources. This paper introduces the LLM ENHANCER system, designed to integrate multiple online sources such as Google, Wikipedia, and DuckDuckGo to enhance data accuracy. The LLMs employed within this system are open source. The data acquisition process for the LLM ENHANCER system operates in parallel, utilizing custom agent tools to manage the flow of information. Vector embeddings are used to identify the most pertinent information, which is subsequently supplied to the LLM for user interaction. The LLM ENHANCER system mitigates hallucinations in chat based LLMs while preserving response naturalness and accuracy.

[87] Detecting Manipulated Contents Using Knowledge-Grounded Inference

Mark Huasong Meng,Ruizhe Wang,Meng Xu,Chuan Yan,Guangdong Bai

Main category: cs.CL

TL;DR: Manicod是一种检测零日操纵内容的工具，通过检索增强生成（RAG）和大语言模型（LLM）实现实时上下文分析，优于现有方法。

Details

Motivation: 现有方法依赖训练时的固有知识或手动整理的上下文，无法有效检测零日操纵内容。 Method: Manicod从主流搜索引擎获取上下文信息，通过RAG向量化后由LLM进行推理，生成决策和解释。 Result: 在包含4270条操纵假新闻的数据集上，Manicod的F1得分为0.856，比现有方法高1.9倍。 Conclusion: Manicod在检测零日操纵内容方面表现出色，优于传统方法。 Abstract: The detection of manipulated content, a prevalent form of fake news, has been widely studied in recent years. While existing solutions have been proven effective in fact-checking and analyzing fake news based on historical events, the reliance on either intrinsic knowledge obtained during training or manually curated context hinders them from tackling zero-day manipulated content, which can only be recognized with real-time contextual information. In this work, we propose Manicod, a tool designed for detecting zero-day manipulated content. Manicod first sources contextual information about the input claim from mainstream search engines, and subsequently vectorizes the context for the large language model (LLM) through retrieval-augmented generation (RAG). The LLM-based inference can produce a "truthful" or "manipulated" decision and offer a textual explanation for the decision. To validate the effectiveness of Manicod, we also propose a dataset comprising 4270 pieces of manipulated fake news derived from 2500 recent real-world news headlines. Manicod achieves an overall F1 score of 0.856 on this dataset and outperforms existing methods by up to 1.9x in F1 score on their benchmarks on fact-checking and claim verification.

[88] Small or Large? Zero-Shot or Finetuned? Guiding Language Model Choice for Specialized Applications in Healthcare

Lovedeep Gondara,Jonathan Simkin,Graham Sayle,Shebnum Devji,Gregory Arbour,Raymond Ng

Main category: cs.CL

TL;DR: 研究探讨了语言模型选择的关键问题，包括微调与零样本使用的必要性、领域相关与通用预训练模型的优势、领域特定预训练的价值，以及小型语言模型（SLMs）与大型语言模型（LLMs）在特定任务中的表现。

Details

Motivation: 指导语言模型选择，特别是在专业领域任务中，比较SLMs和LLMs的性能与资源消耗。 Method: 使用电子病理报告数据，评估三种不同难度和数据量的分类场景，比较SLMs和LLM的零样本和微调表现。 Result: 微调显著提升SLMs性能，使其超越零样本LLM；领域相关预训练模型表现更优；领域特定预训练在复杂任务中效果显著。 Conclusion: SLMs通过微调在专业任务中表现优于零样本LLM，资源效率更高，证明了其在LLM时代仍具价值。 Abstract: This study aims to guide language model selection by investigating: 1) the necessity of finetuning versus zero-shot usage, 2) the benefits of domain-adjacent versus generic pretrained models, 3) the value of further domain-specific pretraining, and 4) the continued relevance of Small Language Models (SLMs) compared to Large Language Models (LLMs) for specific tasks. Using electronic pathology reports from the British Columbia Cancer Registry (BCCR), three classification scenarios with varying difficulty and data size are evaluated. Models include various SLMs and an LLM. SLMs are evaluated both zero-shot and finetuned; the LLM is evaluated zero-shot only. Finetuning significantly improved SLM performance across all scenarios compared to their zero-shot results. The zero-shot LLM outperformed zero-shot SLMs but was consistently outperformed by finetuned SLMs. Domain-adjacent SLMs generally performed better than the generic SLM after finetuning, especially on harder tasks. Further domain-specific pretraining yielded modest gains on easier tasks but significant improvements on the complex, data-scarce task. The results highlight the critical role of finetuning for SLMs in specialized domains, enabling them to surpass zero-shot LLM performance on targeted classification tasks. Pretraining on domain-adjacent or domain-specific data provides further advantages, particularly for complex problems or limited finetuning data. While LLMs offer strong zero-shot capabilities, their performance on these specific tasks did not match that of appropriately finetuned SLMs. In the era of LLMs, SLMs remain relevant and effective, offering a potentially superior performance-resource trade-off compared to LLMs.

[89] Automatic Legal Writing Evaluation of LLMs

Ramon Pires,Roseval Malaquias Junior,Rodrigo Nogueira

Main category: cs.CL

TL;DR: 论文介绍了oab-bench，一个基于巴西律师考试的法学领域基准测试，用于评估大型语言模型在法律写作中的表现。Claude-3.5 Sonnet表现最佳，同时探讨了LLMs作为自动化评估工具的潜力。

Details

Motivation: 当前缺乏评估法律写作的公开、更新频繁且包含全面评估指南的测试数据集，巴西律师考试符合这些要求。 Method: 构建oab-bench基准，包含105个问题及评估指南，测试四个LLMs的表现，并研究其作为自动化评估工具的可靠性。 Result: Claude-3.5 Sonnet平均得分7.93/10，通过所有考试；前沿模型如OpenAI o1与人类评分相关性高。 Conclusion: oab-bench为法学领域提供了有效的评估工具，LLMs在自动化评估法律写作中展现出潜力。 Abstract: Despite the recent advances in Large Language Models, benchmarks for evaluating legal writing remain scarce due to the inherent complexity of assessing open-ended responses in this domain. One of the key challenges in evaluating language models on domain-specific tasks is finding test datasets that are public, frequently updated, and contain comprehensive evaluation guidelines. The Brazilian Bar Examination meets these requirements. We introduce oab-bench, a benchmark comprising 105 questions across seven areas of law from recent editions of the exam. The benchmark includes comprehensive evaluation guidelines and reference materials used by human examiners to ensure consistent grading. We evaluate the performance of four LLMs on oab-bench, finding that Claude-3.5 Sonnet achieves the best results with an average score of 7.93 out of 10, passing all 21 exams. We also investigated whether LLMs can serve as reliable automated judges for evaluating legal writing. Our experiments show that frontier models like OpenAI's o1 achieve a strong correlation with human scores when evaluating approved exams, suggesting their potential as reliable automated evaluators despite the inherently subjective nature of legal writing assessment. The source code and the benchmark -- containing questions, evaluation guidelines, model-generated responses, and their respective automated evaluations -- are publicly available.

[90] Pretraining Large Brain Language Model for Active BCI: Silent Speech

Jinzhao Zhou,Zehong Cao,Yiqun Duan,Connor Barkley,Daniel Leong,Xiaowei Jiang,Quoc-Toan Nguyen,Ziyi Zhao,Thomas Do,Yu-Cheng Chang,Sheng-Fu Liang,Chin-teng Lin

Main category: cs.CL

TL;DR: 本文提出了一种用于主动脑机接口（BCI）的无声语音解码方法，通过预训练大型脑语言模型（LBLM）和新的未来时频预测（FSTP）范式，显著提升了分类性能。

Details

Motivation: 传统的BCI系统在自然性和灵活性上有限，本文旨在通过无声语音解码提升BCI的实用性。 Method: 提出LBLM模型，采用FSTP预训练范式从无标签EEG数据中学习表征，并在下游任务中进行微调。 Result: 在跨会话设置下，LBLM在语义级和单词级分类任务中分别达到47.0%和39.6%的准确率，显著优于基线方法。 Conclusion: 本研究为主动BCI系统中的无声语音解码提供了创新解决方案，并贡献了新的数据集。 Abstract: This paper explores silent speech decoding in active brain-computer interface (BCI) systems, which offer more natural and flexible communication than traditional BCI applications. We collected a new silent speech dataset of over 120 hours of electroencephalogram (EEG) recordings from 12 subjects, capturing 24 commonly used English words for language model pretraining and decoding. Following the recent success of pretraining large models with self-supervised paradigms to enhance EEG classification performance, we propose Large Brain Language Model (LBLM) pretrained to decode silent speech for active BCI. To pretrain LBLM, we propose Future Spectro-Temporal Prediction (FSTP) pretraining paradigm to learn effective representations from unlabeled EEG data. Unlike existing EEG pretraining methods that mainly follow a masked-reconstruction paradigm, our proposed FSTP method employs autoregressive modeling in temporal and frequency domains to capture both temporal and spectral dependencies from EEG signals. After pretraining, we finetune our LBLM on downstream tasks, including word-level and semantic-level classification. Extensive experiments demonstrate significant performance gains of the LBLM over fully-supervised and pretrained baseline models. For instance, in the difficult cross-session setting, our model achieves 47.0\% accuracy on semantic-level classification and 39.6\% in word-level classification, outperforming baseline methods by 5.4\% and 7.3\%, respectively. Our research advances silent speech decoding in active BCI systems, offering an innovative solution for EEG language model pretraining and a new dataset for fundamental research.

[91] Phi-4-Mini-Reasoning: Exploring the Limits of Small Reasoning Language Models in Math

Haoran Xu,Baolin Peng,Hany Awadalla,Dongdong Chen,Yen-Chun Chen,Mei Gao,Young Jin Kim,Yunsheng Li,Liliang Ren,Yelong Shen,Shuohang Wang,Weijian Xu,Jianfeng Gao,Weizhu Chen

Main category: cs.CL

TL;DR: 本文提出了一种系统化的训练方法，通过四个步骤显著提升小型语言模型（SLM）的推理能力，并在数学推理任务上超越了更大的模型。

Details

Motivation: 尽管大型语言模型（LLM）通过链式思维（CoT）显著提升了推理能力，但小型语言模型（SLM）由于模型容量有限，推理能力提升仍具挑战性。本文旨在解决这一问题。 Method: 训练方法包括四个步骤：大规模中训练、监督微调、Rollout DPO和强化学习（带可验证奖励）。应用于Phi-4-Mini模型。 Result: Phi-4-Mini-Reasoning模型在数学推理任务上表现优异，超越了更大的模型，如DeepSeek-R1-Distill-Qwen-7B和DeepSeek-R1-Distill-Llama-8B。 Conclusion: 精心设计的训练方法和高质量CoT数据可以有效解锁小型模型的强大推理能力。 Abstract: Chain-of-Thought (CoT) significantly enhances formal reasoning capabilities in Large Language Models (LLMs) by training them to explicitly generate intermediate reasoning steps. While LLMs readily benefit from such techniques, improving reasoning in Small Language Models (SLMs) remains challenging due to their limited model capacity. Recent work by Deepseek-R1 demonstrates that distillation from LLM-generated synthetic data can substantially improve the reasoning ability of SLM. However, the detailed modeling recipe is not disclosed. In this work, we present a systematic training recipe for SLMs that consists of four steps: (1) large-scale mid-training on diverse distilled long-CoT data, (2) supervised fine-tuning on high-quality long-CoT data, (3) Rollout DPO leveraging a carefully curated preference dataset, and (4) Reinforcement Learning (RL) with Verifiable Reward. We apply our method on Phi-4-Mini, a compact 3.8B-parameter model. The resulting Phi-4-Mini-Reasoning model exceeds, on math reasoning tasks, much larger reasoning models, e.g., outperforming DeepSeek-R1-Distill-Qwen-7B by 3.2 points and DeepSeek-R1-Distill-Llama-8B by 7.7 points on Math-500. Our results validate that a carefully designed training recipe, with large-scale high-quality CoT data, is effective to unlock strong reasoning capabilities even in resource-constrained small models.

[92] Memorization and Knowledge Injection in Gated LLMs

Xu Pan,Ely Hahami,Zechen Zhang,Haim Sompolinsky

Main category: cs.CL

TL;DR: MEGa框架通过将事件记忆直接注入LLM权重中，解决了LLMs难以持续学习和整合新知识的问题。

Details

Motivation: LLMs在持续学习和记忆整合方面表现不佳，与人类持续学习能力形成对比。现有方法多依赖大上下文窗口或外部记忆缓冲区，缺乏对日常生活事件的测试。 Method: 提出MEGa框架，将记忆存储在低秩权重中，并通过门控机制激活相关记忆。 Result: 在虚构角色和维基百科事件数据集上，MEGa优于基线方法，有效减轻灾难性遗忘。 Conclusion: MEGa受人类大脑互补记忆系统启发，为LLMs的持续学习提供了新思路。 Abstract: Large Language Models (LLMs) currently struggle to sequentially add new memories and integrate new knowledge. These limitations contrast with the human ability to continuously learn from new experiences and acquire knowledge throughout life. Most existing approaches add memories either through large context windows or external memory buffers (e.g., Retrieval-Augmented Generation), and studies on knowledge injection rarely test scenarios resembling everyday life events. In this work, we introduce a continual learning framework, Memory Embedded in Gated LLMs (MEGa), which injects event memories directly into the weights of LLMs. Each memory is stored in a dedicated set of gated low-rank weights. During inference, a gating mechanism activates relevant memory weights by matching query embeddings to stored memory embeddings. This enables the model to both recall entire memories and answer related questions. On two datasets - fictional characters and Wikipedia events - MEGa outperforms baseline approaches in mitigating catastrophic forgetting. Our model draws inspiration from the complementary memory system of the human brain.

[93] Talk Before You Retrieve: Agent-Led Discussions for Better RAG in Medical QA

Xuanzhao Dong,Wenhui Zhu,Hao Wang,Xiwen Chen,Peijie Qiu,Rui Yin,Yi Su,Yalin Wang

Main category: cs.CL

TL;DR: Discuss-RAG是一个基于协作代理推理的模块，旨在通过模拟多轮头脑风暴和优化检索内容，提升医学问答系统的性能。

Details

Motivation: 医学问答任务对大型语言模型具有挑战性，现有检索增强生成（RAG）系统存在推理行为建模不足和依赖低质量医学语料的问题。 Method: 提出Discuss-RAG模块，引入总结代理协调医学专家团队模拟多轮头脑风暴，并利用决策代理评估检索片段。 Result: 在四个医学问答基准数据集上，Discuss-RAG显著优于MedRAG，BioASQ和PubMedQA的答案准确率分别提升16.67%和12.20%。 Conclusion: Discuss-RAG通过协作代理推理有效提升了医学问答系统的性能，解决了现有RAG系统的局限性。 Abstract: Medical question answering (QA) is a reasoning-intensive task that remains challenging for large language models (LLMs) due to hallucinations and outdated domain knowledge. Retrieval-Augmented Generation (RAG) provides a promising post-training solution by leveraging external knowledge. However, existing medical RAG systems suffer from two key limitations: (1) a lack of modeling for human-like reasoning behaviors during information retrieval, and (2) reliance on suboptimal medical corpora, which often results in the retrieval of irrelevant or noisy snippets. To overcome these challenges, we propose Discuss-RAG, a plug-and-play module designed to enhance the medical QA RAG system through collaborative agent-based reasoning. Our method introduces a summarizer agent that orchestrates a team of medical experts to emulate multi-turn brainstorming, thereby improving the relevance of retrieved content. Additionally, a decision-making agent evaluates the retrieved snippets before their final integration. Experimental results on four benchmark medical QA datasets show that Discuss-RAG consistently outperforms MedRAG, especially significantly improving answer accuracy by up to 16.67% on BioASQ and 12.20% on PubMedQA. The code is available at: https://github.com/LLM-VLM-GSL/Discuss-RAG.

[94] BiasGuard: A Reasoning-enhanced Bias Detection Tool For Large Language Models

Zhiting Fan,Ruizhe Chen,Zuozhu Liu

Main category: cs.CL

TL;DR: BiasGuard是一种新颖的偏见检测工具，通过两阶段方法分析输入并基于公平性规范进行推理，优于现有工具。

Details

Motivation: 现有方法（如公平性分类器和基于LLM的评判）在理解意图和公平性判断标准方面存在局限性，需改进偏见检测。 Method: BiasGuard采用两阶段方法：第一阶段基于公平性规范显式推理，第二阶段通过强化学习增强推理和判断能力。 Result: 在五个数据集上的实验表明，BiasGuard在准确性和减少过度公平误判方面优于现有工具。 Conclusion: BiasGuard证明了推理增强决策的重要性，其两阶段优化管道有效。 Abstract: Identifying bias in LLM-generated content is a crucial prerequisite for ensuring fairness in LLMs. Existing methods, such as fairness classifiers and LLM-based judges, face limitations related to difficulties in understanding underlying intentions and the lack of criteria for fairness judgment. In this paper, we introduce BiasGuard, a novel bias detection tool that explicitly analyzes inputs and reasons through fairness specifications to provide accurate judgments. BiasGuard is implemented through a two-stage approach: the first stage initializes the model to explicitly reason based on fairness specifications, while the second stage leverages reinforcement learning to enhance its reasoning and judgment capabilities. Our experiments, conducted across five datasets, demonstrate that BiasGuard outperforms existing tools, improving accuracy and reducing over-fairness misjudgments. We also highlight the importance of reasoning-enhanced decision-making and provide evidence for the effectiveness of our two-stage optimization pipeline.

[95] Confidence in Large Language Model Evaluation: A Bayesian Approach to Limited-Sample Challenges

Xiao Xiao,Yu Su,Sijing Zhang,Zhang Chen,Yadong Chen,Tian Liu

Main category: cs.CL

TL;DR: 本文提出了一种贝叶斯方法评估大语言模型（LLM）的能力，通过概率推断整合先验知识，解决了小样本场景下的局限性。

Details

Motivation: 传统评估框架依赖确定性标量指标，而LLM具有概率性输出特性，因此需要更灵活的评估方法。 Method: 将模型能力视为潜在变量，利用精心设计的查询集引发判别性响应，并将模型排名形式化为贝叶斯假设检验问题。 Result: 实验表明，该方法在GPT系列模型上优于传统评估方法，即使样本量减少仍保持统计稳健性。 Conclusion: 该研究通过贝叶斯推断与实际部署约束的结合，推动了LLM评估方法的发展。 Abstract: Large language models (LLMs) exhibit probabilistic output characteristics, yet conventional evaluation frameworks rely on deterministic scalar metrics. This study introduces a Bayesian approach for LLM capability assessment that integrates prior knowledge through probabilistic inference, addressing limitations under limited-sample regimes. By treating model capabilities as latent variables and leveraging a curated query set to induce discriminative responses, we formalize model ranking as a Bayesian hypothesis testing problem over mutually exclusive capability intervals. Experimental evaluations with GPT-series models demonstrate that the proposed method achieves superior discrimination compared to conventional evaluation methods. Results indicate that even with reduced sample sizes, the approach maintains statistical robustness while providing actionable insights, such as probabilistic statements about a model's likelihood of surpassing specific baselines. This work advances LLM evaluation methodologies by bridging Bayesian inference with practical constraints in real-world deployment scenarios.

[96] Does the Prompt-based Large Language Model Recognize Students' Demographics and Introduce Bias in Essay Scoring?

Kaixun Yang,Mladen Raković,Dragan Gašević,Guanliang Chen

Main category: cs.CL

TL;DR: 研究探讨了基于提示的大型语言模型（如GPT-4o）在自动作文评分中是否存在对弱势群体的偏见，发现模型能推断学生背景，且评分偏差与背景预测准确性相关。

Details

Motivation: 传统微调方法需要技术背景，限制了教育者的使用。尽管提示工具使自动评分更易用，但此前研究表明微调模型存在偏见，本研究旨在验证提示范式下是否也存在类似问题。 Method: 使用公开数据集（25,000+学生议论文），设计提示从GPT-4o推断学生背景（性别、母语），并评估评分的公平性，通过多元回归分析背景预测能力对评分的影响。 Result: （i）提示模型能部分推断学生背景；（ii）评分偏差在模型正确预测母语背景时更明显；（iii）对非母语者的评分误差在模型正确识别时增加。 Conclusion: 提示范式下的评分仍存在偏见，尤其是对非母语者，需进一步优化以减少不公平性。 Abstract: Large Language Models (LLMs) are widely used in Automated Essay Scoring (AES) due to their ability to capture semantic meaning. Traditional fine-tuning approaches required technical expertise, limiting accessibility for educators with limited technical backgrounds. However, prompt-based tools like ChatGPT have made AES more accessible, enabling educators to obtain machine-generated scores using natural-language prompts (i.e., the prompt-based paradigm). Despite advancements, prior studies have shown bias in fine-tuned LLMs, particularly against disadvantaged groups. It remains unclear whether such biases persist or are amplified in the prompt-based paradigm with cutting-edge tools. Since such biases are believed to stem from the demographic information embedded in pre-trained models (i.e., the ability of LLMs' text embeddings to predict demographic attributes), this study explores the relationship between the model's predictive power of students' demographic attributes based on their written works and its predictive bias in the scoring task in the prompt-based paradigm. Using a publicly available dataset of over 25,000 students' argumentative essays, we designed prompts to elicit demographic inferences (i.e., gender, first-language background) from GPT-4o and assessed fairness in automated scoring. Then we conducted multivariate regression analysis to explore the impact of the model's ability to predict demographics on its scoring outcomes. Our findings revealed that (i) prompt-based LLMs can somewhat infer students' demographics, particularly their first-language backgrounds, from their essays; (ii) scoring biases are more pronounced when the LLM correctly predicts students' first-language background than when it does not; and (iii) scoring error for non-native English speakers increases when the LLM correctly identifies them as non-native.

[97] Retrieval-Enhanced Few-Shot Prompting for Speech Event Extraction

Máté Gedeon

Main category: cs.CL

TL;DR: 论文提出了一种模块化的语音事件提取框架SpeechEE，结合高性能ASR和语义搜索增强的LLM提示，显著提升了事件触发和参数分类的性能。

Details

Motivation: 语音事件提取（SpeechEE）是ASR和NLP交叉领域的挑战性任务，需要从口语中识别结构化事件信息。 Method: 采用模块化流水线框架，结合ASR和LLM的语义搜索增强提示，通过混合过滤机制分类语音段，并利用少样本LLM提示动态提取事件触发和参数。 Result: 使用o1-mini模型在触发分类和参数分类上分别达到63.3%和27.8%的F1分数，优于先前基准。 Conclusion: 流水线方法结合检索增强的LLM可媲美端到端系统，同时保持可解释性和模块化，为未来结合文本和声学特征的混合模型提供了方向。 Abstract: Speech Event Extraction (SpeechEE) is a challenging task that lies at the intersection of Automatic Speech Recognition (ASR) and Natural Language Processing (NLP), requiring the identification of structured event information from spoken language. In this work, we present a modular, pipeline-based SpeechEE framework that integrates high-performance ASR with semantic search-enhanced prompting of Large Language Models (LLMs). Our system first classifies speech segments likely to contain events using a hybrid filtering mechanism including rule-based, BERT-based, and LLM-based models. It then employs few-shot LLM prompting, dynamically enriched via semantic similarity retrieval, to identify event triggers and extract corresponding arguments. We evaluate the pipeline using multiple LLMs (Llama3-8B, GPT-4o-mini, and o1-mini) highlighting significant performance gains with o1-mini, which achieves 63.3% F1 on trigger classification and 27.8% F1 on argument classification, outperforming prior benchmarks. Our results demonstrate that pipeline approaches, when empowered by retrieval-augmented LLMs, can rival or exceed end-to-end systems while maintaining interpretability and modularity. This work provides practical insights into LLM-driven event extraction and opens pathways for future hybrid models combining textual and acoustic features.

[98] The Distribution of Dependency Distance and Hierarchical Distance in Contemporary Written Japanese and Its Influencing Factors

Linxuan Wang,Shuiyuan Yu

Main category: cs.CL

TL;DR: 研究探讨了日语中依存距离（DD）和层次距离（HD）的关系，发现谓词的配价是MDD和MHD权衡关系的根本原因。

Details

Motivation: 探索日语中DD和HD的关系及其背后的认知机制。 Method: 通过固定和不固定句子长度，比较DD和HD的概率分布，分析MDD和MHD随句子长度的变化及其相关性。 Result: 谓词的配价是MDD和MHD权衡关系的关键因素，且对HD分布的影响大于DD。 Conclusion: 日语母语者通过谓词配价调节线性和层次复杂度，配价阈值决定MDD和MHD的相对大小。 Abstract: To explore the relationship between dependency distance (DD) and hierarchical distance (HD) in Japanese, we compared the probability distributions of DD and HD with and without sentence length fixed, and analyzed the changes in mean dependency distance (MDD) and mean hierarchical distance (MHD) as sentence length increases, along with their correlation coefficient based on the Balanced Corpus of Contemporary Written Japanese. It was found that the valency of the predicates is the underlying factor behind the trade-off relation between MDD and MHD in Japanese. Native speakers of Japanese regulate the linear complexity and hierarchical complexity through the valency of the predicates, and the relative sizes of MDD and MHD depend on whether the threshold of valency has been reached. Apart from the cognitive load, the valency of the predicates also affects the probability distributions of DD and HD. The effect of the valency of the predicates on the distribution of HD is greater than on that of DD, which leads to differences in their probability distributions and causes the mean of MDD to be lower than that of MHD.

[99] RWKV-X: A Linear Complexity Hybrid Language Model

Haowen Hou,Zhiyi Huang,Kaifeng Tan,Rongchang Lu,Fei Richard Yu

Main category: cs.CL

TL;DR: RWKV-X是一种新型混合架构，结合了RWKV的短程建模效率和稀疏注意力机制，用于捕捉长程上下文，具有线性训练时间和恒定推理时间。

Details

Motivation: 解决现有混合方法因依赖全注意力层而保持二次复杂度的问题，提出更高效的模型。 Method: 结合RWKV的短程建模与稀疏注意力机制，实现线性训练和恒定推理时间。 Result: 在64K-token序列上持续预训练后，RWKV-X在64K passkey检索基准上接近完美准确率，长上下文任务表现优于RWKV-7，短上下文任务保持强性能。 Conclusion: RWKV-X是一种可扩展且高效的语言建模主干，支持百万token序列解码，速度和内存稳定，代码和模型已开源。 Abstract: In this paper, we introduce \textbf{RWKV-X}, a novel hybrid architecture that combines the efficiency of RWKV for short-range modeling with a sparse attention mechanism designed to capture long-range context. Unlike previous hybrid approaches that rely on full attention layers and retain quadratic complexity, RWKV-X achieves linear-time complexity in training and constant-time complexity in inference decoding. We demonstrate that RWKV-X, when continually pretrained on 64K-token sequences, achieves near-perfect accuracy on the 64K passkey retrieval benchmark. It consistently outperforms prior RWKV-7 models on long-context benchmarks, while maintaining strong performance on short-context tasks. These results highlight RWKV-X as a scalable and efficient backbone for general-purpose language modeling, capable of decoding sequences up to 1 million tokens with stable speed and memory usage. To facilitate further research and analysis, we have made the checkpoints and the associated code publicly accessible at: https://github.com/howard-hou/RWKV-X.

[100] Homa at SemEval-2025 Task 5: Aligning Librarian Records with OntoAligner for Subject Tagging

Hadi Bayrami Asl Tekanlou,Jafar Razmara,Mahsa Sanaei,Mostafa Rahgouy,Hamed Babaei Giglou

Main category: cs.CL

TL;DR: Homa系统利用OntoAligner工具包和RAG技术，将主题标注问题转化为对齐任务，通过语义相似性匹配技术记录与GND分类。

Details

Motivation: 解决TIBKAT技术记录的主题自动标注问题，提升数字图书馆中主题标注的效率和准确性。 Method: 使用OntoAligner工具包，结合RAG技术，将记录与GND分类进行语义对齐。 Result: 实验展示了该方法的优势和局限性，证明了对齐技术在主题标注中的潜力。 Conclusion: Homa系统通过语义对齐方法有效提升了主题标注任务的表现，为数字图书馆提供了新思路。 Abstract: This paper presents our system, Homa, for SemEval-2025 Task 5: Subject Tagging, which focuses on automatically assigning subject labels to technical records from TIBKAT using the Gemeinsame Normdatei (GND) taxonomy. We leverage OntoAligner, a modular ontology alignment toolkit, to address this task by integrating retrieval-augmented generation (RAG) techniques. Our approach formulates the subject tagging problem as an alignment task, where records are matched to GND categories based on semantic similarity. We evaluate OntoAligner's adaptability for subject indexing and analyze its effectiveness in handling multilingual records. Experimental results demonstrate the strengths and limitations of this method, highlighting the potential of alignment techniques for improving subject tagging in digital libraries.

[101] Advancing Arabic Reverse Dictionary Systems: A Transformer-Based Approach with Dataset Construction Guidelines

Serry Sibaee,Samar Ahmed,Abdullah Al Harbi,Omer Nacar,Adel Ammar,Yasser Habashi,Wadii Boulila

Main category: cs.CL

TL;DR: 本文提出了一种基于Transformer的半编码器神经网络架构，用于阿拉伯语反向词典任务，性能优于现有方法，并提供了数据集构建标准和工具库。

Details

Motivation: 解决阿拉伯语自然语言处理中反向词典任务的空白，提升阿拉伯语词汇检索的效率和准确性。 Method: 采用半编码器神经网络架构，结合几何递减层设计，使用阿拉伯语特定预训练模型（如ARBERTv2）进行实验。 Result: ARBERTv2模型在排名得分上表现最佳（0.0644），并提出了8项数据集构建标准。 Conclusion: 该研究为阿拉伯语计算语言学提供了重要工具和理论支持，适用于语言学习、学术写作和专业交流。 Abstract: This study addresses the critical gap in Arabic natural language processing by developing an effective Arabic Reverse Dictionary (RD) system that enables users to find words based on their descriptions or meanings. We present a novel transformer-based approach with a semi-encoder neural network architecture featuring geometrically decreasing layers that achieves state-of-the-art results for Arabic RD tasks. Our methodology incorporates a comprehensive dataset construction process and establishes formal quality standards for Arabic lexicographic definitions. Experiments with various pre-trained models demonstrate that Arabic-specific models significantly outperform general multilingual embeddings, with ARBERTv2 achieving the best ranking score (0.0644). Additionally, we provide a formal abstraction of the reverse dictionary task that enhances theoretical understanding and develop a modular, extensible Python library (RDTL) with configurable training pipelines. Our analysis of dataset quality reveals important insights for improving Arabic definition construction, leading to eight specific standards for building high-quality reverse dictionary resources. This work contributes significantly to Arabic computational linguistics and provides valuable tools for language learning, academic writing, and professional communication in Arabic.

[102] Improving Informally Romanized Language Identification

Adrian Benton,Alexander Gutkin,Christo Kirov,Brian Roark

Main category: cs.CL

TL;DR: 论文通过改进合成训练集的方法，提高了拉丁化文本的语言识别（LID）准确率，尤其是在处理拼写变体时表现更优。

Details

Motivation: 拉丁化文本（如印度语言的非拉丁原生脚本）存在拼写变异性，导致语言识别困难。 Method: 通过合成包含自然拼写变体的训练样本，并与现有自然样本结合训练。 Result: 在20种印度语言的测试集上，F1分数从74.7%提升至88.2%。 Conclusion: 合成数据结合自然样本训练显著提升了语言识别性能。 Abstract: The Latin script is often used to informally write languages with non-Latin native scripts. In many cases (e.g., most languages in India), there is no conventional spelling of words in the Latin script, hence there will be high spelling variability in written text. Such romanization renders languages that are normally easily distinguished based on script highly confusable, such as Hindi and Urdu. In this work, we increase language identification (LID) accuracy for romanized text by improving the methods used to synthesize training sets. We find that training on synthetic samples which incorporate natural spelling variation yields higher LID system accuracy than including available naturally occurring examples in the training set, or even training higher capacity models. We demonstrate new state-of-the-art LID performance on romanized text from 20 Indic languages in the Bhasha-Abhijnaanam evaluation set (Madhani et al., 2023a), improving test F1 from the reported 74.7% (using a pretrained neural model) to 85.4% using a linear classifier trained solely on synthetic data and 88.2% when also training on available harvested text.

[103] TartuNLP at SemEval-2025 Task 5: Subject Tagging as Two-Stage Information Retrieval

Aleksei Dorkin,Kairit Sirts

Main category: cs.CL

TL;DR: 论文提出了一种两阶段信息检索系统，用于为图书馆记录分配主题标签，显著提高了召回率。

Details

Motivation: 帮助图书馆员为文档分配主题标签，提高标签分配的效率和准确性。 Method: 使用双编码器模型构建两阶段检索系统：第一阶段用双编码器粗粒度提取候选标签，第二阶段用交叉编码器细粒度重排序。 Result: 相比单阶段方法，显著提高了召回率，并在定性评估中表现优异。 Conclusion: 两阶段检索系统在主题标签分配任务中效果显著，具有竞争力。 Abstract: We present our submission to the Task 5 of SemEval-2025 that aims to aid librarians in assigning subject tags to the library records by producing a list of likely relevant tags for a given document. We frame the task as an information retrieval problem, where the document content is used to retrieve subject tags from a large subject taxonomy. We leverage two types of encoder models to build a two-stage information retrieval system -- a bi-encoder for coarse-grained candidate extraction at the first stage, and a cross-encoder for fine-grained re-ranking at the second stage. This approach proved effective, demonstrating significant improvements in recall compared to single-stage methods and showing competitive results according to qualitative evaluation.

[104] Precision Where It Matters: A Novel Spike Aware Mixed-Precision Quantization Strategy for LLaMA-based Language Models

Lucas Maisonnave,Cyril Moineau,Olivier Bichler,Fabrice Rastello

Main category: cs.CL

TL;DR: 本文提出了一种针对LLaMA架构的混合精度量化方法，通过识别并针对激活异常集中的特定投影层，显著提升了量化性能。

Details

Motivation: 大型语言模型（LLMs）的规模给部署和推理带来挑战，现有量化方法对激活异常的处理假设存在不足。 Method: 提出了一种混合精度量化方法，对LLaMA架构中激活异常集中的投影层采用高精度（FP16或FP8），其余部分量化到低位宽。 Result: 在LLaMA2、LLaMA3和Mistral模型上，8比特量化表现优于现有方法，困惑度和零样本准确率显著提升。 Conclusion: 研究表明，针对特定架构的量化策略优于通用方法，为LLMs的高效部署提供了新思路。 Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities in various natural language processing tasks. However, their size presents significant challenges for deployment and inference. This paper investigates the quantization of LLMs, focusing on the LLaMA architecture and its derivatives. We challenge existing assumptions about activation outliers in LLMs and propose a novel mixed-precision quantization approach tailored for LLaMA-like models. Our method leverages the observation that activation spikes in LLaMA architectures are predominantly concentrated in specific projection layers. By applying higher precision (FP16 or FP8) to these layers while quantizing the rest of the model to lower bit-widths, we achieve superior performance compared to existing quantization techniques. Experimental results on LLaMA2, LLaMA3, and Mistral models demonstrate significant improvements in perplexity and zero-shot accuracy, particularly for 8-bit per-tensor quantization. Our approach outperforms general-purpose methods designed to handle outliers across all architecture types, highlighting the benefits of architecture-specific quantization strategies. This research contributes to the ongoing efforts to make LLMs more efficient and deployable, potentially enabling their use in resource-constrained environments. Our findings emphasize the importance of considering model-specific characteristics in developing effective quantization pipelines for state-of-the-art language models by identifying and targeting a small number of projections that concentrate activation spikes.

[105] DNB-AI-Project at SemEval-2025 Task 5: An LLM-Ensemble Approach for Automated Subject Indexing

Lisa Kluge,Maximilian Kähler

Main category: cs.CL

TL;DR: 本文介绍了为SemEval-2025任务5开发的系统，利用LLMs进行自动化主题标注，结合少样本提示和后处理步骤，在定量排名中位列第四，但在专家定性评估中表现最佳。

Details

Motivation: 为技术图书馆的开放获取目录开发自动化主题标注系统，提升标注效率和准确性。 Method: 采用少样本提示技术，结合后处理步骤（词汇映射、集成投票和相关性排序）优化LLMs生成的关键词。 Result: 系统在定量排名中第四，但在专家定性评估中表现最佳。 Conclusion: 该方法在自动化主题标注中具有潜力，尤其在专家评估中表现优异。 Abstract: This paper presents our system developed for the SemEval-2025 Task 5: LLMs4Subjects: LLM-based Automated Subject Tagging for a National Technical Library's Open-Access Catalog. Our system relies on prompting a selection of LLMs with varying examples of intellectually annotated records and asking the LLMs to similarly suggest keywords for new records. This few-shot prompting technique is combined with a series of post-processing steps that map the generated keywords to the target vocabulary, aggregate the resulting subject terms to an ensemble vote and, finally, rank them as to their relevance to the record. Our system is fourth in the quantitative ranking in the all-subjects track, but achieves the best result in the qualitative ranking conducted by subject indexing experts.

[106] Robust Misinformation Detection by Visiting Potential Commonsense Conflict

Bing Wang,Ximing Li,Changchun Li,Bingrui Zhao,Bo Fu,Renchu Guan,Shengsheng Wang

Main category: cs.CL

TL;DR: 提出了一种名为MD-PCC的即插即用增强方法，用于自动检测网络虚假信息，通过构建常识冲突表达来增强文章，并在多个数据集上验证其有效性。

Details

Motivation: 互联网技术的发展导致虚假信息泛滥，对多领域产生负面影响。自动检测虚假信息的需求日益增长，但现有方法在常识冲突方面的利用不足。 Method: 利用常识推理工具COMET构建常识冲突表达，增强文章数据，并开发了一个新的常识导向数据集CoMis。 Result: 在4个公共基准数据集和CoMis上，MD-PCC方法均优于现有基线。 Conclusion: MD-PCC通过利用常识冲突有效提升了虚假信息检测的性能，为未来研究提供了新方向。 Abstract: The development of Internet technology has led to an increased prevalence of misinformation, causing severe negative effects across diverse domains. To mitigate this challenge, Misinformation Detection (MD), aiming to detect online misinformation automatically, emerges as a rapidly growing research topic in the community. In this paper, we propose a novel plug-and-play augmentation method for the MD task, namely Misinformation Detection with Potential Commonsense Conflict (MD-PCC). We take inspiration from the prior studies indicating that fake articles are more likely to involve commonsense conflict. Accordingly, we construct commonsense expressions for articles, serving to express potential commonsense conflicts inferred by the difference between extracted commonsense triplet and golden ones inferred by the well-established commonsense reasoning tool COMET. These expressions are then specified for each article as augmentation. Any specific MD methods can be then trained on those commonsense-augmented articles. Besides, we also collect a novel commonsense-oriented dataset named CoMis, whose all fake articles are caused by commonsense conflict. We integrate MD-PCC with various existing MD backbones and compare them across both 4 public benchmark datasets and CoMis. Empirical results demonstrate that MD-PCC can consistently outperform the existing MD baselines.

[107] RDF-Based Structured Quality Assessment Representation of Multilingual LLM Evaluations

Jonas Gwozdz,Andreas Both

Main category: cs.CL

TL;DR: 提出了一种基于RDF的框架，用于评估多语言大语言模型（LLMs）在知识冲突情况下的可靠性，重点关注知识泄漏、错误检测和多语言一致性。

Details

Motivation: 随着LLMs作为知识接口的普及，评估其在冲突信息下的可靠性变得重要。 Method: 通过四种上下文条件（完整、不完整、冲突和无上下文）在德语和英语中捕获模型响应，并利用RDF框架进行结构化分析。 Result: 实验表明，该框架能全面分析知识泄漏、错误检测和多语言一致性，并在消防安全领域实验中揭示了上下文优先和语言性能的关键模式。 Conclusion: 提出的RDF框架有效评估了LLMs在知识冲突下的表现，且其词汇足以覆盖28个问题的所有评估方面。 Abstract: Large Language Models (LLMs) increasingly serve as knowledge interfaces, yet systematically assessing their reliability with conflicting information remains difficult. We propose an RDF-based framework to assess multilingual LLM quality, focusing on knowledge conflicts. Our approach captures model responses across four distinct context conditions (complete, incomplete, conflicting, and no-context information) in German and English. This structured representation enables the comprehensive analysis of knowledge leakage-where models favor training data over provided context-error detection, and multilingual consistency. We demonstrate the framework through a fire safety domain experiment, revealing critical patterns in context prioritization and language-specific performance, and demonstrating that our vocabulary was sufficient to express every assessment facet encountered in the 28-question study.

[108] Meeseeks: An Iterative Benchmark Evaluating LLMs Multi-Turn Instruction-Following Ability

Jiaming Wang

Main category: cs.CL

TL;DR: Meeseeks是一个新的基准测试，通过迭代反馈过程模拟真实的人与LLM交互，支持自我纠正，并全面评估LLM的指令跟随能力。

Details

Motivation: 现有指令跟随基准测试多为单轮或不允许自我纠正，无法反映真实交互模式。Meeseeks旨在填补这一空白。 Method: 采用迭代反馈过程，允许模型基于失败需求自我纠正，并通过38个能力标签在三个维度（意图识别、内容验证、输出结构验证）进行评估。 Result: Meeseeks为LLM在实际应用中的指令跟随能力提供了有价值的见解。 Conclusion: Meeseeks通过更真实的交互设计和全面评估，提升了LLM指令跟随能力的评测效果。 Abstract: The ability to follow instructions accurately is fundamental for Large Language Models (LLMs) to serve as reliable agents in real-world applications. While existing instruction-following benchmarks are either single-turn or introduce new requirements in each turn without allowing self-correction, Meeseeks simulates realistic human-LLM interactions through an iterative feedback process. This design enables models to self-correct based on specific requirement failures, better reflecting real-world user-end usage patterns. The benchmark implements a comprehensive evaluation system with 38 capability tags organized across three dimensions: Intent Recognition, Granular Content Validation, and Output Structure Validation. Through rigorous evaluation across LLMs, Meeseeks provides valuable insights into LLMs' instruction-following capabilities in practical applications.

[109] Sadeed: Advancing Arabic Diacritization Through Small Language Model

Zeina Aldallal,Sara Chrouf,Khalil Hennara,Mohamed Motaism Hamed,Muhammad Hreden,Safwan AlModhayan

Main category: cs.CL

TL;DR: Sadeed是一种基于Kuwain 1.5B微调的解码器语言模型，用于阿拉伯语标音任务，性能优于传统模型，并提出了新的评测基准SadeedDiac-25。

Details

Motivation: 阿拉伯语标音因其形态复杂性在自然语言处理中具有挑战性，需高效且公平的解决方案。 Method: 基于Kuwain 1.5B微调，使用高质量标音数据集，并引入数据清洗和标准化流程。 Result: Sadeed在有限计算资源下表现优异，优于传统模型，并提出了新评测基准。 Conclusion: Sadeed和SadeedDiac-25为阿拉伯语NLP应用提供了坚实基础。 Abstract: Arabic text diacritization remains a persistent challenge in natural language processing due to the language's morphological richness. In this paper, we introduce Sadeed, a novel approach based on a fine-tuned decoder-only language model adapted from Kuwain 1.5B Hennara et al. [2025], a compact model originally trained on diverse Arabic corpora. Sadeed is fine-tuned on carefully curated, high-quality diacritized datasets, constructed through a rigorous data-cleaning and normalization pipeline. Despite utilizing modest computational resources, Sadeed achieves competitive results compared to proprietary large language models and outperforms traditional models trained on similar domains. Additionally, we highlight key limitations in current benchmarking practices for Arabic diacritization. To address these issues, we introduce SadeedDiac-25, a new benchmark designed to enable fairer and more comprehensive evaluation across diverse text genres and complexity levels. Together, Sadeed and SadeedDiac-25 provide a robust foundation for advancing Arabic NLP applications, including machine translation, text-to-speech, and language learning tools.

[110] 20min-XD: A Comparable Corpus of Swiss News Articles

Michelle Wastl,Jannis Vamvas,Selena Calleri,Rico Sennrich

Main category: cs.CL

TL;DR: 20min-XD是一个法语-德语新闻文章可比语料库，包含约15,000对文章，自动对齐并公开可用。

Details

Motivation: 构建一个跨语言文档级可比语料库，支持NLP应用和语言学研究。 Method: 从瑞士新闻网站20 Minuten/20 minutes收集数据，基于语义相似度自动对齐文章。 Result: 语料库包含从近似翻译到松散相关文章的广泛跨语言相似性。 Conclusion: 20min-XD是一个有价值的资源，适用于多种NLP任务和语言学研究。 Abstract: We present 20min-XD (20 Minuten cross-lingual document-level), a French-German, document-level comparable corpus of news articles, sourced from the Swiss online news outlet 20 Minuten/20 minutes. Our dataset comprises around 15,000 article pairs spanning 2015 to 2024, automatically aligned based on semantic similarity. We detail the data collection process and alignment methodology. Furthermore, we provide a qualitative and quantitative analysis of the corpus. The resulting dataset exhibits a broad spectrum of cross-lingual similarity, ranging from near-translations to loosely related articles, making it valuable for various NLP applications and broad linguistically motivated studies. We publicly release the dataset in document- and sentence-aligned versions and code for the described experiments.

[111] Investigating the Effect of Parallel Data in the Cross-Lingual Transfer for Vision-Language Encoders

Andrei-Alexandru Manea,Jindřich Libovický

Main category: cs.CL

TL;DR: 研究探讨了在多语言视觉-语言任务中，使用平行数据迁移已训练编码器的效果，发现机器翻译的任务数据效果最佳，但某些语言中真实的平行数据表现更好。

Details

Motivation: 大多数预训练的视觉-语言模型和下游任务数据仅支持英语，多语言任务通常依赖跨语言迁移，本研究探索了另一种方法。 Method: 通过平行数据迁移已训练的编码器，研究平行数据的领域和语言数量的影响。 Result: 机器翻译的任务数据平均效果最佳，但某些语言中真实的平行数据表现更好；多语言训练对多数语言有益。 Conclusion: 平行数据的类型和语言数量对迁移效果有显著影响，多语言训练具有普遍优势。 Abstract: Most pre-trained Vision-Language (VL) models and training data for the downstream tasks are only available in English. Therefore, multilingual VL tasks are solved using cross-lingual transfer: fine-tune a multilingual pre-trained model or transfer the text encoder using parallel data. We study the alternative approach: transferring an already trained encoder using parallel data. We investigate the effect of parallel data: domain and the number of languages, which were out of focus in previous work. Our results show that even machine-translated task data are the best on average, caption-like authentic parallel data outperformed it in some languages. Further, we show that most languages benefit from multilingual training.

[112] Enhancing Health Mention Classification Performance: A Study on Advancements in Parameter Efficient Tuning

Reem Abdel-Salam,Mary Adewunmi

Main category: cs.CL

TL;DR: 论文提出了一种通过结合词性标注信息和PEFT技术来改进健康提及分类（HMC）的方法，显著提升了性能并优化了模型效率。

Details

Motivation: 健康提及分类在社交媒体实时追踪和公共卫生监测中至关重要，但由于其复杂性（如比喻性语言和描述性术语），传统方法效果有限。 Method: 研究采用了词性标注信息、改进的PEFT技术及其组合，并在三个数据集（RHDM、PHM、Illness）上进行了实验。 Result: 实验结果表明，结合词性标注信息和PEFT技术显著提高了F1分数，同时使用了更小的模型和更高效的训练。 Conclusion: 该方法为社交媒体中的健康提及分类提供了一种高效且准确的解决方案，同时优化了模型规模和训练效率。 Abstract: Health Mention Classification (HMC) plays a critical role in leveraging social media posts for real-time tracking and public health monitoring. Nevertheless, the process of HMC presents significant challenges due to its intricate nature, primarily stemming from the contextual aspects of health mentions, such as figurative language and descriptive terminology, rather than explicitly reflecting a personal ailment. To address this problem, we argue that clearer mentions can be achieved through conventional fine-tuning with enhanced parameters of biomedical natural language methods (NLP). In this study, we explore different techniques such as the utilisation of part-of-speech (POS) tagger information, improving on PEFT techniques, and different combinations thereof. Extensive experiments are conducted on three widely used datasets: RHDM, PHM, and Illness. The results incorporated POS tagger information, and leveraging PEFT techniques significantly improves performance in terms of F1-score compared to state-of-the-art methods across all three datasets by utilising smaller models and efficient training. Furthermore, the findings highlight the effectiveness of incorporating POS tagger information and leveraging PEFT techniques for HMC. In conclusion, the proposed methodology presents a potentially effective approach to accurately classifying health mentions in social media posts while optimising the model size and training efficiency.

[113] Investigating Literary Motifs in Ancient and Medieval Novels with Large Language Models

Emelie Hallenberg

Main category: cs.CL

TL;DR: 通过微调大型语言模型，研究希腊爱情小说中共同和差异的文学母题，发现某些母题持续存在，而其他母题频率波动，表明趋势或外部影响。

Details

Motivation: 探究希腊爱情小说中文学母题的共性与差异，以揭示其演变或外部影响。 Method: 使用微调的大型语言模型分析文本，提取并比较文学母题。 Result: 部分母题贯穿整个语料库，其他母题频率波动，显示趋势或外部影响。 Conclusion: 方法能有效提取文学母题，为定量和定性分析提供数据。 Abstract: The Greek fictional narratives often termed love novels or romances, ranging from the first century CE to the middle of the 15th century, have long been considered as similar in many ways, not least in the use of particular literary motifs. By applying the use of fine-tuned large language models, this study aims to investigate which motifs exactly that the texts in this corpus have in common, and in which ways they differ from each other. The results show that while some motifs persist throughout the corpus, others fluctuate in frequency, indicating certain trends or external influences. Conclusively, the method proves to adequately extract literary motifs according to a set definition, providing data for both quantitative and qualitative analyses.

[114] Improving Retrieval-Augmented Neural Machine Translation with Monolingual Data

Maxime Bouthors,Josep Crego,François Yvon

Main category: cs.CL

TL;DR: 本文探讨了如何利用目标语言的单语语料库改进检索增强神经机器翻译（RANMT）系统，通过设计改进的跨语言检索系统，实验证明其性能优于传统基于翻译记忆的方法。

Details

Motivation: 传统RANMT系统依赖双语语料库，但在许多场景下，目标语言的单语语料库更丰富。本文旨在利用这些资源提升翻译性能。 Method: 设计了改进的跨语言检索系统，结合句子级和词级匹配目标进行训练，并在两种RANMT架构上进行实验。 Result: 实验表明，新方法在受控和实际场景中均优于传统基于翻译记忆的方法和通用跨语言检索器。 Conclusion: 利用目标语言单语语料库的跨语言检索方法显著提升了RANMT系统的翻译性能。 Abstract: Conventional retrieval-augmented neural machine translation (RANMT) systems leverage bilingual corpora, e.g., translation memories (TMs). Yet, in many settings, in-domain monolingual target-side corpora are often available. This work explores ways to take advantage of such resources by retrieving relevant segments directly in the target language, based on a source-side query. For this, we design improved cross-lingual retrieval systems, trained with both sentence level and word-level matching objectives. In our experiments with two RANMT architectures, we first demonstrate the benefits of such cross-lingual objectives in a controlled setting, obtaining translation performances that surpass standard TM-based models. We then showcase our method on a real-world set-up, where the target monolingual resources far exceed the amount of parallel data and observe large improvements of our new techniques, which outperform both the baseline setting, and general-purpose cross-lingual retrievers.

[115] MAC-Tuning: LLM Multi-Compositional Problem Reasoning with Enhanced Knowledge Boundary Awareness

Junsheng Huang,Zhitao He,Sandeep Polisetty,Qingyun Wang,May Fung

Main category: cs.CL

TL;DR: 论文提出了一种名为MAC-Tuning的新方法，用于在大语言模型（LLMs）中同时解决多个问题时提升其置信度估计能力，实验表明其性能优于基线方法。

Details

Motivation: 随着大语言模型的广泛应用，其生成虚假事实（幻觉）的问题日益突出。现有研究主要关注单一问题设置下的置信度估计，而多问题设置下的模型知识边界意识尚未充分探索。 Method: 提出MAC-Tuning方法，通过在指令数据微调过程中分离答案预测和置信度估计的学习。 Result: 实验结果显示，MAC-Tuning在平均精度上比基线方法提升了25%。 Conclusion: MAC-Tuning在多问题设置下显著提升了LLMs的置信度估计能力，为解决幻觉问题提供了新思路。 Abstract: With the widespread application of large language models (LLMs), the issue of generating non-existing facts, known as hallucination, has garnered increasing attention. Previous research in enhancing LLM confidence estimation mainly focuses on the single problem setting. However, LLM awareness of its internal parameterized knowledge boundary under the more challenging multi-problem setting, which requires answering multiple problems accurately simultaneously, remains underexplored. To bridge this gap, we introduce a novel method, Multiple Answers and Confidence Stepwise Tuning (MAC-Tuning), that separates the learning of answer prediction and confidence estimation during fine-tuning on instruction data. Extensive experiments demonstrate that our method outperforms baselines by up to 25% in average precision.

[116] WebThinker: Empowering Large Reasoning Models with Deep Research Capability

Xiaoxi Li,Jiajie Jin,Guanting Dong,Hongjin Qian,Yutao Zhu,Yongkang Wu,Ji-Rong Wen,Zhicheng Dou

Main category: cs.CL

TL;DR: WebThinker是一个深度研究代理，通过动态搜索和整合网络信息，提升大型推理模型在复杂任务中的表现。

Details

Motivation: 现有大型推理模型依赖静态知识，难以处理复杂、知识密集型任务，尤其是需要整合多样化网络信息的研究报告生成。 Method: 提出WebThinker，包含Deep Web Explorer模块和Autonomous Think-Search-and-Draft策略，结合RL训练优化模型性能。 Result: 在多个复杂推理基准和科学报告生成任务中，WebThinker显著优于现有方法和专有系统。 Conclusion: WebThinker增强了大型推理模型在复杂场景中的可靠性和适用性，为更强大的深度研究系统铺平了道路。 Abstract: Large reasoning models (LRMs), such as OpenAI-o1 and DeepSeek-R1, demonstrate impressive long-horizon reasoning capabilities. However, their reliance on static internal knowledge limits their performance on complex, knowledge-intensive tasks and hinders their ability to produce comprehensive research reports requiring synthesis of diverse web information. To address this, we propose \textbf{WebThinker}, a deep research agent that empowers LRMs to autonomously search the web, navigate web pages, and draft research reports during the reasoning process. WebThinker integrates a \textbf{Deep Web Explorer} module, enabling LRMs to dynamically search, navigate, and extract information from the web when encountering knowledge gaps. It also employs an \textbf{Autonomous Think-Search-and-Draft strategy}, allowing the model to seamlessly interleave reasoning, information gathering, and report writing in real time. To further enhance research tool utilization, we introduce an \textbf{RL-based training strategy} via iterative online Direct Preference Optimization (DPO). Extensive experiments on complex reasoning benchmarks (GPQA, GAIA, WebWalkerQA, HLE) and scientific report generation tasks (Glaive) demonstrate that WebThinker significantly outperforms existing methods and strong proprietary systems. Our approach enhances LRM reliability and applicability in complex scenarios, paving the way for more capable and versatile deep research systems. The code is available at https://github.com/RUC-NLPIR/WebThinker.

[117] How Real Are Synthetic Therapy Conversations? Evaluating Fidelity in Prolonged Exposure Dialogues

Suhas BN,Dominik Mattioli,Saeed Abdullah,Rosa I. Arriaga,Chris W. Wiese,Andrew M. Sherrill

Main category: cs.CL

TL;DR: 论文探讨了在PTSD治疗中使用合成对话数据的潜力，发现合成数据在结构上接近真实对话，但在临床关键指标上表现不足。

Details

Motivation: 解决医疗数据隐私、获取困难和标注成本高的问题，探索合成数据在临床模型训练中的可行性。 Method: 系统比较真实与合成对话的语言、结构和治疗协议指标，引入PE特定评估框架。 Result: 合成数据在结构上接近真实对话（如说话者切换比例0.98 vs. 0.99），但未能捕捉关键临床指标（如痛苦监测）。 Conclusion: 合成数据可缓解数据稀缺和隐私问题，但需开发更全面的评估指标以弥补临床动态捕捉的不足。 Abstract: The growing adoption of synthetic data in healthcare is driven by privacy concerns, limited access to real-world data, and the high cost of annotation. This work explores the use of synthetic Prolonged Exposure (PE) therapeutic conversations for Post-Traumatic Stress Disorder (PTSD) as a scalable alternative for training and evaluating clinical models. We systematically compare real and synthetic dialogues using linguistic, structural, and protocol-specific metrics, including turn-taking patterns and treatment fidelity. We also introduce and evaluate PE-specific metrics derived from linguistic analysis and semantic modeling, offering a novel framework for assessing clinical fidelity beyond surface fluency. Our findings show that although synthetic data holds promise for mitigating data scarcity and protecting patient privacy, it can struggle to capture the subtle dynamics of therapeutic interactions. In our dataset, synthetic dialogues match structural features of real-world dialogues (e.g., speaker switch ratio: 0.98 vs. 0.99), however, synthetic interactions do not adequately reflect key fidelity markers (e.g., distress monitoring). We highlight gaps in existing evaluation frameworks and advocate for fidelity-aware metrics that go beyond surface fluency to uncover clinically significant failures. Our findings clarify where synthetic data can effectively complement real-world datasets -- and where critical limitations remain.

[118] DeepSeek-Prover-V2: Advancing Formal Mathematical Reasoning via Reinforcement Learning for Subgoal Decomposition

Z. Z. Ren,Zhihong Shao,Junxiao Song,Huajian Xin,Haocheng Wang,Wanjia Zhao,Liyue Zhang,Zhe Fu,Qihao Zhu,Dejian Yang,Z. F. Wu,Zhibin Gou,Shirong Ma,Hongxuan Tang,Yuxuan Liu,Wenjun Gao,Daya Guo,Chong Ruan

Main category: cs.CL

TL;DR: DeepSeek-Prover-V2是一个开源大型语言模型，专为Lean 4中的形式定理证明设计，通过递归定理证明流程初始化数据，结合非正式和正式数学推理，在多个基准测试中表现优异。

Details

Motivation: 旨在通过结合非正式和正式数学推理，提升大型语言模型在形式定理证明中的性能。 Method: 利用DeepSeek-V3分解复杂问题为子目标，生成链式推理过程，并通过强化学习初始化模型。 Result: 在MiniF2F-test中达到88.9%通过率，解决PutnamBench中的49个问题，并在ProverBench和AIME问题中表现良好。 Conclusion: DeepSeek-Prover-V2在形式定理证明中表现优异，缩小了形式与非正式推理之间的差距。 Abstract: We introduce DeepSeek-Prover-V2, an open-source large language model designed for formal theorem proving in Lean 4, with initialization data collected through a recursive theorem proving pipeline powered by DeepSeek-V3. The cold-start training procedure begins by prompting DeepSeek-V3 to decompose complex problems into a series of subgoals. The proofs of resolved subgoals are synthesized into a chain-of-thought process, combined with DeepSeek-V3's step-by-step reasoning, to create an initial cold start for reinforcement learning. This process enables us to integrate both informal and formal mathematical reasoning into a unified model. The resulting model, DeepSeek-Prover-V2-671B, achieves state-of-the-art performance in neural theorem proving, reaching 88.9% pass ratio on the MiniF2F-test and solving 49 out of 658 problems from PutnamBench. In addition to standard benchmarks, we introduce ProverBench, a collection of 325 formalized problems, to enrich our evaluation, including 15 selected problems from the recent AIME competitions (years 24-25). Further evaluation on these 15 AIME problems shows that the model successfully solves 6 of them. In comparison, DeepSeek-V3 solves 8 of these problems using majority voting, highlighting that the gap between formal and informal mathematical reasoning in large language models is substantially narrowing.

[119] TRUST: An LLM-Based Dialogue System for Trauma Understanding and Structured Assessments

Sichang Tu,Abigail Powers,Stephen Doogan,Jinho D. Choi

Main category: cs.CL

TL;DR: 该研究开发了一个基于LLM的对话系统TRUST，用于标准化诊断访谈和评估，填补了心理健康领域的技术空白。

Details

Motivation: 解决心理健康服务可及性问题，探索LLM在临床诊断中的应用。 Method: 提出TRUST框架，结合临床访谈的对话行为模式，并采用患者模拟方法进行测试。 Result: 专家评估显示TRUST表现与真实临床访谈相当，达到普通临床医生水平。 Conclusion: TRUST框架有望提升心理健康服务的可及性，未来可进一步优化沟通风格和响应准确性。 Abstract: Objectives: While Large Language Models (LLMs) have been widely used to assist clinicians and support patients, no existing work has explored dialogue systems for standard diagnostic interviews and assessments. This study aims to bridge the gap in mental healthcare accessibility by developing an LLM-powered dialogue system that replicates clinician behavior. Materials and Methods: We introduce TRUST, a framework of cooperative LLM modules capable of conducting formal diagnostic interviews and assessments for Post-Traumatic Stress Disorder (PTSD). To guide the generation of appropriate clinical responses, we propose a Dialogue Acts schema specifically designed for clinical interviews. Additionally, we develop a patient simulation approach based on real-life interview transcripts to replace time-consuming and costly manual testing by clinicians. Results: A comprehensive set of evaluation metrics is designed to assess the dialogue system from both the agent and patient simulation perspectives. Expert evaluations by conversation and clinical specialists show that TRUST performs comparably to real-life clinical interviews. Discussion: Our system performs at the level of average clinicians, with room for future enhancements in communication styles and response appropriateness. Conclusions: Our TRUST framework shows its potential to facilitate mental healthcare availability.

cs.CR [Back]

[120] A False Sense of Privacy: Evaluating Textual Data Sanitization Beyond Surface-level Privacy Leakage

Rui Xin,Niloofar Mireshghallah,Shuyue Stella Li,Michael Duan,Hyunwoo Kim,Yejin Choi,Yulia Tsvetkov,Sewoong Oh,Pang Wei Koh

Main category: cs.CR

TL;DR: 论文挑战了现有敏感文本数据脱敏方法的隐私保护效果，提出新框架评估重识别攻击，揭示当前方法存在虚假隐私感。

Details

Motivation: 现有脱敏方法仅关注显式标识符的泄漏，忽略文本中细微标记可能导致重识别，需更全面的隐私风险评估。 Method: 提出新框架评估重识别攻击，利用辅助信息（如社交活动）推断敏感属性，测试商业工具效果。 Result: Azure的PII移除工具在MedQA数据集中未能保护74%信息；差分隐私虽有效但降低数据实用性。 Conclusion: 当前脱敏技术存在虚假隐私感，需开发更鲁棒方法以防止语义级信息泄漏。 Abstract: Sanitizing sensitive text data typically involves removing personally identifiable information (PII) or generating synthetic data under the assumption that these methods adequately protect privacy; however, their effectiveness is often only assessed by measuring the leakage of explicit identifiers but ignoring nuanced textual markers that can lead to re-identification. We challenge the above illusion of privacy by proposing a new framework that evaluates re-identification attacks to quantify individual privacy risks upon data release. Our approach shows that seemingly innocuous auxiliary information -- such as routine social activities -- can be used to infer sensitive attributes like age or substance use history from sanitized data. For instance, we demonstrate that Azure's commercial PII removal tool fails to protect 74\% of information in the MedQA dataset. Although differential privacy mitigates these risks to some extent, it significantly reduces the utility of the sanitized text for downstream tasks. Our findings indicate that current sanitization techniques offer a \textit{false sense of privacy}, highlighting the need for more robust methods that protect against semantic-level information leakage.

[121] Cert-SSB: Toward Certified Sample-Specific Backdoor Defense

Ting Qiao,Yingjia Wang,Xing Liu,Sixing Wu,Jianbing Li,Yiming Li

Main category: cs.CR

TL;DR: 论文提出了一种样本特定的认证后门防御方法Cert-SSB，通过优化每个样本的噪声幅度并动态调整认证区域，提高了防御性能。

Details

Motivation: 深度神经网络（DNNs）易受后门攻击，现有防御方法假设所有样本与决策边界等距，导致认证性能不佳。 Method: Cert-SSB使用随机梯度上升优化每个样本的噪声幅度，训练多个平滑模型并聚合预测结果，同时引入动态认证方法。 Result: 实验证明Cert-SSB在多个基准数据集上有效提升了防御性能。 Conclusion: Cert-SSB通过样本特定噪声优化和动态认证方法，显著提高了对后门攻击的防御能力。 Abstract: Deep neural networks (DNNs) are vulnerable to backdoor attacks, where an attacker manipulates a small portion of the training data to implant hidden backdoors into the model. The compromised model behaves normally on clean samples but misclassifies backdoored samples into the attacker-specified target class, posing a significant threat to real-world DNN applications. Currently, several empirical defense methods have been proposed to mitigate backdoor attacks, but they are often bypassed by more advanced backdoor techniques. In contrast, certified defenses based on randomized smoothing have shown promise by adding random noise to training and testing samples to counteract backdoor attacks. In this paper, we reveal that existing randomized smoothing defenses implicitly assume that all samples are equidistant from the decision boundary. However, it may not hold in practice, leading to suboptimal certification performance. To address this issue, we propose a sample-specific certified backdoor defense method, termed Cert-SSB. Cert-SSB first employs stochastic gradient ascent to optimize the noise magnitude for each sample, ensuring a sample-specific noise level that is then applied to multiple poisoned training sets to retrain several smoothed models. After that, Cert-SSB aggregates the predictions of multiple smoothed models to generate the final robust prediction. In particular, in this case, existing certification methods become inapplicable since the optimized noise varies across different samples. To conquer this challenge, we introduce a storage-update-based certification method, which dynamically adjusts each sample's certification region to improve certification performance. We conduct extensive experiments on multiple benchmark datasets, demonstrating the effectiveness of our proposed method. Our code is available at https://github.com/NcepuQiaoTing/Cert-SSB.

cs.HC [Back]

[122] Adaptive 3D UI Placement in Mixed Reality Using Deep Reinforcement Learning

Feiyu Lu,Mengyu Chen,Hsiang Hsu,Pranav Deshpande,Cheng Yao Wang,Blair MacIntyre

Main category: cs.HC

TL;DR: 论文探讨了如何利用强化学习（RL）在混合现实（MR）中动态优化3D内容布局，以适应用户姿态和环境变化。

Details

Motivation: MR中虚拟内容的动态布局是一个挑战性问题，传统优化方法难以适应实时变化。 Method: 采用强化学习方法，结合用户姿态和环境信息，实现连续3D内容布局。 Result: 初步实验表明，RL能有效优化内容布局，提升用户体验。 Conclusion: RL在MR中具有潜力，未来可进一步研究个性化UI和内容布局优化。 Abstract: Mixed Reality (MR) could assist users' tasks by continuously integrating virtual content with their view of the physical environment. However, where and how to place these content to best support the users has been a challenging problem due to the dynamic nature of MR experiences. In contrast to prior work that investigates optimization-based methods, we are exploring how reinforcement learning (RL) could assist with continuous 3D content placement that is aware of users' poses and their surrounding environments. Through an initial exploration and preliminary evaluation, our results demonstrate the potential of RL to position content that maximizes the reward for users on the go. We further identify future directions for research that could harness the power of RL for personalized and optimized UI and content placement in MR.

q-bio.CB [Back]

[123] Glucagon and insulin production in pancreatic cells modeled using Petri nets and Boolean networks

Kamila Barylska,Frank Delaplace,Anna Gogolińska,Ewa Pańkowska

Main category: q-bio.CB

TL;DR: 论文提出了基于Petri网的葡萄糖调节模型，重点研究了胰岛素和胰高血糖素的分泌机制及其相互作用，并分析了模型的动态行为。

Details

Motivation: 为了更好地理解糖尿病中复杂的葡萄糖调节过程，作者旨在建立全身葡萄糖调节的Petri网模型。 Method: 作者创建了胰岛素和胰高血糖素分泌的Petri网模型，并分析了其动态行为，同时将其转换为布尔网络。 Result: 成功建立了胰岛素和胰高血糖素分泌的模型，并展示了它们在不同血糖水平下的相互作用。 Conclusion: 这些模型为理解糖尿病机制提供了基础，并展示了Petri网在复杂生物系统建模中的潜力。 Abstract: Diabetes is a civilization chronic disease characterized by a constant elevated concentration of glucose in the blood. Many processes are involved in the glucose regulation, and their interactions are very complex. To better understand those processes we set ourselves a goal to create a Petri net model of the glucose regulation in the whole body. So far we have managed to create a model of glycolysis and synthesis of glucose in the liver, and the general overview models of the glucose regulation in a healthy and diabetic person. In this paper we introduce Petri nets models of insulin secretion in beta cell of the pancreas, and glucagon in the pancreas alpha cells. Those two hormones have mutually opposite effects: insulin preventing hyperglycemia, and glucagon preventing hypoglycemia. Understanding the mechanisms of insulin and glucagon secretion constitutes the basis for understanding diabetes. We also present a model in which both processes occur together, depending on the blood glucose level. The dynamics of each model is analysed. Additionally, we transform the overall insulin and glucagon secretion system to a Boolean network, following standard transformation rules.

cond-mat.mtrl-sci [Back]

[124] Towards Space Group Determination from EBSD Patterns: The Role of Deep Learning and High-throughput Dynamical Simulations

Alfred Yan,Muhammad Nur Talha Kilic,Gert Nolze,Ankit Agrawal,Alok Choudhary,Roberto dos Reis,Vinayak Dravid

Main category: cond-mat.mtrl-sci

TL;DR: 论文提出了一种基于深度学习的晶体对称性分类方法，通过电子背散射衍射（EBSD）和神经网络，实现了对模拟和实验数据的高精度预测。

Details

Motivation: 新材料设计依赖于结构-性能关系的理解，但合成速度远超表征能力，亟需快速、可扩展的晶体对称性分析方法。 Method: 利用Kikuchi衍射和深度学习，训练神经网络分类空间群对称性，并通过无监督域适应方法处理实验数据。 Result: 神经网络在模拟和实验数据上的分类准确率超过90%，验证了其可行性。 Conclusion: 深度学习结合EBSD技术为高通量晶体结构表征提供了有效解决方案。 Abstract: The design of novel materials hinges on the understanding of structure-property relationships. However, our capability to synthesize a large number of materials has outpaced the ability and speed needed to characterize them. While the overall chemical constituents can be readily known during synthesis, the structural evolution and characterization of newly synthesized samples remains a bottleneck for the ultimate goal of high throughput nanomaterials discovery. Thus, scalable methods for crystal symmetry determination that can analyze a large volume of material samples within a short time-frame are especially needed. Kikuchi diffraction in the SEM is a promising technique for this due to its sensitivity to dynamical scattering, which may provide information beyond just the seven crystal systems and fourteen Bravais lattices. After diffraction patterns are collected from material samples, deep learning methods may be able to classify the space group symmetries using the patterns as input, which paired with the elemental composition, would help enable the determination of the crystal structure. To investigate the feasibility of this solution, neural networks were trained to predict the space group type of background corrected EBSD patterns. Our networks were first trained and tested on an artificial dataset of EBSD patterns of 5,148 different cubic phases, created through physics-based dynamical simulations. Next, Maximum Classifier Discrepancy, an unsupervised deep learning-based domain adaptation method, was utilized to train neural networks to make predictions for experimental EBSD patterns. We introduce a relabeling scheme, which enables our models to achieve accuracy scores higher than 90% on simulated and experimental data, suggesting that neural networks are capable of making predictions of crystal symmetry from an EBSD pattern.

cs.SE [Back]

[125] CodeFlowBench: A Multi-turn, Iterative Benchmark for Complex Code Generation

Sizhe Wang,Zhengren Wang,Dongsheng Ma,Yongan Yu,Rui Ling,Zhiyu Li,Feiyu Xiong,Wentao Zhang

Main category: cs.SE

TL;DR: CodeFlowBench是一个评估LLMs在多轮迭代代码复用中表现的基准测试，包含5258个问题，实验显示LLMs在多轮模式中表现较差。

Details

Motivation: 现实开发需要可读、可扩展和可测试的代码，通过模块化组件和多轮复用实现。CodeFlowBench旨在评估LLMs在此类任务中的能力。 Method: 从Codeforces提取问题，通过自动化管道分解为函数级子问题，并设计多轮代码复用的评估框架。 Result: LLMs在多轮模式中表现不佳，如o1-mini的pass@1从单轮的37.8%降至多轮的20.8%。 Conclusion: CodeFlowBench为多轮迭代代码生成提供了全面基准，揭示了LLMs在复杂结构问题上的挑战。 Abstract: Real world development demands code that is readable, extensible, and testable by organizing the implementation into modular components and iteratively reuse pre-implemented code. We term this iterative, multi-turn process codeflow and introduce CodeFlowBench, the first benchmark designed for comprehensively evaluating LLMs' ability to perform codeflow, namely to implement new functionality by reusing existing functions over multiple turns. CodeFlowBench comprises 5258 problems drawn from Codeforces and is continuously updated via an automated pipeline that decomposes each problem into a series of function-level subproblems based on its dependency tree and each subproblem is paired with unit tests. We further propose a novel evaluation framework with tasks and metrics tailored to multi-turn code reuse to assess model performance. In experiments across various LLMs under both multi-turn and single-turn patterns. We observe models' poor performance on CodeFlowBench, with a substantial performance drop in the iterative codeflow scenario. For instance, o1-mini achieves a pass@1 of 20.8% in multi-turn pattern versus 37.8% in single-turn pattern. Further analysis shows that different models excel at different dependency depths, yet all struggle to correctly solve structurally complex problems, highlighting challenges for current LLMs to serve as code generation tools when performing codeflow. Overall, CodeFlowBench offers a comprehensive benchmark and new insights into LLM capabilities for multi-turn, iterative code generation, guiding future advances in code generation tasks.

[126] SWE-smith: Scaling Data for Software Engineering Agents

John Yang,Kilian Leret,Carlos E. Jimenez,Alexander Wettig,Kabir Khandpur,Yanzhe Zhang,Binyuan Hui,Ofir Press,Ludwig Schmidt,Diyi Yang

Main category: cs.SE

TL;DR: SWE-smith是一个用于大规模生成软件工程训练数据的管道，解决了现有数据集小且难以扩展的问题。

Details

Motivation: 现有软件工程训练数据集规模小、构建复杂且存储需求高，限制了语言模型的应用和扩展。 Method: SWE-smith通过自动构建执行环境并合成任务实例（破坏现有测试），生成了50k实例的数据集。 Result: 训练得到的SWE-agent-LM-32B在SWE-bench基准上达到40.2% Pass@1，是目前开源模型中的最佳表现。 Conclusion: SWE-smith的开源降低了自动化软件工程研究的门槛，推动了该领域的发展。 Abstract: Despite recent progress in Language Models (LMs) for software engineering, collecting training data remains a significant pain point. Existing datasets are small, with at most 1,000s of training instances from 11 or fewer GitHub repositories. The procedures to curate such datasets are often complex, necessitating hundreds of hours of human labor; companion execution environments also take up several terabytes of storage, severely limiting their scalability and usability. To address this pain point, we introduce SWE-smith, a novel pipeline for generating software engineering training data at scale. Given any Python codebase, SWE-smith constructs a corresponding execution environment, then automatically synthesizes 100s to 1,000s of task instances that break existing test(s) in the codebase. Using SWE-smith, we create a dataset of 50k instances sourced from 128 GitHub repositories, an order of magnitude larger than all previous works. We train SWE-agent-LM-32B, achieving 40.2% Pass@1 resolve rate on the SWE-bench Verified benchmark, state of the art among open source models. We open source SWE-smith (collection procedure, task instances, trajectories, models) to lower the barrier of entry for research in LM systems for automated software engineering. All assets available at https://swesmith.com.

cs.LG [Back]

[127] Multimodal Large Language Models for Medicine: A Comprehensive Survey

Jiarui Ye,Hao Tang

Main category: cs.LG

TL;DR: 本文综述了多模态大语言模型（MLLMs）在医疗健康领域的应用，包括背景介绍、工作原理、三大应用方向（医疗报告、诊断和治疗）、数据模式与评估基准，以及面临的挑战与解决方案。

Details

Motivation: 随着GPT-4的发布，MLLMs在多模态任务中的强大能力引起广泛关注，研究者开始探索其在医疗健康领域的潜力。 Method: 通过综述330篇相关论文，总结了MLLMs在医疗健康领域的应用方向、数据模式及评估方法，并分析了具体案例。 Result: MLLMs在医疗报告、诊断和治疗方面展现出显著能力，同时提出了六种主流数据模式及对应评估基准。 Conclusion: 尽管MLLMs在医疗领域具有潜力，但仍面临挑战，需进一步研究以解决这些问题。 Abstract: MLLMs have recently become a focal point in the field of artificial intelligence research. Building on the strong capabilities of LLMs, MLLMs are adept at addressing complex multi-modal tasks. With the release of GPT-4, MLLMs have gained substantial attention from different domains. Researchers have begun to explore the potential of MLLMs in the medical and healthcare domain. In this paper, we first introduce the background and fundamental concepts related to LLMs and MLLMs, while emphasizing the working principles of MLLMs. Subsequently, we summarize three main directions of application within healthcare: medical reporting, medical diagnosis, and medical treatment. Our findings are based on a comprehensive review of 330 recent papers in this area. We illustrate the remarkable capabilities of MLLMs in these domains by providing specific examples. For data, we present six mainstream modes of data along with their corresponding evaluation benchmarks. At the end of the survey, we discuss the challenges faced by MLLMs in the medical and healthcare domain and propose feasible methods to mitigate or overcome these issues.

[128] Sparse-to-Sparse Training of Diffusion Models

Inês Cardoso Oliveira,Decebal Constantin Mocanu,Luis A. Leiva

Main category: cs.LG

TL;DR: 论文提出了一种稀疏到稀疏训练范式，首次应用于扩散模型（DMs），旨在提升训练和推理效率。实验表明稀疏DMs在性能上可媲美甚至超越密集模型，同时显著减少参数和计算量。

Details

Motivation: 扩散模型在生成任务中表现优异，但计算资源需求高。现有研究多关注推理效率提升，本文首次探索稀疏训练以优化训练和推理效率。 Method: 采用稀疏到稀疏训练范式，在无条件生成任务中训练稀疏DMs（Latent Diffusion和ChiroDiff），使用三种方法（Static-DM、RigL-DM和MagRan-DM）研究稀疏性对性能的影响。 Result: 稀疏DMs在性能上匹配或超越密集模型，同时大幅减少可训练参数和FLOPs，并确定了稀疏训练的安全有效参数范围。 Conclusion: 稀疏到稀疏训练是提升扩散模型效率的有效方法，为未来研究提供了新方向。 Abstract: Diffusion models (DMs) are a powerful type of generative models that have achieved state-of-the-art results in various image synthesis tasks and have shown potential in other domains, such as natural language processing and temporal data modeling. Despite their stable training dynamics and ability to produce diverse high-quality samples, DMs are notorious for requiring significant computational resources, both in the training and inference stages. Previous work has focused mostly on increasing the efficiency of model inference. This paper introduces, for the first time, the paradigm of sparse-to-sparse training to DMs, with the aim of improving both training and inference efficiency. We focus on unconditional generation and train sparse DMs from scratch (Latent Diffusion and ChiroDiff) on six datasets using three different methods (Static-DM, RigL-DM, and MagRan-DM) to study the effect of sparsity in model performance. Our experiments show that sparse DMs are able to match and often outperform their Dense counterparts, while substantially reducing the number of trainable parameters and FLOPs. We also identify safe and effective values to perform sparse-to-sparse training of DMs.

cs.IR [Back]

[129] Don't Retrieve, Generate: Prompting LLMs for Synthetic Training Data in Dense Retrieval

Aarush Sinha

Main category: cs.IR

TL;DR: 论文提出了一种基于LLM的端到端管道，通过生成查询和硬负例，无需依赖传统计算密集型方法（如BM25或CE），性能与传统方法相当。

Details

Motivation: 传统硬负例挖掘方法（如BM25或CE）计算成本高且需要完整语料库访问，作者希望提出一种更简单高效的替代方案。 Method: 使用LLM生成查询和硬负例，完全脱离语料库依赖，与传统BM25和CE方法对比。 Result: 实验表明，提出的LLM管道在性能上与BM25和CE方法相当，验证了其有效性。 Conclusion: LLM生成的硬负例方法简化了训练流程，性能不逊于传统方法，为高效训练检索模型提供了新途径。 Abstract: Training effective dense retrieval models often relies on hard negative (HN) examples mined from the document corpus via methods like BM25 or cross-encoders (CE), processes that can be computationally demanding and require full corpus access. This paper introduces a different approach, an end-to-end pipeline where a Large Language Model (LLM) first generates a query from a passage, and then generates a hard negative example using \emph{only} that query text. This corpus-free negative generation contrasts with standard mining techniques. We evaluated this \textsc{LLM Query $\rightarrow$ LLM HN} approach against traditional \textsc{LLM Query $\rightarrow$ BM25 HN} and \textsc{LLM Query $\rightarrow$ CE HN} pipelines using E5-Base and GTE-Base models on several BEIR benchmark datasets. Our results show the proposed all-LLM pipeline achieves performance identical to both the BM25 and the computationally intensive CE baselines across nDCG@10, Precision@10, and Recall@100 metrics. This demonstrates that our corpus-free negative generation method matches the effectiveness of complex, corpus-dependent mining techniques, offering a potentially simpler and more efficient pathway for training high-performance retrievers without sacrificing results. We make the dataset including the queries and the hard-negatives for all three methods publicly available https://huggingface.co/collections/chungimungi/arxiv-hard-negatives-68027bbc601ff6cc8eb1f449.

eess.IV [Back]

[130] Light Weight CNN for classification of Brain Tumors from MRI Images

Natnael Alemayehu

Main category: eess.IV

TL;DR: 提出了一种基于CNN的轻量级深度学习模型，用于MRI扫描中脑肿瘤的多类分类，准确率达98.78%。

Details

Motivation: 开发一种高效、低复杂度的模型，辅助临床早期脑肿瘤诊断。 Method: 结合图像预处理（归一化、数据增强、裁剪技术）和CNN架构优化（通过Keras Tuner调参），并采用5折交叉验证。 Result: 模型分类准确率为98.78%。 Conclusion: 该方法为临床诊断提供了一种有效的低复杂度解决方案。 Abstract: This study presents a convolutional neural network (CNN)-based approach for the multi-class classification of brain tumors using magnetic resonance imaging (MRI) scans. We utilize a publicly available dataset containing MRI images categorized into four classes: glioma, meningioma, pituitary tumor, and no tumor. Our primary objective is to build a light weight deep learning model that can automatically classify brain tumor types with high accuracy. To achieve this goal, we incorporate image preprocessing steps, including normalization, data augmentation, and a cropping technique designed to reduce background noise and emphasize relevant regions. The CNN architecture is optimized through hyperparameter tuning using Keras Tuner, enabling systematic exploration of network parameters. To ensure reliable evaluation, we apply 5-fold cross-validation, where each hyperparameter configuration is evaluated across multiple data splits to mitigate overfitting. Experimental results demonstrate that the proposed model achieves a classification accuracy of 98.78%, indicating its potential as a diagnostic aid in clinical settings. The proposed method offers a low-complexity yet effective solution for assisting in early brain tumor diagnosis.

[131] Gradient Attention Map Based Verification of Deep Convolutional Neural Networks with Application to X-ray Image Datasets

Omid Halimi Milani,Amanda Nikho,Lauren Mills,Marouane Tliba,Ahmet Enis Cetin,Mohammed H. Elnagar

Main category: eess.IV

TL;DR: 提出了一种综合验证框架，通过梯度注意力图、早期卷积特征图和垃圾类别来评估深度学习模型在医学影像中的适用性，确保预测的可靠性。

Details

Motivation: 解决深度学习模型在医学影像中因数据分布不同导致的预测不可靠问题，提升患者护理的安全性。 Method: 1. 使用梯度注意力图（GAM）分析注意力模式；2. 扩展到早期卷积特征图捕捉结构偏差；3. 引入垃圾类别拒绝分布外输入。 Result: 实验表明，该方法能有效识别不合适的模型和输入，提升深度学习在医学影像中的可靠性。 Conclusion: 综合验证框架为医学影像中深度学习的安全部署提供了有效工具。 Abstract: Deep learning models have great potential in medical imaging, including orthodontics and skeletal maturity assessment. However, applying a model to data different from its training set can lead to unreliable predictions that may impact patient care. To address this, we propose a comprehensive verification framework that evaluates model suitability through multiple complementary strategies. First, we introduce a Gradient Attention Map (GAM)-based approach that analyzes attention patterns using Grad-CAM and compares them via similarity metrics such as IoU, Dice Similarity, SSIM, Cosine Similarity, Pearson Correlation, KL Divergence, and Wasserstein Distance. Second, we extend verification to early convolutional feature maps, capturing structural mis-alignments missed by attention alone. Finally, we incorporate an additional garbage class into the classification model to explicitly reject out-of-distribution inputs. Experimental results demonstrate that these combined methods effectively identify unsuitable models and inputs, promoting safer and more reliable deployment of deep learning in medical imaging.

[132] LoC-LIC: Low Complexity Learned Image Coding Using Hierarchical Feature Transforms

Ayman A. Ameen,Thomas Richter,André Kaup

Main category: eess.IV

TL;DR: 提出了一种通过分层特征提取降低图像压缩模型复杂度的创新方法，显著减少计算资源需求。

Details

Motivation: 当前学习型图像压缩模型复杂度高，计算资源需求大，限制了其广泛应用。 Method: 采用分层特征提取变换，减少高空间分辨率输入/特征图的通道数，同时降低大通道数特征图的空间维度，从而在不牺牲性能的情况下降低计算负载。 Result: 将前向传播复杂度从1256 kMAC/Pixel降至270 kMAC/Pixel，显著提升了效率。 Conclusion: 该方法为学习型图像压缩模型在多种设备上的高效运行提供了可能，并推动了图像压缩技术的新架构发展。 Abstract: Current learned image compression models typically exhibit high complexity, which demands significant computational resources. To overcome these challenges, we propose an innovative approach that employs hierarchical feature extraction transforms to significantly reduce complexity while preserving bit rate reduction efficiency. Our novel architecture achieves this by using fewer channels for high spatial resolution inputs/feature maps. On the other hand, feature maps with a large number of channels have reduced spatial dimensions, thereby cutting down on computational load without sacrificing performance. This strategy effectively reduces the forward pass complexity from $1256 \, \text{kMAC/Pixel}$ to just $270 \, \text{kMAC/Pixel}$. As a result, the reduced complexity model can open the way for learned image compression models to operate efficiently across various devices and pave the way for the development of new architectures in image compression technology.

econ.GN [Back]

[133] Who Gets the Callback? Generative AI and Gender Bias

Sugat Chaturvedi,Rochana Chaturvedi

Main category: econ.GN

TL;DR: 研究发现，大多数开源大型语言模型在招聘中存在性别偏见，倾向于推荐男性候选人，尤其是在高薪职位中。

Details

Motivation: 探讨生成式AI（特别是大型语言模型）在招聘中的性别偏见问题，以揭示其对劳动力市场公平性和多样性的潜在影响。 Method: 使用332,044个真实在线招聘广告数据集，测试多个中型开源LLMs的性别偏见，并通过模拟招聘者身份（如人格特质和历史人物视角）分析模型行为。 Result: 模型在高薪职位中更倾向于推荐男性，且在男性主导的职业中女性回调率较低，而在女性关联职业中较高，显示出职业性别隔离。 Conclusion: AI驱动的招聘可能加剧劳动力市场中的性别偏见，需关注公平性和多样性问题。 Abstract: Generative artificial intelligence (AI), particularly large language models (LLMs), is being rapidly deployed in recruitment and for candidate shortlisting. We audit several mid-sized open-source LLMs for gender bias using a dataset of 332,044 real-world online job postings. For each posting, we prompt the model to recommend whether an equally qualified male or female candidate should receive an interview callback. We find that most models tend to favor men, especially for higher-wage roles. Mapping job descriptions to the Standard Occupational Classification system, we find lower callback rates for women in male-dominated occupations and higher rates in female-associated ones, indicating occupational segregation. A comprehensive analysis of linguistic features in job ads reveals strong alignment of model recommendations with traditional gender stereotypes. To examine the role of recruiter identity, we steer model behavior by infusing Big Five personality traits and simulating the perspectives of historical figures. We find that less agreeable personas reduce stereotyping, consistent with an agreeableness bias in LLMs. Our findings highlight how AI-driven hiring may perpetuate biases in the labor market and have implications for fairness and diversity within firms.

cs.RO [Back]

[134] LLM-Empowered Embodied Agent for Memory-Augmented Task Planning in Household Robotics

Marc Glocker,Peter Hönig,Matthias Hirschmanner,Markus Vincze

Main category: cs.RO

TL;DR: 论文提出了一种基于LLM驱动的机器人系统，用于自主管理家庭物品，通过多智能体协作和记忆增强任务规划实现高效操作。

Details

Motivation: 旨在解决家庭环境中机器人自主管理物品的复杂性问题，通过结合LLM和记忆增强技术提升任务规划和长期追踪能力。 Method: 系统采用三个专用智能体（路由、任务规划和知识库），结合RAG和Grounded SAM等技术，实现无需显式训练的上下文学习和语义场景理解。 Result: 在三种家庭场景中表现出高任务规划准确性和记忆召回改进，Qwen2.5和LLaMA3.1分别在专用智能体和路由任务中表现最佳。 Conclusion: 该系统通过LLM和记忆增强技术有效提升了家庭物品管理的自主性和准确性，具有实际应用潜力。 Abstract: We present an embodied robotic system with an LLM-driven agent-orchestration architecture for autonomous household object management. The system integrates memory-augmented task planning, enabling robots to execute high-level user commands while tracking past actions. It employs three specialized agents: a routing agent, a task planning agent, and a knowledge base agent, each powered by task-specific LLMs. By leveraging in-context learning, our system avoids the need for explicit model training. RAG enables the system to retrieve context from past interactions, enhancing long-term object tracking. A combination of Grounded SAM and LLaMa3.2-Vision provides robust object detection, facilitating semantic scene understanding for task planning. Evaluation across three household scenarios demonstrates high task planning accuracy and an improvement in memory recall due to RAG. Specifically, Qwen2.5 yields best performance for specialized agents, while LLaMA3.1 excels in routing tasks. The source code is available at: https://github.com/marc1198/chat-hsr.

Pranav Saxena,Nishant Raghuvanshi,Neena Goveas

Main category: cs.RO

TL;DR: UAV-VLN是一个结合大型语言模型（LLM）和视觉感知的端到端框架，用于无人机在未知环境中基于自然语言指令导航。

Details

Motivation: 解决AI自主导航中无人机如何根据自然语言指令在未知环境中高效导航的挑战。 Method: 整合LLM的常识推理能力和视觉模型的目标检测，通过跨模态对齐实现指令解析和轨迹规划。 Result: 在多种室内外场景中表现出色，指令遵循准确性和轨迹效率显著提升。 Conclusion: LLM驱动的视觉语言接口为无人机自主导航提供了安全、直观且通用的解决方案。 Abstract: A core challenge in AI-guided autonomy is enabling agents to navigate realistically and effectively in previously unseen environments based on natural language commands. We propose UAV-VLN, a novel end-to-end Vision-Language Navigation (VLN) framework for Unmanned Aerial Vehicles (UAVs) that seamlessly integrates Large Language Models (LLMs) with visual perception to facilitate human-interactive navigation. Our system interprets free-form natural language instructions, grounds them into visual observations, and plans feasible aerial trajectories in diverse environments. UAV-VLN leverages the common-sense reasoning capabilities of LLMs to parse high-level semantic goals, while a vision model detects and localizes semantically relevant objects in the environment. By fusing these modalities, the UAV can reason about spatial relationships, disambiguate references in human instructions, and plan context-aware behaviors with minimal task-specific supervision. To ensure robust and interpretable decision-making, the framework includes a cross-modal grounding mechanism that aligns linguistic intent with visual context. We evaluate UAV-VLN across diverse indoor and outdoor navigation scenarios, demonstrating its ability to generalize to novel instructions and environments with minimal task-specific training. Our results show significant improvements in instruction-following accuracy and trajectory efficiency, highlighting the potential of LLM-driven vision-language interfaces for safe, intuitive, and generalizable UAV autonomy.

[136] RoboGround: Robotic Manipulation with Grounded Vision-Language Priors

Haifeng Huang,Xinyi Chen,Yilun Chen,Hao Li,Xiaoshen Han,Zehan Wang,Tai Wang,Jiangmiao Pang,Zhou Zhao

Main category: cs.RO

TL;DR: 论文提出了一种基于中间表征（grounding masks）的机器人操作系统RoboGround，通过结合空间指导和预训练视觉语言模型，显著提升了策略的泛化能力。

Details

Motivation: 探索中间表征在机器人操作中的潜力，尤其是grounding masks如何通过提供空间指导和利用大规模视觉语言模型来提升策略的泛化能力。 Method: 提出RoboGround系统，利用grounding masks作为中间表征指导策略网络，并通过自动化流程生成大规模模拟数据以增强多样性。 Result: 实验表明，grounding masks作为中间表征显著提升了机器人策略的泛化能力。 Conclusion: grounding masks是一种有效的中间表征，能够显著提升机器人操作任务的泛化性能。 Abstract: Recent advancements in robotic manipulation have highlighted the potential of intermediate representations for improving policy generalization. In this work, we explore grounding masks as an effective intermediate representation, balancing two key advantages: (1) effective spatial guidance that specifies target objects and placement areas while also conveying information about object shape and size, and (2) broad generalization potential driven by large-scale vision-language models pretrained on diverse grounding datasets. We introduce RoboGround, a grounding-aware robotic manipulation system that leverages grounding masks as an intermediate representation to guide policy networks in object manipulation tasks. To further explore and enhance generalization, we propose an automated pipeline for generating large-scale, simulated data with a diverse set of objects and instructions. Extensive experiments show the value of our dataset and the effectiveness of grounding masks as intermediate guidance, significantly enhancing the generalization abilities of robot policies.

cs.AI [Back]

[137] Phi-4-reasoning Technical Report

Marah Abdin,Sahaj Agarwal,Ahmed Awadallah,Vidhisha Balachandran,Harkirat Behl,Lingjiao Chen,Gustavo de Rosa,Suriya Gunasekar,Mojan Javaheripi,Neel Joshi,Piero Kauffmann,Yash Lara,Caio César Teodoro Mendes,Arindam Mitra,Besmira Nushi,Dimitris Papailiopoulos,Olli Saarikivi,Shital Shah,Vaishnavi Shrivastava,Vibhav Vineet,Yue Wu,Safoora Yousefi,Guoqing Zheng

Main category: cs.AI

TL;DR: Phi-4-reasoning是一个14B参数的推理模型，通过监督微调和强化学习优化，在复杂推理任务中表现优异，甚至超越更大规模的模型。

Details

Motivation: 研究旨在通过精心设计的数据和训练方法，提升语言模型在复杂推理任务中的性能。 Method: 采用监督微调（SFT）和结果驱动的强化学习（RL）训练模型，并生成详细的推理链。 Result: Phi-4-reasoning及其强化版在多项推理任务中表现优异，接近或超越更大规模的模型。 Conclusion: 研究表明数据选择和训练方法对推理模型性能至关重要，并提出了评估改进的方向。 Abstract: We introduce Phi-4-reasoning, a 14-billion parameter reasoning model that achieves strong performance on complex reasoning tasks. Trained via supervised fine-tuning of Phi-4 on carefully curated set of "teachable" prompts-selected for the right level of complexity and diversity-and reasoning demonstrations generated using o3-mini, Phi-4-reasoning generates detailed reasoning chains that effectively leverage inference-time compute. We further develop Phi-4-reasoning-plus, a variant enhanced through a short phase of outcome-based reinforcement learning that offers higher performance by generating longer reasoning traces. Across a wide range of reasoning tasks, both models outperform significantly larger open-weight models such as DeepSeek-R1-Distill-Llama-70B model and approach the performance levels of full DeepSeek-R1 model. Our comprehensive evaluations span benchmarks in math and scientific reasoning, coding, algorithmic problem solving, planning, and spatial understanding. Interestingly, we observe a non-trivial transfer of improvements to general-purpose benchmarks as well. In this report, we provide insights into our training data, our training methodologies, and our evaluations. We show that the benefit of careful data curation for supervised fine-tuning (SFT) extends to reasoning language models, and can be further amplified by reinforcement learning (RL). Finally, our evaluation points to opportunities for improving how we assess the performance and robustness of reasoning models.

[138] AdaR1: From Long-CoT to Hybrid-CoT via Bi-Level Adaptive Reasoning Optimization

Haotian Luo,Haiying He,Yibo Wang,Jinluan Yang,Rui Liu,Naiqiang Tan,Xiaochun Cao,Dacheng Tao,Li Shen

Main category: cs.AI

TL;DR: 论文提出了一种自适应推理框架，通过结合长短推理模型和双层偏好训练，显著减少推理成本，同时保持性能。

Details

Motivation: 当前长推理模型在复杂任务中表现优异，但推理开销大，且不同问题对长推理的需求差异显著，需要自适应策略。 Method: 构建混合推理模型（长短CoT结合），并通过双层偏好训练（组级和实例级）选择合适推理风格。 Result: 在五个数学数据集上，平均推理长度减少50%以上，推理成本显著降低，性能保持。 Conclusion: 自适应策略能有效优化大语言模型的推理效率，未来可进一步探索。 Abstract: Recently, long-thought reasoning models achieve strong performance on complex reasoning tasks, but often incur substantial inference overhead, making efficiency a critical concern. Our empirical analysis reveals that the benefit of using Long-CoT varies across problems: while some problems require elaborate reasoning, others show no improvement, or even degraded accuracy. This motivates adaptive reasoning strategies that tailor reasoning depth to the input. However, prior work primarily reduces redundancy within long reasoning paths, limiting exploration of more efficient strategies beyond the Long-CoT paradigm. To address this, we propose a novel two-stage framework for adaptive and efficient reasoning. First, we construct a hybrid reasoning model by merging long and short CoT models to enable diverse reasoning styles. Second, we apply bi-level preference training to guide the model to select suitable reasoning styles (group-level), and prefer concise and correct reasoning within each style group (instance-level). Experiments demonstrate that our method significantly reduces inference costs compared to other baseline approaches, while maintaining performance. Notably, on five mathematical datasets, the average length of reasoning is reduced by more than 50%, highlighting the potential of adaptive strategies to optimize reasoning efficiency in large language models. Our code is coming soon at https://github.com/StarDewXXX/AdaR1

Table of Contents

cs.CV [Back]

[1] Can a Large Language Model Assess Urban Design Quality? Evaluating Walkability Metrics Across Expertise Levels

[2] Legilimens: Performant Video Analytics on the System-on-Chip Edge

[3] Emotion Recognition in Contemporary Dance Performances Using Laban Movement Analysis

[4] Dance Style Recognition Using Laban Movement Analysis

[5] Geolocating Earth Imagery from ISS: Integrating Machine Learning with Astronaut Photography for Enhanced Geographic Mapping

[6] MemeBLIP2: A novel lightweight multimodal system to detect harmful memes

[7] T2ID-CAS: Diffusion Model and Class Aware Sampling to Mitigate Class Imbalance in Neck Ultrasound Anatomical Landmark Detection

[8] Subject Information Extraction for Novelty Detection with Domain Shifts

[9] Multi-modal Transfer Learning for Dynamic Facial Emotion Recognition in the Wild

[10] Embracing Collaboration Over Competition: Condensing Multiple Prompts for Visual In-Context Learning

[11] CoCoDiff: Diversifying Skeleton Action Features via Coarse-Fine Text-Co-Guided Latent Diffusion

[12] Mamba Based Feature Extraction And Adaptive Multilevel Feature Fusion For 3D Tumor Segmentation From Multi-modal Medical Image

[13] Can We Achieve Efficient Diffusion without Self-Attention? Distilling Self-Attention into Convolutions

[14] Learning Multi-view Multi-class Anomaly Detection

[15] CMD: Constraining Multimodal Distribution for Domain Adaptation in Stereo Matching

[16] The Dual Power of Interpretable Token Embeddings: Jailbreaking Attacks and Defenses for Diffusion Model Unlearning

[17] AGHI-QA: A Subjective-Aligned Dataset and Metric for AI-Generated Human Images

[18] An Evaluation of a Visual Question Answering Strategy for Zero-shot Facial Expression Recognition in Still Images

[19] Text-Conditioned Diffusion Model for High-Fidelity Korean Font Generation

[20] Simple Visual Artifact Detection in Sora-Generated Videos

[21] UniBiomed: A Universal Foundation Model for Grounded Biomedical Image Interpretation

[22] Towards Improved Cervical Cancer Screening: Vision Transformer-Based Classification and Interpretability

[23] Vision-Language Model-Based Semantic-Guided Imaging Biomarker for Early Lung Cancer Detection

[24] Nexus-Gen: A Unified Model for Image Understanding, Generation, and Editing

[25] Revisiting Diffusion Autoencoder Training for Image Reconstruction Quality

[26] IDDM: Bridging Synthetic-to-Real Domain Gap from Physics-Guided Diffusion for Real-world Image Dehazing

[27] Comparison of Different Deep Neural Network Models in the Cultural Heritage Domain

[28] Static or Dynamic: Towards Query-Adaptive Token Selection for Video Question Answering

[29] Adapting In-Domain Few-Shot Segmentation to New Domains without Retraining

[30] Diff-Prompt: Diffusion-Driven Prompt Generator with Mask Supervision

[31] SeriesBench: A Benchmark for Narrative-Driven Drama Series Understanding

[32] Rethinking Visual Layer Selection in Multimodal LLMs

[33] VR-FuseNet: A Fusion of Heterogeneous Fundus Data and Explainable Deep Network for Diabetic Retinopathy Classification

[34] Multiview Point Cloud Registration via Optimization in an Autoencoder Latent Space

[35] Quaternion Nuclear Norms Over Frobenius Norms Minimization for Robust Matrix Completion

[36] Robust Orthogonal NMF with Label Propagation for Image Clustering

[37] GarmentDiffusion: 3D Garment Sewing Pattern Generation with Multimodal Diffusion Transformers

[38] CAE-DFKD: Bridging the Transferability Gap in Data-Free Knowledge Distillation

[39] DGSolver: Diffusion Generalist Solver with Universal Posterior Sampling for Image Restoration

[40] ClassWise-CRF: Category-Specific Fusion for Enhanced Semantic Segmentation of Remote Sensing Imagery

[41] Consistency-aware Fake Videos Detection on Short Video Platforms

[42] MagicPortrait: Temporally Consistent Face Reenactment with 3D Geometric Guidance

[43] SAM4EM: Efficient memory-based two stage prompt-free segment anything model adapter for complex 3D neuroscience electron microscopy stacks

[44] Black-Box Visual Prompt Engineering for Mitigating Object Hallucination in Large Vision Language Models

[45] Iterative Trajectory Exploration for Multimodal Agents

[46] eNCApsulate: NCA for Precision Diagnosis on Capsule Endoscopes

[47] Cascade Detector Analysis and Application to Biomedical Microscopy

[48] Mcity Data Engine: Iterative Model Improvement Through Open-Vocabulary Data Selection

[49] Diffusion-based Adversarial Identity Manipulation for Facial Privacy Protection

[50] HoloTime: Taming Video Diffusion Models for Panoramic 4D Scene Generation

[51] Visual Text Processing: A Comprehensive Review and Unified Evaluation

[52] Enhancing Self-Supervised Fine-Grained Video Object Tracking with Dynamic Memory Prediction

[53] REHEARSE-3D: A Multi-modal Emulated Rain Dataset for 3D Point Cloud De-raining

[54] Vision Transformers in Precision Agriculture: A Comprehensive Survey

[55] VividListener: Expressive and Controllable Listener Dynamics Modeling for Multi-Modal Responsive Interaction

[56] Common3D: Self-Supervised Learning of 3D Morphable Models for Common Objects in Neural Feature Space

[57] Anatomical Similarity as a New Metric to Evaluate Brain Generative Models

[58] Anomaly-Driven Approach for Enhanced Prostate Cancer Segmentation

[59] A simple and effective approach for body part recognition on CT scans based on projection estimation

[60] Why Compress What You Can Generate? When GPT-4o Generation Ushers in Image Compression Fields

[61] Early Exit and Multi Stage Knowledge Distillation in VLMs for Video Summarization

[62] 3D Stylization via Large Reconstruction Model

[63] Active Light Modulation to Counter Manipulation of Speech Visual Content

[64] Differentiable Room Acoustic Rendering with Multi-View Vision Priors

[65] COMPACT: COMPositional Atomic-to-Complex Visual Capability Tuning

[66] A Survey of Interactive Generative Video

[67] ReVision: High-Quality, Low-Cost Video Generation with Explicit 3D Physics Modeling for Complex Motion and Interaction

cs.GR [Back]

[68] Transcending Dimensions using Generative AI: Real-Time 3D Model Generation in Augmented Reality

[69] GauSS-MI: Gaussian Splatting Shannon Mutual Information for Active 3D Reconstruction

[70] PhysicsFC: Learning User-Controlled Skills for a Physics-Based Football Player Controller

[71] LSNIF: Locally-Subdivided Neural Intersection Function

cs.CL [Back]

[72] Waking Up an AI: A Quantitative Framework for Prompt-Induced Phase Transition in Large Language Models

[73] Analyzing Feedback Mechanisms in AI-Generated MCQs: Insights into Readability, Lexical Properties, and Levels of Challenge

[74] Nested Named-Entity Recognition on Vietnamese COVID-19: Dataset and Experiments

[75] ViQA-COVID: COVID-19 Machine Reading Comprehension Dataset for Vietnamese

[76] HYPEROFA: Expanding LLM Vocabulary to New Languages via Hypernetwork-Based Embedding Initialization