cs.CV [Back]

[1] DMM: Building a Versatile Image Generation Model via Distillation-Based Model Merging

Tianhui Song,Weixin Feng,Shuai Wang,Xubin Li,Tiezheng Ge,Bo Zheng,Limin Wang

Main category: cs.CV

TL;DR: 论文提出了一种基于风格提示的图像生成方法（DMM），通过风格向量控制生成任意风格的图像，解决了多模型合并时的冗余和存储问题。

Details

Motivation: 文本到图像生成模型的多样化导致参数冗余和存储成本高，需要一种方法统一多个模型的能力。 Method: 提出风格提示生成管道和基于分数蒸馏的模型合并范式（DMM），重新定义合并目标和评估协议。 Result: 实验表明DMM能有效整合多个教师模型的知识，实现可控的任意风格生成。 Conclusion: DMM为多模型合并提供了一种高效且可控的解决方案。 Abstract: The success of text-to-image (T2I) generation models has spurred a proliferation of numerous model checkpoints fine-tuned from the same base model on various specialized datasets. This overwhelming specialized model production introduces new challenges for high parameter redundancy and huge storage cost, thereby necessitating the development of effective methods to consolidate and unify the capabilities of diverse powerful models into a single one. A common practice in model merging adopts static linear interpolation in the parameter space to achieve the goal of style mixing. However, it neglects the features of T2I generation task that numerous distinct models cover sundry styles which may lead to incompatibility and confusion in the merged model. To address this issue, we introduce a style-promptable image generation pipeline which can accurately generate arbitrary-style images under the control of style vectors. Based on this design, we propose the score distillation based model merging paradigm (DMM), compressing multiple models into a single versatile T2I model. Moreover, we rethink and reformulate the model merging task in the context of T2I generation, by presenting new merging goals and evaluation protocols. Our experiments demonstrate that DMM can compactly reorganize the knowledge from multiple teacher models and achieve controllable arbitrary-style generation.

[2] Geographical Context Matters: Bridging Fine and Coarse Spatial Information to Enhance Continental Land Cover Mapping

Babak Ghassemi,Cassio Fraga-Dantas,Raffaele Gaetano,Dino Ienco,Omid Ghorbanzadeh,Emma Izquierdo-Verdiguier,Francesco Vuolo

Main category: cs.CV

TL;DR: BRIDGE-LC是一种新型深度学习框架，通过整合多尺度地理空间信息提升土地覆盖分类的准确性和可扩展性。

Details

Motivation: 现有机器学习方法在分析地球观测数据时忽略了地理空间元数据，限制了其在大规模应用中的性能。 Method: 提出BRIDGE-LC框架，结合细粒度（经纬度）和粗粒度（生物地理区域）空间信息，通过多层感知机架构实现高效分类。 Result: 实验表明，整合地理空间信息显著提升了分类性能，尤其是同时利用细粒度和粗粒度信息时效果最佳。 Conclusion: BRIDGE-LC框架为大规模土地覆盖制图提供了高效且准确的解决方案。 Abstract: Land use and land cover mapping from Earth Observation (EO) data is a critical tool for sustainable land and resource management. While advanced machine learning and deep learning algorithms excel at analyzing EO imagery data, they often overlook crucial geospatial metadata information that could enhance scalability and accuracy across regional, continental, and global scales. To address this limitation, we propose BRIDGE-LC (Bi-level Representation Integration for Disentangled GEospatial Land Cover), a novel deep learning framework that integrates multi-scale geospatial information into the land cover classification process. By simultaneously leveraging fine-grained (latitude/longitude) and coarse-grained (biogeographical region) spatial information, our lightweight multi-layer perceptron architecture learns from both during training but only requires fine-grained information for inference, allowing it to disentangle region-specific from region-agnostic land cover features while maintaining computational efficiency. To assess the quality of our framework, we use an open-access in-situ dataset and adopt several competing classification approaches commonly considered for large-scale land cover mapping. We evaluated all approaches through two scenarios: an extrapolation scenario in which training data encompasses samples from all biogeographical regions, and a leave-one-region-out scenario where one region is excluded from training. We also explore the spatial representation learned by our model, highlighting a connection between its internal manifold and the geographical information used during training. Our results demonstrate that integrating geospatial information improves land cover mapping performance, with the most substantial gains achieved by jointly leveraging both fine- and coarse-grained spatial information.

[3] WORLDMEM: Long-term Consistent World Simulation with Memory

Zeqi Xiao,Yushi Lan,Yifan Zhou,Wenqi Ouyang,Shuai Yang,Yanhong Zeng,Xingang Pan

Main category: cs.CV

TL;DR: WorldMem框架通过记忆库和注意力机制增强场景生成，解决长期3D空间一致性问题。

Details

Motivation: 解决世界模拟中因时间窗口限制导致的长期3D空间一致性不足问题。 Method: 使用记忆库存储记忆帧和状态（如姿态和时间戳），通过注意力机制提取相关信息。 Result: 在虚拟和真实场景中验证了方法的有效性，能准确重建场景并捕捉动态变化。 Conclusion: WorldMem框架成功提升了场景生成的长期一致性，支持动态世界建模。 Abstract: World simulation has gained increasing popularity due to its ability to model virtual environments and predict the consequences of actions. However, the limited temporal context window often leads to failures in maintaining long-term consistency, particularly in preserving 3D spatial consistency. In this work, we present WorldMem, a framework that enhances scene generation with a memory bank consisting of memory units that store memory frames and states (e.g., poses and timestamps). By employing a memory attention mechanism that effectively extracts relevant information from these memory frames based on their states, our method is capable of accurately reconstructing previously observed scenes, even under significant viewpoint or temporal gaps. Furthermore, by incorporating timestamps into the states, our framework not only models a static world but also captures its dynamic evolution over time, enabling both perception and interaction within the simulated world. Extensive experiments in both virtual and real scenarios validate the effectiveness of our approach.

[4] InstantCharacter: Personalize Any Characters with a Scalable Diffusion Transformer Framework

Jiale Tao,Yanbing Zhang,Qixun Wang,Yiji Cheng,Haofan Wang,Xu Bai,Zhengguang Zhou,Ruihuang Li,Linqing Wang,Chunyu Wang,Qin Lin,Qinglin Lu

Main category: cs.CV

TL;DR: InstantCharacter是一个基于扩散变换器的可扩展框架，用于角色定制，解决了现有方法泛化能力差和图像质量低的问题。

Details

Motivation: 现有基于学习的方法泛化能力有限，基于优化的方法需要特定主题微调，导致文本可控性下降。 Method: 提出InstantCharacter框架，使用扩散变换器和可扩展适配器处理开放域角色特征，并构建大规模角色数据集进行训练。 Result: 框架能生成高保真、文本可控且角色一致的图像，在多样角色外观、姿势和风格上表现优异。 Conclusion: InstantCharacter为角色驱动的图像生成设定了新基准，代码已开源。 Abstract: Current learning-based subject customization approaches, predominantly relying on U-Net architectures, suffer from limited generalization ability and compromised image quality. Meanwhile, optimization-based methods require subject-specific fine-tuning, which inevitably degrades textual controllability. To address these challenges, we propose InstantCharacter, a scalable framework for character customization built upon a foundation diffusion transformer. InstantCharacter demonstrates three fundamental advantages: first, it achieves open-domain personalization across diverse character appearances, poses, and styles while maintaining high-fidelity results. Second, the framework introduces a scalable adapter with stacked transformer encoders, which effectively processes open-domain character features and seamlessly interacts with the latent space of modern diffusion transformers. Third, to effectively train the framework, we construct a large-scale character dataset containing 10-million-level samples. The dataset is systematically organized into paired (multi-view character) and unpaired (text-image combinations) subsets. This dual-data structure enables simultaneous optimization of identity consistency and textual editability through distinct learning pathways. Qualitative experiments demonstrate the advanced capabilities of InstantCharacter in generating high-fidelity, text-controllable, and character-consistent images, setting a new benchmark for character-driven image generation. Our source code is available at https://github.com/Tencent/InstantCharacter.

[5] NTIRE 2025 Challenge on Event-Based Image Deblurring: Methods and Results

Lei Sun,Andrea Alfarano,Peiqi Duan,Shaolin Su,Kaiwei Wang,Boxin Shi,Radu Timofte,Danda Pani Paudel,Luc Van Gool,Qinglin Liu,Wei Yu,Xiaoqian Lv,Lu Yang,Shuigen Wang,Shengping Zhang,Xiangyang Ji,Long Bao,Yuqiang Yang,Jinao Song,Ziyi Wang,Shuang Wen,Heng Sun,Kean Liu,Mingchen Zhong,Senyan Xu,Zhijing Sun,Jiaying Zhu,Chengjie Ge,Xingbo Wang,Yidi Liu,Xin Lu,Xueyang Fu,Zheng-Jun Zha,Dawei Fan,Dafeng Zhang,Yong Yang,Siru Zhang,Qinghua Yang,Hao Kang,Huiyuan Fu,Heng Zhang,Hongyuan Yu,Zhijuan Huang,Shuoyan Wei,Feng Li,Runmin Cong,Weiqi Luo,Mingyun Lin,Chenxu Jiang,Hongyi Liu,Lei Yu,Weilun Li,Jiajun Zhai,Tingting Lin,Shuang Ma,Sai Zhou,Zhanwen Liu,Yang Wang,Eiffel Chong,Nuwan Bandara,Thivya Kandappu,Archan Misra,Yihang Chen,Zhan Li,Weijun Yuan,Wenzhuo Wang,Boyang Yao,Zhanglu Chen,Yijing Sun,Tianjiao Wan,Zijian Gao,Qisheng Xu,Kele Xu,Yukun Zhang,Yu He,Xiaoyan Xie,Tao Fu,Yashu Gautamkumar Patel,Vihar Ramesh Jain,Divesh Basina,Rishik Ashili,Manish Kumar Manjhi,Sourav Kumar,Prinon Benny,Himanshu Ghunawat,B Sri Sairam Gautam,Anett Varghese,Abhishek Yadav

Main category: cs.CV

TL;DR: NTIRE 2025挑战赛聚焦事件驱动的图像去模糊，15支团队提交了有效结果，推动了事件视觉研究的进展。

Details

Motivation: 设计高性能的事件驱动图像去模糊方法，突破传统限制。 Method: 利用事件和图像作为输入，无计算复杂度或模型大小限制。 Result: 15支团队提交了结果，性能通过PSNR量化评估。 Conclusion: 挑战赛为事件视觉研究提供了新见解，推动了技术进步。 Abstract: This paper presents an overview of NTIRE 2025 the First Challenge on Event-Based Image Deblurring, detailing the proposed methodologies and corresponding results. The primary goal of the challenge is to design an event-based method that achieves high-quality image deblurring, with performance quantitatively assessed using Peak Signal-to-Noise Ratio (PSNR). Notably, there are no restrictions on computational complexity or model size. The task focuses on leveraging both events and images as inputs for single-image deblurring. A total of 199 participants registered, among whom 15 teams successfully submitted valid results, offering valuable insights into the current state of event-based image deblurring. We anticipate that this challenge will drive further advancements in event-based vision research.

[6] Sparsity Outperforms Low-Rank Projections in Few-Shot Adaptation

Nairouz Mrabah,Nicolas Richet,Ismail Ben Ayed,Éric Granger

Main category: cs.CV

TL;DR: 提出了一种稀疏优化（SO）框架，通过动态调整少量参数解决视觉语言模型（VLM）在小样本领域适应中的过拟合和计算限制问题。

Details

Motivation: 解决现有低秩重参数化方法在泛化和超参数调优上的不足。 Method: 采用局部稀疏全局密度和局部随机全局重要性两种范式，动态调整参数以减少过拟合。 Result: 在11个数据集上验证了SO在小样本适应中的优越性能，同时降低内存开销。 Conclusion: SO框架在小样本领域适应中表现出色，兼具高效性和稳定性。 Abstract: Adapting Vision-Language Models (VLMs) to new domains with few labeled samples remains a significant challenge due to severe overfitting and computational constraints. State-of-the-art solutions, such as low-rank reparameterization, mitigate these issues but often struggle with generalization and require extensive hyperparameter tuning. In this paper, a novel Sparse Optimization (SO) framework is proposed. Unlike low-rank approaches that typically constrain updates to a fixed subspace, our SO method leverages high sparsity to dynamically adjust very few parameters. We introduce two key paradigms. First, we advocate for \textit{local sparsity and global density}, which updates a minimal subset of parameters per iteration while maintaining overall model expressiveness. As a second paradigm, we advocate for \textit{local randomness and global importance}, which sparsifies the gradient using random selection while pruning the first moment based on importance. This combination significantly mitigates overfitting and ensures stable adaptation in low-data regimes. Extensive experiments on 11 diverse datasets show that SO achieves state-of-the-art few-shot adaptation performance while reducing memory overhead.

[7] 3D-PointZshotS: Geometry-Aware 3D Point Cloud Zero-Shot Semantic Segmentation Narrowing the Visual-Semantic Gap

Minmin Yang,Huantao Ren,Senem Velipasalar

Main category: cs.CV

TL;DR: 3D-PointZshotS是一个几何感知的零样本3D点云分割框架，通过潜在几何原型（LGPs）增强特征生成和对齐，提升从已知类到未知类的迁移能力。

Details

Motivation: 现有零样本3D点云分割方法在从已知类到未知类以及从语义空间到视觉空间的迁移能力上表现不佳。 Method: 提出3D-PointZshotS框架，通过交叉注意力机制将LGPs集成到生成器中，并引入自一致性损失增强特征鲁棒性，同时在共享空间中重新表示视觉和语义特征。 Result: 在ScanNet、SemanticKITTI和S3DIS数据集上，该方法在谐波mIoU指标上优于四个基线方法。 Conclusion: 3D-PointZshotS通过几何感知和特征对齐，显著提升了零样本3D点云分割的性能。 Abstract: Existing zero-shot 3D point cloud segmentation methods often struggle with limited transferability from seen classes to unseen classes and from semantic to visual space. To alleviate this, we introduce 3D-PointZshotS, a geometry-aware zero-shot segmentation framework that enhances both feature generation and alignment using latent geometric prototypes (LGPs). Specifically, we integrate LGPs into a generator via a cross-attention mechanism, enriching semantic features with fine-grained geometric details. To further enhance stability and generalization, we introduce a self-consistency loss, which enforces feature robustness against point-wise perturbations. Additionally, we re-represent visual and semantic features in a shared space, bridging the semantic-visual gap and facilitating knowledge transfer to unseen classes. Experiments on three real-world datasets, namely ScanNet, SemanticKITTI, and S3DIS, demonstrate that our method achieves superior performance over four baselines in terms of harmonic mIoU. The code is available at \href{https://github.com/LexieYang/3D-PointZshotS}{Github}.

[8] DG-MVP: 3D Domain Generalization via Multiple Views of Point Clouds for Classification

Huantao Ren,Minmin Yang,Senem Velipasalar

Main category: cs.CV

TL;DR: 论文提出了一种解决3D点云领域泛化问题的新方法，通过多视角2D投影和卷积模型提升跨域性能。

Details

Motivation: 现有3D点云分类依赖标注数据，而不同域（如CAD模型与LiDAR数据）存在显著差异，导致泛化能力不足。 Method: 采用多视角2D投影缓解点缺失问题，并设计卷积模型提取特征。 Result: 在PointDA-10和Sim-to-Real基准测试中表现优异，优于基线方法。 Conclusion: 该方法有效提升了点云跨域泛化能力，尤其在合成数据到真实数据的迁移中表现突出。 Abstract: Deep neural networks have achieved significant success in 3D point cloud classification while relying on large-scale, annotated point cloud datasets, which are labor-intensive to build. Compared to capturing data with LiDAR sensors and then performing annotation, it is relatively easier to sample point clouds from CAD models. Yet, data sampled from CAD models is regular, and does not suffer from occlusion and missing points, which are very common for LiDAR data, creating a large domain shift. Therefore, it is critical to develop methods that can generalize well across different point cloud domains. %In this paper, we focus on the 3D point cloud domain generalization problem. Existing 3D domain generalization methods employ point-based backbones to extract point cloud features. Yet, by analyzing point utilization of point-based methods and observing the geometry of point clouds from different domains, we have found that a large number of point features are discarded by point-based methods through the max-pooling operation. This is a significant waste especially considering the fact that domain generalization is more challenging than supervised learning, and point clouds are already affected by missing points and occlusion to begin with. To address these issues, we propose a novel method for 3D point cloud domain generalization, which can generalize to unseen domains of point clouds. Our proposed method employs multiple 2D projections of a 3D point cloud to alleviate the issue of missing points and involves a simple yet effective convolution-based model to extract features. The experiments, performed on the PointDA-10 and Sim-to-Real benchmarks, demonstrate the effectiveness of our proposed method, which outperforms different baselines, and can transfer well from synthetic domain to real-world domain.

[9] AdaVid: Adaptive Video-Language Pretraining

Chaitanya Patel,Juan Carlos Niebles,Ehsan Adeli

Main category: cs.CV

TL;DR: AdaVid是一个动态调整计算资源的视频编码框架，通过自适应Transformer块优化性能，在短视频和长视频任务中均表现优异。

Details

Motivation: 解决现有视频编码器在边缘设备上计算资源受限的问题，并扩展处理长视频的能力。 Method: 采用自适应Transformer块调整隐藏嵌入维度，结合轻量级层次网络处理长视频。 Result: 在短视频任务中性能与标准模型相当但计算减半，长视频任务中实现高效与准确性的平衡。 Conclusion: AdaVid为视频编码提供了灵活高效的解决方案，适用于资源受限场景。 Abstract: Contrastive video-language pretraining has demonstrated great success in learning rich and robust video representations. However, deploying such video encoders on compute-constrained edge devices remains challenging due to their high computational demands. Additionally, existing models are typically trained to process only short video clips, often limited to 4 to 64 frames. In this paper, we introduce AdaVid, a flexible architectural framework designed to learn efficient video encoders that can dynamically adapt their computational footprint based on available resources. At the heart of AdaVid is an adaptive transformer block, inspired by Matryoshka Representation Learning, which allows the model to adjust its hidden embedding dimension at inference time. We show that AdaVid-EgoVLP, trained on video-narration pairs from the large-scale Ego4D dataset, matches the performance of the standard EgoVLP on short video-language benchmarks using only half the compute, and even outperforms EgoVLP when given equal computational resources. We further explore the trade-off between frame count and compute on the challenging Diving48 classification benchmark, showing that AdaVid enables the use of more frames without exceeding computational limits. To handle longer videos, we also propose a lightweight hierarchical network that aggregates short clip features, achieving a strong balance between compute efficiency and accuracy across several long video benchmarks.

[10] Event Quality Score (EQS): Assessing the Realism of Simulated Event Camera Streams via Distances in Latent Space

Kaustav Chanda,Aayush Atul Verma,Arpitsinh Vaghela,Yezhou Yang,Bharatesh Chakravarthi

Main category: cs.CV

TL;DR: 论文提出了一种事件质量评分（EQS），用于评估模拟事件数据与真实事件数据的相似性，以缩小模拟与现实的差距。

Details

Motivation: 事件相机在深度学习中的广泛应用受到高质量标注数据集稀缺的限制，而现有模拟数据难以准确模仿真实事件数据。 Method: 利用RVT架构的激活特征，提出事件质量评分（EQS），通过模拟与真实数据对比验证其有效性。 Result: 实验表明，更高的EQS分数意味着模型在模拟数据上训练后能更好地泛化到真实数据。 Conclusion: EQS可用于优化事件相机模拟器，减少模拟与现实的差距。 Abstract: Event cameras promise a paradigm shift in vision sensing with their low latency, high dynamic range, and asynchronous nature of events. Unfortunately, the scarcity of high-quality labeled datasets hinders their widespread adoption in deep learning-driven computer vision. To mitigate this, several simulators have been proposed to generate synthetic event data for training models for detection and estimation tasks. However, the fundamentally different sensor design of event cameras compared to traditional frame-based cameras poses a challenge for accurate simulation. As a result, most simulated data fail to mimic data captured by real event cameras. Inspired by existing work on using deep features for image comparison, we introduce event quality score (EQS), a quality metric that utilizes activations of the RVT architecture. Through sim-to-real experiments on the DSEC driving dataset, it is shown that a higher EQS implies improved generalization to real-world data after training on simulated events. Thus, optimizing for EQS can lead to developing more realistic event camera simulators, effectively reducing the simulation gap. EQS is available at https://github.com/eventbasedvision/EQS.

Andy Dimnaku,Dominic Yurk,Zhiyuan Gao,Arun Padmanabhan,Mandar Aras,Yaser Abu-Mostafa

Main category: cs.CV

TL;DR: 本文提出了一种基于AI的新型导航系统，用于辅助超声检查中定位心脏的下腔静脉（IVC），适用于不同质量的超声设备。

Details

Motivation: 传统心脏超声检查依赖专业人员和高端设备，限制了其在医院外的应用。AI导航系统可帮助新手操作者获取标准化视图。 Method: 系统采用离线训练的决策模型，结合二分类和新型定位算法，实时标注IVC的空间位置。 Result: 模型在高品质医院超声视频中表现优异，并在低成本手持设备（Butterfly iQ）上实现零样本性能。 Conclusion: 该系统有望将超声诊断扩展到医院外，目前正在进行临床试验，并已在Butterfly iQ应用中提供。 Abstract: Ultrasound imaging of the heart (echocardiography) is widely used to diagnose cardiac diseases. However, obtaining an echocardiogram requires an expert sonographer and a high-quality ultrasound imaging device, which are generally only available in hospitals. Recently, AI-based navigation models and algorithms have been used to aid novice sonographers in acquiring the standardized cardiac views necessary to visualize potential disease pathologies. These navigation systems typically rely on directional guidance to predict the necessary rotation of the ultrasound probe. This paper demonstrates a novel AI navigation system that builds on a decision model for identifying the inferior vena cava (IVC) of the heart. The decision model is trained offline using cardiac ultrasound videos and employs binary classification to determine whether the IVC is present in a given ultrasound video. The underlying model integrates a novel localization algorithm that leverages the learned feature representations to annotate the spatial location of the IVC in real-time. Our model demonstrates strong localization performance on traditional high-quality hospital ultrasound videos, as well as impressive zero-shot performance on lower-quality ultrasound videos from a more affordable Butterfly iQ handheld ultrasound machine. This capability facilitates the expansion of ultrasound diagnostics beyond hospital settings. Currently, the guidance system is undergoing clinical trials and is available on the Butterfly iQ app.

[12] Post-Hurricane Debris Segmentation Using Fine-Tuned Foundational Vision Models

Kooshan Amini,Yuhao Liu,Jamie Ellen Padgett,Guha Balakrishnan,Ashok Veeraraghavan

Main category: cs.CV

TL;DR: 该研究通过微调预训练视觉模型，开发了一种通用的飓风碎片分割方法，使用少量高质量数据集，并在未训练过的飓风事件中表现优异。

Details

Motivation: 飓风碎片的及时准确检测对灾害响应和社区恢复至关重要，但现有方法在跨区域适用性和数据稀缺性方面存在局限。 Method: 研究通过微调预训练视觉模型，利用1200张手动标注的飓风图像数据集，结合多标注者标签聚合和视觉提示工程。 Result: 模型fCLIPSeg在未训练过的飓风Ida数据上Dice得分为0.70，且在无碎片区域几乎无假阳性。 Conclusion: 该研究首次提出了一种仅需标准RGB图像的通用碎片分割模型，适用于大规模灾后评估和恢复规划。 Abstract: Timely and accurate detection of hurricane debris is critical for effective disaster response and community resilience. While post-disaster aerial imagery is readily available, robust debris segmentation solutions applicable across multiple disaster regions remain limited. Developing a generalized solution is challenging due to varying environmental and imaging conditions that alter debris' visual signatures across different regions, further compounded by the scarcity of training data. This study addresses these challenges by fine-tuning pre-trained foundational vision models, achieving robust performance with a relatively small, high-quality dataset. Specifically, this work introduces an open-source dataset comprising approximately 1,200 manually annotated aerial RGB images from Hurricanes Ian, Ida, and Ike. To mitigate human biases and enhance data quality, labels from multiple annotators are strategically aggregated and visual prompt engineering is employed. The resulting fine-tuned model, named fCLIPSeg, achieves a Dice score of 0.70 on data from Hurricane Ida -- a disaster event entirely excluded during training -- with virtually no false positives in debris-free areas. This work presents the first event-agnostic debris segmentation model requiring only standard RGB imagery during deployment, making it well-suited for rapid, large-scale post-disaster impact assessments and recovery planning.

[13] Privacy-Preserving Operating Room Workflow Analysis using Digital Twins

Alejandra Perez,Han Zhang,Yu-Chun Ku,Lalithkumar Seenivasan,Roger Soberanis,Jose L. Porras,Richard Day,Jeff Jopling,Peter Najjar,Mathias Unberath

Main category: cs.CV

TL;DR: 提出了一种两阶段隐私保护手术室视频分析方法，通过生成去标识化的数字孪生（DT）实现事件检测，性能与原始RGB视频模型相当甚至更好。

Details

Motivation: 手术室工作流优化需要自动事件识别，但隐私问题限制了计算机视觉的应用，因此需要隐私保护方法。 Method: 两阶段方法：1）利用视觉基础模型生成去标识化的数字孪生；2）采用SafeOR模型处理分割掩码和深度图进行事件检测。 Result: 在38个模拟手术试验中，DT方法性能与原始RGB视频模型相当或更好。 Conclusion: 数字孪生支持隐私保护的手术室工作流分析，促进跨机构数据共享，并可能增强模型泛化能力。 Abstract: Purpose: The operating room (OR) is a complex environment where optimizing workflows is critical to reduce costs and improve patient outcomes. The use of computer vision approaches for the automatic recognition of perioperative events enables identification of bottlenecks for OR optimization. However, privacy concerns limit the use of computer vision for automated event detection from OR videos, which makes privacy-preserving approaches needed for OR workflow analysis. Methods: We propose a two-stage pipeline for privacy-preserving OR video analysis and event detection. In the first stage, we leverage vision foundation models for depth estimation and semantic segmentation to generate de-identified Digital Twins (DT) of the OR from conventional RGB videos. In the second stage, we employ the SafeOR model, a fused two-stream approach that processes segmentation masks and depth maps for OR event detection. We evaluate this method on an internal dataset of 38 simulated surgical trials with five event classes. Results: Our results indicate that this DT-based approach to the OR event detection model achieves performance on par and sometimes even better than raw RGB video-based models on detecting OR events. Conclusion: DTs enable privacy-preserving OR workflow analysis, facilitating the sharing of de-identified data across institutions and they can potentially enhance model generalizability by mitigating domain-specific appearance differences.

[14] Contour Field based Elliptical Shape Prior for the Segment Anything Model

Xinyu Zhao,Jun Liu,Faqiang Wang,Li Cui,Yuping Duan

Main category: cs.CV

TL;DR: 提出了一种将椭圆形状先验信息整合到SAM图像分割中的新方法，通过变分方法提升分割精度。

Details

Motivation: 现有深度学习方法（如SAM）在生成椭圆形状分割结果时效率不足，需改进。 Method: 建立参数化椭圆轮廓场，结合图像特征与椭圆先验，利用对偶算法设计新的SAM网络结构。 Result: 在特定数据集上实验表明，改进后的SAM分割精度优于原版。 Conclusion: 通过引入椭圆先验，显著提升了SAM在椭圆形状分割任务中的表现。 Abstract: The elliptical shape prior information plays a vital role in improving the accuracy of image segmentation for specific tasks in medical and natural images. Existing deep learning-based segmentation methods, including the Segment Anything Model (SAM), often struggle to produce segmentation results with elliptical shapes efficiently. This paper proposes a new approach to integrate the prior of elliptical shapes into the deep learning-based SAM image segmentation techniques using variational methods. The proposed method establishes a parameterized elliptical contour field, which constrains the segmentation results to align with predefined elliptical contours. Utilizing the dual algorithm, the model seamlessly integrates image features with elliptical priors and spatial regularization priors, thereby greatly enhancing segmentation accuracy. By decomposing SAM into four mathematical sub-problems, we integrate the variational ellipse prior to design a new SAM network structure, ensuring that the segmentation output of SAM consists of elliptical regions. Experimental results on some specific image datasets demonstrate an improvement over the original SAM.

[15] Parsimonious Dataset Construction for Laparoscopic Cholecystectomy Structure Segmentation

Yuning Zhou,Henry Badgery,Matthew Read,James Bailey,Catherine Davey

Main category: cs.CV

TL;DR: 论文提出了一种利用主动学习从手术视频中选择关键帧的方法，以构建高质量且经济实惠的语义分割数据集。

Details

Motivation: 医疗领域标注成本高，阻碍了深度学习应用。 Method: 通过主动学习，DNN在训练过程中选择信息量最大的数据，并逐步提升性能。 Result: 实验表明，仅用一半数据即可达到与全数据集相近的性能（mIoU 0.4349 vs 0.4374）。 Conclusion: 主动学习能有效降低标注成本，同时保持模型性能。 Abstract: Labeling has always been expensive in the medical context, which has hindered related deep learning application. Our work introduces active learning in surgical video frame selection to construct a high-quality, affordable Laparoscopic Cholecystectomy dataset for semantic segmentation. Active learning allows the Deep Neural Networks (DNNs) learning pipeline to include the dataset construction workflow, which means DNNs trained by existing dataset will identify the most informative data from the newly collected data. At the same time, DNNs' performance and generalization ability improve over time when the newly selected and annotated data are included in the training data. We assessed different data informativeness measurements and found the deep features distances select the most informative data in this task. Our experiments show that with half of the data selected by active learning, the DNNs achieve almost the same performance with 0.4349 mean Intersection over Union (mIoU) compared to the same DNNs trained on the full dataset (0.4374 mIoU) on the critical anatomies and surgical instruments.

[16] Prompt-Driven and Training-Free Forgetting Approach and Dataset for Large Language Models

Zhenyu Yu,Mohd Yamani Inda Idris,Pei Wang

Main category: cs.CV

TL;DR: 提出了一种基于提示的分层编辑和无训练局部特征移除的自动数据集创建框架，用于隐私合规的扩散模型选择性遗忘。

Details

Motivation: 扩散模型在图像生成中的广泛应用增加了对隐私合规遗忘的需求，但现有方法难以在高维复杂特征中实现选择性遗忘。 Method: 通过构建ForgetMe数据集和引入Entangled评估指标，结合LoRA微调Stable Diffusion实现选择性遗忘。 Result: 验证了ForgetMe数据集和Entangled指标的有效性，为选择性遗忘提供了基准。 Conclusion: 该工作为隐私保护生成AI提供了可扩展和适应性强的解决方案。 Abstract: The widespread adoption of diffusion models in image generation has increased the demand for privacy-compliant unlearning. However, due to the high-dimensional nature and complex feature representations of diffusion models, achieving selective unlearning remains challenging, as existing methods struggle to remove sensitive information while preserving the consistency of non-sensitive regions. To address this, we propose an Automatic Dataset Creation Framework based on prompt-based layered editing and training-free local feature removal, constructing the ForgetMe dataset and introducing the Entangled evaluation metric. The Entangled metric quantifies unlearning effectiveness by assessing the similarity and consistency between the target and background regions and supports both paired (Entangled-D) and unpaired (Entangled-S) image data, enabling unsupervised evaluation. The ForgetMe dataset encompasses a diverse set of real and synthetic scenarios, including CUB-200-2011 (Birds), Stanford-Dogs, ImageNet, and a synthetic cat dataset. We apply LoRA fine-tuning on Stable Diffusion to achieve selective unlearning on this dataset and validate the effectiveness of both the ForgetMe dataset and the Entangled metric, establishing them as benchmarks for selective unlearning. Our work provides a scalable and adaptable solution for advancing privacy-preserving generative AI.

[17] CM3AE: A Unified RGB Frame and Event-Voxel/-Frame Pre-training Framework

Wentao Wu,Xiao Wang,Chenglong Li,Bo Jiang,Jin Tang,Bin Luo,Qi Liu

Main category: cs.CV

TL;DR: 论文提出了一种名为CM3AE的新型预训练框架，用于RGB-Event感知，通过多模态融合和对比学习提升模型性能。

Details

Motivation: 现有方法在事件数据预训练中与RGB帧的关联较弱，限制了多模态融合的应用。 Method: 设计了多模态融合重建模块和对比学习策略，输入包括RGB图像、事件图像和事件体素。 Result: 构建了大规模数据集，并在五个下游任务中验证了CM3AE的有效性。 Conclusion: CM3AE为事件和RGB-Event融合任务提供了强大支持，代码和模型将开源。 Abstract: Event cameras have attracted increasing attention in recent years due to their advantages in high dynamic range, high temporal resolution, low power consumption, and low latency. Some researchers have begun exploring pre-training directly on event data. Nevertheless, these efforts often fail to establish strong connections with RGB frames, limiting their applicability in multi-modal fusion scenarios. To address these issues, we propose a novel CM3AE pre-training framework for the RGB-Event perception. This framework accepts multi-modalities/views of data as input, including RGB images, event images, and event voxels, providing robust support for both event-based and RGB-event fusion based downstream tasks. Specifically, we design a multi-modal fusion reconstruction module that reconstructs the original image from fused multi-modal features, explicitly enhancing the model's ability to aggregate cross-modal complementary information. Additionally, we employ a multi-modal contrastive learning strategy to align cross-modal feature representations in a shared latent space, which effectively enhances the model's capability for multi-modal understanding and capturing global dependencies. We construct a large-scale dataset containing 2,535,759 RGB-Event data pairs for the pre-training. Extensive experiments on five downstream tasks fully demonstrated the effectiveness of CM3AE. Source code and pre-trained models will be released on https://github.com/Event-AHU/CM3AE.

[18] 3DResT: A Strong Baseline for Semi-Supervised 3D Referring Expression Segmentation

Wenxin Chen,Mengxue Qu,Weitai Kang,Yan Yan,Yao Zhao,Yunchao Wei

Main category: cs.CV

TL;DR: 论文提出了一种半监督学习框架3DResT，用于3D指代表达分割（3D-RES），通过TSCS和QDW方法高效利用伪标签，显著提升了性能。

Details

Motivation: 3D-RES通常需要大量实例级标注，成本高且耗时。现有半监督方法未能充分利用伪标签，限制了模型性能。 Method: 提出3DResT框架，包含TSCS（教师-学生一致性采样）和QDW（质量驱动动态加权）方法，分别优化高、低质量伪标签的利用。 Result: 实验表明，仅用1%标注数据，3DResT比全监督方法mIoU提升8.34点。 Conclusion: 3DResT通过高效利用伪标签，显著降低了标注成本并提升了性能，为半监督3D-RES提供了新基准。 Abstract: 3D Referring Expression Segmentation (3D-RES) typically requires extensive instance-level annotations, which are time-consuming and costly. Semi-supervised learning (SSL) mitigates this by using limited labeled data alongside abundant unlabeled data, improving performance while reducing annotation costs. SSL uses a teacher-student paradigm where teacher generates high-confidence-filtered pseudo-labels to guide student. However, in the context of 3D-RES, where each label corresponds to a single mask and labeled data is scarce, existing SSL methods treat high-quality pseudo-labels merely as auxiliary supervision, which limits the model's learning potential. The reliance on high-confidence thresholds for filtering often results in potentially valuable pseudo-labels being discarded, restricting the model's ability to leverage the abundant unlabeled data. Therefore, we identify two critical challenges in semi-supervised 3D-RES, namely, inefficient utilization of high-quality pseudo-labels and wastage of useful information from low-quality pseudo-labels. In this paper, we introduce the first semi-supervised learning framework for 3D-RES, presenting a robust baseline method named 3DResT. To address these challenges, we propose two novel designs called Teacher-Student Consistency-Based Sampling (TSCS) and Quality-Driven Dynamic Weighting (QDW). TSCS aids in the selection of high-quality pseudo-labels, integrating them into the labeled dataset to strengthen the labeled supervision signals. QDW preserves low-quality pseudo-labels by dynamically assigning them lower weights, allowing for the effective extraction of useful information rather than discarding them. Extensive experiments conducted on the widely used benchmark demonstrate the effectiveness of our method. Notably, with only 1% labeled data, 3DResT achieves an mIoU improvement of 8.34 points compared to the fully supervised method.

[19] AdaQual-Diff: Diffusion-Based Image Restoration via Adaptive Quality Prompting

Xin Su,Chen Wu,Yu Zhang,Chen Lyu,Zhuoran Zheng

Main category: cs.CV

TL;DR: AdaQual-Diff是一种基于扩散的框架，通过直接集成感知质量评估到生成修复过程中，解决了传统方法在复杂真实世界退化图像修复中的局限性。

Details

Motivation: 传统方法依赖间接线索，难以适应复杂退化情况，导致修复效果不佳。 Method: 通过自适应质量提示机制，根据区域质量分数动态调整提示结构，实现计算资源的动态分配。 Result: 在合成和真实数据集上，AdaQual-Diff实现了视觉上更优的修复效果。 Conclusion: AdaQual-Diff通过质量引导和内容特定条件，实现了对区域修复强度的精细控制，无需额外参数或推理迭代。 Abstract: Restoring images afflicted by complex real-world degradations remains challenging, as conventional methods often fail to adapt to the unique mixture and severity of artifacts present. This stems from a reliance on indirect cues which poorly capture the true perceptual quality deficit. To address this fundamental limitation, we introduce AdaQual-Diff, a diffusion-based framework that integrates perceptual quality assessment directly into the generative restoration process. Our approach establishes a mathematical relationship between regional quality scores from DeQAScore and optimal guidance complexity, implemented through an Adaptive Quality Prompting mechanism. This mechanism systematically modulates prompt structure according to measured degradation severity: regions with lower perceptual quality receive computationally intensive, structurally complex prompts with precise restoration directives, while higher quality regions receive minimal prompts focused on preservation rather than intervention. The technical core of our method lies in the dynamic allocation of computational resources proportional to degradation severity, creating a spatially-varying guidance field that directs the diffusion process with mathematical precision. By combining this quality-guided approach with content-specific conditioning, our framework achieves fine-grained control over regional restoration intensity without requiring additional parameters or inference iterations. Experimental results demonstrate that AdaQual-Diff achieves visually superior restorations across diverse synthetic and real-world datasets.

[20] Robo-SGG: Exploiting Layout-Oriented Normalization and Restitution for Robust Scene Graph Generation

Changsheng Lv,Mengshi Qi,Zijian Fu,Huadong Ma

Main category: cs.CV

TL;DR: 论文提出Robo-SGG方法，通过布局导向的归一化和恢复技术，提升场景图生成在损坏图像上的鲁棒性。

Details

Motivation: 现有场景图生成方法在损坏图像上性能下降，核心挑战是干净与损坏图像间的域偏移。 Method: 利用布局信息（域不变）增强视觉特征，提出布局导向恢复和布局嵌入编码器（LEE）。 Result: 在VG-C数据集上，mR@50指标相对提升5.6%-8.0%，达到新SOTA。 Conclusion: Robo-SGG作为即插即用模块，显著提升场景图生成在损坏图像上的鲁棒性。 Abstract: In this paper, we introduce a novel method named Robo-SGG, i.e., Layout-Oriented Normalization and Restitution for Robust Scene Graph Generation. Compared to the existing SGG setting, the robust scene graph generation aims to perform inference on a diverse range of corrupted images, with the core challenge being the domain shift between the clean and corrupted images. Existing SGG methods suffer from degraded performance due to compromised visual features e.g., corruption interference or occlusions. To obtain robust visual features, we exploit the layout information, which is domain-invariant, to enhance the efficacy of existing SGG methods on corrupted images. Specifically, we employ Instance Normalization(IN) to filter out the domain-specific feature and recover the unchangeable structural features, i.e., the positional and semantic relationships among objects by the proposed Layout-Oriented Restitution. Additionally, we propose a Layout-Embedded Encoder (LEE) that augments the existing object and predicate encoders within the SGG framework, enriching the robust positional and semantic features of objects and predicates. Note that our proposed Robo-SGG module is designed as a plug-and-play component, which can be easily integrated into any baseline SGG model. Extensive experiments demonstrate that by integrating the state-of-the-art method into our proposed Robo-SGG, we achieve relative improvements of 5.6%, 8.0%, and 6.5% in mR@50 for PredCls, SGCls, and SGDet tasks on the VG-C dataset, respectively, and achieve new state-of-the-art performance in corruption scene graph generation benchmark (VG-C and GQA-C). We will release our source code and model.

[21] SAM-Based Building Change Detection with Distribution-Aware Fourier Adaptation and Edge-Constrained Warping

Yun-Cheng Li,Sen Lei,Yi-Tao Zhao,Heng-Chao Li,Jun Li,Antonio Plaza

Main category: cs.CV

TL;DR: FAEWNet是一种基于SAM的网络，通过分布感知的傅里叶适配和边缘约束变形，解决了建筑物变化检测中的领域差距和噪声干扰问题。

Details

Motivation: 建筑物变化检测在城市化、灾害评估和军事侦察中至关重要，但现有方法因领域差距、不平衡分布和噪声干扰而效果不佳。 Method: FAEWNet结合SAM编码器提取特征，使用分布感知傅里叶适配器聚合任务导向的变化信息，并设计新的流模块优化边缘提取和变化感知。 Result: 在LEVIR-CD、S2Looking和WHU-CD数据集上取得了最先进的结果。 Conclusion: FAEWNet有效解决了建筑物变化检测中的关键问题，提升了检测精度和边缘识别能力。 Abstract: Building change detection remains challenging for urban development, disaster assessment, and military reconnaissance. While foundation models like Segment Anything Model (SAM) show strong segmentation capabilities, SAM is limited in the task of building change detection due to domain gap issues. Existing adapter-based fine-tuning approaches face challenges with imbalanced building distribution, resulting in poor detection of subtle changes and inaccurate edge extraction. Additionally, bi-temporal misalignment in change detection, typically addressed by optical flow, remains vulnerable to background noises. This affects the detection of building changes and compromises both detection accuracy and edge recognition. To tackle these challenges, we propose a new SAM-Based Network with Distribution-Aware Fourier Adaptation and Edge-Constrained Warping (FAEWNet) for building change detection. FAEWNet utilizes the SAM encoder to extract rich visual features from remote sensing images. To guide SAM in focusing on specific ground objects in remote sensing scenes, we propose a Distribution-Aware Fourier Aggregated Adapter to aggregate task-oriented changed information. This adapter not only effectively addresses the domain gap issue, but also pays attention to the distribution of changed buildings. Furthermore, to mitigate noise interference and misalignment in height offset estimation, we design a novel flow module that refines building edge extraction and enhances the perception of changed buildings. Our state-of-the-art results on the LEVIR-CD, S2Looking and WHU-CD datasets highlight the effectiveness of FAEWNet. The code is available at https://github.com/SUPERMAN123000/FAEWNet.

[22] Packing Input Frame Context in Next-Frame Prediction Models for Video Generation

Lvmin Zhang,Maneesh Agrawala

Main category: cs.CV

TL;DR: FramePack是一种神经网络结构，用于训练视频生成的下一帧预测模型，通过压缩输入帧固定上下文长度，提高训练效率，并提出抗漂移采样方法。

Details

Motivation: 解决视频生成中长视频处理的计算瓶颈和误差累积问题。 Method: 压缩输入帧固定上下文长度，采用抗漂移采样方法（倒序生成帧），并支持微调现有视频扩散模型。 Result: 提高了训练批次大小，减少了误差累积，改善了视觉质量。 Conclusion: FramePack能高效处理长视频，提升生成质量，适用于现有模型的微调。 Abstract: We present a neural network structure, FramePack, to train next-frame (or next-frame-section) prediction models for video generation. The FramePack compresses input frames to make the transformer context length a fixed number regardless of the video length. As a result, we are able to process a large number of frames using video diffusion with computation bottleneck similar to image diffusion. This also makes the training video batch sizes significantly higher (batch sizes become comparable to image diffusion training). We also propose an anti-drifting sampling method that generates frames in inverted temporal order with early-established endpoints to avoid exposure bias (error accumulation over iterations). Finally, we show that existing video diffusion models can be finetuned with FramePack, and their visual quality may be improved because the next-frame prediction supports more balanced diffusion schedulers with less extreme flow shift timesteps.

[23] RoPETR: Improving Temporal Camera-Only 3D Detection by Integrating Enhanced Rotary Position Embedding

Hang Ji,Tao Ni,Xufeng Huang,Tao Luo,Xin Zhan,Junbo Chen

Main category: cs.CV

TL;DR: 针对StreamPETR框架的改进，专注于提升速度估计能力，通过定制化位置嵌入策略，在NuScenes数据集上实现了70.86%的NDS。

Details

Motivation: StreamPETR在3D边界框检测上表现优异，但速度估计成为NuScenes数据集上的瓶颈，影响了整体NDS。 Method: 提出一种定制化的位置嵌入策略，以增强时间建模能力。 Result: 在NuScenes测试集上，改进后的方法使用ViT-L主干网络，实现了70.86%的NDS，创下相机3D目标检测的新纪录。 Conclusion: 通过优化速度估计，显著提升了StreamPETR在NuScenes数据集上的性能，达到了最先进水平。 Abstract: This technical report introduces a targeted improvement to the StreamPETR framework, specifically aimed at enhancing velocity estimation, a critical factor influencing the overall NuScenes Detection Score. While StreamPETR exhibits strong 3D bounding box detection performance as reflected by its high mean Average Precision our analysis identified velocity estimation as a substantial bottleneck when evaluated on the NuScenes dataset. To overcome this limitation, we propose a customized positional embedding strategy tailored to enhance temporal modeling capabilities. Experimental evaluations conducted on the NuScenes test set demonstrate that our improved approach achieves a state-of-the-art NDS of 70.86% using the ViT-L backbone, setting a new benchmark for camera-only 3D object detection.

[24] AdaptoVision: A Multi-Resolution Image Recognition Model for Robust and Scalable Classification

Md. Sanaullah Chowdhury Lameya Sabrin

Main category: cs.CV

TL;DR: AdaptoVision是一种新型CNN架构，通过优化结构和减少参数，在保持高性能的同时降低计算复杂度。

Details

Motivation: 设计一种高效平衡计算复杂度和分类精度的CNN架构，适用于资源受限环境。 Method: 结合增强残差单元、深度可分离卷积和分层跳跃连接，减少参数和计算需求。 Result: 在BreakHis数据集上达到SOTA，CIFAR-10和CIFAR-100上分别取得95.3%和85.77%的准确率。 Conclusion: AdaptoVision结构简洁，特征提取能力强，适合实时和资源受限场景。 Abstract: This paper introduces AdaptoVision, a novel convolutional neural network (CNN) architecture designed to efficiently balance computational complexity and classification accuracy. By leveraging enhanced residual units, depth-wise separable convolutions, and hierarchical skip connections, AdaptoVision significantly reduces parameter count and computational requirements while preserving competitive performance across various benchmark and medical image datasets. Extensive experimentation demonstrates that AdaptoVision achieves state-of-the-art on BreakHis dataset and comparable accuracy levels, notably 95.3\% on CIFAR-10 and 85.77\% on CIFAR-100, without relying on any pretrained weights. The model's streamlined architecture and strategic simplifications promote effective feature extraction and robust generalization, making it particularly suitable for deployment in real-time and resource-constrained environments.

[25] Two Tasks, One Goal: Uniting Motion and Planning for Excellent End To End Autonomous Driving Performance

Lin Liu,Ziying Song,Hongyu Pan,Lei Yang,Caiyan Jia

Main category: cs.CV

TL;DR: TTOG是一种新颖的两阶段轨迹生成框架，通过统一规划和运动任务，解决了传统方法中两者分离的问题，并在多个数据集上取得了最先进的性能。

Details

Motivation: 传统自动驾驶方法将规划和运动任务解耦，忽略了学习运动任务中的分布外数据对规划的潜在好处。统一这些任务面临共享上下文表示构建和其他车辆状态不可观测等挑战。 Method: 提出TTOG框架，首先生成多样化的轨迹候选，然后通过车辆状态信息进行细化。使用自车数据训练的状态估计器处理其他车辆状态不可观测问题，并引入ECSA增强场景表示的泛化能力。 Result: 在nuScenes数据集上，TTOG将L2距离减少36.06%；在Bench2Drive数据集上，驾驶分数（DS）提高了22%，显著优于现有基线。 Conclusion: TTOG通过统一规划和运动任务，显著提升了自动驾驶性能，验证了其框架的有效性和优越性。 Abstract: End-to-end autonomous driving has made impressive progress in recent years. Former end-to-end autonomous driving approaches often decouple planning and motion tasks, treating them as separate modules. This separation overlooks the potential benefits that planning can gain from learning out-of-distribution data encountered in motion tasks. However, unifying these tasks poses significant challenges, such as constructing shared contextual representations and handling the unobservability of other vehicles' states. To address these challenges, we propose TTOG, a novel two-stage trajectory generation framework. In the first stage, a diverse set of trajectory candidates is generated, while the second stage focuses on refining these candidates through vehicle state information. To mitigate the issue of unavailable surrounding vehicle states, TTOG employs a self-vehicle data-trained state estimator, subsequently extended to other vehicles. Furthermore, we introduce ECSA (equivariant context-sharing scene adapter) to enhance the generalization of scene representations across different agents. Experimental results demonstrate that TTOG achieves state-of-the-art performance across both planning and motion tasks. Notably, on the challenging open-loop nuScenes dataset, TTOG reduces the L2 distance by 36.06\%. Furthermore, on the closed-loop Bench2Drive dataset, our approach achieves a 22\% improvement in the driving score (DS), significantly outperforming existing baselines.

[26] Accurate Tracking of Arabidopsis Root Cortex Cell Nuclei in 3D Time-Lapse Microscopy Images Based on Genetic Algorithm

Yu Song,Tatsuaki Goh,Yinhao Li,Jiahua Dong,Shunsuke Miyashima,Yutaro Iwamoto,Yohei Kondo,Keiji Nakajima,Yen-wei Chen

Main category: cs.CV

TL;DR: 提出了一种基于遗传算法的细胞核追踪方法，解决了拟南芥根尖细胞密集排列时的追踪问题。

Details

Motivation: 拟南芥是研究植物生理和发育的重要模型，但现有细胞追踪软件在细胞密集排列时表现不佳。 Method: 采用遗传算法，结合拟南芥根细胞的空间关系，分粗到细两阶段进行核追踪。 Result: 在长时间活体成像数据中，该方法能准确追踪核，仅需少量人工修正。 Conclusion: 该方法首次成功解决了拟南芥根尖细胞核追踪的长期难题。 Abstract: Arabidopsis is a widely used model plant to gain basic knowledge on plant physiology and development. Live imaging is an important technique to visualize and quantify elemental processes in plant development. To uncover novel theories underlying plant growth and cell division, accurate cell tracking on live imaging is of utmost importance. The commonly used cell tracking software, TrackMate, adopts tracking-by-detection fashion, which applies Laplacian of Gaussian (LoG) for blob detection, and Linear Assignment Problem (LAP) tracker for tracking. However, they do not perform sufficiently when cells are densely arranged. To alleviate the problems mentioned above, we propose an accurate tracking method based on Genetic algorithm (GA) using knowledge of Arabidopsis root cellular patterns and spatial relationship among volumes. Our method can be described as a coarse-to-fine method, in which we first conducted relatively easy line-level tracking of cell nuclei, then performed complicated nuclear tracking based on known linear arrangement of cell files and their spatial relationship between nuclei. Our method has been evaluated on a long-time live imaging dataset of Arabidopsis root tips, and with minor manual rectification, it accurately tracks nuclei. To the best of our knowledge, this research represents the first successful attempt to address a long-standing problem in the field of time-lapse microscopy in the root meristem by proposing an accurate tracking method for Arabidopsis root nuclei.

[27] TongUI: Building Generalized GUI Agents by Learning from Multimodal Web Tutorials

Bofei Zhang,Zirui Shang,Zhi Gao,Wang Zhang,Rui Xie,Xiaojian Ma,Tao Yuan,Xinxiao Wu,Song-Chun Zhu,Qing Li

Main category: cs.CV

TL;DR: TongUI框架通过从多模态网络教程中学习，构建通用GUI代理，解决了GUI代理开发中轨迹数据不足的问题。

Details

Motivation: 开发通用GUI代理的主要挑战是缺乏跨操作系统和应用程序的轨迹数据，手动标注成本高。 Method: 提出TongUI框架，爬取并处理在线GUI教程（如视频和文章）为轨迹数据，构建GUI-Net数据集（143K数据）。基于此微调Qwen2.5-VL-3B/7B模型。 Result: TongUI代理在基准测试中表现优异，性能提升约10%，验证了GUI-Net数据集的有效性。 Conclusion: TongUI框架和GUI-Net数据集显著提升了GUI代理性能，代码、数据集和模型将开源。 Abstract: Building Graphical User Interface (GUI) agents is a promising research direction, which simulates human interaction with computers or mobile phones to perform diverse GUI tasks. However, a major challenge in developing generalized GUI agents is the lack of sufficient trajectory data across various operating systems and applications, mainly due to the high cost of manual annotations. In this paper, we propose the TongUI framework that builds generalized GUI agents by learning from rich multimodal web tutorials. Concretely, we crawl and process online GUI tutorials (such as videos and articles) into GUI agent trajectory data, through which we produce the GUI-Net dataset containing 143K trajectory data across five operating systems and more than 200 applications. We develop the TongUI agent by fine-tuning Qwen2.5-VL-3B/7B models on GUI-Net, which show remarkable performance improvements on commonly used grounding and navigation benchmarks, outperforming baseline agents about 10\% on multiple benchmarks, showing the effectiveness of the GUI-Net dataset and underscoring the significance of our TongUI framework. We will fully open-source the code, the GUI-Net dataset, and the trained models soon.

[28] HSS-IAD: A Heterogeneous Same-Sort Industrial Anomaly Detection Dataset

Qishan Wang,Shuyong Gao,Junjie Hu,Jiawen Yu,Xuan Tong,You Li,Wenqiang Zhang

Main category: cs.CV

TL;DR: 论文介绍了HSS-IAD数据集，用于解决现有工业异常检测数据集在真实场景中的不足，并评估了多类异常检测方法的性能。

Details

Motivation: 现有工业异常检测数据集（IAD）存在类别不相关、缺陷不真实的问题，限制了多类无监督异常检测（MUAD）方法的实际应用。 Method: 提出HSS-IAD数据集，包含8,580张金属类工业零件图像，提供精确异常标注和前景图像用于合成异常生成。 Result: 在HSS-IAD数据集上评估了流行的IAD方法，展示了其在多类和类分离设置下的潜力。 Conclusion: HSS-IAD数据集有望弥合现有数据集与真实工厂条件之间的差距，数据集已开源。 Abstract: Multi-class Unsupervised Anomaly Detection algorithms (MUAD) are receiving increasing attention due to their relatively low deployment costs and improved training efficiency. However, the real-world effectiveness of MUAD methods is questioned due to limitations in current Industrial Anomaly Detection (IAD) datasets. These datasets contain numerous classes that are unlikely to be produced by the same factory and fail to cover multiple structures or appearances. Additionally, the defects do not reflect real-world characteristics. Therefore, we introduce the Heterogeneous Same-Sort Industrial Anomaly Detection (HSS-IAD) dataset, which contains 8,580 images of metallic-like industrial parts and precise anomaly annotations. These parts exhibit variations in structure and appearance, with subtle defects that closely resemble the base materials. We also provide foreground images for synthetic anomaly generation. Finally, we evaluate popular IAD methods on this dataset under multi-class and class-separated settings, demonstrating its potential to bridge the gap between existing datasets and real factory conditions. The dataset is available at https://github.com/Qiqigeww/HSS-IAD-Dataset.

[29] Collaborative Perception Datasets for Autonomous Driving: A Review

Naibang Wang,Deyong Shang,Yan Gong,Xiaoxi Hu,Ziying Song,Lei Yang,Yuhan Huang,Xiaoyu Wang,Jianli Lu

Main category: cs.CV

TL;DR: 该论文是一篇关于自动驾驶中协同感知数据集的综述，首次从多维度角度系统总结和比较现有资源，并指出未来研究方向。

Details

Motivation: 由于协同感知在自动驾驶中提升感知精度、安全性和鲁棒性的潜力，但缺乏对现有数据集的系统总结和比较分析，阻碍了资源的有效利用和模型评估的标准化。 Method: 从多维度视角对现有协同感知数据集进行分类和比较，包括合作范式、数据来源、场景、传感器配置和支持任务。 Result: 提供了详细的比较分析，并总结了数据集的可扩展性、多样性、领域适应、标准化、隐私和大语言模型集成等关键挑战。 Conclusion: 该综述为协同感知研究提供了资源支持，并指出了未来研究方向，同时维护了一个在线更新的数据集和文献库。 Abstract: Collaborative perception has attracted growing interest from academia and industry due to its potential to enhance perception accuracy, safety, and robustness in autonomous driving through multi-agent information fusion. With the advancement of Vehicle-to-Everything (V2X) communication, numerous collaborative perception datasets have emerged, varying in cooperation paradigms, sensor configurations, data sources, and application scenarios. However, the absence of systematic summarization and comparative analysis hinders effective resource utilization and standardization of model evaluation. As the first comprehensive review focused on collaborative perception datasets, this work reviews and compares existing resources from a multi-dimensional perspective. We categorize datasets based on cooperation paradigms, examine their data sources and scenarios, and analyze sensor modalities and supported tasks. A detailed comparative analysis is conducted across multiple dimensions. We also outline key challenges and future directions, including dataset scalability, diversity, domain adaptation, standardization, privacy, and the integration of large language models. To support ongoing research, we provide a continuously updated online repository of collaborative perception datasets and related literature: https://github.com/frankwnb/Collaborative-Perception-Datasets-for-Autonomous-Driving.

[30] Unsupervised Cross-Domain 3D Human Pose Estimation via Pseudo-Label-Guided Global Transforms

Jingjing Liu,Zhiyong Wang,Xinyu Fan,Amirhossein Dadashzadeh,Honghai Liu,Majid Mirmehdi

Main category: cs.CV

TL;DR: 提出一种新框架，通过全局变换和伪标签生成模块解决跨场景3D人体姿态估计中的领域偏移问题。

Details

Motivation: 现有方法在跨场景推理中因相机视角、位置等领域的差异导致性能下降。 Method: 结合伪标签生成模块和全局变换模块，利用人体中心坐标系对齐不同领域的姿态位置，并通过姿态增强器提升泛化能力。 Result: 在多个跨数据集基准测试中表现优于现有方法，甚至超过目标领域训练的模型。 Conclusion: 该方法有效解决了跨场景3D姿态估计中的领域适应问题，具有显著性能提升。 Abstract: Existing 3D human pose estimation methods often suffer in performance, when applied to cross-scenario inference, due to domain shifts in characteristics such as camera viewpoint, position, posture, and body size. Among these factors, camera viewpoints and locations {have been shown} to contribute significantly to the domain gap by influencing the global positions of human poses. To address this, we propose a novel framework that explicitly conducts global transformations between pose positions in the camera coordinate systems of source and target domains. We start with a Pseudo-Label Generation Module that is applied to the 2D poses of the target dataset to generate pseudo-3D poses. Then, a Global Transformation Module leverages a human-centered coordinate system as a novel bridging mechanism to seamlessly align the positional orientations of poses across disparate domains, ensuring consistent spatial referencing. To further enhance generalization, a Pose Augmentor is incorporated to address variations in human posture and body size. This process is iterative, allowing refined pseudo-labels to progressively improve guidance for domain adaptation. Our method is evaluated on various cross-dataset benchmarks, including Human3.6M, MPI-INF-3DHP, and 3DPW. The proposed method outperforms state-of-the-art approaches and even outperforms the target-trained model.

[31] SmartFreeEdit: Mask-Free Spatial-Aware Image Editing with Complex Instruction Understanding

Qianqian Sun,Jixiang Luo,Dell Zhang,Xuelong Li

Main category: cs.CV

TL;DR: SmartFreeEdit是一个端到端框架，结合多模态大语言模型和超图增强修复架构，通过自然语言指令实现精确、无掩码的图像编辑。

Details

Motivation: 传统方法在空间推理、精确区域分割和语义一致性方面存在挑战，尤其在复杂场景中。 Method: 引入区域感知标记和掩码嵌入范式增强空间理解；设计推理分割管道优化掩码生成；使用超图增强修复模块保持结构完整性和语义一致性。 Result: 在Reason-Edit基准测试中，SmartFreeEdit在分割准确性、指令遵循和视觉质量保持方面优于现有方法。 Conclusion: SmartFreeEdit解决了局部信息聚焦问题，提升了编辑图像的全局一致性。 Abstract: Recent advancements in image editing have utilized large-scale multimodal models to enable intuitive, natural instruction-driven interactions. However, conventional methods still face significant challenges, particularly in spatial reasoning, precise region segmentation, and maintaining semantic consistency, especially in complex scenes. To overcome these challenges, we introduce SmartFreeEdit, a novel end-to-end framework that integrates a multimodal large language model (MLLM) with a hypergraph-enhanced inpainting architecture, enabling precise, mask-free image editing guided exclusively by natural language instructions. The key innovations of SmartFreeEdit include:(1)the introduction of region aware tokens and a mask embedding paradigm that enhance the spatial understanding of complex scenes;(2) a reasoning segmentation pipeline designed to optimize the generation of editing masks based on natural language instructions;and (3) a hypergraph-augmented inpainting module that ensures the preservation of both structural integrity and semantic coherence during complex edits, overcoming the limitations of local-based image generation. Extensive experiments on the Reason-Edit benchmark demonstrate that SmartFreeEdit surpasses current state-of-the-art methods across multiple evaluation metrics, including segmentation accuracy, instruction adherence, and visual quality preservation, while addressing the issue of local information focus and improving global consistency in the edited image. Our project will be available at https://github.com/smileformylove/SmartFreeEdit.

[32] Self-Supervised Pre-training with Combined Datasets for 3D Perception in Autonomous Driving

Shumin Wang,Zhuoran Yang,Lidian Wang,Zhipeng Tang,Heng Li,Lehan Pan,Sha Zhang,Jie Peng,Jianmin Ji,Yanyong Zhang

Main category: cs.CV

TL;DR: 论文提出了一种利用大规模无标签数据预训练3D感知模型的自监督框架，结合提示适配器减少数据集偏差，显著提升下游任务性能。

Details

Motivation: 受NLP和2D视觉领域预训练模型的启发，探索大规模数据预训练在自动驾驶3D感知中的潜力。 Method: 提出自监督预训练框架，从无标签数据中学习3D表示，并结合提示适配器进行领域适应。 Result: 模型在3D目标检测、BEV分割、3D目标跟踪和占用预测等任务中表现显著提升，且性能随数据量增加而稳步提高。 Conclusion: 展示了大规模数据预训练对自动驾驶3D感知模型的持续潜力，并将开源代码以促进进一步研究。 Abstract: The significant achievements of pre-trained models leveraging large volumes of data in the field of NLP and 2D vision inspire us to explore the potential of extensive data pre-training for 3D perception in autonomous driving. Toward this goal, this paper proposes to utilize massive unlabeled data from heterogeneous datasets to pre-train 3D perception models. We introduce a self-supervised pre-training framework that learns effective 3D representations from scratch on unlabeled data, combined with a prompt adapter based domain adaptation strategy to reduce dataset bias. The approach significantly improves model performance on downstream tasks such as 3D object detection, BEV segmentation, 3D object tracking, and occupancy prediction, and shows steady performance increase as the training data volume scales up, demonstrating the potential of continually benefit 3D perception models for autonomous driving. We will release the source code to inspire further investigations in the community.

[33] NTIRE 2025 Challenge on Day and Night Raindrop Removal for Dual-Focused Images: Methods and Results

Xin Li,Yeying Jin,Xin Jin,Zongwei Wu,Bingchen Li,Yufei Wang,Wenhan Yang,Yu Li,Zhibo Chen,Bihan Wen,Robby T. Tan,Radu Timofte,Qiyu Rong,Hongyuan Jing,Mengmeng Zhang,Jinglong Li,Xiangyu Lu,Yi Ren,Yuting Liu,Meng Zhang,Xiang Chen,Qiyuan Guan,Jiangxin Dong,Jinshan Pan,Conglin Gou,Qirui Yang,Fangpu Zhang,Yunlong Lin,Sixiang Chen,Guoxi Huang,Ruirui Lin,Yan Zhang,Jingyu Yang,Huanjing Yue,Jiyuan Chen,Qiaosi Yi,Hongjun Wang,Chenxi Xie,Shuai Li,Yuhui Wu,Kaiyi Ma,Jiakui Hu,Juncheng Li,Liwen Pan,Guangwei Gao,Wenjie Li,Zhenyu Jin,Heng Guo,Zhanyu Ma,Yubo Wang,Jinghua Wang,Wangzhi Xing,Anjusree Karnavar,Diqi Chen,Mohammad Aminul Islam,Hao Yang,Ruikun Zhang,Liyuan Pan,Qianhao Luo,XinCao,Han Zhou,Yan Min,Wei Dong,Jun Chen,Taoyi Wu,Weijia Dou,Yu Wang,Shengjie Zhao,Yongcheng Huang,Xingyu Han,Anyan Huang,Hongtao Wu,Hong Wang,Yefeng Zheng,Abhijeet Kumar,Aman Kumar,Marcos V. Conde,Paula Garrido,Daniel Feijoo,Juan C. Benito,Guanglu Dong,Xin Lin,Siyuan Liu,Tianheng Zheng,Jiayu Zhong,Shouyi Wang,Xiangtai Li,Lanqing Guo,Lu Qi,Chao Ren,Shuaibo Wang,Shilong Zhang,Wanyu Zhou,Yunze Wu,Qinzhong Tan,Jieyuan Pei,Zhuoxuan Li,Jiayu Wang,Haoyu Bian,Haoran Sun,Subhajit Paul,Ni Tang,Junhao Huang,Zihan Cheng,Hongyun Zhu,Yuehan Wu,Kaixin Deng,Hang Ouyang,Tianxin Xiao,Fan Yang,Zhizun Luo,Zeyu Xiao,Zhuoyuan Li,Nguyen Pham Hoang Le,An Dinh Thien,Son T. Luu,Kiet Van Nguyen,Ronghua Xu,Xianmin Tian,Weijian Zhou,Jiacheng Zhang,Yuqian Chen,Yihang Duan,Yujie Wu,Suresh Raikwar,Arsh Garg,Kritika,Jianhua Zheng,Xiaoshan Ma,Ruolin Zhao,Yongyu Yang,Yongsheng Liang,Guiming Huang,Qiang Li,Hongbin Zhang,Xiangyu Zheng,A. N. Rajagopalan

Main category: cs.CV

TL;DR: 本文回顾了NTIRE 2025挑战赛中关于双聚焦图像日夜间雨滴去除的任务，介绍了其数据集和参赛情况。

Details

Motivation: 建立一个在不同光照和聚焦条件下去除雨滴的新基准。 Method: 使用Raindrop Clarity数据集，包含多样化的雨滴退化类型，分为训练、验证和测试子集。 Result: 361名参与者中，32支团队提交了解决方案，并在数据集上实现了SOTA性能。 Conclusion: 该挑战赛为雨滴去除任务提供了强大的新基准。 Abstract: This paper reviews the NTIRE 2025 Challenge on Day and Night Raindrop Removal for Dual-Focused Images. This challenge received a wide range of impressive solutions, which are developed and evaluated using our collected real-world Raindrop Clarity dataset. Unlike existing deraining datasets, our Raindrop Clarity dataset is more diverse and challenging in degradation types and contents, which includes day raindrop-focused, day background-focused, night raindrop-focused, and night background-focused degradations. This dataset is divided into three subsets for competition: 14,139 images for training, 240 images for validation, and 731 images for testing. The primary objective of this challenge is to establish a new and powerful benchmark for the task of removing raindrops under varying lighting and focus conditions. There are a total of 361 participants in the competition, and 32 teams submitting valid solutions and fact sheets for the final testing phase. These submissions achieved state-of-the-art (SOTA) performance on the Raindrop Clarity dataset. The project can be found at https://lixinustc.github.io/CVPR-NTIRE2025-RainDrop-Competition.github.io/.

[34] Post-pre-training for Modality Alignment in Vision-Language Foundation Models

Shin'ya Yamaguchi,Dewei Feng,Sekitoshi Kanai,Kazuki Adachi,Daiki Chijiwa

Main category: cs.CV

TL;DR: CLIP-Refine是一种后预训练方法，旨在通过1个epoch的小数据集训练，减少CLIP模型中的模态间隙，同时保持零样本性能。

Details

Motivation: CLIP模型在多模态特征空间中存在模态间隙，影响下游任务性能。现有方法成本高或导致零样本性能下降。 Method: 提出随机特征对齐（RaFA）和混合对比蒸馏（HyCD）两种技术，通过小数据集训练对齐特征空间。 Result: 实验表明，CLIP-Refine成功减少了模态间隙并提升了零样本性能。 Conclusion: CLIP-Refine是一种高效的后预训练方法，解决了模态间隙问题且不影响零样本性能。 Abstract: Contrastive language image pre-training (CLIP) is an essential component of building modern vision-language foundation models. While CLIP demonstrates remarkable zero-shot performance on downstream tasks, the multi-modal feature spaces still suffer from a modality gap, which is a gap between image and text feature clusters and limits downstream task performance. Although existing works attempt to address the modality gap by modifying pre-training or fine-tuning, they struggle with heavy training costs with large datasets or degradations of zero-shot performance. This paper presents CLIP-Refine, a post-pre-training method for CLIP models at a phase between pre-training and fine-tuning. CLIP-Refine aims to align the feature space with 1 epoch training on small image-text datasets without zero-shot performance degradations. To this end, we introduce two techniques: random feature alignment (RaFA) and hybrid contrastive-distillation (HyCD). RaFA aligns the image and text features to follow a shared prior distribution by minimizing the distance to random reference vectors sampled from the prior. HyCD updates the model with hybrid soft labels generated by combining ground-truth image-text pair labels and outputs from the pre-trained CLIP model. This contributes to achieving both maintaining the past knowledge and learning new knowledge to align features. Our extensive experiments with multiple classification and retrieval tasks show that CLIP-Refine succeeds in mitigating the modality gap and improving the zero-shot performance.

[35] Mask Image Watermarking

Runyi Hu,Jie Zhang,Shiqian Zhao,Nils Lukas,Jiwei Li,Qing Guo,Han Qiu,Tianwei Zhang

Main category: cs.CV

TL;DR: MaskMark是一个简单、高效且灵活的图像水印框架，包含两种变体：MaskMark-D支持全局和局部水印嵌入与提取，MaskMark-ED专注于局部水印嵌入与提取，增强小区域的鲁棒性。基于经典的Encoder-Distortion-Decoder训练范式，MaskMark通过引入掩码机制和定位模块，实现了高性能的水印提取和定位。实验表明，MaskMark在多项任务上优于现有基线模型，且计算成本低。

Details

Motivation: 解决现有水印技术在全局和局部水印嵌入与提取中的性能不足问题，同时提升鲁棒性和灵活性。 Method: 基于Encoder-Distortion-Decoder范式，MaskMark-D在解码阶段引入掩码机制支持局部提取，MaskMark-ED在编码阶段也加入掩码以增强局部鲁棒性。 Result: MaskMark在全局和局部水印提取、定位及多水印嵌入任务中表现最优，计算成本仅为WAM的1/15。 Conclusion: MaskMark是一种高效、灵活且高性能的水印框架，适用于多种应用场景。 Abstract: We present MaskMark, a simple, efficient and flexible framework for image watermarking. MaskMark has two variants: MaskMark-D, which supports global watermark embedding, watermark localization, and local watermark extraction for applications such as tamper detection, and MaskMark-ED, which focuses on local watermark embedding and extraction with enhanced robustness in small regions, enabling localized image protection. Built upon the classical Encoder- Distortion-Decoder training paradigm, MaskMark-D introduces a simple masking mechanism during the decoding stage to support both global and local watermark extraction. A mask is applied to the watermarked image before extraction, allowing the decoder to focus on selected regions and learn local extraction. A localization module is also integrated into the decoder to identify watermark regions during inference, reducing interference from irrelevant content and improving accuracy. MaskMark-ED extends this design by incorporating the mask into the encoding stage as well, guiding the encoder to embed the watermark in designated local regions for enhanced robustness. Comprehensive experiments show that MaskMark achieves state-of-the-art performance in global watermark extraction, local watermark extraction, watermark localization, and multi-watermark embedding. It outperforms all existing baselines, including the recent leading model WAM for local watermarking, while preserving high visual quality of the watermarked images. MaskMark is also flexible, by adjusting the distortion layer, it can adapt to different robustness requirements with just a few steps of fine-tuning. Moreover, our approach is efficient and easy to optimize, requiring only 20 hours on a single A6000 GPU with just 1/15 the computational cost of WAM.

[36] Privacy Protection Against Personalized Text-to-Image Synthesis via Cross-image Consistency Constraints

Guanyu Wang,Kailong Wang,Yihao Huang,Mingyi Zhou,Zhang Qing cnwatcher,Geguang Pu,Li Li

Main category: cs.CV

TL;DR: 论文提出了一种跨图像反个性化（CAP）框架，通过增强图像间的风格一致性来提升隐私保护效果，并开发了动态比率调整策略以优化攻击效果。

Details

Motivation: 随着扩散模型和个性化技术的快速发展，公开图像可能被用于生成高度逼真的仿冒肖像，引发隐私问题。现有反个性化方法忽视了多图像间的关联性，未能充分利用群体层面的隐私保护潜力。 Method: 提出CAP框架，通过跨图像风格一致性增强隐私保护，并引入动态比率调整策略以平衡一致性损失的影响。 Result: 在CelebHQ和VGGFace2基准测试中，CAP显著优于现有方法。 Conclusion: CAP通过群体视角和多图像关系优化了隐私保护，为反个性化提供了更有效的解决方案。 Abstract: The rapid advancement of diffusion models and personalization techniques has made it possible to recreate individual portraits from just a few publicly available images. While such capabilities empower various creative applications, they also introduce serious privacy concerns, as adversaries can exploit them to generate highly realistic impersonations. To counter these threats, anti-personalization methods have been proposed, which add adversarial perturbations to published images to disrupt the training of personalization models. However, existing approaches largely overlook the intrinsic multi-image nature of personalization and instead adopt a naive strategy of applying perturbations independently, as commonly done in single-image settings. This neglects the opportunity to leverage inter-image relationships for stronger privacy protection. Therefore, we advocate for a group-level perspective on privacy protection against personalization. Specifically, we introduce Cross-image Anti-Personalization (CAP), a novel framework that enhances resistance to personalization by enforcing style consistency across perturbed images. Furthermore, we develop a dynamic ratio adjustment strategy that adaptively balances the impact of the consistency loss throughout the attack iterations. Extensive experiments on the classical CelebHQ and VGGFace2 benchmarks show that CAP substantially improves existing methods.

[37] LAD-Reasoner: Tiny Multimodal Models are Good Reasoners for Logical Anomaly Detection

Weijia Li,Guanglei Chu,Jiong Chen,Guo-Sen Xie,Caifeng Shan,Fang Zhao

Main category: cs.CV

TL;DR: 论文提出了一种新的任务RLAD和框架LAD-Reasoner，通过逻辑推理扩展传统异常检测，使用小型多模态语言模型实现高效且可解释的检测。

Details

Motivation: 现有工业异常检测方法依赖复杂模块或设计，限制了实际部署和可解释性，需要更高效的解决方案。 Method: 采用两阶段训练范式：先用SFT进行细粒度视觉理解，再用GRPO优化逻辑异常检测和推理能力，无需构建CoT数据。 Result: 在MVTec LOCO AD数据集上，LAD-Reasoner性能与更大模型相当，且生成更简洁可解释的推理结果。 Conclusion: LAD-Reasoner减少了大型模型和复杂管道的依赖，提供了透明且可解释的逻辑异常检测方案。 Abstract: Recent advances in industrial anomaly detection have highlighted the need for deeper logical anomaly analysis, where unexpected relationships among objects, counts, and spatial configurations must be identified and explained. Existing approaches often rely on large-scale external reasoning modules or elaborate pipeline designs, hindering practical deployment and interpretability. To address these limitations, we introduce a new task, Reasoning Logical Anomaly Detection (RLAD), which extends traditional anomaly detection by incorporating logical reasoning. We propose a new framework, LAD-Reasoner, a customized tiny multimodal language model built on Qwen2.5-VL 3B. Our approach leverages a two-stage training paradigm that first employs Supervised Fine-Tuning (SFT) for fine-grained visual understanding, followed by Group Relative Policy Optimization (GRPO) to refine logical anomaly detection and enforce coherent, human-readable reasoning. Crucially, reward signals are derived from both the detection accuracy and the structural quality of the outputs, obviating the need for building chain of thought (CoT) reasoning data. Experiments on the MVTec LOCO AD dataset show that LAD-Reasoner, though significantly smaller, matches the performance of Qwen2.5-VL-72B in accuracy and F1 score, and further excels in producing concise and interpretable rationales. This unified design reduces reliance on large models and complex pipelines, while offering transparent and interpretable insights into logical anomaly detection. Code and data will be released.

[38] Stronger, Steadier & Superior: Geometric Consistency in Depth VFM Forges Domain Generalized Semantic Segmentation

Siyu Chen,Ting Han,Changshe Zhang,Xin Luo,Meiliu Wu,Guorong Cai,Jinhe Su

Main category: cs.CV

TL;DR: DepthForge通过整合视觉基础模型（VFMs）和深度信息，提出了一种新的DGSS框架，显著提升了模型的泛化性能和几何一致性。

Details

Motivation: 视觉线索易受干扰，而几何信息（如深度）更稳定，因此研究如何结合深度信息与VFMs以提升性能。 Method: 提出DepthForge框架，结合DINOv2/EVA02的视觉特征和Depth Anything V2的深度信息，引入深度感知可学习令牌和深度细化解码器。 Result: 在多个DGSS设置和数据集上表现优异，尤其在极端条件下（如夜晚和雪天）效果显著。 Conclusion: DepthForge通过深度信息增强了VFMs的几何一致性，显著提升了泛化能力和性能。 Abstract: Vision Foundation Models (VFMs) have delivered remarkable performance in Domain Generalized Semantic Segmentation (DGSS). However, recent methods often overlook the fact that visual cues are susceptible, whereas the underlying geometry remains stable, rendering depth information more robust. In this paper, we investigate the potential of integrating depth information with features from VFMs, to improve the geometric consistency within an image and boost the generalization performance of VFMs. We propose a novel fine-tuning DGSS framework, named DepthForge, which integrates the visual cues from frozen DINOv2 or EVA02 and depth cues from frozen Depth Anything V2. In each layer of the VFMs, we incorporate depth-aware learnable tokens to continuously decouple domain-invariant visual and spatial information, thereby enhancing depth awareness and attention of the VFMs. Finally, we develop a depth refinement decoder and integrate it into the model architecture to adaptively refine multi-layer VFM features and depth-aware learnable tokens. Extensive experiments are conducted based on various DGSS settings and five different datsets as unseen target domains. The qualitative and quantitative results demonstrate that our method significantly outperforms alternative approaches with stronger performance, steadier visual-spatial attention, and superior generalization ability. In particular, DepthForge exhibits outstanding performance under extreme conditions (e.g., night and snow). Code is available at https://github.com/anonymouse-xzrptkvyqc/DepthForge.

[39] Set You Straight: Auto-Steering Denoising Trajectories to Sidestep Unwanted Concepts

Leyang Li,Shilin Lu,Yan Ren,Adams Wai-Kin Kong

Main category: cs.CV

TL;DR: ANT是一种新的微调框架，通过自动引导去噪轨迹避免生成有害或不适当内容，解决了现有概念擦除方法的局限性。

Details

Motivation: 确保文本到图像模型的伦理部署，防止生成有害或不适当内容。 Method: ANT通过反转分类器自由引导的条件方向，提出轨迹感知目标，保留早期阶段得分函数场的完整性，并利用增强权重显著性图精确识别关键参数。 Result: ANT在单概念和多概念擦除中均取得最先进结果，生成高质量且安全的输出。 Conclusion: ANT提供了一种高效且通用的方法，显著提升了概念擦除的性能，同时保持生成保真度。 Abstract: Ensuring the ethical deployment of text-to-image models requires effective techniques to prevent the generation of harmful or inappropriate content. While concept erasure methods offer a promising solution, existing finetuning-based approaches suffer from notable limitations. Anchor-free methods risk disrupting sampling trajectories, leading to visual artifacts, while anchor-based methods rely on the heuristic selection of anchor concepts. To overcome these shortcomings, we introduce a finetuning framework, dubbed ANT, which Automatically guides deNoising Trajectories to avoid unwanted concepts. ANT is built on a key insight: reversing the condition direction of classifier-free guidance during mid-to-late denoising stages enables precise content modification without sacrificing early-stage structural integrity. This inspires a trajectory-aware objective that preserves the integrity of the early-stage score function field, which steers samples toward the natural image manifold, without relying on heuristic anchor concept selection. For single-concept erasure, we propose an augmentation-enhanced weight saliency map to precisely identify the critical parameters that most significantly contribute to the unwanted concept, enabling more thorough and efficient erasure. For multi-concept erasure, our objective function offers a versatile plug-and-play solution that significantly boosts performance. Extensive experiments demonstrate that ANT achieves state-of-the-art results in both single and multi-concept erasure, delivering high-quality, safe outputs without compromising the generative fidelity. Code is available at https://github.com/lileyang1210/ANT

[40] EarthGPT-X: Enabling MLLMs to Flexibly and Comprehensively Understand Multi-Source Remote Sensing Imagery

Wei Zhang,Miaoxin Cai,Yaqian Ning,Tong Zhang,Yin Zhuang,He Chen,Jun Li,Xuerui Mao

Main category: cs.CV

TL;DR: 论文提出了一种名为EarthGPT-X的空间多模态大语言模型（MLLM），用于解决遥感（RS）领域中的多源图像理解和交互问题。

Details

Motivation: 由于遥感图像与自然图像差异大，现有自然空间模型难以直接应用于RS领域，且当前RS MLLMs的交互方式和解释能力有限，限制了其实际应用。 Method: 开发了多模态内容整合方法、跨域单阶段融合训练策略和像素感知模块，将指向和定位任务统一到视觉提示框架中。 Result: 实验表明，EarthGPT-X在多粒度任务中表现优越，多模态交互灵活，显著推进了RS领域的MLLM发展。 Conclusion: EarthGPT-X通过创新的方法和统一的框架，提升了遥感图像的理解和交互能力，为RS领域的MLLM应用提供了新思路。 Abstract: Recent advances in the visual-language area have developed natural multi-modal large language models (MLLMs) for spatial reasoning through visual prompting. However, due to remote sensing (RS) imagery containing abundant geospatial information that differs from natural images, it is challenging to effectively adapt natural spatial models to the RS domain. Moreover, current RS MLLMs are limited in overly narrow interpretation levels and interaction manner, hindering their applicability in real-world scenarios. To address those challenges, a spatial MLLM named EarthGPT-X is proposed, enabling a comprehensive understanding of multi-source RS imagery, such as optical, synthetic aperture radar (SAR), and infrared. EarthGPT-X offers zoom-in and zoom-out insight, and possesses flexible multi-grained interactive abilities. Moreover, EarthGPT-X unifies two types of critical spatial tasks (i.e., referring and grounding) into a visual prompting framework. To achieve these versatile capabilities, several key strategies are developed. The first is the multi-modal content integration method, which enhances the interplay between images, visual prompts, and text instructions. Subsequently, a cross-domain one-stage fusion training strategy is proposed, utilizing the large language model (LLM) as a unified interface for multi-source multi-task learning. Furthermore, by incorporating a pixel perception module, the referring and grounding tasks are seamlessly unified within a single framework. In addition, the experiments conducted demonstrate the superiority of the proposed EarthGPT-X in multi-grained tasks and its impressive flexibility in multi-modal interaction, revealing significant advancements of MLLM in the RS field.

[41] TSGS: Improving Gaussian Splatting for Transparent Surface Reconstruction via Normal and De-lighting Priors

Mingwei Li,Pu Pang,Hehe Fan,Hua Huang,Yi Yang

Main category: cs.CV

TL;DR: TSGS框架通过分离几何学习和外观优化，解决了透明表面重建中的透明-深度困境，显著提升了几何精度和渲染效果。

Details

Motivation: 透明表面重建在机器人操作等任务中至关重要，但现有方法（如3DGS）在透明材料上存在深度估计误差大的问题。 Method: TSGS分两阶段：几何学习阶段使用抑制镜面反射的输入，外观优化阶段通过各向异性镜面建模提升视觉保真度，同时采用滑动窗口法提取深度。 Result: 在TransLab数据集上，TSGS显著优于现有方法，Chamfer距离降低37.3%，F1分数提升8.0%。 Conclusion: TSGS在3DGS框架内实现了透明物体的高精度几何重建和逼真渲染，代码和数据集将公开。 Abstract: Reconstructing transparent surfaces is essential for tasks such as robotic manipulation in labs, yet it poses a significant challenge for 3D reconstruction techniques like 3D Gaussian Splatting (3DGS). These methods often encounter a transparency-depth dilemma, where the pursuit of photorealistic rendering through standard $\alpha$-blending undermines geometric precision, resulting in considerable depth estimation errors for transparent materials. To address this issue, we introduce Transparent Surface Gaussian Splatting (TSGS), a new framework that separates geometry learning from appearance refinement. In the geometry learning stage, TSGS focuses on geometry by using specular-suppressed inputs to accurately represent surfaces. In the second stage, TSGS improves visual fidelity through anisotropic specular modeling, crucially maintaining the established opacity to ensure geometric accuracy. To enhance depth inference, TSGS employs a first-surface depth extraction method. This technique uses a sliding window over $\alpha$-blending weights to pinpoint the most likely surface location and calculates a robust weighted average depth. To evaluate the transparent surface reconstruction task under realistic conditions, we collect a TransLab dataset that includes complex transparent laboratory glassware. Extensive experiments on TransLab show that TSGS achieves accurate geometric reconstruction and realistic rendering of transparent objects simultaneously within the efficient 3DGS framework. Specifically, TSGS significantly surpasses current leading methods, achieving a 37.3% reduction in chamfer distance and an 8.0% improvement in F1 score compared to the top baseline. The code and dataset will be released at https://longxiang-ai.github.io/TSGS/.

[42] Hybrid Dense-UNet201 Optimization for Pap Smear Image Segmentation Using Spider Monkey Optimization

Ach Khozaimi,Isnani Darti,Syaiful Anam,Wuryansari Muharini Kusumawinahyu

Main category: cs.CV

TL;DR: 该论文提出了一种结合DenseNet201和U-Net的混合模型（Dense-UNet201），并通过蜘蛛猴优化算法（SMO）进行优化，用于宫颈涂片图像分割，显著提升了性能。

Details

Motivation: 传统分割模型在处理复杂的宫颈涂片图像结构和变化时表现不佳，因此需要更高效的方法。 Method: 采用预训练的DenseNet201作为U-Net的编码器，并通过改进的SMO算法优化模型。 Result: Dense-UNet201在SIPaKMeD数据集上表现优异，准确率达96.16%，IoU为91.63%，Dice系数为95.63%。 Conclusion: 该方法证明了图像预处理、预训练模型和元启发式优化在医学图像分析中的有效性，为宫颈细胞分割提供了新思路。 Abstract: Pap smear image segmentation is crucial for cervical cancer diagnosis. However, traditional segmentation models often struggle with complex cellular structures and variations in pap smear images. This study proposes a hybrid Dense-UNet201 optimization approach that integrates a pretrained DenseNet201 as the encoder for the U-Net architecture and optimizes it using the spider monkey optimization (SMO) algorithm. The Dense-UNet201 model excelled at feature extraction. The SMO was modified to handle categorical and discrete parameters. The SIPaKMeD dataset was used in this study and evaluated using key performance metrics, including loss, accuracy, Intersection over Union (IoU), and Dice coefficient. The experimental results showed that Dense-UNet201 outperformed U-Net, Res-UNet50, and Efficient-UNetB0. SMO Dense-UNet201 achieved a segmentation accuracy of 96.16%, an IoU of 91.63%, and a Dice coefficient score of 95.63%. These findings underscore the effectiveness of image preprocessing, pretrained models, and metaheuristic optimization in improving medical image analysis and provide new insights into cervical cell segmentation methods.

[43] Saliency-Aware Diffusion Reconstruction for Effective Invisible Watermark Removal

Inzamamul Alam,Md Tanvir Islam,Simon S. Woo

Main category: cs.CV

TL;DR: 本文提出了一种名为SADRE的新框架，通过结合自适应噪声注入和扩散重建技术，有效去除数字内容中的水印，同时保持图像质量。

Details

Motivation: 现有水印嵌入技术缺乏鲁棒性，无法满足日益增长的数字内容需求，因此需要更强大的水印去除方法。 Method: SADRE框架结合了自适应噪声注入、区域特定扰动和扩散重建技术，通过显著性掩模指导噪声注入，并使用反向扩散过程恢复图像。 Result: 实验表明，SADRE在多种场景下均能有效去除水印，并在水印破坏和图像质量之间取得平衡，优于现有技术。 Conclusion: SADRE为水印去除提供了灵活可靠的解决方案，成为该领域的新标杆。 Abstract: As digital content becomes increasingly ubiquitous, the need for robust watermark removal techniques has grown due to the inadequacy of existing embedding techniques, which lack robustness. This paper introduces a novel Saliency-Aware Diffusion Reconstruction (SADRE) framework for watermark elimination on the web, combining adaptive noise injection, region-specific perturbations, and advanced diffusion-based reconstruction. SADRE disrupts embedded watermarks by injecting targeted noise into latent representations guided by saliency masks although preserving essential image features. A reverse diffusion process ensures high-fidelity image restoration, leveraging adaptive noise levels determined by watermark strength. Our framework is theoretically grounded with stability guarantees and achieves robust watermark removal across diverse scenarios. Empirical evaluations on state-of-the-art (SOTA) watermarking techniques demonstrate SADRE's superiority in balancing watermark disruption and image quality. SADRE sets a new benchmark for watermark elimination, offering a flexible and reliable solution for real-world web content. Code is available on~\href{https://github.com/inzamamulDU/SADRE}{\textbf{https://github.com/inzamamulDU/SADRE}}.

[44] TwoSquared: 4D Generation from 2D Image Pairs

Lu Sang,Zehranaz Canfes,Dongliang Cao,Riccardo Marin,Florian Bernard,Daniel Cremers

Main category: cs.CV

TL;DR: TwoSquared方法通过两步分解4D动态物体生成问题：1）基于现有3D生成模型的图像到3D模块，2）物理启发的变形模块预测中间运动。

Details

Motivation: 尽管生成AI取得了惊人进展，4D动态物体生成仍具挑战性，主要由于高质量训练数据有限和计算需求高。 Method: TwoSquared方法从两幅2D RGB图像出发，分解为图像到3D模块和物理变形模块两步。 Result: 实验表明，TwoSquared仅需2D图像即可生成纹理和几何一致的4D序列。 Conclusion: TwoSquared无需模板或类别先验知识，适用于野外图像输入，成功解决了4D生成问题。 Abstract: Despite the astonishing progress in generative AI, 4D dynamic object generation remains an open challenge. With limited high-quality training data and heavy computing requirements, the combination of hallucinating unseen geometry together with unseen movement poses great challenges to generative models. In this work, we propose TwoSquared as a method to obtain a 4D physically plausible sequence starting from only two 2D RGB images corresponding to the beginning and end of the action. Instead of directly solving the 4D generation problem, TwoSquared decomposes the problem into two steps: 1) an image-to-3D module generation based on the existing generative model trained on high-quality 3D assets, and 2) a physically inspired deformation module to predict intermediate movements. To this end, our method does not require templates or object-class-specific prior knowledge and can take in-the-wild images as input. In our experiments, we demonstrate that TwoSquared is capable of producing texture-consistent and geometry-consistent 4D sequences only given 2D images.

[45] Image-Editing Specialists: An RLAIF Approach for Diffusion Models

Elior Benarous,Yilun Du,Heng Yang

Main category: cs.CV

TL;DR: 提出了一种基于强化学习的扩散模型训练方法，用于图像编辑，无需大量人工标注，显著提升了编辑的逼真度和语义对齐。

Details

Motivation: 解决图像编辑中结构保留和语义对齐的挑战，减少对大量人工标注的依赖。 Method: 采用在线强化学习框架，结合视觉提示，实现精确且结构一致的编辑。 Result: 模型在复杂场景中实现高保真编辑，仅需少量参考图像和训练步骤。 Conclusion: 该方法在图像编辑和机器人仿真中展现出高效性和通用性。 Abstract: We present a novel approach to training specialized instruction-based image-editing diffusion models, addressing key challenges in structural preservation with input images and semantic alignment with user prompts. We introduce an online reinforcement learning framework that aligns the diffusion model with human preferences without relying on extensive human annotations or curating a large dataset. Our method significantly improves the realism and alignment with instructions in two ways. First, the proposed models achieve precise and structurally coherent modifications in complex scenes while maintaining high fidelity in instruction-irrelevant areas. Second, they capture fine nuances in the desired edit by leveraging a visual prompt, enabling detailed control over visual edits without lengthy textual prompts. This approach simplifies users' efforts to achieve highly specific edits, requiring only 5 reference images depicting a certain concept for training. Experimental results demonstrate that our models can perform intricate edits in complex scenes, after just 10 training steps. Finally, we showcase the versatility of our method by applying it to robotics, where enhancing the visual realism of simulated environments through targeted sim-to-real image edits improves their utility as proxies for real-world settings.

[46] High-Fidelity Image Inpainting with Multimodal Guided GAN Inversion

Libo Zhang,Yongsheng Yu,Jiali Yao,Heng Fan

Main category: cs.CV

TL;DR: MMInvertFill是一种新型GAN反演方法，用于图像修复，通过多模态引导编码器和F&W+潜在空间解决现有方法中未掩码区域不一致和单模态输入的问题。

Details

Motivation: 现有GAN反演方法忽略未掩码区域的硬约束，且仅考虑单模态输入，导致性能下降。 Method: 提出多模态引导编码器和F&W+潜在空间，结合预调制和Soft-update Mean Latent模块，提升结构和纹理生成。 Result: 在六个数据集上，MMInvertFill在定性和定量上均优于现有方法，并能有效完成域外图像修复。 Conclusion: MMInvertFill通过多模态和潜在空间优化，显著提升了图像修复的性能和泛化能力。 Abstract: Generative Adversarial Network (GAN) inversion have demonstrated excellent performance in image inpainting that aims to restore lost or damaged image texture using its unmasked content. Previous GAN inversion-based methods usually utilize well-trained GAN models as effective priors to generate the realistic regions for missing holes. Despite excellence, they ignore a hard constraint that the unmasked regions in the input and the output should be the same, resulting in a gap between GAN inversion and image inpainting and thus degrading the performance. Besides, existing GAN inversion approaches often consider a single modality of the input image, neglecting other auxiliary cues in images for improvements. Addressing these problems, we propose a novel GAN inversion approach, dubbed MMInvertFill, for image inpainting. MMInvertFill contains primarily a multimodal guided encoder with a pre-modulation and a GAN generator with F&W+ latent space. Specifically, the multimodal encoder aims to enhance the multi-scale structures with additional semantic segmentation edge texture modalities through a gated mask-aware attention module. Afterwards, a pre-modulation is presented to encode these structures into style vectors. To mitigate issues of conspicuous color discrepancy and semantic inconsistency, we introduce the F&W+ latent space to bridge the gap between GAN inversion and image inpainting. Furthermore, in order to reconstruct faithful and photorealistic images, we devise a simple yet effective Soft-update Mean Latent module to capture more diversified in-domain patterns for generating high-fidelity textures for massive corruptions. In our extensive experiments on six challenging datasets, we show that our MMInvertFill qualitatively and quantitatively outperforms other state-of-the-arts and it supports the completion of out-of-domain images effectively.

[47] Computer-Aided Design of Personalized Occlusal Positioning Splints Using Multimodal 3D Data

Agnieszka Anna Tomaka,Leszek Luchowski,Michał Tarnawski,Dariusz Pojda

Main category: cs.CV

TL;DR: 本文提出了一种基于计算机辅助设计和数字技术的定制化咬合夹板方法，通过3D建模和虚拟压花技术实现精准定位，并评估其准确性。

Details

Motivation: 结合临床需求与数字牙科实践，解决传统咬合夹板设计中的精度和个性化问题。 Method: 利用3D建模和转换矩阵生成咬合夹板，结合虚拟压花技术解决表面冲突，并通过临床工具和口腔内设备获取转换矩阵。 Result: 设计的夹板在轮廓和表面偏差分析中表现出高精度，实现了可重复的个性化夹板制作。 Conclusion: 该方法为咬合夹板的精准设计和多模态图像配准提供了新思路，提升了治疗效果。 Abstract: Contemporary digital technology has a pivotal role in the design of customized medical appliances, including occlusal splints used in the treatment of stomatognathic system dysfunctions. We present an approach to computer-aided design and precision assessment of positioning occlusal splints, bridging clinical concepts with current digital dental practice. In our model, a 3D splint is generated based on a transformation matrix that represents the therapeutic change in mandibular position, defined by a specialist using a virtual patient model reconstructed from intraoral scans, CBCT, 3D facial scans and plaster model digitisation. The paper introduces a novel method for generating splints that accurately reproduce occlusal conditions in the therapeutic position, including a mechanism for resolving surface conflicts through virtual embossing. We demonstrate how transformation matrices can be acquired through clinical tools and intraoral devices, and evaluate the accuracy of the designed and printed splints using profile and surface deviation analysis. The proposed method enables reproducible, patient-specific splint fabrication and opens new possibilities in diagnostics, multimodal image registration and quantification of occlusal discrepancies.

[48] SC3EF: A Joint Self-Correlation and Cross-Correspondence Estimation Framework for Visible and Thermal Image Registration

Xi Tong,Xing Luo,Jiangxin Yang,Yanpeng Cao

Main category: cs.CV

TL;DR: 提出了一种新颖的联合自相关与跨模态对应估计框架（SC3EF），用于解决RGB-T图像配准问题，结合局部特征与全局上下文线索，显著提升了配准精度。

Details

Motivation: 多光谱成像在智能交通中至关重要，但由于RGB-T图像的模态差异，精确配准具有挑战性。 Method: 设计了基于卷积-Transformer的管道，提取局部特征并编码全局相关性，结合分层光流估计解码器逐步优化密集对应图。 Result: 在代表性RGB-T数据集上优于现有方法，并在大视差、遮挡、恶劣天气等挑战性场景中表现出色。 Conclusion: SC3EF框架在多模态图像配准中具有高效性和泛化能力，适用于复杂场景。 Abstract: Multispectral imaging plays a critical role in a range of intelligent transportation applications, including advanced driver assistance systems (ADAS), traffic monitoring, and night vision. However, accurate visible and thermal (RGB-T) image registration poses a significant challenge due to the considerable modality differences. In this paper, we present a novel joint Self-Correlation and Cross-Correspondence Estimation Framework (SC3EF), leveraging both local representative features and global contextual cues to effectively generate RGB-T correspondences. For this purpose, we design a convolution-transformer-based pipeline to extract local representative features and encode global correlations of intra-modality for inter-modality correspondence estimation between unaligned visible and thermal images. After merging the local and global correspondence estimation results, we further employ a hierarchical optical flow estimation decoder to progressively refine the estimated dense correspondence maps. Extensive experiments demonstrate the effectiveness of our proposed method, outperforming the current state-of-the-art (SOTA) methods on representative RGB-T datasets. Furthermore, it also shows competitive generalization capabilities across challenging scenarios, including large parallax, severe occlusions, adverse weather, and other cross-modal datasets (e.g., RGB-N and RGB-D).

[49] Tree-NeRV: A Tree-Structured Neural Representation for Efficient Non-Uniform Video Encoding

Jiancheng Zhao,Yifan Zhan,Qingtian Zhu,Mingze Ma,Muyao Niu,Zunian Wan,Xiang Ji,Yinqiang Zheng

Main category: cs.CV

TL;DR: Tree-NeRV提出了一种基于树结构的视频编码方法，通过非均匀采样和动态优化策略提升压缩效率和重建质量。

Details

Motivation: 现有NeRV方法未充分利用时间冗余，导致率失真性能不佳。 Method: 提出Tree-NeRV，利用二叉搜索树组织特征表示，并引入优化驱动的动态采样策略。 Result: 实验表明Tree-NeRV在压缩效率和重建质量上优于传统均匀采样方法。 Conclusion: Tree-NeRV通过非均匀采样和动态优化显著提升了视频编码性能。 Abstract: Implicit Neural Representations for Videos (NeRV) have emerged as a powerful paradigm for video representation, enabling direct mappings from frame indices to video frames. However, existing NeRV-based methods do not fully exploit temporal redundancy, as they rely on uniform sampling along the temporal axis, leading to suboptimal rate-distortion (RD) performance. To address this limitation, we propose Tree-NeRV, a novel tree-structured feature representation for efficient and adaptive video encoding. Unlike conventional approaches, Tree-NeRV organizes feature representations within a Binary Search Tree (BST), enabling non-uniform sampling along the temporal axis. Additionally, we introduce an optimization-driven sampling strategy, dynamically allocating higher sampling density to regions with greater temporal variation. Extensive experiments demonstrate that Tree-NeRV achieves superior compression efficiency and reconstruction quality, outperforming prior uniform sampling-based methods. Code will be released.

[50] Second-order Optimization of Gaussian Splats with Importance Sampling

Hamza Pehlivan,Andrea Boscolo Camiletto,Lin Geng Foo,Marc Habermann,Christian Theobalt

Main category: cs.CV

TL;DR: 论文提出了一种基于Levenberg-Marquardt和共轭梯度的二阶优化策略，显著提升了3D高斯泼溅的训练速度。

Details

Motivation: 3D高斯泼溅依赖一阶优化器（如Adam）导致训练时间长，限制了其效率。 Method: 利用Jacobian的稀疏性，提出矩阵无关且GPU并行的LM优化，结合采样策略和学习率启发式方法。 Result: 方法在低高斯数量下比Adam快6倍，标准LM快3倍，中等数量下仍具竞争力。 Conclusion: 提出的二阶优化策略显著提升了3D高斯泼溅的训练效率。 Abstract: 3D Gaussian Splatting (3DGS) is widely used for novel view synthesis due to its high rendering quality and fast inference time. However, 3DGS predominantly relies on first-order optimizers such as Adam, which leads to long training times. To address this limitation, we propose a novel second-order optimization strategy based on Levenberg-Marquardt (LM) and Conjugate Gradient (CG), which we specifically tailor towards Gaussian Splatting. Our key insight is that the Jacobian in 3DGS exhibits significant sparsity since each Gaussian affects only a limited number of pixels. We exploit this sparsity by proposing a matrix-free and GPU-parallelized LM optimization. To further improve its efficiency, we propose sampling strategies for both the camera views and loss function and, consequently, the normal equation, significantly reducing the computational complexity. In addition, we increase the convergence rate of the second-order approximation by introducing an effective heuristic to determine the learning rate that avoids the expensive computation cost of line search methods. As a result, our method achieves a $3\times$ speedup over standard LM and outperforms Adam by $~6\times$ when the Gaussian count is low while remaining competitive for moderate counts. Project Page: https://vcai.mpi-inf.mpg.de/projects/LM-IS

[51] Efficient Masked Image Compression with Position-Indexed Self-Attention

Chengjie Dai,Tiantian Song,Hui Tang,Fangdong Chen,Bowei Yang,Guanghua Song

Main category: cs.CV

TL;DR: 提出了一种基于位置索引自注意力机制的图像压缩方法，仅编码和解码掩码图像的可见部分，显著降低计算成本。

Details

Motivation: 现有方法在编码后结构化比特流，导致冗余计算；传统方法即使掩码不重要区域仍参与计算。 Method: 使用位置索引自注意力机制，仅处理掩码图像的可见部分。 Result: 相比现有语义结构化压缩方法，显著减少计算成本。 Conclusion: 该方法有效解决了冗余计算问题，提升了压缩效率。 Abstract: In recent years, image compression for high-level vision tasks has attracted considerable attention from researchers. Given that object information in images plays a far more crucial role in downstream tasks than background information, some studies have proposed semantically structuring the bitstream to selectively transmit and reconstruct only the information required by these tasks. However, such methods structure the bitstream after encoding, meaning that the coding process still relies on the entire image, even though much of the encoded information will not be transmitted. This leads to redundant computations. Traditional image compression methods require a two-dimensional image as input, and even if the unimportant regions of the image are set to zero by applying a semantic mask, these regions still participate in subsequent computations as part of the image. To address such limitations, we propose an image compression method based on a position-indexed self-attention mechanism that encodes and decodes only the visible parts of the masked image. Compared to existing semantic-structured compression methods, our approach can significantly reduce computational costs.

[52] Disentangling Polysemantic Channels in Convolutional Neural Networks

Robin Hesse,Jonas Fischer,Simone Schaub-Meyer,Stefan Roth

Main category: cs.CV

TL;DR: 提出一种算法，将多义性通道解耦为多个单概念通道，提升CNN的可解释性。

Details

Motivation: CNN中的多义性通道编码多个概念，难以解释，需解决此问题。 Method: 通过分析前一层的不同激活模式，重构CNN权重，解耦多义性通道。 Result: 成功解耦多义性特征，提升CNN的可解释性和特征可视化效果。 Conclusion: 该方法有效增强CNN的机制可解释性，为解释性技术提供改进。 Abstract: Mechanistic interpretability is concerned with analyzing individual components in a (convolutional) neural network (CNN) and how they form larger circuits representing decision mechanisms. These investigations are challenging since CNNs frequently learn polysemantic channels that encode distinct concepts, making them hard to interpret. To address this, we propose an algorithm to disentangle a specific kind of polysemantic channel into multiple channels, each responding to a single concept. Our approach restructures weights in a CNN, utilizing that different concepts within the same channel exhibit distinct activation patterns in the previous layer. By disentangling these polysemantic features, we enhance the interpretability of CNNs, ultimately improving explanatory techniques such as feature visualizations.

[53] Rethinking Temporal Fusion with a Unified Gradient Descent View for 3D Semantic Occupancy Prediction

Dubing Chen,Huan Zheng,Jin Fang,Xingping Dong,Xianfei Li,Wenlong Liao,Tao He,Pai Peng,Jianbing Shen

Main category: cs.CV

TL;DR: GDFusion是一种用于视觉3D语义占用预测的时间融合方法，通过探索时间线索和融合策略，显著提升了性能并降低了内存消耗。

Details

Motivation: 研究在VisionOcc框架中未被充分探索的时间融合问题，重点关注时间线索和融合策略的优化。 Method: 提出GDFusion方法，识别并利用三种时间线索（场景一致性、运动校准和几何互补），并通过重新解释RNN公式实现异构表示的时间信号融合。 Result: 在nuScenes数据集上，GDFusion显著优于基线方法，Occ3D基准上mIoU提升1.4%-4.8%，内存消耗减少27%-72%。 Conclusion: GDFusion通过系统整合时间线索和高效融合策略，为视觉3D语义占用预测提供了显著改进。 Abstract: We present GDFusion, a temporal fusion method for vision-based 3D semantic occupancy prediction (VisionOcc). GDFusion opens up the underexplored aspects of temporal fusion within the VisionOcc framework, focusing on both temporal cues and fusion strategies. It systematically examines the entire VisionOcc pipeline, identifying three fundamental yet previously overlooked temporal cues: scene-level consistency, motion calibration, and geometric complementation. These cues capture diverse facets of temporal evolution and make distinct contributions across various modules in the VisionOcc framework. To effectively fuse temporal signals across heterogeneous representations, we propose a novel fusion strategy by reinterpreting the formulation of vanilla RNNs. This reinterpretation leverages gradient descent on features to unify the integration of diverse temporal information, seamlessly embedding the proposed temporal cues into the network. Extensive experiments on nuScenes demonstrate that GDFusion significantly outperforms established baselines. Notably, on Occ3D benchmark, it achieves 1.4\%-4.8\% mIoU improvements and reduces memory consumption by 27\%-72\%.

[54] Vision and Language Integration for Domain Generalization

Yanmei Wang,Xiyao Liu,Fupeng Chu,Zhi Han

Main category: cs.CV

TL;DR: VLCA结合语言和视觉空间，通过语义空间作为桥梁域，解决领域泛化问题。

Details

Motivation: 由于领域差距，图像难以找到可靠的共同特征空间，而语言具有更全面的表达元素。 Method: 在语言空间利用词向量距离捕捉语义关系，在视觉空间通过低秩近似探索样本特征共同模式，最后在多模态空间对齐语言和视觉表示。 Result: 实验证明了该方法的有效性。 Conclusion: VLCA通过结合语言和视觉空间，成功提升了领域泛化能力。 Abstract: Domain generalization aims at training on source domains to uncover a domain-invariant feature space, allowing the model to perform robust generalization ability on unknown target domains. However, due to domain gaps, it is hard to find reliable common image feature space, and the reason for that is the lack of suitable basic units for images. Different from image in vision space, language has comprehensive expression elements that can effectively convey semantics. Inspired by the semantic completeness of language and intuitiveness of image, we propose VLCA, which combine language space and vision space, and connect the multiple image domains by using semantic space as the bridge domain. Specifically, in language space, by taking advantage of the completeness of language basic units, we tend to capture the semantic representation of the relations between categories through word vector distance. Then, in vision space, by taking advantage of the intuitiveness of image features, the common pattern of sample features with the same class is explored through low-rank approximation. In the end, the language representation is aligned with the vision representation through the multimodal space of text and image. Experiments demonstrate the effectiveness of the proposed method.

[55] MathPhys-Guided Coarse-to-Fine Anomaly Synthesis with SQE-Driven Bi-Level Optimization for Anomaly Detection

Long Qian,Bingke Zhu,Yingying Chen,Ming Tang,Jinqiao Wang

Main category: cs.CV

TL;DR: 论文提出了一种基于数学物理模型指导的合成异常生成方法，通过粗到细的优化策略和合成质量评估器（SQE）提升异常检测性能，并在多个基准测试中取得最优结果。

Details

Motivation: 由于真实缺陷图像的稀缺性和不可预测性，现有合成方法生成的异常缺乏物理一致性，影响模型泛化能力。 Method: 结合物理模型生成缺陷掩码，分两阶段优化（npcF和npcF++），并利用SQE加权训练样本。 Result: 在MVTec AD、VisA和BTAD数据集上实现了图像和像素级别的SOTA性能。 Conclusion: 提出的MaPhC2F和BiSQAD方法有效提升了异常检测的泛化能力和准确性。 Abstract: Anomaly detection is a crucial task in computer vision, yet collecting real-world defect images is inherently difficult due to the rarity and unpredictability of anomalies. Consequently, researchers have turned to synthetic methods for training data augmentation. However, existing synthetic strategies (e.g., naive cut-and-paste or inpainting) overlook the underlying physical causes of defects, leading to inconsistent, low-fidelity anomalies that hamper model generalization to real-world complexities. In this thesis, we introduced a novel pipeline that generates synthetic anomalies through Math-Physics model guidance, refines them via a Coarse-to-Fine approach and employs a bi-level optimization strategy with a Synthesis Quality Estimator(SQE). By incorporating physical modeling of cracks, corrosion, and deformation, our method produces realistic defect masks, which are subsequently enhanced in two phases. The first stage (npcF) enforces a PDE-based consistency to achieve a globally coherent anomaly structure, while the second stage (npcF++) further improves local fidelity using wavelet transforms and boundary synergy blocks. Additionally, we leverage SQE-driven weighting, ensuring that high-quality synthetic samples receive greater emphasis during training. To validate our approach, we conducted comprehensive experiments on three widely adopted industrial anomaly detection benchmarks: MVTec AD, VisA, and BTAD. Across these datasets, the proposed pipeline achieves state-of-the-art (SOTA) results in both image-AUROC and pixel-AUROC, confirming the effectiveness of our MaPhC2F and BiSQAD.

[56] Enhancing Cocoa Pod Disease Classification via Transfer Learning and Ensemble Methods: Toward Robust Predictive Modeling

Devina Anduyan,Nyza Cabillo,Navy Gultiano,Mark Phil Pacot

Main category: cs.CV

TL;DR: 该研究提出了一种基于集成学习的方法，通过结合迁移学习和三种集成策略（Bagging、Boosting和Stacking）对可可豆荚疾病进行分类。

Details

Motivation: 提高可可豆荚疾病分类的准确性和鲁棒性，为精准农业和自动化作物疾病管理提供可靠解决方案。 Method: 使用预训练的卷积神经网络（VGG16、VGG19、ResNet50、ResNet101、InceptionV3和Xception）作为基础学习器，结合Bagging、Boosting和Stacking三种集成策略。 Result: Bagging方法在测试集上达到100%的准确率，优于Boosting（97%）和Stacking（92%）。 Conclusion: 迁移学习与集成技术的结合显著提升了模型的泛化能力和可靠性，为精准农业提供了有效工具。 Abstract: This study presents an ensemble-based approach for cocoa pod disease classification by integrating transfer learning with three ensemble learning strategies: Bagging, Boosting, and Stacking. Pre-trained convolutional neural networks, including VGG16, VGG19, ResNet50, ResNet101, InceptionV3, and Xception, were fine-tuned and employed as base learners to detect three disease categories: Black Pod Rot, Pod Borer, and Healthy. A balanced dataset of 6,000 cocoa pod images was curated and augmented to ensure robustness against variations in lighting, orientation, and disease severity. The performance of each ensemble method was evaluated using accuracy, precision, recall, and F1-score. Experimental results show that Bagging consistently achieved superior classification performance with a test accuracy of 100%, outperforming Boosting (97%) and Stacking (92%). The findings confirm that combining transfer learning with ensemble techniques improves model generalization and reliability, making it a promising direction for precision agriculture and automated crop disease management.

[57] All-in-One Transferring Image Compression from Human Perception to Multi-Machine Perception

Jiancheng Zhao,Xiang Ji,Zhuoxiao Li,Zunian Wan,Weihang Ran,Mingze Ma,Muyao Niu,Yifan Zhan,Cheng-Ching Tseng,Yinqiang Zheng

Main category: cs.CV

TL;DR: 提出了一种非对称适配器框架，支持在单一模型中实现多任务适应，解决了现有方法效率低、任务间缺乏交互的问题。

Details

Motivation: 现有方法通常以单任务方式将学习到的图像压缩模型（LIC）适应下游任务，效率低且缺乏任务交互。 Method: 引入共享适配器学习通用语义特征，任务特定适配器保留任务级差异，仅需轻量级插件模块和冻结的基础编解码器。 Result: 在PASCAL-Context基准测试中表现优于完全微调和其他参数高效微调基线。 Conclusion: 验证了多视觉迁移的有效性，同时保持了压缩效率。 Abstract: Efficiently transferring Learned Image Compression (LIC) model from human perception to machine perception is an emerging challenge in vision-centric representation learning. Existing approaches typically adapt LIC to downstream tasks in a single-task manner, which is inefficient, lacks task interaction, and results in multiple task-specific bitstreams. To address these limitations, we propose an asymmetric adaptor framework that supports multi-task adaptation within a single model. Our method introduces a shared adaptor to learn general semantic features and task-specific adaptors to preserve task-level distinctions. With only lightweight plug-in modules and a frozen base codec, our method achieves strong performance across multiple tasks while maintaining compression efficiency. Experiments on the PASCAL-Context benchmark demonstrate that our method outperforms both Fully Fine-Tuned and other Parameter Efficient Fine-Tuned (PEFT) baselines, and validating the effectiveness of multi-vision transferring.

[58] Hierarchical Feature Learning for Medical Point Clouds via State Space Model

Guoqing Zhang,Jingyun Yang,Yang Li

Main category: cs.CV

TL;DR: 本文提出了一种基于状态空间模型（SSM）的分层特征学习框架，用于医学点云理解，并在新构建的大规模数据集MedPointS上验证了其优越性能。

Details

Motivation: 医学点云在疾病诊断和治疗中具有巨大潜力，但目前相关研究较少，因此需要开发高效的点云学习方法。 Method: 通过最远点采样将输入下采样为多层级，结合KNN查询聚合多尺度结构信息，并引入坐标顺序和内外扫描策略优化点云序列化。使用Point SSM块逐步计算特征，捕捉局部和长程依赖。 Result: 在MedPointS数据集上的实验表明，该方法在解剖分类、补全和分割任务中均表现优异。 Conclusion: 提出的SSM框架在医学点云任务中具有显著优势，数据集和代码已公开。 Abstract: Deep learning-based point cloud modeling has been widely investigated as an indispensable component of general shape analysis. Recently, transformer and state space model (SSM) have shown promising capacities in point cloud learning. However, limited research has been conducted on medical point clouds, which have great potential in disease diagnosis and treatment. This paper presents an SSM-based hierarchical feature learning framework for medical point cloud understanding. Specifically, we down-sample input into multiple levels through the farthest point sampling. At each level, we perform a series of k-nearest neighbor (KNN) queries to aggregate multi-scale structural information. To assist SSM in processing point clouds, we introduce coordinate-order and inside-out scanning strategies for efficient serialization of irregular points. Point features are calculated progressively from short neighbor sequences and long point sequences through vanilla and group Point SSM blocks, to capture both local patterns and long-range dependencies. To evaluate the proposed method, we build a large-scale medical point cloud dataset named MedPointS for anatomy classification, completion, and segmentation. Extensive experiments conducted on MedPointS demonstrate that our method achieves superior performance across all tasks. The dataset is available at https://flemme-docs.readthedocs.io/en/latest/medpoints.html. Code is merged to a public medical imaging platform: https://github.com/wlsdzyzl/flemme.

[59] Pose and Facial Expression Transfer by using StyleGAN

Petr Jahoda,Jan Cech

Main category: cs.CV

TL;DR: 提出了一种将姿态和表情从源人脸图像转移到目标人脸图像的方法，无需人工标注，支持随机身份合成。

Details

Motivation: 实现人脸图像中姿态和表情的自动迁移，避免人工标注的需求。 Method: 使用两个编码器和一个映射网络，将输入投影到StyleGAN2的潜在空间生成输出，训练过程基于视频序列自监督。 Result: 模型能够合成随机身份的可控姿态和表情，并实现接近实时的性能。 Conclusion: 该方法在姿态和表情迁移任务中表现高效且无需人工干预。 Abstract: We propose a method to transfer pose and expression between face images. Given a source and target face portrait, the model produces an output image in which the pose and expression of the source face image are transferred onto the target identity. The architecture consists of two encoders and a mapping network that projects the two inputs into the latent space of StyleGAN2, which finally generates the output. The training is self-supervised from video sequences of many individuals. Manual labeling is not required. Our model enables the synthesis of random identities with controllable pose and expression. Close-to-real-time performance is achieved.

[60] Riemannian Patch Assignment Gradient Flows

Daniel Gonzalez-Alvarado,Fabio Schlindwein,Jonas Cassel,Laura Steingruber,Stefania Petra,Christoph Schnörr

Main category: cs.CV

TL;DR: 论文提出了一种基于图的数据标签方法，通过动态交互和几何数值积分实现标签一致性。

Details

Motivation: 解决图上数据标签的局部初始标签不一致问题，通过动态交互和正则化实现全局一致性。 Method: 使用竞争标签补丁字典和补丁分配变量，通过黎曼上升流的几何数值积分实现标签一致性。 Result: 实验验证了方法的有效性，包括标签分配的不确定性量化。 Conclusion: 该方法通过动态交互和几何积分实现了图上数据标签的高效一致性。 Abstract: This paper introduces patch assignment flows for metric data labeling on graphs. Labelings are determined by regularizing initial local labelings through the dynamic interaction of both labels and label assignments across the graph, entirely encoded by a dictionary of competing labeled patches and mediated by patch assignment variables. Maximal consistency of patch assignments is achieved by geometric numerical integration of a Riemannian ascent flow, as critical point of a Lagrangian action functional. Experiments illustrate properties of the approach, including uncertainty quantification of label assignments.

[61] TTRD3: Texture Transfer Residual Denoising Dual Diffusion Model for Remote Sensing Image Super-Resolution

Yide Liu,Haijiang Sun,Xiaowen Zhang,Qiaoyuan Liu,Zhouchang Chen,Chongzhuo Xiao

Main category: cs.CV

TL;DR: 论文提出了一种名为TTRD3的模型，用于解决遥感图像超分辨率中的多尺度特征提取、语义一致性和几何精度与视觉质量平衡问题。

Details

Motivation: 现有方法在遥感图像超分辨率中存在多尺度特征提取困难、语义不一致以及几何精度与视觉质量不平衡的问题。 Method: TTRD3模型包含三个创新：多尺度特征聚合块（MFAB）、稀疏纹理转移引导模块（STTG）和残差去噪双扩散模型（RDDM）。 Result: 实验表明，TTRD3在多个遥感数据集上优于现有方法，LPIPS和FID指标分别提升了1.43%和3.67%。 Conclusion: TTRD3通过多尺度特征提取、纹理转移和双扩散模型，有效提升了遥感图像超分辨率的性能。 Abstract: Remote Sensing Image Super-Resolution (RSISR) reconstructs high-resolution (HR) remote sensing images from low-resolution inputs to support fine-grained ground object interpretation. Existing methods face three key challenges: (1) Difficulty in extracting multi-scale features from spatially heterogeneous RS scenes, (2) Limited prior information causing semantic inconsistency in reconstructions, and (3) Trade-off imbalance between geometric accuracy and visual quality. To address these issues, we propose the Texture Transfer Residual Denoising Dual Diffusion Model (TTRD3) with three innovations: First, a Multi-scale Feature Aggregation Block (MFAB) employing parallel heterogeneous convolutional kernels for multi-scale feature extraction. Second, a Sparse Texture Transfer Guidance (STTG) module that transfers HR texture priors from reference images of similar scenes. Third, a Residual Denoising Dual Diffusion Model (RDDM) framework combining residual diffusion for deterministic reconstruction and noise diffusion for diverse generation. Experiments on multi-source RS datasets demonstrate TTRD3's superiority over state-of-the-art methods, achieving 1.43% LPIPS improvement and 3.67% FID enhancement compared to best-performing baselines. Code/model: https://github.com/LED-666/TTRD3.

[62] Prototypes are Balanced Units for Efficient and Effective Partially Relevant Video Retrieval

WonJun Moon,Cheol-Ho Cho,Woojin Jun,Minho Shim,Taeoh Kim,Inwoong Lee,Dongyoon Wee,Jae-Pil Heo

Main category: cs.CV

TL;DR: 提出了一种原型PRVR框架，通过固定数量的原型编码视频中的多样化上下文，同时引入策略增强文本关联和视频理解，并通过跨模态和单模态重建任务保持原型可搜索性和准确性。

Details

Motivation: 在部分相关视频检索（PRVR）中，同时实现搜索准确性和效率具有挑战性，因为多样化的上下文表示会增加计算和内存成本。 Method: 提出原型PRVR框架，将视频中的多样化上下文编码为固定数量的原型，并引入文本关联和视频理解策略，以及跨模态和单模态重建任务。 Result: 在TVR、ActivityNet-Captions和QVHighlights上的广泛评估验证了方法的有效性，且未牺牲效率。 Conclusion: 该方法通过原型编码和重建任务，成功平衡了PRVR中的准确性和效率。 Abstract: In a retrieval system, simultaneously achieving search accuracy and efficiency is inherently challenging. This challenge is particularly pronounced in partially relevant video retrieval (PRVR), where incorporating more diverse context representations at varying temporal scales for each video enhances accuracy but increases computational and memory costs. To address this dichotomy, we propose a prototypical PRVR framework that encodes diverse contexts within a video into a fixed number of prototypes. We then introduce several strategies to enhance text association and video understanding within the prototypes, along with an orthogonal objective to ensure that the prototypes capture a diverse range of content. To keep the prototypes searchable via text queries while accurately encoding video contexts, we implement cross- and uni-modal reconstruction tasks. The cross-modal reconstruction task aligns the prototypes with textual features within a shared space, while the uni-modal reconstruction task preserves all video contexts during encoding. Additionally, we employ a video mixing technique to provide weak guidance to further align prototypes and associated textual representations. Extensive evaluations on TVR, ActivityNet-Captions, and QVHighlights validate the effectiveness of our approach without sacrificing efficiency.

[63] Event-Enhanced Blurry Video Super-Resolution

Dachun Kai,Yueyi Zhang,Jin Wang,Zeyu Xiao,Zhiwei Xiong,Xiaoyan Sun

Main category: cs.CV

TL;DR: 提出了一种基于事件信号的模糊视频超分辨率方法Ev-DeblurVSR，通过融合帧和事件信息提升细节恢复和时序一致性。

Details

Motivation: 现有模糊视频超分辨率方法因缺乏运动信息和高频细节，难以恢复高分辨率下的清晰细节。 Method: 引入事件信号，设计互惠特征去模糊模块和混合可变形对齐模块，结合帧和事件信息。 Result: 在合成和真实数据集上表现最优，比FMA-Net准确度高2.59 dB且快7.28倍。 Conclusion: Ev-DeblurVSR通过事件信号显著提升了模糊视频超分辨率的性能。 Abstract: In this paper, we tackle the task of blurry video super-resolution (BVSR), aiming to generate high-resolution (HR) videos from low-resolution (LR) and blurry inputs. Current BVSR methods often fail to restore sharp details at high resolutions, resulting in noticeable artifacts and jitter due to insufficient motion information for deconvolution and the lack of high-frequency details in LR frames. To address these challenges, we introduce event signals into BVSR and propose a novel event-enhanced network, Ev-DeblurVSR. To effectively fuse information from frames and events for feature deblurring, we introduce a reciprocal feature deblurring module that leverages motion information from intra-frame events to deblur frame features while reciprocally using global scene context from the frames to enhance event features. Furthermore, to enhance temporal consistency, we propose a hybrid deformable alignment module that fully exploits the complementary motion information from inter-frame events and optical flow to improve motion estimation in the deformable alignment process. Extensive evaluations demonstrate that Ev-DeblurVSR establishes a new state-of-the-art performance on both synthetic and real-world datasets. Notably, on real data, our method is +2.59 dB more accurate and 7.28$\times$ faster than the recent best BVSR baseline FMA-Net. Code: https://github.com/DachunKai/Ev-DeblurVSR.

[64] Expert Kernel Generation Network Driven by Contextual Mapping for Hyperspectral Image Classification

Guandong Li,Mengxia Ye

Main category: cs.CV

TL;DR: EKGNet提出了一种基于改进3D-DenseNet的模型，通过上下文感知映射网络和动态核生成模块，解决了高光谱图像分类中的过拟合和泛化能力问题。

Details

Motivation: 高光谱图像分类面临高维数据、地物分布稀疏和光谱冗余等挑战，传统方法容易过拟合且泛化能力有限。 Method: EKGNet结合上下文感知映射模块和动态核生成机制，动态生成卷积核权重，构建自适应专家卷积系统。 Result: 在IN、UP和KSC数据集上表现优于主流方法。 Conclusion: EKGNet通过动态卷积系统提升了模型表示能力，无需增加网络深度或宽度。 Abstract: Deep neural networks face several challenges in hyperspectral image classification, including high-dimensional data, sparse distribution of ground objects, and spectral redundancy, which often lead to classification overfitting and limited generalization capability. To more efficiently adapt to ground object distributions while extracting image features without introducing excessive parameters and skipping redundant information, this paper proposes EKGNet based on an improved 3D-DenseNet model, consisting of a context-aware mapping network and a dynamic kernel generation module. The context-aware mapping module translates global contextual information of hyperspectral inputs into instructions for combining base convolutional kernels, while the dynamic kernels are composed of K groups of base convolutions, analogous to K different types of experts specializing in fundamental patterns across various dimensions. The mapping module and dynamic kernel generation mechanism form a tightly coupled system - the former generates meaningful combination weights based on inputs, while the latter constructs an adaptive expert convolution system using these weights. This dynamic approach enables the model to focus more flexibly on key spatial structures when processing different regions, rather than relying on the fixed receptive field of a single static convolutional kernel. EKGNet enhances model representation capability through a 3D dynamic expert convolution system without increasing network depth or width. The proposed method demonstrates superior performance on IN, UP, and KSC datasets, outperforming mainstream hyperspectral image classification approaches.

[65] NoisyRollout: Reinforcing Visual Reasoning with Data Augmentation

Xiangyan Liu,Jinjie Ni,Zijian Wu,Chao Du,Longxu Dou,Haonan Wang,Tianyu Pang,Michael Qizhe Shieh

Main category: cs.CV

TL;DR: NoisyRollout是一种简单有效的强化学习方法，通过混合干净和失真图像的轨迹增强视觉语言模型的探索能力。

Details

Motivation: 当前视觉语言模型在策略探索和视觉感知方面存在不足，影响了推理能力。 Method: 提出NoisyRollout方法，混合干净和失真图像的轨迹，引入视觉导向的归纳偏置，并采用噪声退火计划。 Result: 仅用2.1K训练样本，在5个领域外基准测试中达到最优性能，同时保持领域内性能。 Conclusion: NoisyRollout通过引入视觉多样性，有效提升了模型的探索能力和性能。 Abstract: Recent advances in reinforcement learning (RL) have strengthened the reasoning capabilities of vision-language models (VLMs). However, enhancing policy exploration to more effectively scale test-time compute remains underexplored in VLMs. In addition, VLMs continue to struggle with imperfect visual perception, which in turn affects the subsequent reasoning process. To this end, we propose NoisyRollout, a simple yet effective RL approach that mixes trajectories from both clean and moderately distorted images to introduce targeted diversity in visual perception and the resulting reasoning patterns. Without additional training cost, NoisyRollout enhances the exploration capabilities of VLMs by incorporating a vision-oriented inductive bias. Furthermore, NoisyRollout employs a noise annealing schedule that gradually reduces distortion strength over training, ensuring benefit from noisy signals early while maintaining training stability and scalability in later stages. With just 2.1K training samples, NoisyRollout achieves state-of-the-art performance among open-source RL-tuned models on 5 out-of-domain benchmarks spanning both reasoning and perception tasks, while preserving comparable or even better in-domain performance.

[66] Imaging for All-Day Wearable Smart Glasses

Michael Goesele,Daniel Andersen,Yujia Chen,Simon Green,Eddy Ilg,Chao Li,Johnson Liu,Grace Kuo,Logan Wan,Richard Newcombe

Main category: cs.CV

TL;DR: 论文分析了智能眼镜成像技术的基本限制，并提出了一种分布式成像方法以减少相机模块尺寸。

Details

Motivation: 智能眼镜需要全天佩戴且体积小，但成像质量受限于环境和移动性，需探索新方法。 Method: 系统分析智能眼镜成像限制，提出分布式成像方法，并通过实验验证。 Result: 分布式成像方法显著减小相机模块尺寸，同时保持成像质量。 Conclusion: 分布式成像为智能眼镜的小型化和高性能提供了可行方案。 Abstract: In recent years smart glasses technology has rapidly advanced, opening up entirely new areas for mobile computing. We expect future smart glasses will need to be all-day wearable, adopting a small form factor to meet the requirements of volume, weight, fashionability and social acceptability, which puts significant constraints on the space of possible solutions. Additional challenges arise due to the fact that smart glasses are worn in arbitrary environments while their wearer moves and performs everyday activities. In this paper, we systematically analyze the space of imaging from smart glasses and derive several fundamental limits that govern this imaging domain. We discuss the impact of these limits on achievable image quality and camera module size -- comparing in particular to related devices such as mobile phones. We then propose a novel distributed imaging approach that allows to minimize the size of the individual camera modules when compared to a standard monolithic camera design. Finally, we demonstrate the properties of this novel approach in a series of experiments using synthetic data as well as images captured with two different prototype implementations.

[67] ArtistAuditor: Auditing Artist Style Pirate in Text-to-Image Generation Models

Linkang Du,Zheng Zhu,Min Chen,Zhou Su,Shouling Ji,Peng Cheng,Jiming Chen,Zhikun Zhang

Main category: cs.CV

TL;DR: ArtistAuditor是一种用于审核文本到图像生成模型数据使用的方法，通过分析风格特征判断模型是否使用了特定艺术家的作品进行微调。

Details

Motivation: 解决现有方法（如扰动或水印）在艺术品或模型已发布时不可行的问题，保护艺术家版权。 Method: 使用风格提取器获取多粒度风格表示，并通过训练判别器进行审核决策。 Result: 在六种模型和数据集组合上，AUC值超过0.937，验证了方法的有效性。 Conclusion: ArtistAuditor在现实场景中表现优异，并已开源，为艺术版权保护提供了实用工具。 Abstract: Text-to-image models based on diffusion processes, such as DALL-E, Stable Diffusion, and Midjourney, are capable of transforming texts into detailed images and have widespread applications in art and design. As such, amateur users can easily imitate professional-level paintings by collecting an artist's work and fine-tuning the model, leading to concerns about artworks' copyright infringement. To tackle these issues, previous studies either add visually imperceptible perturbation to the artwork to change its underlying styles (perturbation-based methods) or embed post-training detectable watermarks in the artwork (watermark-based methods). However, when the artwork or the model has been published online, i.e., modification to the original artwork or model retraining is not feasible, these strategies might not be viable. To this end, we propose a novel method for data-use auditing in the text-to-image generation model. The general idea of ArtistAuditor is to identify if a suspicious model has been finetuned using the artworks of specific artists by analyzing the features related to the style. Concretely, ArtistAuditor employs a style extractor to obtain the multi-granularity style representations and treats artworks as samplings of an artist's style. Then, ArtistAuditor queries a trained discriminator to gain the auditing decisions. The experimental results on six combinations of models and datasets show that ArtistAuditor can achieve high AUC values (> 0.937). By studying ArtistAuditor's transferability and core modules, we provide valuable insights into the practical implementation. Finally, we demonstrate the effectiveness of ArtistAuditor in real-world cases by an online platform Scenario. ArtistAuditor is open-sourced at https://github.com/Jozenn/ArtistAuditor.

[68] EchoWorld: Learning Motion-Aware World Models for Echocardiography Probe Guidance

Yang Yue,Yulin Wang,Haojun Jiang,Pan Liu,Shiji Song,Gao Huang

Main category: cs.CV

TL;DR: EchoWorld是一个基于世界建模的超声心动图探头引导框架，通过预训练和微调策略，结合视觉-运动序列，显著提升了引导精度。

Details

Motivation: 超声心动图依赖经验丰富的操作者，开发AI辅助或自主扫描系统具有挑战性，需理解心脏解剖和探头运动与视觉信号的复杂关系。 Method: 提出EchoWorld框架，采用世界建模预训练策略预测解剖区域和探头调整的视觉结果，微调阶段引入运动感知注意力机制整合历史数据。 Result: 在超过100万张超声图像上验证，EchoWorld显著减少引导误差，优于现有视觉主干和引导框架。 Conclusion: EchoWorld通过编码解剖知识和运动动态，实现了高精度的探头引导，为AI辅助超声心动图提供了有效解决方案。 Abstract: Echocardiography is crucial for cardiovascular disease detection but relies heavily on experienced sonographers. Echocardiography probe guidance systems, which provide real-time movement instructions for acquiring standard plane images, offer a promising solution for AI-assisted or fully autonomous scanning. However, developing effective machine learning models for this task remains challenging, as they must grasp heart anatomy and the intricate interplay between probe motion and visual signals. To address this, we present EchoWorld, a motion-aware world modeling framework for probe guidance that encodes anatomical knowledge and motion-induced visual dynamics, while effectively leveraging past visual-motion sequences to enhance guidance precision. EchoWorld employs a pre-training strategy inspired by world modeling principles, where the model predicts masked anatomical regions and simulates the visual outcomes of probe adjustments. Built upon this pre-trained model, we introduce a motion-aware attention mechanism in the fine-tuning stage that effectively integrates historical visual-motion data, enabling precise and adaptive probe guidance. Trained on more than one million ultrasound images from over 200 routine scans, EchoWorld effectively captures key echocardiographic knowledge, as validated by qualitative analysis. Moreover, our method significantly reduces guidance errors compared to existing visual backbones and guidance frameworks, excelling in both single-frame and sequential evaluation protocols. Code is available at https://github.com/LeapLabTHU/EchoWorld.

[69] SkyReels-V2: Infinite-length Film Generative Model

Guibin Chen,Dixuan Lin,Jiangping Yang,Chunze Lin,Juncheng Zhu,Mingyuan Fan,Hao Zhang,Sheng Chen,Zheng Chen,Chengchen Ma,Weiming Xiong,Wei Wang,Nuo Pang,Kang Kang,Zhiheng Xu,Yuzhe Jin,Yupeng Liang,Yubing Song,Peng Zhao,Boyuan Xu,Di Qiu,Debang Li,Zhengcong Fei,Yang Li,Yahui Zhou

Main category: cs.CV

TL;DR: SkyReels-V2是一个无限长度电影生成模型，结合多模态大语言模型、多阶段预训练、强化学习和扩散框架，解决了视频生成中的动态质量、时长和镜头感知问题。

Details

Motivation: 现有视频生成模型在动态质量、视频时长和镜头感知方面存在局限，难以实现专业电影风格的长视频合成。 Method: 设计了结合多模态LLM和子专家模型的视频结构表示，训练统一视频标注器，采用渐进分辨率预训练和四阶段后训练增强。 Result: SkyReels-V2能够高效合成高质量、长视频，并显著提升动态质量和视觉保真度。 Conclusion: SkyReels-V2为长视频生成提供了创新解决方案，推动了专业电影风格合成的进展。 Abstract: Recent advances in video generation have been driven by diffusion models and autoregressive frameworks, yet critical challenges persist in harmonizing prompt adherence, visual quality, motion dynamics, and duration: compromises in motion dynamics to enhance temporal visual quality, constrained video duration (5-10 seconds) to prioritize resolution, and inadequate shot-aware generation stemming from general-purpose MLLMs' inability to interpret cinematic grammar, such as shot composition, actor expressions, and camera motions. These intertwined limitations hinder realistic long-form synthesis and professional film-style generation. To address these limitations, we propose SkyReels-V2, an Infinite-length Film Generative Model, that synergizes Multi-modal Large Language Model (MLLM), Multi-stage Pretraining, Reinforcement Learning, and Diffusion Forcing Framework. Firstly, we design a comprehensive structural representation of video that combines the general descriptions by the Multi-modal LLM and the detailed shot language by sub-expert models. Aided with human annotation, we then train a unified Video Captioner, named SkyCaptioner-V1, to efficiently label the video data. Secondly, we establish progressive-resolution pretraining for the fundamental video generation, followed by a four-stage post-training enhancement: Initial concept-balanced Supervised Fine-Tuning (SFT) improves baseline quality; Motion-specific Reinforcement Learning (RL) training with human-annotated and synthetic distortion data addresses dynamic artifacts; Our diffusion forcing framework with non-decreasing noise schedules enables long-video synthesis in an efficient search space; Final high-quality SFT refines visual fidelity. All the code and models are available at https://github.com/SkyworkAI/SkyReels-V2.

[70] Effective Dual-Region Augmentation for Reduced Reliance on Large Amounts of Labeled Data

Prasanna Reddy Pulakurthi,Majid Rabbani,Celso M. de Melo,Sohail A. Dianat,Raghuveer M. Rao

Main category: cs.CV

TL;DR: 提出了一种新颖的双区域增强方法，减少对大规模标注数据的依赖，提升模型鲁棒性和适应性。

Details

Motivation: 解决计算机视觉任务中对大规模标注数据的依赖问题，同时提升模型在跨域任务中的鲁棒性。 Method: 通过在前景对象上应用随机噪声扰动和背景区域的空间重排，增加训练数据的多样性。 Result: 在PACS数据集上显著优于现有方法，同时在Market-1501和DukeMTMC-reID数据集上验证了有效性。 Conclusion: 该方法通过结构化数据增强，提供了一种可扩展的解决方案，减少了对人工标注数据的依赖。 Abstract: This paper introduces a novel dual-region augmentation approach designed to reduce reliance on large-scale labeled datasets while improving model robustness and adaptability across diverse computer vision tasks, including source-free domain adaptation (SFDA) and person re-identification (ReID). Our method performs targeted data transformations by applying random noise perturbations to foreground objects and spatially shuffling background patches. This effectively increases the diversity of the training data, improving model robustness and generalization. Evaluations on the PACS dataset for SFDA demonstrate that our augmentation strategy consistently outperforms existing methods, achieving significant accuracy improvements in both single-target and multi-target adaptation settings. By augmenting training data through structured transformations, our method enables model generalization across domains, providing a scalable solution for reducing reliance on manually annotated datasets. Furthermore, experiments on Market-1501 and DukeMTMC-reID datasets validate the effectiveness of our approach for person ReID, surpassing traditional augmentation techniques.

[71] Enhancing Person-to-Person Virtual Try-On with Multi-Garment Virtual Try-Off

Riza Velioglu,Petra Bevandic,Robin Chan,Barbara Hammer

Main category: cs.CV

TL;DR: 论文介绍了TryOffDiff，一种基于扩散模型的VTOFF方法，用于从穿着者提取标准化服装图像，并在多服装任务中表现优异。

Details

Motivation: 虚拟试穿（VTON）和虚拟脱衣（VTOFF）是计算机视觉在时尚领域的重要应用。VTOFF的挑战在于从穿着者提取服装的标准化图像，而现有方法在多服装任务中表现不足。 Method: TryOffDiff基于潜在扩散框架，结合SigLIP图像条件，捕捉服装的纹理、形状和图案。通过类特定嵌入，实现了多服装VTOFF。 Result: 在VITON-HD和DressCode数据集上达到最优性能，支持上装、下装和连衣裙。与VTON模型结合时，减少了不必要属性（如肤色）的转移。 Conclusion: TryOffDiff是首个多服装VTOFF模型，性能优越，为虚拟试穿提供了更高质量的服装图像。 Abstract: Computer vision is transforming fashion through Virtual Try-On (VTON) and Virtual Try-Off (VTOFF). VTON generates images of a person in a specified garment using a target photo and a standardized garment image, while a more challenging variant, Person-to-Person Virtual Try-On (p2p-VTON), uses a photo of another person wearing the garment. VTOFF, on the other hand, extracts standardized garment images from clothed individuals. We introduce TryOffDiff, a diffusion-based VTOFF model. Built on a latent diffusion framework with SigLIP image conditioning, it effectively captures garment properties like texture, shape, and patterns. TryOffDiff achieves state-of-the-art results on VITON-HD and strong performance on DressCode dataset, covering upper-body, lower-body, and dresses. Enhanced with class-specific embeddings, it pioneers multi-garment VTOFF, the first of its kind. When paired with VTON models, it improves p2p-VTON by minimizing unwanted attribute transfer, such as skin color. Code is available at: https://rizavelioglu.github.io/tryoffdiff/

[72] EventVAD: Training-Free Event-Aware Video Anomaly Detection

Yihua Shao,Haojin He,Sijie Li,Siyu Chen,Xinwei Long,Fanhu Zeng,Yuxuan Fan,Muyang Zhang,Ziyang Yan,Ao Ma,Xiaochen Wang,Hao Tang,Yan Wang,Shuyan Li

Main category: cs.CV

TL;DR: EventVAD结合动态图架构和多模态大语言模型（MLLMs），通过事件感知和时间推理实现视频异常检测，无需训练数据，性能优于现有方法。

Details

Motivation: 现有监督方法需要大量训练数据且泛化能力差，而无需训练的方法在细粒度视觉转换和多样化事件定位上表现不佳。 Method: EventVAD采用动态时空图建模捕获事件特征，通过无监督统计特征检测事件边界，并利用分层提示策略引导MLLMs推理。 Result: 在UCF-Crime和XD-Violence数据集上，EventVAD使用7B MLLM在无需训练设置下达到SOTA性能。 Conclusion: EventVAD通过结合动态图架构和MLLMs，有效解决了视频异常检测中的泛化和定位问题。 Abstract: Video Anomaly Detection~(VAD) focuses on identifying anomalies within videos. Supervised methods require an amount of in-domain training data and often struggle to generalize to unseen anomalies. In contrast, training-free methods leverage the intrinsic world knowledge of large language models (LLMs) to detect anomalies but face challenges in localizing fine-grained visual transitions and diverse events. Therefore, we propose EventVAD, an event-aware video anomaly detection framework that combines tailored dynamic graph architectures and multimodal LLMs through temporal-event reasoning. Specifically, EventVAD first employs dynamic spatiotemporal graph modeling with time-decay constraints to capture event-aware video features. Then, it performs adaptive noise filtering and uses signal ratio thresholding to detect event boundaries via unsupervised statistical features. The statistical boundary detection module reduces the complexity of processing long videos for MLLMs and improves their temporal reasoning through event consistency. Finally, it utilizes a hierarchical prompting strategy to guide MLLMs in performing reasoning before determining final decisions. We conducted extensive experiments on the UCF-Crime and XD-Violence datasets. The results demonstrate that EventVAD with a 7B MLLM achieves state-of-the-art (SOTA) in training-free settings, outperforming strong baselines that use 7B or larger MLLMs.

[73] RF-DETR Object Detection vs YOLOv12 : A Study of Transformer-based and CNN-based Architectures for Single-Class and Multi-Class Greenfruit Detection in Complex Orchard Environments Under Label Ambiguity

Ranjan Sapkota,Rahul Harsha Cheppally,Ajay Sharda,Manoj Karkee

Main category: cs.CV

TL;DR: 比较RF-DETR和YOLOv12在复杂果园环境中检测绿色水果的性能，RF-DETR在全局上下文建模和定位方面表现更优，而YOLOv12在计算效率和局部特征提取上更胜一筹。

Details

Motivation: 研究旨在评估两种模型在复杂果园环境（如标签模糊、遮挡和背景混合）中的表现，以支持精准农业应用。 Method: 使用自定义数据集（单类和双类标注）测试RF-DETR（基于DINOv2和可变形注意力）和YOLOv12（基于CNN注意力）的性能。 Result: RF-DETR在单类检测中mAP50最高（0.9464），多类检测中mAP@50最高（0.8298）；YOLOv12在mAP@50:95表现更好，适合快速响应场景。 Conclusion: RF-DETR适合需要高精度的农业应用，YOLOv12适用于计算资源有限的场景。 Abstract: This study conducts a detailed comparison of RF-DETR object detection base model and YOLOv12 object detection model configurations for detecting greenfruits in a complex orchard environment marked by label ambiguity, occlusions, and background blending. A custom dataset was developed featuring both single-class (greenfruit) and multi-class (occluded and non-occluded greenfruits) annotations to assess model performance under dynamic real-world conditions. RF-DETR object detection model, utilizing a DINOv2 backbone and deformable attention, excelled in global context modeling, effectively identifying partially occluded or ambiguous greenfruits. In contrast, YOLOv12 leveraged CNN-based attention for enhanced local feature extraction, optimizing it for computational efficiency and edge deployment. RF-DETR achieved the highest mean Average Precision (mAP50) of 0.9464 in single-class detection, proving its superior ability to localize greenfruits in cluttered scenes. Although YOLOv12N recorded the highest mAP@50:95 of 0.7620, RF-DETR consistently outperformed in complex spatial scenarios. For multi-class detection, RF-DETR led with an mAP@50 of 0.8298, showing its capability to differentiate between occluded and non-occluded fruits, while YOLOv12L scored highest in mAP@50:95 with 0.6622, indicating better classification in detailed occlusion contexts. Training dynamics analysis highlighted RF-DETR's swift convergence, particularly in single-class settings where it plateaued within 10 epochs, demonstrating the efficiency of transformer-based architectures in adapting to dynamic visual data. These findings validate RF-DETR's effectiveness for precision agricultural applications, with YOLOv12 suited for fast-response scenarios. >Index Terms: RF-DETR object detection, YOLOv12, YOLOv13, YOLOv14, YOLOv15, YOLOE, YOLO World, YOLO, You Only Look Once, Roboflow, Detection Transformers, CNNs

[74] UniEdit-Flow: Unleashing Inversion and Editing in the Era of Flow Models

Guanlong Jiao,Biqing Huang,Kuan-Chieh Wang,Renjie Liao

Main category: cs.CV

TL;DR: 论文提出了一种基于预测器-校正器框架的流模型反演和编辑方法，包括Uni-Inv反演方法和Uni-Edit编辑方法，具有高效、通用性强等优点。

Details

Motivation: 现有基于扩散模型的反演和编辑方法在流模型中效果不佳，流模型的直线轨迹特性为新的解决方案提供了可能。 Method: 提出Uni-Inv反演方法和Uni-Edit编辑方法，基于预测器-校正器框架，无需调参，模型无关。 Result: 实验表明，Uni-Inv和Uni-Edit在多种生成模型中表现优越，通用性强，且适用于低成本设置。 Conclusion: 该框架为流模型的反演和编辑提供了高效、通用的解决方案。 Abstract: Flow matching models have emerged as a strong alternative to diffusion models, but existing inversion and editing methods designed for diffusion are often ineffective or inapplicable to them. The straight-line, non-crossing trajectories of flow models pose challenges for diffusion-based approaches but also open avenues for novel solutions. In this paper, we introduce a predictor-corrector-based framework for inversion and editing in flow models. First, we propose Uni-Inv, an effective inversion method designed for accurate reconstruction. Building on this, we extend the concept of delayed injection to flow models and introduce Uni-Edit, a region-aware, robust image editing approach. Our methodology is tuning-free, model-agnostic, efficient, and effective, enabling diverse edits while ensuring strong preservation of edit-irrelevant regions. Extensive experiments across various generative models demonstrate the superiority and generalizability of Uni-Inv and Uni-Edit, even under low-cost settings. Project page: https://uniedit-flow.github.io/

[75] Probing and Inducing Combinational Creativity in Vision-Language Models

Yongqian Peng,Yuxi Ma,Mengmeng Wang,Yuxuan Wang,Yizhou Wang,Chi Zhang,Yixin Zhu,Zilong Zheng

Main category: cs.CV

TL;DR: 论文研究了视觉语言模型（VLMs）的组合创造力，提出了IEI框架评估其创造力，并通过实验验证了该框架的有效性。

Details

Motivation: 探讨VLMs是否具备组合创造力，而非简单的模式匹配，为评估人工创造力提供理论基础。 Method: 提出IEI框架（识别-解释-隐含），并构建CreativeMashup数据集进行验证。 Result: 在理解任务中，VLMs表现优于普通人但不及专家；在生成任务中，IEI框架显著提升创造力。 Conclusion: IEI框架为评估人工创造力提供了理论依据，并指导VLMs的创造性生成。 Abstract: The ability to combine existing concepts into novel ideas stands as a fundamental hallmark of human intelligence. Recent advances in Vision-Language Models (VLMs) like GPT-4V and DALLE-3 have sparked debate about whether their outputs reflect combinational creativity--defined by M. A. Boden (1998) as synthesizing novel ideas through combining existing concepts--or sophisticated pattern matching of training data. Drawing inspiration from cognitive science, we investigate the combinational creativity of VLMs from the lens of concept blending. We propose the Identification-Explanation-Implication (IEI) framework, which decomposes creative processes into three levels: identifying input spaces, extracting shared attributes, and deriving novel semantic implications. To validate this framework, we curate CreativeMashup, a high-quality dataset of 666 artist-generated visual mashups annotated according to the IEI framework. Through extensive experiments, we demonstrate that in comprehension tasks, best VLMs have surpassed average human performance while falling short of expert-level understanding; in generation tasks, incorporating our IEI framework into the generation pipeline significantly enhances the creative quality of VLMs outputs. Our findings establish both a theoretical foundation for evaluating artificial creativity and practical guidelines for improving creative generation in VLMs.

[76] VistaDPO: Video Hierarchical Spatial-Temporal Direct Preference Optimization for Large Video Models

Haojian Huang,Haodong Chen,Shengqiong Wu,Meng Luo,Jinlan Fu,Xinya Du,Hanwang Zhang,Hao Fei

Main category: cs.CV

TL;DR: VistaDPO是一个新颖的框架，通过分层优化视频与文本的对齐，解决了大型视频模型（LVMs）中的人机直觉不一致和视频幻觉问题。

Details

Motivation: 现有的大型视频模型（LVMs）存在与人类直觉不一致和视频幻觉问题，需要一种更细粒度的对齐方法。 Method: VistaDPO通过三个层次（实例、时间和感知）优化视频与文本的对齐，并构建了VistaDPO-7k数据集支持训练。 Result: 实验表明，VistaDPO显著提升了LVMs在视频幻觉、视频问答和字幕生成任务中的性能。 Conclusion: VistaDPO有效缓解了视频与语言的对齐问题，为视频理解提供了更可靠的解决方案。 Abstract: Large Video Models (LVMs) built upon Large Language Models (LLMs) have shown promise in video understanding but often suffer from misalignment with human intuition and video hallucination issues. To address these challenges, we introduce VistaDPO, a novel framework for Video Hierarchical Spatial-Temporal Direct Preference Optimization. VistaDPO enhances text-video preference alignment across three hierarchical levels: i) Instance Level, aligning overall video content with responses; ii) Temporal Level, aligning video temporal semantics with event descriptions; and iii) Perceptive Level, aligning spatial objects with language tokens. Given the lack of datasets for fine-grained video-language preference alignment, we construct VistaDPO-7k, a dataset of 7.2K QA pairs annotated with chosen and rejected responses, along with spatial-temporal grounding information such as timestamps, keyframes, and bounding boxes. Extensive experiments on benchmarks such as Video Hallucination, Video QA, and Captioning performance tasks demonstrate that VistaDPO significantly improves the performance of existing LVMs, effectively mitigating video-language misalignment and hallucination. The code and data are available at https://github.com/HaroldChen19/VistaDPO.

[77] Low-hallucination Synthetic Captions for Large-Scale Vision-Language Model Pre-training

Xinsong Zhang,Yarong Zeng,Xinting Huang,Hu Hu,Runquan Xie,Han Hu,Zhanhui Kang

Main category: cs.CV

TL;DR: 研究提出了一种可扩展的合成字幕生成技术，用于视觉语言模型预训练，展示了大规模低幻觉合成字幕的双重优势：替代真实数据并提升模型性能。

Details

Motivation: 当前视觉语言模型预训练依赖高质量图像-文本对，但数据稀缺和饱和限制了领域发展。 Method: 提出了一种生成高质量、低幻觉合成字幕的新流程，并通过连续DPO方法显著减少幻觉。 Result: 合成字幕在35个视觉语言任务中性能提升至少6.2%，并在文本到图像领域显著降低FID分数。 Conclusion: 合成字幕是预训练的有效替代方案，并显著提升模型性能，同时发布了低幻觉合成字幕数据集。 Abstract: In recent years, the field of vision-language model pre-training has experienced rapid advancements, driven primarily by the continuous enhancement of textual capabilities in large language models. However, existing training paradigms for multimodal large language models heavily rely on high-quality image-text pairs. As models and data scales grow exponentially, the availability of such meticulously curated data has become increasingly scarce and saturated, thereby severely limiting further advancements in this domain. This study investigates scalable caption generation techniques for vision-language model pre-training and demonstrates that large-scale low-hallucination synthetic captions can serve dual purposes: 1) acting as a viable alternative to real-world data for pre-training paradigms and 2) achieving superior performance enhancement when integrated into vision-language models through empirical validation. This paper presents three key contributions: 1) a novel pipeline for generating high-quality, low-hallucination, and knowledge-rich synthetic captions. Our continuous DPO methodology yields remarkable results in reducing hallucinations. Specifically, the non-hallucination caption rate on a held-out test set increases from 48.2% to 77.9% for a 7B-size model. 2) Comprehensive empirical validation reveals that our synthetic captions confer superior pre-training advantages over their counterparts. Across 35 vision language tasks, the model trained with our data achieves a significant performance gain of at least 6.2% compared to alt-text pairs and other previous work. Meanwhile, it also offers considerable support in the text-to-image domain. With our dataset, the FID score is reduced by 17.1 on a real-world validation benchmark and 13.3 on the MSCOCO validation benchmark. 3) We will release Hunyuan-Recap100M, a low-hallucination and knowledge-intensive synthetic caption dataset.

[78] Science-T2I: Addressing Scientific Illusions in Image Synthesis

Jialuo Li,Wenhao Chai,Xingyu Fu,Haiyang Xu,Saining Xie

Main category: cs.CV

TL;DR: 提出了一种将科学知识融入生成模型的新方法，通过Science-T2I数据集和SciScore评估模型，显著提升了生成图像的科学真实性和一致性。

Details

Motivation: 现有生成模型在科学领域的图像合成中缺乏真实性和一致性，需要引入科学知识以改进。 Method: 1. 构建Science-T2I数据集；2. 开发SciScore评估模型；3. 提出两阶段微调框架（监督微调和掩码在线微调）。 Result: SciScore评估接近人类水平（提升5%），应用于FLUX模型时性能提升超过50%。 Conclusion: 该方法为生成内容的科学真实性评估设立了新标准，显著提升了生成模型在科学领域的表现。 Abstract: We present a novel approach to integrating scientific knowledge into generative models, enhancing their realism and consistency in image synthesis. First, we introduce Science-T2I, an expert-annotated adversarial dataset comprising adversarial 20k image pairs with 9k prompts, covering wide distinct scientific knowledge categories. Leveraging Science-T2I, we present SciScore, an end-to-end reward model that refines the assessment of generated images based on scientific knowledge, which is achieved by augmenting both the scientific comprehension and visual capabilities of pre-trained CLIP model. Additionally, based on SciScore, we propose a two-stage training framework, comprising a supervised fine-tuning phase and a masked online fine-tuning phase, to incorporate scientific knowledge into existing generative models. Through comprehensive experiments, we demonstrate the effectiveness of our framework in establishing new standards for evaluating the scientific realism of generated content. Specifically, SciScore attains performance comparable to human-level, demonstrating a 5% improvement similar to evaluations conducted by experienced human evaluators. Furthermore, by applying our proposed fine-tuning method to FLUX, we achieve a performance enhancement exceeding 50% on SciScore.

[79] PCBEAR: Pose Concept Bottleneck for Explainable Action Recognition

Jongseo Lee,Wooil Lee,Gyeong-Moon Park,Seong Tae Kim,Jinwoo Choi

Main category: cs.CV

TL;DR: PCBEAR提出了一种基于人体姿态序列的概念瓶颈框架，用于可解释的动作识别，通过静态和动态姿态概念捕捉运动动态，同时保持高分类性能。

Details

Motivation: 现有视频XAI方法难以捕捉运动动态和时间依赖性，而PCBEAR旨在通过人体姿态序列提供可解释的动作识别。 Method: PCBEAR利用人体骨架姿态作为运动感知的结构化概念，通过聚类自动发现静态和动态姿态概念。 Result: 在KTH、Penn-Action和HAA500数据集上验证，PCBEAR在保持高分类性能的同时提供可解释的运动驱动解释。 Conclusion: PCBEAR结合了高性能和可解释性，支持测试时干预以调试和改进模型行为。 Abstract: Human action recognition (HAR) has achieved impressive results with deep learning models, but their decision-making process remains opaque due to their black-box nature. Ensuring interpretability is crucial, especially for real-world applications requiring transparency and accountability. Existing video XAI methods primarily rely on feature attribution or static textual concepts, both of which struggle to capture motion dynamics and temporal dependencies essential for action understanding. To address these challenges, we propose Pose Concept Bottleneck for Explainable Action Recognition (PCBEAR), a novel concept bottleneck framework that introduces human pose sequences as motion-aware, structured concepts for video action recognition. Unlike methods based on pixel-level features or static textual descriptions, PCBEAR leverages human skeleton poses, which focus solely on body movements, providing robust and interpretable explanations of motion dynamics. We define two types of pose-based concepts: static pose concepts for spatial configurations at individual frames, and dynamic pose concepts for motion patterns across multiple frames. To construct these concepts, PCBEAR applies clustering to video pose sequences, allowing for automatic discovery of meaningful concepts without manual annotation. We validate PCBEAR on KTH, Penn-Action, and HAA500, showing that it achieves high classification performance while offering interpretable, motion-driven explanations. Our method provides both strong predictive performance and human-understandable insights into the model's reasoning process, enabling test-time interventions for debugging and improving model behavior.

[80] $\texttt{Complex-Edit}$: CoT-Like Instruction Generation for Complexity-Controllable Image Editing Benchmark

Siwei Yang,Mude Hui,Bingchen Zhao,Yuyin Zhou,Nataniel Ruiz,Cihang Xie

Main category: cs.CV

TL;DR: $ exttt{Complex-Edit}$是一个用于评估基于指令的图像编辑模型的综合基准，通过GPT-4o自动生成多样化指令，并引入多维度评估指标。研究发现开源模型性能显著低于闭源模型，且复杂指令会降低模型保留关键元素和美学质量的能力。

Details

Motivation: 现有图像编辑模型在复杂指令下的性能缺乏系统性评估，因此需要开发一个全面的基准来填补这一空白。 Method: 采用GPT-4o自动生成多样化指令，通过“Chain-of-Edit”流程整合原子任务为复杂指令，并设计VLM自动评估流程。 Result: 开源模型性能显著低于闭源模型；复杂指令影响模型保留关键元素和美学质量；分步执行复杂指令会降低性能；Best-of-N策略能提升结果；合成数据训练会导致编辑结果呈现合成化趋势。 Conclusion: $ exttt{Complex-Edit}$为图像编辑模型的性能评估提供了系统性工具，揭示了开源模型与闭源模型的性能差距，并指出了复杂指令和合成数据对模型表现的影响。 Abstract: We introduce $\texttt{Complex-Edit}$, a comprehensive benchmark designed to systematically evaluate instruction-based image editing models across instructions of varying complexity. To develop this benchmark, we harness GPT-4o to automatically collect a diverse set of editing instructions at scale. Our approach follows a well-structured ``Chain-of-Edit'' pipeline: we first generate individual atomic editing tasks independently and then integrate them to form cohesive, complex instructions. Additionally, we introduce a suite of metrics to assess various aspects of editing performance, along with a VLM-based auto-evaluation pipeline that supports large-scale assessments. Our benchmark yields several notable insights: 1) Open-source models significantly underperform relative to proprietary, closed-source models, with the performance gap widening as instruction complexity increases; 2) Increased instructional complexity primarily impairs the models' ability to retain key elements from the input images and to preserve the overall aesthetic quality; 3) Decomposing a complex instruction into a sequence of atomic steps, executed in a step-by-step manner, substantially degrades performance across multiple metrics; 4) A straightforward Best-of-N selection strategy improves results for both direct editing and the step-by-step sequential approach; and 5) We observe a ``curse of synthetic data'': when synthetic data is involved in model training, the edited images from such models tend to appear increasingly synthetic as the complexity of the editing instructions rises -- a phenomenon that intriguingly also manifests in the latest GPT-4o outputs.

[81] St4RTrack: Simultaneous 4D Reconstruction and Tracking in the World

Haiwen Feng,Junyi Zhang,Qianqian Wang,Yufei Ye,Pengcheng Yu,Michael J. Black,Trevor Darrell,Angjoo Kanazawa

Main category: cs.CV

TL;DR: St4RTrack是一个同时处理动态3D重建和点跟踪的框架，通过预测点图并利用重投影损失实现高效统一。

Details

Motivation: 传统方法将动态3D重建和点跟踪视为独立任务，忽略了它们的深层联系。 Method: 提出St4RTrack框架，通过预测点图并利用重投影损失，实现动态内容的统一重建与跟踪。 Result: 在广泛的新基准测试中验证了框架的有效性和效率。 Conclusion: St4RTrack为动态3D重建和跟踪提供了一种高效统一的解决方案，代码和模型将公开。 Abstract: Dynamic 3D reconstruction and point tracking in videos are typically treated as separate tasks, despite their deep connection. We propose St4RTrack, a feed-forward framework that simultaneously reconstructs and tracks dynamic video content in a world coordinate frame from RGB inputs. This is achieved by predicting two appropriately defined pointmaps for a pair of frames captured at different moments. Specifically, we predict both pointmaps at the same moment, in the same world, capturing both static and dynamic scene geometry while maintaining 3D correspondences. Chaining these predictions through the video sequence with respect to a reference frame naturally computes long-range correspondences, effectively combining 3D reconstruction with 3D tracking. Unlike prior methods that rely heavily on 4D ground truth supervision, we employ a novel adaptation scheme based on a reprojection loss. We establish a new extensive benchmark for world-frame reconstruction and tracking, demonstrating the effectiveness and efficiency of our unified, data-driven framework. Our code, model, and benchmark will be released.

[82] Training-Free Hierarchical Scene Understanding for Gaussian Splatting with Superpoint Graphs

Shaohui Dai,Yansong Qu,Zheyan Li,Xinyang Li,Shengchuan Zhang,Liujuan Cao

Main category: cs.CV

TL;DR: 提出了一种无需训练的方法，通过构建超点图直接从高斯基元中实现3D语义一致性，显著提高了效率和性能。

Details

Motivation: 现有方法需要迭代优化2D语义特征图，导致效率低下和3D语义不一致，因此需要一种更高效且一致的方法。 Method: 构建超点图分割场景为语义一致区域，设计高效重投影策略将2D语义特征提升到超点，避免多视图迭代训练。 Result: 在开放词汇分割任务中达到最佳性能，语义场重建速度快30倍。 Conclusion: 该方法高效且一致，支持开放词汇感知，为场景理解提供了结构化基础。 Abstract: Bridging natural language and 3D geometry is a crucial step toward flexible, language-driven scene understanding. While recent advances in 3D Gaussian Splatting (3DGS) have enabled fast and high-quality scene reconstruction, research has also explored incorporating open-vocabulary understanding into 3DGS. However, most existing methods require iterative optimization over per-view 2D semantic feature maps, which not only results in inefficiencies but also leads to inconsistent 3D semantics across views. To address these limitations, we introduce a training-free framework that constructs a superpoint graph directly from Gaussian primitives. The superpoint graph partitions the scene into spatially compact and semantically coherent regions, forming view-consistent 3D entities and providing a structured foundation for open-vocabulary understanding. Based on the graph structure, we design an efficient reprojection strategy that lifts 2D semantic features onto the superpoints, avoiding costly multi-view iterative training. The resulting representation ensures strong 3D semantic coherence and naturally supports hierarchical understanding, enabling both coarse- and fine-grained open-vocabulary perception within a unified semantic field. Extensive experiments demonstrate that our method achieves state-of-the-art open-vocabulary segmentation performance, with semantic field reconstruction completed over $30\times$ faster. Our code will be available at https://github.com/Atrovast/THGS.

[83] AerialMegaDepth: Learning Aerial-Ground Reconstruction and View Synthesis

Khiem Vuong,Anurag Ghosh,Deva Ramanan,Srinivasa Narasimhan,Shubham Tulsiani

Main category: cs.CV

TL;DR: 论文提出了一种结合伪合成渲染和真实地面图像的混合数据集框架，用于解决地面与航拍图像几何重建中的极端视角变化问题，显著提升了算法性能。

Details

Motivation: 现有学习方法难以处理地面与航拍图像之间的极端视角变化，主要原因是缺乏高质量、配准的训练数据集。 Method: 提出了一种可扩展框架，结合3D城市网格的伪合成渲染和真实地面图像，以弥补领域差距。 Result: 在真实世界的零样本任务中，算法性能显著提升（例如相机旋转误差5度内的准确率从5%提高到56%）。 Conclusion: 该方法不仅改善了相机估计和场景重建，还提升了如新视角合成等下游任务的性能，具有实际应用价值。 Abstract: We explore the task of geometric reconstruction of images captured from a mixture of ground and aerial views. Current state-of-the-art learning-based approaches fail to handle the extreme viewpoint variation between aerial-ground image pairs. Our hypothesis is that the lack of high-quality, co-registered aerial-ground datasets for training is a key reason for this failure. Such data is difficult to assemble precisely because it is difficult to reconstruct in a scalable way. To overcome this challenge, we propose a scalable framework combining pseudo-synthetic renderings from 3D city-wide meshes (e.g., Google Earth) with real, ground-level crowd-sourced images (e.g., MegaDepth). The pseudo-synthetic data simulates a wide range of aerial viewpoints, while the real, crowd-sourced images help improve visual fidelity for ground-level images where mesh-based renderings lack sufficient detail, effectively bridging the domain gap between real images and pseudo-synthetic renderings. Using this hybrid dataset, we fine-tune several state-of-the-art algorithms and achieve significant improvements on real-world, zero-shot aerial-ground tasks. For example, we observe that baseline DUSt3R localizes fewer than 5% of aerial-ground pairs within 5 degrees of camera rotation error, while fine-tuning with our data raises accuracy to nearly 56%, addressing a major failure point in handling large viewpoint changes. Beyond camera estimation and scene reconstruction, our dataset also improves performance on downstream tasks like novel-view synthesis in challenging aerial-ground scenarios, demonstrating the practical value of our approach in real-world applications.

[84] Digital Twin Generation from Visual Data: A Survey

Andrew Melnik,Benjamin Alt,Giang Nguyen,Artur Wilkowski,Maciej Stefańczyk,Qirui Wu,Sinan Harms,Helge Rhodin,Manolis Savva,Michael Beetz

Main category: cs.CV

TL;DR: 本文综述了从视频生成数字孪生的最新进展，探讨了其在机器人、媒体内容创作和设计建筑等领域的应用，分析了多种方法的优缺点，并提出了未来研究方向。

Details

Motivation: 数字孪生在多个领域具有广泛应用潜力，但现有方法面临遮挡、光照变化和可扩展性等挑战，需要系统梳理和总结。 Method: 分析了3D高斯泼溅、生成式修复、语义分割和基础模型等方法，并比较其优缺点。 Result: 总结了当前最先进的方法及其在实际应用中的潜力与局限性。 Conclusion: 本文为数字孪生领域的研究提供了全面概述，并指出了未来的研究方向。 Abstract: This survey explores recent developments in generating digital twins from videos. Such digital twins can be used for robotics application, media content creation, or design and construction works. We analyze various approaches, including 3D Gaussian Splatting, generative in-painting, semantic segmentation, and foundation models highlighting their advantages and limitations. Additionally, we discuss challenges such as occlusions, lighting variations, and scalability, as well as potential future research directions. This survey aims to provide a comprehensive overview of state-of-the-art methodologies and their implications for real-world applications. Awesome list: https://github.com/ndrwmlnk/awesome-digital-twins

[85] Personalized Text-to-Image Generation with Auto-Regressive Models

Kaiyue Sun,Xian Liu,Yao Teng,Xihui Liu

Main category: cs.CV

TL;DR: 本文探讨了自回归模型在个性化图像合成中的潜力，提出了一种两阶段训练策略，并证明其效果与扩散模型相当。

Details

Motivation: 个性化图像合成在文本到图像生成中具有重要意义，但自回归模型在此领域的潜力尚未充分探索。 Method: 采用两阶段训练策略，结合文本嵌入优化和变换器层微调。 Result: 实验表明，该方法在主题保真度和提示跟随方面与领先的扩散方法相当。 Conclusion: 自回归模型在个性化图像生成中具有潜力，为未来研究提供了新方向。 Abstract: Personalized image synthesis has emerged as a pivotal application in text-to-image generation, enabling the creation of images featuring specific subjects in diverse contexts. While diffusion models have dominated this domain, auto-regressive models, with their unified architecture for text and image modeling, remain underexplored for personalized image generation. This paper investigates the potential of optimizing auto-regressive models for personalized image synthesis, leveraging their inherent multimodal capabilities to perform this task. We propose a two-stage training strategy that combines optimization of text embeddings and fine-tuning of transformer layers. Our experiments on the auto-regressive model demonstrate that this method achieves comparable subject fidelity and prompt following to the leading diffusion-based personalization methods. The results highlight the effectiveness of auto-regressive models in personalized image generation, offering a new direction for future research in this area.

[86] ODHSR: Online Dense 3D Reconstruction of Humans and Scenes from Monocular Videos

Zetong Zhang,Manuel kaufmann,Lixin Xue,Jie Song,Martin R. Oswald

Main category: cs.CV

TL;DR: 提出了一种新的统一框架，用于从单目视频中实时重建逼真场景和人体，结合了相机跟踪、人体姿态估计和场景重建。

Details

Motivation: 解决现有方法需要预校准相机和人体姿态以及长时间训练的问题，实现实时、高效的3D重建。 Method: 利用3D高斯泼溅技术学习高斯基元，设计了重建相机跟踪和人体姿态估计模块，并引入人体变形模块和遮挡感知渲染。 Result: 在EMDB和NeuMan数据集上表现出色，相机跟踪、姿态估计和新视角合成性能优于或与现有方法相当。 Conclusion: 该框架实现了高效、实时的逼真场景和人体重建，具有广泛的应用潜力。 Abstract: Creating a photorealistic scene and human reconstruction from a single monocular in-the-wild video figures prominently in the perception of a human-centric 3D world. Recent neural rendering advances have enabled holistic human-scene reconstruction but require pre-calibrated camera and human poses, and days of training time. In this work, we introduce a novel unified framework that simultaneously performs camera tracking, human pose estimation and human-scene reconstruction in an online fashion. 3D Gaussian Splatting is utilized to learn Gaussian primitives for humans and scenes efficiently, and reconstruction-based camera tracking and human pose estimation modules are designed to enable holistic understanding and effective disentanglement of pose and appearance. Specifically, we design a human deformation module to reconstruct the details and enhance generalizability to out-of-distribution poses faithfully. Aiming to learn the spatial correlation between human and scene accurately, we introduce occlusion-aware human silhouette rendering and monocular geometric priors, which further improve reconstruction quality. Experiments on the EMDB and NeuMan datasets demonstrate superior or on-par performance with existing methods in camera tracking, human pose estimation, novel view synthesis and runtime. Our project page is at https://eth-ait.github.io/ODHSR.

[87] Generate, but Verify: Reducing Hallucination in Vision-Language Models with Retrospective Resampling

Tsung-Han Wu,Heekyung Lee,Jiaxin Ge,Joseph E. Gonzalez,Trevor Darrell,David M. Chan

Main category: cs.CV

TL;DR: REVERSE是一个统一的框架，通过幻觉感知训练和实时自验证，显著减少视觉语言模型中的幻觉问题。

Details

Motivation: 视觉语言模型在视觉理解中表现出色，但存在视觉幻觉问题，可能生成不存在的对象或动作描述，对安全关键应用构成风险。现有方法要么依赖启发式调整，要么复杂且倾向于拒绝输出而非修正。 Method: REVERSE结合幻觉感知训练和实时自验证，利用包含130万半合成样本的数据集和推理时回顾重采样技术，动态检测和修正幻觉。 Result: REVERSE在CHAIR-MSCOCO和HaloQuest上分别优于现有最佳方法12%和28%。 Conclusion: REVERSE框架有效减少视觉语言模型的幻觉问题，提供了一种更高效的解决方案。 Abstract: Vision-Language Models (VLMs) excel at visual understanding but often suffer from visual hallucinations, where they generate descriptions of nonexistent objects, actions, or concepts, posing significant risks in safety-critical applications. Existing hallucination mitigation methods typically follow one of two paradigms: generation adjustment, which modifies decoding behavior to align text with visual inputs, and post-hoc verification, where external models assess and correct outputs. While effective, generation adjustment methods often rely on heuristics and lack correction mechanisms, while post-hoc verification is complicated, typically requiring multiple models and tending to reject outputs rather than refine them. In this work, we introduce REVERSE, a unified framework that integrates hallucination-aware training with on-the-fly self-verification. By leveraging a new hallucination-verification dataset containing over 1.3M semi-synthetic samples, along with a novel inference-time retrospective resampling technique, our approach enables VLMs to both detect hallucinations during generation and dynamically revise those hallucinations. Our evaluations show that REVERSE achieves state-of-the-art hallucination reduction, outperforming the best existing methods by up to 12% on CHAIR-MSCOCO and 28% on HaloQuest. Our dataset, model, and code are available at: https://reverse-vlm.github.io.

[88] IMAGGarment-1: Fine-Grained Garment Generation for Controllable Fashion Design

Fei Shen,Jian Yu,Cong Wang,Xin Jiang,Xiaoyu Du,Jinhui Tang

Main category: cs.CV

TL;DR: IMAGGarment-1是一个细粒度服装生成框架，通过两阶段训练策略实现高保真服装合成，支持轮廓、颜色和标志的精确控制。

Details

Motivation: 解决现有方法在多条件可控性上的局限性，满足个性化时尚设计和数字服装应用的需求。 Method: 采用两阶段训练策略：第一阶段通过混合注意力模块和颜色适配器建模全局外观；第二阶段通过自适应外观感知模块增强局部细节。 Result: 在结构稳定性、颜色保真度和局部可控性上优于现有基线方法。 Conclusion: IMAGGarment-1为多条件服装生成提供了高效解决方案，并发布了GarmentBench数据集支持研究。 Abstract: This paper presents IMAGGarment-1, a fine-grained garment generation (FGG) framework that enables high-fidelity garment synthesis with precise control over silhouette, color, and logo placement. Unlike existing methods that are limited to single-condition inputs, IMAGGarment-1 addresses the challenges of multi-conditional controllability in personalized fashion design and digital apparel applications. Specifically, IMAGGarment-1 employs a two-stage training strategy to separately model global appearance and local details, while enabling unified and controllable generation through end-to-end inference. In the first stage, we propose a global appearance model that jointly encodes silhouette and color using a mixed attention module and a color adapter. In the second stage, we present a local enhancement model with an adaptive appearance-aware module to inject user-defined logos and spatial constraints, enabling accurate placement and visual consistency. To support this task, we release GarmentBench, a large-scale dataset comprising over 180K garment samples paired with multi-level design conditions, including sketches, color references, logo placements, and textual prompts. Extensive experiments demonstrate that our method outperforms existing baselines, achieving superior structural stability, color fidelity, and local controllability performance. The code and model are available at https://github.com/muzishen/IMAGGarment-1.

[89] Single-Shot Shape and Reflectance with Spatial Polarization Multiplexing

Tomoki Ichikawa,Ryo Kawahara,Ko Nishino

Main category: cs.CV

TL;DR: 提出了一种空间偏振复用（SPM）方法，用于从单张偏振图像中重建物体形状和反射率，并应用于动态表面恢复。

Details

Motivation: 传统单模式结构光虽能实现单次形状重建，但反射率恢复困难，因缺乏入射光角度采样及投影模式与表面颜色纹理的耦合。 Method: 设计了一种空间复用的偏振模式，通过量化AoLP值实现形状重建，同时利用局部区域的不同偏振光投影实现单次线性偏振椭偏测量，分离镜面和漫反射以估计BRDF。 Result: 实验验证表明，该方法可从单次偏振图像中恢复形状、穆勒矩阵和BRDF，并成功应用于动态表面。 Conclusion: SPM方法在保持自然表面外观的同时，实现了高效的单次形状和反射率重建，适用于动态场景。 Abstract: We propose spatial polarization multiplexing (SPM) for reconstructing object shape and reflectance from a single polarimetric image and demonstrate its application to dynamic surface recovery. Although single-pattern structured light enables single-shot shape reconstruction, the reflectance is challenging to recover due to the lack of angular sampling of incident light and the entanglement of the projected pattern and the surface color texture. We design a spatially multiplexed pattern of polarization that can be robustly and uniquely decoded for shape reconstruction by quantizing the AoLP values. At the same time, our spatial-multiplexing enables single-shot ellipsometry of linear polarization by projecting differently polarized light within a local region, which separates the specular and diffuse reflections for BRDF estimation. We achieve this spatial polarization multiplexing with a constrained de Bruijn sequence. Unlike single-pattern structured light with intensity and color, our polarization pattern is invisible to the naked eye and retains the natural surface appearance which is essential for accurate appearance modeling and also interaction with people. We experimentally validate our method on real data. The results show that our method can recover the shape, the Mueller matrix, and the BRDF from a single-shot polarimetric image. We also demonstrate the application of our method to dynamic surfaces.

[90] PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding

Jang Hyun Cho,Andrea Madotto,Effrosyni Mavroudi,Triantafyllos Afouras,Tushar Nagarajan,Muhammad Maaz,Yale Song,Tengyu Ma,Shuming Hu,Suyog Jain,Miguel Martin,Huiyu Wang,Hanoona Rasheed,Peize Sun,Po-Yao Huang,Daniel Bolya,Nikhila Ravi,Shashank Jain,Tammy Stark,Shane Moon,Babak Damavandi,Vivian Lee,Andrew Westbury,Salman Khan,Philipp Krähenbühl,Piotr Dollár,Lorenzo Torresani,Kristen Grauman,Christoph Feichtenhofer

Main category: cs.CV

TL;DR: 研究提出了一种完全开放和可复现的感知语言模型（PLM）框架，旨在解决现有高性能视觉语言模型闭源问题，并通过大规模合成数据和人类标注数据填补视频理解的关键数据缺口。

Details

Motivation: 当前高性能视觉语言模型多为闭源，限制了科学进展的透明度。研究旨在通过开放框架和透明方法推动图像和视频理解的研究。 Method: 分析标准训练流程，避免使用专有模型蒸馏，探索大规模合成数据，并发布2.8M人类标注的细粒度视频问答对和时空标注视频描述。 Result: 提出PLM-VideoBench评估套件，专注于视频理解的推理能力（“什么”、“哪里”、“何时”、“如何”），并提供数据、训练方案、代码和模型以实现完全可复现性。 Conclusion: 通过开放数据和透明方法，研究为视频理解领域的科学进展提供了可测量和可复现的基础。 Abstract: Vision-language models are integral to computer vision research, yet many high-performing models remain closed-source, obscuring their data, design and training recipe. The research community has responded by using distillation from black-box models to label training data, achieving strong benchmark results, at the cost of measurable scientific progress. However, without knowing the details of the teacher model and its data sources, scientific progress remains difficult to measure. In this paper, we study building a Perception Language Model (PLM) in a fully open and reproducible framework for transparent research in image and video understanding. We analyze standard training pipelines without distillation from proprietary models and explore large-scale synthetic data to identify critical data gaps, particularly in detailed video understanding. To bridge these gaps, we release 2.8M human-labeled instances of fine-grained video question-answer pairs and spatio-temporally grounded video captions. Additionally, we introduce PLM-VideoBench, a suite for evaluating challenging video understanding tasks focusing on the ability to reason about "what", "where", "when", and "how" of a video. We make our work fully reproducible by providing data, training recipes, code & models.

[91] Perception Encoder: The best visual embeddings are not at the output of the network

Daniel Bolya,Po-Yao Huang,Peize Sun,Jang Hyun Cho,Andrea Madotto,Chen Wei,Tengyu Ma,Jiale Zhi,Jathushan Rajasegaran,Hanoona Rasheed,Junke Wang,Marco Monteiro,Hu Xu,Shiyu Dong,Nikhila Ravi,Daniel Li,Piotr Dollár,Christoph Feichtenhofer

Main category: cs.CV

TL;DR: PE是一种通过视觉-语言对比学习训练的高性能编码器，适用于多种视觉任务，其核心是通过中间层提取通用嵌入，并结合两种对齐方法实现多任务优化。

Details

Motivation: 传统视觉编码器需要针对不同任务定制预训练目标，而PE旨在通过单一对比学习目标生成通用嵌入，简化流程并提升性能。 Method: 采用视觉-语言对比学习训练，结合语言对齐（多模态语言建模）和空间对齐（密集预测）方法，从网络中间层提取通用嵌入。 Result: PE在零样本分类、检索、问答及空间任务（如检测、深度估计）中均达到SOTA性能。 Conclusion: PE展示了单一对比学习目标的潜力，并开源了模型、代码和新数据集以推动研究。 Abstract: We introduce Perception Encoder (PE), a state-of-the-art encoder for image and video understanding trained via simple vision-language learning. Traditionally, vision encoders have relied on a variety of pretraining objectives, each tailored to specific downstream tasks such as classification, captioning, or localization. Surprisingly, after scaling our carefully tuned image pretraining recipe and refining with our robust video data engine, we find that contrastive vision-language training alone can produce strong, general embeddings for all of these downstream tasks. There is only one caveat: these embeddings are hidden within the intermediate layers of the network. To draw them out, we introduce two alignment methods, language alignment for multimodal language modeling, and spatial alignment for dense prediction. Together with the core contrastive checkpoint, our PE family of models achieves state-of-the-art performance on a wide variety of tasks, including zero-shot image and video classification and retrieval; document, image, and video Q&A; and spatial tasks such as detection, depth estimation, and tracking. To foster further research, we are releasing our models, code, and a novel dataset of synthetically and human-annotated videos.

cs.GR [Back]

[92] Prototype-Guided Diffusion for Digital Pathology: Achieving Foundation Model Performance with Minimal Clinical Data

Ekaterina Redekop,Mara Pleasure,Vedrana Ivezic,Zichen Wang,Kimberly Flores,Anthony Sisk,William Speier,Corey Arnold

Main category: cs.GR

TL;DR: 提出了一种基于原型引导的扩散模型，用于生成高质量合成病理数据，减少对真实样本的依赖，同时保持下游任务性能。

Details

Motivation: 探究数据集规模与性能之间的关系，并减少对大规模真实病理数据的依赖。 Method: 使用原型引导的扩散模型生成合成病理数据，结合自监督学习。 Result: 合成数据训练的特征性能与大规模真实数据相当，且混合数据方法表现更优。 Conclusion: 生成式AI可高效创建病理训练数据，减少对临床数据的需求。 Abstract: Foundation models in digital pathology use massive datasets to learn useful compact feature representations of complex histology images. However, there is limited transparency into what drives the correlation between dataset size and performance, raising the question of whether simply adding more data to increase performance is always necessary. In this study, we propose a prototype-guided diffusion model to generate high-fidelity synthetic pathology data at scale, enabling large-scale self-supervised learning and reducing reliance on real patient samples while preserving downstream performance. Using guidance from histological prototypes during sampling, our approach ensures biologically and diagnostically meaningful variations in the generated data. We demonstrate that self-supervised features trained on our synthetic dataset achieve competitive performance despite using ~60x-760x less data than models trained on large real-world datasets. Notably, models trained using our synthetic data showed statistically comparable or better performance across multiple evaluation metrics and tasks, even when compared to models trained on orders of magnitude larger datasets. Our hybrid approach, combining synthetic and real data, further enhanced performance, achieving top results in several evaluations. These findings underscore the potential of generative AI to create compelling training data for digital pathology, significantly reducing the reliance on extensive clinical datasets and highlighting the efficiency of our approach.

[93] One Model to Rig Them All: Diverse Skeleton Rigging with UniRig

Jia-Peng Zhang,Cheng-Feng Pu,Meng-Hao Guo,Yan-Pei Cao,Shi-Min Hu

Main category: cs.GR

TL;DR: UniRig是一个基于大型自回归模型和骨点交叉注意力机制的统一框架，用于自动骨骼绑定，显著提高了绑定和运动准确性。

Details

Motivation: 3D内容创作的快速发展需要自动化绑定解决方案以应对复杂多样的3D模型。 Method: UniRig采用骨架树标记化方法编码骨架的层次关系，结合自回归模型和交叉注意力机制生成高质量的骨架和蒙皮权重。 Result: 在Rig-XL数据集上，UniRig在绑定和运动准确性上分别提升了215%和194%，优于现有方法。 Conclusion: UniRig通过自动化绑定过程，显著提升了动画制作的效率和适用范围。 Abstract: The rapid evolution of 3D content creation, encompassing both AI-powered methods and traditional workflows, is driving an unprecedented demand for automated rigging solutions that can keep pace with the increasing complexity and diversity of 3D models. We introduce UniRig, a novel, unified framework for automatic skeletal rigging that leverages the power of large autoregressive models and a bone-point cross-attention mechanism to generate both high-quality skeletons and skinning weights. Unlike previous methods that struggle with complex or non-standard topologies, UniRig accurately predicts topologically valid skeleton structures thanks to a new Skeleton Tree Tokenization method that efficiently encodes hierarchical relationships within the skeleton. To train and evaluate UniRig, we present Rig-XL, a new large-scale dataset of over 14,000 rigged 3D models spanning a wide range of categories. UniRig significantly outperforms state-of-the-art academic and commercial methods, achieving a 215% improvement in rigging accuracy and a 194% improvement in motion accuracy on challenging datasets. Our method works seamlessly across diverse object categories, from detailed anime characters to complex organic and inorganic structures, demonstrating its versatility and robustness. By automating the tedious and time-consuming rigging process, UniRig has the potential to speed up animation pipelines with unprecedented ease and efficiency. Project Page: https://zjp-shadow.github.io/works/UniRig/

[94] UniPhys: Unified Planner and Controller with Diffusion for Flexible Physics-Based Character Control

Yan Wu,Korrawe Karunratanakul,Zhengyi Luo,Siyu Tang

Main category: cs.GR

TL;DR: UniPhys是一个基于扩散的行为克隆框架，将运动规划与控制统一为一个模型，生成自然且物理合理的角色运动。

Details

Motivation: 解决现有方法在长时程控制和多样化引导信号下的运动质量下降及任务特定微调问题。 Method: 结合扩散模型与物理模拟器，通过Diffusion Forcing范式训练，处理噪声运动历史和物理模拟误差。 Result: UniPhys在运动自然性、泛化能力和鲁棒性上优于现有方法。 Conclusion: UniPhys为生成多样化、物理合理的角色运动提供了高效解决方案。 Abstract: Generating natural and physically plausible character motion remains challenging, particularly for long-horizon control with diverse guidance signals. While prior work combines high-level diffusion-based motion planners with low-level physics controllers, these systems suffer from domain gaps that degrade motion quality and require task-specific fine-tuning. To tackle this problem, we introduce UniPhys, a diffusion-based behavior cloning framework that unifies motion planning and control into a single model. UniPhys enables flexible, expressive character motion conditioned on multi-modal inputs such as text, trajectories, and goals. To address accumulated prediction errors over long sequences, UniPhys is trained with the Diffusion Forcing paradigm, learning to denoise noisy motion histories and handle discrepancies introduced by the physics simulator. This design allows UniPhys to robustly generate physically plausible, long-horizon motions. Through guided sampling, UniPhys generalizes to a wide range of control signals, including unseen ones, without requiring task-specific fine-tuning. Experiments show that UniPhys outperforms prior methods in motion naturalness, generalization, and robustness across diverse control tasks.

[95] SOPHY: Generating Simulation-Ready Objects with Physical Materials

Junyi Cao,Evangelos Kalogerakis

Main category: cs.GR

TL;DR: SOPHY是一种生成模型，用于合成3D物理感知形状，同时生成形状、纹理和材料属性，适用于动态环境。

Details

Motivation: 现有3D生成模型仅关注静态几何或物理无关的动画，而SOPHY旨在生成可直接用于模拟和交互的动态对象。 Method: 引入带有详细物理材料属性的3D对象数据集和高效标注流程，联合建模形状和材料属性。 Result: 实验表明，联合建模形状和材料属性提高了生成形状的真实感和保真度，改进了生成几何评估指标。 Conclusion: SOPHY为文本驱动生成和单图像重建提供了物理合理的3D对象，增强了动态环境的实用性。 Abstract: We present SOPHY, a generative model for 3D physics-aware shape synthesis. Unlike existing 3D generative models that focus solely on static geometry or 4D models that produce physics-agnostic animations, our approach jointly synthesizes shape, texture, and material properties related to physics-grounded dynamics, making the generated objects ready for simulations and interactive, dynamic environments. To train our model, we introduce a dataset of 3D objects annotated with detailed physical material attributes, along with an annotation pipeline for efficient material annotation. Our method enables applications such as text-driven generation of interactive, physics-aware 3D objects and single-image reconstruction of physically plausible shapes. Furthermore, our experiments demonstrate that jointly modeling shape and material properties enhances the realism and fidelity of generated shapes, improving performance on generative geometry evaluation metrics.

[96] StorySets: Ordering Curves and Dimensions for Visualizing Uncertain Sets and Multi-Dimensional Discrete Data

Markus Wallinger,Annika Bonerath,Wouter Meulemans,Martin Nöllenburg,Spehen Kobourov,Alexander Wolff

Main category: cs.GR

TL;DR: 提出了一种可视化不确定集合系统的方法，通过垂直符号和x单调曲线表示元素和集合，优化视觉复杂度。

Details

Motivation: 现有集合可视化方法基于确定性，无法处理不确定性，需要新的可视化方法。 Method: 使用垂直符号和x单调曲线表示元素和集合，优化曲线转弯和交叉。 Result: 通过优化算法减少视觉复杂度，实现集合包含关系的清晰展示。 Conclusion: 该方法灵活适用于不确定集合和多维离散数据。 Abstract: We propose a method for visualizing uncertain set systems, which differs from previous set visualization approaches that are based on certainty (an element either belongs to a set or not). Our method is inspired by storyline visualizations and parallel coordinate plots: (a) each element is represented by a vertical glyph, subdivided into bins that represent different levels of uncertainty; (b) each set is represented by an x-monotone curve that traverses element glyphs through the bins representing the level of uncertainty of their membership. Our implementation also includes optimizations to reduce visual complexity captured by the number of turns for the set curves and the number of crossings. Although several of the natural underlying optimization problems are NP-hard in theory (e.g., optimal element order, optimal set order), in practice, we can compute near-optimal solutions with respect to curve crossings with the help of a new exact algorithm for optimally ordering set curves within each element's bins. With these optimizations, the proposed method makes it easy to see set containment (the smaller set's curve is strictly below the larger set's curve). A brief design-space exploration using uncertain set-membership data, as well as multi-dimensional discrete data, shows the flexibility of the proposed approach.

[97] ARAP-GS: Drag-driven As-Rigid-As-Possible 3D Gaussian Splatting Editing with Diffusion Prior

Xiao Han,Runze Tian,Yifei Tong,Fenggen Yu,Dingyao Liu,Yan Zhang

Main category: cs.GR

TL;DR: ARAP-GS是一种基于ARAP变形的拖拽驱动3D高斯溅射（3DGS）编辑框架，首次将ARAP变形直接应用于3D高斯，实现灵活编辑，并通过扩散先验保持视觉质量。

Details

Motivation: 当前拖拽驱动编辑在3DGS中研究较少，因其变形时难以保持形状一致性和视觉连续性。 Method: 采用ARAP变形直接作用于3D高斯，结合扩散先验进行超分辨率优化，保持多视角一致性。 Result: 实验表明ARAP-GS优于现有方法，编辑效率高（10-20分钟/场景）。 Conclusion: ARAP-GS为3DGS拖拽编辑提供了高效、高质量的解决方案。 Abstract: Drag-driven editing has become popular among designers for its ability to modify complex geometric structures through simple and intuitive manipulation, allowing users to adjust and reshape content with minimal technical skill. This drag operation has been incorporated into numerous methods to facilitate the editing of 2D images and 3D meshes in design. However, few studies have explored drag-driven editing for the widely-used 3D Gaussian Splatting (3DGS) representation, as deforming 3DGS while preserving shape coherence and visual continuity remains challenging. In this paper, we introduce ARAP-GS, a drag-driven 3DGS editing framework based on As-Rigid-As-Possible (ARAP) deformation. Unlike previous 3DGS editing methods, we are the first to apply ARAP deformation directly to 3D Gaussians, enabling flexible, drag-driven geometric transformations. To preserve scene appearance after deformation, we incorporate an advanced diffusion prior for image super-resolution within our iterative optimization process. This approach enhances visual quality while maintaining multi-view consistency in the edited results. Experiments show that ARAP-GS outperforms current methods across diverse 3D scenes, demonstrating its effectiveness and superiority for drag-driven 3DGS editing. Additionally, our method is highly efficient, requiring only 10 to 20 minutes to edit a scene on a single RTX 3090 GPU.

[98] CAGE-GS: High-fidelity Cage Based 3D Gaussian Splatting Deformation

Yifei Tong,Runze Tian,Xiao Han,Dingyao Liu,Fenggen Yu,Yan Zhang

Main category: cs.GR

TL;DR: CAGE-GS是一种基于笼子的3DGS变形方法，通过目标形状引导源场景的几何变换，同时利用雅可比矩阵策略保持纹理保真度。

Details

Motivation: 3DGS作为真实场景的3D表示越来越受欢迎，但如何在变形时保留细节仍是一个挑战。 Method: 学习目标形状的变形笼子，利用雅可比矩阵更新高斯协方差参数。 Result: 在公共数据集和新场景中，CAGE-GS在效率和变形质量上显著优于现有技术。 Conclusion: CAGE-GS是一种灵活且高效的3DGS变形方法，适用于多种目标形状表示。 Abstract: As 3D Gaussian Splatting (3DGS) gains popularity as a 3D representation of real scenes, enabling user-friendly deformation to create novel scenes while preserving fine details from the original 3DGS has attracted significant research attention. We introduce CAGE-GS, a cage-based 3DGS deformation method that seamlessly aligns a source 3DGS scene with a user-defined target shape. Our approach learns a deformation cage from the target, which guides the geometric transformation of the source scene. While the cages effectively control structural alignment, preserving the textural appearance of 3DGS remains challenging due to the complexity of covariance parameters. To address this, we employ a Jacobian matrix-based strategy to update the covariance parameters of each Gaussian, ensuring texture fidelity post-deformation. Our method is highly flexible, accommodating various target shape representations, including texts, images, point clouds, meshes and 3DGS models. Extensive experiments and ablation studies on both public datasets and newly proposed scenes demonstrate that our method significantly outperforms existing techniques in both efficiency and deformation quality.

[99] AAA-Gaussians: Anti-Aliased and Artifact-Free 3D Gaussian Rendering

Michael Steiner,Thomas Köhler,Lukas Radl,Felix Windisch,Dieter Schmalstieg,Markus Steinberger

Main category: cs.GR

TL;DR: 3D高斯泼溅（3DGS）在3D重建中表现优异，但仍存在锯齿、投影伪影和视角不一致等问题。通过引入全3D高斯评估、自适应3D平滑滤波器和稳定的视图空间边界方法，本文方法显著提升了渲染质量。

Details

Motivation: 解决3DGS中因简化2D处理导致的锯齿、伪影和视角不一致问题。 Method: 引入自适应3D平滑滤波器、稳定的视图空间边界方法，并推广基于屏幕空间平面的3D瓦片剔除技术。 Result: 在分布内评估集上达到最优质量，并在分布外视角下显著优于其他方法，有效消除锯齿、失真和弹出伪影。 Conclusion: 通过全3D高斯评估和优化技术，实现了实时、无伪影的高质量渲染。 Abstract: Although 3D Gaussian Splatting (3DGS) has revolutionized 3D reconstruction, it still faces challenges such as aliasing, projection artifacts, and view inconsistencies, primarily due to the simplification of treating splats as 2D entities. We argue that incorporating full 3D evaluation of Gaussians throughout the 3DGS pipeline can effectively address these issues while preserving rasterization efficiency. Specifically, we introduce an adaptive 3D smoothing filter to mitigate aliasing and present a stable view-space bounding method that eliminates popping artifacts when Gaussians extend beyond the view frustum. Furthermore, we promote tile-based culling to 3D with screen-space planes, accelerating rendering and reducing sorting costs for hierarchical rasterization. Our method achieves state-of-the-art quality on in-distribution evaluation sets and significantly outperforms other approaches for out-of-distribution views. Our qualitative evaluations further demonstrate the effective removal of aliasing, distortions, and popping artifacts, ensuring real-time, artifact-free rendering.

[100] 3D-PNAS: 3D Industrial Surface Anomaly Synthesis with Perlin Noise

Yifeng Cheng,Juan Du

Main category: cs.GR

TL;DR: 提出了一种基于Perlin噪声和表面参数化的3D异常生成方法3D-PNAS，用于解决工业异常检测中3D数据缺陷样本稀缺的问题。

Details

Motivation: 工业异常检测中3D数据的缺陷样本稀缺，限制了3D数据在质量检测中的应用，而现有方法主要关注2D异常生成。 Method: 通过将点云投影到2D平面，从Perlin噪声场采样多尺度噪声值，并沿法线方向扰动点云，生成逼真的3D表面异常。 Result: 实验表明，该方法能生成多样化的缺陷模式，并在不同物体类型上产生几何上合理的异常。 Conclusion: 3D-PNAS为3D异常生成提供了有效工具，并提供了代码库和可视化工具包以促进未来研究。 Abstract: Large pretrained vision foundation models have shown significant potential in various vision tasks. However, for industrial anomaly detection, the scarcity of real defect samples poses a critical challenge in leveraging these models. While 2D anomaly generation has significantly advanced with established generative models, the adoption of 3D sensors in industrial manufacturing has made leveraging 3D data for surface quality inspection an emerging trend. In contrast to 2D techniques, 3D anomaly generation remains largely unexplored, limiting the potential of 3D data in industrial quality inspection. To address this gap, we propose a novel yet simple 3D anomaly generation method, 3D-PNAS, based on Perlin noise and surface parameterization. Our method generates realistic 3D surface anomalies by projecting the point cloud onto a 2D plane, sampling multi-scale noise values from a Perlin noise field, and perturbing the point cloud along its normal direction. Through comprehensive visualization experiments, we demonstrate how key parameters - including noise scale, perturbation strength, and octaves, provide fine-grained control over the generated anomalies, enabling the creation of diverse defect patterns from pronounced deformations to subtle surface variations. Additionally, our cross-category experiments show that the method produces consistent yet geometrically plausible anomalies across different object types, adapting to their specific surface characteristics. We also provide a comprehensive codebase and visualization toolkit to facilitate future research.

[101] Real-time High-fidelity Gaussian Human Avatars with Position-based Interpolation of Spatially Distributed MLPs

Youyi Zhan,Tianjia Shao,Yin Yang,Kun Zhou

Main category: cs.GR

TL;DR: 提出了一种新型高斯人体化身表示方法，能够实时渲染高保真姿态相关外观细节。

Details

Motivation: 现有方法要么难以捕捉姿态相关细节，要么计算成本高且无法实时渲染。 Method: 采用空间分布的MLP，通过高斯偏移基和线性组合实现高频率信号学习，并使用控制点约束高斯分布。 Result: 相比现有方法，实现了更高保真的外观细节和显著更快的渲染速度。 Conclusion: 该方法在保持实时渲染的同时，显著提升了高斯人体化身的外观质量。 Abstract: Many works have succeeded in reconstructing Gaussian human avatars from multi-view videos. However, they either struggle to capture pose-dependent appearance details with a single MLP, or rely on a computationally intensive neural network to reconstruct high-fidelity appearance but with rendering performance degraded to non-real-time. We propose a novel Gaussian human avatar representation that can reconstruct high-fidelity pose-dependence appearance with details and meanwhile can be rendered in real time. Our Gaussian avatar is empowered by spatially distributed MLPs which are explicitly located on different positions on human body. The parameters stored in each Gaussian are obtained by interpolating from the outputs of its nearby MLPs based on their distances. To avoid undesired smooth Gaussian property changing during interpolation, for each Gaussian we define a set of Gaussian offset basis, and a linear combination of basis represents the Gaussian property offsets relative to the neutral properties. Then we propose to let the MLPs output a set of coefficients corresponding to the basis. In this way, although Gaussian coefficients are derived from interpolation and change smoothly, the Gaussian offset basis is learned freely without constraints. The smoothly varying coefficients combined with freely learned basis can still produce distinctly different Gaussian property offsets, allowing the ability to learn high-frequency spatial signals. We further use control points to constrain the Gaussians distributed on a surface layer rather than allowing them to be irregularly distributed inside the body, to help the human avatar generalize better when animated under novel poses. Compared to the state-of-the-art method, our method achieves better appearance quality with finer details while the rendering speed is significantly faster under novel views and novel poses.

[102] GSAC: Leveraging Gaussian Splatting for Photorealistic Avatar Creation with Unity Integration

Rendong Zhang,Alexandra Watkins,Nilanjan Sarkar

Main category: cs.GR

TL;DR: 论文提出了一种基于3D高斯泼溅（3DGS）的端到端虚拟化身生成方法，通过单目视频输入实现高效、可扩展的逼真化身创建，并集成到Unity引擎中。

Details

Motivation: 现有虚拟化身创建技术成本高、耗时长且效率低，无法满足实时应用需求，因此需要一种更高效、逼真的解决方案。 Method: 结合3D高斯泼溅技术和定制预处理，利用单目视频输入生成逼真化身，并开发了Unity集成的编辑工具。 Result: 实验验证了预处理管道的有效性，展示了高斯化身在Unity中的多功能性，证明了方法的可扩展性和实用性。 Conclusion: 该方法为VR/AR应用提供了一种高效、逼真的虚拟化身生成方案，具有广泛的应用潜力。 Abstract: Photorealistic avatars have become essential for immersive applications in virtual reality (VR) and augmented reality (AR), enabling lifelike interactions in areas such as training simulations, telemedicine, and virtual collaboration. These avatars bridge the gap between the physical and digital worlds, improving the user experience through realistic human representation. However, existing avatar creation techniques face significant challenges, including high costs, long creation times, and limited utility in virtual applications. Manual methods, such as MetaHuman, require extensive time and expertise, while automatic approaches, such as NeRF-based pipelines often lack efficiency, detailed facial expression fidelity, and are unable to be rendered at a speed sufficent for real-time applications. By involving several cutting-edge modern techniques, we introduce an end-to-end 3D Gaussian Splatting (3DGS) avatar creation pipeline that leverages monocular video input to create a scalable and efficient photorealistic avatar directly compatible with the Unity game engine. Our pipeline incorporates a novel Gaussian splatting technique with customized preprocessing that enables the user of "in the wild" monocular video capture, detailed facial expression reconstruction and embedding within a fully rigged avatar model. Additionally, we present a Unity-integrated Gaussian Splatting Avatar Editor, offering a user-friendly environment for VR/AR application development. Experimental results validate the effectiveness of our preprocessing pipeline in standardizing custom data for 3DGS training and demonstrate the versatility of Gaussian avatars in Unity, highlighting the scalability and practicality of our approach.

[103] CompGS++: Compressed Gaussian Splatting for Static and Dynamic Scene Representation

Xiangrui Liu,Xinju Wu,Shiqi Wang,Zhu Li,Sam Kwong

Main category: cs.GR

TL;DR: CompGS++是一种新型框架，通过紧凑的高斯基元减少3D场景建模的数据量，显著提升压缩性能。

Details

Motivation: 高斯泼溅技术在3D场景建模中表现优异，但数据量大且存在冗余，难以适应现有互联网基础设施的传输需求。 Method: 提出空间和时间基元预测模块消除冗余，并设计速率约束优化模块减少参数冗余。 Result: 在多个基准数据集上，CompGS++显著优于现有方法，实现高效压缩和准确建模。 Conclusion: CompGS++为3D沉浸式视觉通信提供了一种高效解决方案，代码将开源。 Abstract: Gaussian splatting demonstrates proficiency for 3D scene modeling but suffers from substantial data volume due to inherent primitive redundancy. To enable future photorealistic 3D immersive visual communication applications, significant compression is essential for transmission over the existing Internet infrastructure. Hence, we propose Compressed Gaussian Splatting (CompGS++), a novel framework that leverages compact Gaussian primitives to achieve accurate 3D modeling with substantial size reduction for both static and dynamic scenes. Our design is based on the principle of eliminating redundancy both between and within primitives. Specifically, we develop a comprehensive prediction paradigm to address inter-primitive redundancy through spatial and temporal primitive prediction modules. The spatial primitive prediction module establishes predictive relationships for scene primitives and enables most primitives to be encoded as compact residuals, substantially reducing the spatial redundancy. We further devise a temporal primitive prediction module to handle dynamic scenes, which exploits primitive correlations across timestamps to effectively reduce temporal redundancy. Moreover, we devise a rate-constrained optimization module that jointly minimizes reconstruction error and rate consumption. This module effectively eliminates parameter redundancy within primitives and enhances the overall compactness of scene representations. Comprehensive evaluations across multiple benchmark datasets demonstrate that CompGS++ significantly outperforms existing methods, achieving superior compression performance while preserving accurate scene modeling. Our implementation will be made publicly available on GitHub to facilitate further research.

[104] HiScene: Creating Hierarchical 3D Scenes with Isometric View Generation

Wenqi Dong,Bangbang Yang,Zesong Yang,Yuan Li,Tao Hu,Hujun Bao,Yuewen Ma,Zhaopeng Cui

Main category: cs.GR

TL;DR: HiScene是一个分层框架，将2D图像生成与3D对象生成结合，生成高保真场景，支持交互式编辑。

Details

Motivation: 解决现有3D生成方法在对象类别和编辑灵活性上的不足。 Method: 采用分层方法将场景视为等距视图下的复杂对象，结合视频扩散技术和形状先验注入。 Result: 实验表明，HiScene能生成更自然的对象排列和完整实例，适合交互应用。 Conclusion: HiScene在保持物理合理性和用户输入对齐的同时，提升了3D场景生成的灵活性和质量。 Abstract: Scene-level 3D generation represents a critical frontier in multimedia and computer graphics, yet existing approaches either suffer from limited object categories or lack editing flexibility for interactive applications. In this paper, we present HiScene, a novel hierarchical framework that bridges the gap between 2D image generation and 3D object generation and delivers high-fidelity scenes with compositional identities and aesthetic scene content. Our key insight is treating scenes as hierarchical "objects" under isometric views, where a room functions as a complex object that can be further decomposed into manipulatable items. This hierarchical approach enables us to generate 3D content that aligns with 2D representations while maintaining compositional structure. To ensure completeness and spatial alignment of each decomposed instance, we develop a video-diffusion-based amodal completion technique that effectively handles occlusions and shadows between objects, and introduce shape prior injection to ensure spatial coherence within the scene. Experimental results demonstrate that our method produces more natural object arrangements and complete object instances suitable for interactive applications, while maintaining physical plausibility and alignment with user inputs.

cs.CL [Back]

[105] Unmasking the Reality of PII Masking Models: Performance Gaps and the Call for Accountability

Devansh Singh,Sundaraparipurnan Narayanan

Main category: cs.CL

TL;DR: 论文探讨了隐私掩码技术在数据隐私中的应用，分析了基于NER的PII检测模型的局限性，并通过实验展示了这些模型的隐私暴露风险。

Details

Motivation: 研究动机在于揭示现有PII检测模型在数据集和NER方法上的局限性，以及这些模型在实际应用中的隐私风险。 Method: 方法包括构建一个包含16种PII类型的17K半合成句子数据集，并在五个NER检测维度和一个对抗性上下文中测试模型性能。 Result: 结果显示，尽管模型被广泛下载和使用，但其在PII检测和掩码上存在显著缺陷，导致隐私暴露。 Conclusion: 结论强调了模型性能评估的不足，并呼吁在模型卡片中提供更多上下文披露。 Abstract: Privacy Masking is a critical concept under data privacy involving anonymization and de-anonymization of personally identifiable information (PII). Privacy masking techniques rely on Named Entity Recognition (NER) approaches under NLP support in identifying and classifying named entities in each text. NER approaches, however, have several limitations including (a) content sensitivity including ambiguous, polysemic, context dependent or domain specific content, (b) phrasing variabilities including nicknames and alias, informal expressions, alternative representations, emerging expressions, evolving naming conventions and (c) formats or syntax variations, typos, misspellings. However, there are a couple of PII datasets that have been widely used by researchers and the open-source community to train models on PII detection or masking. These datasets have been used to train models including Piiranha and Starpii, which have been downloaded over 300k and 580k times on HuggingFace. We examine the quality of the PII masking by these models given the limitations of the datasets and of the NER approaches. We curate a dataset of 17K unique, semi-synthetic sentences containing 16 types of PII by compiling information from across multiple jurisdictions including India, U.K and U.S. We generate sentences (using language models) containing these PII at five different NER detection feature dimensions - (1) Basic Entity Recognition, (2) Contextual Entity Disambiguation, (3) NER in Noisy & Real-World Data, (4) Evolving & Novel Entities Detection and (5) Cross-Lingual or multi-lingual NER) and 1 in adversarial context. We present the results and exhibit the privacy exposure caused by such model use (considering the extent of lifetime downloads of these models). We conclude by highlighting the gaps in measuring performance of the models and the need for contextual disclosure in model cards for such models.

[106] Learning Optimal Prompt Ensemble for Multi-source Visual Prompt Transfer

Enming Zhang,Liwen Cao,Yanru Wu,Zijie Zhao,Guan Wang,Yang Li

Main category: cs.CL

TL;DR: HGPrompt框架通过优化双目标（可迁移性和稳定性）实现多源提示的自适应集成，提升任务泛化能力。

Details

Motivation: 预训练提示作为知识资产，多源提示结合可增强泛化能力，但简单聚合易导致表示崩溃。 Method: 引入信息论指标评估提示特征的可迁移性，并提出梯度对齐正则化以稳定知识迁移。 Result: 在VTAB基准测试中表现最优，验证了多源提示迁移的有效性。 Conclusion: HGPrompt通过自适应集成和稳定性优化，显著提升了多源提示的迁移性能。 Abstract: Prompt tuning has emerged as a lightweight adaptation strategy for adapting foundation models to downstream tasks, particularly in resource-constrained systems. As pre-trained prompts have become valuable intellectual assets, combining multiple source prompts offers a promising approach to enhance generalization to new tasks by leveraging complementary knowledge from diverse sources. However, naive aggregation of these prompts often leads to representation collapse due to mutual interference, undermining their collective potential. To address these challenges, we propose HGPrompt, an adaptive framework for multi-source prompt transfer that learns optimal ensemble weights by jointly optimizing dual objectives: transferability and stability. Specifically, we first introduce an information-theoretic metric to evaluate the transferability of prompt-induced features on the target task, capturing the intrinsic alignment between the feature representations. Additionally, we propose a novel Gradient Alignment Regularization to mitigate gradient conflicts among prompts, enabling stable and coherent knowledge transfer from multiple sources while suppressing interference. Extensive experiments on the large-scale VTAB benchmark demonstrate that HGPrompt achieves state-of-the-art performance, validating its effectiveness in multi-source prompt transfer.

[107] Socrates or Smartypants: Testing Logic Reasoning Capabilities of Large Language Models with Logic Programming-based Test Oracles

Zihao Xu,Junchen Ding,Yiling Lou,Kun Zhang,Dong Gong,Yuekang Li

Main category: cs.CL

TL;DR: 论文介绍了SmartyPat-Bench和SmartyPat，前者是一个基于真实Reddit帖子的逻辑谬误基准数据集，后者是一个自动化生成逻辑谬误的框架。实验表明，SmartyPat生成的谬误质量高，且揭示了LLM在谬误检测中的表现。

Details

Motivation: 现有逻辑推理评估数据集过于简单或不自然，无法满足需求。 Method: 提出SmartyPat-Bench数据集和SmartyPat框架，后者利用逻辑编程和LLM生成高质量逻辑谬误。 Result: SmartyPat生成的谬误质量高，实验显示LLM在结构化推理中表现更好。 Conclusion: SmartyPat和SmartyPat-Bench为逻辑推理评估提供了更自然和多样化的工具，揭示了LLM的推理能力。 Abstract: Large Language Models (LLMs) have achieved significant progress in language understanding and reasoning. Evaluating and analyzing their logical reasoning abilities has therefore become essential. However, existing datasets and benchmarks are often limited to overly simplistic, unnatural, or contextually constrained examples. In response to the growing demand, we introduce SmartyPat-Bench, a challenging, naturally expressed, and systematically labeled benchmark derived from real-world high-quality Reddit posts containing subtle logical fallacies. Unlike existing datasets and benchmarks, it provides more detailed annotations of logical fallacies and features more diverse data. To further scale up the study and address the limitations of manual data collection and labeling - such as fallacy-type imbalance and labor-intensive annotation - we introduce SmartyPat, an automated framework powered by logic programming-based oracles. SmartyPat utilizes Prolog rules to systematically generate logically fallacious statements, which are then refined into fluent natural-language sentences by LLMs, ensuring precise fallacy representation. Extensive evaluation demonstrates that SmartyPat produces fallacies comparable in subtlety and quality to human-generated content and significantly outperforms baseline methods. Finally, experiments reveal nuanced insights into LLM capabilities, highlighting that while excessive reasoning steps hinder fallacy detection accuracy, structured reasoning enhances fallacy categorization performance.

[108] Exploring the Impact of Personality Traits on Conversational Recommender Systems: A Simulation with Large Language Models

Xiaoyan Zhao,Yang Deng,Wenjie Wang,Hongzhan lin,Hong Cheng,Rui Zhang,See-Kiong Ng,Tat-Seng Chua

Main category: cs.CL

TL;DR: 论文提出了一种基于大语言模型（LLM）的个性化感知用户模拟系统（PerCRS），用于研究人格特质如何影响对话推荐系统的结果。

Details

Motivation: 人格特质对用户交互行为有显著影响，但目前对话推荐系统（CRSs）中缺乏对此的系统研究。 Method: 引入PerCRS，用户代理模拟个性化特质和偏好，系统代理具备说服能力以模拟真实交互。采用多角度评估确保鲁棒性。 Result: 实验表明，先进LLM能生成符合指定人格特质的多样化用户响应，促使CRSs动态调整推荐策略。 Conclusion: 研究为人格特质对对话推荐系统结果的影响提供了实证依据。 Abstract: Conversational Recommender Systems (CRSs) engage users in multi-turn interactions to deliver personalized recommendations. The emergence of large language models (LLMs) further enhances these systems by enabling more natural and dynamic user interactions. However, a key challenge remains in understanding how personality traits shape conversational recommendation outcomes. Psychological evidence highlights the influence of personality traits on user interaction behaviors. To address this, we introduce an LLM-based personality-aware user simulation for CRSs (PerCRS). The user agent induces customizable personality traits and preferences, while the system agent possesses the persuasion capability to simulate realistic interaction in CRSs. We incorporate multi-aspect evaluation to ensure robustness and conduct extensive analysis from both user and system perspectives. Experimental results demonstrate that state-of-the-art LLMs can effectively generate diverse user responses aligned with specified personality traits, thereby prompting CRSs to dynamically adjust their recommendation strategies. Our experimental analysis offers empirical insights into the impact of personality traits on the outcomes of conversational recommender systems.

[109] How to Detect and Defeat Molecular Mirage: A Metric-Driven Benchmark for Hallucination in LLM-based Molecular Comprehension

Hao Li,Liuzhenghao Lv,He Cao,Zijing Liu,Zhiyuan Yan,Yu Wang,Yonghong Tian,Yu Li,Li Yuan

Main category: cs.CL

TL;DR: 论文分析了LLMs在分子理解任务中的幻觉问题，提出了Mol-Hallu评估指标和HRPP后处理方法，以减少幻觉并提升模型可靠性。

Details

Motivation: 大型语言模型在科学领域（如分子理解）中存在幻觉问题，导致药物设计等任务中的错误，需要解决以提高可靠性。 Method: 提出Mol-Hallu指标评估幻觉程度，并设计HRPP后处理阶段以减少幻觉。 Result: 实验证明HRPP对解码器和编码器-解码器分子LLMs均有效。 Conclusion: 研究为减少幻觉和提升LLMs在科学应用中的可靠性提供了重要见解。 Abstract: Large language models are increasingly used in scientific domains, especially for molecular understanding and analysis. However, existing models are affected by hallucination issues, resulting in errors in drug design and utilization. In this paper, we first analyze the sources of hallucination in LLMs for molecular comprehension tasks, specifically the knowledge shortcut phenomenon observed in the PubChem dataset. To evaluate hallucination in molecular comprehension tasks with computational efficiency, we introduce \textbf{Mol-Hallu}, a novel free-form evaluation metric that quantifies the degree of hallucination based on the scientific entailment relationship between generated text and actual molecular properties. Utilizing the Mol-Hallu metric, we reassess and analyze the extent of hallucination in various LLMs performing molecular comprehension tasks. Furthermore, the Hallucination Reduction Post-processing stage~(HRPP) is proposed to alleviate molecular hallucinations, Experiments show the effectiveness of HRPP on decoder-only and encoder-decoder molecular LLMs. Our findings provide critical insights into mitigating hallucination and improving the reliability of LLMs in scientific applications.

Xingguang Ji,Jiakang Wang,Hongzhi Zhang,Jingyuan Zhang,Haonan Zhou,Chenxi Sun,Yahui Liu,Qi Wang,Fuzheng Zhang

Main category: cs.CL

TL;DR: Capybara-OMNI是一种轻量高效的多模态大语言模型（MLLM），支持文本、图像、视频和音频模态，并提供了详细的框架设计、数据构建和训练方法。

Details

Motivation: 由于构建和训练多模态数据对的复杂性，开发强大的MLLM仍是一个计算和时间密集型任务，因此需要一种更高效的方法。 Method: 通过详细的框架设计、数据构建和训练方法，逐步开发MLLM，并提供专用基准验证多模态理解能力。 Result: 模型在相同规模的多模态基准测试中表现优异，并进一步开发了聊天版本以增强交互能力。 Conclusion: Capybara-OMNI及其聊天版本公开了模型权重、部分训练数据和推理代码，为社区提供了实用资源。 Abstract: With the development of Multimodal Large Language Models (MLLMs), numerous outstanding accomplishments have emerged within the open-source community. Due to the complexity of creating and training multimodal data pairs, it is still a computational and time-consuming process to build powerful MLLMs. In this work, we introduce Capybara-OMNI, an MLLM that trains in a lightweight and efficient manner and supports understanding text, image, video, and audio modalities. We present in detail the framework design, the data construction, and the training recipe, to develop an MLLM step-by-step to obtain competitive performance. We also provide exclusive benchmarks utilized in our experiments to show how to properly verify understanding capabilities across different modalities. Results show that by following our guidance, we can efficiently build an MLLM that achieves competitive performance among models of the same scale on various multimodal benchmarks. Additionally, to enhance the multimodal instruction following and conversational capabilities of the model, we further discuss how to train the chat version upon an MLLM understanding model, which is more in line with user habits for tasks like real-time interaction with humans. We publicly disclose the Capybara-OMNI model, along with its chat-based version. The disclosure includes both the model weights, a portion of the training data, and the inference codes, which are made available on GitHub.

[111] Data Metabolism: An Efficient Data Design Schema For Vision Language Model

Jingyuan Zhang,Hongzhi Zhang,Zhou Haonan,Chenxi Sun,Xingguang ji,Jiakang Wang,Fanheng Kong,Yahui Liu,Qi Wang,Fuzheng Zhang

Main category: cs.CL

TL;DR: 论文提出了一种数据为中心的框架（Data Metabolism），用于构建视觉语言模型（VLM），并通过数据迭代持续提升性能。发布的Capybara-VL模型在多项任务中表现优异，甚至超越了一些更大的开源模型。

Details

Motivation: 数据管理对训练强大的视觉语言模型至关重要，但现有方法缺乏系统性。本文旨在通过数据代谢框架解决这一问题。 Method: 提出数据代谢概念，构建闭环系统，包括数据管理和迭代。详细介绍了数据处理方法和用户特定数据飞轮。 Result: Capybara-VL模型在多项任务中表现优异，超越更大的开源模型，并与领先的专有模型相当。 Conclusion: 数据为中心的框架能显著提升模型性能，证明了训练更小、更高效VLM的潜力。 Abstract: Data curation plays a crucial role in training powerful Visual Language Models (VLMs). In this work, we introduce the concept of Data Metabolism and present our data-centric framework to build VLMs throughout the development lifecycle. Starting from a standard model architecture, we discuss and provide insights into two crucial development steps: data curation and iteration, forming a closed-loop system that continuously improves model performance. We show a detailed codebook on how to process existing massive datasets and build user-specific data flywheel. As a demonstration, we release a VLM, named Capybara-VL, which excels in typical multimodal tasks (e.g. , visual question answering, scientific reasoning, and text-rich tasks). Despite its relatively compact size, Capybara-VL surpasses several open-source models that are up to 10 times larger in size. Moreover, it achieves results that are on par with those of several leading proprietary models, demonstrating its remarkable competitiveness. These results highlight the power of our data-centric framework and the potential of training smaller and more efficient VLMs.

[112] ChatGPT as Linguistic Equalizer? Quantifying LLM-Driven Lexical Shifts in Academic Writing

Dingkang Lin,Naixuan Zhao,Dan Tian,Jiang Li

Main category: cs.CL

TL;DR: ChatGPT显著提升了非英语母语学者（NNES）在学术写作中的词汇复杂性，减少了语言障碍，促进了学术公平。

Details

Motivation: 研究ChatGPT是否能够帮助非英语母语学者克服学术写作中的语言障碍，从而促进全球学术公平。 Method: 使用OpenAlex数据库中的280万篇文章（2020-2024年），通过MTLD量化词汇复杂性，并采用DID设计识别因果效应。 Result: ChatGPT显著提升了NNES作者摘要中的词汇复杂性，尤其在预印本、技术和生物学领域及低级别期刊中效果最明显。 Conclusion: ChatGPT减少了语言差异，为全球学术界提供了更公平的竞争环境。 Abstract: The advent of ChatGPT has profoundly reshaped scientific research practices, particularly in academic writing, where non-native English-speakers (NNES) historically face linguistic barriers. This study investigates whether ChatGPT mitigates these barriers and fosters equity by analyzing lexical complexity shifts across 2.8 million articles from OpenAlex (2020-2024). Using the Measure of Textual Lexical Diversity (MTLD) to quantify vocabulary sophistication and a difference-in-differences (DID) design to identify causal effects, we demonstrate that ChatGPT significantly enhances lexical complexity in NNES-authored abstracts, even after controlling for article-level controls, authorship patterns, and venue norms. Notably, the impact is most pronounced in preprint papers, technology- and biology-related fields and lower-tier journals. These findings provide causal evidence that ChatGPT reduces linguistic disparities and promotes equity in global academia.

[113] Has the Creativity of Large-Language Models peaked? An analysis of inter- and intra-LLM variability

Jennifer Haase,Paul H. P. Hanel,Sebastian Pokutta

Main category: cs.CL

TL;DR: 研究发现，尽管LLMs在创意任务中表现优于人类平均水平，但其创意能力并未随时间提升，且存在显著的输出不一致性。

Details

Motivation: 探讨LLMs的创意能力是否随时间提升，以及其创意输出的稳定性。 Method: 评估14种LLMs（如GPT-4、Claude等）在两项创意测试（DAT和AUT）中的表现。 Result: LLMs创意能力未提升，GPT-4表现下降；部分模型优于人类平均水平，但极少达到人类顶尖水平；同一模型输出差异大。 Conclusion: 需更细致的评估框架，重视模型选择、提示设计和重复测试，以准确评估LLMs的创意潜力。 Abstract: Following the widespread adoption of ChatGPT in early 2023, numerous studies reported that large language models (LLMs) can match or even surpass human performance in creative tasks. However, it remains unclear whether LLMs have become more creative over time, and how consistent their creative output is. In this study, we evaluated 14 widely used LLMs -- including GPT-4, Claude, Llama, Grok, Mistral, and DeepSeek -- across two validated creativity assessments: the Divergent Association Task (DAT) and the Alternative Uses Task (AUT). Contrary to expectations, we found no evidence of increased creative performance over the past 18-24 months, with GPT-4 performing worse than in previous studies. For the more widely used AUT, all models performed on average better than the average human, with GPT-4o and o3-mini performing best. However, only 0.28% of LLM-generated responses reached the top 10% of human creativity benchmarks. Beyond inter-model differences, we document substantial intra-model variability: the same LLM, given the same prompt, can produce outputs ranging from below-average to original. This variability has important implications for both creativity research and practical applications. Ignoring such variability risks misjudging the creative potential of LLMs, either inflating or underestimating their capabilities. The choice of prompts affected LLMs differently. Our findings underscore the need for more nuanced evaluation frameworks and highlight the importance of model selection, prompt design, and repeated assessment when using Generative AI (GenAI) tools in creative contexts.

[114] AttentionDefense: Leveraging System Prompt Attention for Explainable Defense Against Novel Jailbreaks

Charlotte Siska,Anush Sankaran

Main category: cs.CL

TL;DR: 论文提出了一种名为AttentionDefense的新方法，利用小型语言模型（SLM）的系统提示注意力来检测和解释对抗性提示，提供了一种更便宜且可解释的防御策略。

Details

Motivation: 尽管语言模型（LM）在多领域表现出色，但其易受恶意输入（jailbreak）攻击，导致偏离预期行为。现有防御方法难以解释恶意行为的根源，因此需要一种更透明且高效的解决方案。 Method: 通过小型语言模型（SLM）的系统提示注意力机制，分析对抗性提示的特征，提出AttentionDefense方法。该方法在现有jailbreak基准数据集和新生成的变体数据集上进行了评估。 Result: AttentionDefense在检测性能上优于或等同于基于文本嵌入的分类器和GPT-4零样本检测器，并在计算成本较低的情况下表现出色。 Conclusion: AttentionDefense是一种高效、可解释且计算成本低的防御方法，适用于实际应用。 Abstract: In the past few years, Language Models (LMs) have shown par-human capabilities in several domains. Despite their practical applications and exceeding user consumption, they are susceptible to jailbreaks when malicious input exploits the LM's weaknesses, causing it to deviate from its intended behavior. Current defensive strategies either classify the input prompt as adversarial or prevent LMs from generating harmful outputs. However, it is challenging to explain the reason behind the malicious nature of the jailbreak, which results in a wide variety of closed-box approaches. In this research, we propose and demonstrate that system-prompt attention from Small Language Models (SLMs) can be used to characterize adversarial prompts, providing a novel, explainable, and cheaper defense approach called AttentionDefense. Our research suggests that the attention mechanism is an integral component in understanding and explaining how LMs respond to malicious input that is not captured in the semantic meaning of text embeddings. The proposed AttentionDefense is evaluated against existing jailbreak benchmark datasets. Ablation studies show that SLM-based AttentionDefense has equivalent or better jailbreak detection performance compared to text embedding-based classifiers and GPT-4 zero-shot detectors.To further validate the efficacy of the proposed approach, we generate a dataset of novel jailbreak variants of the existing benchmark dataset using a closed-loop LLM-based multi-agent system. We demonstrate that the proposed AttentionDefense approach performs robustly on this novel jailbreak dataset while existing approaches suffer in performance. Additionally, for practical purposes AttentionDefense is an ideal solution as it has the computation requirements of a small LM but the performance of a LLM detector.

[115] A Strategic Coordination Framework of Small LLMs Matches Large LLMs in Data Synthesis

Xin Gao,Qizhi Pei,Zinan Tang,Yu Li,Honglin Lin,Jiang Wu,Conghui He,Lijun Wu

Main category: cs.CL

TL;DR: GRA框架通过多个小型LLM协作（生成、评审、裁决）实现高质量数据合成，挑战了大型LLM的必要性。

Details

Motivation: 解决大型LLM在数据合成中的高成本、低效率和潜在偏见问题，探索小型LLM协作的可行性。 Method: 提出GRA框架，由Generator、Reviewer、Adjudicator三个角色的小型LLM协作完成数据合成。 Result: GRA生成的数据质量达到或超过大型LLM（如Qwen-2.5-72B-Instruct）。 Conclusion: 小型LLM的战略协作可替代大型LLM，实现高效、高质量的数据合成。 Abstract: While data synthesis and distillation are promising strategies to enhance small language models, current approaches heavily rely on Large Language Models (LLMs), which suffer from high computational costs, environmental inefficiency, and potential biases inherited from monolithic architectures. In contrast, smaller LLMs are more accessible and sustainable, but their individual capabilities often fall short in generating high-quality, diverse, and reliable data. Inspired by collaborative human processes (e.g., peer review), we propose a multiple small LLMs involved framework, GRA, that aggregates specialized roles across small LLMs to iterative refinement and quality control typically achieved by a single large LLM. In this collaborative framework, multiple small LLMs assume distinct roles-Generator, Reviewer, and Adjudicator-to simulate a peer-review-inspired data synthesis pipeline. The Generator proposes initial data samples, the Reviewer critiques their quality and diversity, and the Adjudicator resolves conflicts to finalize the output. By decomposing the synthesis process into specialized sub-tasks, collaborative small LLMs can achieve data-level parity with large LLM-based distillation. Through experiments across multiple benchmarks, we demonstrate that GRA-produced data matches or exceeds the quality of single large LLM outputs, e.g., Qwen-2.5-72B-Instruct. Our results challenge the necessity of monolithic large models for high-quality data synthesis, advocating instead for strategic coordination of smaller agents. Our datasets, models, and code are publicly available at https://github.com/GX-XinGao/GRA.

[116] The Other Side of the Coin: Exploring Fairness in Retrieval-Augmented Generation

Zheng Zhang,Ning Li,Qi Liu,Rui Li,Weibo Gao,Qingyang Mao,Zhenya Huang,Baosheng Yu,Dacheng Tao

Main category: cs.CL

TL;DR: RAG通过引入外部知识减少LLMs的幻觉问题，但其公平性影响尚未明确。实验发现小规模LLMs（<8B）在RAG中公平性更差，并提出FairFT和FairFilter两种方法改善公平性。

Details

Motivation: 研究RAG对LLMs公平性的影响，并解决小规模LLMs在RAG中公平性恶化的问题。 Method: 通过实验分析不同规模LLMs、检索器和检索源的影响，提出FairFT（公平对齐检索器）和FairFilter（公平过滤机制）两种方法。 Result: 小规模LLMs在RAG中公平性更差，FairFT和FairFilter能有效改善公平性且不影响性能。 Conclusion: RAG对小规模LLMs公平性有负面影响，提出的方法能有效缓解这一问题。 Abstract: Retrieval-Augmented Generation (RAG) enhances Large Language Models (LLMs) by retrieving relevant document from external knowledge sources. By referencing this external knowledge, RAG effectively reduces the generation of factually incorrect content and addresses hallucination issues within LLMs. Recently, there has been growing attention to improving the performance and efficiency of RAG systems from various perspectives. While these advancements have yielded significant results, the application of RAG in domains with considerable societal implications raises a critical question about fairness: What impact does the introduction of the RAG paradigm have on the fairness of LLMs? To address this question, we conduct extensive experiments by varying the LLMs, retrievers, and retrieval sources. Our experimental analysis reveals that the scale of the LLMs plays a significant role in influencing fairness outcomes within the RAG framework. When the model scale is smaller than 8B, the integration of retrieval mechanisms often exacerbates unfairness in small-scale LLMs (e.g., LLaMA3.2-1B, Mistral-7B, and LLaMA3-8B). To mitigate the fairness issues introduced by RAG for small-scale LLMs, we propose two approaches, FairFT and FairFilter. Specifically, in FairFT, we align the retriever with the LLM in terms of fairness, enabling it to retrieve documents that facilitate fairer model outputs. In FairFilter, we propose a fairness filtering mechanism to filter out biased content after retrieval. Finally, we validate our proposed approaches on real-world datasets, demonstrating their effectiveness in improving fairness while maintaining performance.

[117] Cross-Document Cross-Lingual Natural Language Inference via RST-enhanced Graph Fusion and Interpretability Prediction

Mengying Yuan,Wangzi Xuan,Fei Li

Main category: cs.CL

TL;DR: 本文提出了一种新的跨文档跨语言自然语言推理（CDCL-NLI）范式，构建了包含1,110个实例、涵盖26种语言的高质量数据集，并提出了一种结合RST增强图融合和可解释性预测的创新方法。实验表明该方法优于传统NLI模型和大型语言模型。

Details

Motivation: 跨文档跨语言自然语言推理（CDCL-NLI）是一个尚未充分探索的领域，本文旨在扩展传统NLI能力至多文档、多语言场景。 Method: 提出了一种结合RST增强的图融合和可解释性预测的方法，利用RGAT进行跨文档上下文建模，并通过基于词汇链的结构感知语义对齐机制实现跨语言理解。 Result: 实验结果表明，该方法显著优于传统NLI模型（如DocNLI和R2F）及大型语言模型（如Llama3和GPT-4o）。 Conclusion: 本文为NLI研究提供了新视角，并有望推动跨文档跨语言上下文理解、语义检索和可解释性推理的研究。数据集和代码已公开。 Abstract: Natural Language Inference (NLI) is a fundamental task in both natural language processing and information retrieval. While NLI has developed many sub-directions such as sentence-level NLI, document-level NLI and cross-lingual NLI, Cross-Document Cross-Lingual NLI (CDCL-NLI) remains largely unexplored. In this paper, we propose a novel paradigm for CDCL-NLI that extends traditional NLI capabilities to multi-document, multilingual scenarios. To support this task, we construct a high-quality CDCL-NLI dataset including 1,110 instances and spanning 26 languages. To build a baseline for this task, we also propose an innovative method that integrates RST-enhanced graph fusion and interpretability prediction. Our method employs RST (Rhetorical Structure Theory) on RGAT (Relation-aware Graph Attention Network) for cross-document context modeling, coupled with a structure-aware semantic alignment mechanism based on lexical chains for cross-lingual understanding. For NLI interpretability, we develop an EDU-level attribution framework that generates extractive explanations. Extensive experiments demonstrate our approach's superior performance, achieving significant improvements over both traditional NLI models such as DocNLI and R2F, as well as LLMs like Llama3 and GPT-4o. Our work sheds light on the study of NLI and will bring research interest on cross-document cross-lingual context understanding, semantic retrieval and interpretability inference. Our dataset and code are available at \href{https://anonymous.4open.science/r/CDCL-NLI-637E/}{CDCL-NLI-Link for peer review}.

Haiqi Zhang,Zhengyuan Zhu,Zeyu Zhang,Chengkai Li

Main category: cs.CL

TL;DR: LLMTaxo是一个利用大语言模型自动构建社交媒体事实声明分类框架，通过多粒度生成主题，帮助用户更有效地理解社交媒体内容。

Details

Motivation: 随着社交媒体内容的爆炸式增长，分析和理解在线讨论变得复杂，需要一种自动化方法来分类事实声明。 Method: LLMTaxo框架利用大语言模型，从多粒度生成主题，并在三个不同数据集上测试不同模型，设计了专门的分类评估指标。 Result: 实验表明，LLMTaxo能有效分类社交媒体事实声明，且某些模型在特定数据集上表现更好。 Conclusion: LLMTaxo为社交媒体内容分类提供了有效工具，未来可进一步优化模型性能。 Abstract: With the vast expansion of content on social media platforms, analyzing and comprehending online discourse has become increasingly complex. This paper introduces LLMTaxo, a novel framework leveraging large language models for the automated construction of taxonomy of factual claims from social media by generating topics from multi-level granularities. This approach aids stakeholders in more effectively navigating the social media landscapes. We implement this framework with different models across three distinct datasets and introduce specially designed taxonomy evaluation metrics for a comprehensive assessment. With the evaluations from both human evaluators and GPT-4, the results indicate that LLMTaxo effectively categorizes factual claims from social media, and reveals that certain models perform better on specific datasets.

[119] Reconstructing Sepsis Trajectories from Clinical Case Reports using LLMs: the Textual Time Series Corpus for Sepsis

Shahriar Noroozizadeh,Jeremy C. Weiss

Main category: cs.CL

TL;DR: 论文构建了一个基于大语言模型的流程，用于从临床病例报告中提取时间定位的发现，并生成了一个关于Sepsis-3的开放文本时间序列语料库。验证结果显示高恢复率和强时间排序能力。

Details

Motivation: 临床病例报告和出院总结虽然完整准确，但通常在事后完成，而结构化数据流虽然更早可用但不完整。为了利用更完整且时间粒度更细的数据训练模型，需要一种新方法。 Method: 构建了一个流程，利用大语言模型对病例报告中的时间定位发现进行表型提取和标注，并应用于生成Sepsis-3的开放语料库。 Result: 验证结果显示高恢复率（事件匹配率：0.755-0.753）和强时间排序能力（一致性：0.932）。 Conclusion: 研究展示了大语言模型在时间定位临床发现中的能力，但也指出了其在时间重建中的局限性，并提出了多模态整合的改进方向。 Abstract: Clinical case reports and discharge summaries may be the most complete and accurate summarization of patient encounters, yet they are finalized, i.e., timestamped after the encounter. Complementary data structured streams become available sooner but suffer from incompleteness. To train models and algorithms on more complete and temporally fine-grained data, we construct a pipeline to phenotype, extract, and annotate time-localized findings within case reports using large language models. We apply our pipeline to generate an open-access textual time series corpus for Sepsis-3 comprising 2,139 case reports from the Pubmed-Open Access (PMOA) Subset. To validate our system, we apply it on PMOA and timeline annotations from I2B2/MIMIC-IV and compare the results to physician-expert annotations. We show high recovery rates of clinical findings (event match rates: O1-preview--0.755, Llama 3.3 70B Instruct--0.753) and strong temporal ordering (concordance: O1-preview--0.932, Llama 3.3 70B Instruct--0.932). Our work characterizes the ability of LLMs to time-localize clinical findings in text, illustrating the limitations of LLM use for temporal reconstruction and providing several potential avenues of improvement via multimodal integration.

Yuxi Ma,Yongqian Peng,Yixin Zhu

Main category: cs.CL

TL;DR: 该研究首次大规模分析了中国官方媒体（1950-2019）中社会群体的语言表征，发现其与西方模式显著不同，尤其在性别和经济阶层方面变化剧烈。

Details

Motivation: 探索革命性社会变革如何通过官方语言表征反映社会群体，填补非西方语境下语言与社会结构关系的研究空白。 Method: 使用历时词嵌入技术，分析中国官方媒体在不同时间分辨率下的语言模式。 Result: 中国社会群体的表征与西方差异显著，性别和经济阶层的表征随历史变革剧烈变化，而种族、年龄和体型等刻板印象则相对稳定。 Conclusion: 官方话语通过语言编码社会结构，强调非西方视角在计算社会科学中的重要性。 Abstract: Language encodes societal beliefs about social groups through word patterns. While computational methods like word embeddings enable quantitative analysis of these patterns, studies have primarily examined gradual shifts in Western contexts. We present the first large-scale computational analysis of Chinese state-controlled media (1950-2019) to examine how revolutionary social transformations are reflected in official linguistic representations of social groups. Using diachronic word embeddings at multiple temporal resolutions, we find that Chinese representations differ significantly from Western counterparts, particularly regarding economic status, ethnicity, and gender. These representations show distinct evolutionary dynamics: while stereotypes of ethnicity, age, and body type remain remarkably stable across political upheavals, representations of gender and economic classes undergo dramatic shifts tracking historical transformations. This work advances our understanding of how officially sanctioned discourse encodes social structure through language while highlighting the importance of non-Western perspectives in computational social science.

[121] A Comprehensive Survey of Reward Models: Taxonomy, Applications, Challenges, and Future

Jialun Zhong,Wei Shen,Yanzeng Li,Songyang Gao,Hua Lu,Yicheng Chen,Yang Zhang,Wei Zhou,Jinjie Gu,Lei Zou

Main category: cs.CL

TL;DR: 本文综述了奖励模型（RM）在增强大语言模型（LLM）中的应用，涵盖偏好收集、奖励建模和使用方法，并探讨了其挑战和未来研究方向。

Details

Motivation: RM作为人类偏好的代理，能够指导LLM的行为，但其研究尚不完善，本文旨在为初学者提供全面介绍并推动未来研究。 Method: 通过综述相关研究，从偏好收集、奖励建模和使用三个角度分析RM，并介绍其应用和评估基准。 Result: 总结了RM的研究现状、应用场景和评估方法，并指出当前挑战。 Conclusion: 本文为RM领域提供了全面的入门指南，并指出了未来研究方向，相关资源已公开。 Abstract: Reward Model (RM) has demonstrated impressive potential for enhancing Large Language Models (LLM), as RM can serve as a proxy for human preferences, providing signals to guide LLMs' behavior in various tasks. In this paper, we provide a comprehensive overview of relevant research, exploring RMs from the perspectives of preference collection, reward modeling, and usage. Next, we introduce the applications of RMs and discuss the benchmarks for evaluation. Furthermore, we conduct an in-depth analysis of the challenges existing in the field and dive into the potential research directions. This paper is dedicated to providing beginners with a comprehensive introduction to RMs and facilitating future studies. The resources are publicly available at github\footnote{https://github.com/JLZhong23/awesome-reward-models}.

[122] Speculative Thinking: Enhancing Small-Model Reasoning with Large Model Guidance at Inference Time

Wang Yang,Xiang Yue,Vipin Chaudhary,Xiaotian Han

Main category: cs.CL

TL;DR: 提出了一种无需训练的框架Speculative Thinking，通过大模型指导小模型在推理层面提升性能，显著提高准确率并缩短输出长度。

Details

Motivation: 现有方法依赖昂贵的训练流程且输出冗长低效，需要一种更高效的方法提升推理模型的性能。 Method: 基于两个观察：1）推理支持性标记（如“wait”）常出现在结构分隔符后；2）大模型对反思行为控制更强。通过将反思步骤委托给大模型，提升小模型的推理准确率和输出效率。 Result: 在MATH500上，1.5B模型的准确率从83.2%提升至89.4%，输出长度减少15.7%；非推理模型Qwen-2.5-7B-Instruct的准确率从74.0%提升至81.8%。 Conclusion: Speculative Thinking是一种高效、无需训练的方法，能显著提升推理模型的性能，同时减少输出长度。 Abstract: Recent advances leverage post-training to enhance model reasoning performance, which typically requires costly training pipelines and still suffers from inefficient, overly lengthy outputs. We introduce Speculative Thinking, a training-free framework that enables large reasoning models to guide smaller ones during inference at the reasoning level, distinct from speculative decoding, which operates at the token level. Our approach is based on two observations: (1) reasoning-supportive tokens such as "wait" frequently appear after structural delimiters like "\n\n", serving as signals for reflection or continuation; and (2) larger models exhibit stronger control over reflective behavior, reducing unnecessary backtracking while improving reasoning quality. By strategically delegating reflective steps to a more capable model, our method significantly boosts the reasoning accuracy of reasoning models while shortening their output. With the assistance of the 32B reasoning model, the 1.5B model's accuracy on MATH500 increases from 83.2% to 89.4%, marking a substantial improvement of 6.2%. Simultaneously, the average output length is reduced from 5439 tokens to 4583 tokens, representing a 15.7% decrease. Moreover, when applied to a non-reasoning model (Qwen-2.5-7B-Instruct), our framework boosts its accuracy from 74.0% to 81.8% on the same benchmark, achieving a relative improvement of 7.8%.

[123] HM-RAG: Hierarchical Multi-Agent Multimodal Retrieval Augmented Generation

Pei Liu,Xin Liu,Ruoyu Yao,Junming Liu,Siyuan Meng,Ding Wang,Jun Ma

Main category: cs.CL

TL;DR: HM-RAG是一种新型分层多代理多模态RAG框架，通过协作智能动态合成异构数据知识，显著提升了复杂查询的解决能力。

Details

Motivation: 传统单代理RAG在处理需要跨异构数据协调推理的复杂查询时存在局限性，HM-RAG旨在解决这一问题。 Method: 采用三层架构：分解代理（语义感知查询重写）、多源检索代理（并行模态检索）和决策代理（一致性投票集成结果）。 Result: 在ScienceQA和CrisisMMD基准测试中，答案准确率提升12.95%，问题分类准确率提升3.56%。 Conclusion: HM-RAG通过模块化架构和多代理协作，显著提升了多模态推理和知识合成能力，成为RAG系统的先进解决方案。 Abstract: While Retrieval-Augmented Generation (RAG) augments Large Language Models (LLMs) with external knowledge, conventional single-agent RAG remains fundamentally limited in resolving complex queries demanding coordinated reasoning across heterogeneous data ecosystems. We present HM-RAG, a novel Hierarchical Multi-agent Multimodal RAG framework that pioneers collaborative intelligence for dynamic knowledge synthesis across structured, unstructured, and graph-based data. The framework is composed of three-tiered architecture with specialized agents: a Decomposition Agent that dissects complex queries into contextually coherent sub-tasks via semantic-aware query rewriting and schema-guided context augmentation; Multi-source Retrieval Agents that carry out parallel, modality-specific retrieval using plug-and-play modules designed for vector, graph, and web-based databases; and a Decision Agent that uses consistency voting to integrate multi-source answers and resolve discrepancies in retrieval results through Expert Model Refinement. This architecture attains comprehensive query understanding by combining textual, graph-relational, and web-derived evidence, resulting in a remarkable 12.95% improvement in answer accuracy and a 3.56% boost in question classification accuracy over baseline RAG systems on the ScienceQA and CrisisMMD benchmarks. Notably, HM-RAG establishes state-of-the-art results in zero-shot settings on both datasets. Its modular architecture ensures seamless integration of new data modalities while maintaining strict data governance, marking a significant advancement in addressing the critical challenges of multimodal reasoning and knowledge synthesis in RAG systems. Code is available at https://github.com/ocean-luna/HMRAG.

[124] Span-level Emotion-Cause-Category Triplet Extraction with Instruction Tuning LLMs and Data Augmentation

Xiangju Li,Dong Yang,Xiaogang Zhu,Faliang Huang,Peng Zhang,Zhongying Zhao

Main category: cs.CL

TL;DR: 该论文提出了一种基于大语言模型的细粒度方法，用于提取情感-原因-类别三元组，通过指令调优和数据增强技术显著提升了性能。

Details

Motivation: 解决现有方法在冗余信息检索和情感类别确定方面的挑战，特别是针对隐式或模糊情感表达。 Method: 采用任务特定的三元组提取指令和低秩适应技术微调大语言模型，结合提示数据增强生成高质量合成数据。 Result: 实验表明，该方法在情感-原因-类别三元组提取指标上至少提升12.8%。 Conclusion: 该方法有效且鲁棒，为情感原因分析研究提供了新方向。 Abstract: Span-level emotion-cause-category triplet extraction represents a novel and complex challenge within emotion cause analysis. This task involves identifying emotion spans, cause spans, and their associated emotion categories within the text to form structured triplets. While prior research has predominantly concentrated on clause-level emotion-cause pair extraction and span-level emotion-cause detection, these methods often confront challenges originating from redundant information retrieval and difficulty in accurately determining emotion categories, particularly when emotions are expressed implicitly or ambiguously. To overcome these challenges, this study explores a fine-grained approach to span-level emotion-cause-category triplet extraction and introduces an innovative framework that leverages instruction tuning and data augmentation techniques based on large language models. The proposed method employs task-specific triplet extraction instructions and utilizes low-rank adaptation to fine-tune large language models, eliminating the necessity for intricate task-specific architectures. Furthermore, a prompt-based data augmentation strategy is developed to address data scarcity by guiding large language models in generating high-quality synthetic training data. Extensive experimental evaluations demonstrate that the proposed approach significantly outperforms existing baseline methods, achieving at least a 12.8% improvement in span-level emotion-cause-category triplet extraction metrics. The results demonstrate the method's effectiveness and robustness, offering a promising avenue for advancing research in emotion cause analysis. The source code is available at https://github.com/zxgnlp/InstruDa-LLM.

[125] Can the capability of Large Language Models be described by human ability? A Meta Study

Mingrui Zan,Yunquan Zhang,Boyang Zhang,Fangming Liu,Daning Cheng

Main category: cs.CL

TL;DR: 论文通过分析80多个LLM在37个评估基准上的表现，探讨了LLM能力与人类能力的相似性，发现部分LLM能力可用人类能力指标描述，但某些能力在LLM中相关性较低，且能力随模型参数规模变化显著。

Details

Motivation: 探讨LLM能力与人类能力的相似性，以明确LLM的实际能力范围。 Method: 收集80多个LLM在37个评估基准上的表现数据，按人类能力的6个主要方面和11个子方面分类，并进行聚类分析。 Result: 1. 部分LLM能力可用人类能力指标描述；2. 某些人类相关能力在LLM中相关性低；3. LLM能力随参数规模变化显著。 Conclusion: LLM能力与人类能力存在部分重叠，但差异显著，且能力表现受模型规模影响。 Abstract: Users of Large Language Models (LLMs) often perceive these models as intelligent entities with human-like capabilities. However, the extent to which LLMs' capabilities truly approximate human abilities remains a topic of debate. In this paper, to characterize the capabilities of LLMs in relation to human capabilities, we collected performance data from over 80 models across 37 evaluation benchmarks. The evaluation benchmarks are categorized into 6 primary abilities and 11 sub-abilities in human aspect. Then, we then clustered the performance rankings into several categories and compared these clustering results with classifications based on human ability aspects. Our findings lead to the following conclusions: 1. We have confirmed that certain capabilities of LLMs with fewer than 10 billion parameters can indeed be described using human ability metrics; 2. While some abilities are considered interrelated in humans, they appear nearly uncorrelated in LLMs; 3. The capabilities possessed by LLMs vary significantly with the parameter scale of the model.

[126] Meta-Evaluating Local LLMs: Rethinking Performance Metrics for Serious Games

Andrés Isaza-Giraldo,Paulo Bala,Lucas Pereira

Main category: cs.CL

TL;DR: 该研究探讨了五种小型LLM在评估游戏《En-join》中玩家回答时的可靠性，揭示了模型在敏感性和特异性之间的权衡，并强调了上下文感知评估框架的重要性。

Details

Motivation: 评估开放式回答在严肃游戏中的主观性挑战，以及小型LLM作为评估工具的准确性和一致性尚不明确。 Method: 使用传统二元分类指标（如准确率、真阳性率和真阴性率）系统比较五种小型LLM在不同评估场景中的表现。 Result: 研究发现某些模型在识别正确答案时表现优异，而其他模型则存在假阳性或不一致的问题。 Conclusion: 研究强调了在部署LLM作为评估工具时需要上下文感知框架和谨慎的模型选择，为AI驱动评估工具的可信度提供了见解。 Abstract: The evaluation of open-ended responses in serious games presents a unique challenge, as correctness is often subjective. Large Language Models (LLMs) are increasingly being explored as evaluators in such contexts, yet their accuracy and consistency remain uncertain, particularly for smaller models intended for local execution. This study investigates the reliability of five small-scale LLMs when assessing player responses in \textit{En-join}, a game that simulates decision-making within energy communities. By leveraging traditional binary classification metrics (including accuracy, true positive rate, and true negative rate), we systematically compare these models across different evaluation scenarios. Our results highlight the strengths and limitations of each model, revealing trade-offs between sensitivity, specificity, and overall performance. We demonstrate that while some models excel at identifying correct responses, others struggle with false positives or inconsistent evaluations. The findings highlight the need for context-aware evaluation frameworks and careful model selection when deploying LLMs as evaluators. This work contributes to the broader discourse on the trustworthiness of AI-driven assessment tools, offering insights into how different LLM architectures handle subjective evaluation tasks.

[127] QM-ToT: A Medical Tree of Thoughts Reasoning Framework for Quantized Model

Zongxian Yang,Jiayu Qian,Zhi-An Huang,Kay Chen Tan

Main category: cs.CL

TL;DR: QM-ToT框架通过树状思考路径分解复杂医学问题，显著提升量化模型在MedQAUSMLE数据集上的性能。

Details

Motivation: 解决大型语言模型在专业生物医学任务中因复杂术语和临床数据敏感性导致的性能下降问题。 Method: 提出Quantized Medical Tree of Thought (QM-ToT)框架，结合树状思考路径和评估层，分解医学问题。 Result: LLaMA2-70b模型准确率从34%提升至50%，LLaMA-3.1-8b从58.77%提升至69.49%；数据蒸馏方法效率提升86.27%。 Conclusion: QM-ToT首次展示了树状思考路径在复杂生物医学任务中的潜力，为资源受限环境下的高性能量化模型部署奠定基础。 Abstract: Large language models (LLMs) face significant challenges in specialized biomedical tasks due to the inherent complexity of medical reasoning and the sensitive nature of clinical data. Existing LLMs often struggle with intricate medical terminology and the need for accurate clinical insights, leading to performance reduction when quantized for resource-constrained deployment. To address these issues, we propose Quantized Medical Tree of Thought (QM-ToT), a path-based reasoning framework. QM-ToT leverages a Tree of Thought (ToT) reasoning approach to decompose complex medical problems into manageable subtasks, coupled with evaluator assessment layers. This framework facilitates substantial performance improvements in INT4-quantized models on the challenging MedQAUSMLE dataset. Specifically, we demonstrate a remarkable accuracy increase from 34% to 50% for the LLaMA2-70b model and from 58.77% to 69.49% for LLaMA-3.1-8b. Besides, we also proposed an effect data distillation method based on ToT. Compared to the traditional distillation method, we achieved an improvement of 86. 27% while using only 3.9% of the data.This work, for the first time, showcases the potential of ToT to significantly enhance performance on complex biomedical tasks, establishing a crucial foundation for future advances in deploying high-performing quantized LLM in resource-limited medical settings.

[128] You've Changed: Detecting Modification of Black-Box Large Language Models

Alden Dima,James Foulds,Shimei Pan,Philip Feldman

Main category: cs.CL

TL;DR: 提出了一种通过比较生成文本的语言和心理语言学特征分布来监控大型语言模型（LLM）变化的方法。

Details

Motivation: 由于LLM通常通过API提供服务，开发者难以检测其行为变化，因此需要一种高效的方法来监控模型变化。 Method: 使用统计测试比较两个文本样本的特征分布，判断其是否等效，从而识别LLM的变化。 Result: 实验表明，简单的文本特征结合统计测试可以有效区分不同语言模型，并可用于检测提示注入攻击。 Conclusion: 该方法能够高效监控LLM变化，避免了计算成本高昂的基准评估。 Abstract: Large Language Models (LLMs) are often provided as a service via an API, making it challenging for developers to detect changes in their behavior. We present an approach to monitor LLMs for changes by comparing the distributions of linguistic and psycholinguistic features of generated text. Our method uses a statistical test to determine whether the distributions of features from two samples of text are equivalent, allowing developers to identify when an LLM has changed. We demonstrate the effectiveness of our approach using five OpenAI completion models and Meta's Llama 3 70B chat model. Our results show that simple text features coupled with a statistical test can distinguish between language models. We also explore the use of our approach to detect prompt injection attacks. Our work enables frequent LLM change monitoring and avoids computationally expensive benchmark evaluations.

Anna-Carolina Haensch

Main category: cs.CL

TL;DR: 研究分析了用户如何将大型语言模型（LLMs）作为心理健康工具使用，发现20%的评论反映个人使用，态度积极，但也存在隐私和缺乏专业监督等问题。

Details

Motivation: 探讨生成式AI聊天机器人（如ChatGPT）在非正式心理健康支持中的角色及其潜力。 Method: 通过分析10,000多条TikTok评论，使用分层编码方案和监督分类模型识别用户体验和态度。 Result: 20%的评论反映个人使用，用户态度积极，主要优点包括可访问性和情感支持，但也存在隐私和缺乏专业监督的担忧。 Conclusion: AI在心理健康支持中的应用日益重要，但需加强临床和伦理审查。 Abstract: The emergence of generative AI chatbots such as ChatGPT has prompted growing public and academic interest in their role as informal mental health support tools. While early rule-based systems have been around for several years, large language models (LLMs) offer new capabilities in conversational fluency, empathy simulation, and availability. This study explores how users engage with LLMs as mental health tools by analyzing over 10,000 TikTok comments from videos referencing LLMs as mental health tools. Using a self-developed tiered coding schema and supervised classification models, we identify user experiences, attitudes, and recurring themes. Results show that nearly 20% of comments reflect personal use, with these users expressing overwhelmingly positive attitudes. Commonly cited benefits include accessibility, emotional support, and perceived therapeutic value. However, concerns around privacy, generic responses, and the lack of professional oversight remain prominent. It is important to note that the user feedback does not indicate which therapeutic framework, if any, the LLM-generated output aligns with. While the findings underscore the growing relevance of AI in everyday practices, they also highlight the urgent need for clinical and ethical scrutiny in the use of AI for mental health support.

[130] Paging Dr. GPT: Extracting Information from Clinical Notes to Enhance Patient Predictions

David Anderson,Michaela Anderson,Margret Bjarnadottir,Stephen Mahar,Shriyan Reyya

Main category: cs.CL

TL;DR: 研究探讨了利用GPT-4o-mini生成的临床问题答案作为特征，结合电子病历的表格数据，提升患者死亡率预测的准确性。

Details

Motivation: 传统预测模型未能充分利用临床笔记中的非结构化信息，而GPT等大型语言模型可能填补这一空白。 Method: 使用MIMIC-IV Note数据集中的14,011例首次入院患者数据，以GPT回答为输入特征，构建逻辑回归模型。 Result: GPT模型优于传统表格数据模型，结合两者后AUC提升5.1%，高风险组的阳性预测值提高29.9%。 Conclusion: 大型语言模型在临床预测任务中具有显著价值，尤其在非结构化文本数据未被充分利用的领域。 Abstract: There is a long history of building predictive models in healthcare using tabular data from electronic medical records. However, these models fail to extract the information found in unstructured clinical notes, which document diagnosis, treatment, progress, medications, and care plans. In this study, we investigate how answers generated by GPT-4o-mini (ChatGPT) to simple clinical questions about patients, when given access to the patient's discharge summary, can support patient-level mortality prediction. Using data from 14,011 first-time admissions to the Coronary Care or Cardiovascular Intensive Care Units in the MIMIC-IV Note dataset, we implement a transparent framework that uses GPT responses as input features in logistic regression models. Our findings demonstrate that GPT-based models alone can outperform models trained on standard tabular data, and that combining both sources of information yields even greater predictive power, increasing AUC by an average of 5.1 percentage points and increasing positive predictive value by 29.9 percent for the highest-risk decile. These results highlight the value of integrating large language models (LLMs) into clinical prediction tasks and underscore the broader potential for using LLMs in any domain where unstructured text data remains an underutilized resource.

[131] GOAT-TTS: LLM-based Text-To-Speech Generation Optimized via A Dual-Branch Architecture

Yaodong Song,Hongjie Chen,Jie Lian,Yuxin Zhang,Guangmin Xia,Zehan Li,Genliang Zhao,Jian Kang,Yongxiang Li,Jie Li

Main category: cs.CL

TL;DR: GOAT-TTS是一种基于LLM的双分支架构文本转语音方法，解决了现有模型在声学特征损失、依赖对齐数据和遗忘文本理解能力的问题。

Details

Motivation: 现有LLM在TTS中存在声学特征损失、依赖对齐数据和遗忘文本理解能力的问题。 Method: 提出双分支架构：模态对齐分支捕获连续声学嵌入，语音生成分支通过模块化微调预测语音标记。 Result: GOAT-TTS性能与最先进TTS模型相当，并能有效合成方言语音数据。 Conclusion: GOAT-TTS通过双分支设计解决了LLM在TTS中的关键问题，展示了高效性和实用性。 Abstract: While large language models (LLMs) have revolutionized text-to-speech (TTS) synthesis through discrete tokenization paradigms, current architectures exhibit fundamental tensions between three critical dimensions: 1) irreversible loss of acoustic characteristics caused by quantization of speech prompts; 2) stringent dependence on precisely aligned prompt speech-text pairs that limit real-world deployment; and 3) catastrophic forgetting of the LLM's native text comprehension during optimization for speech token generation. To address these challenges, we propose an LLM-based text-to-speech Generation approach Optimized via a novel dual-branch ArchiTecture (GOAT-TTS). Our framework introduces two key innovations: (1) The modality-alignment branch combines a speech encoder and projector to capture continuous acoustic embeddings, enabling bidirectional correlation between paralinguistic features (language, timbre, emotion) and semantic text representations without transcript dependency; (2) The speech-generation branch employs modular fine-tuning on top-k layers of an LLM for speech token prediction while freezing the bottom-k layers to preserve foundational linguistic knowledge. Moreover, multi-token prediction is introduced to support real-time streaming TTS synthesis. Experimental results demonstrate that our GOAT-TTS achieves performance comparable to state-of-the-art TTS models while validating the efficacy of synthesized dialect speech data.

[132] Streamlining Biomedical Research with Specialized LLMs

Linqing Chen,Weilei Wang,Yubin Xia,Wentao Wu,Peng Xu,Zilong Bai,Jie Fang,Chaobo Xu,Ran Hu,Licong Xu,Haoran Hua,Jing Sun,Hanmeng Zhong,Jin Liu,Tian Qiu,Haowen Liu,Meng Hu,Xiuwen Li,Fei Gao,Yong Gu,Tao Shi,Chaochao Wang,Jianping Lu,Cheng Sun,Yixin Wang,Shengjie Yang,Yuancheng Li,Lu Jin,Lisha Zhang,Fu Bian,Zhongkai Ye,Lidong Pei,Changyang Tu

Main category: cs.CL

TL;DR: 本文提出了一种结合领域专用大型语言模型与高级信息检索技术的新系统，旨在提供全面且上下文感知的响应。

Details

Motivation: 提升对话生成质量，支持实时高保真交互，提高生物医药领域专业人士的研究效率。 Method: 整合领域专用大型语言模型与信息检索技术，实现组件间无缝交互与输出交叉验证。 Result: 系统显著提高了响应精度，支持多模态数据（如图像、表格）的访问，提升了研发决策效率。 Conclusion: 该系统为生物医药领域提供了高效的交互平台，显著加速了研发过程。 Abstract: In this paper, we propose a novel system that integrates state-of-the-art, domain-specific large language models with advanced information retrieval techniques to deliver comprehensive and context-aware responses. Our approach facilitates seamless interaction among diverse components, enabling cross-validation of outputs to produce accurate, high-quality responses enriched with relevant data, images, tables, and other modalities. We demonstrate the system's capability to enhance response precision by leveraging a robust question-answering model, significantly improving the quality of dialogue generation. The system provides an accessible platform for real-time, high-fidelity interactions, allowing users to benefit from efficient human-computer interaction, precise retrieval, and simultaneous access to a wide range of literature and data. This dramatically improves the research efficiency of professionals in the biomedical and pharmaceutical domains and facilitates faster, more informed decision-making throughout the R\&D process. Furthermore, the system proposed in this paper is available at https://synapse-chat.patsnap.com.

[133] Benchmarking Biopharmaceuticals Retrieval-Augmented Generation Evaluation

Hanmeng Zhong,Linqing Chen,Weilei Wang,Wentao Wu

Main category: cs.CL

TL;DR: 论文介绍了首个针对生物制药领域的检索增强大语言模型（LLMs）评估基准BRAGE，并提出了一种基于引用的分类方法以评估LLMs的查询与参考理解能力（QRUC）。

Details

Motivation: 当前缺乏专门针对生物制药领域的LLMs评估基准，且传统QA指标在开放检索增强QA场景中表现不足。 Method: 提出BRAGE基准和基于引用的分类方法，用于评估LLMs的QRUC。 Result: 实验结果表明主流LLMs在生物制药QRUC上存在显著差距，需改进。 Conclusion: BRAGE为生物制药领域的LLMs评估提供了新工具，未来需进一步提升LLMs的QRUC。 Abstract: Recently, the application of the retrieval-augmented Large Language Models (LLMs) in specific domains has gained significant attention, especially in biopharmaceuticals. However, in this context, there is no benchmark specifically designed for biopharmaceuticals to evaluate LLMs. In this paper, we introduce the Biopharmaceuticals Retrieval-Augmented Generation Evaluation (BRAGE) , the first benchmark tailored for evaluating LLMs' Query and Reference Understanding Capability (QRUC) in the biopharmaceutical domain, available in English, French, German and Chinese. In addition, Traditional Question-Answering (QA) metrics like accuracy and exact match fall short in the open-ended retrieval-augmented QA scenarios. To address this, we propose a citation-based classification method to evaluate the QRUC of LLMs to understand the relationship between queries and references. We apply this method to evaluate the mainstream LLMs on BRAGE. Experimental results show that there is a significant gap in the biopharmaceutical QRUC of mainstream LLMs, and their QRUC needs to be improved.

[134] Propaganda via AI? A Study on Semantic Backdoors in Large Language Models

Nay Myat Min,Long H. Pham,Yige Li,Jun Sun

Main category: cs.CL

TL;DR: 论文提出了一种针对大语言模型（LLM）中语义后门攻击的黑盒检测框架RAVEN，通过语义熵和跨模型一致性分析识别隐藏的概念级触发机制。

Details

Motivation: 传统防御方法仅关注显式的词级异常，忽视了基于语义的隐蔽后门攻击，亟需概念级审计方法。 Method: 通过结构化主题-视角提示探测多个模型，利用双向蕴含聚类响应，结合跨模型比较检测异常一致性输出。 Result: 在多种LLM家族（如GPT-4o、Llama等）中发现了未检测到的语义后门，验证了其可行性。 Conclusion: 语义后门攻击具有实际威胁，RAVEN框架为概念级审计提供了有效工具，需进一步关注模型部署中的安全性。 Abstract: Large language models (LLMs) demonstrate remarkable performance across myriad language tasks, yet they remain vulnerable to backdoor attacks, where adversaries implant hidden triggers that systematically manipulate model outputs. Traditional defenses focus on explicit token-level anomalies and therefore overlook semantic backdoors-covert triggers embedded at the conceptual level (e.g., ideological stances or cultural references) that rely on meaning-based cues rather than lexical oddities. We first show, in a controlled finetuning setting, that such semantic backdoors can be implanted with only a small poisoned corpus, establishing their practical feasibility. We then formalize the notion of semantic backdoors in LLMs and introduce a black-box detection framework, RAVEN (short for "Response Anomaly Vigilance for uncovering semantic backdoors"), which combines semantic entropy with cross-model consistency analysis. The framework probes multiple models with structured topic-perspective prompts, clusters the sampled responses via bidirectional entailment, and flags anomalously uniform outputs; cross-model comparison isolates model-specific anomalies from corpus-wide biases. Empirical evaluations across diverse LLM families (GPT-4o, Llama, DeepSeek, Mistral) uncover previously undetected semantic backdoors, providing the first proof-of-concept evidence of these hidden vulnerabilities and underscoring the urgent need for concept-level auditing of deployed language models. We open-source our code and data at https://github.com/NayMyatMin/RAVEN.

[135] Reimagining Urban Science: Scaling Causal Inference with Large Language Models

Yutong Xia,Ao Qu,Yunhan Zheng,Yihong Tang,Dingyi Zhuang,Yuxuan Liang,Cathy Wu,Roger Zimmermann,Jinhua Zhao

Main category: cs.CL

TL;DR: 论文提出AutoUrbanCI框架，利用LLM改进城市因果研究，解决假设生成、数据复杂性等问题。

Details

Motivation: 城市因果研究面临假设生成效率低、数据复杂性和方法脆弱性等挑战，LLM为其提供了新思路。 Method: 提出AutoUrbanCI框架，包含四个模块化代理：假设生成、数据工程、实验设计与执行、结果解释与政策建议。 Result: 框架旨在提高研究的严谨性和透明度，促进人机协作、公平性和问责制。 Conclusion: 呼吁采用AI增强的工作流程，扩大参与、提高可重复性，实现更包容的城市因果推理。 Abstract: Urban causal research is essential for understanding the complex dynamics of cities and informing evidence-based policies. However, it is challenged by the inefficiency and bias of hypothesis generation, barriers to multimodal data complexity, and the methodological fragility of causal experimentation. Recent advances in large language models (LLMs) present an opportunity to rethink how urban causal analysis is conducted. This Perspective examines current urban causal research by analyzing taxonomies that categorize research topics, data sources, and methodological approaches to identify structural gaps. We then introduce an LLM-driven conceptual framework, AutoUrbanCI, composed of four distinct modular agents responsible for hypothesis generation, data engineering, experiment design and execution, and results interpretation with policy recommendations. We propose evaluation criteria for rigor and transparency and reflect on implications for human-AI collaboration, equity, and accountability. We call for a new research agenda that embraces AI-augmented workflows not as replacements for human expertise but as tools to broaden participation, improve reproducibility, and unlock more inclusive forms of urban causal reasoning.

[136] Mathematical Capabilities of Large Language Models in Finnish Matriculation Examination

Mika Setälä,Pieta Sikström,Ville Heilala,Tommi Kärkkäinen

Main category: cs.CL

TL;DR: 研究评估了大型语言模型（LLMs）在数学推理上的表现，发现其能力快速提升，部分模型甚至达到完美成绩。

Details

Motivation: 评估LLMs在数学能力上的进展，尤其是在高风险的芬兰高中毕业考试中的表现。 Method: 使用芬兰高中毕业考试作为测试平台，对多种LLMs进行数学能力评估。 Result: 初期表现中等，但随着模型进化，部分模型达到接近完美或完美成绩，媲美顶尖学生。 Conclusion: LLMs的数学能力快速进步，展示了其在教育评估中的潜力。 Abstract: Large language models (LLMs) have shown increasing promise in educational settings, yet their mathematical reasoning has been considered evolving. This study evaluates the mathematical capabilities of various LLMs using the Finnish matriculation examination, a high-stakes digital test for upper secondary education. Initial tests yielded moderate performance corresponding to mid-range grades, but later evaluations demonstrated substantial improvements as the language models evolved. Remarkably, some models achieved near-perfect or perfect scores, matching top student performance and qualifying for university admission. Our findings highlight the rapid advances in the mathematical proficiency of LLMs and illustrate their potential to also support educational assessments at scale.

[137] A Large-Language Model Framework for Relative Timeline Extraction from PubMed Case Reports

Jing Wang,Jeremy C Weiss

Main category: cs.CL

TL;DR: 论文提出了一种将病例报告转化为结构化时间序列数据的系统，对比了人工和大型语言模型（LLM）的标注效果，并展示了LLM在事件召回和时间一致性上的表现。

Details

Motivation: 临床事件的时间信息对患者轨迹分析至关重要，但现有电子健康记录和临床报告缺乏结构化时间数据。 Method: 开发了一个系统，将病例报告转化为文本时间序列（事件和时间戳对），并对比了人工和LLM的标注效果。 Result: LLM在事件召回上表现中等（0.80），但在时间一致性上表现优秀（0.95）。 Conclusion: 该研究为利用PubMed开放获取语料库进行时间分析提供了基准。 Abstract: Timing of clinical events is central to characterization of patient trajectories, enabling analyses such as process tracing, forecasting, and causal reasoning. However, structured electronic health records capture few data elements critical to these tasks, while clinical reports lack temporal localization of events in structured form. We present a system that transforms case reports into textual time series-structured pairs of textual events and timestamps. We contrast manual and large language model (LLM) annotations (n=320 and n=390 respectively) of ten randomly-sampled PubMed open-access (PMOA) case reports (N=152,974) and assess inter-LLM agreement (n=3,103; N=93). We find that the LLM models have moderate event recall(O1-preview: 0.80) but high temporal concordance among identified events (O1-preview: 0.95). By establishing the task, annotation, and assessment systems, and by demonstrating high concordance, this work may serve as a benchmark for leveraging the PMOA corpus for temporal analytics.

Muhammad Ahmad,Muhammad Waqas,ldar Batyrshin,Grigori Sidorov

Main category: cs.CL

TL;DR: 论文提出了一种基于AI和NLP的框架，通过分析社交媒体数据检测药物滥用和过量症状，准确率高达98%。

Details

Motivation: 传统研究方法在药物滥用监测中存在局限性，而社交媒体提供了实时数据，为公共卫生监测提供了新途径。 Method: 采用混合标注策略（LLMs和人工标注），结合传统ML模型、神经网络和Transformer模型。 Result: 框架在多类和多标签分类中分别达到98%和97%的准确率，优于基线模型8%。 Conclusion: AI框架在公共卫生监测和个性化干预中具有潜力。 Abstract: Drug overdose remains a critical global health issue, often driven by misuse of opioids, painkillers, and psychiatric medications. Traditional research methods face limitations, whereas social media offers real-time insights into self-reported substance use and overdose symptoms. This study proposes an AI-driven NLP framework trained on annotated social media data to detect commonly used drugs and associated overdose symptoms. Using a hybrid annotation strategy with LLMs and human annotators, we applied traditional ML models, neural networks, and advanced transformer-based models. Our framework achieved 98% accuracy in multi-class and 97% in multi-label classification, outperforming baseline models by up to 8%. These findings highlight the potential of AI for supporting public health surveillance and personalized intervention strategies.

[139] Replicating ReLM Results: Validating Large Language Models with ReLM

Reece Adamson,Erin Song

Main category: cs.CL

TL;DR: ReLM方法利用形式语言评估和控制大型语言模型（LLMs），解决记忆、偏见和零样本性能问题，弥补现有方法的不足。

Details

Motivation: 现有评估方法效率低、不精确且可能引入偏见，但LLMs在生产环境中的行为评估至关重要。 Method: 采用ReLM方法，基于形式语言对LLMs进行评估和控制，并复现原始论文的关键结果。 Result: 验证了ReLM方法在评估LLMs行为方面的有效性，并扩展了其应用。 Conclusion: ReLM方法为机器学习系统领域提供了高效、精确的评估工具。 Abstract: Validating Large Language Models with ReLM explores the application of formal languages to evaluate and control Large Language Models (LLMs) for memorization, bias, and zero-shot performance. Current approaches for evaluating these types behavior are often slow, imprecise, costly, or introduce biases of their own, but are necessary due to the importance of this behavior when productionizing LLMs. This project reproduces key results from the original ReLM paper and expounds on the approach and applications with an emphasis on the relevance to the field of systems for machine learning.

[140] A Method for Handling Negative Similarities in Explainable Graph Spectral Clustering of Text Documents -- Extended Version

Mieczysław A. Kłopotek,Sławomir T. Wierzchoń,Bartłomiej Starosta,Dariusz Czerski,Piotr Borkowski

Main category: cs.CL

TL;DR: 本文研究了图谱聚类中负相似度问题，探讨了组合拉普拉斯和归一化拉普拉斯的解决方案，并比较了6种方法的优缺点。

Details

Motivation: 研究动机在于解决文档嵌入（如doc2vec、GloVe等）与传统词向量空间不同导致的负相似度问题。 Method: 方法包括分析组合拉普拉斯和归一化拉普拉斯的解决方案，并实验比较6种不同方法。 Result: 结果表明，GloVe嵌入常导致归一化拉普拉斯图谱聚类失败，而解决负相似度的方法能提升准确性。 Conclusion: 结论是解决负相似度问题不仅提高了聚类准确性，还扩展了作者先前提出的解释方法的适用范围。 Abstract: This paper investigates the problem of Graph Spectral Clustering with negative similarities, resulting from document embeddings different from the traditional Term Vector Space (like doc2vec, GloVe, etc.). Solutions for combinatorial Laplacians and normalized Laplacians are discussed. An experimental investigation shows the advantages and disadvantages of 6 different solutions proposed in the literature and in this research. The research demonstrates that GloVe embeddings frequently cause failures of normalized Laplacian based GSC due to negative similarities. Furthermore, application of methods curing similarity negativity leads to accuracy improvement for both combinatorial and normalized Laplacian based GSC. It also leads to applicability for GloVe embeddings of explanation methods developed originally bythe authors for Term Vector Space embeddings.

[141] Position: The Most Expensive Part of an LLM should be its Training Data

Nikhil Kandpal,Colin Raffel

Main category: cs.CL

TL;DR: 论文主张为大型语言模型（LLM）训练数据的生产者提供补偿，并估算其成本远高于模型训练本身。

Details

Motivation: LLM训练数据背后的人力劳动被忽视且未得到补偿，论文旨在量化这一成本并推动公平实践。 Method: 研究64个2016至2024年发布的LLM，估算从零开始生产其训练数据的人力成本。 Result: 即使保守估计工资，训练数据成本是模型训练成本的10-1000倍。 Conclusion: 训练数据的价值与补偿之间存在巨大差距，未来需研究更公平的实践。 Abstract: Training a state-of-the-art Large Language Model (LLM) is an increasingly expensive endeavor due to growing computational, hardware, energy, and engineering demands. Yet, an often-overlooked (and seldom paid) expense is the human labor behind these models' training data. Every LLM is built on an unfathomable amount of human effort: trillions of carefully written words sourced from books, academic papers, codebases, social media, and more. This position paper aims to assign a monetary value to this labor and argues that the most expensive part of producing an LLM should be the compensation provided to training data producers for their work. To support this position, we study 64 LLMs released between 2016 and 2024, estimating what it would cost to pay people to produce their training datasets from scratch. Even under highly conservative estimates of wage rates, the costs of these models' training datasets are 10-1000 times larger than the costs to train the models themselves, representing a significant financial liability for LLM providers. In the face of the massive gap between the value of training data and the lack of compensation for its creation, we highlight and discuss research directions that could enable fairer practices in the future.

[142] On Linear Representations and Pretraining Data Frequency in Language Models

Jack Merullo,Noah A. Smith,Sarah Wiegreffe,Yanai Elazar

Main category: cs.CL

TL;DR: 研究预训练数据频率与语言模型线性表示之间的关系，发现线性表示与数据频率高度相关，并提出一种预测预训练数据频率的方法。

Details

Motivation: 探讨预训练数据如何影响语言模型的线性表示，以揭示数据频率与模型行为之间的关系。 Method: 分析预训练数据频率与线性表示的相关性，训练回归模型预测数据频率。 Result: 线性表示与数据频率高度相关，回归模型能有效预测预训练数据频率。 Conclusion: 线性表示强度可反映预训练数据特征，为优化模型行为提供新思路。 Abstract: Pretraining data has a direct impact on the behaviors and quality of language models (LMs), but we only understand the most basic principles of this relationship. While most work focuses on pretraining data's effect on downstream task behavior, we investigate its relationship to LM representations. Previous work has discovered that, in language models, some concepts are encoded `linearly' in the representations, but what factors cause these representations to form? We study the connection between pretraining data frequency and models' linear representations of factual relations. We find evidence that the formation of linear representations is strongly connected to pretraining term frequencies; specifically for subject-relation-object fact triplets, both subject-object co-occurrence frequency and in-context learning accuracy for the relation are highly correlated with linear representations. This is the case across all phases of pretraining. In OLMo-7B and GPT-J, we discover that a linear representation consistently (but not exclusively) forms when the subjects and objects within a relation co-occur at least 1k and 2k times, respectively, regardless of when these occurrences happen during pretraining. Finally, we train a regression model on measurements of linear representation quality in fully-trained LMs that can predict how often a term was seen in pretraining. Our model achieves low error even on inputs from a different model with a different pretraining dataset, providing a new method for estimating properties of the otherwise-unknown training data of closed-data models. We conclude that the strength of linear representations in LMs contains signal about the models' pretraining corpora that may provide new avenues for controlling and improving model behavior: particularly, manipulating the models' training data to meet specific frequency thresholds.

[143] SLURG: Investigating the Feasibility of Generating Synthetic Online Fallacious Discourse

Cal Blanco,Gavin Dsouza,Hugo Lin,Chelsey Rush

Main category: cs.CL

TL;DR: 论文探讨了社交媒体上操纵行为的自动检测中逻辑谬误的定义与推断，特别是在乌克兰-俄罗斯冲突相关的论坛中发现了大量误导信息。提出SLURG方法，利用大语言模型生成合成谬误评论，发现其能有效模仿真实数据的句法模式。

Details

Motivation: 现有自动谬误检测的数据集多基于规范语言领域（如政治辩论或新闻报道），而在线论坛语言多样且非标准化，现有方法难以覆盖。 Method: 提出SLURG方法，利用DeepHermes-3-Mistral-24B大语言模型生成合成谬误论坛评论，并通过高质量少样本提示增强模型对论坛语言多样性的模仿能力。 Result: 大语言模型能有效复制真实数据的句法模式，高质量少样本提示可提升模型对论坛词汇多样性的模仿能力。 Conclusion: SLURG方法为在线论坛谬误检测提供了新思路，大语言模型在生成合成数据方面具有潜力。 Abstract: In our paper we explore the definition, and extrapolation of fallacies as they pertain to the automatic detection of manipulation on social media. In particular we explore how these logical fallacies might appear in the real world i.e internet forums. We discovered a prevalence of misinformation / misguided intention in discussion boards specifically centered around the Ukrainian Russian Conflict which serves to narrow the domain of our task. Although automatic fallacy detection has gained attention recently, most datasets use unregulated fallacy taxonomies or are limited to formal linguistic domains like political debates or news reports. Online discourse, however, often features non-standardized and diverse language not captured in these domains. We present Shady Linguistic Utterance Replication-Generation (SLURG) to address these limitations, exploring the feasibility of generating synthetic fallacious forum-style comments using large language models (LLMs), specifically DeepHermes-3-Mistral-24B. Our findings indicate that LLMs can replicate the syntactic patterns of real data} and that high-quality few-shot prompts enhance LLMs' ability to mimic the vocabulary diversity of online forums.

[144] Integrating Structural and Semantic Signals in Text-Attributed Graphs with BiGTex

Azadeh Beiranvand,Seyed Mehdi Vahidipour

Main category: cs.CL

TL;DR: BiGTex是一个新颖的架构，通过双向图-文本融合单元紧密集成GNN和LLM，实现了文本和结构信息的双向交互，并在节点分类和链接预测任务中取得了最优性能。

Details

Motivation: 解决文本属性图（TAGs）中同时捕捉节点文本语义和图形结构依赖的挑战，弥补GNN和LLM各自的不足。 Method: 提出BiGTex架构，通过堆叠的图-文本融合单元实现双向信息流动，使用参数高效微调（LoRA）训练，保持LLM冻结。 Result: 在五个基准数据集上，BiGTex在节点分类和链接预测任务中达到了最优性能。 Conclusion: BiGTex通过双向融合单元和参数高效微调，成功整合了GNN和LLM的优势，为文本属性图的表示学习提供了有效解决方案。 Abstract: Text-attributed graphs (TAGs) present unique challenges in representation learning by requiring models to capture both the semantic richness of node-associated texts and the structural dependencies of the graph. While graph neural networks (GNNs) excel at modeling topological information, they lack the capacity to process unstructured text. Conversely, large language models (LLMs) are proficient in text understanding but are typically unaware of graph structure. In this work, we propose BiGTex (Bidirectional Graph Text), a novel architecture that tightly integrates GNNs and LLMs through stacked Graph-Text Fusion Units. Each unit allows for mutual attention between textual and structural representations, enabling information to flow in both directions, text influencing structure and structure guiding textual interpretation. The proposed architecture is trained using parameter-efficient fine-tuning (LoRA), keeping the LLM frozen while adapting to task-specific signals. Extensive experiments on five benchmark datasets demonstrate that BiGTex achieves state-of-the-art performance in node classification and generalizes effectively to link prediction. An ablation study further highlights the importance of soft prompting and bi-directional attention in the model's success.

[145] Can Pre-training Indicators Reliably Predict Fine-tuning Outcomes of LLMs?

Hansi Zeng,Kai Hui,Honglei Zhuang,Zhen Qin,Zhenrui Yue,Hamed Zamani,Dana Alon

Main category: cs.CL

TL;DR: 论文提出了一种新方法，通过成对分类任务预测预训练检查点的下游微调性能，解决了传统困惑度指标误导性的问题。

Details

Motivation: 预训练期间的指标（如困惑度）在固定模型规模下预测性能的能力不明确，影响了模型选择和开发效率。 Method: 将预训练检查点选择问题转化为成对分类任务，构建了50个1B参数LLM变体的数据集，并引入新的无监督和有监督代理指标。 Result: 新指标将相对性能预测错误率降低了50%以上，证明了其在特定场景中的实用性。 Conclusion: 研究为优化预训练方案提供了更高效的途径，适用于不同下游任务。 Abstract: While metrics available during pre-training, such as perplexity, correlate well with model performance at scaling-laws studies, their predictive capacities at a fixed model size remain unclear, hindering effective model selection and development. To address this gap, we formulate the task of selecting pre-training checkpoints to maximize downstream fine-tuning performance as a pairwise classification problem: predicting which of two LLMs, differing in their pre-training, will perform better after supervised fine-tuning (SFT). We construct a dataset using 50 1B parameter LLM variants with systematically varied pre-training configurations, e.g., objectives or data, and evaluate them on diverse downstream tasks after SFT. We first conduct a study and demonstrate that the conventional perplexity is a misleading indicator. As such, we introduce novel unsupervised and supervised proxy metrics derived from pre-training that successfully reduce the relative performance prediction error rate by over 50%. Despite the inherent complexity of this task, we demonstrate the practical utility of our proposed proxies in specific scenarios, paving the way for more efficient design of pre-training schemes optimized for various downstream tasks.

[146] Accelerating Clinical NLP at Scale with a Hybrid Framework with Reduced GPU Demands: A Case Study in Dementia Identification

Jianlin Shi,Qiwei Gan,Elizabeth Hanchrow,Annie Bowles,John Stanley,Adam P. Bress,Jordana B. Cohen,Patrick R. Alba

Main category: cs.CL

TL;DR: 提出了一种结合规则过滤、SVM分类器和BERT模型的混合NLP框架，用于高效处理大规模临床文本，并在痴呆症识别案例中验证了其性能。

Details

Motivation: 解决基于Transformer的NLP方法计算资源需求高的问题，提高临床NLP的可及性。 Method: 结合规则过滤、SVM分类器和BERT模型，应用于490万退伍军人的21亿临床笔记分析。 Result: 患者级别精度0.90，召回率0.84，F1分数0.87，识别痴呆症病例数量是结构化数据方法的三倍以上。 Conclusion: 混合NLP框架在大规模临床文本分析中可行，为计算资源有限的医疗机构提供了高效解决方案。 Abstract: Clinical natural language processing (NLP) is increasingly in demand in both clinical research and operational practice. However, most of the state-of-the-art solutions are transformers-based and require high computational resources, limiting their accessibility. We propose a hybrid NLP framework that integrates rule-based filtering, a Support Vector Machine (SVM) classifier, and a BERT-based model to improve efficiency while maintaining accuracy. We applied this framework in a dementia identification case study involving 4.9 million veterans with incident hypertension, analyzing 2.1 billion clinical notes. At the patient level, our method achieved a precision of 0.90, a recall of 0.84, and an F1-score of 0.87. Additionally, this NLP approach identified over three times as many dementia cases as structured data methods. All processing was completed in approximately two weeks using a single machine with dual A40 GPUs. This study demonstrates the feasibility of hybrid NLP solutions for large-scale clinical text analysis, making state-of-the-art methods more accessible to healthcare organizations with limited computational resources.

[147] Beyond Text: Characterizing Domain Expert Needs in Document Research

Sireesh Gururaja,Nupoor Gandhi,Jeremiah Milbauer,Emma Strubell

Main category: cs.CL

TL;DR: 研究探讨了NLP系统在文档处理任务中与专家实际需求的差距，发现现有方法更注重文档内容而非社会背景，呼吁NLP社区改进工具设计。

Details

Motivation: 探讨NLP系统是否能真正模拟专家在文档研究中的工作方式，尤其是文档的社会背景和个性化需求。 Method: 访谈了16位跨领域专家，分析其文档研究流程，并与现有NLP系统进行比较。 Result: 专家流程具有个性化、迭代性和社会背景依赖性；现有NLP方法更关注文档内容，但缺乏可访问性和社会意识。 Conclusion: 呼吁NLP社区开发更具可访问性、个性化、迭代性和社会意识的文档处理工具。 Abstract: Working with documents is a key part of almost any knowledge work, from contextualizing research in a literature review to reviewing legal precedent. Recently, as their capabilities have expanded, primarily text-based NLP systems have often been billed as able to assist or even automate this kind of work. But to what extent are these systems able to model these tasks as experts conceptualize and perform them now? In this study, we interview sixteen domain experts across two domains to understand their processes of document research, and compare it to the current state of NLP systems. We find that our participants processes are idiosyncratic, iterative, and rely extensively on the social context of a document in addition its content; existing approaches in NLP and adjacent fields that explicitly center the document as an object, rather than as merely a container for text, tend to better reflect our participants' priorities, though they are often less accessible outside their research communities. We call on the NLP community to more carefully consider the role of the document in building useful tools that are accessible, personalizable, iterative, and socially aware.

[148] BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents

Jason Wei,Zhiqing Sun,Spencer Papay,Scott McKinney,Jeffrey Han,Isa Fulford,Hyung Won Chung,Alex Tachard Passos,William Fedus,Amelia Glaese

Main category: cs.CL

TL;DR: BrowseComp是一个用于评估网络浏览能力的基准测试，包含1,266个需要持续搜索网络的问题。

Details

Motivation: 提供一个简单但具有挑战性的基准，以衡量代理在网络浏览中的持久性和创造力。 Method: 设计1,266个需要搜索复杂信息的问题，预测答案简短且易于验证。 Result: BrowseComp避开了真实用户查询的复杂性，专注于核心能力的测量。 Conclusion: BrowseComp是一个有用的基准，类似于编程竞赛对编码代理的作用。 Abstract: We present BrowseComp, a simple yet challenging benchmark for measuring the ability for agents to browse the web. BrowseComp comprises 1,266 questions that require persistently navigating the internet in search of hard-to-find, entangled information. Despite the difficulty of the questions, BrowseComp is simple and easy-to-use, as predicted answers are short and easily verifiable against reference answers. BrowseComp for browsing agents can be seen as analogous to how programming competitions are an incomplete but useful benchmark for coding agents. While BrowseComp sidesteps challenges of a true user query distribution, like generating long answers or resolving ambiguity, it measures the important core capability of exercising persistence and creativity in finding information. BrowseComp can be found at https://github.com/openai/simple-evals.

[149] Evaluating the Diversity and Quality of LLM Generated Content

Alexander Shypula,Shuo Li,Botong Zhang,Vishakh Padmakumar,Kayo Yin,Osbert Bastani

Main category: cs.CL

TL;DR: 研究发现，偏好调优技术（如RLHF、PPO、GRPO和DPO）虽然减少了词汇和句法多样性，但提高了有效的语义多样性，因为其生成更多高质量输出。小模型在生成独特内容时参数效率更高。

Details

Motivation: 解决偏好调优技术减少多样性但广泛部署于需要多样输出的应用之间的矛盾。 Method: 引入测量有效语义多样性的框架，使用无需人工干预的开放式任务评估模型。 Result: 偏好调优模型（尤其是RL训练的）在语义多样性上优于SFT或基础模型，小模型参数效率更高。 Conclusion: 偏好调优在保持语义多样性的同时减少句法多样性，小模型在生成独特内容时更高效。 Abstract: Recent work suggests that preference-tuning techniques--including Reinforcement Learning from Human Preferences (RLHF) methods like PPO and GRPO, as well as alternatives like DPO--reduce diversity, creating a dilemma given that such models are widely deployed in applications requiring diverse outputs. To address this, we introduce a framework for measuring effective semantic diversity--diversity among outputs that meet quality thresholds--which better reflects the practical utility of large language models (LLMs). Using open-ended tasks that require no human intervention, we find counterintuitive results: although preference-tuned models--especially those trained via RL--exhibit reduced lexical and syntactic diversity, they produce greater effective semantic diversity than SFT or base models, not from increasing diversity among high-quality outputs, but from generating more high-quality outputs overall. We discover that preference tuning reduces syntactic diversity while preserving semantic diversity--revealing a distinction between diversity in form and diversity in content that traditional metrics often overlook. Our analysis further shows that smaller models are consistently more parameter-efficient at generating unique content within a fixed sampling budget, offering insights into the relationship between model scaling and diversity. These findings have important implications for applications that require diverse yet high-quality outputs, from creative assistance to synthetic data generation.

[150] Memorization vs. Reasoning: Updating LLMs with New Knowledge

Aochong Oliver Li,Tanya Goyal

Main category: cs.CL

TL;DR: 论文提出了Knowledge Update Playground (KUP)框架和memory conditioned training (MCT)方法，用于解决大语言模型知识更新的挑战。

Details

Motivation: 现有方法主要针对实体替换，无法全面反映复杂现实动态，因此需要更全面的知识更新评估框架和方法。 Method: 引入KUP自动管道模拟知识更新，并提出MCT方法，通过自生成“记忆”令牌在训练中条件化更新语料库的令牌。 Result: KUP基准极具挑战性，最佳CPT模型在间接探测（推理）中表现<2%；MCT显著优于CPT基线，直接探测（记忆）提升25.4%。 Conclusion: KUP和MCT为LLM知识更新提供了有效工具，显著提升了模型在知识记忆和推理方面的表现。 Abstract: Large language models (LLMs) encode vast amounts of pre-trained knowledge in their parameters, but updating them as real-world information evolves remains a challenge. Existing methodologies and benchmarks primarily target entity substitutions, failing to capture the full breadth of complex real-world dynamics. In this paper, we introduce Knowledge Update Playground (KUP), an automatic pipeline for simulating realistic knowledge updates reflected in an evidence corpora. KUP's evaluation framework includes direct and indirect probes to both test memorization of updated facts and reasoning over them, for any update learning methods. Next, we present a lightweight method called memory conditioned training (MCT), which conditions tokens in the update corpus on self-generated "memory" tokens during training. Our strategy encourages LLMs to surface and reason over newly memorized knowledge at inference. Our results on two strong LLMs show that (1) KUP benchmark is highly challenging, with the best CPT models achieving $<2\%$ in indirect probing setting (reasoning) and (2) MCT training significantly outperforms prior continued pre-training (CPT) baselines, improving direct probing (memorization) results by up to $25.4\%$.

[151] Memorization: A Close Look at Books

Iris Ma,Ian Domingo,Alberto Krone-Martins,Pierre Baldi,Cristina V. Lopes

Main category: cs.CL

TL;DR: 研究探讨了从LLMs（如Llama 3 70B）中提取整本书的可能性，发现通过“前缀提示”技术可以高相似度地重建《爱丽丝梦游仙境》，但成功率因书籍流行度而异。同时揭示了指令调优模型Llama 3.1中缓解措施的失效及其原因。

Details

Motivation: 探索LLMs中书籍内容的提取能力，评估现有缓解策略的有效性，并研究微调对模型记忆能力的影响。 Method: 使用Llama 3 70B模型和“前缀提示”技术，从500个初始标记中自回归重建书籍内容，分析提取成功率和相关性。 Result: 成功高相似度提取《爱丽丝梦游仙境》，其他书籍提取率与流行度相关；发现Llama 3.1中缓解措施失效，且仅涉及少量权重变化。 Conclusion: 当前缓解策略存在局限性，微调对模型记忆能力有显著影响，需进一步研究改进方法。 Abstract: To what extent can entire books be extracted from LLMs? Using the Llama 3 70B family of models, and the "prefix-prompting" extraction technique, we were able to auto-regressively reconstruct, with a very high level of similarity, one entire book (Alice's Adventures in Wonderland) from just the first 500 tokens. We were also able to obtain high extraction rates on several other books, piece-wise. However, these successes do not extend uniformly to all books. We show that extraction rates of books correlate with book popularity and thus, likely duplication in the training data. We also confirm the undoing of mitigations in the instruction-tuned Llama 3.1, following recent work (Nasr et al., 2025). We further find that this undoing comes from changes to only a tiny fraction of weights concentrated primarily in the lower transformer blocks. Our results provide evidence of the limits of current regurgitation mitigation strategies and introduce a framework for studying how fine-tuning affects the retrieval of verbatim memorization in aligned LLMs.

[152] ELAB: Extensive LLM Alignment Benchmark in Persian Language

Zahra Pourbahman,Fatemeh Rajabi,Mohammadhossein Sadeghi,Omid Ghahroodi,Somaye Bakhshaei,Arash Amini,Reza Kazemi,Mahdieh Soleymani Baghshah

Main category: cs.CL

TL;DR: 本文提出了一个针对波斯大型语言模型（LLMs）的伦理对齐评估框架，涵盖安全性、公平性和社会规范，填补了现有评估框架在波斯语言和文化背景下的空白。

Details

Motivation: 现有的LLM评估框架未充分考虑波斯语言和文化的独特性，因此需要一种适应本土文化的评估方法。 Method: 通过翻译现有数据集（如Anthropic Red Teaming数据）和创建新数据集（如ProhibiBench-fa、SafeBench-fa等），构建了一个统一的波斯LLM评估框架。 Result: 提出了一个公开的排行榜，用于评估波斯LLM在安全性、公平性和社会规范方面的表现。 Conclusion: 该框架为波斯LLM的伦理对齐提供了一种文化适应性的评估方法，填补了现有研究的空白。 Abstract: This paper presents a comprehensive evaluation framework for aligning Persian Large Language Models (LLMs) with critical ethical dimensions, including safety, fairness, and social norms. It addresses the gaps in existing LLM evaluation frameworks by adapting them to Persian linguistic and cultural contexts. This benchmark creates three types of Persian-language benchmarks: (i) translated data, (ii) new data generated synthetically, and (iii) new naturally collected data. We translate Anthropic Red Teaming data, AdvBench, HarmBench, and DecodingTrust into Persian. Furthermore, we create ProhibiBench-fa, SafeBench-fa, FairBench-fa, and SocialBench-fa as new datasets to address harmful and prohibited content in indigenous culture. Moreover, we collect extensive dataset as GuardBench-fa to consider Persian cultural norms. By combining these datasets, our work establishes a unified framework for evaluating Persian LLMs, offering a new approach to culturally grounded alignment evaluation. A systematic evaluation of Persian LLMs is performed across the three alignment aspects: safety (avoiding harmful content), fairness (mitigating biases), and social norms (adhering to culturally accepted behaviors). We present a publicly available leaderboard that benchmarks Persian LLMs with respect to safety, fairness, and social norms at: https://huggingface.co/spaces/MCILAB/LLM_Alignment_Evaluation.

[153] CDF-RAG: Causal Dynamic Feedback for Adaptive Retrieval-Augmented Generation

Elahe Khatibi,Ziyu Wang,Amir M. Rahmani

Main category: cs.CL

TL;DR: CDF-RAG框架通过动态反馈和因果推理改进RAG，提升生成内容的因果一致性和准确性。

Details

Motivation: 现有RAG框架依赖语义相似性检索，难以区分因果关系与虚假关联，导致生成内容可能误导。 Method: CDF-RAG通过迭代查询优化、结构化因果图检索和多跳因果推理，验证因果路径以确保逻辑一致。 Result: 在四个数据集上验证，CDF-RAG在响应准确性和因果正确性上优于现有RAG方法。 Conclusion: CDF-RAG显著提升了生成内容的因果一致性和事实准确性，代码已开源。 Abstract: Retrieval-Augmented Generation (RAG) has significantly enhanced large language models (LLMs) in knowledge-intensive tasks by incorporating external knowledge retrieval. However, existing RAG frameworks primarily rely on semantic similarity and correlation-driven retrieval, limiting their ability to distinguish true causal relationships from spurious associations. This results in responses that may be factually grounded but fail to establish cause-and-effect mechanisms, leading to incomplete or misleading insights. To address this issue, we introduce Causal Dynamic Feedback for Adaptive Retrieval-Augmented Generation (CDF-RAG), a framework designed to improve causal consistency, factual accuracy, and explainability in generative reasoning. CDF-RAG iteratively refines queries, retrieves structured causal graphs, and enables multi-hop causal reasoning across interconnected knowledge sources. Additionally, it validates responses against causal pathways, ensuring logically coherent and factually grounded outputs. We evaluate CDF-RAG on four diverse datasets, demonstrating its ability to improve response accuracy and causal correctness over existing RAG-based methods. Our code is publicly available at https://github.com/ elakhatibi/CDF-RAG.

[154] MetaSynth: Meta-Prompting-Driven Agentic Scaffolds for Diverse Synthetic Data Generation

Haris Riaz,Sourav Bhabesh,Vinayak Arannil,Miguel Ballesteros,Graham Horwood

Main category: cs.CL

TL;DR: MetaSynth通过元提示生成多样性合成数据，成功将Mistral-7B-v0.3适配到金融和生物医学领域，性能显著提升。

Details

Motivation: 探讨如何利用合成数据适配LLMs到特定领域，解决合成数据多样性不足的问题。 Method: 提出MetaSynth方法，通过元提示协调多个专家LLM代理生成多样性合成数据。 Result: 仅用2500万token的MetaSynth数据，Mistral-7B在金融和生物医学领域性能分别提升4.08%和13.75%。 Conclusion: MetaSynth生成的多样性合成数据能有效适配LLMs到特定领域，无需混合真实数据。 Abstract: Recent smaller language models such Phi-3.5 and Phi-4 rely on synthetic data generated using larger Language models. Questions remain about leveraging synthetic data for other use cases, such as adapting LLMs to specific domains. A key limitation of synthetic data is low diversity, which negatively impacts its downstream applicability for improving other models. To address this, we propose MetaSynth, a method for generating synthetic data that enhances diversity through meta-prompting, where a language model orchestrates multiple "expert" LLM agents to collaboratively generate data. Using only 25 million tokens of synthetic data generated with MetaSynth, we successfully adapt a well-trained LLM (Mistral-7B-v0.3) to two specialized domains-Finance and Biomedicine-without compromising the capabilities of the resulting model in general tasks. In addition, we evaluate the diversity of our synthetic data using seven automated metrics, and find that it approaches the diversity of LLM pre-training corpora. Continually pre-training Mistral-7B-v0.3 with MetaSynth notably outperforms the base LLM, showing improvements of up to 4.08% in Finance and 13.75% in Biomedicine. The same model shows degraded performance when trained on data generated using a template prompt, even when the template includes prior generations and varying In-Context exemplars of real data. Our findings suggest that a few million tokens of diverse synthetic data without mixing any real data, is sufficient for effective domain adaptation when using MetaSynth.

[155] Identifying and Mitigating the Influence of the Prior Distribution in Large Language Models

Liyi Zhang,Veniamin Veselovsky,R. Thomas McCoy,Thomas L. Griffiths

Main category: cs.CL

TL;DR: 研究发现大型语言模型（LLMs）在某些确定性任务中表现不佳，但通过干预可以提升其性能。

Details

Motivation: LLMs在处理确定性任务时因隐式先验分布而表现不佳，研究旨在探索如何通过干预改善其性能。 Method: 通过提示模型不依赖先验知识，并使用机制解释技术定位和调整先验影响。 Result: 干预显著提升了模型在任务中的表现，且微调后错误不再与先验相关。 Conclusion: 研究为操纵LLMs依赖先验的程度提供了有效方法，可能减少幻觉问题。 Abstract: Large language models (LLMs) sometimes fail to respond appropriately to deterministic tasks -- such as counting or forming acronyms -- because the implicit prior distribution they have learned over sequences of tokens influences their responses. In this work, we show that, in at least some cases, LLMs actually compute the information needed to perform these tasks correctly, and we identify some interventions that can allow them to access this information to improve their performance. First, we show that simply prompting the language model to not rely on its prior knowledge leads to dramatic improvements in prior-dominated tasks. We then use mechanistic interpretability techniques to localize the prior within the LLM and manipulate the extent to which that prior influences its responses. Specifically, we show that it is possible to identify layers of the underlying neural network that correlate with the prior probability of a response and that lightweight finetuning of these layers with basic prompts on prior-dominated tasks achieves high performance on held-out answers. These results suggest that the information required to produce a correct response is contained within the representations of the problems formed by the models. Furthermore, we show that this finetuning is significantly more effective for prior-dominated tasks, and that the error after finetuning is no longer correlated with the prior. Our results suggest that it may be possible to define effective methods for manipulating the extent to which LLMs rely upon their priors in solving problems, potentially increasing their performance in settings where LLMs hallucinate for reasons related to the prior probability of token sequences.

[156] GeoSense: Evaluating Identification and Application of Geometric Principles in Multimodal Reasoning

Liangyu Xu,Yingxiu Zhao,Jingyun Wang,Yingyao Wang,Bu Pi,Chen Wang,Mingliang Zhang,Jihao Gu,Xiang Li,Xiaoyong Zhu,Jun Song,Bo Zheng

Main category: cs.CL

TL;DR: GeoSense是一个双语基准测试，用于评估多模态大语言模型在几何问题解决中的推理能力，发现现有模型在几何原理识别和应用上存在瓶颈。

Details

Motivation: 现有基准测试未能全面评估多模态大语言模型在几何推理中的能力，尤其是几何原理的识别和应用。 Method: 提出GeoSense，包含五级几何原理框架、1789个标注问题和创新评估策略。 Result: Gemini-2.0-pro-flash表现最佳，总分65.3，但几何原理识别和应用仍是瓶颈。 Conclusion: GeoSense有助于指导未来多模态大语言模型在几何推理能力上的改进。 Abstract: Geometry problem-solving (GPS), a challenging task requiring both visual comprehension and symbolic reasoning, effectively measures the reasoning capabilities of multimodal large language models (MLLMs). Humans exhibit strong reasoning ability in this task through accurate identification and adaptive application of geometric principles within visual contexts. However, existing benchmarks fail to jointly assess both dimensions of the human-like geometric reasoning mechanism in MLLMs, remaining a critical gap in assessing their ability to tackle GPS. To this end, we introduce GeoSense, the first comprehensive bilingual benchmark designed to systematically evaluate the geometric reasoning abilities of MLLMs through the lens of geometric principles. GeoSense features a five-level hierarchical framework of geometric principles spanning plane and solid geometry, an intricately annotated dataset of 1,789 problems, and an innovative evaluation strategy. Through extensive experiments on GeoSense with various open-source and closed-source MLLMs, we observe that Gemini-2.0-pro-flash performs best, achieving an overall score of $65.3$. Our in-depth analysis reveals that the identification and application of geometric principles remain a bottleneck for leading MLLMs, jointly hindering their reasoning abilities. These findings underscore GeoSense's potential to guide future advancements in MLLMs' geometric reasoning capabilities, paving the way for more robust and human-like reasoning in artificial intelligence.

[157] Towards Characterizing Subjectivity of Individuals through Modeling Value Conflicts and Trade-offs

Younghun Lee,Dan Goldwasser

Main category: cs.CL

TL;DR: 论文提出SOLAR框架，利用LLMs分析社交媒体用户的个体主观性及道德判断，通过观察价值冲突和权衡提升推断效果。

Details

Motivation: 探索LLMs是否能捕捉个体层面的主观性，尤其是在社交媒体用户的道德判断中。 Method: 提出SOLAR框架，通过分析用户生成文本中的价值冲突和权衡，推断个体主观性。 Result: 实验表明SOLAR提升了推断结果的准确性，尤其在争议情境中表现更优，并能解释个体的价值偏好。 Conclusion: SOLAR框架能有效捕捉个体主观性，为理解用户道德判断提供新视角。 Abstract: Large Language Models (LLMs) not only have solved complex reasoning problems but also exhibit remarkable performance in tasks that require subjective decision making. Existing studies suggest that LLM generations can be subjectively grounded to some extent, yet exploring whether LLMs can account for individual-level subjectivity has not been sufficiently studied. In this paper, we characterize subjectivity of individuals on social media and infer their moral judgments using LLMs. We propose a framework, SOLAR (Subjective Ground with Value Abstraction), that observes value conflicts and trade-offs in the user-generated texts to better represent subjective ground of individuals. Empirical results show that our framework improves overall inference results as well as performance on controversial situations. Additionally, we qualitatively show that SOLAR provides explanations about individuals' value preferences, which can further account for their judgments.

[158] Scaling Instruction-Tuned LLMs to Million-Token Contexts via Hierarchical Synthetic Data Generation

Linda He,Jue Wang,Maurice Weber,Shang Zhu,Ben Athiwaratkun,Ce Zhang

Main category: cs.CL

TL;DR: 本文提出了一种新的后训练合成数据生成策略，用于高效扩展大语言模型（LLMs）的上下文窗口，同时保持其通用任务性能。

Details

Motivation: LLMs在长上下文推理中存在困难，主要由于计算复杂度随序列长度呈二次方增长，且长上下文数据的标注稀缺且昂贵。目前缺乏开源的长上下文数据系统性研究，也没有公开的超过100K tokens的指令调优数据集。 Method: 采用后训练合成数据生成策略，通过逐步旋转位置嵌入（RoPE）缩放训练策略，可扩展至任意长上下文长度。 Result: 模型在1M tokens的上下文长度下，在RULER基准和InfiniteBench上表现良好，同时在通用语言任务中保持稳健性能。 Conclusion: 该方法有效解决了长上下文数据稀缺问题，为LLMs的长上下文推理提供了可行的解决方案。 Abstract: Large Language Models (LLMs) struggle with long-context reasoning, not only due to the quadratic scaling of computational complexity with sequence length but also because of the scarcity and expense of annotating long-context data. There has been barely any open-source work that systematically ablates long-context data, nor is there any openly available instruction tuning dataset with contexts surpassing 100K tokens. To bridge this gap, we introduce a novel post-training synthetic data generation strategy designed to efficiently extend the context window of LLMs while preserving their general task performance. Our approach scalably extends to arbitrarily long context lengths, unconstrained by the length of available real-world data, which effectively addresses the scarcity of raw long-context data. Through a step-by-step rotary position embedding (RoPE) scaling training strategy, we demonstrate that our model, with a context length of up to 1M tokens, performs well on the RULER benchmark and InfiniteBench and maintains robust performance on general language tasks.

[159] Persona-judge: Personalized Alignment of Large Language Models via Token-level Self-judgment

Xiaotian Zhang,Ruizhe Chen,Yang Feng,Zuozhu Liu

Main category: cs.CL

TL;DR: 提出Persona-judge方法，通过模型内在偏好判断能力实现无需训练的个性化对齐。

Details

Motivation: 现有方法依赖奖励信号和额外标注数据，难以适应多样化人类价值观且计算成本高。 Method: Persona-judge利用模型内在偏好判断能力，通过生成候选标记并由另一偏好模型交叉验证。 Result: 实验证明Persona-judge在个性化对齐方面具有可扩展性和计算高效性。 Conclusion: Persona-judge为自适应个性化对齐提供了新途径。 Abstract: Aligning language models with human preferences presents significant challenges, particularly in achieving personalization without incurring excessive computational costs. Existing methods rely on reward signals and additional annotated data, limiting their scalability and adaptability to diverse human values. To address these challenges, we introduce Persona-judge, a novel discriminative paradigm that enables training-free personalized alignment with unseen preferences. Instead of optimizing policy parameters through external reward feedback, Persona-judge leverages the intrinsic preference judgment capabilities of the model. Specifically, a draft model generates candidate tokens conditioned on a given preference, while a judge model, embodying another preference, cross-validates the predicted tokens whether to be accepted. Experimental results demonstrate that Persona-judge, using the inherent preference evaluation mechanisms of the model, offers a scalable and computationally efficient solution to personalized alignment, paving the way for more adaptive customized alignment.

[160] ACoRN: Noise-Robust Abstractive Compression in Retrieval-Augmented Language Models

Singon Kim,Gunho Jung,Seong-Whan Lee

Main category: cs.CL

TL;DR: 论文提出ACoRN方法，通过细粒度分类和增强训练步骤，提升抽象压缩模型在噪声环境下的鲁棒性。

Details

Motivation: 检索到的文档常包含无关或误导信息，抽象压缩模型易忽略关键内容，尤其在长上下文中。 Method: 提出ACoRN，包括离线数据增强和微调步骤，以生成围绕关键信息的摘要。 Result: 实验显示，ACoRN训练的T5-large模型提升了EM和F1分数，并保留答案字符串。 Conclusion: ACoRN在噪声文档多的场景中表现优异，适用于实际应用。 Abstract: Abstractive compression utilizes smaller langauge models to condense query-relevant context, reducing computational costs in retrieval-augmented generation (RAG). However,retrieved documents often include information that is either irrelevant to answering the query or misleading due to factual incorrect content, despite having high relevance scores. This behavior indicates that abstractive compressors are more likely to omit important information essential for the correct answer, especially in long contexts where attention dispersion occurs. To address this issue, we categorize retrieved documents in a more fine-grained manner and propose Abstractive Compression Robust against Noise (ACoRN), which introduces two novel training steps. First, we use offline data augmentation on the training dataset to enhance compressor robustness against two distinct types of retrieval noise. Second, since the language modelbased compressor cannot fully utilize information from multiple retrieved documents and exhibits positional bias, we perform finetuning to generate summaries centered around key information that directly supports the correct answer. Our experiments demonstrate that T5-large, trained with ACoRN as a compressor, improves EM and F1 scores while preserving the answer string, which could serve as direct evidence. ACoRN excels on datasets with many accuracy-reducing documents, making it highly useful in real-world scenarios.

[161] GRAIL: Gradient-Based Adaptive Unlearning for Privacy and Copyright in LLMs

Kun-Woo Kim,Ji-Hoon Park,Ju-Min Han,Seong-Whan Lee

Main category: cs.CL

TL;DR: GRAIL是一种基于梯度的多领域遗忘框架，用于从大型语言模型中精确移除敏感信息，同时保留关键知识。

Details

Motivation: 大型语言模型可能学习敏感信息，引发社会和法律问题，而现有方法无法有效处理多领域交织的知识。 Method: GRAIL利用多领域梯度信息区分遗忘和保留范围，并采用自适应参数定位策略选择性移除目标知识。 Result: GRAIL在遗忘效果上与现有方法相当，但知识保留能力提升17%。 Conclusion: GRAIL为大规模预训练语言模型中的敏感信息管理提供了新范式。 Abstract: Large Language Models (LLMs) trained on extensive datasets often learn sensitive information, which raises significant social and legal concerns under principles such as the "Right to be forgotten." Retraining entire models from scratch to remove undesired information is both costly and impractical. Furthermore, existing single-domain unlearning methods fail to address multi-domain scenarios, where knowledge is interwoven across domains such as privacy and copyright, creating overlapping representations that lead to excessive knowledge removal or degraded performance. To tackle these issues, we propose GRAIL (GRadient-based AdaptIve unLearning), a novel multi-domain unlearning framework. GRAIL leverages gradient information from multiple domains to precisely distinguish the unlearning scope from the retention scope, and applies an adaptive parameter-wise localization strategy to selectively remove targeted knowledge while preserving critical parameters for each domain. Experimental results on unlearning benchmarks show that GRAIL achieves unlearning success on par with the existing approaches, while also demonstrating up to 17% stronger knowledge retention success compared to the previous state-of-art method. Our findings establish a new paradigm for effectively managing and regulating sensitive information in large-scale pre-trained language models.

[162] Data-efficient LLM Fine-tuning for Code Generation

Weijie Lv,Xuan Xia,Sheng-Jun Huang

Main category: cs.CL

TL;DR: 提出了一种数据选择策略和动态打包技术，显著提升了代码生成LLM的训练效率和性能。

Details

Motivation: 开源与闭源代码生成模型存在性能差距，现有方法通过大量合成数据微调效率低下。 Method: 采用数据选择策略优先复杂数据并保持分布一致性，结合动态打包技术优化分词过程。 Result: 在40%数据上训练时，性能提升且训练时间、GPU内存消耗显著减少。 Conclusion: 优化数据选择和分词过程可同时提升模型性能和训练效率。 Abstract: Large language models (LLMs) have demonstrated significant potential in code generation tasks. However, there remains a performance gap between open-source and closed-source models. To address this gap, existing approaches typically generate large amounts of synthetic data for fine-tuning, which often leads to inefficient training. In this work, we propose a data selection strategy in order to improve the effectiveness and efficiency of training for code-based LLMs. By prioritizing data complexity and ensuring that the sampled subset aligns with the distribution of the original dataset, our sampling strategy effectively selects high-quality data. Additionally, we optimize the tokenization process through a "dynamic pack" technique, which minimizes padding tokens and reduces computational resource consumption. Experimental results show that when training on 40% of the OSS-Instruct dataset, the DeepSeek-Coder-Base-6.7B model achieves an average performance of 66.9%, surpassing the 66.1% performance with the full dataset. Moreover, training time is reduced from 47 minutes to 34 minutes, and the peak GPU memory decreases from 61.47 GB to 42.72 GB during a single epoch. Similar improvements are observed with the CodeLlama-Python-7B model on the Evol-Instruct dataset. By optimizing both data selection and tokenization, our approach not only improves model performance but also improves training efficiency.

[163] Why and How LLMs Hallucinate: Connecting the Dots with Subsequence Associations

Yiyou Sun,Yu Gai,Lijie Chen,Abhilasha Ravichander,Yejin Choi,Dawn Song

Main category: cs.CL

TL;DR: 论文提出了一种子序列关联框架，用于系统追踪和理解大语言模型（LLMs）中的幻觉现象，通过理论和实证分析揭示了幻觉的成因，并提出了一种优于标准归因技术的追踪算法。

Details

Motivation: 大语言模型（LLMs）经常产生偏离事实或上下文的幻觉内容，其成因复杂，诊断困难。本文旨在系统理解并追踪这些幻觉现象。 Method: 提出子序列关联框架，通过分析解码器-仅变换器（decoder-only transformers）作为子序列嵌入模型的行为，设计了一种追踪算法，通过随机化输入上下文分析幻觉概率来识别因果子序列。 Result: 实验表明，该方法在识别幻觉成因方面优于标准归因技术，并与模型训练语料库的证据一致。 Conclusion: 本文为幻觉现象提供了统一视角和稳健的追踪分析框架。 Abstract: Large language models (LLMs) frequently generate hallucinations-content that deviates from factual accuracy or provided context-posing challenges for diagnosis due to the complex interplay of underlying causes. This paper introduces a subsequence association framework to systematically trace and understand hallucinations. Our key insight is that hallucinations arise when dominant hallucinatory associations outweigh faithful ones. Through theoretical and empirical analyses, we demonstrate that decoder-only transformers effectively function as subsequence embedding models, with linear layers encoding input-output associations. We propose a tracing algorithm that identifies causal subsequences by analyzing hallucination probabilities across randomized input contexts. Experiments show our method outperforms standard attribution techniques in identifying hallucination causes and aligns with evidence from the model's training corpus. This work provides a unified perspective on hallucinations and a robust framework for their tracing and analysis.

[164] KODIS: A Multicultural Dispute Resolution Dialogue Corpus

James Hale,Sushrita Rakshit,Kushal Chawla,Jeanne M. Brett,Jonathan Gratch

Main category: cs.CL

TL;DR: KODIS是一个包含来自75个国家数千个对话的双向争议解决语料库，旨在研究文化和冲突理论。

Details

Motivation: 基于文化和冲突的理论模型，研究情绪表达对争议升级的影响及文化差异。 Method: 参与者参与由专家设计的典型客户服务争议对话，收集丰富的性格、过程和结果数据。 Result: 初步分析支持愤怒表达导致争议升级的理论，并凸显情绪表达的文化差异。 Conclusion: KODIS语料库及其数据收集框架为研究社区提供了重要资源。 Abstract: We present KODIS, a dyadic dispute resolution corpus containing thousands of dialogues from over 75 countries. Motivated by a theoretical model of culture and conflict, participants engage in a typical customer service dispute designed by experts to evoke strong emotions and conflict. The corpus contains a rich set of dispositional, process, and outcome measures. The initial analysis supports theories of how anger expressions lead to escalatory spirals and highlights cultural differences in emotional expression. We make this corpus and data collection framework available to the community.

[165] Pandora: A Code-Driven Large Language Model Agent for Unified Reasoning Across Diverse Structured Knowledge

Yongrui Chen,Junhao He,Linbo Fu,Shenyu Zhang,Rihui Jin,Xinbang Dai,Jiaqi Li,Dehai Min,Nan Hu,Yuxin Zhang,Guilin Qi,Yi Huang,Tongtong Wu

Main category: cs.CL

TL;DR: 论文提出了一种名为Pandora的统一结构化知识推理框架，利用Python的Pandas API构建统一知识表示，结合LLM生成推理步骤和可执行代码，显著提升了性能。

Details

Motivation: 现有方法难以在不同结构化知识推理任务间实现知识迁移或与LLM预训练对齐，限制了性能。 Method: 使用Python的Pandas API构建统一知识表示，结合LLM生成推理步骤和可执行代码，并通过训练示例库促进知识迁移。 Result: 在四个基准测试中，Pandora优于现有统一框架，并与任务专用方法竞争有效。 Conclusion: Pandora通过统一表示和LLM结合，显著提升了结构化知识推理的性能和通用性。 Abstract: Unified Structured Knowledge Reasoning (USKR) aims to answer natural language questions (NLQs) by using structured sources such as tables, databases, and knowledge graphs in a unified way. Existing USKR methods either rely on employing task-specific strategies or custom-defined representations, which struggle to leverage the knowledge transfer between different SKR tasks or align with the prior of LLMs, thereby limiting their performance. This paper proposes a novel USKR framework named \textsc{Pandora}, which takes advantage of \textsc{Python}'s \textsc{Pandas} API to construct a unified knowledge representation for alignment with LLM pre-training. It employs an LLM to generate textual reasoning steps and executable Python code for each question. Demonstrations are drawn from a memory of training examples that cover various SKR tasks, facilitating knowledge transfer. Extensive experiments on four benchmarks involving three SKR tasks demonstrate that \textsc{Pandora} outperforms existing unified frameworks and competes effectively with task-specific methods.

[166] Chinese-Vicuna: A Chinese Instruction-following Llama-based Model

Chenghao Fan,Zhenyi Lu,Jie Tian

Main category: cs.CL

TL;DR: Chinese-Vicuna是一个开源、资源高效的中文语言模型，通过LoRA微调LLaMA架构，填补中文指令跟随能力的空白。

Details

Motivation: 解决中文指令跟随能力不足的问题，并在低资源环境中实现高效部署。 Method: 使用LoRA微调LLaMA架构，结合混合数据集（BELLE和Guanaco）和4位量化（QLoRA）。 Result: 在翻译、代码生成和领域特定问答等任务中表现优异，支持医疗和法律等领域的应用。 Conclusion: Chinese-Vicuna通过模块化设计和开源生态，为中文LLM应用提供了多功能基础。 Abstract: Chinese-Vicuna is an open-source, resource-efficient language model designed to bridge the gap in Chinese instruction-following capabilities by fine-tuning Meta's LLaMA architecture using Low-Rank Adaptation (LoRA). Targeting low-resource environments, it enables cost-effective deployment on consumer GPUs (e.g., RTX-2080Ti for 7B models) and supports domain-specific adaptation in fields like healthcare and law. By integrating hybrid datasets (BELLE and Guanaco) and 4-bit quantization (QLoRA), the model achieves competitive performance in tasks such as translation, code generation, and domain-specific Q\&A. The project provides a comprehensive toolkit for model conversion, CPU inference, and multi-turn dialogue interfaces, emphasizing accessibility for researchers and developers. Evaluations indicate competitive performance across medical tasks, multi-turn dialogue coherence, and real-time legal updates. Chinese-Vicuna's modular design, open-source ecosystem, and community-driven enhancements position it as a versatile foundation for Chinese LLM applications.

[167] Out of Sight Out of Mind, Out of Sight Out of Mind: Measuring Bias in Language Models Against Overlooked Marginalized Groups in Regional Contexts

Fatma Elsafoury,David Hartmann

Main category: cs.CL

TL;DR: 本文研究了语言模型对全球范围内被忽视的边缘群体和低资源语言的偏见问题，首次对23个语言模型在270个边缘群体中的偏见进行了分析。

Details

Motivation: 现有研究主要关注英语世界中的偏见，而忽视了全球范围内的边缘群体和低资源语言，导致语言模型的不公平性未被充分解决。 Method: 研究了23个语言模型在270个边缘群体中的偏见表现，并比较了低资源语言（如埃及阿拉伯方言）与现代标准阿拉伯语对偏见测量的影响。 Result: 发现语言模型对许多边缘群体的偏见高于主流群体，但阿拉伯语模型对宗教和种族的偏见普遍较高。此外，非二元性别、LGBTQIA+和黑人女性面临更高的交叉偏见。 Conclusion: 为了开发更具包容性的语言模型，需要扩大研究范围，涵盖更多被忽视的边缘群体和低资源语言。 Abstract: We know that language models (LMs) form biases and stereotypes of minorities, leading to unfair treatments of members of these groups, thanks to research mainly in the US and the broader English-speaking world. As the negative behavior of these models has severe consequences for society and individuals, industry and academia are actively developing methods to reduce the bias in LMs. However, there are many under-represented groups and languages that have been overlooked so far. This includes marginalized groups that are specific to individual countries and regions in the English speaking and Western world, but crucially also almost all marginalized groups in the rest of the world. The UN estimates, that between 600 million to 1.2 billion people worldwide are members of marginalized groups and in need for special protection. If we want to develop inclusive LMs that work for everyone, we have to broaden our understanding to include overlooked marginalized groups and low-resource languages and dialects. In this work, we contribute to this effort with the first study investigating offensive stereotyping bias in 23 LMs for 270 marginalized groups from Egypt, the remaining 21 Arab countries, Germany, the UK, and the US. Additionally, we investigate the impact of low-resource languages and dialects on the study of bias in LMs, demonstrating the limitations of current bias metrics, as we measure significantly higher bias when using the Egyptian Arabic dialect versus Modern Standard Arabic. Our results show, LMs indeed show higher bias against many marginalized groups in comparison to dominant groups. However, this is not the case for Arabic LMs, where the bias is high against both marginalized and dominant groups in relation to religion and ethnicity. Our results also show higher intersectional bias against Non-binary, LGBTQIA+ and Black women.

[168] Enhancing the Geometric Problem-Solving Ability of Multimodal LLMs via Symbolic-Neural Integration

Yicheng Pan,Zhenrong Zhang,Pengfei Hu,Jiefeng Ma,Jun Du,Jianshu Zhang,Quan Liu,Jianqing Gao,Feng Ma

Main category: cs.CL

TL;DR: GeoGen 是一个自动生成几何问题逐步解答路径的流程，结合符号推理生成高质量数据，并训练 GeoLogic 模型以提升 MLLMs 的逻辑推理能力，减少幻觉问题。

Details

Motivation: 解决 MLLMs 在几何问题求解中缺乏逐步解答数据和推理幻觉的问题。 Method: 提出 GeoGen 流程生成高质量数据，并训练 GeoLogic 模型结合符号系统验证推理。 Result: 实验表明，该方法显著提升了 MLLMs 在几何推理任务中的性能。 Conclusion: 结合 LLMs 和符号系统的优势，提供了一种更可靠且可解释的几何问题求解方法。 Abstract: Recent advances in Multimodal Large Language Models (MLLMs) have achieved remarkable progress in general domains and demonstrated promise in multimodal mathematical reasoning. However, applying MLLMs to geometry problem solving (GPS) remains challenging due to lack of accurate step-by-step solution data and severe hallucinations during reasoning. In this paper, we propose GeoGen, a pipeline that can automatically generates step-wise reasoning paths for geometry diagrams. By leveraging the precise symbolic reasoning, \textbf{GeoGen} produces large-scale, high-quality question-answer pairs. To further enhance the logical reasoning ability of MLLMs, we train \textbf{GeoLogic}, a Large Language Model (LLM) using synthetic data generated by GeoGen. Serving as a bridge between natural language and symbolic systems, GeoLogic enables symbolic tools to help verifying MLLM outputs, making the reasoning process more rigorous and alleviating hallucinations. Experimental results show that our approach consistently improves the performance of MLLMs, achieving remarkable results on benchmarks for geometric reasoning tasks. This improvement stems from our integration of the strengths of LLMs and symbolic systems, which enables a more reliable and interpretable approach for the GPS task. Codes are available at https://github.com/ycpNotFound/GeoGen.

[169] Assesing LLMs in Art Contexts: Critique Generation and Theory of Mind Evaluation

Takaya Arita,Wenxian Zheng,Reiji Suzuki,Fuminori Akiba

Main category: cs.CL

TL;DR: 研究探讨了大型语言模型（LLMs）在艺术评论和心智理论（ToM）任务中的表现，发现通过精心设计的提示，LLMs能生成接近人类水平的评论，并在复杂情境中展现一定推理能力。

Details

Motivation: 探索LLMs在艺术相关任务中的表现，尤其是评论生成和心智推理，以评估其认知潜力与局限性。 Method: 结合艺术批评理论构建系统，生成评论并进行图灵测试；设计新的ToM任务测试41种LLMs。 Result: LLMs生成的评论在风格和解释上接近人类；ToM任务表现因模型和任务复杂度而异。 Conclusion: LLMs在特定条件下能模拟理解行为，但其能力仍受限于任务设计和提示方式。 Abstract: This study explored how large language models (LLMs) perform in two areas related to art: writing critiques of artworks and reasoning about mental states (Theory of Mind, or ToM) in art-related situations. For the critique generation part, we built a system that combines Noel Carroll's evaluative framework with a broad selection of art criticism theories. The model was prompted to first write a full-length critique and then shorter, more coherent versions using a step-by-step prompting process. These AI-generated critiques were then compared with those written by human experts in a Turing test-style evaluation. In many cases, human subjects had difficulty telling which was which, and the results suggest that LLMs can produce critiques that are not only plausible in style but also rich in interpretation, as long as they are carefully guided. In the second part, we introduced new simple ToM tasks based on situations involving interpretation, emotion, and moral tension, which can appear in the context of art. These go beyond standard false-belief tests and allow for more complex, socially embedded forms of reasoning. We tested 41 recent LLMs and found that their performance varied across tasks and models. In particular, tasks that involved affective or ambiguous situations tended to reveal clearer differences. Taken together, these results help clarify how LLMs respond to complex interpretative challenges, revealing both their cognitive limitations and potential. While our findings do not directly contradict the so-called Generative AI Paradox--the idea that LLMs can produce expert-like output without genuine understanding--they suggest that, depending on how LLMs are instructed, such as through carefully designed prompts, these models may begin to show behaviors that resemble understanding more closely than we might assume.

[170] SMARTe: Slot-based Method for Accountable Relational Triple extraction

Xue Wen Tan,Stanley Kok

Main category: cs.CL

TL;DR: SMARTe是一种基于槽位注意力的关系三元组提取方法，强调可解释性，同时性能与最先进模型相当。

Details

Motivation: 现有方法侧重于优化性能，但缺乏对模型内部机制的理解，且依赖复杂预处理，导致系统不透明。 Method: SMARTe通过槽位注意力机制和集合预测任务，将信息整合到独立槽位中，确保预测可追溯。 Result: 在NYT和WebNLG数据集上，SMARTe性能与最优模型相当，且通过注意力热图展示了解释性。 Conclusion: SMARTe在保持性能的同时提升了可解释性，为未来研究提供了方向。 Abstract: Relational Triple Extraction (RTE) is a fundamental task in Natural Language Processing (NLP). However, prior research has primarily focused on optimizing model performance, with limited efforts to understand the internal mechanisms driving these models. Many existing methods rely on complex preprocessing to induce specific interactions, often resulting in opaque systems that may not fully align with their theoretical foundations. To address these limitations, we propose SMARTe: a Slot-based Method for Accountable Relational Triple extraction. SMARTe introduces intrinsic interpretability through a slot attention mechanism and frames the task as a set prediction problem. Slot attention consolidates relevant information into distinct slots, ensuring all predictions can be explicitly traced to learned slot representations and the tokens contributing to each predicted relational triple. While emphasizing interpretability, SMARTe achieves performance comparable to state-of-the-art models. Evaluations on the NYT and WebNLG datasets demonstrate that adding interpretability does not compromise performance. Furthermore, we conducted qualitative assessments to showcase the explanations provided by SMARTe, using attention heatmaps that map to their respective tokens. We conclude with a discussion of our findings and propose directions for future research.

[171] Can LLMs reason over extended multilingual contexts? Towards long-context evaluation beyond retrieval and haystacks

Amey Hengle,Prasoon Bajpai,Soham Dan,Tanmoy Chakraborty

Main category: cs.CL

TL;DR: MLRBench是一个新的多语言长上下文推理基准测试，超越了现有的检索中心化方法，评估多跳推理、聚合和认知推理能力。

Details

Motivation: 现有基准测试过于关注检索能力，忽略了推理能力，且存在数据泄漏和短路径问题。 Method: 开发了MLRBench，包含多语言任务，评估多跳推理和聚合能力，并设计为抗泄漏和可扩展。 Result: 实验显示高资源和低资源语言间存在显著差距，LLMs在多语言环境中仅利用不到30%的上下文长度。 Conclusion: MLRBench为多语言LLMs的评估和训练提供了改进方向，开源以促进未来研究。 Abstract: Existing multilingual long-context benchmarks, often based on the popular needle-in-a-haystack test, primarily evaluate a model's ability to locate specific information buried within irrelevant texts. However, such a retrieval-centric approach is myopic and inherently limited, as successful recall alone does not indicate a model's capacity to reason over extended contexts. Moreover, these benchmarks are susceptible to data leakage, short-circuiting, and risk making the evaluation a priori identifiable. To address these limitations, we introduce MLRBench, a new synthetic benchmark for multilingual long-context reasoning. Unlike existing benchmarks, MLRBench goes beyond surface-level retrieval by including tasks that assess multi-hop inference, aggregation, and epistemic reasoning. Spanning seven languages, MLRBench is designed to be parallel, resistant to leakage, and scalable to arbitrary context lengths. Our extensive experiments with an open-weight large language model (LLM) reveal a pronounced gap between high- and low-resource languages, particularly for tasks requiring the model to aggregate multiple facts or predict the absence of information. We also find that, in multilingual settings, LLMs effectively utilize less than 30% of their claimed context length. Although off-the-shelf Retrieval Augmented Generation helps alleviate this to a certain extent, it does not solve the long-context problem. We open-source MLRBench to enable future research in improved evaluation and training of multilingual LLMs.

[172] ViClaim: A Multilingual Multilabel Dataset for Automatic Claim Detection in Videos

Patrick Giedemann,Pius von Däniken,Jan Deriu,Alvaro Rodrigo,Anselmo Peñas,Mark Cieliebak

Main category: cs.CL

TL;DR: ViClaim是一个多语言、多主题的视频转录数据集，用于检测视频中的虚假信息，实验显示模型性能良好但泛化能力有限。

Details

Motivation: 视频内容在传播和虚假信息中的作用日益重要，但现有研究主要关注文本，忽略了视频转录的复杂性。 Method: 构建ViClaim数据集，包含1,798个标注视频转录，涵盖三种语言和六个主题，开发定制标注工具，并使用多语言语言模型进行实验。 Result: 模型在交叉验证中表现良好（宏F1达0.896），但对未见主题的泛化能力有限。 Conclusion: ViClaim为视频虚假信息检测提供了基础，揭示了视频转录中声明检测的复杂性。 Abstract: The growing influence of video content as a medium for communication and misinformation underscores the urgent need for effective tools to analyze claims in multilingual and multi-topic settings. Existing efforts in misinformation detection largely focus on written text, leaving a significant gap in addressing the complexity of spoken text in video transcripts. We introduce ViClaim, a dataset of 1,798 annotated video transcripts across three languages (English, German, Spanish) and six topics. Each sentence in the transcripts is labeled with three claim-related categories: fact-check-worthy, fact-non-check-worthy, or opinion. We developed a custom annotation tool to facilitate the highly complex annotation process. Experiments with state-of-the-art multilingual language models demonstrate strong performance in cross-validation (macro F1 up to 0.896) but reveal challenges in generalization to unseen topics, particularly for distinct domains. Our findings highlight the complexity of claim detection in video transcripts. ViClaim offers a robust foundation for advancing misinformation detection in video-based communication, addressing a critical gap in multimodal analysis.

[173] Are AI agents the new machine translation frontier? Challenges and opportunities of single- and multi-agent systems for multilingual digital communication

Vicent Briva-Iglesias

Main category: cs.CL

TL;DR: 本文探讨了单智能体与多智能体系统在机器翻译（MT）中的潜力，指出多智能体系统在复杂场景中可能优于传统方法。

Details

Motivation: 人工智能（AI）的快速发展为机器翻译带来了新范式，但其应用尚未充分探索。本文旨在分析单智能体和多智能体系统在MT中的潜力，以提升多语言数字通信。 Method: 通过法律MT的试点研究，采用多智能体系统，包含四个专门AI智能体：翻译、充分性审查、流畅性审查和最终编辑。 Result: 研究发现多智能体系统在领域适应性和上下文感知方面表现优异，翻译质量高于传统MT或单智能体系统。 Conclusion: 多智能体系统在MT中具有显著潜力，为未来研究和专业翻译工作流集成奠定了基础。 Abstract: The rapid evolution of artificial intelligence (AI) has introduced AI agents as a disruptive paradigm across various industries, yet their application in machine translation (MT) remains underexplored. This paper describes and analyses the potential of single- and multi-agent systems for MT, reflecting on how they could enhance multilingual digital communication. While single-agent systems are well-suited for simpler translation tasks, multi-agent systems, which involve multiple specialized AI agents collaborating in a structured manner, may offer a promising solution for complex scenarios requiring high accuracy, domain-specific knowledge, and contextual awareness. To demonstrate the feasibility of multi-agent workflows in MT, we are conducting a pilot study in legal MT. The study employs a multi-agent system involving four specialized AI agents for (i) translation, (ii) adequacy review, (iii) fluency review, and (iv) final editing. Our findings suggest that multi-agent systems may have the potential to significantly improve domain-adaptability and contextual awareness, with superior translation quality to traditional MT or single-agent systems. This paper also sets the stage for future research into multi-agent applications in MT, integration into professional translation workflows, and shares a demo of the system analyzed in the paper.

[174] Information Gain-Guided Causal Intervention for Autonomous Debiasing Large Language Models

Zhouhao Sun,Xiao Ding,Li Du,Yunpeng Xu,Yixuan Ma,Yang Zhao,Bing Qin,Ting Liu

Main category: cs.CL

TL;DR: 论文提出了一种基于信息增益引导的因果干预去偏框架（IGCIDB），通过结合因果机制和信息理论，自动平衡指令调优数据集的分布，从而提升大语言模型的泛化能力。

Details

Motivation: 当前大语言模型（LLMs）在推理时可能仍会利用数据集偏见，导致泛化能力不足。现有的去偏方法（如基于先验知识或上下文学习的方法）效果有限。 Method: 提出IGCIDB框架：1）利用信息增益引导的因果干预方法自动平衡数据集分布；2）在去偏后的数据集上进行标准监督微调。 Result: 实验表明，IGCIDB能有效去偏，提升LLMs在不同任务中的泛化能力。 Conclusion: IGCIDF框架通过因果干预和信息理论结合，解决了数据集偏见问题，显著提升了模型的泛化性能。 Abstract: Despite significant progress, recent studies indicate that current large language models (LLMs) may still capture dataset biases and utilize them during inference, leading to the poor generalizability of LLMs. However, due to the diversity of dataset biases and the insufficient nature of bias suppression based on in-context learning, the effectiveness of previous prior knowledge-based debiasing methods and in-context learning based automatic debiasing methods is limited. To address these challenges, we explore the combination of causal mechanisms with information theory and propose an information gain-guided causal intervention debiasing (IGCIDB) framework. This framework first utilizes an information gain-guided causal intervention method to automatically and autonomously balance the distribution of instruction-tuning dataset. Subsequently, it employs a standard supervised fine-tuning process to train LLMs on the debiased dataset. Experimental results show that IGCIDB can effectively debias LLM to improve its generalizability across different tasks.

[175] Benchmarking Multi-National Value Alignment for Large Language Models

Chengyi Ju,Weijie Shi,Chengzhong Liu,Jiaming Ji,Jipeng Zhang,Ruiyuan Zhang,Jia Zhu,Jiajie Xu,Yaodong Yang,Sirui Han,Yike Guo

Main category: cs.CL

TL;DR: NaVAB是一个新基准，用于评估大型语言模型（LLM）与五个主要国家（中、美、英、法、德）价值观的一致性，通过自动化流程构建数据集，并结合对齐技术减少价值观冲突。

Details

Motivation: 现有研究主要关注伦理审查，忽略了国家价值观的多样性，且现有基准依赖手动设计的问卷，难以扩展。 Method: 提出NaVAB基准，包括国家价值观提取流程、指令标记建模、筛选价值相关主题及冲突减少机制生成数据集。 Result: 实验表明NaVAB能有效识别LLM与国家价值观的不一致，并可通过对齐技术减少冲突。 Conclusion: NaVAB为LLM与国家价值观对齐提供了可扩展的解决方案，并展示了实际应用潜力。 Abstract: Do Large Language Models (LLMs) hold positions that conflict with your country's values? Occasionally they do! However, existing works primarily focus on ethical reviews, failing to capture the diversity of national values, which encompass broader policy, legal, and moral considerations. Furthermore, current benchmarks that rely on spectrum tests using manually designed questionnaires are not easily scalable. To address these limitations, we introduce NaVAB, a comprehensive benchmark to evaluate the alignment of LLMs with the values of five major nations: China, the United States, the United Kingdom, France, and Germany. NaVAB implements a national value extraction pipeline to efficiently construct value assessment datasets. Specifically, we propose a modeling procedure with instruction tagging to process raw data sources, a screening process to filter value-related topics and a generation process with a Conflict Reduction mechanism to filter non-conflicting values.We conduct extensive experiments on various LLMs across countries, and the results provide insights into assisting in the identification of misaligned scenarios. Moreover, we demonstrate that NaVAB can be combined with alignment techniques to effectively reduce value concerns by aligning LLMs' values with the target country.

[176] MAIN: Mutual Alignment Is Necessary for instruction tuning

Fanyi Yang,Jianfeng Liu,Xin Zhang,Haoyu Liu,Xixin Cao,Yuefeng Zhan,Hao Sun,Weiwei Deng,Feng Sun,Qi Zhang

Main category: cs.CL

TL;DR: 论文提出了一种名为MAIN的互对齐框架，通过确保指令与响应的对齐性，提升了LLMs的指令调优效果。

Details

Motivation: 当前指令调优方法在扩展数据生成时忽视了指令与响应的对齐性，而高质量的指令-响应对应依赖于二者的对齐程度。 Method: 提出了互对齐框架（MAIN），通过相互约束确保指令与响应的连贯性。 Result: 实验表明，基于MAIN框架调优的LLaMA和Mistral模型在多个基准测试中优于传统方法。 Conclusion: 指令-响应的对齐性对实现可扩展且高质量的LLMs指令调优至关重要。 Abstract: Instruction tuning has enabled large language models (LLMs) to achieve remarkable performance, but its success heavily depends on the availability of large-scale, high-quality instruction-response pairs. However, current methods for scaling up data generation often overlook a crucial aspect: the alignment between instructions and responses. We hypothesize that high-quality instruction-response pairs are not defined by the individual quality of each component, but by the extent of their alignment with each other. To address this, we propose a Mutual Alignment Framework (MAIN) that ensures coherence between the instruction and response through mutual constraints. Experiments demonstrate that models such as LLaMA and Mistral, fine-tuned within this framework, outperform traditional methods across multiple benchmarks. This approach underscores the critical role of instruction-response alignment in enabling scalable and high-quality instruction tuning for LLMs.

[177] ConExion: Concept Extraction with Large Language Models

Ebrahim Norouzi,Sven Hertling,Harald Sack

Main category: cs.CL

TL;DR: 本文提出了一种基于预训练大语言模型（LLMs）的文档概念提取方法，相比传统方法，它能提取与特定领域相关的所有概念，而不仅仅是重要关键词。实验表明，该方法在F1分数上优于现有技术。

Details

Motivation: 传统方法仅提取文档中的重要关键词，而本文旨在提取特定领域的所有相关概念，以支持本体评估和学习。 Method: 使用预训练大语言模型（LLMs）进行概念提取，并探索无监督提示技术的潜力。 Result: 在两个广泛使用的基准数据集上，该方法在F1分数上优于现有技术。 Conclusion: LLMs在概念提取任务中表现出色，提取的概念可用于本体评估和学习，代码和数据集已公开。 Abstract: In this paper, an approach for concept extraction from documents using pre-trained large language models (LLMs) is presented. Compared with conventional methods that extract keyphrases summarizing the important information discussed in a document, our approach tackles a more challenging task of extracting all present concepts related to the specific domain, not just the important ones. Through comprehensive evaluations of two widely used benchmark datasets, we demonstrate that our method improves the F1 score compared to state-of-the-art techniques. Additionally, we explore the potential of using prompts within these models for unsupervised concept extraction. The extracted concepts are intended to support domain coverage evaluation of ontologies and facilitate ontology learning, highlighting the effectiveness of LLMs in concept extraction tasks. Our source code and datasets are publicly available at https://github.com/ISE-FIZKarlsruhe/concept_extraction.

[178] Are Retrials All You Need? Enhancing Large Language Model Reasoning Without Verbalized Feedback

Nearchos Potamitis,Akhil Arora

Main category: cs.CL

TL;DR: 论文提出了一种无需反馈的‘重试’机制，简化了推理框架，发现简单方法常优于复杂方法。

Details

Motivation: 当前迭代推理策略需要额外计算成本，作者希望探索更高效的替代方案。 Method: 引入‘无反馈重试’机制，允许模型在识别错误答案后重新尝试。 Result: 简单重试方法常优于复杂推理框架，计算成本更低。 Conclusion: 复杂方法未必更好，简单高效的方法可能更优。 Abstract: Recent advancements in large language models (LLMs) have catalyzed the development of general-purpose autonomous agents, demonstrating remarkable performance in complex reasoning tasks across various domains. This surge has spurred the evolution of a plethora of prompt-based reasoning frameworks. A recent focus has been on iterative reasoning strategies that refine outputs through self-evaluation and verbalized feedback. However, these strategies require additional computational complexity to enable models to recognize and correct their mistakes, leading to a significant increase in their cost. In this work, we introduce the concept of ``retrials without feedback'', an embarrassingly simple yet powerful mechanism for enhancing reasoning frameworks by allowing LLMs to retry problem-solving attempts upon identifying incorrect answers. Unlike conventional iterative refinement methods, our method does not require explicit self-reflection or verbalized feedback, simplifying the refinement process. Our findings indicate that simpler retrial-based approaches often outperform more sophisticated reasoning frameworks, suggesting that the benefits of complex methods may not always justify their computational costs. By challenging the prevailing assumption that more intricate reasoning strategies inherently lead to better performance, our work offers new insights into how simpler, more efficient approaches can achieve optimal results. So, are retrials all you need?

[179] Estimating Optimal Context Length for Hybrid Retrieval-augmented Multi-document Summarization

Adithya Pratapa,Teruko Mitamura

Main category: cs.CL

TL;DR: 提出了一种结合检索增强系统和长上下文窗口的混合方法，用于多文档摘要任务，通过估计最优检索长度提升性能。

Details

Motivation: 长上下文语言模型在多文档摘要中表现不佳，检索增强系统对检索长度敏感，因此需要一种更有效的方法。 Method: 结合检索增强和长上下文模型，通过生成银参考估计最优检索长度。 Result: 在多文档摘要任务中表现优异，优于强基准模型，适用于不同模型类别和规模。 Conclusion: 该方法有效提升了长上下文语言模型的性能，并具有泛化能力。 Abstract: Recent advances in long-context reasoning abilities of language models led to interesting applications in large-scale multi-document summarization. However, prior work has shown that these long-context models are not effective at their claimed context windows. To this end, retrieval-augmented systems provide an efficient and effective alternative. However, their performance can be highly sensitive to the choice of retrieval context length. In this work, we present a hybrid method that combines retrieval-augmented systems with long-context windows supported by recent language models. Our method first estimates the optimal retrieval length as a function of the retriever, summarizer, and dataset. On a randomly sampled subset of the dataset, we use a panel of LLMs to generate a pool of silver references. We use these silver references to estimate the optimal context length for a given RAG system configuration. Our results on the multi-document summarization task showcase the effectiveness of our method across model classes and sizes. We compare against length estimates from strong long-context benchmarks such as RULER and HELMET. Our analysis also highlights the effectiveness of our estimation method for very long-context LMs and its generalization to new classes of LMs.

[180] Sparks of Science: Hypothesis Generation Using Structured Paper Data

Charles O'Neill,Tirthankar Ghosal,Roberta Răileanu,Mike Walmsley,Thang Bui,Kevin Schawinski,Ioana Ciucă

Main category: cs.CL

TL;DR: 论文介绍了HypoGen数据集，用于科学假设生成任务，通过Bit-Flip-Spark框架和链式推理提升假设质量。

Details

Motivation: 当前基础模型在生成新颖且可行的科学假设方面存在不足，缺乏专门的数据集将科学假设生成任务结构化。 Method: 提出HypoGen数据集（5500个结构化问题-假设对），采用Bit-Flip-Spark框架和链式推理，将假设生成建模为条件语言生成任务。 Result: 实验表明，基于HypoGen微调的模型在假设的新颖性、可行性和整体质量上均有提升。 Conclusion: HypoGen数据集为科学假设生成提供了有效支持，未来可扩展至其他领域。 Abstract: Generating novel and creative scientific hypotheses is a cornerstone in achieving Artificial General Intelligence. Large language and reasoning models have the potential to aid in the systematic creation, selection, and validation of scientifically informed hypotheses. However, current foundation models often struggle to produce scientific ideas that are both novel and feasible. One reason is the lack of a dedicated dataset that frames Scientific Hypothesis Generation (SHG) as a Natural Language Generation (NLG) task. In this paper, we introduce HypoGen, the first dataset of approximately 5500 structured problem-hypothesis pairs extracted from top-tier computer science conferences structured with a Bit-Flip-Spark schema, where the Bit is the conventional assumption, the Spark is the key insight or conceptual leap, and the Flip is the resulting counterproposal. HypoGen uniquely integrates an explicit Chain-of-Reasoning component that reflects the intellectual process from Bit to Flip. We demonstrate that framing hypothesis generation as conditional language modelling, with the model fine-tuned on Bit-Flip-Spark and the Chain-of-Reasoning (and where, at inference, we only provide the Bit), leads to improvements in the overall quality of the hypotheses. Our evaluation employs automated metrics and LLM judge rankings for overall quality assessment. We show that by fine-tuning on our HypoGen dataset we improve the novelty, feasibility, and overall quality of the generated hypotheses. The HypoGen dataset is publicly available at huggingface.co/datasets/UniverseTBD/hypogen-dr1.

[181] Accommodate Knowledge Conflicts in Retrieval-augmented LLMs: Towards Reliable Response Generation in the Wild

Jiatai Wang,Zhiwei Xu,Di Jin,Xuewen Yang,Tao Li

Main category: cs.CL

TL;DR: 论文提出Swin-VIB框架，通过变分信息瓶颈模型解决大语言模型（LLMs）在知识冲突中的不确定性，提升响应生成的可靠性。

Details

Motivation: LLMs在信息检索系统中面临内部记忆与外部检索信息之间的知识冲突，导致响应不可靠和决策不确定性。 Method: 提出Swin-VIB框架，整合变分信息瓶颈模型，自适应增强检索信息并指导LLM响应生成。 Result: 在单选择、开放式问答和检索增强生成任务中验证了Swin-VIB的有效性，单选择任务准确率提升至少7.54%。 Conclusion: Swin-VIB能有效解决LLMs的知识冲突问题，提升响应生成的可靠性和准确性。 Abstract: The proliferation of large language models (LLMs) has significantly advanced information retrieval systems, particularly in response generation (RG). Unfortunately, LLMs often face knowledge conflicts between internal memory and retrievaled external information, arising from misinformation, biases, or outdated knowledge. These conflicts undermine response reliability and introduce uncertainty in decision-making. In this work, we analyze how LLMs navigate knowledge conflicts from an information-theoretic perspective and reveal that when conflicting and supplementary information exhibit significant differences, LLMs confidently resolve their preferences. However, when the distinction is ambiguous, LLMs experience heightened uncertainty. Based on this insight, we propose Swin-VIB, a novel framework that integrates a pipeline of variational information bottleneck models into adaptive augmentation of retrieved information and guiding LLM preference in response generation. Extensive experiments on single-choice, open-ended question-answering (QA), and retrieval augmented generation (RAG) validate our theoretical findings and demonstrate the efficacy of Swin-VIB. Notably, our method improves single-choice task accuracy by at least 7.54\% over competitive baselines.

[182] SHA256 at SemEval-2025 Task 4: Selective Amnesia -- Constrained Unlearning for Large Language Models via Knowledge Isolation

Saransh Agrawal,Kuan-Hao Huang

Main category: cs.CL

TL;DR: 论文提出一种针对大型语言模型（LLM）的定向遗忘方法，通过因果中介分析和分层优化，有效移除敏感信息同时保持模型性能。

Details

Motivation: 解决LLM训练中记忆敏感信息的问题，避免公开部署时的隐私风险。 Method: 采用两阶段方法：因果中介分析定位关键层（0-5层），再通过约束优化和联合损失函数实现定向遗忘。 Result: 在SemEval-2025任务4中排名第二，保持88%的基准MMLU准确率。 Conclusion: 因果导向的分层优化为LLM高效精准遗忘提供了新范式，显著提升AI系统的数据隐私保护能力。 Abstract: Large language models (LLMs) frequently memorize sensitive information during training, posing risks when deploying publicly accessible models. Current machine unlearning methods struggle to selectively remove specific data associations without degrading overall model capabilities. This paper presents our solution to SemEval-2025 Task 4 on targeted unlearning, which introduces a two-stage methodology that combines causal mediation analysis with layer-specific optimization. Through systematic causal tracing experiments on OLMo architectures (1B and 7B parameters), we identify the critical role of the first few transformer layers (layers 0-5) in storing subject-attribute associations within MLP modules. Building on this insight, we develop a constrained optimization approach that freezes upper layers while applying a novel joint loss function to lower layers-simultaneously maximizing forget set loss via output token cross-entropy penalties and minimizing retain set deviation through adaptive regularization. Our method achieves 2nd place in the 1B model track, demonstrating strong task performance while maintaining 88% of baseline MMLU accuracy. These results establish causal-informed layer optimization as a promising paradigm for efficient, precise unlearning in LLMs, offering a significant step forward in addressing data privacy concerns in AI systems.

[183] ChatEXAONEPath: An Expert-level Multimodal Large Language Model for Histopathology Using Whole Slide Images

Sangwook Kim,Soonyoung Lee,Jongseong Jang

Main category: cs.CL

TL;DR: 本文提出了一种基于全切片图像（WSI）的专家级多模态大语言模型ChatEXAONEPath，用于病理学诊断，并通过检索式数据生成管道和AI评估协议展示了其潜力。

Details

Motivation: 现有病理学多模态大语言模型（MLLM）因依赖公开数据集的有限信息而无法全面理解临床背景，开发WSI级MLLM对提升可扩展性和适用性至关重要。 Method: 使用TCGA的10,094对WSI和病理报告构建检索式数据生成管道，并设计AI评估协议以验证模型对多模态信息的理解能力。 Result: ChatEXAONEPath在1,134对WSI和报告上的诊断接受率达62.9%，并能理解多种癌症类型的WSI和临床背景。 Conclusion: 该模型通过整合多模态信息，有望辅助临床医生全面理解复杂病理形态，提升癌症诊断效率。 Abstract: Recent studies have made significant progress in developing large language models (LLMs) in the medical domain, which can answer expert-level questions and demonstrate the potential to assist clinicians in real-world clinical scenarios. Studies have also witnessed the importance of integrating various modalities with the existing LLMs for a better understanding of complex clinical contexts, which are innately multi-faceted by nature. Although studies have demonstrated the ability of multimodal LLMs in histopathology to answer questions from given images, they lack in understanding of thorough clinical context due to the patch-level data with limited information from public datasets. Thus, developing WSI-level MLLMs is significant in terms of the scalability and applicability of MLLMs in histopathology. In this study, we introduce an expert-level MLLM for histopathology using WSIs, dubbed as ChatEXAONEPath. We present a retrieval-based data generation pipeline using 10,094 pairs of WSIs and histopathology reports from The Cancer Genome Atlas (TCGA). We also showcase an AI-based evaluation protocol for a comprehensive understanding of the medical context from given multimodal information and evaluate generated answers compared to the original histopathology reports. We demonstrate the ability of diagnosing the given histopathology images using ChatEXAONEPath with the acceptance rate of 62.9% from 1,134 pairs of WSIs and reports. Our proposed model can understand pan-cancer WSIs and clinical context from various cancer types. We argue that our proposed model has the potential to assist clinicians by comprehensively understanding complex morphology of WSIs for cancer diagnosis through the integration of multiple modalities.

[184] Aspect-Based Summarization with Self-Aspect Retrieval Enhanced Generation

Yichao Feng,Shuai Zhao,Yueqiu Li,Luwei Xiao,Xiaobao Wu,Anh Tuan Luu

Main category: cs.CL

TL;DR: 提出了一种基于自检索增强的方面摘要生成框架，解决了传统方法在资源限制和通用性上的不足，同时优化了token使用并减少了幻觉问题。

Details

Motivation: 传统摘要方法资源消耗大且通用性有限，大语言模型虽无需训练但依赖提示工程且面临token限制和幻觉问题。 Method: 采用嵌入驱动的检索机制识别相关文本段，删除无关内容，严格基于给定方面生成摘要。 Result: 在基准数据集上表现优异，有效缓解了token限制问题。 Conclusion: 该框架在性能和效率上均优于现有方法，为方面摘要提供了新思路。 Abstract: Aspect-based summarization aims to generate summaries tailored to specific aspects, addressing the resource constraints and limited generalizability of traditional summarization approaches. Recently, large language models have shown promise in this task without the need for training. However, they rely excessively on prompt engineering and face token limits and hallucination challenges, especially with in-context learning. To address these challenges, in this paper, we propose a novel framework for aspect-based summarization: Self-Aspect Retrieval Enhanced Summary Generation. Rather than relying solely on in-context learning, given an aspect, we employ an embedding-driven retrieval mechanism to identify its relevant text segments. This approach extracts the pertinent content while avoiding unnecessary details, thereby mitigating the challenge of token limits. Moreover, our framework optimizes token usage by deleting unrelated parts of the text and ensuring that the model generates output strictly based on the given aspect. With extensive experiments on benchmark datasets, we demonstrate that our framework not only achieves superior performance but also effectively mitigates the token limitation problem.

[185] Accuracy is Not Agreement: Expert-Aligned Evaluation of Crash Narrative Classification Models

Sudesh Ramesh Bhagat,Ibne Farabi Shihab,Anuj Sharma

Main category: cs.CL

TL;DR: 研究发现，深度学习模型的技术准确性与专家一致性之间存在反直觉关系，LLMs虽准确度较低但与专家更一致。

Details

Motivation: 探讨深度学习模型在安全关键NLP应用中的评估标准，强调专家一致性的重要性。 Method: 评估五种DL模型和四种LLMs，使用Cohen's Kappa、PCA和SHAP技术分析模型与专家的一致性。 Result: 高准确性模型与专家一致性较低，LLMs更依赖上下文和时间线索，与专家更一致。 Conclusion: 建议将专家一致性作为模型评估的补充指标，LLMs在事故分析中具有潜力。 Abstract: This study explores the relationship between deep learning (DL) model accuracy and expert agreement in the classification of crash narratives. We evaluate five DL models -- including BERT variants, the Universal Sentence Encoder (USE), and a zero-shot classifier -- against expert-labeled data and narrative text. The analysis is further extended to four large language models (LLMs): GPT-4, LLaMA 3, Qwen, and Claude. Our results reveal a counterintuitive trend: models with higher technical accuracy often exhibit lower agreement with domain experts, whereas LLMs demonstrate greater expert alignment despite relatively lower accuracy scores. To quantify and interpret model-expert agreement, we employ Cohen's Kappa, Principal Component Analysis (PCA), and SHAP-based explainability techniques. Findings indicate that expert-aligned models tend to rely more on contextual and temporal language cues, rather than location-specific keywords. These results underscore that accuracy alone is insufficient for evaluating models in safety-critical NLP applications. We advocate for incorporating expert agreement as a complementary metric in model evaluation frameworks and highlight the promise of LLMs as interpretable, scalable tools for crash analysis pipelines.

[186] Retrieval-Augmented Generation with Conflicting Evidence

Han Wang,Archiki Prasad,Elias Stengel-Eskin,Mohit Bansal

Main category: cs.CL

TL;DR: 论文提出了一种新方法MADAM-RAG，通过多智能体辩论处理检索增强生成中的模糊性和错误信息，同时引入RAMDocs数据集模拟复杂场景。实验表明该方法在多个任务上优于现有基线。

Details

Motivation: 现有检索增强生成系统在处理模糊查询、冲突信息和噪声时通常孤立解决单一问题，缺乏综合方法。 Method: 提出RAMDocs数据集模拟复杂场景，并设计MADAM-RAG方法，通过多智能体辩论整合答案并过滤错误信息。 Result: MADAM-RAG在AmbigDocs和FaithEval任务上分别提升11.40%和15.80%，但RAMDocs对现有基线仍具挑战性。 Conclusion: MADAM-RAG初步解决了多源冲突问题，但在证据不平衡时仍有改进空间。 Abstract: Large language model (LLM) agents are increasingly employing retrieval-augmented generation (RAG) to improve the factuality of their responses. However, in practice, these systems often need to handle ambiguous user queries and potentially conflicting information from multiple sources while also suppressing inaccurate information from noisy or irrelevant documents. Prior work has generally studied and addressed these challenges in isolation, considering only one aspect at a time, such as handling ambiguity or robustness to noise and misinformation. We instead consider multiple factors simultaneously, proposing (i) RAMDocs (Retrieval with Ambiguity and Misinformation in Documents), a new dataset that simulates complex and realistic scenarios for conflicting evidence for a user query, including ambiguity, misinformation, and noise; and (ii) MADAM-RAG, a multi-agent approach in which LLM agents debate over the merits of an answer over multiple rounds, allowing an aggregator to collate responses corresponding to disambiguated entities while discarding misinformation and noise, thereby handling diverse sources of conflict jointly. We demonstrate the effectiveness of MADAM-RAG using both closed and open-source models on AmbigDocs -- which requires presenting all valid answers for ambiguous queries -- improving over strong RAG baselines by up to 11.40% and on FaithEval -- which requires suppressing misinformation -- where we improve by up to 15.80% (absolute) with Llama3.3-70B-Instruct. Furthermore, we find that RAMDocs poses a challenge for existing RAG baselines (Llama3.3-70B-Instruct only obtains 32.60 exact match score). While MADAM-RAG begins to address these conflicting factors, our analysis indicates that a substantial gap remains especially when increasing the level of imbalance in supporting evidence and misinformation.

[187] LLMs Meet Finance: Fine-Tuning Foundation Models for the Open FinLLM Leaderboard

Varun Rao,Youran Sun,Mahendra Kumar,Tejas Mutneja,Agastya Mukherjee,Haizhao Yang

Main category: cs.CL

TL;DR: 论文研究了大型语言模型（LLMs）在金融任务中的应用，通过微调基础模型并采用多种技术提升性能，展示了LLMs在金融领域的潜力。

Details

Motivation: 探索大型语言模型在金融任务中的适用性，并验证其性能提升的可能性。 Method: 使用Open FinLLM Leaderboard作为基准，对Qwen2.5和Deepseek-R1进行微调，采用监督微调（SFT）、直接偏好优化（DPO）和强化学习（RL）等技术。 Result: 微调后的模型在多种金融任务中表现出显著性能提升，并测量了金融领域的数据缩放规律。 Conclusion: 研究表明大型语言模型在金融应用中具有巨大潜力。 Abstract: This paper investigates the application of large language models (LLMs) to financial tasks. We fine-tuned foundation models using the Open FinLLM Leaderboard as a benchmark. Building on Qwen2.5 and Deepseek-R1, we employed techniques including supervised fine-tuning (SFT), direct preference optimization (DPO), and reinforcement learning (RL) to enhance their financial capabilities. The fine-tuned models demonstrated substantial performance gains across a wide range of financial tasks. Moreover, we measured the data scaling law in the financial domain. Our work demonstrates the potential of large language models (LLMs) in financial applications.

[188] Energy-Based Reward Models for Robust Language Model Alignment

Anamika Lochab,Ruqi Zhang

Main category: cs.CL

TL;DR: EBRM是一种轻量级后处理框架，通过显式建模奖励分布提升奖励模型的鲁棒性和泛化能力，无需重新训练。

Details

Motivation: 传统奖励模型难以捕捉复杂人类偏好并泛化到未见数据，EBRM旨在解决这些问题。 Method: EBRM采用冲突感知数据过滤、标签噪声感知对比训练和混合初始化，显式建模奖励分布。 Result: 实验显示EBRM在鲁棒性和泛化性上显著提升，安全关键任务对齐性能提高5.97%。 Conclusion: EBRM是一种高效、可扩展的奖励模型增强方法，能有效提升对齐质量并延缓奖励攻击。 Abstract: Reward models (RMs) are essential for aligning Large Language Models (LLMs) with human preferences. However, they often struggle with capturing complex human preferences and generalizing to unseen data. To address these challenges, we introduce Energy-Based Reward Model (EBRM), a lightweight post-hoc refinement framework that enhances RM robustness and generalization. EBRM models the reward distribution explicitly, capturing uncertainty in human preferences and mitigating the impact of noisy or misaligned annotations. It achieves this through conflict-aware data filtering, label-noise-aware contrastive training, and hybrid initialization. Notably, EBRM enhances RMs without retraining, making it computationally efficient and adaptable across different models and tasks. Empirical evaluations on RM benchmarks demonstrate significant improvements in both robustness and generalization, achieving up to a 5.97% improvement in safety-critical alignment tasks compared to standard RMs. Furthermore, reinforcement learning experiments confirm that our refined rewards enhance alignment quality, effectively delaying reward hacking. These results demonstrate our approach as a scalable and effective enhancement for existing RMs and alignment pipelines. The code is available at EBRM.

[189] Syntactic and Semantic Control of Large Language Models via Sequential Monte Carlo

João Loula,Benjamin LeBrun,Li Du,Ben Lipkin,Clemente Pasti,Gabriel Grand,Tianyu Liu,Yahya Emara,Marjorie Freedman,Jason Eisner,Ryan Cotterel,Vikash Mansinghka,Alexander K. Lew,Tim Vieira,Timothy J. O'Donnell

Main category: cs.CL

TL;DR: 提出了一种基于序贯蒙特卡洛（SMC）的架构，用于控制语言模型的生成，灵活结合领域和问题特定约束，并在多个领域验证其优于大型模型。

Details

Motivation: 许多语言模型应用需要生成符合语法或语义约束的文本，但精确生成这种分布通常难以实现。 Method: 开发了一种基于SMC的架构，支持在推理时灵活引入约束，并动态分配计算资源。 Result: 在Python代码生成、文本到SQL、目标推断和分子合成等任务中，小型开源模型表现优于大型模型。 Conclusion: SMC方法能更好地逼近后验分布，提供了一种简单可编程的方式解决控制生成问题。 Abstract: A wide range of LM applications require generating text that conforms to syntactic or semantic constraints. Imposing such constraints can be naturally framed as probabilistic conditioning, but exact generation from the resulting distribution -- which can differ substantially from the LM's base distribution -- is generally intractable. In this work, we develop an architecture for controlled LM generation based on sequential Monte Carlo (SMC). Our SMC framework allows us to flexibly incorporate domain- and problem-specific constraints at inference time, and efficiently reallocate computational resources in light of new information during the course of generation. By comparing to a number of alternatives and ablations on four challenging domains -- Python code generation for data science, text-to-SQL, goal inference, and molecule synthesis -- we demonstrate that, with little overhead, our approach allows small open-source language models to outperform models over 8x larger, as well as closed-source, fine-tuned ones. In support of the probabilistic perspective, we show that these performance improvements are driven by better approximation to the posterior distribution. Our system builds on the framework of Lew et al. (2023) and integrates with its language model probabilistic programming language, giving users a simple, programmable way to apply SMC to a broad variety of controlled generation problems.

[190] CLIMB: CLustering-based Iterative Data Mixture Bootstrapping for Language Model Pre-training

Shizhe Diao,Yu Yang,Yonggan Fu,Xin Dong,Dan Su,Markus Kliegl,Zijia Chen,Peter Belcak,Yoshi Suhara,Hongxu Yin,Mostofa Patwary,Yingyan,Lin,Jan Kautz,Pavlo Molchanov

Main category: cs.CL

TL;DR: CLIMB是一种自动化框架，通过聚类和迭代优化预训练数据混合，提升模型性能。

Details

Motivation: 预训练数据通常缺乏明确的领域划分，手动标注成本高，因此需要一种自动化的方法来优化数据混合。 Method: CLIMB通过语义空间嵌入和聚类大规模数据集，使用代理模型和预测器迭代搜索最优数据混合。 Result: 在400B tokens上训练的1B模型性能超过Llama-3.2-1B 2.0%，特定领域优化提升5%。 Conclusion: CLIMB提供了一种高效的数据混合优化方法，并发布了ClimbLab和ClimbMix数据集供研究使用。 Abstract: Pre-training datasets are typically collected from web content and lack inherent domain divisions. For instance, widely used datasets like Common Crawl do not include explicit domain labels, while manually curating labeled datasets such as The Pile is labor-intensive. Consequently, identifying an optimal pre-training data mixture remains a challenging problem, despite its significant benefits for pre-training performance. To address these challenges, we propose CLustering-based Iterative Data Mixture Bootstrapping (CLIMB), an automated framework that discovers, evaluates, and refines data mixtures in a pre-training setting. Specifically, CLIMB embeds and clusters large-scale datasets in a semantic space and then iteratively searches for optimal mixtures using a smaller proxy model and a predictor. When continuously trained on 400B tokens with this mixture, our 1B model exceeds the state-of-the-art Llama-3.2-1B by 2.0%. Moreover, we observe that optimizing for a specific domain (e.g., Social Sciences) yields a 5% improvement over random sampling. Finally, we introduce ClimbLab, a filtered 1.2-trillion-token corpus with 20 clusters as a research playground, and ClimbMix, a compact yet powerful 400-billion-token dataset designed for efficient pre-training that delivers superior performance under an equal token budget. We analyze the final data mixture, elucidating the characteristics of an optimal data mixture. Our data is available at: https://research.nvidia.com/labs/lpr/climb/

cs.LG [Back]

[191] Simplifying Graph Transformers

Liheng Ma,Soumyasundar Pal,Yingxue Zhang,Philip H. S. Torr,Mark Coates

Main category: cs.LG

TL;DR: 论文提出了三种简单修改，使普通Transformer适用于图学习，无需复杂架构变动，并在多个图数据集上表现优异。

Details

Motivation: 现有图Transformer架构复杂，难以直接应用Transformer的训练进展，因此需要简化方法。 Method: 提出三种修改：(1) 简化的$L_2$注意力；(2) 自适应均方根归一化；(3) 共享编码器的相对位置编码偏置。 Result: 在多种图数据集上表现显著提升，并在图同构测试中展现出强表达能力。 Conclusion: 通过简单修改，普通Transformer可有效应用于图学习，且性能优越。 Abstract: Transformers have attained outstanding performance across various modalities, employing scaled-dot-product (SDP) attention mechanisms. Researchers have attempted to migrate Transformers to graph learning, but most advanced Graph Transformers are designed with major architectural differences, either integrating message-passing or incorporating sophisticated attention mechanisms. These complexities prevent the easy adoption of Transformer training advances. We propose three simple modifications to the plain Transformer to render it applicable to graphs without introducing major architectural distortions. Specifically, we advocate for the use of (1) simplified $L_2$ attention to measure the magnitude closeness of tokens; (2) adaptive root-mean-square normalization to preserve token magnitude information; and (3) a relative positional encoding bias with a shared encoder. Significant performance gains across a variety of graph datasets justify the effectiveness of our proposed modifications. Furthermore, empirical evaluation on the expressiveness benchmark reveals noteworthy realized expressiveness in the graph isomorphism.

[192] Quantum Computing Supported Adversarial Attack-Resilient Autonomous Vehicle Perception Module for Traffic Sign Classification

Reek Majumder,Mashrur Chowdhury,Sakib Mahmud Khan,Zadid Khan,Fahim Ahmad,Frank Ngeni,Gurcan Comert,Judith Mwakalonge,Dimitra Michalaka

Main category: cs.LG

TL;DR: 该论文研究了混合经典-量子深度学习（HCQ-DL）模型在对抗攻击下的鲁棒性，并展示了其在自动驾驶车辆感知模块中的优越性能。

Details

Motivation: 对抗攻击可能导致深度学习模型输出错误分类，对自动驾驶车辆感知模块造成严重后果。因此，研究对抗攻击下的鲁棒模型至关重要。 Method: 使用AlexNet和VGG-16作为特征提取器，构建HCQ-DL模型，并测试了1000多个量子电路。评估了三种对抗攻击（PGD、FGSA、GA）下的性能。 Result: HCQ-DL模型在无攻击场景下准确率超过95%，在GA和FGSA攻击下超过91%，显著优于经典DL模型。PGD攻击下，HCQ-DL模型准确率为85%，而经典DL模型低于21%。 Conclusion: HCQ-DL模型在对抗攻击下表现出更高的准确性和鲁棒性，适用于自动驾驶车辆感知模块。 Abstract: Deep learning (DL)-based image classification models are essential for autonomous vehicle (AV) perception modules since incorrect categorization might have severe repercussions. Adversarial attacks are widely studied cyberattacks that can lead DL models to predict inaccurate output, such as incorrectly classified traffic signs by the perception module of an autonomous vehicle. In this study, we create and compare hybrid classical-quantum deep learning (HCQ-DL) models with classical deep learning (C-DL) models to demonstrate robustness against adversarial attacks for perception modules. Before feeding them into the quantum system, we used transfer learning models, alexnet and vgg-16, as feature extractors. We tested over 1000 quantum circuits in our HCQ-DL models for projected gradient descent (PGD), fast gradient sign attack (FGSA), and gradient attack (GA), which are three well-known untargeted adversarial approaches. We evaluated the performance of all models during adversarial attacks and no-attack scenarios. Our HCQ-DL models maintain accuracy above 95\% during a no-attack scenario and above 91\% for GA and FGSA attacks, which is higher than C-DL models. During the PGD attack, our alexnet-based HCQ-DL model maintained an accuracy of 85\% compared to C-DL models that achieved accuracies below 21\%. Our results highlight that the HCQ-DL models provide improved accuracy for traffic sign classification under adversarial settings compared to their classical counterparts.

[193] VLMGuard-R1: Proactive Safety Alignment for VLMs via Reasoning-Driven Prompt Optimization

Menglan Chen,Xianghe Pang,Jingjing Dong,WenHao Wang,Yaxin Du,Siheng Chen

Main category: cs.LG

TL;DR: 提出了一种基于多模态推理的提示重写框架VLMGuard-R1，用于提升视觉-语言模型的安全性。

Details

Motivation: 视觉-语言模型的多模态复杂性可能引发传统安全措施无法捕捉的潜在威胁，需要通过跨模态推理来预防。 Method: 设计了VLMGuard-R1框架，通过推理引导的提示重写动态解析文本-图像交互，生成更安全的提示，同时保持模型核心参数不变。 Result: 在三个基准测试和五种VLMs上的实验表明，VLMGuard-R1优于四种基线方法，特别是在SIUO基准上平均安全性提升了43.59%。 Conclusion: VLMGuard-R1通过多模态推理驱动的提示重写，显著提升了视觉-语言模型的安全性，且无需修改模型参数。 Abstract: Aligning Vision-Language Models (VLMs) with safety standards is essential to mitigate risks arising from their multimodal complexity, where integrating vision and language unveils subtle threats beyond the reach of conventional safeguards. Inspired by the insight that reasoning across modalities is key to preempting intricate vulnerabilities, we propose a novel direction for VLM safety: multimodal reasoning-driven prompt rewriting. To this end, we introduce VLMGuard-R1, a proactive framework that refines user inputs through a reasoning-guided rewriter, dynamically interpreting text-image interactions to deliver refined prompts that bolster safety across diverse VLM architectures without altering their core parameters. To achieve this, we devise a three-stage reasoning pipeline to synthesize a dataset that trains the rewriter to infer subtle threats, enabling tailored, actionable responses over generic refusals. Extensive experiments across three benchmarks with five VLMs reveal that VLMGuard-R1 outperforms four baselines. In particular, VLMGuard-R1 achieves a remarkable 43.59\% increase in average safety across five models on the SIUO benchmark.

Advait Gadhikar,Tom Jacobs,Chao Zhou,Rebekka Burkholz

Main category: cs.LG

TL;DR: 论文提出了一种名为Sign-In的动态重参数化方法，通过翻转参数符号来提升稀疏神经网络从头训练（PaI）的性能。

Details

Motivation: 稀疏神经网络从头训练（PaI）与密集到稀疏训练之间的性能差距是高效深度学习的障碍。根据彩票假设，PaI依赖于找到问题特定的参数初始化，而参数符号的正确性至关重要。 Method: 提出Sign-In方法，通过动态重参数化实现参数符号翻转，弥补PaI的不足。 Result: 实验和理论表明，Sign-In能提升PaI的性能，但尚未完全弥合PaI与密集到稀疏训练之间的差距。 Conclusion: Sign-In是一种正交方法，为缩小PaI与密集到稀疏训练的性能差距提供了新思路，但仍需进一步研究。 Abstract: The performance gap between training sparse neural networks from scratch (PaI) and dense-to-sparse training presents a major roadblock for efficient deep learning. According to the Lottery Ticket Hypothesis, PaI hinges on finding a problem specific parameter initialization. As we show, to this end, determining correct parameter signs is sufficient. Yet, they remain elusive to PaI. To address this issue, we propose Sign-In, which employs a dynamic reparameterization that provably induces sign flips. Such sign flips are complementary to the ones that dense-to-sparse training can accomplish, rendering Sign-In as an orthogonal method. While our experiments and theory suggest performance improvements of PaI, they also carve out the main open challenge to close the gap between PaI and dense-to-sparse training.

[195] ALT: A Python Package for Lightweight Feature Representation in Time Series Classification

Balázs P. Halmos,Balázs Hajós,Vince Á. Molnár,Marcell T. Kurbucz,Antal Jakovác

Main category: cs.LG

TL;DR: ALT是一个开源的Python包，用于高效准确的时间序列分类（TSC），通过自适应算法提升性能。

Details

Motivation: 改进线性定律变换（LLT）算法，以更有效地捕捉不同时间尺度的模式。 Method: 采用自适应定律变换（ALT）算法，将原始时间序列数据转换为线性可分的特征空间。 Result: 在真实数据集上表现优异，计算开销低，适用于物理等领域。 Conclusion: ALT在TSC任务中具有高效、可扩展和易用的特点。 Abstract: We introduce ALT, an open-source Python package created for efficient and accurate time series classification (TSC). The package implements the adaptive law-based transformation (ALT) algorithm, which transforms raw time series data into a linearly separable feature space using variable-length shifted time windows. This adaptive approach enhances its predecessor, the linear law-based transformation (LLT), by effectively capturing patterns of varying temporal scales. The software is implemented for scalability, interpretability, and ease of use, achieving state-of-the-art performance with minimal computational overhead. Extensive benchmarking on real-world datasets demonstrates the utility of ALT for diverse TSC tasks in physics and related domains.

[196] MIB: A Mechanistic Interpretability Benchmark

Aaron Mueller,Atticus Geiger,Sarah Wiegreffe,Dana Arad,Iván Arcuschin,Adam Belfki,Yik Siu Chan,Jaden Fiotto-Kaufman,Tal Haklay,Michael Hanna,Jing Huang,Rohan Gupta,Yaniv Nikankin,Hadas Orgad,Nikhil Prakash,Anja Reusch,Aruna Sankaranarayanan,Shun Shao,Alessandro Stolfo,Martin Tutek,Amir Zur,David Bau,Yonatan Belinkov

Main category: cs.LG

TL;DR: MIB是一个新的基准测试，用于评估机制解释性方法的有效性，包含两个任务轨道和多种模型。

Details

Motivation: 为机制解释性方法提供持久且有意义的评估标准。 Method: 提出MIB基准测试，分为电路定位和因果变量定位两个轨道，涵盖四个任务和五个模型。 Result: 在电路定位中，归因和掩码优化方法表现最佳；在因果变量定位中，监督DAS方法最优，SAE特征与标准神经元表现相当。 Conclusion: MIB能有效比较方法，证明该领域取得了实质性进展。 Abstract: How can we know whether new mechanistic interpretability methods achieve real improvements? In pursuit of meaningful and lasting evaluation standards, we propose MIB, a benchmark with two tracks spanning four tasks and five models. MIB favors methods that precisely and concisely recover relevant causal pathways or specific causal variables in neural language models. The circuit localization track compares methods that locate the model components - and connections between them - most important for performing a task (e.g., attribution patching or information flow routes). The causal variable localization track compares methods that featurize a hidden vector, e.g., sparse autoencoders (SAEs) or distributed alignment search (DAS), and locate model features for a causal variable relevant to the task. Using MIB, we find that attribution and mask optimization methods perform best on circuit localization. For causal variable localization, we find that the supervised DAS method performs best, while SAE features are not better than neurons, i.e., standard dimensions of hidden vectors. These findings illustrate that MIB enables meaningful comparisons of methods, and increases our confidence that there has been real progress in the field.

cs.CR [Back]

[197] Provable Secure Steganography Based on Adaptive Dynamic Sampling

Kaiyi Pang

Main category: cs.CR

TL;DR: 提出了一种无需显式访问生成模型分布的可证明安全隐写方案，适用于黑盒场景，性能与现有白盒方法相当。

Details

Motivation: 当前可证明安全隐写方法需要显式访问生成模型分布，限制了黑盒场景的实用性。 Method: 采用动态采样策略，使生成模型在不干扰正常生成过程的情况下嵌入秘密消息。 Result: 在三个真实数据集和三个LLM上的评估表明，黑盒方法在效率和容量上与白盒方法相当，且避免了模型输出质量下降。 Conclusion: 该方案为黑盒场景下的安全隐写提供了实用且高效的解决方案。 Abstract: The security of private communication is increasingly at risk due to widespread surveillance. Steganography, a technique for embedding secret messages within innocuous carriers, enables covert communication over monitored channels. Provably Secure Steganography (PSS) is state of the art for making stego carriers indistinguishable from normal ones by ensuring computational indistinguishability between stego and cover distributions. However, current PSS methods often require explicit access to the distribution of generative model for both sender and receiver, limiting their practicality in black box scenarios. In this paper, we propose a provably secure steganography scheme that does not require access to explicit model distributions for both sender and receiver. Our method incorporates a dynamic sampling strategy, enabling generative models to embed secret messages within multiple sampling choices without disrupting the normal generation process of the model. Extensive evaluations of three real world datasets and three LLMs demonstrate that our blackbox method is comparable with existing white-box steganography methods in terms of efficiency and capacity while eliminating the degradation of steganography in model generated outputs.

cs.SE [Back]

[198] A Phenomenological Approach to Analyzing User Queries in IT Systems Using Heidegger's Fundamental Ontology

Maksim Vishnevskiy

Main category: cs.SE

TL;DR: 论文提出了一种基于海德格尔基础本体论的新型IT分析系统，通过区分存在者与存在，使用两种模态语言处理用户输入与内部分析，并通过现象学还原模块连接两者。

Details

Motivation: 当代系统仅限于范畴分析，无法揭示查询处理中的深层本体模式，而基于海德格尔现象学的方法可以解决这一问题。 Method: 系统采用两种语言：存在者的范畴语言和存在的存在语言，通过现象学还原模块连接，分析用户查询并识别递归和自指结构。 Result: 系统能够提供可操作的范畴化见解，并在复杂交互中解决逻辑陷阱，如IT语境中的隐喻使用。 Conclusion: 该系统的实现为通用查询分析工具铺平了道路，但仍需进一步形式化存在语言以实现完全可计算性。 Abstract: This paper presents a novel research analytical IT system grounded in Martin Heidegger's Fundamental Ontology, distinguishing between beings (das Seiende) and Being (das Sein). The system employs two modally distinct, descriptively complete languages: a categorical language of beings for processing user inputs and an existential language of Being for internal analysis. These languages are bridged via a phenomenological reduction module, enabling the system to analyze user queries (including questions, answers, and dialogues among IT specialists), identify recursive and self-referential structures, and provide actionable insights in categorical terms. Unlike contemporary systems limited to categorical analysis, this approach leverages Heidegger's phenomenological existential analysis to uncover deeper ontological patterns in query processing, aiding in resolving logical traps in complex interactions, such as metaphor usage in IT contexts. The path to full realization involves formalizing the language of Being by a research team based on Heidegger's Fundamental Ontology; given the existing completeness of the language of beings, this reduces the system's computability to completeness, paving the way for a universal query analysis tool. The paper presents the system's architecture, operational principles, technical implementation, use cases--including a case based on real IT specialist dialogues--comparative evaluation with existing tools, and its advantages and limitations.

cs.IR [Back]

[199] Specialized text classification: an approach to classifying Open Banking transactions

Duc Tuyen TA,Wajdi Ben Saad,Ji Young Oh

Main category: cs.IR

TL;DR: 本文介绍了一种基于语言的开放式银行交易分类系统，专注于法国市场和法语文本，通过特定领域技术和知识提升性能。

Details

Motivation: PSD2法规和开放式银行框架为银行和金融科技公司提供了利用交易描述理解客户行为的机会，但特定领域的自然语言处理应用在银行业仍未被充分探索。 Method: 系统包括数据收集、标注、预处理、建模和评估阶段，专注于法语银行数据的特定领域语言模型训练。 Result: 与通用方法相比，该系统在性能和效率上表现出显著提升。 Conclusion: 该研究为银行业特定领域的自然语言处理应用提供了有效解决方案，展示了定制化方法的优势。 Abstract: With the introduction of the PSD2 regulation in the EU which established the Open Banking framework, a new window of opportunities has opened for banks and fintechs to explore and enrich Bank transaction descriptions with the aim of building a better understanding of customer behavior, while using this understanding to prevent fraud, reduce risks and offer more competitive and tailored services. And although the usage of natural language processing models and techniques has seen an incredible progress in various applications and domains over the past few years, custom applications based on domain-specific text corpus remain unaddressed especially in the banking sector. In this paper, we introduce a language-based Open Banking transaction classification system with a focus on the french market and french language text. The system encompasses data collection, labeling, preprocessing, modeling, and evaluation stages. Unlike previous studies that focus on general classification approaches, this system is specifically tailored to address the challenges posed by training a language model with a specialized text corpus (Banking data in the French context). By incorporating language-specific techniques and domain knowledge, the proposed system demonstrates enhanced performance and efficiency compared to generic approaches.

[200] A Human-AI Comparative Analysis of Prompt Sensitivity in LLM-Based Relevance Judgment

Negar Arabzadeh,Charles L. A . Clarke

Main category: cs.IR

TL;DR: 研究探讨了大型语言模型（LLMs）在信息检索任务中作为相关性判断工具的稳健性和可靠性，分析了提示敏感性对任务的影响。

Details

Motivation: 评估LLMs在自动化相关性判断中的表现，尤其是提示变化对结果的影响，以验证其可靠性。 Method: 收集人类专家和LLMs生成的提示，使用三种LLMs作为评判者对文档/查询对进行标注，并与TREC官方人类标注进行比较。 Result: 分析了提示变化对人类标注一致性的影响，比较了人类和LLM生成的提示，并评估了不同LLMs作为评判者的差异。 Conclusion: 研究支持LLMs在相关性判断中的潜力，同时强调了提示设计的重要性，并公开了数据和提示以供未来研究。 Abstract: Large Language Models (LLMs) are increasingly used to automate relevance judgments for information retrieval (IR) tasks, often demonstrating agreement with human labels that approaches inter-human agreement. To assess the robustness and reliability of LLM-based relevance judgments, we systematically investigate impact of prompt sensitivity on the task. We collected prompts for relevance assessment from 15 human experts and 15 LLMs across three tasks~ -- ~binary, graded, and pairwise~ -- ~yielding 90 prompts in total. After filtering out unusable prompts from three humans and three LLMs, we employed the remaining 72 prompts with three different LLMs as judges to label document/query pairs from two TREC Deep Learning Datasets (2020 and 2021). We compare LLM-generated labels with TREC official human labels using Cohen's $\kappa$ and pairwise agreement measures. In addition to investigating the impact of prompt variations on agreement with human labels, we compare human- and LLM-generated prompts and analyze differences among different LLMs as judges. We also compare human- and LLM-generated prompts with the standard UMBRELA prompt used for relevance assessment by Bing and TREC 2024 Retrieval Augmented Generation (RAG) Track. To support future research in LLM-based evaluation, we release all data and prompts at https://github.com/Narabzad/prompt-sensitivity-relevance-judgements/.

[201] Benchmarking LLM-based Relevance Judgment Methods

Negar Arabzadeh,Charles L. A. Clarke

Main category: cs.IR

TL;DR: 本文系统比较了多种基于LLM的相关性评估方法，包括二元判断、分级评估、成对偏好方法和基于片段的方法，并通过实验验证了其与人类偏好的一致性。

Details

Motivation: 现有研究主要关注通过提示策略复制人类分级判断，缺乏对其他评估方法的探索和全面比较。 Method: 比较了多种LLM评估方法，包括二元、分级、成对偏好和片段方法，并使用Kendall相关性和人类偏好进行验证。 Result: 实验基于TREC和ANTIQUE数据集，公开了开源和商业模型的判断结果，代码和数据已开源。 Conclusion: 研究提供了全面的LLM相关性评估方法比较，为未来研究提供了参考。 Abstract: Large Language Models (LLMs) are increasingly deployed in both academic and industry settings to automate the evaluation of information seeking systems, particularly by generating graded relevance judgments. Previous work on LLM-based relevance assessment has primarily focused on replicating graded human relevance judgments through various prompting strategies. However, there has been limited exploration of alternative assessment methods or comprehensive comparative studies. In this paper, we systematically compare multiple LLM-based relevance assessment methods, including binary relevance judgments, graded relevance assessments, pairwise preference-based methods, and two nugget-based evaluation methods~--~document-agnostic and document-dependent. In addition to a traditional comparison based on system rankings using Kendall correlations, we also examine how well LLM judgments align with human preferences, as inferred from relevance grades. We conduct extensive experiments on datasets from three TREC Deep Learning tracks 2019, 2020 and 2021 as well as the ANTIQUE dataset, which focuses on non-factoid open-domain question answering. As part of our data release, we include relevance judgments generated by both an open-source (Llama3.2b) and a commercial (gpt-4o) model. Our goal is to \textit{reproduce} various LLM-based relevance judgment methods to provide a comprehensive comparison. All code, data, and resources are publicly available in our GitHub Repository at https://github.com/Narabzad/llm-relevance-judgement-comparison.

[202] Towards Lossless Token Pruning in Late-Interaction Retrieval Models

Yuxuan Zong,Benjamin Piwowarski

Main category: cs.IR

TL;DR: 该论文提出了一种基于正则化损失的方法，用于在不影响检索分数的情况下修剪文档中的冗余标记，从而显著减少内存占用。

Details

Motivation: 现有的神经信息检索模型（如ColBERT）需要大量内存存储文档标记的上下文表示，而现有修剪方法无法保证不影响检索分数。 Method: 引入了三种正则化损失和两种修剪策略，以高修剪率实现标记修剪。 Result: 实验表明，该方法可以仅使用30%的标记，同时保持ColBERT的性能。 Conclusion: 该方法为神经信息检索模型提供了一种高效且不影响性能的标记修剪解决方案。 Abstract: Late interaction neural IR models like ColBERT offer a competitive effectiveness-efficiency trade-off across many benchmarks. However, they require a huge memory space to store the contextual representation for all the document tokens. Some works have proposed using either heuristics or statistical-based techniques to prune tokens from each document. This however doesn't guarantee that the removed tokens have no impact on the retrieval score. Our work uses a principled approach to define how to prune tokens without impacting the score between a document and a query. We introduce three regularization losses, that induce a solution with high pruning ratios, as well as two pruning strategies. We study them experimentally (in and out-domain), showing that we can preserve ColBERT's performance while using only 30\% of the tokens.

[203] Building Russian Benchmark for Evaluation of Information Retrieval Models

Grigory Kovalev,Mikhail Tikhomirov,Evgeny Kozhevnikov,Max Kornilov,Natalia Loukachevitch

Main category: cs.IR

TL;DR: RusBEIR是一个用于俄语信息检索模型零样本评估的综合基准，包含17个数据集，支持对词汇和神经模型的系统比较。

Details

Motivation: 为俄语信息检索研究提供一个统一的、开源的评估框架，填补该领域的空白。 Method: 整合了适应、翻译和新创建的数据集，比较词汇模型（如BM25）和神经模型（如mE5-large和BGE-M3）的性能。 Result: 词汇模型在形态丰富的语言中依赖预处理，BM25是全文档检索的强基线；神经模型在多数数据集上表现更优，但在长文档检索中受限于输入大小。 Conclusion: RusBEIR为俄语信息检索研究提供了重要工具，推动了该领域的发展。 Abstract: We introduce RusBEIR, a comprehensive benchmark designed for zero-shot evaluation of information retrieval (IR) models in the Russian language. Comprising 17 datasets from various domains, it integrates adapted, translated, and newly created datasets, enabling systematic comparison of lexical and neural models. Our study highlights the importance of preprocessing for lexical models in morphologically rich languages and confirms BM25 as a strong baseline for full-document retrieval. Neural models, such as mE5-large and BGE-M3, demonstrate superior performance on most datasets, but face challenges with long-document retrieval due to input size constraints. RusBEIR offers a unified, open-source framework that promotes research in Russian-language information retrieval.

[204] FreshStack: Building Realistic Benchmarks for Evaluating Retrieval on Technical Documents

Nandan Thakur,Jimmy Lin,Sam Havens,Michael Carbin,Omar Khattab,Andrew Drozdov

Main category: cs.IR

TL;DR: FreshStack是一个可复用的框架，用于从社区问答自动构建信息检索（IR）评估基准，包含语料收集、问题答案片段生成和混合检索技术。

Details

Motivation: 构建现实、可扩展且无污染的IR和RAG评估基准，填补现有模型在快速发展和小众主题上的性能不足。 Method: 通过自动语料收集、问题答案片段生成和混合检索技术构建数据集。 Result: 现有检索模型在FreshStack数据集上表现显著低于理想方法，且部分主题中重排序技术未提升准确性。 Conclusion: FreshStack为构建高质量IR评估基准提供了工具，并揭示了现有模型的改进空间。 Abstract: We introduce FreshStack, a reusable framework for automatically building information retrieval (IR) evaluation benchmarks from community-asked questions and answers. FreshStack conducts the following steps: (1) automatic corpus collection from code and technical documentation, (2) nugget generation from community-asked questions and answers, and (3) nugget-level support, retrieving documents using a fusion of retrieval techniques and hybrid architectures. We use FreshStack to build five datasets on fast-growing, recent, and niche topics to ensure the tasks are sufficiently challenging. On FreshStack, existing retrieval models, when applied out-of-the-box, significantly underperform oracle approaches on all five topics, denoting plenty of headroom to improve IR quality. In addition, we identify cases where rerankers do not clearly improve first-stage retrieval accuracy (two out of five topics). We hope that FreshStack will facilitate future work toward constructing realistic, scalable, and uncontaminated IR and RAG evaluation benchmarks. FreshStack datasets are available at: https://fresh-stack.github.io.

Haoxuan Li,Yi Bin,Yunshan Ma,Guoqing Wang,Yang Yang,See-Kiong Ng,Tat-Seng Chua

Main category: cs.IR

TL;DR: SemCORE是一个新的生成式跨模态检索框架，通过结构化自然语言标识符和生成语义验证策略提升语义理解能力，显著优于现有方法。

Details

Motivation: 传统跨模态检索方法依赖嵌入相似性计算，而生成式检索虽具潜力，但存在语义信息不足的问题。 Method: 提出SemCORE框架，包括结构化自然语言标识符（SID）和生成语义验证（GSV）策略，同时支持文本到图像和图像到文本检索。 Result: 在基准数据集上，SemCORE在文本到图像检索的Recall@1指标上平均提升8.65点。 Conclusion: SemCORE通过增强语义理解能力，显著提升了生成式跨模态检索的性能。 Abstract: Cross-modal retrieval (CMR) is a fundamental task in multimedia research, focused on retrieving semantically relevant targets across different modalities. While traditional CMR methods match text and image via embedding-based similarity calculations, recent advancements in pre-trained generative models have established generative retrieval as a promising alternative. This paradigm assigns each target a unique identifier and leverages a generative model to directly predict identifiers corresponding to input queries without explicit indexing. Despite its great potential, current generative CMR approaches still face semantic information insufficiency in both identifier construction and generation processes. To address these limitations, we propose a novel unified Semantic-enhanced generative Cross-mOdal REtrieval framework (SemCORE), designed to unleash the semantic understanding capabilities in generative cross-modal retrieval task. Specifically, we first construct a Structured natural language IDentifier (SID) that effectively aligns target identifiers with generative models optimized for natural language comprehension and generation. Furthermore, we introduce a Generative Semantic Verification (GSV) strategy enabling fine-grained target discrimination. Additionally, to the best of our knowledge, SemCORE is the first framework to simultaneously consider both text-to-image and image-to-text retrieval tasks within generative cross-modal retrieval. Extensive experiments demonstrate that our framework outperforms state-of-the-art generative cross-modal retrieval methods. Notably, SemCORE achieves substantial improvements across benchmark datasets, with an average increase of 8.65 points in Recall@1 for text-to-image retrieval.

cs.CY [Back]

[206] Large Language Model-Based Knowledge Graph System Construction for Sustainable Development Goals: An AI-Based Speculative Design Perspective

Yi-De Lin,Guan-Ze Liao

Main category: cs.CY

TL;DR: 本研究开发了一个AI驱动的知识图谱系统，分析SDG的关联性并提出新目标，为政策制定者提供新见解。

Details

Motivation: 随着2030年临近，SDG进展滞后，需要创新策略加速实现目标。 Method: 使用官方SDG文本、Elsevier关键词数据集和TED Talk转录本，结合AI推测设计、大语言模型和检索增强生成技术。 Result: 发现目标10与16关联性强，目标6覆盖少；知识图谱揭示新中心节点；提出6个潜在新目标。 Conclusion: AI推测框架为政策制定者提供新思路，并为未来多模态和跨系统SDG应用奠定基础。 Abstract: From 2000 to 2015, the UN's Millennium Development Goals guided global priorities. The subsequent Sustainable Development Goals (SDGs) adopted a more dynamic approach, with annual indicator updates. As 2030 nears and progress lags, innovative acceleration strategies are critical. This study develops an AI-powered knowledge graph system to analyze SDG interconnections, discover potential new goals, and visualize them online. Using official SDG texts, Elsevier's keyword dataset, and 1,127 TED Talk transcripts (2020-2023), a pilot on 269 talks from 2023 applies AI-speculative design, large language models, and retrieval-augmented generation. Key findings include: (1) Heatmap analysis reveals strong associations between Goal 10 and Goal 16, and minimal coverage of Goal 6. (2) In the knowledge graph, simulated dialogue over time reveals new central nodes, showing how richer data supports divergent thinking and goal clarity. (3) Six potential new goals are proposed, centered on equity, resilience, and technology-driven inclusion. This speculative-AI framework offers fresh insights for policymakers and lays groundwork for future multimodal and cross-system SDG applications.

[207] Knowledge Acquisition on Mass-shooting Events via LLMs for AI-Driven Justice

Benign John Ihugba,Afsana Nasrin,Ling Wu,Lin Li,Lijun Qian,Xishuang Dong

Main category: cs.CY

TL;DR: 论文提出首个用于大规模枪击事件知识获取的数据集，利用NER技术提取关键实体（如罪犯、受害者、地点等），并通过LLM（如GPT-4o）实现高效信息提取。实验显示GPT-4o表现最佳，o1-mini为资源高效替代方案。

Details

Motivation: 大规模枪击事件产生大量非结构化文本数据，现有研究难以自动化提取关键信息以支持法律和调查工作。 Method: 应用命名实体识别（NER）技术，结合LLM（如GPT-4o）进行少样本提示，从新闻、警方报告等来源提取关键实体。 Result: GPT-4o在NER任务中表现最优，o1-mini在资源受限时表现良好；增加样本量可提升模型性能，尤其是GPT-4o和o1-mini。 Conclusion: 研究为大规模枪击事件的信息提取提供了高效工具，GPT-4o和o1-mini在不同场景下均表现出色。 Abstract: Mass-shooting events pose a significant challenge to public safety, generating large volumes of unstructured textual data that hinder effective investigations and the formulation of public policy. Despite the urgency, few prior studies have effectively automated the extraction of key information from these events to support legal and investigative efforts. This paper presented the first dataset designed for knowledge acquisition on mass-shooting events through the application of named entity recognition (NER) techniques. It focuses on identifying key entities such as offenders, victims, locations, and criminal instruments, that are vital for legal and investigative purposes. The NER process is powered by Large Language Models (LLMs) using few-shot prompting, facilitating the efficient extraction and organization of critical information from diverse sources, including news articles, police reports, and social media. Experimental results on real-world mass-shooting corpora demonstrate that GPT-4o is the most effective model for mass-shooting NER, achieving the highest Micro Precision, Micro Recall, and Micro F1-scores. Meanwhile, o1-mini delivers competitive performance, making it a resource-efficient alternative for less complex NER tasks. It is also observed that increasing the shot count enhances the performance of all models, but the gains are more substantial for GPT-4o and o1-mini, highlighting their superior adaptability to few-shot learning scenarios.

[208] How Large Language Models Are Changing MOOC Essay Answers: A Comparison of Pre- and Post-LLM Responses

Leo Leppänen,Lili Aunimo,Arto Hellas,Jukka K. Nurminen,Linda Mannila

Main category: cs.CY

TL;DR: 论文研究了ChatGPT发布后对学生在线教育的影响，通过分析MOOC课程中学生论文的变化，发现其长度和风格显著改变，但主题未变。

Details

Motivation: 探讨大型语言模型（如ChatGPT）对在线教育的实际影响，尤其是学术诚信和学习方式的潜在变化。 Method: 分析MOOC课程中学生论文的多年度数据集，比较ChatGPT发布前后的变化，包括长度、风格和关键词频率。 Result: ChatGPT发布后，学生论文的长度和风格显著改变，AI相关关键词增多，但主题未变。 Conclusion: 大型语言模型对在线教育产生了可量化的影响，但并未改变学生讨论的核心主题。 Abstract: The release of ChatGPT in late 2022 caused a flurry of activity and concern in the academic and educational communities. Some see the tool's ability to generate human-like text that passes at least cursory inspections for factual accuracy ``often enough'' a golden age of information retrieval and computer-assisted learning. Some, on the other hand, worry the tool may lead to unprecedented levels of academic dishonesty and cheating. In this work, we quantify some of the effects of the emergence of Large Language Models (LLMs) on online education by analyzing a multi-year dataset of student essay responses from a free university-level MOOC on AI ethics. Our dataset includes essays submitted both before and after ChatGPT's release. We find that the launch of ChatGPT coincided with significant changes in both the length and style of student essays, mirroring observations in other contexts such as academic publishing. We also observe -- as expected based on related public discourse -- changes in prevalence of key content words related to AI and LLMs, but not necessarily the general themes or topics discussed in the student essays as identified through (dynamic) topic modeling.

Georgina Curto,Svetlana Kiritchenko,Muhammad Hammad Fahim Siddiqui,Isar Nejadgholi,Kathleen C. Fraser

Main category: cs.CY

TL;DR: 该研究旨在通过社交媒体数据识别和追踪针对贫困人群的偏见（aporophobia），为消除贫困政策提供支持。

Details

Motivation: 贫困是联合国可持续发展目标的首要问题，但社会对贫困人群的偏见阻碍了相关政策的制定和实施。 Method: 与公益组织和政府合作，收集并标注英语推文，构建分类器自动检测aporophobia。 Result: 构建了aporophobia的分类体系，并训练了分类器，但自动检测仍面临挑战。 Conclusion: 该研究为大规模识别和减少社交媒体上的aporophobia奠定了基础。 Abstract: Eradicating poverty is the first goal in the United Nations Sustainable Development Goals. However, aporophobia -- the societal bias against people living in poverty -- constitutes a major obstacle to designing, approving and implementing poverty-mitigation policies. This work presents an initial step towards operationalizing the concept of aporophobia to identify and track harmful beliefs and discriminative actions against poor people on social media. In close collaboration with non-profits and governmental organizations, we conduct data collection and exploration. Then we manually annotate a corpus of English tweets from five world regions for the presence of (1) direct expressions of aporophobia, and (2) statements referring to or criticizing aporophobic views or actions of others, to comprehensively characterize the social media discourse related to bias and discrimination against the poor. Based on the annotated data, we devise a taxonomy of categories of aporophobic attitudes and actions expressed through speech on social media. Finally, we train several classifiers and identify the main challenges for automatic detection of aporophobia in social networks. This work paves the way towards identifying, tracking, and mitigating aporophobic views on social media at scale.

eess.IV [Back]

[210] Regist3R: Incremental Registration with Stereo Foundation Model

Sidun Liu,Wenyu Li,Peng Qiao,Yong Dou

Main category: eess.IV

TL;DR: Regist3R是一种新型立体基础模型，用于高效、可扩展的增量重建，解决了多视图3D重建中的计算成本和全局对齐误差问题。

Details

Motivation: 多视图3D重建在计算机视觉中仍具挑战性，现有方法如DUSt3R在扩展到多视图场景时存在计算成本高和累积误差问题。 Method: Regist3R采用增量重建范式，适用于无序和多视图图像集合的大规模3D重建。 Result: Regist3R在公共数据集上表现优异，计算效率显著提升，优于现有多视图重建模型，并在倾斜航空数据集上验证了其有效性。 Conclusion: Regist3R展示了在大规模场景重建中的潜力，适用于城市建模、航空测绘等实际应用。 Abstract: Multi-view 3D reconstruction has remained an essential yet challenging problem in the field of computer vision. While DUSt3R and its successors have achieved breakthroughs in 3D reconstruction from unposed images, these methods exhibit significant limitations when scaling to multi-view scenarios, including high computational cost and cumulative error induced by global alignment. To address these challenges, we propose Regist3R, a novel stereo foundation model tailored for efficient and scalable incremental reconstruction. Regist3R leverages an incremental reconstruction paradigm, enabling large-scale 3D reconstructions from unordered and many-view image collections. We evaluate Regist3R on public datasets for camera pose estimation and 3D reconstruction. Our experiments demonstrate that Regist3R achieves comparable performance with optimization-based methods while significantly improving computational efficiency, and outperforms existing multi-view reconstruction models. Furthermore, to assess its performance in real-world applications, we introduce a challenging oblique aerial dataset which has long spatial spans and hundreds of views. The results highlight the effectiveness of Regist3R. We also demonstrate the first attempt to reconstruct large-scale scenes encompassing over thousands of views through pointmap-based foundation models, showcasing its potential for practical applications in large-scale 3D reconstruction tasks, including urban modeling, aerial mapping, and beyond.

[211] TUMLS: Trustful Fully Unsupervised Multi-Level Segmentation for Whole Slide Images of Histology

Walid Rehamnia,Alexandra Getmanskaya,Evgeniy Vasilyev,Vadim Turlapov

Main category: eess.IV

TL;DR: 提出了一种名为TUMLS的无监督多级分割方法，用于解决数字病理学中AI应用的挑战，如标注成本高、计算需求大和缺乏不确定性估计。该方法通过自动编码器提取特征，选择代表性图像块进行无监督核分割，显著提升工作流效率和透明度。

Details

Motivation: 当前AI方法在组织病理学中面临标注成本高、计算需求大和缺乏不确定性估计的问题，限制了其实际应用。 Method: TUMLS使用自动编码器作为特征提取器，从低分辨率数据中识别组织类型，选择代表性图像块进行无监督核分割，无需机器学习算法。 Result: 在UPENN-GBM数据集上，自动编码器的MSE为0.0016；在MoNuSeg数据集上，核分割的F1分数为77.46%，Jaccard分数为63.35%，优于其他无监督方法。 Conclusion: TUMLS通过无监督多级分割方法，显著提升了数字病理学工作流的效率和透明度，为领域发展提供了有效解决方案。 Abstract: Digital pathology, augmented by artificial intelligence (AI), holds significant promise for improving the workflow of pathologists. However, challenges such as the labor-intensive annotation of whole slide images (WSIs), high computational demands, and trust concerns arising from the absence of uncertainty estimation in predictions hinder the practical application of current AI methodologies in histopathology. To address these issues, we present a novel trustful fully unsupervised multi-level segmentation methodology (TUMLS) for WSIs. TUMLS adopts an autoencoder (AE) as a feature extractor to identify the different tissue types within low-resolution training data. It selects representative patches from each identified group based on an uncertainty measure and then does unsupervised nuclei segmentation in their respective higher-resolution space without using any ML algorithms. Crucially, this solution integrates seamlessly into clinicians workflows, transforming the examination of a whole WSI into a review of concise, interpretable cross-level insights. This integration significantly enhances and accelerates the workflow while ensuring transparency. We evaluated our approach using the UPENN-GBM dataset, where the AE achieved a mean squared error (MSE) of 0.0016. Additionally, nucleus segmentation is assessed on the MoNuSeg dataset, outperforming all unsupervised approaches with an F1 score of 77.46% and a Jaccard score of 63.35%. These results demonstrate the efficacy of TUMLS in advancing the field of digital pathology.

[212] Towards Cardiac MRI Foundation Models: Comprehensive Visual-Tabular Representations for Whole-Heart Assessment and Beyond

Yundi Zhang,Paul Hager,Che Liu,Suprosanna Shit,Chen Chen,Daniel Rueckert,Jiazhen Pan

Main category: eess.IV

TL;DR: ViTa是一个多模态框架，结合心脏磁共振成像和患者因素，提供全面的心脏健康评估和疾病风险预测。

Details

Motivation: 心脏磁共振成像（CMR）虽为金标准，但未涵盖患者个体因素，影响疾病风险评估的全面性。 Method: ViTa整合3D+T影像数据和患者表格数据，通过共享潜在表征支持多种下游任务。 Result: ViTa在42,000名UK Biobank参与者数据上验证，支持心脏表型预测、分割和疾病分类。 Conclusion: ViTa为心脏健康提供通用、患者特异性的理解，具有临床实用性和可扩展性。 Abstract: Cardiac magnetic resonance imaging is the gold standard for non-invasive cardiac assessment, offering rich spatio-temporal views of the cardiac anatomy and physiology. Patient-level health factors, such as demographics, metabolic, and lifestyle, are known to substantially influence cardiovascular health and disease risk, yet remain uncaptured by CMR alone. To holistically understand cardiac health and to enable the best possible interpretation of an individual's disease risk, CMR and patient-level factors must be jointly exploited within an integrated framework. Recent multi-modal approaches have begun to bridge this gap, yet they often rely on limited spatio-temporal data and focus on isolated clinical tasks, thereby hindering the development of a comprehensive representation for cardiac health evaluation. To overcome these limitations, we introduce ViTa, a step toward foundation models that delivers a comprehensive representation of the heart and a precise interpretation of individual disease risk. Leveraging data from 42,000 UK Biobank participants, ViTa integrates 3D+T cine stacks from short-axis and long-axis views, enabling a complete capture of the cardiac cycle. These imaging data are then fused with detailed tabular patient-level factors, enabling context-aware insights. This multi-modal paradigm supports a wide spectrum of downstream tasks, including cardiac phenotype and physiological feature prediction, segmentation, and classification of cardiac and metabolic diseases within a single unified framework. By learning a shared latent representation that bridges rich imaging features and patient context, ViTa moves beyond traditional, task-specific models toward a universal, patient-specific understanding of cardiac health, highlighting its potential to advance clinical utility and scalability in cardiac analysis.

[213] NTIRE 2025 Challenge on Short-form UGC Video Quality Assessment and Enhancement: Methods and Results

Xin Li,Kun Yuan,Bingchen Li,Fengbin Guan,Yizhen Shao,Zihao Yu,Xijun Wang,Yiting Lu,Wei Luo,Suhang Yao,Ming Sun,Chao Zhou,Zhibo Chen,Radu Timofte,Yabin Zhang,Ao-Xiang Zhang,Tianwu Zhi,Jianzhao Liu,Yang Li,Jingwen Xu,Yiting Liao,Yushen Zuo,Mingyang Wu,Renjie Li,Shengyun Zhong,Zhengzhong Tu,Yufan Liu,Xiangguang Chen,Zuowei Cao,Minhao Tang,Shan Liu,Kexin Zhang,Jingfen Xie,Yan Wang,Kai Chen,Shijie Zhao,Yunchen Zhang,Xiangkai Xu,Hong Gao,Ji Shi,Yiming Bao,Xiugang Dong,Xiangsheng Zhou,Yaofeng Tu,Ying Liang,Yiwen Wang,Xinning Chai,Yuxuan Zhang,Zhengxue Cheng,Yingsheng Qin,Yucai Yang,Rong Xie,Li Song,Wei Sun,Kang Fu,Linhan Cao,Dandan Zhu,Kaiwei Zhang,Yucheng Zhu,Zicheng Zhang,Menghan Hu,Xiongkuo Min,Guangtao Zhai,Zhi Jin,Jiawei Wu,Wei Wang,Wenjian Zhang,Yuhai Lan,Gaoxiong Yi,Hengyuan Na,Wang Luo,Di Wu,MingYin Bai,Jiawang Du,Zilong Lu,Zhenyu Jiang,Hui Zeng,Ziguan Cui,Zongliang Gan,Guijin Tang,Xinglin Xie,Kehuan Song,Xiaoqiang Lu,Licheng Jiao,Fang Liu,Xu Liu,Puhua Chen,Ha Thu Nguyen,Katrien De Moor,Seyed Ali Amirshahi,Mohamed-Chaker Larabi,Qi Tang,Linfeng He,Zhiyong Gao,Zixuan Gao,Guohua Zhang,Zhiye Huang,Yi Deng,Qingmiao Jiang,Lu Chen,Yi Yang,Xi Liao,Nourine Mohammed Nadir,Yuxuan Jiang,Qiang Zhu,Siyue Teng,Fan Zhang,Shuyuan Zhu,Bing Zeng,David Bull,Meiqin Liu,Chao Yao,Yao Zhao

Main category: eess.IV

TL;DR: NTIRE 2025挑战赛聚焦短用户生成内容（UGC）视频质量评估与增强，包含高效视频质量评估（KVQ）和基于扩散的图像超分辨率（KwaiSR）两个赛道，旨在推动轻量级模型发展并提升用户体验。

Details

Motivation: 挑战赛旨在减少对计算密集型组件的依赖，推动短UGC平台（如Kwai和TikTok）的用户体验研究。 Method: Track 1开发轻量级视频质量评估模型；Track 2引入KwaiSR数据集，包含合成和真实图像对。 Result: 吸引266名参与者，收到18份有效提交，显著推动了短UGC视频质量评估和图像超分辨率研究。 Conclusion: 挑战赛成功促进了相关技术进步，项目已公开。 Abstract: This paper presents a review for the NTIRE 2025 Challenge on Short-form UGC Video Quality Assessment and Enhancement. The challenge comprises two tracks: (i) Efficient Video Quality Assessment (KVQ), and (ii) Diffusion-based Image Super-Resolution (KwaiSR). Track 1 aims to advance the development of lightweight and efficient video quality assessment (VQA) models, with an emphasis on eliminating reliance on model ensembles, redundant weights, and other computationally expensive components in the previous IQA/VQA competitions. Track 2 introduces a new short-form UGC dataset tailored for single image super-resolution, i.e., the KwaiSR dataset. It consists of 1,800 synthetically generated S-UGC image pairs and 1,900 real-world S-UGC images, which are split into training, validation, and test sets using a ratio of 8:1:1. The primary objective of the challenge is to drive research that benefits the user experience of short-form UGC platforms such as Kwai and TikTok. This challenge attracted 266 participants and received 18 valid final submissions with corresponding fact sheets, significantly contributing to the progress of short-form UGC VQA and image superresolution. The project is publicly available at https://github.com/lixinustc/KVQE- ChallengeCVPR-NTIRE2025.

cs.HC [Back]

[214] MobilePoser: Real-Time Full-Body Pose Estimation and 3D Human Translation from IMUs in Mobile Consumer Devices

Vasco Xu,Chenfeng Gao,Henry Hoffmann,Karan Ahuja

Main category: cs.HC

TL;DR: MobilePoser是一个实时系统，利用消费设备中已有的IMU子集进行全身姿态和全局平移估计，通过多阶段深度神经网络和物理优化实现高精度。

Details

Motivation: 随着运动捕捉技术向低精度IMU设备迁移，如手机、手表和耳机，传感器噪声和漂移导致在线性能、时间一致性和全局平移丢失等问题。 Method: MobilePoser采用多阶段深度神经网络进行运动姿态估计，随后通过基于物理的运动优化器处理。 Result: 系统实现了最先进的精度，同时保持轻量级。 Conclusion: MobilePoser在健康、游戏和室内导航等领域展示了独特潜力。 Abstract: There has been a continued trend towards minimizing instrumentation for full-body motion capture, going from specialized rooms and equipment, to arrays of worn sensors and recently sparse inertial pose capture methods. However, as these techniques migrate towards lower-fidelity IMUs on ubiquitous commodity devices, like phones, watches, and earbuds, challenges arise including compromised online performance, temporal consistency, and loss of global translation due to sensor noise and drift. Addressing these challenges, we introduce MobilePoser, a real-time system for full-body pose and global translation estimation using any available subset of IMUs already present in these consumer devices. MobilePoser employs a multi-stage deep neural network for kinematic pose estimation followed by a physics-based motion optimizer, achieving state-of-the-art accuracy while remaining lightweight. We conclude with a series of demonstrative applications to illustrate the unique potential of MobilePoser across a variety of fields, such as health and wellness, gaming, and indoor navigation to name a few.

[215] Multimodal LLM Augmented Reasoning for Interpretable Visual Perception Analysis

Shravan Chaudhari,Trilokya Akula,Yoon Kim,Tom Blake

Main category: cs.HC

TL;DR: 研究探讨了多模态大语言模型（MLLMs）在视觉感知任务中的应用，提出了一种无标注的分析框架，用于评估MLLMs作为认知辅助工具的效用。

Details

Motivation: 结合心理学和认知科学的原理，研究MLLMs在视觉感知中的解释能力，以提升人类推理能力并揭示现有数据集的偏见。 Method: 利用心理学和认知科学的复杂性原则指导MLLMs解释视觉内容，提出无标注的分析框架。 Result: 为量化MLLMs的可解释性提供了理论基础，并展示了其在HCI任务中的潜力。 Conclusion: 该研究为MLLMs在视觉感知领域的应用提供了新方向，并强调了其在提升人类认知能力中的作用。 Abstract: In this paper, we advance the study of AI-augmented reasoning in the context of Human-Computer Interaction (HCI), psychology and cognitive science, focusing on the critical task of visual perception. Specifically, we investigate the applicability of Multimodal Large Language Models (MLLMs) in this domain. To this end, we leverage established principles and explanations from psychology and cognitive science related to complexity in human visual perception. We use them as guiding principles for the MLLMs to compare and interprete visual content. Our study aims to benchmark MLLMs across various explainability principles relevant to visual perception. Unlike recent approaches that primarily employ advanced deep learning models to predict complexity metrics from visual content, our work does not seek to develop a mere new predictive model. Instead, we propose a novel annotation-free analytical framework to assess utility of MLLMs as cognitive assistants for HCI tasks, using visual perception as a case study. The primary goal is to pave the way for principled study in quantifying and evaluating the interpretability of MLLMs for applications in improving human reasoning capability and uncovering biases in existing perception datasets annotated by humans.

eess.AS [Back]

[216] EmoVoice: LLM-based Emotional Text-To-Speech Model with Freestyle Text Prompting

Guanrou Yang,Chen Yang,Qian Chen,Ziyang Ma,Wenxi Chen,Wen Wang,Tianrui Wang,Yifan Yang,Zhikang Niu,Wenrui Liu,Fan Yu,Zhihao Du,Zhifu Gao,ShiLiang Zhang,Xie Chen

Main category: eess.AS

TL;DR: EmoVoice是一种新型情感可控的TTS模型，利用LLMs实现细粒度情感控制，并通过并行输出音素和音频标记增强内容一致性。

Details

Motivation: 现有TTS模型在情感表达控制方面存在不足，EmoVoice旨在解决这一问题。 Method: 结合LLMs实现自然语言情感控制，采用音素增强设计，并引入高质量情感数据集EmoVoice-DB。 Result: 在英语EmoVoice-DB和中文Secap测试集上达到SOTA性能，并探索了情感评估指标的可靠性。 Conclusion: EmoVoice在情感可控TTS领域表现优异，未来将公开数据集、代码和模型。 Abstract: Human speech goes beyond the mere transfer of information; it is a profound exchange of emotions and a connection between individuals. While Text-to-Speech (TTS) models have made huge progress, they still face challenges in controlling the emotional expression in the generated speech. In this work, we propose EmoVoice, a novel emotion-controllable TTS model that exploits large language models (LLMs) to enable fine-grained freestyle natural language emotion control, and a phoneme boost variant design that makes the model output phoneme tokens and audio tokens in parallel to enhance content consistency, inspired by chain-of-thought (CoT) and modality-of-thought (CoM) techniques. Besides, we introduce EmoVoice-DB, a high-quality 40-hour English emotion dataset featuring expressive speech and fine-grained emotion labels with natural language descriptions. EmoVoice achieves state-of-the-art performance on the English EmoVoice-DB test set using only synthetic training data, and on the Chinese Secap test set using our in-house data. We further investigate the reliability of existing emotion evaluation metrics and their alignment with human perceptual preferences, and explore using SOTA multimodal LLMs GPT-4o-audio and Gemini to assess emotional speech. Demo samples are available at https://anonymous.4open.science/r/EmoVoice-DF55. Dataset, code, and checkpoints will be released.

cs.RO [Back]

[217] RoboTwin: Dual-Arm Robot Benchmark with Generative Digital Twins

Yao Mu,Tianxing Chen,Zanxin Chen,Shijia Peng,Zhiqian Lan,Zeyu Gao,Zhixuan Liang,Qiaojun Yu,Yude Zou,Mingkun Xu,Lunkai Lin,Zhiqiang Xie,Mingyu Ding,Ping Luo

Main category: cs.RO

TL;DR: RoboTwin是一个生成式数字孪生框架，利用3D生成基础模型和大型语言模型，为双臂机器人任务提供多样化的专家数据集和真实世界对齐的评估平台。

Details

Motivation: 解决双臂协调和复杂物体操作中高质量示范数据和真实世界对齐评估基准的稀缺问题。 Method: 通过单张2D图像生成多样化的对象数字孪生，结合空间关系感知的代码生成框架分解任务并生成精确的机器人运动代码。 Result: 在COBOT Magic Robot平台上验证，预训练策略显著提升了单臂任务70%和双臂任务40%的成功率。 Conclusion: RoboTwin框架为双臂机器人操作系统的开发提供了标准化评估和更好的模拟与真实世界性能对齐。 Abstract: In the rapidly advancing field of robotics, dual-arm coordination and complex object manipulation are essential capabilities for developing advanced autonomous systems. However, the scarcity of diverse, high-quality demonstration data and real-world-aligned evaluation benchmarks severely limits such development. To address this, we introduce RoboTwin, a generative digital twin framework that uses 3D generative foundation models and large language models to produce diverse expert datasets and provide a real-world-aligned evaluation platform for dual-arm robotic tasks. Specifically, RoboTwin creates varied digital twins of objects from single 2D images, generating realistic and interactive scenarios. It also introduces a spatial relation-aware code generation framework that combines object annotations with large language models to break down tasks, determine spatial constraints, and generate precise robotic movement code. Our framework offers a comprehensive benchmark with both simulated and real-world data, enabling standardized evaluation and better alignment between simulated training and real-world performance. We validated our approach using the open-source COBOT Magic Robot platform. Policies pre-trained on RoboTwin-generated data and fine-tuned with limited real-world samples demonstrate significant potential for enhancing dual-arm robotic manipulation systems by improving success rates by over 70% for single-arm tasks and over 40% for dual-arm tasks compared to models trained solely on real-world data.

[218] Explainable Scene Understanding with Qualitative Representations and Graph Neural Networks

Nassim Belmecheri,Arnaud Gotlieb,Nadjib Lazaar,Helge Spieker

Main category: cs.RO

TL;DR: 本文研究了将图神经网络（GNNs）与定性可解释图（QXGs）结合用于自动驾驶场景理解的方法，提出了一种新型GNN架构，并在nuScenes数据集上验证了其优越性能。

Details

Motivation: 场景理解是自动驾驶决策的基础，但现有方法局限于分析单一关系链，忽略了更广泛的场景上下文。 Method: 提出了一种处理整个图结构的新型GNN架构，用于识别交通场景中的相关对象。 Result: 在nuScenes数据集上的实验表明，该方法性能优于基线方法，并能有效处理类别不平衡问题。 Conclusion: 结合定性表示与深度学习方法在自动驾驶场景理解中具有潜力。 Abstract: This paper investigates the integration of graph neural networks (GNNs) with Qualitative Explainable Graphs (QXGs) for scene understanding in automated driving. Scene understanding is the basis for any further reactive or proactive decision-making. Scene understanding and related reasoning is inherently an explanation task: why is another traffic participant doing something, what or who caused their actions? While previous work demonstrated QXGs' effectiveness using shallow machine learning models, these approaches were limited to analysing single relation chains between object pairs, disregarding the broader scene context. We propose a novel GNN architecture that processes entire graph structures to identify relevant objects in traffic scenes. We evaluate our method on the nuScenes dataset enriched with DriveLM's human-annotated relevance labels. Experimental results show that our GNN-based approach achieves superior performance compared to baseline methods. The model effectively handles the inherent class imbalance in relevant object identification tasks while considering the complete spatial-temporal relationships between all objects in the scene. Our work demonstrates the potential of combining qualitative representations with deep learning approaches for explainable scene understanding in autonomous driving systems.

[219] UncAD: Towards Safe End-to-end Autonomous Driving via Online Map Uncertainty

Pengxuan Yang,Yupeng Zheng,Qichao Zhang,Kefei Zhu,Zebin Xing,Qiao Lin,Yun-Fu Liu,Zhiguo Su,Dongbin Zhao

Main category: cs.RO

TL;DR: UncAD提出了一种新范式，通过感知模块中的在线地图不确定性估计，提升自动驾驶安全性，减少碰撞和冲突率。

Details

Motivation: 当前端到端自动驾驶方法依赖确定性建模的在线地图，可能引入错误感知信息，影响规划安全。 Method: UncAD估计在线地图不确定性，利用不确定性指导多模态轨迹生成，并提出不确定性-碰撞感知的规划选择策略。 Result: 在nuScenes数据集上，UncAD仅增加1.9%参数，碰撞率降低26%，可行驶区域冲突率降低42%。 Conclusion: UncAD通过不确定性建模显著提升自动驾驶安全性，且计算开销小。 Abstract: End-to-end autonomous driving aims to produce planning trajectories from raw sensors directly. Currently, most approaches integrate perception, prediction, and planning modules into a fully differentiable network, promising great scalability. However, these methods typically rely on deterministic modeling of online maps in the perception module for guiding or constraining vehicle planning, which may incorporate erroneous perception information and further compromise planning safety. To address this issue, we delve into the importance of online map uncertainty for enhancing autonomous driving safety and propose a novel paradigm named UncAD. Specifically, UncAD first estimates the uncertainty of the online map in the perception module. It then leverages the uncertainty to guide motion prediction and planning modules to produce multi-modal trajectories. Finally, to achieve safer autonomous driving, UncAD proposes an uncertainty-collision-aware planning selection strategy according to the online map uncertainty to evaluate and select the best trajectory. In this study, we incorporate UncAD into various state-of-the-art (SOTA) end-to-end methods. Experiments on the nuScenes dataset show that integrating UncAD, with only a 1.9% increase in parameters, can reduce collision rates by up to 26% and drivable area conflict rate by up to 42%. Codes, pre-trained models, and demo videos can be accessed at https://github.com/pengxuanyang/UncAD.

[220] Taccel: Scaling Up Vision-based Tactile Robotics via High-performance GPU Simulation

Yuyang Li,Wenxin Du,Chang Yu,Puhao Li,Zihang Zhao,Tengyu Liu,Chenfanfu Jiang,Yixin Zhu,Siyuan Huang

Main category: cs.RO

TL;DR: Taccel是一个高性能的触觉传感器模拟平台，通过集成IPC和ABD技术，实现了快速且精确的触觉模拟，加速了触觉机器人研究。

Details

Motivation: 触觉传感是实现机器人高级操作能力的关键，但现有VBTS传感器缺乏高效的模拟工具，限制了研究规模。 Method: Taccel结合IPC和ABD技术，模拟机器人、触觉传感器和物体，提供高速并行计算和精确物理模拟。 Result: Taccel实现了18倍于实时速度的模拟，支持大规模并行环境，并通过实验验证了其精确性和实用性。 Conclusion: Taccel为触觉机器人研究提供了强大的工具，有望推动机器人对物理环境的理解和交互能力的发展。 Abstract: Tactile sensing is crucial for achieving human-level robotic capabilities in manipulation tasks. VBTSs have emerged as a promising solution, offering high spatial resolution and cost-effectiveness by sensing contact through camera-captured deformation patterns of elastic gel pads. However, these sensors' complex physical characteristics and visual signal processing requirements present unique challenges for robotic applications. The lack of efficient and accurate simulation tools for VBTS has significantly limited the scale and scope of tactile robotics research. Here we present Taccel, a high-performance simulation platform that integrates IPC and ABD to model robots, tactile sensors, and objects with both accuracy and unprecedented speed, achieving an 18-fold acceleration over real-time across thousands of parallel environments. Unlike previous simulators that operate at sub-real-time speeds with limited parallelization, Taccel provides precise physics simulation and realistic tactile signals while supporting flexible robot-sensor configurations through user-friendly APIs. Through extensive validation in object recognition, robotic grasping, and articulated object manipulation, we demonstrate precise simulation and successful sim-to-real transfer. These capabilities position Taccel as a powerful tool for scaling up tactile robotics research and development. By enabling large-scale simulation and experimentation with tactile sensing, Taccel accelerates the development of more capable robotic systems, potentially transforming how robots interact with and understand their physical environment.

[221] ViTa-Zero: Zero-shot Visuotactile Object 6D Pose Estimation

Hongyu Li,James Akl,Srinath Sridhar,Tye Brady,Taskin Padir

Main category: cs.RO

TL;DR: ViTa-Zero 是一种零样本视觉触觉姿态估计框架，通过物理约束优化和可行性检查，显著提高了机器人操作任务中物体6D姿态估计的准确性。

Details

Motivation: 现有结合视觉和触觉信息的方法因数据有限而泛化能力不足，ViTa-Zero 旨在解决这一问题。 Method: 利用视觉模型作为主干，结合触觉和本体感知的物理约束（弹簧-质量系统模型），进行可行性检查和测试时优化。 Result: 在真实机器人实验中，ViTa-Zero 在 ADD-S AUC 和 ADD 上分别提高了 55% 和 60%，位置误差降低了 80%。 Conclusion: ViTa-Zero 通过物理约束优化，显著提升了姿态估计的准确性和鲁棒性，适用于多种操作场景。 Abstract: Object 6D pose estimation is a critical challenge in robotics, particularly for manipulation tasks. While prior research combining visual and tactile (visuotactile) information has shown promise, these approaches often struggle with generalization due to the limited availability of visuotactile data. In this paper, we introduce ViTa-Zero, a zero-shot visuotactile pose estimation framework. Our key innovation lies in leveraging a visual model as its backbone and performing feasibility checking and test-time optimization based on physical constraints derived from tactile and proprioceptive observations. Specifically, we model the gripper-object interaction as a spring-mass system, where tactile sensors induce attractive forces, and proprioception generates repulsive forces. We validate our framework through experiments on a real-world robot setup, demonstrating its effectiveness across representative visual backbones and manipulation scenarios, including grasping, object picking, and bimanual handover. Compared to the visual models, our approach overcomes some drastic failure modes while tracking the in-hand object pose. In our experiments, our approach shows an average increase of 55% in AUC of ADD-S and 60% in ADD, along with an 80% lower position error compared to FoundationPose.

cs.AI [Back]

[222] Towards Conversational AI for Human-Machine Collaborative MLOps

George Fatouros,Georgios Makridis,George Kousiouris,John Soldatos,Anargyros Tsadimas,Dimosthenis Kyriazis

Main category: cs.AI

TL;DR: 论文提出了一种基于大型语言模型（LLM）的对话代理系统，旨在提升人机协作在机器学习运维（MLOps）中的效率。

Details

Motivation: 解决复杂MLOps平台（如Kubeflow）的可访问性问题，使不同技术背景的用户都能轻松使用高级ML工具。 Method: 引入Swarm Agent的可扩展架构，整合KFP Agent、MinIO Agent和RAG Agent，通过自然语言交互实现ML工作流管理。 Result: 系统通过上下文感知处理和迭代推理循环，降低了MLOps的复杂性，提升了用户友好性。 Conclusion: 该对话式MLOps助手显著降低了技术门槛，适用于不同技能水平的用户。 Abstract: This paper presents a Large Language Model (LLM) based conversational agent system designed to enhance human-machine collaboration in Machine Learning Operations (MLOps). We introduce the Swarm Agent, an extensible architecture that integrates specialized agents to create and manage ML workflows through natural language interactions. The system leverages a hierarchical, modular design incorporating a KubeFlow Pipelines (KFP) Agent for ML pipeline orchestration, a MinIO Agent for data management, and a Retrieval-Augmented Generation (RAG) Agent for domain-specific knowledge integration. Through iterative reasoning loops and context-aware processing, the system enables users with varying technical backgrounds to discover, execute, and monitor ML pipelines; manage datasets and artifacts; and access relevant documentation, all via intuitive conversational interfaces. Our approach addresses the accessibility gap in complex MLOps platforms like Kubeflow, making advanced ML tools broadly accessible while maintaining the flexibility to extend to other platforms. The paper describes the architecture, implementation details, and demonstrates how this conversational MLOps assistant reduces complexity and lowers barriers to entry for users across diverse technical skill levels.

[223] ZeroSumEval: Scaling LLM Evaluation with Inter-Model Competition

Haidar Khan,Hisham A. Alyahya,Yazeed Alnumay,M Saiful Bari,Bülent Yener

Main category: cs.AI

TL;DR: ZeroSumEval是一种基于竞争的新型评估协议，通过零和游戏动态评估大型语言模型（LLMs），避免传统方法的过拟合、高成本和偏见问题。

Details

Motivation: 传统评估方法（静态基准、人工评估或基于模型的评估）存在过拟合、高成本和偏见问题，需要一种更动态和标准化的评估框架。 Method: ZeroSumEval采用零和游戏动态评估LLMs，涵盖多种游戏类型（如安全挑战、经典游戏、知识测试和说服挑战），评估模型的战略推理、规划、知识应用和创造力等能力。 Result: 实验表明，GPT和Claude等前沿模型在常见游戏和问答中表现良好，但在需要创造力和新颖问题的游戏中表现不佳，且无法可靠地互相破解。 Conclusion: ZeroSumEval为LLMs评估提供了标准化和可扩展的框架，揭示了模型在创造力任务中的局限性。 Abstract: Evaluating the capabilities of Large Language Models (LLMs) has traditionally relied on static benchmark datasets, human assessments, or model-based evaluations - methods that often suffer from overfitting, high costs, and biases. ZeroSumEval is a novel competition-based evaluation protocol that leverages zero-sum games to assess LLMs with dynamic benchmarks that resist saturation. ZeroSumEval encompasses a diverse suite of games, including security challenges (PyJail), classic games (Chess, Liar's Dice, Poker), knowledge tests (MathQuiz), and persuasion challenges (Gandalf, Debate). These games are designed to evaluate a range of AI capabilities such as strategic reasoning, planning, knowledge application, and creativity. Building upon recent studies that highlight the effectiveness of game-based evaluations for LLMs, ZeroSumEval enhances these approaches by providing a standardized and extensible framework. To demonstrate this, we conduct extensive experiments with >7000 simulations across 7 games and 13 models. Our results show that while frontier models from the GPT and Claude families can play common games and answer questions, they struggle to play games that require creating novel and challenging questions. We also observe that models cannot reliably jailbreak each other and fail generally at tasks requiring creativity. We release our code at https://github.com/facebookresearch/ZeroSumEval.

[224] WebLists: Extracting Structured Information From Complex Interactive Websites Using Executable LLM Agents

Arth Bohra,Manvel Saroyan,Danil Melkozerov,Vahe Karufanyan,Gabriel Maher,Pascal Weinberger,Artem Harutyunyan,Giovanni Campagna

Main category: cs.AI

TL;DR: 论文提出了WebLists基准测试和BardeenAgent框架，用于解决大规模结构化数据提取任务，显著提升了性能。

Details

Motivation: 当前网络代理研究主要集中在导航和交易任务，而结构化数据提取任务的研究较少。 Method: 提出BardeenAgent框架，将代理执行转化为可重复程序，并利用HTML的规律性结构提取数据。 Result: BardeenAgent在WebLists基准测试中达到66%召回率，性能提升显著。 Conclusion: BardeenAgent框架有效解决了大规模数据提取问题，性能优于现有方法。 Abstract: Most recent web agent research has focused on navigation and transaction tasks, with little emphasis on extracting structured data at scale. We present WebLists, a benchmark of 200 data-extraction tasks across four common business and enterprise use-cases. Each task requires an agent to navigate to a webpage, configure it appropriately, and extract complete datasets with well-defined schemas. We show that both LLMs with search capabilities and SOTA web agents struggle with these tasks, with a recall of 3% and 31%, respectively, despite higher performance on question-answering tasks. To address this challenge, we propose BardeenAgent, a novel framework that enables web agents to convert their execution into repeatable programs, and replay them at scale across pages with similar structure. BardeenAgent is also the first LLM agent to take advantage of the regular structure of HTML. In particular BardeenAgent constructs a generalizable CSS selector to capture all relevant items on the page, then fits the operations to extract the data. On the WebLists benchmark, BardeenAgent achieves 66% recall overall, more than doubling the performance of SOTA web agents, and reducing cost per output row by 3x.

[225] Embodied-R: Collaborative Framework for Activating Embodied Spatial Reasoning in Foundation Models via Reinforcement Learning

Baining Zhao,Ziyou Wang,Jianjie Fang,Chen Gao,Fanhang Man,Jinqiang Cui,Xin Wang,Xinlei Chen,Yong Li,Wenwu Zhu

Main category: cs.AI

TL;DR: Embodied-R框架结合视觉语言模型（VLM）和小语言模型（LM），通过强化学习实现高效空间推理，性能媲美先进多模态模型。

Details

Motivation: 探索预训练模型如何从视觉观察中获取高级空间推理能力。 Method: 结合VLM和LM，使用强化学习（RL）和新型奖励系统（考虑逻辑一致性）训练模型。 Result: 仅用5k样本训练的3B LM模型在空间推理任务中表现优异，并展现出系统性分析和上下文整合能力。 Conclusion: Embodied-R证明了小模型通过协作框架和RL训练也能实现高效推理，为模型设计提供了新思路。 Abstract: Humans can perceive and reason about spatial relationships from sequential visual observations, such as egocentric video streams. However, how pretrained models acquire such abilities, especially high-level reasoning, remains unclear. This paper introduces Embodied-R, a collaborative framework combining large-scale Vision-Language Models (VLMs) for perception and small-scale Language Models (LMs) for reasoning. Using Reinforcement Learning (RL) with a novel reward system considering think-answer logical consistency, the model achieves slow-thinking capabilities with limited computational resources. After training on only 5k embodied video samples, Embodied-R with a 3B LM matches state-of-the-art multimodal reasoning models (OpenAI-o1, Gemini-2.5-pro) on both in-distribution and out-of-distribution embodied spatial reasoning tasks. Embodied-R also exhibits emergent thinking patterns such as systematic analysis and contextual integration. We further explore research questions including response length, training on VLM, strategies for reward design, and differences in model generalization after SFT (Supervised Fine-Tuning) and RL training.

[226] Antidistillation Sampling

Yash Savani,Asher Trockman,Zhili Feng,Avi Schwarzschild,Alexander Robey,Marc Finzi,J. Zico Kolter

Main category: cs.AI

TL;DR: 前沿模型生成的长推理痕迹可能被用于模型蒸馏，而抗蒸馏采样策略可以限制这种效果。

Details

Motivation: 防止模型生成的推理痕迹被用于蒸馏，同时保持模型性能。 Method: 通过修改模型的下一词概率分布，毒化推理痕迹。 Result: 抗蒸馏采样显著降低了蒸馏效果，同时不影响模型实用性。 Conclusion: 抗蒸馏采样是一种有效的保护模型知识不被蒸馏的方法。 Abstract: Frontier models that generate extended reasoning traces inadvertently produce rich token sequences that can facilitate model distillation. Recognizing this vulnerability, model owners may seek sampling strategies that limit the effectiveness of distillation without compromising model performance. \emph{Antidistillation sampling} provides exactly this capability. By strategically modifying a model's next-token probability distribution, antidistillation sampling poisons reasoning traces, rendering them significantly less effective for distillation while preserving the model's practical utility. For further details, see https://antidistillation.com.

[227] Sleep-time Compute: Beyond Inference Scaling at Test-time

Kevin Lin,Charlie Snell,Yu Wang,Charles Packer,Sarah Wooders,Ion Stoica,Joseph E. Gonzalez

Main category: cs.AI

TL;DR: 论文提出了一种名为“睡眠时间计算”的方法，通过离线预计算用户可能提出的查询，显著减少测试时的计算需求。实验表明，该方法在多个任务中能减少5倍计算量，并提升准确性。

Details

Motivation: 测试时计算的高延迟和高成本是大型语言模型（LLMs）解决复杂问题的主要障碍，因此需要一种方法来降低这些开销。 Method: 引入“睡眠时间计算”，通过预计算用户可能提出的查询和上下文，减少测试时的计算需求。实验基于改进的Stateful GSM-Symbolic和Stateful AIME任务，并扩展了Multi-Query GSM-Symbolic任务。 Result: 睡眠时间计算在Stateful GSM-Symbolic和Stateful AIME任务中减少5倍计算量，并提升准确性（最高13%和18%）。Multi-Query GSM-Symbolic进一步将每查询成本降低2.5倍。 Conclusion: 睡眠时间计算能有效降低测试时计算需求，提升模型性能，尤其在用户查询可预测时效果显著。 Abstract: Scaling test-time compute has emerged as a key ingredient for enabling large language models (LLMs) to solve difficult problems, but comes with high latency and inference cost. We introduce sleep-time compute, which allows models to "think" offline about contexts before queries are presented: by anticipating what queries users might ask and pre-computing useful quantities, we can significantly reduce the compute requirements at test-time. To demonstrate the efficacy of our method, we create modified versions of two reasoning tasks - Stateful GSM-Symbolic and Stateful AIME. We find that sleep-time compute can reduce the amount of test-time compute needed to achieve the same accuracy by ~ 5x on Stateful GSM-Symbolic and Stateful AIME and that by scaling sleep-time compute we can further increase accuracy by up to 13% on Stateful GSM-Symbolic and 18% on Stateful AIME. Furthermore, we introduce Multi-Query GSM-Symbolic, which extends GSM-Symbolic by including multiple related queries per context. By amortizing sleep-time compute across related queries about the same context using Multi-Query GSM-Symbolic, we can decrease the average cost per query by 2.5x. We then conduct additional analysis to understand when sleep-time compute is most effective, finding the predictability of the user query to be well correlated with the efficacy of sleep-time compute. Finally, we conduct a case-study of applying sleep-time compute to a realistic agentic SWE task.

[228] Readable Twins of Unreadable Models

Krzysztof Pancerz,Piotr Kulicki,Michał Kalisz,Andrzej Burda,Maciej Stanisławski,Jaromir Sarzyński

Main category: cs.AI

TL;DR: 论文提出了一种将不可读的深度学习模型转换为可读的信息流模型的方法，以实现可解释的深度学习系统。

Details

Motivation: 构建负责任的人工智能系统需要可解释性，而深度学习模型通常难以解释，因此需要一种方法将其转换为可读的形式。 Method: 基于物理对象的数字孪生概念，提出创建可读的孪生模型（不精确信息流模型），并详细描述了从深度学习模型到信息流模型的转换过程。 Result: 通过MNIST数据集的手写数字分类模型示例验证了方法的可行性。 Conclusion: 该方法为深度学习模型的可解释性提供了一种新思路，有助于构建更负责任的AI系统。 Abstract: Creating responsible artificial intelligence (AI) systems is an important issue in contemporary research and development of works on AI. One of the characteristics of responsible AI systems is their explainability. In the paper, we are interested in explainable deep learning (XDL) systems. On the basis of the creation of digital twins of physical objects, we introduce the idea of creating readable twins (in the form of imprecise information flow models) for unreadable deep learning models. The complete procedure for switching from the deep learning model (DLM) to the imprecise information flow model (IIFM) is presented. The proposed approach is illustrated with an example of a deep learning classification model for image recognition of handwritten digits from the MNIST data set.

Table of Contents

cs.CV [Back]

[1] DMM: Building a Versatile Image Generation Model via Distillation-Based Model Merging

[2] Geographical Context Matters: Bridging Fine and Coarse Spatial Information to Enhance Continental Land Cover Mapping

[3] WORLDMEM: Long-term Consistent World Simulation with Memory

[4] InstantCharacter: Personalize Any Characters with a Scalable Diffusion Transformer Framework

[5] NTIRE 2025 Challenge on Event-Based Image Deblurring: Methods and Results

[6] Sparsity Outperforms Low-Rank Projections in Few-Shot Adaptation

[7] 3D-PointZshotS: Geometry-Aware 3D Point Cloud Zero-Shot Semantic Segmentation Narrowing the Visual-Semantic Gap

[8] DG-MVP: 3D Domain Generalization via Multiple Views of Point Clouds for Classification

[9] AdaVid: Adaptive Video-Language Pretraining

[10] Event Quality Score (EQS): Assessing the Realism of Simulated Event Camera Streams via Distances in Latent Space

[11] Decision-based AI Visual Navigation for Cardiac Ultrasounds

[12] Post-Hurricane Debris Segmentation Using Fine-Tuned Foundational Vision Models

[13] Privacy-Preserving Operating Room Workflow Analysis using Digital Twins

[14] Contour Field based Elliptical Shape Prior for the Segment Anything Model

[15] Parsimonious Dataset Construction for Laparoscopic Cholecystectomy Structure Segmentation

[16] Prompt-Driven and Training-Free Forgetting Approach and Dataset for Large Language Models

[17] CM3AE: A Unified RGB Frame and Event-Voxel/-Frame Pre-training Framework

[18] 3DResT: A Strong Baseline for Semi-Supervised 3D Referring Expression Segmentation

[19] AdaQual-Diff: Diffusion-Based Image Restoration via Adaptive Quality Prompting

[20] Robo-SGG: Exploiting Layout-Oriented Normalization and Restitution for Robust Scene Graph Generation

[21] SAM-Based Building Change Detection with Distribution-Aware Fourier Adaptation and Edge-Constrained Warping

[22] Packing Input Frame Context in Next-Frame Prediction Models for Video Generation

[23] RoPETR: Improving Temporal Camera-Only 3D Detection by Integrating Enhanced Rotary Position Embedding

[24] AdaptoVision: A Multi-Resolution Image Recognition Model for Robust and Scalable Classification

[25] Two Tasks, One Goal: Uniting Motion and Planning for Excellent End To End Autonomous Driving Performance

[26] Accurate Tracking of Arabidopsis Root Cortex Cell Nuclei in 3D Time-Lapse Microscopy Images Based on Genetic Algorithm

[27] TongUI: Building Generalized GUI Agents by Learning from Multimodal Web Tutorials

[28] HSS-IAD: A Heterogeneous Same-Sort Industrial Anomaly Detection Dataset

[29] Collaborative Perception Datasets for Autonomous Driving: A Review

[30] Unsupervised Cross-Domain 3D Human Pose Estimation via Pseudo-Label-Guided Global Transforms

[31] SmartFreeEdit: Mask-Free Spatial-Aware Image Editing with Complex Instruction Understanding

[32] Self-Supervised Pre-training with Combined Datasets for 3D Perception in Autonomous Driving

[33] NTIRE 2025 Challenge on Day and Night Raindrop Removal for Dual-Focused Images: Methods and Results

[34] Post-pre-training for Modality Alignment in Vision-Language Foundation Models

[35] Mask Image Watermarking

[36] Privacy Protection Against Personalized Text-to-Image Synthesis via Cross-image Consistency Constraints

[37] LAD-Reasoner: Tiny Multimodal Models are Good Reasoners for Logical Anomaly Detection

[38] Stronger, Steadier & Superior: Geometric Consistency in Depth VFM Forges Domain Generalized Semantic Segmentation

[39] Set You Straight: Auto-Steering Denoising Trajectories to Sidestep Unwanted Concepts

[40] EarthGPT-X: Enabling MLLMs to Flexibly and Comprehensively Understand Multi-Source Remote Sensing Imagery

[41] TSGS: Improving Gaussian Splatting for Transparent Surface Reconstruction via Normal and De-lighting Priors

[42] Hybrid Dense-UNet201 Optimization for Pap Smear Image Segmentation Using Spider Monkey Optimization

[43] Saliency-Aware Diffusion Reconstruction for Effective Invisible Watermark Removal

[44] TwoSquared: 4D Generation from 2D Image Pairs

[45] Image-Editing Specialists: An RLAIF Approach for Diffusion Models

[46] High-Fidelity Image Inpainting with Multimodal Guided GAN Inversion

[47] Computer-Aided Design of Personalized Occlusal Positioning Splints Using Multimodal 3D Data

[48] SC3EF: A Joint Self-Correlation and Cross-Correspondence Estimation Framework for Visible and Thermal Image Registration

[49] Tree-NeRV: A Tree-Structured Neural Representation for Efficient Non-Uniform Video Encoding

[50] Second-order Optimization of Gaussian Splats with Importance Sampling

[51] Efficient Masked Image Compression with Position-Indexed Self-Attention

[52] Disentangling Polysemantic Channels in Convolutional Neural Networks

[53] Rethinking Temporal Fusion with a Unified Gradient Descent View for 3D Semantic Occupancy Prediction

[54] Vision and Language Integration for Domain Generalization

[55] MathPhys-Guided Coarse-to-Fine Anomaly Synthesis with SQE-Driven Bi-Level Optimization for Anomaly Detection

[56] Enhancing Cocoa Pod Disease Classification via Transfer Learning and Ensemble Methods: Toward Robust Predictive Modeling

[57] All-in-One Transferring Image Compression from Human Perception to Multi-Machine Perception

[58] Hierarchical Feature Learning for Medical Point Clouds via State Space Model

[59] Pose and Facial Expression Transfer by using StyleGAN

[60] Riemannian Patch Assignment Gradient Flows

[61] TTRD3: Texture Transfer Residual Denoising Dual Diffusion Model for Remote Sensing Image Super-Resolution

[62] Prototypes are Balanced Units for Efficient and Effective Partially Relevant Video Retrieval

[63] Event-Enhanced Blurry Video Super-Resolution

[64] Expert Kernel Generation Network Driven by Contextual Mapping for Hyperspectral Image Classification

[65] NoisyRollout: Reinforcing Visual Reasoning with Data Augmentation

[66] Imaging for All-Day Wearable Smart Glasses

[67] ArtistAuditor: Auditing Artist Style Pirate in Text-to-Image Generation Models

[68] EchoWorld: Learning Motion-Aware World Models for Echocardiography Probe Guidance

[69] SkyReels-V2: Infinite-length Film Generative Model

[70] Effective Dual-Region Augmentation for Reduced Reliance on Large Amounts of Labeled Data

[71] Enhancing Person-to-Person Virtual Try-On with Multi-Garment Virtual Try-Off

[72] EventVAD: Training-Free Event-Aware Video Anomaly Detection

[73] RF-DETR Object Detection vs YOLOv12 : A Study of Transformer-based and CNN-based Architectures for Single-Class and Multi-Class Greenfruit Detection in Complex Orchard Environments Under Label Ambiguity

[74] UniEdit-Flow: Unleashing Inversion and Editing in the Era of Flow Models

[75] Probing and Inducing Combinational Creativity in Vision-Language Models

[76] VistaDPO: Video Hierarchical Spatial-Temporal Direct Preference Optimization for Large Video Models

[77] Low-hallucination Synthetic Captions for Large-Scale Vision-Language Model Pre-training

[78] Science-T2I: Addressing Scientific Illusions in Image Synthesis