cs.CV [Back]

[1] DMM: Building a Versatile Image Generation Model via Distillation-Based Model Merging

Tianhui Song,Weixin Feng,Shuai Wang,Xubin Li,Tiezheng Ge,Bo Zheng,Limin Wang

Main category: cs.CV

TL;DR: 论文提出了一种基于风格提示的图像生成方法（DMM），通过压缩多个文本到图像生成模型为一个多功能模型，解决了参数冗余和存储成本高的问题。

Details

Motivation: 当前文本到图像生成模型的多样化导致参数冗余和存储成本高，需要一种有效的方法将多个模型的能力整合到一个模型中。 Method: 提出了一种基于风格提示的图像生成流程，并通过分数蒸馏模型合并范式（DMM）将多个模型压缩为一个多功能模型。 Result: 实验表明，DMM能够高效整合多个教师模型的知识，并实现可控的任意风格生成。 Conclusion: DMM为文本到图像生成任务中的模型合并提供了新的思路和评估标准。 Abstract: The success of text-to-image (T2I) generation models has spurred a proliferation of numerous model checkpoints fine-tuned from the same base model on various specialized datasets. This overwhelming specialized model production introduces new challenges for high parameter redundancy and huge storage cost, thereby necessitating the development of effective methods to consolidate and unify the capabilities of diverse powerful models into a single one. A common practice in model merging adopts static linear interpolation in the parameter space to achieve the goal of style mixing. However, it neglects the features of T2I generation task that numerous distinct models cover sundry styles which may lead to incompatibility and confusion in the merged model. To address this issue, we introduce a style-promptable image generation pipeline which can accurately generate arbitrary-style images under the control of style vectors. Based on this design, we propose the score distillation based model merging paradigm (DMM), compressing multiple models into a single versatile T2I model. Moreover, we rethink and reformulate the model merging task in the context of T2I generation, by presenting new merging goals and evaluation protocols. Our experiments demonstrate that DMM can compactly reorganize the knowledge from multiple teacher models and achieve controllable arbitrary-style generation.

[2] Geographical Context Matters: Bridging Fine and Coarse Spatial Information to Enhance Continental Land Cover Mapping

Babak Ghassemi,Cassio Fraga-Dantas,Raffaele Gaetano,Dino Ienco,Omid Ghorbanzadeh,Emma Izquierdo-Verdiguier,Francesco Vuolo

Main category: cs.CV

TL;DR: BRIDGE-LC框架通过整合多尺度地理空间信息，提升了土地覆盖分类的准确性和可扩展性。

Details

Motivation: 现有深度学习方法常忽略地理空间元数据，限制了土地覆盖分类的跨区域准确性。 Method: 提出BRIDGE-LC框架，结合细粒度和粗粒度空间信息，通过多层感知机架构实现高效分类。 Result: 实验表明，整合地理空间信息显著提升了分类性能，尤其是同时利用两种粒度信息时。 Conclusion: BRIDGE-LC为大规模土地覆盖分类提供了高效且准确的解决方案。 Abstract: Land use and land cover mapping from Earth Observation (EO) data is a critical tool for sustainable land and resource management. While advanced machine learning and deep learning algorithms excel at analyzing EO imagery data, they often overlook crucial geospatial metadata information that could enhance scalability and accuracy across regional, continental, and global scales. To address this limitation, we propose BRIDGE-LC (Bi-level Representation Integration for Disentangled GEospatial Land Cover), a novel deep learning framework that integrates multi-scale geospatial information into the land cover classification process. By simultaneously leveraging fine-grained (latitude/longitude) and coarse-grained (biogeographical region) spatial information, our lightweight multi-layer perceptron architecture learns from both during training but only requires fine-grained information for inference, allowing it to disentangle region-specific from region-agnostic land cover features while maintaining computational efficiency. To assess the quality of our framework, we use an open-access in-situ dataset and adopt several competing classification approaches commonly considered for large-scale land cover mapping. We evaluated all approaches through two scenarios: an extrapolation scenario in which training data encompasses samples from all biogeographical regions, and a leave-one-region-out scenario where one region is excluded from training. We also explore the spatial representation learned by our model, highlighting a connection between its internal manifold and the geographical information used during training. Our results demonstrate that integrating geospatial information improves land cover mapping performance, with the most substantial gains achieved by jointly leveraging both fine- and coarse-grained spatial information.

[3] WORLDMEM: Long-term Consistent World Simulation with Memory

Zeqi Xiao,Yushi Lan,Yifan Zhou,Wenqi Ouyang,Shuai Yang,Yanhong Zeng,Xingang Pan

Main category: cs.CV

TL;DR: WorldMem框架通过引入记忆库和记忆注意力机制，解决了世界模拟中长期一致性和3D空间一致性的问题，同时支持动态演化的建模。

Details

Motivation: 世界模拟中有限的时间上下文窗口导致长期一致性和3D空间一致性难以维持，需要一种方法来增强场景生成的准确性。 Method: 提出WorldMem框架，包含存储记忆帧和状态（如姿态和时间戳）的记忆库，通过记忆注意力机制提取相关信息。 Result: 实验验证了WorldMem在虚拟和真实场景中能准确重建场景，并支持动态演化建模。 Conclusion: WorldMem通过记忆机制有效解决了世界模拟中的长期一致性问题，并支持动态交互。 Abstract: World simulation has gained increasing popularity due to its ability to model virtual environments and predict the consequences of actions. However, the limited temporal context window often leads to failures in maintaining long-term consistency, particularly in preserving 3D spatial consistency. In this work, we present WorldMem, a framework that enhances scene generation with a memory bank consisting of memory units that store memory frames and states (e.g., poses and timestamps). By employing a memory attention mechanism that effectively extracts relevant information from these memory frames based on their states, our method is capable of accurately reconstructing previously observed scenes, even under significant viewpoint or temporal gaps. Furthermore, by incorporating timestamps into the states, our framework not only models a static world but also captures its dynamic evolution over time, enabling both perception and interaction within the simulated world. Extensive experiments in both virtual and real scenarios validate the effectiveness of our approach.

[4] InstantCharacter: Personalize Any Characters with a Scalable Diffusion Transformer Framework

Jiale Tao,Yanbing Zhang,Qixun Wang,Yiji Cheng,Haofan Wang,Xu Bai,Zhengguang Zhou,Ruihuang Li,Linqing Wang,Chunyu Wang,Qin Lin,Qinglin Lu

Main category: cs.CV

TL;DR: InstantCharacter提出了一种基于扩散变换器的可扩展框架，用于角色定制，解决了现有方法泛化能力差和图像质量低的问题。

Details

Motivation: 现有基于U-Net的学习方法泛化能力有限，优化方法需要特定主题微调且文本可控性差。 Method: 采用扩散变换器架构，引入可扩展适配器和大型角色数据集（1000万样本），通过双数据子集优化身份一致性和文本可编辑性。 Result: 实验表明，InstantCharacter能生成高保真、文本可控且角色一致的图像。 Conclusion: InstantCharacter为角色驱动的图像生成设定了新基准。 Abstract: Current learning-based subject customization approaches, predominantly relying on U-Net architectures, suffer from limited generalization ability and compromised image quality. Meanwhile, optimization-based methods require subject-specific fine-tuning, which inevitably degrades textual controllability. To address these challenges, we propose InstantCharacter, a scalable framework for character customization built upon a foundation diffusion transformer. InstantCharacter demonstrates three fundamental advantages: first, it achieves open-domain personalization across diverse character appearances, poses, and styles while maintaining high-fidelity results. Second, the framework introduces a scalable adapter with stacked transformer encoders, which effectively processes open-domain character features and seamlessly interacts with the latent space of modern diffusion transformers. Third, to effectively train the framework, we construct a large-scale character dataset containing 10-million-level samples. The dataset is systematically organized into paired (multi-view character) and unpaired (text-image combinations) subsets. This dual-data structure enables simultaneous optimization of identity consistency and textual editability through distinct learning pathways. Qualitative experiments demonstrate the advanced capabilities of InstantCharacter in generating high-fidelity, text-controllable, and character-consistent images, setting a new benchmark for character-driven image generation. Our source code is available at https://github.com/Tencent/InstantCharacter.

[5] NTIRE 2025 Challenge on Event-Based Image Deblurring: Methods and Results

Lei Sun,Andrea Alfarano,Peiqi Duan,Shaolin Su,Kaiwei Wang,Boxin Shi,Radu Timofte,Danda Pani Paudel,Luc Van Gool,Qinglin Liu,Wei Yu,Xiaoqian Lv,Lu Yang,Shuigen Wang,Shengping Zhang,Xiangyang Ji,Long Bao,Yuqiang Yang,Jinao Song,Ziyi Wang,Shuang Wen,Heng Sun,Kean Liu,Mingchen Zhong,Senyan Xu,Zhijing Sun,Jiaying Zhu,Chengjie Ge,Xingbo Wang,Yidi Liu,Xin Lu,Xueyang Fu,Zheng-Jun Zha,Dawei Fan,Dafeng Zhang,Yong Yang,Siru Zhang,Qinghua Yang,Hao Kang,Huiyuan Fu,Heng Zhang,Hongyuan Yu,Zhijuan Huang,Shuoyan Wei,Feng Li,Runmin Cong,Weiqi Luo,Mingyun Lin,Chenxu Jiang,Hongyi Liu,Lei Yu,Weilun Li,Jiajun Zhai,Tingting Lin,Shuang Ma,Sai Zhou,Zhanwen Liu,Yang Wang,Eiffel Chong,Nuwan Bandara,Thivya Kandappu,Archan Misra,Yihang Chen,Zhan Li,Weijun Yuan,Wenzhuo Wang,Boyang Yao,Zhanglu Chen,Yijing Sun,Tianjiao Wan,Zijian Gao,Qisheng Xu,Kele Xu,Yukun Zhang,Yu He,Xiaoyan Xie,Tao Fu,Yashu Gautamkumar Patel,Vihar Ramesh Jain,Divesh Basina,Rishik Ashili,Manish Kumar Manjhi,Sourav Kumar,Prinon Benny,Himanshu Ghunawat,B Sri Sairam Gautam,Anett Varghese,Abhishek Yadav

Main category: cs.CV

TL;DR: NTIRE 2025挑战赛聚焦于基于事件的高质量图像去模糊，15支团队提交了有效结果，推动了事件视觉研究的进展。

Details

Motivation: 设计基于事件的方法实现高质量图像去模糊，并通过PSNR量化评估性能。 Method: 利用事件和图像作为输入进行单图像去模糊，无计算复杂度或模型大小限制。 Result: 199名参与者注册，15支团队提交有效结果，提供了事件图像去模糊的现状洞察。 Conclusion: 挑战赛有望推动事件视觉研究的进一步发展。 Abstract: This paper presents an overview of NTIRE 2025 the First Challenge on Event-Based Image Deblurring, detailing the proposed methodologies and corresponding results. The primary goal of the challenge is to design an event-based method that achieves high-quality image deblurring, with performance quantitatively assessed using Peak Signal-to-Noise Ratio (PSNR). Notably, there are no restrictions on computational complexity or model size. The task focuses on leveraging both events and images as inputs for single-image deblurring. A total of 199 participants registered, among whom 15 teams successfully submitted valid results, offering valuable insights into the current state of event-based image deblurring. We anticipate that this challenge will drive further advancements in event-based vision research.

[6] Sparsity Outperforms Low-Rank Projections in Few-Shot Adaptation

Nairouz Mrabah,Nicolas Richet,Ismail Ben Ayed,Éric Granger

Main category: cs.CV

TL;DR: 提出了一种稀疏优化（SO）框架，通过动态调整少量参数解决视觉语言模型（VLM）在小样本领域适应中的过拟合和计算限制问题。

Details

Motivation: 现有方法（如低秩重参数化）在泛化和超参数调优上表现不佳，需要更高效且稳定的适应方案。 Method: 采用局部稀疏与全局密度、局部随机性与全局重要性两种范式，动态调整参数以减少过拟合。 Result: 在11个数据集上验证，SO在小样本适应中表现最优，同时降低内存开销。 Conclusion: SO框架显著提升了VLM在小样本领域的适应能力，兼具高效性和稳定性。 Abstract: Adapting Vision-Language Models (VLMs) to new domains with few labeled samples remains a significant challenge due to severe overfitting and computational constraints. State-of-the-art solutions, such as low-rank reparameterization, mitigate these issues but often struggle with generalization and require extensive hyperparameter tuning. In this paper, a novel Sparse Optimization (SO) framework is proposed. Unlike low-rank approaches that typically constrain updates to a fixed subspace, our SO method leverages high sparsity to dynamically adjust very few parameters. We introduce two key paradigms. First, we advocate for \textit{local sparsity and global density}, which updates a minimal subset of parameters per iteration while maintaining overall model expressiveness. As a second paradigm, we advocate for \textit{local randomness and global importance}, which sparsifies the gradient using random selection while pruning the first moment based on importance. This combination significantly mitigates overfitting and ensures stable adaptation in low-data regimes. Extensive experiments on 11 diverse datasets show that SO achieves state-of-the-art few-shot adaptation performance while reducing memory overhead.

[7] 3D-PointZshotS: Geometry-Aware 3D Point Cloud Zero-Shot Semantic Segmentation Narrowing the Visual-Semantic Gap

Minmin Yang,Huantao Ren,Senem Velipasalar

Main category: cs.CV

TL;DR: 3D-PointZshotS是一个几何感知的零样本3D点云分割框架，通过潜在几何原型（LGPs）和自一致性损失提升特征生成和对齐能力，在多个数据集上表现优于基线方法。

Details

Motivation: 解决现有零样本3D点云分割方法在从可见类到未见类以及从语义空间到视觉空间的迁移性不足的问题。 Method: 引入潜在几何原型（LGPs）并通过交叉注意力机制集成到生成器中，同时使用自一致性损失增强特征鲁棒性，并在共享空间中重新表示视觉和语义特征。 Result: 在ScanNet、SemanticKITTI和S3DIS数据集上，3D-PointZshotS在谐波mIoU指标上优于四种基线方法。 Conclusion: 3D-PointZshotS通过几何感知的特征生成和对齐，显著提升了零样本3D点云分割的性能和迁移能力。 Abstract: Existing zero-shot 3D point cloud segmentation methods often struggle with limited transferability from seen classes to unseen classes and from semantic to visual space. To alleviate this, we introduce 3D-PointZshotS, a geometry-aware zero-shot segmentation framework that enhances both feature generation and alignment using latent geometric prototypes (LGPs). Specifically, we integrate LGPs into a generator via a cross-attention mechanism, enriching semantic features with fine-grained geometric details. To further enhance stability and generalization, we introduce a self-consistency loss, which enforces feature robustness against point-wise perturbations. Additionally, we re-represent visual and semantic features in a shared space, bridging the semantic-visual gap and facilitating knowledge transfer to unseen classes. Experiments on three real-world datasets, namely ScanNet, SemanticKITTI, and S3DIS, demonstrate that our method achieves superior performance over four baselines in terms of harmonic mIoU. The code is available at \href{https://github.com/LexieYang/3D-PointZshotS}{Github}.

[8] DG-MVP: 3D Domain Generalization via Multiple Views of Point Clouds for Classification

Huantao Ren,Minmin Yang,Senem Velipasalar

Main category: cs.CV

TL;DR: 该论文提出了一种新的3D点云领域泛化方法，通过多视角2D投影和卷积模型解决点云数据缺失和遮挡问题，并在实验中表现优异。

Details

Motivation: 解决3D点云领域泛化问题，尤其是从CAD模型到LiDAR数据的领域偏移问题，以及现有方法因最大池化操作导致特征浪费的问题。 Method: 采用多视角2D投影缓解点云缺失问题，并使用卷积模型提取特征。 Result: 在PointDA-10和Sim-to-Real基准测试中表现优于基线方法，能够有效从合成领域迁移到真实领域。 Conclusion: 提出的方法在3D点云领域泛化任务中具有显著优势，能够处理数据缺失和遮挡问题。 Abstract: Deep neural networks have achieved significant success in 3D point cloud classification while relying on large-scale, annotated point cloud datasets, which are labor-intensive to build. Compared to capturing data with LiDAR sensors and then performing annotation, it is relatively easier to sample point clouds from CAD models. Yet, data sampled from CAD models is regular, and does not suffer from occlusion and missing points, which are very common for LiDAR data, creating a large domain shift. Therefore, it is critical to develop methods that can generalize well across different point cloud domains. %In this paper, we focus on the 3D point cloud domain generalization problem. Existing 3D domain generalization methods employ point-based backbones to extract point cloud features. Yet, by analyzing point utilization of point-based methods and observing the geometry of point clouds from different domains, we have found that a large number of point features are discarded by point-based methods through the max-pooling operation. This is a significant waste especially considering the fact that domain generalization is more challenging than supervised learning, and point clouds are already affected by missing points and occlusion to begin with. To address these issues, we propose a novel method for 3D point cloud domain generalization, which can generalize to unseen domains of point clouds. Our proposed method employs multiple 2D projections of a 3D point cloud to alleviate the issue of missing points and involves a simple yet effective convolution-based model to extract features. The experiments, performed on the PointDA-10 and Sim-to-Real benchmarks, demonstrate the effectiveness of our proposed method, which outperforms different baselines, and can transfer well from synthetic domain to real-world domain.

[9] AdaVid: Adaptive Video-Language Pretraining

Chaitanya Patel,Juan Carlos Niebles,Ehsan Adeli

Main category: cs.CV

TL;DR: AdaVid是一个动态适应计算资源的视频编码框架，通过自适应Transformer块调整计算需求，在短视频任务中表现优异，并支持长视频处理。

Details

Motivation: 解决现有视频编码器在边缘设备上计算需求高且仅能处理短视频的问题。 Method: 引入自适应Transformer块和轻量级分层网络，动态调整隐藏嵌入维度和聚合短视频特征。 Result: 在短视频任务中计算效率翻倍，长视频任务中平衡计算与精度。 Conclusion: AdaVid为视频编码提供了高效灵活的解决方案，适用于不同计算资源和视频长度。 Abstract: Contrastive video-language pretraining has demonstrated great success in learning rich and robust video representations. However, deploying such video encoders on compute-constrained edge devices remains challenging due to their high computational demands. Additionally, existing models are typically trained to process only short video clips, often limited to 4 to 64 frames. In this paper, we introduce AdaVid, a flexible architectural framework designed to learn efficient video encoders that can dynamically adapt their computational footprint based on available resources. At the heart of AdaVid is an adaptive transformer block, inspired by Matryoshka Representation Learning, which allows the model to adjust its hidden embedding dimension at inference time. We show that AdaVid-EgoVLP, trained on video-narration pairs from the large-scale Ego4D dataset, matches the performance of the standard EgoVLP on short video-language benchmarks using only half the compute, and even outperforms EgoVLP when given equal computational resources. We further explore the trade-off between frame count and compute on the challenging Diving48 classification benchmark, showing that AdaVid enables the use of more frames without exceeding computational limits. To handle longer videos, we also propose a lightweight hierarchical network that aggregates short clip features, achieving a strong balance between compute efficiency and accuracy across several long video benchmarks.

[10] Event Quality Score (EQS): Assessing the Realism of Simulated Event Camera Streams via Distances in Latent Space

Kaustav Chanda,Aayush Atul Verma,Arpitsinh Vaghela,Yezhou Yang,Bharatesh Chakravarthi

Main category: cs.CV

TL;DR: 论文提出了一种事件质量评分（EQS），用于评估模拟事件数据的真实性，并通过实验证明高EQS能提升模型在真实数据上的泛化能力。

Details

Motivation: 事件相机在深度学习视觉任务中缺乏高质量标注数据，现有模拟器生成的数据与真实数据差异较大，限制了其广泛应用。 Method: 利用RVT架构的激活特征，提出事件质量评分（EQS）作为模拟数据真实性的度量标准。 Result: 在DSEC驾驶数据集上的实验表明，高EQS的模拟数据能显著提升模型在真实数据上的性能。 Conclusion: EQS为开发更真实的事件相机模拟器提供了有效工具，有助于缩小模拟与真实数据之间的差距。 Abstract: Event cameras promise a paradigm shift in vision sensing with their low latency, high dynamic range, and asynchronous nature of events. Unfortunately, the scarcity of high-quality labeled datasets hinders their widespread adoption in deep learning-driven computer vision. To mitigate this, several simulators have been proposed to generate synthetic event data for training models for detection and estimation tasks. However, the fundamentally different sensor design of event cameras compared to traditional frame-based cameras poses a challenge for accurate simulation. As a result, most simulated data fail to mimic data captured by real event cameras. Inspired by existing work on using deep features for image comparison, we introduce event quality score (EQS), a quality metric that utilizes activations of the RVT architecture. Through sim-to-real experiments on the DSEC driving dataset, it is shown that a higher EQS implies improved generalization to real-world data after training on simulated events. Thus, optimizing for EQS can lead to developing more realistic event camera simulators, effectively reducing the simulation gap. EQS is available at https://github.com/eventbasedvision/EQS.

Andy Dimnaku,Dominic Yurk,Zhiyuan Gao,Arun Padmanabhan,Mandar Aras,Yaser Abu-Mostafa

Main category: cs.CV

TL;DR: 本文提出了一种基于AI的新型导航系统，用于辅助超声检查中的心脏下腔静脉（IVC）定位，适用于不同质量的超声设备。

Details

Motivation: 传统心脏超声检查依赖专家和设备，限制了其在医院外的应用。AI导航系统可帮助新手操作者获取标准化视图。 Method: 系统采用离线训练的决策模型和实时定位算法，通过二分类判断IVC是否存在并标注其位置。 Result: 模型在高质量和低成本超声视频中均表现优异，支持零样本性能。 Conclusion: 该系统有望将超声诊断扩展到医院外，目前正在进行临床试验并已集成到Butterfly iQ应用中。 Abstract: Ultrasound imaging of the heart (echocardiography) is widely used to diagnose cardiac diseases. However, obtaining an echocardiogram requires an expert sonographer and a high-quality ultrasound imaging device, which are generally only available in hospitals. Recently, AI-based navigation models and algorithms have been used to aid novice sonographers in acquiring the standardized cardiac views necessary to visualize potential disease pathologies. These navigation systems typically rely on directional guidance to predict the necessary rotation of the ultrasound probe. This paper demonstrates a novel AI navigation system that builds on a decision model for identifying the inferior vena cava (IVC) of the heart. The decision model is trained offline using cardiac ultrasound videos and employs binary classification to determine whether the IVC is present in a given ultrasound video. The underlying model integrates a novel localization algorithm that leverages the learned feature representations to annotate the spatial location of the IVC in real-time. Our model demonstrates strong localization performance on traditional high-quality hospital ultrasound videos, as well as impressive zero-shot performance on lower-quality ultrasound videos from a more affordable Butterfly iQ handheld ultrasound machine. This capability facilitates the expansion of ultrasound diagnostics beyond hospital settings. Currently, the guidance system is undergoing clinical trials and is available on the Butterfly iQ app.

[12] Post-Hurricane Debris Segmentation Using Fine-Tuned Foundational Vision Models

Kooshan Amini,Yuhao Liu,Jamie Ellen Padgett,Guha Balakrishnan,Ashok Veeraraghavan

Main category: cs.CV

TL;DR: 该研究通过微调预训练视觉模型，开发了一种通用的飓风碎片分割方法，使用少量高质量数据集，并在未见过的飓风事件中表现良好。

Details

Motivation: 及时准确的飓风碎片检测对灾害响应和社区恢复至关重要，但现有方法受限于环境差异和数据稀缺。 Method: 研究引入了一个开源数据集（约1200张手动标注的RGB图像），并通过多标注者标签聚合和视觉提示工程提升数据质量，微调预训练模型fCLIPSeg。 Result: 模型在训练中未见的飓风Ida数据上Dice得分为0.70，且在无碎片区域几乎无误报。 Conclusion: 该模型是首个仅需标准RGB图像的通用碎片分割模型，适用于大规模灾后评估和恢复规划。 Abstract: Timely and accurate detection of hurricane debris is critical for effective disaster response and community resilience. While post-disaster aerial imagery is readily available, robust debris segmentation solutions applicable across multiple disaster regions remain limited. Developing a generalized solution is challenging due to varying environmental and imaging conditions that alter debris' visual signatures across different regions, further compounded by the scarcity of training data. This study addresses these challenges by fine-tuning pre-trained foundational vision models, achieving robust performance with a relatively small, high-quality dataset. Specifically, this work introduces an open-source dataset comprising approximately 1,200 manually annotated aerial RGB images from Hurricanes Ian, Ida, and Ike. To mitigate human biases and enhance data quality, labels from multiple annotators are strategically aggregated and visual prompt engineering is employed. The resulting fine-tuned model, named fCLIPSeg, achieves a Dice score of 0.70 on data from Hurricane Ida -- a disaster event entirely excluded during training -- with virtually no false positives in debris-free areas. This work presents the first event-agnostic debris segmentation model requiring only standard RGB imagery during deployment, making it well-suited for rapid, large-scale post-disaster impact assessments and recovery planning.

[13] Privacy-Preserving Operating Room Workflow Analysis using Digital Twins

Alejandra Perez,Han Zhang,Yu-Chun Ku,Lalithkumar Seenivasan,Roger Soberanis,Jose L. Porras,Richard Day,Jeff Jopling,Peter Najjar,Mathias Unberath

Main category: cs.CV

TL;DR: 提出了一种隐私保护的手术室视频分析两阶段方法，通过数字孪生技术实现事件检测，性能媲美原始视频。

Details

Motivation: 手术室工作流程优化需要事件识别，但隐私问题限制了计算机视觉的应用，因此需隐私保护方法。 Method: 两阶段流程：首先生成数字孪生（深度估计与语义分割），再用SafeOR模型（双流融合）检测事件。 Result: 数字孪生方法在事件检测上性能与原始视频相当或更好。 Conclusion: 数字孪生技术保护隐私，促进数据共享，并可能提升模型泛化能力。 Abstract: Purpose: The operating room (OR) is a complex environment where optimizing workflows is critical to reduce costs and improve patient outcomes. The use of computer vision approaches for the automatic recognition of perioperative events enables identification of bottlenecks for OR optimization. However, privacy concerns limit the use of computer vision for automated event detection from OR videos, which makes privacy-preserving approaches needed for OR workflow analysis. Methods: We propose a two-stage pipeline for privacy-preserving OR video analysis and event detection. In the first stage, we leverage vision foundation models for depth estimation and semantic segmentation to generate de-identified Digital Twins (DT) of the OR from conventional RGB videos. In the second stage, we employ the SafeOR model, a fused two-stream approach that processes segmentation masks and depth maps for OR event detection. We evaluate this method on an internal dataset of 38 simulated surgical trials with five event classes. Results: Our results indicate that this DT-based approach to the OR event detection model achieves performance on par and sometimes even better than raw RGB video-based models on detecting OR events. Conclusion: DTs enable privacy-preserving OR workflow analysis, facilitating the sharing of de-identified data across institutions and they can potentially enhance model generalizability by mitigating domain-specific appearance differences.

[14] Contour Field based Elliptical Shape Prior for the Segment Anything Model

Xinyu Zhao,Jun Liu,Faqiang Wang,Li Cui,Yuping Duan

Main category: cs.CV

TL;DR: 论文提出了一种将椭圆形状先验信息整合到基于深度学习的SAM图像分割技术中的新方法，通过变分方法提高分割精度。

Details

Motivation: 现有深度学习方法（如SAM）在生成椭圆形状分割结果时效率不足，椭圆形状先验信息对医学和自然图像分割任务至关重要。 Method: 建立参数化椭圆轮廓场，利用对偶算法将图像特征与椭圆先验及空间正则化先验结合，分解SAM为四个数学子问题并整合变分椭圆先验设计新网络结构。 Result: 在特定图像数据集上实验表明，该方法优于原始SAM。 Conclusion: 通过整合椭圆形状先验，显著提升了SAM的分割精度，适用于需要椭圆形状分割的任务。 Abstract: The elliptical shape prior information plays a vital role in improving the accuracy of image segmentation for specific tasks in medical and natural images. Existing deep learning-based segmentation methods, including the Segment Anything Model (SAM), often struggle to produce segmentation results with elliptical shapes efficiently. This paper proposes a new approach to integrate the prior of elliptical shapes into the deep learning-based SAM image segmentation techniques using variational methods. The proposed method establishes a parameterized elliptical contour field, which constrains the segmentation results to align with predefined elliptical contours. Utilizing the dual algorithm, the model seamlessly integrates image features with elliptical priors and spatial regularization priors, thereby greatly enhancing segmentation accuracy. By decomposing SAM into four mathematical sub-problems, we integrate the variational ellipse prior to design a new SAM network structure, ensuring that the segmentation output of SAM consists of elliptical regions. Experimental results on some specific image datasets demonstrate an improvement over the original SAM.

[15] Parsimonious Dataset Construction for Laparoscopic Cholecystectomy Structure Segmentation

Yuning Zhou,Henry Badgery,Matthew Read,James Bailey,Catherine Davey

Main category: cs.CV

TL;DR: 论文提出了一种基于主动学习的方法，用于在腹腔镜胆囊切除术视频中选择关键帧构建高质量数据集，显著降低了标注成本。

Details

Motivation: 医疗领域标注成本高昂，阻碍了深度学习应用。本文旨在通过主动学习降低标注成本，同时提升模型性能。 Method: 采用主动学习策略，利用现有训练的DNN选择信息量最大的新数据，并评估了不同的数据信息量度量方法。 Result: 实验表明，仅用主动学习选择的一半数据，模型性能（0.4349 mIoU）接近使用全数据集（0.4374 mIoU）。 Conclusion: 主动学习能有效减少标注需求，同时保持模型性能，适用于医疗图像分割任务。 Abstract: Labeling has always been expensive in the medical context, which has hindered related deep learning application. Our work introduces active learning in surgical video frame selection to construct a high-quality, affordable Laparoscopic Cholecystectomy dataset for semantic segmentation. Active learning allows the Deep Neural Networks (DNNs) learning pipeline to include the dataset construction workflow, which means DNNs trained by existing dataset will identify the most informative data from the newly collected data. At the same time, DNNs' performance and generalization ability improve over time when the newly selected and annotated data are included in the training data. We assessed different data informativeness measurements and found the deep features distances select the most informative data in this task. Our experiments show that with half of the data selected by active learning, the DNNs achieve almost the same performance with 0.4349 mean Intersection over Union (mIoU) compared to the same DNNs trained on the full dataset (0.4374 mIoU) on the critical anatomies and surgical instruments.

[16] Prompt-Driven and Training-Free Forgetting Approach and Dataset for Large Language Models

Zhenyu Yu,Mohd Yamani Inda Idris,Pei Wang

Main category: cs.CV

TL;DR: 提出了一种基于提示的分层编辑和无训练局部特征移除的自动数据集创建框架，构建ForgetMe数据集并引入Entangled评估指标，用于量化扩散模型的选择性遗忘效果。

Details

Motivation: 扩散模型在图像生成中的广泛应用增加了对隐私合规遗忘的需求，但现有方法难以在高维复杂特征中实现选择性遗忘。 Method: 基于提示的分层编辑和无训练局部特征移除构建ForgetMe数据集，引入Entangled评估指标，并应用LoRA微调Stable Diffusion实现选择性遗忘。 Result: 验证了ForgetMe数据集和Entangled指标的有效性，为选择性遗忘提供了可扩展的解决方案。 Conclusion: 该工作为隐私保护生成AI提供了可扩展且适应性强的解决方案。 Abstract: The widespread adoption of diffusion models in image generation has increased the demand for privacy-compliant unlearning. However, due to the high-dimensional nature and complex feature representations of diffusion models, achieving selective unlearning remains challenging, as existing methods struggle to remove sensitive information while preserving the consistency of non-sensitive regions. To address this, we propose an Automatic Dataset Creation Framework based on prompt-based layered editing and training-free local feature removal, constructing the ForgetMe dataset and introducing the Entangled evaluation metric. The Entangled metric quantifies unlearning effectiveness by assessing the similarity and consistency between the target and background regions and supports both paired (Entangled-D) and unpaired (Entangled-S) image data, enabling unsupervised evaluation. The ForgetMe dataset encompasses a diverse set of real and synthetic scenarios, including CUB-200-2011 (Birds), Stanford-Dogs, ImageNet, and a synthetic cat dataset. We apply LoRA fine-tuning on Stable Diffusion to achieve selective unlearning on this dataset and validate the effectiveness of both the ForgetMe dataset and the Entangled metric, establishing them as benchmarks for selective unlearning. Our work provides a scalable and adaptable solution for advancing privacy-preserving generative AI.

[17] CM3AE: A Unified RGB Frame and Event-Voxel/-Frame Pre-training Framework

Wentao Wu,Xiao Wang,Chenglong Li,Bo Jiang,Jin Tang,Bin Luo,Qi Liu

Main category: cs.CV

TL;DR: 提出了一种名为CM3AE的新型预训练框架，用于RGB-Event感知，通过多模态融合重建模块和对比学习策略增强跨模态信息聚合能力。

Details

Motivation: 现有方法在事件数据预训练中未能与RGB帧建立强关联，限制了多模态融合场景的应用。 Method: 设计了多模态融合重建模块和对比学习策略，输入包括RGB图像、事件图像和事件体素。 Result: 构建了大规模数据集（2,535,759对RGB-Event数据），并在五个下游任务中验证了CM3AE的有效性。 Conclusion: CM3AE为基于事件和RGB-Event融合的下游任务提供了强大支持，代码和预训练模型将开源。 Abstract: Event cameras have attracted increasing attention in recent years due to their advantages in high dynamic range, high temporal resolution, low power consumption, and low latency. Some researchers have begun exploring pre-training directly on event data. Nevertheless, these efforts often fail to establish strong connections with RGB frames, limiting their applicability in multi-modal fusion scenarios. To address these issues, we propose a novel CM3AE pre-training framework for the RGB-Event perception. This framework accepts multi-modalities/views of data as input, including RGB images, event images, and event voxels, providing robust support for both event-based and RGB-event fusion based downstream tasks. Specifically, we design a multi-modal fusion reconstruction module that reconstructs the original image from fused multi-modal features, explicitly enhancing the model's ability to aggregate cross-modal complementary information. Additionally, we employ a multi-modal contrastive learning strategy to align cross-modal feature representations in a shared latent space, which effectively enhances the model's capability for multi-modal understanding and capturing global dependencies. We construct a large-scale dataset containing 2,535,759 RGB-Event data pairs for the pre-training. Extensive experiments on five downstream tasks fully demonstrated the effectiveness of CM3AE. Source code and pre-trained models will be released on https://github.com/Event-AHU/CM3AE.

[18] 3DResT: A Strong Baseline for Semi-Supervised 3D Referring Expression Segmentation

Wenxin Chen,Mengxue Qu,Weitai Kang,Yan Yan,Yao Zhao,Yunchao Wei

Main category: cs.CV

TL;DR: 论文提出了一种半监督学习框架3DResT，通过TSCS和QDW方法高效利用伪标签，显著提升了3D-RES任务性能。

Details

Motivation: 3D-RES任务需要大量标注数据，成本高。现有半监督学习方法对伪标签的利用不足，限制了模型性能。 Method: 提出3DResT框架，包含TSCS（选择高质量伪标签）和QDW（动态加权低质量伪标签）两种设计。 Result: 在仅1%标注数据下，3DResT比全监督方法mIoU提升8.34点。 Conclusion: 3DResT通过优化伪标签利用，显著降低了标注成本并提升了性能。 Abstract: 3D Referring Expression Segmentation (3D-RES) typically requires extensive instance-level annotations, which are time-consuming and costly. Semi-supervised learning (SSL) mitigates this by using limited labeled data alongside abundant unlabeled data, improving performance while reducing annotation costs. SSL uses a teacher-student paradigm where teacher generates high-confidence-filtered pseudo-labels to guide student. However, in the context of 3D-RES, where each label corresponds to a single mask and labeled data is scarce, existing SSL methods treat high-quality pseudo-labels merely as auxiliary supervision, which limits the model's learning potential. The reliance on high-confidence thresholds for filtering often results in potentially valuable pseudo-labels being discarded, restricting the model's ability to leverage the abundant unlabeled data. Therefore, we identify two critical challenges in semi-supervised 3D-RES, namely, inefficient utilization of high-quality pseudo-labels and wastage of useful information from low-quality pseudo-labels. In this paper, we introduce the first semi-supervised learning framework for 3D-RES, presenting a robust baseline method named 3DResT. To address these challenges, we propose two novel designs called Teacher-Student Consistency-Based Sampling (TSCS) and Quality-Driven Dynamic Weighting (QDW). TSCS aids in the selection of high-quality pseudo-labels, integrating them into the labeled dataset to strengthen the labeled supervision signals. QDW preserves low-quality pseudo-labels by dynamically assigning them lower weights, allowing for the effective extraction of useful information rather than discarding them. Extensive experiments conducted on the widely used benchmark demonstrate the effectiveness of our method. Notably, with only 1% labeled data, 3DResT achieves an mIoU improvement of 8.34 points compared to the fully supervised method.

[19] AdaQual-Diff: Diffusion-Based Image Restoration via Adaptive Quality Prompting

Xin Su,Chen Wu,Yu Zhang,Chen Lyu,Zhuoran Zheng

Main category: cs.CV

TL;DR: AdaQual-Diff是一种基于扩散的框架，通过直接集成感知质量评估到生成修复过程中，解决了复杂退化图像修复的挑战。

Details

Motivation: 传统方法依赖间接线索，难以适应复杂退化图像的独特混合和严重程度，导致修复效果不佳。 Method: AdaQual-Diff通过自适应质量提示机制，根据区域质量分数动态调整提示结构，实现计算资源的动态分配。 Result: 实验表明，AdaQual-Diff在合成和真实数据集上实现了视觉上更优的修复效果。 Conclusion: AdaQual-Diff通过质量引导的方法，实现了对区域修复强度的精细控制，无需额外参数或推理迭代。 Abstract: Restoring images afflicted by complex real-world degradations remains challenging, as conventional methods often fail to adapt to the unique mixture and severity of artifacts present. This stems from a reliance on indirect cues which poorly capture the true perceptual quality deficit. To address this fundamental limitation, we introduce AdaQual-Diff, a diffusion-based framework that integrates perceptual quality assessment directly into the generative restoration process. Our approach establishes a mathematical relationship between regional quality scores from DeQAScore and optimal guidance complexity, implemented through an Adaptive Quality Prompting mechanism. This mechanism systematically modulates prompt structure according to measured degradation severity: regions with lower perceptual quality receive computationally intensive, structurally complex prompts with precise restoration directives, while higher quality regions receive minimal prompts focused on preservation rather than intervention. The technical core of our method lies in the dynamic allocation of computational resources proportional to degradation severity, creating a spatially-varying guidance field that directs the diffusion process with mathematical precision. By combining this quality-guided approach with content-specific conditioning, our framework achieves fine-grained control over regional restoration intensity without requiring additional parameters or inference iterations. Experimental results demonstrate that AdaQual-Diff achieves visually superior restorations across diverse synthetic and real-world datasets.

[20] Robo-SGG: Exploiting Layout-Oriented Normalization and Restitution for Robust Scene Graph Generation

Changsheng Lv,Mengshi Qi,Zijian Fu,Huadong Ma

Main category: cs.CV

TL;DR: 论文提出Robo-SGG方法，通过布局导向的归一化和恢复，提升场景图生成在损坏图像上的鲁棒性。

Details

Motivation: 现有场景图生成方法在损坏图像上性能下降，因视觉特征受损。利用布局信息（领域不变）增强鲁棒性。 Method: 采用实例归一化过滤领域特定特征，通过布局导向恢复恢复结构特征，并设计布局嵌入编码器增强对象和谓词特征。 Result: 在VG-C数据集上，PredCls、SGCls和SGDet任务分别提升5.6%、8.0%和6.5%，达到新SOTA。 Conclusion: Robo-SGG作为即插即用模块，显著提升损坏图像上的场景图生成性能。 Abstract: In this paper, we introduce a novel method named Robo-SGG, i.e., Layout-Oriented Normalization and Restitution for Robust Scene Graph Generation. Compared to the existing SGG setting, the robust scene graph generation aims to perform inference on a diverse range of corrupted images, with the core challenge being the domain shift between the clean and corrupted images. Existing SGG methods suffer from degraded performance due to compromised visual features e.g., corruption interference or occlusions. To obtain robust visual features, we exploit the layout information, which is domain-invariant, to enhance the efficacy of existing SGG methods on corrupted images. Specifically, we employ Instance Normalization(IN) to filter out the domain-specific feature and recover the unchangeable structural features, i.e., the positional and semantic relationships among objects by the proposed Layout-Oriented Restitution. Additionally, we propose a Layout-Embedded Encoder (LEE) that augments the existing object and predicate encoders within the SGG framework, enriching the robust positional and semantic features of objects and predicates. Note that our proposed Robo-SGG module is designed as a plug-and-play component, which can be easily integrated into any baseline SGG model. Extensive experiments demonstrate that by integrating the state-of-the-art method into our proposed Robo-SGG, we achieve relative improvements of 5.6%, 8.0%, and 6.5% in mR@50 for PredCls, SGCls, and SGDet tasks on the VG-C dataset, respectively, and achieve new state-of-the-art performance in corruption scene graph generation benchmark (VG-C and GQA-C). We will release our source code and model.

[21] SAM-Based Building Change Detection with Distribution-Aware Fourier Adaptation and Edge-Constrained Warping

Yun-Cheng Li,Sen Lei,Yi-Tao Zhao,Heng-Chao Li,Jun Li,Antonio Plaza

Main category: cs.CV

TL;DR: FAEWNet提出了一种基于SAM的网络，结合分布感知傅里叶适配和边缘约束变形，用于建筑物变化检测，解决了领域差距和噪声干扰问题。

Details

Motivation: 建筑物变化检测在城市化、灾害评估和军事侦察中具有重要应用，但现有方法（如SAM）因领域差距和噪声干扰表现不佳。 Method: FAEWNet利用SAM编码器提取特征，引入分布感知傅里叶适配器聚合变化信息，并设计流模块优化边缘提取和对齐。 Result: 在LEVIR-CD、S2Looking和WHU-CD数据集上取得了最先进的结果。 Conclusion: FAEWNet有效解决了建筑物变化检测中的领域差距和噪声问题，显著提升了检测精度和边缘识别能力。 Abstract: Building change detection remains challenging for urban development, disaster assessment, and military reconnaissance. While foundation models like Segment Anything Model (SAM) show strong segmentation capabilities, SAM is limited in the task of building change detection due to domain gap issues. Existing adapter-based fine-tuning approaches face challenges with imbalanced building distribution, resulting in poor detection of subtle changes and inaccurate edge extraction. Additionally, bi-temporal misalignment in change detection, typically addressed by optical flow, remains vulnerable to background noises. This affects the detection of building changes and compromises both detection accuracy and edge recognition. To tackle these challenges, we propose a new SAM-Based Network with Distribution-Aware Fourier Adaptation and Edge-Constrained Warping (FAEWNet) for building change detection. FAEWNet utilizes the SAM encoder to extract rich visual features from remote sensing images. To guide SAM in focusing on specific ground objects in remote sensing scenes, we propose a Distribution-Aware Fourier Aggregated Adapter to aggregate task-oriented changed information. This adapter not only effectively addresses the domain gap issue, but also pays attention to the distribution of changed buildings. Furthermore, to mitigate noise interference and misalignment in height offset estimation, we design a novel flow module that refines building edge extraction and enhances the perception of changed buildings. Our state-of-the-art results on the LEVIR-CD, S2Looking and WHU-CD datasets highlight the effectiveness of FAEWNet. The code is available at https://github.com/SUPERMAN123000/FAEWNet.

[22] Packing Input Frame Context in Next-Frame Prediction Models for Video Generation

Lvmin Zhang,Maneesh Agrawala

Main category: cs.CV

TL;DR: FramePack是一种神经网络结构，用于训练视频生成中的下一帧预测模型，通过压缩输入帧固定上下文长度，提高训练效率并支持更大的批次。

Details

Motivation: 解决视频生成中因视频长度导致的上下文长度不固定问题，以及减少计算瓶颈和误差累积。 Method: 提出FramePack压缩输入帧以固定上下文长度，采用反漂移采样方法生成帧，并支持对现有视频扩散模型进行微调。 Result: 能够处理更多帧，训练批次大小显著提高，同时改善了视觉质量。 Conclusion: FramePack为视频生成提供了一种高效且质量更高的方法，适用于现有模型的改进。 Abstract: We present a neural network structure, FramePack, to train next-frame (or next-frame-section) prediction models for video generation. The FramePack compresses input frames to make the transformer context length a fixed number regardless of the video length. As a result, we are able to process a large number of frames using video diffusion with computation bottleneck similar to image diffusion. This also makes the training video batch sizes significantly higher (batch sizes become comparable to image diffusion training). We also propose an anti-drifting sampling method that generates frames in inverted temporal order with early-established endpoints to avoid exposure bias (error accumulation over iterations). Finally, we show that existing video diffusion models can be finetuned with FramePack, and their visual quality may be improved because the next-frame prediction supports more balanced diffusion schedulers with less extreme flow shift timesteps.

[23] RoPETR: Improving Temporal Camera-Only 3D Detection by Integrating Enhanced Rotary Position Embedding

Hang Ji,Tao Ni,Xufeng Huang,Tao Luo,Xin Zhan,Junbo Chen

Main category: cs.CV

TL;DR: 本文提出了一种针对StreamPETR框架的改进方法，专注于提升速度估计能力，从而显著提高了NuScenes检测分数。

Details

Motivation: StreamPETR在3D边界框检测上表现优异，但在NuScenes数据集上的速度估计成为瓶颈，影响了整体性能。 Method: 采用定制的位姿嵌入策略，以增强时间建模能力。 Result: 在NuScenes测试集上，改进后的方法使用ViT-L主干网络实现了70.86%的NDS，创下相机仅3D物体检测的新标杆。 Conclusion: 通过优化速度估计，StreamPETR在性能上取得了显著提升，成为当前最优的相机仅3D检测方法。 Abstract: This technical report introduces a targeted improvement to the StreamPETR framework, specifically aimed at enhancing velocity estimation, a critical factor influencing the overall NuScenes Detection Score. While StreamPETR exhibits strong 3D bounding box detection performance as reflected by its high mean Average Precision our analysis identified velocity estimation as a substantial bottleneck when evaluated on the NuScenes dataset. To overcome this limitation, we propose a customized positional embedding strategy tailored to enhance temporal modeling capabilities. Experimental evaluations conducted on the NuScenes test set demonstrate that our improved approach achieves a state-of-the-art NDS of 70.86% using the ViT-L backbone, setting a new benchmark for camera-only 3D object detection.

[24] AdaptoVision: A Multi-Resolution Image Recognition Model for Robust and Scalable Classification

Md. Sanaullah Chowdhury Lameya Sabrin

Main category: cs.CV

TL;DR: AdaptoVision是一种新型CNN架构，通过优化残差单元、深度可分离卷积和分层跳跃连接，平衡计算复杂性和分类准确性，适用于资源受限环境。

Details

Motivation: 设计一种高效平衡计算复杂性和分类准确性的CNN架构，适用于实时和资源受限环境。 Method: 利用增强的残差单元、深度可分离卷积和分层跳跃连接，减少参数和计算需求。 Result: 在BreakHis数据集上达到SOTA，CIFAR-10和CIFAR-100上分别取得95.3%和85.77%的准确率。 Conclusion: AdaptoVision通过简化架构和有效特征提取，在资源受限环境中表现出色。 Abstract: This paper introduces AdaptoVision, a novel convolutional neural network (CNN) architecture designed to efficiently balance computational complexity and classification accuracy. By leveraging enhanced residual units, depth-wise separable convolutions, and hierarchical skip connections, AdaptoVision significantly reduces parameter count and computational requirements while preserving competitive performance across various benchmark and medical image datasets. Extensive experimentation demonstrates that AdaptoVision achieves state-of-the-art on BreakHis dataset and comparable accuracy levels, notably 95.3\% on CIFAR-10 and 85.77\% on CIFAR-100, without relying on any pretrained weights. The model's streamlined architecture and strategic simplifications promote effective feature extraction and robust generalization, making it particularly suitable for deployment in real-time and resource-constrained environments.

[25] Two Tasks, One Goal: Uniting Motion and Planning for Excellent End To End Autonomous Driving Performance

Lin Liu,Ziying Song,Hongyu Pan,Lei Yang,Caiyan Jia

Main category: cs.CV

TL;DR: TTOG是一个两阶段轨迹生成框架，通过统一规划和运动任务，解决了共享上下文表示和车辆状态不可观测的挑战，显著提升了自动驾驶性能。

Details

Motivation: 传统方法将规划和运动任务解耦，忽略了学习运动任务中分布外数据的潜在好处。统一这些任务面临共享上下文表示和车辆状态不可观测的挑战。 Method: 提出TTOG框架：第一阶段生成多样化轨迹候选，第二阶段通过车辆状态信息优化候选轨迹。使用自车数据训练的状态估计器处理周围车辆状态不可观测问题，并引入ECSA增强场景表示的泛化能力。 Result: 在nuScenes数据集上L2距离减少36.06%，在Bench2Drive数据集上驾驶分数提升22%，显著优于现有基线。 Conclusion: TTOG通过统一规划和运动任务，结合状态估计和场景表示优化，实现了自动驾驶性能的显著提升。 Abstract: End-to-end autonomous driving has made impressive progress in recent years. Former end-to-end autonomous driving approaches often decouple planning and motion tasks, treating them as separate modules. This separation overlooks the potential benefits that planning can gain from learning out-of-distribution data encountered in motion tasks. However, unifying these tasks poses significant challenges, such as constructing shared contextual representations and handling the unobservability of other vehicles' states. To address these challenges, we propose TTOG, a novel two-stage trajectory generation framework. In the first stage, a diverse set of trajectory candidates is generated, while the second stage focuses on refining these candidates through vehicle state information. To mitigate the issue of unavailable surrounding vehicle states, TTOG employs a self-vehicle data-trained state estimator, subsequently extended to other vehicles. Furthermore, we introduce ECSA (equivariant context-sharing scene adapter) to enhance the generalization of scene representations across different agents. Experimental results demonstrate that TTOG achieves state-of-the-art performance across both planning and motion tasks. Notably, on the challenging open-loop nuScenes dataset, TTOG reduces the L2 distance by 36.06\%. Furthermore, on the closed-loop Bench2Drive dataset, our approach achieves a 22\% improvement in the driving score (DS), significantly outperforming existing baselines.

[26] Accurate Tracking of Arabidopsis Root Cortex Cell Nuclei in 3D Time-Lapse Microscopy Images Based on Genetic Algorithm

Yu Song,Tatsuaki Goh,Yinhao Li,Jiahua Dong,Shunsuke Miyashima,Yutaro Iwamoto,Yohei Kondo,Keiji Nakajima,Yen-wei Chen

Main category: cs.CV

TL;DR: 提出了一种基于遗传算法的精确细胞追踪方法，用于解决拟南芥根尖活体成像中密集细胞排列的追踪问题。

Details

Motivation: 拟南芥是研究植物生理和发育的重要模型，活体成像技术对揭示植物生长和细胞分裂的理论至关重要。现有追踪软件在细胞密集排列时效果不佳。 Method: 采用遗传算法，结合拟南芥根细胞的空间关系和线性排列知识，实现从粗到细的追踪（先进行线级追踪，再进行核级追踪）。 Result: 在拟南芥根尖的长时间活体成像数据集上验证，经少量人工修正后能准确追踪细胞核。 Conclusion: 该方法首次成功解决了拟南芥根尖活体成像中细胞核追踪的长期难题。 Abstract: Arabidopsis is a widely used model plant to gain basic knowledge on plant physiology and development. Live imaging is an important technique to visualize and quantify elemental processes in plant development. To uncover novel theories underlying plant growth and cell division, accurate cell tracking on live imaging is of utmost importance. The commonly used cell tracking software, TrackMate, adopts tracking-by-detection fashion, which applies Laplacian of Gaussian (LoG) for blob detection, and Linear Assignment Problem (LAP) tracker for tracking. However, they do not perform sufficiently when cells are densely arranged. To alleviate the problems mentioned above, we propose an accurate tracking method based on Genetic algorithm (GA) using knowledge of Arabidopsis root cellular patterns and spatial relationship among volumes. Our method can be described as a coarse-to-fine method, in which we first conducted relatively easy line-level tracking of cell nuclei, then performed complicated nuclear tracking based on known linear arrangement of cell files and their spatial relationship between nuclei. Our method has been evaluated on a long-time live imaging dataset of Arabidopsis root tips, and with minor manual rectification, it accurately tracks nuclei. To the best of our knowledge, this research represents the first successful attempt to address a long-standing problem in the field of time-lapse microscopy in the root meristem by proposing an accurate tracking method for Arabidopsis root nuclei.

[27] TongUI: Building Generalized GUI Agents by Learning from Multimodal Web Tutorials

Bofei Zhang,Zirui Shang,Zhi Gao,Wang Zhang,Rui Xie,Xiaojian Ma,Tao Yuan,Xinxiao Wu,Song-Chun Zhu,Qing Li

Main category: cs.CV

TL;DR: TongUI框架通过从多模态网络教程中学习，构建了通用的GUI代理，并创建了包含143K轨迹数据的GUI-Net数据集，显著提升了性能。

Details

Motivation: 开发通用GUI代理的主要挑战是缺乏跨操作系统和应用的轨迹数据，手动标注成本高。 Method: 通过爬取和处理在线GUI教程（如视频和文章）生成轨迹数据，构建GUI-Net数据集，并基于此微调Qwen2.5-VL-3B/7B模型。 Result: TongUI代理在多个基准测试中表现优异，性能提升约10%。 Conclusion: GUI-Net数据集和TongUI框架的有效性得到验证，代码、数据集和模型将开源。 Abstract: Building Graphical User Interface (GUI) agents is a promising research direction, which simulates human interaction with computers or mobile phones to perform diverse GUI tasks. However, a major challenge in developing generalized GUI agents is the lack of sufficient trajectory data across various operating systems and applications, mainly due to the high cost of manual annotations. In this paper, we propose the TongUI framework that builds generalized GUI agents by learning from rich multimodal web tutorials. Concretely, we crawl and process online GUI tutorials (such as videos and articles) into GUI agent trajectory data, through which we produce the GUI-Net dataset containing 143K trajectory data across five operating systems and more than 200 applications. We develop the TongUI agent by fine-tuning Qwen2.5-VL-3B/7B models on GUI-Net, which show remarkable performance improvements on commonly used grounding and navigation benchmarks, outperforming baseline agents about 10\% on multiple benchmarks, showing the effectiveness of the GUI-Net dataset and underscoring the significance of our TongUI framework. We will fully open-source the code, the GUI-Net dataset, and the trained models soon.

[28] HSS-IAD: A Heterogeneous Same-Sort Industrial Anomaly Detection Dataset

Qishan Wang,Shuyong Gao,Junjie Hu,Jiawen Yu,Xuan Tong,You Li,Wenqiang Zhang

Main category: cs.CV

TL;DR: 论文介绍了HSS-IAD数据集，解决了现有工业异常检测数据集的局限性，并评估了流行方法在多类和类分离设置下的表现。

Details

Motivation: 现有工业异常检测数据集未能反映真实工厂条件，限制了多类无监督异常检测算法的实际效果。 Method: 提出HSS-IAD数据集，包含8,580张金属类工业零件图像及精确异常标注，并提供前景图像用于合成异常生成。 Result: 评估了流行工业异常检测方法，展示了该数据集在弥合现有数据集与真实工厂条件差距方面的潜力。 Conclusion: HSS-IAD数据集为多类无监督异常检测研究提供了更接近真实场景的数据支持。 Abstract: Multi-class Unsupervised Anomaly Detection algorithms (MUAD) are receiving increasing attention due to their relatively low deployment costs and improved training efficiency. However, the real-world effectiveness of MUAD methods is questioned due to limitations in current Industrial Anomaly Detection (IAD) datasets. These datasets contain numerous classes that are unlikely to be produced by the same factory and fail to cover multiple structures or appearances. Additionally, the defects do not reflect real-world characteristics. Therefore, we introduce the Heterogeneous Same-Sort Industrial Anomaly Detection (HSS-IAD) dataset, which contains 8,580 images of metallic-like industrial parts and precise anomaly annotations. These parts exhibit variations in structure and appearance, with subtle defects that closely resemble the base materials. We also provide foreground images for synthetic anomaly generation. Finally, we evaluate popular IAD methods on this dataset under multi-class and class-separated settings, demonstrating its potential to bridge the gap between existing datasets and real factory conditions. The dataset is available at https://github.com/Qiqigeww/HSS-IAD-Dataset.

[29] Collaborative Perception Datasets for Autonomous Driving: A Review

Naibang Wang,Deyong Shang,Yan Gong,Xiaoxi Hu,Ziying Song,Lei Yang,Yuhan Huang,Xiaoyu Wang,Jianli Lu

Main category: cs.CV

TL;DR: 本文综述了自动驾驶中协作感知数据集的多维分类与比较，并指出了未来研究方向。

Details

Motivation: 协作感知在自动驾驶中潜力巨大，但缺乏对现有数据集的系统总结和比较分析，阻碍了资源利用和模型评估标准化。 Method: 通过多维视角对现有数据集进行分类和比较，包括合作范式、数据来源、场景、传感器配置和支持任务。 Result: 提供了详细的比较分析，并指出了数据集可扩展性、多样性、领域适应、标准化、隐私和大语言模型集成等关键挑战。 Conclusion: 本文为协作感知研究提供了资源支持，并提出了未来发展方向。 Abstract: Collaborative perception has attracted growing interest from academia and industry due to its potential to enhance perception accuracy, safety, and robustness in autonomous driving through multi-agent information fusion. With the advancement of Vehicle-to-Everything (V2X) communication, numerous collaborative perception datasets have emerged, varying in cooperation paradigms, sensor configurations, data sources, and application scenarios. However, the absence of systematic summarization and comparative analysis hinders effective resource utilization and standardization of model evaluation. As the first comprehensive review focused on collaborative perception datasets, this work reviews and compares existing resources from a multi-dimensional perspective. We categorize datasets based on cooperation paradigms, examine their data sources and scenarios, and analyze sensor modalities and supported tasks. A detailed comparative analysis is conducted across multiple dimensions. We also outline key challenges and future directions, including dataset scalability, diversity, domain adaptation, standardization, privacy, and the integration of large language models. To support ongoing research, we provide a continuously updated online repository of collaborative perception datasets and related literature: https://github.com/frankwnb/Collaborative-Perception-Datasets-for-Autonomous-Driving.

[30] Unsupervised Cross-Domain 3D Human Pose Estimation via Pseudo-Label-Guided Global Transforms

Jingjing Liu,Zhiyong Wang,Xinyu Fan,Amirhossein Dadashzadeh,Honghai Liu,Majid Mirmehdi

Main category: cs.CV

TL;DR: 提出了一种新框架，通过全局变换和伪标签生成解决跨场景3D人体姿态估计中的域偏移问题。

Details

Motivation: 现有方法在跨场景推理中因相机视角、位置等域偏移导致性能下降，需解决全局姿态对齐问题。 Method: 包括伪标签生成模块、全局变换模块和姿态增强器，通过迭代优化伪标签实现域适应。 Result: 在多个跨数据集基准测试中优于现有方法，甚至超过目标域训练模型。 Conclusion: 该方法有效解决了跨场景3D姿态估计的域偏移问题，提升了泛化性能。 Abstract: Existing 3D human pose estimation methods often suffer in performance, when applied to cross-scenario inference, due to domain shifts in characteristics such as camera viewpoint, position, posture, and body size. Among these factors, camera viewpoints and locations {have been shown} to contribute significantly to the domain gap by influencing the global positions of human poses. To address this, we propose a novel framework that explicitly conducts global transformations between pose positions in the camera coordinate systems of source and target domains. We start with a Pseudo-Label Generation Module that is applied to the 2D poses of the target dataset to generate pseudo-3D poses. Then, a Global Transformation Module leverages a human-centered coordinate system as a novel bridging mechanism to seamlessly align the positional orientations of poses across disparate domains, ensuring consistent spatial referencing. To further enhance generalization, a Pose Augmentor is incorporated to address variations in human posture and body size. This process is iterative, allowing refined pseudo-labels to progressively improve guidance for domain adaptation. Our method is evaluated on various cross-dataset benchmarks, including Human3.6M, MPI-INF-3DHP, and 3DPW. The proposed method outperforms state-of-the-art approaches and even outperforms the target-trained model.

[31] SmartFreeEdit: Mask-Free Spatial-Aware Image Editing with Complex Instruction Understanding

Qianqian Sun,Jixiang Luo,Dell Zhang,Xuelong Li

Main category: cs.CV

TL;DR: SmartFreeEdit是一个端到端框架，结合多模态大语言模型和超图增强修复架构，通过自然语言指令实现精确、无需掩码的图像编辑。

Details

Motivation: 传统方法在空间推理、精确区域分割和语义一致性方面存在挑战，尤其是在复杂场景中。 Method: 引入区域感知标记和掩码嵌入范式增强空间理解；设计推理分割管道优化掩码生成；使用超图增强修复模块保持结构和语义一致性。 Result: 在Reason-Edit基准测试中，SmartFreeEdit在分割精度、指令遵循和视觉质量方面优于现有方法。 Conclusion: SmartFreeEdit解决了局部信息聚焦问题，提升了编辑图像的全局一致性。 Abstract: Recent advancements in image editing have utilized large-scale multimodal models to enable intuitive, natural instruction-driven interactions. However, conventional methods still face significant challenges, particularly in spatial reasoning, precise region segmentation, and maintaining semantic consistency, especially in complex scenes. To overcome these challenges, we introduce SmartFreeEdit, a novel end-to-end framework that integrates a multimodal large language model (MLLM) with a hypergraph-enhanced inpainting architecture, enabling precise, mask-free image editing guided exclusively by natural language instructions. The key innovations of SmartFreeEdit include:(1)the introduction of region aware tokens and a mask embedding paradigm that enhance the spatial understanding of complex scenes;(2) a reasoning segmentation pipeline designed to optimize the generation of editing masks based on natural language instructions;and (3) a hypergraph-augmented inpainting module that ensures the preservation of both structural integrity and semantic coherence during complex edits, overcoming the limitations of local-based image generation. Extensive experiments on the Reason-Edit benchmark demonstrate that SmartFreeEdit surpasses current state-of-the-art methods across multiple evaluation metrics, including segmentation accuracy, instruction adherence, and visual quality preservation, while addressing the issue of local information focus and improving global consistency in the edited image. Our project will be available at https://github.com/smileformylove/SmartFreeEdit.

[32] Self-Supervised Pre-training with Combined Datasets for 3D Perception in Autonomous Driving

Shumin Wang,Zhuoran Yang,Lidian Wang,Zhipeng Tang,Heng Li,Lehan Pan,Sha Zhang,Jie Peng,Jianmin Ji,Yanyong Zhang

Main category: cs.CV

TL;DR: 本文提出了一种利用大规模无标签数据预训练3D感知模型的自监督框架，并通过提示适配器减少数据集偏差，显著提升了自动驾驶下游任务的性能。

Details

Motivation: 受NLP和2D视觉领域预训练模型的启发，探索大规模数据预训练在3D感知中的潜力。 Method: 采用自监督预训练框架，结合提示适配器的领域适应策略，从无标签数据中学习有效的3D表示。 Result: 模型在3D目标检测、BEV分割、3D目标跟踪和占用预测等任务中表现显著提升，且性能随数据量增加而稳定提升。 Conclusion: 展示了大规模数据预训练对自动驾驶3D感知模型的持续潜力，并将开源代码以促进社区研究。 Abstract: The significant achievements of pre-trained models leveraging large volumes of data in the field of NLP and 2D vision inspire us to explore the potential of extensive data pre-training for 3D perception in autonomous driving. Toward this goal, this paper proposes to utilize massive unlabeled data from heterogeneous datasets to pre-train 3D perception models. We introduce a self-supervised pre-training framework that learns effective 3D representations from scratch on unlabeled data, combined with a prompt adapter based domain adaptation strategy to reduce dataset bias. The approach significantly improves model performance on downstream tasks such as 3D object detection, BEV segmentation, 3D object tracking, and occupancy prediction, and shows steady performance increase as the training data volume scales up, demonstrating the potential of continually benefit 3D perception models for autonomous driving. We will release the source code to inspire further investigations in the community.

[33] NTIRE 2025 Challenge on Day and Night Raindrop Removal for Dual-Focused Images: Methods and Results

Xin Li,Yeying Jin,Xin Jin,Zongwei Wu,Bingchen Li,Yufei Wang,Wenhan Yang,Yu Li,Zhibo Chen,Bihan Wen,Robby T. Tan,Radu Timofte,Qiyu Rong,Hongyuan Jing,Mengmeng Zhang,Jinglong Li,Xiangyu Lu,Yi Ren,Yuting Liu,Meng Zhang,Xiang Chen,Qiyuan Guan,Jiangxin Dong,Jinshan Pan,Conglin Gou,Qirui Yang,Fangpu Zhang,Yunlong Lin,Sixiang Chen,Guoxi Huang,Ruirui Lin,Yan Zhang,Jingyu Yang,Huanjing Yue,Jiyuan Chen,Qiaosi Yi,Hongjun Wang,Chenxi Xie,Shuai Li,Yuhui Wu,Kaiyi Ma,Jiakui Hu,Juncheng Li,Liwen Pan,Guangwei Gao,Wenjie Li,Zhenyu Jin,Heng Guo,Zhanyu Ma,Yubo Wang,Jinghua Wang,Wangzhi Xing,Anjusree Karnavar,Diqi Chen,Mohammad Aminul Islam,Hao Yang,Ruikun Zhang,Liyuan Pan,Qianhao Luo,XinCao,Han Zhou,Yan Min,Wei Dong,Jun Chen,Taoyi Wu,Weijia Dou,Yu Wang,Shengjie Zhao,Yongcheng Huang,Xingyu Han,Anyan Huang,Hongtao Wu,Hong Wang,Yefeng Zheng,Abhijeet Kumar,Aman Kumar,Marcos V. Conde,Paula Garrido,Daniel Feijoo,Juan C. Benito,Guanglu Dong,Xin Lin,Siyuan Liu,Tianheng Zheng,Jiayu Zhong,Shouyi Wang,Xiangtai Li,Lanqing Guo,Lu Qi,Chao Ren,Shuaibo Wang,Shilong Zhang,Wanyu Zhou,Yunze Wu,Qinzhong Tan,Jieyuan Pei,Zhuoxuan Li,Jiayu Wang,Haoyu Bian,Haoran Sun,Subhajit Paul,Ni Tang,Junhao Huang,Zihan Cheng,Hongyun Zhu,Yuehan Wu,Kaixin Deng,Hang Ouyang,Tianxin Xiao,Fan Yang,Zhizun Luo,Zeyu Xiao,Zhuoyuan Li,Nguyen Pham Hoang Le,An Dinh Thien,Son T. Luu,Kiet Van Nguyen,Ronghua Xu,Xianmin Tian,Weijian Zhou,Jiacheng Zhang,Yuqian Chen,Yihang Duan,Yujie Wu,Suresh Raikwar,Arsh Garg,Kritika,Jianhua Zheng,Xiaoshan Ma,Ruolin Zhao,Yongyu Yang,Yongsheng Liang,Guiming Huang,Qiang Li,Hongbin Zhang,Xiangyu Zheng,A. N. Rajagopalan

Main category: cs.CV

TL;DR: 本文回顾了NTIRE 2025挑战赛中关于双聚焦图像昼夜雨滴去除的任务，介绍了多样化的Raindrop Clarity数据集及其在竞赛中的应用。

Details

Motivation: 建立一个强大的基准，用于解决不同光照和聚焦条件下的雨滴去除问题。 Method: 收集并使用了多样化的Raindrop Clarity数据集，分为训练、验证和测试子集，吸引了361名参与者。 Result: 32支团队提交了有效解决方案，并在数据集上实现了最先进的性能。 Conclusion: 该挑战赛为雨滴去除任务提供了新的基准，并展示了多样数据集的重要性。 Abstract: This paper reviews the NTIRE 2025 Challenge on Day and Night Raindrop Removal for Dual-Focused Images. This challenge received a wide range of impressive solutions, which are developed and evaluated using our collected real-world Raindrop Clarity dataset. Unlike existing deraining datasets, our Raindrop Clarity dataset is more diverse and challenging in degradation types and contents, which includes day raindrop-focused, day background-focused, night raindrop-focused, and night background-focused degradations. This dataset is divided into three subsets for competition: 14,139 images for training, 240 images for validation, and 731 images for testing. The primary objective of this challenge is to establish a new and powerful benchmark for the task of removing raindrops under varying lighting and focus conditions. There are a total of 361 participants in the competition, and 32 teams submitting valid solutions and fact sheets for the final testing phase. These submissions achieved state-of-the-art (SOTA) performance on the Raindrop Clarity dataset. The project can be found at https://lixinustc.github.io/CVPR-NTIRE2025-RainDrop-Competition.github.io/.

[34] Post-pre-training for Modality Alignment in Vision-Language Foundation Models

Shin'ya Yamaguchi,Dewei Feng,Sekitoshi Kanai,Kazuki Adachi,Daiki Chijiwa

Main category: cs.CV

TL;DR: CLIP-Refine是一种后预训练方法，旨在通过随机特征对齐和混合对比蒸馏技术减少CLIP模型中的模态差距，提升零样本性能。

Details

Motivation: CLIP模型在多模态特征空间中存在模态差距，影响下游任务性能，现有方法成本高或导致零样本性能下降。 Method: 提出CLIP-Refine，结合随机特征对齐（RaFA）和混合对比蒸馏（HyCD），在小数据集上进行1轮训练。 Result: 实验表明，CLIP-Refine有效减少了模态差距，提升了零样本性能。 Conclusion: CLIP-Refine是一种高效的后预训练方法，能在不增加训练成本的情况下改善CLIP模型性能。 Abstract: Contrastive language image pre-training (CLIP) is an essential component of building modern vision-language foundation models. While CLIP demonstrates remarkable zero-shot performance on downstream tasks, the multi-modal feature spaces still suffer from a modality gap, which is a gap between image and text feature clusters and limits downstream task performance. Although existing works attempt to address the modality gap by modifying pre-training or fine-tuning, they struggle with heavy training costs with large datasets or degradations of zero-shot performance. This paper presents CLIP-Refine, a post-pre-training method for CLIP models at a phase between pre-training and fine-tuning. CLIP-Refine aims to align the feature space with 1 epoch training on small image-text datasets without zero-shot performance degradations. To this end, we introduce two techniques: random feature alignment (RaFA) and hybrid contrastive-distillation (HyCD). RaFA aligns the image and text features to follow a shared prior distribution by minimizing the distance to random reference vectors sampled from the prior. HyCD updates the model with hybrid soft labels generated by combining ground-truth image-text pair labels and outputs from the pre-trained CLIP model. This contributes to achieving both maintaining the past knowledge and learning new knowledge to align features. Our extensive experiments with multiple classification and retrieval tasks show that CLIP-Refine succeeds in mitigating the modality gap and improving the zero-shot performance.

[35] Mask Image Watermarking

Runyi Hu,Jie Zhang,Shiqian Zhao,Nils Lukas,Jiwei Li,Qing Guo,Han Qiu,Tianwei Zhang

Main category: cs.CV

TL;DR: MaskMark是一个简单、高效且灵活的图像水印框架，支持全局和局部水印嵌入与提取，性能优于现有基线模型。

Details

Motivation: 为了解决图像水印中全局和局部水印嵌入与提取的需求，同时提升水印的鲁棒性和视觉质量。 Method: 基于Encoder-Distortion-Decoder训练范式，MaskMark-D通过解码阶段的掩码机制支持全局和局部水印提取；MaskMark-ED进一步将掩码引入编码阶段，增强局部水印的鲁棒性。 Result: MaskMark在全局水印提取、局部水印提取、水印定位和多水印嵌入方面均达到最先进性能，计算成本仅为WAM的1/15。 Conclusion: MaskMark是一个高效、灵活且性能优越的图像水印框架，适用于多种应用场景。 Abstract: We present MaskMark, a simple, efficient and flexible framework for image watermarking. MaskMark has two variants: MaskMark-D, which supports global watermark embedding, watermark localization, and local watermark extraction for applications such as tamper detection, and MaskMark-ED, which focuses on local watermark embedding and extraction with enhanced robustness in small regions, enabling localized image protection. Built upon the classical Encoder- Distortion-Decoder training paradigm, MaskMark-D introduces a simple masking mechanism during the decoding stage to support both global and local watermark extraction. A mask is applied to the watermarked image before extraction, allowing the decoder to focus on selected regions and learn local extraction. A localization module is also integrated into the decoder to identify watermark regions during inference, reducing interference from irrelevant content and improving accuracy. MaskMark-ED extends this design by incorporating the mask into the encoding stage as well, guiding the encoder to embed the watermark in designated local regions for enhanced robustness. Comprehensive experiments show that MaskMark achieves state-of-the-art performance in global watermark extraction, local watermark extraction, watermark localization, and multi-watermark embedding. It outperforms all existing baselines, including the recent leading model WAM for local watermarking, while preserving high visual quality of the watermarked images. MaskMark is also flexible, by adjusting the distortion layer, it can adapt to different robustness requirements with just a few steps of fine-tuning. Moreover, our approach is efficient and easy to optimize, requiring only 20 hours on a single A6000 GPU with just 1/15 the computational cost of WAM.

[36] Privacy Protection Against Personalized Text-to-Image Synthesis via Cross-image Consistency Constraints

Guanyu Wang,Kailong Wang,Yihao Huang,Mingyi Zhou,Zhang Qing cnwatcher,Geguang Pu,Li Li

Main category: cs.CV

TL;DR: 论文提出了一种名为CAP的跨图像反个性化框架，通过增强图像间的风格一致性来提升隐私保护效果。

Details

Motivation: 随着扩散模型和个性化技术的快速发展，从公开图像中重建个人肖像成为可能，但同时也带来了隐私泄露的风险。现有方法忽略了多图像之间的关系，未能充分利用群体层面的隐私保护潜力。 Method: 提出CAP框架，通过跨图像风格一致性增强抗个性化能力，并开发动态比率调整策略以平衡一致性损失的影响。 Result: 在CelebHQ和VGGFace2基准测试中，CAP显著优于现有方法。 Conclusion: CAP通过群体层面的隐私保护策略，有效提升了抗个性化能力，为隐私保护提供了新思路。 Abstract: The rapid advancement of diffusion models and personalization techniques has made it possible to recreate individual portraits from just a few publicly available images. While such capabilities empower various creative applications, they also introduce serious privacy concerns, as adversaries can exploit them to generate highly realistic impersonations. To counter these threats, anti-personalization methods have been proposed, which add adversarial perturbations to published images to disrupt the training of personalization models. However, existing approaches largely overlook the intrinsic multi-image nature of personalization and instead adopt a naive strategy of applying perturbations independently, as commonly done in single-image settings. This neglects the opportunity to leverage inter-image relationships for stronger privacy protection. Therefore, we advocate for a group-level perspective on privacy protection against personalization. Specifically, we introduce Cross-image Anti-Personalization (CAP), a novel framework that enhances resistance to personalization by enforcing style consistency across perturbed images. Furthermore, we develop a dynamic ratio adjustment strategy that adaptively balances the impact of the consistency loss throughout the attack iterations. Extensive experiments on the classical CelebHQ and VGGFace2 benchmarks show that CAP substantially improves existing methods.

[37] LAD-Reasoner: Tiny Multimodal Models are Good Reasoners for Logical Anomaly Detection

Weijia Li,Guanglei Chu,Jiong Chen,Guo-Sen Xie,Caifeng Shan,Fang Zhao

Main category: cs.CV

TL;DR: 论文提出了一种新的任务RLAD和框架LAD-Reasoner，通过结合逻辑推理改进传统异常检测，采用两阶段训练方法，性能媲美更大模型。

Details

Motivation: 现有工业异常检测方法依赖复杂模块或设计，限制了实际部署和可解释性，需要更高效的解决方案。 Method: 提出LAD-Reasoner框架，基于Qwen2.5-VL 3B，采用两阶段训练（SFT和GRPO），结合检测准确性和输出质量优化。 Result: 在MVTec LOCO AD数据集上，LAD-Reasoner性能媲美Qwen2.5-VL-72B，且生成更简洁可解释的推理。 Conclusion: LAD-Reasoner减少了大型模型和复杂流程的依赖，提供了透明可解释的逻辑异常检测方案。 Abstract: Recent advances in industrial anomaly detection have highlighted the need for deeper logical anomaly analysis, where unexpected relationships among objects, counts, and spatial configurations must be identified and explained. Existing approaches often rely on large-scale external reasoning modules or elaborate pipeline designs, hindering practical deployment and interpretability. To address these limitations, we introduce a new task, Reasoning Logical Anomaly Detection (RLAD), which extends traditional anomaly detection by incorporating logical reasoning. We propose a new framework, LAD-Reasoner, a customized tiny multimodal language model built on Qwen2.5-VL 3B. Our approach leverages a two-stage training paradigm that first employs Supervised Fine-Tuning (SFT) for fine-grained visual understanding, followed by Group Relative Policy Optimization (GRPO) to refine logical anomaly detection and enforce coherent, human-readable reasoning. Crucially, reward signals are derived from both the detection accuracy and the structural quality of the outputs, obviating the need for building chain of thought (CoT) reasoning data. Experiments on the MVTec LOCO AD dataset show that LAD-Reasoner, though significantly smaller, matches the performance of Qwen2.5-VL-72B in accuracy and F1 score, and further excels in producing concise and interpretable rationales. This unified design reduces reliance on large models and complex pipelines, while offering transparent and interpretable insights into logical anomaly detection. Code and data will be released.

[38] Stronger, Steadier & Superior: Geometric Consistency in Depth VFM Forges Domain Generalized Semantic Segmentation

Siyu Chen,Ting Han,Changshe Zhang,Xin Luo,Meiliu Wu,Guorong Cai,Jinhe Su

Main category: cs.CV

TL;DR: DepthForge框架通过结合视觉基础模型（VFMs）和深度信息，提升领域泛化语义分割（DGSS）的性能，尤其在极端条件下表现优异。

Details

Motivation: 视觉线索易受干扰，而几何信息（如深度）更稳定，因此研究如何将深度信息与VFMs结合以提升泛化性能。 Method: 提出DepthForge框架，整合DINOv2/EVA02的视觉特征和Depth Anything V2的深度信息，通过深度感知可学习标记和解码器增强模型。 Result: 在多个DGSS设置和五个未见目标数据集上，DepthForge性能显著优于其他方法，尤其在极端条件下表现突出。 Conclusion: DepthForge通过深度信息增强VFMs，显著提升了DGSS的泛化能力和鲁棒性。 Abstract: Vision Foundation Models (VFMs) have delivered remarkable performance in Domain Generalized Semantic Segmentation (DGSS). However, recent methods often overlook the fact that visual cues are susceptible, whereas the underlying geometry remains stable, rendering depth information more robust. In this paper, we investigate the potential of integrating depth information with features from VFMs, to improve the geometric consistency within an image and boost the generalization performance of VFMs. We propose a novel fine-tuning DGSS framework, named DepthForge, which integrates the visual cues from frozen DINOv2 or EVA02 and depth cues from frozen Depth Anything V2. In each layer of the VFMs, we incorporate depth-aware learnable tokens to continuously decouple domain-invariant visual and spatial information, thereby enhancing depth awareness and attention of the VFMs. Finally, we develop a depth refinement decoder and integrate it into the model architecture to adaptively refine multi-layer VFM features and depth-aware learnable tokens. Extensive experiments are conducted based on various DGSS settings and five different datsets as unseen target domains. The qualitative and quantitative results demonstrate that our method significantly outperforms alternative approaches with stronger performance, steadier visual-spatial attention, and superior generalization ability. In particular, DepthForge exhibits outstanding performance under extreme conditions (e.g., night and snow). Code is available at https://github.com/anonymouse-xzrptkvyqc/DepthForge.

[39] Set You Straight: Auto-Steering Denoising Trajectories to Sidestep Unwanted Concepts

Leyang Li,Shilin Lu,Yan Ren,Adams Wai-Kin Kong

Main category: cs.CV

TL;DR: ANT是一种新的微调框架，通过自动引导去噪轨迹避免生成有害内容，解决了现有概念擦除方法的局限性。

Details

Motivation: 确保文本到图像模型的伦理部署需要防止生成有害或不适当内容的技术。现有方法存在局限性，如破坏采样轨迹或依赖启发式锚概念选择。 Method: ANT通过反转分类器自由引导的条件方向，在去噪中后期精确修改内容，同时保持早期结构完整性。使用权重显著性图识别关键参数，实现高效擦除。 Result: 实验表明，ANT在单概念和多概念擦除中均达到最先进效果，生成高质量、安全的输出，且不影响生成保真度。 Conclusion: ANT提供了一种无需启发式锚概念选择的高效方法，显著提升了概念擦除性能，适用于伦理部署。 Abstract: Ensuring the ethical deployment of text-to-image models requires effective techniques to prevent the generation of harmful or inappropriate content. While concept erasure methods offer a promising solution, existing finetuning-based approaches suffer from notable limitations. Anchor-free methods risk disrupting sampling trajectories, leading to visual artifacts, while anchor-based methods rely on the heuristic selection of anchor concepts. To overcome these shortcomings, we introduce a finetuning framework, dubbed ANT, which Automatically guides deNoising Trajectories to avoid unwanted concepts. ANT is built on a key insight: reversing the condition direction of classifier-free guidance during mid-to-late denoising stages enables precise content modification without sacrificing early-stage structural integrity. This inspires a trajectory-aware objective that preserves the integrity of the early-stage score function field, which steers samples toward the natural image manifold, without relying on heuristic anchor concept selection. For single-concept erasure, we propose an augmentation-enhanced weight saliency map to precisely identify the critical parameters that most significantly contribute to the unwanted concept, enabling more thorough and efficient erasure. For multi-concept erasure, our objective function offers a versatile plug-and-play solution that significantly boosts performance. Extensive experiments demonstrate that ANT achieves state-of-the-art results in both single and multi-concept erasure, delivering high-quality, safe outputs without compromising the generative fidelity. Code is available at https://github.com/lileyang1210/ANT

[40] EarthGPT-X: Enabling MLLMs to Flexibly and Comprehensively Understand Multi-Source Remote Sensing Imagery

Wei Zhang,Miaoxin Cai,Yaqian Ning,Tong Zhang,Yin Zhuang,He Chen,Jun Li,Xuerui Mao

Main category: cs.CV

TL;DR: EarthGPT-X是一种空间多模态大语言模型（MLLM），旨在解决遥感（RS）领域中的空间推理挑战，通过多源遥感影像的综合理解和灵活交互能力。

Details

Motivation: 由于遥感影像与自然图像差异大，现有自然空间模型难以适应RS领域，且当前RS MLLMs的交互方式和解释层次有限，限制了实际应用。 Method: 提出多模态内容整合方法、跨域单阶段融合训练策略，并引入像素感知模块，将引用和定位任务统一于视觉提示框架。 Result: 实验显示EarthGPT-X在多粒度任务中表现优越，多模态交互灵活，显著推进了RS领域的MLLM发展。 Conclusion: EarthGPT-X通过创新方法解决了RS领域的挑战，展现了MLLM在遥感中的潜力。 Abstract: Recent advances in the visual-language area have developed natural multi-modal large language models (MLLMs) for spatial reasoning through visual prompting. However, due to remote sensing (RS) imagery containing abundant geospatial information that differs from natural images, it is challenging to effectively adapt natural spatial models to the RS domain. Moreover, current RS MLLMs are limited in overly narrow interpretation levels and interaction manner, hindering their applicability in real-world scenarios. To address those challenges, a spatial MLLM named EarthGPT-X is proposed, enabling a comprehensive understanding of multi-source RS imagery, such as optical, synthetic aperture radar (SAR), and infrared. EarthGPT-X offers zoom-in and zoom-out insight, and possesses flexible multi-grained interactive abilities. Moreover, EarthGPT-X unifies two types of critical spatial tasks (i.e., referring and grounding) into a visual prompting framework. To achieve these versatile capabilities, several key strategies are developed. The first is the multi-modal content integration method, which enhances the interplay between images, visual prompts, and text instructions. Subsequently, a cross-domain one-stage fusion training strategy is proposed, utilizing the large language model (LLM) as a unified interface for multi-source multi-task learning. Furthermore, by incorporating a pixel perception module, the referring and grounding tasks are seamlessly unified within a single framework. In addition, the experiments conducted demonstrate the superiority of the proposed EarthGPT-X in multi-grained tasks and its impressive flexibility in multi-modal interaction, revealing significant advancements of MLLM in the RS field.

[41] TSGS: Improving Gaussian Splatting for Transparent Surface Reconstruction via Normal and De-lighting Priors

Mingwei Li,Pu Pang,Hehe Fan,Hua Huang,Yi Yang

Main category: cs.CV

TL;DR: TSGS框架通过分离几何学习和外观优化，解决了透明表面重建中的透明度-深度矛盾，显著提升了几何精度和渲染效果。

Details

Motivation: 透明表面重建在实验室机器人操作等任务中至关重要，但现有方法（如3DGS）在透明材料上存在深度估计误差。 Method: TSGS分两阶段：几何学习阶段使用抑制镜面反射的输入；外观优化阶段通过各向异性镜面建模提升视觉保真度，同时采用滑动窗口法提取深度。 Result: 在TransLab数据集上，TSGS显著优于现有方法，减少了37.3%的Chamfer距离，提高了8.0%的F1分数。 Conclusion: TSGS在3DGS框架内实现了透明物体的高精度几何重建和逼真渲染，代码和数据集将公开。 Abstract: Reconstructing transparent surfaces is essential for tasks such as robotic manipulation in labs, yet it poses a significant challenge for 3D reconstruction techniques like 3D Gaussian Splatting (3DGS). These methods often encounter a transparency-depth dilemma, where the pursuit of photorealistic rendering through standard $\alpha$-blending undermines geometric precision, resulting in considerable depth estimation errors for transparent materials. To address this issue, we introduce Transparent Surface Gaussian Splatting (TSGS), a new framework that separates geometry learning from appearance refinement. In the geometry learning stage, TSGS focuses on geometry by using specular-suppressed inputs to accurately represent surfaces. In the second stage, TSGS improves visual fidelity through anisotropic specular modeling, crucially maintaining the established opacity to ensure geometric accuracy. To enhance depth inference, TSGS employs a first-surface depth extraction method. This technique uses a sliding window over $\alpha$-blending weights to pinpoint the most likely surface location and calculates a robust weighted average depth. To evaluate the transparent surface reconstruction task under realistic conditions, we collect a TransLab dataset that includes complex transparent laboratory glassware. Extensive experiments on TransLab show that TSGS achieves accurate geometric reconstruction and realistic rendering of transparent objects simultaneously within the efficient 3DGS framework. Specifically, TSGS significantly surpasses current leading methods, achieving a 37.3% reduction in chamfer distance and an 8.0% improvement in F1 score compared to the top baseline. The code and dataset will be released at https://longxiang-ai.github.io/TSGS/.

[42] Hybrid Dense-UNet201 Optimization for Pap Smear Image Segmentation Using Spider Monkey Optimization

Ach Khozaimi,Isnani Darti,Syaiful Anam,Wuryansari Muharini Kusumawinahyu

Main category: cs.CV

TL;DR: 提出了一种结合DenseNet201和U-Net的混合模型Dense-UNet201，并通过改进的蜘蛛猴优化算法（SMO）优化，显著提升了宫颈涂片图像的分割性能。

Details

Motivation: 传统分割模型在处理复杂的宫颈涂片图像结构和变异时表现不佳，需要更高效的方法。 Method: 采用DenseNet201作为U-Net的编码器，结合改进的SMO算法进行优化，使用SIPaKMeD数据集评估性能。 Result: Dense-UNet201在分割准确率（96.16%）、IoU（91.63%）和Dice系数（95.63%）上优于其他模型。 Conclusion: 该方法证明了预训练模型和元启发式优化在医学图像分析中的有效性，为宫颈细胞分割提供了新思路。 Abstract: Pap smear image segmentation is crucial for cervical cancer diagnosis. However, traditional segmentation models often struggle with complex cellular structures and variations in pap smear images. This study proposes a hybrid Dense-UNet201 optimization approach that integrates a pretrained DenseNet201 as the encoder for the U-Net architecture and optimizes it using the spider monkey optimization (SMO) algorithm. The Dense-UNet201 model excelled at feature extraction. The SMO was modified to handle categorical and discrete parameters. The SIPaKMeD dataset was used in this study and evaluated using key performance metrics, including loss, accuracy, Intersection over Union (IoU), and Dice coefficient. The experimental results showed that Dense-UNet201 outperformed U-Net, Res-UNet50, and Efficient-UNetB0. SMO Dense-UNet201 achieved a segmentation accuracy of 96.16%, an IoU of 91.63%, and a Dice coefficient score of 95.63%. These findings underscore the effectiveness of image preprocessing, pretrained models, and metaheuristic optimization in improving medical image analysis and provide new insights into cervical cell segmentation methods.

[43] Saliency-Aware Diffusion Reconstruction for Effective Invisible Watermark Removal

Inzamamul Alam,Md Tanvir Islam,Simon S. Woo

Main category: cs.CV

TL;DR: 本文提出了一种新颖的SADRE框架，用于网络水印消除，结合自适应噪声注入、区域特定扰动和先进的扩散重建技术，有效破坏水印同时保留图像关键特征。

Details

Motivation: 随着数字内容的普及，现有水印嵌入技术缺乏鲁棒性，亟需更有效的水印去除方法。 Method: SADRE框架通过显著性掩码引导的潜在表示中注入目标噪声，结合反向扩散过程实现高保真图像恢复。 Result: 实验证明SADRE在平衡水印破坏和图像质量方面优于现有技术，为实际网络内容提供可靠解决方案。 Conclusion: SADRE为水印消除设定了新标准，具有理论和实际应用价值。 Abstract: As digital content becomes increasingly ubiquitous, the need for robust watermark removal techniques has grown due to the inadequacy of existing embedding techniques, which lack robustness. This paper introduces a novel Saliency-Aware Diffusion Reconstruction (SADRE) framework for watermark elimination on the web, combining adaptive noise injection, region-specific perturbations, and advanced diffusion-based reconstruction. SADRE disrupts embedded watermarks by injecting targeted noise into latent representations guided by saliency masks although preserving essential image features. A reverse diffusion process ensures high-fidelity image restoration, leveraging adaptive noise levels determined by watermark strength. Our framework is theoretically grounded with stability guarantees and achieves robust watermark removal across diverse scenarios. Empirical evaluations on state-of-the-art (SOTA) watermarking techniques demonstrate SADRE's superiority in balancing watermark disruption and image quality. SADRE sets a new benchmark for watermark elimination, offering a flexible and reliable solution for real-world web content. Code is available on~\href{https://github.com/inzamamulDU/SADRE}{\textbf{https://github.com/inzamamulDU/SADRE}}.

[44] TwoSquared: 4D Generation from 2D Image Pairs

Lu Sang,Zehranaz Canfes,Dongliang Cao,Riccardo Marin,Florian Bernard,Daniel Cremers

Main category: cs.CV

TL;DR: TwoSquared方法通过两步分解4D动态物体生成问题：首先生成3D模型，再预测物理变形，仅需两张2D图像即可生成4D序列。

Details

Motivation: 4D动态物体生成因缺乏高质量训练数据和计算需求高而具有挑战性。 Method: 1) 基于现有3D生成模型的图像到3D模块；2) 物理启发的变形模块预测中间运动。 Result: 实验表明，TwoSquared能仅凭2D图像生成纹理和几何一致的4D序列。 Conclusion: TwoSquared无需模板或类别先验知识，适用于野外图像输入。 Abstract: Despite the astonishing progress in generative AI, 4D dynamic object generation remains an open challenge. With limited high-quality training data and heavy computing requirements, the combination of hallucinating unseen geometry together with unseen movement poses great challenges to generative models. In this work, we propose TwoSquared as a method to obtain a 4D physically plausible sequence starting from only two 2D RGB images corresponding to the beginning and end of the action. Instead of directly solving the 4D generation problem, TwoSquared decomposes the problem into two steps: 1) an image-to-3D module generation based on the existing generative model trained on high-quality 3D assets, and 2) a physically inspired deformation module to predict intermediate movements. To this end, our method does not require templates or object-class-specific prior knowledge and can take in-the-wild images as input. In our experiments, we demonstrate that TwoSquared is capable of producing texture-consistent and geometry-consistent 4D sequences only given 2D images.

[45] Image-Editing Specialists: An RLAIF Approach for Diffusion Models

Elior Benarous,Yilun Du,Heng Yang

Main category: cs.CV

TL;DR: 提出一种基于强化学习的扩散模型训练方法，用于图像编辑，提升结构保留和语义对齐，仅需少量参考图像和训练步骤即可实现复杂场景的精细编辑。

Details

Motivation: 解决图像编辑中结构保留和语义对齐的挑战，减少对大量人工标注或数据集的依赖。 Method: 采用在线强化学习框架，结合视觉提示，实现精确且结构一致的编辑，同时保持高保真度。 Result: 模型在复杂场景中实现精细编辑，仅需5张参考图像和10次训练步骤，显著提升真实感和指令对齐。 Conclusion: 该方法简化了用户操作，适用于复杂场景编辑，并展示了在机器人领域的应用潜力。 Abstract: We present a novel approach to training specialized instruction-based image-editing diffusion models, addressing key challenges in structural preservation with input images and semantic alignment with user prompts. We introduce an online reinforcement learning framework that aligns the diffusion model with human preferences without relying on extensive human annotations or curating a large dataset. Our method significantly improves the realism and alignment with instructions in two ways. First, the proposed models achieve precise and structurally coherent modifications in complex scenes while maintaining high fidelity in instruction-irrelevant areas. Second, they capture fine nuances in the desired edit by leveraging a visual prompt, enabling detailed control over visual edits without lengthy textual prompts. This approach simplifies users' efforts to achieve highly specific edits, requiring only 5 reference images depicting a certain concept for training. Experimental results demonstrate that our models can perform intricate edits in complex scenes, after just 10 training steps. Finally, we showcase the versatility of our method by applying it to robotics, where enhancing the visual realism of simulated environments through targeted sim-to-real image edits improves their utility as proxies for real-world settings.

[46] High-Fidelity Image Inpainting with Multimodal Guided GAN Inversion

Libo Zhang,Yongsheng Yu,Jiali Yao,Heng Fan

Main category: cs.CV

TL;DR: MMInvertFill是一种新型GAN反演方法，用于图像修复，通过多模态引导编码器和F&W+潜在空间解决现有方法在一致性和多模态利用上的不足。

Details

Motivation: 现有GAN反演方法在图像修复中忽略了未掩码区域的一致性约束，且未充分利用多模态信息，导致性能下降。 Method: 提出多模态引导编码器和预调制模块，结合F&W+潜在空间和Soft-update Mean Latent模块，提升修复效果。 Result: 在六个数据集上实验表明，MMInvertFill在质量和数量上均优于现有方法，并能有效处理域外图像。 Conclusion: MMInvertFill通过多模态和潜在空间优化，显著提升了图像修复的性能和泛化能力。 Abstract: Generative Adversarial Network (GAN) inversion have demonstrated excellent performance in image inpainting that aims to restore lost or damaged image texture using its unmasked content. Previous GAN inversion-based methods usually utilize well-trained GAN models as effective priors to generate the realistic regions for missing holes. Despite excellence, they ignore a hard constraint that the unmasked regions in the input and the output should be the same, resulting in a gap between GAN inversion and image inpainting and thus degrading the performance. Besides, existing GAN inversion approaches often consider a single modality of the input image, neglecting other auxiliary cues in images for improvements. Addressing these problems, we propose a novel GAN inversion approach, dubbed MMInvertFill, for image inpainting. MMInvertFill contains primarily a multimodal guided encoder with a pre-modulation and a GAN generator with F&W+ latent space. Specifically, the multimodal encoder aims to enhance the multi-scale structures with additional semantic segmentation edge texture modalities through a gated mask-aware attention module. Afterwards, a pre-modulation is presented to encode these structures into style vectors. To mitigate issues of conspicuous color discrepancy and semantic inconsistency, we introduce the F&W+ latent space to bridge the gap between GAN inversion and image inpainting. Furthermore, in order to reconstruct faithful and photorealistic images, we devise a simple yet effective Soft-update Mean Latent module to capture more diversified in-domain patterns for generating high-fidelity textures for massive corruptions. In our extensive experiments on six challenging datasets, we show that our MMInvertFill qualitatively and quantitatively outperforms other state-of-the-arts and it supports the completion of out-of-domain images effectively.

[47] Computer-Aided Design of Personalized Occlusal Positioning Splints Using Multimodal 3D Data

Agnieszka Anna Tomaka,Leszek Luchowski,Michał Tarnawski,Dariusz Pojda

Main category: cs.CV

TL;DR: 论文提出了一种基于数字技术的定制化咬合夹板设计与精度评估方法，结合临床需求与数字化牙科实践。

Details

Motivation: 现代数字技术在定制医疗设备（如咬合夹板）设计中具有重要作用，但目前缺乏将临床概念与数字化实践结合的精确方法。 Method: 通过虚拟患者模型（基于口内扫描、CBCT、3D面部扫描等）生成3D夹板，利用转换矩阵表示下颌治疗位置变化，并通过虚拟压印解决表面冲突。 Result: 展示了如何通过临床工具获取转换矩阵，并通过轮廓和表面偏差分析评估夹板精度。 Conclusion: 该方法实现了可重复的个性化夹板制作，为诊断、多模态图像配准和咬合差异量化提供了新可能。 Abstract: Contemporary digital technology has a pivotal role in the design of customized medical appliances, including occlusal splints used in the treatment of stomatognathic system dysfunctions. We present an approach to computer-aided design and precision assessment of positioning occlusal splints, bridging clinical concepts with current digital dental practice. In our model, a 3D splint is generated based on a transformation matrix that represents the therapeutic change in mandibular position, defined by a specialist using a virtual patient model reconstructed from intraoral scans, CBCT, 3D facial scans and plaster model digitisation. The paper introduces a novel method for generating splints that accurately reproduce occlusal conditions in the therapeutic position, including a mechanism for resolving surface conflicts through virtual embossing. We demonstrate how transformation matrices can be acquired through clinical tools and intraoral devices, and evaluate the accuracy of the designed and printed splints using profile and surface deviation analysis. The proposed method enables reproducible, patient-specific splint fabrication and opens new possibilities in diagnostics, multimodal image registration and quantification of occlusal discrepancies.

[48] SC3EF: A Joint Self-Correlation and Cross-Correspondence Estimation Framework for Visible and Thermal Image Registration

Xi Tong,Xing Luo,Jiangxin Yang,Yanpeng Cao

Main category: cs.CV

TL;DR: 提出了一种名为SC3EF的新框架，通过结合局部特征和全局上下文信息，解决了RGB-T图像配准的挑战。

Details

Motivation: 多光谱成像在智能交通中至关重要，但RGB-T图像配准因模态差异大而困难。 Method: 设计了基于卷积和Transformer的流程，提取局部特征并编码全局相关性，结合分层光流估计解码器优化配准结果。 Result: 在RGB-T数据集上表现优于现有方法，并在跨模态数据集上展现了良好的泛化能力。 Conclusion: SC3EF框架在RGB-T图像配准中表现出色，适用于复杂场景。 Abstract: Multispectral imaging plays a critical role in a range of intelligent transportation applications, including advanced driver assistance systems (ADAS), traffic monitoring, and night vision. However, accurate visible and thermal (RGB-T) image registration poses a significant challenge due to the considerable modality differences. In this paper, we present a novel joint Self-Correlation and Cross-Correspondence Estimation Framework (SC3EF), leveraging both local representative features and global contextual cues to effectively generate RGB-T correspondences. For this purpose, we design a convolution-transformer-based pipeline to extract local representative features and encode global correlations of intra-modality for inter-modality correspondence estimation between unaligned visible and thermal images. After merging the local and global correspondence estimation results, we further employ a hierarchical optical flow estimation decoder to progressively refine the estimated dense correspondence maps. Extensive experiments demonstrate the effectiveness of our proposed method, outperforming the current state-of-the-art (SOTA) methods on representative RGB-T datasets. Furthermore, it also shows competitive generalization capabilities across challenging scenarios, including large parallax, severe occlusions, adverse weather, and other cross-modal datasets (e.g., RGB-N and RGB-D).

[49] Tree-NeRV: A Tree-Structured Neural Representation for Efficient Non-Uniform Video Encoding

Jiancheng Zhao,Yifan Zhan,Qingtian Zhu,Mingze Ma,Muyao Niu,Zunian Wan,Xiang Ji,Yinqiang Zheng

Main category: cs.CV

TL;DR: Tree-NeRV提出了一种基于二叉搜索树的非均匀采样视频编码方法，优化了时间冗余利用，提升了压缩效率和重建质量。

Details

Motivation: 现有NeRV方法因均匀采样未能充分利用时间冗余，导致率失真性能不佳。 Method: 采用二叉搜索树结构组织特征表示，引入优化驱动的非均匀采样策略，动态分配高采样密度到时间变化大的区域。 Result: 实验表明Tree-NeRV在压缩效率和重建质量上优于传统均匀采样方法。 Conclusion: Tree-NeRV通过非均匀采样和动态优化，显著提升了视频编码性能。 Abstract: Implicit Neural Representations for Videos (NeRV) have emerged as a powerful paradigm for video representation, enabling direct mappings from frame indices to video frames. However, existing NeRV-based methods do not fully exploit temporal redundancy, as they rely on uniform sampling along the temporal axis, leading to suboptimal rate-distortion (RD) performance. To address this limitation, we propose Tree-NeRV, a novel tree-structured feature representation for efficient and adaptive video encoding. Unlike conventional approaches, Tree-NeRV organizes feature representations within a Binary Search Tree (BST), enabling non-uniform sampling along the temporal axis. Additionally, we introduce an optimization-driven sampling strategy, dynamically allocating higher sampling density to regions with greater temporal variation. Extensive experiments demonstrate that Tree-NeRV achieves superior compression efficiency and reconstruction quality, outperforming prior uniform sampling-based methods. Code will be released.

[50] Second-order Optimization of Gaussian Splats with Importance Sampling

Hamza Pehlivan,Andrea Boscolo Camiletto,Lin Geng Foo,Marc Habermann,Christian Theobalt

Main category: cs.CV

TL;DR: 提出了一种基于Levenberg-Marquardt和共轭梯度的二阶优化策略，显著提升了3D高斯溅射的训练速度。

Details

Motivation: 3D高斯溅射依赖一阶优化器（如Adam）导致训练时间过长，需改进。 Method: 利用Jacobian的稀疏性，提出矩阵无关且GPU并行的LM优化，结合采样策略和学习率启发式方法。 Result: 方法比标准LM快3倍，在低高斯数量时比Adam快6倍，中等数量时仍具竞争力。 Conclusion: 二阶优化策略显著提升了3D高斯溅射的训练效率。 Abstract: 3D Gaussian Splatting (3DGS) is widely used for novel view synthesis due to its high rendering quality and fast inference time. However, 3DGS predominantly relies on first-order optimizers such as Adam, which leads to long training times. To address this limitation, we propose a novel second-order optimization strategy based on Levenberg-Marquardt (LM) and Conjugate Gradient (CG), which we specifically tailor towards Gaussian Splatting. Our key insight is that the Jacobian in 3DGS exhibits significant sparsity since each Gaussian affects only a limited number of pixels. We exploit this sparsity by proposing a matrix-free and GPU-parallelized LM optimization. To further improve its efficiency, we propose sampling strategies for both the camera views and loss function and, consequently, the normal equation, significantly reducing the computational complexity. In addition, we increase the convergence rate of the second-order approximation by introducing an effective heuristic to determine the learning rate that avoids the expensive computation cost of line search methods. As a result, our method achieves a $3\times$ speedup over standard LM and outperforms Adam by $~6\times$ when the Gaussian count is low while remaining competitive for moderate counts. Project Page: https://vcai.mpi-inf.mpg.de/projects/LM-IS

[51] Efficient Masked Image Compression with Position-Indexed Self-Attention

Chengjie Dai,Tiantian Song,Hui Tang,Fangdong Chen,Bowei Yang,Guanghua Song

Main category: cs.CV

TL;DR: 提出了一种基于位置索引自注意力机制的图像压缩方法，仅编码和解码掩码图像的可见部分，显著降低计算成本。

Details

Motivation: 现有方法在编码后结构化比特流，导致冗余计算，传统方法即使掩码不重要区域仍参与计算。 Method: 采用位置索引自注意力机制，仅处理掩码图像的可见部分。 Result: 相比现有语义结构化压缩方法，显著减少计算成本。 Conclusion: 新方法有效解决了冗余计算问题，提升了压缩效率。 Abstract: In recent years, image compression for high-level vision tasks has attracted considerable attention from researchers. Given that object information in images plays a far more crucial role in downstream tasks than background information, some studies have proposed semantically structuring the bitstream to selectively transmit and reconstruct only the information required by these tasks. However, such methods structure the bitstream after encoding, meaning that the coding process still relies on the entire image, even though much of the encoded information will not be transmitted. This leads to redundant computations. Traditional image compression methods require a two-dimensional image as input, and even if the unimportant regions of the image are set to zero by applying a semantic mask, these regions still participate in subsequent computations as part of the image. To address such limitations, we propose an image compression method based on a position-indexed self-attention mechanism that encodes and decodes only the visible parts of the masked image. Compared to existing semantic-structured compression methods, our approach can significantly reduce computational costs.

[52] Disentangling Polysemantic Channels in Convolutional Neural Networks

Robin Hesse,Jonas Fischer,Simone Schaub-Meyer,Stefan Roth

Main category: cs.CV

TL;DR: 提出一种算法，将多义通道解耦为单义通道，提升CNN的可解释性。

Details

Motivation: CNN中的多义通道难以解释，阻碍了对网络决策机制的理解。 Method: 通过分析前一层的不同激活模式，重构成网络权重，将多义通道解耦为单义通道。 Result: 成功解耦多义通道，提升了CNN的可解释性和特征可视化效果。 Conclusion: 该方法为CNN的机制解释提供了新工具，有助于理解网络决策过程。 Abstract: Mechanistic interpretability is concerned with analyzing individual components in a (convolutional) neural network (CNN) and how they form larger circuits representing decision mechanisms. These investigations are challenging since CNNs frequently learn polysemantic channels that encode distinct concepts, making them hard to interpret. To address this, we propose an algorithm to disentangle a specific kind of polysemantic channel into multiple channels, each responding to a single concept. Our approach restructures weights in a CNN, utilizing that different concepts within the same channel exhibit distinct activation patterns in the previous layer. By disentangling these polysemantic features, we enhance the interpretability of CNNs, ultimately improving explanatory techniques such as feature visualizations.

[53] Rethinking Temporal Fusion with a Unified Gradient Descent View for 3D Semantic Occupancy Prediction

Dubing Chen,Huan Zheng,Jin Fang,Xingping Dong,Xianfei Li,Wenlong Liao,Tao He,Pai Peng,Jianbing Shen

Main category: cs.CV

TL;DR: GDFusion是一种用于视觉3D语义占用预测（VisionOcc）的时间融合方法，通过探索时间线索和融合策略，显著提升了性能并降低了内存消耗。

Details

Motivation: 探索VisionOcc框架中未充分研究的时间融合问题，特别是时间线索和融合策略的作用。 Method: 提出GDFusion方法，识别并利用三种时间线索（场景一致性、运动校准和几何互补），并通过重新解释RNN的梯度下降特征融合策略统一异构表示的时间信号。 Result: 在nuScenes数据集上显著优于基线方法，Occ3D基准上mIoU提升1.4%-4.8%，内存消耗减少27%-72%。 Conclusion: GDFusion通过有效融合时间线索和优化策略，显著提升了VisionOcc的性能和效率。 Abstract: We present GDFusion, a temporal fusion method for vision-based 3D semantic occupancy prediction (VisionOcc). GDFusion opens up the underexplored aspects of temporal fusion within the VisionOcc framework, focusing on both temporal cues and fusion strategies. It systematically examines the entire VisionOcc pipeline, identifying three fundamental yet previously overlooked temporal cues: scene-level consistency, motion calibration, and geometric complementation. These cues capture diverse facets of temporal evolution and make distinct contributions across various modules in the VisionOcc framework. To effectively fuse temporal signals across heterogeneous representations, we propose a novel fusion strategy by reinterpreting the formulation of vanilla RNNs. This reinterpretation leverages gradient descent on features to unify the integration of diverse temporal information, seamlessly embedding the proposed temporal cues into the network. Extensive experiments on nuScenes demonstrate that GDFusion significantly outperforms established baselines. Notably, on Occ3D benchmark, it achieves 1.4\%-4.8\% mIoU improvements and reduces memory consumption by 27\%-72\%.

[54] Vision and Language Integration for Domain Generalization

Yanmei Wang,Xiyao Liu,Fupeng Chu,Zhi Han

Main category: cs.CV

TL;DR: 论文提出了一种结合语言空间和视觉空间的方法VLCA，通过语义空间作为桥梁连接多个图像域，以解决领域泛化问题。

Details

Motivation: 由于领域差距，难以找到可靠的通用图像特征空间，而语言具有更全面的表达元素，可以弥补这一不足。 Method: 在语言空间中利用词向量距离捕捉类别关系的语义表示，在视觉空间中通过低秩近似探索同类样本的共同模式，最后在多模态空间中对齐语言和视觉表示。 Result: 实验证明了该方法的有效性。 Conclusion: VLCA通过结合语言和视觉空间，成功提升了模型在未知目标领域的泛化能力。 Abstract: Domain generalization aims at training on source domains to uncover a domain-invariant feature space, allowing the model to perform robust generalization ability on unknown target domains. However, due to domain gaps, it is hard to find reliable common image feature space, and the reason for that is the lack of suitable basic units for images. Different from image in vision space, language has comprehensive expression elements that can effectively convey semantics. Inspired by the semantic completeness of language and intuitiveness of image, we propose VLCA, which combine language space and vision space, and connect the multiple image domains by using semantic space as the bridge domain. Specifically, in language space, by taking advantage of the completeness of language basic units, we tend to capture the semantic representation of the relations between categories through word vector distance. Then, in vision space, by taking advantage of the intuitiveness of image features, the common pattern of sample features with the same class is explored through low-rank approximation. In the end, the language representation is aligned with the vision representation through the multimodal space of text and image. Experiments demonstrate the effectiveness of the proposed method.

[55] MathPhys-Guided Coarse-to-Fine Anomaly Synthesis with SQE-Driven Bi-Level Optimization for Anomaly Detection

Long Qian,Bingke Zhu,Yingying Chen,Ming Tang,Jinqiao Wang

Main category: cs.CV

TL;DR: 提出了一种基于数学物理模型和双层优化的合成异常生成方法，显著提升了异常检测模型的泛化能力。

Details

Motivation: 由于真实缺陷图像难以获取且现有合成方法忽略物理原因导致低质量异常，需开发更真实的合成方法。 Method: 通过数学物理模型生成异常，采用粗到细优化和双层策略（SQE驱动），结合PDE和小波变换提升质量。 Result: 在MVTec AD、VisA和BTAD数据集上取得SOTA性能（图像和像素级AUROC）。 Conclusion: MaPhC2F和BiSQAD方法有效生成高质量合成异常，显著提升模型性能。 Abstract: Anomaly detection is a crucial task in computer vision, yet collecting real-world defect images is inherently difficult due to the rarity and unpredictability of anomalies. Consequently, researchers have turned to synthetic methods for training data augmentation. However, existing synthetic strategies (e.g., naive cut-and-paste or inpainting) overlook the underlying physical causes of defects, leading to inconsistent, low-fidelity anomalies that hamper model generalization to real-world complexities. In this thesis, we introduced a novel pipeline that generates synthetic anomalies through Math-Physics model guidance, refines them via a Coarse-to-Fine approach and employs a bi-level optimization strategy with a Synthesis Quality Estimator(SQE). By incorporating physical modeling of cracks, corrosion, and deformation, our method produces realistic defect masks, which are subsequently enhanced in two phases. The first stage (npcF) enforces a PDE-based consistency to achieve a globally coherent anomaly structure, while the second stage (npcF++) further improves local fidelity using wavelet transforms and boundary synergy blocks. Additionally, we leverage SQE-driven weighting, ensuring that high-quality synthetic samples receive greater emphasis during training. To validate our approach, we conducted comprehensive experiments on three widely adopted industrial anomaly detection benchmarks: MVTec AD, VisA, and BTAD. Across these datasets, the proposed pipeline achieves state-of-the-art (SOTA) results in both image-AUROC and pixel-AUROC, confirming the effectiveness of our MaPhC2F and BiSQAD.

[56] Enhancing Cocoa Pod Disease Classification via Transfer Learning and Ensemble Methods: Toward Robust Predictive Modeling

Devina Anduyan,Nyza Cabillo,Navy Gultiano,Mark Phil Pacot

Main category: cs.CV

TL;DR: 该研究提出了一种基于集成学习的可可豆荚病害分类方法，结合迁移学习和三种集成策略（Bagging、Boosting和Stacking），使用预训练的卷积神经网络作为基础学习器，在6000张图像数据集上验证了Bagging方法的优越性（100%准确率）。

Details

Motivation: 通过结合迁移学习和集成学习技术，提高可可豆荚病害分类的准确性和鲁棒性，为精准农业和自动化病害管理提供支持。 Method: 使用预训练的CNN模型（如VGG16、ResNet50等）作为基础学习器，结合Bagging、Boosting和Stacking三种集成策略，对可可豆荚病害进行分类。 Result: Bagging方法表现最佳，测试准确率达100%，优于Boosting（97%）和Stacking（92%）。 Conclusion: 迁移学习与集成学习的结合显著提升了模型性能，为农业病害自动化管理提供了有效解决方案。 Abstract: This study presents an ensemble-based approach for cocoa pod disease classification by integrating transfer learning with three ensemble learning strategies: Bagging, Boosting, and Stacking. Pre-trained convolutional neural networks, including VGG16, VGG19, ResNet50, ResNet101, InceptionV3, and Xception, were fine-tuned and employed as base learners to detect three disease categories: Black Pod Rot, Pod Borer, and Healthy. A balanced dataset of 6,000 cocoa pod images was curated and augmented to ensure robustness against variations in lighting, orientation, and disease severity. The performance of each ensemble method was evaluated using accuracy, precision, recall, and F1-score. Experimental results show that Bagging consistently achieved superior classification performance with a test accuracy of 100%, outperforming Boosting (97%) and Stacking (92%). The findings confirm that combining transfer learning with ensemble techniques improves model generalization and reliability, making it a promising direction for precision agriculture and automated crop disease management.

[57] All-in-One Transferring Image Compression from Human Perception to Multi-Machine Perception

Jiancheng Zhao,Xiang Ji,Zhuoxiao Li,Zunian Wan,Weihang Ran,Mingze Ma,Muyao Niu,Yifan Zhan,Cheng-Ching Tseng,Yinqiang Zheng

Main category: cs.CV

TL;DR: 提出了一种不对称适配器框架，支持在单一模型中实现多任务适应，解决了现有方法效率低、缺乏任务交互的问题。

Details

Motivation: 现有方法通常以单任务方式将学习到的图像压缩模型（LIC）适应下游任务，效率低且缺乏任务交互。 Method: 引入共享适配器学习通用语义特征，任务特定适配器保留任务级区分，仅需轻量级插件模块和冻结的基础编解码器。 Result: 在PASCAL-Context基准测试中表现优于完全微调和其他参数高效微调基线。 Conclusion: 验证了多视觉迁移的有效性，同时保持了压缩效率。 Abstract: Efficiently transferring Learned Image Compression (LIC) model from human perception to machine perception is an emerging challenge in vision-centric representation learning. Existing approaches typically adapt LIC to downstream tasks in a single-task manner, which is inefficient, lacks task interaction, and results in multiple task-specific bitstreams. To address these limitations, we propose an asymmetric adaptor framework that supports multi-task adaptation within a single model. Our method introduces a shared adaptor to learn general semantic features and task-specific adaptors to preserve task-level distinctions. With only lightweight plug-in modules and a frozen base codec, our method achieves strong performance across multiple tasks while maintaining compression efficiency. Experiments on the PASCAL-Context benchmark demonstrate that our method outperforms both Fully Fine-Tuned and other Parameter Efficient Fine-Tuned (PEFT) baselines, and validating the effectiveness of multi-vision transferring.

[58] Hierarchical Feature Learning for Medical Point Clouds via State Space Model

Guoqing Zhang,Jingyun Yang,Yang Li

Main category: cs.CV

TL;DR: 本文提出了一种基于状态空间模型（SSM）的分层特征学习框架，用于医学点云理解，并在新构建的大规模数据集MedPointS上验证了其优越性能。

Details

Motivation: 医学点云在疾病诊断和治疗中具有巨大潜力，但目前相关研究较少。本文旨在填补这一空白。 Method: 通过最远点采样和多层次KNN查询聚合多尺度结构信息，结合坐标顺序和内外扫描策略优化SSM处理点云。 Result: 在MedPointS数据集上的实验表明，该方法在分类、补全和分割任务中均表现优异。 Conclusion: 提出的框架为医学点云分析提供了有效工具，数据集和代码已公开。 Abstract: Deep learning-based point cloud modeling has been widely investigated as an indispensable component of general shape analysis. Recently, transformer and state space model (SSM) have shown promising capacities in point cloud learning. However, limited research has been conducted on medical point clouds, which have great potential in disease diagnosis and treatment. This paper presents an SSM-based hierarchical feature learning framework for medical point cloud understanding. Specifically, we down-sample input into multiple levels through the farthest point sampling. At each level, we perform a series of k-nearest neighbor (KNN) queries to aggregate multi-scale structural information. To assist SSM in processing point clouds, we introduce coordinate-order and inside-out scanning strategies for efficient serialization of irregular points. Point features are calculated progressively from short neighbor sequences and long point sequences through vanilla and group Point SSM blocks, to capture both local patterns and long-range dependencies. To evaluate the proposed method, we build a large-scale medical point cloud dataset named MedPointS for anatomy classification, completion, and segmentation. Extensive experiments conducted on MedPointS demonstrate that our method achieves superior performance across all tasks. The dataset is available at https://flemme-docs.readthedocs.io/en/latest/medpoints.html. Code is merged to a public medical imaging platform: https://github.com/wlsdzyzl/flemme.

[59] Pose and Facial Expression Transfer by using StyleGAN

Petr Jahoda,Jan Cech

Main category: cs.CV

TL;DR: 提出了一种将面部姿态和表情从源图像转移到目标图像的方法，无需人工标注，通过自监督训练实现。

Details

Motivation: 旨在实现面部姿态和表情的跨身份转移，同时避免人工标注的需求。 Method: 采用双编码器和映射网络结构，将输入投影到StyleGAN2的潜在空间，生成输出图像。训练基于多人视频序列的自监督学习。 Result: 模型能够合成随机身份的面部图像，并控制其姿态和表情，接近实时性能。 Conclusion: 该方法高效且无需标注，适用于面部姿态和表情的跨身份合成。 Abstract: We propose a method to transfer pose and expression between face images. Given a source and target face portrait, the model produces an output image in which the pose and expression of the source face image are transferred onto the target identity. The architecture consists of two encoders and a mapping network that projects the two inputs into the latent space of StyleGAN2, which finally generates the output. The training is self-supervised from video sequences of many individuals. Manual labeling is not required. Our model enables the synthesis of random identities with controllable pose and expression. Close-to-real-time performance is achieved.

[60] Riemannian Patch Assignment Gradient Flows

Daniel Gonzalez-Alvarado,Fabio Schlindwein,Jonas Cassel,Laura Steingruber,Stefania Petra,Christoph Schnörr

Main category: cs.CV

TL;DR: 本文提出了一种基于图结构的度量数据标签分配方法，通过动态交互和几何数值积分实现标签一致性。

Details

Motivation: 解决图数据中标签分配的局部不一致性问题，通过全局动态交互提升标签一致性。 Method: 使用竞争性标签补丁字典和补丁分配变量，通过黎曼上升流的几何数值积分实现最大一致性。 Result: 实验验证了方法的有效性，包括标签分配的不确定性量化。 Conclusion: 该方法通过动态交互和几何优化，显著提升了图数据标签分配的一致性和可靠性。 Abstract: This paper introduces patch assignment flows for metric data labeling on graphs. Labelings are determined by regularizing initial local labelings through the dynamic interaction of both labels and label assignments across the graph, entirely encoded by a dictionary of competing labeled patches and mediated by patch assignment variables. Maximal consistency of patch assignments is achieved by geometric numerical integration of a Riemannian ascent flow, as critical point of a Lagrangian action functional. Experiments illustrate properties of the approach, including uncertainty quantification of label assignments.

[61] TTRD3: Texture Transfer Residual Denoising Dual Diffusion Model for Remote Sensing Image Super-Resolution

Yide Liu,Haijiang Sun,Xiaowen Zhang,Qiaoyuan Liu,Zhouchang Chen,Chongzhuo Xiao

Main category: cs.CV

TL;DR: 论文提出了一种名为TTRD3的模型，通过多尺度特征提取、纹理传递和双扩散模型，解决了遥感图像超分辨率中的关键挑战。

Details

Motivation: 现有方法在提取多尺度特征、语义一致性和几何精度与视觉质量的平衡方面存在不足，需要一种更有效的解决方案。 Method: TTRD3包含三个创新模块：多尺度特征聚合块（MFAB）、稀疏纹理传递引导（STTG）和残差去噪双扩散模型（RDDM）。 Result: 实验表明，TTRD3在LPIPS和FID指标上分别提升了1.43%和3.67%，优于现有最佳方法。 Conclusion: TTRD3通过结合多尺度特征提取、纹理传递和双扩散模型，显著提升了遥感图像超分辨率的性能。 Abstract: Remote Sensing Image Super-Resolution (RSISR) reconstructs high-resolution (HR) remote sensing images from low-resolution inputs to support fine-grained ground object interpretation. Existing methods face three key challenges: (1) Difficulty in extracting multi-scale features from spatially heterogeneous RS scenes, (2) Limited prior information causing semantic inconsistency in reconstructions, and (3) Trade-off imbalance between geometric accuracy and visual quality. To address these issues, we propose the Texture Transfer Residual Denoising Dual Diffusion Model (TTRD3) with three innovations: First, a Multi-scale Feature Aggregation Block (MFAB) employing parallel heterogeneous convolutional kernels for multi-scale feature extraction. Second, a Sparse Texture Transfer Guidance (STTG) module that transfers HR texture priors from reference images of similar scenes. Third, a Residual Denoising Dual Diffusion Model (RDDM) framework combining residual diffusion for deterministic reconstruction and noise diffusion for diverse generation. Experiments on multi-source RS datasets demonstrate TTRD3's superiority over state-of-the-art methods, achieving 1.43% LPIPS improvement and 3.67% FID enhancement compared to best-performing baselines. Code/model: https://github.com/LED-666/TTRD3.

[62] Prototypes are Balanced Units for Efficient and Effective Partially Relevant Video Retrieval

WonJun Moon,Cheol-Ho Cho,Woojin Jun,Minho Shim,Taeoh Kim,Inwoong Lee,Dongyoon Wee,Jae-Pil Heo

Main category: cs.CV

TL;DR: 论文提出了一种原型PRVR框架，通过将视频的多样化上下文编码为固定数量的原型，同时引入多种策略增强文本关联和视频理解，并采用跨模态和单模态重建任务确保原型可搜索且准确编码视频内容。

Details

Motivation: 在部分相关视频检索（PRVR）中，同时实现搜索准确性和效率具有挑战性，因为多样化的上下文表示会增加计算和内存成本。 Method: 提出原型PRVR框架，将视频上下文编码为固定数量的原型，引入文本关联和视频理解策略，采用跨模态和单模态重建任务，并使用视频混合技术提供弱指导。 Result: 在TVR、ActivityNet-Captions和QVHighlights数据集上的广泛实验验证了方法的有效性，且未牺牲效率。 Conclusion: 该方法通过原型编码和重建任务，成功平衡了PRVR中的准确性和效率问题。 Abstract: In a retrieval system, simultaneously achieving search accuracy and efficiency is inherently challenging. This challenge is particularly pronounced in partially relevant video retrieval (PRVR), where incorporating more diverse context representations at varying temporal scales for each video enhances accuracy but increases computational and memory costs. To address this dichotomy, we propose a prototypical PRVR framework that encodes diverse contexts within a video into a fixed number of prototypes. We then introduce several strategies to enhance text association and video understanding within the prototypes, along with an orthogonal objective to ensure that the prototypes capture a diverse range of content. To keep the prototypes searchable via text queries while accurately encoding video contexts, we implement cross- and uni-modal reconstruction tasks. The cross-modal reconstruction task aligns the prototypes with textual features within a shared space, while the uni-modal reconstruction task preserves all video contexts during encoding. Additionally, we employ a video mixing technique to provide weak guidance to further align prototypes and associated textual representations. Extensive evaluations on TVR, ActivityNet-Captions, and QVHighlights validate the effectiveness of our approach without sacrificing efficiency.

[63] Event-Enhanced Blurry Video Super-Resolution

Dachun Kai,Yueyi Zhang,Jin Wang,Zeyu Xiao,Zhiwei Xiong,Xiaoyan Sun

Main category: cs.CV

TL;DR: 提出了一种基于事件信号的模糊视频超分辨率方法Ev-DeblurVSR，通过融合帧和事件信息提升细节恢复和时序一致性。

Details

Motivation: 现有模糊视频超分辨率方法因缺乏运动信息和高频细节，难以恢复高分辨率下的清晰细节。 Method: 引入事件信号，设计互惠特征去模糊模块和混合可变形对齐模块，结合帧和事件信息。 Result: 在合成和真实数据集上表现最优，真实数据上比FMA-Net准确度提升2.59 dB，速度提升7.28倍。 Conclusion: Ev-DeblurVSR通过事件信号显著提升了模糊视频超分辨率的性能。 Abstract: In this paper, we tackle the task of blurry video super-resolution (BVSR), aiming to generate high-resolution (HR) videos from low-resolution (LR) and blurry inputs. Current BVSR methods often fail to restore sharp details at high resolutions, resulting in noticeable artifacts and jitter due to insufficient motion information for deconvolution and the lack of high-frequency details in LR frames. To address these challenges, we introduce event signals into BVSR and propose a novel event-enhanced network, Ev-DeblurVSR. To effectively fuse information from frames and events for feature deblurring, we introduce a reciprocal feature deblurring module that leverages motion information from intra-frame events to deblur frame features while reciprocally using global scene context from the frames to enhance event features. Furthermore, to enhance temporal consistency, we propose a hybrid deformable alignment module that fully exploits the complementary motion information from inter-frame events and optical flow to improve motion estimation in the deformable alignment process. Extensive evaluations demonstrate that Ev-DeblurVSR establishes a new state-of-the-art performance on both synthetic and real-world datasets. Notably, on real data, our method is +2.59 dB more accurate and 7.28$\times$ faster than the recent best BVSR baseline FMA-Net. Code: https://github.com/DachunKai/Ev-DeblurVSR.

[64] Expert Kernel Generation Network Driven by Contextual Mapping for Hyperspectral Image Classification

Guandong Li,Mengxia Ye

Main category: cs.CV

TL;DR: EKGNet提出了一种基于改进3D-DenseNet的模型，通过上下文感知映射网络和动态核生成模块，解决高光谱图像分类中的高维数据、稀疏分布和光谱冗余问题。

Details

Motivation: 高光谱图像分类面临高维数据、稀疏分布和光谱冗余等挑战，导致过拟合和泛化能力受限。 Method: EKGNet结合上下文感知映射模块和动态核生成模块，动态生成卷积核组合权重，构建自适应专家卷积系统。 Result: 在IN、UP和KSC数据集上表现优异，优于主流方法。 Conclusion: EKGNet通过动态专家卷积系统提升模型表征能力，无需增加网络深度或宽度。 Abstract: Deep neural networks face several challenges in hyperspectral image classification, including high-dimensional data, sparse distribution of ground objects, and spectral redundancy, which often lead to classification overfitting and limited generalization capability. To more efficiently adapt to ground object distributions while extracting image features without introducing excessive parameters and skipping redundant information, this paper proposes EKGNet based on an improved 3D-DenseNet model, consisting of a context-aware mapping network and a dynamic kernel generation module. The context-aware mapping module translates global contextual information of hyperspectral inputs into instructions for combining base convolutional kernels, while the dynamic kernels are composed of K groups of base convolutions, analogous to K different types of experts specializing in fundamental patterns across various dimensions. The mapping module and dynamic kernel generation mechanism form a tightly coupled system - the former generates meaningful combination weights based on inputs, while the latter constructs an adaptive expert convolution system using these weights. This dynamic approach enables the model to focus more flexibly on key spatial structures when processing different regions, rather than relying on the fixed receptive field of a single static convolutional kernel. EKGNet enhances model representation capability through a 3D dynamic expert convolution system without increasing network depth or width. The proposed method demonstrates superior performance on IN, UP, and KSC datasets, outperforming mainstream hyperspectral image classification approaches.

[65] NoisyRollout: Reinforcing Visual Reasoning with Data Augmentation

Xiangyan Liu,Jinjie Ni,Zijian Wu,Chao Du,Longxu Dou,Haonan Wang,Tianyu Pang,Michael Qizhe Shieh

Main category: cs.CV

TL;DR: NoisyRollout是一种简单有效的强化学习方法，通过混合干净和失真图像的轨迹增强视觉语言模型的探索能力，无需额外训练成本。

Details

Motivation: 当前视觉语言模型在策略探索和视觉感知方面存在不足，影响了推理能力。 Method: 提出NoisyRollout方法，结合干净和失真图像轨迹，引入视觉导向的归纳偏置，并采用噪声退火计划。 Result: 仅用2.1K训练样本，在5个跨领域基准测试中达到最优性能，同时保持或提升领域内性能。 Conclusion: NoisyRollout通过引入视觉多样性显著提升了视觉语言模型的探索能力和性能。 Abstract: Recent advances in reinforcement learning (RL) have strengthened the reasoning capabilities of vision-language models (VLMs). However, enhancing policy exploration to more effectively scale test-time compute remains underexplored in VLMs. In addition, VLMs continue to struggle with imperfect visual perception, which in turn affects the subsequent reasoning process. To this end, we propose NoisyRollout, a simple yet effective RL approach that mixes trajectories from both clean and moderately distorted images to introduce targeted diversity in visual perception and the resulting reasoning patterns. Without additional training cost, NoisyRollout enhances the exploration capabilities of VLMs by incorporating a vision-oriented inductive bias. Furthermore, NoisyRollout employs a noise annealing schedule that gradually reduces distortion strength over training, ensuring benefit from noisy signals early while maintaining training stability and scalability in later stages. With just 2.1K training samples, NoisyRollout achieves state-of-the-art performance among open-source RL-tuned models on 5 out-of-domain benchmarks spanning both reasoning and perception tasks, while preserving comparable or even better in-domain performance.

[66] Imaging for All-Day Wearable Smart Glasses

Michael Goesele,Daniel Andersen,Yujia Chen,Simon Green,Eddy Ilg,Chao Li,Johnson Liu,Grace Kuo,Logan Wan,Richard Newcombe

Main category: cs.CV

TL;DR: 论文分析了智能眼镜成像的基本限制，并提出了一种分布式成像方法以减少相机模块尺寸。

Details

Motivation: 智能眼镜需满足全天佩戴、小型化和时尚性等要求，但成像质量和相机尺寸之间存在矛盾。 Method: 提出分布式成像方法，通过实验验证其性能。 Result: 分布式成像方法能显著减小相机模块尺寸，同时保持图像质量。 Conclusion: 分布式成像为智能眼镜设计提供了可行的解决方案。 Abstract: In recent years smart glasses technology has rapidly advanced, opening up entirely new areas for mobile computing. We expect future smart glasses will need to be all-day wearable, adopting a small form factor to meet the requirements of volume, weight, fashionability and social acceptability, which puts significant constraints on the space of possible solutions. Additional challenges arise due to the fact that smart glasses are worn in arbitrary environments while their wearer moves and performs everyday activities. In this paper, we systematically analyze the space of imaging from smart glasses and derive several fundamental limits that govern this imaging domain. We discuss the impact of these limits on achievable image quality and camera module size -- comparing in particular to related devices such as mobile phones. We then propose a novel distributed imaging approach that allows to minimize the size of the individual camera modules when compared to a standard monolithic camera design. Finally, we demonstrate the properties of this novel approach in a series of experiments using synthetic data as well as images captured with two different prototype implementations.

[67] ArtistAuditor: Auditing Artist Style Pirate in Text-to-Image Generation Models

Linkang Du,Zheng Zhu,Min Chen,Zhou Su,Shouling Ji,Peng Cheng,Jiming Chen,Zhikun Zhang

Main category: cs.CV

TL;DR: ArtistAuditor是一种用于审核文本到图像生成模型数据使用的方法，通过分析风格特征判断模型是否使用了特定艺术家的作品进行微调，实验显示其AUC值高（>0.937），并在实际案例中验证了有效性。

Details

Motivation: 解决文本到图像生成模型（如DALL-E、Stable Diffusion等）在未经授权使用艺术家作品进行微调时可能引发的版权侵权问题。 Method: 使用风格提取器获取多粒度风格表示，将艺术作品视为艺术家风格的采样，并通过训练判别器进行审核决策。 Result: 在六种模型和数据集组合上实验，AUC值均高于0.937，验证了方法的有效性。 Conclusion: ArtistAuditor为数据使用审核提供了实用解决方案，并在实际应用中展示了潜力。 Abstract: Text-to-image models based on diffusion processes, such as DALL-E, Stable Diffusion, and Midjourney, are capable of transforming texts into detailed images and have widespread applications in art and design. As such, amateur users can easily imitate professional-level paintings by collecting an artist's work and fine-tuning the model, leading to concerns about artworks' copyright infringement. To tackle these issues, previous studies either add visually imperceptible perturbation to the artwork to change its underlying styles (perturbation-based methods) or embed post-training detectable watermarks in the artwork (watermark-based methods). However, when the artwork or the model has been published online, i.e., modification to the original artwork or model retraining is not feasible, these strategies might not be viable. To this end, we propose a novel method for data-use auditing in the text-to-image generation model. The general idea of ArtistAuditor is to identify if a suspicious model has been finetuned using the artworks of specific artists by analyzing the features related to the style. Concretely, ArtistAuditor employs a style extractor to obtain the multi-granularity style representations and treats artworks as samplings of an artist's style. Then, ArtistAuditor queries a trained discriminator to gain the auditing decisions. The experimental results on six combinations of models and datasets show that ArtistAuditor can achieve high AUC values (> 0.937). By studying ArtistAuditor's transferability and core modules, we provide valuable insights into the practical implementation. Finally, we demonstrate the effectiveness of ArtistAuditor in real-world cases by an online platform Scenario. ArtistAuditor is open-sourced at https://github.com/Jozenn/ArtistAuditor.

[68] EchoWorld: Learning Motion-Aware World Models for Echocardiography Probe Guidance

Yang Yue,Yulin Wang,Haojun Jiang,Pan Liu,Shiji Song,Gao Huang

Main category: cs.CV

TL;DR: EchoWorld是一个基于世界建模的运动感知超声探头引导框架，通过预训练和微调策略提升引导精度。

Details

Motivation: 超声心动图依赖经验丰富的技师，AI辅助或自主扫描系统可解决这一问题，但现有机器学习模型难以理解心脏解剖和探头运动与视觉信号的复杂关系。 Method: 提出EchoWorld框架，采用世界建模预训练策略预测解剖区域和探头调整的视觉结果，微调阶段引入运动感知注意力机制整合历史数据。 Result: 在超过100万张超声图像上训练，EchoWorld显著减少引导误差，优于现有视觉骨干和引导框架。 Conclusion: EchoWorld通过运动感知和世界建模有效提升超声探头引导精度，为AI辅助扫描提供新思路。 Abstract: Echocardiography is crucial for cardiovascular disease detection but relies heavily on experienced sonographers. Echocardiography probe guidance systems, which provide real-time movement instructions for acquiring standard plane images, offer a promising solution for AI-assisted or fully autonomous scanning. However, developing effective machine learning models for this task remains challenging, as they must grasp heart anatomy and the intricate interplay between probe motion and visual signals. To address this, we present EchoWorld, a motion-aware world modeling framework for probe guidance that encodes anatomical knowledge and motion-induced visual dynamics, while effectively leveraging past visual-motion sequences to enhance guidance precision. EchoWorld employs a pre-training strategy inspired by world modeling principles, where the model predicts masked anatomical regions and simulates the visual outcomes of probe adjustments. Built upon this pre-trained model, we introduce a motion-aware attention mechanism in the fine-tuning stage that effectively integrates historical visual-motion data, enabling precise and adaptive probe guidance. Trained on more than one million ultrasound images from over 200 routine scans, EchoWorld effectively captures key echocardiographic knowledge, as validated by qualitative analysis. Moreover, our method significantly reduces guidance errors compared to existing visual backbones and guidance frameworks, excelling in both single-frame and sequential evaluation protocols. Code is available at https://github.com/LeapLabTHU/EchoWorld.

[69] SkyReels-V2: Infinite-length Film Generative Model

Guibin Chen,Dixuan Lin,Jiangping Yang,Chunze Lin,Juncheng Zhu,Mingyuan Fan,Hao Zhang,Sheng Chen,Zheng Chen,Chengchen Ma,Weiming Xiong,Wei Wang,Nuo Pang,Kang Kang,Zhiheng Xu,Yuzhe Jin,Yupeng Liang,Yubing Song,Peng Zhao,Boyuan Xu,Di Qiu,Debang Li,Zhengcong Fei,Yang Li,Yahui Zhou

Main category: cs.CV

TL;DR: SkyReels-V2提出了一种无限长度电影生成模型，结合多模态大语言模型、多阶段预训练、强化学习和扩散框架，解决了视频生成中的动态质量、时长和镜头感知问题。

Details

Motivation: 现有视频生成模型在动态质量、视频时长和镜头感知方面存在局限，无法满足专业电影风格生成的需求。 Method: 结合多模态大语言模型和子专家模型设计视频结构表示，通过多阶段预训练和强化学习优化动态效果，采用扩散框架实现长视频合成。 Result: SkyReels-V2能够生成高质量、动态效果良好且符合专业电影风格的无限长度视频。 Conclusion: SkyReels-V2为长视频生成提供了创新解决方案，显著提升了视频质量和动态效果。 Abstract: Recent advances in video generation have been driven by diffusion models and autoregressive frameworks, yet critical challenges persist in harmonizing prompt adherence, visual quality, motion dynamics, and duration: compromises in motion dynamics to enhance temporal visual quality, constrained video duration (5-10 seconds) to prioritize resolution, and inadequate shot-aware generation stemming from general-purpose MLLMs' inability to interpret cinematic grammar, such as shot composition, actor expressions, and camera motions. These intertwined limitations hinder realistic long-form synthesis and professional film-style generation. To address these limitations, we propose SkyReels-V2, an Infinite-length Film Generative Model, that synergizes Multi-modal Large Language Model (MLLM), Multi-stage Pretraining, Reinforcement Learning, and Diffusion Forcing Framework. Firstly, we design a comprehensive structural representation of video that combines the general descriptions by the Multi-modal LLM and the detailed shot language by sub-expert models. Aided with human annotation, we then train a unified Video Captioner, named SkyCaptioner-V1, to efficiently label the video data. Secondly, we establish progressive-resolution pretraining for the fundamental video generation, followed by a four-stage post-training enhancement: Initial concept-balanced Supervised Fine-Tuning (SFT) improves baseline quality; Motion-specific Reinforcement Learning (RL) training with human-annotated and synthetic distortion data addresses dynamic artifacts; Our diffusion forcing framework with non-decreasing noise schedules enables long-video synthesis in an efficient search space; Final high-quality SFT refines visual fidelity. All the code and models are available at https://github.com/SkyworkAI/SkyReels-V2.

[70] Effective Dual-Region Augmentation for Reduced Reliance on Large Amounts of Labeled Data

Prasanna Reddy Pulakurthi,Majid Rabbani,Celso M. de Melo,Sohail A. Dianat,Raghuveer M. Rao

Main category: cs.CV

TL;DR: 提出了一种新颖的双区域增强方法，通过针对性数据变换减少对大规模标注数据的依赖，提升模型鲁棒性和适应性。

Details

Motivation: 减少对大规模标注数据的依赖，同时提升模型在计算机视觉任务中的鲁棒性和适应性。 Method: 对前景物体施加随机噪声扰动，并对背景区域进行空间打乱，增加训练数据的多样性。 Result: 在PACS数据集上显著优于现有方法，同时在Market-1501和DukeMTMC-reID数据集上验证了其有效性。 Conclusion: 该方法通过结构化变换增强数据，为减少人工标注依赖提供了可扩展的解决方案。 Abstract: This paper introduces a novel dual-region augmentation approach designed to reduce reliance on large-scale labeled datasets while improving model robustness and adaptability across diverse computer vision tasks, including source-free domain adaptation (SFDA) and person re-identification (ReID). Our method performs targeted data transformations by applying random noise perturbations to foreground objects and spatially shuffling background patches. This effectively increases the diversity of the training data, improving model robustness and generalization. Evaluations on the PACS dataset for SFDA demonstrate that our augmentation strategy consistently outperforms existing methods, achieving significant accuracy improvements in both single-target and multi-target adaptation settings. By augmenting training data through structured transformations, our method enables model generalization across domains, providing a scalable solution for reducing reliance on manually annotated datasets. Furthermore, experiments on Market-1501 and DukeMTMC-reID datasets validate the effectiveness of our approach for person ReID, surpassing traditional augmentation techniques.

[71] Enhancing Person-to-Person Virtual Try-On with Multi-Garment Virtual Try-Off

Riza Velioglu,Petra Bevandic,Robin Chan,Barbara Hammer

Main category: cs.CV

TL;DR: 论文介绍了TryOffDiff，一种基于扩散模型的VTOFF方法，用于从穿着衣物的人像中提取标准化衣物图像，并在VITON-HD和DressCode数据集上表现优异。

Details

Motivation: 解决虚拟试穿（VTON）和虚拟脱衣（VTOFF）中的技术挑战，尤其是从人像中提取标准化衣物图像的难题。 Method: 基于潜在扩散框架和SigLIP图像条件化，TryOffDiff能够捕捉衣物的纹理、形状和图案，并首次实现多衣物VTOFF。 Result: 在VITON-HD数据集上达到最先进水平，在DressCode数据集上表现优异，同时提升了p2p-VTON的效果。 Conclusion: TryOffDiff是首个多衣物VTOFF模型，与VTON结合可减少不必要属性转移，为时尚领域的计算机视觉应用提供了新工具。 Abstract: Computer vision is transforming fashion through Virtual Try-On (VTON) and Virtual Try-Off (VTOFF). VTON generates images of a person in a specified garment using a target photo and a standardized garment image, while a more challenging variant, Person-to-Person Virtual Try-On (p2p-VTON), uses a photo of another person wearing the garment. VTOFF, on the other hand, extracts standardized garment images from clothed individuals. We introduce TryOffDiff, a diffusion-based VTOFF model. Built on a latent diffusion framework with SigLIP image conditioning, it effectively captures garment properties like texture, shape, and patterns. TryOffDiff achieves state-of-the-art results on VITON-HD and strong performance on DressCode dataset, covering upper-body, lower-body, and dresses. Enhanced with class-specific embeddings, it pioneers multi-garment VTOFF, the first of its kind. When paired with VTON models, it improves p2p-VTON by minimizing unwanted attribute transfer, such as skin color. Code is available at: https://rizavelioglu.github.io/tryoffdiff/

[72] EventVAD: Training-Free Event-Aware Video Anomaly Detection

Yihua Shao,Haojin He,Sijie Li,Siyu Chen,Xinwei Long,Fanhu Zeng,Yuxuan Fan,Muyang Zhang,Ziyang Yan,Ao Ma,Xiaochen Wang,Hao Tang,Yan Wang,Shuyan Li

Main category: cs.CV

TL;DR: EventVAD结合动态图架构和多模态大语言模型，通过事件感知推理实现视频异常检测，无需训练数据，性能优于现有方法。

Details

Motivation: 现有监督方法依赖大量训练数据且泛化能力有限，而训练免费方法在细粒度视觉转换和多样化事件定位上表现不足。 Method: EventVAD通过动态时空图建模捕获事件特征，利用无监督统计特征检测事件边界，并通过分层提示策略引导MLLM推理。 Result: 在UCF-Crime和XD-Violence数据集上，EventVAD使用7B MLLM达到了训练免费设置下的SOTA性能。 Conclusion: EventVAD通过事件感知和多模态推理，显著提升了视频异常检测的准确性和效率。 Abstract: Video Anomaly Detection~(VAD) focuses on identifying anomalies within videos. Supervised methods require an amount of in-domain training data and often struggle to generalize to unseen anomalies. In contrast, training-free methods leverage the intrinsic world knowledge of large language models (LLMs) to detect anomalies but face challenges in localizing fine-grained visual transitions and diverse events. Therefore, we propose EventVAD, an event-aware video anomaly detection framework that combines tailored dynamic graph architectures and multimodal LLMs through temporal-event reasoning. Specifically, EventVAD first employs dynamic spatiotemporal graph modeling with time-decay constraints to capture event-aware video features. Then, it performs adaptive noise filtering and uses signal ratio thresholding to detect event boundaries via unsupervised statistical features. The statistical boundary detection module reduces the complexity of processing long videos for MLLMs and improves their temporal reasoning through event consistency. Finally, it utilizes a hierarchical prompting strategy to guide MLLMs in performing reasoning before determining final decisions. We conducted extensive experiments on the UCF-Crime and XD-Violence datasets. The results demonstrate that EventVAD with a 7B MLLM achieves state-of-the-art (SOTA) in training-free settings, outperforming strong baselines that use 7B or larger MLLMs.

[73] RF-DETR Object Detection vs YOLOv12 : A Study of Transformer-based and CNN-based Architectures for Single-Class and Multi-Class Greenfruit Detection in Complex Orchard Environments Under Label Ambiguity

Ranjan Sapkota,Rahul Harsha Cheppally,Ajay Sharda,Manoj Karkee

Main category: cs.CV

TL;DR: 比较RF-DETR和YOLOv12在复杂果园环境中检测绿色水果的性能，RF-DETR在全局上下文建模中表现更优，而YOLOv12在计算效率和局部特征提取上更优。

Details

Motivation: 解决复杂果园环境中绿色水果检测的挑战，如标签模糊、遮挡和背景干扰。 Method: 使用自定义数据集评估RF-DETR（基于DINOv2和可变形注意力）和YOLOv12（基于CNN注意力）的性能。 Result: RF-DETR在单类检测中mAP50最高（0.9464），YOLOv12在mAP@50:95中表现最佳（0.7620）。RF-DETR在多类检测中领先（mAP@50为0.8298）。 Conclusion: RF-DETR适用于精准农业，YOLOv12适合快速响应场景。 Abstract: This study conducts a detailed comparison of RF-DETR object detection base model and YOLOv12 object detection model configurations for detecting greenfruits in a complex orchard environment marked by label ambiguity, occlusions, and background blending. A custom dataset was developed featuring both single-class (greenfruit) and multi-class (occluded and non-occluded greenfruits) annotations to assess model performance under dynamic real-world conditions. RF-DETR object detection model, utilizing a DINOv2 backbone and deformable attention, excelled in global context modeling, effectively identifying partially occluded or ambiguous greenfruits. In contrast, YOLOv12 leveraged CNN-based attention for enhanced local feature extraction, optimizing it for computational efficiency and edge deployment. RF-DETR achieved the highest mean Average Precision (mAP50) of 0.9464 in single-class detection, proving its superior ability to localize greenfruits in cluttered scenes. Although YOLOv12N recorded the highest mAP@50:95 of 0.7620, RF-DETR consistently outperformed in complex spatial scenarios. For multi-class detection, RF-DETR led with an mAP@50 of 0.8298, showing its capability to differentiate between occluded and non-occluded fruits, while YOLOv12L scored highest in mAP@50:95 with 0.6622, indicating better classification in detailed occlusion contexts. Training dynamics analysis highlighted RF-DETR's swift convergence, particularly in single-class settings where it plateaued within 10 epochs, demonstrating the efficiency of transformer-based architectures in adapting to dynamic visual data. These findings validate RF-DETR's effectiveness for precision agricultural applications, with YOLOv12 suited for fast-response scenarios. >Index Terms: RF-DETR object detection, YOLOv12, YOLOv13, YOLOv14, YOLOv15, YOLOE, YOLO World, YOLO, You Only Look Once, Roboflow, Detection Transformers, CNNs

[74] UniEdit-Flow: Unleashing Inversion and Editing in the Era of Flow Models

Guanlong Jiao,Biqing Huang,Kuan-Chieh Wang,Renjie Liao

Main category: cs.CV

TL;DR: 论文提出了一种基于预测-校正的框架，用于流匹配模型的反转和编辑，包括Uni-Inv反转方法和Uni-Edit编辑方法，具有高效、通用和无需调优的特点。

Details

Motivation: 现有针对扩散模型的反转和编辑方法对流匹配模型效果不佳或不适用，因此需要开发新方法。 Method: 提出Uni-Inv反转方法和Uni-Edit编辑方法，基于预测-校正框架，支持区域感知的鲁棒图像编辑。 Result: 实验表明Uni-Inv和Uni-Edit在多种生成模型中表现优越且通用，即使在低成本设置下。 Conclusion: 该框架为流匹配模型的反转和编辑提供了高效、通用的解决方案。 Abstract: Flow matching models have emerged as a strong alternative to diffusion models, but existing inversion and editing methods designed for diffusion are often ineffective or inapplicable to them. The straight-line, non-crossing trajectories of flow models pose challenges for diffusion-based approaches but also open avenues for novel solutions. In this paper, we introduce a predictor-corrector-based framework for inversion and editing in flow models. First, we propose Uni-Inv, an effective inversion method designed for accurate reconstruction. Building on this, we extend the concept of delayed injection to flow models and introduce Uni-Edit, a region-aware, robust image editing approach. Our methodology is tuning-free, model-agnostic, efficient, and effective, enabling diverse edits while ensuring strong preservation of edit-irrelevant regions. Extensive experiments across various generative models demonstrate the superiority and generalizability of Uni-Inv and Uni-Edit, even under low-cost settings. Project page: https://uniedit-flow.github.io/

[75] Probing and Inducing Combinational Creativity in Vision-Language Models

Yongqian Peng,Yuxi Ma,Mengmeng Wang,Yuxuan Wang,Yizhou Wang,Chi Zhang,Yixin Zhu,Zilong Zheng

Main category: cs.CV

TL;DR: 论文探讨了视觉语言模型（VLMs）是否具备组合创造力，提出了IEI框架评估其创造力，并通过实验验证了该框架的有效性。

Details

Motivation: 研究VLMs是否具备组合创造力，即通过结合现有概念生成新想法的能力，而非仅仅是训练数据的模式匹配。 Method: 提出IEI框架（识别-解释-隐含），分解创造力过程为三个层次，并创建CreativeMashup数据集进行验证。 Result: 在理解任务中，最佳VLMs超过普通人但不及专家；在生成任务中，IEI框架显著提升VLM输出的创意质量。 Conclusion: 为评估人工创造力提供了理论基础，并为提升VLMs的创意生成提供了实用指南。 Abstract: The ability to combine existing concepts into novel ideas stands as a fundamental hallmark of human intelligence. Recent advances in Vision-Language Models (VLMs) like GPT-4V and DALLE-3 have sparked debate about whether their outputs reflect combinational creativity--defined by M. A. Boden (1998) as synthesizing novel ideas through combining existing concepts--or sophisticated pattern matching of training data. Drawing inspiration from cognitive science, we investigate the combinational creativity of VLMs from the lens of concept blending. We propose the Identification-Explanation-Implication (IEI) framework, which decomposes creative processes into three levels: identifying input spaces, extracting shared attributes, and deriving novel semantic implications. To validate this framework, we curate CreativeMashup, a high-quality dataset of 666 artist-generated visual mashups annotated according to the IEI framework. Through extensive experiments, we demonstrate that in comprehension tasks, best VLMs have surpassed average human performance while falling short of expert-level understanding; in generation tasks, incorporating our IEI framework into the generation pipeline significantly enhances the creative quality of VLMs outputs. Our findings establish both a theoretical foundation for evaluating artificial creativity and practical guidelines for improving creative generation in VLMs.

[76] VistaDPO: Video Hierarchical Spatial-Temporal Direct Preference Optimization for Large Video Models

Haojian Huang,Haodong Chen,Shengqiong Wu,Meng Luo,Jinlan Fu,Xinya Du,Hanwang Zhang,Hao Fei

Main category: cs.CV

TL;DR: VistaDPO是一个新颖的框架，通过分层空间-时间直接偏好优化（DPO）解决大型视频模型（LVMs）与人类直觉不对齐及视频幻觉问题。

Details

Motivation: 现有大型视频模型（LVMs）在视频理解中存在与人类直觉不对齐和视频幻觉问题，需要改进。 Method: VistaDPO通过三个层次（实例级、时间级、感知级）优化文本-视频偏好对齐，并构建VistaDPO-7k数据集支持细粒度对齐。 Result: 实验表明，VistaDPO显著提升了LVMs在视频幻觉、视频问答和字幕任务中的性能。 Conclusion: VistaDPO有效缓解了视频-语言不对齐和幻觉问题，代码和数据已开源。 Abstract: Large Video Models (LVMs) built upon Large Language Models (LLMs) have shown promise in video understanding but often suffer from misalignment with human intuition and video hallucination issues. To address these challenges, we introduce VistaDPO, a novel framework for Video Hierarchical Spatial-Temporal Direct Preference Optimization. VistaDPO enhances text-video preference alignment across three hierarchical levels: i) Instance Level, aligning overall video content with responses; ii) Temporal Level, aligning video temporal semantics with event descriptions; and iii) Perceptive Level, aligning spatial objects with language tokens. Given the lack of datasets for fine-grained video-language preference alignment, we construct VistaDPO-7k, a dataset of 7.2K QA pairs annotated with chosen and rejected responses, along with spatial-temporal grounding information such as timestamps, keyframes, and bounding boxes. Extensive experiments on benchmarks such as Video Hallucination, Video QA, and Captioning performance tasks demonstrate that VistaDPO significantly improves the performance of existing LVMs, effectively mitigating video-language misalignment and hallucination. The code and data are available at https://github.com/HaroldChen19/VistaDPO.

[77] Low-hallucination Synthetic Captions for Large-Scale Vision-Language Model Pre-training

Xinsong Zhang,Yarong Zeng,Xinting Huang,Hu Hu,Runquan Xie,Han Hu,Zhanhui Kang

Main category: cs.CV

TL;DR: 研究提出了一种可扩展的合成字幕生成技术，用于视觉语言模型预训练，证明大规模低幻觉合成字幕可替代真实数据并提升性能。

Details

Motivation: 高质量图像-文本对稀缺且饱和，限制了多模态大语言模型的进一步发展。 Method: 提出生成高质量、低幻觉、知识丰富的合成字幕的管道，并采用连续DPO方法减少幻觉。 Result: 合成字幕显著提升模型性能，在35个任务中至少提升6.2%，并在文本到图像领域降低FID分数。 Conclusion: 合成字幕是预训练的有效替代方案，Hunyuan-Recap100M数据集将公开。 Abstract: In recent years, the field of vision-language model pre-training has experienced rapid advancements, driven primarily by the continuous enhancement of textual capabilities in large language models. However, existing training paradigms for multimodal large language models heavily rely on high-quality image-text pairs. As models and data scales grow exponentially, the availability of such meticulously curated data has become increasingly scarce and saturated, thereby severely limiting further advancements in this domain. This study investigates scalable caption generation techniques for vision-language model pre-training and demonstrates that large-scale low-hallucination synthetic captions can serve dual purposes: 1) acting as a viable alternative to real-world data for pre-training paradigms and 2) achieving superior performance enhancement when integrated into vision-language models through empirical validation. This paper presents three key contributions: 1) a novel pipeline for generating high-quality, low-hallucination, and knowledge-rich synthetic captions. Our continuous DPO methodology yields remarkable results in reducing hallucinations. Specifically, the non-hallucination caption rate on a held-out test set increases from 48.2% to 77.9% for a 7B-size model. 2) Comprehensive empirical validation reveals that our synthetic captions confer superior pre-training advantages over their counterparts. Across 35 vision language tasks, the model trained with our data achieves a significant performance gain of at least 6.2% compared to alt-text pairs and other previous work. Meanwhile, it also offers considerable support in the text-to-image domain. With our dataset, the FID score is reduced by 17.1 on a real-world validation benchmark and 13.3 on the MSCOCO validation benchmark. 3) We will release Hunyuan-Recap100M, a low-hallucination and knowledge-intensive synthetic caption dataset.

[78] Science-T2I: Addressing Scientific Illusions in Image Synthesis

Jialuo Li,Wenhao Chai,Xingyu Fu,Haiyang Xu,Saining Xie

Main category: cs.CV

TL;DR: 提出了一种将科学知识融入生成模型的新方法，通过Science-T2I数据集和SciScore奖励模型提升图像合成的真实性和一致性。

Details

Motivation: 增强生成模型在科学领域的真实性和一致性。 Method: 引入Science-T2I数据集和SciScore奖励模型，提出两阶段训练框架（监督微调和掩码在线微调）。 Result: SciScore性能接近人类水平，提升5%；应用于FLUX模型时性能提升超过50%。 Conclusion: 该方法为评估生成内容的科学真实性设立了新标准。 Abstract: We present a novel approach to integrating scientific knowledge into generative models, enhancing their realism and consistency in image synthesis. First, we introduce Science-T2I, an expert-annotated adversarial dataset comprising adversarial 20k image pairs with 9k prompts, covering wide distinct scientific knowledge categories. Leveraging Science-T2I, we present SciScore, an end-to-end reward model that refines the assessment of generated images based on scientific knowledge, which is achieved by augmenting both the scientific comprehension and visual capabilities of pre-trained CLIP model. Additionally, based on SciScore, we propose a two-stage training framework, comprising a supervised fine-tuning phase and a masked online fine-tuning phase, to incorporate scientific knowledge into existing generative models. Through comprehensive experiments, we demonstrate the effectiveness of our framework in establishing new standards for evaluating the scientific realism of generated content. Specifically, SciScore attains performance comparable to human-level, demonstrating a 5% improvement similar to evaluations conducted by experienced human evaluators. Furthermore, by applying our proposed fine-tuning method to FLUX, we achieve a performance enhancement exceeding 50% on SciScore.

[79] PCBEAR: Pose Concept Bottleneck for Explainable Action Recognition

Jongseo Lee,Wooil Lee,Gyeong-Moon Park,Seong Tae Kim,Jinwoo Choi

Main category: cs.CV

TL;DR: PCBEAR提出了一种基于人体姿态序列的概念瓶颈框架，用于可解释的动作识别，通过静态和动态姿态概念捕捉运动动态，实现了高性能和可解释性。

Details

Motivation: 现有视频XAI方法难以捕捉运动动态和时间依赖性，PCBEAR旨在通过人体姿态序列提供更透明和可解释的动作识别。 Method: PCBEAR利用人体骨架姿态作为运动感知概念，通过聚类自动发现静态和动态姿态概念，无需人工标注。 Result: 在KTH、Penn-Action和HAA500数据集上验证，PCBEAR实现了高分类性能，并提供可解释的运动驱动解释。 Conclusion: PCBEAR结合了高性能和可解释性，支持测试时干预以调试和改进模型行为。 Abstract: Human action recognition (HAR) has achieved impressive results with deep learning models, but their decision-making process remains opaque due to their black-box nature. Ensuring interpretability is crucial, especially for real-world applications requiring transparency and accountability. Existing video XAI methods primarily rely on feature attribution or static textual concepts, both of which struggle to capture motion dynamics and temporal dependencies essential for action understanding. To address these challenges, we propose Pose Concept Bottleneck for Explainable Action Recognition (PCBEAR), a novel concept bottleneck framework that introduces human pose sequences as motion-aware, structured concepts for video action recognition. Unlike methods based on pixel-level features or static textual descriptions, PCBEAR leverages human skeleton poses, which focus solely on body movements, providing robust and interpretable explanations of motion dynamics. We define two types of pose-based concepts: static pose concepts for spatial configurations at individual frames, and dynamic pose concepts for motion patterns across multiple frames. To construct these concepts, PCBEAR applies clustering to video pose sequences, allowing for automatic discovery of meaningful concepts without manual annotation. We validate PCBEAR on KTH, Penn-Action, and HAA500, showing that it achieves high classification performance while offering interpretable, motion-driven explanations. Our method provides both strong predictive performance and human-understandable insights into the model's reasoning process, enabling test-time interventions for debugging and improving model behavior.

[80] $\texttt{Complex-Edit}$: CoT-Like Instruction Generation for Complexity-Controllable Image Editing Benchmark

Siwei Yang,Mude Hui,Bingchen Zhao,Yuyin Zhou,Nataniel Ruiz,Cihang Xie

Main category: cs.CV

TL;DR: 论文提出了Complex-Edit基准，用于评估基于指令的图像编辑模型在不同复杂度指令下的表现，并揭示了开源模型与闭源模型的性能差距以及合成数据的负面影响。

Details

Motivation: 为了系统评估图像编辑模型在不同复杂度指令下的表现，并填补现有基准的不足。 Method: 利用GPT-4o自动生成多样化的编辑指令，采用“Chain-of-Edit”流程生成复杂指令，并设计了一套评估指标和VLM自动评估流程。 Result: 开源模型性能显著低于闭源模型，复杂指令会降低模型保留关键元素和美学质量的能力，分解复杂指令会进一步降低性能，Best-of-N策略能提升结果，合成数据会导致编辑结果显得更假。 Conclusion: Complex-Edit为图像编辑模型提供了全面的评估工具，揭示了性能差距和合成数据的负面影响，为未来研究提供了重要参考。 Abstract: We introduce $\texttt{Complex-Edit}$, a comprehensive benchmark designed to systematically evaluate instruction-based image editing models across instructions of varying complexity. To develop this benchmark, we harness GPT-4o to automatically collect a diverse set of editing instructions at scale. Our approach follows a well-structured ``Chain-of-Edit'' pipeline: we first generate individual atomic editing tasks independently and then integrate them to form cohesive, complex instructions. Additionally, we introduce a suite of metrics to assess various aspects of editing performance, along with a VLM-based auto-evaluation pipeline that supports large-scale assessments. Our benchmark yields several notable insights: 1) Open-source models significantly underperform relative to proprietary, closed-source models, with the performance gap widening as instruction complexity increases; 2) Increased instructional complexity primarily impairs the models' ability to retain key elements from the input images and to preserve the overall aesthetic quality; 3) Decomposing a complex instruction into a sequence of atomic steps, executed in a step-by-step manner, substantially degrades performance across multiple metrics; 4) A straightforward Best-of-N selection strategy improves results for both direct editing and the step-by-step sequential approach; and 5) We observe a ``curse of synthetic data'': when synthetic data is involved in model training, the edited images from such models tend to appear increasingly synthetic as the complexity of the editing instructions rises -- a phenomenon that intriguingly also manifests in the latest GPT-4o outputs.

[81] St4RTrack: Simultaneous 4D Reconstruction and Tracking in the World

Haiwen Feng,Junyi Zhang,Qianqian Wang,Yufei Ye,Pengcheng Yu,Michael J. Black,Trevor Darrell,Angjoo Kanazawa

Main category: cs.CV

TL;DR: St4RTrack是一个同时进行动态3D重建和点跟踪的框架，通过预测点图并利用重投影损失实现，无需依赖4D真值监督。

Details

Motivation: 动态3D重建和点跟踪通常被分开处理，但两者有深层联系。本文旨在统一这两个任务。 Method: 通过预测一对帧的点图，结合静态和动态场景几何，并通过重投影损失进行自适应。 Result: 在新建的基准测试中展示了框架的有效性和高效性。 Conclusion: St4RTrack提供了一个统一的数据驱动框架，显著提升了动态3D重建和跟踪的性能。 Abstract: Dynamic 3D reconstruction and point tracking in videos are typically treated as separate tasks, despite their deep connection. We propose St4RTrack, a feed-forward framework that simultaneously reconstructs and tracks dynamic video content in a world coordinate frame from RGB inputs. This is achieved by predicting two appropriately defined pointmaps for a pair of frames captured at different moments. Specifically, we predict both pointmaps at the same moment, in the same world, capturing both static and dynamic scene geometry while maintaining 3D correspondences. Chaining these predictions through the video sequence with respect to a reference frame naturally computes long-range correspondences, effectively combining 3D reconstruction with 3D tracking. Unlike prior methods that rely heavily on 4D ground truth supervision, we employ a novel adaptation scheme based on a reprojection loss. We establish a new extensive benchmark for world-frame reconstruction and tracking, demonstrating the effectiveness and efficiency of our unified, data-driven framework. Our code, model, and benchmark will be released.

[82] Training-Free Hierarchical Scene Understanding for Gaussian Splatting with Superpoint Graphs

Shaohui Dai,Yansong Qu,Zheyan Li,Xinyang Li,Shengchuan Zhang,Liujuan Cao

Main category: cs.CV

TL;DR: 提出了一种无需训练的方法，通过超点图从高斯基元直接构建场景，实现高效且一致的3D语义理解。

Details

Motivation: 现有方法依赖多视图迭代优化2D语义特征，效率低且语义不一致，需要改进。 Method: 构建超点图分割场景为语义连贯区域，设计重投影策略将2D语义特征提升到超点，避免多视图训练。 Result: 方法在开放词汇分割任务中表现最佳，语义重建速度快30倍。 Conclusion: 该方法高效且语义一致，支持多粒度开放词汇理解。 Abstract: Bridging natural language and 3D geometry is a crucial step toward flexible, language-driven scene understanding. While recent advances in 3D Gaussian Splatting (3DGS) have enabled fast and high-quality scene reconstruction, research has also explored incorporating open-vocabulary understanding into 3DGS. However, most existing methods require iterative optimization over per-view 2D semantic feature maps, which not only results in inefficiencies but also leads to inconsistent 3D semantics across views. To address these limitations, we introduce a training-free framework that constructs a superpoint graph directly from Gaussian primitives. The superpoint graph partitions the scene into spatially compact and semantically coherent regions, forming view-consistent 3D entities and providing a structured foundation for open-vocabulary understanding. Based on the graph structure, we design an efficient reprojection strategy that lifts 2D semantic features onto the superpoints, avoiding costly multi-view iterative training. The resulting representation ensures strong 3D semantic coherence and naturally supports hierarchical understanding, enabling both coarse- and fine-grained open-vocabulary perception within a unified semantic field. Extensive experiments demonstrate that our method achieves state-of-the-art open-vocabulary segmentation performance, with semantic field reconstruction completed over $30\times$ faster. Our code will be available at https://github.com/Atrovast/THGS.

[83] AerialMegaDepth: Learning Aerial-Ground Reconstruction and View Synthesis

Khiem Vuong,Anurag Ghosh,Deva Ramanan,Srinivasa Narasimhan,Shubham Tulsiani

Main category: cs.CV

TL;DR: 论文提出了一种结合伪合成渲染和真实地面图像的可扩展框架，用于解决地面和航拍图像几何重建中的极端视角变化问题，显著提升了算法性能。

Details

Motivation: 现有学习方法难以处理地面和航拍图像之间的极端视角变化，主要原因是缺乏高质量、配准的数据集。 Method: 结合3D城市网格的伪合成渲染（如Google Earth）和真实地面图像（如MegaDepth），构建混合数据集，用于微调算法。 Result: 在真实世界的零样本任务中，算法性能显著提升，例如相机旋转误差5度内的配准率从5%提高到56%。 Conclusion: 该方法不仅改善了相机估计和场景重建，还在下游任务（如新视角合成）中表现出实用价值。 Abstract: We explore the task of geometric reconstruction of images captured from a mixture of ground and aerial views. Current state-of-the-art learning-based approaches fail to handle the extreme viewpoint variation between aerial-ground image pairs. Our hypothesis is that the lack of high-quality, co-registered aerial-ground datasets for training is a key reason for this failure. Such data is difficult to assemble precisely because it is difficult to reconstruct in a scalable way. To overcome this challenge, we propose a scalable framework combining pseudo-synthetic renderings from 3D city-wide meshes (e.g., Google Earth) with real, ground-level crowd-sourced images (e.g., MegaDepth). The pseudo-synthetic data simulates a wide range of aerial viewpoints, while the real, crowd-sourced images help improve visual fidelity for ground-level images where mesh-based renderings lack sufficient detail, effectively bridging the domain gap between real images and pseudo-synthetic renderings. Using this hybrid dataset, we fine-tune several state-of-the-art algorithms and achieve significant improvements on real-world, zero-shot aerial-ground tasks. For example, we observe that baseline DUSt3R localizes fewer than 5% of aerial-ground pairs within 5 degrees of camera rotation error, while fine-tuning with our data raises accuracy to nearly 56%, addressing a major failure point in handling large viewpoint changes. Beyond camera estimation and scene reconstruction, our dataset also improves performance on downstream tasks like novel-view synthesis in challenging aerial-ground scenarios, demonstrating the practical value of our approach in real-world applications.

[84] Digital Twin Generation from Visual Data: A Survey

Andrew Melnik,Benjamin Alt,Giang Nguyen,Artur Wilkowski,Maciej Stefańczyk,Qirui Wu,Sinan Harms,Helge Rhodin,Manolis Savva,Michael Beetz

Main category: cs.CV

TL;DR: 本文综述了从视频生成数字孪生的最新进展，分析了多种方法及其优缺点，并探讨了挑战与未来研究方向。

Details

Motivation: 数字孪生可用于机器人应用、媒体内容创作及设计施工，但现有方法存在局限，需系统总结。 Method: 分析了3D高斯泼溅、生成修复、语义分割和基础模型等方法。 Result: 总结了各方法的优势与不足，并指出遮挡、光照变化和可扩展性等挑战。 Conclusion: 本文为数字孪生领域提供了全面综述，并展望了未来研究方向。 Abstract: This survey explores recent developments in generating digital twins from videos. Such digital twins can be used for robotics application, media content creation, or design and construction works. We analyze various approaches, including 3D Gaussian Splatting, generative in-painting, semantic segmentation, and foundation models highlighting their advantages and limitations. Additionally, we discuss challenges such as occlusions, lighting variations, and scalability, as well as potential future research directions. This survey aims to provide a comprehensive overview of state-of-the-art methodologies and their implications for real-world applications. Awesome list: https://github.com/ndrwmlnk/awesome-digital-twins

[85] Personalized Text-to-Image Generation with Auto-Regressive Models

Kaiyue Sun,Xian Liu,Yao Teng,Xihui Liu

Main category: cs.CV

TL;DR: 本文探讨了自回归模型在个性化图像生成中的潜力，提出了一种两阶段训练策略，并在实验中证明了其与扩散模型相当的性能。

Details

Motivation: 个性化图像合成是文本到图像生成的重要应用，但目前自回归模型在这一领域的研究较少。本文旨在探索自回归模型的潜力。 Method: 提出了一种两阶段训练策略，结合文本嵌入优化和Transformer层微调。 Result: 实验表明，该方法在主题保真度和提示跟随方面与领先的扩散模型方法相当。 Conclusion: 自回归模型在个性化图像生成中表现优异，为未来研究提供了新方向。 Abstract: Personalized image synthesis has emerged as a pivotal application in text-to-image generation, enabling the creation of images featuring specific subjects in diverse contexts. While diffusion models have dominated this domain, auto-regressive models, with their unified architecture for text and image modeling, remain underexplored for personalized image generation. This paper investigates the potential of optimizing auto-regressive models for personalized image synthesis, leveraging their inherent multimodal capabilities to perform this task. We propose a two-stage training strategy that combines optimization of text embeddings and fine-tuning of transformer layers. Our experiments on the auto-regressive model demonstrate that this method achieves comparable subject fidelity and prompt following to the leading diffusion-based personalization methods. The results highlight the effectiveness of auto-regressive models in personalized image generation, offering a new direction for future research in this area.

[86] ODHSR: Online Dense 3D Reconstruction of Humans and Scenes from Monocular Videos

Zetong Zhang,Manuel kaufmann,Lixin Xue,Jie Song,Martin R. Oswald

Main category: cs.CV

TL;DR: 提出了一种统一框架，用于从单目视频中实时重建真实感场景和人体，结合3D高斯泼溅技术，实现了相机跟踪、人体姿态估计和场景重建的同步进行。

Details

Motivation: 解决现有方法需要预标定相机和人体姿态、训练时间长的问题，实现高效、实时的重建。 Method: 利用3D高斯泼溅技术学习高斯基元，设计重建式相机跟踪和人体姿态估计模块，结合人体变形模块和遮挡感知渲染提升重建质量。 Result: 在EMDB和NeuMan数据集上，相机跟踪、人体姿态估计和新视角合成的性能优于或与现有方法相当。 Conclusion: 该框架实现了高效、实时的真实感场景和人体重建，具有广泛的应用潜力。 Abstract: Creating a photorealistic scene and human reconstruction from a single monocular in-the-wild video figures prominently in the perception of a human-centric 3D world. Recent neural rendering advances have enabled holistic human-scene reconstruction but require pre-calibrated camera and human poses, and days of training time. In this work, we introduce a novel unified framework that simultaneously performs camera tracking, human pose estimation and human-scene reconstruction in an online fashion. 3D Gaussian Splatting is utilized to learn Gaussian primitives for humans and scenes efficiently, and reconstruction-based camera tracking and human pose estimation modules are designed to enable holistic understanding and effective disentanglement of pose and appearance. Specifically, we design a human deformation module to reconstruct the details and enhance generalizability to out-of-distribution poses faithfully. Aiming to learn the spatial correlation between human and scene accurately, we introduce occlusion-aware human silhouette rendering and monocular geometric priors, which further improve reconstruction quality. Experiments on the EMDB and NeuMan datasets demonstrate superior or on-par performance with existing methods in camera tracking, human pose estimation, novel view synthesis and runtime. Our project page is at https://eth-ait.github.io/ODHSR.

[87] Generate, but Verify: Reducing Hallucination in Vision-Language Models with Retrospective Resampling

Tsung-Han Wu,Heekyung Lee,Jiaxin Ge,Joseph E. Gonzalez,Trevor Darrell,David M. Chan

Main category: cs.CV

TL;DR: REVERSE是一个统一框架，通过幻觉感知训练和实时自验证，显著减少视觉语言模型的幻觉问题。

Details

Motivation: 视觉语言模型在视觉理解方面表现优异，但存在视觉幻觉问题，生成不存在的对象或动作描述，对安全关键应用构成风险。现有方法存在局限性，如生成调整依赖启发式方法，后验证方法复杂且倾向于拒绝输出而非修正。 Method: REVERSE结合幻觉感知训练和实时自验证，利用包含130万半合成样本的数据集和推理时回顾性重采样技术，使模型能在生成时检测并动态修正幻觉。 Result: REVERSE在CHAIR-MSCOCO和HaloQuest上分别优于现有最佳方法12%和28%，实现了最先进的幻觉减少效果。 Conclusion: REVERSE通过统一框架和新技术，显著提升了视觉语言模型的幻觉检测和修正能力，为安全关键应用提供了更可靠的解决方案。 Abstract: Vision-Language Models (VLMs) excel at visual understanding but often suffer from visual hallucinations, where they generate descriptions of nonexistent objects, actions, or concepts, posing significant risks in safety-critical applications. Existing hallucination mitigation methods typically follow one of two paradigms: generation adjustment, which modifies decoding behavior to align text with visual inputs, and post-hoc verification, where external models assess and correct outputs. While effective, generation adjustment methods often rely on heuristics and lack correction mechanisms, while post-hoc verification is complicated, typically requiring multiple models and tending to reject outputs rather than refine them. In this work, we introduce REVERSE, a unified framework that integrates hallucination-aware training with on-the-fly self-verification. By leveraging a new hallucination-verification dataset containing over 1.3M semi-synthetic samples, along with a novel inference-time retrospective resampling technique, our approach enables VLMs to both detect hallucinations during generation and dynamically revise those hallucinations. Our evaluations show that REVERSE achieves state-of-the-art hallucination reduction, outperforming the best existing methods by up to 12% on CHAIR-MSCOCO and 28% on HaloQuest. Our dataset, model, and code are available at: https://reverse-vlm.github.io.

[88] IMAGGarment-1: Fine-Grained Garment Generation for Controllable Fashion Design

Fei Shen,Jian Yu,Cong Wang,Xin Jiang,Xiaoyu Du,Jinhui Tang

Main category: cs.CV

TL;DR: IMAGGarment-1是一个细粒度服装生成框架，支持高保真服装合成，并精确控制轮廓、颜色和标志位置。

Details

Motivation: 解决现有方法在个性化时尚设计和数字服装应用中多条件可控性的局限性。 Method: 采用两阶段训练策略，分别建模全局外观和局部细节，并通过端到端推理实现统一可控生成。 Result: 在结构稳定性、颜色保真度和局部可控性方面优于现有基线方法。 Conclusion: IMAGGarment-1为多条件服装生成提供了高效解决方案，并发布了GarmentBench数据集支持研究。 Abstract: This paper presents IMAGGarment-1, a fine-grained garment generation (FGG) framework that enables high-fidelity garment synthesis with precise control over silhouette, color, and logo placement. Unlike existing methods that are limited to single-condition inputs, IMAGGarment-1 addresses the challenges of multi-conditional controllability in personalized fashion design and digital apparel applications. Specifically, IMAGGarment-1 employs a two-stage training strategy to separately model global appearance and local details, while enabling unified and controllable generation through end-to-end inference. In the first stage, we propose a global appearance model that jointly encodes silhouette and color using a mixed attention module and a color adapter. In the second stage, we present a local enhancement model with an adaptive appearance-aware module to inject user-defined logos and spatial constraints, enabling accurate placement and visual consistency. To support this task, we release GarmentBench, a large-scale dataset comprising over 180K garment samples paired with multi-level design conditions, including sketches, color references, logo placements, and textual prompts. Extensive experiments demonstrate that our method outperforms existing baselines, achieving superior structural stability, color fidelity, and local controllability performance. The code and model are available at https://github.com/muzishen/IMAGGarment-1.

[89] Single-Shot Shape and Reflectance with Spatial Polarization Multiplexing

Tomoki Ichikawa,Ryo Kawahara,Ko Nishino

Main category: cs.CV

TL;DR: 提出了一种空间偏振复用（SPM）方法，通过单次偏振图像重建物体形状和反射率，并应用于动态表面恢复。

Details

Motivation: 传统单模式结构光虽能单次重建形状，但反射率恢复困难，因缺乏入射光角度采样及投影模式与表面颜色纹理的耦合。 Method: 设计空间复用的偏振模式，通过量化AoLP值实现形状重建，同时利用局部区域的不同偏振光分离镜面和漫反射，用于BRDF估计。 Result: 实验验证了该方法可从单次偏振图像中恢复形状、Mueller矩阵和BRDF，并成功应用于动态表面。 Conclusion: SPM方法在保留自然表面外观的同时，实现了高精度的形状和反射率重建，适用于动态场景。 Abstract: We propose spatial polarization multiplexing (SPM) for reconstructing object shape and reflectance from a single polarimetric image and demonstrate its application to dynamic surface recovery. Although single-pattern structured light enables single-shot shape reconstruction, the reflectance is challenging to recover due to the lack of angular sampling of incident light and the entanglement of the projected pattern and the surface color texture. We design a spatially multiplexed pattern of polarization that can be robustly and uniquely decoded for shape reconstruction by quantizing the AoLP values. At the same time, our spatial-multiplexing enables single-shot ellipsometry of linear polarization by projecting differently polarized light within a local region, which separates the specular and diffuse reflections for BRDF estimation. We achieve this spatial polarization multiplexing with a constrained de Bruijn sequence. Unlike single-pattern structured light with intensity and color, our polarization pattern is invisible to the naked eye and retains the natural surface appearance which is essential for accurate appearance modeling and also interaction with people. We experimentally validate our method on real data. The results show that our method can recover the shape, the Mueller matrix, and the BRDF from a single-shot polarimetric image. We also demonstrate the application of our method to dynamic surfaces.

[90] PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding

Jang Hyun Cho,Andrea Madotto,Effrosyni Mavroudi,Triantafyllos Afouras,Tushar Nagarajan,Muhammad Maaz,Yale Song,Tengyu Ma,Shuming Hu,Suyog Jain,Miguel Martin,Huiyu Wang,Hanoona Rasheed,Peize Sun,Po-Yao Huang,Daniel Bolya,Nikhila Ravi,Shashank Jain,Tammy Stark,Shane Moon,Babak Damavandi,Vivian Lee,Andrew Westbury,Salman Khan,Philipp Krähenbühl,Piotr Dollár,Lorenzo Torresani,Kristen Grauman,Christoph Feichtenhofer

Main category: cs.CV

TL;DR: 研究提出了一种完全开放的感知语言模型（PLM）框架，避免依赖闭源模型，通过大规模合成数据和人类标注数据填补视频理解的数据缺口，并发布了评估套件PLM-VideoBench。

Details

Motivation: 当前高性能视觉语言模型多为闭源，阻碍了科学进展的可测量性，研究旨在通过开放框架和透明方法推动图像和视频理解研究。 Method: 分析了标准训练流程（不依赖闭源模型蒸馏），探索大规模合成数据并发布人类标注的细粒度视频问答对和时空标注视频描述。 Result: 发布了2.8M人类标注视频数据，并推出PLM-VideoBench评估套件，专注于视频的“什么”、“哪里”、“何时”和“如何”推理能力。 Conclusion: 通过开放数据、训练方法、代码和模型，研究为透明和可复现的视频理解研究提供了基础。 Abstract: Vision-language models are integral to computer vision research, yet many high-performing models remain closed-source, obscuring their data, design and training recipe. The research community has responded by using distillation from black-box models to label training data, achieving strong benchmark results, at the cost of measurable scientific progress. However, without knowing the details of the teacher model and its data sources, scientific progress remains difficult to measure. In this paper, we study building a Perception Language Model (PLM) in a fully open and reproducible framework for transparent research in image and video understanding. We analyze standard training pipelines without distillation from proprietary models and explore large-scale synthetic data to identify critical data gaps, particularly in detailed video understanding. To bridge these gaps, we release 2.8M human-labeled instances of fine-grained video question-answer pairs and spatio-temporally grounded video captions. Additionally, we introduce PLM-VideoBench, a suite for evaluating challenging video understanding tasks focusing on the ability to reason about "what", "where", "when", and "how" of a video. We make our work fully reproducible by providing data, training recipes, code & models.

[91] Perception Encoder: The best visual embeddings are not at the output of the network

Daniel Bolya,Po-Yao Huang,Peize Sun,Jang Hyun Cho,Andrea Madotto,Chen Wei,Tengyu Ma,Jiale Zhi,Jathushan Rajasegaran,Hanoona Rasheed,Junke Wang,Marco Monteiro,Hu Xu,Shiyu Dong,Nikhila Ravi,Daniel Li,Piotr Dollár,Christoph Feichtenhofer

Main category: cs.CV

TL;DR: Perception Encoder (PE) 是一种通过简单视觉-语言学习训练的最先进编码器，适用于图像和视频理解。通过对比视觉-语言训练，PE 在多种下游任务中表现出色，并引入了两种对齐方法以提取隐藏的嵌入特征。

Details

Motivation: 传统视觉编码器依赖多种预训练目标，针对特定任务设计。PE 旨在通过单一对比视觉-语言训练生成通用嵌入，简化模型设计并提升性能。 Method: PE 通过对比视觉-语言训练生成通用嵌入，并引入语言对齐和空间对齐方法提取隐藏特征。模型经过图像预训练和视频数据引擎优化。 Result: PE 在零样本分类、检索、问答及空间任务（如检测、深度估计和跟踪）中达到最先进性能。 Conclusion: PE 展示了对比视觉-语言训练的潜力，简化了模型设计并提升了多任务性能。模型、代码和新数据集已开源以促进研究。 Abstract: We introduce Perception Encoder (PE), a state-of-the-art encoder for image and video understanding trained via simple vision-language learning. Traditionally, vision encoders have relied on a variety of pretraining objectives, each tailored to specific downstream tasks such as classification, captioning, or localization. Surprisingly, after scaling our carefully tuned image pretraining recipe and refining with our robust video data engine, we find that contrastive vision-language training alone can produce strong, general embeddings for all of these downstream tasks. There is only one caveat: these embeddings are hidden within the intermediate layers of the network. To draw them out, we introduce two alignment methods, language alignment for multimodal language modeling, and spatial alignment for dense prediction. Together with the core contrastive checkpoint, our PE family of models achieves state-of-the-art performance on a wide variety of tasks, including zero-shot image and video classification and retrieval; document, image, and video Q&A; and spatial tasks such as detection, depth estimation, and tracking. To foster further research, we are releasing our models, code, and a novel dataset of synthetically and human-annotated videos.

cs.GR [Back]

[92] Prototype-Guided Diffusion for Digital Pathology: Achieving Foundation Model Performance with Minimal Clinical Data

Ekaterina Redekop,Mara Pleasure,Vedrana Ivezic,Zichen Wang,Kimberly Flores,Anthony Sisk,William Speier,Corey Arnold

Main category: cs.GR

TL;DR: 研究提出了一种原型引导的扩散模型，用于生成高质量合成病理数据，减少对真实患者样本的依赖，同时保持下游任务性能。

Details

Motivation: 探讨数据集规模与性能之间的相关性，并验证是否可以通过生成合成数据减少对大规模真实数据的需求。 Method: 使用原型引导的扩散模型生成合成病理数据，结合自监督学习，并与真实数据混合训练。 Result: 合成数据训练的特征性能与大规模真实数据相当，且混合数据训练效果更优。 Conclusion: 生成式AI可高效创建病理训练数据，显著减少对临床数据集的依赖。 Abstract: Foundation models in digital pathology use massive datasets to learn useful compact feature representations of complex histology images. However, there is limited transparency into what drives the correlation between dataset size and performance, raising the question of whether simply adding more data to increase performance is always necessary. In this study, we propose a prototype-guided diffusion model to generate high-fidelity synthetic pathology data at scale, enabling large-scale self-supervised learning and reducing reliance on real patient samples while preserving downstream performance. Using guidance from histological prototypes during sampling, our approach ensures biologically and diagnostically meaningful variations in the generated data. We demonstrate that self-supervised features trained on our synthetic dataset achieve competitive performance despite using ~60x-760x less data than models trained on large real-world datasets. Notably, models trained using our synthetic data showed statistically comparable or better performance across multiple evaluation metrics and tasks, even when compared to models trained on orders of magnitude larger datasets. Our hybrid approach, combining synthetic and real data, further enhanced performance, achieving top results in several evaluations. These findings underscore the potential of generative AI to create compelling training data for digital pathology, significantly reducing the reliance on extensive clinical datasets and highlighting the efficiency of our approach.

[93] One Model to Rig Them All: Diverse Skeleton Rigging with UniRig

Jia-Peng Zhang,Cheng-Feng Pu,Meng-Hao Guo,Yan-Pei Cao,Shi-Min Hu

Main category: cs.GR

TL;DR: UniRig是一个基于大型自回归模型和骨点交叉注意力机制的统一框架，用于自动骨骼绑定，显著提升了绑定和运动精度。

Details

Motivation: 随着3D内容创作的快速发展，传统绑定方法难以应对复杂和非标准拓扑结构的需求，因此需要一种更高效、自动化的解决方案。 Method: UniRig采用骨架树标记化方法编码骨架的层次关系，结合大型自回归模型和骨点交叉注意力机制，生成高质量的骨骼和蒙皮权重。 Result: 在Rig-XL数据集上，UniRig在绑定精度和运动精度上分别提升了215%和194%，显著优于现有方法。 Conclusion: UniRig通过自动化绑定流程，显著提升了动画制作的效率，适用于多种复杂模型。 Abstract: The rapid evolution of 3D content creation, encompassing both AI-powered methods and traditional workflows, is driving an unprecedented demand for automated rigging solutions that can keep pace with the increasing complexity and diversity of 3D models. We introduce UniRig, a novel, unified framework for automatic skeletal rigging that leverages the power of large autoregressive models and a bone-point cross-attention mechanism to generate both high-quality skeletons and skinning weights. Unlike previous methods that struggle with complex or non-standard topologies, UniRig accurately predicts topologically valid skeleton structures thanks to a new Skeleton Tree Tokenization method that efficiently encodes hierarchical relationships within the skeleton. To train and evaluate UniRig, we present Rig-XL, a new large-scale dataset of over 14,000 rigged 3D models spanning a wide range of categories. UniRig significantly outperforms state-of-the-art academic and commercial methods, achieving a 215% improvement in rigging accuracy and a 194% improvement in motion accuracy on challenging datasets. Our method works seamlessly across diverse object categories, from detailed anime characters to complex organic and inorganic structures, demonstrating its versatility and robustness. By automating the tedious and time-consuming rigging process, UniRig has the potential to speed up animation pipelines with unprecedented ease and efficiency. Project Page: https://zjp-shadow.github.io/works/UniRig/

[94] UniPhys: Unified Planner and Controller with Diffusion for Flexible Physics-Based Character Control

Yan Wu,Korrawe Karunratanakul,Zhengyi Luo,Siyu Tang

Main category: cs.GR

TL;DR: UniPhys是一个基于扩散的行为克隆框架，将运动规划与控制统一为单一模型，生成自然且物理合理的角色运动。

Details

Motivation: 解决现有方法在长时程控制和多样化引导信号下的运动质量下降及任务特定微调问题。 Method: 采用扩散强迫范式训练，处理噪声运动历史和物理模拟器引入的差异，支持多模态输入（如文本、轨迹、目标）。 Result: 实验表明，UniPhys在运动自然性、泛化性和鲁棒性上优于现有方法。 Conclusion: UniPhys无需任务特定微调即可生成高质量、长时程的运动。 Abstract: Generating natural and physically plausible character motion remains challenging, particularly for long-horizon control with diverse guidance signals. While prior work combines high-level diffusion-based motion planners with low-level physics controllers, these systems suffer from domain gaps that degrade motion quality and require task-specific fine-tuning. To tackle this problem, we introduce UniPhys, a diffusion-based behavior cloning framework that unifies motion planning and control into a single model. UniPhys enables flexible, expressive character motion conditioned on multi-modal inputs such as text, trajectories, and goals. To address accumulated prediction errors over long sequences, UniPhys is trained with the Diffusion Forcing paradigm, learning to denoise noisy motion histories and handle discrepancies introduced by the physics simulator. This design allows UniPhys to robustly generate physically plausible, long-horizon motions. Through guided sampling, UniPhys generalizes to a wide range of control signals, including unseen ones, without requiring task-specific fine-tuning. Experiments show that UniPhys outperforms prior methods in motion naturalness, generalization, and robustness across diverse control tasks.

[95] SOPHY: Generating Simulation-Ready Objects with Physical Materials

Junyi Cao,Evangelos Kalogerakis

Main category: cs.GR

TL;DR: SOPHY是一个生成3D物理感知形状的模型，联合合成形状、纹理和材料属性，适用于动态环境。

Details

Motivation: 现有3D生成模型仅关注静态几何或物理无关动画，SOPHY填补了物理感知合成的空白。 Method: 引入带物理材料属性的数据集和标注流程，训练模型联合建模形状和材料属性。 Result: 实验表明，联合建模提升了生成形状的真实感和保真度，适用于文本驱动生成和单图像重建。 Conclusion: SOPHY为动态环境提供了物理感知的3D对象生成方案，显著提升了生成质量。 Abstract: We present SOPHY, a generative model for 3D physics-aware shape synthesis. Unlike existing 3D generative models that focus solely on static geometry or 4D models that produce physics-agnostic animations, our approach jointly synthesizes shape, texture, and material properties related to physics-grounded dynamics, making the generated objects ready for simulations and interactive, dynamic environments. To train our model, we introduce a dataset of 3D objects annotated with detailed physical material attributes, along with an annotation pipeline for efficient material annotation. Our method enables applications such as text-driven generation of interactive, physics-aware 3D objects and single-image reconstruction of physically plausible shapes. Furthermore, our experiments demonstrate that jointly modeling shape and material properties enhances the realism and fidelity of generated shapes, improving performance on generative geometry evaluation metrics.

[96] StorySets: Ordering Curves and Dimensions for Visualizing Uncertain Sets and Multi-Dimensional Discrete Data

Markus Wallinger,Annika Bonerath,Wouter Meulemans,Martin Nöllenburg,Spehen Kobourov,Alexander Wolff

Main category: cs.GR

TL;DR: 提出一种可视化不确定集合系统的方法，通过垂直符号和x单调曲线表示元素和集合，优化视觉复杂度。

Details

Motivation: 传统集合可视化基于确定性，无法处理不确定性成员关系，需要新方法。 Method: 结合故事线可视化和平行坐标图，元素用垂直符号表示，集合用x单调曲线表示，优化曲线转折和交叉。 Result: 提出新算法优化曲线交叉，实现近优解，便于观察集合包含关系。 Conclusion: 方法灵活，适用于不确定集合和多维离散数据。 Abstract: We propose a method for visualizing uncertain set systems, which differs from previous set visualization approaches that are based on certainty (an element either belongs to a set or not). Our method is inspired by storyline visualizations and parallel coordinate plots: (a) each element is represented by a vertical glyph, subdivided into bins that represent different levels of uncertainty; (b) each set is represented by an x-monotone curve that traverses element glyphs through the bins representing the level of uncertainty of their membership. Our implementation also includes optimizations to reduce visual complexity captured by the number of turns for the set curves and the number of crossings. Although several of the natural underlying optimization problems are NP-hard in theory (e.g., optimal element order, optimal set order), in practice, we can compute near-optimal solutions with respect to curve crossings with the help of a new exact algorithm for optimally ordering set curves within each element's bins. With these optimizations, the proposed method makes it easy to see set containment (the smaller set's curve is strictly below the larger set's curve). A brief design-space exploration using uncertain set-membership data, as well as multi-dimensional discrete data, shows the flexibility of the proposed approach.

[97] ARAP-GS: Drag-driven As-Rigid-As-Possible 3D Gaussian Splatting Editing with Diffusion Prior

Xiao Han,Runze Tian,Yifei Tong,Fenggen Yu,Dingyao Liu,Yan Zhang

Main category: cs.GR

TL;DR: ARAP-GS是一种基于ARAP变形的拖拽驱动3D高斯溅射编辑框架，首次将ARAP变形直接应用于3D高斯，实现了灵活的形状编辑，并通过扩散先验保持视觉质量。

Details

Motivation: 当前拖拽驱动编辑在3D高斯溅射（3DGS）中研究较少，因其在保持形状连贯性和视觉连续性方面存在挑战。 Method: 提出ARAP-GS框架，结合ARAP变形和扩散先验进行迭代优化，支持拖拽驱动的3DGS编辑。 Result: 实验表明ARAP-GS在多样3D场景中优于现有方法，编辑效率高（10-20分钟/场景）。 Conclusion: ARAP-GS为3DGS编辑提供了高效、灵活的解决方案，具有广泛的应用潜力。 Abstract: Drag-driven editing has become popular among designers for its ability to modify complex geometric structures through simple and intuitive manipulation, allowing users to adjust and reshape content with minimal technical skill. This drag operation has been incorporated into numerous methods to facilitate the editing of 2D images and 3D meshes in design. However, few studies have explored drag-driven editing for the widely-used 3D Gaussian Splatting (3DGS) representation, as deforming 3DGS while preserving shape coherence and visual continuity remains challenging. In this paper, we introduce ARAP-GS, a drag-driven 3DGS editing framework based on As-Rigid-As-Possible (ARAP) deformation. Unlike previous 3DGS editing methods, we are the first to apply ARAP deformation directly to 3D Gaussians, enabling flexible, drag-driven geometric transformations. To preserve scene appearance after deformation, we incorporate an advanced diffusion prior for image super-resolution within our iterative optimization process. This approach enhances visual quality while maintaining multi-view consistency in the edited results. Experiments show that ARAP-GS outperforms current methods across diverse 3D scenes, demonstrating its effectiveness and superiority for drag-driven 3DGS editing. Additionally, our method is highly efficient, requiring only 10 to 20 minutes to edit a scene on a single RTX 3090 GPU.

[98] CAGE-GS: High-fidelity Cage Based 3D Gaussian Splatting Deformation

Yifei Tong,Runze Tian,Xiao Han,Dingyao Liu,Fenggen Yu,Yan Zhang

Main category: cs.GR

TL;DR: CAGE-GS是一种基于笼子的3D高斯泼溅变形方法，通过目标形状引导源场景的几何变形，同时保持纹理保真度。

Details

Motivation: 随着3D高斯泼溅（3DGS）作为真实场景的3D表示越来越受欢迎，如何在变形时保留原始3DGS的细节成为研究重点。 Method: 通过从目标形状学习变形笼子，指导源场景的几何变形，并使用基于雅可比矩阵的策略更新高斯协方差参数以保持纹理。 Result: 在公共数据集和新场景上的实验表明，该方法在效率和变形质量上显著优于现有技术。 Conclusion: CAGE-GS是一种灵活且高效的方法，适用于多种目标形状表示，并在变形质量上表现优异。 Abstract: As 3D Gaussian Splatting (3DGS) gains popularity as a 3D representation of real scenes, enabling user-friendly deformation to create novel scenes while preserving fine details from the original 3DGS has attracted significant research attention. We introduce CAGE-GS, a cage-based 3DGS deformation method that seamlessly aligns a source 3DGS scene with a user-defined target shape. Our approach learns a deformation cage from the target, which guides the geometric transformation of the source scene. While the cages effectively control structural alignment, preserving the textural appearance of 3DGS remains challenging due to the complexity of covariance parameters. To address this, we employ a Jacobian matrix-based strategy to update the covariance parameters of each Gaussian, ensuring texture fidelity post-deformation. Our method is highly flexible, accommodating various target shape representations, including texts, images, point clouds, meshes and 3DGS models. Extensive experiments and ablation studies on both public datasets and newly proposed scenes demonstrate that our method significantly outperforms existing techniques in both efficiency and deformation quality.

[99] AAA-Gaussians: Anti-Aliased and Artifact-Free 3D Gaussian Rendering

Michael Steiner,Thomas Köhler,Lukas Radl,Felix Windisch,Dieter Schmalstieg,Markus Steinberger

Main category: cs.GR

TL;DR: 3D高斯泼溅（3DGS）在3D重建中表现优异，但仍存在锯齿、投影伪影和视图不一致等问题。通过引入全3D高斯评估、自适应3D平滑滤波器和稳定的视图空间边界方法，本文解决了这些问题，并提升了渲染效率。

Details

Motivation: 3DGS在处理3D重建时因将泼溅简化为2D实体而引发锯齿、伪影和视图不一致问题，需要更全面的3D评估方法。 Method: 提出自适应3D平滑滤波器以减少锯齿，引入稳定的视图空间边界方法消除伪影，并采用基于屏幕空间平面的3D瓦片剔除技术加速渲染。 Result: 在分布内评估集上达到最优质量，在分布外视图上显著优于其他方法，有效消除了锯齿、失真和伪影。 Conclusion: 通过全3D高斯评估和改进的渲染技术，实现了实时、无伪影的高质量3D重建。 Abstract: Although 3D Gaussian Splatting (3DGS) has revolutionized 3D reconstruction, it still faces challenges such as aliasing, projection artifacts, and view inconsistencies, primarily due to the simplification of treating splats as 2D entities. We argue that incorporating full 3D evaluation of Gaussians throughout the 3DGS pipeline can effectively address these issues while preserving rasterization efficiency. Specifically, we introduce an adaptive 3D smoothing filter to mitigate aliasing and present a stable view-space bounding method that eliminates popping artifacts when Gaussians extend beyond the view frustum. Furthermore, we promote tile-based culling to 3D with screen-space planes, accelerating rendering and reducing sorting costs for hierarchical rasterization. Our method achieves state-of-the-art quality on in-distribution evaluation sets and significantly outperforms other approaches for out-of-distribution views. Our qualitative evaluations further demonstrate the effective removal of aliasing, distortions, and popping artifacts, ensuring real-time, artifact-free rendering.

[100] 3D-PNAS: 3D Industrial Surface Anomaly Synthesis with Perlin Noise

Yifeng Cheng,Juan Du

Main category: cs.GR

TL;DR: 论文提出了一种基于Perlin噪声和表面参数化的3D异常生成方法3D-PNAS，解决了工业异常检测中3D数据生成不足的问题。

Details

Motivation: 工业异常检测中缺乏真实缺陷样本，而3D异常生成技术尚未充分探索，限制了3D数据在质量检测中的应用。 Method: 通过将点云投影到2D平面，从Perlin噪声场采样多尺度噪声值，并沿法线方向扰动点云，生成逼真的3D表面异常。 Result: 实验表明，该方法能生成多样化的缺陷模式，并适应不同物体类型的表面特征。 Conclusion: 3D-PNAS为3D异常生成提供了有效工具，促进了工业质量检测的研究。 Abstract: Large pretrained vision foundation models have shown significant potential in various vision tasks. However, for industrial anomaly detection, the scarcity of real defect samples poses a critical challenge in leveraging these models. While 2D anomaly generation has significantly advanced with established generative models, the adoption of 3D sensors in industrial manufacturing has made leveraging 3D data for surface quality inspection an emerging trend. In contrast to 2D techniques, 3D anomaly generation remains largely unexplored, limiting the potential of 3D data in industrial quality inspection. To address this gap, we propose a novel yet simple 3D anomaly generation method, 3D-PNAS, based on Perlin noise and surface parameterization. Our method generates realistic 3D surface anomalies by projecting the point cloud onto a 2D plane, sampling multi-scale noise values from a Perlin noise field, and perturbing the point cloud along its normal direction. Through comprehensive visualization experiments, we demonstrate how key parameters - including noise scale, perturbation strength, and octaves, provide fine-grained control over the generated anomalies, enabling the creation of diverse defect patterns from pronounced deformations to subtle surface variations. Additionally, our cross-category experiments show that the method produces consistent yet geometrically plausible anomalies across different object types, adapting to their specific surface characteristics. We also provide a comprehensive codebase and visualization toolkit to facilitate future research.

[101] Real-time High-fidelity Gaussian Human Avatars with Position-based Interpolation of Spatially Distributed MLPs

Youyi Zhan,Tianjia Shao,Yin Yang,Kun Zhou

Main category: cs.GR

TL;DR: 提出了一种基于高斯分布的人体化身表示方法，通过空间分布的MLP和偏移基实现高保真姿态相关外观，同时支持实时渲染。

Details

Motivation: 现有方法要么难以捕捉姿态相关细节，要么计算成本高且无法实时渲染。 Method: 使用空间分布的MLP和偏移基表示高斯属性，通过控制点约束高斯分布。 Result: 在保持高保真细节的同时，渲染速度显著快于现有方法。 Conclusion: 该方法在质量和效率上均优于现有技术。 Abstract: Many works have succeeded in reconstructing Gaussian human avatars from multi-view videos. However, they either struggle to capture pose-dependent appearance details with a single MLP, or rely on a computationally intensive neural network to reconstruct high-fidelity appearance but with rendering performance degraded to non-real-time. We propose a novel Gaussian human avatar representation that can reconstruct high-fidelity pose-dependence appearance with details and meanwhile can be rendered in real time. Our Gaussian avatar is empowered by spatially distributed MLPs which are explicitly located on different positions on human body. The parameters stored in each Gaussian are obtained by interpolating from the outputs of its nearby MLPs based on their distances. To avoid undesired smooth Gaussian property changing during interpolation, for each Gaussian we define a set of Gaussian offset basis, and a linear combination of basis represents the Gaussian property offsets relative to the neutral properties. Then we propose to let the MLPs output a set of coefficients corresponding to the basis. In this way, although Gaussian coefficients are derived from interpolation and change smoothly, the Gaussian offset basis is learned freely without constraints. The smoothly varying coefficients combined with freely learned basis can still produce distinctly different Gaussian property offsets, allowing the ability to learn high-frequency spatial signals. We further use control points to constrain the Gaussians distributed on a surface layer rather than allowing them to be irregularly distributed inside the body, to help the human avatar generalize better when animated under novel poses. Compared to the state-of-the-art method, our method achieves better appearance quality with finer details while the rendering speed is significantly faster under novel views and novel poses.

[102] GSAC: Leveraging Gaussian Splatting for Photorealistic Avatar Creation with Unity Integration

Rendong Zhang,Alexandra Watkins,Nilanjan Sarkar

Main category: cs.GR

TL;DR: 提出了一种基于3D高斯散射（3DGS）的端到端虚拟角色创建流程，利用单目视频输入，实现高效、可扩展的逼真虚拟角色生成，并与Unity引擎兼容。

Details

Motivation: 逼真虚拟角色在VR/AR应用中至关重要，但现有技术成本高、耗时长且效率低，无法满足实时应用需求。 Method: 结合单目视频输入和定制预处理，采用3D高斯散射技术，生成可嵌入完整骨骼模型的虚拟角色，并开发了Unity集成的编辑工具。 Result: 实验验证了预处理流程的有效性，展示了高斯虚拟角色在Unity中的多功能性，证明了方法的可扩展性和实用性。 Conclusion: 该方法为VR/AR应用提供了一种高效、可扩展的逼真虚拟角色生成解决方案。 Abstract: Photorealistic avatars have become essential for immersive applications in virtual reality (VR) and augmented reality (AR), enabling lifelike interactions in areas such as training simulations, telemedicine, and virtual collaboration. These avatars bridge the gap between the physical and digital worlds, improving the user experience through realistic human representation. However, existing avatar creation techniques face significant challenges, including high costs, long creation times, and limited utility in virtual applications. Manual methods, such as MetaHuman, require extensive time and expertise, while automatic approaches, such as NeRF-based pipelines often lack efficiency, detailed facial expression fidelity, and are unable to be rendered at a speed sufficent for real-time applications. By involving several cutting-edge modern techniques, we introduce an end-to-end 3D Gaussian Splatting (3DGS) avatar creation pipeline that leverages monocular video input to create a scalable and efficient photorealistic avatar directly compatible with the Unity game engine. Our pipeline incorporates a novel Gaussian splatting technique with customized preprocessing that enables the user of "in the wild" monocular video capture, detailed facial expression reconstruction and embedding within a fully rigged avatar model. Additionally, we present a Unity-integrated Gaussian Splatting Avatar Editor, offering a user-friendly environment for VR/AR application development. Experimental results validate the effectiveness of our preprocessing pipeline in standardizing custom data for 3DGS training and demonstrate the versatility of Gaussian avatars in Unity, highlighting the scalability and practicality of our approach.

[103] CompGS++: Compressed Gaussian Splatting for Static and Dynamic Scene Representation

Xiangrui Liu,Xinju Wu,Shiqi Wang,Zhu Li,Sam Kwong

Main category: cs.GR

TL;DR: CompGS++提出了一种压缩高斯泼溅方法，通过消除原始数据中的冗余，显著减小数据体积，同时保持3D场景建模的准确性。

Details

Motivation: 高斯泼溅在3D场景建模中表现优异，但数据量大且存在冗余，限制了其在沉浸式视觉通信中的应用。因此，需要一种高效的压缩方法以适应现有互联网基础设施。 Method: 设计了空间和时间原始预测模块，分别处理静态和动态场景中的冗余，并通过率约束优化模块进一步消除参数冗余。 Result: 在多个基准数据集上，CompGS++显著优于现有方法，实现了高效压缩和准确建模。 Conclusion: CompGS++为3D场景建模提供了一种高效的压缩解决方案，未来将在GitHub上开源以促进研究。 Abstract: Gaussian splatting demonstrates proficiency for 3D scene modeling but suffers from substantial data volume due to inherent primitive redundancy. To enable future photorealistic 3D immersive visual communication applications, significant compression is essential for transmission over the existing Internet infrastructure. Hence, we propose Compressed Gaussian Splatting (CompGS++), a novel framework that leverages compact Gaussian primitives to achieve accurate 3D modeling with substantial size reduction for both static and dynamic scenes. Our design is based on the principle of eliminating redundancy both between and within primitives. Specifically, we develop a comprehensive prediction paradigm to address inter-primitive redundancy through spatial and temporal primitive prediction modules. The spatial primitive prediction module establishes predictive relationships for scene primitives and enables most primitives to be encoded as compact residuals, substantially reducing the spatial redundancy. We further devise a temporal primitive prediction module to handle dynamic scenes, which exploits primitive correlations across timestamps to effectively reduce temporal redundancy. Moreover, we devise a rate-constrained optimization module that jointly minimizes reconstruction error and rate consumption. This module effectively eliminates parameter redundancy within primitives and enhances the overall compactness of scene representations. Comprehensive evaluations across multiple benchmark datasets demonstrate that CompGS++ significantly outperforms existing methods, achieving superior compression performance while preserving accurate scene modeling. Our implementation will be made publicly available on GitHub to facilitate further research.

[104] HiScene: Creating Hierarchical 3D Scenes with Isometric View Generation

Wenqi Dong,Bangbang Yang,Zesong Yang,Yuan Li,Tao Hu,Hujun Bao,Yuewen Ma,Zhaopeng Cui

Main category: cs.GR

TL;DR: HiScene提出了一种分层框架，将2D图像生成与3D对象生成结合，生成高保真场景。

Details

Motivation: 现有方法在对象类别或编辑灵活性上存在局限，HiScene旨在填补这一空白。 Method: 采用分层视角，将场景视为可分解的复杂对象，结合视频扩散技术和形状先验注入。 Result: 实验表明，HiScene能生成更自然的对象布局和完整实例，适合交互应用。 Conclusion: HiScene在保持物理合理性和用户输入对齐的同时，提升了3D场景生成的灵活性和质量。 Abstract: Scene-level 3D generation represents a critical frontier in multimedia and computer graphics, yet existing approaches either suffer from limited object categories or lack editing flexibility for interactive applications. In this paper, we present HiScene, a novel hierarchical framework that bridges the gap between 2D image generation and 3D object generation and delivers high-fidelity scenes with compositional identities and aesthetic scene content. Our key insight is treating scenes as hierarchical "objects" under isometric views, where a room functions as a complex object that can be further decomposed into manipulatable items. This hierarchical approach enables us to generate 3D content that aligns with 2D representations while maintaining compositional structure. To ensure completeness and spatial alignment of each decomposed instance, we develop a video-diffusion-based amodal completion technique that effectively handles occlusions and shadows between objects, and introduce shape prior injection to ensure spatial coherence within the scene. Experimental results demonstrate that our method produces more natural object arrangements and complete object instances suitable for interactive applications, while maintaining physical plausibility and alignment with user inputs.

cs.CL [Back]

[105] Unmasking the Reality of PII Masking Models: Performance Gaps and the Call for Accountability

Devansh Singh,Sundaraparipurnan Narayanan

Main category: cs.CL

TL;DR: 论文研究了隐私掩码技术及其在NER方法中的局限性，评估了Piiranha和Starpii模型的质量，并提出了性能测量和模型卡片披露的改进需求。

Details

Motivation: 隐私掩码技术对数据隐私至关重要，但现有NER方法在敏感内容、表达多样性和格式变化等方面存在局限，需要评估其实际效果。 Method: 通过构建包含16种PII类型的17K半合成句子数据集，测试模型在5种NER检测维度和对抗环境下的表现。 Result: 研究发现模型使用可能导致隐私暴露，并揭示了性能测量和模型卡片披露的不足。 Conclusion: 需改进模型性能测量方法，并在模型卡片中提供更多上下文信息。 Abstract: Privacy Masking is a critical concept under data privacy involving anonymization and de-anonymization of personally identifiable information (PII). Privacy masking techniques rely on Named Entity Recognition (NER) approaches under NLP support in identifying and classifying named entities in each text. NER approaches, however, have several limitations including (a) content sensitivity including ambiguous, polysemic, context dependent or domain specific content, (b) phrasing variabilities including nicknames and alias, informal expressions, alternative representations, emerging expressions, evolving naming conventions and (c) formats or syntax variations, typos, misspellings. However, there are a couple of PII datasets that have been widely used by researchers and the open-source community to train models on PII detection or masking. These datasets have been used to train models including Piiranha and Starpii, which have been downloaded over 300k and 580k times on HuggingFace. We examine the quality of the PII masking by these models given the limitations of the datasets and of the NER approaches. We curate a dataset of 17K unique, semi-synthetic sentences containing 16 types of PII by compiling information from across multiple jurisdictions including India, U.K and U.S. We generate sentences (using language models) containing these PII at five different NER detection feature dimensions - (1) Basic Entity Recognition, (2) Contextual Entity Disambiguation, (3) NER in Noisy & Real-World Data, (4) Evolving & Novel Entities Detection and (5) Cross-Lingual or multi-lingual NER) and 1 in adversarial context. We present the results and exhibit the privacy exposure caused by such model use (considering the extent of lifetime downloads of these models). We conclude by highlighting the gaps in measuring performance of the models and the need for contextual disclosure in model cards for such models.

[106] Learning Optimal Prompt Ensemble for Multi-source Visual Prompt Transfer

Enming Zhang,Liwen Cao,Yanru Wu,Zijie Zhao,Guan Wang,Yang Li

Main category: cs.CL

TL;DR: HGPrompt是一个自适应多源提示迁移框架，通过优化迁移性和稳定性双重目标，解决了多源提示聚合中的表示崩溃问题。

Details

Motivation: 预训练提示作为知识资产，多源提示组合可以增强新任务的泛化能力，但简单聚合会导致表示崩溃。 Method: 提出HGPrompt框架，结合信息论指标评估迁移性，并引入梯度对齐正则化以稳定知识迁移。 Result: 在大规模VTAB基准测试中，HGPrompt实现了最先进的性能。 Conclusion: HGPrompt有效解决了多源提示迁移中的干扰问题，提升了性能。 Abstract: Prompt tuning has emerged as a lightweight adaptation strategy for adapting foundation models to downstream tasks, particularly in resource-constrained systems. As pre-trained prompts have become valuable intellectual assets, combining multiple source prompts offers a promising approach to enhance generalization to new tasks by leveraging complementary knowledge from diverse sources. However, naive aggregation of these prompts often leads to representation collapse due to mutual interference, undermining their collective potential. To address these challenges, we propose HGPrompt, an adaptive framework for multi-source prompt transfer that learns optimal ensemble weights by jointly optimizing dual objectives: transferability and stability. Specifically, we first introduce an information-theoretic metric to evaluate the transferability of prompt-induced features on the target task, capturing the intrinsic alignment between the feature representations. Additionally, we propose a novel Gradient Alignment Regularization to mitigate gradient conflicts among prompts, enabling stable and coherent knowledge transfer from multiple sources while suppressing interference. Extensive experiments on the large-scale VTAB benchmark demonstrate that HGPrompt achieves state-of-the-art performance, validating its effectiveness in multi-source prompt transfer.

[107] Socrates or Smartypants: Testing Logic Reasoning Capabilities of Large Language Models with Logic Programming-based Test Oracles

Zihao Xu,Junchen Ding,Yiling Lou,Kun Zhang,Dong Gong,Yuekang Li

Main category: cs.CL

TL;DR: 论文介绍了SmartyPat-Bench和SmartyPat，一个用于评估LLMs逻辑推理能力的自然表达数据集和自动化框架。

Details

Motivation: 现有数据集和基准测试在逻辑推理评估中过于简单或不自然，无法满足需求。 Method: 提出SmartyPat-Bench（基于高质量Reddit帖子）和SmartyPat（自动化框架，结合Prolog规则和LLMs生成逻辑谬误）。 Result: SmartyPat生成的谬误与人工生成内容相当，且显著优于基线方法。实验还揭示了LLMs在谬误检测和分类中的表现。 Conclusion: SmartyPat-Bench和SmartyPat为LLMs逻辑推理能力评估提供了更自然和多样化的工具，揭示了LLMs在复杂推理中的局限性。 Abstract: Large Language Models (LLMs) have achieved significant progress in language understanding and reasoning. Evaluating and analyzing their logical reasoning abilities has therefore become essential. However, existing datasets and benchmarks are often limited to overly simplistic, unnatural, or contextually constrained examples. In response to the growing demand, we introduce SmartyPat-Bench, a challenging, naturally expressed, and systematically labeled benchmark derived from real-world high-quality Reddit posts containing subtle logical fallacies. Unlike existing datasets and benchmarks, it provides more detailed annotations of logical fallacies and features more diverse data. To further scale up the study and address the limitations of manual data collection and labeling - such as fallacy-type imbalance and labor-intensive annotation - we introduce SmartyPat, an automated framework powered by logic programming-based oracles. SmartyPat utilizes Prolog rules to systematically generate logically fallacious statements, which are then refined into fluent natural-language sentences by LLMs, ensuring precise fallacy representation. Extensive evaluation demonstrates that SmartyPat produces fallacies comparable in subtlety and quality to human-generated content and significantly outperforms baseline methods. Finally, experiments reveal nuanced insights into LLM capabilities, highlighting that while excessive reasoning steps hinder fallacy detection accuracy, structured reasoning enhances fallacy categorization performance.

[108] Exploring the Impact of Personality Traits on Conversational Recommender Systems: A Simulation with Large Language Models

Xiaoyan Zhao,Yang Deng,Wenjie Wang,Hongzhan lin,Hong Cheng,Rui Zhang,See-Kiong Ng,Tat-Seng Chua

Main category: cs.CL

TL;DR: 论文提出了一种基于大语言模型（LLM）的个性化感知用户模拟系统（PerCRS），用于研究人格特质如何影响对话推荐系统的结果。

Details

Motivation: 人格特质对用户交互行为有显著影响，但现有对话推荐系统（CRSs）缺乏对其的系统性研究。 Method: 通过PerCRS模拟具有可定制人格特质和偏好的用户代理，结合系统代理的劝说能力，进行多角度评估。 Result: 实验表明，LLM能生成符合指定人格特质的多样化用户响应，促使CRSs动态调整推荐策略。 Conclusion: 研究为理解人格特质对对话推荐系统结果的影响提供了实证依据。 Abstract: Conversational Recommender Systems (CRSs) engage users in multi-turn interactions to deliver personalized recommendations. The emergence of large language models (LLMs) further enhances these systems by enabling more natural and dynamic user interactions. However, a key challenge remains in understanding how personality traits shape conversational recommendation outcomes. Psychological evidence highlights the influence of personality traits on user interaction behaviors. To address this, we introduce an LLM-based personality-aware user simulation for CRSs (PerCRS). The user agent induces customizable personality traits and preferences, while the system agent possesses the persuasion capability to simulate realistic interaction in CRSs. We incorporate multi-aspect evaluation to ensure robustness and conduct extensive analysis from both user and system perspectives. Experimental results demonstrate that state-of-the-art LLMs can effectively generate diverse user responses aligned with specified personality traits, thereby prompting CRSs to dynamically adjust their recommendation strategies. Our experimental analysis offers empirical insights into the impact of personality traits on the outcomes of conversational recommender systems.

[109] How to Detect and Defeat Molecular Mirage: A Metric-Driven Benchmark for Hallucination in LLM-based Molecular Comprehension

Hao Li,Liuzhenghao Lv,He Cao,Zijing Liu,Zhiyuan Yan,Yu Wang,Yonghong Tian,Yu Li,Li Yuan

Main category: cs.CL

TL;DR: 论文分析了LLMs在分子理解任务中的幻觉问题，提出了Mol-Hallu评估指标和HRPP后处理方法，有效减少了幻觉现象。

Details

Motivation: 现有LLMs在分子理解任务中存在幻觉问题，影响药物设计和使用的准确性。 Method: 提出Mol-Hallu评估指标量化幻觉程度，并设计HRPP后处理阶段减少幻觉。 Result: 实验证明HRPP在多种LLMs中有效减少分子幻觉。 Conclusion: 研究为减少LLMs在科学应用中的幻觉提供了关键见解。 Abstract: Large language models are increasingly used in scientific domains, especially for molecular understanding and analysis. However, existing models are affected by hallucination issues, resulting in errors in drug design and utilization. In this paper, we first analyze the sources of hallucination in LLMs for molecular comprehension tasks, specifically the knowledge shortcut phenomenon observed in the PubChem dataset. To evaluate hallucination in molecular comprehension tasks with computational efficiency, we introduce \textbf{Mol-Hallu}, a novel free-form evaluation metric that quantifies the degree of hallucination based on the scientific entailment relationship between generated text and actual molecular properties. Utilizing the Mol-Hallu metric, we reassess and analyze the extent of hallucination in various LLMs performing molecular comprehension tasks. Furthermore, the Hallucination Reduction Post-processing stage~(HRPP) is proposed to alleviate molecular hallucinations, Experiments show the effectiveness of HRPP on decoder-only and encoder-decoder molecular LLMs. Our findings provide critical insights into mitigating hallucination and improving the reliability of LLMs in scientific applications.

Xingguang Ji,Jiakang Wang,Hongzhi Zhang,Jingyuan Zhang,Haonan Zhou,Chenxi Sun,Yahui Liu,Qi Wang,Fuzheng Zhang

Main category: cs.CL

TL;DR: Capybara-OMNI是一个轻量高效的多模态大语言模型（MLLM），支持文本、图像、视频和音频的理解，并通过详细的框架设计、数据构建和训练方法实现了竞争性性能。

Details

Motivation: 由于构建和训练多模态数据对的复杂性，开发强大的MLLM仍然是一个计算和时间密集型过程。本文旨在提出一种轻量高效的训练方法。 Method: 详细介绍了Capybara-OMNI的框架设计、数据构建和训练方法，并提供了专用基准测试以验证多模态理解能力。 Result: 实验结果表明，遵循该方法可以高效构建出在同类规模模型中具有竞争性能的MLLM。 Conclusion: 公开了Capybara-OMNI模型及其聊天版本，包括模型权重、部分训练数据和推理代码，以促进社区发展。 Abstract: With the development of Multimodal Large Language Models (MLLMs), numerous outstanding accomplishments have emerged within the open-source community. Due to the complexity of creating and training multimodal data pairs, it is still a computational and time-consuming process to build powerful MLLMs. In this work, we introduce Capybara-OMNI, an MLLM that trains in a lightweight and efficient manner and supports understanding text, image, video, and audio modalities. We present in detail the framework design, the data construction, and the training recipe, to develop an MLLM step-by-step to obtain competitive performance. We also provide exclusive benchmarks utilized in our experiments to show how to properly verify understanding capabilities across different modalities. Results show that by following our guidance, we can efficiently build an MLLM that achieves competitive performance among models of the same scale on various multimodal benchmarks. Additionally, to enhance the multimodal instruction following and conversational capabilities of the model, we further discuss how to train the chat version upon an MLLM understanding model, which is more in line with user habits for tasks like real-time interaction with humans. We publicly disclose the Capybara-OMNI model, along with its chat-based version. The disclosure includes both the model weights, a portion of the training data, and the inference codes, which are made available on GitHub.

[111] Data Metabolism: An Efficient Data Design Schema For Vision Language Model

Jingyuan Zhang,Hongzhi Zhang,Zhou Haonan,Chenxi Sun,Xingguang ji,Jiakang Wang,Fanheng Kong,Yahui Liu,Qi Wang,Fuzheng Zhang

Main category: cs.CL

TL;DR: 论文提出了一种名为'数据代谢'的数据中心框架，用于构建视觉语言模型（VLM），并通过闭环系统持续优化模型性能。发布的Capybara-VL模型在多项任务中表现优异，甚至超越了一些更大的开源模型。

Details

Motivation: 数据管理对训练强大的视觉语言模型至关重要，但现有方法缺乏系统性。本文旨在通过数据代谢框架解决这一问题。 Method: 采用数据中心框架，包括数据管理和迭代两个关键步骤，形成闭环系统。详细介绍了如何处理大规模数据集并构建用户特定的数据飞轮。 Result: Capybara-VL模型在多项任务中表现优异，超越了一些更大的开源模型，并与部分领先的专有模型性能相当。 Conclusion: 数据中心框架展示了训练更小、更高效的视觉语言模型的潜力，为未来研究提供了新方向。 Abstract: Data curation plays a crucial role in training powerful Visual Language Models (VLMs). In this work, we introduce the concept of Data Metabolism and present our data-centric framework to build VLMs throughout the development lifecycle. Starting from a standard model architecture, we discuss and provide insights into two crucial development steps: data curation and iteration, forming a closed-loop system that continuously improves model performance. We show a detailed codebook on how to process existing massive datasets and build user-specific data flywheel. As a demonstration, we release a VLM, named Capybara-VL, which excels in typical multimodal tasks (e.g. , visual question answering, scientific reasoning, and text-rich tasks). Despite its relatively compact size, Capybara-VL surpasses several open-source models that are up to 10 times larger in size. Moreover, it achieves results that are on par with those of several leading proprietary models, demonstrating its remarkable competitiveness. These results highlight the power of our data-centric framework and the potential of training smaller and more efficient VLMs.

[112] ChatGPT as Linguistic Equalizer? Quantifying LLM-Driven Lexical Shifts in Academic Writing

Dingkang Lin,Naixuan Zhao,Dan Tian,Jiang Li

Main category: cs.CL

TL;DR: ChatGPT显著提升了非英语母语学者（NNES）在学术写作中的词汇复杂度，减少了语言障碍，促进了学术公平。

Details

Motivation: 研究ChatGPT是否能帮助非英语母语学者克服学术写作中的语言障碍，提升全球学术公平性。 Method: 使用MTLD量化词汇复杂度，采用DID设计分析2.8百万篇OpenAlex文章（2020-2024），控制文章、作者和期刊因素。 Result: ChatGPT显著提高了NNES作者摘要的词汇复杂度，尤其在预印本、技术和生物领域及低级别期刊中效果明显。 Conclusion: ChatGPT有效减少语言差异，促进全球学术公平。 Abstract: The advent of ChatGPT has profoundly reshaped scientific research practices, particularly in academic writing, where non-native English-speakers (NNES) historically face linguistic barriers. This study investigates whether ChatGPT mitigates these barriers and fosters equity by analyzing lexical complexity shifts across 2.8 million articles from OpenAlex (2020-2024). Using the Measure of Textual Lexical Diversity (MTLD) to quantify vocabulary sophistication and a difference-in-differences (DID) design to identify causal effects, we demonstrate that ChatGPT significantly enhances lexical complexity in NNES-authored abstracts, even after controlling for article-level controls, authorship patterns, and venue norms. Notably, the impact is most pronounced in preprint papers, technology- and biology-related fields and lower-tier journals. These findings provide causal evidence that ChatGPT reduces linguistic disparities and promotes equity in global academia.

[113] Has the Creativity of Large-Language Models peaked? An analysis of inter- and intra-LLM variability

Jennifer Haase,Paul H. P. Hanel,Sebastian Pokutta

Main category: cs.CL

TL;DR: 研究评估了14种大型语言模型（LLMs）在创造力任务中的表现，发现其创造力并未随时间提升，且存在显著的模型间和模型内变异性。

Details

Motivation: 探讨LLMs在创造力任务中的表现是否随时间提升，以及其创造力的稳定性和一致性。 Method: 使用Divergent Association Task (DAT)和Alternative Uses Task (AUT)两种创造力评估工具对14种LLMs进行测试。 Result: LLMs的创造力未显著提升，GPT-4表现不如以往；部分模型在AUT中优于人类平均水平，但极少达到人类顶尖水平。模型内变异性显著。 Conclusion: 需更细致的评估框架，重视模型选择、提示设计和重复评估，以准确衡量LLMs的创造力潜力。 Abstract: Following the widespread adoption of ChatGPT in early 2023, numerous studies reported that large language models (LLMs) can match or even surpass human performance in creative tasks. However, it remains unclear whether LLMs have become more creative over time, and how consistent their creative output is. In this study, we evaluated 14 widely used LLMs -- including GPT-4, Claude, Llama, Grok, Mistral, and DeepSeek -- across two validated creativity assessments: the Divergent Association Task (DAT) and the Alternative Uses Task (AUT). Contrary to expectations, we found no evidence of increased creative performance over the past 18-24 months, with GPT-4 performing worse than in previous studies. For the more widely used AUT, all models performed on average better than the average human, with GPT-4o and o3-mini performing best. However, only 0.28% of LLM-generated responses reached the top 10% of human creativity benchmarks. Beyond inter-model differences, we document substantial intra-model variability: the same LLM, given the same prompt, can produce outputs ranging from below-average to original. This variability has important implications for both creativity research and practical applications. Ignoring such variability risks misjudging the creative potential of LLMs, either inflating or underestimating their capabilities. The choice of prompts affected LLMs differently. Our findings underscore the need for more nuanced evaluation frameworks and highlight the importance of model selection, prompt design, and repeated assessment when using Generative AI (GenAI) tools in creative contexts.

[114] AttentionDefense: Leveraging System Prompt Attention for Explainable Defense Against Novel Jailbreaks

Charlotte Siska,Anush Sankaran

Main category: cs.CL

TL;DR: 论文提出了一种名为AttentionDefense的新防御方法，利用小型语言模型（SLM）的系统提示注意力来识别对抗性提示，提供了一种可解释且成本较低的防御方案。

Details

Motivation: 当前防御策略难以解释对抗性提示的恶意本质，导致多种封闭式方法。研究旨在通过注意力机制理解语言模型对恶意输入的响应。 Method: 提出AttentionDefense方法，利用SLM的系统提示注意力特征化对抗性提示，并通过实验验证其有效性。 Result: AttentionDefense在现有对抗性数据集上表现优于基于文本嵌入的分类器和GPT-4零样本检测器，并在新型对抗性数据集上保持稳健性能。 Conclusion: AttentionDefense是一种高效且可解释的防御方法，计算需求低但性能接近大型语言模型检测器。 Abstract: In the past few years, Language Models (LMs) have shown par-human capabilities in several domains. Despite their practical applications and exceeding user consumption, they are susceptible to jailbreaks when malicious input exploits the LM's weaknesses, causing it to deviate from its intended behavior. Current defensive strategies either classify the input prompt as adversarial or prevent LMs from generating harmful outputs. However, it is challenging to explain the reason behind the malicious nature of the jailbreak, which results in a wide variety of closed-box approaches. In this research, we propose and demonstrate that system-prompt attention from Small Language Models (SLMs) can be used to characterize adversarial prompts, providing a novel, explainable, and cheaper defense approach called AttentionDefense. Our research suggests that the attention mechanism is an integral component in understanding and explaining how LMs respond to malicious input that is not captured in the semantic meaning of text embeddings. The proposed AttentionDefense is evaluated against existing jailbreak benchmark datasets. Ablation studies show that SLM-based AttentionDefense has equivalent or better jailbreak detection performance compared to text embedding-based classifiers and GPT-4 zero-shot detectors.To further validate the efficacy of the proposed approach, we generate a dataset of novel jailbreak variants of the existing benchmark dataset using a closed-loop LLM-based multi-agent system. We demonstrate that the proposed AttentionDefense approach performs robustly on this novel jailbreak dataset while existing approaches suffer in performance. Additionally, for practical purposes AttentionDefense is an ideal solution as it has the computation requirements of a small LM but the performance of a LLM detector.

[115] A Strategic Coordination Framework of Small LLMs Matches Large LLMs in Data Synthesis

Xin Gao,Qizhi Pei,Zinan Tang,Yu Li,Honglin Lin,Jiang Wu,Conghui He,Lijun Wu

Main category: cs.CL

TL;DR: 论文提出了一种名为GRA的框架，通过多个小型LLM协作模拟同行评审过程，实现高质量数据合成，挑战了大型LLM的必要性。

Details

Motivation: 当前数据合成方法依赖大型LLM，存在高成本、低效率和潜在偏见问题。小型LLM虽可持续，但单独能力不足。 Method: GRA框架将小型LLM分为Generator、Reviewer和Adjudicator三个角色，通过迭代优化生成高质量数据。 Result: 实验表明，GRA生成的数据质量与大型LLM（如Qwen-2.5-72B-Instruct）相当或更优。 Conclusion: 研究证明小型LLM的战略协作可替代大型LLM，为高效、可持续的数据合成提供了新思路。 Abstract: While data synthesis and distillation are promising strategies to enhance small language models, current approaches heavily rely on Large Language Models (LLMs), which suffer from high computational costs, environmental inefficiency, and potential biases inherited from monolithic architectures. In contrast, smaller LLMs are more accessible and sustainable, but their individual capabilities often fall short in generating high-quality, diverse, and reliable data. Inspired by collaborative human processes (e.g., peer review), we propose a multiple small LLMs involved framework, GRA, that aggregates specialized roles across small LLMs to iterative refinement and quality control typically achieved by a single large LLM. In this collaborative framework, multiple small LLMs assume distinct roles-Generator, Reviewer, and Adjudicator-to simulate a peer-review-inspired data synthesis pipeline. The Generator proposes initial data samples, the Reviewer critiques their quality and diversity, and the Adjudicator resolves conflicts to finalize the output. By decomposing the synthesis process into specialized sub-tasks, collaborative small LLMs can achieve data-level parity with large LLM-based distillation. Through experiments across multiple benchmarks, we demonstrate that GRA-produced data matches or exceeds the quality of single large LLM outputs, e.g., Qwen-2.5-72B-Instruct. Our results challenge the necessity of monolithic large models for high-quality data synthesis, advocating instead for strategic coordination of smaller agents. Our datasets, models, and code are publicly available at https://github.com/GX-XinGao/GRA.

[116] The Other Side of the Coin: Exploring Fairness in Retrieval-Augmented Generation

Zheng Zhang,Ning Li,Qi Liu,Rui Li,Weibo Gao,Qingyang Mao,Zhenya Huang,Baosheng Yu,Dacheng Tao

Main category: cs.CL

TL;DR: RAG通过引入外部知识减少LLMs的幻觉问题，但可能加剧小规模LLMs的不公平性。研究提出FairFT和FairFilter方法以改善公平性。

Details

Motivation: 探讨RAG框架对LLMs公平性的影响，尤其是在小规模模型中的应用问题。 Method: 通过实验分析不同规模LLMs、检索器和检索源的影响，并提出FairFT（公平对齐检索器）和FairFilter（公平过滤机制）两种方法。 Result: 小规模LLMs（<8B）在RAG中更容易出现不公平性，FairFT和FairFilter能有效改善公平性且不影响性能。 Conclusion: RAG对小规模LLMs的公平性有负面影响，但提出的方法能有效缓解这一问题。 Abstract: Retrieval-Augmented Generation (RAG) enhances Large Language Models (LLMs) by retrieving relevant document from external knowledge sources. By referencing this external knowledge, RAG effectively reduces the generation of factually incorrect content and addresses hallucination issues within LLMs. Recently, there has been growing attention to improving the performance and efficiency of RAG systems from various perspectives. While these advancements have yielded significant results, the application of RAG in domains with considerable societal implications raises a critical question about fairness: What impact does the introduction of the RAG paradigm have on the fairness of LLMs? To address this question, we conduct extensive experiments by varying the LLMs, retrievers, and retrieval sources. Our experimental analysis reveals that the scale of the LLMs plays a significant role in influencing fairness outcomes within the RAG framework. When the model scale is smaller than 8B, the integration of retrieval mechanisms often exacerbates unfairness in small-scale LLMs (e.g., LLaMA3.2-1B, Mistral-7B, and LLaMA3-8B). To mitigate the fairness issues introduced by RAG for small-scale LLMs, we propose two approaches, FairFT and FairFilter. Specifically, in FairFT, we align the retriever with the LLM in terms of fairness, enabling it to retrieve documents that facilitate fairer model outputs. In FairFilter, we propose a fairness filtering mechanism to filter out biased content after retrieval. Finally, we validate our proposed approaches on real-world datasets, demonstrating their effectiveness in improving fairness while maintaining performance.

[117] Cross-Document Cross-Lingual Natural Language Inference via RST-enhanced Graph Fusion and Interpretability Prediction

Mengying Yuan,Wangzi Xuan,Fei Li

Main category: cs.CL

TL;DR: 本文提出了一种新的跨文档跨语言自然语言推理（CDCL-NLI）范式，构建了一个高质量的数据集，并提出了一种基于RST增强的图融合和可解释性预测的创新方法。

Details

Motivation: 跨文档跨语言自然语言推理（CDCL-NLI）是一个尚未充分探索的领域，本文旨在填补这一空白。 Method: 采用RST增强的RGAT进行跨文档上下文建模，结合基于词汇链的结构感知语义对齐机制，并开发了EDU级归因框架以生成可解释性推理。 Result: 实验表明，该方法显著优于传统NLI模型和大型语言模型（如Llama3和GPT-4o）。 Conclusion: 本文为NLI研究提供了新方向，并推动了跨文档跨语言上下文理解、语义检索和可解释性推理的研究。 Abstract: Natural Language Inference (NLI) is a fundamental task in both natural language processing and information retrieval. While NLI has developed many sub-directions such as sentence-level NLI, document-level NLI and cross-lingual NLI, Cross-Document Cross-Lingual NLI (CDCL-NLI) remains largely unexplored. In this paper, we propose a novel paradigm for CDCL-NLI that extends traditional NLI capabilities to multi-document, multilingual scenarios. To support this task, we construct a high-quality CDCL-NLI dataset including 1,110 instances and spanning 26 languages. To build a baseline for this task, we also propose an innovative method that integrates RST-enhanced graph fusion and interpretability prediction. Our method employs RST (Rhetorical Structure Theory) on RGAT (Relation-aware Graph Attention Network) for cross-document context modeling, coupled with a structure-aware semantic alignment mechanism based on lexical chains for cross-lingual understanding. For NLI interpretability, we develop an EDU-level attribution framework that generates extractive explanations. Extensive experiments demonstrate our approach's superior performance, achieving significant improvements over both traditional NLI models such as DocNLI and R2F, as well as LLMs like Llama3 and GPT-4o. Our work sheds light on the study of NLI and will bring research interest on cross-document cross-lingual context understanding, semantic retrieval and interpretability inference. Our dataset and code are available at \href{https://anonymous.4open.science/r/CDCL-NLI-637E/}{CDCL-NLI-Link for peer review}.

Haiqi Zhang,Zhengyuan Zhu,Zeyu Zhang,Chengkai Li

Main category: cs.CL

TL;DR: LLMTaxo利用大语言模型自动构建社交媒体事实声明的多粒度分类框架，并通过实验验证其有效性。

Details

Motivation: 社交媒体内容爆炸式增长，分析在线讨论变得复杂，需要更有效的分类方法。 Method: 提出LLMTaxo框架，利用大语言模型生成多粒度主题，并在三个数据集上测试不同模型。 Result: 实验表明LLMTaxo能有效分类事实声明，且某些模型在特定数据集上表现更优。 Conclusion: LLMTaxo为社交媒体事实声明分类提供了有效工具，未来可优化模型选择。 Abstract: With the vast expansion of content on social media platforms, analyzing and comprehending online discourse has become increasingly complex. This paper introduces LLMTaxo, a novel framework leveraging large language models for the automated construction of taxonomy of factual claims from social media by generating topics from multi-level granularities. This approach aids stakeholders in more effectively navigating the social media landscapes. We implement this framework with different models across three distinct datasets and introduce specially designed taxonomy evaluation metrics for a comprehensive assessment. With the evaluations from both human evaluators and GPT-4, the results indicate that LLMTaxo effectively categorizes factual claims from social media, and reveals that certain models perform better on specific datasets.

[119] Reconstructing Sepsis Trajectories from Clinical Case Reports using LLMs: the Textual Time Series Corpus for Sepsis

Shahriar Noroozizadeh,Jeremy C. Weiss

Main category: cs.CL

TL;DR: 该论文提出了一种利用大语言模型（LLMs）从临床病例报告和出院摘要中提取时间局部化发现的管道，并生成了一个关于Sepsis-3的开放访问文本时间序列语料库。验证结果显示LLMs在时间定位临床发现方面表现良好，但也揭示了其局限性。

Details

Motivation: 临床病例报告和出院摘要虽然完整准确，但通常在患者就诊后才完成，而结构化数据流虽然更早可用，但不完整。为了利用更完整且时间粒度更细的数据训练模型，需要一种方法从文本中提取时间局部化的临床发现。 Method: 构建了一个管道，利用大语言模型对病例报告中的时间局部化发现进行表型提取和注释，并生成了一个包含2,139份病例报告的Sepsis-3语料库。通过比较PMOA和I2B2/MIMIC-IV的时间线注释与医生专家注释来验证系统。 Result: 实验结果显示，LLMs在临床发现的恢复率（事件匹配率：O1-preview--0.755，Llama 3.3 70B Instruct--0.753）和时间顺序（一致性：O1-preview--0.932，Llama 3.3 70B Instruct--0.932）方面表现优异。 Conclusion: LLMs在时间定位临床发现方面具有潜力，但也存在局限性，未来可通过多模态集成进一步改进。 Abstract: Clinical case reports and discharge summaries may be the most complete and accurate summarization of patient encounters, yet they are finalized, i.e., timestamped after the encounter. Complementary data structured streams become available sooner but suffer from incompleteness. To train models and algorithms on more complete and temporally fine-grained data, we construct a pipeline to phenotype, extract, and annotate time-localized findings within case reports using large language models. We apply our pipeline to generate an open-access textual time series corpus for Sepsis-3 comprising 2,139 case reports from the Pubmed-Open Access (PMOA) Subset. To validate our system, we apply it on PMOA and timeline annotations from I2B2/MIMIC-IV and compare the results to physician-expert annotations. We show high recovery rates of clinical findings (event match rates: O1-preview--0.755, Llama 3.3 70B Instruct--0.753) and strong temporal ordering (concordance: O1-preview--0.932, Llama 3.3 70B Instruct--0.932). Our work characterizes the ability of LLMs to time-localize clinical findings in text, illustrating the limitations of LLM use for temporal reconstruction and providing several potential avenues of improvement via multimodal integration.

Yuxi Ma,Yongqian Peng,Yixin Zhu

Main category: cs.CL

TL;DR: 该论文通过计算分析中国官方媒体（1950-2019）的语言模式，探讨社会变革如何反映在官方对群体的描述中，发现与西方模式显著不同。

Details

Motivation: 研究旨在揭示语言如何编码社会结构，尤其是非西方背景下官方话语对社会群体的描述。 Method: 使用历时词嵌入方法，分析中国官方媒体的语言模式。 Result: 发现中国对群体的描述与西方显著不同，某些群体（如性别和经济阶层）的描述随历史变革剧烈变化，而其他（如种族、年龄）则稳定。 Conclusion: 研究强调了非西方视角在计算社会科学中的重要性，并展示了官方话语如何通过语言编码社会结构。 Abstract: Language encodes societal beliefs about social groups through word patterns. While computational methods like word embeddings enable quantitative analysis of these patterns, studies have primarily examined gradual shifts in Western contexts. We present the first large-scale computational analysis of Chinese state-controlled media (1950-2019) to examine how revolutionary social transformations are reflected in official linguistic representations of social groups. Using diachronic word embeddings at multiple temporal resolutions, we find that Chinese representations differ significantly from Western counterparts, particularly regarding economic status, ethnicity, and gender. These representations show distinct evolutionary dynamics: while stereotypes of ethnicity, age, and body type remain remarkably stable across political upheavals, representations of gender and economic classes undergo dramatic shifts tracking historical transformations. This work advances our understanding of how officially sanctioned discourse encodes social structure through language while highlighting the importance of non-Western perspectives in computational social science.

[121] A Comprehensive Survey of Reward Models: Taxonomy, Applications, Challenges, and Future

Jialun Zhong,Wei Shen,Yanzeng Li,Songyang Gao,Hua Lu,Yicheng Chen,Yang Zhang,Wei Zhou,Jinjie Gu,Lei Zou

Main category: cs.CL

TL;DR: 本文综述了奖励模型（RM）在增强大型语言模型（LLM）中的应用，涵盖偏好收集、奖励建模及使用方法，并探讨了其应用、评估基准、挑战及未来研究方向。

Details

Motivation: 奖励模型（RM）作为人类偏好的代理，能够指导LLM的行为，但其研究尚不全面。本文旨在为初学者提供全面的RM介绍，并推动未来研究。 Method: 通过综述现有研究，从偏好收集、奖励建模和使用三个角度分析RM，并探讨其应用、评估基准及挑战。 Result: 总结了RM的研究现状，提出了未来研究方向，并公开了相关资源。 Conclusion: RM在LLM中具有重要潜力，本文为其研究提供了全面指南，并鼓励进一步探索。 Abstract: Reward Model (RM) has demonstrated impressive potential for enhancing Large Language Models (LLM), as RM can serve as a proxy for human preferences, providing signals to guide LLMs' behavior in various tasks. In this paper, we provide a comprehensive overview of relevant research, exploring RMs from the perspectives of preference collection, reward modeling, and usage. Next, we introduce the applications of RMs and discuss the benchmarks for evaluation. Furthermore, we conduct an in-depth analysis of the challenges existing in the field and dive into the potential research directions. This paper is dedicated to providing beginners with a comprehensive introduction to RMs and facilitating future studies. The resources are publicly available at github\footnote{https://github.com/JLZhong23/awesome-reward-models}.

[122] Speculative Thinking: Enhancing Small-Model Reasoning with Large Model Guidance at Inference Time

Wang Yang,Xiang Yue,Vipin Chaudhary,Xiaotian Han

Main category: cs.CL

TL;DR: 提出了一种名为“推测性思考”的训练免费框架，通过大模型指导小模型提升推理性能，显著提高准确率并缩短输出长度。

Details

Motivation: 现有后训练方法成本高且输出冗长，需要更高效的解决方案。 Method: 利用大模型控制小模型的反思行为，减少不必要回溯，提升推理质量。 Result: 1.5B模型在MATH500上准确率提升6.2%，输出长度减少15.7%；非推理模型准确率提升7.8%。 Conclusion: 推测性思考框架有效提升推理性能，适用于多种模型。 Abstract: Recent advances leverage post-training to enhance model reasoning performance, which typically requires costly training pipelines and still suffers from inefficient, overly lengthy outputs. We introduce Speculative Thinking, a training-free framework that enables large reasoning models to guide smaller ones during inference at the reasoning level, distinct from speculative decoding, which operates at the token level. Our approach is based on two observations: (1) reasoning-supportive tokens such as "wait" frequently appear after structural delimiters like "\n\n", serving as signals for reflection or continuation; and (2) larger models exhibit stronger control over reflective behavior, reducing unnecessary backtracking while improving reasoning quality. By strategically delegating reflective steps to a more capable model, our method significantly boosts the reasoning accuracy of reasoning models while shortening their output. With the assistance of the 32B reasoning model, the 1.5B model's accuracy on MATH500 increases from 83.2% to 89.4%, marking a substantial improvement of 6.2%. Simultaneously, the average output length is reduced from 5439 tokens to 4583 tokens, representing a 15.7% decrease. Moreover, when applied to a non-reasoning model (Qwen-2.5-7B-Instruct), our framework boosts its accuracy from 74.0% to 81.8% on the same benchmark, achieving a relative improvement of 7.8%.

[123] HM-RAG: Hierarchical Multi-Agent Multimodal Retrieval Augmented Generation

Pei Liu,Xin Liu,Ruoyu Yao,Junming Liu,Siyuan Meng,Ding Wang,Jun Ma

Main category: cs.CL

TL;DR: HM-RAG是一种新型分层多代理多模态RAG框架，通过协作智能解决复杂查询，显著提升了答案准确性和分类性能。

Details

Motivation: 传统单代理RAG在处理跨异构数据生态系统的复杂查询时存在局限性，需要更高效的协作推理方法。 Method: 采用三层架构：分解代理（语义查询重写）、多源检索代理（并行多模态检索）和决策代理（一致性投票与专家模型优化）。 Result: 在ScienceQA和CrisisMMD基准测试中，答案准确性提升12.95%，问题分类准确性提升3.56%，并在零样本设置中达到最优结果。 Conclusion: HM-RAG通过模块化架构和多代理协作，显著提升了多模态推理和知识合成的能力，同时支持新数据模态的无缝集成。 Abstract: While Retrieval-Augmented Generation (RAG) augments Large Language Models (LLMs) with external knowledge, conventional single-agent RAG remains fundamentally limited in resolving complex queries demanding coordinated reasoning across heterogeneous data ecosystems. We present HM-RAG, a novel Hierarchical Multi-agent Multimodal RAG framework that pioneers collaborative intelligence for dynamic knowledge synthesis across structured, unstructured, and graph-based data. The framework is composed of three-tiered architecture with specialized agents: a Decomposition Agent that dissects complex queries into contextually coherent sub-tasks via semantic-aware query rewriting and schema-guided context augmentation; Multi-source Retrieval Agents that carry out parallel, modality-specific retrieval using plug-and-play modules designed for vector, graph, and web-based databases; and a Decision Agent that uses consistency voting to integrate multi-source answers and resolve discrepancies in retrieval results through Expert Model Refinement. This architecture attains comprehensive query understanding by combining textual, graph-relational, and web-derived evidence, resulting in a remarkable 12.95% improvement in answer accuracy and a 3.56% boost in question classification accuracy over baseline RAG systems on the ScienceQA and CrisisMMD benchmarks. Notably, HM-RAG establishes state-of-the-art results in zero-shot settings on both datasets. Its modular architecture ensures seamless integration of new data modalities while maintaining strict data governance, marking a significant advancement in addressing the critical challenges of multimodal reasoning and knowledge synthesis in RAG systems. Code is available at https://github.com/ocean-luna/HMRAG.

[124] Span-level Emotion-Cause-Category Triplet Extraction with Instruction Tuning LLMs and Data Augmentation

Xiangju Li,Dong Yang,Xiaogang Zhu,Faliang Huang,Peng Zhang,Zhongying Zhao

Main category: cs.CL

TL;DR: 该论文提出了一种基于大语言模型的细粒度方法，用于提取情感-原因-类别三元组，通过指令调优和数据增强技术显著提升了性能。

Details

Motivation: 现有研究在情感原因分析中面临冗余信息检索和情感类别确定不准确的挑战，尤其是隐式或模糊情感表达时。 Method: 采用任务特定的三元组提取指令，结合低秩适应微调大语言模型，并开发基于提示的数据增强策略生成高质量合成数据。 Result: 实验表明，该方法在情感-原因-类别三元组提取指标上至少提升12.8%，优于现有基线方法。 Conclusion: 该方法有效且鲁棒，为情感原因分析研究提供了新方向。 Abstract: Span-level emotion-cause-category triplet extraction represents a novel and complex challenge within emotion cause analysis. This task involves identifying emotion spans, cause spans, and their associated emotion categories within the text to form structured triplets. While prior research has predominantly concentrated on clause-level emotion-cause pair extraction and span-level emotion-cause detection, these methods often confront challenges originating from redundant information retrieval and difficulty in accurately determining emotion categories, particularly when emotions are expressed implicitly or ambiguously. To overcome these challenges, this study explores a fine-grained approach to span-level emotion-cause-category triplet extraction and introduces an innovative framework that leverages instruction tuning and data augmentation techniques based on large language models. The proposed method employs task-specific triplet extraction instructions and utilizes low-rank adaptation to fine-tune large language models, eliminating the necessity for intricate task-specific architectures. Furthermore, a prompt-based data augmentation strategy is developed to address data scarcity by guiding large language models in generating high-quality synthetic training data. Extensive experimental evaluations demonstrate that the proposed approach significantly outperforms existing baseline methods, achieving at least a 12.8% improvement in span-level emotion-cause-category triplet extraction metrics. The results demonstrate the method's effectiveness and robustness, offering a promising avenue for advancing research in emotion cause analysis. The source code is available at https://github.com/zxgnlp/InstruDa-LLM.

[125] Can the capability of Large Language Models be described by human ability? A Meta Study

Mingrui Zan,Yunquan Zhang,Boyang Zhang,Fangming Liu,Daning Cheng

Main category: cs.CL

TL;DR: 论文通过分析80多个LLM在37个评估基准上的表现，探讨了LLM能力与人类能力的相似性，发现部分能力可用人类指标描述，但某些能力在LLM中不相关，且能力随模型参数规模变化显著。

Details

Motivation: 研究LLM的能力是否真正接近人类能力，并量化其表现。 Method: 收集80多个LLM在37个基准上的数据，按人类能力的6个主要类别和11个子类别分类，并进行聚类分析。 Result: 部分LLM能力可用人类指标描述；某些人类相关能力在LLM中不相关；LLM能力随参数规模显著变化。 Conclusion: LLM的部分能力与人类能力相似，但存在显著差异，且能力表现受参数规模影响。 Abstract: Users of Large Language Models (LLMs) often perceive these models as intelligent entities with human-like capabilities. However, the extent to which LLMs' capabilities truly approximate human abilities remains a topic of debate. In this paper, to characterize the capabilities of LLMs in relation to human capabilities, we collected performance data from over 80 models across 37 evaluation benchmarks. The evaluation benchmarks are categorized into 6 primary abilities and 11 sub-abilities in human aspect. Then, we then clustered the performance rankings into several categories and compared these clustering results with classifications based on human ability aspects. Our findings lead to the following conclusions: 1. We have confirmed that certain capabilities of LLMs with fewer than 10 billion parameters can indeed be described using human ability metrics; 2. While some abilities are considered interrelated in humans, they appear nearly uncorrelated in LLMs; 3. The capabilities possessed by LLMs vary significantly with the parameter scale of the model.

[126] Meta-Evaluating Local LLMs: Rethinking Performance Metrics for Serious Games

Andrés Isaza-Giraldo,Paulo Bala,Lucas Pereira

Main category: cs.CL

TL;DR: 研究探讨了五种小型LLM在评估游戏《En-join》中玩家回答时的可靠性，发现模型在敏感性和特异性之间存在权衡，强调了上下文感知评估框架的重要性。

Details

Motivation: 由于开放式回答的正确性常具主观性，研究旨在验证小型LLM作为评估工具的准确性和一致性。 Method: 使用传统二元分类指标（如准确率、真阳性率、真阴性率）系统比较五种小型LLM在不同评估场景下的表现。 Result: 结果显示模型在识别正确回答时表现不一，部分模型易产生假阳性或不一致评估，需权衡敏感性与特异性。 Conclusion: 研究强调了在部署LLM作为评估工具时需选择适合的模型，并开发上下文感知的评估框架，为AI驱动评估工具的信任度提供了见解。 Abstract: The evaluation of open-ended responses in serious games presents a unique challenge, as correctness is often subjective. Large Language Models (LLMs) are increasingly being explored as evaluators in such contexts, yet their accuracy and consistency remain uncertain, particularly for smaller models intended for local execution. This study investigates the reliability of five small-scale LLMs when assessing player responses in \textit{En-join}, a game that simulates decision-making within energy communities. By leveraging traditional binary classification metrics (including accuracy, true positive rate, and true negative rate), we systematically compare these models across different evaluation scenarios. Our results highlight the strengths and limitations of each model, revealing trade-offs between sensitivity, specificity, and overall performance. We demonstrate that while some models excel at identifying correct responses, others struggle with false positives or inconsistent evaluations. The findings highlight the need for context-aware evaluation frameworks and careful model selection when deploying LLMs as evaluators. This work contributes to the broader discourse on the trustworthiness of AI-driven assessment tools, offering insights into how different LLM architectures handle subjective evaluation tasks.

[127] QM-ToT: A Medical Tree of Thoughts Reasoning Framework for Quantized Model

Zongxian Yang,Jiayu Qian,Zhi-An Huang,Kay Chen Tan

Main category: cs.CL

TL;DR: 论文提出QM-ToT框架，通过路径推理和评估层提升量化LLM在生物医学任务中的性能，显著提高了模型在MedQAUSMLE数据集上的准确率。

Details

Motivation: 现有LLM在复杂医学任务中表现不佳，尤其是在量化部署时性能下降明显，需要一种新方法提升其性能。 Method: 提出QM-ToT框架，结合Tree of Thought推理方法分解医学问题，并引入评估层优化量化模型性能。 Result: LLaMA2-70b准确率从34%提升至50%，LLaMA-3.1-8b从58.77%提升至69.49%；数据蒸馏方法仅用3.9%数据实现86.27%的改进。 Conclusion: QM-ToT首次展示了ToT在复杂生物医学任务中的潜力，为资源受限的医疗环境中部署高性能量化LLM奠定了基础。 Abstract: Large language models (LLMs) face significant challenges in specialized biomedical tasks due to the inherent complexity of medical reasoning and the sensitive nature of clinical data. Existing LLMs often struggle with intricate medical terminology and the need for accurate clinical insights, leading to performance reduction when quantized for resource-constrained deployment. To address these issues, we propose Quantized Medical Tree of Thought (QM-ToT), a path-based reasoning framework. QM-ToT leverages a Tree of Thought (ToT) reasoning approach to decompose complex medical problems into manageable subtasks, coupled with evaluator assessment layers. This framework facilitates substantial performance improvements in INT4-quantized models on the challenging MedQAUSMLE dataset. Specifically, we demonstrate a remarkable accuracy increase from 34% to 50% for the LLaMA2-70b model and from 58.77% to 69.49% for LLaMA-3.1-8b. Besides, we also proposed an effect data distillation method based on ToT. Compared to the traditional distillation method, we achieved an improvement of 86. 27% while using only 3.9% of the data.This work, for the first time, showcases the potential of ToT to significantly enhance performance on complex biomedical tasks, establishing a crucial foundation for future advances in deploying high-performing quantized LLM in resource-limited medical settings.

[128] You've Changed: Detecting Modification of Black-Box Large Language Models

Alden Dima,James Foulds,Shimei Pan,Philip Feldman

Main category: cs.CL

TL;DR: 提出了一种通过比较生成文本的语言和心理语言学特征分布来监测大型语言模型（LLM）变化的方法，并验证了其有效性。

Details

Motivation: 由于LLM通常通过API提供服务，开发者难以检测其行为变化，因此需要一种高效的方法来监控模型变化。 Method: 使用统计测试比较两组文本样本的特征分布，以判断LLM是否发生变化。 Result: 实验表明，简单的文本特征结合统计测试可以有效区分不同语言模型，并能检测提示注入攻击。 Conclusion: 该方法为频繁监控LLM变化提供了一种高效且计算成本低的解决方案。 Abstract: Large Language Models (LLMs) are often provided as a service via an API, making it challenging for developers to detect changes in their behavior. We present an approach to monitor LLMs for changes by comparing the distributions of linguistic and psycholinguistic features of generated text. Our method uses a statistical test to determine whether the distributions of features from two samples of text are equivalent, allowing developers to identify when an LLM has changed. We demonstrate the effectiveness of our approach using five OpenAI completion models and Meta's Llama 3 70B chat model. Our results show that simple text features coupled with a statistical test can distinguish between language models. We also explore the use of our approach to detect prompt injection attacks. Our work enables frequent LLM change monitoring and avoids computationally expensive benchmark evaluations.

Anna-Carolina Haensch

Main category: cs.CL

TL;DR: 研究分析了用户如何将大型语言模型（LLMs）作为心理健康工具使用，发现用户态度总体积极，但也存在隐私和专业监督等担忧。

Details

Motivation: 探讨生成式AI聊天机器人（如ChatGPT）在非正式心理健康支持中的角色及其用户反馈。 Method: 通过分析10,000多条TikTok评论，采用分层编码和监督分类模型识别用户经验和主题。 Result: 近20%的评论反映个人使用，用户态度积极，主要优势为可访问性和情感支持，但隐私和缺乏专业监督是主要问题。 Conclusion: AI在心理健康支持中的应用日益重要，但需加强临床和伦理审查。 Abstract: The emergence of generative AI chatbots such as ChatGPT has prompted growing public and academic interest in their role as informal mental health support tools. While early rule-based systems have been around for several years, large language models (LLMs) offer new capabilities in conversational fluency, empathy simulation, and availability. This study explores how users engage with LLMs as mental health tools by analyzing over 10,000 TikTok comments from videos referencing LLMs as mental health tools. Using a self-developed tiered coding schema and supervised classification models, we identify user experiences, attitudes, and recurring themes. Results show that nearly 20% of comments reflect personal use, with these users expressing overwhelmingly positive attitudes. Commonly cited benefits include accessibility, emotional support, and perceived therapeutic value. However, concerns around privacy, generic responses, and the lack of professional oversight remain prominent. It is important to note that the user feedback does not indicate which therapeutic framework, if any, the LLM-generated output aligns with. While the findings underscore the growing relevance of AI in everyday practices, they also highlight the urgent need for clinical and ethical scrutiny in the use of AI for mental health support.

[130] Paging Dr. GPT: Extracting Information from Clinical Notes to Enhance Patient Predictions

David Anderson,Michaela Anderson,Margret Bjarnadottir,Stephen Mahar,Shriyan Reyya

Main category: cs.CL

TL;DR: 研究探讨了利用GPT-4o-mini生成的临床问题答案结合患者出院摘要，提升患者死亡率预测的能力。结果表明，GPT模型优于传统表格数据模型，且结合两者可进一步提高预测性能。

Details

Motivation: 传统预测模型未能充分利用非结构化临床笔记中的信息，而GPT等大型语言模型可能填补这一空白。 Method: 使用MIMIC-IV Note数据集中的14,011例首次入院患者数据，将GPT回答作为逻辑回归模型的输入特征。 Result: GPT模型单独使用优于传统表格数据模型，结合两者后AUC平均提升5.1个百分点，高风险组的阳性预测值提升29.9%。 Conclusion: 大型语言模型在临床预测任务中具有显著价值，尤其在非结构化文本数据未被充分利用的领域。 Abstract: There is a long history of building predictive models in healthcare using tabular data from electronic medical records. However, these models fail to extract the information found in unstructured clinical notes, which document diagnosis, treatment, progress, medications, and care plans. In this study, we investigate how answers generated by GPT-4o-mini (ChatGPT) to simple clinical questions about patients, when given access to the patient's discharge summary, can support patient-level mortality prediction. Using data from 14,011 first-time admissions to the Coronary Care or Cardiovascular Intensive Care Units in the MIMIC-IV Note dataset, we implement a transparent framework that uses GPT responses as input features in logistic regression models. Our findings demonstrate that GPT-based models alone can outperform models trained on standard tabular data, and that combining both sources of information yields even greater predictive power, increasing AUC by an average of 5.1 percentage points and increasing positive predictive value by 29.9 percent for the highest-risk decile. These results highlight the value of integrating large language models (LLMs) into clinical prediction tasks and underscore the broader potential for using LLMs in any domain where unstructured text data remains an underutilized resource.

[131] GOAT-TTS: LLM-based Text-To-Speech Generation Optimized via A Dual-Branch Architecture

Yaodong Song,Hongjie Chen,Jie Lian,Yuxin Zhang,Guangmin Xia,Zehan Li,Genliang Zhao,Jian Kang,Yongxiang Li,Jie Li

Main category: cs.CL

TL;DR: GOAT-TTS是一种基于LLM的双分支架构，解决了当前TTS模型在声学特性损失、依赖对齐数据及语言理解遗忘方面的挑战。

Details

Motivation: 当前TTS模型存在声学特性损失、依赖对齐数据及语言理解遗忘等问题，限制了实际应用。 Method: 提出双分支架构：模态对齐分支捕捉连续声学嵌入，语音生成分支通过模块化微调预测语音标记。 Result: GOAT-TTS性能与最先进TTS模型相当，并验证了方言语音数据的有效性。 Conclusion: GOAT-TTS通过双分支设计解决了TTS的关键问题，为实际部署提供了可行方案。 Abstract: While large language models (LLMs) have revolutionized text-to-speech (TTS) synthesis through discrete tokenization paradigms, current architectures exhibit fundamental tensions between three critical dimensions: 1) irreversible loss of acoustic characteristics caused by quantization of speech prompts; 2) stringent dependence on precisely aligned prompt speech-text pairs that limit real-world deployment; and 3) catastrophic forgetting of the LLM's native text comprehension during optimization for speech token generation. To address these challenges, we propose an LLM-based text-to-speech Generation approach Optimized via a novel dual-branch ArchiTecture (GOAT-TTS). Our framework introduces two key innovations: (1) The modality-alignment branch combines a speech encoder and projector to capture continuous acoustic embeddings, enabling bidirectional correlation between paralinguistic features (language, timbre, emotion) and semantic text representations without transcript dependency; (2) The speech-generation branch employs modular fine-tuning on top-k layers of an LLM for speech token prediction while freezing the bottom-k layers to preserve foundational linguistic knowledge. Moreover, multi-token prediction is introduced to support real-time streaming TTS synthesis. Experimental results demonstrate that our GOAT-TTS achieves performance comparable to state-of-the-art TTS models while validating the efficacy of synthesized dialect speech data.

[132] Streamlining Biomedical Research with Specialized LLMs

Linqing Chen,Weilei Wang,Yubin Xia,Wentao Wu,Peng Xu,Zilong Bai,Jie Fang,Chaobo Xu,Ran Hu,Licong Xu,Haoran Hua,Jing Sun,Hanmeng Zhong,Jin Liu,Tian Qiu,Haowen Liu,Meng Hu,Xiuwen Li,Fei Gao,Yong Gu,Tao Shi,Chaochao Wang,Jianping Lu,Cheng Sun,Yixin Wang,Shengjie Yang,Yuancheng Li,Lu Jin,Lisha Zhang,Fu Bian,Zhongkai Ye,Lidong Pei,Changyang Tu

Main category: cs.CL

TL;DR: 提出了一种结合领域特定大语言模型与高级信息检索技术的新系统，提供全面、上下文感知的响应，显著提升对话生成质量。

Details

Motivation: 旨在通过整合多模态数据和跨组件交互，提高生物医学和制药领域的研究效率和决策速度。 Method: 集成领域特定大语言模型与信息检索技术，实现多模态数据输出和跨验证。 Result: 系统显著提升了响应精度和对话生成质量，支持实时高保真交互。 Conclusion: 该系统为生物医学和制药领域的专业人员提供了高效的研究工具，加速研发决策过程。 Abstract: In this paper, we propose a novel system that integrates state-of-the-art, domain-specific large language models with advanced information retrieval techniques to deliver comprehensive and context-aware responses. Our approach facilitates seamless interaction among diverse components, enabling cross-validation of outputs to produce accurate, high-quality responses enriched with relevant data, images, tables, and other modalities. We demonstrate the system's capability to enhance response precision by leveraging a robust question-answering model, significantly improving the quality of dialogue generation. The system provides an accessible platform for real-time, high-fidelity interactions, allowing users to benefit from efficient human-computer interaction, precise retrieval, and simultaneous access to a wide range of literature and data. This dramatically improves the research efficiency of professionals in the biomedical and pharmaceutical domains and facilitates faster, more informed decision-making throughout the R\&D process. Furthermore, the system proposed in this paper is available at https://synapse-chat.patsnap.com.

[133] Benchmarking Biopharmaceuticals Retrieval-Augmented Generation Evaluation

Hanmeng Zhong,Linqing Chen,Weilei Wang,Wentao Wu

Main category: cs.CL

TL;DR: 论文提出了首个针对生物制药领域的检索增强生成评估基准（BRAGE），并设计了一种基于引用的分类方法评估LLMs的查询与参考理解能力（QRUC）。

Details

Motivation: 当前缺乏专门针对生物制药领域的基准来评估检索增强的LLMs，且传统QA指标在开放检索增强QA场景中表现不足。 Method: 提出了BRAGE基准，并设计了一种基于引用的分类方法评估QRUC。 Result: 实验结果显示主流LLMs在生物制药QRUC方面存在显著差距，需改进。 Conclusion: BRAGE填补了生物制药领域评估LLMs的空白，提出的方法为开放检索增强QA提供了更有效的评估手段。 Abstract: Recently, the application of the retrieval-augmented Large Language Models (LLMs) in specific domains has gained significant attention, especially in biopharmaceuticals. However, in this context, there is no benchmark specifically designed for biopharmaceuticals to evaluate LLMs. In this paper, we introduce the Biopharmaceuticals Retrieval-Augmented Generation Evaluation (BRAGE) , the first benchmark tailored for evaluating LLMs' Query and Reference Understanding Capability (QRUC) in the biopharmaceutical domain, available in English, French, German and Chinese. In addition, Traditional Question-Answering (QA) metrics like accuracy and exact match fall short in the open-ended retrieval-augmented QA scenarios. To address this, we propose a citation-based classification method to evaluate the QRUC of LLMs to understand the relationship between queries and references. We apply this method to evaluate the mainstream LLMs on BRAGE. Experimental results show that there is a significant gap in the biopharmaceutical QRUC of mainstream LLMs, and their QRUC needs to be improved.

[134] Propaganda via AI? A Study on Semantic Backdoors in Large Language Models

Nay Myat Min,Long H. Pham,Yige Li,Jun Sun

Main category: cs.CL

TL;DR: 论文提出了一种针对大语言模型（LLMs）中语义后门攻击的黑盒检测框架RAVEN，通过语义熵和跨模型一致性分析，揭示了传统防御方法忽视的概念级后门漏洞。

Details

Motivation: 大语言模型在多种任务中表现优异，但仍易受后门攻击，尤其是传统防御方法难以检测的语义后门（基于概念而非词汇的触发机制）。 Method: 在受控微调环境中植入语义后门，并开发RAVEN框架，结合语义熵和跨模型一致性分析，通过主题视角提示和多模型响应聚类检测异常输出。 Result: 实验证明RAVEN能有效检测多种LLM家族（如GPT-4o、Llama等）中的语义后门，首次验证了这些隐藏漏洞的存在。 Conclusion: 研究强调了概念级审计的必要性，并开源了代码和数据，为后续防御研究提供了工具。 Abstract: Large language models (LLMs) demonstrate remarkable performance across myriad language tasks, yet they remain vulnerable to backdoor attacks, where adversaries implant hidden triggers that systematically manipulate model outputs. Traditional defenses focus on explicit token-level anomalies and therefore overlook semantic backdoors-covert triggers embedded at the conceptual level (e.g., ideological stances or cultural references) that rely on meaning-based cues rather than lexical oddities. We first show, in a controlled finetuning setting, that such semantic backdoors can be implanted with only a small poisoned corpus, establishing their practical feasibility. We then formalize the notion of semantic backdoors in LLMs and introduce a black-box detection framework, RAVEN (short for "Response Anomaly Vigilance for uncovering semantic backdoors"), which combines semantic entropy with cross-model consistency analysis. The framework probes multiple models with structured topic-perspective prompts, clusters the sampled responses via bidirectional entailment, and flags anomalously uniform outputs; cross-model comparison isolates model-specific anomalies from corpus-wide biases. Empirical evaluations across diverse LLM families (GPT-4o, Llama, DeepSeek, Mistral) uncover previously undetected semantic backdoors, providing the first proof-of-concept evidence of these hidden vulnerabilities and underscoring the urgent need for concept-level auditing of deployed language models. We open-source our code and data at https://github.com/NayMyatMin/RAVEN.

[135] Reimagining Urban Science: Scaling Causal Inference with Large Language Models

Yutong Xia,Ao Qu,Yunhan Zheng,Yihong Tang,Dingyi Zhuang,Yuxuan Liang,Cathy Wu,Roger Zimmermann,Jinhua Zhao

Main category: cs.CL

TL;DR: 论文提出AutoUrbanCI框架，利用大语言模型（LLMs）改进城市因果研究，解决假设生成、数据复杂性等问题，并强调人机协作与公平性。

Details

Motivation: 城市因果研究面临假设生成低效、数据复杂性和方法脆弱性等挑战，LLMs为改进提供了新机会。 Method: 提出AutoUrbanCI框架，包含四个模块化代理：假设生成、数据工程、实验设计与执行、结果解释与政策建议。 Result: 框架旨在提高研究的严谨性和透明度，推动更包容的城市因果推理。 Conclusion: 呼吁将AI作为工具而非替代，以扩大参与、提升可重复性，推动更公平的城市研究。 Abstract: Urban causal research is essential for understanding the complex dynamics of cities and informing evidence-based policies. However, it is challenged by the inefficiency and bias of hypothesis generation, barriers to multimodal data complexity, and the methodological fragility of causal experimentation. Recent advances in large language models (LLMs) present an opportunity to rethink how urban causal analysis is conducted. This Perspective examines current urban causal research by analyzing taxonomies that categorize research topics, data sources, and methodological approaches to identify structural gaps. We then introduce an LLM-driven conceptual framework, AutoUrbanCI, composed of four distinct modular agents responsible for hypothesis generation, data engineering, experiment design and execution, and results interpretation with policy recommendations. We propose evaluation criteria for rigor and transparency and reflect on implications for human-AI collaboration, equity, and accountability. We call for a new research agenda that embraces AI-augmented workflows not as replacements for human expertise but as tools to broaden participation, improve reproducibility, and unlock more inclusive forms of urban causal reasoning.

[136] Mathematical Capabilities of Large Language Models in Finnish Matriculation Examination

Mika Setälä,Pieta Sikström,Ville Heilala,Tommi Kärkkäinen

Main category: cs.CL

TL;DR: 研究评估了大型语言模型（LLMs）在数学推理上的表现，发现随着模型进化，其数学能力显著提升，部分模型甚至达到满分。

Details

Motivation: 探讨LLMs在数学教育中的潜力及其在高风险考试中的应用。 Method: 使用芬兰高中毕业考试（高风险的数字化测试）评估不同LLMs的数学能力。 Result: 初始表现中等，后期模型显著提升，部分模型达到满分，媲美顶尖学生水平。 Conclusion: LLMs数学能力快速进步，展示了其在大规模教育评估中的潜力。 Abstract: Large language models (LLMs) have shown increasing promise in educational settings, yet their mathematical reasoning has been considered evolving. This study evaluates the mathematical capabilities of various LLMs using the Finnish matriculation examination, a high-stakes digital test for upper secondary education. Initial tests yielded moderate performance corresponding to mid-range grades, but later evaluations demonstrated substantial improvements as the language models evolved. Remarkably, some models achieved near-perfect or perfect scores, matching top student performance and qualifying for university admission. Our findings highlight the rapid advances in the mathematical proficiency of LLMs and illustrate their potential to also support educational assessments at scale.

[137] A Large-Language Model Framework for Relative Timeline Extraction from PubMed Case Reports

Jing Wang,Jeremy C Weiss

Main category: cs.CL

TL;DR: 论文提出了一种将病例报告转化为结构化时间序列数据的系统，对比了人工和LLM标注的效果，发现LLM在事件召回率上表现中等，但时间一致性高。

Details

Motivation: 电子健康记录中缺乏关键事件的时间信息，限制了患者轨迹的分析。 Method: 开发系统将病例报告转化为文本时间序列，对比人工和LLM标注效果。 Result: LLM事件召回率中等（0.80），但时间一致性高（0.95）。 Conclusion: 该研究为利用PMOA语料库进行时间分析提供了基准。 Abstract: Timing of clinical events is central to characterization of patient trajectories, enabling analyses such as process tracing, forecasting, and causal reasoning. However, structured electronic health records capture few data elements critical to these tasks, while clinical reports lack temporal localization of events in structured form. We present a system that transforms case reports into textual time series-structured pairs of textual events and timestamps. We contrast manual and large language model (LLM) annotations (n=320 and n=390 respectively) of ten randomly-sampled PubMed open-access (PMOA) case reports (N=152,974) and assess inter-LLM agreement (n=3,103; N=93). We find that the LLM models have moderate event recall(O1-preview: 0.80) but high temporal concordance among identified events (O1-preview: 0.95). By establishing the task, annotation, and assessment systems, and by demonstrating high concordance, this work may serve as a benchmark for leveraging the PMOA corpus for temporal analytics.

Muhammad Ahmad,Muhammad Waqas,ldar Batyrshin,Grigori Sidorov

Main category: cs.CL

TL;DR: AI驱动的NLP框架通过社交媒体数据检测药物滥用和过量症状，准确率高达98%。

Details

Motivation: 传统研究方法在药物滥用监测中存在局限性，社交媒体提供实时数据支持。 Method: 结合LLM和人工标注的混合策略，使用传统ML、神经网络和Transformer模型。 Result: 多类分类准确率98%，多标签分类97%，优于基线模型8%。 Conclusion: AI在公共卫生监测和个性化干预中具有潜力。 Abstract: Drug overdose remains a critical global health issue, often driven by misuse of opioids, painkillers, and psychiatric medications. Traditional research methods face limitations, whereas social media offers real-time insights into self-reported substance use and overdose symptoms. This study proposes an AI-driven NLP framework trained on annotated social media data to detect commonly used drugs and associated overdose symptoms. Using a hybrid annotation strategy with LLMs and human annotators, we applied traditional ML models, neural networks, and advanced transformer-based models. Our framework achieved 98% accuracy in multi-class and 97% in multi-label classification, outperforming baseline models by up to 8%. These findings highlight the potential of AI for supporting public health surveillance and personalized intervention strategies.

[139] Replicating ReLM Results: Validating Large Language Models with ReLM

Reece Adamson,Erin Song

Main category: cs.CL

TL;DR: ReLM方法利用形式语言评估和控制大型语言模型（LLMs）的记忆、偏见和零样本性能，解决了现有方法慢、不精确或引入偏见的问题。

Details

Motivation: 当前评估LLMs行为的方法存在效率低、准确性差或引入偏见的问题，而这些问题在LLMs实际应用中至关重要。 Method: 采用ReLM方法，基于形式语言对LLMs进行评估和控制，并复现了原始论文的关键结果。 Result: 验证了ReLM方法在评估LLMs行为方面的有效性，并扩展了其在机器学习系统领域的应用。 Conclusion: ReLM方法为LLMs的行为评估提供了高效、精确的解决方案，对机器学习系统领域具有重要意义。 Abstract: Validating Large Language Models with ReLM explores the application of formal languages to evaluate and control Large Language Models (LLMs) for memorization, bias, and zero-shot performance. Current approaches for evaluating these types behavior are often slow, imprecise, costly, or introduce biases of their own, but are necessary due to the importance of this behavior when productionizing LLMs. This project reproduces key results from the original ReLM paper and expounds on the approach and applications with an emphasis on the relevance to the field of systems for machine learning.

[140] A Method for Handling Negative Similarities in Explainable Graph Spectral Clustering of Text Documents -- Extended Version

Mieczysław A. Kłopotek,Sławomir T. Wierzchoń,Bartłomiej Starosta,Dariusz Czerski,Piotr Borkowski

Main category: cs.CL

TL;DR: 本文研究了图谱聚类中负相似度的问题，提出了6种解决方案，并分析了其优缺点。实验表明，GloVe嵌入会导致归一化拉普拉斯方法失败，而解决负相似度的方法能提升准确性。

Details

Motivation: 研究文档嵌入（如doc2vec、GloVe等）与传统词向量空间不同导致的负相似度问题，探索图谱聚类的解决方案。 Method: 讨论了组合拉普拉斯和归一化拉普拉斯的解决方案，实验比较了6种不同方法。 Result: GloVe嵌入常导致归一化拉普拉斯方法失败，但解决负相似度的方法能提升准确性，并扩展了原方法的适用性。 Conclusion: 解决负相似度问题能显著提升图谱聚类的准确性，并扩展了方法的应用范围。 Abstract: This paper investigates the problem of Graph Spectral Clustering with negative similarities, resulting from document embeddings different from the traditional Term Vector Space (like doc2vec, GloVe, etc.). Solutions for combinatorial Laplacians and normalized Laplacians are discussed. An experimental investigation shows the advantages and disadvantages of 6 different solutions proposed in the literature and in this research. The research demonstrates that GloVe embeddings frequently cause failures of normalized Laplacian based GSC due to negative similarities. Furthermore, application of methods curing similarity negativity leads to accuracy improvement for both combinatorial and normalized Laplacian based GSC. It also leads to applicability for GloVe embeddings of explanation methods developed originally bythe authors for Term Vector Space embeddings.

[141] Position: The Most Expensive Part of an LLM should be its Training Data

Nikhil Kandpal,Colin Raffel

Main category: cs.CL

TL;DR: 论文主张为LLM训练数据生产者提供补偿，估算64个LLM的数据生产成本，发现其远超模型训练成本。

Details

Motivation: LLM训练数据背后的人类劳动价值被忽视，论文旨在量化这一成本并推动公平补偿。 Method: 研究2016-2024年发布的64个LLM，估算从零开始生产其训练数据的人力成本。 Result: 数据生产成本是模型训练成本的10-1000倍，凸显LLM提供商的财务责任。 Conclusion: 呼吁未来研究推动更公平的数据生产补偿实践。 Abstract: Training a state-of-the-art Large Language Model (LLM) is an increasingly expensive endeavor due to growing computational, hardware, energy, and engineering demands. Yet, an often-overlooked (and seldom paid) expense is the human labor behind these models' training data. Every LLM is built on an unfathomable amount of human effort: trillions of carefully written words sourced from books, academic papers, codebases, social media, and more. This position paper aims to assign a monetary value to this labor and argues that the most expensive part of producing an LLM should be the compensation provided to training data producers for their work. To support this position, we study 64 LLMs released between 2016 and 2024, estimating what it would cost to pay people to produce their training datasets from scratch. Even under highly conservative estimates of wage rates, the costs of these models' training datasets are 10-1000 times larger than the costs to train the models themselves, representing a significant financial liability for LLM providers. In the face of the massive gap between the value of training data and the lack of compensation for its creation, we highlight and discuss research directions that could enable fairer practices in the future.

[142] On Linear Representations and Pretraining Data Frequency in Language Models

Jack Merullo,Noah A. Smith,Sarah Wiegreffe,Yanai Elazar

Main category: cs.CL

TL;DR: 研究预训练数据频率与语言模型线性表示之间的关系，发现线性表示的形成与数据频率高度相关，并提出一种预测预训练数据频率的新方法。

Details

Motivation: 探索预训练数据如何影响语言模型的线性表示，以理解数据频率与模型行为之间的关系。 Method: 分析预训练数据频率与线性表示的相关性，训练回归模型预测预训练数据频率。 Result: 线性表示的形成与数据频率高度相关，回归模型能有效预测预训练数据频率。 Conclusion: 线性表示强度可揭示预训练数据特征，为改进模型行为提供新途径。 Abstract: Pretraining data has a direct impact on the behaviors and quality of language models (LMs), but we only understand the most basic principles of this relationship. While most work focuses on pretraining data's effect on downstream task behavior, we investigate its relationship to LM representations. Previous work has discovered that, in language models, some concepts are encoded `linearly' in the representations, but what factors cause these representations to form? We study the connection between pretraining data frequency and models' linear representations of factual relations. We find evidence that the formation of linear representations is strongly connected to pretraining term frequencies; specifically for subject-relation-object fact triplets, both subject-object co-occurrence frequency and in-context learning accuracy for the relation are highly correlated with linear representations. This is the case across all phases of pretraining. In OLMo-7B and GPT-J, we discover that a linear representation consistently (but not exclusively) forms when the subjects and objects within a relation co-occur at least 1k and 2k times, respectively, regardless of when these occurrences happen during pretraining. Finally, we train a regression model on measurements of linear representation quality in fully-trained LMs that can predict how often a term was seen in pretraining. Our model achieves low error even on inputs from a different model with a different pretraining dataset, providing a new method for estimating properties of the otherwise-unknown training data of closed-data models. We conclude that the strength of linear representations in LMs contains signal about the models' pretraining corpora that may provide new avenues for controlling and improving model behavior: particularly, manipulating the models' training data to meet specific frequency thresholds.

[143] SLURG: Investigating the Feasibility of Generating Synthetic Online Fallacious Discourse

Cal Blanco,Gavin Dsouza,Hugo Lin,Chelsey Rush

Main category: cs.CL

TL;DR: 论文探讨了社交媒体上自动检测操纵行为的逻辑谬误定义与外推，聚焦乌克兰-俄罗斯冲突论坛中的错误信息，并提出SLURG方法利用LLM生成合成谬误评论。

Details

Motivation: 现有自动谬误检测数据集多限于正式语言领域，未能覆盖在线论坛的非标准化语言，需填补这一空白。 Method: 提出SLURG方法，利用DeepHermes-3-Mistral-24B等大型语言模型生成合成谬误论坛评论，并测试其可行性。 Result: 研究发现LLM能复制真实数据的句法模式，高质量少样本提示可提升其模仿论坛词汇多样性的能力。 Conclusion: LLM在生成合成谬误评论方面具有潜力，为在线论坛谬误检测提供了新工具。 Abstract: In our paper we explore the definition, and extrapolation of fallacies as they pertain to the automatic detection of manipulation on social media. In particular we explore how these logical fallacies might appear in the real world i.e internet forums. We discovered a prevalence of misinformation / misguided intention in discussion boards specifically centered around the Ukrainian Russian Conflict which serves to narrow the domain of our task. Although automatic fallacy detection has gained attention recently, most datasets use unregulated fallacy taxonomies or are limited to formal linguistic domains like political debates or news reports. Online discourse, however, often features non-standardized and diverse language not captured in these domains. We present Shady Linguistic Utterance Replication-Generation (SLURG) to address these limitations, exploring the feasibility of generating synthetic fallacious forum-style comments using large language models (LLMs), specifically DeepHermes-3-Mistral-24B. Our findings indicate that LLMs can replicate the syntactic patterns of real data} and that high-quality few-shot prompts enhance LLMs' ability to mimic the vocabulary diversity of online forums.

[144] Integrating Structural and Semantic Signals in Text-Attributed Graphs with BiGTex

Azadeh Beiranvand,Seyed Mehdi Vahidipour

Main category: cs.CL

TL;DR: BiGTex提出了一种双向图文本融合架构，结合GNN和LLM的优势，通过参数高效微调实现节点分类和链接预测的SOTA性能。

Details

Motivation: 解决文本属性图（TAGs）中同时捕捉文本语义和结构依赖的挑战，弥补GNN和LLM各自的不足。 Method: 提出BiGTex架构，通过堆叠的图文本融合单元实现双向注意力机制，结合GNN和LLM，采用LoRA进行参数高效微调。 Result: 在五个基准数据集上，BiGTex在节点分类和链接预测任务中达到SOTA性能。 Conclusion: BiGTex通过双向融合和软提示技术，成功整合了文本和结构信息，为TAGs表示学习提供了有效解决方案。 Abstract: Text-attributed graphs (TAGs) present unique challenges in representation learning by requiring models to capture both the semantic richness of node-associated texts and the structural dependencies of the graph. While graph neural networks (GNNs) excel at modeling topological information, they lack the capacity to process unstructured text. Conversely, large language models (LLMs) are proficient in text understanding but are typically unaware of graph structure. In this work, we propose BiGTex (Bidirectional Graph Text), a novel architecture that tightly integrates GNNs and LLMs through stacked Graph-Text Fusion Units. Each unit allows for mutual attention between textual and structural representations, enabling information to flow in both directions, text influencing structure and structure guiding textual interpretation. The proposed architecture is trained using parameter-efficient fine-tuning (LoRA), keeping the LLM frozen while adapting to task-specific signals. Extensive experiments on five benchmark datasets demonstrate that BiGTex achieves state-of-the-art performance in node classification and generalizes effectively to link prediction. An ablation study further highlights the importance of soft prompting and bi-directional attention in the model's success.

[145] Can Pre-training Indicators Reliably Predict Fine-tuning Outcomes of LLMs?

Hansi Zeng,Kai Hui,Honglei Zhuang,Zhen Qin,Zhenrui Yue,Hamed Zamani,Dana Alon

Main category: cs.CL

TL;DR: 论文探讨了预训练指标（如困惑度）在固定模型规模下对下游任务性能的预测能力不足的问题，提出了一种基于成对分类的方法来选择最佳预训练检查点，并引入了新的代理指标，显著降低了性能预测错误率。

Details

Motivation: 预训练指标（如困惑度）在模型规模扩展研究中表现良好，但在固定模型规模下对下游任务性能的预测能力不明确，影响了模型选择和开发的效率。 Method: 将预训练检查点选择问题建模为成对分类任务，构建了一个包含50种1B参数LLM变体的数据集，通过监督微调（SFT）评估其下游任务性能，并提出了新的无监督和有监督代理指标。 Result: 研究发现传统困惑度指标具有误导性，而新提出的代理指标将相对性能预测错误率降低了50%以上。 Conclusion: 尽管任务复杂，新代理指标在特定场景中具有实用价值，为优化预训练方案提供了更高效的设计路径。 Abstract: While metrics available during pre-training, such as perplexity, correlate well with model performance at scaling-laws studies, their predictive capacities at a fixed model size remain unclear, hindering effective model selection and development. To address this gap, we formulate the task of selecting pre-training checkpoints to maximize downstream fine-tuning performance as a pairwise classification problem: predicting which of two LLMs, differing in their pre-training, will perform better after supervised fine-tuning (SFT). We construct a dataset using 50 1B parameter LLM variants with systematically varied pre-training configurations, e.g., objectives or data, and evaluate them on diverse downstream tasks after SFT. We first conduct a study and demonstrate that the conventional perplexity is a misleading indicator. As such, we introduce novel unsupervised and supervised proxy metrics derived from pre-training that successfully reduce the relative performance prediction error rate by over 50%. Despite the inherent complexity of this task, we demonstrate the practical utility of our proposed proxies in specific scenarios, paving the way for more efficient design of pre-training schemes optimized for various downstream tasks.

[146] Accelerating Clinical NLP at Scale with a Hybrid Framework with Reduced GPU Demands: A Case Study in Dementia Identification

Jianlin Shi,Qiwei Gan,Elizabeth Hanchrow,Annie Bowles,John Stanley,Adam P. Bress,Jordana B. Cohen,Patrick R. Alba

Main category: cs.CL

TL;DR: 提出了一种结合规则过滤、SVM分类器和BERT模型的混合NLP框架，用于高效且准确地分析大规模临床文本，并在痴呆症识别案例中验证了其性能。

Details

Motivation: 当前基于Transformer的临床NLP方法计算资源需求高，限制了其可及性，因此需要一种更高效的解决方案。 Method: 采用混合NLP框架，结合规则过滤、SVM分类器和BERT模型，应用于包含4.9百万退伍军人的临床数据分析。 Result: 患者级别的精确度为0.90，召回率为0.84，F1分数为0.87，且比结构化数据方法多识别出三倍以上的痴呆症病例。 Conclusion: 混合NLP框架为资源有限的医疗机构提供了高效且可及的大规模临床文本分析解决方案。 Abstract: Clinical natural language processing (NLP) is increasingly in demand in both clinical research and operational practice. However, most of the state-of-the-art solutions are transformers-based and require high computational resources, limiting their accessibility. We propose a hybrid NLP framework that integrates rule-based filtering, a Support Vector Machine (SVM) classifier, and a BERT-based model to improve efficiency while maintaining accuracy. We applied this framework in a dementia identification case study involving 4.9 million veterans with incident hypertension, analyzing 2.1 billion clinical notes. At the patient level, our method achieved a precision of 0.90, a recall of 0.84, and an F1-score of 0.87. Additionally, this NLP approach identified over three times as many dementia cases as structured data methods. All processing was completed in approximately two weeks using a single machine with dual A40 GPUs. This study demonstrates the feasibility of hybrid NLP solutions for large-scale clinical text analysis, making state-of-the-art methods more accessible to healthcare organizations with limited computational resources.

[147] Beyond Text: Characterizing Domain Expert Needs in Document Research

Sireesh Gururaja,Nupoor Gandhi,Jeremiah Milbauer,Emma Strubell

Main category: cs.CL

TL;DR: 研究探讨了NLP系统在文档处理任务中与专家实际需求的差距，呼吁NLP社区更关注文档的社会背景和个性化需求。

Details

Motivation: 研究动机是了解NLP系统是否能真正模拟专家处理文档的方式，并比较现有技术与专家实际需求之间的差异。 Method: 通过访谈16位跨两个领域的专家，分析其文档研究流程，并与当前NLP系统能力进行对比。 Result: 发现专家流程具有个性化、迭代性，并依赖文档的社会背景；部分NLP方法更贴近专家需求，但普及性不足。 Conclusion: 呼吁NLP社区在工具开发中更注重文档的社会背景、个性化、迭代性和可访问性。 Abstract: Working with documents is a key part of almost any knowledge work, from contextualizing research in a literature review to reviewing legal precedent. Recently, as their capabilities have expanded, primarily text-based NLP systems have often been billed as able to assist or even automate this kind of work. But to what extent are these systems able to model these tasks as experts conceptualize and perform them now? In this study, we interview sixteen domain experts across two domains to understand their processes of document research, and compare it to the current state of NLP systems. We find that our participants processes are idiosyncratic, iterative, and rely extensively on the social context of a document in addition its content; existing approaches in NLP and adjacent fields that explicitly center the document as an object, rather than as merely a container for text, tend to better reflect our participants' priorities, though they are often less accessible outside their research communities. We call on the NLP community to more carefully consider the role of the document in building useful tools that are accessible, personalizable, iterative, and socially aware.

[148] BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents

Jason Wei,Zhiqing Sun,Spencer Papay,Scott McKinney,Jeffrey Han,Isa Fulford,Hyung Won Chung,Alex Tachard Passos,William Fedus,Amelia Glaese

Main category: cs.CL

TL;DR: BrowseComp是一个简单但具有挑战性的基准测试，用于评估代理浏览网页的能力，包含1,266个需要持续搜索互联网的问题。

Details

Motivation: 为衡量代理在复杂网络环境中查找信息的能力，提供一个简单且易验证的基准测试。 Method: 设计1,266个需要持续搜索互联网的问题，问题答案简短且易于验证。 Result: BrowseComp能够有效测试代理在查找复杂信息时的持久性和创造性。 Conclusion: BrowseComp是一个有用的基准测试，尽管不完全模拟真实用户查询，但能有效衡量核心能力。 Abstract: We present BrowseComp, a simple yet challenging benchmark for measuring the ability for agents to browse the web. BrowseComp comprises 1,266 questions that require persistently navigating the internet in search of hard-to-find, entangled information. Despite the difficulty of the questions, BrowseComp is simple and easy-to-use, as predicted answers are short and easily verifiable against reference answers. BrowseComp for browsing agents can be seen as analogous to how programming competitions are an incomplete but useful benchmark for coding agents. While BrowseComp sidesteps challenges of a true user query distribution, like generating long answers or resolving ambiguity, it measures the important core capability of exercising persistence and creativity in finding information. BrowseComp can be found at https://github.com/openai/simple-evals.

[149] Evaluating the Diversity and Quality of LLM Generated Content

Alexander Shypula,Shuo Li,Botong Zhang,Vishakh Padmakumar,Kayo Yin,Osbert Bastani

Main category: cs.CL

TL;DR: 研究发现，偏好调优技术（如RLHF、PPO、GRPO和DPO）虽然减少了词汇和句法多样性，但提高了有效语义多样性，因为其生成更多高质量输出。小模型在固定采样预算下更高效地生成独特内容。

Details

Motivation: 解决偏好调优技术减少多样性但实际应用中需要多样输出的矛盾。 Method: 引入有效语义多样性框架，通过无需人工干预的开放任务评估模型。 Result: 偏好调优模型（尤其是RL训练的）在语义多样性上优于SFT或基础模型，小模型在参数效率上表现更好。 Conclusion: 偏好调优在保持语义多样性上优于传统方法，小模型在多样性生成上更高效，对需要多样高质量输出的应用有重要意义。 Abstract: Recent work suggests that preference-tuning techniques--including Reinforcement Learning from Human Preferences (RLHF) methods like PPO and GRPO, as well as alternatives like DPO--reduce diversity, creating a dilemma given that such models are widely deployed in applications requiring diverse outputs. To address this, we introduce a framework for measuring effective semantic diversity--diversity among outputs that meet quality thresholds--which better reflects the practical utility of large language models (LLMs). Using open-ended tasks that require no human intervention, we find counterintuitive results: although preference-tuned models--especially those trained via RL--exhibit reduced lexical and syntactic diversity, they produce greater effective semantic diversity than SFT or base models, not from increasing diversity among high-quality outputs, but from generating more high-quality outputs overall. We discover that preference tuning reduces syntactic diversity while preserving semantic diversity--revealing a distinction between diversity in form and diversity in content that traditional metrics often overlook. Our analysis further shows that smaller models are consistently more parameter-efficient at generating unique content within a fixed sampling budget, offering insights into the relationship between model scaling and diversity. These findings have important implications for applications that require diverse yet high-quality outputs, from creative assistance to synthetic data generation.

[150] Memorization vs. Reasoning: Updating LLMs with New Knowledge

Aochong Oliver Li,Tanya Goyal

Main category: cs.CL

TL;DR: 本文介绍了知识更新平台（KUP）和记忆条件训练（MCT）方法，用于解决大语言模型（LLMs）知识更新的挑战。KUP提供自动化的知识更新评估框架，MCT通过条件化训练提升模型对新知识的记忆和推理能力。

Details

Motivation: 现有方法主要针对实体替换，无法全面反映复杂现实动态，因此需要更有效的知识更新方法。 Method: 提出KUP作为自动化知识更新评估框架，并开发MCT方法，通过在训练中条件化生成记忆标记来提升知识更新能力。 Result: KUP基准测试显示现有方法在推理任务上表现较差（<2%），而MCT显著优于传统持续预训练方法，记忆任务提升达25.4%。 Conclusion: KUP和MCT为LLMs知识更新提供了有效工具，显著提升了模型对新知识的记忆和推理能力。 Abstract: Large language models (LLMs) encode vast amounts of pre-trained knowledge in their parameters, but updating them as real-world information evolves remains a challenge. Existing methodologies and benchmarks primarily target entity substitutions, failing to capture the full breadth of complex real-world dynamics. In this paper, we introduce Knowledge Update Playground (KUP), an automatic pipeline for simulating realistic knowledge updates reflected in an evidence corpora. KUP's evaluation framework includes direct and indirect probes to both test memorization of updated facts and reasoning over them, for any update learning methods. Next, we present a lightweight method called memory conditioned training (MCT), which conditions tokens in the update corpus on self-generated "memory" tokens during training. Our strategy encourages LLMs to surface and reason over newly memorized knowledge at inference. Our results on two strong LLMs show that (1) KUP benchmark is highly challenging, with the best CPT models achieving $<2\%$ in indirect probing setting (reasoning) and (2) MCT training significantly outperforms prior continued pre-training (CPT) baselines, improving direct probing (memorization) results by up to $25.4\%$.

[151] Memorization: A Close Look at Books

Iris Ma,Ian Domingo,Alberto Krone-Martins,Pierre Baldi,Cristina V. Lopes

Main category: cs.CL

TL;DR: 研究探讨了从LLMs（如Llama 3 70B）中提取整本书的可能性，发现通过“前缀提示”技术可以高相似度地重构《爱丽丝梦游仙境》，但成功率与书籍流行度相关。同时，研究发现指令调优的Llama 3.1中缓解措施的失效，并揭示了其权重变化的集中区域。

Details

Motivation: 探索LLMs中书籍内容的提取能力及其限制，评估当前缓解策略的有效性。 Method: 使用Llama 3 70B模型和“前缀提示”技术，尝试从500个初始标记中重构书籍内容，并分析提取率与书籍流行度的相关性。同时，研究指令调优对权重变化的影响。 Result: 成功高相似度重构《爱丽丝梦游仙境》，但提取率因书籍流行度而异。指令调优导致缓解措施失效，且权重变化集中在少数低层Transformer块。 Conclusion: 当前LLMs的书籍提取能力有限且受流行度影响，缓解策略存在不足，需进一步研究调优对记忆提取的影响。 Abstract: To what extent can entire books be extracted from LLMs? Using the Llama 3 70B family of models, and the "prefix-prompting" extraction technique, we were able to auto-regressively reconstruct, with a very high level of similarity, one entire book (Alice's Adventures in Wonderland) from just the first 500 tokens. We were also able to obtain high extraction rates on several other books, piece-wise. However, these successes do not extend uniformly to all books. We show that extraction rates of books correlate with book popularity and thus, likely duplication in the training data. We also confirm the undoing of mitigations in the instruction-tuned Llama 3.1, following recent work (Nasr et al., 2025). We further find that this undoing comes from changes to only a tiny fraction of weights concentrated primarily in the lower transformer blocks. Our results provide evidence of the limits of current regurgitation mitigation strategies and introduce a framework for studying how fine-tuning affects the retrieval of verbatim memorization in aligned LLMs.

[152] ELAB: Extensive LLM Alignment Benchmark in Persian Language

Zahra Pourbahman,Fatemeh Rajabi,Mohammadhossein Sadeghi,Omid Ghahroodi,Somaye Bakhshaei,Arash Amini,Reza Kazemi,Mahdieh Soleymani Baghshah

Main category: cs.CL

TL;DR: 本文提出了一种针对波斯语大型语言模型（LLMs）的综合性评估框架，重点关注安全性、公平性和社会规范等伦理维度，填补了现有评估框架在波斯语言和文化背景下的空白。

Details

Motivation: 现有LLM评估框架未能充分适应波斯语言和文化背景，导致评估结果不够准确或全面。 Method: 通过翻译现有数据集（如Anthropic Red Teaming数据、AdvBench等）、生成合成数据（如ProhibiBench-fa、SafeBench-fa等）以及收集自然数据（如GuardBench-fa），构建了一个统一的波斯语LLM评估框架。 Result: 提出了一个公开的排行榜（https://huggingface.co/spaces/MCILAB/LLM_Alignment_Evaluation），用于评估波斯语LLMs在安全性、公平性和社会规范方面的表现。 Conclusion: 该框架为波斯语LLMs的伦理对齐评估提供了新的方法，并填补了文化背景下的评估空白。 Abstract: This paper presents a comprehensive evaluation framework for aligning Persian Large Language Models (LLMs) with critical ethical dimensions, including safety, fairness, and social norms. It addresses the gaps in existing LLM evaluation frameworks by adapting them to Persian linguistic and cultural contexts. This benchmark creates three types of Persian-language benchmarks: (i) translated data, (ii) new data generated synthetically, and (iii) new naturally collected data. We translate Anthropic Red Teaming data, AdvBench, HarmBench, and DecodingTrust into Persian. Furthermore, we create ProhibiBench-fa, SafeBench-fa, FairBench-fa, and SocialBench-fa as new datasets to address harmful and prohibited content in indigenous culture. Moreover, we collect extensive dataset as GuardBench-fa to consider Persian cultural norms. By combining these datasets, our work establishes a unified framework for evaluating Persian LLMs, offering a new approach to culturally grounded alignment evaluation. A systematic evaluation of Persian LLMs is performed across the three alignment aspects: safety (avoiding harmful content), fairness (mitigating biases), and social norms (adhering to culturally accepted behaviors). We present a publicly available leaderboard that benchmarks Persian LLMs with respect to safety, fairness, and social norms at: https://huggingface.co/spaces/MCILAB/LLM_Alignment_Evaluation.

[153] CDF-RAG: Causal Dynamic Feedback for Adaptive Retrieval-Augmented Generation

Elahe Khatibi,Ziyu Wang,Amir M. Rahmani

Main category: cs.CL

TL;DR: CDF-RAG框架通过动态反馈和因果推理改进RAG，提升生成内容的因果一致性和准确性。

Details

Motivation: 现有RAG框架依赖语义相似性检索，难以区分真实因果关系与虚假关联，导致生成内容可能不完整或误导。 Method: CDF-RAG通过迭代查询优化、检索结构化因果图和多跳因果推理，验证因果路径以确保逻辑一致性。 Result: 在四个数据集上验证，CDF-RAG显著提升了响应准确性和因果正确性。 Conclusion: CDF-RAG为知识密集型任务提供了更可靠的因果推理和生成能力。 Abstract: Retrieval-Augmented Generation (RAG) has significantly enhanced large language models (LLMs) in knowledge-intensive tasks by incorporating external knowledge retrieval. However, existing RAG frameworks primarily rely on semantic similarity and correlation-driven retrieval, limiting their ability to distinguish true causal relationships from spurious associations. This results in responses that may be factually grounded but fail to establish cause-and-effect mechanisms, leading to incomplete or misleading insights. To address this issue, we introduce Causal Dynamic Feedback for Adaptive Retrieval-Augmented Generation (CDF-RAG), a framework designed to improve causal consistency, factual accuracy, and explainability in generative reasoning. CDF-RAG iteratively refines queries, retrieves structured causal graphs, and enables multi-hop causal reasoning across interconnected knowledge sources. Additionally, it validates responses against causal pathways, ensuring logically coherent and factually grounded outputs. We evaluate CDF-RAG on four diverse datasets, demonstrating its ability to improve response accuracy and causal correctness over existing RAG-based methods. Our code is publicly available at https://github.com/ elakhatibi/CDF-RAG.

[154] MetaSynth: Meta-Prompting-Driven Agentic Scaffolds for Diverse Synthetic Data Generation

Haris Riaz,Sourav Bhabesh,Vinayak Arannil,Miguel Ballesteros,Graham Horwood

Main category: cs.CL

TL;DR: MetaSynth通过元提示生成多样化合成数据，成功将Mistral-7B-v0.3适配到金融和生物医学领域，性能显著提升。

Details

Motivation: 解决合成数据多样性不足的问题，以提升其在特定领域模型适配中的效果。 Method: 提出MetaSynth方法，利用语言模型协调多个“专家”LLM代理协作生成数据。 Result: 仅用2500万token的合成数据，模型在金融和生物医学领域分别提升4.08%和13.75%。 Conclusion: MetaSynth生成的多样化合成数据无需混合真实数据，即可有效适配特定领域。 Abstract: Recent smaller language models such Phi-3.5 and Phi-4 rely on synthetic data generated using larger Language models. Questions remain about leveraging synthetic data for other use cases, such as adapting LLMs to specific domains. A key limitation of synthetic data is low diversity, which negatively impacts its downstream applicability for improving other models. To address this, we propose MetaSynth, a method for generating synthetic data that enhances diversity through meta-prompting, where a language model orchestrates multiple "expert" LLM agents to collaboratively generate data. Using only 25 million tokens of synthetic data generated with MetaSynth, we successfully adapt a well-trained LLM (Mistral-7B-v0.3) to two specialized domains-Finance and Biomedicine-without compromising the capabilities of the resulting model in general tasks. In addition, we evaluate the diversity of our synthetic data using seven automated metrics, and find that it approaches the diversity of LLM pre-training corpora. Continually pre-training Mistral-7B-v0.3 with MetaSynth notably outperforms the base LLM, showing improvements of up to 4.08% in Finance and 13.75% in Biomedicine. The same model shows degraded performance when trained on data generated using a template prompt, even when the template includes prior generations and varying In-Context exemplars of real data. Our findings suggest that a few million tokens of diverse synthetic data without mixing any real data, is sufficient for effective domain adaptation when using MetaSynth.

[155] Identifying and Mitigating the Influence of the Prior Distribution in Large Language Models

Liyi Zhang,Veniamin Veselovsky,R. Thomas McCoy,Thomas L. Griffiths

Main category: cs.CL

TL;DR: 研究表明，大型语言模型（LLMs）在某些确定性任务中表现不佳，但通过干预可以改善其性能。

Details

Motivation: LLMs在确定性任务中表现不佳，因为它们依赖隐式先验分布。 Method: 通过提示模型不依赖先验知识，并使用机制解释技术定位和调整先验影响。 Result: 轻量级微调显著提升了模型在任务中的表现，且错误不再与先验相关。 Conclusion: 研究为减少LLMs依赖先验提供了有效方法，可能改善其幻觉问题。 Abstract: Large language models (LLMs) sometimes fail to respond appropriately to deterministic tasks -- such as counting or forming acronyms -- because the implicit prior distribution they have learned over sequences of tokens influences their responses. In this work, we show that, in at least some cases, LLMs actually compute the information needed to perform these tasks correctly, and we identify some interventions that can allow them to access this information to improve their performance. First, we show that simply prompting the language model to not rely on its prior knowledge leads to dramatic improvements in prior-dominated tasks. We then use mechanistic interpretability techniques to localize the prior within the LLM and manipulate the extent to which that prior influences its responses. Specifically, we show that it is possible to identify layers of the underlying neural network that correlate with the prior probability of a response and that lightweight finetuning of these layers with basic prompts on prior-dominated tasks achieves high performance on held-out answers. These results suggest that the information required to produce a correct response is contained within the representations of the problems formed by the models. Furthermore, we show that this finetuning is significantly more effective for prior-dominated tasks, and that the error after finetuning is no longer correlated with the prior. Our results suggest that it may be possible to define effective methods for manipulating the extent to which LLMs rely upon their priors in solving problems, potentially increasing their performance in settings where LLMs hallucinate for reasons related to the prior probability of token sequences.

[156] GeoSense: Evaluating Identification and Application of Geometric Principles in Multimodal Reasoning

Liangyu Xu,Yingxiu Zhao,Jingyun Wang,Yingyao Wang,Bu Pi,Chen Wang,Mingliang Zhang,Jihao Gu,Xiang Li,Xiaoyong Zhu,Jun Song,Bo Zheng

Main category: cs.CL

TL;DR: GeoSense是一个双语基准测试，用于评估多模态大语言模型（MLLMs）的几何推理能力，发现现有模型在几何原理识别和应用上存在瓶颈。

Details

Motivation: 现有基准测试未能全面评估MLLMs在几何问题解决（GPS）中的视觉理解和符号推理能力，因此需要开发更全面的评估工具。 Method: GeoSense采用五级层次几何原理框架，包含1,789个标注问题，并通过创新评估策略测试MLLMs。 Result: Gemini-2.0-pro-flash表现最佳，总体得分65.3，但几何原理的识别和应用仍是瓶颈。 Conclusion: GeoSense为提升MLLMs的几何推理能力提供了方向，推动AI实现更接近人类的推理能力。 Abstract: Geometry problem-solving (GPS), a challenging task requiring both visual comprehension and symbolic reasoning, effectively measures the reasoning capabilities of multimodal large language models (MLLMs). Humans exhibit strong reasoning ability in this task through accurate identification and adaptive application of geometric principles within visual contexts. However, existing benchmarks fail to jointly assess both dimensions of the human-like geometric reasoning mechanism in MLLMs, remaining a critical gap in assessing their ability to tackle GPS. To this end, we introduce GeoSense, the first comprehensive bilingual benchmark designed to systematically evaluate the geometric reasoning abilities of MLLMs through the lens of geometric principles. GeoSense features a five-level hierarchical framework of geometric principles spanning plane and solid geometry, an intricately annotated dataset of 1,789 problems, and an innovative evaluation strategy. Through extensive experiments on GeoSense with various open-source and closed-source MLLMs, we observe that Gemini-2.0-pro-flash performs best, achieving an overall score of $65.3$. Our in-depth analysis reveals that the identification and application of geometric principles remain a bottleneck for leading MLLMs, jointly hindering their reasoning abilities. These findings underscore GeoSense's potential to guide future advancements in MLLMs' geometric reasoning capabilities, paving the way for more robust and human-like reasoning in artificial intelligence.

[157] Towards Characterizing Subjectivity of Individuals through Modeling Value Conflicts and Trade-offs

Younghun Lee,Dan Goldwasser

Main category: cs.CL

TL;DR: 论文提出SOLAR框架，利用LLMs分析社交媒体用户的个体主观性，特别是道德判断，并通过价值冲突和权衡提升推断效果。

Details

Motivation: 探索LLMs是否能捕捉个体层面的主观性，尤其是在社交媒体用户的道德判断中。 Method: 提出SOLAR框架，通过分析用户生成文本中的价值冲突和权衡来表征个体主观性。 Result: 实验表明SOLAR提升了推断结果的准确性，尤其在争议情境中，并能解释个体的价值偏好。 Conclusion: SOLAR框架有效捕捉个体主观性，为理解用户道德判断提供了新方法。 Abstract: Large Language Models (LLMs) not only have solved complex reasoning problems but also exhibit remarkable performance in tasks that require subjective decision making. Existing studies suggest that LLM generations can be subjectively grounded to some extent, yet exploring whether LLMs can account for individual-level subjectivity has not been sufficiently studied. In this paper, we characterize subjectivity of individuals on social media and infer their moral judgments using LLMs. We propose a framework, SOLAR (Subjective Ground with Value Abstraction), that observes value conflicts and trade-offs in the user-generated texts to better represent subjective ground of individuals. Empirical results show that our framework improves overall inference results as well as performance on controversial situations. Additionally, we qualitatively show that SOLAR provides explanations about individuals' value preferences, which can further account for their judgments.

[158] Scaling Instruction-Tuned LLMs to Million-Token Contexts via Hierarchical Synthetic Data Generation

Linda He,Jue Wang,Maurice Weber,Shang Zhu,Ben Athiwaratkun,Ce Zhang

Main category: cs.CL

TL;DR: 本文提出了一种新的后训练合成数据生成策略，用于高效扩展LLMs的上下文窗口，同时保持其通用任务性能。

Details

Motivation: 解决LLMs在长上下文推理中的困难，包括计算复杂度高和长上下文数据稀缺的问题。 Method: 采用逐步旋转位置嵌入（RoPE）扩展训练策略，生成合成数据以扩展上下文长度。 Result: 模型在1M tokens的上下文长度下，在RULER和InfiniteBench基准测试中表现良好，同时保持通用语言任务的稳健性能。 Conclusion: 该方法有效解决了长上下文数据稀缺问题，为LLMs的长上下文推理提供了可行方案。 Abstract: Large Language Models (LLMs) struggle with long-context reasoning, not only due to the quadratic scaling of computational complexity with sequence length but also because of the scarcity and expense of annotating long-context data. There has been barely any open-source work that systematically ablates long-context data, nor is there any openly available instruction tuning dataset with contexts surpassing 100K tokens. To bridge this gap, we introduce a novel post-training synthetic data generation strategy designed to efficiently extend the context window of LLMs while preserving their general task performance. Our approach scalably extends to arbitrarily long context lengths, unconstrained by the length of available real-world data, which effectively addresses the scarcity of raw long-context data. Through a step-by-step rotary position embedding (RoPE) scaling training strategy, we demonstrate that our model, with a context length of up to 1M tokens, performs well on the RULER benchmark and InfiniteBench and maintains robust performance on general language tasks.

[159] Persona-judge: Personalized Alignment of Large Language Models via Token-level Self-judgment

Xiaotian Zhang,Ruizhe Chen,Yang Feng,Zuozhu Liu

Main category: cs.CL

TL;DR: 提出Persona-judge方法，通过模型内在偏好判断能力实现无需训练的个性化对齐，解决现有方法依赖奖励信号和额外标注数据的局限性。

Details

Motivation: 现有方法依赖奖励信号和额外标注数据，难以适应多样化的人类价值观，且计算成本高。 Method: Persona-judge利用模型内在偏好判断能力，通过草案模型生成候选标记，再由法官模型基于另一偏好验证是否接受。 Result: 实验证明Persona-judge提供了一种可扩展且计算高效的个性化对齐解决方案。 Conclusion: Persona-judge为更自适应的定制化对齐开辟了新途径。 Abstract: Aligning language models with human preferences presents significant challenges, particularly in achieving personalization without incurring excessive computational costs. Existing methods rely on reward signals and additional annotated data, limiting their scalability and adaptability to diverse human values. To address these challenges, we introduce Persona-judge, a novel discriminative paradigm that enables training-free personalized alignment with unseen preferences. Instead of optimizing policy parameters through external reward feedback, Persona-judge leverages the intrinsic preference judgment capabilities of the model. Specifically, a draft model generates candidate tokens conditioned on a given preference, while a judge model, embodying another preference, cross-validates the predicted tokens whether to be accepted. Experimental results demonstrate that Persona-judge, using the inherent preference evaluation mechanisms of the model, offers a scalable and computationally efficient solution to personalized alignment, paving the way for more adaptive customized alignment.

[160] ACoRN: Noise-Robust Abstractive Compression in Retrieval-Augmented Language Models

Singon Kim,Gunho Jung,Seong-Whan Lee

Main category: cs.CL

TL;DR: 论文提出ACoRN方法，通过细粒度分类和增强训练步骤，提升抽象压缩模型在检索增强生成中的鲁棒性和准确性。

Details

Motivation: 检索到的文档常包含无关或误导性信息，导致抽象压缩模型忽略关键内容，影响答案准确性。 Method: 提出ACoRN方法，包括离线数据增强和微调训练步骤，以增强模型对噪声的鲁棒性并聚焦关键信息。 Result: 实验显示ACoRN训练的T5-large模型在EM和F1分数上表现更优，尤其在噪声多的数据集中。 Conclusion: ACoRN能有效提升压缩模型在真实场景中的性能，尤其在处理噪声文档时表现突出。 Abstract: Abstractive compression utilizes smaller langauge models to condense query-relevant context, reducing computational costs in retrieval-augmented generation (RAG). However,retrieved documents often include information that is either irrelevant to answering the query or misleading due to factual incorrect content, despite having high relevance scores. This behavior indicates that abstractive compressors are more likely to omit important information essential for the correct answer, especially in long contexts where attention dispersion occurs. To address this issue, we categorize retrieved documents in a more fine-grained manner and propose Abstractive Compression Robust against Noise (ACoRN), which introduces two novel training steps. First, we use offline data augmentation on the training dataset to enhance compressor robustness against two distinct types of retrieval noise. Second, since the language modelbased compressor cannot fully utilize information from multiple retrieved documents and exhibits positional bias, we perform finetuning to generate summaries centered around key information that directly supports the correct answer. Our experiments demonstrate that T5-large, trained with ACoRN as a compressor, improves EM and F1 scores while preserving the answer string, which could serve as direct evidence. ACoRN excels on datasets with many accuracy-reducing documents, making it highly useful in real-world scenarios.

[161] GRAIL: Gradient-Based Adaptive Unlearning for Privacy and Copyright in LLMs

Kun-Woo Kim,Ji-Hoon Park,Ju-Min Han,Seong-Whan Lee

Main category: cs.CL

TL;DR: GRAIL是一种基于梯度的多领域遗忘框架，用于从大型语言模型中精确移除敏感信息，同时保留关键知识。

Details

Motivation: 大型语言模型可能包含敏感信息，传统方法成本高且无法处理多领域交织的知识。 Method: GRAIL利用多领域梯度信息，通过自适应参数定位策略选择性移除目标知识。 Result: GRAIL在遗忘效果上与现有方法相当，知识保留率提升17%。 Conclusion: GRAIL为大规模预训练语言模型中的敏感信息管理提供了新范式。 Abstract: Large Language Models (LLMs) trained on extensive datasets often learn sensitive information, which raises significant social and legal concerns under principles such as the "Right to be forgotten." Retraining entire models from scratch to remove undesired information is both costly and impractical. Furthermore, existing single-domain unlearning methods fail to address multi-domain scenarios, where knowledge is interwoven across domains such as privacy and copyright, creating overlapping representations that lead to excessive knowledge removal or degraded performance. To tackle these issues, we propose GRAIL (GRadient-based AdaptIve unLearning), a novel multi-domain unlearning framework. GRAIL leverages gradient information from multiple domains to precisely distinguish the unlearning scope from the retention scope, and applies an adaptive parameter-wise localization strategy to selectively remove targeted knowledge while preserving critical parameters for each domain. Experimental results on unlearning benchmarks show that GRAIL achieves unlearning success on par with the existing approaches, while also demonstrating up to 17% stronger knowledge retention success compared to the previous state-of-art method. Our findings establish a new paradigm for effectively managing and regulating sensitive information in large-scale pre-trained language models.

[162] Data-efficient LLM Fine-tuning for Code Generation

Weijie Lv,Xuan Xia,Sheng-Jun Huang

Main category: cs.CL

TL;DR: 提出了一种数据选择策略和动态打包技术，显著提升了代码生成LLM的训练效率和性能。

Details

Motivation: 开源与闭源代码生成模型存在性能差距，传统方法通过生成大量合成数据微调，效率低下。 Method: 采用数据选择策略优先复杂数据并保持分布一致性，结合动态打包技术优化分词。 Result: 在40%数据上训练，性能提升（66.9% vs 66.1%），训练时间减少（34分钟 vs 47分钟），GPU内存降低（42.72GB vs 61.47GB）。 Conclusion: 优化数据选择和分词可同时提升模型性能和训练效率。 Abstract: Large language models (LLMs) have demonstrated significant potential in code generation tasks. However, there remains a performance gap between open-source and closed-source models. To address this gap, existing approaches typically generate large amounts of synthetic data for fine-tuning, which often leads to inefficient training. In this work, we propose a data selection strategy in order to improve the effectiveness and efficiency of training for code-based LLMs. By prioritizing data complexity and ensuring that the sampled subset aligns with the distribution of the original dataset, our sampling strategy effectively selects high-quality data. Additionally, we optimize the tokenization process through a "dynamic pack" technique, which minimizes padding tokens and reduces computational resource consumption. Experimental results show that when training on 40% of the OSS-Instruct dataset, the DeepSeek-Coder-Base-6.7B model achieves an average performance of 66.9%, surpassing the 66.1% performance with the full dataset. Moreover, training time is reduced from 47 minutes to 34 minutes, and the peak GPU memory decreases from 61.47 GB to 42.72 GB during a single epoch. Similar improvements are observed with the CodeLlama-Python-7B model on the Evol-Instruct dataset. By optimizing both data selection and tokenization, our approach not only improves model performance but also improves training efficiency.

[163] Why and How LLMs Hallucinate: Connecting the Dots with Subsequence Associations

Yiyou Sun,Yu Gai,Lijie Chen,Abhilasha Ravichander,Yejin Choi,Dawn Song

Main category: cs.CL

TL;DR: 该论文提出了一种子序列关联框架，用于系统追踪和理解大语言模型（LLMs）中的幻觉现象，并通过理论和实证分析验证其有效性。

Details

Motivation: 大语言模型（LLMs）经常产生幻觉内容，偏离事实准确性或上下文，其复杂成因给诊断带来挑战。 Method: 论文提出了一种子序列关联框架，通过分析幻觉概率在随机输入上下文中的变化，设计了一种追踪算法来识别因果子序列。 Result: 实验表明，该方法在识别幻觉原因方面优于标准归因技术，并与模型训练语料库的证据一致。 Conclusion: 该研究为幻觉现象提供了统一的视角，并为追踪和分析幻觉提供了稳健的框架。 Abstract: Large language models (LLMs) frequently generate hallucinations-content that deviates from factual accuracy or provided context-posing challenges for diagnosis due to the complex interplay of underlying causes. This paper introduces a subsequence association framework to systematically trace and understand hallucinations. Our key insight is that hallucinations arise when dominant hallucinatory associations outweigh faithful ones. Through theoretical and empirical analyses, we demonstrate that decoder-only transformers effectively function as subsequence embedding models, with linear layers encoding input-output associations. We propose a tracing algorithm that identifies causal subsequences by analyzing hallucination probabilities across randomized input contexts. Experiments show our method outperforms standard attribution techniques in identifying hallucination causes and aligns with evidence from the model's training corpus. This work provides a unified perspective on hallucinations and a robust framework for their tracing and analysis.

[164] KODIS: A Multicultural Dispute Resolution Dialogue Corpus

James Hale,Sushrita Rakshit,Kushal Chawla,Jeanne M. Brett,Jonathan Gratch

Main category: cs.CL

TL;DR: KODIS是一个包含来自75个国家数千个对话的双向争议解决语料库，旨在研究文化和冲突理论。

Details

Motivation: 基于文化和冲突的理论模型，研究情绪表达和冲突升级的关系。 Method: 参与者参与由专家设计的典型客户服务争议对话，收集丰富的性格、过程和结果数据。 Result: 初步分析支持愤怒表达导致冲突升级的理论，并凸显情绪表达的文化差异。 Conclusion: 语料库和数据收集框架公开，供社区使用。 Abstract: We present KODIS, a dyadic dispute resolution corpus containing thousands of dialogues from over 75 countries. Motivated by a theoretical model of culture and conflict, participants engage in a typical customer service dispute designed by experts to evoke strong emotions and conflict. The corpus contains a rich set of dispositional, process, and outcome measures. The initial analysis supports theories of how anger expressions lead to escalatory spirals and highlights cultural differences in emotional expression. We make this corpus and data collection framework available to the community.

[165] Pandora: A Code-Driven Large Language Model Agent for Unified Reasoning Across Diverse Structured Knowledge

Yongrui Chen,Junhao He,Linbo Fu,Shenyu Zhang,Rihui Jin,Xinbang Dai,Jiaqi Li,Dehai Min,Nan Hu,Yuxin Zhang,Guilin Qi,Yi Huang,Tongtong Wu

Main category: cs.CL

TL;DR: 论文提出了一种名为Pandora的统一结构化知识推理框架，利用Python的Pandas API构建统一知识表示，并与LLM预训练对齐，通过生成文本推理步骤和可执行代码提升性能。

Details

Motivation: 现有统一结构化知识推理方法难以实现任务间知识迁移或与LLMs的先验对齐，限制了性能。 Method: 利用Python的Pandas API构建统一知识表示，通过LLM生成文本推理步骤和可执行代码，并从涵盖多种任务的训练示例中提取演示。 Result: 在四个基准测试中，Pandora优于现有统一框架，并与任务特定方法竞争有效。 Conclusion: Pandora通过统一表示和LLM对齐，显著提升了结构化知识推理的性能和泛化能力。 Abstract: Unified Structured Knowledge Reasoning (USKR) aims to answer natural language questions (NLQs) by using structured sources such as tables, databases, and knowledge graphs in a unified way. Existing USKR methods either rely on employing task-specific strategies or custom-defined representations, which struggle to leverage the knowledge transfer between different SKR tasks or align with the prior of LLMs, thereby limiting their performance. This paper proposes a novel USKR framework named \textsc{Pandora}, which takes advantage of \textsc{Python}'s \textsc{Pandas} API to construct a unified knowledge representation for alignment with LLM pre-training. It employs an LLM to generate textual reasoning steps and executable Python code for each question. Demonstrations are drawn from a memory of training examples that cover various SKR tasks, facilitating knowledge transfer. Extensive experiments on four benchmarks involving three SKR tasks demonstrate that \textsc{Pandora} outperforms existing unified frameworks and competes effectively with task-specific methods.

[166] Chinese-Vicuna: A Chinese Instruction-following Llama-based Model

Chenghao Fan,Zhenyi Lu,Jie Tian

Main category: cs.CL

TL;DR: Chinese-Vicuna是一个开源、资源高效的语言模型，通过LoRA微调LLaMA架构，填补中文指令跟随能力的空白，支持低资源环境部署和领域适应。

Details

Motivation: 解决中文指令跟随能力的不足，并支持在低资源环境下的高效部署和领域特定应用。 Method: 采用LoRA微调LLaMA架构，结合混合数据集（BELLE和Guanaco）和4位量化（QLoRA），支持领域适应和多任务性能。 Result: 在翻译、代码生成、领域问答等任务中表现优异，尤其在医疗任务、多轮对话和法律实时更新方面具有竞争力。 Conclusion: Chinese-Vicuna以其模块化设计、开源生态系统和社区驱动增强，成为中文LLM应用的灵活基础。 Abstract: Chinese-Vicuna is an open-source, resource-efficient language model designed to bridge the gap in Chinese instruction-following capabilities by fine-tuning Meta's LLaMA architecture using Low-Rank Adaptation (LoRA). Targeting low-resource environments, it enables cost-effective deployment on consumer GPUs (e.g., RTX-2080Ti for 7B models) and supports domain-specific adaptation in fields like healthcare and law. By integrating hybrid datasets (BELLE and Guanaco) and 4-bit quantization (QLoRA), the model achieves competitive performance in tasks such as translation, code generation, and domain-specific Q\&A. The project provides a comprehensive toolkit for model conversion, CPU inference, and multi-turn dialogue interfaces, emphasizing accessibility for researchers and developers. Evaluations indicate competitive performance across medical tasks, multi-turn dialogue coherence, and real-time legal updates. Chinese-Vicuna's modular design, open-source ecosystem, and community-driven enhancements position it as a versatile foundation for Chinese LLM applications.

[167] Out of Sight Out of Mind, Out of Sight Out of Mind: Measuring Bias in Language Models Against Overlooked Marginalized Groups in Regional Contexts

Fatma Elsafoury,David Hartmann

Main category: cs.CL

TL;DR: 该论文研究了语言模型（LMs）对边缘群体的偏见问题，特别关注了被忽视的非英语国家和低资源语言群体，并在23个LMs中对270个边缘群体进行了分析。

Details

Motivation: 现有研究主要集中在英语国家和西方世界，忽视了全球范围内的边缘群体和低资源语言，导致LMs的偏见问题未能全面解决。 Method: 研究分析了23个LMs对270个边缘群体的偏见，并比较了埃及阿拉伯方言与现代标准阿拉伯语对偏见测量的影响。 Result: LMs对许多边缘群体表现出更高的偏见，但阿拉伯语LMs在宗教和种族方面对边缘和主导群体均表现出高偏见。此外，非二元性别、LGBTQIA+和黑人女性面临更高的交叉偏见。 Conclusion: 为开发包容性LMs，需扩展研究范围以涵盖更多边缘群体和低资源语言，并改进现有偏见度量方法。 Abstract: We know that language models (LMs) form biases and stereotypes of minorities, leading to unfair treatments of members of these groups, thanks to research mainly in the US and the broader English-speaking world. As the negative behavior of these models has severe consequences for society and individuals, industry and academia are actively developing methods to reduce the bias in LMs. However, there are many under-represented groups and languages that have been overlooked so far. This includes marginalized groups that are specific to individual countries and regions in the English speaking and Western world, but crucially also almost all marginalized groups in the rest of the world. The UN estimates, that between 600 million to 1.2 billion people worldwide are members of marginalized groups and in need for special protection. If we want to develop inclusive LMs that work for everyone, we have to broaden our understanding to include overlooked marginalized groups and low-resource languages and dialects. In this work, we contribute to this effort with the first study investigating offensive stereotyping bias in 23 LMs for 270 marginalized groups from Egypt, the remaining 21 Arab countries, Germany, the UK, and the US. Additionally, we investigate the impact of low-resource languages and dialects on the study of bias in LMs, demonstrating the limitations of current bias metrics, as we measure significantly higher bias when using the Egyptian Arabic dialect versus Modern Standard Arabic. Our results show, LMs indeed show higher bias against many marginalized groups in comparison to dominant groups. However, this is not the case for Arabic LMs, where the bias is high against both marginalized and dominant groups in relation to religion and ethnicity. Our results also show higher intersectional bias against Non-binary, LGBTQIA+ and Black women.

[168] Enhancing the Geometric Problem-Solving Ability of Multimodal LLMs via Symbolic-Neural Integration

Yicheng Pan,Zhenrong Zhang,Pengfei Hu,Jiefeng Ma,Jun Du,Jianshu Zhang,Quan Liu,Jianqing Gao,Feng Ma

Main category: cs.CL

TL;DR: GeoGen是一个自动生成几何问题逐步解答路径的管道，结合符号推理生成高质量数据。GeoLogic是基于GeoGen数据训练的LLM，用于增强多模态大语言模型（MLLMs）的几何推理能力。

Details

Motivation: 几何问题求解（GPS）中缺乏逐步解答数据和推理过程中的幻觉问题限制了MLLMs的应用。 Method: 提出GeoGen管道生成高质量几何问题解答数据，并训练GeoLogic模型结合符号系统验证MLLM输出。 Result: 实验表明，该方法显著提升了MLLMs在几何推理任务中的性能。 Conclusion: 结合LLMs和符号系统的优势，提供了一种更可靠且可解释的GPS任务解决方案。 Abstract: Recent advances in Multimodal Large Language Models (MLLMs) have achieved remarkable progress in general domains and demonstrated promise in multimodal mathematical reasoning. However, applying MLLMs to geometry problem solving (GPS) remains challenging due to lack of accurate step-by-step solution data and severe hallucinations during reasoning. In this paper, we propose GeoGen, a pipeline that can automatically generates step-wise reasoning paths for geometry diagrams. By leveraging the precise symbolic reasoning, \textbf{GeoGen} produces large-scale, high-quality question-answer pairs. To further enhance the logical reasoning ability of MLLMs, we train \textbf{GeoLogic}, a Large Language Model (LLM) using synthetic data generated by GeoGen. Serving as a bridge between natural language and symbolic systems, GeoLogic enables symbolic tools to help verifying MLLM outputs, making the reasoning process more rigorous and alleviating hallucinations. Experimental results show that our approach consistently improves the performance of MLLMs, achieving remarkable results on benchmarks for geometric reasoning tasks. This improvement stems from our integration of the strengths of LLMs and symbolic systems, which enables a more reliable and interpretable approach for the GPS task. Codes are available at https://github.com/ycpNotFound/GeoGen.

[169] Assesing LLMs in Art Contexts: Critique Generation and Theory of Mind Evaluation

Takaya Arita,Wenxian Zheng,Reiji Suzuki,Fuminori Akiba

Main category: cs.CL

TL;DR: 研究探讨了大型语言模型（LLMs）在艺术相关领域的表现，包括生成艺术品评论和艺术情境中的心理状态推理（ToM）。结果显示，LLMs能生成风格逼真且富有解释性的评论，并在复杂ToM任务中表现不一。

Details

Motivation: 探索LLMs在艺术评论和ToM任务中的能力，以评估其是否能在缺乏真实理解的情况下产生类似专家的输出。 Method: 结合Noel Carroll的评估框架与艺术批评理论，设计分步提示生成评论；开发新的ToM任务，测试41种LLMs的表现。 Result: 人类难以区分AI与专家评论；LLMs在ToM任务中表现差异显著，尤其在情感或模糊情境中。 Conclusion: LLMs通过精心设计的提示可能表现出接近理解的行为，但其认知局限仍存在。 Abstract: This study explored how large language models (LLMs) perform in two areas related to art: writing critiques of artworks and reasoning about mental states (Theory of Mind, or ToM) in art-related situations. For the critique generation part, we built a system that combines Noel Carroll's evaluative framework with a broad selection of art criticism theories. The model was prompted to first write a full-length critique and then shorter, more coherent versions using a step-by-step prompting process. These AI-generated critiques were then compared with those written by human experts in a Turing test-style evaluation. In many cases, human subjects had difficulty telling which was which, and the results suggest that LLMs can produce critiques that are not only plausible in style but also rich in interpretation, as long as they are carefully guided. In the second part, we introduced new simple ToM tasks based on situations involving interpretation, emotion, and moral tension, which can appear in the context of art. These go beyond standard false-belief tests and allow for more complex, socially embedded forms of reasoning. We tested 41 recent LLMs and found that their performance varied across tasks and models. In particular, tasks that involved affective or ambiguous situations tended to reveal clearer differences. Taken together, these results help clarify how LLMs respond to complex interpretative challenges, revealing both their cognitive limitations and potential. While our findings do not directly contradict the so-called Generative AI Paradox--the idea that LLMs can produce expert-like output without genuine understanding--they suggest that, depending on how LLMs are instructed, such as through carefully designed prompts, these models may begin to show behaviors that resemble understanding more closely than we might assume.

[170] SMARTe: Slot-based Method for Accountable Relational Triple extraction

Xue Wen Tan,Stanley Kok

Main category: cs.CL

TL;DR: SMARTe提出了一种基于槽注意力机制的关系三元组提取方法，强调可解释性，同时性能与最先进模型相当。

Details

Motivation: 现有方法主要关注模型性能优化，缺乏对内部机制的理解，且依赖复杂预处理导致系统不透明。 Method: SMARTe通过槽注意力机制将任务框架化为集合预测问题，确保预测可追溯至学习到的槽表示和贡献词。 Result: 在NYT和WebNLG数据集上，SMARTe在保持性能的同时提供了可解释性，并通过注意力热图展示解释。 Conclusion: SMARTe证明了可解释性与高性能可共存，并提出了未来研究方向。 Abstract: Relational Triple Extraction (RTE) is a fundamental task in Natural Language Processing (NLP). However, prior research has primarily focused on optimizing model performance, with limited efforts to understand the internal mechanisms driving these models. Many existing methods rely on complex preprocessing to induce specific interactions, often resulting in opaque systems that may not fully align with their theoretical foundations. To address these limitations, we propose SMARTe: a Slot-based Method for Accountable Relational Triple extraction. SMARTe introduces intrinsic interpretability through a slot attention mechanism and frames the task as a set prediction problem. Slot attention consolidates relevant information into distinct slots, ensuring all predictions can be explicitly traced to learned slot representations and the tokens contributing to each predicted relational triple. While emphasizing interpretability, SMARTe achieves performance comparable to state-of-the-art models. Evaluations on the NYT and WebNLG datasets demonstrate that adding interpretability does not compromise performance. Furthermore, we conducted qualitative assessments to showcase the explanations provided by SMARTe, using attention heatmaps that map to their respective tokens. We conclude with a discussion of our findings and propose directions for future research.

[171] Can LLMs reason over extended multilingual contexts? Towards long-context evaluation beyond retrieval and haystacks

Amey Hengle,Prasoon Bajpai,Soham Dan,Tanmoy Chakraborty

Main category: cs.CL

TL;DR: MLRBench是一个新的多语言长上下文推理基准，超越了现有的检索为中心的方法，评估多跳推理、聚合和认知推理能力。

Details

Motivation: 现有基准局限于检索能力评估，易受数据泄漏和短路径问题影响，无法全面评估模型的长上下文推理能力。 Method: 提出MLRBench，包含多语言任务，设计为并行、抗泄漏且可扩展，评估多跳推理、聚合和认知推理。 Result: 实验显示高低资源语言间存在显著差距，LLMs在多语言环境下仅利用不到30%的声称上下文长度。 Conclusion: MLRBench开源，为多语言LLMs的评估和训练提供新工具，现有检索增强生成方法未能完全解决长上下文问题。 Abstract: Existing multilingual long-context benchmarks, often based on the popular needle-in-a-haystack test, primarily evaluate a model's ability to locate specific information buried within irrelevant texts. However, such a retrieval-centric approach is myopic and inherently limited, as successful recall alone does not indicate a model's capacity to reason over extended contexts. Moreover, these benchmarks are susceptible to data leakage, short-circuiting, and risk making the evaluation a priori identifiable. To address these limitations, we introduce MLRBench, a new synthetic benchmark for multilingual long-context reasoning. Unlike existing benchmarks, MLRBench goes beyond surface-level retrieval by including tasks that assess multi-hop inference, aggregation, and epistemic reasoning. Spanning seven languages, MLRBench is designed to be parallel, resistant to leakage, and scalable to arbitrary context lengths. Our extensive experiments with an open-weight large language model (LLM) reveal a pronounced gap between high- and low-resource languages, particularly for tasks requiring the model to aggregate multiple facts or predict the absence of information. We also find that, in multilingual settings, LLMs effectively utilize less than 30% of their claimed context length. Although off-the-shelf Retrieval Augmented Generation helps alleviate this to a certain extent, it does not solve the long-context problem. We open-source MLRBench to enable future research in improved evaluation and training of multilingual LLMs.

[172] ViClaim: A Multilingual Multilabel Dataset for Automatic Claim Detection in Videos

Patrick Giedemann,Pius von Däniken,Jan Deriu,Alvaro Rodrigo,Anselmo Peñas,Mark Cieliebak

Main category: cs.CL

TL;DR: ViClaim是一个多语言、多主题的视频转录数据集，用于检测视频中的虚假信息，实验显示模型性能良好但跨主题泛化能力有限。

Details

Motivation: 视频内容作为传播和虚假信息的媒介日益重要，但现有研究多集中于文本，缺乏对视频转录中口语复杂性的处理。 Method: 构建ViClaim数据集，包含1,798个标注视频转录（三种语言、六个主题），开发定制标注工具，并使用多语言语言模型进行实验。 Result: 模型在交叉验证中表现优异（宏F1达0.896），但在未见主题上泛化能力不足。 Conclusion: ViClaim为视频虚假信息检测提供了基础，但跨主题检测仍需改进。 Abstract: The growing influence of video content as a medium for communication and misinformation underscores the urgent need for effective tools to analyze claims in multilingual and multi-topic settings. Existing efforts in misinformation detection largely focus on written text, leaving a significant gap in addressing the complexity of spoken text in video transcripts. We introduce ViClaim, a dataset of 1,798 annotated video transcripts across three languages (English, German, Spanish) and six topics. Each sentence in the transcripts is labeled with three claim-related categories: fact-check-worthy, fact-non-check-worthy, or opinion. We developed a custom annotation tool to facilitate the highly complex annotation process. Experiments with state-of-the-art multilingual language models demonstrate strong performance in cross-validation (macro F1 up to 0.896) but reveal challenges in generalization to unseen topics, particularly for distinct domains. Our findings highlight the complexity of claim detection in video transcripts. ViClaim offers a robust foundation for advancing misinformation detection in video-based communication, addressing a critical gap in multimodal analysis.

[173] Are AI agents the new machine translation frontier? Challenges and opportunities of single- and multi-agent systems for multilingual digital communication

Vicent Briva-Iglesias

Main category: cs.CL

TL;DR: 论文探讨了单智能体和多智能体系统在机器翻译中的应用潜力，多智能体系统在复杂场景中表现更优。

Details

Motivation: 探索人工智能代理在机器翻译中的未充分开发潜力，以提升多语言数字通信。 Method: 通过法律机器翻译的试点研究，采用多智能体系统（四个专业AI代理协作）。 Result: 多智能体系统在领域适应性和上下文感知方面表现优越，翻译质量更高。 Conclusion: 多智能体系统有望显著提升机器翻译质量，并为未来研究和专业翻译工作流奠定基础。 Abstract: The rapid evolution of artificial intelligence (AI) has introduced AI agents as a disruptive paradigm across various industries, yet their application in machine translation (MT) remains underexplored. This paper describes and analyses the potential of single- and multi-agent systems for MT, reflecting on how they could enhance multilingual digital communication. While single-agent systems are well-suited for simpler translation tasks, multi-agent systems, which involve multiple specialized AI agents collaborating in a structured manner, may offer a promising solution for complex scenarios requiring high accuracy, domain-specific knowledge, and contextual awareness. To demonstrate the feasibility of multi-agent workflows in MT, we are conducting a pilot study in legal MT. The study employs a multi-agent system involving four specialized AI agents for (i) translation, (ii) adequacy review, (iii) fluency review, and (iv) final editing. Our findings suggest that multi-agent systems may have the potential to significantly improve domain-adaptability and contextual awareness, with superior translation quality to traditional MT or single-agent systems. This paper also sets the stage for future research into multi-agent applications in MT, integration into professional translation workflows, and shares a demo of the system analyzed in the paper.

[174] Information Gain-Guided Causal Intervention for Autonomous Debiasing Large Language Models

Zhouhao Sun,Xiao Ding,Li Du,Yunpeng Xu,Yixuan Ma,Yang Zhao,Bing Qin,Ting Liu

Main category: cs.CL

TL;DR: 论文提出了一种基于信息增益的因果干预去偏框架（IGCIDB），通过结合因果机制与信息论，自动平衡指令调优数据集的分布，提升大语言模型的泛化能力。

Details

Motivation: 当前大语言模型（LLMs）在推理时可能捕获数据集偏差，导致泛化能力差。传统去偏方法效果有限，需要新方法解决。 Method: 提出IGCIDB框架，利用信息增益引导的因果干预方法平衡数据集分布，再通过标准监督微调训练LLMs。 Result: 实验表明，IGCIDB能有效去偏，提升LLMs在不同任务中的泛化能力。 Conclusion: IGCIDF框架为LLMs的去偏问题提供了有效解决方案，显著提升了模型的泛化性能。 Abstract: Despite significant progress, recent studies indicate that current large language models (LLMs) may still capture dataset biases and utilize them during inference, leading to the poor generalizability of LLMs. However, due to the diversity of dataset biases and the insufficient nature of bias suppression based on in-context learning, the effectiveness of previous prior knowledge-based debiasing methods and in-context learning based automatic debiasing methods is limited. To address these challenges, we explore the combination of causal mechanisms with information theory and propose an information gain-guided causal intervention debiasing (IGCIDB) framework. This framework first utilizes an information gain-guided causal intervention method to automatically and autonomously balance the distribution of instruction-tuning dataset. Subsequently, it employs a standard supervised fine-tuning process to train LLMs on the debiased dataset. Experimental results show that IGCIDB can effectively debias LLM to improve its generalizability across different tasks.

[175] Benchmarking Multi-National Value Alignment for Large Language Models

Chengyi Ju,Weijie Shi,Chengzhong Liu,Jiaming Ji,Jipeng Zhang,Ruiyuan Zhang,Jia Zhu,Jiajie Xu,Yaodong Yang,Sirui Han,Yike Guo

Main category: cs.CL

TL;DR: NaVAB是一个评估大型语言模型（LLMs）与五个主要国家价值观对齐的基准，通过高效的数据处理和冲突减少机制，帮助识别和减少价值观冲突。

Details

Motivation: 现有研究主要关注伦理审查，忽视了国家价值观的多样性，且现有基准难以扩展。 Method: 提出NaVAB基准，包括国家价值观提取流程、指令标记建模、筛选过程和冲突减少机制。 Result: 实验表明NaVAB能有效识别LLMs的价值观冲突，并可通过对齐技术减少冲突。 Conclusion: NaVAB为LLMs与国家价值观对齐提供了实用工具，有助于减少价值观冲突。 Abstract: Do Large Language Models (LLMs) hold positions that conflict with your country's values? Occasionally they do! However, existing works primarily focus on ethical reviews, failing to capture the diversity of national values, which encompass broader policy, legal, and moral considerations. Furthermore, current benchmarks that rely on spectrum tests using manually designed questionnaires are not easily scalable. To address these limitations, we introduce NaVAB, a comprehensive benchmark to evaluate the alignment of LLMs with the values of five major nations: China, the United States, the United Kingdom, France, and Germany. NaVAB implements a national value extraction pipeline to efficiently construct value assessment datasets. Specifically, we propose a modeling procedure with instruction tagging to process raw data sources, a screening process to filter value-related topics and a generation process with a Conflict Reduction mechanism to filter non-conflicting values.We conduct extensive experiments on various LLMs across countries, and the results provide insights into assisting in the identification of misaligned scenarios. Moreover, we demonstrate that NaVAB can be combined with alignment techniques to effectively reduce value concerns by aligning LLMs' values with the target country.

[176] MAIN: Mutual Alignment Is Necessary for instruction tuning

Fanyi Yang,Jianfeng Liu,Xin Zhang,Haoyu Liu,Xixin Cao,Yuefeng Zhan,Hao Sun,Weiwei Deng,Feng Sun,Qi Zhang

Main category: cs.CL

TL;DR: 本文提出了一种互对齐框架（MAIN），通过互约束确保指令与响应的对齐，从而提升指令调优的质量和可扩展性。

Details

Motivation: 当前指令调优方法在扩展数据生成时忽视了指令与响应的对齐问题，而高质量的对齐对模型性能至关重要。 Method: 提出互对齐框架（MAIN），通过互约束确保指令与响应的对齐。 Result: 实验表明，基于MAIN调优的LLaMA和Mistral模型在多个基准测试中优于传统方法。 Conclusion: 指令与响应的对齐在高质量指令调优中起关键作用，MAIN框架为LLMs提供了可扩展的解决方案。 Abstract: Instruction tuning has enabled large language models (LLMs) to achieve remarkable performance, but its success heavily depends on the availability of large-scale, high-quality instruction-response pairs. However, current methods for scaling up data generation often overlook a crucial aspect: the alignment between instructions and responses. We hypothesize that high-quality instruction-response pairs are not defined by the individual quality of each component, but by the extent of their alignment with each other. To address this, we propose a Mutual Alignment Framework (MAIN) that ensures coherence between the instruction and response through mutual constraints. Experiments demonstrate that models such as LLaMA and Mistral, fine-tuned within this framework, outperform traditional methods across multiple benchmarks. This approach underscores the critical role of instruction-response alignment in enabling scalable and high-quality instruction tuning for LLMs.

[177] ConExion: Concept Extraction with Large Language Models

Ebrahim Norouzi,Sven Hertling,Harald Sack

Main category: cs.CL

TL;DR: 本文提出了一种基于预训练大语言模型（LLMs）的概念提取方法，相比传统关键词提取方法，能够提取文档中所有相关概念，而不仅是重要信息。实验表明，该方法在F1分数上优于现有技术，并探索了无监督概念提取的潜力。

Details

Motivation: 传统方法仅提取文档中的重要关键词，而本文旨在提取所有相关概念，以支持本体评估和学习。 Method: 使用预训练大语言模型（LLMs），通过提示技术进行无监督概念提取。 Result: 在两个基准数据集上的实验表明，该方法在F1分数上优于现有技术。 Conclusion: LLMs在概念提取任务中表现优异，支持本体评估和学习，代码和数据集已开源。 Abstract: In this paper, an approach for concept extraction from documents using pre-trained large language models (LLMs) is presented. Compared with conventional methods that extract keyphrases summarizing the important information discussed in a document, our approach tackles a more challenging task of extracting all present concepts related to the specific domain, not just the important ones. Through comprehensive evaluations of two widely used benchmark datasets, we demonstrate that our method improves the F1 score compared to state-of-the-art techniques. Additionally, we explore the potential of using prompts within these models for unsupervised concept extraction. The extracted concepts are intended to support domain coverage evaluation of ontologies and facilitate ontology learning, highlighting the effectiveness of LLMs in concept extraction tasks. Our source code and datasets are publicly available at https://github.com/ISE-FIZKarlsruhe/concept_extraction.

[178] Are Retrials All You Need? Enhancing Large Language Model Reasoning Without Verbalized Feedback

Nearchos Potamitis,Akhil Arora

Main category: cs.CL

TL;DR: 论文提出了一种无需反馈的“重试”机制，简化了推理框架，并证明其优于复杂方法。

Details

Motivation: 现有迭代推理策略需要额外计算成本，而复杂方法的效果未必优于简单方法。 Method: 引入“无反馈重试”机制，允许LLM在识别错误答案后重新尝试解决问题。 Result: 简单重试方法常优于复杂推理框架，表明复杂方法的计算成本未必合理。 Conclusion: 研究挑战了复杂推理策略必然更好的假设，展示了更简单高效方法的潜力。 Abstract: Recent advancements in large language models (LLMs) have catalyzed the development of general-purpose autonomous agents, demonstrating remarkable performance in complex reasoning tasks across various domains. This surge has spurred the evolution of a plethora of prompt-based reasoning frameworks. A recent focus has been on iterative reasoning strategies that refine outputs through self-evaluation and verbalized feedback. However, these strategies require additional computational complexity to enable models to recognize and correct their mistakes, leading to a significant increase in their cost. In this work, we introduce the concept of ``retrials without feedback'', an embarrassingly simple yet powerful mechanism for enhancing reasoning frameworks by allowing LLMs to retry problem-solving attempts upon identifying incorrect answers. Unlike conventional iterative refinement methods, our method does not require explicit self-reflection or verbalized feedback, simplifying the refinement process. Our findings indicate that simpler retrial-based approaches often outperform more sophisticated reasoning frameworks, suggesting that the benefits of complex methods may not always justify their computational costs. By challenging the prevailing assumption that more intricate reasoning strategies inherently lead to better performance, our work offers new insights into how simpler, more efficient approaches can achieve optimal results. So, are retrials all you need?

[179] Estimating Optimal Context Length for Hybrid Retrieval-augmented Multi-document Summarization

Adithya Pratapa,Teruko Mitamura

Main category: cs.CL

TL;DR: 提出了一种结合检索增强系统和长上下文窗口的混合方法，优化检索长度以提高多文档摘要性能。

Details

Motivation: 长上下文模型在实际应用中表现不佳，检索增强系统对检索长度敏感，需找到更优解决方案。 Method: 通过估计最优检索长度，利用LLMs生成银参考，优化检索增强系统的上下文长度。 Result: 在多文档摘要任务中表现优异，优于RULER和HELMET等基准，且适用于不同模型。 Conclusion: 该方法有效优化检索长度，适用于长上下文模型，并具有泛化能力。 Abstract: Recent advances in long-context reasoning abilities of language models led to interesting applications in large-scale multi-document summarization. However, prior work has shown that these long-context models are not effective at their claimed context windows. To this end, retrieval-augmented systems provide an efficient and effective alternative. However, their performance can be highly sensitive to the choice of retrieval context length. In this work, we present a hybrid method that combines retrieval-augmented systems with long-context windows supported by recent language models. Our method first estimates the optimal retrieval length as a function of the retriever, summarizer, and dataset. On a randomly sampled subset of the dataset, we use a panel of LLMs to generate a pool of silver references. We use these silver references to estimate the optimal context length for a given RAG system configuration. Our results on the multi-document summarization task showcase the effectiveness of our method across model classes and sizes. We compare against length estimates from strong long-context benchmarks such as RULER and HELMET. Our analysis also highlights the effectiveness of our estimation method for very long-context LMs and its generalization to new classes of LMs.

[180] Sparks of Science: Hypothesis Generation Using Structured Paper Data

Charles O'Neill,Tirthankar Ghosal,Roberta Răileanu,Mike Walmsley,Thang Bui,Kevin Schawinski,Ioana Ciucă

Main category: cs.CL

TL;DR: 论文提出了HypoGen数据集，用于改进科学假设生成任务，通过Bit-Flip-Spark框架和推理链提升假设的新颖性和可行性。

Details

Motivation: 当前基础模型在生成新颖且可行的科学假设方面表现不佳，缺乏专门的数据集将科学假设生成视为自然语言生成任务。 Method: 引入HypoGen数据集，包含5500个结构化问题-假设对，采用Bit-Flip-Spark框架和推理链，将假设生成建模为条件语言模型。 Result: 实验表明，基于HypoGen数据集的微调显著提升了生成假设的新颖性、可行性和整体质量。 Conclusion: HypoGen数据集为科学假设生成提供了有效工具，未来可进一步扩展其应用。 Abstract: Generating novel and creative scientific hypotheses is a cornerstone in achieving Artificial General Intelligence. Large language and reasoning models have the potential to aid in the systematic creation, selection, and validation of scientifically informed hypotheses. However, current foundation models often struggle to produce scientific ideas that are both novel and feasible. One reason is the lack of a dedicated dataset that frames Scientific Hypothesis Generation (SHG) as a Natural Language Generation (NLG) task. In this paper, we introduce HypoGen, the first dataset of approximately 5500 structured problem-hypothesis pairs extracted from top-tier computer science conferences structured with a Bit-Flip-Spark schema, where the Bit is the conventional assumption, the Spark is the key insight or conceptual leap, and the Flip is the resulting counterproposal. HypoGen uniquely integrates an explicit Chain-of-Reasoning component that reflects the intellectual process from Bit to Flip. We demonstrate that framing hypothesis generation as conditional language modelling, with the model fine-tuned on Bit-Flip-Spark and the Chain-of-Reasoning (and where, at inference, we only provide the Bit), leads to improvements in the overall quality of the hypotheses. Our evaluation employs automated metrics and LLM judge rankings for overall quality assessment. We show that by fine-tuning on our HypoGen dataset we improve the novelty, feasibility, and overall quality of the generated hypotheses. The HypoGen dataset is publicly available at huggingface.co/datasets/UniverseTBD/hypogen-dr1.

[181] Accommodate Knowledge Conflicts in Retrieval-augmented LLMs: Towards Reliable Response Generation in the Wild

Jiatai Wang,Zhiwei Xu,Di Jin,Xuewen Yang,Tao Li

Main category: cs.CL

TL;DR: 论文分析了大型语言模型（LLMs）在知识冲突中的表现，并提出了一种新框架Swin-VIB，通过变分信息瓶颈模型优化检索信息，提升响应生成的可靠性。实验表明该方法显著提升了任务准确性。

Details

Motivation: LLMs在信息检索系统中面临知识冲突问题，导致响应不可靠和决策不确定性。研究旨在解决这一问题。 Method: 提出Swin-VIB框架，结合变分信息瓶颈模型，自适应增强检索信息并指导LLMs的响应生成。 Result: 实验验证了理论发现，Swin-VIB在单选择任务中准确率提升至少7.54%。 Conclusion: Swin-VIB有效解决了LLMs的知识冲突问题，提升了响应生成的可靠性。 Abstract: The proliferation of large language models (LLMs) has significantly advanced information retrieval systems, particularly in response generation (RG). Unfortunately, LLMs often face knowledge conflicts between internal memory and retrievaled external information, arising from misinformation, biases, or outdated knowledge. These conflicts undermine response reliability and introduce uncertainty in decision-making. In this work, we analyze how LLMs navigate knowledge conflicts from an information-theoretic perspective and reveal that when conflicting and supplementary information exhibit significant differences, LLMs confidently resolve their preferences. However, when the distinction is ambiguous, LLMs experience heightened uncertainty. Based on this insight, we propose Swin-VIB, a novel framework that integrates a pipeline of variational information bottleneck models into adaptive augmentation of retrieved information and guiding LLM preference in response generation. Extensive experiments on single-choice, open-ended question-answering (QA), and retrieval augmented generation (RAG) validate our theoretical findings and demonstrate the efficacy of Swin-VIB. Notably, our method improves single-choice task accuracy by at least 7.54\% over competitive baselines.

[182] SHA256 at SemEval-2025 Task 4: Selective Amnesia -- Constrained Unlearning for Large Language Models via Knowledge Isolation

Saransh Agrawal,Kuan-Hao Huang

Main category: cs.CL

TL;DR: 论文提出了一种针对大型语言模型（LLMs）的定向遗忘方法，通过因果中介分析和分层优化，有效移除敏感信息而不显著降低模型性能。

Details

Motivation: LLMs在训练中容易记忆敏感信息，现有遗忘方法难以选择性移除数据而不影响模型整体能力。 Method: 采用两阶段方法：因果中介分析确定关键层（0-5层），再通过约束优化和联合损失函数对下层进行定向遗忘。 Result: 在1B模型赛道中排名第二，保持88%的基线MMLU准确率。 Conclusion: 因果驱动的分层优化为LLMs的高效精准遗忘提供了新范式，有助于解决AI系统中的数据隐私问题。 Abstract: Large language models (LLMs) frequently memorize sensitive information during training, posing risks when deploying publicly accessible models. Current machine unlearning methods struggle to selectively remove specific data associations without degrading overall model capabilities. This paper presents our solution to SemEval-2025 Task 4 on targeted unlearning, which introduces a two-stage methodology that combines causal mediation analysis with layer-specific optimization. Through systematic causal tracing experiments on OLMo architectures (1B and 7B parameters), we identify the critical role of the first few transformer layers (layers 0-5) in storing subject-attribute associations within MLP modules. Building on this insight, we develop a constrained optimization approach that freezes upper layers while applying a novel joint loss function to lower layers-simultaneously maximizing forget set loss via output token cross-entropy penalties and minimizing retain set deviation through adaptive regularization. Our method achieves 2nd place in the 1B model track, demonstrating strong task performance while maintaining 88% of baseline MMLU accuracy. These results establish causal-informed layer optimization as a promising paradigm for efficient, precise unlearning in LLMs, offering a significant step forward in addressing data privacy concerns in AI systems.

[183] ChatEXAONEPath: An Expert-level Multimodal Large Language Model for Histopathology Using Whole Slide Images

Sangwook Kim,Soonyoung Lee,Jongseong Jang

Main category: cs.CL

TL;DR: 研究提出了一种基于WSI的多模态大语言模型ChatEXAONEPath，用于病理学诊断，通过整合多模态信息提升临床上下文理解能力。

Details

Motivation: 现有多模态大语言模型在病理学中缺乏对完整临床上下文的理解，主要受限于公开数据集的局限性。 Method: 采用检索式数据生成流程，基于TCGA的10,094对WSI和病理报告，开发了AI评估协议以全面理解医学上下文。 Result: 模型在1,134对WSI和报告中诊断接受率达62.9%，能理解多种癌症类型的WSI和临床上下文。 Conclusion: ChatEXAONEPath通过整合多模态信息，有望辅助临床医生全面理解复杂WSI形态，提升癌症诊断能力。 Abstract: Recent studies have made significant progress in developing large language models (LLMs) in the medical domain, which can answer expert-level questions and demonstrate the potential to assist clinicians in real-world clinical scenarios. Studies have also witnessed the importance of integrating various modalities with the existing LLMs for a better understanding of complex clinical contexts, which are innately multi-faceted by nature. Although studies have demonstrated the ability of multimodal LLMs in histopathology to answer questions from given images, they lack in understanding of thorough clinical context due to the patch-level data with limited information from public datasets. Thus, developing WSI-level MLLMs is significant in terms of the scalability and applicability of MLLMs in histopathology. In this study, we introduce an expert-level MLLM for histopathology using WSIs, dubbed as ChatEXAONEPath. We present a retrieval-based data generation pipeline using 10,094 pairs of WSIs and histopathology reports from The Cancer Genome Atlas (TCGA). We also showcase an AI-based evaluation protocol for a comprehensive understanding of the medical context from given multimodal information and evaluate generated answers compared to the original histopathology reports. We demonstrate the ability of diagnosing the given histopathology images using ChatEXAONEPath with the acceptance rate of 62.9% from 1,134 pairs of WSIs and reports. Our proposed model can understand pan-cancer WSIs and clinical context from various cancer types. We argue that our proposed model has the potential to assist clinicians by comprehensively understanding complex morphology of WSIs for cancer diagnosis through the integration of multiple modalities.

[184] Aspect-Based Summarization with Self-Aspect Retrieval Enhanced Generation

Yichao Feng,Shuai Zhao,Yueqiu Li,Luwei Xiao,Xiaobao Wu,Anh Tuan Luu

Main category: cs.CL

TL;DR: 提出了一种基于自检索的方面摘要生成框架，解决了传统方法在资源限制和泛化能力上的不足，同时优化了令牌使用并减少了幻觉问题。

Details

Motivation: 传统摘要方法资源消耗大且泛化能力有限，而大语言模型依赖提示工程且面临令牌限制和幻觉问题。 Method: 采用嵌入驱动的检索机制提取相关文本片段，删除无关内容，确保摘要严格基于给定方面。 Result: 在基准数据集上表现优异，有效缓解了令牌限制问题。 Conclusion: 该框架在性能和实用性上均优于现有方法。 Abstract: Aspect-based summarization aims to generate summaries tailored to specific aspects, addressing the resource constraints and limited generalizability of traditional summarization approaches. Recently, large language models have shown promise in this task without the need for training. However, they rely excessively on prompt engineering and face token limits and hallucination challenges, especially with in-context learning. To address these challenges, in this paper, we propose a novel framework for aspect-based summarization: Self-Aspect Retrieval Enhanced Summary Generation. Rather than relying solely on in-context learning, given an aspect, we employ an embedding-driven retrieval mechanism to identify its relevant text segments. This approach extracts the pertinent content while avoiding unnecessary details, thereby mitigating the challenge of token limits. Moreover, our framework optimizes token usage by deleting unrelated parts of the text and ensuring that the model generates output strictly based on the given aspect. With extensive experiments on benchmark datasets, we demonstrate that our framework not only achieves superior performance but also effectively mitigates the token limitation problem.

[185] Accuracy is Not Agreement: Expert-Aligned Evaluation of Crash Narrative Classification Models

Sudesh Ramesh Bhagat,Ibne Farabi Shihab,Anuj Sharma

Main category: cs.CL

TL;DR: 研究发现，深度学习模型的技术准确性与专家一致性之间存在反直觉关系：高准确性模型与专家一致性较低，而大语言模型（LLMs）虽准确性较低，却更符合专家判断。

Details

Motivation: 探讨深度学习模型在安全关键NLP应用中的评估标准，强调仅依赖准确性不足，需引入专家一致性作为补充指标。 Method: 评估五种DL模型（BERT变体、USE、零样本分类器）和四种LLMs（GPT-4、LLaMA 3、Qwen、Claude），使用Cohen's Kappa、PCA和SHAP分析模型与专家一致性。 Result: 高准确性模型与专家一致性较低，LLMs更依赖上下文和时间语言线索，而非地点关键词，与专家判断更一致。 Conclusion: 建议将专家一致性纳入模型评估框架，并认为LLMs在可解释性和扩展性方面具有潜力，适用于安全关键任务。 Abstract: This study explores the relationship between deep learning (DL) model accuracy and expert agreement in the classification of crash narratives. We evaluate five DL models -- including BERT variants, the Universal Sentence Encoder (USE), and a zero-shot classifier -- against expert-labeled data and narrative text. The analysis is further extended to four large language models (LLMs): GPT-4, LLaMA 3, Qwen, and Claude. Our results reveal a counterintuitive trend: models with higher technical accuracy often exhibit lower agreement with domain experts, whereas LLMs demonstrate greater expert alignment despite relatively lower accuracy scores. To quantify and interpret model-expert agreement, we employ Cohen's Kappa, Principal Component Analysis (PCA), and SHAP-based explainability techniques. Findings indicate that expert-aligned models tend to rely more on contextual and temporal language cues, rather than location-specific keywords. These results underscore that accuracy alone is insufficient for evaluating models in safety-critical NLP applications. We advocate for incorporating expert agreement as a complementary metric in model evaluation frameworks and highlight the promise of LLMs as interpretable, scalable tools for crash analysis pipelines.

[186] Retrieval-Augmented Generation with Conflicting Evidence

Han Wang,Archiki Prasad,Elias Stengel-Eskin,Mohit Bansal

Main category: cs.CL

TL;DR: 论文提出RAMDocs数据集和MADAM-RAG方法，联合处理用户查询中的模糊性、错误信息和噪声，显著提升了RAG系统的性能。

Details

Motivation: 现有RAG系统通常单独处理模糊查询或错误信息，无法同时应对多种冲突因素。 Method: 提出RAMDocs数据集模拟复杂冲突场景，并设计MADAM-RAG多代理方法，通过多轮辩论整合有效答案并剔除错误信息。 Result: MADAM-RAG在AmbigDocs和FaithEval上分别提升11.40%和15.80%，但RAMDocs对现有基线仍具挑战性（32.60分）。 Conclusion: MADAM-RAG初步解决了多因素冲突，但在证据不平衡时仍有改进空间。 Abstract: Large language model (LLM) agents are increasingly employing retrieval-augmented generation (RAG) to improve the factuality of their responses. However, in practice, these systems often need to handle ambiguous user queries and potentially conflicting information from multiple sources while also suppressing inaccurate information from noisy or irrelevant documents. Prior work has generally studied and addressed these challenges in isolation, considering only one aspect at a time, such as handling ambiguity or robustness to noise and misinformation. We instead consider multiple factors simultaneously, proposing (i) RAMDocs (Retrieval with Ambiguity and Misinformation in Documents), a new dataset that simulates complex and realistic scenarios for conflicting evidence for a user query, including ambiguity, misinformation, and noise; and (ii) MADAM-RAG, a multi-agent approach in which LLM agents debate over the merits of an answer over multiple rounds, allowing an aggregator to collate responses corresponding to disambiguated entities while discarding misinformation and noise, thereby handling diverse sources of conflict jointly. We demonstrate the effectiveness of MADAM-RAG using both closed and open-source models on AmbigDocs -- which requires presenting all valid answers for ambiguous queries -- improving over strong RAG baselines by up to 11.40% and on FaithEval -- which requires suppressing misinformation -- where we improve by up to 15.80% (absolute) with Llama3.3-70B-Instruct. Furthermore, we find that RAMDocs poses a challenge for existing RAG baselines (Llama3.3-70B-Instruct only obtains 32.60 exact match score). While MADAM-RAG begins to address these conflicting factors, our analysis indicates that a substantial gap remains especially when increasing the level of imbalance in supporting evidence and misinformation.

[187] LLMs Meet Finance: Fine-Tuning Foundation Models for the Open FinLLM Leaderboard

Varun Rao,Youran Sun,Mahendra Kumar,Tejas Mutneja,Agastya Mukherjee,Haizhao Yang

Main category: cs.CL

TL;DR: 论文研究了大型语言模型（LLMs）在金融任务中的应用，通过微调基础模型并采用多种技术提升其金融能力，展示了显著的性能提升。

Details

Motivation: 探索大型语言模型在金融领域的应用潜力，提升其在金融任务中的表现。 Method: 使用Open FinLLM Leaderboard作为基准，对Qwen2.5和Deepseek-R1进行微调，采用监督微调（SFT）、直接偏好优化（DPO）和强化学习（RL）等技术。 Result: 微调后的模型在多种金融任务中表现出显著的性能提升，并测量了金融领域的数据扩展规律。 Conclusion: 研究表明大型语言模型在金融应用中具有巨大潜力。 Abstract: This paper investigates the application of large language models (LLMs) to financial tasks. We fine-tuned foundation models using the Open FinLLM Leaderboard as a benchmark. Building on Qwen2.5 and Deepseek-R1, we employed techniques including supervised fine-tuning (SFT), direct preference optimization (DPO), and reinforcement learning (RL) to enhance their financial capabilities. The fine-tuned models demonstrated substantial performance gains across a wide range of financial tasks. Moreover, we measured the data scaling law in the financial domain. Our work demonstrates the potential of large language models (LLMs) in financial applications.

[188] Energy-Based Reward Models for Robust Language Model Alignment

Anamika Lochab,Ruqi Zhang

Main category: cs.CL

TL;DR: EBRM是一种轻量级后处理框架，通过显式建模奖励分布提升奖励模型的鲁棒性和泛化能力，无需重新训练。

Details

Motivation: 传统奖励模型难以捕捉复杂人类偏好且泛化能力不足，EBRM旨在解决这些问题。 Method: EBRM采用冲突感知数据过滤、标签噪声感知对比训练和混合初始化，显式建模奖励分布。 Result: 实验显示EBRM在安全关键对齐任务中性能提升5.97%，并有效延缓奖励攻击。 Conclusion: EBRM是一种可扩展且高效的方法，显著提升现有奖励模型和对齐流程。 Abstract: Reward models (RMs) are essential for aligning Large Language Models (LLMs) with human preferences. However, they often struggle with capturing complex human preferences and generalizing to unseen data. To address these challenges, we introduce Energy-Based Reward Model (EBRM), a lightweight post-hoc refinement framework that enhances RM robustness and generalization. EBRM models the reward distribution explicitly, capturing uncertainty in human preferences and mitigating the impact of noisy or misaligned annotations. It achieves this through conflict-aware data filtering, label-noise-aware contrastive training, and hybrid initialization. Notably, EBRM enhances RMs without retraining, making it computationally efficient and adaptable across different models and tasks. Empirical evaluations on RM benchmarks demonstrate significant improvements in both robustness and generalization, achieving up to a 5.97% improvement in safety-critical alignment tasks compared to standard RMs. Furthermore, reinforcement learning experiments confirm that our refined rewards enhance alignment quality, effectively delaying reward hacking. These results demonstrate our approach as a scalable and effective enhancement for existing RMs and alignment pipelines. The code is available at EBRM.

[189] Syntactic and Semantic Control of Large Language Models via Sequential Monte Carlo

João Loula,Benjamin LeBrun,Li Du,Ben Lipkin,Clemente Pasti,Gabriel Grand,Tianyu Liu,Yahya Emara,Marjorie Freedman,Jason Eisner,Ryan Cotterel,Vikash Mansinghka,Alexander K. Lew,Tim Vieira,Timothy J. O'Donnell

Main category: cs.CL

TL;DR: 提出了一种基于顺序蒙特卡洛（SMC）的架构，用于控制语言模型生成，灵活满足语法或语义约束，并在多个领域验证其高效性。

Details

Motivation: 许多语言模型应用需要生成符合语法或语义约束的文本，但精确生成这类分布通常是难以处理的。 Method: 采用顺序蒙特卡洛（SMC）框架，动态调整计算资源并灵活结合领域特定约束。 Result: 在Python代码生成、文本转SQL等四个领域，该方法使小型开源模型性能超越更大模型，且计算开销低。 Conclusion: 该方法通过更好地逼近后验分布提升性能，并与概率编程语言集成，为用户提供简单可控的生成工具。 Abstract: A wide range of LM applications require generating text that conforms to syntactic or semantic constraints. Imposing such constraints can be naturally framed as probabilistic conditioning, but exact generation from the resulting distribution -- which can differ substantially from the LM's base distribution -- is generally intractable. In this work, we develop an architecture for controlled LM generation based on sequential Monte Carlo (SMC). Our SMC framework allows us to flexibly incorporate domain- and problem-specific constraints at inference time, and efficiently reallocate computational resources in light of new information during the course of generation. By comparing to a number of alternatives and ablations on four challenging domains -- Python code generation for data science, text-to-SQL, goal inference, and molecule synthesis -- we demonstrate that, with little overhead, our approach allows small open-source language models to outperform models over 8x larger, as well as closed-source, fine-tuned ones. In support of the probabilistic perspective, we show that these performance improvements are driven by better approximation to the posterior distribution. Our system builds on the framework of Lew et al. (2023) and integrates with its language model probabilistic programming language, giving users a simple, programmable way to apply SMC to a broad variety of controlled generation problems.

[190] CLIMB: CLustering-based Iterative Data Mixture Bootstrapping for Language Model Pre-training

Shizhe Diao,Yu Yang,Yonggan Fu,Xin Dong,Dan Su,Markus Kliegl,Zijia Chen,Peter Belcak,Yoshi Suhara,Hongxu Yin,Mostofa Patwary,Yingyan,Lin,Jan Kautz,Pavlo Molchanov

Main category: cs.CL

TL;DR: CLIMB是一种自动化框架，通过聚类和迭代优化预训练数据混合物，显著提升模型性能。

Details

Motivation: 预训练数据集通常缺乏明确的领域划分，手动标注成本高，导致数据混合物优化困难。 Method: CLIMB通过语义空间嵌入和聚类，结合代理模型和预测器迭代优化数据混合物。 Result: 1B模型在400B tokens训练下超越Llama-3.2-1B 2.0%，特定领域优化提升5%。 Conclusion: CLIMB和ClimbMix展示了数据混合物优化的重要性，并提供了高效预训练资源。 Abstract: Pre-training datasets are typically collected from web content and lack inherent domain divisions. For instance, widely used datasets like Common Crawl do not include explicit domain labels, while manually curating labeled datasets such as The Pile is labor-intensive. Consequently, identifying an optimal pre-training data mixture remains a challenging problem, despite its significant benefits for pre-training performance. To address these challenges, we propose CLustering-based Iterative Data Mixture Bootstrapping (CLIMB), an automated framework that discovers, evaluates, and refines data mixtures in a pre-training setting. Specifically, CLIMB embeds and clusters large-scale datasets in a semantic space and then iteratively searches for optimal mixtures using a smaller proxy model and a predictor. When continuously trained on 400B tokens with this mixture, our 1B model exceeds the state-of-the-art Llama-3.2-1B by 2.0%. Moreover, we observe that optimizing for a specific domain (e.g., Social Sciences) yields a 5% improvement over random sampling. Finally, we introduce ClimbLab, a filtered 1.2-trillion-token corpus with 20 clusters as a research playground, and ClimbMix, a compact yet powerful 400-billion-token dataset designed for efficient pre-training that delivers superior performance under an equal token budget. We analyze the final data mixture, elucidating the characteristics of an optimal data mixture. Our data is available at: https://research.nvidia.com/labs/lpr/climb/

cs.CY [Back]

[191] Large Language Model-Based Knowledge Graph System Construction for Sustainable Development Goals: An AI-Based Speculative Design Perspective

Yi-De Lin,Guan-Ze Liao

Main category: cs.CY

TL;DR: 研究开发AI知识图谱系统，分析SDG关联并提出新目标，为政策制定者提供新视角。

Details

Motivation: 随着2030年临近，SDG进展滞后，需创新加速策略。 Method: 结合官方SDG文本、Elsevier关键词数据集及TED演讲，应用AI推测设计和大型语言模型。 Result: 发现目标10与16关联强，目标6覆盖少；提出6个新目标，聚焦公平、韧性和技术包容。 Conclusion: AI推测框架为政策制定提供新思路，支持未来多模态和跨系统SDG应用。 Abstract: From 2000 to 2015, the UN's Millennium Development Goals guided global priorities. The subsequent Sustainable Development Goals (SDGs) adopted a more dynamic approach, with annual indicator updates. As 2030 nears and progress lags, innovative acceleration strategies are critical. This study develops an AI-powered knowledge graph system to analyze SDG interconnections, discover potential new goals, and visualize them online. Using official SDG texts, Elsevier's keyword dataset, and 1,127 TED Talk transcripts (2020-2023), a pilot on 269 talks from 2023 applies AI-speculative design, large language models, and retrieval-augmented generation. Key findings include: (1) Heatmap analysis reveals strong associations between Goal 10 and Goal 16, and minimal coverage of Goal 6. (2) In the knowledge graph, simulated dialogue over time reveals new central nodes, showing how richer data supports divergent thinking and goal clarity. (3) Six potential new goals are proposed, centered on equity, resilience, and technology-driven inclusion. This speculative-AI framework offers fresh insights for policymakers and lays groundwork for future multimodal and cross-system SDG applications.

[192] Knowledge Acquisition on Mass-shooting Events via LLMs for AI-Driven Justice

Benign John Ihugba,Afsana Nasrin,Ling Wu,Lin Li,Lijun Qian,Xishuang Dong

Main category: cs.CY

TL;DR: 论文提出首个用于大规模枪击事件知识获取的数据集，利用命名实体识别（NER）技术提取关键信息，如罪犯、受害者、地点和犯罪工具，支持法律和调查工作。实验表明GPT-4o和o1-mini在少样本学习场景中表现最佳。

Details

Motivation: 大规模枪击事件产生大量非结构化文本数据，现有研究难以自动化提取关键信息，影响调查和公共政策制定。 Method: 应用命名实体识别（NER）技术，结合大型语言模型（LLMs）的少样本提示，从新闻、警方报告和社交媒体中提取关键实体。 Result: GPT-4o在NER任务中表现最优，o1-mini作为资源高效替代方案。增加样本量提升所有模型性能，GPT-4o和o1-mini适应性更强。 Conclusion: 该数据集和方法为大规模枪击事件的信息提取提供了高效工具，GPT-4o和o1-mini在少样本学习中表现突出。 Abstract: Mass-shooting events pose a significant challenge to public safety, generating large volumes of unstructured textual data that hinder effective investigations and the formulation of public policy. Despite the urgency, few prior studies have effectively automated the extraction of key information from these events to support legal and investigative efforts. This paper presented the first dataset designed for knowledge acquisition on mass-shooting events through the application of named entity recognition (NER) techniques. It focuses on identifying key entities such as offenders, victims, locations, and criminal instruments, that are vital for legal and investigative purposes. The NER process is powered by Large Language Models (LLMs) using few-shot prompting, facilitating the efficient extraction and organization of critical information from diverse sources, including news articles, police reports, and social media. Experimental results on real-world mass-shooting corpora demonstrate that GPT-4o is the most effective model for mass-shooting NER, achieving the highest Micro Precision, Micro Recall, and Micro F1-scores. Meanwhile, o1-mini delivers competitive performance, making it a resource-efficient alternative for less complex NER tasks. It is also observed that increasing the shot count enhances the performance of all models, but the gains are more substantial for GPT-4o and o1-mini, highlighting their superior adaptability to few-shot learning scenarios.

[193] How Large Language Models Are Changing MOOC Essay Answers: A Comparison of Pre- and Post-LLM Responses

Leo Leppänen,Lili Aunimo,Arto Hellas,Jukka K. Nurminen,Linda Mannila

Main category: cs.CY

TL;DR: 论文研究了ChatGPT发布后对学生在线教育的影响，通过分析MOOC课程中学生论文的变化，发现论文长度、风格及关键词使用有明显变化。

Details

Motivation: 探讨大型语言模型（如ChatGPT）对在线教育的实际影响，尤其是学术诚信和学生写作行为的变化。 Method: 分析一个关于AI伦理的MOOC课程中学生提交的论文数据集，比较ChatGPT发布前后的数据变化。 Result: ChatGPT发布后，学生论文的长度和风格发生显著变化，AI相关关键词使用增加，但主题未明显改变。 Conclusion: 大型语言模型对在线教育产生了可量化的影响，尤其是在写作行为上，但未显著改变讨论主题。 Abstract: The release of ChatGPT in late 2022 caused a flurry of activity and concern in the academic and educational communities. Some see the tool's ability to generate human-like text that passes at least cursory inspections for factual accuracy ``often enough'' a golden age of information retrieval and computer-assisted learning. Some, on the other hand, worry the tool may lead to unprecedented levels of academic dishonesty and cheating. In this work, we quantify some of the effects of the emergence of Large Language Models (LLMs) on online education by analyzing a multi-year dataset of student essay responses from a free university-level MOOC on AI ethics. Our dataset includes essays submitted both before and after ChatGPT's release. We find that the launch of ChatGPT coincided with significant changes in both the length and style of student essays, mirroring observations in other contexts such as academic publishing. We also observe -- as expected based on related public discourse -- changes in prevalence of key content words related to AI and LLMs, but not necessarily the general themes or topics discussed in the student essays as identified through (dynamic) topic modeling.

Georgina Curto,Svetlana Kiritchenko,Muhammad Hammad Fahim Siddiqui,Isar Nejadgholi,Kathleen C. Fraser

Main category: cs.CY

TL;DR: 该论文旨在通过社交媒体数据识别和追踪针对贫困人群的偏见（aporophobia），以帮助消除贫困。

Details

Motivation: 消除贫困是联合国可持续发展目标的首要任务，但社会对贫困人群的偏见（aporophobia）阻碍了扶贫政策的制定和实施。 Method: 与非营利组织和政府合作，收集并标注英语推文，构建分类器以自动检测aporophobia。 Result: 开发了aporophobia的分类法，并训练了分类器，揭示了自动检测的主要挑战。 Conclusion: 该研究为大规模识别和减少社交媒体上的aporophobia奠定了基础。 Abstract: Eradicating poverty is the first goal in the United Nations Sustainable Development Goals. However, aporophobia -- the societal bias against people living in poverty -- constitutes a major obstacle to designing, approving and implementing poverty-mitigation policies. This work presents an initial step towards operationalizing the concept of aporophobia to identify and track harmful beliefs and discriminative actions against poor people on social media. In close collaboration with non-profits and governmental organizations, we conduct data collection and exploration. Then we manually annotate a corpus of English tweets from five world regions for the presence of (1) direct expressions of aporophobia, and (2) statements referring to or criticizing aporophobic views or actions of others, to comprehensively characterize the social media discourse related to bias and discrimination against the poor. Based on the annotated data, we devise a taxonomy of categories of aporophobic attitudes and actions expressed through speech on social media. Finally, we train several classifiers and identify the main challenges for automatic detection of aporophobia in social networks. This work paves the way towards identifying, tracking, and mitigating aporophobic views on social media at scale.

eess.AS [Back]

[195] EmoVoice: LLM-based Emotional Text-To-Speech Model with Freestyle Text Prompting

Guanrou Yang,Chen Yang,Qian Chen,Ziyang Ma,Wenxi Chen,Wen Wang,Tianrui Wang,Yifan Yang,Zhikang Niu,Wenrui Liu,Fan Yu,Zhihao Du,Zhifu Gao,ShiLiang Zhang,Xie Chen

Main category: eess.AS

TL;DR: EmoVoice是一种新型情感可控TTS模型，利用大语言模型实现细粒度情感控制，并通过音素增强设计提升内容一致性。

Details

Motivation: 当前TTS模型在情感表达控制方面存在挑战，EmoVoice旨在解决这一问题。 Method: 结合大语言模型和音素增强设计，提出EmoVoice模型，并构建高质量情感数据集EmoVoice-DB。 Result: 在英语和中文测试集上达到SOTA性能，并探索了情感评估指标的可靠性。 Conclusion: EmoVoice在情感控制TTS领域表现优异，未来将公开数据集和代码。 Abstract: Human speech goes beyond the mere transfer of information; it is a profound exchange of emotions and a connection between individuals. While Text-to-Speech (TTS) models have made huge progress, they still face challenges in controlling the emotional expression in the generated speech. In this work, we propose EmoVoice, a novel emotion-controllable TTS model that exploits large language models (LLMs) to enable fine-grained freestyle natural language emotion control, and a phoneme boost variant design that makes the model output phoneme tokens and audio tokens in parallel to enhance content consistency, inspired by chain-of-thought (CoT) and modality-of-thought (CoM) techniques. Besides, we introduce EmoVoice-DB, a high-quality 40-hour English emotion dataset featuring expressive speech and fine-grained emotion labels with natural language descriptions. EmoVoice achieves state-of-the-art performance on the English EmoVoice-DB test set using only synthetic training data, and on the Chinese Secap test set using our in-house data. We further investigate the reliability of existing emotion evaluation metrics and their alignment with human perceptual preferences, and explore using SOTA multimodal LLMs GPT-4o-audio and Gemini to assess emotional speech. Demo samples are available at https://anonymous.4open.science/r/EmoVoice-DF55. Dataset, code, and checkpoints will be released.

cs.HC [Back]

[196] MobilePoser: Real-Time Full-Body Pose Estimation and 3D Human Translation from IMUs in Mobile Consumer Devices

Vasco Xu,Chenfeng Gao,Henry Hoffmann,Karan Ahuja

Main category: cs.HC

TL;DR: MobilePoser是一种实时系统，利用消费设备中的IMU进行全身姿态和全局平移估计，解决了传感器噪声和漂移带来的问题。

Details

Motivation: 随着运动捕捉技术向低精度IMU设备的迁移，如手机、手表和耳机，现有方法在在线性能、时间一致性和全局平移方面面临挑战。 Method: MobilePoser采用多阶段深度神经网络进行运动学姿态估计，并结合基于物理的运动优化器。 Result: 系统实现了最先进的精度，同时保持轻量级。 Conclusion: MobilePoser在健康、游戏和室内导航等多个领域展示了独特潜力。 Abstract: There has been a continued trend towards minimizing instrumentation for full-body motion capture, going from specialized rooms and equipment, to arrays of worn sensors and recently sparse inertial pose capture methods. However, as these techniques migrate towards lower-fidelity IMUs on ubiquitous commodity devices, like phones, watches, and earbuds, challenges arise including compromised online performance, temporal consistency, and loss of global translation due to sensor noise and drift. Addressing these challenges, we introduce MobilePoser, a real-time system for full-body pose and global translation estimation using any available subset of IMUs already present in these consumer devices. MobilePoser employs a multi-stage deep neural network for kinematic pose estimation followed by a physics-based motion optimizer, achieving state-of-the-art accuracy while remaining lightweight. We conclude with a series of demonstrative applications to illustrate the unique potential of MobilePoser across a variety of fields, such as health and wellness, gaming, and indoor navigation to name a few.

[197] Multimodal LLM Augmented Reasoning for Interpretable Visual Perception Analysis

Shravan Chaudhari,Trilokya Akula,Yoon Kim,Tom Blake

Main category: cs.HC

TL;DR: 本文研究了多模态大语言模型（MLLMs）在视觉感知任务中的应用，提出了一种无标注的分析框架，旨在评估MLLMs作为HCI认知助手的实用性。

Details

Motivation: 探索MLLMs在视觉感知领域的适用性，并基于心理学和认知科学的原则，评估其解释能力，以提升人类推理能力并发现现有数据集的偏见。 Method: 利用心理学和认知科学的原理，设计无标注的分析框架，对MLLMs在视觉感知任务中的表现进行基准测试。 Result: 提出了一种新颖的评估方法，为量化MLLMs的解释性提供了原则性研究基础。 Conclusion: 该研究为MLLMs在HCI任务中的应用提供了新视角，并强调了其在提升人类推理和发现数据集偏见方面的潜力。 Abstract: In this paper, we advance the study of AI-augmented reasoning in the context of Human-Computer Interaction (HCI), psychology and cognitive science, focusing on the critical task of visual perception. Specifically, we investigate the applicability of Multimodal Large Language Models (MLLMs) in this domain. To this end, we leverage established principles and explanations from psychology and cognitive science related to complexity in human visual perception. We use them as guiding principles for the MLLMs to compare and interprete visual content. Our study aims to benchmark MLLMs across various explainability principles relevant to visual perception. Unlike recent approaches that primarily employ advanced deep learning models to predict complexity metrics from visual content, our work does not seek to develop a mere new predictive model. Instead, we propose a novel annotation-free analytical framework to assess utility of MLLMs as cognitive assistants for HCI tasks, using visual perception as a case study. The primary goal is to pave the way for principled study in quantifying and evaluating the interpretability of MLLMs for applications in improving human reasoning capability and uncovering biases in existing perception datasets annotated by humans.

cs.AI [Back]

[198] Towards Conversational AI for Human-Machine Collaborative MLOps

George Fatouros,Georgios Makridis,George Kousiouris,John Soldatos,Anargyros Tsadimas,Dimosthenis Kyriazis

Main category: cs.AI

TL;DR: 论文提出了一种基于大语言模型（LLM）的对话代理系统，旨在提升人机协作在机器学习运维（MLOps）中的效率。系统通过模块化设计和自然语言交互简化了复杂MLOps工具的使用。

Details

Motivation: 解决复杂MLOps平台（如Kubeflow）的可访问性问题，使不同技术背景的用户都能轻松使用高级ML工具。 Method: 采用分层模块化架构，集成KFP Agent、MinIO Agent和RAG Agent，通过自然语言交互和迭代推理实现功能。 Result: 系统降低了MLOps的复杂性，提升了用户友好性，适用于不同技术水平的用户。 Conclusion: 该对话式MLOps助手有效简化了工作流程，扩展了ML工具的适用范围。 Abstract: This paper presents a Large Language Model (LLM) based conversational agent system designed to enhance human-machine collaboration in Machine Learning Operations (MLOps). We introduce the Swarm Agent, an extensible architecture that integrates specialized agents to create and manage ML workflows through natural language interactions. The system leverages a hierarchical, modular design incorporating a KubeFlow Pipelines (KFP) Agent for ML pipeline orchestration, a MinIO Agent for data management, and a Retrieval-Augmented Generation (RAG) Agent for domain-specific knowledge integration. Through iterative reasoning loops and context-aware processing, the system enables users with varying technical backgrounds to discover, execute, and monitor ML pipelines; manage datasets and artifacts; and access relevant documentation, all via intuitive conversational interfaces. Our approach addresses the accessibility gap in complex MLOps platforms like Kubeflow, making advanced ML tools broadly accessible while maintaining the flexibility to extend to other platforms. The paper describes the architecture, implementation details, and demonstrates how this conversational MLOps assistant reduces complexity and lowers barriers to entry for users across diverse technical skill levels.

[199] ZeroSumEval: Scaling LLM Evaluation with Inter-Model Competition

Haidar Khan,Hisham A. Alyahya,Yazeed Alnumay,M Saiful Bari,Bülent Yener

Main category: cs.AI

TL;DR: ZeroSumEval是一种基于零和游戏的动态评估协议，用于评估大语言模型（LLMs）的能力，避免传统方法的过拟合、高成本和偏见问题。

Details

Motivation: 传统评估方法（如静态数据集、人工评估或基于模型的评估）存在过拟合、高成本和偏见问题，需要一种更动态和标准化的评估框架。 Method: ZeroSumEval通过设计多样化的游戏（如安全挑战、经典游戏、知识测试和说服挑战）来动态评估LLMs的战略推理、规划、知识应用和创造力等能力。 Result: 实验表明，前沿模型（如GPT和Claude系列）在常见游戏和问答任务中表现良好，但在需要创造新颖问题的游戏中表现不佳，且无法可靠地突破彼此的限制。 Conclusion: ZeroSumEval提供了一种标准化且可扩展的动态评估框架，揭示了当前LLMs在创造力任务中的局限性。 Abstract: Evaluating the capabilities of Large Language Models (LLMs) has traditionally relied on static benchmark datasets, human assessments, or model-based evaluations - methods that often suffer from overfitting, high costs, and biases. ZeroSumEval is a novel competition-based evaluation protocol that leverages zero-sum games to assess LLMs with dynamic benchmarks that resist saturation. ZeroSumEval encompasses a diverse suite of games, including security challenges (PyJail), classic games (Chess, Liar's Dice, Poker), knowledge tests (MathQuiz), and persuasion challenges (Gandalf, Debate). These games are designed to evaluate a range of AI capabilities such as strategic reasoning, planning, knowledge application, and creativity. Building upon recent studies that highlight the effectiveness of game-based evaluations for LLMs, ZeroSumEval enhances these approaches by providing a standardized and extensible framework. To demonstrate this, we conduct extensive experiments with >7000 simulations across 7 games and 13 models. Our results show that while frontier models from the GPT and Claude families can play common games and answer questions, they struggle to play games that require creating novel and challenging questions. We also observe that models cannot reliably jailbreak each other and fail generally at tasks requiring creativity. We release our code at https://github.com/facebookresearch/ZeroSumEval.

[200] WebLists: Extracting Structured Information From Complex Interactive Websites Using Executable LLM Agents

Arth Bohra,Manvel Saroyan,Danil Melkozerov,Vahe Karufanyan,Gabriel Maher,Pascal Weinberger,Artem Harutyunyan,Giovanni Campagna

Main category: cs.AI

TL;DR: 论文提出了WebLists基准测试和BardeenAgent框架，用于解决大规模结构化数据提取任务，显著提升了性能并降低了成本。

Details

Motivation: 当前网络代理研究主要关注导航和交易任务，而忽视了大规模结构化数据提取的需求。 Method: 提出BardeenAgent框架，将代理执行转化为可重复程序，并利用HTML的规律结构生成通用CSS选择器。 Result: BardeenAgent在WebLists基准上达到66%的召回率，是现有技术的两倍，且成本降低3倍。 Conclusion: BardeenAgent显著提升了网络代理在结构化数据提取任务中的性能，具有实际应用价值。 Abstract: Most recent web agent research has focused on navigation and transaction tasks, with little emphasis on extracting structured data at scale. We present WebLists, a benchmark of 200 data-extraction tasks across four common business and enterprise use-cases. Each task requires an agent to navigate to a webpage, configure it appropriately, and extract complete datasets with well-defined schemas. We show that both LLMs with search capabilities and SOTA web agents struggle with these tasks, with a recall of 3% and 31%, respectively, despite higher performance on question-answering tasks. To address this challenge, we propose BardeenAgent, a novel framework that enables web agents to convert their execution into repeatable programs, and replay them at scale across pages with similar structure. BardeenAgent is also the first LLM agent to take advantage of the regular structure of HTML. In particular BardeenAgent constructs a generalizable CSS selector to capture all relevant items on the page, then fits the operations to extract the data. On the WebLists benchmark, BardeenAgent achieves 66% recall overall, more than doubling the performance of SOTA web agents, and reducing cost per output row by 3x.

[201] Embodied-R: Collaborative Framework for Activating Embodied Spatial Reasoning in Foundation Models via Reinforcement Learning

Baining Zhao,Ziyou Wang,Jianjie Fang,Chen Gao,Fanhang Man,Jinqiang Cui,Xin Wang,Xinlei Chen,Yong Li,Wenwu Zhu

Main category: cs.AI

TL;DR: Embodied-R框架结合视觉语言模型（VLMs）和小型语言模型（LMs），通过强化学习实现高效空间推理能力，仅需少量训练样本即可达到先进水平。

Details

Motivation: 探索预训练模型如何从视觉观察中获取空间推理能力，尤其是高级推理能力。 Method: 提出Embodied-R框架，结合VLMs和LMs，采用强化学习（RL）和新型奖励系统（考虑逻辑一致性）进行训练。 Result: 在5k样本训练后，Embodied-R（3B LM）在空间推理任务上达到与OpenAI-o1、Gemini-2.5-pro相当的性能，并展现出系统性分析和上下文整合能力。 Conclusion: Embodied-R展示了在有限计算资源下实现高效推理的潜力，并探讨了奖励设计、模型泛化等研究方向。 Abstract: Humans can perceive and reason about spatial relationships from sequential visual observations, such as egocentric video streams. However, how pretrained models acquire such abilities, especially high-level reasoning, remains unclear. This paper introduces Embodied-R, a collaborative framework combining large-scale Vision-Language Models (VLMs) for perception and small-scale Language Models (LMs) for reasoning. Using Reinforcement Learning (RL) with a novel reward system considering think-answer logical consistency, the model achieves slow-thinking capabilities with limited computational resources. After training on only 5k embodied video samples, Embodied-R with a 3B LM matches state-of-the-art multimodal reasoning models (OpenAI-o1, Gemini-2.5-pro) on both in-distribution and out-of-distribution embodied spatial reasoning tasks. Embodied-R also exhibits emergent thinking patterns such as systematic analysis and contextual integration. We further explore research questions including response length, training on VLM, strategies for reward design, and differences in model generalization after SFT (Supervised Fine-Tuning) and RL training.

[202] Antidistillation Sampling

Yash Savani,Asher Trockman,Zhili Feng,Avi Schwarzschild,Alexander Robey,Marc Finzi,J. Zico Kolter

Main category: cs.AI

TL;DR: 论文提出了一种名为“抗蒸馏采样”的方法，通过调整模型的下一词概率分布，破坏推理轨迹的有效性，从而防止模型蒸馏，同时保持模型性能。

Details

Motivation: 前沿模型生成的推理轨迹可能被用于模型蒸馏，模型所有者需要一种方法在不影响性能的情况下限制蒸馏效果。 Method: 采用抗蒸馏采样策略，通过修改模型的下一词概率分布，毒害推理轨迹。 Result: 抗蒸馏采样显著降低了蒸馏效果，同时保持了模型的实用性。 Conclusion: 抗蒸馏采样是一种有效的方法，可以在不损害模型性能的情况下防止模型蒸馏。 Abstract: Frontier models that generate extended reasoning traces inadvertently produce rich token sequences that can facilitate model distillation. Recognizing this vulnerability, model owners may seek sampling strategies that limit the effectiveness of distillation without compromising model performance. \emph{Antidistillation sampling} provides exactly this capability. By strategically modifying a model's next-token probability distribution, antidistillation sampling poisons reasoning traces, rendering them significantly less effective for distillation while preserving the model's practical utility. For further details, see https://antidistillation.com.

[203] Sleep-time Compute: Beyond Inference Scaling at Test-time

Kevin Lin,Charlie Snell,Yu Wang,Charles Packer,Sarah Wooders,Ion Stoica,Joseph E. Gonzalez

Main category: cs.AI

TL;DR: 论文提出了一种名为“睡眠时间计算”的方法，通过离线预计算用户可能提出的查询，显著减少测试时的计算需求。

Details

Motivation: 解决大型语言模型在测试时高延迟和高计算成本的问题。 Method: 引入睡眠时间计算，预计算用户可能提出的查询，并验证其在两个推理任务（Stateful GSM-Symbolic 和 Stateful AIME）上的效果。 Result: 睡眠时间计算将测试时计算需求减少5倍，并通过扩展睡眠时间计算将准确性提高13%（GSM-Symbolic）和18%（AIME）。多查询GSM-Symbolic进一步将每查询成本降低2.5倍。 Conclusion: 睡眠时间计算在预测性强的查询场景中效果显著，适用于实际任务如代理软件工程任务。 Abstract: Scaling test-time compute has emerged as a key ingredient for enabling large language models (LLMs) to solve difficult problems, but comes with high latency and inference cost. We introduce sleep-time compute, which allows models to "think" offline about contexts before queries are presented: by anticipating what queries users might ask and pre-computing useful quantities, we can significantly reduce the compute requirements at test-time. To demonstrate the efficacy of our method, we create modified versions of two reasoning tasks - Stateful GSM-Symbolic and Stateful AIME. We find that sleep-time compute can reduce the amount of test-time compute needed to achieve the same accuracy by ~ 5x on Stateful GSM-Symbolic and Stateful AIME and that by scaling sleep-time compute we can further increase accuracy by up to 13% on Stateful GSM-Symbolic and 18% on Stateful AIME. Furthermore, we introduce Multi-Query GSM-Symbolic, which extends GSM-Symbolic by including multiple related queries per context. By amortizing sleep-time compute across related queries about the same context using Multi-Query GSM-Symbolic, we can decrease the average cost per query by 2.5x. We then conduct additional analysis to understand when sleep-time compute is most effective, finding the predictability of the user query to be well correlated with the efficacy of sleep-time compute. Finally, we conduct a case-study of applying sleep-time compute to a realistic agentic SWE task.

[204] Readable Twins of Unreadable Models

Krzysztof Pancerz,Piotr Kulicki,Michał Kalisz,Andrzej Burda,Maciej Stanisławski,Jaromir Sarzyński

Main category: cs.AI

TL;DR: 论文提出了一种将不可读的深度学习模型转换为可读的不精确信息流模型的方法，以实现可解释的深度学习系统。

Details

Motivation: 构建负责任的人工智能系统需要可解释性，而深度学习模型通常缺乏可读性，因此需要一种方法将其转换为可解释的形式。 Method: 基于物理对象的数字孪生概念，提出创建不精确信息流模型（IIFM）作为深度学习模型（DLM）的可读孪生体，并详细描述了从DLM到IIFM的转换过程。 Result: 通过MNIST数据集的手写数字图像识别分类模型示例验证了该方法的可行性。 Conclusion: 该方法为深度学习模型的可解释性提供了一种新思路，有助于构建更负责任的AI系统。 Abstract: Creating responsible artificial intelligence (AI) systems is an important issue in contemporary research and development of works on AI. One of the characteristics of responsible AI systems is their explainability. In the paper, we are interested in explainable deep learning (XDL) systems. On the basis of the creation of digital twins of physical objects, we introduce the idea of creating readable twins (in the form of imprecise information flow models) for unreadable deep learning models. The complete procedure for switching from the deep learning model (DLM) to the imprecise information flow model (IIFM) is presented. The proposed approach is illustrated with an example of a deep learning classification model for image recognition of handwritten digits from the MNIST data set.

eess.IV [Back]

[205] Regist3R: Incremental Registration with Stereo Foundation Model

Sidun Liu,Wenyu Li,Peng Qiao,Yong Dou

Main category: eess.IV

TL;DR: Regist3R是一种新型立体基础模型，用于高效、可扩展的增量重建，解决了多视图3D重建中的计算成本和全局对齐误差问题。

Details

Motivation: 多视图3D重建在计算机视觉中仍具挑战性，现有方法如DUSt3R在扩展至多视图场景时存在计算成本高和累积误差问题。 Method: Regist3R采用增量重建范式，适用于无序和多视图图像集的大规模3D重建。 Result: Regist3R在相机姿态估计和3D重建任务中表现优异，计算效率显著提升，并优于现有多视图重建模型。 Conclusion: Regist3R在大规模场景重建中表现出色，展示了其在城市建模和航空测绘等实际应用中的潜力。 Abstract: Multi-view 3D reconstruction has remained an essential yet challenging problem in the field of computer vision. While DUSt3R and its successors have achieved breakthroughs in 3D reconstruction from unposed images, these methods exhibit significant limitations when scaling to multi-view scenarios, including high computational cost and cumulative error induced by global alignment. To address these challenges, we propose Regist3R, a novel stereo foundation model tailored for efficient and scalable incremental reconstruction. Regist3R leverages an incremental reconstruction paradigm, enabling large-scale 3D reconstructions from unordered and many-view image collections. We evaluate Regist3R on public datasets for camera pose estimation and 3D reconstruction. Our experiments demonstrate that Regist3R achieves comparable performance with optimization-based methods while significantly improving computational efficiency, and outperforms existing multi-view reconstruction models. Furthermore, to assess its performance in real-world applications, we introduce a challenging oblique aerial dataset which has long spatial spans and hundreds of views. The results highlight the effectiveness of Regist3R. We also demonstrate the first attempt to reconstruct large-scale scenes encompassing over thousands of views through pointmap-based foundation models, showcasing its potential for practical applications in large-scale 3D reconstruction tasks, including urban modeling, aerial mapping, and beyond.

[206] TUMLS: Trustful Fully Unsupervised Multi-Level Segmentation for Whole Slide Images of Histology

Walid Rehamnia,Alexandra Getmanskaya,Evgeniy Vasilyev,Vadim Turlapov

Main category: eess.IV

TL;DR: 提出了一种名为TUMLS的无监督多级分割方法，用于解决数字病理学中AI应用的挑战，如标注需求高、计算量大和缺乏不确定性估计。

Details

Motivation: 当前AI方法在组织病理学中面临标注成本高、计算需求大和预测不确定性不足的问题，限制了其实际应用。 Method: TUMLS采用自编码器作为特征提取器，从低分辨率数据中识别组织类型，并通过不确定性度量选择代表性区域，在高分辨率空间进行无监督细胞核分割。 Result: 在UPENN-GBM数据集上，自编码器的MSE为0.0016；在MoNuSeg数据集上，细胞核分割的F1分数为77.46%，Jaccard分数为63.35%，优于其他无监督方法。 Conclusion: TUMLS显著提升了数字病理学的工作流程效率和透明度，展示了其在无监督分割领域的有效性。 Abstract: Digital pathology, augmented by artificial intelligence (AI), holds significant promise for improving the workflow of pathologists. However, challenges such as the labor-intensive annotation of whole slide images (WSIs), high computational demands, and trust concerns arising from the absence of uncertainty estimation in predictions hinder the practical application of current AI methodologies in histopathology. To address these issues, we present a novel trustful fully unsupervised multi-level segmentation methodology (TUMLS) for WSIs. TUMLS adopts an autoencoder (AE) as a feature extractor to identify the different tissue types within low-resolution training data. It selects representative patches from each identified group based on an uncertainty measure and then does unsupervised nuclei segmentation in their respective higher-resolution space without using any ML algorithms. Crucially, this solution integrates seamlessly into clinicians workflows, transforming the examination of a whole WSI into a review of concise, interpretable cross-level insights. This integration significantly enhances and accelerates the workflow while ensuring transparency. We evaluated our approach using the UPENN-GBM dataset, where the AE achieved a mean squared error (MSE) of 0.0016. Additionally, nucleus segmentation is assessed on the MoNuSeg dataset, outperforming all unsupervised approaches with an F1 score of 77.46% and a Jaccard score of 63.35%. These results demonstrate the efficacy of TUMLS in advancing the field of digital pathology.

[207] Towards Cardiac MRI Foundation Models: Comprehensive Visual-Tabular Representations for Whole-Heart Assessment and Beyond

Yundi Zhang,Paul Hager,Che Liu,Suprosanna Shit,Chen Chen,Daniel Rueckert,Jiazhen Pan

Main category: eess.IV

TL;DR: ViTa是一个多模态框架，整合心脏磁共振成像（CMR）和患者个体因素，提供全面的心脏健康评估和疾病风险预测。

Details

Motivation: 现有方法未能充分利用CMR和患者个体因素的结合，限制了心脏健康的全面评估。 Method: ViTa利用42,000名UK Biobank参与者的数据，整合3D+T影像和患者个体因素，学习共享潜在表示。 Result: ViTa支持多种下游任务，如心脏表型预测、分割和疾病分类，提供更全面的心脏健康分析。 Conclusion: ViTa通过多模态整合，为心脏健康提供了更全面的评估方法，具有临床实用性和扩展潜力。 Abstract: Cardiac magnetic resonance imaging is the gold standard for non-invasive cardiac assessment, offering rich spatio-temporal views of the cardiac anatomy and physiology. Patient-level health factors, such as demographics, metabolic, and lifestyle, are known to substantially influence cardiovascular health and disease risk, yet remain uncaptured by CMR alone. To holistically understand cardiac health and to enable the best possible interpretation of an individual's disease risk, CMR and patient-level factors must be jointly exploited within an integrated framework. Recent multi-modal approaches have begun to bridge this gap, yet they often rely on limited spatio-temporal data and focus on isolated clinical tasks, thereby hindering the development of a comprehensive representation for cardiac health evaluation. To overcome these limitations, we introduce ViTa, a step toward foundation models that delivers a comprehensive representation of the heart and a precise interpretation of individual disease risk. Leveraging data from 42,000 UK Biobank participants, ViTa integrates 3D+T cine stacks from short-axis and long-axis views, enabling a complete capture of the cardiac cycle. These imaging data are then fused with detailed tabular patient-level factors, enabling context-aware insights. This multi-modal paradigm supports a wide spectrum of downstream tasks, including cardiac phenotype and physiological feature prediction, segmentation, and classification of cardiac and metabolic diseases within a single unified framework. By learning a shared latent representation that bridges rich imaging features and patient context, ViTa moves beyond traditional, task-specific models toward a universal, patient-specific understanding of cardiac health, highlighting its potential to advance clinical utility and scalability in cardiac analysis.

[208] NTIRE 2025 Challenge on Short-form UGC Video Quality Assessment and Enhancement: Methods and Results

Xin Li,Kun Yuan,Bingchen Li,Fengbin Guan,Yizhen Shao,Zihao Yu,Xijun Wang,Yiting Lu,Wei Luo,Suhang Yao,Ming Sun,Chao Zhou,Zhibo Chen,Radu Timofte,Yabin Zhang,Ao-Xiang Zhang,Tianwu Zhi,Jianzhao Liu,Yang Li,Jingwen Xu,Yiting Liao,Yushen Zuo,Mingyang Wu,Renjie Li,Shengyun Zhong,Zhengzhong Tu,Yufan Liu,Xiangguang Chen,Zuowei Cao,Minhao Tang,Shan Liu,Kexin Zhang,Jingfen Xie,Yan Wang,Kai Chen,Shijie Zhao,Yunchen Zhang,Xiangkai Xu,Hong Gao,Ji Shi,Yiming Bao,Xiugang Dong,Xiangsheng Zhou,Yaofeng Tu,Ying Liang,Yiwen Wang,Xinning Chai,Yuxuan Zhang,Zhengxue Cheng,Yingsheng Qin,Yucai Yang,Rong Xie,Li Song,Wei Sun,Kang Fu,Linhan Cao,Dandan Zhu,Kaiwei Zhang,Yucheng Zhu,Zicheng Zhang,Menghan Hu,Xiongkuo Min,Guangtao Zhai,Zhi Jin,Jiawei Wu,Wei Wang,Wenjian Zhang,Yuhai Lan,Gaoxiong Yi,Hengyuan Na,Wang Luo,Di Wu,MingYin Bai,Jiawang Du,Zilong Lu,Zhenyu Jiang,Hui Zeng,Ziguan Cui,Zongliang Gan,Guijin Tang,Xinglin Xie,Kehuan Song,Xiaoqiang Lu,Licheng Jiao,Fang Liu,Xu Liu,Puhua Chen,Ha Thu Nguyen,Katrien De Moor,Seyed Ali Amirshahi,Mohamed-Chaker Larabi,Qi Tang,Linfeng He,Zhiyong Gao,Zixuan Gao,Guohua Zhang,Zhiye Huang,Yi Deng,Qingmiao Jiang,Lu Chen,Yi Yang,Xi Liao,Nourine Mohammed Nadir,Yuxuan Jiang,Qiang Zhu,Siyue Teng,Fan Zhang,Shuyuan Zhu,Bing Zeng,David Bull,Meiqin Liu,Chao Yao,Yao Zhao

Main category: eess.IV

TL;DR: NTIRE 2025挑战赛聚焦于短用户生成内容（UGC）视频质量评估与增强，分两个赛道：高效视频质量评估（KVQ）和基于扩散的图像超分辨率（KwaiSR）。

Details

Motivation: 推动短UGC平台（如Kwai和TikTok）用户体验的研究，减少对计算密集型模型的依赖。 Method: Track 1开发轻量级VQA模型；Track 2引入KwaiSR数据集，包含合成和真实S-UGC图像。 Result: 吸引了266名参与者，收到18份有效提交，推动了短UGC视频质量评估和图像超分辨率的研究。 Conclusion: 挑战赛成功促进了相关技术的进步，数据集和项目公开可用。 Abstract: This paper presents a review for the NTIRE 2025 Challenge on Short-form UGC Video Quality Assessment and Enhancement. The challenge comprises two tracks: (i) Efficient Video Quality Assessment (KVQ), and (ii) Diffusion-based Image Super-Resolution (KwaiSR). Track 1 aims to advance the development of lightweight and efficient video quality assessment (VQA) models, with an emphasis on eliminating reliance on model ensembles, redundant weights, and other computationally expensive components in the previous IQA/VQA competitions. Track 2 introduces a new short-form UGC dataset tailored for single image super-resolution, i.e., the KwaiSR dataset. It consists of 1,800 synthetically generated S-UGC image pairs and 1,900 real-world S-UGC images, which are split into training, validation, and test sets using a ratio of 8:1:1. The primary objective of the challenge is to drive research that benefits the user experience of short-form UGC platforms such as Kwai and TikTok. This challenge attracted 266 participants and received 18 valid final submissions with corresponding fact sheets, significantly contributing to the progress of short-form UGC VQA and image superresolution. The project is publicly available at https://github.com/lixinustc/KVQE- ChallengeCVPR-NTIRE2025.

cs.CR [Back]

[209] Provable Secure Steganography Based on Adaptive Dynamic Sampling

Kaiyi Pang

Main category: cs.CR

TL;DR: 提出了一种无需显式模型分布的黑盒隐写方案，通过动态采样策略实现高效隐蔽通信。

Details

Motivation: 现有可证明安全隐写方法需要显式访问生成模型分布，限制了黑盒场景的实用性。 Method: 采用动态采样策略，使生成模型在不干扰正常生成过程的情况下嵌入秘密消息。 Result: 在三个真实数据集和三个LLM上的实验表明，该方法在效率和容量上与白盒方法相当，且避免了模型输出质量下降。 Conclusion: 该黑盒隐写方案具有实际应用潜力，解决了现有方法的局限性。 Abstract: The security of private communication is increasingly at risk due to widespread surveillance. Steganography, a technique for embedding secret messages within innocuous carriers, enables covert communication over monitored channels. Provably Secure Steganography (PSS) is state of the art for making stego carriers indistinguishable from normal ones by ensuring computational indistinguishability between stego and cover distributions. However, current PSS methods often require explicit access to the distribution of generative model for both sender and receiver, limiting their practicality in black box scenarios. In this paper, we propose a provably secure steganography scheme that does not require access to explicit model distributions for both sender and receiver. Our method incorporates a dynamic sampling strategy, enabling generative models to embed secret messages within multiple sampling choices without disrupting the normal generation process of the model. Extensive evaluations of three real world datasets and three LLMs demonstrate that our blackbox method is comparable with existing white-box steganography methods in terms of efficiency and capacity while eliminating the degradation of steganography in model generated outputs.

cs.IR [Back]

[210] Specialized text classification: an approach to classifying Open Banking transactions

Duc Tuyen TA,Wajdi Ben Saad,Ji Young Oh

Main category: cs.IR

TL;DR: 本文提出了一种针对法语市场的基于语言的开放银行交易分类系统，专注于解决银行领域特定文本语料库的挑战。

Details

Motivation: 随着PSD2法规的引入，开放银行为银行和金融科技公司提供了利用交易描述理解客户行为的机会，但银行领域的特定文本语料库处理仍未被充分解决。 Method: 系统包括数据收集、标注、预处理、建模和评估阶段，结合语言特定技术和领域知识，针对法语银行数据定制。 Result: 相比通用方法，该系统在性能和效率上表现更优。 Conclusion: 该系统为银行领域特定文本处理提供了高效解决方案，展示了语言特定和领域知识的重要性。 Abstract: With the introduction of the PSD2 regulation in the EU which established the Open Banking framework, a new window of opportunities has opened for banks and fintechs to explore and enrich Bank transaction descriptions with the aim of building a better understanding of customer behavior, while using this understanding to prevent fraud, reduce risks and offer more competitive and tailored services. And although the usage of natural language processing models and techniques has seen an incredible progress in various applications and domains over the past few years, custom applications based on domain-specific text corpus remain unaddressed especially in the banking sector. In this paper, we introduce a language-based Open Banking transaction classification system with a focus on the french market and french language text. The system encompasses data collection, labeling, preprocessing, modeling, and evaluation stages. Unlike previous studies that focus on general classification approaches, this system is specifically tailored to address the challenges posed by training a language model with a specialized text corpus (Banking data in the French context). By incorporating language-specific techniques and domain knowledge, the proposed system demonstrates enhanced performance and efficiency compared to generic approaches.

[211] A Human-AI Comparative Analysis of Prompt Sensitivity in LLM-Based Relevance Judgment

Negar Arabzadeh,Charles L. A . Clarke

Main category: cs.IR

TL;DR: 研究探讨了大型语言模型（LLMs）在信息检索任务中用于相关性判断的稳健性和可靠性，分析了提示敏感性对任务的影响。

Details

Motivation: 评估LLMs在相关性判断任务中的表现，并研究提示变化对结果的影响。 Method: 收集了15名人类专家和15个LLMs的提示，筛选后使用72个提示对TREC数据集进行标注，并与人类标签比较。 Result: 通过Cohen's κ和成对一致性度量比较LLM与人类标签，分析提示差异和不同LLMs的表现。 Conclusion: 研究支持LLMs在相关性判断中的应用，并提供了数据和提示以促进未来研究。 Abstract: Large Language Models (LLMs) are increasingly used to automate relevance judgments for information retrieval (IR) tasks, often demonstrating agreement with human labels that approaches inter-human agreement. To assess the robustness and reliability of LLM-based relevance judgments, we systematically investigate impact of prompt sensitivity on the task. We collected prompts for relevance assessment from 15 human experts and 15 LLMs across three tasks~ -- ~binary, graded, and pairwise~ -- ~yielding 90 prompts in total. After filtering out unusable prompts from three humans and three LLMs, we employed the remaining 72 prompts with three different LLMs as judges to label document/query pairs from two TREC Deep Learning Datasets (2020 and 2021). We compare LLM-generated labels with TREC official human labels using Cohen's $\kappa$ and pairwise agreement measures. In addition to investigating the impact of prompt variations on agreement with human labels, we compare human- and LLM-generated prompts and analyze differences among different LLMs as judges. We also compare human- and LLM-generated prompts with the standard UMBRELA prompt used for relevance assessment by Bing and TREC 2024 Retrieval Augmented Generation (RAG) Track. To support future research in LLM-based evaluation, we release all data and prompts at https://github.com/Narabzad/prompt-sensitivity-relevance-judgements/.

[212] Benchmarking LLM-based Relevance Judgment Methods

Negar Arabzadeh,Charles L. A. Clarke

Main category: cs.IR

TL;DR: 本文系统比较了多种基于大语言模型（LLM）的相关性评估方法，包括二元判断、分级评估、成对偏好方法和两种基于片段的方法，并通过实验验证其与人类偏好的一致性。

Details

Motivation: 现有研究主要关注通过提示策略复制人类分级相关性判断，缺乏对其他评估方法的探索和全面比较。 Method: 比较了多种LLM相关性评估方法，包括二元判断、分级评估、成对偏好和片段方法，并在TREC和ANTIQUE数据集上进行了实验。 Result: 实验结果表明，不同LLM评估方法与人类偏好的对齐程度存在差异，数据和方法已公开。 Conclusion: 本文为LLM相关性评估提供了全面比较，促进了未来研究的透明性和可重复性。 Abstract: Large Language Models (LLMs) are increasingly deployed in both academic and industry settings to automate the evaluation of information seeking systems, particularly by generating graded relevance judgments. Previous work on LLM-based relevance assessment has primarily focused on replicating graded human relevance judgments through various prompting strategies. However, there has been limited exploration of alternative assessment methods or comprehensive comparative studies. In this paper, we systematically compare multiple LLM-based relevance assessment methods, including binary relevance judgments, graded relevance assessments, pairwise preference-based methods, and two nugget-based evaluation methods~--~document-agnostic and document-dependent. In addition to a traditional comparison based on system rankings using Kendall correlations, we also examine how well LLM judgments align with human preferences, as inferred from relevance grades. We conduct extensive experiments on datasets from three TREC Deep Learning tracks 2019, 2020 and 2021 as well as the ANTIQUE dataset, which focuses on non-factoid open-domain question answering. As part of our data release, we include relevance judgments generated by both an open-source (Llama3.2b) and a commercial (gpt-4o) model. Our goal is to \textit{reproduce} various LLM-based relevance judgment methods to provide a comprehensive comparison. All code, data, and resources are publicly available in our GitHub Repository at https://github.com/Narabzad/llm-relevance-judgement-comparison.

[213] Towards Lossless Token Pruning in Late-Interaction Retrieval Models

Yuxuan Zong,Benjamin Piwowarski

Main category: cs.IR

TL;DR: 论文提出了一种基于正则化损失的方法，用于在不影响检索分数的情况下剪枝文档中的标记，实验表明可以保留ColBERT性能的同时仅使用30%的标记。

Details

Motivation: 现有的神经IR模型（如ColBERT）需要大量内存存储文档标记的上下文表示，现有剪枝方法无法保证不影响检索分数。 Method: 引入三种正则化损失和两种剪枝策略，以高剪枝率实现不影响文档与查询分数的剪枝。 Result: 实验表明，该方法在保留ColBERT性能的同时，仅需使用30%的标记。 Conclusion: 通过正则化损失和剪枝策略，实现了高效且不影响检索效果的标记剪枝。 Abstract: Late interaction neural IR models like ColBERT offer a competitive effectiveness-efficiency trade-off across many benchmarks. However, they require a huge memory space to store the contextual representation for all the document tokens. Some works have proposed using either heuristics or statistical-based techniques to prune tokens from each document. This however doesn't guarantee that the removed tokens have no impact on the retrieval score. Our work uses a principled approach to define how to prune tokens without impacting the score between a document and a query. We introduce three regularization losses, that induce a solution with high pruning ratios, as well as two pruning strategies. We study them experimentally (in and out-domain), showing that we can preserve ColBERT's performance while using only 30\% of the tokens.

[214] Building Russian Benchmark for Evaluation of Information Retrieval Models

Grigory Kovalev,Mikhail Tikhomirov,Evgeny Kozhevnikov,Max Kornilov,Natalia Loukachevitch

Main category: cs.IR

TL;DR: RusBEIR是一个用于俄语信息检索模型零样本评估的综合基准，包含17个数据集，支持对词汇和神经模型的系统比较。

Details

Motivation: 为俄语信息检索提供统一的评估基准，填补该领域的空白。 Method: 整合了适应、翻译和新建的17个数据集，比较词汇模型（如BM25）和神经模型（如mE5-large和BGE-M3）的性能。 Result: 词汇模型在预处理后表现良好，BM25是全文检索的强基线；神经模型在多数数据集上表现更优，但在长文档检索中受限。 Conclusion: RusBEIR为俄语信息检索研究提供了开源框架，推动了该领域的发展。 Abstract: We introduce RusBEIR, a comprehensive benchmark designed for zero-shot evaluation of information retrieval (IR) models in the Russian language. Comprising 17 datasets from various domains, it integrates adapted, translated, and newly created datasets, enabling systematic comparison of lexical and neural models. Our study highlights the importance of preprocessing for lexical models in morphologically rich languages and confirms BM25 as a strong baseline for full-document retrieval. Neural models, such as mE5-large and BGE-M3, demonstrate superior performance on most datasets, but face challenges with long-document retrieval due to input size constraints. RusBEIR offers a unified, open-source framework that promotes research in Russian-language information retrieval.

[215] FreshStack: Building Realistic Benchmarks for Evaluating Retrieval on Technical Documents

Nandan Thakur,Jimmy Lin,Sam Havens,Michael Carbin,Omar Khattab,Andrew Drozdov

Main category: cs.IR

TL;DR: FreshStack是一个可重复使用的框架，用于从社区问答中自动构建信息检索（IR）评估基准。它通过自动收集语料、生成问题答案片段，并结合混合检索技术，构建了五个具有挑战性的数据集。现有检索模型在这些数据集上表现不佳，表明IR质量仍有提升空间。

Details

Motivation: 构建一个现实、可扩展且无污染的IR和RAG评估基准，以推动信息检索领域的发展。 Method: FreshStack通过三步实现：自动语料收集、问题答案片段生成，以及结合混合检索技术的文档检索。 Result: 构建了五个数据集，现有检索模型在这些数据集上表现不佳，且重排序器在某些情况下未能提升检索精度。 Conclusion: FreshStack为构建高质量的IR评估基准提供了新工具，并揭示了现有模型的改进空间。 Abstract: We introduce FreshStack, a reusable framework for automatically building information retrieval (IR) evaluation benchmarks from community-asked questions and answers. FreshStack conducts the following steps: (1) automatic corpus collection from code and technical documentation, (2) nugget generation from community-asked questions and answers, and (3) nugget-level support, retrieving documents using a fusion of retrieval techniques and hybrid architectures. We use FreshStack to build five datasets on fast-growing, recent, and niche topics to ensure the tasks are sufficiently challenging. On FreshStack, existing retrieval models, when applied out-of-the-box, significantly underperform oracle approaches on all five topics, denoting plenty of headroom to improve IR quality. In addition, we identify cases where rerankers do not clearly improve first-stage retrieval accuracy (two out of five topics). We hope that FreshStack will facilitate future work toward constructing realistic, scalable, and uncontaminated IR and RAG evaluation benchmarks. FreshStack datasets are available at: https://fresh-stack.github.io.

Haoxuan Li,Yi Bin,Yunshan Ma,Guoqing Wang,Yang Yang,See-Kiong Ng,Tat-Seng Chua

Main category: cs.IR

TL;DR: 论文提出了一种名为SemCORE的新框架，通过结构化自然语言标识符和生成语义验证策略，提升了生成式跨模态检索的语义理解能力，显著优于现有方法。

Details

Motivation: 传统跨模态检索方法依赖嵌入相似性计算，而生成式检索虽具潜力，但在标识符构建和生成过程中存在语义信息不足的问题。 Method: 提出SemCORE框架，包括结构化自然语言标识符（SID）和生成语义验证（GSV）策略，同时处理文本到图像和图像到文本的检索任务。 Result: 实验表明，SemCORE在基准数据集上表现优异，文本到图像检索的Recall@1平均提升8.65分。 Conclusion: SemCORE通过增强语义理解能力，显著提升了生成式跨模态检索的性能，为未来研究提供了新方向。 Abstract: Cross-modal retrieval (CMR) is a fundamental task in multimedia research, focused on retrieving semantically relevant targets across different modalities. While traditional CMR methods match text and image via embedding-based similarity calculations, recent advancements in pre-trained generative models have established generative retrieval as a promising alternative. This paradigm assigns each target a unique identifier and leverages a generative model to directly predict identifiers corresponding to input queries without explicit indexing. Despite its great potential, current generative CMR approaches still face semantic information insufficiency in both identifier construction and generation processes. To address these limitations, we propose a novel unified Semantic-enhanced generative Cross-mOdal REtrieval framework (SemCORE), designed to unleash the semantic understanding capabilities in generative cross-modal retrieval task. Specifically, we first construct a Structured natural language IDentifier (SID) that effectively aligns target identifiers with generative models optimized for natural language comprehension and generation. Furthermore, we introduce a Generative Semantic Verification (GSV) strategy enabling fine-grained target discrimination. Additionally, to the best of our knowledge, SemCORE is the first framework to simultaneously consider both text-to-image and image-to-text retrieval tasks within generative cross-modal retrieval. Extensive experiments demonstrate that our framework outperforms state-of-the-art generative cross-modal retrieval methods. Notably, SemCORE achieves substantial improvements across benchmark datasets, with an average increase of 8.65 points in Recall@1 for text-to-image retrieval.

cs.LG [Back]

[217] Simplifying Graph Transformers

Liheng Ma,Soumyasundar Pal,Yingxue Zhang,Philip H. S. Torr,Mark Coates

Main category: cs.LG

TL;DR: 论文提出了三种简单的Transformer改进方法，使其适用于图学习，无需复杂架构调整。

Details

Motivation: 现有图Transformer架构复杂，难以直接应用Transformer的训练进展，因此需要简化方法。 Method: 1. 使用简化的L2注意力；2. 自适应均方根归一化；3. 共享编码器的相对位置编码偏置。 Result: 在多种图数据集上表现显著提升，并在图同构测试中展现出较强的表达能力。 Conclusion: 提出的改进方法简单有效，为图学习中的Transformer应用提供了新思路。 Abstract: Transformers have attained outstanding performance across various modalities, employing scaled-dot-product (SDP) attention mechanisms. Researchers have attempted to migrate Transformers to graph learning, but most advanced Graph Transformers are designed with major architectural differences, either integrating message-passing or incorporating sophisticated attention mechanisms. These complexities prevent the easy adoption of Transformer training advances. We propose three simple modifications to the plain Transformer to render it applicable to graphs without introducing major architectural distortions. Specifically, we advocate for the use of (1) simplified $L_2$ attention to measure the magnitude closeness of tokens; (2) adaptive root-mean-square normalization to preserve token magnitude information; and (3) a relative positional encoding bias with a shared encoder. Significant performance gains across a variety of graph datasets justify the effectiveness of our proposed modifications. Furthermore, empirical evaluation on the expressiveness benchmark reveals noteworthy realized expressiveness in the graph isomorphism.

[218] VLMGuard-R1: Proactive Safety Alignment for VLMs via Reasoning-Driven Prompt Optimization

Menglan Chen,Xianghe Pang,Jingjing Dong,WenHao Wang,Yaxin Du,Siheng Chen

Main category: cs.LG

TL;DR: 提出了一种基于多模态推理的提示重写方法VLMGuard-R1，通过动态解析文本-图像交互提升视觉语言模型的安全性，实验表明其显著优于基线方法。

Details

Motivation: 视觉语言模型的多模态复杂性可能引发传统安全措施无法应对的潜在威胁，因此需要一种新的安全对齐方法。 Method: 提出VLMGuard-R1框架，通过推理引导的重写器动态优化用户输入，采用三阶段推理管道生成数据集训练重写器。 Result: 在三个基准测试和五种VLMs上，VLMGuard-R1优于四种基线方法，尤其在SIUO基准上平均安全性提升43.59%。 Conclusion: VLMGuard-R1通过多模态推理驱动的提示重写，有效提升了视觉语言模型的安全性，且无需修改模型核心参数。 Abstract: Aligning Vision-Language Models (VLMs) with safety standards is essential to mitigate risks arising from their multimodal complexity, where integrating vision and language unveils subtle threats beyond the reach of conventional safeguards. Inspired by the insight that reasoning across modalities is key to preempting intricate vulnerabilities, we propose a novel direction for VLM safety: multimodal reasoning-driven prompt rewriting. To this end, we introduce VLMGuard-R1, a proactive framework that refines user inputs through a reasoning-guided rewriter, dynamically interpreting text-image interactions to deliver refined prompts that bolster safety across diverse VLM architectures without altering their core parameters. To achieve this, we devise a three-stage reasoning pipeline to synthesize a dataset that trains the rewriter to infer subtle threats, enabling tailored, actionable responses over generic refusals. Extensive experiments across three benchmarks with five VLMs reveal that VLMGuard-R1 outperforms four baselines. In particular, VLMGuard-R1 achieves a remarkable 43.59\% increase in average safety across five models on the SIUO benchmark.

[219] Quantum Computing Supported Adversarial Attack-Resilient Autonomous Vehicle Perception Module for Traffic Sign Classification

Reek Majumder,Mashrur Chowdhury,Sakib Mahmud Khan,Zadid Khan,Fahim Ahmad,Frank Ngeni,Gurcan Comert,Judith Mwakalonge,Dimitra Michalaka

Main category: cs.LG

TL;DR: 该论文研究了混合经典-量子深度学习模型（HCQ-DL）在对抗攻击下的鲁棒性，相比经典深度学习模型（C-DL），HCQ-DL在交通标志分类任务中表现更优。

Details

Motivation: 自动驾驶车辆（AV）的感知模块依赖于深度学习模型，而对抗攻击可能导致严重错误分类。研究旨在探索HCQ-DL模型在对抗攻击下的性能。 Method: 使用AlexNet和VGG-16作为特征提取器，结合1000多个量子电路构建HCQ-DL模型，测试了PGD、FGSA和GA三种对抗攻击场景。 Result: HCQ-DL在无攻击场景下准确率超过95%，在FGSA和GA攻击下超过91%，远高于C-DL模型。在PGD攻击下，HCQ-DL（基于AlexNet）准确率为85%，而C-DL低于21%。 Conclusion: HCQ-DL模型在对抗攻击下表现出更强的鲁棒性，适用于自动驾驶感知模块的交通标志分类任务。 Abstract: Deep learning (DL)-based image classification models are essential for autonomous vehicle (AV) perception modules since incorrect categorization might have severe repercussions. Adversarial attacks are widely studied cyberattacks that can lead DL models to predict inaccurate output, such as incorrectly classified traffic signs by the perception module of an autonomous vehicle. In this study, we create and compare hybrid classical-quantum deep learning (HCQ-DL) models with classical deep learning (C-DL) models to demonstrate robustness against adversarial attacks for perception modules. Before feeding them into the quantum system, we used transfer learning models, alexnet and vgg-16, as feature extractors. We tested over 1000 quantum circuits in our HCQ-DL models for projected gradient descent (PGD), fast gradient sign attack (FGSA), and gradient attack (GA), which are three well-known untargeted adversarial approaches. We evaluated the performance of all models during adversarial attacks and no-attack scenarios. Our HCQ-DL models maintain accuracy above 95\% during a no-attack scenario and above 91\% for GA and FGSA attacks, which is higher than C-DL models. During the PGD attack, our alexnet-based HCQ-DL model maintained an accuracy of 85\% compared to C-DL models that achieved accuracies below 21\%. Our results highlight that the HCQ-DL models provide improved accuracy for traffic sign classification under adversarial settings compared to their classical counterparts.

Advait Gadhikar,Tom Jacobs,Chao Zhou,Rebekka Burkholz

Main category: cs.LG

TL;DR: 论文提出Sign-In方法，通过动态重参数化解决稀疏神经网络训练的初始化问题，缩小与密集到稀疏训练的性能差距。

Details

Motivation: 稀疏神经网络从头训练（PaI）与密集到稀疏训练之间存在性能差距，阻碍高效深度学习。根据彩票假设，PaI依赖于特定问题的参数初始化，但参数符号难以确定。 Method: 提出Sign-In方法，通过动态重参数化实现参数符号翻转，补充密集到稀疏训练的不足。 Result: 实验和理论表明Sign-In能提升PaI性能，但仍需解决与密集到稀疏训练的差距。 Conclusion: Sign-In是解决PaI初始化问题的有效方法，但完全缩小性能差距仍需进一步研究。 Abstract: The performance gap between training sparse neural networks from scratch (PaI) and dense-to-sparse training presents a major roadblock for efficient deep learning. According to the Lottery Ticket Hypothesis, PaI hinges on finding a problem specific parameter initialization. As we show, to this end, determining correct parameter signs is sufficient. Yet, they remain elusive to PaI. To address this issue, we propose Sign-In, which employs a dynamic reparameterization that provably induces sign flips. Such sign flips are complementary to the ones that dense-to-sparse training can accomplish, rendering Sign-In as an orthogonal method. While our experiments and theory suggest performance improvements of PaI, they also carve out the main open challenge to close the gap between PaI and dense-to-sparse training.

[221] MIB: A Mechanistic Interpretability Benchmark

Aaron Mueller,Atticus Geiger,Sarah Wiegreffe,Dana Arad,Iván Arcuschin,Adam Belfki,Yik Siu Chan,Jaden Fiotto-Kaufman,Tal Haklay,Michael Hanna,Jing Huang,Rohan Gupta,Yaniv Nikankin,Hadas Orgad,Nikhil Prakash,Anja Reusch,Aruna Sankaranarayanan,Shun Shao,Alessandro Stolfo,Martin Tutek,Amir Zur,David Bau,Yonatan Belinkov

Main category: cs.LG

TL;DR: MIB是一个用于评估机制解释性方法的基准，包含两个任务轨道和多个模型，旨在比较方法在定位因果路径或变量上的表现。

Details

Motivation: 为机制解释性方法提供持久且有意义的评估标准，以确认其是否真正改进。 Method: 提出MIB基准，包含电路定位和因果变量定位两个轨道，覆盖四个任务和五个模型，比较不同方法的表现。 Result: 电路定位中，归因和掩码优化方法表现最佳；因果变量定位中，监督DAS方法最优，而SAE特征不优于神经元。 Conclusion: MIB能有效比较方法，证明该领域确实取得了进步。 Abstract: How can we know whether new mechanistic interpretability methods achieve real improvements? In pursuit of meaningful and lasting evaluation standards, we propose MIB, a benchmark with two tracks spanning four tasks and five models. MIB favors methods that precisely and concisely recover relevant causal pathways or specific causal variables in neural language models. The circuit localization track compares methods that locate the model components - and connections between them - most important for performing a task (e.g., attribution patching or information flow routes). The causal variable localization track compares methods that featurize a hidden vector, e.g., sparse autoencoders (SAEs) or distributed alignment search (DAS), and locate model features for a causal variable relevant to the task. Using MIB, we find that attribution and mask optimization methods perform best on circuit localization. For causal variable localization, we find that the supervised DAS method performs best, while SAE features are not better than neurons, i.e., standard dimensions of hidden vectors. These findings illustrate that MIB enables meaningful comparisons of methods, and increases our confidence that there has been real progress in the field.

[222] ALT: A Python Package for Lightweight Feature Representation in Time Series Classification

Balázs P. Halmos,Balázs Hajós,Vince Á. Molnár,Marcell T. Kurbucz,Antal Jakovác

Main category: cs.LG

TL;DR: ALT是一个开源Python包，用于高效准确的时间序列分类，通过自适应算法将原始数据转换为线性可分特征空间。

Details

Motivation: 改进现有的线性变换方法（LLT），以更好地捕捉不同时间尺度的模式。 Method: 采用自适应律变换（ALT）算法，使用变长滑动时间窗口处理数据。 Result: 在真实数据集上表现出色，计算开销低，性能达到最新水平。 Conclusion: ALT适用于物理等领域的时间序列分类任务，具有可扩展性和易用性。 Abstract: We introduce ALT, an open-source Python package created for efficient and accurate time series classification (TSC). The package implements the adaptive law-based transformation (ALT) algorithm, which transforms raw time series data into a linearly separable feature space using variable-length shifted time windows. This adaptive approach enhances its predecessor, the linear law-based transformation (LLT), by effectively capturing patterns of varying temporal scales. The software is implemented for scalability, interpretability, and ease of use, achieving state-of-the-art performance with minimal computational overhead. Extensive benchmarking on real-world datasets demonstrates the utility of ALT for diverse TSC tasks in physics and related domains.

cs.RO [Back]

[223] RoboTwin: Dual-Arm Robot Benchmark with Generative Digital Twins

Yao Mu,Tianxing Chen,Zanxin Chen,Shijia Peng,Zhiqian Lan,Zeyu Gao,Zhixuan Liang,Qiaojun Yu,Yude Zou,Mingkun Xu,Lunkai Lin,Zhiqiang Xie,Mingyu Ding,Ping Luo

Main category: cs.RO

TL;DR: RoboTwin是一个生成式数字孪生框架，利用3D生成基础模型和大语言模型，为双臂机器人任务生成多样化专家数据集并提供真实世界对齐的评估平台。

Details

Motivation: 解决双臂协调和复杂物体操作中高质量示范数据和真实世界对齐评估基准的稀缺问题。 Method: 通过单张2D图像创建多样化数字孪生对象，结合空间关系感知的代码生成框架，分解任务并生成精确机器人运动代码。 Result: 在COBOT Magic Robot平台上验证，预训练模型在真实世界样本微调后，单臂任务成功率提升70%，双臂任务提升40%。 Conclusion: RoboTwin显著提升了双臂机器人操作系统的性能，并提供了标准化评估基准。 Abstract: In the rapidly advancing field of robotics, dual-arm coordination and complex object manipulation are essential capabilities for developing advanced autonomous systems. However, the scarcity of diverse, high-quality demonstration data and real-world-aligned evaluation benchmarks severely limits such development. To address this, we introduce RoboTwin, a generative digital twin framework that uses 3D generative foundation models and large language models to produce diverse expert datasets and provide a real-world-aligned evaluation platform for dual-arm robotic tasks. Specifically, RoboTwin creates varied digital twins of objects from single 2D images, generating realistic and interactive scenarios. It also introduces a spatial relation-aware code generation framework that combines object annotations with large language models to break down tasks, determine spatial constraints, and generate precise robotic movement code. Our framework offers a comprehensive benchmark with both simulated and real-world data, enabling standardized evaluation and better alignment between simulated training and real-world performance. We validated our approach using the open-source COBOT Magic Robot platform. Policies pre-trained on RoboTwin-generated data and fine-tuned with limited real-world samples demonstrate significant potential for enhancing dual-arm robotic manipulation systems by improving success rates by over 70% for single-arm tasks and over 40% for dual-arm tasks compared to models trained solely on real-world data.

[224] Explainable Scene Understanding with Qualitative Representations and Graph Neural Networks

Nassim Belmecheri,Arnaud Gotlieb,Nadjib Lazaar,Helge Spieker

Main category: cs.RO

TL;DR: 论文研究了将图神经网络（GNN）与定性可解释图（QXG）结合用于自动驾驶场景理解的方法，提出了一种新型GNN架构，并在nuScenes数据集上验证了其优越性能。

Details

Motivation: 场景理解是自动驾驶决策的基础，但现有方法仅分析对象对的单一关系链，忽略了更广泛的场景上下文。 Method: 提出了一种新型GNN架构，处理整个图结构以识别交通场景中的相关对象。 Result: 在nuScenes数据集上的实验表明，该方法性能优于基线方法，能有效处理类别不平衡并考虑时空关系。 Conclusion: 结合定性表示与深度学习方法在自动驾驶场景理解中具有潜力。 Abstract: This paper investigates the integration of graph neural networks (GNNs) with Qualitative Explainable Graphs (QXGs) for scene understanding in automated driving. Scene understanding is the basis for any further reactive or proactive decision-making. Scene understanding and related reasoning is inherently an explanation task: why is another traffic participant doing something, what or who caused their actions? While previous work demonstrated QXGs' effectiveness using shallow machine learning models, these approaches were limited to analysing single relation chains between object pairs, disregarding the broader scene context. We propose a novel GNN architecture that processes entire graph structures to identify relevant objects in traffic scenes. We evaluate our method on the nuScenes dataset enriched with DriveLM's human-annotated relevance labels. Experimental results show that our GNN-based approach achieves superior performance compared to baseline methods. The model effectively handles the inherent class imbalance in relevant object identification tasks while considering the complete spatial-temporal relationships between all objects in the scene. Our work demonstrates the potential of combining qualitative representations with deep learning approaches for explainable scene understanding in autonomous driving systems.

[225] UncAD: Towards Safe End-to-end Autonomous Driving via Online Map Uncertainty

Pengxuan Yang,Yupeng Zheng,Qichao Zhang,Kefei Zhu,Zebin Xing,Qiao Lin,Yun-Fu Liu,Zhiguo Su,Dongbin Zhao

Main category: cs.RO

TL;DR: UncAD提出了一种新范式，通过感知模块估计在线地图的不确定性，并利用该不确定性指导预测和规划模块生成多模态轨迹，从而提升自动驾驶安全性。

Details

Motivation: 现有端到端自动驾驶方法依赖确定性建模的在线地图，可能引入错误感知信息，影响规划安全性。 Method: UncAD首先估计感知模块中在线地图的不确定性，利用不确定性指导预测和规划模块生成多模态轨迹，并提出不确定性-碰撞感知的规划选择策略。 Result: 在nuScenes数据集上，UncAD仅增加1.9%参数，却将碰撞率降低26%，可行驶区域冲突率降低42%。 Conclusion: UncAD通过引入不确定性建模显著提升了自动驾驶的安全性，且计算开销小。 Abstract: End-to-end autonomous driving aims to produce planning trajectories from raw sensors directly. Currently, most approaches integrate perception, prediction, and planning modules into a fully differentiable network, promising great scalability. However, these methods typically rely on deterministic modeling of online maps in the perception module for guiding or constraining vehicle planning, which may incorporate erroneous perception information and further compromise planning safety. To address this issue, we delve into the importance of online map uncertainty for enhancing autonomous driving safety and propose a novel paradigm named UncAD. Specifically, UncAD first estimates the uncertainty of the online map in the perception module. It then leverages the uncertainty to guide motion prediction and planning modules to produce multi-modal trajectories. Finally, to achieve safer autonomous driving, UncAD proposes an uncertainty-collision-aware planning selection strategy according to the online map uncertainty to evaluate and select the best trajectory. In this study, we incorporate UncAD into various state-of-the-art (SOTA) end-to-end methods. Experiments on the nuScenes dataset show that integrating UncAD, with only a 1.9% increase in parameters, can reduce collision rates by up to 26% and drivable area conflict rate by up to 42%. Codes, pre-trained models, and demo videos can be accessed at https://github.com/pengxuanyang/UncAD.

[226] Taccel: Scaling Up Vision-based Tactile Robotics via High-performance GPU Simulation

Yuyang Li,Wenxin Du,Chang Yu,Puhao Li,Zihang Zhao,Tengyu Liu,Chenfanfu Jiang,Yixin Zhu,Siyuan Huang

Main category: cs.RO

TL;DR: Taccel是一个高性能的触觉模拟平台，通过集成IPC和ABD技术，实现了快速且精确的触觉模拟，支持大规模并行环境，显著加速触觉机器人研究。

Details

Motivation: 触觉传感是实现机器人高级操作能力的关键，但现有VBTS传感器缺乏高效模拟工具，限制了研究规模。 Method: Taccel结合IPC和ABD技术，模拟机器人、触觉传感器和物体，提供精确物理模拟和真实触觉信号。 Result: Taccel实现了18倍于实时的加速，支持数千个并行环境，并通过实验验证了模拟的精确性和从模拟到现实的迁移能力。 Conclusion: Taccel是一个强大的工具，能够推动触觉机器人研究的规模化发展，提升机器人对物理环境的理解和交互能力。 Abstract: Tactile sensing is crucial for achieving human-level robotic capabilities in manipulation tasks. VBTSs have emerged as a promising solution, offering high spatial resolution and cost-effectiveness by sensing contact through camera-captured deformation patterns of elastic gel pads. However, these sensors' complex physical characteristics and visual signal processing requirements present unique challenges for robotic applications. The lack of efficient and accurate simulation tools for VBTS has significantly limited the scale and scope of tactile robotics research. Here we present Taccel, a high-performance simulation platform that integrates IPC and ABD to model robots, tactile sensors, and objects with both accuracy and unprecedented speed, achieving an 18-fold acceleration over real-time across thousands of parallel environments. Unlike previous simulators that operate at sub-real-time speeds with limited parallelization, Taccel provides precise physics simulation and realistic tactile signals while supporting flexible robot-sensor configurations through user-friendly APIs. Through extensive validation in object recognition, robotic grasping, and articulated object manipulation, we demonstrate precise simulation and successful sim-to-real transfer. These capabilities position Taccel as a powerful tool for scaling up tactile robotics research and development. By enabling large-scale simulation and experimentation with tactile sensing, Taccel accelerates the development of more capable robotic systems, potentially transforming how robots interact with and understand their physical environment.

[227] ViTa-Zero: Zero-shot Visuotactile Object 6D Pose Estimation

Hongyu Li,James Akl,Srinath Sridhar,Tye Brady,Taskin Padir

Main category: cs.RO

TL;DR: ViTa-Zero是一个零样本的视觉触觉姿态估计框架，通过物理约束和实时优化提升机器人操作任务的准确性。

Details

Motivation: 现有结合视觉和触觉的方法因数据有限难以泛化，ViTa-Zero旨在解决这一问题。 Method: 利用视觉模型作为基础，通过弹簧-质量系统建模夹具-物体交互，结合触觉和本体感知进行可行性检查和优化。 Result: 实验显示，ViTa-Zero在ADD-S和ADD的AUC上分别提升55%和60%，位置误差降低80%。 Conclusion: ViTa-Zero通过物理约束和实时优化显著提升了姿态估计的准确性和鲁棒性。 Abstract: Object 6D pose estimation is a critical challenge in robotics, particularly for manipulation tasks. While prior research combining visual and tactile (visuotactile) information has shown promise, these approaches often struggle with generalization due to the limited availability of visuotactile data. In this paper, we introduce ViTa-Zero, a zero-shot visuotactile pose estimation framework. Our key innovation lies in leveraging a visual model as its backbone and performing feasibility checking and test-time optimization based on physical constraints derived from tactile and proprioceptive observations. Specifically, we model the gripper-object interaction as a spring-mass system, where tactile sensors induce attractive forces, and proprioception generates repulsive forces. We validate our framework through experiments on a real-world robot setup, demonstrating its effectiveness across representative visual backbones and manipulation scenarios, including grasping, object picking, and bimanual handover. Compared to the visual models, our approach overcomes some drastic failure modes while tracking the in-hand object pose. In our experiments, our approach shows an average increase of 55% in AUC of ADD-S and 60% in ADD, along with an 80% lower position error compared to FoundationPose.

cs.SE [Back]

[228] A Phenomenological Approach to Analyzing User Queries in IT Systems Using Heidegger's Fundamental Ontology

Maksim Vishnevskiy

Main category: cs.SE

TL;DR: 论文提出了一种基于海德格尔基础本体论的新型IT分析系统，通过区分存在者（das Seiende）与存在（das Sein），使用两种模态不同的语言进行用户输入处理和内部分析，并通过现象学还原模块连接两者，以揭示更深层次的本体论模式。

Details

Motivation: 现有系统仅局限于范畴分析，无法处理复杂交互中的逻辑陷阱（如隐喻使用）。该研究旨在利用海德格尔的现象学存在分析，提供更深入的查询处理洞察。 Method: 系统采用两种语言：存在者的范畴语言和存在的存在语言，通过现象学还原模块连接，分析用户查询并识别递归和自指结构。 Result: 系统能够揭示查询处理中的本体论模式，帮助解决复杂交互中的逻辑陷阱，并通过案例和对比评估展示了其优势。 Conclusion: 该研究为通用查询分析工具奠定了基础，但仍需进一步形式化存在语言以实现完全可计算性。 Abstract: This paper presents a novel research analytical IT system grounded in Martin Heidegger's Fundamental Ontology, distinguishing between beings (das Seiende) and Being (das Sein). The system employs two modally distinct, descriptively complete languages: a categorical language of beings for processing user inputs and an existential language of Being for internal analysis. These languages are bridged via a phenomenological reduction module, enabling the system to analyze user queries (including questions, answers, and dialogues among IT specialists), identify recursive and self-referential structures, and provide actionable insights in categorical terms. Unlike contemporary systems limited to categorical analysis, this approach leverages Heidegger's phenomenological existential analysis to uncover deeper ontological patterns in query processing, aiding in resolving logical traps in complex interactions, such as metaphor usage in IT contexts. The path to full realization involves formalizing the language of Being by a research team based on Heidegger's Fundamental Ontology; given the existing completeness of the language of beings, this reduces the system's computability to completeness, paving the way for a universal query analysis tool. The paper presents the system's architecture, operational principles, technical implementation, use cases--including a case based on real IT specialist dialogues--comparative evaluation with existing tools, and its advantages and limitations.

Table of Contents

cs.CV [Back]

[1] DMM: Building a Versatile Image Generation Model via Distillation-Based Model Merging

[2] Geographical Context Matters: Bridging Fine and Coarse Spatial Information to Enhance Continental Land Cover Mapping

[3] WORLDMEM: Long-term Consistent World Simulation with Memory

[4] InstantCharacter: Personalize Any Characters with a Scalable Diffusion Transformer Framework

[5] NTIRE 2025 Challenge on Event-Based Image Deblurring: Methods and Results

[6] Sparsity Outperforms Low-Rank Projections in Few-Shot Adaptation

[7] 3D-PointZshotS: Geometry-Aware 3D Point Cloud Zero-Shot Semantic Segmentation Narrowing the Visual-Semantic Gap

[8] DG-MVP: 3D Domain Generalization via Multiple Views of Point Clouds for Classification

[9] AdaVid: Adaptive Video-Language Pretraining

[10] Event Quality Score (EQS): Assessing the Realism of Simulated Event Camera Streams via Distances in Latent Space

[11] Decision-based AI Visual Navigation for Cardiac Ultrasounds

[12] Post-Hurricane Debris Segmentation Using Fine-Tuned Foundational Vision Models

[13] Privacy-Preserving Operating Room Workflow Analysis using Digital Twins

[14] Contour Field based Elliptical Shape Prior for the Segment Anything Model

[15] Parsimonious Dataset Construction for Laparoscopic Cholecystectomy Structure Segmentation

[16] Prompt-Driven and Training-Free Forgetting Approach and Dataset for Large Language Models

[17] CM3AE: A Unified RGB Frame and Event-Voxel/-Frame Pre-training Framework

[18] 3DResT: A Strong Baseline for Semi-Supervised 3D Referring Expression Segmentation

[19] AdaQual-Diff: Diffusion-Based Image Restoration via Adaptive Quality Prompting

[20] Robo-SGG: Exploiting Layout-Oriented Normalization and Restitution for Robust Scene Graph Generation

[21] SAM-Based Building Change Detection with Distribution-Aware Fourier Adaptation and Edge-Constrained Warping

[22] Packing Input Frame Context in Next-Frame Prediction Models for Video Generation

[23] RoPETR: Improving Temporal Camera-Only 3D Detection by Integrating Enhanced Rotary Position Embedding

[24] AdaptoVision: A Multi-Resolution Image Recognition Model for Robust and Scalable Classification

[25] Two Tasks, One Goal: Uniting Motion and Planning for Excellent End To End Autonomous Driving Performance

[26] Accurate Tracking of Arabidopsis Root Cortex Cell Nuclei in 3D Time-Lapse Microscopy Images Based on Genetic Algorithm

[27] TongUI: Building Generalized GUI Agents by Learning from Multimodal Web Tutorials

[28] HSS-IAD: A Heterogeneous Same-Sort Industrial Anomaly Detection Dataset

[29] Collaborative Perception Datasets for Autonomous Driving: A Review

[30] Unsupervised Cross-Domain 3D Human Pose Estimation via Pseudo-Label-Guided Global Transforms

[31] SmartFreeEdit: Mask-Free Spatial-Aware Image Editing with Complex Instruction Understanding

[32] Self-Supervised Pre-training with Combined Datasets for 3D Perception in Autonomous Driving

[33] NTIRE 2025 Challenge on Day and Night Raindrop Removal for Dual-Focused Images: Methods and Results

[34] Post-pre-training for Modality Alignment in Vision-Language Foundation Models

[35] Mask Image Watermarking

[36] Privacy Protection Against Personalized Text-to-Image Synthesis via Cross-image Consistency Constraints

[37] LAD-Reasoner: Tiny Multimodal Models are Good Reasoners for Logical Anomaly Detection

[38] Stronger, Steadier & Superior: Geometric Consistency in Depth VFM Forges Domain Generalized Semantic Segmentation

[39] Set You Straight: Auto-Steering Denoising Trajectories to Sidestep Unwanted Concepts

[40] EarthGPT-X: Enabling MLLMs to Flexibly and Comprehensively Understand Multi-Source Remote Sensing Imagery

[41] TSGS: Improving Gaussian Splatting for Transparent Surface Reconstruction via Normal and De-lighting Priors

[42] Hybrid Dense-UNet201 Optimization for Pap Smear Image Segmentation Using Spider Monkey Optimization

[43] Saliency-Aware Diffusion Reconstruction for Effective Invisible Watermark Removal

[44] TwoSquared: 4D Generation from 2D Image Pairs

[45] Image-Editing Specialists: An RLAIF Approach for Diffusion Models

[46] High-Fidelity Image Inpainting with Multimodal Guided GAN Inversion

[47] Computer-Aided Design of Personalized Occlusal Positioning Splints Using Multimodal 3D Data

[48] SC3EF: A Joint Self-Correlation and Cross-Correspondence Estimation Framework for Visible and Thermal Image Registration

[49] Tree-NeRV: A Tree-Structured Neural Representation for Efficient Non-Uniform Video Encoding

[50] Second-order Optimization of Gaussian Splats with Importance Sampling

[51] Efficient Masked Image Compression with Position-Indexed Self-Attention

[52] Disentangling Polysemantic Channels in Convolutional Neural Networks

[53] Rethinking Temporal Fusion with a Unified Gradient Descent View for 3D Semantic Occupancy Prediction

[54] Vision and Language Integration for Domain Generalization

[55] MathPhys-Guided Coarse-to-Fine Anomaly Synthesis with SQE-Driven Bi-Level Optimization for Anomaly Detection

[56] Enhancing Cocoa Pod Disease Classification via Transfer Learning and Ensemble Methods: Toward Robust Predictive Modeling

[57] All-in-One Transferring Image Compression from Human Perception to Multi-Machine Perception

[58] Hierarchical Feature Learning for Medical Point Clouds via State Space Model

[59] Pose and Facial Expression Transfer by using StyleGAN

[60] Riemannian Patch Assignment Gradient Flows

[61] TTRD3: Texture Transfer Residual Denoising Dual Diffusion Model for Remote Sensing Image Super-Resolution

[62] Prototypes are Balanced Units for Efficient and Effective Partially Relevant Video Retrieval

[63] Event-Enhanced Blurry Video Super-Resolution

[64] Expert Kernel Generation Network Driven by Contextual Mapping for Hyperspectral Image Classification

[65] NoisyRollout: Reinforcing Visual Reasoning with Data Augmentation

[66] Imaging for All-Day Wearable Smart Glasses

[67] ArtistAuditor: Auditing Artist Style Pirate in Text-to-Image Generation Models

[68] EchoWorld: Learning Motion-Aware World Models for Echocardiography Probe Guidance

[69] SkyReels-V2: Infinite-length Film Generative Model

[70] Effective Dual-Region Augmentation for Reduced Reliance on Large Amounts of Labeled Data

[71] Enhancing Person-to-Person Virtual Try-On with Multi-Garment Virtual Try-Off

[72] EventVAD: Training-Free Event-Aware Video Anomaly Detection

[73] RF-DETR Object Detection vs YOLOv12 : A Study of Transformer-based and CNN-based Architectures for Single-Class and Multi-Class Greenfruit Detection in Complex Orchard Environments Under Label Ambiguity

[74] UniEdit-Flow: Unleashing Inversion and Editing in the Era of Flow Models

[75] Probing and Inducing Combinational Creativity in Vision-Language Models

[76] VistaDPO: Video Hierarchical Spatial-Temporal Direct Preference Optimization for Large Video Models

[77] Low-hallucination Synthetic Captions for Large-Scale Vision-Language Model Pre-training

[78] Science-T2I: Addressing Scientific Illusions in Image Synthesis