Skip to content

Table of Contents

cs.CV [Back]

[1] Can ChatGPT Perform Image Splicing Detection? A Preliminary Study

Souradip Nath

Main category: cs.CV

TL;DR: GPT-4V在零样本设置下实现超过85%的准确率,通过Chain-of-Thought提示在图像拼接检测任务中表现最佳。

Details Motivation: 研究GPT-4V在图像取证领域(特别是图像拼接检测)的零样本能力,探索其通用性和潜力。 Method: 使用三种提示策略(Zero-Shot、Few-Shot、Chain-of-Thought)在CASIA v2.0数据集的子集上评估GPT-4V。 Result: GPT-4V在零样本设置下表现优异,CoT提示在真实和拼接图像间取得最佳平衡。 Conclusion: 尽管不及专用模型,GPT-4V的通用性、可解释性和百科全书式推理使其成为图像取证的有力工具。 Abstract: Multimodal Large Language Models (MLLMs) like GPT-4V are capable of reasoning across text and image modalities, showing promise in a variety of complex vision-language tasks. In this preliminary study, we investigate the out-of-the-box capabilities of GPT-4V in the domain of image forensics, specifically, in detecting image splicing manipulations. Without any task-specific fine-tuning, we evaluate GPT-4V using three prompting strategies: Zero-Shot (ZS), Few-Shot (FS), and Chain-of-Thought (CoT), applied over a curated subset of the CASIA v2.0 splicing dataset. Our results show that GPT-4V achieves competitive detection performance in zero-shot settings (more than 85% accuracy), with CoT prompting yielding the most balanced trade-off across authentic and spliced images. Qualitative analysis further reveals that the model not only detects low-level visual artifacts but also draws upon real-world contextual knowledge such as object scale, semantic consistency, and architectural facts, to identify implausible composites. While GPT-4V lags behind specialized state-of-the-art splicing detection models, its generalizability, interpretability, and encyclopedic reasoning highlight its potential as a flexible tool in image forensics.

[2] CarboNeXT and CarboFormer: Dual Semantic Segmentation Architectures for Detecting and Quantifying Carbon Dioxide Emissions Using Optical Gas Imaging

Taminul Islam,Toqi Tahamid Sarker,Mohamed G Embaby,Khaled R Ahmed,Amer AbuGhazaleh

Main category: cs.CV

TL;DR: CarboNeXT是一个用于光学气体成像(OGI)的语义分割框架,用于检测和量化CO₂排放,在低流量场景中表现优异,并支持实时监测。

Details Motivation: CO₂排放是环境监测和工业过程(如畜牧业管理)的重要指标,需要高效准确的检测工具。 Method: 结合多尺度上下文聚合网络、UPerHead和辅助FCN组件,提出CarboNeXT框架,并贡献了两个新数据集(CCR和RTA)。 Result: CarboNeXT在CCR和RTA数据集上分别达到88.46%和92.95%的mIoU,实时性能为60.95 FPS;轻量版CarboFormer在资源受限设备上表现优异。 Conclusion: 该研究为CO₂排放分析提供了高效工具,特别适用于畜牧业和环境监测。 Abstract: Carbon dioxide (CO$_2$) emissions are critical indicators of both environmental impact and various industrial processes, including livestock management. We introduce CarboNeXT, a semantic segmentation framework for Optical Gas Imaging (OGI), designed to detect and quantify CO$_2$ emissions across diverse applications. Our approach integrates a multi-scale context aggregation network with UPerHead and auxiliary FCN components to effectively model both local details and global relationships in gas plume imagery. We contribute two novel datasets: (1) the Controlled Carbon Dioxide Release (CCR) dataset, which simulates gas leaks with systematically varied flow rates (10-100 SCCM), and (2) the Real Time Ankom (RTA) dataset, focusing on emissions from dairy cow rumen fluid in vitro experiments. Extensive evaluations demonstrate that CarboNeXT outperforms state-of-the-art methods, achieving 88.46% mIoU on CCR and 92.95% mIoU on RTA, with particular effectiveness in challenging low-flow scenarios. The model operates at 60.95 FPS, enabling real-time monitoring applications. Additionally, we propose CarboFormer, a lightweight variant with only 5.07M parameters that achieves 84.68 FPS, with competitive performance of 84.88% mIoU on CCR and 92.98% on RTA, making it suitable for resource-constrained platforms such as programmable drones. Our work advances both environmental sensing and precision livestock management by providing robust tools for CO$_2$ emission analysis, with a specific focus on livestock applications.

[3] Scalable Generation of Spatial Transcriptomics from Histology Images via Whole-Slide Flow Matching

Tinglin Huang,Tianyu Liu,Mehrtash Babadi,Wengong Jin,Rex Ying

Main category: cs.CV

TL;DR: STFlow是一种基于流匹配的生成模型,通过建模整个切片的基因表达联合分布来考虑细胞间相互作用,解决了现有方法在内存和独立性预测上的限制。

Details Motivation: 空间转录组学(ST)技术因低通量和实验设施需求受限,现有预测方法未能有效建模细胞间相互作用且内存效率低。 Method: STFlow采用流匹配生成模型和局部空间注意力的高效编码器,实现全切片处理。 Result: 在HEST-1k和STImage-1K4M基准测试中,STFlow显著优于现有方法,相对病理基础模型提升18%。 Conclusion: STFlow通过改进建模和内存效率,为空间转录组学预测提供了更优解决方案。 Abstract: Spatial transcriptomics (ST) has emerged as a powerful technology for bridging histology imaging with gene expression profiling. However, its application has been limited by low throughput and the need for specialized experimental facilities. Prior works sought to predict ST from whole-slide histology images to accelerate this process, but they suffer from two major limitations. First, they do not explicitly model cell-cell interaction as they factorize the joint distribution of whole-slide ST data and predict the gene expression of each spot independently. Second, their encoders struggle with memory constraints due to the large number of spots (often exceeding 10,000) in typical ST datasets. Herein, we propose STFlow, a flow matching generative model that considers cell-cell interaction by modeling the joint distribution of gene expression of an entire slide. It also employs an efficient slide-level encoder with local spatial attention, enabling whole-slide processing without excessive memory overhead. On the recently curated HEST-1k and STImage-1K4M benchmarks, STFlow substantially outperforms state-of-the-art baselines and achieves over 18% relative improvements over the pathology foundation models.

[4] Seed Selection for Human-Oriented Image Reconstruction via Guided Diffusion

Yui Tatsumi,Ziyue Zeng,Hiroshi Watanabe

Main category: cs.CV

TL;DR: 提出了一种基于种子选择的方法,通过从多个候选种子中选择最优种子来提升图像质量,同时不增加比特率。

Details Motivation: 传统方法需要额外传输信息以实现可扩展性,而现有扩散方法虽避免了这一点,但使用单一随机种子可能导致图像质量不佳。 Method: 提出了一种种子选择方法,基于反向扩散过程的早期中间输出来选择最优种子,以减少计算成本。 Result: 实验结果表明,该方法在多个指标上优于基线方法。 Conclusion: 该方法在不增加比特率的情况下,显著提升了图像质量。 Abstract: Conventional methods for scalable image coding for humans and machines require the transmission of additional information to achieve scalability. A recent diffusion-based method avoids this by generating human-oriented images from machine-oriented images without extra bitrate. This method, however, uses a single random seed, which may lead to suboptimal image quality. In this paper, we propose a seed selection method that identifies the optimal seed from multiple candidates to improve image quality without increasing the bitrate. To reduce computational cost, the selection is performed based on intermediate outputs obtained from early steps of the reverse diffusion process. Experimental results demonstrate that our method outperforms the baseline across multiple metrics.

[5] Text2Stereo: Repurposing Stable Diffusion for Stereo Generation with Consistency Rewards

Aakash Garg,Libing Zeng,Andrii Tsarov,Nima Khademi Kalantari

Main category: cs.CV

TL;DR: 提出一种基于扩散模型的文本到立体图像生成方法,利用Stable Diffusion的先验知识并通过微调提升生成质量。

Details Motivation: 由于大规模立体图像数据集稀缺,直接训练扩散模型不可行,因此利用现有模型进行微调。 Method: 利用Stable Diffusion的先验知识,微调于立体图像数据集,并通过提示对齐和立体一致性奖励函数优化模型。 Result: 实验表明,该方法在多样场景中生成高质量立体图像,优于现有方法。 Conclusion: 该方法有效解决了立体图像生成的挑战,展示了扩散模型在该任务中的潜力。 Abstract: In this paper, we propose a novel diffusion-based approach to generate stereo images given a text prompt. Since stereo image datasets with large baselines are scarce, training a diffusion model from scratch is not feasible. Therefore, we propose leveraging the strong priors learned by Stable Diffusion and fine-tuning it on stereo image datasets to adapt it to the task of stereo generation. To improve stereo consistency and text-to-image alignment, we further tune the model using prompt alignment and our proposed stereo consistency reward functions. Comprehensive experiments demonstrate the superiority of our approach in generating high-quality stereo images across diverse scenarios, outperforming existing methods.

[6] Speaking images. A novel framework for the automated self-description of artworks

Valentine Bernasconi,Gustavo Marfia

Main category: cs.CV

TL;DR: 提出了一种利用生成式AI技术,从数字化艺术品自动生成自解释视频的新框架,探讨了AI在文化遗产领域的应用潜力及其文化偏见问题。

Details Motivation: 生成式AI的突破为艺术与文化遗产领域提供了新的研究视角,需要创新技术来提升数字化藏品的可访问性和内容展示。 Method: 基于开源大语言模型、人脸检测、文本转语音和音频转动画模型,提出了一种从数字化艺术品自动生成自解释视频的框架。 Result: 实现了从艺术品自动生成动画视频,探讨了AI模型的文化偏见、数字图像的潜力及艺术史领域的相关担忧。 Conclusion: 该框架为文化遗产的数字化展示提供了新思路,但也需关注AI的文化偏见和艺术史领域的伦理问题。 Abstract: Recent breakthroughs in generative AI have opened the door to new research perspectives in the domain of art and cultural heritage, where a large number of artifacts have been digitized. There is a need for innovation to ease the access and highlight the content of digital collections. Such innovations develop into creative explorations of the digital image in relation to its malleability and contemporary interpretation, in confrontation to the original historical object. Based on the concept of the autonomous image, we propose a new framework towards the production of self-explaining cultural artifacts using open-source large-language, face detection, text-to-speech and audio-to-animation models. The goal is to start from a digitized artwork and to automatically assemble a short video of the latter where the main character animates to explain its content. The whole process questions cultural biases encapsulated in large-language models, the potential of digital images and deepfakes of artworks for educational purposes, along with concerns of the field of art history regarding such creative diversions.

[7] MR.NAVI: Mixed-Reality Navigation Assistant for the Visually Impaired

Nicolas Pfitzer,Yifan Zhou,Marco Poggensee,Defne Kurtulus,Bessie Dominguez-Dager,Mihai Dusmanu,Marc Pollefeys,Zuria Bauer

Main category: cs.CV

TL;DR: MR.NAVI是一个混合现实系统,通过实时场景理解和直观音频反馈,帮助视觉障碍用户增强空间感知。

Details Motivation: 全球有超过4300万人患有严重视觉障碍,在陌生环境中导航面临巨大挑战。 Method: 系统结合计算机视觉算法(如MobileNet目标检测、RANSAC地板检测和DBSCAN聚类)与自然语言处理,提供场景描述、碰撞避免和导航指令,并集成公共交通API。 Result: 用户研究表明,系统在陌生环境中表现出良好的可用性和有效性。 Conclusion: MR.NAVI为视觉障碍用户提供了一种实用的导航解决方案。 Abstract: Over 43 million people worldwide live with severe visual impairment, facing significant challenges in navigating unfamiliar environments. We present MR.NAVI, a mixed reality system that enhances spatial awareness for visually impaired users through real-time scene understanding and intuitive audio feedback. Our system combines computer vision algorithms for object detection and depth estimation with natural language processing to provide contextual scene descriptions, proactive collision avoidance, and navigation instructions. The distributed architecture processes sensor data through MobileNet for object detection and employs RANSAC-based floor detection with DBSCAN clustering for obstacle avoidance. Integration with public transit APIs enables navigation with public transportation directions. Through our experiments with user studies, we evaluated both scene description and navigation functionalities in unfamiliar environments, showing promising usability and effectiveness.

[8] DVD: A Comprehensive Dataset for Advancing Violence Detection in Real-World Scenarios

Dimitrios Kollias,Damith C. Senadeera,Jianian Zheng,Kaushal K. K. Yadav,Greg Slabaugh,Muhammad Awais,Xiaoyun Yang

Main category: cs.CV

TL;DR: 论文提出了一个大规模、帧级标注的暴力检测数据库DVD,解决了现有数据库标注粗糙、规模小和多样性不足的问题。

Details Motivation: 现有暴力检测数据库存在标注粗糙、规模小和多样性不足的问题,限制了模型的泛化能力。 Method: 引入DVD数据库,包含500个视频、270万帧,具有多样环境、不同光照条件、多摄像头来源、复杂社交互动和丰富元数据。 Result: DVD数据库能够更好地捕捉现实世界暴力事件的复杂性。 Conclusion: DVD数据库为暴力检测研究提供了更高质量的数据支持。 Abstract: Violence Detection (VD) has become an increasingly vital area of research. Existing automated VD efforts are hindered by the limited availability of diverse, well-annotated databases. Existing databases suffer from coarse video-level annotations, limited scale and diversity, and lack of metadata, restricting the generalization of models. To address these challenges, we introduce DVD, a large-scale (500 videos, 2.7M frames), frame-level annotated VD database with diverse environments, varying lighting conditions, multiple camera sources, complex social interactions, and rich metadata. DVD is designed to capture the complexities of real-world violent events.

[9] State Estimation and Control of Dynamic Systems from High-Dimensional Image Data

Ashik E Rasul,Hyung-Jin Yoon

Main category: cs.CV

TL;DR: 提出了一种结合CNN和GRU的神经网络架构,用于从图像序列和动作中学习状态表示,并通过DQN训练强化学习代理,实现了无需真实状态的实时准确估计与控制。

Details Motivation: 在动态系统中,准确的状态估计对最优策略设计至关重要,但获取真实状态往往不切实际,增加了策略学习的难度。 Method: 采用CNN提取空间特征,GRU建模时序信息,结合DQN训练强化学习代理。 Result: 实验表明,该方法能实时准确估计状态并实现控制,无需真实状态数据。 Conclusion: 提出的方法有效解决了状态估计问题,并通过定量评估验证了其对策略性能和稳定性的影响。 Abstract: Accurate state estimation is critical for optimal policy design in dynamic systems. However, obtaining true system states is often impractical or infeasible, complicating the policy learning process. This paper introduces a novel neural architecture that integrates spatial feature extraction using convolutional neural networks (CNNs) and temporal modeling through gated recurrent units (GRUs), enabling effective state representation from sequences of images and corresponding actions. These learned state representations are used to train a reinforcement learning agent with a Deep Q-Network (DQN). Experimental results demonstrate that our proposed approach enables real-time, accurate estimation and control without direct access to ground-truth states. Additionally, we provide a quantitative evaluation methodology for assessing the accuracy of the learned states, highlighting their impact on policy performance and control stability.

[10] An Independent Discriminant Network Towards Identification of Counterfeit Images and Videos

Shayantani Kar,B. Shresth Bhimrajka,Aditya Kumar,Sahil Gupta,Sourav Ghosh,Subhamita Mukherjee,Shauvik Paul

Main category: cs.CV

TL;DR: 论文提出了一种基于InceptionResNetV2的判别网络,用于检测GAN生成的伪造图像和视频,旨在解决网络平台上虚假内容泛滥的问题。

Details Motivation: 虚假图像和视频的快速传播已成为新兴问题,这些内容可能被用于隐藏犯罪证据,亟需有效的检测方法。 Method: 使用卷积神经网络(基于InceptionResNetV2)构建独立的判别网络,并开发了一个平台供用户检测伪造内容。 Result: 提出的判别网络能够有效识别GAN生成的伪造图像和视频。 Conclusion: 该研究为法医领域提供了检测伪造内容的潜在工具,有助于识别犯罪活动。 Abstract: Rapid spread of false images and videos on online platforms is an emerging problem. Anyone may add, delete, clone or modify people and entities from an image using various editing software which are readily available. This generates false and misleading proof to hide the crime. Now-a-days, these false and counterfeit images and videos are flooding on the internet. These spread false information. Many methods are available in literature for detecting those counterfeit contents but new methods of counterfeiting are also evolving. Generative Adversarial Networks (GAN) are observed to be one effective method as it modifies the context and definition of images producing plausible results via image-to-image translation. This work uses an independent discriminant network that can identify GAN generated image or video. A discriminant network has been created using a convolutional neural network based on InceptionResNetV2. The article also proposes a platform where users can detect forged images and videos. This proposed work has the potential to help the forensics domain to detect counterfeit videos and hidden criminal evidence towards the identification of criminal activities.

[11] A Compendium of Autonomous Navigation using Object Detection and Tracking in Unmanned Aerial Vehicles

Mohit Arora,Pratyush Shukla,Shivali Chopra

Main category: cs.CV

TL;DR: 论文综述了无人机(UAVs)的自主导航技术,重点探讨了通过计算机视觉算法实现实时目标检测与跟踪的方法及其在多个领域的应用。

Details Motivation: 无人机在国家安全和监控中扮演重要角色,但其操作面临信号质量、实时处理等挑战。通过计算机视觉实现自主导航是解决这些问题的关键。 Method: 利用计算机视觉和深度学习算法开发目标检测与跟踪技术,以实现无人机的实时自主导航。 Result: 论文总结了多种作者提出的算法,展示了这些方法在灾难管理、密集区域探索等领域的应用潜力。 Conclusion: 计算机视觉算法是实现无人机自主导航的有效途径,未来可进一步优化以应对更复杂的应用场景。 Abstract: Unmanned Aerial Vehicles (UAVs) are one of the most revolutionary inventions of 21st century. At the core of a UAV lies the central processing system that uses wireless signals to control their movement. The most popular UAVs are quadcopters that use a set of four motors, arranged as two on either side with opposite spin. An autonomous UAV is called a drone. Drones have been in service in the US army since the 90's for covert missions critical to national security. It would not be wrong to claim that drones make up an integral part of the national security and provide the most valuable service during surveillance operations. While UAVs are controlled using wireless signals, there reside some challenges that disrupt the operation of such vehicles such as signal quality and range, real time processing, human expertise, robust hardware and data security. These challenges can be solved by programming UAVs to be autonomous, using object detection and tracking, through Computer Vision algorithms. Computer Vision is an interdisciplinary field that seeks the use of deep learning to gain a high-level understanding of digital images and videos for the purpose of automating the task of human visual system. Using computer vision, algorithms for detecting and tracking various objects can be developed suitable to the hardware so as to allow real time processing for immediate judgement. This paper attempts to review the various approaches several authors have proposed for the purpose of autonomous navigation of UAVs by through various algorithms of object detection and tracking in real time, for the purpose of applications in various fields such as disaster management, dense area exploration, traffic vehicle surveillance etc.

[12] Can Vision Transformers with ResNet's Global Features Fairly Authenticate Demographic Faces?

Abu Sufian,Marco Leo,Cosimo Distante,Anirudha Ghosh,Debaditya Barman

Main category: cs.CV

TL;DR: 研究了Vision Transformer (ViT)和ResNet在生物特征人脸认证中的公平性和泛化能力,提出了一种新的少样本原型网络,并在多个人口统计学数据集上测试性能。

Details Motivation: 解决生物特征人脸认证中跨人口统计学群体的公平性和泛化问题。 Method: 结合ViT和ResNet的全局特征,设计少样本原型网络,并在支持集和查询集上测试。 Result: Microsoft Swin Transformer在三种ViT中表现最佳,性能随支持集规模增加而提升。 Conclusion: ViT和ResNet结合的方法在公平性和泛化性上表现良好,代码和数据已公开。 Abstract: Biometric face authentication is crucial in computer vision, but ensuring fairness and generalization across demographic groups remains a big challenge. Therefore, we investigated whether Vision Transformer (ViT) and ResNet, leveraging pre-trained global features, can fairly authenticate different demographic faces while relying minimally on local features. In this investigation, we used three pre-trained state-of-the-art (SOTA) ViT foundation models from Facebook, Google, and Microsoft for global features as well as ResNet-18. We concatenated the features from ViT and ResNet, passed them through two fully connected layers, and trained on customized face image datasets to capture the local features. Then, we designed a novel few-shot prototype network with backbone features embedding. We also developed new demographic face image support and query datasets for this empirical study. The network's testing was conducted on this dataset in one-shot, three-shot, and five-shot scenarios to assess how performance improves as the size of the support set increases. We observed results across datasets with varying races/ethnicities, genders, and age groups. The Microsoft Swin Transformer backbone performed better among the three SOTA ViT for this task. The code and data are available at: https://github.com/Sufianlab/FairVitBio.

[13] Q-Ponder: A Unified Training Pipeline for Reasoning-based Visual Quality Assessment

Zhuoxuan Cai,Jian Zhang,Xinbin Yuan,Pengtao Jiang,Wenxiang Chen,Bowen Tang,Lujian Yao,Qiyuan Wang,Jinwen Chen,Bo Li

Main category: cs.CV

TL;DR: 提出了一种统一的两阶段训练框架,解决多模态大语言模型在视觉质量评估中评分与解释性任务分离的问题,显著提升了性能。

Details Motivation: 现有方法将质量评分与解释性任务分开优化,导致模型在评分或解释性上表现不平衡,限制了模型潜力。 Method: 采用两阶段训练:冷启动阶段通过专家设计的提示从教师模型蒸馏数据;强化学习微调阶段引入新奖励机制联合优化评分准确性与解释一致性。 Result: Q-Ponder在质量评分回归基准上达到SOTA性能,跨域数据集SRCC提升6.5%,且在解释准确性和合理性上显著优于教师模型。 Conclusion: 统一框架有效解决了评分与解释性任务的分离问题,展示了在多任务上的泛化潜力。 Abstract: Recent studies demonstrate that multimodal large language models (MLLMs) can proficiently evaluate visual quality through interpretable assessments. However, existing approaches typically treat quality scoring and reasoning descriptions as separate tasks with disjoint optimization objectives, leading to a trade-off: models adept at quality reasoning descriptions struggle with precise score regression, while score-focused models lack interpretability. This limitation hinders the full potential of MLLMs in visual quality assessment, where accuracy and interpretability should be mutually reinforcing. To address this, we propose a unified two-stage training framework comprising a cold-start stage and a reinforcement learning-based fine-tuning stage. Specifically, in the first stage, we distill high-quality data from a teacher model through expert-designed prompts, initializing reasoning capabilities via cross-entropy loss supervision. In the second stage, we introduce a novel reward with Group Relative Policy Optimization (GRPO) to jointly optimize scoring accuracy and reasoning consistency. We designate the models derived from these two stages as Q-Ponder-CI and Q-Ponder. Extensive experiments show that Q-Ponder achieves state-of-the-art (SOTA) performance on quality score regression benchmarks, delivering up to 6.5% higher SRCC on cross-domain datasets. Furthermore, Q-Ponder significantly outperforms description-based SOTA models, including its teacher model Qwen-2.5-VL-72B, particularly in description accuracy and reasonableness, demonstrating the generalization potential over diverse tasks.

[14] TriPSS: A Tri-Modal Keyframe Extraction Framework Using Perceptual, Structural, and Semantic Representations

Mert Can Cakmak,Nitin Agarwal,Diwash Poudel

Main category: cs.CV

TL;DR: TriPSS是一种新型三模态框架,通过融合颜色特征、深度结构嵌入和语义上下文,实现了高效的视频关键帧提取,性能优于现有方法。

Details Motivation: 视频关键帧提取对视频摘要和检索至关重要,但现有方法难以全面捕捉视频内容的丰富性。 Method: TriPSS整合CIELAB空间颜色特征、ResNet-50深度嵌入和Llama-3.2-11B-Vision-Instruct生成的语义上下文,通过PCA融合多模态嵌入,并采用HDBSCAN聚类和后续优化步骤。 Result: 在TVSum20和SumMe数据集上,TriPSS表现优于传统单模态和多模态方法,达到最先进性能。 Conclusion: TriPSS通过捕捉视觉和语义的细微信息,为大规模视频内容理解设定了新标准。 Abstract: Efficient keyframe extraction is critical for effective video summarization and retrieval, yet capturing the complete richness of video content remains challenging. In this work, we present TriPSS, a novel tri-modal framework that effectively integrates perceptual cues from color features in the CIELAB space, deep structural embeddings derived from ResNet-50, and semantic context from frame-level captions generated by Llama-3.2-11B-Vision-Instruct. By fusing these diverse modalities using principal component analysis, TriPSS constructs robust multi-modal embeddings that enable adaptive segmentation of video content via HDBSCAN clustering. A subsequent refinement stage incorporating quality assessment and duplicate filtering ensures that the final keyframe set is both concise and semantically rich. Comprehensive evaluations on benchmark datasets TVSum20 and SumMe demonstrate that TriPSS achieves state-of-the-art performance, substantially outperforming traditional unimodal and previous multi-modal methods. These results underscore TriPSS's ability to capture nuanced visual and semantic information, thereby setting a new benchmark for video content understanding in large-scale retrieval scenarios.

[15] Talk2SAM: Text-Guided Semantic Enhancement for Complex-Shaped Object Segmentation

Luka Vetoshkin,Dmitry Yudin

Main category: cs.CV

TL;DR: Talk2SAM通过结合文本提示改进复杂形状物体的分割,显著提升了分割质量。

Details Motivation: 当前分割模型(如SAM和SAM-HQ)在处理复杂形状物体(如细长结构或精细边界)时表现不佳,亟需一种更灵活的方法。 Method: 利用CLIP嵌入的文本提示生成语义区域特征,并将其投影到DINO特征空间,作为SAM-HQ的额外提示。 Result: 在BIG、ThinObject5K和DIS5K基准测试中,Talk2SAM比SAM-HQ提升了5.9% IoU和8.3%边界IoU。 Conclusion: Talk2SAM通过自然语言指导提供了一种灵活且高效的分割方法,特别适用于传统提示方法失效的场景。 Abstract: Segmenting objects with complex shapes, such as wires, bicycles, or structural grids, remains a significant challenge for current segmentation models, including the Segment Anything Model (SAM) and its high-quality variant SAM-HQ. These models often struggle with thin structures and fine boundaries, leading to poor segmentation quality. We propose Talk2SAM, a novel approach that integrates textual guidance to improve segmentation of such challenging objects. The method uses CLIP-based embeddings derived from user-provided text prompts to identify relevant semantic regions, which are then projected into the DINO feature space. These features serve as additional prompts for SAM-HQ, enhancing its ability to focus on the target object. Beyond improving segmentation accuracy, Talk2SAM allows user-controllable segmentation, enabling disambiguation of objects within a single bounding box based on textual input. We evaluate our approach on three benchmarks: BIG, ThinObject5K, and DIS5K. Talk2SAM consistently outperforms SAM-HQ, achieving up to +5.9\% IoU and +8.3\% boundary IoU improvements. Our results demonstrate that incorporating natural language guidance provides a flexible and effective means for precise object segmentation, particularly in cases where traditional prompt-based methods fail. The source code is available on GitHub: https://github.com/richlukich/Talk2SAM

[16] Attention-based transformer models for image captioning across languages: An in-depth survey and evaluation

Israa A. Albadarneh,Bassam H. Hammo,Omar S. Al-Kadi

Main category: cs.CV

TL;DR: 这篇论文综述了基于注意力的图像描述生成模型,重点分析了Transformer模型在多语言环境中的应用,并探讨了当前模型的局限性和未来研究方向。

Details Motivation: 填补现有文献中对多语言注意力模型分析的空白,为研究者提供全面的参考。 Method: 分类讨论了Transformer、深度学习和混合方法,并评估了基准数据集和评价指标(如BLEU、METEOR等)。 Result: 总结了当前模型的局限性,如语义不一致、非英语数据稀缺和推理能力不足。 Conclusion: 提出了未来研究方向,包括多模态学习和实际应用(如AI助手、医疗等),为领域发展提供指导。 Abstract: Image captioning involves generating textual descriptions from input images, bridging the gap between computer vision and natural language processing. Recent advancements in transformer-based models have significantly improved caption generation by leveraging attention mechanisms for better scene understanding. While various surveys have explored deep learning-based approaches for image captioning, few have comprehensively analyzed attention-based transformer models across multiple languages. This survey reviews attention-based image captioning models, categorizing them into transformer-based, deep learning-based, and hybrid approaches. It explores benchmark datasets, discusses evaluation metrics such as BLEU, METEOR, CIDEr, and ROUGE, and highlights challenges in multilingual captioning. Additionally, this paper identifies key limitations in current models, including semantic inconsistencies, data scarcity in non-English languages, and limitations in reasoning ability. Finally, we outline future research directions, such as multimodal learning, real-time applications in AI-powered assistants, healthcare, and forensic analysis. This survey serves as a comprehensive reference for researchers aiming to advance the field of attention-based image captioning.

[17] AD-EE: Early Exiting for Fast and Reliable Vision-Language Models in Autonomous Driving

Lianming Huang,Haibo Hu,Yufei Cui,Jiacheng Zuo,Shangyu Wu,Nan Guan,Chun Jason Xue

Main category: cs.CV

TL;DR: AD-EE框架通过早期退出机制和因果推断优化视觉语言模型在自动驾驶中的实时性能,显著降低延迟并提升检测精度。

Details Motivation: 视觉语言模型在自动驾驶中应用时存在高延迟和计算开销问题,影响实时性能。 Method: 提出AD-EE框架,结合自动驾驶领域特性和因果推断,确定最优退出层。 Result: 在Waymo和CODA数据集及实际车辆测试中,延迟最大降低57.58%,检测精度最高提升44%。 Conclusion: AD-EE有效解决了视觉语言模型在自动驾驶中的实时性问题,具有显著性能提升。 Abstract: With the rapid advancement of autonomous driving, deploying Vision-Language Models (VLMs) to enhance perception and decision-making has become increasingly common. However, the real-time application of VLMs is hindered by high latency and computational overhead, limiting their effectiveness in time-critical driving scenarios. This challenge is particularly evident when VLMs exhibit over-inference, continuing to process unnecessary layers even after confident predictions have been reached. To address this inefficiency, we propose AD-EE, an Early Exit framework that incorporates domain characteristics of autonomous driving and leverages causal inference to identify optimal exit layers. We evaluate our method on large-scale real-world autonomous driving datasets, including Waymo and the corner-case-focused CODA, as well as on a real vehicle running the Autoware Universe platform. Extensive experiments across multiple VLMs show that our method significantly reduces latency, with maximum improvements reaching up to 57.58%, and enhances object detection accuracy, with maximum gains of up to 44%.

[18] A VLM-based Method for Visual Anomaly Detection in Robotic Scientific Laboratories

Shiwei Lin,Chenxu Wang,Xiaozhen Ding,Yi Wang,Boyuan Du,Lei Song,Chenggang Wang,Huaping Liu

Main category: cs.CV

TL;DR: 提出了一种基于视觉语言模型(VLM)的视觉推理方法,通过四种逐步信息化的提示配置支持不同级别的监督,用于科学工作流程中的视觉异常检测。

Details Motivation: 在机器人科学实验室中,视觉异常检测对及时发现和解决潜在故障或偏差至关重要,是确保实验过程稳定性和安全性的关键因素。 Method: 采用VLM的视觉推理方法,设计了四种逐步信息化的提示配置,并构建了一个针对科学工作流程中过程异常检测的视觉基准。 Result: 实验表明,随着上下文信息的增加,检测准确性提高,验证了该方法在科学工作流程中的有效性和适应性。真实场景验证也表明第一人称视觉观察能有效识别过程级异常。 Conclusion: 该研究为科学实验工作流程中的视觉异常检测提供了数据驱动的基础和评估框架。 Abstract: In robot scientific laboratories, visual anomaly detection is important for the timely identification and resolution of potential faults or deviations. It has become a key factor in ensuring the stability and safety of experimental processes. To address this challenge, this paper proposes a VLM-based visual reasoning approach that supports different levels of supervision through four progressively informative prompt configurations. To systematically evaluate its effectiveness, we construct a visual benchmark tailored for process anomaly detection in scientific workflows. Experiments on two representative vision-language models show that detection accuracy improves as more contextual information is provided, confirming the effectiveness and adaptability of the proposed reasoning approach for process anomaly detection in scientific workflows. Furthermore, real-world validations at selected experimental steps confirm that first-person visual observation can effectively identify process-level anomalies. This work provides both a data-driven foundation and an evaluation framework for vision anomaly detection in scientific experiment workflows.

[19] Object-level Self-Distillation for Vision Pretraining

Çağlar Hızlı,Çağatay Yıldız,Pekka Marttinen

Main category: cs.CV

TL;DR: ODIS是一种通过对象级自蒸馏改进视觉预训练的方法,解决了图像级自蒸馏在多对象场景中的局限性。

Details Motivation: 现有视觉预训练方法假设每张图像仅含单一对象,限制了在复杂场景数据集上的扩展性。ODIS旨在通过对象级自蒸馏提升模型的语义理解能力。 Method: ODIS利用对象感知裁剪和掩码注意力,将自蒸馏粒度从图像级细化到对象级,简化了复杂场景任务。 Result: ODIS在ViT-Large上实现了82.6%的ImageNet1k k-NN准确率,显著提升了图像和补丁级别的表示能力。 Conclusion: ODIS通过对象级自蒸馏有效提升了视觉预训练的性能,尤其在复杂场景中表现优异。 Abstract: State-of-the-art vision pretraining methods rely on image-level self-distillation from object-centric datasets such as ImageNet, implicitly assuming each image contains a single object. This assumption does not always hold: many ImageNet images already contain multiple objects. Further, it limits scalability to scene-centric datasets that better mirror real-world complexity. We address these challenges by introducing Object-level Self-DIStillation (ODIS), a pretraining approach that shifts the self-distillation granularity from whole images to individual objects. Using object-aware cropping and masked attention, ODIS isolates object-specific regions, guiding the transformer toward semantically meaningful content and transforming a noisy, scene-level task into simpler object-level sub-tasks. We show that this approach improves visual representations both at the image and patch levels. Using masks at inference time, our method achieves an impressive $82.6\%$ $k$-NN accuracy on ImageNet1k with ViT-Large.

[20] Can Vision Language Models Infer Human Gaze Direction? A Controlled Study

Zory Zhang,Pinyuan Feng,Bingyang Wang,Tianwei Zhao,Suyang Yu,Qingying Gao,Hokin Deng,Ziqiao Ma,Yijiang Li,Dezhi Luo

Main category: cs.CV

TL;DR: 研究评估了111个视觉语言模型(VLMs)在视线推断任务中的表现,发现大多数模型表现不如随机猜测,而人类表现接近完美。少数表现较好的模型受任务难度影响,但对感知变化较稳健。

Details Motivation: 视线推断是人类自然交互的关键能力,研究旨在评估VLMs是否具备类似能力,以推动更自然的人机交互技术发展。 Method: 通过控制实验,比较111个VLMs和65名人类参与者在不同难度和变异性照片中的表现,并使用混合效应模型分析行为。 Result: 94个VLMs表现不如随机猜测,人类表现接近完美。少数表现较好的模型受任务难度影响,但对提示和场景对象变化较稳健。 Conclusion: VLMs目前缺乏视线推断能力,但少数模型表现出的潜力表明未来可能通过改进实现更自然的人机交互。 Abstract: Gaze-referential inference--the ability to infer what others are looking at--is a critical component of a theory of mind that underpins natural human-AI interaction. In a controlled study, we evaluated this skill across 111 Vision Language Models (VLMs) using photos taken with manipulated difficulty and variability, comparing performance with that of human participants (N = 65), and analyzed behaviors using mixed-effects models. We found that 94 of the 111 VLMs failed to do better than random guessing, while humans achieved near-ceiling accuracy. VLMs even respond with each choice almost equally frequently. Are they randomly guessing? Although most VLMs struggle, when we zoom in on five of the top-tier VLMs with above-chance performance, we find that their performance declined with increasing task difficulty but varied only slightly across different prompts and scene objects. These behavioral features cannot be explained by considering them as random guessers. Instead, they likely use a combination of heuristics and guessing such that their performance is subject to the task difficulty but robust to perceptual variations. This suggests that VLMs, lacking gaze inference capability, have yet to become technologies that can naturally interact with humans, but the potential remains.

[21] SAVVY: Spatial Awareness via Audio-Visual LLMs through Seeing and Hearing

Mingfei Chen,Zijun Cui,Xiulong Liu,Jinlin Xiang,Caleb Zheng,Jingyuan Li,Eli Shlizerman

Main category: cs.CV

TL;DR: SAVVY-Bench是首个针对动态3D场景中空间推理的基准测试,结合同步空间音频,提出了SAVVY训练无关的推理流程,显著提升了AV-LLMs的性能。

Details Motivation: 现有AV-LLMs和基准测试主要关注静态或2D场景,缺乏对动态3D空间音频场景的研究。 Method: SAVVY分为两阶段:基于AV-LLMs的物体轨迹跟踪和动态全局地图构建,通过坐标转换生成最终答案。 Result: SAVVY显著提升了现有AV-LLMs的性能,为动态3D空间推理设定了新标准。 Conclusion: SAVVY-Bench和SAVVY方法填补了动态3D空间推理的空白,为未来研究提供了新方向。 Abstract: 3D spatial reasoning in dynamic, audio-visual environments is a cornerstone of human cognition yet remains largely unexplored by existing Audio-Visual Large Language Models (AV-LLMs) and benchmarks, which predominantly focus on static or 2D scenes. We introduce SAVVY-Bench, the first benchmark for 3D spatial reasoning in dynamic scenes with synchronized spatial audio. SAVVY-Bench is comprised of thousands of relationships involving static and moving objects, and requires fine-grained temporal grounding, consistent 3D localization, and multi-modal annotation. To tackle this challenge, we propose SAVVY, a novel training-free reasoning pipeline that consists of two stages: (i) Egocentric Spatial Tracks Estimation, which leverages AV-LLMs as well as other audio-visual methods to track the trajectories of key objects related to the query using both visual and spatial audio cues, and (ii) Dynamic Global Map Construction, which aggregates multi-modal queried object trajectories and converts them into a unified global dynamic map. Using the constructed map, a final QA answer is obtained through a coordinate transformation that aligns the global map with the queried viewpoint. Empirical evaluation demonstrates that SAVVY substantially enhances performance of state-of-the-art AV-LLMs, setting a new standard and stage for approaching dynamic 3D spatial reasoning in AV-LLMs.

[22] Better STEP, a format and dataset for boundary representation

Nafiseh Izadyar,Sai Chandra Madduri,Teseo Schneider

Main category: cs.CV

TL;DR: 论文提出了一种基于HDF5的开放格式替代STEP格式,解决了CAD数据在机器学习管道中使用受限的问题,并提供了开源库和Python包支持。

Details Motivation: 现有的CAD数据集使用STEP格式,需要CAD内核处理,导致高昂的许可成本,限制了其在大型学习管道中的应用。 Method: 开发了一种基于HDF5的开放格式,并提供了开源库和Python包,支持数据查询和处理,同时包含采样、法线和曲率等标准功能。 Result: 成功转换了Fusion 360和ABC数据集,并通过四个标准用例(法线估计、去噪、表面重建和分割)验证了数据的完整性和兼容性。 Conclusion: 提出的HDF5格式和工具显著提升了CAD数据在机器学习中的可用性和可扩展性。 Abstract: Boundary representation (B-rep) generated from computer-aided design (CAD) is widely used in industry, with several large datasets available. However, the data in these datasets is represented in STEP format, requiring a CAD kernel to read and process it. This dramatically limits their scope and usage in large learning pipelines, as it constrains the possibility of deploying them on computing clusters due to the high cost of per-node licenses. This paper introduces an alternative format based on the open, cross-platform format HDF5 and a corresponding dataset for STEP files, paired with an open-source library to query and process them. Our Python package also provides standard functionalities such as sampling, normals, and curvature to ease integration in existing pipelines. To demonstrate the effectiveness of our format, we converted the Fusion 360 dataset and the ABC dataset. We developed four standard use cases (normal estimation, denoising, surface reconstruction, and segmentation) to assess the integrity of the data and its compliance with the original STEP files.

[23] Self-Predictive Dynamics for Generalization of Vision-based Reinforcement Learning

Kyungsoo Kim,Jeongsoo Ha,Yusung Kim

Main category: cs.CV

TL;DR: SPD方法通过双路增强学习任务相关特征,显著提升视觉强化学习在复杂和未见观察中的泛化性能。

Details Motivation: 视觉强化学习需高效鲁棒的图像表示,尤其当图像包含训练中未见的干扰元素(如阴影、云、光)时。 Method: SPD使用弱增强和强增强并行,通过预测双向增强版本间的逆向和正向转移来学习表示。 Result: 在MuJoCo视觉控制任务和CARLA自动驾驶任务中,SPD优于先前研究,显著提升未见观察的泛化性能。 Conclusion: SPD能高效提取任务相关特征,适用于复杂和未见观察,提升视觉强化学习性能。 Abstract: Vision-based reinforcement learning requires efficient and robust representations of image-based observations, especially when the images contain distracting (task-irrelevant) elements such as shadows, clouds, and light. It becomes more important if those distractions are not exposed during training. We design a Self-Predictive Dynamics (SPD) method to extract task-relevant features efficiently, even in unseen observations after training. SPD uses weak and strong augmentations in parallel, and learns representations by predicting inverse and forward transitions across the two-way augmented versions. In a set of MuJoCo visual control tasks and an autonomous driving task (CARLA), SPD outperforms previous studies in complex observations, and significantly improves the generalization performance for unseen observations. Our code is available at https://github.com/unigary/SPD.

[24] Dream to Generalize: Zero-Shot Model-Based Reinforcement Learning for Unseen Visual Distractions

Jeongsoo Ha,Kyungsoo Kim,Yusung Kim

Main category: cs.CV

TL;DR: 提出了一种名为Dr. G的自监督方法,用于零样本模型强化学习,通过双对比学习和递归状态逆动力学模型提升对视觉干扰的鲁棒性。

Details Motivation: 解决现有MBRL算法在面对视觉干扰时性能下降的问题,适用于真实场景中的任务无关干扰。 Method: 使用双对比学习训练编码器和世界模型,引入递归状态逆动力学模型以增强时间结构理解。 Result: 在DeepMind Control和Robosuite中,Dr. G分别比现有方法性能提升117%和14%。 Conclusion: Dr. G能有效提升MBRL在视觉干扰下的泛化能力,适用于复杂环境。 Abstract: Model-based reinforcement learning (MBRL) has been used to efficiently solve vision-based control tasks in highdimensional image observations. Although recent MBRL algorithms perform well in trained observations, they fail when faced with visual distractions in observations. These task-irrelevant distractions (e.g., clouds, shadows, and light) may be constantly present in real-world scenarios. In this study, we propose a novel self-supervised method, Dream to Generalize (Dr. G), for zero-shot MBRL. Dr. G trains its encoder and world model with dual contrastive learning which efficiently captures task-relevant features among multi-view data augmentations. We also introduce a recurrent state inverse dynamics model that helps the world model to better understand the temporal structure. The proposed methods can enhance the robustness of the world model against visual distractions. To evaluate the generalization performance, we first train Dr. G on simple backgrounds and then test it on complex natural video backgrounds in the DeepMind Control suite, and the randomizing environments in Robosuite. Dr. G yields a performance improvement of 117% and 14% over prior works, respectively. Our code is open-sourced and available at https://github.com/JeongsooHa/DrG.git

[25] Self-supervised One-Stage Learning for RF-based Multi-Person Pose Estimation

Seunghwan Shin,Yusung Kim

Main category: cs.CV

TL;DR: 提出了一种基于原始RF信号的高效轻量级单阶段MPPE模型,通过子分组和共享单层CNN结合多头注意力,性能优于现有方法,并引入新的自监督学习方法。

Details Motivation: 现有RF-based MPPE方法要么预处理复杂耗时,要么精度和泛化性能低,需改进。 Method: 子分组RF信号,共享单层CNN嵌入后接多头注意力,并提出自监督学习方法预测掩码数据的潜在表示。 Result: PCKh@0.5指标提升15%,自监督学习在新场景或障碍物下显著提升性能。 Conclusion: 模型高效轻量,性能优越,自监督学习增强泛化能力,代码开源。 Abstract: In the field of Multi-Person Pose Estimation (MPPE), Radio Frequency (RF)-based methods can operate effectively regardless of lighting conditions and obscured line-of-sight situations. Existing RF-based MPPE methods typically involve either 1) converting RF signals into heatmap images through complex preprocessing, or 2) applying a deep embedding network directly to raw RF signals. The first approach, while delivering decent performance, is computationally intensive and time-consuming. The second method, though simpler in preprocessing, results in lower MPPE accuracy and generalization performance. This paper proposes an efficient and lightweight one-stage MPPE model based on raw RF signals. By sub-grouping RF signals and embedding them using a shared single-layer CNN followed by multi-head attention, this model outperforms previous methods that embed all signals at once through a large and deep CNN. Additionally, we propose a new self-supervised learning (SSL) method that takes inputs from both one unmasked subgroup and the remaining masked subgroups to predict the latent representations of the masked data. Empirical results demonstrate that our model improves MPPE accuracy by up to 15 in PCKh@0.5 compared to previous methods using raw RF signals. Especially, the proposed SSL method has shown to significantly enhance performance improvements when placed in new locations or in front of obstacles at RF antennas, contributing to greater performance gains as the number of people increases. Our code and dataset is open at Github. https://github.com/sshnan7/SOSPE .

[26] SIV-Bench: A Video Benchmark for Social Interaction Understanding and Reasoning

Fanqi Kong,Weiqin Zu,Xinyu Chen,Yaodong Yang,Song-Chun Zhu,Xue Feng

Main category: cs.CV

TL;DR: SIV-Bench是一个新的视频基准测试,用于评估多模态大语言模型(MLLMs)在社交场景理解(SSU)、社交状态推理(SSR)和社交动态预测(SDP)方面的能力。

Details Motivation: 人类社交互动的复杂性和多模态特性对人工智能提出了挑战,需要新的评估工具来推动研究。 Method: 通过2,792个视频片段和8,792个问题-答案对,结合人类-LLM协作流程,构建了SIV-Bench。数据来自TikTok和YouTube,涵盖多种视频类型和文化背景。 Result: 实验显示MLLMs在SSU表现良好,但在SSR和SDP上表现较差,尤其是关系推理(RI)成为瓶颈。转录对话对理解复杂社交互动至关重要。 Conclusion: SIV-Bench为开发更具社交智能的AI提供了关键见解,并公开了数据集和代码。 Abstract: The rich and multifaceted nature of human social interaction, encompassing multimodal cues, unobservable relations and mental states, and dynamical behavior, presents a formidable challenge for artificial intelligence. To advance research in this area, we introduce SIV-Bench, a novel video benchmark for rigorously evaluating the capabilities of Multimodal Large Language Models (MLLMs) across Social Scene Understanding (SSU), Social State Reasoning (SSR), and Social Dynamics Prediction (SDP). SIV-Bench features 2,792 video clips and 8,792 meticulously generated question-answer pairs derived from a human-LLM collaborative pipeline. It is originally collected from TikTok and YouTube, covering a wide range of video genres, presentation styles, and linguistic and cultural backgrounds. It also includes a dedicated setup for analyzing the impact of different textual cues-original on-screen text, added dialogue, or no text. Our comprehensive experiments on leading MLLMs reveal that while models adeptly handle SSU, they significantly struggle with SSR and SDP, where Relation Inference (RI) is an acute bottleneck, as further examined in our analysis. Our study also confirms the critical role of transcribed dialogue in aiding comprehension of complex social interactions. By systematically identifying current MLLMs' strengths and limitations, SIV-Bench offers crucial insights to steer the development of more socially intelligent AI. The dataset and code are available at https://kfq20.github.io/sivbench/.

[27] Coordinated Robustness Evaluation Framework for Vision-Language Models

Ashwin Ramesh Babu,Sajad Mousavi,Vineet Gundecha,Sahand Ghorbanpour,Avisek Naug,Antonio Guillen,Ricardo Luna Gutierrez,Soumyendu Sarkar

Main category: cs.CV

TL;DR: 提出了一种针对视觉语言模型的通用对抗攻击方法,通过联合扰动图像和文本模态,评估模型的鲁棒性。

Details Motivation: 视觉语言模型在图像描述和视觉问答等任务中表现优异,但对小扰动的鲁棒性不足,影响实际部署。 Method: 训练一个通用代理模型,生成联合表示,并用于生成图像和文本的对抗扰动,评估其在视觉问答和视觉推理任务中的效果。 Result: 提出的方法优于其他多模态和单模态攻击策略,成功削弱了多个先进预训练模型的鲁棒性。 Conclusion: 该方法有效揭示了视觉语言模型的脆弱性,为提升其鲁棒性提供了方向。 Abstract: Vision-language models, which integrate computer vision and natural language processing capabilities, have demonstrated significant advancements in tasks such as image captioning and visual question and answering. However, similar to traditional models, they are susceptible to small perturbations, posing a challenge to their robustness, particularly in deployment scenarios. Evaluating the robustness of these models requires perturbations in both the vision and language modalities to learn their inter-modal dependencies. In this work, we train a generic surrogate model that can take both image and text as input and generate joint representation which is further used to generate adversarial perturbations for both the text and image modalities. This coordinated attack strategy is evaluated on the visual question and answering and visual reasoning datasets using various state-of-the-art vision-language models. Our results indicate that the proposed strategy outperforms other multi-modal attacks and single-modality attacks from the recent literature. Our results demonstrate their effectiveness in compromising the robustness of several state-of-the-art pre-trained multi-modal models such as instruct-BLIP, ViLT and others.

[28] Robustness Evaluation for Video Models with Reinforcement Learning

Ashwin Ramesh Babu,Sajad Mousavi,Vineet Gundecha,Sahand Ghorbanpour,Avisek Naug,Antonio Guillen,Ricardo Luna Gutierrez,Soumyendu Sarkar

Main category: cs.CV

TL;DR: 提出了一种多智能体强化学习方法,用于视频分类模型的鲁棒性评估,通过空间和时间扰动生成更有效的攻击。

Details Motivation: 视频分类模型的鲁棒性评估比图像模型更具挑战性,因其时间维度和计算复杂性更高。 Method: 采用多智能体强化学习(空间和时间)协同识别视频敏感区域,生成微小且视觉不可察觉的扰动。 Result: 在Lp指标和平均查询次数上优于现有方法,支持自定义失真类型,并在HMDB-51和UCF-101数据集上评估了4种流行模型。 Conclusion: 该方法在视频动作识别模型的鲁棒性评估中表现出色,更具实用性和灵活性。 Abstract: Evaluating the robustness of Video classification models is very challenging, specifically when compared to image-based models. With their increased temporal dimension, there is a significant increase in complexity and computational cost. One of the key challenges is to keep the perturbations to a minimum to induce misclassification. In this work, we propose a multi-agent reinforcement learning approach (spatial and temporal) that cooperatively learns to identify the given video's sensitive spatial and temporal regions. The agents consider temporal coherence in generating fine perturbations, leading to a more effective and visually imperceptible attack. Our method outperforms the state-of-the-art solutions on the Lp metric and the average queries. Our method enables custom distortion types, making the robustness evaluation more relevant to the use case. We extensively evaluate 4 popular models for video action recognition on two popular datasets, HMDB-51 and UCF-101.

[29] LLMs Can Compensate for Deficiencies in Visual Representations

Sho Takishita,Jay Gala,Abdelrahman Mohamed,Kentaro Inui,Yova Kementchedjhieva

Main category: cs.CV

TL;DR: 研究发现CLIP视觉编码器在视觉语言模型中提供语义信息,语言解码器可补偿其局限性。

Details Motivation: 探索CLIP视觉编码器的局限性是否被语言解码器补偿。 Method: 使用三种基于CLIP的视觉语言模型,通过自注意力消融实验验证假设。 Result: CLIP视觉表示提供语义信息,语言解码器能补偿其不足。 Conclusion: 未来可设计更多依赖语言解码器的视觉处理架构。 Abstract: Many vision-language models (VLMs) that prove very effective at a range of multimodal task, build on CLIP-based vision encoders, which are known to have various limitations. We investigate the hypothesis that the strong language backbone in VLMs compensates for possibly weak visual features by contextualizing or enriching them. Using three CLIP-based VLMs, we perform controlled self-attention ablations on a carefully designed probing task. Our findings show that despite known limitations, CLIP visual representations offer ready-to-read semantic information to the language decoder. However, in scenarios of reduced contextualization in the visual representations, the language decoder can largely compensate for the deficiency and recover performance. This suggests a dynamic division of labor in VLMs and motivates future architectures that offload more visual processing to the language decoder.

[30] BYO-Eval: Build Your Own Dataset for Fine-Grained Visual Assessment of Multimodal Language Models

Ludovic Arnould,Salim Khazem,Hugues Ali Mehenni

Main category: cs.CV

TL;DR: 提出了一种基于合成图像的视觉语言模型评估方法,通过控制视觉属性揭示感知缺陷,替代传统高成本的真实图像标注方法。

Details Motivation: 现有基准测试成本高、易泄露信息且无法明确失败原因,需更精准的评估方法。 Method: 利用程序生成合成图像,逐步增加任务难度,保持其他视觉参数不变,进行系统压力测试。 Result: 实现了对视觉语言模型能力的细粒度、可解释评估。 Conclusion: 该方法为视觉语言模型提供了更精准、可控的评估工具。 Abstract: Visual Language Models (VLMs) are now sufficiently advanced to support a broad range of applications, including answering complex visual questions, and are increasingly expected to interact with images in varied ways. To evaluate them, current benchmarks often focus on specific domains (e.g., reading charts), constructing datasets of annotated real images paired with pre-defined Multiple Choice Questions (MCQs) to report aggregate accuracy scores. However, such benchmarks entail high annotation costs, risk information leakage, and do not clarify whether failures stem from limitations in visual perception, reasoning, or general knowledge. We propose a new evaluation methodology, inspired by ophthalmologic diagnostics, leveraging procedural generation of synthetic images to obtain control over visual attributes and precisely reveal perception failures in VLMs. Specifically, we build collections of images with gradually more challenging variations in the content of interest (e.g., number of objects in a counting task) while holding other visual parameters constant. This diagnostic allows systematic stress testing and fine-grained failure analysis, shifting the focus from coarse benchmarking toward targeted and interpretable assessment of VLM capabilities. Our code is available at https://github.com/byoeval/BYO-EVAL.

[31] Structured Labeling Enables Faster Vision-Language Models for End-to-End Autonomous Driving

Hao Jiang,Chuan Hu,Yukang Shi,Yuan He,Ke Wang,Xi Zhang,Zhipeng Zhang

Main category: cs.CV

TL;DR: 本文提出了一种结构化且简洁的基准数据集NuScenes-S,并开发了高效的VLM模型FastDrive,以解决现有VLM在自动驾驶应用中的冗余和计算成本高的问题。

Details Motivation: 现有VLM在自动驾驶应用中存在数据集冗余、计算成本高的问题,限制了推理速度和实际部署。 Method: 引入结构化数据集NuScenes-S,并提出紧凑型VLM模型FastDrive(0.9B参数),专注于处理结构化描述。 Result: FastDrive在结构化数据集上表现优异,决策任务准确率提升20%,推理速度比7B参数模型快10倍。 Conclusion: 结构化数据集和紧凑型VLM模型显著提升了自动驾驶决策任务的性能和效率。 Abstract: Vision-Language Models (VLMs) offer a promising approach to end-to-end autonomous driving due to their human-like reasoning capabilities. However, troublesome gaps remains between current VLMs and real-world autonomous driving applications. One major limitation is that existing datasets with loosely formatted language descriptions are not machine-friendly and may introduce redundancy. Additionally, high computational cost and massive scale of VLMs hinder the inference speed and real-world deployment. To bridge the gap, this paper introduces a structured and concise benchmark dataset, NuScenes-S, which is derived from the NuScenes dataset and contains machine-friendly structured representations. Moreover, we present FastDrive, a compact VLM baseline with 0.9B parameters. In contrast to existing VLMs with over 7B parameters and unstructured language processing(e.g., LLaVA-1.5), FastDrive understands structured and concise descriptions and generates machine-friendly driving decisions with high efficiency. Extensive experiments show that FastDrive achieves competitive performance on structured dataset, with approximately 20% accuracy improvement on decision-making tasks, while surpassing massive parameter baseline in inference speed with over 10x speedup. Additionally, ablation studies further focus on the impact of scene annotations (e.g., weather, time of day) on decision-making tasks, demonstrating their importance on decision-making tasks in autonomous driving.

[32] U-NetMN and SegNetMN: Modified U-Net and SegNet models for bimodal SAR image segmentation

Marwane Kzadri,Franco Alberto Cardillo,Nanée Chahinian,Carole Delenne,Renaud Hostache,Jamal Riffi

Main category: cs.CV

TL;DR: 研究评估了模式归一化对U-Net和SegNet模型在SAR图像分割中的影响,发现其显著加速收敛并提高稳定性。

Details Motivation: SAR图像分割对遥感应用至关重要,但深度学习模型因数据复杂分布面临收敛速度和稳定性问题。 Method: 在U-Net和SegNet中集成模式归一化,以减少收敛时间并保持性能。 Result: 模式归一化显著加速收敛,且归一化模型在不同区域表现更稳定。 Conclusion: 归一化有效提升SAR图像分割的计算效率和泛化能力。 Abstract: Segmenting Synthetic Aperture Radar (SAR) images is crucial for many remote sensing applications, particularly water body detection. However, deep learning-based segmentation models often face challenges related to convergence speed and stability, mainly due to the complex statistical distribution of this type of data. In this study, we evaluate the impact of mode normalization on two widely used semantic segmentation models, U-Net and SegNet. Specifically, we integrate mode normalization, to reduce convergence time while maintaining the performance of the baseline models. Experimental results demonstrate that mode normalization significantly accelerates convergence. Furthermore, cross-validation results indicate that normalized models exhibit increased stability in different zones. These findings highlight the effectiveness of normalization in improving computational efficiency and generalization in SAR image segmentation.

[33] Degradation-Aware Image Enhancement via Vision-Language Classification

Jie Cai,Kangning Yang,Jiaming Ding,Lan Fu,Ling Ouyang,Jiang Li,Jinglin Shen,Zibo Meng

Main category: cs.CV

TL;DR: 提出了一种基于视觉语言模型(VLM)的框架,用于自动分类和修复图像退化问题,分为四种类型,并通过专用模型恢复图像质量。

Details Motivation: 图像退化影响视觉质量和下游任务,需要自动化的分类和修复方法。 Method: 使用VLM分类图像退化类型(A-D),并针对A、B、C类使用专用模型进行修复。 Result: 实验证明该方法能准确分类退化类型并通过专用模型提升图像质量。 Conclusion: 该方法为图像增强任务提供了可扩展的自动化解决方案。 Abstract: Image degradation is a prevalent issue in various real-world applications, affecting visual quality and downstream processing tasks. In this study, we propose a novel framework that employs a Vision-Language Model (VLM) to automatically classify degraded images into predefined categories. The VLM categorizes an input image into one of four degradation types: (A) super-resolution degradation (including noise, blur, and JPEG compression), (B) reflection artifacts, (C) motion blur, or (D) no visible degradation (high-quality image). Once classified, images assigned to categories A, B, or C undergo targeted restoration using dedicated models tailored for each specific degradation type. The final output is a restored image with improved visual quality. Experimental results demonstrate the effectiveness of our approach in accurately classifying image degradations and enhancing image quality through specialized restoration models. Our method presents a scalable and automated solution for real-world image enhancement tasks, leveraging the capabilities of VLMs in conjunction with state-of-the-art restoration techniques.

[34] Towards Reliable Identification of Diffusion-based Image Manipulations

Alex Costanzino,Woody Bayliss,Juil Sock,Marc Gorriz Blanch,Danijela Horak,Ivan Laptev,Philip Torr,Fabio Pizzati

Main category: cs.CV

TL;DR: 提出了一种名为RADAR的新方法,用于可靠地检测图像中被修改的区域,结合多种图像模态特征和对比损失,显著提高了检测精度和泛化能力。

Details Motivation: 随着扩散模型的发展,图像编辑质量提升但滥用风险增加,需要可靠的方法识别图像修改。 Method: RADAR基于现有基础模型,结合多种图像模态特征,并引入辅助对比损失以隔离被修改的图像区域。 Result: 在BBC-PAIR基准测试中,RADAR表现出色,优于现有方法,能检测已知和未知扩散模型的编辑。 Conclusion: RADAR在检测和定位图像编辑方面表现优异,为应对扩散模型滥用提供了有效工具。 Abstract: Changing facial expressions, gestures, or background details may dramatically alter the meaning conveyed by an image. Notably, recent advances in diffusion models greatly improve the quality of image manipulation while also opening the door to misuse. Identifying changes made to authentic images, thus, becomes an important task, constantly challenged by new diffusion-based editing tools. To this end, we propose a novel approach for ReliAble iDentification of inpainted AReas (RADAR). RADAR builds on existing foundation models and combines features from different image modalities. It also incorporates an auxiliary contrastive loss that helps to isolate manipulated image patches. We demonstrate these techniques to significantly improve both the accuracy of our method and its generalisation to a large number of diffusion models. To support realistic evaluation, we further introduce BBC-PAIR, a new comprehensive benchmark, with images tampered by 28 diffusion models. Our experiments show that RADAR achieves excellent results, outperforming the state-of-the-art in detecting and localising image edits made by both seen and unseen diffusion models. Our code, data and models will be publicly available at alex-costanzino.github.io/radar.

[35] S2GO: Streaming Sparse Gaussian Occupancy Prediction

Jinhyung Park,Yihan Hu,Chensheng Peng,Wenzhao Zheng,Kris Kitani,Wei Zhan

Main category: cs.CV

TL;DR: S2GO提出了一种基于稀疏查询的3D表示方法,通过动态传播查询并解码为语义高斯,显著提升了3D占用预测的性能和效率。

Details Motivation: 现有3D占用预测方法依赖密集表示(如体素或高斯),效率低且难以捕捉动态场景。S2GO旨在通过稀疏查询解决这些问题。 Method: 使用动态传播的3D查询表示场景,并在每个时间步解码为语义高斯,结合去噪渲染目标优化查询和高斯。 Result: 在nuScenes和KITTI基准测试中,S2GO以1.5 IoU优势超越GaussianWorld,推理速度快5.9倍。 Conclusion: 稀疏查询方法在3D占用预测中表现出高效性和灵活性,优于传统密集表示方法。 Abstract: Despite the demonstrated efficiency and performance of sparse query-based representations for perception, state-of-the-art 3D occupancy prediction methods still rely on voxel-based or dense Gaussian-based 3D representations. However, dense representations are slow, and they lack flexibility in capturing the temporal dynamics of driving scenes. Distinct from prior work, we instead summarize the scene into a compact set of 3D queries which are propagated through time in an online, streaming fashion. These queries are then decoded into semantic Gaussians at each timestep. We couple our framework with a denoising rendering objective to guide the queries and their constituent Gaussians in effectively capturing scene geometry. Owing to its efficient, query-based representation, S2GO achieves state-of-the-art performance on the nuScenes and KITTI occupancy benchmarks, outperforming prior art (e.g., GaussianWorld) by 1.5 IoU with 5.9x faster inference.

[36] OpenRR-5k: A Large-Scale Benchmark for Reflection Removal in the Wild

Jie Cai,Kangning Yang,Ling Ouyang,Lan Fu,Jiaming Ding,Jinglin Shen,Zibo Meng

Main category: cs.CV

TL;DR: 本文提出了一种用于单图像反射去除(SIRR)的新基准,包含5,300对高质量像素对齐的图像,并验证了其有效性。

Details Motivation: 现有反射去除方法因缺乏大规模、高质量和多样化的数据集而受限。 Method: 构建了一个包含5,300对图像的数据集,分为训练、验证和测试集,并训练了一个U-Net模型进行评估。 Result: 使用五种常用指标(PSNR、SSIM、LPIPS、DISTS、NIQE)验证了数据集的有效性。 Conclusion: 数据集和代码将公开,以促进未来研究。 Abstract: Removing reflections is a crucial task in computer vision, with significant applications in photography and image enhancement. Nevertheless, existing methods are constrained by the absence of large-scale, high-quality, and diverse datasets. In this paper, we present a novel benchmark for Single Image Reflection Removal (SIRR). We have developed a large-scale dataset containing 5,300 high-quality, pixel-aligned image pairs, each consisting of a reflection image and its corresponding clean version. Specifically, the dataset is divided into two parts: 5,000 images are used for training, and 300 images are used for validation. Additionally, we have included 100 real-world testing images without ground truth (GT) to further evaluate the practical performance of reflection removal methods. All image pairs are precisely aligned at the pixel level to guarantee accurate supervision. The dataset encompasses a broad spectrum of real-world scenarios, featuring various lighting conditions, object types, and reflection patterns, and is segmented into training, validation, and test sets to facilitate thorough evaluation. To validate the usefulness of our dataset, we train a U-Net-based model and evaluate it using five widely-used metrics, including PSNR, SSIM, LPIPS, DISTS, and NIQE. We will release both the dataset and the code on https://github.com/caijie0620/OpenRR-5k to facilitate future research in this field.

[37] A Neural Network Model of Spatial and Feature-Based Attention

Ruoyang Hu,Robert A. Jacobs

Main category: cs.CV

TL;DR: 论文提出了一种受人类视觉注意力启发的神经网络模型,包含两个网络:一个处理简单任务,另一个通过注意力机制引导前者适应复杂任务。模型学习到的注意力模式与人类视觉注意力相似。

Details Motivation: 研究人类视觉注意力机制,并探索如何通过神经网络模型模拟这种机制。 Method: 设计了一个双网络模型,一个网络处理简单任务,另一个通过注意力机制提供上下文信息并引导前者。 Result: 模型学习到的注意力模式与人类的空间和特征注意力相似。 Conclusion: 神经网络模型可以模拟人类视觉注意力,为研究人类认知提供了新方向。 Abstract: Visual attention is a mechanism closely intertwined with vision and memory. Top-down information influences visual processing through attention. We designed a neural network model inspired by aspects of human visual attention. This model consists of two networks: one serves as a basic processor performing a simple task, while the other processes contextual information and guides the first network through attention to adapt to more complex tasks. After training the model and visualizing the learned attention response, we discovered that the model's emergent attention patterns corresponded to spatial and feature-based attention. This similarity between human visual attention and attention in computer vision suggests a promising direction for studying human cognition using neural network models.

[38] Implicit Neural Representation for Video Restoration

Mary Aiyetigbo,Wanqi Yuan,Feng Luo,Nianyi Li

Main category: cs.CV

TL;DR: VR-INR是一种基于隐式神经表示的视频修复方法,仅需在单一放大因子(×4)上训练,即可泛化到任意未见过的超分辨率尺度,并实现零样本去噪。

Details Motivation: 现有视频修复方法通常针对固定放大因子训练,缺乏处理超出训练分布的尺度或退化的灵活性。 Method: 采用分层时空纹理编码框架和多分辨率隐式哈希编码,实现从低分辨率输入到任意放大尺度的高分辨率帧的自适应解码。 Result: VR-INR在未见过的尺度和噪声条件下保持高质量重建,在清晰度、细节保留和去噪效果上显著优于现有方法。 Conclusion: VR-INR展示了隐式神经表示在视频修复中的强大泛化能力,为灵活处理多种退化问题提供了新思路。 Abstract: High-resolution (HR) videos play a crucial role in many computer vision applications. Although existing video restoration (VR) methods can significantly enhance video quality by exploiting temporal information across video frames, they are typically trained for fixed upscaling factors and lack the flexibility to handle scales or degradations beyond their training distribution. In this paper, we introduce VR-INR, a novel video restoration approach based on Implicit Neural Representations (INRs) that is trained only on a single upscaling factor ($\times 4$) but generalizes effectively to arbitrary, unseen super-resolution scales at test time. Notably, VR-INR also performs zero-shot denoising on noisy input, despite never having seen noisy data during training. Our method employs a hierarchical spatial-temporal-texture encoding framework coupled with multi-resolution implicit hash encoding, enabling adaptive decoding of high-resolution and noise-suppressed frames from low-resolution inputs at any desired magnification. Experimental results show that VR-INR consistently maintains high-quality reconstructions at unseen scales and noise during training, significantly outperforming state-of-the-art approaches in sharpness, detail preservation, and denoising efficacy.

[39] F2T2-HiT: A U-Shaped FFT Transformer and Hierarchical Transformer for Reflection Removal

Jie Cai,Kangning Yang,Ling Ouyang,Lan Fu,Jiaming Ding,Huiming Sun,Chiu Man Ho,Zibo Meng

Main category: cs.CV

TL;DR: 论文提出了一种基于Transformer的U形架构F2T2-HiT,用于单图像反射去除(SIRR),结合了快速傅里叶变换和分层Transformer模块,显著提升了处理复杂反射的能力。

Details Motivation: 现实场景中的反射具有多样性和复杂性,现有方法难以有效处理,因此需要一种更强大的技术来分离反射和背景。 Method: 采用U形架构,结合快速傅里叶变换(FFT)Transformer模块和分层Transformer模块,利用全局频率信息和多尺度特征提取。 Result: 在三个公开测试数据集上实现了最先进的性能,验证了方法的有效性。 Conclusion: F2T2-HiT架构在SIRR任务中表现出色,为复杂反射问题提供了创新解决方案。 Abstract: Single Image Reflection Removal (SIRR) technique plays a crucial role in image processing by eliminating unwanted reflections from the background. These reflections, often caused by photographs taken through glass surfaces, can significantly degrade image quality. SIRR remains a challenging problem due to the complex and varied reflections encountered in real-world scenarios. These reflections vary significantly in intensity, shapes, light sources, sizes, and coverage areas across the image, posing challenges for most existing methods to effectively handle all cases. To address these challenges, this paper introduces a U-shaped Fast Fourier Transform Transformer and Hierarchical Transformer (F2T2-HiT) architecture, an innovative Transformer-based design for SIRR. Our approach uniquely combines Fast Fourier Transform (FFT) Transformer blocks and Hierarchical Transformer blocks within a UNet framework. The FFT Transformer blocks leverage the global frequency domain information to effectively capture and separate reflection patterns, while the Hierarchical Transformer blocks utilize multi-scale feature extraction to handle reflections of varying sizes and complexities. Extensive experiments conducted on three publicly available testing datasets demonstrate state-of-the-art performance, validating the effectiveness of our approach.

[40] FocusDiff: Advancing Fine-Grained Text-Image Alignment for Autoregressive Visual Generation through RL

Kaihang Pan,Wendong Bu,Yuruo Wu,Yang Wu,Kai Shen,Yunfei Li,Hang Zhao,Juncheng Li,Siliang Tang,Yueting Zhuang

Main category: cs.CV

TL;DR: 论文提出FocusDiff方法,通过强化学习增强细粒度文本-图像语义对齐,显著优于现有方法。

Details Motivation: 现有模型在细粒度文本-图像对齐上表现不佳,无法实现精确的视觉控制。 Method: 构建新数据集并设计强化学习算法,强调相似文本-图像对的细微语义差异。 Result: 在现有基准测试中达到最优性能,并在PairComp上显著领先。 Conclusion: FocusDiff有效解决了细粒度对齐问题,为文本到图像生成提供了更精确的控制。 Abstract: Recent studies extend the autoregression paradigm to text-to-image generation, achieving performance comparable to diffusion models. However, our new PairComp benchmark -- featuring test cases of paired prompts with similar syntax but different fine-grained semantics -- reveals that existing models struggle with fine-grained text-image alignment thus failing to realize precise control over visual tokens. To address this, we propose FocusDiff, which enhances fine-grained text-image semantic alignment by focusing on subtle differences between similar text-image pairs. We construct a new dataset of paired texts and images with similar overall expressions but distinct local semantics, further introducing a novel reinforcement learning algorithm to emphasize such fine-grained semantic differences for desired image generation. Our approach achieves state-of-the-art performance on existing text-to-image benchmarks and significantly outperforms prior methods on PairComp.

[41] MORSE-500: A Programmatically Controllable Video Benchmark to Stress-Test Multimodal Reasoning

Zikui Cai,Andrew Wang,Anirudh Satheesh,Ankit Nakhawa,Hyunwoo Jae,Keenan Powell,Minghui Liu,Neel Jay,Sungbin Oh,Xiyao Wang,Yongyuan Liang,Tom Goldstein,Furong Huang

Main category: cs.CV

TL;DR: MORSE-500是一个新的视频基准测试,旨在解决现有多模态推理基准的不足,覆盖六类推理能力,并通过可编程生成支持持续演进。

Details Motivation: 现有基准测试在时间动态性、推理能力广度和难度扩展性上存在不足。 Method: 使用Python脚本、生成视频模型和真实素材生成500个视频片段,嵌入六类推理问题。 Result: 实验显示当前先进模型在抽象和规划任务上表现显著不足。 Conclusion: MORSE-500为多模态推理研究提供了透明、可复现和前瞻性的工具。 Abstract: Despite rapid advances in vision-language models (VLMs), current benchmarks for multimodal reasoning fall short in three key dimensions. First, they overwhelmingly rely on static images, failing to capture the temporal complexity of real-world environments. Second, they narrowly focus on mathematical problem-solving, neglecting the broader spectrum of reasoning skills -- including abstract, physical, planning, spatial, and temporal capabilities -- required for robust multimodal intelligence. Third, many benchmarks quickly saturate, offering limited headroom for diagnosing failure modes or measuring continued progress. We introduce MORSE-500 (Multimodal Reasoning Stress-test Environment), a video benchmark composed of 500 fully scripted clips with embedded questions spanning six complementary reasoning categories. Each instance is programmatically generated using deterministic Python scripts (via Manim, Matplotlib, MoviePy), generative video models, and curated real footage. This script-driven design allows fine-grained control over visual complexity, distractor density, and temporal dynamics -- enabling difficulty to be scaled systematically as models improve. Unlike static benchmarks that become obsolete once saturated, MORSE-500 is built to evolve: its controllable generation pipeline supports the creation of arbitrarily challenging new instances, making it ideally suited for stress-testing next-generation models. Initial experiments with state-of-the-art systems -- including various Gemini 2.5 Pro and OpenAI o3 which represent the strongest available at the time, alongside strong open-source models -- reveal substantial performance gaps across all categories, with particularly large deficits in abstract and planning tasks. We release the full dataset, generation scripts, and evaluation harness to support transparent, reproducible, and forward-looking multimodal reasoning research.

[42] Personalized Interpretability -- Interactive Alignment of Prototypical Parts Networks

Tomasz Michalski,Adam Wróbel,Andrea Bontempelli,Jakub Luśtyk,Mikolaj Kniejski,Stefano Teso,Andrea Passerini,Bartosz Zieliński,Dawid Rymarczyk

Main category: cs.CV

TL;DR: YoursProtoP是一种交互式策略,通过用户监督个性化视觉概念,解决概念不一致问题,提升模型解释的可理解性。

Details Motivation: 现有基于概念的神经网络解释存在概念不一致问题,且缺乏用户反馈机制,导致解释与用户理解不匹配。 Method: 提出YoursProtoP,通过用户监督调整和拆分原型部分(视觉概念),使其更符合用户偏好和理解。 Result: 在FunnyBirds、CUB、CARS和PETS数据集上验证,YoursProtoP在保持模型准确性的同时实现了概念一致性。 Conclusion: YoursProtoP通过交互式个性化,有效解决了概念不一致问题,提升了模型解释的用户友好性。 Abstract: Concept-based interpretable neural networks have gained significant attention due to their intuitive and easy-to-understand explanations based on case-based reasoning, such as "this bird looks like those sparrows". However, a major limitation is that these explanations may not always be comprehensible to users due to concept inconsistency, where multiple visual features are inappropriately mixed (e.g., a bird's head and wings treated as a single concept). This inconsistency breaks the alignment between model reasoning and human understanding. Furthermore, users have specific preferences for how concepts should look, yet current approaches provide no mechanism for incorporating their feedback. To address these issues, we introduce YoursProtoP, a novel interactive strategy that enables the personalization of prototypical parts - the visual concepts used by the model - according to user needs. By incorporating user supervision, YoursProtoP adapts and splits concepts used for both prediction and explanation to better match the user's preferences and understanding. Through experiments on both the synthetic FunnyBirds dataset and a real-world scenario using the CUB, CARS, and PETS datasets in a comprehensive user study, we demonstrate the effectiveness of YoursProtoP in achieving concept consistency without compromising the accuracy of the model.

[43] FRAME: Pre-Training Video Feature Representations via Anticipation and Memory

Sethuraman TV,Savya Khosla,Vignesh Srinivasakumar,Jiahui Huang,Seoung Wug Oh,Simon Jenni,Derek Hoiem,Joon-Young Lee

Main category: cs.CV

TL;DR: FRAME是一种自监督视频帧编码器,专为密集视频理解设计,通过预测当前和未来的DINO补丁特征,生成时空一致的表示,并在密集预测任务中优于现有图像和视频编码器。

Details Motivation: 现有图像编码器(如DINO或CLIP)缺乏时间感知能力,而视频模型(如VideoMAE)在密集预测任务中表现不佳,因此需要一种能够生成时空一致特征的视频编码器。 Method: FRAME通过自监督学习预测当前和未来的DINO补丁特征,利用过去和当前的RGB帧生成时空精确且一致的表示,并支持语言驱动任务。 Result: 在七个数据集的六个密集预测任务中,FRAME表现优于图像编码器和现有自监督视频模型,同时保持紧凑架构。 Conclusion: FRAME填补了密集视频预测任务的空白,是首个利用图像模型并超越其性能的视频编码器,适用于多种下游应用。 Abstract: Dense video prediction tasks, such as object tracking and semantic segmentation, require video encoders that generate temporally consistent, spatially dense features for every frame. However, existing approaches fall short: image encoders like DINO or CLIP lack temporal awareness, while video models such as VideoMAE underperform compared to image encoders on dense prediction tasks. We address this gap with FRAME, a self-supervised video frame encoder tailored for dense video understanding. FRAME learns to predict current and future DINO patch features from past and present RGB frames, leading to spatially precise and temporally coherent representations. To our knowledge, FRAME is the first video encoder to leverage image-based models for dense prediction while outperforming them on tasks requiring fine-grained visual correspondence. As an auxiliary capability, FRAME aligns its class token with CLIP's semantic space, supporting language-driven tasks such as video classification. We evaluate FRAME across six dense prediction tasks on seven datasets, where it consistently outperforms image encoders and existing self-supervised video models. Despite its versatility, FRAME maintains a compact architecture suitable for a range of downstream applications.

[44] Layered Motion Fusion: Lifting Motion Segmentation to 3D in Egocentric Videos

Vadim Tschernezki,Diane Larlus,Andrea Vedaldi,Iro Laina

Main category: cs.CV

TL;DR: 论文探讨了3D技术如何通过融合2D运动分割预测到分层辐射场(Layered Motion Fusion)来改进动态分割,并通过测试时细化解决动态视频的高复杂度问题,最终显著超越2D基线。

Details Motivation: 当前3D技术在动态现象(如移动物体分割)中表现不佳,而2D技术虽广泛但独立视图存在局限性。本文旨在探索3D技术如何提升动态场景的2D分析。 Method: 提出分层运动融合方法,将2D模型的运动分割预测融合到分层辐射场中,并通过测试时细化降低数据复杂度。 Result: 3D模型的动态分割预测显著优于2D基线,证明3D技术可在动态场景中增强2D分析。 Conclusion: 3D技术即使在动态现象中也能有效提升2D分析,为复杂现实场景提供了新解决方案。 Abstract: Computer vision is largely based on 2D techniques, with 3D vision still relegated to a relatively narrow subset of applications. However, by building on recent advances in 3D models such as neural radiance fields, some authors have shown that 3D techniques can at last improve outputs extracted from independent 2D views, by fusing them into 3D and denoising them. This is particularly helpful in egocentric videos, where the camera motion is significant, but only under the assumption that the scene itself is static. In fact, as shown in the recent analysis conducted by EPIC Fields, 3D techniques are ineffective when it comes to studying dynamic phenomena, and, in particular, when segmenting moving objects. In this paper, we look into this issue in more detail. First, we propose to improve dynamic segmentation in 3D by fusing motion segmentation predictions from a 2D-based model into layered radiance fields (Layered Motion Fusion). However, the high complexity of long, dynamic videos makes it challenging to capture the underlying geometric structure, and, as a result, hinders the fusion of motion cues into the (incomplete) scene geometry. We address this issue through test-time refinement, which helps the model to focus on specific frames, thereby reducing the data complexity. This results in a synergy between motion fusion and the refinement, and in turn leads to segmentation predictions of the 3D model that surpass the 2D baseline by a large margin. This demonstrates that 3D techniques can enhance 2D analysis even for dynamic phenomena in a challenging and realistic setting.

[45] When Semantics Mislead Vision: Mitigating Large Multimodal Models Hallucinations in Scene Text Spotting and Understanding

Yan Shu,Hangui Lin,Yexin Liu,Yan Zhang,Gangyan Zeng,Yan Li,Yu Zhou,Ser-Nam Lim,Harry Yang,Nicu Sebe

Main category: cs.CV

TL;DR: 论文提出了一种无需训练的框架,通过ZoomText和Grounded Layer Correction减少大型多模态模型在视觉模糊或非语义文本上的语义幻觉问题。

Details Motivation: 大型多模态模型在视觉模糊或非语义文本上容易产生语义幻觉,即生成视觉错误但语义合理的答案。 Method: 提出ZoomText(粗到细的文本区域识别)和Grounded Layer Correction(利用不易产生幻觉的层指导解码)的框架。 Result: 方法有效减少语义幻觉,并在场景文本识别和理解公共基准上表现优异。 Conclusion: 该框架显著缓解语义幻觉问题,同时保持对有意义文本的语义理解。 Abstract: Large Multimodal Models (LMMs) have achieved impressive progress in visual perception and reasoning. However, when confronted with visually ambiguous or non-semantic scene text, they often struggle to accurately spot and understand the content, frequently generating semantically plausible yet visually incorrect answers, which we refer to as semantic hallucination. In this work, we investigate the underlying causes of semantic hallucination and identify a key finding: Transformer layers in LLM with stronger attention focus on scene text regions are less prone to producing semantic hallucinations. Thus, we propose a training-free semantic hallucination mitigation framework comprising two key components: (1) ZoomText, a coarse-to-fine strategy that identifies potential text regions without external detectors; and (2) Grounded Layer Correction, which adaptively leverages the internal representations from layers less prone to hallucination to guide decoding, correcting hallucinated outputs for non-semantic samples while preserving the semantics of meaningful ones. To enable rigorous evaluation, we introduce TextHalu-Bench, a benchmark of over 1,730 samples spanning both semantic and non-semantic cases, with manually curated question-answer pairs designed to probe model hallucinations. Extensive experiments demonstrate that our method not only effectively mitigates semantic hallucination but also achieves strong performance on public benchmarks for scene text spotting and understanding.

[46] EX-4D: EXtreme Viewpoint 4D Video Synthesis via Depth Watertight Mesh

Tao Hu,Haoyang Peng,Xiao Liu,Yuewen Ma

Main category: cs.CV

TL;DR: EX-4D提出了一种基于深度水密网格表示的新框架,用于从单目输入生成高质量、相机可控的视频,解决了极端视角下的几何不一致和遮挡问题。

Details Motivation: 现有方法在极端视角下常因几何不一致和边界遮挡问题导致视觉质量下降,EX-4D旨在解决这些问题。 Method: 采用深度水密网格表示作为几何先验,提出模拟掩码策略生成训练数据,并使用轻量级LoRA视频扩散适配器合成高质量视频。 Result: 实验表明,EX-4D在物理一致性和极端视角质量上优于现有方法,实现了实用的4D视频生成。 Conclusion: EX-4D通过创新的几何表示和训练策略,显著提升了极端视角下的视频生成质量。 Abstract: Generating high-quality camera-controllable videos from monocular input is a challenging task, particularly under extreme viewpoint. Existing methods often struggle with geometric inconsistencies and occlusion artifacts in boundaries, leading to degraded visual quality. In this paper, we introduce EX-4D, a novel framework that addresses these challenges through a Depth Watertight Mesh representation. The representation serves as a robust geometric prior by explicitly modeling both visible and occluded regions, ensuring geometric consistency in extreme camera pose. To overcome the lack of paired multi-view datasets, we propose a simulated masking strategy that generates effective training data only from monocular videos. Additionally, a lightweight LoRA-based video diffusion adapter is employed to synthesize high-quality, physically consistent, and temporally coherent videos. Extensive experiments demonstrate that EX-4D outperforms state-of-the-art methods in terms of physical consistency and extreme-view quality, enabling practical 4D video generation.

[47] On-the-fly Reconstruction for Large-Scale Novel View Synthesis from Unposed Images

Andreas Meuleman,Ishaan Shah,Alexandre Lanvin,Bernhard Kerbl,George Drettakis

Main category: cs.CV

TL;DR: 提出了一种实时处理相机姿态和3D高斯泼溅(3DGS)的方法,适用于大场景和宽基线拍摄。

Details Motivation: 现有方法如3DGS和SLAM在处理大场景和宽基线时存在计算时间长或效果不佳的问题。 Method: 结合快速初始姿态估计和直接采样高斯基元,采用增量式方法处理大场景。 Result: 方法能够实时处理多种拍摄场景和规模,在速度和图像质量上具有竞争力。 Conclusion: 该方法为实时3D重建提供了一种高效且通用的解决方案。 Abstract: Radiance field methods such as 3D Gaussian Splatting (3DGS) allow easy reconstruction from photos, enabling free-viewpoint navigation. Nonetheless, pose estimation using Structure from Motion and 3DGS optimization can still each take between minutes and hours of computation after capture is complete. SLAM methods combined with 3DGS are fast but struggle with wide camera baselines and large scenes. We present an on-the-fly method to produce camera poses and a trained 3DGS immediately after capture. Our method can handle dense and wide-baseline captures of ordered photo sequences and large-scale scenes. To do this, we first introduce fast initial pose estimation, exploiting learned features and a GPU-friendly mini bundle adjustment. We then introduce direct sampling of Gaussian primitive positions and shapes, incrementally spawning primitives where required, significantly accelerating training. These two efficient steps allow fast and robust joint optimization of poses and Gaussian primitives. Our incremental approach handles large-scale scenes by introducing scalable radiance field construction, progressively clustering 3DGS primitives, storing them in anchors, and offloading them from the GPU. Clustered primitives are progressively merged, keeping the required scale of 3DGS at any viewpoint. We evaluate our solution on a variety of datasets and show that our solution can provide on-the-fly processing of all the capture scenarios and scene sizes we target while remaining competitive with other methods that only handle specific capture styles or scene sizes in speed, image quality, or both.

[48] VoxelSplat: Dynamic Gaussian Splatting as an Effective Loss for Occupancy and Flow Prediction

Ziyue Zhu,Shenlong Wang,Jin Xie,Jiang-jiang Liu,Jingdong Wang,Jian Yang

Main category: cs.CV

TL;DR: 论文提出了一种名为VoxelSplat的新型正则化框架,用于解决相机占用预测中的3D语义和场景流预测问题,通过2D投影增强语义监督和自监督学习场景流,显著提升了性能。

Details Motivation: 相机占用预测中的3D语义和场景流预测面临遮挡和动态环境不平衡等挑战,需要一种有效的方法来提升模型性能。 Method: VoxelSplat框架利用3D高斯泼溅技术,通过2D投影提供额外的语义监督信号,并自监督学习场景流。 Result: 在基准数据集上的实验表明,VoxelSplat显著提升了语义占用和场景流估计的准确性。 Conclusion: VoxelSplat框架能够无缝集成到现有模型中,提升性能且不增加推理时间,具有广泛的应用潜力。 Abstract: Recent advancements in camera-based occupancy prediction have focused on the simultaneous prediction of 3D semantics and scene flow, a task that presents significant challenges due to specific difficulties, e.g., occlusions and unbalanced dynamic environments. In this paper, we analyze these challenges and their underlying causes. To address them, we propose a novel regularization framework called VoxelSplat. This framework leverages recent developments in 3D Gaussian Splatting to enhance model performance in two key ways: (i) Enhanced Semantics Supervision through 2D Projection: During training, our method decodes sparse semantic 3D Gaussians from 3D representations and projects them onto the 2D camera view. This provides additional supervision signals in the camera-visible space, allowing 2D labels to improve the learning of 3D semantics. (ii) Scene Flow Learning: Our framework uses the predicted scene flow to model the motion of Gaussians, and is thus able to learn the scene flow of moving objects in a self-supervised manner using the labels of adjacent frames. Our method can be seamlessly integrated into various existing occupancy models, enhancing performance without increasing inference time. Extensive experiments on benchmark datasets demonstrate the effectiveness of VoxelSplat in improving the accuracy of both semantic occupancy and scene flow estimation. The project page and codes are available at https://zzy816.github.io/VoxelSplat-Demo/.

[49] PartCrafter: Structured 3D Mesh Generation via Compositional Latent Diffusion Transformers

Yuchen Lin,Chenguo Lin,Panwang Pan,Honglei Yan,Yiqiang Feng,Yadong Mu,Katerina Fragkiadaki

Main category: cs.CV

TL;DR: PartCrafter是一个结构化3D生成模型,能够从单张RGB图像联合生成多个语义明确且几何上独立的3D网格,无需依赖预分割输入。

Details Motivation: 现有方法要么生成单一3D形状,要么采用两阶段流程(先分割图像再重建每个部分),PartCrafter旨在通过统一的组合生成架构解决这些问题。 Method: 基于预训练的3D网格扩散变换器(DiT),引入组合潜在空间和分层注意力机制,实现端到端的部分感知生成。 Result: 实验表明,PartCrafter在生成可分解3D网格方面优于现有方法,包括输入图像中不可见的部分。 Conclusion: PartCrafter展示了部分感知生成先验在3D理解和合成中的优势,代码和训练数据将公开。 Abstract: We introduce PartCrafter, the first structured 3D generative model that jointly synthesizes multiple semantically meaningful and geometrically distinct 3D meshes from a single RGB image. Unlike existing methods that either produce monolithic 3D shapes or follow two-stage pipelines, i.e., first segmenting an image and then reconstructing each segment, PartCrafter adopts a unified, compositional generation architecture that does not rely on pre-segmented inputs. Conditioned on a single image, it simultaneously denoises multiple 3D parts, enabling end-to-end part-aware generation of both individual objects and complex multi-object scenes. PartCrafter builds upon a pretrained 3D mesh diffusion transformer (DiT) trained on whole objects, inheriting the pretrained weights, encoder, and decoder, and introduces two key innovations: (1) A compositional latent space, where each 3D part is represented by a set of disentangled latent tokens; (2) A hierarchical attention mechanism that enables structured information flow both within individual parts and across all parts, ensuring global coherence while preserving part-level detail during generation. To support part-level supervision, we curate a new dataset by mining part-level annotations from large-scale 3D object datasets. Experiments show that PartCrafter outperforms existing approaches in generating decomposable 3D meshes, including parts that are not directly visible in input images, demonstrating the strength of part-aware generative priors for 3D understanding and synthesis. Code and training data will be released.

[50] UniRes: Universal Image Restoration for Complex Degradations

Mo Zhou,Keren Ye,Mauricio Delbracio,Peyman Milanfar,Vishal M. Patel,Hossein Talebi

Main category: cs.CV

TL;DR: 提出了一种名为UniRes的扩散框架,用于解决现实世界中复杂退化图像的恢复问题,通过结合多个专用模型实现端到端处理。

Details Motivation: 现实世界图像恢复面临多种退化问题,现有方法难以泛化到野外数据,因此需要解决复杂退化问题。 Method: 提出UniRes框架,结合多个专用模型在扩散采样步骤中,利用孤立训练数据实现复杂退化恢复。 Result: 在复杂退化和单一退化数据集上表现优异,尤其在复杂退化图像上性能显著提升。 Conclusion: UniRes框架灵活高效,适用于复杂退化图像恢复,扩展性强且支持质量-保真度权衡。 Abstract: Real-world image restoration is hampered by diverse degradations stemming from varying capture conditions, capture devices and post-processing pipelines. Existing works make improvements through simulating those degradations and leveraging image generative priors, however generalization to in-the-wild data remains an unresolved problem. In this paper, we focus on complex degradations, i.e., arbitrary mixtures of multiple types of known degradations, which is frequently seen in the wild. A simple yet flexible diffusionbased framework, named UniRes, is proposed to address such degradations in an end-to-end manner. It combines several specialized models during the diffusion sampling steps, hence transferring the knowledge from several well-isolated restoration tasks to the restoration of complex in-the-wild degradations. This only requires well-isolated training data for several degradation types. The framework is flexible as extensions can be added through a unified formulation, and the fidelity-quality trade-off can be adjusted through a new paradigm. Our proposed method is evaluated on both complex-degradation and single-degradation image restoration datasets. Extensive qualitative and quantitative experimental results show consistent performance gain especially for images with complex degradations.

[51] Controlled Data Rebalancing in Multi-Task Learning for Real-World Image Super-Resolution

Shuchen Lin,Mingtao Feng,Weisheng Dong,Fangfang Wu,Jianqiao Luo,Yaonan Wang,Guangming Shi

Main category: cs.CV

TL;DR: 论文提出了一种改进的Real-SR方法,通过多任务学习框架解决任务不平衡问题,包括任务定义、不平衡量化和自适应数据再平衡。

Details Motivation: 现实世界中的图像超分辨率(Real-SR)由于低分辨率图像的复杂退化模式而具有挑战性,传统方法难以平衡不同退化模式的处理。 Method: 提出了一种新的任务定义框架,通过参数特定边界分割退化空间,并开发了基于焦点损失的多任务加权机制和自适应数据再平衡策略。 Result: 实验表明,该方法在所有退化任务中均表现出色。 Conclusion: 该方法通过协调任务定义、不平衡量化和数据再平衡,显著提升了Real-SR的性能。 Abstract: Real-world image super-resolution (Real-SR) is a challenging problem due to the complex degradation patterns in low-resolution images. Unlike approaches that assume a broadly encompassing degradation space, we focus specifically on achieving an optimal balance in how SR networks handle different degradation patterns within a fixed degradation space. We propose an improved paradigm that frames Real-SR as a data-heterogeneous multi-task learning problem, our work addresses task imbalance in the paradigm through coordinated advancements in task definition, imbalance quantification, and adaptive data rebalancing. Specifically, we introduce a novel task definition framework that segments the degradation space by setting parameter-specific boundaries for degradation operators, effectively reducing the task quantity while maintaining task discrimination. We then develop a focal loss based multi-task weighting mechanism that precisely quantifies task imbalance dynamics during model training. Furthermore, to prevent sporadic outlier samples from dominating the gradient optimization of the shared multi-task SR model, we strategically convert the quantified task imbalance into controlled data rebalancing through deliberate regulation of task-specific training volumes. Extensive quantitative and qualitative experiments demonstrate that our method achieves consistent superiority across all degradation tasks.

[52] Hallucinate, Ground, Repeat: A Framework for Generalized Visual Relationship Detection

Shanmukha Vellamcheti,Sanjoy Kundu,Sathyanarayanan N. Aakur

Main category: cs.CV

TL;DR: 论文提出了一种迭代视觉接地框架,利用大型语言模型(LLM)作为结构化关系先验,提升视觉关系检测(VRD)的泛化能力。

Details Motivation: 现有VRD模型依赖固定谓词集,难以泛化到新交互关系。本文旨在通过视觉接地解决未标注但语义合理的关系假设问题。 Method: 采用类似期望最大化(EM)的方法,交替使用LLM生成候选场景图(期望)和训练视觉模型对齐感知证据(最大化)。 Result: 在Visual Genome新基准测试中,模型在seen、unseen和mixed设置下的平均召回率(mR@50)分别为15.9、13.1和11.7,优于基线。 Conclusion: 研究表明,基于LLM的接地先验可扩展开放世界视觉理解,具有潜力。 Abstract: Understanding relationships between objects is central to visual intelligence, with applications in embodied AI, assistive systems, and scene understanding. Yet, most visual relationship detection (VRD) models rely on a fixed predicate set, limiting their generalization to novel interactions. A key challenge is the inability to visually ground semantically plausible, but unannotated, relationships hypothesized from external knowledge. This work introduces an iterative visual grounding framework that leverages large language models (LLMs) as structured relational priors. Inspired by expectation-maximization (EM), our method alternates between generating candidate scene graphs from detected objects using an LLM (expectation) and training a visual model to align these hypotheses with perceptual evidence (maximization). This process bootstraps relational understanding beyond annotated data and enables generalization to unseen predicates. Additionally, we introduce a new benchmark for open-world VRD on Visual Genome with 21 held-out predicates and evaluate under three settings: seen, unseen, and mixed. Our model outperforms LLM-only, few-shot, and debiased baselines, achieving mean recall (mR@50) of 15.9, 13.1, and 11.7 on predicate classification on these three sets. These results highlight the promise of grounded LLM priors for scalable open-world visual understanding.

[53] Aerial Multi-View Stereo via Adaptive Depth Range Inference and Normal Cues

Yimei Liu,Yakun Ju,Yuan Rao,Hao Fan,Junyu Dong,Feng Gao,Qian Du

Main category: cs.CV

TL;DR: ADR-MVS是一种自适应深度范围多视图立体方法,通过整合单目几何线索和多视图深度估计,解决了航空图像中深度范围变化和特征匹配不敏感的问题,显著提升了三维城市重建的精度。

Details Motivation: 现有多视图立体方法在航空图像中表现不佳,主要原因是忽视了航空与近距离场景的关键差异,如深度范围变化和低细节图像的特征匹配问题。 Method: ADR-MVS通过深度范围预测器生成自适应范围图,结合单目几何线索和多视图框架,逐步优化深度估计。此外,设计了法线引导的成本聚合和深度细化模块。 Result: 在WHU、LuoJia-MVS和München数据集上,ADR-MVS实现了最先进的性能,并表现出优越的计算效率。 Conclusion: ADR-MVS通过自适应深度范围和法线引导技术,显著提升了航空图像的三维重建精度和效率。 Abstract: Three-dimensional digital urban reconstruction from multi-view aerial images is a critical application where deep multi-view stereo (MVS) methods outperform traditional techniques. However, existing methods commonly overlook the key differences between aerial and close-range settings, such as varying depth ranges along epipolar lines and insensitive feature-matching associated with low-detailed aerial images. To address these issues, we propose an Adaptive Depth Range MVS (ADR-MVS), which integrates monocular geometric cues to improve multi-view depth estimation accuracy. The key component of ADR-MVS is the depth range predictor, which generates adaptive range maps from depth and normal estimates using cross-attention discrepancy learning. In the first stage, the range map derived from monocular cues breaks through predefined depth boundaries, improving feature-matching discriminability and mitigating convergence to local optima. In later stages, the inferred range maps are progressively narrowed, ultimately aligning with the cascaded MVS framework for precise depth regression. Moreover, a normal-guided cost aggregation operation is specially devised for aerial stereo images to improve geometric awareness within the cost volume. Finally, we introduce a normal-guided depth refinement module that surpasses existing RGB-guided techniques. Experimental results demonstrate that ADR-MVS achieves state-of-the-art performance on the WHU, LuoJia-MVS, and M\"unchen datasets, while exhibits superior computational complexity.

[54] TissUnet: Improved Extracranial Tissue and Cranium Segmentation for Children through Adulthood

Markian Mandzak,Elvira Yang,Anna Zapaishchykova,Yu-Hui Chen,Lucas Heilbroner,John Zielke,Divyanshu Tak,Reza Mojahed-Yazdi,Francesca Romana Mussa,Zezhong Ye,Sridhar Vajapeyam,Viviana Benitez,Ralph Salloum,Susan N. Chi,Houman Sotoudeh,Jakob Seidlitz,Sabine Mueller,Hugo J. W. L. Aerts,Tina Y. Poussaint,Benjamin H. Kann

Main category: cs.CV

TL;DR: TissUnet是一种深度学习模型,用于从常规三维T1加权MRI中分割颅外组织(如颅骨、皮下脂肪和肌肉),在多种数据集上验证了其高准确性和可重复性。

Details Motivation: 颅外组织在脑MRI中可能对健康评估和临床决策有重要价值,但现有工具未广泛验证,尤其是在发育中大脑或病理情况下。 Method: TissUnet基于155对MRI-CT扫描训练,并在覆盖广泛年龄范围和脑肿瘤患者的九个数据集中验证。 Result: TissUnet在健康成人和肿瘤患者中均表现优异,Dice系数中位数分别为0.83和0.81,优于现有方法。 Conclusion: TissUnet能快速、准确地分割颅外组织,支持大规模研究,如颅面形态、治疗效果和心脏代谢风险分析。 Abstract: Extracranial tissues visible on brain magnetic resonance imaging (MRI) may hold significant value for characterizing health conditions and clinical decision-making, yet they are rarely quantified. Current tools have not been widely validated, particularly in settings of developing brains or underlying pathology. We present TissUnet, a deep learning model that segments skull bone, subcutaneous fat, and muscle from routine three-dimensional T1-weighted MRI, with or without contrast enhancement. The model was trained on 155 paired MRI-computed tomography (CT) scans and validated across nine datasets covering a wide age range and including individuals with brain tumors. In comparison to AI-CT-derived labels from 37 MRI-CT pairs, TissUnet achieved a median Dice coefficient of 0.79 [IQR: 0.77-0.81] in a healthy adult cohort. In a second validation using expert manual annotations, median Dice was 0.83 [IQR: 0.83-0.84] in healthy individuals and 0.81 [IQR: 0.78-0.83] in tumor cases, outperforming previous state-of-the-art method. Acceptability testing resulted in an 89% acceptance rate after adjudication by a tie-breaker(N=108 MRIs), and TissUnet demonstrated excellent performance in the blinded comparative review (N=45 MRIs), including both healthy and tumor cases in pediatric populations. TissUnet enables fast, accurate, and reproducible segmentation of extracranial tissues, supporting large-scale studies on craniofacial morphology, treatment effects, and cardiometabolic risk using standard brain T1w MRI.

[55] DriveAction: A Benchmark for Exploring Human-like Driving Decisions in VLA Models

Yuhan Hao,Zhengning Li,Lei Sun,Weilong Wang,Naixin Yi,Sheng Song,Caihong Qin,Mofan Zhou,Yifei Zhan,Peng Jia,Xianpeng Lang

Main category: cs.CV

TL;DR: DriveAction是一个专为VLA模型设计的动作驱动基准测试,包含16,185个QA对,基于2,610个驾驶场景。它解决了现有基准测试在场景多样性、动作标注和评估协议上的不足。

Details Motivation: 现有基准测试缺乏场景多样性、可靠的动作标注和符合人类偏好的评估协议,限制了VLA模型在自动驾驶中的应用。 Method: DriveAction利用真实驾驶数据生成QA对,提供离散动作标签,并采用树状评估框架,明确关联视觉、语言和动作任务。 Result: 实验表明,VLA模型需同时依赖视觉和语言输入,缺失任一组件均导致准确率下降(视觉缺失降3.3%,语言缺失降4.1%,两者缺失降8.0%)。 Conclusion: DriveAction为自动驾驶中类人决策提供了新见解和严格基础,支持模型瓶颈的精确识别。 Abstract: Vision-Language-Action (VLA) models have advanced autonomous driving, but existing benchmarks still lack scenario diversity, reliable action-level annotation, and evaluation protocols aligned with human preferences. To address these limitations, we introduce DriveAction, the first action-driven benchmark specifically designed for VLA models, comprising 16,185 QA pairs generated from 2,610 driving scenarios. DriveAction leverages real-world driving data proactively collected by users of production-level autonomous vehicles to ensure broad and representative scenario coverage, offers high-level discrete action labels collected directly from users' actual driving operations, and implements an action-rooted tree-structured evaluation framework that explicitly links vision, language, and action tasks, supporting both comprehensive and task-specific assessment. Our experiments demonstrate that state-of-the-art vision-language models (VLMs) require both vision and language guidance for accurate action prediction: on average, accuracy drops by 3.3% without vision input, by 4.1% without language input, and by 8.0% without either. Our evaluation supports precise identification of model bottlenecks with robust and consistent results, thus providing new insights and a rigorous foundation for advancing human-like decisions in autonomous driving.

[56] Pts3D-LLM: Studying the Impact of Token Structure for 3D Scene Understanding With Large Language Models

Hugues Thomas,Chen Chen,Jian Zhang

Main category: cs.CV

TL;DR: 本文研究了3D场景表示对多模态大语言模型(MLLMs)的重要性,提出了一种结合3D点云特征的新方法,显著提升了性能。

Details Motivation: 现有方法主要依赖2D图像特征,且采用不同的标记化方法,缺乏对3D结构的系统研究。 Method: 提出了一种结合3D点云特征的新方法,使用Sonata预训练的Point Transformer V3编码器丰富视觉标记。 Result: 实验表明,结合3D特征显著提升了性能,点基标记结构在巧妙采样和排序下可与视频基结构媲美。 Conclusion: 本文的系统分析和透明报告为3D理解领域的稳健进展提供了重要贡献。 Abstract: Effectively representing 3D scenes for Multimodal Large Language Models (MLLMs) is crucial yet challenging. Existing approaches commonly only rely on 2D image features and use varied tokenization approaches. This work presents a rigorous study of 3D token structures, systematically comparing video-based and point-based representations while maintaining consistent model backbones and parameters. We propose a novel approach that enriches visual tokens by incorporating 3D point cloud features from a Sonata pretrained Point Transformer V3 encoder. Our experiments demonstrate that merging explicit 3D features significantly boosts performance. Furthermore, we show that point-based token structures can rival video-based ones when the points are cleverly sampled and ordered. Our best models from both structures achieve state-of-the-art results on multiple 3D understanding benchmarks. We emphasize our analysis of token structures as a key contribution, alongside transparent reporting of results averaged over multiple seeds, a practice we believe is vital for robust progress in the field.

[57] MoralCLIP: Contrastive Alignment of Vision-and-Language Representations with Moral Foundations Theory

Ana Carolina Condez,Diogo Tavares,João Magalhães

Main category: cs.CV

TL;DR: MoralCLIP是一种新的嵌入表示方法,通过基于道德基础理论(MFT)的多模态学习扩展,填补了视觉语言模型在道德维度理解上的空白。

Details Motivation: 现有视觉语言模型缺乏对内容道德维度的理解和推理能力,而这是人类认知的关键方面。 Method: MoralCLIP通过整合视觉和文本道德线索到统一嵌入空间,实现跨模态道德对齐,并利用多标签数据集Social-Moral Image Database进行训练。 Result: 实验表明,明确的道德监督提升了单模态和多模态对道德内容的理解,为具有道德意识的AI系统奠定了基础。 Conclusion: MoralCLIP为识别和与人类道德价值观对齐的AI系统提供了基础。 Abstract: Recent advances in vision-language models have enabled rich semantic understanding across modalities. However, these encoding methods lack the ability to interpret or reason about the moral dimensions of content-a crucial aspect of human cognition. In this paper, we address this gap by introducing MoralCLIP, a novel embedding representation method that extends multimodal learning with explicit moral grounding based on Moral Foundations Theory (MFT). Our approach integrates visual and textual moral cues into a unified embedding space, enabling cross-modal moral alignment. MoralCLIP is grounded on the multi-label dataset Social-Moral Image Database to identify co-occurring moral foundations in visual content. For MoralCLIP training, we design a moral data augmentation strategy to scale our annotated dataset to 15,000 image-text pairs labeled with MFT-aligned dimensions. Our results demonstrate that explicit moral supervision improves both unimodal and multimodal understanding of moral content, establishing a foundation for morally-aware AI systems capable of recognizing and aligning with human moral values.

[58] Token Transforming: A Unified and Training-Free Token Compression Framework for Vision Transformer Acceleration

Fanhu Zeng,Deli Yu,Zhenglun Kong,Hao Tang

Main category: cs.CV

TL;DR: 本文提出了一种通用的令牌转换框架,统一了现有的令牌压缩方法,减少了信息损失,并实现了无需训练的加速。

Details Motivation: 由于视觉变换器在计算成本上的高昂,动态压缩令牌成为研究热点,但现有方法存在信息损失和需后训练的问题。 Method: 将令牌压缩统一为令牌矩阵变换,提出多对多的令牌转换框架,保留更多信息并支持无需训练的加速。 Result: 实验表明,该方法减少了40%的FLOPs,加速DeiT-S 1.5倍,仅损失0.1%准确率,并在密集预测任务中表现优异。 Conclusion: 该方法提供了更好的计算-性能权衡,显著降低了计算成本和推理时间。 Abstract: Vision transformers have been widely explored in various vision tasks. Due to heavy computational cost, much interest has aroused for compressing vision transformer dynamically in the aspect of tokens. Current methods mainly pay attention to token pruning or merging to reduce token numbers, in which tokens are compressed exclusively, causing great information loss and therefore post-training is inevitably required to recover the performance. In this paper, we rethink token reduction and unify the process as an explicit form of token matrix transformation, in which all existing methods are constructing special forms of matrices within the framework. Furthermore, we propose a many-to-many Token Transforming framework that serves as a generalization of all existing methods and reserves the most information, even enabling training-free acceleration. We conduct extensive experiments to validate our framework. Specifically, we reduce 40% FLOPs and accelerate DeiT-S by $\times$1.5 with marginal 0.1% accuracy drop. Furthermore, we extend the method to dense prediction tasks including segmentation, object detection, depth estimation, and language model generation. Results demonstrate that the proposed method consistently achieves substantial improvements, offering a better computation-performance trade-off, impressive budget reduction and inference acceleration.

[59] You Only Estimate Once: Unified, One-stage, Real-Time Category-level Articulated Object 6D Pose Estimation for Robotic Grasping

Jingshun Huang,Haitao Lin,Tianyu Wang,Yanwei Fu,Yu-Gang Jiang,Xiangyang Xue

Main category: cs.CV

TL;DR: 论文提出了一种单阶段方法YOEO,用于机器人任务中关节物体的类别级姿态估计,避免了多阶段流程的高计算成本和低实时性能。

Details Motivation: 解决现有方法在实时机器人任务中计算成本高、性能低的问题。 Method: 使用统一网络生成点级语义标签和质心偏移,结合聚类算法区分实例,并通过对齐分离区域恢复姿态和大小。 Result: 在GAPart数据集上验证了姿态估计能力,并在真实环境中实现了200Hz的实时反馈。 Conclusion: YOEO方法高效且实用,适用于实时机器人交互任务。 Abstract: This paper addresses the problem of category-level pose estimation for articulated objects in robotic manipulation tasks. Recent works have shown promising results in estimating part pose and size at the category level. However, these approaches primarily follow a complex multi-stage pipeline that first segments part instances in the point cloud and then estimates the Normalized Part Coordinate Space (NPCS) representation for 6D poses. These approaches suffer from high computational costs and low performance in real-time robotic tasks. To address these limitations, we propose YOEO, a single-stage method that simultaneously outputs instance segmentation and NPCS representations in an end-to-end manner. We use a unified network to generate point-wise semantic labels and centroid offsets, allowing points from the same part instance to vote for the same centroid. We further utilize a clustering algorithm to distinguish points based on their estimated centroid distances. Finally, we first separate the NPCS region of each instance. Then, we align the separated regions with the real point cloud to recover the final pose and size. Experimental results on the GAPart dataset demonstrate the pose estimation capabilities of our proposed single-shot method. We also deploy our synthetically-trained model in a real-world setting, providing real-time visual feedback at 200Hz, enabling a physical Kinova robot to interact with unseen articulated objects. This showcases the utility and effectiveness of our proposed method.

[60] Investigating the Relationship between Weighted Figure of Merit and Rosin's Measure

Bimal Kumar Ray

Main category: cs.CV

TL;DR: 研究发现加权优值度量和Rosin度量在理论上独立且无相关性,不能互相替代。

Details Motivation: 解决多边形近似中优值度量的替代问题,验证加权优值度量是否能替代Rosin度量。 Method: 通过理论分析、实验验证和统计检验(Pearson相关系数)比较两种度量。 Result: 理论证明和实验数据均表明两种度量独立且无相关性。 Conclusion: 加权优值度量不能替代Rosin度量,因两者结论不一致。 Abstract: Many studies had been conducted to solve the problem of approximating a digital boundary by piece straight-line segments for further processing required in computer vision applications. The authors of these studies compared their schemes to determine the best one. The initial measure used to assess the goodness of a polygonal approximation was figure of merit. Later, it was pointed out that this measure was not an appropriate metric for a valid reason and this is why Rosin - through mathematical analysis - introduced a measure called merit. However, this measure involves optimal scheme of polygonal approximation and so it is time-consuming to compute it to assess the goodness of an approximation. This led many researchers to use weighted figure of merit as a substitute for Rosin's measure to compare among sub-optimal schemes. An attempt is made in this communication to investigate whether the two measures - weighted figure of merit and Rosin's measure - are related so that one can be used instead of the other and towards this end theoretical analysis, experimental investigation and statistical analysis are carried out. The mathematical formula for weighted figure of merit and Rosin's measure are analyzed and through proof of theorems it is found that the two measures are independent of each other theoretically. The graphical analysis of experiments carried out using public dataset supports theoretical analysis. The statistical analysis using Pearson's correlation coefficient also establishes that the two measures are uncorrelated. This analysis leads one to conclude that if a sub-optimal scheme is found to be better (worse) than some other sub-optimal scheme as indicated by Rosin's measure then the same conclusion cannot be drawn using weighted figure of merit and so one cannot use weighted figure of merit instead of Rosin's measure.

[61] Where Is The Ball: 3D Ball Trajectory Estimation From 2D Monocular Tracking

Puntawat Ponglertnapakorn,Supasorn Suwajanakorn

Main category: cs.CV

TL;DR: 提出了一种从2D跟踪序列估计3D球轨迹的方法,通过LSTM管道和规范3D表示解决2D到3D的歧义问题。

Details Motivation: 解决从2D视频中估计3D球轨迹的歧义性问题,并推广到多视角和实际场景。 Method: 设计了基于LSTM的管道,利用规范3D表示和中间表示实现视角无关性和重投影一致性。 Result: 在合成和真实数据集上表现优异,仅用模拟数据训练即可推广到实际场景。 Conclusion: 方法在运动分析和虚拟重放中具有广泛应用潜力。 Abstract: We present a method for 3D ball trajectory estimation from a 2D tracking sequence. To overcome the ambiguity in 3D from 2D estimation, we design an LSTM-based pipeline that utilizes a novel canonical 3D representation that is independent of the camera's location to handle arbitrary views and a series of intermediate representations that encourage crucial invariance and reprojection consistency. We evaluated our method on four synthetic and three real datasets and conducted extensive ablation studies on our design choices. Despite training solely on simulated data, our method achieves state-of-the-art performance and can generalize to real-world scenarios with multiple trajectories, opening up a range of applications in sport analysis and virtual replay. Please visit our page: https://where-is-the-ball.github.io.

[62] Do Large Vision-Language Models Distinguish between the Actual and Apparent Features of Illusions?

Taiga Shinozaki,Tomoki Doi,Satoshi Nishida,Hitomi Yanaka

Main category: cs.CV

TL;DR: 研究探讨大型视觉语言模型(LVLMs)是否像人类一样容易受到视觉错觉的影响,并引入了一个新的视觉问答数据集(VQA)来区分真实和虚假错觉。

Details Motivation: 人类容易受到光学错觉的影响,但机器是否也有类似现象尚不明确。现有研究未区分实际和表象特征,导致对机器认知的评估模糊。 Method: 通过构建包含真实和虚假错觉的VQA数据集,评估LVLMs在区分实际和表象特征上的表现。 Result: LVLMs在真实和虚假错觉问题上给出相同答案,表明其回答可能基于对错觉的先验知识而非真实视觉理解。 Conclusion: LVLMs可能依赖先验知识而非视觉理解来识别错觉,数据集为未来研究提供了工具。 Abstract: Humans are susceptible to optical illusions, which serve as valuable tools for investigating sensory and cognitive processes. Inspired by human vision studies, research has begun exploring whether machines, such as large vision language models (LVLMs), exhibit similar susceptibilities to visual illusions. However, studies often have used non-abstract images and have not distinguished actual and apparent features, leading to ambiguous assessments of machine cognition. To address these limitations, we introduce a visual question answering (VQA) dataset, categorized into genuine and fake illusions, along with corresponding control images. Genuine illusions present discrepancies between actual and apparent features, whereas fake illusions have the same actual and apparent features even though they look illusory due to the similar geometric configuration. We evaluate the performance of LVLMs for genuine and fake illusion VQA tasks and investigate whether the models discern actual and apparent features. Our findings indicate that although LVLMs may appear to recognize illusions by correctly answering questions about both feature types, they predict the same answers for both Genuine Illusion and Fake Illusion VQA questions. This suggests that their responses might be based on prior knowledge of illusions rather than genuine visual understanding. The dataset is available at https://github.com/ynklab/FILM

[63] Robust sensor fusion against on-vehicle sensor staleness

Meng Fan,Yifan Zuo,Patrick Blaes,Harley Montgomery,Subhasis Das

Main category: cs.CV

TL;DR: 论文提出了一种解决自动驾驶中传感器数据时间不同步问题的新方法,通过时间戳偏移和数据增强策略提升感知系统的性能。

Details Motivation: 传感器数据的时间不同步(staleness)会导致物体状态估计不一致,严重影响轨迹预测的准确性,从而威胁自动驾驶的安全性。 Method: 提出两种方法:(1) 为LiDAR和雷达数据添加相对于相机的时间戳偏移特征;(2) 使用数据增强模拟实际部署中的传感器staleness模式。 Result: 实验表明,传统模型在传感器数据不同步时性能显著下降,而新方法在同步和不同步条件下均能保持良好性能。 Conclusion: 该方法具有模型无关性,能有效提升自动驾驶感知系统在传感器数据时间不同步情况下的鲁棒性。 Abstract: Sensor fusion is crucial for a performant and robust Perception system in autonomous vehicles, but sensor staleness, where data from different sensors arrives with varying delays, poses significant challenges. Temporal misalignment between sensor modalities leads to inconsistent object state estimates, severely degrading the quality of trajectory predictions that are critical for safety. We present a novel and model-agnostic approach to address this problem via (1) a per-point timestamp offset feature (for LiDAR and radar both relative to camera) that enables fine-grained temporal awareness in sensor fusion, and (2) a data augmentation strategy that simulates realistic sensor staleness patterns observed in deployed vehicles. Our method is integrated into a perspective-view detection model that consumes sensor data from multiple LiDARs, radars and cameras. We demonstrate that while a conventional model shows significant regressions when one sensor modality is stale, our approach reaches consistently good performance across both synchronized and stale conditions.

[64] GazeNLQ @ Ego4D Natural Language Queries Challenge 2025

Wei-Cheng Lin,Chih-Ming Lien,Chen Lo,Chia-Hung Yeh

Main category: cs.CV

TL;DR: GazeNLQ利用视线估计增强视频表示,通过对比学习预训练策略提升自然语言查询的视频片段检索精度。

Details Motivation: 视线作为非语言交流线索,反映视觉注意力并揭示人类意图与认知,因此利用视线提升视频检索性能。 Method: 提出GazeNLQ方法,通过对比学习预训练视线估计模型,并将估计的视线信息融入视频表示以增强定位准确性。 Result: 实验结果显示,GazeNLQ在R1@IoU0.3和R1@IoU0.5上分别达到27.82和18.68的分数。 Conclusion: GazeNLQ通过视线估计显著提升了视频片段检索的准确性,代码已开源。 Abstract: This report presents our solution to the Ego4D Natural Language Queries (NLQ) Challenge at CVPR 2025. Egocentric video captures the scene from the wearer's perspective, where gaze serves as a key non-verbal communication cue that reflects visual attention and offer insights into human intention and cognition. Motivated by this, we propose a novel approach, GazeNLQ, which leverages gaze to retrieve video segments that match given natural language queries. Specifically, we introduce a contrastive learning-based pretraining strategy for gaze estimation directly from video. The estimated gaze is used to augment video representations within proposed model, thereby enhancing localization accuracy. Experimental results show that GazeNLQ achieves R1@IoU0.3 and R1@IoU0.5 scores of 27.82 and 18.68, respectively. Our code is available at https://github.com/stevenlin510/GazeNLQ.

[65] EASG-Bench: Video Q&A Benchmark with Egocentric Action Scene Graphs

Ivan Rodin,Tz-Ying Wu,Kyle Min,Sharath Nittur Sridhar,Antonino Furnari,Subarna Tripathi,Giovanni Maria Farinella

Main category: cs.CV

TL;DR: EASG-Bench是一个基于自我中心视频的问答基准,通过时空动态场景图生成问题-答案对,评估语言模型和视频大语言模型的性能,发现其在时序问题上的表现差距。

Details Motivation: 研究自我中心视频中复杂关系的理解,填补长上下文视频理解的研究空白。 Method: 提出EASG-Bench基准和系统评估框架,测试语言模型和视频大语言模型。 Result: 发现语言模型和视频大语言模型在时序问题上的性能差距。 Conclusion: EASG-Bench为长上下文视频理解研究提供了工具和方向,促进进一步研究。 Abstract: We introduce EASG-Bench, a question-answering benchmark for egocentric videos where the question-answering pairs are created from spatio-temporally grounded dynamic scene graphs capturing intricate relationships among actors, actions, and objects. We propose a systematic evaluation framework and evaluate several language-only and video large language models (video-LLMs) on this benchmark. We observe a performance gap in language-only and video-LLMs, especially on questions focusing on temporal ordering, thus identifying a research gap in the area of long-context video understanding. To promote the reproducibility of our findings and facilitate further research, the benchmark and accompanying code are available at the following GitHub page: https://github.com/fpv-iplab/EASG-bench.

[66] LLIA -- Enabling Low-Latency Interactive Avatars: Real-Time Audio-Driven Portrait Video Generation with Diffusion Models

Haojie Yu,Zhaonian Wang,Yihan Pan,Meng Cheng,Hao Yang,Chao Wang,Tao Xie,Xiaoming Xu,Xiaoming Wei,Xunliang Cai

Main category: cs.CV

TL;DR: 提出了一种基于扩散模型的音频驱动肖像视频生成框架,通过优化生成策略和训练方法,实现了低延迟、高保真度的实时交互。

Details Motivation: 扩散模型在虚拟人生成中表现优异,但计算需求大,难以满足实时交互应用的速度和延迟要求。 Method: 提出可变长度视频生成、一致性模型训练策略、模型量化和管道并行化,以及针对长视频生成的新推理策略。 Result: 在NVIDIA RTX 4090D上,模型在384x384分辨率下达到78 FPS,512x512分辨率下达到45 FPS,初始生成延迟分别为140 ms和215 ms。 Conclusion: 该方法在保持高保真度的同时,实现了低延迟和流畅的双向交互。 Abstract: Diffusion-based models have gained wide adoption in the virtual human generation due to their outstanding expressiveness. However, their substantial computational requirements have constrained their deployment in real-time interactive avatar applications, where stringent speed, latency, and duration requirements are paramount. We present a novel audio-driven portrait video generation framework based on the diffusion model to address these challenges. Firstly, we propose robust variable-length video generation to reduce the minimum time required to generate the initial video clip or state transitions, which significantly enhances the user experience. Secondly, we propose a consistency model training strategy for Audio-Image-to-Video to ensure real-time performance, enabling a fast few-step generation. Model quantization and pipeline parallelism are further employed to accelerate the inference speed. To mitigate the stability loss incurred by the diffusion process and model quantization, we introduce a new inference strategy tailored for long-duration video generation. These methods ensure real-time performance and low latency while maintaining high-fidelity output. Thirdly, we incorporate class labels as a conditional input to seamlessly switch between speaking, listening, and idle states. Lastly, we design a novel mechanism for fine-grained facial expression control to exploit our model's inherent capacity. Extensive experiments demonstrate that our approach achieves low-latency, fluid, and authentic two-way communication. On an NVIDIA RTX 4090D, our model achieves a maximum of 78 FPS at a resolution of 384x384 and 45 FPS at a resolution of 512x512, with an initial video generation latency of 140 ms and 215 ms, respectively.

[67] NTIRE 2025 Challenge on HR Depth from Images of Specular and Transparent Surfaces

Pierluigi Zama Ramirez,Fabio Tosi,Luigi Di Stefano,Radu Timofte,Alex Costanzino,Matteo Poggi,Samuele Salti,Stefano Mattoccia,Zhe Zhang,Yang Yang,Wu Chen,Anlong Ming,Mingshuai Zhao,Mengying Yu,Shida Gao,Xiangfeng Wang,Feng Xue,Jun Shi,Yong Yang,Yong A,Yixiang Jin,Dingzhe Li,Aryan Shukla,Liam Frija-Altarac,Matthew Toews,Hui Geng,Tianjiao Wan,Zijian Gao,Qisheng Xu,Kele Xu,Zijian Zang,Jameer Babu Pinjari,Kuldeep Purohit,Mykola Lavreniuk,Jing Cao,Shenyi Li,Kui Jiang,Junjun Jiang,Yong Huang

Main category: cs.CV

TL;DR: NTIRE 2025挑战赛聚焦高分辨率与非朗伯表面的深度估计,分立体和单图像两个赛道,吸引177名参与者,最终各有4支团队提交模型。

Details Motivation: 解决深度估计领域的高分辨率与非朗伯表面两大难题。 Method: 通过立体和单图像两个赛道进行深度估计挑战。 Result: 177名注册者,最终各有4支团队提交模型。 Conclusion: 挑战赛推动了高分辨率与非朗伯表面深度估计的研究。 Abstract: This paper reports on the NTIRE 2025 challenge on HR Depth From images of Specular and Transparent surfaces, held in conjunction with the New Trends in Image Restoration and Enhancement (NTIRE) workshop at CVPR 2025. This challenge aims to advance the research on depth estimation, specifically to address two of the main open issues in the field: high-resolution and non-Lambertian surfaces. The challenge proposes two tracks on stereo and single-image depth estimation, attracting about 177 registered participants. In the final testing stage, 4 and 4 participating teams submitted their models and fact sheets for the two tracks.

[68] DeformCL: Learning Deformable Centerline Representation for Vessel Extraction in 3D Medical Image

Ziwei Zhao,Zhixing Zhang,Yuhang Liu,Zhao Zhang,Haojun Yu,Dong Wang,Liwei Wang

Main category: cs.CV

TL;DR: DeformCL提出了一种基于可变形中心线的连续表示方法,用于3D医学图像中血管的准确提取,解决了传统离散表示方法的局限性。

Details Motivation: 传统基于像素分类的离散表示方法(如掩码)容易导致血管局部断裂或碎片化,影响临床诊断的准确性。 Method: DeformCL通过可变形中心线表示血管,中心点作为节点,通过边捕捉空间关系,具有自然连通性、噪声鲁棒性和交互便利性。采用级联训练流程充分挖掘其优势。 Result: 在四个3D血管分割数据集上的实验表明,DeformCL在效果和性能上优于传统方法,临床可视化验证了其重要性。 Conclusion: DeformCL为3D血管提取提供了一种更优的连续表示方法,具有显著的临床价值。代码已开源。 Abstract: In the field of 3D medical imaging, accurately extracting and representing the blood vessels with curvilinear structures holds paramount importance for clinical diagnosis. Previous methods have commonly relied on discrete representation like mask, often resulting in local fractures or scattered fragments due to the inherent limitations of the per-pixel classification paradigm. In this work, we introduce DeformCL, a new continuous representation based on Deformable Centerlines, where centerline points act as nodes connected by edges that capture spatial relationships. Compared with previous representations, DeformCL offers three key advantages: natural connectivity, noise robustness, and interaction facility. We present a comprehensive training pipeline structured in a cascaded manner to fully exploit these favorable properties of DeformCL. Extensive experiments on four 3D vessel segmentation datasets demonstrate the effectiveness and superiority of our method. Furthermore, the visualization of curved planar reformation images validates the clinical significance of the proposed framework. We release the code in https://github.com/barry664/DeformCL

[69] FuseUNet: A Multi-Scale Feature Fusion Method for U-like Networks

Quansong He,Xiangde Min,Kaishen Wang,Tao He

Main category: cs.CV

TL;DR: 论文提出了一种新的多尺度特征融合方法,通过将UNet的解码过程视为初始值问题,并利用线性多步方法改进特征融合,解决了传统UNet跳连的局限性。

Details Motivation: 传统UNet的跳连缺乏多尺度特征间的有效交互,且依赖简单的拼接或加法操作,限制了信息整合效率。 Method: 将UNet解码过程建模为初始值问题,利用线性多步方法实现自适应多尺度特征融合,独立于编码器和解码器架构。 Result: 在多个医学图像分割数据集上验证了方法的有效性,提高了特征利用率,减少了网络参数,同时保持了高性能。 Conclusion: 该方法为UNet家族提供了一种通用的多尺度特征融合解决方案,具有广泛的适用性和高效性。 Abstract: Medical image segmentation is a critical task in computer vision, with UNet serving as a milestone architecture. The typical component of UNet family is the skip connection, however, their skip connections face two significant limitations: (1) they lack effective interaction between features at different scales, and (2) they rely on simple concatenation or addition operations, which constrain efficient information integration. While recent improvements to UNet have focused on enhancing encoder and decoder capabilities, these limitations remain overlooked. To overcome these challenges, we propose a novel multi-scale feature fusion method that reimagines the UNet decoding process as solving an initial value problem (IVP), treating skip connections as discrete nodes. By leveraging principles from the linear multistep method, we propose an adaptive ordinary differential equation method to enable effective multi-scale feature fusion. Our approach is independent of the encoder and decoder architectures, making it adaptable to various U-Net-like networks. Experiments on ACDC, KiTS2023, MSD brain tumor, and ISIC2017/2018 skin lesion segmentation datasets demonstrate improved feature utilization, reduced network parameters, and maintained high performance. The code is available at https://github.com/nayutayuki/FuseUNet.

[70] High Throughput Event Filtering: The Interpolation-based DIF Algorithm Hardware Architecture

Marcin Kowalczyk,Tomasz Kryjak

Main category: cs.CV

TL;DR: 论文提出了一种基于FPGA的DIF滤波器硬件架构,用于处理事件视觉传感器中的噪声,并发布了高分辨率事件数据集。该架构在吞吐量和噪声处理范围上优于现有解决方案。

Details Motivation: 事件视觉技术快速发展,但传感器数据流中存在大量噪声,受光照和温度等因素影响。需要高效的硬件解决方案。 Method: 提出并实现了基于FPGA的DIF滤波器硬件架构,并准备了高分辨率事件数据集用于评估。 Result: 架构吞吐量达403.39 MEPS(1280x720)和428.45 MEPS(640x480),AUROC指数0.844-0.999,优于现有方案。 Conclusion: DIF滤波器在吞吐量和噪声处理范围上表现优异,适用于事件视觉领域。 Abstract: In recent years, there has been rapid development in the field of event vision. It manifests itself both on the technical side, as better and better event sensors are available, and on the algorithmic side, as more and more applications of this technology are proposed and scientific papers are published. However, the data stream from these sensors typically contains a significant amount of noise, which varies depending on factors such as the degree of illumination in the observed scene or the temperature of the sensor. We propose a hardware architecture of the Distance-based Interpolation with Frequency Weights (DIF) filter and implement it on an FPGA chip. To evaluate the algorithm and compare it with other solutions, we have prepared a new high-resolution event dataset, which we are also releasing to the community. Our architecture achieved a throughput of 403.39 million events per second (MEPS) for a sensor resolution of 1280 x 720 and 428.45 MEPS for a resolution of 640 x 480. The average values of the Area Under the Receiver Operating Characteristic (AUROC) index ranged from 0.844 to 0.999, depending on the dataset, which is comparable to the state-of-the-art filtering solutions, but with much higher throughput and better operation over a wide range of noise levels.

[71] FontAdapter: Instant Font Adaptation in Visual Text Generation

Myungkyu Koo,Subin Kim,Sangkyung Kwak,Jaehyun Nam,Seojin Kim,Jinwoo Shin

Main category: cs.CV

TL;DR: FontAdapter是一个快速生成未见字体视觉文本的框架,通过两阶段课程学习实现高效字体定制。

Details Motivation: 现有方法对预设字体外的未见字体适应效率低,难以实时定制。 Method: 采用两阶段课程学习:先提取字体属性,再整合到自然背景中,并构建合成数据集支持训练。 Result: FontAdapter能在无需额外微调的情况下,高效生成高质量未见字体文本,并支持多种字体定制任务。 Conclusion: FontAdapter是一个高效、多功能的字体定制框架,适用于未见字体的实时生成和编辑。 Abstract: Text-to-image diffusion models have significantly improved the seamless integration of visual text into diverse image contexts. Recent approaches further improve control over font styles through fine-tuning with predefined font dictionaries. However, adapting unseen fonts outside the preset is computationally expensive, often requiring tens of minutes, making real-time customization impractical. In this paper, we present FontAdapter, a framework that enables visual text generation in unseen fonts within seconds, conditioned on a reference glyph image. To this end, we find that direct training on font datasets fails to capture nuanced font attributes, limiting generalization to new glyphs. To overcome this, we propose a two-stage curriculum learning approach: FontAdapter first learns to extract font attributes from isolated glyphs and then integrates these styles into diverse natural backgrounds. To support this two-stage training scheme, we construct synthetic datasets tailored to each stage, leveraging large-scale online fonts effectively. Experiments demonstrate that FontAdapter enables high-quality, robust font customization across unseen fonts without additional fine-tuning during inference. Furthermore, it supports visual text editing, font style blending, and cross-lingual font transfer, positioning FontAdapter as a versatile framework for font customization tasks.

[72] Cross-View Multi-Modal Segmentation @ Ego-Exo4D Challenges 2025

Yuqian Fu,Runze Wang,Yanwei Fu,Danda Pani Paudel,Luc Van Gool

Main category: cs.CV

TL;DR: 提出了一种跨视角多模态对象分割方法,用于解决Ego-Exo4D对应挑战中的对象对应任务,通过融合视觉掩码和文本描述提升分割效果,并在跨视角对齐模块中增强鲁棒性。

Details Motivation: 解决不同视角(如ego和exo视图)中对象对应的任务,尤其是视觉域差异带来的挑战。 Method: 提出多模态条件融合模块和跨视角对象对齐模块,结合视觉掩码和文本描述提升分割效果。 Result: 在Ego-Exo4D对象对应基准测试中排名第二。 Conclusion: 该方法通过多模态融合和跨视角对齐有效提升了对象分割的准确性和鲁棒性。 Abstract: In this report, we present a cross-view multi-modal object segmentation approach for the object correspondence task in the Ego-Exo4D Correspondence Challenges 2025. Given object queries from one perspective (e.g., ego view), the goal is to predict the corresponding object masks in another perspective (e.g., exo view). To tackle this task, we propose a multimodal condition fusion module that enhances object localization by leveraging both visual masks and textual descriptions as segmentation conditions. Furthermore, to address the visual domain gap between ego and exo views, we introduce a cross-view object alignment module that enforces object-level consistency across perspectives, thereby improving the model's robustness to viewpoint changes. Our proposed method ranked second on the leaderboard of the large-scale Ego-Exo4D object correspondence benchmark. Code will be made available at https://github.com/lovelyqian/ObjectRelator.

[73] ChronoTailor: Harnessing Attention Guidance for Fine-Grained Video Virtual Try-On

Jinjuan Wang,Wenzhang Sun,Ming Li,Yun Zheng,Fanyao Li,Zhulin Tao,Donglin Di,Hao Li,Wei Chen,Xianglin Huang

Main category: cs.CV

TL;DR: ChronoTailor是一个基于扩散的视频虚拟试穿框架,通过时空注意力机制和特征融合,实现了时间一致性和服装细节保留,显著优于现有方法。

Details Motivation: 现有视频虚拟试穿方法在保持连续性和服装细节方面存在不足,需要一种更高效的解决方案。 Method: ChronoTailor采用区域感知空间引导和注意力驱动的时间特征融合,结合多尺度服装特征和姿态对齐,确保时空连续性和细节保留。 Result: 实验表明,ChronoTailor在时空连续性和服装细节保留方面显著优于现有方法。 Conclusion: ChronoTailor通过创新的时空注意力机制和特征融合,解决了视频虚拟试穿中的关键问题,并提供了新的数据集支持研究。 Abstract: Video virtual try-on aims to seamlessly replace the clothing of a person in a source video with a target garment. Despite significant progress in this field, existing approaches still struggle to maintain continuity and reproduce garment details. In this paper, we introduce ChronoTailor, a diffusion-based framework that generates temporally consistent videos while preserving fine-grained garment details. By employing a precise spatio-temporal attention mechanism to guide the integration of fine-grained garment features, ChronoTailor achieves robust try-on performance. First, ChronoTailor leverages region-aware spatial guidance to steer the evolution of spatial attention and employs an attention-driven temporal feature fusion mechanism to generate more continuous temporal features. This dual approach not only enables fine-grained local editing but also effectively mitigates artifacts arising from video dynamics. Second, ChronoTailor integrates multi-scale garment features to preserve low-level visual details and incorporates a garment-pose feature alignment to ensure temporal continuity during dynamic motion. Additionally, we collect StyleDress, a new dataset featuring intricate garments, varied environments, and diverse poses, offering advantages over existing public datasets, and will be publicly available for research. Extensive experiments show that ChronoTailor maintains spatio-temporal continuity and preserves garment details during motion, significantly outperforming previous methods.

[74] Improved Allergy Wheal Detection for the Skin Prick Automated Test Device

Rembert Daems,Sven Seys,Valérie Hox,Adam Chaker,Glynnis De Greve,Winde Lemmens,Anne-Lise Poirrier,Eline Beckers,Zuzana Diamant,Carmen Dierickx,Peter W. Hellings,Caroline Huart,Claudia Jerin,Mark Jorissen,Hanne Oscé,Karolien Roux,Mark Thompson,Sophie Tombu,Saartje Uyttebroek,Andrzej Zarowski,Senne Gorris,Laura Van Gerven,Dirk Loeckx,Thomas Demeester

Main category: cs.CV

TL;DR: 论文提出了一种基于32张不同光照条件下图像的自动化方法,用于检测和描绘皮肤点刺试验中的风团,相比传统单张图像方法显著提高了准确性。

Details Motivation: 皮肤点刺试验(SPT)是诊断吸入性过敏的金标准,但传统方法存在一致性不足的问题。SPAT设备通过多光照条件图像提高诊断准确性,但需要定制化方法处理这种独特数据模态。 Method: 研究使用SPAT设备采集的868名患者数据,手工标注10,416个风团。方法包括神经网络像素级分割和算法化风团检测与描绘。 Result: 在217名患者的验证集上,使用32张多光照图像的方法比传统单张图像方法显著提高了准确性。 Conclusion: 多光照条件图像显著提升了皮肤点刺试验中风团检测和描绘的准确性。 Abstract: Background: The skin prick test (SPT) is the gold standard for diagnosing sensitization to inhalant allergies. The Skin Prick Automated Test (SPAT) device was designed for increased consistency in test results, and captures 32 images to be jointly used for allergy wheal detection and delineation, which leads to a diagnosis. Materials and Methods: Using SPAT data from $868$ patients with suspected inhalant allergies, we designed an automated method to detect and delineate wheals on these images. To this end, $10,416$ wheals were manually annotated by drawing detailed polygons along the edges. The unique data-modality of the SPAT device, with $32$ images taken under distinct lighting conditions, requires a custom-made approach. Our proposed method consists of two parts: a neural network component that segments the wheals on the pixel level, followed by an algorithmic and interpretable approach for detecting and delineating the wheals. Results: We evaluate the performance of our method on a hold-out validation set of $217$ patients. As a baseline we use a single conventionally lighted image per SPT as input to our method. Conclusion: Using the $32$ SPAT images under various lighting conditions offers a considerably higher accuracy than a single image in conventional, uniform light.

[75] CryoFastAR: Fast Cryo-EM Ab Initio Reconstruction Made Easy

Jiakai Zhang,Shouchen Zhou,Haizhao Dai,Xinhang Liu,Peihao Wang,Zhiwen Fan,Yuan Pei,Jingyi Yu

Main category: cs.CV

TL;DR: CryoFastAR是一种几何基础模型,可直接从冷冻电镜噪声图像预测姿态,显著加速冷冻电镜中的姿态估计和3D重建。

Details Motivation: 冷冻电镜中的姿态估计和3D重建仍依赖耗时的迭代优化,主要由于低信噪比和对比传递函数畸变等挑战。 Method: 通过整合多视图特征并在大规模模拟冷冻电镜数据上训练,采用渐进式训练策略提高模型鲁棒性。 Result: 实验表明,CryoFastAR在合成和真实数据集上均显著加速推理,同时保持与传统方法相当的质量。 Conclusion: CryoFastAR为冷冻电镜提供了一种快速且准确的姿态估计方法,有望推动科学成像领域的发展。 Abstract: Pose estimation from unordered images is fundamental for 3D reconstruction, robotics, and scientific imaging. Recent geometric foundation models, such as DUSt3R, enable end-to-end dense 3D reconstruction but remain underexplored in scientific imaging fields like cryo-electron microscopy (cryo-EM) for near-atomic protein reconstruction. In cryo-EM, pose estimation and 3D reconstruction from unordered particle images still depend on time-consuming iterative optimization, primarily due to challenges such as low signal-to-noise ratios (SNR) and distortions from the contrast transfer function (CTF). We introduce CryoFastAR, the first geometric foundation model that can directly predict poses from Cryo-EM noisy images for Fast ab initio Reconstruction. By integrating multi-view features and training on large-scale simulated cryo-EM data with realistic noise and CTF modulations, CryoFastAR enhances pose estimation accuracy and generalization. To enhance training stability, we propose a progressive training strategy that first allows the model to extract essential features under simpler conditions before gradually increasing difficulty to improve robustness. Experiments show that CryoFastAR achieves comparable quality while significantly accelerating inference over traditional iterative approaches on both synthetic and real datasets.

[76] Domain-RAG: Retrieval-Guided Compositional Image Generation for Cross-Domain Few-Shot Object Detection

Yu Li,Xingyu Qiu,Yuqian Fu,Jie Chen,Tianwen Qian,Xu Zheng,Danda Pani Paudel,Yanwei Fu,Xuanjing Huang,Luc Van Gool,Yu-Gang Jiang

Main category: cs.CV

TL;DR: 论文提出Domain-RAG框架,通过检索引导的生成方法解决跨域少样本目标检测中的域对齐和视觉真实性问题。

Details Motivation: 跨域少样本目标检测(CD-FSOD)需要从未见过的域中检测新物体,但现有数据增强和生成方法难以同时满足视觉真实性和域对齐要求。 Method: Domain-RAG分为三个阶段:域感知背景检索、域引导背景生成和前景-背景组合,无需额外训练即可生成高质量样本。 Result: 实验表明,Domain-RAG在多个任务中表现优异,超越基线方法并达到新SOTA。 Conclusion: Domain-RAG为CD-FSOD提供了一种高效且无需训练的解决方案,显著提升了性能。 Abstract: Cross-Domain Few-Shot Object Detection (CD-FSOD) aims to detect novel objects with only a handful of labeled samples from previously unseen domains. While data augmentation and generative methods have shown promise in few-shot learning, their effectiveness for CD-FSOD remains unclear due to the need for both visual realism and domain alignment. Existing strategies, such as copy-paste augmentation and text-to-image generation, often fail to preserve the correct object category or produce backgrounds coherent with the target domain, making them non-trivial to apply directly to CD-FSOD. To address these challenges, we propose Domain-RAG, a training-free, retrieval-guided compositional image generation framework tailored for CD-FSOD. Domain-RAG consists of three stages: domain-aware background retrieval, domain-guided background generation, and foreground-background composition. Specifically, the input image is first decomposed into foreground and background regions. We then retrieve semantically and stylistically similar images to guide a generative model in synthesizing a new background, conditioned on both the original and retrieved contexts. Finally, the preserved foreground is composed with the newly generated domain-aligned background to form the generated image. Without requiring any additional supervision or training, Domain-RAG produces high-quality, domain-consistent samples across diverse tasks, including CD-FSOD, remote sensing FSOD, and camouflaged FSOD. Extensive experiments show consistent improvements over strong baselines and establish new state-of-the-art results. Codes will be released upon acceptance.

[77] HMVLM: Multistage Reasoning-Enhanced Vision-Language Model for Long-Tailed Driving Scenarios

Daming Wang,Yuhao Song,Zijian He,Kangliang Chen,Xing Pan,Lu Deng,Weihao Gu

Main category: cs.CV

TL;DR: HaoMo Vision-Language Model (HMVLM) 是一个端到端驾驶框架,采用快慢架构,通过选择性提示和多阶段推理提升性能,在Waymo挑战赛中表现优异。

Details Motivation: 提出一种结合视觉语言模型和认知架构的驾驶框架,以提升自动驾驶的高层决策能力。 Method: 采用选择性五视角提示、多阶段链式思维提示和样条轨迹后处理技术。 Result: 在Waymo Open Dataset上训练,RFS得分7.7367,超越基线2.77%,排名第二。 Conclusion: HMVLM通过创新提示和推理流程,显著提升了自动驾驶的决策和轨迹规划能力。 Abstract: We present HaoMo Vision-Language Model (HMVLM), an end-to-end driving framework that implements the slow branch of a cognitively inspired fast-slow architecture. A fast controller outputs low-level steering, throttle, and brake commands, while a slow planner-a large vision-language model-generates high-level intents such as "yield to pedestrian" or "merge after the truck" without compromising latency. HMVLM introduces three upgrades: (1) selective five-view prompting with an embedded 4s history of ego kinematics, (2) multi-stage chain-of-thought (CoT) prompting that enforces a Scene Understanding -> Driving Decision -> Trajectory Inference reasoning flow, and (3) spline-based trajectory post-processing that removes late-stage jitter and sharp turns. Trained on the Waymo Open Dataset, these upgrades enable HMVLM to achieve a Rater Feedback Score (RFS) of 7.7367, securing 2nd place in the 2025 Waymo Vision-based End-to-End (E2E) Driving Challenge and surpassing the public baseline by 2.77%.

[78] Unleashing the Potential of Consistency Learning for Detecting and Grounding Multi-Modal Media Manipulation

Yiheng Li,Yang Yang,Zichang Tan,Huan Liu,Weihua Chen,Xu Zhou,Zhen Lei

Main category: cs.CV

TL;DR: 论文提出了一种名为CSCL的新方法,通过捕捉局部内容的细粒度一致性来提升多模态媒体篡改检测的性能。

Details Motivation: 现有方法在探索局部内容的细粒度一致性方面不足,导致对篡改细节的感知不充分,结果不可靠。 Method: 提出CSCL方法,包含图像和文本两个分支,每个分支有两个级联解码器(CCD和SCD),分别捕捉模态内上下文一致性和跨模态语义一致性。 Result: 在DGM4数据集上的实验表明,CSCL取得了最先进的性能,尤其是在定位篡改内容方面。 Conclusion: CSCL通过细粒度一致性学习显著提升了篡改检测和定位的能力。 Abstract: To tackle the threat of fake news, the task of detecting and grounding multi-modal media manipulation DGM4 has received increasing attention. However, most state-of-the-art methods fail to explore the fine-grained consistency within local content, usually resulting in an inadequate perception of detailed forgery and unreliable results. In this paper, we propose a novel approach named Contextual-Semantic Consistency Learning (CSCL) to enhance the fine-grained perception ability of forgery for DGM4. Two branches for image and text modalities are established, each of which contains two cascaded decoders, i.e., Contextual Consistency Decoder (CCD) and Semantic Consistency Decoder (SCD), to capture within-modality contextual consistency and across-modality semantic consistency, respectively. Both CCD and SCD adhere to the same criteria for capturing fine-grained forgery details. To be specific, each module first constructs consistency features by leveraging additional supervision from the heterogeneous information of each token pair. Then, the forgery-aware reasoning or aggregating is adopted to deeply seek forgery cues based on the consistency features. Extensive experiments on DGM4 datasets prove that CSCL achieves new state-of-the-art performance, especially for the results of grounding manipulated content. Codes and weights are avaliable at https://github.com/liyih/CSCL.

[79] Query Nearby: Offset-Adjusted Mask2Former enhances small-organ segmentation

Xin Zhang,Dongdong Meng,Sheng Li

Main category: cs.CV

TL;DR: 提出了一种基于Mask2Former的改进方法,通过可变形注意力和偏移调整策略提升医学图像分割性能,尤其在中小器官上表现优异。

Details Motivation: 医学图像分割在临床应用中至关重要,但现有方法(如Transformer)在中小器官上表现不佳且计算资源需求高。 Method: 结合可变形注意力和偏移调整策略优化Mask2Former,利用第4特征图提供器官粗定位,并引入FCN辅助头加速训练。 Result: 在HaNSeg和SegRap2023数据集上达到SOTA性能,尤其在中小器官上表现突出。 Conclusion: 改进的Mask2Former方法显著提升了医学图像分割的精度和效率,适用于临床需求。 Abstract: Medical segmentation plays an important role in clinical applications like radiation therapy and surgical guidance, but acquiring clinically acceptable results is difficult. In recent years, progress has been witnessed with the success of utilizing transformer-like models, such as combining the attention mechanism with CNN. In particular, transformer-based segmentation models can extract global information more effectively, compensating for the drawbacks of CNN modules that focus on local features. However, utilizing transformer architecture is not easy, because training transformer-based models can be resource-demanding. Moreover, due to the distinct characteristics in the medical field, especially when encountering mid-sized and small organs with compact regions, their results often seem unsatisfactory. For example, using ViT to segment medical images directly only gives a DSC of less than 50\%, which is far lower than the clinically acceptable score of 80\%. In this paper, we used Mask2Former with deformable attention to reduce computation and proposed offset adjustment strategies to encourage sampling points within the same organs during attention weights computation, thereby integrating compact foreground information better. Additionally, we utilized the 4th feature map in Mask2Former to provide a coarse location of organs, and employed an FCN-based auxiliary head to help train Mask2Former more quickly using Dice loss. We show that our model achieves SOTA (State-of-the-Art) performance on the HaNSeg and SegRap2023 datasets, especially on mid-sized and small organs.Our code is available at link https://github.com/earis/Offsetadjustment\_Background-location\_Decoder\_Mask2former.

[80] Rethinking Semi-supervised Segmentation Beyond Accuracy: Reliability and Robustness

Steven Landgraf,Markus Hillemann,Markus Ulrich

Main category: cs.CV

TL;DR: 论文提出了一种新的评估指标RSS,结合准确性、校准性和不确定性质量,用于更全面地评估半监督语义分割模型的可靠性。

Details Motivation: 现有半监督分割方法仅关注准确性,忽视了可靠性和鲁棒性,而这两者对安全关键应用(如自动驾驶)至关重要。 Method: 引入Reliable Segmentation Score (RSS),通过调和平均数结合预测准确性、校准性和不确定性质量。 Result: 实验表明半监督方法常以可靠性换取准确性,UniMatchV2在鲁棒性上表现良好但仍存在可靠性问题。 Conclusion: 建议采用RSS等更全面的评估指标,以更好地满足实际部署需求。 Abstract: Semantic segmentation is critical for scene understanding but demands costly pixel-wise annotations, attracting increasing attention to semi-supervised approaches to leverage abundant unlabeled data. While semi-supervised segmentation is often promoted as a path toward scalable, real-world deployment, it is astonishing that current evaluation protocols exclusively focus on segmentation accuracy, entirely overlooking reliability and robustness. These qualities, which ensure consistent performance under diverse conditions (robustness) and well-calibrated model confidences as well as meaningful uncertainties (reliability), are essential for safety-critical applications like autonomous driving, where models must handle unpredictable environments and avoid sudden failures at all costs. To address this gap, we introduce the Reliable Segmentation Score (RSS), a novel metric that combines predictive accuracy, calibration, and uncertainty quality measures via a harmonic mean. RSS penalizes deficiencies in any of its components, providing an easy and intuitive way of holistically judging segmentation models. Comprehensive evaluations of UniMatchV2 against its predecessor and a supervised baseline show that semi-supervised methods often trade reliability for accuracy. While out-of-domain evaluations demonstrate UniMatchV2's robustness, they further expose persistent reliability shortcomings. We advocate for a shift in evaluation protocols toward more holistic metrics like RSS to better align semi-supervised learning research with real-world deployment needs.

[81] FADE: Frequency-Aware Diffusion Model Factorization for Video Editing

Yixuan Zhu,Haolin Wang,Shilin Ma,Wenliang Zhao,Yansong Tang,Lei Chen,Jie Zhou

Main category: cs.CV

TL;DR: FADE是一种无需训练的视频编辑方法,通过频率感知分解利用预训练视频扩散模型的先验知识,实现高效、高质量的视频编辑。

Details Motivation: 传统图像扩散模型难以处理视频动态特性,而现有视频扩散模型计算成本高,无法直接应用图像编辑技术。 Method: 通过分析视频模型的注意力模式,提出频率感知分解策略和频谱引导调制,优化组件角色并防止信息泄漏。 Result: 实验证明FADE能生成高质量、时间一致的编辑结果,且计算效率高。 Conclusion: FADE为视频编辑提供了一种高效且无需训练的解决方案,显著提升了编辑质量和灵活性。 Abstract: Recent advancements in diffusion frameworks have significantly enhanced video editing, achieving high fidelity and strong alignment with textual prompts. However, conventional approaches using image diffusion models fall short in handling video dynamics, particularly for challenging temporal edits like motion adjustments. While current video diffusion models produce high-quality results, adapting them for efficient editing remains difficult due to the heavy computational demands that prevent the direct application of previous image editing techniques. To overcome these limitations, we introduce FADE, a training-free yet highly effective video editing approach that fully leverages the inherent priors from pre-trained video diffusion models via frequency-aware factorization. Rather than simply using these models, we first analyze the attention patterns within the video model to reveal how video priors are distributed across different components. Building on these insights, we propose a factorization strategy to optimize each component's specialized role. Furthermore, we devise spectrum-guided modulation to refine the sampling trajectory with frequency domain cues, preventing information leakage and supporting efficient, versatile edits while preserving the basic spatial and temporal structure. Extensive experiments on real-world videos demonstrate that our method consistently delivers high-quality, realistic and temporally coherent editing results both qualitatively and quantitatively. Code is available at https://github.com/EternalEvan/FADE .

[82] MOGO: Residual Quantized Hierarchical Causal Transformer for High-Quality and Real-Time 3D Human Motion Generation

Dongjie Fu,Tengjiao Sun,Pengcheng Fang,Xiaohao Cai,Hansung Kim

Main category: cs.CV

TL;DR: MOGO是一个高效实时的3D动作生成框架,通过MoSA-VQ和RQHC-Transformer实现高质量、低延迟的动作生成,并在实验中表现出色。

Details Motivation: 现有基于Transformer的动作生成方法难以同时实现高保真度、实时响应和可扩展性,MOGO旨在解决这一问题。 Method: MOGO包含MoSA-VQ(自适应残差向量量化模块)和RQHC-Transformer(残量化层次因果Transformer),通过单次前向传递生成动作。 Result: 在HumanML3D等数据集上,MOGO在生成质量和实时性能上优于现有方法,并支持零样本泛化。 Conclusion: MOGO在动作生成领域实现了高效与高质量的平衡,具有实际应用潜力。 Abstract: Recent advances in transformer-based text-to-motion generation have led to impressive progress in synthesizing high-quality human motion. Nevertheless, jointly achieving high fidelity, streaming capability, real-time responsiveness, and scalability remains a fundamental challenge. In this paper, we propose MOGO (Motion Generation with One-pass), a novel autoregressive framework tailored for efficient and real-time 3D motion generation. MOGO comprises two key components: (1) MoSA-VQ, a motion scale-adaptive residual vector quantization module that hierarchically discretizes motion sequences with learnable scaling to produce compact yet expressive representations; and (2) RQHC-Transformer, a residual quantized hierarchical causal transformer that generates multi-layer motion tokens in a single forward pass, significantly reducing inference latency. To enhance semantic fidelity, we further introduce a text condition alignment mechanism that improves motion decoding under textual control. Extensive experiments on benchmark datasets including HumanML3D, KIT-ML, and CMP demonstrate that MOGO achieves competitive or superior generation quality compared to state-of-the-art transformer-based methods, while offering substantial improvements in real-time performance, streaming generation, and generalization under zero-shot settings.

[83] Dy3DGS-SLAM: Monocular 3D Gaussian Splatting SLAM for Dynamic Environments

Mingrui Li,Yiming Zhou,Hongxing Zhou,Xinggang Hu,Florian Roemer,Hongyu Wang,Ahmad Osman

Main category: cs.CV

TL;DR: Dy3DGS-SLAM是一种基于3D高斯泼溅的SLAM方法,首次支持单目RGB输入,用于动态场景的跟踪与重建。通过融合光流和深度掩码,结合运动损失和渲染损失,显著提升了动态环境下的性能。

Details Motivation: 现有基于NeRF或3D高斯泼溅的SLAM方法在动态环境中表现不佳,尤其是仅依赖RGB输入时。Dy3DGS-SLAM旨在解决这一问题。 Method: 提出融合光流和深度掩码的概率模型,设计运动损失约束姿态估计,并使用渲染损失消除动态干扰。 Result: 实验表明,Dy3DGS-SLAM在动态环境中的跟踪与渲染性能达到或超过现有RGB-D方法。 Conclusion: Dy3DGS-SLAM为动态场景的SLAM提供了高效的单目RGB解决方案,性能优越。 Abstract: Current Simultaneous Localization and Mapping (SLAM) methods based on Neural Radiance Fields (NeRF) or 3D Gaussian Splatting excel in reconstructing static 3D scenes but struggle with tracking and reconstruction in dynamic environments, such as real-world scenes with moving elements. Existing NeRF-based SLAM approaches addressing dynamic challenges typically rely on RGB-D inputs, with few methods accommodating pure RGB input. To overcome these limitations, we propose Dy3DGS-SLAM, the first 3D Gaussian Splatting (3DGS) SLAM method for dynamic scenes using monocular RGB input. To address dynamic interference, we fuse optical flow masks and depth masks through a probabilistic model to obtain a fused dynamic mask. With only a single network iteration, this can constrain tracking scales and refine rendered geometry. Based on the fused dynamic mask, we designed a novel motion loss to constrain the pose estimation network for tracking. In mapping, we use the rendering loss of dynamic pixels, color, and depth to eliminate transient interference and occlusion caused by dynamic objects. Experimental results demonstrate that Dy3DGS-SLAM achieves state-of-the-art tracking and rendering in dynamic environments, outperforming or matching existing RGB-D methods.

[84] Domain Adaptation in Agricultural Image Analysis: A Comprehensive Review from Shallow Models to Deep Learning

Xing Hu,Siyuan Chen,Dawei Zhang

Main category: cs.CV

TL;DR: 本文探讨了领域自适应(DA)技术在农业图像分析中的应用,以解决因环境差异、作物类型和数据采集方法导致的领域偏移问题,提升模型跨区域和季节的泛化能力。

Details Motivation: 农业计算机视觉应用中,领域偏移限制了模型的泛化能力,DA技术有望解决这一问题。 Method: 系统回顾了DA在农业图像中的最新进展,包括浅层和深度学习模型,以及监督、半监督和无监督方法,特别关注对抗学习。 Result: DA在作物健康监测、害虫检测和果实识别等任务中表现出性能提升,尤其在复杂农业环境中效果显著。 Conclusion: 本文为研究者提供了DA在农业图像分析中的全面框架,指出了当前研究空白,并支持DA方法的进一步发展。 Abstract: With the increasing use of computer vision in agriculture, image analysis has become crucial for tasks like crop health monitoring and pest detection. However, significant domain shifts between source and target domains-due to environmental differences, crop types, and data acquisition methods-pose challenges. These domain gaps limit the ability of models to generalize across regions, seasons, and complex agricultural environments. This paper explores how Domain Adaptation (DA) techniques can address these challenges, focusing on their role in enhancing the cross-domain transferability of agricultural image analysis. DA has gained attention in agricultural vision tasks due to its potential to mitigate domain heterogeneity. The paper systematically reviews recent advances in DA for agricultural imagery, particularly its practical applications in complex agricultural environments. We examine the key drivers for adopting DA in agriculture, such as limited labeled data, weak model transferability, and dynamic environmental conditions. We also discuss its use in crop health monitoring, pest detection, and fruit recognition, highlighting improvements in performance across regions and seasons. The paper categorizes DA methods into shallow and deep learning models, with further divisions into supervised, semi-supervised, and unsupervised approaches. A special focus is given to adversarial learning-based DA methods, which have shown great promise in challenging agricultural scenarios. Finally, we review key public datasets in agricultural imagery, analyzing their value and limitations in DA research. This review provides a comprehensive framework for researchers, offering insights into current research gaps and supporting the advancement of DA methods in agricultural image analysis.

[85] MCA-Bench: A Multimodal Benchmark for Evaluating CAPTCHA Robustness Against VLM-based Attacks

Zonglin Wu,Yule Xue,Xin Wei,Yiren Song

Main category: cs.CV

TL;DR: 论文介绍了MCA-Bench,一个统一的多模态CAPTCHA基准测试套件,用于评估其安全性,并提出了设计原则和开放挑战。

Details Motivation: 现有CAPTCHA方案多样但缺乏统一的大规模多模态基准测试,无法全面评估其安全性。 Method: 利用共享的视觉-语言模型框架,为每种CAPTCHA类型微调专用破解代理,实现跨模态评估。 Result: MCA-Bench揭示了现代CAPTCHA设计的漏洞谱,并首次定量分析了挑战复杂性、交互深度和模型可解性之间的关系。 Conclusion: 提出了三项可操作的设计原则,并指出关键开放挑战,为CAPTCHA加固和社区合作奠定了基础。 Abstract: As automated attack techniques rapidly advance, CAPTCHAs remain a critical defense mechanism against malicious bots. However, existing CAPTCHA schemes encompass a diverse range of modalities -- from static distorted text and obfuscated images to interactive clicks, sliding puzzles, and logic-based questions -- yet the community still lacks a unified, large-scale, multimodal benchmark to rigorously evaluate their security robustness. To address this gap, we introduce MCA-Bench, a comprehensive and reproducible benchmarking suite that integrates heterogeneous CAPTCHA types into a single evaluation protocol. Leveraging a shared vision-language model backbone, we fine-tune specialized cracking agents for each CAPTCHA category, enabling consistent, cross-modal assessments. Extensive experiments reveal that MCA-Bench effectively maps the vulnerability spectrum of modern CAPTCHA designs under varied attack settings, and crucially offers the first quantitative analysis of how challenge complexity, interaction depth, and model solvability interrelate. Based on these findings, we propose three actionable design principles and identify key open challenges, laying the groundwork for systematic CAPTCHA hardening, fair benchmarking, and broader community collaboration. Datasets and code are available online.

[86] Bootstrapping World Models from Dynamics Models in Multimodal Foundation Models

Yifu Qiu,Yftah Ziser,Anna Korhonen,Shay B. Cohen,Edoardo M. Ponti

Main category: cs.CV

TL;DR: 本文研究了视觉与语言基础模型是否具备现实世界模型和动态模型的能力,发现动态模型通过监督学习更容易获得,并可进一步引导世界模型的构建。通过两种策略(弱监督学习和推理时验证),动态模型能提升世界模型的性能,最终在图像编辑任务中表现优异。

Details Motivation: 探讨视觉与语言基础模型是否具备现实世界模型和动态模型的能力,并研究如何通过动态模型引导世界模型的构建。 Method: 通过监督学习训练动态模型,并采用两种策略(弱监督学习和推理时验证)将其用于引导世界模型的构建。 Result: 在Aurora-Bench的图像编辑任务中,最佳模型性能优于现有方法15%(基于GPT4o评估),并在人类评估中表现最佳。 Conclusion: 动态模型能有效引导世界模型的构建,提升图像编辑任务的性能。 Abstract: To what extent do vision-and-language foundation models possess a realistic world model (observation $\times$ action $\rightarrow$ observation) and a dynamics model (observation $\times$ observation $\rightarrow$ action), when actions are expressed through language? While open-source foundation models struggle with both, we find that fine-tuning them to acquire a dynamics model through supervision is significantly easier than acquiring a world model. In turn, dynamics models can be used to bootstrap world models through two main strategies: 1) weakly supervised learning from synthetic data and 2) inference time verification. Firstly, the dynamics model can annotate actions for unlabelled pairs of video frame observations to expand the training data. We further propose a new objective, where image tokens in observation pairs are weighted by their importance, as predicted by a recognition model. Secondly, the dynamics models can assign rewards to multiple samples of the world model to score them, effectively guiding search at inference time. We evaluate the world models resulting from both strategies through the task of action-centric image editing on Aurora-Bench. Our best model achieves a performance competitive with state-of-the-art image editing models, improving on them by a margin of $15\%$ on real-world subsets according to GPT4o-as-judge, and achieving the best average human evaluation across all subsets of Aurora-Bench.

[87] Enhancing Orthopox Image Classification Using Hybrid Machine Learning and Deep Learning Models

Alejandro Puente-Castro,Enrique Fernandez-Blanco,Daniel Rivero,Andres Molares-Ulloa

Main category: cs.CV

TL;DR: 提出了一种结合机器学习和预训练深度学习模型的混合方法,用于从医学图像中准确分类Orthopox病毒感染,无需数据增强即可提取深度特征,实现了高效分类。

Details Motivation: 传统诊断方法耗时且依赖专家解读,现有数据集少且偏颇,需自动化、可扩展的解决方案。 Method: 结合机器学习和预训练深度学习模型,提取深度特征表示,无需数据增强。 Result: 该方法在分类性能和计算成本上表现优异,具有强泛化能力和鲁棒性。 Conclusion: 提供了一种可扩展且可解释的解决方案,适用于实际临床部署。 Abstract: Orthopoxvirus infections must be accurately classified from medical pictures for an easy and early diagnosis and epidemic prevention. The necessity for automated and scalable solutions is highlighted by the fact that traditional diagnostic techniques can be time-consuming and require expert interpretation and there are few and biased data sets of the different types of Orthopox. In order to improve classification performance and lower computational costs, a hybrid strategy is put forth in this paper that uses Machine Learning models combined with pretrained Deep Learning models to extract deep feature representations without the need for augmented data. The findings show that this feature extraction method, when paired with other methods in the state-of-the-art, produces excellent classification outcomes while preserving training and inference efficiency. The proposed approach demonstrates strong generalization and robustness across multiple evaluation settings, offering a scalable and interpretable solution for real-world clinical deployment.

[88] Restereo: Diffusion stereo video generation and restoration

Xingchang Huang,Ashish Kumar Singh,Florian Dubost,Cristina Nader Vasconcelos,Sakar Khattar,Liang Shi,Christian Theobalt,Cengiz Oztireli,Gurprit Singh

Main category: cs.CV

TL;DR: 本文提出了一种新方法,通过单一模型同时生成和增强左右视角的立体视频,适用于低质量输入。

Details Motivation: 现有方法多从高质量单目视频生成立体视频,而本文旨在解决低质量输入下的立体视频生成与修复问题。 Method: 通过微调模型以修复退化数据,并结合变形掩码条件实现一致的立体生成。 Result: 实验表明,该方法在低分辨率输入下的立体视频生成中优于现有方法。 Conclusion: 该方法在小规模合成数据集上微调后,可应用于低质量真实视频,实现立体生成与修复。 Abstract: Stereo video generation has been gaining increasing attention with recent advancements in video diffusion models. However, most existing methods focus on generating 3D stereoscopic videos from monocular 2D videos. These approaches typically assume that the input monocular video is of high quality, making the task primarily about inpainting occluded regions in the warped video while preserving disoccluded areas. In this paper, we introduce a new pipeline that not only generates stereo videos but also enhances both left-view and right-view videos consistently with a single model. Our approach achieves this by fine-tuning the model on degraded data for restoration, as well as conditioning the model on warped masks for consistent stereo generation. As a result, our method can be fine-tuned on a relatively small synthetic stereo video datasets and applied to low-quality real-world videos, performing both stereo video generation and restoration. Experiments demonstrate that our method outperforms existing approaches both qualitatively and quantitatively in stereo video generation from low-resolution inputs.

[89] O-MaMa @ EgoExo4D Correspondence Challenge: Learning Object Mask Matching between Egocentric and Exocentric Views

Lorenzo Mur-Labadia,Maria Santos-Villafranca,Alejandro Perez-Yus,Jesus Bermudez-Cameo,Ruben Martinez-Cantin,Jose J. Guerrero

Main category: cs.CV

TL;DR: 将跨图像分割任务重新定义为掩码匹配任务,提出了一种结合语义特征、多视角融合和对比学习的方法。

Details Motivation: 解决跨视角下特定对象的分割问题,提升分割的准确性和鲁棒性。 Method: 1. 使用Mask-Context Encoder提取对象级特征;2. 通过Ego↔Exo Cross-Attention融合多视角信息;3. 采用Mask Matching对比损失对齐特征;4. 引入Hard Negative Adjacent Mining策略增强区分能力。 Result: 实现了跨视角对象分割的高效匹配和准确分割。 Conclusion: 该方法通过掩码匹配和多视角融合,显著提升了跨图像分割的性能。 Abstract: The goal of the correspondence task is to segment specific objects across different views. This technical report re-defines cross-image segmentation by treating it as a mask matching task. Our method consists of: (1) A Mask-Context Encoder that pools dense DINOv2 semantic features to obtain discriminative object-level representations from FastSAM mask candidates, (2) an Ego$\leftrightarrow$Exo Cross-Attention that fuses multi-perspective observations, (3) a Mask Matching contrastive loss that aligns cross-view features in a shared latent space, and (4) a Hard Negative Adjacent Mining strategy to encourage the model to better differentiate between nearby objects.

[90] Sample-Specific Noise Injection For Diffusion-Based Adversarial Purification

Yuhao Sun,Jiacheng Zhang,Zesheng Ye,Chaowei Xiao,Feng Liu

Main category: cs.CV

TL;DR: 论文提出了一种样本特定的噪声注入方法(SSNI),通过预训练的评分网络自适应调整噪声水平,显著提升了扩散净化方法的准确性和鲁棒性。

Details Motivation: 现有扩散净化方法对所有样本使用固定的噪声水平,而研究发现不同样本的最优噪声水平可能不同,因此提出样本特定的噪声注入框架。 Method: SSNI利用预训练的评分网络估计样本偏离干净数据分布的程度(评分范数),并通过重加权函数自适应调整每个样本的噪声水平。 Result: 在CIFAR-10和ImageNet-1K数据集上,SSNI显著提升了现有扩散净化方法的准确性和鲁棒性。 Conclusion: 研究表明,在扩散净化方法中为不同样本分配不同的噪声水平是必要的,SSNI框架为此提供了有效解决方案。 Abstract: Diffusion-based purification (DBP) methods aim to remove adversarial noise from the input sample by first injecting Gaussian noise through a forward diffusion process, and then recovering the clean example through a reverse generative process. In the above process, how much Gaussian noise is injected to the input sample is key to the success of DBP methods, which is controlled by a constant noise level $t^*$ for all samples in existing methods. In this paper, we discover that an optimal $t^*$ for each sample indeed could be different. Intuitively, the cleaner a sample is, the less the noise it should be injected, and vice versa. Motivated by this finding, we propose a new framework, called Sample-specific Score-aware Noise Injection (SSNI). Specifically, SSNI uses a pre-trained score network to estimate how much a data point deviates from the clean data distribution (i.e., score norms). Then, based on the magnitude of score norms, SSNI applies a reweighting function to adaptively adjust $t^*$ for each sample, achieving sample-specific noise injections. Empirically, incorporating our framework with existing DBP methods results in a notable improvement in both accuracy and robustness on CIFAR-10 and ImageNet-1K, highlighting the necessity to allocate distinct noise levels to different samples in DBP methods. Our code is available at: https://github.com/tmlr-group/SSNI.

[91] HAVIR: HierArchical Vision to Image Reconstruction using CLIP-Guided Versatile Diffusion

Shiyi Zhang,Dong Liang,Hairong Zheng,Yihang Zhou

Main category: cs.CV

TL;DR: HAVIR通过结合AutoKL和CLIP适配器,从fMRI数据中重建复杂视觉刺激的结构和语义信息,表现优于现有模型。

Details Motivation: 解决从fMRI数据中准确重建复杂视觉刺激的挑战,包括其元素密度、多样性、空间结构和语义信息。 Method: 提出HAVIR,包含AutoKL适配器(捕捉拓扑结构)和CLIP适配器(提取语义信息),通过Versatile Diffusion融合生成图像。 Result: HAVIR在复杂场景中有效重建视觉刺激的结构和语义信息,优于现有模型。 Conclusion: HAVIR为解决复杂视觉刺激重建问题提供了有效方法,结合了结构和语义信息的优势。 Abstract: Reconstructing visual information from brain activity bridges the gap between neuroscience and computer vision. Even though progress has been made in decoding images from fMRI using generative models, a challenge remains in accurately recovering highly complex visual stimuli. This difficulty stems from their elemental density and diversity, sophisticated spatial structures, and multifaceted semantic information. To address these challenges, we propose HAVIR that contains two adapters: (1) The AutoKL Adapter transforms fMRI voxels into a latent diffusion prior, capturing topological structures; (2) The CLIP Adapter converts the voxels to CLIP text and image embeddings, containing semantic information. These complementary representations are fused by Versatile Diffusion to generate the final reconstructed image. To extract the most essential semantic information from complex scenarios, the CLIP Adapter is trained with text captions describing the visual stimuli and their corresponding semantic images synthesized from these captions. The experimental results demonstrate that HAVIR effectively reconstructs both structural features and semantic information of visual stimuli even in complex scenarios, outperforming existing models.

[92] Tensor-to-Tensor Models with Fast Iterated Sum Features

Joscha Diehl,Rasheed Ibraheem,Leonard Schmitz,Yue Wu

Main category: cs.CV

TL;DR: 提出了一种基于“角树”数学工具的新型张量到张量层(FIS层),具有线性计算成本,适用于图像和高阶张量数据处理。

Details Motivation: 现代深度学习应用中,图像和高阶张量数据的高维性需要次二次方的处理层。 Method: 利用“角树”和迭代积分(或和)的多参数推广,构建了FIS层,可无缝集成到神经网络中。 Result: 在分类和异常检测任务中表现优异,替换ResNet部分层后参数和计算量减少,性能接近(仅差0.1%)。异常检测模型在MVTec AD数据集上平均AUROC达97.3%。 Conclusion: FIS层为高维数据处理提供了高效且性能优越的解决方案。 Abstract: Data in the form of images or higher-order tensors is ubiquitous in modern deep learning applications. Owing to their inherent high dimensionality, the need for subquadratic layers processing such data is even more pressing than for sequence data. We propose a novel tensor-to-tensor layer with linear cost in the input size, utilizing the mathematical gadget of ``corner trees'' from the field of permutation counting. In particular, for order-two tensors, we provide an image-to-image layer that can be plugged into image processing pipelines. On the one hand, our method can be seen as a higher-order generalization of state-space models. On the other hand, it is based on a multiparameter generalization of the signature of iterated integrals (or sums). The proposed tensor-to-tensor concept is used to build a neural network layer called the Fast Iterated Sums (FIS) layer which integrates seamlessly with other layer types. We demonstrate the usability of the FIS layer with both classification and anomaly detection tasks. By replacing some layers of a smaller ResNet architecture with FIS, a similar accuracy (with a difference of only 0.1\%) was achieved in comparison to a larger ResNet while reducing the number of trainable parameters and multi-add operations. The FIS layer was also used to build an anomaly detection model that achieved an average AUROC of 97.3\% on the texture images of the popular MVTec AD dataset. The processing and modelling codes are publicly available at https://github.com/diehlj/fast-iterated-sums.

[93] SDS-Net: Shallow-Deep Synergism-detection Network for infrared small target detection

Taoran Yue,Xiaojin Lu,Jiaxi Cai,Yuanping Chen,Shibing Chu

Main category: cs.CV

TL;DR: 提出了一种浅层-深层协同检测网络(SDS-Net),通过双分支架构和自适应特征融合模块,提升红外小目标检测的精度和计算效率。

Details Motivation: 现有CNN方法忽视了浅层与深层特征的异质性,导致特征协作不足,影响检测性能。 Method: 采用双分支架构分别建模结构特征和语义特征,并引入自适应特征融合模块动态建模跨层特征相关性。 Result: 在三个公开数据集上表现优于现有方法,同时保持低计算复杂性和高推理效率。 Conclusion: SDS-Net在红外小目标检测中表现出优越性能,具有广泛应用前景。 Abstract: Current CNN-based infrared small target detection(IRSTD) methods generally overlook the heterogeneity between shallow and deep features, leading to inefficient collaboration between shallow fine grained structural information and deep high-level semantic representations. Additionally, the dependency relationships and fusion mechanisms across different feature hierarchies lack systematic modeling, which fails to fully exploit the complementarity of multilevel features. These limitations hinder IRSTD performance while incurring substantial computational costs. To address these challenges, this paper proposes a shallow-deep synergistic detection network (SDS-Net) that efficiently models multilevel feature representations to increase both the detection accuracy and computational efficiency in IRSTD tasks. SDS-Net introduces a dual-branch architecture that separately models the structural characteristics and semantic properties of features, effectively preserving shallow spatial details while capturing deep semantic representations, thereby achieving high-precision detection with significantly improved inference speed. Furthermore, the network incorporates an adaptive feature fusion module to dynamically model cross-layer feature correlations, enhancing overall feature collaboration and representation capability. Comprehensive experiments on three public datasets (NUAA-SIRST, NUDT-SIRST, and IRSTD-1K) demonstrate that SDS-Net outperforms state-of-the-art IRSTD methods while maintaining low computational complexity and high inference efficiency, showing superior detection performance and broad application prospects. Our code will be made public at https://github.com/PhysiLearn/SDS-Net.

[94] Full Conformal Adaptation of Medical Vision-Language Models

Julio Silva-Rodríguez,Leo Fillioux,Paul-Henry Cournède,Maria Vakalopoulou,Stergios Christodoulidis,Ismail Ben Ayed,Jose Dolz

Main category: cs.CV

TL;DR: 该论文研究了视觉语言模型(VLMs)在医学图像分析中的可靠性问题,提出了一种新的全适应一致性预测框架,显著提升了性能。

Details Motivation: 尽管VLMs在医学图像分析中表现出强大的判别潜力,但其可靠性问题尚未得到充分研究。本文旨在解决这一问题。 Method: 提出全适应一致性预测框架,结合SS-Text方法,利用少量样本进行适应和一致性预测。 Result: 在9个适应任务中,框架显著提升了性能(相对改进高达27%),同时保持了相同的覆盖保证。 Conclusion: 该框架为VLMs在医学领域的可靠性提供了有效解决方案,且无需额外数据。 Abstract: Vision-language models (VLMs) pre-trained at large scale have shown unprecedented transferability capabilities and are being progressively integrated into medical image analysis. Although its discriminative potential has been widely explored, its reliability aspect remains overlooked. This work investigates their behavior under the increasingly popular split conformal prediction (SCP) framework, which theoretically guarantees a given error level on output sets by leveraging a labeled calibration set. However, the zero-shot performance of VLMs is inherently limited, and common practice involves few-shot transfer learning pipelines, which cannot absorb the rigid exchangeability assumptions of SCP. To alleviate this issue, we propose full conformal adaptation, a novel setting for jointly adapting and conformalizing pre-trained foundation models, which operates transductively over each test data point using a few-shot adaptation set. Moreover, we complement this framework with SS-Text, a novel training-free linear probe solver for VLMs that alleviates the computational cost of such a transductive approach. We provide comprehensive experiments using 3 different modality-specialized medical VLMs and 9 adaptation tasks. Our framework requires exactly the same data as SCP, and provides consistent relative improvements of up to 27% on set efficiency while maintaining the same coverage guarantees.

[95] WisWheat: A Three-Tiered Vision-Language Dataset for Wheat Management

Bowen Yuan,Selena Song,Javier Fernandez,Yadan Luo,Mahsa Baktashmotlagh,Zijian Wang

Main category: cs.CV

TL;DR: WisWheat是一个针对小麦管理任务设计的三层数据集,显著提升了视觉语言模型(VLM)在小麦管理中的性能。

Details Motivation: 传统小麦管理依赖人工专家检查,成本高且难以扩展。直接应用通用视觉语言模型(VLM)效果不佳,需要领域特定知识。 Method: 提出WisWheat数据集,包含三层设计:基础预训练数据集、定量数据集和指令微调数据集,用于增强VLM在小麦管理任务中的表现。 Result: 实验显示,微调后的VLM在小麦压力和生长阶段任务中表现优异,准确率分别达到79.2%和84.6%,超过通用商业模型。 Conclusion: WisWheat数据集有效提升了VLM在小麦管理任务中的性能,为农业智能化提供了实用工具。 Abstract: Wheat management strategies play a critical role in determining yield. Traditional management decisions often rely on labour-intensive expert inspections, which are expensive, subjective and difficult to scale. Recently, Vision-Language Models (VLMs) have emerged as a promising solution to enable scalable, data-driven management support. However, due to a lack of domain-specific knowledge, directly applying VLMs to wheat management tasks results in poor quantification and reasoning capabilities, ultimately producing vague or even misleading management recommendations. In response, we propose WisWheat, a wheat-specific dataset with a three-layered design to enhance VLM performance on wheat management tasks: (1) a foundational pretraining dataset of 47,871 image-caption pairs for coarsely adapting VLMs to wheat morphology; (2) a quantitative dataset comprising 7,263 VQA-style image-question-answer triplets for quantitative trait measuring tasks; and (3) an Instruction Fine-tuning dataset with 4,888 samples targeting biotic and abiotic stress diagnosis and management plan for different phenological stages. Extensive experimental results demonstrate that fine-tuning open-source VLMs (e.g., Qwen2.5 7B) on our dataset leads to significant performance improvements. Specifically, the Qwen2.5 VL 7B fine-tuned on our wheat instruction dataset achieves accuracy scores of 79.2% and 84.6% on wheat stress and growth stage conversation tasks respectively, surpassing even general-purpose commercial models such as GPT-4o by a margin of 11.9% and 34.6%.

[96] Feedback Guidance of Diffusion Models

Koulischer Felix,Handke Florian,Deleu Johannes,Demeester Thomas,Ambrogioni Luca

Main category: cs.CV

TL;DR: 论文提出了一种称为反馈引导(FBG)的新方法,通过动态调整引导量来改善条件扩散模型的样本保真度,避免了传统Classifier-Free Guidance(CFG)的多样性损失和记忆化问题。

Details Motivation: 传统的CFG方法在条件扩散模型中虽然能提高样本保真度,但因其固定引导量可能导致多样性损失和记忆化问题。FBG旨在通过动态调整引导量来解决这些问题。 Method: FBG基于线性假设(条件分布被无条件分布线性破坏),通过反馈预测的条件信号信息量来动态调整引导量,而非固定值。 Result: 在ImageNet512x512上,FBG显著优于CFG,并与Limited Interval Guidance(LIG)竞争,同时具备更强的数学框架。在文本到图像生成中,FBG能自动为复杂提示分配更高的引导量。 Conclusion: FBG提供了一种动态调整引导量的方法,优于传统固定引导量的CFG,并能与其他引导方案结合使用。 Abstract: While Classifier-Free Guidance (CFG) has become standard for improving sample fidelity in conditional diffusion models, it can harm diversity and induce memorization by applying constant guidance regardless of whether a particular sample needs correction. We propose FeedBack Guidance (FBG), which uses a state-dependent coefficient to self-regulate guidance amounts based on need. Our approach is derived from first principles by assuming the learned conditional distribution is linearly corrupted by the unconditional distribution, contrasting with CFG's implicit multiplicative assumption. Our scheme relies on feedback of its own predictions about the conditional signal informativeness to adapt guidance dynamically during inference, challenging the view of guidance as a fixed hyperparameter. The approach is benchmarked on ImageNet512x512, where it significantly outperforms Classifier-Free Guidance and is competitive to Limited Interval Guidance (LIG) while benefitting from a strong mathematical framework. On Text-To-Image generation, we demonstrate that, as anticipated, our approach automatically applies higher guidance scales for complex prompts than for simpler ones and that it can be easily combined with existing guidance schemes such as CFG or LIG.

[97] VideoChat-A1: Thinking with Long Videos by Chain-of-Shot Reasoning

Zikang Wang,Boyu Chen,Zhengrong Yue,Yi Wang,Yu Qiao,Limin Wang,Yali Wang

Main category: cs.CV

TL;DR: VideoChat-A1提出了一种新的长视频代理范式,通过链式镜头推理逐步选择相关镜头,模仿人类思考过程,显著提升了长视频理解性能。

Details Motivation: 现有MLLMs在长视频理解中因忽略镜头结构而表现不佳,需模仿人类思考过程以提升性能。 Method: 采用链式镜头推理范式,逐步选择相关镜头并进行粗到细的分区多模态推理。 Result: 在主流长视频QA基准测试中表现最佳,如VideoMME(77.0)和EgoSchema(70.1),优于基线模型和部分闭源模型。 Conclusion: VideoChat-A1通过模仿人类思考过程,显著提升了长视频理解的准确性和效率。 Abstract: The recent advance in video understanding has been driven by multimodal large language models (MLLMs). But these MLLMs are good at analyzing short videos, while suffering from difficulties in understanding videos with a longer context. To address this difficulty, several agent paradigms have recently been proposed, using MLLMs as agents for retrieving extra contextual knowledge in a long video. However, most existing agents ignore the key fact that a long video is composed with multiple shots, i.e., to answer the user question from a long video, it is critical to deeply understand its relevant shots like human. Without such insight, these agents often mistakenly find redundant even noisy temporal context, restricting their capacity for long video understanding. To fill this gap, we propose VideoChat-A1, a novel long video agent paradigm. Different from the previous works, our VideoChat-A1 can deeply think with long videos, via a distinct chain-of-shot reasoning paradigm. More specifically, it can progressively select the relevant shots of user question, and look into these shots in a coarse-to-fine partition. By multi-modal reasoning along the shot chain, VideoChat-A1 can effectively mimic step-by-step human thinking process, allowing to interactively discover preferable temporal context for thoughtful understanding in long videos. Extensive experiments show that, our VideoChat-A1 achieves the state-of-the-art performance on the mainstream long video QA benchmarks, e.g., it achieves 77.0 on VideoMME and 70.1 on EgoSchema, outperforming its strong baselines (e.g., Intern2.5VL-8B and InternVideo2.5-8B), by up to 10.8\% and 6.2\%. Compared to leading close-source GPT-4o and Gemini 1.5 Pro, VideoChat-A1 offers competitive accuracy, but with 7\% input frames and 12\% inference time on average.

[98] Bidirectional Image-Event Guided Low-Light Image Enhancement

Zhanwen Liu,Huanna Song,Yang Wang,Nan Yang,Shangyu Xie,Yisheng An,Xiangmo Zhao

Main category: cs.CV

TL;DR: 论文提出了一种双向引导的低光图像增强框架(BiLIE),通过频率高通滤波和双向交叉注意力融合机制,解决了事件相机在低光条件下因全局低频噪声和局部结构不连续性导致的问题,并提供了新的数据集RELIE。

Details Motivation: 传统帧相机在极端低光条件下因动态范围和时间分辨率有限,导致图像细节丢失和运动模糊。现有事件相机方法忽略了动态光照引起的全局低频噪声和稀疏事件数据的局部结构不连续性。 Method: 提出BiLIE框架,包括基于频率高通滤波的事件特征增强(EFE)模块和双向交叉注意力融合(BCAF)机制,以抑制低频噪声并保留高频边缘。同时,构建了新数据集RELIE。 Result: 实验表明,BiLIE在PSNR上优于现有方法0.96dB,在LPIPS上优于0.03。 Conclusion: BiLIE通过抑制噪声和增强结构连续性,显著提升了低光图像增强的性能,并提供了高质量的数据集支持。 Abstract: Under extreme low-light conditions, traditional frame-based cameras, due to their limited dynamic range and temporal resolution, face detail loss and motion blur in captured images. To overcome this bottleneck, researchers have introduced event cameras and proposed event-guided low-light image enhancement algorithms. However, these methods neglect the influence of global low-frequency noise caused by dynamic lighting conditions and local structural discontinuities in sparse event data. To address these issues, we propose an innovative Bidirectional guided Low-light Image Enhancement framework (BiLIE). Specifically, to mitigate the significant low-frequency noise introduced by global illumination step changes, we introduce the frequency high-pass filtering-based Event Feature Enhancement (EFE) module at the event representation level to suppress the interference of low-frequency information, and preserve and highlight the high-frequency edges.Furthermore, we design a Bidirectional Cross Attention Fusion (BCAF) mechanism to acquire high-frequency structures and edges while suppressing structural discontinuities and local noise introduced by sparse event guidance, thereby generating smoother fused representations.Additionally, considering the poor visual quality and color bias in existing datasets, we provide a new dataset (RELIE), with high-quality ground truth through a reliable enhancement scheme. Extensive experimental results demonstrate that our proposed BiLIE outperforms state-of-the-art methods by 0.96dB in PSNR and 0.03 in LPIPS.

[99] CCLSTM: Coupled Convolutional Long-Short Term Memory Network for Occupancy Flow Forecasting

Peter Lengyel

Main category: cs.CV

TL;DR: 论文提出了一种轻量级的卷积LSTM架构(CCLSTM),用于预测动态代理的未来状态,解决了现有方法依赖高质量向量输入和计算密集型Transformer的问题。

Details Motivation: 现有方法依赖高质量向量输入和计算密集型Transformer架构,实际应用中难以实现或成本高昂。 Method: 提出CCLSTM,一种基于卷积操作的轻量级架构,无需向量输入或自注意力机制,通过紧凑的循环卷积结构捕捉时空动态。 Result: CCLSTM在占用流指标上达到最先进性能,并在2024 Waymo挑战赛的所有指标中排名第一。 Conclusion: CCLSTM提供了一种高效、轻量级的解决方案,适用于实际自动驾驶场景。 Abstract: Predicting future states of dynamic agents is a fundamental task in autonomous driving. An expressive representation for this purpose is Occupancy Flow Fields, which provide a scalable and unified format for modeling motion, spatial extent, and multi-modal future distributions. While recent methods have achieved strong results using this representation, they often depend on high-quality vectorized inputs, which are unavailable or difficult to generate in practice, and the use of transformer-based architectures, which are computationally intensive and costly to deploy. To address these issues, we propose \textbf{Coupled Convolutional LSTM (CCLSTM)}, a lightweight, end-to-end trainable architecture based solely on convolutional operations. Without relying on vectorized inputs or self-attention mechanisms, CCLSTM effectively captures temporal dynamics and spatial occupancy-flow correlations using a compact recurrent convolutional structure. Despite its simplicity, CCLSTM achieves state-of-the-art performance on occupancy flow metrics and, as of this submission, ranks \(1^{\text{st}}\) in all metrics on the 2024 Waymo Occupancy and Flow Prediction Challenge leaderboard.

[100] CLaMR: Contextualized Late-Interaction for Multimodal Content Retrieval

David Wan,Han Wang,Elias Stengel-Eskin,Jaemin Cho,Mohit Bansal

Main category: cs.CV

TL;DR: CLaMR是一种多模态检索系统,通过动态选择最佳模态组合提升视频内容检索效果,显著优于传统方法。

Details Motivation: 在线视频内容多模态特性导致传统独立检索方法效果不佳,需动态选择最佳模态组合以提高检索质量。 Method: 提出CLaMR,联合编码视频帧、语音转录、屏幕文本和元数据,并引入MultiVENT 2.0++数据集和模态感知损失函数。 Result: 在MultiVENT 2.0++和MSRVTT测试集上,CLaMR显著优于单模态和多模态基线方法,nDCG@10提升25.6和35.4。 Conclusion: CLaMR通过动态模态选择和联合训练,在多模态视频检索任务中表现优异,且在下游任务(如长视频QA)中也有显著提升。 Abstract: Online video web content is richly multimodal: a single video blends vision, speech, ambient audio, and on-screen text. Retrieval systems typically treat these modalities as independent retrieval sources, which can lead to noisy and subpar retrieval. We explore multimodal video content retrieval, where relevance can be scored from one particular modality or jointly across multiple modalities simultaneously. Consequently, an effective retriever must dynamically choose which modality (or set of modalities) best addresses the query. We introduce CLaMR, a multimodal, late-interaction retriever that jointly indexes 4 modalities: video frames, transcribed speech, on-screen text, and metadata. CLaMR jointly encodes all modalities with a unified multimodal backbone for improved contextualization and is trained to enhance dynamic modality selection via two key innovations. First, given the lack of training data for multimodal retrieval, we introduce MultiVENT 2.0++, a large-scale synthetic training dataset built on MultiVENT 2.0 (event-centric videos in various languages paired with queries) with modality-targeted queries. Next, we propose a modality-aware loss that jointly trains according to a standard contrastive objective alongside an objective for learning correct modality usage. On the test sets of MultiVENT 2.0++ and MSRVTT, conventional aggregation strategies, such as averaging similarities for baseline retrievers, degrade performance by introducing noise from irrelevant modalities. In contrast, CLaMR consistently outperforms existing retrievers: on MultiVENT 2.0++, CLaMR improves nDCG@10 by 25.6 over the best single-modality retriever and by 35.4 over the best multi-modality retriever. We illustrate CLaMR's downstream utility on long-video QA, retrieving relevant frames and obtaining a 3.50% boost over LanguageBind on Video-MME and 1.42% over dense sampling on LongVideoBench.

[101] A Novel Large-scale Crop Dataset and Dual-stream Transformer Method for Fine-grained Hierarchical Crop Classification from Integrated Hyperspectral EnMAP Data and Multispectral Sentinel-2 Time Series

Wenyuan Li,Shunlin Liang,Yuxiang Zhang,Liqin Liu,Keyan Chen,Yongzhe Chen,Han Ma,Jianglei Xu,Yichuan Ma,Shikang Guan,Zhenwei Shi

Main category: cs.CV

TL;DR: 该论文提出了一种结合高光谱和多时相卫星数据的双流Transformer架构,用于细粒度作物分类,并通过实验验证了其优越性能。

Details Motivation: 细粒度作物分类对精准农业和食品安全监测至关重要,但现有研究因高光谱数据获取和标注成本高而稀缺。 Method: 构建了H2Crop数据集,结合EnMAP高光谱数据和Sentinel-2时间序列,提出双流Transformer架构,分别处理光谱-空间和时间特征。 Result: 实验表明,加入高光谱数据使F1分数平均提升4.2%,最高达6.3%,优于现有深度学习方法。 Conclusion: 该方法为细粒度作物分类和高光谱图像处理提供了重要基准,并证明了高光谱数据的持续优势。 Abstract: Fine-grained crop classification is crucial for precision agriculture and food security monitoring. It requires simultaneous capture of both phenological dynamics (obtained from multi-temporal satellite data like Sentinel-2) and subtle spectral variations (demanding nanometer-scale spectral resolution from hyperspectral imagery). Research combining these two modalities remains scarce currently due to challenges in hyperspectral data acquisition and crop types annotation costs. To address these issues, we construct a hierarchical hyperspectral crop dataset (H2Crop) by integrating 30m-resolution EnMAP hyperspectral data with Sentinel-2 time series. With over one million annotated field parcels organized in a four-tier crop taxonomy, H2Crop establishes a vital benchmark for fine-grained agricultural crop classification and hyperspectral image processing. We propose a dual-stream Transformer architecture that synergistically processes these modalities. It coordinates two specialized pathways: a spectral-spatial Transformer extracts fine-grained signatures from hyperspectral EnMAP data, while a temporal Swin Transformer extracts crop growth patterns from Sentinel-2 time series. The designed hierarchy classification heads with hierarchical fusion then simultaneously delivers multi-level classification across all taxonomic tiers. Experiments demonstrate that adding hyperspectral EnMAP data to Sentinel-2 time series yields a 4.2% average F1-scores improvement (peaking at 6.3%). Extensive comparisons also confirming our method's higher accuracy over existing deep learning approaches for crop type classification and the consistent benefits of hyperspectral data across varying temporal windows and crop change scenarios. Codes and dataset will be available at https://github.com/flyakon/H2Crop and www.glass.hku.hk Keywords: Crop type classification, precision agriculture, remote sensing, deep learning, hyperspectral data, Sentinel-2 time series, fine-grained crops

[102] Technical Report for Egocentric Mistake Detection for the HoloAssist Challenge

Constantin Patsch,Marsil Zakour,Yuankai Wu,Eckehard Steinbach

Main category: cs.CV

TL;DR: 提出了一种在线错误检测框架,结合了程序性和执行性错误检测,并利用大语言模型生成反馈,实验验证了其有效性。

Details Motivation: 在线错误检测在工业自动化和教育等领域至关重要,但现有方法主要关注程序性错误,忽略了其他错误类型。 Method: 开发了一个在线错误检测框架,能够同时处理程序性和执行性错误,并利用大语言模型生成解释性反馈。 Result: 在HoloAssist基准测试中,该方法在错误检测任务中排名第二,验证了其有效性。 Conclusion: 该框架为在线错误检测提供了更全面的解决方案,适用于实际应用场景。 Abstract: In this report, we address the task of online mistake detection, which is vital in domains like industrial automation and education, where real-time video analysis allows human operators to correct errors as they occur. While previous work focuses on procedural errors involving action order, broader error types must be addressed for real-world use. We introduce an online mistake detection framework that handles both procedural and execution errors (e.g., motor slips or tool misuse). Upon detecting an error, we use a large language model (LLM) to generate explanatory feedback. Experiments on the HoloAssist benchmark confirm the effectiveness of our approach, where our approach is placed second on the mistake detection task.

[103] SatelliteFormula: Multi-Modal Symbolic Regression from Remote Sensing Imagery for Physics Discovery

Zhenyu Yu,Mohd. Yamani Idna Idris,Pei Wang,Yuelong Xia,Fei Ma,Rizwan Qureshi

Main category: cs.CV

TL;DR: SatelliteFormula是一个新颖的符号回归框架,直接从多光谱遥感图像中推导物理可解释的表达式。

Details Motivation: 传统经验指数或黑盒学习模型无法满足多光谱数据的高维复杂性需求,且缺乏物理可解释性。 Method: 结合Vision Transformer编码器提取空间-光谱特征,并通过物理引导约束确保一致性和可解释性。 Result: 在基准数据集和遥感任务中表现出优越的性能、稳定性和泛化能力。 Conclusion: SatelliteFormula填补了数据驱动学习与物理理解之间的鸿沟,实现了复杂环境变量的可解释建模。 Abstract: We propose SatelliteFormula, a novel symbolic regression framework that derives physically interpretable expressions directly from multi-spectral remote sensing imagery. Unlike traditional empirical indices or black-box learning models, SatelliteFormula combines a Vision Transformer-based encoder for spatial-spectral feature extraction with physics-guided constraints to ensure consistency and interpretability. Existing symbolic regression methods struggle with the high-dimensional complexity of multi-spectral data; our method addresses this by integrating transformer representations into a symbolic optimizer that balances accuracy and physical plausibility. Extensive experiments on benchmark datasets and remote sensing tasks demonstrate superior performance, stability, and generalization compared to state-of-the-art baselines. SatelliteFormula enables interpretable modeling of complex environmental variables, bridging the gap between data-driven learning and physical understanding.

[104] STSBench: A Spatio-temporal Scenario Benchmark for Multi-modal Large Language Models in Autonomous Driving

Christian Fruhwirth-Reisinger,Dušan Malić,Wei Lin,David Schinagl,Samuel Schulter,Horst Possegger

Main category: cs.CV

TL;DR: STSBench是一个基于场景的框架,用于评估自动驾驶中视觉语言模型(VLMs)的整体理解能力,通过多视角和多帧数据生成选择题,揭示现有模型在时空推理上的不足。

Details Motivation: 现有基准测试主要针对单视角图像或视频的语义任务,而自动驾驶需要多视角和时空推理能力,因此需要新的评估框架。 Method: 从数据集中自动挖掘交通场景,提供用户界面进行人工验证,并生成选择题用于模型评估。应用于NuScenes数据集,创建STSnu基准。 Result: STSnu包含43个场景和971个选择题,发现现有模型在复杂环境中的时空推理能力不足。 Conclusion: STSBench填补了时空评估的空白,推动更鲁棒和可解释的VLMs发展。 Abstract: We introduce STSBench, a scenario-based framework to benchmark the holistic understanding of vision-language models (VLMs) for autonomous driving. The framework automatically mines pre-defined traffic scenarios from any dataset using ground-truth annotations, provides an intuitive user interface for efficient human verification, and generates multiple-choice questions for model evaluation. Applied to the NuScenes dataset, we present STSnu, the first benchmark that evaluates the spatio-temporal reasoning capabilities of VLMs based on comprehensive 3D perception. Existing benchmarks typically target off-the-shelf or fine-tuned VLMs for images or videos from a single viewpoint and focus on semantic tasks such as object recognition, dense captioning, risk assessment, or scene understanding. In contrast, STSnu evaluates driving expert VLMs for end-to-end driving, operating on videos from multi-view cameras or LiDAR. It specifically assesses their ability to reason about both ego-vehicle actions and complex interactions among traffic participants, a crucial capability for autonomous vehicles. The benchmark features 43 diverse scenarios spanning multiple views and frames, resulting in 971 human-verified multiple-choice questions. A thorough evaluation uncovers critical shortcomings in existing models' ability to reason about fundamental traffic dynamics in complex environments. These findings highlight the urgent need for architectural advances that explicitly model spatio-temporal reasoning. By addressing a core gap in spatio-temporal evaluation, STSBench enables the development of more robust and explainable VLMs for autonomous driving.

[105] GenIR: Generative Visual Feedback for Mental Image Retrieval

Diji Yang,Minghao Liu,Chung-Hsiang Lo,Yi Zhang,James Davis

Main category: cs.CV

TL;DR: 论文提出了一种名为GenIR的生成式多轮检索方法,用于解决用户通过多轮交互搜索心理图像的问题,并通过实验验证了其优越性。

Details Motivation: 现有的视觉语言模型在文本到图像检索任务中表现良好,但实际应用中用户搜索行为通常是多轮交互的,且缺乏直观的反馈机制。 Method: 提出GenIR方法,利用基于扩散的图像生成技术,在每轮交互中生成可视化反馈,帮助用户直观地优化查询。 Result: 实验表明,GenIR在多轮心理图像检索任务中显著优于现有方法。 Conclusion: 该研究为心理图像检索任务提供了新的数据集和方法,为未来研究奠定了基础。 Abstract: Vision-language models (VLMs) have shown strong performance on text-to-image retrieval benchmarks. However, bridging this success to real-world applications remains a challenge. In practice, human search behavior is rarely a one-shot action. Instead, it is often a multi-round process guided by clues in mind, that is, a mental image ranging from vague recollections to vivid mental representations of the target image. Motivated by this gap, we study the task of Mental Image Retrieval (MIR), which targets the realistic yet underexplored setting where users refine their search for a mentally envisioned image through multi-round interactions with an image search engine. Central to successful interactive retrieval is the capability of machines to provide users with clear, actionable feedback; however, existing methods rely on indirect or abstract verbal feedback, which can be ambiguous, misleading, or ineffective for users to refine the query. To overcome this, we propose GenIR, a generative multi-round retrieval paradigm leveraging diffusion-based image generation to explicitly reify the AI system's understanding at each round. These synthetic visual representations provide clear, interpretable feedback, enabling users to refine their queries intuitively and effectively. We further introduce a fully automated pipeline to generate a high-quality multi-round MIR dataset. Experimental results demonstrate that GenIR significantly outperforms existing interactive methods in the MIR scenario. This work establishes a new task with a dataset and an effective generative retrieval method, providing a foundation for future research in this direction.

[106] Challenging Vision-Language Models with Surgical Data: A New Dataset and Broad Benchmarking Study

Leon Mayer,Tim Rädsch,Dominik Michael,Lucas Luttner,Amine Yamlahi,Evangelia Christodoulou,Patrick Godau,Marcel Knopp,Annika Reinke,Fiona Kolbinger,Lena Maier-Hein

Main category: cs.CV

TL;DR: 该研究评估了视觉语言模型(VLMs)在腹腔镜手术任务中的表现,发现其在基础感知任务上表现良好,但在需要医学知识的任务中表现较差,且专业医学VLMs目前不如通用模型。

Details Motivation: 探索视觉语言模型在复杂的内窥镜手术任务中的适用性,填补该领域的研究空白。 Method: 使用多种先进模型、手术数据集和人工标注,评估VLMs在基础感知和高级场景理解任务中的表现。 Result: VLMs在基础任务上表现良好,但在医学知识任务中表现不佳;专业医学VLMs表现不如通用模型。 Conclusion: 需进一步优化VLMs以适应手术环境的复杂性,为下一代内窥镜AI系统的发展提供方向。 Abstract: While traditional computer vision models have historically struggled to generalize to endoscopic domains, the emergence of foundation models has shown promising cross-domain performance. In this work, we present the first large-scale study assessing the capabilities of Vision Language Models (VLMs) for endoscopic tasks with a specific focus on laparoscopic surgery. Using a diverse set of state-of-the-art models, multiple surgical datasets, and extensive human reference annotations, we address three key research questions: (1) Can current VLMs solve basic perception tasks on surgical images? (2) Can they handle advanced frame-based endoscopic scene understanding tasks? and (3) How do specialized medical VLMs compare to generalist models in this context? Our results reveal that VLMs can effectively perform basic surgical perception tasks, such as object counting and localization, with performance levels comparable to general domain tasks. However, their performance deteriorates significantly when the tasks require medical knowledge. Notably, we find that specialized medical VLMs currently underperform compared to generalist models across both basic and advanced surgical tasks, suggesting that they are not yet optimized for the complexity of surgical environments. These findings highlight the need for further advancements to enable VLMs to handle the unique challenges posed by surgery. Overall, our work provides important insights for the development of next-generation endoscopic AI systems and identifies key areas for improvement in medical visual language models.

[107] Optimizing Cloud-to-GPU Throughput for Deep Learning With Earth Observation Data

Akram Zaytar,Caleb Robinson,Girmaw Abebe Tadesse,Tammy Glazer,Gilles Hacheme,Anthony Ortiz,Rahul M Dodhia,Juan M Lavista Ferres

Main category: cs.CV

TL;DR: 本文通过优化GeoTIFF数据加载配置,显著提升了从云端和本地加载数据的效率,使GPU利用率大幅提高,同时保持了模型训练的准确性。

Details Motivation: 由于标准PyTorch数据加载器无法在从云存储直接流式传输GeoTIFF文件时充分利用现代GPU,因此需要优化数据加载配置以提高效率。 Method: 通过系统测试不同的加载器配置和数据参数,使用贝叶斯优化找到每种存储类型的最佳设置,重点关注瓦片对齐读取和工作线程池。 Result: 优化后的配置使远程数据加载吞吐量提高了20倍,本地吞吐量提高了4倍;在三个公共EO基准测试中,优化后的远程加载训练模型在相同时间预算内达到与本地训练相同的精度,验证IoU提高了6-15%,GPU利用率达到85-95%(标准配置为0-30%)。 Conclusion: 优化GeoTIFF数据加载配置可以显著提升训练效率,同时保持模型性能,为大规模地球观测数据训练提供了实用解决方案。 Abstract: Training deep learning models on petabyte-scale Earth observation (EO) data requires separating compute resources from data storage. However, standard PyTorch data loaders cannot keep modern GPUs utilized when streaming GeoTIFF files directly from cloud storage. In this work, we benchmark GeoTIFF loading throughput from both cloud object storage and local SSD, systematically testing different loader configurations and data parameters. We focus on tile-aligned reads and worker thread pools, using Bayesian optimization to find optimal settings for each storage type. Our optimized configurations increase remote data loading throughput by 20x and local throughput by 4x compared to default settings. On three public EO benchmarks, models trained with optimized remote loading achieve the same accuracy as local training within identical time budgets. We improve validation IoU by 6-15% and maintain 85-95% GPU utilization versus 0-30% with standard configurations. Code is publicly available at https://github.com/microsoft/pytorch-cloud-geotiff-optimization

[108] Visual Graph Arena: Evaluating Visual Conceptualization of Vision and Multimodal Large Language Models

Zahra Babaiee,Peyman M. Kiasari,Daniela Rus,Radu Grosu

Main category: cs.CV

TL;DR: 论文介绍了Visual Graph Arena (VGA)数据集,用于评估AI系统在视觉抽象能力上的表现,发现当前模型在概念化推理方面存在显著局限性。

Details Motivation: 解决多模态大语言模型在视觉问答中缺乏概念化能力的核心问题,即识别和推理同一概念的能力。 Method: 通过设计六种基于图的任务,使用多样化的图布局(如Kamada-Kawai和平面图)来测试模型对视觉形式的独立推理能力。 Result: 实验显示人类在任务中表现近乎完美,而模型在异构检测上完全失败,路径/循环任务上表现有限,表明模型存在伪智能模式匹配行为。 Conclusion: VGA为评估和改进AI视觉模型的概念化能力提供了框架,揭示了当前模型在视觉理解上的根本局限性。 Abstract: Recent advancements in multimodal large language models have driven breakthroughs in visual question answering. Yet, a critical gap persists, `conceptualization'-the ability to recognize and reason about the same concept despite variations in visual form, a basic ability of human reasoning. To address this challenge, we introduce the Visual Graph Arena (VGA), a dataset featuring six graph-based tasks designed to evaluate and improve AI systems' capacity for visual abstraction. VGA uses diverse graph layouts (e.g., Kamada-Kawai vs. planar) to test reasoning independent of visual form. Experiments with state-of-the-art vision models and multimodal LLMs reveal a striking divide: humans achieved near-perfect accuracy across tasks, while models totally failed on isomorphism detection and showed limited success in path/cycle tasks. We further identify behavioral anomalies suggesting pseudo-intelligent pattern matching rather than genuine understanding. These findings underscore fundamental limitations in current AI models for visual understanding. By isolating the challenge of representation-invariant reasoning, the VGA provides a framework to drive progress toward human-like conceptualization in AI visual models. The Visual Graph Arena is available at: \href{https://vga.csail.mit.edu/}{vga.csail.mit.edu}

[109] Movie Facts and Fibs (MF$^2$): A Benchmark for Long Movie Understanding

Emmanouil Zaranis,António Farinhas,Saul Santos,Beatriz Canaverde,Miguel Moura Ramos,Aditya K Surikuchi,André Viveiros,Baohao Liao,Elena Bueno-Benito,Nithin Sivakumaran,Pavlo Vasylenko,Shoubin Yu,Sonal Sannigrahi,Wafaa Mohammed,Ben Peters,Danae Sánchez Villegas,Elias Stengel-Eskin,Giuseppe Attanasio,Jaehong Yoon,Stella Frank,Alessandro Suglia,Chrysoula Zerva,Desmond Elliott,Mariella Dimiccoli,Mohit Bansal,Oswald Lanz,Raffaella Bernardi,Raquel Fernández,Sandro Pezzelle,Vlad Niculae,André F. T. Martins

Main category: cs.CV

TL;DR: MF$^2$是一个新的基准测试,用于评估模型对全长电影中关键叙事信息的理解、整合和回忆能力,揭示了当前视觉语言模型在深度理解上的不足。

Details Motivation: 现有基准测试要么关注琐碎细节,要么依赖半自动生成的问题,未能反映模型的真实理解能力,因此需要更全面的评估工具。 Method: MF$^2$包含50多部全长电影,每部电影配对手工构建的真实和虚假声明对,模型需通过二元评估协议识别正确和错误的声明。 Result: 实验表明,当前最先进的模型在任务表现上远低于人类水平,突显其在关键叙事信息处理上的不足。 Conclusion: MF$^2$揭示了视觉语言模型在深度理解和推理能力上的局限性,为未来研究提供了改进方向。 Abstract: Despite recent progress in vision-language models (VLMs), holistic understanding of long-form video content remains a significant challenge, partly due to limitations in current benchmarks. Many focus on peripheral, ``needle-in-a-haystack'' details, encouraging context-insensitive retrieval over deep comprehension. Others rely on large-scale, semi-automatically generated questions (often produced by language models themselves) that are easier for models to answer but fail to reflect genuine understanding. In this paper, we introduce MF$^2$, a new benchmark for evaluating whether models can comprehend, consolidate, and recall key narrative information from full-length movies (50-170 minutes long). MF$^2$ includes over 50 full-length, open-licensed movies, each paired with manually constructed sets of claim pairs -- one true (fact) and one plausible but false (fib), totalling over 850 pairs. These claims target core narrative elements such as character motivations and emotions, causal chains, and event order, and refer to memorable moments that humans can recall without rewatching the movie. Instead of multiple-choice formats, we adopt a binary claim evaluation protocol: for each pair, models must correctly identify both the true and false claims. This reduces biases like answer ordering and enables a more precise assessment of reasoning. Our experiments demonstrate that both open-weight and closed state-of-the-art models fall well short of human performance, underscoring the relative ease of the task for humans and their superior ability to retain and reason over critical narrative information -- an ability current VLMs lack.

[110] Bridging Perspectives: A Survey on Cross-view Collaborative Intelligence with Egocentric-Exocentric Vision

Yuping He,Yifei Huang,Guo Chen,Lidong Lu,Baoqi Pei,Jilan Xu,Tong Lu,Yoichi Sato

Main category: cs.CV

TL;DR: 该论文综述了从第一人称(egocentric)和第三人称(exocentric)视角进行视频理解的研究,探讨了两种视角的协同潜力、应用场景、研究任务、最新进展及未来方向。

Details Motivation: 人类通过第一人称和第三人称视角的互补认知来理解动态环境,研究如何让机器利用这两种视角的协同潜力成为视频理解的重要方向。 Method: 系统性地将研究进展分为三个方向:利用第一人称数据增强第三人称理解、利用第三人称数据改进第一人称分析、以及联合学习框架。 Result: 综述了相关任务、工作及基准数据集,分析了当前研究的局限性。 Conclusion: 通过整合两种视角的见解,旨在推动视频理解和人工智能的发展,使机器更接近人类感知世界的方式。 Abstract: Perceiving the world from both egocentric (first-person) and exocentric (third-person) perspectives is fundamental to human cognition, enabling rich and complementary understanding of dynamic environments. In recent years, allowing the machines to leverage the synergistic potential of these dual perspectives has emerged as a compelling research direction in video understanding. In this survey, we provide a comprehensive review of video understanding from both exocentric and egocentric viewpoints. We begin by highlighting the practical applications of integrating egocentric and exocentric techniques, envisioning their potential collaboration across domains. We then identify key research tasks to realize these applications. Next, we systematically organize and review recent advancements into three main research directions: (1) leveraging egocentric data to enhance exocentric understanding, (2) utilizing exocentric data to improve egocentric analysis, and (3) joint learning frameworks that unify both perspectives. For each direction, we analyze a diverse set of tasks and relevant works. Additionally, we discuss benchmark datasets that support research in both perspectives, evaluating their scope, diversity, and applicability. Finally, we discuss limitations in current works and propose promising future research directions. By synthesizing insights from both perspectives, our goal is to inspire advancements in video understanding and artificial intelligence, bringing machines closer to perceiving the world in a human-like manner. A GitHub repo of related works can be found at https://github.com/ayiyayi/Awesome-Egocentric-and-Exocentric-Vision.

[111] BecomingLit: Relightable Gaussian Avatars with Hybrid Neural Shading

Jonathan Schmidt,Simon Giebenhain,Matthias Niessner

Main category: cs.CV

TL;DR: BecomingLit提出了一种重建可重光照、高分辨率头部化身的新方法,支持交互式渲染,并在重光照和重演任务中显著优于现有方法。

Details Motivation: 现有方法在重建可重光照的高分辨率头部化身时存在成本高或效果不足的问题,因此需要一种低成本的捕获方案和高效的表示方法。 Method: 提出了一种低成本的光照舞台捕获方案,并基于3D高斯基元构建可重光照化身,结合神经漫反射BRDF和解析高光项进行混合神经着色。 Result: 在实验中,该方法在重光照和重演任务中显著优于现有方法,并能从单目视频轻松控制化身动画。 Conclusion: BecomingLit提供了一种高效、低成本的可重光照头部化身重建方案,具有广泛的应用潜力。 Abstract: We introduce BecomingLit, a novel method for reconstructing relightable, high-resolution head avatars that can be rendered from novel viewpoints at interactive rates. Therefore, we propose a new low-cost light stage capture setup, tailored specifically towards capturing faces. Using this setup, we collect a novel dataset consisting of diverse multi-view sequences of numerous subjects under varying illumination conditions and facial expressions. By leveraging our new dataset, we introduce a new relightable avatar representation based on 3D Gaussian primitives that we animate with a parametric head model and an expression-dependent dynamics module. We propose a new hybrid neural shading approach, combining a neural diffuse BRDF with an analytical specular term. Our method reconstructs disentangled materials from our dynamic light stage recordings and enables all-frequency relighting of our avatars with both point lights and environment maps. In addition, our avatars can easily be animated and controlled from monocular videos. We validate our approach in extensive experiments on our dataset, where we consistently outperform existing state-of-the-art methods in relighting and reenactment by a significant margin.

[112] STARFlow: Scaling Latent Normalizing Flows for High-resolution Image Synthesis

Jiatao Gu,Tianrong Chen,David Berthelot,Huangjie Zheng,Yuyang Wang,Ruixiang Zhang,Laurent Dinh,Miguel Angel Bautista,Josh Susskind,Shuangfei Zhai

Main category: cs.CV

TL;DR: STARFlow是一种基于归一化流的可扩展生成模型,结合了Transformer的自回归能力,在高分辨率图像合成中表现优异。

Details Motivation: 解决归一化流在大规模高分辨率图像生成中的效率和性能问题。 Method: 提出TARFlow,结合深度和浅层Transformer块、潜在空间建模及新型引导算法。 Result: 在类别条件和文本条件图像生成任务中表现优异,接近扩散模型的质量。 Conclusion: 首次证明归一化流在大规模高分辨率图像生成中的有效性。 Abstract: We present STARFlow, a scalable generative model based on normalizing flows that achieves strong performance in high-resolution image synthesis. The core of STARFlow is Transformer Autoregressive Flow (TARFlow), which combines the expressive power of normalizing flows with the structured modeling capabilities of Autoregressive Transformers. We first establish the theoretical universality of TARFlow for modeling continuous distributions. Building on this foundation, we introduce several key architectural and algorithmic innovations to significantly enhance scalability: (1) a deep-shallow design, wherein a deep Transformer block captures most of the model representational capacity, complemented by a few shallow Transformer blocks that are computationally efficient yet substantially beneficial; (2) modeling in the latent space of pretrained autoencoders, which proves more effective than direct pixel-level modeling; and (3) a novel guidance algorithm that significantly boosts sample quality. Crucially, our model remains an end-to-end normalizing flow, enabling exact maximum likelihood training in continuous spaces without discretization. STARFlow achieves competitive performance in both class-conditional and text-conditional image generation tasks, approaching state-of-the-art diffusion models in sample quality. To our knowledge, this work is the first successful demonstration of normalizing flows operating effectively at this scale and resolution.

[113] ExAct: A Video-Language Benchmark for Expert Action Analysis

Han Yi,Yulu Pan,Feihong He,Xinyu Liu,Benjamin Zhang,Oluwatumininu Oguntola,Gedas Bertasius

Main category: cs.CV

TL;DR: ExAct是一个新的视频语言基准测试,专注于专家级对人类技能活动的理解,包含3521个专家策划的视频问答对,覆盖6个领域的11种活动。现有VLM模型表现远低于人类专家水平。

Details Motivation: 开发一个能够精确理解人类技能的基准测试,以推动视频语言模型(VLM)在物理和程序性领域的进步。 Method: 构建包含11种物理活动的视频问答数据集,设计五个候选答案选项,要求模型具备细粒度的专家级理解能力。 Result: 现有最佳模型GPT-4o的准确率仅为44.70%,远低于人类专家的82.02%。 Conclusion: ExAct为开发和评估VLM提供了重要工具,未来需进一步提升模型对复杂人类技能的理解能力。 Abstract: We present ExAct, a new video-language benchmark for expert-level understanding of skilled physical human activities. Our new benchmark contains 3521 expert-curated video question-answer pairs spanning 11 physical activities in 6 domains: Sports, Bike Repair, Cooking, Health, Music, and Dance. ExAct requires the correct answer to be selected from five carefully designed candidate options, thus necessitating a nuanced, fine-grained, expert-level understanding of physical human skills. Evaluating the recent state-of-the-art VLMs on ExAct reveals a substantial performance gap relative to human expert performance. Specifically, the best-performing GPT-4o model achieves only 44.70% accuracy, well below the 82.02% attained by trained human specialists/experts. We believe that ExAct will be beneficial for developing and evaluating VLMs capable of precise understanding of human skills in various physical and procedural domains. Dataset and code are available at https://texaser.github.io/exact_project_page/

[114] CoMemo: LVLMs Need Image Context with Image Memory

Shi Liu,Weijie Su,Xizhou Zhu,Wenhai Wang,Jifeng Dai

Main category: cs.CV

TL;DR: CoMemo提出了一种双路径架构和新的位置编码机制,解决了LVLM在视觉处理中的注意力分配和空间关系问题。

Details Motivation: 现有LVLM架构在处理多模态数据时存在注意力分配不均和空间关系丢失的问题。 Method: 采用双路径架构(Context图像路径和Memory路径)和RoPE-DHR位置编码机制。 Result: 在多个基准测试中表现优于传统LVLM架构。 Conclusion: CoMemo有效提升了LVLM在多模态任务中的性能。 Abstract: Recent advancements in Large Vision-Language Models built upon Large Language Models have established aligning visual features with LLM representations as the dominant paradigm. However, inherited LLM architectural designs introduce suboptimal characteristics for multimodal processing. First, LVLMs exhibit a bimodal distribution in attention allocation, leading to the progressive neglect of middle visual content as context expands. Second, conventional positional encoding schemes fail to preserve vital 2D structural relationships when processing dynamic high-resolution images. To address these limitations, we propose CoMemo - a dual-path architecture that combines a Context image path with an image Memory path for visual processing, effectively alleviating visual information neglect. Additionally, we introduce RoPE-DHR, a novel positional encoding mechanism that employs thumbnail-based positional aggregation to maintain 2D spatial awareness while mitigating remote decay in extended sequences. Evaluations across seven benchmarks,including long-context comprehension, multi-image reasoning, and visual question answering, demonstrate CoMemo's superior performance compared to conventional LVLM architectures. Project page is available at https://lalbj.github.io/projects/CoMemo/.

[115] TerraFM: A Scalable Foundation Model for Unified Multisensor Earth Observation

Muhammad Sohail Danish,Muhammad Akhtar Munir,Syed Roshaan Ali Shah,Muhammad Haris Khan,Rao Muhammad Anwer,Jorma Laaksonen,Fahad Shahbaz Khan,Salman Khan

Main category: cs.CV

TL;DR: TerraFM是一种自监督学习模型,利用全球分布的Sentinel-1和Sentinel-2影像,通过多模态融合和对比学习提升地球观测任务的泛化能力。

Details Motivation: 现有基础模型受限于训练数据的规模、地理覆盖和光谱多样性,难以学习全局可迁移的表征。 Method: 结合多模态影像(雷达与光学),采用自监督学习,通过模态特定嵌入和自适应交叉注意力融合,并引入局部-全局对比学习及双中心机制。 Result: 在GEO-Bench和Copernicus-Bench上,TerraFM在分类和分割任务中表现优于现有模型。 Conclusion: TerraFM通过多模态融合和自监督学习策略,显著提升了地球观测任务的泛化性能。 Abstract: Modern Earth observation (EO) increasingly leverages deep learning to harness the scale and diversity of satellite imagery across sensors and regions. While recent foundation models have demonstrated promising generalization across EO tasks, many remain limited by the scale, geographical coverage, and spectral diversity of their training data, factors critical for learning globally transferable representations. In this work, we introduce TerraFM, a scalable self-supervised learning model that leverages globally distributed Sentinel-1 and Sentinel-2 imagery, combined with large spatial tiles and land-cover aware sampling to enrich spatial and semantic coverage. By treating sensing modalities as natural augmentations in our self-supervised approach, we unify radar and optical inputs via modality-specific patch embeddings and adaptive cross-attention fusion. Our training strategy integrates local-global contrastive learning and introduces a dual-centering mechanism that incorporates class-frequency-aware regularization to address long-tailed distributions in land cover.TerraFM achieves strong generalization on both classification and segmentation tasks, outperforming prior models on GEO-Bench and Copernicus-Bench. Our code and pretrained models are publicly available at: https://github.com/mbzuai-oryx/TerraFM .

cs.GR [Back]

[116] Gen4D: Synthesizing Humans and Scenes in the Wild

Jerrin Bright,Zhibo Wang,Yuhao Chen,Sirisha Rambhatla,John Zelek,David Clausi

Main category: cs.GR

TL;DR: Gen4D是一个自动化生成多样化、逼真4D人体动画的管道,结合了运动编码、扩散高斯溅射和背景合成,并基于此创建了SportPAL合成数据集。

Details Motivation: 解决野外活动数据不足的问题,特别是在体育等不常见领域,现有合成数据方法多样性有限。 Method: Gen4D整合运动编码、扩散高斯溅射生成虚拟形象和背景合成,自动生成多样化4D动画。 Result: 创建了SportPAL数据集,涵盖棒球、冰球和足球三种运动,为野外视觉任务提供可扩展的合成数据基础。 Conclusion: Gen4D和SportPAL为无需手动建模的合成数据生成提供了高效解决方案。 Abstract: Lack of input data for in-the-wild activities often results in low performance across various computer vision tasks. This challenge is particularly pronounced in uncommon human-centric domains like sports, where real-world data collection is complex and impractical. While synthetic datasets offer a promising alternative, existing approaches typically suffer from limited diversity in human appearance, motion, and scene composition due to their reliance on rigid asset libraries and hand-crafted rendering pipelines. To address this, we introduce Gen4D, a fully automated pipeline for generating diverse and photorealistic 4D human animations. Gen4D integrates expert-driven motion encoding, prompt-guided avatar generation using diffusion-based Gaussian splatting, and human-aware background synthesis to produce highly varied and lifelike human sequences. Based on Gen4D, we present SportPAL, a large-scale synthetic dataset spanning three sports: baseball, icehockey, and soccer. Together, Gen4D and SportPAL provide a scalable foundation for constructing synthetic datasets tailored to in-the-wild human-centric vision tasks, with no need for manual 3D modeling or scene design.

[117] IGSM: Improved Geometric and Sensitivity Matching for Finetuning Pruned Diffusion Models

Caleb Zheng,Eli Shlizerman

Main category: cs.GR

TL;DR: 提出了一种名为IGSM的微调框架,通过二阶雅可比投影损失改进剪枝扩散模型的性能,缩小与密集模型的差距。

Details Motivation: 剪枝扩散模型因容量有限导致质量下降,现有方法在微调阶段未充分利用密集模型的知识。 Method: 提出IGSM框架,引入二阶雅可比投影损失,对齐剪枝模型与密集模型的几何和时间动态。 Result: 在多个数据集上实验表明,IGSM显著提升了剪枝模型的样本质量。 Conclusion: IGSM是一种通用且高效的微调方法,适用于多种架构的扩散模型。 Abstract: Diffusion models achieve realistic outcomes across a wide range of generative tasks, but their high computational cost remains a major barrier to deployment. Model pruning has emerged as a promising strategy to reduce inference cost and enable lightweight diffusion models. While effective, pruned diffusion models are proned to quality reduction due to limited capacity. A key limitation of current pruning approaches is that pruned models are finetuned using the same objective as the dense model, typically denoising score matching (DSM). Since the dense model is accessible during finetuning, it warrants a more effective approach for knowledge transfer from the dense to the pruned model. Motivated by this aim, we revisit the finetuning stage and propose IGSM (\textbf{I}mproved \textbf{G}eometric and \textbf{S}ensitivity \textbf{M}atching), a general-purpose finetuning framework that introduces a second-order Jacobian projection loss inspired by Finite-Time Lyapunov Exponents (FTLE). IGSM efficiently captures and aligns the geometric and the temporal dynamics of pruned models with their dense teachers using scalable second-order projections. Our approach is architecture-agnostic and applies to both U-Net- and Transformer-based diffusion models. Experiments on CIFAR-10, CelebA, LSUN-Church, and LSUN-Bedroom show that IGSM consistently narrows the performance gap between pruned and dense models, substantially improving sample quality. Code is available on GitHub: https://github.com/FATE4869/IGSM-Official

[118] AI-powered Contextual 3D Environment Generation: A Systematic Review

Miguel Silva,Alexandre Valle de Carvalho

Main category: cs.GR

TL;DR: 本文系统综述了生成式AI在3D场景生成中的应用,分析了其特点、优势、局限及改进潜力,并探讨了场景真实性、文本输入影响等关键挑战。

Details Motivation: 高质量3D环境生成在游戏、虚拟现实和电影等行业中至关重要,但依赖人工流程导致资源消耗大,因此研究生成式AI技术以优化这一过程。 Method: 通过系统综述现有生成式AI技术,分析其特点、优势、局限及改进潜力,并探讨了场景真实性、文本输入影响等关键挑战。 Result: 研究发现,先进的生成架构能以高计算成本生成高质量3D内容,多模态集成技术(如交叉注意力和潜在空间对齐)有助于文本到3D任务,训练数据的质量和多样性及全面评估指标对实现可扩展、稳健的3D场景生成至关重要。 Conclusion: 本文为AI驱动的3D内容生成提供了全面的现状理解,并为未来研究奠定了基础。 Abstract: The generation of high-quality 3D environments is crucial for industries such as gaming, virtual reality, and cinema, yet remains resource-intensive due to the reliance on manual processes. This study performs a systematic review of existing generative AI techniques for 3D scene generation, analyzing their characteristics, strengths, limitations, and potential for improvement. By examining state-of-the-art approaches, it presents key challenges such as scene authenticity and the influence of textual inputs. Special attention is given to how AI can blend different stylistic domains while maintaining coherence, the impact of training data on output quality, and the limitations of current models. In addition, this review surveys existing evaluation metrics for assessing realism and explores how industry professionals incorporate AI into their workflows. The findings of this study aim to provide a comprehensive understanding of the current landscape and serve as a foundation for future research on AI-driven 3D content generation. Key findings include that advanced generative architectures enable high-quality 3D content creation at a high computational cost, effective multi-modal integration techniques like cross-attention and latent space alignment facilitate text-to-3D tasks, and the quality and diversity of training data combined with comprehensive evaluation metrics are critical to achieving scalable, robust 3D scene generation.

[119] ODE-GS: Latent ODEs for Dynamic Scene Extrapolation with 3D Gaussian Splatting

Daniel Wang,Patrick Rim,Tian Tian,Alex Wong,Ganesh Sundaramoorthi

Main category: cs.GR

TL;DR: ODE-GS结合3D高斯泼溅与潜在神经ODE,实现动态3D场景的长期预测,超越训练时间范围。

Details Motivation: 现有神经渲染系统(如NeRF或3DGS)在时间外推预测上表现不佳,ODE-GS旨在解决这一问题。 Method: 通过冻结时间条件变形模型,训练Transformer编码器总结高斯轨迹,利用神经ODE控制潜在状态演化。 Result: 在D-NeRF和NVFI基准测试中,PSNR提升达10 dB,LPIPS减半,表现最优。 Conclusion: 连续时间潜在动力学是实现复杂3D场景逼真预测的有效方法。 Abstract: We present ODE-GS, a novel method that unifies 3D Gaussian Splatting with latent neural ordinary differential equations (ODEs) to forecast dynamic 3D scenes far beyond the time span seen during training. Existing neural rendering systems - whether NeRF- or 3DGS-based - embed time directly in a deformation network and therefore excel at interpolation but collapse when asked to predict the future, where timestamps are strictly out-of-distribution. ODE-GS eliminates this dependency: after learning a high-fidelity, time-conditioned deformation model for the training window, we freeze it and train a Transformer encoder that summarizes past Gaussian trajectories into a latent state whose continuous evolution is governed by a neural ODE. Numerical integration of this latent flow yields smooth, physically plausible Gaussian trajectories that can be queried at any future instant and rendered in real time. Coupled with a variational objective and a lightweight second-derivative regularizer, ODE-GS attains state-of-the-art extrapolation on D-NeRF and NVFI benchmarks, improving PSNR by up to 10 dB and halving perceptual error (LPIPS) relative to the strongest baselines. Our results demonstrate that continuous-time latent dynamics are a powerful, practical route to photorealistic prediction of complex 3D scenes.

[120] Neural Visibility Cache for Real-Time Light Sampling

Jakub Bokšanský,Daniel Meister

Main category: cs.GR

TL;DR: 提出一种在线训练的神经缓存,用于存储光源与3D位置之间的可见性,结合加权储层采样(WRS)实现实时渲染。

Details Motivation: 解决实时场景中多光源直接光照的挑战。 Method: 采用多层感知机(MLP)和多分辨率哈希网格编码实现神经缓存,支持在线训练和高效推理。 Result: 实现了实时帧率下的高效渲染,并能与其他实时技术(如ReSTIR)无缝结合。 Conclusion: 该方法为实时渲染中的多光源问题提供了一种高效解决方案。 Abstract: Direct illumination with many lights is an inherent component of physically-based rendering, remaining challenging, especially in real-time scenarios. We propose an online-trained neural cache that stores visibility between lights and 3D positions. We feed light visibility to weighted reservoir sampling (WRS) to sample a light source. The cache is implemented as a fully-fused multilayer perceptron (MLP) with multi-resolution hash-grid encoding, enabling online training and efficient inference on modern GPUs in real-time frame rates. The cache can be seamlessly integrated into existing rendering frameworks and can be used in combination with other real-time techniques such as spatiotemporal reservoir sampling (ReSTIR).

[121] SurGSplat: Progressive Geometry-Constrained Gaussian Splatting for Surgical Scene Reconstruction

Yuchao Zheng,Jianing Zhang,Guochen Ning,Hongen Liao

Main category: cs.GR

TL;DR: SurGSplat提出了一种通过几何约束逐步优化3D高斯点云的新方法,解决了内窥镜场景中稀疏特征和光照不均导致的3D重建问题,显著提升了手术导航的精度和安全性。

Details Motivation: 内窥镜手术场景中,稀疏特征和不一致光照导致现有SfM方法重建失败,亟需一种高精度、高鲁棒性的3D重建方法。 Method: SurGSplat通过逐步优化3D高斯点云,结合几何约束,实现血管等关键结构的精细重建。 Result: 实验表明,SurGSplat在新视角合成和位姿估计精度上表现优异,为手术场景提供了高保真重建方案。 Conclusion: SurGSplat是一种高效、高精度的内窥镜手术场景重建方法,显著提升了手术导航的视觉清晰度和决策精确性。 Abstract: Intraoperative navigation relies heavily on precise 3D reconstruction to ensure accuracy and safety during surgical procedures. However, endoscopic scenarios present unique challenges, including sparse features and inconsistent lighting, which render many existing Structure-from-Motion (SfM)-based methods inadequate and prone to reconstruction failure. To mitigate these constraints, we propose SurGSplat, a novel paradigm designed to progressively refine 3D Gaussian Splatting (3DGS) through the integration of geometric constraints. By enabling the detailed reconstruction of vascular structures and other critical features, SurGSplat provides surgeons with enhanced visual clarity, facilitating precise intraoperative decision-making. Experimental evaluations demonstrate that SurGSplat achieves superior performance in both novel view synthesis (NVS) and pose estimation accuracy, establishing it as a high-fidelity and efficient solution for surgical scene reconstruction. More information and results can be found on the page https://surgsplat.github.io/.

[122] Hardware Accelerated Neural Block Texture Compression with Cooperative Vectors

Belcour Laurent,Benyoub Anis

Main category: cs.GR

TL;DR: 本文提出了一种扩展方法,改进神经纹理压缩技术,支持低动态范围块压缩格式,实现更高压缩比或更高质量,并通过硬件加速提升性能。

Details Motivation: 现有神经纹理压缩方法在高动态范围块压缩格式下表现良好,但低动态范围格式的可行性尚未验证。本文旨在证明低动态范围格式仍能有效支持神经纹理压缩,并优化性能。 Method: 利用现有块压缩方法结合硬件纹理过滤,存储基于物理渲染(PBR)纹理集的神经表示。采用基于瓦片的渲染架构,利用硬件矩阵乘法引擎提升运行时性能。 Result: 在Intel B580上,以28MB VRAM和0.55ms渲染4K纹理集(每资产9通道),支持各向异性过滤。 Conclusion: 低动态范围块压缩格式在神经纹理压缩中可行,且能实现更高压缩比或更高质量,同时通过硬件加速显著提升性能。 Abstract: In this work, we present an extension to the neural texture compression method of Weinreich and colleagues [2024]. Like them, we leverage existing block compression methods which permit to use hardware texture filtering to store a neural representation of physically-based rendering (PBR) texture sets (including albedo, normal maps, roughness, etc.). However, we show that low dynamic range block compression formats still make the solution viable. Thanks to this, we show that we can achieve higher compression ratio or higher quality at fixed compression ratio. We improve performance at runtime using a tile based rendering architecture that leverage hardware matrix multiplication engine. Thanks to all this, we render 4k textures sets (9 channels per asset) with anisotropic filtering at 1080p using only 28MB of VRAM per texture set at 0.55ms on an Intel B580.

cs.CL [Back]

[123] EvidenceOutcomes: a Dataset of Clinical Trial Publications with Clinically Meaningful Outcomes

Yiliang Zhou,Abigail M. Newbury,Gongbo Zhang,Betina Ross Idnay,Hao Liu,Chunhua Weng,Yifan Peng

Main category: cs.CL

TL;DR: 该论文提出了EvidenceOutcomes,一个标注临床有意义结果的新语料库,解决了现有基准中结果元素被忽视或简化的问题。

Details Motivation: 在循证医学中,结果(Outcome)是最复杂的PICO元素,但现有基准常忽视或简化它。 Method: 通过迭代和专家讨论制定标注指南,由三名独立标注者对500篇PubMed摘要和140篇EBM-NLP摘要进行标注,并训练PubMedBERT模型。 Result: 标注语料库的评分者间一致性为0.76,PubMedBERT模型的F1分数在实体和标记级别分别为0.69和0.76。 Conclusion: EvidenceOutcomes可作为未来机器学习算法开发和测试的共享基准。 Abstract: The fundamental process of evidence extraction and synthesis in evidence-based medicine involves extracting PICO (Population, Intervention, Comparison, and Outcome) elements from biomedical literature. However, Outcomes, being the most complex elements, are often neglected or oversimplified in existing benchmarks. To address this issue, we present EvidenceOutcomes, a novel, large, annotated corpus of clinically meaningful outcomes extracted from biomedical literature. We first developed a robust annotation guideline for extracting clinically meaningful outcomes from text through iteration and discussion with clinicians and Natural Language Processing experts. Then, three independent annotators annotated the Results and Conclusions sections of a randomly selected sample of 500 PubMed abstracts and 140 PubMed abstracts from the existing EBM-NLP corpus. This resulted in EvidenceOutcomes with high-quality annotations of an inter-rater agreement of 0.76. Additionally, our fine-tuned PubMedBERT model, applied to these 500 PubMed abstracts, achieved an F1-score of 0.69 at the entity level and 0.76 at the token level on the subset of 140 PubMed abstracts from the EBM-NLP corpus. EvidenceOutcomes can serve as a shared benchmark to develop and test future machine learning algorithms to extract clinically meaningful outcomes from biomedical abstracts.

[124] LLMs Can Also Do Well! Breaking Barriers in Semantic Role Labeling via Large Language Models

Xinxin Li,Huiyao Chen,Chengjun Liu,Jing Li,Meishan Zhang,Jun Yu,Min Zhang

Main category: cs.CL

TL;DR: 本文通过检索增强生成和自我校正机制,使生成式大语言模型在语义角色标注任务中超越传统编码器-解码器模型。

Details Motivation: 尽管生成式大语言模型在多种NLP任务中表现优异,但在语义角色标注(SRL)任务中仍落后于BERT类模型。本文旨在缩小这一差距。 Method: 采用检索增强生成机制(利用外部语言知识)和自我校正机制(修正不一致输出)。 Result: 在三个广泛使用的SRL基准测试(CPB1.0、CoNLL-2009和CoNLL-2012)中,该方法在中文和英文上均达到最优性能。 Conclusion: 这是首次成功应用大语言模型在SRL任务中超越编码器-解码器方法的研究。 Abstract: Semantic role labeling (SRL) is a crucial task of natural language processing (NLP). Although generative decoder-based large language models (LLMs) have achieved remarkable success across various NLP tasks, they still lag behind state-of-the-art encoder-decoder (BERT-like) models in SRL. In this work, we seek to bridge this gap by equipping LLMs for SRL with two mechanisms: (a) retrieval-augmented generation and (b) self-correction. The first mechanism enables LLMs to leverage external linguistic knowledge such as predicate and argument structure descriptions, while the second allows LLMs to identify and correct inconsistent SRL outputs. We conduct extensive experiments on three widely-used benchmarks of SRL (CPB1.0, CoNLL-2009, and CoNLL-2012). Results demonstrate that our method achieves state-of-the-art performance in both Chinese and English, marking the first successful application of LLMs to surpass encoder-decoder approaches in SRL.

[125] Beyond RAG: Reinforced Reasoning Augmented Generation for Clinical Notes

Lo Pang-Yun Ting,Chengshuai Zhao,Yu-Hua Zeng,Yuan Jee Lim,Kun-Ta Chuang

Main category: cs.CL

TL;DR: R2AG是一种基于强化学习的检索器,用于从入院前数据生成长篇出院指导,通过检索医学知识图谱的推理路径为LLM提供语义指导,显著提升生成效果。

Details Motivation: 现有基于LLM的方法在从有限患者信息生成长篇临床笔记时表现不足,需要更有效的语义引导。 Method: 提出R2AG,结合强化学习和GRO优化检索质量,通过推理路径为LLM提供显式语义指导。 Result: 在MIMIC-IV-Note数据集上,R2AG在临床效果和自然语言生成指标上均优于基线方法。 Conclusion: R2AG填补了稀疏输入场景的语义空白,帮助LLM避免临床误解,生成更连贯的推理路径。 Abstract: Clinical note generation aims to automatically produce free-text summaries of a patient's condition and diagnostic process, with discharge instructions being a representative long-form example. While recent large language model (LLM)-based methods pre-trained on general clinical corpora show promise in clinical text generation, they fall short in producing long-form notes from limited patient information. In this paper, we propose R2AG, the first reinforced retriever for long-form discharge instruction generation based on pre-admission data. R2AG is trained with reinforcement learning to retrieve reasoning paths from a medical knowledge graph, providing explicit semantic guidance to the LLM. To bridge the information gap, we propose Group-Based Retriever Optimization (GRO) which improves retrieval quality with group-relative rewards, encouraging reasoning leaps for deeper inference by the LLM. Comprehensive experiments on the MIMIC-IV-Note dataset show that R2AG outperforms baselines in both clinical efficacy and natural language generation metrics. Further analysis reveals that R2AG fills semantic gaps in sparse input scenarios, and retrieved reasoning paths help LLMs avoid clinical misinterpretation by focusing on key evidence and following coherent reasoning.

[126] Advancing Decoding Strategies: Enhancements in Locally Typical Sampling for LLMs

Jaydip Sen,Saptarshi Sengupta. Subhasis Dasgupta

Main category: cs.CL

TL;DR: 本文提出了一种改进的局部典型采样算法(ASTS),通过动态熵阈值、多目标评分和奖励惩罚调整,提升了文本生成的流畅性、多样性和连贯性。

Details Motivation: 传统解码方法(如top-k和核采样)在平衡文本生成的流畅性、多样性和连贯性方面存在不足,因此需要改进。 Method: 提出自适应语义感知典型性采样(ASTS),结合动态熵阈值、多目标评分和奖励惩罚调整。 Result: 实验表明,ASTS在多个基准测试中优于现有采样技术,减少了重复,增强了语义对齐和流畅性。 Conclusion: ASTS是一种高效且有效的解码策略,适用于需要高质量文本生成的任务。 Abstract: This chapter explores advancements in decoding strategies for large language models (LLMs), focusing on enhancing the Locally Typical Sampling (LTS) algorithm. Traditional decoding methods, such as top-k and nucleus sampling, often struggle to balance fluency, diversity, and coherence in text generation. To address these challenges, Adaptive Semantic-Aware Typicality Sampling (ASTS) is proposed as an improved version of LTS, incorporating dynamic entropy thresholding, multi-objective scoring, and reward-penalty adjustments. ASTS ensures contextually coherent and diverse text generation while maintaining computational efficiency. Its performance is evaluated across multiple benchmarks, including story generation and abstractive summarization, using metrics such as perplexity, MAUVE, and diversity scores. Experimental results demonstrate that ASTS outperforms existing sampling techniques by reducing repetition, enhancing semantic alignment, and improving fluency.

[127] taz2024full: Analysing German Newspapers for Gender Bias and Discrimination across Decades

Stefanie Urchs,Veronika Thurner,Matthias Aßenmacher,Christian Heumann,Stephanie Thiemichen

Main category: cs.CL

TL;DR: taz2024full是迄今为止最大的公开德语报纸文章语料库,包含1980年至2024年的180万篇文本,用于研究性别偏见等问题。

Details Motivation: 德语大规模语料资源有限,限制了语言趋势和社会问题(如性别偏见)的研究。 Method: 构建taz2024full语料库,并采用可扩展的结构化分析流程研究性别代表性。 Result: 研究发现男性在报道中持续被过度代表,但近年来逐渐趋于平衡。 Conclusion: 该语料库支持多种应用,促进德语NLP的包容性和可重复性研究。 Abstract: Open-access corpora are essential for advancing natural language processing (NLP) and computational social science (CSS). However, large-scale resources for German remain limited, restricting research on linguistic trends and societal issues such as gender bias. We present taz2024full, the largest publicly available corpus of German newspaper articles to date, comprising over 1.8 million texts from taz, spanning 1980 to 2024. As a demonstration of the corpus's utility for bias and discrimination research, we analyse gender representation across four decades of reporting. We find a consistent overrepresentation of men, but also a gradual shift toward more balanced coverage in recent years. Using a scalable, structured analysis pipeline, we provide a foundation for studying actor mentions, sentiment, and linguistic framing in German journalistic texts. The corpus supports a wide range of applications, from diachronic language analysis to critical media studies, and is freely available to foster inclusive and reproducible research in German-language NLP.

[128] Understanding Gender Bias in AI-Generated Product Descriptions

Markelle Kelly,Mohammad Tahaei,Padhraic Smyth,Lauren Wilcox

Main category: cs.CL

TL;DR: 研究探讨了大型语言模型(LLM)在电子商务中生成产品描述时的性别偏见,提出了新的分类方法,并分析了GPT-3.5和电商专用LLM中的偏见表现。

Details Motivation: 尽管LLM的性别偏见在许多领域已被广泛研究,但在电子商务中的应用尚未深入探讨,可能揭示新的算法偏见和危害。 Method: 开发了数据驱动的性别偏见分类方法,并将其与现有通用危害分类框架结合,定量分析了GPT-3.5和电商专用LLM中的偏见现象。 Result: 研究发现电商场景中存在独特的性别偏见形式,如对服装尺寸的假设、产品特征刻板描述和说服性语言使用的差异。 Conclusion: 研究揭示了电子商务中未充分探索的性别偏见维度,补充了当前AI危害框架中的排斥性规范、刻板印象和性能差异三类危害。 Abstract: While gender bias in large language models (LLMs) has been extensively studied in many domains, uses of LLMs in e-commerce remain largely unexamined and may reveal novel forms of algorithmic bias and harm. Our work investigates this space, developing data-driven taxonomic categories of gender bias in the context of product description generation, which we situate with respect to existing general purpose harms taxonomies. We illustrate how AI-generated product descriptions can uniquely surface gender biases in ways that require specialized detection and mitigation approaches. Further, we quantitatively analyze issues corresponding to our taxonomic categories in two models used for this task -- GPT-3.5 and an e-commerce-specific LLM -- demonstrating that these forms of bias commonly occur in practice. Our results illuminate unique, under-explored dimensions of gender bias, such as assumptions about clothing size, stereotypical bias in which features of a product are advertised, and differences in the use of persuasive language. These insights contribute to our understanding of three types of AI harms identified by current frameworks: exclusionary norms, stereotyping, and performance disparities, particularly for the context of e-commerce.

[129] Are Large Language Models Good Temporal Graph Learners?

Shenyang Huang,Ali Parviz,Emma Kondrup,Zachary Yang,Zifeng Ding,Michael Bronstein,Reihaneh Rabbany,Guillaume Rabusseau

Main category: cs.CL

TL;DR: TGTalker是一个新颖的时序图学习框架,专为LLMs设计,用于动态图预测,并在真实世界网络中表现出色。

Details Motivation: 现有研究对LLMs在动态图(真实世界演化网络)中的应用探索不足,尤其是针对真实世界时序图的研究。 Method: TGTalker利用时序图中的近期偏好提取结构信息,并将其转换为自然语言供LLMs使用,同时利用时序邻居作为预测的额外信息。 Result: 在五个真实世界网络中,TGTalker在链接预测任务中表现优异,优于TGN和HTGN等流行模型,并能生成预测的文本解释。 Conclusion: TGTalker不仅填补了LLMs在动态图应用中的空白,还为时序链接预测的可解释性开辟了新方向。 Abstract: Large Language Models (LLMs) have recently driven significant advancements in Natural Language Processing and various other applications. While a broad range of literature has explored the graph-reasoning capabilities of LLMs, including their use of predictors on graphs, the application of LLMs to dynamic graphs -- real world evolving networks -- remains relatively unexplored. Recent work studies synthetic temporal graphs generated by random graph models, but applying LLMs to real-world temporal graphs remains an open question. To address this gap, we introduce Temporal Graph Talker (TGTalker), a novel temporal graph learning framework designed for LLMs. TGTalker utilizes the recency bias in temporal graphs to extract relevant structural information, converted to natural language for LLMs, while leveraging temporal neighbors as additional information for prediction. TGTalker demonstrates competitive link prediction capabilities compared to existing Temporal Graph Neural Network (TGNN) models. Across five real-world networks, TGTalker performs competitively with state-of-the-art temporal graph methods while consistently outperforming popular models such as TGN and HTGN. Furthermore, TGTalker generates textual explanations for each prediction, thus opening up exciting new directions in explainability and interpretability for temporal link prediction. The code is publicly available at https://github.com/shenyangHuang/TGTalker.

[130] Auto Review: Second Stage Error Detection for Highly Accurate Information Extraction from Phone Conversations

Ayesha Qamar,Arushi Raghuvanshi,Conal Sathi,Youngseo Son

Main category: cs.CL

TL;DR: Auto Review系统通过两阶段处理自动化医疗福利验证电话的后审核阶段,减少人工审核负担并保持高准确性。

Details Motivation: 医疗福利验证电话的准确性对患者治疗至关重要,但现有系统因语音识别问题和领域术语使用而受限。 Method: 提出两阶段后处理流程,利用多语音识别替代方案和伪标签技术,无需人工修正转录。 Result: 实验表明,通用大语言模型和特征模型显著提升了转录质量,提高了Auto Review效率。 Conclusion: 该方法有效解决了语音识别瓶颈,提升了自动化审核的准确性和效率。 Abstract: Automating benefit verification phone calls saves time in healthcare and helps patients receive treatment faster. It is critical to obtain highly accurate information in these phone calls, as it can affect a patient's healthcare journey. Given the noise in phone call transcripts, we have a two-stage system that involves a post-call review phase for potentially noisy fields, where human reviewers manually verify the extracted data$\unicode{x2013}$a labor-intensive task. To automate this stage, we introduce Auto Review, which significantly reduces manual effort while maintaining a high bar for accuracy. This system, being highly reliant on call transcripts, suffers a performance bottleneck due to automatic speech recognition (ASR) issues. This problem is further exacerbated by the use of domain-specific jargon in the calls. In this work, we propose a second-stage postprocessing pipeline for accurate information extraction. We improve accuracy by using multiple ASR alternatives and a pseudo-labeling approach that does not require manually corrected transcripts. Experiments with general-purpose large language models and feature-based model pipelines demonstrate substantial improvements in the quality of corrected call transcripts, thereby enhancing the efficiency of Auto Review.

[131] Homogeneous Keys, Heterogeneous Values: Exploiting Local KV Cache Asymmetry for Long-Context LLMs

Wanyun Cui,Mingwei Xu

Main category: cs.CL

TL;DR: 论文提出了一种名为AsymKV的训练无关压缩框架,通过利用KV缓存中的不对称性(键的同质性和值的异质性),显著提升了长上下文建模的效率。

Details Motivation: 大型语言模型(LLMs)中注意力机制的二次复杂度限制了长上下文建模的效率,现有压缩方法未能区分键和值的不同特性。 Method: 提出AsymKV框架,结合基于同质性的键合并和数学证明的无损值压缩,无需额外训练。 Result: 在LLaMA3.1-8B等模型上,AsymKV在LongBench任务中平均得分43.95,显著优于现有方法(如H$_2$O的38.89)。 Conclusion: KV缓存中的不对称性是压缩的关键,AsymKV通过针对性设计显著提升了长上下文建模性能。 Abstract: Recent advances in Large Language Models (LLMs) have highlighted the critical importance of extending context length, yet the quadratic complexity of attention mechanisms poses significant challenges for efficient long-context modeling. KV cache compression has emerged as a key approach to address this challenge. Through extensive empirical analysis, we reveal a fundamental yet previously overlooked asymmetry in KV caches: while adjacent keys receive similar attention weights (local homogeneity), adjacent values demonstrate distinct heterogeneous distributions. This key-value asymmetry reveals a critical limitation in existing compression methods that treat keys and values uniformly. To address the limitation, we propose a training-free compression framework (AsymKV) that combines homogeneity-based key merging with a mathematically proven lossless value compression. Extensive experiments demonstrate that AsymKV consistently outperforms existing long-context methods across various tasks and base models. For example, on LLaMA3.1-8B, AsymKV achieves an average score of 43.95 on LongBench, surpassing SOTA methods like H$_2$O (38.89) by a large margin.

[132] SmoothRot: Combining Channel-Wise Scaling and Rotation for Quantization-Friendly LLMs

Patrik Czakó,Gábor Kertész,Sándor Szénási

Main category: cs.CL

TL;DR: SmoothRot是一种新颖的后训练量化技术,通过结合通道级缩放和Hadamard变换,有效解决大型语言模型(LLM)4位量化中的激活异常值问题,显著提升量化精度。

Details Motivation: 大型语言模型(LLM)的4位量化面临激活异常值的挑战,影响量化精度。 Method: SmoothRot结合通道级缩放和Hadamard变换,将极端异常值转化为适合量化的激活值。 Result: 在LLaMA2 7B、LLaMA3.1 8B和Mistral 7B等模型上,SmoothRot将量化模型与FP16模型的性能差距缩小了10-30%,且不增加推理延迟。 Conclusion: SmoothRot显著提升了4位量化的效率,适用于语言生成和零样本推理任务。 Abstract: We present SmoothRot, a novel post-training quantization technique to enhance the efficiency of 4-bit quantization in Large Language Models (LLMs). SmoothRot addresses the critical challenge of massive activation outliers, by integrating channel-wise scaling with Hadamard transformations. Our technique effectively transforms extreme outliers into quantization-friendly activations, significantly improving quantization accuracy. Experiments conducted on popular LLMs (LLaMA2 7B, LLaMA3.1 8B, and Mistral 7B) demonstrate that SmoothRot consistently reduces the performance gap between quantized and FP16 models by approximately 10-30\% across language generation and zero-shot reasoning tasks, without introducing additional inference latency. Code is available at https://github.com/czakop/smoothrot.

[133] Automatically Detecting Amusing Games in Wordle

Ronaldo Luo,Gary Liang,Cindy Liu,Adam Kabbara,Minahil Bakhtawar,Kina Kim,Michael Guerzhoy

Main category: cs.CL

TL;DR: 论文探讨了如何自动预测Reddit用户对Wordle游戏的有趣反应,通过GPT-3.5分类用户反应并提取游戏特征,发现用户的有趣反应可以一定程度预测。

Details Motivation: 研究目标是理解Reddit用户对Wordle游戏的有趣反应,探索游戏特征与用户情感之间的关系。 Method: 从Reddit抓取约8万条用户反应,用GPT-3.5进行少样本提示分类,验证其与人工标注的一致性,并提取游戏特征预测用户有趣反应。 Result: 研究发现游戏特征能弱预测用户的有趣反应,表明Wordle游戏中的创意幽默是可测量的。 Conclusion: 用户对Wordle游戏的有趣反应可以一定程度预测,揭示了游戏创意与幽默的关联。 Abstract: We explore automatically predicting which Wordle games Reddit users find amusing. We scrape approximately 80k reactions by Reddit users to Wordle games from Reddit, classify the reactions as expressing amusement or not using OpenAI's GPT-3.5 using few-shot prompting, and verify that GPT-3.5's labels roughly correspond to human labels. We then extract features from Wordle games that can predict user amusement. We demonstrate that the features indeed provide a (weak) signal that predicts user amusement as predicted by GPT-3.5. Our results indicate that user amusement at Wordle games can be predicted computationally to some extent. We explore which features of the game contribute to user amusement. We find that user amusement is predictable, indicating a measurable aspect of creativity infused into Wordle games through humor.

[134] MLLM-CL: Continual Learning for Multimodal Large Language Models

Hongbo Zhao,Fei Zhu,Rundong Wang,Gaofeng Meng,Zhaoxiang Zhang

Main category: cs.CL

TL;DR: MLLM-CL是一个新的多模态大语言模型持续学习基准,涵盖领域和能力持续学习,通过参数隔离和路由机制减少灾难性干扰,实验表现优于现有方法。

Details Motivation: 现有MLLMs在动态现实场景中持续集成新知识和技能的能力不足,而现有持续学习基准和方法存在局限性。 Method: 提出参数隔离和基于MLLM的路由机制,以减少灾难性干扰。 Result: 实验表明,该方法能高效整合领域知识和功能能力,遗忘最少,显著优于现有方法。 Conclusion: MLLM-CL为多模态持续学习提供了有效解决方案,未来可进一步优化动态场景适应能力。 Abstract: Recent Multimodal Large Language Models (MLLMs) excel in vision-language understanding but face challenges in adapting to dynamic real-world scenarios that require continuous integration of new knowledge and skills. While continual learning (CL) offers a potential solution, existing benchmarks and methods suffer from critical limitations. In this paper, we introduce MLLM-CL, a novel benchmark encompassing domain and ability continual learning, where the former focuses on independently and identically distributed (IID) evaluation across evolving mainstream domains, whereas the latter evaluates on non-IID scenarios with emerging model ability. Methodologically, we propose preventing catastrophic interference through parameter isolation, along with an MLLM-based routing mechanism. Extensive experiments demonstrate that our approach can integrate domain-specific knowledge and functional abilities with minimal forgetting, significantly outperforming existing methods.

[135] Multidimensional Analysis of Specific Language Impairment Using Unsupervised Learning Through PCA and Clustering

Niruthiha Selvanayagam

Main category: cs.CL

TL;DR: 研究使用无监督机器学习分析儿童语言发展轨迹,发现SLI主要表现为语言产出减少而非句法复杂性缺陷,挑战传统分类诊断框架。

Details Motivation: 传统诊断方法可能忽略SLI的细微发展模式,研究旨在通过无监督学习技术揭示自然语言发展轨迹,为早期识别和干预提供依据。 Method: 分析1,163名4-16岁儿童的叙事样本,采用PCA和聚类方法评估64个语言特征,区分语言发展轨迹和SLI特征。 Result: 发现两个主要聚类:高语言产出低SLI率,以及低产出高句法复杂性高SLI率,支持语言能力的连续模型。 Conclusion: SLI主要表现为语言产出减少,研究结果支持无监督学习在优化诊断和干预策略中的潜力。 Abstract: Specific Language Impairment (SLI) affects approximately 7 percent of children, presenting as isolated language deficits despite normal cognitive abilities, sensory systems, and supportive environments. Traditional diagnostic approaches often rely on standardized assessments, which may overlook subtle developmental patterns. This study aims to identify natural language development trajectories in children with and without SLI using unsupervised machine learning techniques, providing insights for early identification and targeted interventions. Narrative samples from 1,163 children aged 4-16 years across three corpora (Conti-Ramsden 4, ENNI, and Gillam) were analyzed using Principal Component Analysis (PCA) and clustering. A total of 64 linguistic features were evaluated to uncover developmental trajectories and distinguish linguistic profiles. Two primary clusters emerged: (1) high language production with low SLI prevalence, and (2) limited production but higher syntactic complexity with higher SLI prevalence. Additionally, boundary cases exhibited intermediate traits, supporting a continuum model of language abilities. Findings suggest SLI manifests primarily through reduced production capacity rather than syntactic complexity deficits. The results challenge categorical diagnostic frameworks and highlight the potential of unsupervised learning techniques for refining diagnostic criteria and intervention strategies.

[136] Improving LLMs with a knowledge from databases

Petr Máša

Main category: cs.CL

TL;DR: 论文提出了一种基于增强关联规则的方法,通过规则到文本的转换,将规则集作为RAG输入到LLM中,显著提升了基于数据集的问答性能。

Details Motivation: 探讨是否可以通过可解释的ML方法(如增强关联规则)改进基于数据集/数据库的答案生成,同时确保方法的安全性(如与RAG技术兼容)。 Method: 提出一种方法,基于定义的知识模式生成规则集,通过规则到文本转换器将规则转化为文本形式,并将结果作为RAG输入到LLM中。 Result: 与ChatGPT(包括使用代理)相比,该方法在基于数据集的问答中表现出显著改进。 Conclusion: 该方法具有进一步优化的潜力,例如整合其他模式或将规则挖掘作为代理使用。 Abstract: Large language models (LLMs) are achieving significant progress almost every moment now. Many advanced techniques have been introduced and widely accepted, like retrieval-augmentation generation (RAG), agents, and tools. Tools can query the database to answer questions from structured data files or perform groupings or other statistics. This unlocks huge opportunities, such as it can answer any question, but also poses threats, such as safety, because there is no control over the commands that are created. We would like to discuss whether we can create a new method that improves answers based on dataset/database via some interpretable ML methods, namely enhanced association rules. The advantage would be if the method can be also used in some safe technique like RAG. Association rules have a sound history. Since the introduction of CN2 and aproiri, many enhancements have been made. In parallel, enhanced association rules have been introduced and evolved over the last 40 years. The general problem is typically that there are too many rules. There are some techniques for handling it, but when LLM emerged, it turned out to be the best use case for the RAG technique for LLMs. We proposed a method that generates a ruleset based on defined knowledge patterns, then converts rules into text form via a rule-to-text converter, and includes the result as an RAG into LLM. We compared this method with ChatGPT (even with using agents) and we have discovered a significant improvement in answering questions based on the dataset. We have also tried several strategies how much rules to generate. We found this improvement interesting. Moreover, it can also be improved in many ways as future work, like incorporating other patterns, the use of rule mining as an agent, and many others.

[137] Combating Misinformation in the Arab World: Challenges & Opportunities

Azza Abouzied,Firoj Alam,Raian Ali,Paolo Papotti

Main category: cs.CL

TL;DR: 探讨阿拉伯地区因地缘政治、语言多样性和文化差异面临的虚假信息挑战,提出通过检测、追踪、缓解和社区参与构建更具韧性的信息生态系统。

Details Motivation: 阿拉伯地区因独特的地缘政治、语言和文化背景,虚假信息问题尤为突出,亟需针对性解决方案。 Method: 结合基层事实核查组织、理解文化规范、推动社会纠正和建立协作信息网络。 Result: 为阿拉伯世界构建更具韧性的信息生态系统提供了可行路径。 Conclusion: 通过多维度协作和文化敏感的方法,可以有效应对阿拉伯地区的虚假信息挑战。 Abstract: Misinformation and disinformation pose significant risks globally, with the Arab region facing unique vulnerabilities due to geopolitical instabilities, linguistic diversity, and cultural nuances. We explore these challenges through the key facets of combating misinformation: detection, tracking, mitigation and community-engagement. We shed light on how connecting with grass-roots fact-checking organizations, understanding cultural norms, promoting social correction, and creating strong collaborative information networks can create opportunities for a more resilient information ecosystem in the Arab world.

[138] UTSA-NLP at ArchEHR-QA 2025: Improving EHR Question Answering via Self-Consistency Prompting

Sara Shields-Menard,Zach Reimers,Joshua Gardner,David Perry,Anthony Rios

Main category: cs.CL

TL;DR: 该系统利用大语言模型分两步处理临床问题:筛选相关句子并生成简短回答,8B模型表现优于70B模型。

Details Motivation: 解决电子健康记录(EHR)中临床问题的高效回答需求。 Method: 采用两步法:1) 筛选相关句子(使用少样本提示、自一致性和阈值);2) 生成简短回答。 Result: 8B模型在筛选相关信息上优于70B模型,自一致性与阈值提高了决策可靠性。 Conclusion: 准确筛选句子对生成高质量回答至关重要,自一致性与阈值优化了筛选效果。 Abstract: We describe our system for the ArchEHR-QA Shared Task on answering clinical questions using electronic health records (EHRs). Our approach uses large language models in two steps: first, to find sentences in the EHR relevant to a clinician's question, and second, to generate a short, citation-supported response based on those sentences. We use few-shot prompting, self-consistency, and thresholding to improve the sentence classification step to decide which sentences are essential. We compare several models and find that a smaller 8B model performs better than a larger 70B model for identifying relevant information. Our results show that accurate sentence selection is critical for generating high-quality responses and that self-consistency with thresholding helps make these decisions more reliable.

[139] SynthesizeMe! Inducing Persona-Guided Prompts for Personalized Reward Models in LLMs

Michael J Ryan,Omar Shaikh,Aditri Bhagirath,Daniel Frees,William Held,Diyi Yang

Main category: cs.CL

TL;DR: SynthesizeMe是一种通过用户交互生成合成用户角色的方法,用于个性化奖励建模,无需依赖额外身份信息。

Details Motivation: 现有方法依赖额外身份信息(如人口统计或预定义偏好类别),限制了灵活性。SynthesizeMe旨在通过用户交互直接生成个性化角色。 Method: SynthesizeMe通过生成和验证解释用户偏好的推理,诱导合成用户角色,并筛选信息丰富的交互数据以构建个性化提示。 Result: 使用SynthesizeMe的提示将个性化LLM-as-a-judge的准确率提高了4.4%,并在新基准PersonalRewardBench上表现最佳。 Conclusion: SynthesizeMe提供了一种无需额外身份信息的个性化奖励建模方法,显著提升了模型性能。 Abstract: Recent calls for pluralistic alignment of Large Language Models (LLMs) encourage adapting models to diverse user preferences. However, most prior work on personalized reward models heavily rely on additional identity information, such as demographic details or a predefined set of preference categories. To this end, we introduce SynthesizeMe, an approach to inducing synthetic user personas from user interactions for personalized reward modeling. SynthesizeMe first generates and verifies reasoning to explain user preferences, then induces synthetic user personas from that reasoning, and finally filters to informative prior user interactions in order to build personalized prompts for a particular user. We show that using SynthesizeMe induced prompts improves personalized LLM-as-a-judge accuracy by 4.4% on Chatbot Arena. Combining SynthesizeMe derived prompts with a reward model achieves top performance on PersonalRewardBench: a new curation of user-stratified interactions with chatbots collected from 854 users of Chatbot Arena and PRISM.

[140] OPeRA: A Dataset of Observation, Persona, Rationale, and Action for Evaluating LLMs on Human Online Shopping Behavior Simulation

Ziyi Wang,Yuxuan Lu,Wenbo Li,Amirali Amini,Bo Sun,Yakov Bart,Weimin Lyu,Jiri Gesi,Tian Wang,Jing Huang,Yu Su,Upol Ehsan,Malihe Alikhani,Toby Jia-Jun Li,Lydia Chilton,Dakuo Wang

Main category: cs.CL

TL;DR: 论文介绍了OPERA数据集,用于评估大型语言模型(LLM)模拟特定用户网络行为的能力,填补了高质量公开数据集的空白。

Details Motivation: 当前缺乏高质量公开数据集来评估LLM模拟真实用户行为的能力,尤其是结合用户内部推理的数据。 Method: 通过在线问卷和自定义浏览器插件收集用户行为数据,构建OPERA数据集,涵盖用户画像、浏览器观察、细粒度动作及即时推理。 Result: 建立了首个基准测试,评估LLM在给定用户画像和行为历史下预测用户下一步动作和推理的能力。 Conclusion: OPERA为未来研究LLM作为个性化数字孪生体奠定了基础。 Abstract: Can large language models (LLMs) accurately simulate the next web action of a specific user? While LLMs have shown promising capabilities in generating ``believable'' human behaviors, evaluating their ability to mimic real user behaviors remains an open challenge, largely due to the lack of high-quality, publicly available datasets that capture both the observable actions and the internal reasoning of an actual human user. To address this gap, we introduce OPERA, a novel dataset of Observation, Persona, Rationale, and Action collected from real human participants during online shopping sessions. OPERA is the first public dataset that comprehensively captures: user personas, browser observations, fine-grained web actions, and self-reported just-in-time rationales. We developed both an online questionnaire and a custom browser plugin to gather this dataset with high fidelity. Using OPERA, we establish the first benchmark to evaluate how well current LLMs can predict a specific user's next action and rationale with a given persona and history. This dataset lays the groundwork for future research into LLM agents that aim to act as personalized digital twins for human.

[141] Mitigating Confounding in Speech-Based Dementia Detection through Weight Masking

Zhecheng Sheng,Xiruo Ding,Brian Hur,Changye Li,Trevor Cohen,Serguei Pakhomov

Main category: cs.CL

TL;DR: 论文研究了性别因素对阿尔茨海默病(AD)筛查中语言异常检测的影响,提出了两种方法(Extended Confounding Filter和Dual Filter)来消除性别干扰,实验表明这些方法能减少过拟合但略微降低检测性能。

Details Motivation: 现有研究未充分探讨AD患者转录本中说话者性别对检测结果的影响,性别因素可能干扰痴呆症的准确识别。 Method: 提出了Extended Confounding Filter和Dual Filter两种方法,通过隔离和消除与性别相关的权重来减少性别干扰。 Result: 实验显示,Transformer模型容易过拟合训练数据分布,消除性别相关权重后,痴呆症分类器更去混淆,但检测性能略有下降。 Conclusion: 性别因素是痴呆症检测中的重要干扰变量,消除性别权重可提高模型鲁棒性,但需权衡检测性能的轻微损失。 Abstract: Deep transformer models have been used to detect linguistic anomalies in patient transcripts for early Alzheimer's disease (AD) screening. While pre-trained neural language models (LMs) fine-tuned on AD transcripts perform well, little research has explored the effects of the gender of the speakers represented by these transcripts. This work addresses gender confounding in dementia detection and proposes two methods: the $\textit{Extended Confounding Filter}$ and the $\textit{Dual Filter}$, which isolate and ablate weights associated with gender. We evaluate these methods on dementia datasets with first-person narratives from patients with cognitive impairment and healthy controls. Our results show transformer models tend to overfit to training data distributions. Disrupting gender-related weights results in a deconfounded dementia classifier, with the trade-off of slightly reduced dementia detection performance.

[142] Leveraging Self-Attention for Input-Dependent Soft Prompting in LLMs

Ananth Muppidi,Abhilash Nandy,Sambaran Bandyopadhyay

Main category: cs.CL

TL;DR: 提出了一种基于输入依赖的软提示方法(ID-SPAM),通过自注意力机制动态生成软提示,显著提升了参数效率和大语言模型在特定任务中的性能。

Details Motivation: 针对大语言模型在特定任务中微调计算成本高、技术难度大的问题,探索参数高效的微调方法。 Method: 采用输入依赖的软提示技术(ID-SPAM),结合自注意力机制动态生成软提示,减少可训练参数数量。 Result: 在多项任务中优于现有技术,并提升了零样本领域迁移能力。 Conclusion: ID-SPAM是一种简单高效的参数高效微调方法,适用于大语言模型的特定任务适配。 Abstract: The performance of large language models in domain-specific tasks necessitates fine-tuning, which is computationally expensive and technically challenging. This paper focuses on parameter-efficient fine-tuning using soft prompting, a promising approach that adapts pre-trained models to downstream tasks by learning a small set of parameters. We propose a novel Input Dependent Soft Prompting technique with a self-Attention Mechanism (ID-SPAM) that generates soft prompts based on the input tokens and attends different tokens with varying importance. Our method is simple and efficient, keeping the number of trainable parameters small. We show the merits of the proposed approach compared to state-of-the-art techniques on various tasks and show the improved zero shot domain transfer capability.

[143] IYKYK: Using language models to decode extremist cryptolects

Christine de Kock,Arij Riabi,Zeerak Talat,Michael Sejr Schlichtkrull,Pranava Madhyastha,Ed Hovy

Main category: cs.CL

TL;DR: 研究探讨了当前语言技术在检测和解读极端主义群体加密语言方面的能力,发现通用大语言模型表现不佳,但通过领域适应和专门提示技术可显著提升性能。

Details Motivation: 极端主义群体使用复杂内部语言(加密语言)排除或误导外界,研究旨在评估现有技术是否能有效检测和解读此类语言。 Method: 评估了八种模型在六项任务中的表现,结合领域适应和专门提示技术改进性能。 Result: 通用大语言模型无法稳定检测或解读极端主义语言,但领域适应和专门提示技术显著提升了性能。 Conclusion: 研究为自动化内容审核技术的发展提供了重要见解,并发布了新的标注和非标注数据集。 Abstract: Extremist groups develop complex in-group language, also referred to as cryptolects, to exclude or mislead outsiders. We investigate the ability of current language technologies to detect and interpret the cryptolects of two online extremist platforms. Evaluating eight models across six tasks, our results indicate that general purpose LLMs cannot consistently detect or decode extremist language. However, performance can be significantly improved by domain adaptation and specialised prompting techniques. These results provide important insights to inform the development and deployment of automated moderation technologies. We further develop and release novel labelled and unlabelled datasets, including 19.4M posts from extremist platforms and lexicons validated by human experts.

[144] A Fictional Q&A Dataset for Studying Memorization and Knowledge Acquisition

John Kirchenbauer,Janny Mongkolsupawan,Yuxin Wen,Tom Goldstein,Daphne Ippolito

Main category: cs.CL

TL;DR: 提出新数据集研究语言模型对事实和逐字序列的记忆机制,通过合成虚构事件数据区分记忆形式。

Details Motivation: 理解语言模型如何记忆训练数据中的事实和逐字序列,填补现有研究的空白。 Method: 构建合成数据集,包含虚构事件的类网络文本及问答对,通过训练实验分析记忆形式。 Result: 合成数据有效区分不同记忆形式,同时揭示构建真实虚构数据的挑战。 Conclusion: 新数据集为研究语言模型记忆机制提供工具,但需进一步解决数据真实性挑战。 Abstract: When language models are trained on textual data, they acquire both knowledge about the structure of language as well as knowledge of facts about the world. At inference time, their knowledge of facts can be leveraged to solve interesting problems and perform useful knowledge work for users. It is well known that language models can verbatim memorize long sequences from their training data. However, it is much less well understood how language models memorize facts seen during training. In this work, we propose a new dataset to specifically empower researchers to study the dual processes of fact memorization and verbatim sequence memorization. The dataset consists of synthetically-generated, webtext-like documents about fictional events, as well as question-answer pairs about the events. We conduct training experiments showing how synthetic data about fictional events can be effective in teasing apart different forms of memorization. We also document the challenges in effectively building realistic, fictional synthetic data.

[145] Can LLMs Express Personality Across Cultures? Introducing CulturalPersonas for Evaluating Trait Alignment

Priyanka Dey,Yugal Khanter,Aayush Bothra,Jieyu Zhao,Emilio Ferrara

Main category: cs.CL

TL;DR: CulturalPersonas是一个新的大规模基准测试,用于评估LLMs在文化背景下的个性表达,通过多国场景问题提升模型的文化适应性和表达力。

Details Motivation: 随着LLMs在互动应用中的普及,文化背景下的个性表达变得至关重要,但现有研究忽视了文化与个性的交互作用。 Method: 引入CulturalPersonas基准,包含3,000个基于六国日常场景的问题,评估三种LLMs的多选和开放式回答。 Result: CulturalPersonas显著提升了LLMs与各国人类个性分布的匹配度(Wasserstein距离减少20%以上),并生成更具文化一致性的输出。 Conclusion: CulturalPersonas为LLMs的全球行为规范对齐提供了新方向,推动了更具社会智能和全球适应性的LLMs发展。 Abstract: As LLMs become central to interactive applications, ranging from tutoring to mental health, the ability to express personality in culturally appropriate ways is increasingly important. While recent works have explored personality evaluation of LLMs, they largely overlook the interplay between culture and personality. To address this, we introduce CulturalPersonas, the first large-scale benchmark with human validation for evaluating LLMs' personality expression in culturally grounded, behaviorally rich contexts. Our dataset spans 3,000 scenario-based questions across six diverse countries, designed to elicit personality through everyday scenarios rooted in local values. We evaluate three LLMs, using both multiple-choice and open-ended response formats. Our results show that CulturalPersonas improves alignment with country-specific human personality distributions (over a 20% reduction in Wasserstein distance across models and countries) and elicits more expressive, culturally coherent outputs compared to existing benchmarks. CulturalPersonas surfaces meaningful modulated trait outputs in response to culturally grounded prompts, offering new directions for aligning LLMs to global norms of behavior. By bridging personality expression and cultural nuance, we envision that CulturalPersonas will pave the way for more socially intelligent and globally adaptive LLMs.

[146] Zero-Shot Event Causality Identification via Multi-source Evidence Fuzzy Aggregation with Large Language Models

Zefan Zeng,Xingchen Hu,Qing Cheng,Weiping Ding,Wentao Li,Zhong Liu

Main category: cs.CL

TL;DR: MEFA是一个基于多源证据模糊聚合的零样本框架,用于事件因果关系识别(ECI),通过任务分解和模糊聚合显著提升了性能并减少了幻觉错误。

Details Motivation: 现有ECI模型依赖大规模标注数据或LLMs易产生虚假因果链接,MEFA旨在解决这些问题。 Method: 将因果关系推理分解为三个主任务和三个辅助任务,利用精心设计的提示引导LLMs生成响应,并通过模糊聚合整合证据。 Result: 在三个基准测试中,MEFA的F1分数和精确度分别比次优无监督基线高6.2%和9.3%,并显著减少幻觉错误。 Conclusion: 任务分解和模糊聚合的有效性得到验证,MEFA在零样本ECI中表现出色。 Abstract: Event Causality Identification (ECI) aims to detect causal relationships between events in textual contexts. Existing ECI models predominantly rely on supervised methodologies, suffering from dependence on large-scale annotated data. Although Large Language Models (LLMs) enable zero-shot ECI, they are prone to causal hallucination-erroneously establishing spurious causal links. To address these challenges, we propose MEFA, a novel zero-shot framework based on Multi-source Evidence Fuzzy Aggregation. First, we decompose causality reasoning into three main tasks (temporality determination, necessity analysis, and sufficiency verification) complemented by three auxiliary tasks. Second, leveraging meticulously designed prompts, we guide LLMs to generate uncertain responses and deterministic outputs. Finally, we quantify LLM's responses of sub-tasks and employ fuzzy aggregation to integrate these evidence for causality scoring and causality determination. Extensive experiments on three benchmarks demonstrate that MEFA outperforms second-best unsupervised baselines by 6.2% in F1-score and 9.3% in precision, while significantly reducing hallucination-induced errors. In-depth analysis verify the effectiveness of task decomposition and the superiority of fuzzy aggregation.

[147] A Unified Representation for Continuity and Discontinuity: Syntactic and Computational Motivations

Ratna Kandala,Prakash Mondal

Main category: cs.CL

TL;DR: 本文提出了一种统一的语法结构表示方法,适用于短语结构语法(PSG)、依存语法(DG)和范畴语法(CG),并从句法和计算复杂性的角度进行了分析。

Details Motivation: 为了整合PSG、DG和CG这三种语法形式的基本原理,提出一种统一的表示方法,以简化句法分析和计算复杂性。 Method: 通过对应原则(correspondence principle)实现统一表示,并以土耳其语中的不连续从属从句为例进行说明。 Result: 统一的表示方法能够简化计算复杂性,并为自然语言中的不连续现象提供新的理论视角。 Conclusion: 该方法不仅整合了PSG、DG和CG的基本原理,还对句法分析和计算处理具有重要影响。 Abstract: This paper advances a unified representation of linguistic structure for three grammar formalisms, namely, Phrase Structure Grammar (PSG), Dependency Grammar (DG) and Categorial Grammar (CG) from the perspective of syntactic and computational complexity considerations. The correspondence principle is proposed to enable a unified representation of the representational principles from PSG, DG, and CG. To that end, the paper first illustrates a series of steps in achieving a unified representation for a discontinuous subordinate clause from Turkish as an illustrative case. This affords a new way of approaching discontinuity in natural language from a theoretical point of view that unites and integrates the basic tenets of PSG, DG, and CG, with significant consequences for syntactic analysis. Then this paper demonstrates that a unified representation can simplify computational complexity with regards to the neurocognitive representation and processing of both continuous and discontinuous sentences vis-\`a-vis the basic principles of PSG, DG, and CG.

[148] When to use Graphs in RAG: A Comprehensive Analysis for Graph Retrieval-Augmented Generation

Zhishang Xiang,Chuanjie Wu,Qinggang Zhang,Shengyuan Chen,Zijin Hong,Xiao Huang,Jinsong Su

Main category: cs.CL

TL;DR: GraphRAG是一种通过图结构增强LLM知识检索的方法,但其实际效果常不如传统RAG。本文提出GraphRAG-Bench基准,系统评估GraphRAG在不同任务中的表现,并分析其优势场景。

Details Motivation: 尽管GraphRAG在理论上能提升知识检索的连贯性,但实际表现不佳,需明确其适用场景和优势。 Method: 提出GraphRAG-Bench基准,包含多难度任务和全流程评估,从图构建到最终生成。 Result: 通过基准测试,系统分析了GraphRAG优于传统RAG的条件及其成功原因。 Conclusion: GraphRAG在特定场景下表现更优,研究为其实践应用提供了指导。 Abstract: Graph retrieval-augmented generation (GraphRAG) has emerged as a powerful paradigm for enhancing large language models (LLMs) with external knowledge. It leverages graphs to model the hierarchical structure between specific concepts, enabling more coherent and effective knowledge retrieval for accurate reasoning.Despite its conceptual promise, recent studies report that GraphRAG frequently underperforms vanilla RAG on many real-world tasks. This raises a critical question: Is GraphRAG really effective, and in which scenarios do graph structures provide measurable benefits for RAG systems? To address this, we propose GraphRAG-Bench, a comprehensive benchmark designed to evaluate GraphRAG models onboth hierarchical knowledge retrieval and deep contextual reasoning. GraphRAG-Bench features a comprehensive dataset with tasks of increasing difficulty, coveringfact retrieval, complex reasoning, contextual summarization, and creative generation, and a systematic evaluation across the entire pipeline, from graph constructionand knowledge retrieval to final generation. Leveraging this novel benchmark, we systematically investigate the conditions when GraphRAG surpasses traditional RAG and the underlying reasons for its success, offering guidelines for its practical application. All related resources and analyses are collected for the community at https://github.com/GraphRAG-Bench/GraphRAG-Benchmark.

[149] Being Strong Progressively! Enhancing Knowledge Distillation of Large Language Models through a Curriculum Learning Framework

Lingyuan Liu,Mengxiang Zhang

Main category: cs.CL

TL;DR: 论文提出了一种基于渐进式过载原则的课程学习框架(POCL),用于改进大语言模型的知识蒸馏(KD),通过分阶段引入训练样本提升学生模型的稳定性和性能。

Details Motivation: 现有的大语言模型知识蒸馏方法在训练过程中容易导致学生模型分布偏移,引发灾难性遗忘、模式崩溃等问题。 Method: 提出POCL框架,包含难度测量器和训练调度器,分阶段从易到难引入样本,并逐步提高损失函数温度。 Result: 实验表明,POCL能显著提升蒸馏学生模型的性能,适用于多种白盒KD方法和模型家族。 Conclusion: 研究证明了排序训练样本在KD中的有效性,为提升蒸馏模型的稳定性和性能提供了新思路。 Abstract: Knowledge Distillation (KD) compresses large language models (LLMs) by transferring the teacher model's capabilities to a smaller student model, reducing inference cost and memory usage while maintaining performance. However, existing KD methods for LLMs often fail to prevent significant shifts in the student model's distribution during training, leading to issues such as catastrophic forgetting, mode collapse, and training-inference mismatch. To address these challenges, we propose a novel, plug-in curriculum learning framework inspired by the strength training principle of "progressive overload" (POCL), which can be seamlessly integrated into existing white-box KD approaches with minimal computational overhead. The framework comprises two core components: (1) a difficulty measurer that ranks and partitions training samples from easy to hard, and (2) a training scheduler that incrementally introduces these subsets into the distillation process at fixed intervals while applying loss functions with progressively rising temperatures. By starting with the easiest samples and progressively increasing the difficulty, the approach enhances both the stability and efficiency of learning. Extensive experiments in instruction-following settings demonstrate that POCL consistently improves the performance of distilled student models across various white-box KD methods and model families. Our findings highlight the effectiveness of sorted training samples in KD for LLMs. More generally, our work demonstrates how to structure training data within the KD process to enhance the stability and performance of distilled LLMs.

[150] RKEFino1: A Regulation Knowledge-Enhanced Large Language Model

Yan Wang,Yueru He,Ruoyu Xiang,Jeff Zhao

Main category: cs.CL

TL;DR: 论文提出RKEFino1模型,通过增强金融监管知识解决LLMs在数字监管报告中的准确性和合规性问题。

Details Motivation: 大型语言模型在金融应用中潜力巨大,但面临准确性和合规性挑战,特别是在数字监管报告领域。 Method: 基于Fino1模型,结合XBRL、CDM和MOF的领域知识进行微调,设计了知识问答、数学推理和数值命名实体识别任务。 Result: 实验证明RKEFino1在合规性金融任务中表现优异,具备良好的泛化能力。 Conclusion: RKEFino1为金融领域的合规性任务提供了有效解决方案,模型已开源。 Abstract: Recent advances in large language models (LLMs) hold great promise for financial applications but introduce critical accuracy and compliance challenges in Digital Regulatory Reporting (DRR). To address these issues, we propose RKEFino1, a regulation knowledge-enhanced financial reasoning model built upon Fino1, fine-tuned with domain knowledge from XBRL, CDM, and MOF. We formulate two QA tasks-knowledge-based and mathematical reasoning-and introduce a novel Numerical NER task covering financial entities in both sentences and tables. Experimental results demonstrate the effectiveness and generalization capacity of RKEFino1 in compliance-critical financial tasks. We have released our model on Hugging Face.

[151] Large Language Models are Good Relational Learners

Fang Wu,Vijay Prakash Dwivedi,Jure Leskovec

Main category: cs.CL

TL;DR: Rel-LLM提出了一种基于图神经网络(GNN)的编码器,用于生成结构化关系提示,以解决传统文本序列化方法在关系深度学习(RDL)中的不足。

Details Motivation: 尽管大语言模型(LLMs)在多领域表现出色,但其在关系深度学习中的应用仍未被充分探索,传统方法存在忽略关系结构、冗余和上下文长度限制等问题。 Method: Rel-LLM结合GNN编码器和检索增强生成(RAG)框架,通过提取实体局部子图生成结构化提示,保留数据库的关系结构。 Result: 实验表明,Rel-LLM在关键RDL任务上优于现有方法,提供了一种可扩展且高效的LLM与结构化数据集成方案。 Conclusion: Rel-LLM为LLMs在关系深度学习中的应用提供了新思路,解决了传统方法的局限性。 Abstract: Large language models (LLMs) have demonstrated remarkable capabilities across various domains, yet their application to relational deep learning (RDL) remains underexplored. Existing approaches adapt LLMs by traversing relational links between entities in a database and converting the structured data into flat text documents. Still, this text-based serialization disregards critical relational structures, introduces redundancy, and often exceeds standard LLM context lengths. We introduce Rel-LLM, a novel architecture that utilizes a graph neural network (GNN)- based encoder to generate structured relational prompts for LLMs within a retrieval-augmented generation (RAG) framework. Unlike traditional text-based serialization approaches, our method preserves the inherent relational structure of databases while enabling LLMs to effectively process and reason over complex entity relationships. Specifically, the GNN encoder extracts a local subgraph around an entity to build feature representations that contain relevant entity relationships and temporal dependencies. These representations are transformed into structured prompts using a denormalization process, effectively allowing the LLM to reason over relational structures. Through extensive experiments, we demonstrate that Rel-LLM outperforms existing methods on key RDL tasks, offering a scalable and efficient approach to integrating LLMs with structured data sources. Code is available at https://github.com/smiles724/Rel-LLM.

[152] Do LLMs Really Forget? Evaluating Unlearning with Knowledge Correlation and Confidence Awareness

Rongzhe Wei,Peizhi Niu,Hans Hao-Hsun Hsu,Ruihan Wu,Haoteng Yin,Mohsen Ghassemi,Yifan Li,Vamsi K. Potluru,Eli Chien,Kamalika Chaudhuri,Olgica Milenkovic,Pan Li

Main category: cs.CL

TL;DR: 论文提出了一种新的知识遗忘评估框架,通过知识图谱和推理协议更准确地评估大型语言模型中的知识遗忘效果。

Details Motivation: 现有方法主要关注显式遗忘孤立事实,忽略了潜在推理依赖和知识的非确定性,导致遗忘效果被高估。 Method: 提出基于知识图谱和置信度分数的评估框架,利用LLM作为推理法官,设计提示并与人类评估校准。 Result: 实验表明新框架能更真实严格地评估遗忘效果,并发现现有方法高估了遗忘有效性。 Conclusion: 新框架为知识遗忘提供了更准确的评估工具,揭示了现有方法的局限性。 Abstract: Machine unlearning techniques aim to mitigate unintended memorization in large language models (LLMs). However, existing approaches predominantly focus on the explicit removal of isolated facts, often overlooking latent inferential dependencies and the non-deterministic nature of knowledge within LLMs. Consequently, facts presumed forgotten may persist implicitly through correlated information. To address these challenges, we propose a knowledge unlearning evaluation framework that more accurately captures the implicit structure of real-world knowledge by representing relevant factual contexts as knowledge graphs with associated confidence scores. We further develop an inference-based evaluation protocol leveraging powerful LLMs as judges; these judges reason over the extracted knowledge subgraph to determine unlearning success. Our LLM judges utilize carefully designed prompts and are calibrated against human evaluations to ensure their trustworthiness and stability. Extensive experiments on our newly constructed benchmark demonstrate that our framework provides a more realistic and rigorous assessment of unlearning performance. Moreover, our findings reveal that current evaluation strategies tend to overestimate unlearning effectiveness. Our code is publicly available at https://github.com/Graph-COM/Knowledge_Unlearning.git.

[153] LLM-Symbolic Integration for Robust Temporal Tabular Reasoning

Atharv Kulkarni,Kushagra Dixit,Vivek Srikumar,Dan Roth,Vivek Gupta

Main category: cs.CL

TL;DR: TempTabQA-C数据集和符号中间表示法提升了LLMs在时间表格问答任务中的表现,通过生成和执行SQL查询解决了传统方法的局限性。

Details Motivation: 传统提示方法在时间表格问答任务中存在记忆问题、对表格大小的敏感性以及对复杂查询性能下降的挑战。 Method: 引入TempTabQA-C数据集和符号中间表示法,将表格转换为数据库模式,结合自适应少样本提示。 Result: 实验结果显示方法在鲁棒性、可扩展性和性能上均有显著提升。 Conclusion: 该方法为LLMs在时间推理任务中设立了新的基准。 Abstract: Temporal tabular question answering presents a significant challenge for Large Language Models (LLMs), requiring robust reasoning over structured data, which is a task where traditional prompting methods often fall short. These methods face challenges such as memorization, sensitivity to table size, and reduced performance on complex queries. To overcome these limitations, we introduce TempTabQA-C, a synthetic dataset designed for systematic and controlled evaluations, alongside a symbolic intermediate representation that transforms tables into database schemas. This structured approach allows LLMs to generate and execute SQL queries, enhancing generalization and mitigating biases. By incorporating adaptive few-shot prompting with contextually tailored examples, our method achieves superior robustness, scalability, and performance. Experimental results consistently highlight improvements across key challenges, setting a new benchmark for robust temporal reasoning with LLMs.

[154] Writing-RL: Advancing Long-form Writing via Adaptive Curriculum Reinforcement Learning

Xuanyu Lei,Chenliang Li,Yuning Wu,Kaiming Liu,Weizhou Shen,Peng Li,Ming Yan,Ji Zhang,Fei Huang,Yang Liu

Main category: cs.CL

TL;DR: 论文提出了一种名为Writing-RL的强化学习框架,通过自适应课程设计提升长文本写作能力,超越了传统监督微调方法的限制。

Details Motivation: 现有监督微调方法存在数据饱和和学习能力受限的问题,限制了长文本写作的性能提升。 Method: 框架包含三个关键组件:基于边际的数据选择策略、成对比较奖励机制和动态参考调度方法。 Result: 实验表明,该框架显著提升了长文本写作性能,并意外发现其能泛化到长输入推理任务。 Conclusion: 该研究为长上下文训练提供了新视角,展示了强化学习在长文本生成中的潜力。 Abstract: Recent advances in Large Language Models (LLMs) have enabled strong performance in long-form writing, yet existing supervised fine-tuning (SFT) approaches suffer from limitations such as data saturation and restricted learning capacity bounded by teacher signals. In this work, we present Writing-RL: an Adaptive Curriculum Reinforcement Learning framework to advance long-form writing capabilities beyond SFT. The framework consists of three key components: Margin-aware Data Selection strategy that prioritizes samples with high learning potential, Pairwise Comparison Reward mechanism that provides discriminative learning signals in the absence of verifiable rewards, and Dynamic Reference Scheduling approach, which plays a particularly critical role by adaptively adjusting task difficulty based on evolving model performance. Experiments on 7B-scale writer models show that our RL framework largely improves long-form writing performance over strong SFT baselines. Furthermore, we observe that models trained with long-output RL generalize surprisingly well to long-input reasoning tasks, potentially offering a promising perspective for rethinking long-context training.

[155] BioMol-MQA: A Multi-Modal Question Answering Dataset For LLM Reasoning Over Bio-Molecular Interactions

Saptarshi Sengupta,Shuhua Yang,Paul Kwong Yu,Fali Wang,Suhang Wang

Main category: cs.CL

TL;DR: BioMol-MQA是一个新的多模态问答数据集,旨在测试LLMs在检索和推理多模态知识图谱中的能力,特别是在药物联用领域。

Details Motivation: 现有RAG-based LLMs主要针对单模态(文本)信息检索,而现实问题(如医疗)需要多模态信息(如知识图谱、文本、分子结构)的检索与推理。 Method: 提出BioMol-MQA数据集,包含多模态知识图谱(文本和分子结构)和设计用于测试LLM能力的挑战性问题。 Result: 基准测试显示现有LLMs难以回答这些问题,仅在有背景数据时表现良好,表明需要更强的RAG框架。 Conclusion: BioMol-MQA突显了多模态信息检索和推理的重要性,为未来RAG框架的发展提供了方向。 Abstract: Retrieval augmented generation (RAG) has shown great power in improving Large Language Models (LLMs). However, most existing RAG-based LLMs are dedicated to retrieving single modality information, mainly text; while for many real-world problems, such as healthcare, information relevant to queries can manifest in various modalities such as knowledge graph, text (clinical notes), and complex molecular structure. Thus, being able to retrieve relevant multi-modality domain-specific information, and reason and synthesize diverse knowledge to generate an accurate response is important. To address the gap, we present BioMol-MQA, a new question-answering (QA) dataset on polypharmacy, which is composed of two parts (i) a multimodal knowledge graph (KG) with text and molecular structure for information retrieval; and (ii) challenging questions that designed to test LLM capabilities in retrieving and reasoning over multimodal KG to answer questions. Our benchmarks indicate that existing LLMs struggle to answer these questions and do well only when given the necessary background data, signaling the necessity for strong RAG frameworks.

[156] dots.llm1 Technical Report

Bi Huo,Bin Tu,Cheng Qin,Da Zheng,Debing Zhang,Dongjie Zhang,En Li,Fu Guo,Jian Yao,Jie Lou,Junfeng Tian,Li Hu,Ran Zhu,Shengdong Chen,Shuo Liu,Su Guang,Te Wo,Weijun Zhang,Xiaoming Shi,Xinxin Peng,Xing Wu,Yawen Liu,Yuqiu Ji,Ze Wen,Zhenhai Liu,Zichao Li,Zilong Liao

Main category: cs.CL

TL;DR: dots.llm1是一个大规模MoE模型,通过激活部分参数(14B/142B)实现高效训练和推理,性能媲美Qwen2.5-72B,且未使用合成数据。

Details Motivation: 探索通过MoE模型高效扩展语言模型规模,同时降低训练和推理成本。 Method: 使用精心设计的数据处理流程,预训练11.2T高质量token,并开源中间训练检查点。 Result: dots.llm1性能与Qwen2.5-72B相当,但成本更低。 Conclusion: MoE模型是高效扩展语言模型的有效途径,开源检查点有助于进一步研究。 Abstract: Mixture of Experts (MoE) models have emerged as a promising paradigm for scaling language models efficiently by activating only a subset of parameters for each input token. In this report, we present dots.llm1, a large-scale MoE model that activates 14B parameters out of a total of 142B parameters, delivering performance on par with state-of-the-art models while reducing training and inference costs. Leveraging our meticulously crafted and efficient data processing pipeline, dots.llm1 achieves performance comparable to Qwen2.5-72B after pretraining on 11.2T high-quality tokens and post-training to fully unlock its capabilities. Notably, no synthetic data is used during pretraining. To foster further research, we open-source intermediate training checkpoints at every one trillion tokens, providing valuable insights into the learning dynamics of large language models.

[157] Discrete Minds in a Continuous World: Do Language Models Know Time Passes?

Minghan Wang,Ye Bai,Thuy-Trang Vu,Ehsan Shareghi,Gholamreza Haffari

Main category: cs.CL

TL;DR: LLMs can perceive time passage and adapt behavior, validated through experiments on token-time mapping, urgency response, and time pressure in dynamic tasks.

Details Motivation: To explore whether LLMs can perceive and adapt to the passage of time, a capability previously unexamined despite their proficiency in temporal reasoning tasks. Method: Three experiments: validating the Token-Time Hypothesis, testing response adaptation to urgency, and assessing behavior under time pressure in BombRush. Result: LLMs show awareness of time passage, linking tokens to physical time, with performance varying by model size and reasoning ability. Conclusion: This study provides a foundation for improving LLMs' temporal awareness in time-sensitive applications. Abstract: While Large Language Models (LLMs) excel at temporal reasoning tasks like event ordering and duration estimation, their ability to perceive the actual passage of time remains unexplored. We investigate whether LLMs perceive the passage of time and adapt their decision-making accordingly through three complementary experiments. First, we introduce the Token-Time Hypothesis, positing that LLMs can map discrete token counts to continuous wall-clock time, and validate this through a dialogue duration judgment task. Second, we demonstrate that LLMs could use this awareness to adapt their response length while maintaining accuracy when users express urgency in question answering tasks. Finally, we develop BombRush, an interactive navigation challenge that examines how LLMs modify behavior under progressive time pressure in dynamic environments. Our findings indicate that LLMs possess certain awareness of time passage, enabling them to bridge discrete linguistic tokens and continuous physical time, though this capability varies with model size and reasoning abilities. This work establishes a theoretical foundation for enhancing temporal awareness in LLMs for time-sensitive applications.

[158] MAPLE: Multi-Agent Adaptive Planning with Long-Term Memory for Table Reasoning

Ye Bai,Minghan Wang,Thuy-Trang Vu

Main category: cs.CL

TL;DR: MAPLE框架通过多智能体协作和长期记忆机制,显著提升了表格问答任务的性能。

Details Motivation: 现有方法缺乏错误检测和经验复用机制,无法模拟人类复杂推理过程。 Method: 提出MAPLE框架,包含Solver、Checker、Reflector和Archiver四个组件,实现反馈驱动的多智能体协作。 Result: 在WiKiTQ和TabFact数据集上表现优于现有方法,达到SOTA性能。 Conclusion: MAPLE通过模拟人类问题解决过程,有效提升了表格问答任务的推理能力。 Abstract: Table-based question answering requires complex reasoning capabilities that current LLMs struggle to achieve with single-pass inference. Existing approaches, such as Chain-of-Thought reasoning and question decomposition, lack error detection mechanisms and discard problem-solving experiences, contrasting sharply with how humans tackle such problems. In this paper, we propose MAPLE (Multi-agent Adaptive Planning with Long-term mEmory), a novel framework that mimics human problem-solving through specialized cognitive agents working in a feedback-driven loop. MAPLE integrates 4 key components: (1) a Solver using the ReAct paradigm for reasoning, (2) a Checker for answer verification, (3) a Reflector for error diagnosis and strategy correction, and (4) an Archiver managing long-term memory for experience reuse and evolution. Experiments on WiKiTQ and TabFact demonstrate significant improvements over existing methods, achieving state-of-the-art performance across multiple LLM backbones.

[159] FinanceReasoning: Benchmarking Financial Numerical Reasoning More Credible, Comprehensive and Challenging

Zichen Tang,Haihong E,Ziyan Ma,Haoyang He,Jiacheng Liu,Zhongjun Yang,Zihua Rong,Rongjin Li,Kun Ji,Qing Huang,Xinyang Hu,Yang Liu,Qianhe Zheng

Main category: cs.CL

TL;DR: FinanceReasoning是一个新的基准测试,用于评估大型推理模型(LRMs)在金融数值推理问题中的能力,具有可信性、全面性和挑战性三大特点。

Details Motivation: 现有的金融推理基准测试在可信性、覆盖范围和难度上存在不足,因此需要一个新的基准测试来更准确地评估和改进LRMs的金融推理能力。 Method: 通过更新和标注问题、构建Python格式化函数,以及设计高难度问题,FinanceReasoning提供了一个全面的评估框架。 Result: 实验表明,结合推理器和编程器模型能显著提升LRMs的性能(如DeepSeek-R1从83.2%提升到87.8%),但模型在数值精度上仍有挑战。 Conclusion: FinanceReasoning为未来领域特定复杂推理任务的研究提供了重要工具和方向。 Abstract: We introduce FinanceReasoning, a novel benchmark designed to evaluate the reasoning capabilities of large reasoning models (LRMs) in financial numerical reasoning problems. Compared to existing benchmarks, our work provides three key advancements. (1) Credibility: We update 15.6% of the questions from four public datasets, annotating 908 new questions with detailed Python solutions and rigorously refining evaluation standards. This enables an accurate assessment of the reasoning improvements of LRMs. (2) Comprehensiveness: FinanceReasoning covers 67.8% of financial concepts and formulas, significantly surpassing existing datasets. Additionally, we construct 3,133 Python-formatted functions, which enhances LRMs' financial reasoning capabilities through refined knowledge (e.g., 83.2% $\rightarrow$ 91.6% for GPT-4o). (3) Challenge: Models are required to apply multiple financial formulas for precise numerical reasoning on 238 Hard problems. The best-performing model (i.e., OpenAI o1 with PoT) achieves 89.1% accuracy, yet LRMs still face challenges in numerical precision. We demonstrate that combining Reasoner and Programmer models can effectively enhance LRMs' performance (e.g., 83.2% $\rightarrow$ 87.8% for DeepSeek-R1). Our work paves the way for future research on evaluating and improving LRMs in domain-specific complex reasoning tasks.

[160] Cross-lingual Collapse: How Language-Centric Foundation Models Shape Reasoning in Large Language Models

Cheonbok Park,Jeonghoon Kim,Joosung Lee,Sanghwan Bae,Jaegul Choo,Kangmin Yoo

Main category: cs.CL

TL;DR: 论文发现多语言大模型在推理时会偏向其预训练的主导语言(跨语言崩溃),并通过实验验证了这种不平衡的放大及其不可逆性。

Details Motivation: 探索多语言大模型(LRMs)在多语言推理中的机制,尤其是跨语言崩溃现象。 Method: 使用GRPO方法在三种语言(中文、韩语、乌克兰语)上微调多语言LRMs,监控任务准确性和语言一致性。 Result: GRPO迅速放大预训练语言不平衡,语言一致性奖励可缓解但牺牲准确性,跨语言崩溃现象严重且不可逆。 Conclusion: 多语言模型在推理时并非所有语言平等,研究揭示了奖励塑造、数据难度和预训练先验的作用。 Abstract: We identify \textbf{Cross-lingual Collapse}, a systematic drift in which the chain-of-thought (CoT) of a multilingual language model reverts to its dominant pre-training language even when the prompt is expressed in a different language. Recent large language models (LLMs) with reinforcement learning with verifiable reward (RLVR) have achieved strong logical reasoning performances by exposing their intermediate reasoning traces, giving rise to large reasoning models (LRMs). However, the mechanism behind multilingual reasoning in LRMs is not yet fully explored. To investigate the issue, we fine-tune multilingual LRMs with Group-Relative Policy Optimization (GRPO) on translated versions of the GSM$8$K and SimpleRL-Zoo datasets in three different languages: Chinese, Korean, and Ukrainian. During training, we monitor both task accuracy and language consistency of the reasoning chains. Our experiments reveal three key findings: (i) GRPO rapidly amplifies pre-training language imbalances, leading to the erosion of low-resource languages within just a few hundred updates; (ii) language consistency reward mitigates this drift but does so at the expense of an almost 5 - 10 pp drop in accuracy. and (iii) the resulting language collapse is severely damaging and largely irreversible, as subsequent fine-tuning struggles to steer the model back toward its original target-language reasoning capabilities. Together, these findings point to a remarkable conclusion: \textit{not all languages are trained equally for reasoning}. Furthermore, our paper sheds light on the roles of reward shaping, data difficulty, and pre-training priors in eliciting multilingual reasoning.

[161] Route-and-Reason: Scaling Large Language Model Reasoning with Reinforced Model Router

Chenyang Shao,Xinyang Liu,Yutang Lin,Fengli Xu,Yong Li

Main category: cs.CL

TL;DR: 论文提出R2-Reasoner框架,通过动态分配子任务到不同规模的模型,以降低API成本并保持准确性。

Details Motivation: 多步推理虽能提升大语言模型(LLM)性能,但长推理链会增加成本。小规模语言模型(SLM)可处理简单子任务,但需准确的任务分解和分配。 Method: 提出R2-Reasoner框架,包含任务分解器和子任务分配器,通过强化学习训练动态路由子任务。 Result: 在四个基准测试中,API成本降低86.85%,同时保持或超越基线准确性。 Conclusion: R2-Reasoner为高效、自适应的LLM推理提供了可行方案,代码已开源。 Abstract: Multi-step reasoning has proven essential for enhancing the problem-solving capabilities of Large Language Models (LLMs) by decomposing complex tasks into intermediate steps, either explicitly or implicitly. Extending the reasoning chain at test time through deeper thought processes or broader exploration, can furthur improve performance, but often incurs substantial costs due to the explosion in token usage. Yet, many reasoning steps are relatively simple and can be handled by more efficient smaller-scale language models (SLMs). This motivates hybrid approaches that allocate subtasks across models of varying capacities. However, realizing such collaboration requires accurate task decomposition and difficulty-aware subtask allocation, which is challenging. To address this, we propose R2-Reasoner, a novel framework that enables collaborative reasoning across heterogeneous LLMs by dynamically routing sub-tasks based on estimated complexity. At the core of our framework is a Reinforced Model Router, composed of a task decomposer and a subtask allocator. The task decomposer segments complex input queries into logically ordered subtasks, while the subtask allocator assigns each subtask to the most appropriate model, ranging from lightweight SLMs to powerful LLMs, balancing accuracy and efficiency. To train this router, we introduce a staged pipeline that combines supervised fine-tuning on task-specific datasets with Group Relative Policy Optimization algorithm, enabling self-supervised refinement through iterative reinforcement learning. Extensive experiments across four challenging benchmarks demonstrate that R2-Reasoner reduces API costs by 86.85% while maintaining or surpassing baseline accuracy. Our framework paves the way for more cost-effective and adaptive LLM reasoning. The code is open-source at https://anonymous.4open.science/r/R2_Reasoner .

[162] Generating Grounded Responses to Counter Misinformation via Learning Efficient Fine-Grained Critiques

Xiaofei Xu,Xiuzhen Zhang,Ke Deng

Main category: cs.CL

TL;DR: MisMitiFact是一个基于事实的高效框架,用于生成对抗虚假信息的回应,通过轻量级反馈模型减少LLM的幻觉问题,显著提高了生成效率。

Details Motivation: 虚假信息对社会构成严重威胁,手动事实核查成本高且难以扩展,需要自动化解决方案。 Method: 提出MisMitiFact框架,利用轻量级、细粒度的反馈模型对LLM输出进行修正,确保回应基于事实。 Result: 实验表明,MisMitiFact生成的回应质量与LLM自反馈相当,但反馈生成吞吐量提高了约5倍。 Conclusion: MisMitiFact是一种高效、低成本的大规模虚假信息缓解方案。 Abstract: Fake news and misinformation poses a significant threat to society, making efficient mitigation essential. However, manual fact-checking is costly and lacks scalability. Large Language Models (LLMs) offer promise in automating counter-response generation to mitigate misinformation, but a critical challenge lies in their tendency to hallucinate non-factual information. Existing models mainly rely on LLM self-feedback to reduce hallucination, but this approach is computationally expensive. In this paper, we propose MisMitiFact, Misinformation Mitigation grounded in Facts, an efficient framework for generating fact-grounded counter-responses at scale. MisMitiFact generates simple critique feedback to refine LLM outputs, ensuring responses are grounded in evidence. We develop lightweight, fine-grained critique models trained on data sourced from readily available fact-checking sites to identify and correct errors in key elements such as numerals, entities, and topics in LLM generations. Experiments show that MisMitiFact generates counter-responses of comparable quality to LLMs' self-feedback while using significantly smaller critique models. Importantly, it achieves ~5x increase in feedback generation throughput, making it highly suitable for cost-effective, large-scale misinformation mitigation. Code and LLM prompt templates are at https://github.com/xxfwin/MisMitiFact.

[163] LengClaro2023: A Dataset of Administrative Texts in Spanish with Plain Language adaptations

Belén Agüera-Marco,Itziar Gonzalez-Dios

Main category: cs.CL

TL;DR: LengClaro2023是一个西班牙语法律行政文本数据集,包含原始文本及其两种简化版本,用于评估西班牙语自动文本简化系统。

Details Motivation: 创建该数据集的动机是为西班牙语自动文本简化系统提供评估资源,探索进一步改进的可能性。 Method: 基于西班牙社会保障网站常用程序,为每个文本生成两种简化版本:一种遵循arText claro建议,另一种结合简明语言指南的额外建议。 Result: 生成了一个可用于评估西班牙语ATS系统的语言资源。 Conclusion: LengClaro2023为西班牙语文本简化研究提供了实用工具,支持进一步优化ATS系统。 Abstract: In this work, we present LengClaro2023, a dataset of legal-administrative texts in Spanish. Based on the most frequently used procedures from the Spanish Social Security website, we have created for each text two simplified equivalents. The first version follows the recommendations provided by arText claro. The second version incorporates additional recommendations from plain language guidelines to explore further potential improvements in the system. The linguistic resource created in this work can be used for evaluating automatic text simplification (ATS) systems in Spanish.

[164] MoA: Heterogeneous Mixture of Adapters for Parameter-Efficient Fine-Tuning of Large Language Models

Jie Cao,Tianwei Lin,Hongyang He,Rolan Yan,Wenqiao Zhang,Juncheng Li,Dongping Zhang,Siliang Tang,Yueting Zhuang

Main category: cs.CL

TL;DR: 论文提出了一种异构的Mixture-of-Adapters (MoA)方法,通过动态整合多样结构的PEFT适配器专家,提升LLM在下游任务中的性能。

Details Motivation: 现有同质MoE-LoRA架构存在表示崩溃和专家负载不平衡问题,限制了LLM潜力。 Method: 提出异构MoA方法,支持Soft MoA(加权融合专家输出)和Sparse MoA(稀疏激活专家)。 Result: 实验表明异构MoA在性能和参数效率上优于同质MoE-LoRA方法。 Conclusion: 异构MoA有效解决了同质架构的问题,提升了LLM在下游任务中的表现。 Abstract: Recent studies integrate Low-Rank Adaptation (LoRA) and Mixture-of-Experts (MoE) to further enhance the performance of parameter-efficient fine-tuning (PEFT) methods in Large Language Model (LLM) applications. Existing methods employ \emph{homogeneous} MoE-LoRA architectures composed of LoRA experts with either similar or identical structures and capacities. However, these approaches often suffer from representation collapse and expert load imbalance, which negatively impact the potential of LLMs. To address these challenges, we propose a \emph{heterogeneous} \textbf{Mixture-of-Adapters (MoA)} approach. This method dynamically integrates PEFT adapter experts with diverse structures, leveraging their complementary representational capabilities to foster expert specialization, thereby enhancing the effective transfer of pre-trained knowledge to downstream tasks. MoA supports two variants: \textbf{(i)} \textit{Soft MoA} achieves fine-grained integration by performing a weighted fusion of all expert outputs; \textbf{(ii)} \textit{Sparse MoA} activates adapter experts sparsely based on their contribution, achieving this with negligible performance degradation. Experimental results demonstrate that heterogeneous MoA outperforms homogeneous MoE-LoRA methods in both performance and parameter efficiency. Our project is available at https://github.com/DCDmllm/MoA.

[165] DynamicMind: A Tri-Mode Thinking System for Large Language Models

Wei Li,Yanbin Wei,Qiushi Huang,Jiangyue Yan,Yang Chen,James T. Kwok,Yu Zhang

Main category: cs.CL

TL;DR: DynamicMind是一种三模式思维系统,通过认知启发式提示工程,使大语言模型能自主选择快速、正常或慢速思维模式,以优化零样本问答任务的性能和资源利用。

Details Motivation: 现代大语言模型难以动态调整推理深度以适应任务复杂性,导致性能不佳或资源浪费。 Method: 提出三模式思维系统(快速、正常、慢速),引入思维密度指标和轻量级思维路由器,优化资源分配。 Result: 在数学、常识和科学问答任务中,DynamicMind表现出卓越的零样本问答能力,同时平衡性能与计算效率。 Conclusion: DynamicMind通过三模式思维系统有效解决了大语言模型在任务适应性上的不足,提升了零样本问答的效率和性能。 Abstract: Modern large language models (LLMs) often struggle to dynamically adapt their reasoning depth to varying task complexities, leading to suboptimal performance or inefficient resource utilization. To address this, we introduce DynamicMind, a novel tri-mode thinking system. DynamicMind empowers LLMs to autonomously select between Fast, Normal, and Slow thinking modes for zero-shot question answering (ZSQA) tasks through cognitive-inspired prompt engineering. Our framework's core innovations include: (1) expanding the established dual-process framework of fast and slow thinking into a tri-mode thinking system involving a normal thinking mode to preserve the intrinsic capabilities of LLM; (2) proposing the Thinking Density metric, which aligns computational resource allocation with problem complexity; and (3) developing the Thinking Mode Capacity (TMC) dataset and a lightweight Mind Router to predict the optimal thinking mode. Extensive experiments across diverse mathematical, commonsense, and scientific QA benchmarks demonstrate that DynamicMind achieves superior ZSQA capabilities while establishing an effective trade-off between performance and computational efficiency.

[166] IntentionESC: An Intention-Centered Framework for Enhancing Emotional Support in Dialogue Systems

Xinjie Zhang,Wenxuan Wang,Qin Jin

Main category: cs.CL

TL;DR: 论文提出IntentionESC框架和ICECoT机制,通过明确支持者意图和情感状态,优化情感支持对话策略,并利用自动化标注和评估验证其有效性。

Details Motivation: 情感支持对话中,模糊的意图可能导致支持者使用不当策略,明确意图对指导支持过程至关重要。 Method: 提出IntentionESC框架定义意图,结合ICECoT机制使LLM模拟人类推理,设计自动化标注管道生成训练数据,并开发评估方案。 Result: 实验验证了框架的有效性,生成更高质量的情感支持响应。 Conclusion: IntentionESC和ICECoT显著提升了情感支持对话的效果,为LLM在情感支持领域的应用提供了新思路。 Abstract: In emotional support conversations, unclear intentions can lead supporters to employ inappropriate strategies, inadvertently imposing their expectations or solutions on the seeker. Clearly defined intentions are essential for guiding both the supporter's motivations and the overall emotional support process. In this paper, we propose the Intention-centered Emotional Support Conversation (IntentionESC) framework, which defines the possible intentions of supporters in emotional support conversations, identifies key emotional state aspects for inferring these intentions, and maps them to appropriate support strategies. While Large Language Models (LLMs) excel in text generating, they fundamentally operate as probabilistic models trained on extensive datasets, lacking a true understanding of human thought processes and intentions. To address this limitation, we introduce the Intention Centric Chain-of-Thought (ICECoT) mechanism. ICECoT enables LLMs to mimic human reasoning by analyzing emotional states, inferring intentions, and selecting suitable support strategies, thereby generating more effective emotional support responses. To train the model with ICECoT and integrate expert knowledge, we design an automated annotation pipeline that produces high-quality training data. Furthermore, we develop a comprehensive evaluation scheme to assess emotional support efficacy and conduct extensive experiments to validate our framework. Our data and code are available at https://github.com/43zxj/IntentionESC_ICECoT.

[167] NameTag 3: A Tool and a Service for Multilingual/Multitagset NER

Jana Straková,Milan Straka

Main category: cs.CL

TL;DR: NameTag 3是一个开源工具和云服务,支持多语言、多数据集和多标签集的命名实体识别(NER),包括扁平实体和嵌套实体。它在21个测试数据集和15种语言中达到最先进水平,并可作为命令行工具或云服务使用。

Details Motivation: 提供一种高效、多功能的NER工具,支持多种语言和数据集,同时保持开源和易于访问。 Method: 基于355M参数的微调模型实现扁平NER,126M参数的微调模型实现嵌套NER。 Result: 在21个测试数据集和15种语言中表现优异,且模型和工具开源。 Conclusion: NameTag 3是一个强大且灵活的NER工具,适用于多种语言和场景,同时提供开源和云服务选项。 Abstract: We introduce NameTag 3, an open-source tool and cloud-based web service for multilingual, multidataset, and multitagset named entity recognition (NER), supporting both flat and nested entities. NameTag 3 achieves state-of-the-art results on 21 test datasets in 15 languages and remains competitive on the rest, even against larger models. It is available as a command-line tool and as a cloud-based service, enabling use without local installation. NameTag 3 web service currently provides flat NER for 17 languages, trained on 21 corpora and three NE tagsets, all powered by a single 355M-parameter fine-tuned model; and nested NER for Czech, powered by a 126M fine-tuned model. The source code is licensed under open-source MPL 2.0, while the models are distributed under non-commercial CC BY-NC-SA 4.0. Documentation is available at https://ufal.mff.cuni.cz/nametag, source code at https://github.com/ufal/nametag3, and trained models via https://lindat.cz. The REST service and the web application can be found at https://lindat.mff.cuni.cz/services/nametag/. A demonstration video is available at https://www.youtube.com/watch?v=-gaGnP0IV8A.

[168] Elementary Math Word Problem Generation using Large Language Models

Nimesh Ariyarathne,Harshani Bandara,Yasith Heshan,Omega Gamage,Surangika Ranathunga,Dilan Nayanajith,Yutharsan Sivapalan,Gayathri Lihinikaduarachchi,Tharoosha Vihidun,Meenambika Chandirakumar,Sanujen Premakumar,Sanjula Gathsara

Main category: cs.CL

TL;DR: 论文提出了一种基于大型语言模型(LLM)的数学应用题(MWP)生成系统,无需额外输入,仅需题目数量、年级和题型即可生成高质量题目。

Details Motivation: 解决传统手动生成数学应用题耗时且现有深度学习方法需要额外输入的问题。 Method: 利用LLM,结合不同提示策略、多样性增强技术和人类反馈优化模型性能。 Result: 生成的应用题质量高,拼写和语法问题少,但在年级和题型匹配上仍有不足。 Conclusion: LLM在MWP生成中表现良好,但需进一步优化以更好地满足年级和题型要求。 Abstract: Mathematics is often perceived as a complex subject by students, leading to high failure rates in exams. To improve Mathematics skills, it is important to provide sample questions for students to practice problem-solving. Manually creating Math Word Problems (MWPs) is time consuming for tutors, because they have to type in natural language while adhering to grammar and spelling rules of the language. Existing Deep Learning techniques for MWP generation either require a tutor to provide the initial portion of the MWP, and/or additional information such as an equation. In this paper, we present an MWP generation system based on Large Language Models (LLMs) that overcome the need for additional input - the only input to our system is the number of MWPs needed, the grade and the type of question (e.g. addition, subtraction). Unlike the existing LLM-based solutions for MWP generation, we carried out an extensive set of experiments involving different LLMs, prompting strategies, techniques to improve the diversity of questions, as well as techniques that employ human feedback to improve LLM performance. Human and automated evaluations confirmed that the generated MWPs are high in quality, with minimal spelling and grammar issues. However, LLMs still struggle to generate questions that adhere to the specified grade and question type requirements.

[169] Let's Put Ourselves in Sally's Shoes: Shoes-of-Others Prefixing Improves Theory of Mind in Large Language Models

Kazutoshi Shinoda,Nobukatsu Hojo,Kyosuke Nishida,Yoshihiro Yamazaki,Keita Suzuki,Hiroaki Sugiyama,Kuniko Saito

Main category: cs.CL

TL;DR: 提出了一种新的推理时方法Shoes-of-Others (SoO)前缀,用于提升大型语言模型在心理理论任务中的表现,适用于更广泛的场景。

Details Motivation: 现有推理时方法在心理理论任务中表现有限,且仅适用于特定情境(世界状态变化),需要一种更通用的方法。 Method: 通过简单地在模型输出前添加前缀“Let's put ourselves in A's shoes.”(A为目标角色名称),即SoO前缀法。 Result: 在两个基准测试中,SoO前缀法在五种心理状态类别中均显著提升了模型表现。 Conclusion: SoO前缀法通过引发忠实思考,有效提升了心理理论任务的性能,且适用范围更广。 Abstract: Recent studies have shown that Theory of Mind (ToM) in large language models (LLMs) has not reached human-level performance yet. Since fine-tuning LLMs on ToM datasets often degrades their generalization, several inference-time methods have been proposed to enhance ToM in LLMs. However, existing inference-time methods for ToM are specialized for inferring beliefs from contexts involving changes in the world state. In this study, we present a new inference-time method for ToM, Shoes-of-Others (SoO) prefixing, which makes fewer assumptions about contexts and is applicable to broader scenarios. SoO prefixing simply specifies the beginning of LLM outputs with ``Let's put ourselves in A's shoes.'', where A denotes the target character's name. We evaluate SoO prefixing on two benchmarks that assess ToM in conversational and narrative contexts without changes in the world state and find that it consistently improves ToM across five categories of mental states. Our analysis suggests that SoO prefixing elicits faithful thoughts, thereby improving the ToM performance.

[170] LTG at SemEval-2025 Task 10: Optimizing Context for Classification of Narrative Roles

Egil Rønningstad,Gaurav Negi

Main category: cs.CL

TL;DR: 提出了一种基于实体导向启发式方法的上下文选择策略,用于解决长文档中实体分类的上下文限制问题,并在性能上优于或与大型生成模型相当。

Details Motivation: 解决长文档中实体分类时上下文窗口有限的挑战。 Method: 采用实体导向的启发式方法选择上下文,并结合XLM-RoBERTa模型进行分类。 Result: 该方法在性能上与或优于使用大型生成模型的监督微调方法。 Conclusion: 简单的上下文选择策略可以有效提升有限上下文窗口模型的分类性能。 Abstract: Our contribution to the SemEval 2025 shared task 10, subtask 1 on entity framing, tackles the challenge of providing the necessary segments from longer documents as context for classification with a masked language model. We show that a simple entity-oriented heuristics for context selection can enable text classification using models with limited context window. Our context selection approach and the XLM-RoBERTa language model is on par with, or outperforms, Supervised Fine-Tuning with larger generative language models.

[171] Tau-Eval: A Unified Evaluation Framework for Useful and Private Text Anonymization

Gabriel Loiseau,Damien Sileo,Damien Riquet,Maxime Meyer,Marc Tommasi

Main category: cs.CL

TL;DR: Tau-Eval是一个开源框架,用于通过隐私和效用任务敏感性评估文本匿名化方法。

Details Motivation: 文本匿名化在保护隐私与保留信息之间存在复杂权衡,缺乏统一评估标准。 Method: 提出Tau-Eval框架,包括Python库、代码、文档和教程。 Result: Tau-Eval为评估匿名化方法提供了实用工具。 Conclusion: Tau-Eval填补了文本匿名化评估领域的空白,支持隐私与效用的平衡。 Abstract: Text anonymization is the process of removing or obfuscating information from textual data to protect the privacy of individuals. This process inherently involves a complex trade-off between privacy protection and information preservation, where stringent anonymization methods can significantly impact the text's utility for downstream applications. Evaluating the effectiveness of text anonymization proves challenging from both privacy and utility perspectives, as there is no universal benchmark that can comprehensively assess anonymization techniques across diverse, and sometimes contradictory contexts. We present Tau-Eval, an open-source framework for benchmarking text anonymization methods through the lens of privacy and utility task sensitivity. A Python library, code, documentation and tutorials are publicly available.

[172] A Culturally-Rich Romanian NLP Dataset from "Who Wants to Be a Millionaire?" Videos

Alexandru-Gabriel Ganea,Antonia-Adelina Popovici,Adrian-Marius Dumitran

Main category: cs.CL

TL;DR: 研究通过罗马尼亚游戏节目构建多语言数据集,发现LLMs在国际问题上的表现优于文化相关问题,强调文化背景对模型性能的影响。

Details Motivation: 探讨LLMs在不同语言和文化背景下的性能差异,尤其是文化相关问题的影响。 Method: 结合OCR、自动文本提取和人工验证构建数据集,并测试LLMs在罗马尼亚和国际问题上的表现。 Result: LLMs在国际问题上准确率80-95%,罗马尼亚文化问题上50-75%。 Conclusion: 文化背景显著影响LLMs性能,需构建更具文化意识的多语言NLP系统。 Abstract: Large Language Models (LLMs) demonstrate varying performance across languages and cultural contexts. This study introduces a novel, culturally-rich, multilingual dataset derived from video recordings of the Romanian game show "Who Wants to Be a Millionaire?" (Vrei s\u{a} fii Milionar?). We employed an innovative process combining optical character recognition (OCR), automated text extraction, and manual verification to collect question-answer pairs, enriching them with metadata including question domain (e.g., biology, history), cultural relevance (Romanian-specific vs. international), and difficulty. Benchmarking state-of-the-art LLMs, including Romanian-adapted models, on this dataset revealed significant performance disparities: models consistently achieve higher accuracy (80-95%) on international questions compared to Romanian-specific cultural questions (50-75%). We further investigate these differences through experiments involving machine translation of Romanian questions into English and cross-lingual tests using a comparable dataset in French. Our findings underscore the impact of cultural context and data source on LLM performance and offer practical insights for building robust, culturally-aware multilingual NLP systems, especially in educational domains. The dataset is publicly available at Hugging Face.

[173] Token Signature: Predicting Chain-of-Thought Gains with Token Decoding Feature in Large Language Models

Peijie Liu,Fengli Xu,Yong Li

Main category: cs.CL

TL;DR: 论文提出Dynamic CoT方法,通过分析token概率分布动态选择CoT或直接回答,提升效率并减少token消耗。

Details Motivation: Chain-of-Thought(CoT)技术在复杂推理任务中效果显著,但性能提升不稳定,且机制尚不明确。 Method: 提出基于token概率分布的指标评估CoT效果,结合逻辑回归模型动态选择CoT或直接回答(Dynamic CoT),并扩展到闭源模型。 Result: CoT评估指标准确率达89.2%,Dynamic CoT减少35%以上token消耗且保持高准确率。 Conclusion: 研究为CoT机制提供新视角,并提出了高效部署框架。 Abstract: Chain-of-Thought (CoT) technique has proven effective in improving the performance of large language models (LLMs) on complex reasoning tasks. However, the performance gains are inconsistent across different tasks, and the underlying mechanism remains a long-standing research question. In this work, we make a preliminary observation that the monotonicity of token probability distributions may be correlated with the gains achieved through CoT reasoning. Leveraging this insight, we propose two indicators based on the token probability distribution to assess CoT effectiveness across different tasks. By combining instance-level indicators with logistic regression model, we introduce Dynamic CoT, a method that dynamically select between CoT and direct answer. Furthermore, we extend Dynamic CoT to closed-source models by transferring decision strategies learned from open-source models. Our indicators for assessing CoT effectiveness achieve an accuracy of 89.2\%, and Dynamic CoT reduces token consumption by more than 35\% while maintaining high accuracy. Overall, our work offers a novel perspective on the underlying mechanisms of CoT reasoning and provides a framework for its more efficient deployment.

[174] Unlocking Recursive Thinking of LLMs: Alignment via Refinement

Haoke Zhang,Xiaobo Liang,Cunxiang Wang,Juntao Li,Min Zhang

Main category: cs.CL

TL;DR: AvR(Alignment via Refinement)是一种通过长形式思维链(CoT)增强LLM递归推理能力的新方法,通过可微分学习优化细化奖励,显著提升模型性能。

Details Motivation: 现有LLM在递归推理能力上受限,缺乏专家数据蒸馏支持,AvR旨在通过长形式CoT解锁其潜力。 Method: AvR引入细化过程,结合批评和改进动作,通过可微分学习优化细化奖励,生成多轮数据以形成长细化思维。 Result: 实验显示,AvR显著优于传统偏好优化方法,仅用3k合成样本即可将LLaMA-3-8B-Instruct模型在AlpacaEval 2.0上的胜率提升20%以上。 Conclusion: AvR通过细化奖励和长形式CoT有效提升了LLM的递归推理能力,具有显著性能优势。 Abstract: The OpenAI o1-series models have demonstrated that leveraging long-form Chain of Thought (CoT) can substantially enhance performance. However, the recursive thinking capabilities of Large Language Models (LLMs) remain limited, particularly in the absence of expert-curated data for distillation. In this paper, we propose \textbf{AvR}: \textbf{Alignment via Refinement}, a novel method aimed at unlocking the potential of LLMs for recursive reasoning through long-form CoT. AvR introduces a refinement process that integrates criticism and improvement actions, guided by differentiable learning techniques to optimize \textbf{refinement-aware rewards}. As a result, the synthesized multi-round data can be organized as a long refinement thought, further enabling test-time scaling. Experimental results show that AvR significantly outperforms conventional preference optimization methods. Notably, with only 3k synthetic samples, our method boosts the performance of the LLaMA-3-8B-Instruct model by over 20\% in win rate on AlpacaEval 2.0. Our code is available at Github (https://github.com/Banner-Z/AvR.git).

Yu Li,Lehui Li,Zhihao Wu,Qingmin Liao,Jianye Hao,Kun Shao,Fengli Xu,Yong Li

Main category: cs.CL

TL;DR: 提出了一种综合框架,通过分层搜索空间、预测性能模型和分层MCTS策略,解决了LLM代理设计中的三大挑战,显著提升了性能。

Details Motivation: 现有代理搜索方法存在优化不足、评估成本高和搜索效率低的问题,亟需改进。 Method: 引入分层搜索空间、预测性能模型和分层MCTS策略,结合任务描述高效搜索。 Result: 在七个基准测试中平均性能提升8.34%,搜索效率更高。 Conclusion: 该方法显著提升了代理设计的性能和效率,具有广泛应用潜力。 Abstract: Large language model (LLM) agents have demonstrated strong capabilities across diverse domains. However, designing high-performing agentic systems remains challenging. Existing agent search methods suffer from three major limitations: (1) an emphasis on optimizing agentic workflows while under-utilizing proven human-designed components such as memory, planning, and tool use; (2) high evaluation costs, as each newly generated agent must be fully evaluated on benchmarks; and (3) inefficient search in large search space. In this work, we introduce a comprehensive framework to address these challenges. First, We propose a hierarchical search space that jointly models agentic workflow and composable functional components, enabling richer agentic system designs. Building on this structured design space, we introduce a predictive value model that estimates agent performance given agentic system and task description, allowing for efficient, low-cost evaluation during the search process. Finally, we present a hierarchical Monte Carlo Tree Search (MCTS) strategy informed by uncertainty to guide the search. Experiments on seven benchmarks, covering embodied, math, web, tool, and game, show that our method achieves an average performance gain of 8.34\% over state-of-the-art baselines and exhibits faster search progress with steeper improvement trajectories. Code repo is available at https://github.com/Ericccc02/AgentSwift.

[176] When to Trust Context: Self-Reflective Debates for Context Reliability

Zeqi Zhou,Fang Wu,Shayan Talaei,Haokai Zhao,Cheng Meixin,Tinson Xu,Amin Saberi,Yejin Choi

Main category: cs.CL

TL;DR: SR-DCR框架通过自反性辩论解决大语言模型中参数知识与上下文冲突的问题,提升可靠性。

Details Motivation: 解决大语言模型在参数知识与上下文输入冲突时的不可靠性问题,如事实不一致或幻觉。 Method: 提出SR-DCR框架,结合令牌级自信度与非对称多智能体辩论,通过批评者、辩护者和法官模型评估上下文可靠性。 Result: 在ClashEval基准测试中,SR-DCR显著提升对误导性上下文的鲁棒性,同时保持准确性,优于传统辩论和仅自信度基线。 Conclusion: SR-DCR是一种轻量级且高效的方法,能有效解决语言模型中的上下文冲突问题。 Abstract: Large language models frequently encounter conflicts between their parametric knowledge and contextual input, often resulting in factual inconsistencies or hallucinations. We propose Self-Reflective Debate for Contextual Reliability (SR-DCR), a lightweight framework that integrates token-level self-confidence with an asymmetric multi-agent debate to adjudicate such conflicts. A critic, deprived of context, challenges a defender who argues from the given passage; a judge model evaluates the debate and determines the context's reliability. The final answer is selected by combining the verdict with model confidence. Experiments on the ClashEval benchmark demonstrate that SR-DCR consistently enhances robustness to misleading context while maintaining accuracy on trustworthy inputs, outperforming both classical debate and confidence-only baselines with minimal computational overhead. The code is available at https://github.com/smiles724/Self-Reflective-Debates.

[177] Large Language Models are Demonstration Pre-Selectors for Themselves

Jiarui Jin,Yuwei Wu,Haoxuan Li,Xiaoting He,Weinan Zhang,Yiming Yang,Yong Yu,Jun Wang,Mengyue Yang

Main category: cs.CL

TL;DR: FEEDER是一种新型预选框架,通过高效选择代表性示例,减少大规模数据集的检索计算成本,提升上下文学习(ICL)和微调LLMs的效率。

Details Motivation: 现有ICL方法依赖相似性或多样性评分选择示例,计算成本高,需从大规模数据集中重复检索。 Method: 提出FEEDER框架,引入“充分性”和“必要性”指标,设计树状算法高效选择代表性示例。 Result: 实验表明,FEEDER能将训练数据规模减少20%以上,同时保持性能,并兼容多种下游示例选择策略。 Conclusion: FEEDER显著提升了ICL和微调LLMs的效率,且不影响性能。 Abstract: In-context learning (ICL) with large language models (LLMs) delivers strong few-shot performance by choosing few-shot demonstrations from the entire training data. However, existing ICL methods, which rely on similarity or diversity scores to choose demonstrations, incur high computational costs due to repeatedly retrieval from large-scale datasets for each query. To this end, we propose FEEDER (FEw yet Essential Demonstration prE-selectoR), a novel pre-selection framework that identifies a representative subset of demonstrations containing the most representative examples in the training data, tailored to specific LLMs. To construct this subset, we introduce the "sufficiency" and "necessity" metrics in the pre-selection stage and design a tree-based algorithm to identify representative examples efficiently. Once pre-selected, this representative subset can effectively replace the full training data, improving efficiency while maintaining comparable performance in ICL. Additionally, our pre-selected subset also benefits fine-tuning LLMs, where we introduce a bi-level optimization method that enhances training efficiency without sacrificing performance. Experiments with LLMs ranging from 300M to 8B parameters show that FEEDER can reduce training data size by over 20% while maintaining performance and seamlessly integrating with various downstream demonstration selection strategies in ICL.

[178] MATP-BENCH: Can MLLM Be a Good Automated Theorem Prover for Multimodal Problems?

Zhitao He,Zongwei Lyu,Dazhong Chen,Dadi Guo,Yi R. Fung

Main category: cs.CL

TL;DR: MATP-BENCH是一个新的多模态、多级别、多语言的基准测试,用于评估多模态大语言模型(MLLMs)作为自动定理证明器的能力。

Details Motivation: 探索MLLMs在多模态领域作为自动定理证明器(ATPs)的潜力,填补现有研究的空白。 Method: 提出MATP-BENCH基准测试,包含1056个多模态数学问题,涵盖高中、大学和竞赛级别,并支持多种定理证明框架(如Lean 4、Coq和Isabelle)。 Result: 现有方法仅能解决MATP-BENCH中的部分问题,表明该基准测试对自动定理证明研究提出了开放挑战。 Conclusion: MATP-BENCH为多模态自动定理证明领域提供了新的评估标准,并揭示了当前技术的局限性。 Abstract: Numerous theorems, such as those in geometry, are often presented in multimodal forms (e.g., diagrams). Humans benefit from visual reasoning in such settings, using diagrams to gain intuition and guide the proof process. Modern Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities in solving a wide range of mathematical problems. However, the potential of MLLMs as Automated Theorem Provers (ATPs), specifically in the multimodal domain, remains underexplored. In this paper, we introduce the Multimodal Automated Theorem Proving benchmark (MATP-BENCH), a new Multimodal, Multi-level, and Multi-language benchmark designed to evaluate MLLMs in this role as multimodal automated theorem provers. MATP-BENCH consists of 1056 multimodal theorems drawn from high school, university, and competition-level mathematics. All these multimodal problems are accompanied by formalizations in Lean 4, Coq and Isabelle, thus making the benchmark compatible with a wide range of theorem-proving frameworks. MATP-BENCH requires models to integrate sophisticated visual understanding with mastery of a broad spectrum of mathematical knowledge and rigorous symbolic reasoning to generate formal proofs. We use MATP-BENCH to evaluate a variety of advanced multimodal language models. Existing methods can only solve a limited number of the MATP-BENCH problems, indicating that this benchmark poses an open challenge for research on automated theorem proving.

[179] Hey, That's My Data! Label-Only Dataset Inference in Large Language Models

Chen Xiong,Zihao Wang,Rui Zhu,Tsung-Yi Ho,Pin-Yu Chen,Jingwei Xiong,Haixu Tang,Lucila Ohno-Machado

Main category: cs.CL

TL;DR: CatShift是一种仅依赖标签的数据集推断框架,利用LLM的灾难性遗忘特性,通过微调检测可疑数据集是否曾被用于模型训练。

Details Motivation: 现有方法依赖模型内部log概率,但许多LLM已隐藏或混淆这些信号,亟需不依赖logits的标签方法。 Method: CatShift通过微调可疑数据集,观察模型输出的显著变化,与已知非成员验证集对比,统计判断数据集是否曾被训练。 Result: 在开源和API-based LLM上的实验验证了CatShift在无法访问logits时的有效性。 Conclusion: CatShift为保护专有数据提供了实用且鲁棒的解决方案。 Abstract: Large Language Models (LLMs) have revolutionized Natural Language Processing by excelling at interpreting, reasoning about, and generating human language. However, their reliance on large-scale, often proprietary datasets poses a critical challenge: unauthorized usage of such data can lead to copyright infringement and significant financial harm. Existing dataset-inference methods typically depend on log probabilities to detect suspicious training material, yet many leading LLMs have begun withholding or obfuscating these signals. This reality underscores the pressing need for label-only approaches capable of identifying dataset membership without relying on internal model logits. We address this gap by introducing CatShift, a label-only dataset-inference framework that capitalizes on catastrophic forgetting: the tendency of an LLM to overwrite previously learned knowledge when exposed to new data. If a suspicious dataset was previously seen by the model, fine-tuning on a portion of it triggers a pronounced post-tuning shift in the model's outputs; conversely, truly novel data elicits more modest changes. By comparing the model's output shifts for a suspicious dataset against those for a known non-member validation set, we statistically determine whether the suspicious set is likely to have been part of the model's original training corpus. Extensive experiments on both open-source and API-based LLMs validate CatShift's effectiveness in logit-inaccessible settings, offering a robust and practical solution for safeguarding proprietary data.

[180] Simple Yet Effective: Extracting Private Data Across Clients in Federated Fine-Tuning of Large Language Models

Yingqi Hu,Zhuo Zhang,Jingyuan Zhang,Lizhen Qu,Zenglin Xu

Main category: cs.CL

TL;DR: 联邦微调大型语言模型(FedLLMs)在保护数据隐私的同时提升模型性能,但其记忆能力使其易受数据提取攻击。本文提出针对FedLLMs的提取攻击算法,在更现实的威胁模型下评估其风险。

Details Motivation: 研究FedLLMs在隐私保护方面的潜在风险,尤其是针对未见过的个人身份信息(PII)的提取攻击。 Method: 设计针对FedLLMs的提取攻击算法,利用攻击者持有的上下文前缀跨客户端泛化,并在真实法律数据集上评估。 Result: 实验显示攻击可提取高达56.57%的受害者专属PII,其中“地址”、“生日”和“姓名”最易受攻击。 Conclusion: 研究强调了开发鲁棒防御策略的紧迫性,并为隐私保护联邦学习提供了新的基准和评估框架。 Abstract: Federated fine-tuning of large language models (FedLLMs) presents a promising approach for achieving strong model performance while preserving data privacy in sensitive domains. However, the inherent memorization ability of LLMs makes them vulnerable to training data extraction attacks. To investigate this risk, we introduce simple yet effective extraction attack algorithms specifically designed for FedLLMs. In contrast to prior "verbatim" extraction attacks, which assume access to fragments from all training data, our approach operates under a more realistic threat model, where the attacker only has access to a single client's data and aims to extract previously unseen personally identifiable information (PII) from other clients. This requires leveraging contextual prefixes held by the attacker to generalize across clients. To evaluate the effectiveness of our approaches, we propose two rigorous metrics-coverage rate and efficiency-and extend a real-world legal dataset with PII annotations aligned with CPIS, GDPR, and CCPA standards, achieving 89.9% human-verified precision. Experimental results show that our method can extract up to 56.57% of victim-exclusive PII, with "Address," "Birthday," and "Name" being the most vulnerable categories. Our findings underscore the pressing need for robust defense strategies and contribute a new benchmark and evaluation framework for future research in privacy-preserving federated learning.

[181] Zero-Shot Detection of LLM-Generated Code via Approximated Task Conditioning

Maor Ashkenazi,Ofir Brenner,Tal Furman Shohet,Eran Treister

Main category: cs.CL

TL;DR: 提出了一种基于任务条件概率分布的无监督方法(ATC),用于检测LLM生成的代码,无需访问生成模型或原始任务提示,在多种编程语言中表现优异。

Details Motivation: 检测LLM生成的代码对安全、知识产权和学术诚信至关重要,但现有方法在代码检测上效果有限。 Method: 通过近似生成代码的任务条件(ATC),评估代码标记的熵,利用任务条件概率分布区分LLM生成和人类编写的代码。 Result: ATC在多种编程语言(如Python、CPP、Java)的基准测试中达到最优性能。 Conclusion: 任务条件概率分布对LLM生成代码检测至关重要,ATC方法具有实际应用价值,代码和数据集已开源。 Abstract: Detecting Large Language Model (LLM)-generated code is a growing challenge with implications for security, intellectual property, and academic integrity. We investigate the role of conditional probability distributions in improving zero-shot LLM-generated code detection, when considering both the code and the corresponding task prompt that generated it. Our key insight is that when evaluating the probability distribution of code tokens using an LLM, there is little difference between LLM-generated and human-written code. However, conditioning on the task reveals notable differences. This contrasts with natural language text, where differences exist even in the unconditional distributions. Leveraging this, we propose a novel zero-shot detection approach that approximates the original task used to generate a given code snippet and then evaluates token-level entropy under the approximated task conditioning (ATC). We further provide a mathematical intuition, contextualizing our method relative to previous approaches. ATC requires neither access to the generator LLM nor the original task prompts, making it practical for real-world applications. To the best of our knowledge, it achieves state-of-the-art results across benchmarks and generalizes across programming languages, including Python, CPP, and Java. Our findings highlight the importance of task-level conditioning for LLM-generated code detection. The supplementary materials and code are available at https://github.com/maorash/ATC, including the dataset gathering implementation, to foster further research in this area.

[182] MIRIAD: Augmenting LLMs with millions of medical query-response pairs

Qinyue Zheng,Salman Abdullah,Sam Rawal,Cyril Zakka,Sophie Ostmeier,Maximilian Purk,Eduardo Reis,Eric J. Topol,Jure Leskovec,Michael Moor

Main category: cs.CL

TL;DR: MIRIAD是一个大规模、经过整理的医学QA对语料库,通过半自动化流程生成,旨在提升LLMs在医疗领域的准确性和可靠性。

Details Motivation: 现有RAG方法依赖未结构化的医学文本,存在噪声和难以利用的问题,缺乏系统化的知识组织方法。 Method: MIRIAD通过半自动化流程(LLM生成、过滤、锚定和人工标注)构建了5,821,948个医学QA对,以查询-响应格式封装知识。 Result: 实验显示,MIRIAD将LLMs的准确性提升6.7%,并显著提高检测医学幻觉的能力(F1分数提升22.5%-37%)。 Conclusion: MIRIAD为医疗领域提供了更可靠的LLM应用基础,支持信息检索、增强RAG和知识驱动的聊天界面。 Abstract: LLMs are bound to transform healthcare with advanced decision support and flexible chat assistants. However, LLMs are prone to generate inaccurate medical content. To ground LLMs in high-quality medical knowledge, LLMs have been equipped with external knowledge via RAG, where unstructured medical knowledge is split into small text chunks that can be selectively retrieved and integrated into the LLMs context. Yet, existing RAG pipelines rely on raw, unstructured medical text, which can be noisy, uncurated and difficult for LLMs to effectively leverage. Systematic approaches to organize medical knowledge to best surface it to LLMs are generally lacking. To address these challenges, we introduce MIRIAD, a large-scale, curated corpus of 5,821,948 medical QA pairs, each rephrased from and grounded in a passage from peer-reviewed medical literature using a semi-automated pipeline combining LLM generation, filtering, grounding, and human annotation. Unlike prior medical corpora, which rely on unstructured text, MIRIAD encapsulates web-scale medical knowledge in an operationalized query-response format, which enables more targeted retrieval. Experiments on challenging medical QA benchmarks show that augmenting LLMs with MIRIAD improves accuracy up to 6.7% compared to unstructured RAG baselines with the same source corpus and with the same amount of retrieved text. Moreover, MIRIAD improved the ability of LLMs to detect medical hallucinations by 22.5 to 37% (increase in F1 score). We further introduce MIRIAD-Atlas, an interactive map of MIRIAD spanning 56 medical disciplines, enabling clinical users to visually explore, search, and refine medical knowledge. MIRIAD promises to unlock a wealth of down-stream applications, including medical information retrievers, enhanced RAG applications, and knowledge-grounded chat interfaces, which ultimately enables more reliable LLM applications in healthcare.

[183] Reinforcing Code Generation: Improving Text-to-SQL with Execution-Based Learning

Atharv Kulkarni,Vivek Srikumar

Main category: cs.CL

TL;DR: 研究通过强化学习优化大型语言模型生成SQL查询的能力,利用数据库执行反馈提升准确性。

Details Motivation: 探索是否可以通过与数据库引擎交互而非监督微调来优化模型生成SQL查询的能力。 Method: 将问题建模为强化学习任务,使用执行反馈作为标量奖励,结合GRPO框架进行优化。 Result: RL-tuning将SQL生成准确率从31.49提升至49.83,错误率从25.43%降至14.71%,接近更大模型的性能。 Conclusion: 执行反馈可有效提升LLM的符号推理能力。 Abstract: In this work, we study the problem of code generation with a large language model (LLM), with a focus on generating SQL queries from natural language questions. We ask: Instead of using supervised fine tuning with text-code pairs, can we tune a model by having it interact with a database engine? We frame this problem as a reinforcement learning problem where the model receives execution-based feedback from the environment in the form of scalar rewards. These rewards penalize execution failures and assign positive values when a query returns a correct answer. We use the rewards within the Group Relative Policy Optimization (GRPO) framework. We use a tabular reasoning benchmark to test and evaluate our findings. We find that with only weak supervision in the form of question-answer pairs, RL-tuning improves the accuracy of model generated SQL code from 31.49 to 49.83 while reducing error percentage from 25.43% to 14.71%. This improvement allowed the model nearly match the performance performance to the larger SQLCoder-70B model. Our work demonstrates the potential of using execution-based feedback to improve symbolic reasoning capabilities of LLMs.

[184] Bridging the Gap: In-Context Learning for Modeling Human Disagreement

Benedetta Muscato,Yue Li,Gizem Gezici,Zhixue Zhao,Fosca Giannotti

Main category: cs.CL

TL;DR: 研究探讨了大型语言模型(LLMs)是否能捕捉主观任务中的多视角和标注者分歧,发现零样本设置下可行,但少样本设置表现不佳,提示设计和示例选择对性能有显著影响。

Details Motivation: 现有LLMs依赖聚合标签(如多数投票),可能掩盖主观标注中的人类分歧,研究旨在探索LLMs是否能反映这种分歧。 Method: 使用上下文学习(ICL)在零样本和少样本设置下评估四种开源LLMs,采用三种标签建模策略(聚合硬标签、分散硬标签和软标签),并分析示例选择和排序方法的影响。 Result: 零样本设置下多视角生成可行,但少样本设置难以捕捉人类判断的全谱;提示设计和示例选择显著影响性能,示例排序影响有限。 Conclusion: 研究揭示了LLMs建模主观性的挑战,强调了构建更具视角感知和社会智能模型的重要性。 Abstract: Large Language Models (LLMs) have shown strong performance on NLP classification tasks. However, they typically rely on aggregated labels-often via majority voting-which can obscure the human disagreement inherent in subjective annotations. This study examines whether LLMs can capture multiple perspectives and reflect annotator disagreement in subjective tasks such as hate speech and offensive language detection. We use in-context learning (ICL) in zero-shot and few-shot settings, evaluating four open-source LLMs across three label modeling strategies: aggregated hard labels, and disaggregated hard and soft labels. In few-shot prompting, we assess demonstration selection methods based on textual similarity (BM25, PLM-based), annotation disagreement (entropy), a combined ranking, and example ordering strategies (random vs. curriculum-based). Results show that multi-perspective generation is viable in zero-shot settings, while few-shot setups often fail to capture the full spectrum of human judgments. Prompt design and demonstration selection notably affect performance, though example ordering has limited impact. These findings highlight the challenges of modeling subjectivity with LLMs and the importance of building more perspective-aware, socially intelligent models.

[185] Phonetically-Augmented Discriminative Rescoring for Voice Search Error Correction

Christophe Van Gysel,Maggie Wu,Lyan Verwimp,Caglar Tirkaz,Marco Bertola,Zhihong Lei,Youssef Oualil

Main category: cs.CL

TL;DR: 论文提出了一种语音纠正系统,通过生成语音替代方案并结合重打分机制,显著提升了E2E ASR模型对不常见电影标题的识别准确率。

Details Motivation: E2E ASR模型依赖昂贵的标注数据训练,但对近期或不常见的电影标题识别效果不佳。 Method: 提出包含语音搜索和重打分组件的语音纠正系统,生成并优化语音替代方案。 Result: 在电影标题基准测试中,相对词错误率降低了4.4%至7.6%。 Conclusion: 该方法有效提升了ASR系统对不常见内容的识别能力。 Abstract: End-to-end (E2E) Automatic Speech Recognition (ASR) models are trained using paired audio-text samples that are expensive to obtain, since high-quality ground-truth data requires human annotators. Voice search applications, such as digital media players, leverage ASR to allow users to search by voice as opposed to an on-screen keyboard. However, recent or infrequent movie titles may not be sufficiently represented in the E2E ASR system's training data, and hence, may suffer poor recognition. In this paper, we propose a phonetic correction system that consists of (a) a phonetic search based on the ASR model's output that generates phonetic alternatives that may not be considered by the E2E system, and (b) a rescorer component that combines the ASR model recognition and the phonetic alternatives, and select a final system output. We find that our approach improves word error rate between 4.4 and 7.6% relative on benchmarks of popular movie titles over a series of competitive baselines.

[186] Let's CONFER: A Dataset for Evaluating Natural Language Inference Models on CONditional InFERence and Presupposition

Tara Azin,Daniel Dumitrescu,Diana Inkpen,Raj Singh

Main category: cs.CL

TL;DR: 该研究探讨了自然语言推理(NLI)模型在处理条件句中预设推理的能力,并提出了CONFER数据集用于评估。研究发现NLI模型在此任务上表现不佳,且微调对性能提升有限。

Details Motivation: NLI模型在细粒度语用推理(尤其是条件句中的预设)上的能力尚未充分研究,因此需要专门的数据集和评估方法。 Method: 研究引入了CONFER数据集,评估了四种NLI模型(包括预训练模型)和多种大型语言模型(LLMs)在零样本和少样本提示下的表现。 Result: NLI模型在条件句的预设推理任务上表现较差,且微调现有NLI数据集未能显著提升性能。 Conclusion: NLI模型在处理条件句的预设推理方面存在局限性,未来研究需进一步改进模型设计或训练方法。 Abstract: Natural Language Inference (NLI) is the task of determining whether a sentence pair represents entailment, contradiction, or a neutral relationship. While NLI models perform well on many inference tasks, their ability to handle fine-grained pragmatic inferences, particularly presupposition in conditionals, remains underexplored. In this study, we introduce CONFER, a novel dataset designed to evaluate how NLI models process inference in conditional sentences. We assess the performance of four NLI models, including two pre-trained models, to examine their generalization to conditional reasoning. Additionally, we evaluate Large Language Models (LLMs), including GPT-4o, LLaMA, Gemma, and DeepSeek-R1, in zero-shot and few-shot prompting settings to analyze their ability to infer presuppositions with and without prior context. Our findings indicate that NLI models struggle with presuppositional reasoning in conditionals, and fine-tuning on existing NLI datasets does not necessarily improve their performance.

[187] semantic-features: A User-Friendly Tool for Studying Contextual Word Embeddings in Interpretable Semantic Spaces

Jwalanthi Ranganathan,Rohan Jha,Kanishka Misra,Kyle Mahowald

Main category: cs.CL

TL;DR: 介绍了一个名为semantic-features的库,用于研究语言模型的上下文词嵌入,并通过实验验证其在语义解释上的敏感性。

Details Motivation: 研究上下文词嵌入的语义解释能力,特别是不同句法结构对语义的影响。 Method: 使用semantic-features库,设计450对句子,测试两种与格结构对语义解释的影响。 Result: 三种掩码语言模型的词嵌入表现出预期的敏感性。 Conclusion: 工具semantic-features在语义解释研究中具有实用价值。 Abstract: We introduce semantic-features, an extensible, easy-to-use library based on Chronis et al. (2023) for studying contextualized word embeddings of LMs by projecting them into interpretable spaces. We apply this tool in an experiment where we measure the contextual effect of the choice of dative construction (prepositional or double object) on the semantic interpretation of utterances (Bresnan, 2007). Specifically, we test whether "London" in "I sent London the letter." is more likely to be interpreted as an animate referent (e.g., as the name of a person) than in "I sent the letter to London." To this end, we devise a dataset of 450 sentence pairs, one in each dative construction, with recipients being ambiguous with respect to person-hood vs. place-hood. By applying semantic-features, we show that the contextualized word embeddings of three masked language models show the expected sensitivities. This leaves us optimistic about the usefulness of our tool.

[188] Does It Run and Is That Enough? Revisiting Text-to-Chart Generation with a Multi-Agent Approach

James Ford,Anthony Rios

Main category: cs.CL

TL;DR: 论文提出了一种轻量级多代理流程,显著降低了图表生成代码的执行错误率,但发现仍需改进图表的美观性、语义保真度和可访问性。

Details Motivation: 研究大型语言模型在生成可执行图表代码时15%的失败率是否源于模型限制或单一提示设计。 Method: 采用多代理流程(起草、执行、修复、判断),仅使用现成的GPT-4o-mini模型。 Result: 在Text2Chart31和ChartX基准测试中,执行错误率分别降至4.5%和4.6%。 Conclusion: 未来工作应关注图表美观性、语义保真度和可访问性,而非仅执行可靠性。 Abstract: Large language models can translate natural-language chart descriptions into runnable code, yet approximately 15\% of the generated scripts still fail to execute, even after supervised fine-tuning and reinforcement learning. We investigate whether this persistent error rate stems from model limitations or from reliance on a single-prompt design. To explore this, we propose a lightweight multi-agent pipeline that separates drafting, execution, repair, and judgment, using only an off-the-shelf GPT-4o-mini model. On the \textsc{Text2Chart31} benchmark, our system reduces execution errors to 4.5\% within three repair iterations, outperforming the strongest fine-tuned baseline by nearly 5 percentage points while requiring significantly less compute. Similar performance is observed on the \textsc{ChartX} benchmark, with an error rate of 4.6\%, demonstrating strong generalization. Under current benchmarks, execution success appears largely solved. However, manual review reveals that 6 out of 100 sampled charts contain hallucinations, and an LLM-based accessibility audit shows that only 33.3\% (\textsc{Text2Chart31}) and 7.2\% (\textsc{ChartX}) of generated charts satisfy basic colorblindness guidelines. These findings suggest that future work should shift focus from execution reliability toward improving chart aesthetics, semantic fidelity, and accessibility.

[189] Detecting Voice Phishing with Precision: Fine-Tuning Small Language Models

Ju Yong Sim,Seong Hwan Kim

Main category: cs.CL

TL;DR: 通过微调Llama3模型并设计语音钓鱼(VP)评估标准,结合Chain-of-Thought技术,构建对抗测试数据集,实验表明Llama3-8B在小语言模型中表现最佳,接近GPT-4检测器。

Details Motivation: 解决语音钓鱼检测中缺乏高质量转录数据的问题,并探索小语言模型在VP检测中的潜力。 Method: 微调Llama3模型,设计VP评估标准并应用Chain-of-Thought技术,构建对抗测试数据集。 Result: Llama3-8B在小语言模型中表现最佳,性能接近GPT-4检测器。 Conclusion: 在VP检测中,将专家知识融入提示比Chain-of-Threat技术对小语言模型更有效。 Abstract: We develop a voice phishing (VP) detector by fine-tuning Llama3, a representative open-source, small language model (LM). In the prompt, we provide carefully-designed VP evaluation criteria and apply the Chain-of-Thought (CoT) technique. To evaluate the robustness of LMs and highlight differences in their performance, we construct an adversarial test dataset that places the models under challenging conditions. Moreover, to address the lack of VP transcripts, we create transcripts by referencing existing or new types of VP techniques. We compare cases where evaluation criteria are included, the CoT technique is applied, or both are used together. In the experiment, our results show that the Llama3-8B model, fine-tuned with a dataset that includes a prompt with VP evaluation criteria, yields the best performance among small LMs and is comparable to that of a GPT-4-based VP detector. These findings indicate that incorporating human expert knowledge into the prompt is more effective than using the CoT technique for small LMs in VP detection.

[190] Building Models of Neurological Language

Henry Watkins

Main category: cs.CL

TL;DR: 该报告开发并评估了神经学领域的专用语言模型,从定制模型转向利用检索增强生成(RAG)和表征模型,创建了神经学专用数据集和工具,并提供了本地部署的脚本和容器。

Details Motivation: 适应开源和商业医学LLM的快速发展,开发适用于神经学领域的专用语言模型,支持本地安全部署。 Method: 采用检索增强生成(RAG)和表征模型,创建神经学专用数据集(病例报告、问答集、教科书数据),开发多词表达提取工具和基于图的医学术语分析。 Result: 报告了性能指标和图社区分析结果,提供了本地部署的脚本和Docker容器。 Conclusion: 未来可能基于开源架构(如phi-4)开发多模态模型。 Abstract: This report documents the development and evaluation of domain-specific language models for neurology. Initially focused on building a bespoke model, the project adapted to rapid advances in open-source and commercial medical LLMs, shifting toward leveraging retrieval-augmented generation (RAG) and representational models for secure, local deployment. Key contributions include the creation of neurology-specific datasets (case reports, QA sets, textbook-derived data), tools for multi-word expression extraction, and graph-based analyses of medical terminology. The project also produced scripts and Docker containers for local hosting. Performance metrics and graph community results are reported, with future possible work open for multimodal models using open-source architectures like phi-4.

[191] PuzzleWorld: A Benchmark for Multimodal, Open-Ended Reasoning in Puzzlehunts

Hengzhi Li,Brendon Jiang,Alexander Naehu,Regan Song,Justin Zhang,Megan Tjandrasuwita,Chanakya Ekbote,Steven-Shine Chen,Adithya Balachandran,Wei Dai,Rebecca Chang,Paul Pu Liang

Main category: cs.CL

TL;DR: PuzzleWorld是一个包含667个谜题的大规模基准测试,用于评估开放性和创造性多模态推理能力。现有模型表现不佳,但通过微调推理痕迹可提升性能。

Details Motivation: 研究开放性问题解决能力,如科学发现和探索性数据分析,现有基准测试不足以评估此类能力。 Method: 引入PuzzleWorld基准测试,包含详细标注的谜题,用于评估逐步推理和多模态能力。 Result: 现有模型表现较差(1-14%准确率),但微调推理痕迹可提升性能(4%到11%)。 Conclusion: PuzzleWorld为开发更通用的开放性和创造性推理系统提供了新基准。 Abstract: Puzzlehunts are a genre of complex, multi-step puzzles lacking well-defined problem definitions. In contrast to conventional reasoning benchmarks consisting of tasks with clear instructions, puzzlehunts require models to discover the underlying problem structure from multimodal evidence and iterative reasoning, mirroring real-world domains such as scientific discovery, exploratory data analysis, or investigative problem-solving. Despite recent progress in foundation models, their performance on such open-ended settings remains largely untested. In this paper, we introduce PuzzleWorld, a large-scale benchmark of 667 puzzlehunt-style problems designed to assess step-by-step, open-ended, and creative multimodal reasoning. Each puzzle is annotated with the final solution, detailed reasoning traces, and cognitive skill labels, enabling holistic benchmarking and fine-grained diagnostic analysis. Most state-of-the-art models achieve only 1-2% final answer accuracy, with the best model solving only 14% of puzzles and reaching 40% stepwise accuracy. To demonstrate the value of our reasoning annotations, we show that fine-tuning a small model on reasoning traces improves stepwise reasoning from 4% to 11%, while training on final answers alone degrades performance to near zero. Our error analysis reveals that current models exhibit myopic reasoning, are bottlenecked by the limitations of language-based inference, and lack sketching capabilities crucial for visual and spatial reasoning. We release PuzzleWorld at https://github.com/MIT-MI/PuzzleWorld to support future work on building more general, open-ended, and creative reasoning systems.

[192] Can Theoretical Physics Research Benefit from Language Agents?

Sirui Lu,Zhijing Jin,Terry Jingchen Zhang,Pavel Kos,J. Ignacio Cirac,Bernhard Schölkopf

Main category: cs.CL

TL;DR: 本文探讨了大型语言模型(LLMs)在理论物理研究中的潜力与挑战,提出未来物理专用LLMs的可能发展方向,并呼吁物理与AI社区合作。

Details Motivation: 尽管LLMs在多个领域快速发展,但其在理论物理中的应用尚未成熟。本文旨在探讨如何通过结合领域知识和工具,利用LLMs加速物理研究。 Method: 分析了LLMs在物理中的当前能力(如数学推理和代码生成),并指出了物理直觉、约束满足和可靠推理等关键不足。 Result: 提出了未来物理专用LLMs的愿景,包括处理多模态数据、提出可测试假设和设计实验的能力。 Conclusion: 实现这一愿景需解决物理一致性和验证方法等挑战,呼吁物理与AI社区合作以推动科学发现。 Abstract: Large Language Models (LLMs) are rapidly advancing across diverse domains, yet their application in theoretical physics research is not yet mature. This position paper argues that LLM agents can potentially help accelerate theoretical, computational, and applied physics when properly integrated with domain knowledge and toolbox. We analyze current LLM capabilities for physics -- from mathematical reasoning to code generation -- identifying critical gaps in physical intuition, constraint satisfaction, and reliable reasoning. We envision future physics-specialized LLMs that could handle multimodal data, propose testable hypotheses, and design experiments. Realizing this vision requires addressing fundamental challenges: ensuring physical consistency, and developing robust verification methods. We call for collaborative efforts between physics and AI communities to help advance scientific discovery in physics.

[193] Explaining Matters: Leveraging Definitions and Semantic Expansion for Sexism Detection

Sahrish Khan,Arshad Jhumka,Gabriele Pergola

Main category: cs.CL

TL;DR: 论文提出两种基于提示的数据增强技术和一种集成策略,用于解决在线性别歧视检测中的数据稀疏性和语言细微差别问题,并在EDOS数据集上取得了最佳性能。

Details Motivation: 在线性别歧视检测面临数据稀疏性和语言细微差别的挑战,现有系统难以泛化且存在类别不平衡问题。 Method: 提出定义数据增强(DDA)和上下文语义扩展(CSE)两种数据增强技术,以及一种集成策略来提升分类可靠性。 Result: 在EDOS数据集上,二元分类任务(Task A)和细粒度分类任务(Task C)的宏F1分别提高了1.5和4.1分。 Conclusion: 所提出的方法有效解决了性别歧视检测中的关键问题,并显著提升了模型性能。 Abstract: The detection of sexism in online content remains an open problem, as harmful language disproportionately affects women and marginalized groups. While automated systems for sexism detection have been developed, they still face two key challenges: data sparsity and the nuanced nature of sexist language. Even in large, well-curated datasets like the Explainable Detection of Online Sexism (EDOS), severe class imbalance hinders model generalization. Additionally, the overlapping and ambiguous boundaries of fine-grained categories introduce substantial annotator disagreement, reflecting the difficulty of interpreting nuanced expressions of sexism. To address these challenges, we propose two prompt-based data augmentation techniques: Definition-based Data Augmentation (DDA), which leverages category-specific definitions to generate semantically-aligned synthetic examples, and Contextual Semantic Expansion (CSE), which targets systematic model errors by enriching examples with task-specific semantic features. To further improve reliability in fine-grained classification, we introduce an ensemble strategy that resolves prediction ties by aggregating complementary perspectives from multiple language models. Our experimental evaluation on the EDOS dataset demonstrates state-of-the-art performance across all tasks, with notable improvements of macro F1 by 1.5 points for binary classification (Task A) and 4.1 points for fine-grained classification (Task C).

[194] Bridging External and Parametric Knowledge: Mitigating Hallucination of LLMs with Shared-Private Semantic Synergy in Dual-Stream Knowledge

Yi Sui,Chaozhuo Li,Chen Zhang,Dawei song,Qiuchi Li

Main category: cs.CL

TL;DR: DSSP-RAG框架通过混合注意力机制解决LLMs中外部知识与参数知识的冲突,提升检索增强生成的性能。

Details Motivation: 传统RAG方法在外部知识与LLMs参数知识冲突时性能下降,缺乏解决冲突的机制。 Method: 提出DSSP-RAG框架,采用混合注意力区分共享与私有语义,引入无监督幻觉检测和能量商(EQ)减少噪声。 Result: 实验表明DSSP-RAG能有效解决冲突,增强双流知识的互补性,性能优于基线。 Conclusion: DSSP-RAG通过创新方法提升了RAG的稳定性和性能,为知识冲突问题提供了解决方案。 Abstract: Retrieval-augmented generation (RAG) is a cost-effective approach to mitigate the hallucination of Large Language Models (LLMs) by incorporating the retrieved external knowledge into the generation process. However, external knowledge may conflict with the parametric knowledge of LLMs. Furthermore, current LLMs lack inherent mechanisms for resolving such knowledge conflicts, making traditional RAG methods suffer from degraded performance and stability. Thus, we propose a Dual-Stream Knowledge-Augmented Framework for Shared-Private Semantic Synergy (DSSP-RAG). Central to the framework is a novel approach that refines self-attention into a mixed-attention, distinguishing shared and private semantics for a controlled internal-external knowledge integration. To effectively facilitate DSSP in RAG, we further introduce an unsupervised hallucination detection method based on cognitive uncertainty, ensuring the necessity of introducing knowledge, and an Energy Quotient (EQ) based on attention difference matrices to reduce noise in the retrieved external knowledge. Extensive experiments on benchmark datasets show that DSSP-RAG can effectively resolve conflicts and enhance the complementarity of dual-stream knowledge, leading to superior performance over strong baselines.

[195] Cartridges: Lightweight and general-purpose long context representations via self-study

Sabri Eyuboglu,Ryan Ehrlich,Simran Arora,Neel Guha,Dylan Zinsley,Emily Liu,Will Tennien,Atri Rudra,James Zou,Azalia Mirhoseini,Christopher Re

Main category: cs.CL

TL;DR: 论文提出了一种名为Cartridge的方法,通过离线训练较小的KV缓存来替代昂贵的在线上下文学习(ICL),从而降低内存消耗并提高吞吐量。

Details Motivation: 当前大语言模型在处理长上下文时,KV缓存的内存消耗随输入长度增加而成本高昂。 Method: 提出Cartridge方法,通过离线训练KV缓存,并采用self-study训练策略生成合成对话以优化性能。 Result: Cartridge在性能上与ICL相当,但内存消耗减少38.6倍,吞吐量提高26.4倍,并扩展了有效上下文长度。 Conclusion: Cartridge结合self-study是一种高效且经济的替代方案,适用于长上下文任务。 Abstract: Large language models are often used to answer queries grounded in large text corpora (e.g. codebases, legal documents, or chat histories) by placing the entire corpus in the context window and leveraging in-context learning (ICL). Although current models support contexts of 100K-1M tokens, this setup is costly to serve because the memory consumption of the KV cache scales with input length. We explore an alternative: training a smaller KV cache offline on each corpus. At inference time, we load this trained KV cache, which we call a Cartridge, and decode a response. Critically, the cost of training a Cartridge can be amortized across all the queries referencing the same corpus. However, we find that the naive approach of training the Cartridge with next-token prediction on the corpus is not competitive with ICL. Instead, we propose self-study, a training recipe in which we generate synthetic conversations about the corpus and train the Cartridge with a context-distillation objective. We find that Cartridges trained with self-study replicate the functionality of ICL, while being significantly cheaper to serve. On challenging long-context benchmarks, Cartridges trained with self-study match ICL performance while using 38.6x less memory and enabling 26.4x higher throughput. Self-study also extends the model's effective context length (e.g. from 128k to 484k tokens on MTOB) and surprisingly, leads to Cartridges that can be composed at inference time without retraining.

[196] AdvSumm: Adversarial Training for Bias Mitigation in Text Summarization

Mukur Gupta,Nikhil Reddy Varimalla,Nicholas Deas,Melanie Subbiah,Kathleen McKeown

Main category: cs.CL

TL;DR: AdvSumm是一种对抗性训练框架,旨在通过改进泛化能力减少文本摘要中的偏见,同时保持摘要质量。

Details Motivation: 大型语言模型在文本摘要中表现出色,但继承了预训练数据中的偏见,导致下游任务输出不公平或不恰当。 Method: AdvSumm引入Perturber组件,通过梯度引导的嵌入级扰动增强序列到序列模型对输入变化的鲁棒性。 Result: AdvSumm有效减少了摘要中的名称-国籍偏见和政治框架偏见,且性能优于标准Transformer和数据增强技术。 Conclusion: AdvSumm在减少偏见的同时不牺牲摘要质量,为文本摘要中的偏见缓解提供了有效解决方案。 Abstract: Large Language Models (LLMs) have achieved impressive performance in text summarization and are increasingly deployed in real-world applications. However, these systems often inherit associative and framing biases from pre-training data, leading to inappropriate or unfair outputs in downstream tasks. In this work, we present AdvSumm (Adversarial Summarization), a domain-agnostic training framework designed to mitigate bias in text summarization through improved generalization. Inspired by adversarial robustness, AdvSumm introduces a novel Perturber component that applies gradient-guided perturbations at the embedding level of Sequence-to-Sequence models, enhancing the model's robustness to input variations. We empirically demonstrate that AdvSumm effectively reduces different types of bias in summarization-specifically, name-nationality bias and political framing bias-without compromising summarization quality. Compared to standard transformers and data augmentation techniques like back-translation, AdvSumm achieves stronger bias mitigation performance across benchmark datasets.

cs.CR [Back]

[197] SoK: Are Watermarks in LLMs Ready for Deployment?

Kieu Dang,Phung Lai,NhatHai Phan,Yelong Shen,Ruoming Jin,Abdallah Khreishah,My Thai

Main category: cs.CR

TL;DR: 本文探讨了大型语言模型(LLMs)中的水印技术,分析了其分类、有效性、局限性及未来方向,指出当前水印技术在实际应用中尚未充分发挥潜力。

Details Motivation: 由于LLMs面临模型窃取攻击等风险,水印技术被视为保护知识产权和防止滥用的关键手段,但其实际进展和效果尚不明确。 Method: 通过提出水印分类法、设计知识产权分类器,并在攻击和无攻击环境下评估水印效果,分析现有技术的局限性。 Result: 实验表明,尽管水印技术受到广泛关注,但其对LLMs实用性和下游任务的影响限制了实际应用。 Conclusion: 研究强调需要开发更实用的水印解决方案,以适配LLMs的实际部署需求。 Abstract: Large Language Models (LLMs) have transformed natural language processing, demonstrating impressive capabilities across diverse tasks. However, deploying these models introduces critical risks related to intellectual property violations and potential misuse, particularly as adversaries can imitate these models to steal services or generate misleading outputs. We specifically focus on model stealing attacks, as they are highly relevant to proprietary LLMs and pose a serious threat to their security, revenue, and ethical deployment. While various watermarking techniques have emerged to mitigate these risks, it remains unclear how far the community and industry have progressed in developing and deploying watermarks in LLMs. To bridge this gap, we aim to develop a comprehensive systematization for watermarks in LLMs by 1) presenting a detailed taxonomy for watermarks in LLMs, 2) proposing a novel intellectual property classifier to explore the effectiveness and impacts of watermarks on LLMs under both attack and attack-free environments, 3) analyzing the limitations of existing watermarks in LLMs, and 4) discussing practical challenges and potential future directions for watermarks in LLMs. Through extensive experiments, we show that despite promising research outcomes and significant attention from leading companies and community to deploy watermarks, these techniques have yet to reach their full potential in real-world applications due to their unfavorable impacts on model utility of LLMs and downstream tasks. Our findings provide an insightful understanding of watermarks in LLMs, highlighting the need for practical watermarks solutions tailored to LLM deployment.

[198] Robust Anti-Backdoor Instruction Tuning in LVLMs

Yuan Xun,Siyuan Liang,Xiaojun Jia,Xinwei Liu,Xiaochun Cao

Main category: cs.CR

TL;DR: 论文提出了一种针对大型视觉语言模型(LVLMs)的轻量级防御框架,通过仅微调解码器和文本嵌入层,结合两种正则化方法,有效抵御隐蔽后门攻击。

Details Motivation: 现有后门防御技术通常针对单模态模型或依赖训练时的监督知识,而现实场景中无法修改冻结的视觉编码器或核心LLM参数,且缺乏攻击先验知识。 Method: 提出Robust Instruction Tuning框架,通过输入多样性正则化和异常激活正则化,微调解码器和文本嵌入层。 Result: 在Flickr30k和MSCOCO数据集上,对七种攻击的防御效果显著,攻击成功率降至接近零,训练成本仅增加不到15%。 Conclusion: 该方法无需核心权重或攻击先验知识,通过轻量级微调即可实现高效防御,具有实际应用价值。 Abstract: Large visual language models (LVLMs) have demonstrated excellent instruction-following capabilities, yet remain vulnerable to stealthy backdoor attacks when finetuned using contaminated data. Existing backdoor defense techniques are usually developed for single-modal visual or language models under fully parameter-adjustable settings or rely on supervisory knowledge during training. However, in real-world scenarios, defenders cannot modify frozen visual encoders or core LLM parameters, nor possess prior knowledge of unknown trigger patterns or target responses. Motivated by the empirical finding that LVLMs readily overfit to fixed, unknown triggers, which can embed malicious associations during adapter-level tuning, we aim to design a defense that operates without access to core weights or attack priors. To this end, we introduce a lightweight, certified-agnostic defense framework, Robust Instruction Tuning, that finetunes only adapter modules and text embedding layers under instruction tuning. Our method integrates two complementary regularizations: (1) Input Diversity Regularization, which perturbs trigger components across training samples to disrupt consistent spurious cues; and (2) Anomalous Activation Regularization, which dynamically sparses adapter weights exhibiting abnormally sharp activations linked to backdoor patterns. These mechanisms jointly guide the model toward learning semantically grounded representations rather than memorizing superficial trigger-response mappings. Extensive experiments against seven attacks on Flickr30k and MSCOCO demonstrate that ours reduces their attack success rate to nearly zero, with an increase in training cost of less than 15%.

[199] QA-HFL: Quality-Aware Hierarchical Federated Learning for Resource-Constrained Mobile Devices with Heterogeneous Image Quality

Sajid Hussain,Muhammad Sohail,Nauman Ali Khan

Main category: cs.CR

TL;DR: QA-HFL是一个质量感知的层次化联邦学习框架,通过质量加权融合机制和差分隐私保护,高效处理移动设备上的异构图像质量。实验显示其性能显著优于现有方法。

Details Motivation: 解决资源受限移动设备上异构图像质量对联邦学习性能的影响,同时保障隐私。 Method: 训练针对不同图像质量水平的本地模型,通过质量加权融合机制聚合特征,并加入差分隐私保护。 Result: 在MNIST上达到92.31%准确率,隐私约束下保持30.77%准确率,低端设备贡献显著(63.5%)。 Conclusion: QA-HFL通过设备特定正则化和智能客户端选择,显著提升性能并保持高效通信。 Abstract: This paper introduces QA-HFL, a quality-aware hierarchical federated learning framework that efficiently handles heterogeneous image quality across resource-constrained mobile devices. Our approach trains specialized local models for different image quality levels and aggregates their features using a quality-weighted fusion mechanism, while incorporating differential privacy protection. Experiments on MNIST demonstrate that QA-HFL achieves 92.31% accuracy after just three federation rounds, significantly outperforming state-of-the-art methods like FedRolex (86.42%). Under strict privacy constraints, our approach maintains 30.77% accuracy with formal differential privacy guarantees. Counter-intuitively, low-end devices contributed most significantly (63.5%) to the final model despite using 100 fewer parameters than high-end counterparts. Our quality-aware approach addresses accuracy decline through device-specific regularization, adaptive weighting, intelligent client selection, and server-side knowledge distillation, while maintaining efficient communication with a 4.71% compression ratio. Statistical analysis confirms that our approach significantly outperforms baseline methods (p 0.01) under both standard and privacy-constrained conditions.

eess.AS [Back]

[200] Low-Resource Domain Adaptation for Speech LLMs via Text-Only Fine-Tuning

Yangui Fang,Jing Peng,Xu Li,Yu Xi,Chengwei Zhang,Guohui Zhong,Kai Yu

Main category: eess.AS

TL;DR: 提出了一种仅使用文本的微调策略,用于低资源领域适应ASR,无需额外音频数据。

Details Motivation: 解决语音LLMs在新领域适应中的挑战,尤其是在低资源环境下配对语音-文本数据稀缺的问题。 Method: 通过引入实时评估机制,仅使用目标领域的未配对文本进行微调,保持语音-文本对齐。 Result: 在LibriSpeech、SlideSpeech和Medical数据集上表现优异,性能接近全音频-文本微调,且能有效泛化到新领域。 Conclusion: 文本仅微调策略在低资源领域适应ASR中具有潜力,能避免灾难性遗忘。 Abstract: Recent advances in automatic speech recognition (ASR) have combined speech encoders with large language models (LLMs) through projection, forming Speech LLMs with strong performance. However, adapting them to new domains remains challenging, especially in low-resource settings where paired speech-text data is scarce. We propose a text-only fine-tuning strategy for Speech LLMs using unpaired target-domain text without requiring additional audio. To preserve speech-text alignment, we introduce a real-time evaluation mechanism during fine-tuning. This enables effective domain adaptation while maintaining source-domain performance. Experiments on LibriSpeech, SlideSpeech, and Medical datasets show that our method achieves competitive recognition performance, with minimal degradation compared to full audio-text fine-tuning. It also improves generalization to new domains without catastrophic forgetting, highlighting the potential of text-only fine-tuning for low-resource domain adaptation of ASR.

[201] Audio-Aware Large Language Models as Judges for Speaking Styles

Cheng-Han Chiang,Xiaofei Wang,Chung-Ching Lin,Kevin Lin,Linjie Li,Radu Kopetz,Yao Qian,Zhendong Wang,Zhengyuan Yang,Hung-yi Lee,Lijuan Wang

Main category: eess.AS

TL;DR: 论文探讨了使用音频感知大语言模型(ALLMs)作为自动评判工具,评估语音生成模型的说话风格表现。通过比较ALLMs与人类评判结果,发现Gemini-2.5-pro与人类评判的一致性较高,表明ALLMs可作为有效的评判工具。同时指出当前语音生成模型在说话风格控制上仍有改进空间。

Details Motivation: 研究旨在利用ALLMs的能力(理解音频中的文本与非文本信息)作为评判工具,评估语音生成模型(SLMs)在说话风格上的表现,以弥补传统人工评估的高成本问题。 Method: 使用四种语音生成模型(SLMs)完成两项任务(语音风格指令跟随和角色扮演),并分别由人类和两种ALLMs(GPT-4o-audio和Gemini-2.5-pro)评判。通过比较ALLMs与人类评判的一致性,验证ALLMs作为评判工具的有效性。 Result: Gemini-2.5-pro与人类评判的一致性接近人类评判者之间的一致性,表明ALLMs可作为评判工具。同时发现当前SLMs(包括GPT-4o-audio)在说话风格控制和自然对话生成上仍有不足。 Conclusion: ALLMs(尤其是Gemini-2.5-pro)可作为语音生成模型的有效评判工具,但现有SLMs在说话风格控制方面仍需进一步优化。 Abstract: Audio-aware large language models (ALLMs) can understand the textual and non-textual information in the audio input. In this paper, we explore using ALLMs as an automatic judge to assess the speaking styles of speeches. We use ALLM judges to evaluate the speeches generated by SLMs on two tasks: voice style instruction following and role-playing. The speaking style we consider includes emotion, volume, speaking pace, word emphasis, pitch control, and non-verbal elements. We use four spoken language models (SLMs) to complete the two tasks and use humans and ALLMs to judge the SLMs' responses. We compare two ALLM judges, GPT-4o-audio and Gemini-2.5-pro, with human evaluation results and show that the agreement between Gemini and human judges is comparable to the agreement between human evaluators. These promising results show that ALLMs can be used as a judge to evaluate SLMs. Our results also reveal that current SLMs, even GPT-4o-audio, still have room for improvement in controlling the speaking style and generating natural dialogues.

[202] CO-VADA: A Confidence-Oriented Voice Augmentation Debiasing Approach for Fair Speech Emotion Recognition

Yun-Shao Tsai,Yi-Cheng Lin,Huang-Cheng Chou,Hung-yi Lee

Main category: eess.AS

TL;DR: CO-VADA是一种无需修改模型架构或依赖人口统计信息的去偏方法,通过语音转换生成多样化的训练样本,减少语音情感识别系统中的偏见。

Details Motivation: 语音情感识别系统常因说话者特征与情感标签的虚假相关性而产生偏见,现有去偏方法多需模型修改或人口统计信息,实用性受限。 Method: CO-VADA通过识别反映偏见的训练样本,应用语音转换技术生成多样化样本,引导模型关注情感相关特征。 Result: CO-VADA兼容多种语音情感识别模型和语音转换工具,可扩展且实用。 Conclusion: CO-VADA为提升语音情感识别系统的公平性提供了一种有效且通用的解决方案。 Abstract: Bias in speech emotion recognition (SER) systems often stems from spurious correlations between speaker characteristics and emotional labels, leading to unfair predictions across demographic groups. Many existing debiasing methods require model-specific changes or demographic annotations, limiting their practical use. We present CO-VADA, a Confidence-Oriented Voice Augmentation Debiasing Approach that mitigates bias without modifying model architecture or relying on demographic information. CO-VADA identifies training samples that reflect bias patterns present in the training data and then applies voice conversion to alter irrelevant attributes and generate samples. These augmented samples introduce speaker variations that differ from dominant patterns in the data, guiding the model to focus more on emotion-relevant features. Our framework is compatible with various SER models and voice conversion tools, making it a scalable and practical solution for improving fairness in SER systems.

cs.RO [Back]

[203] Object Navigation with Structure-Semantic Reasoning-Based Multi-level Map and Multimodal Decision-Making LLM

Chongshang Yan,Jiaxuan He,Delun Li,Yi Yang,Wenjie Song

Main category: cs.RO

TL;DR: 论文提出了一种结合环境属性地图(EAM)和多层次语言模型推理模块(MHR)的主动目标导航框架,显著提升了零样本目标导航的成功率和效率。

Details Motivation: 零样本目标导航在未知开放环境中性能显著下降,主要由于忽视了高维隐式场景信息和长距离目标搜索任务。 Method: 通过SBERT推理观察环境并利用扩散模型预测未观察区域构建EAM;MHR基于EAM进行前沿探索决策,避免长距离场景中的迂回路径。 Result: EAM在MP3D数据集上达到64.5%的场景映射准确率,导航任务在HM3D和MP3D基准上的SPL分别提升21.4%和46.0%。 Conclusion: 提出的EAM和MHR模块有效解决了零样本目标导航的性能问题,显著提升了导航效率和成功率。 Abstract: The zero-shot object navigation (ZSON) in unknown open-ended environments coupled with semantically novel target often suffers from the significant decline in performance due to the neglect of high-dimensional implicit scene information and the long-range target searching task. To address this, we proposed an active object navigation framework with Environmental Attributes Map (EAM) and MLLM Hierarchical Reasoning module (MHR) to improve its success rate and efficiency. EAM is constructed by reasoning observed environments with SBERT and predicting unobserved ones with Diffusion, utilizing human space regularities that underlie object-room correlations and area adjacencies. MHR is inspired by EAM to perform frontier exploration decision-making, avoiding the circuitous trajectories in long-range scenarios to improve path efficiency. Experimental results demonstrate that the EAM module achieves 64.5\% scene mapping accuracy on MP3D dataset, while the navigation task attains SPLs of 28.4\% and 26.3\% on HM3D and MP3D benchmarks respectively - representing absolute improvements of 21.4\% and 46.0\% over baseline methods.

[204] 3DFlowAction: Learning Cross-Embodiment Manipulation from 3D Flow World Model

Hongyan Zhi,Peihao Chen,Siyuan Zhou,Yubo Dong,Quanxi Wu,Lei Han,Mingkui Tan

Main category: cs.RO

TL;DR: 论文提出了一种基于3D光流的世界模型,通过人类和机器人数据学习物体运动的预测,以指导机器人操作任务。

Details Motivation: 机器人操作任务缺乏统一的大规模数据集,而人类通过理解物体在3D空间中的运动来完成任务。因此,作者希望学习一个3D光流世界模型,以跨平台指导机器人操作。 Method: 合成大规模3D光流数据集ManiFlow-110k,利用视频扩散模型学习操作物理,并通过语言指令生成3D光流轨迹。提出光流引导渲染机制,结合GPT-4o评估预测结果,最后通过优化策略生成机器人动作。 Result: 实验表明,该方法在多样化机器人操作任务中具有强泛化能力,且无需硬件特定训练即可实现跨平台适应。 Conclusion: 3D光流模型为机器人操作任务提供了一种统一的、跨平台的解决方案,显著提升了泛化能力和适应性。 Abstract: Manipulation has long been a challenging task for robots, while humans can effortlessly perform complex interactions with objects, such as hanging a cup on the mug rack. A key reason is the lack of a large and uniform dataset for teaching robots manipulation skills. Current robot datasets often record robot action in different action spaces within a simple scene. This hinders the robot to learn a unified and robust action representation for different robots within diverse scenes. Observing how humans understand a manipulation task, we find that understanding how the objects should move in the 3D space is a critical clue for guiding actions. This clue is embodiment-agnostic and suitable for both humans and different robots. Motivated by this, we aim to learn a 3D flow world model from both human and robot manipulation data. This model predicts the future movement of the interacting objects in 3D space, guiding action planning for manipulation. Specifically, we synthesize a large-scale 3D optical flow dataset, named ManiFlow-110k, through a moving object auto-detect pipeline. A video diffusion-based world model then learns manipulation physics from these data, generating 3D optical flow trajectories conditioned on language instructions. With the generated 3D object optical flow, we propose a flow-guided rendering mechanism, which renders the predicted final state and leverages GPT-4o to assess whether the predicted flow aligns with the task description. This equips the robot with a closed-loop planning ability. Finally, we consider the predicted 3D optical flow as constraints for an optimization policy to determine a chunk of robot actions for manipulation. Extensive experiments demonstrate strong generalization across diverse robotic manipulation tasks and reliable cross-embodiment adaptation without hardware-specific training.

cs.HC [Back]

[205] QualitEye: Public and Privacy-preserving Gaze Data Quality Verification

Mayar Elfares,Pascal Reisert,Ralf Küsters,Andreas Bulling

Main category: cs.HC

TL;DR: QualitEye是一种验证图像视线数据质量的新方法,通过语义表示和隐私保护协议,在公开和隐私保护场景下均表现优异。

Details Motivation: 随着视线数据集的增多,数据质量和隐私问题成为挑战,需要一种既能验证质量又能保护隐私的方法。 Method: QualitEye采用新的语义表示方法,结合隐私保护协议(如私有集合交集),在MPIIFaceGaze和GazeCapture数据集上验证性能。 Result: 在公开和隐私保护场景下均实现高性能验证,且隐私保护版本的运行时开销较小。 Conclusion: QualitEye为机器学习、人机交互和密码学交叉领域的视线分析提供了新思路。 Abstract: Gaze-based applications are increasingly advancing with the availability of large datasets but ensuring data quality presents a substantial challenge when collecting data at scale. It further requires different parties to collaborate, therefore, privacy concerns arise. We propose QualitEye--the first method for verifying image-based gaze data quality. QualitEye employs a new semantic representation of eye images that contains the information required for verification while excluding irrelevant information for better domain adaptation. QualitEye covers a public setting where parties can freely exchange data and a privacy-preserving setting where parties cannot reveal their raw data nor derive gaze features/labels of others with adapted private set intersection protocols. We evaluate QualitEye on the MPIIFaceGaze and GazeCapture datasets and achieve a high verification performance (with a small overhead in runtime for privacy-preserving versions). Hence, QualitEye paves the way for new gaze analysis methods at the intersection of machine learning, human-computer interaction, and cryptography.

[206] WoundAIssist: A Patient-Centered Mobile App for AI-Assisted Wound Care With Physicians in the Loop

Vanessa Borst,Anna Riedmann,Tassilo Dege,Konstantin Müller,Astrid Schmieder,Birgit Lugrin,Samuel Kounev

Main category: cs.HC

TL;DR: WoundAIssist是一款AI驱动的移动应用,用于远程慢性伤口护理,通过照片和问卷记录伤口,结合深度学习模型进行分割和监测,提升用户体验和医疗效率。

Details Motivation: 慢性伤口在老年人群中日益普遍,传统护理资源密集且效率低,亟需远程解决方案。 Method: 开发了WoundAIssist应用,集成轻量级深度学习模型,支持患者自主记录伤口,医生远程监控和视频咨询。 Result: 用户研究显示应用易用性高,AI伤口识别效果好,医疗质量和用户体验提升。 Conclusion: WoundAIssist填补了患者与医疗专业人员之间的远程护理缺口,并为类似数字健康工具提供了设计参考。 Abstract: The rising prevalence of chronic wounds, especially in aging populations, presents a significant healthcare challenge due to prolonged hospitalizations, elevated costs, and reduced patient quality of life. Traditional wound care is resource-intensive, requiring frequent in-person visits that strain both patients and healthcare professionals (HCPs). Therefore, we present WoundAIssist, a patient-centered, AI-driven mobile application designed to support telemedical wound care. WoundAIssist enables patients to regularly document wounds at home via photographs and questionnaires, while physicians remain actively engaged in the care process through remote monitoring and video consultations. A distinguishing feature is an integrated lightweight deep learning model for on-device wound segmentation, which, combined with patient-reported data, enables continuous monitoring of wound healing progression. Developed through an iterative, user-centered process involving both patients and domain experts, WoundAIssist prioritizes an user-friendly design, particularly for elderly patients. A conclusive usability study with patients and dermatologists reported excellent usability, good app quality, and favorable perceptions of the AI-driven wound recognition. Our main contribution is two-fold: (I) the implementation and (II) evaluation of WoundAIssist, an easy-to-use yet comprehensive telehealth solution designed to bridge the gap between patients and HCPs. Additionally, we synthesize design insights for remote patient monitoring apps, derived from over three years of interdisciplinary research, that may inform the development of similar digital health tools across clinical domains.

eess.IV [Back]

[207] Enhancing Neural Autoregressive Distribution Estimators for Image Reconstruction

Ambrose Emmett-Iwaniw,Nathan Kirk

Main category: eess.IV

TL;DR: 论文研究了通过观察图像像素子集预测未观察部分的方法,提出了一种改进的ConvNADE模型,并探讨了随机和低差异像素块对重建质量的影响。

Details Motivation: 研究如何通过观察少量像素预测图像未观察部分,以提高自回归模型的性能和效率。 Method: 提出了一种改进的ConvNADE模型,适用于实值和彩色图像,并比较了随机和低差异像素块对重建的影响。 Result: 实验表明,使用低差异序列选择像素能降低测试损失并生成更真实的图像。 Conclusion: 低差异像素块选择策略能显著提升图像重建质量和模型性能。 Abstract: Autoregressive models are often employed to learn distributions of image data by decomposing the $D$-dimensional density function into a product of one-dimensional conditional distributions. Each conditional depends on preceding variables (pixels, in the case of image data), making the order in which variables are processed fundamental to the model performance. In this paper, we study the problem of observing a small subset of image pixels (referred to as a pixel patch) to predict the unobserved parts of the image. As our prediction mechanism, we propose a generalized and computationally efficient version of the convolutional neural autoregressive distribution estimator (ConvNADE) model adapted for real-valued and color images. Moreover, we investigate the quality of image reconstruction when observing both random pixel patches and low-discrepancy pixel patches inspired by quasi-Monte Carlo theory. Experiments on benchmark datasets demonstrate that choosing the pixels akin to a low-discrepancy sequence reduces test loss and produces more realistic reconstructed images.

[208] Deep histological synthesis from mass spectrometry imaging for multimodal registration

Kimberley M. Bird,Xujiong Ye,Alan M. Race,James M. Brown

Main category: eess.IV

TL;DR: 提出了一种基于pix2pix模型的方法,将质谱成像(MSI)合成为组织学图像,以实现单模态配准,初步结果显示合成图像质量较好。

Details Motivation: 组织学和MSI的图像形成过程和维度差异大,配准困难,需要一种有效方法解决这一问题。 Method: 使用pix2pix模型从MSI合成组织学图像,实现单模态配准。 Result: 合成图像质量较高,与基线U-Net模型相比,互信息(MI)和结构相似性指数(SSIM)分别提高了+0.924和+0.419。 Conclusion: 该方法在组织学和MSI配准中表现出潜力,代码已开源。 Abstract: Registration of histological and mass spectrometry imaging (MSI) allows for more precise identification of structural changes and chemical interactions in tissue. With histology and MSI having entirely different image formation processes and dimensionalities, registration of the two modalities remains an ongoing challenge. This work proposes a solution that synthesises histological images from MSI, using a pix2pix model, to effectively enable unimodal registration. Preliminary results show promising synthetic histology images with limited artifacts, achieving increases in mutual information (MI) and structural similarity index measures (SSIM) of +0.924 and +0.419, respectively, compared to a baseline U-Net model. Our source code is available on GitHub: https://github.com/kimberley/MIUA2025.

[209] FPDANet: A Multi-Section Classification Model for Intelligent Screening of Fetal Ultrasound

Minglang Chen,Jie He,Caixu Xu,Bocheng Liang,Shengli Li,Guannan He,Xiongjie Tao

Main category: eess.IV

TL;DR: FPDANet提出了一种双边多尺度信息融合网络,通过位置注意力机制(DAN)和多尺度信息融合模块(FPAN)解决了ResNet在胎儿超声图像分类中的局限性。

Details Motivation: ResNet在胎儿超声图像分类中因单向特征传递和缺乏上下文信息关联而表现不佳,且此类图像存在低对比度、高相似性和高噪声问题。 Method: 设计了DAN模块以建立空间位置特征的依赖关系,并引入FPAN模块捕捉不同尺度的上下文和全局特征依赖。 Result: FPDANet在Top-1和Top-5指标上分别达到91.05%和100%的分类准确率。 Conclusion: FPDANet在胎儿超声图像分类任务中表现出高效性和鲁棒性。 Abstract: ResNet has been widely used in image classification tasks due to its ability to model the residual dependence of constant mappings for linear computation. However, the ResNet method adopts a unidirectional transfer of features and lacks an effective method to correlate contextual information, which is not effective in classifying fetal ultrasound images in the classification task, and fetal ultrasound images have problems such as low contrast, high similarity, and high noise. Therefore, we propose a bilateral multi-scale information fusion network-based FPDANet to address the above challenges. Specifically, we design the positional attention mechanism (DAN) module, which utilizes the similarity of features to establish the dependency of different spatial positional features and enhance the feature representation. In addition, we design a bilateral multi-scale (FPAN) information fusion module to capture contextual and global feature dependencies at different feature scales, thereby further improving the model representation. FPDANet classification results obtained 91.05\% and 100\% in Top-1 and Top-5 metrics, respectively, and the experimental results proved the effectiveness and robustness of FPDANet.

[210] LinGuinE: Longitudinal Guidance Estimation for Volumetric Lung Tumour Segmentation

Nadine Garibli,Mayank Patwari,Bence Csiba,Yi Wei,Kostas Sidiropoulos

Main category: eess.IV

TL;DR: LinGuinE是一种自动分割肺部肿瘤纵向CT扫描的方法,通过初始输入和点传播技术显著提高了分割精度。

Details Motivation: 当前缺乏自动化或半自动化的纵向肿瘤分割方法,而这对放疗、手术和化疗评估至关重要。 Method: LinGuinE通过刚性配准传播初始肿瘤点,利用点击有效性分类器筛选有效点并生成新时间点的分割。 Result: 在两个测试数据集上,LinGuinE将Dice系数提高了20%以上(p<0.05)。 Conclusion: LinGuinE在纵向肿瘤分割中表现出色,且起始时间点灵活,适用于临床实践。 Abstract: Segmentation of lung gross tumour volumes is an important first step in radiotherapy and surgical intervention, and is starting to play a role in assessing chemotherapy response. Response to a drug is measured by tracking the tumour volumes over a series of CT scans over a time period i.e. a longitudinal study. However, there currently exist few solutions for automated or semi-automated longitudinal tumour segmentation. This paper introduces LinGuinE, an automated method to segment a longitudinal series of lung tumours. A radiologist must provide an initial input, indicating the location of the tumour in a CT scan at an arbitrary time point. LinGuinE samples points inside this tumour and propagates them to another time point using rigid registration. A click validity classifier selects points which still fall within the tumour; these are used to automatically create a segmentation in the new time point. We test LinGuinE on a dataset acquired from a phase 3 clinical trial for lung tumours and the publicly available 4-D lung CBCT dataset. We find that LinGuinE improves the Dice on both test sets by over 20% (p< 0.05) across 63 longitudinal studies. We show that any time point can be used as a starting point, conduct ablation experiments, and find that our LinGuinE setup yields the best results on both test datasets.

[211] DermaCon-IN: A Multi-concept Annotated Dermatological Image Dataset of Indian Skin Disorders for Clinical AI Research

Shanawaj S Madarkar,Mahajabeen Madarkar,Madhumitha V,Teli Prakash,Konda Reddy Mopuri,Vinaykumar MV,KVL Sathwika,Adarsh Kasturi,Gandla Dilip Raj,PVN Supranitha,Harsh Udai

Main category: eess.IV

TL;DR: DermaCon-IN是一个来自南印度门诊的前瞻性皮肤病数据集,包含5,450张临床图像,覆盖240种诊断,旨在解决现有数据集的不足,推动皮肤病AI模型的公平性和鲁棒性。

Details Motivation: 现有皮肤病数据集未能捕捉真实世界的临床和人口复杂性,尤其是非西方人群的代表性不足,限制了AI模型的公平性和实用性。 Method: 数据集包含5,450张图像,由认证皮肤科医生标注,采用Rook分类法。测试了多种模型架构(卷积模型、Transformer模型和概念瓶颈模型)以评估性能。 Result: 研究为皮肤病AI模型提供了基准性能,并探索了如何整合解剖学和概念级线索。 Conclusion: DermaCon-IN为开发可解释且临床实用的皮肤病AI模型提供了代表性基础。 Abstract: Artificial intelligence is poised to augment dermatological care by enabling scalable image-based diagnostics. Yet, the development of robust and equitable models remains hindered by datasets that fail to capture the clinical and demographic complexity of real-world practice. This complexity stems from region-specific disease distributions, wide variation in skin tones, and the underrepresentation of outpatient scenarios from non-Western populations. We introduce DermaCon-IN, a prospectively curated dermatology dataset comprising over 5,450 clinical images from approximately 3,000 patients across outpatient clinics in South India. Each image is annotated by board-certified dermatologists with over 240 distinct diagnoses, structured under a hierarchical, etiology-based taxonomy adapted from Rook's classification. The dataset captures a wide spectrum of dermatologic conditions and tonal variation commonly seen in Indian outpatient care. We benchmark a range of architectures including convolutional models (ResNet, DenseNet, EfficientNet), transformer-based models (ViT, MaxViT, Swin), and Concept Bottleneck Models to establish baseline performance and explore how anatomical and concept-level cues may be integrated. These results are intended to guide future efforts toward interpretable and clinically realistic models. DermaCon-IN provides a scalable and representative foundation for advancing dermatology AI in real-world settings.

cs.SE [Back]

[212] Interpretation Meets Safety: A Survey on Interpretation Methods and Tools for Improving LLM Safety

Seongmin Lee,Aeree Cho,Grace C. Kim,ShengYun Peng,Mansi Phute,Duen Horng Chau

Main category: cs.SE

TL;DR: 本文是第一篇将解释技术与大语言模型(LLM)安全性结合的综述,提出统一框架,总结近70项研究,并探讨未来挑战。

Details Motivation: 随着LLM的广泛应用,理解和减少其不安全行为至关重要,但现有研究常忽视解释技术与安全的联系。 Method: 提出一个统一框架,连接以安全为中心的解释方法、其指导的安全增强措施及实现工具,并按LLM工作流程阶段分类。 Result: 总结近70项研究,形成新的分类法,帮助研究者与实践者理解关键进展。 Conclusion: 本文填补了研究空白,为更安全、可解释的LLM提供了导航,并指出未来方向。 Abstract: As large language models (LLMs) see wider real-world use, understanding and mitigating their unsafe behaviors is critical. Interpretation techniques can reveal causes of unsafe outputs and guide safety, but such connections with safety are often overlooked in prior surveys. We present the first survey that bridges this gap, introducing a unified framework that connects safety-focused interpretation methods, the safety enhancements they inform, and the tools that operationalize them. Our novel taxonomy, organized by LLM workflow stages, summarizes nearly 70 works at their intersections. We conclude with open challenges and future directions. This timely survey helps researchers and practitioners navigate key advancements for safer, more interpretable LLMs.

[213] Deployability-Centric Infrastructure-as-Code Generation: An LLM-based Iterative Framework

Tianyi Zhang,Shidong Pan,Zejun Zhang,Zhenchang Xing,Xiaoyu Sun

Main category: cs.SE

TL;DR: 论文提出IaCGen框架和DPIaC-Eval基准,通过迭代反馈机制提升IaC模板的可部署性,显著提高了LLM生成模板的成功率,但用户意图对齐和安全性仍需改进。

Details Motivation: 当前IaC生成评估仅关注语法正确性,忽略了可部署性这一关键指标,论文旨在填补这一空白。 Method: 提出IaCGen框架(基于LLM的迭代反馈机制)和DPIaC-Eval基准(包含153个真实场景)。 Result: 初始LLM表现差(部署成功率约30%),但IaCGen将其提升至90%以上;用户意图对齐和安全性表现仍较低。 Conclusion: 论文首次全面评估了可部署性为中心的IaC生成,为未来研究奠定了基础。 Abstract: Infrastructure-as-Code (IaC) generation holds significant promise for automating cloud infrastructure provisioning. Recent advances in Large Language Models (LLMs) present a promising opportunity to democratize IaC development by generating deployable infrastructure templates from natural language descriptions, but current evaluation focuses on syntactic correctness while ignoring deployability, the fatal measure of IaC template utility. We address this gap through two contributions: (1) IaCGen, an LLM-based deployability-centric framework that uses iterative feedback mechanism to generate IaC templates, and (2) DPIaC-Eval, a deployability-centric IaC template benchmark consists of 153 real-world scenarios that can evaluate syntax, deployment, user intent, and security. Our evaluation reveals that state-of-the-art LLMs initially performed poorly, with Claude-3.5 and Claude-3.7 achieving only 30.2% and 26.8% deployment success on the first attempt respectively. However, IaCGen transforms this performance dramatically: all evaluated models reach over 90% passItr@25, with Claude-3.5 and Claude-3.7 achieving 98% success rate. Despite these improvements, critical challenges remain in user intent alignment (25.2% accuracy) and security compliance (8.4% pass rate), highlighting areas requiring continued research. Our work provides the first comprehensive assessment of deployability-centric IaC template generation and establishes a foundation for future research.

[214] CodeContests+: High-Quality Test Case Generation for Competitive Programming

Zihan Wang,Siyao Liu,Yang Sun,Hongyan Li,Kai Shen

Main category: cs.SE

TL;DR: 本文提出了一种基于LLM的代理系统,用于生成高质量竞赛编程测试用例,并构建了改进版数据集CodeContests+,显著提升了评估准确性和强化学习效果。

Details Motivation: 竞赛编程因其高推理难度和精确反馈成为评估LLM推理能力的关键任务,但测试用例难以获取,生成高质量测试用例是构建大规模数据集的必要步骤。 Method: 采用LLM代理系统生成测试用例,应用于CodeContests数据集,构建改进版CodeContests+,并通过百万级提交数据和强化学习实验评估其质量。 Result: CodeContests+显著提升了评估准确性(尤其是真阳性率),且测试用例质量的改进对强化学习有明显优势。 Conclusion: LLM生成的测试用例能有效提升竞赛编程数据集的评估质量,并为强化学习提供显著优势。 Abstract: Competitive programming, due to its high reasoning difficulty and precise correctness feedback, has become a key task for both training and evaluating the reasoning capabilities of large language models (LLMs). However, while a large amount of public problem data, such as problem statements and solutions, is available, the test cases of these problems are often difficult to obtain. Therefore, test case generation is a necessary task for building large-scale datasets, and the quality of the test cases directly determines the accuracy of the evaluation. In this paper, we introduce an LLM-based agent system that creates high-quality test cases for competitive programming problems. We apply this system to the CodeContests dataset and propose a new version with improved test cases, named CodeContests+. We evaluated the quality of test cases in CodeContestsPlus. First, we used 1.72 million submissions with pass/fail labels to examine the accuracy of these test cases in evaluation. The results indicated that CodeContests+ achieves significantly higher accuracy than CodeContests, particularly with a notably higher True Positive Rate (TPR). Subsequently, our experiments in LLM Reinforcement Learning (RL) further confirmed that improvements in test case quality yield considerable advantages for RL.

q-bio.NC [Back]

[215] Noninvasive precision modulation of high-level neural population activity via natural vision perturbations

Guy Gaziv,Sarah Goulding,Ani Ayvazian-Hancock,Yoon Bai,James J. DiCarlo

Main category: q-bio.NC

TL;DR: 研究探讨了通过自然视觉输入的扰动非侵入性地精确调控灵长类高级腹侧视觉流神经活动的可能性,并在猕猴实验中验证了模型预测与生物实现的定量一致性。

Details Motivation: 神经活动的精确调控是神经科学中的一大挑战,传统方法通常具有侵入性。本研究旨在探索非侵入性方法,通过视觉输入扰动实现精确调控。 Method: 通过自然视觉输入的扰动,对猕猴腹侧颞叶(IT)神经群体进行调控,并验证模型预测与实际神经活动的一致性。 Result: 研究发现模型预测与生物实现的神经活动效果一致,能够精确调控目标神经位点,并能通过微妙的视觉扰动注入实验者选择的神经模式。 Conclusion: 结果表明,当前的腹侧流可执行模型能够设计非侵入性、视觉传递的神经干预,分辨率可达单个神经元。 Abstract: Precise control of neural activity -- modulating target neurons deep in the brain while leaving nearby neurons unaffected -- is an outstanding challenge in neuroscience, generally achieved through invasive techniques. This study investigates the possibility of precisely and noninvasively modulating neural activity in the high-level primate ventral visual stream via perturbations on one's natural visual feed. When tested on macaque inferior temporal (IT) neural populations, we found quantitative agreement between the model-predicted and biologically realized effect: strong modulation concentrated on targeted neural sites. We extended this to demonstrate accurate injection of experimenter-chosen neural population patterns via subtle perturbations applied on the background of typical natural visual feeds. These results highlight that current machine-executable models of the ventral stream can now design noninvasive, visually-delivered, possibly imperceptible neural interventions at the resolution of individual neurons.

cs.NE [Back]

[216] Integer Binary-Range Alignment Neuron for Spiking Neural Networks

Binghao Ye,Wenjuan Li,Dong Wang,Man Yao,Bing Li,Weiming Hu,Dong Liang,Kun Shang

Main category: cs.NE

TL;DR: 提出了一种新型脉冲神经元(Integer Binary-Range Alignment Leaky Integrate-and-Fire),通过整数二进制泄漏积分发射和范围对齐策略,显著提升脉冲神经网络的表达能力,同时仅略微增加能耗。

Details Motivation: 脉冲神经网络(SNNs)因其类脑计算和高效能耗而备受关注,但在图像分类和物体检测等任务中表现不及人工神经网络(ANNs),主要原因是其表达能力有限。 Method: 采用整数二进制泄漏积分发射(Integer Binary Leaky Integrate-and-Fire)和范围对齐策略,前者允许训练时使用整数值激活,后者解决高整数值激活受限问题。 Result: 实验表明,该方法在ImageNet上达到74.19%准确率,在COCO上达到66.2% mAP@50和49.1% mAP@50:95,均超越此前最佳SNN表现,且能耗效率提升6.3倍。 Conclusion: 所提方法不仅显著提升了SNN的性能,甚至在某些任务中匹配或超越相同架构的ANN,同时保持了高效能耗。 Abstract: Spiking Neural Networks (SNNs) are noted for their brain-like computation and energy efficiency, but their performance lags behind Artificial Neural Networks (ANNs) in tasks like image classification and object detection due to the limited representational capacity. To address this, we propose a novel spiking neuron, Integer Binary-Range Alignment Leaky Integrate-and-Fire to exponentially expand the information expression capacity of spiking neurons with only a slight energy increase. This is achieved through Integer Binary Leaky Integrate-and-Fire and range alignment strategy. The Integer Binary Leaky Integrate-and-Fire allows integer value activation during training and maintains spike-driven dynamics with binary conversion expands virtual timesteps during inference. The range alignment strategy is designed to solve the spike activation limitation problem where neurons fail to activate high integer values. Experiments show our method outperforms previous SNNs, achieving 74.19% accuracy on ImageNet and 66.2% mAP@50 and 49.1% mAP@50:95 on COCO, surpassing previous bests with the same architecture by +3.45% and +1.6% and +1.8%, respectively. Notably, our SNNs match or exceed ANNs' performance with the same architecture, and the energy efficiency is improved by 6.3${\times}$.

cs.SD [Back]

[217] NAT: Neural Acoustic Transfer for Interactive Scenes in Real Time

Xutong Jin,Bo Pang,Chenxi Xu,Xinyun Hou,Guoping Wang,Sheng Li

Main category: cs.SD

TL;DR: 提出了一种基于隐式神经表示的新方法Neural Acoustic Transfer,用于实时预测动态变化环境中的声场,解决了传统方法在复杂场景中效率低下的问题。

Details Motivation: 传统声学传输方法依赖大量预计算数据,难以应对动态变化的环境(如物体位置、材质和尺寸的变化),导致实时交互和听觉反馈效率低下。 Method: 采用隐式神经表示编码预计算的声学传输及其变化,结合快速蒙特卡洛边界元方法(BEM)和GPU加速的标准BEM生成训练数据,实现实时声场预测。 Result: 方法在数值精度和运行效率上表现优异(30秒音频仅需几毫秒),适用于动态环境中的声学建模。 Conclusion: Neural Acoustic Transfer为虚拟现实、增强现实等交互应用提供了高效准确的声学建模解决方案。 Abstract: Previous acoustic transfer methods rely on extensive precomputation and storage of data to enable real-time interaction and auditory feedback. However, these methods struggle with complex scenes, especially when dynamic changes in object position, material, and size significantly alter sound effects. These continuous variations lead to fluctuating acoustic transfer distributions, making it challenging to represent with basic data structures and render efficiently in real time. To address this challenge, we present Neural Acoustic Transfer, a novel approach that utilizes an implicit neural representation to encode precomputed acoustic transfer and its variations, allowing for real-time prediction of sound fields under varying conditions. To efficiently generate the training data required for the neural acoustic field, we developed a fast Monte-Carlo-based boundary element method (BEM) approximation for general scenarios with smooth Neumann conditions. Additionally, we implemented a GPU-accelerated version of standard BEM for scenarios requiring higher precision. These methods provide the necessary training data, enabling our neural network to accurately model the sound radiation space. We demonstrate our method's numerical accuracy and runtime efficiency (within several milliseconds for 30s audio) through comprehensive validation and comparisons in diverse acoustic transfer scenarios. Our approach allows for efficient and accurate modeling of sound behavior in dynamically changing environments, which can benefit a wide range of interactive applications such as virtual reality, augmented reality, and advanced audio production.

[218] Voice Impression Control in Zero-Shot TTS

Keinichi Fujita,Shota Horiguchi,Yusuke Ijima

Main category: cs.SD

TL;DR: 论文提出了一种零样本TTS中控制语音印象的方法,通过低维向量表示语音印象强度,并利用大语言模型生成目标印象向量。

Details Motivation: 尽管零样本TTS在说话人保真度上表现优异,但如何通过调节副语言/非语言信息来控制语音印象仍具挑战性。 Method: 使用低维向量表示语音印象对(如暗-亮)的强度,并通过大语言模型从自然语言描述生成目标印象向量。 Result: 主客观评估均证明该方法在语音印象控制上的有效性。 Conclusion: 该方法无需手动优化,实现了从自然语言描述到目标语音印象的生成。 Abstract: Para-/non-linguistic information in speech is pivotal in shaping the listeners' impression. Although zero-shot text-to-speech (TTS) has achieved high speaker fidelity, modulating subtle para-/non-linguistic information to control perceived voice characteristics, i.e., impressions, remains challenging. We have therefore developed a voice impression control method in zero-shot TTS that utilizes a low-dimensional vector to represent the intensities of various voice impression pairs (e.g., dark-bright). The results of both objective and subjective evaluations have demonstrated our method's effectiveness in impression control. Furthermore, generating this vector via a large language model enables target-impression generation from a natural language description of the desired impression, thus eliminating the need for manual optimization.

[219] Label-Context-Dependent Internal Language Model Estimation for CTC

Zijian Yang,Minh-Nghia Phan,Ralf Schlüter,Hermann Ney

Main category: cs.SD

TL;DR: CTC隐含学习上下文依赖的内部语言模型(ILM),提出基于知识蒸馏的新方法,实验证明其优于上下文无关模型。

Details Motivation: 研究CTC中隐含的上下文依赖性,改进ILM估计方法。 Method: 提出基于知识蒸馏的上下文依赖ILM估计方法,并引入两种正则化方法。 Result: 实验显示上下文依赖ILM在跨域评估中表现更优,标签级知识蒸馏方法显著降低词错误率。 Conclusion: CTC确实学习到上下文依赖的ILM,新方法在性能上优于传统方法。 Abstract: Although connectionist temporal classification (CTC) has the label context independence assumption, it can still implicitly learn a context-dependent internal language model (ILM) due to modern powerful encoders. In this work, we investigate the implicit context dependency modeled in the ILM of CTC. To this end, we propose novel context-dependent ILM estimation methods for CTC based on knowledge distillation (KD) with theoretical justifications. Furthermore, we introduce two regularization methods for KD. We conduct experiments on Librispeech and TED-LIUM Release 2 datasets for in-domain and cross-domain evaluation, respectively. Experimental results show that context-dependent ILMs outperform the context-independent priors in cross-domain evaluation, indicating that CTC learns a context-dependent ILM. The proposed label-level KD with smoothing method surpasses other ILM estimation approaches, with more than 13% relative improvement in word error rate compared to shallow fusion.

cs.LG [Back]

[220] Projectable Models: One-Shot Generation of Small Specialized Transformers from Large Ones

Andrey Zhmoginov,Jihwan Lee,Mark Sandler

Main category: cs.LG

TL;DR: 提出一种将大型Transformer参数映射到小型专用模型的技术,以提高特定任务性能并降低计算成本。

Details Motivation: 现代基础模型计算成本高且知识广泛,但特定任务可能只需部分知识。 Method: 通过任务特定的参数映射,将大型模型的知识压缩到小型专用模型中。 Result: 在图像建模任务中,生成的小型模型性能优于通用条件模型。 Conclusion: 该方法能有效降低计算成本并提升特定任务性能。 Abstract: Modern Foundation Models (FMs) are typically trained on corpora spanning a wide range of different data modalities, topics and downstream tasks. Utilizing these models can be very computationally expensive and is out of reach for most consumer devices. Furthermore, most of the broad FM knowledge may actually be irrelevant for a specific task at hand. Here we explore a technique for mapping parameters of a large Transformer to parameters of a smaller specialized model. By making this transformation task-specific, we aim to capture a narrower scope of the knowledge needed for performing a specific task by a smaller model. We study our method on image modeling tasks, showing that performance of generated models exceeds that of universal conditional models.

[221] BAQ: Efficient Bit Allocation Quantization for Large Language Models

Chao Zhang,Li Wang,Samson Lasaulce,Merouane Debbah

Main category: cs.LG

TL;DR: 本文提出了一种基于Hessian代理的量化位宽分配框架(BAQ),通过凸优化任务最小化量化损失,显著优于现有方法。

Details Motivation: 现有量化方法依赖均匀或启发式位宽分配,未考虑权重对量化噪声的非均匀敏感性。 Method: 提出基于Hessian代理的敏感性度量,将位宽分配问题转化为凸优化任务,设计BAQ算法。 Result: BAQ在125M至30B参数的LLMs上表现优于GPTQ,困惑度降低高达56倍。 Conclusion: BAQ通过优化位宽分配,实现了量化损失与复杂度的良好平衡,并提供了理论解释。 Abstract: Post-training model quantization is a widely adopted technique for reducing the memory and computational costs of large language models (LLMs). However, most existing methods rely on uniform or heuristic bitwidth assignments, failing to account for the nonuniform sensitivity of weights to quantization noise. In this paper, we propose a novel framework for allocating quantization bitwidths based on sensitivity metrics derived from a Hessian proxy. We make key assumptions, which allow the layer/component-wise loss function to be expressed as an explicit function of the bitwidths. This enables a neat formulation of the bit allocation problem as a convex optimization task, whose closed-form solution adapts precision across weights to minimize the layer-wise quantization loss. Inspecting the solution provides several insights (such as the equal-loss structure), which are then exploited to design the proposed \textbf{BAQ} (Bit Allocation Quantization) algorithm. The proposed algorithm achieves a good trade-off between loss minimization and complexity and allows BAQ to be integrated into standard quantization pipelines with minimal overhead. Experimental results show that BAQ consistently outperforms GPTQ, achieving up to 56$\times$ lower perplexity at the same bitwidth on large language models ranging from 125M to 30B parameters. Leveraging our analytical results derived from solving the optimal bit allocation problem, we also provide a theoretical explanation for the observed gains. All codes of this paper are available at https://github.com/CSU-ModelCompression/BAQ.

[222] Contextually Guided Transformers via Low-Rank Adaptation

Andrey Zhmoginov,Jihwan Lee,Max Vladymyrov,Mark Sandler

Main category: cs.LG

TL;DR: 提出了一种无需显式提示的Transformer改进模型CGT,通过将上下文编码到模型权重中实现动态自适应性。

Details Motivation: 解决传统LLMs依赖显式提示带来的计算开销问题。 Method: 提出Contextually Guided Transformer (CGT),在序列位置维护上下文摘要,动态更新权重。 Result: 在合成任务和语言建模基准测试中验证了有效性,并提升了上下文表示的可解释性。 Conclusion: 为高效、自适应的语言建模提供了新方向。 Abstract: Large Language Models (LLMs) based on Transformers excel at text processing, but their reliance on prompts for specialized behavior introduces computational overhead. We propose a modification to a Transformer architecture that eliminates the need for explicit prompts by learning to encode context into the model's weights. Our Contextually Guided Transformer (CGT) model maintains a contextual summary at each sequence position, allowing it to update the weights on the fly based on the preceding context. This approach enables the model to self-specialize, effectively creating a tailored model for processing information following a given prefix. We demonstrate the effectiveness of our method on synthetic in-context learning tasks and language modeling benchmarks. Furthermore, we introduce techniques for enhancing the interpretability of the learned contextual representations, drawing connections to Variational Autoencoders and promoting smoother, more consistent context encoding. This work offers a novel direction for efficient and adaptable language modeling by integrating context directly into the model's architecture.

[223] Table-r1: Self-supervised and Reinforcement Learning for Program-based Table Reasoning in Small Language Models

Rihui Jin,Zheyu Xin,Xing Xie,Zuoyi Li,Guilin Qi,Yongrui Chen,Xinbang Dai,Tongtong Wu,Gholamreza Haffari

Main category: cs.LG

TL;DR: 论文提出了一种名为Table-r1的两阶段程序化表格推理方法,旨在解决小语言模型在表格推理中的局限性,通过布局转换推理和混合范式优化提升性能。

Details Motivation: 小语言模型(SLMs)在表格推理中表现有限,尤其是数值推理和代码生成能力不足,需要一种方法缩小其与大型语言模型(LLMs)的差距。 Method: Table-r1分为两阶段:第一阶段通过自监督学习任务(布局转换推理)提升表格布局泛化能力;第二阶段采用混合范式优化(Group Relative Policy Optimization)增强推理一致性,必要时动态回退到文本推理。 Result: 在四个表格推理基准测试中,Table-r1显著优于其他SLM方法,准确率提升至少15%,性能接近LLMs。 Conclusion: Table-r1为SLMs提供了一种有效的程序化表格推理方法,显著提升了性能并缩小了与LLMs的差距。 Abstract: Table reasoning (TR) requires structured reasoning over semi-structured tabular data and remains challenging, particularly for small language models (SLMs, e.g., LLaMA-8B) due to their limited capacity compared to large LMs (LLMs, e.g., GPT-4o). To narrow this gap, we explore program-based TR (P-TR), which circumvents key limitations of text-based TR (T-TR), notably in numerical reasoning, by generating executable programs. However, applying P-TR to SLMs introduces two challenges: (i) vulnerability to heterogeneity in table layouts, and (ii) inconsistency in reasoning due to limited code generation capability. We propose Table-r1, a two-stage P-TR method designed for SLMs. Stage 1 introduces an innovative self-supervised learning task, Layout Transformation Inference, to improve tabular layout generalization from a programmatic view. Stage 2 adopts a mix-paradigm variant of Group Relative Policy Optimization, enhancing P-TR consistency while allowing dynamic fallback to T-TR when needed. Experiments on four TR benchmarks demonstrate that Table-r1 outperforms all SLM-based methods, achieving at least a 15% accuracy improvement over the base model (LLaMA-8B) across all datasets and reaching performance competitive with LLMs.

[224] The Lock-in Hypothesis: Stagnation by Algorithm

Tianyi Alex Qiu,Zhonghao He,Tejasveer Chugh,Max Kleiman-Weiner

Main category: cs.LG

TL;DR: 论文研究了大型语言模型(LLM)与人类用户的反馈循环如何导致信念固化和多样性丧失,并通过模拟和真实数据验证了这一假设。

Details Motivation: 探讨LLM与人类用户的互动如何形成类似回声室的反馈循环,进而固化现有信念并可能锁定错误观点。 Method: 通过基于代理的LLM模拟和真实GPT使用数据,形式化并验证反馈循环假说。 Result: 分析显示新GPT版本发布后,多样性出现突然且持续的下降,与假说一致。 Conclusion: LLM与人类的反馈循环可能导致信念固化和多样性丧失,需警惕其潜在影响。 Abstract: The training and deployment of large language models (LLMs) create a feedback loop with human users: models learn human beliefs from data, reinforce these beliefs with generated content, reabsorb the reinforced beliefs, and feed them back to users again and again. This dynamic resembles an echo chamber. We hypothesize that this feedback loop entrenches the existing values and beliefs of users, leading to a loss of diversity and potentially the lock-in of false beliefs. We formalize this hypothesis and test it empirically with agent-based LLM simulations and real-world GPT usage data. Analysis reveals sudden but sustained drops in diversity after the release of new GPT iterations, consistent with the hypothesized human-AI feedback loop. Code and data available at https://thelockinhypothesis.com

[225] Corrector Sampling in Language Models

Itai Gat,Neta Shaul,Uriel Singer,Yaron Lipman

Main category: cs.LG

TL;DR: 提出了一种名为RPT的新采样方法,通过迭代重新访问和替换先前生成的文本来减少自回归语言模型中的错误累积。

Details Motivation: 自回归语言模型因其固定的从左到右的标记生成方式会导致错误累积,需要一种方法来缓解这一问题。 Method: 提出Resample-Previous-Tokens (RPT)方法,通过迭代重新访问和替换先前生成的文本来减少错误。该方法可与现有自回归模型集成。 Result: 在8B参数模型上仅用100B数据微调RPT后,推理和编码基准相对标准采样提升了约10%。 Conclusion: RPT是一种有效减少自回归模型错误累积的方法,且能保持模型的速度和预测质量。 Abstract: Autoregressive language models accumulate errors due to their fixed, irrevocable left-to-right token generation. To address this, we propose a new sampling method called Resample-Previous-Tokens (RPT). RPT mitigates error accumulation by iteratively revisiting and potentially replacing tokens in a window of previously generated text. This method can be integrated into existing autoregressive models, preserving their next-token-prediction quality and speed. Fine-tuning a pretrained 8B parameter model with RPT for only 100B resulted in ~10% relative improvements on reasoning and coding benchmarks compared to the standard sampling.

[226] Learning to Weight Parameters for Data Attribution

Shuangqi Li,Hieu Le,Jingyi Xu,Mathieu Salzmann

Main category: cs.LG

TL;DR: 提出了一种基于参数重要性权重的数据归因方法,用于生成模型中识别影响输出的训练数据,提升了归因准确性。

Details Motivation: 现有方法将所有网络参数视为同等重要,忽略了不同层编码不同信息的特点,导致归因效果不佳。 Method: 通过学习参数重要性权重,无需标注数据,使归因过程适应模型结构,捕捉训练数据对输出语义(如主题、风格或背景)的贡献。 Result: 在扩散模型中提高了归因准确性,并提供了输出如何从训练数据中借鉴的细粒度洞察。 Conclusion: 该方法通过建模参数重要性,改进了生成模型中的数据归因,为理解模型行为提供了新视角。 Abstract: We study data attribution in generative models, aiming to identify which training examples most influence a given output. Existing methods achieve this by tracing gradients back to training data. However, they typically treat all network parameters uniformly, ignoring the fact that different layers encode different types of information and may thus draw information differently from the training set. We propose a method that models this by learning parameter importance weights tailored for attribution, without requiring labeled data. This allows the attribution process to adapt to the structure of the model, capturing which training examples contribute to specific semantic aspects of an output, such as subject, style, or background. Our method improves attribution accuracy across diffusion models and enables fine-grained insights into how outputs borrow from training data.

[227] Peer-Ranked Precision: Creating a Foundational Dataset for Fine-Tuning Vision Models from DataSeeds' Annotated Imagery

Sajjad Abdoli,Freeman Lewin,Gediminas Vasiliauskas,Fabian Schonholz

Main category: cs.LG

TL;DR: 论文探讨了AI模型开发从‘模型中心’向‘数据中心’方法的转变,并介绍了高质量数据集DSD及其对模型性能的提升。

Details Motivation: 传统AI开发过于依赖复杂模型架构,而忽视了数据质量的重要性,因此需要转向以数据为中心的开发方法。 Method: 引入DataSeeds.AI的DSD数据集,包含10,610张高质量图像及多层级标注,用于提升模型性能。 Result: DSD数据集在特定模型上显著提升了性能,并通过公开代码和模型验证了其效果。 Conclusion: 数据中心的开发方法是未来AI发展的关键,DSD数据集为高质量数据提供了新标准。 Abstract: The development of modern Artificial Intelligence (AI) models, particularly diffusion-based models employed in computer vision and image generation tasks, is undergoing a paradigmatic shift in development methodologies. Traditionally dominated by a "Model Centric" approach, in which performance gains were primarily pursued through increasingly complex model architectures and hyperparameter optimization, the field is now recognizing a more nuanced "Data-Centric" approach. This emergent framework foregrounds the quality, structure, and relevance of training data as the principal driver of model performance. To operationalize this paradigm shift, we introduce the DataSeeds.AI sample dataset (the "DSD"), initially comprised of approximately 10,610 high-quality human peer-ranked photography images accompanied by extensive multi-tier annotations. The DSD is a foundational computer vision dataset designed to usher in a new standard for commercial image datasets. Representing a small fraction of DataSeed.AI's 100 million-plus image catalog, the DSD provides a scalable foundation necessary for robust commercial and multimodal AI development. Through this in-depth exploratory analysis, we document the quantitative improvements generated by the DSD on specific models against known benchmarks and make the code and the trained models used in our evaluation publicly available.

[228] Any-Class Presence Likelihood for Robust Multi-Label Classification with Abundant Negative Data

Dumindu Tissera,Omar Awadallah,Muhammad Umair Danish,Ayan Sadhu,Katarina Grolinger

Main category: cs.LG

TL;DR: 论文提出了一种改进多标签分类(MLC)损失函数的方法,通过归一化加权几何平均预测类别概率,解决负数据过多对学习过程的干扰问题。

Details Motivation: 在多标签分类任务中,负数据(无类别实例)过多会干扰学习,影响正实例的准确分类。传统方法为负数据分配单独类别会增加冗余。 Method: 重新设计标准MLC损失函数,引入归一化加权几何平均预测类别概率,并添加正则化参数控制负类别概率对正实例的影响。 Result: 在多个大规模数据集上测试,改进的损失函数显著提升性能,F1、F2和平均精度分别提高6.01、8.06和3.11个百分点。 Conclusion: 提出的方法有效解决了负数据干扰问题,提升了多标签分类性能,且无需额外参数或计算复杂度。 Abstract: Multi-label Classification (MLC) assigns an instance to one or more non-exclusive classes. A challenge arises when the dataset contains a large proportion of instances with no assigned class, referred to as negative data, which can overwhelm the learning process and hinder the accurate identification and classification of positive instances. Nevertheless, it is common in MLC applications such as industrial defect detection, agricultural disease identification, and healthcare diagnosis to encounter large amounts of negative data. Assigning a separate negative class to these instances further complicates the learning objective and introduces unnecessary redundancies. To address this challenge, we redesign standard MLC loss functions by deriving a likelihood of any class being present, formulated by a normalized weighted geometric mean of the predicted class probabilities. We introduce a regularization parameter that controls the relative contribution of the absent class probabilities to the any-class presence likelihood in positive instances. The any-class presence likelihood complements the multi-label learning by encouraging the network to become more aware of implicit positive instances and improve the label classification within those positive instances. Experiments on large-scale datasets with negative data: SewerML, modified COCO, and ChestX-ray14, across various networks and base loss functions show that our loss functions consistently improve MLC performance of their standard loss counterparts, achieving gains of up to 6.01 percentage points in F1, 8.06 in F2, and 3.11 in mean average precision, all without additional parameters or computational complexity. Code available at: https://github.com/ML-for-Sensor-Data-Western/gmean-mlc

Han Ji,Yuqi Feng,Jiahao Fan,Yanan Sun

Main category: cs.LG

TL;DR: 本文首次全面研究了性能预测器中损失函数的有效性,将其分为回归、排序和加权三类,并通过实验验证了它们的组合能提升NAS效果。

Details Motivation: 神经架构搜索(NAS)中评估成本高,性能预测器通过损失函数选择直接影响效果,但现有损失函数的特性和效果尚未深入研究。 Method: 将损失函数分为回归、排序和加权三类,评估了八种损失函数在13个任务和五个搜索空间中的表现。 Result: 研究发现特定类别的损失函数组合能有效提升预测器性能,并为不同任务选择损失函数提供了实用指导。 Conclusion: 本文为NAS社区提供了损失函数选择的实用指南,并启发未来损失函数的开发方向。 Abstract: Evaluation is a critical but costly procedure in neural architecture search (NAS). Performance predictors have been widely adopted to reduce evaluation costs by directly estimating architecture performance. The effectiveness of predictors is heavily influenced by the choice of loss functions. While traditional predictors employ regression loss functions to evaluate the absolute accuracy of architectures, recent approaches have explored various ranking-based loss functions, such as pairwise and listwise ranking losses, to focus on the ranking of architecture performance. Despite their success in NAS, the effectiveness and characteristics of these loss functions have not been thoroughly investigated. In this paper, we conduct the first comprehensive study on loss functions in performance predictors, categorizing them into three main types: regression, ranking, and weighted loss functions. Specifically, we assess eight loss functions using a range of NAS-relevant metrics on 13 tasks across five search spaces. Our results reveal that specific categories of loss functions can be effectively combined to enhance predictor-based NAS. Furthermore, our findings could provide practical guidance for selecting appropriate loss functions for various tasks. We hope this work provides meaningful insights to guide the development of loss functions for predictor-based methods in the NAS community.

[230] TRUST: Test-time Resource Utilization for Superior Trustworthiness

Haripriya Harikumar,Santu Rana

Main category: cs.LG

TL;DR: 提出一种新的测试时优化方法,通过考虑分类器权重噪声,生成更可靠的置信度估计,显著提升不确定性估计性能。

Details Motivation: 传统不确定性估计方法(如dropout)难以清晰区分可靠与不可靠预测,主要由于分类器权重噪声干扰细粒度统计信息。 Method: 提出一种测试时优化方法,通过分析噪声影响,定义单调子集选择函数,生成更可靠的置信度估计。 Result: 该方法在AUSE和AURC等风险指标上表现优异,能有效识别训练与测试分布差异,区分分布内外样本,并揭示CNN与ViT分类器的关键差异。 Conclusion: 新方法显著提升了不确定性估计的可靠性,适用于多种视觉数据集,为模型评估提供了更准确的工具。 Abstract: Standard uncertainty estimation techniques, such as dropout, often struggle to clearly distinguish reliable predictions from unreliable ones. We attribute this limitation to noisy classifier weights, which, while not impairing overall class-level predictions, render finer-level statistics less informative. To address this, we propose a novel test-time optimization method that accounts for the impact of such noise to produce more reliable confidence estimates. This score defines a monotonic subset-selection function, where population accuracy consistently increases as samples with lower scores are removed, and it demonstrates superior performance in standard risk-based metrics such as AUSE and AURC. Additionally, our method effectively identifies discrepancies between training and test distributions, reliably differentiates in-distribution from out-of-distribution samples, and elucidates key differences between CNN and ViT classifiers across various vision datasets.

[231] Gradient Similarity Surgery in Multi-Task Deep Learning

Thomas Borsani,Andrea Rosani,Giuseppe Nicosia,Giuseppe Di Fatta

Main category: cs.LG

TL;DR: 论文提出了一种新的梯度手术方法SAM-GS,通过梯度幅度相似性度量解决多任务深度学习中的梯度冲突问题,优化学习过程。

Details Motivation: 多任务深度学习(MTDL)中,任务梯度可能因幅度或方向不同而产生冲突,影响收敛。现有梯度手术方法需进一步改进以更有效处理冲突。 Method: 提出SAM-GS方法,基于梯度幅度相似性度量,通过梯度均衡和一阶动量调制调整梯度轨迹。 Result: 实验证明SAM-GS在合成问题和MTL基准测试中有效,梯度幅度相似性对优化学习过程起关键作用。 Conclusion: SAM-GS为MTDL提供了一种有效且可扩展的梯度冲突解决方案,显著提升了训练过程的稳定性和收敛速度。 Abstract: The multi-task learning ($MTL$) paradigm aims to simultaneously learn multiple tasks within a single model capturing higher-level, more general hidden patterns that are shared by the tasks. In deep learning, a significant challenge in the backpropagation training process is the design of advanced optimisers to improve the convergence speed and stability of the gradient descent learning rule. In particular, in multi-task deep learning ($MTDL$) the multitude of tasks may generate potentially conflicting gradients that would hinder the concurrent convergence of the diverse loss functions. This challenge arises when the gradients of the task objectives have either different magnitudes or opposite directions, causing one or a few to dominate or to interfere with each other, thus degrading the training process. Gradient surgery methods address the problem explicitly dealing with conflicting gradients by adjusting the overall gradient trajectory. This work introduces a novel gradient surgery method, the Similarity-Aware Momentum Gradient Surgery (SAM-GS), which provides an effective and scalable approach based on a gradient magnitude similarity measure to guide the optimisation process. The SAM-GS surgery adopts gradient equalisation and modulation of the first-order momentum. A series of experimental tests have shown the effectiveness of SAM-GS on synthetic problems and $MTL$ benchmarks. Gradient magnitude similarity plays a crucial role in regularising gradient aggregation in $MTDL$ for the optimisation of the learning process.

[232] Towards an Explainable Comparison and Alignment of Feature Embeddings

Mohammad Jalali,Bahar Dibaei Nia,Farzan Farnia

Main category: cs.LG

TL;DR: 提出了一种名为SPEC的框架,用于比较和调整不同特征嵌入模型,通过核矩阵差异分析聚类差异,并提供了可扩展的实现。

Details Motivation: 现有嵌入模型的比较主要关注数值性能,缺乏对聚类差异的可解释性分析。 Method: 利用核矩阵的差异特征分解,检测样本聚类差异,并提出可扩展的线性复杂度实现。 Result: 在ImageNet和MS-COCO等大规模数据集上验证了SPEC的有效性。 Conclusion: SPEC框架能够有效比较和调整嵌入模型,提升聚类一致性。 Abstract: While several feature embedding models have been developed in the literature, comparisons of these embeddings have largely focused on their numerical performance in classification-related downstream applications. However, an interpretable comparison of different embeddings requires identifying and analyzing mismatches between sample groups clustered within the embedding spaces. In this work, we propose the \emph{Spectral Pairwise Embedding Comparison (SPEC)} framework to compare embeddings and identify their differences in clustering a reference dataset. Our approach examines the kernel matrices derived from two embeddings and leverages the eigendecomposition of the difference kernel matrix to detect sample clusters that are captured differently by the two embeddings. We present a scalable implementation of this kernel-based approach, with computational complexity that grows linearly with the sample size. Furthermore, we introduce an optimization problem using this framework to align two embeddings, ensuring that clusters identified in one embedding are also captured in the other model. We provide numerical results demonstrating the SPEC's application to compare and align embeddings on large-scale datasets such as ImageNet and MS-COCO. The code is available at [https://github.com/mjalali/embedding-comparison](github.com/mjalali/embedding-comparison).

cs.AI [Back]

[233] When Models Know More Than They Can Explain: Quantifying Knowledge Transfer in Human-AI Collaboration

Quan Shi,Carlos E. Jimenez,Shunyu Yao,Nick Haber,Diyi Yang,Karthik Narasimhan

Main category: cs.AI

TL;DR: 论文探讨AI推理进步是否提升知识迁移能力,提出KITE框架并通过实验发现性能与协作效果相关但不一致,需专门优化。

Details Motivation: 研究AI模型是否能以人类可理解的方式传递推理知识,促进人机协作。 Method: 引入KITE框架,进行两阶段实验(N=118),第一阶段人机协作构思策略,第二阶段人类独立实施解决方案。 Result: 模型性能与协作效果相关但不一致,存在显著异常值,知识迁移需专门优化。 Conclusion: 知识迁移需针对性优化,行为与策略因素影响成功迁移,发布代码与框架支持未来研究。 Abstract: Recent advancements in AI reasoning have driven substantial improvements across diverse tasks. A critical open question is whether these improvements also yields better knowledge transfer: the ability of models to communicate reasoning in ways humans can understand, apply, and learn from. To investigate this, we introduce Knowledge Integration and Transfer Evaluation (KITE), a conceptual and experimental framework for Human-AI knowledge transfer capabilities and conduct the first large-scale human study (N=118) explicitly designed to measure it. In our two-phase setup, humans first ideate with an AI on problem-solving strategies, then independently implement solutions, isolating model explanations' influence on human understanding. Our findings reveal that although model benchmark performance correlates with collaborative outcomes, this relationship is notably inconsistent, featuring significant outliers, indicating that knowledge transfer requires dedicated optimization. Our analysis identifies behavioral and strategic factors mediating successful knowledge transfer. We release our code, dataset, and evaluation framework to support future work on communicatively aligned models.

[234] MMTU: A Massive Multi-Task Table Understanding and Reasoning Benchmark

Junjie Xing,Yeye He,Mengyu Zhou,Haoyu Dong,Shi Han,Lingjiao Chen,Dongmei Zhang,Surajit Chaudhuri,H. V. Jagadish

Main category: cs.AI

TL;DR: MMTU是一个大规模基准测试,包含25个真实世界表格任务的30K问题,旨在全面评估模型在专家级表格理解、推理和操作方面的能力。

Details Motivation: 现有对表格相关任务的评估有限,主要集中在NL-to-SQL和Table-QA等任务,忽略了专业用户面临的更广泛任务,限制了模型在这一领域的发展。 Method: 通过从几十年计算机科学研究中提取复杂表格任务,构建了包含30K问题的MMTU基准测试,涵盖理解、推理和编码等技能。 Result: 前沿模型(如OpenAI o4-mini和DeepSeek R1)在MMTU上的表现仅为60%,表明仍有显著改进空间。 Conclusion: MMTU为结构化数据处理和分析的基础模型发展提供了重要基准,推动了该领域的进一步研究。 Abstract: Tables and table-based use cases play a crucial role in many important real-world applications, such as spreadsheets, databases, and computational notebooks, which traditionally require expert-level users like data engineers, data analysts, and database administrators to operate. Although LLMs have shown remarkable progress in working with tables (e.g., in spreadsheet and database copilot scenarios), comprehensive benchmarking of such capabilities remains limited. In contrast to an extensive and growing list of NLP benchmarks, evaluations of table-related tasks are scarce, and narrowly focus on tasks like NL-to-SQL and Table-QA, overlooking the broader spectrum of real-world tasks that professional users face. This gap limits our understanding and model progress in this important area. In this work, we introduce MMTU, a large-scale benchmark with over 30K questions across 25 real-world table tasks, designed to comprehensively evaluate models ability to understand, reason, and manipulate real tables at the expert-level. These tasks are drawn from decades' worth of computer science research on tabular data, with a focus on complex table tasks faced by professional users. We show that MMTU require a combination of skills -- including table understanding, reasoning, and coding -- that remain challenging for today's frontier models, where even frontier reasoning models like OpenAI o4-mini and DeepSeek R1 score only around 60%, suggesting significant room for improvement. We highlight key findings in our evaluation using MMTU and hope that this benchmark drives further advances in understanding and developing foundation models for structured data processing and analysis. Our code and data are available at https://github.com/MMTU-Benchmark/MMTU and https://huggingface.co/datasets/MMTU-benchmark/MMTU.

[235] Constrained Sampling for Language Models Should Be Easy: An MCMC Perspective

Emmanuel Anaya Gonzalez,Sairam Vaidya,Kanghee Park,Ruyi Ji,Taylor Berg-Kirkpatrick,Loris D'Antoni

Main category: cs.AI

TL;DR: 提出了一种基于MCMC的约束采样框架,满足约束、单调收敛和高效性,优于现有方法。

Details Motivation: 现有约束解码方法会扭曲模型分布,影响多样性,尤其在程序模糊测试中。 Method: 构建有效输出的提议分布,应用Metropolis-Hastings接受准则,确保约束空间的高效探索。 Result: 在合成基准和实际程序模糊测试任务中表现优于现有方法。 Conclusion: 新框架在满足约束的同时保持了模型分布的高效性和多样性。 Abstract: Constrained decoding enables Language Models (LMs) to produce samples that provably satisfy hard constraints. However, existing constrained-decoding approaches often distort the underlying model distribution, a limitation that is especially problematic in applications like program fuzzing, where one wants to generate diverse and valid program inputs for testing purposes. We propose a new constrained sampling framework based on Markov Chain Monte Carlo (MCMC) that simultaneously satisfies three core desiderata: constraint satisfying (every sample satisfies the constraint), monotonically converging (the sampling process converges to the true conditional distribution), and efficient (high-quality samples emerge in few steps). Our method constructs a proposal distribution over valid outputs and applies a Metropolis-Hastings acceptance criterion based on the LM's likelihood, ensuring principled and efficient exploration of the constrained space. Empirically, our sampler outperforms existing methods on both synthetic benchmarks and real-world program fuzzing tasks.

[236] Proactive Assistant Dialogue Generation from Streaming Egocentric Videos

Yichi Zhang,Xin Luna Dong,Zhaojiang Lin,Andrea Madotto,Anuj Kumar,Babak Damavandi,Joyce Chai,Seungwhan Moon

Main category: cs.AI

TL;DR: 论文提出了一种用于实时感知任务指导的对话AI框架,包括数据合成、自动评估和端到端模型。

Details Motivation: 开发实时感知任务指导系统面临数据收集和评估的高成本挑战,需提供基于视觉输入的交互式主动辅助。 Method: 1. 从标注的自我中心视频合成对话数据;2. 开发自动评估指标;3. 提出处理数据不平衡和长视频的端到端模型。 Result: 生成了大规模合成对话数据集,验证了自动评估指标,并开发了实时响应模型。 Conclusion: 该框架为开发实时主动AI助手奠定了基础,支持多样化任务指导。 Abstract: Recent advances in conversational AI have been substantial, but developing real-time systems for perceptual task guidance remains challenging. These systems must provide interactive, proactive assistance based on streaming visual inputs, yet their development is constrained by the costly and labor-intensive process of data collection and system evaluation. To address these limitations, we present a comprehensive framework with three key contributions. First, we introduce a novel data curation pipeline that synthesizes dialogues from annotated egocentric videos, resulting in \dataset, a large-scale synthetic dialogue dataset spanning multiple domains. Second, we develop a suite of automatic evaluation metrics, validated through extensive human studies. Third, we propose an end-to-end model that processes streaming video inputs to generate contextually appropriate responses, incorporating novel techniques for handling data imbalance and long-duration videos. This work lays the foundation for developing real-time, proactive AI assistants capable of guiding users through diverse tasks. Project page: https://pro-assist.github.io/

[237] PersonaAgent: When Large Language Model Agents Meet Personalization at Test Time

Weizhi Zhang,Xinyang Zhang,Chenwei Zhang,Liangwei Yang,Jingbo Shang,Zhepei Wei,Henry Peng Zou,Zijie Huang,Zhengyang Wang,Yifan Gao,Xiaoman Pan,Lian Xiong,Jingguo Liu,Philip S. Yu,Xian Li

Main category: cs.AI

TL;DR: PersonaAgent是一个个性化的LLM代理框架,通过个性化记忆和动作模块解决用户需求多样性问题,并通过实时用户偏好对齐策略优化表现。

Details Motivation: 当前LLM代理缺乏灵活性,无法满足用户多样化需求,因此开发了PersonaAgent以实现个性化任务处理。 Method: PersonaAgent包含个性化记忆模块和动作模块,通过用户特定系统提示(persona)协调记忆与动作,并采用实时偏好对齐策略优化提示。 Result: 实验表明,PersonaAgent在个性化动作空间和实时应用中显著优于基线方法。 Conclusion: PersonaAgent展示了提供动态、个性化用户体验的可行性和潜力。 Abstract: Large Language Model (LLM) empowered agents have recently emerged as advanced paradigms that exhibit impressive capabilities in a wide range of domains and tasks. Despite their potential, current LLM agents often adopt a one-size-fits-all approach, lacking the flexibility to respond to users' varying needs and preferences. This limitation motivates us to develop PersonaAgent, the first personalized LLM agent framework designed to address versatile personalization tasks. Specifically, PersonaAgent integrates two complementary components - a personalized memory module that includes episodic and semantic memory mechanisms; a personalized action module that enables the agent to perform tool actions tailored to the user. At the core, the persona (defined as unique system prompt for each user) functions as an intermediary: it leverages insights from personalized memory to control agent actions, while the outcomes of these actions in turn refine the memory. Based on the framework, we propose a test-time user-preference alignment strategy that simulate the latest n interactions to optimize the persona prompt, ensuring real-time user preference alignment through textual loss feedback between simulated and ground-truth responses. Experimental evaluations demonstrate that PersonaAgent significantly outperforms other baseline methods by not only personalizing the action space effectively but also scaling during test-time real-world applications. These results underscore the feasibility and potential of our approach in delivering tailored, dynamic user experiences.

cs.SI [Back]

[238] Masked Language Models are Good Heterogeneous Graph Generalizers

Jinyu Yang,Cheng Yang,Shanyuan Cui,Zeyuan Guo,Liangwei Yang,Muhan Zhang,Chuan Shi

Main category: cs.SI

TL;DR: MLM4HG提出了一种基于掩码语言建模的方法,通过元路径文本序列和统一的任务模板,提升异构图神经网络(HGNNs)的跨域和多任务泛化能力。

Details Motivation: 现有HGNNs与大型语言模型(LLMs)结合的方法存在嵌入空间差异和任务泛化能力不足的问题。 Method: MLM4HG将异构图转换为元路径文本序列,结合统一任务模板,通过掩码预测范式微调预训练语言模型。 Result: 在四个真实数据集上的跨域和多任务实验中,MLM4HG在少样本和零样本场景下均优于现有方法。 Conclusion: MLM4HG通过文本化异构图和统一任务范式,显著提升了模型的泛化能力。 Abstract: Heterogeneous graph neural networks (HGNNs) excel at capturing structural and semantic information in heterogeneous graphs (HGs), while struggling to generalize across domains and tasks. Recently, some researchers have turned to integrating HGNNs with large language models (LLMs) for more generalizable heterogeneous graph learning. However, these approaches typically extract structural information via HGNNs as HG tokens, and disparities in embedding spaces between HGNNs and LLMs have been shown to bias the LLM's comprehension of HGs. Moreover, as these HG tokens are often derived from node-level tasks, the model's ability to generalize across tasks remains limited. To this end, we propose a simple yet effective Masked Language Modeling-based method, called MLM4HG. MLM4HG introduces metapath-based textual sequences instead of HG tokens to extract structural and semantic information inherent in HGs, and designs customized textual templates to unify different graph tasks into a coherent cloze-style "mask" token prediction paradigm. Specifically, MLM4HG first converts HGs from various domains to texts based on metapaths, and subsequently combines them with the unified task texts to form a HG-based corpus. Moreover, the corpus is fed into a pretrained LM for fine-tuning with a constrained target vocabulary, enabling the fine-tuned LM to generalize to unseen target HGs. Extensive cross-domain and multi-task experiments on four real-world datasets demonstrate the superior generalization performance of MLM4HG over state-of-the-art methods in both few-shot and zero-shot scenarios. Our code is available at https://github.com/BUPT-GAMMA/MLM4HG.