Skip to content

Table of Contents

cs.CV [Back]

[1] EfficientQuant: An Efficient Post-Training Quantization for CNN-Transformer Hybrid Models on Edge Devices

Shaibal Saha,Lanyu Xu

Main category: cs.CV

TL;DR: EfficientQuant是一种针对混合卷积-Transformer模型的结构感知后训练量化方法,显著降低延迟并保持高精度。

Details Motivation: 混合卷积-Transformer模型在CV任务中表现优异,但资源消耗大,现有后训练量化方法对其效果有限。 Method: 采用均匀量化处理卷积块,对数量化处理Transformer块。 Result: 在ImageNet-1K数据集上实现2.5×-8.7×的延迟降低,精度损失极小,边缘设备上表现高效。 Conclusion: EfficientQuant为混合模型的边缘部署提供了实用解决方案。 Abstract: Hybrid models that combine convolutional and transformer blocks offer strong performance in computer vision (CV) tasks but are resource-intensive for edge deployment. Although post-training quantization (PTQ) can help reduce resource demand, its application to hybrid models remains limited. We propose EfficientQuant, a novel structure-aware PTQ approach that applies uniform quantization to convolutional blocks and $log_2$ quantization to transformer blocks. EfficientQuant achieves $2.5 \times - 8.7 \times$ latency reduction with minimal accuracy loss on the ImageNet-1K dataset. It further demonstrates low latency and memory efficiency on edge devices, making it practical for real-world deployment.

[2] Adaptive Object Detection with ESRGAN-Enhanced Resolution & Faster R-CNN

Divya Swetha K,Ziaul Haque Choudhury,Hemanta Kumar Bhuyan,Biswajit Brahma,Nilayam Kumar Kamila

Main category: cs.CV

TL;DR: 提出了一种结合ESRGAN和Faster R-CNN的方法,用于从低分辨率图像中提升目标检测性能。

Details Motivation: 解决低分辨率图像中目标检测性能不佳的问题,特别是在图像质量不一致的应用场景中。 Method: 使用ESRGAN作为预处理步骤增强图像质量,再通过Faster R-CNN进行目标检测。 Result: 实验结果显示,该方法在低分辨率图像上的检测性能优于传统方法。 Conclusion: 该框架为图像质量不一致或受限的应用提供了有效的解决方案,实现了图像质量提升与高效目标检测的平衡。 Abstract: In this study, proposes a method for improved object detection from the low-resolution images by integrating Enhanced Super-Resolution Generative Adversarial Networks (ESRGAN) and Faster Region-Convolutional Neural Network (Faster R-CNN). ESRGAN enhances low-quality images, restoring details and improving clarity, while Faster R-CNN performs accurate object detection on the enhanced images. The combination of these techniques ensures better detection performance, even with poor-quality inputs, offering an effective solution for applications where image resolution is in consistent. ESRGAN is employed as a pre-processing step to enhance the low-resolution input image, effectively restoring lost details and improving overall image quality. Subsequently, the enhanced image is fed into the Faster R-CNN model for accurate object detection and localization. Experimental results demonstrate that this integrated approach yields superior performance compared to traditional methods applied directly to low-resolution images. The proposed framework provides a promising solution for applications where image quality is variable or limited, enabling more robust and reliable object detection in challenging scenarios. It achieves a balance between improved image quality and efficient object detection

[3] Technical Report for Argoverse2 Scenario Mining Challenges on Iterative Error Correction and Spatially-Aware Prompting

Yifei Chen,Ross Greer

Main category: cs.CV

TL;DR: RefAV框架通过LLM将自然语言查询转换为可执行代码以挖掘自动驾驶场景,但面临代码错误和参数解释不准确的问题。本文提出两种改进:迭代代码生成机制和专用提示工程,实验证明其有效性。

Details Motivation: 解决LLM生成代码时的运行时错误和复杂空间关系参数解释不准确的问题,提升自动驾驶场景挖掘的可靠性。 Method: 1. 引入容错迭代代码生成机制,通过错误反馈重新提示LLM优化代码;2. 采用专用提示工程改进LLM对空间关系函数的理解和应用。 Result: 在Argoverse 2验证集上,使用多种LLM(Qwen2.5-VL-7B、Gemini 2.5 Flash和Gemini 2.5 Pro)均表现提升,其中Gemini 2.5 Pro在测试集上HOTA-Temporal得分为52.37。 Conclusion: 提出的方法显著提升了场景挖掘的可靠性和精度,为自动驾驶系统开发提供了有效工具。 Abstract: Scenario mining from extensive autonomous driving datasets, such as Argoverse 2, is crucial for the development and validation of self-driving systems. The RefAV framework represents a promising approach by employing Large Language Models (LLMs) to translate natural-language queries into executable code for identifying relevant scenarios. However, this method faces challenges, including runtime errors stemming from LLM-generated code and inaccuracies in interpreting parameters for functions that describe complex multi-object spatial relationships. This technical report introduces two key enhancements to address these limitations: (1) a fault-tolerant iterative code-generation mechanism that refines code by re-prompting the LLM with error feedback, and (2) specialized prompt engineering that improves the LLM's comprehension and correct application of spatial-relationship functions. Experiments on the Argoverse 2 validation set with diverse LLMs-Qwen2.5-VL-7B, Gemini 2.5 Flash, and Gemini 2.5 Pro-show consistent gains across multiple metrics; most notably, the proposed system achieves a HOTA-Temporal score of 52.37 on the official test set using Gemini 2.5 Pro. These results underline the efficacy of the proposed techniques for reliable, high-precision scenario mining.

[4] Image-Based Method For Measuring And Classification Of Iron Ore Pellets Using Star-Convex Polygons

Artem Solomko,Oleg Kartashev,Andrey Golov,Mikhail Deulin,Vadim Valynkin,Vasily Kharin

Main category: cs.CV

TL;DR: 本文提出了一种基于StarDist算法的铁矿石球团分类方法,旨在解决传统算法在密集和不稳定环境中分类和测量精度不足的问题。

Details Motivation: 由于传统图像分类和分割方法(如ViT和Mask R-CNN)在铁矿石球团分类中效果不佳,需要一种更精确的方法来识别和分析密集环境中的物体。 Method: 采用StarDist算法进行图像分割和轮廓提取,分类球团并测量其物理尺寸,特别关注平滑边界物体的检测。 Result: 新方法显著提高了物理尺寸测量的准确性,并改进了球团尺寸分布的分析。 Conclusion: 通过StarDist算法,本研究为复杂环境下的球团分类和测量提供了有效的解决方案。 Abstract: We would like to present a comprehensive study on the classification of iron ore pellets, aimed at identifying quality violations in the final product, alongside the development of an innovative imagebased measurement method utilizing the StarDist algorithm, which is primarily employed in the medical field. This initiative is motivated by the necessity to accurately identify and analyze objects within densely packed and unstable environments. The process involves segmenting these objects, determining their contours, classifying them, and measuring their physical dimensions. This is crucial because the size distribution and classification of pellets such as distinguishing between nice (quality) and joint (caused by the presence of moisture or indicating a process of production failure) types are among the most significant characteristics that define the quality of the final product. Traditional algorithms, including image classification techniques using Vision Transformer (ViT), instance segmentation methods like Mask R-CNN, and various anomaly segmentation algorithms, have not yielded satisfactory results in this context. Consequently, we explored methodologies from related fields to enhance our approach. The outcome of our research is a novel method designed to detect objects with smoothed boundaries. This advancement significantly improves the accuracy of physical dimension measurements and facilitates a more precise analysis of size distribution among the iron ore pellets. By leveraging the strengths of the StarDist algorithm, we aim to provide a robust solution that addresses the challenges posed by the complex nature of pellet classification and measurement.

[5] Segment This Thing: Foveated Tokenization for Efficient Point-Prompted Segmentation

Tanner Schmidt,Richard Newcombe

Main category: cs.CV

TL;DR: STT是一种高效的图像分割模型,通过单点提示生成单一分割区域,采用可变分辨率补丁标记化减少计算成本。

Details Motivation: 提高图像分割效率,同时避免缩小模型规模,通过聚焦输入图像的关键区域。 Method: 提取以提示点为中心的图像裁剪,并应用可变分辨率补丁标记化,远离提示点的补丁降采样率更高。 Result: STT在分割基准测试中表现优异,计算成本大幅降低,可在消费级硬件上实时运行。 Conclusion: STT是一种高效且实用的工具,适用于增强现实或机器人应用。 Abstract: This paper presents Segment This Thing (STT), a new efficient image segmentation model designed to produce a single segment given a single point prompt. Instead of following prior work and increasing efficiency by decreasing model size, we gain efficiency by foveating input images. Given an image and a point prompt, we extract a crop centered on the prompt and apply a novel variable-resolution patch tokenization in which patches are downsampled at a rate that increases with increased distance from the prompt. This approach yields far fewer image tokens than uniform patch tokenization. As a result we can drastically reduce the computational cost of segmentation without reducing model size. Furthermore, the foveation focuses the model on the region of interest, a potentially useful inductive bias. We show that our Segment This Thing model is more efficient than prior work while remaining competitive on segmentation benchmarks. It can easily run at interactive frame rates on consumer hardware and is thus a promising tool for augmented reality or robotics applications.

[6] Gender Fairness of Machine Learning Algorithms for Pain Detection

Dylan Green,Yuting Shang,Jiaee Cheong,Yang Liu,Hatice Gunes

Main category: cs.CV

TL;DR: 论文研究了基于机器学习和深度学习的自动疼痛检测模型在性别公平性上的表现,发现所有模型均存在性别偏见,尽管ViT模型准确率最高。

Details Motivation: 自动疼痛检测在医疗领域潜力巨大,但模型在不同人口群体(如性别)中的准确性和公平性研究不足。 Method: 使用UNBC-McMaster Shoulder Pain Expression Archive Database,比较传统ML算法(L SVM、RBF SVM)和DL方法(CNN、ViT)的性能与公平性。 Result: ViT模型准确率最高,但所有模型均表现出性别偏见,揭示了准确性与公平性之间的权衡。 Conclusion: 需采用公平性感知技术以减少自动医疗系统中的偏见。 Abstract: Automated pain detection through machine learning (ML) and deep learning (DL) algorithms holds significant potential in healthcare, particularly for patients unable to self-report pain levels. However, the accuracy and fairness of these algorithms across different demographic groups (e.g., gender) remain under-researched. This paper investigates the gender fairness of ML and DL models trained on the UNBC-McMaster Shoulder Pain Expression Archive Database, evaluating the performance of various models in detecting pain based solely on the visual modality of participants' facial expressions. We compare traditional ML algorithms, Linear Support Vector Machine (L SVM) and Radial Basis Function SVM (RBF SVM), with DL methods, Convolutional Neural Network (CNN) and Vision Transformer (ViT), using a range of performance and fairness metrics. While ViT achieved the highest accuracy and a selection of fairness metrics, all models exhibited gender-based biases. These findings highlight the persistent trade-off between accuracy and fairness, emphasising the need for fairness-aware techniques to mitigate biases in automated healthcare systems.

[7] Monocular 3D Hand Pose Estimation with Implicit Camera Alignment

Christos Pantazopoulos,Spyridon Thermos,Gerasimos Potamianos

Main category: cs.CV

TL;DR: 提出了一种从2D关键点输入估计3D手部关节的优化流程,无需相机参数知识,并在多个基准测试中表现优异。

Details Motivation: 解决单张彩色图像中3D手部关节估计的挑战,如深度信息缺失、遮挡和复杂关节结构。 Method: 采用关键点对齐步骤和指尖损失函数,避免对相机参数的依赖。 Result: 在EgoDexter和Dexter+Object基准测试中表现优异,对“野外”图像处理鲁棒。 Conclusion: 该方法在无需相机参数的情况下,实现了与现有技术相当的性能,并展示了2D关键点估计精度的重要性。 Abstract: Estimating the 3D hand articulation from a single color image is a continuously investigated problem with applications in Augmented Reality (AR), Virtual Reality (VR), Human-Computer Interaction (HCI), and robotics. Apart from the absence of depth information, occlusions, articulation complexity, and the need for camera parameters knowledge pose additional challenges. In this work, we propose an optimization pipeline for estimating the 3D hand articulation from 2D keypoint input, which includes a keypoint alignment step and a fingertip loss to overcome the need to know or estimate the camera parameters. We evaluate our approach on the EgoDexter and Dexter+Object benchmarks to showcase that our approach performs competitively with the SotA, while also demonstrating its robustness when processing "in-the-wild" images without any prior camera knowledge. Our quantitative analysis highlights the sensitivity of the 2D keypoint estimation accuracy, despite the use of hand priors. Code is available at https://github.com/cpantazop/HandRepo

[8] ContextLoss: Context Information for Topology-Preserving Segmentation

Benedict Schacht,Imke Greving,Simone Frintrop,Berit Zeller-Plumhoff,Christian Wilms

Main category: cs.CV

TL;DR: 提出了一种新的损失函数ContextLoss(CLoss),通过考虑拓扑错误的完整上下文,提高图像分割的拓扑正确性,并在多个数据集上验证了其有效性。

Details Motivation: 在图像分割中,保持分割结构(如血管、膜或道路)的拓扑结构至关重要,拓扑错误可能严重影响实际应用(如导航)。现有方法基于关键像素掩码,但缺乏对拓扑错误上下文的全面考虑。 Method: 提出了CLoss损失函数,通过考虑关键像素掩码中拓扑错误的完整上下文,增强网络对拓扑错误的关注。同时提出了两种直观的指标来验证连通性改进。 Result: 在多个公开数据集和自有的3D纳米成像数据集上验证,CLoss在拓扑感知指标上表现更优,修复了比其他方法多44%的缺失连接。 Conclusion: CLoss通过引入上下文信息显著提升了拓扑正确性,为图像分割任务提供了更可靠的解决方案。 Abstract: In image segmentation, preserving the topology of segmented structures like vessels, membranes, or roads is crucial. For instance, topological errors on road networks can significantly impact navigation. Recently proposed solutions are loss functions based on critical pixel masks that consider the whole skeleton of the segmented structures in the critical pixel mask. We propose the novel loss function ContextLoss (CLoss) that improves topological correctness by considering topological errors with their whole context in the critical pixel mask. The additional context improves the network focus on the topological errors. Further, we propose two intuitive metrics to verify improved connectivity due to a closing of missed connections. We benchmark our proposed CLoss on three public datasets (2D & 3D) and our own 3D nano-imaging dataset of bone cement lines. Training with our proposed CLoss increases performance on topology-aware metrics and repairs up to 44% more missed connections than other state-of-the-art methods. We make the code publicly available.

[9] JAFAR: Jack up Any Feature at Any Resolution

Paul Couairon,Loick Chambon,Louis Serrano,Jean-Emmanuel Haugeard,Matthieu Cord,Nicolas Thome

Main category: cs.CV

TL;DR: JAFAR是一种轻量级且灵活的特征上采样器,能够将任何基础视觉编码器的低分辨率特征提升到任意目标分辨率,通过注意力模块和SFT调制实现语义对齐,无需高分辨率监督即可泛化到更高输出尺度。

Details Motivation: 基础视觉编码器的低分辨率输出难以满足下游任务对高分辨率模态的需求,因此需要一种高效的特征上采样方法。 Method: JAFAR采用基于注意力的模块和SFT调制,将低分辨率语义丰富的键与高分辨率查询对齐,实现特征上采样。 Result: 实验表明,JAFAR能够有效恢复细粒度空间细节,并在多种下游任务中优于现有方法。 Conclusion: JAFAR是一种高效的特征上采样解决方案,适用于广泛的下游任务。 Abstract: Foundation Vision Encoders have become essential for a wide range of dense vision tasks. However, their low-resolution spatial feature outputs necessitate feature upsampling to produce the high-resolution modalities required for downstream tasks. In this work, we introduce JAFAR, a lightweight and flexible feature upsampler that enhances the spatial resolution of visual features from any Foundation Vision Encoder to an arbitrary target resolution. JAFAR employs an attention-based module designed to promote semantic alignment between high-resolution queries, derived from low-level image features, and semantically enriched low-resolution keys, using Spatial Feature Transform (SFT) modulation. Notably, despite the absence of high-resolution supervision, we demonstrate that learning at low upsampling ratios and resolutions generalizes remarkably well to significantly higher output scales. Extensive experiments show that JAFAR effectively recovers fine-grained spatial details and consistently outperforms existing feature upsampling methods across a diverse set of downstream tasks. Project page at https://jafar-upsampler.github.io

[10] Autonomous Computer Vision Development with Agentic AI

Jin Kim,Muhammad Wahi-Anwa,Sangyun Park,Shawn Shin,John M. Hoffman,Matthew S. Brown

Main category: cs.CV

TL;DR: 论文展示了如何利用基于大型语言模型(LLM)的Agentic AI方法,通过自然语言提示自主构建计算机视觉系统,并成功应用于医学图像分析任务。

Details Motivation: 探索Agentic AI在复杂推理、规划和工具配置中的潜力,减少传统计算机视觉应用中数据科学家的手动干预。 Method: 扩展开源Cognitive AI环境SimpleMind,引入基于LLM的代理(OpenManus),通过自然语言提示自动分解任务并配置工具,生成YAML格式的配置文件,自主完成训练和推理。 Result: 代理成功配置并测试了50张胸部X光图像,肺部、心脏和肋骨的分割平均Dice分数分别为0.96、0.82和0.83。 Conclusion: 该研究证明了Agentic AI在计算机视觉应用中自主规划和工具配置的可行性,为减少人工干预提供了新思路。 Abstract: Agentic Artificial Intelligence (AI) systems leveraging Large Language Models (LLMs) exhibit significant potential for complex reasoning, planning, and tool utilization. We demonstrate that a specialized computer vision system can be built autonomously from a natural language prompt using Agentic AI methods. This involved extending SimpleMind (SM), an open-source Cognitive AI environment with configurable tools for medical image analysis, with an LLM-based agent, implemented using OpenManus, to automate the planning (tool configuration) for a particular computer vision task. We provide a proof-of-concept demonstration that an agentic system can interpret a computer vision task prompt, plan a corresponding SimpleMind workflow by decomposing the task and configuring appropriate tools. From the user input prompt, "provide sm (SimpleMind) config for lungs, heart, and ribs segmentation for cxr (chest x-ray)"), the agent LLM was able to generate the plan (tool configuration file in YAML format), and execute SM-Learn (training) and SM-Think (inference) scripts autonomously. The computer vision agent automatically configured, trained, and tested itself on 50 chest x-ray images, achieving mean dice scores of 0.96, 0.82, 0.83, for lungs, heart, and ribs, respectively. This work shows the potential for autonomous planning and tool configuration that has traditionally been performed by a data scientist in the development of computer vision applications.

[11] FARCLUSS: Fuzzy Adaptive Rebalancing and Contrastive Uncertainty Learning for Semi-Supervised Semantic Segmentation

Ebenezer Tarubinga,Jenifer Kalafatovich

Main category: cs.CV

TL;DR: 论文提出了一种半监督语义分割框架,通过模糊伪标签、不确定性动态加权、自适应类别平衡和轻量对比正则化,有效利用未标记数据并提升小类和模糊区域的分割性能。

Details Motivation: 解决半监督语义分割中未标记数据利用不足、类别不平衡偏差加剧以及预测不确定性忽视的问题。 Method: 提出四个核心组件:模糊伪标签保留软类别分布;不确定性动态加权通过熵调整像素贡献;自适应类别平衡动态调整损失;轻量对比正则化优化特征嵌入。 Result: 在基准测试中显著优于现有方法,尤其在小类和模糊区域的分割上表现突出。 Conclusion: 该框架将不确定性转化为学习资源,有效提升了半监督语义分割的性能。 Abstract: Semi-supervised semantic segmentation (SSSS) faces persistent challenges in effectively leveraging unlabeled data, such as ineffective utilization of pseudo-labels, exacerbation of class imbalance biases, and neglect of prediction uncertainty. Current approaches often discard uncertain regions through strict thresholding favouring dominant classes. To address these limitations, we introduce a holistic framework that transforms uncertainty into a learning asset through four principal components: (1) fuzzy pseudo-labeling, which preserves soft class distributions from top-K predictions to enrich supervision; (2) uncertainty-aware dynamic weighting, that modulate pixel-wise contributions via entropy-based reliability scores; (3) adaptive class rebalancing, which dynamically adjust losses to counteract long-tailed class distributions; and (4) lightweight contrastive regularization, that encourage compact and discriminative feature embeddings. Extensive experiments on benchmarks demonstrate that our method outperforms current state-of-the-art approaches, achieving significant improvements in the segmentation of under-represented classes and ambiguous regions.

[12] On the development of an AI performance and behavioural measures for teaching and classroom management

Andreea I. Niculescu,Jochen Ehnen,Chen Yi,Du Jiawei,Tay Chiat Pin,Joey Tianyi Zhou,Vigneshwaran Subbaraju,Teh Kah Kuan,Tran Huy Dat,John Komar,Gi Soong Chee,Kenneth Kwok

Main category: cs.CV

TL;DR: 该论文介绍了一个为期两年的研究项目,利用AI分析课堂动态,重点关注教师行为,并通过多模态传感器数据提供支持。

Details Motivation: 研究旨在通过AI技术减轻教师手动分析负担,并提供客观的课堂互动反馈,帮助教师改进教学策略。 Method: 利用实时课堂传感器数据和AI技术,提取关键行为指标,并开发了一个教学回顾仪表盘。 Result: 生成了一个音频-视觉数据集、新的行为测量方法,并初步评估显示系统清晰、易用且非评判性。 Conclusion: 该系统为教师提供了客观的课堂互动分析,支持教学改进,并为AI教育分析领域贡献了文化相关的方法。 Abstract: This paper presents a two-year research project focused on developing AI-driven measures to analyze classroom dynamics, with particular emphasis on teacher actions captured through multimodal sensor data. We applied real-time data from classroom sensors and AI techniques to extract meaningful insights and support teacher development. Key outcomes include a curated audio-visual dataset, novel behavioral measures, and a proof-of-concept teaching review dashboard. An initial evaluation with eight researchers from the National Institute for Education (NIE) highlighted the system's clarity, usability, and its non-judgmental, automated analysis approach -- which reduces manual workloads and encourages constructive reflection. Although the current version does not assign performance ratings, it provides an objective snapshot of in-class interactions, helping teachers recognize and improve their instructional strategies. Designed and tested in an Asian educational context, this work also contributes a culturally grounded methodology to the growing field of AI-based educational analytics.

[13] AlignHuman: Improving Motion and Fidelity via Timestep-Segment Preference Optimization for Audio-Driven Human Animation

Chao Liang,Jianwen Jiang,Wang Liao,Jiaqi Yang,Zerong zheng,Weihong Zeng,Han Liang

Main category: cs.CV

TL;DR: AlignHuman框架通过偏好优化和分治训练策略,优化了人类视频生成中运动自然性与视觉保真度的权衡问题。

Details Motivation: 当前人类视频生成在运动自然性和视觉保真度之间存在矛盾,需要一种方法同时优化这两个目标。 Method: 提出AlignHuman框架,结合偏好优化和分治策略,利用时间步分段偏好优化(TPO)和两个专用LoRA模块分别优化运动动态和视觉保真度。 Result: 实验表明,AlignHuman在减少推理步数(从100步降至30步)的同时保持生成质量,实现了3.3倍的加速。 Conclusion: AlignHuman通过时间步分段优化和专家模块,有效平衡了运动自然性与视觉保真度,显著提升了生成效率。 Abstract: Recent advancements in human video generation and animation tasks, driven by diffusion models, have achieved significant progress. However, expressive and realistic human animation remains challenging due to the trade-off between motion naturalness and visual fidelity. To address this, we propose \textbf{AlignHuman}, a framework that combines Preference Optimization as a post-training technique with a divide-and-conquer training strategy to jointly optimize these competing objectives. Our key insight stems from an analysis of the denoising process across timesteps: (1) early denoising timesteps primarily control motion dynamics, while (2) fidelity and human structure can be effectively managed by later timesteps, even if early steps are skipped. Building on this observation, we propose timestep-segment preference optimization (TPO) and introduce two specialized LoRAs as expert alignment modules, each targeting a specific dimension in its corresponding timestep interval. The LoRAs are trained using their respective preference data and activated in the corresponding intervals during inference to enhance motion naturalness and fidelity. Extensive experiments demonstrate that AlignHuman improves strong baselines and reduces NFEs during inference, achieving a 3.3$\times$ speedup (from 100 NFEs to 30 NFEs) with minimal impact on generation quality. Homepage: \href{https://alignhuman.github.io/}{https://alignhuman.github.io/}

[14] 3D-RAD: A Comprehensive 3D Radiology Med-VQA Dataset with Multi-Temporal Analysis and Diverse Diagnostic Tasks

Xiaotang Gai,Jiaxiang Liu,Yichen Li,Zijie Meng,Jian Wu,Zuozhu Liu

Main category: cs.CV

TL;DR: 3D-RAD是一个基于CT扫描的大规模医学视觉问答数据集,支持多种任务和复杂推理,现有模型表现有限,但微调后可显著提升性能。

Details Motivation: 推动3D医学视觉问答(Med-VQA)的发展,解决现有研究局限于2D影像和任务单一的问题。 Method: 构建3D-RAD数据集,包含六种VQA任务,支持开放和封闭问题,引入复杂推理挑战。 Result: 现有视觉语言模型在3D多时相任务中泛化能力有限,但微调后性能显著提升。 Conclusion: 3D-RAD数据集和代码公开,旨在推动多模态医学AI研究,为3D医学视觉理解奠定基础。 Abstract: Medical Visual Question Answering (Med-VQA) holds significant potential for clinical decision support, yet existing efforts primarily focus on 2D imaging with limited task diversity. This paper presents 3D-RAD, a large-scale dataset designed to advance 3D Med-VQA using radiology CT scans. The 3D-RAD dataset encompasses six diverse VQA tasks: anomaly detection, image observation, medical computation, existence detection, static temporal diagnosis, and longitudinal temporal diagnosis. It supports both open- and closed-ended questions while introducing complex reasoning challenges, including computational tasks and multi-stage temporal analysis, to enable comprehensive benchmarking. Extensive evaluations demonstrate that existing vision-language models (VLMs), especially medical VLMs exhibit limited generalization, particularly in multi-temporal tasks, underscoring the challenges of real-world 3D diagnostic reasoning. To drive future advancements, we release a high-quality training set 3D-RAD-T of 136,195 expert-aligned samples, showing that fine-tuning on this dataset could significantly enhance model performance. Our dataset and code, aiming to catalyze multimodal medical AI research and establish a robust foundation for 3D medical visual understanding, are publicly available at https://github.com/Tang-xiaoxiao/M3D-RAD.

[15] LLM-to-Phy3D: Physically Conform Online 3D Object Generation with LLMs

Melvin Wong,Yueming Lyu,Thiago Rios,Stefan Menzel,Yew-Soon Ong

Main category: cs.CV

TL;DR: LLM-to-Phy3D是一种新型方法,通过物理约束优化LLM生成的3D对象,显著提升其物理可行性和几何新颖性。

Details Motivation: 现有LLM-to-3D模型缺乏物理知识,导致生成的3D对象脱离现实物理约束,限制了其在工程设计中的应用。 Method: 引入在线黑盒优化循环,结合视觉和物理评估,迭代优化提示以生成物理可行的3D对象。 Result: 在车辆设计优化中,LLM-to-Phy3D比传统模型提升4.5%至106.7%的物理性能。 Conclusion: LLM-to-Phy3D为物理AI在科学和工程领域的应用提供了潜力。 Abstract: The emergence of generative artificial intelligence (GenAI) and large language models (LLMs) has revolutionized the landscape of digital content creation in different modalities. However, its potential use in Physical AI for engineering design, where the production of physically viable artifacts is paramount, remains vastly underexplored. The absence of physical knowledge in existing LLM-to-3D models often results in outputs detached from real-world physical constraints. To address this gap, we introduce LLM-to-Phy3D, a physically conform online 3D object generation that enables existing LLM-to-3D models to produce physically conforming 3D objects on the fly. LLM-to-Phy3D introduces a novel online black-box refinement loop that empowers large language models (LLMs) through synergistic visual and physics-based evaluations. By delivering directional feedback in an iterative refinement process, LLM-to-Phy3D actively drives the discovery of prompts that yield 3D artifacts with enhanced physical performance and greater geometric novelty relative to reference objects, marking a substantial contribution to AI-driven generative design. Systematic evaluations of LLM-to-Phy3D, supported by ablation studies in vehicle design optimization, reveal various LLM improvements gained by 4.5% to 106.7% in producing physically conform target domain 3D designs over conventional LLM-to-3D models. The encouraging results suggest the potential general use of LLM-to-Phy3D in Physical AI for scientific and engineering applications.

[16] Self-Calibrating BCIs: Ranking and Recovery of Mental Targets Without Labels

Jonathan Grizou,Carlos de la Torre-Ortiz,Tuukka Ruotsalo

Main category: cs.CV

TL;DR: 论文提出了一种名为CURSOR的算法,首次通过自校准方法从未标记的EEG和图像数据中恢复心理目标,无需预训练解码器或标记数据。

Details Motivation: 解决在无标记数据情况下从EEG和图像数据中恢复心理目标的问题,填补了自校准方法在此领域的空白。 Method: 提出CURSOR框架,利用未标记数据学习预测图像相似度分数,并用于排名和生成新刺激。 Result: 实验表明CURSOR能预测与人类感知判断相关的相似度分数,并生成与心理目标无法区分的新刺激。 Conclusion: CURSOR为无标记数据下的心理目标恢复提供了有效解决方案,具有潜在应用价值。 Abstract: We consider the problem of recovering a mental target (e.g., an image of a face) that a participant has in mind from paired EEG (i.e., brain responses) and image (i.e., perceived faces) data collected during interactive sessions without access to labeled information. The problem has been previously explored with labeled data but not via self-calibration, where labeled data is unavailable. Here, we present the first framework and an algorithm, CURSOR, that learns to recover unknown mental targets without access to labeled data or pre-trained decoders. Our experiments on naturalistic images of faces demonstrate that CURSOR can (1) predict image similarity scores that correlate with human perceptual judgments without any label information, (2) use these scores to rank stimuli against an unknown mental target, and (3) generate new stimuli indistinguishable from the unknown mental target (validated via a user study, N=53).

[17] SLRNet: A Real-Time LSTM-Based Sign Language Recognition System

Sharvari Kamble

Main category: cs.CV

TL;DR: SLRNet是一个基于MediaPipe Holistic和LSTM网络的实时手语识别系统,用于识别ASL字母和功能词,验证准确率为86.7%。

Details Motivation: 解决听障群体与社会的沟通障碍。 Method: 使用MediaPipe Holistic和LSTM网络处理视频流,实现实时识别。 Result: 验证准确率达到86.7%,展示了硬件无关的手势识别可行性。 Conclusion: SLRNet为包容性手语识别提供了可行方案。 Abstract: Sign Language Recognition (SLR) plays a crucial role in bridging the communication gap between the hearing-impaired community and society. This paper introduces SLRNet, a real-time webcam-based ASL recognition system using MediaPipe Holistic and Long Short-Term Memory (LSTM) networks. The model processes video streams to recognize both ASL alphabet letters and functional words. With a validation accuracy of 86.7%, SLRNet demonstrates the feasibility of inclusive, hardware-independent gesture recognition.

Linhao Yu,Xinguang Ji,Yahui Liu,Fanheng Kong,Chenxi Sun,Jingyuan Zhang,Hongzhi Zhang,V. W.,Fuzheng Zhang,Deyi Xiong

Main category: cs.CV

TL;DR: AutoCaption是一个自动框架,利用蒙特卡洛树搜索(MCTS)生成多样化的视频描述句子,解决了现有视频字幕评测的不足。

Details Motivation: 现有视频字幕评测存在关键点不足、数据创建成本高和评测范围有限的问题。 Method: 提出AutoCaption框架,通过MCTS迭代生成多样化的视频描述句子,构建细粒度评测基准MCTS-VCB。 Result: 评测20多个MLLMs,Gemini-1.5-Pro得分最高(F1=71.2);微调InternVL2.5-8B后性能显著提升。 Conclusion: AutoCaption有效提升视频字幕评测的全面性,生成的数据可用于模型微调,显著提升性能。 Abstract: Video captioning can be used to assess the video understanding capabilities of Multimodal Large Language Models (MLLMs). However, existing benchmarks and evaluation protocols suffer from crucial issues, such as inadequate or homogeneous creation of key points, exorbitant cost of data creation, and limited evaluation scopes. To address these issues, we propose an automatic framework, named AutoCaption, which leverages Monte Carlo Tree Search (MCTS) to construct numerous and diverse descriptive sentences (\textit{i.e.}, key points) that thoroughly represent video content in an iterative way. This iterative captioning strategy enables the continuous enhancement of video details such as actions, objects' attributes, environment details, etc. We apply AutoCaption to curate MCTS-VCB, a fine-grained video caption benchmark covering video details, thereby enabling a comprehensive evaluation of MLLMs on the video captioning task. We evaluate more than 20 open- and closed-source MLLMs of varying sizes on MCTS-VCB. Results show that MCTS-VCB can effectively and comprehensively evaluate the video captioning capability, with Gemini-1.5-Pro achieving the highest F1 score of 71.2. Interestingly, we fine-tune InternVL2.5-8B with the AutoCaption-generated data, which helps the model achieve an overall improvement of 25.0% on MCTS-VCB and 16.3% on DREAM-1K, further demonstrating the effectiveness of AutoCaption. The code and data are available at https://github.com/tjunlp-lab/MCTS-VCB.

[19] Digitization of Document and Information Extraction using OCR

Rasha Sinha,Rekha B S

Main category: cs.CV

TL;DR: 提出了一种结合OCR与LLM的框架,用于从扫描和数字文档中提取结构化文本,显著优于传统方法。

Details Motivation: 从混合格式文档中准确提取信息的需求,传统方法在灵活性和语义理解上存在不足。 Method: 使用OCR处理扫描文件,布局感知库解析数字文件,LLM分析提取的文本以识别关键信息。 Result: 展示了不同OCR工具的比较分析,框架在准确性、布局识别和处理速度上表现优异。 Conclusion: 该框架显著提升了文本提取的灵活性和语义精度,适用于多种文档类型。 Abstract: Retrieving accurate details from documents is a crucial task, especially when handling a combination of scanned images and native digital formats. This document presents a combined framework for text extraction that merges Optical Character Recognition (OCR) techniques with Large Language Models (LLMs) to deliver structured outputs enriched by contextual understanding and confidence indicators. Scanned files are processed using OCR engines, while digital files are interpreted through layout-aware libraries. The extracted raw text is subsequently analyzed by an LLM to identify key-value pairs and resolve ambiguities. A comparative analysis of different OCR tools is presented to evaluate their effectiveness concerning accuracy, layout recognition, and processing speed. The approach demonstrates significant improvements over traditional rule-based and template-based methods, offering enhanced flexibility and semantic precision across different document categories

[20] VIBE: Can a VLM Read the Room?

Tania Chakraborty,Eylon Caplan,Dan Goldwasser

Main category: cs.CV

TL;DR: 本文探讨了视觉语言模型(VLMs)在社会推理中的能力,发现其存在视觉社交-语用推理的局限性,并提出了新任务和数据集以测试VLMs的表现。

Details Motivation: 理解人类社交行为(如情绪识别和社交动态)是一个重要但具有挑战性的问题,而现有LLMs仅限于文本领域,无法捕捉非语言线索的作用。 Method: 提出新任务“视觉社交-语用推理”,构建高质量数据集,并测试多个VLMs的表现。 Result: 发现VLMs在视觉社交-语用推理方面存在局限性。 Conclusion: VLMs在社交推理中仍有改进空间,新任务和数据集为未来研究提供了方向。 Abstract: Understanding human social behavior such as recognizing emotions and the social dynamics causing them is an important and challenging problem. While LLMs have made remarkable advances, they are limited to the textual domain and cannot account for the major role that non-verbal cues play in understanding social situations. Vision Language Models (VLMs) can potentially account for this gap, however their ability to make correct inferences over such social cues has received little attention. In this paper, we explore the capabilities of VLMs at social reasoning. We identify a previously overlooked limitation in VLMs: the Visual Social-Pragmatic Inference gap. To target this gap, we propose a new task for VLMs: Visual Social-Pragmatic Inference. We construct a high quality dataset to test the abilities of a VLM for this task and benchmark the performance of several VLMs on it.

[21] Synthetic Geology -- Structural Geology Meets Deep Learning

Simon Ghyselincks,Valeriia Okhmak,Stefano Zampini,George Turkiyyah,David Keyes,Eldad Haber

Main category: cs.CV

TL;DR: 利用生成式人工智能和合成数据训练神经网络,实现从地表地质数据生成高保真三维地下图像。

Details Motivation: 解决地下数据稀缺问题,扩展地表地质数据的应用范围。 Method: 设计合成数据生成器模拟地质活动,训练神经网络生成地下3D图像。 Result: 模型能够从未见过的地表数据生成高保真地下结构图像,并随钻孔数据增加而提升精度。 Conclusion: 该方法为资源勘探、灾害评估等提供了新工具,未来可通过区域数据微调进一步优化。 Abstract: Visualizing the first few kilometers of the Earth's subsurface, a long-standing challenge gating a virtually inexhaustible list of important applications, is coming within reach through deep learning. Building on techniques of generative artificial intelligence applied to voxelated images, we demonstrate a method that extends surface geological data supplemented by boreholes to a three-dimensional subsurface region by training a neural network. The Earth's land area having been extensively mapped for geological features, the bottleneck of this or any related technique is the availability of data below the surface. We close this data gap in the development of subsurface deep learning by designing a synthetic data-generator process that mimics eons of geological activity such as sediment compaction, volcanic intrusion, and tectonic dynamics to produce a virtually limitless number of samples of the near lithosphere. A foundation model trained on such synthetic data is able to generate a 3D image of the subsurface from a previously unseen map of surface topography and geology, showing increasing fidelity with increasing access to borehole data, depicting such structures as layers, faults, folds, dikes, and sills. We illustrate the early promise of the combination of a synthetic lithospheric generator with a trained neural network model using generative flow matching. Ultimately, such models will be fine-tuned on data from applicable campaigns, such as mineral prospecting in a given region. Though useful in itself, a regionally fine-tuned models may be employed not as an end but as a means: as an AI-based regularizer in a more traditional inverse problem application, in which the objective function represents the mismatch of additional data with physical models with applications in resource exploration, hazard assessment, and geotechnical engineering.

[22] Evaluating BiLSTM and CNN+GRU Approaches for Human Activity Recognition Using WiFi CSI Data

Almustapha A. Wakili,Babajide J. Asaju,Woosub Jung

Main category: cs.CV

TL;DR: 比较BiLSTM和CNN+GRU在WiFi CSI数据集上的性能,发现CNN+GRU在UT-HAR表现更好,BiLSTM在NTU-Fi HAR更优,强调数据集特性和预处理的重要性。

Details Motivation: 探索不同深度学习模型在WiFi CSI数据上的活动识别性能,以优化模型选择和实际应用。 Method: 使用BiLSTM和CNN+GRU模型在UT-HAR和NTU-Fi HAR数据集上进行实验比较。 Result: CNN+GRU在UT-HAR准确率95.20%,BiLSTM在NTU-Fi HAR准确率92.05%。 Conclusion: 数据集特性和预处理对模型性能至关重要,模型在医疗和智能家居中有实际应用潜力。 Abstract: This paper compares the performance of BiLSTM and CNN+GRU deep learning models for Human Activity Recognition (HAR) on two WiFi-based Channel State Information (CSI) datasets: UT-HAR and NTU-Fi HAR. The findings indicate that the CNN+GRU model has a higher accuracy on the UT-HAR dataset (95.20%) thanks to its ability to extract spatial features. In contrast, the BiLSTM model performs better on the high-resolution NTU-Fi HAR dataset (92.05%) by extracting long-term temporal dependencies more effectively. The findings strongly emphasize the critical role of dataset characteristics and preprocessing techniques in model performance improvement. We also show the real-world applicability of such models in applications like healthcare and intelligent home systems, highlighting their potential for unobtrusive activity recognition.

[23] Test-Time-Scaling for Zero-Shot Diagnosis with Visual-Language Reasoning

Ji Young Byun,Young-Jin Park,Navid Azizan,Rama Chellappa

Main category: cs.CV

TL;DR: 提出了一种零样本框架,通过测试时缩放增强大型语言模型(LLM)在医学图像诊断中的推理能力,提高诊断准确性。

Details Motivation: 医学影像中的视觉问答(VQA)和基于推理的诊断尚未被充分探索,且监督微调因数据有限和标注成本高而不切实际。 Method: 结合视觉语言模型和LLM,通过测试时缩放策略整合多个候选输出生成可靠诊断。 Result: 在多模态医学影像中验证了方法的有效性,提高了诊断准确性和分类可靠性。 Conclusion: 该框架为医学影像诊断提供了一种高效、可靠的解决方案,尤其在数据有限的情况下。 Abstract: As a cornerstone of patient care, clinical decision-making significantly influences patient outcomes and can be enhanced by large language models (LLMs). Although LLMs have demonstrated remarkable performance, their application to visual question answering in medical imaging, particularly for reasoning-based diagnosis, remains largely unexplored. Furthermore, supervised fine-tuning for reasoning tasks is largely impractical due to limited data availability and high annotation costs. In this work, we introduce a zero-shot framework for reliable medical image diagnosis that enhances the reasoning capabilities of LLMs in clinical settings through test-time scaling. Given a medical image and a textual prompt, a vision-language model processes a medical image along with a corresponding textual prompt to generate multiple descriptions or interpretations of visual features. These interpretations are then fed to an LLM, where a test-time scaling strategy consolidates multiple candidate outputs into a reliable final diagnosis. We evaluate our approach across various medical imaging modalities -- including radiology, ophthalmology, and histopathology -- and demonstrate that the proposed test-time scaling strategy enhances diagnostic accuracy for both our and baseline methods. Additionally, we provide an empirical analysis showing that the proposed approach, which allows unbiased prompting in the first stage, improves the reliability of LLM-generated diagnoses and enhances classification accuracy.

[24] Towards a general-purpose foundation model for fMRI analysis

Cheng Wang,Yu Jiang,Zhihao Peng,Chenxin Li,Changbae Bang,Lin Zhao,Jinglei Lv,Jorge Sepulcre,Carl Yang,Lifang He,Tianming Liu,Daniel Barron,Quanzheng Li,Randy Hirschtick,Byung-Hoon Kim,Xiang Li,Yixuan Yuan

Main category: cs.CV

TL;DR: NeuroSTORM是一个通用的fMRI分析框架,通过预训练和任务特定提示调优,显著提升了跨任务和跨中心的性能。

Details Motivation: 当前fMRI分析方法存在可重复性和可迁移性问题,NeuroSTORM旨在解决这些问题。 Method: 采用Mamba骨干网络和移位扫描策略,直接从4D fMRI数据中学习,并提出空间-时间优化预训练方法。 Result: 在五项任务中表现优于现有方法,并在多国医院数据中展现出临床实用性。 Conclusion: NeuroSTORM为fMRI研究提供了一个标准化、开源的基础模型,提升了可重复性和可迁移性。 Abstract: Functional Magnetic Resonance Imaging (fMRI) is essential for studying brain function and diagnosing neurological disorders, but current analysis methods face reproducibility and transferability issues due to complex pre-processing and task-specific models. We introduce NeuroSTORM (Neuroimaging Foundation Model with Spatial-Temporal Optimized Representation Modeling), a generalizable framework that directly learns from 4D fMRI volumes and enables efficient knowledge transfer across diverse applications. NeuroSTORM is pre-trained on 28.65 million fMRI frames (>9,000 hours) from over 50,000 subjects across multiple centers and ages 5 to 100. Using a Mamba backbone and a shifted scanning strategy, it efficiently processes full 4D volumes. We also propose a spatial-temporal optimized pre-training approach and task-specific prompt tuning to improve transferability. NeuroSTORM outperforms existing methods across five tasks: age/gender prediction, phenotype prediction, disease diagnosis, fMRI-to-image retrieval, and task-based fMRI classification. It demonstrates strong clinical utility on datasets from hospitals in the U.S., South Korea, and Australia, achieving top performance in disease diagnosis and cognitive phenotype prediction. NeuroSTORM provides a standardized, open-source foundation model to improve reproducibility and transferability in fMRI-based clinical research.

[25] WaveFormer: A Lightweight Transformer Model for sEMG-based Gesture Recognition

Yanlong Chen,Mattia Orlandi,Pierangelo Maria Rapa,Simone Benatti,Luca Benini,Yawei Li

Main category: cs.CV

TL;DR: WaveFormer是一种轻量级基于Transformer的架构,用于sEMG手势识别,通过可学习的小波变换结合时域和频域特征,显著提升分类精度,并在资源受限设备上实现实时部署。

Details Motivation: 解决传统深度学习模型在sEMG手势识别中计算量大、难以部署的问题,同时提升对相似手势的分类精度。 Method: 提出WaveFormer架构,结合时间域和频率域特征,使用WaveletConv模块(多级小波分解层和深度可分离卷积)实现高效特征提取。 Result: 在EPN612数据集上达到95%分类精度,仅需3.1百万参数,INT8量化后在Intel CPU上实现6.75毫秒推理延迟。 Conclusion: WaveFormer在轻量化和高精度方面表现出色,适用于资源受限的嵌入式系统。 Abstract: Human-machine interaction, particularly in prosthetic and robotic control, has seen progress with gesture recognition via surface electromyographic (sEMG) signals.However, classifying similar gestures that produce nearly identical muscle signals remains a challenge, often reducing classification accuracy. Traditional deep learning models for sEMG gesture recognition are large and computationally expensive, limiting their deployment on resource-constrained embedded systems. In this work, we propose WaveFormer, a lightweight transformer-based architecture tailored for sEMG gesture recognition. Our model integrates time-domain and frequency-domain features through a novel learnable wavelet transform, enhancing feature extraction. In particular, the WaveletConv module, a multi-level wavelet decomposition layer with depthwise separable convolution, ensures both efficiency and compactness. With just 3.1 million parameters, WaveFormer achieves 95% classification accuracy on the EPN612 dataset, outperforming larger models. Furthermore, when profiled on a laptop equipped with an Intel CPU, INT8 quantization achieves real-time deployment with a 6.75 ms inference latency.

[26] Teaching in adverse scenes: a statistically feedback-driven threshold and mask adjustment teacher-student framework for object detection in UAV images under adverse scenes

Hongyu Chen,Jiping Liu,Yong Wang,Jun Zhu,Dejun Feng,Yakun Xie

Main category: cs.CV

TL;DR: 本文提出了一种名为SF-TMAT的无监督域自适应方法,针对无人机在恶劣场景下的目标检测问题,通过动态调整掩码比例和伪标签阈值,显著提升了性能。

Details Motivation: 现有UDA方法主要基于自然图像或清晰无人机图像,对恶劣条件下的无人机图像研究不足,且现有方法在特征对齐和伪标签质量上存在问题。 Method: 提出SF-TMAT框架,包括DSFMA(动态调整掩码比例和特征重建)和VFST(动态调整伪标签阈值),以优化特征学习和伪标签质量。 Result: 实验表明SF-TMAT在恶劣场景下的无人机目标检测中表现优越,具有强泛化能力。 Conclusion: SF-TMAT为恶劣场景下的无人机目标检测提供了有效解决方案,代码已开源。 Abstract: Unsupervised Domain Adaptation (UDA) has shown promise in effectively alleviating the performance degradation caused by domain gaps between source and target domains, and it can potentially be generalized to UAV object detection in adverse scenes. However, existing UDA studies are based on natural images or clear UAV imagery, and research focused on UAV imagery in adverse conditions is still in its infancy. Moreover, due to the unique perspective of UAVs and the interference from adverse conditions, these methods often fail to accurately align features and are influenced by limited or noisy pseudo-labels. To address this, we propose the first benchmark for UAV object detection in adverse scenes, the Statistical Feedback-Driven Threshold and Mask Adjustment Teacher-Student Framework (SF-TMAT). Specifically, SF-TMAT introduces a design called Dynamic Step Feedback Mask Adjustment Autoencoder (DSFMA), which dynamically adjusts the mask ratio and reconstructs feature maps by integrating training progress and loss feedback. This approach dynamically adjusts the learning focus at different training stages to meet the model's needs for learning features at varying levels of granularity. Additionally, we propose a unique Variance Feedback Smoothing Threshold (VFST) strategy, which statistically computes the mean confidence of each class and dynamically adjusts the selection threshold by incorporating a variance penalty term. This strategy improves the quality of pseudo-labels and uncovers potentially valid labels, thus mitigating domain bias. Extensive experiments demonstrate the superiority and generalization capability of the proposed SF-TMAT in UAV object detection under adverse scene conditions. The Code is released at https://github.com/ChenHuyoo .

[27] BrainMAP: Multimodal Graph Learning For Efficient Brain Disease Localization

Nguyen Linh Dan Le,Jing Ren,Ciyuan Peng,Chengyao Xie,Bowen Li,Feng Xia

Main category: cs.CV

TL;DR: BrainMAP是一种新型多模态图学习框架,用于高效精准识别神经退行性疾病相关脑区,显著降低计算开销。

Details Motivation: 现有图学习方法难以定位神经退行性疾病相关脑区,且多模态脑图模型计算复杂度高,限制了实际应用。 Method: BrainMAP通过AAL图谱引导的过滤方法提取关键脑子图,并采用跨节点注意力和自适应门控机制融合fMRI和DTI数据。 Result: 实验表明,BrainMAP在计算效率上优于现有方法,且预测准确性未受影响。 Conclusion: BrainMAP为神经退行性疾病研究提供了高效精准的解决方案。 Abstract: Recent years have seen a surge in research focused on leveraging graph learning techniques to detect neurodegenerative diseases. However, existing graph-based approaches typically lack the ability to localize and extract the specific brain regions driving neurodegenerative pathology within the full connectome. Additionally, recent works on multimodal brain graph models often suffer from high computational complexity, limiting their practical use in resource-constrained devices. In this study, we present BrainMAP, a novel multimodal graph learning framework designed for precise and computationally efficient identification of brain regions affected by neurodegenerative diseases. First, BrainMAP utilizes an atlas-driven filtering approach guided by the AAL atlas to pinpoint and extract critical brain subgraphs. Unlike recent state-of-the-art methods, which model the entire brain network, BrainMAP achieves more than 50% reduction in computational overhead by concentrating on disease-relevant subgraphs. Second, we employ an advanced multimodal fusion process comprising cross-node attention to align functional magnetic resonance imaging (fMRI) and diffusion tensor imaging (DTI) data, coupled with an adaptive gating mechanism to blend and integrate these modalities dynamically. Experimental results demonstrate that BrainMAP outperforms state-of-the-art methods in computational efficiency, without compromising predictive accuracy.

[28] Enhanced Vehicle Speed Detection Considering Lane Recognition Using Drone Videos in California

Amirali Ataee Naeini,Ashkan Teymouri,Ghazaleh Jafarsalehi,Michael Zhang

Main category: cs.CV

TL;DR: 本文提出了一种基于YOLOv11的改进模型,用于提高车辆速度和车道检测的准确性,特别针对HOV车道和重型车辆的分类。

Details Motivation: 加州车辆数量增加,交通系统不足和测速摄像头稀疏,需要更有效的车辆速度检测方法。 Method: 使用YOLOv11模型,训练了800张鸟瞰图,识别车辆车道并分类为轿车和重型车辆。 Result: 模型在MAE和MSE上表现优异,分别为0.97 mph和0.94 mph²。 Conclusion: 改进的YOLOv11模型能有效解决车辆速度和分类问题,适用于交通监控。 Abstract: The increase in vehicle numbers in California, driven by inadequate transportation systems and sparse speed cameras, necessitates effective vehicle speed detection. Detecting vehicle speeds per lane is critical for monitoring High-Occupancy Vehicle (HOV) lane speeds, distinguishing between cars and heavy vehicles with differing speed limits, and enforcing lane restrictions for heavy vehicles. While prior works utilized YOLO (You Only Look Once) for vehicle speed detection, they often lacked accuracy, failed to identify vehicle lanes, and offered limited or less practical classification categories. This study introduces a fine-tuned YOLOv11 model, trained on almost 800 bird's-eye view images, to enhance vehicle speed detection accuracy which is much higher compare to the previous works. The proposed system identifies the lane for each vehicle and classifies vehicles into two categories: cars and heavy vehicles. Designed to meet the specific requirements of traffic monitoring and regulation, the model also evaluates the effects of factors such as drone height, distance of Region of Interest (ROI), and vehicle speed on detection accuracy and speed measurement. Drone footage collected from Northern California was used to assess the proposed system. The fine-tuned YOLOv11 achieved its best performance with a mean absolute error (MAE) of 0.97 mph and mean squared error (MSE) of 0.94 $\text{mph}^2$, demonstrating its efficacy in addressing challenges in vehicle speed detection and classification.

[29] Lifting Data-Tracing Machine Unlearning to Knowledge-Tracing for Foundation Models

Yuwen Tan,Boqing Gong

Main category: cs.CV

TL;DR: 本文提出将数据追踪的机器遗忘提升为针对基础模型(FMs)的知识追踪遗忘,以满足实际需求并借鉴认知研究的见解。

Details Motivation: 传统的数据追踪遗忘无法满足基础模型的多样化遗忘需求,且知识追踪遗忘更符合人脑遗忘机制。 Method: 提出知识追踪遗忘范式,并通过视觉语言基础模型的具体案例进行说明。 Result: 知识追踪遗忘更适用于基础模型的多样化需求,且与人脑遗忘机制更一致。 Conclusion: 知识追踪遗忘是基础模型遗忘的更优范式,具有实际应用价值和理论支持。 Abstract: Machine unlearning removes certain training data points and their influence on AI models (e.g., when a data owner revokes their decision to allow models to learn from the data). In this position paper, we propose to lift data-tracing machine unlearning to knowledge-tracing for foundation models (FMs). We support this position based on practical needs and insights from cognitive studies. Practically, tracing data cannot meet the diverse unlearning requests for FMs, which may be from regulators, enterprise users, product teams, etc., having no access to FMs' massive training data. Instead, it is convenient for these parties to issue an unlearning request about the knowledge or capability FMs (should not) possess. Cognitively, knowledge-tracing unlearning aligns with how the human brain forgets more closely than tracing individual training data points. Finally, we provide a concrete case study about a vision-language FM to illustrate how an unlearner might instantiate the knowledge-tracing machine unlearning paradigm.

[30] TARDIS STRIDE: A Spatio-Temporal Road Image Dataset for Exploration and Autonomy

Héctor Carrión,Yutong Bai,Víctor A. Hernández Castro,Kishan Panaganti,Ayush Zenith,Matthew Trang,Tony Zhang,Pietro Perona,Jitendra Malik

Main category: cs.CV

TL;DR: 论文提出了一种时空道路图像数据集STRIDE,结合360度全景图像,用于建模动态环境中的空间和时间关系。基于此数据集,开发了Transformer生成模型TARDIS,展示了在多种任务中的强大性能。

Details Motivation: 现实世界环境具有动态变化的时空特性,现有方法难以有效建模。 Method: 引入STRIDE数据集,将全景图像转化为观察、状态和动作节点;开发TARDIS模型,通过自回归框架整合时空动态。 Result: 在可控图像合成、指令跟随、自主控制和地理定位等任务中表现优异。 Conclusion: STRIDE和TARDIS为开发具备时空理解能力的通用智能体提供了新方向。 Abstract: World models aim to simulate environments and enable effective agent behavior. However, modeling real-world environments presents unique challenges as they dynamically change across both space and, crucially, time. To capture these composed dynamics, we introduce a Spatio-Temporal Road Image Dataset for Exploration (STRIDE) permuting 360-degree panoramic imagery into rich interconnected observation, state and action nodes. Leveraging this structure, we can simultaneously model the relationship between egocentric views, positional coordinates, and movement commands across both space and time. We benchmark this dataset via TARDIS, a transformer-based generative world model that integrates spatial and temporal dynamics through a unified autoregressive framework trained on STRIDE. We demonstrate robust performance across a range of agentic tasks such as controllable photorealistic image synthesis, instruction following, autonomous self-control, and state-of-the-art georeferencing. These results suggest a promising direction towards sophisticated generalist agents--capable of understanding and manipulating the spatial and temporal aspects of their material environments--with enhanced embodied reasoning capabilities. Training code, datasets, and model checkpoints are made available at https://huggingface.co/datasets/Tera-AI/STRIDE.

[31] HyBiomass: Global Hyperspectral Imagery Benchmark Dataset for Evaluating Geospatial Foundation Models in Forest Aboveground Biomass Estimation

Aaron Banze,Timothée Stassin,Nassim Ait Ali Braham,Rıdvan Salih Kuzu,Simon Besnard,Michael Schmitt

Main category: cs.CV

TL;DR: 论文提出了一个全球分布的森林地上生物量(AGB)估算数据集,用于评估地理空间基础模型(Geo-FMs)的性能,填补了现有基准数据集在任务多样性和地理覆盖上的不足。

Details Motivation: 现有基准数据集多局限于特定任务和地理区域,缺乏对地理空间基础模型在多样化任务和全球范围内的评估。 Method: 结合EnMAP卫星的高光谱影像和GEDI激光雷达的AGB密度估计,构建了一个全球分布的像素级回归任务数据集。 Result: 实验表明,Geo-FMs在某些情况下能超越基线U-Net,性能差异与数据集规模和Vision Transformer的token patch大小相关。 Conclusion: 该数据集有助于Geo-FMs的开发与评估,并支持地理偏差和泛化能力的研究。 Abstract: Comprehensive evaluation of geospatial foundation models (Geo-FMs) requires benchmarking across diverse tasks, sensors, and geographic regions. However, most existing benchmark datasets are limited to segmentation or classification tasks, and focus on specific geographic areas. To address this gap, we introduce a globally distributed dataset for forest aboveground biomass (AGB) estimation, a pixel-wise regression task. This benchmark dataset combines co-located hyperspectral imagery (HSI) from the Environmental Mapping and Analysis Program (EnMAP) satellite and predictions of AGB density estimates derived from the Global Ecosystem Dynamics Investigation lidars, covering seven continental regions. Our experimental results on this dataset demonstrate that the evaluated Geo-FMs can match or, in some cases, surpass the performance of a baseline U-Net, especially when fine-tuning the encoder. We also find that the performance difference between the U-Net and Geo-FMs depends on the dataset size for each region and highlight the importance of the token patch size in the Vision Transformer backbone for accurate predictions in pixel-wise regression tasks. By releasing this globally distributed hyperspectral benchmark dataset, we aim to facilitate the development and evaluation of Geo-FMs for HSI applications. Leveraging this dataset additionally enables research into geographic bias and generalization capacity of Geo-FMs. The dataset and source code will be made publicly available.

[32] GynSurg: A Comprehensive Gynecology Laparoscopic Surgery Dataset

Sahar Nasirihaghighi,Negin Ghamsarian,Leonie Peschek,Matteo Munari,Heinrich Husslein,Raphael Sznitman,Klaus Schoeffmann

Main category: cs.CV

TL;DR: GynSurg是一个针对妇科腹腔镜手术的大规模多任务数据集,旨在解决现有数据集规模小、任务单一或标注不足的问题,支持手术场景理解、动作识别等应用。

Details Motivation: 现有妇科腹腔镜手术数据集规模有限、任务单一或标注不够详细,限制了其在全面工作流分析中的应用。 Method: 提出了GynSurg数据集,提供丰富的多任务标注,并通过标准化训练协议评估其质量。 Result: GynSurg是目前最大、最多样化的妇科腹腔镜手术数据集,支持动作识别、语义分割等多种任务。 Conclusion: GynSurg的发布将推动妇科腹腔镜手术领域的智能系统发展,并促进手术分析和发现。 Abstract: Recent advances in deep learning have transformed computer-assisted intervention and surgical video analysis, driving improvements not only in surgical training, intraoperative decision support, and patient outcomes, but also in postoperative documentation and surgical discovery. Central to these developments is the availability of large, high-quality annotated datasets. In gynecologic laparoscopy, surgical scene understanding and action recognition are fundamental for building intelligent systems that assist surgeons during operations and provide deeper analysis after surgery. However, existing datasets are often limited by small scale, narrow task focus, or insufficiently detailed annotations, limiting their utility for comprehensive, end-to-end workflow analysis. To address these limitations, we introduce GynSurg, the largest and most diverse multi-task dataset for gynecologic laparoscopic surgery to date. GynSurg provides rich annotations across multiple tasks, supporting applications in action recognition, semantic segmentation, surgical documentation, and discovery of novel procedural insights. We demonstrate the dataset quality and versatility by benchmarking state-of-the-art models under a standardized training protocol. To accelerate progress in the field, we publicly release the GynSurg dataset and its annotations

[33] A Watermark for Auto-Regressive Image Generation Models

Yihan Wu,Xuehao Cui,Ruibo Chen,Georgios Milis,Heng Huang

Main category: cs.CV

TL;DR: 本文提出了一种名为C-reweight的新型无失真水印方法,用于解决图像生成模型中的重标记不匹配问题,提升图像生成的可靠性和安全性。

Details Motivation: 图像生成模型的快速发展带来了潜在的滥用风险,如深度伪造和误导性视觉证据的制造,因此需要有效的真实性验证机制。 Method: 通过基于聚类的策略,C-reweight将同一聚类中的标记视为等效,从而减少重标记不匹配问题,同时保持图像质量。 Result: 在主流图像生成平台上的评估表明,C-reweight不仅保持了生成图像的视觉质量,还提高了检测能力,优于现有技术。 Conclusion: C-reweight为安全可靠的图像合成设定了新标准,解决了传统水印技术在图像生成模型中的局限性。 Abstract: The rapid evolution of image generation models has revolutionized visual content creation, enabling the synthesis of highly realistic and contextually accurate images for diverse applications. However, the potential for misuse, such as deepfake generation, image based phishing attacks, and fabrication of misleading visual evidence, underscores the need for robust authenticity verification mechanisms. While traditional statistical watermarking techniques have proven effective for autoregressive language models, their direct adaptation to image generation models encounters significant challenges due to a phenomenon we term retokenization mismatch, a disparity between original and retokenized sequences during the image generation process. To overcome this limitation, we propose C-reweight, a novel, distortion-free watermarking method explicitly designed for image generation models. By leveraging a clustering-based strategy that treats tokens within the same cluster equivalently, C-reweight mitigates retokenization mismatch while preserving image fidelity. Extensive evaluations on leading image generation platforms reveal that C-reweight not only maintains the visual quality of generated images but also improves detectability over existing distortion-free watermarking techniques, setting a new standard for secure and trustworthy image synthesis.

[34] Scalable Context-Preserving Model-Aware Deep Clustering for Hyperspectral Images

Xianlu Li,Nicolas Nadisic,Shaoguang Huang,Nikos Deligiannis,Aleksandra Pižurica

Main category: cs.CV

TL;DR: 提出了一种基于基表示的、可扩展且保留上下文信息的深度聚类方法,用于高效的高光谱图像聚类,同时捕捉局部和非局部结构。

Details Motivation: 现有方法计算复杂度高(O(n^2)),且仅关注局部或非局部结构,无法有效监督整个聚类过程。 Method: 采用单阶段框架,结合空间平滑约束(局部结构)和基于小聚类的方案(非局部结构),联合优化。 Result: 时间和空间复杂度降至O(n),在真实数据集上表现优于现有技术。 Conclusion: 该方法高效且性能优越,适用于大规模高光谱图像聚类。 Abstract: Subspace clustering has become widely adopted for the unsupervised analysis of hyperspectral images (HSIs). Recent model-aware deep subspace clustering methods often use a two-stage framework, involving the calculation of a self-representation matrix with complexity of O(n^2), followed by spectral clustering. However, these methods are computationally intensive, generally incorporating solely either local or non-local spatial structure constraints, and their structural constraints fall short of effectively supervising the entire clustering process. We propose a scalable, context-preserving deep clustering method based on basis representation, which jointly captures local and non-local structures for efficient HSI clustering. To preserve local structure (i.e., spatial continuity within subspaces), we introduce a spatial smoothness constraint that aligns clustering predictions with their spatially filtered versions. For non-local structure (i.e., spectral continuity), we employ a mini-cluster-based scheme that refines predictions at the group level, encouraging spectrally similar pixels to belong to the same subspace. Notably, these two constraints are jointly optimized to reinforce each other. Specifically, our model is designed as an one-stage approach in which the structural constraints are applied to the entire clustering process. The time and space complexity of our method is O(n), making it applicable to large-scale HSI data. Experiments on real-world datasets show that our method outperforms state-of-the-art techniques. Our code is available at: https://github.com/lxlscut/SCDSC

[35] Enhance Multimodal Consistency and Coherence for Text-Image Plan Generation

Xiaoxin Lu,Ranran Haoran Zhang,Yusen Zhang,Rui Zhang

Main category: cs.CV

TL;DR: 本文提出了一种新颖的框架,用于生成和优化文本-图像计划,解决了模态一致性和视觉步骤连贯性的挑战。

Details Motivation: 现有研究主要关注文本计划的生成,而多模态(文本-图像)计划的潜力尚未充分探索。 Method: 框架分步生成和优化计划:1) 生成文本步骤;2) 编辑视觉步骤;3) 提取视觉信息;4) 结合信息优化。 Result: 实验表明,该方法在多种骨干模型上表现优异,并提供了新的基准和评估指标。 Conclusion: 该框架为多模态计划生成提供了有效解决方案,并具有广泛的适用性。 Abstract: People get informed of a daily task plan through diverse media involving both texts and images. However, most prior research only focuses on LLM's capability of textual plan generation. The potential of large-scale models in providing text-image plans remains understudied. Generating high-quality text-image plans faces two main challenges: ensuring consistent alignment between two modalities and keeping coherence among visual steps. To address these challenges, we propose a novel framework that generates and refines text-image plans step-by-step. At each iteration, our framework (1) drafts the next textual step based on the prediction history; (2) edits the last visual step to obtain the next one; (3) extracts PDDL-like visual information; and (4) refines the draft with the extracted visual information. The textual and visual step produced in stage (4) and (2) will then serve as inputs for the next iteration. Our approach offers a plug-and-play improvement to various backbone models, such as Mistral-7B, Gemini-1.5, and GPT-4o. To evaluate the effectiveness of our approach, we collect a new benchmark consisting of 1,100 tasks and their text-image pair solutions covering 11 daily topics. We also design and validate a new set of metrics to evaluate the multimodal consistency and coherence in text-image plans. Extensive experiment results show the effectiveness of our approach on a range of backbone models against competitive baselines. Our code and data are available at https://github.com/psunlpgroup/MPlanner.

[36] Dynamic Double Space Tower

Weikai Sun,Shijie Song,Han Wang

Main category: cs.CV

TL;DR: 论文提出了一种动态双向空间塔结构,替代传统注意力机制,以增强模型在视觉问答任务中对空间关系的理解和推理能力。

Details Motivation: 现有方法在复杂推理场景中表现不佳,主要由于跨模态交互不足和实体空间关系捕捉能力有限。 Method: 提出动态双向空间塔,分四层模拟人类格式塔视觉原理,提供结构化先验,提升空间关系处理能力。 Result: 模块可应用于多模态模型,在空间关系问答数据集上取得先进成果,模型July仅用3B参数即达到SOTA。 Conclusion: 该方法显著提升了模型对图像内容的空间组织能力,为多模态任务提供了新的解决方案。 Abstract: The Visual Question Answering (VQA) task requires the simultaneous understanding of image content and question semantics. However, existing methods often have difficulty handling complex reasoning scenarios due to insufficient cross-modal interaction and capturing the entity spatial relationships in the image.\cite{huang2023adaptive}\cite{liu2021comparing}\cite{guibas2021adaptive}\cite{zhang2022vsa}We studied a brand-new approach to replace the attention mechanism in order to enhance the reasoning ability of the model and its understanding of spatial relationships.Specifically, we propose a dynamic bidirectional spatial tower, which is divided into four layers to observe the image according to the principle of human gestalt vision. This naturally provides a powerful structural prior for the spatial organization between entities, enabling the model to no longer blindly search for relationships between pixels but make judgments based on more meaningful perceptual units. Change from "seeing images" to "perceiving and organizing image content".A large number of experiments have shown that our module can be used in any other multimodal model and achieve advanced results, demonstrating its potential in spatial relationship processing.Meanwhile, the multimodal visual question-answering model July trained by our method has achieved state-of-the-art results with only 3B parameters, especially on the question-answering dataset of spatial relations.

[37] Stop learning it all to mitigate visual hallucination, Focus on the hallucination target

Dokyoon Yoon,Youngsook Song,Woomyong Park

Main category: cs.CV

TL;DR: 提出了一种名为\mymethod的偏好学习方法,通过针对幻觉发生的区域进行优化,有效减少了多模态大语言模型在视觉语言任务中的幻觉问题。

Details Motivation: 多模态大语言模型在视觉语言任务中常产生幻觉(生成输入图像中不存在的对象信息),影响模型在实际应用中的可靠性。 Method: 构建包含幻觉响应、正确响应和目标信息的数据集,应用偏好学习方法专注于特定目标区域,过滤无关信号以纠正幻觉。 Result: 实验表明,\mymethod在多个视觉幻觉任务中有效减少幻觉,提升模型可靠性和性能,同时不影响整体表现。 Conclusion: \mymethod通过针对性优化,显著改善了多模态大语言模型的幻觉问题,为实际应用提供了更可靠的解决方案。 Abstract: Multimodal Large Language Models (MLLMs) frequently suffer from hallucination issues, generating information about objects that are not present in input images during vision-language tasks. These hallucinations particularly undermine model reliability in practical applications requiring accurate object identification. To address this challenge, we propose \mymethod,\ a preference learning approach that mitigates hallucinations by focusing on targeted areas where they occur. To implement this, we build a dataset containing hallucinated responses, correct responses, and target information (i.e., objects present in the images and the corresponding chunk positions in responses affected by hallucinations). By applying a preference learning method restricted to these specific targets, the model can filter out irrelevant signals and focus on correcting hallucinations. This allows the model to produce more factual responses by concentrating solely on relevant information. Experimental results demonstrate that \mymethod\ effectively reduces hallucinations across multiple vision hallucination tasks, improving the reliability and performance of MLLMs without diminishing overall performance.

[38] Auto-Connect: Connectivity-Preserving RigFormer with Direct Preference Optimization

Jingfeng Guo,Jian Liu,Jinnan Chen,Shiwei Mao,Changrong Hu,Puhua Jiang,Junlin Yu,Jing Xu,Qi Liu,Lixin Xu,Zhuo Chen,Chunchao Guo

Main category: cs.CV

TL;DR: Auto-Connect是一种自动骨骼绑定新方法,通过连接保持的标记化方案和拓扑感知奖励函数,显著提升骨骼结构的拓扑准确性和变形质量。

Details Motivation: 现有方法在预测骨骼位置时未能有效保持连接性,导致拓扑错误和变形问题。 Method: 采用连接保持的标记化方案、拓扑感知奖励函数和隐式测地特征,结合奖励引导的优化策略。 Result: 生成的骨骼结构具有更高的解剖合理性和变形性能。 Conclusion: Auto-Connect通过综合优化策略,显著提升了自动骨骼绑定的质量和效率。 Abstract: We introduce Auto-Connect, a novel approach for automatic rigging that explicitly preserves skeletal connectivity through a connectivity-preserving tokenization scheme. Unlike previous methods that predict bone positions represented as two joints or first predict points before determining connectivity, our method employs special tokens to define endpoints for each joint's children and for each hierarchical layer, effectively automating connectivity relationships. This approach significantly enhances topological accuracy by integrating connectivity information directly into the prediction framework. To further guarantee high-quality topology, we implement a topology-aware reward function that quantifies topological correctness, which is then utilized in a post-training phase through reward-guided Direct Preference Optimization. Additionally, we incorporate implicit geodesic features for latent top-k bone selection, which substantially improves skinning quality. By leveraging geodesic distance information within the model's latent space, our approach intelligently determines the most influential bones for each vertex, effectively mitigating common skinning artifacts. This combination of connectivity-preserving tokenization, reward-guided fine-tuning, and geodesic-aware bone selection enables our model to consistently generate more anatomically plausible skeletal structures with superior deformation properties.

Jie Zhu,Leye Wang

Main category: cs.CV

TL;DR: 提出了一种名为FSCA的黑盒审计框架,用于解决文本到图像扩散模型中的数据来源审计问题,无需访问模型内部知识,实验表明其优于现有方法。

Details Motivation: 现有文本到图像扩散模型依赖大规模数据集,可能涉及版权和隐私问题,但现有审计方法假设不现实或不可靠。 Method: 利用文本到图像扩散模型中的两种语义连接进行审计,无需内部知识,并引入召回平衡和阈值调整策略。 Result: FSCA在多个数据集和指标上优于基线方法,用户级准确率达90%(仅需10样本/用户)。 Conclusion: FSCA在现实应用中具有强大的审计潜力,代码已开源。 Abstract: Text-to-image diffusion model since its propose has significantly influenced the content creation due to its impressive generation capability. However, this capability depends on large-scale text-image datasets gathered from web platforms like social media, posing substantial challenges in copyright compliance and personal privacy leakage. Though there are some efforts devoted to explore approaches for auditing data provenance in text-to-image diffusion models, existing work has unrealistic assumptions that can obtain model internal knowledge, e.g., intermediate results, or the evaluation is not reliable. To fill this gap, we propose a completely black-box auditing framework called Feature Semantic Consistency-based Auditing (FSCA). It utilizes two types of semantic connections within the text-to-image diffusion model for auditing, eliminating the need for access to internal knowledge. To demonstrate the effectiveness of our FSCA framework, we perform extensive experiments on LAION-mi dataset and COCO dataset, and compare with eight state-of-the-art baseline approaches. The results show that FSCA surpasses previous baseline approaches across various metrics and different data distributions, showcasing the superiority of our FSCA. Moreover, we introduce a recall balance strategy and a threshold adjustment strategy, which collectively allows FSCA to reach up a user-level accuracy of 90% in a real-world auditing scenario with only 10 samples/user, highlighting its strong auditing potential in real-world applications. Our code is made available at https://github.com/JiePKU/FSCA.

[40] TAViS: Text-bridged Audio-Visual Segmentation with Foundation Models

Ziyang Luo,Nian Liu,Xuguang Yang,Salman Khan,Rao Muhammad Anwer,Hisham Cholakkal,Fahad Shahbaz Khan,Junwei Han

Main category: cs.CV

TL;DR: TAViS框架通过结合多模态基础模型(ImageBind)和分割基础模型(SAM2),解决了音频-视觉分割中的跨模态对齐问题,并引入文本桥接设计和监督策略提升性能。

Details Motivation: 音频-视觉分割(AVS)面临跨模态对齐的挑战,现有方法未能有效结合多模态知识或解决特征空间差异问题。 Method: 提出TAViS框架,结合ImageBind和SAM2,引入文本桥接混合提示机制和对齐监督策略。 Result: 在单源、多源、语义数据集及零样本设置中表现优异。 Conclusion: TAViS通过文本桥接设计有效解决了跨模态对齐问题,显著提升了分割性能。 Abstract: Audio-Visual Segmentation (AVS) faces a fundamental challenge of effectively aligning audio and visual modalities. While recent approaches leverage foundation models to address data scarcity, they often rely on single-modality knowledge or combine foundation models in an off-the-shelf manner, failing to address the cross-modal alignment challenge. In this paper, we present TAViS, a novel framework that \textbf{couples} the knowledge of multimodal foundation models (ImageBind) for cross-modal alignment and a segmentation foundation model (SAM2) for precise segmentation. However, effectively combining these models poses two key challenges: the difficulty in transferring the knowledge between SAM2 and ImageBind due to their different feature spaces, and the insufficiency of using only segmentation loss for supervision. To address these challenges, we introduce a text-bridged design with two key components: (1) a text-bridged hybrid prompting mechanism where pseudo text provides class prototype information while retaining modality-specific details from both audio and visual inputs, and (2) an alignment supervision strategy that leverages text as a bridge to align shared semantic concepts within audio-visual modalities. Our approach achieves superior performance on single-source, multi-source, semantic datasets, and excels in zero-shot settings.

[41] Uncertainty Awareness Enables Efficient Labeling for Cancer Subtyping in Digital Pathology

Nirhoshan Sivaroopan,Chamuditha Jayanga Galappaththige,Chalani Ekanayake,Hasindri Watawana,Ranga Rodrigo,Chamira U. S. Edussooriya,Dushan N. Wadduwage

Main category: cs.CV

TL;DR: 该论文提出了一种基于自监督对比学习的不确定性感知模型,用于癌症亚型分类,通过选择性标注关键图像优化训练过程,仅需1-10%的标注即可达到最优性能。

Details Motivation: 在数字病理学中,癌症亚型分类模型需要大量专家标注,而标注资源有限。因此,引入不确定性感知机制以优化标注效率和模型性能。 Method: 在自监督对比学习模型中引入证据向量计算不确定性分数,选择性标注关键图像,迭代优化训练过程。 Result: 仅需1-10%的标注即可在基准数据集上实现最优性能,同时提高了分类的精确性和效率。 Conclusion: 该方法在标注数据有限的情况下显著提升了癌症亚型分类的性能,为数字病理学提供了新的研究方向。 Abstract: Machine-learning-assisted cancer subtyping is a promising avenue in digital pathology. Cancer subtyping models, however, require careful training using expert annotations so that they can be inferred with a degree of known certainty (or uncertainty). To this end, we introduce the concept of uncertainty awareness into a self-supervised contrastive learning model. This is achieved by computing an evidence vector at every epoch, which assesses the model's confidence in its predictions. The derived uncertainty score is then utilized as a metric to selectively label the most crucial images that require further annotation, thus iteratively refining the training process. With just 1-10% of strategically selected annotations, we attain state-of-the-art performance in cancer subtyping on benchmark datasets. Our method not only strategically guides the annotation process to minimize the need for extensive labeled datasets, but also improves the precision and efficiency of classifications. This development is particularly beneficial in settings where the availability of labeled data is limited, offering a promising direction for future research and application in digital pathology.

[42] On the Natural Robustness of Vision-Language Models Against Visual Perception Attacks in Autonomous Driving

Pedram MohajerAnsari,Amir Salarpour,Michael Kühr,Siyu Huang,Mohammad Hamad,Sebastian Steinhorst,Habeeb Olufowobi,Mert D. Pesé

Main category: cs.CV

TL;DR: 论文提出了一种名为V2LMs的视觉语言模型,用于提升自动驾驶车辆(AVs)感知任务的鲁棒性,对抗攻击时表现优于传统深度神经网络(DNNs)。

Details Motivation: 传统DNNs在对抗攻击下表现脆弱,且现有防御机制(如对抗训练)会降低正常精度且无法泛化到未见攻击。 Method: 提出V2LMs,并评估其两种部署模式:Solo Mode(单任务)和Tandem Mode(多任务)。 Result: V2LMs在对抗攻击下精度下降小于8%,而DNNs下降33%-46%。Tandem Mode在保持鲁棒性的同时更节省内存。 Conclusion: V2LMs为AV感知系统提供了更安全、更具韧性的解决方案。 Abstract: Autonomous vehicles (AVs) rely on deep neural networks (DNNs) for critical tasks such as traffic sign recognition (TSR), automated lane centering (ALC), and vehicle detection (VD). However, these models are vulnerable to attacks that can cause misclassifications and compromise safety. Traditional defense mechanisms, including adversarial training, often degrade benign accuracy and fail to generalize against unseen attacks. In this work, we introduce Vehicle Vision Language Models (V2LMs), fine-tuned vision-language models specialized for AV perception. Our findings demonstrate that V2LMs inherently exhibit superior robustness against unseen attacks without requiring adversarial training, maintaining significantly higher accuracy than conventional DNNs under adversarial conditions. We evaluate two deployment strategies: Solo Mode, where individual V2LMs handle specific perception tasks, and Tandem Mode, where a single unified V2LM is fine-tuned for multiple tasks simultaneously. Experimental results reveal that DNNs suffer performance drops of 33% to 46% under attacks, whereas V2LMs maintain adversarial accuracy with reductions of less than 8% on average. The Tandem Mode further offers a memory-efficient alternative while achieving comparable robustness to Solo Mode. We also explore integrating V2LMs as parallel components to AV perception to enhance resilience against adversarial threats. Our results suggest that V2LMs offer a promising path toward more secure and resilient AV perception systems.

[43] FAME: A Lightweight Spatio-Temporal Network for Model Attribution of Face-Swap Deepfakes

Wasim Ahmad,Yan-Tsung Peng,Yuan-Hao Chang

Main category: cs.CV

TL;DR: FAME是一种轻量高效的时空框架,用于识别Deepfake视频的来源模型,优于现有方法。

Details Motivation: Deepfake视频对数字安全、隐私和媒体完整性构成威胁,需要有效工具识别其来源模型。 Method: FAME通过多级嵌入和时空注意力机制捕捉不同生成模型的细微伪影。 Result: 在多个数据集上,FAME在准确性和运行时间上均优于现有方法。 Conclusion: FAME具有在实际法医和信息安全应用中部署的潜力。 Abstract: The widespread emergence of face-swap Deepfake videos poses growing risks to digital security, privacy, and media integrity, necessitating effective forensic tools for identifying the source of such manipulations. Although most prior research has focused primarily on binary Deepfake detection, the task of model attribution -- determining which generative model produced a given Deepfake -- remains underexplored. In this paper, we introduce FAME (Fake Attribution via Multilevel Embeddings), a lightweight and efficient spatio-temporal framework designed to capture subtle generative artifacts specific to different face-swap models. FAME integrates spatial and temporal attention mechanisms to improve attribution accuracy while remaining computationally efficient. We evaluate our model on three challenging and diverse datasets: Deepfake Detection and Manipulation (DFDM), FaceForensics++, and FakeAVCeleb. Results show that FAME consistently outperforms existing methods in both accuracy and runtime, highlighting its potential for deployment in real-world forensic and information security applications.

[44] Environmental Change Detection: Toward a Practical Task of Scene Change Detection

Kyusik Cho,Suhan Woo,Hongje Seong,Euntai Kim

Main category: cs.CV

TL;DR: 论文提出了一种新的环境变化检测(ECD)任务,解决了传统场景变化检测(SCD)中依赖理想化对齐查询-参考图像对的局限性。通过利用未筛选的大规模图像数据库和提出的新框架,实现了在视角不对齐和有限视野下的高效变化检测。

Details Motivation: 传统SCD方法依赖理想化的查询-参考图像对齐,而实际应用中参考图像通常视角不同。论文旨在解决这一局限性,提出更实用的ECD任务。 Method: 提出了一种新框架,通过利用多个参考候选和聚合语义丰富的表示来检测变化,避免了视角不对齐和有限视野的问题。 Result: 在三个标准基准测试中,新框架显著优于现有方法的简单组合,并接近理想化设置下的性能。 Conclusion: ECD任务和提出的框架为实际场景中的变化检测提供了更实用的解决方案,代码将在论文接受后公开。 Abstract: Humans do not memorize everything. Thus, humans recognize scene changes by exploring the past images. However, available past (i.e., reference) images typically represent nearby viewpoints of the present (i.e., query) scene, rather than the identical view. Despite this practical limitation, conventional Scene Change Detection (SCD) has been formalized under an idealized setting in which reference images with matching viewpoints are available for every query. In this paper, we push this problem toward a practical task and introduce Environmental Change Detection (ECD). A key aspect of ECD is to avoid unrealistically aligned query-reference pairs and rely solely on environmental cues. Inspired by real-world practices, we provide these cues through a large-scale database of uncurated images. To address this new task, we propose a novel framework that jointly understands spatial environments and detects changes. The main idea is that matching at the same spatial locations between a query and a reference may lead to a suboptimal solution due to viewpoint misalignment and limited field-of-view (FOV) coverage. We deal with this limitation by leveraging multiple reference candidates and aggregating semantically rich representations for change detection. We evaluate our framework on three standard benchmark sets reconstructed for ECD, and significantly outperform a naive combination of state-of-the-art methods while achieving comparable performance to the oracle setting. The code will be released upon acceptance.

[45] Composite Data Augmentations for Synthetic Image Detection Against Real-World Perturbations

Efthymia Amarantidou,Christos Koutlis,Symeon Papadopoulos,Panagiotis C. Petrantonakis

Main category: cs.CV

TL;DR: 论文提出了一种改进合成图像检测(SID)的方法,通过数据增强组合和遗传算法优化选择,显著提升了模型在真实世界扰动下的性能。

Details Motivation: 生成式AI工具的普及导致合成图像在社交媒体上传播,威胁信息真实性。现有SID方法对经过压缩等操作的网络图像效果不佳。 Method: 研究采用数据增强组合,利用遗传算法选择最优增强策略,并引入双标准优化方法。 Result: 最佳模型平均精度提升22.53%,显著优于未使用增强的模型。 Conclusion: 该方法为开发能识别不同质量和变换的合成图像的检测模型提供了有效途径。 Abstract: The advent of accessible Generative AI tools enables anyone to create and spread synthetic images on social media, often with the intention to mislead, thus posing a significant threat to online information integrity. Most existing Synthetic Image Detection (SID) solutions struggle on generated images sourced from the Internet, as these are often altered by compression and other operations. To address this, our research enhances SID by exploring data augmentation combinations, leveraging a genetic algorithm for optimal augmentation selection, and introducing a dual-criteria optimization approach. These methods significantly improve model performance under real-world perturbations. Our findings provide valuable insights for developing detection models capable of identifying synthetic images across varying qualities and transformations, with the best-performing model achieving a mean average precision increase of +22.53% compared to models without augmentations. The implementation is available at github.com/efthimia145/sid-composite-data-augmentation.

[46] Preserving Clusters in Prompt Learning for Unsupervised Domain Adaptation

Tung-Long Vuong,Hoang Phan,Vy Vo,Anh Bui,Thanh-Toan Do,Trung Le,Dinh Phung

Main category: cs.CV

TL;DR: 论文提出了一种新方法,通过利用视觉和文本嵌入的几何特性来增强伪标签和目标提示学习,解决了多模态预训练模型在无监督域适应中的局限性。

Details Motivation: 现有方法依赖于CLIP零样本预测和自训练机制,但视觉嵌入分布在目标域中可能偏离预训练模型,导致误导信号。 Method: 利用源提示的参考预测,基于源和目标视觉嵌入的关系,并通过最优传输理论增强文本嵌入的聚类特性。 Result: 实验验证了方法的有效性,展示了性能提升和目标提示表示质量的改进。 Conclusion: 该方法通过几何嵌入特性优化了无监督域适应,显著提升了模型性能。 Abstract: Recent approaches leveraging multi-modal pre-trained models like CLIP for Unsupervised Domain Adaptation (UDA) have shown significant promise in bridging domain gaps and improving generalization by utilizing rich semantic knowledge and robust visual representations learned through extensive pre-training on diverse image-text datasets. While these methods achieve state-of-the-art performance across benchmarks, much of the improvement stems from base pseudo-labels (CLIP zero-shot predictions) and self-training mechanisms. Thus, the training mechanism exhibits a key limitation wherein the visual embedding distribution in target domains can deviate from the visual embedding distribution in the pre-trained model, leading to misguided signals from class descriptions. This work introduces a fresh solution to reinforce these pseudo-labels and facilitate target-prompt learning, by exploiting the geometry of visual and text embeddings - an aspect that is overlooked by existing methods. We first propose to directly leverage the reference predictions (from source prompts) based on the relationship between source and target visual embeddings. We later show that there is a strong clustering behavior observed between visual and text embeddings in pre-trained multi-modal models. Building on optimal transport theory, we transform this insight into a novel strategy to enforce the clustering property in text embeddings, further enhancing the alignment in the target domain. Our experiments and ablation studies validate the effectiveness of the proposed approach, demonstrating superior performance and improved quality of target prompts in terms of representation.

[47] Manager: Aggregating Insights from Unimodal Experts in Two-Tower VLMs and MLLMs

Xiao Xu,Libo Qin,Wanxiang Che,Min-Yen Kan

Main category: cs.CV

TL;DR: 论文提出了一种名为Manager的轻量级插件,通过自适应聚合单模态专家的多层级知识,提升视觉语言模型的性能。

Details Motivation: 现有的BridgeTower模型在单模态表征利用、多层级语义知识灵活性和高分辨率数据集评估方面存在不足。 Method: 提出ManagerTower模型,在跨模态层中引入Manager插件,并扩展到多模态大语言模型架构。 Result: 在4个下游视觉语言任务中表现优异,并在20个数据集上显著提升LLaVA-OV的零样本性能。 Conclusion: Manager插件和多网格算法的协同作用能改善视觉表征,缓解语义模糊,提升性能。 Abstract: Two-Tower Vision--Language Models (VLMs) have demonstrated strong performance across various downstream VL tasks. While BridgeTower further enhances performance by building bridges between encoders, it \textit{(i)} suffers from ineffective layer-by-layer utilization of unimodal representations, \textit{(ii)} restricts the flexible exploitation of different levels of unimodal semantic knowledge, and \textit{(iii)} is limited to the evaluation on traditional low-resolution datasets only with the Two-Tower VLM architecture. In this work, we propose Manager, a lightweight, efficient and effective plugin that adaptively aggregates insights from different levels of pre-trained unimodal experts to facilitate more comprehensive VL alignment and fusion. First, under the Two-Tower VLM architecture, we introduce ManagerTower, a novel VLM that introduces the manager in each cross-modal layer. Whether with or without VL pre-training, ManagerTower outperforms previous strong baselines and achieves superior performance on 4 downstream VL tasks. Moreover, we extend our exploration to the latest Multimodal Large Language Model (MLLM) architecture. We demonstrate that LLaVA-OV-Manager significantly boosts the zero-shot performance of LLaVA-OV across different categories of capabilities, images, and resolutions on 20 downstream datasets, whether the multi-grid algorithm is enabled or not. In-depth analysis reveals that both our manager and the multi-grid algorithm can be viewed as a plugin that improves the visual representation by capturing more diverse visual details from two orthogonal perspectives (depth and width). Their synergy can mitigate the semantic ambiguity caused by the multi-grid algorithm and further improve performance. Code and models are available at https://github.com/LooperXX/ManagerTower.

[48] GNSS-inertial state initialization by distance residuals

Samuel Cerezo,Javier Civera

Main category: cs.CV

TL;DR: 提出了一种新的GNSS-惯性初始化策略,通过延迟使用全局GNSS测量数据,转而依赖相对距离残差,以提高初始估计的准确性和鲁棒性。

Details Motivation: 传感器平台的初始状态估计常因初始测量信息有限而导致估计不准确,容易陷入局部最优解。 Method: 提出了一种基于Hessian矩阵奇异值演化的准则,用于确定何时切换到全局GNSS测量数据。 Result: 在EuRoC和GVINS数据集上的实验表明,该方法比直接使用全局GNSS数据的策略表现更优。 Conclusion: 该方法能够显著提高初始化的准确性和鲁棒性,避免局部最优解。 Abstract: Initializing the state of a sensorized platform can be challenging, as a limited set of initial measurements often carry limited information, leading to poor initial estimates that may converge to local minima during non-linear optimization. This paper proposes a novel GNSS-inertial initialization strategy that delays the use of global GNSS measurements until sufficient information is available to accurately estimate the transformation between the GNSS and inertial frames. Instead, the method initially relies on GNSS relative distance residuals. To determine the optimal moment for switching to global measurements, we introduce a criterion based on the evolution of the Hessian matrix singular values. Experiments on the EuRoC and GVINS datasets show that our approach consistently outperforms the naive strategy of using global GNSS data from the start, yielding more accurate and robust initializations.

[49] FIMA-Q: Post-Training Quantization for Vision Transformers by Fisher Information Matrix Approximation

Zhuguanyu Wu,Shihe Wang,Jiayi Zhang,Jiaxin Chen,Yunhong Wang

Main category: cs.CV

TL;DR: 论文提出了一种名为FIMA-Q的新型后训练量化方法,用于解决Vision Transformers(ViTs)在低比特量化下的精度下降问题。通过分析Hessian引导的量化损失,作者发现传统Hessian近似的局限性,并提出了一种基于KL散度与Fisher信息矩阵(FIM)连接的快速计算方法,以及一种高效的FIM近似方法DPLR-FIM。实验表明,该方法在低比特量化下显著提升了精度。

Details Motivation: 当前的后训练量化方法在ViTs上,尤其是低比特量化时,仍存在显著的精度下降问题,亟需改进。 Method: 作者提出FIMA-Q方法,通过建立KL散度与FIM的联系,快速计算量化损失,并设计DPLR-FIM近似方法,优化量化损失。 Result: 在多种ViT架构和公开数据集上的实验表明,FIMA-Q在低比特量化下显著提升了精度,优于现有方法。 Conclusion: FIMA-Q为ViTs的后训练量化提供了一种高效且高精度的解决方案,尤其在低比特量化场景下表现突出。 Abstract: Post-training quantization (PTQ) has stood out as a cost-effective and promising model compression paradigm in recent years, as it avoids computationally intensive model retraining. Nevertheless, current PTQ methods for Vision Transformers (ViTs) still suffer from significant accuracy degradation, especially under low-bit quantization. To address these shortcomings, we analyze the prevailing Hessian-guided quantization loss, and uncover certain limitations of conventional Hessian approximations. By following the block-wise reconstruction framework, we propose a novel PTQ method for ViTs, dubbed FIMA-Q. Specifically, we firstly establish the connection between KL divergence and FIM, which enables fast computation of the quantization loss during reconstruction. We further propose an efficient FIM approximation method, namely DPLR-FIM, by employing the diagonal plus low-rank principle, and formulate the ultimate quantization loss. Our extensive experiments, conducted across various vision tasks with representative ViT-based architectures on public datasets, demonstrate that our method substantially promotes the accuracy compared to the state-of-the-art approaches, especially in the case of low-bit quantization. The source code is available at https://github.com/ShiheWang/FIMA-Q.

[50] Leveraging Satellite Image Time Series for Accurate Extreme Event Detection

Heng Fang,Hossein Azizpour

Main category: cs.CV

TL;DR: SITS-Extreme框架利用卫星图像时间序列检测极端天气事件,通过多时相观测提高准确性,优于传统双时相基线。

Details Motivation: 气候变化导致极端天气事件频发,亟需早期检测以改善灾害响应。 Method: 提出SITS-Extreme框架,结合多时相卫星图像数据,过滤无关变化并提取灾害相关信号。 Result: 实验验证了SITS-Extreme的有效性,显著优于传统方法,并分析了关键组件和不同灾害类型的适用性。 Conclusion: SITS-Extreme具有大规模灾害监测的潜力和可扩展性。 Abstract: Climate change is leading to an increase in extreme weather events, causing significant environmental damage and loss of life. Early detection of such events is essential for improving disaster response. In this work, we propose SITS-Extreme, a novel framework that leverages satellite image time series to detect extreme events by incorporating multiple pre-disaster observations. This approach effectively filters out irrelevant changes while isolating disaster-relevant signals, enabling more accurate detection. Extensive experiments on both real-world and synthetic datasets validate the effectiveness of SITS-Extreme, demonstrating substantial improvements over widely used strong bi-temporal baselines. Additionally, we examine the impact of incorporating more timesteps, analyze the contribution of key components in our framework, and evaluate its performance across different disaster types, offering valuable insights into its scalability and applicability for large-scale disaster monitoring.

[51] Linearly Solving Robust Rotation Estimation

Yinlong Liu,Tianyu Huang,Zhi-Xin Yang

Main category: cs.CV

TL;DR: 论文提出了一种新的旋转估计方法,将其重新表述为线性模型拟合问题,无需引入奇异性,并利用GPU实现高效并行计算,表现出极强的鲁棒性。

Details Motivation: 旋转估计在计算机视觉和机器人任务中至关重要,但传统方法通常是非线性非凸优化问题,设计复杂。本文旨在提供一种更简单且高效的方法。 Method: 将旋转估计问题重新表述为线性模型拟合问题,利用四元数球面上的大圆表示旋转运动,提出基于投票的方法,支持GPU并行计算。 Result: 方法在噪声和异常值下表现优异,能快速处理大规模(10^6)和高异常值比例(99%)的问题,耗时低于0.5秒。 Conclusion: 实验验证了方法的有效性和鲁棒性,为旋转估计问题提供了新的解决方案。 Abstract: Rotation estimation plays a fundamental role in computer vision and robot tasks, and extremely robust rotation estimation is significantly useful for safety-critical applications. Typically, estimating a rotation is considered a non-linear and non-convex optimization problem that requires careful design. However, in this paper, we provide some new perspectives that solving a rotation estimation problem can be reformulated as solving a linear model fitting problem without dropping any constraints and without introducing any singularities. In addition, we explore the dual structure of a rotation motion, revealing that it can be represented as a great circle on a quaternion sphere surface. Accordingly, we propose an easily understandable voting-based method to solve rotation estimation. The proposed method exhibits exceptional robustness to noise and outliers and can be computed in parallel with graphics processing units (GPUs) effortlessly. Particularly, leveraging the power of GPUs, the proposed method can obtain a satisfactory rotation solution for large-scale($10^6$) and severely corrupted (99$\%$ outlier ratio) rotation estimation problems under 0.5 seconds. Furthermore, to validate our theoretical framework and demonstrate the superiority of our proposed method, we conduct controlled experiments and real-world dataset experiments. These experiments provide compelling evidence supporting the effectiveness and robustness of our approach in solving rotation estimation problems.

[52] EyeSim-VQA: A Free-Energy-Guided Eye Simulation Framework for Video Quality Assessment

Zhaoyang Wang,Wen Lu,Jie Li,Lihuo He,Maoguo Gong,Xinbo Gao

Main category: cs.CV

TL;DR: EyeSimVQA是一种基于自由能的自修复视频质量评估框架,通过双分支架构和生物启发设计,在性能和可解释性上优于现有方法。

Details Motivation: 视频质量评估(VQA)因时空复杂性和预训练模型限制,现有方法难以直接集成增强模块。 Method: 提出双分支架构(美学分支和技术分支),结合自由能自修复模块和生物启发预测头。 Result: 在五个公开VQA基准测试中表现优异,性能和可解释性提升。 Conclusion: EyeSimVQA为VQA提供了一种高效且可解释的解决方案。 Abstract: Free-energy-guided self-repair mechanisms have shown promising results in image quality assessment (IQA), but remain under-explored in video quality assessment (VQA), where temporal dynamics and model constraints pose unique challenges. Unlike static images, video content exhibits richer spatiotemporal complexity, making perceptual restoration more difficult. Moreover, VQA systems often rely on pre-trained backbones, which limits the direct integration of enhancement modules without affecting model stability. To address these issues, we propose EyeSimVQA, a novel VQA framework that incorporates free-energy-based self-repair. It adopts a dual-branch architecture, with an aesthetic branch for global perceptual evaluation and a technical branch for fine-grained structural and semantic analysis. Each branch integrates specialized enhancement modules tailored to distinct visual inputs-resized full-frame images and patch-based fragments-to simulate adaptive repair behaviors. We also explore a principled strategy for incorporating high-level visual features without disrupting the original backbone. In addition, we design a biologically inspired prediction head that models sweeping gaze dynamics to better fuse global and local representations for quality prediction. Experiments on five public VQA benchmarks demonstrate that EyeSimVQA achieves competitive or superior performance compared to state-of-the-art methods, while offering improved interpretability through its biologically grounded design.

[53] DaMO: A Data-Efficient Multimodal Orchestrator for Temporal Reasoning with Video LLMs

Bo-Cheng Chiu,Jen-Jee Chen,Yu-Chee Tseng,Feng-Chi Chen

Main category: cs.CV

TL;DR: DaMO是一种数据高效的大型语言模型,专为视频领域设计,通过层次化双流架构和全局残差提升时间推理和多模态理解能力。

Details Motivation: 现有视频LLM在细粒度时间推理方面存在局限,难以精确响应特定视频时刻,尤其是在监督受限的情况下。 Method: 提出Temporal-aware Fuseformer,采用层次化双流架构捕获时间动态并融合视觉和音频信息,结合全局残差提高计算效率。通过四阶段渐进训练范式增强多模态对齐、语义基础和时序推理能力。 Result: 在时间定位和视频QA任务中,DaMO表现优于现有方法,尤其在需要精确时间对齐的任务中。 Conclusion: DaMO为数据高效的视频-语言建模提供了有前景的方向。 Abstract: Large Language Models (LLMs) have recently been extended to the video domain, enabling sophisticated video-language understanding. However, existing Video LLMs often exhibit limitations in fine-grained temporal reasoning, restricting their ability to precisely attribute responses to specific video moments, especially under constrained supervision. We introduce DaMO, a data-efficient Video LLM explicitly designed for accurate temporal reasoning and multimodal understanding. At its core, the proposed Temporal-aware Fuseformer employs a hierarchical dual-stream architecture that progressively captures temporal dynamics within each modality and effectively fuses complementary visual and audio information. To further enhance computational efficiency, DaMO integrates a global residual that reduces spatial redundancy while preserving essential semantic details. We train DaMO via a structured four-stage progressive training paradigm, incrementally equipping the model with multimodal alignment, semantic grounding, and temporal reasoning capabilities. This work also contributes multiple datasets augmented from existing ones with GPT-generated temporally grounded QA pairs for tasks requiring temporal supervision. Comprehensive experiments on temporal grounding and video QA benchmarks demonstrate that DaMO consistently surpasses prior methods, particularly in tasks demanding precise temporal alignment and reasoning. Our work establishes a promising direction for data-efficient video-language modeling.

[54] VFaith: Do Large Multimodal Models Really Reason on Seen Images Rather than Previous Memories?

Jiachen Yu,Yufei Zhan,Ziheng Wu,Yousong Zhu,Jinqiao Wang,Minghui Qiu

Main category: cs.CV

TL;DR: 论文提出了一种自动编辑视觉线索的流程和VFaith-Bench基准,用于评估多模态大语言模型(MLLMs)的视觉推理能力及其对视觉信息的忠实性。

Details Motivation: 尽管长链思维(CoT)能提升MLLMs解决复杂问题的能力,但其有效性原因尚不明确,尤其是视觉线索提取与推理过程的贡献难以量化。 Method: 设计了基于GPT-Image-1的自动编辑流程,生成对比问答对,并通过修改关键视觉线索改变答案,构建VFaith-Bench基准。 Result: VFaith-Bench包含755个条目和人类标注任务,测试主流模型显示其视觉推理能力与视觉感知的关系。 Conclusion: 研究揭示了MLLMs推理能力的来源,为未来模型优化提供了方向。 Abstract: Recent extensive works have demonstrated that by introducing long CoT, the capabilities of MLLMs to solve complex problems can be effectively enhanced. However, the reasons for the effectiveness of such paradigms remain unclear. It is challenging to analysis with quantitative results how much the model's specific extraction of visual cues and its subsequent so-called reasoning during inference process contribute to the performance improvements. Therefore, evaluating the faithfulness of MLLMs' reasoning to visual information is crucial. To address this issue, we first present a cue-driven automatic and controllable editing pipeline with the help of GPT-Image-1. It enables the automatic and precise editing of specific visual cues based on the instruction. Furthermore, we introduce VFaith-Bench, the first benchmark to evaluate MLLMs' visual reasoning capabilities and analyze the source of such capabilities with an emphasis on the visual faithfulness. Using the designed pipeline, we constructed comparative question-answer pairs by altering the visual cues in images that are crucial for solving the original reasoning problem, thereby changing the question's answer. By testing similar questions with images that have different details, the average accuracy reflects the model's visual reasoning ability, while the difference in accuracy before and after editing the test set images effectively reveals the relationship between the model's reasoning ability and visual perception. We further designed specific metrics to expose this relationship. VFaith-Bench includes 755 entries divided into five distinct subsets, along with an additional human-labeled perception task. We conducted in-depth testing and analysis of existing mainstream flagship models and prominent open-source model series/reasoning models on VFaith-Bench, further investigating the underlying factors of their reasoning capabilities.

[55] Camera-based method for the detection of lifted truck axles using convolutional neural networks

Bachir Tchana Tankeu,Mohamed Bouteldja,Nicolas Grignard,Bernard Jacob

Main category: cs.CV

TL;DR: 提出了一种基于YOLOv8s卷积神经网络的方法,用于检测卡车抬升轴,精度87%,召回率91.7%,推理时间1.4毫秒,适合实时应用。

Details Motivation: 现有技术(如动态称重系统)难以准确分类抬升轴的车辆,且缺乏商业和技术方法。 Method: 使用YOLOv8s卷积神经网络,通过垂直于交通方向的摄像头图像检测卡车抬升轴。 Result: 精度87%,召回率91.7%,推理时间1.4毫秒,适合实时应用。 Conclusion: 方法有效,但可通过增加数据集或图像增强进一步优化。 Abstract: The identification and classification of vehicles play a crucial role in various aspects of the control-sanction system. Current technologies such as weigh-in-motion (WIM) systems can classify most vehicle categories but they struggle to accurately classify vehicles with lifted axles. Moreover, very few commercial and technical methods exist for detecting lifted axles. In this paper, as part of the European project SETO (Smart Enforcement of Transport Operations), a method based on a convolutional neural network (CNN), namely YOLOv8s, was proposed for the detection of lifted truck axles in images of trucks captured by cameras placed perpendicular to the direction of traffic. The performance of the proposed method was assessed and it was found that it had a precision of 87%, a recall of 91.7%, and an inference time of 1.4 ms, which makes it well-suited for real time implantations. These results suggest that further improvements could be made, potentially by increasing the size of the datasets and/or by using various image augmentation methods.

[56] OV-MAP : Open-Vocabulary Zero-Shot 3D Instance Segmentation Map for Robots

Juno Kim,Yesol Park,Hye-Jung Yoon,Byoung-Tak Zhang

Main category: cs.CV

TL;DR: OV-MAP是一种新颖的开放世界3D地图构建方法,通过将开放特征集成到3D地图中,增强物体识别能力。

Details Motivation: 解决相邻体素特征重叠导致实例级精度下降的问题。 Method: 使用类别无关的分割模型将2D掩码投影到3D空间,并结合原始和合成深度图像,辅以3D掩码投票机制。 Result: 在ScanNet200和Replica等公开数据集上表现出卓越的零样本性能、鲁棒性和适应性。 Conclusion: 该方法无需依赖3D监督分割模型,即可实现准确的零样本3D实例分割,适用于多样化环境。 Abstract: We introduce OV-MAP, a novel approach to open-world 3D mapping for mobile robots by integrating open-features into 3D maps to enhance object recognition capabilities. A significant challenge arises when overlapping features from adjacent voxels reduce instance-level precision, as features spill over voxel boundaries, blending neighboring regions together. Our method overcomes this by employing a class-agnostic segmentation model to project 2D masks into 3D space, combined with a supplemented depth image created by merging raw and synthetic depth from point clouds. This approach, along with a 3D mask voting mechanism, enables accurate zero-shot 3D instance segmentation without relying on 3D supervised segmentation models. We assess the effectiveness of our method through comprehensive experiments on public datasets such as ScanNet200 and Replica, demonstrating superior zero-shot performance, robustness, and adaptability across diverse environments. Additionally, we conducted real-world experiments to demonstrate our method's adaptability and robustness when applied to diverse real-world environments.

[57] EasyARC: Evaluating Vision Language Models on True Visual Reasoning

Mert Unsal,Aylin Akkus

Main category: cs.CV

TL;DR: 论文提出EasyARC,一个多模态推理基准,结合视觉和文本,要求多图像、多步推理和自我纠正,用于评估视觉语言模型的真实推理能力。

Details Motivation: 现有基准主要测试视觉提取与文本推理的结合,缺乏视觉与语言间复杂交互的真实推理。受ARC挑战启发,作者希望填补这一空白。 Method: 提出EasyARC基准,通过程序生成多图像、多步推理任务,支持渐进难度和结构化评估,适用于强化学习。 Result: 评估了当前最先进的视觉语言模型,并分析了其失败模式,证明EasyARC能有效测试模型的推理和扩展能力。 Conclusion: EasyARC为视觉语言模型的真实推理和测试时扩展能力设立了新标准,并开源了数据集和评估代码。 Abstract: Building on recent advances in language-based reasoning models, we explore multimodal reasoning that integrates vision and text. Existing multimodal benchmarks primarily test visual extraction combined with text-based reasoning, lacking true visual reasoning with more complex interactions between vision and language. Inspired by the ARC challenge, we introduce EasyARC, a vision-language benchmark requiring multi-image, multi-step reasoning, and self-correction. EasyARC is procedurally generated, fully verifiable, and scalable, making it ideal for reinforcement learning (RL) pipelines. The generators incorporate progressive difficulty levels, enabling structured evaluation across task types and complexities. We benchmark state-of-the-art vision-language models and analyze their failure modes. We argue that EasyARC sets a new standard for evaluating true reasoning and test-time scaling capabilities in vision-language models. We open-source our benchmark dataset and evaluation code.

[58] A$^2$LC: Active and Automated Label Correction for Semantic Segmentation

Youjin Jeon,Kyusik Cho,Suhan Woo,Euntai Kim

Main category: cs.CV

TL;DR: 本文提出了一种名为A²LC的高效主动标签校正框架,通过整合自动化校正阶段和自适应平衡获取函数,显著提升了语义分割中标签校正的效率和效果。

Details Motivation: 手动像素级标注成本高且易出错,现有方法虽利用基础模型生成伪标签提高了校正效率,但仍存在显著不足。 Method: A²LC框架引入自动化校正阶段,利用标注者反馈扩展校正范围,并提出自适应平衡获取函数以关注尾部类别。 Result: 在Cityscapes和PASCAL VOC 2012数据集上,A²LC仅用20%的预算即超越现有方法,并在同等预算下性能提升27.23%。 Conclusion: A²LC通过自动化校正和自适应平衡机制,显著提升了标签校正的效率和效果,为语义分割任务提供了高效解决方案。 Abstract: Active Label Correction (ALC) has emerged as a promising solution to the high cost and error-prone nature of manual pixel-wise annotation in semantic segmentation, by selectively identifying and correcting mislabeled data. Although recent work has improved correction efficiency by generating pseudo-labels using foundation models, substantial inefficiencies still remain. In this paper, we propose Active and Automated Label Correction for semantic segmentation (A$^2$LC), a novel and efficient ALC framework that integrates an automated correction stage into the conventional pipeline. Specifically, the automated correction stage leverages annotator feedback to perform label correction beyond the queried samples, thereby maximizing cost efficiency. In addition, we further introduce an adaptively balanced acquisition function that emphasizes underrepresented tail classes and complements the automated correction mechanism. Extensive experiments on Cityscapes and PASCAL VOC 2012 demonstrate that A$^2$LC significantly outperforms previous state-of-the-art methods. Notably, A$^2$LC achieves high efficiency by outperforming previous methods using only 20% of their budget, and demonstrates strong effectiveness by yielding a 27.23% performance improvement under an equivalent budget constraint on the Cityscapes dataset. The code will be released upon acceptance.

[59] Wi-CBR: WiFi-based Cross-domain Behavior Recognition via Multimodal Collaborative Awareness

Ruobei Zhang,Shengeng Tang,Huan Yan,Xiang Zhang,Richang Hong

Main category: cs.CV

TL;DR: 提出了一种基于WiFi的多模态协作感知方法,通过融合相位数据和DFS数据提升行为识别精度。

Details Motivation: 现有方法通常只关注单一数据类型,忽视了多特征的交互与融合。 Method: 引入双分支自注意力模块捕获模态内时空线索,应用组注意力机制挖掘关键特征,并通过门控机制优化信息熵。 Result: 在Widar3.0和XRF55数据集上的实验表明,该方法性能优越。 Conclusion: 多模态协作感知方法显著提升了行为识别的准确性。 Abstract: WiFi-based human behavior recognition aims to recognize gestures and activities by analyzing wireless signal variations. However, existing methods typically focus on a single type of data, neglecting the interaction and fusion of multiple features. To this end, we propose a novel multimodal collaborative awareness method. By leveraging phase data reflecting changes in dynamic path length and Doppler Shift (DFS) data corresponding to frequency changes related to the speed of gesture movement, we enable efficient interaction and fusion of these features to improve recognition accuracy. Specifically, we first introduce a dual-branch self-attention module to capture spatial-temporal cues within each modality. Then, a group attention mechanism is applied to the concatenated phase and DFS features to mine key group features critical for behavior recognition. Through a gating mechanism, the combined features are further divided into PD-strengthen and PD-weaken branches, optimizing information entropy and promoting cross-modal collaborative awareness. Extensive in-domain and cross-domain experiments on two large publicly available datasets, Widar3.0 and XRF55, demonstrate the superior performance of our method.

[60] SignAligner: Harmonizing Complementary Pose Modalities for Coherent Sign Language Generation

Xu Wang,Shengeng Tang,Lechao Cheng,Feng Li,Shuo Wang,Richang Hong

Main category: cs.CV

TL;DR: 论文提出了一种名为SignAligner的新方法,用于生成逼真的手语视频,通过三个阶段实现:文本驱动的姿势模态共生成、在线协作校正多模态以及逼真手语视频合成。

Details Motivation: 手语生成的目标是基于口语生成多样化的手语表示,但由于手语的复杂性(包括手势、面部表情和身体动作),实现逼真和自然的生成仍具挑战性。 Method: SignAligner方法分为三个阶段:1)联合生成姿势坐标、手势动作和身体运动;2)通过动态损失权重和跨模态注意力校正生成的姿势模态;3)将校正后的模态输入预训练的视频生成网络。 Result: 实验表明,SignAligner显著提高了生成手语视频的准确性和表现力。 Conclusion: SignAligner通过多模态协作和动态校正,有效提升了手语生成的逼真度和自然性。 Abstract: Sign language generation aims to produce diverse sign representations based on spoken language. However, achieving realistic and naturalistic generation remains a significant challenge due to the complexity of sign language, which encompasses intricate hand gestures, facial expressions, and body movements. In this work, we introduce PHOENIX14T+, an extended version of the widely-used RWTH-PHOENIX-Weather 2014T dataset, featuring three new sign representations: Pose, Hamer and Smplerx. We also propose a novel method, SignAligner, for realistic sign language generation, consisting of three stages: text-driven pose modalities co-generation, online collaborative correction of multimodality, and realistic sign video synthesis. First, by incorporating text semantics, we design a joint sign language generator to simultaneously produce posture coordinates, gesture actions, and body movements. The text encoder, based on a Transformer architecture, extracts semantic features, while a cross-modal attention mechanism integrates these features to generate diverse sign language representations, ensuring accurate mapping and controlling the diversity of modal features. Next, online collaborative correction is introduced to refine the generated pose modalities using a dynamic loss weighting strategy and cross-modal attention, facilitating the complementarity of information across modalities, eliminating spatiotemporal conflicts, and ensuring semantic coherence and action consistency. Finally, the corrected pose modalities are fed into a pre-trained video generation network to produce high-fidelity sign language videos. Extensive experiments demonstrate that SignAligner significantly improves both the accuracy and expressiveness of the generated sign videos.

[61] Evaluating Fairness and Mitigating Bias in Machine Learning: A Novel Technique using Tensor Data and Bayesian Regression

Kuniko Paxton,Koorosh Aslansefat,Dhavalkumar Thakker,Yiannis Papadopoulos

Main category: cs.CV

TL;DR: 本文提出了一种新的机器学习公平性评估方法,专注于图像分类任务中的肤色处理,避免传统分类的局限性,通过概率分布和统计距离度量实现更细粒度的公平性分析。

Details Motivation: 肤色在计算机视觉中以张量数据表示,不同于其他敏感属性的分类特征,现有公平性研究多关注后者,忽略了肤色的特殊性。 Method: 将肤色张量数据转换为概率分布,应用统计距离度量;提出基于贝叶斯回归和多项式函数的训练方法,减少传统肤色分类中的潜在偏差。 Result: 新方法能够捕捉传统分类中被忽略的细粒度公平性差异,并在模型训练中实现更公平的肤色处理。 Conclusion: 该方法为肤色相关的机器学习公平性提供了更灵活、准确的评估和优化手段,推动了可信AI的发展。 Abstract: Fairness is a critical component of Trustworthy AI. In this paper, we focus on Machine Learning (ML) and the performance of model predictions when dealing with skin color. Unlike other sensitive attributes, the nature of skin color differs significantly. In computer vision, skin color is represented as tensor data rather than categorical values or single numerical points. However, much of the research on fairness across sensitive groups has focused on categorical features such as gender and race. This paper introduces a new technique for evaluating fairness in ML for image classification tasks, specifically without the use of annotation. To address the limitations of prior work, we handle tensor data, like skin color, without classifying it rigidly. Instead, we convert it into probability distributions and apply statistical distance measures. This novel approach allows us to capture fine-grained nuances in fairness both within and across what would traditionally be considered distinct groups. Additionally, we propose an innovative training method to mitigate the latent biases present in conventional skin tone categorization. This method leverages color distance estimates calculated through Bayesian regression with polynomial functions, ensuring a more nuanced and equitable treatment of skin color in ML models.

[62] DISCO: Mitigating Bias in Deep Learning with Conditional Distance Correlation

Emre Kavak,Tom Nuno Wolf,Christian Wachinger

Main category: cs.CV

TL;DR: 论文提出了一种标准反因果预测模型(SAM)和正则化策略DISCO,用于解决预测任务中模型依赖无关信号的问题,并通过实验验证其有效性。

Details Motivation: 预测任务中,模型可能依赖无关信号(如光照条件)而非真实因果关系,导致预测结果不理想。本文旨在解决这一问题。 Method: 引入SAM模型分析反因果设置中的信息路径,并提出DISCO正则化策略,利用条件距离相关性优化回归任务中的条件独立性。 Result: 实验表明,DISCO在多种偏差缓解任务中表现优异,可作为传统核方法的替代方案。 Conclusion: SAM和DISCO为反因果预测问题提供了有效解决方案,能够减少无关信号的干扰,提升模型鲁棒性。 Abstract: During prediction tasks, models can use any signal they receive to come up with the final answer - including signals that are causally irrelevant. When predicting objects from images, for example, the lighting conditions could be correlated to different targets through selection bias, and an oblivious model might use these signals as shortcuts to discern between various objects. A predictor that uses lighting conditions instead of real object-specific details is obviously undesirable. To address this challenge, we introduce a standard anti-causal prediction model (SAM) that creates a causal framework for analyzing the information pathways influencing our predictor in anti-causal settings. We demonstrate that a classifier satisfying a specific conditional independence criterion will focus solely on the direct causal path from label to image, being counterfactually invariant to the remaining variables. Finally, we propose DISCO, a novel regularization strategy that uses conditional distance correlation to optimize for conditional independence in regression tasks. We can show that DISCO achieves competitive results in different bias mitigation experiments, deeming it a valid alternative to classical kernel-based methods.

[63] Prohibited Items Segmentation via Occlusion-aware Bilayer Modeling

Yunhan Ren,Ruihuang Li,Lingbo Liu,Changwen Chen

Main category: cs.CV

TL;DR: 提出了一种针对X射线图像中违禁物品的遮挡感知实例分割方法,结合SAM模型和双层掩码解码器模块,显著提升了分割性能。

Details Motivation: X射线图像中违禁物品的外观与自然物体差异大,且物体间严重重叠,导致分割困难。 Method: 集成Segment Anything Model(SAM)以利用其先验知识,设计遮挡感知的双层掩码解码器模块,并标注了两个大规模遮挡数据集。 Result: 在PIDray-A和PIXray-A数据集上的实验验证了方法的有效性。 Conclusion: 提出的方法有效解决了X射线图像中违禁物品的分割问题,数据集和代码已开源。 Abstract: Instance segmentation of prohibited items in security X-ray images is a critical yet challenging task. This is mainly caused by the significant appearance gap between prohibited items in X-ray images and natural objects, as well as the severe overlapping among objects in X-ray images. To address these issues, we propose an occlusion-aware instance segmentation pipeline designed to identify prohibited items in X-ray images. Specifically, to bridge the representation gap, we integrate the Segment Anything Model (SAM) into our pipeline, taking advantage of its rich priors and zero-shot generalization capabilities. To address the overlap between prohibited items, we design an occlusion-aware bilayer mask decoder module that explicitly models the occlusion relationships. To supervise occlusion estimation, we manually annotated occlusion areas of prohibited items in two large-scale X-ray image segmentation datasets, PIDray and PIXray. We then reorganized these additional annotations together with the original information as two occlusion-annotated datasets, PIDray-A and PIXray-A. Extensive experimental results on these occlusion-annotated datasets demonstrate the effectiveness of our proposed method. The datasets and codes are available at: https://github.com/Ryh1218/Occ

[64] Dynamic Mixture of Curriculum LoRA Experts for Continual Multimodal Instruction Tuning

Chendi Ge,Xin Wang,Zeyang Zhang,Hong Chen,Jiapei Fan,Longtao Huang,Hui Xue,Wenwu Zhu

Main category: cs.CV

TL;DR: 提出了一种动态混合课程LoRA专家(D-MoLE)方法,通过动态调整MLLM的架构来解决任务架构冲突和模态不平衡问题,显著提升了性能。

Details Motivation: 现有方法采用固定架构,难以适应新任务,且存在任务架构冲突和模态不平衡问题。 Method: 提出D-MoLE方法,包括动态层间专家分配器和梯度驱动的跨模态课程,以动态调整架构并平衡模态更新。 Result: 实验表明D-MoLE显著优于现有基线,平均提升15%。 Conclusion: 这是首个从架构角度研究MLLM持续学习的工作,为解决任务适应性问题提供了新思路。 Abstract: Continual multimodal instruction tuning is crucial for adapting Multimodal Large Language Models (MLLMs) to evolving tasks. However, most existing methods adopt a fixed architecture, struggling with adapting to new tasks due to static model capacity. We propose to evolve the architecture under parameter budgets for dynamic task adaptation, which remains unexplored and imposes two challenges: 1) task architecture conflict, where different tasks require varying layer-wise adaptations, and 2) modality imbalance, where different tasks rely unevenly on modalities, leading to unbalanced updates. To address these challenges, we propose a novel Dynamic Mixture of Curriculum LoRA Experts (D-MoLE) method, which automatically evolves MLLM's architecture with controlled parameter budgets to continually adapt to new tasks while retaining previously learned knowledge. Specifically, we propose a dynamic layer-wise expert allocator, which automatically allocates LoRA experts across layers to resolve architecture conflicts, and routes instructions layer-wisely to facilitate knowledge sharing among experts. Then, we propose a gradient-based inter-modal continual curriculum, which adjusts the update ratio of each module in MLLM based on the difficulty of each modality within the task to alleviate the modality imbalance problem. Extensive experiments show that D-MoLE significantly outperforms state-of-the-art baselines, achieving a 15% average improvement over the best baseline. To the best of our knowledge, this is the first study of continual learning for MLLMs from an architectural perspective.

[65] Cross-Modal Clustering-Guided Negative Sampling for Self-Supervised Joint Learning from Medical Images and Reports

Libin Lan,Hongxing Li,Zunhui Xia,Juan Zhou,Xiaofei Zhu,Yongmei Li,Yudong Zhang,Xin Luo

Main category: cs.CV

TL;DR: 论文提出了一种跨模态聚类引导负采样方法(CM-CGNS),通过改进负样本选择和增强局部细节提取,提升了医学视觉表示学习的效果。

Details Motivation: 现有模型在医学图像和报告的多模态自监督学习中存在负样本选择不当、忽视局部细节和低层次特征的问题,影响了诊断准确性。 Method: 1)通过跨模态注意力扩展k-means聚类用于多模态负样本选择;2)引入跨模态掩码图像重建模块(CM-MIR)增强局部特征交互。 Result: 在五个下游数据集上的分类、检测和分割任务中,CM-CGNS在多项指标上优于现有方法。 Conclusion: CM-CGNS通过优化负样本选择和局部特征提取,显著提升了医学视觉表示学习的性能。 Abstract: Learning medical visual representations directly from paired images and reports through multimodal self-supervised learning has emerged as a novel and efficient approach to digital diagnosis in recent years. However, existing models suffer from several severe limitations. 1) neglecting the selection of negative samples, resulting in the scarcity of hard negatives and the inclusion of false negatives; 2) focusing on global feature extraction, but overlooking the fine-grained local details that are crucial for medical image recognition tasks; and 3) contrastive learning primarily targets high-level features but ignoring low-level details which are essential for accurate medical analysis. Motivated by these critical issues, this paper presents a Cross-Modal Cluster-Guided Negative Sampling (CM-CGNS) method with two-fold ideas. First, it extends the k-means clustering used for local text features in the single-modal domain to the multimodal domain through cross-modal attention. This improvement increases the number of negative samples and boosts the model representation capability. Second, it introduces a Cross-Modal Masked Image Reconstruction (CM-MIR) module that leverages local text-to-image features obtained via cross-modal attention to reconstruct masked local image regions. This module significantly strengthens the model's cross-modal information interaction capabilities and retains low-level image features essential for downstream tasks. By well handling the aforementioned limitations, the proposed CM-CGNS can learn effective and robust medical visual representations suitable for various recognition tasks. Extensive experimental results on classification, detection, and segmentation tasks across five downstream datasets show that our method outperforms state-of-the-art approaches on multiple metrics, verifying its superior performance.

[66] Predicting Patient Survival with Airway Biomarkers using nn-Unet/Radiomics

Zacharia Mesbah,Dhruv Jain,Tsiry Mayet,Romain Modzelewski,Romain Herault,Simon Bernard,Sebastien Thureau,Clement Chatelain

Main category: cs.CV

TL;DR: 该研究通过三阶段方法评估气道影像生物标志物对肺纤维化患者生存结果的预测意义,包括气道分割、特征提取和分类,取得了较高的分割和分类分数。

Details Motivation: 研究旨在探索气道相关影像生物标志物在预测肺纤维化患者生存结果中的重要性。 Method: 采用三阶段方法:1) 使用nn-Unet分割气道结构;2) 从气管和气道周围提取关键特征;3) 将特征输入SVM分类器。 Result: 分割任务得分为0.8601,分类任务得分为0.7346。 Conclusion: 该方法在气道影像分析中表现出较高的预测能力,为肺纤维化患者的生存预测提供了有效工具。 Abstract: The primary objective of the AIIB 2023 competition is to evaluate the predictive significance of airway-related imaging biomarkers in determining the survival outcomes of patients with lung fibrosis.This study introduces a comprehensive three-stage approach. Initially, a segmentation network, namely nn-Unet, is employed to delineate the airway's structural boundaries. Subsequently, key features are extracted from the radiomic images centered around the trachea and an enclosing bounding box around the airway. This step is motivated by the potential presence of critical survival-related insights within the tracheal region as well as pertinent information encoded in the structure and dimensions of the airway. Lastly, radiomic features obtained from the segmented areas are integrated into an SVM classifier. We could obtain an overall-score of 0.8601 for the segmentation in Task 1 while 0.7346 for the classification in Task 2.

[67] Pose Matters: Evaluating Vision Transformers and CNNs for Human Action Recognition on Small COCO Subsets

MingZe Tang,Madiha Kazi

Main category: cs.CV

TL;DR: 本研究比较了不同模型在COCO图像数据集上的动作识别性能,发现Vision Transformer(ViT)表现最佳,准确率达90%,显著优于卷积网络和CLIP模型。

Details Motivation: 探索不同模型在动作识别任务中的性能差异,并分析其失败原因。 Method: 使用COCO图像数据集的三类子集,测试了从全连接网络到Transformer架构的多种模型,并通过统计分析和可视化技术评估性能。 Result: ViT的测试准确率最高(90%),且其关注的动作区域更准确,而其他模型易受背景干扰。 Conclusion: Transformer模型在数据效率和性能上优于传统方法,且可解释性技术有助于诊断模型失败原因。 Abstract: This study explores human action recognition using a three-class subset of the COCO image corpus, benchmarking models from simple fully connected networks to transformer architectures. The binary Vision Transformer (ViT) achieved 90% mean test accuracy, significantly exceeding multiclass classifiers such as convolutional networks (approximately 35%) and CLIP-based models (approximately 62-64%). A one-way ANOVA (F = 61.37, p < 0.001) confirmed these differences are statistically significant. Qualitative analysis with SHAP explainer and LeGrad heatmaps indicated that the ViT localizes pose-specific regions (e.g., lower limbs for walking or running), while simpler feed-forward models often focus on background textures, explaining their errors. These findings emphasize the data efficiency of transformer representations and the importance of explainability techniques in diagnosing class-specific failures.

[68] MTabVQA: Evaluating Multi-Tabular Reasoning of Language Models in Visual Space

Anshul Singh,Chris Biemann,Jan Strich

Main category: cs.CV

TL;DR: MTabVQA是一个新基准,用于评估视觉语言模型在多表格图像中的多跳推理能力,揭示了现有模型的局限性,并通过微调提升了性能。

Details Motivation: 现有基准无法评估模型在多表格图像中的解析和推理能力,MTabVQA填补了这一空白。 Method: 引入MTabVQA基准和MTabVQA-Instruct数据集,通过微调提升模型性能。 Result: 实验表明微调显著提升了模型在多表格视觉推理任务中的表现。 Conclusion: MTabVQA为多表格视觉问答提供了有效评估工具,并通过微调改进了模型能力。 Abstract: Vision-Language Models (VLMs) have demonstrated remarkable capabilities in interpreting visual layouts and text. However, a significant challenge remains in their ability to interpret robustly and reason over multi-tabular data presented as images, a common occurrence in real-world scenarios like web pages and digital documents. Existing benchmarks typically address single tables or non-visual data (text/structured). This leaves a critical gap: they don't assess the ability to parse diverse table images, correlate information across them, and perform multi-hop reasoning on the combined visual data. We introduce MTabVQA, a novel benchmark specifically designed for multi-tabular visual question answering to bridge that gap. MTabVQA comprises 3,745 complex question-answer pairs that necessitate multi-hop reasoning across several visually rendered table images. We provide extensive benchmark results for state-of-the-art VLMs on MTabVQA, revealing significant performance limitations. We further investigate post-training techniques to enhance these reasoning abilities and release MTabVQA-Instruct, a large-scale instruction-tuning dataset. Our experiments show that fine-tuning VLMs with MTabVQA-Instruct substantially improves their performance on visual multi-tabular reasoning. Code and dataset (https://huggingface.co/datasets/mtabvqa/MTabVQA-Eval) are available online (https://anonymous.4open.science/r/MTabVQA-EMNLP-B16E).

[69] DMAF-Net: An Effective Modality Rebalancing Framework for Incomplete Multi-Modal Medical Image Segmentation

Libin Lan,Hongxing Li,Zunhui Xia,Yudong Zhang

Main category: cs.CV

TL;DR: DMAF-Net提出了一种动态模态感知融合网络,通过动态平衡模态贡献和结构关系,解决了多模态医学图像分割中的模态不平衡问题。

Details Motivation: 现有方法依赖完整模态假设,无法动态平衡模态贡献和结构关系,导致在真实临床场景中性能不佳。 Method: DMAF-Net引入动态模态感知融合模块、关系蒸馏与原型蒸馏框架,以及动态训练监控策略。 Result: 在BraTS2020和MyoPS2020数据集上,DMAF-Net表现优于现有方法。 Conclusion: DMAF-Net通过动态平衡模态贡献和结构关系,显著提升了多模态医学图像分割的性能。 Abstract: Incomplete multi-modal medical image segmentation faces critical challenges from modality imbalance, including imbalanced modality missing rates and heterogeneous modality contributions. Due to their reliance on idealized assumptions of complete modality availability, existing methods fail to dynamically balance contributions and neglect the structural relationships between modalities, resulting in suboptimal performance in real-world clinical scenarios. To address these limitations, we propose a novel model, named Dynamic Modality-Aware Fusion Network (DMAF-Net). The DMAF-Net adopts three key ideas. First, it introduces a Dynamic Modality-Aware Fusion (DMAF) module to suppress missing-modality interference by combining transformer attention with adaptive masking and weight modality contributions dynamically through attention maps. Second, it designs a synergistic Relation Distillation and Prototype Distillation framework to enforce global-local feature alignment via covariance consistency and masked graph attention, while ensuring semantic consistency through cross-modal class-specific prototype alignment. Third, it presents a Dynamic Training Monitoring (DTM) strategy to stabilize optimization under imbalanced missing rates by tracking distillation gaps in real-time, and to balance convergence speeds across modalities by adaptively reweighting losses and scaling gradients. Extensive experiments on BraTS2020 and MyoPS2020 demonstrate that DMAF-Net outperforms existing methods for incomplete multi-modal medical image segmentation. Extensive experiments on BraTS2020 and MyoPS2020 demonstrate that DMAF-Net outperforms existing methods for incomplete multi-modal medical image segmentation. Our code is available at https://github.com/violet-42/DMAF-Net.

[70] Quizzard@INOVA Challenge 2025 -- Track A: Plug-and-Play Technique in Interleaved Multi-Image Model

Dinh Viet Cuong,Hoang-Bao Le,An Pham Ngoc Nguyen,Liting Zhou,Cathal Gurrin

Main category: cs.CV

TL;DR: 论文展示了LLaVA-NeXT-interleave在22个数据集上的优异表现,并比较了标准模型与DCI增强版的性能差异。

Details Motivation: 研究目的是验证LLaVA-NeXT-interleave在多任务中的性能,并探索DCI连接器对模型的影响。 Method: 在标准模型基础上加入DCI连接器,并在22个数据集上测试性能。 Result: 标准模型在视觉任务中表现最佳,而DCI增强版在语义连贯性和结构化变化理解任务中更优。 Conclusion: 结合基础模型与即插即用技术具有潜力,代码已开源。 Abstract: This paper addresses two main objectives. Firstly, we demonstrate the impressive performance of the LLaVA-NeXT-interleave on 22 datasets across three different tasks: Multi-Image Reasoning, Documents and Knowledge-Based Understanding and Interactive Multi-Modal Communication. Secondly, we add the Dense Channel Integration (DCI) connector to the LLaVA-NeXT-Interleave and compare its performance against the standard model. We find that the standard model achieves the highest overall accuracy, excelling in vision-heavy tasks like VISION, NLVR2, and Fashion200K. Meanwhile, the DCI-enhanced version shows particular strength on datasets requiring deeper semantic coherence or structured change understanding such as MIT-States_PropertyCoherence and SlideVQA. Our results highlight the potential of combining powerful foundation models with plug-and-play techniques for Interleave tasks. The code is available at https://github.com/dinhvietcuong1996/icme25-inova.

[71] AgriPotential: A Novel Multi-Spectral and Multi-Temporal Remote Sensing Dataset for Agricultural Potentials

Mohammad El Sakka,Caroline De Pourtales,Lotfi Chaari,Josiane Mothe

Main category: cs.CV

TL;DR: AgriPotential是一个基于Sentinel-2卫星影像的新基准数据集,用于农业潜力预测,支持多种机器学习任务。

Details Motivation: 遥感技术在大规模地球监测和土地管理中至关重要,但缺乏专门用于农业潜力预测的公开数据集。 Method: 数据集包含多个月的Sentinel-2影像,提供三种主要作物类型的像素级标注,覆盖法国南部多样区域。 Result: AgriPotential支持序数回归、多标签分类和时空建模等任务,为可持续土地利用规划提供数据支持。 Conclusion: AgriPotential填补了农业潜力预测数据集的空白,并公开可用,旨在推动数据驱动的土地利用规划。 Abstract: Remote sensing has emerged as a critical tool for large-scale Earth monitoring and land management. In this paper, we introduce AgriPotential, a novel benchmark dataset composed of Sentinel-2 satellite imagery spanning multiple months. The dataset provides pixel-level annotations of agricultural potentials for three major crop types - viticulture, market gardening, and field crops - across five ordinal classes. AgriPotential supports a broad range of machine learning tasks, including ordinal regression, multi-label classification, and spatio-temporal modeling. The data covers diverse areas in Southern France, offering rich spectral information. AgriPotential is the first public dataset designed specifically for agricultural potential prediction, aiming to improve data-driven approaches to sustainable land use planning. The dataset and the code are freely accessible at: https://zenodo.org/records/15556484

[72] DiffFuSR: Super-Resolution of all Sentinel-2 Multispectral Bands using Diffusion Models

Muhammad Sarmad,Arnt-Børre Salberg,Michael Kampffmeyer

Main category: cs.CV

TL;DR: DiffFuSR是一个模块化管道,用于将Sentinel-2 Level-2A影像的12个光谱波段超分辨率统一到2.5米的地面采样距离(GSD)。

Details Motivation: 解决Sentinel-2影像多光谱波段分辨率不一致的问题,提升影像质量和实用性。 Method: 采用两阶段方法:1)基于扩散模型的RGB超分辨率;2)利用超分辨率RGB图像作为空间先验,通过融合网络上采样其他多光谱波段。 Result: 在OpenSR基准测试中表现优于现有方法,反射率保真度、光谱一致性、空间对齐和幻觉抑制均更优。 Conclusion: 通过生成先验和融合策略的协调学习,构建了模块化的Sentinel-2超分辨率框架。 Abstract: This paper presents DiffFuSR, a modular pipeline for super-resolving all 12 spectral bands of Sentinel-2 Level-2A imagery to a unified ground sampling distance (GSD) of 2.5 meters. The pipeline comprises two stages: (i) a diffusion-based super-resolution (SR) model trained on high-resolution RGB imagery from the NAIP and WorldStrat datasets, harmonized to simulate Sentinel-2 characteristics; and (ii) a learned fusion network that upscales the remaining multispectral bands using the super-resolved RGB image as a spatial prior. We introduce a robust degradation model and contrastive degradation encoder to support blind SR. Extensive evaluations of the proposed SR pipeline on the OpenSR benchmark demonstrate that the proposed method outperforms current SOTA baselines in terms of reflectance fidelity, spectral consistency, spatial alignment, and hallucination suppression. Furthermore, the fusion network significantly outperforms classical pansharpening approaches, enabling accurate enhancement of Sentinel-2's 20 m and 60 m bands. This study underscores the power of harmonized learning with generative priors and fusion strategies to create a modular framework for Sentinel-2 SR. Our code and models can be found at https://github.com/NorskRegnesentral/DiffFuSR.

[73] MambaVSR: Content-Aware Scanning State Space Model for Video Super-Resolution

Linfeng He,Meiqin Liu,Qi Tang,Chao Yao,Yao Zhao

Main category: cs.CV

TL;DR: MambaVSR是一种基于状态空间模型的视频超分辨率框架,通过动态时空交互和内容感知机制,显著提升了非局部依赖建模能力,同时保持计算效率。

Details Motivation: 现有视频超分辨率方法在处理大位移运动和长视频序列时表现不佳,主要依赖光流或Transformer架构,无法有效建模非局部依赖。 Method: 提出MambaVSR框架,包含共享罗盘构造(SCC)和内容感知序列化(CAS)模块,动态生成空间扫描序列并聚合跨帧相似内容。全局-局部状态空间块(GLSSB)结合窗口自注意力和SSM特征传播。 Result: 在REDS数据集上,MambaVSR比基于Transformer的方法PSNR提升0.58 dB,参数减少55%。 Conclusion: MambaVSR通过创新的内容感知机制和状态空间模型,显著提升了视频超分辨率的性能与效率。 Abstract: Video super-resolution (VSR) faces critical challenges in effectively modeling non-local dependencies across misaligned frames while preserving computational efficiency. Existing VSR methods typically rely on optical flow strategies or transformer architectures, which struggle with large motion displacements and long video sequences. To address this, we propose MambaVSR, the first state-space model framework for VSR that incorporates an innovative content-aware scanning mechanism. Unlike rigid 1D sequential processing in conventional vision Mamba methods, our MambaVSR enables dynamic spatiotemporal interactions through the Shared Compass Construction (SCC) and the Content-Aware Sequentialization (CAS). Specifically, the SCC module constructs intra-frame semantic connectivity graphs via efficient sparse attention and generates adaptive spatial scanning sequences through spectral clustering. Building upon SCC, the CAS module effectively aligns and aggregates non-local similar content across multiple frames by interleaving temporal features along the learned spatial order. To bridge global dependencies with local details, the Global-Local State Space Block (GLSSB) synergistically integrates window self-attention operations with SSM-based feature propagation, enabling high-frequency detail recovery under global dependency guidance. Extensive experiments validate MambaVSR's superiority, outperforming the Transformer-based method by 0.58 dB PSNR on the REDS dataset with 55% fewer parameters.

[74] CLIP Meets Diffusion: A Synergistic Approach to Anomaly Detection

Byeongchan Lee,John Won,Seunghyun Lee,Jinwoo Shin

Main category: cs.CV

TL;DR: CLIPFUSION是一种结合判别式和生成式基础模型的方法,通过CLIP模型捕捉全局特征,扩散模型捕捉局部细节,显著提升了异常检测性能。

Details Motivation: 异常检测因异常定义模糊、类型多样且训练数据稀缺而复杂,需要一种能同时捕捉高低级特征的模型。 Method: 利用CLIP判别模型和扩散生成模型,结合交叉注意力图和特征图进行异常检测。 Result: 在MVTec-AD和VisA数据集上表现优异,超越基线方法。 Conclusion: 多模态多模型融合方法有效解决异常检测的多面挑战,具有实际应用潜力。 Abstract: Anomaly detection is a complex problem due to the ambiguity in defining anomalies, the diversity of anomaly types (e.g., local and global defect), and the scarcity of training data. As such, it necessitates a comprehensive model capable of capturing both low-level and high-level features, even with limited data. To address this, we propose CLIPFUSION, a method that leverages both discriminative and generative foundation models. Specifically, the CLIP-based discriminative model excels at capturing global features, while the diffusion-based generative model effectively captures local details, creating a synergistic and complementary approach. Notably, we introduce a methodology for utilizing cross-attention maps and feature maps extracted from diffusion models specifically for anomaly detection. Experimental results on benchmark datasets (MVTec-AD, VisA) demonstrate that CLIPFUSION consistently outperforms baseline methods, achieving outstanding performance in both anomaly segmentation and classification. We believe that our method underscores the effectiveness of multi-modal and multi-model fusion in tackling the multifaceted challenges of anomaly detection, providing a scalable solution for real-world applications.

[75] AgentSense: Virtual Sensor Data Generation Using LLM Agent in Simulated Home Environments

Zikang Leng,Megha Thukral,Yaqi Liu,Hrudhai Rajasekhar,Shruthi K. Hiremath,Thomas Plötz

Main category: cs.CV

TL;DR: AgentSense利用大型语言模型生成虚拟数据,解决了智能家居中人类活动识别(HAR)系统缺乏多样化标注数据的问题,显著提升了模型性能。

Details Motivation: 智能家居HAR系统面临数据不足和多样性的挑战,需要能泛化到不同用户和环境的数据。 Method: 通过AgentSense生成虚拟人物和日常行为,在模拟环境中记录传感器数据。 Result: 虚拟数据显著提升了模型性能,尤其在真实数据有限时,结合少量真实数据即可达到与全量真实数据相当的效果。 Conclusion: 虚拟数据为解决大规模标注数据缺乏问题提供了有效途径,无需人工收集数据。 Abstract: A major obstacle in developing robust and generalizable smart home-based Human Activity Recognition (HAR) systems is the lack of large-scale, diverse labeled datasets. Variability in home layouts, sensor configurations, and user behavior adds further complexity, as individuals follow varied routines and perform activities in distinct ways. Building HAR systems that generalize well requires training data that captures the diversity across users and environments. To address these challenges, we introduce AgentSense, a virtual data generation pipeline where diverse personas are generated by leveraging Large Language Models. These personas are used to create daily routines, which are then decomposed into low-level action sequences. Subsequently, the actions are executed in a simulated home environment called VirtualHome that we extended with virtual ambient sensors capable of recording the agents activities as they unfold. Overall, AgentSense enables the generation of rich, virtual sensor datasets that represent a wide range of users and home settings. Across five benchmark HAR datasets, we show that leveraging our virtual sensor data substantially improves performance, particularly when real data are limited. Notably, models trained on a combination of virtual data and just a few days of real data achieve performance comparable to those trained on the entire real datasets. These results demonstrate and prove the potential of virtual data to address one of the most pressing challenges in ambient sensing, which is the distinct lack of large-scale, annotated datasets without requiring any manual data collection efforts.

[76] Real-Time Feedback and Benchmark Dataset for Isometric Pose Evaluation

Abhishek Jaiswal,Armeet Singh Luthra,Purav Jangir,Bhavya Garg,Nisheeth Srivastava

Main category: cs.CV

TL;DR: 论文提出了一种实时反馈系统,用于评估等长运动姿势,解决了依赖数字媒体内容而缺乏专家指导的问题。

Details Motivation: 等长运动因其便利性和隐私性受欢迎,但缺乏专家指导可能导致姿势错误、受伤和用户流失。 Method: 发布了一个包含3,600个视频片段的多类别等长运动数据集,并评估了包括图网络在内的先进模型,同时提出了一种新的三部分评估指标。 Result: 研究结果表明,智能个性化家庭锻炼系统的可行性得到提升,系统还可应用于康复和物理治疗等领域。 Conclusion: 该系统通过直接向用户提供专家级诊断,扩展了其在健身和其他运动相关领域的应用潜力。 Abstract: Isometric exercises appeal to individuals seeking convenience, privacy, and minimal dependence on equipments. However, such fitness training is often overdependent on unreliable digital media content instead of expert supervision, introducing serious risks, including incorrect posture, injury, and disengagement due to lack of corrective feedback. To address these challenges, we present a real-time feedback system for assessing isometric poses. Our contributions include the release of the largest multiclass isometric exercise video dataset to date, comprising over 3,600 clips across six poses with correct and incorrect variations. To support robust evaluation, we benchmark state-of-the-art models-including graph-based networks-on this dataset and introduce a novel three-part metric that captures classification accuracy, mistake localization, and model confidence. Our results enhance the feasibility of intelligent and personalized exercise training systems for home workouts. This expert-level diagnosis, delivered directly to the users, also expands the potential applications of these systems to rehabilitation, physiotherapy, and various other fitness disciplines that involve physical motion.

[77] Self-supervised Learning of Echocardiographic Video Representations via Online Cluster Distillation

Divyanshu Mishra,Mohammadreza Salehi,Pramit Saha,Olga Patey,Aris T. Papageorghiou,Yuki M. Asano,J. Alison Noble

Main category: cs.CV

TL;DR: DISCOVR是一种自监督双分支框架,用于心脏超声视频表示学习,通过结合聚类视频编码器和在线图像编码器,提升了在超声领域的性能。

Details Motivation: 超声心动图领域缺乏特定领域的预训练模型,现有自监督学习方法在应对高样本相似性和低PSNR输入时表现不佳。 Method: DISCOVR结合了基于聚类的视频编码器和在线图像编码器,通过语义聚类蒸馏损失将解剖知识从图像编码器传递到视频编码器。 Result: 在六个超声心动图数据集上,DISCOVR在零样本和线性探测设置中均优于现有方法,并实现了更好的分割迁移效果。 Conclusion: DISCOVR通过双分支设计和知识蒸馏,显著提升了超声视频表示学习的性能,适用于不同人群的超声数据。 Abstract: Self-supervised learning (SSL) has achieved major advances in natural images and video understanding, but challenges remain in domains like echocardiography (heart ultrasound) due to subtle anatomical structures, complex temporal dynamics, and the current lack of domain-specific pre-trained models. Existing SSL approaches such as contrastive, masked modeling, and clustering-based methods struggle with high intersample similarity, sensitivity to low PSNR inputs common in ultrasound, or aggressive augmentations that distort clinically relevant features. We present DISCOVR (Distilled Image Supervision for Cross Modal Video Representation), a self-supervised dual branch framework for cardiac ultrasound video representation learning. DISCOVR combines a clustering-based video encoder that models temporal dynamics with an online image encoder that extracts fine-grained spatial semantics. These branches are connected through a semantic cluster distillation loss that transfers anatomical knowledge from the evolving image encoder to the video encoder, enabling temporally coherent representations enriched with fine-grained semantic understanding. Evaluated on six echocardiography datasets spanning fetal, pediatric, and adult populations, DISCOVR outperforms both specialized video anomaly detection methods and state-of-the-art video-SSL baselines in zero-shot and linear probing setups, and achieves superior segmentation transfer.

[78] GPLQ: A General, Practical, and Lightning QAT Method for Vision Transformers

Guang Liang,Xinyao Liu,Jianxin Wu

Main category: cs.CV

TL;DR: GPLQ是一种高效且有效的ViT量化框架,通过两阶段策略(先量化激活,后量化权重)显著提升了4位量化的性能,同时降低了计算成本和内存占用。

Details Motivation: 现有PTQ和QAT方法在ViT量化中存在显著缺陷,如PTQ精度下降严重,QAT计算成本高且泛化能力有限。 Method: GPLQ采用两阶段策略:第一阶段量化激活并保留FP32权重,第二阶段量化权重。 Result: GPLQ比现有QAT方法快100倍,内存占用低于FP32训练,4位量化模型性能接近FP32模型。 Conclusion: GPLQ为ViT量化提供了一种高效、实用的解决方案,并计划开源工具包。 Abstract: Vision Transformers (ViTs) are essential in computer vision but are computationally intensive, too. Model quantization, particularly to low bit-widths like 4-bit, aims to alleviate this difficulty, yet existing Post-Training Quantization (PTQ) and Quantization-Aware Training (QAT) methods exhibit significant limitations. PTQ often incurs substantial accuracy drop, while QAT achieves high accuracy but suffers from prohibitive computational costs, limited generalization to downstream tasks, training instability, and lacking of open-source codebase. To address these challenges, this paper introduces General, Practical, and Lightning Quantization (GPLQ), a novel framework designed for efficient and effective ViT quantization. GPLQ is founded on two key empirical insights: the paramount importance of activation quantization and the necessity of preserving the model's original optimization ``basin'' to maintain generalization. Consequently, GPLQ employs a sequential ``activation-first, weights-later'' strategy. Stage 1 keeps weights in FP32 while quantizing activations with a feature mimicking loss in only 1 epoch to keep it stay in the same ``basin'', thereby preserving generalization. Stage 2 quantizes weights using a PTQ method. As a result, GPLQ is 100x faster than existing QAT methods, lowers memory footprint to levels even below FP32 training, and achieves 4-bit model performance that is highly competitive with FP32 models in terms of both accuracy on ImageNet and generalization to diverse downstream tasks, including fine-grained visual classification and object detection. We will release an easy-to-use open-source toolkit supporting multiple vision tasks.

[79] Teleoperated Driving: a New Challenge for 3D Object Detection in Compressed Point Clouds

Filippo Bragato,Michael Neri,Paolo Testolina,Marco Giordani,Federica Battisti

Main category: cs.CV

TL;DR: 论文研究了通过点云数据检测车辆和行人以支持远程驾驶(TD)的安全操作,分析了压缩算法和目标检测器的性能,并评估了其对V2X网络的影响。

Details Motivation: 随着互联设备的普及和传感器技术的发展,远程驾驶(TD)成为可能。然而,如何从点云数据中高效检测车辆和行人以确保安全操作是一个关键问题。 Method: 利用扩展后的SELMA数据集(包含3D物体的真实边界框),评估了多种压缩算法和目标检测器的性能,包括压缩效率、处理时间、检测精度以及对V2X网络的影响。 Result: 研究分析了不同压缩算法和目标检测器在多个指标下的表现,并验证了其对V2X网络数据速率和延迟的影响,符合3GPP对TD应用的要求。 Conclusion: 通过点云数据检测车辆和行人是实现安全远程驾驶的有效方法,压缩算法和目标检测器的优化对提升系统性能至关重要。 Abstract: In recent years, the development of interconnected devices has expanded in many fields, from infotainment to education and industrial applications. This trend has been accelerated by the increased number of sensors and accessibility to powerful hardware and software. One area that significantly benefits from these advancements is Teleoperated Driving (TD). In this scenario, a controller drives safely a vehicle from remote leveraging sensors data generated onboard the vehicle, and exchanged via Vehicle-to-Everything (V2X) communications. In this work, we tackle the problem of detecting the presence of cars and pedestrians from point cloud data to enable safe TD operations. More specifically, we exploit the SELMA dataset, a multimodal, open-source, synthetic dataset for autonomous driving, that we expanded by including the ground-truth bounding boxes of 3D objects to support object detection. We analyze the performance of state-of-the-art compression algorithms and object detectors under several metrics, including compression efficiency, (de)compression and inference time, and detection accuracy. Moreover, we measure the impact of compression and detection on the V2X network in terms of data rate and latency with respect to 3GPP requirements for TD applications.

[80] Rethinking Multilingual Vision-Language Translation: Dataset, Evaluation, and Adaptation

Xintong Wang,Jingheng Pan,Yixiao Liu,Xiaohu Zhao,Chenyang Lyu,Minghao Wu,Chris Biemann,Longyue Wang,Linlong Xu,Weihua Luo,Kaifu Zhang

Main category: cs.CV

TL;DR: 该论文系统评估了视觉语言翻译(VLT)任务,从数据质量、模型架构和评估指标三个角度进行了全面研究,提出了新数据集AibTrans、评测了多种模型,并提出了更鲁棒的评估方法DA Score。

Details Motivation: 现有大型视觉语言模型(LVLMs)在VLT任务上缺乏系统评估和理解,研究旨在填补这一空白。 Method: 通过分析数据质量(提出AibTrans数据集)、评测11种商业和6种开源模型、提出Density-Aware Evaluation方法(DA Score)进行研究。 Result: 发现高资源语言对微调会损害跨语言性能,提出平衡多语言微调策略以提升模型适应性。 Conclusion: 建立了VLT新评测基准,为未来研究提供了数据、模型和评估方法的改进方向。 Abstract: Vision-Language Translation (VLT) is a challenging task that requires accurately recognizing multilingual text embedded in images and translating it into the target language with the support of visual context. While recent Large Vision-Language Models (LVLMs) have demonstrated strong multilingual and visual understanding capabilities, there is a lack of systematic evaluation and understanding of their performance on VLT. In this work, we present a comprehensive study of VLT from three key perspectives: data quality, model architecture, and evaluation metrics. (1) We identify critical limitations in existing datasets, particularly in semantic and cultural fidelity, and introduce AibTrans -- a multilingual, parallel, human-verified dataset with OCR-corrected annotations. (2) We benchmark 11 commercial LVLMs/LLMs and 6 state-of-the-art open-source models across end-to-end and cascaded architectures, revealing their OCR dependency and contrasting generation versus reasoning behaviors. (3) We propose Density-Aware Evaluation to address metric reliability issues under varying contextual complexity, introducing the DA Score as a more robust measure of translation quality. Building upon these findings, we establish a new evaluation benchmark for VLT. Notably, we observe that fine-tuning LVLMs on high-resource language pairs degrades cross-lingual performance, and we propose a balanced multilingual fine-tuning strategy that effectively adapts LVLMs to VLT without sacrificing their generalization ability.

[81] Vision-based Lifting of 2D Object Detections for Automated Driving

Hendrik Königshof,Kun Li,Christoph Stiller

Main category: cs.CV

TL;DR: 提出一种仅使用摄像头的低成本3D物体检测方法,通过2D算法结果提升至3D检测,适用于多种道路使用者,计算效率高。

Details Motivation: 自动驾驶需要3D物体检测,但现有方法依赖昂贵的LiDAR数据,而摄像头成本低且普及。 Method: 利用2D CNN处理点云数据,将2D检测结果提升至3D,降低计算成本。 Result: 在KITTI基准测试中表现与现有图像方法相当,但运行时间仅为三分之一。 Conclusion: 该方法为低成本3D检测提供了可行方案,适用于多种道路场景。 Abstract: Image-based 3D object detection is an inevitable part of autonomous driving because cheap onboard cameras are already available in most modern cars. Because of the accurate depth information, currently, most state-of-the-art 3D object detectors heavily rely on LiDAR data. In this paper, we propose a pipeline which lifts the results of existing vision-based 2D algorithms to 3D detections using only cameras as a cost-effective alternative to LiDAR. In contrast to existing approaches, we focus not only on cars but on all types of road users. To the best of our knowledge, we are the first using a 2D CNN to process the point cloud for each 2D detection to keep the computational effort as low as possible. Our evaluation on the challenging KITTI 3D object detection benchmark shows results comparable to state-of-the-art image-based approaches while having a runtime of only a third.

[82] SphereDrag: Spherical Geometry-Aware Panoramic Image Editing

Zhiao Feng,Xuewei Li,Junjie Yang,Yuxin Peng,Xi Li

Main category: cs.CV

TL;DR: SphereDrag提出了一种基于球面几何的全景图像编辑框架,解决了边界不连续、轨迹变形和像素密度不均的问题,并在实验中取得了显著改进。

Details Motivation: 由于全景图像的球面几何和投影变形特性,现有的平面图像编辑方法难以直接应用,因此需要专门的全景图像编辑解决方案。 Method: SphereDrag采用自适应重投影(AR)处理边界不连续,大圆轨迹调整(GCTA)提高轨迹精度,球面搜索区域跟踪(SSRT)解决像素密度不均问题。 Result: 实验表明,SphereDrag在几何一致性和图像质量上显著优于现有方法,相对改进高达10.5%。 Conclusion: SphereDrag为全景图像编辑提供了一种高效且可控的解决方案,并通过PanoBench标准化评估框架验证了其有效性。 Abstract: Image editing has made great progress on planar images, but panoramic image editing remains underexplored. Due to their spherical geometry and projection distortions, panoramic images present three key challenges: boundary discontinuity, trajectory deformation, and uneven pixel density. To tackle these issues, we propose SphereDrag, a novel panoramic editing framework utilizing spherical geometry knowledge for accurate and controllable editing. Specifically, adaptive reprojection (AR) uses adaptive spherical rotation to deal with discontinuity; great-circle trajectory adjustment (GCTA) tracks the movement trajectory more accurate; spherical search region tracking (SSRT) adaptively scales the search range based on spherical location to address uneven pixel density. Also, we construct PanoBench, a panoramic editing benchmark, including complex editing tasks involving multiple objects and diverse styles, which provides a standardized evaluation framework. Experiments show that SphereDrag gains a considerable improvement compared with existing methods in geometric consistency and image quality, achieving up to 10.5% relative improvement.

[83] Methods for evaluating the resolution of 3D data derived from satellite images

Christina Selby,Holden Bindl,Tyler Feldman,Andrew Skow,Nicolas Norena Acosta,Shea Hagstrom,Myron Brown

Main category: cs.CV

TL;DR: 本文探讨了如何评估卫星图像衍生的3D数据(点云、数字表面模型和3D网格模型)的分辨率,并提出了基于高分辨率参考机载激光雷达的自动化评估工具和工作流程。

Details Motivation: 卫星图像衍生的3D数据在大规模覆盖或难以通过机载激光雷达或相机获取的场景建模中至关重要,但其分辨率的评估方法尚不完善。 Method: 提出了3D度量评估工具和工作流程,利用高分辨率参考机载激光雷达实现自动化评估。 Result: 通过对不同质量数据的分析,展示了评估工具的有效性。 Conclusion: 该研究为卫星图像3D数据的分辨率评估提供了实用工具,有助于提升数据质量和任务效用。 Abstract: 3D data derived from satellite images is essential for scene modeling applications requiring large-scale coverage or involving locations not accessible by airborne lidar or cameras. Measuring the resolution of this data is important for determining mission utility and tracking improvements. In this work, we consider methods to evaluate the resolution of point clouds, digital surface models, and 3D mesh models. We describe 3D metric evaluation tools and workflows that enable automated evaluation based on high-resolution reference airborne lidar, and we present results of analyses with data of varying quality.

[84] O2Former:Direction-Aware and Multi-Scale Query Enhancement for SAR Ship Instance Segmentation

F. Gao,Y Li,X He,J Sun,J Wang

Main category: cs.CV

TL;DR: O2Former是一个针对SAR图像中船舶实例分割的优化框架,通过改进查询生成和方向感知模块,显著提升了分割性能。

Details Motivation: SAR图像中的船舶分割因尺度变化、目标密度和模糊边界等问题,现有方法表现不佳,需要针对性解决方案。 Method: O2Former基于Mask2Former,引入优化查询生成器(OQG)和方向感知嵌入模块(OAEM),分别提升多尺度特征交互和方向敏感性。 Result: 实验表明O2Former在SAR船舶数据集上优于现有方法,验证了其有效性和泛化能力。 Conclusion: O2Former通过针对性设计解决了SAR图像分割的挑战,为相关应用提供了高效工具。 Abstract: Instance segmentation of ships in synthetic aperture radar (SAR) imagery is critical for applications such as maritime monitoring, environmental analysis, and national security. SAR ship images present challenges including scale variation, object density, and fuzzy target boundary, which are often overlooked in existing methods, leading to suboptimal performance. In this work, we propose O2Former, a tailored instance segmentation framework that extends Mask2Former by fully leveraging the structural characteristics of SAR imagery. We introduce two key components. The first is the Optimized Query Generator(OQG). It enables multi-scale feature interaction by jointly encoding shallow positional cues and high-level semantic information. This improves query quality and convergence efficiency. The second component is the Orientation-Aware Embedding Module(OAEM). It enhances directional sensitivity through direction-aware convolution and polar-coordinate encoding. This effectively addresses the challenge of uneven target orientations in SAR scenes. Together, these modules facilitate precise feature alignment from backbone to decoder and strengthen the model's capacity to capture fine-grained structural details. Extensive experiments demonstrate that O2Former outperforms state of the art instance segmentation baselines, validating its effectiveness and generalization on SAR ship datasets.

[85] Aligned Novel View Image and Geometry Synthesis via Cross-modal Attention Instillation

Min-Seop Kwak,Junho Kim,Sangdoo Yun,Dongyoon Han,Taekyoung Kim,Seungryong Kim,Jin-Hwa Kim

Main category: cs.CV

TL;DR: 提出了一种基于扩散的框架,通过扭曲和修复方法实现对齐的新视角图像和几何生成,利用现成的几何预测器并引入跨模态注意力蒸馏。

Details Motivation: 现有方法需要密集的姿态图像或局限于域内视角的生成模型,而本文方法旨在实现更灵活且对齐的新视角合成。 Method: 利用几何预测器预测部分几何,将新视角合成视为图像和几何的修复任务,并通过跨模态注意力蒸馏确保对齐。 Result: 在未见场景中实现了高保真外推视角合成,并在插值设置下提供竞争性重建质量。 Conclusion: 该方法在图像和几何生成中实现了对齐和高质量的结果,适用于全面的3D完成。 Abstract: We introduce a diffusion-based framework that performs aligned novel view image and geometry generation via a warping-and-inpainting methodology. Unlike prior methods that require dense posed images or pose-embedded generative models limited to in-domain views, our method leverages off-the-shelf geometry predictors to predict partial geometries viewed from reference images, and formulates novel-view synthesis as an inpainting task for both image and geometry. To ensure accurate alignment between generated images and geometry, we propose cross-modal attention distillation, where attention maps from the image diffusion branch are injected into a parallel geometry diffusion branch during both training and inference. This multi-task approach achieves synergistic effects, facilitating geometrically robust image synthesis as well as well-defined geometry prediction. We further introduce proximity-based mesh conditioning to integrate depth and normal cues, interpolating between point cloud and filtering erroneously predicted geometry from influencing the generation process. Empirically, our method achieves high-fidelity extrapolative view synthesis on both image and geometry across a range of unseen scenes, delivers competitive reconstruction quality under interpolation settings, and produces geometrically aligned colored point clouds for comprehensive 3D completion. Project page is available at https://cvlab-kaist.github.io/MoAI.

[86] Evaluating Sensitivity Parameters in Smartphone-Based Gaze Estimation: A Comparative Study of Appearance-Based and Infrared Eye Trackers

Nishan Gunawardena,Gough Yumu Lui,Jeewani Anupama Ginige,Bahman Javadi

Main category: cs.CV

TL;DR: 研究比较了基于智能手机的深度学习眼动追踪算法与商用红外眼动追踪器Tobii Pro Nano的性能,探讨了外观基眼动追踪在移动设备上的可行性。

Details Motivation: 研究外观基眼动追踪在真实移动使用条件下的可行性,并分析关键敏感因素。 Method: 使用轻量级卷积神经网络(MobileNet-V3)和循环结构(LSTM)从灰度面部图像预测注视坐标,收集51名参与者的数据并测量准确性。 Result: 深度学习模型的平均误差为17.76 mm,略高于Tobii Pro Nano的16.53 mm,但对光照、视力矫正和年龄等因素更敏感。 Conclusion: 外观基方法在移动眼动追踪中具有潜力,并提供了评估不同使用条件下注视估计系统的参考框架。 Abstract: This study evaluates a smartphone-based, deep-learning eye-tracking algorithm by comparing its performance against a commercial infrared-based eye tracker, the Tobii Pro Nano. The aim is to investigate the feasibility of appearance-based gaze estimation under realistic mobile usage conditions. Key sensitivity factors, including age, gender, vision correction, lighting conditions, device type, and head position, were systematically analysed. The appearance-based algorithm integrates a lightweight convolutional neural network (MobileNet-V3) with a recurrent structure (Long Short-Term Memory) to predict gaze coordinates from grayscale facial images. Gaze data were collected from 51 participants using dynamic visual stimuli, and accuracy was measured using Euclidean distance. The deep learning model produced a mean error of 17.76 mm, compared to 16.53 mm for the Tobii Pro Nano. While overall accuracy differences were small, the deep learning-based method was more sensitive to factors such as lighting, vision correction, and age, with higher failure rates observed under low-light conditions among participants using glasses and in older age groups. Device-specific and positional factors also influenced tracking performance. These results highlight the potential of appearance-based approaches for mobile eye tracking and offer a reference framework for evaluating gaze estimation systems across varied usage conditions.

[87] How Visual Representations Map to Language Feature Space in Multimodal LLMs

Constantin Venhoff,Ashkan Khakzar,Sonia Joseph,Philip Torr,Neel Nanda

Main category: cs.CV

TL;DR: 论文提出了一种通过线性适配器连接冻结的视觉和语言模型的方法,揭示了视觉与语言表征对齐的机制。

Details Motivation: 研究视觉语言模型(VLMs)如何实现视觉与语言表征的对齐,目前机制尚不明确。 Method: 使用冻结的大型语言模型(LLM)和视觉变换器(ViT),仅通过训练线性适配器进行视觉指令调整。 Result: 实验表明,视觉表征逐渐与语言特征对齐,但早期层存在不匹配,引发对当前适配器架构的质疑。 Conclusion: 研究揭示了视觉与语言表征对齐的机制,并指出当前适配器架构可能并非最优。 Abstract: Effective multimodal reasoning depends on the alignment of visual and linguistic representations, yet the mechanisms by which vision-language models (VLMs) achieve this alignment remain poorly understood. We introduce a methodological framework that deliberately maintains a frozen large language model (LLM) and a frozen vision transformer (ViT), connected solely by training a linear adapter during visual instruction tuning. This design is fundamental to our approach: by keeping the language model frozen, we ensure it maintains its original language representations without adaptation to visual data. Consequently, the linear adapter must map visual features directly into the LLM's existing representational space rather than allowing the language model to develop specialized visual understanding through fine-tuning. Our experimental design uniquely enables the use of pre-trained sparse autoencoders (SAEs) of the LLM as analytical probes. These SAEs remain perfectly aligned with the unchanged language model and serve as a snapshot of the learned language feature-representations. Through systematic analysis of SAE reconstruction error, sparsity patterns, and feature SAE descriptions, we reveal the layer-wise progression through which visual representations gradually align with language feature representations, converging in middle-to-later layers. This suggests a fundamental misalignment between ViT outputs and early LLM layers, raising important questions about whether current adapter-based architectures optimally facilitate cross-modal representation learning.

[88] Simple Radiology VLLM Test-time Scaling with Thought Graph Traversal

Yue Yao,Zelin Wen,Yan Tong,Xinyu Tian,Xuqing Li,Xiao Ma,Dongliang Xu,Tom Gedeon

Main category: cs.CV

TL;DR: 提出了一种轻量级的Thought Graph Traversal(TGT)框架,通过测试时动态调整推理深度,提升放射学报告生成的准确性和一致性。

Details Motivation: 探索如何在无需额外训练的情况下,通过测试时调整提升视觉语言大模型(VLLMs)在放射学报告生成中的推理性能。 Method: 结合医学先验知识设计TGT框架,动态调整推理预算,优化模型生成过程。 Result: 在标准基准测试中优于基线方法,生成更准确、一致的报告,并揭示数据集偏差。 Conclusion: TGT框架是一种简单有效的方法,可显著提升VLLMs在放射学报告生成中的表现。 Abstract: Test-time scaling offers a promising way to improve the reasoning performance of vision-language large models (VLLMs) without additional training. In this paper, we explore a simple but effective approach for applying test-time scaling to radiology report generation. Specifically, we introduce a lightweight Thought Graph Traversal (TGT) framework that guides the model to reason through organ-specific findings in a medically coherent order. This framework integrates structured medical priors into the prompt, enabling deeper and more logical analysis with no changes to the underlying model. To further enhance reasoning depth, we apply a reasoning budget forcing strategy that adjusts the model's inference depth at test time by dynamically extending its generation process. This simple yet powerful combination allows a frozen radiology VLLM to self-correct and generate more accurate, consistent chest X-ray reports. Our method outperforms baseline prompting approaches on standard benchmarks, and also reveals dataset biases through traceable reasoning paths. Code and prompts are open-sourced for reproducibility at https://github.com/glerium/Thought-Graph-Traversal.

[89] VGR: Visual Grounded Reasoning

Jiacong Wang,Zijiang Kang,Haochen Wang,Haiyong Jiang,Jiawen Li,Bohong Wu,Ya Wang,Jiao Ran,Xiao Liang,Chao Feng,Jun Xiao

Main category: cs.CV

TL;DR: 本文提出了一种名为VGR的新型多模态大语言模型,通过增强细粒度视觉感知能力,解决了传统方法在视觉推理任务中的语言偏见和领域限制问题。

Details Motivation: 现有方法主要依赖纯语言空间进行推理,存在语言偏见且局限于数学或科学领域,难以处理需要全面理解图像细节的复杂视觉推理任务。 Method: VGR通过检测相关图像区域并基于这些区域提供精确答案,结合视觉定位和语言推理的大规模SFT数据集VGR-SFT,并引入重放阶段增强多模态理解。 Result: 实验表明,VGR在LLaVA-NeXT-7B基准上表现优异,仅使用30%的图像标记数量,在MMStar、AI2D和ChartQA上分别提升了4.1、7.1和12.9分。 Conclusion: VGR通过结合视觉定位和语言推理,显著提升了多模态推理任务的性能,尤其在需要理解图像细节的任务中表现突出。 Abstract: In the field of multimodal chain-of-thought (CoT) reasoning, existing approaches predominantly rely on reasoning on pure language space, which inherently suffers from language bias and is largely confined to math or science domains. This narrow focus limits their ability to handle complex visual reasoning tasks that demand comprehensive understanding of image details. To address these limitations, this paper introduces VGR, a novel reasoning multimodal large language model (MLLM) with enhanced fine-grained visual perception capabilities. Unlike traditional MLLMs that answer the question or reasoning solely on the language space, our VGR first detects relevant regions that may help to solve problems, and then provides precise answers based on replayed image regions. To achieve this, we conduct a large-scale SFT dataset called VGR -SFT that contains reasoning data with mixed vision grounding and language deduction. The inference pipeline of VGR allows the model to choose bounding boxes for visual reference and a replay stage is introduced to integrates the corresponding regions into the reasoning process, enhancing multimodel comprehension. Experiments on the LLaVA-NeXT-7B baseline show that VGR achieves superior performance on multi-modal benchmarks requiring comprehensive image detail understanding. Compared to the baseline, VGR uses only 30\% of the image token count while delivering scores of +4.1 on MMStar, +7.1 on AI2D, and a +12.9 improvement on ChartQA.

[90] Improving Surgical Risk Prediction Through Integrating Automated Body Composition Analysis: a Retrospective Trial on Colectomy Surgery

Hanxue Gu,Yaqian Chen,isoo Lee,Diego Schaps,Regina Woody,Roy Colglazier,Maciej A. Mazurowski,Christopher Mantyh

Main category: cs.CV

TL;DR: 研究评估了术前CT扫描自动提取的身体组成指标是否能预测结肠切除术后的结果,单独或结合临床变量或现有风险预测工具。

Details Motivation: 探索术前身体组成指标对术后结果的预测能力,以优化手术风险评估。 Method: 使用Cox比例风险模型评估1年全因死亡率,逻辑回归评估次要结果,提取300多个CT特征。 Result: 结果显示CT提取的身体组成指标对术后结果有预测能力,性能通过C指数和IBS评估。 Conclusion: 术前CT提取的身体组成指标可用于预测结肠切除术后结果,结合临床变量效果更佳。 Abstract: Objective: To evaluate whether preoperative body composition metrics automatically extracted from CT scans can predict postoperative outcomes after colectomy, either alone or combined with clinical variables or existing risk predictors. Main outcomes and measures: The primary outcome was the predictive performance for 1-year all-cause mortality following colectomy. A Cox proportional hazards model with 1-year follow-up was used, and performance was evaluated using the concordance index (C-index) and Integrated Brier Score (IBS). Secondary outcomes included postoperative complications, unplanned readmission, blood transfusion, and severe infection, assessed using AUC and Brier Score from logistic regression. Odds ratios (OR) described associations between individual CT-derived body composition metrics and outcomes. Over 300 features were extracted from preoperative CTs across multiple vertebral levels, including skeletal muscle area, density, fat areas, and inter-tissue metrics. NSQIP scores were available for all surgeries after 2012.

[91] Affogato: Learning Open-Vocabulary Affordance Grounding with Automated Data Generation at Scale

Junha Lee,Eunha Park,Chunghyun Park,Dahyun Kang,Minsu Cho

Main category: cs.CV

TL;DR: Affogato是一个大规模基准数据集,用于解决基于自然语言描述的交互进行物体区域定位的挑战,包含150K实例,支持开放词汇描述和3D功能热图。

Details Motivation: 解决功能定位任务中的细粒度部分级定位、多有效交互区域的模糊性以及大规模数据集稀缺的问题。 Method: 利用预训练的部分感知视觉主干和文本条件热图解码器构建视觉语言模型。 Result: 模型在现有2D和3D基准测试中表现良好,并展现出开放词汇跨领域泛化的有效性。 Conclusion: Affogato数据集和模型为解决功能定位任务提供了有效的工具和基准。 Abstract: Affordance grounding-localizing object regions based on natural language descriptions of interactions-is a critical challenge for enabling intelligent agents to understand and interact with their environments. However, this task remains challenging due to the need for fine-grained part-level localization, the ambiguity arising from multiple valid interaction regions, and the scarcity of large-scale datasets. In this work, we introduce Affogato, a large-scale benchmark comprising 150K instances, annotated with open-vocabulary text descriptions and corresponding 3D affordance heatmaps across a diverse set of objects and interactions. Building on this benchmark, we develop simple yet effective vision-language models that leverage pretrained part-aware vision backbones and a text-conditional heatmap decoder. Our models trained with the Affogato dataset achieve promising performance on the existing 2D and 3D benchmarks, and notably, exhibit effectiveness in open-vocabulary cross-domain generalization. The Affogato dataset is shared in public: https://huggingface.co/datasets/project-affogato/affogato

cs.GR [Back]

[92] Anti-Aliased 2D Gaussian Splatting

Mae Younes,Adnane Boukhayma

Main category: cs.GR

TL;DR: AA-2DGS通过引入世界空间平滑核和对象空间Mip滤波器,解决了2DGS在不同采样率下的锯齿问题,提升了渲染质量。

Details Motivation: 2DGS在训练和渲染采样率不一致时会出现严重锯齿,限制了其实际应用。 Method: 提出AA-2DGS,包括世界空间平滑核和对象空间Mip滤波器,约束频率内容并优化抗锯齿处理。 Result: 显著提升了不同尺度下的渲染质量,消除了高频锯齿。 Conclusion: AA-2DGS在保持几何优势的同时,有效解决了2DGS的抗锯齿问题。 Abstract: 2D Gaussian Splatting (2DGS) has recently emerged as a promising method for novel view synthesis and surface reconstruction, offering better view-consistency and geometric accuracy than volumetric 3DGS. However, 2DGS suffers from severe aliasing artifacts when rendering at different sampling rates than those used during training, limiting its practical applications in scenarios requiring camera zoom or varying fields of view. We identify that these artifacts stem from two key limitations: the lack of frequency constraints in the representation and an ineffective screen-space clamping approach. To address these issues, we present AA-2DGS, an antialiased formulation of 2D Gaussian Splatting that maintains its geometric benefits while significantly enhancing rendering quality across different scales. Our method introduces a world space flat smoothing kernel that constrains the frequency content of 2D Gaussian primitives based on the maximal sampling frequency from training views, effectively eliminating high-frequency artifacts when zooming in. Additionally, we derive a novel object space Mip filter by leveraging an affine approximation of the ray-splat intersection mapping, which allows us to efficiently apply proper anti-aliasing directly in the local space of each splat.

[93] On Ray Reordering Techniques for Faster GPU Ray Tracing

Daniel Meister,Jakub Bokšanský,Michael Guthe,Jiří Bittner

Main category: cs.GR

TL;DR: 研究射线重排序作为提升GPU射线追踪性能的工具,提出一种基于终止点估计的新方法,并在RTX内核中验证其效果。

Details Motivation: 探索射线重排序对现有GPU射线追踪实现的性能提升潜力,尤其是对二次射线的优化。 Method: 总结现有射线排序键计算方法,提出基于终止点估计的改进方法,并在RTX内核中进行评估。 Result: 射线重排序显著提升追踪速度(1.3-2.0倍),但硬件加速阶段的排序开销难以完全抵消。 Conclusion: 射线重排序在GPU射线追踪中具有显著性能优势,但需进一步优化以降低排序开销。 Abstract: We study ray reordering as a tool for increasing the performance of existing GPU ray tracing implementations. We focus on ray reordering that is fully agnostic to the particular trace kernel. We summarize the existing methods for computing the ray sorting keys and discuss their properties. We propose a novel modification of a previously proposed method using the termination point estimation that is well-suited to tracing secondary rays. We evaluate the ray reordering techniques in the context of the wavefront path tracing using the RTX trace kernels. We show that ray reordering yields significantly higher trace speed on recent GPUs (1.3-2.0x), but to recover the reordering overhead in the hardware-accelerated trace phase is problematic.

[94] Adaptive Tetrahedral Grids for Volumetric Path-Tracing

Anis Benyoub,Jonathan Dupuy

Main category: cs.GR

TL;DR: 论文提出使用最长边二分算法构建的四面体网格进行体积数据路径追踪渲染,其优势在于高适应性和低内存占用,GPU实现性能提升显著。

Details Motivation: 解决体积数据渲染中内存占用高和性能不足的问题,利用四面体网格的适应性优化路径追踪。 Method: 采用最长边二分算法构建四面体网格,设计优化的GPU算法和数据结构。 Result: GPU实现性能比常规网格提升高达30倍,支持实时渲染32样本/像素的生产级资产。 Conclusion: 四面体网格在体积数据路径追踪中具有显著优势,适合高性能实时渲染。 Abstract: We advertise the use of tetrahedral grids constructed via the longest edge bisection algorithm for rendering volumetric data with path tracing. The key benefits of such grids is two-fold. First, they provide a highly adaptive space-partitioning representation that limits the memory footprint of volumetric assets. Second, each (tetrahedral) cell has exactly 4 neighbors within the volume (one per face of each tetrahedron) or less at boundaries. We leverage these properties to devise optimized algorithms and data-structures to compute and path-trace adaptive tetrahedral grids on the GPU. In practice, our GPU implementation outperforms regular grids by up to x30 and renders production assets in real time at 32 samples per pixel.

[95] CGVQM+D: Computer Graphics Video Quality Metric and Dataset

Akshay Jindal,Nabil Sadaka,Manu Mathew Thomas,Anton Sochenov,Anton Kaplanyan

Main category: cs.GR

TL;DR: 论文提出了一种专注于高级渲染技术引入的失真的视频质量数据集,并开发了优于现有指标的CGVQM质量评估方法。

Details Motivation: 现有数据集主要研究自然视频和传统失真,而合成内容和现代渲染失真的感知研究不足。 Method: 构建了包含多种渲染技术失真的数据集,并提出基于预训练3D CNN的CGVQM质量评估方法。 Result: 现有指标在这些失真上表现不佳(最高Pearson相关性0.78),CGVQM显著优于现有方法。 Conclusion: CGVQM能有效评估现代渲染失真,数据集和实现已开源。 Abstract: While existing video and image quality datasets have extensively studied natural videos and traditional distortions, the perception of synthetic content and modern rendering artifacts remains underexplored. We present a novel video quality dataset focused on distortions introduced by advanced rendering techniques, including neural supersampling, novel-view synthesis, path tracing, neural denoising, frame interpolation, and variable rate shading. Our evaluations show that existing full-reference quality metrics perform sub-optimally on these distortions, with a maximum Pearson correlation of 0.78. Additionally, we find that the feature space of pre-trained 3D CNNs aligns strongly with human perception of visual quality. We propose CGVQM, a full-reference video quality metric that significantly outperforms existing metrics while generating both per-pixel error maps and global quality scores. Our dataset and metric implementation is available at https://github.com/IntelLabs/CGVQM.

cs.CL [Back]

[96] TeleEval-OS: Performance evaluations of large language models for operations scheduling

Yanyan Wang,Yingying Wang,Junli Liang,Yin Xu,Yunlong Liu,Yiming Xu,Zhengwang Jiang,Zhehe Li,Fei Li,Long Zhao,Kuang Xu,Qi Song,Xiangyang Li

Main category: cs.CL

TL;DR: 论文提出了首个电信运营调度评估基准(TeleEval-OS),用于全面评估大语言模型(LLMs)在电信运营调度任务中的表现,发现开源LLMs在特定场景下优于闭源LLMs。

Details Motivation: 电信运营调度任务复杂且缺乏评估基准,阻碍了LLMs在该领域的应用潜力探索。 Method: 构建TeleEval-OS基准,包含15个数据集和13个子任务,模拟四个关键运营阶段,并采用零样本和少样本评估方法测试14种LLMs。 Result: 实验表明,开源LLMs在特定场景下表现优于闭源LLMs,展示了其在电信运营调度中的潜力。 Conclusion: TeleEval-OS为LLMs在电信运营调度中的应用提供了评估工具,开源LLMs在该领域具有显著价值。 Abstract: The rapid advancement of large language models (LLMs) has significantly propelled progress in artificial intelligence, demonstrating substantial application potential across multiple specialized domains. Telecommunications operation scheduling (OS) is a critical aspect of the telecommunications industry, involving the coordinated management of networks, services, risks, and human resources to optimize production scheduling and ensure unified service control. However, the inherent complexity and domain-specific nature of OS tasks, coupled with the absence of comprehensive evaluation benchmarks, have hindered thorough exploration of LLMs' application potential in this critical field. To address this research gap, we propose the first Telecommunications Operation Scheduling Evaluation Benchmark (TeleEval-OS). Specifically, this benchmark comprises 15 datasets across 13 subtasks, comprehensively simulating four key operational stages: intelligent ticket creation, intelligent ticket handling, intelligent ticket closure, and intelligent evaluation. To systematically assess the performance of LLMs on tasks of varying complexity, we categorize their capabilities in telecommunications operation scheduling into four hierarchical levels, arranged in ascending order of difficulty: basic NLP, knowledge Q&A, report generation, and report analysis. On TeleEval-OS, we leverage zero-shot and few-shot evaluation methods to comprehensively assess 10 open-source LLMs (e.g., DeepSeek-V3) and 4 closed-source LLMs (e.g., GPT-4o) across diverse scenarios. Experimental results demonstrate that open-source LLMs can outperform closed-source LLMs in specific scenarios, highlighting their significant potential and value in the field of telecommunications operation scheduling.

[97] Who is in the Spotlight: The Hidden Bias Undermining Multimodal Retrieval-Augmented Generation

Jiayu Yao,Shenghua Liu,Yiwei Wang,Lingrui Mei,Baolong Bi,Yuyao Ge,Zhecheng Li,Xueqi Cheng

Main category: cs.CL

TL;DR: 本文研究了多模态检索增强生成(RAG)系统中证据位置对性能的影响,发现位置偏差会显著影响系统表现,并提出了一种量化方法。

Details Motivation: 当前多模态RAG系统对证据顺序高度敏感,导致性能不稳定和推理偏差,因此需要研究位置偏差的影响。 Method: 通过文本、图像及混合模态任务的实验,引入位置敏感指数(PSI_p)和可视化框架分析注意力分配模式。 Result: 多模态交互加剧了位置偏差,且偏差随检索范围对数增长。 Conclusion: 研究为RAG系统的位置感知分析提供了理论基础,建议采用证据重排序或去偏策略以提高系统可靠性。 Abstract: Multimodal Retrieval-Augmented Generation (RAG) systems have become essential in knowledge-intensive and open-domain tasks. As retrieval complexity increases, ensuring the robustness of these systems is critical. However, current RAG models are highly sensitive to the order in which evidence is presented, often resulting in unstable performance and biased reasoning, particularly as the number of retrieved items or modality diversity grows. This raises a central question: How does the position of retrieved evidence affect multimodal RAG performance? To answer this, we present the first comprehensive study of position bias in multimodal RAG systems. Through controlled experiments across text-only, image-only, and mixed-modality tasks, we observe a consistent U-shaped accuracy curve with respect to evidence position. To quantify this bias, we introduce the Position Sensitivity Index ($PSI_p$) and develop a visualization framework to trace attention allocation patterns across decoder layers. Our results reveal that multimodal interactions intensify position bias compared to unimodal settings, and that this bias increases logarithmically with retrieval range. These findings offer both theoretical and empirical foundations for position-aware analysis in RAG, highlighting the need for evidence reordering or debiasing strategies to build more reliable and equitable generation systems.

[98] Smotrom tvoja pa ander drogoj verden! Resurrecting Dead Pidgin with Generative Models: Russenorsk Case Study

Alexey Tikhonov,Sergei Shteiner,Anna Bykova,Ivan P. Yamshchikov

Main category: cs.CL

TL;DR: 本文通过现代大语言模型分析Russenorsk词汇,构建结构化词典,验证其构词和语法原则,并提出翻译代理。

Details Motivation: 研究Russenorsk这一独特的贸易皮钦语,探索其词汇和语法结构。 Method: 利用大语言模型分析文献,构建词典,验证假设,并开发翻译代理。 Result: 验证了部分学术假设,并生成了现代文本的Russenorsk翻译。 Conclusion: Russenorsk的构词和语法原则可通过现代技术验证,翻译代理为语言研究提供新工具。 Abstract: Russenorsk, a pidgin language historically used in trade interactions between Russian and Norwegian speakers, represents a unique linguistic phenomenon. In this paper, we attempt to analyze its lexicon using modern large language models (LLMs), based on surviving literary sources. We construct a structured dictionary of the language, grouped by synonyms and word origins. Subsequently, we use this dictionary to formulate hypotheses about the core principles of word formation and grammatical structure in Russenorsk and show which hypotheses generated by large language models correspond to the hypotheses previously proposed ones in the academic literature. We also develop a "reconstruction" translation agent that generates hypothetical Russenorsk renderings of contemporary Russian and Norwegian texts.

[99] A Large Language Model Based Pipeline for Review of Systems Entity Recognition from Clinical Notes

Hieu Nghiem,Hemanth Reddy Singareddy,Zhuqi Miao,Jivan Lamichhane,Abdulaziz Ahmed,Johnson Thomas,Dursun Delen,William Paiva

Main category: cs.CL

TL;DR: 开发了一种基于大语言模型(LLM)的自动化流程,用于从临床笔记中提取系统回顾(ROS)实体,结合开源和商业模型,实现了低成本且高效的性能。

Details Motivation: 减少临床笔记中ROS文档的负担,提供一种可扩展且本地可部署的解决方案。 Method: 使用SecTag提取ROS部分,结合少量样本的LLM识别ROS实体范围、状态及关联系统,测试了开源模型(Mistral、Llama、Gemma)和ChatGPT。 Result: ChatGPT表现最佳(实体范围错误率28.2%,状态/系统错误率14.5%),开源模型也表现良好(实体范围错误率30.5-36.7%,状态/系统错误率24.3-27.3%)。 Conclusion: 该流程为资源有限的医疗环境提供了可行的开源替代方案,显著降低了ROS文档负担。 Abstract: Objective: Develop a cost-effective, large language model (LLM)-based pipeline for automatically extracting Review of Systems (ROS) entities from clinical notes. Materials and Methods: The pipeline extracts ROS sections using SecTag, followed by few-shot LLMs to identify ROS entity spans, their positive/negative status, and associated body systems. We implemented the pipeline using open-source LLMs (Mistral, Llama, Gemma) and ChatGPT. The evaluation was conducted on 36 general medicine notes containing 341 annotated ROS entities. Results: When integrating ChatGPT, the pipeline achieved the lowest error rates in detecting ROS entity spans and their corresponding statuses/systems (28.2% and 14.5%, respectively). Open-source LLMs enable local, cost-efficient execution of the pipeline while delivering promising performance with similarly low error rates (span: 30.5-36.7%; status/system: 24.3-27.3%). Discussion and Conclusion: Our pipeline offers a scalable and locally deployable solution to reduce ROS documentation burden. Open-source LLMs present a viable alternative to commercial models in resource-limited healthcare environments.

[100] Deontological Keyword Bias: The Impact of Modal Expressions on Normative Judgments of Language Models

Bumjin Park,Jinsil Lee,Jaesik Choi

Main category: cs.CL

TL;DR: 研究发现大型语言模型(LLMs)在道德推理中存在义务性关键词偏见(DKB),即在提示中加入模态表达(如“必须”或“应该”)时,LLMs倾向于将非义务性情境判断为义务。通过结合少量示例和推理提示的策略可以缓解此偏见。

Details Motivation: 探究LLMs在道德义务判断中的行为,尤其是模态表达对其判断的影响,填补了LLM对齐研究中关于义务判断的空白。 Method: 通过分析LLMs在不同模态表达下的义务判断行为,提出DKB现象,并设计结合少量示例和推理提示的缓解策略。 Result: LLMs在模态表达存在时,将90%以上的常识情境判断为义务,且此现象在不同LLM家族、问题类型和答案格式中一致。 Conclusion: 模态表达作为语言框架显著影响LLMs的规范性决策,需解决此类偏见以确保判断对齐。 Abstract: Large language models (LLMs) are increasingly engaging in moral and ethical reasoning, where criteria for judgment are often unclear, even for humans. While LLM alignment studies cover many areas, one important yet underexplored area is how LLMs make judgments about obligations. This work reveals a strong tendency in LLMs to judge non-obligatory contexts as obligations when prompts are augmented with modal expressions such as must or ought to. We introduce this phenomenon as Deontological Keyword Bias (DKB). We find that LLMs judge over 90\% of commonsense scenarios as obligations when modal expressions are present. This tendency is consist across various LLM families, question types, and answer formats. To mitigate DKB, we propose a judgment strategy that integrates few-shot examples with reasoning prompts. This study sheds light on how modal expressions, as a form of linguistic framing, influence the normative decisions of LLMs and underscores the importance of addressing such biases to ensure judgment alignment.

[101] Targeted control of fast prototyping through domain-specific interface

Yu-Zhe Shi,Mingchen Liu,Hanlu Ma,Qiao Xu,Huamin Qu,Kun He,Lecheng Ruan,Qining Wang

Main category: cs.CL

TL;DR: 论文提出了一种接口架构,用于弥合设计师语言与建模语言之间的差距,通过自然语言指令实现对原型模型的精确控制。

Details Motivation: 工业设计师希望通过自然语言指令直观地控制原型模型,但现有大型语言模型在此领域的潜力未完全发挥,主要因语言抽象层次、语义精确性和词汇范围的不匹配。 Method: 基于快速原型设计实践的设计原则,提出了一种接口架构,并开发了自动化领域规范算法。 Result: 机器评估和人类研究表明,该接口可作为大型语言模型的辅助模块,实现对原型模型的精确有效控制。 Conclusion: 该接口架构为设计师提供了一种自然且高效的原型控制方式,弥合了语言与建模之间的鸿沟。 Abstract: Industrial designers have long sought a natural and intuitive way to achieve the targeted control of prototype models -- using simple natural language instructions to configure and adjust the models seamlessly according to their intentions, without relying on complex modeling commands. While Large Language Models have shown promise in this area, their potential for controlling prototype models through language remains partially underutilized. This limitation stems from gaps between designers' languages and modeling languages, including mismatch in abstraction levels, fluctuation in semantic precision, and divergence in lexical scopes. To bridge these gaps, we propose an interface architecture that serves as a medium between the two languages. Grounded in design principles derived from a systematic investigation of fast prototyping practices, we devise the interface's operational mechanism and develop an algorithm for its automated domain specification. Both machine-based evaluations and human studies on fast prototyping across various product design domains demonstrate the interface's potential to function as an auxiliary module for Large Language Models, enabling precise and effective targeted control of prototype models.

[102] CLAIM: Mitigating Multilingual Object Hallucination in Large Vision-Language Models with Cross-Lingual Attention Intervention

Zekai Ye,Qiming Li,Xiaocheng Feng,Libo Qin,Yichong Huang,Baohang Li,Kui Jiang,Yang Xiang,Zhirui Zhang,Yunfei Lu,Duyu Tang,Dandan Tu,Bing Qin

Main category: cs.CL

TL;DR: CLAIM提出了一种近乎无需训练的方法,通过调整跨语言注意力模式来减少多语言对象幻觉。

Details Motivation: 大型视觉语言模型在多语言查询中更容易产生与视觉输入不一致的响应,现有方法依赖资源密集的预训练或微调。 Method: CLAIM通过识别语言特定的跨模态注意力头,估计语言转移向量,并在推理时干预注意力输出。 Result: 实验显示CLAIM在POPE和MME基准上平均提升了13.56%和21.75%的性能。 Conclusion: CLAIM有效缓解了多语言对象幻觉,且中间层的注意力差异在多语言场景中最为显著。 Abstract: Large Vision-Language Models (LVLMs) have demonstrated impressive multimodal abilities but remain prone to multilingual object hallucination, with a higher likelihood of generating responses inconsistent with the visual input when utilizing queries in non-English languages compared to English. Most existing approaches to address these rely on pretraining or fine-tuning, which are resource-intensive. In this paper, inspired by observing the disparities in cross-modal attention patterns across languages, we propose Cross-Lingual Attention Intervention for Mitigating multilingual object hallucination (CLAIM) in LVLMs, a novel near training-free method by aligning attention patterns. CLAIM first identifies language-specific cross-modal attention heads, then estimates language shift vectors from English to the target language, and finally intervenes in the attention outputs during inference to facilitate cross-lingual visual perception capability alignment. Extensive experiments demonstrate that CLAIM achieves an average improvement of 13.56% (up to 30% in Spanish) on the POPE and 21.75% on the hallucination subsets of the MME benchmark across various languages. Further analysis reveals that multilingual attention divergence is most prominent in intermediate layers, highlighting their critical role in multilingual scenarios.

[103] CyclicReflex: Improving Large Reasoning Models via Cyclical Reflection Token Scheduling

Chongyu Fan,Yihua Zhang,Jinghan Jia,Alfred Hero,Sijia Liu

Main category: cs.CL

TL;DR: 论文提出了一种动态调节反思令牌(reflection tokens)使用的方法CyclicReflex,通过周期性调度提升大型推理模型的性能。

Details Motivation: 反思令牌在大型推理模型中用于引导多步推理,但过度或不足使用均会降低模型性能,需找到平衡。 Method: 提出CyclicReflex方法,通过位置依赖的三角波形动态调整反思令牌的logits。 Result: 在多个数据集(MATH500等)上,CyclicReflex优于标准解码和其他方法(如TIP和S1)。 Conclusion: CyclicReflex能有效管理反思令牌的使用,提升模型性能,适用于不同规模的模型。 Abstract: Large reasoning models (LRMs), such as OpenAI's o1 and DeepSeek-R1, harness test-time scaling to perform multi-step reasoning for complex problem-solving. This reasoning process, executed before producing final answers, is often guided by special juncture tokens or textual segments that prompt self-evaluative reflection. We refer to these transition markers and reflective cues as "reflection tokens" (e.g., "wait", "but", "alternatively"). In this work, we treat reflection tokens as a "resource" and introduce the problem of resource allocation, aimed at improving the test-time compute performance of LRMs by adaptively regulating the frequency and placement of reflection tokens. Through empirical analysis, we show that both excessive and insufficient use of reflection tokens, referred to as over-reflection and under-reflection, can degrade model performance. To better understand and manage this trade-off, we draw an analogy between reflection token usage and learning rate scheduling in optimization. Building on this insight, we propose cyclical reflection token scheduling (termed CyclicReflex), a decoding strategy that dynamically modulates reflection token logits using a position-dependent triangular waveform. Experiments on MATH500, AIME2024/2025, and AMC2023 demonstrate that CyclicReflex consistently improves performance across model sizes (1.5B-8B), outperforming standard decoding and more recent approaches such as TIP (thought switching penalty) and S1. Codes are available at https://github.com/OPTML-Group/CyclicReflex.

[104] RoE-FND: A Case-Based Reasoning Approach with Dual Verification for Fake News Detection via LLMs

Yuzhou Yang,Yangming Zhou,Zhiying Zhu,Zhenxing Qian,Xinpeng Zhang,Sheng Li

Main category: cs.CL

TL;DR: RoE-FND是一个基于逻辑推理的假新闻检测框架,结合大语言模型与经验学习,通过自反知识构建和动态准则检索提升检测效果。

Details Motivation: 在线虚假内容泛滥,现有假新闻检测方法存在证据选择噪声、泛化瓶颈和决策不透明等问题,需要更鲁棒的解决方案。 Method: RoE-FND分两阶段:自反知识构建(分析历史错误构建知识库)和动态准则检索(从历史案例中提取推理准则),并通过双通道程序验证推理。 Result: RoE-FND在三个数据集上表现出优于现有方法的泛化能力和有效性。 Conclusion: RoE-FND为假新闻检测提供了一种无需训练、适应性强且效果显著的解决方案。 Abstract: The proliferation of deceptive content online necessitates robust Fake News Detection (FND) systems. While evidence-based approaches leverage external knowledge to verify claims, existing methods face critical limitations: noisy evidence selection, generalization bottlenecks, and unclear decision-making processes. Recent efforts to harness Large Language Models (LLMs) for FND introduce new challenges, including hallucinated rationales and conclusion bias. To address these issues, we propose \textbf{RoE-FND} (\textbf{\underline{R}}eason \textbf{\underline{o}}n \textbf{\underline{E}}xperiences FND), a framework that reframes evidence-based FND as a logical deduction task by synergizing LLMs with experiential learning. RoE-FND encompasses two stages: (1) \textit{self-reflective knowledge building}, where a knowledge base is curated by analyzing past reasoning errors, namely the exploration stage, and (2) \textit{dynamic criterion retrieval}, which synthesizes task-specific reasoning guidelines from historical cases as experiences during deployment. It further cross-checks rationales against internal experience through a devised dual-channel procedure. Key contributions include: a case-based reasoning framework for FND that addresses multiple existing challenges, a training-free approach enabling adaptation to evolving situations, and empirical validation of the framework's superior generalization and effectiveness over state-of-the-art methods across three datasets.

[105] MANBench: Is Your Multimodal Model Smarter than Human?

Han Zhou,Qitong Xu,Yiheng Dong,Xin Yang

Main category: cs.CL

TL;DR: MANBench是一个双语基准测试,用于评估多模态大语言模型(MLLMs)与人类在多模态任务中的表现差异,发现MLLMs在某些任务上表现优异,但在复杂推理任务中仍落后于人类。

Details Motivation: 评估MLLMs是否能在多模态任务中超越人类表现,并揭示其局限性。 Method: 开发MANBench基准测试,包含1,314个问题,涵盖九类任务,并进行人类与MLLMs的对比实验。 Result: MLLMs在知识和文本-图像理解任务中表现优异,但在跨模态推理和复杂任务(如谜题和空间想象)中表现不佳。 Conclusion: MANBench揭示了MLLMs的局限性,呼吁进一步研究以缩小其与人类能力的差距。 Abstract: The rapid advancement of Multimodal Large Language Models (MLLMs) has ignited discussions regarding their potential to surpass human performance in multimodal tasks. In response, we introduce MANBench (Multimodal Ability Norms Benchmark), a bilingual benchmark (English and Chinese) comprising 1,314 questions across nine tasks, spanning knowledge-based and non-knowledge-based domains. MANBench emphasizes intuitive reasoning, seamless cross-modal integration, and real-world complexity, providing a rigorous evaluation framework. Through extensive human experiments involving diverse participants, we compared human performance against state-of-the-art MLLMs. The results indicate that while MLLMs excel in tasks like Knowledge and Text-Image Understanding, they struggle with deeper cross-modal reasoning tasks such as Transmorphic Understanding, Image Consistency, and Multi-image Understanding. Moreover, both humans and MLLMs face challenges in highly complex tasks like Puzzles and Spatial Imagination. MANBench highlights the strengths and limitations of MLLMs, revealing that even advanced models fall short of achieving human-level performance across many domains. We hope MANBench will inspire efforts to bridge the gap between MLLMs and human multimodal capabilities. The code and dataset are available at https://github.com/micdz/MANBench.

[106] SAGE:Specification-Aware Grammar Extraction for Automated Test Case Generation with LLMs

Aditi,Hyunwoo Park,Sicheol Sung,Yo-Sub Han,Sang-Ki Ko

Main category: cs.CL

TL;DR: 论文提出了一种利用开源大语言模型(LLM)从自然语言规范中生成上下文无关计数器文法(CCFG)的方法,结合强化学习优化文法有效性。

Details Motivation: 从自然语言规范生成有效的通用文法是一个关键挑战,尤其是在有限监督下。 Method: 通过微调开源LLM进行规范到文法的转换,并应用GRPO强化学习优化文法有效性和通用性。 Result: 实验表明,SAGE方法在文法质量和测试效果上优于17种开源和闭源LLM,文法有效性提升15.92%,测试效果提升12.34%。 Conclusion: 该方法显著提升了文法生成的质量和测试效果,为有限监督下的文法生成提供了有效解决方案。 Abstract: Grammar-based test case generation has proven effective for competitive programming problems, but generating valid and general grammars from natural language specifications remains a key challenge, especially under limited supervision. Context-Free Grammars with Counters (CCFGs) have recently been introduced as a formalism to represent such specifications with logical constraints by storing and reusing counter values during derivation. In this work, we explore the use of open-source large language models (LLMs) to induce CCFGs from specifications using a small number of labeled examples and verifiable reward-guided reinforcement learning. Our approach first fine-tunes an open-source LLM to perform specification-to-grammar translation, and further applies Group Relative Policy Optimization (GRPO) to enhance grammar validity and generality. We also examine the effectiveness of iterative feedback for open and closed-source LLMs in correcting syntactic and semantic errors in generated grammars. Experimental results show that our approach SAGE achieves stronger generalization and outperforms 17 open and closed-source LLMs in both grammar quality and test effectiveness, improving over the state-of-the-art by 15.92%p in grammar validity and 12.34%p in test effectiveness. We provide our implementation and dataset at the following anonymous repository:https://anonymous.4open.science/r/SAGE-5714

[107] PRISM: A Transformer-based Language Model of Structured Clinical Event Data

Lionel Levine,John Santerre,Alex S. Young,T. Barry Levine,Francis Campion,Majid Sarrafzadeh

Main category: cs.CL

TL;DR: PRISM是一种基于Transformer的架构,用于建模临床决策过程的序列化进展,通过预测患者诊断旅程中的下一步骤,显著优于随机基线。

Details Motivation: 传统方法依赖孤立的诊断分类,无法捕捉临床轨迹的复杂依赖性,PRISM旨在填补这一空白。 Method: PRISM将临床轨迹标记化为事件序列(如诊断测试、实验室结果和诊断),并利用自回归训练目标学习预测下一步骤。 Result: 实验显示PRISM在预测任务中表现优异,生成的序列反映了真实的诊断路径和临床行为。 Conclusion: PRISM为基于序列的医疗建模奠定了基础,展示了生成语言模型技术在结构化医疗数据中的可行性。 Abstract: We introduce PRISM (Predictive Reasoning in Sequential Medicine), a transformer-based architecture designed to model the sequential progression of clinical decision-making processes. Unlike traditional approaches that rely on isolated diagnostic classification, PRISM frames clinical trajectories as tokenized sequences of events - including diagnostic tests, laboratory results, and diagnoses - and learns to predict the most probable next steps in the patient diagnostic journey. Leveraging a large custom clinical vocabulary and an autoregressive training objective, PRISM demonstrates the ability to capture complex dependencies across longitudinal patient timelines. Experimental results show substantial improvements over random baselines in next-token prediction tasks, with generated sequences reflecting realistic diagnostic pathways, laboratory result progressions, and clinician ordering behaviors. These findings highlight the feasibility of applying generative language modeling techniques to structured medical event data, enabling applications in clinical decision support, simulation, and education. PRISM establishes a foundation for future advancements in sequence-based healthcare modeling, bridging the gap between machine learning architectures and real-world diagnostic reasoning.

[108] RedDebate: Safer Responses through Multi-Agent Red Teaming Debates

Ali Asad,Stephen Obadinma,Radin Shayanfar,Xiaodan Zhu

Main category: cs.CL

TL;DR: RedDebate是一种多智能体辩论框架,通过LLM之间的对抗性论证主动识别和减少不安全行为,结合长期记忆模块,显著提升AI安全性。

Details Motivation: 现有AI安全方法依赖昂贵的人工评估或单模型评估,存在扩展性和监督风险。RedDebate通过多智能体辩论和自动化红队测试解决这些问题。 Method: RedDebate利用多LLM的对抗性辩论和长期记忆模块,系统性识别不安全盲点并迭代改进响应。 Result: 在HarmBench等基准测试中,辩论使不安全行为减少17.7%,结合长期记忆模块后减少超过23.5%。 Conclusion: RedDebate是首个完全自动化结合多智能体辩论与红队测试的框架,无需人工干预即可持续提升AI安全性。 Abstract: We propose RedDebate, a novel multi-agent debate framework that leverages adversarial argumentation among Large Language Models (LLMs) to proactively identify and mitigate their own unsafe behaviours. Existing AI safety methods often depend heavily on costly human evaluations or isolated single-model assessment, both subject to scalability constraints and oversight risks. RedDebate instead embraces collaborative disagreement, enabling multiple LLMs to critically examine one another's reasoning, and systematically uncovering unsafe blind spots through automated red-teaming, and iteratively improve their responses. We further integrate distinct types of long-term memory that retain learned safety insights from debate interactions. Evaluating on established safety benchmarks such as HarmBench, we demonstrate the proposed method's effectiveness. Debate alone can reduce unsafe behaviours by 17.7%, and when combined with long-term memory modules, achieves reductions exceeding 23.5%. To our knowledge, RedDebate constitutes the first fully automated framework that combines multi-agent debates with red-teaming to progressively enhance AI safety without direct human intervention.(Github Repository: https://github.com/aliasad059/RedDebate)

[109] Two Birds with One Stone: Improving Factuality and Faithfulness of LLMs via Dynamic Interactive Subspace Editing

Pengbo Wang,Chaozhuo Li,Chenxu Wang,Liwen Zheng,Litian Zhang,Xi Zhang

Main category: cs.CL

TL;DR: 论文提出SPACE框架,通过编辑共享激活子空间同时提升LLMs的事实性和忠实性,解决了现有方法独立处理幻觉类型导致的性能权衡问题。

Details Motivation: 大型语言模型(LLMs)在实际部署中因事实性和忠实性幻觉问题受限,现有方法独立处理这些问题导致性能权衡。 Method: 通过分析LLMs激活空间动态,发现幻觉类型共享子空间,提出SPACE框架,结合双任务特征建模和混合探测策略(谱聚类与注意力头显著性评分)编辑共享子空间。 Result: 在多个基准数据集上的实验结果表明SPACE方法的优越性。 Conclusion: SPACE为同时提升LLMs的事实性和忠实性提供了统一框架,揭示了共享子空间的潜力。 Abstract: LLMs have demonstrated unprecedented capabilities in natural language processing, yet their practical deployment remains hindered by persistent factuality and faithfulness hallucinations. While existing methods address these hallucination types independently, they inadvertently induce performance trade-offs, as interventions targeting one type often exacerbate the other. Through empirical and theoretical analysis of activation space dynamics in LLMs, we reveal that these hallucination categories share overlapping subspaces within neural representations, presenting an opportunity for concurrent mitigation. To harness this insight, we propose SPACE, a unified framework that jointly enhances factuality and faithfulness by editing shared activation subspaces. SPACE establishes a geometric foundation for shared subspace existence through dual-task feature modeling, then identifies and edits these subspaces via a hybrid probe strategy combining spectral clustering and attention head saliency scoring. Experimental results across multiple benchmark datasets demonstrate the superiority of our approach.

[110] Customizing Speech Recognition Model with Large Language Model Feedback

Shaoshi Ling,Guoli Ye

Main category: cs.CL

TL;DR: 本文提出了一种基于强化学习的无监督领域适应方法,利用LLM的反馈提升ASR模型在领域不匹配情况下的命名实体识别性能。

Details Motivation: 尽管ASR系统在通用转录任务上表现良好,但在识别罕见命名实体和适应领域不匹配方面仍有不足。LLMs因其大规模训练数据在多领域表现更优,因此被用于提升ASR性能。 Method: 通过强化学习方法,利用LLM作为奖励模型对ASR模型的假设进行评分,并以这些评分为奖励信号对ASR模型进行微调。 Result: 该方法在实体词错误率上比传统自训练方法提升了21%。 Conclusion: 结合LLM的强化学习方法显著提升了ASR模型在领域适应和命名实体识别上的性能。 Abstract: Automatic speech recognition (ASR) systems have achieved strong performance on general transcription tasks. However, they continue to struggle with recognizing rare named entities and adapting to domain mismatches. In contrast, large language models (LLMs), trained on massive internet-scale datasets, are often more effective across a wide range of domains. In this work, we propose a reinforcement learning based approach for unsupervised domain adaptation, leveraging unlabeled data to enhance transcription quality, particularly the named entities affected by domain mismatch, through feedback from a LLM. Given contextual information, our framework employs a LLM as the reward model to score the hypotheses from the ASR model. These scores serve as reward signals to fine-tune the ASR model via reinforcement learning. Our method achieves a 21\% improvement on entity word error rate over conventional self-training methods.

[111] Dynamic Context Tuning for Retrieval-Augmented Generation: Enhancing Multi-Turn Planning and Tool Adaptation

Jubin Abhishek Soni,Amit Anand,Rajesh Kumar Pandey,Aniket Abhishek Soni

Main category: cs.CL

TL;DR: DCT是一个轻量级框架,扩展了RAG以支持多轮对话和动态工具环境,无需重新训练,显著提高了准确性和效率。

Details Motivation: 现有RAG系统局限于静态、单轮交互,无法适应动态领域(如医疗和智能家居)的需求。 Method: DCT结合了基于注意力的上下文缓存、LoRA检索和高效上下文压缩,支持动态工具选择和上下文管理。 Result: 实验显示DCT将计划准确性提高14%,幻觉减少37%,且成本显著低于GPT-4。 Conclusion: DCT能够泛化到未见过的工具,适用于广泛的动态环境,具有可扩展性和适应性。 Abstract: Retrieval-Augmented Generation (RAG) has significantly advanced large language models (LLMs) by grounding their outputs in external tools and knowledge sources. However, existing RAG systems are typically constrained to static, single-turn interactions with fixed toolsets, making them ill-suited for dynamic domains such as healthcare and smart homes, where user intent, available tools, and contextual factors evolve over time. We present Dynamic Context Tuning (DCT), a lightweight framework that extends RAG to support multi-turn dialogue and evolving tool environments without requiring retraining. DCT integrates an attention-based context cache to track relevant past information, LoRA-based retrieval to dynamically select domain-specific tools, and efficient context compression to maintain inputs within LLM context limits. Experiments on both synthetic and real-world benchmarks show that DCT improves plan accuracy by 14% and reduces hallucinations by 37%, while matching GPT-4 performance at significantly lower cost. Furthermore, DCT generalizes to previously unseen tools, enabling scalable and adaptable AI assistants across a wide range of dynamic environments.

[112] The Scales of Justitia: A Comprehensive Survey on Safety Evaluation of LLMs

Songyang Liu,Chaozhuo Li,Jiameng Qiu,Xi Zhang,Feiran Huang,Litian Zhang,Yiming Hei,Philip S. Yu

Main category: cs.CL

TL;DR: 本文综述了大语言模型(LLMs)安全评估的最新进展,探讨了评估背景、任务分类、指标与数据集、方法工具,并提出了未来研究方向。

Details Motivation: 随着LLMs的广泛应用,其生成内容中的安全问题(如毒性、偏见)引发关注,但缺乏系统性综述。本文旨在填补这一空白。 Method: 通过分类分析,从"为什么评估"、"评估什么"、"在哪里评估"和"如何评估"四个维度系统梳理LLMs安全评估的研究。 Result: 总结了现有评估任务、指标、数据集和方法,并提出了未来挑战和研究方向。 Conclusion: 强调LLMs安全评估的重要性,以保障其在实际应用中的安全部署。 Abstract: With the rapid advancement of artificial intelligence technology, Large Language Models (LLMs) have demonstrated remarkable potential in the field of Natural Language Processing (NLP), including areas such as content generation, human-computer interaction, machine translation, and code generation, among others. However, their widespread deployment has also raised significant safety concerns. In recent years, LLM-generated content has occasionally exhibited unsafe elements like toxicity and bias, particularly in adversarial scenarios, which has garnered extensive attention from both academia and industry. While numerous efforts have been made to evaluate the safety risks associated with LLMs, there remains a lack of systematic reviews summarizing these research endeavors. This survey aims to provide a comprehensive and systematic overview of recent advancements in LLMs safety evaluation, focusing on several key aspects: (1) "Why evaluate" that explores the background of LLMs safety evaluation, how they differ from general LLMs evaluation, and the significance of such evaluation; (2) "What to evaluate" that examines and categorizes existing safety evaluation tasks based on key capabilities, including dimensions such as toxicity, robustness, ethics, bias and fairness, truthfulness, and so on; (3) "Where to evaluate" that summarizes the evaluation metrics, datasets and benchmarks currently used in safety evaluations; (4) "How to evaluate" that reviews existing evaluation toolkit, and categorizing mainstream evaluation methods based on the roles of the evaluators. Finally, we identify the challenges in LLMs safety evaluation and propose potential research directions to promote further advancement in this field. We emphasize the importance of prioritizing LLMs safety evaluation to ensure the safe deployment of these models in real-world applications.

[113] Persistent Homology of Topic Networks for the Prediction of Reader Curiosity

Manuel D. S. Hopp,Vincent Labatut,Arthur Amalvy,Richard Dufour,Hannah Stone,Hayley Jach,Kou Murayama

Main category: cs.CL

TL;DR: 本文提出了一种基于信息缺口理论的框架,通过量化文本语义结构中的信息缺口来建模读者好奇心,结合BERTopic和持久同调分析动态语义网络的拓扑特征,实验证明该方法显著提升了好奇心预测的准确性。

Details Motivation: 读者好奇心对文本参与至关重要,但在NLP领域研究较少。本文旨在填补这一空白,通过量化信息缺口来建模好奇心。 Method: 结合BERTopic主题建模和持久同调分析,构建动态语义网络并提取其拓扑特征(如连通组件、循环、空洞),作为信息缺口的代理变量。 Result: 实验结果显示,该方法相比基线模型显著提升了好奇心预测的准确性(解释偏差从30%提升至73%)。 Conclusion: 该框架为分析文本结构与读者参与关系提供了新的计算方法,验证了信息缺口理论在NLP中的应用潜力。 Abstract: Reader curiosity, the drive to seek information, is crucial for textual engagement, yet remains relatively underexplored in NLP. Building on Loewenstein's Information Gap Theory, we introduce a framework that models reader curiosity by quantifying semantic information gaps within a text's semantic structure. Our approach leverages BERTopic-inspired topic modeling and persistent homology to analyze the evolving topology (connected components, cycles, voids) of a dynamic semantic network derived from text segments, treating these features as proxies for information gaps. To empirically evaluate this pipeline, we collect reader curiosity ratings from participants (n = 49) as they read S. Collins's ''The Hunger Games'' novel. We then use the topological features from our pipeline as independent variables to predict these ratings, and experimentally show that they significantly improve curiosity prediction compared to a baseline model (73% vs. 30% explained deviance), validating our approach. This pipeline offers a new computational method for analyzing text structure and its relation to reader engagement.

[114] C-SEO Bench: Does Conversational SEO Work?

Haritz Puerto,Martin Gubri,Tommaso Green,Seong Joon Oh,Sangdoo Yun

Main category: cs.CL

TL;DR: 论文提出了C-SEO Bench,首个用于评估跨任务、领域和多参与者的C-SEO方法的基准,发现现有方法效果有限,传统SEO策略更有效。

Details Motivation: 研究C-SEO方法的广泛适用性和多参与者竞争场景下的效果,填补现有评估的局限性。 Method: 设计C-SEO Bench基准,涵盖多种任务、领域和参与者数量,提出新的评估协议。 Result: 当前C-SEO方法效果不佳,传统SEO策略更有效;多参与者竞争导致收益递减。 Conclusion: C-SEO问题具有零和性质,传统SEO策略在LLM环境中更具优势。 Abstract: Large Language Models (LLMs) are transforming search engines into Conversational Search Engines (CSE). Consequently, Search Engine Optimization (SEO) is being shifted into Conversational Search Engine Optimization (C-SEO). We are beginning to see dedicated C-SEO methods for modifying web documents to increase their visibility in CSE responses. However, they are often tested only for a limited breadth of application domains; we do not understand whether certain C-SEO methods would be effective for a broad range of domains. Moreover, existing evaluations consider only a single-actor scenario where only one web document adopts a C-SEO method; in reality, multiple players are likely to competitively adopt the cutting-edge C-SEO techniques, drawing an analogy from the dynamics we have seen in SEO. We present C-SEO Bench, the first benchmark designed to evaluate C-SEO methods across multiple tasks, domains, and number of actors. We consider two search tasks, question answering and product recommendation, with three domains each. We also formalize a new evaluation protocol with varying adoption rates among involved actors. Our experiments reveal that most current C-SEO methods are largely ineffective, contrary to reported results in the literature. Instead, traditional SEO strategies, those aiming to improve the ranking of the source in the LLM context, are significantly more effective. We also observe that as we increase the number of C-SEO adopters, the overall gains decrease, depicting a congested and zero-sum nature of the problem. Our code and data are available at https://github.com/parameterlab/c-seo-bench and https://huggingface.co/datasets/parameterlab/c-seo-bench.

[115] Evolutionary Perspectives on the Evaluation of LLM-Based AI Agents: A Comprehensive Survey

Jiachen Zhu,Menghui Zhu,Renting Rui,Rong Shan,Congmin Zheng,Bo Chen,Yunjia Xi,Jianghao Lin,Weiwen Liu,Ruiming Tang,Yong Yu,Weinan Zhang

Main category: cs.CL

TL;DR: 本文系统分析了当前评估大型语言模型(LLM)聊天机器人与AI代理的方法,提出了一个区分两者的框架,并分类了现有评估基准,为研究者提供了实用指导。

Details Motivation: 现有评估框架混淆了LLM聊天机器人与AI代理的区别,导致研究者难以选择合适的基准,本文旨在填补这一空白。 Method: 通过进化视角,提出一个包含五个关键方面的分析框架,分类现有评估基准,并总结评估属性。 Result: 提供了一个清晰的区分框架和分类表,总结了当前趋势,并提出了未来评估方法的四个关键视角。 Conclusion: 本文为研究者提供了实用的评估指导,推动了AI代理评估领域的进一步发展。 Abstract: The advent of large language models (LLMs), such as GPT, Gemini, and DeepSeek, has significantly advanced natural language processing, giving rise to sophisticated chatbots capable of diverse language-related tasks. The transition from these traditional LLM chatbots to more advanced AI agents represents a pivotal evolutionary step. However, existing evaluation frameworks often blur the distinctions between LLM chatbots and AI agents, leading to confusion among researchers selecting appropriate benchmarks. To bridge this gap, this paper introduces a systematic analysis of current evaluation approaches, grounded in an evolutionary perspective. We provide a detailed analytical framework that clearly differentiates AI agents from LLM chatbots along five key aspects: complex environment, multi-source instructor, dynamic feedback, multi-modal perception, and advanced capability. Further, we categorize existing evaluation benchmarks based on external environments driving forces, and resulting advanced internal capabilities. For each category, we delineate relevant evaluation attributes, presented comprehensively in practical reference tables. Finally, we synthesize current trends and outline future evaluation methodologies through four critical lenses: environment, agent, evaluator, and metrics. Our findings offer actionable guidance for researchers, facilitating the informed selection and application of benchmarks in AI agent evaluation, thus fostering continued advancement in this rapidly evolving research domain.

[116] You Only Fine-tune Once: Many-Shot In-Context Fine-Tuning for Large Language Model

Wenchong He,Liqian Peng,Zhe Jiang,Alex Go

Main category: cs.CL

TL;DR: 论文提出了一种名为ManyICL的新方法,通过将上下文学习扩展到多样本设置,显著缩小了与专用微调的性能差距,并解决了长序列处理的效率问题。

Details Motivation: 当前的中等规模LLM(如Mistral 7B)通过少量样本上下文微调可以实现多任务处理,但仍落后于专用微调。ManyICL旨在提升性能并解决效率问题。 Method: 提出ManyICL方法,将多样本上下文示例从提示转换为自回归学习的目标,并通过新的训练目标优化长序列处理效率。 Result: 实验表明,ManyICL在分类、摘要、问答等任务上显著优于零/少量样本微调,接近专用微调性能,并缓解了灾难性遗忘问题。 Conclusion: ManyICL在多任务处理中表现出色,为LLM的上下文学习提供了更高效的解决方案。 Abstract: Large language models (LLMs) possess a remarkable ability to perform in-context learning (ICL), which enables them to handle multiple downstream tasks simultaneously without requiring task-specific fine-tuning. Recent studies have shown that even moderately sized LLMs, such as Mistral 7B, Gemma 7B and Llama-3 8B, can achieve ICL through few-shot in-context fine-tuning of all tasks at once. However, this approach still lags behind dedicated fine-tuning, where a separate model is trained for each individual task. In this paper, we propose a novel approach, Many-Shot In-Context Fine-tuning (ManyICL), which significantly narrows this performance gap by extending the principles of ICL to a many-shot setting. To unlock the full potential of ManyICL and address the inherent inefficiency of processing long sequences with numerous in-context examples, we propose a novel training objective. Instead of solely predicting the final answer, our approach treats every answer within the context as a supervised training target. This effectively shifts the role of many-shot examples from prompts to targets for autoregressive learning. Through extensive experiments on diverse downstream tasks, including classification, summarization, question answering, natural language inference, and math, we demonstrate that ManyICL substantially outperforms zero/few-shot fine-tuning and approaches the performance of dedicated fine-tuning. Furthermore, ManyICL significantly mitigates catastrophic forgetting issues observed in zero/few-shot fine-tuning. The code will be made publicly available upon publication.

[117] DAM: Dynamic Attention Mask for Long-Context Large Language Model Inference Acceleration

Hanzhi Zhang,Heng Fan,Kewei Sha,Yan Huang,Yunhe Feng

Main category: cs.CL

TL;DR: 提出了一种动态稀疏注意力机制(DAM),通过自适应掩码解决长序列任务中静态稀疏注意力的局限性,提升计算效率且无需微调。

Details Motivation: 传统稀疏注意力方法使用静态预定义掩码,无法捕捉异构注意力模式,导致长序列任务中的交互和检索性能受限。 Method: 引入动态稀疏注意力机制,在注意力图级别分配自适应掩码,保留各层和头的异构模式,无需预定义掩码结构。 Result: 该方法与全注意力模型高度对齐,显著减少内存和计算开销,性能损失最小。 Conclusion: DAM为大规模语言模型提供了一种可扩展的替代方案,兼顾效率和检索性能。 Abstract: Long-context understanding is crucial for many NLP applications, yet transformers struggle with efficiency due to the quadratic complexity of self-attention. Sparse attention methods alleviate this cost but often impose static, predefined masks, failing to capture heterogeneous attention patterns. This results in suboptimal token interactions, limiting adaptability and retrieval accuracy in long-sequence tasks. This work introduces a dynamic sparse attention mechanism that assigns adaptive masks at the attention-map level, preserving heterogeneous patterns across layers and heads. Unlike existing approaches, our method eliminates the need for fine-tuning and predefined mask structures while maintaining computational efficiency. By learning context-aware attention structures, it achieves high alignment with full-attention models, ensuring minimal performance degradation while reducing memory and compute overhead. This approach provides a scalable alternative to full attention, enabling the practical deployment of large-scale Large Language Models (LLMs) without sacrificing retrieval performance. DAM is available at: https://github.com/HanzhiZhang-Ulrica/DAM.

[118] Enabling On-Device Medical AI Assistants via Input-Driven Saliency Adaptation

Uttej Kallakurik,Edward Humes,Rithvik Jonna,Xiaomin Lin,Tinoosh Mohsenin

Main category: cs.CL

TL;DR: 提出了一种针对医疗领域的LLM压缩框架,通过剪枝和量化技术显著减小模型体积,同时保持性能,并在边缘设备上实现实时高效推理。

Details Motivation: LLMs在医疗场景中潜力巨大,但体积过大难以在资源受限的边缘设备上部署。 Method: 通过测量神经元重要性进行剪枝,再结合量化技术压缩模型,评估于MedMCQA等医疗基准。 Result: 成功压缩Gemma和LLaMA3模型50%和67%,在Jetson Orin Nano和Raspberry Pi 5上实现高效推理。 Conclusion: 该框架为LLMs在医疗领域的边缘部署提供了可行方案。 Abstract: Large Language Models (LLMs) have significant impact on the healthcare scenarios but remain prohibitively large for deployment in real-time, resource-constrained environments such as edge devices. In this work, we introduce a novel medical assistant system, optimized through our general-purpose compression framework, which tailors Large Language Models (LLMs) for deployment in specialized domains. By measuring neuron saliency on domain-specific data, our method can aggressively prune irrelevant neurons, reducing model size while preserving performance. Following pruning, we apply post-training quantization to further reduce the memory footprint, and evaluate the compressed model across medical benchmarks including MedMCQA, MedQA, and PubMedQA. We also deploy the 50\% compressed Gemma and the 67\% compressed LLaMA3 models on Jetson Orin Nano (18.7W peak) and Raspberry Pi 5 (6.3W peak), achieving real-time, energy-efficient inference under hardware constraints.

[119] Graph-based RAG Enhancement via Global Query Disambiguation and Dependency-Aware Reranking

Ningyuan Li,Junrui Liu,Yi Shan,Minghui Huang,Tong Li

Main category: cs.CL

TL;DR: PankRAG框架通过分层查询解析和依赖感知重排机制,解决了传统基于实体的RAG方法可能忽略潜在关键信息的问题,显著提升了生成响应的准确性和相关性。

Details Motivation: 传统基于实体的RAG方法可能误解或遗漏潜在关键信息,导致检索内容不相关或矛盾,增加幻觉风险并降低生成响应的保真度。 Method: PankRAG采用全局感知的分层查询解析策略和依赖感知重排机制,构建多级解析路径并利用依赖结构优化检索结果。 Result: PankRAG在多个基准测试中表现优于现有方法,证明了其鲁棒性和泛化能力。 Conclusion: PankRAG通过结构化推理和依赖感知优化,显著提升了RAG方法的性能,为复杂查询处理提供了有效解决方案。 Abstract: Contemporary graph-based retrieval-augmented generation (RAG) methods typically begin by extracting entities from user queries and then leverage pre-constructed knowledge graphs to retrieve related relationships and metadata. However, this pipeline's exclusive reliance on entity-level extraction can lead to the misinterpretation or omission of latent yet critical information and relations. As a result, retrieved content may be irrelevant or contradictory, and essential knowledge may be excluded, exacerbating hallucination risks and degrading the fidelity of generated responses. To address these limitations, we introduce PankRAG, a framework that combines a globally aware, hierarchical query-resolution strategy with a novel dependency-aware reranking mechanism. PankRAG first constructs a multi-level resolution path that captures both parallel and sequential interdependencies within a query, guiding large language models (LLMs) through structured reasoning. It then applies its dependency-aware reranker to exploit the dependency structure among resolved sub-questions, enriching and validating retrieval results for subsequent sub-questions. Empirical evaluations demonstrate that PankRAG consistently outperforms state-of-the-art approaches across multiple benchmarks, underscoring its robustness and generalizability.

[120] History-Aware Cross-Attention Reinforcement: Self-Supervised Multi Turn and Chain-of-Thought Fine-Tuning with vLLM

Andrew Kiruluta,Andreas Lemos,Priscilla Burity

Main category: cs.CL

TL;DR: CAGSR-vLLM-MTC扩展了CAGSR框架,基于vLLM运行时,支持多轮对话和链式推理,通过异步捕获注意力权重和改进奖励函数实现。

Details Motivation: 解决多轮对话和链式推理中的注意力机制优化问题,提升模型性能。 Method: 在vLLM运行时中异步捕获跨层、跨头的注意力权重,并改进自监督奖励函数以累积对话历史和推理步骤的注意力信号。 Result: 提出了一种基于熵的注意力钳制机制,防止注意力过早集中在早期上下文。 Conclusion: 为多参与方对话和层次化推理提供了未来研究方向。 Abstract: We present CAGSR-vLLM-MTC, an extension of our Self-Supervised Cross-Attention-Guided Reinforcement (CAGSR) framework, now implemented on the high-performance vLLM runtime, to address both multi-turn dialogue and chain-of-thought reasoning. Building upon our original single-turn approach, we first instrumented vLLM's C++/CUDA kernels to asynchronously capture per-layer, per-head cross-attention weights during generation. We then generalized our self-supervised reward function to accumulate attention signals over entire conversation histories and intermediate chain-of-thought steps. We discuss practical trade-offs, including an entropy-based clamping mechanism to prevent attention collapse on early context, and outline future directions for multi-party dialogues and hierarchical reasoning.

[121] Enhancing Large Language Models for Mobility Analytics with Semantic Location Tokenization

Yile Chen,Yicheng Tao,Yue Jiang,Shuai Liu,Han Yu,Gao Cong

Main category: cs.CL

TL;DR: QT-Mob是一个新颖的框架,通过语义丰富的令牌表示位置和多重微调目标,显著提升了LLM在移动性分析中的性能。

Details Motivation: 现有方法在位置语义表示和移动信号建模方面存在不足,限制了LLM在移动性分析中的应用。 Method: QT-Mob引入位置令牌化模块和互补微调目标,增强LLM对移动模式和位置语义的理解。 Result: 在三个真实数据集上,QT-Mob在下一位置预测和移动恢复任务中表现优于现有方法。 Conclusion: QT-Mob不仅提升了LLM对移动数据的解释能力,还为移动性分析任务提供了更通用的解决方案。 Abstract: The widespread adoption of location-based services has led to the generation of vast amounts of mobility data, providing significant opportunities to model user movement dynamics within urban environments. Recent advancements have focused on adapting Large Language Models (LLMs) for mobility analytics. However, existing methods face two primary limitations: inadequate semantic representation of locations (i.e., discrete IDs) and insufficient modeling of mobility signals within LLMs (i.e., single templated instruction fine-tuning). To address these issues, we propose QT-Mob, a novel framework that significantly enhances LLMs for mobility analytics. QT-Mob introduces a location tokenization module that learns compact, semantically rich tokens to represent locations, preserving contextual information while ensuring compatibility with LLMs. Furthermore, QT-Mob incorporates a series of complementary fine-tuning objectives that align the learned tokens with the internal representations in LLMs, improving the model's comprehension of sequential movement patterns and location semantics. The proposed QT-Mob framework not only enhances LLMs' ability to interpret mobility data but also provides a more generalizable approach for various mobility analytics tasks. Experiments on three real-world dataset demonstrate the superior performance in both next-location prediction and mobility recovery tasks, outperforming existing deep learning and LLM-based methods.

[122] AssertBench: A Benchmark for Evaluating Self-Assertion in Large Language Models

Jaeho Lee,Atharv Chowdhary

Main category: cs.CL

TL;DR: AssertBench评估LLMs在面对用户对立陈述时保持事实一致性的能力,通过构建两种对立提示并记录模型的反应。

Details Motivation: 研究LLMs在用户对立陈述下是否坚持事实一致性,而非盲目同意用户。 Method: 从FEVEROUS数据集采样事实,构建对立提示(用户声称正确或错误),记录模型反应。 Result: 通过分层结果,隔离框架诱导的变异性,衡量模型坚持事实的能力。 Conclusion: AssertBench旨在量化LLMs在面对对立用户断言时保持事实一致性的能力。 Abstract: Recent benchmarks have probed factual consistency and rhetorical robustness in Large Language Models (LLMs). However, a knowledge gap exists regarding how directional framing of factually true statements influences model agreement, a common scenario for LLM users. AssertBench addresses this by sampling evidence-supported facts from FEVEROUS, a fact verification dataset. For each (evidence-backed) fact, we construct two framing prompts: one where the user claims the statement is factually correct, and another where the user claims it is incorrect. We then record the model's agreement and reasoning. The desired outcome is that the model asserts itself, maintaining consistent truth evaluation across both framings, rather than switching its evaluation to agree with the user. AssertBench isolates framing-induced variability from the model's underlying factual knowledge by stratifying results based on the model's accuracy on the same claims when presented neutrally. In doing so, this benchmark aims to measure an LLM's ability to "stick to its guns" when presented with contradictory user assertions about the same fact. The complete source code is available at https://github.com/achowd32/assert-bench.

[123] Evaluating and Improving Robustness in Large Language Models: A Survey and Future Directions

Kun Zhang,Le Wu,Kui Yu,Guangyi Lv,Dacao Zhang

Main category: cs.CL

TL;DR: 本文综述了大型语言模型(LLMs)的鲁棒性研究,从对抗鲁棒性、分布外鲁棒性和评估方法三个角度进行了系统梳理,并提供了相关资源和未来研究方向。

Details Motivation: 随着LLMs在广泛领域的应用,其鲁棒性问题日益突出,需要确保模型在异常场景下生成内容的正确性和稳定性。 Method: 通过定义LLM鲁棒性,并从对抗鲁棒性、分布外鲁棒性和评估方法三个角度组织综述内容。 Result: 总结了代表性工作,并提供了相关资源和开源项目支持社区研究。 Conclusion: 未来研究应进一步探索LLM鲁棒性的提升方法,并开发更全面的评估工具。 Abstract: Large Language Models (LLMs) have gained enormous attention in recent years due to their capability of understanding and generating natural languages. With the rapid development and wild-range applications (e.g., Agents, Embodied Intelligence), the robustness of LLMs has received increased attention. As the core brain of many AI applications, the robustness of LLMs requires that models should not only generate consistent contents, but also ensure the correctness and stability of generated content when dealing with unexpeted application scenarios (e.g., toxic prompts, limited noise domain data, outof-distribution (OOD) applications, etc). In this survey paper, we conduct a thorough review of the robustness of LLMs, aiming to provide a comprehensive terminology of concepts and methods around this field and facilitate the community. Specifically, we first give a formal definition of LLM robustness and present the collection protocol of this survey paper. Then, based on the types of perturbated inputs, we organize this survey from the following perspectives: 1) Adversarial Robustness: tackling the problem that prompts are manipulated intentionally, such as noise prompts, long context, data attack, etc; 2) OOD Robustness: dealing with the unexpected real-world application scenarios, such as OOD detection, zero-shot transferring, hallucinations, etc; 3) Evaluation of Robustness: summarizing the new evaluation datasets, metrics, and tools for verifying the robustness of LLMs. After reviewing the representative work from each perspective, we discuss and highlight future opportunities and research directions in this field. Meanwhile, we also organize related works and provide an easy-to-search project (https://github.com/zhangkunzk/Awesome-LLM-Robustness-papers) to support the community.

[124] Manifesto from Dagstuhl Perspectives Workshop 24352 -- Conversational Agents: A Framework for Evaluation (CAFE)

Christine Bauer,Li Chen,Nicola Ferro,Norbert Fuhr,Avishek Anand,Timo Breuer,Guglielmo Faggioli,Ophir Frieder,Hideo Joho,Jussi Karlgren,Johannes Kiesel,Bart P. Knijnenburg,Aldo Lipani,Lien Michiels,Andrea Papenmeier,Maria Soledad Pera,Mark Sanderson,Scott Sanner,Benno Stein,Johanne R. Trippas,Karin Verspoor,Martijn C Willemsen

Main category: cs.CL

TL;DR: 讨论了CONIAC的定义及其独特特征,提出了抽象的世界模型,并定义了CAFE框架用于评估CONIAC系统。

Details Motivation: 探讨对话式信息访问(CONIAC)的概念及其评估方法,以提升系统设计和用户体验。 Method: 提出世界模型抽象CONIAC,并设计CAFE框架,包含六个主要组件用于系统评估。 Result: 定义了CAFE框架,明确了评估CONIAC系统的六个关键组成部分。 Conclusion: CAFE框架为CONIAC系统的评估提供了结构化方法,有助于系统优化和用户研究。 Abstract: During the workshop, we deeply discussed what CONversational Information ACcess (CONIAC) is and its unique features, proposing a world model abstracting it, and defined the Conversational Agents Framework for Evaluation (CAFE) for the evaluation of CONIAC systems, consisting of six major components: 1) goals of the system's stakeholders, 2) user tasks to be studied in the evaluation, 3) aspects of the users carrying out the tasks, 4) evaluation criteria to be considered, 5) evaluation methodology to be applied, and 6) measures for the quantitative criteria chosen.

[125] Breaking the Reviewer: Assessing the Vulnerability of Large Language Models in Automated Peer Review Under Textual Adversarial Attacks

Tzu-Ling Lin,Wei-Chih Chen,Teng-Fang Hsiao,Hou-I Liu,Ya-Hsin Yeh,Yu Kai Chan,Wen-Sheng Lien,Po-Yen Kuo,Philip S. Yu,Hong-Han Shuai

Main category: cs.CL

TL;DR: 该论文研究了大型语言模型(LLMs)作为自动化审稿人在对抗性攻击下的鲁棒性,揭示了其易受攻击的弱点,并提出了缓解策略。

Details Motivation: 随着学术投稿量的增加,审稿负担加重,LLMs可能提供帮助,但其易受对抗性攻击的特性引发了可靠性问题。 Method: 研究聚焦三个问题:LLMs生成审稿的有效性、对抗性攻击对LLMs审稿可靠性的影响,以及缓解策略。 Result: 评估显示LLMs在对抗性攻击下表现脆弱,文本操纵会扭曲其审稿结果。 Conclusion: 研究强调需解决对抗性风险,以确保AI增强而非损害学术交流的完整性。 Abstract: Peer review is essential for maintaining academic quality, but the increasing volume of submissions places a significant burden on reviewers. Large language models (LLMs) offer potential assistance in this process, yet their susceptibility to textual adversarial attacks raises reliability concerns. This paper investigates the robustness of LLMs used as automated reviewers in the presence of such attacks. We focus on three key questions: (1) The effectiveness of LLMs in generating reviews compared to human reviewers. (2) The impact of adversarial attacks on the reliability of LLM-generated reviews. (3) Challenges and potential mitigation strategies for LLM-based review. Our evaluation reveals significant vulnerabilities, as text manipulations can distort LLM assessments. We offer a comprehensive evaluation of LLM performance in automated peer reviewing and analyze its robustness against adversarial attacks. Our findings emphasize the importance of addressing adversarial risks to ensure AI strengthens, rather than compromises, the integrity of scholarly communication.

[126] KokushiMD-10: Benchmark for Evaluating Large Language Models on Ten Japanese National Healthcare Licensing Examinations

Junyu Liu,Kaiqi Yan,Tianyang Wang,Qian Niu,Momoko Nagai-Tanima,Tomoki Aoyama

Main category: cs.CL

TL;DR: KokushiMD-10是一个多模态基准测试,基于日本国家医疗执照考试构建,用于评估大型语言模型(LLMs)在多语言和多模态临床任务中的表现。

Details Motivation: 现有基准测试多为文本、英语中心且局限于医学领域,无法全面评估医疗AI的广泛知识和多模态推理能力。 Method: 构建KokushiMD-10基准,包含11588个真实考试问题,涵盖多个医疗领域,并整合临床图像和专家注释。 Result: 测试了30多个先进LLMs(如GPT-4o、Claude 3.5和Gemini),结果显示无模型能稳定通过所有领域。 Conclusion: KokushiMD-10为评估和推进医疗AI在多语言和多模态任务中的推理能力提供了全面资源,凸显了医疗AI的持续挑战。 Abstract: Recent advances in large language models (LLMs) have demonstrated notable performance in medical licensing exams. However, comprehensive evaluation of LLMs across various healthcare roles, particularly in high-stakes clinical scenarios, remains a challenge. Existing benchmarks are typically text-based, English-centric, and focus primarily on medicines, which limits their ability to assess broader healthcare knowledge and multimodal reasoning. To address these gaps, we introduce KokushiMD-10, the first multimodal benchmark constructed from ten Japanese national healthcare licensing exams. This benchmark spans multiple fields, including Medicine, Dentistry, Nursing, Pharmacy, and allied health professions. It contains over 11588 real exam questions, incorporating clinical images and expert-annotated rationales to evaluate both textual and visual reasoning. We benchmark over 30 state-of-the-art LLMs, including GPT-4o, Claude 3.5, and Gemini, across both text and image-based settings. Despite promising results, no model consistently meets passing thresholds across domains, highlighting the ongoing challenges in medical AI. KokushiMD-10 provides a comprehensive and linguistically grounded resource for evaluating and advancing reasoning-centric medical AI across multilingual and multimodal clinical tasks.

[127] Incorporating Domain Knowledge into Materials Tokenization

Yerim Oh,Jun-Hyung Park,Junho Kim,SungHo Kim,SangKeun Lee

Main category: cs.CL

TL;DR: MATTER是一种新型的标记化方法,通过整合材料知识解决传统标记化方法在材料科学中的语义丢失和过度碎片化问题。

Details Motivation: 传统标记化方法在材料科学中无法保持结构和语义完整性,导致语义丢失和过度碎片化。 Method: MATTER结合了基于材料知识库训练的MatDetector和重排序方法,优先合并材料概念。 Result: 实验表明,MATTER在生成和分类任务中分别实现了4%和2%的平均性能提升。 Conclusion: 领域知识对科学文本处理中的标记化策略至关重要。 Abstract: While language models are increasingly utilized in materials science, typical models rely on frequency-centric tokenization methods originally developed for natural language processing. However, these methods frequently produce excessive fragmentation and semantic loss, failing to maintain the structural and semantic integrity of material concepts. To address this issue, we propose MATTER, a novel tokenization approach that integrates material knowledge into tokenization. Based on MatDetector trained on our materials knowledge base and a re-ranking method prioritizing material concepts in token merging, MATTER maintains the structural integrity of identified material concepts and prevents fragmentation during tokenization, ensuring their semantic meaning remains intact. The experimental results demonstrate that MATTER outperforms existing tokenization methods, achieving an average performance gain of $4\%$ and $2\%$ in the generation and classification tasks, respectively. These results underscore the importance of domain knowledge for tokenization strategies in scientific text processing. Our code is available at https://github.com/yerimoh/MATTER

[128] Infinity Instruct: Scaling Instruction Selection and Synthesis to Enhance Language Models

Jijie Li,Li Du,Hanyu Zhao,Bo-wen Zhang,Liangdong Wang,Boyan Gao,Guang Liu,Yonghua Lin

Main category: cs.CL

TL;DR: Infinity-Instruct是一个高质量指令数据集,通过两阶段流程提升LLM的基础和聊天能力,显著超越现有开源模型。

Details Motivation: 现有开源指令数据集局限于狭窄领域,限制了LLM的泛化能力,与专有模型差距较大。 Method: 采用两阶段流程:1) 从1亿样本中筛选7.4M高质量基础指令;2) 通过指令选择、进化和过滤合成1.5M高质量聊天指令。 Result: 在多个开源模型上验证,性能显著提升,InfInstruct-LLaMA3.1-70B在指令任务上超越GPT-4-0314 8.6%。 Conclusion: Infinity-Instruct展示了基础和聊天训练的协同效应,为LLM开发提供了新思路。 Abstract: Large Language Models (LLMs) demonstrate strong performance in real-world applications, yet existing open-source instruction datasets often concentrate on narrow domains, such as mathematics or coding, limiting generalization and widening the gap with proprietary models. To bridge this gap, we introduce Infinity-Instruct, a high-quality instruction dataset designed to enhance both foundational and chat capabilities of LLMs through a two-phase pipeline. In Phase 1, we curate 7.4M high-quality foundational instructions (InfInstruct-F-7.4M) from over 100M samples using hybrid data selection techniques. In Phase 2, we synthesize 1.5M high-quality chat instructions (InfInstruct-G-1.5M) through a two-stage process involving instruction selection, evolution, and diagnostic filtering. We empirically evaluate Infinity-Instruct by fine-tuning several open-source models, including Mistral, LLaMA, Qwen, and Yi, and observe substantial performance gains across both foundational and instruction following benchmarks, consistently surpassing official instruction-tuned counterparts. Notably, InfInstruct-LLaMA3.1-70B outperforms GPT-4-0314 by 8.6\% on instruction following tasks while achieving comparable foundational performance. These results underscore the synergy between foundational and chat training and offer new insights into holistic LLM development. Our dataset\footnote{https://huggingface.co/datasets/BAAI/Infinity-Instruct} and codes\footnote{https://gitee.com/li-touch/infinity-instruct} have been publicly released.

[129] ScIRGen: Synthesize Realistic and Large-Scale RAG Dataset for Scientific Research

Junyong Lin,Lu Dai,Ruiqian Han,Yijie Sui,Ruilin Wang,Xingliang Sun,Qinglin Wu,Min Feng,Hao Liu,Hui Xiong

Main category: cs.CL

TL;DR: ScIRGen是一个科学问答与检索数据集生成框架,旨在更准确地反映科研人员的信息需求,并创建了一个大规模的科学检索增强生成数据集。

Details Motivation: 现有科学检索和问答数据集通常处理简单问题,与真实研究需求不符,因此需要开发更贴合实际需求的数据集。 Method: 设计了基于学术论文的数据集信息提取方法,采用认知分类法生成高质量问题,并基于LLMs的困惑度变化自动过滤合成答案。 Result: 创建了包含61k问答对的ScIRGen-Geo数据集,并评估了现有方法在复杂问题上的表现,发现其推理能力仍有不足。 Conclusion: ScIRGen推动了支持复杂科学信息需求的工具发展,为科学社区提供了更先进的资源。 Abstract: Scientific researchers need intensive information about datasets to effectively evaluate and develop theories and methodologies. The information needs regarding datasets are implicitly embedded in particular research tasks, rather than explicitly expressed in search queries. However, existing scientific retrieval and question-answering (QA) datasets typically address straightforward questions, which do not align with the distribution of real-world research inquiries. To bridge this gap, we developed ScIRGen, a dataset generation framework for scientific QA \& retrieval that more accurately reflects the information needs of professional science researchers, and uses it to create a large-scale scientific retrieval-augmented generation (RAG) dataset with realistic queries, datasets and papers. Technically, we designed a dataset-oriented information extraction method that leverages academic papers to augment the dataset representation. We then proposed a question generation framework by employing cognitive taxonomy to ensure the quality of synthesized questions. We also design a method to automatically filter synthetic answers based on the perplexity shift of LLMs, which is highly aligned with human judgment of answers' validity. Collectively, these methodologies culminated in the creation of the 61k QA dataset, ScIRGen-Geo. We benchmarked representative methods on the ScIRGen-Geo dataset for their question-answering and retrieval capabilities, finding out that current methods still suffer from reasoning from complex questions. This work advances the development of more sophisticated tools to support the intricate information needs of the scientific community.

Jingyu Li,Lingchao Mao,Hairong Wang,Zhendong Wang,Xi Mao,Xuelei Sherry Ni

Main category: cs.CL

TL;DR: 该研究利用基础模型对阿尔茨海默病及相关痴呆症(ADRD)进行早期检测,通过语音和语言模型分析自发语音,发现基于声学的方法(如ASR生成的嵌入)表现最佳。

Details Motivation: 早期检测ADRD对及时干预至关重要,自发语音中的声学和语言标记可作为非侵入性生物标志物。 Method: 使用PREPARE Challenge数据集,包含1600多名参与者的录音,筛选后分为健康对照组(HC)、轻度认知障碍(MCI)和阿尔茨海默病(AD)。测试多种开源语音和语言模型进行分类。 Result: Whisper-medium语音模型表现最佳(准确率0.731,AUC 0.802),BERT语言模型在文本分类中表现最好(准确率0.662,AUC 0.744)。ASR生成的声学嵌入效果显著。 Conclusion: 研究提出了基于基础模型的基准框架,声学方法(如ASR嵌入)在ADRD早期检测中具有可扩展、非侵入和成本效益高的潜力。 Abstract: Background: Alzheimer's disease and related dementias (ADRD) are progressive neurodegenerative conditions where early detection is vital for timely intervention and care. Spontaneous speech contains rich acoustic and linguistic markers that may serve as non-invasive biomarkers for cognitive decline. Foundation models, pre-trained on large-scale audio or text data, produce high-dimensional embeddings encoding contextual and acoustic features. Methods: We used the PREPARE Challenge dataset, which includes audio recordings from over 1,600 participants with three cognitive statuses: healthy control (HC), mild cognitive impairment (MCI), and Alzheimer's Disease (AD). We excluded non-English, non-spontaneous, or poor-quality recordings. The final dataset included 703 (59.13%) HC, 81 (6.81%) MCI, and 405 (34.06%) AD cases. We benchmarked a range of open-source foundation speech and language models to classify cognitive status into the three categories. Results: The Whisper-medium model achieved the highest performance among speech models (accuracy = 0.731, AUC = 0.802). Among language models, BERT with pause annotation performed best (accuracy = 0.662, AUC = 0.744). ADRD detection using state-of-the-art automatic speech recognition (ASR) model-generated audio embeddings outperformed others. Including non-semantic features like pause patterns consistently improved text-based classification. Conclusion: This study introduces a benchmarking framework using foundation models and a clinically relevant dataset. Acoustic-based approaches -- particularly ASR-derived embeddings -- demonstrate strong potential for scalable, non-invasive, and cost-effective early detection of ADRD.

[131] SDMPrune: Self-Distillation MLP Pruning for Efficient Large Language Models

Hourun Zhu,Chengchao Shen

Main category: cs.CL

TL;DR: 论文提出了一种在剪枝阶段引入自蒸馏损失的方法,以更准确地利用原始模型的预测信息,从而提升LLM的生成能力。同时,专注于剪枝MLP模块以显著压缩模型而不明显降低性能。

Details Motivation: 尽管LLMs性能强大,但其部署成本高昂。现有基于梯度的剪枝方法因使用one-hot标签而忽略了其他词的预测信息,导致生成能力下降。 Method: 在剪枝阶段引入自蒸馏损失,充分利用原始模型的预测信息;专注于剪枝MLP模块以减少参数。 Result: 实验表明,该方法在零样本基准测试中显著优于现有剪枝方法,并在1B规模开源LLMs中表现出色。 Conclusion: 该方法通过自蒸馏损失和MLP模块剪枝,有效压缩LLM且保持性能,具有实际应用潜力。 Abstract: In spite of strong performance achieved by LLMs, the costs of their deployment are unaffordable. For the compression of LLMs, gradient-based pruning methods present promising effectiveness. However, in these methods, the gradient computation with one-hot labels ignore the potential predictions on other words, thus missing key information for generative capability of the original model. To address this issue, we introduce a self-distillation loss during the pruning phase (rather than post-training) to fully exploit the predictions of the original model, thereby obtaining more accurate gradient information for pruning. Moreover, we find that, compared to attention modules, the predictions of LLM are less sensitive to multilayer perceptron (MLP) modules, which take up more than $5 \times$ parameters (LLaMA3.2-1.2B). To this end, we focus on the pruning of MLP modules, to significantly compress LLM without obvious performance degradation. Experimental results on extensive zero-shot benchmarks demonstrate that our method significantly outperforms existing pruning methods. Furthermore, our method achieves very competitive performance among 1B-scale open source LLMs. The source code and trained weights are available at https://github.com/visresearch/SDMPrune.

[132] SUTA-LM: Bridging Test-Time Adaptation and Language Model Rescoring for Robust ASR

Wei-Ping Huang,Guan-Ting Lin,Hung-yi Lee

Main category: cs.CL

TL;DR: SUTA-LM是一种结合语言模型重评分的测试时自适应方法,解决了传统方法中TTA与语言模型干扰的问题,并在18个ASR数据集上表现稳健。

Details Motivation: 尽管端到端ASR取得进展,但实际领域不匹配仍导致性能下降。TTA旨在通过推理时调整模型来缓解此问题,但传统方法中TTA与语言模型重评分的结合存在干扰问题。 Method: 提出SUTA-LM,基于熵最小化的TTA方法,结合语言模型重评分。通过自动步长选择机制控制适应过程,并利用声学和语言信息优化输出。 Result: 在18个多样化的ASR数据集上,SUTA-LM表现出广泛的领域适应性。 Conclusion: SUTA-LM有效解决了TTA与语言模型结合的挑战,为领域不匹配问题提供了稳健解决方案。 Abstract: Despite progress in end-to-end ASR, real-world domain mismatches still cause performance drops, which Test-Time Adaptation (TTA) aims to mitigate by adjusting models during inference. Recent work explores combining TTA with external language models, using techniques like beam search rescoring or generative error correction. In this work, we identify a previously overlooked challenge: TTA can interfere with language model rescoring, revealing the nontrivial nature of effectively combining the two methods. Based on this insight, we propose SUTA-LM, a simple yet effective extension of SUTA, an entropy-minimization-based TTA approach, with language model rescoring. SUTA-LM first applies a controlled adaptation process guided by an auto-step selection mechanism leveraging both acoustic and linguistic information, followed by language model rescoring to refine the outputs. Experiments on 18 diverse ASR datasets show that SUTA-LM achieves robust results across a wide range of domains.

[133] ASRJam: Human-Friendly AI Speech Jamming to Prevent Automated Phone Scams

Freddie Grabovski,Gilad Gressel,Yisroel Mirsky

Main category: cs.CL

TL;DR: 论文提出ASRJam和EchoGuard框架,通过干扰ASR转录来防御语音钓鱼攻击,同时保持人类通话体验。

Details Motivation: 语音钓鱼攻击利用LLM、TTS和ASR技术规模化实施,威胁安全,需有效防御手段。 Method: 提出ASRJam框架注入对抗性扰动,并设计EchoGuard利用自然失真干扰ASR。 Result: 39人用户研究表明,EchoGuard在ASR干扰和人类体验上表现最佳。 Conclusion: EchoGuard提供了一种实用且高效的防御方案,平衡了安全性和用户体验。 Abstract: Large Language Models (LLMs), combined with Text-to-Speech (TTS) and Automatic Speech Recognition (ASR), are increasingly used to automate voice phishing (vishing) scams. These systems are scalable and convincing, posing a significant security threat. We identify the ASR transcription step as the most vulnerable link in the scam pipeline and introduce ASRJam, a proactive defence framework that injects adversarial perturbations into the victim's audio to disrupt the attacker's ASR. This breaks the scam's feedback loop without affecting human callers, who can still understand the conversation. While prior adversarial audio techniques are often unpleasant and impractical for real-time use, we also propose EchoGuard, a novel jammer that leverages natural distortions, such as reverberation and echo, that are disruptive to ASR but tolerable to humans. To evaluate EchoGuard's effectiveness and usability, we conducted a 39-person user study comparing it with three state-of-the-art attacks. Results show that EchoGuard achieved the highest overall utility, offering the best combination of ASR disruption and human listening experience.

[134] GUIRoboTron-Speech: Towards Automated GUI Agents Based on Speech Instructions

Wenkang Han,Zhixiong Zeng,Jing Huang,Shu Jiang,Liming Zheng,Longrong Yang,Haibo Qiu,Chang Yao,Jingyuan Chen,Lin Ma

Main category: cs.CL

TL;DR: GUIRoboTron-Speech是一种端到端的自主GUI代理,直接接受语音指令和屏幕截图来预测动作,解决了传统文本指令的局限性。

Details Motivation: 现有GUI代理依赖文本指令,限制了在免提场景中的可访问性和便利性。 Method: 利用随机音色TTS模型生成高质量语音指令,通过渐进式训练和启发式混合指令策略提升性能。 Result: 在多个基准数据集上验证了其鲁棒性和优越性能,证明语音作为指令模态的潜力。 Conclusion: GUIRoboTron-Speech展示了语音驱动GUI代理的广泛应用前景。 Abstract: Autonomous agents for Graphical User Interfaces (GUIs) are revolutionizing human-computer interaction, yet their reliance on text-based instructions imposes limitations on accessibility and convenience, particularly in hands-free scenarios. To address this gap, we propose GUIRoboTron-Speech, the first end-to-end autonomous GUI agent that directly accepts speech instructions and on-device screenshots to predict actions. Confronted with the scarcity of speech-based GUI agent datasets, we initially generated high-quality speech instructions for training by leveraging a random timbre text-to-speech (TTS) model to convert existing text instructions. We then develop GUIRoboTron-Speech's capabilities through progressive grounding and planning training stages. A key contribution is a heuristic mixed-instruction training strategy designed to mitigate the modality imbalance inherent in pre-trained foundation models. Comprehensive experiments on several benchmark datasets validate the robust and superior performance of GUIRoboTron-Speech, demonstrating the significant potential and widespread applicability of speech as an effective instruction modality for driving GUI agents. Our code and datasets are available at https://github.com/GUIRoboTron/GUIRoboTron-Speech.

[135] Stronger Language Models Produce More Human-Like Errors

Andrew Keenan Richardson,Ryan Othniel Kearns,Sean Moss,Vincent Wang-Mascianica,Philipp Koralus

Main category: cs.CL

TL;DR: 研究发现,随着语言模型能力的提升,其错误模式逐渐趋近于人类推理中的常见谬误,而非达到规范性理性。

Details Motivation: 探讨语言模型是否随着能力的提升而趋近于人类推理模式,尤其是错误模式。 Method: 使用Erotetic Theory of Reasoning (ETR)框架和PyETR工具包生成逻辑推理问题,评估38个语言模型在383个任务中的表现。 Result: 模型能力提升时,错误答案中与ETR预测的人类谬误相符的比例增加(ρ=0.360, p=0.0265),但逻辑正确性与模型能力无关。 Conclusion: 语言模型的错误模式趋近于人类推理的偏见和局限性,挑战了模型规模提升自然实现规范性理性的观点。 Abstract: Do language models converge toward human-like reasoning patterns as they improve? We provide surprising evidence that while overall reasoning capabilities increase with model sophistication, the nature of errors increasingly mirrors predictable human reasoning fallacies: a previously unobserved inverse scaling phenomenon. To investigate this question, we apply the Erotetic Theory of Reasoning (ETR), a formal cognitive framework with empirical support for predicting human reasoning outcomes. Using the open-source package PyETR, we generate logical reasoning problems where humans predictably err, evaluating responses from 38 language models across 383 reasoning tasks. Our analysis indicates that as models advance in general capability (as measured by Chatbot Arena scores), the proportion of their incorrect answers that align with ETR-predicted human fallacies tends to increase ($\rho = 0.360, p = 0.0265$). Notably, as we observe no correlation between model sophistication and logical correctness on these tasks, this shift in error patterns toward human-likeness occurs independently of error rate. These findings challenge the prevailing view that scaling language models naturally obtains normative rationality, suggesting instead a convergence toward human-like cognition inclusive of our characteristic biases and limitations, as we further confirm by demonstrating order-effects in language model reasoning.

[136] Trustworthy AI for Medicine: Continuous Hallucination Detection and Elimination with CHECK

Carlos Garcia-Fernandez,Luis Felipe,Monique Shotande,Muntasir Zitu,Aakash Tripathi,Ghulam Rasool,Issam El Naqa,Vivek Rudrapatna,Gilmer Valdes

Main category: cs.CL

TL;DR: CHECK框架通过结合临床数据库和信息理论分类器,显著降低LLM在医疗领域的幻觉率,并提升性能。

Details Motivation: 解决LLM在医疗应用中的幻觉问题,确保其安全性和可靠性。 Method: CHECK框架整合结构化临床数据库和信息理论分类器,检测事实和推理幻觉。 Result: 将LLama3.3-70B-Instruct的幻觉率从31%降至0.3%,并在多个医疗基准测试中表现优异。 Conclusion: CHECK为医疗等高风险领域提供了一种可扩展的安全LLM部署方案。 Abstract: Large language models (LLMs) show promise in healthcare, but hallucinations remain a major barrier to clinical use. We present CHECK, a continuous-learning framework that integrates structured clinical databases with a classifier grounded in information theory to detect both factual and reasoning-based hallucinations. Evaluated on 1500 questions from 100 pivotal clinical trials, CHECK reduced LLama3.3-70B-Instruct hallucination rates from 31% to 0.3% - making an open source model state of the art. Its classifier generalized across medical benchmarks, achieving AUCs of 0.95-0.96, including on the MedQA (USMLE) benchmark and HealthBench realistic multi-turn medical questioning. By leveraging hallucination probabilities to guide GPT-4o's refinement and judiciously escalate compute, CHECK boosted its USMLE passing rate by 5 percentage points, achieving a state-of-the-art 92.1%. By suppressing hallucinations below accepted clinical error thresholds, CHECK offers a scalable foundation for safe LLM deployment in medicine and other high-stakes domains.

[137] A Self-Refining Framework for Enhancing ASR Using TTS-Synthesized Data

Cheng Kang Chou,Chan-Jan Hsu,Ho-Lam Chung,Liang-Hsuan Tseng,Hsi-Chun Cheng,Yu-Kuan Fu,Kuan Po Huang,Hung-Yi Lee

Main category: cs.CL

TL;DR: 提出了一种自优化框架,利用无标签数据集提升ASR性能,通过伪标签和TTS系统实现闭环优化,显著降低了错误率。

Details Motivation: 解决低资源或特定领域ASR性能提升的挑战,避免依赖大量标注数据。 Method: 利用现有ASR模型生成伪标签,训练高保真TTS系统,合成语音-文本对并反馈至ASR系统,形成闭环优化。 Result: 在台湾普通话任务中,错误率降低20%(普通话)和50%(中英混合),优于Whisper模型。 Conclusion: 该框架为低资源或领域特定ASR提供了一种高效的自优化方法,优于传统伪标签自蒸馏方法。 Abstract: We propose a self-refining framework that enhances ASR performance with only unlabeled datasets. The process starts with an existing ASR model generating pseudo-labels on unannotated speech, which are then used to train a high-fidelity text-to-speech (TTS) system. Then, synthesized speech text pairs are bootstrapped into the original ASR system, completing the closed-loop self-improvement cycle. We demonstrated the effectiveness of the framework on Taiwanese Mandarin speech. Leveraging 6,000 hours of unlabeled speech, a moderate amount of text data, and synthetic content from the AI models, we adapt Whisper-large-v2 into a specialized model, Twister. Twister reduces error rates by up to 20% on Mandarin and 50% on Mandarin-English code-switching benchmarks compared to Whisper. Results highlight the framework as a compelling alternative to pseudo-labeling self-distillation approaches and provides a practical pathway for improving ASR performance in low-resource or domain-specific settings.

[138] Large Language Models and Emergence: A Complex Systems Perspective

David C. Krakauer,John W. Krakauer,Melanie Mitchell

Main category: cs.CL

TL;DR: 论文探讨了涌现现象及其在大型语言模型(LLM)中的应用,分析了LLM是否具备涌现智能。

Details Motivation: 研究涌现现象在复杂系统中的作用,并探讨LLM是否表现出涌现能力及其智能特性。 Method: 通过回顾涌现的量化方法,分析LLM的涌现能力和智能表现。 Result: 论文总结了LLM可能具备的涌现能力,并讨论了其是否构成涌现智能。 Conclusion: LLM展现出一定的涌现能力,但其是否达到涌现智能仍需进一步研究。 Abstract: Emergence is a concept in complexity science that describes how many-body systems manifest novel higher-level properties, properties that can be described by replacing high-dimensional mechanisms with lower-dimensional effective variables and theories. This is captured by the idea "more is different". Intelligence is a consummate emergent property manifesting increasingly efficient -- cheaper and faster -- uses of emergent capabilities to solve problems. This is captured by the idea "less is more". In this paper, we first examine claims that Large Language Models exhibit emergent capabilities, reviewing several approaches to quantifying emergence, and secondly ask whether LLMs possess emergent intelligence.

[139] Scalable Medication Extraction and Discontinuation Identification from Electronic Health Records Using Large Language Models

Chong Shao,Douglas Snyder,Chiran Li,Bowen Gu,Kerry Ngan,Chun-Ting Yang,Jiageng Wu,Richard Wyss,Kueiyu Joshua Lin,Jie Yang

Main category: cs.CL

TL;DR: 该研究评估了开源和专有大型语言模型(LLMs)在从电子健康记录(EHR)中提取药物信息并分类其状态的能力,发现GPT-4o在零样本设置下表现最佳,开源模型如Llama-3.1-70B-Instruct也表现优异。

Details Motivation: 识别EHR中的药物停用信息对患者安全至关重要,但此类信息常隐藏在非结构化笔记中,需要高效且可扩展的自动提取方法。 Method: 研究收集了三个EHR数据集,评估了12种LLMs,并探索了多种提示策略,比较了药物提取、状态分类及其联合任务的性能。 Result: GPT-4o在零样本设置下表现最佳(药物提取F1=94.0%,停用分类F1=78.1%,联合任务F1=72.7%),开源模型紧随其后。医学专用LLMs表现不如通用领域模型。 Conclusion: LLMs在EHR药物信息提取和停用分类中潜力巨大,开源模型可替代专有系统,少量样本学习能进一步提升性能。 Abstract: Identifying medication discontinuations in electronic health records (EHRs) is vital for patient safety but is often hindered by information being buried in unstructured notes. This study aims to evaluate the capabilities of advanced open-sourced and proprietary large language models (LLMs) in extracting medications and classifying their medication status from EHR notes, focusing on their scalability on medication information extraction without human annotation. We collected three EHR datasets from diverse sources to build the evaluation benchmark. We evaluated 12 advanced LLMs and explored multiple LLM prompting strategies. Performance on medication extraction, medication status classification, and their joint task (extraction then classification) was systematically compared across all experiments. We found that LLMs showed promising performance on the medication extraction and discontinuation classification from EHR notes. GPT-4o consistently achieved the highest average F1 scores in all tasks under zero-shot setting - 94.0% for medication extraction, 78.1% for discontinuation classification, and 72.7% for the joint task. Open-sourced models followed closely, Llama-3.1-70B-Instruct achieved the highest performance in medication status classification on the MIV-Med dataset (68.7%) and in the joint task on both the Re-CASI (76.2%) and MIV-Med (60.2%) datasets. Medical-specific LLMs demonstrated lower performance compared to advanced general-domain LLMs. Few-shot learning generally improved performance, while CoT reasoning showed inconsistent gains. LLMs demonstrate strong potential for medication extraction and discontinuation identification on EHR notes, with open-sourced models offering scalable alternatives to proprietary systems and few-shot can further improve LLMs' capability.

[140] RETUYT-INCO at BEA 2025 Shared Task: How Far Can Lightweight Models Go in AI-powered Tutor Evaluation?

Santiago Góngora,Ignacio Sastre,Santiago Robaina,Ignacio Remersaro,Luis Chiruzzo,Aiala Rosá

Main category: cs.CL

TL;DR: RETUYT-INCO团队在BEA 2025共享任务中使用参数少于1B的小模型,展示了在计算资源有限的情况下仍能保持竞争力的结果。

Details Motivation: 模拟全球南方研究机构因计算资源昂贵而受限的条件,验证小模型在共享任务中的可行性。 Method: 使用参数少于1B的小模型参与BEA 2025共享任务,限制计算资源需求。 Result: 在五个赛道中,与优胜者的$exact\ F_1$分数差距为6.46至13.13,表现具有竞争力。 Conclusion: 小模型(<1B参数)在低预算GPU或无GPU设备上仍能胜任任务,适合资源受限环境。 Abstract: In this paper, we present the RETUYT-INCO participation at the BEA 2025 shared task. Our participation was characterized by the decision of using relatively small models, with fewer than 1B parameters. This self-imposed restriction tries to represent the conditions in which many research labs or institutions are in the Global South, where computational power is not easily accessible due to its prohibitive cost. Even under this restrictive self-imposed setting, our models managed to stay competitive with the rest of teams that participated in the shared task. According to the $exact\ F_1$ scores published by the organizers, the performance gaps between our models and the winners were as follows: $6.46$ in Track 1; $10.24$ in Track 2; $7.85$ in Track 3; $9.56$ in Track 4; and $13.13$ in Track 5. Considering that the minimum difference with a winner team is $6.46$ points -- and the maximum difference is $13.13$ -- according to the $exact\ F_1$ score, we find that models with a size smaller than 1B parameters are competitive for these tasks, all of which can be run on computers with a low-budget GPU or even without a GPU.

[141] Iterative Multilingual Spectral Attribute Erasure

Shun Shao,Yftah Ziser,Zheng Zhao,Yifu Qiu,Shay B. Cohen,Anna Korhonen

Main category: cs.CL

TL;DR: IMSAE是一种通过迭代SVD截断方法在多语言中识别和缓解联合偏见子空间的技术,优于传统单语和跨语言方法。

Details Motivation: 现有去偏见方法无法利用多语言表示的共享语义空间,限制了跨语言偏见迁移的效果。 Method: 提出IMSAE,通过迭代SVD截断在多语言中识别和缓解联合偏见子空间。 Result: 在八种语言和五种人口统计维度上验证了IMSAE的有效性,尤其在零样本设置下表现优异。 Conclusion: IMSAE在多语言去偏见任务中优于传统方法,同时保持模型实用性。 Abstract: Multilingual representations embed words with similar meanings to share a common semantic space across languages, creating opportunities to transfer debiasing effects between languages. However, existing methods for debiasing are unable to exploit this opportunity because they operate on individual languages. We present Iterative Multilingual Spectral Attribute Erasure (IMSAE), which identifies and mitigates joint bias subspaces across multiple languages through iterative SVD-based truncation. Evaluating IMSAE across eight languages and five demographic dimensions, we demonstrate its effectiveness in both standard and zero-shot settings, where target language data is unavailable, but linguistically similar languages can be used for debiasing. Our comprehensive experiments across diverse language models (BERT, LLaMA, Mistral) show that IMSAE outperforms traditional monolingual and cross-lingual approaches while maintaining model utility.

[142] No Universal Prompt: Unifying Reasoning through Adaptive Prompting for Temporal Table Reasoning

Kushagra Dixit,Abhishek Rajgaria,Harshavardhan Kalalbandi,Dan Roth,Vivek Gupta

Main category: cs.CL

TL;DR: 本文研究了大型语言模型(LLMs)在时间表格推理中的表现,发现现有提示方法效果不一,提出自适应框架SEAR,显著优于基线方法。

Details Motivation: 时间表格推理是LLMs的重要挑战,现有提示方法效果差异大且缺乏系统性研究,需探索最优方法。 Method: 研究多种提示技术在不同表格类型中的表现,提出自适应框架SEAR,动态调整提示策略并整合结构化推理。 Result: SEAR在所有表格类型中表现最优,表格结构重构能进一步提升推理效果。 Conclusion: SEAR框架显著提升LLMs的表格推理能力,统一表格表示有助于模型表现。 Abstract: Temporal Table Reasoning is a critical challenge for Large Language Models (LLMs), requiring effective prompting techniques to extract relevant insights. Despite existence of multiple prompting methods, their impact on table reasoning remains largely unexplored. Furthermore, the performance of these models varies drastically across different table and context structures, making it difficult to determine an optimal approach. This work investigates multiple prompting technique across diverse table types to determine optimal approaches for different scenarios. We find that performance varies based on entity type, table structure, requirement of additional context and question complexity, with NO single method consistently outperforming others. To mitigate these challenges, we introduce SEAR, an adaptive prompting framework inspired by human reasoning that dynamically adjusts based on context characteristics and integrates a structured reasoning. Our results demonstrate that SEAR achieves superior performance across all table types compared to other baseline prompting techniques. Additionally, we explore the impact of table structure refactoring, finding that a unified representation enhances model's reasoning.

[143] Learning a Continue-Thinking Token for Enhanced Test-Time Scaling

Liran Ringel,Elad Tolochinsky,Yaniv Romano

Main category: cs.CL

TL;DR: 研究探讨了通过学习专用继续思考标记(如“<|continue-thinking|>”)来扩展语言模型推理步骤的效果,相比固定标记方法(如“Wait”),学习标记在数学基准测试中表现更优。

Details Motivation: 探索是否可以通过学习专用标记(而非固定标记)来触发更长的推理步骤,从而提升语言模型性能。 Method: 在蒸馏版DeepSeek-R1中引入单一学习标记“<|continue-thinking|>”,仅通过强化学习训练其嵌入,同时冻结模型权重。 Result: 学习标记在GSM8K等数学基准测试中显著优于基线模型和固定标记方法(如4.2% vs 1.3%的绝对提升)。 Conclusion: 学习专用继续思考标记是一种有效的测试时扩展方法,能显著提升模型推理能力。 Abstract: Test-time scaling has emerged as an effective approach for improving language model performance by utilizing additional compute at inference time. Recent studies have shown that overriding end-of-thinking tokens (e.g., replacing "" with "Wait") can extend reasoning steps and improve accuracy. In this work, we explore whether a dedicated continue-thinking token can be learned to trigger extended reasoning. We augment a distilled version of DeepSeek-R1 with a single learned "<|continue-thinking|>" token, training only its embedding via reinforcement learning while keeping the model weights frozen. Our experiments show that this learned token achieves improved accuracy on standard math benchmarks compared to both the baseline model and a test-time scaling approach that uses a fixed token (e.g., "Wait") for budget forcing. In particular, we observe that in cases where the fixed-token approach enhances the base model's accuracy, our method achieves a markedly greater improvement. For example, on the GSM8K benchmark, the fixed-token approach yields a 1.3% absolute improvement in accuracy, whereas our learned-token method achieves a 4.2% improvement over the base model that does not use budget forcing.

[144] Beyond Random Sampling: Efficient Language Model Pretraining via Curriculum Learning

Yang Zhang,Amr Mohamed,Hadi Abdine,Guokan Shang,Michalis Vazirgiannis

Main category: cs.CL

TL;DR: 本文首次系统研究了课程学习在预训练语言模型中的应用,实验表明其能提升训练效率和泛化能力,尤其在早期和中期训练阶段效果显著。

Details Motivation: 课程学习在多个机器学习领域表现出提升训练效率和泛化能力的潜力,但在预训练语言模型中的应用尚未充分探索,因此本文进行了系统性研究。 Method: 实验了多种课程学习设置,包括基础课程学习、基于步调的采样和交错课程,并基于六种难度指标(涵盖语言学和信息论视角)进行指导。 Result: 课程学习在早期和中期训练阶段显著提升收敛速度,作为预热策略时可持续带来高达3.5%的性能提升。压缩比、词汇多样性和可读性被证明是有效的难度信号。 Conclusion: 研究强调了数据排序在大规模预训练中的重要性,并为实际训练场景下高效、数据节约的模型开发提供了实用建议。 Abstract: Curriculum learning has shown promise in improving training efficiency and generalization in various machine learning domains, yet its potential in pretraining language models remains underexplored, prompting our work as the first systematic investigation in this area. We experimented with different settings, including vanilla curriculum learning, pacing-based sampling, and interleaved curricula-guided by six difficulty metrics spanning linguistic and information-theoretic perspectives. We train models under these settings and evaluate their performance on eight diverse benchmarks. Our experiments reveal that curriculum learning consistently improves convergence in early and mid-training phases, and can yield lasting gains when used as a warmup strategy with up to $3.5\%$ improvement. Notably, we identify compression ratio, lexical diversity, and readability as effective difficulty signals across settings. Our findings highlight the importance of data ordering in large-scale pretraining and provide actionable insights for scalable, data-efficient model development under realistic training scenarios.

[145] Don't Pay Attention

Mohammad Hammoud,Devang Acharya

Main category: cs.CL

TL;DR: 论文提出了一种名为Avey的新架构,旨在解决Transformer在处理长序列时的局限性,同时避免RNN的并行性问题。

Details Motivation: Transformer在处理长序列时存在固定上下文窗口和二次复杂度的问题,而RNN虽然能线性扩展序列长度,但并行性受限。因此,需要一种新架构来解决这些问题。 Method: Avey由排序器和自回归神经处理器组成,能够动态识别并处理序列中最相关的标记,从而解耦序列长度与上下文宽度。 Result: 实验表明,Avey在短范围NLP任务中表现与Transformer相当,同时在捕捉长距离依赖方面表现更优。 Conclusion: Avey是一种突破性的架构,能够有效处理任意长度的序列,同时兼具高效性和性能优势。 Abstract: The Transformer has become the de facto standard for large language models and a wide range of downstream tasks across various domains. Despite its numerous advantages like inherent training parallelism, the Transformer still faces key challenges due to its inability to effectively process sequences beyond a fixed context window and the quadratic complexity of its attention mechanism. These challenges have renewed interest in RNN-like architectures, which offer linear scaling with sequence length and improved handling of long-range dependencies, albeit with limited parallelism due to their inherently recurrent nature. In this paper, we propose Avey, a new neural foundational architecture that breaks away from both attention and recurrence. Avey comprises a ranker and an autoregressive neural processor, which collaboratively identify and contextualize only the most relevant tokens for any given token, regardless of their positions in the sequence. Specifically, Avey decouples sequence length from context width, thus enabling effective processing of arbitrarily long sequences. Experimental results show that Avey compares favorably to the Transformer across a variety of standard short-range NLP benchmarks, while notably excelling at capturing long-range dependencies.

[146] Surprisal from Larger Transformer-based Language Models Predicts fMRI Data More Poorly

Yi-Chien Lin,William Schuler

Main category: cs.CL

TL;DR: 研究发现,Transformer模型的困惑度与其对人类阅读时间的预测能力呈正相关,且这一趋势不仅适用于基于延迟的测量(如阅读时间),还能推广到神经测量(如脑成像数据)。

Details Motivation: 探讨Transformer模型的困惑度与其对人类语言处理能力的预测关系,尤其是在脑成像数据上的适用性。 Method: 使用17个预训练的Transformer模型,评估其在三种语言家族和两个fMRI数据集上的预测能力。 Result: 模型困惑度与模型拟合度仍呈正相关,表明这一趋势可推广到神经测量。 Conclusion: Transformer模型的困惑度与其对人类语言处理的预测能力具有普遍性,适用于多种测量方式。 Abstract: As Transformers become more widely incorporated into natural language processing tasks, there has been considerable interest in using surprisal from these models as predictors of human sentence processing difficulty. Recent work has observed a positive relationship between Transformer-based models' perplexity and the predictive power of their surprisal estimates on reading times, showing that language models with more parameters and trained on more data are less predictive of human reading times. However, these studies focus on predicting latency-based measures (i.e., self-paced reading times and eye-gaze durations) with surprisal estimates from Transformer-based language models. This trend has not been tested on brain imaging data. This study therefore evaluates the predictive power of surprisal estimates from 17 pre-trained Transformer-based models across three different language families on two functional magnetic resonance imaging datasets. Results show that the positive relationship between model perplexity and model fit still obtains, suggesting that this trend is not specific to latency-based measures and can be generalized to neural measures.

[147] From Replication to Redesign: Exploring Pairwise Comparisons for LLM-Based Peer Review

Yaohui Zhang,Haijing Zhang,Wenlong Ji,Tianyu Hua,Nick Haber,Hancheng Cao,Weixin Liang

Main category: cs.CL

TL;DR: 本文提出了一种利用大语言模型(LLMs)进行论文成对比较的新方法,取代传统评分机制,实验表明该方法在识别高影响力论文上更优,但也揭示了新颖性不足和机构不平衡等偏见。

Details Motivation: 传统同行评审流程存在局限性,而现有研究多将LLMs直接替代人类评审,缺乏对LLMs参与评审的新范式探索。本文旨在重新思考LLMs如何更有效地参与学术评审。 Method: 采用LLM代理对论文进行成对比较,而非单独评分,通过大量比较结果聚合,更准确地衡量论文相对质量。 Result: 实验表明,这种比较方法在识别高影响力论文上显著优于传统评分方法,但也发现了新颖性降低和机构不平衡等偏见。 Conclusion: LLMs为重新设计同行评审提供了潜力,但需解决公平性和多样性问题,未来系统需进一步优化。 Abstract: The advent of large language models (LLMs) offers unprecedented opportunities to reimagine peer review beyond the constraints of traditional workflows. Despite these opportunities, prior efforts have largely focused on replicating traditional review workflows with LLMs serving as direct substitutes for human reviewers, while limited attention has been given to exploring new paradigms that fundamentally rethink how LLMs can participate in the academic review process. In this paper, we introduce and explore a novel mechanism that employs LLM agents to perform pairwise comparisons among manuscripts instead of individual scoring. By aggregating outcomes from substantial pairwise evaluations, this approach enables a more accurate and robust measure of relative manuscript quality. Our experiments demonstrate that this comparative approach significantly outperforms traditional rating-based methods in identifying high-impact papers. However, our analysis also reveals emergent biases in the selection process, notably a reduced novelty in research topics and an increased institutional imbalance. These findings highlight both the transformative potential of rethinking peer review with LLMs and critical challenges that future systems must address to ensure equity and diversity.

[148] Do We Still Need Audio? Rethinking Speaker Diarization with a Text-Based Approach Using Multiple Prediction Models

Peilin Wu,Jinho D. Choi

Main category: cs.CL

TL;DR: 本文提出了一种基于文本的说话人分割新方法,通过句子级说话人变化检测,显著优于传统音频方法,尤其在短对话中表现突出。

Details Motivation: 传统音频说话人分割受限于音质和说话人相似性,本文探索利用对话文本实现更高效的说话人分割。 Method: 开发了两种模型:单预测模型(SPM)和多预测模型(MPM),基于对话文本检测说话人变化。 Result: 实验表明,文本方法(尤其是MPM)在短对话中表现优异,与音频方法竞争。 Conclusion: 本文展示了文本特征在说话人分割中的潜力,并强调语义理解的重要性,为多模态和语义特征研究开辟了新方向。 Abstract: We present a novel approach to Speaker Diarization (SD) by leveraging text-based methods focused on Sentence-level Speaker Change Detection within dialogues. Unlike audio-based SD systems, which are often challenged by audio quality and speaker similarity, our approach utilizes the dialogue transcript alone. Two models are developed: the Single Prediction Model (SPM) and the Multiple Prediction Model (MPM), both of which demonstrate significant improvements in identifying speaker changes, particularly in short conversations. Our findings, based on a curated dataset encompassing diverse conversational scenarios, reveal that the text-based SD approach, especially the MPM, performs competitively against state-of-the-art audio-based SD systems, with superior performance in short conversational contexts. This paper not only showcases the potential of leveraging linguistic features for SD but also highlights the importance of integrating semantic understanding into SD systems, opening avenues for future research in multimodal and semantic feature-based diarization.

[149] The Biased Samaritan: LLM biases in Perceived Kindness

Jack H Fagan,Ruhaan Juyaal,Amy Yue-Ming Yu,Siya Pun

Main category: cs.CL

TL;DR: 本文提出了一种评估生成式AI模型人口统计偏见的新方法,通过分析模型对不同性别、种族和年龄的道德干预意愿,揭示了模型的偏见模式。

Details Motivation: 理解和减轻大型语言模型(LLMs)的偏见是一个持续的问题,本文旨在定量评估不同LLMs对人口统计特征的偏见。 Method: 通过提示模型评估道德患者的干预意愿,定量分析LLMs对不同性别、种族和年龄的偏见。 Result: 研究发现模型通常将基准人口统计视为白人中年或年轻男性,但非基准人口统计更愿意提供帮助。 Conclusion: 本文为客观评估LLMs的偏见提供了方法,并揭示了两种常被混淆的偏见模式。 Abstract: While Large Language Models (LLMs) have become ubiquitous in many fields, understanding and mitigating LLM biases is an ongoing issue. This paper provides a novel method for evaluating the demographic biases of various generative AI models. By prompting models to assess a moral patient's willingness to intervene constructively, we aim to quantitatively evaluate different LLMs' biases towards various genders, races, and ages. Our work differs from existing work by aiming to determine the baseline demographic identities for various commercial models and the relationship between the baseline and other demographics. We strive to understand if these biases are positive, neutral, or negative, and the strength of these biases. This paper can contribute to the objective assessment of bias in Large Language Models and give the user or developer the power to account for these biases in LLM output or in training future LLMs. Our analysis suggested two key findings: that models view the baseline demographic as a white middle-aged or young adult male; however, a general trend across models suggested that non-baseline demographics are more willing to help than the baseline. These methodologies allowed us to distinguish these two biases that are often tangled together.

[150] A Variational Approach for Mitigating Entity Bias in Relation Extraction

Samuel Mensah,Elena Kochkina,Jabez Magomere,Joy Prakash Sain,Simerjot Kaur,Charese Smiley

Main category: cs.CL

TL;DR: 提出了一种基于变分信息瓶颈(VIB)框架的新方法,用于减少关系抽取中的实体偏差,提升模型泛化能力。

Details Motivation: 关系抽取模型过度依赖实体信息导致泛化能力差,亟需解决实体偏差问题。 Method: 采用变分信息瓶颈框架压缩实体特定信息,同时保留任务相关特征。 Result: 在通用、金融和生物医学领域的关系抽取数据集上,无论是域内还是域外测试集,均达到最优性能。 Conclusion: 该方法提供了一种鲁棒、可解释且理论支持的关系抽取解决方案。 Abstract: Mitigating entity bias is a critical challenge in Relation Extraction (RE), where models often rely excessively on entities, resulting in poor generalization. This paper presents a novel approach to address this issue by adapting a Variational Information Bottleneck (VIB) framework. Our method compresses entity-specific information while preserving task-relevant features. It achieves state-of-the-art performance on relation extraction datasets across general, financial, and biomedical domains, in both indomain (original test sets) and out-of-domain (modified test sets with type-constrained entity replacements) settings. Our approach offers a robust, interpretable, and theoretically grounded methodology.

[151] Curriculum-Guided Layer Scaling for Language Model Pretraining

Karanpartap Singh,Neil Band,Ehsan Adeli

Main category: cs.CL

TL;DR: CGLS是一种通过逐步增加模型层数和数据难度来提升预训练效率的框架,在多个基准测试中表现优于基线方法。

Details Motivation: 受人类认知发展的启发,希望通过同步模型增长和数据难度提升来优化预训练效率。 Method: 采用Curriculum-Guided Layer Scaling(CGLS),逐步增加模型层数并配合数据难度提升,从简单数据过渡到复杂数据。 Result: 在100M和1.2B参数规模下,CGLS在PIQA、ARC等基准测试中表现优异,提升了泛化和零样本性能。 Conclusion: CGLS通过渐进式堆叠策略,显著提升了知识密集型和推理任务的泛化能力。 Abstract: As the cost of pretraining large language models grows, there is continued interest in strategies to improve learning efficiency during this core training stage. Motivated by cognitive development, where humans gradually build knowledge as their brains mature, we propose Curriculum-Guided Layer Scaling (CGLS), a framework for compute-efficient pretraining that synchronizes increasing data difficulty with model growth through progressive layer stacking (i.e. gradually adding layers during training). At the 100M parameter scale, using a curriculum transitioning from synthetic short stories to general web data, CGLS outperforms baseline methods on the question-answering benchmarks PIQA and ARC. Pretraining at the 1.2B scale, we stratify the DataComp-LM corpus with a DistilBERT-based classifier and progress from general text to highly technical or specialized content. Our results show that progressively increasing model depth alongside sample difficulty leads to better generalization and zero-shot performance on various downstream benchmarks. Altogether, our findings demonstrate that CGLS unlocks the potential of progressive stacking, offering a simple yet effective strategy for improving generalization on knowledge-intensive and reasoning tasks.

[152] Predicting Early-Onset Colorectal Cancer with Large Language Models

Wilson Lau,Youngwon Kim,Sravanthi Parasa,Md Enamul Haque,Anand Oka,Jay Nanduri

Main category: cs.CL

TL;DR: 论文研究了早发性结直肠癌(EoCRC)的预测,比较了10种机器学习模型与大型语言模型(LLM)的性能,发现微调后的LLM表现最佳。

Details Motivation: 早发性结直肠癌发病率逐年上升,但该人群年龄低于国家筛查指南推荐年龄,因此需要更有效的预测方法。 Method: 应用10种机器学习模型与LLM,基于患者诊断前6个月的病情、实验室结果和观察数据进行预测。 Result: 微调后的LLM平均敏感度为73%,特异度为91%。 Conclusion: LLM在预测EoCRC方面表现优异,可能为早期筛查提供新工具。 Abstract: The incidence rate of early-onset colorectal cancer (EoCRC, age < 45) has increased every year, but this population is younger than the recommended age established by national guidelines for cancer screening. In this paper, we applied 10 different machine learning models to predict EoCRC, and compared their performance with advanced large language models (LLM), using patient conditions, lab results, and observations within 6 months of patient journey prior to the CRC diagnoses. We retrospectively identified 1,953 CRC patients from multiple health systems across the United States. The results demonstrated that the fine-tuned LLM achieved an average of 73% sensitivity and 91% specificity.

[153] Efficient Long-Context LLM Inference via KV Cache Clustering

Jie Hu,Shengnan Wang,Yutong He,Ping Gong,Jiawei Yi,Juncheng Zhang,Youhui Bai,Renhai Chen,Gong Zhang,Cheng Li,Kun Yuan

Main category: cs.CL

TL;DR: 论文提出Chelsea框架,通过在线KV缓存聚类减少内存使用和计算开销,同时保持模型性能。

Details Motivation: 长上下文LLMs的KV缓存需求高,现有方法效率低或信息丢失严重。 Method: 提出Chunked Soft Matching,分块聚类并合并KV缓存,理论分析计算复杂度和最优分区策略。 Result: 实验显示内存使用减少80%,解码速度提升3.19倍,延迟降低2.72倍。 Conclusion: Chelsea高效且实用,适用于长上下文LLMs的部署。 Abstract: Large language models (LLMs) with extended context windows have become increasingly prevalent for tackling complex tasks. However, the substantial Key-Value (KV) cache required for long-context LLMs poses significant deployment challenges. Existing approaches either discard potentially critical information needed for future generations or offer limited efficiency gains due to high computational overhead. In this paper, we introduce Chelsea, a simple yet effective framework for online KV cache clustering. Our approach is based on the observation that key states exhibit high similarity along the sequence dimension. To enable efficient clustering, we divide the sequence into chunks and propose Chunked Soft Matching, which employs an alternating partition strategy within each chunk and identifies clusters based on similarity. Chelsea then merges the KV cache within each cluster into a single centroid. Additionally, we provide a theoretical analysis of the computational complexity and the optimality of the intra-chunk partitioning strategy. Extensive experiments across various models and long-context benchmarks demonstrate that Chelsea achieves up to 80% reduction in KV cache memory usage while maintaining comparable model performance. Moreover, with minimal computational overhead, Chelsea accelerates the decoding stage of inference by up to 3.19$\times$ and reduces end-to-end latency by up to 2.72$\times$.

[154] Agent-RLVR: Training Software Engineering Agents via Guidance and Environment Rewards

Jeff Da,Clinton Wang,Xiang Deng,Yuntao Ma,Nikhil Barhate,Sean Hendryx

Main category: cs.CL

TL;DR: Agent-RLVR是一种改进的强化学习方法,通过引入代理指导机制,提升了在复杂代理环境中的训练效果,显著提高了任务完成率。

Details Motivation: 传统RLVR在复杂代理环境中效果不佳,因为奖励稀疏导致训练困难。Agent-RLVR旨在通过模拟人类教学方式解决这一问题。 Method: Agent-RLVR通过代理指导机制(如策略计划和动态反馈)引导代理完成任务,并结合RLVR进行策略更新。 Result: 在SWE-Bench Verified上,Agent-RLVR将Qwen-2.5-72B-Instruct的pass@1性能从9.4%提升至22.4%,进一步优化后达到27.8%。 Conclusion: Agent-RLVR为复杂环境中的代理训练提供了有效方法,扩展了RLVR的应用范围。 Abstract: Reinforcement Learning from Verifiable Rewards (RLVR) has been widely adopted as the de facto method for enhancing the reasoning capabilities of large language models and has demonstrated notable success in verifiable domains like math and competitive programming tasks. However, the efficacy of RLVR diminishes significantly when applied to agentic environments. These settings, characterized by multi-step, complex problem solving, lead to high failure rates even for frontier LLMs, as the reward landscape is too sparse for effective model training via conventional RLVR. In this work, we introduce Agent-RLVR, a framework that makes RLVR effective in challenging agentic settings, with an initial focus on software engineering tasks. Inspired by human pedagogy, Agent-RLVR introduces agent guidance, a mechanism that actively steers the agent towards successful trajectories by leveraging diverse informational cues. These cues, ranging from high-level strategic plans to dynamic feedback on the agent's errors and environmental interactions, emulate a teacher's guidance, enabling the agent to navigate difficult solution spaces and promotes active self-improvement via additional environment exploration. In the Agent-RLVR training loop, agents first attempt to solve tasks to produce initial trajectories, which are then validated by unit tests and supplemented with agent guidance. Agents then reattempt with guidance, and the agent policy is updated with RLVR based on the rewards of these guided trajectories. Agent-RLVR elevates the pass@1 performance of Qwen-2.5-72B-Instruct from 9.4% to 22.4% on SWE-Bench Verified. We find that our guidance-augmented RLVR data is additionally useful for test-time reward model training, shown by further boosting pass@1 to 27.8%. Agent-RLVR lays the groundwork for training agents with RLVR in complex, real-world environments where conventional RL methods struggle.

[155] KoGEC : Korean Grammatical Error Correction with Pre-trained Translation Models

Taeeun Kim,Semin Jeong,Youngsook Song

Main category: cs.CL

TL;DR: KoGEC是一个基于预训练翻译模型的韩语语法纠错系统,通过微调NLLB模型,在韩语GEC任务中表现优于GPT-4和HCX-3。

Details Motivation: 开发一个高效的韩语语法纠错系统,填补现有大语言模型在韩语GEC任务中的不足。 Method: 微调NLLB模型,使用特殊语言标记区分原始和纠正句子,并通过BLEU和LLM评估方法进行测试。 Result: KoGEC在韩语GEC任务中表现优于GPT-4和HCX-3,尤其在多种错误类型上更均衡。 Conclusion: KoGEC展示了任务特定模型在专业NLP任务中的潜力,并提供了新的评估方法。 Abstract: This research introduces KoGEC, a Korean Grammatical Error Correction system using pre\--trained translation models. We fine-tuned NLLB (No Language Left Behind) models for Korean GEC, comparing their performance against large language models like GPT-4 and HCX-3. The study used two social media conversation datasets for training and testing. The NLLB models were fine-tuned using special language tokens to distinguish between original and corrected Korean sentences. Evaluation was done using BLEU scores and an "LLM as judge" method to classify error types. Results showed that the fine-tuned NLLB (KoGEC) models outperformed GPT-4o and HCX-3 in Korean GEC tasks. KoGEC demonstrated a more balanced error correction profile across various error types, whereas the larger LLMs tended to focus less on punctuation errors. We also developed a Chrome extension to make the KoGEC system accessible to users. Finally, we explored token vocabulary expansion to further improve the model but found it to decrease model performance. This research contributes to the field of NLP by providing an efficient, specialized Korean GEC system and a new evaluation method. It also highlights the potential of compact, task-specific models to compete with larger, general-purpose language models in specialized NLP tasks.

[156] AbsenceBench: Language Models Can't Tell What's Missing

Harvey Yiyun Fu,Aryan Shrivastava,Jared Moore,Peter West,Chenhao Tan,Ari Holtzman

Main category: cs.CL

TL;DR: 论文介绍了AbsenceBench,用于评估大语言模型(LLM)在检测缺失信息方面的能力,发现即使先进模型如Claude-3.7-Sonnet表现不佳,揭示了Transformer注意力机制的根本限制。

Details Motivation: 尽管LLM在长文本中定位特定信息(如NIAH测试)表现出色,但在识别明显缺失信息时仍存在困难,因此需要研究其局限性。 Method: 通过AbsenceBench测试模型在三个领域(数字序列、诗歌、GitHub拉取请求)中检测缺失信息的能力,对比原始和编辑后的文本。 Result: 实验显示,即使先进模型Claude-3.7-Sonnet的F1分数仅为69.6%,表明模型在检测缺失信息方面表现不佳。 Conclusion: Transformer注意力机制难以处理文档中的“空白”,因为缺失信息无法对应具体的键值,揭示了模型在特定任务中的根本限制。 Abstract: Large language models (LLMs) are increasingly capable of processing long inputs and locating specific information within them, as evidenced by their performance on the Needle in a Haystack (NIAH) test. However, while models excel at recalling surprising information, they still struggle to identify clearly omitted information. We introduce AbsenceBench to assesses LLMs' capacity to detect missing information across three domains: numerical sequences, poetry, and GitHub pull requests. AbsenceBench asks models to identify which pieces of a document were deliberately removed, given access to both the original and edited contexts. Despite the apparent straightforwardness of these tasks, our experiments reveal that even state-of-the-art models like Claude-3.7-Sonnet achieve only 69.6% F1-score with a modest average context length of 5K tokens. Our analysis suggests this poor performance stems from a fundamental limitation: Transformer attention mechanisms cannot easily attend to "gaps" in documents since these absences don't correspond to any specific keys that can be attended to. Overall, our results and analysis provide a case study of the close proximity of tasks where models are already superhuman (NIAH) and tasks where models breakdown unexpectedly (AbsenceBench).

[157] A Gamified Evaluation and Recruitment Platform for Low Resource Language Machine Translation Systems

Carlos Rafael Catalan

Main category: cs.CL

TL;DR: 本文探讨了低资源语言机器翻译系统中人类评估者的重要性,提出了一种招募和游戏化评估平台的设计方案,以解决数据集和评估者短缺的问题。

Details Motivation: 低资源语言的机器翻译系统缺乏有效的自动化评估方法,且人类评估者和数据集稀缺,亟需解决方案。 Method: 通过全面回顾现有评估方法,设计了一个招募和游戏化评估平台,以填补资源缺口。 Result: 提出了一个针对机器翻译系统开发者的平台设计方案,并讨论了其评估挑战及在自然语言处理研究中的潜在应用。 Conclusion: 该平台设计为低资源语言机器翻译系统的评估提供了可行解决方案,并具有更广泛的自然语言处理研究应用潜力。 Abstract: Human evaluators provide necessary contributions in evaluating large language models. In the context of Machine Translation (MT) systems for low-resource languages (LRLs), this is made even more apparent since popular automated metrics tend to be string-based, and therefore do not provide a full picture of the nuances of the behavior of the system. Human evaluators, when equipped with the necessary expertise of the language, will be able to test for adequacy, fluency, and other important metrics. However, the low resource nature of the language means that both datasets and evaluators are in short supply. This presents the following conundrum: How can developers of MT systems for these LRLs find adequate human evaluators and datasets? This paper first presents a comprehensive review of existing evaluation procedures, with the objective of producing a design proposal for a platform that addresses the resource gap in terms of datasets and evaluators in developing MT systems. The result is a design for a recruitment and gamified evaluation platform for developers of MT systems. Challenges are also discussed in terms of evaluating this platform, as well as its possible applications in the wider scope of Natural Language Processing (NLP) research.

[158] Med-PRM: Medical Reasoning Models with Stepwise, Guideline-verified Process Rewards

Jaehoon Yun,Jiwoong Sohn,Jungwoo Park,Hyunjae Kim,Xiangru Tang,Yanjun Shao,Yonghoe Koo,Minhyeok Ko,Qingyu Chen,Mark Gerstein,Michael Moor,Jaewoo Kang

Main category: cs.CL

TL;DR: Med-PRM是一种过程奖励建模框架,通过检索增强生成验证推理步骤,提升临床决策准确性。

Details Motivation: 现有大语言模型在临床决策中难以定位和纠正推理过程中的具体错误,而医学领域对此需求迫切。 Method: 利用检索增强生成技术,将每个推理步骤与医学知识库对比验证。 Result: 在五个医学QA基准和两个开放诊断任务中表现优异,基础模型性能提升高达13.50%。 Conclusion: Med-PRM不仅性能卓越,还能以即插即用方式与其他模型集成,首次实现小规模模型在MedQA上超过80%的准确率。 Abstract: Large language models have shown promise in clinical decision making, but current approaches struggle to localize and correct errors at specific steps of the reasoning process. This limitation is critical in medicine, where identifying and addressing reasoning errors is essential for accurate diagnosis and effective patient care. We introduce Med-PRM, a process reward modeling framework that leverages retrieval-augmented generation to verify each reasoning step against established medical knowledge bases. By verifying intermediate reasoning steps with evidence retrieved from clinical guidelines and literature, our model can precisely assess the reasoning quality in a fine-grained manner. Evaluations on five medical QA benchmarks and two open-ended diagnostic tasks demonstrate that Med-PRM achieves state-of-the-art performance, with improving the performance of base models by up to 13.50% using Med-PRM. Moreover, we demonstrate the generality of Med-PRM by integrating it in a plug-and-play fashion with strong policy models such as Meerkat, achieving over 80\% accuracy on MedQA for the first time using small-scale models of 8 billion parameters. Our code and data are available at: https://med-prm.github.io/

[159] ImmunoFOMO: Are Language Models missing what oncologists see?

Aman Sinha,Bogdan-Valentin Popescu,Xavier Coubez,Marianne Clausel,Mathieu Constant

Main category: cs.CL

TL;DR: 研究探讨了语言模型在医学概念理解上的表现,发现预训练模型在识别特定低级别概念上优于大型语言模型。

Details Motivation: 随着语言模型能力的快速提升,医学领域对其在自然语言处理中的应用兴趣增加,尤其是在医学概念理解方面。 Method: 通过比较不同语言模型与临床专家在识别乳腺癌免疫治疗标志物上的表现。 Result: 预训练语言模型在识别特定低级别概念上表现优于大型语言模型。 Conclusion: 预训练语言模型在医学特定任务中具有潜力,尤其是在细粒度概念识别上。 Abstract: Language models (LMs) capabilities have grown with a fast pace over the past decade leading researchers in various disciplines, such as biomedical research, to increasingly explore the utility of LMs in their day-to-day applications. Domain specific language models have already been in use for biomedical natural language processing (NLP) applications. Recently however, the interest has grown towards medical language models and their understanding capabilities. In this paper, we investigate the medical conceptual grounding of various language models against expert clinicians for identification of hallmarks of immunotherapy in breast cancer abstracts. Our results show that pre-trained language models have potential to outperform large language models in identifying very specific (low-level) concepts.

[160] Relational Schemata in BERT Are Inducible, Not Emergent: A Study of Performance vs. Competence in Language Models

Cole Gawin

Main category: cs.CL

TL;DR: 研究探讨BERT是否通过预训练编码了抽象关系模式,发现仅通过微调才能在高维嵌入空间中按关系类型组织概念对。

Details Motivation: 探究BERT的语义任务表现是否反映真正的概念能力,而非表面统计关联。 Method: 分析BERT内部表示,比较预训练和微调后[CLS]标记嵌入的关系分类性能。 Result: 预训练BERT具有潜在关系信号,但仅微调后概念对才按关系类型组织。 Conclusion: 行为表现不代表结构化概念理解,但模型可通过适当训练获得关系抽象的归纳偏差。 Abstract: While large language models like BERT demonstrate strong empirical performance on semantic tasks, whether this reflects true conceptual competence or surface-level statistical association remains unclear. I investigate whether BERT encodes abstract relational schemata by examining internal representations of concept pairs across taxonomic, mereological, and functional relations. I compare BERT's relational classification performance with representational structure in [CLS] token embeddings. Results reveal that pretrained BERT enables high classification accuracy, indicating latent relational signals. However, concept pairs organize by relation type in high-dimensional embedding space only after fine-tuning on supervised relation classification tasks. This indicates relational schemata are not emergent from pretraining alone but can be induced via task scaffolding. These findings demonstrate that behavioral performance does not necessarily imply structured conceptual understanding, though models can acquire inductive biases for grounded relational abstraction through appropriate training.

[161] Lag-Relative Sparse Attention In Long Context Training

Manlai Liang,Wanyi Huang,Mandi Liu,Huaijun Li,Jinlong Li

Main category: cs.CL

TL;DR: 提出了一种名为Lag-Relative Sparse Attention(LRSA)的方法,通过LagKV压缩技术优化长上下文处理,降低计算和内存开销,同时保持性能。

Details Motivation: 尽管大语言模型(LLMs)在自然语言处理方面取得进展,但其处理长上下文的能力受限于注意力计算的二次复杂性和线性增长的内存占用。现有压缩技术常导致性能下降,且不适用于训练后优化。 Method: 采用LRSA方法,结合LagKV压缩技术,通过分块预填充选择最相关的键值对,聚焦于关键历史上下文。 Result: 实验表明,该方法显著提升了LLM在键值压缩下的鲁棒性,并在问答微调任务中取得更好效果。 Conclusion: LRSA方法以低计算开销和无额外参数的方式,有效解决了长上下文处理的性能与效率问题。 Abstract: Large Language Models (LLMs) have made significant strides in natural language processing and generation, yet their ability to handle long-context input remains constrained by the quadratic complexity of attention computation and linear-increasing key-value memory footprint. To reduce computational costs and memory, key-value cache compression techniques are commonly applied at inference time, but this often leads to severe performance degradation, as models are not trained to handle compressed context. Although there are more sophisticated compression methods, they are typically unsuitable for post-training because of their incompatibility with gradient-based optimization or high computation overhead. To fill this gap with no additional parameter and little computation overhead, we propose Lag-Relative Sparse Attention(LRSA) anchored by the LagKV compression method for long context post-training. Our method performs chunk-by-chunk prefilling, which selects the top K most relevant key-value pairs in a fixed-size lagging window, allowing the model to focus on salient historical context while maintaining efficiency. Experimental results show that our approach significantly enhances the robustness of the LLM with key-value compression and achieves better fine-tuned results in the question-answer tuning task.

[162] On the Effectiveness of Integration Methods for Multimodal Dialogue Response Retrieval

Seongbo Jang,Seonghyeon Lee,Dongha Lee,Hwanjo Yu

Main category: cs.CL

TL;DR: 该论文探讨了多模态对话系统中如何生成文本和图像等多种形式的响应,提出了三种集成方法,并比较了它们的优缺点。实验表明,端到端方法性能与两步法相当,且参数共享策略能提升性能。

Details Motivation: 研究多模态对话系统中响应的多模态性,以提升对话系统的表现力和多样性。 Method: 提出了三种集成方法(基于两步法和端到端法),并比较其优缺点。实验验证了端到端方法的性能,并引入参数共享策略。 Result: 端到端方法性能与两步法相当,参数共享策略减少了参数数量并提升了性能。 Conclusion: 端到端方法在多模态对话系统中具有潜力,参数共享策略能有效提升性能。 Abstract: Multimodal chatbots have become one of the major topics for dialogue systems in both research community and industry. Recently, researchers have shed light on the multimodality of responses as well as dialogue contexts. This work explores how a dialogue system can output responses in various modalities such as text and image. To this end, we first formulate a multimodal dialogue response retrieval task for retrieval-based systems as the combination of three subtasks. We then propose three integration methods based on a two-step approach and an end-to-end approach, and compare the merits and demerits of each method. Experimental results on two datasets demonstrate that the end-to-end approach achieves comparable performance without an intermediate step in the two-step approach. In addition, a parameter sharing strategy not only reduces the number of parameters but also boosts performance by transferring knowledge across the subtasks and the modalities.

[163] From Persona to Person: Enhancing the Naturalness with Multiple Discourse Relations Graph Learning in Personalized Dialogue Generation

Chih-Hao Hsu,Ying-Jia Lin,Hung-Yu Kao

Main category: cs.CL

TL;DR: MUDI(多话语关系图学习)通过结构化对话图和新型注意力策略提升个性化对话生成的自然性和一致性。

Details Motivation: 个性化对话生成需保持连贯性并与用户特质一致,现有方法难以满足需求。 Method: 利用大语言模型标注话语关系并构建对话图,采用DialogueGAT编码器捕捉隐含关系,解码时使用一致性感知注意力策略。 Result: 实验表明MUDI显著提升个性化响应质量,更接近人类对话。 Conclusion: MUDI通过结构化学习和注意力机制有效改善个性化对话生成效果。 Abstract: In dialogue generation, the naturalness of responses is crucial for effective human-machine interaction. Personalized response generation poses even greater challenges, as the responses must remain coherent and consistent with the user's personal traits or persona descriptions. We propose MUDI ($\textbf{Mu}$ltiple $\textbf{Di}$scourse Relations Graph Learning) for personalized dialogue generation. We utilize a Large Language Model to assist in annotating discourse relations and to transform dialogue data into structured dialogue graphs. Our graph encoder, the proposed DialogueGAT model, then captures implicit discourse relations within this structure, along with persona descriptions. During the personalized response generation phase, novel coherence-aware attention strategies are implemented to enhance the decoder's consideration of discourse relations. Our experiments demonstrate significant improvements in the quality of personalized responses, thus resembling human-like dialogue exchanges.

[164] Are LLMs Good Text Diacritizers? An Arabic and Yorùbá Case Study

Hawau Olamide Toyin,Samar M. Magdy,Hanan Aldarmaki

Main category: cs.CL

TL;DR: 大型语言模型(LLMs)在阿拉伯语和约鲁巴语的文本加标点任务中表现优于专用模型,但小模型易产生幻觉,微调可改善性能。

Details Motivation: 研究LLMs在多语言文本加标点任务中的有效性,填补现有研究的空白。 Method: 引入多语言数据集MultiDiac,评估14种LLMs和6种专用模型,并对4种小模型进行LoRA微调。 Result: 多数现成LLMs表现优于专用模型,小模型易产生幻觉,微调可提升性能。 Conclusion: LLMs在文本加标点任务中具有潜力,微调是提升小模型性能的有效方法。 Abstract: We investigate the effectiveness of large language models (LLMs) for text diacritization in two typologically distinct languages: Arabic and Yoruba. To enable a rigorous evaluation, we introduce a novel multilingual dataset MultiDiac, with diverse samples that capture a range of diacritic ambiguities. We evaluate 14 LLMs varying in size, accessibility, and language coverage, and benchmark them against 6 specialized diacritization models. Additionally, we fine-tune four small open-source models using LoRA for Yoruba. Our results show that many off-the-shelf LLMs outperform specialized diacritization models for both Arabic and Yoruba, but smaller models suffer from hallucinations. Fine-tuning on a small dataset can help improve diacritization performance and reduce hallucination rates.

[165] SceneGram: Conceptualizing and Describing Tangrams in Scene Context

Simeon Junker,Sina Zarrieß

Main category: cs.CL

TL;DR: 论文研究了场景上下文对人类对抽象形状概念化的影响,并提出了SceneGram数据集,发现多模态LLM未能捕捉人类概念化的丰富性和变异性。

Details Motivation: 探讨场景上下文如何影响人类对同一抽象形状的不同概念化方式,并验证多模态LLM是否能模拟这种人类行为。 Method: 构建SceneGram数据集,收集人类在不同场景下对抽象形状的命名,并分析多模态LLM的命名表现。 Result: 人类对同一形状的概念化因场景而异,而多模态LLM未能表现出类似的丰富性和变异性。 Conclusion: 场景上下文显著影响人类概念化,而现有多模态LLM在此方面仍有不足。 Abstract: Research on reference and naming suggests that humans can come up with very different ways of conceptualizing and referring to the same object, e.g. the same abstract tangram shape can be a "crab", "sink" or "space ship". Another common assumption in cognitive science is that scene context fundamentally shapes our visual perception of objects and conceptual expectations. This paper contributes SceneGram, a dataset of human references to tangram shapes placed in different scene contexts, allowing for systematic analyses of the effect of scene context on conceptualization. Based on this data, we analyze references to tangram shapes generated by multimodal LLMs, showing that these models do not account for the richness and variability of conceptualizations found in human references.

[166] LoRA-Gen: Specializing Large Language Model via Online LoRA Generation

Yicheng Xiao,Lin Song,Rui Yang,Cheng Cheng,Yixiao Ge,Xiu Li,Ying Shan

Main category: cs.CL

TL;DR: LoRA-Gen框架利用云端大模型为边缘侧小模型生成LoRA参数,通过重参数化技术提升任务性能与推理效率。

Details Motivation: 现有方法在领域特定任务中效果和效率受限,尤其是边缘侧小模型。 Method: 利用云端大模型生成LoRA参数,通过重参数化合并到边缘侧模型,减少输入上下文长度。 Result: 在推理任务中优于传统LoRA微调,速度提升2.1倍;在智能代理任务中压缩比达10.1倍。 Conclusion: LoRA-Gen实现了高效的知识迁移与模型专用化,显著提升边缘侧模型的性能与效率。 Abstract: Recent advances have highlighted the benefits of scaling language models to enhance performance across a wide range of NLP tasks. However, these approaches still face limitations in effectiveness and efficiency when applied to domain-specific tasks, particularly for small edge-side models. We propose the LoRA-Gen framework, which utilizes a large cloud-side model to generate LoRA parameters for edge-side models based on task descriptions. By employing the reparameterization technique, we merge the LoRA parameters into the edge-side model to achieve flexible specialization. Our method facilitates knowledge transfer between models while significantly improving the inference efficiency of the specialized model by reducing the input context length. Without specialized training, LoRA-Gen outperforms conventional LoRA fine-tuning, which achieves competitive accuracy and a 2.1x speedup with TinyLLaMA-1.1B in reasoning tasks. Besides, our method delivers a compression ratio of 10.1x with Gemma-2B on intelligent agent tasks.

[167] Converting Annotated Clinical Cases into Structured Case Report Forms

Pietro Ferrazzi,Alberto Lavelli,Bernardo Magnini

Main category: cs.CL

TL;DR: 为了解决CRF数据集稀缺问题,研究者提出将信息提取任务的数据集转换为结构化CRF,并开发了一种半自动转换方法,应用于E3C数据集,生成了高质量的CRF槽填充数据集。实验显示,槽填充任务对大型语言模型仍具挑战性。

Details Motivation: 公开可用的高质量CRF数据集稀缺,限制了基于临床笔记的CRF槽填充系统的发展。 Method: 提出一种半自动转换方法,将信息提取任务的数据集转换为结构化CRF,并以E3C数据集为例进行应用。 Result: 生成的CRF数据集在槽填充任务中表现一般,大型语言模型的零样本性能为意大利语59.7%、英语67.3%,开源模型表现更差。 Conclusion: CRF槽填充任务对现有模型仍具挑战性,研究提供了高质量数据集以促进未来研究。 Abstract: Case Report Forms (CRFs) are largely used in medical research as they ensure accuracy, reliability, and validity of results in clinical studies. However, publicly available, wellannotated CRF datasets are scarce, limiting the development of CRF slot filling systems able to fill in a CRF from clinical notes. To mitigate the scarcity of CRF datasets, we propose to take advantage of available datasets annotated for information extraction tasks and to convert them into structured CRFs. We present a semi-automatic conversion methodology, which has been applied to the E3C dataset in two languages (English and Italian), resulting in a new, high-quality dataset for CRF slot filling. Through several experiments on the created dataset, we report that slot filling achieves 59.7% for Italian and 67.3% for English on a closed Large Language Models (zero-shot) and worse performances on three families of open-source models, showing that filling CRFs is challenging even for recent state-of-the-art LLMs. We release the datest at https://huggingface.co/collections/NLP-FBK/e3c-to-crf-67b9844065460cbe42f80166

[168] Improving Causal Interventions in Amnesic Probing with Mean Projection or LEACE

Alicja Dobrzeniecka,Antske Fokkens,Pia Sommerauer

Main category: cs.CL

TL;DR: Amnesic probing用于分析模型行为中特定语言信息的影响,通过移除目标信息并观察性能变化。INLP方法存在随机修改问题,而MP和LEACE能更精准地移除信息。

Details Motivation: 研究如何更精准地移除模型中的特定信息以分析其行为,避免现有方法(如INLP)的随机修改问题。 Method: 提出Mean Projection (MP)和LEACE两种替代方法,用于更精准地移除目标信息。 Result: MP和LEACE能更有效地移除目标信息,提升Amnesic Probing的行为解释能力。 Conclusion: MP和LEACE是INLP的有效替代方法,能更精准地移除信息,为模型行为分析提供更可靠的工具。 Abstract: Amnesic probing is a technique used to examine the influence of specific linguistic information on the behaviour of a model. This involves identifying and removing the relevant information and then assessing whether the model's performance on the main task changes. If the removed information is relevant, the model's performance should decline. The difficulty with this approach lies in removing only the target information while leaving other information unchanged. It has been shown that Iterative Nullspace Projection (INLP), a widely used removal technique, introduces random modifications to representations when eliminating target information. We demonstrate that Mean Projection (MP) and LEACE, two proposed alternatives, remove information in a more targeted manner, thereby enhancing the potential for obtaining behavioural explanations through Amnesic Probing.

[169] LLMs for Sentence Simplification: A Hybrid Multi-Agent prompting Approach

Pratibha Zunjare,Michael Hsiao

Main category: cs.CL

TL;DR: 论文提出了一种结合高级提示和多智能体架构的混合方法,用于将复杂句子简化为逻辑清晰的简单句子,实验表明该方法在视频游戏设计应用中成功简化了70%的句子,优于单智能体方法的48%成功率。

Details Motivation: 解决将复杂句子转化为逻辑简单句子的挑战,同时保持语义和逻辑完整性。 Method: 采用结合高级提示和多智能体架构的混合方法。 Result: 实验结果显示,该方法在视频游戏设计应用中成功简化了70%的复杂句子,而单智能体方法的成功率为48%。 Conclusion: 混合方法在多智能体架构的辅助下显著提升了句子简化的效果。 Abstract: This paper addresses the challenge of transforming complex sentences into sequences of logical, simplified sentences while preserving semantic and logical integrity with the help of Large Language Models. We propose a hybrid approach that combines advanced prompting with multi-agent architectures to enhance the sentence simplification process. Experimental results show that our approach was able to successfully simplify 70% of the complex sentences written for video game design application. In comparison, a single-agent approach attained a 48% success rate on the same task.

[170] Configurable Preference Tuning with Rubric-Guided Synthetic Data

Víctor Gallego

Main category: cs.CL

TL;DR: 论文提出Configurable Preference Tuning (CPT)框架,挑战静态偏好的假设,通过动态调整语言模型行为以适应人类可解释的指令。

Details Motivation: 现有AI对齐模型(如DPO)假设偏好是静态单一的,限制了适应性。CPT旨在解决这一问题。 Method: CPT利用基于结构化细粒度规则生成的合成偏好数据,通过微调使模型能根据系统提示动态调整输出。 Result: 实验表明,CPT能实现细粒度控制,并建模更复杂、上下文相关的人类反馈。 Conclusion: CPT为语言模型提供了动态适应偏好的能力,无需重新训练,增强了灵活性和实用性。 Abstract: Models of human feedback for AI alignment, such as those underpinning Direct Preference Optimization (DPO), often bake in a singular, static set of preferences, limiting adaptability. This paper challenges the assumption of monolithic preferences by introducing Configurable Preference Tuning (CPT), a novel framework for endowing language models with the ability to dynamically adjust their behavior based on explicit, human-interpretable directives. CPT leverages synthetically generated preference data, conditioned on system prompts derived from structured, fine-grained rubrics that define desired attributes like writing style. By fine-tuning with these rubric-guided preferences, the LLM learns to modulate its outputs at inference time in response to the system prompt, without retraining. This approach not only offers fine-grained control but also provides a mechanism for modeling more nuanced and context-dependent human feedback. Several experimental artifacts, such as training code, generated datasets and fine-tuned models are released at https://github.com/vicgalle/configurable-preference-tuning

[171] The Cambrian Explosion of Mixed-Precision Matrix Multiplication for Quantized Deep Learning Inference

Héctor Martínez,Adrián Castelló,Francisco D. Igual,Enrique S. Quintana-Ortí

Main category: cs.CL

TL;DR: 论文探讨了深度学习(DL)中从传统64位浮点计算转向低精度格式(如FP16、BF16等)的趋势,并提出了一种适应混合精度整数(MIP)算术的高性能矩阵乘法(gemm)优化策略。

Details Motivation: 随着DL对计算效率和资源利用的需求增加,传统的高精度浮点计算逐渐被低精度和混合精度计算取代,硬件架构也随之演进。然而,现有的gemm优化方法未能充分利用现代ISA的混合精度能力。 Method: 论文提出了新颖的微内核设计和数据布局策略,以更好地利用现代ISA(如x86_64、ARM和RISC-V)的混合精度整数算术能力。 Result: 实验表明,混合精度整数算术在三种代表性CPU架构上比浮点实现带来了显著的性能提升。 Conclusion: 论文标志着矩阵乘法优化的新时代,即由DL推理需求驱动的混合精度优化,称之为矩阵乘法的“寒武纪时期”。 Abstract: Recent advances in deep learning (DL) have led to a shift from traditional 64-bit floating point (FP64) computations toward reduced-precision formats, such as FP16, BF16, and 8- or 16-bit integers, combined with mixed-precision arithmetic. This transition enhances computational throughput, reduces memory and bandwidth usage, and improves energy efficiency, offering significant advantages for resource-constrained edge devices. To support this shift, hardware architectures have evolved accordingly, now including adapted ISAs (Instruction Set Architectures) that expose mixed-precision vector units and matrix engines tailored for DL workloads. At the heart of many DL and scientific computing tasks is the general matrix-matrix multiplication gemm, a fundamental kernel historically optimized using axpy vector instructions on SIMD (single instruction, multiple data) units. However, as hardware moves toward mixed-precision dot-product-centric operations optimized for quantized inference, these legacy approaches are being phased out. In response to this, our paper revisits traditional high-performance gemm and describes strategies for adapting it to mixed-precision integer (MIP) arithmetic across modern ISAs, including x86_64, ARM, and RISC-V. Concretely, we illustrate novel micro-kernel designs and data layouts that better exploit today's specialized hardware and demonstrate significant performance gains from MIP arithmetic over floating-point implementations across three representative CPU architectures. These contributions highlight a new era of gemm optimization-driven by the demands of DL inference on heterogeneous architectures, marking what we term as the "Cambrian period" for matrix multiplication.

[172] DART: Distilling Autoregressive Reasoning to Silent Thought

Nan Jiang,Ziming Wu,De-Chuan Zhan,Fuming Lai,Shaobing Lian

Main category: cs.CL

TL;DR: DART通过自蒸馏框架,用非自回归的Silent Thought替代自回归的Chain-of-Thought推理,显著提升效率。

Details Motivation: 自回归的CoT推理计算开销大,限制了在延迟敏感应用中的部署。 Method: DART引入两条训练路径:CoT路径和ST路径,后者通过轻量级REM模块对齐隐藏状态,直接生成答案。 Result: DART在保持推理性能的同时显著提升效率。 Conclusion: DART是高效推理的可行替代方案。 Abstract: Chain-of-Thought (CoT) reasoning has significantly advanced Large Language Models (LLMs) in solving complex tasks. However, its autoregressive paradigm leads to significant computational overhead, hindering its deployment in latency-sensitive applications. To address this, we propose \textbf{DART} (\textbf{D}istilling \textbf{A}utoregressive \textbf{R}easoning to Silent \textbf{T}hought), a self-distillation framework that enables LLMs to replace autoregressive CoT with non-autoregressive Silent Thought (ST). Specifically, DART introduces two training pathways: the CoT pathway for traditional reasoning and the ST pathway for generating answers directly from a few ST tokens. The ST pathway utilizes a lightweight Reasoning Evolvement Module (REM) to align its hidden states with the CoT pathway, enabling the ST tokens to evolve into informative embeddings. During inference, only the ST pathway is activated, leveraging evolving ST tokens to deliver the answer directly. Extensive experimental results demonstrate that DART achieves comparable reasoning performance to existing baselines while offering significant efficiency gains, serving as a feasible alternative for efficient reasoning.

[173] DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents

Mingxuan Du,Benfeng Xu,Chiwei Zhu,Xiaorui Wang,Zhendong Mao

Main category: cs.CL

TL;DR: DeepResearch Bench是一个用于评估基于LLM的深度研究代理(DRA)能力的基准测试,包含100个博士级研究任务,并提出两种评估方法以对齐人类判断。

Details Motivation: 目前缺乏系统性评估DRA能力的基准,阻碍了其发展。 Method: 提出两种方法:基于参考的自适应标准评估报告质量,以及评估信息检索和引用的框架。 Result: 开源了DeepResearch Bench和关键评估框架,以推动LLM代理的实用化。 Conclusion: DeepResearch Bench填补了DRA评估的空白,为未来研究提供了工具。 Abstract: Deep Research Agents are a prominent category of LLM-based agents. By autonomously orchestrating multistep web exploration, targeted retrieval, and higher-order synthesis, they transform vast amounts of online information into analyst-grade, citation-rich reports--compressing hours of manual desk research into minutes. However, a comprehensive benchmark for systematically evaluating the capabilities of these agents remains absent. To bridge this gap, we present DeepResearch Bench, a benchmark consisting of 100 PhD-level research tasks, each meticulously crafted by domain experts across 22 distinct fields. Evaluating DRAs is inherently complex and labor-intensive. We therefore propose two novel methodologies that achieve strong alignment with human judgment. The first is a reference-based method with adaptive criteria to assess the quality of generated research reports. The other framework is introduced to evaluate DRA's information retrieval and collection capabilities by assessing its effective citation count and overall citation accuracy. We have open-sourced DeepResearch Bench and key components of these frameworks at https://github.com/Ayanami0730/deep_research_bench to accelerate the development of practical LLM-based agents.

[174] Long-Short Alignment for Effective Long-Context Modeling in LLMs

Tianqi Du,Haotian Huang,Yifei Wang,Yisen Wang

Main category: cs.CL

TL;DR: 论文提出了一种新视角,关注输出分布而非输入特征,以解决大语言模型在长上下文建模中的长度泛化问题。通过提出长短期对齐概念和度量指标,并开发正则化方法,实验验证了其有效性。

Details Motivation: 大语言模型在长上下文建模中因固定上下文窗口而受限,长度泛化是关键挑战。传统方法关注输入特征,本文转向输出分布,探索更有效的解决方案。 Method: 通过合成任务案例研究,提出长短期对齐概念,并开发Long-Short Misalignment指标。基于此设计正则化项,促进训练中的对齐。 Result: 实验表明,长短期对齐与长度泛化性能强相关,提出的正则化方法显著提升了模型表现。 Conclusion: 研究为长上下文建模提供了新视角,通过关注输出分布和长短期对齐,实现了更有效的长度泛化。 Abstract: Large language models (LLMs) have exhibited impressive performance and surprising emergent properties. However, their effectiveness remains limited by the fixed context window of the transformer architecture, posing challenges for long-context modeling. Among these challenges, length generalization -- the ability to generalize to sequences longer than those seen during training -- is a classical and fundamental problem. In this work, we propose a fresh perspective on length generalization, shifting the focus from the conventional emphasis on input features such as positional encodings or data structures to the output distribution of the model. Specifically, through case studies on synthetic tasks, we highlight the critical role of \textbf{long-short alignment} -- the consistency of output distributions across sequences of varying lengths. Extending this insight to natural language tasks, we propose a metric called Long-Short Misalignment to quantify this phenomenon, uncovering a strong correlation between the metric and length generalization performance. Building on these findings, we develop a regularization term that promotes long-short alignment during training. Extensive experiments validate the effectiveness of our approach, offering new insights for achieving more effective long-context modeling in LLMs. Code is available at https://github.com/PKU-ML/LongShortAlignment.

[175] Persona-driven Simulation of Voting Behavior in the European Parliament with Large Language Models

Maximilian Kreutner,Marlene Lutz,Markus Strohmaier

Main category: cs.CL

TL;DR: 研究探讨了零样本角色提示是否能准确预测欧洲议会议员的投票行为和政策立场,发现其预测效果较好(加权F1分数约0.793)。

Details Motivation: 大型语言模型(LLMs)在政治话语中表现出进步左倾偏见,研究希望通过角色提示模拟不同群体的政治立场。 Method: 使用零样本角色提示,结合有限信息预测个体投票决策和群体政策立场,并评估其对反事实论点、不同提示和生成方法的稳定性。 Result: 模型能较好地模拟欧洲议会议员的投票行为(加权F1分数约0.793),并提供了相关数据集和代码。 Conclusion: 零样本角色提示能有效预测政治行为,为模拟群体政治立场提供了可行方法。 Abstract: Large Language Models (LLMs) display remarkable capabilities to understand or even produce political discourse, but have been found to consistently display a progressive left-leaning bias. At the same time, so-called persona or identity prompts have been shown to produce LLM behavior that aligns with socioeconomic groups that the base model is not aligned with. In this work, we analyze whether zero-shot persona prompting with limited information can accurately predict individual voting decisions and, by aggregation, accurately predict positions of European groups on a diverse set of policies. We evaluate if predictions are stable towards counterfactual arguments, different persona prompts and generation methods. Finally, we find that we can simulate voting behavior of Members of the European Parliament reasonably well with a weighted F1 score of approximately 0.793. Our persona dataset of politicians in the 2024 European Parliament and our code are available at https://github.com/dess-mannheim/european_parliament_simulation.

[176] Are Multimodal Large Language Models Pragmatically Competent Listeners in Simple Reference Resolution Tasks?

Simeon Junker,Manar Ali,Larissa Koch,Sina Zarrieß,Hendrik Buschmeier

Main category: cs.CL

TL;DR: 研究探讨多模态大语言模型在简单抽象视觉刺激(如色块和色格)的指代消解任务中的语言能力,发现其基础语用能力仍面临挑战。

Details Motivation: 尽管任务对人类来说简单,但对多模态大语言模型(MLLMs)的语用能力是一个重要测试。 Method: 通过简单抽象的视觉刺激(如色块和色格)设计指代消解任务。 Result: 结果表明,即使是基础的语用能力(如颜色描述的上下文依赖解释),对当前最先进的MLLMs仍构成挑战。 Conclusion: MLLMs在基础语用能力方面仍需改进。 Abstract: We investigate the linguistic abilities of multimodal large language models in reference resolution tasks featuring simple yet abstract visual stimuli, such as color patches and color grids. Although the task may not seem challenging for today's language models, being straightforward for human dyads, we consider it to be a highly relevant probe of the pragmatic capabilities of MLLMs. Our results and analyses indeed suggest that basic pragmatic capabilities, such as context-dependent interpretation of color descriptions, still constitute major challenges for state-of-the-art MLLMs.

[177] Post Persona Alignment for Multi-Session Dialogue Generation

Yi-Pei Chen,Noriki Nishida,Hideki Nakayama,Yuji Matsumoto

Main category: cs.CL

TL;DR: 论文提出了一种名为PPA的两阶段框架,用于解决多轮对话中保持长期一致性和生成个性化多样化回复的挑战。PPA通过先生成通用回复,再检索相关人物记忆并调整回复,显著提升了对话的一致性、多样性和个性化。

Details Motivation: 大型语言模型在单轮对话中表现优异,但在多轮对话中难以保持人物一致性和对话连贯性。现有方法通常在生成回复前检索人物信息,限制了多样性和个性化。 Method: PPA采用两阶段框架:1)仅基于对话上下文生成通用回复;2)以回复为查询检索相关人物记忆;3)调整回复以符合人物特性。 Result: 实验表明,PPA在多轮对话数据中显著优于现有方法,在一致性、多样性和人物相关性方面表现更优。 Conclusion: PPA为长期个性化对话生成提供了更灵活有效的范式。 Abstract: Multi-session persona-based dialogue generation presents challenges in maintaining long-term consistency and generating diverse, personalized responses. While large language models (LLMs) excel in single-session dialogues, they struggle to preserve persona fidelity and conversational coherence across extended interactions. Existing methods typically retrieve persona information before response generation, which can constrain diversity and result in generic outputs. We propose Post Persona Alignment (PPA), a novel two-stage framework that reverses this process. PPA first generates a general response based solely on dialogue context, then retrieves relevant persona memories using the response as a query, and finally refines the response to align with the speaker's persona. This post-hoc alignment strategy promotes naturalness and diversity while preserving consistency and personalization. Experiments on multi-session LLM-generated dialogue data demonstrate that PPA significantly outperforms prior approaches in consistency, diversity, and persona relevance, offering a more flexible and effective paradigm for long-term personalized dialogue generation.

[178] Beyond Homogeneous Attention: Memory-Efficient LLMs via Fourier-Approximated KV Cache

Xiaoran Liu,Siyang He,Qiqi Wang,Ruixiao Li,Yuerong Song,Zhigeng Liu,Linlin Li,Qun Liu,Zengfeng Huang,Qipeng Guo,Ziwei He,Xipeng Qiu

Main category: cs.CL

TL;DR: FourierAttention通过利用Transformer头维度的异质性,将长上下文不敏感的维度投影到正交傅里叶基上,从而减少KV缓存的内存需求,同时保持准确性。

Details Motivation: 随着上下文长度增加,大语言模型的KV缓存内存需求成为瓶颈,现有压缩方法牺牲准确性或引入计算开销。 Method: 提出FourierAttention框架,利用Transformer头维度的异质性,将长上下文不敏感的维度投影到正交傅里叶基上,并用固定长度的频谱系数近似其时间演化。 Result: 在LLaMA模型上评估,FourierAttention在LongBench和NIAH任务上实现了最佳的长上下文准确性。 Conclusion: FourierAttention通过FlashFourierAttention内核优化内存,实现了高效部署且不牺牲性能。 Abstract: Large Language Models struggle with memory demands from the growing Key-Value (KV) cache as context lengths increase. Existing compression methods homogenize head dimensions or rely on attention-guided token pruning, often sacrificing accuracy or introducing computational overhead. We propose FourierAttention, a training-free framework that exploits the heterogeneous roles of transformer head dimensions: lower dimensions prioritize local context, while upper ones capture long-range dependencies. By projecting the long-context-insensitive dimensions onto orthogonal Fourier bases, FourierAttention approximates their temporal evolution with fixed-length spectral coefficients. Evaluations on LLaMA models show that FourierAttention achieves the best long-context accuracy on LongBench and Needle-In-A-Haystack (NIAH). Besides, a custom Triton kernel, FlashFourierAttention, is designed to optimize memory via streamlined read-write operations, enabling efficient deployment without performance compromise.

[179] GeistBERT: Breathing Life into German NLP

Raphael Scheible-Schmitt,Johann Frei

Main category: cs.CL

TL;DR: GeistBERT通过增量训练和优化,提升了德语NLP任务性能,并在多个任务中达到SOTA。

Details Motivation: 德语NLP需要针对其语言特性的现代架构和数据集,GeistBERT旨在填补这一需求。 Method: 基于GottBERT权重,使用WWM预训练,并扩展为支持长序列的模型。 Result: GeistBERT在NER和文本分类任务中表现优异,部分任务超越更大模型。 Conclusion: GeistBERT为德语NLP社区提供了高性能模型,并开源以促进研究。 Abstract: Advances in transformer-based language models have highlighted the benefits of language-specific pre-training on high-quality corpora. In this context, German NLP stands to gain from updated architectures and modern datasets tailored to the linguistic characteristics of the German language. GeistBERT seeks to improve German language processing by incrementally training on a diverse corpus and optimizing model performance across various NLP tasks. It was pre-trained using fairseq with standard hyperparameters, initialized from GottBERT weights, and trained on a large-scale German corpus using Whole Word Masking (WWM). Based on the pre-trained model, we derived extended-input variants using Nystr\"omformer and Longformer architectures with support for sequences up to 8k tokens. While these long-context models were not evaluated on dedicated long-context benchmarks, they are included in our release. We assessed all models on NER (CoNLL 2003, GermEval 2014) and text classification (GermEval 2018 fine/coarse, 10kGNAD) using $F_1$ score and accuracy. The GeistBERT models achieved strong performance, leading all tasks among the base models and setting a new state-of-the-art (SOTA). Notably, the base models outperformed larger models in several tasks. To support the German NLP research community, we are releasing GeistBERT under the MIT license.

[180] Effectiveness of Counter-Speech against Abusive Content: A Multidimensional Annotation and Classification Study

Greta Damo,Elena Cabrio,Serena Villata

Main category: cs.CL

TL;DR: 提出了一种基于社会科学概念的计算框架,用于评估反仇恨言论(CS)的有效性,定义了六个核心维度,并标注了4214个CS实例,发布了新的语言资源。两种分类策略表现优异,优于基线方法。

Details Motivation: 在线仇恨言论(HS)是一个严重问题,反仇恨言论(CS)是缓解HS的关键策略,但评估其有效性的标准尚未明确。 Method: 提出了一个计算框架,定义六个核心维度(清晰度、证据、情感诉求、反驳、受众适应性和公平性),并标注了4214个CS实例。采用多任务和依赖关系两种分类策略。 Result: 两种分类策略的平均F1分数分别为0.94和0.96,优于标准基线方法,并揭示了维度间的强相关性。 Conclusion: 该框架为评估CS有效性提供了新工具,分类策略表现优异,未来可进一步优化和应用。 Abstract: Counter-speech (CS) is a key strategy for mitigating online Hate Speech (HS), yet defining the criteria to assess its effectiveness remains an open challenge. We propose a novel computational framework for CS effectiveness classification, grounded in social science concepts. Our framework defines six core dimensions - Clarity, Evidence, Emotional Appeal, Rebuttal, Audience Adaptation, and Fairness - which we use to annotate 4,214 CS instances from two benchmark datasets, resulting in a novel linguistic resource released to the community. In addition, we propose two classification strategies, multi-task and dependency-based, achieving strong results (0.94 and 0.96 average F1 respectively on both expert- and user-written CS), outperforming standard baselines, and revealing strong interdependence among dimensions.

[181] Feedback Friction: LLMs Struggle to Fully Incorporate External Feedback

Dongwei Jiang,Alvin Zhang,Andrew Wang,Nicholas Andrews,Daniel Khashabi

Main category: cs.CL

TL;DR: 研究发现,即使在大语言模型(LLMs)接收到近乎完美的外部反馈时,它们仍难以完全整合反馈并纠正错误答案,这种现象被称为“反馈摩擦”。

Details Motivation: 探究LLMs在理想条件下整合外部反馈的能力,以揭示其自我改进的局限性。 Method: 设计实验环境,通过反馈生成器提供近乎完整的真实答案反馈,并评估LLMs在数学推理、知识推理等多任务中的表现。 Result: 即使条件理想,LLMs仍表现出对反馈的抵抗(反馈摩擦),改进策略效果有限。 Conclusion: 反馈摩擦是LLMs的显著限制,未来研究需进一步探索其成因和解决方案。 Abstract: Recent studies have shown LLMs possess some ability to improve their responses when given external feedback. However, it remains unclear how effectively and thoroughly these models can incorporate extrinsic feedback. In an ideal scenario, if LLMs receive near-perfect and complete feedback, we would expect them to fully integrate the feedback and change their incorrect answers to correct ones. In this paper, we systematically investigate LLMs' ability to incorporate feedback by designing a controlled experimental environment. For each problem, a solver model attempts a solution, then a feedback generator with access to near-complete ground-truth answers produces targeted feedback, after which the solver tries again. We evaluate this pipeline across a diverse range of tasks, including math reasoning, knowledge reasoning, scientific reasoning, and general multi-domain evaluations with state-of-the-art language models including Claude 3.7 (with and without extended thinking). Surprisingly, even under these near-ideal conditions, solver models consistently show resistance to feedback, a limitation that we term FEEDBACK FRICTION. To mitigate this limitation, we experiment with sampling-based strategies like progressive temperature increases and explicit rejection of previously attempted incorrect answers, which yield improvements but still fail to help models achieve target performance. We also perform a rigorous exploration of potential causes of FEEDBACK FRICTION, ruling out factors such as model overconfidence and data familiarity. We hope that highlighting this issue in LLMs and ruling out several apparent causes will help future research in self-improvement.

[182] Improving Large Language Model Safety with Contrastive Representation Learning

Samuel Simko,Mrinmaya Sachan,Bernhard Schölkopf,Zhijing Jin

Main category: cs.CL

TL;DR: 提出了一种基于对比表示学习(CRL)的防御框架,通过三元组损失和对抗性硬负样本挖掘,提高大语言模型(LLM)对对抗攻击的鲁棒性。

Details Motivation: 大语言模型(LLM)易受对抗攻击,现有防御方法难以泛化到不同类型的攻击。 Method: 将模型防御问题转化为对比表示学习问题,使用三元组损失和对抗性硬负样本挖掘来区分良性表示和有害表示。 Result: 实验表明,该方法优于现有基于表示工程的防御方法,提高了对输入级和嵌入空间攻击的鲁棒性,且不影响标准性能。 Conclusion: 提出的CRL框架为LLM对抗攻击防御提供了有效且通用的解决方案。 Abstract: Large Language Models (LLMs) are powerful tools with profound societal impacts, yet their ability to generate responses to diverse and uncontrolled inputs leaves them vulnerable to adversarial attacks. While existing defenses often struggle to generalize across varying attack types, recent advancements in representation engineering offer promising alternatives. In this work, we propose a defense framework that formulates model defense as a contrastive representation learning (CRL) problem. Our method finetunes a model using a triplet-based loss combined with adversarial hard negative mining to encourage separation between benign and harmful representations. Our experimental results across multiple models demonstrate that our approach outperforms prior representation engineering-based defenses, improving robustness against both input-level and embedding-space attacks without compromising standard performance. Our code is available at https://github.com/samuelsimko/crl-llm-defense

[183] code_transformed: The Influence of Large Language Models on Code

Yuliang Xu,Siming Huang,Mingmeng Geng,Yao Wan,Xuanhua Shi,Dongping Chen

Main category: cs.CL

TL;DR: 本文研究了大型语言模型(LLMs)对代码风格的影响,分析了命名规范、复杂性、可维护性和相似性,发现LLMs显著改变了编程风格。

Details Motivation: 随着LLMs的快速发展,代码生成能力开始重塑编程实践,本文旨在探讨LLMs是否改变了代码风格及其具体表现。 Method: 通过分析2020至2025年间与arXiv论文相关的19,000多个GitHub仓库中的代码,量化了代码风格的演变趋势。 Result: 研究发现LLMs生成的代码特征明显,例如Python中snake_case变量名的比例从2023年的47%升至2025年的51%。 Conclusion: LLMs对现实世界的编程风格产生了可测量的影响,尽管难以精确估计其贡献比例。 Abstract: Coding remains one of the most fundamental modes of interaction between humans and machines. With the rapid advancement of Large Language Models (LLMs), code generation capabilities have begun to significantly reshape programming practices. This development prompts a central question: Have LLMs transformed code style, and how can such transformation be characterized? In this paper, we present a pioneering study that investigates the impact of LLMs on code style, with a focus on naming conventions, complexity, maintainability, and similarity. By analyzing code from over 19,000 GitHub repositories linked to arXiv papers published between 2020 and 2025, we identify measurable trends in the evolution of coding style that align with characteristics of LLM-generated code. For instance, the proportion of snake\_case variable names in Python code increased from 47% in Q1 2023 to 51% in Q1 2025. Furthermore, we investigate how LLMs approach algorithmic problems by examining their reasoning processes. Given the diversity of LLMs and usage scenarios, among other factors, it is difficult or even impossible to precisely estimate the proportion of code generated or assisted by LLMs. Our experimental results provide the first large-scale empirical evidence that LLMs affect real-world programming style.