cs.CV [Back]

[1] Multilinear subspace learning for person re-identification based fusion of high order tensor features

Ammar Chouchane,Mohcene Bessaoudi,Hamza Kheddar,Abdelmalik Ouamane,Tiago Vieira,Mahmoud Hassaballah

Main category: cs.CV

TL;DR: 论文提出了一种高维特征融合方法（HDFF），结合CNN和LOMO特征，通过张量融合和TXQDA学习提升行人重识别（PRe-ID）的准确性。

Details

Motivation: 行人重识别是视频监控中的关键挑战，现有方法在特征提取和表示上仍有改进空间。 Method: 采用HDFF方法融合CNN和LOMO特征，引入张量融合方案，结合TXQDA进行多线性子空间学习，使用余弦相似度匹配。 Result: 在VIPeR、GRID和PRID450S数据集上的实验表明，该方法优于现有先进方法。 Conclusion: HDFF和TXQDA的结合显著提升了行人重识别的性能，验证了方法的有效性。 Abstract: Video surveillance image analysis and processing is a challenging field in computer vision, with one of its most difficult tasks being Person Re-Identification (PRe-ID). PRe-ID aims to identify and track target individuals who have already been detected in a network of cameras, using a robust description of their pedestrian images. The success of recent research in person PRe-ID is largely due to effective feature extraction and representation, as well as the powerful learning of these features to reliably discriminate between pedestrian images. To this end, two powerful features, Convolutional Neural Networks (CNN) and Local Maximal Occurrence (LOMO), are modeled on multidimensional data using the proposed method, High-Dimensional Feature Fusion (HDFF). Specifically, a new tensor fusion scheme is introduced to leverage and combine these two types of features in a single tensor, even though their dimensions are not identical. To enhance the system's accuracy, we employ Tensor Cross-View Quadratic Analysis (TXQDA) for multilinear subspace learning, followed by cosine similarity for matching. TXQDA efficiently facilitates learning while reducing the high dimensionality inherent in high-order tensor data. The effectiveness of our approach is verified through experiments on three widely-used PRe-ID datasets: VIPeR, GRID, and PRID450S. Extensive experiments demonstrate that our approach outperforms recent state-of-the-art methods.

[2] Generative AI for Autonomous Driving: A Review

Katharina Winter,Abhishek Vivekanandan,Rupert Polley,Yinzhe Shen,Christian Schlauch,Mohamed-Khalil Bouzidi,Bojan Derajic,Natalie Grabowsky,Annajoyce Mariani,Dennis Rochau,Giovanni Lucente,Harsh Yadav,Firas Mualla,Adam Molin,Sebastian Bernhard,Christian Wirth,Ömer Şahin Taş,Nadja Klein,Fabian B. Flohr,Hanno Gottschalk

Main category: cs.CV

TL;DR: 生成式AI（GenAI）在自动驾驶（AD）领域的应用扩展了传统文本、图像和视频生成的范围，涉及静态地图创建、动态场景生成、轨迹预测和车辆运动规划等任务。

Details

Motivation: 探索生成式模型如何提升自动驾驶任务，并比较不同生成方法的优缺点。 Method: 分析多种生成方法（如VAEs、GANs、INNs、GTs和DMs）及其混合方法，并讨论相关数据集和未来研究方向。 Result: 生成式AI在自动驾驶中展现出潜力，但需解决安全性、可解释性和实时性等挑战。 Conclusion: 生成式AI为自动驾驶提供了新工具，但仍需进一步研究以解决核心挑战。 Abstract: Generative AI (GenAI) is rapidly advancing the field of Autonomous Driving (AD), extending beyond traditional applications in text, image, and video generation. We explore how generative models can enhance automotive tasks, such as static map creation, dynamic scenario generation, trajectory forecasting, and vehicle motion planning. By examining multiple generative approaches ranging from Variational Autoencoder (VAEs) over Generative Adversarial Networks (GANs) and Invertible Neural Networks (INNs) to Generative Transformers (GTs) and Diffusion Models (DMs), we highlight and compare their capabilities and limitations for AD-specific applications. Additionally, we discuss hybrid methods integrating conventional techniques with generative approaches, and emphasize their improved adaptability and robustness. We also identify relevant datasets and outline open research questions to guide future developments in GenAI. Finally, we discuss three core challenges: safety, interpretability, and realtime capabilities, and present recommendations for image generation, dynamic scenario generation, and planning.

[3] How Do Large Vision-Language Models See Text in Image? Unveiling the Distinctive Role of OCR Heads

Ingeol Baek,Hwan Chang,Sunghyun Ryu,Hwanhee Lee

Main category: cs.CV

TL;DR: 论文研究了大型视觉语言模型（LVLMs）中负责识别图像文本的特定头（OCR Head），发现其具有稀疏性低、性质独特和静态激活的特点，并通过下游任务验证了这些发现。

Details

Motivation: 尽管LVLMs取得了显著进展，但其在图像中定位和解释文本信息的可解释性仍存在不足。 Method: 探索了多种LVLMs，识别出负责文本识别的OCR Head，并通过Chain-of-Thought（CoT）和掩码实验验证其特性。 Result: OCR Head具有低稀疏性、独特性质和静态激活的特点，且通过调整其sink-token值可提升性能。 Conclusion: 研究揭示了LVLMs处理图像中嵌入文本信息的内部机制，为进一步优化提供了理论基础。 Abstract: Despite significant advancements in Large Vision Language Models (LVLMs), a gap remains, particularly regarding their interpretability and how they locate and interpret textual information within images. In this paper, we explore various LVLMs to identify the specific heads responsible for recognizing text from images, which we term the Optical Character Recognition Head (OCR Head). Our findings regarding these heads are as follows: (1) Less Sparse: Unlike previous retrieval heads, a large number of heads are activated to extract textual information from images. (2) Qualitatively Distinct: OCR heads possess properties that differ significantly from general retrieval heads, exhibiting low similarity in their characteristics. (3) Statically Activated: The frequency of activation for these heads closely aligns with their OCR scores. We validate our findings in downstream tasks by applying Chain-of-Thought (CoT) to both OCR and conventional retrieval heads and by masking these heads. We also demonstrate that redistributing sink-token values within the OCR heads improves performance. These insights provide a deeper understanding of the internal mechanisms LVLMs employ in processing embedded textual information in images.

[4] SCENIR: Visual Semantic Clarity through Unsupervised Scene Graph Retrieval

Nikolaos Chaidos,Angeliki Dimitriou,Maria Lymperaiou,Giorgos Stamou

Main category: cs.CV

TL;DR: 论文提出了一种基于场景图的图像检索框架SCENIR，通过无监督的图自动编码器解决传统方法对标注数据的依赖和语义理解不足的问题，并在性能和效率上优于现有方法。

Details

Motivation: 现有卷积和基于Transformer的架构在图像检索中易受低层次视觉特征（如颜色）的偏见影响，且缺乏语义理解。此外，基于监督图神经网络的场景图检索方法依赖于不一致的标注数据。 Method: 提出SCENIR，一种基于图自动编码器的无监督检索框架，无需标注数据，并首次引入图编辑距离（GED）作为场景图相似性的确定性度量。 Result: SCENIR在性能和运行效率上优于现有视觉、多模态和监督GNN方法，并通过自动化场景图生成验证了其在未标注数据集上的泛化能力。 Conclusion: SCENIR通过无监督学习和GED度量，显著提升了图像检索的语义理解和可靠性，推动了反事实图像检索的先进技术。 Abstract: Despite the dominance of convolutional and transformer-based architectures in image-to-image retrieval, these models are prone to biases arising from low-level visual features, such as color. Recognizing the lack of semantic understanding as a key limitation, we propose a novel scene graph-based retrieval framework that emphasizes semantic content over superficial image characteristics. Prior approaches to scene graph retrieval predominantly rely on supervised Graph Neural Networks (GNNs), which require ground truth graph pairs driven from image captions. However, the inconsistency of caption-based supervision stemming from variable text encodings undermine retrieval reliability. To address these, we present SCENIR, a Graph Autoencoder-based unsupervised retrieval framework, which eliminates the dependence on labeled training data. Our model demonstrates superior performance across metrics and runtime efficiency, outperforming existing vision-based, multimodal, and supervised GNN approaches. We further advocate for Graph Edit Distance (GED) as a deterministic and robust ground truth measure for scene graph similarity, replacing the inconsistent caption-based alternatives for the first time in image-to-image retrieval evaluation. Finally, we validate the generalizability of our method by applying it to unannotated datasets via automated scene graph generation, while substantially contributing in advancing state-of-the-art in counterfactual image retrieval.

[5] Satellites Reveal Mobility: A Commuting Origin-destination Flow Generator for Global Cities

Can Rong,Xin Zhang,Yanxin Xi,Hongjie Sui,Jingtao Ding,Yong Li

Main category: cs.CV

TL;DR: GlODGen利用卫星图像和人口数据生成全球城市的OD流量数据，替代传统高成本调查方法。

Details

Motivation: 传统获取OD流量数据的方法成本高且涉及隐私问题，而卫星图像可提供丰富语义信息。 Method: GlODGen结合视觉语言地理基础模型和图扩散模型，从卫星图像提取语义特征并生成OD流量。 Result: 在六大洲六个城市的实验中，GlODGen生成的OD数据与真实数据高度一致。 Conclusion: GlODGen是一种高效、通用的OD数据生成工具，适用于全球城市。 Abstract: Commuting Origin-destination~(OD) flows, capturing daily population mobility of citizens, are vital for sustainable development across cities around the world. However, it is challenging to obtain the data due to the high cost of travel surveys and privacy concerns. Surprisingly, we find that satellite imagery, publicly available across the globe, contains rich urban semantic signals to support high-quality OD flow generation, with over 98\% expressiveness of traditional multisource hard-to-collect urban sociodemographic, economics, land use, and point of interest data. This inspires us to design a novel data generator, GlODGen, which can generate OD flow data for any cities of interest around the world. Specifically, GlODGen first leverages Vision-Language Geo-Foundation Models to extract urban semantic signals related to human mobility from satellite imagery. These features are then combined with population data to form region-level representations, which are used to generate OD flows via graph diffusion models. Extensive experiments on 4 continents and 6 representative cities show that GlODGen has great generalizability across diverse urban environments on different continents and can generate OD flow data for global cities highly consistent with real-world mobility data. We implement GlODGen as an automated tool, seamlessly integrating data acquisition and curation, urban semantic feature extraction, and OD flow generation together. It has been released at https://github.com/tsinghua-fib-lab/generate-od-pubtools.

[6] Decouple and Orthogonalize: A Data-Free Framework for LoRA Merging

Shenghe Zheng,Hongzhi Wang,Chenyu Huang,Xiaohui Wang,Tao Chen,Jiayuan Fan,Shuyue Hu,Peng Ye

Main category: cs.CV

TL;DR: 论文提出了一种针对LoRA模型的解耦正交合并方法（DO-Merging），解决了现有合并方法在LoRA上表现不佳的问题，通过分离参数的大小和方向分量并独立合并，显著提升了合并性能。

Details

Motivation: 随着开源模型的增多，模型合并成为降低训练、存储和推理成本的重要方法。然而，现有研究主要关注全微调模型的合并，忽视了流行的LoRA模型。实证分析表明，现有方法在LoRA上表现不佳，且参数幅度差异较大导致合并性能下降。 Method: 提出DO-Merging方法，将参数解耦为幅度和方向分量并独立合并，同时引入无数据、分层梯度下降和正交约束以减少方向分量的干扰。 Result: 实验证明，DO-Merging在视觉、语言和多模态任务中显著优于现有方法，且能以极低成本实现性能提升。 Conclusion: DO-Merging通过解耦和正交合并有效解决了LoRA模型合并的问题，各组件可灵活集成现有方法，为任务提供近乎免费的改进。 Abstract: With more open-source models available for diverse tasks, model merging has gained attention by combining models into one, reducing training, storage, and inference costs. Current research mainly focuses on model merging for full fine-tuning, overlooking the popular LoRA. However, our empirical analysis reveals that: a) existing merging methods designed for full fine-tuning perform poorly on LoRA; b) LoRA modules show much larger parameter magnitude variance than full fine-tuned weights; c) greater parameter magnitude variance correlates with worse merging performance. Considering that large magnitude variances cause deviations in the distribution of the merged parameters, resulting in information loss and performance degradation, we propose a Decoupled and Orthogonal merging approach(DO-Merging). By separating parameters into magnitude and direction components and merging them independently, we reduce the impact of magnitude differences on the directional alignment of the merged models, thereby preserving task information. Furthermore, we introduce a data-free, layer-wise gradient descent method with orthogonal constraints to mitigate interference during the merging of direction components. We provide theoretical guarantees for both the decoupling and orthogonal components. And we validate through extensive experiments across vision, language, and multi-modal domains that our proposed DO-Merging can achieve significantly higher performance than existing merging methods at a minimal cost. Notably, each component can be flexibly integrated with existing methods, offering near free-lunch improvements across tasks.

[7] Highlighting What Matters: Promptable Embeddings for Attribute-Focused Image Retrieval

Siting Li,Xiang Gao,Simon Shaolei Du

Main category: cs.CV

TL;DR: 论文提出COCO-Facet基准测试，评估文本到图像检索器在属性聚焦查询中的表现，发现现有方法（如CLIP和MLLM）表现不佳，并提出基于提示的图像嵌入方法以提升性能。

Details

Motivation: 现有文本到图像检索器在属性聚焦查询中表现不佳，尤其是CLIP和MLLM等方法，因其图像嵌入关注全局语义而忽略细节。 Method: 提出使用提示性图像嵌入方法，通过多模态检索器生成可提示的图像嵌入，并设计两种加速策略：预处理提示嵌入和线性近似。 Result: 实验表明，预处理的提示嵌入在预定义提示下Recall@5提升15%，而线性近似在推理时提示可用时提升8%。 Conclusion: 提示性图像嵌入方法能有效提升属性聚焦查询的性能，且具有通用性和实用性。 Abstract: While an image is worth more than a thousand words, only a few provide crucial information for a given task and thus should be focused on. In light of this, ideal text-to-image (T2I) retrievers should prioritize specific visual attributes relevant to queries. To evaluate current retrievers on handling attribute-focused queries, we build COCO-Facet, a COCO-based benchmark with 9,112 queries about diverse attributes of interest. We find that CLIP-like retrievers, which are widely adopted due to their efficiency and zero-shot ability, have poor and imbalanced performance, possibly because their image embeddings focus on global semantics and subjects while leaving out other details. Notably, we reveal that even recent Multimodal Large Language Model (MLLM)-based, stronger retrievers with a larger output dimension struggle with this limitation. Hence, we hypothesize that retrieving with general image embeddings is suboptimal for performing such queries. As a solution, we propose to use promptable image embeddings enabled by these multimodal retrievers, which boost performance by highlighting required attributes. Our pipeline for deriving such embeddings generalizes across query types, image pools, and base retriever architectures. To enhance real-world applicability, we offer two acceleration strategies: Pre-processing promptable embeddings and using linear approximations. We show that the former yields a 15% improvement in Recall@5 when prompts are predefined, while the latter achieves an 8% improvement when prompts are only available during inference.

[8] GRIT: Teaching MLLMs to Think with Images

Yue Fan,Xuehai He,Diji Yang,Kaizhi Zheng,Ching-Chen Kuo,Yuting Zheng,Sravana Jyothi Narayanaraju,Xinze Guan,Xin Eric Wang

Main category: cs.CV

TL;DR: GRIT提出了一种结合自然语言和视觉信息的推理方法，通过强化学习训练多模态语言模型生成视觉接地的推理链。

Details

Motivation: 现有视觉推理模型仅使用自然语言生成推理内容，缺乏视觉信息的显式整合，限制了其生成清晰且视觉接地的推理链的能力。 Method: GRIT引入了一种接地的推理范式，模型生成交替使用自然语言和显式边界框坐标的推理链，并结合GRPO-GR强化学习算法优化训练。 Result: GRIT仅需少量数据（20个图像-问题-答案三元组）即可高效训练模型，生成连贯且视觉接地的推理链。 Conclusion: GRIT成功统一了推理和接地能力，为多模态语言模型的视觉推理提供了有效解决方案。 Abstract: Recent studies have demonstrated the efficacy of using Reinforcement Learning (RL) in building reasoning models that articulate chains of thoughts prior to producing final answers. However, despite ongoing advances that aim at enabling reasoning for vision-language tasks, existing open-source visual reasoning models typically generate reasoning content with pure natural language, lacking explicit integration of visual information. This limits their ability to produce clearly articulated and visually grounded reasoning chains. To this end, we propose Grounded Reasoning with Images and Texts (GRIT), a novel method for training MLLMs to think with images. GRIT introduces a grounded reasoning paradigm, in which models generate reasoning chains that interleave natural language and explicit bounding box coordinates. These coordinates point to regions of the input image that the model consults during its reasoning process. Additionally, GRIT is equipped with a reinforcement learning approach, GRPO-GR, built upon the GRPO algorithm. GRPO-GR employs robust rewards focused on the final answer accuracy and format of the grounded reasoning output, which eliminates the need for data with reasoning chain annotations or explicit bounding box labels. As a result, GRIT achieves exceptional data efficiency, requiring as few as 20 image-question-answer triplets from existing datasets. Comprehensive evaluations demonstrate that GRIT effectively trains MLLMs to produce coherent and visually grounded reasoning chains, showing a successful unification of reasoning and grounding abilities.

[9] Challenger: Affordable Adversarial Driving Video Generation

Zhiyuan Xu,Bohan Li,Huan-ang Gao,Mingju Gao,Yong Chen,Ming Liu,Chenxu Yan,Hang Zhao,Shuo Feng,Hao Zhao

Main category: cs.CV

TL;DR: Challenger框架生成物理合理且逼真的对抗性驾驶视频，显著提升自动驾驶模型的碰撞率。

Details

Motivation: 当前方法主要关注普通驾驶场景，缺乏逼真的对抗性传感器数据以测试自动驾驶系统。 Method: 采用物理感知的多轮轨迹优化和定制轨迹评分函数，生成逼真对抗性驾驶视频。 Result: 在nuScenes数据集上生成多样化对抗场景，显著提高多个先进自动驾驶模型的碰撞率。 Conclusion: Challenger能有效生成逼真对抗性驾驶视频，且对抗行为在不同模型间具有可迁移性。 Abstract: Generating photorealistic driving videos has seen significant progress recently, but current methods largely focus on ordinary, non-adversarial scenarios. Meanwhile, efforts to generate adversarial driving scenarios often operate on abstract trajectory or BEV representations, falling short of delivering realistic sensor data that can truly stress-test autonomous driving (AD) systems. In this work, we introduce Challenger, a framework that produces physically plausible yet photorealistic adversarial driving videos. Generating such videos poses a fundamental challenge: it requires jointly optimizing over the space of traffic interactions and high-fidelity sensor observations. Challenger makes this affordable through two techniques: (1) a physics-aware multi-round trajectory refinement process that narrows down candidate adversarial maneuvers, and (2) a tailored trajectory scoring function that encourages realistic yet adversarial behavior while maintaining compatibility with downstream video synthesis. As tested on the nuScenes dataset, Challenger generates a diverse range of aggressive driving scenarios-including cut-ins, sudden lane changes, tailgating, and blind spot intrusions-and renders them into multiview photorealistic videos. Extensive evaluations show that these scenarios significantly increase the collision rate of state-of-the-art end-to-end AD models (UniAD, VAD, SparseDrive, and DiffusionDrive), and importantly, adversarial behaviors discovered for one model often transfer to others.

[10] ViQAgent: Zero-Shot Video Question Answering via Agent with Open-Vocabulary Grounding Validation

Tony Montes,Fernando Lozano

Main category: cs.CV

TL;DR: 该论文提出了一种结合Chain-of-Thought框架和YOLO-World的LLM-brained代理，用于零样本视频问答（VideoQA），显著提升了对象跟踪和对齐能力，并在多个基准测试中取得了最佳性能。

Details

Motivation: 当前VideoQA系统在对象跟踪和基于推理的决策方面仍有改进空间，尤其是在对象引用与语言模型输出的对齐上。 Method: 采用LLM-brained代理，结合Chain-of-Thought框架和YOLO-World，增强对象跟踪和对齐能力。 Result: 在NExT-QA、iVQA和ActivityNet-QA等基准测试中实现了新的最佳性能，并提高了输出可靠性。 Conclusion: 该方法为VideoQA和视频理解提供了更高效的解决方案，支持跨领域验证和可靠性提升。 Abstract: Recent advancements in Video Question Answering (VideoQA) have introduced LLM-based agents, modular frameworks, and procedural solutions, yielding promising results. These systems use dynamic agents and memory-based mechanisms to break down complex tasks and refine answers. However, significant improvements remain in tracking objects for grounding over time and decision-making based on reasoning to better align object references with language model outputs, as newer models get better at both tasks. This work presents an LLM-brained agent for zero-shot Video Question Answering (VideoQA) that combines a Chain-of-Thought framework with grounding reasoning alongside YOLO-World to enhance object tracking and alignment. This approach establishes a new state-of-the-art in VideoQA and Video Understanding, showing enhanced performance on NExT-QA, iVQA, and ActivityNet-QA benchmarks. Our framework also enables cross-checking of grounding timeframes, improving accuracy and providing valuable support for verification and increased output reliability across multiple video domains. The code is available at https://github.com/t-montes/viqagent.

[11] VideoGameQA-Bench: Evaluating Vision-Language Models for Video Game Quality Assurance

Mohammad Reza Taesiri,Abhijay Ghildyal,Saman Zadtootaghaj,Nabajeet Barman,Cor-Paul Bezemer

Main category: cs.CV

TL;DR: 论文介绍了VideoGameQA-Bench，一个用于评估视觉语言模型在游戏QA任务中性能的综合基准。

Details

Motivation: 游戏行业收入高，但QA流程劳动密集且自动化选项有限，需要标准化基准来评估VLMs在游戏QA中的表现。 Method: 提出VideoGameQA-Bench，涵盖多种游戏QA任务，如视觉单元测试、回归测试、故障检测等。 Result: 基准提供了全面的评估工具，支持图像和视频的游戏QA任务。 Conclusion: VideoGameQA-Bench填补了现有基准的不足，为游戏QA领域的自动化提供了标准化工具。 Abstract: With video games now generating the highest revenues in the entertainment industry, optimizing game development workflows has become essential for the sector's sustained growth. Recent advancements in Vision-Language Models (VLMs) offer considerable potential to automate and enhance various aspects of game development, particularly Quality Assurance (QA), which remains one of the industry's most labor-intensive processes with limited automation options. To accurately evaluate the performance of VLMs in video game QA tasks and determine their effectiveness in handling real-world scenarios, there is a clear need for standardized benchmarks, as existing benchmarks are insufficient to address the specific requirements of this domain. To bridge this gap, we introduce VideoGameQA-Bench, a comprehensive benchmark that covers a wide array of game QA activities, including visual unit testing, visual regression testing, needle-in-a-haystack tasks, glitch detection, and bug report generation for both images and videos of various games. Code and data are available at: https://asgaardlab.github.io/videogameqa-bench/

[12] Super-Resolution with Structured Motion

Gabby Litterio,Juan-David Lizarazo-Ferro,Pedro Felzenszwalb,Rashid Zia

Main category: cs.CV

TL;DR: 通过高精度运动信息、稀疏图像先验和凸优化，可以实现大幅提高分辨率，并证明运动模糊有助于超分辨率。

Details

Motivation: 探讨超分辨率的理论限制，尤其是运动模糊对分辨率提升的影响。 Method: 利用高精度运动信息、稀疏图像先验和凸优化技术，结合伪随机运动进行超分辨率重建。 Result: 实验证明，该方法能够从单张低分辨率图像中重建高分辨率目标，并在模拟和真实数据中验证了其有效性。 Conclusion: 运动模糊可以作为超分辨率的助力，通过优化方法实现大幅分辨率提升。 Abstract: We consider the limits of super-resolution using imaging constraints. Due to various theoretical and practical limitations, reconstruction-based methods have been largely restricted to small increases in resolution. In addition, motion-blur is usually seen as a nuisance that impedes super-resolution. We show that by using high-precision motion information, sparse image priors, and convex optimization, it is possible to increase resolution by large factors. A key operation in super-resolution is deconvolution with a box. In general, convolution with a box is not invertible. However, we obtain perfect reconstructions of sparse signals using convex optimization. We also show that motion blur can be helpful for super-resolution. We demonstrate that using pseudo-random motion it is possible to reconstruct a high-resolution target using a single low-resolution image. We present numerical experiments with simulated data and results with real data captured by a camera mounted on a computer controlled stage.

[13] OViP: Online Vision-Language Preference Learning

Shujun Liu,Siyuan Wang,Zejun Li,Jianxiang Wang,Cheng Zeng,Zhongyu Wei

Main category: cs.CV

TL;DR: OViP框架通过动态构建对比训练数据，减少大视觉语言模型的幻觉问题，同时保持多模态能力。

Details

Motivation: 大视觉语言模型（LVLMs）易产生与视觉输入不符的幻觉内容，现有方法依赖预定义或随机编辑的负样本，效果有限。 Method: 提出在线视觉语言偏好学习（OViP）框架，基于模型自身幻觉输出动态构建对比数据，利用扩散模型合成负样本。 Result: 实验表明，OViP有效减少幻觉，同时保留核心多模态能力。 Conclusion: OViP通过失败驱动的训练和动态数据构建，显著提升模型对齐能力。 Abstract: Large vision-language models (LVLMs) remain vulnerable to hallucination, often generating content misaligned with visual inputs. While recent approaches advance multi-modal Direct Preference Optimization (DPO) to mitigate hallucination, they typically rely on predefined or randomly edited negative samples that fail to reflect actual model errors, limiting training efficacy. In this work, we propose an Online Vision-language Preference Learning (OViP) framework that dynamically constructs contrastive training data based on the model's own hallucinated outputs. By identifying semantic differences between sampled response pairs and synthesizing negative images using a diffusion model, OViP generates more relevant supervision signals in real time. This failure-driven training enables adaptive alignment of both textual and visual preferences. Moreover, we refine existing evaluation protocols to better capture the trade-off between hallucination suppression and expressiveness. Experiments on hallucination and general benchmarks demonstrate that OViP effectively reduces hallucinations while preserving core multi-modal capabilities.

[14] Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning

Alex Su,Haozhe Wang,Weimin Ren,Fangzhen Lin,Wenhu Chen

Main category: cs.CV

TL;DR: 论文提出了一种在像素空间进行推理的新框架，通过视觉操作增强视觉语言模型（VLM）的推理能力，显著提升了视觉任务的性能。

Details

Motivation: 现有的思维链推理仅限于文本空间，限制了在视觉密集型任务中的效果，因此需要扩展推理能力到像素空间。 Method: 采用两阶段训练方法：指令调优和强化学习，引入视觉操作（如放大、选择帧）以增强VLM的视觉推理能力。 Result: 7B模型在多个视觉推理基准测试中表现优异，如V* bench（84%）、TallyQA-Complex（74%）和InfographicsVQA（84%）。 Conclusion: 像素空间推理对提升VLM性能至关重要，提出的框架有效解决了视觉推理的挑战。 Abstract: Chain-of-thought reasoning has significantly improved the performance of Large Language Models (LLMs) across various domains. However, this reasoning process has been confined exclusively to textual space, limiting its effectiveness in visually intensive tasks. To address this limitation, we introduce the concept of reasoning in the pixel-space. Within this novel framework, Vision-Language Models (VLMs) are equipped with a suite of visual reasoning operations, such as zoom-in and select-frame. These operations enable VLMs to directly inspect, interrogate, and infer from visual evidences, thereby enhancing reasoning fidelity for visual tasks. Cultivating such pixel-space reasoning capabilities in VLMs presents notable challenges, including the model's initially imbalanced competence and its reluctance to adopt the newly introduced pixel-space operations. We address these challenges through a two-phase training approach. The first phase employs instruction tuning on synthesized reasoning traces to familiarize the model with the novel visual operations. Following this, a reinforcement learning (RL) phase leverages a curiosity-driven reward scheme to balance exploration between pixel-space reasoning and textual reasoning. With these visual operations, VLMs can interact with complex visual inputs, such as information-rich images or videos to proactively gather necessary information. We demonstrate that this approach significantly improves VLM performance across diverse visual reasoning benchmarks. Our 7B model, \model, achieves 84\% on V* bench, 74\% on TallyQA-Complex, and 84\% on InfographicsVQA, marking the highest accuracy achieved by any open-source model to date. These results highlight the importance of pixel-space reasoning and the effectiveness of our framework.

[15] Analyzing Hierarchical Structure in Vision Models with Sparse Autoencoders

Matthew Lyle Olson,Musashi Hinck,Neale Ratzlaff,Changbai Li,Phillip Howard,Vasudev Lal,Shao-Yen Tseng

Main category: cs.CV

TL;DR: 使用稀疏自编码器（SAEs）分析视觉模型如何编码ImageNet层次结构，发现模型激活中存在隐含的层次关系，并探讨了DINOv2模型中不同层的表示一致性。

Details

Motivation: 探索视觉模型是否能够学习并编码ImageNet分类学中的层次结构，以及如何利用SAEs揭示这种结构。 Method: 利用稀疏自编码器（SAEs）分析视觉模型的内部表示，特别关注DINOv2模型的不同层。 Result: SAEs揭示了模型激活中的层次关系，表明视觉模型隐含地编码了分类学结构，且类别信息通过每一层的类令牌逐步增加。 Conclusion: 研究为视觉模型表示的层次分析提供了系统框架，并展示了SAEs在揭示深度网络语义结构中的潜力。 Abstract: The ImageNet hierarchy provides a structured taxonomy of object categories, offering a valuable lens through which to analyze the representations learned by deep vision models. In this work, we conduct a comprehensive analysis of how vision models encode the ImageNet hierarchy, leveraging Sparse Autoencoders (SAEs) to probe their internal representations. SAEs have been widely used as an explanation tool for large language models (LLMs), where they enable the discovery of semantically meaningful features. Here, we extend their use to vision models to investigate whether learned representations align with the ontological structure defined by the ImageNet taxonomy. Our results show that SAEs uncover hierarchical relationships in model activations, revealing an implicit encoding of taxonomic structure. We analyze the consistency of these representations across different layers of the popular vision foundation model DINOv2 and provide insights into how deep vision models internalize hierarchical category information by increasing information in the class token through each layer. Our study establishes a framework for systematic hierarchical analysis of vision model representations and highlights the potential of SAEs as a tool for probing semantic structure in deep networks.

[16] Domain Adaptive Skin Lesion Classification via Conformal Ensemble of Vision Transformers

Mehran Zoravar,Shadi Alijani,Homayoun Najjaran

Main category: cs.CV

TL;DR: 论文提出了一种名为CE-ViTs的新框架，通过结合视觉变换器模型和集成学习，提升图像分类性能，特别是在域适应和不确定性处理方面。

Details

Motivation: 在医学影像等关键领域，深度学习模型的可靠性至关重要。传统方法在域偏移场景下表现不佳，因此需要一种更鲁棒的方法。 Method: 采用集成学习的视觉变换器模型，结合HAM10000、Dermofit和ISIC数据集进行训练，并通过保形学习增强域适应能力。 Result: 实验显示，框架的覆盖率达到90.38%，比单一模型提升9.95%，且对难分类样本的预测集大小从1.86增加到3.075。 Conclusion: CE-ViTs通过集成学习和保形预测，显著提升了模型的鲁棒性和域适应能力，适用于关键领域的图像分类任务。 Abstract: Exploring the trustworthiness of deep learning models is crucial, especially in critical domains such as medical imaging decision support systems. Conformal prediction has emerged as a rigorous means of providing deep learning models with reliable uncertainty estimates and safety guarantees. However, conformal prediction results face challenges due to the backbone model's struggles in domain-shifted scenarios, such as variations in different sources. To aim this challenge, this paper proposes a novel framework termed Conformal Ensemble of Vision Transformers (CE-ViTs) designed to enhance image classification performance by prioritizing domain adaptation and model robustness, while accounting for uncertainty. The proposed method leverages an ensemble of vision transformer models in the backbone, trained on diverse datasets including HAM10000, Dermofit, and Skin Cancer ISIC datasets. This ensemble learning approach, calibrated through the combined mentioned datasets, aims to enhance domain adaptation through conformal learning. Experimental results underscore that the framework achieves a high coverage rate of 90.38\%, representing an improvement of 9.95\% compared to the HAM10000 model. This indicates a strong likelihood that the prediction set includes the true label compared to singular models. Ensemble learning in CE-ViTs significantly improves conformal prediction performance, increasing the average prediction set size for challenging misclassified samples from 1.86 to 3.075.

[17] Image-to-Image Translation with Diffusion Transformers and CLIP-Based Image Conditioning

Qiang Zhu,Kuan Lu,Menghao Huo,Yuxiao Li

Main category: cs.CV

TL;DR: 本文提出了一种基于扩散模型和Transformer的图像到图像翻译框架，结合CLIP编码器引导翻译过程，实现了高质量且语义一致的转换。

Details

Motivation: 探索一种结合扩散模型和Transformer的图像翻译方法，以替代传统的GAN模型，同时利用CLIP编码器实现无需文本或标签的细粒度翻译。 Method: 采用Diffusion Transformers（DiT）框架，结合CLIP图像嵌入作为条件输入，并使用CLIP相似性损失和LPIPS感知损失优化训练。 Result: 在face2comics和edges2shoes两个数据集上验证了方法的有效性，实现了高质量且语义一致的图像翻译。 Conclusion: DiT结合CLIP条件和感知相似性目标，为图像翻译任务提供了一种有前景的替代方案。 Abstract: Image-to-image translation aims to learn a mapping between a source and a target domain, enabling tasks such as style transfer, appearance transformation, and domain adaptation. In this work, we explore a diffusion-based framework for image-to-image translation by adapting Diffusion Transformers (DiT), which combine the denoising capabilities of diffusion models with the global modeling power of transformers. To guide the translation process, we condition the model on image embeddings extracted from a pre-trained CLIP encoder, allowing for fine-grained and structurally consistent translations without relying on text or class labels. We incorporate both a CLIP similarity loss to enforce semantic consistency and an LPIPS perceptual loss to enhance visual fidelity during training. We validate our approach on two benchmark datasets: face2comics, which translates real human faces to comic-style illustrations, and edges2shoes, which translates edge maps to realistic shoe images. Experimental results demonstrate that DiT, combined with CLIP-based conditioning and perceptual similarity objectives, achieves high-quality, semantically faithful translations, offering a promising alternative to GAN-based models for paired image-to-image translation tasks.

[18] Position: Agentic Systems Constitute a Key Component of Next-Generation Intelligent Image Processing

Jinjin Gu

Main category: cs.CV

TL;DR: 本文主张图像处理领域应从纯模型中心开发扩展到包含智能代理系统设计，以解决当前方法的局限性。

Details

Motivation: 深度学习在图像处理任务中取得了显著进展，但在泛化性、适应性和灵活性方面存在不足。 Method: 提出开发智能代理系统，动态选择和优化现有工具，模拟人类专家的策略性操作。 Result: 分析了模型中心范式的局限性，并确立了代理系统的设计原则和能力等级。 Conclusion: 智能代理系统是图像处理领域的下一步发展方向。 Abstract: This position paper argues that the image processing community should broaden its focus from purely model-centric development to include agentic system design as an essential complementary paradigm. While deep learning has significantly advanced capabilities for specific image processing tasks, current approaches face critical limitations in generalization, adaptability, and real-world problem-solving flexibility. We propose that developing intelligent agentic systems, capable of dynamically selecting, combining, and optimizing existing image processing tools, represents the next evolutionary step for the field. Such systems would emulate human experts' ability to strategically orchestrate different tools to solve complex problems, overcoming the brittleness of monolithic models. The paper analyzes key limitations of model-centric paradigms, establishes design principles for agentic image processing systems, and outlines different capability levels for such agents.

[19] CP-LLM: Context and Pixel Aware Large Language Model for Video Quality Assessment

Wen Wen,Yaohong Wu,Yue Sheng,Neil Birkbeck,Balu Adsumilli,Yilin Wang

Main category: cs.CV

TL;DR: CP-LLM是一种新型多模态大语言模型，通过双视觉编码器分别分析视频上下文和像素级失真，结合语言解码器实现质量评分和描述生成，显著提升了视频质量评估的性能。

Details

Motivation: 传统VQA模型缺乏对视频上下文的理解，而现有LLM模型对小失真不敏感或无法同时处理质量评分和描述。CP-LLM旨在解决这些问题。 Method: CP-LLM采用双视觉编码器分别处理高/低粒度信息，结合语言解码器进行推理，通过多任务训练优化评分、描述生成和成对比较。 Result: 实验表明CP-LLM在VQA基准测试中达到最优性能，对小失真具有更强鲁棒性。 Conclusion: CP-LLM为视频质量评估提供了一种全面且实用的解决方案。 Abstract: Video quality assessment (VQA) is a challenging research topic with broad applications. Effective VQA necessitates sensitivity to pixel-level distortions and a comprehensive understanding of video context to accurately determine the perceptual impact of distortions. Traditional hand-crafted and learning-based VQA models mainly focus on pixel-level distortions and lack contextual understanding, while recent LLM-based models struggle with sensitivity to small distortions or handle quality scoring and description as separate tasks. To address these shortcomings, we introduce CP-LLM: a Context and Pixel aware Large Language Model. CP-LLM is a novel multimodal LLM architecture featuring dual vision encoders designed to independently analyze perceptual quality at both high-level (video context) and low-level (pixel distortion) granularity, along with a language decoder subsequently reasons about the interplay between these aspects. This design enables CP-LLM to simultaneously produce robust quality scores and interpretable quality descriptions, with enhanced sensitivity to pixel distortions (e.g. compression artifacts). The model is trained via a multi-task pipeline optimizing for score prediction, description generation, and pairwise comparisons. Experiment results demonstrate that CP-LLM achieves state-of-the-art cross-dataset performance on established VQA benchmarks and superior robustness to pixel distortions, confirming its efficacy for comprehensive and practical video quality assessment in real-world scenarios.

[20] Learning better representations for crowded pedestrians in offboard LiDAR-camera 3D tracking-by-detection

Shichao Li,Peiliang Li,Qing Lian,Peng Yun,Xiaozhi Chen

Main category: cs.CV

TL;DR: 本文提出了一种用于高密度行人场景的3D自动标注系统，通过多视角LiDAR和相机数据重建行人轨迹，并学习高分辨率表示以提升性能。

Details

Motivation: 解决高密度城市环境中行人感知的难题，尤其是3D地面真实数据生成的速度和准确性。 Method: 收集多视角LiDAR-相机3D多目标跟踪基准，构建离线自动标注系统，学习密度和关系感知的高分辨率表示。 Result: 实验表明，该方法显著提升了3D行人跟踪性能，提高了自动标注效率。 Conclusion: 提出的方法有效解决了高密度行人场景的3D标注问题，代码将公开。 Abstract: Perceiving pedestrians in highly crowded urban environments is a difficult long-tail problem for learning-based autonomous perception. Speeding up 3D ground truth generation for such challenging scenes is performance-critical yet very challenging. The difficulties include the sparsity of the captured pedestrian point cloud and a lack of suitable benchmarks for a specific system design study. To tackle the challenges, we first collect a new multi-view LiDAR-camera 3D multiple-object-tracking benchmark of highly crowded pedestrians for in-depth analysis. We then build an offboard auto-labeling system that reconstructs pedestrian trajectories from LiDAR point cloud and multi-view images. To improve the generalization power for crowded scenes and the performance for small objects, we propose to learn high-resolution representations that are density-aware and relationship-aware. Extensive experiments validate that our approach significantly improves the 3D pedestrian tracking performance towards higher auto-labeling efficiency. The code will be publicly available at this HTTP URL.

[21] An Approach Towards Identifying Bangladeshi Leaf Diseases through Transfer Learning and XAI

Faika Fairuj Preotee,Shuvashis Sarker,Shamim Rahim Refat,Tashreef Muhammad,Shifat Islam

Main category: cs.CV

TL;DR: 该研究利用深度学习模型（如VGG19和Xception）和可解释AI技术（如GradCAM）高效识别孟加拉国的植物叶片病害，准确率高达98.90%，帮助农民管理作物健康。

Details

Motivation: 解决孟加拉国农民因缺乏专家知识而难以识别和管理植物叶片病害的问题，提升农业生产力。 Method: 使用CNN和迁移学习模型（VGG16、VGG19等）分类21种叶片病害，并结合XAI技术提高模型透明度。 Result: VGG19和Xception模型表现最佳，准确率分别为98.90%和98.66%。 Conclusion: 该方法不仅提高了病害检测准确性，还通过透明化模型决策支持农民做出更明智的管理决策，促进农业可持续发展。 Abstract: Leaf diseases are harmful conditions that affect the health, appearance and productivity of plants, leading to significant plant loss and negatively impacting farmers' livelihoods. These diseases cause visible symptoms such as lesions, color changes, and texture variations, making it difficult for farmers to manage plant health, especially in large or remote farms where expert knowledge is limited. The main motivation of this study is to provide an efficient and accessible solution for identifying plant leaf diseases in Bangladesh, where agriculture plays a critical role in food security. The objective of our research is to classify 21 distinct leaf diseases across six plants using deep learning models, improving disease detection accuracy while reducing the need for expert involvement. Deep Learning (DL) techniques, including CNN and Transfer Learning (TL) models like VGG16, VGG19, MobileNetV2, InceptionV3, ResNet50V2 and Xception are used. VGG19 and Xception achieve the highest accuracies, with 98.90% and 98.66% respectively. Additionally, Explainable AI (XAI) techniques such as GradCAM, GradCAM++, LayerCAM, ScoreCAM and FasterScoreCAM are used to enhance transparency by highlighting the regions of the models focused on during disease classification. This transparency ensures that farmers can understand the model's predictions and take necessary action. This approach not only improves disease management but also supports farmers in making informed decisions, leading to better plant protection and increased agricultural productivity.

[22] An Exploratory Approach Towards Investigating and Explaining Vision Transformer and Transfer Learning for Brain Disease Detection

Shuvashis Sarker,Shamim Rahim Refat,Faika Fairuj Preotee,Shifat Islam,Tashreef Muhammad,Mohammad Ashraful Hoque

Main category: cs.CV

TL;DR: 该研究比较了Vision Transformer（ViT）和多种迁移学习模型（如VGG16、VGG19等）在MRI数据上对脑部疾病的分类效果，ViT表现最佳，准确率达94.39%，并利用XAI方法提升模型可解释性。

Details

Motivation: 脑部疾病的诊断和治疗复杂，MRI图像解读困难，研究旨在通过先进模型和可解释性方法提高诊断精度。 Method: 使用ViT和多种迁移学习模型对孟加拉国MRI数据集进行分类，并结合XAI方法（如GradCAM等）解释模型预测。 Result: ViT模型表现最优，分类准确率达94.39%，XAI方法增强了模型的可解释性。 Conclusion: ViT结合XAI方法在脑部疾病分类中表现出色，为医学诊断提供了更精准的工具。 Abstract: The brain is a highly complex organ that manages many important tasks, including movement, memory and thinking. Brain-related conditions, like tumors and degenerative disorders, can be hard to diagnose and treat. Magnetic Resonance Imaging (MRI) serves as a key tool for identifying these conditions, offering high-resolution images of brain structures. Despite this, interpreting MRI scans can be complicated. This study tackles this challenge by conducting a comparative analysis of Vision Transformer (ViT) and Transfer Learning (TL) models such as VGG16, VGG19, Resnet50V2, MobilenetV2 for classifying brain diseases using MRI data from Bangladesh based dataset. ViT, known for their ability to capture global relationships in images, are particularly effective for medical imaging tasks. Transfer learning helps to mitigate data constraints by fine-tuning pre-trained models. Furthermore, Explainable AI (XAI) methods such as GradCAM, GradCAM++, LayerCAM, ScoreCAM, and Faster-ScoreCAM are employed to interpret model predictions. The results demonstrate that ViT surpasses transfer learning models, achieving a classification accuracy of 94.39%. The integration of XAI methods enhances model transparency, offering crucial insights to aid medical professionals in diagnosing brain diseases with greater precision.

[23] GMatch: Geometry-Constrained Feature Matching for RGB-D Object Pose Estimation

Ming Yang,Haoran Li

Main category: cs.CV

TL;DR: GMatch是一种无需学习的特征匹配方法，用于鲁棒的6DoF物体姿态估计，通过几何一致性解决稀疏特征匹配中的局部模糊问题。

Details

Motivation: 传统方法仅依赖描述符相似性，容易受到局部模糊性的影响，GMatch旨在通过几何一致性提升匹配的鲁棒性。 Method: GMatch采用增量搜索和SE(3)不变的几何一致性约束，利用几何特征唯一确定3D关键点配置，无需训练或GPU支持。 Result: 在HOPE和YCB-Video数据集上，GMatch-SIFT表现优于传统和基于学习的方法，达到或超过实例级姿态网络的性能。 Conclusion: GMatch-SIFT不仅验证了其在物体姿态估计中的有效性，还展示了作为通用特征匹配器的广泛适用性。 Abstract: We present GMatch, a learning-free feature matcher designed for robust 6DoF object pose estimation, addressing common local ambiguities in sparse feature matching. Unlike traditional methods that rely solely on descriptor similarity, GMatch performs a guided, incremental search, enforcing SE(3)-invariant geometric consistency throughout the matching process. It leverages a provably complete set of geometric features that uniquely determine 3D keypoint configurations, ensuring globally consistent correspondences without the need for training or GPU support. When combined with classical descriptors such as SIFT, GMatch-SIFT forms a general-purpose pose estimation pipeline that offers strong interpretability and generalization across diverse objects and scenes. Experiments on the HOPE dataset show that GMatch outperforms both traditional and learning-based matchers, with GMatch-SIFT achieving or surpassing the performance of instance-level pose networks. On the YCB-Video dataset, GMatch-SIFT demonstrates high accuracy and low variance on texture-rich objects. These results not only validate the effectiveness of GMatch-SIFT for object pose estimation but also highlight the broader applicability of GMatch as a general-purpose feature matcher. Code will be released upon acceptance.

[24] Steering LVLMs via Sparse Autoencoder for Hallucination Mitigation

Zhenglin Hua,Jinghan He,Zijun Yao,Tianxu Han,Haiyun Guo,Yuheng Jia,Junfeng Fang

Main category: cs.CV

TL;DR: 论文提出了一种基于稀疏自编码器（SAE）的方法（SSL），通过识别幻觉和实际语义的方向，有效减少大型视觉语言模型（LVLM）中的幻觉问题，无需额外训练。

Details

Motivation: 大型视觉语言模型在多模态任务中表现优异，但存在幻觉问题，即生成与视觉输入不一致的文本。现有方法计算成本高，效果有限。 Method: 利用稀疏自编码器识别与幻觉或实际语义相关的方向，提出SSL方法，通过调整这些方向来抑制幻觉。 Result: 实验表明，SSL在减少幻觉方面优于现有解码方法，且具有跨模型架构的迁移性和低时间开销。 Conclusion: SSL是一种高效、无需训练的方法，能显著减少LVLM的幻觉问题，同时保持语义完整性。 Abstract: Large vision-language models (LVLMs) have achieved remarkable performance on multimodal tasks such as visual question answering (VQA) and image captioning. However, they still suffer from hallucinations, generating text inconsistent with visual input, posing significant risks in real-world applications. Existing approaches to address this issue focus on incorporating external knowledge bases, alignment training, or decoding strategies, all of which require substantial computational cost and time. Recent works try to explore more efficient alternatives by adjusting LVLMs' internal representations. Although promising, these methods may cause hallucinations to be insufficiently suppressed or lead to excessive interventions that negatively affect normal semantics. In this work, we leverage sparse autoencoders (SAEs) to identify semantic directions closely associated with either hallucinations or actuality, realizing more precise and direct hallucination-related representations. Our analysis demonstrates that interventions along the faithful direction we identified can mitigate hallucinations, while those along the hallucinatory direction can exacerbate them. Building on these insights, we propose Steering LVLMs via SAE Latent Directions (SSL), a training-free method based on SAE-derived latent directions to mitigate hallucinations in LVLMs. Extensive experiments demonstrate that SSL significantly outperforms existing decoding approaches in mitigating hallucinations, while maintaining transferability across different model architectures with negligible additional time overhead.

[25] When VLMs Meet Image Classification: Test Sets Renovation via Missing Label Identification

Zirui Pang,Haosheng Tan,Yuhan Pu,Zhijie Deng,Zhouan Shen,Keyu Hu,Jiaheng Wei

Main category: cs.CV

TL;DR: REVEAL框架通过结合视觉语言模型和标签清理方法，系统性解决图像分类数据集中噪声标签和缺失标签问题，显著提升6个基准测试集的质量。

Details

Motivation: 现有标签清理方法主要关注噪声标签，而缺失标签问题被忽视，导致模型评估不准确。 Method: 整合预训练视觉语言模型（如LLaVA、BLIP）与标签清理方法（如Cleanlab），通过置信度预测和共识过滤检测噪声和缺失标签。 Result: REVEAL显著提升了6个基准测试集的标签质量，并通过人类验证高度符合人类判断。 Conclusion: REVEAL为图像分类数据集提供了更准确的标签，支持更公平的模型评估。 Abstract: Image classification benchmark datasets such as CIFAR, MNIST, and ImageNet serve as critical tools for model evaluation. However, despite the cleaning efforts, these datasets still suffer from pervasive noisy labels and often contain missing labels due to the co-existing image pattern where multiple classes appear in an image sample. This results in misleading model comparisons and unfair evaluations. Existing label cleaning methods focus primarily on noisy labels, but the issue of missing labels remains largely overlooked. Motivated by these challenges, we present a comprehensive framework named REVEAL, integrating state-of-the-art pre-trained vision-language models (e.g., LLaVA, BLIP, Janus, Qwen) with advanced machine/human label curation methods (e.g., Docta, Cleanlab, MTurk), to systematically address both noisy labels and missing label detection in widely-used image classification test sets. REVEAL detects potential noisy labels and omissions, aggregates predictions from various methods, and refines label accuracy through confidence-informed predictions and consensus-based filtering. Additionally, we provide a thorough analysis of state-of-the-art vision-language models and pre-trained image classifiers, highlighting their strengths and limitations within the context of dataset renovation by revealing 10 observations. Our method effectively reveals missing labels from public datasets and provides soft-labeled results with likelihoods. Through human verifications, REVEAL significantly improves the quality of 6 benchmark test sets, highly aligning to human judgments and enabling more accurate and meaningful comparisons in image classification.

[26] Training-Free Reasoning and Reflection in MLLMs

Hongchen Wei,Zhenzhong Chen

Main category: cs.CV

TL;DR: 本文提出FRANK模型，一种无需训练的MLLM，通过分层权重合并方法将视觉预训练MLLM与推理专用LLM结合，显著提升多模态推理能力。

Details

Motivation: 扩展推理LLMs到多模态LLMs（MLLMs）面临高昂的重新训练成本和高质量多模态推理数据稀缺的问题。 Method: 采用分层权重合并方法，结合视觉预训练MLLM和推理专用LLM，提出泰勒导出的闭式融合机制。 Result: 在MMMU基准测试中，FRANK-38B准确率达69.2，优于基线模型InternVL2.5-38B（+5.3），甚至超过GPT-4o。 Conclusion: FRANK模型通过分层融合方法，有效提升了MLLMs的推理能力，无需额外训练或监督。 Abstract: Recent advances in Reasoning LLMs (e.g., DeepSeek-R1 and OpenAI-o1) have showcased impressive reasoning capabilities via reinforcement learning. However, extending these capabilities to Multimodal LLMs (MLLMs) is hampered by the prohibitive costs of retraining and the scarcity of high-quality, verifiable multimodal reasoning datasets. This paper introduces FRANK Model, a training-FRee ANd r1-liKe MLLM that imbues off-the-shelf MLLMs with reasoning and reflection abilities, without any gradient updates or extra supervision. Our key insight is to decouple perception and reasoning across MLLM decoder layers. Specifically, we observe that compared to the deeper decoder layers, the shallow decoder layers allocate more attention to visual tokens, while the deeper decoder layers concentrate on textual semantics. This observation motivates a hierarchical weight merging approach that combines a visual-pretrained MLLM with a reasoning-specialized LLM. To this end, we propose a layer-wise, Taylor-derived closed-form fusion mechanism that integrates reasoning capacity into deep decoder layers while preserving visual grounding in shallow decoder layers. Extensive experiments on challenging multimodal reasoning benchmarks demonstrate the effectiveness of our approach. On the MMMU benchmark, our model FRANK-38B achieves an accuracy of 69.2, outperforming the strongest baseline InternVL2.5-38B by +5.3, and even surpasses the proprietary GPT-4o model. Our project homepage is at: http://iip.whu.edu.cn/frank/index.html

[27] BadDepth: Backdoor Attacks Against Monocular Depth Estimation in the Physical World

Ji Guo,Long Zhou,Zhijin Wang,Jiaming He,Qiyang Song,Aiguo Chen,Wenbo Jiang

Main category: cs.CV

TL;DR: 论文提出了BadDepth，首个针对单目深度估计（MDE）模型的后门攻击方法，通过选择性操纵目标对象深度并生成毒化数据集，解决了现有方法无法应用于MDE的问题。

Details

Motivation: 尽管MDE模型在自动驾驶等领域广泛应用，但其对后门攻击的脆弱性尚未被研究，论文旨在填补这一空白。 Method: 提出BadDepth方法，利用图像分割模型选择性操纵目标对象深度，并通过深度补全恢复周围区域，生成毒化数据集。引入数字到物理增强以提高物理世界场景的鲁棒性。 Result: 在多个模型上的实验验证了BadDepth在数字和物理世界中的有效性，且不受环境因素影响。 Conclusion: BadDepth是首个针对MDE模型的后门攻击方法，解决了现有方法的局限性，并在实验中表现出色。 Abstract: In recent years, deep learning-based Monocular Depth Estimation (MDE) models have been widely applied in fields such as autonomous driving and robotics. However, their vulnerability to backdoor attacks remains unexplored. To fill the gap in this area, we conduct a comprehensive investigation of backdoor attacks against MDE models. Typically, existing backdoor attack methods can not be applied to MDE models. This is because the label used in MDE is in the form of a depth map. To address this, we propose BadDepth, the first backdoor attack targeting MDE models. BadDepth overcomes this limitation by selectively manipulating the target object's depth using an image segmentation model and restoring the surrounding areas via depth completion, thereby generating poisoned datasets for object-level backdoor attacks. To improve robustness in physical world scenarios, we further introduce digital-to-physical augmentation to adapt to the domain gap between the physical world and the digital domain. Extensive experiments on multiple models validate the effectiveness of BadDepth in both the digital domain and the physical world, without being affected by environmental factors.

[28] Breaking Complexity Barriers: High-Resolution Image Restoration with Rank Enhanced Linear Attention

Yuang Ai,Huaibo Huang,Tao Wu,Qihang Fan,Ran He

Main category: cs.CV

TL;DR: 论文提出了一种名为RELA的线性注意力增强方法，并基于此构建了高效图像修复Transformer模型LAformer，解决了传统Transformer在高分辨率图像上的计算复杂性问题。

Details

Motivation: Transformer在图像修复任务中表现优异，但其自注意力的二次复杂度限制了其在高分辨率图像上的应用。现有方法通过稀疏或窗口注意力缓解问题，但牺牲了全局上下文建模能力。 Method: 提出Rank Enhanced Linear Attention (RELA)，通过轻量级深度卷积增强特征表示；基于RELA构建LAformer模型，结合线性注意力和通道注意力实现全局感知，并通过卷积门控前馈网络增强局部拟合能力。 Result: 在7个图像修复任务和21个基准测试中，LAformer优于现有SOTA方法，并显著提升了计算效率。 Conclusion: LAformer通过线性注意力和卷积增强，实现了高效且高性能的图像修复，适用于高分辨率图像处理。 Abstract: Transformer-based models have made remarkable progress in image restoration (IR) tasks. However, the quadratic complexity of self-attention in Transformer hinders its applicability to high-resolution images. Existing methods mitigate this issue with sparse or window-based attention, yet inherently limit global context modeling. Linear attention, a variant of softmax attention, demonstrates promise in global context modeling while maintaining linear complexity, offering a potential solution to the above challenge. Despite its efficiency benefits, vanilla linear attention suffers from a significant performance drop in IR, largely due to the low-rank nature of its attention map. To counter this, we propose Rank Enhanced Linear Attention (RELA), a simple yet effective method that enriches feature representations by integrating a lightweight depthwise convolution. Building upon RELA, we propose an efficient and effective image restoration Transformer, named LAformer. LAformer achieves effective global perception by integrating linear attention and channel attention, while also enhancing local fitting capabilities through a convolutional gated feed-forward network. Notably, LAformer eliminates hardware-inefficient operations such as softmax and window shifting, enabling efficient processing of high-resolution images. Extensive experiments across 7 IR tasks and 21 benchmarks demonstrate that LAformer outperforms SOTA methods and offers significant computational advantages.

[29] Deep Learning-Driven Ultra-High-Definition Image Restoration: A Survey

Liyan Wang,Weixiang Zhou,Cong Wang,Kin-Man Lam,Zhixun Su,Jinshan Pan

Main category: cs.CV

TL;DR: 本文系统综述了超高清（UHD）图像修复领域的最新进展，涵盖数据集构建、算法设计等方面，并提出了分类框架和未来研究方向。

Details

Motivation: 解决超高清图像质量退化问题，总结深度学习在该领域的创新，为研究者提供资源。 Method: 总结退化模型、整理数据集、分类网络架构和采样策略，并分析技术发展。 Result: 提出了分类框架，整理了现有方法，并提供了未来研究方向。 Conclusion: UHD图像修复领域仍有发展空间，需进一步探索数据集、算法和评估方法。 Abstract: Ultra-high-definition (UHD) image restoration aims to specifically solve the problem of quality degradation in ultra-high-resolution images. Recent advancements in this field are predominantly driven by deep learning-based innovations, including enhancements in dataset construction, network architecture, sampling strategies, prior knowledge integration, and loss functions. In this paper, we systematically review recent progress in UHD image restoration, covering various aspects ranging from dataset construction to algorithm design. This serves as a valuable resource for understanding state-of-the-art developments in the field. We begin by summarizing degradation models for various image restoration subproblems, such as super-resolution, low-light enhancement, deblurring, dehazing, deraining, and desnowing, and emphasizing the unique challenges of their application to UHD image restoration. We then highlight existing UHD benchmark datasets and organize the literature according to degradation types and dataset construction methods. Following this, we showcase major milestones in deep learning-driven UHD image restoration, reviewing the progression of restoration tasks, technological developments, and evaluations of existing methods. We further propose a classification framework based on network architectures and sampling strategies, helping to clearly organize existing methods. Finally, we share insights into the current research landscape and propose directions for further advancements. A related repository is available at https://github.com/wlydlut/UHD-Image-Restoration-Survey.

[30] RE-TRIP : Reflectivity Instance Augmented Triangle Descriptor for 3D Place Recognition

Yechan Park,Gyuhyeon Pak,Euntai Kim

Main category: cs.CV

TL;DR: 本文提出了一种新的3D位置识别描述符RE-TRIP，结合几何测量和反射率信息，提升了在复杂场景中的鲁棒性。

Details

Motivation: 现有LiDAR位置识别方法主要依赖几何信息，忽略了反射率数据，而RE-TRIP旨在利用这些额外信息提升性能。 Method: 提出RE-TRIP描述符，包括关键点提取、实例分割、匹配方法和反射率验证方法。 Result: 在多个公开数据集上的实验表明，RE-TRIP优于现有方法（如Scan Context等）。 Conclusion: RE-TRIP通过结合几何和反射率信息，显著提升了位置识别的鲁棒性和准确性。 Abstract: While most people associate LiDAR primarily with its ability to measure distances and provide geometric information about the environment (via point clouds), LiDAR also captures additional data, including reflectivity or intensity values. Unfortunately, when LiDAR is applied to Place Recognition (PR) in mobile robotics, most previous works on LiDAR-based PR rely only on geometric measurements, neglecting the additional reflectivity information that LiDAR provides. In this paper, we propose a novel descriptor for 3D PR, named RE-TRIP (REflectivity-instance augmented TRIangle descriPtor). This new descriptor leverages both geometric measurements and reflectivity to enhance robustness in challenging scenarios such as geometric degeneracy, high geometric similarity, and the presence of dynamic objects. To implement RE-TRIP in real-world applications, we further propose (1) a keypoint extraction method, (2) a key instance segmentation method, (3) a RE-TRIP matching method, and (4) a reflectivity-combined loop verification method. Finally, we conduct a series of experiments to demonstrate the effectiveness of RE-TRIP. Applied to public datasets (i.e., HELIPR, FusionPortable) containing diverse scenarios such as long corridors, bridges, large-scale urban areas, and highly dynamic environments -- our experimental results show that the proposed method outperforms existing state-of-the-art methods in terms of Scan Context, Intensity Scan Context, and STD.

[31] TRAIL: Transferable Robust Adversarial Images via Latent diffusion

Yuhao Xue,Zhifei Zhang,Xinyang Jiang,Yifei Shen,Junyao Gao,Wentao Gu,Jiale Zhao,Miaojing Shi,Cairong Zhao

Main category: cs.CV

TL;DR: TRAIL提出了一种基于潜在扩散模型的测试时适应框架，通过结合对抗目标和感知约束生成分布对齐的对抗样本，显著提升了跨模型攻击的可迁移性。

Details

Motivation: 现有对抗攻击方法因生成的对抗特征与真实数据分布不匹配，导致跨模型可迁移性受限。 Method: TRAIL利用潜在扩散模型，在攻击时更新U-Net权重，结合对抗目标和感知约束生成对抗样本。 Result: 实验表明TRAIL在跨模型攻击可迁移性上显著优于现有方法。 Conclusion: 分布对齐的对抗特征合成对实际黑盒攻击至关重要。 Abstract: Adversarial attacks exploiting unrestricted natural perturbations present severe security risks to deep learning systems, yet their transferability across models remains limited due to distribution mismatches between generated adversarial features and real-world data. While recent works utilize pre-trained diffusion models as adversarial priors, they still encounter challenges due to the distribution shift between the distribution of ideal adversarial samples and the natural image distribution learned by the diffusion model. To address the challenge, we propose Transferable Robust Adversarial Images via Latent Diffusion (TRAIL), a test-time adaptation framework that enables the model to generate images from a distribution of images with adversarial features and closely resembles the target images. To mitigate the distribution shift, during attacks, TRAIL updates the diffusion U-Net's weights by combining adversarial objectives (to mislead victim models) and perceptual constraints (to preserve image realism). The adapted model then generates adversarial samples through iterative noise injection and denoising guided by these objectives. Experiments demonstrate that TRAIL significantly outperforms state-of-the-art methods in cross-model attack transferability, validating that distribution-aligned adversarial feature synthesis is critical for practical black-box attacks.

[32] Erased or Dormant? Rethinking Concept Erasure Through Reversibility

Ping Liu,Chi Zhang

Main category: cs.CV

TL;DR: 论文探讨了当前概念擦除技术是否真正消除了扩散模型生成目标概念的能力，还是仅实现了表面的、特定提示的抑制。通过系统评估两种代表性方法的鲁棒性和可逆性，发现擦除的概念常通过轻微适应重新出现，揭示了现有方法的局限性。

Details

Motivation: 研究动机是验证概念擦除技术是否真正消除了生成目标概念的能力，而非仅实现特定提示下的表面抑制。 Method: 采用实例级评估策略，通过轻量级微调测试擦除概念的再激活潜力，定量和定性分析两种代表性方法（Unified Concept Editing和Erased Stable Diffusion）。 Result: 结果显示擦除的概念常通过轻微适应重新出现，表明现有方法仅抑制了潜在生成表示，未完全消除。 Conclusion: 结论指出现有概念擦除方法存在关键局限性，需更深层次的表示级干预和更严格的评估标准，以实现真正不可逆的概念移除。 Abstract: To what extent does concept erasure eliminate generative capacity in diffusion models? While prior evaluations have primarily focused on measuring concept suppression under specific textual prompts, we explore a complementary and fundamental question: do current concept erasure techniques genuinely remove the ability to generate targeted concepts, or do they merely achieve superficial, prompt-specific suppression? We systematically evaluate the robustness and reversibility of two representative concept erasure methods, Unified Concept Editing and Erased Stable Diffusion, by probing their ability to eliminate targeted generative behaviors in text-to-image models. These methods attempt to suppress undesired semantic concepts by modifying internal model parameters, either through targeted attention edits or model-level fine-tuning strategies. To rigorously assess whether these techniques truly erase generative capacity, we propose an instance-level evaluation strategy that employs lightweight fine-tuning to explicitly test the reactivation potential of erased concepts. Through quantitative metrics and qualitative analyses, we show that erased concepts often reemerge with substantial visual fidelity after minimal adaptation, indicating that current methods suppress latent generative representations without fully eliminating them. Our findings reveal critical limitations in existing concept erasure approaches and highlight the need for deeper, representation-level interventions and more rigorous evaluation standards to ensure genuine, irreversible removal of concepts from generative models.

[33] QuickVideo: Real-Time Long Video Understanding with System Algorithm Co-Design

Benjamin Schneider,Dongfu Jiang,Chao Du,Tianyu Pang,Wenhu Chen

Main category: cs.CV

TL;DR: QuickVideo通过系统算法协同设计，解决了长视频理解中的计算瓶颈，包括并行解码、高效预填充和CPU-GPU重叠处理，显著降低了延迟和内存使用。

Details

Motivation: 长视频理解在现实应用中至关重要，但现有方法因解码和预填充的高计算成本而受限。 Method: 提出QuickVideo，包含QuickDecoder（并行解码）、QuickPrefill（KV缓存剪枝）和CPU-GPU重叠处理。 Result: 实验表明，QuickVideo显著减少推理时间，支持长视频的高效处理。 Conclusion: QuickVideo为长视频理解提供了可扩展且高效的解决方案。 Abstract: Long-video understanding has emerged as a crucial capability in real-world applications such as video surveillance, meeting summarization, educational lecture analysis, and sports broadcasting. However, it remains computationally prohibitive for VideoLLMs, primarily due to two bottlenecks: 1) sequential video decoding, the process of converting the raw bit stream to RGB frames can take up to a minute for hour-long video inputs, and 2) costly prefilling of up to several million tokens for LLM inference, resulting in high latency and memory use. To address these challenges, we propose QuickVideo, a system-algorithm co-design that substantially accelerates long-video understanding to support real-time downstream applications. It comprises three key innovations: QuickDecoder, a parallelized CPU-based video decoder that achieves 2-3 times speedup by splitting videos into keyframe-aligned intervals processed concurrently; QuickPrefill, a memory-efficient prefilling method using KV-cache pruning to support more frames with less GPU memory; and an overlapping scheme that overlaps CPU video decoding with GPU inference. Together, these components infernece time reduce by a minute on long video inputs, enabling scalable, high-quality video understanding even on limited hardware. Experiments show that QuickVideo generalizes across durations and sampling rates, making long video processing feasible in practice.

[34] Redemption Score: An Evaluation Framework to Rank Image Captions While Redeeming Image Semantics and Language Pragmatics

Ashim Dahal,Ankit Ghimire,Saydul Akbar Murad,Nick Rahimi

Main category: cs.CV

TL;DR: Redemption Score是一种新型混合框架，通过结合三种互补信号（MID、DINO和BERTScore）评估图像标题，优于现有方法。

Details

Motivation: 现有图像标题评估指标未能全面捕捉视觉语义和语言语用学，需要更全面的评估方法。 Method: 提出Redemption Score框架，结合MID、DINO感知相似性和BERTScore，进行校准融合。 Result: 在Flickr8k基准测试中，Kendall-τ达56.43，优于12种现有方法，且无需任务特定训练。 Conclusion: Redemption Score通过有效结合视觉和语言信号，提供了更鲁棒和细致的评估。 Abstract: Evaluating image captions requires cohesive assessment of both visual semantics and language pragmatics, which is often not entirely captured by most metrics. We introduce Redemption Score, a novel hybrid framework that ranks image captions by triangulating three complementary signals: (1) Mutual Information Divergence (MID) for global image-text distributional alignment, (2) DINO-based perceptual similarity of cycle-generated images for visual grounding, and (3) BERTScore for contextual text similarity against human references. A calibrated fusion of these signals allows Redemption Score to offer a more holistic assessment. On the Flickr8k benchmark, Redemption Score achieves a Kendall-$\tau$ of 56.43, outperforming twelve prior methods and demonstrating superior correlation with human judgments without requiring task-specific training. Our framework provides a more robust and nuanced evaluation by effectively redeeming image semantics and linguistic interpretability indicated by strong transfer of knowledge in the Conceptual Captions and MS COCO datasets.

[35] Understanding Generative AI Capabilities in Everyday Image Editing Tasks

Mohammad Reza Taesiri,Brandon Collins,Logan Bolton,Viet Dac Lai,Franck Dernoncourt,Trung Bui,Anh Totti Nguyen

Main category: cs.CV

TL;DR: 研究分析了83k个真实世界的图像编辑请求，发现当前AI编辑器（如GPT-4o）仅能完成约33%的任务，且在低创意需求任务中表现较差。

Details

Motivation: 探讨生成式AI在图像编辑中的实际应用潜力，了解用户需求与AI能力的差距。 Method: 分析Reddit社区12年间的83k个编辑请求和305k个PSR-wizard编辑案例，结合人类和VLM评分。 Result: AI编辑器在低创意任务中表现不佳，常无法保留人或动物的身份，且会进行非请求的修饰。VLM评分与人类评分存在差异。 Conclusion: AI编辑器在创意任务中表现较好，但需改进精确编辑能力；VLM评分可能偏向AI编辑。 Abstract: Generative AI (GenAI) holds significant promise for automating everyday image editing tasks, especially following the recent release of GPT-4o on March 25, 2025. However, what subjects do people most often want edited? What kinds of editing actions do they want to perform (e.g., removing or stylizing the subject)? Do people prefer precise edits with predictable outcomes or highly creative ones? By understanding the characteristics of real-world requests and the corresponding edits made by freelance photo-editing wizards, can we draw lessons for improving AI-based editors and determine which types of requests can currently be handled successfully by AI editors? In this paper, we present a unique study addressing these questions by analyzing 83k requests from the past 12 years (2013-2025) on the Reddit community, which collected 305k PSR-wizard edits. According to human ratings, approximately only 33% of requests can be fulfilled by the best AI editors (including GPT-4o, Gemini-2.0-Flash, SeedEdit). Interestingly, AI editors perform worse on low-creativity requests that require precise editing than on more open-ended tasks. They often struggle to preserve the identity of people and animals, and frequently make non-requested touch-ups. On the other side of the table, VLM judges (e.g., o1) perform differently from human judges and may prefer AI edits more than human edits. Code and qualitative examples are available at: https://psrdataset.github.io

Chaoya Jiang,Yongrui Heng,Wei Ye,Han Yang,Haiyang Xu,Ming Yan,Ji Zhang,Fei Huang,Shikun Zhang

Main category: cs.CV

TL;DR: VLM-R³是一个视觉语言模型框架，通过动态聚焦和迭代视觉区域，提升复杂任务中的文本推理与视觉证据的精确关联。

Details

Motivation: 现有的推理型多模态语言模型在生成长文本推理链方面取得了一定成功，但在需要动态迭代聚焦视觉区域的复杂任务中表现不佳。 Method: 提出VLM-R³框架，引入区域条件强化策略优化（R-GRPO），训练模型选择信息区域、制定变换（如裁剪、缩放）并整合视觉上下文到推理链中。 Result: 在MathVista、ScienceQA等基准测试中，VLM-R³在零样本和少样本设置下达到新最优性能，尤其在需要精细空间推理的任务中表现突出。 Conclusion: VLM-R³通过动态区域选择和整合视觉上下文，显著提升了复杂视觉推理任务的性能。 Abstract: Recently, reasoning-based MLLMs have achieved a degree of success in generating long-form textual reasoning chains. However, they still struggle with complex tasks that necessitate dynamic and iterative focusing on and revisiting of visual regions to achieve precise grounding of textual reasoning in visual evidence. We introduce \textbf{VLM-R$^3$} (\textbf{V}isual \textbf{L}anguage \textbf{M}odel with \textbf{R}egion \textbf{R}ecognition and \textbf{R}easoning), a framework that equips an MLLM with the ability to (i) decide \emph{when} additional visual evidence is needed, (ii) determine \emph{where} to ground within the image, and (iii) seamlessly weave the relevant sub-image content back into an interleaved chain-of-thought. The core of our method is \textbf{Region-Conditioned Reinforcement Policy Optimization (R-GRPO)}, a training paradigm that rewards the model for selecting informative regions, formulating appropriate transformations (e.g.\ crop, zoom), and integrating the resulting visual context into subsequent reasoning steps. To bootstrap this policy, we compile a modest but carefully curated Visuo-Lingual Interleaved Rationale (VLIR) corpus that provides step-level supervision on region selection and textual justification. Extensive experiments on MathVista, ScienceQA, and other benchmarks show that VLM-R$^3$ sets a new state of the art in zero-shot and few-shot settings, with the largest gains appearing on questions demanding subtle spatial reasoning or fine-grained visual cue extraction.

[37] A Causal Approach to Mitigate Modality Preference Bias in Medical Visual Question Answering

Shuchang Ye,Usman Naseem,Mingyuan Meng,Dagan Feng,Jinman Kim

Main category: cs.CV

TL;DR: MedCFVQA模型通过因果图消除模态偏好偏差，并在重构的数据集上显著优于非因果模型。

Details

Motivation: 现有MedVQA模型存在模态偏好偏差，导致无法充分利用多模态知识。 Method: 提出MedCFVQA模型，利用因果图消除模态偏好偏差，并重构数据集以减少先验依赖性。 Result: MedCFVQA在SLAKE、RadVQA及其重构数据集上表现显著优于非因果模型。 Conclusion: MedCFVQA有效解决了模态偏好偏差问题，提升了MedVQA的性能。 Abstract: Medical Visual Question Answering (MedVQA) is crucial for enhancing the efficiency of clinical diagnosis by providing accurate and timely responses to clinicians' inquiries regarding medical images. Existing MedVQA models suffered from modality preference bias, where predictions are heavily dominated by one modality while overlooking the other (in MedVQA, usually questions dominate the answer but images are overlooked), thereby failing to learn multimodal knowledge. To overcome the modality preference bias, we proposed a Medical CounterFactual VQA (MedCFVQA) model, which trains with bias and leverages causal graphs to eliminate the modality preference bias during inference. Existing MedVQA datasets exhibit substantial prior dependencies between questions and answers, which results in acceptable performance even if the model significantly suffers from the modality preference bias. To address this issue, we reconstructed new datasets by leveraging existing MedVQA datasets and Changed their P3rior dependencies (CP) between questions and their answers in the training and test set. Extensive experiments demonstrate that MedCFVQA significantly outperforms its non-causal counterpart on both SLAKE, RadVQA and SLAKE-CP, RadVQA-CP datasets.

[38] A Shape-Aware Total Body Photography System for In-focus Surface Coverage Optimization

Wei-Lun Huang,Joshua Liu,Davood Tashayyod,Jun Kang,Amir Gandjbakhche,Misha Kazhdan,Mehran Armand

Main category: cs.CV

TL;DR: 提出了一种新型形状感知全身摄影系统，通过优化图像分辨率和清晰度，提升皮肤癌筛查中可疑病变的自动检测能力。

Details

Motivation: 现有全身摄影系统在自动检测和分析可疑皮肤病变方面仍有改进空间，尤其是图像分辨率和清晰度。 Method: 系统结合深度和RGB相机、3D身体形状估计及聚焦表面优化方法，选择每个相机姿态的最佳对焦距离。 Result: 在模拟数据和真实扫描中，系统平均分辨率达0.068 mm/像素和0.0566 mm/像素，85%和95%的表面区域保持清晰。 Conclusion: 该系统的高保真成像能力有望提升皮肤病变自动分析的准确性，助力皮肤癌筛查。 Abstract: Total Body Photography (TBP) is becoming a useful screening tool for patients at high risk for skin cancer. While much progress has been made, existing TBP systems can be further improved for automatic detection and analysis of suspicious skin lesions, which is in part related to the resolution and sharpness of acquired images. This paper proposes a novel shape-aware TBP system automatically capturing full-body images while optimizing image quality in terms of resolution and sharpness over the body surface. The system uses depth and RGB cameras mounted on a 360-degree rotary beam, along with 3D body shape estimation and an in-focus surface optimization method to select the optimal focus distance for each camera pose. This allows for optimizing the focused coverage over the complex 3D geometry of the human body given the calibrated camera poses. We evaluate the effectiveness of the system in capturing high-fidelity body images. The proposed system achieves an average resolution of 0.068 mm/pixel and 0.0566 mm/pixel with approximately 85% and 95% of surface area in-focus, evaluated on simulation data of diverse body shapes and poses as well as a real scan of a mannequin respectively. Furthermore, the proposed shape-aware focus method outperforms existing focus protocols (e.g. auto-focus). We believe the high-fidelity imaging enabled by the proposed system will improve automated skin lesion analysis for skin cancer screening.

[39] CT-Agent: A Multimodal-LLM Agent for 3D CT Radiology Question Answering

Yuren Mao,Wenyi Xu,Yuyang Qin,Yunjun Gao

Main category: cs.CV

TL;DR: 本文提出了一种名为CT-Agent的多模态代理框架，用于解决CT放射学问答任务中的解剖复杂性和跨切片空间关系问题。

Details

Motivation: 为放射科医生提供高效的视觉问答系统，减轻其撰写CT放射报告的负担，并解决现有系统在CT图像理解和空间关系捕捉上的不足。 Method: 采用解剖独立的工具分解解剖复杂性，并通过全局-局部标记压缩策略高效捕捉跨切片空间关系。 Result: 在两个3D胸部CT数据集（CT-RATE和RadGenome-ChestCT）上的实验验证了CT-Agent的优越性能。 Conclusion: CT-Agent能够有效解决CT放射学问答任务中的关键挑战，具有实际应用潜力。 Abstract: Computed Tomography (CT) scan, which produces 3D volumetric medical data that can be viewed as hundreds of cross-sectional images (a.k.a. slices), provides detailed anatomical information for diagnosis. For radiologists, creating CT radiology reports is time-consuming and error-prone. A visual question answering (VQA) system that can answer radiologists' questions about some anatomical regions on the CT scan and even automatically generate a radiology report is urgently needed. However, existing VQA systems cannot adequately handle the CT radiology question answering (CTQA) task for: (1) anatomic complexity makes CT images difficult to understand; (2) spatial relationship across hundreds slices is difficult to capture. To address these issues, this paper proposes CT-Agent, a multimodal agentic framework for CTQA. CT-Agent adopts anatomically independent tools to break down the anatomic complexity; furthermore, it efficiently captures the across-slice spatial relationship with a global-local token compression strategy. Experimental results on two 3D chest CT datasets, CT-RATE and RadGenome-ChestCT, verify the superior performance of CT-Agent.

[40] DOVE: Efficient One-Step Diffusion Model for Real-World Video Super-Resolution

Zheng Chen,Zichen Zou,Kewei Zhang,Xiongfei Su,Xin Yuan,Yong Guo,Yulun Zhang

Main category: cs.CV

TL;DR: DOVE是一种高效的单步扩散模型，用于现实世界视频超分辨率（VSR），通过微调预训练模型和引入潜在像素训练策略，显著提升了推理速度。

Details

Motivation: 扩散模型在视频超分辨率中表现优异，但推理速度慢；单步采样技术虽能加速，但在VSR中实现单步仍具挑战性。 Method: 提出DOVE模型，通过微调预训练视频扩散模型（CogVideoX），并采用潜在像素训练策略和高质量数据集HQ-VSR进行优化。 Result: DOVE在性能上媲美或多步扩散方法，推理效率显著提升，速度比现有方法快28倍。 Conclusion: DOVE为视频超分辨率提供了一种高效的单步解决方案，兼具高性能和快速推理。 Abstract: Diffusion models have demonstrated promising performance in real-world video super-resolution (VSR). However, the dozens of sampling steps they require, make inference extremely slow. Sampling acceleration techniques, particularly single-step, provide a potential solution. Nonetheless, achieving one step in VSR remains challenging, due to the high training overhead on video data and stringent fidelity demands. To tackle the above issues, we propose DOVE, an efficient one-step diffusion model for real-world VSR. DOVE is obtained by fine-tuning a pretrained video diffusion model (*i.e.*, CogVideoX). To effectively train DOVE, we introduce the latent-pixel training strategy. The strategy employs a two-stage scheme to gradually adapt the model to the video super-resolution task. Meanwhile, we design a video processing pipeline to construct a high-quality dataset tailored for VSR, termed HQ-VSR. Fine-tuning on this dataset further enhances the restoration capability of DOVE. Extensive experiments show that DOVE exhibits comparable or superior performance to multi-step diffusion-based VSR methods. It also offers outstanding inference efficiency, achieving up to a **28$\times$** speed-up over existing methods such as MGLD-VSR. Code is available at: https://github.com/zhengchen1999/DOVE.

[41] Swin Transformer for Robust CGI Images Detection: Intra- and Inter-Dataset Analysis across Multiple Color Spaces

Preeti Mehta,Aman Sagar,Suchi Kumari

Main category: cs.CV

TL;DR: 本研究提出了一种基于Swin Transformer的模型，用于在RGB、YCbCr和HSV三种颜色空间中区分计算机生成图像（CGI）与真实数字图像，并在多个数据集上验证了其性能。

Details

Motivation: 现有分类方法在处理CGI的复杂性和多样性时存在局限性，因此需要一种更准确的区分方法。 Method: 采用Swin Transformer的分层架构捕捉局部和全局特征，结合数据增强技术处理数据集不平衡问题，并通过t-SNE可视化特征分离效果。 Result: RGB颜色空间表现最佳，模型在跨数据集测试中展现出较高的准确性和鲁棒性，优于VGG-19和ResNet-50。 Conclusion: Swin Transformer模型在数字图像取证中具有潜力，尤其在区分CGI与自然图像方面表现出色，适用于需要高精度分类的场景。 Abstract: This study aims to address the growing challenge of distinguishing computer-generated imagery (CGI) from authentic digital images across three different color spaces; RGB, YCbCr, and HSV. Given the limitations of existing classification methods in handling the complexity and variability of CGI, this research proposes a Swin Transformer based model for accurate differentiation between natural and synthetic images. The proposed model leverages the Swin Transformer's hierarchical architecture to capture local and global features for distinguishing CGI from natural images. Its performance was assessed through intra- and inter-dataset testing across three datasets: CiFAKE, JSSSTU, and Columbia. The model was evaluated individually on each dataset (D1, D2, D3) and on the combined datasets (D1+D2+D3) to test its robustness and domain generalization. To address dataset imbalance, data augmentation techniques were applied. Additionally, t-SNE visualization was used to demonstrate the feature separability achieved by the Swin Transformer across the selected color spaces. The model's performance was tested across all color schemes, with the RGB color scheme yielding the highest accuracy for each dataset. As a result, RGB was selected for domain generalization analysis and compared with other CNN-based models, VGG-19 and ResNet-50. The comparative results demonstrate the proposed model's effectiveness in detecting CGI, highlighting its robustness and reliability in both intra-dataset and inter-dataset evaluations. The findings of this study highlight the Swin Transformer model's potential as an advanced tool for digital image forensics, particularly in distinguishing CGI from natural images. The model's strong performance indicates its capability for domain generalization, making it a valuable asset in scenarios requiring precise and reliable image classification.

[42] DualComp: End-to-End Learning of a Unified Dual-Modality Lossless Compressor

Yan Zhao,Zhengxue Cheng,Junxuan Zhang,Qunshan Gu,Qi Wang,Li Song

Main category: cs.CV

TL;DR: DualComp是一种轻量级、统一的双模态（图像和文本）无损压缩器，通过模态统一的分词、模态切换上下文学习和模态路由专家混合等结构增强，实现了高效的参数利用和接近实时的推理速度。

Details

Motivation: 现有的学习型无损压缩器多为单模态设计，缺乏多模态数据的灵活性和适应性，而多模态大语言模型又过于复杂，难以实际部署。 Method: DualComp基于轻量级主干，采用模态统一分词、模态切换上下文学习和模态路由专家混合三种结构增强，并结合重参数化训练策略。 Result: DualComp在文本和图像数据集上的压缩性能与基于SOTA LLM的方法相当，其单模态变体在Kodak数据集上以仅1.2%的模型大小超越之前最佳图像压缩器约9%。 Conclusion: DualComp为多模态无损压缩提供了一种高效且轻量化的解决方案，兼具性能和实用性。 Abstract: Most learning-based lossless compressors are designed for a single modality, requiring separate models for multi-modal data and lacking flexibility. However, different modalities vary significantly in format and statistical properties, making it ineffective to use compressors that lack modality-specific adaptations. While multi-modal large language models (MLLMs) offer a potential solution for modality-unified compression, their excessive complexity hinders practical deployment. To address these challenges, we focus on the two most common modalities, image and text, and propose DualComp, the first unified and lightweight learning-based dual-modality lossless compressor. Built on a lightweight backbone, DualComp incorporates three key structural enhancements to handle modality heterogeneity: modality-unified tokenization, modality-switching contextual learning, and modality-routing mixture-of-experts. A reparameterization training strategy is also used to boost compression performance. DualComp integrates both modality-specific and shared parameters for efficient parameter utilization, enabling near real-time inference (200KB/s) on desktop CPUs. With much fewer parameters, DualComp achieves compression performance on par with the SOTA LLM-based methods for both text and image datasets. Its simplified single-modality variant surpasses the previous best image compressor on the Kodak dataset by about 9% using just 1.2% of the model size.

[43] LINEA: Fast and Accurate Line Detection Using Scalable Transformers

Sebastian Janampa,Marios Pattichis

Main category: cs.CV

TL;DR: 提出了一种基于变形线注意力（DLA）的新型Transformer方法LINEA，无需在大数据集上预训练注意力机制，显著提升了线检测的速度和性能。

Details

Motivation: 当前基于Transformer的线检测方法虽然精度高，但推理速度慢且需要在大数据集上预训练，限制了其在低延迟视频分析中的应用。 Method: 采用Deformable Line Attention（DLA）机制，避免了注意力机制的预训练需求，开发了名为LINEA的新方法。 Result: 实验表明，LINEA在速度上显著提升，并在分布外数据集测试中优于先前模型的sAP性能。 Conclusion: LINEA通过DLA机制实现了无需预训练的高效线检测，为低延迟应用提供了可行解决方案。 Abstract: Line detection is a basic digital image processing operation used by higher-level processing methods. Recently, transformer-based methods for line detection have proven to be more accurate than methods based on CNNs, at the expense of significantly lower inference speeds. As a result, video analysis methods that require low latencies cannot benefit from current transformer-based methods for line detection. In addition, current transformer-based models require pretraining attention mechanisms on large datasets (e.g., COCO or Object360). This paper develops a new transformer-based method that is significantly faster without requiring pretraining the attention mechanism on large datasets. We eliminate the need to pre-train the attention mechanism using a new mechanism, Deformable Line Attention (DLA). We use the term LINEA to refer to our new transformer-based method based on DLA. Extensive experiments show that LINEA is significantly faster and outperforms previous models on sAP in out-of-distribution dataset testing.

[44] DriveMoE: Mixture-of-Experts for Vision-Language-Action Model in End-to-End Autonomous Driving

Zhenjie Yang,Yilin Chai,Xiaosong Jia,Qifeng Li,Yuqian Shao,Xuekai Zhu,Haisheng Su,Junchi Yan

Main category: cs.CV

TL;DR: DriveMoE是一种基于Mixture-of-Experts (MoE)架构的端到端自动驾驶框架，通过场景专用视觉MoE和技能专用动作MoE，动态处理多视角数据和复杂驾驶场景，在Bench2Drive评测中达到SOTA性能。

Details

Motivation: 端到端自动驾驶需要高效处理多视角数据并应对复杂场景，尤其是罕见驾驶行为。MoE架构在大型语言模型中的成功表明参数专业化可实现强扩展性。 Method: 在Drive-π0基线基础上，增加视觉MoE（动态选择相关摄像头）和动作MoE（激活不同驾驶行为的专家模块），模拟人类驾驶认知。 Result: DriveMoE在Bench2Drive评测中表现最优，避免了模式平均问题。 Conclusion: DriveMoE通过结合视觉和动作MoE，有效提升了自动驾驶任务的性能，代码和模型将开源。 Abstract: End-to-end autonomous driving (E2E-AD) demands effective processing of multi-view sensory data and robust handling of diverse and complex driving scenarios, particularly rare maneuvers such as aggressive turns. Recent success of Mixture-of-Experts (MoE) architecture in Large Language Models (LLMs) demonstrates that specialization of parameters enables strong scalability. In this work, we propose DriveMoE, a novel MoE-based E2E-AD framework, with a Scene-Specialized Vision MoE and a Skill-Specialized Action MoE. DriveMoE is built upon our $\pi_0$ Vision-Language-Action (VLA) baseline (originally from the embodied AI field), called Drive-$\pi_0$. Specifically, we add Vision MoE to Drive-$\pi_0$ by training a router to select relevant cameras according to the driving context dynamically. This design mirrors human driving cognition, where drivers selectively attend to crucial visual cues rather than exhaustively processing all visual information. In addition, we add Action MoE by training another router to activate specialized expert modules for different driving behaviors. Through explicit behavioral specialization, DriveMoE is able to handle diverse scenarios without suffering from modes averaging like existing models. In Bench2Drive closed-loop evaluation experiments, DriveMoE achieves state-of-the-art (SOTA) performance, demonstrating the effectiveness of combining vision and action MoE in autonomous driving tasks. We will release our code and models of DriveMoE and Drive-$\pi_0$.

[45] ARPO:End-to-End Policy Optimization for GUI Agents with Experience Replay

Fanbin Lu,Zhisheng Zhong,Shu Liu,Chi-Wing Fu,Jiaya Jia

Main category: cs.CV

TL;DR: 本文提出了一种名为ARPO的强化学习方法，用于优化基于视觉语言的GUI代理在复杂任务中的表现，通过经验回放和任务选择策略提升训练稳定性。

Details

Motivation: 训练大型语言模型作为交互式GUI代理面临长时程动作序列优化和多模态反馈的挑战，现有方法在GUI环境中应用不足。 Method: 提出ARPO方法，结合GRPO和经验回放缓冲，并引入任务选择策略以稳定训练。 Result: 在OSWorld基准测试中，ARPO表现优异，为基于强化学习的GUI代理设定了新的性能基准。 Conclusion: 强化学习在训练多轮视觉语言GUI代理中具有显著效果，能够处理复杂的真实UI交互。 Abstract: Training large language models (LLMs) as interactive agents for controlling graphical user interfaces (GUIs) presents a unique challenge to optimize long-horizon action sequences with multimodal feedback from complex environments. While recent works have advanced multi-turn reinforcement learning (RL) for reasoning and tool-using capabilities in LLMs, their application to GUI-based agents remains relatively underexplored due to the difficulty of sparse rewards, delayed feedback, and high rollout costs. In this paper, we investigate end-to-end policy optimization for vision-language-based GUI agents with the aim of improving performance on complex, long-horizon computer tasks. We propose Agentic Replay Policy Optimization (ARPO), an end-to-end RL approach that augments Group Relative Policy Optimization (GRPO) with a replay buffer to reuse the successful experience across training iterations. To further stabilize the training process, we propose a task selection strategy that filters tasks based on baseline agent performance, allowing the agent to focus on learning from informative interactions. Additionally, we compare ARPO with offline preference optimization approaches, highlighting the advantages of policy-based methods in GUI environments. Experiments on the OSWorld benchmark demonstrate that ARPO achieves competitive results, establishing a new performance baseline for LLM-based GUI agents trained via reinforcement learning. Our findings underscore the effectiveness of reinforcement learning for training multi-turn, vision-language GUI agents capable of managing complex real-world UI interactions. Codes and models:https://github.com/dvlab-research/ARPO.git.

[46] Efficient Prototype Consistency Learning in Medical Image Segmentation via Joint Uncertainty and Data Augmentation

Lijian Li,Yuanpeng He,Chi-Man Pun

Main category: cs.CV

TL;DR: 论文提出EPCL-JUDA方法，通过联合不确定性量化和数据增强提升原型学习的表达能力，在医学图像分割中表现优异。

Details

Motivation: 现有原型学习方法因标记数据稀缺导致原型表达能力不足，无法完整表示类别嵌入。 Method: 基于Mean-Teacher框架，结合原始和增强标记数据生成原型，通过联合不确定性量化优化伪标签，融合标记和未标记数据生成高质量全局原型，并引入原型网络降低内存需求。 Result: 在多个数据集上表现优于现有方法，验证了框架的有效性。 Conclusion: EPCL-JUDA通过增强原型表达和优化伪标签，显著提升了半监督医学图像分割性能。 Abstract: Recently, prototype learning has emerged in semi-supervised medical image segmentation and achieved remarkable performance. However, the scarcity of labeled data limits the expressiveness of prototypes in previous methods, potentially hindering the complete representation of prototypes for class embedding. To overcome this issue, we propose an efficient prototype consistency learning via joint uncertainty quantification and data augmentation (EPCL-JUDA) to enhance the semantic expression of prototypes based on the framework of Mean-Teacher. The concatenation of original and augmented labeled data is fed into student network to generate expressive prototypes. Then, a joint uncertainty quantification method is devised to optimize pseudo-labels and generate reliable prototypes for original and augmented unlabeled data separately. High-quality global prototypes for each class are formed by fusing labeled and unlabeled prototypes, which are utilized to generate prototype-to-features to conduct consistency learning. Notably, a prototype network is proposed to reduce high memory requirements brought by the introduction of augmented data. Extensive experiments on Left Atrium, Pancreas-NIH, Type B Aortic Dissection datasets demonstrate EPCL-JUDA's superiority over previous state-of-the-art approaches, confirming the effectiveness of our framework. The code will be released soon.

[47] Self-Classification Enhancement and Correction for Weakly Supervised Object Detection

Yufei Yin,Lechao Cheng,Wengang Zhou,Jiajun Deng,Zhou Yu,Houqiang Li

Main category: cs.CV

TL;DR: 本文提出了一种新的弱监督目标检测框架，通过引入自分类增强模块和自分类校正算法，解决了多类分类任务中的分类模糊问题，并在VOC数据集上表现出色。

Details

Motivation: 弱监督目标检测（WSOD）因标注成本低而受到关注，但现有方法在多类分类任务中存在分类模糊问题，未能充分利用其优势。 Method: 提出自分类增强模块（ICBC）以弥合多类分类任务间的差距，并设计自分类校正算法以减少误分类预测。 Result: 在VOC 2007和2012数据集上的实验表明，该框架性能优越。 Conclusion: 新框架通过ICBC和校正算法有效提升了弱监督目标检测的性能。 Abstract: In recent years, weakly supervised object detection (WSOD) has attracted much attention due to its low labeling cost. The success of recent WSOD models is often ascribed to the two-stage multi-class classification (MCC) task, i.e., multiple instance learning and online classification refinement. Despite achieving non-trivial progresses, these methods overlook potential classification ambiguities between these two MCC tasks and fail to leverage their unique strengths. In this work, we introduce a novel WSOD framework to ameliorate these two issues. For one thing, we propose a self-classification enhancement module that integrates intra-class binary classification (ICBC) to bridge the gap between the two distinct MCC tasks. The ICBC task enhances the network's discrimination between positive and mis-located samples in a class-wise manner and forges a mutually reinforcing relationship with the MCC task. For another, we propose a self-classification correction algorithm during inference, which combines the results of both MCC tasks to effectively reduce the mis-classified predictions. Extensive experiments on the prevalent VOC 2007 & 2012 datasets demonstrate the superior performance of our framework.

[48] SAMba-UNet: Synergizing SAM2 and Mamba in UNet with Heterogeneous Aggregation for Cardiac MRI Segmentation

Guohao Huo,Ruiting Dai,Hao Tang

Main category: cs.CV

TL;DR: 提出了一种名为SAMba-UNet的双编码器架构，用于心脏MRI分割，通过动态特征融合和异构全注意力模块提升复杂病理特征提取能力。

Details

Motivation: 解决心脏MRI分割中复杂病理特征提取的挑战，尤其是小病灶和边界定位问题。 Method: 结合SAM2、Mamba和UNet，设计动态特征融合器和异构全注意力模块，实现跨模态特征学习。 Result: 在ACDC数据集上，Dice系数达0.9103，HD95边界误差为1.0859 mm，显著优于现有方法。 Conclusion: 该模型为心脏疾病诊断提供了高效可靠的解决方案，代码将开源。 Abstract: To address the challenge of complex pathological feature extraction in automated cardiac MRI segmentation, this study proposes an innovative dual-encoder architecture named SAMba-UNet. The framework achieves cross-modal feature collaborative learning by integrating the vision foundation model SAM2, the state-space model Mamba, and the classical UNet. To mitigate domain discrepancies between medical and natural images, a Dynamic Feature Fusion Refiner is designed, which enhances small lesion feature extraction through multi-scale pooling and a dual-path calibration mechanism across channel and spatial dimensions. Furthermore, a Heterogeneous Omni-Attention Convergence Module (HOACM) is introduced, combining global contextual attention with branch-selective emphasis mechanisms to effectively fuse SAM2's local positional semantics and Mamba's long-range dependency modeling capabilities. Experiments on the ACDC cardiac MRI dataset demonstrate that the proposed model achieves a Dice coefficient of 0.9103 and an HD95 boundary error of 1.0859 mm, significantly outperforming existing methods, particularly in boundary localization for complex pathological structures such as right ventricular anomalies. This work provides an efficient and reliable solution for automated cardiac disease diagnosis, and the code will be open-sourced.

[49] Paired and Unpaired Image to Image Translation using Generative Adversarial Networks

Gaurav Kumar,Soham Satyadharma,Harpreet Singh

Main category: cs.CV

TL;DR: 该论文研究了基于GAN的成对和非成对图像翻译，使用条件GAN和循环一致性损失，并通过定量和定性方法评估结果。

Details

Motivation: 探索图像翻译领域，特别是成对和非成对任务，以生成具有不同风格或纹理的新图像。 Method: 使用条件GAN处理成对任务，循环一致性损失处理非成对任务，并测试不同损失函数、Patch-GAN尺寸和架构。 Result: 通过定量指标（精度、召回率、FID分数）和定性分析评估了不同实验的结果。 Conclusion: GAN在图像翻译任务中表现良好，定量和定性分析为模型优化提供了依据。 Abstract: Image to image translation is an active area of research in the field of computer vision, enabling the generation of new images with different styles, textures, or resolutions while preserving their characteristic properties. Recent architectures leverage Generative Adversarial Networks (GANs) to transform input images from one domain to another. In this work, we focus on the study of both paired and unpaired image translation across multiple image domains. For the paired task, we used a conditional GAN model, and for the unpaired task, we trained it using cycle consistency loss. We experimented with different types of loss functions, multiple Patch-GAN sizes, and model architectures. New quantitative metrics - precision, recall, and FID score - were used for analysis. In addition, a qualitative study of the results of different experiments was conducted.

[50] Accelerating Targeted Hard-Label Adversarial Attacks in Low-Query Black-Box Settings

Arjhun Swaminathan,Mete Akgün

Main category: cs.CV

TL;DR: 论文提出了一种名为TEA的新型对抗攻击方法，利用目标图像的边缘信息生成对抗样本，在低查询量下优于现有方法。

Details

Motivation: 针对黑盒设置中目标攻击的挑战，现有方法主要依赖决策边界的几何特性，而忽略了图像本身的信息。 Method: 提出TEA方法，通过利用目标图像的边缘信息，生成更接近源图像但仍能实现目标分类的对抗样本。 Result: TEA在低查询量下表现优于现有方法（减少近70%查询量），并为几何攻击提供了更好的初始化。 Conclusion: TEA通过结合图像边缘信息，显著提升了对抗攻击的效率，适用于实际应用中的黑盒场景。 Abstract: Deep neural networks for image classification remain vulnerable to adversarial examples -- small, imperceptible perturbations that induce misclassifications. In black-box settings, where only the final prediction is accessible, crafting targeted attacks that aim to misclassify into a specific target class is particularly challenging due to narrow decision regions. Current state-of-the-art methods often exploit the geometric properties of the decision boundary separating a source image and a target image rather than incorporating information from the images themselves. In contrast, we propose Targeted Edge-informed Attack (TEA), a novel attack that utilizes edge information from the target image to carefully perturb it, thereby producing an adversarial image that is closer to the source image while still achieving the desired target classification. Our approach consistently outperforms current state-of-the-art methods across different models in low query settings (nearly 70\% fewer queries are used), a scenario especially relevant in real-world applications with limited queries and black-box access. Furthermore, by efficiently generating a suitable adversarial example, TEA provides an improved target initialization for established geometry-based attacks.

[51] NTIRE 2025 challenge on Text to Image Generation Model Quality Assessment

Shuhao Han,Haotian Fan,Fangyuan Kong,Wenjie Liao,Chunle Guo,Chongyi Li,Radu Timofte,Liang Li,Tao Li,Junhui Cui,Yunqiu Wang,Yang Tai,Jingwei Sun,Jianhui Sun,Xinli Yue,Tianyi Wang,Huan Hou,Junda Lu,Xinyang Huang,Zitang Zhou,Zijian Zhang,Xuhui Zheng,Xuecheng Wu,Chong Peng,Xuezhi Cao,Trong-Hieu Nguyen-Mau,Minh-Hoang Le,Minh-Khoa Le-Phan,Duy-Nam Ly,Hai-Dang Nguyen,Minh-Triet Tran,Yukang Lin,Yan Hong,Chuanbiao Song,Siyuan Li,Jun Lan,Zhichao Zhang,Xinyue Li,Wei Sun,Zicheng Zhang,Yunhao Li,Xiaohong Liu,Guangtao Zhai,Zitong Xu,Huiyu Duan,Jiarui Wang,Guangji Ma,Liu Yang,Lu Liu,Qiang Hu,Xiongkuo Min,Zichuan Wang,Zhenchen Tang,Bo Peng,Jing Dong,Fengbin Guan,Zihao Yu,Yiting Lu,Wei Luo,Xin Li,Minhao Lin,Haofeng Chen,Xuanxuan He,Kele Xu,Qisheng Xu,Zijian Gao,Tianjiao Wan,Bo-Cheng Qiu,Chih-Chung Hsu,Chia-ming Lee,Yu-Fan Lin,Bo Yu,Zehao Wang,Da Mu,Mingxiu Chen,Junkang Fang,Huamei Sun,Wending Zhao,Zhiyu Wang,Wang Liu,Weikang Yu,Puhong Duan,Bin Sun,Xudong Kang,Shutao Li,Shuai He,Lingzhi Fu,Heng Cong,Rongyu Zhang,Jiarong He,Zhishan Qiao,Yongqing Huang,Zewen Chen,Zhe Pang,Juan Wang,Jian Guo,Zhizhuo Shao,Ziyu Feng,Bing Li,Weiming Hu,Hesong Li,Dehua Liu,Zeming Liu,Qingsong Xie,Ruichen Wang,Zhihao Li,Yuqi Liang,Jianqi Bi,Jun Luo,Junfeng Yang,Can Li,Jing Fu,Hongwei Xu,Mingrui Long,Lulin Tang

Main category: cs.CV

TL;DR: NTIRE 2025挑战赛聚焦于文本到图像生成模型的质量评估，分为对齐和结构两个赛道，吸引了大量参与者，最终获胜方法表现优异。

Details

Motivation: 解决文本到图像生成模型的细粒度质量评估问题，特别是图像-文本对齐和图像结构失真检测。 Method: 挑战赛分为对齐赛道（使用EvalMuse-40K数据集）和结构赛道（使用EvalMuse-Structure数据集），参与者提交模型进行评估。 Result: 对齐赛道有371名注册者，1883份开发阶段提交，507份测试阶段提交；结构赛道有211名注册者，1155份开发阶段提交，487份测试阶段提交。几乎所有方法均优于基线，获胜方法表现最佳。 Conclusion: 挑战赛成功推动了文本到图像生成模型质量评估的研究，获胜方法展示了卓越的预测性能。 Abstract: This paper reports on the NTIRE 2025 challenge on Text to Image (T2I) generation model quality assessment, which will be held in conjunction with the New Trends in Image Restoration and Enhancement Workshop (NTIRE) at CVPR 2025. The aim of this challenge is to address the fine-grained quality assessment of text-to-image generation models. This challenge evaluates text-to-image models from two aspects: image-text alignment and image structural distortion detection, and is divided into the alignment track and the structural track. The alignment track uses the EvalMuse-40K, which contains around 40K AI-Generated Images (AIGIs) generated by 20 popular generative models. The alignment track has a total of 371 registered participants. A total of 1,883 submissions are received in the development phase, and 507 submissions are received in the test phase. Finally, 12 participating teams submitted their models and fact sheets. The structure track uses the EvalMuse-Structure, which contains 10,000 AI-Generated Images (AIGIs) with corresponding structural distortion mask. A total of 211 participants have registered in the structure track. A total of 1155 submissions are received in the development phase, and 487 submissions are received in the test phase. Finally, 8 participating teams submitted their models and fact sheets. Almost all methods have achieved better results than baseline methods, and the winning methods in both tracks have demonstrated superior prediction performance on T2I model quality assessment.

[52] SuperPure: Efficient Purification of Localized and Distributed Adversarial Patches via Super-Resolution GAN Models

Hossein Khalili,Seongbin Park,Venkat Bollapragada,Nader Sehatbakhsh

Main category: cs.CV

TL;DR: 论文提出了一种名为SuperPure的新防御策略，用于对抗分布式和局部化的对抗补丁攻击，通过像素级掩码和GAN超分辨率技术显著提升了鲁棒性和效率。

Details

Motivation: 现有的防御方法对分布式补丁攻击无效且计算资源消耗大，难以满足实时性要求。 Method: 采用像素级掩码和GAN超分辨率技术逐步净化图像中的对抗补丁。 Result: SuperPure在局部化补丁攻击上的鲁棒性提升20%，对分布式补丁攻击的鲁棒性达到58%，同时降低了98%的延迟。 Conclusion: SuperPure在鲁棒性、效率和实用性上均优于现有方法，适用于实时系统。 Abstract: As vision-based machine learning models are increasingly integrated into autonomous and cyber-physical systems, concerns about (physical) adversarial patch attacks are growing. While state-of-the-art defenses can achieve certified robustness with minimal impact on utility against highly-concentrated localized patch attacks, they fall short in two important areas: (i) State-of-the-art methods are vulnerable to low-noise distributed patches where perturbations are subtly dispersed to evade detection or masking, as shown recently by the DorPatch attack; (ii) Achieving high robustness with state-of-the-art methods is extremely time and resource-consuming, rendering them impractical for latency-sensitive applications in many cyber-physical systems. To address both robustness and latency issues, this paper proposes a new defense strategy for adversarial patch attacks called SuperPure. The key novelty is developing a pixel-wise masking scheme that is robust against both distributed and localized patches. The masking involves leveraging a GAN-based super-resolution scheme to gradually purify the image from adversarial patches. Our extensive evaluations using ImageNet and two standard classifiers, ResNet and EfficientNet, show that SuperPure advances the state-of-the-art in three major directions: (i) it improves the robustness against conventional localized patches by more than 20%, on average, while also improving top-1 clean accuracy by almost 10%; (ii) It achieves 58% robustness against distributed patch attacks (as opposed to 0% in state-of-the-art method, PatchCleanser); (iii) It decreases the defense end-to-end latency by over 98% compared to PatchCleanser. Our further analysis shows that SuperPure is robust against white-box attacks and different patch sizes. Our code is open-source.

[53] Efficient Motion Prompt Learning for Robust Visual Tracking

Jie Zhao,Xin Chen,Yongsheng Yuan,Michael Felsberg,Dong Wang,Huchuan Lu

Main category: cs.CV

TL;DR: 提出了一种轻量级即插即用的运动提示跟踪方法，通过结合运动和视觉线索提升跟踪鲁棒性。

Details

Motivation: 现有跟踪器主要依赖视觉区分性，忽视了视频数据的时序一致性。 Method: 设计了运动编码器、融合解码器和自适应权重机制，将运动轨迹编码到视觉嵌入空间并动态融合特征。 Result: 在七个挑战性跟踪基准上显著提升了视觉跟踪器的鲁棒性，训练成本低且速度牺牲小。 Conclusion: 运动提示模块能有效增强现有视觉跟踪器的性能，具有高效性和实用性。 Abstract: Due to the challenges of processing temporal information, most trackers depend solely on visual discriminability and overlook the unique temporal coherence of video data. In this paper, we propose a lightweight and plug-and-play motion prompt tracking method. It can be easily integrated into existing vision-based trackers to build a joint tracking framework leveraging both motion and vision cues, thereby achieving robust tracking through efficient prompt learning. A motion encoder with three different positional encodings is proposed to encode the long-term motion trajectory into the visual embedding space, while a fusion decoder and an adaptive weight mechanism are designed to dynamically fuse visual and motion features. We integrate our motion module into three different trackers with five models in total. Experiments on seven challenging tracking benchmarks demonstrate that the proposed motion module significantly improves the robustness of vision-based trackers, with minimal training costs and negligible speed sacrifice. Code is available at https://github.com/zj5559/Motion-Prompt-Tracking.

Cheng Cheng,Lin Song,Yicheng Xiao,Yuxin Chen,Xuchong Zhang,Hongbin Sun,Ying Shan

Main category: cs.CV

TL;DR: TensorAR是一种新的自回归范式，通过从下一个令牌预测转变为下一个张量预测，实现了对生成内容的迭代优化。

Details

Motivation: 现有的自回归图像生成器缺乏对先前预测的优化机制，限制了生成质量。 Method: TensorAR通过滑动生成重叠的图像块（张量），并提出离散张量噪声方案以防止信息泄漏。 Result: 实验表明，TensorAR显著提升了自回归模型的生成性能。 Conclusion: TensorAR作为一种即插即用模块，为自回归模型提供了有效的改进方案。 Abstract: Autoregressive (AR) image generators offer a language-model-friendly approach to image generation by predicting discrete image tokens in a causal sequence. However, unlike diffusion models, AR models lack a mechanism to refine previous predictions, limiting their generation quality. In this paper, we introduce TensorAR, a new AR paradigm that reformulates image generation from next-token prediction to next-tensor prediction. By generating overlapping windows of image patches (tensors) in a sliding fashion, TensorAR enables iterative refinement of previously generated content. To prevent information leakage during training, we propose a discrete tensor noising scheme, which perturbs input tokens via codebook-indexed noise. TensorAR is implemented as a plug-and-play module compatible with existing AR models. Extensive experiments on LlamaGEN, Open-MAGVIT2, and RAR demonstrate that TensorAR significantly improves the generation performance of autoregressive models.

[55] Panoptic Captioning: Seeking An Equivalency Bridge for Image and Text

Kun-Yu Lin,Hongjun Wang,Weining Ren,Kai Han

Main category: cs.CV

TL;DR: 论文提出全景描述任务（panoptic captioning），旨在生成图像的最小文本等价描述，并提出数据引擎PancapEngine和方法PancapChain以提升性能。实验显示其模型超越现有开源和专有模型。

Details

Motivation: 当前多模态大语言模型在全景描述任务上表现有限，需改进以生成更全面的图像描述。 Method: 提出PancapEngine生成高质量数据，PancapChain分阶段生成描述，并引入PancapScore评估指标。 Result: PancapChain-13B模型超越InternVL-2.5-78B、GPT-4o和Gemini-2.0-Pro等模型。 Conclusion: PancapEngine和PancapChain有效提升全景描述任务性能，为未来研究提供可靠评估基准。 Abstract: This work introduces panoptic captioning, a novel task striving to seek the minimum text equivalence of images. We take the first step towards panoptic captioning by formulating it as a task of generating a comprehensive textual description for an image, which encapsulates all entities, their respective locations and attributes, relationships among entities, as well as global image state.Through an extensive evaluation, our work reveals that state-of-the-art Multi-modal Large Language Models (MLLMs) have limited performance in solving panoptic captioning. To address this, we propose an effective data engine named PancapEngine to produce high-quality data and a novel method named PancapChain to improve panoptic captioning. Specifically, our PancapEngine first detects diverse categories of entities in images by an elaborate detection suite, and then generates required panoptic captions using entity-aware prompts. Additionally, our PancapChain explicitly decouples the challenging panoptic captioning task into multiple stages and generates panoptic captions step by step. More importantly, we contribute a comprehensive metric named PancapScore and a human-curated test set for reliable model evaluation.Experiments show that our PancapChain-13B model can beat state-of-the-art open-source MLLMs like InternVL-2.5-78B and even surpass proprietary models like GPT-4o and Gemini-2.0-Pro, demonstrating the effectiveness of our data engine and method. Project page: https://visual-ai.github.io/pancap/

[56] FPQVAR: Floating Point Quantization for Visual Autoregressive Model with FPGA Hardware Co-design

Renjie Wei,Songqiang Xu,Qingyu Guo,Meng Li

Main category: cs.CV

TL;DR: FPQVAR是一种高效的浮点量化框架，通过算法和硬件协同设计，显著降低了VAR模型的内存和计算成本，同时提升了图像生成质量和推理速度。

Details

Motivation: VAR模型在图像生成中表现出色，但参数规模和计算成本限制了其在边缘设备上的部署。因此，需要一种高效的量化方法来解决这一问题。 Method: 提出FPQVAR框架，包括双格式量化、分组Hadamard变换和GHT感知可学习变换，并在硬件层面设计了低比特浮点量化器和乘法器。 Result: FPQVAR在4位量化下显著提升了FID和IS分数，6位量化性能接近FP16模型，FPGA加速器在吞吐量和能效上优于整数加速器和GPU基线。 Conclusion: FPQVAR通过算法和硬件协同优化，成功解决了VAR模型的部署难题，为边缘设备上的高效图像生成提供了可行方案。 Abstract: Visual autoregressive (VAR) modeling has marked a paradigm shift in image generation from next-token prediction to next-scale prediction. VAR predicts a set of tokens at each step from coarse to fine scale, leading to better image quality and faster inference speed compared to existing diffusion models. However, the large parameter size and computation cost hinder its deployment on edge devices. To reduce the memory and computation cost, we propose FPQVAR, an efficient post-training floating-point (FP) quantization framework for VAR featuring algorithm and hardware co-design. At the algorithm level, we first identify the challenges of quantizing VAR. To address them, we propose Dual Format Quantization for the highly imbalanced input activation. We further propose Group-wise Hadamard Transformation and GHT-Aware Learnable Transformation to address the time-varying outlier channels. At the hardware level, we design the first low-bit FP quantizer and multiplier with lookup tables on FPGA and propose the first FPGA-based VAR accelerator featuring low-bit FP computation and an elaborate two-level pipeline. Extensive experiments show that compared to the state-of-the-art quantization method, our proposed FPQVAR significantly improves Fr\'echet Inception Distance (FID) from 10.83 to 3.58, Inception Score (IS) from 175.9 to 241.5 under 4-bit quantization. FPQVAR also significantly improves the performance of 6-bit quantized VAR, bringing it on par with the FP16 model. Our accelerator on AMD-Xilinx VCK190 FPGA achieves a throughput of 1.1 image/s, which is 3.1x higher than the integer-based accelerator. It also demonstrates 3.6x and 2.8x higher energy efficiency compared to the integer-based accelerator and GPU baseline, respectively.

[57] Fusion of Foundation and Vision Transformer Model Features for Dermatoscopic Image Classification

Amirreza Mahbod,Rupert Ecker,Ramona Woitek

Main category: cs.CV

TL;DR: 研究比较了皮肤科专用基础模型PanDerm与两种ViT架构在皮肤病变分类任务中的表现，发现PanDerm结合MLP分类器与Swin Transformer性能相当，融合预测可进一步提升效果。

Details

Motivation: 皮肤病变的准确分类对皮肤癌的诊断和治疗至关重要，研究旨在探索基础模型与ViT架构在此任务中的实用性。 Method: 使用PanDerm提取冻结特征，结合MLP、XGBoost和TabNet分类器；对ViT模型进行全微调；在HAM10000和MSKCC数据集上实验。 Result: PanDerm结合MLP与微调Swin Transformer性能相当，融合预测可进一步提升分类效果。 Conclusion: 未来将探索更多基础模型、微调策略和高级融合技术。 Abstract: Accurate classification of skin lesions from dermatoscopic images is essential for diagnosis and treatment of skin cancer. In this study, we investigate the utility of a dermatology-specific foundation model, PanDerm, in comparison with two Vision Transformer (ViT) architectures (ViT base and Swin Transformer V2 base) for the task of skin lesion classification. Using frozen features extracted from PanDerm, we apply non-linear probing with three different classifiers, namely, multi-layer perceptron (MLP), XGBoost, and TabNet. For the ViT-based models, we perform full fine-tuning to optimize classification performance. Our experiments on the HAM10000 and MSKCC datasets demonstrate that the PanDerm-based MLP model performs comparably to the fine-tuned Swin transformer model, while fusion of PanDerm and Swin Transformer predictions leads to further performance improvements. Future work will explore additional foundation models, fine-tuning strategies, and advanced fusion techniques.

[58] Style Transfer with Diffusion Models for Synthetic-to-Real Domain Adaptation

Estelle Chigot,Dennis G. Wilson,Meriem Ghrib,Thomas Oberlin

Main category: cs.CV

TL;DR: 论文提出两种基于扩散模型的语义一致风格迁移方法（CACTI和CACTIF），用于提升合成数据训练的视觉模型在真实场景中的性能，有效缩小合成与真实数据的领域差距。

Details

Motivation: 解决合成数据训练的语义分割模型在真实场景中表现不佳的问题，尤其是在标注数据稀缺的恶劣条件下。 Method: 提出CACTI（基于语义类的自适应实例归一化和交叉注意力）和CACTIF（带选择性注意力过滤的扩展版本），通过选择性风格迁移保持语义边界和结构一致性。 Result: 在GTA5到Cityscapes/ACDC的实验中，生成图像质量更高（FID分数更低），内容保留更好。 Conclusion: 类感知的扩散风格迁移能有效缩小合成与真实数据的领域差距，推动鲁棒感知系统的发展。 Abstract: Semantic segmentation models trained on synthetic data often perform poorly on real-world images due to domain gaps, particularly in adverse conditions where labeled data is scarce. Yet, recent foundation models enable to generate realistic images without any training. This paper proposes to leverage such diffusion models to improve the performance of vision models when learned on synthetic data. We introduce two novel techniques for semantically consistent style transfer using diffusion models: Class-wise Adaptive Instance Normalization and Cross-Attention (CACTI) and its extension with selective attention Filtering (CACTIF). CACTI applies statistical normalization selectively based on semantic classes, while CACTIF further filters cross-attention maps based on feature similarity, preventing artifacts in regions with weak cross-attention correspondences. Our methods transfer style characteristics while preserving semantic boundaries and structural coherence, unlike approaches that apply global transformations or generate content without constraints. Experiments using GTA5 as source and Cityscapes/ACDC as target domains show that our approach produces higher quality images with lower FID scores and better content preservation. Our work demonstrates that class-aware diffusion-based style transfer effectively bridges the synthetic-to-real domain gap even with minimal target domain data, advancing robust perception systems for challenging real-world applications. The source code is available at: https://github.com/echigot/cactif.

[59] Temporal and Spatial Feature Fusion Framework for Dynamic Micro Expression Recognition

Feng Liu,Bingyu Nan,Xuezhong Qian,Xiaolan Fu

Main category: cs.CV

TL;DR: 论文提出了一种名为TSFmicro的新框架，通过融合时空特征提升微表情识别准确率，实验证明其优于现有方法。

Details

Motivation: 微表情短暂且局部性强，识别准确率低（仅50%），需探索多模态融合技术以提升识别效果。 Method: 结合Retention Network和基于transformer的DMER网络，提出并行时空融合方法，在高维特征空间融合时空信息。 Result: TSFmicro在三个知名微表情数据集上表现优于其他先进方法。 Conclusion: TSFmicro框架通过高效时空特征融合，显著提升了微表情识别的准确性和语义丰富性。 Abstract: When emotions are repressed, an individual's true feelings may be revealed through micro-expressions. Consequently, micro-expressions are regarded as a genuine source of insight into an individual's authentic emotions. However, the transient and highly localised nature of micro-expressions poses a significant challenge to their accurate recognition, with the accuracy rate of micro-expression recognition being as low as 50%, even for professionals. In order to address these challenges, it is necessary to explore the field of dynamic micro expression recognition (DMER) using multimodal fusion techniques, with special attention to the diverse fusion of temporal and spatial modal features. In this paper, we propose a novel Temporal and Spatial feature Fusion framework for DMER (TSFmicro). This framework integrates a Retention Network (RetNet) and a transformer-based DMER network, with the objective of efficient micro-expression recognition through the capture and fusion of temporal and spatial relations. Meanwhile, we propose a novel parallel time-space fusion method from the perspective of modal fusion, which fuses spatio-temporal information in high-dimensional feature space, resulting in complementary "where-how" relationships at the semantic level and providing richer semantic information for the model. The experimental results demonstrate the superior performance of the TSFmicro method in comparison to other contemporary state-of-the-art methods. This is evidenced by its effectiveness on three well-recognised micro-expression datasets.

[60] DeCafNet: Delegate and Conquer for Efficient Temporal Grounding in Long Videos

Zijia Lu,A S M Iftekhar,Gaurav Mittal,Tianjian Meng,Xiawei Wang,Cheng Zhao,Rohith Kukkala,Ehsan Elhamifar,Mei Chen

Main category: cs.CV

TL;DR: DeCafNet通过“委托-征服”策略和双编码器设计，显著降低了长视频时间定位的计算成本，同时提升了性能。

Details

Motivation: 现有方法因计算成本高难以扩展，需高效处理长视频中的大量片段。 Method: 引入轻量级辅助编码器提取特征并生成显著性图，结合专家编码器和多尺度时间精炼。 Result: 计算成本降低47%，性能优于现有方法，达到新SOTA。 Conclusion: DeCafNet在效率和性能上均表现出色，为长视频时间定位提供了高效解决方案。 Abstract: Long Video Temporal Grounding (LVTG) aims at identifying specific moments within lengthy videos based on user-provided text queries for effective content retrieval. The approach taken by existing methods of dividing video into clips and processing each clip via a full-scale expert encoder is challenging to scale due to prohibitive computational costs of processing a large number of clips in long videos. To address this issue, we introduce DeCafNet, an approach employing ``delegate-and-conquer'' strategy to achieve computation efficiency without sacrificing grounding performance. DeCafNet introduces a sidekick encoder that performs dense feature extraction over all video clips in a resource-efficient manner, while generating a saliency map to identify the most relevant clips for full processing by the expert encoder. To effectively leverage features from sidekick and expert encoders that exist at different temporal resolutions, we introduce DeCaf-Grounder, which unifies and refines them via query-aware temporal aggregation and multi-scale temporal refinement for accurate grounding. Experiments on two LTVG benchmark datasets demonstrate that DeCafNet reduces computation by up to 47\% while still outperforming existing methods, establishing a new state-of-the-art for LTVG in terms of both efficiency and performance. Our code is available at https://github.com/ZijiaLewisLu/CVPR2025-DeCafNet.

[61] MAGE: A Multi-task Architecture for Gaze Estimation with an Efficient Calibration Module

Haoming Huang,Musen Zhang,Jianxin Yang,Zhen Li,Jinkai Li,Yao Guo

Main category: cs.CV

TL;DR: 论文提出MAGE方法，用于6-DoF视线估计，通过多任务架构和高效校准模块解决现有方法的局限性。

Details

Motivation: 现有视线估计方法仅预测方向或屏幕注视点，缺乏3D空间的全面分析，且个体差异影响泛化能力。 Method: MAGE结合多任务架构，编码面部图像的方向和位置特征，并通过Easy-Calibration模块减少个体差异影响。 Result: 实验显示，MAGE在MPIIFaceGaze、EYEDIAP和IMRGaze数据集上达到最优性能。 Conclusion: MAGE为真实世界人机交互提供了高效的6-DoF视线估计解决方案。 Abstract: Eye gaze can provide rich information on human psychological activities, and has garnered significant attention in the field of Human-Robot Interaction (HRI). However, existing gaze estimation methods merely predict either the gaze direction or the Point-of-Gaze (PoG) on the screen, failing to provide sufficient information for a comprehensive six Degree-of-Freedom (DoF) gaze analysis in 3D space. Moreover, the variations of eye shape and structure among individuals also impede the generalization capability of these methods. In this study, we propose MAGE, a Multi-task Architecture for Gaze Estimation with an efficient calibration module, to predict the 6-DoF gaze information that is applicable for the real-word HRI. Our basic model encodes both the directional and positional features from facial images, and predicts gaze results with dedicated information flow and multiple decoders. To reduce the impact of individual variations, we propose a novel calibration module, namely Easy-Calibration, to fine-tune the basic model with subject-specific data, which is efficient to implement without the need of a screen. Experimental results demonstrate that our method achieves state-of-the-art performance on the public MPIIFaceGaze, EYEDIAP, and our built IMRGaze datasets.

[62] Sketchy Bounding-box Supervision for 3D Instance Segmentation

Qian Deng,Le Hui,Jin Xie,Jian Yang

Main category: cs.CV

TL;DR: 论文提出了一种名为Sketchy-3DIS的弱监督3D实例分割框架，通过联合训练伪标签生成器和分割器，在不精确的边界框监督下提升性能。

Details

Motivation: 在弱监督3D实例分割中，边界框监督虽然减少了点级标注的需求，但实际应用中获取精确边界框仍具挑战性。因此，论文探索了不精确边界框（称为草图边界框）的影响。 Method: 提出自适应框到点伪标签生成器，解决草图边界框重叠部分的点分配问题；设计粗到细实例分割器，先预测粗实例，再学习细实例；通过联合训练生成高质量实例。 Result: 在ScanNetV2和S3DIS基准测试中达到最先进性能，甚至优于部分全监督方法。 Conclusion: Sketchy-3DIS框架在不精确边界框监督下有效提升了3D实例分割性能，展示了弱监督方法的潜力。 Abstract: Bounding box supervision has gained considerable attention in weakly supervised 3D instance segmentation. While this approach alleviates the need for extensive point-level annotations, obtaining accurate bounding boxes in practical applications remains challenging. To this end, we explore the inaccurate bounding box, named sketchy bounding box, which is imitated through perturbing ground truth bounding box by adding scaling, translation, and rotation. In this paper, we propose Sketchy-3DIS, a novel weakly 3D instance segmentation framework, which jointly learns pseudo labeler and segmentator to improve the performance under the sketchy bounding-box supervisions. Specifically, we first propose an adaptive box-to-point pseudo labeler that adaptively learns to assign points located in the overlapped parts between two sketchy bounding boxes to the correct instance, resulting in compact and pure pseudo instance labels. Then, we present a coarse-to-fine instance segmentator that first predicts coarse instances from the entire point cloud and then learns fine instances based on the region of coarse instances. Finally, by using the pseudo instance labels to supervise the instance segmentator, we can gradually generate high-quality instances through joint training. Extensive experiments show that our method achieves state-of-the-art performance on both the ScanNetV2 and S3DIS benchmarks, and even outperforms several fully supervised methods using sketchy bounding boxes. Code is available at https://github.com/dengq7/Sketchy-3DIS.

[63] AdvReal: Adversarial Patch Generation Framework with Application to Adversarial Safety Evaluation of Object Detection Systems

Yuanhao Huang,Yilong Ren,Jinlei Wang,Lujia Huo,Xuesong Bai,Jinchuan Zhang,Haiyan Yu

Main category: cs.CV

TL;DR: 提出了一种统一的对抗训练框架，用于生成2D和3D对抗样本，并通过实验验证其在物理环境中的有效性。

Details

Motivation: 深度学习感知方法易受对抗样本攻击，影响自动驾驶车辆的安全性，需解决现实场景中的多样性和环境变化问题。 Method: 提出联合对抗训练框架，结合非刚性表面建模和3D匹配机制，生成对抗纹理。 Result: 生成的对抗样本能有效误导目标检测模型，并在多角度攻击、光照变化和不同距离下表现鲁棒。 Conclusion: 该方法在物理环境中具有优异的鲁棒性和迁移性，为自动驾驶安全提供了新思路。 Abstract: Autonomous vehicles are typical complex intelligent systems with artificial intelligence at their core. However, perception methods based on deep learning are extremely vulnerable to adversarial samples, resulting in safety accidents. How to generate effective adversarial examples in the physical world and evaluate object detection systems is a huge challenge. In this study, we propose a unified joint adversarial training framework for both 2D and 3D samples to address the challenges of intra-class diversity and environmental variations in real-world scenarios. Building upon this framework, we introduce an adversarial sample reality enhancement approach that incorporates non-rigid surface modeling and a realistic 3D matching mechanism. We compare with 5 advanced adversarial patches and evaluate their attack performance on 8 object detecotrs, including single-stage, two-stage, and transformer-based models. Extensive experiment results in digital and physical environments demonstrate that the adversarial textures generated by our method can effectively mislead the target detection model. Moreover, proposed method demonstrates excellent robustness and transferability under multi-angle attacks, varying lighting conditions, and different distance in the physical world. The demo video and code can be obtained at https://github.com/Huangyh98/AdvReal.git.

[64] Mitigating Hallucinations in Vision-Language Models through Image-Guided Head Suppression

Sreetama Sarkar,Yue Che,Alex Gavin,Peter A. Beerel,Souvik Kundu

Main category: cs.CV

TL;DR: SPIN是一种任务无关的注意力引导头抑制策略，可减少大型视觉语言模型（LVLM）的幻觉现象，且不增加计算或延迟开销。

Details

Motivation: LVLMs在生成文本时存在与视觉上下文不符的幻觉现象，现有方法虽能减少幻觉但显著增加延迟。 Method: 通过分析发现幻觉与特定注意力头相关，SPIN选择性抑制对图像令牌注意力低的头，保留前K个头。 Result: 在视觉问答和图像描述任务中，SPIN将幻觉分数降低2.7倍，F1保持稳定，吞吐量提升1.8倍。 Conclusion: SPIN有效减少LVLMs的幻觉，同时保持性能和效率。 Abstract: Despite their remarkable progress in multimodal understanding tasks, large vision language models (LVLMs) often suffer from "hallucinations", generating texts misaligned with the visual context. Existing methods aimed at reducing hallucinations through inference time intervention incur a significant increase in latency. To mitigate this, we present SPIN, a task-agnostic attention-guided head suppression strategy that can be seamlessly integrated during inference, without incurring any significant compute or latency overhead. We investigate whether hallucination in LVLMs can be linked to specific model components. Our analysis suggests that hallucinations can be attributed to a dynamic subset of attention heads in each layer. Leveraging this insight, for each text query token, we selectively suppress attention heads that exhibit low attention to image tokens, keeping the top-K attention heads intact. Extensive evaluations on visual question answering and image description tasks demonstrate the efficacy of SPIN in reducing hallucination scores up to 2.7x while maintaining F1, and improving throughput by 1.8x compared to existing alternatives. Code is available at https://github.com/YUECHE77/SPIN.

[65] Pose-invariant face recognition via feature-space pose frontalization

Nikolay Stanishev,Yuhang Lu,Touradj Ebrahimi

Main category: cs.CV

TL;DR: 本文提出了一种在特征空间中实现人脸正面化和识别的新方法，通过特征空间姿态正面化模块（FSPFM）和新的训练范式，显著提升了姿态不变人脸识别的性能。

Details

Motivation: 姿态不变人脸识别是现代AI人脸识别系统的挑战性问题，现有方法通过生成模型或学习姿态鲁棒特征实现正面化，但仍有改进空间。 Method: 提出FSPFM模块将任意角度的侧面图像转换为正面图像，并设计预训练和注意力引导微调的训练范式。 Result: 在五个流行的人脸识别基准测试中，该方法在姿态不变任务中优于现有技术，并在其他标准场景中保持优异性能。 Conclusion: 该方法在特征空间中实现了高效的人脸正面化和识别，为姿态不变人脸识别提供了新的解决方案。 Abstract: Pose-invariant face recognition has become a challenging problem for modern AI-based face recognition systems. It aims at matching a profile face captured in the wild with a frontal face registered in a database. Existing methods perform face frontalization via either generative models or learning a pose robust feature representation. In this paper, a new method is presented to perform face frontalization and recognition within the feature space. First, a novel feature space pose frontalization module (FSPFM) is proposed to transform profile images with arbitrary angles into frontal counterparts. Second, a new training paradigm is proposed to maximize the potential of FSPFM and boost its performance. The latter consists of a pre-training and an attention-guided fine-tuning stage. Moreover, extensive experiments have been conducted on five popular face recognition benchmarks. Results show that not only our method outperforms the state-of-the-art in the pose-invariant face recognition task but also maintains superior performance in other standard scenarios.

[66] Circle-RoPE: Cone-like Decoupled Rotary Positional Embedding for Large Vision-Language Models

Chengcheng Wang,Jianyuan Guo,Hongguang Li,Yuchuan Tian,Ying Nie,Chang Xu,Kai Han

Main category: cs.CV

TL;DR: 论文提出了一种名为Circle-RoPE的新型位置编码方案，用于解决视觉语言模型中跨模态位置偏差问题，并通过实验验证了其有效性。

Details

Motivation: Rotary Position Embedding (RoPE)在视觉语言模型中引入跨模态位置偏差，导致图像和文本之间的虚假对齐。 Method: 提出Per-Token Distance (PTD)度量跨模态位置编码独立性，并设计Circle-RoPE方案，将图像标记映射到与文本标记正交的圆形轨迹上。此外，采用分层策略应用不同RoPE变体。 Result: 实验表明，Circle-RoPE能有效减少跨模态偏差，同时保留图像空间信息。 Conclusion: Circle-RoPE为视觉语言模型提供了更鲁棒和灵活的位置编码框架。 Abstract: Rotary Position Embedding (RoPE) is a widely adopted technique for encoding relative positional information in large language models (LLMs). However, when extended to large vision-language models (LVLMs), its variants introduce unintended cross-modal positional biases. Specifically, they enforce relative positional dependencies between text token indices and image tokens, causing spurious alignments. This issue arises because image tokens representing the same content but located at different spatial positions are assigned distinct positional biases, leading to inconsistent cross-modal associations. To address this, we propose Per-Token Distance (PTD) - a simple yet effective metric for quantifying the independence of positional encodings across modalities. Informed by this analysis, we introduce Circle-RoPE, a novel encoding scheme that maps image token indices onto a circular trajectory orthogonal to the linear path of text token indices, forming a cone-like structure. This configuration ensures that each text token maintains an equal distance to all image tokens, reducing artificial cross-modal biases while preserving intra-image spatial information. To further enhance performance, we propose a staggered layer strategy that applies different RoPE variants across layers. This design leverages the complementary strengths of each RoPE variant, thereby enhancing the model's overall performance. Our experimental results demonstrate that our method effectively preserves spatial information from images while reducing relative positional bias, offering a more robust and flexible positional encoding framework for LVLMs. The code is available at [https://github.com/lose4578/CircleRoPE](https://github.com/lose4578/CircleRoPE).

[67] Investigating Fine- and Coarse-grained Structural Correspondences Between Deep Neural Networks and Human Object Image Similarity Judgments Using Unsupervised Alignment

Soh Takahashi,Masaru Sasaki,Ken Takeda,Masafumi Oizumi

Main category: cs.CV

TL;DR: 该论文研究了人类与深度学习模型在物体表征上的相似性，发现CLIP模型在细粒度和粗粒度上均与人类表征匹配良好，而自监督模型仅能捕捉粗粒度结构。

Details

Motivation: 探讨人类与深度学习模型在物体表征上的相似性，尤其是细粒度和粗粒度的匹配程度。 Method: 使用基于Gromov-Wasserstein最优传输的无监督对齐方法，比较人类与模型在物体表征上的匹配情况。 Result: CLIP模型在细粒度和粗粒度上均与人类表征匹配良好，自监督模型仅能捕捉粗粒度结构。 Conclusion: 语言信息对精确物体表征的获取至关重要，自监督学习在捕捉粗粒度结构上具有潜力。 Abstract: The learning mechanisms by which humans acquire internal representations of objects are not fully understood. Deep neural networks (DNNs) have emerged as a useful tool for investigating this question, as they have internal representations similar to those of humans as a byproduct of optimizing their objective functions. While previous studies have shown that models trained with various learning paradigms - such as supervised, self-supervised, and CLIP - acquire human-like representations, it remains unclear whether their similarity to human representations is primarily at a coarse category level or extends to finer details. Here, we employ an unsupervised alignment method based on Gromov-Wasserstein Optimal Transport to compare human and model object representations at both fine-grained and coarse-grained levels. The unique feature of this method compared to conventional representational similarity analysis is that it estimates optimal fine-grained mappings between the representation of each object in human and model representations. We used this unsupervised alignment method to assess the extent to which the representation of each object in humans is correctly mapped to the corresponding representation of the same object in models. Using human similarity judgments of 1,854 objects from the THINGS dataset, we find that models trained with CLIP consistently achieve strong fine- and coarse-grained matching with human object representations. In contrast, self-supervised models showed limited matching at both fine- and coarse-grained levels, but still formed object clusters that reflected human coarse category structure. Our results offer new insights into the role of linguistic information in acquiring precise object representations and the potential of self-supervised learning to capture coarse categorical structures.

[68] Unlocking Smarter Device Control: Foresighted Planning with a World Model-Driven Code Execution Approach

Xiaoran Yin,Xu Luo,Hao Wu,Lianli Gao,Jingkuan Song

Main category: cs.CV

TL;DR: 论文提出了一种名为FPWC的框架，通过自然语言理解和结构化推理增强智能体对环境的全局理解，从而优化移动设备的自动控制。

Details

Motivation: 移动设备的自动控制在执行复杂任务时面临环境信息有限的挑战，现有方法依赖即时观察导致决策不优。 Method: FPWC框架通过任务导向的世界模型和迭代规划生成前瞻性动作，以可执行代码形式实现。 Result: 在模拟环境和真实设备上的实验显示，FPWC在任务成功率上比现有方法提升了44.4%。 Conclusion: FPWC通过全局理解和前瞻性规划显著提升了移动设备自动控制的性能。 Abstract: The automatic control of mobile devices is essential for efficiently performing complex tasks that involve multiple sequential steps. However, these tasks pose significant challenges due to the limited environmental information available at each step, primarily through visual observations. As a result, current approaches, which typically rely on reactive policies, focus solely on immediate observations and often lead to suboptimal decision-making. To address this problem, we propose \textbf{Foresighted Planning with World Model-Driven Code Execution (FPWC)},a framework that prioritizes natural language understanding and structured reasoning to enhance the agent's global understanding of the environment by developing a task-oriented, refinable \emph{world model} at the outset of the task. Foresighted actions are subsequently generated through iterative planning within this world model, executed in the form of executable code. Extensive experiments conducted in simulated environments and on real mobile devices demonstrate that our method outperforms previous approaches, particularly achieving a 44.4\% relative improvement in task success rate compared to the state-of-the-art in the simulated environment. Code and demo are provided in the supplementary material.

Ranjith Merugu,Mohammad Sameer Suhail,Akshay P Sarashetti,Venkata Bharath Reddy Reddem,Pankaj Kumar Bajpai,Amit Satish Unde

Main category: cs.CV

TL;DR: 提出了一种名为JFFRA的视频修复框架，通过联合优化光流和特征修复，显著提升了视频修复性能。

Details

Motivation: 现有视频修复方法在利用时间信息时难以保持时间一致性，且依赖精确的光流估计。 Method: JFFRA通过迭代优化光流和特征修复，结合多尺度处理和遮挡感知的损失函数。 Result: 在去噪、去模糊和超分辨率等任务中，性能提升高达1.62 dB。 Conclusion: JFFRA通过协同优化光流和特征修复，显著提升了视频修复的效果和时间一致性。 Abstract: Recent advancements in video restoration have focused on recovering high-quality video frames from low-quality inputs. Compared with static images, the performance of video restoration significantly depends on efficient exploitation of temporal correlations among successive video frames. The numerous techniques make use of temporal information via flow-based strategies or recurrent architectures. However, these methods often encounter difficulties in preserving temporal consistency as they utilize degraded input video frames. To resolve this issue, we propose a novel video restoration framework named Joint Flow and Feature Refinement using Attention (JFFRA). The proposed JFFRA is based on key philosophy of iteratively enhancing data through the synergistic collaboration of flow (alignment) and restoration. By leveraging previously enhanced features to refine flow and vice versa, JFFRA enables efficient feature enhancement using temporal information. This interplay between flow and restoration is executed at multiple scales, reducing the dependence on precise flow estimation. Moreover, we incorporate an occlusion-aware temporal loss function to enhance the network's capability in eliminating flickering artifacts. Comprehensive experiments validate the versatility of JFFRA across various restoration tasks such as denoising, deblurring, and super-resolution. Our method demonstrates a remarkable performance improvement of up to 1.62 dB compared to state-of-the-art approaches.

[70] Ranked Entropy Minimization for Continual Test-Time Adaptation

Jisu Han,Jaemin Na,Wonjun Hwang

Main category: cs.CV

TL;DR: 论文提出了一种名为“排名熵最小化”的方法，用于解决测试时适应中的稳定性问题，特别是在连续测试时适应场景中。

Details

Motivation: 熵最小化方法在测试时适应中表现出高效性和适应性，但在连续测试时适应中容易导致模型崩溃（预测单一类别）。 Method: 通过渐进掩码策略显式结构化预测难度，逐步对齐模型在不同预测难度下的概率分布，同时保持熵的排名顺序。 Result: 在多个基准测试中验证了方法的有效性。 Conclusion: 排名熵最小化方法显著提升了熵最小化在连续测试时适应中的稳定性。 Abstract: Test-time adaptation aims to adapt to realistic environments in an online manner by learning during test time. Entropy minimization has emerged as a principal strategy for test-time adaptation due to its efficiency and adaptability. Nevertheless, it remains underexplored in continual test-time adaptation, where stability is more important. We observe that the entropy minimization method often suffers from model collapse, where the model converges to predicting a single class for all images due to a trivial solution. We propose ranked entropy minimization to mitigate the stability problem of the entropy minimization method and extend its applicability to continuous scenarios. Our approach explicitly structures the prediction difficulty through a progressive masking strategy. Specifically, it gradually aligns the model's probability distributions across different levels of prediction difficulty while preserving the rank order of entropy. The proposed method is extensively evaluated across various benchmarks, demonstrating its effectiveness through empirical results. Our code is available at https://github.com/pilsHan/rem

[71] MAFE R-CNN: Selecting More Samples to Learn Category-aware Features for Small Object Detection

Yichen Li,Qiankun Liu,Zhenchao Jin,Jiuzhe Wei,Jing Nie,Ying Fu

Main category: cs.CV

TL;DR: 论文提出MAFE R-CNN，通过多线索样本选择和类别感知特征增强机制，解决小目标检测中特征学习和样本选择的问题。

Details

Motivation: 小目标检测因特征学习不足和高质量样本选择困难而具有挑战性。 Method: 采用多线索样本选择策略（MCSS）和类别感知特征增强机制（CFEM）。 Result: 在SODA数据集上验证了方法的有效性。 Conclusion: MAFE R-CNN通过改进样本选择和特征增强，提升了小目标检测性能。 Abstract: Small object detection in intricate environments has consistently represented a major challenge in the field of object detection. In this paper, we identify that this difficulty stems from the detectors' inability to effectively learn discriminative features for objects of small size, compounded by the complexity of selecting high-quality small object samples during training, which motivates the proposal of the Multi-Clue Assignment and Feature Enhancement R-CNN.Specifically, MAFE R-CNN integrates two pivotal components.The first is the Multi-Clue Sample Selection (MCSS) strategy, in which the Intersection over Union (IoU) distance, predicted category confidence, and ground truth region sizes are leveraged as informative clues in the sample selection process. This methodology facilitates the selection of diverse positive samples and ensures a balanced distribution of object sizes during training, thereby promoting effective model learning.The second is the Category-aware Feature Enhancement Mechanism (CFEM), where we propose a simple yet effective category-aware memory module to explore the relationships among object features. Subsequently, we enhance the object feature representation by facilitating the interaction between category-aware features and candidate box features.Comprehensive experiments conducted on the large-scale small object dataset SODA validate the effectiveness of the proposed method. The code will be made publicly available.

[72] TAT-VPR: Ternary Adaptive Transformer for Dynamic and Efficient Visual Place Recognition

Oliver Grainge,Michael Milford,Indu Bodala,Sarvapali D. Ramchurn,Shoaib Ehsan

Main category: cs.CV

TL;DR: TAT-VPR是一种三元量化Transformer，通过动态权衡精度与效率，为视觉SLAM闭环提供支持。结合三元权重和学习激活稀疏门，模型可在运行时减少40%计算量且保持性能（Recall@1）。两阶段蒸馏流程保持描述符质量，适用于微型无人机和嵌入式SLAM，达到SOTA定位精度。

Details

Motivation: 解决视觉SLAM闭环中精度与效率的动态权衡问题，同时适应资源受限设备（如微型无人机和嵌入式系统）。 Method: 采用三元量化Transformer，结合学习激活稀疏门动态控制计算量；提出两阶段蒸馏流程以保持描述符质量。 Result: 运行时计算量减少40%，性能（Recall@1）不下降；在微型无人机和嵌入式SLAM中实现SOTA定位精度。 Conclusion: TAT-VPR成功实现了动态精度-效率权衡，适用于资源受限设备，同时保持高性能。 Abstract: TAT-VPR is a ternary-quantized transformer that brings dynamic accuracy-efficiency trade-offs to visual SLAM loop-closure. By fusing ternary weights with a learned activation-sparsity gate, the model can control computation by up to 40% at run-time without degrading performance (Recall@1). The proposed two-stage distillation pipeline preserves descriptor quality, letting it run on micro-UAV and embedded SLAM stacks while matching state-of-the-art localization accuracy.

[73] CMRINet: Joint Groupwise Registration and Segmentation for Cardiac Function Quantification from Cine-MRI

Mohamed S. Elmahdy,Marius Staring,Patrick J. H. de Koning,Samer Alabed,Mahan Salehi,Faisal Alandejani,Michael Sharkey,Ziad Aldabbagh,Andrew J. Swift,Rob J. van der Geest

Main category: cs.CV

TL;DR: 提出了一种端到端的深度学习模型，联合估计心脏cine-MRI图像的组间配准和分割，以提高心脏功能评估的效率和准确性。

Details

Motivation: 左心室射血分数（LVEF）作为评估心脏功能的重要指标，受多种因素影响且在某些疾病中表现不敏感，需结合心肌应变进行更全面的评估。传统方法中配准和分割任务分离，限制了评估效果。 Method: 提出了一种解剖学引导的深度组间配准网络（Deep GW），在374名受试者的四腔视图cine-MRI图像数据集上训练和验证。 Result: 与传统配准方法和两种基于深度学习的方法相比，所提模型性能提升且计算时间显著减少。 Conclusion: 该模型为心脏功能评估提供了一种高效且准确的端到端解决方案。 Abstract: Accurate and efficient quantification of cardiac function is essential for the estimation of prognosis of cardiovascular diseases (CVDs). One of the most commonly used metrics for evaluating cardiac pumping performance is left ventricular ejection fraction (LVEF). However, LVEF can be affected by factors such as inter-observer variability and varying pre-load and after-load conditions, which can reduce its reproducibility. Additionally, cardiac dysfunction may not always manifest as alterations in LVEF, such as in heart failure and cardiotoxicity diseases. An alternative measure that can provide a relatively load-independent quantitative assessment of myocardial contractility is myocardial strain and strain rate. By using LVEF in combination with myocardial strain, it is possible to obtain a thorough description of cardiac function. Automated estimation of LVEF and other volumetric measures from cine-MRI sequences can be achieved through segmentation models, while strain calculation requires the estimation of tissue displacement between sequential frames, which can be accomplished using registration models. These tasks are often performed separately, potentially limiting the assessment of cardiac function. To address this issue, in this study we propose an end-to-end deep learning (DL) model that jointly estimates groupwise (GW) registration and segmentation for cardiac cine-MRI images. The proposed anatomically-guided Deep GW network was trained and validated on a large dataset of 4-chamber view cine-MRI image series of 374 subjects. A quantitative comparison with conventional GW registration using elastix and two DL-based methods showed that the proposed model improved performance and substantially reduced computation time.

[74] MAGIC: Motion-Aware Generative Inference via Confidence-Guided LLM

Siwei Meng,Yawei Luo,Ping Liu

Main category: cs.CV

TL;DR: MAGIC是一个无需训练的动态3D内容生成框架，结合预训练的图像到视频扩散模型和LLM推理，通过物理反馈循环生成物理合理的动态视频。

Details

Motivation: 现有视频生成模型注重视觉真实感但忽略物理合理性，且依赖大规模标注数据或模型微调，计算和数据负担重。 Method: MAGIC整合预训练扩散模型和LLM推理，通过置信驱动的反馈循环生成物理合理的动态视频，并使用可微分MPM模拟器实现物理行为控制。 Result: MAGIC在推理准确性和时间一致性上优于现有物理感知生成方法和视频扩散模型。 Conclusion: MAGIC提供了一种无需训练和监督的方法，有效解决了动态3D内容生成的物理合理性问题。 Abstract: Recent advances in static 3D generation have intensified the demand for physically consistent dynamic 3D content. However, existing video generation models, including diffusion-based methods, often prioritize visual realism while neglecting physical plausibility, resulting in implausible object dynamics. Prior approaches for physics-aware dynamic generation typically rely on large-scale annotated datasets or extensive model fine-tuning, which imposes significant computational and data collection burdens and limits scalability across scenarios. To address these challenges, we present MAGIC, a training-free framework for single-image physical property inference and dynamic generation, integrating pretrained image-to-video diffusion models with iterative LLM-based reasoning. Our framework generates motion-rich videos from a static image and closes the visual-to-physical gap through a confidence-driven LLM feedback loop that adaptively steers the diffusion model toward physics-relevant motion. To translate visual dynamics into controllable physical behavior, we further introduce a differentiable MPM simulator operating directly on 3D Gaussians reconstructed from the single image, enabling physically grounded, simulation-ready outputs without any supervision or model tuning. Experiments show that MAGIC outperforms existing physics-aware generative methods in inference accuracy and achieves greater temporal coherence than state-of-the-art video diffusion models.

[75] AnchorFormer: Differentiable Anchor Attention for Efficient Vision Transformer

Jiquan Shan,Junxiao Wang,Lifeng Zhao,Liang Cai,Hongyuan Zhang,Ioannis Liritzis

Main category: cs.CV

TL;DR: AnchorFormer通过引入锚点令牌降低ViTs的复杂度，从O(n²)降至O(mn)，并在分类、检测和分割任务中表现优异。

Details

Motivation: 解决ViTs因全局自注意力导致的O(n²)高复杂度问题，同时聚焦图像关键区域以提高效率。 Method: 使用锚点令牌学习关键信息，通过二分注意力降低复杂度，并利用神经元表示锚点以可微分学习分布。 Result: 在ImageNet分类中提升9.0%准确率或减少46.7% FLOPs，COCO检测中mAP提升81.3%。 Conclusion: AnchorFormer在保持性能的同时显著降低了计算复杂度，适用于多种视觉任务。 Abstract: Recently, vision transformers (ViTs) have achieved excellent performance on vision tasks by measuring the global self-attention among the image patches. Given $n$ patches, they will have quadratic complexity such as $\mathcal{O}(n^2)$ and the time cost is high when splitting the input image with a small granularity. Meanwhile, the pivotal information is often randomly gathered in a few regions of an input image, some tokens may not be helpful for the downstream tasks. To handle this problem, we introduce an anchor-based efficient vision transformer (AnchorFormer), which employs the anchor tokens to learn the pivotal information and accelerate the inference. Firstly, by estimating the bipartite attention between the anchors and tokens, the complexity will be reduced from $\mathcal{O}(n^2)$ to $\mathcal{O}(mn)$, where $m$ is an anchor number and $m < n$. Notably, by representing the anchors with the neurons in a neural layer, we can differentiable learn these distributions and approximate global self-attention through the Markov process. Moreover, we extend the proposed model to three downstream tasks including classification, detection, and segmentation. Extensive experiments show the effectiveness of our AnchorFormer, e.g., achieving up to a 9.0% higher accuracy or 46.7% FLOPs reduction on ImageNet classification, 81.3% higher mAP on COCO detection under comparable FLOPs, as compared to the current baselines.

[76] Consistent World Models via Foresight Diffusion

Yu Zhang,Xingzhuo Guo,Haoran Xu,Mingsheng Long

Main category: cs.CV

TL;DR: 论文提出ForeDiff框架，通过解耦条件理解和目标去噪，提升扩散模型在一致性世界建模中的表现。

Details

Motivation: 扩散模型在生成任务中表现优异，但在世界建模中因样本一致性问题受限，主要原因是条件理解和目标去噪的耦合。 Method: 提出ForeDiff框架，分离条件理解和目标去噪，引入确定性预测流和预训练预测器。 Result: 在机器人视频预测和科学时空预测实验中，ForeDiff在预测准确性和样本一致性上优于基线。 Conclusion: ForeDiff为扩散模型在世界建模中的应用提供了新方向。 Abstract: Diffusion and flow-based models have enabled significant progress in generation tasks across various modalities and have recently found applications in world modeling. However, unlike typical generation tasks that encourage sample diversity, world models entail different sources of uncertainty and require consistent samples aligned with the ground-truth trajectory, which is a limitation we empirically observe in diffusion models. We argue that a key bottleneck in learning consistent diffusion-based world models lies in the suboptimal predictive ability, which we attribute to the entanglement of condition understanding and target denoising within shared architectures and co-training schemes. To address this, we propose Foresight Diffusion (ForeDiff), a diffusion-based world modeling framework that enhances consistency by decoupling condition understanding from target denoising. ForeDiff incorporates a separate deterministic predictive stream to process conditioning inputs independently of the denoising stream, and further leverages a pretrained predictor to extract informative representations that guide generation. Extensive experiments on robot video prediction and scientific spatiotemporal forecasting show that ForeDiff improves both predictive accuracy and sample consistency over strong baselines, offering a promising direction for diffusion-based world models.

[77] Clear Nights Ahead: Towards Multi-Weather Nighttime Image Restoration

Yuetong Liu,Yunqiu Xu,Yang Wei,Xiuli Bi,Bin Xiao

Main category: cs.CV

TL;DR: 论文提出了一种名为ClearNight的统一夜间图像修复框架，用于解决多天气条件下的夜间图像退化问题，并贡献了AllWeatherNight数据集。

Details

Motivation: 现实世界中夜间图像常受多种天气和光照效应共同影响，但相关研究较少。 Method: 通过Retinex双先验提取和天气感知动态协作方法，ClearNight有效去除复杂退化。 Result: ClearNight在合成和真实图像上均达到最先进性能。 Conclusion: AllWeatherNight数据集和ClearNight框架的有效性通过实验验证。 Abstract: Restoring nighttime images affected by multiple adverse weather conditions is a practical yet under-explored research problem, as multiple weather conditions often coexist in the real world alongside various lighting effects at night. This paper first explores the challenging multi-weather nighttime image restoration task, where various types of weather degradations are intertwined with flare effects. To support the research, we contribute the AllWeatherNight dataset, featuring large-scale high-quality nighttime images with diverse compositional degradations, synthesized using our introduced illumination-aware degradation generation. Moreover, we present ClearNight, a unified nighttime image restoration framework, which effectively removes complex degradations in one go. Specifically, ClearNight extracts Retinex-based dual priors and explicitly guides the network to focus on uneven illumination regions and intrinsic texture contents respectively, thereby enhancing restoration effectiveness in nighttime scenarios. In order to better represent the common and unique characters of multiple weather degradations, we introduce a weather-aware dynamic specific-commonality collaboration method, which identifies weather degradations and adaptively selects optimal candidate units associated with specific weather types. Our ClearNight achieves state-of-the-art performance on both synthetic and real-world images. Comprehensive ablation experiments validate the necessity of AllWeatherNight dataset as well as the effectiveness of ClearNight. Project page: https://henlyta.github.io/ClearNight/mainpage.html

[78] InspectionV3: Enhancing Tobacco Quality Assessment with Deep Convolutional Neural Networks for Automated Workshop Management

Yao Wei,Muhammad Usman,Hazrat Bilal

Main category: cs.CV

TL;DR: 论文提出InspectionV3，一种基于定制深度卷积神经网络的自动化烟草分级解决方案，解决了传统方法成本高、效率低的问题，实现了高精度分级。

Details

Motivation: 烟草加工中存在的低效、不一致性和缺乏监督问题导致成本上升和质量下降，传统人工检查方法不可靠且昂贵，需要自动化解决方案。 Method: 采用定制化的深度卷积神经网络架构，利用21,113张标注图像（覆盖20个质量等级）进行训练，结合预处理和批量归一化技术，实现实时分级。 Result: 系统达到97%准确率，95%精确率和召回率，96% F1分数和AUC，95%特异性，验证了其实际应用可行性。 Conclusion: InspectionV3通过自动化分级和数据分析优化了烟草加工流程，展示了深度学习在工业应用中的潜力。 Abstract: The problems that tobacco workshops encounter include poor curing, inconsistencies in supplies, irregular scheduling, and a lack of oversight, all of which drive up expenses and worse quality. Large quantities make manual examination costly, sluggish, and unreliable. Deep convolutional neural networks have recently made strides in capabilities that transcend those of conventional methods. To effectively enhance them, nevertheless, extensive customization is needed to account for subtle variations in tobacco grade. This study introduces InspectionV3, an integrated solution for automated flue-cured tobacco grading that makes use of a customized deep convolutional neural network architecture. A scope that covers color, maturity, and curing subtleties is established via a labelled dataset consisting of 21,113 images spanning 20 quality classes. Expert annotators performed preprocessing on the tobacco leaf images, including cleaning, labelling, and augmentation. Multi-layer CNN factors use batch normalization to describe domain properties like as permeability and moisture spots, and so account for the subtleties of the workshop. Its expertise lies in converting visual patterns into useful information for enhancing workflow. Fast notifications are made possible by real-time, on-the-spot grading that matches human expertise. Images-powered analytics dashboards facilitate the tracking of yield projections, inventories, bottlenecks, and the optimization of data-driven choices. More labelled images are assimilated after further retraining, improving representational capacities and enabling adaptations for seasonal variability. Metrics demonstrate 97% accuracy, 95% precision and recall, 96% F1-score and AUC, 95% specificity; validating real-world viability.

[79] ALTo: Adaptive-Length Tokenizer for Autoregressive Mask Generation

Lingfeng Wang,Hualing Lin,Senda Chen,Tao Wang,Changxu Cheng,Yangyang Zhong,Dong Zheng,Wuyue Zhao

Main category: cs.CV

TL;DR: ALTo是一种自适应长度标记器，用于改进多模态大语言模型（MLLM）的掩码生成，通过动态分配注意力提升性能。

Details

Motivation: 人类能根据对象复杂度自适应分配注意力，而现有MLLM受限于固定标记表示，因此需要一种更灵活的方法。 Method: 设计了标记长度预测器、长度正则化项和可微分标记分块策略，并集成到ALToLLM中，通过GRPO优化掩码质量与效率的权衡。 Result: ALToLLM在主流分割基准测试中实现了最先进的性能，并具备自适应标记成本。 Conclusion: ALToLLM通过自适应标记技术显著提升了MLLM的掩码生成能力，代码和模型已开源。 Abstract: While humans effortlessly draw visual objects and shapes by adaptively allocating attention based on their complexity, existing multimodal large language models (MLLMs) remain constrained by rigid token representations. Bridging this gap, we propose ALTo, an adaptive length tokenizer for autoregressive mask generation. To achieve this, a novel token length predictor is designed, along with a length regularization term and a differentiable token chunking strategy. We further build ALToLLM that seamlessly integrates ALTo into MLLM. Preferences on the trade-offs between mask quality and efficiency is implemented by group relative policy optimization (GRPO). Experiments demonstrate that ALToLLM achieves state-of-the-art performance with adaptive token cost on popular segmentation benchmarks. Code and models are released at https://github.com/yayafengzi/ALToLLM.

[80] Beyond Face Swapping: A Diffusion-Based Digital Human Benchmark for Multimodal Deepfake Detection

Jiaxin Liu,Jia Wang,Saihui Hou,Min Ren,Huijia Wu,Zhaofeng He

Main category: cs.CV

TL;DR: 论文介绍了DigiFakeAV数据集和DigiShield检测基线，用于应对扩散模型生成的高度逼真数字人视频的检测挑战。

Details

Motivation: 扩散模型生成的数字人视频具有高度真实性和隐蔽性，对现有检测方法构成严峻挑战。 Method: 构建了首个基于扩散模型的大规模多模态数字人伪造数据集DigiFakeAV，并提出基于时空和跨模态融合的检测基线DigiShield。 Result: 用户研究表明伪造视频与真实视频的混淆率达68%，现有检测模型性能显著下降；DigiShield在DigiFakeAV和DF-TIMIT数据集上达到SOTA性能。 Conclusion: DigiShield通过细粒度分析合成视频中面部特征的时序演化，有效识别隐蔽伪影，为数字人伪造检测提供了新思路。 Abstract: In recent years, the rapid development of deepfake technology has given rise to an emerging and serious threat to public security: diffusion model-based digital human generation. Unlike traditional face manipulation methods, such models can generate highly realistic videos with consistency through multimodal control signals. Their flexibility and covertness pose severe challenges to existing detection strategies. To bridge this gap, we introduce DigiFakeAV, the first large-scale multimodal digital human forgery dataset based on diffusion models. Employing five latest digital human generation methods (Sonic, Hallo, etc.) and voice cloning method, we systematically produce a dataset comprising 60,000 videos (8.4 million frames), covering multiple nationalities, skin tones, genders, and real-world scenarios, significantly enhancing data diversity and realism. User studies show that the confusion rate between forged and real videos reaches 68%, and existing state-of-the-art (SOTA) detection models exhibit large drops in AUC values on DigiFakeAV, highlighting the challenge of the dataset. To address this problem, we further propose DigiShield, a detection baseline based on spatiotemporal and cross-modal fusion. By jointly modeling the 3D spatiotemporal features of videos and the semantic-acoustic features of audio, DigiShield achieves SOTA performance on both the DigiFakeAV and DF-TIMIT datasets. Experiments show that this method effectively identifies covert artifacts through fine-grained analysis of the temporal evolution of facial features in synthetic videos.

[81] Detailed Evaluation of Modern Machine Learning Approaches for Optic Plastics Sorting

Vaishali Maheshkar,Aadarsh Anantha Ramakrishnan,Charuvahan Adhivarahan,Karthik Dantu

Main category: cs.CV

TL;DR: 论文探讨了塑料回收率低的问题，提出自动分拣技术是关键，并评估了光学识别方法在现实场景中的局限性。

Details

Motivation: 塑料回收率仅为8%，主要由于污染、经济激励不足和技术难题。自动分拣技术（如光学识别）被寄予厚望，但其实际效果需进一步验证。 Method: 研究通过20,000多张图像数据集，结合公共和自定义机器学习流程，评估光学识别的能力与限制，使用Grad-CAM、显著性图和混淆矩阵分析模型行为。 Result: 光学识别方法在现实塑料分拣中效果有限，因其依赖颜色和形状等物理特性。 Conclusion: 光学识别在塑料分拣中效果不佳，需进一步改进或探索其他技术。 Abstract: According to the EPA, only 25% of waste is recycled, and just 60% of U.S. municipalities offer curbside recycling. Plastics fare worse, with a recycling rate of only 8%; an additional 16% is incinerated, while the remaining 76% ends up in landfills. The low plastic recycling rate stems from contamination, poor economic incentives, and technical difficulties, making efficient recycling a challenge. To improve recovery, automated sorting plays a critical role. Companies like AMP Robotics and Greyparrot utilize optical systems for sorting, while Materials Recovery Facilities (MRFs) employ Near-Infrared (NIR) sensors to detect plastic types. Modern optical sorting uses advances in computer vision such as object recognition and instance segmentation, powered by machine learning. Two-stage detectors like Mask R-CNN use region proposals and classification with deep backbones like ResNet. Single-stage detectors like YOLO handle detection in one pass, trading some accuracy for speed. While such methods excel under ideal conditions with a large volume of labeled training data, challenges arise in realistic scenarios, emphasizing the need to further examine the efficacy of optic detection for automated sorting. In this study, we compiled novel datasets totaling 20,000+ images from varied sources. Using both public and custom machine learning pipelines, we assessed the capabilities and limitations of optical recognition for sorting. Grad-CAM, saliency maps, and confusion matrices were employed to interpret model behavior. We perform this analysis on our custom trained models from the compiled datasets. To conclude, our findings are that optic recognition methods have limited success in accurate sorting of real-world plastics at MRFs, primarily because they rely on physical properties such as color and shape.

[82] CodeMerge: Codebook-Guided Model Merging for Robust Test-Time Adaptation in Autonomous Driving

Huitong Yang,Zhuoxiao Chen,Fengyi Zhang,Zi Huang,Yadan Luo

Main category: cs.CV

TL;DR: CodeMerge是一个轻量级模型合并框架，通过低维指纹和键值码本高效合并模型，提升3D感知性能。

Details

Motivation: 解决现有测试时适应方法在动态环境下不稳定且计算成本高的问题。 Method: 利用源模型的倒数第二层特征生成低维指纹，构建键值码本，通过岭杠杆分数计算合并系数。 Result: 在nuScenes-C和nuScenes-to-KITTI基准上分别提升14.9% NDS和7.6% mAP。 Conclusion: CodeMerge高效且性能优越，适用于下游任务。 Abstract: Maintaining robust 3D perception under dynamic and unpredictable test-time conditions remains a critical challenge for autonomous driving systems. Existing test-time adaptation (TTA) methods often fail in high-variance tasks like 3D object detection due to unstable optimization and sharp minima. While recent model merging strategies based on linear mode connectivity (LMC) offer improved stability by interpolating between fine-tuned checkpoints, they are computationally expensive, requiring repeated checkpoint access and multiple forward passes. In this paper, we introduce CodeMerge, a lightweight and scalable model merging framework that bypasses these limitations by operating in a compact latent space. Instead of loading full models, CodeMerge represents each checkpoint with a low-dimensional fingerprint derived from the source model's penultimate features and constructs a key-value codebook. We compute merging coefficients using ridge leverage scores on these fingerprints, enabling efficient model composition without compromising adaptation quality. Our method achieves strong performance across challenging benchmarks, improving end-to-end 3D detection 14.9% NDS on nuScenes-C and LiDAR-based detection by over 7.6% mAP on nuScenes-to-KITTI, while benefiting downstream tasks such as online mapping, motion prediction and planning even without training. Code and pretrained models are released in the supplementary material.

[83] Motion Matters: Compact Gaussian Streaming for Free-Viewpoint Video Reconstruction

Jiacong Chen,Qingyu Mao,Youneng Bao,Xiandong Meng,Fanyang Meng,Ronggang Wang,Yongsheng Liang

Main category: cs.CV

TL;DR: 论文提出了一种名为Compact Gaussian Streaming (ComGS)的新框架，通过利用动态场景中运动的局部性和一致性，显著减少了存储需求，同时保持了视觉保真度和渲染速度。

Details

Motivation: 现有的在线自由视点视频（FVV）重建方法（如3D高斯泼溅）因逐点建模导致存储需求过高，无法充分利用运动特性。 Method: ComGS通过关键点驱动的运动表示建模对象一致的高斯点运动，仅传输关键点属性；采用视图空间梯度差异策略定位运动区域的关键点，并通过自适应运动驱动机制预测空间影响场；还引入了错误感知校正策略以减少误差积累。 Result: ComGS在存储效率上比3DGStream提升了159倍，比QUEEN提升了14倍，同时保持了视觉保真度和渲染速度。 Conclusion: ComGS为动态场景的在线FVV重建提供了一种高效且存储友好的解决方案。 Abstract: 3D Gaussian Splatting (3DGS) has emerged as a high-fidelity and efficient paradigm for online free-viewpoint video (FVV) reconstruction, offering viewers rapid responsiveness and immersive experiences. However, existing online methods face challenge in prohibitive storage requirements primarily due to point-wise modeling that fails to exploit the motion properties. To address this limitation, we propose a novel Compact Gaussian Streaming (ComGS) framework, leveraging the locality and consistency of motion in dynamic scene, that models object-consistent Gaussian point motion through keypoint-driven motion representation. By transmitting only the keypoint attributes, this framework provides a more storage-efficient solution. Specifically, we first identify a sparse set of motion-sensitive keypoints localized within motion regions using a viewspace gradient difference strategy. Equipped with these keypoints, we propose an adaptive motion-driven mechanism that predicts a spatial influence field for propagating keypoint motion to neighboring Gaussian points with similar motion. Moreover, ComGS adopts an error-aware correction strategy for key frame reconstruction that selectively refines erroneous regions and mitigates error accumulation without unnecessary overhead. Overall, ComGS achieves a remarkable storage reduction of over 159 X compared to 3DGStream and 14 X compared to the SOTA method QUEEN, while maintaining competitive visual fidelity and rendering speed. Our code will be released.

[84] SHaDe: Compact and Consistent Dynamic 3D Reconstruction via Tri-Plane Deformation and Latent Diffusion

Asrar Alruwayqi

Main category: cs.CV

TL;DR: 提出了一种动态3D场景重建的新框架，结合了显式三平面变形场、视图条件化的规范辐射场和时序感知的潜在扩散先验，实现了高效紧凑的时空表示。

Details

Motivation: 传统方法依赖MLP建模运动，效率低且泛化能力不足。新方法旨在通过显式变形场和结构化SH渲染头提升效率和可解释性，并通过潜在扩散模块增强鲁棒性。 Method: 使用三平面变形场编码4D场景，通过SH注意力机制合成视图依赖颜色，并引入潜在扩散模块优化特征。训练分为扩散模块预训练和联合微调两阶段。 Result: 在合成基准测试中表现优异，视觉质量、时序一致性和稀疏视图动态输入的鲁棒性均优于HexPlane和4D Gaussian Splatting。 Conclusion: 该框架通过显式变形场和潜在扩散模块，显著提升了动态3D场景重建的效率、质量和泛化能力。 Abstract: We present a novel framework for dynamic 3D scene reconstruction that integrates three key components: an explicit tri-plane deformation field, a view-conditioned canonical radiance field with spherical harmonics (SH) attention, and a temporally-aware latent diffusion prior. Our method encodes 4D scenes using three orthogonal 2D feature planes that evolve over time, enabling efficient and compact spatiotemporal representation. These features are explicitly warped into a canonical space via a deformation offset field, eliminating the need for MLP-based motion modeling. In canonical space, we replace traditional MLP decoders with a structured SH-based rendering head that synthesizes view-dependent color via attention over learned frequency bands improving both interpretability and rendering efficiency. To further enhance fidelity and temporal consistency, we introduce a transformer-guided latent diffusion module that refines the tri-plane and deformation features in a compressed latent space. This generative module denoises scene representations under ambiguous or out-of-distribution (OOD) motion, improving generalization. Our model is trained in two stages: the diffusion module is first pre-trained independently, and then fine-tuned jointly with the full pipeline using a combination of image reconstruction, diffusion denoising, and temporal consistency losses. We demonstrate state-of-the-art results on synthetic benchmarks, surpassing recent methods such as HexPlane and 4D Gaussian Splatting in visual quality, temporal coherence, and robustness to sparse-view dynamic inputs.

[85] TextureSAM: Towards a Texture Aware Foundation Model for Segmentation

Inbal Cohen,Boaz Meivar,Peihan Tu,Shai Avidan,Gal Oren

Main category: cs.CV

TL;DR: TextureSAM是一种针对纹理主导场景优化的分割模型，通过纹理增强技术改进SAM模型，显著提升了纹理分割性能。

Details

Motivation: SAM模型在语义分割任务中表现优异，但偏向形状而非纹理，限制了其在纹理定义边界的领域（如医学影像、材料分类）的应用。 Method: 提出TextureSAM，采用纹理增强技术和纹理修改的ADE20K数据集进行微调，以强调纹理特征。 Result: TextureSAM在自然和合成纹理数据集上分别比SAM-2提高了0.2和0.18 mIoU。 Conclusion: TextureSAM有效解决了SAM的纹理分割局限性，为纹理主导场景提供了更优的分割方案。 Abstract: Segment Anything Models (SAM) have achieved remarkable success in object segmentation tasks across diverse datasets. However, these models are predominantly trained on large-scale semantic segmentation datasets, which introduce a bias toward object shape rather than texture cues in the image. This limitation is critical in domains such as medical imaging, material classification, and remote sensing, where texture changes define object boundaries. In this study, we investigate SAM's bias toward semantics over textures and introduce a new texture-aware foundation model, TextureSAM, which performs superior segmentation in texture-dominant scenarios. To achieve this, we employ a novel fine-tuning approach that incorporates texture augmentation techniques, incrementally modifying training images to emphasize texture features. By leveraging a novel texture-alternation of the ADE20K dataset, we guide TextureSAM to prioritize texture-defined regions, thereby mitigating the inherent shape bias present in the original SAM model. Our extensive experiments demonstrate that TextureSAM significantly outperforms SAM-2 on both natural (+0.2 mIoU) and synthetic (+0.18 mIoU) texture-based segmentation datasets. The code and texture-augmented dataset will be publicly available.

[86] Auto-nnU-Net: Towards Automated Medical Image Segmentation

Jannis Becktepe,Leona Hennig,Steffen Oeltze-Jafra,Marius Lindauer

Main category: cs.CV

TL;DR: Auto-nnU-Net提出了一种改进的nnU-Net框架，通过超参数优化（HPO）、神经架构搜索（NAS）和分层NAS（HNAS）提升医学图像分割性能，同时平衡计算资源需求。

Details

Motivation: 现有nnU-Net框架在超参数和设计选择上受限，无法完全适应多样化的医学图像分割任务需求。 Method: 提出Auto-nnU-Net框架，结合HPO、NAS和HNAS，并引入Regularized PriorBand优化资源分配。 Result: 在10个数据集中，6个性能显著提升，其余持平，同时保持资源效率。 Conclusion: Auto-nnU-Net有效提升了医学图像分割的自动化水平和性能，适用于实际医疗场景。 Abstract: Medical Image Segmentation (MIS) includes diverse tasks, from bone to organ segmentation, each with its own challenges in finding the best segmentation model. The state-of-the-art AutoML-related MIS-framework nnU-Net automates many aspects of model configuration but remains constrained by fixed hyperparameters and heuristic design choices. As a full-AutoML framework for MIS, we propose Auto-nnU-Net, a novel nnU-Net variant enabling hyperparameter optimization (HPO), neural architecture search (NAS), and hierarchical NAS (HNAS). Additionally, we propose Regularized PriorBand to balance model accuracy with the computational resources required for training, addressing the resource constraints often faced in real-world medical settings that limit the feasibility of extensive training procedures. We evaluate our approach across diverse MIS datasets from the well-established Medical Segmentation Decathlon, analyzing the impact of AutoML techniques on segmentation performance, computational efficiency, and model design choices. The results demonstrate that our AutoML approach substantially improves the segmentation performance of nnU-Net on 6 out of 10 datasets and is on par on the other datasets while maintaining practical resource requirements. Our code is available at https://github.com/LUH-AI/AutonnUNet.

Nina Shvetsova,Goutam Bhat,Prune Truong,Hilde Kuehne,Federico Tombari

Main category: cs.CV

TL;DR: 提出了一种新的单目到立体视频转换架构，通过改进的Stable Video Diffusion模型生成高质量右视图，性能优于现有方法。

Details

Motivation: 解决单目到立体视频转换中右视图修复和细化的问题。 Method: 扩展Stable Video Diffusion模型，利用左视图、变形右视图和遮挡掩码作为输入，改进注意力层以利用邻帧信息。 Result: 在用户研究中平均排名1.43（4种方法），速度比第二名快6倍。 Conclusion: 该方法在质量和效率上均优于现有技术。 Abstract: We tackle the problem of monocular-to-stereo video conversion and propose a novel architecture for inpainting and refinement of the warped right view obtained by depth-based reprojection of the input left view. We extend the Stable Video Diffusion (SVD) model to utilize the input left video, the warped right video, and the disocclusion masks as conditioning input to generate a high-quality right camera view. In order to effectively exploit information from neighboring frames for inpainting, we modify the attention layers in SVD to compute full attention for discoccluded pixels. Our model is trained to generate the right view video in an end-to-end manner by minimizing image space losses to ensure high-quality generation. Our approach outperforms previous state-of-the-art methods, obtaining an average rank of 1.43 among the 4 compared methods in a user study, while being 6x faster than the second placed method.

[88] Temporal Object Captioning for Street Scene Videos from LiDAR Tracks

Vignesh Gopinathan,Urs Zimmermann,Michael Arnold,Matthias Rottmann

Main category: cs.CV

TL;DR: 论文提出了一种基于LiDAR的自动化视频字幕生成方法，专注于交通参与者的时间动态，通过规则系统和模板生成字幕，显著提升了视频字幕模型的时间理解能力。

Details

Motivation: 现有视频字幕模型在时间语义捕捉和利用方面存在不足，尤其是在高级驾驶辅助系统（ADAS）背景下。 Method: 采用基于LiDAR的规则系统提取车道位置和相对运动等关键信息，结合模板生成字幕，并用于训练SwinBERT模型。 Result: 实验表明，该方法在三个数据集上均显著提升了模型对时间动态的理解能力。 Conclusion: LiDAR生成的字幕监督能有效减少现有模型的视觉/静态偏差，显著增强时间理解能力。 Abstract: Video captioning models have seen notable advancements in recent years, especially with regard to their ability to capture temporal information. While many research efforts have focused on architectural advancements, such as temporal attention mechanisms, there remains a notable gap in understanding how models capture and utilize temporal semantics for effective temporal feature extraction, especially in the context of Advanced Driver Assistance Systems. We propose an automated LiDAR-based captioning procedure that focuses on the temporal dynamics of traffic participants. Our approach uses a rule-based system to extract essential details such as lane position and relative motion from object tracks, followed by a template-based caption generation. Our findings show that training SwinBERT, a video captioning model, using only front camera images and supervised with our template-based captions, specifically designed to encapsulate fine-grained temporal behavior, leads to improved temporal understanding consistently across three datasets. In conclusion, our results clearly demonstrate that integrating LiDAR-based caption supervision significantly enhances temporal understanding, effectively addressing and reducing the inherent visual/static biases prevalent in current state-of-the-art model architectures.

[89] Decoupled Geometric Parameterization and its Application in Deep Homography Estimation

Yao Huang,Si-Yuan Cao,Yaqing Ding,Hao Yin,Shibin Xie,Shuting Wang,Zhijun Fang,Jiachun Wang,Shen Cai,Junchi Yan,Shuhan Shen

Main category: cs.CV

TL;DR: 论文提出了一种新的平面单应性几何参数化方法，通过SKS分解解耦相似变换和核变换参数，简化了单应性矩阵的计算。

Details

Motivation: 传统基于四角位置偏移的参数化方法缺乏几何可解释性且需解线性系统，因此需要更直观且高效的方法。 Method: 利用相似-核-相似（SKS）分解，将单应性解耦为相似变换和核变换两组独立参数，并推导核变换参数与角度偏移的线性关系。 Result: 提出的参数化方法通过矩阵乘法直接估计单应性，无需解线性系统，性能与四角偏移方法相当。 Conclusion: 新方法在保持性能的同时，提高了单应性估计的几何直观性和计算效率。 Abstract: Planar homography, with eight degrees of freedom (DOFs), is fundamental in numerous computer vision tasks. While the positional offsets of four corners are widely adopted (especially in neural network predictions), this parameterization lacks geometric interpretability and typically requires solving a linear system to compute the homography matrix. This paper presents a novel geometric parameterization of homographies, leveraging the similarity-kernel-similarity (SKS) decomposition for projective transformations. Two independent sets of four geometric parameters are decoupled: one for a similarity transformation and the other for the kernel transformation. Additionally, the geometric interpretation linearly relating the four kernel transformation parameters to angular offsets is derived. Our proposed parameterization allows for direct homography estimation through matrix multiplication, eliminating the need for solving a linear system, and achieves performance comparable to the four-corner positional offsets in deep homography estimation.

[90] MEgoHand: Multimodal Egocentric Hand-Object Interaction Motion Generation

Bohan Zhou,Yi Zhan,Zhongbin Zhang,Zongqing Lu

Main category: cs.CV

TL;DR: MEgoHand是一个多模态框架，通过结合RGB、文本和初始手部姿态生成物理合理的手-物体交互，显著降低了手腕平移和关节旋转误差。

Details

Motivation: 解决现有方法在生成手-物体交互时依赖预定义3D物体先验、多模态方法模糊生成以及复杂建模3D手-物体关联的问题。 Method: 采用双层架构：高层利用视觉语言模型推断运动先验和单目深度估计进行空间推理，低层基于DiT的流匹配策略生成细粒度轨迹。 Result: 在多个数据集上验证，手腕平移误差降低86.9%，关节旋转误差降低34.1%。 Conclusion: MEgoHand能够准确建模细粒度手部关节结构，并在多样化场景中表现出强鲁棒性。 Abstract: Egocentric hand-object motion generation is crucial for immersive AR/VR and robotic imitation but remains challenging due to unstable viewpoints, self-occlusions, perspective distortion, and noisy ego-motion. Existing methods rely on predefined 3D object priors, limiting generalization to novel objects, which restricts their generalizability to novel objects. Meanwhile, recent multimodal approaches suffer from ambiguous generation from abstract textual cues, intricate pipelines for modeling 3D hand-object correlation, and compounding errors in open-loop prediction. We propose MEgoHand, a multimodal framework that synthesizes physically plausible hand-object interactions from egocentric RGB, text, and initial hand pose. MEgoHand introduces a bi-level architecture: a high-level "cerebrum" leverages a vision language model (VLM) to infer motion priors from visual-textual context and a monocular depth estimator for object-agnostic spatial reasoning, while a low-level DiT-based flow-matching policy generates fine-grained trajectories with temporal orthogonal filtering to enhance stability. To address dataset inconsistency, we design a dataset curation paradigm with an Inverse MANO Retargeting Network and Virtual RGB-D Renderer, curating a unified dataset of 3.35M RGB-D frames, 24K interactions, and 1.2K objects. Extensive experiments across five in-domain and two cross-domain datasets demonstrate the effectiveness of MEgoHand, achieving substantial reductions in wrist translation error (86.9%) and joint rotation error (34.1%), highlighting its capacity to accurately model fine-grained hand joint structures and generalize robustly across diverse scenarios.

[91] Grounding Chest X-Ray Visual Question Answering with Generated Radiology Reports

Francesco Dalla Serra,Patrick Schrempf,Chaoyang Wang,Zaiqiao Meng,Fani Deligianni,Alison Q. O'Neil

Main category: cs.CV

TL;DR: 提出一种新的CXR视觉问答方法，处理单图像和图像差异问题，并利用放射学报告提升模型性能。

Details

Motivation: 解决CXR视觉问答中的单图像和图像差异问题，探索放射学报告在模型性能提升中的作用。 Method: 提出统一方法处理两类问题，分两步：报告生成和答案生成，利用预测的放射学报告作为证据。 Result: 在Medical-Diff-VQA数据集上实现最佳性能。 Conclusion: 放射学报告的整合显著提升了CXR视觉问答模型的性能。 Abstract: We present a novel approach to Chest X-ray (CXR) Visual Question Answering (VQA), addressing both single-image image-difference questions. Single-image questions focus on abnormalities within a specific CXR ("What abnormalities are seen in image X?"), while image-difference questions compare two longitudinal CXRs acquired at different time points ("What are the differences between image X and Y?"). We further explore how the integration of radiology reports can enhance the performance of VQA models. While previous approaches have demonstrated the utility of radiology reports during the pre-training phase, we extend this idea by showing that the reports can also be leveraged as additional input to improve the VQA model's predicted answers. First, we propose a unified method that handles both types of questions and auto-regressively generates the answers. For single-image questions, the model is provided with a single CXR. For image-difference questions, the model is provided with two CXRs from the same patient, captured at different time points, enabling the model to detect and describe temporal changes. Taking inspiration from 'Chain-of-Thought reasoning', we demonstrate that performance on the CXR VQA task can be improved by grounding the answer generator module with a radiology report predicted for the same CXR. In our approach, the VQA model is divided into two steps: i) Report Generation (RG) and ii) Answer Generation (AG). Our results demonstrate that incorporating predicted radiology reports as evidence to the AG model enhances performance on both single-image and image-difference questions, achieving state-of-the-art results on the Medical-Diff-VQA dataset.

[92] Background Matters: A Cross-view Bidirectional Modeling Framework for Semi-supervised Medical Image Segmentation

Luyang Cao,Jianwei Li,Yinghuan Shi

Main category: cs.CV

TL;DR: 论文提出了一种名为CVBM的半监督医学图像分割框架，通过显式建模背景区域提升前景分割性能，并在多个数据集上达到SOTA效果。

Details

Motivation: 当前半监督医学图像分割方法主要关注前景建模，忽略了背景建模的潜在价值。研究表明背景建模能增强前景建模的置信度。 Method: 提出CVBM框架，引入背景建模作为辅助视角，并通过双向一致性机制确保前景与背景预测的对齐。 Result: 在LA、Pancreas、ACDC和HRF数据集上表现优异，其中Pancreas数据集上仅用20%标注数据即超越全监督方法（DSC: 84.57% vs. 83.89%）。 Conclusion: CVBM通过背景建模和双向一致性机制显著提升了半监督医学图像分割的性能，展示了背景建模的重要性。 Abstract: Semi-supervised medical image segmentation (SSMIS) leverages unlabeled data to reduce reliance on manually annotated images. However, current SOTA approaches predominantly focus on foreground-oriented modeling (i.e., segmenting only the foreground region) and have largely overlooked the potential benefits of explicitly modeling the background region. Our study theoretically and empirically demonstrates that highly certain predictions in background modeling enhance the confidence of corresponding foreground modeling. Building on this insight, we propose the Cross-view Bidirectional Modeling (CVBM) framework, which introduces a novel perspective by incorporating background modeling to improve foreground modeling performance. Within CVBM, background modeling serves as an auxiliary perspective, providing complementary supervisory signals to enhance the confidence of the foreground model. Additionally, CVBM introduces an innovative bidirectional consistency mechanism, which ensures mutual alignment between foreground predictions and background-guided predictions. Extensive experiments demonstrate that our approach achieves SOTA performance on the LA, Pancreas, ACDC, and HRF datasets. Notably, on the Pancreas dataset, CVBM outperforms fully supervised methods (i.e., DSC: 84.57% vs. 83.89%) while utilizing only 20% of the labeled data. Our code is publicly available at https://github.com/caoluyang0830/CVBM.git.

[93] SoccerChat: Integrating Multimodal Data for Enhanced Soccer Game Understanding

Sushant Gautam,Cise Midoglu,Vajira Thambawita,Michael A. Riegler,Pål Halvorsen,Mubarak Shah

Main category: cs.CV

TL;DR: SoccerChat是一个多模态对话AI框架，通过整合视觉和文本数据提升足球视频理解能力，并在动作分类和裁判决策任务中表现优异。

Details

Motivation: 传统足球分析方法依赖孤立的数据流，无法全面捕捉比赛动态，因此需要一种更综合的方法。 Method: 利用SoccerNet数据集（包含球衣颜色标注和ASR转录文本），通过结构化视频指令数据集微调SoccerChat，实现比赛理解、事件分类和裁判决策。 Result: SoccerChat在动作分类和裁判决策任务中表现出色，同时保持了裁判决策的竞争性准确度。 Conclusion: 多模态整合对提升足球分析至关重要，为交互式和可解释的AI驱动体育分析铺平了道路。 Abstract: The integration of artificial intelligence in sports analytics has transformed soccer video understanding, enabling real-time, automated insights into complex game dynamics. Traditional approaches rely on isolated data streams, limiting their effectiveness in capturing the full context of a match. To address this, we introduce SoccerChat, a multimodal conversational AI framework that integrates visual and textual data for enhanced soccer video comprehension. Leveraging the extensive SoccerNet dataset, enriched with jersey color annotations and automatic speech recognition (ASR) transcripts, SoccerChat is fine-tuned on a structured video instruction dataset to facilitate accurate game understanding, event classification, and referee decision making. We benchmark SoccerChat on action classification and referee decision-making tasks, demonstrating its performance in general soccer event comprehension while maintaining competitive accuracy in referee decision making. Our findings highlight the importance of multimodal integration in advancing soccer analytics, paving the way for more interactive and explainable AI-driven sports analysis. https://github.com/simula/SoccerChat

[94] Towards Texture- And Shape-Independent 3D Keypoint Estimation in Birds

Valentin Schmuker,Alex Hoi Hang Chan,Bastian Goldluecke,Urs Waldmann

Main category: cs.CV

TL;DR: 提出了一种纹理无关的方法来估计和跟踪多只鸽子的3D关节位置，扩展了3D-MuPPET框架，使用分割方法生成个体轮廓并估计2D关键点，最终实现与原始纹理依赖方法相当的精度。

Details

Motivation: 解决现有3D姿态估计方法对纹理依赖的问题，并探索该方法在其他鸟类物种上的适用性。 Method: 基于3D-MuPPET框架，通过分割生成个体轮廓，估计2D关键点并三角化为3D姿态，无需纹理信息。 Result: 纹理无关方法达到与纹理依赖方法相当的精度，并在其他鸟类物种上初步验证了可行性。 Conclusion: 该方法为开发更鲁棒和准确的纹理无关姿态估计框架提供了基础。 Abstract: In this paper, we present a texture-independent approach to estimate and track 3D joint positions of multiple pigeons. For this purpose, we build upon the existing 3D-MuPPET framework, which estimates and tracks the 3D poses of up to 10 pigeons using a multi-view camera setup. We extend this framework by using a segmentation method that generates silhouettes of the individuals, which are then used to estimate 2D keypoints. Following 3D-MuPPET, these 2D keypoints are triangulated to infer 3D poses, and identities are matched in the first frame and tracked in 2D across subsequent frames. Our proposed texture-independent approach achieves comparable accuracy to the original texture-dependent 3D-MuPPET framework. Additionally, we explore our approach's applicability to other bird species. To do that, we infer the 2D joint positions of four bird species without additional fine-tuning the model trained on pigeons and obtain preliminary promising results. Thus, we think that our approach serves as a solid foundation and inspires the development of more robust and accurate texture-independent pose estimation frameworks.

[95] From Evaluation to Defense: Advancing Safety in Video Large Language Models

Yiwei Sun,Peiqi Jiang,Chuanbin Liu,Luohao Lin,Zhiying Lu,Hongtao Xie

Main category: cs.CV

TL;DR: 论文提出了首个大规模、文化多样的视频LLM安全基准VideoSafetyBench（VSB-77k），并揭示视频模态会降低安全性能42.3%。为解决这一问题，作者提出了双阶段框架VideoSafety-R1，通过两种创新方法显著提升了安全性。

Details

Motivation: 视频LLM的安全风险尚未被充分研究，作者旨在填补这一空白，并解决视频模态带来的系统性安全漏洞。 Method: 1. 提出VideoSafetyBench（VSB-77k）基准；2. 开发双阶段框架VideoSafety-R1，包括Alarm Token-Guided Safety Fine-Tuning（AT-SFT）和Safety-Guided GRPO。 Result: VideoSafety-R1在VSB-Eval-HH上提升了65.1%，在其他图像安全数据集上也有显著改进。 Conclusion: 视频模态对LLM安全性能有显著负面影响，但通过主动推理框架可以显著提升安全性。 Abstract: While the safety risks of image-based large language models have been extensively studied, their video-based counterparts (Video LLMs) remain critically under-examined. To systematically study this problem, we introduce \textbf{VideoSafetyBench (VSB-77k) - the first large-scale, culturally diverse benchmark for Video LLM safety}, which compromises 77,646 video-query pairs and spans 19 principal risk categories across 10 language communities. \textit{We reveal that integrating video modality degrades safety performance by an average of 42.3\%, exposing systemic risks in multimodal attack exploitation.} To address this vulnerability, we propose \textbf{VideoSafety-R1}, a dual-stage framework achieving unprecedented safety gains through two innovations: (1) Alarm Token-Guided Safety Fine-Tuning (AT-SFT) injects learnable alarm tokens into visual and textual sequences, enabling explicit harm perception across modalities via multitask objectives. (2) Then, Safety-Guided GRPO enhances defensive reasoning through dynamic policy optimization with rule-based rewards derived from dual-modality verification. These components synergize to shift safety alignment from passive harm recognition to active reasoning. The resulting framework achieves a 65.1\% improvement on VSB-Eval-HH, and improves by 59.1\%, 44.3\%, and 15.0\% on the image safety datasets MMBench, VLGuard, and FigStep, respectively. \textit{Our codes are available in the supplementary materials.} \textcolor{red}{Warning: This paper contains examples of harmful language and videos, and reader discretion is recommended.}

[96] Point, Detect, Count: Multi-Task Medical Image Understanding with Instruction-Tuned Vision-Language Models

Sushant Gautam,Michael A. Riegler,Pål Halvorsen

Main category: cs.CV

TL;DR: 研究通过指令调优视觉语言模型（VLM）进行多任务医学图像理解，包括检测、定位和计数，使用MedMultiPoints数据集和LoRA方法，结果显示多任务训练提升性能，但也存在边缘案例可靠性降低的权衡。

Details

Motivation: 评估指令调优的VLM是否能同时提升医学图像的多任务理解，以提高诊断准确性和效率。 Method: 使用MedMultiPoints数据集，将任务转化为指令提示，通过LoRA方法微调Qwen2.5-VL-7B-Instruct模型。 Result: 多任务训练提高了鲁棒性和准确性，如减少计数MAE并提升匹配准确率，但边缘案例的可靠性降低。 Conclusion: 研究表明通用VLM可通过提示驱动的微调适应医学任务，生成结构化输出，为可解释的医学AI提供可能。 Abstract: We investigate fine-tuning Vision-Language Models (VLMs) for multi-task medical image understanding, focusing on detection, localization, and counting of findings in medical images. Our objective is to evaluate whether instruction-tuned VLMs can simultaneously improve these tasks, with the goal of enhancing diagnostic accuracy and efficiency. Using MedMultiPoints, a multimodal dataset with annotations from endoscopy (polyps and instruments) and microscopy (sperm cells), we reformulate each task into instruction-based prompts suitable for vision-language reasoning. We fine-tune Qwen2.5-VL-7B-Instruct using Low-Rank Adaptation (LoRA) across multiple task combinations. Results show that multi-task training improves robustness and accuracy. For example, it reduces the Count Mean Absolute Error (MAE) and increases Matching Accuracy in the Counting + Pointing task. However, trade-offs emerge, such as more zero-case point predictions, indicating reduced reliability in edge cases despite overall performance gains. Our study highlights the potential of adapting general-purpose VLMs to specialized medical tasks via prompt-driven fine-tuning. This approach mirrors clinical workflows, where radiologists simultaneously localize, count, and describe findings - demonstrating how VLMs can learn composite diagnostic reasoning patterns. The model produces interpretable, structured outputs, offering a promising step toward explainable and versatile medical AI. Code, model weights, and scripts will be released for reproducibility at https://github.com/simula/PointDetectCount.

[97] Unsupervised Network Anomaly Detection with Autoencoders and Traffic Images

Michael Neri,Sara Baldoni

Main category: cs.CV

TL;DR: 提出了一种基于图像的网络流量表示方法，用于快速检测安全异常，并通过无监督学习实现高效异常检测。

Details

Motivation: 随着连接设备数量的增加，快速检测安全问题和处理大量数据的需求日益突出。设备异构性也增加了复杂性。 Method: 采用基于图像的网络流量表示方法，以1秒时间窗口生成网络状态的紧凑摘要，减少复杂处理架构的需求。 Result: 提出的方法能有效突出异常，并通过无监督学习成功检测异常。 Conclusion: 该方法为网络流量异常检测提供了一种高效且轻量级的解决方案。 Abstract: Due to the recent increase in the number of connected devices, the need to promptly detect security issues is emerging. Moreover, the high number of communication flows creates the necessity of processing huge amounts of data. Furthermore, the connected devices are heterogeneous in nature, having different computational capacities. For this reason, in this work we propose an image-based representation of network traffic which allows to realize a compact summary of the current network conditions with 1-second time windows. The proposed representation highlights the presence of anomalies thus reducing the need for complex processing architectures. Finally, we present an unsupervised learning approach which effectively detects the presence of anomalies. The code and the dataset are available at https://github.com/michaelneri/image-based-network-traffic-anomaly-detection.

[98] Seeing Far and Clearly: Mitigating Hallucinations in MLLMs with Attention Causal Decoding

Feilong Tang,Chengzhi Liu,Zhongxing Xu,Ming Hu,Zelin Peng,Zhiwei Yang,Jionglong Su,Minquan Lin,Yifan Peng,Xuelian Cheng,Imran Razzak,Zongyuan Ge

Main category: cs.CV

TL;DR: FarSight是一种通过优化因果掩码减少幻觉的解码策略，利用注意力寄存器结构和位置感知编码方法，显著降低了多模态大语言模型中的幻觉问题。

Details

Motivation: 多模态大语言模型在视觉问答中存在幻觉问题，分为初始幻觉和雪球幻觉两类，需要改进解码策略以减少干扰。 Method: 提出FarSight策略，通过因果掩码优化和注意力寄存器结构动态分配注意力，并结合位置感知编码方法。 Result: 实验表明，FarSight在图像和视频基准测试中显著减少了幻觉问题。 Conclusion: FarSight是一种有效的即插即用解码策略，能够显著提升多模态大语言模型的推理性能。 Abstract: Recent advancements in multimodal large language models (MLLMs) have significantly improved performance in visual question answering. However, they often suffer from hallucinations. In this work, hallucinations are categorized into two main types: initial hallucinations and snowball hallucinations. We argue that adequate contextual information can be extracted directly from the token interaction process. Inspired by causal inference in the decoding strategy, we propose to leverage causal masks to establish information propagation between multimodal tokens. The hypothesis is that insufficient interaction between those tokens may lead the model to rely on outlier tokens, overlooking dense and rich contextual cues. Therefore, we propose to intervene in the propagation process by tackling outlier tokens to enhance in-context inference. With this goal, we present FarSight, a versatile plug-and-play decoding strategy to reduce attention interference from outlier tokens merely by optimizing the causal mask. The heart of our method is effective token propagation. We design an attention register structure within the upper triangular matrix of the causal mask, dynamically allocating attention to capture attention diverted to outlier tokens. Moreover, a positional awareness encoding method with a diminishing masking rate is proposed, allowing the model to attend to further preceding tokens, especially for video sequence tasks. With extensive experiments, FarSight demonstrates significant hallucination-mitigating performance across different MLLMs on both image and video benchmarks, proving its effectiveness.

[99] Zero-Shot Hyperspectral Pansharpening Using Hysteresis-Based Tuning for Spectral Quality Control

Giuseppe Guarino,Matteo Ciotola,Gemine Vivone,Giovanni Poggi,Giuseppe Scarpa

Main category: cs.CV

TL;DR: 本文提出了一种轻量级神经网络方法，用于高光谱图像融合，确保所有波段的均匀光谱质量，无需外部数据训练。

Details

Motivation: 高光谱图像融合面临独特挑战，如波段数量多、噪声大、光谱不匹配等，现有方法难以在所有波段保持一致性。 Method: 使用自适应权重的轻量级神经网络，动态调整空间损失，重新定义非线性依赖关系，实现无监督学习。 Result: 实验表明，该方法在所有波段均能保持优异的锐化质量，与现有技术竞争。 Conclusion: 该方法灵活、低复杂度，适用于高光谱图像融合，代码和结果已公开。 Abstract: Hyperspectral pansharpening has received much attention in recent years due to technological and methodological advances that open the door to new application scenarios. However, research on this topic is only now gaining momentum. The most popular methods are still borrowed from the more mature field of multispectral pansharpening and often overlook the unique challenges posed by hyperspectral data fusion, such as i) the very large number of bands, ii) the overwhelming noise in selected spectral ranges, iii) the significant spectral mismatch between panchromatic and hyperspectral components, iv) a typically high resolution ratio. Imprecise data modeling especially affects spectral fidelity. Even state-of-the-art methods perform well in certain spectral ranges and much worse in others, failing to ensure consistent quality across all bands, with the risk of generating unreliable results. Here, we propose a hyperspectral pansharpening method that explicitly addresses this problem and ensures uniform spectral quality. To this end, a single lightweight neural network is used, with weights that adapt on the fly to each band. During fine-tuning, the spatial loss is turned on and off to ensure a fast convergence of the spectral loss to the desired level, according to a hysteresis-like dynamic. Furthermore, the spatial loss itself is appropriately redefined to account for nonlinear dependencies between panchromatic and spectral bands. Overall, the proposed method is fully unsupervised, with no prior training on external data, flexible, and low-complexity. Experiments on a recently published benchmarking toolbox show that it ensures excellent sharpening quality, competitive with the state-of-the-art, consistently across all bands. The software code and the full set of results are shared online on https://github.com/giu-guarino/rho-PNN.

[100] SD-MAD: Sign-Driven Few-shot Multi-Anomaly Detection in Medical Images

Kaiyu Guo,Tan Pan,Chen Jiang,Zijian Wang,Brian C. Lovell,Limei Han,Yuan Cheng,Mahsa Baktashmotlagh

Main category: cs.CV

TL;DR: 提出了一种针对少样本医学异常检测的框架SD-MAD，通过结合文本描述和两阶段方法，解决多异常类别识别问题。

Details

Motivation: 医学异常检测面临数据隐私和有限样本的挑战，现有方法常忽略多异常类别的区分。 Method: SD-MAD框架分两阶段：1）通过文本描述对齐放射学特征与异常类别；2）自动选择特征以减少数据不足的影响。 Result: 实验证明该方法在多异常检测中有效。 Conclusion: SD-MAD为少样本医学异常检测提供了一种新思路，尤其在多类别识别场景中表现优越。 Abstract: Medical anomaly detection (AD) is crucial for early clinical intervention, yet it faces challenges due to limited access to high-quality medical imaging data, caused by privacy concerns and data silos. Few-shot learning has emerged as a promising approach to alleviate these limitations by leveraging the large-scale prior knowledge embedded in vision-language models (VLMs). Recent advancements in few-shot medical AD have treated normal and abnormal cases as a one-class classification problem, often overlooking the distinction among multiple anomaly categories. Thus, in this paper, we propose a framework tailored for few-shot medical anomaly detection in the scenario where the identification of multiple anomaly categories is required. To capture the detailed radiological signs of medical anomaly categories, our framework incorporates diverse textual descriptions for each category generated by a Large-Language model, under the assumption that different anomalies in medical images may share common radiological signs in each category. Specifically, we introduce SD-MAD, a two-stage Sign-Driven few-shot Multi-Anomaly Detection framework: (i) Radiological signs are aligned with anomaly categories by amplifying inter-anomaly discrepancy; (ii) Aligned signs are selected further to mitigate the effect of the under-fitting and uncertain-sample issue caused by limited medical data, employing an automatic sign selection strategy at inference. Moreover, we propose three protocols to comprehensively quantify the performance of multi-anomaly detection. Extensive experiments illustrate the effectiveness of our method.

Haihong Hao,Mingfei Han,Changlin Li,Zhihui Li,Xiaojun Chang

Main category: cs.CV

TL;DR: CoNav是一个协作跨模态推理框架，通过3D-text模型指导图像-文本导航代理，解决导航中的模糊性，显著提升了多个导航和空间推理基准的性能。

Details

Motivation: 解决2D图像、3D点云和文本指令统一融合中的挑战，如多模态数据稀缺和模态间信念冲突。 Method: 引入跨模态信念对齐，共享3D-text模型的文本假设给导航代理，并通过轻量级微调整合视觉与空间语义知识。 Result: 在四个导航基准和两个空间推理基准上显著提升性能，且路径更短（SPL指标）。 Conclusion: CoNav展示了多模态数据融合在导航中的潜力，但仍面临挑战。 Abstract: Embodied navigation demands comprehensive scene understanding and precise spatial reasoning. While image-text models excel at interpreting pixel-level color and lighting cues, 3D-text models capture volumetric structure and spatial relationships. However, unified fusion approaches that jointly fuse 2D images, 3D point clouds, and textual instructions face challenges in limited availability of triple-modality data and difficulty resolving conflicting beliefs among modalities. In this work, we introduce CoNav, a collaborative cross-modal reasoning framework where a pretrained 3D-text model explicitly guides an image-text navigation agent by providing structured spatial-semantic knowledge to resolve ambiguities during navigation. Specifically, we introduce Cross-Modal Belief Alignment, which operationalizes this cross-modal guidance by simply sharing textual hypotheses from the 3D-text model to the navigation agent. Through lightweight fine-tuning on a small 2D-3D-text corpus, the navigation agent learns to integrate visual cues with spatial-semantic knowledge derived from the 3D-text model, enabling effective reasoning in embodied navigation. CoNav achieves significant improvements on four standard embodied navigation benchmarks (R2R, CVDN, REVERIE, SOON) and two spatial reasoning benchmarks (ScanQA, SQA3D). Moreover, under close navigation Success Rate, CoNav often generates shorter paths compared to other methods (as measured by SPL), showcasing the potential and challenges of fusing data from different modalities in embodied navigation. Project Page: https://oceanhao.github.io/CoNav/

Huanjin Yao,Qixiang Yin,Jingyi Zhang,Min Yang,Yibo Wang,Wenhao Wu,Fei Su,Li Shen,Minghui Qiu,Dacheng Tao,Jiaxing Huang

Main category: cs.CV

TL;DR: 提出Share-GRPO，一种通过强化学习提升多模态大语言模型推理能力的方法，解决奖励稀疏和优势消失问题。

Details

Motivation: 激励多模态大语言模型（MLLMs）的推理能力，并解决强化学习中的奖励稀疏和优势消失问题。 Method: 提出Share-GRPO，通过数据扩展技术扩展问题空间，探索和共享多样推理轨迹，并在优势计算中共享奖励信息。 Result: 在六个广泛使用的推理基准测试中表现优异。 Conclusion: Share-GRPO有效提升了MLLMs的推理能力，并解决了强化学习中的关键问题。 Abstract: In this work, we aim to incentivize the reasoning ability of Multimodal Large Language Models (MLLMs) via reinforcement learning (RL) and develop an effective approach that mitigates the sparse reward and advantage vanishing issues during RL. To this end, we propose Share-GRPO, a novel RL approach that tackle these issues by exploring and sharing diverse reasoning trajectories over expanded question space. Specifically, Share-GRPO first expands the question space for a given question via data transformation techniques, and then encourages MLLM to effectively explore diverse reasoning trajectories over the expanded question space and shares the discovered reasoning trajectories across the expanded questions during RL. In addition, Share-GRPO also shares reward information during advantage computation, which estimates solution advantages hierarchically across and within question variants, allowing more accurate estimation of relative advantages and improving the stability of policy training. Extensive evaluations over six widely-used reasoning benchmarks showcase the superior performance of our method. Code will be available at https://github.com/HJYao00/R1-ShareVL.

[103] Zero-Shot Anomaly Detection in Battery Thermal Images Using Visual Question Answering with Prior Knowledge

Marcella Astrid,Abdelrahman Shabayek,Djamila Aouada

Main category: cs.CV

TL;DR: 论文提出了一种基于视觉问答（VQA）模型的零样本异常检测方法，用于电池热图像中的异常检测，无需电池特定训练数据。

Details

Motivation: 电池安全性和效率至关重要，但传统深度学习方法需要大量标注数据，而异常数据难以获取。 Method: 利用预训练的VQA模型（如ChatGPT-4o、LLaVa-13b和BLIP-2），通过文本提示结合电池正常行为知识进行零样本异常检测。 Result: 尽管未对电池数据进行微调，该方法在性能上可与基于电池数据训练的最先进模型竞争。 Conclusion: VQA零样本学习在电池异常检测中具有潜力，未来可进一步优化其效果。 Abstract: Batteries are essential for various applications, including electric vehicles and renewable energy storage, making safety and efficiency critical concerns. Anomaly detection in battery thermal images helps identify failures early, but traditional deep learning methods require extensive labeled data, which is difficult to obtain, especially for anomalies due to safety risks and high data collection costs. To overcome this, we explore zero-shot anomaly detection using Visual Question Answering (VQA) models, which leverage pretrained knowledge and textbased prompts to generalize across vision tasks. By incorporating prior knowledge of normal battery thermal behavior, we design prompts to detect anomalies without battery-specific training data. We evaluate three VQA models (ChatGPT-4o, LLaVa-13b, and BLIP-2) analyzing their robustness to prompt variations, repeated trials, and qualitative outputs. Despite the lack of finetuning on battery data, our approach demonstrates competitive performance compared to state-of-the-art models that are trained with the battery data. Our findings highlight the potential of VQA-based zero-shot learning for battery anomaly detection and suggest future directions for improving its effectiveness.

[104] Semantic Compression of 3D Objects for Open and Collaborative Virtual Worlds

Jordan Dotzel,Tony Montes,Mohamed S. Abdelfattah,Zhiru Zhang

Main category: cs.CV

TL;DR: 传统3D对象压缩方法仅处理顶点、多边形和纹理等结构信息，而语义压缩则直接操作核心概念，实现更高压缩率，并利用自然语言存储格式。

Details

Motivation: 传统方法在高压缩率下表现不佳，而语义压缩通过忽略结构信息并利用生成模型填补缺失信息，有望突破压缩极限。 Method: 构建基于公共生成模型的3D语义压缩流程，探索质量与压缩率的平衡。 Result: 在Objaverse数据集上实现高达105x的压缩率，在100x压缩率附近优于传统方法。 Conclusion: 语义压缩在高压缩率下表现优越，适合大规模协作项目，尤其是在增强现实和虚拟现实应用中。 Abstract: Traditional methods for 3D object compression operate only on structural information within the object vertices, polygons, and textures. These methods are effective at compression rates up to 10x for standard object sizes but quickly deteriorate at higher compression rates with texture artifacts, low-polygon counts, and mesh gaps. In contrast, semantic compression ignores structural information and operates directly on the core concepts to push to extreme levels of compression. In addition, it uses natural language as its storage format, which makes it natively human-readable and a natural fit for emerging applications built around large-scale, collaborative projects within augmented and virtual reality. It deprioritizes structural information like location, size, and orientation and predicts the missing information with state-of-the-art deep generative models. In this work, we construct a pipeline for 3D semantic compression from public generative models and explore the quality-compression frontier for 3D object compression. We apply this pipeline to achieve rates as high as 105x for 3D objects taken from the Objaverse dataset and show that semantic compression can outperform traditional methods in the important quality-preserving region around 100x compression.

[105] On the use of Graphs for Satellite Image Time Series

Corentin Dufourg,Charlotte Pelletier,Stéphane May,Sébastien Lefèvre

Main category: cs.CV

TL;DR: 论文探讨了基于图的方法在时空遥感分析中的应用，提出了一种通用的图处理流程，并通过案例研究展示了其在地表覆盖映射和水资源预测中的潜力。

Details

Motivation: 地球表面动态过程复杂，卫星图像时间序列（SITS）提供了全球监测能力，但数据量大且复杂。图方法能突破传统欧几里得结构的限制，建模时空交互，适用于模式检测和分类任务。 Method: 提出了一种通用的图处理流程，包括从SITS构建时空图，并将其应用于下游任务。论文还进行了全面综述和两个案例研究。 Result: 案例研究表明，图方法在地表覆盖映射和水资源预测中具有潜力。 Conclusion: 论文总结了图方法的优势，并讨论了当前局限性和未来发展方向。 Abstract: The Earth's surface is subject to complex and dynamic processes, ranging from large-scale phenomena such as tectonic plate movements to localized changes associated with ecosystems, agriculture, or human activity. Satellite images enable global monitoring of these processes with extensive spatial and temporal coverage, offering advantages over in-situ methods. In particular, resulting satellite image time series (SITS) datasets contain valuable information. To handle their large volume and complexity, some recent works focus on the use of graph-based techniques that abandon the regular Euclidean structure of satellite data to work at an object level. Besides, graphs enable modelling spatial and temporal interactions between identified objects, which are crucial for pattern detection, classification and regression tasks. This paper is an effort to examine the integration of graph-based methods in spatio-temporal remote-sensing analysis. In particular, it aims to present a versatile graph-based pipeline to tackle SITS analysis. It focuses on the construction of spatio-temporal graphs from SITS and their application to downstream tasks. The paper includes a comprehensive review and two case studies, which highlight the potential of graph-based approaches for land cover mapping and water resource forecasting. It also discusses numerous perspectives to resolve current limitations and encourage future developments.

[106] One-Step Diffusion-Based Image Compression with Semantic Distillation

Naifu Xue,Zhaoyang Jia,Jiahao Li,Bin Li,Yuan Zhang,Yan Lu

Main category: cs.CV

TL;DR: 论文提出了一种单步扩散生成图像编解码器OneDC，通过结合潜在压缩模块和单步扩散生成器，显著降低了延迟。

Details

Motivation: 现有扩散生成编解码器因迭代采样导致延迟高，作者认为多步采样对生成压缩并非必要。 Method: 提出OneDC，利用超先验作为语义信号，引入语义蒸馏机制增强语义能力，并采用混合像素和潜在域优化。 Result: OneDC在单步生成下实现SOTA感知质量，比特率降低40%，解码速度快20倍。 Conclusion: OneDC证明了单步扩散生成在图像压缩中的高效性，兼具高质量和低延迟。 Abstract: While recent diffusion-based generative image codecs have shown impressive performance, their iterative sampling process introduces unpleasing latency. In this work, we revisit the design of a diffusion-based codec and argue that multi-step sampling is not necessary for generative compression. Based on this insight, we propose OneDC, a One-step Diffusion-based generative image Codec -- that integrates a latent compression module with a one-step diffusion generator. Recognizing the critical role of semantic guidance in one-step diffusion, we propose using the hyperprior as a semantic signal, overcoming the limitations of text prompts in representing complex visual content. To further enhance the semantic capability of the hyperprior, we introduce a semantic distillation mechanism that transfers knowledge from a pretrained generative tokenizer to the hyperprior codec. Additionally, we adopt a hybrid pixel- and latent-domain optimization to jointly enhance both reconstruction fidelity and perceptual realism. Extensive experiments demonstrate that OneDC achieves SOTA perceptual quality even with one-step generation, offering over 40% bitrate reduction and 20x faster decoding compared to prior multi-step diffusion-based codecs. Code will be released later.

[107] KRIS-Bench: Benchmarking Next-Level Intelligent Image Editing Models

Yongliang Wu,Zonghui Li,Xinting Hu,Xinyu Ye,Xianfang Zeng,Gang Yu,Wenbo Zhu,Bernt Schiele,Ming-Hsuan Yang,Xu Yang

Main category: cs.CV

TL;DR: KRIS-Bench是一个基于知识推理的图像编辑评测基准，通过三类知识类型（事实、概念、程序）和22个任务评估模型性能，发现现有模型在推理能力上存在显著不足。

Details

Motivation: 当前多模态生成模型在指令式图像编辑中表现良好，但在知识推理任务上的能力尚未充分探索。 Method: 提出KRIS-Bench基准，包含22个任务和1,267个标注实例，并设计知识合理性指标进行细粒度评估。 Result: 在10个先进模型上的实验显示，它们在知识推理任务上表现不佳。 Conclusion: 知识为中心的评测基准对推动智能图像编辑系统发展至关重要。 Abstract: Recent advances in multi-modal generative models have enabled significant progress in instruction-based image editing. However, while these models produce visually plausible outputs, their capacity for knowledge-based reasoning editing tasks remains under-explored. In this paper, we introduce KRIS-Bench (Knowledge-based Reasoning in Image-editing Systems Benchmark), a diagnostic benchmark designed to assess models through a cognitively informed lens. Drawing from educational theory, KRIS-Bench categorizes editing tasks across three foundational knowledge types: Factual, Conceptual, and Procedural. Based on this taxonomy, we design 22 representative tasks spanning 7 reasoning dimensions and release 1,267 high-quality annotated editing instances. To support fine-grained evaluation, we propose a comprehensive protocol that incorporates a novel Knowledge Plausibility metric, enhanced by knowledge hints and calibrated through human studies. Empirical results on 10 state-of-the-art models reveal significant gaps in reasoning performance, highlighting the need for knowledge-centric benchmarks to advance the development of intelligent image editing systems.

[108] SEDD-PCC: A Single Encoder-Dual Decoder Framework For End-To-End Learned Point Cloud Compression

Kai Hsiang Hsieh,Monyneath Yim,Jui Chiu Chiang

Main category: cs.CV

TL;DR: SEDD-PCC提出了一种端到端的学习框架，联合压缩点云的几何和属性，通过共享特征提取和知识蒸馏提升效率。

Details

Motivation: 现有方法将几何和属性编码分开处理，增加了计算复杂度且未充分利用共享特征。 Method: 使用单一编码器提取共享特征到统一潜在空间，再通过双解码器顺序重建几何和属性，并引入知识蒸馏。 Result: 在对比评估中，SEDD-PCC表现出优于规则和基于学习方法的性能。 Conclusion: SEDD-PCC是一种高效且实用的点云压缩解决方案，展示了AI驱动方法的潜力。 Abstract: To encode point clouds containing both geometry and attributes, most learning-based compression schemes treat geometry and attribute coding separately, employing distinct encoders and decoders. This not only increases computational complexity but also fails to fully exploit shared features between geometry and attributes. To address this limitation, we propose SEDD-PCC, an end-to-end learning-based framework for lossy point cloud compression that jointly compresses geometry and attributes. SEDD-PCC employs a single encoder to extract shared geometric and attribute features into a unified latent space, followed by dual specialized decoders that sequentially reconstruct geometry and attributes. Additionally, we incorporate knowledge distillation to enhance feature representation learning from a teacher model, further improving coding efficiency. With its simple yet effective design, SEDD-PCC provides an efficient and practical solution for point cloud compression. Comparative evaluations against both rule-based and learning-based methods demonstrate its competitive performance, highlighting SEDD-PCC as a promising AI-driven compression approach.

[109] Robust Vision-Based Runway Detection through Conformal Prediction and Conformal mAP

Alya Zouzou,Léo andéol,Mélanie Ducoffe,Ryma Boumazouza

Main category: cs.CV

TL;DR: 论文探讨了使用共形预测为基于视觉的着陆系统（VLS）中的跑道检测提供统计不确定性保证，并提出了一种新指标C-mAP。

Details

Motivation: 提高跑道检测的可靠性，为航空航天领域的机器学习系统认证提供支持。 Method: 使用微调的YOLOv5和YOLOv6模型，结合共形预测量化定位可靠性。 Result: 共形预测能以统计上合理的方式量化不确定性，提高跑道检测的可靠性。 Conclusion: 共形预测提升了跑道检测的安全性，为机器学习系统在航空航天领域的认证奠定了基础。 Abstract: We explore the use of conformal prediction to provide statistical uncertainty guarantees for runway detection in vision-based landing systems (VLS). Using fine-tuned YOLOv5 and YOLOv6 models on aerial imagery, we apply conformal prediction to quantify localization reliability under user-defined risk levels. We also introduce Conformal mean Average Precision (C-mAP), a novel metric aligning object detection performance with conformal guarantees. Our results show that conformal prediction can improve the reliability of runway detection by quantifying uncertainty in a statistically sound way, increasing safety on-board and paving the way for certification of ML system in the aerospace domain.

[110] Representation Discrepancy Bridging Method for Remote Sensing Image-Text Retrieval

Hailong Ning,Siying Wang,Tao Lei,Xiaopeng Cao,Huanmin Dou,Bin Zhao,Asoke K. Nandi,Petia Radeva

Main category: cs.CV

TL;DR: 本文提出了一种用于遥感图像-文本检索（RSITR）的表示差异桥接（RDB）方法，通过跨模态不对称适配器（CMAA）和双任务一致性损失（DTCL）解决模态不平衡问题，显著提升了性能。

Details

Motivation: 现有参数高效微调（PEFT）方法在跨模态关联探索中通常采用对称适配器结构，但文本模态的强判别性可能抑制图像表示学习，导致跨模态优化不平衡。 Method: 提出RDB方法，包括跨模态不对称适配器（CMAA）和双任务一致性损失（DTCL）。CMAA通过视觉增强适配器（VEA）和文本语义适配器（TSA）实现模态特定优化；DTCL通过自适应加权组合提升跨模态对齐鲁棒性。 Result: 在RSICD和RSITMD数据集上，RDB方法在mR指标上比现有PEFT方法提升6%-11%，比全微调GeoRSCLIP模型提升1.15%-2%。 Conclusion: RDB方法有效解决了跨模态优化不平衡问题，显著提升了遥感图像-文本检索任务的性能。 Abstract: Remote Sensing Image-Text Retrieval (RSITR) plays a critical role in geographic information interpretation, disaster monitoring, and urban planning by establishing semantic associations between image and textual descriptions. Existing Parameter-Efficient Fine-Tuning (PEFT) methods for Vision-and-Language Pre-training (VLP) models typically adopt symmetric adapter structures for exploring cross-modal correlations. However, the strong discriminative nature of text modality may dominate the optimization process and inhibits image representation learning. The nonnegligible imbalanced cross-modal optimization remains a bottleneck to enhancing the model performance. To address this issue, this study proposes a Representation Discrepancy Bridging (RDB) method for the RSITR task. On the one hand, a Cross-Modal Asymmetric Adapter (CMAA) is designed to enable modality-specific optimization and improve feature alignment. The CMAA comprises a Visual Enhancement Adapter (VEA) and a Text Semantic Adapter (TSA). VEA mines fine-grained image features by Differential Attention (DA) mechanism, while TSA identifies key textual semantics through Hierarchical Attention (HA) mechanism. On the other hand, this study extends the traditional single-task retrieval framework to a dual-task optimization framework and develops a Dual-Task Consistency Loss (DTCL). The DTCL improves cross-modal alignment robustness through an adaptive weighted combination of cross-modal, classification, and exponential moving average consistency constraints. Experiments on RSICD and RSITMD datasets show that the proposed RDB method achieves a 6%-11% improvement in mR metrics compared to state-of-the-art PEFT methods and a 1.15%-2% improvement over the full fine-tuned GeoRSCLIP model.

[111] Mesh-RFT: Enhancing Mesh Generation via Fine-grained Reinforcement Fine-Tuning

Jian Liu,Jing Xu,Song Guo,Jing Li,Jingfeng Guo,Jiaao Yu,Haohan Weng,Biwen Lei,Xianghui Yang,Zhuo Chen,Fangqi Zhu,Tao Han,Chunchao Guo

Main category: cs.CV

TL;DR: Mesh-RFT提出了一种细粒度强化学习微调框架，通过Masked Direct Preference Optimization (M-DPO)实现局部优化，显著提升3D网格生成的质量。

Details

Motivation: 现有预训练模型在3D网格生成中存在数据偏差和低质量问题，而全局强化学习方法难以捕捉局部结构细节。 Method: 采用M-DPO进行局部细化，引入边界边缘比（BER）和拓扑分数（TS）作为评估指标，通过细粒度RL策略优化网格质量。 Result: 实验显示Mesh-RFT显著降低了Hausdorff距离（24.6%）并提升了拓扑分数（3.8%），优于全局DPO方法。 Conclusion: Mesh-RFT在几何完整性和拓扑规则性上表现优异，为生产级网格生成设定了新标准。 Abstract: Existing pretrained models for 3D mesh generation often suffer from data biases and produce low-quality results, while global reinforcement learning (RL) methods rely on object-level rewards that struggle to capture local structure details. To address these challenges, we present \textbf{Mesh-RFT}, a novel fine-grained reinforcement fine-tuning framework that employs Masked Direct Preference Optimization (M-DPO) to enable localized refinement via quality-aware face masking. To facilitate efficient quality evaluation, we introduce an objective topology-aware scoring system to evaluate geometric integrity and topological regularity at both object and face levels through two metrics: Boundary Edge Ratio (BER) and Topology Score (TS). By integrating these metrics into a fine-grained RL strategy, Mesh-RFT becomes the first method to optimize mesh quality at the granularity of individual faces, resolving localized errors while preserving global coherence. Experiment results show that our M-DPO approach reduces Hausdorff Distance (HD) by 24.6\% and improves Topology Score (TS) by 3.8\% over pre-trained models, while outperforming global DPO methods with a 17.4\% HD reduction and 4.9\% TS gain. These results demonstrate Mesh-RFT's ability to improve geometric integrity and topological regularity, achieving new state-of-the-art performance in production-ready mesh generation. Project Page: \href{https://hitcslj.github.io/mesh-rft/}{this https URL}.

[112] Self-Rewarding Large Vision-Language Models for Optimizing Prompts in Text-to-Image Generation

Hongji Yang,Yucheng Zhou,Wencheng Han,Jianbing Shen

Main category: cs.CV

TL;DR: 提出了一种基于大型视觉语言模型（LVLM）的提示优化框架，通过AI反馈而非人工标注数据来优化文本到图像模型的提示生成。

Details

Motivation: 现有方法依赖大量人工标注数据和训练模型，存在数据规模依赖和模型偏见问题，需要一种更高效、低偏见的提示优化方法。 Method: 利用LVLM作为提示重写器和奖励模型，通过强化学习实现自我迭代优化，无需人工反馈。 Result: 在两个流行数据集上表现优于其他强竞争方法。 Conclusion: 该框架有效减少了对人工标注的依赖，同时提升了提示优化的性能。 Abstract: Text-to-image models are powerful for producing high-quality images based on given text prompts, but crafting these prompts often requires specialized vocabulary. To address this, existing methods train rewriting models with supervision from large amounts of manually annotated data and trained aesthetic assessment models. To alleviate the dependence on data scale for model training and the biases introduced by trained models, we propose a novel prompt optimization framework, designed to rephrase a simple user prompt into a sophisticated prompt to a text-to-image model. Specifically, we employ the large vision language models (LVLMs) as the solver to rewrite the user prompt, and concurrently, employ LVLMs as a reward model to score the aesthetics and alignment of the images generated by the optimized prompt. Instead of laborious human feedback, we exploit the prior knowledge of the LVLM to provide rewards, i.e., AI feedback. Simultaneously, the solver and the reward model are unified into one model and iterated in reinforcement learning to achieve self-improvement by giving a solution and judging itself. Results on two popular datasets demonstrate that our method outperforms other strong competitors.

Meng-Hao Guo,Xuanyu Chu,Qianrui Yang,Zhe-Han Mo,Yiqing Shen,Pei-lin Li,Xinjie Lin,Jinnian Zhang,Xin-Sheng Chen,Yi Zhang,Kiyohiro Nakayama,Zhengyang Geng,Houwen Peng,Han Hu,Shi-Nin Hu

Main category: cs.CV

TL;DR: 本文提出了一个名为RBench-V的基准测试，用于评估多模态模型在视觉推理中的能力，发现现有模型在多模态输出推理方面表现不佳。

Details

Motivation: 现有基准测试主要关注多模态输入和纯文本推理，而忽略了多模态输出推理的重要性，因此需要一个新的评估工具。 Method: 通过精心挑选803个涵盖数学、物理、计数和游戏的问题，构建RBench-V基准，要求模型进行图像操作以支持推理。 Result: 评估结果显示，即使是表现最好的模型o3，准确率仅为25.8%，远低于人类的82.3%。 Conclusion: 当前模型在多模态推理能力上仍有显著不足，RBench-V为未来研究提供了重要参考。 Abstract: The rapid advancement of native multi-modal models and omni-models, exemplified by GPT-4o, Gemini, and o3, with their capability to process and generate content across modalities such as text and images, marks a significant milestone in the evolution of intelligence. Systematic evaluation of their multi-modal output capabilities in visual thinking processes (also known as multi-modal chain of thought, M-CoT) becomes critically important. However, existing benchmarks for evaluating multi-modal models primarily focus on assessing multi-modal inputs and text-only reasoning while neglecting the importance of reasoning through multi-modal outputs. In this paper, we present a benchmark, dubbed RBench-V, designed to assess models' vision-indispensable reasoning abilities. To construct RBench-V, we carefully hand-pick 803 questions covering math, physics, counting, and games. Unlike previous benchmarks that typically specify certain input modalities, RBench-V presents problems centered on multi-modal outputs, which require image manipulation such as generating novel images and constructing auxiliary lines to support the reasoning process. We evaluate numerous open- and closed-source models on RBench-V, including o3, Gemini 2.5 Pro, Qwen2.5-VL, etc. Even the best-performing model, o3, achieves only 25.8% accuracy on RBench-V, far below the human score of 82.3%, highlighting that current models struggle to leverage multi-modal reasoning. Data and code are available at https://evalmodels.github.io/rbenchv

[114] Mitigating Overfitting in Medical Imaging: Self-Supervised Pretraining vs. ImageNet Transfer Learning for Dermatological Diagnosis

Iván Matas,Carmen Serrano,Miguel Nogales,David Moreno,Lara Ferrándiz,Teresa Ojeda,Begoña Acha

Main category: cs.CV

TL;DR: 本文提出了一种无监督学习框架，用于提取皮肤病学特征，优于基于ImageNet的预训练模型。

Details

Motivation: 解决自然图像预训练模型在医学影像中可能无法捕捉领域特定特征的问题。 Method: 使用变分自编码器（VAE）在皮肤病数据集上从头训练，生成结构化且临床相关的潜在空间。 Result: 自监督模型验证损失降低33.33%，准确率提升44.44%，泛化能力优于ImageNet预训练模型。 Conclusion: 领域特定特征提取在医学影像中至关重要，自监督学习表现更优。 Abstract: Deep learning has transformed computer vision but relies heavily on large labeled datasets and computational resources. Transfer learning, particularly fine-tuning pretrained models, offers a practical alternative; however, models pretrained on natural image datasets such as ImageNet may fail to capture domain-specific characteristics in medical imaging. This study introduces an unsupervised learning framework that extracts high-value dermatological features instead of relying solely on ImageNet-based pretraining. We employ a Variational Autoencoder (VAE) trained from scratch on a proprietary dermatological dataset, allowing the model to learn a structured and clinically relevant latent space. This self-supervised feature extractor is then compared to an ImageNet-pretrained backbone under identical classification conditions, highlighting the trade-offs between general-purpose and domain-specific pretraining. Our results reveal distinct learning patterns. The self-supervised model achieves a final validation loss of 0.110 (-33.33%), while the ImageNet-pretrained model stagnates at 0.100 (-16.67%), indicating overfitting. Accuracy trends confirm this: the self-supervised model improves from 45% to 65% (+44.44%) with a near-zero overfitting gap, whereas the ImageNet-pretrained model reaches 87% (+50.00%) but plateaus at 75% (+19.05%), with its overfitting gap increasing to +0.060. These findings suggest that while ImageNet pretraining accelerates convergence, it also amplifies overfitting on non-clinically relevant features. In contrast, self-supervised learning achieves steady improvements, stronger generalization, and superior adaptability, underscoring the importance of domain-specific feature extraction in medical imaging.

[115] Single Domain Generalization for Few-Shot Counting via Universal Representation Matching

Xianing Chen,Si Huo,Borui Jiang,Hailin Hu,Xinghao Chen

Main category: cs.CV

TL;DR: 论文提出了一种名为URM的少样本计数模型，通过引入通用视觉-语言表示来解决领域偏移问题，显著提升了模型的泛化能力。

Details

Motivation: 现有少样本计数方法在领域偏移下泛化能力不足，本文旨在解决这一问题。 Method: 提出URM模型，利用大规模预训练视觉-语言模型的通用表示来构建相关性图，提升泛化性。 Result: URM在领域内和新领域泛化设置下均达到最先进性能。 Conclusion: 通过引入通用表示，URM显著提升了少样本计数模型的泛化能力。 Abstract: Few-shot counting estimates the number of target objects in an image using only a few annotated exemplars. However, domain shift severely hinders existing methods to generalize to unseen scenarios. This falls into the realm of single domain generalization that remains unexplored in few-shot counting. To solve this problem, we begin by analyzing the main limitations of current methods, which typically follow a standard pipeline that extract the object prototypes from exemplars and then match them with image feature to construct the correlation map. We argue that existing methods overlook the significance of learning highly generalized prototypes. Building on this insight, we propose the first single domain generalization few-shot counting model, Universal Representation Matching, termed URM. Our primary contribution is the discovery that incorporating universal vision-language representations distilled from a large scale pretrained vision-language model into the correlation construction process substantially improves robustness to domain shifts without compromising in domain performance. As a result, URM achieves state-of-the-art performance on both in domain and the newly introduced domain generalization setting.

[116] Four Eyes Are Better Than Two: Harnessing the Collaborative Potential of Large Models via Differentiated Thinking and Complementary Ensembles

Jun Xie,Xiongjun Guan,Yingjian Zhu,Zhaoran Zhao,Xinming Wang,Feng Chen,Zhepeng Wang

Main category: cs.CV

TL;DR: 本文介绍了CVPR 2025 Ego4D EgoSchema挑战赛的亚军解决方案，通过少样本学习和模型集成策略，利用多模态大模型提升视频理解任务性能。

Details

Motivation: 受大模型成功的启发，探索如何有效利用多模态大模型解决视频理解任务。 Method: 采用少样本学习和模型集成策略，系统评估多样化提示风格和处理范式以引导大模型注意力。 Result: 实验表明，单个多模态模型已超越现有SOTA方法，进一步引入结果协作与集成阶段显著提升性能。 Conclusion: 本文为大模型的实际应用提供了有价值的参考，并启发未来研究。 Abstract: In this paper, we present the runner-up solution for the Ego4D EgoSchema Challenge at CVPR 2025 (Confirmed on May 20, 2025). Inspired by the success of large models, we evaluate and leverage leading accessible multimodal large models and adapt them to video understanding tasks via few-shot learning and model ensemble strategies. Specifically, diversified prompt styles and process paradigms are systematically explored and evaluated to effectively guide the attention of large models, fully unleashing their powerful generalization and adaptability abilities. Experimental results demonstrate that, with our carefully designed approach, directly utilizing an individual multimodal model already outperforms the previous state-of-the-art (SOTA) method which includes several additional processes. Besides, an additional stage is further introduced that facilitates the cooperation and ensemble of periodic results, which achieves impressive performance improvements. We hope this work serves as a valuable reference for the practical application of large models and inspires future research in the field.

[117] REPA Works Until It Doesn't: Early-Stopped, Holistic Alignment Supercharges Diffusion Training

Ziqiao Wang,Wangbo Zhao,Yuhao Zhou,Zekai Li,Zhiyuan Liang,Mingjia Shi,Xuanlei Zhao,Pengfei Zhou,Kaipeng Zhang,Zhangyang Wang,Kai Wang,Yang You

Main category: cs.CV

TL;DR: HASTE是一种两阶段训练方法，通过早期对齐教师模型特征和注意力图加速DiT训练，后期终止对齐以释放生成能力，显著提升训练效率。

Details

Motivation: 解决DiT训练速度慢的问题，同时避免现有方法（如REPA）后期性能下降的缺陷。 Method: HASTE分为两阶段：阶段一通过全面对齐损失将教师模型的注意力图和特征投影蒸馏到DiT中；阶段二在触发条件满足时终止对齐损失，专注于去噪任务。 Result: 在ImageNet 256x256上，HASTE仅用50轮达到基准FID，500轮匹配REPA最佳FID，训练步骤减少28倍。 Conclusion: HASTE是一种简单而高效的扩散模型训练方法，适用于多种任务。 Abstract: Diffusion Transformers (DiTs) deliver state-of-the-art image quality, yet their training remains notoriously slow. A recent remedy -- representation alignment (REPA) that matches DiT hidden features to those of a non-generative teacher (e.g. DINO) -- dramatically accelerates the early epochs but plateaus or even degrades performance later. We trace this failure to a capacity mismatch: once the generative student begins modelling the joint data distribution, the teacher's lower-dimensional embeddings and attention patterns become a straitjacket rather than a guide. We then introduce HASTE (Holistic Alignment with Stage-wise Termination for Efficient training), a two-phase schedule that keeps the help and drops the hindrance. Phase I applies a holistic alignment loss that simultaneously distills attention maps (relational priors) and feature projections (semantic anchors) from the teacher into mid-level layers of the DiT, yielding rapid convergence. Phase II then performs one-shot termination that deactivates the alignment loss, once a simple trigger such as a fixed iteration is hit, freeing the DiT to focus on denoising and exploit its generative capacity. HASTE speeds up training of diverse DiTs without architecture changes. On ImageNet 256X256, it reaches the vanilla SiT-XL/2 baseline FID in 50 epochs and matches REPA's best FID in 500 epochs, amounting to a 28X reduction in optimization steps. HASTE also improves text-to-image DiTs on MS-COCO, demonstrating to be a simple yet principled recipe for efficient diffusion training across various tasks. Our code is available at https://github.com/NUS-HPC-AI-Lab/HASTE .

[118] REOBench: Benchmarking Robustness of Earth Observation Foundation Models

Xiang Li,Yong Tao,Siyuan Zhang,Siwei Liu,Zhitong Xiong,Chunbo Luo,Lu Liu,Mykola Pechenizkiy,Xiao Xiang Zhu,Tianjin Huang

Main category: cs.CV

TL;DR: REOBench是首个评估地球观测基础模型在六项任务和十二种图像扰动下鲁棒性的综合基准，揭示了现有模型在输入扰动下的性能下降问题，并提出了改进方向。

Details

Motivation: 地球观测基础模型在真实世界扰动下的鲁棒性尚未充分研究，REOBench旨在填补这一空白。 Method: 通过高分辨率光学遥感图像，系统评估了基于掩码图像建模、对比学习和视觉语言预训练的多种模型。 Result: 现有模型在输入扰动下性能显著下降，视觉语言模型在多模态任务中表现更鲁棒。 Conclusion: REOBench揭示了当前模型的脆弱性，为开发更鲁棒的模型提供了实用建议。 Abstract: Earth observation foundation models have shown strong generalization across multiple Earth observation tasks, but their robustness under real-world perturbations remains underexplored. To bridge this gap, we introduce REOBench, the first comprehensive benchmark for evaluating the robustness of Earth observation foundation models across six tasks and twelve types of image corruptions, including both appearance-based and geometric perturbations. To ensure realistic and fine-grained evaluation, our benchmark focuses on high-resolution optical remote sensing images, which are widely used in critical applications such as urban planning and disaster response. We conduct a systematic evaluation of a broad range of models trained using masked image modeling, contrastive learning, and vision-language pre-training paradigms. Our results reveal that (1) existing Earth observation foundation models experience significant performance degradation when exposed to input corruptions. (2) The severity of degradation varies across tasks, model architectures, backbone sizes, and types of corruption, with performance drop varying from less than 1% to over 20%. (3) Vision-language models show enhanced robustness, particularly in multimodal tasks. REOBench underscores the vulnerability of current Earth observation foundation models to real-world corruptions and provides actionable insights for developing more robust and reliable models.

[119] V2V: Scaling Event-Based Vision through Efficient Video-to-Voxel Simulation

Hanyue Lou,Jinxiu Liang,Minggui Teng,Yi Wang,Boxin Shi

Main category: cs.CV

TL;DR: 论文提出Video-to-Voxel (V2V)方法，将传统视频帧直接转换为事件体素网格表示，显著降低存储需求并提升模型训练效率。

Details

Motivation: 现有事件相机数据生成方法存储需求高且真实数据稀缺，限制了事件视觉模型的开发和泛化能力。 Method: 通过V2V方法，绕过事件流生成，直接转换视频帧为事件体素网格，存储需求降低150倍，并支持实时参数随机化。 Result: 在10,000个视频（总计52小时）上训练模型，数据规模远超现有事件数据集，显著提升了视频重建和光流估计性能。 Conclusion: V2V方法解决了事件数据存储和规模问题，为事件视觉模型的训练提供了高效解决方案。 Abstract: Event-based cameras offer unique advantages such as high temporal resolution, high dynamic range, and low power consumption. However, the massive storage requirements and I/O burdens of existing synthetic data generation pipelines and the scarcity of real data prevent event-based training datasets from scaling up, limiting the development and generalization capabilities of event vision models. To address this challenge, we introduce Video-to-Voxel (V2V), an approach that directly converts conventional video frames into event-based voxel grid representations, bypassing the storage-intensive event stream generation entirely. V2V enables a 150 times reduction in storage requirements while supporting on-the-fly parameter randomization for enhanced model robustness. Leveraging this efficiency, we train several video reconstruction and optical flow estimation model architectures on 10,000 diverse videos totaling 52 hours--an order of magnitude larger than existing event datasets, yielding substantial improvements.

[120] SOLVE: Synergy of Language-Vision and End-to-End Networks for Autonomous Driving

Xuesong Chen,Linjiang Huang,Tao Ma,Rongyao Fang,Shaoshuai Shi,Hongsheng Li

Main category: cs.CV

TL;DR: SOLVE框架通过结合视觉语言模型（VLM）和端到端（E2E）模型，提升自动驾驶规划能力，采用轨迹链式思维（T-CoT）和特征级知识共享，显著提高了轨迹预测准确性。

Details

Motivation: 现有方法在高效集成和实时决策方面存在计算需求高的挑战，SOLVE旨在解决这些问题。 Method: 通过共享视觉编码器实现特征级知识共享，采用T-CoT逐步优化轨迹预测，并利用时间解耦策略协调VLM和E2E模型。 Result: 在nuScenes数据集上，SOLVE显著提升了轨迹预测的准确性。 Conclusion: SOLVE为更稳健可靠的自动驾驶系统提供了新思路。 Abstract: The integration of Vision-Language Models (VLMs) into autonomous driving systems has shown promise in addressing key challenges such as learning complexity, interpretability, and common-sense reasoning. However, existing approaches often struggle with efficient integration and realtime decision-making due to computational demands. In this paper, we introduce SOLVE, an innovative framework that synergizes VLMs with end-to-end (E2E) models to enhance autonomous vehicle planning. Our approach emphasizes knowledge sharing at the feature level through a shared visual encoder, enabling comprehensive interaction between VLM and E2E components. We propose a Trajectory Chain-of-Thought (T-CoT) paradigm, which progressively refines trajectory predictions, reducing uncertainty and improving accuracy. By employing a temporal decoupling strategy, SOLVE achieves efficient cooperation by aligning high-quality VLM outputs with E2E real-time performance. Evaluated on the nuScenes dataset, our method demonstrates significant improvements in trajectory prediction accuracy, paving the way for more robust and reliable autonomous driving systems.

[121] Hypergraph Tversky-Aware Domain Incremental Learning for Brain Tumor Segmentation with Missing Modalities

Junze Wang,Lei Fan,Weipeng Jing,Donglin Di,Yang Song,Sidong Liu,Cong Cong

Main category: cs.CV

TL;DR: 提出ReHyDIL方法，通过领域增量学习和超图网络解决多模态MRI分割中缺失模态的问题，性能提升显著。

Details

Motivation: 临床实践中MRI模态可能缺失，现有方法假设训练时所有模态可用，导致性能下降。重新训练模型效率低且易过拟合。 Method: 结合领域增量学习（DIL）和超图网络（CHSNet），引入Tversky-Aware Contrastive损失函数。 Result: 在BraTS2019数据集上，Dice相似系数提升超过2%。 Conclusion: ReHyDIL有效解决了模态缺失问题，性能优于现有方法。 Abstract: Existing methods for multimodal MRI segmentation with missing modalities typically assume that all MRI modalities are available during training. However, in clinical practice, some modalities may be missing due to the sequential nature of MRI acquisition, leading to performance degradation. Furthermore, retraining models to accommodate newly available modalities can be inefficient and may cause overfitting, potentially compromising previously learned knowledge. To address these challenges, we propose Replay-based Hypergraph Domain Incremental Learning (ReHyDIL) for brain tumor segmentation with missing modalities. ReHyDIL leverages Domain Incremental Learning (DIL) to enable the segmentation model to learn from newly acquired MRI modalities without forgetting previously learned information. To enhance segmentation performance across diverse patient scenarios, we introduce the Cross-Patient Hypergraph Segmentation Network (CHSNet), which utilizes hypergraphs to capture high-order associations between patients. Additionally, we incorporate Tversky-Aware Contrastive (TAC) loss to effectively mitigate information imbalance both across and within different modalities. Extensive experiments on the BraTS2019 dataset demonstrate that ReHyDIL outperforms state-of-the-art methods, achieving an improvement of over 2\% in the Dice Similarity Coefficient across various tumor regions. Our code is available at ReHyDIL.

[122] Semi-Supervised State-Space Model with Dynamic Stacking Filter for Real-World Video Deraining

Shangquan Sun,Wenqi Ren,Juxiang Zhou,Shu Wang,Jianhou Gan,Xiaochun Cao

Main category: cs.CV

TL;DR: 提出了一种双分支时空状态空间模型，用于视频中的雨痕去除，通过动态堆叠滤波器和半监督学习提升性能，并引入真实世界基准测试。

Details

Motivation: 现有依赖配对数据的方法在真实场景中泛化能力不足，因合成与真实雨效差异大。 Method: 设计时空状态空间模型层提取特征，动态堆叠滤波器优化多帧融合，半监督学习生成伪干净补丁。 Result: 在合成和真实雨天视频基准测试中表现优异，定量指标、视觉质量和下游任务实用性均领先。 Conclusion: 该方法显著提升了雨天视频修复效果，并为下游任务提供了实用支持。 Abstract: Significant progress has been made in video restoration under rainy conditions over the past decade, largely propelled by advancements in deep learning. Nevertheless, existing methods that depend on paired data struggle to generalize effectively to real-world scenarios, primarily due to the disparity between synthetic and authentic rain effects. To address these limitations, we propose a dual-branch spatio-temporal state-space model to enhance rain streak removal in video sequences. Specifically, we design spatial and temporal state-space model layers to extract spatial features and incorporate temporal dependencies across frames, respectively. To improve multi-frame feature fusion, we derive a dynamic stacking filter, which adaptively approximates statistical filters for superior pixel-wise feature refinement. Moreover, we develop a median stacking loss to enable semi-supervised learning by generating pseudo-clean patches based on the sparsity prior of rain. To further explore the capacity of deraining models in supporting other vision-based tasks in rainy environments, we introduce a novel real-world benchmark focused on object detection and tracking in rainy conditions. Our method is extensively evaluated across multiple benchmarks containing numerous synthetic and real-world rainy videos, consistently demonstrating its superiority in quantitative metrics, visual quality, efficiency, and its utility for downstream tasks.

[123] Perceptual Quality Assessment for Embodied AI

Chunyi Li,Jiaohao Xiao,Jianbo Zhang,Farong Wen,Zicheng Zhang,Yuan Tian,Xiangyang Zhu,Xiaohong Liu,Zhengxue Cheng,Weisi Lin,Guangtao Zhai

Main category: cs.CV

TL;DR: 论文提出了面向具身AI的图像质量评估（IQA）方法，旨在解决真实世界中图像质量对机器人感知的影响，并构建了包含36k图像对和5m标注的数据库。

Details

Motivation: 具身AI在真实世界中的应用受限于图像质量，传统IQA方法无法评估机器人感知质量，因此需要开发新的评估方法。 Method: 基于Mertonian系统和元认知理论，构建了感知-认知-决策-执行流程，并建立了Embodied-IQA数据库，验证了主流IQA方法的性能。 Result: 实验表明现有IQA方法在具身AI任务中表现不足，需开发更准确的评估指标。 Conclusion: 通过评估具身AI的图像质量，有望推动其在复杂真实环境中的应用。 Abstract: Embodied AI has developed rapidly in recent years, but it is still mainly deployed in laboratories, with various distortions in the Real-world limiting its application. Traditionally, Image Quality Assessment (IQA) methods are applied to predict human preferences for distorted images; however, there is no IQA method to assess the usability of an image in embodied tasks, namely, the perceptual quality for robots. To provide accurate and reliable quality indicators for future embodied scenarios, we first propose the topic: IQA for Embodied AI. Specifically, we (1) based on the Mertonian system and meta-cognitive theory, constructed a perception-cognition-decision-execution pipeline and defined a comprehensive subjective score collection process; (2) established the Embodied-IQA database, containing over 36k reference/distorted image pairs, with more than 5m fine-grained annotations provided by Vision Language Models/Vision Language Action-models/Real-world robots; (3) trained and validated the performance of mainstream IQA methods on Embodied-IQA, demonstrating the need to develop more accurate quality indicators for Embodied AI. We sincerely hope that through evaluation, we can promote the application of Embodied AI under complex distortions in the Real-world. Project page: https://github.com/lcysyzxdxc/EmbodiedIQA

[124] Action2Dialogue: Generating Character-Centric Narratives from Scene-Level Prompts

Taewon Kang,Ming C. Lin

Main category: cs.CV

TL;DR: 提出了一种模块化流程，将动作级提示转化为视觉和听觉基础叙事对话，增强视觉叙事的表现力。

Details

Motivation: 当前基于场景的视频生成系统在角色对话和语音方面研究不足，本文旨在填补这一空白。 Method: 结合预训练的视觉语言编码器提取场景语义特征，利用大型语言模型生成角色一致的对话，并通过递归叙事银行确保上下文一致性。 Result: 无需额外训练即可生成角色一致且富有表现力的语音叙事，适用于多种故事场景。 Conclusion: 该框架成功将视觉叙事与角色对话结合，提升了故事的表现力和连贯性。 Abstract: Recent advances in scene-based video generation have enabled systems to synthesize coherent visual narratives from structured prompts. However, a crucial dimension of storytelling -- character-driven dialogue and speech -- remains underexplored. In this paper, we present a modular pipeline that transforms action-level prompts into visually and auditorily grounded narrative dialogue, enriching visual storytelling with natural voice and character expression. Our method takes as input a pair of prompts per scene, where the first defines the setting and the second specifies a character's behavior. While a story generation model such as Text2Story generates the corresponding visual scene, we focus on generating expressive character utterances from these prompts and the scene image. We apply a pretrained vision-language encoder to extract a high-level semantic feature from the representative frame, capturing salient visual context. This feature is then combined with the structured prompts and used to guide a large language model in synthesizing natural, character-consistent dialogue. To ensure contextual consistency across scenes, we introduce a Recursive Narrative Bank that conditions each dialogue generation on the accumulated dialogue history from prior scenes. This approach enables characters to speak in ways that reflect their evolving goals and interactions throughout a story. Finally, we render each utterance as expressive, character-consistent speech, resulting in fully-voiced video narratives. Our framework requires no additional training and demonstrates applicability across a variety of story settings, from fantasy adventures to slice-of-life episodes.

[125] Fact-R1: Towards Explainable Video Misinformation Detection with Deep Reasoning

Fanrui Zhang,Dian Li,Qiang Zhang,Chenjun,sinbadliu,Junxiong Lin,Jiahong Yan,Jiawei Liu,Zheng-Jun Zha

Main category: cs.CV

TL;DR: 论文提出FakeVV数据集和Fact-R1框架，用于视频虚假信息检测，结合深度推理和规则强化学习，显著提升检测能力。

Details

Motivation: 社交媒体中多模态虚假信息快速传播，现有方法因数据集不足和缺乏深度推理能力而受限。 Method: 提出Fact-R1框架，通过三阶段训练（CoT指令调优、DPO偏好对齐、GRPO策略优化）实现深度推理。 Result: Fact-R1在多模态虚假信息检测中展现出与文本强化学习系统相当的推理能力。 Conclusion: 研究为虚假信息检测提供新范式，结合大规模视频理解与可解释验证。 Abstract: The rapid spread of multimodal misinformation on social media has raised growing concerns, while research on video misinformation detection remains limited due to the lack of large-scale, diverse datasets. Existing methods often overfit to rigid templates and lack deep reasoning over deceptive content. To address these challenges, we introduce FakeVV, a large-scale benchmark comprising over 100,000 video-text pairs with fine-grained, interpretable annotations. In addition, we further propose Fact-R1, a novel framework that integrates deep reasoning with collaborative rule-based reinforcement learning. Fact-R1 is trained through a three-stage process: (1) misinformation long-Chain-of-Thought (CoT) instruction tuning, (2) preference alignment via Direct Preference Optimization (DPO), and (3) Group Relative Policy Optimization (GRPO) using a novel verifiable reward function. This enables Fact-R1 to exhibit emergent reasoning behaviors comparable to those observed in advanced text-based reinforcement learning systems, but in the more complex multimodal misinformation setting. Our work establishes a new paradigm for misinformation detection, bridging large-scale video understanding, reasoning-guided alignment, and interpretable verification.

[126] LaViDa: A Large Diffusion Language Model for Multimodal Understanding

Shufan Li,Konstantinos Kallidromitis,Hritik Bansal,Akash Gokul,Yusuke Kato,Kazuki Kozuka,Jason Kuen,Zhe Lin,Kai-Wei Chang,Aditya Grover

Main category: cs.CV

TL;DR: LaViDa是一种基于离散扩散模型（DMs）的视觉语言模型（VLM），解决了现有自回归模型（如LLaVA）在推理速度和可控生成上的不足，并在多模态任务中表现出色。

Details

Motivation: 现有自回归视觉语言模型（如LLaVA）在推理速度和可控生成方面表现不佳，而离散扩散模型（DMs）在多模态任务中的潜力尚未充分探索。 Method: LaViDa通过为DMs配备视觉编码器并联合微调，结合互补掩码、前缀KV缓存和时间步偏移等技术，优化训练和推理。 Result: LaViDa在多模态基准测试中表现优于或与自回归模型相当，在COCO字幕任务中CIDEr得分提升4.1，推理速度提高1.92倍。 Conclusion: LaViDa展示了离散扩散模型在多模态任务中的优势，是自回归模型的有力替代方案。 Abstract: Modern Vision-Language Models (VLMs) can solve a wide range of tasks requiring visual reasoning. In real-world scenarios, desirable properties for VLMs include fast inference and controllable generation (e.g., constraining outputs to adhere to a desired format). However, existing autoregressive (AR) VLMs like LLaVA struggle in these aspects. Discrete diffusion models (DMs) offer a promising alternative, enabling parallel decoding for faster inference and bidirectional context for controllable generation through text-infilling. While effective in language-only settings, DMs' potential for multimodal tasks is underexplored. We introduce LaViDa, a family of VLMs built on DMs. We build LaViDa by equipping DMs with a vision encoder and jointly fine-tune the combined parts for multimodal instruction following. To address challenges encountered, LaViDa incorporates novel techniques such as complementary masking for effective training, prefix KV cache for efficient inference, and timestep shifting for high-quality sampling. Experiments show that LaViDa achieves competitive or superior performance to AR VLMs on multi-modal benchmarks such as MMMU, while offering unique advantages of DMs, including flexible speed-quality tradeoff, controllability, and bidirectional reasoning. On COCO captioning, LaViDa surpasses Open-LLaVa-Next-8B by +4.1 CIDEr with 1.92x speedup. On bidirectional tasks, it achieves +59% improvement on Constrained Poem Completion. These results demonstrate LaViDa as a strong alternative to AR VLMs. Code and models will be released in the camera-ready version.

[127] Conditional Panoramic Image Generation via Masked Autoregressive Modeling

Chaoyang Wang,Xiangtai Li,Lu Qi,Xiaofan Lin,Jinbin Bai,Qianyu Zhou,Yunhai Tong

Main category: cs.CV

TL;DR: PAR框架通过掩码自回归建模解决了全景图像生成中的i.i.d.假设问题和任务分离问题，结合圆形填充和一致性对齐策略提升生成质量。

Details

Motivation: 现有方法因扩散模型不适用于ERP全景图像且任务分离导致效率低下。 Method: 提出PAR框架，利用掩码自回归建模，结合圆形填充和一致性对齐策略。 Result: 在文本到图像生成和全景外绘任务中表现优异，具有良好扩展性和泛化能力。 Conclusion: PAR为全景图像生成提供了一种统一且高效的解决方案。 Abstract: Recent progress in panoramic image generation has underscored two critical limitations in existing approaches. First, most methods are built upon diffusion models, which are inherently ill-suited for equirectangular projection (ERP) panoramas due to the violation of the identically and independently distributed (i.i.d.) Gaussian noise assumption caused by their spherical mapping. Second, these methods often treat text-conditioned generation (text-to-panorama) and image-conditioned generation (panorama outpainting) as separate tasks, relying on distinct architectures and task-specific data. In this work, we propose a unified framework, Panoramic AutoRegressive model (PAR), which leverages masked autoregressive modeling to address these challenges. PAR avoids the i.i.d. assumption constraint and integrates text and image conditioning into a cohesive architecture, enabling seamless generation across tasks. To address the inherent discontinuity in existing generative models, we introduce circular padding to enhance spatial coherence and propose a consistency alignment strategy to improve generation quality. Extensive experiments demonstrate competitive performance in text-to-image generation and panorama outpainting tasks while showcasing promising scalability and generalization capabilities.

[128] Training-Free Efficient Video Generation via Dynamic Token Carving

Yuechen Zhang,Jinbo Xing,Bin Xia,Shaoteng Liu,Bohao Peng,Xin Tao,Pengfei Wan,Eric Lo,Jiaya Jia

Main category: cs.CV

TL;DR: Jenga是一种新型推理管道，通过动态注意力雕刻和渐进分辨率生成，显著提升视频扩散变换器（DiT）模型的推理效率，同时保持生成质量。

Details

Motivation: 视频扩散变换器模型的计算需求高，限制了其实际部署。主要问题包括自注意力机制的二次复杂性和扩散模型的多步特性。 Method: Jenga结合动态注意力雕刻（使用3D空间填充曲线动态选择相关标记交互）和渐进分辨率生成（逐步增加潜在分辨率）。 Result: 实验表明，Jenga在多个先进视频扩散模型上实现了显著加速（8.83倍），生成质量几乎无下降（VBench上0.01%的性能损失）。 Conclusion: Jenga作为一种即插即用解决方案，无需重新训练模型，即可将推理时间从分钟缩短到秒，实现高效、高质量的视频生成。 Abstract: Despite the remarkable generation quality of video Diffusion Transformer (DiT) models, their practical deployment is severely hindered by extensive computational requirements. This inefficiency stems from two key challenges: the quadratic complexity of self-attention with respect to token length and the multi-step nature of diffusion models. To address these limitations, we present Jenga, a novel inference pipeline that combines dynamic attention carving with progressive resolution generation. Our approach leverages two key insights: (1) early denoising steps do not require high-resolution latents, and (2) later steps do not require dense attention. Jenga introduces a block-wise attention mechanism that dynamically selects relevant token interactions using 3D space-filling curves, alongside a progressive resolution strategy that gradually increases latent resolution during generation. Experimental results demonstrate that Jenga achieves substantial speedups across multiple state-of-the-art video diffusion models while maintaining comparable generation quality (8.83$\times$ speedup with 0.01\% performance drop on VBench). As a plug-and-play solution, Jenga enables practical, high-quality video generation on modern hardware by reducing inference time from minutes to seconds -- without requiring model retraining. Code: https://github.com/dvlab-research/Jenga

[129] T2I-ConBench: Text-to-Image Benchmark for Continual Post-training

Zhehao Huang,Yuhang Liu,Yixin Lou,Zhengbao He,Mingzhen He,Wenxing Zhou,Tao Li,Kehan Li,Zeyi Huang,Xiaolin Huang

Main category: cs.CV

TL;DR: 论文提出了T2I-ConBench，一个用于文本到图像模型持续后训练的统一基准，分析了四种维度，并评估了十种方法，发现现有方法均未全面优秀。

Details

Motivation: 持续后训练可以避免单独模型的成本，但缺乏标准化评估协议阻碍了相关研究。 Method: 引入T2I-ConBench基准，结合自动指标、人类偏好建模和视觉语言QA，评估四种维度。 Result: 评估十种方法后发现，没有方法在所有方面表现优秀，跨任务泛化问题仍未解决。 Conclusion: T2I-ConBench为文本到图像模型的持续后训练研究提供了标准化评估工具。 Abstract: Continual post-training adapts a single text-to-image diffusion model to learn new tasks without incurring the cost of separate models, but naive post-training causes forgetting of pretrained knowledge and undermines zero-shot compositionality. We observe that the absence of a standardized evaluation protocol hampers related research for continual post-training. To address this, we introduce T2I-ConBench, a unified benchmark for continual post-training of text-to-image models. T2I-ConBench focuses on two practical scenarios, item customization and domain enhancement, and analyzes four dimensions: (1) retention of generality, (2) target-task performance, (3) catastrophic forgetting, and (4) cross-task generalization. It combines automated metrics, human-preference modeling, and vision-language QA for comprehensive assessment. We benchmark ten representative methods across three realistic task sequences and find that no approach excels on all fronts. Even joint "oracle" training does not succeed for every task, and cross-task generalization remains unsolved. We release all datasets, code, and evaluation tools to accelerate research in continual post-training for text-to-image models.

[130] Tracking the Flight: Exploring a Computational Framework for Analyzing Escape Responses in Plains Zebra (Equus quagga)

Isla Duporge,Sofia Minano,Nikoloz Sirmpilatze,Igor Tatarnikov,Scott Wolf,Adam L. Tyson,Daniel Rubenstein

Main category: cs.CV

TL;DR: 论文研究了利用无人机捕捉动物运动的高分辨率视频，并通过计算机视觉技术分离动物运动与无人机运动的方法，评估了三种方法在斑马逃逸事件中的应用。

Details

Motivation: 无人机技术的普及为动物行为研究提供了高分辨率视频，但分析时需要解决动物运动与无人机运动分离的技术挑战。 Method: 评估了三种方法：基于生物成像的配准技术、SfM管道和混合插值方法，应用于44匹斑马的逃逸事件视频。 Result: 最佳方法成功提取个体轨迹，发现逃逸时斑马群的对齐性增强、停止前间距短暂扩大以及中心区域协调性更强的行为模式。 Conclusion: 该方法有效且可扩展，有助于更广泛的动物集体行为研究。 Abstract: Ethological research increasingly benefits from the growing affordability and accessibility of drones, which enable the capture of high-resolution footage of animal movement at fine spatial and temporal scales. However, analyzing such footage presents the technical challenge of separating animal movement from drone motion. While non-trivial, computer vision techniques such as image registration and Structure-from-Motion (SfM) offer practical solutions. For conservationists, open-source tools that are user-friendly, require minimal setup, and deliver timely results are especially valuable for efficient data interpretation. This study evaluates three approaches: a bioimaging-based registration technique, an SfM pipeline, and a hybrid interpolation method. We apply these to a recorded escape event involving 44 plains zebras, captured in a single drone video. Using the best-performing method, we extract individual trajectories and identify key behavioral patterns: increased alignment (polarization) during escape, a brief widening of spacing just before stopping, and tighter coordination near the group's center. These insights highlight the method's effectiveness and its potential to scale to larger datasets, contributing to broader investigations of collective animal behavior.

[131] RealEngine: Simulating Autonomous Driving in Realistic Context

Junzhe Jiang,Nan Song,Jingyu Li,Xiatian Zhu,Li Zhang

Main category: cs.CV

TL;DR: RealEngine是一个新型驾驶模拟框架，通过整合3D场景重建和新视角合成技术，实现逼真且灵活的闭环驾驶模拟。

Details

Motivation: 现有模拟器和基准测试未能全面满足高质量驾驶模拟的关键需求，如多模态感知、闭环评估、多样化场景和多智能体协作。 Method: 利用真实世界多模态传感器数据，分别重建背景场景和前景交通参与者，通过灵活场景组合实现高多样性和逼真性。 Result: RealEngine支持非反应式模拟、安全测试和多智能体交互，形成全面可靠的驾驶代理评估基准。 Conclusion: RealEngine填补了现有模拟器的不足，为驾驶代理的评估提供了逼真且灵活的解决方案。 Abstract: Driving simulation plays a crucial role in developing reliable driving agents by providing controlled, evaluative environments. To enable meaningful assessments, a high-quality driving simulator must satisfy several key requirements: multi-modal sensing capabilities (e.g., camera and LiDAR) with realistic scene rendering to minimize observational discrepancies; closed-loop evaluation to support free-form trajectory behaviors; highly diverse traffic scenarios for thorough evaluation; multi-agent cooperation to capture interaction dynamics; and high computational efficiency to ensure affordability and scalability. However, existing simulators and benchmarks fail to comprehensively meet these fundamental criteria. To bridge this gap, this paper introduces RealEngine, a novel driving simulation framework that holistically integrates 3D scene reconstruction and novel view synthesis techniques to achieve realistic and flexible closed-loop simulation in the driving context. By leveraging real-world multi-modal sensor data, RealEngine reconstructs background scenes and foreground traffic participants separately, allowing for highly diverse and realistic traffic scenarios through flexible scene composition. This synergistic fusion of scene reconstruction and view synthesis enables photorealistic rendering across multiple sensor modalities, ensuring both perceptual fidelity and geometric accuracy. Building upon this environment, RealEngine supports three essential driving simulation categories: non-reactive simulation, safety testing, and multi-agent interaction, collectively forming a reliable and comprehensive benchmark for evaluating the real-world performance of driving agents.

[132] DetailMaster: Can Your Text-to-Image Model Handle Long Prompts?

Qirui Jiao,Daoyuan Chen,Yilun Huang,Xika Lin,Ying Shen,Yaliang Li

Main category: cs.CV

TL;DR: DetailMaster是首个专门评估文本到图像（T2I）模型处理长文本提示能力的基准，揭示了现有模型在复杂细节和空间推理上的局限性。

Details

Motivation: 现有T2I模型在长文本和细节密集型提示下表现不佳，缺乏专业应用的评估标准。 Method: 提出DetailMaster基准，包含四个关键评估维度，使用平均284.89个token的长提示，评估了12种T2I模型。 Result: 现有模型在属性绑定和空间推理等关键维度上仅达到约50%准确率，且性能随提示长度增加而下降。 Conclusion: 研究揭示了模型在结构理解和细节处理上的系统性缺陷，呼吁未来研究改进架构，并开源了数据集和工具。 Abstract: While recent text-to-image (T2I) models show impressive capabilities in synthesizing images from brief descriptions, their performance significantly degrades when confronted with long, detail-intensive prompts required in professional applications. We present DetailMaster, the first comprehensive benchmark specifically designed to evaluate T2I models' systematical abilities to handle extended textual inputs that contain complex compositional requirements. Our benchmark introduces four critical evaluation dimensions: Character Attributes, Structured Character Locations, Multi-Dimensional Scene Attributes, and Explicit Spatial/Interactive Relationships. The benchmark comprises long and detail-rich prompts averaging 284.89 tokens, with high quality validated by expert annotators. Evaluation on 7 general-purpose and 5 long-prompt-optimized T2I models reveals critical performance limitations: state-of-the-art models achieve merely ~50% accuracy in key dimensions like attribute binding and spatial reasoning, while all models showing progressive performance degradation as prompt length increases. Our analysis highlights systemic failures in structural comprehension and detail overload handling, motivating future research into architectures with enhanced compositional reasoning. We open-source the dataset, data curation code, and evaluation tools to advance detail-rich T2I generation and enable broad applications that would otherwise be infeasible due to the lack of a dedicated benchmark.

[133] Efficient Correlation Volume Sampling for Ultra-High-Resolution Optical Flow Estimation

Karlis Martins Briedis,Markus Gross,Christopher Schroers

Main category: cs.CV

TL;DR: 提出了一种更高效的全对相关体积采样实现方法，显著降低了内存使用和计算复杂度，同时保持高分辨率下的准确性。

Details

Motivation: 现有光流估计方法在处理高分辨率图像时，因计算和内存复杂度高而效率低下，需要改进以实现高效且精确的估计。 Method: 提出了一种更高效的全对相关体积采样实现方法，与RAFT的数学定义一致，但显著降低了内存和计算需求。 Result: 新方法在内存使用上降低了95%，速度提升了90%，并在高分辨率数据集上实现了最先进的准确性和效率。 Conclusion: 该方法在内存受限环境下显著提升了光流估计的效率，同时保持了高分辨率下的性能。 Abstract: Recent optical flow estimation methods often employ local cost sampling from a dense all-pairs correlation volume. This results in quadratic computational and memory complexity in the number of pixels. Although an alternative memory-efficient implementation with on-demand cost computation exists, this is slower in practice and therefore prior methods typically process images at reduced resolutions, missing fine-grained details. To address this, we propose a more efficient implementation of the all-pairs correlation volume sampling, still matching the exact mathematical operator as defined by RAFT. Our approach outperforms on-demand sampling by up to 90% while maintaining low memory usage, and performs on par with the default implementation with up to 95% lower memory usage. As cost sampling makes up a significant portion of the overall runtime, this can translate to up to 50% savings for the total end-to-end model inference in memory-constrained environments. Our evaluation of existing methods includes an 8K ultra-high-resolution dataset and an additional inference-time modification of the recent SEA-RAFT method. With this, we achieve state-of-the-art results at high resolutions both in accuracy and efficiency.

[134] MedFrameQA: A Multi-Image Medical VQA Benchmark for Clinical Reasoning

Suhao Yu,Haojin Wang,Juncheng Wu,Cihang Xie,Yuyin Zhou

Main category: cs.CV

TL;DR: MedFrameQA是首个专注于医学视觉问答（VQA）中多图像推理的基准数据集，旨在模拟临床诊断中的多图像比较流程。

Details

Motivation: 现有医学VQA基准多关注单图像分析，而临床诊断通常需要比较多图像。为填补这一空白，研究团队开发了MedFrameQA。 Method: 通过自动化流程从医学视频中提取时间连贯的帧，构建逻辑连贯的VQA项目，并采用多阶段过滤策略确保数据质量。 Result: 数据集包含2,851个VQA对，覆盖9个人体系统和43个器官。测试的10种多模态LLM表现不佳，准确率普遍低于50%。 Conclusion: MedFrameQA揭示了当前模型在多图像推理中的不足，有望推动临床诊断AI的发展。 Abstract: Existing medical VQA benchmarks mostly focus on single-image analysis, yet clinicians almost always compare a series of images before reaching a diagnosis. To better approximate this workflow, we introduce MedFrameQA -- the first benchmark that explicitly evaluates multi-image reasoning in medical VQA. To build MedFrameQA both at scale and in high-quality, we develop 1) an automated pipeline that extracts temporally coherent frames from medical videos and constructs VQA items whose content evolves logically across images, and 2) a multiple-stage filtering strategy, including model-based and manual review, to preserve data clarity, difficulty, and medical relevance. The resulting dataset comprises 2,851 VQA pairs (gathered from 9,237 high-quality frames in 3,420 videos), covering nine human body systems and 43 organs; every question is accompanied by two to five images. We comprehensively benchmark ten advanced Multimodal LLMs -- both proprietary and open source, with and without explicit reasoning modules -- on MedFrameQA. The evaluation challengingly reveals that all models perform poorly, with most accuracies below 50%, and accuracy fluctuates as the number of images per question increases. Error analysis further shows that models frequently ignore salient findings, mis-aggregate evidence across images, and propagate early mistakes through their reasoning chains; results also vary substantially across body systems, organs, and modalities. We hope this work can catalyze research on clinically grounded, multi-image reasoning and accelerate progress toward more capable diagnostic AI systems.

[135] UniPhy: Learning a Unified Constitutive Model for Inverse Physics Simulation

Himangi Mittal,Peiye Zhuang,Hsin-Ying Lee,Shubham Tulsiani

Main category: cs.CV

TL;DR: UniPhy是一种通用的潜在条件神经本构模型，能够编码多种材料的物理特性，并通过可微分模拟推断材料属性。

Details

Motivation: 现有方法依赖于用户指定的材料类型信息，而UniPhy无需此类信息，通过共享训练提高了估计的鲁棒性和准确性。 Method: UniPhy通过模拟不同几何形状和材料（弹性、塑性、沙、流体）的轨迹进行训练，并通过潜在优化推断未知材料的属性。 Result: UniPhy在推断材料属性时表现优于现有方法，能够更准确地重放和重新模拟新条件下的物体运动。 Conclusion: UniPhy提供了一种无需先验材料信息的通用方法，显著提升了材料属性推断和模拟的准确性。 Abstract: We propose UniPhy, a common latent-conditioned neural constitutive model that can encode the physical properties of diverse materials. At inference UniPhy allows `inverse simulation' i.e. inferring material properties by optimizing the scene-specific latent to match the available observations via differentiable simulation. In contrast to existing methods that treat such inference as system identification, UniPhy does not rely on user-specified material type information. Compared to prior neural constitutive modeling approaches which learn instance specific networks, the shared training across materials improves both, robustness and accuracy of the estimates. We train UniPhy using simulated trajectories across diverse geometries and materials -- elastic, plasticine, sand, and fluids (Newtonian & non-Newtonian). At inference, given an object with unknown material properties, UniPhy can infer the material properties via latent optimization to match the motion observations, and can then allow re-simulating the object under diverse scenarios. We compare UniPhy against prior inverse simulation methods, and show that the inference from UniPhy enables more accurate replay and re-simulation under novel conditions.

[136] OpenSeg-R: Improving Open-Vocabulary Segmentation via Step-by-Step Visual Reasoning

Zongyan Han,Jiale Cao,Shuo Chen,Tong Wang,Jorma Laaksonen,Rao Muhammad Anwer

Main category: cs.CV

TL;DR: OpenSeg-R 是一个基于逐步视觉推理的开词汇分割框架，通过利用大型多模态模型生成层次化推理，显著提升了分割精度和可解释性。

Details

Motivation: 现有开词汇分割方法缺乏显式推理和上下文理解能力，难以区分相似类别。 Method: OpenSeg-R 通过生成通用和图像特定的推理步骤，形成结构化三元组，并基于这些推理生成详细描述提示以指导分割。 Result: 在五个基准数据集上显著优于现有方法，并在开词汇全景分割中实现一致性能提升。 Conclusion: OpenSeg-R 首次将逐步视觉推理引入开词汇分割，提升了分割精度和模型可解释性。 Abstract: Open-Vocabulary Segmentation (OVS) has drawn increasing attention for its capacity to generalize segmentation beyond predefined categories. However, existing methods typically predict segmentation masks with simple forward inference, lacking explicit reasoning and interpretability. This makes it challenging for OVS model to distinguish similar categories in open-world settings due to the lack of contextual understanding and discriminative visual cues. To address this limitation, we propose a step-by-step visual reasoning framework for open-vocabulary segmentation, named OpenSeg-R. The proposed OpenSeg-R leverages Large Multimodal Models (LMMs) to perform hierarchical visual reasoning before segmentation. Specifically, we generate both generic and image-specific reasoning for each image, forming structured triplets that explain the visual reason for objects in a coarse-to-fine manner. Based on these reasoning steps, we can compose detailed description prompts, and feed them to the segmentor to produce more accurate segmentation masks. To the best of our knowledge, OpenSeg-R is the first framework to introduce explicit step-by-step visual reasoning into OVS. Experimental results demonstrate that OpenSeg-R significantly outperforms state-of-the-art methods on open-vocabulary semantic segmentation across five benchmark datasets. Moreover, it achieves consistent gains across all metrics on open-vocabulary panoptic segmentation. Qualitative results further highlight the effectiveness of our reasoning-guided framework in improving both segmentation precision and interpretability. Our code is publicly available at https://github.com/Hanzy1996/OpenSeg-R.

[137] Creatively Upscaling Images with Global-Regional Priors

Yurui Qian,Qi Cai,Yingwei Pan,Ting Yao,Tao Mei

Main category: cs.CV

TL;DR: C-Upscale是一种无需调整的图像放大方法，通过结合全局和区域先验，生成超高分辨率图像，同时保持全局语义一致性和丰富的区域细节。

Details

Motivation: 当前扩散模型在文本到图像生成中表现优异，但分辨率受限（如1,024 X 1,024）。现有方法在生成高分辨率图像时难以同时保持全局结构和区域细节的创造性。 Method: C-Upscale利用全局提示和通过多模态LLM估计的区域提示，提取全局结构先验和区域注意力先验，并通过区域注意力控制减少对象重复问题。 Result: C-Upscale能够生成超高分辨率图像（如4,096 X 4,096和8,192 X 8,192），具有更高的视觉保真度和更丰富的区域细节。 Conclusion: C-Upscale通过全局和区域先验的结合，解决了高分辨率图像生成中的语义一致性和细节创造性问题。 Abstract: Contemporary diffusion models show remarkable capability in text-to-image generation, while still being limited to restricted resolutions (e.g., 1,024 X 1,024). Recent advances enable tuning-free higher-resolution image generation by recycling pre-trained diffusion models and extending them via regional denoising or dilated sampling/convolutions. However, these models struggle to simultaneously preserve global semantic structure and produce creative regional details in higher-resolution images. To address this, we present C-Upscale, a new recipe of tuning-free image upscaling that pivots on global-regional priors derived from given global prompt and estimated regional prompts via Multimodal LLM. Technically, the low-frequency component of low-resolution image is recognized as global structure prior to encourage global semantic consistency in high-resolution generation. Next, we perform regional attention control to screen cross-attention between global prompt and each region during regional denoising, leading to regional attention prior that alleviates object repetition issue. The estimated regional prompts containing rich descriptive details further act as regional semantic prior to fuel the creativity of regional detail generation. Both quantitative and qualitative evaluations demonstrate that our C-Upscale manages to generate ultra-high-resolution images (e.g., 4,096 X 4,096 and 8,192 X 8,192) with higher visual fidelity and more creative regional details.

[138] Incorporating Visual Correspondence into Diffusion Model for Virtual Try-On

Siqi Wan,Jingwen Chen,Yingwei Pan,Ting Yao,Tao Mei

Main category: cs.CV

TL;DR: 论文提出了一种基于视觉对应性的扩散模型方法，用于虚拟试穿（VTON）任务，通过语义点匹配和3D感知线索提升细节保留能力。

Details

Motivation: 解决扩散模型在VTON任务中因随机性导致的服装形状和细节难以保留的问题。 Method: 利用视觉对应性作为先验，将服装细节表示为结构化语义点，并通过局部流变形匹配目标人物的语义点，结合深度/法线图生成3D感知线索。 Result: 在VITON-HD和DressCode数据集上实现了最先进的VTON性能，显著提升了服装细节的保留能力。 Conclusion: 通过语义点匹配和3D感知线索的引入，有效提升了扩散模型在VTON任务中的表现，代码已开源。 Abstract: Diffusion models have shown preliminary success in virtual try-on (VTON) task. The typical dual-branch architecture comprises two UNets for implicit garment deformation and synthesized image generation respectively, and has emerged as the recipe for VTON task. Nevertheless, the problem remains challenging to preserve the shape and every detail of the given garment due to the intrinsic stochasticity of diffusion model. To alleviate this issue, we novelly propose to explicitly capitalize on visual correspondence as the prior to tame diffusion process instead of simply feeding the whole garment into UNet as the appearance reference. Specifically, we interpret the fine-grained appearance and texture details as a set of structured semantic points, and match the semantic points rooted in garment to the ones over target person through local flow warping. Such 2D points are then augmented into 3D-aware cues with depth/normal map of target person. The correspondence mimics the way of putting clothing on human body and the 3D-aware cues act as semantic point matching to supervise diffusion model training. A point-focused diffusion loss is further devised to fully take the advantage of semantic point matching. Extensive experiments demonstrate strong garment detail preservation of our approach, evidenced by state-of-the-art VTON performances on both VITON-HD and DressCode datasets. Code is publicly available at: https://github.com/HiDream-ai/SPM-Diff.

[139] Pursuing Temporal-Consistent Video Virtual Try-On via Dynamic Pose Interaction

Dong Li,Wenqi Zhong,Wei Yu,Yingwei Pan,Dingwen Zhang,Ting Yao,Junwei Han,Tao Mei

Main category: cs.CV

TL;DR: 论文提出了一种名为DPIDM的动态姿态交互扩散模型，用于视频虚拟试穿，通过结合时空姿态交互和扩散模型，显著提升了试穿效果。

Details

Motivation: 现有视频虚拟试穿方法主要关注时间模块，但忽略了人与服装之间的时空姿态交互。DPIDM旨在通过动态姿态交互解决这一问题。 Method: DPIDM引入基于骨架的姿态适配器，结合分层注意力模块，建模帧内人与服装姿态交互及跨帧长期姿态动态。 Result: 在VITON-HD、VVT和ViViD数据集上，DPIDM表现优异，VFID分数为0.506，比现有最佳方法提升60.5%。 Conclusion: DPIDM通过动态姿态交互和时空一致性优化，显著提升了视频虚拟试穿的效果。 Abstract: Video virtual try-on aims to seamlessly dress a subject in a video with a specific garment. The primary challenge involves preserving the visual authenticity of the garment while dynamically adapting to the pose and physique of the subject. While existing methods have predominantly focused on image-based virtual try-on, extending these techniques directly to videos often results in temporal inconsistencies. Most current video virtual try-on approaches alleviate this challenge by incorporating temporal modules, yet still overlook the critical spatiotemporal pose interactions between human and garment. Effective pose interactions in videos should not only consider spatial alignment between human and garment poses in each frame but also account for the temporal dynamics of human poses throughout the entire video. With such motivation, we propose a new framework, namely Dynamic Pose Interaction Diffusion Models (DPIDM), to leverage diffusion models to delve into dynamic pose interactions for video virtual try-on. Technically, DPIDM introduces a skeleton-based pose adapter to integrate synchronized human and garment poses into the denoising network. A hierarchical attention module is then exquisitely designed to model intra-frame human-garment pose interactions and long-term human pose dynamics across frames through pose-aware spatial and temporal attention mechanisms. Moreover, DPIDM capitalizes on a temporal regularized attention loss between consecutive frames to enhance temporal consistency. Extensive experiments conducted on VITON-HD, VVT and ViViD datasets demonstrate the superiority of our DPIDM against the baseline methods. Notably, DPIDM achieves VFID score of 0.506 on VVT dataset, leading to 60.5% improvement over the state-of-the-art GPD-VVTO approach.

[140] Extremely Simple Multimodal Outlier Synthesis for Out-of-Distribution Detection and Segmentation

Moru Liu,Hao Dong,Jessica Kelly,Olga Fink,Mario Trapp

Main category: cs.CV

TL;DR: 论文提出了一种名为Feature Mixing的简单快速方法，用于多模态异常合成，以提升OOD检测性能，并引入了新的数据集CARLA-OOD。

Details

Motivation: 现实应用多为多模态，而现有研究集中于单模态图像数据，缺乏对未知数据的监督信号，导致OOD样本预测过度自信。 Method: 提出Feature Mixing方法，通过多模态特征混合合成异常数据，理论支持其有效性，且适用于多种模态组合。 Result: 在多个数据集上验证，Feature Mixing实现了SOTA性能，速度提升10至370倍。 Conclusion: Feature Mixing是一种高效、通用的OOD检测方法，适用于多模态场景。 Abstract: Out-of-distribution (OOD) detection and segmentation are crucial for deploying machine learning models in safety-critical applications such as autonomous driving and robot-assisted surgery. While prior research has primarily focused on unimodal image data, real-world applications are inherently multimodal, requiring the integration of multiple modalities for improved OOD detection. A key challenge is the lack of supervision signals from unknown data, leading to overconfident predictions on OOD samples. To address this challenge, we propose Feature Mixing, an extremely simple and fast method for multimodal outlier synthesis with theoretical support, which can be further optimized to help the model better distinguish between in-distribution (ID) and OOD data. Feature Mixing is modality-agnostic and applicable to various modality combinations. Additionally, we introduce CARLA-OOD, a novel multimodal dataset for OOD segmentation, featuring synthetic OOD objects across diverse scenes and weather conditions. Extensive experiments on SemanticKITTI, nuScenes, CARLA-OOD datasets, and the MultiOOD benchmark demonstrate that Feature Mixing achieves state-of-the-art performance with a $10 \times$ to $370 \times$ speedup. Our source code and dataset will be available at https://github.com/mona4399/FeatureMixing.

[141] Dimple: Discrete Diffusion Multimodal Large Language Model with Parallel Decoding

Runpeng Yu,Xinyin Ma,Xinchao Wang

Main category: cs.CV

TL;DR: Dimple是一种离散扩散多模态大语言模型（DMLLM），通过结合自回归和扩散训练方法解决了纯离散扩散训练的不稳定性和性能问题，性能优于LLaVA-NEXT 3.9%，并提出了高效解码策略和结构化响应控制。

Details

Motivation: 解决纯离散扩散训练的不稳定性、性能不佳和长度偏差问题，探索DMLLM的可行性和优势。 Method: 设计了一种结合初始自回归阶段和后续扩散阶段的训练范式，提出自信解码策略和预填充技术。 Result: Dimple-7B性能超过LLaVA-NEXT 3.9%，解码效率提升至响应长度的1/3，预填充技术提速1.5x至7x。 Conclusion: Dimple验证了DMLLM的可行性和优势，提升了推理效率和可控性。 Abstract: In this work, we propose Dimple, the first Discrete Diffusion Multimodal Large Language Model (DMLLM). We observe that training with a purely discrete diffusion approach leads to significant training instability, suboptimal performance, and severe length bias issues. To address these challenges, we design a novel training paradigm that combines an initial autoregressive phase with a subsequent diffusion phase. This approach yields the Dimple-7B model, trained on the same dataset and using a similar training pipeline as LLaVA-NEXT. Dimple-7B ultimately surpasses LLaVA-NEXT in performance by 3.9%, demonstrating that DMLLM can achieve performance comparable to that of autoregressive models. To improve inference efficiency, we propose a decoding strategy termed confident decoding, which dynamically adjusts the number of tokens generated at each step, significantly reducing the number of generation iterations. In autoregressive models, the number of forward iterations during generation equals the response length. With confident decoding, however, the number of iterations needed by Dimple is even only $\frac{\text{response length}}{3}$. We also re-implement the prefilling technique in autoregressive models and demonstrate that it does not significantly impact performance on most benchmark evaluations, while offering a speedup of 1.5x to 7x. Additionally, we explore Dimple's capability to precisely control its response using structure priors. These priors enable structured responses in a manner distinct from instruction-based or chain-of-thought prompting, and allow fine-grained control over response format and length, which is difficult to achieve in autoregressive models. Overall, this work validates the feasibility and advantages of DMLLM and enhances its inference efficiency and controllability. Code and models are available at https://github.com/yu-rp/Dimple.

[142] An Effective Training Framework for Light-Weight Automatic Speech Recognition Models

Abdul Hannan,Alessio Brutti,Shah Nawaz,Mubashir Noman

Main category: cs.CV

TL;DR: 提出了一种高效的两步表示学习方法，能从单一大型模型中生成多个小型模型，显著提升性能且训练速度快。

Details

Motivation: 现有方法（如剪枝、蒸馏等）在将大型模型压缩为小型模型时，性能下降明显或训练时间长，难以在低资源设备上部署。 Method: 采用两步表示学习方法，从单一大型模型生成多个小型模型，确保在有限训练周期内性能更优。 Result: 在ASR基准测试中，实现了三倍训练速度提升和最高12.54%的词错误率改进。 Conclusion: 该方法为低资源设备部署提供了高效解决方案，性能显著优于现有方法。 Abstract: Recent advancement in deep learning encouraged developing large automatic speech recognition (ASR) models that achieve promising results while ignoring computational and memory constraints. However, deploying such models on low resource devices is impractical despite of their favorable performance. Existing approaches (pruning, distillation, layer skip etc.) transform the large models into smaller ones at the cost of significant performance degradation or require prolonged training of smaller models for better performance. To address these issues, we introduce an efficacious two-step representation learning based approach capable of producing several small sized models from a single large model ensuring considerably better performance in limited number of epochs. Comprehensive experimentation on ASR benchmarks reveals the efficacy of our approach, achieving three-fold training speed-up and up to 12.54% word error rate improvement.

[143] Native Segmentation Vision Transformers

Guillem Brasó,Aljoša Ošep,Laura Leal-Taixé

Main category: cs.CV

TL;DR: 提出了一种基于内容感知的空间分组层的视觉Transformer设计，通过动态分配token实现分辨率降低，无需额外分割头即可生成强分割掩码。

Details

Motivation: 传统均匀下采样在视觉主干中仍是标准方法，但缺乏对图像边界和语义内容的动态适应。 Method: 设计内容感知的空间分组层，动态分配token，形成层次化分割，构建Native Segmentation Vision Transformer。 Result: 仅通过分组层即可生成强分割掩码，支持零样本分割和高效下游任务设计。 Conclusion: 提出了一种新的原生分割范式，为分割任务提供了高效且无需监督的解决方案。 Abstract: Uniform downsampling remains the de facto standard for reducing spatial resolution in vision backbones. In this work, we propose an alternative design built around a content-aware spatial grouping layer, that dynamically assigns tokens to a reduced set based on image boundaries and their semantic content. Stacking our grouping layer across consecutive backbone stages results in hierarchical segmentation that arises natively in the feature extraction process, resulting in our coined Native Segmentation Vision Transformer. We show that a careful design of our architecture enables the emergence of strong segmentation masks solely from grouping layers, that is, without additional segmentation-specific heads. This sets the foundation for a new paradigm of native, backbone-level segmentation, which enables strong zero-shot results without mask supervision, as well as a minimal and efficient standalone model design for downstream segmentation tasks. Our project page is https://research.nvidia.com/labs/dvl/projects/native-segmentation.

[144] Seeing through Satellite Images at Street Views

Ming Qian,Bin Tan,Qiuyu Wang,Xianwei Zheng,Hanjiang Xiong,Gui-Song Xia,Yujun Shen,Nan Xue

Main category: cs.CV

TL;DR: 论文提出Sat2Density++方法，通过神经辐射场从卫星和街景图像对中学习，实现逼真的街景全景图像和视频合成。

Details

Motivation: 解决卫星与街景图像间视角变化大且数据稀疏的挑战，实现街景特定元素（如天空和光照效果）的逼真渲染。 Method: 基于神经辐射场学习，建模街景特定元素，提出Sat2Density++方法。 Result: 在城乡场景数据集上验证，方法能生成多视角一致且忠实于卫星图像的逼真街景全景。 Conclusion: Sat2Density++成功解决了街景合成中的挑战，实现了高质量的渲染效果。 Abstract: This paper studies the task of SatStreet-view synthesis, which aims to render photorealistic street-view panorama images and videos given any satellite image and specified camera positions or trajectories. We formulate to learn neural radiance field from paired images captured from satellite and street viewpoints, which comes to be a challenging learning problem due to the sparse-view natural and the extremely-large viewpoint changes between satellite and street-view images. We tackle the challenges based on a task-specific observation that street-view specific elements, including the sky and illumination effects are only visible in street-view panoramas, and present a novel approach Sat2Density++ to accomplish the goal of photo-realistic street-view panoramas rendering by modeling these street-view specific in neural networks. In the experiments, our method is testified on both urban and suburban scene datasets, demonstrating that Sat2Density++ is capable of rendering photorealistic street-view panoramas that are consistent across multiple views and faithful to the satellite image.

[145] PAEFF: Precise Alignment and Enhanced Gated Feature Fusion for Face-Voice Association

Abdul Hannan,Muhammad Arslan Manzoor,Shah Nawaz,Muhammad Irzam Liaqat,Markus Schedl,Mubashir Noman

Main category: cs.CV

TL;DR: 论文提出了一种改进的人脸与声音关联学习方法，通过对齐嵌入空间和增强门控融合提升性能。

Details

Motivation: 当前人脸与声音关联学习方法存在负样本挖掘和依赖远距离参数的问题，需要改进。 Method: 提出一种方法，先对齐人脸和声音的嵌入空间，再通过增强门控融合进行联合嵌入。 Result: 在VoxCeleb数据集上的实验验证了该方法的有效性。 Conclusion: 该方法通过空间对齐和融合优化，显著提升了人脸与声音关联的性能。 Abstract: We study the task of learning association between faces and voices, which is gaining interest in the multimodal community lately. These methods suffer from the deliberate crafting of negative mining procedures as well as the reliance on the distant margin parameter. These issues are addressed by learning a joint embedding space in which orthogonality constraints are applied to the fused embeddings of faces and voices. However, embedding spaces of faces and voices possess different characteristics and require spaces to be aligned before fusing them. To this end, we propose a method that accurately aligns the embedding spaces and fuses them with an enhanced gated fusion thereby improving the performance of face-voice association. Extensive experiments on the VoxCeleb dataset reveals the merits of the proposed approach.

[146] CoMo: Learning Continuous Latent Motion from Internet Videos for Scalable Robot Learning

Jiange Yang,Yansong Shi,Haoyi Zhu,Mingyu Liu,Kaijing Ma,Yating Wang,Gangshan Wu,Tong He,Limin Wang

Main category: cs.CV

TL;DR: CoMo提出了一种从互联网视频中学习连续运动表示的方法，解决了现有离散方法的局限性，并通过新指标和零样本泛化能力提升了性能。

Details

Motivation: 现有离散潜在动作方法存在信息丢失和难以处理复杂动态的问题，需要更高效的连续运动表示学习方法。 Method: CoMo采用早期时间特征差异机制防止模型崩溃，结合信息瓶颈原则约束嵌入维度，并引入新评估指标。 Result: CoMo在零样本泛化中表现优异，能生成未见视频域的伪动作，提升策略联合学习性能。 Conclusion: CoMo通过连续运动表示和新指标，显著提升了运动学习的效果，适用于多种视频域和任务。 Abstract: Learning latent motion from Internet videos is crucial for building generalist robots. However, existing discrete latent action methods suffer from information loss and struggle with complex and fine-grained dynamics. We propose CoMo, which aims to learn more informative continuous motion representations from diverse, internet-scale videos. CoMo employs a early temporal feature difference mechanism to prevent model collapse and suppress static appearance noise, effectively discouraging shortcut learning problem. Furthermore, guided by the information bottleneck principle, we constrain the latent motion embedding dimensionality to achieve a better balance between retaining sufficient action-relevant information and minimizing the inclusion of action-irrelevant appearance noise. Additionally, we also introduce two new metrics for more robustly and affordably evaluating motion and guiding motion learning methods development: (i) the linear probing MSE of action prediction, and (ii) the cosine similarity between past-to-current and future-to-current motion embeddings. Critically, CoMo exhibits strong zero-shot generalization, enabling it to generate continuous pseudo actions for previously unseen video domains. This capability facilitates unified policy joint learning using pseudo actions derived from various action-less video datasets (such as cross-embodiment videos and, notably, human demonstration videos), potentially augmented with limited labeled robot data. Extensive experiments show that policies co-trained with CoMo pseudo actions achieve superior performance with both diffusion and autoregressive architectures in simulated and real-world settings.

[147] Deep mineralogical segmentation of thin section images based on QEMSCAN maps

Jean Pablo Vieira de Mello,Matheus Augusto Alves Cuglieri,Leandro P. de Figueiredo,Fernando Bordignon,Marcelo Ramalho Albuquerque,Rodrigo Surmas,Bruno Cavalcanti de Paula

Main category: cs.CV

TL;DR: 提出了一种基于卷积神经网络的碳酸盐岩薄片图像矿物自动分割模型，以低成本、高效的方式模拟QEMSCAN矿物映射。

Details

Motivation: 人工分析岩石薄片的矿物学特征主观且耗时，现有技术如QEMSCAN成本高且耗时长，需一种更高效、低成本的自动化方法。 Method: 使用U-Net语义分割架构，以平面和交叉偏振薄片图像及QEMSCAN映射为训练目标，进行矿物分类。 Result: 模型在矿物边界划分和分布估计上表现良好，R²值在已知和未知岩相中分别超过0.97和0.88。 Conclusion: 模型在矿物分割上具有潜力，但分割质量受图像分辨率和岩石纹理多样性影响。 Abstract: Interpreting the mineralogical aspects of rock thin sections is an important task for oil and gas reservoirs evaluation. However, human analysis tend to be subjective and laborious. Technologies like QEMSCAN(R) are designed to automate the mineralogical mapping process, but also suffer from limitations like high monetary costs and time-consuming analysis. This work proposes a Convolutional Neural Network model for automatic mineralogical segmentation of thin section images of carbonate rocks. The model is able to mimic the QEMSCAN mapping itself in a low-cost, generalized and efficient manner. For this, the U-Net semantic segmentation architecture is trained on plane and cross polarized thin section images using the corresponding QEMSCAN maps as target, which is an approach not widely explored. The model was instructed to differentiate occurrences of Calcite, Dolomite, Mg-Clay Minerals, Quartz, Pores and the remaining mineral phases as an unique class named "Others", while it was validated on rock facies both seen and unseen during training, in order to address its generalization capability. Since the images and maps are provided in different resolutions, image registration was applied to align then spatially. The study reveals that the quality of the segmentation is very much dependent on these resolution differences and on the variety of learnable rock textures. However, it shows promising results, especially with regard to the proper delineation of minerals boundaries on solid textures and precise estimation of the minerals distributions, describing a nearly linear relationship between expected and predicted distributions, with coefficient of determination (R^2) superior to 0.97 for seen facies and 0.88 for unseen.

[148] Learning Adaptive and Temporally Causal Video Tokenization in a 1D Latent Space

Yan Li,Changyao Tian,Renqiu Xia,Ning Liao,Weiwei Guo,Junchi Yan,Hongsheng Li,Jifeng Dai,Hao Li,Xue Yang

Main category: cs.CV

TL;DR: AdapTok是一种自适应视频标记化方法，通过动态分配标记提升视频重建和生成效率。

Details

Motivation: 解决视频内容中不同帧的标记分配问题，提升标记利用效率和生成质量。 Method: 采用块掩码策略和块因果评分器，结合整数线性规划动态分配标记。 Result: 在UCF-101和Kinetics-600上显著提升重建质量和生成性能。 Conclusion: AdapTok在可控标记预算下实现了高效、动态的视频建模。 Abstract: We propose AdapTok, an adaptive temporal causal video tokenizer that can flexibly allocate tokens for different frames based on video content. AdapTok is equipped with a block-wise masking strategy that randomly drops tail tokens of each block during training, and a block causal scorer to predict the reconstruction quality of video frames using different numbers of tokens. During inference, an adaptive token allocation strategy based on integer linear programming is further proposed to adjust token usage given predicted scores. Such design allows for sample-wise, content-aware, and temporally dynamic token allocation under a controllable overall budget. Extensive experiments for video reconstruction and generation on UCF-101 and Kinetics-600 demonstrate the effectiveness of our approach. Without additional image data, AdapTok consistently improves reconstruction quality and generation performance under different token budgets, allowing for more scalable and token-efficient generative video modeling.

[149] SpatialScore: Towards Unified Evaluation for Multimodal Spatial Understanding

Haoning Wu,Xiao Huang,Yaohui Chen,Ya Zhang,Yanfeng Wang,Weidi Xie

Main category: cs.CV

TL;DR: 论文研究了多模态大语言模型（MLLMs）在3D空间感知和理解能力上的表现，提出了VGBench和SpatialScore两个基准测试，并开发了SpatialAgent系统进行评估。

Details

Motivation: 探索现有MLLMs是否具备3D空间感知和理解能力，填补相关研究的空白。 Method: 1. 提出VGBench评估视觉几何感知能力；2. 整合11个数据集构建SpatialScore基准；3. 开发多代理系统SpatialAgent，支持两种推理范式。 Result: 揭示了MLLMs在空间推理中的持续挑战，同时证明了SpatialAgent的有效性。 Conclusion: SpatialScore为MLLMs的进一步发展提供了有价值的见解和严格基准。 Abstract: Multimodal large language models (MLLMs) have achieved impressive success in question-answering tasks, yet their capabilities for spatial understanding are less explored. This work investigates a critical question: do existing MLLMs possess 3D spatial perception and understanding abilities? Concretely, we make the following contributions in this paper: (i) we introduce VGBench, a benchmark specifically designed to assess MLLMs for visual geometry perception, e.g., camera pose and motion estimation; (ii) we propose SpatialScore, the most comprehensive and diverse multimodal spatial understanding benchmark to date, integrating VGBench with relevant data from the other 11 existing datasets. This benchmark comprises 28K samples across various spatial understanding tasks, modalities, and QA formats, along with a carefully curated challenging subset, SpatialScore-Hard; (iii) we develop SpatialAgent, a novel multi-agent system incorporating 9 specialized tools for spatial understanding, supporting both Plan-Execute and ReAct reasoning paradigms; (iv) we conduct extensive evaluations to reveal persistent challenges in spatial reasoning while demonstrating the effectiveness of SpatialAgent. We believe SpatialScore will offer valuable insights and serve as a rigorous benchmark for the next evolution of MLLMs.

Runsen Xu,Weiyao Wang,Hao Tang,Xingyu Chen,Xiaodong Wang,Fu-Jen Chu,Dahua Lin,Matt Feiszli,Kevin J. Liang

Main category: cs.CV

TL;DR: 论文提出了一种框架，通过整合深度感知、视觉对应和动态感知，增强多模态大语言模型（MLLMs）的多帧空间理解能力，并引入MultiSPA数据集和基准测试。

Details

Motivation: 当前MLLMs在视觉任务中表现突出，但其空间理解仅限于单帧图像，无法满足机器人等需要多帧推理的实际应用需求。 Method: 提出一个框架，结合深度感知、视觉对应和动态感知，并利用新构建的MultiSPA数据集（包含2700万样本）进行训练和评估。 Result: 提出的Multi-SpatialMLLM模型在基准测试中显著优于基线模型和专有系统，展示了可扩展和通用的多帧推理能力。 Conclusion: 该模型在多任务中表现出优势，并展示了在机器人等领域作为多帧奖励标注器的潜力。 Abstract: Multi-modal large language models (MLLMs) have rapidly advanced in visual tasks, yet their spatial understanding remains limited to single images, leaving them ill-suited for robotics and other real-world applications that require multi-frame reasoning. In this paper, we propose a framework to equip MLLMs with robust multi-frame spatial understanding by integrating depth perception, visual correspondence, and dynamic perception. Central to our approach is the MultiSPA dataset, a novel, large-scale collection of more than 27 million samples spanning diverse 3D and 4D scenes. Alongside MultiSPA, we introduce a comprehensive benchmark that tests a wide spectrum of spatial tasks under uniform metrics. Our resulting model, Multi-SpatialMLLM, achieves significant gains over baselines and proprietary systems, demonstrating scalable, generalizable multi-frame reasoning. We further observe multi-task benefits and early indications of emergent capabilities in challenging scenarios, and showcase how our model can serve as a multi-frame reward annotator for robotics.

[151] Delving into RL for Image Generation with CoT: A Study on DPO vs. GRPO

Chengzhuo Tong,Ziyu Guo,Renrui Zhang,Wenyu Shan,Xinyu Wei,Zhenghao Xing,Hongsheng Li,Pheng-Ann Heng

Main category: cs.CV

TL;DR: 论文探讨了强化学习（RL）在提升大语言模型（LLMs）的链式思维（CoT）推理能力中的作用，重点比较了DPO和GRPO两种算法在自回归图像生成中的表现，并分析了奖励模型对算法泛化能力的影响。

Details

Motivation: 研究旨在填补RL在自回归图像生成领域应用的空白，深入分析不同RL策略的优缺点及领域特定挑战。 Method: 通过全面评估GRPO和DPO算法在自回归图像生成中的性能，并研究不同奖励模型对算法能力的影响。 Result: 发现GRPO和DPO各有优势，且奖励模型的泛化能力能提升RL算法的泛化潜力；探索了三种扩展策略以提升性能。 Conclusion: 研究为未来开发更有效的RL算法以实现自回归图像生成中的鲁棒CoT推理提供了新思路。 Abstract: Recent advancements underscore the significant role of Reinforcement Learning (RL) in enhancing the Chain-of-Thought (CoT) reasoning capabilities of large language models (LLMs). Two prominent RL algorithms, Direct Preference Optimization (DPO) and Group Relative Policy Optimization (GRPO), are central to these developments, showcasing different pros and cons. Autoregressive image generation, also interpretable as a sequential CoT reasoning process, presents unique challenges distinct from LLM-based CoT reasoning. These encompass ensuring text-image consistency, improving image aesthetic quality, and designing sophisticated reward models, rather than relying on simpler rule-based rewards. While recent efforts have extended RL to this domain, these explorations typically lack an in-depth analysis of the domain-specific challenges and the characteristics of different RL strategies. To bridge this gap, we provide the first comprehensive investigation of the GRPO and DPO algorithms in autoregressive image generation, evaluating their in-domain performance and out-of-domain generalization, while scrutinizing the impact of different reward models on their respective capabilities. Our findings reveal that GRPO and DPO exhibit distinct advantages, and crucially, that reward models possessing stronger intrinsic generalization capabilities potentially enhance the generalization potential of the applied RL algorithms. Furthermore, we systematically explore three prevalent scaling strategies to enhance both their in-domain and out-of-domain proficiency, deriving unique insights into efficiently scaling performance for each paradigm. We hope our study paves a new path for inspiring future work on developing more effective RL algorithms to achieve robust CoT reasoning in the realm of autoregressive image generation. Code is released at https://github.com/ZiyuGuo99/Image-Generation-CoT

[152] SophiaVL-R1: Reinforcing MLLMs Reasoning with Thinking Reward

Kaixuan Fan,Kaituo Feng,Haoming Lyu,Dongzhan Zhou,Xiangyu Yue

Main category: cs.CV

TL;DR: 论文提出SophiaVL-R1，通过引入思维过程奖励信号和改进的Trust-GRPO方法，提升了多模态大语言模型的推理能力。

Details

Motivation: 现有基于规则强化学习的多模态大语言模型缺乏对思维过程的监督，导致推理策略可能次优，影响泛化能力。 Method: 提出思维奖励模型评估思维过程质量，结合Trust-GRPO方法动态调整奖励权重，并采用退火训练策略逐步减少思维奖励。 Result: SophiaVL-R1在多个基准测试中表现优异，甚至优于参数规模更大的模型。 Conclusion: 通过监督思维过程和动态调整奖励，SophiaVL-R1显著提升了模型的推理和泛化能力。 Abstract: Recent advances have shown success in eliciting strong reasoning abilities in multimodal large language models (MLLMs) through rule-based reinforcement learning (RL) with outcome rewards. However, this paradigm typically lacks supervision over the thinking process leading to the final outcome.As a result, the model may learn sub-optimal reasoning strategies, which can hinder its generalization ability. In light of this, we propose SophiaVL-R1, as an attempt to add reward signals for the thinking process in this paradigm. To achieve this, we first train a thinking reward model that evaluates the quality of the entire thinking process. Given that the thinking reward may be unreliable for certain samples due to reward hacking, we propose the Trust-GRPO method, which assigns a trustworthiness weight to the thinking reward during training. This weight is computed based on the thinking reward comparison of responses leading to correct answers versus incorrect answers, helping to mitigate the impact of potentially unreliable thinking rewards. Moreover, we design an annealing training strategy that gradually reduces the thinking reward over time, allowing the model to rely more on the accurate rule-based outcome reward in later training stages. Experiments show that our SophiaVL-R1 surpasses a series of reasoning MLLMs on various benchmarks (e.g., MathVisita, MMMU), demonstrating strong reasoning and generalization capabilities. Notably, our SophiaVL-R1-7B even outperforms LLaVA-OneVision-72B on most benchmarks, despite the latter having 10 times more parameters. All code, models, and datasets are made publicly available at https://github.com/kxfan2002/SophiaVL-R1.

[153] Let Androids Dream of Electric Sheep: A Human-like Image Implication Understanding and Reasoning Framework

Chenhao Zhang,Yazhe Niu

Main category: cs.CV

TL;DR: 论文提出Let Androids Dream (LAD)框架，通过三阶段方法解决图像隐喻理解问题，性能超越现有模型。

Details

Motivation: 现有AI系统在理解图像隐喻时存在文化、情感和上下文缺失的挑战，LAD旨在填补这一空白。 Method: LAD采用三阶段框架：感知（视觉信息转文本）、搜索（跨域知识整合）、推理（生成上下文对齐的隐喻理解）。 Result: LAD在英语和中文图像隐喻基准测试中表现优异，部分任务超越GPT-4o模型。 Conclusion: LAD为AI理解图像隐喻提供了新思路，推动了视觉语言推理和人机交互的进步。 Abstract: Metaphorical comprehension in images remains a critical challenge for AI systems, as existing models struggle to grasp the nuanced cultural, emotional, and contextual implications embedded in visual content. While multimodal large language models (MLLMs) excel in basic Visual Question Answer (VQA) tasks, they struggle with a fundamental limitation on image implication tasks: contextual gaps that obscure the relationships between different visual elements and their abstract meanings. Inspired by the human cognitive process, we propose Let Androids Dream (LAD), a novel framework for image implication understanding and reasoning. LAD addresses contextual missing through the three-stage framework: (1) Perception: converting visual information into rich and multi-level textual representations, (2) Search: iteratively searching and integrating cross-domain knowledge to resolve ambiguity, and (3) Reasoning: generating context-alignment image implication via explicit reasoning. Our framework with the lightweight GPT-4o-mini model achieves SOTA performance compared to 15+ MLLMs on English image implication benchmark and a huge improvement on Chinese benchmark, performing comparable with the GPT-4o model on Multiple-Choice Question (MCQ) and outperforms 36.7% on Open-Style Question (OSQ). Additionally, our work provides new insights into how AI can more effectively interpret image implications, advancing the field of vision-language reasoning and human-AI interaction. Our project is publicly available at https://github.com/MING-ZCH/Let-Androids-Dream-of-Electric-Sheep.

[154] CrossLMM: Decoupling Long Video Sequences from LMMs via Dual Cross-Attention Mechanisms

Shilin Yan,Jiaming Han,Joey Tsai,Hongwei Xue,Rongyao Fang,Lingyi Hong,Ziyu Guo,Ray Zhang

Main category: cs.CV

TL;DR: 论文提出CrossLMM，通过双交叉注意力机制减少视频token数量，降低计算成本，同时保持性能。

Details

Motivation: 随着输入复杂性增加（如长视频序列），token数量激增导致计算成本二次增长，亟需高效压缩视频token的方法。 Method: 采用池化方法减少视觉token，并在LLM层引入视觉-视觉和文本-视觉交叉注意力机制，提升token利用效率和信息保真度。 Result: 在多种视频LMM基准测试中，性能相当或更优，且计算资源显著减少。 Conclusion: CrossLMM有效解决了视频token压缩问题，为高效多模态模型提供了新思路。 Abstract: The advent of Large Multimodal Models (LMMs) has significantly enhanced Large Language Models (LLMs) to process and interpret diverse data modalities (e.g., image and video). However, as input complexity increases, particularly with long video sequences, the number of required tokens has grown significantly, leading to quadratically computational costs. This has made the efficient compression of video tokens in LMMs, while maintaining performance integrity, a pressing research challenge. In this paper, we introduce CrossLMM, decoupling long video sequences from LMMs via a dual cross-attention mechanism, which substantially reduces visual token quantity with minimal performance degradation. Specifically, we first implement a significant token reduction from pretrained visual encoders through a pooling methodology. Then, within LLM layers, we employ a visual-to-visual cross-attention mechanism, wherein the pooled visual tokens function as queries against the original visual token set. This module enables more efficient token utilization while retaining fine-grained informational fidelity. In addition, we introduce a text-to-visual cross-attention mechanism, for which the text tokens are enhanced through interaction with the original visual tokens, enriching the visual comprehension of the text tokens. Comprehensive empirical evaluation demonstrates that our approach achieves comparable or superior performance across diverse video-based LMM benchmarks, despite utilizing substantially fewer computational resources.

[155] ARB: A Comprehensive Arabic Multimodal Reasoning Benchmark

Sara Ghaboura,Ketan More,Wafa Alghallabi,Omkar Thawakar,Jorma Laaksonen,Hisham Cholakkal,Salman Khan,Rao Muhammad Anwer

Main category: cs.CV

TL;DR: 论文介绍了首个针对阿拉伯语的多模态推理基准ARB，旨在评估LMMs在阿拉伯语中的逐步推理能力，覆盖11个领域，发现现有模型在连贯性、忠实性和文化背景方面存在挑战。

Details

Motivation: 现有基准多集中于英语，忽略了阿拉伯语等丰富语言文化背景，需填补这一空白。 Method: 构建ARB基准，包含1,356个多模态样本和5,119个人工标注的推理步骤，评估12种先进LMMs。 Result: 发现模型在连贯性、忠实性和文化背景方面表现不佳。 Conclusion: ARB为诊断多模态推理提供了结构化框架，推动包容性、透明性和文化意识AI的发展。 Abstract: As Large Multimodal Models (LMMs) become more capable, there is growing interest in evaluating their reasoning processes alongside their final outputs. However, most benchmarks remain focused on English, overlooking languages with rich linguistic and cultural contexts, such as Arabic. To address this gap, we introduce the Comprehensive Arabic Multimodal Reasoning Benchmark (ARB), the first benchmark designed to evaluate step-by-step reasoning in Arabic across both textual and visual modalities. ARB spans 11 diverse domains, including visual reasoning, document understanding, OCR, scientific analysis, and cultural interpretation. It comprises 1,356 multimodal samples paired with 5,119 human-curated reasoning steps and corresponding actions. We evaluated 12 state-of-the-art open- and closed-source LMMs and found persistent challenges in coherence, faithfulness, and cultural grounding. ARB offers a structured framework for diagnosing multimodal reasoning in underrepresented languages and marks a critical step toward inclusive, transparent, and culturally aware AI systems. We release the benchmark, rubric, and evaluation suit to support future research and reproducibility. Code available at: https://github.com/mbzuai-oryx/ARB

[156] GoT-R1: Unleashing Reasoning Capability of MLLM for Visual Generation with Reinforcement Learning

Chengqi Duan,Rongyao Fang,Yuqing Wang,Kun Wang,Linjiang Huang,Xingyu Zeng,Hongsheng Li,Xihui Liu

Main category: cs.CV

TL;DR: GoT-R1是一个通过强化学习增强语义-空间推理的视觉生成框架，显著提升了复杂文本提示下的图像生成质量。

Details

Motivation: 解决现有视觉生成模型在处理多对象、精确空间关系和属性的复杂提示时的困难。 Method: 采用强化学习框架，结合双阶段多维奖励机制，利用MLLMs评估推理过程和最终输出。 Result: 在T2I-CompBench基准测试中表现优异，尤其在涉及精确空间关系和属性绑定的任务中。 Conclusion: GoT-R1成功将复杂推理能力引入视觉生成领域，推动了图像生成技术的发展。 Abstract: Visual generation models have made remarkable progress in creating realistic images from text prompts, yet struggle with complex prompts that specify multiple objects with precise spatial relationships and attributes. Effective handling of such prompts requires explicit reasoning about the semantic content and spatial layout. We present GoT-R1, a framework that applies reinforcement learning to enhance semantic-spatial reasoning in visual generation. Building upon the Generation Chain-of-Thought approach, GoT-R1 enables models to autonomously discover effective reasoning strategies beyond predefined templates through carefully designed reinforcement learning. To achieve this, we propose a dual-stage multi-dimensional reward framework that leverages MLLMs to evaluate both the reasoning process and final output, enabling effective supervision across the entire generation pipeline. The reward system assesses semantic alignment, spatial accuracy, and visual quality in a unified approach. Experimental results demonstrate significant improvements on T2I-CompBench benchmark, particularly in compositional tasks involving precise spatial relationships and attribute binding. GoT-R1 advances the state-of-the-art in image generation by successfully transferring sophisticated reasoning capabilities to the visual generation domain. To facilitate future research, we make our code and pretrained models publicly available at https://github.com/gogoduan/GoT-R1.

cs.GR [Back]

[157] Dynamic Caustics by Ultrasonically Modulated Liquid Surface

Koki Nagakura,Tatsuki Fushimi,Ayaka Tsutsui,Yoichi Ochiai

Main category: cs.GR

TL;DR: 提出了一种利用双优化全息场和相控阵换能器（PAT）生成动态焦散图案的方法，通过数字孪生框架实现实时反馈和优化。

Details

Motivation: 扩展静态焦散优化和超声波操控的研究，探索液体表面作为折射介质的动态焦散生成，填补实际应用中的空白。 Method: 采用计算技术动态塑造流体表面，结合数字孪生框架进行迭代反馈和优化，利用超声波直接操控液体表面。 Result: 实验证明该方法能生成高频连续动画和复杂焦散图案，虽对比度和分辨率不及固体表面方法，但具有实时适应性和可扩展性优势。 Conclusion: 该方法在交互显示、艺术装置和教育工具等领域有应用潜力，未来研究将聚焦于提升图案分辨率和复杂度。 Abstract: This paper presents a method for generating dynamic caustic patterns by utilising dual-optimised holographic fields with Phased Array Transducer (PAT). Building on previous research in static caustic optimisation and ultrasonic manipulation, this approach employs computational techniques to dynamically shape fluid surfaces, thereby creating controllable and real-time caustic images. The system employs a Digital Twin framework, which enables iterative feedback and refinement, thereby improving the accuracy and quality of the caustic patterns produced. This paper extends the foundational work in caustic generation by integrating liquid surfaces as refractive media. This concept has previously been explored in simulations but not fully realised in practical applications. The utilisation of ultrasound to directly manipulate these surfaces enables the generation of dynamic caustics with a high degree of flexibility. The Digital Twin approach further enhances this process by allowing for precise adjustments and optimisation based on real-time feedback. Experimental results demonstrate the technique's capacity to generate continuous animations and complex caustic patterns at high frequencies. Although there are limitations in contrast and resolution compared to solid-surface methods, this approach offers advantages in terms of real-time adaptability and scalability. This technique has the potential to be applied in a number of areas, including interactive displays, artistic installations and educational tools. This research builds upon the work of previous researchers in the fields of caustics optimisation, ultrasonic manipulation, and computational displays. Future research will concentrate on enhancing the resolution and intricacy of the generated patterns.

[158] From Reality to Virtual Worlds: The Role of Photogrammetry in Game Development

Santiago Berrezueta-Guzman,Andrei Koshelev,Stefan Wagner

Main category: cs.GR

TL;DR: 论文探讨了RealityCapture在VR游戏开发中的效率、精度和集成优势，发现用户偏好手动设计模型，但开发者认为其节省时间且保持高质量。

Details

Motivation: 研究RealityCapture在VR游戏开发中的应用，评估其与传统建模方法的优劣。 Method: 评估RealityCapture的效率、重建精度及与Unreal Engine的集成，并比较用户对设计模型与摄影测量生成模型的偏好。 Result: 摄影测量提升真实感但用户偏好手动设计小物件；RealityCapture显著减少开发时间且保持高质量。 Conclusion: RealityCapture是VR开发的实用工具，未来AI优化和云处理可扩展其应用。 Abstract: Photogrammetry is transforming digital content creation by enabling the rapid conversion of real-world objects into highly detailed 3D models. This paper evaluates the role of RealityCapture, a GPU-accelerated photogrammetry tool, in game development of Virtual Reality (VR). We assess its efficiency, reconstruction accuracy, and integration with Unreal Engine, comparing its advantages and limitations against traditional modeling workflows. Additionally, we examined user preferences between designed 3D assets and photogrammetry-generated models. The results revealed that while photogrammetry enhances realism and interactivity, users slightly preferred manually designed models for small, manipulable elements because of the level of detail. However, from a developer perspective, RealityCapture significantly reduces development time while maintaining geometric precision and photorealistic textures. Despite its reliance on high-performance hardware, its automation, scalability, and seamless integration with real-time rendering engines make it a valuable tool for game developers and VR creators. Future improvements in AI-driven optimization and cloud-based processing could enhance accessibility, broadening its applications in gaming, cultural heritage preservation, and simulation.

cs.CL [Back]

[159] BR-TaxQA-R: A Dataset for Question Answering with References for Brazilian Personal Income Tax Law, including case law

Juvenal Domingos Júnior,Augusto Faria,E. Seiti de Oliveira,Erick de Brito,Matheus Teotonio,Andre Assumpção,Diedre Carmo,Roberto Lotufo,Jayr Pereira

Main category: cs.CL

TL;DR: BR-TaxQA-R是一个支持巴西个人所得税法问答的新数据集，结合了检索增强生成（RAG）技术，性能优于商业工具，但需专家验证法律有效性。

Details

Motivation: 为巴西个人所得税法提供高质量的问答支持，结合法律文本和AI技术。 Method: 使用RAG管道（OpenAI嵌入搜索和GPT-4o-mini生成），比较不同文本分割策略，并基于RAGAS指标评估。 Result: 自定义RAG在响应相关性上优于商业系统，商业模型在事实准确性和流畅性上表现更好。 Conclusion: 法律领域AI生成答案需专家验证，BR-TaxQA-R数据集公开可用。 Abstract: This paper presents BR-TaxQA-R, a novel dataset designed to support question answering with references in the context of Brazilian personal income tax law. The dataset contains 715 questions from the 2024 official Q\&A document published by Brazil's Internal Revenue Service, enriched with statutory norms and administrative rulings from the Conselho Administrativo de Recursos Fiscais (CARF). We implement a Retrieval-Augmented Generation (RAG) pipeline using OpenAI embeddings for searching and GPT-4o-mini for answer generation. We compare different text segmentation strategies and benchmark our system against commercial tools such as ChatGPT and Perplexity.ai using RAGAS-based metrics. Results show that our custom RAG pipeline outperforms commercial systems in Response Relevancy, indicating stronger alignment with user queries, while commercial models achieve higher scores in Factual Correctness and fluency. These findings highlight a trade-off between legally grounded generation and linguistic fluency. Crucially, we argue that human expert evaluation remains essential to ensure the legal validity of AI-generated answers in high-stakes domains such as taxation. BR-TaxQA-R is publicly available at https://huggingface.co/datasets/unicamp-dl/BR-TaxQA-R.

[160] Extracting Probabilistic Knowledge from Large Language Models for Bayesian Network Parameterization

Aliakbar Nafar,Kristen Brent Venable,Zijun Cui,Parisa Kordjamshidi

Main category: cs.CL

TL;DR: 该论文研究了如何利用大型语言模型（LLMs）中的概率知识，通过贝叶斯网络（BN）为现实世界事件生成概率估计，并探索了LLMs在参数化BN和减少系统偏差方面的潜力。

Details

Motivation: LLMs作为事实知识库的潜力已被证实，但其生成关于现实世界事件的概率知识的能力尚未充分研究。 Method: 通过查询LLMs获取事件的条件概率，将其用于参数化贝叶斯网络，并与随机、均匀分布等基线方法进行比较。 Result: 实验表明，LLMs提供的概率估计在多个领域（如医疗和金融）中具有意义，并能作为专家先验优化数据稀缺时的分布。 Conclusion: 该研究提出了一种结合LLMs概率知识与少量真实数据自动构建贝叶斯网络的策略，并建立了评估LLMs提取概率知识性能的基线。 Abstract: Large Language Models (LLMs) have demonstrated potential as factual knowledge bases; however, their capability to generate probabilistic knowledge about real-world events remains understudied. This paper investigates using probabilistic knowledge inherent in LLMs to derive probability estimates for statements concerning events and their interrelationships captured via a Bayesian Network (BN). Using LLMs in this context allows for the parameterization of BNs, enabling probabilistic modeling within specific domains. Experiments on eighty publicly available Bayesian Networks, from healthcare to finance, demonstrate that querying LLMs about the conditional probabilities of events provides meaningful results when compared to baselines, including random and uniform distributions, as well as approaches based on next-token generation probabilities. We explore how these LLM-derived distributions can serve as expert priors to refine distributions extracted from minimal data, significantly reducing systematic biases. Overall, this work introduces a promising strategy for automatically constructing Bayesian Networks by combining probabilistic knowledge extracted from LLMs with small amounts of real-world data. Additionally, we evaluate several prompting strategies for eliciting probabilistic knowledge from LLMs and establish the first comprehensive baseline for assessing LLM performance in extracting probabilistic knowledge.

[161] Aligning Dialogue Agents with Global Feedback via Large Language Model Reward Decomposition

Dong Won Lee,Hae Won Park,Cynthia Breazeal,Louis-Philippe Morency

Main category: cs.CL

TL;DR: 提出了一种基于大语言模型的奖励分解框架，通过单次会话级反馈信号对齐对话代理，利用预训练大语言模型推理细粒度局部奖励，并蒸馏为轻量级奖励模型用于强化学习微调。

Details

Motivation: 解决传统方法需要手动设计奖励和细粒度人类反馈的问题，利用大语言模型的推理能力实现自动奖励分解。 Method: 提出文本和多模态两种变体，分别基于对话文本和行为线索（如音调、注视和面部表情）分解奖励，并蒸馏为轻量级模型用于强化学习微调。 Result: 在对话质量的人类评估中优于现有奖励分解方法，表明大语言模型是强大的奖励分解工具。 Conclusion: 大语言模型可有效替代手动奖励设计和细粒度反馈，提升对话代理对齐效果。 Abstract: We propose a large language model based reward decomposition framework for aligning dialogue agents using only a single session-level feedback signal. We leverage the reasoning capabilities of a frozen, pretrained large language model (LLM) to infer fine-grained local implicit rewards by decomposing global, session-level feedback. Our first text-only variant prompts the LLM to perform reward decomposition using only the dialogue transcript. The second multimodal variant incorporates additional behavioral cues, such as pitch, gaze, and facial affect, expressed as natural language descriptions. These inferred turn-level rewards are distilled into a lightweight reward model, which we utilize for RL-based fine-tuning for dialogue generation. We evaluate both text-only and multimodal variants against state-of-the-art reward decomposition methods and demonstrate notable improvements in human evaluations of conversation quality, suggesting that LLMs are strong reward decomposers that obviate the need for manual reward shaping and granular human feedback.

[162] Citation Parsing and Analysis with Language Models

Parth Sarin,Juan Pablo Alperin

Main category: cs.CL

TL;DR: 论文提出了一种基于开源语言模型的工具，用于解析和标记文献引用，以改善全球知识共享网络的不平等问题。

Details

Motivation: 解决全球知识生产和传播中的不平等问题，尤其是全球南方学者在索引服务中的边缘化现象。 Method: 使用开源语言模型对文献引用进行标记和解析，评估了多种模型在标注任务中的表现。 Result: 发现现有语言模型在解析引用时具有高准确性，尤其是小模型Qwen3-0.6B表现优异。 Conclusion: 该工具可显著提升引用网络的准确性，改善研究索引和发现，推动元科学研究。 Abstract: A key type of resource needed to address global inequalities in knowledge production and dissemination is a tool that can support journals in understanding how knowledge circulates. The absence of such a tool has resulted in comparatively less information about networks of knowledge sharing in the Global South. In turn, this gap authorizes the exclusion of researchers and scholars from the South in indexing services, reinforcing colonial arrangements that de-center and minoritize those scholars. In order to support citation network tracking on a global scale, we investigate the capacity of open-weight language models to mark up manuscript citations in an indexable format. We assembled a dataset of matched plaintext and annotated citations from preprints and published research papers. Then, we evaluated a number of open-weight language models on the annotation task. We find that, even out of the box, today's language models achieve high levels of accuracy on identifying the constituent components of each citation, outperforming state-of-the-art methods. Moreover, the smallest model we evaluated, Qwen3-0.6B, can parse all fields with high accuracy in $2^5$ passes, suggesting that post-training is likely to be effective in producing small, robust citation parsing models. Such a tool could greatly improve the fidelity of citation networks and thus meaningfully improve research indexing and discovery, as well as further metascientific research.

[163] Training Step-Level Reasoning Verifiers with Formal Verification Tools

Ryo Kamoi,Yusen Zhang,Nan Zhang,Sarkar Snigdha Sarathi Das,Rui Zhang

Main category: cs.CL

TL;DR: 论文提出FoVer方法，利用形式化验证工具自动标注步骤级错误标签，训练过程奖励模型（PRMs），解决了人工标注成本高和任务泛化性不足的问题。

Details

Motivation: 现有PRMs依赖昂贵的人工标注且仅适用于数学推理任务，需解决自动数据集创建和任务泛化性问题。 Method: 使用形式化验证工具（如Z3和Isabelle）自动标注步骤级错误标签，合成训练数据集，训练LLM-based PRMs。 Result: FoVer训练的PRMs在多样推理任务中表现优异，显著优于基线模型，并与人工标注的先进模型竞争或超越。 Conclusion: FoVer通过自动标注和任务泛化性扩展了PRMs的应用范围，为推理任务提供了高效解决方案。 Abstract: Process Reward Models (PRMs), which provide step-by-step feedback on the reasoning generated by Large Language Models (LLMs), are receiving increasing attention. However, two key research gaps remain: collecting accurate step-level error labels for training typically requires costly human annotation, and existing PRMs are limited to math reasoning problems. In response to these gaps, this paper aims to address the challenges of automatic dataset creation and the generalization of PRMs to diverse reasoning tasks. To achieve this goal, we propose FoVer, an approach for training PRMs on step-level error labels automatically annotated by formal verification tools, such as Z3 for formal logic and Isabelle for theorem proof, which provide automatic and accurate verification for symbolic tasks. Using this approach, we synthesize a training dataset with error labels on LLM responses for formal logic and theorem proof tasks without human annotation. Although this data synthesis is feasible only for tasks compatible with formal verification, we observe that LLM-based PRMs trained on our dataset exhibit cross-task generalization, improving verification across diverse reasoning tasks. Specifically, PRMs trained with FoVer significantly outperform baseline PRMs based on the original LLMs and achieve competitive or superior results compared to state-of-the-art PRMs trained on labels annotated by humans or stronger models, as measured by step-level verification on ProcessBench and Best-of-K performance across 12 reasoning benchmarks, including MATH, AIME, ANLI, MMLU, and BBH. The datasets, models, and code are provided at https://github.com/psunlpgroup/FoVer.

[164] Pre-training Large Memory Language Models with Internal and External Knowledge

Linxi Zhao,Sofian Zalouk,Christian K. Belardi,Justin Lovelace,Jin Peng Zhou,Kilian Q. Weinberger,Yoav Artzi,Jennifer J. Sun

Main category: cs.CL

TL;DR: 论文提出了一种新型语言模型LMLM，通过结合内部权重和外部数据库存储事实知识，减少对模型记忆的依赖，实现了可编辑和可验证的知识管理。

Details

Motivation: 解决传统神经语言模型中知识分布不透明、难以检查和更新的问题。 Method: 提出LMLM模型，通过预训练策略将事实知识存储在内部权重和外部数据库中，并屏蔽外部检索的事实值以减少记忆依赖。 Result: LMLM在标准基准测试中表现与更大规模的知识密集模型相当，同时具备显式、可编辑和可验证的知识库优势。 Conclusion: LMLM代表了语言模型与事实知识交互和管理方式的根本性转变。 Abstract: Neural language models are black-boxes -- both linguistic patterns and factual knowledge are distributed across billions of opaque parameters. This entangled encoding makes it difficult to reliably inspect, verify, or update specific facts. We propose a new class of language models, Large Memory Language Models (LMLM) with a pre-training recipe that stores factual knowledge in both internal weights and an external database. Our approach strategically masks externally retrieved factual values from the training loss, thereby teaching the model to perform targeted lookups rather than relying on memorization in model weights. Our experiments demonstrate that LMLMs achieve competitive performance compared to significantly larger, knowledge-dense LLMs on standard benchmarks, while offering the advantages of explicit, editable, and verifiable knowledge bases. This work represents a fundamental shift in how language models interact with and manage factual knowledge.

[165] Explaining Puzzle Solutions in Natural Language: An Exploratory Study on 6x6 Sudoku

Anirudh Maiya,Razan Alghamdi,Maria Leonor Pacheco,Ashutosh Trivedi,Fabio Somenzi

Main category: cs.CL

TL;DR: 评估五种大语言模型（LLMs）在解决和解释数独问题中的表现，发现其在解释能力上存在显著不足。

Details

Motivation: 研究LLMs在人类-AI协作决策中的可信度、逐步性和定制化解释能力，以数独为例。 Method: 评估五种LLMs在解决和解释六六数独问题中的表现。 Result: 一种LLM能有限解决数独，但所有模型均无法提供反映战略推理或直观问题解决的解释。 Conclusion: LLMs在成为有效协作伙伴前需解决解释能力的重大挑战。 Abstract: The success of Large Language Models (LLMs) in human-AI collaborative decision-making hinges on their ability to provide trustworthy, gradual, and tailored explanations. Solving complex puzzles, such as Sudoku, offers a canonical example of this collaboration, where clear and customized explanations often hold greater importance than the final solution. In this study, we evaluate the performance of five LLMs in solving and explaining \sixsix{} Sudoku puzzles. While one LLM demonstrates limited success in solving puzzles, none can explain the solution process in a manner that reflects strategic reasoning or intuitive problem-solving. These findings underscore significant challenges that must be addressed before LLMs can become effective partners in human-AI collaborative decision-making.

[166] Leveraging Online Data to Enhance Medical Knowledge in a Small Persian Language Model

Mehrdad ghassabi,Pedram Rostami,Hamidreza Baradaran Kashani,Amirhossein Poursina,Zahra Kazemi,Milad Tavakoli

Main category: cs.CL

TL;DR: 该研究通过利用波斯语在线医疗数据（包括医学杂志和医患问答对）微调小型语言模型，提升了其在医疗领域的表现。

Details

Motivation: 解决波斯语等低资源语言中医疗领域小型语言模型表现不佳的问题，填补了波斯语医疗数据集的空白。 Method: 爬取波斯语医学杂志和医患问答对数据，构建首个波斯语医疗数据集，并用于微调基线模型。 Result: 微调后的模型在医疗问答任务中表现优于基线模型，准确性提升。 Conclusion: 研究表明，利用开放获取的在线数据可以增强小型语言模型在医疗领域的表现，为资源受限环境下的波斯语医疗AI应用提供了新方案。 Abstract: The rapid advancement of language models has demonstrated the potential of artificial intelligence in the healthcare industry. However, small language models struggle with specialized domains in low-resource languages like Persian. While numerous medical-domain websites exist in Persian, no curated dataset or corpus has been available making ours the first of its kind. This study explores the enhancement of medical knowledge in a small language model by leveraging accessible online data, including a crawled corpus from medical magazines and a dataset of real doctor-patient QA pairs. We fine-tuned a baseline model using our curated data to improve its medical knowledge. Benchmark evaluations demonstrate that the fine-tuned model achieves improved accuracy in medical question answering and provides better responses compared to its baseline. This work highlights the potential of leveraging open-access online data to enrich small language models in medical fields, providing a novel solution for Persian medical AI applications suitable for resource-constrained environments.

[167] Causal Interventions Reveal Shared Structure Across English Filler-Gap Constructions

Sasha Boguraev,Christopher Potts,Kyle Mahowald

Main category: cs.CL

TL;DR: 论文探讨了如何通过因果可解释性方法分析大型语言模型（LLMs），以揭示其学习到的抽象机制，从而推动语言学理论的发展。

Details

Motivation: 语言学理论需要更深入的证据来理解句法结构，而LLMs提供了丰富的潜在证据。通过分析LLMs的内部机制，可以更好地理解其如何处理英语填充-空缺依赖结构（如疑问句、关系从句）。 Method: 采用分布式交换干预（Distributed Interchange Interventions）实验方法，分析LLMs对填充-空缺依赖结构的抽象处理机制。 Result: 实验表明，LLMs对这些结构采用了相似的抽象分析，并揭示了频率、填充类型和上下文等被忽视的因素，这些发现可能推动语言学理论的修正。 Conclusion: 通过机制化分析LLMs的内部机制，可以为语言学理论提供新的见解和推动力。 Abstract: Large Language Models (LLMs) have emerged as powerful sources of evidence for linguists seeking to develop theories of syntax. In this paper, we argue that causal interpretability methods, applied to LLMs, can greatly enhance the value of such evidence by helping us characterize the abstract mechanisms that LLMs learn to use. Our empirical focus is a set of English filler-gap dependency constructions (e.g., questions, relative clauses). Linguistic theories largely agree that these constructions share many properties. Using experiments based in Distributed Interchange Interventions, we show that LLMs converge on similar abstract analyses of these constructions. These analyses also reveal previously overlooked factors -- relating to frequency, filler type, and surrounding context -- that could motivate changes to standard linguistic theory. Overall, these results suggest that mechanistic, internal analyses of LLMs can push linguistic theory forward.

[168] SLMEval: Entropy-Based Calibration for Human-Aligned Evaluation of Large Language Models

Roland Daynauth,Christopher Clarke,Krisztian Flautner,Lingjia Tang,Jason Mars

Main category: cs.CL

TL;DR: SLMEval是一种基于熵最大化的新型校准方法，能够在真实世界任务中显著提升语言模型评估与人类判断的相关性，同时大幅降低成本。

Details

Motivation: 现有校准技术在开放任务中表现不佳，与人类判断相关性弱甚至负相关，需要一种更通用的校准方法。 Method: 提出SLMEval，通过熵最大化和小规模人类偏好数据估计潜在质量分布，重新加权评估分数。 Result: SLMEval在真实任务和公共基准测试中与人类判断相关性显著提升（如Spearman相关系数0.57），且成本降低5-30倍。 Conclusion: SLMEval是一种高效且通用的校准方法，适用于开放任务评估。 Abstract: The LLM-as-a-Judge paradigm offers a scalable, reference-free approach for evaluating language models. Although several calibration techniques have been proposed to better align these evaluators with human judgment, prior studies focus primarily on narrow, well-structured benchmarks. As a result, it remains unclear whether such calibrations generalize to real-world, open-ended tasks. In this work, we show that SOTA calibrated evaluators often fail in these settings, exhibiting weak or even negative correlation with human judgments. To address this, we propose SLMEval, a novel and efficient calibration method based on entropy maximization over a small amount of human preference data. By estimating a latent distribution over model quality and reweighting evaluator scores accordingly, SLMEval achieves strong correlation with human evaluations across two real-world production use cases and the public benchmark. For example, on one such task, SLMEval achieves a Spearman correlation of 0.57 with human judgments, while G-Eval yields a negative correlation. In addition, SLMEval reduces evaluation costs by 5-30x compared to GPT-4-based calibrated evaluators such as G-eval.

[169] LAGO: Few-shot Crosslingual Embedding Inversion Attacks via Language Similarity-Aware Graph Optimization

Wenrui Yu,Yiyi Chen,Johannes Bjerva,Sokol Kosta,Qiongxiu Li

Main category: cs.CL

TL;DR: LAGO是一种基于语言相似性的图优化方法，用于少样本跨语言嵌入反转攻击，显著提升攻击的迁移性。

Details

Motivation: 解决多语言NLP系统中因语言独立性假设导致的隐私漏洞问题。 Method: 通过图约束分布式优化框架建模语言关系，结合句法和词汇相似性作为边约束，实现跨语言参数协作学习。 Result: 实验显示LAGO在极少量数据下（每种语言仅10样本）显著提升攻击迁移性，Rouge-L分数提高10-20%。 Conclusion: 语言相似性是反转攻击迁移性的关键因素，呼吁关注语言感知的隐私保护多语言嵌入方法。 Abstract: We propose LAGO - Language Similarity-Aware Graph Optimization - a novel approach for few-shot cross-lingual embedding inversion attacks, addressing critical privacy vulnerabilities in multilingual NLP systems. Unlike prior work in embedding inversion attacks that treat languages independently, LAGO explicitly models linguistic relationships through a graph-based constrained distributed optimization framework. By integrating syntactic and lexical similarity as edge constraints, our method enables collaborative parameter learning across related languages. Theoretically, we show this formulation generalizes prior approaches, such as ALGEN, which emerges as a special case when similarity constraints are relaxed. Our framework uniquely combines Frobenius-norm regularization with linear inequality or total variation constraints, ensuring robust alignment of cross-lingual embedding spaces even with extremely limited data (as few as 10 samples per language). Extensive experiments across multiple languages and embedding models demonstrate that LAGO substantially improves the transferability of attacks with 10-20% increase in Rouge-L score over baselines. This work establishes language similarity as a critical factor in inversion attack transferability, urging renewed focus on language-aware privacy-preserving multilingual embeddings.

[170] Ranking Free RAG: Replacing Re-ranking with Selection in RAG for Sensitive Domains

Yash Saxena,Anpur Padia,Mandar S Chaudhary,Kalpa Gunaratna,Srinivasan Parthasarathy,Manas Gaur

Main category: cs.CL

TL;DR: METEORA提出了一种新的RAG方法，通过基于理性的选择替代传统重排序，提高了生成准确性、解释性和对抗性内容的鲁棒性。

Details

Motivation: 传统RAG依赖相似性检索和重排序，缺乏解释性和对抗性内容的鲁棒性。 Method: METEORA分两阶段：1) 使用偏好调整的LLM生成理性；2) 基于理性选择证据块，并通过验证器过滤误导内容。 Result: 在六个数据集上，METEORA生成准确性提高33.34%，使用块数减少50%，对抗性F1分数从0.10提升至0.44。 Conclusion: METEORA显著提升了RAG的解释性、准确性和对抗性鲁棒性。 Abstract: Traditional Retrieval-Augmented Generation (RAG) pipelines rely on similarity-based retrieval and re-ranking, which depend on heuristics such as top-k, and lack explainability, interpretability, and robustness against adversarial content. To address this gap, we propose a novel method METEORA that replaces re-ranking in RAG with a rationale-driven selection approach. METEORA operates in two stages. First, a general-purpose LLM is preference-tuned to generate rationales conditioned on the input query using direct preference optimization. These rationales guide the evidence chunk selection engine, which selects relevant chunks in three stages: pairing individual rationales with corresponding retrieved chunks for local relevance, global selection with elbow detection for adaptive cutoff, and context expansion via neighboring chunks. This process eliminates the need for top-k heuristics. The rationales are also used for consistency check using a Verifier LLM to detect and filter poisoned or misleading content for safe generation. The framework provides explainable and interpretable evidence flow by using rationales consistently across both selection and verification. Our evaluation across six datasets spanning legal, financial, and academic research domains shows that METEORA improves generation accuracy by 33.34% while using approximately 50% fewer chunks than state-of-the-art re-ranking methods. In adversarial settings, METEORA significantly improves the F1 score from 0.10 to 0.44 over the state-of-the-art perplexity-based defense baseline, demonstrating strong resilience to poisoning attacks. Code available at: https://anonymous.4open.science/r/METEORA-DC46/README.md

[171] NOVER: Incentive Training for Language Models via Verifier-Free Reinforcement Learning

Wei Liu,Siya Qi,Xinyu Wang,Chen Qian,Yali Du,Yulan He

Main category: cs.CL

TL;DR: NOVER是一种无需外部验证器的强化学习框架，仅需标准监督微调数据，适用于广泛的文本任务，性能优于同类模型。

Details

Motivation: 现有方法依赖外部验证器，限制了其应用范围，且奖励模型训练成本高。 Method: 提出NOVER框架，无需外部验证器，仅需标准监督微调数据。 Result: NOVER在性能上优于同类模型（如DeepSeek R1 671B）7.7%。 Conclusion: NOVER为大型语言模型优化提供了新可能性，如逆向激励训练。 Abstract: Recent advances such as DeepSeek R1-Zero highlight the effectiveness of incentive training, a reinforcement learning paradigm that computes rewards solely based on the final answer part of a language model's output, thereby encouraging the generation of intermediate reasoning steps. However, these methods fundamentally rely on external verifiers, which limits their applicability to domains like mathematics and coding where such verifiers are readily available. Although reward models can serve as verifiers, they require high-quality annotated data and are costly to train. In this work, we propose NOVER, NO-VERifier Reinforcement Learning, a general reinforcement learning framework that requires only standard supervised fine-tuning data with no need for an external verifier. NOVER enables incentive training across a wide range of text-to-text tasks and outperforms the model of the same size distilled from large reasoning models such as DeepSeek R1 671B by 7.7 percent. Moreover, the flexibility of NOVER enables new possibilities for optimizing large language models, such as inverse incentive training.

[172] Prototypical Human-AI Collaboration Behaviors from LLM-Assisted Writing in the Wild

Sheshera Mysore,Debarati Das,Hancheng Cao,Bahareh Sarrafzadeh

Main category: cs.CL

TL;DR: 论文研究了用户与大型语言模型（LLM）在多轮交互中的协作行为，提出了原型人机协作行为（PATHs），并分析了这些行为与用户写作意图的关联。

Details

Motivation: 探讨用户在复杂写作任务中如何主动引导LLM生成内容，而非被动接受输出。 Method: 通过分析用户与Bing Copilot和WildChat的交互数据，识别原型协作行为（PATHs），并统计其与写作意图的关联。 Result: 发现少数PATHs能解释大部分交互行为，且特定写作意图与特定PATHs显著相关。 Conclusion: 研究结果对LLM的优化和用户意图对齐具有重要启示。 Abstract: As large language models (LLMs) are used in complex writing workflows, users engage in multi-turn interactions to steer generations to better fit their needs. Rather than passively accepting output, users actively refine, explore, and co-construct text. We conduct a large-scale analysis of this collaborative behavior for users engaged in writing tasks in the wild with two popular AI assistants, Bing Copilot and WildChat. Our analysis goes beyond simple task classification or satisfaction estimation common in prior work and instead characterizes how users interact with LLMs through the course of a session. We identify prototypical behaviors in how users interact with LLMs in prompts following their original request. We refer to these as Prototypical Human-AI Collaboration Behaviors (PATHs) and find that a small group of PATHs explain a majority of the variation seen in user-LLM interaction. These PATHs span users revising intents, exploring texts, posing questions, adjusting style or injecting new content. Next, we find statistically significant correlations between specific writing intents and PATHs, revealing how users' intents shape their collaboration behaviors. We conclude by discussing the implications of our findings on LLM alignment.

[173] OpenEthics: A Comprehensive Ethical Evaluation of Open-Source Generative Large Language Models

Burak Erinç Çetin,Yıldırım Özen,Elif Naz Demiryılmaz,Kaan Engür,Cagri Toraman

Main category: cs.CL

TL;DR: 论文对29个开源大语言模型进行了广泛的伦理评估，涵盖鲁棒性、可靠性、安全性和公平性，填补了评估广度、语言覆盖和模型多样性的空白。

Details

Motivation: 现有研究多关注狭窄的伦理维度，且语言和模型多样性不足，需更全面的伦理评估以指导更安全的模型开发。 Method: 使用LLM-as-a-Judge方法，评估模型在英语和土耳其语中的行为，分析四个伦理方面。 Result: 开源模型在安全性和公平性上表现较好，鲁棒性良好，但可靠性存疑；参数更多的模型伦理表现更优，Gemma和Qwen表现最佳。 Conclusion: 伦理评估可独立于语言进行，大参数模型表现更优，为未来模型开发提供了指导。 Abstract: Generative large language models present significant potential but also raise critical ethical concerns. Most studies focus on narrow ethical dimensions, and also limited diversity of languages and models. To address these gaps, we conduct a broad ethical evaluation of 29 recent open-source large language models using a novel data collection including four ethical aspects: Robustness, reliability, safety, and fairness. We analyze model behavior in both a commonly used language, English, and a low-resource language, Turkish. Our aim is to provide a comprehensive ethical assessment and guide safer model development by filling existing gaps in evaluation breadth, language coverage, and model diversity. Our experimental results, based on LLM-as-a-Judge, reveal that optimization efforts for many open-source models appear to have prioritized safety and fairness, and demonstrated good robustness while reliability remains a concern. We demonstrate that ethical evaluation can be effectively conducted independently of the language used. In addition, models with larger parameter counts tend to exhibit better ethical performance, with Gemma and Qwen models demonstrating the most ethical behavior among those evaluated.

[174] Internal and External Impacts of Natural Language Processing Papers

Yu Zhang

Main category: cs.CL

TL;DR: 分析了1979至2024年顶级NLP会议论文的影响，发现语言模型在学术界和公众中影响最广，而语言学基础影响较低。伦理和公平话题在政策文件中受关注但学术引用较少。

Details

Motivation: 研究NLP领域的研究成果如何被学术界和更广泛的社会领域（如专利、媒体、政策）所采用和影响。 Method: 通过分析研究文章和外部来源（专利、媒体、政策文件）的引用数据，评估不同NLP主题的影响力。 Result: 语言模型影响力最广，语言学基础影响较低；伦理和公平话题在政策文件中受关注但学术引用较少；外部领域对NLP的应用和社会影响有不同偏好。 Conclusion: NLP研究的影响力在学术界和外部领域存在差异，语言模型和伦理话题是重点方向。 Abstract: We investigate the impacts of NLP research published in top-tier conferences (i.e., ACL, EMNLP, and NAACL) from 1979 to 2024. By analyzing citations from research articles and external sources such as patents, media, and policy documents, we examine how different NLP topics are consumed both within the academic community and by the broader public. Our findings reveal that language modeling has the widest internal and external influence, while linguistic foundations have lower impacts. We also observe that internal and external impacts generally align, but topics like ethics, bias, and fairness show significant attention in policy documents with much fewer academic citations. Additionally, external domains exhibit distinct preferences, with patents focusing on practical NLP applications and media and policy documents engaging more with the societal implications of NLP models.

[175] Small Language Models in the Real World: Insights from Industrial Text Classification

Lujun Li,Lama Sleem,Niccolo' Gentile,Geoffrey Nichil,Radu State

Main category: cs.CL

TL;DR: 本文探讨了小型语言模型在文本分类任务中的潜力，重点分析了提示工程和监督微调方法，并评估了其在工业场景中的性能和效率。

Details

Motivation: 随着ChatGPT等Transformer模型的兴起，解码器模型（如Llama）在文本分类中表现出色，但其推理效率低、依赖提示质量且资源消耗大，因此研究小型模型的有效性成为重要课题。 Method: 通过综合评估提示工程和监督微调方法，专注于工业场景（如邮件分类、法律文档分类和长文本分类），分析小型模型的性能与VRAM效率。 Result: 研究发现小型模型在特定任务中表现良好，且资源利用率高，适合本地部署。 Conclusion: 小型语言模型在工业场景中具有潜力，尤其是在资源受限的环境中，但仍需进一步探索模型选择和方法优化。 Abstract: With the emergence of ChatGPT, Transformer models have significantly advanced text classification and related tasks. Decoder-only models such as Llama exhibit strong performance and flexibility, yet they suffer from inefficiency on inference due to token-by-token generation, and their effectiveness in text classification tasks heavily depends on prompt quality. Moreover, their substantial GPU resource requirements often limit widespread adoption. Thus, the question of whether smaller language models are capable of effectively handling text classification tasks emerges as a topic of significant interest. However, the selection of appropriate models and methodologies remains largely underexplored. In this paper, we conduct a comprehensive evaluation of prompt engineering and supervised fine-tuning methods for transformer-based text classification. Specifically, we focus on practical industrial scenarios, including email classification, legal document categorization, and the classification of extremely long academic texts. We examine the strengths and limitations of smaller models, with particular attention to both their performance and their efficiency in Video Random-Access Memory (VRAM) utilization, thereby providing valuable insights for the local deployment and application of compact models in industrial settings.

[176] BiasLab: Toward Explainable Political Bias Detection with Dual-Axis Annotations and Rationale Indicators

KMA Solaiman

Main category: cs.CL

TL;DR: BiasLab是一个包含300篇政治新闻文章的数据集，标注了感知到的意识形态偏见，并提供了丰富的理由注释，支持可解释的政治偏见建模。

Details

Motivation: 开发一个能够支持透明、社会意识NLP系统的数据集，促进对偏见检测和解释有效性的研究。 Method: 通过众包标注文章对民主党和共和党的情感，结合工人资格筛选和试点分析优化标注流程，并利用GPT-4模拟标注进行比较。 Result: 量化了标注者间一致性，分析了与来源偏见的偏差，并展示了基线性能，揭示了可解释偏见检测的挑战。 Conclusion: BiasLab为可解释的政治偏见建模提供了实用工具，并公开数据集以推动相关研究。 Abstract: We present BiasLab, a dataset of 300 political news articles annotated for perceived ideological bias. These articles were selected from a curated 900-document pool covering diverse political events and source biases. Each article is labeled by crowdworkers along two independent scales, assessing sentiment toward the Democratic and Republican parties, and enriched with rationale indicators. The annotation pipeline incorporates targeted worker qualification and was refined through pilot-phase analysis. We quantify inter-annotator agreement, analyze misalignment with source-level outlet bias, and organize the resulting labels into interpretable subsets. Additionally, we simulate annotation using schema-constrained GPT-4o, enabling direct comparison to human labels and revealing mirrored asymmetries, especially in misclassifying subtly right-leaning content. We define two modeling tasks: perception drift prediction and rationale type classification, and report baseline performance to illustrate the challenge of explainable bias detection. BiasLab's rich rationale annotations provide actionable interpretations that facilitate explainable modeling of political bias, supporting the development of transparent, socially aware NLP systems. We release the dataset, annotation schema, and modeling code to encourage research on human-in-the-loop interpretability and the evaluation of explanation effectiveness in real-world settings.

[177] Date Fragments: A Hidden Bottleneck of Tokenization for Temporal Reasoning

Gagan Bhatia,Maxime Peyrard,Wei Zhao

Main category: cs.CL

TL;DR: 论文提出了一种衡量BPE分词器对日期分割效果的指标DateAugBench，并通过实验发现过度分割会影响模型对不常见日期的理解能力，同时揭示了LLM在日期抽象上的机制。

Details

Motivation: 现代BPE分词器常将日期分割成无意义的片段，影响时间推理的鲁棒性，因此需要量化这种分割效果并探索LLM如何处理日期片段。 Method: 1. 提出日期分割比指标；2. 发布DateAugBench测试集；3. 通过分层探测和注意力分析揭示LLM的日期抽象机制。 Result: 过度分割导致不常见日期（如历史或未来日期）的准确率下降10%；模型越大，日期抽象机制越早出现；LLM的日期推理路径与人类不同。 Conclusion: 日期分割对时间推理有显著影响，LLM通过特定机制修复分割片段，且模型规模越大效果越好。 Abstract: Modern BPE tokenizers often split calendar dates into meaningless fragments, e.g., 20250312 $\rightarrow$ 202, 503, 12, inflating token counts and obscuring the inherent structure needed for robust temporal reasoning. In this work, we (1) introduce a simple yet interpretable metric, termed date fragmentation ratio, that measures how faithfully a tokenizer preserves multi-digit date components; (2) release DateAugBench, a suite of 6500 examples spanning three temporal reasoning tasks: context-based date resolution, format-invariance puzzles, and date arithmetic across historical, contemporary, and future regimes; and (3) through layer-wise probing and causal attention-hop analyses, uncover an emergent date-abstraction mechanism whereby large language models stitch together the fragments of month, day, and year components for temporal reasoning. Our experiments show that excessive fragmentation correlates with accuracy drops of up to 10 points on uncommon dates like historical and futuristic dates. Further, we find that the larger the model, the faster the emergent date abstraction that heals date fragments is accomplished. Lastly, we observe a reasoning path that LLMs follow to assemble date fragments, typically differing from human interpretation (year $\rightarrow$ month $\rightarrow$ day).

[178] Continually Self-Improving Language Models for Bariatric Surgery Question--Answering

Yash Kumar Atri,Thomas H Shin,Thomas Hartvigsen

Main category: cs.CL

TL;DR: 论文介绍了bRAGgen模型，一种基于检索增强生成（RAG）的自适应模型，用于解决减肥和代谢手术（MBS）患者获取信息的障碍，并通过bRAGq数据集验证其性能。

Details

Motivation: 减肥和代谢手术（MBS）的多学科协作治疗常因医疗资源不均而受阻，患者难以及时获取准确信息。 Method: 开发了bRAGgen模型，动态整合实时医学证据，并引入bRAGq数据集进行验证。 Result: bRAGgen在生成临床准确和相关响应方面显著优于现有模型。 Conclusion: bRAGgen为MBS护理提供了高效、准确的信息支持，有望改善患者治疗体验。 Abstract: While bariatric and metabolic surgery (MBS) is considered the gold standard treatment for severe and morbid obesity, its therapeutic efficacy hinges upon active and longitudinal engagement with multidisciplinary providers, including surgeons, dietitians/nutritionists, psychologists, and endocrinologists. This engagement spans the entire patient journey, from preoperative preparation to long-term postoperative management. However, this process is often hindered by numerous healthcare disparities, such as logistical and access barriers, which impair easy patient access to timely, evidence-based, clinician-endorsed information. To address these gaps, we introduce bRAGgen, a novel adaptive retrieval-augmented generation (RAG)-based model that autonomously integrates real-time medical evidence when response confidence dips below dynamic thresholds. This self-updating architecture ensures that responses remain current and accurate, reducing the risk of misinformation. Additionally, we present bRAGq, a curated dataset of 1,302 bariatric surgery--related questions, validated by an expert bariatric surgeon. bRAGq constitutes the first large-scale, domain-specific benchmark for comprehensive MBS care. In a two-phase evaluation, bRAGgen is benchmarked against state-of-the-art models using both large language model (LLM)--based metrics and expert surgeon review. Across all evaluation dimensions, bRAGgen demonstrates substantially superior performance in generating clinically accurate and relevant responses.

[179] Hierarchical Safety Realignment: Lightweight Restoration of Safety in Pruned Large Vision-Language Models

Yue Li,Xin Yi,Dongsheng Shi,Gerard de Melo,Xiaoling Wang,Linlin Wang

Main category: cs.CL

TL;DR: 提出了一种名为HSR的轻量级方法，通过分层对齐注意力头和神经元，恢复剪枝后大型视觉语言模型的安全性。

Details

Motivation: 剪枝技术虽然能压缩模型，但会导致安全性下降，亟需解决这一问题。 Method: HSR通过量化注意力头对安全性的贡献，识别关键头并选择性恢复神经元，分层对齐安全性。 Result: 在不同模型和剪枝策略下，HSR显著提升了安全性表现。 Conclusion: HSR是首个专注于剪枝后恢复LVLM安全性的工作，效果显著。 Abstract: With the increasing size of Large Vision-Language Models (LVLMs), network pruning techniques aimed at compressing models for deployment in resource-constrained environments have garnered significant attention. However, we observe that pruning often leads to a degradation in safety performance. To address this issue, we present a novel and lightweight approach, termed Hierarchical Safety Realignment (HSR). HSR operates by first quantifying the contribution of each attention head to safety, identifying the most critical ones, and then selectively restoring neurons directly within these attention heads that play a pivotal role in maintaining safety. This process hierarchically realigns the safety of pruned LVLMs, progressing from the attention head level to the neuron level. We validate HSR across various models and pruning strategies, consistently achieving notable improvements in safety performance. To our knowledge, this is the first work explicitly focused on restoring safety in LVLMs post-pruning.

[180] MPL: Multiple Programming Languages with Large Language Models for Information Extraction

Bo Li,Gexiang Fang,Wei Ye,Zhenghua Xu,Jinglei Zhang,Hao Cheng,Shikun Zhang

Main category: cs.CL

TL;DR: 论文提出了一种名为MPL的新框架，通过结合多种编程语言（如C++和Java）在监督微调阶段提升信息抽取任务的效果，并引入function-prompt和虚拟运行技术优化代码式输入。

Details

Motivation: 现有研究主要关注Python模拟代码式输入，忽视了其他广泛使用的编程语言（如C++和Java）在监督微调阶段的潜力。 Method: 提出MPL框架，结合多种编程语言，并引入function-prompt与虚拟运行技术。 Result: 在多个数据集上的实验证明了MPL的有效性。 Conclusion: MPL框架通过多编程语言结合和新技术显著提升了信息抽取任务的性能，代码已开源。 Abstract: Recent research in information extraction (IE) focuses on utilizing code-style inputs to enhance structured output generation. The intuition behind this is that the programming languages (PLs) inherently exhibit greater structural organization than natural languages (NLs). This structural advantage makes PLs particularly suited for IE tasks. Nevertheless, existing research primarily focuses on Python for code-style simulation, overlooking the potential of other widely-used PLs (e.g., C++ and Java) during the supervised fine-tuning (SFT) phase. In this research, we propose \textbf{M}ultiple \textbf{P}rogramming \textbf{L}anguages with large language models for information extraction (abbreviated as \textbf{MPL}), a novel framework that explores the potential of incorporating different PLs in the SFT phase. Additionally, we introduce \texttt{function-prompt} with virtual running to simulate code-style inputs more effectively and efficiently. Experimental results on a wide range of datasets demonstrate the effectiveness of MPL. Furthermore, we conduct extensive experiments to provide a comprehensive analysis. We have released our code for future research.

Haotian Lan,Yao Gao,Yujun Cheng,Wei Yuan,Kun Wang

Main category: cs.CL

TL;DR: 论文提出了一种结合无监督和监督学习的LLM框架，用于量化用户生成内容中的旅游期望，发现休闲/社交期望比自然/情感因素更能驱动用户参与。

Details

Motivation: 社交媒体用户生成内容对旅游决策至关重要，但现有分析方法缺乏可扩展性。 Method: 采用双方法LLM框架：无监督期望提取与调查监督微调相结合。 Result: 研究发现休闲/社交期望比自然/情感因素更能驱动用户参与。 Conclusion: 该框架为旅游分析提供了精准工具，并展示了计算社会科学在营销优化中的潜力。 Abstract: Social media's rise establishes user-generated content (UGC) as pivotal for travel decisions, yet analytical methods lack scalability. This study introduces a dual-method LLM framework: unsupervised expectation extraction from UGC paired with survey-informed supervised fine-tuning. Findings reveal leisure/social expectations drive engagement more than foundational natural/emotional factors. By establishing LLMs as precision tools for expectation quantification, we advance tourism analytics methodology and propose targeted strategies for experience personalization and social travel promotion. The framework's adaptability extends to consumer behavior research, demonstrating computational social science's transformative potential in marketing optimization.

[182] KoBALT: Korean Benchmark For Advanced Linguistic Tasks

Hyopil Shin,Sangah Lee,Dongjun Jang,Wooseok Song,Jaeyoon Kim,Chaeyoung Oh,Hyemi Jo,Youngchae Ahn,Sihyun Oh,Hyohyeong Chang,Sunkyoung Kim,Jinsik Lee

Main category: cs.CL

TL;DR: KoBALT是一个针对韩语的综合性语言学基准测试，包含700道选择题，覆盖5个语言学领域，旨在评估大型语言模型在韩语中的表现。

Details

Motivation: 传统基准测试缺乏语言学深度和类型学基础，KoBALT填补了这一空白，为韩语这种形态丰富的语言提供了更全面的评估工具。 Method: KoBALT通过专家策划的700道语言学问题，减少与标准韩语语料库的n-gram重叠，降低数据污染风险。 Result: 评估20个当代大型语言模型显示，最高准确率为61%，但各领域表现差异显著，语义领域表现最佳（66%），音系（31%）和形态学（36%）表现较弱。人类偏好评估验证了KoBALT的有效性。 Conclusion: KoBALT为韩语语言模型提供了稳健的评估框架，填补了语言学评估的空白。 Abstract: We introduce KoBALT (Korean Benchmark for Advanced Linguistic Tasks), a comprehensive linguistically-motivated benchmark comprising 700 multiple-choice questions spanning 24 phenomena across five linguistic domains: syntax, semantics, pragmatics, phonetics/phonology, and morphology. KoBALT is designed to advance the evaluation of large language models (LLMs) in Korean, a morphologically rich language, by addressing the limitations of conventional benchmarks that often lack linguistic depth and typological grounding. It introduces a suite of expert-curated, linguistically motivated questions with minimal n-gram overlap with standard Korean corpora, substantially mitigating the risk of data contamination and allowing a more robust assessment of true language understanding. Our evaluation of 20 contemporary LLMs reveals significant performance disparities, with the highest-performing model achieving 61\% general accuracy but showing substantial variation across linguistic domains - from stronger performance in semantics (66\%) to considerable weaknesses in phonology (31\%) and morphology (36\%). Through human preference evaluation with 95 annotators, we demonstrate a strong correlation between KoBALT scores and human judgments, validating our benchmark's effectiveness as a discriminative measure of Korean language understanding. KoBALT addresses critical gaps in linguistic evaluation for typologically diverse languages and provides a robust framework for assessing genuine linguistic competence in Korean language models.

[183] Veracity Bias and Beyond: Uncovering LLMs' Hidden Beliefs in Problem-Solving Reasoning

Yue Zhou,Barbara Di Eugenio

Main category: cs.CL

TL;DR: 研究发现，尽管LLMs在表面上避免人口统计刻板印象，但在解决数学、编程、常识和写作问题时，仍表现出对特定群体的偏见，包括归因偏见和评估偏见。

Details

Motivation: 尽管LLMs已明确避免人口统计刻板印象，但在不同社会背景下仍表现出偏见。本研究旨在揭示LLMs在解决方案真实性方面的人口统计偏见。 Method: 通过实验分析五种人类价值观对齐的LLMs在数学、编程、常识和写作问题中的表现，识别归因偏见和评估偏见。 Result: LLMs在数学和编程中更少将正确答案归因于非裔美国人群体，而在写作评估中最不偏好亚洲作者。此外，LLMs在可视化代码中自动为人口统计群体分配刻板颜色。 Conclusion: 人口统计偏见不仅限于表面刻板印象，还深入模型推理过程，对其在教育与评估领域的应用提出警示。 Abstract: Despite LLMs' explicit alignment against demographic stereotypes, they have been shown to exhibit biases under various social contexts. In this work, we find that LLMs exhibit concerning biases in how they associate solution veracity with demographics. Through experiments across five human value-aligned LLMs on mathematics, coding, commonsense, and writing problems, we reveal two forms of such veracity biases: Attribution Bias, where models disproportionately attribute correct solutions to certain demographic groups, and Evaluation Bias, where models' assessment of identical solutions varies based on perceived demographic authorship. Our results show pervasive biases: LLMs consistently attribute fewer correct solutions and more incorrect ones to African-American groups in math and coding, while Asian authorships are least preferred in writing evaluation. In additional studies, we show LLMs automatically assign racially stereotypical colors to demographic groups in visualization code, suggesting these biases are deeply embedded in models' reasoning processes. Our findings indicate that demographic bias extends beyond surface-level stereotypes and social context provocations, raising concerns about LLMs' deployment in educational and evaluation settings.

[184] LLMs Are Not Scorers: Rethinking MT Evaluation with Generation-Based Methods

Hyang Cui

Main category: cs.CL

TL;DR: 论文提出了一种基于生成的机器翻译质量评估方法，利用解码器LLMs生成高质量参考译文，并通过语义相似度评分提升评估效果。

Details

Motivation: 现有直接评分方法在片段级别与人工评估相关性较低，需改进评估范式。 Method: 使用解码器LLMs生成参考译文，结合句子嵌入进行语义相似度评分。 Result: 在8种LLMs和8种语言对上验证，方法优于直接评分和非LLM参考无关指标。 Conclusion: 生成式评估结合语义分析是未来方向，支持混合方法的发展。 Abstract: Recent studies have applied large language models (LLMs) to machine translation quality estimation (MTQE) by prompting models to assign numeric scores. Nonetheless, these direct scoring methods tend to show low segment-level correlation with human judgments. In this paper, we propose a generation-based evaluation paradigm that leverages decoder-only LLMs to produce high-quality references, followed by semantic similarity scoring using sentence embeddings. We conduct the most extensive evaluation to date in MTQE, covering 8 LLMs and 8 language pairs. Empirical results show that our method outperforms both intra-LLM direct scoring baselines and external non-LLM reference-free metrics from MTME. These findings demonstrate the strength of generation-based evaluation and support a shift toward hybrid approaches that combine fluent generation with accurate semantic assessment.

[185] Position of Uncertainty: A Cross-Linguistic Study of Positional Bias in Large Language Models

Menschikov Mikhail,Alexander Kharitonov,Maiia Kotyga,Vadim Porvatov,Anna Zhukovskaya,David Kagramanyan,Egor Shvetsov,Evgeny Burnaev

Main category: cs.CL

TL;DR: 研究发现大语言模型存在位置偏见，且与语言多样性、模型不确定性和句法相关。通过五种语言实验，揭示了模型驱动的偏见特征及其对提示工程的影响。

Details

Motivation: 探讨大语言模型中位置偏见与语言多样性的关系，揭示其对模型行为和提示工程的影响。 Method: 跨语言研究（英语、俄语、德语、印地语、越南语），分析位置偏见与模型不确定性、句法和提示的关系。 Result: 1. 位置偏见是模型驱动的，语言间存在差异；2. 显式位置指导降低准确性；3. 对齐位置偏见增加熵，但最小熵不预测准确性；4. 自由语序语言中LLM对词序的强加。 Conclusion: 位置偏见是模型固有特性，提示工程需考虑语言多样性，显式位置指导可能适得其反。 Abstract: Large language models exhibit positional bias -- systematic neglect of information at specific context positions -- yet its interplay with linguistic diversity remains poorly understood. We present a cross-linguistic study across five typologically distinct languages (English, Russian, German, Hindi, Vietnamese), examining how positional bias interacts with model uncertainty, syntax, and prompting. Key findings: (1) Positional bias is model-driven, with language-specific variations -- Qwen2.5-7B favors late positions, challenging assumptions of early-token bias; (2) Explicit positional guidance (e.g., correct context is at position X) reduces accuracy across languages, undermining prompt-engineering practices; (3) Aligning context with positional bias increases entropy, yet minimal entropy does not predict accuracy. (4) We further uncover that LLMs differently impose dominant word order in free-word-order languages like Hindi.

[186] Distilling the Implicit Multi-Branch Structure in LLMs' Reasoning via Reinforcement Learning

Shicheng Xu,Liang Pang,Yunchang Zhu,Jia Gu,Zihao Wei,Jingcheng Deng,Feiyang Pan,Huawei Shen,Xueqi Cheng

Main category: cs.CL

TL;DR: 论文提出RLKD框架，通过强化学习和生成结构奖励模型（GSRM）改进小语言模型的推理能力，避免传统监督微调（SFT）的局限性。

Details

Motivation: 传统SFT方法只能捕捉教师模型推理路径的表面特征，无法传递其隐含的多分支结构，限制了学生模型的推理能力提升。 Method: 提出RLKD框架，结合强化学习和GSRM模型，将推理路径分解为元推理与求解步骤，并通过奖励机制衡量学生与教师推理结构的对齐程度。 Result: 实验表明，RLKD在仅使用0.1%数据的情况下，仍优于传统SFT-RL方法，显著提升学生模型的推理潜力。 Conclusion: RLKD通过强化学习有效传递教师模型的隐含推理结构，为小语言模型的推理能力提升提供了新思路。 Abstract: Distilling reasoning paths from teacher to student models via supervised fine-tuning (SFT) provides a shortcut for improving the reasoning ability of smaller Large Language Models (LLMs). However, the reasoning paths generated by teacher models often reflect only surface-level traces of their underlying authentic reasoning. Insights from cognitive neuroscience suggest that authentic reasoning involves a complex interweaving between meta-reasoning (which selects appropriate sub-problems from multiple candidates) and solving (which addresses the sub-problem). This implies authentic reasoning has an implicit multi-branch structure. Supervised fine-tuning collapses this rich structure into a flat sequence of token prediction in the teacher's reasoning path, preventing effective distillation of this structure to students. To address this limitation, we propose RLKD, a reinforcement learning (RL)-based distillation framework guided by a novel Generative Structure Reward Model (GSRM). Our GSRM converts reasoning paths into multiple meta-reasoning-solving steps and computes rewards to measure structural alignment between student and teacher reasoning. RLKD combines this reward with RL, enabling student LLMs to internalize the teacher's implicit multi-branch reasoning structure rather than merely mimicking fixed output paths. Experiments show RLKD surpasses standard SFT-RL pipelines even when trained on 0.1% of data under an RL-only regime, unlocking greater student reasoning potential than SFT-based distillation.

[187] EduBench: A Comprehensive Benchmarking Dataset for Evaluating Large Language Models in Diverse Educational Scenarios

Bin Xu,Yu Bai,Huashan Sun,Yiguan Lin,Siming Liu,Xinyue Liang,Yaolin Li,Yang Gao,Heyan Huang

Main category: cs.CL

TL;DR: 论文提出了首个针对教育场景的多样化基准EduBench，包含9大场景和4000多个教育情境，并设计了多维评估指标。通过人工标注验证模型生成效果，并训练了一个小规模模型，性能接近SOTA大模型。

Details

Motivation: 大型语言模型在教育领域的应用尚未充分探索和优化，论文旨在填补这一空白。 Method: 构建多样化教育场景数据集，设计多维评估指标，结合人工标注验证模型效果，并训练小规模模型。 Result: 训练的小规模模型在测试集上性能接近SOTA大模型（如Deepseek V3、Qwen Max）。 Conclusion: 为教育导向的语言模型开发和评估提供了实用基础，代码和数据已开源。 Abstract: As large language models continue to advance, their application in educational contexts remains underexplored and under-optimized. In this paper, we address this gap by introducing the first diverse benchmark tailored for educational scenarios, incorporating synthetic data containing 9 major scenarios and over 4,000 distinct educational contexts. To enable comprehensive assessment, we propose a set of multi-dimensional evaluation metrics that cover 12 critical aspects relevant to both teachers and students. We further apply human annotation to ensure the effectiveness of the model-generated evaluation responses. Additionally, we succeed to train a relatively small-scale model on our constructed dataset and demonstrate that it can achieve performance comparable to state-of-the-art large models (e.g., Deepseek V3, Qwen Max) on the test set. Overall, this work provides a practical foundation for the development and evaluation of education-oriented language models. Code and data are released at https://github.com/ybai-nlp/EduBench.

[188] KNN-SSD: Enabling Dynamic Self-Speculative Decoding via Nearest Neighbor Layer Set Optimization

Mingbo Song,Heming Xia,Jun Zhang,Chak Tou Leong,Qiancheng Xu,Wenjie Li,Sujian Li

Main category: cs.CL

TL;DR: KNN-SSD算法通过KNN搜索匹配不同跳层与领域输入，提升自推测解码的领域泛化能力，实现LLM推理1.3x-1.6x加速。

Details

Motivation: 自推测解码在领域转移时性能显著下降，需提升其领域泛化能力。 Method: 提出KNN-SSD算法，利用KNN搜索动态匹配跳层与输入领域。 Result: 在多种模型和任务中，KNN-SSD实现1.3x-1.6x的推理加速。 Conclusion: KNN-SSD有效提升自推测解码的领域适应性，显著加速LLM推理。 Abstract: Speculative Decoding (SD) has emerged as a widely used paradigm to accelerate the inference of large language models (LLMs) without compromising generation quality. It works by efficiently drafting multiple tokens using a compact model and then verifying them in parallel using the target LLM. Notably, Self-Speculative Decoding proposes skipping certain layers to construct the draft model, which eliminates the need for additional parameters or training. Despite its strengths, we observe in this work that drafting with layer skipping exhibits significant sensitivity to domain shifts, leading to a substantial drop in acceleration performance. To enhance the domain generalizability of this paradigm, we introduce KNN-SSD, an algorithm that leverages K-Nearest Neighbor (KNN) search to match different skipped layers with various domain inputs. We evaluated our algorithm in various models and multiple tasks, observing that its application leads to 1.3x-1.6x speedup in LLM inference.

[189] Can LLMs Simulate Human Behavioral Variability? A Case Study in the Phonemic Fluency Task

Mengyang Qiu,Zoe Brisebois,Siena Sun

Main category: cs.CL

TL;DR: 研究探讨了大型语言模型（LLMs）是否能模拟人类在音素流畅性任务中的个体差异，发现LLMs虽能匹配人类平均表现，但无法复现人类行为的多样性。

Details

Motivation: 验证LLMs是否能替代人类参与者模拟认知任务中的个体差异行为。 Method: 评估了34种模型配置，比较了LLMs与106名人类参与者在音素流畅性任务中的表现。 Result: LLMs能匹配人类平均表现和词汇偏好，但无法复现人类行为的多样性和结构灵活性。 Conclusion: LLMs在模拟人类认知和行为方面存在关键局限性。 Abstract: Large language models (LLMs) are increasingly explored as substitutes for human participants in cognitive tasks, but their ability to simulate human behavioral variability remains unclear. This study examines whether LLMs can approximate individual differences in the phonemic fluency task, where participants generate words beginning with a target letter. We evaluated 34 model configurations, varying prompt specificity, sampling temperature, and model type, and compared outputs to responses from 106 human participants. While some configurations, especially Claude 3.7 Sonnet, matched human averages and lexical preferences, none reproduced the scope of human variability. LLM outputs were consistently less diverse and structurally rigid, and LLM ensembles failed to increase diversity. Network analyses further revealed fundamental differences in retrieval structure between humans and models. These results highlight key limitations in using LLMs to simulate human cognition and behavior.

[190] When Do LLMs Admit Their Mistakes? Understanding the Role of Model Belief in Retraction

Yuqing Yang,Robin Jia

Main category: cs.CL

TL;DR: 论文研究了大型语言模型（LLM）是否会在知道错误时承认错误（称为“撤回”），并探讨了撤回的条件和原因。研究发现LLM能够撤回错误答案，但频率较低，且撤回行为与模型的内部信念密切相关。通过实验证明，内部信念对撤回行为有因果影响，且简单的监督微调可显著提升撤回性能。

Details

Motivation: 理解LLM在何时及为何会撤回其错误答案，以揭示模型内部信念与行为之间的关系。 Method: 构建模型特定数据集评估LLM是否会撤回与其参数知识矛盾的错误答案，并通过实验验证内部信念对撤回行为的影响。 Result: LLM能够撤回错误答案但频率较低，撤回行为与内部信念相关；监督微调可显著提升撤回性能。 Conclusion: LLM的撤回行为受内部信念驱动，通过监督微调可优化其撤回能力，为模型自我修正提供新思路。 Abstract: Can large language models (LLMs) admit their mistakes when they should know better? In this work, we define the behavior of acknowledging errors in previously generated answers as "retraction" and aim to understand when and why LLMs choose to retract. We first construct model-specific datasets to evaluate whether a model will retract an incorrect answer that contradicts its own parametric knowledge. While LLMs are capable of retraction, they do so only infrequently. We demonstrate that retraction is closely tied to previously identified indicators of models' internal belief: models fail to retract wrong answers that they "believe" to be factually correct. Steering experiments further demonstrate that internal belief causally influences model retraction. In particular, when the model does not believe its answer, this not only encourages the model to attempt to verify the answer, but also alters attention behavior during self-verification. Finally, we demonstrate that simple supervised fine-tuning significantly improves retraction performance by helping the model learn more accurate internal beliefs. Code and datasets are available on https://github.com/ayyyq/llm-retraction.

[191] Automated Feedback Loops to Protect Text Simplification with Generative AI from Information Loss

Abhay Kumara Sri Krishna Nandiraju,Gondy Leroy,David Kauchak,Arif Ahmed

Main category: cs.CL

TL;DR: 研究比较了生成式AI在简化健康信息时缺失内容的情况，并提出五种方法补全缺失信息。结果表明，补全缺失实体能显著提升文本质量。

Details

Motivation: 简化健康信息有助于理解，但生成式AI可能遗漏关键内容，需评估和修复这些缺失。 Method: 收集50份健康信息文本，用GPT-4简化，比较五种补全方法（实体、词汇、排名实体等），并用相似性指标评估。 Result: 补全所有缺失实体的方法效果最佳，优于其他方法。当前工具能识别实体但无法有效排序。 Conclusion: 补全缺失实体能有效提升简化文本质量，但需改进实体排序工具。 Abstract: Understanding health information is essential in achieving and maintaining a healthy life. We focus on simplifying health information for better understanding. With the availability of generative AI, the simplification process has become efficient and of reasonable quality, however, the algorithms remove information that may be crucial for comprehension. In this study, we compare generative AI to detect missing information in simplified text, evaluate its importance, and fix the text with the missing information. We collected 50 health information texts and simplified them using gpt-4-0613. We compare five approaches to identify missing elements and regenerate the text by inserting the missing elements. These five approaches involve adding missing entities and missing words in various ways: 1) adding all the missing entities, 2) adding all missing words, 3) adding the top-3 entities ranked by gpt-4-0613, and 4, 5) serving as controls for comparison, adding randomly chosen entities. We use cosine similarity and ROUGE scores to evaluate the semantic similarity and content overlap between the original, simplified, and reconstructed simplified text. We do this for both summaries and full text. Overall, we find that adding missing entities improves the text. Adding all the missing entities resulted in better text regeneration, which was better than adding the top-ranked entities or words, or random words. Current tools can identify these entities, but are not valuable in ranking them.

[192] Understanding Fact Recall in Language Models: Why Two-Stage Training Encourages Memorization but Mixed Training Teaches Knowledge

Ying Zhang,Benjamin Heinzerling,Dongyuan Li,Ryoma Ishigaki,Yuta Hitomi,Kentaro Inui

Main category: cs.CL

TL;DR: 研究探讨了语言模型在事实召回任务中的训练策略，发现混合训练能促进共享参数的形成，从而提升模型的泛化能力。

Details

Motivation: 尽管语言模型在通用任务上表现优异，但事实召回能力仍具挑战性。现有训练策略（如两阶段训练）易导致机械记忆而非泛化召回，混合训练虽有效但机制不明。 Method: 通过交叉任务梯度追踪分析共享参数，使用Llama-3.2B和Pythia-2.8B模型在合成数据集上验证混合训练的效果。 Result: 混合训练能形成更大且更集中的共享参数集，这些参数对事实存储和召回任务均有显著影响。 Conclusion: 共享参数的出现可能是语言模型在不同任务中泛化事实知识的关键因素。 Abstract: Fact recall, the ability of language models (LMs) to retrieve specific factual knowledge, remains a challenging task despite their impressive general capabilities. Common training strategies often struggle to promote robust recall behavior with two-stage training, which first trains a model with fact-storing examples (e.g., factual statements) and then with fact-recalling examples (question-answer pairs), tending to encourage rote memorization rather than generalizable fact retrieval. In contrast, mixed training, which jointly uses both types of examples, has been empirically shown to improve the ability to recall facts, but the underlying mechanisms are still poorly understood. In this work, we investigate how these training strategies affect how model parameters are shaped during training and how these differences relate to their ability to recall facts. We introduce cross-task gradient trace to identify shared parameters, those strongly influenced by both fact-storing and fact-recalling examples. Our analysis on synthetic fact recall datasets with the Llama-3.2B and Pythia-2.8B models reveals that mixed training encouraging a larger and more centralized set of shared parameters. These findings suggest that the emergence of parameters may play a key role in enabling LMs to generalize factual knowledge across task formulations.

[193] SAE-SSV: Supervised Steering in Sparse Representation Spaces for Reliable Control of Language Models

Zirui He,Mingyu Jin,Bo Shen,Ali Payani,Yongfeng Zhang,Mengnan Du

Main category: cs.CL

TL;DR: 论文提出了一种基于稀疏自编码器的监督引导方法，通过稀疏潜在表示空间控制大型语言模型的行为，实验表明该方法在多个任务中表现优于现有方法。

Details

Motivation: 大型语言模型在开放生成场景中行为控制困难，需要一种可靠且可解释的方法来引导模型行为。 Method: 使用稀疏自编码器获取稀疏潜在表示，训练线性分类器识别任务相关维度，并学习约束于该子空间的监督引导向量。 Result: 在情感、真实性和政治极性任务中，该方法成功率高且生成质量下降最小，仅需极小子空间即可实现有效引导。 Conclusion: 稀疏潜在表示空间的监督引导方法实现了更精准和可解释的模型行为控制。 Abstract: Large language models (LLMs) have demonstrated impressive capabilities in natural language understanding and generation, but controlling their behavior reliably remains challenging, especially in open-ended generation settings. This paper introduces a novel supervised steering approach that operates in sparse, interpretable representation spaces. We employ sparse autoencoders (SAEs)to obtain sparse latent representations that aim to disentangle semantic attributes from model activations. Then we train linear classifiers to identify a small subspace of task-relevant dimensions in latent representations. Finally, we learn supervised steering vectors constrained to this subspace, optimized to align with target behaviors. Experiments across sentiment, truthfulness, and politics polarity steering tasks with multiple LLMs demonstrate that our supervised steering vectors achieve higher success rates with minimal degradation in generation quality compared to existing methods. Further analysis reveals that a notably small subspace is sufficient for effective steering, enabling more targeted and interpretable interventions.

[194] The Language of Interoception: Examining Embodiment and Emotion Through a Corpus of Body Part Mentions

Sophie Wu,Jan Philip Wahle,Saif M. Mohammad

Main category: cs.CL

TL;DR: 该论文首次研究了情感、具身化和日常语言在大规模自然语言数据中的联系，发现身体部位提及（BPMs）在文本中常见且与情感、健康结果显著相关。

Details

Motivation: 探索情感、具身化和语言之间的关系，填补了大规模自然语言数据中身体部位提及与情感及健康关联的研究空白。 Method: 创建了在线英文文本（博客和推文）中身体部位提及的语料库，并进行了情感标注，结合词-情感关联词典分析情感强度。 Result: BPMs在个人叙事和推文中常见（5%-10%），其使用模式因时间和地点而异；含BPMs的文本情感更强，且与较差健康结果显著相关。 Conclusion: 研究身体部位相关语言可为NLP、情感科学和人类福祉研究开辟新方向。 Abstract: This paper is the first investigation of the connection between emotion, embodiment, and everyday language in a large sample of natural language data. We created corpora of body part mentions (BPMs) in online English text (blog posts and tweets). This includes a subset featuring human annotations for the emotions of the person whose body part is mentioned in the text. We show that BPMs are common in personal narratives and tweets (~5% to 10% of posts include BPMs) and that their usage patterns vary markedly by time and %geographic location. Using word-emotion association lexicons and our annotated data, we show that text containing BPMs tends to be more emotionally charged, even when the BPM is not explicitly used to describe a physical reaction to the emotion in the text. Finally, we discover a strong and statistically significant correlation between body-related language and a variety of poorer health outcomes. In sum, we argue that investigating the role of body-part related words in language can open up valuable avenues of future research at the intersection of NLP, the affective sciences, and the study of human wellbeing.

[195] An Empirical Study on Configuring In-Context Learning Demonstrations for Unleashing MLLMs' Sentimental Perception Capability

Daiqing Wu,Dongbao Yang,Sicheng Zhao,Can Ma,Yu Zhou

Main category: cs.CL

TL;DR: 论文探讨了多模态大语言模型（MLLMs）在零样本范式下处理多模态情感分析（MSA）的不足，通过上下文学习（ICL）优化演示配置，显著提升了性能。

Details

Motivation: 零样本范式在多模态情感分析中表现不佳，质疑MLLMs的情感感知能力，需验证其潜力。 Method: 扩展零样本范式至上下文学习，研究演示的检索、呈现和分布三因素，并发现并抵消MLLMs的情感预测偏差。 Result: 优化策略在六个MSA数据集上平均准确率提升15.9%（相比零样本）和11.2%（相比随机ICL基线）。 Conclusion: MLLMs具备情感感知能力，通过优化演示配置可显著提升MSA任务性能。 Abstract: The advancements in Multimodal Large Language Models (MLLMs) have enabled various multimodal tasks to be addressed under a zero-shot paradigm. This paradigm sidesteps the cost of model fine-tuning, emerging as a dominant trend in practical application. Nevertheless, Multimodal Sentiment Analysis (MSA), a pivotal challenge in the quest for general artificial intelligence, fails to accommodate this convenience. The zero-shot paradigm exhibits undesirable performance on MSA, casting doubt on whether MLLMs can perceive sentiments as competent as supervised models. By extending the zero-shot paradigm to In-Context Learning (ICL) and conducting an in-depth study on configuring demonstrations, we validate that MLLMs indeed possess such capability. Specifically, three key factors that cover demonstrations' retrieval, presentation, and distribution are comprehensively investigated and optimized. A sentimental predictive bias inherent in MLLMs is also discovered and later effectively counteracted. By complementing each other, the devised strategies for three factors result in average accuracy improvements of 15.9% on six MSA datasets against the zero-shot paradigm and 11.2% against the random ICL baseline.

[196] Large Language Models based ASR Error Correction for Child Conversations

Anfeng Xu,Tiantian Feng,So Hyun Kim,Somer Bishop,Catherine Lord,Shrikanth Narayanan

Main category: cs.CL

TL;DR: LLMs在纠正儿童对话语音的ASR错误中表现出潜力，但在处理上下文信息或自回归ASR输出时仍面临挑战。

Details

Motivation: 儿童语音的ASR转录准确性较低，LLMs的应用尚未充分探索，尤其是在对话场景中。 Method: 通过实验评估LLMs在纠正两种儿童对话语音数据集上的ASR错误，包括零样本和微调ASR输出。 Result: LLMs能有效纠正零样本和CTC-based ASR输出，但对上下文信息或自回归ASR（如Whisper）输出的改进有限。 Conclusion: LLMs在儿童语音ASR纠错中有潜力，但需进一步研究以克服现有挑战。 Abstract: Automatic Speech Recognition (ASR) has recently shown remarkable progress, but accurately transcribing children's speech remains a significant challenge. Recent developments in Large Language Models (LLMs) have shown promise in improving ASR transcriptions. However, their applications in child speech including conversational scenarios are underexplored. In this study, we explore the use of LLMs in correcting ASR errors for conversational child speech. We demonstrate the promises and challenges of LLMs through experiments on two children's conversational speech datasets with both zero-shot and fine-tuned ASR outputs. We find that while LLMs are helpful in correcting zero-shot ASR outputs and fine-tuned CTC-based ASR outputs, it remains challenging for LLMs to improve ASR performance when incorporating contextual information or when using fine-tuned autoregressive ASR (e.g., Whisper) outputs.

[197] Memorization or Reasoning? Exploring the Idiom Understanding of LLMs

Jisu Kim,Youngwoo Shin,Uiji Hwang,Jihun Choi,Richeng Xuan,Taeuk Kim

Main category: cs.CL

TL;DR: 论文研究了大型语言模型（LLMs）处理习语的机制，提出了多语言习语数据集MIDAS，并发现LLMs通过记忆与推理结合的方式处理习语。

Details

Motivation: 习语因其独特的语言特性对LLMs构成挑战，但现有研究对其处理机制了解有限，尤其是在多语言环境下。 Method: 引入MIDAS数据集，包含六种语言的习语及其含义，并全面评估LLMs的习语处理能力。 Result: LLMs不仅依赖记忆，还结合上下文线索和推理处理习语，尤其是组合型习语。 Conclusion: LLMs的习语理解是内部知识检索与推理推断的交互结果。 Abstract: Idioms have long posed a challenge due to their unique linguistic properties, which set them apart from other common expressions. While recent studies have leveraged large language models (LLMs) to handle idioms across various tasks, e.g., idiom-containing sentence generation and idiomatic machine translation, little is known about the underlying mechanisms of idiom processing in LLMs, particularly in multilingual settings. To this end, we introduce MIDAS, a new large-scale dataset of idioms in six languages, each paired with its corresponding meaning. Leveraging this resource, we conduct a comprehensive evaluation of LLMs' idiom processing ability, identifying key factors that influence their performance. Our findings suggest that LLMs rely not only on memorization, but also adopt a hybrid approach that integrates contextual cues and reasoning, especially when processing compositional idioms. This implies that idiom understanding in LLMs emerges from an interplay between internal knowledge retrieval and reasoning-based inference.

[198] Don't Judge Code by Its Cover: Exploring Biases in LLM Judges for Code Evaluation

Jiwon Moon,Yerin Hwang,Dongryeol Lee,Taegwan Kang,Yongil Kim,Kyomin Jung

Main category: cs.CL

TL;DR: 研究发现大型语言模型（LLMs）在评估代码时存在偏见，无法公平处理语义相同但形式不同的代码。

Details

Motivation: 探讨LLMs作为代码评估工具的公平性和鲁棒性，尤其是在处理语义相同但形式不同的代码时。 Method: 定义了六种潜在偏见类型，并在五种编程语言和多个LLMs上进行了实证研究。 Result: 所有测试的LLM评估者均表现出正负偏见，导致评分偏高或偏低。 Conclusion: 需要更鲁棒的代码评估方法，以减少LLM评估中的偏见。 Abstract: With the growing use of large language models(LLMs) as evaluators, their application has expanded to code evaluation tasks, where they assess the correctness of generated code without relying on reference implementations. While this offers scalability and flexibility, it also raises a critical, unresolved question: Can LLM judges fairly and robustly evaluate semantically equivalent code with superficial variations? Functionally correct code often exhibits variations-such as differences in variable names, comments, or formatting-that should not influence its correctness. Yet, whether LLM judges can reliably handle these variations remains unclear. We present the first comprehensive study of this issue, defining six types of potential bias in code evaluation and revealing their systematic impact on LLM judges. Across five programming languages and multiple LLMs, we empirically demonstrate that all tested LLM judges are susceptible to both positive and negative biases, resulting in inflated or unfairly low scores. Moreover, we observe that LLM judges remain vulnerable to these biases even when prompted to generate test cases before scoring, highlighting the need for more robust code evaluation methods.

[199] Explain Less, Understand More: Jargon Detection via Personalized Parameter-Efficient Fine-tuning

Bohao Wu,Qingyun Wang,Yue Guo

Main category: cs.CL

TL;DR: 论文研究了如何高效、可扩展地实现个性化术语检测，提出了两种策略：轻量级微调和个性化提示，并在资源受限的情况下取得了显著性能提升。

Details

Motivation: 为使技术文档对不同学科背景的读者更易理解，个性化术语检测和解释至关重要，但传统方法需要大量标注和计算资源。 Method: 探索了两种个性化策略：1) 使用LoRA进行轻量级微调；2) 无需重新训练的个性化提示。同时研究了结合有限标注数据和无监督用户背景信号的混合方法。 Result: 个性化LoRA模型的F1分数比GPT-4高21.4%，比最佳基线高8.3%，且仅需10%的标注数据即可达到可比性能。 Conclusion: 该研究首次系统探索了基于开源语言模型的高效、低资源个性化术语检测，为可扩展的用户自适应NLP系统提供了实用路径。 Abstract: Personalizing jargon detection and explanation is essential for making technical documents accessible to readers with diverse disciplinary backgrounds. However, tailoring models to individual users typically requires substantial annotation efforts and computational resources due to user-specific finetuning. To address this, we present a systematic study of personalized jargon detection, focusing on methods that are both efficient and scalable for real-world deployment. We explore two personalization strategies: (1) lightweight fine-tuning using Low-Rank Adaptation (LoRA) on open-source models, and (2) personalized prompting, which tailors model behavior at inference time without retaining. To reflect realistic constraints, we also investigate hybrid approaches that combine limited annotated data with unsupervised user background signals. Our personalized LoRA model outperforms GPT-4 by 21.4% in F1 score and exceeds the best performing oracle baseline by 8.3%. Remarkably, our method achieves comparable performance using only 10% of the annotated training data, demonstrating its practicality for resource-constrained settings. Our study offers the first work to systematically explore efficient, low-resource personalization of jargon detection using open-source language models, offering a practical path toward scalable, user-adaptive NLP system.

[200] MuseRAG: Idea Originality Scoring At Scale

Ali Sarosh Bangash,Krish Veera,Ishfat Abrar Islam,Raiyan Abdul Baten

Main category: cs.CL

TL;DR: MuseRAG是一种自动化方法，利用LLM和RAG框架评估创意的新颖性，通过语义相似性聚类和零样本提示，实现高效且与人工评分一致的结果。

Details

Motivation: 传统方法通过手动统计创意频率评估新颖性，效率低且易出错，需要自动化解决方案。 Method: 结合LLM和RAG框架，通过语义检索和零样本提示对创意进行聚类和评分。 Result: 在五个数据集中，MuseRAG与人工评分一致性高（r=0.89），聚类效果良好（AMI=0.59）。 Conclusion: MuseRAG实现了高效、准确的创意新颖性评估，为创造力研究提供了可扩展的工具。 Abstract: An objective, face-valid way to assess the originality of creative ideas is to measure how rare each idea is within a population -- an approach long used in creativity research but difficult to automate at scale. Tabulating response frequencies via manual bucketing of idea rephrasings is labor-intensive, error-prone, and brittle under large corpora. We introduce a fully automated, psychometrically validated pipeline for frequency-based originality scoring. Our method, MuseRAG, combines large language models (LLMs) with an externally orchestrated retrieval-augmented generation (RAG) framework. Given a new idea, the system retrieves semantically similar prior idea buckets and zero-shot prompts the LLM to judge whether the new idea belongs to an existing bucket or forms a new one. The resulting buckets enable computation of frequency-based originality metrics. Across five datasets (N=1143, n_ideas=16294), MuseRAG matches human annotators in idea clustering structure and resolution (AMI = 0.59) and in participant-level scoring (r = 0.89) -- while exhibiting strong convergent and external validity. Our work enables intent-sensitive, human-aligned originality scoring at scale to aid creativity research.

[201] LIFEBench: Evaluating Length Instruction Following in Large Language Models

Wei Zhang,Zhenhong Zhou,Junfeng Fang,Rongwu Xu,Kun Wang,Yuanhe Zhang,Rui Wang,Ge Zhang,Xinfeng Li,Li Sun,Lingjuan Lyu,Yang Liu,Sen Su

Main category: cs.CL

TL;DR: 论文介绍了LIFEBench，一个评估大语言模型（LLM）遵循长度指令能力的基准测试，发现现有模型在长文本生成中存在显著不足。

Details

Motivation: 尽管LLM能解决复杂问题，但在遵循明确长度指令（如生成特定字数文本）时表现不佳，现有评测标准忽视了这一能力。 Method: 提出LIFEBench基准，包含10,800个实例，覆盖4类任务和16到8192字的长度范围，评估了26种常用LLM。 Result: 大多数模型在短文本生成中表现尚可，但超过阈值后性能急剧下降；几乎所有模型无法达到厂商宣称的最大输出长度。 Conclusion: LIFEBench揭示了当前LLM在长度指令遵循上的根本局限，为未来改进提供了重要参考。 Abstract: While large language models (LLMs) can solve PhD-level reasoning problems over long context inputs, they still struggle with a seemingly simpler task: following explicit length instructions-e.g., write a 10,000-word novel. Additionally, models often generate far too short outputs, terminate prematurely, or even refuse the request. Existing benchmarks focus primarily on evaluating generations quality, but often overlook whether the generations meet length constraints. To this end, we introduce Length Instruction Following Evaluation Benchmark (LIFEBench) to comprehensively evaluate LLMs' ability to follow length instructions across diverse tasks and a wide range of specified lengths. LIFEBench consists of 10,800 instances across 4 task categories in both English and Chinese, covering length constraints ranging from 16 to 8192 words. We evaluate 26 widely-used LLMs and find that most models reasonably follow short-length instructions but deteriorate sharply beyond a certain threshold. Surprisingly, almost all models fail to reach the vendor-claimed maximum output lengths in practice, as further confirmed by our evaluations extending up to 32K words. Even long-context LLMs, despite their extended input-output windows, counterintuitively fail to improve length-instructions following. Notably, Reasoning LLMs outperform even specialized long-text generation models, achieving state-of-the-art length following. Overall, LIFEBench uncovers fundamental limitations in current LLMs' length instructions following ability, offering critical insights for future progress.

[202] Align-GRAG: Reasoning-Guided Dual Alignment for Graph Retrieval-Augmented Generation

Derong Xu,Pengyue Jia,Xiaopeng Li,Yingyi Zhang,Maolin Wang,Qidong Liu,Xiangyu Zhao,Yichao Wang,Huifeng Guo,Ruiming Tang,Enhong Chen,Tong Xu

Main category: cs.CL

TL;DR: Align-GRAG是一个基于图的双对齐框架，通过检索子图并优化图编码器与LLM推理，解决了图RAG中的信息冗余和表示差距问题。

Details

Motivation: 大型语言模型（LLM）存在幻觉和过时信息问题，图RAG虽能提供更全面的上下文，但面临信息冗余和表示差距的挑战。 Method: 提出Align-GRAG框架，通过检索子图并联合优化图编码器与LLM推理，实现节点和表示的双对齐，提升生成效果。 Result: 在GraphQA基准测试中，该方法在常识推理、场景图理解和知识图谱推理任务上表现优异。 Conclusion: Align-GRAG有效解决了图RAG的挑战，为LLM提供了更高效的上下文支持。 Abstract: Large language models (LLMs) have demonstrated remarkable capabilities, but still struggle with issues like hallucinations and outdated information. Retrieval-augmented generation (RAG) addresses these issues by grounding LLM outputs in external knowledge with an Information Retrieval (IR) system. Building on this foundation, graph-based RAG systems go a step further by retrieving subgraphs, which preserve the relationships between knowledge entities and provide more comprehensive context. However, graph RAG faces two challenges: (1) Retrieving relevant information introduces irrelevant nodes (especially in dense graph databases, where retrieval usually extends to adjacent nodes), and leads to overly lengthy inputs that hinder efficiency; (2) The representation gap between graph and language during generation with LLMs limits the ability to fully leverage graph structures for enhanced understanding. To address these limitations, we propose Align-GRAG, a novel reasoning-guided dual alignment framework in post-retrieval phrase. It first formulates a subgraph by retrieving nodes and edges. Then an Aligner is proposed to jointly optimizes a graph encoder with LLM-summarized reasoning. It achieves dual alignment of graph node and representation by leveraging KL divergence loss and contrastive loss, facilitating efficient pruning of irrelevant knowledge and establishing a unified semantic space. The Generator integrates the aligned graph data with LLM to produce coherent and accurate answers. Experiments on GraphQA benchmark across three tasks (including common sense reasoning, scene graph understanding, and knowledge graph reasoning) validate the effectiveness of our method. The code will be available upon accepted.

[203] Three Minds, One Legend: Jailbreak Large Reasoning Model with Adaptive Stacked Ciphers

Viet-Anh Nguyen,Shiqian Zhao,Gia Dao,Runyi Hu,Yi Xie,Luu Anh Tuan

Main category: cs.CL

TL;DR: SEAL是一种针对大型推理模型（LRMs）的新型越狱攻击方法，通过自适应加密管道绕过其推理过程和安全机制，攻击成功率达80.8%。

Details

Motivation: 尽管大型推理模型（LRMs）表现出强大的逻辑能力，但其可能引入更严重的安全漏洞尚未充分研究。现有越狱方法在有效性和鲁棒性之间难以平衡。 Method: SEAL采用堆叠加密方法，结合多种密码技术，并通过动态策略（随机和自适应）调整密码长度、顺序和组合，以绕过安全机制。 Result: 在DeepSeek-R1、Claude Sonnet和OpenAI GPT-o4等模型上的实验表明，SEAL攻击成功率达80.8%，显著优于现有方法。 Conclusion: SEAL展示了LRMs在安全方面的潜在漏洞，强调了进一步研究模型安全性的必要性。 Abstract: Recently, Large Reasoning Models (LRMs) have demonstrated superior logical capabilities compared to traditional Large Language Models (LLMs), gaining significant attention. Despite their impressive performance, the potential for stronger reasoning abilities to introduce more severe security vulnerabilities remains largely underexplored. Existing jailbreak methods often struggle to balance effectiveness with robustness against adaptive safety mechanisms. In this work, we propose SEAL, a novel jailbreak attack that targets LRMs through an adaptive encryption pipeline designed to override their reasoning processes and evade potential adaptive alignment. Specifically, SEAL introduces a stacked encryption approach that combines multiple ciphers to overwhelm the models reasoning capabilities, effectively bypassing built-in safety mechanisms. To further prevent LRMs from developing countermeasures, we incorporate two dynamic strategies - random and adaptive - that adjust the cipher length, order, and combination. Extensive experiments on real-world reasoning models, including DeepSeek-R1, Claude Sonnet, and OpenAI GPT-o4, validate the effectiveness of our approach. Notably, SEAL achieves an attack success rate of 80.8% on GPT o4-mini, outperforming state-of-the-art baselines by a significant margin of 27.2%. Warning: This paper contains examples of inappropriate, offensive, and harmful content.

[204] Diverse, not Short: A Length-Controlled Self-Learning Framework for Improving Response Diversity of Language Models

Vijeta Deshpande,Debasmita Ghose,John D. Patterson,Roger Beaty,Anna Rumshisky

Main category: cs.CL

TL;DR: 论文提出Diverse-NS框架，通过控制长度提升语言模型输出的多样性，仅需3000对偏好数据即可有效训练，并在多个任务中验证了其效果。

Details

Motivation: 现有多样性指标和奖励模型偏向短输出，限制了表达的多样性。 Method: 引入Diverse-NS框架，通过生成和筛选平衡多样性、质量和长度的偏好数据，进行自学习训练。 Result: 在LLaMA-3.1-8B和Olmo-2系列模型上显著提升了词汇和语义多样性，且小模型可作为大模型的多样性指导。 Conclusion: Diverse-NS通过解决长度偏差，高效提升了模型的多样性和表达能力。 Abstract: Diverse language model responses are crucial for creative generation, open-ended tasks, and self-improvement training. We show that common diversity metrics, and even reward models used for preference optimization, systematically bias models toward shorter outputs, limiting expressiveness. To address this, we introduce Diverse, not Short (Diverse-NS), a length-controlled self-learning framework that improves response diversity while maintaining length parity. By generating and filtering preference data that balances diversity, quality, and length, Diverse-NS enables effective training using only 3,000 preference pairs. Applied to LLaMA-3.1-8B and the Olmo-2 family, Diverse-NS substantially enhances lexical and semantic diversity. We show consistent improvement in diversity with minor reduction or gains in response quality on four creative generation tasks: Divergent Associations, Persona Generation, Alternate Uses, and Creative Writing. Surprisingly, experiments with the Olmo-2 model family (7B, and 13B) show that smaller models like Olmo-2-7B can serve as effective "diversity teachers" for larger models. By explicitly addressing length bias, our method efficiently pushes models toward more diverse and expressive outputs.

[205] Does Localization Inform Unlearning? A Rigorous Examination of Local Parameter Attribution for Knowledge Unlearning in Language Models

Hwiyeong Lee,Uiji Hwang,Hyelim Lim,Taeuk Kim

Main category: cs.CL

TL;DR: 本文探讨了大语言模型中知识遗忘的问题，指出现有局部遗忘方法的有效性存疑，并通过实验挑战了参数局部性与有效遗忘之间的核心假设。

Details

Motivation: 大语言模型常保留不必要内容，引发对知识遗忘的兴趣。现有局部遗忘方法试图通过局部参数更新移除目标知识，但其有效性缺乏验证。 Method: 重新审视现有局部遗忘方法，并进行控制实验，严格评估局部参数更新是否对遗忘有因果贡献。 Result: 实验发现，有效遗忘所需的参数修改并非严格确定，挑战了局部遗忘的核心假设。 Conclusion: 参数局部性并非有效知识移除的必然指标，局部遗忘方法的有效性需重新评估。 Abstract: Large language models often retain unintended content, prompting growing interest in knowledge unlearning. Recent approaches emphasize localized unlearning, which restricts parameter updates to specific regions in an effort to remove target knowledge while preserving unrelated general knowledge. However, their effectiveness remains uncertain due to the lack of robust and thorough evaluation of the trade-off between the competing goals of unlearning. In this paper, we begin by revisiting existing localized unlearning approaches. We then conduct controlled experiments to rigorously evaluate whether local parameter updates causally contribute to unlearning. Our findings reveal that the set of parameters that must be modified for effective unlearning is not strictly determined, challenging the core assumption of localized unlearning that parameter locality is inherently indicative of effective knowledge removal.

Aashish Anantha Ramakrishnan,Aadarsh Anantha Ramakrishnan,Dongwon Lee

Main category: cs.CL

TL;DR: IRONIC框架通过多模态连贯关系分析，零样本检测多模态讽刺，性能领先。

Details

Motivation: 现有方法未能高效利用人类识别讽刺的认知过程。 Method: 利用多模态连贯关系分析图像-文本关联。 Result: 在零样本多模态讽刺检测中达到最佳性能。 Conclusion: 需将语言和认知洞察融入多模态推理策略设计。 Abstract: Interpreting figurative language such as sarcasm across multi-modal inputs presents unique challenges, often requiring task-specific fine-tuning and extensive reasoning steps. However, current Chain-of-Thought approaches do not efficiently leverage the same cognitive processes that enable humans to identify sarcasm. We present IRONIC, an in-context learning framework that leverages Multi-modal Coherence Relations to analyze referential, analogical and pragmatic image-text linkages. Our experiments show that IRONIC achieves state-of-the-art performance on zero-shot Multi-modal Sarcasm Detection across different baselines. This demonstrates the need for incorporating linguistic and cognitive insights into the design of multi-modal reasoning strategies. Our code is available at: https://github.com/aashish2000/IRONIC

[207] Transformer Copilot: Learning from The Mistake Log in LLM Fine-tuning

Jiaru Zou,Yikun Ban,Zihao Li,Yunzhe Qi,Ruizhong Qiu,Ling Yang,Jingrui He

Main category: cs.CL

TL;DR: 论文提出了一种名为Transformer Copilot的新框架，通过记录模型错误并设计Copilot模型来优化生成性能，实验表明性能提升显著。

Details

Motivation: 传统的微调方法仅关注生成损失的最小化，而忽略了模型自身的学习信号。本文旨在通过记录和分析模型的错误行为，提升模型性能。 Method: 引入Mistake Log记录错误，设计Copilot模型通过logits校正优化Pilot模型的推理性能，并提出联合训练和融合推理范式。 Result: 在12个基准测试中，性能提升高达34.5%，计算开销低且具有强扩展性和迁移性。 Conclusion: Transformer Copilot框架通过利用模型自身的学习信号，显著提升了生成任务的性能，同时保持了高效性。 Abstract: Large language models are typically adapted to downstream tasks through supervised fine-tuning on domain-specific data. While standard fine-tuning focuses on minimizing generation loss to optimize model parameters, we take a deeper step by retaining and leveraging the model's own learning signals, analogous to how human learners reflect on past mistakes to improve future performance. We first introduce the concept of Mistake Log to systematically track the model's learning behavior and recurring errors throughout fine-tuning. Treating the original transformer-based model as the Pilot, we correspondingly design a Copilot model to refine the Pilot's inference performance via logits rectification. We name the overall Pilot-Copilot framework the Transformer Copilot, which introduces (i) a novel Copilot model design, (ii) a joint training paradigm where the Copilot continuously learns from the evolving Mistake Log alongside the Pilot, and (iii) a fused inference paradigm where the Copilot rectifies the Pilot's logits for enhanced generation. We provide both theoretical and empirical analyses on our new learning framework. Experiments on 12 benchmarks spanning commonsense, arithmetic, and recommendation tasks demonstrate that Transformer Copilot consistently improves performance by up to 34.5%, while introducing marginal computational overhead to Pilot models and exhibiting strong scalability and transferability.

[208] Spontaneous Speech Variables for Evaluating LLMs Cognitive Plausibility

Sheng-Fu Wang,Laurent Prevot,Jou-an Chi,Ri-Sheng Huang,Shu-Kai Hsieh

Main category: cs.CL

TL;DR: 论文探讨了如何利用自发语音语料库评估大型语言模型在预测语言生成变量（如语音缩减和韵律突出）上的表现，发现口语训练数据能提供更准确的预测。

Details

Motivation: 研究大型语言模型在认知视角下的特性，尤其是通过预测语言处理中的行为与生理变量来评估模型。 Method: 从自发语音语料库中提取生成变量（语音缩减、韵律突出），测试不同预训练数据集（书面、口语及混合类型）训练的模型对这些变量的预测能力。 Result: 经过微调后，模型能显著超越基线预测这些变量，且口语训练数据的预测效果优于书面数据。 Conclusion: 高质量语音语料库可作为评估大型语言模型的有效基准，口语数据对模型性能提升尤为重要。 Abstract: The achievements of Large Language Models in Natural Language Processing, especially for high-resource languages, call for a better understanding of their characteristics from a cognitive perspective. Researchers have attempted to evaluate artificial models by testing their ability to predict behavioral (e.g., eye-tracking fixations) and physiological (e.g., brain responses) variables during language processing (e.g., reading/listening). In this paper, we propose using spontaneous speech corpora to derive production variables (speech reductions, prosodic prominences) and applying them in a similar fashion. More precisely, we extract. We then test models trained with a standard procedure on different pretraining datasets (written, spoken, and mixed genres) for their ability to predict these two variables. Our results show that, after some fine-tuning, the models can predict these production variables well above baselines. We also observe that spoken genre training data provides more accurate predictions than written genres. These results contribute to the broader effort of using high-quality speech corpora as benchmarks for LLMs.

[209] HiMATE: A Hierarchical Multi-Agent Framework for Machine Translation Evaluation

Shijie Zhang,Renhao Li,Songsheng Wang,Philipp Koehn,Min Yang,Derek F. Wong

Main category: cs.CL

TL;DR: HiMATE是一个基于分层多代理框架的机器翻译评估方法，通过利用MQM错误类型学的细粒度信息，显著提升了错误检测和严重性评估的性能。

Details

Motivation: 现有基于LLM的机器翻译评估方法在错误范围和严重性评估上存在不足，未能充分利用MQM层次结构的细粒度信息。 Method: 提出HiMATE框架，采用分层多代理系统，结合模型自反思能力和代理间非对称信息讨论，优化错误子类型评估。 Result: HiMATE在多个数据集上表现优于基线方法，错误检测和严重性评估的F1分数平均提升89%。 Conclusion: HiMATE通过分层多代理框架有效提升了机器翻译评估的准确性和人类对齐性，代码和数据已公开。 Abstract: The advancement of Large Language Models (LLMs) enables flexible and interpretable automatic evaluations. In the field of machine translation evaluation, utilizing LLMs with translation error annotations based on Multidimensional Quality Metrics (MQM) yields more human-aligned judgments. However, current LLM-based evaluation methods still face challenges in accurately identifying error spans and assessing their severity. In this paper, we propose HiMATE, a Hierarchical Multi-Agent Framework for Machine Translation Evaluation. We argue that existing approaches inadequately exploit the fine-grained structural and semantic information within the MQM hierarchy. To address this, we develop a hierarchical multi-agent system grounded in the MQM error typology, enabling granular evaluation of subtype errors. Two key strategies are incorporated to further mitigate systemic hallucinations within the framework: the utilization of the model's self-reflection capability and the facilitation of agent discussion involving asymmetric information. Empirically, HiMATE outperforms competitive baselines across different datasets in conducting human-aligned evaluations. Further analyses underscore its significant advantage in error span detection and severity assessment, achieving an average F1-score improvement of 89% over the best-performing baseline. We make our code and data publicly available at https://anonymous.4open.science/r/HiMATE-Anony.

[210] Augmenting LLM Reasoning with Dynamic Notes Writing for Complex QA

Rishabh Maheshwary,Masoud Hashemi,Khyati Mahajan,Shiva Krishna Reddy Malay,Sai Rajeswar,Sathwik Tejaswi Madhusudhan,Spandana Gella,Vikas Yadav

Main category: cs.CL

TL;DR: 论文提出了一种名为Notes Writing的方法，用于解决迭代RAG在多跳问答中因上下文过长和无关信息积累导致的性能问题。该方法通过生成简洁笔记减少噪声，间接提升LLM的有效上下文长度，平均性能提升15.6%。

Details

Motivation: 迭代RAG在多跳问答中面临上下文过长和无关信息积累的挑战，限制了模型处理检索内容的能力。现有方法多局限于单轮RAG或缺乏可扩展性。 Method: 提出Notes Writing方法，在每一步从检索文档生成简洁笔记，减少噪声并保留关键信息，间接提升LLM的有效上下文长度。 Result: 在三种迭代RAG方法、两种模型和四个数据集上验证，平均性能提升15.6%，输出标记增加极少。 Conclusion: Notes Writing是一种框架无关的方法，能有效提升迭代RAG的性能，适用于多跳问答任务。 Abstract: Iterative RAG for multi-hop question answering faces challenges with lengthy contexts and the buildup of irrelevant information. This hinders a model's capacity to process and reason over retrieved content and limits performance. While recent methods focus on compressing retrieved information, they are either restricted to single-round RAG, require finetuning or lack scalability in iterative RAG. To address these challenges, we propose Notes Writing, a method that generates concise and relevant notes from retrieved documents at each step, thereby reducing noise and retaining only essential information. This indirectly increases the effective context length of Large Language Models (LLMs), enabling them to reason and plan more effectively while processing larger volumes of input text. Notes Writing is framework agnostic and can be integrated with different iterative RAG methods. We demonstrate its effectiveness with three iterative RAG methods, across two models and four evaluation datasets. Notes writing yields an average improvement of 15.6 percentage points overall, with minimal increase in output tokens.

[211] ToDi: Token-wise Distillation via Fine-Grained Divergence Control

Seongryong Jung,Suwan Yoon,DongGeon Kim,Hwanhee Lee

Main category: cs.CL

TL;DR: 论文提出了一种名为ToDi的新方法，通过动态结合FKL和RKL的互补作用，实现更精确的知识蒸馏，显著优于现有方法。

Details

Motivation: 大型语言模型（LLMs）的高延迟和能耗限制了其在资源受限环境中的部署，而传统知识蒸馏方法（如FKL和RKL）忽略了词汇表中不同token的预测差异。 Method: 通过梯度分析揭示FKL和RKL的互补作用，提出ToDi方法，利用基于sigmoid的权重函数动态调整每token的FKL和RKL组合。 Result: ToDi在指令跟随基准测试中一致优于现有蒸馏方法，并通过消融实验和效率分析验证了其有效性和实用性。 Conclusion: ToDi通过动态调整FKL和RKL的组合，实现了更精确的知识蒸馏，为资源受限环境中的LLM部署提供了高效解决方案。 Abstract: Large language models (LLMs) offer impressive performance but are impractical for resource-constrained deployment due to high latency and energy consumption. Knowledge distillation (KD) addresses this by transferring knowledge from a large teacher to a smaller student model. However, conventional KD, notably approaches like Forward KL (FKL) and Reverse KL (RKL), apply uniform divergence loss across the entire vocabulary, neglecting token-level prediction discrepancies. By investigating these representative divergences via gradient analysis, we reveal that FKL boosts underestimated tokens, while RKL suppresses overestimated ones, showing their complementary roles. Based on this observation, we propose Token-wise Distillation (ToDi), a novel method that adaptively combines FKL and RKL per token using a sigmoid-based weighting function derived from the teacher-student probability log-ratio. ToDi dynamically emphasizes the appropriate divergence for each token, enabling precise distribution alignment. We demonstrate that ToDi consistently outperforms recent distillation baselines using uniform or less granular strategies across instruction-following benchmarks. Extensive ablation studies and efficiency analysis further validate ToDi's effectiveness and practicality.

[212] INFERENCEDYNAMICS: Efficient Routing Across LLMs through Structured Capability and Knowledge Profiling

Haochen Shi,Tianshi Zheng,Weiqi Wang,Baixuan Xu,Chunyang Li,Chunkit Chan,Tao Fan,Yangqiu Song,Qiang Yang

Main category: cs.CL

TL;DR: InferenceDynamics是一个灵活且可扩展的多维路由框架，旨在解决当前LLM路由方法在可扩展性和适应性上的不足，通过建模模型的能力和知识，实现高效资源利用和任务性能优化。

Details

Motivation: 当前LLM路由方法在处理大量专用LLM时存在可扩展性和适应性不足的问题，需要一种更灵活、可扩展的解决方案。 Method: 提出InferenceDynamics框架，通过建模模型的能力和知识，实现多维路由，并在RouteMix数据集上进行验证。 Result: 在MMLU-Pro、GPQA、BigGenBench和LiveBench等基准测试中，InferenceDynamics展示了其高效性和通用性，能够识别并利用高性能模型。 Conclusion: InferenceDynamics为LLM生态系统提供了更高效的资源利用和任务性能优化，其代码将公开以促进进一步研究。 Abstract: Large Language Model (LLM) routing is a pivotal technique for navigating a diverse landscape of LLMs, aiming to select the best-performing LLMs tailored to the domains of user queries, while managing computational resources. However, current routing approaches often face limitations in scalability when dealing with a large pool of specialized LLMs, or in their adaptability to extending model scope and evolving capability domains. To overcome those challenges, we propose InferenceDynamics, a flexible and scalable multi-dimensional routing framework by modeling the capability and knowledge of models. We operate it on our comprehensive dataset RouteMix, and demonstrate its effectiveness and generalizability in group-level routing using modern benchmarks including MMLU-Pro, GPQA, BigGenBench, and LiveBench, showcasing its ability to identify and leverage top-performing models for given tasks, leading to superior outcomes with efficient resource utilization. The broader adoption of Inference Dynamics can empower users to harness the full specialized potential of the LLM ecosystem, and our code will be made publicly available to encourage further research.

[213] PMPO: Probabilistic Metric Prompt Optimization for Small and Large Language Models

Chenzhuo Zhao,Ziqian Liu,Xingda Wang,Junting Lu,Chaoyi Ruan

Main category: cs.CL

TL;DR: PMPO是一种轻量级提示优化框架，通过交叉熵损失直接优化提示，无需输出生成或人工评估，显著提升模型性能。

Details

Motivation: 现有提示优化方法依赖高成本输出生成或人工标注，限制了其扩展性，尤其是对小模型或非指令调优模型。 Method: PMPO利用掩码和交叉熵损失识别低质量提示段，通过最小化损失优化提示，仅需前向传播和似然计算。 Result: PMPO在BBH、GSM8K等任务中表现优异，AlpacaEval 2.0胜率提升19点以上。 Conclusion: PMPO高效、通用，适用于多种任务和模型规模。 Abstract: Prompt optimization offers a practical and broadly applicable alternative to fine-tuning for improving large language model (LLM) performance. However, existing methods often rely on costly output generation, self-critiquing abilities, or human-annotated preferences, which limit their scalability, especially for smaller or non-instruction-tuned models. We introduce PMPO (Probabilistic Metric Prompt Optimization), a unified framework that refines prompts using token-level cross-entropy loss as a direct, lightweight evaluation signal. PMPO identifies low-quality prompt segments by masking and measuring their impact on loss, then rewrites and selects improved variants by minimizing loss over positive and negative examples. Unlike prior methods, it requires no output sampling or human evaluation during optimization, relying only on forward passes and log-likelihoods. PMPO supports both supervised and preference-based tasks through a closely aligned loss-based evaluation strategy. Experiments show that PMPO consistently outperforms prior methods across model sizes and tasks: it achieves the highest average accuracy on BBH, performs strongly on GSM8K and AQUA-RAT, and improves AlpacaEval 2.0 win rates by over 19 points. These results highlight PMPO's effectiveness, efficiency, and broad applicability.

[214] CLEAR: A Clinically-Grounded Tabular Framework for Radiology Report Evaluation

Yuyang Jiang,Chacha Chen,Shengyuan Wang,Feng Li,Zecong Tang,Benjamin M. Mervak,Lydia Chelala,Christopher M Straus,Reve Chahine,Samuel G. Armato III,Chenhao Tan

Main category: cs.CL

TL;DR: CLEAR是一个用于放射学报告评估的临床框架，通过多维度属性级比较提供更全面和可解释的评估。

Details

Motivation: 现有指标缺乏细粒度和可解释性，无法捕捉候选报告与真实报告之间的临床差异。 Method: 引入CLEAR框架，结合专家标注和属性级比较，评估报告的临床准确性。 Result: CLEAR在提取临床属性和提供自动化指标方面表现优异，与临床判断高度一致。 Conclusion: CLEAR为放射学报告评估提供了更全面和临床可解释的方法。 Abstract: Existing metrics often lack the granularity and interpretability to capture nuanced clinical differences between candidate and ground-truth radiology reports, resulting in suboptimal evaluation. We introduce a Clinically-grounded tabular framework with Expert-curated labels and Attribute-level comparison for Radiology report evaluation (CLEAR). CLEAR not only examines whether a report can accurately identify the presence or absence of medical conditions, but also assesses whether it can precisely describe each positively identified condition across five key attributes: first occurrence, change, severity, descriptive location, and recommendation. Compared to prior works, CLEAR's multi-dimensional, attribute-level outputs enable a more comprehensive and clinically interpretable evaluation of report quality. Additionally, to measure the clinical alignment of CLEAR, we collaborate with five board-certified radiologists to develop CLEAR-Bench, a dataset of 100 chest X-ray reports from MIMIC-CXR, annotated across 6 curated attributes and 13 CheXpert conditions. Our experiments show that CLEAR achieves high accuracy in extracting clinical attributes and provides automated metrics that are strongly aligned with clinical judgment.

[215] SC4ANM: Identifying Optimal Section Combinations for Automated Novelty Prediction in Academic Papers

Wenqing Wu,Chengzhi Zhang,Tong Bao,Yi Zhao

Main category: cs.CL

TL;DR: 本文探讨了学术论文新颖性评估的最佳章节组合，发现引言、结果和讨论部分最适合预测新颖性评分。

Details

Motivation: 现有方法对新颖性的评估多基于词汇或实体组合，但论文的新颖性内容分散在不同章节，因此需要探索最佳章节组合以提升自动化评估效果。 Method: 利用自然语言处理技术识别论文的IMRaD结构，以不同章节组合作为预训练语言模型和大语言模型的输入，预测新颖性评分。 Result: 引言、结果和讨论部分最适合新颖性评估，而全文输入效果不明显；引言和结果部分对预测任务最为重要。 Conclusion: 引言、结果和讨论部分的组合是评估论文新颖性的最佳选择，为自动化新颖性评估提供了新思路。 Abstract: Novelty is a core component of academic papers, and there are multiple perspectives on the assessment of novelty. Existing methods often focus on word or entity combinations, which provide limited insights. The content related to a paper's novelty is typically distributed across different core sections, e.g., Introduction, Methodology and Results. Therefore, exploring the optimal combination of sections for evaluating the novelty of a paper is important for advancing automated novelty assessment. In this paper, we utilize different combinations of sections from academic papers as inputs to drive language models to predict novelty scores. We then analyze the results to determine the optimal section combinations for novelty score prediction. We first employ natural language processing techniques to identify the sectional structure of academic papers, categorizing them into introduction, methods, results, and discussion (IMRaD). Subsequently, we used different combinations of these sections (e.g., introduction and methods) as inputs for pretrained language models (PLMs) and large language models (LLMs), employing novelty scores provided by human expert reviewers as ground truth labels to obtain prediction results. The results indicate that using introduction, results and discussion is most appropriate for assessing the novelty of a paper, while the use of the entire text does not yield significant results. Furthermore, based on the results of the PLMs and LLMs, the introduction and results appear to be the most important section for the task of novelty score prediction. The code and dataset for this paper can be accessed at https://github.com/njust-winchy/SC4ANM.

[216] Embodied Agents Meet Personalization: Exploring Memory Utilization for Personalized Assistance

Taeyoon Kwon,Dongwook Choi,Sunghwan Kim,Hyojun Kim,Seungjun Moon,Beong-woo Kwak,Kuan-Hao Huang,Jinyoung Yeo

Main category: cs.CL

TL;DR: 论文提出了MEMENTO框架，用于评估个性化具身代理在记忆利用方面的能力，发现现有模型（如GPT-4o）在多记忆任务中表现显著下降。

Details

Motivation: 当前具身代理在家庭物品重排任务中表现良好，但缺乏对用户个性化语义的理解，无法提供真正有意义的帮助。 Method: 提出MEMENTO框架，采用两阶段记忆评估设计，量化记忆利用对任务性能的影响，重点关注目标对象识别和用户模式推断。 Result: 实验显示，现有模型在多记忆任务中表现显著下降（如GPT-4o性能下降30.5%），尤其在用户模式任务中。 Conclusion: 研究为开发更有效的个性化具身代理提供了重要见解，未来需进一步优化记忆利用能力。 Abstract: Embodied agents empowered by large language models (LLMs) have shown strong performance in household object rearrangement tasks. However, these tasks primarily focus on single-turn interactions with simplified instructions, which do not truly reflect the challenges of providing meaningful assistance to users. To provide personalized assistance, embodied agents must understand the unique semantics that users assign to the physical world (e.g., favorite cup, breakfast routine) by leveraging prior interaction history to interpret dynamic, real-world instructions. Yet, the effectiveness of embodied agents in utilizing memory for personalized assistance remains largely underexplored. To address this gap, we present MEMENTO, a personalized embodied agent evaluation framework designed to comprehensively assess memory utilization capabilities to provide personalized assistance. Our framework consists of a two-stage memory evaluation process design that enables quantifying the impact of memory utilization on task performance. This process enables the evaluation of agents' understanding of personalized knowledge in object rearrangement tasks by focusing on its role in goal interpretation: (1) the ability to identify target objects based on personal meaning (object semantics), and (2) the ability to infer object-location configurations from consistent user patterns, such as routines (user patterns). Our experiments across various LLMs reveal significant limitations in memory utilization, with even frontier models like GPT-4o experiencing a 30.5% performance drop when required to reference multiple memories, particularly in tasks involving user patterns. These findings, along with our detailed analyses and case studies, provide valuable insights for future research in developing more effective personalized embodied agents. Project website: https://connoriginal.github.io/MEMENTO

[217] Ask, Retrieve, Summarize: A Modular Pipeline for Scientific Literature Summarization

Pierre Achkar,Tim Gollub,Martin Potthast

Main category: cs.CL

TL;DR: XSum是一个基于检索增强生成（RAG）的科学领域多文档摘要（MDS）模块化流程，包含问题生成和编辑模块，显著提升了摘要质量。

Details

Motivation: 科学文献的快速增长使得研究者难以有效跟踪和整合知识，XSum旨在解决这一问题。 Method: XSum采用问题生成模块动态生成问题以检索相关信息，并通过编辑模块合成符合学术标准的摘要。 Result: 在SurveySum数据集上，XSum在CheckEval、G-Eval和Ref-F1等指标上表现优于现有方法。 Conclusion: XSum为科学摘要提供了一个透明、可适应的框架，具有广泛的应用潜力。 Abstract: The exponential growth of scientific publications has made it increasingly difficult for researchers to stay updated and synthesize knowledge effectively. This paper presents XSum, a modular pipeline for multi-document summarization (MDS) in the scientific domain using Retrieval-Augmented Generation (RAG). The pipeline includes two core components: a question-generation module and an editor module. The question-generation module dynamically generates questions adapted to the input papers, ensuring the retrieval of relevant and accurate information. The editor module synthesizes the retrieved content into coherent and well-structured summaries that adhere to academic standards for proper citation. Evaluated on the SurveySum dataset, XSum demonstrates strong performance, achieving considerable improvements in metrics such as CheckEval, G-Eval and Ref-F1 compared to existing approaches. This work provides a transparent, adaptable framework for scientific summarization with potential applications in a wide range of domains. Code available at https://github.com/webis-de/scolia25-xsum

[218] PaTH Attention: Position Encoding via Accumulating Householder Transformations

Songlin Yang,Yikang Shen,Kaiyue Wen,Shawn Tan,Mayank Mishra,Liliang Ren,Rameswar Panda,Yoon Kim

Main category: cs.CL

TL;DR: PaTH是一种基于数据依赖的Householder变换的位置编码方法，相比RoPE和其他基线方法表现更优。

Details

Motivation: RoPE的位置编码仅依赖相对位置，限制了表达力，因此需要更灵活的数据依赖编码方案。 Method: 提出PaTH，基于数据依赖的Householder变换，设计高效并行训练算法和FlashAttention风格的块处理以减少I/O开销。 Result: 在合成基准和实际语言建模实验中，PaTH表现优于RoPE和其他基线方法。 Conclusion: PaTH通过数据依赖的位置编码提升了表达力和性能，是一种有效的替代方案。 Abstract: The attention mechanism is a core primitive in modern large language models (LLMs) and AI more broadly. Since attention by itself is permutation-invariant, position encoding is essential for modeling structured domains such as language. Rotary position encoding (RoPE) has emerged as the de facto standard approach for position encoding and is part of many modern LLMs. However, in RoPE the key/query transformation between two elements in a sequence is only a function of their relative position and otherwise independent of the actual input. This limits the expressivity of RoPE-based transformers. This paper describes PaTH, a flexible data-dependent position encoding scheme based on accumulated products of Householder(like) transformations, where each transformation is data-dependent, i.e., a function of the input. We derive an efficient parallel algorithm for training through exploiting a compact representation of products of Householder matrices, and implement a FlashAttention-style blockwise algorithm that minimizes I/O cost. Across both targeted synthetic benchmarks and moderate-scale real-world language modeling experiments, we find that PaTH demonstrates superior performance compared to RoPE and other recent baselines.

[219] Semantic Pivots Enable Cross-Lingual Transfer in Large Language Models

Kaiyu He,Tong Zhou,Yubo Chen,Delai Qiu,Shengping Liu,Kang Liu,Jun Zhao

Main category: cs.CL

TL;DR: 研究提出了一种词级跨语言翻译任务来量化大语言模型（LLMs）的跨语言能力，并通过追踪模型中间层输出来分析其学习机制，发现两种行为模式：共现行为和语义枢纽行为。通过重构预训练数据集提升跨语言能力。

Details

Motivation: 理解LLMs如何获得跨语言能力对模型可解释性至关重要。 Method: 提出词级跨语言翻译任务，追踪模型中间层输出，分析共现频率和语义枢纽行为，重构预训练数据集。 Result: 实验验证了通过语义枢纽感知的预训练数据集能有效提升LLMs的跨语言能力。 Conclusion: 研究揭示了LLMs跨语言能力的机制，并提出了一种改进方法，为模型可解释性和能力提升提供了新思路。 Abstract: Large language models (LLMs) demonstrate remarkable ability in cross-lingual tasks. Understanding how LLMs acquire this ability is crucial for their interpretability. To quantify the cross-lingual ability of LLMs accurately, we propose a Word-Level Cross-Lingual Translation Task. To find how LLMs learn cross-lingual ability, we trace the outputs of LLMs' intermediate layers in the word translation task. We identify and distinguish two distinct behaviors in the forward pass of LLMs: co-occurrence behavior and semantic pivot behavior. We attribute LLMs' two distinct behaviors to the co-occurrence frequency of words and find the semantic pivot from the pre-training dataset. Finally, to apply our findings to improve the cross-lingual ability of LLMs, we reconstruct a semantic pivot-aware pre-training dataset using documents with a high proportion of semantic pivots. Our experiments validate the effectiveness of our approach in enhancing cross-lingual ability. Our research contributes insights into the interpretability of LLMs and offers a method for improving LLMs' cross-lingual ability.

[220] Resource for Error Analysis in Text Simplification: New Taxonomy and Test Collection

Benjamin Vendeville,Liana Ermakova,Pierre De Loor

Main category: cs.CL

TL;DR: 论文提出了一种新的测试集和错误分类法，用于评估自动文本简化（ATS）中的错误，填补了当前评估方法的不足。

Details

Motivation: 公众常因复杂文本难以理解而传播错误信息，而现有的ATS评估方法未能跟上文本生成技术的发展，尤其是大语言模型（LLMs）的进步。 Method: 1. 提出错误分类法，重点关注信息失真；2. 引入并行数据集，包含自动简化的科学文本，并人工标注；3. 分析数据集质量及现有模型在错误检测和分类中的表现。 Result: 提供了工具以更准确地评估ATS中的错误，帮助开发更可靠的模型。 Conclusion: 该研究为改进自动简化文本质量提供了基础，推动了ATS领域的进一步发展。 Abstract: The general public often encounters complex texts but does not have the time or expertise to fully understand them, leading to the spread of misinformation. Automatic Text Simplification (ATS) helps make information more accessible, but its evaluation methods have not kept up with advances in text generation, especially with Large Language Models (LLMs). In particular, recent studies have shown that current ATS metrics do not correlate with the presence of errors. Manual inspections have further revealed a variety of errors, underscoring the need for a more nuanced evaluation framework, which is currently lacking. This resource paper addresses this gap by introducing a test collection for detecting and classifying errors in simplified texts. First, we propose a taxonomy of errors, with a formal focus on information distortion. Next, we introduce a parallel dataset of automatically simplified scientific texts. This dataset has been human-annotated with labels based on our proposed taxonomy. Finally, we analyze the quality of the dataset, and we study the performance of existing models to detect and classify errors from that taxonomy. These contributions give researchers the tools to better evaluate errors in ATS, develop more reliable models, and ultimately improve the quality of automatically simplified texts.

[221] On the reliability of feature attribution methods for speech classification

Gaofei Shen,Hosein Mohebbi,Arianna Bisazza,Afra Alishahi,Grzegorz Chrupała

Main category: cs.CL

TL;DR: 论文研究了语音处理中特征归因方法的可靠性，发现标准方法在语音领域通常不可靠，除非是单词对齐的扰动方法用于基于单词的分类任务。

Details

Motivation: 随着大规模预训练模型能力的提升，理解其输出决定因素变得更重要。特征归因旨在揭示输入中哪些部分对模型输出贡献最大，但在语音处理中，输入信号的独特特性使得特征归因方法的应用具有挑战性。 Method: 研究了输入类型、聚合和扰动时间跨度等因素对标准特征归因方法可靠性的影响，以及这些因素与每个分类任务特性的交互作用。 Result: 标准特征归因方法在语音领域通常不可靠，但单词对齐的扰动方法在基于单词的分类任务中表现例外。 Conclusion: 语音处理中标准特征归因方法的可靠性有限，需针对任务特性选择合适方法。 Abstract: As the capabilities of large-scale pre-trained models evolve, understanding the determinants of their outputs becomes more important. Feature attribution aims to reveal which parts of the input elements contribute the most to model outputs. In speech processing, the unique characteristics of the input signal make the application of feature attribution methods challenging. We study how factors such as input type and aggregation and perturbation timespan impact the reliability of standard feature attribution methods, and how these factors interact with characteristics of each classification task. We find that standard approaches to feature attribution are generally unreliable when applied to the speech domain, with the exception of word-aligned perturbation methods when applied to word-based classification tasks.

[222] From Surveys to Narratives: Rethinking Cultural Value Adaptation in LLMs

Muhammad Farid Adilazuarda,Chen Cecilia Liu,Iryna Gurevych,Alham Fikri Aji

Main category: cs.CL

TL;DR: 研究发现，仅依赖世界价值观调查（WVS）数据调整大型语言模型（LLM）的文化价值观可能导致文化同质化并干扰事实知识。通过结合维基百科和NormAd的百科全书和情景文化叙事，可以提升文化独特性。

Details

Motivation: 探讨WVS数据在LLM文化价值观适应中的局限性，以及如何通过补充数据改善文化独特性。 Method: 系统研究WVS数据训练，并补充维基百科和NormAd的文化叙事数据。 Result: 补充叙事数据比单独使用WVS数据更能提升文化独特性，但对下游任务效果不一。 Conclusion: 文化价值观对齐具有复杂性，需结合多种数据源以引导任务特定行为。 Abstract: Adapting cultural values in Large Language Models (LLMs) presents significant challenges, particularly due to biases and limited training data. Prior work primarily aligns LLMs with different cultural values using World Values Survey (WVS) data. However, it remains unclear whether this approach effectively captures cultural nuances or produces distinct cultural representations for various downstream tasks. In this paper, we systematically investigate WVS-based training for cultural value adaptation and find that relying solely on survey data can homogenize cultural norms and interfere with factual knowledge. To investigate these issues, we augment WVS with encyclopedic and scenario-based cultural narratives from Wikipedia and NormAd. While these narratives may have variable effects on downstream tasks, they consistently improve cultural distinctiveness than survey data alone. Our work highlights the inherent complexity of aligning cultural values with the goal of guiding task-specific behavior.

[223] Tool-Star: Empowering LLM-Brained Multi-Tool Reasoner via Reinforcement Learning

Guanting Dong,Yifei Chen,Xiaoxi Li,Jiajie Jin,Hongjin Qian,Yutao Zhu,Hangyu Mao,Guorui Zhou,Zhicheng Dou,Ji-Rong Wen

Main category: cs.CL

TL;DR: Tool-Star是一个基于强化学习的框架，旨在增强大语言模型（LLMs）在逐步推理中自主调用多工具的能力。通过数据合成和训练的系统设计，解决了工具使用数据稀缺问题，并在多个推理基准测试中表现出色。

Details

Motivation: 尽管大语言模型在推理能力上表现出色，但如何通过强化学习实现多工具协作推理仍是一个挑战。 Method: 提出Tool-Star框架，包括工具集成推理数据合成管道和两阶段训练框架（冷启动微调和多工具自评RL算法）。 Result: 在10多个挑战性推理基准测试中验证了Tool-Star的有效性和效率。 Conclusion: Tool-Star通过系统设计和强化学习，显著提升了LLMs在多工具协作推理中的能力。 Abstract: Recently, large language models (LLMs) have shown remarkable reasoning capabilities via large-scale reinforcement learning (RL). However, leveraging the RL algorithm to empower effective multi-tool collaborative reasoning in LLMs remains an open challenge. In this paper, we introduce Tool-Star, an RL-based framework designed to empower LLMs to autonomously invoke multiple external tools during stepwise reasoning. Tool-Star integrates six types of tools and incorporates systematic designs in both data synthesis and training. To address the scarcity of tool-use data, we propose a general tool-integrated reasoning data synthesis pipeline, which combines tool-integrated prompting with hint-based sampling to automatically and scalably generate tool-use trajectories. A subsequent quality normalization and difficulty-aware classification process filters out low-quality samples and organizes the dataset from easy to hard. Furthermore, we propose a two-stage training framework to enhance multi-tool collaborative reasoning by: (1) cold-start fine-tuning, which guides LLMs to explore reasoning patterns via tool-invocation feedback; and (2) a multi-tool self-critic RL algorithm with hierarchical reward design, which reinforces reward understanding and promotes effective tool collaboration. Experimental analyses on over 10 challenging reasoning benchmarks highlight the effectiveness and efficiency of Tool-Star. The code is available at https://github.com/dongguanting/Tool-Star.

[224] Attributing Response to Context: A Jensen-Shannon Divergence Driven Mechanistic Study of Context Attribution in Retrieval-Augmented Generation

Ruizhe Li,Chen Chen,Yuchen Hu,Yanjun Gao,Xi Wang,Emine Yilmaz

Main category: cs.CL

TL;DR: 提出了一种基于Jensen-Shannon Divergence的方法ARC-JSD，用于高效准确地在RAG模型中识别关键上下文句子，无需额外微调或代理建模。

Details

Motivation: 当前RAG模型在内容生成时难以可靠地将生成内容归因于特定上下文，且现有方法计算成本高。 Method: 使用Jensen-Shannon Divergence驱动的方法ARC-JSD，无需微调或代理建模，直接识别关键上下文。 Result: 在多个RAG基准测试中表现出更高的准确性和计算效率，并揭示了模型内部负责上下文归因的特定结构。 Conclusion: ARC-JSD为RAG模型提供了一种高效准确的上下文归因方法，并深入解析了模型内部机制。 Abstract: Retrieval-Augmented Generation (RAG) leverages large language models (LLMs) combined with external contexts to enhance the accuracy and reliability of generated responses. However, reliably attributing generated content to specific context segments, context attribution, remains challenging due to the computationally intensive nature of current methods, which often require extensive fine-tuning or human annotation. In this work, we introduce a novel Jensen-Shannon Divergence driven method to Attribute Response to Context (ARC-JSD), enabling efficient and accurate identification of essential context sentences without additional fine-tuning or surrogate modelling. Evaluations on a wide range of RAG benchmarks, such as TyDi QA, Hotpot QA, and Musique, using instruction-tuned LLMs in different scales demonstrate superior accuracy and significant computational efficiency improvements compared to the previous surrogate-based method. Furthermore, our mechanistic analysis reveals specific attention heads and multilayer perceptron (MLP) layers responsible for context attribution, providing valuable insights into the internal workings of RAG models.

[225] Exploring the Relationship Between Diversity and Quality in Ad Text Generation

Yoichi Aoki,Soichiro Murakami,Ukyo Honda,Akihiko Kato

Main category: cs.CL

TL;DR: 研究探讨了广告文本生成中多样性与广告质量的关系，分析了多样性增强方法及其参数的影响。

Details

Motivation: 广告文本生成需要多样性和吸引力，但现有方法主要针对其他任务（如摘要和翻译），未充分探索其在广告领域的适用性。 Method: 通过分析多样性增强方法、超参数、输入输出格式和模型，研究多样性与广告质量的关系。 Result: 未明确提及具体结果，但强调了广告文本生成的特殊性。 Conclusion: 广告文本生成需要针对其独特需求优化多样性增强方法。 Abstract: In natural language generation for advertising, creating diverse and engaging ad texts is crucial for capturing a broad audience and avoiding advertising fatigue. Regardless of the importance of diversity, the impact of the diversity-enhancing methods in ad text generation -- mainly tested on tasks such as summarization and machine translation -- has not been thoroughly explored. Ad text generation significantly differs from these tasks owing to the text style and requirements. This research explores the relationship between diversity and ad quality in ad text generation by considering multiple factors, such as diversity-enhancing methods, their hyperparameters, input-output formats, and the models.

[226] WebAgent-R1: Training Web Agents via End-to-End Multi-Turn Reinforcement Learning

Zhepei Wei,Wenlin Yao,Yao Liu,Weizhi Zhang,Qin Lu,Liang Qiu,Changlong Yu,Puyang Xu,Chao Zhang,Bing Yin,Hyokun Yun,Lihong Li

Main category: cs.CL

TL;DR: WebAgent-R1是一个端到端的多轮强化学习框架，用于训练网络代理，显著提升了任务成功率。

Details

Motivation: 现有强化学习方法主要关注单轮任务，而多轮交互的网络代理训练因长时决策复杂性仍具挑战性。 Method: 通过异步生成多样化轨迹，完全依赖任务成功的二元奖励进行在线学习。 Result: 在WebArena-Lite基准测试中，WebAgent-R1显著提升了Qwen-2.5-3B和Llama-3.1-8B的任务成功率。 Conclusion: 思考型提示策略和测试时交互扩展有效，行为克隆和长链推理对初始化策略至关重要。 Abstract: While reinforcement learning (RL) has demonstrated remarkable success in enhancing large language models (LLMs), it has primarily focused on single-turn tasks such as solving math problems. Training effective web agents for multi-turn interactions remains challenging due to the complexity of long-horizon decision-making across dynamic web interfaces. In this work, we present WebAgent-R1, a simple yet effective end-to-end multi-turn RL framework for training web agents. It learns directly from online interactions with web environments by asynchronously generating diverse trajectories, entirely guided by binary rewards depending on task success. Experiments on the WebArena-Lite benchmark demonstrate the effectiveness of WebAgent-R1, boosting the task success rate of Qwen-2.5-3B from 6.1% to 33.9% and Llama-3.1-8B from 8.5% to 44.8%, significantly outperforming existing state-of-the-art methods and strong proprietary models such as OpenAI o3. In-depth analyses reveal the effectiveness of the thinking-based prompting strategy and test-time scaling through increased interactions for web tasks. We further investigate different RL initialization policies by introducing two variants, namely WebAgent-R1-Zero and WebAgent-R1-CoT, which highlight the importance of the warm-up training stage (i.e., behavior cloning) and provide insights on incorporating long chain-of-thought (CoT) reasoning in web agents.

[227] $I^2G$: Generating Instructional Illustrations via Text-Conditioned Diffusion

Jing Bi,Pinxin Liu,Ali Vosoughi,Jiarui Wu,Jinxi He,Chenliang Xu

Main category: cs.CL

TL;DR: 提出了一种将程序性文本转化为视觉指令的语言驱动框架，通过分解目标语句和步骤，结合三个创新点，显著优于现有基线。

Details

Motivation: 解决纯文本指令在传达复杂物理动作和空间关系时的局限性。 Method: 采用基于选区解析器的文本编码机制、成对话语一致性模型和新的评估协议，将程序性文本转化为视觉指令。 Result: 在三个数据集上的实验表明，该方法在生成准确反映指令内容和顺序的视觉内容上显著优于基线。 Conclusion: 该研究为程序性语言的视觉内容落地提供了新方法，适用于教育、任务指导和多模态语言理解。 Abstract: The effective communication of procedural knowledge remains a significant challenge in natural language processing (NLP), as purely textual instructions often fail to convey complex physical actions and spatial relationships. We address this limitation by proposing a language-driven framework that translates procedural text into coherent visual instructions. Our approach models the linguistic structure of instructional content by decomposing it into goal statements and sequential steps, then conditioning visual generation on these linguistic elements. We introduce three key innovations: (1) a constituency parser-based text encoding mechanism that preserves semantic completeness even with lengthy instructions, (2) a pairwise discourse coherence model that maintains consistency across instruction sequences, and (3) a novel evaluation protocol specifically designed for procedural language-to-image alignment. Our experiments across three instructional datasets (HTStep, CaptainCook4D, and WikiAll) demonstrate that our method significantly outperforms existing baselines in generating visuals that accurately reflect the linguistic content and sequential nature of instructions. This work contributes to the growing body of research on grounding procedural language in visual content, with applications spanning education, task guidance, and multimodal language understanding.

[228] Beyond Static Testbeds: An Interaction-Centric Agent Simulation Platform for Dynamic Recommender Systems

Song Jin,Juntian Zhang,Yuhan Liu,Xun Zhang,Yufei Zhang,Guojun Yin,Fei Jiang,Wei Lin,Rui Yan

Main category: cs.CL

TL;DR: RecInter是一个基于代理的推荐系统仿真平台，通过动态交互机制和高级代理架构，显著提升了仿真的真实性和可信度。

Details

Motivation: 传统A/B测试资源消耗大，离线方法难以捕捉动态用户-平台交互，现有仿真平台缺乏动态环境重塑机制。 Method: 引入RecInter平台，支持动态更新物品属性和商家代理互动，结合多维用户画像和基于CoT的LLM微调。 Result: 平台显著提升仿真可信度，成功复现品牌忠诚度和马太效应等涌现现象。 Conclusion: RecInter为推荐系统研究提供了可信的仿真测试平台。 Abstract: Evaluating and iterating upon recommender systems is crucial, yet traditional A/B testing is resource-intensive, and offline methods struggle with dynamic user-platform interactions. While agent-based simulation is promising, existing platforms often lack a mechanism for user actions to dynamically reshape the environment. To bridge this gap, we introduce RecInter, a novel agent-based simulation platform for recommender systems featuring a robust interaction mechanism. In RecInter platform, simulated user actions (e.g., likes, reviews, purchases) dynamically update item attributes in real-time, and introduced Merchant Agents can reply, fostering a more realistic and evolving ecosystem. High-fidelity simulation is ensured through Multidimensional User Profiling module, Advanced Agent Architecture, and LLM fine-tuned on Chain-of-Thought (CoT) enriched interaction data. Our platform achieves significantly improved simulation credibility and successfully replicates emergent phenomena like Brand Loyalty and the Matthew Effect. Experiments demonstrate that this interaction mechanism is pivotal for simulating realistic system evolution, establishing our platform as a credible testbed for recommender systems research.

[229] University of Indonesia at SemEval-2025 Task 11: Evaluating State-of-the-Art Encoders for Multi-Label Emotion Detection

Ikhlasul Akmal Hanif,Eryawan Presma Yulianrifat,Jaycent Gunawan Ongris,Eduardus Tjitrahardja,Muhammad Falensi Azmi,Rahmat Bryan Naufal,Alfan Farizki Wicaksono

Main category: cs.CL

TL;DR: 本文介绍了SemEval 2025 Task 11 Track A的多标签情感分类方法，比较了完全微调Transformer模型与仅训练分类器的策略，发现基于提示的编码器（如mE5和BGE）表现更优。最佳模型为多BGE模型的集成，平均F1-macro得分为56.58。

Details

Motivation: 探索多语言环境下多标签情感分类的最佳方法，比较不同策略的效果。 Method: 采用完全微调Transformer模型和仅训练分类器的策略，评估不同设置（如微调策略、模型架构、损失函数等）。 Result: 基于提示的编码器（如mE5和BGE）表现优于完全微调的XLMR和mBERT。最佳模型为多BGE模型的集成，平均F1-macro得分为56.58。 Conclusion: 在28种语言的多标签情感分类任务中，基于提示的编码器结合分类器训练是更优策略，集成模型表现最佳。 Abstract: This paper presents our approach for SemEval 2025 Task 11 Track A, focusing on multilabel emotion classification across 28 languages. We explore two main strategies: fully fine-tuning transformer models and classifier-only training, evaluating different settings such as fine-tuning strategies, model architectures, loss functions, encoders, and classifiers. Our findings suggest that training a classifier on top of prompt-based encoders such as mE5 and BGE yields significantly better results than fully fine-tuning XLMR and mBERT. Our best-performing model on the final leaderboard is an ensemble combining multiple BGE models, where CatBoost serves as the classifier, with different configurations. This ensemble achieves an average F1-macro score of 56.58 across all languages.

[230] Reading Between the Prompts: How Stereotypes Shape LLM's Implicit Personalization

Vera Neplenbroek,Arianna Bisazza,Raquel Fernández

Main category: cs.CL

TL;DR: 研究发现大型语言模型（LLMs）会通过对话中的隐性线索推断用户人口统计信息，导致对少数群体的回应质量下降；通过干预模型内部表征可缓解此问题。

Details

Motivation: 探讨LLMs如何基于刻板印象线索推断用户身份，及其对回应质量的影响。 Method: 使用受控合成对话分析LLMs的潜在用户表征，通过模型内部和生成答案进行研究。 Result: LLMs确实基于刻板信号推断人口属性，且干预内部表征可有效缓解偏见。 Conclusion: 需提高LLMs在用户身份表征上的透明度和可控性。 Abstract: Generative Large Language Models (LLMs) infer user's demographic information from subtle cues in the conversation -- a phenomenon called implicit personalization. Prior work has shown that such inferences can lead to lower quality responses for users assumed to be from minority groups, even when no demographic information is explicitly provided. In this work, we systematically explore how LLMs respond to stereotypical cues using controlled synthetic conversations, by analyzing the models' latent user representations through both model internals and generated answers to targeted user questions. Our findings reveal that LLMs do infer demographic attributes based on these stereotypical signals, which for a number of groups even persists when the user explicitly identifies with a different demographic group. Finally, we show that this form of stereotype-driven implicit personalization can be effectively mitigated by intervening on the model's internal representations using a trained linear probe to steer them toward the explicitly stated identity. Our results highlight the need for greater transparency and control in how LLMs represent user identity.

[231] Teaching Large Language Models to Maintain Contextual Faithfulness via Synthetic Tasks and Reinforcement Learning

Shuzheng Si,Haozhe Zhao,Cheng Gao,Yuzhuo Bai,Zhitong Wang,Bofei Gao,Kangyang Luo,Wenhao Li,Yufei Huang,Gang Chen,Fanchao Qi,Minjia Zhang,Baobao Chang,Maosong Sun

Main category: cs.CL

TL;DR: 论文提出了一种名为CANOE的系统框架，通过合成短形式QA数据和Dual-GRPO强化学习方法，无需人工标注即可提升LLM在短形式和长形式生成任务中的忠实性。

Details

Motivation: 提升大型语言模型（LLM）在上下文中的忠实性，以构建可靠的信息检索系统。 Method: 1. 合成短形式QA数据；2. 提出Dual-GRPO强化学习方法，结合三种规则奖励，同时优化短形式和长形式生成。 Result: CANOE显著提升了LLM在11种下游任务中的忠实性，甚至优于GPT-4o等先进模型。 Conclusion: CANOE框架有效提升了LLM的忠实性，且无需人工标注，具有广泛的应用潜力。 Abstract: Teaching large language models (LLMs) to be faithful in the provided context is crucial for building reliable information-seeking systems. Therefore, we propose a systematic framework, CANOE, to improve the faithfulness of LLMs in both short-form and long-form generation tasks without human annotations. Specifically, we first synthesize short-form question-answering (QA) data with four diverse tasks to construct high-quality and easily verifiable training data without human annotation. Also, we propose Dual-GRPO, a rule-based reinforcement learning method that includes three tailored rule-based rewards derived from synthesized short-form QA data, while simultaneously optimizing both short-form and long-form response generation. Notably, Dual-GRPO eliminates the need to manually label preference data to train reward models and avoids over-optimizing short-form generation when relying only on the synthesized short-form QA data. Experimental results show that CANOE greatly improves the faithfulness of LLMs across 11 different downstream tasks, even outperforming the most advanced LLMs, e.g., GPT-4o and OpenAI o1.

[232] LLaMAs Have Feelings Too: Unveiling Sentiment and Emotion Representations in LLaMA Models Through Probing

Dario Di Palma,Alessandro De Bellis,Giovanni Servedio,Vito Walter Anelli,Fedelucio Narducci,Tommaso Di Noia

Main category: cs.CL

TL;DR: 该研究通过探测Llama模型的隐藏层，定位情感特征的分布，并评估其对情感分析的影响。结果显示，情感信息在中间层最集中，检测准确率比提示技术提高14%，同时内存需求平均减少57%。

Details

Motivation: 尽管大语言模型（LLMs）在情感分析等任务中表现出色，但对其如何捕捉情感信息的理解仍有限。本研究旨在填补这一空白。 Method: 使用探测分类器分析Llama模型各层的情感编码，识别最能捕捉情感信号的层和池化方法。 Result: 情感信息在中间层最集中，检测准确率提高14%；解码器模型中，最后一个标记并非总是最具信息量；内存需求平均减少57%。 Conclusion: 层特异性探测是情感任务的有效方法，可提升模型效用并减少资源消耗。 Abstract: Large Language Models (LLMs) have rapidly become central to NLP, demonstrating their ability to adapt to various tasks through prompting techniques, including sentiment analysis. However, we still have a limited understanding of how these models capture sentiment-related information. This study probes the hidden layers of Llama models to pinpoint where sentiment features are most represented and to assess how this affects sentiment analysis. Using probe classifiers, we analyze sentiment encoding across layers and scales, identifying the layers and pooling methods that best capture sentiment signals. Our results show that sentiment information is most concentrated in mid-layers for binary polarity tasks, with detection accuracy increasing up to 14% over prompting techniques. Additionally, we find that in decoder-only models, the last token is not consistently the most informative for sentiment encoding. Finally, this approach enables sentiment tasks to be performed with memory requirements reduced by an average of 57%. These insights contribute to a broader understanding of sentiment in LLMs, suggesting layer-specific probing as an effective approach for sentiment tasks beyond prompting, with potential to enhance model utility and reduce memory requirements.

[233] Sparse Activation Editing for Reliable Instruction Following in Narratives

Runcong Zhao,Chengyu Cao,Qinglin Zhu,Xiucheng Lv,Shun Shao,Lin Gui,Ruifeng Xu,Yulan He

Main category: cs.CL

TL;DR: Concise-SAE是一个无需训练、仅通过自然语言指令识别和编辑相关神经元的框架，显著提升了语言模型在复杂叙事环境中的指令遵循能力。

Details

Motivation: 解决现有基准测试无法捕捉语言模型在复杂叙事环境中指令遵循困难的问题。 Method: 提出Concise-SAE框架，通过自然语言指令识别和编辑相关神经元，无需标注数据。 Result: 在多样化的FreeInstruct基准测试中表现优异，且不损害生成质量。 Conclusion: Concise-SAE在复杂叙事和其他任务中均表现出色，无需额外训练。 Abstract: Complex narrative contexts often challenge language models' ability to follow instructions, and existing benchmarks fail to capture these difficulties. To address this, we propose Concise-SAE, a training-free framework that improves instruction following by identifying and editing instruction-relevant neurons using only natural language instructions, without requiring labelled data. To thoroughly evaluate our method, we introduce FreeInstruct, a diverse and realistic benchmark of 1,212 examples that highlights the challenges of instruction following in narrative-rich settings. While initially motivated by complex narratives, Concise-SAE demonstrates state-of-the-art instruction adherence across varied tasks without compromising generation quality.

[234] AppealCase: A Dataset and Benchmark for Civil Case Appeal Scenarios

Yuting Huang,Meitong Guo,Yiquan Wu,Ang Li,Xiaozhong Liu,Keting Yin,Changlong Sun,Fei Wu,Kun Kuang

Main category: cs.CL

TL;DR: 论文介绍了AppealCase数据集，填补了LegalAI中上诉案件分析的空白，并提出了五个新任务，评估显示现有模型在上诉场景中表现不佳。

Details

Motivation: 现有LegalAI研究多关注一审案件，忽视了上诉过程的重要性，而上诉是司法系统中纠错和确保公平的核心机制。 Method: 构建了包含10,000对匹配的一审和二审文档的AppealCase数据集，标注了五个关键维度，并基于此提出五个新任务，评估了20个主流模型。 Result: 实验结果显示，现有模型在判决逆转预测任务上的F1分数均低于50%，表明上诉场景的复杂性和挑战性。 Conclusion: AppealCase数据集有望推动LegalAI在上诉案件分析中的研究，并提升司法决策的一致性。 Abstract: Recent advances in LegalAI have primarily focused on individual case judgment analysis, often overlooking the critical appellate process within the judicial system. Appeals serve as a core mechanism for error correction and ensuring fair trials, making them highly significant both in practice and in research. To address this gap, we present the AppealCase dataset, consisting of 10,000 pairs of real-world, matched first-instance and second-instance documents across 91 categories of civil cases. The dataset also includes detailed annotations along five dimensions central to appellate review: judgment reversals, reversal reasons, cited legal provisions, claim-level decisions, and whether there is new information in the second instance. Based on these annotations, we propose five novel LegalAI tasks and conduct a comprehensive evaluation across 20 mainstream models. Experimental results reveal that all current models achieve less than 50% F1 scores on the judgment reversal prediction task, highlighting the complexity and challenge of the appeal scenario. We hope that the AppealCase dataset will spur further research in LegalAI for appellate case analysis and contribute to improving consistency in judicial decision-making.

[235] CUB: Benchmarking Context Utilisation Techniques for Language Models

Lovisa Hagström,Youna Kim,Haeun Yu,Sang-goo Lee,Richard Johansson,Hyunsoo Cho,Isabelle Augenstein

Main category: cs.CL

TL;DR: 本文提出了CUB基准测试，用于系统比较检索增强生成（RAG）中的上下文利用操作技术（CMTs），发现现有CMTs难以应对真实场景中的多样化上下文。

Details

Motivation: 语言模型（LMs）在知识密集型任务中可能忽略或误用上下文信息，而现有CMTs缺乏系统比较。 Method: 开发CUB基准测试，评估七种代表性CMTs在三种上下文类型和多样化数据集上的表现。 Result: 多数CMTs难以处理真实场景中的多样化上下文，且在合成数据集上表现虚高。 Conclusion: 需开发能处理多种上下文类型的CMTs，并采用更全面的测试方法。 Abstract: Incorporating external knowledge is crucial for knowledge-intensive tasks, such as question answering and fact checking. However, language models (LMs) may ignore relevant information that contradicts outdated parametric memory or be distracted by irrelevant contexts. While many context utilisation manipulation techniques (CMTs) that encourage or suppress context utilisation have recently been proposed to alleviate these issues, few have seen systematic comparison. In this paper, we develop CUB (Context Utilisation Benchmark) to help practitioners within retrieval-augmented generation (RAG) identify the best CMT for their needs. CUB allows for rigorous testing on three distinct context types, observed to capture key challenges in realistic context utilisation scenarios. With this benchmark, we evaluate seven state-of-the-art methods, representative of the main categories of CMTs, across three diverse datasets and tasks, applied to nine LMs. Our results show that most of the existing CMTs struggle to handle the full set of types of contexts that may be encountered in real-world retrieval-augmented scenarios. Moreover, we find that many CMTs display an inflated performance on simple synthesised datasets, compared to more realistic datasets with naturally occurring samples. Altogether, our results show the need for holistic tests of CMTs and the development of CMTs that can handle multiple context types.

[236] Are the Hidden States Hiding Something? Testing the Limits of Factuality-Encoding Capabilities in LLMs

Giovanni Servedio,Alessandro De Bellis,Dario Di Palma,Vito Walter Anelli,Tommaso Di Noia

Main category: cs.CL

TL;DR: 论文探讨了LLMs中的事实性幻觉问题，通过生成更真实的数据集挑战先前研究，发现泛化至LLM生成数据仍具挑战性。

Details

Motivation: 解决LLMs生成虚假内容的问题，提升模型可靠性和用户信任。 Method: 从表格数据中采样真伪事实句子，并从问答集合生成依赖LLM的真实数据集。 Result: 部分验证先前研究，但LLM生成数据集的泛化仍困难。 Conclusion: 为LLMs事实性研究奠定基础，提供更有效评估的实用指南。 Abstract: Factual hallucinations are a major challenge for Large Language Models (LLMs). They undermine reliability and user trust by generating inaccurate or fabricated content. Recent studies suggest that when generating false statements, the internal states of LLMs encode information about truthfulness. However, these studies often rely on synthetic datasets that lack realism, which limits generalization when evaluating the factual accuracy of text generated by the model itself. In this paper, we challenge the findings of previous work by investigating truthfulness encoding capabilities, leading to the generation of a more realistic and challenging dataset. Specifically, we extend previous work by introducing: (1) a strategy for sampling plausible true-false factoid sentences from tabular data and (2) a procedure for generating realistic, LLM-dependent true-false datasets from Question Answering collections. Our analysis of two open-source LLMs reveals that while the findings from previous studies are partially validated, generalization to LLM-generated datasets remains challenging. This study lays the groundwork for future research on factuality in LLMs and offers practical guidelines for more effective evaluation.

[237] Benchmarking and Pushing the Multi-Bias Elimination Boundary of LLMs via Causal Effect Estimation-guided Debiasing

Zhouhao Sun,Zhiyuan Kan,Xiao Ding,Li Du,Yang Zhao,Bing Qin,Ting Liu

Main category: cs.CL

TL;DR: 论文提出了一个多偏见基准（multi-bias benchmark），每个数据包含五种偏见类型，并开发了一种因果效应估计引导的多偏见消除方法（CMBE），以提升大语言模型的泛化能力。

Details

Motivation: 当前大语言模型（LLMs）在推理时仍可能利用偏见，导致泛化能力差。现有基准通常只包含单一偏见类型，而实际应用中数据可能包含多种偏见。 Method: 提出多偏见基准，并开发CMBE方法，通过同时估计多种偏见的因果效应，从总因果效应中消除偏见的影响。 Result: 实验表明，现有LLMs和去偏见方法在多偏见基准上表现不佳，而CMBE能有效同时消除多种偏见。 Conclusion: CMBE方法能显著提升LLMs的泛化能力，解决了同时消除多种偏见的挑战。 Abstract: Despite significant progress, recent studies have indicated that current large language models (LLMs) may still utilize bias during inference, leading to the poor generalizability of LLMs. Some benchmarks are proposed to investigate the generalizability of LLMs, with each piece of data typically containing one type of controlled bias. However, a single piece of data may contain multiple types of biases in practical applications. To bridge this gap, we propose a multi-bias benchmark where each piece of data contains five types of biases. The evaluations conducted on this benchmark reveal that the performance of existing LLMs and debiasing methods is unsatisfying, highlighting the challenge of eliminating multiple types of biases simultaneously. To overcome this challenge, we propose a causal effect estimation-guided multi-bias elimination method (CMBE). This method first estimates the causal effect of multiple types of biases simultaneously. Subsequently, we eliminate the causal effect of biases from the total causal effect exerted by both the semantic information and biases during inference. Experimental results show that CMBE can effectively eliminate multiple types of bias simultaneously to enhance the generalizability of LLMs.

[238] EnSToM: Enhancing Dialogue Systems with Entropy-Scaled Steering Vectors for Topic Maintenance

Heejae Suh,Yejin Jeon,Deokhyung Kang,Taehee Park,Yejin Min,Gary Geunbae Lee

Main category: cs.CL

TL;DR: EnSToM是一种动态调整引导强度的新方法，通过输入不确定性提升小语言模型在任务导向对话中的主题一致性。

Details

Motivation: 小语言模型在资源受限环境中高效，但主题一致性不足，影响任务导向对话系统的可靠性。 Method: 提出Entropy-scaled Steering vectors for Topic Maintenance (EnSToM)，基于输入不确定性动态调整引导强度。 Result: 实验表明EnSToM在小数据量下显著提升性能，优于微调方法。 Conclusion: EnSToM在不牺牲效率的情况下增强主题一致性，为小语言模型对话系统提供了稳健解决方案。 Abstract: Small large language models (sLLMs) offer the advantage of being lightweight and efficient, which makes them suitable for resource-constrained environments. However, sLLMs often struggle to maintain topic consistency in task-oriented dialogue systems, which is critical for scenarios such as service chatbots. Specifically, it is important to ensure that the model denies off-topic or malicious inputs and adheres to its intended functionality so as to prevent potential misuse and uphold reliability. Towards this, existing activation engineering approaches have been proposed to manipulate internal activations during inference. While these methods are effective in certain scenarios, our preliminary experiments reveal their limitations in ensuring topic adherence. Therefore, to address this, we propose a novel approach termed Entropy-scaled Steering vectors for Topic Maintenance (EnSToM). EnSToM dynamically adjusts the steering intensity based on input uncertainty, which allows the model to handle off-topic distractors effectively while preserving on-topic accuracy. Our experiments demonstrate that EnSToM achieves significant performance gain with a relatively small data size compared to fine-tuning approaches. By improving topic adherence without compromising efficiency, our approach provides a robust solution for enhancing sLLM-based dialogue systems.

[239] Mechanistic Understanding and Mitigation of Language Confusion in English-Centric Large Language Models

Ercong Nie,Helmut Schmid,Hinrich Schütze

Main category: cs.CL

TL;DR: 该论文研究了大型语言模型（LLMs）中的语言混淆现象，通过机制解释性（MI）方法揭示了混淆点（CPs）的关键作用，并提出通过编辑关键神经元来显著减少混淆。

Details

Motivation: 语言混淆是英语中心模型的一个关键挑战，研究旨在通过机制解释性方法理解并解决这一问题。 Method: 结合行为基准测试和神经元级分析，使用Language Confusion Benchmark（LCB）识别混淆点，并通过TunedLens和神经元归因分析揭示混淆机制。 Result: 研究发现混淆主要由最后几层的转换失败驱动，编辑关键神经元可显著减少混淆且不影响模型性能。 Conclusion: 研究为LLMs的内部动态提供了新见解，神经元级干预是提升多语言建模鲁棒性和可解释性的有前景方向。 Abstract: Language confusion -- where large language models (LLMs) generate unintended languages against the user's need -- remains a critical challenge, especially for English-centric models. We present the first mechanistic interpretability (MI) study of language confusion, combining behavioral benchmarking with neuron-level analysis. Using the Language Confusion Benchmark (LCB), we show that confusion points (CPs) -- specific positions where language switches occur -- are central to this phenomenon. Through layer-wise analysis with TunedLens and targeted neuron attribution, we reveal that transition failures in the final layers drive confusion. We further demonstrate that editing a small set of critical neurons, identified via comparative analysis with multilingual-tuned models, substantially mitigates confusion without harming general competence or fluency. Our approach matches multilingual alignment in confusion reduction for most languages and yields cleaner, higher-quality outputs. These findings provide new insights into the internal dynamics of LLMs and highlight neuron-level interventions as a promising direction for robust, interpretable multilingual language modeling.

[240] Think Silently, Think Fast: Dynamic Latent Compression of LLM Reasoning Chains

Wenhui Tan,Jiaze Li,Jianzhong Ju,Zhenbo Luo,Jian Luan,Ruihua Song

Main category: cs.CL

TL;DR: CoLaR框架通过动态压缩潜在空间中的推理过程，显著减少推理链长度并提升效率，同时保持高性能。

Details

Motivation: 解决LLMs中显式推理链计算成本高、效率低的问题。 Method: 两阶段训练：监督微调阶段预测压缩嵌入，强化学习阶段探索多样推理路径。 Result: 在数学推理任务中，CoLaR比基线方法准确率提升14.1%，推理链长度减少53.3%。 Conclusion: CoLaR在潜在空间高效推理，动态调整速度，适用于复杂任务。 Abstract: Large Language Models (LLMs) achieve superior performance through Chain-of-Thought (CoT) reasoning, but these token-level reasoning chains are computationally expensive and inefficient. In this paper, we introduce Compressed Latent Reasoning (CoLaR), a novel framework that dynamically compresses reasoning processes in latent space through a two-stage training approach. First, during supervised fine-tuning, CoLaR extends beyond next-token prediction by incorporating an auxiliary next compressed embedding prediction objective. This process merges embeddings of consecutive tokens using a compression factor randomly sampled from a predefined range, and trains a specialized latent head to predict distributions of subsequent compressed embeddings. Second, we enhance CoLaR through reinforcement learning (RL) that leverages the latent head's non-deterministic nature to explore diverse reasoning paths and exploit more compact ones. This approach enables CoLaR to: i) perform reasoning at a dense latent level (i.e., silently), substantially reducing reasoning chain length, and ii) dynamically adjust reasoning speed at inference time by simply prompting the desired compression factor. Extensive experiments across four mathematical reasoning datasets demonstrate that CoLaR achieves 14.1% higher accuracy than latent-based baseline methods at comparable compression ratios, and reduces reasoning chain length by 53.3% with only 4.8% performance degradation compared to explicit CoT method. Moreover, when applied to more challenging mathematical reasoning tasks, our RL-enhanced CoLaR demonstrates performance gains of up to 5.4% while dramatically reducing latent reasoning chain length by 82.8%. The code and models will be released upon acceptance.

[241] ScholarBench: A Bilingual Benchmark for Abstraction, Comprehension, and Reasoning Evaluation in Academic Contexts

Dongwon Noh,Donghyeok Koh,Junghun Yuk,Gyuwan Kim,Jaeyong Lee,Kyungtae Lim,Cheoneum Park

Main category: cs.CL

TL;DR: 论文介绍了 exttt{ScholarBench}，一个专注于深度专家知识和复杂学术问题解决的基准测试，用于评估大型语言模型的学术推理能力。

Details

Motivation: 现有基准测试无法评估大型语言模型在复杂学术任务中的表现，因此需要更专业的测试工具。 Method: 通过三步构建过程，设计包含五种问题类型和八个研究领域的双语数据集（英语和韩语）。 Result: 基准测试包含5,031个韩语和5,309个英语样本，即使是先进模型如o3-mini的平均得分仅为0.543，显示其挑战性。 Conclusion: exttt{ScholarBench}为评估大型语言模型在学术领域的推理能力提供了高质量且具有挑战性的工具。 Abstract: Prior benchmarks for evaluating the domain-specific knowledge of large language models (LLMs) lack the scalability to handle complex academic tasks. To address this, we introduce \texttt{ScholarBench}, a benchmark centered on deep expert knowledge and complex academic problem-solving, which evaluates the academic reasoning ability of LLMs and is constructed through a three-step process. \texttt{ScholarBench} targets more specialized and logically complex contexts derived from academic literature, encompassing five distinct problem types. Unlike prior benchmarks, \texttt{ScholarBench} evaluates the abstraction, comprehension, and reasoning capabilities of LLMs across eight distinct research domains. To ensure high-quality evaluation data, we define category-specific example attributes and design questions that are aligned with the characteristic research methodologies and discourse structures of each domain. Additionally, this benchmark operates as an English-Korean bilingual dataset, facilitating simultaneous evaluation for linguistic capabilities of LLMs in both languages. The benchmark comprises 5,031 examples in Korean and 5,309 in English, with even state-of-the-art models like o3-mini achieving an average evaluation score of only 0.543, demonstrating the challenging nature of this benchmark.

[242] URLs Help, Topics Guide: Understanding Metadata Utility in LLM Training

Dongyang Fan,Vinko Sabolčec,Martin Jaggi

Main category: cs.CL

TL;DR: 研究发现，并非所有元数据类型对LLM训练和性能都有同等贡献，仅URL上下文能加速训练，而质量评分和主题/格式信息无明显帮助。URL上下文仅在推理时使用较长提示时提升下游性能。此外，上下文感知预训练比无上下文预训练更具可控性。

Details

Motivation: 探索不同类型元数据（如URL、质量评分、主题/格式）对LLM训练效率和下游性能的影响，以填补现有研究的空白。 Method: 系统评估多种元数据类型（URL、质量评分、主题/格式）在LLM预训练中的作用，分析其对训练速度和下游任务性能的影响。 Result: 仅URL上下文能加速训练；质量评分和主题/格式信息对训练无显著帮助。URL上下文在推理时需较长提示才能提升性能。上下文感知预训练更具生成可控性。 Conclusion: URL上下文是唯一能加速训练的元数据类型，而主题和格式信息虽不加速训练，但可用于生成控制。上下文感知预训练提供更可控的生成能力。 Abstract: Large Language Models (LLMs) are commonly pretrained on vast corpora of text without utilizing contextual metadata such as source, quality, or topic, leading to a context-free learning paradigm. While recent studies suggest that adding metadata like URL information as context (i.e., auxiliary inputs not used in the loss calculation) can improve training efficiency and downstream performance, they offer limited understanding of which types of metadata are truly effective and under what conditions. In this work, we conduct a systematic evaluation and find that not all metadata types contribute equally. Only URL context speeds up training, whereas quality scores and topic/format domain information offer no clear benefit. Furthermore, the improved downstream performances of URL conditioning emerge only when longer prompts are used at inference time. In addition, we demonstrate that context-aware pretraining enables more controllable generation than context-free pretraining, in a classifier-free guidance fashion. Although topic and format metadata do not accelerate training, they are effective for steering outputs, offering human-interpretable control over generation.

[243] EMULATE: A Multi-Agent Framework for Determining the Veracity of Atomic Claims by Emulating Human Actions

Spencer Hong,Meng Luo,Xinyi Wan

Main category: cs.CL

TL;DR: 论文提出了一种名为EMULATE的新型事实核查系统，通过多智能体框架模拟人类行为，显著提升了性能。

Details

Motivation: 现有的事实核查系统通过检索证据并分类的方式与人类行为不符，需要更接近人类的方法。 Method: 采用多智能体框架，每个智能体负责特定任务（如搜索结果排序、网页内容评估），模拟人类行为。 Result: 在多个基准测试中表现优于现有方法，验证了多智能体框架的有效性。 Conclusion: EMULATE系统通过模拟人类行为，显著提升了事实核查的准确性和效率。 Abstract: Determining the veracity of atomic claims is an imperative component of many recently proposed fact-checking systems. Many approaches tackle this problem by first retrieving evidence by querying a search engine and then performing classification by providing the evidence set and atomic claim to a large language model, but this process deviates from what a human would do in order to perform the task. Recent work attempted to address this issue by proposing iterative evidence retrieval, allowing for evidence to be collected several times and only when necessary. Continuing along this line of research, we propose a novel claim verification system, called EMULATE, which is designed to better emulate human actions through the use of a multi-agent framework where each agent performs a small part of the larger task, such as ranking search results according to predefined criteria or evaluating webpage content. Extensive experiments on several benchmarks show clear improvements over prior work, demonstrating the efficacy of our new multi-agent framework.

[244] O$^2$-Searcher: A Searching-based Agent Model for Open-Domain Open-Ended Question Answering

Jianbiao Mei,Tao Hu,Daocheng Fu,Licheng Wen,Xuemeng Yang,Rong Wu,Pinlong Cai,Xing Gao,Yu Yang,Chengjun Xie,Botian Shi,Yong Liu,Yu Qiao

Main category: cs.CL

TL;DR: O²-Searcher是一个基于强化学习的搜索代理，旨在解决开放域中的开放性和封闭性问题，通过动态知识获取和统一训练机制显著提升性能。

Details

Motivation: 大型语言模型（LLMs）受限于静态参数知识，难以处理需要开放域最新信息的任务，尤其是开放性问题。 Method: 提出O²-Searcher，利用强化学习在本地模拟搜索环境中动态获取知识，设计统一训练机制和奖励函数。 Result: O²-Searcher在O²-QA基准上显著优于其他LLM代理，并在封闭性问题基准上达到SOTA。 Conclusion: O²-Searcher通过动态知识获取和适应性策略，有效解决了开放性和封闭性问题，性能优于同类模型。 Abstract: Large Language Models (LLMs), despite their advancements, are fundamentally limited by their static parametric knowledge, hindering performance on tasks requiring open-domain up-to-date information. While enabling LLMs to interact with external knowledge environments is a promising solution, current efforts primarily address closed-end problems. Open-ended questions, which characterized by lacking a standard answer or providing non-unique and diverse answers, remain underexplored. To bridge this gap, we present O$^2$-Searcher, a novel search agent leveraging reinforcement learning to effectively tackle both open-ended and closed-ended questions in the open domain. O$^2$-Searcher leverages an efficient, locally simulated search environment for dynamic knowledge acquisition, effectively decoupling the external world knowledge from model's sophisticated reasoning processes. It employs a unified training mechanism with meticulously designed reward functions, enabling the agent to identify problem types and adapt different answer generation strategies. Furthermore, to evaluate performance on complex open-ended tasks, we construct O$^2$-QA, a high-quality benchmark featuring 300 manually curated, multi-domain open-ended questions with associated web page caches. Extensive experiments show that O$^2$-Searcher, using only a 3B model, significantly surpasses leading LLM agents on O$^2$-QA. It also achieves SOTA results on various closed-ended QA benchmarks against similarly-sized models, while performing on par with much larger ones.

[245] Evaluating Large Language Model with Knowledge Oriented Language Specific Simple Question Answering

Bowen Jiang,Runchuan Zhu,Jiang Wu,Zinco Jiang,Yifan He,Junyuan Gao,Jia Yu,Rui Min,Yinfan Wang,Haote Yang,Songyang Zhang,Dahua Lin,Lijun Wu,Conghui He

Main category: cs.CL

TL;DR: KoLasSimpleQA是首个评估大型语言模型（LLMs）多语言事实能力的基准，涵盖9种语言和双领域设计，旨在高效测试LLMs的事实记忆和自我意识。

Details

Motivation: 现有研究缺乏对LLMs多语言事实能力的全面评估，KoLasSimpleQA填补了这一空白，支持全球适用性和深度评估。 Method: 通过设计具有单知识点覆盖、绝对客观性、唯一答案和时间稳定性的问题集，采用LLM-as-judge范式进行评估。 Result: 主流LLMs在通用领域和语言特定领域表现差异显著，尤其在性能指标、排名、校准和鲁棒性方面。 Conclusion: KoLasSimpleQA有助于识别LLMs在多语言环境中的能力边界，并为模型优化提供指导。 Abstract: We introduce KoLasSimpleQA, the first benchmark evaluating the multilingual factual ability of Large Language Models (LLMs). Inspired by existing research, we created the question set with features such as single knowledge point coverage, absolute objectivity, unique answers, and temporal stability. These questions enable efficient evaluation using the LLM-as-judge paradigm, testing both the LLMs' factual memory and self-awareness ("know what they don't know"). KoLasSimpleQA expands existing research in two key dimensions: (1) Breadth (Multilingual Coverage): It includes 9 languages, supporting global applicability evaluation. (2) Depth (Dual Domain Design): It covers both the general domain (global facts) and the language-specific domain (such as history, culture, and regional traditions) for a comprehensive assessment of multilingual capabilities. We evaluated mainstream LLMs, including traditional LLM and emerging Large Reasoning Models. Results show significant performance differences between the two domains, particularly in performance metrics, ranking, calibration, and robustness. This highlights the need for targeted evaluation and optimization in multilingual contexts. We hope KoLasSimpleQA will help the research community better identify LLM capability boundaries in multilingual contexts and provide guidance for model optimization. We will release KoLasSimpleQA at https://github.com/opendatalab/KoLasSimpleQA .

[246] What Media Frames Reveal About Stance: A Dataset and Study about Memes in Climate Change Discourse

Shijia Zhou,Siyao Peng,Simon Luebke,Jörg Haßler,Mario Haim,Saif M. Mohammad,Barbara Plank

Main category: cs.CL

TL;DR: 该论文研究了媒体框架与立场在气候变迁网络迷因中的交互作用，提出了CLIMATEMEMES数据集，并评估了LLaVA-NeXT和Molmo在立场检测和媒体框架检测任务中的表现。

Details

Motivation: 探索媒体框架与立场之间的交互作用，尤其是在气候变迁网络迷因中的表现。 Method: 采用跨学科方法，构建CLIMATEMEMES数据集，包含1,184个迷因，并评估LLaVA-NeXT和Molmo在立场和框架检测任务中的性能。 Result: 视觉语言模型（VLMs）在立场检测上表现良好，但在框架检测上不如语言模型（LLMs）。人类标注的标题能提升性能。 Conclusion: VLMs在处理气候变迁迷因中的复杂框架和立场表达时存在局限性，LLMs在框架检测上表现更优。 Abstract: Media framing refers to the emphasis on specific aspects of perceived reality to shape how an issue is defined and understood. Its primary purpose is to shape public perceptions often in alignment with the authors' opinions and stances. However, the interaction between stance and media frame remains largely unexplored. In this work, we apply an interdisciplinary approach to conceptualize and computationally explore this interaction with internet memes on climate change. We curate CLIMATEMEMES, the first dataset of climate-change memes annotated with both stance and media frames, inspired by research in communication science. CLIMATEMEMES includes 1,184 memes sourced from 47 subreddits, enabling analysis of frame prominence over time and communities, and sheds light on the framing preferences of different stance holders. We propose two meme understanding tasks: stance detection and media frame detection. We evaluate LLaVA-NeXT and Molmo in various setups, and report the corresponding results on their LLM backbone. Human captions consistently enhance performance. Synthetic captions and human-corrected OCR also help occasionally. Our findings highlight that VLMs perform well on stance, but struggle on frames, where LLMs outperform VLMs. Finally, we analyze VLMs' limitations in handling nuanced frames and stance expressions on climate change internet memes.

[247] From Generic Empathy to Personalized Emotional Support: A Self-Evolution Framework for User Preference Alignment

Jing Ye,Lu Xiang,Yaping Zhang,Chengqing Zong

Main category: cs.CL

TL;DR: 论文提出了一种自我进化框架，帮助大语言模型（LLMs）在多轮互动中提供更个性化的情感支持，通过自我反思和优化减少通用回复。

Details

Motivation: 现有LLMs在情感支持中常提供通用回复，无法满足用户个性化需求，需改进以更好地理解用户偏好。 Method: 框架分两阶段：1）情感支持经验获取，通过微调提供基础支持；2）自我改进，通过自我反思和优化生成个性化回复。 Result: 实验表明，该方法显著提升情感支持效果，减少无用回复，缩小用户偏好与模型输出的差距。 Conclusion: 自我进化框架能有效提升LLMs在情感支持中的个性化表现，满足用户需求。 Abstract: Effective emotional support hinges on understanding users' emotions and needs to provide meaningful comfort during multi-turn interactions. Large Language Models (LLMs) show great potential for expressing empathy; however, they often deliver generic and one-size-fits-all responses that fail to address users' specific needs. To tackle this issue, we propose a self-evolution framework designed to help LLMs improve their responses to better align with users' implicit preferences concerning user profiles (personalities), emotional states, and specific situations. Our framework consists of two distinct phases: \textit{(1)} \textit{Emotional Support Experience Acquisition}, where LLMs are fine-tuned on limited emotional support conversation data to provide basic support, and \textit{(2)} \textit{Self-Improvement for Personalized Emotional Support}, where LLMs leverage self-reflection and self-refinement to generate personalized responses. Through iterative direct preference optimization between the pre- and post-refined responses, our model generates responses that reflect a better understanding of the user's implicit preferences. Extensive experiments and evaluations demonstrate that our method significantly enhances the model's performance in emotional support, reducing unhelpful responses and minimizing discrepancies between user preferences and model outputs.

[248] Steering Large Language Models for Machine Translation Personalization

Daniel Scalena,Gabriele Sarti,Arianna Bisazza,Elisabetta Fersini,Malvina Nissim

Main category: cs.CL

TL;DR: 论文探讨了在低资源环境下个性化LLM生成翻译的策略，提出了一种对比框架，利用稀疏自编码器提取潜在概念，实现强个性化且保持翻译质量。

Details

Motivation: 现有基于LLM的高质量机器翻译系统在风格要求不明确时表现不佳，尤其在文学翻译领域。 Method: 研究提示策略和推理时干预，提出对比框架利用稀疏自编码器提取潜在概念。 Result: 实验表明，该方法能实现强个性化且不损害翻译质量，并发现多示例提示与干预方法机制相似。 Conclusion: 该方法有效解决了低资源环境下个性化翻译的挑战，为LLM在文学翻译中的应用提供了新思路。 Abstract: High-quality machine translation systems based on large language models (LLMs) have simplified the production of personalized translations reflecting specific stylistic constraints. However, these systems still struggle in settings where stylistic requirements are less explicit and might be harder to convey via prompting. We explore various strategies for personalizing LLM-generated translations in low-resource settings, focusing on the challenging literary translation domain. We explore prompting strategies and inference-time interventions for steering model generations towards a personalized style, and propose a contrastive framework exploiting latent concepts extracted from sparse autoencoders to identify salient personalization properties. Our results show that steering achieves strong personalization while preserving translation quality. We further examine the impact of steering on LLM representations, finding model layers with a relevant impact for personalization are impacted similarly by multi-shot prompting and our steering method, suggesting similar mechanism at play.

[249] SSR-Zero: Simple Self-Rewarding Reinforcement Learning for Machine Translation

Wenjie Yang,Mao Zheng,Mingyang Song,Zheng Li

Main category: cs.CL

TL;DR: 论文提出了一种名为SSR的自我奖励强化学习框架，用于机器翻译，无需外部监督信号，仅依赖自我判断奖励，并在中英翻译任务中表现优异。

Details

Motivation: 当前先进的机器翻译专用大语言模型（LLMs）依赖昂贵且难以扩展的外部监督信号（如人工标注数据或奖励模型），限制了其应用。 Method: 提出Simple Self-Rewarding (SSR)强化学习框架，完全在线且无需参考数据，仅通过自我判断奖励进行训练。 Result: SSR-Zero-7B模型在WMT23、WMT24和Flores200基准测试中优于现有模型；结合COMET监督后，SSR-X-Zero-7B达到最先进水平。 Conclusion: 自我奖励机制在机器翻译中有效且优于外部LLM评判方法，与奖励模型结合时具有互补优势，为自我改进强化学习方法提供了新方向。 Abstract: Large language models (LLMs) have recently demonstrated remarkable capabilities in machine translation (MT). However, most advanced MT-specific LLMs heavily rely on external supervision signals during training, such as human-annotated reference data or trained reward models (RMs), which are often expensive to obtain and challenging to scale. To overcome this limitation, we propose a Simple Self-Rewarding (SSR) Reinforcement Learning (RL) framework for MT that is reference-free, fully online, and relies solely on self-judging rewards. Training with SSR using 13K monolingual examples and Qwen-2.5-7B as the backbone, our model SSR-Zero-7B outperforms existing MT-specific LLMs, e.g., TowerInstruct-13B and GemmaX-28-9B, as well as larger general LLMs like Qwen2.5-32B-Instruct in English $\leftrightarrow$ Chinese translation tasks from WMT23, WMT24, and Flores200 benchmarks. Furthermore, by augmenting SSR with external supervision from COMET, our strongest model, SSR-X-Zero-7B, achieves state-of-the-art performance in English $\leftrightarrow$ Chinese translation, surpassing all existing open-source models under 72B parameters and even outperforming closed-source models, e.g., GPT-4o and Gemini 1.5 Pro. Our analysis highlights the effectiveness of the self-rewarding mechanism compared to the external LLM-as-a-judge approach in MT and demonstrates its complementary benefits when combined with trained RMs. Our findings provide valuable insight into the potential of self-improving RL methods. We have publicly released our code, data and models.

[250] Collaboration among Multiple Large Language Models for Medical Question Answering

Kexin Shang,Chia-Hsuan Chang,Christopher C. Yang

Main category: cs.CL

TL;DR: 提出一个多LLM协作框架，用于医学选择题数据集，提升LLM的推理能力并减少分歧。

Details

Motivation: 现有大型语言模型（LLM）在医学任务中潜力未充分挖掘，缺乏多LLM协同效应的研究。 Method: 设计多LLM协作框架，基于医学选择题数据集，通过事后分析验证效果。 Result: 框架提升所有LLM的推理能力，减少问题分歧，并发现LLM的置信度与预测准确性相关。 Conclusion: 多LLM协作框架有效，为医学任务中的LLM协同提供了新思路。 Abstract: Empowered by vast internal knowledge reservoir, the new generation of large language models (LLMs) demonstrate untapped potential to tackle medical tasks. However, there is insufficient effort made towards summoning up a synergic effect from multiple LLMs' expertise and background. In this study, we propose a multi-LLM collaboration framework tailored on a medical multiple-choice questions dataset. Through post-hoc analysis on 3 pre-trained LLM participants, our framework is proved to boost all LLMs reasoning ability as well as alleviate their divergence among questions. We also measure an LLM's confidence when it confronts with adversary opinions from other LLMs and observe a concurrence between LLM's confidence and prediction accuracy.

[251] Can reasoning models comprehend mathematical problems in Chinese ancient texts? An empirical study based on data from Suanjing Shishu

Liu Chang,Wang Dongbo,Liu liu,Zhao Zhixiao

Main category: cs.CL

TL;DR: 该研究构建了Guji_MATH基准，用于评估基于《算经十书》的古籍数学问题解决能力，发现主流推理模型在古汉语约束下表现不佳，需优化古典中文理解和文化知识。

Details

Motivation: 解决古汉语数学经典智能处理的挑战，挖掘古籍数学知识并传播传统文化。 Method: 通过机器辅助标注和人工验证，从8部经典中提取538个数学问题，设计闭卷和开卷两种评估模式测试六种推理模型。 Result: 推理模型能部分理解和解决问题，但表现不及现代数学任务基准，需提升古典中文理解和文化知识。 Conclusion: 研究为古籍数学知识挖掘和传统文化传播提供方法支持，并为评估推理模型的跨语言跨文化能力提供新视角。 Abstract: This study addresses the challenges in intelligent processing of Chinese ancient mathematical classics by constructing Guji_MATH, a benchmark for evaluating classical texts based on Suanjing Shishu. It systematically assesses the mathematical problem-solving capabilities of mainstream reasoning models under the unique linguistic constraints of classical Chinese. Through machine-assisted annotation and manual verification, 538 mathematical problems were extracted from 8 canonical texts, forming a structured dataset centered on the "Question-Answer-Solution" framework, supplemented by problem types and difficulty levels. Dual evaluation modes--closed-book (autonomous problem-solving) and open-book (reproducing classical solution methods)--were designed to evaluate the performance of six reasoning models on ancient Chinese mathematical problems. Results indicate that reasoning models can partially comprehend and solve these problems, yet their overall performance remains inferior to benchmarks on modern mathematical tasks. Enhancing models' classical Chinese comprehension and cultural knowledge should be prioritized for optimization. This study provides methodological support for mining mathematical knowledge from ancient texts and disseminating traditional culture, while offering new perspectives for evaluating cross-linguistic and cross-cultural capabilities of reasoning models.

[252] A Japanese Language Model and Three New Evaluation Benchmarks for Pharmaceutical NLP

Issey Sukeda,Takuro Fujii,Kosei Buma,Shunsuke Sasaki,Shinnosuke Ono

Main category: cs.CL

TL;DR: 本文提出了一种针对日本制药领域的特定领域语言模型，通过持续预训练开发，并在三个新基准测试中表现优异。

Details

Motivation: 开发适用于日本制药领域的语言模型，填补现有模型在专业术语和逻辑一致性任务上的不足。 Method: 通过持续预训练2亿日语制药标记和80亿英语生物医学标记，构建模型，并引入三个新基准测试（YakugakuQA、NayoseQA、SogoCheck）进行严格评估。 Result: 模型在术语密集和知识型任务上优于开源模型，与商业模型（如GPT-4o）表现相当，但GPT-4o在SogoCheck任务上表现不佳。 Conclusion: 该研究展示了构建实用、安全且经济的日本特定领域语言模型的可行性，并为制药和医疗NLP提供了可复用的评估资源。 Abstract: We present a Japanese domain-specific language model for the pharmaceutical field, developed through continual pretraining on 2 billion Japanese pharmaceutical tokens and 8 billion English biomedical tokens. To enable rigorous evaluation, we introduce three new benchmarks: YakugakuQA, based on national pharmacist licensing exams; NayoseQA, which tests cross-lingual synonym and terminology normalization; and SogoCheck, a novel task designed to assess consistency reasoning between paired statements. We evaluate our model against both open-source medical LLMs and commercial models, including GPT-4o. Results show that our domain-specific model outperforms existing open models and achieves competitive performance with commercial ones, particularly on terminology-heavy and knowledge-based tasks. Interestingly, even GPT-4o performs poorly on SogoCheck, suggesting that cross-sentence consistency reasoning remains an open challenge. Our benchmark suite offers a broader diagnostic lens for pharmaceutical NLP, covering factual recall, lexical variation, and logical consistency. This work demonstrates the feasibility of building practical, secure, and cost-effective language models for Japanese domain-specific applications, and provides reusable evaluation resources for future research in pharmaceutical and healthcare NLP. Our model, codes, and datasets are released at https://github.com/EQUES-Inc/pharma-LLM-eval.

[253] Beyond Induction Heads: In-Context Meta Learning Induces Multi-Phase Circuit Emergence

Gouki Minegishi,Hiroki Furuta,Shohei Taniguchi,Yusuke Iwasawa,Yutaka Matsuo

Main category: cs.CL

TL;DR: 论文探讨了Transformer语言模型如何通过训练获得元学习能力，以解决上下文任务，而非简单复制答案。

Details

Motivation: 研究大型语言模型如何从上下文中元学习任务解决能力，而非仅复制答案，这一能力在训练中的形成机制尚不明确。 Method: 通过扩展复制任务为上下文元学习场景，分析模型训练过程中电路动态变化。 Result: 发现元学习能力获取分为多阶段，每阶段有独特电路形成，与单阶段变化的归纳头不同。 Conclusion: 揭示了Transformer上下文学习能力的来源，深化了对模型动态的理解。 Abstract: Transformer-based language models exhibit In-Context Learning (ICL), where predictions are made adaptively based on context. While prior work links induction heads to ICL through a sudden jump in accuracy, this can only account for ICL when the answer is included within the context. However, an important property of practical ICL in large language models is the ability to meta-learn how to solve tasks from context, rather than just copying answers from context; how such an ability is obtained during training is largely unexplored. In this paper, we experimentally clarify how such meta-learning ability is acquired by analyzing the dynamics of the model's circuit during training. Specifically, we extend the copy task from previous research into an In-Context Meta Learning setting, where models must infer a task from examples to answer queries. Interestingly, in this setting, we find that there are multiple phases in the process of acquiring such abilities, and that a unique circuit emerges in each phase, contrasting with the single-phases change in induction heads. The emergence of such circuits can be related to several phenomena known in large language models, and our analysis lead to a deeper understanding of the source of the transformer's ICL ability.

[254] Locate-then-Merge: Neuron-Level Parameter Fusion for Mitigating Catastrophic Forgetting in Multimodal LLMs

Zeping Yu,Sophia Ananiadou

Main category: cs.CL

TL;DR: 论文提出了一种名为Locate-then-Merge的无训练参数融合框架，通过定位重要参数并选择性融合，解决了多模态大语言模型在指令调优阶段对语言能力的灾难性遗忘问题。

Details

Motivation: 多模态大语言模型（MLLMs）在指令调优阶段往往会导致基础LLM语言能力的灾难性遗忘，即使是像Llama3这样的强模型也不例外。 Method: 提出了Locate-then-Merge框架，包括定位重要参数和选择性融合；进一步引入Neuron-Fusion策略，通过神经元级操作保留视觉能力相关的参数，同时减弱对通用语言技能的影响。 Result: 在13个语言和视觉任务基准测试中，Neuron-Fusion一致优于现有模型融合方法，并有效减少了生成中的上下文幻觉。 Conclusion: 该方法在保留视觉适应能力的同时，显著缓解了语言能力的退化，为多模态模型的优化提供了新思路。 Abstract: Although multimodal large language models (MLLMs) have achieved impressive performance, the multimodal instruction tuning stage often causes catastrophic forgetting of the base LLM's language ability, even in strong models like Llama3. To address this, we propose Locate-then-Merge, a training-free parameter fusion framework that first locates important parameters and then selectively merges them. We further introduce Neuron-Fusion, a neuron-level strategy that preserves the influence of neurons with large parameter shifts--neurons likely responsible for newly acquired visual capabilities--while attenuating the influence of neurons with smaller changes that likely encode general-purpose language skills. This design enables better retention of visual adaptation while mitigating language degradation. Experiments on 13 benchmarks across both language and visual tasks show that Neuron-Fusion consistently outperforms existing model merging methods. Further analysis reveals that our method effectively reduces context hallucination in generation.

[255] Breaking mBad! Supervised Fine-tuning for Cross-Lingual Detoxification

Himanshu Beniwal,Youngwoo Kim,Maarten Sap,Soham Dan,Thomas Hartvigsen

Main category: cs.CL

TL;DR: 论文探讨了跨语言去毒化方法，通过504种设置评估其在数据有限情况下的毒性减少效果，并分析了其对非毒性任务性能的影响。

Details

Motivation: 随着大语言模型在全球应用中的普及，确保其在多语言环境中无毒性成为关键挑战。 Method: 提出跨语言去毒化范式，评估其在高低资源语言间的毒性减少效果，并分析对模型性能的影响。 Result: 研究揭示了安全性与知识保留之间的权衡，代码和数据集已公开。 Conclusion: 跨语言去毒化在减少毒性方面有效，但需平衡安全性与模型性能。 Abstract: As large language models (LLMs) become increasingly prevalent in global applications, ensuring that they are toxicity-free across diverse linguistic contexts remains a critical challenge. We explore "Cross-lingual Detoxification", a cross-lingual paradigm that mitigates toxicity, enabling detoxification capabilities to transfer between high and low-resource languages across different script families. We analyze cross-lingual detoxification's effectiveness through 504 extensive settings to evaluate toxicity reduction in cross-distribution settings with limited data and investigate how mitigation impacts model performance on non-toxic tasks, revealing trade-offs between safety and knowledge preservation. Our code and dataset are publicly available at https://github.com/himanshubeniwal/Breaking-mBad.

[256] TRIM: Achieving Extreme Sparsity with Targeted Row-wise Iterative Metric-driven Pruning

Florentin Beck,William Rudman,Carsten Eickhoff

Main category: cs.CL

TL;DR: TRIM是一种针对大型语言模型（LLM）的新型剪枝方法，通过维度级别的稀疏化分配优化性能，显著提升高稀疏率下的模型表现。

Details

Motivation: 现有的一刀切剪枝方法在高稀疏率下表现不佳，需要更精细的剪枝策略以保留关键信息。 Method: TRIM采用迭代调整过程，根据质量指标为每个输出维度（行）分配不同的稀疏率，减少质量保留的方差。 Result: 在多种LLM（如Qwen2.5、LLaMA-2和OPT）上，TRIM在80%稀疏率下将困惑度降低48%至90%以上，达到新的SOTA。 Conclusion: 维度级别的稀疏化适配是极端LLM压缩的关键。 Abstract: Large Language Models (LLMs) present significant computational and memory challenges due to their extensive size, making pruning essential for their efficient deployment. Existing one-shot pruning methods often apply uniform sparsity constraints across layers or within each layer, resulting in suboptimal performance, especially at high sparsity ratios. This work introduces TRIM (Targeted Row-wise Iterative Metric-driven pruning), a novel approach that applies varying sparsity ratios to individual output dimensions (rows) within each layer. TRIM employs an iterative adjustment process guided by quality metrics to optimize dimension-wise sparsity allocation, focusing on reducing variance in quality retention across outputs to preserve critical information. TRIM can be seamlessly integrated with existing layer-wise pruning strategies. Our evaluations on perplexity and zero-shot tasks across diverse LLM families (Qwen2.5, LLaMA-2, and OPT) and sparsity levels demonstrate that TRIM achieves new state-of-the-art results and enhances stability. For instance, at 80% sparsity, TRIM reduces perplexity by 48% for Qwen2.5-14B and over 90% for OPT-13B compared to baseline methods. We conclude that fine-grained, dimension-wise sparsity adaptation is crucial for pushing the limits of extreme LLM compression. Code available at: https://github.com/flobk/TRIM

[257] IFEval-Audio: Benchmarking Instruction-Following Capability in Audio-based Large Language Models

Yiming Gao,Bin Wang,Chengwei Wei,Shuo Sun,AiTi Aw

Main category: cs.CL

TL;DR: 论文提出了IFEval-Audio数据集，用于评估音频大语言模型在遵循指令方面的能力，填补了该领域的研究空白。

Details

Motivation: 尽管大语言模型在文本任务中表现出强大的指令遵循能力，但在多模态模型中（如音频）这一能力会下降，而音频大语言模型的指令遵循能力尚未被充分研究。 Method: 作者设计了IFEval-Audio数据集，包含280个音频-指令-答案三元组，覆盖六个维度，用于评估模型对音频相关指令的遵循能力。 Result: 通过该数据集，作者对现有音频大语言模型进行了基准测试，评估其在音频指令遵循任务中的表现。 Conclusion: IFEval-Audio的发布为未来音频大语言模型的研究提供了支持。 Abstract: Large language models (LLMs) have demonstrated strong instruction-following capabilities in text-based tasks. However, this ability often deteriorates in multimodal models after alignment with non-text modalities such as images or audio. While several recent efforts have investigated instruction-following performance in text and vision-language models, instruction-following in audio-based large language models remains largely unexplored. To bridge this gap, we introduce IFEval-Audio, a novel evaluation dataset designed to assess the ability to follow instructions in an audio LLM. IFEval-Audio contains 280 audio-instruction-answer triples across six diverse dimensions: Content, Capitalization, Symbol, List Structure, Length, and Format. Each example pairs an audio input with a text instruction, requiring the model to generate an output that follows a specified structure. We benchmark state-of-the-art audio LLMs on their ability to follow audio-involved instructions. The dataset is released publicly to support future research in this emerging area.

[258] Reasoning Beyond Language: A Comprehensive Survey on Latent Chain-of-Thought Reasoning

Xinghao Chen,Anhao Zhao,Heming Xia,Xuan Lu,Hanlin Wang,Yanjun Chen,Wei Zhang,Jian Wang,Wenjie Li,Xiaoyu Shen

Main category: cs.CL

TL;DR: 本文综述了潜在链式思维（Latent CoT）推理范式，提出了统一的分类法，并分析了代表性方法的优缺点，旨在为LLM推理领域提供结构化基础。

Details

Motivation: 传统链式思维（CoT）依赖显式语言表达推理步骤，效率低且难以应用于抽象推理。潜在CoT通过解耦推理与语言，有望实现更丰富的认知表征和更灵活的推理。 Method: 从四个角度提出统一分类法：token-wise策略、内部机制、分析和应用。深入讨论和比较代表性方法的设计模式、优势和挑战。 Result: 综述了潜在CoT的研究进展，为LLM推理领域提供了结构化分析和未来研究方向。 Conclusion: 潜在CoT是一个新兴且有前景的研究方向，本文为其发展提供了系统化的基础和资源。 Abstract: Large Language Models (LLMs) have achieved impressive performance on complex reasoning tasks with Chain-of-Thought (CoT) prompting. However, conventional CoT relies on reasoning steps explicitly verbalized in natural language, introducing inefficiencies and limiting its applicability to abstract reasoning. To address this, there has been growing research interest in latent CoT reasoning, where inference occurs within latent spaces. By decoupling reasoning from language, latent reasoning promises richer cognitive representations and more flexible, faster inference. Researchers have explored various directions in this promising field, including training methodologies, structural innovations, and internal reasoning mechanisms. This paper presents a comprehensive overview and analysis of this reasoning paradigm. We begin by proposing a unified taxonomy from four perspectives: token-wise strategies, internal mechanisms, analysis, and applications. We then provide in-depth discussions and comparative analyses of representative methods, highlighting their design patterns, strengths, and open challenges. We aim to provide a structured foundation for advancing this emerging direction in LLM reasoning. The relevant papers will be regularly updated at https://github.com/EIT-NLP/Awesome-Latent-CoT.

[259] Accidental Misalignment: Fine-Tuning Language Models Induces Unexpected Vulnerability

Punya Syon Pandey,Samuel Simko,Kellin Pelrine,Zhijing Jin

Main category: cs.CL

TL;DR: 研究探讨了微调大语言模型时因数据特性导致的意外脆弱性（Accidental Misalignment），分析了数据集因素与攻击成功率的关系，并提出了防御策略。

Details

Motivation: 大语言模型微调时可能因数据集特性引入脆弱性，需研究其成因及防御方法。 Method: 识别微调数据中的相关性因素（如语言特征、语义相似性、毒性），评估模型对抗性能，分析数据因素与攻击成功率的关系。 Result: 揭示了微调数据特性与模型脆弱性的关联，为防御策略提供了新见解。 Conclusion: 数据集设计对保持模型对齐至关重要，需关注数据特性以避免意外脆弱性。 Abstract: As large language models gain popularity, their vulnerability to adversarial attacks remains a primary concern. While fine-tuning models on domain-specific datasets is often employed to improve model performance, it can introduce vulnerabilities within the underlying model. In this work, we investigate Accidental Misalignment, unexpected vulnerabilities arising from characteristics of fine-tuning data. We begin by identifying potential correlation factors such as linguistic features, semantic similarity, and toxicity within our experimental datasets. We then evaluate the adversarial performance of these fine-tuned models and assess how dataset factors correlate with attack success rates. Lastly, we explore potential causal links, offering new insights into adversarial defense strategies and highlighting the crucial role of dataset design in preserving model alignment. Our code is available at https://github.com/psyonp/accidental_misalignment.

[260] Learning Beyond Limits: Multitask Learning and Synthetic Data for Low-Resource Canonical Morpheme Segmentation

Changbing Yang,Garrett Nicolai

Main category: cs.CL

TL;DR: 提出了一种基于Transformer的语素分割系统，通过多任务学习和LLM生成合成数据增强低资源训练信号。

Details

Motivation: 解决低资源语言中语素分割和注释的数据稀缺问题。 Method: 联合预测形态学片段和注释，利用共享语言表示，并通过LLM生成合成数据。 Result: 在SIGMORPHON 2023数据集上显著提高了词级分割准确性和语素级F1分数。 Conclusion: 该方法有效提升了低资源语言的语素分割性能。 Abstract: We introduce a transformer-based morpheme segmentation system that augments a low-resource training signal through multitask learning and LLM-generated synthetic data. Our framework jointly predicts morphological segments and glosses from orthographic input, leveraging shared linguistic representations obtained through a common documentary process to enhance model generalization. To further address data scarcity, we integrate synthetic training data generated by large language models (LLMs) using in-context learning. Experimental results on the SIGMORPHON 2023 dataset show that our approach significantly improves word-level segmentation accuracy and morpheme-level F1-score across multiple low-resource languages.

[261] Two-way Evidence self-Alignment based Dual-Gated Reasoning Enhancement

Kexin Zhang,Junlan Chen,Daifeng Li,Yuxuan Zhang,Yangyang Feng,Bowen Deng,Weixu Chen

Main category: cs.CL

TL;DR: 论文提出了一种名为ESA-DGR的统一框架，包含双向证据自对齐（TW-ESA）和双门推理增强（DGR）模块，以解决大语言模型在知识密集型多步推理任务中的问题，显著提升了性能。

Details

Motivation: 大语言模型在知识密集型多步推理任务中面临证据提取和利用的挑战，导致推理不准确。 Method: 提出TW-ESA模块通过严格推理与LLM推理的对齐提升证据逻辑理解，DGR模块通过逐步融合LLM知识增强推理准确性。 Result: 在三个数据集上，ESA-DGR显著优于现有方法，EM和F1分数分别平均提升4%和5%。 Conclusion: ESA-DGR框架有效解决了LLM在多步推理任务中的问题，提升了推理准确性和鲁棒性。 Abstract: Large language models (LLMs) encounter difficulties in knowledge-intensive multi-step reasoning (KIMSR) tasks. One challenge is how to effectively extract and represent rationale evidence. The current methods often extract semantically relevant but logically irrelevant evidence, resulting in flawed reasoning and inaccurate responses. We propose a two-way evidence self-alignment (TW-ESA) module, which utilizes the mutual alignment between strict reasoning and LLM reasoning to enhance its understanding of the causal logic of evidence, thereby addressing the first challenge. Another challenge is how to utilize the rationale evidence and LLM's intrinsic knowledge for accurate reasoning when the evidence contains uncertainty. We propose a dual-gated reasoning enhancement (DGR) module to gradually fuse useful knowledge of LLM within strict reasoning, which can enable the model to perform accurate reasoning by focusing on causal elements in the evidence and exhibit greater robustness. The two modules are collaboratively trained in a unified framework ESA-DGR. Extensive experiments on three diverse and challenging KIMSR datasets reveal that ESA-DGR significantly surpasses state-of-the-art LLM-based fine-tuning methods, with remarkable average improvements of 4% in exact match (EM) and 5% in F1 score. The implementation code is available at https://anonymous.4open.science/r/ESA-DGR-2BF8.

[262] Does Synthetic Data Help Named Entity Recognition for Low-Resource Languages?

Gaurav Kamath,Sowmya Vajjala

Main category: cs.CL

TL;DR: 研究探讨了合成数据在低资源语言命名实体识别（NER）中的作用，覆盖11种语言，结果显示合成数据有潜力，但效果因语言而异。

Details

Motivation: 低资源语言的NER任务因标注数据有限而具有挑战性，数据增强是常见方法，但合成数据的作用尚不明确。 Method: 在11种不同语系的低资源语言中，使用合成数据进行NER实验。 Result: 合成数据对低资源语言NER有积极作用，但效果在不同语言间差异显著。 Conclusion: 合成数据是低资源语言NER的有潜力工具，但需考虑语言特性。 Abstract: Named Entity Recognition(NER) for low-resource languages aims to produce robust systems for languages where there is limited labeled training data available, and has been an area of increasing interest within NLP. Data augmentation for increasing the amount of low-resource labeled data is a common practice. In this paper, we explore the role of synthetic data in the context of multilingual, low-resource NER, considering 11 languages from diverse language families. Our results suggest that synthetic data does in fact hold promise for low-resource language NER, though we see significant variation between languages.

[263] Unlearning Isn't Deletion: Investigating Reversibility of Machine Unlearning in LLMs

Xiaoyu Xu,Xiang Yue,Yang Liu,Qingqing Ye,Haibo Hu,Minxin Du

Main category: cs.CL

TL;DR: 论文指出当前大语言模型（LLM）的遗忘评估方法依赖词级指标（如准确率和困惑度）存在误导性，提出基于表示层面的评估框架，揭示可逆与不可逆遗忘的区别。

Details

Motivation: 现有遗忘评估方法可能掩盖而非真正删除信息，需更可靠的诊断工具。 Method: 引入基于PCA相似性、中心核对齐和Fisher信息的表示级评估框架，分析六种遗忘方法、三个领域和两种开源LLM。 Result: 发现可逆遗忘中模型保留潜在特征，不可逆遗忘中表示层受损；任务类型和超参数调节可逆性。 Conclusion: 揭示了当前评估方法的根本缺陷，为LLM可信遗忘提供了新的诊断基础，并发布了分析工具包。 Abstract: Unlearning in large language models (LLMs) is intended to remove the influence of specific data, yet current evaluations rely heavily on token-level metrics such as accuracy and perplexity. We show that these metrics can be misleading: models often appear to forget, but their original behavior can be rapidly restored with minimal fine-tuning, revealing that unlearning may obscure information rather than erase it. To diagnose this phenomenon, we introduce a representation-level evaluation framework using PCA-based similarity and shift, centered kernel alignment, and Fisher information. Applying this toolkit across six unlearning methods, three domains (text, code, math), and two open-source LLMs, we uncover a critical distinction between reversible and irreversible forgetting. In reversible cases, models suffer token-level collapse yet retain latent features; in irreversible cases, deeper representational damage occurs. We further provide a theoretical account linking shallow weight perturbations near output layers to misleading unlearning signals, and show that reversibility is modulated by task type and hyperparameters. Our findings reveal a fundamental gap in current evaluation practices and establish a new diagnostic foundation for trustworthy unlearning in LLMs. We provide a unified toolkit for analyzing LLM representation changes under unlearning and relearning: https://github.com/XiaoyuXU1/Representational_Analysis_Tools.git.

[264] SimpleDeepSearcher: Deep Information Seeking via Web-Powered Reasoning Trajectory Synthesis

Shuang Sun,Huatong Song,Yuhao Wang,Ruiyang Ren,Jinhao Jiang,Junjie Zhang,Fei Bai,Jia Deng,Wayne Xin Zhao,Zheng Liu,Lei Fang,Zhongyuan Wang,Ji-Rong Wen

Main category: cs.CL

TL;DR: SimpleDeepSearcher通过数据工程而非复杂训练范式，解决了RAG系统在高质量训练数据和计算成本上的瓶颈。

Details

Motivation: 现有RAG系统在多步推理和迭代检索中存在训练数据质量不足、模拟环境分布不匹配及计算成本高的问题。 Method: 通过模拟真实用户交互和多标准数据筛选，生成高质量训练数据，仅需871个样本进行监督微调（SFT）。 Result: 在五个基准测试中，SFT方法显著优于基于强化学习的基线。 Conclusion: SFT是解决数据稀缺瓶颈的有效途径，为高效深度搜索系统提供了实用方案。 Abstract: Retrieval-augmented generation (RAG) systems have advanced large language models (LLMs) in complex deep search scenarios requiring multi-step reasoning and iterative information retrieval. However, existing approaches face critical limitations that lack high-quality training trajectories or suffer from the distributional mismatches in simulated environments and prohibitive computational costs for real-world deployment. This paper introduces SimpleDeepSearcher, a lightweight yet effective framework that bridges this gap through strategic data engineering rather than complex training paradigms. Our approach synthesizes high-quality training data by simulating realistic user interactions in live web search environments, coupled with a multi-criteria curation strategy that optimizes the diversity and quality of input and output side. Experiments on five benchmarks across diverse domains demonstrate that SFT on only 871 curated samples yields significant improvements over RL-based baselines. Our work establishes SFT as a viable pathway by systematically addressing the data-scarce bottleneck, offering practical insights for efficient deep search systems. Our code is available at https://github.com/RUCAIBox/SimpleDeepSearcher.

[265] R1-Compress: Long Chain-of-Thought Compression via Chunk Compression and Search

Yibo Wang,Li Shen,Huanjin Yao,Tiansheng Huang,Rui Liu,Naiqiang Tan,Jiaxing Huang,Kai Zhang,Dacheng Tao

Main category: cs.CL

TL;DR: R1-Compress是一种两阶段分块级压缩框架，用于减少Long-CoT的计算开销，同时保持推理准确性和连贯性。

Details

Motivation: 解决现有压缩方法在Long-CoT中牺牲局部推理信号或导致输出不连贯的问题。 Method: 将Long-CoT分块，应用LLM驱动的块内压缩，并通过块间搜索选择短且连贯的序列。 Result: 在MATH500上，R1-Compress减少20%的token使用，准确率仅下降0.6%。 Conclusion: R1-Compress在减少计算开销的同时，保持了推理的准确性和连贯性。 Abstract: Chain-of-Thought (CoT) reasoning enhances large language models (LLMs) by enabling step-by-step problem-solving, yet its extension to Long-CoT introduces substantial computational overhead due to increased token length. Existing compression approaches -- instance-level and token-level -- either sacrifice essential local reasoning signals like reflection or yield incoherent outputs. To address these limitations, we propose R1-Compress, a two-stage chunk-level compression framework that preserves both local information and coherence. Our method segments Long-CoT into manageable chunks, applies LLM-driven inner-chunk compression, and employs an inter-chunk search mechanism to select the short and coherent sequence. Experiments on Qwen2.5-Instruct models across MATH500, AIME24, and GPQA-Diamond demonstrate that R1-Compress significantly reduces token usage while maintaining comparable reasoning accuracy. On MATH500, R1-Compress achieves an accuracy of 92.4%, with only a 0.6% drop compared to the Long-CoT baseline, while reducing token usage by about 20%. Source code will be available at https://github.com/w-yibo/R1-Compress

[266] Understanding and Analyzing Inappropriately Targeting Language in Online Discourse: A Comparative Annotation Study

Baran Barbarestani,Isa Maks,Piek Vossen

Main category: cs.CL

TL;DR: 论文提出了一种结合众包、专家标注和ChatGPT的方法，用于检测在线对话中的不当目标语言，重点分析了Reddit上的英语对话。

Details

Motivation: 研究旨在通过多源标注方法（专家、众包、ChatGPT）识别在线对话中的仇恨言论和歧视性语言，以改进内容审核策略。 Method: 采用综合标注框架，标注多样化的数据集，比较专家、众包和ChatGPT在识别显性和隐性仇恨语言上的表现。 Result: 研究发现上下文对识别仇恨言论至关重要，并揭示了新的目标类别（如社会信仰和身体形象）。ChatGPT在理解微妙语言方面存在局限。 Conclusion: 研究为改进自动化内容审核提供了见解，强调了多源标注和上下文分析的重要性。 Abstract: This paper introduces a method for detecting inappropriately targeting language in online conversations by integrating crowd and expert annotations with ChatGPT. We focus on English conversation threads from Reddit, examining comments that target individuals or groups. Our approach involves a comprehensive annotation framework that labels a diverse data set for various target categories and specific target words within the conversational context. We perform a comparative analysis of annotations from human experts, crowd annotators, and ChatGPT, revealing strengths and limitations of each method in recognizing both explicit hate speech and subtler discriminatory language. Our findings highlight the significant role of contextual factors in identifying hate speech and uncover new categories of targeting, such as social belief and body image. We also address the challenges and subjective judgments involved in annotation and the limitations of ChatGPT in grasping nuanced language. This study provides insights for improving automated content moderation strategies to enhance online safety and inclusivity.

[267] Nested Named Entity Recognition as Single-Pass Sequence Labeling

Alberto Muñoz-Ortiz,David Vilares,Caio COrro,Carlos Gómez-Rodríguez

Main category: cs.CL

TL;DR: 将嵌套命名实体识别（NNER）转化为序列标注任务，通过线性化选区结构简化问题，结合预训练编码器实现高效性能。

Details

Motivation: 解决嵌套命名实体识别问题的复杂性，同时保持高效性。 Method: 利用选区结构线性化技术，结合预训练编码器，将问题转化为序列标注任务。 Result: 性能与低效系统相当，且仅需n次标注操作。 Conclusion: 该方法高效且易于实现，适用于任何现成的序列标注库。 Abstract: We cast nested named entity recognition (NNER) as a sequence labeling task by leveraging prior work that linearizes constituency structures, effectively reducing the complexity of this structured prediction problem to straightforward token classification. By combining these constituency linearizations with pretrained encoders, our method captures nested entities while performing exactly $n$ tagging actions. Our approach achieves competitive performance compared to less efficient systems, and it can be trained using any off-the-shelf sequence labeling library.

[268] Comparative analysis of subword tokenization approaches for Indian languages

Sudhansu Bala Das,Samujjal Choudhury,Tapas Kumar Mishra,Bidyut Kr. Patra

Main category: cs.CL

TL;DR: 论文研究了不同子词分词技术（如SentencePiece、BPE和WordPiece）对印度语言（ILs）机器翻译的影响，发现SentencePiece在统计和神经机器翻译模型中表现最佳，而BPE在多语言神经机器翻译模型中更优。

Details

Motivation: 印度语言具有复杂的形态结构，需要合适的子词分词技术来捕捉其前缀、后缀等形态变化，从而提高机器翻译的质量。 Method: 研究比较了SentencePiece、BPE和WordPiece等子词分词技术在统计、神经和多语言神经机器翻译模型中的表现，使用BLEU、TER等标准评估指标。 Result: SentencePiece在统计和神经机器翻译模型中表现最佳，而BPE在多语言神经机器翻译模型中更优。ILs到英语的翻译质量优于英语到ILs。 Conclusion: 选择合适的子词分词技术对提升印度语言的机器翻译质量至关重要，SentencePiece和BPE在不同场景下各有优势。 Abstract: Tokenization is the act of breaking down text into smaller parts, or tokens, that are easier for machines to process. This is a key phase in machine translation (MT) models. Subword tokenization enhances this process by breaking down words into smaller subword units, which is especially beneficial in languages with complicated morphology or a vast vocabulary. It is useful in capturing the intricate structure of words in Indian languages (ILs), such as prefixes, suffixes, and other morphological variations. These languages frequently use agglutinative structures, in which words are formed by the combination of multiple morphemes such as suffixes, prefixes, and stems. As a result, a suitable tokenization strategy must be chosen to address these scenarios. This paper examines how different subword tokenization techniques, such as SentencePiece, Byte Pair Encoding (BPE), and WordPiece Tokenization, affect ILs. The effectiveness of these subword tokenization techniques is investigated in statistical, neural, and multilingual neural machine translation models. All models are examined using standard evaluation metrics, such as the Bilingual Evaluation Understudy (BLEU) score, TER, METEOR, CHRF, RIBES, and COMET. Based on the results, it appears that for the majority of language pairs for the Statistical and Neural MT models, the SentencePiece tokenizer continuously performed better than other tokenizers in terms of BLEU score. However, BPE tokenization outperformed other tokenization techniques in the context of Multilingual Neural Machine Translation model. The results show that, despite using the same tokenizer and dataset for each model, translations from ILs to English surpassed translations from English to ILs.

[269] MPO: Multilingual Safety Alignment via Reward Gap Optimization

Weixiang Zhao,Yulin Hu,Yang Deng,Tongtong Wu,Wenxuan Zhang,Jiahe Guo,An Zhang,Yanyan Zhao,Bing Qin,Tat-Seng Chua,Ting Liu

Main category: cs.CL

TL;DR: 论文提出了一种名为MPO的新方法，用于解决多语言安全对齐问题，通过最小化主导语言与目标语言之间的奖励差距，有效提升多语言模型的安全性。

Details

Motivation: 大型语言模型在全球AI应用中的重要性日益增加，但现有的安全对齐方法（如RLHF和DPO）主要针对单语言，难以处理多语言数据中的噪声问题。 Method: 提出了Multilingual reward gaP Optimization (MPO)，利用主导语言（英语）的良好安全对齐能力，通过最小化奖励差距来改进多语言安全对齐。 Result: 在LLaMA-3.1、Gemma-2和Qwen2.5三个模型上的实验验证了MPO在多语言安全对齐中的有效性，且不影响多语言通用性能。 Conclusion: MPO是一种有效的多语言安全对齐方法，能够在不损害模型通用能力的情况下提升多语言安全性。 Abstract: Large language models (LLMs) have become increasingly central to AI applications worldwide, necessitating robust multilingual safety alignment to ensure secure deployment across diverse linguistic contexts. Existing preference learning methods for safety alignment, such as RLHF and DPO, are primarily monolingual and struggle with noisy multilingual data. To address these limitations, we introduce Multilingual reward gaP Optimization (MPO), a novel approach that leverages the well-aligned safety capabilities of the dominant language (English) to improve safety alignment across multiple languages. MPO directly minimizes the reward gap difference between the dominant language and target languages, effectively transferring safety capabilities while preserving the original strengths of the dominant language. Extensive experiments on three LLMs, LLaMA-3.1, Gemma-2 and Qwen2.5, validate MPO's efficacy in multilingual safety alignment without degrading general multilingual utility.

[270] CASTILLO: Characterizing Response Length Distributions of Large Language Models

Daniel F. Perez-Ramirez,Dejan Kostic,Magnus Boman

Main category: cs.CL

TL;DR: CASTILLO数据集分析了13种开源大语言模型在7种指令集上的响应长度分布，揭示了模型间和模型内的显著变异性，为资源调度提供了预测基础。

Details

Motivation: 由于自回归文本生成的随机性和变长性，高效管理大语言模型推理的计算资源具有挑战性。准确预测响应长度有助于资源分配，但现有方法存在偏差或忽略模型和提示的变异性。 Method: 通过固定解码参数为每个〈提示，模型〉样本对生成10个独立完成，记录响应长度，并发布统计数据（均值、标准差、百分位数）及最短和最长完成情况。 Result: 分析显示响应长度在模型间和模型内存在显著变异性，且模型有特定行为和部分文本退化现象。 Conclusion: CASTILLO为预测模型开发提供了数据支持，并系统分析了模型生成行为，推动了生成语言模型与系统研究的交叉发展。 Abstract: Efficiently managing compute resources for Large Language Model (LLM) inference remains challenging due to the inherently stochastic and variable lengths of autoregressive text generation. Accurately estimating response lengths in advance enables proactive resource allocation, yet existing approaches either bias text generation towards certain lengths or rely on assumptions that ignore model- and prompt-specific variability. We introduce CASTILLO, a dataset characterizing response length distributions across 13 widely-used open-source LLMs evaluated on seven distinct instruction-following corpora. For each $\langle$prompt, model$\rangle$ sample pair, we generate 10 independent completions using fixed decoding hyper-parameters, record the token length of each response, and publish summary statistics (mean, std-dev, percentiles), along with the shortest and longest completions, and the exact generation settings. Our analysis reveals significant inter- and intra-model variability in response lengths (even under identical generation settings), as well as model-specific behaviors and occurrences of partial text degeneration in only subsets of responses. CASTILLO enables the development of predictive models for proactive scheduling and provides a systematic framework for analyzing model-specific generation behaviors. We publicly release the dataset and code to foster research at the intersection of generative language modeling and systems.

[271] Shadows in the Attention: Contextual Perturbation and Representation Drift in the Dynamics of Hallucination in LLMs

Zeyu Wei,Shuo Wang,Xiaohui Rong,Xuemin Liu,He Li

Main category: cs.CL

TL;DR: 论文研究了大型语言模型（LLM）中幻觉现象与内部状态漂移的关系，揭示了幻觉频率与上下文注入的关联及其动态特征。

Details

Motivation: 幻觉（看似合理但错误的输出）是LLM可靠部署的主要障碍，研究旨在揭示其与内部状态漂移的关系。 Method: 使用TruthfulQA构建两种16轮“滴定”实验，追踪六种开源LLM的幻觉率和隐藏状态动态。 Result: 发现幻觉频率随上下文注入单调增长，且相关与无关上下文导致不同类型的错误。 Conclusion: 研究结果为幻觉预测和上下文感知缓解机制提供了实证基础。 Abstract: Hallucinations -- plausible yet erroneous outputs -- remain a critical barrier to reliable deployment of large language models (LLMs). We present the first systematic study linking hallucination incidence to internal-state drift induced by incremental context injection. Using TruthfulQA, we construct two 16-round "titration" tracks per question: one appends relevant but partially flawed snippets, the other injects deliberately misleading content. Across six open-source LLMs, we track overt hallucination rates with a tri-perspective detector and covert dynamics via cosine, entropy, JS and Spearman drifts of hidden states and attention maps. Results reveal (1) monotonic growth of hallucination frequency and representation drift that plateaus after 5--7 rounds; (2) relevant context drives deeper semantic assimilation, producing high-confidence "self-consistent" hallucinations, whereas irrelevant context induces topic-drift errors anchored by attention re-routing; and (3) convergence of JS-Drift ($\sim0.69$) and Spearman-Drift ($\sim0$) marks an "attention-locking" threshold beyond which hallucinations solidify and become resistant to correction. Correlation analyses expose a seesaw between assimilation capacity and attention diffusion, clarifying size-dependent error modes. These findings supply empirical foundations for intrinsic hallucination prediction and context-aware mitigation mechanisms.

[272] Power-Law Decay Loss for Large Language Model Finetuning: Focusing on Information Sparsity to Enhance Generation Quality

Jintian Shao,Hongyi Huang,Jiayi Wu,Beiwen Zhang,ZhiYu Wu,You Shan,MingKai Zheng

Main category: cs.CL

TL;DR: 论文提出了一种新的损失函数Power-Law Decay Loss (PDL)，用于优化文本生成任务的微调过程，通过重新加权高频和低频词的重要性，提升生成文本的质量和多样性。

Details

Motivation: 标准交叉熵损失在微调阶段对所有词元平等对待，导致模型过度关注高频低信息词元，而忽视低频高信息词元。PDL的动机源于信息论和语言学观察：词元的信息量通常与其出现频率成反比。 Method: PDL基于词元在训练语料中的频率，按幂律衰减重新加权标准交叉熵损失中每个词元的贡献，减少高频词权重，增加低频词权重。 Result: PDL引导模型在微调过程中更关注学习生成具有特定和独特信息的词元，从而提升生成文本的质量、多样性和信息量。 Conclusion: PDL在抽象摘要、对话系统和风格转换等多种文本生成微调任务中具有潜在应用和优势。 Abstract: During the finetuning stage of text generation tasks, standard cross-entropy loss treats all tokens equally. This can lead models to overemphasize high-frequency, low-information tokens, neglecting lower-frequency tokens crucial for specificity and informativeness in generated content. This paper introduces a novel loss function, Power-Law Decay Loss (PDL), specifically designed to optimize the finetuning process for text generation. The core motivation for PDL stems from observations in information theory and linguistics: the informativeness of a token is often inversely proportional to its frequency of occurrence. PDL re-weights the contribution of each token in the standard cross-entropy loss based on its frequency in the training corpus, following a power-law decay. Specifically, the weights for high-frequency tokens are reduced, while low-frequency, information-dense tokens are assigned higher weights. This mechanism guides the model during finetuning to focus more on learning and generating tokens that convey specific and unique information, thereby enhancing the quality, diversity, and informativeness of the generated text. We theoretically elaborate on the motivation and construction of PDL and discuss its potential applications and advantages across various text generation finetuning tasks, such as abstractive summarization, dialogue systems, and style transfer.

[273] UNCLE: Uncertainty Expressions in Long-Form Generation

Ruihan Yang,Caiqi Zhang,Zhisong Zhang,Xinting Huang,Dong Yu,Nigel Collier,Deqing Yang

Main category: cs.CL

TL;DR: 论文介绍了UNCLE基准，用于评估大语言模型在长短问答中表达不确定性的能力，发现当前模型在长问答中表现不佳，并探索了改进方法。

Details

Motivation: 大语言模型在长文本生成中容易产生幻觉，现有工作缺乏对其不确定性表达能力的直接评估。 Method: 提出UNCLE基准，包含长短问答数据集和新评估指标，并尝试提示和训练方法改进模型。 Result: 当前模型在长问答中未能有效表达不确定性，训练方法改进效果更显著。 Conclusion: UNCLE为未来研究提供了方向，揭示了长短问答中不确定性表达的差异。 Abstract: Large Language Models (LLMs) are prone to hallucination, particularly in long-form generations. A promising direction to mitigate hallucination is to teach LLMs to express uncertainty explicitly when they lack sufficient knowledge. However, existing work lacks direct and fair evaluation of LLMs' ability to express uncertainty effectively in long-form generation. To address this gap, we first introduce UNCLE, a benchmark designed to evaluate uncertainty expression in both long- and short-form question answering (QA). UNCLE spans five domains and comprises 4k long-form QA instances and over 20k short-form QA pairs. Our dataset is the first to directly bridge short- and long-form QA with paired questions and gold-standard answers. Along with the benchmark, we propose a suite of new metrics to assess the models' capabilities to selectively express uncertainty. Using UNCLE, we then demonstrate that current models fail to convey uncertainty appropriately in long-form generation. We further explore both prompt-based and training-based methods to improve models' performance, with the training-based methods yielding greater gains. Further analysis of alignment gaps between short- and long-form uncertainty expression highlights promising directions for future research using UNCLE.

[274] Latent Principle Discovery for Language Model Self-Improvement

Keshav Ramji,Tahira Naseem,Ramón Fernandez Astudillo

Main category: cs.CL

TL;DR: 论文提出了一种自动化方法，通过自校正设置从语言模型中提取潜在行为属性，并通过聚类压缩为可解释的集合，显著提升了模型性能。

Details

Motivation: 改进语言模型生成质量需要大量人工标注行为属性，但这一过程耗时费力。研究旨在自动化这一过程，减少人工干预。 Method: 采用后验正则化的蒙特卡洛期望最大化方法，从模型中挖掘潜在行为属性，并通过聚类压缩为可解释集合，指导模型自校正。 Result: 实验显示，该方法使较小模型（7-8B参数）性能显著提升，AlpacaEval胜率提高8-10%，MT-Bench平均提升0.3，IFEval原则遵循胜率提高19-23%。 Conclusion: 自动化提取和压缩行为属性的方法有效，为语言模型的持续自我改进提供了潜力。 Abstract: When language model (LM) users aim to improve the quality of its generations, it is crucial to specify concrete behavioral attributes that the model should strive to reflect. However, curating such principles across many domains, even non-exhaustively, requires a labor-intensive annotation process. To automate this process, we propose eliciting these latent attributes guiding model reasoning towards human-preferred responses by explicitly modeling them in a self-correction setting. Our approach mines new principles from the LM itself and compresses the discovered elements to an interpretable set via clustering. Specifically, we employ an approximation of posterior-regularized Monte Carlo Expectation-Maximization to both identify a condensed set of the most effective latent principles and teach the LM to strategically invoke them in order to intrinsically refine its responses. We demonstrate that bootstrapping our algorithm over multiple iterations enables smaller language models (7-8B parameters) to self-improve, achieving +8-10% in AlpacaEval win-rate, an average of +0.3 on MT-Bench, and +19-23% in principle-following win-rate on IFEval. We also show that clustering the principles yields interpretable and diverse model-generated constitutions while retaining model performance. The gains our method achieves highlight the potential of automated, principle-driven post-training recipes toward continual self-improvement.

[275] PIIvot: A Lightweight NLP Anonymization Framework for Question-Anchored Tutoring Dialogues

Matthew Zent,Digory Smith,Simon Woodhead

Main category: cs.CL

TL;DR: PIIvot是一个轻量级框架，通过利用数据上下文知识简化PII匿名化问题，并发布了QATD-2k数据集以支持教育对话数据需求。

Details

Motivation: PII匿名化是开放科学数据共享的重要障碍，现有方法在误差阈值和召回/精确度权衡上仍有局限。 Method: 提出PIIvot框架，利用数据上下文知识简化PII检测问题。 Result: 开发了PIIvot框架，并贡献了QATD-2k数据集，这是同类中最大的开源教育对话数据集。 Conclusion: PIIvot为PII匿名化提供了更高效的解决方案，同时支持教育数据共享需求。 Abstract: Personally identifiable information (PII) anonymization is a high-stakes task that poses a barrier to many open-science data sharing initiatives. While PII identification has made large strides in recent years, in practice, error thresholds and the recall/precision trade-off still limit the uptake of these anonymization pipelines. We present PIIvot, a lighter-weight framework for PII anonymization that leverages knowledge of the data context to simplify the PII detection problem. To demonstrate its effectiveness, we also contribute QATD-2k, the largest open-source real-world tutoring dataset of its kind, to support the demand for quality educational dialogue data.

[276] In-Context Watermarks for Large Language Models

Yepeng Liu,Xuandong Zhao,Christopher Kruegel,Dawn Song,Yuheng Bu

Main category: cs.CL

TL;DR: 论文提出了一种名为In-Context Watermarking (ICW)的水印技术，通过提示工程在生成文本中嵌入水印，解决了现有方法需要访问解码过程的限制。

Details

Motivation: 由于大型语言模型（LLMs）在敏感应用中的广泛使用，需要有效的水印技术来确保AI生成文本的来源和可追溯性。现有方法大多需要访问解码过程，限制了实际应用。 Method: 提出了ICW技术，通过提示工程利用LLMs的上下文学习和指令遵循能力嵌入水印，并研究了四种不同粒度的ICW策略及其检测方法。 Result: 实验验证了ICW作为一种模型无关的实用水印方法的可行性，并表明随着LLMs能力的提升，ICW有望成为可扩展且易于实现的内容溯源方案。 Conclusion: ICW为无需访问模型内部的水印技术提供了新的方向，特别适用于学术评审等实际场景。 Abstract: The growing use of large language models (LLMs) for sensitive applications has highlighted the need for effective watermarking techniques to ensure the provenance and accountability of AI-generated text. However, most existing watermarking methods require access to the decoding process, limiting their applicability in real-world settings. One illustrative example is the use of LLMs by dishonest reviewers in the context of academic peer review, where conference organizers have no access to the model used but still need to detect AI-generated reviews. Motivated by this gap, we introduce In-Context Watermarking (ICW), which embeds watermarks into generated text solely through prompt engineering, leveraging LLMs' in-context learning and instruction-following abilities. We investigate four ICW strategies at different levels of granularity, each paired with a tailored detection method. We further examine the Indirect Prompt Injection (IPI) setting as a specific case study, in which watermarking is covertly triggered by modifying input documents such as academic manuscripts. Our experiments validate the feasibility of ICW as a model-agnostic, practical watermarking approach. Moreover, our findings suggest that as LLMs become more capable, ICW offers a promising direction for scalable and accessible content attribution.

[277] On Multilingual Encoder Language Model Compression for Low-Resource Languages

Daniil Gurgurov,Michal Gregor,Josef van Genabith,Simon Ostermann

Main category: cs.CL

TL;DR: 本文提出了一种结合两步知识蒸馏、结构化剪枝、截断和词汇修剪的方法，用于高效压缩多语言编码器模型，适用于低资源语言。

Details

Motivation: 针对低资源语言的多语言模型压缩需求，通过系统结合现有技术，显著减小模型体积，同时保留关键语言知识。 Method: 结合两步知识蒸馏、结构化剪枝、截断和词汇修剪，减少层深度、前馈隐藏层大小和中间层嵌入大小。 Result: 压缩率高达92%，下游任务性能下降仅为2-10%，且性能损失与教师模型中语言特定数据量相关。 Conclusion: 该方法为多语言模型压缩提供了有效实践，适用于低资源语言场景。 Abstract: In this paper, we combine two-step knowledge distillation, structured pruning, truncation, and vocabulary trimming for extremely compressing multilingual encoder-only language models for low-resource languages. Our novel approach systematically combines existing techniques and takes them to the extreme, reducing layer depth, feed-forward hidden size, and intermediate layer embedding size to create significantly smaller monolingual models while retaining essential language-specific knowledge. We achieve compression rates of up to 92% with only a marginal performance drop of 2-10% in four downstream tasks, including sentiment analysis, topic classification, named entity recognition, and part-of-speech tagging, across three low-resource languages. Notably, the performance degradation correlates with the amount of language-specific data in the teacher model, with larger datasets resulting in smaller performance losses. Additionally, we conduct extensive ablation studies to identify best practices for multilingual model compression using these techniques.

[278] BP-Seg: A graphical model approach to unsupervised and non-contiguous text segmentation using belief propagation

Fengyi Li,Kayhan Behdin,Natesh Pillai,Xiaofeng Wang,Zhipeng Wang,Ercan Yildiz

Main category: cs.CL

TL;DR: 提出了一种基于图模型的非监督学习方法BP-Seg，用于高效文本分割，同时考虑局部连贯性和远距离语义相似性。

Details

Motivation: 基于句子语义的文本分割是许多下游应用的基础任务，需要一种高效的方法。 Method: 通过在图模型上使用信念传播，同时捕捉相邻句子的局部连贯性和远距离句子的语义相似性。 Result: 在示例和长文档数据集上的实验表明，该方法优于竞争方法。 Conclusion: BP-Seg是一种有效的文本分割方法，适用于多种应用场景。 Abstract: Text segmentation based on the semantic meaning of sentences is a fundamental task with broad utility in many downstream applications. In this paper, we propose a graphical model-based unsupervised learning approach, named BP-Seg for efficient text segmentation. Our method not only considers local coherence, capturing the intuition that adjacent sentences are often more related, but also effectively groups sentences that are distant in the text yet semantically similar. This is achieved through belief propagation on the carefully constructed graphical models. Experimental results on both an illustrative example and a dataset with long-form documents demonstrate that our method performs favorably compared to competing approaches.

[279] From Tens of Hours to Tens of Thousands: Scaling Back-Translation for Speech Recognition

Tianduo Wang,Lu Xu,Wei Lu,Shanbo Cheng

Main category: cs.CL

TL;DR: 本文提出了一种名为Speech Back-Translation的可扩展方法，通过将大规模文本语料库转换为合成语音来提升多语言ASR模型性能。实验证明，仅需少量真实语音即可生成大量高质量合成语音，显著降低转录错误。

Details

Motivation: 多语言ASR系统在资源有限的语言中扩展覆盖范围仍具挑战性，需要一种高效的方法来利用现有文本资源。 Method: 使用现成的TTS模型将大规模文本语料库转换为合成语音，并通过智能性评估框架验证合成语音质量。 Result: 在十种语言中生成了超过50万小时的合成语音，Whisper-large-v3的转录错误平均降低了30%以上。 Conclusion: Speech Back-Translation是一种可扩展且有效的方法，能够显著提升多语言ASR系统的性能。 Abstract: Recent advances in Automatic Speech Recognition (ASR) have been largely fueled by massive speech corpora. However, extending coverage to diverse languages with limited resources remains a formidable challenge. This paper introduces Speech Back-Translation, a scalable pipeline that improves multilingual ASR models by converting large-scale text corpora into synthetic speech via off-the-shelf text-to-speech (TTS) models. We demonstrate that just tens of hours of real transcribed speech can effectively train TTS models to generate synthetic speech at hundreds of times the original volume while maintaining high quality. To evaluate synthetic speech quality, we develop an intelligibility-based assessment framework and establish clear thresholds for when synthetic data benefits ASR training. Using Speech Back-Translation, we generate more than 500,000 hours of synthetic speech in ten languages and continue pre-training Whisper-large-v3, achieving average transcription error reductions of over 30\%. These results highlight the scalability and effectiveness of Speech Back-Translation for enhancing multilingual ASR systems.

[280] VeriFastScore: Speeding up long-form factuality evaluation

Rishanth Rajendhran,Amir Zadeh,Matthew Sarte,Chuan Li,Mohit Iyyer

Main category: cs.CL

TL;DR: VeriFastScore通过微调Llama3.1 8B模型，结合合成数据和Google搜索证据，实现了快速且高效的长文本事实性评估，速度提升6.6倍。

Details

Motivation: 现有方法（如FactScore和VeriScore）虽有效但耗时，限制了大规模评估和训练的实际应用。 Method: 利用合成数据微调Llama3.1 8B模型，同时提取和验证文本中的所有可验证声明。 Result: VeriFastScore与原方法在示例和系统级别上相关性高（r=0.80和r=0.94），速度提升6.6倍。 Conclusion: VeriFastScore为事实性研究提供了高效工具，并公开了模型和数据集。 Abstract: Metrics like FactScore and VeriScore that evaluate long-form factuality operate by decomposing an input response into atomic claims and then individually verifying each claim. While effective and interpretable, these methods incur numerous LLM calls and can take upwards of 100 seconds to evaluate a single response, limiting their practicality in large-scale evaluation and training scenarios. To address this, we propose VeriFastScore, which leverages synthetic data to fine-tune Llama3.1 8B for simultaneously extracting and verifying all verifiable claims within a given text based on evidence from Google Search. We show that this task cannot be solved via few-shot prompting with closed LLMs due to its complexity: the model receives ~4K tokens of evidence on average and needs to concurrently decompose claims, judge their verifiability, and verify them against noisy evidence. However, our fine-tuned VeriFastScore model demonstrates strong correlation with the original VeriScore pipeline at both the example level (r=0.80) and system level (r=0.94) while achieving an overall speedup of 6.6x (9.9x excluding evidence retrieval) over VeriScore. To facilitate future factuality research, we publicly release our VeriFastScore model and synthetic datasets.

[281] LLM as Effective Streaming Processor: Bridging Streaming-Batch Mismatches with Group Position Encoding

Junlong Tong,Jinlan Fu,Zixuan Lin,Yingqi Fan,Anhao Zhao,Hui Su,Xiaoyu Shen

Main category: cs.CL

TL;DR: 本文提出了一种基于批处理架构的组位置编码范式，解决了大型语言模型（LLMs）在流式处理中的三个关键不匹配问题，无需架构修改即可提升性能。

Details

Motivation: 现有方法在将LLMs适应流式处理时依赖昂贵的重新编码或专用架构，缺乏可扩展性。本文旨在解决输入-注意力、输出-注意力和位置-ID不匹配问题。 Method: 通过分析位置编码对LLMs的影响，提出组位置编码范式，保持源和目标上下文中的相对位置一致性。 Result: 实验表明，该方法在跨语言和跨模态任务中优于现有方法，且在流式和批处理模式下均表现良好。 Conclusion: 组位置编码范式有效解决了流式处理中的不匹配问题，无需修改架构，具有广泛适用性。 Abstract: Large Language Models (LLMs) are primarily designed for batch processing. Existing methods for adapting LLMs to streaming rely either on expensive re-encoding or specialized architectures with limited scalability. This work identifies three key mismatches in adapting batch-oriented LLMs to streaming: (1) input-attention, (2) output-attention, and (3) position-ID mismatches. While it is commonly assumed that the latter two mismatches require frequent re-encoding, our analysis reveals that only the input-attention mismatch significantly impacts performance, indicating re-encoding outputs is largely unnecessary. To better understand this discrepancy with the common assumption, we provide the first comprehensive analysis of the impact of position encoding on LLMs in streaming, showing that preserving relative positions within source and target contexts is more critical than maintaining absolute order. Motivated by the above analysis, we introduce a group position encoding paradigm built on batch architectures to enhance consistency between streaming and batch modes. Extensive experiments on cross-lingual and cross-modal tasks demonstrate that our method outperforms existing approaches. Our method requires no architectural modifications, exhibits strong generalization in both streaming and batch modes. The code is available at repository https://github.com/EIT-NLP/StreamingLLM.

[282] T1: A Tool-Oriented Conversational Dataset for Multi-Turn Agentic Planning

Amartya Chakraborty,Paresh Dashore,Nadia Bathaee,Anmol Jain,Anirban Das,Shi-Xiong Zhang,Sambit Sahu,Milind Naphade,Genta Indra Winata

Main category: cs.CL

TL;DR: T1是一个多领域、多轮对话数据集，旨在解决LLMs在工具调用依赖关系中的规划问题，并提供评估基准。

Details

Motivation: LLMs在多轮对话中处理工具调用依赖关系的能力不足，需要数据集支持研究和评估。 Method: 引入T1数据集，包含多领域工具依赖场景，支持缓存机制和动态重新规划。 Result: T1-Agent展示了在复杂工具依赖场景中的规划和推理能力。 Conclusion: T1为工具使用和规划研究提供了重要资源，并可作为开源模型的评估基准。 Abstract: Large Language Models (LLMs) have demonstrated impressive capabilities as intelligent agents capable of solving complex problems. However, effective planning in scenarios involving dependencies between API or tool calls-particularly in multi-turn conversations-remains a significant challenge. To address this, we introduce T1, a tool-augmented, multi-domain, multi-turn conversational dataset specifically designed to capture and manage inter-tool dependencies across diverse domains. T1 enables rigorous evaluation of agents' ability to coordinate tool use across nine distinct domains (4 single domain and 5 multi-domain) with the help of an integrated caching mechanism for both short- and long-term memory, while supporting dynamic replanning-such as deciding whether to recompute or reuse cached results. Beyond facilitating research on tool use and planning, T1 also serves as a benchmark for evaluating the performance of open-source language models. We present results powered by T1-Agent, highlighting their ability to plan and reason in complex, tool-dependent scenarios.

[283] MASLab: A Unified and Comprehensive Codebase for LLM-based Multi-Agent Systems

Rui Ye,Keduan Huang,Qimin Wu,Yuzhu Cai,Tian Jin,Xianghe Pang,Xiangrui Liu,Jiaqi Su,Chen Qian,Bohan Tang,Kaiqu Liang,Jiaao Chen,Yue Hu,Zhenfei Yin,Rongye Shi,Bo An,Yang Gao,Wenjun Wu,Lei Bai,Siheng Chen

Main category: cs.CL

TL;DR: MASLab是一个统一的代码库，整合了20多种LLM-based多智能体系统方法，提供公平比较和低门槛研究环境。

Details

Motivation: 解决现有领域缺乏统一代码库导致的冗余实现、不公平比较和高研究门槛问题。 Method: 整合并验证多种方法，提供统一环境和共享结构，支持标准化评估。 Result: 通过10+基准和8模型实验，全面展示了当前MAS方法的发展现状。 Conclusion: MASLab将持续更新，并欢迎开源社区贡献，推动领域发展。 Abstract: LLM-based multi-agent systems (MAS) have demonstrated significant potential in enhancing single LLMs to address complex and diverse tasks in practical applications. Despite considerable advancements, the field lacks a unified codebase that consolidates existing methods, resulting in redundant re-implementation efforts, unfair comparisons, and high entry barriers for researchers. To address these challenges, we introduce MASLab, a unified, comprehensive, and research-friendly codebase for LLM-based MAS. (1) MASLab integrates over 20 established methods across multiple domains, each rigorously validated by comparing step-by-step outputs with its official implementation. (2) MASLab provides a unified environment with various benchmarks for fair comparisons among methods, ensuring consistent inputs and standardized evaluation protocols. (3) MASLab implements methods within a shared streamlined structure, lowering the barriers for understanding and extension. Building on MASLab, we conduct extensive experiments covering 10+ benchmarks and 8 models, offering researchers a clear and comprehensive view of the current landscape of MAS methods. MASLab will continue to evolve, tracking the latest developments in the field, and invite contributions from the broader open-source community.

[284] DecoupledESC: Enhancing Emotional Support Generation via Strategy-Response Decoupled Preference Optimization

Chao Zhang,Xin Shi,Xueqiao Zhang,Yifan Zhu,Yi Yang,Yawei Luo

Main category: cs.CL

TL;DR: 论文提出了一种解耦的情感支持对话（ESC）框架，通过Inferential Preference Mining（IPM）构建高质量偏好数据，并利用Decoupled ESC框架分步优化策略规划和共情回复生成，显著提升了情感支持对话的质量。

Details

Motivation: 现有情感支持对话模型通过监督微调（SFT）优化，但仍存在心理错误；直接偏好优化（DPO）因数据结构和优化模糊性受限，需改进。 Method: 提出IPM构建高质量偏好数据（IPM-PrefDial），并基于Gross的情绪调节扩展过程模型，将ESC任务解耦为策略规划和共情回复生成两个子任务，分别通过SFT和DPO优化。 Result: 实验表明，解耦框架优于联合优化基线，减少了偏好偏差并提升了回复质量。 Conclusion: 解耦框架和IPM方法有效解决了ESC任务中的偏好数据质量和优化模糊性问题，显著提升了情感支持对话的生成效果。 Abstract: Recent advances in Emotional Support Conversation (ESC) have improved emotional support generation by fine-tuning Large Language Models (LLMs) via Supervised Fine-Tuning (SFT). However, common psychological errors still persist. While Direct Preference Optimization (DPO) shows promise in reducing such errors through pairwise preference learning, its effectiveness in ESC tasks is limited by two key challenges: (1) Entangled data structure: Existing ESC data inherently entangles psychological strategies and response content, making it difficult to construct high-quality preference pairs; and (2) Optimization ambiguity: Applying vanilla DPO to such entangled pairwise data leads to ambiguous training objectives. To address these issues, we introduce Inferential Preference Mining (IPM) to construct high-quality preference data, forming the IPM-PrefDial dataset. Building upon this data, we propose a Decoupled ESC framework inspired by Gross's Extended Process Model of Emotion Regulation, which decomposes the ESC task into two sequential subtasks: strategy planning and empathic response generation. Each was trained via SFT and subsequently enhanced by DPO to align with the psychological preference. Extensive experiments demonstrate that our Decoupled ESC framework outperforms joint optimization baselines, reducing preference bias and improving response quality.

[285] Do Large Language Models Excel in Complex Logical Reasoning with Formal Language?

Jin Jiang,Jianing Wang,Yuchen Yan,Yang Liu,Jianhua Zhu,Mengdi Zhang,Xunliang Cai,Liangcai Gao

Main category: cs.CL

TL;DR: 本文对大型语言模型（LLMs）在逻辑推理任务中的表现进行了全面评估，重点关注形式语言的使用，发现思维模型优于指令模型，但所有模型在归纳推理上均有局限。

Details

Motivation: 现有研究多关注用形式语言引导LLMs推理，但对其能力的系统性评估不足，本文旨在填补这一空白。 Method: 从LLMs谱系、任务分类和轨迹格式三个维度评估，并利用形式相关数据增强小模型。 Result: 思维模型表现更优，但所有模型在归纳推理上表现不佳；PoT格式数据泛化能力最佳。 Conclusion: 简单的拒绝微调方法能提升LLMs在形式语言上的泛化能力，实验证明其效果最佳。 Abstract: Large Language Models (LLMs) have been shown to achieve breakthrough performance on complex logical reasoning tasks. Nevertheless, most existing research focuses on employing formal language to guide LLMs to derive reliable reasoning paths, while systematic evaluations of these capabilities are still limited. In this paper, we aim to conduct a comprehensive evaluation of LLMs across various logical reasoning problems utilizing formal languages. From the perspective of three dimensions, i.e., spectrum of LLMs, taxonomy of tasks, and format of trajectories, our key findings are: 1) Thinking models significantly outperform Instruct models, especially when formal language is employed; 2) All LLMs exhibit limitations in inductive reasoning capability, irrespective of whether they use a formal language; 3) Data with PoT format achieves the best generalization performance across other languages. Additionally, we also curate the formal-relative training data to further enhance the small language models, and the experimental results indicate that a simple rejected fine-tuning method can better enable LLMs to generalize across formal languages and achieve the best overall performance. Our codes and reports are available at https://github.com/jiangjin1999/FormalEval.

[286] R1-Searcher++: Incentivizing the Dynamic Knowledge Acquisition of LLMs via Reinforcement Learning

Huatong Song,Jinhao Jiang,Wenqing Tian,Zhipeng Chen,Yuhuan Wu,Jiahao Zhao,Yingqian Min,Wayne Xin Zhao,Lei Fang,Ji-Rong Wen

Main category: cs.CL

TL;DR: R1-Searcher++是一个新颖的框架，通过两阶段训练策略（SFT冷启动和RL动态知识获取）使LLM自适应利用内外知识源，提升检索增强推理能力。

Details

Motivation: 解决LLM因静态知识易产生幻觉，以及现有RAG方法成本高、泛化差或忽略模型内部知识的问题。 Method: 采用两阶段训练：SFT冷启动学习初步格式，RL动态知识获取阶段通过结果监督、奖励机制和记忆机制优化内外知识利用。 Result: 实验表明R1-Searcher++优于现有RAG和推理方法，实现高效检索。 Conclusion: R1-Searcher++通过结合内外知识源，显著提升了LLM的检索增强推理能力。 Abstract: Large Language Models (LLMs) are powerful but prone to hallucinations due to static knowledge. Retrieval-Augmented Generation (RAG) helps by injecting external information, but current methods often are costly, generalize poorly, or ignore the internal knowledge of the model. In this paper, we introduce R1-Searcher++, a novel framework designed to train LLMs to adaptively leverage both internal and external knowledge sources. R1-Searcher++ employs a two-stage training strategy: an initial SFT Cold-start phase for preliminary format learning, followed by RL for Dynamic Knowledge Acquisition. The RL stage uses outcome-supervision to encourage exploration, incorporates a reward mechanism for internal knowledge utilization, and integrates a memorization mechanism to continuously assimilate retrieved information, thereby enriching the model's internal knowledge. By leveraging internal knowledge and external search engine, the model continuously improves its capabilities, enabling efficient retrieval-augmented reasoning. Our experiments demonstrate that R1-Searcher++ outperforms previous RAG and reasoning methods and achieves efficient retrieval. The code is available at https://github.com/RUCAIBox/R1-Searcher-plus.

eess.AS [Back]

[287] Towards Holistic Evaluation of Large Audio-Language Models: A Comprehensive Survey

Chih-Kai Yang,Neo S. Ho,Hung-yi Lee

Main category: eess.AS

TL;DR: 本文提出了一种系统化的分类法，用于评估大型音频语言模型（LALMs）的性能，填补了现有评估标准碎片化的空白。

Details

Motivation: 现有LALMs的评估标准缺乏统一分类，阻碍了模型的全面评估和进一步发展。 Method: 通过全面调研，将LALMs评估分为四个维度：通用听觉感知与处理、知识与推理、对话导向能力、公平性与安全性。 Result: 提出了首个针对LALMs评估的系统分类法，并总结了当前挑战和未来方向。 Conclusion: 该分类法为LALMs评估提供了清晰指南，并计划持续更新调研成果以支持领域发展。 Abstract: With advancements in large audio-language models (LALMs), which enhance large language models (LLMs) with auditory capabilities, these models are expected to demonstrate universal proficiency across various auditory tasks. While numerous benchmarks have emerged to assess LALMs' performance, they remain fragmented and lack a structured taxonomy. To bridge this gap, we conduct a comprehensive survey and propose a systematic taxonomy for LALM evaluations, categorizing them into four dimensions based on their objectives: (1) General Auditory Awareness and Processing, (2) Knowledge and Reasoning, (3) Dialogue-oriented Ability, and (4) Fairness, Safety, and Trustworthiness. We provide detailed overviews within each category and highlight challenges in this field, offering insights into promising future directions. To the best of our knowledge, this is the first survey specifically focused on the evaluations of LALMs, providing clear guidelines for the community. We will release the collection of the surveyed papers and actively maintain it to support ongoing advancements in the field.

[288] Meta-PerSER: Few-Shot Listener Personalized Speech Emotion Recognition via Meta-learning

Liang-Yeh Shen,Shi-Xin Fang,Yi-Cheng Lin,Huang-Cheng Chou,Hung-yi Lee

Main category: eess.AS

TL;DR: Meta-PerSER是一种基于元学习的个性化语音情感识别框架，通过快速适应个人标注风格，显著提升情感识别性能。

Details

Motivation: 传统语音情感识别系统依赖聚合标注，忽略个体差异，导致预测不一致。Meta-PerSER旨在解决这一问题，实现个性化情感识别。 Method: 采用MAML元学习方法，结合Combined-Set Meta-Training、Derivative Annealing和分层分步学习率，利用预训练自监督模型提取通用情感特征，再微调适应个人标注风格。 Result: 在IEMOCAP数据集上，Meta-PerSER在已知和未知数据场景中均显著优于基线方法。 Conclusion: Meta-PerSER展示了在个性化情感识别中的潜力，为未来研究提供了新方向。 Abstract: This paper introduces Meta-PerSER, a novel meta-learning framework that personalizes Speech Emotion Recognition (SER) by adapting to each listener's unique way of interpreting emotion. Conventional SER systems rely on aggregated annotations, which often overlook individual subtleties and lead to inconsistent predictions. In contrast, Meta-PerSER leverages a Model-Agnostic Meta-Learning (MAML) approach enhanced with Combined-Set Meta-Training, Derivative Annealing, and per-layer per-step learning rates, enabling rapid adaptation with only a few labeled examples. By integrating robust representations from pre-trained self-supervised models, our framework first captures general emotional cues and then fine-tunes itself to personal annotation styles. Experiments on the IEMOCAP corpus demonstrate that Meta-PerSER significantly outperforms baseline methods in both seen and unseen data scenarios, highlighting its promise for personalized emotion recognition.

cs.DB [Back]

[289] MAPS: A Multilingual Benchmark for Global Agent Performance and Security

Omer Hofman,Oren Rachmil,Shamik Bose,Vikas Pahuja,Jonathan Brokman,Toshiya Shimizu,Trisha Starostina,Kelly Marchisio,Seraphina Goldfarb-Tarrant,Roman Vainshtein

Main category: cs.DB

TL;DR: MAPS是一个多语言基准测试套件，用于评估基于LLM的代理AI系统在多种语言和任务中的表现，填补了现有基准仅关注英语的空白。

Details

Motivation: LLM在非英语环境中表现不佳，可能导致代理AI系统的全球可访问性问题，但目前缺乏多语言评估工具。 Method: 基于四个广泛使用的代理基准（GAIA、SWE-bench、MATH和Agent Security Benchmark），将其翻译为十种语言，构建包含805个任务的MAPS套件。 Result: 实验表明，从英语切换到其他语言时，代理的性能和安全性普遍下降，且与翻译输入量相关。 Conclusion: MAPS为多语言代理AI系统的开发和评估提供了标准化框架，并提出了改进建议，以推动全球可访问性。 Abstract: Agentic AI systems, which build on Large Language Models (LLMs) and interact with tools and memory, have rapidly advanced in capability and scope. Yet, since LLMs have been shown to struggle in multilingual settings, typically resulting in lower performance and reduced safety, agentic systems risk inheriting these limitations. This raises concerns about the global accessibility of such systems, as users interacting in languages other than English may encounter unreliable or security-critical agent behavior. Despite growing interest in evaluating agentic AI, existing benchmarks focus exclusively on English, leaving multilingual settings unexplored. To address this gap, we propose MAPS, a multilingual benchmark suite designed to evaluate agentic AI systems across diverse languages and tasks. MAPS builds on four widely used agentic benchmarks - GAIA (real-world tasks), SWE-bench (code generation), MATH (mathematical reasoning), and the Agent Security Benchmark (security). We translate each dataset into ten diverse languages, resulting in 805 unique tasks and 8,855 total language-specific instances. Our benchmark suite enables a systematic analysis of how multilingual contexts affect agent performance and robustness. Empirically, we observe consistent degradation in both performance and security when transitioning from English to other languages, with severity varying by task and correlating with the amount of translated input. Building on these findings, we provide actionable recommendations to guide agentic AI systems development and assessment under multilingual settings. This work establishes a standardized evaluation framework, encouraging future research towards equitable, reliable, and globally accessible agentic AI. MAPS benchmark suite is publicly available at https://huggingface.co/datasets/Fujitsu-FRE/MAPS

math.NA [Back]

[290] Implicit Neural Shape Optimization for 3D High-Contrast Electrical Impedance Tomography

Junqing Chen,Haibo Liu

Main category: math.NA

TL;DR: 提出了一种基于隐式神经形状优化的3D高对比度电阻抗断层扫描（EIT）框架，用于解决材料界面处电导率急剧变化的问题。

Details

Motivation: 高对比度场景（如金属植入物监测和工业缺陷检测）对传统重建方法提出了挑战，因其严重的病态性。 Method: 结合形状优化与隐式神经表示，提出形状导数优化方案和高效潜在空间表示。 Result: 通过理论分析和数值实验，证明了性能的显著提升。 Conclusion: 该框架在医疗成像和工业无损检测中具有实际应用潜力。 Abstract: We present a novel implicit neural shape optimization framework for 3D high-contrast Electrical Impedance Tomography (EIT), addressing scenarios where conductivity exhibits sharp discontinuities across material interfaces. These high-contrast cases, prevalent in metallic implant monitoring and industrial defect detection, challenge traditional reconstruction methods due to severe ill-posedness. Our approach synergizes shape optimization with implicit neural representations, introducing key innovations including a shape derivative-based optimization scheme that explicitly incorporates high-contrast interface conditions and an efficient latent space representation that reduces variable dimensionality. Through rigorous theoretical analysis of algorithm convergence and extensive numerical experiments, we demonstrate substantial performance improvements, establishing our framework as promising for practical applications in medical imaging with metallic implants and industrial non-destructive testing.

cs.MM [Back]

Junjie Zheng,Zihao Chen,Chaofan Ding,Yunming Liang,Yihan Fan,Huan Yang,Lei Xie,Xinhan Di

Main category: cs.MM

TL;DR: 提出了一种多模态生成框架，用于解决电影配音中的风格适应、对话处理和细节考虑问题，通过视觉语言模型和语音生成模型实现高质量配音，实验结果显示性能优于现有方法。

Details

Motivation: 当前电影配音技术在风格适应、对话处理及细节（如说话者年龄和性别）方面仍有不足，需要更全面的解决方案。 Method: 采用多模态视觉语言模型分析视觉输入，结合大型语音生成模型生成高质量配音，并构建带标注的电影配音数据集。 Result: 在多个基准数据集上表现优于现有方法，LSE-D、SPK-SIM、EMO-SIM和MCD指标分别提升1.09%、8.80%、19.08%和18.74%。 Conclusion: 多模态框架有效提升了电影配音的质量和适应性，为未来研究提供了新方向。 Abstract: Current movie dubbing technology can produce the desired speech using a reference voice and input video, maintaining perfect synchronization with the visuals while effectively conveying the intended emotions. However, crucial aspects of movie dubbing, including adaptation to various dubbing styles, effective handling of dialogue, narration, and monologues, as well as consideration of subtle details such as speaker age and gender, remain insufficiently explored. To tackle these challenges, we introduce a multi-modal generative framework. First, it utilizes a multi-modal large vision-language model (VLM) to analyze visual inputs, enabling the recognition of dubbing types and fine-grained attributes. Second, it produces high-quality dubbing using large speech generation models, guided by multi-modal inputs. Additionally, a movie dubbing dataset with annotations for dubbing types and subtle details is constructed to enhance movie understanding and improve dubbing quality for the proposed multi-modal framework. Experimental results across multiple benchmark datasets show superior performance compared to state-of-the-art (SOTA) methods. In details, the LSE-D, SPK-SIM, EMO-SIM, and MCD exhibit improvements of up to 1.09%, 8.80%, 19.08%, and 18.74%, respectively.

cs.SE [Back]

[292] SWE-Dev: Evaluating and Training Autonomous Feature-Driven Software Development

Yaxin Du,Yuzhu Cai,Yifan Zhou,Cheng Wang,Yu Qian,Xianghe Pang,Qian Liu,Yue Hu,Siheng Chen

Main category: cs.SE

TL;DR: 论文介绍了SWE-Dev数据集，用于评估和训练AI在现实世界功能开发任务中的表现，展示了当前AI在此任务上的挑战性，并通过微调证明了数据集的实用性。

Details

Motivation: 大型语言模型（LLMs）在软件工程任务中表现出色，但功能驱动开发（FDD）任务尚未充分探索。 Method: 提出SWE-Dev数据集，包含14,000个训练样本和500个测试样本，提供可运行环境和单元测试，支持监督微调和强化学习。 Result: 评估显示FDD对当前AI极具挑战性（如Claude-3.7-Sonnet仅22.45% Pass@3），但微调后7B模型性能接近GPT-4o。 Conclusion: SWE-Dev是提升模型性能的有效平台，其高质量数据对AI在FDD任务中的进步至关重要。 Abstract: Large Language Models (LLMs) have shown strong capability in diverse software engineering tasks, e.g. code completion, bug fixing, and document generation. However, feature-driven development (FDD), a highly prevalent real-world task that involves developing new functionalities for large, existing codebases, remains underexplored. We therefore introduce SWE-Dev, the first large-scale dataset (with 14,000 training and 500 test samples) designed to evaluate and train autonomous coding systems on real-world feature development tasks. To ensure verifiable and diverse training, SWE-Dev uniquely provides all instances with a runnable environment and its developer-authored executable unit tests. This collection not only provides high-quality data for Supervised Fine-Tuning (SFT), but also enables Reinforcement Learning (RL) by delivering accurate reward signals from executable unit tests. Our extensive evaluations on SWE-Dev, covering 17 chatbot LLMs, 10 reasoning models, and 10 Multi-Agent Systems (MAS), reveal that FDD is a profoundly challenging frontier for current AI (e.g., Claude-3.7-Sonnet achieves only 22.45\% Pass@3 on the hard test split). Crucially, we demonstrate that SWE-Dev serves as an effective platform for model improvement: fine-tuning on training set enabled a 7B model comparable to GPT-4o on \textit{hard} split, underscoring the value of its high-quality training data. Code is available here \href{https://github.com/justLittleWhite/SWE-Dev}{https://github.com/justLittleWhite/SWE-Dev}.

cs.IR [Back]

[293] InfoDeepSeek: Benchmarking Agentic Information Seeking for Retrieval-Augmented Generation

Yunjia Xi,Jianghao Lin,Menghui Zhu,Yongzhao Xiao,Zhuoying Ou,Jiaqi Liu,Tong Wan,Bo Chen,Weiwen Liu,Yasheng Wang,Ruiming Tang,Weinan Zhang,Yong Yu

Main category: cs.IR

TL;DR: InfoDeepSeek是一个新基准，用于评估动态网络环境中自主LLM代理的信息检索能力，填补了现有基准的不足。

Details

Motivation: 现有基准无法评估动态网络环境中的自主LLM代理行为，因其局限于静态检索环境和简单查询。 Method: 提出系统性方法构建具有确定性、难度和多样性的查询，并开发首个动态代理信息检索评估框架。 Result: 通过广泛实验，InfoDeepSeek揭示了代理行为的细微差异，并为未来研究提供了实用见解。 Conclusion: InfoDeepSeek为动态网络环境中的代理信息检索提供了有效评估工具，推动了相关研究的发展。 Abstract: Retrieval-Augmented Generation (RAG) enhances large language models (LLMs) by grounding responses with retrieved information. As an emerging paradigm, Agentic RAG further enhances this process by introducing autonomous LLM agents into the information seeking process. However, existing benchmarks fall short in evaluating such systems, as they are confined to a static retrieval environment with a fixed, limited corpus} and simple queries that fail to elicit agentic behavior. Moreover, their evaluation protocols assess information seeking effectiveness by pre-defined gold sets of documents, making them unsuitable for the open-ended and dynamic nature of real-world web environments. To bridge this gap, we present InfoDeepSeek, a new benchmark with challenging questions designed for assessing agentic information seeking in real-world, dynamic web environments. We propose a systematic methodology for constructing challenging queries satisfying the criteria of determinacy, difficulty, and diversity. Based on this, we develop the first evaluation framework tailored to dynamic agentic information seeking, including fine-grained metrics about the accuracy, utility, and compactness of information seeking outcomes. Through extensive experiments across LLMs, search engines, and question types, InfoDeepSeek reveals nuanced agent behaviors and offers actionable insights for future research.

[294] Aug2Search: Enhancing Facebook Marketplace Search with LLM-Generated Synthetic Data Augmentation

Ruijie Xi,He Ba,Hao Yuan,Rishu Agrawal,Arul Prakash

Main category: cs.IR

TL;DR: Aug2Search框架利用生成式AI生成高质量合成数据，提升嵌入检索模型性能，实验显示合成数据训练效果优于原始数据。

Details

Motivation: 解决平台搜索日志数据多样性和细节不足的问题，以优化查询与产品的相关性。 Method: 利用LLMs生成合成数据（查询、增强产品列表、从增强列表生成查询），并在不同数据集上训练EBR模型。 Result: 合成数据显著提升模型性能（ROC_AUC提高4%），且纯合成数据训练效果更优。 Conclusion: 生成式AI生成的合成数据能有效增强EBR模型，尤其在数据多样性不足的场景中表现突出。 Abstract: Embedding-Based Retrieval (EBR) is an important technique in modern search engines, enabling semantic match between search queries and relevant results. However, search logging data on platforms like Facebook Marketplace lacks the diversity and details needed for effective EBR model training, limiting the models' ability to capture nuanced search patterns. To address this challenge, we propose Aug2Search, an EBR-based framework leveraging synthetic data generated by Generative AI (GenAI) models, in a multimodal and multitask approach to optimize query-product relevance. This paper investigates the capabilities of GenAI, particularly Large Language Models (LLMs), in generating high-quality synthetic data, and analyzing its impact on enhancing EBR models. We conducted experiments using eight Llama models and 100 million data points from Facebook Marketplace logs. Our synthetic data generation follows three strategies: (1) generate queries, (2) enhance product listings, and (3) generate queries from enhanced listings. We train EBR models on three different datasets: sampled engagement data or original data ((e.g., "Click" and "Listing Interactions")), synthetic data, and a mixture of both engagement and synthetic data to assess their performance across various training sets. Our findings underscore the robustness of Llama models in producing synthetic queries and listings with high coherence, relevance, and diversity, while maintaining low levels of hallucination. Aug2Search achieves an improvement of up to 4% in ROC_AUC with 100 million synthetic data samples, demonstrating the effectiveness of our approach. Moreover, our experiments reveal that with the same volume of training data, models trained exclusively on synthetic data often outperform those trained on original data only or a mixture of original and synthetic data.

[295] Benchmarking Retrieval-Augmented Multimomal Generation for Document Question Answering

Kuicai Dong,Yujing Chang,Shijie Huang,Yasheng Wang,Ruiming Tang,Yong Liu

Main category: cs.IR

TL;DR: MMDocRAG是一个新的DocVQA基准，包含4055个专家标注的QA对，支持多页跨模态证据链，并提出了评估多模态引用选择的新指标。实验表明，专有LVMs优于开源模型，且多模态输入对性能有显著影响。

Details

Motivation: 当前DocVQA方法以文本为中心，缺乏对视觉信息的有效利用，且缺少评估多模态证据选择和整合的基准。 Method: 提出了MMDocRAG基准，包含多页跨模态证据链的QA对，并设计了新指标评估多模态引用选择。实验涵盖60个VLM/LLM模型和14个检索系统。 Result: 专有LVMs表现优于开源模型，多模态输入对性能有显著提升。微调后的LLMs在使用详细图像描述时表现更好。 Conclusion: MMDocRAG为开发更鲁棒的多模态DocVQA系统提供了严格测试基准和实用见解。 Abstract: Document Visual Question Answering (DocVQA) faces dual challenges in processing lengthy multimodal documents (text, images, tables) and performing cross-modal reasoning. Current document retrieval-augmented generation (DocRAG) methods remain limited by their text-centric approaches, frequently missing critical visual information. The field also lacks robust benchmarks for assessing multimodal evidence selection and integration. We introduce MMDocRAG, a comprehensive benchmark featuring 4,055 expert-annotated QA pairs with multi-page, cross-modal evidence chains. Our framework introduces innovative metrics for evaluating multimodal quote selection and enables answers that interleave text with relevant visual elements. Through large-scale experiments with 60 VLM/LLM models and 14 retrieval systems, we identify persistent challenges in multimodal evidence retrieval, selection, and integration.Key findings reveal advanced proprietary LVMs show superior performance than open-sourced alternatives. Also, they show moderate advantages using multimodal inputs over text-only inputs, while open-source alternatives show significant performance degradation. Notably, fine-tuned LLMs achieve substantial improvements when using detailed image descriptions. MMDocRAG establishes a rigorous testing ground and provides actionable insights for developing more robust multimodal DocVQA systems. Our benchmark and code are available at https://mmdocrag.github.io/MMDocRAG/.

[296] MiLQ: Benchmarking IR Models for Bilingual Web Search with Mixed Language Queries

Jonghwi Kim,Deokhyung Kang,Seonjeong Hwang,Yunsu Kim,Jungseul Ok,Gary Lee

Main category: cs.IR

TL;DR: MiLQ是首个公开的混合语言查询测试集，用于评估多语言信息检索模型在混合查询上的表现，结果显示模型表现中等且不一致，但代码切换训练数据可能提升模型鲁棒性。

Details

Motivation: 双语用户常用混合语言查询，但相关研究稀缺，因此需要建立基准测试集以推动研究。 Method: 引入MiLQ测试集，评估多语言IR模型在混合查询上的表现，并分析代码切换训练数据的影响。 Result: 模型在混合查询上表现中等且不一致，但代码切换数据可能提升性能；混合英语查询对双语用户搜索英语文档更有效。 Conclusion: 混合语言查询研究需更多关注，代码切换训练数据可能提升模型鲁棒性，混合英语查询是一种有效策略。 Abstract: Despite bilingual speakers frequently using mixed-language queries in web searches, Information Retrieval (IR) research on them remains scarce. To address this, we introduce MiLQ,Mixed-Language Query test set, the first public benchmark of mixed-language queries, confirmed as realistic and highly preferred. Experiments show that multilingual IR models perform moderately on MiLQ and inconsistently across native, English, and mixed-language queries, also suggesting code-switched training data's potential for robust IR models handling such queries. Meanwhile, intentional English mixing in queries proves an effective strategy for bilinguals searching English documents, which our analysis attributes to enhanced token matching compared to native queries.

[297] Don't "Overthink" Passage Reranking: Is Reasoning Truly Necessary?

Nour Jedidi,Yung-Sung Chuang,James Glass,Jimmy Lin

Main category: cs.IR

TL;DR: 研究发现，基于推理的段落重排序模型（ReasonRR）在相同训练条件下表现不如非推理的标准模型（StandardRR），且禁用推理后（ReasonRR-NoReason）效果更好。

Details

Motivation: 探索推理能力是否能提升基于大语言模型的段落重排序准确性。 Method: 比较推理模型（ReasonRR）与非推理模型（StandardRR）的表现，并分析推理过程的影响。 Result: StandardRR优于ReasonRR，且ReasonRR-NoReason表现更佳，推理过程导致模型偏向极端评分。 Conclusion: 推理过程限制了重排序模型的准确性，因其未能考虑段落的部分相关性。 Abstract: With the growing success of reasoning models across complex natural language tasks, researchers in the Information Retrieval (IR) community have begun exploring how similar reasoning capabilities can be integrated into passage rerankers built on Large Language Models (LLMs). These methods typically employ an LLM to produce an explicit, step-by-step reasoning process before arriving at a final relevance prediction. But, does reasoning actually improve reranking accuracy? In this paper, we dive deeper into this question, studying the impact of the reasoning process by comparing reasoning-based pointwise rerankers (ReasonRR) to standard, non-reasoning pointwise rerankers (StandardRR) under identical training conditions, and observe that StandardRR generally outperforms ReasonRR. Building on this observation, we then study the importance of reasoning to ReasonRR by disabling its reasoning process (ReasonRR-NoReason), and find that ReasonRR-NoReason is surprisingly more effective than ReasonRR. Examining the cause of this result, our findings reveal that reasoning-based rerankers are limited by the LLM's reasoning process, which pushes it toward polarized relevance scores and thus fails to consider the partial relevance of passages, a key factor for the accuracy of pointwise rerankers.

[298] Fixing Data That Hurts Performance: Cascading LLMs to Relabel Hard Negatives for Robust Information Retrieval

Nandan Thakur,Crystina Zhang,Xueguang Ma,Jimmy Lin

Main category: cs.IR

TL;DR: 研究发现某些数据集可能损害检索模型效果，提出一种基于LLM的简单方法重新标注假阴性样本，显著提升模型性能。

Details

Motivation: 某些数据集可能对模型效果产生负面影响，尤其是假阴性问题，需改进训练数据质量。 Method: 使用级联LLM提示识别并重新标注假阴性样本。 Result: 重新标注后，E5和Qwen2.5-7B模型在BEIR和AIR-Bench上性能提升0.7-1.8 nDCG@10。 Conclusion: 通过改进数据标注质量，可显著提升检索和重排模型性能，级联LLM方法高效且可靠。 Abstract: Training robust retrieval and reranker models typically relies on large-scale retrieval datasets; for example, the BGE collection contains 1.6 million query-passage pairs sourced from various data sources. However, we find that certain datasets can negatively impact model effectiveness -- pruning 8 out of 15 datasets from the BGE collection reduces the training set size by 2.35$\times$ and increases nDCG@10 on BEIR by 1.0 point. This motivates a deeper examination of training data quality, with a particular focus on "false negatives", where relevant passages are incorrectly labeled as irrelevant. We propose a simple, cost-effective approach using cascading LLM prompts to identify and relabel hard negatives. Experimental results show that relabeling false negatives with true positives improves both E5 (base) and Qwen2.5-7B retrieval models by 0.7-1.4 nDCG@10 on BEIR and by 1.7-1.8 nDCG@10 on zero-shot AIR-Bench evaluation. Similar gains are observed for rerankers fine-tuned on the relabeled data, such as Qwen2.5-3B on BEIR. The reliability of the cascading design is further supported by human annotation results, where we find judgment by GPT-4o shows much higher agreement with humans than GPT-4o-mini.

[299] $\text{R}^2\text{ec}$: Towards Large Recommender Models with Reasoning

Runyang You,Yongqi Li,Xinyu Lin,Xin Zhang,Wenjie Wang,Wenjie Li,Liqiang Nie

Main category: cs.IR

TL;DR: 论文提出了一种统一的大型推荐模型（\name），具备内在推理能力，通过重新设计架构和强化学习框架（RecPO）同时优化推理和推荐能力，实验验证了其有效性。

Details

Motivation: 现有研究将LLMs作为外部推理模块，导致资源成本高且联合优化不足，需要一种统一模型来解决这些问题。 Method: 提出\name模型，重新设计架构以实现推理和推荐的交错，并开发RecPO强化学习框架，通过融合奖励方案优化模型。 Result: 在三个数据集上的实验显示，\name在Hit@5和NDCG@20指标上分别相对提升了68.67%和45.21%。 Conclusion: \name通过统一架构和RecPO框架，显著提升了推荐系统的推理和推荐能力，且无需依赖专门的推理标注。 Abstract: Large recommender models have extended LLMs as powerful recommenders via encoding or item generation, and recent breakthroughs in LLM reasoning synchronously motivate the exploration of reasoning in recommendation. Current studies usually position LLMs as external reasoning modules to yield auxiliary thought for augmenting conventional recommendation pipelines. However, such decoupled designs are limited in significant resource cost and suboptimal joint optimization. To address these issues, we propose \name, a unified large recommender model with intrinsic reasoning capabilities. Initially, we reconceptualize the model architecture to facilitate interleaved reasoning and recommendation in the autoregressive process. Subsequently, we propose RecPO, a corresponding reinforcement learning framework that optimizes \name\ both the reasoning and recommendation capabilities simultaneously in a single policy update; RecPO introduces a fused reward scheme that solely leverages recommendation labels to simulate the reasoning capability, eliminating dependency on specialized reasoning annotations. Experiments on three datasets with various baselines verify the effectiveness of \name, showing relative improvements of 68.67\% in Hit@5 and 45.21\% in NDCG@20. Code available at https://github.com/YRYangang/RRec.

cs.NE [Back]

[300] TDFormer: A Top-Down Attention-Controlled Spiking Transformer

Zizheng Zhu,Yingchao Yu,Zeqi Zheng,Zhaofei Yu,Yaochu Jin

Main category: cs.NE

TL;DR: TDFormer是一种新型的脉冲神经网络模型，通过引入自上而下的反馈结构，显著提升了模型性能。

Details

Motivation: 传统SNN的膜电位作为时间步间唯一的信息链接，限制了模型性能。 Method: 提出TDFormer模型，利用高阶表示调制低阶信息处理，增加时间步间互信息。 Result: 在多个数据集上显著提升性能，ImageNet准确率达86.83%。 Conclusion: 反馈结构有效解决了梯度消失问题，提升了时间信息的传递与整合。 Abstract: Traditional spiking neural networks (SNNs) can be viewed as a combination of multiple subnetworks with each running for one time step, where the parameters are shared, and the membrane potential serves as the only information link between them. However, the implicit nature of the membrane potential limits its ability to effectively represent temporal information. As a result, each time step cannot fully leverage information from previous time steps, seriously limiting the model's performance. Inspired by the top-down mechanism in the brain, we introduce TDFormer, a novel model with a top-down feedback structure that functions hierarchically and leverages high-order representations from earlier time steps to modulate the processing of low-order information at later stages. The feedback structure plays a role from two perspectives: 1) During forward propagation, our model increases the mutual information across time steps, indicating that richer temporal information is being transmitted and integrated in different time steps. 2) During backward propagation, we theoretically prove that the feedback structure alleviates the problem of vanishing gradients along the time dimension. We find that these mechanisms together significantly and consistently improve the model performance on multiple datasets. In particular, our model achieves state-of-the-art performance on ImageNet with an accuracy of 86.83%.

cs.LG [Back]

[301] Interpretability Illusions with Sparse Autoencoders: Evaluating Robustness of Concept Representations

Aaron J. Li,Suraj Srinivas,Usha Bhalla,Himabindu Lakkaraju

Main category: cs.LG

TL;DR: 论文提出稀疏自编码器（SAE）的概念表示对输入扰动的鲁棒性不足，并开发了一个评估框架来量化这种鲁棒性。实验表明，微小对抗扰动即可操纵SAE的概念解释，而基础LLM的输出几乎不受影响。

Details

Motivation: 现有SAE评估忽略了概念表示对输入扰动的鲁棒性，而鲁棒性是概念标签真实性的关键指标。 Method: 通过输入空间优化问题量化鲁棒性，并开发评估框架，模拟对抗扰动操纵SAE表示的场景。 Result: 实验发现微小对抗扰动能有效操纵SAE的概念解释，但对基础LLM输出影响甚微。 Conclusion: SAE的概念表示脆弱，可能不适合用于模型监控和监管。 Abstract: Sparse autoencoders (SAEs) are commonly used to interpret the internal activations of large language models (LLMs) by mapping them to human-interpretable concept representations. While existing evaluations of SAEs focus on metrics such as the reconstruction-sparsity tradeoff, human (auto-)interpretability, and feature disentanglement, they overlook a critical aspect: the robustness of concept representations to input perturbations. We argue that robustness must be a fundamental consideration for concept representations, reflecting the fidelity of concept labeling. To this end, we formulate robustness quantification as input-space optimization problems and develop a comprehensive evaluation framework featuring realistic scenarios in which adversarial perturbations are crafted to manipulate SAE representations. Empirically, we find that tiny adversarial input perturbations can effectively manipulate concept-based interpretations in most scenarios without notably affecting the outputs of the base LLMs themselves. Overall, our results suggest that SAE concept representations are fragile and may be ill-suited for applications in model monitoring and oversight.

[302] Merge to Mix: Mixing Datasets via Model Merging

Zhixu Silvia Tao,Kasper Vinken,Hao-Wei Yeh,Avi Cooper,Xavier Boix

Main category: cs.LG

TL;DR: 提出了一种名为Merge to Mix的新方法，通过模型合并加速数据集混合的构建，避免了传统试错法的多次微调。

Details

Motivation: 传统的数据集混合方法依赖试错和启发式规则，效率低下，需要多次微调。 Method: 利用模型合并技术，将单独微调后的模型通过简单算术操作合并，作为整个数据集混合的替代。 Result: 实验表明，Merge to Mix在数据集选择方面优于现有方法。 Conclusion: Merge to Mix为高效构建数据集混合提供了一种新思路，显著提升了性能。 Abstract: Mixing datasets for fine-tuning large models (LMs) has become critical for maximizing performance on downstream tasks. However, composing effective dataset mixtures typically relies on heuristics and trial-and-error, often requiring multiple fine-tuning runs to achieve the desired outcome. We propose a novel method, $\textit{Merge to Mix}$, that accelerates composing dataset mixtures through model merging. Model merging is a recent technique that combines the abilities of multiple individually fine-tuned LMs into a single LM by using a few simple arithmetic operations. Our key insight is that merging models individually fine-tuned on each dataset in a mixture can effectively serve as a surrogate for a model fine-tuned on the entire mixture. Merge to Mix leverages this insight to accelerate selecting dataset mixtures without requiring full fine-tuning on each candidate mixture. Our experiments demonstrate that Merge to Mix surpasses state-of-the-art methods in dataset selection for fine-tuning LMs.

[303] A Survey of Large Language Models for Text-Guided Molecular Discovery: from Molecule Generation to Optimization

Ziqing Wang,Kexin Zhang,Zihan Zhao,Yibo Wen,Abhishek Pandey,Han Liu,Kaize Ding

Main category: cs.LG

TL;DR: 这篇综述探讨了大型语言模型（LLMs）在分子发现中的应用，重点关注分子生成和优化任务，并提供了相关技术和资源的更新。

Details

Motivation: 推动LLMs在分子科学中的应用，填补这一新兴领域的知识空白。 Method: 提出分类法分析代表性技术，总结数据集和评估协议。 Result: 综述了LLMs在分子发现中的潜力，并提供了持续更新的资源列表。 Conclusion: 讨论了关键挑战和未来方向，为研究者提供了参考。 Abstract: Large language models (LLMs) are introducing a paradigm shift in molecular discovery by enabling text-guided interaction with chemical spaces through natural language, symbolic notations, with emerging extensions to incorporate multi-modal inputs. To advance the new field of LLM for molecular discovery, this survey provides an up-to-date and forward-looking review of the emerging use of LLMs for two central tasks: molecule generation and molecule optimization. Based on our proposed taxonomy for both problems, we analyze representative techniques in each category, highlighting how LLM capabilities are leveraged across different learning settings. In addition, we include the commonly used datasets and evaluation protocols. We conclude by discussing key challenges and future directions, positioning this survey as a resource for researchers working at the intersection of LLMs and molecular science. A continuously updated reading list is available at https://github.com/REAL-Lab-NU/Awesome-LLM-Centric-Molecular-Discovery.

[304] NAN: A Training-Free Solution to Coefficient Estimation in Model Merging

Chongjie Si,Kangtao Lv,Jingjing Jiang,Yadao Wang,Yongwei Wang,Xiaokang Yang,Wenbo Su,Bo Zheng,Wei Shen

Main category: cs.LG

TL;DR: 本文提出了一种基于最小二乘优化的模型合并方法NAN，通过参数范数的逆估计合并系数，显著提升了基线方法的性能。

Details

Motivation: 现有模型合并方法依赖启发式确定合并系数，限制了其扩展性和通用性。 Method: 通过最小二乘优化重新审视模型合并，提出NAN方法，利用参数范数的逆估计合并系数。 Result: 实验表明，NAN能持续提升基线方法的性能。 Conclusion: NAN是一种简单有效的训练自由方法，适用于广泛的合并策略。 Abstract: Model merging offers a training-free alternative to multi-task learning by combining independently fine-tuned models into a unified one without access to raw data. However, existing approaches often rely on heuristics to determine the merging coefficients, limiting their scalability and generality. In this work, we revisit model merging through the lens of least-squares optimization and show that the optimal merging weights should scale with the amount of task-specific information encoded in each model. Based on this insight, we propose NAN, a simple yet effective method that estimates model merging coefficients via the inverse of parameter norm. NAN is training-free, plug-and-play, and applicable to a wide range of merging strategies. Extensive experiments on show that NAN consistently improves performance of baseline methods.

[305] NQKV: A KV Cache Quantization Scheme Based on Normal Distribution Characteristics

Zhihang Cai,Xingjun Zhang,Zhendong Tan,Zheng Wei

Main category: cs.LG

TL;DR: 论文提出NQKV算法，通过量化KV缓存至更低比特以减少内存消耗，提升LLM推理效率。

Details

Motivation: LLM推理中KV缓存的内存消耗成为瓶颈，现有量化方法仅支持8比特且更低比特会导致精度下降。 Method: 分析KV缓存的元素分布，设计基于分块正态分布的NQKV算法，实现信息论最优量化误差。 Result: NQKV使OPT模型推理批量增大2倍或上下文长度延长4倍，吞吐量提升9.3倍。 Conclusion: NQKV有效降低KV缓存内存需求，显著提升LLM推理效率，且不影响输出质量。 Abstract: Large Language Models (LLMs) have demonstrated remarkable proficiency across a wide range of tasks. However, LLMs often require larger batch sizes to enhance throughput or longer context lengths to meet task demands, which significantly increases the memory resource consumption of the Key-Value (KV) cache during inference, becoming a major bottleneck in LLM deployment. To address this issue, quantization is a common and straightforward approach. Currently, quantization methods for activations are limited to 8-bit, and quantization to even lower bits can lead to substantial accuracy drops. To further save space by quantizing the KV cache to even lower bits, we analyzed the element distribution of the KV cache and designed the NQKV algorithm. Since the elements within each block of the KV cache follow a normal distribution, NQKV employs per-block quantile quantization to achieve information-theoretically optimal quantization error. Without significantly compromising model output quality, NQKV enables the OPT model to perform inference with an 2x larger batch size or a 4x longer context length, and it improves throughput by 9.3x compared to when the KV cache is not used.

[306] AdaSTaR: Adaptive Data Sampling for Training Self-Taught Reasoners

Woosung Koh,Wonbeen Oh,Jaein Jang,MinHyung Lee,Hyeongjin Kim,Ah Yeon Kim,Joonkee Kim,Junghyun Lee,Taehyeon Kim,Se-Young Yun

Main category: cs.LG

TL;DR: AdaSTaR是一种自适应采样算法，通过平衡训练数据难度和多样性，显著提升语言模型的推理能力和训练效率。

Details

Motivation: 传统方法（如STaR）在训练语言模型时存在数据不平衡问题，导致对简单样本过度训练而对困难样本训练不足。 Method: AdaSTaR结合两种自适应采样原则：1）多样性采样以平衡训练数据；2）课程采样以动态调整数据难度。 Result: 在六个基准测试中，AdaSTaR均取得最佳测试准确率，并平均减少58.6%的训练计算量。 Conclusion: AdaSTaR为高效自改进语言模型提供了新方向，适用于不同预训练模型和更大规模模型。 Abstract: Self-Taught Reasoners (STaR), synonymously known as Rejection sampling Fine-Tuning (RFT), is an integral part of the training pipeline of self-improving reasoning Language Models (LMs). The self-improving mechanism often employs random observation (data) sampling. However, this results in trained observation imbalance; inefficiently over-training on solved examples while under-training on challenging ones. In response, we introduce Adaptive STaR (AdaSTaR), a novel algorithm that rectifies this by integrating two adaptive sampling principles: (1) Adaptive Sampling for Diversity: promoting balanced training across observations, and (2) Adaptive Sampling for Curriculum: dynamically adjusting data difficulty to match the model's evolving strength. Across six benchmarks, AdaSTaR achieves best test accuracy in all instances (6/6) and reduces training FLOPs by an average of 58.6% against an extensive list of baselines. These improvements in performance and efficiency generalize to different pre-trained LMs and larger models, paving the way for more efficient and effective self-improving LMs.

[307] AceReason-Nemotron: Advancing Math and Code Reasoning through Reinforcement Learning

Yang Chen,Zhuolin Yang,Zihan Liu,Chankyu Lee,Peng Xu,Mohammad Shoeybi,Bryan Catanzaro,Wei Ping

Main category: cs.LG

TL;DR: 大规模强化学习（RL）可显著提升中小型模型的推理能力，超越蒸馏方法，并通过数学和代码提示的分阶段训练实现性能提升。

Details

Motivation: 尽管大规模RL在推理任务中取得进展，但训练方法仍不明确，且前沿模型的实现细节常被忽略。本文旨在展示RL如何提升中小型模型的推理能力。 Method: 分阶段训练：先数学提示，后代码提示；构建数据管道收集高质量提示；采用课程学习和策略更新稳定训练。 Result: 数学RL显著提升数学和代码任务性能（如AIME 2025提升14.6%/17.2%），代码RL进一步优化代码表现且不影响数学结果。 Conclusion: RL不仅能激发预训练和微调获得的推理能力，还能突破模型极限，解决此前无法解决的问题。 Abstract: Despite recent progress in large-scale reinforcement learning (RL) for reasoning, the training recipe for building high-performing reasoning models remains elusive. Key implementation details of frontier models, such as DeepSeek-R1, including data curation strategies and RL training recipe, are often omitted. Moreover, recent research indicates distillation remains more effective than RL for smaller models. In this work, we demonstrate that large-scale RL can significantly enhance the reasoning capabilities of strong, small- and mid-sized models, achieving results that surpass those of state-of-the-art distillation-based models. We systematically study the RL training process through extensive ablations and propose a simple yet effective approach: first training on math-only prompts, then on code-only prompts. Notably, we find that math-only RL not only significantly enhances the performance of strong distilled models on math benchmarks (e.g., +14.6% / +17.2% on AIME 2025 for the 7B / 14B models), but also code reasoning tasks (e.g., +6.8% / +5.8% on LiveCodeBench for the 7B / 14B models). In addition, extended code-only RL iterations further improve performance on code benchmarks with minimal or no degradation in math results. We develop a robust data curation pipeline to collect challenging prompts with high-quality, verifiable answers and test cases to enable verification-based RL across both domains. Finally, we identify key experimental insights, including curriculum learning with progressively increasing response lengths and the stabilizing effect of on-policy parameter updates. We find that RL not only elicits the foundational reasoning capabilities acquired during pretraining and supervised fine-tuning (e.g., distillation), but also pushes the limits of the model's reasoning ability, enabling it to solve problems that were previously unsolvable.

[308] MoRE-Brain: Routed Mixture of Experts for Interpretable and Generalizable Cross-Subject fMRI Visual Decoding

Yuxiang Wei,Yanteng Zhang,Xi Xiao,Tianyang Wang,Xiao Wang,Vince D. Calhoun

Main category: cs.LG

TL;DR: MoRE-Brain提出了一种基于脑网络原理的混合专家架构，用于高保真、可适应且可解释的视觉重建，通过动态路由机制增强神经解码的通用性和可解释性。

Details

Motivation: 当前fMRI视觉解码研究过于注重重建保真度而忽视可解释性，限制了神经科学洞察的获取。 Method: 采用分层混合专家架构，专家网络处理功能相关的fMRI信号，结合CLIP空间编码和扩散模型生成图像，通过双阶段路由动态调整专家贡献。 Result: 实验验证了MoRE-Brain的高保真重建能力，有效利用fMRI信号，避免过度依赖生成先验。 Conclusion: MoRE-Brain在通用性和可解释性方面取得了显著进展，为fMRI视觉解码提供了新方向。 Abstract: Decoding visual experiences from fMRI offers a powerful avenue to understand human perception and develop advanced brain-computer interfaces. However, current progress often prioritizes maximizing reconstruction fidelity while overlooking interpretability, an essential aspect for deriving neuroscientific insight. To address this gap, we propose MoRE-Brain, a neuro-inspired framework designed for high-fidelity, adaptable, and interpretable visual reconstruction. MoRE-Brain uniquely employs a hierarchical Mixture-of-Experts architecture where distinct experts process fMRI signals from functionally related voxel groups, mimicking specialized brain networks. The experts are first trained to encode fMRI into the frozen CLIP space. A finetuned diffusion model then synthesizes images, guided by expert outputs through a novel dual-stage routing mechanism that dynamically weighs expert contributions across the diffusion process. MoRE-Brain offers three main advancements: First, it introduces a novel Mixture-of-Experts architecture grounded in brain network principles for neuro-decoding. Second, it achieves efficient cross-subject generalization by sharing core expert networks while adapting only subject-specific routers. Third, it provides enhanced mechanistic insight, as the explicit routing reveals precisely how different modeled brain regions shape the semantic and spatial attributes of the reconstructed image. Extensive experiments validate MoRE-Brain's high reconstruction fidelity, with bottleneck analyses further demonstrating its effective utilization of fMRI signals, distinguishing genuine neural decoding from over-reliance on generative priors. Consequently, MoRE-Brain marks a substantial advance towards more generalizable and interpretable fMRI-based visual decoding. Code will be publicly available soon: https://github.com/yuxiangwei0808/MoRE-Brain.

[309] GradPCA: Leveraging NTK Alignment for Reliable Out-of-Distribution Detection

Mariia Seleznova,Hung-Hsu Chou,Claudio Mayrink Verdun,Gitta Kutyniok

Main category: cs.LG

TL;DR: GradPCA是一种基于梯度主成分分析的OOD检测方法，利用NTK对齐的低秩结构，性能优于现有方法。

Details

Motivation: 现有OOD检测方法性能不一致，GradPCA通过梯度PCA和NTK对齐提供更稳定的解决方案。 Method: 对梯度类均值应用PCA，结合NTK对齐的理论支持，分析特征空间性质。 Result: 实验验证GradPCA性能优越，特征质量（如预训练表示）对检测效果至关重要。 Conclusion: GradPCA为设计更原则化的光谱OOD检测器提供了理论和实践指导。 Abstract: We introduce GradPCA, an Out-of-Distribution (OOD) detection method that exploits the low-rank structure of neural network gradients induced by Neural Tangent Kernel (NTK) alignment. GradPCA applies Principal Component Analysis (PCA) to gradient class-means, achieving more consistent performance than existing methods across standard image classification benchmarks. We provide a theoretical perspective on spectral OOD detection in neural networks to support GradPCA, highlighting feature-space properties that enable effective detection and naturally emerge from NTK alignment. Our analysis further reveals that feature quality -- particularly the use of pretrained versus non-pretrained representations -- plays a crucial role in determining which detectors will succeed. Extensive experiments validate the strong performance of GradPCA, and our theoretical framework offers guidance for designing more principled spectral OOD detectors.

[310] Mitigating Fine-tuning Risks in LLMs via Safety-Aware Probing Optimization

Chengcan Wu,Zhixin Zhang,Zeming Wei,Yihao Zhang,Meng Sun

Main category: cs.LG

TL;DR: 本文探讨了大型语言模型（LLMs）在微调过程中安全性能下降的问题，并提出了一种安全感知探测（SAP）优化框架，以在保持任务性能的同时减少有害内容生成。

Details

Motivation: 尽管LLMs在预训练阶段已采用安全对齐技术，但微调时仍可能因数据问题导致安全性能下降，本文旨在解决这一问题。 Method: 提出SAP框架，通过在梯度传播过程中引入安全感知探测，识别潜在风险方向，从而优化微调过程。 Result: 实验表明，SAP能有效降低有害内容生成，同时保持与标准微调方法相当的测试损失。 Conclusion: SAP框架为解决LLMs微调过程中的安全问题提供了有效方案。 Abstract: The significant progress of large language models (LLMs) has led to remarkable achievements across numerous applications. However, their ability to generate harmful content has sparked substantial safety concerns. Despite the implementation of safety alignment techniques during the pre-training phase, recent research indicates that fine-tuning LLMs on adversarial or even benign data can inadvertently compromise their safety. In this paper, we re-examine the fundamental issue of why fine-tuning on non-harmful data still results in safety degradation. We introduce a safety-aware probing (SAP) optimization framework designed to mitigate the safety risks of fine-tuning LLMs. Specifically, SAP incorporates a safety-aware probe into the gradient propagation process, mitigating the model's risk of safety degradation by identifying potential pitfalls in gradient directions, thereby enhancing task-specific performance while successfully preserving model safety. Our extensive experimental results demonstrate that SAP effectively reduces harmfulness below the original fine-tuned model and achieves comparable test loss to standard fine-tuning methods. Our code is available at https://github.com/ChengcanWu/SAP.

[311] ATR-Bench: A Federated Learning Benchmark for Adaptation, Trust, and Reasoning

Tajamul Ashraf,Mohammed Mohsen Peerzada,Moloud Abdar,Yutong Xie,Yuyin Zhou,Xiaofeng Liu,Iqra Altaf Gillani,Janibul Bashir

Main category: cs.LG

TL;DR: ATR-Bench是一个统一框架，用于从适应性、信任和推理三个维度分析联邦学习，填补了标准化评估的空白。

Details

Motivation: 联邦学习缺乏标准化的评估方法，阻碍了系统进展和公平比较。 Method: 提出ATR-Bench框架，从适应性、信任和推理三个维度分析联邦学习，并进行了广泛的基准测试。 Result: 为异构客户端适应性和对抗性环境中的信任提供了基准测试，推理部分提供了文献驱动的见解。 Conclusion: ATR-Bench为联邦学习的系统化评估奠定了基础，并计划公开代码和持续更新的研究库。 Abstract: Federated Learning (FL) has emerged as a promising paradigm for collaborative model training while preserving data privacy across decentralized participants. As FL adoption grows, numerous techniques have been proposed to tackle its practical challenges. However, the lack of standardized evaluation across key dimensions hampers systematic progress and fair comparison of FL methods. In this work, we introduce ATR-Bench, a unified framework for analyzing federated learning through three foundational dimensions: Adaptation, Trust, and Reasoning. We provide an in-depth examination of the conceptual foundations, task formulations, and open research challenges associated with each theme. We have extensively benchmarked representative methods and datasets for adaptation to heterogeneous clients and trustworthiness in adversarial or unreliable environments. Due to the lack of reliable metrics and models for reasoning in FL, we only provide literature-driven insights for this dimension. ATR-Bench lays the groundwork for a systematic and holistic evaluation of federated learning with real-world relevance. We will make our complete codebase publicly accessible and a curated repository that continuously tracks new developments and research in the FL literature.

[312] The Polar Express: Optimal Matrix Sign Methods and Their Application to the Muon Algorithm

Noah Amsel,David Persson,Christopher Musco,Robert Gower

Main category: cs.LG

TL;DR: 论文提出了一种名为Polar Express的GPU友好算法，用于计算极分解和矩阵符号函数，适用于深度学习中的Muon优化框架，相比传统算法在效率和兼容性上更优。

Details

Motivation: 传统数值分析中的极分解算法（如Newton-Schulz）在深度学习场景中效率不足且不兼容GPU，而深度学习对高精度要求较低，因此需要一种高效且GPU兼容的新方法。 Method: Polar Express通过解决极小极大优化问题，动态调整多项式更新规则，仅使用矩阵乘法操作，确保GPU兼容性，并具有快速收敛和稳定性。 Result: 实验表明，Polar Express在GPT-2等大规模模型上显著降低了验证损失，且在不同学习率下优于现有方法。 Conclusion: Polar Express是一种高效、GPU兼容且稳定的极分解算法，适用于深度学习优化，尤其在Muon框架中表现优异。 Abstract: Computing the polar decomposition and the related matrix sign function, has been a well-studied problem in numerical analysis for decades. More recently, it has emerged as an important subroutine in deep learning, particularly within the Muon optimization framework. However, the requirements in this setting differ significantly from those of traditional numerical analysis. In deep learning, methods must be highly efficient and GPU-compatible, but high accuracy is often unnecessary. As a result, classical algorithms like Newton-Schulz (which suffers from slow initial convergence) and methods based on rational functions (which rely on QR decompositions or matrix inverses) are poorly suited to this context. In this work, we introduce Polar Express, a GPU-friendly algorithm for computing the polar decomposition. Like classical polynomial methods such as Newton-Schulz, our approach uses only matrix-matrix multiplications, making it GPU-compatible. Motivated by earlier work of Chen & Chow and Nakatsukasa & Freund, Polar Express adapts the polynomial update rule at each iteration by solving a minimax optimization problem, and we prove that it enjoys a strong worst-case optimality guarantee. This property ensures both rapid early convergence and fast asymptotic convergence. We also address finite-precision issues, making it stable in bfloat16 in practice. We apply Polar Express within the Muon optimization framework and show consistent improvements in validation loss on large-scale models such as GPT-2, outperforming recent alternatives across a range of learning rates.

[313] LLaDA-V: Large Language Diffusion Models with Visual Instruction Tuning

Zebin You,Shen Nie,Xiaolu Zhang,Jun Hu,Jun Zhou,Zhiwu Lu,Ji-Rong Wen,Chongxuan Li

Main category: cs.LG

TL;DR: LLaDA-V是一种基于扩散的多模态大语言模型，通过视觉指令调优和掩码扩散模型实现多模态对齐，性能优于现有混合自回归-扩散模型。

Details

Motivation: 探索扩散模型在多模态任务中的潜力，突破当前以自回归范式为主的多模态方法。 Method: 基于LLaDA模型，结合视觉编码器和MLP连接器，将视觉特征投影到语言嵌入空间。 Result: 在多模态任务中表现优异，性能接近或超越现有模型，展示了扩散模型在多模态领域的潜力。 Conclusion: 扩散模型在多模态任务中具有潜力，值得进一步研究。 Abstract: In this work, we introduce LLaDA-V, a purely diffusion-based Multimodal Large Language Model (MLLM) that integrates visual instruction tuning with masked diffusion models, representing a departure from the autoregressive paradigms dominant in current multimodal approaches. Built upon LLaDA, a representative large language diffusion model, LLaDA-V incorporates a vision encoder and MLP connector that projects visual features into the language embedding space, enabling effective multimodal alignment. Our empirical investigation reveals several intriguing results: First, LLaDA-V demonstrates promising multimodal performance despite its language model being weaker on purely textual tasks than counterparts like LLaMA3-8B and Qwen2-7B. When trained on the same instruction data, LLaDA-V is highly competitive to LLaMA3-V across multimodal tasks with better data scalability. It also narrows the performance gap to Qwen2-VL, suggesting the effectiveness of its architecture for multimodal tasks. Second, LLaDA-V achieves state-of-the-art performance in multimodal understanding compared to existing hybrid autoregressive-diffusion and purely diffusion-based MLLMs. Our findings suggest that large language diffusion models show promise in multimodal contexts and warrant further investigation in future research. Project page and codes: https://ml-gsai.github.io/LLaDA-V-demo/.

[314] UFT: Unifying Supervised and Reinforcement Fine-Tuning

Mingyang Liu,Gabriele Farina,Asuman Ozdaglar

Main category: cs.LG

TL;DR: 论文提出了一种统一微调（UFT）方法，结合了监督微调（SFT）和强化微调（RFT）的优势，解决了现有方法的局限性，显著提升了模型的推理能力。

Details

Motivation: 现有后训练方法（SFT和RFT）各有局限性：SFT可能过拟合且限制大模型推理能力，RFT依赖基础模型强度且样本复杂度高。 Method: 提出UFT，将SFT和RFT统一为一个集成过程，结合监督信号和探索能力。 Result: UFT在不同规模模型上均优于SFT和RFT，并理论上证明其突破了RFT的指数样本复杂度瓶颈。 Conclusion: UFT是一种高效的后训练范式，显著提升了模型的推理能力，并解决了现有方法的局限性。 Abstract: Post-training has demonstrated its importance in enhancing the reasoning capabilities of large language models (LLMs). The primary post-training methods can be categorized into supervised fine-tuning (SFT) and reinforcement fine-tuning (RFT). SFT is efficient and well-suited for small language models, but it may lead to overfitting and limit the reasoning abilities of larger models. In contrast, RFT generally yields better generalization but depends heavily on the strength of the base model. To address the limitations of SFT and RFT, we propose Unified Fine-Tuning (UFT), a novel post-training paradigm that unifies SFT and RFT into a single, integrated process. UFT enables the model to effectively explore solutions while incorporating informative supervision signals, bridging the gap between memorizing and thinking underlying existing methods. Notably, UFT outperforms both SFT and RFT in general, regardless of model sizes. Furthermore, we theoretically prove that UFT breaks RFT's inherent exponential sample complexity bottleneck, showing for the first time that unified training can exponentially accelerate convergence on long-horizon reasoning tasks.

[315] Masked Conditioning for Deep Generative Models

Phillip Mueller,Jannik Wiese,Sebastian Mueller,Lars Mikelsons

Main category: cs.LG

TL;DR: 提出了一种新的掩码条件方法，使生成模型能够处理稀疏、混合类型数据，适用于资源有限的工程任务。

Details

Motivation: 工程领域的数据集通常规模小、标签稀疏且包含数值和分类条件，同时计算资源有限，限制了生成模型的应用。 Method: 采用掩码条件训练模拟稀疏条件，探索多种稀疏调度策略，并引入灵活的嵌入方法处理分类和数值条件。 Result: 方法在2D点云和图像数据集上验证有效，并展示小模型与大型预训练模型结合可提升生成质量。 Conclusion: 该方法在资源受限的工程任务中具有潜力，同时保留了条件控制能力。 Abstract: Datasets in engineering domains are often small, sparsely labeled, and contain numerical as well as categorical conditions. Additionally. computational resources are typically limited in practical applications which hinders the adoption of generative models for engineering tasks. We introduce a novel masked-conditioning approach, that enables generative models to work with sparse, mixed-type data. We mask conditions during training to simulate sparse conditions at inference time. For this purpose, we explore the use of various sparsity schedules that show different strengths and weaknesses. In addition, we introduce a flexible embedding that deals with categorical as well as numerical conditions. We integrate our method into an efficient variational autoencoder as well as a latent diffusion model and demonstrate the applicability of our approach on two engineering-related datasets of 2D point clouds and images. Finally, we show that small models trained on limited data can be coupled with large pretrained foundation models to improve generation quality while retaining the controllability induced by our conditioning scheme.

[316] When Are Concepts Erased From Diffusion Models?

Kevin Lu,Nicky Kriplani,Rohit Gandikota,Minh Pham,David Bau,Chinmay Hegde,Niv Cohen

Main category: cs.LG

TL;DR: 论文研究了扩散模型中的概念擦除方法，提出了两种擦除机制模型，并引入了一套评估框架来验证概念是否被彻底擦除。

Details

Motivation: 探讨现有概念擦除方法的彻底性，揭示擦除机制与模型鲁棒性之间的权衡。 Method: 提出两种概念擦除机制模型，并设计包括对抗攻击、新型探测技术和替代生成分析在内的评估框架。 Result: 揭示了擦除方法在减少副作用与保持对抗性提示鲁棒性之间的张力。 Conclusion: 强调了对扩散模型概念擦除进行全面评估的重要性。 Abstract: Concept erasure, the ability to selectively prevent a model from generating specific concepts, has attracted growing interest, with various approaches emerging to address the challenge. However, it remains unclear how thoroughly these methods erase the target concept. We begin by proposing two conceptual models for the erasure mechanism in diffusion models: (i) reducing the likelihood of generating the target concept, and (ii) interfering with the model's internal guidance mechanisms. To thoroughly assess whether a concept has been truly erased from the model, we introduce a suite of independent evaluations. Our evaluation framework includes adversarial attacks, novel probing techniques, and analysis of the model's alternative generations in place of the erased concept. Our results shed light on the tension between minimizing side effects and maintaining robustness to adversarial prompts. Broadly, our work underlines the importance of comprehensive evaluation for erasure in diffusion models.

[317] Interactive Post-Training for Vision-Language-Action Models

Shuhan Tan,Kairan Dou,Yue Zhao,Philipp Krähenbühl

Main category: cs.LG

TL;DR: RIPT-VLA是一种基于强化学习的交互式后训练方法，通过稀疏二元成功奖励微调预训练的视觉-语言-动作模型，显著提升模型性能和数据效率。

Details

Motivation: 现有视觉-语言-动作模型训练依赖离线专家数据和监督模仿，难以适应低数据环境下的新任务和环境。 Method: 采用动态采样和留一法优势估计的稳定策略优化算法进行交互式后训练。 Result: 在轻量级QueST模型上提升21.2%，7B OpenVLA-OFT模型达到97.5%成功率；仅需一次演示即可将成功率从4%提升至97%。 Conclusion: RIPT-VLA是一种实用且高效的后训练范式，通过最小监督实现模型性能提升和泛化能力。 Abstract: We introduce RIPT-VLA, a simple and scalable reinforcement-learning-based interactive post-training paradigm that fine-tunes pretrained Vision-Language-Action (VLA) models using only sparse binary success rewards. Existing VLA training pipelines rely heavily on offline expert demonstration data and supervised imitation, limiting their ability to adapt to new tasks and environments under low-data regimes. RIPT-VLA addresses this by enabling interactive post-training with a stable policy optimization algorithm based on dynamic rollout sampling and leave-one-out advantage estimation. RIPT-VLA has the following characteristics. First, it applies to various VLA models, resulting in an improvement on the lightweight QueST model by 21.2%, and the 7B OpenVLA-OFT model to an unprecedented 97.5% success rate. Second, it is computationally efficient and data-efficient: with only one demonstration, RIPT-VLA enables an unworkable SFT model (4%) to succeed with a 97% success rate within 15 iterations. Furthermore, we demonstrate that the policy learned by RIPT-VLA generalizes across different tasks and scenarios and is robust to the initial state context. These results highlight RIPT-VLA as a practical and effective paradigm for post-training VLA models through minimal supervision.

cs.AI [Back]

[318] Causal LLM Routing: End-to-End Regret Minimization from Observational Data

Asterios Tsiourvas,Wei Sun,Georgia Perakis

Main category: cs.AI

TL;DR: 论文提出了一种基于因果端到端框架的LLM路由方法，通过最小化决策遗憾从观测数据中学习路由策略，避免了传统方法的误差累积问题。

Details

Motivation: 传统LLM路由方法采用解耦策略，容易导致误差累积，且依赖高成本的完整反馈数据。本文旨在通过观测数据学习路由策略，解决这些问题。 Method: 提出了一种因果端到端框架，引入两种理论支持的替代目标：基于分类的上界和softmax加权的遗憾近似。还扩展了框架以处理异构成本偏好。 Result: 在公共基准测试中，该方法优于现有基线，实现了不同嵌入模型上的最先进性能。 Conclusion: 该方法通过观测数据学习路由策略，避免了误差累积和高成本数据需求，具有实际应用价值。 Abstract: LLM routing aims to select the most appropriate model for each query, balancing competing performance metrics such as accuracy and cost across a pool of language models. Prior approaches typically adopt a decoupled strategy, where the metrics are first predicted and the model is then selected based on these estimates. This setup is prone to compounding errors and often relies on full-feedback data, where each query is evaluated by all candidate models, which is costly to obtain and maintain in practice. In contrast, we learn from observational data, which records only the outcome of the model actually deployed. We propose a causal end-to-end framework that learns routing policies by minimizing decision-making regret from observational data. To enable efficient optimization, we introduce two theoretically grounded surrogate objectives: a classification-based upper bound, and a softmax-weighted regret approximation shown to recover the optimal policy at convergence. We further extend our framework to handle heterogeneous cost preferences via an interval-conditioned architecture. Experiments on public benchmarks show that our method outperforms existing baselines, achieving state-of-the-art performance across different embedding models.

[319] Optimizing LLM-Based Multi-Agent System with Textual Feedback: A Case Study on Software Development

Ming Shen,Raphael Shu,Anurag Pratik,James Gung,Yubin Ge,Monica Sunkara,Yi Zhang

Main category: cs.AI

TL;DR: 论文提出了一种基于自然语言反馈的两步优化方法，用于提升基于角色的多智能体系统在软件开发任务中的表现，并研究了不同优化设置对系统行为的影响。

Details

Motivation: 尽管大型语言模型（LLMs）在多智能体系统中取得了显著进展，但优化这些系统仍具挑战性。本文旨在通过自然语言反馈优化角色多智能体系统。 Method: 提出两步优化流程：1）通过文本反馈识别表现不佳的智能体及其失败原因；2）利用失败原因优化系统提示。研究了在线与离线优化、个体与群体优化的对比，以及单次与多次提示优化策略。 Result: 方法在软件开发任务中表现有效，并通过不同优化设置研究了多智能体系统的群体行为。 Conclusion: 研究为未来多智能体系统的开发提供了实用见解，验证了优化方法的有效性。 Abstract: We have seen remarkable progress in large language models (LLMs) empowered multi-agent systems solving complex tasks necessitating cooperation among experts with diverse skills. However, optimizing LLM-based multi-agent systems remains challenging. In this work, we perform an empirical case study on group optimization of role-based multi-agent systems utilizing natural language feedback for challenging software development tasks under various evaluation dimensions. We propose a two-step agent prompts optimization pipeline: identifying underperforming agents with their failure explanations utilizing textual feedback and then optimizing system prompts of identified agents utilizing failure explanations. We then study the impact of various optimization settings on system performance with two comparison groups: online against offline optimization and individual against group optimization. For group optimization, we study two prompting strategies: one-pass and multi-pass prompting optimizations. Overall, we demonstrate the effectiveness of our optimization method for role-based multi-agent systems tackling software development tasks evaluated on diverse evaluation dimensions, and we investigate the impact of diverse optimization settings on group behaviors of the multi-agent systems to provide practical insights for future development.

[320] Can AI Read Between The Lines? Benchmarking LLMs On Financial Nuance

Dominick Kubica,Dylan T. Gordon,Nanami Emura,Derleen Saini,Charlie Goldenberg

Main category: cs.AI

TL;DR: 本文评估了生成式AI在金融领域情感分析的可靠性，比较了多种大型语言模型（如Copilot、ChatGPT、Gemini）的表现，并探讨了提示工程对结果的影响。

Details

Motivation: 随着生成式AI在各行业的广泛应用，其在金融等高风险领域的准确性和可靠性亟待验证，尤其是在处理复杂情感语言时。 Method: 研究使用微软财报电话会议文本，评估LLM的情感分析结果与市场情绪及股价变动的相关性，并测试提示工程技术。 Result: 研究发现LLM在处理金融文本中的复杂情感语言时表现不佳，提示工程可部分改善结果。 Conclusion: 生成式AI在金融情感分析中仍有局限性，需进一步优化模型和提示工程技术。 Abstract: As of 2025, Generative Artificial Intelligence (GenAI) has become a central tool for productivity across industries. Beyond text generation, GenAI now plays a critical role in coding, data analysis, and research workflows. As large language models (LLMs) continue to evolve, it is essential to assess the reliability and accuracy of their outputs, especially in specialized, high-stakes domains like finance. Most modern LLMs transform text into numerical vectors, which are used in operations such as cosine similarity searches to generate responses. However, this abstraction process can lead to misinterpretation of emotional tone, particularly in nuanced financial contexts. While LLMs generally excel at identifying sentiment in everyday language, these models often struggle with the nuanced, strategically ambiguous language found in earnings call transcripts. Financial disclosures frequently embed sentiment in hedged statements, forward-looking language, and industry-specific jargon, making it difficult even for human analysts to interpret consistently, let alone AI models. This paper presents findings from the Santa Clara Microsoft Practicum Project, led by Professor Charlie Goldenberg, which benchmarks the performance of Microsoft's Copilot, OpenAI's ChatGPT, Google's Gemini, and traditional machine learning models for sentiment analysis of financial text. Using Microsoft earnings call transcripts, the analysis assesses how well LLM-derived sentiment correlates with market sentiment and stock movements and evaluates the accuracy of model outputs. Prompt engineering techniques are also examined to improve sentiment analysis results. Visualizations of sentiment consistency are developed to evaluate alignment between tone and stock performance, with sentiment trends analyzed across Microsoft's lines of business to determine which segments exert the greatest influence.

[321] BioDSA-1K: Benchmarking Data Science Agents for Biomedical Research

Zifeng Wang,Benjamin Danek,Jimeng Sun

Main category: cs.AI

TL;DR: BioDSA-1K是一个用于评估AI代理在生物医学假设验证任务中的基准，包含1029个任务和1177个分析计划，支持多维评估。

Details

Motivation: 验证科学假设是生物医学研究的核心挑战，但AI代理因数据复杂性和证据解释困难而难以胜任。 Method: BioDSA-1K从300多项研究中提取任务和分析计划，提供结构化假设和基于数据的证据，支持统计或机器学习方法验证。 Result: 基准支持四个维度的评估：假设决策准确性、证据与结论一致性、推理过程正确性及AI生成代码的可执行性。 Conclusion: BioDSA-1K为构建和评估可信赖的生物医学AI代理提供了基础，特别关注数据不足时的非可验证假设。 Abstract: Validating scientific hypotheses is a central challenge in biomedical research, and remains difficult for artificial intelligence (AI) agents due to the complexity of real-world data analysis and evidence interpretation. In this work, we present BioDSA-1K, a benchmark designed to evaluate AI agents on realistic, data-driven biomedical hypothesis validation tasks. BioDSA-1K consists of 1,029 hypothesis-centric tasks paired with 1,177 analysis plans, curated from over 300 published biomedical studies to reflect the structure and reasoning found in authentic research workflows. Each task includes a structured hypothesis derived from the original study's conclusions, expressed in the affirmative to reflect the language of scientific reporting, and one or more pieces of supporting evidence grounded in empirical data tables. While these hypotheses mirror published claims, they remain testable using standard statistical or machine learning methods. The benchmark enables evaluation along four axes: (1) hypothesis decision accuracy, (2) alignment between evidence and conclusion, (3) correctness of the reasoning process, and (4) executability of the AI-generated analysis code. Importantly, BioDSA-1K includes non-verifiable hypotheses: cases where the available data are insufficient to support or refute a claim, reflecting a common yet underexplored scenario in real-world science. We propose BioDSA-1K as a foundation for building and evaluating generalizable, trustworthy AI agents for biomedical discovery.

[322] Dynamic Sampling that Adapts: Iterative DPO for Self-Aware Mathematical Reasoning

Jun Rao,Xuebo Liu,Hexuan Deng,Zepeng Lin,Zixiong Yu,Jiansheng Wei,Xiaojun Meng,Min Zhang

Main category: cs.AI

TL;DR: SAI-DPO是一种动态数据选择算法，通过实时评估模型在不同训练阶段的推理能力，优化数据选择，显著提升任务性能。

Details

Motivation: 现有数据选择方法依赖静态指标，无法适应动态训练过程，限制了模型能力的持续提升。 Method: 提出SAI-DPO算法，动态选择训练数据，结合实时模型性能反馈，自适应调整数据选择策略。 Result: 在多个数学推理基准测试中，SAI-DPO平均性能提升21.3个百分点，在AIME24和AMC23上分别提升10和15分。 Conclusion: 动态、模型自适应的数据选择策略优于静态方法，显著提升了推理任务的性能。 Abstract: In the realm of data selection for reasoning tasks, existing approaches predominantly rely on externally predefined static metrics such as difficulty and diversity, which are often designed for supervised fine-tuning (SFT) and lack adaptability to continuous training processes. A critical limitation of these methods is their inability to dynamically align with the evolving capabilities of models during online training, a gap that becomes increasingly pronounced with the rise of dynamic training paradigms and online reinforcement learning (RL) frameworks (e.g., R1 models). To address this, we introduce SAI-DPO, an algorithm that dynamically selects training data by continuously assessing a model's stage-specific reasoning abilities across different training phases. By integrating real-time model performance feedback, SAI-DPO adaptively adapts data selection to the evolving strengths and weaknesses of the model, thus enhancing both data utilization efficiency and final task performance. Extensive experiments on three state-of-the-art models and eight mathematical reasoning benchmarks, including challenging competition-level datasets (e.g., AIME24 and AMC23), demonstrate that SAI-DPO achieves an average performance boost of up to 21.3 percentage points, with particularly notable improvements of 10 and 15 points on AIME24 and AMC23, respectively. These results highlight the superiority of dynamic, model-adaptive data selection over static, externally defined strategies in advancing reasoning.

[323] SafeKey: Amplifying Aha-Moment Insights for Safety Reasoning

Kaiwen Zhou,Xuandong Zhao,Gaowen Liu,Jayanth Srinivasa,Aosong Feng,Dawn Song,Xin Eric Wang

Main category: cs.AI

TL;DR: SafeKey通过激活LRMs的安全‘顿悟时刻’，显著提升了模型对有害查询和对抗攻击的安全性，同时保持通用能力。

Details

Motivation: 现有SFT对齐的LRMs在未见过的越狱提示上泛化能力不足，存在安全隐患。 Method: 提出SafeKey，包括双路径安全头和查询掩码建模目标，以增强模型的安全推理能力。 Result: 实验表明，SafeKey将平均有害率降低9.6%，显著提升安全性。 Conclusion: SafeKey通过重塑内部注意力和提升隐藏表示质量，有效增强LRMs的安全性。 Abstract: Large Reasoning Models (LRMs) introduce a new generation paradigm of explicitly reasoning before answering, leading to remarkable improvements in complex tasks. However, they pose great safety risks against harmful queries and adversarial attacks. While recent mainstream safety efforts on LRMs, supervised fine-tuning (SFT), improve safety performance, we find that SFT-aligned models struggle to generalize to unseen jailbreak prompts. After thorough investigation of LRMs' generation, we identify a safety aha moment that can activate safety reasoning and lead to a safe response. This aha moment typically appears in the `key sentence', which follows models' query understanding process and can indicate whether the model will proceed safely. Based on these insights, we propose SafeKey, including two complementary objectives to better activate the safety aha moment in the key sentence: (1) a Dual-Path Safety Head to enhance the safety signal in the model's internal representations before the key sentence, and (2) a Query-Mask Modeling objective to improve the models' attention on its query understanding, which has important safety hints. Experiments across multiple safety benchmarks demonstrate that our methods significantly improve safety generalization to a wide range of jailbreak attacks and out-of-distribution harmful prompts, lowering the average harmfulness rate by 9.6\%, while maintaining general abilities. Our analysis reveals how SafeKey enhances safety by reshaping internal attention and improving the quality of hidden representations.

[324] How do Scaling Laws Apply to Knowledge Graph Engineering Tasks? The Impact of Model Size on Large Language Model Performance

Desiree Heim,Lars-Peter Meyer,Markus Schröder,Johannes Frey,Andreas Dengel

Main category: cs.AI

TL;DR: 论文探讨了在知识图谱工程（KGE）中使用大型语言模型（LLMs）时，模型大小与性能及成本的关系，并通过LLM-KG-Bench框架评估了26个开源LLMs的表现。

Details

Motivation: 研究动机是验证模型大小对KGE任务性能的影响，并探索成本效益比，以指导实际应用中选择合适的模型。 Method: 使用LLM-KG-Bench框架对26个开源LLMs进行评估，分析不同大小模型的性能变化及其与模型家族的关系。 Result: 研究发现模型大小与性能通常正相关，但也存在性能平台或天花板效应，某些情况下较小模型更具成本效益；同一家族中，较大模型偶尔表现不如较小模型。 Conclusion: 结论指出模型大小对KGE任务性能的影响符合一般规律，但需注意局部异常情况，建议测试同一家族中相邻大小的模型以确保选择最优。 Abstract: When using Large Language Models (LLMs) to support Knowledge Graph Engineering (KGE), one of the first indications when searching for an appropriate model is its size. According to the scaling laws, larger models typically show higher capabilities. However, in practice, resource costs are also an important factor and thus it makes sense to consider the ratio between model performance and costs. The LLM-KG-Bench framework enables the comparison of LLMs in the context of KGE tasks and assesses their capabilities of understanding and producing KGs and KG queries. Based on a dataset created in an LLM-KG-Bench run covering 26 open state-of-the-art LLMs, we explore the model size scaling laws specific to KGE tasks. In our analyses, we assess how benchmark scores evolve between different model size categories. Additionally, we inspect how the general score development of single models and families of models correlates to their size. Our analyses revealed that, with a few exceptions, the model size scaling laws generally also apply to the selected KGE tasks. However, in some cases, plateau or ceiling effects occurred, i.e., the task performance did not change much between a model and the next larger model. In these cases, smaller models could be considered to achieve high cost-effectiveness. Regarding models of the same family, sometimes larger models performed worse than smaller models of the same family. These effects occurred only locally. Hence it is advisable to additionally test the next smallest and largest model of the same family.

[325] Incentivizing Dual Process Thinking for Efficient Large Language Model Reasoning

Xiaoxue Cheng,Junyi Li,Zhenduo Zhang,Xinyu Tang,Wayne Xin Zhao,Xinyu Kong,Zhiqiang Zhang

Main category: cs.AI

TL;DR: 论文提出了一种名为ACPO的强化学习框架，通过自适应认知分配和动态系统切换，减少大型推理模型的冗余推理。

Details

Motivation: 大型推理模型在复杂推理任务中表现优异，但常因过度思考生成冗余内容。受认知科学中的双过程理论启发，作者希望通过自适应方法优化推理效率。 Method: ACPO框架包含两个关键组件：系统感知推理标记和在线难度估计与标记长度预算。采用两阶段训练策略，先通过监督微调冷启动模型，再应用ACPO优化自适应系统切换。 Result: 实验表明，ACPO能有效减少冗余推理，并根据任务复杂度自适应调整认知分配，实现高效混合推理。 Conclusion: ACPO通过透明化认知过程和动态系统切换，显著提升了大型推理模型的效率。 Abstract: Large reasoning models (LRMs) have demonstrated strong performance on complex reasoning tasks, but often suffer from overthinking, generating redundant content regardless of task difficulty. Inspired by the dual process theory in cognitive science, we propose Adaptive Cognition Policy Optimization (ACPO), a reinforcement learning framework that enables LRMs to achieve efficient reasoning through adaptive cognitive allocation and dynamic system switch. ACPO incorporates two key components: (1) introducing system-aware reasoning tokens to explicitly represent the thinking modes thereby making the model's cognitive process transparent, and (2) integrating online difficulty estimation and token length budget to guide adaptive system switch and reasoning during reinforcement learning. To this end, we propose a two-stage training strategy. The first stage begins with supervised fine-tuning to cold start the model, enabling it to generate reasoning paths with explicit thinking modes. In the second stage, we apply ACPO to further enhance adaptive system switch for difficulty-aware reasoning. Experimental results demonstrate that ACPO effectively reduces redundant reasoning while adaptively adjusting cognitive allocation based on task complexity, achieving efficient hybrid reasoning.

[326] SPaRC: A Spatial Pathfinding Reasoning Challenge

Lars Benedikt Kaesberg,Jan Philip Wahle,Terry Ruas,Bela Gipp

Main category: cs.AI

TL;DR: SPaRC是一个新的2D网格路径寻找数据集，用于评估空间和符号推理能力。人类表现优异，而现有模型表现较差，尤其是在复杂问题上。

Details

Motivation: 现有推理数据集在测试抽象、多步问题时表现饱和，无法满足需求，尤其是路径寻找和复杂规则约束问题。 Method: 引入SPaRC数据集，包含1000个2D网格路径寻找谜题，要求基于算术和几何规则进行逐步规划。 Result: 人类准确率接近完美（98.0%；困难谜题为94.5%），而最佳模型（如o4-mini）表现较差（15.8%；困难谜题为1.1%）。模型常生成无效路径，且在导航和空间逻辑上出错。 Conclusion: SPaRC揭示了模型在空间推理上的局限性，并推动研究改进训练和测试时计算效率，以提升抽象、多步问题的解决能力。 Abstract: Existing reasoning datasets saturate and fail to test abstract, multi-step problems, especially pathfinding and complex rule constraint satisfaction. We introduce SPaRC (Spatial Pathfinding Reasoning Challenge), a dataset of 1,000 2D grid pathfinding puzzles to evaluate spatial and symbolic reasoning, requiring step-by-step planning with arithmetic and geometric rules. Humans achieve near-perfect accuracy (98.0%; 94.5% on hard puzzles), while the best reasoning models, such as o4-mini, struggle (15.8%; 1.1% on hard puzzles). Models often generate invalid paths (>50% of puzzles for o4-mini), and reasoning tokens reveal they make errors in navigation and spatial logic. Unlike humans, who take longer on hard puzzles, models fail to scale test-time compute with difficulty. Allowing models to make multiple solution attempts improves accuracy, suggesting potential for better spatial reasoning with improved training and efficient test-time scaling methods. SPaRC can be used as a window into models' spatial reasoning limitations and drive research toward new methods that excel in abstract, multi-step problem-solving.

[327] KTAE: A Model-Free Algorithm to Key-Tokens Advantage Estimation in Mathematical Reasoning

Wei Sun,Wen Yang,Pu Jian,Qianlong Du,Fuwei Cui,Shuo Ren,Jiajun Zhang

Main category: cs.AI

TL;DR: 论文提出了一种名为KTAE的新算法，通过细粒度的token级优势估计，解决了现有强化学习算法（如GRPO和DAPO）在计算优势时的粗粒度问题，显著提升了模型的推理能力。

Details

Motivation: 现有强化学习算法（如GRPO和DAPO）在计算优势时存在粗粒度问题，无法捕捉token级贡献，限制了模型的学习效果。 Method: 提出KTAE算法，利用采样rollout的正确性和统计分析，量化token级重要性，结合rollout级优势，实现细粒度token级优势估计。 Result: 实验表明，GRPO+KTAE和DAPO+KTAE在五个数学推理基准上优于基线方法，且能以更短的响应实现更高准确率。 Conclusion: KTAE算法有效提升了强化学习在语言模型中的应用效果，为细粒度优势估计提供了新思路。 Abstract: Recent advances have demonstrated that integrating reinforcement learning with rule-based rewards can significantly enhance the reasoning capabilities of large language models, even without supervised fine-tuning. However, prevalent reinforcement learning algorithms such as GRPO and its variants like DAPO, suffer from a coarse granularity issue when computing the advantage. Specifically, they compute rollout-level advantages that assign identical values to every token within a sequence, failing to capture token-specific contributions and hindering effective learning. To address this limitation, we propose Key-token Advantage Estimation (KTAE) - a novel algorithm that estimates fine-grained, token-level advantages without introducing additional models. KTAE leverages the correctness of sampled rollouts and applies statistical analysis to quantify the importance of individual tokens within a sequence to the final outcome. This quantified token-level importance is then combined with the rollout-level advantage to obtain a more fine-grained token-level advantage estimation. Empirical results show that models trained with GRPO+KTAE and DAPO+KTAE outperform baseline methods across five mathematical reasoning benchmarks. Notably, they achieve higher accuracy with shorter responses and even surpass R1-Distill-Qwen-1.5B using the same base model.

[328] From EduVisBench to EduVisAgent: A Benchmark and Multi-Agent Framework for Pedagogical Visualization

Haonian Ji,Shi Qiu,Siyang Xin,Siwei Han,Zhaorun Chen,Hongyi Wang,Dake Zhang,Huaxiu Yao

Main category: cs.AI

TL;DR: 论文提出了EduVisBench基准和EduVisAgent框架，用于评估和改进基础模型在教育场景中生成可视化解释的能力。

Details

Motivation: 现有基础模型在生成教育有效的可视化解释方面能力有限，忽视了结构化可视化对概念理解的重要性。 Method: 引入多领域、多层次的EduVisBench基准，并提出EduVisAgent多代理协作框架，分解推理并设计教育对齐的可视化。 Result: EduVisAgent显著优于基线模型，性能提升40.2%，生成更符合教育需求的可视化。 Conclusion: EduVisBench和EduVisAgent为教育场景中的视觉推理能力提供了有效评估和改进工具。 Abstract: While foundation models (FMs), such as diffusion models and large vision-language models (LVLMs), have been widely applied in educational contexts, their ability to generate pedagogically effective visual explanations remains limited. Most existing approaches focus primarily on textual reasoning, overlooking the critical role of structured and interpretable visualizations in supporting conceptual understanding. To better assess the visual reasoning capabilities of FMs in educational settings, we introduce EduVisBench, a multi-domain, multi-level benchmark. EduVisBench features diverse STEM problem sets requiring visually grounded solutions, along with a fine-grained evaluation rubric informed by pedagogical theory. Our empirical analysis reveals that existing models frequently struggle with the inherent challenge of decomposing complex reasoning and translating it into visual representations aligned with human cognitive processes. To address these limitations, we propose EduVisAgent, a multi-agent collaborative framework that coordinates specialized agents for instructional planning, reasoning decomposition, metacognitive prompting, and visualization design. Experimental results show that EduVisAgent substantially outperforms all baselines, achieving a 40.2% improvement and delivering more educationally aligned visualizations. EduVisBench and EduVisAgent are available at https://github.com/aiming-lab/EduVisBench and https://github.com/aiming-lab/EduVisAgent.

[329] NovelSeek: When Agent Becomes the Scientist -- Building Closed-Loop System from Hypothesis to Verification

NovelSeek Team,Bo Zhang,Shiyang Feng,Xiangchao Yan,Jiakang Yuan,Zhiyin Yu,Xiaohan He,Songtao Huang,Shaowei Hou,Zheng Nie,Zhilong Wang,Jinyao Liu,Runmin Ma,Tianshuo Peng,Peng Ye,Dongzhan Zhou,Shufei Zhang,Xiaosong Wang,Yilan Zhang,Meng Li,Zhongying Tu,Xiangyu Yue,Wangli Ouyang,Bowen Zhou,Lei Bai

Main category: cs.AI

TL;DR: NovelSeek是一个统一的多智能体框架，用于跨科学领域进行自主研究，具有可扩展性、交互性和高效性。

Details

Motivation: 加速科学研究范式转变，提升研究效率和创新性。 Method: 采用闭环多智能体框架NovelSeek，支持跨领域任务处理、人机交互和自动化流程。 Result: 在多个科学任务中显著提升性能，如反应产率预测从27.6%提升至35.4%，增强子活性预测准确率从0.52提升至0.79。 Conclusion: NovelSeek展示了在科学研究中实现高效、创新和交互的潜力。 Abstract: Artificial Intelligence (AI) is accelerating the transformation of scientific research paradigms, not only enhancing research efficiency but also driving innovation. We introduce NovelSeek, a unified closed-loop multi-agent framework to conduct Autonomous Scientific Research (ASR) across various scientific research fields, enabling researchers to tackle complicated problems in these fields with unprecedented speed and precision. NovelSeek highlights three key advantages: 1) Scalability: NovelSeek has demonstrated its versatility across 12 scientific research tasks, capable of generating innovative ideas to enhance the performance of baseline code. 2) Interactivity: NovelSeek provides an interface for human expert feedback and multi-agent interaction in automated end-to-end processes, allowing for the seamless integration of domain expert knowledge. 3) Efficiency: NovelSeek has achieved promising performance gains in several scientific fields with significantly less time cost compared to human efforts. For instance, in reaction yield prediction, it increased from 27.6% to 35.4% in just 12 hours; in enhancer activity prediction, accuracy rose from 0.52 to 0.79 with only 4 hours of processing; and in 2D semantic segmentation, precision advanced from 78.8% to 81.0% in a mere 30 hours.

[330] AGENTIF: Benchmarking Instruction Following of Large Language Models in Agentic Scenarios

Yunjia Qi,Hao Peng,Xiaozhi Wang,Amy Xin,Youfeng Liu,Bin Xu,Lei Hou,Juanzi Li

Main category: cs.AI

TL;DR: 论文提出了AgentIF，首个用于系统评估LLM在代理场景中指令遵循能力的基准，发现当前模型表现不佳，尤其是在处理复杂约束和工具规范时。

Details

Motivation: 代理场景中涉及长且复杂的指令，但LLM是否能可靠遵循这些指令尚未充分研究。 Method: 构建AgentIF基准，包含50个真实代理应用的707条人工标注指令，涵盖长且复杂的约束。 Result: 当前LLM在AgentIF上表现不佳，特别是在复杂约束和工具规范方面。 Conclusion: AgentIF为未来研究提供了基准，揭示了LLM在代理场景中的局限性。 Abstract: Large Language Models (LLMs) have demonstrated advanced capabilities in real-world agentic applications. Growing research efforts aim to develop LLM-based agents to address practical demands, introducing a new challenge: agentic scenarios often involve lengthy instructions with complex constraints, such as extended system prompts and detailed tool specifications. While adherence to such instructions is crucial for agentic applications, whether LLMs can reliably follow them remains underexplored. In this paper, we introduce AgentIF, the first benchmark for systematically evaluating LLM instruction following ability in agentic scenarios. AgentIF features three key characteristics: (1) Realistic, constructed from 50 real-world agentic applications. (2) Long, averaging 1,723 words with a maximum of 15,630 words. (3) Complex, averaging 11.9 constraints per instruction, covering diverse constraint types, such as tool specifications and condition constraints. To construct AgentIF, we collect 707 human-annotated instructions across 50 agentic tasks from industrial application agents and open-source agentic systems. For each instruction, we annotate the associated constraints and corresponding evaluation metrics, including code-based evaluation, LLM-based evaluation, and hybrid code-LLM evaluation. We use AgentIF to systematically evaluate existing advanced LLMs. We observe that current models generally perform poorly, especially in handling complex constraint structures and tool specifications. We further conduct error analysis and analytical experiments on instruction length and meta constraints, providing some findings about the failure modes of existing LLMs. We have released the code and data to facilitate future research.

[331] Bridging the Dynamic Perception Gap: Training-Free Draft Chain-of-Thought for Dynamic Multimodal Spatial Reasoning

Siqu Ou,Hongcheng Liu,Pingjie Wang,Yusheng Liao,Chuan Xuan,Yanfeng Wang,Yu Wang

Main category: cs.AI

TL;DR: GRASSLAND是一个新颖的迷宫导航基准，用于评估动态空间推理能力。D2R框架通过结合文本推理链和动态视觉草稿，显著提升了多模态大语言模型在动态空间推理任务中的表现。

Details

Motivation: 现有方法在动态空间推理任务中表现不佳，局限于文本或静态视觉领域。 Method: 提出D2R框架，将文本推理链与动态视觉草稿结合，无需模型微调。 Result: D2R在多样化任务中表现优异，为动态空间推理提供了稳健基准。 Conclusion: D2R框架显著提升了动态空间推理能力，为未来研究提供了新方向。 Abstract: While chains-of-thought (CoT) have advanced complex reasoning in multimodal large language models (MLLMs), existing methods remain confined to text or static visual domains, often faltering in dynamic spatial reasoning tasks. To bridge this gap, we present GRASSLAND, a novel maze navigation benchmark designed to evaluate dynamic spatial reasoning. Our experiments show that augmenting textual reasoning chains with dynamic visual drafts, overlaid on input images, significantly outperforms conventional approaches, offering new insights into spatial reasoning in evolving environments. To generalize this capability, we propose D2R (Dynamic Draft-Augmented Reasoning), a training-free framework that seamlessly integrates textual CoT with corresponding visual drafts into MLLMs. Extensive evaluations demonstrate that D2R consistently enhances performance across diverse tasks, establishing a robust baseline for dynamic spatial reasoning without requiring model fine-tuning. Project is open at https://github.com/Cratileo/D2R.

[332] X-MAS: Towards Building Multi-Agent Systems with Heterogeneous LLMs

Rui Ye,Xiangrui Liu,Qimin Wu,Xianghe Pang,Zhenfei Yin,Lei Bai,Siheng Chen

Main category: cs.AI

TL;DR: 论文提出异构LLM驱动的多智能体系统（X-MAS），通过多样化LLM提升系统性能，并引入X-MAS-Bench测试平台验证其效果。

Details

Motivation: 现有MAS框架依赖单一LLM，限制了系统智能。探索异构LLM驱动的MAS以突破这一限制。 Method: 提出X-MAS-Bench测试平台，评估27种LLM在5个领域和5种功能上的表现，进行170万次评估。 Result: 异构LLM驱动的MAS显著提升性能，如聊天机器人场景提升8.4%，混合场景提升47%。 Conclusion: 异构LLM在MAS中具有变革潜力，为可扩展协作AI系统提供新方向。 Abstract: LLM-based multi-agent systems (MAS) extend the capabilities of single LLMs by enabling cooperation among multiple specialized agents. However, most existing MAS frameworks rely on a single LLM to drive all agents, constraining the system's intelligence to the limit of that model. This paper explores the paradigm of heterogeneous LLM-driven MAS (X-MAS), where agents are powered by diverse LLMs, elevating the system's potential to the collective intelligence of diverse LLMs. We introduce X-MAS-Bench, a comprehensive testbed designed to evaluate the performance of various LLMs across different domains and MAS-related functions. As an extensive empirical study, we assess 27 LLMs across 5 domains (encompassing 21 test sets) and 5 functions, conducting over 1.7 million evaluations to identify optimal model selections for each domain-function combination. Building on these findings, we demonstrate that transitioning from homogeneous to heterogeneous LLM-driven MAS can significantly enhance system performance without requiring structural redesign. Specifically, in a chatbot-only MAS scenario, the heterogeneous configuration yields up to 8.4\% performance improvement on the MATH dataset. In a mixed chatbot-reasoner scenario, the heterogeneous MAS could achieve a remarkable 47\% performance boost on the AIME dataset. Our results underscore the transformative potential of heterogeneous LLMs in MAS, highlighting a promising avenue for advancing scalable, collaborative AI systems.

[333] Think or Not? Selective Reasoning via Reinforcement Learning for Vision-Language Models

Jiaqi Wang,Kevin Qinghong Lin,James Cheng,Mike Zheng Shou

Main category: cs.AI

TL;DR: TON是一种两阶段训练策略，通过选择性推理减少计算成本，同时保持或提升性能。

Details

Motivation: 受人类选择性思考启发，探索如何在视觉语言模型中实现选择性推理以减少不必要的计算开销。 Method: 提出TON：1）监督微调阶段引入‘thought dropout’操作；2）GRPO阶段让模型自由探索何时思考。 Result: TON相比GRPO减少90%的推理长度，性能不降反升。 Conclusion: TON为强化学习方法实现类人推理模式提供了新思路。 Abstract: Reinforcement Learning (RL) has proven to be an effective post-training strategy for enhancing reasoning in vision-language models (VLMs). Group Relative Policy Optimization (GRPO) is a recent prominent method that encourages models to generate complete reasoning traces before answering, leading to increased token usage and computational cost. Inspired by the human-like thinking process-where people skip reasoning for easy questions but think carefully when needed-we explore how to enable VLMs to first decide when reasoning is necessary. To realize this, we propose TON, a two-stage training strategy: (i) a supervised fine-tuning (SFT) stage with a simple yet effective 'thought dropout' operation, where reasoning traces are randomly replaced with empty thoughts. This introduces a think-or-not format that serves as a cold start for selective reasoning; (ii) a GRPO stage that enables the model to freely explore when to think or not, while maximizing task-aware outcome rewards. Experimental results show that TON can reduce the completion length by up to 90% compared to vanilla GRPO, without sacrificing performance or even improving it. Further evaluations across diverse vision-language tasks-covering a range of reasoning difficulties under both 3B and 7B models-consistently reveal that the model progressively learns to bypass unnecessary reasoning steps as training advances. These findings shed light on the path toward human-like reasoning patterns in reinforcement learning approaches. Our code is available at https://github.com/kokolerk/TON.

cs.SD [Back]

[334] AudioTrust: Benchmarking the Multifaceted Trustworthiness of Audio Large Language Models

Kai Li,Can Shen,Yile Liu,Jirui Han,Kelong Zheng,Xuechao Zou,Zhe Wang,Xingjian Du,Shun Zhang,Hanjun Luo,Yingbin Jin,Xinxin Xing,Ziyang Ma,Yue Liu,Xiaojun Jia,Yifan Zhang,Junfeng Fang,Kun Wang,Yibo Yan,Haoyang Li,Yiming Li,Xiaobin Zhuang,Yang Liu,Haibo Hu,Zhuo Chen,Zhizheng Wu,Xiaolin Hu,Eng-Siong Chng,XiaoFeng Wang,Wenyuan Xu,Wei Dong,Xinfeng Li

Main category: cs.SD

TL;DR: AudioTrust是首个专为音频大语言模型（ALLMs）设计的全方位可信度评估框架和基准，涵盖公平性、幻觉、安全性、隐私性、鲁棒性和认证六个维度。

Details

Motivation: 现有评估框架主要针对文本模态或仅关注有限的安全维度，未能充分适应音频模态的特性和应用场景。 Method: AudioTrust基于18种实验设置和4,420个音频/文本样本数据集，设计了9个音频专用评估指标，并采用自动化评分流程。 Result: 实验揭示了当前开源和闭源ALLMs在高风险音频场景中的可信度边界和局限性。 Conclusion: AudioTrust为未来音频模型的安全可信部署提供了重要参考。 Abstract: The rapid advancement and expanding applications of Audio Large Language Models (ALLMs) demand a rigorous understanding of their trustworthiness. However, systematic research on evaluating these models, particularly concerning risks unique to the audio modality, remains largely unexplored. Existing evaluation frameworks primarily focus on the text modality or address only a restricted set of safety dimensions, failing to adequately account for the unique characteristics and application scenarios inherent to the audio modality. We introduce AudioTrust-the first multifaceted trustworthiness evaluation framework and benchmark specifically designed for ALLMs. AudioTrust facilitates assessments across six key dimensions: fairness, hallucination, safety, privacy, robustness, and authentication. To comprehensively evaluate these dimensions, AudioTrust is structured around 18 distinct experimental setups. Its core is a meticulously constructed dataset of over 4,420 audio/text samples, drawn from real-world scenarios (e.g., daily conversations, emergency calls, voice assistant interactions), specifically designed to probe the multifaceted trustworthiness of ALLMs. For assessment, the benchmark carefully designs 9 audio-specific evaluation metrics, and we employ a large-scale automated pipeline for objective and scalable scoring of model outputs. Experimental results reveal the trustworthiness boundaries and limitations of current state-of-the-art open-source and closed-source ALLMs when confronted with various high-risk audio scenarios, offering valuable insights for the secure and trustworthy deployment of future audio models. Our platform and benchmark are available at https://github.com/JusperLee/AudioTrust.

cs.RO [Back]

[335] UAV-Flow Colosseo: A Real-World Benchmark for Flying-on-a-Word UAV Imitation Learning

Xiangyu Wang,Donglin Yang,Yue Liao,Wenhao Zheng,wenjun wu,Bin Dai,Hongsheng Li,Si Liu

Main category: cs.RO

TL;DR: 该论文提出了‘Flying-on-a-Word’任务，通过模仿学习实现无人机对语言指令的精细轨迹控制，并发布了UAV-Flow基准测试。

Details

Motivation: 现有研究多关注无人机的高层规划和长距离导航，而本文转向语言引导的精细轨迹控制，以提升人机交互的直观性。 Method: 采用模仿学习方法，无人机通过模仿专家飞行轨迹与原子语言指令配对学习控制策略，并设计了UAV-Flow基准测试。 Result: 实验表明，VLA模型优于VLN基线，且空间接地在精细控制中起关键作用。 Conclusion: UAV-Flow支持直接部署，无需模拟到现实的转换，为语言引导的无人机控制提供了有效框架。 Abstract: Unmanned Aerial Vehicles (UAVs) are evolving into language-interactive platforms, enabling more intuitive forms of human-drone interaction. While prior works have primarily focused on high-level planning and long-horizon navigation, we shift attention to language-guided fine-grained trajectory control, where UAVs execute short-range, reactive flight behaviors in response to language instructions. We formalize this problem as the Flying-on-a-Word (Flow) task and introduce UAV imitation learning as an effective approach. In this framework, UAVs learn fine-grained control policies by mimicking expert pilot trajectories paired with atomic language instructions. To support this paradigm, we present UAV-Flow, the first real-world benchmark for language-conditioned, fine-grained UAV control. It includes a task formulation, a large-scale dataset collected in diverse environments, a deployable control framework, and a simulation suite for systematic evaluation. Our design enables UAVs to closely imitate the precise, expert-level flight trajectories of human pilots and supports direct deployment without sim-to-real gap. We conduct extensive experiments on UAV-Flow, benchmarking VLN and VLA paradigms. Results show that VLA models are superior to VLN baselines and highlight the critical role of spatial grounding in the fine-grained Flow setting.

[336] VERDI: VLM-Embedded Reasoning for Autonomous Driving

Bowen Feng,Zhiting Mei,Baiang Li,Julian Ost,Roger Girgis,Anirudha Majumdar,Felix Heide

Main category: cs.RO

TL;DR: VERDI是一种训练时框架，将视觉语言模型的推理过程和常识知识蒸馏到自动驾驶堆栈中，提升性能且保持推理速度。

Details

Motivation: 解决现有基于视觉语言模型的轨迹规划方法部署不实用且无法分解安全性的问题。 Method: 通过将视觉语言模型的文本特征与自动驾驶堆栈的中间模块输出对齐，实现推理过程的知识蒸馏。 Result: 在NuScenes数据集上，VERDI比未嵌入推理的端到端方法性能提升10%（ℓ2距离），同时保持高推理速度。 Conclusion: VERDI成功将结构化推理嵌入模块化自动驾驶堆栈，避免了大型视觉语言模型的高推理成本。 Abstract: While autonomous driving (AD) stacks struggle with decision making under partial observability and real-world complexity, human drivers are capable of commonsense reasoning to make near-optimal decisions with limited information. Recent work has attempted to leverage finetuned Vision-Language Models (VLMs) for trajectory planning at inference time to emulate human behavior. Despite their success in benchmark evaluations, these methods are often impractical to deploy (a 70B parameter VLM inference at merely 8 tokens per second requires more than 160G of memory), and their monolithic network structure prohibits safety decomposition. To bridge this gap, we propose VLM-Embedded Reasoning for autonomous Driving (VERDI), a training-time framework that distills the reasoning process and commonsense knowledge of VLMs into the AD stack. VERDI augments modular differentiable end-to-end (e2e) AD models by aligning intermediate module outputs at the perception, prediction, and planning stages with text features explaining the driving reasoning process produced by VLMs. By encouraging alignment in latent space, \textsc{VERDI} enables the modular AD stack to internalize structured reasoning, without incurring the inference-time costs of large VLMs. We demonstrate the effectiveness of our method on the NuScenes dataset and find that VERDI outperforms existing e2e methods that do not embed reasoning by 10% in $\ell_{2}$ distance, while maintaining high inference speed.

[337] SEM: Enhancing Spatial Understanding for Robust Robot Manipulation

Xuewu Lin,Tianwei Lin,Lichao Huang,Hongyu Xie,Yiwei Jin,Keyu Li,Zhizhong Su

Main category: cs.RO

TL;DR: 提出了一种名为SEM的新型扩散策略框架，通过增强空间理解和机器人状态编码，显著提升了机器人操作的性能。

Details

Motivation: 现有方法在空间理解和语义抽象方面存在不足，3D点云模型缺乏语义信息，2D图像编码器难以进行空间推理。 Method: SEM结合了空间增强器和机器人状态编码器，前者通过3D几何上下文增强视觉表示，后者通过图建模捕捉关节依赖关系。 Result: SEM在多样化任务中表现出色，优于现有基线方法。 Conclusion: SEM通过增强空间理解和机器人状态建模，实现了更鲁棒和通用的机器人操作。 Abstract: A key challenge in robot manipulation lies in developing policy models with strong spatial understanding, the ability to reason about 3D geometry, object relations, and robot embodiment. Existing methods often fall short: 3D point cloud models lack semantic abstraction, while 2D image encoders struggle with spatial reasoning. To address this, we propose SEM (Spatial Enhanced Manipulation model), a novel diffusion-based policy framework that explicitly enhances spatial understanding from two complementary perspectives. A spatial enhancer augments visual representations with 3D geometric context, while a robot state encoder captures embodiment-aware structure through graphbased modeling of joint dependencies. By integrating these modules, SEM significantly improves spatial understanding, leading to robust and generalizable manipulation across diverse tasks that outperform existing baselines.

[338] Raw2Drive: Reinforcement Learning with Aligned World Models for End-to-End Autonomous Driving (in CARLA v2)

Zhenjie Yang,Xiaosong Jia,Qifeng Li,Xue Yang,Maoqing Yao,Junchi Yan

Main category: cs.RO

TL;DR: Raw2Drive是一种基于模型强化学习（MBRL）的双流方法，解决了端到端自动驾驶（E2E-AD）中强化学习（RL）训练困难的问题，并通过引导机制确保原始传感器世界模型与特权世界模型的一致性。

Details

Motivation: 模仿学习（IL）在端到端自动驾驶中仍为主流，但存在因果混淆和分布偏移问题。强化学习（RL）虽能缓解这些问题，但其训练难度限制了应用。 Method: 设计Raw2Drive，包括特权世界模型和原始传感器世界模型，通过引导机制确保两者一致性，并利用特权世界模型的先验知识指导原始传感器策略的训练。 Result: Raw2Drive是CARLA Leaderboard 2.0和Bench2Drive上唯一基于RL的端到端方法，并取得最先进性能。 Conclusion: Raw2Drive填补了MBRL在端到端自动驾驶中的空白，展示了RL在此领域的潜力。 Abstract: Reinforcement Learning (RL) can mitigate the causal confusion and distribution shift inherent to imitation learning (IL). However, applying RL to end-to-end autonomous driving (E2E-AD) remains an open problem for its training difficulty, and IL is still the mainstream paradigm in both academia and industry. Recently Model-based Reinforcement Learning (MBRL) have demonstrated promising results in neural planning; however, these methods typically require privileged information as input rather than raw sensor data. We fill this gap by designing Raw2Drive, a dual-stream MBRL approach. Initially, we efficiently train an auxiliary privileged world model paired with a neural planner that uses privileged information as input. Subsequently, we introduce a raw sensor world model trained via our proposed Guidance Mechanism, which ensures consistency between the raw sensor world model and the privileged world model during rollouts. Finally, the raw sensor world model combines the prior knowledge embedded in the heads of the privileged world model to effectively guide the training of the raw sensor policy. Raw2Drive is so far the only RL based end-to-end method on CARLA Leaderboard 2.0, and Bench2Drive and it achieves state-of-the-art performance.

[339] ManipLVM-R1: Reinforcement Learning for Reasoning in Embodied Manipulation with Large Vision-Language Models

Zirui Song,Guangxian Ouyang,Mingzhe Li,Yuheng Ji,Chenxi Wang,Zixiang Xu,Zeyu Zhang,Xiaoqing Zhang,Qian Jiang,Zhenhao Chen,Zhongzhi Li,Rui Yan,Xiuying Chen

Main category: cs.RO

TL;DR: 提出了一种名为ManipLVM-R1的强化学习框架，通过可验证奖励（RLVR）替代传统监督学习，提升机器人操作的泛化能力和物理推理能力。

Details

Motivation: 现有的大型视觉语言模型（LVLMs）依赖昂贵的人工标注数据，泛化能力有限且难以适应域外（OOD）场景。 Method: 设计了两种基于规则的奖励函数：Affordance Perception Reward（增强交互区域定位）和Trajectory Match Reward（确保动作路径的物理合理性）。 Result: 通过直接优化任务对齐结果，减少了标注依赖，提升了泛化能力和物理推理能力。 Conclusion: ManipLVM-R1框架通过强化学习和可验证奖励，显著提升了机器人操作的适应性和系统性推理能力。 Abstract: Large Vision-Language Models (LVLMs) have recently advanced robotic manipulation by leveraging vision for scene perception and language for instruction following. However, existing methods rely heavily on costly human-annotated training datasets, which limits their generalization and causes them to struggle in out-of-domain (OOD) scenarios, reducing real-world adaptability. To address these challenges, we propose ManipLVM-R1, a novel reinforcement learning framework that replaces traditional supervision with Reinforcement Learning using Verifiable Rewards (RLVR). By directly optimizing for task-aligned outcomes, our method enhances generalization and physical reasoning while removing the dependence on costly annotations. Specifically, we design two rule-based reward functions targeting key robotic manipulation subtasks: an Affordance Perception Reward to enhance localization of interaction regions, and a Trajectory Match Reward to ensure the physical plausibility of action paths. These rewards provide immediate feedback and impose spatial-logical constraints, encouraging the model to go beyond shallow pattern matching and instead learn deeper, more systematic reasoning about physical interactions.

cs.CR [Back]

[340] All You Need is "Leet": Evading Hate-speech Detection AI

Sampanna Yashwant Kahu,Naman Ahuja

Main category: cs.CR

TL;DR: 提出一种黑盒扰动技术，用于欺骗深度学习仇恨言论检测模型，降低其效率，同时保持原意。

Details

Motivation: 社交媒体和在线论坛中仇恨言论泛滥，需要保护用户免受其害。 Method: 设计黑盒扰动技术，生成能欺骗先进仇恨言论检测模型的扰动。 Result: 最佳扰动攻击成功规避86.8%仇恨文本的检测。 Conclusion: 该方法有效降低仇恨言论检测效率，同时最小化语义变化。 Abstract: Social media and online forums are increasingly becoming popular. Unfortunately, these platforms are being used for spreading hate speech. In this paper, we design black-box techniques to protect users from hate-speech on online platforms by generating perturbations that can fool state of the art deep learning based hate speech detection models thereby decreasing their efficiency. We also ensure a minimal change in the original meaning of hate-speech. Our best perturbation attack is successfully able to evade hate-speech detection for 86.8 % of hateful text.

[341] DuFFin: A Dual-Level Fingerprinting Framework for LLMs IP Protection

Yuliang Yan,Haochun Tang,Shuo Yan,Enyan Dai

Main category: cs.CR

TL;DR: DuFFin是一种新型的双级指纹框架，用于黑盒设置下的所有权验证，能准确识别受保护大语言模型的变体。

Details

Motivation: 保护大语言模型的知识产权，防止恶意窃取或未经授权的部署。 Method: 提出DuFFin框架，通过提取触发模式和知识级指纹来识别模型来源。 Result: 实验表明，DuFFin能准确验证基础模型的版权，IP-ROC指标超过0.95。 Conclusion: DuFFin是一种高效且实用的黑盒指纹方法，适用于大语言模型的知识产权保护。 Abstract: Large language models (LLMs) are considered valuable Intellectual Properties (IP) for legitimate owners due to the enormous computational cost of training. It is crucial to protect the IP of LLMs from malicious stealing or unauthorized deployment. Despite existing efforts in watermarking and fingerprinting LLMs, these methods either impact the text generation process or are limited in white-box access to the suspect model, making them impractical. Hence, we propose DuFFin, a novel $\textbf{Du}$al-Level $\textbf{Fin}$gerprinting $\textbf{F}$ramework for black-box setting ownership verification. DuFFin extracts the trigger pattern and the knowledge-level fingerprints to identify the source of a suspect model. We conduct experiments on a variety of models collected from the open-source website, including four popular base models as protected LLMs and their fine-tuning, quantization, and safety alignment versions, which are released by large companies, start-ups, and individual users. Results show that our method can accurately verify the copyright of the base protected LLM on their model variants, achieving the IP-ROC metric greater than 0.95. Our code is available at https://github.com/yuliangyan0807/llm-fingerprint.

[342] CTRAP: Embedding Collapse Trap to Safeguard Large Language Models from Harmful Fine-Tuning

Biao Yi,Tiansheng Huang,Baolei Zhang,Tong Li,Lihai Nie,Zheli Liu,Li Shen

Main category: cs.CR

TL;DR: 论文提出了一种名为CTRAP的新方法，通过诱导模型崩溃来防止恶意微调攻击，而非选择性遗忘。

Details

Motivation: 现有选择性遗忘方法无法有效阻止LLMs快速重新学习或转向恶意任务，因此需要一种更彻底的防御机制。 Method: 提出Collapse Trap (CTRAP)，在模型对齐时预配置其反应机制，若检测到恶意微调则触发模型核心能力崩溃。 Result: 实验证明CTRAP能有效抵御多种LLMs的恶意微调攻击，同时保持良性场景下的高性能。 Conclusion: CTRAP通过模型崩溃机制解决了选择性遗忘的局限性，为LLM安全提供了新思路。 Abstract: Fine-tuning-as-a-service, while commercially successful for Large Language Model (LLM) providers, exposes models to harmful fine-tuning attacks. As a widely explored defense paradigm against such attacks, unlearning attempts to remove malicious knowledge from LLMs, thereby essentially preventing them from being used to perform malicious tasks. However, we highlight a critical flaw: the powerful general adaptability of LLMs allows them to easily bypass selective unlearning by rapidly relearning or repurposing their capabilities for harmful tasks. To address this fundamental limitation, we propose a paradigm shift: instead of selective removal, we advocate for inducing model collapse--effectively forcing the model to "unlearn everything"--specifically in response to updates characteristic of malicious adaptation. This collapse directly neutralizes the very general capabilities that attackers exploit, tackling the core issue unaddressed by selective unlearning. We introduce the Collapse Trap (CTRAP) as a practical mechanism to implement this concept conditionally. Embedded during alignment, CTRAP pre-configures the model's reaction to subsequent fine-tuning dynamics. If updates during fine-tuning constitute a persistent attempt to reverse safety alignment, the pre-configured trap triggers a progressive degradation of the model's core language modeling abilities, ultimately rendering it inert and useless for the attacker. Crucially, this collapse mechanism remains dormant during benign fine-tuning, ensuring the model's utility and general capabilities are preserved for legitimate users. Extensive empirical results demonstrate that CTRAP effectively counters harmful fine-tuning risks across various LLMs and attack settings, while maintaining high performance in benign scenarios. Our code is available at https://anonymous.4open.science/r/CTRAP.

[343] CAIN: Hijacking LLM-Humans Conversations via a Two-Stage Malicious System Prompt Generation and Refining Framework

Viet Pham,Thai Le

Main category: cs.CR

TL;DR: 论文提出了一种新的安全威胁：通过操纵LLMs的系统提示，使其在特定目标问题上输出恶意答案，同时在其他问题上表现正常。作者开发了CAIN算法，成功在开源和商业LLMs上实现了显著的攻击效果。

Details

Motivation: LLMs易受对抗攻击，但现有研究未充分探讨通过系统提示操控AI-人类对话的威胁。本文旨在揭示这种攻击的危害性及其大规模信息操纵的潜力。 Method: 开发了CAIN算法，能在黑盒设置下自动生成针对特定问题的有害系统提示，无需访问LLMs参数。 Result: CAIN在非目标攻击中导致目标问题F1分数下降40%，在目标攻击中达到70%的F1分数，且对良性问题影响极小。 Conclusion: 研究强调了增强LLMs鲁棒性的紧迫性，以确保其在实际应用中的安全性和完整性。 Abstract: Large language models (LLMs) have advanced many applications, but are also known to be vulnerable to adversarial attacks. In this work, we introduce a novel security threat: hijacking AI-human conversations by manipulating LLMs' system prompts to produce malicious answers only to specific targeted questions (e.g., "Who should I vote for US President?", "Are Covid vaccines safe?"), while behaving benignly on others. This attack is detrimental as it can enable malicious actors to exercise large-scale information manipulation by spreading harmful but benign-looking system prompts online. To demonstrate such an attack, we develop CAIN, an algorithm that can automatically curate such harmful system prompts for a specific target question in a black-box setting or without the need to access the LLM's parameters. Evaluated on both open-source and commercial LLMs, CAIN demonstrates significant adversarial impact. In untargeted attacks or forcing LLMs to output incorrect answers, CAIN achieves up to 40% F1 degradation on targeted questions while preserving high accuracy on benign inputs. For targeted attacks or forcing LLMs to output specific harmful answers, CAIN achieves over 70% F1 scores on these targeted responses with minimal impact on benign questions. Our results highlight the critical need for enhanced robustness measures to safeguard the integrity and safety of LLMs in real-world applications. All source code will be publicly available.

[344] Backdoor Cleaning without External Guidance in MLLM Fine-tuning

Xuankun Rong,Wenke Huang,Jian Liang,Jinhe Bi,Xun Xiao,Yiming Li,Bo Du,Mang Ye

Main category: cs.CR

TL;DR: BYE是一种基于注意力熵模式的数据过滤框架，用于识别和过滤多模态大语言模型中的后门样本，无需干净监督或模型修改。

Details

Motivation: 多模态大语言模型在微调即服务（FTaaS）中存在安全风险，恶意微调可能植入后门。 Method: BYE通过三阶段流程（提取注意力图、计算熵分数和敏感层分析、无监督聚类）过滤后门样本。 Result: BYE在多种数据集和触发类型下实现了接近零的攻击成功率，同时保持干净任务性能。 Conclusion: BYE为MLLMs中的后门威胁提供了鲁棒且通用的解决方案。 Abstract: Multimodal Large Language Models (MLLMs) are increasingly deployed in fine-tuning-as-a-service (FTaaS) settings, where user-submitted datasets adapt general-purpose models to downstream tasks. This flexibility, however, introduces serious security risks, as malicious fine-tuning can implant backdoors into MLLMs with minimal effort. In this paper, we observe that backdoor triggers systematically disrupt cross-modal processing by causing abnormal attention concentration on non-semantic regions--a phenomenon we term attention collapse. Based on this insight, we propose Believe Your Eyes (BYE), a data filtering framework that leverages attention entropy patterns as self-supervised signals to identify and filter backdoor samples. BYE operates via a three-stage pipeline: (1) extracting attention maps using the fine-tuned model, (2) computing entropy scores and profiling sensitive layers via bimodal separation, and (3) performing unsupervised clustering to remove suspicious samples. Unlike prior defenses, BYE equires no clean supervision, auxiliary labels, or model modifications. Extensive experiments across various datasets, models, and diverse trigger types validate BYE's effectiveness: it achieves near-zero attack success rates while maintaining clean-task performance, offering a robust and generalizable solution against backdoor threats in MLLMs.

eess.IV [Back]

[345] MambaStyle: Efficient StyleGAN Inversion for Real Image Editing with State-Space Models

Jhon Lopez,Carlos Hinojosa,Henry Arguello,Bernard Ghanem

Main category: eess.IV

TL;DR: MambaStyle提出了一种基于视觉状态空间模型（VSSMs）的高效单阶段编码器方法，用于GAN反演和编辑，平衡了重建质量、编辑能力和计算效率。

Details

Motivation: 现有GAN反演方法难以同时实现高质量重建、有效编辑和计算效率，MambaStyle旨在解决这一问题。 Method: 通过将VSSMs集成到架构中，MambaStyle实现了高质量图像反演和灵活编辑，同时减少了参数和计算复杂度。 Result: 实验表明，MambaStyle在反演精度、编辑质量和计算效率方面优于现有方法，且模型复杂度更低、推理速度更快。 Conclusion: MambaStyle是一种适用于实时应用的高效GAN反演和编辑方法。 Abstract: The task of inverting real images into StyleGAN's latent space to manipulate their attributes has been extensively studied. However, existing GAN inversion methods struggle to balance high reconstruction quality, effective editability, and computational efficiency. In this paper, we introduce MambaStyle, an efficient single-stage encoder-based approach for GAN inversion and editing that leverages vision state-space models (VSSMs) to address these challenges. Specifically, our approach integrates VSSMs within the proposed architecture, enabling high-quality image inversion and flexible editing with significantly fewer parameters and reduced computational complexity compared to state-of-the-art methods. Extensive experiments show that MambaStyle achieves a superior balance among inversion accuracy, editing quality, and computational efficiency. Notably, our method achieves superior inversion and editing results with reduced model complexity and faster inference, making it suitable for real-time applications.

[346] P3Net: Progressive and Periodic Perturbation for Semi-Supervised Medical Image Segmentation

Zhenyan Yao,Miao Zhang,Lanhu Wu,Yongri Piao,Feng Tian,Weibing Sun,Huchuan Lu

Main category: eess.IV

TL;DR: 论文提出了一种渐进周期性扰动机制（P3M）和边界聚焦损失，用于半监督医学图像分割，动态调整扰动并提升边界预测精度。

Details

Motivation: 现有扰动技术缺乏深入理解，过度或不适当的扰动可能带来负面影响，需解决如何利用扰动机制引导学习及提升边界预测准确性的问题。 Method: 提出P3M机制动态调整扰动，结合边界聚焦损失以增强模型对边界区域的敏感性。 Result: 在2D和3D数据集上取得最优性能，P3M可扩展至其他方法，边界聚焦损失具通用性。 Conclusion: P3M和边界聚焦损失有效提升了半监督医学图像分割的性能，具有广泛适用性。 Abstract: Perturbation with diverse unlabeled data has proven beneficial for semi-supervised medical image segmentation (SSMIS). While many works have successfully used various perturbation techniques, a deeper understanding of learning perturbations is needed. Excessive or inappropriate perturbation can have negative effects, so we aim to address two challenges: how to use perturbation mechanisms to guide the learning of unlabeled data through labeled data, and how to ensure accurate predictions in boundary regions. Inspired by human progressive and periodic learning, we propose a progressive and periodic perturbation mechanism (P3M) and a boundary-focused loss. P3M enables dynamic adjustment of perturbations, allowing the model to gradually learn them. Our boundary-focused loss encourages the model to concentrate on boundary regions, enhancing sensitivity to intricate details and ensuring accurate predictions. Experimental results demonstrate that our method achieves state-of-the-art performance on two 2D and 3D datasets. Moreover, P3M is extendable to other methods, and the proposed loss serves as a universal tool for improving existing methods, highlighting the scalability and applicability of our approach.

[347] Benchmarking Chest X-ray Diagnosis Models Across Multinational Datasets

Qinmei Xu,Yiheng Li,Xianghao Zhan,Ahmet Gorkem Er,Brittany Dashevsky,Chuanjun Xu,Mohammed Alawad,Mengya Yang,Liu Ya,Changsheng Zhou,Xiao Li,Haruka Itakura,Olivier Gevaert

Main category: eess.IV

TL;DR: 该研究评估了基于视觉语言预训练的基础模型与传统CNN在跨国CXR数据集上的诊断性能和泛化能力，发现基础模型在准确性和任务覆盖范围上优于CNN，但所有模型在儿科病例上表现较差。

Details

Motivation: 评估基础模型与传统CNN在CXR诊断中的性能和泛化能力，尤其是在不同人群和任务中的表现。 Method: 研究比较了8个CXR诊断模型（5个基础模型和3个CNN架构）在37个标准化分类任务上的表现，使用了6个公共数据集和3个私有数据集。 Result: 基础模型在公共和私有数据集上均优于CNN，其中MAVL模型表现最佳。所有模型在儿科病例上表现显著下降。 Conclusion: 研究强调了结构化监督和提示设计在放射AI中的价值，并建议未来方向包括地理扩展和集成模型用于临床部署。 Abstract: Foundation models leveraging vision-language pretraining have shown promise in chest X-ray (CXR) interpretation, yet their real-world performance across diverse populations and diagnostic tasks remains insufficiently evaluated. This study benchmarks the diagnostic performance and generalizability of foundation models versus traditional convolutional neural networks (CNNs) on multinational CXR datasets. We evaluated eight CXR diagnostic models - five vision-language foundation models and three CNN-based architectures - across 37 standardized classification tasks using six public datasets from the USA, Spain, India, and Vietnam, and three private datasets from hospitals in China. Performance was assessed using AUROC, AUPRC, and other metrics across both shared and dataset-specific tasks. Foundation models outperformed CNNs in both accuracy and task coverage. MAVL, a model incorporating knowledge-enhanced prompts and structured supervision, achieved the highest performance on public (mean AUROC: 0.82; AUPRC: 0.32) and private (mean AUROC: 0.95; AUPRC: 0.89) datasets, ranking first in 14 of 37 public and 3 of 4 private tasks. All models showed reduced performance on pediatric cases, with average AUROC dropping from 0.88 +/- 0.18 in adults to 0.57 +/- 0.29 in children (p = 0.0202). These findings highlight the value of structured supervision and prompt design in radiologic AI and suggest future directions including geographic expansion and ensemble modeling for clinical deployment. Code for all evaluated models is available at https://drive.google.com/drive/folders/1B99yMQm7bB4h1sVMIBja0RfUu8gLktCE

[348] Comprehensive Lung Disease Detection Using Deep Learning Models and Hybrid Chest X-ray Data with Explainable AI

Shuvashis Sarker,Shamim Rahim Refat,Faika Fairuj Preotee,Tanvir Rouf Shawon,Raihan Tanvir

Main category: eess.IV

TL;DR: 研究通过深度学习模型结合混合数据集，显著提升了肺部疾病（如COVID-19、肺炎等）的诊断准确性，多种模型在混合数据集上达到99%准确率，并通过可解释AI技术增强模型透明度。

Details

Motivation: 肺部疾病影响全球数百万人，需要先进的诊断工具以提高检测和治疗准确性。 Method: 使用CNN、VGG16、VGG19等多种深度学习模型，结合混合数据集（来自孟加拉国和全球数据），并应用LIME技术增强模型可解释性。 Result: 混合数据集显著提升模型性能，VGG16、Xception等模型在混合数据集上达到99%准确率。 Conclusion: 混合数据集和可解释AI技术为医疗影像提供了高精度且透明的AI解决方案。 Abstract: Advanced diagnostic instruments are crucial for the accurate detection and treatment of lung diseases, which affect millions of individuals globally. This study examines the effectiveness of deep learning and transfer learning models using a hybrid dataset, created by merging four individual datasets from Bangladesh and global sources. The hybrid dataset significantly enhances model accuracy and generalizability, particularly in detecting COVID-19, pneumonia, lung opacity, and normal lung conditions from chest X-ray images. A range of models, including CNN, VGG16, VGG19, InceptionV3, Xception, ResNet50V2, InceptionResNetV2, MobileNetV2, and DenseNet121, were applied to both individual and hybrid datasets. The results showed superior performance on the hybrid dataset, with VGG16, Xception, ResNet50V2, and DenseNet121 each achieving an accuracy of 99%. This consistent performance across the hybrid dataset highlights the robustness of these models in handling diverse data while maintaining high accuracy. To understand the models implicit behavior, explainable AI techniques were employed to illuminate their black-box nature. Specifically, LIME was used to enhance the interpretability of model predictions, especially in cases of misclassification, contributing to the development of reliable and interpretable AI-driven solutions for medical imaging.

[349] OSCAR: One-Step Diffusion Codec Across Multiple Bit-rates

Jinpei Guo,Yifei Ji,Zheng Chen,Kai Liu,Min Liu,Wang Rao,Wenbo Li,Yong Guo,Yulun Zhang

Main category: eess.IV

TL;DR: OSCAR是一种基于预训练潜在扩散模型的一步扩散编解码器，支持多比特率图像压缩，显著提高了推理效率。

Details

Motivation: 现有扩散方法需多步采样且需为不同比特率训练单独模型，计算和存储成本高。 Method: 将压缩潜在表示视为原始潜在表示的噪声变体，通过伪扩散时间步映射支持多比特率重建，并采用一步去噪。 Result: OSCAR在定量和视觉质量指标上表现优异，显著提升推理效率。 Conclusion: OSCAR为高效多比特率图像压缩提供了可行解决方案。 Abstract: Pretrained latent diffusion models have shown strong potential for lossy image compression, owing to their powerful generative priors. Most existing diffusion-based methods reconstruct images by iteratively denoising from random noise, guided by compressed latent representations. While these approaches have achieved high reconstruction quality, their multi-step sampling process incurs substantial computational overhead. Moreover, they typically require training separate models for different compression bit-rates, leading to significant training and storage costs. To address these challenges, we propose a one-step diffusion codec across multiple bit-rates. termed OSCAR. Specifically, our method views compressed latents as noisy variants of the original latents, where the level of distortion depends on the bit-rate. This perspective allows them to be modeled as intermediate states along a diffusion trajectory. By establishing a mapping from the compression bit-rate to a pseudo diffusion timestep, we condition a single generative model to support reconstructions at multiple bit-rates. Meanwhile, we argue that the compressed latents retain rich structural information, thereby making one-step denoising feasible. Thus, OSCAR replaces iterative sampling with a single denoising pass, significantly improving inference efficiency. Extensive experiments demonstrate that OSCAR achieves superior performance in both quantitative and visual quality metrics. The code and models will be released at https://github.com/jp-guo/OSCAR.

[350] Compressing Human Body Video with Interactive Semantics: A Generative Approach

Bolin Chen,Shanzhi Yin,Hanwei Zhu,Lingyu Zhu,Zihan Zhang,Jie Chen,Ru-Ling Liao,Shiqi Wang,Yan Ye

Main category: eess.IV

TL;DR: 提出了一种基于交互语义的人体视频压缩方法，通过嵌入语义级表示实现视频编码的交互性和可控性。

Details

Motivation: 传统视频编码在超低码率下性能有限，且缺乏交互性，难以满足元宇宙中数字人通信的需求。 Method: 使用3D人体模型将复杂的人体信号分解为可配置的嵌入表示，支持可控编辑、紧凑压缩和高效传输。解码器基于这些语义重建高质量视频。 Result: 在超低码率下，性能优于VVC标准和生成式压缩方案，且无需额外预处理即可实现交互性。 Conclusion: 该方法为元宇宙中的数字人通信提供了高效、交互性强的视频压缩解决方案。 Abstract: In this paper, we propose to compress human body video with interactive semantics, which can facilitate video coding to be interactive and controllable by manipulating semantic-level representations embedded in the coded bitstream. In particular, the proposed encoder employs a 3D human model to disentangle nonlinear dynamics and complex motion of human body signal into a series of configurable embeddings, which are controllably edited, compactly compressed, and efficiently transmitted. Moreover, the proposed decoder can evolve the mesh-based motion fields from these decoded semantics to realize the high-quality human body video reconstruction. Experimental results illustrate that the proposed framework can achieve promising compression performance for human body videos at ultra-low bitrate ranges compared with the state-of-the-art video coding standard Versatile Video Coding (VVC) and the latest generative compression schemes. Furthermore, the proposed framework enables interactive human body video coding without any additional pre-/post-manipulation processes, which is expected to shed light on metaverse-related digital human communication in the future.

[351] Generative Latent Coding for Ultra-Low Bitrate Image and Video Compression

Linfeng Qi,Zhaoyang Jia,Jiahao Li,Bin Li,Houqiang Li,Yan Lu

Main category: eess.IV

TL;DR: 提出了一种基于生成潜在编码（GLC）的图像和视频压缩方法，通过在潜在空间进行变换编码，实现了在超低比特率下的高真实性和高保真度。

Details

Motivation: 现有基于像素空间的压缩方法因与人类感知不对齐，难以在超低比特率下同时实现高真实性和高保真度。 Method: 使用生成向量量化变分自编码器（VQ-VAE）的潜在空间进行变换编码，并引入空间分类超模块和时空分类超模块优化性能。 Result: GLC-image在CLIC 2020测试集上比特率低于0.04 bpp，比MS-ILLM节省45%比特率；GLC-video在DISTS指标上比PLVC节省65.3%比特率。 Conclusion: GLC模型在超低比特率下显著提升了图像和视频压缩的视觉质量。 Abstract: Most existing approaches for image and video compression perform transform coding in the pixel space to reduce redundancy. However, due to the misalignment between the pixel-space distortion and human perception, such schemes often face the difficulties in achieving both high-realism and high-fidelity at ultra-low bitrate. To solve this problem, we propose \textbf{G}enerative \textbf{L}atent \textbf{C}oding (\textbf{GLC}) models for image and video compression, termed GLC-image and GLC-Video. The transform coding of GLC is conducted in the latent space of a generative vector-quantized variational auto-encoder (VQ-VAE). Compared to the pixel-space, such a latent space offers greater sparsity, richer semantics and better alignment with human perception, and show its advantages in achieving high-realism and high-fidelity compression. To further enhance performance, we improve the hyper prior by introducing a spatial categorical hyper module in GLC-image and a spatio-temporal categorical hyper module in GLC-video. Additionally, the code-prediction-based loss function is proposed to enhance the semantic consistency. Experiments demonstrate that our scheme shows high visual quality at ultra-low bitrate for both image and video compression. For image compression, GLC-image achieves an impressive bitrate of less than $0.04$ bpp, achieving the same FID as previous SOTA model MS-ILLM while using $45\%$ fewer bitrate on the CLIC 2020 test set. For video compression, GLC-video achieves 65.3\% bitrate saving over PLVC in terms of DISTS.

Ge Meng,Zhongnan Cai,Jingyan Tu,Yingying Wang,Chenxin Li,Yue Huang,Xinghao Ding

Main category: eess.IV

TL;DR: 论文提出了一种基于物理信息的跨模态状态空间模型网络（PCMamba），用于双相机压缩高光谱成像（DCCHI），通过结合物理成像过程和线性复杂度的Mamba模型，实现了轻量级且高质量的高光谱图像重建。

Details

Motivation: 现有研究主要从2D压缩测量和PAN图像中显式提取光谱和空间信息，导致高光谱重建存在瓶颈。论文尝试通过分析物理特性（如温度、发射率和物体间的多次反射）的相互关系，为重建提供更深层次的理论支持。 Method: 提出了PCMamba模型，将高光谱热信号的物理成像过程融入Mamba的线性复杂度中，设计了跨模态扫描Mamba块（CSMB），通过跨模态像素级交互和位置归纳偏置，充分利用2D测量和PAN图像的潜在信息。 Result: 在真实和模拟数据集上的实验表明，该方法在定量和定性指标上均显著优于现有最优方法。 Conclusion: PCMamba通过物理驱动的合成过程和跨模态交互，实现了高效且高质量的高光谱图像重建，为相关领域提供了新的理论和技术支持。 Abstract: Panchromatic (PAN) -assisted Dual-Camera Compressive Hyperspectral Imaging (DCCHI) is a key technology in snapshot hyperspectral imaging. Existing research primarily focuses on exploring spectral information from 2D compressive measurements and spatial information from PAN images in an explicit manner, leading to a bottleneck in HSI reconstruction. Various physical factors, such as temperature, emissivity, and multiple reflections between objects, play a critical role in the process of a sensor acquiring hyperspectral thermal signals. Inspired by this, we attempt to investigate the interrelationships between physical properties to provide deeper theoretical insights for HSI reconstruction. In this paper, we propose a Physics-Informed Cross-Modal State Space Model Network (PCMamba) for DCCHI, which incorporates the forward physical imaging process of HSI into the linear complexity of Mamba to facilitate lightweight and high-quality HSI reconstruction. Specifically, we analyze the imaging process of hyperspectral thermal signals to enable the network to disentangle the three key physical properties-temperature, emissivity, and texture. By fully exploiting the potential information embedded in 2D measurements and PAN images, the HSIs are reconstructed through a physics-driven synthesis process. Furthermore, we design a Cross-Modal Scanning Mamba Block (CSMB) that introduces inter-modal pixel-wise interaction with positional inductive bias by cross-scanning the backbone features and PAN features. Extensive experiments conducted on both real and simulated datasets demonstrate that our method significantly outperforms SOTA methods in both quantitative and qualitative metrics.

cs.AR [Back]

[353] CASS: Nvidia to AMD Transpilation with Data, Models, and Benchmark

Ahmed Heakl,Sarim Hashmi,Gustavo Bertolo Stahl,Seung Hun Eddie Han,Salman Khan,Abdulrahman Mahmoud

Main category: cs.AR

TL;DR: CASS是一个针对跨架构GPU代码转换的大规模数据集和模型套件，支持源级和汇编级翻译，性能显著优于商业基线。

Details

Motivation: 解决低级别GPU代码可移植性的关键缺口。 Method: 利用70k已验证代码对训练领域特定语言模型，并引入CASS-Bench进行严格评估。 Result: 源翻译准确率95%，汇编翻译准确率37.5%，生成代码在85%测试案例中匹配原生性能。 Conclusion: CASS为GPU编译器工具和硬件翻译提供了开源资源，推动技术进步。 Abstract: We introduce \texttt{CASS}, the first large-scale dataset and model suite for cross-architecture GPU code transpilation, targeting both source-level (CUDA~$\leftrightarrow$~HIP) and assembly-level (Nvidia SASS~$\leftrightarrow$~AMD RDNA3) translation. The dataset comprises 70k verified code pairs across host and device, addressing a critical gap in low-level GPU code portability. Leveraging this resource, we train the \texttt{CASS} family of domain-specific language models, achieving 95\% source translation accuracy and 37.5\% assembly translation accuracy, substantially outperforming commercial baselines such as GPT-4o, Claude, and Hipify. Our generated code matches native performance in over 85\% of test cases, preserving runtime and memory behavior. To support rigorous evaluation, we introduce \texttt{CASS-Bench}, a curated benchmark spanning 16 GPU domains with ground-truth execution. All data, models, and evaluation tools are released as open source to foster progress in GPU compiler tooling, binary compatibility, and LLM-guided hardware translation. Dataset and benchmark are on \href{https://huggingface.co/datasets/MBZUAI/cass}{\textcolor{blue}{HuggingFace}}, with code at \href{https://github.com/GustavoStahl/CASS}{\textcolor{blue}{GitHub}}.

Table of Contents

cs.CV [Back]

[1] Multilinear subspace learning for person re-identification based fusion of high order tensor features

[2] Generative AI for Autonomous Driving: A Review

[3] How Do Large Vision-Language Models See Text in Image? Unveiling the Distinctive Role of OCR Heads

[4] SCENIR: Visual Semantic Clarity through Unsupervised Scene Graph Retrieval

[5] Satellites Reveal Mobility: A Commuting Origin-destination Flow Generator for Global Cities

[6] Decouple and Orthogonalize: A Data-Free Framework for LoRA Merging

[7] Highlighting What Matters: Promptable Embeddings for Attribute-Focused Image Retrieval

[8] GRIT: Teaching MLLMs to Think with Images

[9] Challenger: Affordable Adversarial Driving Video Generation

[10] ViQAgent: Zero-Shot Video Question Answering via Agent with Open-Vocabulary Grounding Validation

[11] VideoGameQA-Bench: Evaluating Vision-Language Models for Video Game Quality Assurance

[12] Super-Resolution with Structured Motion

[13] OViP: Online Vision-Language Preference Learning

[14] Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning

[15] Analyzing Hierarchical Structure in Vision Models with Sparse Autoencoders

[16] Domain Adaptive Skin Lesion Classification via Conformal Ensemble of Vision Transformers

[17] Image-to-Image Translation with Diffusion Transformers and CLIP-Based Image Conditioning

[18] Position: Agentic Systems Constitute a Key Component of Next-Generation Intelligent Image Processing

[19] CP-LLM: Context and Pixel Aware Large Language Model for Video Quality Assessment

[20] Learning better representations for crowded pedestrians in offboard LiDAR-camera 3D tracking-by-detection

[21] An Approach Towards Identifying Bangladeshi Leaf Diseases through Transfer Learning and XAI

[22] An Exploratory Approach Towards Investigating and Explaining Vision Transformer and Transfer Learning for Brain Disease Detection

[23] GMatch: Geometry-Constrained Feature Matching for RGB-D Object Pose Estimation

[24] Steering LVLMs via Sparse Autoencoder for Hallucination Mitigation

[25] When VLMs Meet Image Classification: Test Sets Renovation via Missing Label Identification

[26] Training-Free Reasoning and Reflection in MLLMs

[27] BadDepth: Backdoor Attacks Against Monocular Depth Estimation in the Physical World

[28] Breaking Complexity Barriers: High-Resolution Image Restoration with Rank Enhanced Linear Attention

[29] Deep Learning-Driven Ultra-High-Definition Image Restoration: A Survey

[30] RE-TRIP : Reflectivity Instance Augmented Triangle Descriptor for 3D Place Recognition

[31] TRAIL: Transferable Robust Adversarial Images via Latent diffusion

[32] Erased or Dormant? Rethinking Concept Erasure Through Reversibility

[33] QuickVideo: Real-Time Long Video Understanding with System Algorithm Co-Design

[34] Redemption Score: An Evaluation Framework to Rank Image Captions While Redeeming Image Semantics and Language Pragmatics

[35] Understanding Generative AI Capabilities in Everyday Image Editing Tasks

[36] VLM-R$^3$: Region Recognition, Reasoning, and Refinement for Enhanced Multimodal Chain-of-Thought

[37] A Causal Approach to Mitigate Modality Preference Bias in Medical Visual Question Answering

[38] A Shape-Aware Total Body Photography System for In-focus Surface Coverage Optimization

[39] CT-Agent: A Multimodal-LLM Agent for 3D CT Radiology Question Answering

[40] DOVE: Efficient One-Step Diffusion Model for Real-World Video Super-Resolution

[41] Swin Transformer for Robust CGI Images Detection: Intra- and Inter-Dataset Analysis across Multiple Color Spaces

[42] DualComp: End-to-End Learning of a Unified Dual-Modality Lossless Compressor

[43] LINEA: Fast and Accurate Line Detection Using Scalable Transformers

[44] DriveMoE: Mixture-of-Experts for Vision-Language-Action Model in End-to-End Autonomous Driving

[45] ARPO:End-to-End Policy Optimization for GUI Agents with Experience Replay

[46] Efficient Prototype Consistency Learning in Medical Image Segmentation via Joint Uncertainty and Data Augmentation

[47] Self-Classification Enhancement and Correction for Weakly Supervised Object Detection

[48] SAMba-UNet: Synergizing SAM2 and Mamba in UNet with Heterogeneous Aggregation for Cardiac MRI Segmentation

[49] Paired and Unpaired Image to Image Translation using Generative Adversarial Networks

[50] Accelerating Targeted Hard-Label Adversarial Attacks in Low-Query Black-Box Settings

[51] NTIRE 2025 challenge on Text to Image Generation Model Quality Assessment

[52] SuperPure: Efficient Purification of Localized and Distributed Adversarial Patches via Super-Resolution GAN Models

[53] Efficient Motion Prompt Learning for Robust Visual Tracking

[54] TensorAR: Refinement is All You Need in Autoregressive Image Generation

[55] Panoptic Captioning: Seeking An Equivalency Bridge for Image and Text

[56] FPQVAR: Floating Point Quantization for Visual Autoregressive Model with FPGA Hardware Co-design

[57] Fusion of Foundation and Vision Transformer Model Features for Dermatoscopic Image Classification

[58] Style Transfer with Diffusion Models for Synthetic-to-Real Domain Adaptation

[59] Temporal and Spatial Feature Fusion Framework for Dynamic Micro Expression Recognition

[60] DeCafNet: Delegate and Conquer for Efficient Temporal Grounding in Long Videos

[61] MAGE: A Multi-task Architecture for Gaze Estimation with an Efficient Calibration Module

[62] Sketchy Bounding-box Supervision for 3D Instance Segmentation

[63] AdvReal: Adversarial Patch Generation Framework with Application to Adversarial Safety Evaluation of Object Detection Systems

[64] Mitigating Hallucinations in Vision-Language Models through Image-Guided Head Suppression

[65] Pose-invariant face recognition via feature-space pose frontalization

[66] Circle-RoPE: Cone-like Decoupled Rotary Positional Embedding for Large Vision-Language Models

[67] Investigating Fine- and Coarse-grained Structural Correspondences Between Deep Neural Networks and Human Object Image Similarity Judgments Using Unsupervised Alignment

[68] Unlocking Smarter Device Control: Foresighted Planning with a World Model-Driven Code Execution Approach

[69] Joint Flow And Feature Refinement Using Attention For Video Restoration

[70] Ranked Entropy Minimization for Continual Test-Time Adaptation

[71] MAFE R-CNN: Selecting More Samples to Learn Category-aware Features for Small Object Detection

[72] TAT-VPR: Ternary Adaptive Transformer for Dynamic and Efficient Visual Place Recognition

[73] CMRINet: Joint Groupwise Registration and Segmentation for Cardiac Function Quantification from Cine-MRI

[74] MAGIC: Motion-Aware Generative Inference via Confidence-Guided LLM

[75] AnchorFormer: Differentiable Anchor Attention for Efficient Vision Transformer

[76] Consistent World Models via Foresight Diffusion

[77] Clear Nights Ahead: Towards Multi-Weather Nighttime Image Restoration

[78] InspectionV3: Enhancing Tobacco Quality Assessment with Deep Convolutional Neural Networks for Automated Workshop Management